Drop columns from RDataFrame

tobychev · October 1, 2018, 9:05am

Hello,
I have a very large root file with many rows and columns and I can filter out most rows, but I don’t understand how to reduce the number of columns.

Currently the dataset is too large to load into a numpy array directly, so I would like to reduce the dataset using ROOT and then do further analysis using python. I imagine a simple way to do this is to somehow drop some of the columns and then save a snapshot and load the reduced file in python later.

I do want a way to explicitly tell ROOT which columns I don’t want to keep, because there are very many columns I’m unsure if I need or not, and I don’t want to list all the 100+ names explicitly.

eguiraud · October 1, 2018, 9:37am

Hi @tobychev,
I think this is an area of RDF where there is room for improvement.
We do not have a way to blacklist columns from a Snapshot.
If it works for your specific data layout (it’s not what the method is originally meant for), you can get a list of all column names with GetColumnNames() and write a little function that takes that std::vector<std::string> and drops a few items. It would look like this:

df.Snapshot("thinned_tree", "out.root", DropColumns(df.GetColumnNames()));

where DropColumns would be something like

std::vector<std::string> DropColumns(std::vector<std::string> &&good_cols)
{
   // your blacklist
   static const std::vector<std::string> blacklist = {"useless", "columns"};
   // a lambda that checks if `s` is in the blacklist
   auto is_blacklisted = [&blacklist](const std::string &s)  { return std::find(blacklist.begin(), blacklist.end(), s) != blacklist.end(); };

   // removing elements from std::vectors is not pretty, see https://en.wikipedia.org/wiki/Erase%E2%80%93remove_idiom
   good_cols.erase(std::remove_if(good_cols.begin(), good_cols.end(), is_blacklisted), good_cols.end());
   
   return good_cols;
}

Now for your specific usecase, we introduced a TTree::AsMatrix helper function in PyROOT that returns some of the columns as numpy arrays (but does not perform event filtering), tutorial here.
In the future we plan to add RDataFrame::AsMatrix which does the same (“export” ROOT data into numpy arrays/pandas) but with the full power of RDF. So things will get nicer in the future.

Let us know if this helps.
Cheers
Enrico

tobychev · October 1, 2018, 10:31am

Hello @eguiraud,
making a function to filter the list is a nice solution! I am using the python bindings to do this work, so filtering a list should be easy.

The reduction from the event filtering is very significant, so I guess I’ll have to live with the roundtrip to the intermediate file for now.

Thanks for the help!

eguiraud · October 1, 2018, 11:35am

Good!
Please double-check that the list of column names returned by GetColumnNames is correct in your case – for relatively flat ntuples it should be ok, for deeply nested branches it might not be what you expect, as “column name” is not well defined in that case (but I guess you will notice if you are missing an important column somewhere )

Cheers,
Enrico

system · October 15, 2018, 11:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.