RDataFrame in Python: Get access to underlying data and reduce number of branches

Hi all,

I want to compare different methods to manipulate large ROOT tuples in python. Currently, I focussed on a combination of uproot/root_pandas and pandas DataFrame to perform filtering, combining different branches, etc. Now, I want to do basically the same with RDataFrame. I know that it is very easy to perform filtering options and defining new ones, but I have not yet found out how to do the following:

  1. Access the underlying data. Lets let I have a filtered RDF and want to look the specific values for an observable. How can I do that? I found a method called Take for the C++ implementation, but I cannot find it python (?).

In uproot, I would do the following:

ntuple = uproot.open(path)
tree = ntuple[treename]
print('first 5 entries:', tree.array('myvar', entrystop=5))
# prints [1,2,3,4,5]
  1. Is it also possible to reduce the number of branches based on a regular expression before again saving a TTree? Again a working example below, this time with root_pandas:
all_vars = ['B_M', '*PID*']
df = read_root(path, treename, columns=all_vars)
print(df.columns)
# prints ['B_M', 'pi_PID_K', 'K_PID_p', ...]

Thanks!

Hi @Kecksdose,

  1. the python equivalent of Take is AsNumpy – see the tutorial here.

  2. Snapshot is the RDF method that saves a new tree. You can tell Snapshot what branches you want in the new tree. The method currently takes a single regular expression, but I think you can specify ['B_M', '*PID*'] as (B_M|.*PID.*).

Cheers,
Enrico

P.S.
Please let us know what the outcome of the comparison is! RDF should have a larger constant startup time, but should scale nicely to very large datasets, very large amount of histograms, and very large amount of cores.

Hi @eguiraud,

mhm, if I execute the example tutorial you provided I face the following error:

Traceback (most recent call last):
  File "analyse_rdataframe.py", line 12, in <module>
    npy = df.AsNumpy()
AttributeError: 'RInterface<ROOT::Detail::RDF::RLoopManager,void>' object has no attribute 'AsNumpy'

I am using ROOT 6.16/00. Could it be that I need to update my ROOT version in order to use the AsNumpy() features?

One side remark: Is there anywhere a RDF method list for python? I only found this (https://root.cern.ch/doc/master/classROOT_1_1RDataFrame-members.html), but AsNumpy() is not listed here.

Thanks,
Timon

Hi,
yes you need v6.18 or master. For example you can get v6.18 from cvmfs at /cvmfs/sft.cern.ch/lcg/app/releases/ROOT/6.18.00.

Calling in @etejedor about the method list for python. There is currently no list, but I know there are plans in that direction (and we’ll add AsNumpy documentation to the RDF users guide, thanks for pointing it out).

Cheers,
Enrico

1 Like

Hi Timon,

Unfortunately there is no such list of methods yet, but we will take care of it as soon as possible. We are currently working on a new PyROOT and documentation is one of the priorities.

In addition to AsNumpy, there is also another feature in experimental mode which allows you to close the circle and come back to RDF, by creating a NumPy data source for RDF:

Note that, to use that feature, you need to use experimental PyROOT (i.e. build ROOT with -Dpyroot_experimental=ON).

Hi all,

thanks for the additional informations. I am now able to save a filtered Snapshot, which is indeed quite fast. I also like the new docstrings of AsNumpy(). I attached two docstrings to show what I mean ;).

With the aim to completely be independent of root_numpy and uproot (but still wanting to use pandas DataFrames from time to time), I am still struggeling a bit with the options. Is it possible,

  1. to receive numpy-arrays based on a regular expression? Like you mentioned, it works when I want to Snapshot some data, but it did not work together with AsNumpy()
  2. to read Array-based variables? When trying to read some of then, I get the following error:
Error in <TBranch::TBranch>: Illegal leaf: B0_ARRAY_M/B0_ARRAY_M[B0_ARRAY_nPV]/F. If this is a variable size C array it's possible that the branch holding the size is not available.

 *** Break *** segmentation violation

Regards,
Timon

Appendix:

In [12]: rdf.Snapshot?
Call signature:  rdf.Snapshot(*args, **kwargs)
Type:            TemplateProxy
String form:     <ROOT.TemplateProxy object at 0x7fda37743f48>
File:            ~/.conda/envs/myroot/lib/python3.6/site-packages/ROOT.py
Docstring:
ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, const vector<string>& columnList, const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions())
ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, basic_string_view<char,char_traits<char> > columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions())
ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(basic_string_view<char,char_traits<char> > treename, basic_string_view<char,char_traits<char> > filename, initializer_list<string> columnList, const ROOT::RDF::RSnapshotOptions& options = ROOT::RDF::RSnapshotOptions())
Class docstring: PyROOT template proxy (internal)
In [13]: rdf.AsNumpy?
Signature: rdf.AsNumpy(columns=None, exclude=None)
Docstring:
Read-out the RDataFrame as a collection of numpy arrays.

The values of the dataframe are read out as numpy array of the respective type
if the type is a fundamental type such as float or int. If the type of the column
is a complex type, such as your custom class or a std::array, the returned numpy
array contains Python objects of this type interpreted via PyROOT.

Be aware that reading out custom types is much less performant than reading out
fundamental types, such as int or float, which are supported directly by numpy.

The reading is performed in multiple threads if the implicit multi-threading of
ROOT is enabled.

Note that this is an instant action of the RDataFrame graph and will trigger the
event-loop.

Parameters:
    columns: If None return all branches as columns, otherwise specify names in iterable.
    exclude: Exclude branches from selection.

Returns:
    dict: Dict with column names as keys and 1D numpy arrays with content as values
File:      ~/.conda/envs/myroot/lib/python3.6/site-packages/ROOT.py
Type:      method

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.