RDataFrame, a modern tool to manipulate and analyze ROOT datasets

What is RDataFrame?

ROOT::RDataFrame is a modern C++ high-level interface for interacting with data in ROOT.
It spent its infancy in the namespace ROOT::Experimental, and with v6.14, released yesterday, it became officially part of ROOT.

The interface design is inspired by other dataframe APIs such as pandas or Spark DataFrames; it also takes ideas from the functional and declarative programming paradigms.
RDataFrame, however, is specially crafted for HEP use-cases and ROOT users. For example:

  • it reads ROOT data as well as CSVs, Arrow tables, and it can be extended to arbitrary data formats
  • among others, it exposes methods to easily fill ROOT 1D, 2D, 3D histograms, profiles, and retrieve information on cut-flows, and its functionality can be extended by end users
  • it allows to easily generate ROOT data from scratch or write out a skimmed, trimmed dataset
  • all of the above is parallelized with one line of code which activates ROOT’s implicit multi-threading
  • for python users, RDataFrame comes with pyROOT bindings

How does this look?
This is a simple cut-and-fill with RDataFrame:

ROOT::RDataFrame df("mytree", {"f1.root", "f2.root"});
auto h = df.Filter("x > 0").Histo1D("x");
h->Draw(); // the event loop is run here, upon first access to one of the results

The lazy triggering of the event loop makes it easy to generate multiple results while reading the data only once:

// C++11 lambdas and functions are also supported as filter expressions
auto filtered_df = df.Filter([](float x) { return x > 0; }, {"x"});
auto hx = filtered_df.Histo1D("x");
auto hy = filtered_df.Histo1D("y");
hx->Draw(); // event loop is run here, both hx and hy are filled

As a last example, let’s filter the events, define a new quantity, produce a control plot and write out the filtered dataset, all in the same multi-thread event loop:

ROOT::EnableImplicitMT(); // enable multi-threading
ROOT::RDataFrame df(treename, filenames); // create dataframe
auto df2 = df.Filter("x > 0").Define("y", "x*x"); // filter and define new column
auto control_h = df2.Histo1D("y"); // book filling of a control plot
// write out new dataset. this triggers the event loop and also fills the booked control plot
df2.Snapshot("newtree", "newfile.root", {"x","y"});

More information, a quick crash course and a cheat-sheet are available as part of RDataFrame’s guide.
C++ and python tutorials are available here.

What’s in RDataFrame's future?
Work is in progress to make the pyROOT bindings more pythonic, and to let users define an RDataFrame analysis and execute it as is, either locally on multiple cores or distributed over a CPU cluster (e.g. a Spark cluster).
We are listening to the feedback of early adopters to keep improving RDataFrame and fix its rough edges: users’ feedback is greatly welcome, either on this forum as a question, a discussion, or on ROOT’s jira as a bug report!

9 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.