Splitting DataFrames over a variable efficiently

dhmorton · August 12, 2019, 9:18am

Hello,
I am not too experienced with dataframes, and was hoping to also check my understanding. Currently I am using ROOT 6.14. Essentially I was considering what to do when I don’t simply want to filter data but rather divide it into two or more sets over the cut. As an example, say we have a dataframe with columns “p” and “x”. We want to divide it into 3 categories in p and take find the mean “x”, i.e.

float cut_edges[]={0.0,2.0,3.0, 99.0}; //cuts defining 3 categories, 99 used for infinity for simplicity
ROOT::RDataFrame df(inputTree, inputFile, "x");
float x_Mean[3];

Now, one could put this in a for loop e.g.

for(int i=0; i<3; i++){
  double lowbound=cut_edges[i];
  double highbound=cut_edges[i+1];
  auto dfCut=df.Filter([lowbound,highbound](float p){return (p<highbound && p>lowbound);},{"p"});
  x_Mean[i]=*dfCut.Mean("x");
}

, but that would require loop over the tree several times (assuming I understand correctly).

On the other hand, writing the for loop line by line instead could be lazily done correct? E.g.

auto dfCut1=df.Filter([cut_edges[0],cut_edges[1]](float p){return (p<cut_edges[1] && p>cut_edges[0]);},{"p"});
auto dfCut2=df.Filter([cut_edges[1],cut_edges[2]](float p){return (p<cut_edges[2] && p>cut_edges[1]);},{"p"});
auto dfCut3=df.Filter([cut_edges[2],cut_edges[3]](float p){return (p<cut_edges[3] && p>cut_edges[2]);},{"p"});
x_Mean[0]=*dfCut1.Mean("x");
x_Mean[1]=*dfCut2.Mean("x");
x_Mean[2]=*dfCut3.Mean("x");

Would this require only one loop through the data?

Now if I wanted an arbitrary number of divisions, this method becomes impractical, and I would think there is a more elegant way to divide up a dataframe. Is there any efficient ways of writing this sort of code? I didn’t see anything quite like it in the tutorials.

eguiraud · August 12, 2019, 9:28am

Hi,
your understanding is correct.

You can also have lazy evaluation and still use a for loop (haven’t tested the code, but it should give you an idea):

std::vector<ROOT::RDF::RResultPtr<double>> means;
for (auto cut : cuts) {
  auto m = df.Filter([cut](float p) { return p > cut.first && p < cut.second; }, {"p"}).Mean("x");
  means.emplace_back(m);
}
// event loop triggered the first time you access an element of `means`

On the other hand this specific problem might be expressed as filling a 2D histogram (with “p” and “x” on the axes) and filling a 2D histogram (e.g. with df.Histo2D) is faster than checking N filters per entry.

Cheers,
Enrico

dhmorton · August 12, 2019, 9:35am

Oh that is something worth experimenting with. I am more used to older C. I considered a TH2D, or closer to my actual use case a TH3D, but wasn’t sure about the speed. Thanks!