Determine mode of difference of entries in two columns of a TTree

KAM · September 20, 2021, 8:26pm

I have a directory of .root files, each containing a TTree two branches (call them A and B, if you like) holding integer values. Each entry in B is equal to its corresponding entry in A minus some offset. This offset is the same for most, but not all entries. My goal is to determine the value of this most common offset, which is different for each file.

I’ve tried a couple of different things. First, I tried reading each file in with an RDataFrame, iterated over each entry using RDataFrame.Foreach(), stored the difference in a histogram, then took the mean of this histogram. This was, of course, horribly inefficient.

Second, I tried using RDataFrame.Take<>() to read in each column as a vector, took the difference of those vectors and stored these values in a third vector, then found the mode (most common element) of this third vector using more or less the method described here.

Unfortunately, as I’m working with a large number of files, both methods are (impractically) slow and inefficient. I imagine there must be a better way?

pcanal · September 20, 2021, 8:52pm

Did you try:

tree->Draw("B-A");

or somethig like)

ROOT::RDataFrame d(treeName, fileName)
auto hist = d.Define("sub", [](double a, int b) { return b - a; }, {"A", "B"})
                    .Hist1D("sub");

and look for the slot for the most entries?

eguiraud · September 28, 2021, 1:51pm

I second this suggestion, probably with a Histo1D<double>(...) instead of Histo1D(...) for extra performance (and make sure to compile the program with optimizations - -O2 compilation flag - or run it as root macro.C+ - with the plus - to turn on compiler optimizations).

You can also parallelize over files:

ROOT::EnableImplicitMT();
std::vector<ROOT::RDF::RResultPtr<TH1D>> histos;
std::vector<ROOT::RDF::RResultHandle> histoHandles;
for (auto fname : files) {
  histos.push_back(df.Define(...).Histo1D<double>(...));
  histoHandles.push_back(histos.back());
}

ROOT::RDF::RunGraphs(histoHandles);

That will start the processing for all files concurrently. If you have many files and memory usage grows too much by processing them all “at the same time” you can also process them in batches of 32 or similar.

Cheers,
Enrico

system · October 12, 2021, 1:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.