Home | News | Documentation | Download

Most efficient way to slice TTree in one variable

Hi,

I am trying to open a file which contains one TTree, and then produce slices of this tree in one certain variable (in this case eta) and then save the sliced trees in a new output file. For this, I wrote the following function:

void sliceTreeInEta(TString inputPath, TString outputPath){
  TFile* inputFile = new TFile(inputPath, "READ"); 
  TTree* originalTree = (TTree*)inputFile->Get("TreeName");
  TFile* outputFile = new TFile(outputPath, "RECREATE"); 
  outputFile->cd();
  float eta = 0;
  while(eta<5.0){
    TTree * slicedTree = originalTree -> CopyTree(Form("(eta>-%.2f && eta<-%.2f) || (eta>%.2f && eta<%.2f)", eta+0.05, eta, eta, eta+0.05));
    slicedTree->SetName(Form("%.0f", eta));
    slicedTree->SetTitle(Form("%.0f", eta));
    eta+=0.05;
  }
  std::cout<<"SUCCESS: Slicing terminated!"<<std::endl;
  outputFile -> Write();
  outputFile -> Close();
  inputFile -> Close(); 
}

This works fine, but I noticed that as soon as I have a large number of events this gets terribly slow. In the end I need to produce 100 slices for 1 000 000 events (i.e. each slice will have roughly 10 000 events). Is there a way to make this more efficient?

Hi,
if I understand correctly, you would run over the input tree once per slice.
A better approach would be to only run over the input tree once and produce your 100 slices in one go.

You can easily do it with RDataFrame but need to write each slice to a different file (have not tested the code, but it should give you an idea):

ROOT::RDataFrame df("tree", "input.root");
ROOT::RDF::RSnapshotOptions opts;
opts.fLazy = true;

for (float eta = 0.f; eta < 5.f; eta += 0.05f) {
  const std::string filter = Form("(eta>-%.2f && eta<-%.2f) || (eta>%.2f && eta<%.2f)", eta+0.05, eta, eta, eta+0.05);
  const std::string out_file = "output_" + std::to_string(eta) + ".root";
  df.Filter(filter).Snapshot("slice", out_file, {}, opts);
}

// this is to actually trigger the event loop, since all Snapshots were marked "lazy"
df.Count().GetValue(); 

If you add ROOT::EnableImplcitMT at the beginning, RDF actually runs the procedure in parallel on multiple threads.

Hope this helps!
Enrico

Hi @eguiraud,

thanks for your answer. I am currently facing some issues in getting RDataFrame into our CMake setup but this definitely looks promising. So I guess to get a single file the best way would be to hadd them at the end, right?

You could do something like that, yes.
Here is a little project of mine that depends on RDF and has CMakeLists that, at least 2 years ago, worked fine. Maybe it can help. If not, feel free to create a little reproducer of your CMake issue and open a new thread on the forum.

Cheers,
Enrico

Hi @eguiraud,

I managed to add it to my project but for

    df.Filter(filter).Snapshot("slice", out_file, {}, opts);

I am getting this error:

 **error:** **static_assert failed "filter expression returns a type that is not convertible to bool"** 
static_assert(std::is_convertible<FilterRet_t, bool>::value, 

any suggestion?

The error means that your filter function does not return a boolean value (or something convertible to a boolean value).

Hi @eguiraud,

I understand but even if I write something like filter=“Var>10”, I get exactly the same error. (And I am sure this variable exists in the tree).

Uhm, so, with ROOT master, I can do this no problem:

root [0] const auto filter = "var > 10";
root [1] ROOT::RDataFrame(10).Define("var", "1").Filter(filter).Snapshot("t", "f.root")
root [2] TFile("f.root").Get<TTree>("t")->Print()
******************************************************************************
*Tree    :t         : t                                                      *
*Entries :        0 : Total =             305 bytes  File  Size =        170 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************

What am I missing? I.e. can you give me a step-by-step recipe to reproduce the issue?

So I am running this with Root 6.18.04:

ROOT::RDataFrame df("FCS_ParametrizationInput","testFile.root");
ROOT::RDF::RSnapshotOptions opts;
opts.fLazy = true;
const std::string filter = "TruthPx>0";
const std::string out_file = "output.root";
df.Filter(filter).Snapshot("slice", out_file, {}, opts);

on this file https://cernbox.cern.ch/index.php/s/b2sThuMBBKCUGvs and I am getting the following output https://cernbox.cern.ch/index.php/s/Khzsb0gxxNa3lHJ

@eguiraud btw, I also tried with Root 6.20 (even though I am forced to use 6.18.04 for now ), same problem

Hi @eguiraud,

I think I found the problem. The elements stored in my TTree are std::vectors (even though only the first element is filled):

How can I translate you code example to work with the vectors? If I do variable[0]> then it doesn’t complain anymore but also doesn’t seem to produce the output file.

Hi @mark1,
ok if you are Filtering on vector branches this explains the error:
in RDataFrame, all arrays and vectors can be read in as RVec<T>. RVecs are vector-like types that also offer a number of useful features, for instance vec > 0, when vec is a RVec, returns an RVec with 1 at the positions where the condition is satisfied, and 0 elsewhere. This is useful e.g. to quickly select certain entries of an RVec with vec[vec > 0].

So indeed vec > 0, as a Filter expression, is not convertible to bool.

Do you maybe want something like:

df.Define("filtered_vec", "vec[vec > 0.5]").Snapshot("slice", out_file, {"filtered_vec"}, opts);

?

(note that, for now, you cannot overwrite existing branches with new Definitions in RDF)

Hi @eguiraud,
thanks for the answer but I am more confused now then before. The line you sent me produces one TTree which has the filtered vector (e.g. TruthPx), but this is not what I want. I need the whole original TTree to be filtered. As I said, the vectors only contain one element so it essentially should just throw out events with TruthPx[0]<10, for instance, keeping all other branches and thus mimicking the CopyTree method from the original post. I tried this using you original example:

which seems to run fine but doesn’t produce any output file :frowning:

Ah sorry I missed it that your arrays only have one element. Then "TruthPx[0] > 0" is absolutely fine.

doesn’t produce any output file

Ah, that’s my fault, sorry! :sweat: The Snapshot action is unregistered from the RDataFrame when its return value goes out of scope. I’m cooking up a working example.

EDIT: by the way, a Snapshot action that goes out of scope without being triggered should at least print a warning

1 Like

As you see in the screenshot, I don’t get any warning whatsoever

As you see in the screenshot, I don’t get any warning whatsoever

Yes I saw, I will have to look into that!

In the meantime, this should work:

#include <ROOT/RDataFrame.hxx>
#include <vector>
#include <iostream>

int main()
{
   // ROOT::RDataFrame df("tree", "input.root");
   // simulate an RDF
   auto df = ROOT::RDataFrame(10).Define("eta", [] { return 1.f; });

   ROOT::RDF::RSnapshotOptions opts;
   opts.fLazy = true;

   using SnapshottedDF = ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>>;
   std::vector<SnapshottedDF> snapshots;

   for (float eta = 0.f; eta < 0.15f; eta += 0.05f) {
     const std::string filter = Form("(eta>-%.2f && eta<-%.2f) || (eta>%.2f && eta<%.2f)", eta+0.05, eta, eta, eta+0.05);
     const std::string out_file = "output_" + std::to_string(eta) + ".root";
     auto s = df.Filter(filter).Snapshot("slice", out_file, ".*", opts);
     snapshots.emplace_back(s);
     std::cout << "done one" << std::endl;
   }

   // trigger event loop
   std::cout << df.Count().GetValue() << std::endl;

   return 0;
}

Hi @eguiraud,

thanks for the example! However, I am getting this error:

  no matching conversion for C-style cast from 'long long' to 'ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void> >'

I was compiling and running that code correctly, can you share a recipe to reproduce the issue?

Yes, if I use the simulated df it works fine but as soon as I use my own file (see link before) I get these errors. Also, it should be eta[0] in my case, right? (produces same error)

I’m afraid I can’t tell where the error is coming from just from what you posted. Can you post a reproducer?