RDataframe in conjunction with EntryLists, Creating a "preselection"

Dear all,

while playing around with RDataFrame I noticed a strange behavior in conjunction with “preselected chains”:

Let’s say I have different TFiles with different TTrees in it and I want to apply some coarse cuts that depend on the actual file/tree pair.

I thought of attaching a combined TEntrylist to the TChain before wrapping the chain itself in an RDataFrame and proceeding with the analysis. However the dataframe seems not to honor the preselection encoded in the entrylists.

To demonstrate this, I borrowed from @eguiraud’s test that covers a similiar situation:

But when I alter the way the eventlists are generated, only e values from the first file seem to be selected (note the offset parameter):

void MakeInputFile(const std::string &filename, int nEntries, int offset=0)
{
   const auto treename = "t";
   auto d = ROOT::RDataFrame(nEntries)
               .Define("e", [&offset](ULong64_t e) { return int(e+offset); }, {"rdfentry_"})
               .Snapshot<int>(treename, filename, {"e"});
}

void TestChainWithEntryList()
{
   const auto nEntries = 10;
   const auto treename = "t";
   const auto file1 = "rdfentrylist1.root";
   MakeInputFile(file1, nEntries);
   const auto file2 = "rdfentrylist2.root";
   MakeInputFile(file2, nEntries, 100);

   /* Preselect events by classic TTree::Draw method */
   auto f1 = TFile::Open(file1);
   auto t1 = f1->Get<TTree>(treename);
   gROOT->cd();
   t1->Draw(">>elist1", "e%2==0", "entrylist");
   f1->Close();

   auto f2 = TFile::Open(file2);
   auto t2 = f2->Get<TTree>(treename);
   gROOT->cd();
   t2->Draw(">>elist2", "e%2==0", "entrylist");
   f2->Close();


   // make a TEntryList that contains two TEntryLists in its list of TEntryLists,
   // as required by TChain (see TEntryList's doc)
   TEntryList elists;
   elists.Add(gROOT->Get<TEntryList>("elist1"));
   elists.Add(gROOT->Get<TEntryList>("elist2"));

   TChain c(treename);
   c.Add(file1, nEntries);
   c.Add(file2, nEntries);
   c.SetEntryList(&elists);

   auto entries = ROOT::RDataFrame(c).Take<int>("e");


   /* List all the entries gathered by RDataFrame... */
   for (const auto& e : *entries) {
	  std::cout << e << " ";
   }
   std::cout << std::endl;


   /* On the contrary TChain::Scan can do it*/
   c.Scan("e");


   gSystem->Unlink(file1);
   gSystem->Unlink(file2);
}


void test() {

	std::cout << gROOT->GetVersion() << std::endl;
	TestChainWithEntryList();
}

My macro produces the following output, with different results for RDataFrame and TChain::Scan:

6.20/04
0 2 4 6 8 0 2 4 6 8 
************************
*    Row   *         e *
************************
*        0 *         0 *
*        2 *         2 *
*        4 *         4 *
*        6 *         6 *
*        8 *         8 *
*       10 *       100 *
*       12 *       102 *
*       14 *       104 *
*       16 *       106 *
*       18 *       108 *
************************

I am not sure, weather my handling of the eventlists is correct, but as the scan method delivers the expected results I would rather think of a bug in the RDataFrame/TTreeReader framework.

In general I am not very satisfied with the way the eventlists are produced in my macro. (RDataFrame is far better than ancient TTree::Draw :smiley:). But I did not find any other way how to apply Filters solely based on file(-name) storing a particular tree. Or is it somehow possible to chain dataframes (with initial filers already applied) instead of trees? Or did I miss something and there exists a far better approach?

Many thanks in advance,
Philipp

Hi Philipp,
thank you for the self-contained reproducer! It looks like a bug, I’ll let you know more as soon as possible.

Cheers,
Enrico

Hi,
I can reproduce the problem with TTreeReader only, no RDF. This is now ROOT-10753, I will take care of it asap.

Thanks for the report!
Enrico

Hey Enrico,
thank’s a lot for looking into this.

Just for clarification: This still means that RDF is still affected as it backed by TTreeReader? If not, it would be me who is doing sth wrong and I could try to work around that.

And do you know any alternative ways how to implement tree-specific cuts within the RDataframe context?

Hi Philipp,
indeed, RDF is still affected.

There is a workaround: you can attach a TEntryList with global entry numbers to the TChain (which is not what TEntryList docs say, but again, workaround), because TTreeReader interprets the entries in TEntryList as global entry numbers if I understand the problem correctly, and RDataFrame just constructs a TTreeReader(chain, chain->GetEntryList()).

Here is a playground with the workaround implemented directly for TTreeReader (no RDF, but the extension to RDF should be as above):

#include <TFile.h>
#include <TTree.h>
#include <TChain.h>
#include <TEntryList.h>
#include <TTreeReader.h>
#include <TTreeReaderValue.h>
#include <iostream>

void MakeInputFiles()
{
   int e = 0;
   for (int i = 1; i <= 2; ++i) {
      TFile f(("f" + std::to_string(i) + ".root").c_str(), "recreate");
      TTree t("t", "t");
      t.Branch("e", &e);
      for (int j = 0; j < 3; ++j) {
         t.Fill();
         ++e;
      }
      t.Write();
      f.Close();
   }
}

int main()
{
   // files "f{1,2}.root" with TTree "t" values "e" = {0,1,2} and {3,4,5}
   MakeInputFiles();

   TEntryList elist1("e", "e", "t", "f1.root");
   elist1.Enter(0);
   elist1.Enter(2);
   TEntryList elist2("e", "e", "t", "f2.root");
   elist2.Enter(0);
   elist2.Enter(2);

   // make a TEntryList that contains two TEntryLists in its list of TEntryLists,
   // as required by TChain (see TEntryList's doc)
   TEntryList elists;
   elists.Add(&elist1);
   elists.Add(&elist2);

   TEntryList elistWithGlobalEntries;
   elistWithGlobalEntries.Enter(0);
   elistWithGlobalEntries.Enter(2);
   elistWithGlobalEntries.Enter(3);
   elistWithGlobalEntries.Enter(5);

   TChain c("t");
   c.Add("f1.root", 3);
   c.Add("f2.root", 3);
   c.SetEntryList(&elists);

   TTreeReader r(&c, &elistWithGlobalEntries);
   TTreeReaderValue<int> e(r, "e");
   while (r.Next())
      std::cout << *e << " ";
   std::cout << std::endl;

   c.Scan("e");

   return 0;
}

Cheers,
Enrico

Thank you once more, I am glad to get some support from the experts :slight_smile:

However in your example, one somehow has to translate local eventNumbers (in elist{1,2} to global onces which presumably involves the computation of an offset parameter for n-1 trees.
(is it as easy as: offset = tree_0.GetEntries() + ... tree_n-1.GetEntries()?).


In my particular use-case I hopefully found another solution as I have a branch eventHash in my chain, which could be used to identify the different sub-trees of the chain and to perform these specific cuts.

Unluckily the trees are produced with ROOT 6.16 and eventHash is of type ULong64_t which means that I am hit by this one here. But I can change that (either the branch type or the ROOT version)


I have one more question concerning RDataframe when being accessed from pyROOT:
Not the entire functionality is available within the python environment. Therefore I thought about defining parts of the processing chain on the cpp side and circumventing the template instantiation problem by relieing on runtime polymorphism and the RNode type.
Sadly, this does not seem to work:

import ROOT as R
R.gInterpreter.ProcessLine("""
void list_rdf_column(ROOT::RDF::RNode node) {
    auto res = node.Take<ULong64_t>("rdfentry_");
    for (const auto& e : *res) {
        std::cout << e << std::endl;
    }
}
""")

df = R.ROOT.RDataFrame(MYCHAIN).Filter("rdfentry_ < 10")
R.list_rdf_column(df)

gives:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-e3f2b2ed4663> in <module>
      1 df = R.ROOT.RDataFrame(chain).Filter("rdfentry_ < 10")
----> 2 R.list_rdf_column(df)
      3 
      4 
      5 

TypeError: void ::list_rdf_column(ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void> node) =>
    could not convert argument 1

Exchanging the last line with:
R.list_rdf_column(R.ROOT.RDF.RNode(df))
does not help:

TypeError                                 Traceback (most recent call last)
<ipython-input-6-f1383542854e> in <module>
     20 
     21 
---> 22 R.list_rdf_column(R.ROOT.RDF.RNode(df))
     23 
     24 

TypeError: none of the 2 overloaded methods succeeded. Full details:
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>::ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>(const ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>&) =>
    could not convert argument 1
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>::ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>(ROOT::RDF::RInterface<ROOT::Detail::RDF::RNodeBase,void>&&) =>
    could not convert argument 1 (this method can not (yet) be called)

Do you have by chance any clue how this can achieved?

Basically, yes: given that the order of the trees in the TChain is fixed, you know the offset of each TTree entry, and you can derive the global entry number of an event in the TChain given its local entry number in the TTree it pertains to.

Definitely upgrade to v6.20 if you can, RDataFrame got a lot better since (and it’s even better in v6.22, which is coming out soon).

There is a ROOT.RDF.AsRNode helper function that you can use to convert any RDF node to the generic RNode type. If your ROOT version does not have it, the implementation is straight-forward:

import ROOT as R
R.gInterpreter.Declare("""
template <typename NodeType>
ROOT::RDF::RNode AsRNode(NodeType node)
{
   return node;
}

void list_rdf_column(ROOT::RDF::RNode node) {
    auto res = node.Take<ULong64_t>("rdfentry_");
    for (const auto& e : *res) {
        std::cout << e << std::endl;
    }
}
""")

df = R.ROOT.RDataFrame(1).Filter("rdfentry_ < 10")
R.list_rdf_column(R.AsRNode(df))

Hope this helps!
Enrico

1 Like

There is a ROOT.RDF.AsRNode helper function that you can use to convert any RDF node to the generic RNode type.

Yes! Nice one. Thank’s a lot.

Maybe one should change the documentation of RNodeBase, which still talks about a function called: ROOT::RDF::ToCommonNodeType (a name that I could not find at all). Or even make it a static function of RNodeBase?

But anyhow, in my test case it worked flawlessly.

Sorry for asking all these questions, but I really appreciate the direction ROOT is going with RDataFrame. :+1:
I have been desperately missing concepts like that since years…

Cheers,
Philipp

1 Like

Indeed, the docs need fixing, thanks for pointing that out.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi @phi,
just so you know, ROOT-10753 is fixed in master and the fix will be available in the upcoming v6.22.

Cheers,
Enrico

FYI, docs will soon be fixed in master. Thanks again!