Sample splitting using RDataFrame not working for the trees which contains vector branches

Hi Experts,

I am trying to split a root file using RDataFrame which contains vector branches and dictionaries. And it isn’t working. But I had tried with floats, doubles it worked well. Please let me know if there is something more to add for the vector or dictionaries. I also tried with making the branchstatus of dictionary " OFF " but still doesn’t work (so it doesn’t work for vectors as well) I am attaching the file and code.

void split_filter()

{

auto oldfile1 = TFile::Open("flatntuple_MC_.root");

TTree *t1=(TTree*)oldfile1->Get("ntuplizer/tree");

Int_t nEntries1;

nEntries1=t1->GetEntries();

ROOT::RDataFrame df1 ("ntuplizer/tree", "flatntuple_MC_.root");

df1.Range(nEntries1 / 3).Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");

df1.Range(nEntries1 / 3, 2*nEntries1/3 ).Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");

df1.Range(2*nEntries1 / 3, nEntries1).Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");

}

flatntuple_MC_.root (382.4 KB)

Thanks in advance!
Priyanka


Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hi @psadangi,

I am sure @eguiraud can give you some hints here. Alternatively, you can make manual, direct use of the TTree API to split your tree based on the given entry ranges.

Cheers,
J.

Hi @psadangi ,
thank you for the report and for providing a complete reproducer. I see the crash with today’s master as well, investigating…

Hi @psadangi ,
I think the problem is the "HLT_BPH_isFired" branch, which is a map<string, bool> for which ROOT does not have dictionaries by default (like it does for vector<int>, vector<float> etc.).

Cause

This code that only depends on TTreeReader (the interface RDataFrame uses for reading data under the hood) reproduces the problem:

void repro_treereader() {
  TFile f("flatntuple_MC_.root");
  auto *t = f.Get<TTree>("ntuplizer/tree");
  R__ASSERT(t != nullptr);
  TTreeReader r(t);
  TTreeReaderValue<std::map<std::string, bool>> rv(r, "HLT_BPH_isFired");
  r.Next();
  *rv;
}

Doing the same with TTree directly instead of TTreeReader actually provides a better error message (and the lack of clear errors using TTreeReader is a bug, I’ll open an issue):

void repro_tree() {
  TFile f("flatntuple_MC_.root");
  auto *t = f.Get<TTree>("ntuplizer/tree");
  R__ASSERT(t != nullptr);
  std::map<std::string, bool> *m = nullptr;
  t->SetBranchAddress("HLT_BPH_isFired", &m);
  t->GetEntry(0);
}
$ root -l repro_tree.C
root [0]
Processing repro_tree.C...
Error in <TTree::SetBranchAddress>: The class requested (map<string,bool>) for the branch "HLT_BPH_isFired" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (map<string,bool>) to avoid to write corrupted data.

 *** Break *** segmentation violation

So we need to generate dictionaries for map<string, bool> in order to read that branch correctly.

Solution

This version of your original code should work when invoked as root -l -b -q original_repro.C+ (note the +, which compiles your code in a shared library including the necessary dictionaries):

#include <TFile.h>
#include <TTree.h>
#include <ROOT/RDataFrame.hxx>

#pragma link C++ class std::map < std::string, bool>;

void original_repro() {
  auto oldfile1 = TFile::Open("flatntuple_MC_.root");
  TTree *t1 = (TTree *)oldfile1->Get("ntuplizer/tree");
  Int_t nEntries1;
  nEntries1 = t1->GetEntries();
  ROOT::RDataFrame df1("ntuplizer/tree", "flatntuple_MC_.root");
  df1.Range(nEntries1 / 3)
      .Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");
  df1.Range(nEntries1 / 3, 2 * nEntries1 / 3)
      .Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");
  df1.Range(2 * nEntries1 / 3, nEntries1)
      .Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");
}

For other ways to generate dictionaries see I/O of custom classes - ROOT .

Cheers,
Enrico

Hi @eguiraud,

Many thanks for your detailed explanation. This isn’t caused by mapping alone, though. This also holds true for some branches that are vector<vector> or so. The snapshot is attached. Even though I tried adding " #pragma link C++ class vector<vector< int > >+ " for these branches, it didn’t work.

Thanks
Priyanka

Right, sorry! With v6.26, I also needed:

#pragma link C++ class std::vector<std::vector<int>>;
#pragma link C++ class ROOT::RVec<std::vector<int>>;

So the following script ran without errors:

#include <TFile.h>
#include <TTree.h>
#include <ROOT/RDataFrame.hxx>

#pragma link C++ class std::map<std::string, bool>;
#pragma link C++ class std::vector<std::vector<int>>;
#pragma link C++ class ROOT::RVec<std::vector<int>>;

void fix() {
  auto oldfile1 = TFile::Open("flatntuple_MC_.root");
  TTree *t1 = (TTree *)oldfile1->Get("ntuplizer/tree");
  Int_t nEntries1;
  nEntries1 = t1->GetEntries();
  ROOT::RDataFrame df1("ntuplizer/tree", "flatntuple_MC_.root");
  df1.Range(nEntries1 / 3)
      .Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");
  df1.Range(nEntries1 / 3, 2 * nEntries1 / 3)
      .Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");
  df1.Range(2 * nEntries1 / 3, nEntries1)
      .Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");
}
$ root --version                                                                                                                                        
ROOT Version: 6.26/00
Built for linuxx8664gcc on Mar 05 2022, 12:03:00
From @
$ root -l -b -q fix.C+  

Processing fix.C+...
$ # all good

If that does not work, please specify your ROOT version and the ROOT version with which the input file was produced.

Cheers,
Enrico

Hi @eguiraud ,

This doesn’t work for me! The version of ROOT I’m using is 6.24/02.
Built for linuxx8664gcc on Jul 03 2021, 08:02:00
From @

But the file was created with version 6.12/07! Could that be the reason?

Thanks
Priyanka

It should be just a matter of adding the relevant #pragma link C++ lines. I tried with v6.24 installed as a conda package, and this works:

#include <TFile.h>
#include <TTree.h>
#include <ROOT/RDataFrame.hxx>

#pragma link C++ class std::map<std::string, bool>;
#pragma link C++ class std::vector<std::vector<int>>;
#pragma link C++ class ROOT::RVec<std::vector<int>>;
#pragma link C++ class vector<vector<int>,ROOT::Detail::VecOps::RAdoptAllocator<vector<int>>>+;

void fix() {
  auto oldfile1 = TFile::Open("flatntuple_MC_.root");
  TTree *t1 = (TTree *)oldfile1->Get("ntuplizer/tree");
  Int_t nEntries1;
  nEntries1 = t1->GetEntries();
  ROOT::RDataFrame df1("ntuplizer/tree", "flatntuple_MC_.root");
  df1.Range(nEntries1 / 3)
      .Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");
  df1.Range(nEntries1 / 3, 2 * nEntries1 / 3)
      .Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");
  df1.Range(2 * nEntries1 / 3, nEntries1)
      .Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");
}

But I would strongly suggest to just upgrade to v6.26 (possibly v6.26.02, coming out in a few days), where the last pragma is not needed because we streamlined the I/O of arrays in RDataFrame.

Cheers,
Enrico

That worked, thanks. I’ll upgrade to the latest version.

Priyanka

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.