Sample splitting using RDataFrame not working for the trees which contains vector branches

Hi Experts,

I am trying to split a root file using RDataFrame which contains vector branches and dictionaries. And it isn’t working. But I had tried with floats, doubles it worked well. Please let me know if there is something more to add for the vector or dictionaries. I also tried with making the branchstatus of dictionary " OFF " but still doesn’t work (so it doesn’t work for vectors as well) I am attaching the file and code.

void split_filter()


auto oldfile1 = TFile::Open("flatntuple_MC_.root");

TTree *t1=(TTree*)oldfile1->Get("ntuplizer/tree");

Int_t nEntries1;


ROOT::RDataFrame df1 ("ntuplizer/tree", "flatntuple_MC_.root");

df1.Range(nEntries1 / 3).Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");

df1.Range(nEntries1 / 3, 2*nEntries1/3 ).Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");

df1.Range(2*nEntries1 / 3, nEntries1).Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");


flatntuple_MC_.root (382.4 KB)

Thanks in advance!

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Hi @psadangi,

I am sure @eguiraud can give you some hints here. Alternatively, you can make manual, direct use of the TTree API to split your tree based on the given entry ranges.


Hi @psadangi ,
thank you for the report and for providing a complete reproducer. I see the crash with today’s master as well, investigating…

Hi @psadangi ,
I think the problem is the "HLT_BPH_isFired" branch, which is a map<string, bool> for which ROOT does not have dictionaries by default (like it does for vector<int>, vector<float> etc.).


This code that only depends on TTreeReader (the interface RDataFrame uses for reading data under the hood) reproduces the problem:

void repro_treereader() {
  TFile f("flatntuple_MC_.root");
  auto *t = f.Get<TTree>("ntuplizer/tree");
  R__ASSERT(t != nullptr);
  TTreeReader r(t);
  TTreeReaderValue<std::map<std::string, bool>> rv(r, "HLT_BPH_isFired");

Doing the same with TTree directly instead of TTreeReader actually provides a better error message (and the lack of clear errors using TTreeReader is a bug, I’ll open an issue):

void repro_tree() {
  TFile f("flatntuple_MC_.root");
  auto *t = f.Get<TTree>("ntuplizer/tree");
  R__ASSERT(t != nullptr);
  std::map<std::string, bool> *m = nullptr;
  t->SetBranchAddress("HLT_BPH_isFired", &m);
$ root -l repro_tree.C
root [0]
Processing repro_tree.C...
Error in <TTree::SetBranchAddress>: The class requested (map<string,bool>) for the branch "HLT_BPH_isFired" is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (map<string,bool>) to avoid to write corrupted data.

 *** Break *** segmentation violation

So we need to generate dictionaries for map<string, bool> in order to read that branch correctly.


This version of your original code should work when invoked as root -l -b -q original_repro.C+ (note the +, which compiles your code in a shared library including the necessary dictionaries):

#include <TFile.h>
#include <TTree.h>
#include <ROOT/RDataFrame.hxx>

#pragma link C++ class std::map < std::string, bool>;

void original_repro() {
  auto oldfile1 = TFile::Open("flatntuple_MC_.root");
  TTree *t1 = (TTree *)oldfile1->Get("ntuplizer/tree");
  Int_t nEntries1;
  nEntries1 = t1->GetEntries();
  ROOT::RDataFrame df1("ntuplizer/tree", "flatntuple_MC_.root");
  df1.Range(nEntries1 / 3)
      .Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");
  df1.Range(nEntries1 / 3, 2 * nEntries1 / 3)
      .Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");
  df1.Range(2 * nEntries1 / 3, nEntries1)
      .Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");

For other ways to generate dictionaries see I/O of custom classes - ROOT .


Hi @eguiraud,

Many thanks for your detailed explanation. This isn’t caused by mapping alone, though. This also holds true for some branches that are vector<vector> or so. The snapshot is attached. Even though I tried adding " #pragma link C++ class vector<vector< int > >+ " for these branches, it didn’t work.


Right, sorry! With v6.26, I also needed:

#pragma link C++ class std::vector<std::vector<int>>;
#pragma link C++ class ROOT::RVec<std::vector<int>>;

So the following script ran without errors:

#include <TFile.h>
#include <TTree.h>
#include <ROOT/RDataFrame.hxx>

#pragma link C++ class std::map<std::string, bool>;
#pragma link C++ class std::vector<std::vector<int>>;
#pragma link C++ class ROOT::RVec<std::vector<int>>;

void fix() {
  auto oldfile1 = TFile::Open("flatntuple_MC_.root");
  TTree *t1 = (TTree *)oldfile1->Get("ntuplizer/tree");
  Int_t nEntries1;
  nEntries1 = t1->GetEntries();
  ROOT::RDataFrame df1("ntuplizer/tree", "flatntuple_MC_.root");
  df1.Range(nEntries1 / 3)
      .Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");
  df1.Range(nEntries1 / 3, 2 * nEntries1 / 3)
      .Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");
  df1.Range(2 * nEntries1 / 3, nEntries1)
      .Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");
$ root --version                                                                                                                                        
ROOT Version: 6.26/00
Built for linuxx8664gcc on Mar 05 2022, 12:03:00
From @
$ root -l -b -q fix.C+  

Processing fix.C+...
$ # all good

If that does not work, please specify your ROOT version and the ROOT version with which the input file was produced.


Hi @eguiraud ,

This doesn’t work for me! The version of ROOT I’m using is 6.24/02.
Built for linuxx8664gcc on Jul 03 2021, 08:02:00
From @

But the file was created with version 6.12/07! Could that be the reason?


It should be just a matter of adding the relevant #pragma link C++ lines. I tried with v6.24 installed as a conda package, and this works:

#include <TFile.h>
#include <TTree.h>
#include <ROOT/RDataFrame.hxx>

#pragma link C++ class std::map<std::string, bool>;
#pragma link C++ class std::vector<std::vector<int>>;
#pragma link C++ class ROOT::RVec<std::vector<int>>;
#pragma link C++ class vector<vector<int>,ROOT::Detail::VecOps::RAdoptAllocator<vector<int>>>+;

void fix() {
  auto oldfile1 = TFile::Open("flatntuple_MC_.root");
  TTree *t1 = (TTree *)oldfile1->Get("ntuplizer/tree");
  Int_t nEntries1;
  nEntries1 = t1->GetEntries();
  ROOT::RDataFrame df1("ntuplizer/tree", "flatntuple_MC_.root");
  df1.Range(nEntries1 / 3)
      .Snapshot("events0", "flatntuple_MC_v4_custom_PU_full_final_1.root");
  df1.Range(nEntries1 / 3, 2 * nEntries1 / 3)
      .Snapshot("events1", "flatntuple_MC_v4_custom_PU_full_final_2.root");
  df1.Range(2 * nEntries1 / 3, nEntries1)
      .Snapshot("events2", "flatntuple_MC_v4_custom_PU_full_final_3.root");

But I would strongly suggest to just upgrade to v6.26 (possibly v6.26.02, coming out in a few days), where the last pragma is not needed because we streamlined the I/O of arrays in RDataFrame.


That worked, thanks. I’ll upgrade to the latest version.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.