Hello!
I have root file and I need to process data from it. I have tried three approaches, please see source code below. One of them with TTreeReader works, but I want to use RDataFrame because of parallel processing benefits. However, it seems that RDataFrame can’t read RVec<vector<double> > constructed from TClonesArray of objects with member of type vector<double>.
Here is minimal reproducer.
using namespace std;
// very simplified classes from real project
// I can't change source code of it without huge refactoring
class Event : public TObject
{
public:
TClonesArray tracks;
Event() : tracks{"Track", 5} {};
ClassDef(Event, 1);
};
ClassImp(Event);
class Track : public TObject
{
public:
vector<double> hitEnergies;
Track(const vector<double>& v = {}) : hitEnergies{v} {};
ClassDef(Track, 1);
};
ClassImp(Track);
// fill tree with events with random amount of tracks and random amount of hits
void write() {
TFile *f1 = new TFile(TString("events.root"), "recreate");
TTree *t1 = new TTree("Events", "events");
Event *event = new Event();
t1->Branch("Event.", "", event);
int nentries = 4;
for (int i = 0; i < nentries; ++i) {
TClonesArray *tracks = &(event->tracks);
tracks->Clear("C");
int nTracks = gRandom->Integer(10) + 1;
for (int j = 0; j < nTracks; ++j) {
Track *track = (Track*)tracks->ConstructedAt(tracks->GetEntries());
int n = gRandom->Integer(10) + 1;
track->hitEnergies.reserve(n);
for (int k = 0; k < n; ++k) {
track->hitEnergies.push_back(gRandom->Rndm());
}
}
t1->Fill();
}
t1->Write();
f1->Close();
}
// this method works, but slow for large amount of data
void read1() {
TChain *events = new TChain("Events");
events->Add("events.root");
// events->Print();
TTreeReader reader(events);
TTreeReaderValue<Event> event(reader, "Event.");
cout << "TTreeReaderValue<Event> says" << endl;
while (reader.Next()) {
cout << "Tracks: " << event->tracks.GetEntries() << endl; // OK, can extract data further
}
}
// this method I tried first chronologically, but it doesn't work
void read2() {
TChain *events = new TChain("Events");
events->Add("events.root");
// events->Print();
TTreeReader reader(events);
TTreeReaderArray<vector<double> > energies(reader, "Event.tracks.hitEnergies");
cout << "TTreeReaderArray<vector<double> > says" << endl;
while (reader.Next()) {
cout << "Tracks: " << energies.GetSize() << endl; // Zeros! Extracting data leads to segfaults
}
}
// this method I want to use to speed up calculations
void read3() {
ROOT::EnableImplicitMT();
ROOT::RDataFrame d("Events", "events.root");
// d.Describe().Print();
// cout << endl;
cout << "RDataFrame says" << endl;
for (const auto &el : d.Take<ROOT::RVec<vector<double> > >("Event.tracks.hitEnergies")) {
cout << "Tracks: " << el.size() << endl; // Zeros! Extracting data leads to segfaults
}
}
void example() {
write();
read1();
cout << endl;
read2();
cout << endl;
read3();
}
To sum up,
Main trouble. RDataFrame fails to read values from RVec<vector<double> > because of segfaults. Investigation showed that RVec has zero size. Is there any way to fix this?
Not so important trouble. TTreeReaderArray<vector<double> > seems to have the same problem, but I won’t use it probably.
RDataFrame uses TTreeReader under the hood, so the issue with zero-sized RVecs you see in RDataFrame is probably a consequence of the zero-sized arrays returned by TTreeReaderArray.
Alright, there are a few things going on here mostly related to TTreeReader and ROOT I/O that make some things work and some things not with RDataFrame.
the problem with TTreeReaderArray<vector<double>>(reader, "Event.tracks.hitEnergies") is a bug in TTreeReader, I opened an issue. RDataFrame will use TTreeReaderArray under the hood whenever you read a colum as RVec, hitting this issue. RDataFrame also reads columns as RVecs automatically if they are arrays, which is what happens in your reproducer
the workaround would be d.Take<Track>("Tracks."), but there is a hiccup: Take needs to read the TClonesArray for every event and copy them into the resulting vector<Track>. However, a TClonesArray of Track objects is not copiable if you run the program as an interpreted macro, because the default copy-constructor of a TObject invokes Clone and the default Clone implementation requires ROOT I/O dictionaries for the Track class. Running the macro as root -l -b -q test.C+ (with the +) generates dictionaries for the Track class before running so this works:
#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TClonesArray.h>
#include <TObject.h>
#include <TTree.h>
#include <iostream>
#include <vector>
class Track : public TObject {
public:
std::vector<double> hitEnergies;
Track(const std::vector<double> &v = {}) : hitEnergies{v} {};
ClassDef(Track, 1);
};
ClassImp(Track);
void works() {
// write file
{
TFile f("f.root", "recreate");
TTree tree("Events", "events");
TClonesArray arr("Track", 1);
tree.Branch("Tracks", &arr);
arr.ConstructedAt(0);
((Track *)arr.At(0))->hitEnergies.assign({1.0, 2.0, 3.0});
arr.ConstructedAt(1);
((Track *)arr.At(1))->hitEnergies.assign({4.0, 5.0});
tree.Fill();
tree.Write();
}
// this involves a copy of the TClonesArray objects, but copying a
// TClonesArray of an interpreted class does not work (at least not if `Track`
// uses the default `Clone` method, overriding it might fix this issue).
ROOT::RDataFrame d("Events", "f.root");
std::vector<TClonesArray> arrs = d.Take<TClonesArray>("Tracks").GetValue();
std::cout << arrs.size() << '\n';
std::cout << static_cast<Track *>(arrs[0][0])->hitEnergies[1] << '\n';
std::cout << static_cast<Track *>(arrs[0][1])->hitEnergies[0] << '\n';
}
to be run e.g. as root -l -b -q works.cpp+ (note the +).
Processing the TClonesArrays on the fly without copying them out also works, even without dictionaries, e.g.: