Trouble reading RVec of vector from branch with TClonesArray of objects with std::vector member

Hello!
I have root file and I need to process data from it. I have tried three approaches, please see source code below. One of them with TTreeReader works, but I want to use RDataFrame because of parallel processing benefits. However, it seems that RDataFrame can’t read RVec<vector<double> > constructed from TClonesArray of objects with member of type vector<double>.
Here is minimal reproducer.

using namespace std;

// very simplified classes from real project
// I can't change source code of it without huge refactoring
class Event : public TObject
{
public:
	TClonesArray tracks;

	Event() : tracks{"Track", 5} {};

	ClassDef(Event, 1);
};
ClassImp(Event);


class Track : public TObject
{
public:
	vector<double> hitEnergies;

	Track(const vector<double>& v = {}) : hitEnergies{v} {};

	ClassDef(Track, 1);
};
ClassImp(Track);

// fill tree with events with random amount of tracks and random amount of hits
void write() {
	TFile *f1 = new TFile(TString("events.root"), "recreate");
	TTree *t1 = new TTree("Events", "events");

	Event *event = new Event();
	t1->Branch("Event.", "", event);

	int nentries = 4;
	for (int i = 0; i < nentries; ++i) {
		TClonesArray *tracks = &(event->tracks);
		tracks->Clear("C");

		int nTracks = gRandom->Integer(10) + 1;

		for (int j = 0; j < nTracks; ++j) {
			Track *track = (Track*)tracks->ConstructedAt(tracks->GetEntries());

			int n = gRandom->Integer(10) + 1;

			track->hitEnergies.reserve(n);
			for (int k = 0; k < n; ++k) {
				track->hitEnergies.push_back(gRandom->Rndm());
			}
		}
		t1->Fill();
	}
	t1->Write();
	f1->Close();
}

// this method works, but slow for large amount of data
void read1() {
	TChain *events = new TChain("Events");
	events->Add("events.root");
	// events->Print();

	TTreeReader reader(events);
	TTreeReaderValue<Event> event(reader, "Event."); 

	cout << "TTreeReaderValue<Event> says" << endl;
	while (reader.Next()) {
		cout << "Tracks: " << event->tracks.GetEntries() << endl; // OK, can extract data further
	}
}

// this method I tried first chronologically, but it doesn't work
void read2() {
	TChain *events = new TChain("Events");
	events->Add("events.root");
	// events->Print();

	TTreeReader reader(events);
	TTreeReaderArray<vector<double> > energies(reader, "Event.tracks.hitEnergies");

	cout << "TTreeReaderArray<vector<double> > says" << endl;
	while (reader.Next()) {
		cout << "Tracks: " << energies.GetSize() << endl; // Zeros! Extracting data leads to segfaults
	}
}

// this method I want to use to speed up calculations
void read3() {
	ROOT::EnableImplicitMT();
	ROOT::RDataFrame d("Events", "events.root");

	// d.Describe().Print();
	// cout << endl;

	cout << "RDataFrame says" << endl;
	for (const auto &el : d.Take<ROOT::RVec<vector<double> > >("Event.tracks.hitEnergies")) {
		cout << "Tracks: " << el.size() << endl; // Zeros! Extracting data leads to segfaults
	}
}

void example() {
	write();

	read1();
	cout << endl;
	read2();
	cout << endl;
	read3();
}

To sum up,

  1. Main trouble. RDataFrame fails to read values from RVec<vector<double> > because of segfaults. Investigation showed that RVec has zero size. Is there any way to fix this?
  2. Not so important trouble. TTreeReaderArray<vector<double> > seems to have the same problem, but I won’t use it probably.

Please read tips for efficient and successful posting and posting code

ROOT Version: 6.26/06
Platform: CentOS Linux release 7.9.2009 (Core) x86_64
Compiler: gcc (GCC) 12.2.0


Here is even more minimal reproducer.

// test.cpp

class Track : public TObject
{
public:
	vector<double> hitEnergies;

	Track(const vector<double>& v = {}) : hitEnergies{v} {};

	ClassDef(Track, 1);
};
ClassImp(Track);

void test() {
	TTree tree("Events", "events");

	TClonesArray arr("Track", 1);

	tree.Branch("Tracks.", &arr);

	arr.ConstructedAt(0);
	((Track*)arr.At(0))->hitEnergies.assign({1.0, 2.0, 3.0});
	arr.ConstructedAt(1);
	((Track*)arr.At(1))->hitEnergies.assign({4.0, 5.0});

	tree.Fill();
	// tree.Print();

	tree.DrawClone("Tracks.hitEnergies"); // data exist

	ROOT::RDataFrame d(tree);
	// d.Describe().Print();
	// cout << endl;

	for (const auto &el : d.Take<ROOT::RVec<vector<double> > >("Tracks.hitEnergies")) {
		cout << el.at(1).at(0) << endl; // should be "4.0", but produces out of bounds error instead
	}
}

Tree is filled successfully


but trying to access data via RDataFrame fails with

terminate called after throwing an instance of 'std::out_of_range'
  what():  RVecN

Is that ROOT bug?

Hi @Ako_b,

This probably needs @eguiraud to reply / investigate what is going on. Let’s ping him.

Cheers,
J.

1 Like

Hi @Ako_b ,

and welcome to the ROOT forum!

RDataFrame uses TTreeReader under the hood, so the issue with zero-sized RVecs you see in RDataFrame is probably a consequence of the zero-sized arrays returned by TTreeReaderArray.

I’m taking a look!
Cheers,
Enrico

Alright, there are a few things going on here mostly related to TTreeReader and ROOT I/O that make some things work and some things not with RDataFrame.

  1. the problem with TTreeReaderArray<vector<double>>(reader, "Event.tracks.hitEnergies") is a bug in TTreeReader, I opened an issue. RDataFrame will use TTreeReaderArray under the hood whenever you read a colum as RVec, hitting this issue. RDataFrame also reads columns as RVecs automatically if they are arrays, which is what happens in your reproducer

  2. the workaround would be d.Take<Track>("Tracks."), but there is a hiccup: Take needs to read the TClonesArray for every event and copy them into the resulting vector<Track>. However, a TClonesArray of Track objects is not copiable if you run the program as an interpreted macro, because the default copy-constructor of a TObject invokes Clone and the default Clone implementation requires ROOT I/O dictionaries for the Track class. Running the macro as root -l -b -q test.C+ (with the +) generates dictionaries for the Track class before running so this works:

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TClonesArray.h>
#include <TObject.h>
#include <TTree.h>
#include <iostream>
#include <vector>

class Track : public TObject {
public:
  std::vector<double> hitEnergies;

  Track(const std::vector<double> &v = {}) : hitEnergies{v} {};

  ClassDef(Track, 1);
};
ClassImp(Track);

void works() {
  // write file
  {
    TFile f("f.root", "recreate");
    TTree tree("Events", "events");

    TClonesArray arr("Track", 1);

    tree.Branch("Tracks", &arr);

    arr.ConstructedAt(0);
    ((Track *)arr.At(0))->hitEnergies.assign({1.0, 2.0, 3.0});
    arr.ConstructedAt(1);
    ((Track *)arr.At(1))->hitEnergies.assign({4.0, 5.0});

    tree.Fill();
    tree.Write();
  }

  // this involves a copy of the TClonesArray objects, but copying a
  // TClonesArray of an interpreted class does not work (at least not if `Track`
  // uses the default `Clone` method, overriding it might fix this issue).
  ROOT::RDataFrame d("Events", "f.root");
  std::vector<TClonesArray> arrs = d.Take<TClonesArray>("Tracks").GetValue();
  std::cout << arrs.size() << '\n';
  std::cout << static_cast<Track *>(arrs[0][0])->hitEnergies[1] << '\n';
  std::cout << static_cast<Track *>(arrs[0][1])->hitEnergies[0] << '\n';
}

to be run e.g. as root -l -b -q works.cpp+ (note the +).

Processing the TClonesArrays on the fly without copying them out also works, even without dictionaries, e.g.:

ROOT::RDataFrame d("Events", "f.root");
d.Foreach(
    [](const TClonesArray &arr) {
      std::cout << static_cast<Track *>(arr[0])->hitEnergies.at(1) << '\n';
      std::cout << static_cast<Track *>(arr[1])->hitEnergies.at(0) << '\n';
    },
    {"Tracks"});

Overriding the Clone method id Track could also be a workaround for problem number 2.

I hope this helps!
Cheers,
Enrico

P.S.
for more information on what dictionaries are and how to generate them for your classes see I/O of custom classes - ROOT

1 Like

Hello,
Thank you, jalopezg and eguiraud for your replies!

With this approach it works completely fine. I greatly appreciate your help!