TTreeReaderArray with std::vector<objects> within a split "Event" Tree

malfonsi79 · December 10, 2015, 12:52pm

Dear ROOTers,

I am trying to use (ROOT 6.05/03/somecommit, but also colleagues using 6.04 releases) the TTreeReader facility with a data structure that resembles the “Event” class example (in split mode). For the purpose of this discussion the data can be simplified as follow (but I can provide a reduced data file if important):

// N.B. ClassDef macros present in all classes, here omitted for clarity

class Event : public TObject {
	// [...] Some stuff, but in particular:
	std::vector< Peak > peaks;
	std::vector< Interaction > interactions;
}

class Peak : public TObject {
	// [...] This is a large object with a lot of basic data types e.g.:
	Float_t area;
	// ... but also another, nested vector (N.B. This is NOT further split in the current version of the TTree)
	std::vector< ReconstructedPosition > reconstructedposition;
}

class Interaction : public TObject {
{
	//Only basic data types, in particular:
	Int_t s2; //index of the related element within the "peaks" vector
}

//the class ReconstructedPosition has only basic data types

All these class definitions are contained in “classes.{hpp|cpp}” files that are provided together with the root files. I compile them in a library “classes.so” once forever in a separate root session with ’ gSystem->CompileMacro(“classes.cpp”,“kf”,“classes”) '. Then I use ’ gSystem->Load(“classes.so”) ’ to load the library just after starting the root session.

Then I access the data with a compiled script (via “.L myscript.cc++”) that looks like:

TTreeReader myreader(tree);
TTreeReaderArray<Interaction> interactions(myreader,"interactions");

// EACH TIME ONLY ONE OF THE 3 OPTIONS BELOW:
TTreeReaderArray<Peak> peaks(myreader,"peaks");                    //--> OPTION 1
TTreeReaderArray<Float_t> areas(myreader,"peaks.area");            //--> OPTION 2
TTreeReaderValue< std::vector<Peak> > peaksvec(myreader,"peaks");  //--> OPTION 3

//THEN I ACCESS THE OBJECTS E.G. WITH
while ( myreader.Next() ) {
	for (const auto& interaction : interactions) {
		cout << "S2 id " << interaction.s2 << "\t\t area " 
		//	   << peaks[interaction.s2].area << endl;   //--> FOR OPTION 1
		//	   << areas[interaction.s2] << endl;        //--> FOR OPTION 2
		<< (*peaksvec)[interaction.s2].area << endl;     //--> FOR OPTION 3
	}
}

I observe that the TTreeReaderArray has no problem to iterate over the “Iteration” vector, while with “Peak” it starts to get garbage after some event, so I have to stick to OPTION 3.

Q1) Am I doing something wrong or is there a bug with TTreeReaderArray (probably known… I found this report sft.its.cern.ch/jira/browse/ROOT-7581) ?

Q2) Is a “linkdef” file, with the directives to generate dictionaries for all the std::vector, required at the moment of the compilation of the “classes.so” library? Would this actually solve my problem? (However I don’t get any hint message during compilation)

Q3) Is actually OPTION 2, i.e. accessing to a sub-branch of vector of objects, supported? The documentation is not really explicit… I tried, and it worked … for few events just like option 1 (at the beginning I thought that the problem was with option 2). If option 2 is not supported, it seems to me that option 1 entails some extra I/O cost due to the reading of the whole object… am I wrong?

Q4) The workaround of OPTION 3 works, but seems even more expensive in terms of I/O because you have to load the full vector of objects… can you comment on that?

Some other generic but related questions:

Q5) Has in general this TTreeReader approach some (non-negligible) overhead over the old way of manually “SetBranchAddress()” or even GetEntry() on single branches instead of the full tree?

I would like to benchmark the different approaches, and benchmarks are always a bit tricky with the details. Q6)Can I follow the example of $ROOTSYS/test/MainEvent.cxx with the TStopwatch class? Any other detail that can matter?

Q7) In the documentation TClonesArray is described as superior in terms of performance over other arrays of objects. Still it seems to me (I can be wrong) that std::vector is preferred by most of the people and I wonder if, due to the large number of users, you optimised the I/O of std::vector and TClonesArray is not a must.

Thanks in advance for going through this big bunch of questions,
Matteo

pcanal · December 10, 2015, 9:13pm

[quote]Q7) TClonesArray[/quote]The advantage of TClonesArray is mainly in memory management. Part of the gap is bridge with the appearance of emplace_back. The main difference is that the object in a TClonesArray can be constructed just once and are not destructed until the end of the TClonesArray (where as when you down-size an array the extra object are destructed and need to be reconstructed). So in very high turn-around situation (like reading many times the ‘same’ TClonesArray from a TTree [each for a different entries]) the number of calls to constructor and destructors can be staggering high and reduce the performance.

[quote]Q6) TStopwatch[/quote]Yes or you can use std::chrono.

[quote]Q5) TTreeReader overhead[/quote] yes there is an overhead (which is a trade off for safety). The overhead should be relatively small (a few additional if statements during the branch access) and highly dependent on the TTree and use pattern.

[quote]Q4) Option 3 overhead[/quote]Yes, assuming the vector has been split, you are correct.

[quote]Q3) Is actually OPTION 2, i.e. accessing to a sub-branch of vector of objects, supported? [/quote]Yes if the vector is split.

[quote]I tried, and it worked … for few events [/quote]If it does not work for all event there is (of course ) a problem that still need to be addressed. It may or may not be the same as the one you mention in Q1. We would need a running example to investigate.

[quote]Q2) Linked and STL[/quote]The dictionary for STL collection used inside a class which has a dictionary is generated automatically as part of the class’ dictionary.

Cheers,
Philippe.

malfonsi79 · December 11, 2015, 10:34am

Dear Philippe,

thanks a lot for your answers. Just a confirmation:

is a typo or do you really mean that the overhead is NOT SMALL?

BTW I will prepare a code snippet and a short data file that focus on the issue.

pcanal · December 11, 2015, 8:19pm

Hi,

There was indeed a typo in my sentence about overhead (I updated the post). The overhead should be small or negligible.

Thanks,
Philippe.