On-fly dictionary generation for vector<myclass> and use as a branch

Hello,

I wonder if it is possible to write a simple vector of my custom class (not inheriting from TObject) to a TTree without any extra files and steps. My script is the following:

class C
{
public:
	std::vector<double> a;

};

int tree_test()
{
	gInterpreter->GenerateDictionary("C;vector<C>", "vector");

	auto t = new TTree("t", "t");
	double am[] = {1.5, 2.5, 3.4};
	C c;
	c.a.assign(am, am+3);
	std::vector<C> vc;
	vc.push_back(c);
	int b[] = {0, 100};
	t->Branch("c", &vc);
	t->Branch("b", b, "b[2]/I");
	t->Fill();
	t->Scan("c:b");
	t->Scan("c[b[0]]:b[0]");

	return 0;

}

And ROOT complains:

Error in TTree::Branch: The class requested (vector) for the branch ā€œcā€ is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (vector) to avoid to write corrupted data.

Even though the dictionaryā€™s .so is generated. This simple dictionary generation worked in PyROOT thoughā€¦


ROOT Version: 6.22.06
Platform: Not Provided
Compiler: Not Provided


GenerateDictionary requires physical files. So you need to put the declaration of C into a header file and name that header file in the 2nd argument of GenerateDictionary

Hmm, if I move to a separate file, I need also to #include it before the main(), otherwise, I can not use it. But when I #include, I get

error: redefinition of 'C'

I am trying to run it without compilation, just through the interpreter. Maybe this is the issue?

Btw. how does it work in PyROOT without a separate header file?

I am not sure how GenerateDictionary can work without an header ā€¦ even in PyROOT ā€¦ what is the exact sequence you do on PyROOT?

error: redefinition of 'C'

Did you remember to put code guard? Also to use the header file in your script?

Thanks, of course I forgot about the code guard. Now it works.

In python I do:

	ROOT.gROOT.ProcessLine("class C\
	{\
	public:\
	    vector<float> SimEfield_X;\
	};")
	
	ROOT.gInterpreter.GenerateDictionary("vector<C>", "vector")

and I was able to create a branch with vector and properly store results into it.

However, Iā€™ve created the script in the original mail to test the following problem. The result of

t->Scan(c.a:b);
t->Scan(c[b[0]].a:b[0]);

is:


***********************************************
*    Row   * Instance *       c.a *         b *
***********************************************
*        0 *        0 *       1.5 *         0 *
*        0 *        1 *       2.5 *       100 *
*        0 *        2 *       3.4 *           *
*        1 *        0 *       1.5 *       200 *
*        1 *        1 *       2.5 *         2 *
*        1 *        2 *       3.4 *           *
***********************************************
***********************************************
*    Row   * Instance * c[b[0]].a *      b[0] *
***********************************************
*        0 *        0 *           *         0 *
Error in <TArrayI::At>: index 200 out of bounds (size: 1, this: 0x400de80)
*        1 *        0 *           *       200 *
***********************************************

So we can see that the c[0].a value is 1.5. If I scan for c[0].a I get this value. But when I scan for c[b[0]].a, where b[0] is 0, I get empty result. Why is that?

The python code was succeeding (partially) because it did not really detect that there was a mismatch between the interpreter view of the vector and the I/O internal representation (i.e. a bug :frowning: ).

So would there be any advantage if the dictionary was properly loaded? If not, please leave that bug there :slight_smile:

Yes. With the dictionary loaded you are guaranteed the data will be properly read. Without it, it might not and I assumed the part ā€œt when I scan for c[b[0]].a, where b[0] is 0, I get empty result. Why is that?ā€ was one of the consequence.

But the results of reading c[b[0]].a is the same even after the dictionary is properly loaded (I tried it now just after the generation of the TTree in C++)

humm ā€¦ I am vaguely confused of what works as expected and what does not :slight_smile:

Could you provide a complete running example showing the problem (if any)?

Please find attached the script and the class header. The scan of c[b[0]].a does not give any result, even though b[0] is 0 and c[0].a gives a result.
tree_test1.C (524 Bytes)
ccc.h (87 Bytes)

That:

***********************************************
*    Row   * Instance * c[b[0]].a *      b[0] *
***********************************************
*        0 *        0 *           *         0 *

appears to be a bug in TTreeFormulaā€™s handling of this nested collections (as shown by the other TTree::Scan, the data is there. Did you try RDataFrame to compare the results?

And how can I print out a specific cell of an array with RDataFrame? I canā€™t find anything like that in the docs, and the TTreeFormula like format doesnā€™t work. It would be really helpful if there was a TTree documentation equivalent for RDataFrame, showing how to perform same tasks.

I found out that I can Define() a new branch for b[0] and it works, however such a branch defined for c[0].a or c.a[0] still consists of all the cells of c.aā€¦

Hi @LeWhoo ,

As c.a is a RVec<std::vector<double>> (outer collection is over the std::vector<C>, inner collection is over the a data member of each C), c.a[0] is a vector<double>, so that seems correct?

Hereā€™s one way to do it:

df.Define("ca00", "c.a[0][0]").Display("ca00")->Print();

RDataFrameā€™s docs are here and Iā€™m adding a section about working with collections in this PR, any feedback is welcome.

Cheers,
Enrico

Then I am missing RDataFrame logic here completely. Why do I have an outer collection? It is not for entries, it is collection of something else, but I donā€™t know what.

Thanks, it works, but I have no idea what the first index stands forā€¦

Anyway, unlike TTreeFormula, here the indexing with other variable works:
d.Define("ca01", "c.a[0][b[1]]").Display("ca01")->Print();
gives:

ca01      | 
0.0000000 | 
3.4000000 |

as expected.

Thank you!

With learning the things above, I have comments, that maybe do not fit this topic. Still:

  1. If I understand correctly, RDataFrame is not a replacement for TTree interfaces. But at the moment one is forced to learn both, as some tasks with RDataFrame are really complicated. This display of an array cell is an example. It involves some strange steps of defining another variable (why? the variable is there!), then some strange extra index, etc.
  2. Those defines+displays are reaaaaally slow.

On the other hand, I understand the cool new capabilities that RDataFrame gives.

About the slowness: itā€™s a constant overhead due to some C++ code being just-in-time compiled (namely a function that evaluates c.a[0][b[1]]). RDataFrame will be fast on large datasets where this constant overhead of a few seconds does not matter. It is routinely used to run multi-thread analyses on hundreds of GBs of data that produce hundreds or thousands of histograms ā€“ thatā€™s where the performance benefits become apparent, I think. (EDIT: in C++ you can skip just-in-time compilation completely by passing C++ callables rather than strings, but when calling RDF through Python, at least for now, itā€™s required)

About the extra index: it comes directly from your schema, itā€™s not an RDataFrame thing: c is a std::vector<C>, and each element of c has a member a which is a std::vector<double>. So, for each entry, c.a is a collection of collections: a vector<double> for each element in the vector<C>. Itā€™s the same in TTree.

About why the intermediate Define: separation of concerns and keeping features orthogonal. There is one method to Define a new column, and other methods to do things with whatever columns are present. Itā€™s a bit more typing for super simple things, but it scales well to complicated usecases. I completely agree that TTree::Scan or TTree::Draw are faster to type and run for very simple queries, but the other side of the coin is that they donā€™t scale well (in complexity and performance) to hundreds of histograms and complicated expressions.

Cheers,
Enrico

Yes, of course. It should have been obvious to me from the start, especially that Scan on c[0].a was also displaying the whole vector. I am not sure where my brain was, sorry.

So there are two issues here:

  1. I have a feeling that a need to define a separate variable for an array index is just a missing feature. I understand it is just a matter of improving the parser for RDataFrame, perhaps just simply borrowing some part from TTreeFormula. There are also those cases with empty indices shown in TTree::Draw/Scan documentation that I am not certain can be now done with RDataFrame and Define.
  2. I understand that the main target for the whole ROOT package is CERN and LHC, but there are experiments and people like me that use TTrees for not complicated queries. I guess it wouldnā€™t serve ROOT to push away such people/experiment further by replacing funcionality that suited them with one that doesnā€™tā€¦ For sure, it would be a shame for us.

To be clear, no functionality is being replaced. TTree is not being deprecated. I was just mentioning what usecase is typically served best by what interface, but of course you should use what works best for you.

Yes, but use of RDataFrame is encouraged, and someone on this forum once replied to me, that TTree is being depreciated and the future is RNTuple. Thus it would be really nice if clearly missing features (addressing arrays indices) and simplifications (such as Scan(), showing same amount of information) could be added to RNTuple. It would recude entropy :slight_smile:

1 Like