I wonder if it is possible to write a simple vector of my custom class (not inheriting from TObject) to a TTree without any extra files and steps. My script is the following:
class C
{
public:
std::vector<double> a;
};
int tree_test()
{
gInterpreter->GenerateDictionary("C;vector<C>", "vector");
auto t = new TTree("t", "t");
double am[] = {1.5, 2.5, 3.4};
C c;
c.a.assign(am, am+3);
std::vector<C> vc;
vc.push_back(c);
int b[] = {0, 100};
t->Branch("c", &vc);
t->Branch("b", b, "b[2]/I");
t->Fill();
t->Scan("c:b");
t->Scan("c[b[0]]:b[0]");
return 0;
}
And ROOT complains:
Error in TTree::Branch: The class requested (vector) for the branch ācā is an instance of an stl collection and does not have a compiled CollectionProxy. Please generate the dictionary for this collection (vector) to avoid to write corrupted data.
Even though the dictionaryās .so is generated. This simple dictionary generation worked in PyROOT thoughā¦
ROOT Version: 6.22.06 Platform: Not Provided Compiler: Not Provided
GenerateDictionary requires physical files. So you need to put the declaration of C into a header file and name that header file in the 2nd argument of GenerateDictionary
So we can see that the c[0].a value is 1.5. If I scan for c[0].a I get this value. But when I scan for c[b[0]].a, where b[0] is 0, I get empty result. Why is that?
The python code was succeeding (partially) because it did not really detect that there was a mismatch between the interpreter view of the vector and the I/O internal representation (i.e. a bug ).
Yes. With the dictionary loaded you are guaranteed the data will be properly read. Without it, it might not and I assumed the part āt when I scan for c[b[0]].a, where b[0] is 0, I get empty result. Why is that?ā was one of the consequence.
But the results of reading c[b[0]].a is the same even after the dictionary is properly loaded (I tried it now just after the generation of the TTree in C++)
Please find attached the script and the class header. The scan of c[b[0]].a does not give any result, even though b[0] is 0 and c[0].a gives a result. tree_test1.C (524 Bytes) ccc.h (87 Bytes)
appears to be a bug in TTreeFormulaās handling of this nested collections (as shown by the other TTree::Scan, the data is there. Did you try RDataFrame to compare the results?
And how can I print out a specific cell of an array with RDataFrame? I canāt find anything like that in the docs, and the TTreeFormula like format doesnāt work. It would be really helpful if there was a TTree documentation equivalent for RDataFrame, showing how to perform same tasks.
I found out that I can Define() a new branch for b[0] and it works, however such a branch defined for c[0].a or c.a[0] still consists of all the cells of c.aā¦
As c.a is a RVec<std::vector<double>> (outer collection is over the std::vector<C>, inner collection is over the a data member of each C), c.a[0] is a vector<double>, so that seems correct?
Then I am missing RDataFrame logic here completely. Why do I have an outer collection? It is not for entries, it is collection of something else, but I donāt know what.
Thanks, it works, but I have no idea what the first index stands forā¦
Anyway, unlike TTreeFormula, here the indexing with other variable works: d.Define("ca01", "c.a[0][b[1]]").Display("ca01")->Print();
gives:
ca01 |
0.0000000 |
3.4000000 |
as expected.
Thank you!
With learning the things above, I have comments, that maybe do not fit this topic. Still:
If I understand correctly, RDataFrame is not a replacement for TTree interfaces. But at the moment one is forced to learn both, as some tasks with RDataFrame are really complicated. This display of an array cell is an example. It involves some strange steps of defining another variable (why? the variable is there!), then some strange extra index, etc.
Those defines+displays are reaaaaally slow.
On the other hand, I understand the cool new capabilities that RDataFrame gives.
About the slowness: itās a constant overhead due to some C++ code being just-in-time compiled (namely a function that evaluates c.a[0][b[1]]). RDataFrame will be fast on large datasets where this constant overhead of a few seconds does not matter. It is routinely used to run multi-thread analyses on hundreds of GBs of data that produce hundreds or thousands of histograms ā thatās where the performance benefits become apparent, I think. (EDIT: in C++ you can skip just-in-time compilation completely by passing C++ callables rather than strings, but when calling RDF through Python, at least for now, itās required)
About the extra index: it comes directly from your schema, itās not an RDataFrame thing: c is a std::vector<C>, and each element of c has a member a which is a std::vector<double>. So, for each entry, c.a is a collection of collections: a vector<double> for each element in the vector<C>. Itās the same in TTree.
About why the intermediate Define: separation of concerns and keeping features orthogonal. There is one method to Define a new column, and other methods to do things with whatever columns are present. Itās a bit more typing for super simple things, but it scales well to complicated usecases. I completely agree that TTree::Scan or TTree::Draw are faster to type and run for very simple queries, but the other side of the coin is that they donāt scale well (in complexity and performance) to hundreds of histograms and complicated expressions.
Yes, of course. It should have been obvious to me from the start, especially that Scan on c[0].a was also displaying the whole vector. I am not sure where my brain was, sorry.
So there are two issues here:
I have a feeling that a need to define a separate variable for an array index is just a missing feature. I understand it is just a matter of improving the parser for RDataFrame, perhaps just simply borrowing some part from TTreeFormula. There are also those cases with empty indices shown in TTree::Draw/Scan documentation that I am not certain can be now done with RDataFrame and Define.
I understand that the main target for the whole ROOT package is CERN and LHC, but there are experiments and people like me that use TTrees for not complicated queries. I guess it wouldnāt serve ROOT to push away such people/experiment further by replacing funcionality that suited them with one that doesnātā¦ For sure, it would be a shame for us.
To be clear, no functionality is being replaced. TTree is not being deprecated. I was just mentioning what usecase is typically served best by what interface, but of course you should use what works best for you.
Yes, but use of RDataFrame is encouraged, and someone on this forum once replied to me, that TTree is being depreciated and the future is RNTuple. Thus it would be really nice if clearly missing features (addressing arrays indices) and simplifications (such as Scan(), showing same amount of information) could be added to RNTuple. It would recude entropy