On-fly dictionary generation for vector<myclass> and use as a branch

Thanks, of course I forgot about the code guard. Now it works.

In python I do:

	ROOT.gROOT.ProcessLine("class C\
	{\
	public:\
	    vector<float> SimEfield_X;\
	};")
	
	ROOT.gInterpreter.GenerateDictionary("vector<C>", "vector")

and I was able to create a branch with vector and properly store results into it.

However, I’ve created the script in the original mail to test the following problem. The result of

t->Scan(c.a:b);
t->Scan(c[b[0]].a:b[0]);

is:


***********************************************
*    Row   * Instance *       c.a *         b *
***********************************************
*        0 *        0 *       1.5 *         0 *
*        0 *        1 *       2.5 *       100 *
*        0 *        2 *       3.4 *           *
*        1 *        0 *       1.5 *       200 *
*        1 *        1 *       2.5 *         2 *
*        1 *        2 *       3.4 *           *
***********************************************
***********************************************
*    Row   * Instance * c[b[0]].a *      b[0] *
***********************************************
*        0 *        0 *           *         0 *
Error in <TArrayI::At>: index 200 out of bounds (size: 1, this: 0x400de80)
*        1 *        0 *           *       200 *
***********************************************

So we can see that the c[0].a value is 1.5. If I scan for c[0].a I get this value. But when I scan for c[b[0]].a, where b[0] is 0, I get empty result. Why is that?

The python code was succeeding (partially) because it did not really detect that there was a mismatch between the interpreter view of the vector and the I/O internal representation (i.e. a bug :frowning: ).

So would there be any advantage if the dictionary was properly loaded? If not, please leave that bug there :slight_smile:

Yes. With the dictionary loaded you are guaranteed the data will be properly read. Without it, it might not and I assumed the part “t when I scan for c[b[0]].a, where b[0] is 0, I get empty result. Why is that?” was one of the consequence.

But the results of reading c[b[0]].a is the same even after the dictionary is properly loaded (I tried it now just after the generation of the TTree in C++)

humm … I am vaguely confused of what works as expected and what does not :slight_smile:

Could you provide a complete running example showing the problem (if any)?

Please find attached the script and the class header. The scan of c[b[0]].a does not give any result, even though b[0] is 0 and c[0].a gives a result.
tree_test1.C (524 Bytes)
ccc.h (87 Bytes)

That:

***********************************************
*    Row   * Instance * c[b[0]].a *      b[0] *
***********************************************
*        0 *        0 *           *         0 *

appears to be a bug in TTreeFormula’s handling of this nested collections (as shown by the other TTree::Scan, the data is there. Did you try RDataFrame to compare the results?

And how can I print out a specific cell of an array with RDataFrame? I can’t find anything like that in the docs, and the TTreeFormula like format doesn’t work. It would be really helpful if there was a TTree documentation equivalent for RDataFrame, showing how to perform same tasks.

I found out that I can Define() a new branch for b[0] and it works, however such a branch defined for c[0].a or c.a[0] still consists of all the cells of c.a…

Hi @LeWhoo ,

As c.a is a RVec<std::vector<double>> (outer collection is over the std::vector<C>, inner collection is over the a data member of each C), c.a[0] is a vector<double>, so that seems correct?

Here’s one way to do it:

df.Define("ca00", "c.a[0][0]").Display("ca00")->Print();

RDataFrame’s docs are here and I’m adding a section about working with collections in this PR, any feedback is welcome.

Cheers,
Enrico

Then I am missing RDataFrame logic here completely. Why do I have an outer collection? It is not for entries, it is collection of something else, but I don’t know what.

Thanks, it works, but I have no idea what the first index stands for…

Anyway, unlike TTreeFormula, here the indexing with other variable works:
d.Define("ca01", "c.a[0][b[1]]").Display("ca01")->Print();
gives:

ca01      | 
0.0000000 | 
3.4000000 |

as expected.

Thank you!

With learning the things above, I have comments, that maybe do not fit this topic. Still:

  1. If I understand correctly, RDataFrame is not a replacement for TTree interfaces. But at the moment one is forced to learn both, as some tasks with RDataFrame are really complicated. This display of an array cell is an example. It involves some strange steps of defining another variable (why? the variable is there!), then some strange extra index, etc.
  2. Those defines+displays are reaaaaally slow.

On the other hand, I understand the cool new capabilities that RDataFrame gives.

About the slowness: it’s a constant overhead due to some C++ code being just-in-time compiled (namely a function that evaluates c.a[0][b[1]]). RDataFrame will be fast on large datasets where this constant overhead of a few seconds does not matter. It is routinely used to run multi-thread analyses on hundreds of GBs of data that produce hundreds or thousands of histograms – that’s where the performance benefits become apparent, I think. (EDIT: in C++ you can skip just-in-time compilation completely by passing C++ callables rather than strings, but when calling RDF through Python, at least for now, it’s required)

About the extra index: it comes directly from your schema, it’s not an RDataFrame thing: c is a std::vector<C>, and each element of c has a member a which is a std::vector<double>. So, for each entry, c.a is a collection of collections: a vector<double> for each element in the vector<C>. It’s the same in TTree.

About why the intermediate Define: separation of concerns and keeping features orthogonal. There is one method to Define a new column, and other methods to do things with whatever columns are present. It’s a bit more typing for super simple things, but it scales well to complicated usecases. I completely agree that TTree::Scan or TTree::Draw are faster to type and run for very simple queries, but the other side of the coin is that they don’t scale well (in complexity and performance) to hundreds of histograms and complicated expressions.

Cheers,
Enrico

Yes, of course. It should have been obvious to me from the start, especially that Scan on c[0].a was also displaying the whole vector. I am not sure where my brain was, sorry.

So there are two issues here:

  1. I have a feeling that a need to define a separate variable for an array index is just a missing feature. I understand it is just a matter of improving the parser for RDataFrame, perhaps just simply borrowing some part from TTreeFormula. There are also those cases with empty indices shown in TTree::Draw/Scan documentation that I am not certain can be now done with RDataFrame and Define.
  2. I understand that the main target for the whole ROOT package is CERN and LHC, but there are experiments and people like me that use TTrees for not complicated queries. I guess it wouldn’t serve ROOT to push away such people/experiment further by replacing funcionality that suited them with one that doesn’t… For sure, it would be a shame for us.

To be clear, no functionality is being replaced. TTree is not being deprecated. I was just mentioning what usecase is typically served best by what interface, but of course you should use what works best for you.

Yes, but use of RDataFrame is encouraged, and someone on this forum once replied to me, that TTree is being depreciated and the future is RNTuple. Thus it would be really nice if clearly missing features (addressing arrays indices) and simplifications (such as Scan(), showing same amount of information) could be added to RNTuple. It would recude entropy :slight_smile:

1 Like

Again just to clarify: TTree will never be removed and will not become unsupported any time soon, probably ever given the amount of code that relies on it. RNTuple aims to be in production in a few years and it is a modern TTree substitute that will provide important benefits w.r.t. TTree (mostly in terms of performance, storage use and type-safety) but it will not replace all of its features. Also it won’t be backward-compatible, meaning TTree will (have to) stay in ROOT to serve all users that are reading files written before, say, Run 4, and that rely on its features.

I had looked into factoring out TTreeFormula to allow something like df.Filter(FromFormula("yourttreeformulaexpression")), but unfortunately the TTreeFormula parser is too entangled in TTree internals to be factored out that way, so RDataFrame supports all of C++ but has no way to understand TTree::Draw expressions.

Bottom line: there are things in RDataFrame that require extra typing w.r.t. TTree::Draw. There are also several things that are possible in RDataFrame that are not possible at all with TTree::Draw, e.g. writing out new ROOT files and producing multiple histograms with a single (multi-thread) event loop, or calling arbitrary C++ functions during the event loop.

In case you find that certain useful features are completely missing in RDataFrame, please ask for them at Issues · root-project/root · GitHub . For some there might be alternative ways to get the same result in RDF (possibly with a bit more verbosity than in TTree::Draw, admittedly :slight_smile: ), some might be things that we do need to implement.

Cheers,
Enrico

1 Like

@pcanal Is there any workaround for the bug in TTreeFormula? Also, would you generate a bug ticket? I could do it, but I am not sure I ever did…

Just to add to this thread, the optimal way to store multi-dimensional (jagged) arrays in RNTuple is via std::vector<MyClass>, where MyClass itself can contain (nested) std::vectors. As in TTree, a dictionary of MyClass is required for writing, although not necessarily for reading. In RNTuple, there is no overhead from using std::vector for the serialization of collections and the data gets fully split to columnar layout throughout all nesting levels.

There are some tutorials available that show the basic RNTuple functionality. There is also RDataFrame support for reading (the pending PR #6700 will bring a big improvement). If you like to give your sample a try with RNTuple, please don’t hesitate to get in touch with me if you have any questions.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.