RDataFrame.Histo2D not working with Friends

Hi all,

I have been trying to use RDataFrames with trees which are friended but am coming into some trouble when trying to draw a 2D histogram. It seems the RDataFrame is not taking into account the trees which are ‘friended’.

I have attached some test code which highlights this issue. I compare the DF method with the normal tree method and it produces two plots which show the issue. In the DF method, the plot has all values drawn whereas using the normal tree method, only the 100 events which share the same index value are drawn.

Does anyone have any suggestions?
Many thanks,

Lewis
tree_vs_df.py (1.0 KB)


_ROOT Version: 6.14.04
Platform: Not Provided
Compiler: Not Provided


Hi,
would it be possible for you to try your reproducer with ROOT v6.16?
It’s available on cvmfs as an lcg release (accessible from lxplus), as a binary release, and also on conda, so you would not need to recompile ROOT to try.

Cheers,
Enrico

Ah I just noticed that you are using an indexed friend (i.e. you call BuildIndex on the friend): this is not supported by RDF, see https://sft.its.cern.ch/jira/browse/ROOT-9559

It’s very bad that RDF just silently returns wrong values in this case. We will need to at least error out, until we add support for indexed trees.

Hi Enrico,

Thanks for the response! I will follow the JIRA. Do you know how high of a priority this is? The BuildIndex functionality is quite useful and I would quite like to move to using RDF’s!

Cheers,

Lewis

It’s higher priority now :smile:
It is not a trivial feature to add, and there is a bit of push back to have this in in the first place due to the fact that indexed trees are slow, very slow (effectively you are doing random access of the friend TTree on your disk rather than sequential reading, which kills a number of important read speed optimizations both in ROOT and in your operating system).

So I do not feel confident enough to give a time estimation. It would greatly help, however, if you could comment on the jira issue stating why indexed trees are a must-have feature for you. In principle, one could always “unroll” the friend tree and avoid the indexing.

Cheers,
Enrico

What does it mean to “unroll” the tree?

I will post my use case here and see if there is a better way of doing it. (If not i will add it to the JIRA)

In our analysis we have two trees, one for truth and one for reco. I want to build a migration matrix which basically maps the truth to reco. To do so I will add the truth tree as a friend to the reco tree and then use the BuildIndex to align them to the ‘eventNumber’ of each tree. The point is there will be times where there is an event in truth but not in reco or vis versa so the ‘eventNumber’ is used as the index.

If you have any suggestions that would be great!

Cheers,

Lewis

When I say “unroll” I mean that you could in principle do one pass over the friend tree and write out a new friend tree with the correct entry ordering and same number of entries as the main tree.
In other words, from the current friend tree, it must be possible to generate a new friend tree that, when looped over sequentially, yields the same entries as the original friend tree did when looped over following the eventNumber index.

RDF does support friend trees, just not ones with indexes.

The event loop with the “sequential”/“unrolled” friend tree will be at least as fast as the event loop with BuildIndex, and faster in most cases. The new friend tree might occupy more space on disk, but I would be surprised if storage space was a limitation for your usecase.

I may not be understand what you mean correctly but this seems like it would require two loops? First loop over the nominal tree to find the correct ordering and then you would need to loop over the friend tree to find the corresponding event to then copy to the new tree. Is this what you mean? If so this is surely much more computationally expensive.

Yes, you would need a first loop to read in the indexed tree and write out a new tree with all entries in the order in which you will loop over them. But you will only need to do this conversion once, so however computationally expensive it might be, if you then run over this dataset several times the speed-up you will get will make up for this preprocessing step.

Just an idea :smile: It’s not clear to me what the advantage of using indexed friend trees is in the first place (again: random acces into a TTree is very bad, performance wise)

Hi Enrico,

The typical use case for Indexed Friend Tree is connected a skimmed and trim TTree with the original full TTree to get back to information that was trimmed away. In that scenario the main TTree has a subset of the entries in the original TTree and the index helps skip the entries that were dropped. (and this is case the access is not random but monotonically increasing … just sparsely)

Cheers,

Philippe.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.