Fill Ttree with Rdataframe

RENATO_QUAGLIANI · October 21, 2018, 11:05am

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Dear all,
I have a naive question. Is it possible to declare a lambda to do some calculations and spit out the result value inside the ttree handled by the Rdataframe? So far i have only seen using Rdataframe to read and generate hostograms but i would like to understamd if there is some example which one can use to make a filling of a ttree using the dataframe (maybe even update the ttree associated to the dataframe itself)
Thanks in advance
Renato

Danilo · October 21, 2018, 1:24pm

Hi Renato,

of course! That’s what is called “Snapshot”. Here you can find the doc of the method and here an example macro.

Cheers,
Danilo

RENATO_QUAGLIANI · October 21, 2018, 1:43pm

Hi @Danilo,
Thanks a lot, indeed i ended up on that example.
I guess that if i want to transport back also all the other branches of the original TTree i have to loop over all branches from the starting TTree and forward them when making the SnapShot for the column list to handle, is that right?

I mean :

I create DataFrame1 with TTree Tree1.
I collect the set of Branches this tree has.
I declare my new variable with some expression
I do a snapshot of DataFrame1 with {SetOfBranches + extraVariableWIthExpression}.
Is this the mechanism one has to use to do a sort of
CopyTree( with TCut ) + add some extra branch?

I don’t know if it has been already benchmarked, but i believe the RDataFrame CopyTree [as i said before] with a Cut will be much faster than the standard TTree::Copy.
Am I correct?

For the records, we are currently using the RDataFrame for pyROOT for an analysis and we were wondering if the usage of the RDataFrame is the fastest option one have so far with respect any of hte panda pyROOT````numpyetc... approaches. We first tried to use theTTreeReader``` but due to the pointers usage, python was upset, so we ended up trying the RDataFrame which so far shows a very fast plotting for more than 20 branches with very large tuples.

Thanks again,
Renato

PS: i am curious to know if the RDataFrame is actually beating all the python approaches to inspect TTree tuples or not.

Danilo · October 21, 2018, 2:18pm

Hi Renato,

a lot in your post. Let me chop it.

I guess that if i want to transport back also all the other branches of the original TTree i have to loop over all branches from the starting TTree and forward them when making the SnapShot for the column list to handle, is that right?
I mean :
I create DataFrame1 with TTree Tree1.
I collect the set of Branches this tree has.
I declare my new variable with some expression
I do a snapshot of DataFrame1 with {SetOfBranches + extraVariableWIthExpression}.
Is this the mechanism one has to use to do a sort of
CopyTree( with TCut ) + add some extra branch?

What you see in the examples is a way to generate a new tree in a new file which contains the newly defined columns.

I don’t know if it has been already benchmarked, but i believe the RDataFrame CopyTree [as i said before] with a Cut will be much faster than the standard TTree::Copy.
Am I correct?

I think there is no “RDataFrame CopyTree”. Certainly what RDataFrame does is what in TTree jeargon is called a “slow clone”, i.e. decompression-deserilisation-serialisation-compression and reclustering of all columns selected for the snapshot.

For the records, we are currently using the RDataFrame for pyROOT for an analysis and we were wondering if the usage of the RDataFrame is the fastest option one have so far with respect any of hte panda pyROOT````` numpy etc… approaches. We first tried to use the` TTreeReader``` but due to the pointers usage, python was upset, so we ended up trying the RDataFrame which so far shows a very fast plotting for more than 20 branches with very large tuples.

Sounds interesting: we are interesting in sharing common benchmarks! Do you have an easily usable repository somewhere we can try and benchmark?

Cheers,
Danilo

RENATO_QUAGLIANI · October 21, 2018, 7:09pm

Hi @Danilo
I will try to extract some code this week from the analysis framework we have and produce the python code to compare the various approaches. Currently what we observed is that the RDataFrame with the MT enabled is quite fast and it scales with the sqrt( N ) with N = N branches, but in this case we do N histos1D. I come back to this once i have some code to benchmark in a more systematic way.

Cheers
Renato

system · November 4, 2018, 7:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.