Merging two TTrees with different variables but 1 common key

Hi everyone. I have 2 TTrees t1 and t2 from 1 .root file. I created a RDataFrame for each tree and now want to merge them. The trees contain different variables and have different amount of coloums as a rdf. But they have 1 common key variable that they share per event. I managed to do this successfully in pandas dataframe, but I want to try it in RDataFrame. The code I used with pandas was simply:

df_combined = pd.merge(df_t1, df_t2)

where I didnt have to specifiy the common key, because its the only common variable, so pd.merge takes it automatically.

I was now wondering how I could do the same in RDataFrame.
Because I didnt find a something like .Merge in the documentation. Would one way be to friend the trees? But it wouldnt really friend them along a common key or is an option for that?

Thanks for any help.

_ROOT Version: JupyROOT 6.28/00

Hello @helton ,

RDataFrames cannot be merged.

That’s exactly it, the way is using indexed friend trees. This makes me realize we should add a section to the RDF docs that shows an example. Here’s how you do it, the magic happens thanks to the BuildIndex call:

import ROOT

main_file = ROOT.TFile.Open("main.root")
mainTree = main_file.Get("mainTree")

aux_file = ROOT.TFile.Open("aux.root")
aux_tree = aux_file.Get("auxTree")

# if a friend tree has an index on `commonColumn`, when the main tree loads
# a given row, it then loads the row of the friend tree that has the same value
# of `commonColumn`   
aux_tree.BuildIndex("commonColumn")

mainTree.AddFriend(aux_tree)

df = ROOT.RDataFrame(mainTree)

I realize it’s a bit clunkier than the pandas version but it comes with having to deal with a more complex/powerful data format that larger-than-memory datasets.

Cheers,
Enrico

P.S.
note that there was a bug in using indexed friend trees with multi-threading in ROOT v6.28.00 – it is fixed in v6.28.02 or later.

Hi Enrico,

thanks so much! This solves my issue.

Cheers

P.S.
I just opened a PR to improve the docs

Hi Enrico,

I have trouble getting this to work for multipe files. They are all of same shape and the trees are also called the same.
Doing .Get() doesnt work for lists of files. What can I do then? I tried to create a TChain for each tree and just append the other TChains from the other files to it:

chain_main = ROOT.TChain("t_main")
chain_aux = ROOT.TChain("t_aux")
for infile in inFiles:
    chain_main.AddFile(infile)
    chain_aux.AddFile(infile)
chain_main.BuildIndex("eventNumber")
chain_aux.BuildIndex("eventNumber")
chain_main.AddFriend(chain_aux)

rdf=ROOT.RDataFrame(chain_reco)

(It takes a very long time (I have around 15 files)) But then I get errors, that:
Error in TChainIndex::TChainIndex: The indices in files of this chain aren’t sorted.
Error in TTreePlayer::BuildIndex: Creating a TChainIndex unsuccessful - switching to TTreeIndex

So how can I just do it with TTreeIndex then?

Sorry for opening the post up again, but I thought it was still close to the original topic.
Cheers

Hi @helton ,

which line fails exactly? Do you need to construct the RDataFrame object to observe the issue? Note that you don’t need to call BuildIndex on the main chain, only on the friend (and RDataFrame(chain_reco) should be RDataFrame(chain_main)).

Cheers,
Enrico

Hi Enrico,

its not really an error, its like a warning message so it doesnt specify a line directly. It produces like 10 of these lines I posted, I guess 1 for each adding file or so?

But I do get the error even without constructing the rdf. Any idea?

Cheers

Yes but you can play with the code a bit to figure out at which line the error is printed. For example you can put printouts in-between lines to track execution, or run the code in a debugger one line at a time.

Also as I mentioned you don’t need to call BuildIndex on the main chain, only on the friend, does that help?

Hi Enrico,

taking out the 1 line of BuildIndex essentially takes out half of the error warnings. So instead of this:

Error in TChainIndex::TChainIndex: The indices in files of this chain aren’t sorted.
Error in TChainIndex::TChainIndex: The indices in files of this chain aren’t sorted.
... (this 10 times)
 Error in TTreePlayer::BuildIndex: Creating a TChainIndex unsuccessful - switching to TTreeIndex

Error in TChainIndex::TChainIndex: The indices in files of this chain aren’t sorted.
Error in TChainIndex::TChainIndex: The indices in files of this chain aren’t sorted.
... (this 10 times)
 Error in TTreePlayer::BuildIndex: Creating a TChainIndex unsuccessful - switching to TTreeIndex

I only get half of this so only once. So the error line has to be the chain.BuildIndex right? It gives one line error per file im adding to the chain.

Does something like this work instead?
Add the trees from the files into the rdf:

frame = RDataFrame("TTreeName", "filename*")

and index/friend it afterwards or something?

Cheers

Hi @helton ,

alright so we confirmed this is not related to RDataFrame but it’s a TChain issue, and if I understand correctly the following should be a reproducer:

chain_aux = ROOT.TChain("t_aux")
for infile in inFiles:
    chain_aux.AddFile(infile)
chain_aux.BuildIndex("eventNumber")

We need @pcanal 's help to figure out what the problem is exactly, let’s ping him.

Please share a minimal set of input files (maybe 2-3 are enough) that we can use to reproduce the problem on our side and debug it.

About your last question, the indexing and the friendship needs to be set up outside of RDF.

Cheers,
Enrico

Hi Enrico,
yes this exactly produces the error messages I talked about. I will send you a link to 3 input files.

Cheers

The type of the message is misleading, it is actually more of a Warning.

TChainIndex has a optimization that speed ups the lookup significantly by require the index value to increase monotonous “accross” files (eg so it knows that it never needs to go backward once it reach a certain index value).

To avoid the message (and end up with the exact same result as you do now with the error message) do:
(after translating it to Python)

auto t = new TTreeIndex( chain ,"eventNumber","");
if (t->IsZombie()) {
   ... print error message ...
   delete t;
   return error_code;
}
1 Like

Hello pcanal,

thanks for clarifying!

Cheers