Inner join between TTrees

jasot · October 21, 2017, 4:37pm

Hello,

I have two TTree with the same number of entries but different branches, and I would like to merge them. It would be the same as an INNER JOIN from SQL. Of course I have a common branch as a key variable in both TTrees.

I know how to do it going event by event, and declaring all the branches of both TTrees to store it in a new one, but this is too wordy and I wonder if there is a more elegant and flexible way to do it.

By elegant I mean that I would not need to declare all the branches in the TTrees (only the key variable/branch). By flexible I mean that I would not neet to change the code if some new branch is added to some of the TTrees.

I saw someone tried to do the same some time ago, but it looks that there were no solution at that time:

Thanks!
J.

eguiraud · October 21, 2017, 5:13pm

Hi,
just to be sure I understand, you have two trees t1 and t2 with the same number of events but different structure, which both have a branch "key" with the same set of unique values – but the order of its values is different for t1 and t2.

You want to write an output tree t3 with the branches of both trees. Each entry in t3 should have branch values that correspond to events in t1 and t2 with the same key.

Is this correct?
When you do the “join” in your current implementation, given an entry in t1 with key value x how do you find which entry in t2 has the same value x? Linear search?

Cheers,
Enrico

behrenhoff · October 21, 2017, 6:16pm

I’d like to support jasot in the request for a general tool for inner joining of trees.

I have written such a tool (can’t share it) that works with non-object branches only* (I didn’t know how to implement copying of arbitrary objects, especially how to create such a branch). The reason why I need this is TMVA: you can only use floats in TMVA (input variables and spectators). Often, I need other variables as spectators, for example strings (/C branches) or longs/L. To work around this limitation, I add a float unique index to the original tree and add this float as spectator. After training, I can join the TMVA tree with the original tree using my float variable to join. The TMVA trees are usually rather small, i.e. you can create a map unique_index->event number, then LoadBaskets, and then do random access on the TMVA tree while looping sequentially over the potentially large original tree.

*Actually, I can deal with objects in one tree, but not in both trees. As TMVA only has float branches, I can do a CloneTree on the original events and only add the relevant branches from TMVA. Therefore this is not a limitation in my usecase.

jasot · October 21, 2017, 7:11pm

Hi!
Yes, it is exactly as you explained.
And yes, now I use a linear search to join t1 and t2, although normally both ttrees are ordered by the key, so the this is not a problem.

Thanks!

eguiraud · October 21, 2017, 7:16pm

Interesting! when you say the original tree is “large” and the TMVA tree is “small” you mean in terms of number of branches right? As they must have the same number of events.

Your workaround requires a full loop over the TMVA tree to create the map unique_index->event_number and then a full loop over the original tree while you do random access in the TMVA tree – so it does not scale to many events or two trees both having many branches, or situations in which switching files continuously might be very expensive (e.g. over the network).

With this said I don’t have a better (general) solution if one of the two trees fits in memory it might be more efficient to pre-load all of it, sort it by the key and then loop on both the original tree and the sorted data in parallel, but of course this does not scale either.

eguiraud · October 21, 2017, 7:16pm

In this case can’t you just do the two loops in lockstep?

behrenhoff · October 21, 2017, 7:41pm

No, not necessarily. You can apply cuts inside TMVA, i.e. the resulting tree might only contain a tiny fraction of the events. That said, it is usually much more efficient to copy the tree with the cut, and limit the number of cuts in TMVA as otherwise it might take hours until TMVA actually starts training (I haven’t figured out what other things TMVA is doing apart from thowing away unwanted events - same applies for unused branches: not having them in the tree at all makes TMVA faster). So the TMVA tree might contain 20 Float_t branches and have 500k events, while the “large” tree might contain anything and potentially orders of magnitude more events. Also, you have a training and a testing tree, both containing only parts of the whole tree.
Another argument to apply the cut first: by using Float as key, you are limited to 4e9 different values. However, it is much more convenient to have an integer as key - that reduces the number of different floats even further. The original tree might be larger. But you cannot use that many events for training anyway (time/memory).

jasot · October 21, 2017, 7:49pm

Yes, this is what I do now. Two loops, one per ttree, matching the events one by one, and then filling the branches. I can do it, and it works well. My question was it there exist a more general solution “inner join” type, more practical, but after the comment of behrenhof, I guess it has not been already implemented…

If I understood well, I do a similar thing to beherenhof. To avoid copying all the branches one by one, first I clone the first tree, and then I bring some of the relevant branches from the second ttree. The problem is that there is many branches in this second ttree, and it can change from one file to another, so I have to adapt the code everytime…

eguiraud · October 21, 2017, 7:56pm

If you don’t do random access in one of the trees but loop over both of them in lockstep, you can just add one tree as a friend of the other (with TTree::AddFriend) to do this automatically.

jasot · October 21, 2017, 8:42pm

Hi again,

I didn’t know this funtion. I think this is not exactly what I needed, but it could be an alternative solution. With AddFriend, if I understand well, instead of creating a new TTree with all the information together, you just create the index in every tree, and then once you analyse the data you just call him as a friend.

Does it works with several “parent” and “friend” trees? I mean, if the events of every tree to join are spread within several files. Thus I would have a tree t1_1 and t1_2 with N events each, and I want to join them with the N/2 events in t2_1, N/2 events in t2_2, N/2 events in t2_3 and N/2 events in t2_4.

Something like:

TChain *T = new TChain("treename");
T->Add("treeparent1.root");
T->Add("treeparent2.root");
T->AddFriend("TF","treefriend1.root");
T->AddFriend("TF","treefriend2.root");
T->AddFriend("TF","treefriend3.root");
T->AddFriend("TF","treefriend4.root");
T->Draw("var1:TF.var2");

Thanks for your help!

eguiraud · October 21, 2017, 8:55pm

Hi jasot,
you should create two chains and add one as friend of the other.
Effectively this means you can access the branches of the second, friend chain as if they were branches of the first one.

system · November 4, 2017, 8:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.