Merging two datasets with different selections

Hi! I am using RDataFrame in PyROOT. I have two data sets, to which I want to apply different selections. I load my two data sets into different RDataFrames and apply the selections independently. Then, I want to merge them.

I want to know if it is possible to merge two RDataFrames. If not:

  • I understand that one option would be to SnapShot both of them into TTrees and then merge the trees, but this seems memory-demanding.

  • A colleague suggested using the “AsNumpy” function to save the relevant columns and append the two arrays. This would work but it would be a bit inconvenient.

  • Any other option?

1 Like

Hi @Clara_Landesa_Gomez ,

and welcome to the ROOT forum!

In the following I will assume that the two trees have the same schema (same branches with same names and types) and you want to concatenate the datasets vertically (i.e. make a dataset that has the same columns as the two original datasets and the union of their rows). If that’s not the case, please clarify what you mean with merging here.

That will require storage for the skimmed versions of the two datasets, but if that’s not a problem, depending on your analysis workflow, this might be a one-time operation (or a rare operation anyway) and it might prove to be the simplest solution.

Otherwise you can read both datasets into the same RDataFrame object (e.g. with RDataFrame("Events", {"f1.root", "f2.root"}), which will concatenate them vertically, or by first building a TChain and them passing that to the RDataFrame constructor) and then change the behavior of your Filters based on which file is being processed by using DefinePerSample, e.g.:

df.DefinePerSample("isMC", "rdfsampleinfo_.Contains("f1") ? true : false")
  .Filter("isMC ? x > 1.5 : x > 2.")

I hope this helps!