Merging two datasets with different selections

Hi! I am using RDataFrame in PyROOT. I have two data sets, to which I want to apply different selections. I load my two data sets into different RDataFrames and apply the selections independently. Then, I want to merge them.

I want to know if it is possible to merge two RDataFrames. If not:

  • I understand that one option would be to SnapShot both of them into TTrees and then merge the trees, but this seems memory-demanding.

  • A colleague suggested using the “AsNumpy” function to save the relevant columns and append the two arrays. This would work but it would be a bit inconvenient.

  • Any other option?

1 Like

Hi @Clara_Landesa_Gomez ,

and welcome to the ROOT forum!

In the following I will assume that the two trees have the same schema (same branches with same names and types) and you want to concatenate the datasets vertically (i.e. make a dataset that has the same columns as the two original datasets and the union of their rows). If that’s not the case, please clarify what you mean with merging here.

That will require storage for the skimmed versions of the two datasets, but if that’s not a problem, depending on your analysis workflow, this might be a one-time operation (or a rare operation anyway) and it might prove to be the simplest solution.

Otherwise you can read both datasets into the same RDataFrame object (e.g. with RDataFrame("Events", {"f1.root", "f2.root"}), which will concatenate them vertically, or by first building a TChain and them passing that to the RDataFrame constructor) and then change the behavior of your Filters based on which file is being processed by using DefinePerSample, e.g.:

df.DefinePerSample("isMC", "rdfsampleinfo_.Contains("f1") ? true : false")
  .Filter("isMC ? x > 1.5 : x > 2.")

I hope this helps!
Cheers,
Enrico

Hello, this is Gulshan Negi
In conclusion, PyROOT does not allow for the direct merging of two RDataFrames. However, depending on the size of your datasets, memory constraints, and specific analysis requirements, there are alternative methods you might want to consider.

If memory allows, you can snapshot each RDataFrame into TTrees, selecting only the relevant columns to optimize memory usage. Then, merge the TTrees using TTree merging techniques.

Make use of the AsPandas function to transform each RDataFrame into a Pandas DataFrame. Using Pandas’ merging functions, apply independent selections to each DataFrame and combine them.

Consolidate the two RDataFrames into a single RDataFrame if separate selections are not required. Make use of the filtering capabilities of the merged RDataFrame to apply various selections.

To choose the best strategy for your particular situation, weigh the benefits and drawbacks of memory usage, computational efficiency, and ease of use. For more in-depth instructions and examples, consult the PyROOT and Panda documentation.

Thanks

1 Like

Hi,

RDF has no AsPandas function. This looks like a ChatGPT answer to me (including the usual hallucinations). @Gulshan212 please refrain from posting unsubstantiated information :slight_smile:

Cheers,
Enrico

1 Like