Making a single RDataFrame from lots of files

Howdy y’all,
Me again. Same issue as last time (trying to read in lots of analysis files), but now with a fun new quirk.

  1. I have about 100 files and about 200GB of data.
  2. This can be split into different categories: files->[data, MC].
  3. These can be spit into different components: MC->[signal, background1, background2] etc.
  4. These can be split into different contributions: signal->[[electron events, muon events],[2016, 2017]]
  5. These contributions are then split across multiple datasets: muon16->[id10001,id10002,id10003] etc
  6. These datasets each have several hundred (systematic) trees containing identically named branches.

Ok so what I did is was make a class for each component (from 3.). Each class has a dictionary entry corresponding to each tree containing an RDataFrame with the tree name loaded from a list of datasets.

c = my_component(name="signal component")
cfiles = ROOT.std.vector()
[cfiles.push_back(f) for f in ["one.root","two.root","three.root"]]
for systematic in ["nominal","up1","down1","up2","down2"]:
    c.syst[systematic] = ROOT.RDataFrame(systematic, cfiles)

Two issues:
a. different components (from 4.) have different branches eg. electrons, muons etc. - normally handled with a if n_electron>0; electron[0].pt()>30 else if n_muon>0:muon[0].pt()>15 type affair.
b. different datasets need scale factors corresponding to their id. eg 1.0 for every event in id10001 but .8 for every event in id10002.

so my question is this:
I’d like to create a single dataframe from multiple files whilst still remembering which file they came from as they might require different selections.

Related, if I try to plot eg the leading muon pt in an event with no muons I get a segfault (which is very difficult to pin down). Is there not an ‘ignore null pointer’ option for the plotting?

an idea that has been suggested before is running this multiple times to create more slimmed down trees that then can be hadded… These files are already slimmed down twice so I’m looking for ways to avoid doing this another 2 times. In pandas or R I’d just make multiple DataFrames and concatenate them but I know this isn’t possible in root.

All the best,

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

You can use RVec::at instead of operator[] to access an element with a fallback value in case the element does not exist.

If the computation graph needs to be different for different input files (e.g. for input A you need to filter events based on n_electron, for input B you need to filter on n_muon), what you want is 2 separate computation graphs. As discussed in the previous thread, RDF assumes that, for the columns used in the computation graph, all files have the same structure (there can be unused columns that are present in some files and absent in other).

You can use ROOT::RDF::RunGraphs to then run the event loops for all your different dataframes concurrently.


Hi yes, but I realised that my issue is not that the columns are missing (eg events with no muons still have a column called muons its just normally empty).

Mostly its a case of how do I add an identifier to different files. All files have a column called id with it set to 0 to events from one file and 1 to events from another?

You can do exactly the same in a Define.

I don’t understand whether you have the id column in the files already or not – if it’s there, I guess that’s a solution right? If not: the nice solution would be DefinePerSample (GH issue), which we want to add in the future. In the meanwhile, you could create friend trees that only contain one column with the id before hand, and then add those friends to the main TTrees that you process.

Hi yes. The DefinePerSample seems exactly what is needed here. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.