Merging Datasets using RDataFrame

vcroft · August 26, 2020, 3:26pm

Please read tips for efficient and successful posting and posting code

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided

Howdy folks, I’m looking for a nice way of merging the output from multiple files in a RDataFrame

file1 = mymuons.root
file2 = myelectrons.root

Now there’s a simple way to do this, simply load them into the same RDF, they even have the same tree name.
buuuut… mymuons.root contains variables like muon_pt[0] and myelectrons.root has variables tau_pt[0] and stuff.

So what I’ve done until now is define new variables like lepton_pt that are common between both (there are also others like has_passed_triggers = "(find(begin(passedTriggers), end(passedTriggers), 'HLT_mu20_iloose_L1MU15') != end(passedTriggers) || find(begin(passedTriggers), end(passedTriggers), 'HLT_mu26_ivarmedium') != end(passedTriggers) || find(begin(passedTriggers), end(passedTriggers), 'HLT_mu50') != end(passedTriggers)") a single dataset with the same tree might come from 6 different files with 6 different minor differences in variable names.

How do I add these together? Is there a clever way of mapping all variables in a subset of the data frame? Maybe MakeLazyDataFrame or Alias?

StephanH · August 26, 2020, 3:40pm

@eguiraud?

Can you build a chain of trees with certain branches missing?

Do you need an RDataSource, which dynamically switches between branches depending on availability?
In a different framework, we had this:

framework.registerBranch("lep_pt", {"lep_pt", "el_pt", "mu_pt", "tau_pt"});

That would just look up those branches from left to right, and expose that as lep_pt if one is found.

eguiraud · August 26, 2020, 3:48pm

Hi,
you can build a chain (or equivalently pass to RDF’s ctor) different trees with different branches and process them in the same event loop as long as you only touch the branches they have in common. This does not seem to be your case, so you cannot run the same exact computation graph on all these trees in one event loop that chains the trees one after the other. RDF assumes that all trees in a chain have the same schema (at least for what concerns the branches you use). @swunsch and I are looking into relaxing this requirement a bit, but there is nothing in RDF at the moment.

What you can do is run separate event loops in separate computation graphs, but reuse code:

results = {}
for f in files:
  df = ROOT.RDataFrame("treename", f)
  col_names = df.GetColumnNames()
  df = df.Alias("lepton_pt", "muon_pt" if "muon_pt" in col_names else "tau_pt")
  # now this function can assume that a "lepton_pt" branch is always present
  df = transform_df_as_you_wish(df) 
  results[f] = df.Histo("lepton_pt")

I am not 100% sure I answered the question, but let’s go from here
Cheers,
Enrico

StephanH · August 26, 2020, 4:54pm

With Enrico’s suggestion, you are almost there. If you Snapshot for each input file, you will be left with a bunch of output files.
You can hadd those to merge the trees.

It’s not the most efficient solution, but at least you can do this “off the shelf”.

vcroft · August 27, 2020, 10:34am

Hmmm ok. So at the Snapshot stage I’d have to drop all the rtfs into a new file (with just the common branches) then read them all into a new rdf. Seems kinda messy but doable. Do I need the intermediary output files?

eguiraud · August 27, 2020, 11:27am

I think so. The problem is that RDF (just like TChain) makes the assumption that the trees you run on in a single event loop (i.e. the trees you chain/concatenate into a full dataset) all have the same schema.

This is baked into the API, e.g. df.GetColumnNames() would lose meaning if there was no single list of valid column names.

StephanH · August 27, 2020, 12:06pm

Depends. You can either hadd all into one file and throw them away or you put all the intermediate files in a new dataframe. I like the latter approach, because there is no merging step, but the former approach will result in only one file in case you want to keep it longer.