[RDataFrame] How to use sample meta-information in DefinePerSample?

Hello,

I hope to get your comments on a particular use case I am struggling with and if it warrants adding some features to RDF. Apologies if the following post is a tad too long

I have a list of samples in a yaml file in the following template.

dh_10000:
   files:
      - /data/Faser*.root
   xsect: 2.246042502785529e-05 * 1000
   mass : 0.23
   coupl: 7.000000e-04
   sumw : 20000.0
   weight : 1
   wtnames : ['Nominal']
...

To load a specific sample (spanning over multiple root files), I use the function.

def get_data_sample(sample):
    file_names = []
    for file in SampleInfo[int(sample)]['files']:
        file_names.append(file)
    sample = ROOT.RDF.Experimental.RSample(f"Sample_{sample}", "nt",  file_names)
    spec = ROOT.RDF.Experimental.RDatasetSpec()
    spec.AddSample(sample)
    rdf = ROOT.RDataFrame(spec)
    return rdf, SampleInfo[sample]

For some reason, if I need to have the cross-section available, it should be easy to add a definition before returning the rdf as rdf = rdf.Define("xsect", f"SampleInfo[sample]['xsect']")
Which is a work around that fails when I want to load all samples.

def get_all_samples_dh():
    SampleInfos = {}
    for sample_id in sample_data.keys():
        if "dh_" in sample_id:
            sample_info = get_sample_info(sample_id)
            SampleInfos[sample_id] = sample_info
    
    sample = ROOT.RDF.Experimental.RSample(f"Sample_dh", "nt",  [sample["files"][0] for sample in SampleInfos.values()])
    spec = ROOT.RDF.Experimental.RDatasetSpec()
    spec.AddSample(sample)
    rdf = ROOT.RDataFrame(spec)
    return rdf, SampleInfos

In this case, if I wanted to define the cross-section and other sample-specific columns. I would need to use DefinePerSample and match the underlying ROOT::RDF::RSampleInfo::id to the sample_info [assuming the id contains some substring like the sample name that can be matched to the sample_info details]. This is further nuanced by the fact that since DefinePerSample needs a cpp function, I cannot do this matching in cpp unless I read the yaml in the cpp function.

So, either is a really obvious solution that I am missing out on. In that case, I apologize for the inconvenience. Let me know if I am overly complicating the problem here.

In case it might be considerable, a few ways to mitigate it off the top of my head would be:

  • Can some of the DefinePerSample functionality be available to RDataSpec, which would let me define columns on the fly that are specific to those samples?
  • Allow a concatenation of RDataFrame so that in the above example, In the get_all_samples_dh, I can call the get_data_sample (with the sample columns implemented there) for each sample and concatenate them together.
  • A pythonization to allow DefinePerSample to take Python callable could mitigate the issue.
    Thanks in advance.

Please read tips for efficient and successful posting and posting code

Please fill also the fields below. Note that root -b -q will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug from the ROOT prompt to pre-populate a topic.

ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided


Hello @lost_soul_519,

I believe that what you want to achieve is to have the same cross section (and other metadata) defined for several files that belong to the same sample. (Let me know if that’s not accurate :slightly_smiling_face:).
I see two ways to do this:

  1. You can use RDF’s FromSpec(), but this works only with json.
  2. You can emulate what FromSpec does with yaml.

If you want to keep the yaml, I guess we should go for number 2:
First, create the metadata for the sample, this should look more or less like

meta = ROOT.RDF.Experimental.RMetaData()
meta.Add("sample_name", "xxx")
meta.Add("xsec", 1.337)
...

Next, create the sample with all the files, and pass the metadata as last argument:

sample = ROOT.RDF.Experimental.RSample(f"Sample_{sample}", "nt",  file_names, meta)

This metadata is now shared by all files in the sample, and becomes accessible in DefinePerSample. You can both pass a glob string to list the files, or you can pass a list of file names directly. Given that you have a glob string in yaml, I would probably use that.

Finally, create the RDF and use DefinePerSample (see here on the RDF main page):

spec = ROOT.RDF.Experimental.RDatasetSpec()
spec.AddSample(sample)
spec.AddSample(...)
rdf = ROOT.RDataFrame(spec)
  .DefinePerSample("xsec", 'rdfsampleinfo_.GetD("xsec")')
  .DefinePerSample("name", 'rdfsampleinfo_.GetS("sample_name")')

That is, you can access using rdfsampleinfo_.Get<Type>("<key>") all the key-value pairs that you defined in the metadata. You can check RSampleInfo for the names of the accessor functions. You unfortunately have to correctly name the types of the metadata, but it’s only double, int or string.

Let us know if that solves the problem!

1 Like

Hi,
Thank you; the fix is almost perfect.
Sorry, I might have overthought the problem and wasn’t aware of the RMetaData class.

I should be able to simplify the workflow with this significantly.
I’ve changed the title to reflect that.
Thanks again.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.