Hello,
I hope to get your comments on a particular use case I am struggling with and if it warrants adding some features to RDF. Apologies if the following post is a tad too long
I have a list of samples in a yaml
file in the following template.
dh_10000:
files:
- /data/Faser*.root
xsect: 2.246042502785529e-05 * 1000
mass : 0.23
coupl: 7.000000e-04
sumw : 20000.0
weight : 1
wtnames : ['Nominal']
...
To load a specific sample (spanning over multiple root files), I use the function.
def get_data_sample(sample):
file_names = []
for file in SampleInfo[int(sample)]['files']:
file_names.append(file)
sample = ROOT.RDF.Experimental.RSample(f"Sample_{sample}", "nt", file_names)
spec = ROOT.RDF.Experimental.RDatasetSpec()
spec.AddSample(sample)
rdf = ROOT.RDataFrame(spec)
return rdf, SampleInfo[sample]
For some reason, if I need to have the cross-section available, it should be easy to add a definition before returning the rdf
as rdf = rdf.Define("xsect", f"SampleInfo[sample]['xsect']")
Which is a work around that fails when I want to load all samples.
def get_all_samples_dh():
SampleInfos = {}
for sample_id in sample_data.keys():
if "dh_" in sample_id:
sample_info = get_sample_info(sample_id)
SampleInfos[sample_id] = sample_info
sample = ROOT.RDF.Experimental.RSample(f"Sample_dh", "nt", [sample["files"][0] for sample in SampleInfos.values()])
spec = ROOT.RDF.Experimental.RDatasetSpec()
spec.AddSample(sample)
rdf = ROOT.RDataFrame(spec)
return rdf, SampleInfos
In this case, if I wanted to define the cross-section and other sample-specific columns. I would need to use DefinePerSample
and match the underlying ROOT::RDF::RSampleInfo::id
to the sample_info
[assuming the id contains some substring like the sample name that can be matched to the sample_info details]. This is further nuanced by the fact that since DefinePerSample
needs a cpp function, I cannot do this matching in cpp unless I read the yaml
in the cpp function.
So, either is a really obvious solution that I am missing out on. In that case, I apologize for the inconvenience. Let me know if I am overly complicating the problem here.
In case it might be considerable, a few ways to mitigate it off the top of my head would be:
- Can some of the
DefinePerSample
functionality be available toRDataSpec
, which would let me define columns on the fly that are specific to those samples? - Allow a concatenation of
RDataFrame
so that in the above example, In theget_all_samples_dh
, I can call theget_data_sample
(with the sample columns implemented there) for each sample and concatenate them together. - A pythonization to allow
DefinePerSample
to take Python callable could mitigate the issue.
Thanks in advance.
Please read tips for efficient and successful posting and posting code
Please fill also the fields below. Note that root -b -q
will tell you this info, and starting from 6.28/06 upwards, you can call .forum bug
from the ROOT prompt to pre-populate a topic.
ROOT Version: Not Provided
Platform: Not Provided
Compiler: Not Provided