I am struggling to find out how to get RDataFrame’s DefinePerSample() working. In my dataframe I would like to define a column with calibration factors that depend on the file where the data is from.
I tried following code:
vector<string> fileNames;
vector<double> calibrationFactor;
fileNames.push_back("Run1*root");
calibrationFactor.push_back(1);
fileNames.push_back("Run2*root");
calibrationFactor.push_back(2);
ROOT::RDataFrame data1("AnalysisTree",fileNames);
ROOT::RDF::RNode data2 = ROOT::RDataFrame(0);
data2 = data1.DefinePerSample("calibrationFactor", [](unsigned int slot, const ROOT::RDF::RSampleInfo &id)
{
for (unsigned int i=0; i<fileNames.size(); i++){
if (id.Contains(fileNames[i])) double result = calibrationFactor[i];
}
return result;
});
This does not work. The first error before the root crash is at id.Contains() with the message ROOT_prompt_12:1:5: error: cannot initialize an array element of type 'void *' with an rvalue of type 'const ROOT::RDF::RSampleInfo *'.
Do I have to initialize the ROOT::RDF::RSampleInfo object in advance? How would I have to do it?
I also tried using a simpler expression with DefinePerSample() similar to the one in the example in the documentation, but it resulted in the same error.
Thanks in advance!
Cheers,
Konrad
ROOT Version: 6.26.06 Platform: Not Provided Compiler: Not Provided
But how could I do this best iteratively, if I have a lot of samples? I could use nested ternary operators within the lambda expression (e.g. return id.Contains("sample1") ? 1 : (return id.Contains("sample2") ? 2 : 3);, but I would have to manually edit it whenever I change the data.
I understand that a for-loop is not working within a lambda, so I have to define a function before, but how do I deal with the slot and id variables? With the following code for example, I get the error use of undeclared identifier 'slot' and the same for id. I hope this code illustrates what I try to achieve:
Doing a for loop is totally okay inside of a lambda! I suppose you want something like:
std::vector<std::string> interesting_files {"f0.root", "f1.root", "f2.root"};
std::vector<doubles> weights {0., 1., 2.};
auto df = ROOT::RDataFrame("tree", {"f0.root", "f1.root", "f2.root"}).
DefinePerSample("weightbysample", [&interesting_files, &weights] //notice I need to capture my vectors
(unsigned int, const ROOT::RDF::RSampleInfo &id) { // notice that DefinePerSample expects precisely this signature of 1 unsigned int and 1 RSampleInfo object
// further notice here that I did not write slot, since it would be unused in our case
for (unsigned int i=0; i<interesting_files.size(); i++)
if (id.Contains(interesting_files[i]))
return weights[i]; // you dont need the extra variables
return -1.; // without this default value (here -1.), you will obviously get:
// warning: control reaches end of non-void function [-Wreturn-type]
I just tested that and it works as expected. Let me know if I misunderstood your problem.
P.S. I see that the weight is some sort of metadata, i.e. additional specification corresponding to the files/groups of files. We are currently working on that - see 145th ROOT Parallelism, Performance and Programming Model Meeting (20 October 2022) · Indico - long story short (when ready) in 6.28 the whole body of DefinePerSample in your case becomes return id.GetD("weight");