DefinePerSample() - how to properly use it?

konrad · October 20, 2022, 5:06pm

Hello,

I am struggling to find out how to get RDataFrame’s DefinePerSample() working. In my dataframe I would like to define a column with calibration factors that depend on the file where the data is from.

I tried following code:

vector<string> fileNames;
vector<double> calibrationFactor;

fileNames.push_back("Run1*root");
calibrationFactor.push_back(1);
fileNames.push_back("Run2*root");
calibrationFactor.push_back(2);

ROOT::RDataFrame data1("AnalysisTree",fileNames);
ROOT::RDF::RNode data2 = ROOT::RDataFrame(0);

data2 = data1.DefinePerSample("calibrationFactor", [](unsigned int slot, const ROOT::RDF::RSampleInfo &id)
	{
		for (unsigned int i=0; i<fileNames.size(); i++){
			if (id.Contains(fileNames[i])) double result = calibrationFactor[i];
		}
		return result;
	});

This does not work. The first error before the root crash is at id.Contains() with the message ROOT_prompt_12:1:5: error: cannot initialize an array element of type 'void *' with an rvalue of type 'const ROOT::RDF::RSampleInfo *'.

Do I have to initialize the ROOT::RDF::RSampleInfo object in advance? How would I have to do it?

I also tried using a simpler expression with DefinePerSample() similar to the one in the example in the documentation, but it resulted in the same error.

Thanks in advance!
Cheers,
Konrad

ROOT Version: 6.26.06
Platform: Not Provided
Compiler: Not Provided

eguiraud · October 20, 2022, 5:32pm

Hi @konrad ,

konrad:

data2 = data1.DefinePerSample("calibrationFactor", [](unsigned int slot, const ROOT::RDF::RSampleInfo &id)
	{
		for (unsigned int i=0; i<fileNames.size(); i++){
			if (id.Contains(fileNames[i])) double result = calibrationFactor[i];
		}
		return result;
	});

you are returning result after it went out of scope – I’m surprised that code compiles at all

As a very simple test, this works for me (and prints 20):

#include <ROOT/RDataFrame.hxx>
#include <iostream>

int main() {
  ROOT::RDataFrame df(10);
  auto df2 = df.DefinePerSample(
      "weightbysample",
      [](unsigned int slot, const ROOT::RDF::RSampleInfo &id) {
        return id.Contains("sample1") ? 1 : 2;
      });
  std::cout << df2.Sum<int>("weightbysample").GetValue() << std::endl;
}

Does it work for you?

Cheers,
Enrico

konrad · October 21, 2022, 3:20pm

Hi @eguiraud ,

thank you, your example is working!

But how could I do this best iteratively, if I have a lot of samples? I could use nested ternary operators within the lambda expression (e.g. return id.Contains("sample1") ? 1 : (return id.Contains("sample2") ? 2 : 3);, but I would have to manually edit it whenever I change the data.

I understand that a for-loop is not working within a lambda, so I have to define a function before, but how do I deal with the slot and id variables? With the following code for example, I get the error use of undeclared identifier 'slot' and the same for id. I hope this code illustrates what I try to achieve:

#include <ROOT/RDataFrame.hxx>
#include <iostream>

double setWeights(unsigned int slot, const ROOT::RDF::RSampleInfo id, vector<string> fileNames, vector<double> weights){
	double result;
	for (unsigned int i=0; i<fileNames.size(); i++){
		if (id.Contains(fileNames[i])) result = weights[i];
	}
	return result;
	}

int main() {

	vector<string> fileNames;
	vector<double> weights;

	fileNames.push_back("file1");
	weights.push_back(1);
	fileNames.push_back("file2");
	weights.push_back(2);

	ROOT::RDataFrame df(10);
	ROOT::RDF::RNode df2 = ROOT::RDataFrame(0);


	df2 = df.DefinePerSample("weightbysample", setWeights(slot, &id, fileNames, weights);)

	std::cout << df2.Sum<int>("weightbysample").GetValue() << std::endl;
}

Cheers,
Konrad

ikabadzhov · October 23, 2022, 8:30pm

Hello @konrad ,

Doing a for loop is totally okay inside of a lambda! I suppose you want something like:

std::vector<std::string> interesting_files {"f0.root", "f1.root", "f2.root"};
std::vector<doubles> weights {0., 1., 2.};
auto df = ROOT::RDataFrame("tree", {"f0.root", "f1.root", "f2.root"}).
               DefinePerSample("weightbysample", [&interesting_files, &weights] //notice I need to capture my vectors
(unsigned int, const ROOT::RDF::RSampleInfo &id) { // notice that DefinePerSample expects precisely this signature of 1 unsigned int and 1 RSampleInfo object
// further notice here that I did not write slot, since it would be unused in our case
   for (unsigned int i=0; i<interesting_files.size(); i++)
      if (id.Contains(interesting_files[i]))
         return weights[i]; // you dont need the extra variables
   return -1.; // without this default value (here -1.), you will obviously get:
   // warning: control reaches end of non-void function [-Wreturn-type]

I just tested that and it works as expected. Let me know if I misunderstood your problem.

P.S. I see that the weight is some sort of metadata, i.e. additional specification corresponding to the files/groups of files. We are currently working on that - see 145th ROOT Parallelism, Performance and Programming Model Meeting (20 October 2022) · Indico - long story short (when ready) in 6.28 the whole body of DefinePerSample in your case becomes return id.GetD("weight");

konrad · October 24, 2022, 3:35pm

Hello @ikabadzhov

thank you very much, I think like this I get exactly what I want!

system · November 7, 2022, 3:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.