RDataFrame performance processing many small files

apetukho · December 7, 2022, 1:55pm

Dear Enrico,

Thank you for the breakdown of the runtime. I’m trying to rewrite the code using the DefinePerSample, but I’ve run into some problems implementing it in python.

As the documentation and this helpful thread suggest, I need to pass the function used in DefinePerSample() some C++ objects that contain information on the pairs of "sample identifier: sample weight". I’m using std::vector<std::string> and std::vector<std::double> as in the thread. I’ve declared this function

ROOT.gInterpreter.Declare('''
float GetSampleWeight(unsigned int slot, const ROOT::RDF::RSampleInfo &id, std::vector<std::string> filePathVector, std::vector<double> weightVector) {
    for (unsigned int i = 0; i < filePathVector.size(); i++) {
        if (id.Contains(filePathVector[i])) {
            return weightVector[i];
        }
    }
    return -1.;
}
''')

to later use it in the RDataFrame as

df = df.DefinePerSample("sampleWeight", "GetSampleWeight(rdfslot_, rdfsampleinfo_, filePathVector, weightVector)")

but the question is how exactly do I get the filePathVector and weightVector to pass it to this function?
I’ve tried creating strings like this and feeding them to ROOT.gInterpreter.Declare()

ROOT.gInterpreter.Declare('std::vector<std::string> filePathVector {"file1.root", "file2.root", };')
ROOT.gInterpreter.Declare('std::vector<double> weightVector {1, 2, };')

But when I run my program over processes (which all result in a new filePathVector and weightVector) in a loop I get a C++ error about std::vector<std::string> filePathVector and std::vector<double> redefinition.

The reproducer and the input files files can be found here.

So my question is how to properly use the DefinePerSample when working in python?

Best regards,
Aleksandr