RDataFrame: Memory hog issue when trying to combine data from multiple data frames in a loop

konrad · October 7, 2021, 10:27am

ROOT Version: 6.24.02
Platform: Debian GNU/Linux 10 (buster)
Compiler: Not Provided

Hello,

I have several root files, and want to combine data from them in a single histogram. For this I wrote a macro that loops over the files, creates a RDataFrame for each file and then applies filters (also some complex ones, which is the reason why I use RDF in the first place). The files itself are quite big (several GB), but only ~100 events survive the filters. I want to combine around 10 files.

Afterwards I loop over several columns of the dataframe (data from several detector channels) to combine them into vectors, which I in the end use to fill my histogram.
The problem is this macro uses continuously more memory, until it runs out of memory and crashes. In addition, it runs painfully slow.

From what I read in the forum, I suppose it is a memory hog issue caused by the filters, but I didn’t find out yet, how I can avoid it.

Below you can find a “sketch” of my code.

Any ideas how I can improve the code and use less memory? Or is there a way to clear the memory after each iteration over a file? Or should I better follow a completely different approach?
I know that I could load all my files into a single data frame, but the issue is that every file has individual calibration factors, and I am afraid it would be complicated to keep track of that.

Thanks in advance!
Konrad

vector<double> amplitude_vector;
vector<double> peakTime_vector;
vector<double> energy_vector;
vector<double> energies_combined; // vector with the signal energies from all files 
vector<double> peakTimes_combined; // vector with the signal peak times from all files

// Loop over files:
for (unsigned int f = 0; f < fileNameList.size(); f++) { 
    // function that creates the RFD, and creates some additional columns:  
    auto data_filtered = createRDF(fileNameList[f]); 
    
    // Filters:
    data_filtered = data_filtered.Filter("ColumnName1 > 0");
    data_filtered = data_filtered.Filter("ColumnName2 > 0");
    // ... and many more filters...

    // Loop over columns I want to combine:

    // vector<double> calibration_vector = calibration factors for each column, different for each file
    // vector<string> column_names1 = names of columns which contain signal amplitudes
    // vector<string> column_names2 = names of columns which contain signal peak times
    // int ncol = number of columns (detector channels), something like 60

    for (unsigned int i = 0; i < nCol; i++) { 
        amplitude_vector = data_filtered.Take<double>(column_names1[i]).GetValue();
        peakTime_vector = data_filtered.Take<double>(column_names2[i]).GetValue();
        
        // fill vector with calibrated data (doing it here because due to the lazy action
        // of RDF my calibration got messed up when I defined a calibrated column...)
        energy_vector = amplitude_vector;
        for (unsigned int j = 0; j < energy_vector.size(); j++) {
             energy_vector[j] = amplitude_vector[j] * calibration_vector[i];
        }
        // add data to vectors, which combine the data from the different files
        energies_combined.insert(energies_combined.end(), energy_vector .begin(), energy_vector .end());
        peakTimes_combined.insert(peakTimes_combined.end(), peakTime_vector .begin(), peakTime_vector .end());

        energy_vector.clear();
        peakTime_vector.clear();
        amplitude_vector.clear();

    } // end loop over columns
} // end loop over files

// Create histogram
TH2D* h = new TH2D(...);
for (usigned int i = 0; i<energies_combined.size(); i++){
    h->Fill(energies_combined[i],peakTimes_combined[i];
}
h->Draw("colz");

eguiraud · October 7, 2021, 12:08pm

Hi @konrad ,

Performance

about slowness, because of the immediate call to GetValue() this runs two loops over the data for each of the nCol iterations, so you are running way too many event loops:

amplitude_vector = data_filtered.Take<double>(column_names1[i]).GetValue();
peakTime_vector = data_filtered.Take<double>(column_names2[i]).GetValue();

RDataFrame can retrieve all the vectors in a single event loop if you first call all the Take and only afterwards you access the results (e.g. by calling GetValue). dataframe.GetNRuns() returns the number of event loops run by the RDF object until that point.
Also, for performance make sure to compile the code with optimizations (e.g. -O2 compiler option) or execute the macro after compiling it root macro.C+ instead of root macro.C.

Memory usage

The memory hogging probably comes from the just-in-time compilation of all those Filter("expression"), where expression needs to be transformed in actual executable code before we can…well…execute it! And that generated code stays in memory until the end of the program. If it’s an option, you can use the non-jitted overload Filter(cpp_function_or_lambda, column_list) which avoids that overhead.

Possibly a better approach

We just merged a new feature in RDataFrame that would let you do everything in a single RDataFrame computation graph, increasing performance and lowering memory usage on top of the things mentioned above: it’s DefinePerSample, which would let you Define the value of the calibration factor based on the filename.

Let us know if any of this helps!
If not, it would be useful to be able to run your code and play with it to further investigate.

Cheers,
Enrico

konrad · October 7, 2021, 2:51pm

Hi Enrico,

thank you very much!
After implementing your first suggestion regarding performance, the code runs really fast, and the memory problem disappeared as well.

Now I have a first loop over the columns for the Take() action:

vector<ROOT::RDF::RResultPtr<std::vector<double, std::allocator<double>>>> result;

for (unsigned int i = 0; i < nCol; i++) {
    result.push_back(data_filtered.Take<double>(column_names1[i]));
    ...
}

And in a second loop I fill the vector with GetValue():

 for (unsigned int i = 0; i < nCol; i++) { 
        amplitude_vector = result[i].GetValue();
        ...
}

Is there also a more elegant solution how to declare this chunky vector in the beginning? To find out the data type for these declarations I always have to provoke a compilation error and then copy it from the error message…

Anyways, my macro now works well and does what it should.

Cheers,
Konrad

eguiraud · October 7, 2021, 2:57pm

Great!

Take<T> returns a ROOT::RDF::RResultPtr<std::vector<T>>, you can use that spelling of the return type.

system · October 21, 2021, 2:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.