ROOT Version: 6.24.02
Platform: Debian GNU/Linux 10 (buster)
Compiler: Not Provided
Hello,
I have several root files, and want to combine data from them in a single histogram. For this I wrote a macro that loops over the files, creates a RDataFrame for each file and then applies filters (also some complex ones, which is the reason why I use RDF in the first place). The files itself are quite big (several GB), but only ~100 events survive the filters. I want to combine around 10 files.
Afterwards I loop over several columns of the dataframe (data from several detector channels) to combine them into vectors, which I in the end use to fill my histogram.
The problem is this macro uses continuously more memory, until it runs out of memory and crashes. In addition, it runs painfully slow.
From what I read in the forum, I suppose it is a memory hog issue caused by the filters, but I didn’t find out yet, how I can avoid it.
Below you can find a “sketch” of my code.
Any ideas how I can improve the code and use less memory? Or is there a way to clear the memory after each iteration over a file? Or should I better follow a completely different approach?
I know that I could load all my files into a single data frame, but the issue is that every file has individual calibration factors, and I am afraid it would be complicated to keep track of that.
Thanks in advance!
Konrad
vector<double> amplitude_vector;
vector<double> peakTime_vector;
vector<double> energy_vector;
vector<double> energies_combined; // vector with the signal energies from all files
vector<double> peakTimes_combined; // vector with the signal peak times from all files
// Loop over files:
for (unsigned int f = 0; f < fileNameList.size(); f++) {
// function that creates the RFD, and creates some additional columns:
auto data_filtered = createRDF(fileNameList[f]);
// Filters:
data_filtered = data_filtered.Filter("ColumnName1 > 0");
data_filtered = data_filtered.Filter("ColumnName2 > 0");
// ... and many more filters...
// Loop over columns I want to combine:
// vector<double> calibration_vector = calibration factors for each column, different for each file
// vector<string> column_names1 = names of columns which contain signal amplitudes
// vector<string> column_names2 = names of columns which contain signal peak times
// int ncol = number of columns (detector channels), something like 60
for (unsigned int i = 0; i < nCol; i++) {
amplitude_vector = data_filtered.Take<double>(column_names1[i]).GetValue();
peakTime_vector = data_filtered.Take<double>(column_names2[i]).GetValue();
// fill vector with calibrated data (doing it here because due to the lazy action
// of RDF my calibration got messed up when I defined a calibrated column...)
energy_vector = amplitude_vector;
for (unsigned int j = 0; j < energy_vector.size(); j++) {
energy_vector[j] = amplitude_vector[j] * calibration_vector[i];
}
// add data to vectors, which combine the data from the different files
energies_combined.insert(energies_combined.end(), energy_vector .begin(), energy_vector .end());
peakTimes_combined.insert(peakTimes_combined.end(), peakTime_vector .begin(), peakTime_vector .end());
energy_vector.clear();
peakTime_vector.clear();
amplitude_vector.clear();
} // end loop over columns
} // end loop over files
// Create histogram
TH2D* h = new TH2D(...);
for (usigned int i = 0; i<energies_combined.size(); i++){
h->Fill(energies_combined[i],peakTimes_combined[i];
}
h->Draw("colz");