ROOT Version: 6.28/00
Platform: “CentOS Linux 7 (Core)”
Compiler: g++ (GCC) 12.2.1 20221030
Dear all,
I wrote a code that analyze simulations, stored in root files, using RDataFrame. The code works fine on my PC (MacBookPro M1), however I can no longer download the simulations and analyze them on my PC because they are getting too big (>100GB). So, now I need to run the same code on a farm. However when I do that the code stop working because it reach the memory limit (8GB). I tried to optimize the code as much as I can following what is wrote in ROOT::RDataFrame Class Reference, but I am clearly missing something. Can you help me optimizing/correcting it?
In the code first I initialize the DataFrames using a TChain made of all the simulations:
//Inizzializzazione dei TTree
string files = Directory + "*.root";
cout << "Analizing root files in folder: " << files << endl;
TChain ch1 ("Events");
TChain ch2 ("RunSummary");
ch1.Add(files.c_str()); //root_files/*.root
ch2.Add(files.c_str());
//Inizializzazione RDataFrame
RDataFrame Events(ch1);
RDataFrame RunSummary(ch2);
Then I do some operations like:
- counting the number of primaries simulated and create a new column with the correct weight
//Conteggio elettroni simulati
cout << "Beginning to count primaries simulated..." << endl;
unsigned int nEOT = 0;
RunSummary.Foreach([&nEOT](unsigned int i){ nEOT = nEOT + i;}, {"TotEvents"});
//Definizione delle lambda da usare per definire nuove colonne nel dataframe
double n_m = 0.939565378;
auto Ekin_calc = [n_m](double Etot){ return Etot - n_m; };
auto PesoEOT_calc = [nEOT, NormFactor](double weight){ return weight/nEOT * NormFactor; };
//Definizione di due nuovi dataframe da usare per gli istogrammi (ogni nuovo dataframe eredita anche tutte le variabili del genitore)
cout << "Creating new dataframes with correct weight..." << endl;
auto Ekin = Events.Define("Ekin", Ekin_calc, {"ETot"}).Define("Peso", PesoEOT_calc, {"Weight1"}); //Creiamo il DF Ekin, ha una colonna Ekin, e una Peso
auto DF_WeightsEOT = Events.Define("Peso", PesoEOT_calc, {"Weight1"}); //Creiamo il DF con i pesi
- finding the number of detectors placed
cout << "Counting surfaces..." << endl;
auto SurfaceIDs = Events.Take<unsigned int>("SurfaceID");
sort(SurfaceIDs->begin(), SurfaceIDs->end());
vector<unsigned int>::iterator it;
it = unique(SurfaceIDs->begin(), SurfaceIDs->end());
SurfaceIDs->resize(distance(SurfaceIDs->begin(),it));
SurfaceIDs->erase(std::remove(SurfaceIDs->begin(), SurfaceIDs->end(), 0), SurfaceIDs->end());
- calculating some statistics of the simulation
//Numero totale elettroni simulati
cout << "Total number of Electrons simulated: "<< nEOT << endl;
//Calcolo dei tempi di simulatione
auto AvgTime = RunSummary.Histo1D<double>({"AvgTime", "AvgTime", 250, 0, 0}, "AvgTime");
auto TotTime = RunSummary.Histo1D<double>({"TotTime", "TotTime", 250, 0, 0}, "TotTime");
Double_t x, q; q = 0.5; // 0.5 for "median"
AvgTime->ComputeIntegral(); // just a precaution
AvgTime->GetQuantiles(1, &x, &q);
cout<<"Mean time to follow a primary: "<<AvgTime->GetMean()<<endl;
cout<<"Median time to follow a primary: "<<x<<endl;
TotTime->ComputeIntegral(); // just a precaution
TotTime->GetQuantiles(1, &x, &q);
cout<<"Mean time to complete a job: "<<TotTime->GetMean()<<endl;
cout<<"Median time to complete a job: "<<x<<endl;
//Superfici trovate
cout << "Total Surfaces found: " << SurfaceIDs->size() << " -> ";
for(auto i : SurfaceIDs){
cout << i << " ";
}
cout << endl;
- building some histograms using this fuction to get some information about several surfaces (here I pass the new dataframe with the weight column and the surface ID):
void count_on_surf(ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager, void> Ekin, int ID){
string filter = "SurfaceID == " + to_string(ID);
auto nN1 = Ekin.Filter(filter).Count();
double Error;
double Integral = Ekin.Filter(filter).Histo1D<double, double>({"energy", "Energy Det 300; E (GeV); Particles/EOT", 200, 0, 0}, "Ekin", "Weight1")->IntegralAndError(1,200,Error);
cout << "Entries totali sulla superficie "<< ID << ": \t" << *nN1 << " \t Integrale (NO - EOT): \t"<< Integral << " +/- "<< Error << " (" << Error/Integral*100 << "%)" << endl;
}
- lastly I build several histograms in a loop using multiple filter (here an example of what I’m doing):
for(int i = 0; i < SurfaceIDs->size(); i++){
string filter = "SurfaceID == " + std::to_string(SurfaceIDs->at(i));
auto energy = Ekin.Filter(filter).Histo1D<double, double>({Form("%s_Energy_Det%d", particella.c_str(), SurfaceIDs->at(i)), Form("%s energy Det %d; E (GeV); Particles/EOT", particella.c_str(), SurfaceIDs->at(i)), 200, 0, 0},"Ekin", "Peso");
//another 13 histograms made in this way but with different ranges and variables
energy_ranges_vec.push_back(energy); //put all the histograms in a vector
ROOT::RDF::RunGraphs({PFiltered, PxFiltered, PyFiltered, PzFiltered, XFiltered, YFiltered, ZFiltered, energy, energy0_10KeV, energy10_100KeV, energy100KeV_10MeV, energy10_20MeV, energy20_100MeV, energy100MeV_11GeV});
//Energy ranges histograms saving
c_energies = new TCanvas(Form("c_energies%d",i), "c_energies", 600*3, 500*3);
//Some conditions on where to print
energy_ranges_vec[x]->Draw("histe");
c_energies->SaveAs(Form("Graphs/Energy/Energy_Surface_%03d.png",SurfaceIDs->at(i)));
energy_ranges_vec.clear();
}
But actually the codes stops before even reaching the loop at the beginning of point 3.
I’m will attach the whole script in case you want to see what I’m doing.
Thanks in advance,
Antonino