I’m currently working on a project in which part of the workflow involves merging O(100) files with histograms (TH1F). Each histogram has roughly (it varies a little) 1300 bins. When fully merged, the final file has 975000 histograms.
This whole process can take up to 26 hours (AMD EPYC 7452, 128 cores, 256 GB of memory). The merging is currently done with “hadd”, even though TFileMerger gives the same performance (as expected). Parallelization (-j) does not bring improvements, since I suspect that there is very small overlap of histograms in the original files. Most of the time (but not always) the machine is just copying histograms from source to destination.
Is there any recommendation in cases like this on how to speed up this merging process? Maybe taking profit that there is a very small adding and mostly copying… Even a lower-level solution, with something not already implemented, I’m willing to pursue, if it makes sense, of course.
Do you mean to merge in 10 batches 10 files instead of merging all 100 files at once. This then followed by merging the 10 files into 1 file. But that would double the amount of copying isn’t it ?
I tried to slice the merging in a batches, but it did not performed well.
In the end I wrote the program bellow to do the merging. I had to accommodate a high memory consumption, but the processing time decreased from 26 hours to 20 min.
#include <csignal>
#include <cstdio>
#include <cstdlib>
#include <iostream>
#include <memory>
#include <string>
#include <unordered_map>
#include "fmt/format.h"
#include "TFile.h"
#include "TH1F.h"
#include "TKey.h"
using namespace ROOT;
auto merger(const std::vector<std::string> &input_files, const std::string &output_file) -> void
{
auto histos = std::unordered_map<std::string, std::unique_ptr<TH1F>>();
for (auto &&file_path : input_files)
{
std::unique_ptr<TFile> root_file(TFile::Open(file_path.c_str()));
TIter keyList(root_file->GetListOfKeys());
TKey *key;
while ((key = (TKey *)keyList()))
{
auto full_name = std::string(key->GetName());
if (full_name.find("[EC_") == 0)
{
if (histos.find(full_name) == histos.end())
{
histos.insert({full_name, std::unique_ptr<TH1F>(static_cast<TH1F *>(key->ReadObj()))});
}
else
{
histos[full_name]->Add(static_cast<TH1F *>(key->ReadObj()));
}
}
}
}
std::unique_ptr<TFile> output_root_file(TFile::Open(output_file.c_str(), "RECREATE", "", 0, 0));
for (auto &&[name, histo] : histos)
{
output_root_file->WriteObject(histo.get(), name.c_str());
}
}
auto main(int argc, char *argv[]) -> int
{
TH1::AddDirectory(false);
TDirectory::AddDirectory(false);
if (argc < 3)
{
fmt::print(stderr, "ERROR: Could not merge files.\nUsage: {} <output> <input1> <input2> ...\n", argv[0]);
std::exit(EXIT_FAILURE);
}
std::string output_file = argv[1];
std::vector<std::string> inputs_files = {};
for (int i = 2; i < argc; i++)
{
inputs_files.push_back(argv[i]);
}
merger(inputs_files, output_file);
fmt::print("Done: {}\n", output_file);
return EXIT_SUCCESS;
}
I have never used the program hadd but would hope that
it would do exactly what you your script does. You have a
dramatic improvement of a factor 50 in time, any thoughts on that.
The main difference between your code and hadd/TFileMerger(..., kTRUE) is that you can assume that all the histograms have the same binning while hadd can’t. Namely you use the fast TH1::Add while hadd needs to call TH1::Merge. This is likely the difference.