I am trying to create a ROOT file with a folder structure that has many entries (1000 x 1000 folders). I am using this simple macro:
void mwe() {
std::vector<std::string> samples;
std::vector<std::string> systematics;
for (int i = 0; i < 1000; ++i) {
samples.emplace_back("sample_" + std::to_string(i));
systematics.emplace_back("syst_" + std::to_string(i));
}
std::unique_ptr<TFile> out(TFile::Open("example.root", "RECREATE"));
for (const auto& isample : samples) {
for (const auto& isystematic : systematics) {
out->cd();
gDirectory->mkdir((isample+"/"+isystematic).c_str());
}
}
out->Close();
}
And it seems to be extremely slow (has been runnig more more than 1 hour and is not yet finished). Surprisingly, it also seems to require more than 3 GBs of memory. It has been tested with ROOT 6.26.08, 6.28.04 or even master branch from a few days ago. In all cases the code is extremely slow. Is this expected? Or am I not creating the directories in an efficient way?
Cheers,
Tomas
ROOT Version: 6.26.08 or newer Platform: Ubuntu or CentOS7 Compiler: gcc (Ubuntu 12.3.0-1ubuntu1~23.04) or gcc (GCC) 11.3.0
That sounds about ‘right’. A TDirectory takes about 4k due to pre-allocating for caching and similar accelerations. So at 1000 * 1000 directories it will lead to 4Gb (well apparently less ).
There is ways you could reduce this but depending what you put inside the directory it might hurt more than help.
You are using a flat structure, so the top level directory contains 1 millions directories which is not great (and very badly handled by the destructor).
So you should probably consider using a different organization. To understand which one, there is one important question. How do you plan to use (and read) this data?
Thanks a lot for the replies! We use this format to store some histograms in this structure: regions/samples/systematics, this is usually O(10) times O(10) times O(100), but in this particular case the numbers I used in the macro are about right. I know this is an unusual setup, but I was still surprised how slow the code was. We use the histograms in the folders to do som processing (smoothing/symmetrising etc) and then use then in HistFactory to build our model. We found out that using the folder structure makes the code significantly faster when doing file->Get<TH1>("histo") many times thats why we use the folder structure now.