Very slow and memory-intensive creation of directories inside a ROOT file

TomasDado · August 7, 2023, 1:31pm

Hi,

I am trying to create a ROOT file with a folder structure that has many entries (1000 x 1000 folders). I am using this simple macro:

void mwe() {

  std::vector<std::string> samples;
  std::vector<std::string> systematics;

  for (int i = 0; i < 1000; ++i) {
    samples.emplace_back("sample_" + std::to_string(i));
    systematics.emplace_back("syst_" + std::to_string(i));
  }

  std::unique_ptr<TFile> out(TFile::Open("example.root", "RECREATE"));

  for (const auto& isample : samples) {
     for (const auto& isystematic : systematics) {
        out->cd();
        gDirectory->mkdir((isample+"/"+isystematic).c_str());
     }
  }

  out->Close();
}

And it seems to be extremely slow (has been runnig more more than 1 hour and is not yet finished). Surprisingly, it also seems to require more than 3 GBs of memory. It has been tested with ROOT 6.26.08, 6.28.04 or even master branch from a few days ago. In all cases the code is extremely slow. Is this expected? Or am I not creating the directories in an efficient way?

Cheers,
Tomas

ROOT Version: 6.26.08 or newer
Platform: Ubuntu or CentOS7
Compiler: gcc (Ubuntu 12.3.0-1ubuntu1~23.04) or gcc (GCC) 11.3.0

bellenot · August 8, 2023, 6:47am

Maybe @pcanal or @Axel can give some hints

pcanal · August 9, 2023, 9:12pm

This is still unresolved issue. To work around it, do:

gROOT->GetListOfFiles()->Remove(out);

and do NOT delete the TFile object. i.e. use:

  TFile *out(TFile::Open("example.root", "RECREATE"));

  for (const auto& isample : samples) {
     for (const auto& isystematic : systematics) {
        out->mkdir((isample+"/"+isystematic).c_str());
     }
  }
  out->Write();
}

(yes, this is a memory leak and could be a problem at your problem size).

pcanal · August 9, 2023, 9:23pm

That sounds about ‘right’. A TDirectory takes about 4k due to pre-allocating for caching and similar accelerations. So at 1000 * 1000 directories it will lead to 4Gb (well apparently less ).

There is ways you could reduce this but depending what you put inside the directory it might hurt more than help.

pcanal · August 9, 2023, 9:40pm

You are using a flat structure, so the top level directory contains 1 millions directories which is not great (and very badly handled by the destructor).

So you should probably consider using a different organization. To understand which one, there is one important question. How do you plan to use (and read) this data?

TomasDado · August 10, 2023, 6:51am

Thanks a lot for the replies! We use this format to store some histograms in this structure: regions/samples/systematics, this is usually O(10) times O(10) times O(100), but in this particular case the numbers I used in the macro are about right. I know this is an unusual setup, but I was still surprised how slow the code was. We use the histograms in the folders to do som processing (smoothing/symmetrising etc) and then use then in HistFactory to build our model. We found out that using the folder structure makes the code significantly faster when doing file->Get<TH1>("histo") many times thats why we use the folder structure now.

pcanal · August 10, 2023, 4:27pm

Alright, it makes sense. So the work-around above is your best bet until we find a solution.

pcanal · August 14, 2023, 7:44pm

The PR io: Do not turn on automatically MustClean for TDirectory by pcanal · Pull Request #13451 · root-project/root · GitHub should fix the problem.

system · August 28, 2023, 7:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.