Very slow and memory-intensive creation of directories inside a ROOT file

Hi,

I am trying to create a ROOT file with a folder structure that has many entries (1000 x 1000 folders). I am using this simple macro:

void mwe() {

  std::vector<std::string> samples;
  std::vector<std::string> systematics;

  for (int i = 0; i < 1000; ++i) {
    samples.emplace_back("sample_" + std::to_string(i));
    systematics.emplace_back("syst_" + std::to_string(i));
  }

  std::unique_ptr<TFile> out(TFile::Open("example.root", "RECREATE"));

  for (const auto& isample : samples) {
     for (const auto& isystematic : systematics) {
        out->cd();
        gDirectory->mkdir((isample+"/"+isystematic).c_str());
     }
  }

  out->Close();
}

And it seems to be extremely slow (has been runnig more more than 1 hour and is not yet finished). Surprisingly, it also seems to require more than 3 GBs of memory. It has been tested with ROOT 6.26.08, 6.28.04 or even master branch from a few days ago. In all cases the code is extremely slow. Is this expected? Or am I not creating the directories in an efficient way?

Cheers,
Tomas

ROOT Version: 6.26.08 or newer
Platform: Ubuntu or CentOS7
Compiler: gcc (Ubuntu 12.3.0-1ubuntu1~23.04) or gcc (GCC) 11.3.0

Maybe @pcanal or @Axel can give some hints

This is still unresolved issue. To work around it, do:

gROOT->GetListOfFiles()->Remove(out);

and do NOT delete the TFile object. i.e. use:

  TFile *out(TFile::Open("example.root", "RECREATE"));

  for (const auto& isample : samples) {
     for (const auto& isystematic : systematics) {
        out->mkdir((isample+"/"+isystematic).c_str());
     }
  }
  out->Write();
}

(yes, this is a memory leak and could be a problem at your problem size).

That sounds about ‘right’. A TDirectory takes about 4k due to pre-allocating for caching and similar accelerations. So at 1000 * 1000 directories it will lead to 4Gb (well apparently less :slight_smile: ).

There is ways you could reduce this but depending what you put inside the directory it might hurt more than help.

You are using a flat structure, so the top level directory contains 1 millions directories which is not great (and very badly handled by the destructor).

So you should probably consider using a different organization. To understand which one, there is one important question. How do you plan to use (and read) this data?

Thanks a lot for the replies! We use this format to store some histograms in this structure: regions/samples/systematics, this is usually O(10) times O(10) times O(100), but in this particular case the numbers I used in the macro are about right. I know this is an unusual setup, but I was still surprised how slow the code was. We use the histograms in the folders to do som processing (smoothing/symmetrising etc) and then use then in HistFactory to build our model. We found out that using the folder structure makes the code significantly faster when doing file->Get<TH1>("histo") many times thats why we use the folder structure now.

Alright, it makes sense. So the work-around above is your best bet until we find a solution.

The PR io: Do not turn on automatically MustClean for TDirectory by pcanal · Pull Request #13451 · root-project/root · GitHub should fix the problem.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.