I have a quite large number of histograms in my C++ code using ROOT libraries.
The further the analysis progresses the greater the number of histograms. I run 12 sets of 2000 jobs, one set at a time. Recently I have noticed, while merging the histos, hadd reports the following error:
hadd Source file 1998: data_997.root
hadd Source file 1999: data_998.root
hadd Source file 2000: data_999.root
hadd Target path: t200-32.root:/
hadd Target path: t200-32.root:/demo
hadd Opening the next 923 files
Warning in <TFile::Init>: file data_1914.root probably not closed, trying to recover
Info in <TFile::Recover>: data_1914.root, recovered key TDirectoryFile:demo at address 232
Warning in <TFile::Init>: successfully recovered 1 keys
hadd Target path: t200-32.root:/
hadd Target path: t200-32.root:/demo
hadd Opening the next 154 files
hadd Target path: t200-32.root:/
hadd Target path: t200-32.root:/demo
Moreover, I am observing lost of statistics.
Searching for information on this error, ROOT Forum reports memory limit issue (max=2GB).
Rene Brun said:
“I see from the result of file->Map that you have reached the maximum default Tree size limit of 1.9 GBytes. You should have received an error message or a warning indicating that the system is automatically switching to a new file. Read carefully the documentation of TTree::ChangeFile at root.cern.ch/root/html/TTree.htm … ChangeFile”
It seems that ROOT crates a new version output file, which is not appropriately closed. However, in my case, the previous one gets lost. A solution would be to increase the memory limit.
How do I do that in my C++ code?
TTree::LoadBaskets (Long64_t maxmemory = 2000000000)
Read in memory all baskets from all branches up to the limit of maxmemory bytes.
Default for maxmemory is 2000000000 (2 Gigabytes).
Well, it could be that some “job” that creates individual files died and left a file improperly closed.
But then, adding “mytree->...->Write();” will not help (as the “job” will die anyhow while processing some “peculiar” event).
On the other hand, if it is just a single file out of 2000, then there should be no significant “lost of statistics” (well, yes, one should try to debug the problem of why this “job” died for this file and how many events are missing).
Thanks for the input.
My code generates root files with lots of histograms. I realized the more the number of histograms I set up in the code the more the root files (output) with the problem above I’ve got. Although they are always recovered when merging them via hadd, I have obseved statistics lost.
I do not use Write() or Close() in my code because of the following information:
Sometimes I get only one problematic file, sometimes dozens.
I had to turn off a bunch of histos in order to minimize this problem.
For instance, in my KshorKshort channel analysis #15 I’ve got 3679 events;
in my analysis #31 I’ve got 3181 events. This is a big difference. That is my concern.
The following method does not work: #include “TTree.h”
int startup() {
TTree::SetMaxTreeSize( 1000000000000LL ); // 1 TB
return 0;
}
namespace { static int i = startup(); }
LD_PRELOAD=startup_C.so
hadd…
I have tried: #include “TTree.h”
…
void
PromptAnalyzer::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)
{
…
TTree::SetMaxTreeSize( 1000000000000LL );
…
}
no luck. I tried also:
void
PromptAnalyzer::beginJob()
{
…
TTree::SetMaxTreeSize( 1000000000000LL );
…
histosTH1F[“hpt”] = fs->make(“hpt”,“p_{T}”,nbins_pt,0,5);
}
no luck. Also:
void
PromptAnalyzer::beginRun(edm::Run const& run, edm::EventSetup const& es)
{
…
TTree::SetMaxTreeSize( 1000000000000LL );
…
}
no luck.
By the way, let me correct the information I provided,
the statistics lost is not about 500 events as I mentioned above,
it is much less. I have to measure it though.
Maybe another idea … check “open files” limits in: ulimit -H -a ulimit -S -a
Try to increase it (in a shell in which you then run “hadd”), e.g.: ulimit -S -n 4096
You could also try: hadd -n 1 ...
And you could also try to increase the “stack size” limit (also in the shell in which you run “jobs” that produce partial files as it is possible that some “job” dies because of it), e.g.: ulimit -S -s 32768
I guess, to “exclude” some possible “known” problems, try first: “ulimit -S -s 32768; hadd -T -n 1 ...”
If the “hadd” still dies on the same partial input file, try to set “ulimit -S -s 32768” and (in the same shell) run the job that produces it again.
Actually, maybe you could attach this file here for inspection.