TTree memory limit

Hi Wile,

The following method does not work:
#include “TTree.h”
int startup() {
TTree::SetMaxTreeSize( 1000000000000LL ); // 1 TB
return 0;
}
namespace { static int i = startup(); }
LD_PRELOAD=startup_C.so
hadd…

I have tried:
#include “TTree.h”

void
PromptAnalyzer::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)
{

TTree::SetMaxTreeSize( 1000000000000LL );

}
no luck. I tried also:
void
PromptAnalyzer::beginJob()
{

TTree::SetMaxTreeSize( 1000000000000LL );

histosTH1F[“hpt”] = fs->make(“hpt”,“p_{T}”,nbins_pt,0,5);
}
no luck. Also:
void
PromptAnalyzer::beginRun(edm::Run const& run, edm::EventSetup const& es)
{

TTree::SetMaxTreeSize( 1000000000000LL );

}
no luck.

Luiz Regis

Hi Luis,

Is there a TTree stored in the output file of your PromptAnalyzer?

Philippe.

Hi Philippe,

No, only histograms in the output root file.
In my C++ code I have the data (ntuple) accessed via getbytoken:

#input memory
#input map

class PromptAnalyzer : public edm::one::EDAnalyzer<edm::one::SharedResources>  {
public:
      explicit PromptAnalyzer(const edm::ParameterSet&);
      ~PromptAnalyzer();
  private:
      virtual void beginJob() override;
      virtual void analyze(const edm::Event&, const edm::EventSetup&) override;
      virtual void endJob() override;
  
      virtual void beginRun(edm::Run const&, edm::EventSetup const&);
      virtual void endRun(edm::Run const&, edm::EventSetup const&);
...
}
...
  edm::EDGetTokenT<reco::TrackCollection> trkToken_;
  edm::EDGetTokenT<vector<CTPPSLocalTrackLite> > RPtrkToken_;
  edm::EDGetTokenT<reco::VertexCollection> vtxToken_;
  edm::EDGetTokenT<reco::BeamSpot> beamspotToken_;
  edm::EDGetTokenT<edm::TriggerResults>  trigToken_;
  // V0 ...Luiz
  edm::EDGetTokenT<reco::VertexCompositeCandidateCollection> kshortsToken_;
  edm::EDGetTokenT<reco::VertexCompositeCandidateCollection> lambdasToken_;
  edm::EDGetTokenT<reco::DeDxDataValueMap> dedxsToken_;
  edm::EDGetTokenT<reco::DeDxDataValueMap> dedxPIXsToken_;
  
  HLTConfigProvider hltConfig_;

  map<string,TH1F*> histosTH1F;
  map<string,TH2F*> histosTH2F;

};

By the way, let me correct the information I provided,
the statistics lost is not about 500 events as I mentioned above,
it is much less. I have to measure it though.

thanks,
Luiz

The my advise was irrelevant :frowning:

In your original post, the “failing” file is data_1914.root is that one of the output of PromptAnalyzer?

Yes, it is.

Maybe another idea … check “open files” limits in:
ulimit -H -a
ulimit -S -a

Try to increase it (in a shell in which you then run “hadd”), e.g.:
ulimit -S -n 4096

You could also try: hadd -n 1 ...

And you could also try to increase the “stack size” limit (also in the shell in which you run “jobs” that produce partial files as it is possible that some “job” dies because of it), e.g.:
ulimit -S -s 32768

When you re-run the example that leads to the bad file, is it always the file data_1914.root that causes the problem?

Hi Philippe,

Yes, it is.

Luiz

Hi Wile,

Thanks for the input. I will give it a try as soon as possible.

Luiz

What code do you use to open the file and to close them? When is that code (respectively) called?

Here are the limits:

[lregisem@lxplus751 src]$ ulimit -H -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 116931
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 4096
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 116931
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

[lregisem@lxplus751 src]$ ulimit -S -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 116931
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

I guess, to “exclude” some possible “known” problems, try first: “ulimit -S -s 32768; hadd -T -n 1 ...
If the “hadd” still dies on the same partial input file, try to set “ulimit -S -s 32768” and (in the same shell) run the job that produces it again.

Actually, maybe you could attach this file here for inspection.

The code I use to access the dataset files is:

python config:

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(XXX)
,
lumisToProcess = cms.untracked.VLuminosityBlockRange('319104:1-319104:10',
'319104:15-319104:185','319124:91-319124:277','319125:1-319125:208','319159:125-319159:618',
'319174:1-319174:77','319175:1-319175:139','319176:1-319176:1803','319177:1-319177:232',
'319190:1-319190:317','319222:108-319222:294','319223:1-319223:131','319254:115-319254:263',
'319255:1-319255:164','319256:1-319256:726','319262:10-319262:10','319262:15-319262:16',
'319262:20-319262:23','319262:29-319262:34','319262:39-319262:40','319262:46-319262:58',
'319262:61-319262:78','319262:82-319262:123','319262:129-319262:362','319263:1-319263:367',
'319264:1-319264:57','319265:1-319265:396','319266:1-319266:26','319267:1-319267:204',
'319268:1-319268:467','319270:1-319270:206','319300:1-319300:1132','319311:1-319311:1733'
)
)

process.TFileService = cms.Service("TFileService",
            fileName = cms.string("output.root"),
            closeFileFast = cms.untracked.bool(False)
)

submit-condorRECOall.csh (see the attachment):

  cat ../${template_py} | sed "s|XXX|${mylist}|"  > temp_py
  cat  temp_py       | sed "s|YYY|${i}|" > ${submit_dir}/job_${i}.py
  rm -f temp_py

shell:

./submit-condorRECOall.csh t201 t20.eos_1 

PromptAnalyzer.cc:

// ------------ method called for each event  ------------
void
PromptAnalyzer::analyze(const edm::Event& iEvent, const edm::EventSetup& iSetup)
{
...
}

// ------------ method called once each job just before starting event loop  ------------
void
PromptAnalyzer::beginJob()
{
  //...Luiz
  edm::Service<TFileService> fs;
  
  int nbins_eta = 80;
  int nbins_pt = 100;
  int nbins_phi = 64;

  histosTH1F["hpt"] = fs->make<TH1F>("hpt","p_{T}",nbins_pt,0,5);
  histosTH1F["heta"] = fs->make<TH1F>("heta","#eta",nbins_eta,-4,4);
  histosTH1F["hphi"] = fs->make<TH1F>("hphi","#varphi",nbins_phi,-3.2,3.2);
  histosTH1F["halgo"] = fs->make<TH1F>("halgo","Algo",15,0,15.);
  histosTH1F["hnhits"] = fs->make<TH1F>("hnhits","nhits pix+strip",40,0,40.);
...
  std::cout<<"booked all of Luiz' histograms."<<std::endl;
  //--------------end of my histograms
}

// ------------ method called once each job just after ending the event loop  ------------
void
PromptAnalyzer::endJob()
{
  std::cout<<"ciao ciao..."<<std::endl;
}

// ------------ method called when starting to processes a run  ------------
void 
PromptAnalyzer::beginRun(edm::Run const& run, edm::EventSetup const& es)
{
  bool changed(true);
  if (hltConfig_.init(run, es, "HLT",changed)) {
    hltConfig_.dump("Triggers");
    hltConfig_.dump("PrescaleTable"); 
  }
}

// ------------ method called when ending the processing of a run  ------------
void 
PromptAnalyzer::endRun(edm::Run const&, edm::EventSetup const&)
{
}

job_1914.py

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring('root://eostotem//eos/totem/data/cmstotem/2018/90m/RECO_copy/TOTEM20/110000/2625EB46-453E-E911-8EB
8-008CFA06473C.root', 'root://eostotem//eos/totem/data/cmstotem/2018/90m/RECO_copy/TOTEM20/110000/261F8013-B83D-E911-9A99-003048F2E8C0.r
oot',)
,
lumisToProcess = cms.untracked.VLuminosityBlockRange('319104:1-319104:10',
'319104:15-319104:185','319124:91-319124:277','319125:1-319125:208','319159:125-319159:618',
'319174:1-319174:77','319175:1-319175:139','319176:1-319176:1803','319177:1-319177:232',
'319190:1-319190:317','319222:108-319222:294','319223:1-319223:131','319254:115-319254:263',
'319255:1-319255:164','319256:1-319256:726','319262:10-319262:10','319262:15-319262:16',
'319262:20-319262:23','319262:29-319262:34','319262:39-319262:40','319262:46-319262:58',
'319262:61-319262:78','319262:82-319262:123','319262:129-319262:362','319263:1-319263:367',
'319264:1-319264:57','319265:1-319265:396','319266:1-319266:26','319267:1-319267:204',
'319268:1-319268:467','319270:1-319270:206','319300:1-319300:1132','319311:1-319311:1733'
)
)
process.TFileService = cms.Service("TFileService",
            fileName = cms.string("output.root"),
            closeFileFast = cms.untracked.bool(False)
)

submit-condorRECOall-csh.txt (3.0 KB)

Since, the failing output file is produced by its own job: job_1914.py, does re-running the job leads to a similarly broken files? Does calling hadd on a small subset (a dozen file) that includes the 1914 failing file also fails?

dataset: t200

ulimit -S -n 4096
ulimit -S -s 32768
hadd -T -n 50 x200.root data_*.root
… (see the attachment)
hadd Opening the next 49 files
hadd Target path: x200.root:/
hadd Target path: x200.root:/demo
hadd Opening the next 49 files
hadd Target path: x200.root:/
hadd Target path: x200.root:/demo
hadd Opening the next 49 files
Warning in TFile::Init: file data_1422.root probably not closed, trying to recover
Info in TFile::Recover: data_1422.root, recovered key TDirectoryFile:demo at address 232
Warning in TFile::Init: successfully recovered 1 keys
hadd Target path: x200.root:/
hadd Target path: x200.root:/demo

dataset: t210

ulimit -S -n 4096
ulimit -S -s 32768
hadd -T -n 50 x210.root data_*.root
… (see the attachment)
hadd Target path: x210.root:/demo
hadd Opening the next 49 files
hadd Target path: x210.root:/
hadd Target path: x210.root:/demo
hadd Opening the next 49 files
hadd Target path: x210.root:/
hadd Target path: x210.root:/demo
hadd Opening the next 49 files
Warning in TFile::Init: file data_576.root probably not closed, trying to recover
Info in TFile::Recover: data_576.root, recovered key TDirectoryFile:demo at address 232
Warning in TFile::Init: successfully recovered 1 keys
hadd Target path: x210.root:/
hadd Target path: x210.root:/demo
hadd Opening the next 49 files
hadd Target path: x210.root:/
hadd Target path: x210.root:/demo

Full outputs attached. Root files attached.

hadd-t200-2.txt (78.9 KB)
hadd-t210.txt (76.3 KB)
data_576.root (339 Bytes)
data_1422.root (339 Bytes)
data_1914.root (1.1 MB)

The 1914.root provided above is now from t200 dataset which is good, no error. The previous error on 1914 is from another unkown dataset (sorry I do not remember which one). So, forget 1914 and focus on 1422 and 576.

“576” and “1422” are completely empty (just an empty “demo” directory is inside).
You need to inspect the jobs which created them (it seems they died before any histograms were written).

Check please the job 1422 files attached.

job_1422-err.txt (12.3 KB)
job_1422-out.txt (12.4 KB)
job_1422-py.txt (5.4 KB)
job_1422-sh.txt (564 Bytes)
submit_1422.txt (312 Bytes)
files_1422.txt (238 Bytes)

job 576
job_576.err.txt (14.8 KB)
job_576.out.txt (12.4 KB)
job_576.py.txt (5.4 KB)
job_576.sh.txt (561 Bytes)
submit_576.txt (308 Bytes)

I am going to resubmit 1422.