Hi,
Many of those running physics analyses on our DPM system have experienced difficulties using TTreeCache when the DPM load is high. The typical use-case is that users have TTrees containing around six thousand branches. When the user jobs run only around one thousand branches are used. The TTreeCache setting used was 100Mbytes and the TFile size is from 600MBtyes to 1GByte. Before reaching the I/O limit on the DPM, this configuration was producing good results with ROOT versions 5.26, 5.28 and 5.30. When an I/O limit is reached, enabling TTreeCache creates a cascade of RFIO failures. The errors are typically of the form:
Error in TRFIOFile::TRFIOFile: error doing rfio_read
Error in TBranch::GetBasket: File:
rfio:///dpm/unige.ch/home/atlas/atlaslocalgroupdisk/user/wbell/data11_7TeV/user.wbell.data11_7TeV.00191239.physics_Muons.merge.NTUP_TOPMU.f413_m1019_p694_p722_thin0001_0001.111112032448/user.wbell.005705._00037.output.root
at byte:536958, branch:EF_e45_medium1, entry:459, badread=1, nerrors=1,
basketnumber=1
terminate called after throwing an instance of 'std::out_of_range’
what(): vector::_M_range_check
/var/spool/torque/mom_priv/jobs/1418590.grid07.unige.ch.SC: line 55:
6815 Aborted
It looks like reading is not blocked when the process should be waiting for more data. Instead the process continues and crashes, printing errors due to corrupted data.
In the user code we have the lines (m_tree is a TTree*):
m_tree->SetCacheSize(104857600);
TTreeCache::SetLearnEntries(1); // Stop learning after 1 entry is read.
m_tree->AddBranchToCache(branchName,true);
If these lines are removed the job failure rate is significantly reduced. Instead, most jobs just run a little slower. We see this behaviour with ROOT 5.30 and 5.32 (64bit LINUX build from AFS). Would it be possible for TTreeCache to deal with a heavily loaded file system in a little bit more graceful manner? Are future updates of TTreeCache planned?
Thanks and best regards,
Will