TTreeCache::FillBuffer error with ROOT 6.14/04

Attila_Krasznahorkay · October 1, 2018, 11:54am

Hi,

Before going to a JIRA ticket with this, I wanted to ask for some help, as I’m having a very hard time just identifying what is even going wrong exactly…

After building the ATLAS (standalone) analysis release against ROOT 6.14/04, one of our test jobs started printing the following error message:

TTreeCache::FillBuffer    ERROR   Inconsistency: fCurrentClusterStart=0 fEntryCurrent=53 fNextClusterStart=100 but fCurrentEntry should not be in between the two

(We use our own output formatting…)

Do you have a suggestion on how I could try to narrow down this issue a bit more? The stack trace of the message doesn’t help much…

#4  0x00007fb934cd610e in xAOD::ErrorHandler(int, bool, char const*, char const*) () at /home/krasznaa/projects/lcg94/athena/Control/xAODRootAccess/Root/Init.cxx:182
#5  0x00007fb942a802bb in ErrorHandler () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libCore.so
#6  0x00007fb942a96e31 in TObject::Error(char const*, char const*, ...) const () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libCore.so
#7  0x00007fb9455e8454 in TTreeCache::FillBuffer() () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libTree.so
#8  0x00007fb9455e9ea5 in TTreeCache::ReadBufferNormal(char*, long long, int) () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libTree.so
#9  0x00007fb9455685c3 in TBasket::ReadBasketBuffers(long long, int, TFile*) () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libTree.so
#10 0x00007fb94556e5ad in TBranch::GetBasket(int) () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libTree.so
#11 0x00007fb94556ec82 in TBranch::GetEntry(long long, int) () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libTree.so
#12 0x00007fb945584703 in TBranchElement::GetEntry(long long, int) () from /home/krasznaa/projects/lcg94/build/install/AnalysisBaseExternals/21.2.47/InstallArea/x86_64-slc6-gcc62-opt/bin/../lib/libTree.so
#13 0x00007fb934cdf6a3 in xAOD::TAuxStore::getData(unsigned long) const () at /home/krasznaa/projects/lcg94/athena/Control/xAODRootAccess/Root/TAuxStore.cxx:308
#14 0x00007fb934c47473 in xAOD::AuxContainerBase::getData(unsigned long) const () at /home/krasznaa/projects/lcg94/athena/Event/xAOD/xAODCore/Root/AuxContainerBase.cxx:194
#15 0x00007fb934c75db3 in xAOD::ShallowAuxContainer::getData(unsigned long) const () at /home/krasznaa/projects/lcg94/athena/Event/xAOD/xAODCore/Root/ShallowAuxContainer.cxx:189
#16 0x00007fb934bced6f in SG::AuxVectorData::isAvailableOol(unsigned long) const () at /home/krasznaa/projects/lcg94/athena/Control/AthContainers/Root/AuxVectorData.cxx:239
#17 0x00007fb92dc86710 in xAOD::Egamma_v1::nCaloClusters() const () at /home/krasznaa/projects/lcg94/athena/Control/AthContainers/AthContainers/AuxVectorData.icc:99
#18 0x00007fb92dc86a65 in xAOD::Egamma_v1::caloCluster(unsigned long) const () at /home/krasznaa/projects/lcg94/athena/Event/xAOD/xAODEgamma/Root/Egamma_v1.cxx:409
#19 0x00007fb92a6c3bf4 in met::METSignificance::AddElectron(xAOD::IParticle const*, float&, float&) () at /home/krasznaa/projects/lcg94/athena/Reconstruction/MET/METUtilities/Root/METSignificance.cxx:503
#20 0x00007fb92a6c58db in met::METSignificance::varianceMET(xAOD::MissingETContainer_v1*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) () at /home/krasznaa/projects/lcg94/athena/Reconstruction/MET/METUtilities/Root/METSignificance.cxx:307
#21 0x00007fb92a72a388 in std::_Function_handler<StatusCode ()(CP::SystematicSet), CP::MetSignificanceAlg::execute()::{lambda(CP::SystematicSet const&)#1}>::_M_invoke(std::_Any_data const&, CP::SystematicSet&&) () at /home/krasznaa/projects/lcg94/athena/PhysicsAnalysis/Algorithms/MetAnalysisAlgorithms/Root/MetSignificanceAlg.cxx:64
#22 0x00007fb92a895a5b in CP::SysListHandle::foreach(std::function<StatusCode ()(CP::SystematicSet)> const&) () at /cvmfs/sft.cern.ch/lcg/releases/gcc/6.2.0-2bc78/x86_64-slc6-gcc62-opt/include/c++/6.2.0/functional:2136
#23 0x00007fb92a7290e7 in CP::MetSignificanceAlg::execute() () at /home/krasznaa/projects/lcg94/athena/PhysicsAnalysis/Algorithms/MetAnalysisAlgorithms/Root/MetSignificanceAlg.cxx:69
#24 0x00007fb934489e82 in EL::AnaAlgorithm::sysExecute() () at /home/krasznaa/projects/lcg94/athena/PhysicsAnalysis/D3PDTools/AnaAlgorithm/Root/AnaAlgorithm.cxx:211
#25 0x00007fb934585f3b in EL::AnaAlgorithmWrapper::execute() () at /home/krasznaa/projects/lcg94/athena/PhysicsAnalysis/D3PDTools/EventLoop/Root/AnaAlgorithmWrapper.cxx:91
#26 0x00007fb9345aeb6a in EL::Worker::algsExecute() () at /home/krasznaa/projects/lcg94/athena/PhysicsAnalysis/D3PDTools/EventLoop/Root/Worker.cxx:500

I even added a few printouts about the open files / existing objects at the time of the error message, but I’m still not any smarter.

Note that it’s only this particular job that prints this message. Other (similar) jobs seem to handle their tree caches correctly. So it should be some particular issue with how this job is set up.

Any suggestions to how I should debug this further?

Cheers,
Attila

ROOT Version: 6.14/04
Platform: x86_64-slc6-gcc62-opt
Compiler: GCC 6.2

Attila_Krasznahorkay · October 1, 2018, 1:06pm

Hmm… Looking more carefully at the log files of all of our tests, I see this exact error message appearing in many places actually. Which helps. As it should make it easier to produce a relatively simple reproducer. (I hope…)

The interesting thing is that it’s always event 53 of the same file. Regardless of which variable/branch is being read according to the backtrace.

As I wrote, I’ll try to produce a reproducer in some shape or form…

Attila_Krasznahorkay · October 1, 2018, 4:13pm

For a while I suspected that this may have something to do with a read rule not being executed correctly. But I now added a fair amount of extra printouts to the failing job, and it seems to me that our code is calling for the correct entry from the branch that triggers the error.

Namely, just before I get the error message, I get this printout from just before the line of code that shows up in the stack trace of the error message:

xAOD::TAuxStore::getData  INFO    Called with this = 0x16e96dc0, auxid = 505, m_entry = 103
TTreeCache::FillBuffer    ERROR   Inconsistency: fCurrentClusterStart=0 fEntryCurrent=53 fNextClusterStart=100 but fCurrentEntry should not be in between the two

As you saw in the stack trace before, xAOD::TAuxStore::getData is calling TBranchElement::GetEntry, and from there on the issue is all in “ROOT’s court”.

So… Any idea of how the value of 103 becomes 53 in this call tree?

pcanal · October 1, 2018, 6:07pm

Hi Attila,

The Inconsistency message was suppose to have been fixed by commit 4d05469146 which is part of v6.14/04. This means that the new TTreeCache::FillBuffer still has a (new) weakness. (And I hope that the data error is a consequence of this error).

To reproduce the minimal is likely to include

file
order of entry access.
Branches added to the cache (especially if not manually)

Cheers,
Philippe.

Attila_Krasznahorkay · October 2, 2018, 7:03am

Hi Philippe,

If this helps, I get the following when setting gDebug=10:

xAOD::TAuxStore::getData  INFO    Called with this = 0x16a978b0, auxid = 505, m_entry = 103
TTreeCache::FillBuffer    INFO    ***** Called for entry 103
TTreeCache::FillBuffer    INFO    *** Will reset the branch information about baskets
TTreeCache::FillBuffer    INFO    Looking at cluster spanning from 53 to 106
TTreeCache::FillBuffer    ERROR   Inconsistency: fCurrentClusterStart=0 fEntryCurrent=53 fNextClusterStart=100 but fCurrentEntry should not be in between the two

Will try to find some time today to create a “simpler way” of reproducing it. Unfortunately my current setup for getting this error is a pretty complicated one.

Cheers,
Attila

Attila_Krasznahorkay · October 2, 2018, 12:50pm

Hi Philippe,

Unfortunately a simple reproducer is still eluding me. (The faulty job has O(10) files open at the point of failure, and the cache of the “event file” is set up in a fairly non-trivial way.)

However, I did find a way of avoiding the issue. You see, our code has been calling TTree::SetCacheSize(...) with a value of 30000000 since (sort of…) the beginning. I don’t even remember why I hardcoded this exact value many years ago.

Reading through the latest documentation I saw that these days it’s recommended to let the function determine the best size itself. So I updated our code to call TTree::SetCacheSize(-1) on the event tree. And the error went away.

Still, this means that there is some bug in the cache code. I’m just having a very hard time producing a piece of code that could be built easily on top of “vanilla” ROOT for reproducing this message…

Cheers,
Attila

pcanal · October 2, 2018, 4:42pm

Can you upload all the printout in that run for TTreeCache::FillBuffer ?

Thanks,
Philippe.

pcanal · October 2, 2018, 5:00pm

Hi Attila,

In addition, can you give me the result of

tree->Print("clusters");

on one of the failing files?

Thanks,
Philippe.

Attila_Krasznahorkay · October 4, 2018, 7:32am

Hi Philippe,

Sorry for the delay… The log file with gDebug=10 is now under:

/afs/cern.ch/work/k/krasznaa/public/TTreeCache/met_gDebug.log

Yeah, it’s quite ridiculously large…

The file that the job is accessing is one of the first messages in the file. (And it’s a publicly accessible file.) Which gives:

root [1] CollectionTree->Print( "clusters" );
******************************************************************************
*Tree    :CollectionTree: CollectionTree                                         *
*Entries :     2157 : Total =      1629544572 bytes  File  Size =  513135049 *
*        :          : Tree compression factor =   3.17                       *
******************************************************************************
Cluster Range #  Entry Start      Last Entry        Size
0                0                2156                53
root [2]

Hope this helps…

Cheers,
Attila

pcanal · October 5, 2018, 4:07pm

Hi Attila,

I am not able to reproduce the problem yet. So far I was able to reproduce the TTreeCache::FillBuffer output exactly except for the last 5 lines …

From the log, it looks like TTreeCache::FillBuffer is called 3 times before the error message. Could you send me the stack trace for those 3 calls and if you can the name of the branch in the TBranch or TBranchElement call stack frame the closest to the call to FillBuffer?

Thanks,
Philippe.

pcanal · October 5, 2018, 5:02pm

In addition, I would like to know if there was a range set on the TTreeCache:

TTreeCache *tc = CollectionTree->GetReadCache(f,kTRUE);
cout << "Cache minimum entry: " << tc->GetEntryMin() << '\n';
cout << "Cache maximum entry: " << tc->GetEntryMax() << '\n';

thanks,
Philippe.

Attila_Krasznahorkay · October 8, 2018, 1:57pm

Hi Philippe,

I sure hope that this will be helpful, as it was a lot more painful to get these than I thought.

You can find the log file with the 3 backtraces under:

/afs/cern.ch/work/k/krasznaa/public/TTreeCache/cache_size_30000000.txt

And as a “reference”, I thought I’d do the same for the -1 setting. Not realising how many times TTreeCache::FillBuffer gets called in that case. (A lot…) So I had to learn a bit about debugger scripting for that. You can find that file under:

/afs/cern.ch/work/k/krasznaa/public/TTreeCache/cache_size_1.txt

The 3 calls in the 30 MB case are for:

Connecting to the “Electrons” branch;
Connecting to the “MuonsAuxDyn.inDetTrackParticleLink” branch;
Calling GetEntry(103) on the “ElectronsAuxDyn.caloClusterLinks” branch.

The last one I just know from previous debugging sessions, not from this particular log file…

Finally, I added the printout that you asked for. I get the following with it:

xAOD::TEvent::readFrom    INFO    Cache minimum entry: 0
xAOD::TEvent::readFrom    INFO    Cache maximum entry: 2157

Now, the interesting/worrisome thing is that I get this printout 3 times in my log. So the “initial setup” of the input tree happens a bit more times than I realised. And note that the input file should not be closed in between these calls. So this may be a hint at where the issue may be coming from.

Cheers,
Attila

P.S. Note that even the extra printout required a #define protected public hack. As I wrote, the whole thing was a bit more painful than I was hoping…

pcanal · October 8, 2018, 2:50pm

So the “initial setup” of the input tree happens a bit more times than I realised.

I assume you are talking of some ATLAS code, isn’t it? If so can you point to the source (so I can see what the ‘setup’ is doing)

Thanks. Philippe.

Attila_Krasznahorkay · October 8, 2018, 3:32pm

The code is here:

https://gitlab.cern.ch/atlas/athena/blob/21.2/Control/xAODRootAccess/Root/TEvent.cxx#L340-486

I’ll actually have to debug this a bit myself. I’m pretty sure that the job uses a TChain to read the input file(s), and the logic in xAOD::TEvent for using a TChain is a bit convoluted as you can see. (To make sure that the correct metadata is loaded about the current input file as we loop through the events…)

Cheers,
Attila

Attila_Krasznahorkay · October 8, 2018, 3:55pm

We’ll follow up about the high call count on: https://its.cern.ch/jira/browse/ATLASG-1459

pcanal · October 8, 2018, 4:37pm

This is puzzling (/vexing ) … So far I can only reproduce the problem if I set the range to 0 through 100.

I can not think of any other mechanism that would bring the 100 in the error message:

TTreeCache::FillBuffer    ERROR   Inconsistency: fCurrentClusterStart=0 fEntryCurrent=53 fNextClusterStart=100 but fCurrentEntry should not be in between the two

humm … maybe there is one … I would actually need to know exactly which branches the TTreeCache is loading. So somewhere after the training phase is over, I would need the print out of

TTreeCache *tc = dynamic_cast<TTreeCache*>(CollectionTree->GetDirectory()->GetFile()->GetCacheRead(CollectionTree));
tc->GetCachedBranches()->ls();

thanks,
Philippe.

Attila_Krasznahorkay · October 9, 2018, 9:17am

Hi Philippe,

I’ve put the log file from such a job under:

/afs/cern.ch/work/k/krasznaa/public/TTreeCache/cached_branches.txt

I just scheduled the printout for entry 50.

There is something however that’s been bothering me for a while, I just never looked into it so far. You see, I was convinced that we were flushing our files every 100 entries. But apparently not. Apparently for “derivations” we flush every 10 megabytes.

https://gitlab.cern.ch/atlas/athena/blob/21.2/Database/AthenaPOOL/OutputStreamAthenaPool/python/MultipleStreamManager.py#L484

That’s why this file comes out with a 53 flush value.

[bash][tauriel]:run > root -l /cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/CommonInputs/DAOD_PHYSVAL/data16_13TeV.00311321.physics_Main.DAOD_PHYSVAL.r9264_AthDerivation-21.2.1.0.root
root [0] 
Attaching file /cvmfs/atlas-nightlies.cern.ch/repo/data/data-art/CommonInputs/DAOD_PHYSVAL/data16_13TeV.00311321.physics_Main.DAOD_PHYSVAL.r9264_AthDerivation-21.2.1.0.root as _file0...
Warning in <TClass::Init>: no dictionary for class DataHeader_p5 is available
Warning in <TClass::Init>: no dictionary for class DataHeaderElement_p5 is available
...
(TFile *) 0x7fd010300c20
root [1] CollectionTree->GetAutoFlush()
(long long) 53
root [2]

When we write a (mini-xAOD) file during analysis, the default setting is 200.

https://gitlab.cern.ch/atlas/athena/blob/21.2/Control/xAODRootAccess/xAODRootAccess/TEvent.h#L131-133

But, this job does not write a mini-xAOD output. It does write a simple ntuple on its output however. And that simple ntuple is set up with a flush setting of 200.

https://gitlab.cern.ch/atlas/athena/blob/21.2/PhysicsAnalysis/Algorithms/AsgAnalysisAlgorithms/Root/AsgxAODNTupleMakerAlg.cxx#L300-301

And in any case, turning off the output file/tree writing doesn’t make the error disappear.

So long story short, I have no clue where the job would get the 100 value from.

Cheers,
Attila

pcanal · October 9, 2018, 3:40pm

I attempting to build the ATLAS software … assuming it is successful, I will take it from there …

thanks,
Philippe.

pcanal · October 11, 2018, 10:43pm

This is fixed by the following PRs.

master: https://github.com/root-project/root/pull/2786
v6.14: https://github.com/root-project/root/pull/2787

which have been merged.

Attila_Krasznahorkay · October 12, 2018, 8:17am

Thanks a lot Philippe!

Now we just need to put some pressure on getting 6.14/06, and an LCG release with it…