Error in <TBranch::GetBasket>

Dear Support,
I’m experiencing a strange problem of which I’m not able to identify the source.
I produce several ROOT files with the same analysis code and store them on CASTOR, but when drawing a TChain of them I get an “Error in TBranch::GetBasket” depending on what kind of cuts I apply.

To better understand what is going on I picked up a single STAGED file (/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root) and followed these steps:

  • open it from CASTOR and plot some variables without any cut:
    TChain *ttree = new TChain (“Tree”);
    ttree->Add(“rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root”)
    ttree->Draw(“cda”)

NO ERROR

  • open it from CASTOR and plot variables with cuts:
    TChain *ttree = new TChain (“Tree”);
    ttree->Add(“rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root”)
    ttree->Draw(“cda”,pigg.c_str())

    **** : trace level set to 3
    stager: stage_get Usertag=NULL Protocol=rfio File=/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS8-01/0.root
    stager: Looking up RH host - Using castorpublic
    stager: Looking up RH port - Using 9002
    stager: Looking up service class - Using na48
    stager: Setting euid: 16520
    stager: Setting egid: 1338
    stager: Creating socket for castor callback - Using port 37617
    stager: Nov 4 09:59:08 (1225789148) Sending request
    stager: 49100edc-0000-1000-a46e-c126e8dad8aa SND 0.04 s to send the request
    stager: Waiting for callback from castor
    stager: 49100edc-0000-1000-a46e-c126e8dad8aa CBK 0.43 s before callback from 128.142.162.18 was received
    Error in TBranch::GetBasket: File: rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS8-01/0.root at byte:3757281117880405260, branch:cda, entry:19508, badread=0, nerrors=0, basketnumber=1
    Error in TBranch::GetBasket: File: rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS8-01/0.root at byte:3757281117880405260, branch:cda, entry:51688, badread=0, nerrors=0, basketnumber=2
    Error: Symbol G__exception is not defined in current scope (tmpfile):1:
    Error: type G__exception not defined FILE:(tmpfile) LINE:1

where pigg is a string containing all the cuts I need:

pigg = “tw_m1tp&&nmuon==0&&trk_p->P()>10.&&trk_dch1->Mag()>15.&&trk_dch1->Mag()<100.&&trk_cls_e/trk_p->P()<0.8&&min(g0_p->E(),g1_p->E())>5.&&pk_dist->Z()>10.&&vtx->Z()>-2000.&&vtx->Z()<8000.&&mfake<.460&&trk_dch4->Mag()>15.&&trk_dch4->Mag()<100.&&g_dist>40.&&g_cls_dist>10.&&g_dim<3.&&g_r_flange>0.&&abs(pigg->E()-60.)<5.&&cog->Mag()<2.&&cda<2.&&abs(vtx->Z()-nvtx_z)<200.&&abs(pigg->M()-.493677)<.02”;

  • open it from CASTOR and plot variables with less cuts:
    TChain *ttree = new TChain (“Tree”);
    ttree->Add(“rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root”)
    ttree->Draw(“cda”,pigg_reduced.c_str())

NO ERROR (I removed random requirements until I get no error)

  • copy it to a local disk, open and plot variables without cuts:
    TChain *ttree = new TChain (“Tree”);
    ttree->Add("/tmp/bifani/0.root")
    ttree->Draw(“cda”)

NO ERROR

  • copy it to a local disk, open and plot variables with cuts:
    TChain *ttree = new TChain (“Tree”);
    ttree->Add(“rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root”)
    ttree->Draw(“cda”,pigg.c_str())

NO ERROR

By checking the ROOT forum it appears that this type of errors is related to a physically corrupted file (bad disk, bad transfer, incomplete transfer, crash of the process writing the file, etc.). If that is true, how is it possible that the corrupted file recover its good status when copied back from CASTOR to a local disk?

I tried to reproduce the same file twice but the problem is still there (doing that I also checked the transfer to CASTOR is properly completed).

I finally re-copy the local file to CASTOR without any success: the error re-appears again.

Any idea?

Regards,
Simone

Hi,

The problem seems to be gone (as of today, your example, as run with the version of ROOT you mentioned on lxplus215 is working just fine).
So I am guessing this was a (intermittent) problem in CASTOR.

Cheers,
Philippe.

Hi,
did you apply the cuts as described in my previous post? Because I keep on getting the same error.

Btw, after investigating the problem I found out that the error appears when adding the files to the TChain with the “rfio://” prefix only (i.e. ttree->Add("/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root") NO ERROR). Is there any reason to use the “rfio” prefix?

Cheers,
Simone

Hi Simone,

Yes I used exactly the string you provided and can not reproduce the problem. If you write a small script that reproduce the problem, I will try to run it on lxplus215

Cheers,
Philippe.

Hi,
I simply opened a ROOT session on lxplus215 and did the following:

  • without “rfio” prefix

root [0] TChain ttree
root [1] ttree = new TChain (“Tree”)
(class TChain
)0x84770f0
root [2] ttree->Add("/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root")
(Int_t)(1)
root [3] ttree->Draw(“cog->Mag()”,“tw_m1tp&&1&&nmuon==0&&trk_p->P()>10.&&trk_dch1->Mag()>15.&&trk_dch1->Mag()<100.&&trk_cls_e/trk_p->P()<0.8&&min(g0_p->E(),g1_p->E())>5.&&pk_dist->Z()>10.&&vtx->Z()>-2000.&&vtx->Z()<8000.&&mfake<.460&&trk_dch4->Mag()>15.&&trk_dch4->Mag()<100.&&g_dist>40.&&g_cls_dist>10.&&g_dim<3.&&g_r_flange>0.&&abs(pigg->E()-60.)<5.&&cog->Mag()<2.&&cda<2.&&abs(vtx->Z()-nvtx_z)<200.&&abs(pigg->M()-.493677)<.02”)
**** : trace level set to 3
stager: stage_get Usertag=NULL Protocol=root File=/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root
stager: Looking up RH host - Using castorpublic
stager: Looking up RH port - Using 9002
stager: Looking up service class - Using na48
stager: Setting euid: 16520
stager: Setting egid: 1338
stager: Creating socket for castor callback - Using port 34079
stager: Nov 10 11:28:51 (1226312931) Sending request
stager: 49180ce3-0000-1000-8129-d6c547b38a07 SND 0.02 s to send the request
stager: Waiting for callback from castor
stager: 49180ce3-0000-1000-8129-d6c547b38a07 CBK 0.81 s before callback from 128.142.162.18 was received
TCanvas::MakeDefCanvas: created default TCanvas with name c1
(Long64_t)6

  • with “rfio” prefix

root [0] TChain ttree
root [1] ttree = new TChain (“Tree”)
(class TChain
)0x8475240
root [2] ttree->Add(“rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root”)
(Int_t)(1)
root [3] ttree->Draw(“cog->Mag()”,“tw_m1tp&&1&&nmuon==0&&trk_p->P()>10.&&trk_dch1->Mag()>15.&&trk_dch1->Mag()<100.&&trk_cls_e/trk_p->P()<0.8&&min(g0_p->E(),g1_p->E())>5.&&pk_dist->Z()>10.&&vtx->Z()>-2000.&&vtx->Z()<8000.&&mfake<.460&&trk_dch4->Mag()>15.&&trk_dch4->Mag()<100.&&g_dist>40.&&g_cls_dist>10.&&g_dim<3.&&g_r_flange>0.&&abs(pigg->E()-60.)<5.&&cog->Mag()<2.&&cda<2.&&abs(vtx->Z()-nvtx_z)<200.&&abs(pigg->M()-.493677)<.02”)
**** : trace level set to 3
stager: stage_get Usertag=NULL Protocol=rfio File=/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root
stager: Looking up RH host - Using castorpublic
stager: Looking up RH port - Using 9002
stager: Looking up service class - Using na48
stager: Setting euid: 16520
stager: Setting egid: 1338
stager: Creating socket for castor callback - Using port 39621
stager: Nov 10 11:32:32 (1226313152) Sending request
stager: 49180dc0-0000-1000-baa9-f516417fe171 SND 0.02 s to send the request
stager: Waiting for callback from castor
stager: 49180dc0-0000-1000-baa9-f516417fe171 CBK 1.92 s before callback from 128.142.162.18 was received
TCanvas::MakeDefCanvas: created default TCanvas with name c1
Error in TBranch::GetBasket: File: rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root at byte:3757281117880405260, branch:cda, entry:219454, badread=0, nerrors=0, basketnumber=1
Error in TBranch::GetBasket: File: rfio:///castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root at byte:3757281117880405260, branch:cda, entry:226703, badread=0, nerrors=0, basketnumber=2
Error: Symbol G__exception is not defined in current scope (tmpfile):1:
Error: type G__exception not defined FILE:(tmpfile) LINE:1
(Long64_t)0
*** Interpreter error recovered ***

That is the shortest way I can systematically reproduce the problem

The difference is in the protocol used to open the file:

stager: stage_get Usertag=NULL Protocol=[b]root[/b] File=/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root

vs.

stager: stage_get Usertag=NULL Protocol=[b]rfio[/b] File=/castor/cern.ch/user/b/bifani/na48_2/pigg/sc/SS0-00/0.root

Simone

Hi Simone,

Since I could not reproduce the problem the last I do not trust the copy/paste to be complete (the forum may or may not be ‘eating’ some of the character in your string).

Also I notice that your output is actually different from mine. Your rfio seems to output debug messages whereas mine does not!. So the difference might actually comes from the rfio/castor library that you are using (not that this is also the difference between with and without the rfio prefix). Can you tell me specifically how you setup both root and rfio/castor?

Thanks,
Philippe.

Hi,
I usually set the following env variables:

setenv ROOTSYS /afs/cern.ch/sw/lcg/external/root/5.20.00/slc4_ia32_gcc34/root
setenv PATH {PATH}:{ROOTSYS}/bin
setenv LD_LIBRARY_PATH ${ROOTSYS}/lib
setenv STAGER_TRACE 3 (this allows you to trace the stager behavior)

Attached you’ll find a .C file with 2 macros:

  • without(): no “rfio” prefix
  • with(): “rfio” prefix

I tested it and experienced the same problem.

Hope it helps,
Simone
test.C (1.15 KB)

Hi,

As far as I can tell the problem is within the rfio library. For example, using valgrind on the failing case (which somehow is reproducing everywhere today!), I see the following error:

==19495== Invalid read of size 1 ==19495== at 0x4906897: memcpy (mac_replace_strmem.c:394) ==19495== by 0x9556E82: rfio_read64_v2 (read64.c:253) ==19495== by 0x9554871: rfio_read_v2 (read.c:128) ==19495== by 0x955460B: rfio_read (read.c:48) ==19495== by 0x93A5F9D: TRFIOFile::SysRead(int, void*, int) (TRFIOFile.cxx:320) ==19495== by 0x746FC75: TFile::ReadBuffer(char*, int) (TFile.cxx:1309) ==19495== by 0x7B105CE: TBasket::ReadBasketBuffers(long long, int, TFile*) (TBasket.cxx:340) ==19495== Address 0x92AA90F is 1 bytes before a block of size 56 alloc'd ==19495== at 0x4904DB5: operator new(unsigned long) (vg_replace_malloc.c:168) ==19495== by 0x4C150A0: TStorage::ObjectAlloc(unsigned long) (TStorage.cxx:328) ==19495== by 0x401196: TObject::operator new(unsigned long) (TObject.h:156) ==19495== by 0x4CADF37: TCint::UpdateListOfGlobals() (TCint.cxx:646) ==19495== by 0x4C11151: TROOT::GetListOfGlobals(bool) (TROOT.cxx:1084) ==19495== by 0x4C8D542: TDataMember::TDataMember(void*, TClass*) (TDataMember.cxx:405) ==19495== by 0x4CAF1AA: TCint::CreateListOfDataMembers(TClass*) (TCint.cxx:895) ==19495== by 0x4C7F4C7: TClass::GetListOfDataMembers() (TClass.cxx:2432) ==19495== by 0x74934E7: TStreamerInfo::BuildCheck() (TStreamerInfo.cxx:718) ==19495== by 0x74738D5: TFile::ReadStreamerInfo() (TFile.cxx:2330) ==19495== by 0x746E1C7: TFile::Init(bool) (TFile.cxx:714) ==19495== by 0x93A57C3: TRFIOFile::TRFIOFile(char const*, char const*, char const*, int) (TRFIOFile.cxx:186) Note that the memcpy is supposed to be copying from an buffer internal to rfio into the buffer ROOT provides and it is while reading this internal buffer that valgrind sees a problem …

I would recommend that you report this issue to the rfio/castor developers.

Cheers,
Philippe.

Dear Philippe,
to conclude the topic I would like to know which is the proper way to open a CASTOR file in a ROOT shell. Shoule I use the “rfio” prefix or not?

Thanks a lot for your help,
Simone

Hi Simone,

There is no simple answer. It depends on which machine you are running one. The general way to access castor is via rfio://…. (For example on lxplus215.cern.ch you can do both but on lxbuild064.cern.ch I can only do the rfio://).

Cheers,
Philippe.

Hi,
because I usually submit ROOT jobs to LSF I have to put the “rfio” prefix back and wait for the CASTOR support to fix the problem.

Cheers,
Simone

Hello,
I sent an email to the castor support but they closed my ticket because related to the one you opened yesterday at savannah.cern.ch/bugs/?43769
They didn’t find a solution yet, isn’t it? Do you have any news?

Cheers,
Simone

Hi,

Indeed, I was about to write you about this report I just realized they were likely the same issue. The good news is that they are able to reproduce it :slight_smile:

Cheers,
Philippe.

Hi,
the best news would be that they are able to fix it :smiley:
Btw… How is it possible that I was the only one experiencing this problem?
Usually I’m not that lucky!!!

Lets wait,
s.

Hi,

To close this thread, the problem has been fixed in 2.1.8-4 of the castor client library.

Cheers,
Philippe.