Problem with reading TTree via xrootd

Dear xrootd experts,

I met a problem with reading TTree via xrootd. I did not have any problem if I read it directly from a local disk on that machine the file resides. And I did not see any problem either if I read a few branches instead of the full event. The problem is reproducible:

ievt=0
ievt=1
[…]
ievt=216
ievt=217
Warning in TBasket::ReadBasketBuffers: basket: has fNevBuf=0 but fEntryOffset=0, pos=18869210, len=689, fNbytes=0, fObjlen=0, trying to repair
Error in TBranchElement::GetBasket: File: root://acas0420//data/test/user.HongMa. … .pool.root at byte:0, branch:m_event_type.m_user_type, entry:217, badread=0
ievt=218
Warning in TBasket::ReadBasketBuffers: basket: has fNevBuf=0 but fEntryOffset=0, pos=18939979, len=689, fNbytes=0, fObjlen=0, trying to repair
Error in TBranchElement::GetBasket: File: root://acas0420//data/test/user.HongMa. … .pool.root at byte:0, branch:m_event_type.m_user_type, entry:218, badread=0
ievt=219
[…]

Any idea?

–Shuwei

Hi Shuwei,

no problem to check if you help me in reproducing it. That file and a small script which reads it would be perfect.

In alternative, you can set in the smallest test you can do

XNet.Debug: 3

and put somewhere the huge logfile for me to look at.

However, from the tiny log you post, I get the suspect that a read request failed for some strange reason.

Which version/bundle of xrootd (client but server too) are you using?

Fabrizio

Hi Fabrizio,

Please find the log file (it is huge 80M before compression, 20M after compression):

usatlas.bnl.gov/computing/dq … f99.log.gz

for a case reading events from evt-99 (case-1). The problem disappear if I read from evt-100 (case-2).

I did some debugging studies and found a difference in line-1311 (within TFile::ReadBufferViaCache) of TFile.cxx:

Int_t st = fCacheRead->ReadBuffer(buf, off, len);

The return st equals to 1 for case-1, and st equals to 0 for case-1.

I had similar problems with reading dCache file in case of a small DCACHE_RA_BUFFER. I do not know if they are related or not.

My root version is 5.14.00. xrd version is 20060928-1600.

–Shuwei

Hi Shuwei,

and thank you for having sent to me this logfile. Probably I spotted the issue, but I have to say that it has been fixed from a long time.

The ROOT version you are using is quite old (1 year and a half), and since then there have been at least two rounds of major changes and fixes. Now it should be quite faster too, even for a single client. (if you are interested in this and in how to optimize the xfers, expecially for TTrees, please get in touch with me)

The problem which I found is related to the server denying to reply to a readv xfer which is too big. In the first version of the client-side readv code the readv max size limitations were not correctly honored, and probably the code itself (inside TXNetFile) was unable to recover properly.

As I said, the code which is currently in ROOT and XrdClient was rewritten, and I am not (yet?!) aware of current problems like the one you found.

This is the incriminated log line, which shows what happens at low level:

080318 17:16:24 001 Xrd: ReadPartialAnswer: Server [acas0429.usatlas.bnl.gov:1094] answered [kXR_error] (4003)
080318 17:16:24 001 Xrd: SendGenCommand: Server declared: Single readv transfer is too large(error code: 3008)

So, my warm advice is to update to the latest ROOT release. That issue is not patchable by just compiling an old one against the newer xrootd client, since the bug was in TXNetFile.

Please let me know
Fabrizio

Hi Fabrizio,

Does this bug also affect TDCacheFile? I have a similar problem with reading file from dCache.

I am using an ATLAS release, in which many projects depend on ROOT. It is not easy to upgrade the ROOT on the installed release. I will try anyway.

Many thanks for taking time to help investigate this weird problem.

–Shuwei

Hi Shuwei,

for TDCacheFile I’d say that I don’t know. But, willing to look at history, I believe that the readv code was implemented in dCache later than in the xrootd case, but almost in the same period. So, it might be that the same bug slipped into that. After all, it’s very difficult to test all the limit cases, and this is one of them.

For the ROOT upgrade, I don’t believe that there are up-to-date patches for the release you are using, although we might check. But I can suppose that the benefit coming from countless fixes and enhancements overcomes the burden of having to validate a new version.

In any case please let me know if you have still data access troubles with the latest version. You will not have the same one of course, but in that case I can quickly have a look and debug it. I am also interested in knowing also if it works and how…

Fabrizio

Hi Fabrizio,

 I just tried [b]ROOT-5.18.00[/b] with default and other value for [b]XNet.ReadAheadSize[/b] and [b]XNet.ReadCacheSize[/b], the reading error disappears! Many thanks for the suggestion and the fix.

  But the reading error still persists in reading dcache files with some small buffer size when ReadAhead is enabled. Can you help point to the code location of readv? So I can take a look to see if dcache is using the same or similar readv code and if the bug is fixed there.

–Shuwei

Hi Shuwei,

Great!
For the ROOT version I’d use at least 5.18b, if not the very last one 5.19. The refinement process continues…

for dCache, since I am not a dCache developer, I can only point you to the TDCacheFile code inside ROOT, otherwise I’d suggest to report this bug to the dCache core developers.

Fabrizio