Need suggestions to troubleshoot myriad error messages

Hello,

I have been running my ATLAS code on the grid, and my job produces about 300 ntuples, with TTrees in them. I have a macro that I made via treeName->MakeClass(“test”). When I run this macro on these output ntuples, I see error message on some of these files. Even if I run an empty macro on one of these root files, I still see these messages.

I have been running many jobs on the grid, and sometimes 5% of the ntuple files have these errors, sometimes it is 10-12%, but always there are problems.

I get messages like:

Error in TFile::ReadBuffer: error reading all requested bytes from file

or

Error in TBranchElement::GetBasket: outputFile1.root at byte :0, branch:TruthTrk_pt, entry:9591, badread=0

If I simply open this file and do treeName->Draw(“TruthTrk_pt”), then I get a plot and no error message.

or

SysError in TFile::Seek: cannot seek to position 5305165262144230 in file outputFile1.root retpos=-1 (Invalid argument)

I am looking for suggestions on how to go about trouble-shooting these problems. Before I spend a lot of time chasing this down, I wanted to check if there are short-cuts.

Is this a problem with

a) my code that produces these ntuple files
b) my macro that runs over them
c) files getting corrupted when I download them from the grid to my local area
d) some peculiarity in how the jobs are run and closed on the worker nodes
e) some funny problem in ROOT
f) all or none of the above.

Vivek

[quote]c) files getting corrupted when I download them from the grid to my local area [/quote]Is the likely cause.

Cheers,
Philippe.

Hi Phillipe,

I re-retrieved two of the root files, and I get the same error message as before. Admittedly, this is very low statistics.

I can re-get a few more and see what happens, but at least this quick test shows that this may not be the only problem.

vivek

Hi Vivek,

It quite possible that the file are corrupted where you are downloading them from (i.e. either their production, their upload or their long term storage has had a problem).

Cheers,
Philippe.

Hi Phillipe,

Well, they have run on at least three different grid sites…

vivek

Hi,

So what is different on the site where it does not work? How do you access the file (older version of the DCache library is known to have had a couple of issues where in some circumstance, it would not correctly download the file (when used via the dcache ROOT plugin).

Cheers,
Philippe.

Hi Phillipe,

I can’t swear to this, but I think this problem has existed wherever I have run these jobs (somewhere between 3-5 grid sites), but I need to check this.

To get the files I use the ATLAS command, dq2-get. I do not know how this works.

Vivek

Hi Phillipe,

In the past I used to run at Brookhaven, but lately my jobs have going elsewhere. I just checked an old output where the jobs had run at BNL, and I see the same problem.

Having said this, I have also run other code on different MC events, and I don’t recall seeing these problems.

So, I am not sure anymore as to where the problem could be…

vivek

Hi Vivek,

You need to simply the problem first. First download the file to your local machine and see if your analysis work on them.

Cheers,
Philippe.

Hi Phillipe,

Yes, I need to do this. However, it’s not the same input files that cause the problem.

I will download the input files that correspond to one of these “bad” ntuple files and try them locally.

vivek