Truncated files

VINX89 · July 2, 2016, 10:53pm

Dear all,

I’m wasting my Phd time for a hopeless war against this message:

Error in TNetXNGFile::Init: file is truncated at 1564321698 bytes: should be 1570435367, trying to recover.

This can happen either if I process a very large datafile, or if I parallelise the jobs and I merge the files using hadd afterward.

What can it be the origin of that? I have no idea. Do I have any responsibility (i.e., proper clean up of memory in the script I use to produce the tuple etc…), or it’s just a failure of the system that doesn’t depend on me (data samples are very large)?

Thanks for any hint.

Vincenzo

Danilo · July 3, 2016, 11:12am

Hi Vincenzo,

how have these files been produced? What do they contain?
What do you mean by processing large data files?
In general this can be a sign of an application abruptly exiting not leaving the time to ROOT of properly closing the files.

Danilo

VINX89 · July 3, 2016, 9:51pm

Hi Danilo,

thanks for your reply.
I have some input tuples on eos at CERN, and I process these tuples using PyROOT scripts running on lxplus machines.
These input tuples are produced using the LHCb software running on the grid, and they contain data.
I usually run these script using the LSF batch system or a screen session.
Sometimes, I first divide the input tuple in many subtuples, and then I parallelise the jobs (I merge the output using hadd).

An example. If I seed to filter a tree with some selection string, I first take the input tree

inputFile = TFile.Open(inputfile,“READ”)
inputTree = inputFile.Get(inputtree)

then, I create an output file (still on eos) where to store the filtered output tree

outputFile = TFile.Open(outputfile,“RECREATE”)
outputTree = inputTree.CopyTree(preselection)

Finally, I try to save everything:

outputFile.cd()
outputTree.Write("",TObject.kWriteDelete)
gDirectory.Delete(inputtree)
outputFile.ls()
outputFile.Flush()
outputFile.Close()

This script usually works, except for a very large tuple I need to process.

These lines are taken from other posts I’ve seen on other memory issues. Is there anything obviously wrong here (for example, I’m not sure about deleting the input file?)

Thanks,
Vincenzo

Danilo · July 4, 2016, 7:24am

Vincenzo,

thanks for the details.
But is the “big file” giving problems already on EOS or after your processing? Is this file publicly accessible?

Danilo

VINX89 · July 4, 2016, 8:02am

Hi Danilo.

No, the file before the processing looks ok.
Unfortunately, this file is not accessible to people external to LHCb.

Cheers,
Vincenzo

Axel · July 6, 2016, 2:47pm

Hi,

what does

valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp --num-callers=30 …

say? “…” stands for “the command you usually invoke to run your code”, e.g. “python mycode.py”.

Cheers, Axel.

VINX89 · July 9, 2016, 10:22pm

[quote=“Axel”]Hi,

what does

valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp --num-callers=30 …

say? “…” stands for “the command you usually invoke to run your code”, e.g. “python mycode.py”.

Cheers, Axel.[/quote]

Hi Axel,

sorry for the slow response.

You can find attached:
-the logfile of the run with Valgrind;
-the script(s) I actually used (I ran it with the option “doPreselection” true, so I quitted before the end of the script).

As I said, unfortunately I can’t provide access to the root file.

Thanks for your help.

Cheers,
Vincenzo
conf_fitBplus2D0Pi_Data.py (1.4 KB)
fitBplus2D0Pi.py (17.9 KB)
logValgrind.txt (20.7 KB)

pcanal · July 11, 2016, 4:00pm

HI,

Valgrind seems to indicates that the issue might be inside XRootD:==26780== Invalid read of size 8 ==26780== at 0x1FE1ACE1: XrdCl::XRootDMsgHandler::ReadRawReadV(XrdCl::Message*, int, unsigned int&) (XrdClXRootDMsgHandler.cc:720) ==26780== by 0x1FE1B2E4: XrdCl::XRootDMsgHandler::ReadMessageBody(XrdCl::Message*, int, unsigned int&) (XrdClXRootDMsgHandler.cc:570) .... ==26780== Address 0x279e1ca8 is 24 bytes before a block of size 6,144 alloc'd ==26780== at 0x4C295FC: operator new(unsigned long) (vg_replace_malloc.c:298) ==26780== by 0x1FE23491: std::vector<XrdCl::ChunkInfo, std::allocator<XrdCl::ChunkInfo> >::_M_insert_aux(__gnu_cxx::__normal_iterator<XrdCl::ChunkInfo*, std::vector<XrdCl::ChunkInfo, std::allocator<XrdCl::ChunkInfo> > >, XrdCl::ChunkInfo const&) (new_allocator.h:104) ==26780== by 0x1FE3483C: XrdCl::FileStateHandler::VectorRead(std::vector<XrdCl::ChunkInfo, std::allocator<XrdCl::ChunkInfo> > const&, void*, XrdCl::ResponseHandler*, unsigned short) (stl_vector.h:913) ==26780== by 0x1FE2B470: XrdCl::File::VectorRead(std::vector<XrdCl::ChunkInfo, std::allocator<XrdCl::ChunkInfo> > const&, void*, XrdCl::ResponseHandler*, unsigned short) (XrdClFile.cc:252)

Which version of XRootD are you using. Can you check that the version in the environment is the same as the one used to build ROOT?

Cheers,
Philippe.