Corrupted file when created in EOS

Hello everyone,

I have created a multi-threaded application which produces .root files as a part producer/consumer pipeline. The files are created and stored directly in EOS using TFile::Open("root://...") syntax.

My app works quite well. However, in certain isolated cases I’m encountering errors when I attempt to read files back:

Error in <TBasket::Streamer>: The value of fObjlen is incorrect (-1234746257) ; trying to recover by setting it to zero
Error in <TBasket::Streamer>: The value of fNevBufSize is incorrect (-161071509) ; trying to recover by setting it to zero
Error in <TBranch::GetBasket>: File: root://.../myfile.root at byte:1803550903135067784, branch:BuffB, entry:15965, badread=1, nerrors=1, basketnumber=5

I’m starting to run out of ideas. Since the app works well in most cases, this might be caused by some communication / buffer flushing bug.

The .root files in question contain 3 trees and are generated like this:

void doWork() {
    while (isWorkAvailable()) {
        TFile* file = TFile::Open("root://....", "RECREATE");
        TTree* tree1 = new TTree("tree1", "first tree");
        tree1->SetDirectory(file);
        tree1->SetAutoSave(-300 * 1024 * 1024); // Every 300 MB.

        // ... set branch addresses ...

        size_t records = 0;
        const size_t flush_interval = 1024;
        while (haveData()) {
            // ... move data to position ...

            tree1->Fill();
            ++records;

            if (records % flush_interval == 0) {
                file->Flush(); // periodically flush to EOS
            }
        }

        tree1->Write(0, TObject::kOverwrite);
        file->Flush();
        file->Close();
        delete file;
    }
}

int main() {
    TThread::Initialize();
    gROOT->SetBatch(true);

    // ... start worker pool using std::thread with pointer to doWork() ...

    // ... join threads ...

    return 0;
}

I have attempted to simplify the code as much as possible. I will welcome any ideas or suggestions regarding this problem (or any other issue that you might see).

Thanks in advance!

Cheers,
Petr

Hi Petr,

can you confirm that upon reading the writing is always complete?

Cheers,
D

Yes, I can confirm that.

Nevertheless, it is possible that the writing is not properly finalized. I have included a simplified version of my source code containing the opening and closing segments in hope that you can validate that.

There are always 4 threads running at any moment. Each thread accesses a single file at a time (I can guarantee that no two threads are working with the same file), after the file is finished, it starts working on another one (as indicated by the while() loop).

Hi,

does this occur even without using EOS, e.g. on a local disk?

Cheers,
D

Hello,
I don’t know. I will perform some tests.

Do you think that there could be a bug in the XRdCl TFile interface? Or is it still possible that my code is wrong?

Cheers,
Petr

Hi,

bugs in that area can be present as everywhere else, but I think it’s not likely.

D

I have managed to run an extensive test and confirmed that the issue does not occur when output file is initialized with TFile::Open("file://...").

Moreover, the full log leads me to believe that the problem is in the XRdCl interface.
See the snippet:

... 
Error in <TNetXNGFile::Flush>: [ERROR] Server responded with an error: [3005] Unable to open - capability illegal /eos/atlas/..../myfile.root; timer expired
Error in <TNetXNGFile::TNetXNGFile>: The remote file is not open
Error in <TNetXNGFile::TNetXNGFile>: The remote file is not open
Error in <TNetXNGFile::TNetXNGFile>: The remote file is not open
... (repeated infinitely)

The implementation works for 2 days, then the above error starts occurring. Since I use Kerberos to authenticate access to EOS, this could be caused by token expiration. However, I have made sure to run my application with k5reauth to prevent exactly that issue.

Do you have any ideas?

Cheers,
Petr

Hi Petr,

thanks for the tests you performed. If you are sure that this is not an issue linked to an expired token perhaps the best thing to do at this point is to contact the administrator of your EOS instance (at cern this would be a ticket in service now perhaps linking to this very thread).

Cheers,
Danilo

Hi Danilo,

I have resolved the issue. It seems like EOS will forcibly close any file descriptor open for more than 48 hours. This screws up XRdCl connection, which eventually leads to an error in TNetXNGFile::Flush.

Unfortunately, the timeout can’t be extended, so I will just have to produce smaller files in more granular amounts of time.

Thanks for your help!

Cheers,
Petr

2 Likes

Hi Petr,

good information to know: thanks for sharing it.

Cheers,
D

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.