Error in TFile::TFile File does not exist

ngrieser · November 23, 2017, 2:08pm

Hello Root experts.

I am using a NtupleDumper framework which reads in a large set of files and dumps the pertinent information into smaller files. When I run this framework, some of the data periods run without error, but a few give me the error:

Error in TFile::TFile: file /path/to/root/file does not exist. Which is followed by a Break Segmentation
I then go directly to this directory and look and the file is definitely there. This file is the very first file in the list being read in, and I’ve tried a few things:

Removing a large amount of files in the list, leaving only the first three, including the file that gives the error originally. Then this runs correctly like the other periods. (Unfortunately there are way too many files to do this repeatedly to find where this stops functioning).
I then replaced the full list and tried to run again, and got the same error.
I tried swapping the first and some later file in the list’s order, and then the error is given on whatever file is first in the list.
I’ve checked to make sure the lists are created exactly the same for the functioning data periods as for the ones that fail.

Any help with this would be greatly appreciated, as the error response doesn’t quite give much information.

Regards,
Nate Grieser

pcanal · November 23, 2017, 4:08pm

This seems to indicate some sort of memory overwrite (likely in the “NtupleDumper framework”). Do narrow it down, the best is to run the failing process with valgrind.

Cheers,
Philippe.

ngrieser · November 24, 2017, 7:57pm

Philippe,

Thanks for the advice. I ran valgrind and it lists that the only issues are there are a number of bytes in blocks still reachable, but not necessarily a leak or unreachable.

Here is the few lines in the code associated with where the error occurs:

for ( auto file: files){
   int nbraches, entries;
   TFile *f=TFile::Open(file);
   TH1D *h = (TH1D*)f->Get("MetaData_EventCount");
   if ( h_event_count == 0){
       h_event_count = (TH1D*)h->Clone("evcount");
      h_event_count->SetDirectory(0);
   }
   else{ h_event_count->Add(h) }
   int metadata_entries = h->GetBinContent(3);
   TTree* nom_tree = (TTree*)f->("FlavourTagging_Nominal"); 
   if (nom_tree) entries = nom_tree->GetEntries();
   else entries = 0;

   if (metadata_entries < 1) cout << "(0 entries) skip file: " << file << endl;
   else if entries == 0 { cout << "(0 entries in tree|| no nominal tree) skip file: " << file << endl; }
   else { cout << " use file: " << file << endl;
      for (auto chain: tchains){
         chain->Add(file);
      }
   }
   f->Close();
   delete f;
}

Specifically the line pointed to in the crash is the TFile::Open(file) line. Any suggestions on how to clean this up to prevent any leaks? Unfortunately using valgrind didn’t help me too much with that.

Thanks for all the help!

Regards,
Nate

behrenhoff · November 24, 2017, 10:47pm

This is certainly not the code you are actually using, is it?

This code won’t work, e.g. there is no class TH1d with lowecase d, there is no “foe” loop, h_event_count and h_event+count are not the same, and the line TTree* nom_tree = (TTree*)f->GetEntries(); is complete nonsense: a number is not a TTree! Also: where does the “else” in else entries = 0; suddenly come from?

You need to post the code you are using (stripped down as much as possible but still showing the problem).

As general advice (I am posing this again and again): be type safe, avoid casts wherever possible, and where they cannot be avoided, use dynamic_cast (or static_cast) to make the code look ugly. Especially avoid file->Get, prefer file->GetObject plus error handling. Here this check would ensure f->Get("MetaData_EventCount"); really is a TH1D*.

ngrieser · November 24, 2017, 11:17pm

Behrenhoff,

Sorry there was an issue with copying the code from the compiler to this thread. I’ve checked again and the typos should be gone. Again, this code works for a number of slices, and only has this issue with the larger slices.

With some further looks and piecing together some other help topics, it seems that the TH1D *h is not getting cleared at the end of each of the list members in “files”. I tried to use h->Delete(); and delete h; Unfortunately the error persists with these additions. Is there a more clever way to clear the memory being allocated to this TH1D?

Wile_E_Coyote · November 25, 2017, 8:33am

What is the definition of the “files” variable?
Can it be that some subdirectory and / or file names contain non-basic-ASCII characters (or “space” characters)?

Here’s a foolproof version of your loop:

TH1D *h_event_count = 0;
for (auto file: files) {
  TFile *f = TFile::Open(file);
  if ((!f) || (f->IsZombie())) {
    std::cout << "(no file || file is zombie) skip file: "
              << file << std::endl;
    delete f; continue;
  }
  
  TH1D *h; f->GetObject("MetaData_EventCount", h);
  if ((!h) || (h->GetBinContent(3) < 1)) {
    std::cout << "(no histogram || 0 entries in histogram) skip file: "
              << file << std::endl;
    delete f; continue;
  }
  
  TTree *t; f->GetObject("FlavourTagging_Nominal", t); 
  if((!t) || (t->GetEntries() < 1)) {
    std::cout << "(no tree || 0 entries in tree) skip file: "
              << file << std::endl;
    delete f; continue;
  }
  
  if (h_event_count) h_event_count->Add(h);
  else {
    h->SetDirectory(0); // (0) or (gROOT)
    h->SetName("evcount");
    h_event_count = h;
  }
  delete f;
  
  std::cout << "use file: " << file << std::endl;
  for (auto chain: tchains) chain->Add(file);
}

ngrieser · November 25, 2017, 7:33pm

Coyote,

This seems to have been a fix for two of the four broken periods. However, now two of the periods now crash because it returns every file as being a zombie file. The files are definitely filled (7-13MB .root files), and are accessible via terminal so I’m not sure what this is saying. Unfortunately searching for what the definition of zombie file doesn’t really return much so maybe you could explain this?

Also, files is defined as:

std::vector<TString> files;

Intuitively this doesn’t seem like the best way to do things, but unfortunately this is a hand-me-down software package that is convoluted, so I’m not sure of the repercussions of changing this to a TList.
I checked also the file lists themselves and they seem to be consistent in naming/utilization as the other non-broken periods. In fact, they are created identically with the same python script.
Thanks for the help and consideration with this!

Wile_E_Coyote · November 26, 2017, 12:39pm

For some of these “files”, try to run (see if you get what you expect is inside):

rootls -l file

If the above command returns no errors, the only thing I can think of is that, in your code, you open many different files simultaneously (and you do not close them when no longer needed) and then you meet the limit imposed by the operating system on “open files” / “descriptors” (which I expect is something like 1024 -> note that in the “foolproof version of your loop”, I always “delete f;” as soon as it is no longer needed).

ngrieser · November 26, 2017, 6:49pm

There didn’t seem to be anything wrong with the input files.

I searched for any other instance of TFile being opened and the only one is in your foolproof example.

I ended up cutting the list in half and will just merge the resulting root files, but unfortunately not a legitimate fix. Thanks for all the attempts and help

pcanal · November 27, 2017, 9:52pm

Are you saying that with valgrind even with a large number of files, it succeeds?

ngrieser · November 28, 2017, 12:33am

No sorry that wasn’t very clear. Valgrind return no leak or loss, but it did state some files were still reachable. From what I read on their documentation I thought that it meant these were not an issue. Maybe I misinterpreted?

pcanal · November 28, 2017, 1:32am

For the purpose of this test, you could turn off leak checking.

If the problem is a memory overrun of sort, valgrind should discover it.

Just for clarity, let me ask, did you run under valgrind with a larger number of files? If/When you do run under valgrind with a large number of files, does the job
a) fail in the same way as without valgrind but has no (non-leak related) valrgind report?
b) fail in the same way as without valgrind and has (non-leak related) valgrind report)?
c) succeed but has no (non-leak related) valrgind report?
d) succeed and has (non-leak related) valgrind report)?

Thanks,
Philippe.

ngrieser · November 28, 2017, 2:29am

Philippe,

The valgrind run returned (B), failed in the same way and has the non-leak report. This is run over all of the full list. I didn’t try running valgrind over the halved lists.

Nate

Wile_E_Coyote · November 28, 2017, 9:22am

Do you get this “file is zombie” message directly from the “foolproof loop” or do you get it afterwards from another parts of the code?

In this loop, you add all found files into multiple chains.

How many files do you have in a “broken period” and to how many chains do you then add them?

If you multiply the number of files by the number of chains, can it be that you exceed the limit on “open files” / “descriptors”? I’m not really convinced that this “product” is relevant here as I guess a chain should always have just the “current” file opened, but who knows what “optimizations” can take place (and if you use “multiprocessing” / “Proof-lite” then for sure there will be many files opened simultaneously, so maybe the number of chains times the number of “worker processes” is important in this case).

Do you delete these chains after you no longer use them?
If not, and you simply start to create new chains for a new “period”, then ROOT will keep many files which were not “closed” (so this will work for the some “period” but then the next one may fail with “file is zombie” messages).

pcanal · November 28, 2017, 5:24pm

Hi Nate,

You replied:

The valgrind run returned (B), failed in the same way and has the non-leak report.

Since valgrind did report/complain about things that were not-related to leaks, could you share those reports?

Thanks,
Philippe.

system · December 12, 2017, 5:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.