TFile not found after gunzip

Hi all, I’m using ROOT v5.34/24 on Ubuntu 14.04.1 LTS (and also Mac OSX 10.10.5) to do some post-processing of large raw data files. Specifically, I have a bunch of list-mode LYSO scintillator data stored as PulseArea TBranches of TTrees called WaveformTree in about 50 different files, some of which are gzipped and some of which are not. I would like to extract the PulseArea data, fill a TH1F with it (for each file), and write the collection of TH1Fs to a single root file.

Here is my entire analysis script:

// script to data-reduce list-mode LYSO root files to histogram-mode
// based off of https://root-forum.cern.ch/t/open-files-in-a-directory-with-a-for-loop/12471/1
// Jayson Vavrek, MIT, 2016
void reduceLYSO(const char *dirname="./")
{
	TStopwatch sw;
	sw.Start();

	// these are the two file extensions for LYSO data
	const char *ext=".root";
	const char *extgz=".root.gz";

	// initialize the output file
	TFile *outfile = new TFile("reduced_LYSO_data.root","RECREATE");

	// loop over all the files
	int fileCounter = 0;
  TSystemDirectory dir(dirname, dirname);
  TList *files = dir.GetListOfFiles();
  if (files) {
    TSystemFile *file;
    Long64_t totalFileSize = 0;
    TCanvas *c1 = new TCanvas();

    cout << "Processing files..." << endl;

    TIter next(files);
    while ((file=(TSystemFile*)next()))
    {
    	TString fname = file->GetName();
    	bool wasZipped = false;
    	if (!file->IsDirectory() && fname.BeginsWith("2016") )
    	{
    		// if the file is gzip'd, gunzip it and modify the fname var
    		if (fname.EndsWith(extgz))
    		{
    			cout << gSystem->Exec("yes n | gunzip " + fname) << endl; // pass "n" in case it asks to overwrite
    			fname.ReplaceAll(".gz","");
    			cout << fname << " " << fname.Length() << endl;
    			wasZipped = true;
    		}

    		// now do the heavy lifting
    		if (fname.EndsWith(ext))
    		{
    			TFile *f    = (TFile*) TFile::Open(fname.Data());
					TTree *tree = (TTree*) f->Get("WaveformTree");

					totalFileSize += f->GetSize();

					TString hname = fname;
					hname.ReplaceAll(".root","");
					hname.Prepend("h_");
    			cout << "  " << fileCounter << ") " << hname << endl;

    			TH1F *h = new TH1F("h","h",30000, 0, 30000);
    			tree->Draw("PulseArea>>h","","");
    			h->SetName(hname);
    			outfile->cd();
    			h->Write();
    		}

    		// if the original file was zipped, rezip it
    		if (wasZipped) gSystem->Exec("gzip " + fname);

    		++fileCounter;
    	}
    }
  }
  outfile->Close();
  cout << "File " << outfile->GetName() << " created." << endl;
  cout << "Approximately " << totalFileSize/1.0e9 << " Gbytes of list-mode data reduced to "
  		 << outfile->GetSize()/1.0e3 << " kbytes of histogram-mode data." << endl;
  cout << "Real time spent: " << sw.RealTime()/60.0 << " min." << endl;
}

After processing about nine files, (some of which are gzipped and some of which are not), my script fails. Specifically, I get an error:

Error in <TFile::TFile>: file 2016_08_02_14_23_13.root does not exist

This occurs when I gunzip the file then call TFile::Open() on the file name minus the “.gz” extension.

The weird thing is that doing the gunzip (and other lines) manually in the shell or the interpreter works fine. It also works fine if I only have a few files to loop over. It’s when I have 50 or so files in the directory to process—then it fails on the ninth file. Moreover, if I comment out the tree->Draw() command, it churns through all the files no problem.

I should also mention that it initially fails on the file that was created last chronologically. I’m not sure why, but the files in the TList have no discernible order, such that the chronologically-last file gets put ninth in the TList. Due to the seemingly random order, the script processes a few gzipped and a few non-gzipped files before failing on the ninth one, so it’s not running into a problem on its first encounter with a gzipped file, for instance. If I move this last file out of the directory, the script then proceeds to fail on the 10th-last chronological file (11th from the start of the TList), and not the second last.

Any ideas?

Hi,

I’d have a comment and a question:

  1. It should be useless to re-compress a rootfile in gz format as its contents are by default already compressed
  2. Is the actual file 2016_08_02_14_23_13.root available on disk when root tries to open it?

Cheers,
Danilo

Thanks for the response.

  1. The gzip compression does seem to work—the disk space taken up by the zipped versions is 6% that of the unzipped versions.
  2. The gzip for the problematic file failed, so the file 2016_08_02_14_23_13.root didn’t exist on the disk when root tried to open it (only its gzipped version did).

Either way, I traced the problem to maxing out my available memory after not closing files #-o
Adding

f->Close();
delete f;

right after ++fileCounter
solved it.

Hi,

glad you could solve 2)!
About 1): if a second round of compression has still an effect and you really need minimal disk usage, you could increase the compression level of gzip to the maximum value or even test lzma (it very much depends on the data model and if you can trade in some cpu time but it could be an advantage - to be tested for the specific case)
See root.cern.ch/doc/master/classTF … 909ac86760

Cheers,
Danilo