Problem writing ROOT file

N.B. I corrected all the paths below from …/scratch0/… to …/scratch0/public/… (sorry for the typo).


HI,

(Sorry for the long post, but I have tried to describe my problem as completely as possible.)

I have a problem running a job on LXBatch (batch system at CERN). The analysis job consists of C++ code (NLOJet++ program) compiled against ROOT 5.21.04.

When I run interactively on an lxplus node, it is fine.

However when I run the same code on the batch system, the code seems run okay, but for some reason I am not able to write the output to a TFile. I get the errors:

which can be seen in my stdout and stderr logs:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/logs/

(For some reason the errors seem to occur out of order in the log file, i.e. the ROOT errors are reported after some shell errors when I try to rename/copy the missing file. I guess this is a separate I/O issue.)

After that, there is no output ROOT file which is supposed to appear as:
NLOJet++Moriond/output/DijetMassChi.root

I put my code itself which writes to the TFile here:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondProgram/src/DijetMassChi.cpp

where the code that writes the output TFile is the function save():

void histos::save(TString filename){
TFile myfile(filename,"UPDATE");
//calculate cross-section from weights info.
ref_cross->Write("",TObject::kOverwrite);
ref_cross_scale->Write("",TObject::kOverwrite);
ref_obs_bins->Write("",TObject::kOverwrite);
ref_obs_bins_sq->Write("",TObject::kOverwrite);
myfile.Close();                                                                                                                         
}                                         

Again when I run interactively on lxplus, the output file is saved fine without these errors. The problem only occurs on the batch queue.

For completeness, the setup file is:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondProgram/setup.sh

where I set my environment to:

plat=slc4_amd64_gcc34
export PATH=$PATH:/afs/cern.ch/sw/lcg/external/root/5.21.04/${plat}/root/bin/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/afs/cern.ch/sw/lcg/external/root/5.21.04/${plat}/root/lib

although actually the gcc version is 4.1.2 (same on both lxplus and lxbatch).

The Makefile is:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondProgram/Makefile

Each time I run a batch job, I copy the directory with my source code and rebuild:

#----- Setup NLOJet++
scp -r lxplus:~/NLOJet++Moriond/ ./
cd NLOJet++Moriond/
source setup.sh
make clean
make -B DijetMassChi.la     # Need to make unconditionally.
rm -f output/*.root         #  Delete existing output (will be updated)

which you can see in my script:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondScripts/NLOJet++Moriond.run.sh

The rest of the batch submit scripts are here:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondScripts/

where in particular I execute the first, and then one script calls the next:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondScripts/NLOJet++Moriond.wrap.sh
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondScripts/NLOJet++Moriond.lsf.sh
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondScripts/NLOJet++Moriond.run.sh

Finally for completeness the NLOJet++ program itself is:
lxplus.cern.ch:~efeng/scratch0/public/ROOTprogram/NLOJet++MoriondProgram/

In the above, local directories I have for the C++ code and for the batch scripts are both actually called NLOJet++Moriond (in different paths), so I renamed them NLOJet++MoriondProgram/ and NLOJet++MoriondScripts/ when providing them in my scratch area for you.

I would be very grateful for any suggestions to understand why this problem occurring, and more importantly how to fix it.

Thanks,
Eric

[quote]SysError in TFile::TFile: file ./output/DijetMassChi.root can not be opened (No such file or directory) [/quote]Most likely the output directory does not exist in the ‘current directory’ used on the batch nodes…

Philippe.

[quote=“pcanal”][quote]SysError in TFile::TFile: file ./output/DijetMassChi.root can not be opened (No such file or directory) [/quote]Most likely the output directory does not exist in the ‘current directory’ used on the batch nodes…

Philippe.[/quote]

Hi Philippe,

Thanks for the suggestion, but in my job I confirm the directory is there. I actually ‘ls’ it:

echo "Contents of output/:"
ls -lart output/

which yields:

Thanks,
Eric

Hi,

Humm … strange. Where you have the file opening in the code, can you add (you might need to add #include “TROOT.h”:gROOT->ProcessLine(".! echo $PWD); gROOT->ProcessLine(".! ls -lart "); gROOT->ProcessLine(".! touch ./output/testing_writes "); gROOT->ProcessLine(".! ls -lart ./output");

Philippe.

[quote=“pcanal”]Hi,

Humm … strange. Where you have the file opening in the code, can you add (you might need to add #include “TROOT.h”:gROOT->ProcessLine(".! echo $PWD); gROOT->ProcessLine(".! ls -lart "); gROOT->ProcessLine(".! touch ./output/testing_writes "); gROOT->ProcessLine(".! ls -lart ./output");

Philippe.[/quote]

Hi Phillipe,

I tried that, and also added some info lines to make it more readable, and now I got these errors repeatedly (along with the pre-existing TFile errors, also repeatedly):

It seems that somehow ROOT cannot tell what the pwd is? In my shell script that executes the job, I checked before running the NLOJet++ analysis that I am in the right place:

Thanks,
Eric

Hi Eric,

[quote]It seems that somehow ROOT cannot tell what the pwd is?[/quote]Literally speaking, it is the shell that can not tell. The error message is hinting that the directory that was the current directory when root.exe started is being deleted … To check you could print the current directory before starting root.exe and then also in various places in your code/script to see where the delete happens.

Philippe.