ROOT files corruptions while updating

Dear experts,

I am running a python script that produces several histograms and saves them in various subdirectories od a ROOT file. I run this script multiple times using HTCondor and I make sure that no file is being used by two jobs.

The problem I face is that some output files get corrupted when I open it in “UPDATE” mode. I see errors such as:

Error R__unzip_header: error in header.  Values: 7574 

or

Error in <TFile::ReadKeys>: reading illegal key, exiting after 0 keys

Here is my code snippet:

def process_weight(weight, input_file, conditions, output_filename, add_dir, UnfoldVar_reco_list, binning_list, NrBins_list):
    list_hist = []

    for index in range(0, len(UnfoldVar_reco_list)):
        reco_hist = make_hist(input_file, tree, conditions, weight, UnfoldVar_reco_list[index], binning_list[index], NrBins_list[index])
        if reco_hist is not None:
            reco_hist.SetName("PL_reco_" + UnfoldVar_reco_list[index])
            list_hist.append(reco_hist)

    if weight in float_weights:
        weight_directory = os.path.join(output_dir_nom, f"{weight}")
        os.makedirs(weight_directory, exist_ok=True)
    else:
        weight_directory = os.path.join(output_dir_nom, new_wt_list[og_list.index(f"{weight}")])
        os.makedirs(weight_directory, exist_ok=True)
        
    if add_dir == "":
        file_path = os.path.join(weight_directory, f'{output_filename}')
    else:
        final_dir = os.path.join(weight_directory, f"{add_dir}")
        os.makedirs(final_dir, exist_ok=True)
        file_path = os.path.join(final_dir, f'{output_filename}')
    
    if os.path.exists(file_path):
        outFile = ROOT.TFile.Open(file_path,"UPDATE")
    else:
        outFile = ROOT.TFile.Open(file_path,"RECREATE")

    outFile.cd()

    # Check if the directory already exists
    subdir = getattr(outFile, region, None)
    if not subdir:
        # If it doesn't exist, create it
        subdir = outFile.mkdir(region)
    
    # Change to the directory
    subdir.cd() 
    
    # Write all histograms from list_hist to the ROOT file
    for hist in list_hist:
        hist.Write(hist.GetName(), ROOT.TObject.kOverwrite)
    outFile.Write()
    outFile.Close()

    return True

I call this function for around 300 weights and each weight has a different output File. Out of those, randomly some weight output file show those errors. When I rerun that weight, then the file is OK.

I don’t know why this problem occurs only in some of the files and not all, eventhough I use the same code! Is there something I could do to make sure that my output file writing is handled flawlessly?

Thank you in advance!
Regards,
Nilima

Hi Nilima,

Welcome to the ROOT community!
I understand from your post that you are writing from N different processes, HTCondor jobs, the same ROOT file. If what I say is accurate, and please correct me if I am wrong, this is not supported. My suggestion would be to write N files and then merge them, for example with the hadd tool, which is rather efficient.

Best,
D

Hi Danilo,

Thank you for your reply!

I may have created a confusion but no two jobs write into the same ROOT file. Let me make it clearer:
Say I submit 300 jobs that each write the output into 300 individual ROOT files. Now the output is written in a directory called Dir1. Now after the jobs have finished, there are 300 ROOT files each with output stored in Dir1.

Then, I again submit the same number of jobs but now I update the ROOT files with a new directory Dir2. Here is where I face problems with file corruption.

If I understand your suggestion correctly, should I RECREATE file for every individual directory and then merge them?

Regards,
Nilima.

Hi Nilima,

Thanks for the explanation. I think you are doing everything right here.
Are you sure that no file is closed prematurely in a nasty way, for example with the batch system killing the job w/o letting the application finalising the writing?

Cheers,
D

To be honest, I didn’t think of this before but this is definitely something to be looked at. Do you have any ideas on how to check this or rather what to do to prevesnt this from happening?

Regards,
Nilima

Hi Nilima,

From the messages, that would be my first suggestion.
Now, it largely depends on the setup used to submit jobs. Typically, logs are kept and shipped to the jobs submitter. There typically something can be found in terms indications of premature job killing.

Cheers,
D

Hi Danilo,

Sorry for the late reply. I tried to do some tests and figure out what might be going wrong but I couldn’t succeed completely. I confirmed that the code is OK and also the inputs are fine. I found that if my output file is already written with say 4 subdirectories, while writing the 5th directory it will crash and corrupt the file. I couldn’t find a proper solution, hence, I switched to RECREATE mode for every subdir and merged them at the end.

Regards,
Nilima.