Corrupted file after hadding

mscornaj · September 26, 2016, 1:23pm

Hello,
I hope this is not an already asked question and this is the right place to post.
I have a very large number of files containing histograms (~1800). Each file is less than 100 KB and I have to merge them to be further processed by an ATLAS tool.
My approach was to run something like this:

hadd out.root ./root-files/*root

the stdout content was usual, it counted the number number of files and, after a very long time (more than an hour) it finally produced the out.root file.

When I ran the tool over the file, I have experienced a segmentation fault due to the corrupted header of the file.
I have then tried to create several intermediate output files, each resulting in merging 300 of the original ones.
I have then created the final file from the intermediate ones and it worked smoothly.

My question is:
Is there a limit over the number of files to be merged? Is this a known issue? If yes, would it be possible to limit over the number of input files or introduce a sanity check over the final output?
Best Regards,
Matteo

couet · September 27, 2016, 6:44am

It looks like this was already addressed in the past:

mscornaj · September 28, 2016, 4:08pm

Hi Couet,
thanks for your reply. I think the issue is quite different.
I am using Root 6.04 and the error I get when I try to access one of the histograms contained in the “corrupted” file is:
Error R__unzip_header: error in header

pcanal · September 28, 2016, 5:57pm

[quote]Is there a limit over the number of files to be merged?[/quote]The only known limit is induced by the allowed number of file descriptor and this limit is automatically handled by hadd.

[quote]Is this a known issue? [/quote]So your issue is unlikely to be known and we would need a set of file reproducing the problem to track it down.

Cheers,
Philippe.

mscornaj · September 28, 2016, 8:06pm

Hi Philippe,
the whole set of file is ~1.4 GB.
I have them on afs’ lxplus, are you able to access there? If yes, I can put them in a public directory, if not, what option do you suggest?

mscornaj · October 19, 2016, 1:30pm

ping

Riccardo · September 28, 2020, 1:12pm

Was this problem ever solved? I have the same issue when using hadd over more than ~1000 files (each file is between 200 and 400 KB). Specifically, when I try to access any histogram in the corrupted output, I get this error:

Error R__unzip_header: error in header.

My workaround right now is: 1) divide the file set in a few subsets, each with less than 1000 files; 2) hadd each subset separately; 3) hadd the subset outputs together. This works, but it’s quite annoying and time consuming…