ROOT files diff

Hi,

For testing purpose, I would like to check that two files (with trees) contain the same data. I would like to check it fast, and would like to do it without loading the data definitions libraries which describe the objects in the files.

I made a try with the md5sum command on each file, but it always return a different result, even when the files are expected to contain the same data. I guess it comes from auxiliary data in the files, which depends on the file generation runtime, such as process ids etc.

Any idea how to proceed ?

David.

Absolutely no idea ?

Hi.

What do you mean by “md5 always fails”?

Each time I run my test application, which writes the same data into a file toto.root, the command “md5sum toto.root” gives a different result. But perhaps I misunderstood the use of md5sum ?

I have corrected the first message, replacing “md5” with the real command name “md5sum”.

David,

md5sum is useless because the baskets have a date/time stamp.
My suggestion is
-to read the Tree headers T1 and T2
-compare the total size (compressed and uncompressed)
via GetZipBytes and GetTotBytes
-compare the number of entries.

If the 3 tests are ok, you have a very probability to have the same data.

Rene

Thanks René.
That sounds promising.

Do you think I could easily do the same with a TChain ?

Yes, loop on all files of the TChain using the proposed procedure.

Rene

Actually, I was wondering if such method as TTree::GetZipBytes, inherited by TChain, can be safely called for a TChain object. I guess the answer is no.

When I run two times the same test program, I get the sames sizes. Great !

There is another scenario I would like to talk about :

  1. I am skimming two files together, through a TChain
  2. I am skimming the two files one by one, with the same cut
    as before, and I finally merge fast.
  3. I compare the two resulting skimmed files.

The compressed size differs. It is not surprizing me, since the
compressions algorithm has worked on different subsets of
data.

The uncompressed size differs also. It is more surprizing.
If the inside data is the same, I should end with the same
uncompressed sizes, should I ?

Another strange thing : when the tree structure is tuple-like, all the sizes are 0… something I should do so to enforce the computing of the sizes ?

[quote]The uncompressed size differs also. It is more surprizing.
If the inside data is the same, I should end with the same
uncompressed sizes, should I ? [/quote]Not quite since the number of basket will be different.

Cheers,
Philippe

When working with a tuple-like tree (the tree has only one level of branches, and all data types are ROOT ones), the size methods returns 0. Any idea about what is going on ?

The basic code I use :

TFile * f = TFile::Open(“the file name”) ;
TTree * t = f->Get(“the tree name”) ;
cout<<"%INFO: number of entries is “<GetEntries()<<endl ;
cout<<”%INFO: compressed size is “<GetZipBytes()<<endl ;
cout<<”%INFO: uncompressed size is "<GetTotBytes()<<endl ;
f->Close() ;

I tried to call TTree::Print(), Scan(), … but the sizes stay definitively 0.

Could you send me the file?

Philippe

ok

A new “funny” effect with this code. In production, when dealing with so-expected big files, the call to GetZipBytes() is generating this error :


%INFO: number of entries is 19727939
Error: integer literal too large, add LL or ULL for long long integer /afs/slac.stanford.edu/g/glast/ground/DataServer/v3r5/src/Skimmer.cxx:857:
*** Interpreter error recovered ***
%INFO: compressed size is (class G__CINT_ENDL)153450224

Any idea about what is going on ?

No :slight_smile: Can you try the latest code (ROOT 5.16/00).

Cheers,
Philippe