Reading a Large File & Writing to a Small File

Hi,

I have a job reading from a large file (>2GB), selecting relevant entries, and writing them to a small file (TTree::SetMaxTreeSize has been set to a value <2GB).

Is it normal, for an entry which I transfer, that the numbers of read bytes differs from the number of written bytes ? Why ?

Let’s assume it is normal.
I was comparing the the read/write bytes to check the validity of my job.
Could you imagine another way to check an entry has been correctly transfered ? Do I have a way to predict the difference between the read and written bytes ? I fear it is depending on the kind of data…

David.

Hi David,

Depending on ‘which’ read/write byte number you are talking about. The uncompressed number should be the same. The compressed number will vary a lot and unpredictably. If you need to validate the data I would simply recommend to do a histogram (or any other sort of plot) of some of the data both for the filtered and resulting files.

Cheers,
Philippe

I have an input TChain, and I build the output TTree with TChain::CloneTree(0).
I called “read bytes” the result returned by TChain::GetEntry(), and I called “written bytes” the result returned by the following call to TTree::File().
So, is it normal that they differ ?

not really. Those 2 numbers should be the uncompressed size, so unless there are things that ought to change in the data (like dates or timestamps saved as string)

Philippe

Could you explain a little more why a “date saved as string” ought to change (in the current context : transfer from a file >2GB to a file 2<GB).

David,

Each branch basket has a data/time stamp. When compressing the basket, you may end-up with a slightly different number of bytes depending on the date/time values.

Rene

Hi René,

We were talking about the number of bytes returned by TChain::GetEntry() and TTree::Fill(). If I understand correctly philippe, those are sizes of uncompressed data. So, the compression of the data/time stamp does not explain the difference. But perhaps this data/timestamp has a different format, including uncompressed, depending on the fact that the underlying file is >2GB or not ?

To detail a little the context : we have many jobs where the input file is >2GB, and in such a case the bytes returned by Chain::GetEntry() and TTree::Fill() very often differs (and maybe always). We try to make our maind if we should worry about this or not.

David.

[quote]we have many jobs where the input file is >2GB[/quote]Ah! :slight_smile: The format is slightly different when writting less or more than 2Gb (some internal values are stored using 64 bits instead of 32bits). This is probably the differences.

Cheers,
Philippe.

PS. You can also send me a running example to confirm (or not :slight_smile: ).

Ok. So, initial assumption is confirmed. Fine.

Now back to the corollary question : do you know more precisely
which internal values are changing ? Is there any chance we could predict,
for a given user data type, how much bytes we we will gain
or loose ?

Sorry if I seem fastidious. We still have the std::vector<Double32_t> bug
coming back here and there, and the test on read/written bytes is our main way to detect some mismatch with our running ROOT version and file ROOT version.

David.

David,

When writing a file
If the file pointer is less than 2Gbytes, a 4bytes integer is used otherwise
an 8 byte integer is used (this to save space in the file)

Rene

I am not fully understanding what you call “file pointer”.

After rereading the last version of the ROOT user guide,
section I/O, it seems clear that in each record header,
the SeekKey and SeekPdir fields will be twice larger
when a file becomes bigger than 2GB. So if the record
header is taken into account for the bytes returned by
GetEntry() and Fill(), I should expect 8 more bytes.

What is less clear in my brain, is how a user data member
is stored in the file when it is a classical C++ pointer, and if
it is affected by the 2GB limit.

And same wondering about a TRef…

What is getting more and more clear, as I write this message,
is that I will very hardly be able to compute the bytes difference
between a >2GB and <2GB file. What’s more, if I understood well,
the increaze of size for SeekKey and SeekPdir only applies in a
file once one has reached the 2GB limit. Limit which is probably
very difficult to detect while I am getting the entries during my
skimming job.

Good time for holidays 8)