Performance vs. N files

davide · March 21, 2012, 3:55pm

Hello PROOF experts,

I recently observed the following feature with proof.
I have some dataset that is stored in a large number of files (more than 10k of them).
The files are stored on a dcache system (accessed with the dcap:// protocol).
I can run on the data in two ways:
(1) using the original files
(2) merging the original files (for example in groups or 25) and run on the resulting merged files
What I see is that, although I am running on the same events and with the same nodes, the proof speed is significantly different: the number of events processed per second is about x10 higher in the case (2) than in case (1).

Is this expected? Or is there something implemented in proof that can affect datasets with a large number of files?
Naively I would think that, if I have the same number of events, and if the files are processed sequentially on each worker, then the speed should be about the same for (1) and (2), modulo the initial overhead due to the opening of the file. Also, each file contains a sensible number of entries (>10k, i.e. I am not opening many minuscule files, in which case this overhead could be important).

Thanks,

davide

ganis · March 21, 2012, 4:58pm

Hi,

Depending on the size of the files, it may be an effect of the TTreeCache, which is enabled by default in PROOF.
You can try to repeat the test with the cache disabled. For that you have to set the parameter ‘PROOF_UseTreeCache’ to 0:

 proof->SetParameter("PROOF_UseTreeCache", 0)

before running the query.

G. Ganis

pcanal · March 21, 2012, 7:38pm

Hi Davide,

What is the size in mega bytes of each files?

Philippe.