PROOF/ROOT and IO/CPU performance

Hi all,

I have been benchmarking a test PROOF cluster in an attempt to find bottlenecks and ways we can improve performance before building a bigger cluster for use with physics studies. My ability to come to concrete conclusions has been limited by my ignorance as to what steps in the processing chain involve which hardware, and how the processing chain itself actually works. The test cluster is being run both from local cluster storage and from network xrootd storage.

Considering just the network storage case, my understanding is that for a given data packet, the processing is time is given by the sum of:

  1. Data transit time to worker node over network
  2. Local disk write/read time (is the data cached to disk first?)
  3. Decompression and loading of data into memory
  4. Loading the desired entry
  5. Analysis code
  6. Repeating steps 4 and 5 until data packet finished
  7. Request next packet to be sent

Processing time in my tests seems to be mostly spent loading the data into memory and copying it over the network. That said there was a large performance difference in whether the input ntuple was slimmed or had the unnecessary branches turned off using SetBranchStatus. Where is SetBranchStatus applied in this processing chain?

Dear bbutler,

The TTreeCache is on by default and since version 5.25/04 synchronized with the packetizer; so the transfers are somewhat optimized. The network bandwidth is typically the ultimate limit in such a case, provided that the server remotely has the capability to serve all the incoming requests.

Decompression is an heavy step and definitely dominates this once the data are received locally.

This depends on you … :wink:

Packet request latency is typically small, but the requests a processed serially by the master, so this may become visible in the case of many workers with a single master.

Disabling branches you read less data and it goes faster. That is one of the main advantages of the tree structure. [quote]
Where is SetBranchStatus applied in this processing chain?
[/quote]
The right place where to set the status is TSelector::Init(TTree *), which is called each time a new TTree object in load in memory.

G. Ganis

Is all the data in the TTree decompressed by default? That is, does SetBranchStatus in Init() affect whether certain branches are decompressed?

[quote=“ganis”]

Disabling branches you read less data and it goes faster. That is one of the main advantages of the tree structure.
G. Ganis[/quote]

I’m sorry I wasn’t clearer, what I meant was, if I turn off branches I don’t need using SetBranchStatus in Init(), I get some increase it speed. If I remove those branches from the TTree completely and save it to new ROOT files and use those as input, the job runs a LOT faster. Meanwhile, the network data transfer does not seem to be saturated in either case as the processing rate goes up linearly with each additional worker. There must be some processing step on the worker node which takes a lot longer with larger TTrees and is not affected by SetBranchStatus. Can you enlighten me as to what is going on here?

Instead of calling SetBranchstatus and then tree.GetEntry, it is much faster to call mybranch.GetEntry.
In case of a tree with 10000 branches (many Atlas Trees) and one million events, it makes a huge difference.
see example in tutorial h1analysis.C

Rene