The optimal way to store variable tracks count in a TTree

pcanal · April 6, 2021, 10:50pm

In this context I/O is decomposed in 3 parts:

raw I/O which read the bytes from the disk
(most often) decompressing
deserializing (transforming from the platform-independent format to the in-memory format).
is (usually) done through the TTreeCache cluster by cluster and when correctly configured fetch from the disk only the (entire compressed) basket of the branch used (here traces_ which holds the size of the trace collection and traces.SimSignal_X. It can be tune to only read the basket containing the entries being used when this is known in advanced.
This is done “just in time” when the data for a specific TTree entry is read.

Usually 2. is the dominant factor. Whether 1 or 3 is dominant depends on the speed of the disk and the complexity of the data. In you case the cost 1 and 3 might also be similar to the cost of just running TTree::Draw on the data (exclusive of the cost of the GetEntry)

entries = t.Draw("traces[100].SimSignal_X", "", "goff")
vs
entries = t.Draw("traces[100].SimSignal_X", "Entry$==499", "goff")

TTree::Draw is not tuned to understand that the filter "Entry$==499" reduce the number of entries actually used, so it request them all (read all the basket for the 2 branches for the whole file). It also need to evaluate the criteria all 1001 times.

A better way of writing the 2nd one is:

entries = t.Draw("traces[100].SimSignal_X", "", "goff", 1 /* # of entries */, 499 /* first entry */);

This should fetch from the disk only one basket per branch and execute the TTreeFormula only once.

entries = t.Draw("traces.SimSignal_X", "", "goff")
takes ~40 seconds, ~50 times longer then just traces[100].

Most likely this is the cost of filling the histograms (with 180x more entries) that becomes dominant.
Literally (pending unintended deficiencies) the only differences between Draw("traces.SimSignal_X") and Draw("traces[100].SimSignal_X") should be that for each entry Draw loop 1000 time to copy the float out of the (uncompressed) buffer and call (in a slight more optimized/complex way than that but semantically equivalent) TH1F::Fill in the first case and do that same loop 180*1000 times per entry in the 2nd case.

Try

entries = t.Draw("Sum$(traces.SimSignal_X)", "", "goff")

which should return only 1001 entries and thus be more stable for your purpose and focus on the I/O part rather than the histogram filling part.

** Add more word on the single vs all trace explanation **

LeWhoo · April 6, 2021, 10:57pm

Thanks, I’ll try. In general, I’ve considered the TTree::Draw() to be the fastest way to get a vector of values following some formula (or not). But from what you write, this may be not the most optimal way. So… what is the most optimal way? A loop through entries in C++? If so… is there any way to call it in PyROOT without ProcessLine(), etc.?

pcanal · April 6, 2021, 11:13pm

“Optimal” depends of course of the criteria, for example:

pure speed
time to write the code
maintainability
flexibility
time to learn the subtleties of the interfaces.

And indeed a hand crafted, specific to the use case, C++ loop will have the lowest run-time but be longer to write that the equivalent TTree::Draw or RDF or python code.

Note that in my previous answer I did not indicate/say that TTree::Draw was sub-optimal (for the use you described) I mentioned the case (filtering event with $Entry==) that was a sub-optimal use of TTree::Draw itself and case where you were (possibly) not measuring only what you thought you were measuring.

For benchmarking it is always a challenge to make sure that we are comparing “apples to apples” and that the alternative either really does the same “tasks” or at least the difference are understood (for example you may have underestimated the time needed to fill the histogram in your analysis).

Cheers,
Philippe.

PS. These days, commonly, the best choice (best comprise between performance, ease of use and readability) will be to use RDataFrame (especially for the part where it easily leverage multiple cores are the number one entries/files increases).

LeWhoo · April 6, 2021, 11:25pm

Thanks. I am aware that I don’t know if I am comparing apples to apples, thus I write here, and I’ve just learned something important

entries = t.Draw("traces[100].SimSignal_X", "", "goff", 1 /* # of entries */, 499 /* first entry */);

Takes ~0.02 second, which is a huge difference. Thanks! If I skip [100], adding another index changes nothing, but skipping [100] results in ~0.06 seconds. No indices and Sum$ gives ~0.055. So I am not sure if the difference between 0.02 and 0.06 can be attributed to filling a histogram. Maybe here just allocating memory for more values starts to play a role.

entries = t.Draw("Sum$(traces.SimSignal_X)", "", "goff")
takes roughly 35 s, while
entries = t.Draw("Sum$(traces[100].SimSignal_X)", "", "goff")
takes roughly 1 s. Without Sum$ it was respectively ~40 s and ~0.7 s. Interesting. In the case of more numbers, Sum$ helped, while in the case of fewer numbers it hurt. Anyway, Sum$ of all traces still takes a ~35 times more time than the sum of traces[100], so either Sum$ is still quite slow, or there is something else going on here based on the time difference between the two above.

pcanal · April 7, 2021, 12:21am

If I got the math right, [100] is 1000*1000 values, [1000][0] is 1000 values and nothing is 180,000,000 values (180 *1000*1000). So processing ‘179000000’ values would be ‘costing’: 0.4s / 179000000 = 2 nano-seconds.

The difference between the other 2 number (35s - 1s) is (if I calculated right) 189 nano-seconds …

LeWhoo · April 7, 2021, 4:55am

I am not sure where 0.4 s comes from…

0.06 and 0.02 s are the numbers for a single entry. If no index is specified, we have roughly 180,000 numbers in a single entry, if [100] is specified we have 1000 numbers. So from that, we get either ~0.3 or 20 microseconds per number.

Reading all the entries and summing is 35 s for ~180,000,000 numbers and 1s for 1000,000 numbers, which gives ~0.2 and 1 microseconds respectively, and without the sum it was ~0.2 and 0.7 microseconds respectively. So here I get either 800 or 500 nanoseconds difference.

pcanal · April 7, 2021, 4:08pm

If I understood correctly

entries = t.Draw("traces[100].SimSignal_X", "", "goff", 1 /* # of entries */, 499 /* first entry */);

takes .2s

entries = t.Draw("traces[100].SimSignal_X[0]", "", "goff", 1 /* # of entries */, 499 /* first entry */);

also takes .2s
and

entries = t.Draw("traces.SimSignal_X", "", "goff", 1 /* # of entries */, 499 /* first entry */);

take .6s
So the “extra” cost of process in the additional 179 traces is .4s. I am focusing on the difference as I expect that the only difference between option 1 and 3 is the processing of the values (i.e. the I/O cost is the same … humm actually I am not sure that is the case, if you run both in a row, the 2nd one will have no I/O has the data would already be in memory.

I got myself trips in forgetting the entry selection in my calculation so I was indeed a factor 1000 off.

Since the cost of processing the additional 179 traces is .4s, each traces “cost” 2.2 milliseconds and since there a 1000 signal value per trace, each additional number cost 2.2 micro-seconds which is a bit slow but plausible cost for TH1F::Fill.

entries = t.Draw("Sum$(traces.SimSignal_X)", "", "goff")

takes 30s

entries = t.Draw("traces.SimSignal_X", "", "goff")

takes 40s

For the first one, Sum$ sees 180,000,000 numbers and draw plots 1,000.
For the second one, 180,000,000 are plotted.
The difference from 1 to 2: 180,000,000 less operator+= and 180,000,000 more TH1F::Fill.
And 10s divided by 180,000,000 is 0.05 micro-seconds which seems to indicates that Sum$ is (relatively speaking) slow (given the previous number it would cost 1.7 micro-seconds per number).

So TLDR, it looks like the number are more or so consistent, isn’t it?

LeWhoo · April 20, 2021, 7:34pm

Thanks, it makes sense now.

I would like to post here the results of my (amateur) speed comparison to HDF5. Maybe it would be useful to someone. However, first I want to add RNTuple results, and for that I need to understand how to utilise its columnar approach. This will be a separate post.

Long story short, ROOT is far faster (30 or 100 times, depends on caching/no caching) than HDF5 for branches with short arrays, but for the traces at best 6 times faster. As reading the traces is the most likely scenario and the improvement is not that big, we’ve decided to go with HDF5, as it is more elastic and does not require installing a big code-base. Also, a negative sentiment towards ROOT (which I don’t share) played a significant role However, if RNTuple proves to be much (more than an order of magnitude) faster than HDF5 in our use scenarios, we’ll need this difference and have the manpower, we’ll consider a switch in the future.

system · May 4, 2021, 7:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.