GoHEP/groot: v0.27.0 (root-split, groot-faster-than-root)

sbinet · May 26, 2020, 4:52pm

hi there,

I am very happy to announce the release of Go-HEP@v0.27.0:

You can easily download standalone binaries (i.e. you don’t need Go installed on your machine) for selected platform+OS combinations from https://go-hep.org/dist (e.g. choosing the latest version).

This release has seen quite a few performance improvements in the reading ROOT data area.

groot can now read data faster than ROOT in the few tests I was able to construct and assemble:

name                               time/op
ReadCMS/GoHEP/Zlib-8               19.2s ± 1%
ReadCMS/ROOT-TreeBranch/Zlib-8     37.5s ± 1%
ReadCMS/ROOT-TreeReader/Zlib-8     26.1s ± 3%
ReadCMS/ROOT-TreeReaderMT/Zlib-8   25.6s ± 5%  (ROOT::EnableImplicitMT())

ReadScalar/GoHEP/None-8            737ms ± 3%
ReadScalar/GoHEP/LZ4-8             769ms ± 3%
ReadScalar/GoHEP/Zlib-8            1.33s ± 1%
ReadScalar/ROOT-TreeBranch/None-8  1.22s ± 3%
ReadScalar/ROOT-TreeBranch/LZ4-8   1.35s ± 3%
ReadScalar/ROOT-TreeBranch/Zlib-8  2.47s ± 1%
ReadScalar/ROOT-TreeReader/None-8  1.43s ± 5%
ReadScalar/ROOT-TreeReader/LZ4-8   1.57s ± 2%
ReadScalar/ROOT-TreeReader/Zlib-8  2.69s ± 1%

The release announcement has some more details about it, but here is the repo that has all the nitty-gritty details:

https://github.com/go-hep/groot-bench

Happy to get any feedback about ways to improve these benchmarks.
(I tried to be fair (using options I knew about in ROOT) but also tried to use relatively common defaults for ROOT, as a “normal” user would)

and I also released root-split, a command that splits ROOT Trees into multiple files+tree:

groot/cmd/root-split

finally, for the eye candy, a new HStack plotter has been implemented:

cheers,
-s

eguiraud · May 27, 2020, 8:08am

Hi @sbinet,
that’s impressive, congratulations!

I don’t think TTreeReaderMT is any different than TTreeReader? TTreeReader does not do multi-threading.

Also, @pcanal might correct me if I’m wrong, but here the correct usage would be to SetBranchStatus to 0 and then selectively set the branch status of branches that you read to 1, and TTree::SetBranchAddress is usually what is used to read branches, not TTree::Branch.

TTreeReader’s advantage is that it lazily loads entries, but in your benchmarks you don’t perform selections, so you always read all TTreeReaderValues anyway, which is the worst case scenario for TTreeReader (and possibly unrealistic for actual analyses).

Cheers,
Enrico

P.S.

Go/TTreeBranch has similar rutnimes as Go/TTreeReader, but it should be much faster. Besides what I commented above, it could be that you are measuring ROOT’s startup time together with the actual read-out runtime, which would offset both TTreeBranch and TTreeReader bringing them closer together?
about startup time: ROOT will have larger start-up time than groot, but that is probably not relevant for actual analysis tasks (as it won’t matter as your program lasts longer than a second). So for the benchmarks that last ~1s it might be fairer to measure just the time spent reading with a stopwatch around the relevant logic

sbinet · May 27, 2020, 9:23am

thanks.

I don’t think TTreeReaderMT is any different than TTreeReader? TTreeReader does not do multi-threading.

I was under the impression that ROOT::EnableImplicitMT() would also enable parallel decompression.
but I’ll remove that until RDF has been added (eventually)

TTreeReader’s advantage is that it lazily loads entries, but in your benchmarks you don’t perform selections, so you always read all TTreeReaderValues anyway, which is the worst case scenario for TTreeReader (and possibly unrealistic for actual analyses).

yeah…
I did setup things (on the groot scaffolding side) to be able to perform benchmarks reading a subset of the branches.
didn’t come around to actually do that, yet.

a similar testbench (with an actual subset of branches to perform an analysis) gives similar ballpark numbers:

in the end, yes, asking for less branches should reduce wall clock time but it all boils down to how fast you are to actually read those branches. (so I wanted to get that figure first.)

the plan is also to be able to benchmark against groot’s RNtuple implementation, eventually. (once the C++ side stabilized a bit more.)

Go/TTreeBranch has similar rutnimes as Go/TTreeReader, but it should be much faster. Besides what I commented above, it could be that you are measuring ROOT’s startup time together with the actual read-out runtime, which would offset both TTreeBranch and TTreeReader bringing them closer together?

I’ll see whether I’ll increase the number of events for these toy-data files or setup a “scout program” that doesn’t read any event (to evaluate the ROOT startup time overhead).

thanks again.

eguiraud · May 27, 2020, 9:28am

It’s not reading less branches, the difference is that TTree::GetEntry loads all branches that are setup for reading upfront, while TTreeReader does it lazily on a per-branch basis. So, with selections but equivalent logic between TTree and TTreeReader, TTreeReader ends up reading less because it’s smarter under the hood.

The weirdest thing, that points at a problem in the benchmarks, is:

ReadCMS/ROOT-TreeBranch/Zlib-8     37.5s ± 1%
ReadCMS/ROOT-TreeReader/Zlib-8     26.1s ± 3%

Raw TTree should be (much) faster than TTreeReader when always reading all branches.

sbinet · May 27, 2020, 9:56am

well, sure. but what I was hinting at was that -like for CPU instructions- the fastest code you never have to optimize is the one that doesn’t read branches.

eguiraud:

The weirdest thing, that points at a problem in the benchmarks, is:
ReadCMS/ROOT-TreeBranch/Zlib-8     37.5s ± 1%
ReadCMS/ROOT-TreeReader/Zlib-8     26.1s ± 3%
Raw TTree should be (much) faster than TTreeReader when always reading all branches.

according to the logs:

the timings are rather stable (so it’s not because e.g. TreeBranch is exercized first and heats up the cache for TreeReader)

sbinet · May 27, 2020, 2:38pm

FYI, here is what I get w/ SetBranchAddress:

name                            time/op
ReadCMS/GoHEP/Zlib-8            18.5s ± 1%
ReadCMS/ROOT-TreeBranch/Zlib-8  30.4s ± 2%
ReadCMS/ROOT-TreeReader/Zlib-8  25.2s ± 4%

eguiraud · May 28, 2020, 7:22am

The new version of the code with SetBranchAddress looks good to me! As per odd result · Issue #2 · go-hep/groot-bench · GitHub , TTreeReader’s lazy loading seems to be very advantageous on this workload.

I was under the impression that ROOT::EnableImplicitMT() would also enable parallel decompression.

Only if you read many branches at the same time, like with TTree::GetEntry. TTreeReader’s lazy loading loads branch contents are loaded on-demand, synchronously.

Cheers,
Enrico

sbinet · May 28, 2020, 1:11pm

the executive summary is thus:

name                                  time/op
ReadCMSScalar/GoHEP/Zlib-8            3.92s ± 2%  // only read scalar data
ReadCMSScalar/ROOT-TreeBranch/Zlib-8  7.98s ± 2%  // ditto
ReadCMSScalar/ROOT-TreeReader/Zlib-8  6.60s ± 2%  // ditto

name                                  time/op
ReadCMSAll/GoHEP/Zlib-8               18.4s ± 1%  // read all branches
ReadCMSAll/ROOT-TreeBranch/Zlib-8     30.4s ± 2%  // ditto
ReadCMSAll/ROOT-TreeReader/Zlib-8     [N/A]       // comparison meaningless (b/c of loading-on-demand)