Proposed Boost library: Histogram

hdembinski · November 14, 2017, 8:19pm

Hi everyone,

I proposed a multidimensional histogram library for inclusion in Boost. If you are interested in shaping such a class outside of the ROOT framework, then the time is now :).

github https://github.com/HDembinski/histogram
documentation https://htmlpreview.github.io/?https://raw.githubusercontent.com/HDembinski/histogram/html/doc/html/index.html

It is a C++11 header-only library that provides a safe, convenient, and fast multi-dimensional histogram for statistical analysis and visualisation. The library offers a safety guarantee: the counts in the histogram cannot overflow. There are many specialized axes types which define how input values mappend to bins. For example, there is a special circular axis for angles.

The library is very customizable for the power user, but just works for the casual user. Meta-programming is used to provide a fast histogram implementation that can be used when the histogram configuration is known at compile-time. A dynamic implementation is also provided for the other case when the configuration is only known at run-time. The two implementations share a common interface, so it is easy to switch between them. Python bindings are included for the dynamic implementation. The Python interface supports Numpy arrays to greatly speed up the exchange of data between the Python and C++ side.

I tested the performance of the library in benchmarks against other implementations in the GNU Scientific Library and in ROOT. The benchmark results can be found in the documentation. The performance is very competetive.

Let me know what you guys think.

Axel · November 16, 2017, 9:06am

Thanks, Hans!

I was following your development since a while. As you probably know, we’re implementing something fairly similar, https://root.cern.ch/doc/master/classROOT_1_1Experimental_1_1THist.html

This is not yet as stable as yours, but it should allow us to see the design differences - and lots of similarities, actually. I like many of your approaches; some differences are because of different (perceived) use cases / requirements, usability, integration targets (ROOT vs Boost) etc.

If you’re ever at CERN I’d love to meet you and go through the designs, to compare rationale. And of course we’ll benchmark ours against yours to make sure we know where we stand

Keep up the good work!

Axel.

hdembinski · November 16, 2017, 10:41am

Dear Axel,

that is a pleasant surprise. Now I am glad that I posted this announcement here. I enjoyed reading your blog on the ROOT page. For example, I am happy to see that ROOT is moving toward more compatibility with the STL. I was at CERN just the last two weeks, too bad I didn’t post this then, we could have easily met.

I always mention performance when I talk about the histogram library, because that is the simplest way to catch people’s attention, but I actually care deeply about simple consistent interfaces that make software easy and safe to use, without cutting down on flexibility for the power user, and without sacrificing performance. These benefits, however, cannot be demonstrated by a number, you actually need to use the library to realize them.

My library demonstrates that it is possible to internalize the management of the data type of bin counts in a way that is memory-efficient and still run-time efficient. If you don’t have to set the type of the bin count, then you cannot shoot yourself in the foot there, and you don’t need to learn about the maximum sizes of various integer types and the funny behavior of floating point numbers. The approach is even faster for multi-dimensional histograms. Despite run-time overheads, the gains in better utilizing the CPU caches outweigh the additional instruction costs.

Best regards,
Hans

Axel · November 16, 2017, 11:26am

Hi Hans,

Right, I saw that overflow-prevention-mechanism of yours, neat! Most histogramming here is done with floating point precision which changes priorities quite a bit. E.g. we want to have the statistics configurable (just counting, or moments etc).

Will you “ever” be back at CERN?

I will try yours out, I promise! And then contact you privately in case I have questions.

Cheers, Axel.

hdembinski · November 16, 2017, 1:54pm

Since the library manages the data type internally, I have freedom to do optimizations. I use integers as bin counters as long as the user does not pass a weight to the fill method. In this case the variance is equal to the sum of counts, because sum wi = sum wi^2 for wi = 1. There is no need to keep track of the variance separately, so a single integer per bin is sufficient.

When a weight is passed to the fill method for the first time, the data type of the counter is internally converted into struct that holds two doubles, one for the sum of weights and one for the sum of weights squared. From then on, I keep score of the variance = sum wi^2 separately.

This approach always produces the first and second moment of the weight distribution. So far I have not seen a use case for higher moments in the wild.

I plan to remain a member of the LHCb collaboration for the near future, so I will come to CERN again, but probably not in the next months.