Idea / Feature proposal : Automatic choice of bin numbers in TH1

thucking · December 17, 2019, 3:51pm

Idea / Feature proposal : Automatic choice of bin numbers in TH1

Introduction

Consider the following scenario: You want to histogram a lot of data, without
knowing how many values there are, in wich interval they are or which distribution
they follow. At the moment ROOT offers the posibility to let the axis range to
be choosen automatically. It would be nice to haven an automatic choice of the bin number.

At the moment I’m filling the data into an histogram, then I derive some
properties from the histogram (e.g. the standard deviation ) to calculate the
optimal number of bins by some formular. After that I create a new histogram
with the correct binning and fill the data again. I would be verry happy, if
this could be done by ROOT automatically.

Formulars / Implementation

At the moment I’m using the formular by David W. Scott [Biometrika, Vol. 66,
No. 3 (Dec., 1979), pp. 605-610].
According to Scott the optimal bin size h for a data set with n values x_i is given
by:
h_n = 3.49 * s * n^(-1/3)
Here s is the estimator of the standard deviation.
Therefore the number of bins N_B would be :
N_B = (max(x_i) - min(x_i) ) * n^(1/3) / ( 3.49 * s )

But there are also other possibilities to get an estimate for the number of bins.
See Wikipedia arcticle here

Of course there is no optimal number of bins. And usually it will be neccessary
to adjust the bin number by hand as well as to check if the binning is not hiding
some “features” of the data. But to get a first impression and not to see a
barcode, it would be a nice feature.

Edit:
Sorry I posted it accidentally before checking again. I couldn’t figure out how to format the formulas nicely.

etejedor · December 17, 2019, 5:13pm

Hi @moneta can you provide the user with some feedback about this idea? Thanks!

moneta · December 17, 2019, 10:18pm

Hi,

Thank you for the suggestion. It is true we could add, for example in case a nbin=0 is provided to compute that number automatically, given the current data and number of entries.
This however will work only in case when automatic min/max is computed, i.e. when we store temporarily all the data points filled into the histogram.

I think it is not a difficult addition to provide. If you want to contribute by providing us a pull requests that would be great, otherwise we can plan to add this for 6.22

Best Regards

Lorenzo

thucking · December 18, 2019, 10:33am

Thanks a lot. I am not so experienced and did not do a pull request or changed some ROOT source code for myself yet. If I get some time and can figure out a solution I would provide you with a pull request. Otherwise I will wait until 6.22