GetCorrelationFactor() and GetStdDev() for single bin histograms are nonzero

Hello all,

I am working on a data analysis project and I was looking at a correlation matrix between many different variables. Some variables are completely flat, filled with only a single value, and I notice they give funny correlations. The 2D histogram between flat variables is only a single bin, and the correlation factor returned by such histograms fluctuates wildly.

I found that the GetCorrelationFactor method relies on the GetStdDev method, which I think should return zero for a 1 bin histogram, but does not. So I wrote a script to investigate this further, and I am confused about what I have found. Maybe there is an easy answer, but if not I have attached my file and working script.

The script is below and for convenience the comment output, as well.

Any insight is appreciated! It would be nice if the correlations would either return 1 or 0, and not fluctuate.

-Julian

Processing testScript.C…
Correlation factor returns 0.08. For other 1-bin histograms, this may return >1.0:
CF = 0.089672
GetStdDev returns nonzero. This also fluctuates with different 1 bin histograms:
GetStdDev(1) = 0.000643407
GetStdDev(2) = 0.000306698 <–
Use GetStats() function to re-calculate GetStdDev:
stats1[0] (sumw) = 4032
stats1[1] (sumw2) = 4032
stats1[2] (sumwx) = 7.93209e+06
stats1[3] (sumwx2)= 1.56047e+10
stats1[4] (sumwy) = 6.2331e+06
stats1[5] (sumwy2) = 9.63578e+09
stats1[6] (sumwxy) = 1.22623e+10
mean_x = 1967.29
mean_y = 1545.91
stats[4] behaves as it should, when divided by sumw, it returns mean y:
stats[4]/stats[0] = 1545.91
This should return zero, but doesn’t!:
stats[5]/stats[0] - mean_ymean_y = -9.40636e-08
Successfully reproduced GetStdDev(), using GetStats() and GetMean():
sqrt(abs(stats[5]/stats[0] - mean_y
mean_y)) = 0.000306698 <–
Why doesn’t stats[5] work? Recalculate with bin loop (overkill for single bin!)
I realize that GetStats uses GetBinCenter, not GetMean, and they differ.
binx = 21
biny = 20
x = 1967.2787499999999454
y = 1545.9212499999998727
w = 4032
MeanX = 1967.2850341796875
MeanY = 1545.9066162109375
Value of mystats[5] differs from GetStats() value:
mystats[5] = 9635965965.1646976471
My version of GetStdDev returns zero: …
sqrt(abs(stats[5]/stats[0] - myymyy)) = 0
Calculated from GetStats() and the mean y value, this matches GetStdDev()
sqrt(abs(stats[5]/stats[0] - meany
meany)) = 0.00030669786441408965899

testScript.C (4.01 KB)
testfile.root (3.94 KB)

Hi,

In the histogram the statistics is computed at filling time using the un-binned entries, so what you are getting is expected.
If you want the histogram statistics (mean, std-dev, etc…) using only the bin information, you can obtain it by calling before TH1::ResetStats().
Keep in mind that then the statistics information obtained from the full entries will be lost after calling TH1::ResetStats

Best Regards

Lorenzo

Hi Lorenzo,

That is indeed interesting to know, but both branches used to generate the 2D histogram were filled with only a single value each, so it seems it should not be a loss of information, and the standard deviation should return zero.

x = 1967.2850341796875
y = 1545.9066162109375

This does help to explain why my recalculation of GetStats using bin centers does not agree, though. If I instead use the real value (the mean), the answers are much closer:

From GetStats():
stats[0] (sumw) = 4032
stats[1] (sumw2) = 4032
stats[2] (sumwx) = 7932093.2578125
stats[3] (sumwx2)= 15604688355.813802719
stats[4] (sumwy) = 6233095.4765625
stats[5] (sumwy2) = 9635783536.6920566559
stats[6] (sumwxy) = 12262275447.654586792

From looping over bins:
mystats[0] (sumw) = 4032
mystats[1] (sumw2) = 4032
mystats[2] (sumwx) = 7932093.2578125
mystats[3] (sumwx2)= 15604688355.812133789
mystats[4] (sumwy) = 6233095.4765625
mystats[5] (sumwy2) = 9635783536.6924362183
mystats[6] (sumwxy) = 12262275447.654514313

Ah, I like to forget that problems like this exist, but it seems the difference is in the number of multiplications performed. My loops over bins is just doing 4032meanmean, where the GetStats method is performing mean*mean 4032 times.

Double_t value = corr->GetMean(1);
Double_t weight = 4032.0;
Double_t sum = 0;
for(int c = 0; c < 4032; c++){
sum += valuevalue;
}
cout << " sum = " << sum << ", " << weight
value*value << endl;

Output:
sum = 15604688355.813802719, 15604688355.812133789

Which matches the output above.
I suppose I will settle for something like if(CorrelationFactor < 1e-5) CorrelationFactor = 0.

edit Actually, that does not solve my problem, since many of these return large correlation factors. Is there any other way? Recalculate statistics from the tree myself and put a threshold on standard deviation?

-Julian