Hello all,
I am working on a data analysis project and I was looking at a correlation matrix between many different variables. Some variables are completely flat, filled with only a single value, and I notice they give funny correlations. The 2D histogram between flat variables is only a single bin, and the correlation factor returned by such histograms fluctuates wildly.
I found that the GetCorrelationFactor method relies on the GetStdDev method, which I think should return zero for a 1 bin histogram, but does not. So I wrote a script to investigate this further, and I am confused about what I have found. Maybe there is an easy answer, but if not I have attached my file and working script.
The script is below and for convenience the comment output, as well.
Any insight is appreciated! It would be nice if the correlations would either return 1 or 0, and not fluctuate.
-Julian
Processing testScript.C…
Correlation factor returns 0.08. For other 1-bin histograms, this may return >1.0:
CF = 0.089672
GetStdDev returns nonzero. This also fluctuates with different 1 bin histograms:
GetStdDev(1) = 0.000643407
GetStdDev(2) = 0.000306698 <–
Use GetStats() function to re-calculate GetStdDev:
stats1[0] (sumw) = 4032
stats1[1] (sumw2) = 4032
stats1[2] (sumwx) = 7.93209e+06
stats1[3] (sumwx2)= 1.56047e+10
stats1[4] (sumwy) = 6.2331e+06
stats1[5] (sumwy2) = 9.63578e+09
stats1[6] (sumwxy) = 1.22623e+10
mean_x = 1967.29
mean_y = 1545.91
stats[4] behaves as it should, when divided by sumw, it returns mean y:
stats[4]/stats[0] = 1545.91
This should return zero, but doesn’t!:
stats[5]/stats[0] - mean_ymean_y = -9.40636e-08
Successfully reproduced GetStdDev(), using GetStats() and GetMean():
sqrt(abs(stats[5]/stats[0] - mean_ymean_y)) = 0.000306698 <–
Why doesn’t stats[5] work? Recalculate with bin loop (overkill for single bin!)
I realize that GetStats uses GetBinCenter, not GetMean, and they differ.
binx = 21
biny = 20
x = 1967.2787499999999454
y = 1545.9212499999998727
w = 4032
MeanX = 1967.2850341796875
MeanY = 1545.9066162109375
Value of mystats[5] differs from GetStats() value:
mystats[5] = 9635965965.1646976471
My version of GetStdDev returns zero: …
sqrt(abs(stats[5]/stats[0] - myymyy)) = 0
Calculated from GetStats() and the mean y value, this matches GetStdDev()
sqrt(abs(stats[5]/stats[0] - meanymeany)) = 0.00030669786441408965899
testScript.C (4.01 KB)
testfile.root (3.94 KB)