ROOT error calculation

rahmans · September 25, 2014, 4:11am

So, I have root files containing results of a simulation. If I have 10 one million event files I combine them in a TChain as T[i][j][k] = new TChain("T"); nfile[i][j][k] = T[i][j][k]->Add(Form("%s/map_%s_%s_%s_%s/map_%s_%s_%s_%s_*.root", scratchDir, index, map[j], off[j][k], data[i]->GetGenName(), index, map[j], off[j][k], data[i]->GetGenName()));

If I have 1 ten millon event file I just do:

T[i][j][k] = new TChain("T"); nfile[i][j][k] = T[i][j][k]->Add(Form("%s/map_%s_%s_%s_%s/map_%s_%s_%s_%s.root", scratchDir, index, map[j], off[j][k], data[i]->GetGenName(), index, map[j], off[j][k], data[i]->GetGenName()));

After than the code is identical.
My question is this. Once I run through error calculations with my code, then the propagated error in the 10 one million event file is roughly 1/3rd the size of the error calculated in case of 1 ten million event file. According to me they should be the same. Could any one tell me how root could be miscalculating the error in one of the cases?

Danilo · September 25, 2014, 5:08am

you don’t say anything about the error you are referring to. Hard to say.
If any quantity changes between the processing of 10 files and 1 single file which has the exact same content of the smaller 10 ones, something must be wrong in your code. Root is hardly the responsible.

Wile_E_Coyote · September 25, 2014, 8:37am

A factor of 3 in errors looks to me quite like a sqrt(10). You could try to process 4 files and see if you get a factor of 2 = sqrt(4) → this could then be related to some kind of “statistical errors” which scale as sqrt(total_counts) → this seems then fine to me.

{
  gStyle->SetOptStat(220002200);
  gStyle->SetOptFit(1111);
  TH1F *h1 = new TH1F("h1", "random gauss with 1 * 1000 entries", 100, -5, 5);
  h1->FillRandom("gaus", (1 * 1000));
  TH1F *h9 = new TH1F("h9", "random gauss with 9 * 1000 entries", 100, -5, 5);
  h9->FillRandom("gaus", (9 * 1000));
  TCanvas *c = new TCanvas("c", "c");
  c->Divide(1, 2);
  c->cd(1);
  h1->Fit("gaus");
  c->cd(2);
  h9->Fit("gaus");
  c->cd(0);
}

rahmans · September 25, 2014, 1:47pm

Thanks for the help so far. So, the events are are binned into a histogram to create a rate histogram. And then I calculate the integral and the error on the integral on a certain range using the IntegralandError function. The only difference as I said is at the beginning where I read in one big file instead of ten smaller files. I have thought about the statistical error thing so I am testing for that now. The simulation takes some time and the root files take time to process. I will get back if I find something like that.

Wile_E_Coyote · September 25, 2014, 5:46pm

Well, there’s something I don’t understand.
If you use TH1::IntegralAndError and get a factor of 3 in the returned error value then in the simplest case I would expect a corresponding factor 3^2 (i.e. 9) in the returned integral value, too (at least in the first approximation).
Of course, if you set TH1::Sumw2 for this histogram and fill it with “weights” which are not 1 then it may be somehow different.

BTW. Can it be that you should set event “weights” which depend on the total number of events in your simulation file? Maybe your “T” tree keeps these “weights”, too (e.g. in a separate branch, as they can change from event to event). I can also imagine that your simulation software “normalizes” events in each file to some “magic value” (e.g. something like the total number of “Protons On Target”). Then, if you take 10 “small” simulations you get events which correspond to 10 * this “magic value”, but when you take 1 “big” simulation you just get events which correspond to 1 * this “magic value” (well, in the simplest case, after filling the histogram you would need to scale it by 1 / total_number_of_used_files, so something like “histo->Scale(1.0 / T[i][j][k]->GetNtrees());”).

rahmans · September 29, 2014, 7:09pm

Thanks for the help. The problem was with the server that I was submitting the jobs to. There was a default max memory limit which I thought was sufficient but I had to readjust that. So with the huge 10 million event file it was omitting data. I redid the simulations and now I can reproduce the results as with 10 one million event files. My analysis code was good.