What is the correct way to save a histogram which takes a lot of data from Many input files?

Hello,
This must be a topic that other people have encountered, and this is my first post, so I’m putting it here in the hope that there’s a dead-easy fix out there.

I have a set of over 1000 input files.

Each file has a rather complicated TTree and they all have the same structure of TTree. This is data from an ATLAS Monte Carlo simulation.

I used “MakeClass” with the input files to generate the shell of code, and I use the TChain method to setup the file input using wildcards to identify every single one of the 1000+ files to be read in.

The code I have compiles and runs fine. It produces no errors and my output file contains filled histograms of every single quantity that I chose to store.

However, I have noticed a very strange behaviour which I think has to do with the way ROOT does memory management vs. writing what is in memory to an output file.

When I run the code on one input file and I ask to histogram the number of B hadrons in the leading jet in each event I get a histogram that has about 10k entries.
Here is the code where that file is identified:

{
// if parameter tree is not specified (or zero), connect the file
// used to generate this class and read the Tree.
   if (tree == 0) {
     TChain *f = new TChain("bTag_AntiKt4EMTopoJets");
     //
     // This will add the 12 files I uploaded as a test. 
     f->Add("/data/atlas/users/huffman/MCbTagwHITS/user.thuffman.bTagHitsTTree_Akt4EMTo/user.thuffman.18558259.Akt4EMTo._000335.root");
     tree = f;

   }
   Init(tree);
}

Then I put in a wildcard in the filename so that it will run over 10 files.
So the filename is now user.thuffman.18558259.Akt4EMTo._00033*.root
and it works exactly as I would expect. My histogram of the number of B hadrons in every leading jet now has a bit over 100k entries since its run over 10 times more files and all the files are approximately the same size.

So next I add another line to the TChain which would include another 10 files, so that I am running over 20 files total. Again all are the same size. Here’s the code that adds the next 10 files.

testOutDev::testOutDev(TTree *tree) : fChain(0) 
{
// if parameter tree is not specified (or zero), connect the file
// used to generate this class and read the Tree.
   if (tree == 0) {
     TChain *f = new TChain("bTag_AntiKt4EMTopoJets");
     //
     // This will add the 12 files I uploaded as a test. 
     f->Add("/data/atlas/users/huffman/MCbTagwHITS/user.thuffman.bTagHitsTTree_Akt4EMTo/user.thuffman.18558259.Akt4EMTo._00033*.root");
     f->Add("/data/atlas/users/huffman/MCbTagwHITS/user.thuffman.bTagHitsTTree_Akt4EMTo/user.thuffman.18558259.Akt4EMTo._00053*.root");
     tree = f;

   }
   Init(tree);
}

BUT now I only get 50k events when I plot that same histogram!
I would have expected to get something like 200k events!

I believe something strange is happening about when I choose to “write” the histogram.
I define the output *.root file in testOutDev.h where it is referred to as ‘fout’.

class testOutDev {
public :
   TTree          *fChain;   //!pointer to the analyzed TTree or TChain
   Int_t           fCurrent; //!current Tree number in a TChain
   //
   // Open a file when you create a new bTagNTkJets1n2 object
   // <BTH> I think you need to change directories to the file directory
   // when you use the "Loop" method by putting in the line "fout->cd();"
   TFile        *fout = new TFile("moneyPlots/testBintOTT001.root","RECREATE");

// Fixed size dimensions of array or collections stored in the TTree if any.

   // Declaration of leaf types
   Int_t           runnb;

Then right at the top of the “Loop()” definition in testOutDev.C I make sure I change directory so that the default directory is ‘fout’ using:
testOutDev::fout->cd();

Here is also where I define the histogram in question which I call ‘hist1nb’

I do not actually do a call to hist1nb->Write(); untill ALL of the files have been looped over.
I did this because I found that, if I put the “Write” before my instance of testOutDev completely finishes the Loop() method, I just get multiple copies (cycle numbers) of that histogram…one for every single file…

but I do not WANT that! I want only ONE histogram that contains ALL of the events…not 10 or 20 histograms with the same name that have one file’s worth of events in them.

Is there a way you can tell ROOT that it is time to move what is in memory into the output file and clear the memory such that it can accept more input?

Oh and just so you know, I did read Saving Histograms to Disk in the ROOT manual and it isn’t clear to me what I need to do in order to get my histograms properly filled with only one cycle number.

Actually, indeed, does cycle number even matter? If I have 1000+ cycle numbers of my ‘hist1nb’ histogram and I open the file that has them in ROOT interactively and type:
root> hist1nb->Draw();
will it actually give me a plot that is the sum all ALL those cycles? Or will I only get one of them? (I’m doing a test right now to see if that works)

Cheers and thanks!

Have you run the “MakeClass” on the TChain (mandatory in your case) or on a single TTree (will create a misbehaving class in your case)?

In order to do it correctly for a TChain, in an interactive ROOT session, try something like this:

TChain *t = new TChain("TTreeName");
t->AddFile("SomeFileName.root"); // (at least) one file (is sufficient)
t->LoadTree(0); // just the very first entry
t->MakeClass("NewClassName");
delete t; // cleanup

In principle, the “NewClassName::Loop()” method could look like this:

void NewClassName::Loop()
{
  if (fChain == 0) return; // just a precaution

  TFile *fout = TFile::Open("output_file.root", "recreate");
  if (fout) fout->cd(); // all new histograms will be connected to "fout"
  // ... create / book / initialize all histograms here ...

  Long64_t nentries = fChain->GetEntriesFast();

  Long64_t nbytes = 0, nb = 0;
  for (Long64_t jentry = 0; jentry < nentries; jentry++) {
    Long64_t ientry = LoadTree(jentry);
    if (ientry < 0) break;
    nb = fChain->GetEntry(jentry); nbytes += nb;
    // if (Cut(ientry) < 0) continue;
    // ... fill all histograms with the current "jentry" data here ...
  }

  if (fout) fout->Write(); // save all histograms
  delete fout; // automatically deletes all histograms, too
  fout = 0; // just a precaution
}

See also:

1 Like

I think I might see my problem??
In this code example, where would you have booked (initialized) the histograms?
Would this have been done in the *.h file? And if so, in which method within the *.h file?

I’ve been booking my histograms inside the Loop() method and just prior to the line

if (fChain == 0) return;

Oh and also, to answer your first question. (this is in ROOT version 6.14 by the way)
I really don’t know how to answer it.
All I do is attach one of the files to an interactive ROOT session
So at the linux command line:
% root <filename.root>

Then when I get the ROOT prompt I just typed in:

root> _file0->ls()

root> TTreeName->MakeClass(“NewClassName”);

And It makes NewClassName.h and NewClassName.C for me.
I have to put in

#include <TSystem.h>

in NewClassName.h to get it to work…but then the ‘Loop()’ framework and all the variable names that were in the TTree in the original file are available automatically.

I do not know if this procedure is thee same as "…run MakeClass on the TChain … " or not.

Thanks! I will give this a go.

but where should I create the histograms?
Is it correct to do it in the Loop() method right at the top?

Or should I initialize (or create) them somewhere else?

Ah…sorry, missed that earlier. Thanks very much for your help! I’m trying it all out now.

I wanted to add something I discovered using your method above.
Have you ever looked at the value you get for nentries once one has run the line:

Long64_t nentries = fChain->GetEntriesFast();

??
When I look at that number it is huge, and unrelated to the input Chain size.
I notice that GetEntriesFast() is not actually implemented in TChain but it is instead inherited from TTree. When using a TChain this method doesn’t seem to return the correct number.
However by simply using:

Long64_t nentries = fChain->GetEntries();

instead it now fills nentries with the correct integer which is the total number of entries in the whole TChain.

As I said, this is version 6.14 of ROOT, but if GetEntriesFast() isn’t implemented in TChain in the latest version of ROOT maybe it should be considered? Otherwise the Loop() method is terminating upon an error that shows up in ientry rather than terminating normally.

This aside is unrelated to my original problem.
I am still having the original trouble here but I now think it has to do with a small number of corrupt input files. Still investigating.

The “fChain->GetEntriesFast()” (returning TTree::kMaxEntries) and then break the loop on “LoadTree(jentry) < 0” logic is exactly what you should use.