Features I added to ROOT

bozzochet · October 1, 2010, 10:58am

Dear ROOTers and dear ROOT developer,

I would like to show here two features I needed, and so I wrote and added, to the analysis tool of my experiment and so, pratically, to ROOT…

Let’s start, briefly, explaining how my analysis tools works and so understanding what I needed for it.
In this analysis tool the user create a set of histograms, a set of cuts (functions to be used to select events) and a datacard with the sequence of cuts (with parameter for each cut) to be performed. The programs gives back a file with some service histos (history of events passing cuts, etc…) a directory for each cut, with inside all the histograms (from the set defined by user), as sequence of cuts, as first cut (single cut), and as last cut.

But now come the problems… I the user decides to have N histograms (and a part of them are, for example, TH2…) and decides to have M cuts, and then decides to fill the histograms for all the cuts as first, last and standard (in the cut sequence…) cut, the tool has to book NM3 histograms and there’re mainly 2 problems:

is very simple, in this way, to use 2, 3 or even 10 GB of RAM when M and N are increasing!!!
the N histograms are hold in a TObjArray (or in 3*M TObjArray and these TObjArray, pratically, in another TObjArray) and retrieved (by the user and internally in the tool…) by name. When N is quite large the FindObject() of TObjArray becames very slow…

Now we can discuss if this is the best solution or not for the analysis and bla bla bla, but however this tool was designed in this way…

So I wrote, in some hours of work, these two features that I didn’t find on ROOT:

Disk resident hisotgrams to not to consume RAM
Binary search in a my personal, customized and simplified, version of TObjArray
When you book an histograms an hidden file with a tree with 2 or 3 branches (x and w, or x,y and w) is created without really booking (i.e. allocating memory) and during the histogram filling the tree is filled. At the end of the filling process (a Terminate() method is needed) histogram by histogram is booked really, filled and wrote using RAM only for 1 histogram per time.
This way of managing histograms is, clearly, slower than the standard one but in my case, and from my test is only a factor 2 slower…
I create my version of TObjArray that inside has a map with, as key, the hash of the TObject name (I used an hash algorythm I found designed for web searches -> name very similar between themselves, as, probably, in the case of histogram names) and with the TObject pointer.
In the map the TObject are inserted sorted (by the map key) and you have, from the map, the binary search feature… With 10 objects or less the search is, clearly, not faster than the standard TObjArray way (looping all over the array until the name you requested is found) but with, for example, 100 objects or more, in my simple implementation (and I’m not an computer science engineer!), I can gain a factor 3,4 or even more in the total processing time…

What do you think about it?!

Thanks for the reading,
Matteo

brun · October 1, 2010, 1:43pm

Hi Matteo,

Just a quick reaction. Did you investigate the THnSparse class? This is a way to gain a lot of space in particular if you have many TH2 or TH3s.

Rene

bozzochet · October 1, 2010, 2:37pm

No, I don’t know it… Now I try to study!

Thanks for the very quick reply!

Cheers,
Matteo

bozzochet · October 1, 2010, 4:00pm

I look at the THnSparse…
The idea is similar to the one I had (to not to waste the room or the empty bins…) but THnSparse, that clearly is obviously faster than mine approach, however needs the room on the RAM to store all the “previous” entry.

My previous idea, infact, was to use a normal array to store the pairs/triple of entries. But I realized very soon that this “wins” only if the number of entries is less than the number of bins in the THn… So, then I tried to design a way to reduce the array dimension using the weight: if I have an entry with the x (or x-y), and I have to fill with another entry with the same x (for same I mean “in the same bin”) I calculate the correct weight to “fill” the “sum” of the two entries, and removing the previous entry. This is not a fast work, it depends if you are using Sumw2 or not, and bla bla…

So finally I decided to move to a type of array that is not affected by room space, so an array on disk: a TTree… My BHist (this is the name I gave to mine implementation…), infact, uses this room on disk but on RAM only uses the room to instanciate the TTree object…

To increase the performances of the tools and to make it quite reliable I create this BHist with this feature switchable or not (and in my analysis tools is automatical after a firts dry run to calculate the dimension of booked histograms…): when the feature is off BHist is only a wrapper to standard TH1, TH2 and TProfile and some methods (for example the Terminate()) are simply doing nothing, when, instead, the features is on, the “game” with TTrees is done and these methods do something (Terminate() reads back the tree, books the histogram, fills it and then erase the hidden file with the tree).

In addition the THnSparse class requires only an equally-spaced binning when, instead, in my research field (cosmic rays) a log-binning is very often preferable…

Thanks again for the suggestion,
Matteo

brun · October 1, 2010, 4:53pm

Hi Matteo,

Thanks for this additional info. Now I understand better what you have implemented and effectively it could be an interesting class that one could support. Is your class well documented and tested?
Do you provide a BHist::Merge function such that the class can be used in parallel systems like PROOF?

Rene

bozzochet · October 1, 2010, 5:37pm

Is tested by me and by some guys working with me…
Is not so well documented but is made by only ~ 400 lines of code so for me would be not so hard work to document better it…
I didn’t implement all the functionalities of the standard TH1, TH2 and TProfile but only Sumw2, LineColor, MarkerColor, the name for the axis, and actually the real name would have been BHistF, since is only a “wrapper” for TH1F, TH2F and TProfile (but this can be extended easily)… So, no, I didn’t implement the Merge method but since just now I’m trying to make my analysis tools (that IS a TSelector daughter…) with TProofLite (and then I hope TProof…), but I have still problems withthe directory managing, probably I HAVE to implement it soon…
For other methods (for example in my tool I use Scale(), Add() and something else similar…) that are used only when the filling process has finished (so after the Terminate() that besides to book the real histogram also write it into the “output” TDirectory…) I implemented two stupid methods “TH1* BHist::GetHist()” and void “BHist::SaveHist()”: the user retrieve from “output” TDirectory the real THnF with GetHist(), manipulate it as wanted, and then “fix” again in on the “output” TDirectory with SaveHist() (this obviously uses the kOverWrite flag…)
The “sandbox” whit inside the file where the tree is created, for now (but I think is the best solution…), is managed from “above”: the class managing my custom ObjArray (the other “point” of my original post…) with all the histograms of the set of histograms (i.e. of the same cut…) creates the hidden file (TFile*) with its name + the pid of the process + date and pass it in the BHist constructor (so BHist can create the TTree inside it…) and is its responsability to delete this file at the end…

brun · October 1, 2010, 7:25pm

Once your code is mature and documented, post it if you like and we will investigate if it can be a more general tool.

Rene

bozzochet · October 1, 2010, 10:58pm

Ok! It will be a pleasure for me to do this!

bozzochet · December 12, 2010, 4:30pm

Hi,
after more than two months (my experiment is in the final commissioning and close to the data taking so I’m a little bit busy…) I’m able to ‘release’ my class.
It’s not in a very complete and final implementation but almost all the main functionalities are implemented, is working, was tested and I can define it as ‘stable’…

Here:
dl.dropbox.com/u/3993877/BHist.tar.gz

you can find a tar archive with inside the class (BHist.C and Bhist.h), the linkdef for the dictionary and a valid Makefile.

In addition you can find some ROOT macros I’ve prepared to test it:
Basic.C : this shows the basic functionalities of BHist’s
Speedness.C : this is a sort of benchmark test for the speedness and for the usage of memory comparing BHist and TH1
Add.C ans AddSumw2.C : this shows the use of the Add() and Merge() methods.

Additional work has to be done to improve the class but I prefer to do this only if this class can be usefull in the future and for other people (otherwise, for me, I will implement something else when I will need it…).

Let me know what do you think about my class!

Cheers,
Matteo