Dear ROOTers and dear ROOT developer,
I would like to show here two features I needed, and so I wrote and added, to the analysis tool of my experiment and so, pratically, to ROOT…
Let’s start, briefly, explaining how my analysis tools works and so understanding what I needed for it.
In this analysis tool the user create a set of histograms, a set of cuts (functions to be used to select events) and a datacard with the sequence of cuts (with parameter for each cut) to be performed. The programs gives back a file with some service histos (history of events passing cuts, etc…) a directory for each cut, with inside all the histograms (from the set defined by user), as sequence of cuts, as first cut (single cut), and as last cut.
But now come the problems… I the user decides to have N histograms (and a part of them are, for example, TH2…) and decides to have M cuts, and then decides to fill the histograms for all the cuts as first, last and standard (in the cut sequence…) cut, the tool has to book NM3 histograms and there’re mainly 2 problems:
- is very simple, in this way, to use 2, 3 or even 10 GB of RAM when M and N are increasing!!!
- the N histograms are hold in a TObjArray (or in 3*M TObjArray and these TObjArray, pratically, in another TObjArray) and retrieved (by the user and internally in the tool…) by name. When N is quite large the FindObject() of TObjArray becames very slow…
Now we can discuss if this is the best solution or not for the analysis and bla bla bla, but however this tool was designed in this way…
So I wrote, in some hours of work, these two features that I didn’t find on ROOT:
Disk resident hisotgrams to not to consume RAM
Binary search in a my personal, customized and simplified, version of TObjArray
When you book an histograms an hidden file with a tree with 2 or 3 branches (x and w, or x,y and w) is created without really booking (i.e. allocating memory) and during the histogram filling the tree is filled. At the end of the filling process (a Terminate() method is needed) histogram by histogram is booked really, filled and wrote using RAM only for 1 histogram per time.
This way of managing histograms is, clearly, slower than the standard one but in my case, and from my test is only a factor 2 slower…
I create my version of TObjArray that inside has a map with, as key, the hash of the TObject name (I used an hash algorythm I found designed for web searches -> name very similar between themselves, as, probably, in the case of histogram names) and with the TObject pointer.
In the map the TObject are inserted sorted (by the map key) and you have, from the map, the binary search feature… With 10 objects or less the search is, clearly, not faster than the standard TObjArray way (looping all over the array until the name you requested is found) but with, for example, 100 objects or more, in my simple implementation (and I’m not an computer science engineer!), I can gain a factor 3,4 or even more in the total processing time…
What do you think about it?!
Thanks for the reading,
Just a quick reaction. Did you investigate the THnSparse class? This is a way to gain a lot of space in particular if you have many TH2 or TH3s.
No, I don’t know it… Now I try to study!
Thanks for the very quick reply!
I look at the THnSparse…
The idea is similar to the one I had (to not to waste the room or the empty bins…) but THnSparse, that clearly is obviously faster than mine approach, however needs the room on the RAM to store all the “previous” entry.
My previous idea, infact, was to use a normal array to store the pairs/triple of entries. But I realized very soon that this “wins” only if the number of entries is less than the number of bins in the THn… So, then I tried to design a way to reduce the array dimension using the weight: if I have an entry with the x (or x-y), and I have to fill with another entry with the same x (for same I mean “in the same bin”) I calculate the correct weight to “fill” the “sum” of the two entries, and removing the previous entry. This is not a fast work, it depends if you are using Sumw2 or not, and bla bla…
So finally I decided to move to a type of array that is not affected by room space, so an array on disk: a TTree… My BHist (this is the name I gave to mine implementation…), infact, uses this room on disk but on RAM only uses the room to instanciate the TTree object…
To increase the performances of the tools and to make it quite reliable I create this BHist with this feature switchable or not (and in my analysis tools is automatical after a firts dry run to calculate the dimension of booked histograms…): when the feature is off BHist is only a wrapper to standard TH1, TH2 and TProfile and some methods (for example the Terminate()) are simply doing nothing, when, instead, the features is on, the “game” with TTrees is done and these methods do something (Terminate() reads back the tree, books the histogram, fills it and then erase the hidden file with the tree).
In addition the THnSparse class requires only an equally-spaced binning when, instead, in my research field (cosmic rays) a log-binning is very often preferable…
Thanks again for the suggestion,
Thanks for this additional info. Now I understand better what you have implemented and effectively it could be an interesting class that one could support. Is your class well documented and tested?
Do you provide a BHist::Merge function such that the class can be used in parallel systems like PROOF?
Once your code is mature and documented, post it if you like and we will investigate if it can be a more general tool.
Ok! It will be a pleasure for me to do this!
after more than two months (my experiment is in the final commissioning and close to the data taking so I’m a little bit busy…) I’m able to ‘release’ my class.
It’s not in a very complete and final implementation but almost all the main functionalities are implemented, is working, was tested and I can define it as ‘stable’…
you can find a tar archive with inside the class (BHist.C and Bhist.h), the linkdef for the dictionary and a valid Makefile.
In addition you can find some ROOT macros I’ve prepared to test it:
Basic.C : this shows the basic functionalities of BHist’s
Speedness.C : this is a sort of benchmark test for the speedness and for the usage of memory comparing BHist and TH1
Add.C ans AddSumw2.C : this shows the use of the Add() and Merge() methods.
Additional work has to be done to improve the class but I prefer to do this only if this class can be usefull in the future and for other people (otherwise, for me, I will implement something else when I will need it…).
Let me know what do you think about my class!