How to handle THnSparse larger than 500Mo

Do you think it could be possible to avoid this size limit to build a kind of THnSparse in a TTree ? Where the number of entries is the number of filled bins, containing an array of the dimension of the THnSparse. I prefer to have you idea about this before starting to try to implement it.

I don’t really know what is that you actually want to do with your “gammas”.
You can have a simple tree which keeps all your gammas in a “raw format” (i.e. each tree entry could keep all “GammaMult” energies which belong to it, without making any groups of three) and then you can “analyse” this tree, creating 1, 2, 3 dimensional histograms / projections in RAM. As long as you do not try to write these histograms to disk, you can make them as huge as you have RAM (note however that, if you want to draw them, you’d better keep the number of bins small, otherwise it will take a very long time to display it). Of course, each time you create a new histogram, you will need to loop over your tree again (but that’s what ROOT can do well).

What I need at the end is to project on a 1D histogram the gamma rays which are in coincidence with two other one. But this needs to be fast. I cannot read all the entries of a root Tree to build it, it will takes ages for each projection. Ideally I need a THnSparse or equivalent structure in such a way that a projection is made in few seconds. This kind of Cube is used since many years with other softwares( like radware or gaspware). That’s a pity that I cannot do the same thing with root objects.

I don’t perfectly understand how it works but does the TKDTree objects can be a solution to this kind of problem ?

I guess we need @pcanal to state when / if the 1GB ROOT TFile buffer/basket limit will disappear (so that huge histograms could be saved to / retrieved from a ROOT file).

In the meantime … you can try to live with a THnSparseC (1 byte per bin) or a THnSparseS (2 bytes per bin) but, when filling, make sure that you always check that you do not exceed the maximum allowed bin content (i.e. 127 for “Char_t” and 32767 for “Short_t”).

In my case, the limit seems not to come from the 1GB file (mine is 515MB), but the errors comes from the TObjArray size of the THnSparse. Is it the same thing ?

But if this is only related on the TFile Buffer size, is it possible to store a binary file on disk (with file larger than 1GB), and loading it in memory in a THnSparse for analyzing it ?

Indeed I will try with a THnSparseS (that’s a pity that there is no THnSparse for unsigned short…)

// for a "Short_t" based "THnSparseS *hnsparse"
if (hnsparse->GetBinContent(hnsparse->GetBin( hEnG_entry )) < 32767) {
    hnsparse->Fill( hEnG_entry );
  } else std::cout << "SATURATED!" << std::endl;

For such large data, we recommend storing it in a split TTree.

The 1GB limit is challenging to remove and thus is likely to only be lifted in the so-called ROOT v7 format.

Cheers,
Philippe.

Hi Philippe,

What do you call a split TTree ? Does it will allow me to make the same kind of projections than with a THnSparse ?

I ask my question again, is it only a problem of the TFile size ? If I store on disk a huge binary file and I then store it in a THnSparse in RAM memory, and not in a file, will it works ?

Thanks in advance

Jérémie

If I store on disk a huge binary file and I then store it in a THnSparse in RAM memory, and not in a file, will it works ?

Yes, and using a TTree is an efficient way of storing the data in a ‘huge binary file’, you would be able to reconstruct the THnSparse at read time and you may be able to read the data in other ways too. (See the description of TTree in the User’s Guide and the new RDataFrame analysis tool).

Cheers,
Philippe.

Hum I was not aware of this new RDataFrame analysis tool. It seems very interesting… Do you know if examples are available to help me to store data from a THnSpare in a TTree, and to build a THnSparse from a TTree using this analysis tool ?

Hi,

Using the trick you propose:

  TTree *t = new TTree("tree3D", "tree with his3D");
  t->Branch("his3D.", his3D, 32000, 111); // "max" splitting?
  t->Fill();
  t->Write();
  delete t;
#endif /* 0 or 1 */
  f->Write();
  delete f;

I still obtain at the end the error:

Error in TBufferFile::WriteByteCount: bytecount too large (more than 1073741822)

I think I need to build by hand a tree, and not storing the THnSparse directly in the TTree

I’m afraid @pcanal would need to tell if it is possible to get a “better” branch splitting (I was trying to impose the “max” one).

Well, you could implement the following “brutal fix”.

As I understand, your whole experimental data “sample” produces a THnSparse which is too big to be stored in a ROOT file.

So, try to “split” / “divide” your whole experimental data “sample” into several “subsamples” (or even several tens of “subsamples”, if needed).

Each “subsample” would then (possibly / hopefully) produce a much smaller “partial” THnSparse and you should be able to store these “partial” histograms in a ROOT file, either directly as separate objects or in a TTree. You could create a single ROOT file with all “partial” histograms or one ROOT file per “partial” histogram.

So, if your raw experimental data are spread across multiple files, you could take each raw experimental data file as one physical “subsample” or, if you have just one single file with all raw experimental data, simply divide the total number of events by some number and create that many logical “subsamples” (or one “subsample” per an hour or a day or a week of measurements).

Another (quite clever) way to split your data into “subsamples” would be to monitor the actual total number of bins of your THnSparse, when you fill it (THnSparse::GetNbins). Once this number reaches a certain maximum value (defined by you, it should be small enough that you can still save this histogram in a ROOT file, let’s say 10 to 50 million bins could be fine, I guess), you simply write the current “partial” THnSparse histogram to a ROOT file, then you recreate the THnSparse histogram (or reset it so that all previous bins are gone) and continue the filling with this next “partial” histogram.

For test purposes, I created some 4096x4096x4096 3 dimensional THnSparse histograms and I filled them in a random way (with random values). I have found that the average TFile buffer/basket size needed by such histograms can easily be estimated as follows. For histograms filled without weights one needs “number_of_filled_bins * (sizeof(bin) + 5)”, while for histograms for which THnSparse::Sumw2() has been called one needs “number_of_filled_bins * (sizeof(bin) + 13)” (i.e. weights are always “Double_t”), where the “sizeof(bin)” is 8 for “Double_t” and “Long_t” and 4 for “Float_t” and “Int_t” and the “number_of_filled_bins” is given by THnSparse::GetNbins().

Then, you just need a simple small ROOT macro, which reads / retrieves all “partial” histograms (from a single or from many ROOT files) and adds them in RAM. Well, you will always need to run this macro at the beginning of your ROOT session, of course … but that should really be very fast.

Alternatively (and I can not be precise not knowing your data flow and data layout), where you do

double some_var = ....;
double some_value = ....;
for( some condition ) {
    some_var = ....;
    some_value = ....;
    sparseHisto->Fill( some_var, somevalue);
}

do

double some_var = ....;
double some_value = ....;
tree->Branch("some_var",&some_var);
tree->Branch("some_value",&some_value);
for( some condition ) {
    some_var = ....;
    some_value = ....;
    tree->Fill();
}

then when reading the file use RDataFrame (or MakeSelector or other way of looping through the TTree) to recreate the THnSparse.

Cheers,
Philippe.

Thank you for your help. Anyway, all of these solutions are not very user friendly. Doing a simple projection will takes time and it needs to be done in few seconds to be competitive with the old softwares that handle this kind of cube. I don’t understand how they did to handle cubes up to 8192x8192x8192, in files of few GB, and doing projections in around one second…

When I want to plot a projection of gamma-gamma-gamma coincidence (the 1d spectrum of gamma rays which are in coincidence with two other ones giving specific energy ranges), I need the full statistics of my experiment. If I need to do that on many subfiles and then sum them it will be a nightmare.

Once you have the total THnSparse in RAM (either by summing up “partial” histograms or by creating it directly from a TTree), you can make as many projections as you want (no need to recreate / refill the THnSparse if you do not exit your ROOT session).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.