Dear ROOTers
In order to better understand the problem, please allow me first to explain the algorithm to be used,
called “normalize quantile”. Assuming a table with n rows and m columns the algorithm is as follows:
1, for each column determine the rank of each row
2, sort each column according to its rank
3, for each row compute the mean of the sorted columns
4, replace each column with the common mean
5, sort each column individually according ot its original rank
Attached I have macro “macroQuantileTest.C” and a demo table “TestAB.txt” which implements the quanitle
normalization identical to the implementation in my program with the only difference that in my program
every column is replaced by a root tree. (The orignal implementation of the algorithm in C is using a
linearized matrix and thus has severe memory problems when applied to large datasets.)
Until now the code worked very well even for large datasets but now a user of my program has imported
25000 data as root trees into a root file, where each tree has more than 1.2 million entries (the root
file has a size of about 140 GB).
Now the user wanted to apply the “quantile normalization” algorithm to all 25000 trees and experienced
a progressive increase in memory consumption and in processing time: after about 1hr about 2000 samples
were processed using about 1.4 GB; after 22hrs only 17000 samples were proceesed and memory increased
to about 5GB so the user stopped the processing.
The user has compiled root_v5.24.00 for 64 bit on MacOS X 10.4 and reported that most of the time was
spent in vm_map_enter and vm_map_lookup_entry of the mach_kernel library (although there was no apparent swap).
The problem may already occur during filling datatrees which I do as follows:
Int_t Normalize(Int_t numdata, TTree **datatree)
{
// sort arrays for quantile normalization
for (Int_t k=0; k<numdata; k++) {
FillDataArrays(datatree[k], size, arrInten);
AddArray(size, arrInten, arrMask, datatree[k]->GetName());
}//for_k
// quantile normalization
Calculate(size, arrInten, arrInten, arrMask);
// get normalized arrays and fill data trees
for (Int_t k=0; k<numdata; k++) {
arrInten = GetArray(size, arrInten, arrMask, datatree[k]->GetName());
datatree[k] = FillDataTree(datatree[k], size, arrInten);
}//for_k
}//Normalize
Here numdata is 25000 trees and size is 1.2 million entries. Functions AddArray(), Calculate() and GetArray()
are described in the attached macro and the other two functions are in principle as follows:
Int_t FillDataArrays(TTree *datatree, Int_t size, Double_t *inten)
{
XGCCell *gccell = 0;
datatree->SetBranchAddress("DataBranch", &gccell);
for (Int_t i=0; i<size; i++) {
inten[i] = gccell->GetIntensity();
}//for_i
SafeDelete(gccell);
datatree->ResetBranchAddress(datatree->GetBranch("DataBranch"));
}//FillDataArrays
TTree *FillDataTree(TTree *oldtree, Int_t size, Double_t *inten)
{
XGCCell *oldcell = 0;
oldtree->SetBranchAddress("DataBranch", &oldcell);
XGCCell *newcell = new XGCCell();
TTree *newtree = new TTree(oldtree->GetName());
newtree->Branch("DataBranch", "XGCCell", &newcell, 64000, 99);
for (Int_t i=0; i<size; i++) {
oldtree->GetEntry(i);
newcell->SetIntensity(inten[i]);
newtree->Fill();
}//for_i
newtree->Write("", TObject::kOverwrite);
SafeDelete(newcell);
newtree->ResetBranchAddress(newtree->GetBranch("DataBranch"));
SafeDelete(oldcell);
oldtree->ResetBranchAddress(oldtree->GetBranch("DataBranch"));
SafeDelete(oldtree);
return newtree;
}//FillDataTree
Moreover, I assume that the problem may also be in the following code fragment of function Calculate():
// Calculate mean values of all tree entries
for (Int_t i=0; i<fNMean1; i++) {
// read entry i from trees and fill array
for (Int_t k=0; k<fNData; k++) {
treek[k]->GetEntry(i);
arr[k] = sortk[k];
}//for_k
// calculate mean of array
fMean1[i] = Mean(fNData, arr);
y[i] = fMean1[i];
}//for_i
Here fNData is 25000 trees and fNMean1 is 1.2 million entries.
Do you know what might be the reason for the progressive time and memory consumption?
Is there a way to avoid the progressive memory increase (since my intention is that users should be able
to use my program on computers with 1-2 GB RAM only)?
What is the behavior of the baskets in this case?
Best regards
Christian
macroQuantileTest.C (8.82 KB)
TestAB.txt (4.02 KB)