Summarizing a large ROOT file

Dear ROOTers,

I have a table of data with 10 million rows and 100 columns. I am storing this data in a root tree structure. This is a transactional level data that cannot be used in my analysis.

In order to prepare this data for analysis, I have to summarize this data. In SQL, my summary request would be something like:

SELECT SUM(column1), SUM(column2), SUM(column3)
FROM my10MillionRowTable
GROUP BY columnA, columnB, columnC, columnD, columnE ;

Can such a summary be created from existing ROOT functionality? If so, how?

Eventually, I would like to create a method CreateSummary() with a syntax as shown above. The above method can be used to summarize any tree into another tree. The SUM() method can be repalced with any other user written aggregation method like MIN(), MAX(), MEDIAN(), WeightedAverage() etc. Do you think the Root tree structure lends itself to this kind of an application?

Thanks.

Hi,

[quote]Can such a summary be created from existing ROOT functionality?[/quote]I assume you need to crate a TTree (histogram would be somewhat easier as they sum the data inherently). If so this is going to be a bit challenging as the TTree structure is write once / read many time and thus you need to compute the full sum for one of the unique values of (columnA, columnB, columnC, columnD, columnE) before you can write it. So essentially you need do ‘by hand’:
find the list of unique quintuplets (columnA, columnB, columnC, columnD, columnE)
for each of those unique quintuplets
sum the 3 columns over the whole input for all only the entries matching the quintuplets
write the new entry

Philippe.

Dear Philippe,
Thanks for your reply. I do have a follow up question.

  1. Is there a built in method that can produce a list of unique quintuplets?
    Histograms can be used to get unique triplets. May be 5-dimensional parallel co-ordinates for unique quintuplets?

  2. Assuming there are half a million unique quintuplets, will the logic above mean that the original 10 million rows of data will be scanned once for each quintuplet?

  3. Is it possible to create an index on the 5 fields and then scan the tree once in the index order? This way, all the summaries can be derived in one pass of the tree.

Any advice from the ROOT team about the suitability of ROOT for this kind of applicaiton will be much appreciated.

Hi,

you could use a THnSparse for histogramming (i.e. counting) in 5 dimensions. Create 3 of them, and let them sum your three variables as a function of the histogram bin, i.e. the unique quintuple.

Cheers, Axel.

Great. THnSparse served the purpose. I was even able to convert this into a tree based on examples given in the tutotials.

Thanks.