Managing large data for tensorflow TMVA training

I am attempting to do some deep learning with some very low level variables from a detector and have some perhaps more general questions relating to the whole pipeline…

So the detector has ~36000 channels. For a bunch of signal and background events I have recorded this data in a .root file as a branch of a vector, like so:

*............................................................................*
*Br   21 :       Channels : vector<double>                                   *
*Entries :      585 : Total  Size=  172603429 bytes  File Size  =    2269407 *
*Baskets :      585 : Basket Size=      32000 bytes  Compression=  76.05     *
*............................................................................*

Note also that this is only a sample file, the real file has about a million events and is ~20Gb.

Now I would like to do some DL on this whole tensor, and also a reduced dimension version. The 36k channels comes from 288 silicon modules, each with 128 channels. I could therefore attempt to reduce this whole thing to 228 inputs instead by summing or averaging over each channel on the module. I would like to see if that works too.

Anyway, when I throw the whole tensor at TMVA like so:

self.dataloader.AddVariablesArray("Channels", 36,864, 'F', 0., 999.);

The the whole process hangs when I attempt to BookModel(), I guess quite understandably.

So my questions are:

  1. What is the best way to store this much data? Is a root file the best way or are there other methods which tensorflow can interact with (without TMVA even?)? How is other big data datasets stored? Database? csv? I’m not too sure what the best option is here…

  2. What is the best way to perform this dimensionality reduction? I can probably do something like:

f = root.TFile("data.root")
myTree = f.Get("trainingdata")
for entry in myTree:         
     # Now you have acess to the leaves/branches of each entry in the tree, e.g.
    detectorVector = entry.Channels
    reducedVector = reduceDimensionality(detectorVector);
    //Now save as new .root file, or as a new branch in same file, or to csv or whatever.

but this feels like it would be very slow, looping over each entry 1 at a time, is there a trick I am missing here? Also that example is in python, but I can do it in C++ too.

  1. Is 36,000 inputs just too many?

If this is the wrong place for this kind of discussion just let me know and I’ll try elsewhere but I was curious on the opinions of people here in the know.

Cheers

Hi,

Adding @moneta in the loop.

Best,
D