I’ve just seen this presentation “Zero-overhead training of machine learning models” which advertizes fantastic features such as TMVA.Experimental.CreatePyTorchGenerators .
I’m using v6.32.04 and what I see is the interface doesn’t seem to take a RDataFrame as argument as in the example from the presentation, but rather a tree and file name.
I guess the presentation is based on cutting-edge features not yet available ?
Is there a plan of when this will be released ? And more documention (examples, references…) ?
This really looks like an ideal solution to feed very large amount of data to ML frameworks !
Thanks for your interest! Indeed providing an ideal solution to feed very large amount of data to ML frameworks is our goal.
The code is currently available in the master branch and we are discussing whether it would make sense to release it as part of 6.34.00, due at the end of the month.
Would that be useful for you?
Having this feature in the next v6.34 would be great !
My use case is something nothing very fancy in HEP context: I need to read data where each event has many features, including scalars and vectors of variable lenghts.
// in each event:
float missEt;
vector<float> muon_pt;
vector<float> muon_rapidity;
...
vector<float> jet_pt;
vector<float> jet_rapidity;
... etc...
I have tens of millions of such events over several files. In my current setup I need to prepare batchs of events into simple concatenated tensors (ex: the muon_pt tensor contains the concatenation of the muon_pt of the all the events in the batch. I also need a muon_N tensor to allow to do do per-event calculations, such as sum of muon_pt per event).
The difficult part is to perform the loading of the data from disk to the tensor (pytorch in my case) in a flexible and performant way across all the files. And it seems the new features you propose would exactly solve that…
I’m happy to share more details of my use case if this is useful.
I had a look at the RC1 of v6.34 and could test TMVA.Experimental.CreatePyTorchGenerators . This is very promising, but I understand it is currently limited and cannot generate non constant-shape input features.
This is a rather strong limitation as many advanced ML projects need to handle variable length (VL) arrays as in the example in the post above (in HEP this is very common). And using larger than needed arrays+padding is often non desirable or possible.
I think there would be 2 functionnalities which could cover most cases :
The ability to load a (VL) array into a fixed size tensor by just concatenating the event-level tensors and providing the full batch-level tensor together with a companion “size per event” tensor. In the example in the post above that would be the pair (muon_pt, muon_N).
The ability to provide batches consisting of a tuple of heterogenous tensors. In the example above where we would need both pt and eta for muons and jets, we would get (muon_features, muon_N, jet_features, jet_N) with muon_features.shape==(tot_num_muon,2), muon_N.shape=(tot_num_evt) and similar for jets.
Looking a bit at the code (RBatchGenerator, RChunkLoader) I understand it would be possible, although this would certainly require more development.
Are there plans to support these kind of feature from the ROOT team ?