Shuffle RDataFrame?

Continuing the discussion from Use RDataFrame to produce small training set:

I would like to randomly select a defined number of rows from my RDataFrame (a la pandas.DataFrame.sample). Is there yet a way to do this? The above post from 2 years ago said it was in progress.

I think @eguiraud can help you.

Hello @mwilkins ,

sorry for the late reply (ACAT…). Random access into a TTree is extremely slow, so we don’t really have something like pandas’ sample.

I think the closest thing RDF offers is: df.Filter("some_rng < some_threshold").Range(n_events). The Range is to make sure you get exactly the number of events you want, but unfortunately Range is only available in single-thread runs.

This summer, in TMVA, we also started the development of a Python API to loop over a dataset and return the data as batches of numpy array, to feed them into ML tools – but that’s not in production yet.

Cheers,
Enrico

Ah, this is unfortunate. Being able to randomly select rows would be useful for slimming, e.g., calibration datasets where there may be correlations between adjacent rows. I suppose one could create a column of random numbers and use that to randomly select a fraction of the data, as long as you don’t care about getting a precise number of rows at the end.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.