I would like to randomly select a defined number of rows from my RDataFrame (a la pandas.DataFrame.sample). Is there yet a way to do this? The above post from 2 years ago said it was in progress.
sorry for the late reply (ACAT…). Random access into a TTree is extremely slow, so we don’t really have something like pandas’ sample.
I think the closest thing RDF offers is: df.Filter("some_rng < some_threshold").Range(n_events). The Range is to make sure you get exactly the number of events you want, but unfortunately Range is only available in single-thread runs.
This summer, in TMVA, we also started the development of a Python API to loop over a dataset and return the data as batches of numpy array, to feed them into ML tools – but that’s not in production yet.
Ah, this is unfortunate. Being able to randomly select rows would be useful for slimming, e.g., calibration datasets where there may be correlations between adjacent rows. I suppose one could create a column of random numbers and use that to randomly select a fraction of the data, as long as you don’t care about getting a precise number of rows at the end.