Shuffle RDataFrame?

mwilkins · October 23, 2022, 3:27pm

Continuing the discussion from Use RDataFrame to produce small training set:

I would like to randomly select a defined number of rows from my RDataFrame (a la pandas.DataFrame.sample). Is there yet a way to do this? The above post from 2 years ago said it was in progress.

couet · October 25, 2022, 9:23am

I think @eguiraud can help you.

eguiraud · October 26, 2022, 7:35am

Hello @mwilkins ,

sorry for the late reply (ACAT…). Random access into a TTree is extremely slow, so we don’t really have something like pandas’ sample.

I think the closest thing RDF offers is: df.Filter("some_rng < some_threshold").Range(n_events). The Range is to make sure you get exactly the number of events you want, but unfortunately Range is only available in single-thread runs.

This summer, in TMVA, we also started the development of a Python API to loop over a dataset and return the data as batches of numpy array, to feed them into ML tools – but that’s not in production yet.

Cheers,
Enrico

mwilkins · October 26, 2022, 2:08pm

Ah, this is unfortunate. Being able to randomly select rows would be useful for slimming, e.g., calibration datasets where there may be correlations between adjacent rows. I suppose one could create a column of random numbers and use that to randomly select a fraction of the data, as long as you don’t care about getting a precise number of rows at the end.

system · November 9, 2022, 2:08pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.