I’m afraid that shuffling is not a feature that
RDataFrame provides. So you’ll have to work around to get this done. Moreover, it strongly depends on the size of your data you need in memory for the training. I would propose the following solution, which suits most use-cases:
- Preprocess your (generic?) samples and write out the pt slices with only the information (so columns) you need for the training - aka slimming and skimming.
- Load exactly the information you need into memory and shuffle then from there.
df = ROOT.RDataFrame("<treename>", filelist)
.Define("<var name>", "<preprocessing>")\
.Snapshot("training_data", "file.root", only_columns_you_need)
Data-loading for machine learning and shuffling:
df = ROOT.RDataFrame("training_data", "file.root")
data = df.AsNumpy() # no arguments means we load all columns!
x = numpy.vstack((data[v] for v in ["order", "the", "variables"]))
x_batch = x[numpy.random.choice(x.shape, batch_size)]
In principal you can also filter and load in one go (just put the respective filter in front of the
AsNumpy call) but since you’ll do this many, many times, it’s probably more efficient to have an intermediate step.
In case you want to split the dataset into multiple parts, assuming that each event is statistically independent, you can use the “magic column”
rdfentry_ to make this happen. For example you add a split for a two-fold to the skimming like this:
df = ROOT.RDataFrame(...)
df.Filter("rdfentry_ % 2 == 0").Snapshot(...)
df.Filter("rdfentry_ % 1 == 0").Snapshot(...)
We know that the selection of a random subset of a given dataset interfaced by
RDataFrame is very interesting for machine learning workflows, we are working on it to make it happen more smoothly
In case the proposal above is not efficient enough because after the skimming/slimming you have still too much data, the most efficient solutions are dependent on the ML library you are targeting and their specialized input pipelines (for example
TFRecord in TensorFlow).