Use RDataFrame to produce small training set

csauer · February 10, 2020, 5:32pm

Dear all,

I recently stumbled upon this very promising RDataFrame interface and I’d now like to utilize it to produce a small training set that will be used to train a neural network. The actual/full data set is an abundance of root files (each one with millions of events!) whereby each file only contains events within a certain range in transverse momentum (pT slices). I’d like to (randomly) sample a well-defined number of events (that will be used for training) out of all events whilst taking into account the statistics in the individual pT slices in each file. How can I archive this task using ROOT’s RDataFrame? And if RDataFrame is not the appropriate choice, what else can I do?
I know how to use the RDataFrame interface to filter a data set and how to create a new one using the “Snapshot” method, but how to load several root files and “shuffle” them such that the result is a small representative of the full data set?
Thank you very much in advance!

Best regards

oshadura · February 11, 2020, 11:35am

@eguiraud @swunsch perhaps maybe do you have some simple example here?

Thanks,
Oksana.

swunsch · February 11, 2020, 2:50pm

Hi Christof!

I’m afraid that shuffling is not a feature that RDataFrame provides. So you’ll have to work around to get this done. Moreover, it strongly depends on the size of your data you need in memory for the training. I would propose the following solution, which suits most use-cases:

Preprocess your (generic?) samples and write out the pt slices with only the information (so columns) you need for the training - aka slimming and skimming.
Load exactly the information you need into memory and shuffle then from there.

Skimming/slimming/preprocessing:

df = ROOT.RDataFrame("<treename>", filelist)
df.Filter("<select slice>")\
  .Define("<var name>", "<preprocessing>")\
  .Snapshot("training_data", "file.root", only_columns_you_need)

Data-loading for machine learning and shuffling:

df = ROOT.RDataFrame("training_data", "file.root")
data = df.AsNumpy() # no arguments means we load all columns!
x = numpy.vstack((data[v] for v in ["order", "the", "variables"]))
x_batch = x[numpy.random.choice(x.shape[0], batch_size)]

In principal you can also filter and load in one go (just put the respective filter in front of the AsNumpy call) but since you’ll do this many, many times, it’s probably more efficient to have an intermediate step.

In case you want to split the dataset into multiple parts, assuming that each event is statistically independent, you can use the “magic column” rdfentry_ to make this happen. For example you add a split for a two-fold to the skimming like this:

df = ROOT.RDataFrame(...)
df.Filter("rdfentry_ % 2 == 0").Snapshot(...)
df.Filter("rdfentry_ % 1 == 0").Snapshot(...)

We know that the selection of a random subset of a given dataset interfaced by RDataFrame is very interesting for machine learning workflows, we are working on it to make it happen more smoothly

In case the proposal above is not efficient enough because after the skimming/slimming you have still too much data, the most efficient solutions are dependent on the ML library you are targeting and their specialized input pipelines (for example TFRecord in TensorFlow).

Best
Stefan

system · February 25, 2020, 3:04pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.