Iterate over dataframe

I am using RDataFrame to store my data. For example, if I want to filter for some event and then store in numpy array I’d use:

df = R.RDataFrame("myTree", "myFile.root")  # dataframe
df_event = df.Filter("EventNr == 2")  # event nr 2
data_np = df_event .AsNumpy()

This works fine, but the problem begins when I want to loop over all EventNr and then store each event in its own numpy array. For example, I tried the following:

df = R.RDataFrame("myTree", "myFile.root")  # dataframe
df_event = df.Filter("EventNr <= 10")  # first 10 events
data_np = df_event .AsNumpy()

This will simply store everything as one numpy array instead of event by event.

Is there maybe a way to achieve this?

Hint: EventNr is one of the leaves in myTree, and there are many others that RDataFrame stores them in columns.

EDIT:

I can solve my question by simply doing:

df = R.RDataFrame("myTree", "myFile.root")  # dataframe
df_event1 = df.Filter("EventNr == 1")  # event nr 1
data_np1 = df_event1 .AsNumpy()

df_event2 = df.Filter("EventNr == 2")  # event nr 2
data_np2 = df_event2 .AsNumpy()

# and so on ..

But this is not optimal when I am working with large number of events.

Creating (a python dictionary of ) numpy arrays with one element is a kind of anti-pattern, why is it you want to extract data this way? I would think that you’d be better off in most cases taking your many arrays of data and reshaping things, if you need some kind of ‘per-event’ array of information, e.g. to feed into an MVA algorithm.

Hi @Karl007 ,
I am not sure I understand the question, the numpy arrays will have one entry per event.

So after arrays = df.AsNumpy(), arrays['x'][0] will contain the value of column x for event 0, arrays['x'][1] for event 1, and so on. With the caveat that multi-thread event loops will shuffle the output columns w.r.t. their original order (in blocks and maintaining the correspondence between values of different columns for a given event).

Cheers,
Enrico

Hi @eguiraud

In my data, an event has more one entry, like this:


[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  2
  3  2  2  2  2  2  2  2  3  3  3  3  2  2  2  2  3  3  3  3  3  3  3  3
  3  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  4  4  4  4
  5  5  5  5  5  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  4  4  4
  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  7
  7  7  7  7  7  6  6  6  6  6  6  6  7  7  7  7  7  7  6  6  6  6  6  6
  6  6  6  6  6  7  7  7  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9  9  9  9  9  9
  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9
  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10]

This is why doing arrays['x'][0] gets you only the first entry of the first event. That said, sorry if I caused you any confusion but let me explain it better: I would like to have values of each event because I must apply a ML algorithm for each event individually. As I showed earlier, doing this with one event is simple, but when I store a bunch of events together using AsNumpy() function, and then pass that to the algorithm, it considers the bunch of events as one event, that is, all the column values are from one event because they are stored in one array.

Ideally you would have something like (pseudocode):

for event in events:
     apply ML algorithm

Is it more clear now?

Hi @nmangane

I am guessing by element you mean column or feature? If that’s what you mean then no, there are multiple columns in the dictionary. And I am trying to extract event by event to pass them individually to ML algorithm.

Ah, a more common way to represent data like that would be to have one array per event.

With your format, one option is to export EventNr and all other columns as numpy arrays, then put the arrays into a pandas dataframe and use a group_by to group the column values by EventNr.

1 Like

That’s what I was trying to avoid :smile: because it might be too slow when you have too many event.

But it’s indeed one solution so thank you!

One more thing, do you think I could use np.where on the produced dictionary by AsNumpy() ?

On the other hand it sounds like something that you only have to do once in a while (when you change the input data), not every time you want to run your analysis.

I am not sure how to apply np.where to this problem, sorry.

Sure, thanks!

I also appreciate the AsNumpy() method, it gives a huge boost of speed compared to classical loop. :slight_smile:

(Good! It’s even faster if you manage to do some pre-processing as part of the inner RDF event loop with Filters and Defines before exporting data with AsNumpy :smiley: )

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.