Iterate over dataframe

Karl007 · March 29, 2022, 1:28pm

I am using RDataFrame to store my data. For example, if I want to filter for some event and then store in numpy array I’d use:

df = R.RDataFrame("myTree", "myFile.root")  # dataframe
df_event = df.Filter("EventNr == 2")  # event nr 2
data_np = df_event .AsNumpy()

This works fine, but the problem begins when I want to loop over all EventNr and then store each event in its own numpy array. For example, I tried the following:

df = R.RDataFrame("myTree", "myFile.root")  # dataframe
df_event = df.Filter("EventNr <= 10")  # first 10 events
data_np = df_event .AsNumpy()

This will simply store everything as one numpy array instead of event by event.

Is there maybe a way to achieve this?

Hint: EventNr is one of the leaves in myTree, and there are many others that RDataFrame stores them in columns.

EDIT:

I can solve my question by simply doing:

df = R.RDataFrame("myTree", "myFile.root")  # dataframe
df_event1 = df.Filter("EventNr == 1")  # event nr 1
data_np1 = df_event1 .AsNumpy()

df_event2 = df.Filter("EventNr == 2")  # event nr 2
data_np2 = df_event2 .AsNumpy()

# and so on ..

But this is not optimal when I am working with large number of events.

nmangane · March 29, 2022, 6:54pm

Creating (a python dictionary of ) numpy arrays with one element is a kind of anti-pattern, why is it you want to extract data this way? I would think that you’d be better off in most cases taking your many arrays of data and reshaping things, if you need some kind of ‘per-event’ array of information, e.g. to feed into an MVA algorithm.

eguiraud · March 30, 2022, 7:39am

Hi @Karl007 ,
I am not sure I understand the question, the numpy arrays will have one entry per event.

So after arrays = df.AsNumpy(), arrays['x'][0] will contain the value of column x for event 0, arrays['x'][1] for event 1, and so on. With the caveat that multi-thread event loops will shuffle the output columns w.r.t. their original order (in blocks and maintaining the correspondence between values of different columns for a given event).

Cheers,
Enrico

Karl007 · March 30, 2022, 8:55am

Hi @eguiraud

In my data, an event has more one entry, like this:


[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  2  2  2
  3  2  2  2  2  2  2  2  3  3  3  3  2  2  2  2  3  3  3  3  3  3  3  3
  3  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3  3  3  4  4  4  4
  5  5  5  5  5  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  4  4  4
  4  4  4  4  4  4  4  4  5  5  5  5  5  5  5  5  5  5  6  6  6  6  6  7
  7  7  7  7  7  6  6  6  6  6  6  6  7  7  7  7  7  7  6  6  6  6  6  6
  6  6  6  6  6  7  7  7  7  7  7  7  7  7  7  7  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  9  9  9  9  9  9  9  9  9  9  9
  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9
  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10]

This is why doing arrays['x'][0] gets you only the first entry of the first event. That said, sorry if I caused you any confusion but let me explain it better: I would like to have values of each event because I must apply a ML algorithm for each event individually. As I showed earlier, doing this with one event is simple, but when I store a bunch of events together using AsNumpy() function, and then pass that to the algorithm, it considers the bunch of events as one event, that is, all the column values are from one event because they are stored in one array.

Ideally you would have something like (pseudocode):

for event in events:
     apply ML algorithm

Is it more clear now?

Karl007 · March 30, 2022, 9:01am

Hi @nmangane

I am guessing by element you mean column or feature? If that’s what you mean then no, there are multiple columns in the dictionary. And I am trying to extract event by event to pass them individually to ML algorithm.

eguiraud · March 30, 2022, 9:09am

Ah, a more common way to represent data like that would be to have one array per event.

With your format, one option is to export EventNr and all other columns as numpy arrays, then put the arrays into a pandas dataframe and use a group_by to group the column values by EventNr.

Karl007 · March 30, 2022, 9:13am

That’s what I was trying to avoid because it might be too slow when you have too many event.

But it’s indeed one solution so thank you!

One more thing, do you think I could use np.where on the produced dictionary by AsNumpy() ?

eguiraud · March 30, 2022, 9:16am

On the other hand it sounds like something that you only have to do once in a while (when you change the input data), not every time you want to run your analysis.

I am not sure how to apply np.where to this problem, sorry.

Karl007 · March 30, 2022, 9:20am

Sure, thanks!

I also appreciate the AsNumpy() method, it gives a huge boost of speed compared to classical loop.

eguiraud · March 30, 2022, 9:24am

(Good! It’s even faster if you manage to do some pre-processing as part of the inner RDF event loop with Filters and Defines before exporting data with AsNumpy )

system · April 13, 2022, 9:25am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.