This works fine, but the problem begins when I want to loop over all EventNr and then store each event in its own numpy array. For example, I tried the following:
Creating (a python dictionary of ) numpy arrays with one element is a kind of anti-pattern, why is it you want to extract data this way? I would think that you’d be better off in most cases taking your many arrays of data and reshaping things, if you need some kind of ‘per-event’ array of information, e.g. to feed into an MVA algorithm.
Hi @Karl007 ,
I am not sure I understand the question, the numpy arrays will have one entry per event.
So after arrays = df.AsNumpy(), arrays['x'][0] will contain the value of column x for event 0, arrays['x'][1] for event 1, and so on. With the caveat that multi-thread event loops will shuffle the output columns w.r.t. their original order (in blocks and maintaining the correspondence between values of different columns for a given event).
This is why doing arrays['x'][0] gets you only the first entry of the first event. That said, sorry if I caused you any confusion but let me explain it better: I would like to have values of each event because I must apply a ML algorithm for each event individually. As I showed earlier, doing this with one event is simple, but when I store a bunch of events together using AsNumpy() function, and then pass that to the algorithm, it considers the bunch of events as one event, that is, all the column values are from one event because they are stored in one array.
Ideally you would have something like (pseudocode):
I am guessing by element you mean column or feature? If that’s what you mean then no, there are multiple columns in the dictionary. And I am trying to extract event by event to pass them individually to ML algorithm.
Ah, a more common way to represent data like that would be to have one array per event.
With your format, one option is to export EventNr and all other columns as numpy arrays, then put the arrays into a pandas dataframe and use a group_by to group the column values by EventNr.
On the other hand it sounds like something that you only have to do once in a while (when you change the input data), not every time you want to run your analysis.
I am not sure how to apply np.where to this problem, sorry.
(Good! It’s even faster if you manage to do some pre-processing as part of the inner RDF event loop with Filters and Defines before exporting data with AsNumpy )