Hi @rooter_03 ,
I would like to provide a generic RDF solution for “adding the contents of a numpy array as an additional RDF column” but I am having some conceptual problems making it work in case the array is added after a filter is applied and the event loop is multi-threaded. Without that case solved I don’t think we can add this in upstream ROOT. A review of the different cases follows, let me know if I’m missing something:
RDF with no filters
In this case something like my first solution in the thread should work: rdfentry_
is a valid index for the numpy array column if there are no filters, both in single-thread and multi-thread event loops.
RDF with a Range
If a Range is present the event loop will be single-thread (multi-threading + Range is not supported). In this case you can use your solution above with an extra trick: you can store the last rdfentry_
value seen in get_ind
and if the new rdfentry_
value is lower than that, it’s a new event loop so you can automatically reset the index.
I would still suggest to not use random numbers for ran_int
as that might cause collisions. This is an example solution that makes use of a helper C++ functor: example.py (1004 Bytes)
I think the example above should answer your latest question about automating the index reset (but note that it’s not a thread-safe solution, let me know if you need a thread-safe version).
RDF with filters
This is the tricky case. If the RDF dataset has 1000 entries but the numpy arrays only have 100 that correspond to the original entries that pass certain selections, it is difficult to establish a correspondence between the selected original entries and the array entries in multi-thread event loops, because the original entries will be processed in an unspecified order.
We would need a dictionary that specifies which of the 1000 original entries corresponds to which of the 100 entries in the numpy array.
Another workaround is to first perform the selection and Snapshot
the filtered dataset into a new file, and then add the numpy arrays as new columns of the new, filtered dataset.
Cheers,
Enrico