Adding data from an external container to a DataFrame

eguiraud · August 26, 2021, 9:21am

Hi @rooter_03 ,
I would like to provide a generic RDF solution for “adding the contents of a numpy array as an additional RDF column” but I am having some conceptual problems making it work in case the array is added after a filter is applied and the event loop is multi-threaded. Without that case solved I don’t think we can add this in upstream ROOT. A review of the different cases follows, let me know if I’m missing something:

RDF with no filters

In this case something like my first solution in the thread should work: rdfentry_ is a valid index for the numpy array column if there are no filters, both in single-thread ~~and multi-thread event loops~~.

RDF with a Range

If a Range is present the event loop will be single-thread (multi-threading + Range is not supported). In this case you can use your solution above with an extra trick: you can store the last rdfentry_ value seen in get_ind and if the new rdfentry_ value is lower than that, it’s a new event loop so you can automatically reset the index.

I would still suggest to not use random numbers for ran_int as that might cause collisions. This is an example solution that makes use of a helper C++ functor: example.py (1004 Bytes)

I think the example above should answer your latest question about automating the index reset (but note that it’s not a thread-safe solution, let me know if you need a thread-safe version).

RDF with filters

This is the tricky case. If the RDF dataset has 1000 entries but the numpy arrays only have 100 that correspond to the original entries that pass certain selections, it is difficult to establish a correspondence between the selected original entries and the array entries in multi-thread event loops, because the original entries will be processed in an unspecified order.
We would need a dictionary that specifies which of the 1000 original entries corresponds to which of the 100 entries in the numpy array.
Another workaround is to first perform the selection and Snapshot the filtered dataset into a new file, and then add the numpy arrays as new columns of the new, filtered dataset.

Cheers,
Enrico