Adding data from an external container to a DataFrame

Hi,

From:

and

https://root.cern/doc/master/pyroot004__NumbaDeclare_8py.html

It seems that add_array in:

import ROOT
import numpy

#------------------------------------------
def add_array(name, arr_val, df):
    df_size=df.Count().GetValue()
    if arr_val.size != df_size:
        print('Array size is different from dataframe size: {}/{}'.format(arr_val.size, df_size))
        raise

    fun='''@ROOT.Numba.Declare(['int'], 'float')\ndef get_val_{}(index):\n    return arr_val[index]'''.format(name)
    exec(fun, {'ROOT' : ROOT, 'arr_val' : arr_val})

    ROOT.gInterpreter.ProcessLine('int index_df = -1;')
    df=df.Define(name, 'index_df+=1; return Numba::get_val_{}(index_df);'.format(name))

    return df
#------------------------------------------
df=ROOT.RDataFrame(10)

arr_val_1=numpy.array([1.] * 10)
arr_val_2=numpy.array([2.] * 10)

df=add_array('x_1', arr_val_1, df)
df=add_array('x_2', arr_val_2, df)

df.Snapshot('tree', 'file.root')

ifile=ROOT.TFile('file.root')
ifile.tree.Scan()

Could be a good implementation of a general function that plugs containers into dataframes. However, I need the exec part given that the Numba function cannot be defined more than once with the same name.

Also, the index_df seems to be needed given that if the dataframe has been filtered rdfentry_ is not anymore a good index, but jumps i.e. 2, 22, 34, 56. Is there a safer way of doing this? Am I doing anything in an inneficient/wrong way?

This add_array or a better version of it, should also be in ROOT, given that it’s useful in many cases.

Cheers.


Please read tips for efficient and successful posting and posting code

_ROOT Version:6.22/06
_Platform:Centos7
_Compiler:gcc8-opt


Hi @rooter_03 ,
MakeNumpyDataFrame lets you build a RDataFrame starting from a dictionary of numpy arrays, see ROOT: tutorials/dataframe/df032_MakeNumpyDataFrame.py File Reference .

Your code above would become:

import ROOT
import numpy
arrays = {'x_1': numpy.full(10, 1.), 'x_2': numpy.full(10, 2.)}
ROOT.RDF.MakeNumpyDataFrame(arrays).Snapshot('tree', 'file.root')
ifile = ROOT.TFile('file.root')
ifile.tree.Scan()

Cheers,
Enrico

Hi,

Thanks for your reply, but that does not work for me and it will very rarely be needed, at least I have never needed that. The idea is not to build a new dataframe from dictionaries, but to add data to a dataframe already built from a tree. Could you please tell me if what I showed in the original post is the best way (safest, simplest) to achieve that?

I have already added that function to my personal library and I am seeing this like:

input_line_261:5:7: error: redefinition of 'get_val_Jpsi_M_smeared'
float get_val_Jpsi_M_smeared(int x_0) {
      ^
input_line_158:5:7: note: previous definition is here
float get_val_Jpsi_M_smeared(int x_0) {
      ^
Traceback (most recent call last):
  File "test.py", line 33, in <module>
    check(year, tree)
  File "test.py", line 17, in check

because in this particular case, the data frame is not meant to take the new Jpsi_M_smeared variable multiple times, but the process does this for multiple dataframes and as long as it gets done multiple times there seems to be a shared object in memory that prevents a new definition (for a new dataframe) to be done. I found out that what seems to fix the issue for good is:

def add_array(name, arr_val, df): 
    ran=ROOT.TRandom3(0)
    ran_int=ran.Integer(1000000)

    df_size=df.Count().GetValue()
    if arr_val.size != df_size:
        log.error('Array size is different from dataframe size: {}/{}'.format(arr_val.size, df_size))
        raise

    fun='''@ROOT.Numba.Declare(['int'], 'float')\ndef get_val_{}_{}(index):\n    return arr_val[index]'''.format(name, ran_int)
    exec(fun, {'ROOT' : ROOT, 'arr_val' : arr_val})

    ROOT.gInterpreter.ProcessLine('int index_df = -1;')
    df=df.Define(name, 'index_df+=1; return Numba::get_val_{}_{}(index_df);'.format(name, ran_int))

    return df

i.e. using a random number for the name of this function everytime it gets called.

Cheers.

I guess you thought that I wanted to make a dataframe from arrays because of what I put in my first post. What I indended to do there, was to get a testing dataframe, otherwise I would have to attach a ROOT file to the post.

Indeed I was solving for the sample code you shared.

The problem with the multiple definition is that the function you pass to Numba.Declare produces a corresponding C++ function, declared to ROOT’s interpreter, and C++ does not like function redefinitions. The workaround with the random number will fail if you draw the same number twice, a steadily increasing counter would be best.

A possibly faster/cleaner way is making use of the Python/C++ interplay that PyROOT allows:

import ROOT
import numpy as np

ROOT.gInterpreter.Declare('''
ROOT::RDF::RNode AddArray(ROOT::RDF::RNode df, ROOT::RVec<double> &v, const std::string &name) {
    return df.Define(name, [&](ULong64_t e) { return v[e]; }, {"rdfentry_"});
}
''')

df = ROOT.RDataFrame(10).Define("x", "42")
arr = np.full(10, 1.)
arr_as_rvec = ROOT.VecOps.AsRVec(arr) # basically free, there is no copy here
df = ROOT.AddArray(ROOT.RDF.AsRNode(df), arr_as_rvec, "y")
df.Display().Print()

For simplicity I left out the extra complication of indexing the array correctly.

Cheers,
Enrico

Hi,

Thanks for your reply. That is going to break as the numba example. If the Declare line is put in a function and that function gets called twice, the machine will say that the function was already declared. The only way for that to work would be to put it in the rootlogon.C so that it gets loaded once and only once, everytime ROOT starts.

However I am not sure if that would work when sending jobs to computing clusters. Other thing that could work would be:

if not hasattr(ROOT, 'AddArray'):
       ROOT.gInterpreter.Declare(...

so that the Declare line only kicks in when the function has not been declared. Regarding your simplicity argument. Could you please put a working example? That way other people that come here can just use it. In my case I cannot get it to work:

import ROOT

import numpy as np  

#-------------
def declare_struc():
    if not hasattr(ROOT, 'RDFAddArray'):
        ROOT.gInterpreter.Declare('''
        ROOT::RDF::RNode RDFAddArray(ROOT::RDF::RNode df, ROOT::RVec<double> &v, const std::string &name) 
        {
            unsigned rdf_entry = 0;
            return df.Define(name, [&]() { auto val = v[rdf_entry]; rdf_entry++; return val; });
        }
        ''')
#-------------

declare_struc()

df = ROOT.RDataFrame(10).Define("x", "42")

arr = np.array([0.0, 1.1, 2.2, 3.3, 4.4, 5.5])

arr_as_rvec=ROOT.VecOps.AsRVec(arr)

df = ROOT.RDFAddArray(ROOT.RDF.AsRNode(df), arr_as_rvec, "y")
df = ROOT.RDFAddArray(ROOT.RDF.AsRNode(df), arr_as_rvec, "z")

df.Display().Print()

the result is:

x  | y              | z              | 
42 | 0.0000000      | 1.1000000      | 
42 | 2.2000000      | 3.3000000      | 
42 | 4.4000000      | 5.5000000      | 
42 | 0.0000000      | 2.4209217e-322 | 
42 | 6.8990104e-310 | 9.2860178e+242 | 

it seems that somehow, there is a cross-talk between the call to add the y column and the one to add the z column, i.e., they both share the same rdf_entry. Do you know how we could get around this?

Cheers.

No, because you only need to declare it once while, if I understand the code correctly, your code required to declare one helper function per array.
You can add that Declare to the beginning of your program or in the __init__.py of your Python module, for example.

My example should run correctly with a recent-enough ROOT version, with ROOT master it prints:

x  | y         |
42 | 1.0000000 |
42 | 1.0000000 |
42 | 1.0000000 |
42 | 1.0000000 |
42 | 1.0000000 |

This is a use after delete of rdf_entry: by the time the event loop starts, the rdf_entry variable has gone out of scope. It looks like when you ran the code both dangling references happen to point to the same memory region.

If you can’t use rdfentry_ to index the arrays I guess you can Define your own counter. This:

import ROOT
import numpy as np

ROOT.gInterpreter.Declare('''
ROOT::RDF::RNode AddArray(ROOT::RDF::RNode df, ROOT::RVec<double> &v, const std::string &name) {
    return df.Define(name, [&](unsigned c) { return v[c]; }, {"counter"});
}

unsigned counter = 0;
''')

df = ROOT.RDataFrame(10).Define("counter", "counter++")
arr1 = ROOT.VecOps.AsRVec(np.full(10, 1.))
arr2 = ROOT.VecOps.AsRVec(np.full(10, 2.))
df = ROOT.AddArray(ROOT.RDF.AsRNode(df), arr1, "y")
df = ROOT.AddArray(ROOT.RDF.AsRNode(df), arr2, "z")
df.Display().Print()

prints this:

counter | y         | z         |
0       | 1.0000000 | 2.0000000 |
1       | 1.0000000 | 2.0000000 |
2       | 1.0000000 | 2.0000000 |
3       | 1.0000000 | 2.0000000 |
4       | 1.0000000 | 2.0000000 |

Cheers,
Enrico

P.S.
note that if you want to support multiple event loops run using that same counter variable, you have to set it back to 0 before starting an event loop.

Hi,

Thanks for your reply.

Yes, you are right there, I realized it a few minutes after I submitted my reply. The numba example needs a separate declaration for each input. Here you can process several input containers with the same function.

The part that was not included was the replacement of rdfentry_. I have noticed that, as I said before, this variable does follow the 0, 1, 2… pattern but only when Filter has not been called. There is no guarantee that the input has not been filtered, therefore, this variable cannot be ever relied upon and I think it’s dangerous to use it.

Ok, this was completely obscure to me. The way how all this works is pretty alien and I did notice strange behaviours when using the new rdf_entry. The definition of a counter like in the example you posted does work. However I would have never been able to do figure this out myself and I think it would have been the case for most people ending up in this thread.

The way I understand Define(x, y) is that you define a column called x as the expression y. Now, counter is not in the dataframe, given that it’s brand new and empty, therefore using it in y seems VERY strange and completely counterintuitive, however not necesarily wrong.

In any case there is where most people would have had problems addapting your code and it’s good that now we have a working example that people can just pick.

Cheers.

rdfentry_ gives you the “row number” in the original dataset, so definitely with a Filter in the middle it might skip some numbers. If you could assume that arrays are only added to unfiltered datasets that would definitely simplify things. If we added a ROOT.RDF.AttachArray helper function upstream (which might be a good idea) I would be tempted to impose that restriction – with an extra step you can always Snapshot the filtered dataset and then add the array as an extra column when reading the skimmed dataset. It is not clear to me how indexing into the array should work in multi-thread event loops if the array is added after Filters.

In my code above, counter is a global variable, so you can access it from anywhere, also from inside the lambda – it also means that different threads will access it concurrently, so if you want to do this stuff in a parallel event loop you’ll need a counter per thread or similar.
Alternatively counter could be a static variable defined in AddArray (but that would make it impossible to reset it to zero if needed).

Hi,

I have been playing with this function for a while and I am having a lot of problems with it. Mostly because this counter variable is global and if I access the dataframe multiple times (not booking, but actually accessing the data), the counter variable will need to be reset every time as you said:

Otherwise the container where the data is will have to be read beyond the last element and that will cause crashes. That puts far too much work on the user and makes the code dangerous. The numba approach seems actually better. It probably is somewhat wasteful, but we do not have remember to reset any counter.

Cheers.

Hi,

Even my solution with Numba has problems. I started to get my code to crash and I found out that in that implementation I also would have to reset the counter in some cases, which would make that function utterly useless.

However what is below seems to work:

def add_df_column(df, arr_val, name):
    ran=ROOT.TRandom3(0)
    ran_int=ran.Integer(100000000)

    df_size=df.Count().GetValue()
    if arr_val.size != df_size:
        log.error('Array size is different from dataframe size: {}/{}'.format(arr_val.size, df_size))
        raise

    str_ind='''@ROOT.Numba.Declare(['int'], 'int'  )\ndef get_ind_{}_{}(index):\n    if index + 1 > arr_val.size:\n        return 0\n    return index'''.format(name, ran_int)
    str_eva='''@ROOT.Numba.Declare(['int'], 'float')\ndef get_val_{}_{}(index):\n    if index + 1 > arr_val.size:\n        print('Cannot access array at given index')\n        return -1\n    return arr_val[index]'''.format(name, ran_int)

    exec(str_ind, {'ROOT' : ROOT, 'arr_val' : arr_val})
    exec(str_eva, {'ROOT' : ROOT, 'arr_val' : arr_val})

    ROOT.gInterpreter.ProcessLine('int index_df = -1;')

    ind_eva = 'Numba::get_ind_{}_{}(index_df)'.format(name, ran_int)
    fun_eva = 'Numba::get_val_{}_{}(index_df)'.format(name, ran_int)

    df=df.Define(name, 'index_df++; index_df={}; return {};'.format(ind_eva, fun_eva))

    return df

i.e. we need two functions, an index setter and an array reader. Again, there is a chance in 100 million that the function does cause a crash because of the random number.

Cheers.

Hi,

Ok, I have more news on this. After working for a while and looking at plots that make no sense I realized that the fix above only works when the container has been traversed completely. I.e. if there is a line like:

df=df.Range(1000)

and the size of the dataframe is larger than 1000 the index will have to be reset, which is then meant to be done by hand and it’s obviously going to cause a myriad of problems everywhere, given that users will forget and therefore this resetting needs to be automated and made foolproof. I will keep trying to find a good solution and it would be nice if the ROOT developers (@eguiraud ) could offer some support here.

Cheers.

Hi @rooter_03 ,
I would like to provide a generic RDF solution for “adding the contents of a numpy array as an additional RDF column” but I am having some conceptual problems making it work in case the array is added after a filter is applied and the event loop is multi-threaded. Without that case solved I don’t think we can add this in upstream ROOT. A review of the different cases follows, let me know if I’m missing something:

RDF with no filters

In this case something like my first solution in the thread should work: rdfentry_ is a valid index for the numpy array column if there are no filters, both in single-thread and multi-thread event loops.

RDF with a Range

If a Range is present the event loop will be single-thread (multi-threading + Range is not supported). In this case you can use your solution above with an extra trick: you can store the last rdfentry_ value seen in get_ind and if the new rdfentry_ value is lower than that, it’s a new event loop so you can automatically reset the index.

I would still suggest to not use random numbers for ran_int as that might cause collisions. This is an example solution that makes use of a helper C++ functor: example.py (1004 Bytes)

I think the example above should answer your latest question about automating the index reset (but note that it’s not a thread-safe solution, let me know if you need a thread-safe version).

RDF with filters

This is the tricky case. If the RDF dataset has 1000 entries but the numpy arrays only have 100 that correspond to the original entries that pass certain selections, it is difficult to establish a correspondence between the selected original entries and the array entries in multi-thread event loops, because the original entries will be processed in an unspecified order.
We would need a dictionary that specifies which of the 1000 original entries corresponds to which of the 100 entries in the numpy array.
Another workaround is to first perform the selection and Snapshot the filtered dataset into a new file, and then add the numpy arrays as new columns of the new, filtered dataset.

Cheers,
Enrico

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Important correction: in multi-thread event loops rdfentry_ will not correspond to the input dataset’s entry number, as the entries are processed in an unspecified order. This is documented at ROOT: ROOT::RDataFrame Class Reference but I forgot about it in my reply above, sorry about that!

In this scenario the stable solution is to write the extra column out into another TTree that you can add as a friend of the main dataset: entries from the main tree and its friends are guaranteed to be processed in sync.