RDataFrame AsNumpy and selections

Chinmay · March 21, 2022, 12:42pm

Hi.
As described in step 1 of mu reply I am trying to read TTrees into numpy arrays.
Now following is the code snippet I am using for the purpose

    def train(self,gamma_file,input_features,target_feature,model_name,precuts,test_split_ratio,bkg_file,**kwargs_algo) :
         dict_features = None
        # I am allowing input_features to be either list of variables or a dictionary where key works as alias   and values are expressions involving input branches
        if isinstance(input_features,list) :
            dict_features = dict(zip(input_features,input_features))
        else :
            dict_features = input_features 

        #create RDataFrame for gamma ray data and background data
        # all the branches are taken 
        # What can be a way to select branches required from input_features
        dfg_full = ROOT.RDataFrame("TriggeredEvents",gamma_file) 
        dfg_precuts = dfg_full.Filter(precuts)
        number_of_gamma = dfg_precuts.Count().GetValue()
        print("Number of gamma samples after precuts = {}".format(number_of_gamma))
       
        # generate filtered numpy arrays with only columns that we need
        # generate empty numpy array for gamma
        np_g = np.empty((number_of_gamma,len(dict_features.keys()))) 
        for i,key,value in zip(range(0,len(dict_features.keys())),dict_features.keys(),dict_features.values()) :
            dfg_feature = dfg_precuts.Define(key,value)
            np_g[:,i] = dfg_feature.AsNumpy([key])[key]

Is there better way of doing this ? I want to avoid

loading full data in memory as in

dfg_full = ROOT.RDataFrame("TriggeredEvents",gamma_file)

And creating dataframe objects in loop

     for i,key,value in zip(range(0,len(dict_features.keys())),dict_features.keys(),dict_features.values()) :
            dfg_feature = dfg_precuts.Define(key,value)
            np_g[:,i] = dfg_feature.AsNumpy([key])[key]

eguiraud · March 21, 2022, 1:00pm

Hi @Chinmay ,
I took the liberty to move this post to a new topic as it does not seem to be about the crash with conda.

I am not sure I understand the first question, ROOT.RDataFrame("TriggeredEvents",gamma_file) does not load all data into memory.

About 2., currently the for loop at the end of your script runs one event loop per iteration, which might be slow. You can pass lazy=True to AsNumpy to make it lazy, if you then book all the AsNumpy calls before accessing any of the results. In that case AsNumpy will return a result proxy and you can access the actual numpy array results by calling GetValue on it (which triggers the event loop).
You could also rearrange the code so that you do a single call as AsNumpy([key1, key2, key3]).

Cheers,
Enrico

Chinmay · March 21, 2022, 2:29pm

Can ‘columns’ argument of the AsNumpy() be functions of columns of df_g_precuts dataframe.
In that case I can rearrange code and call ‘AsNumpy([key1,key2,key3…])’.

On the other hand, I need to book my ‘Define’ calls on same dataframe in for loop so that they are dynamically accepted from user. It’s not clear to me how to do that.

eguiraud · March 21, 2022, 2:48pm

No, you need a Define for that

for colname, expression in zip(colnames, expressions):
   df = df.Define(...)

Cheers,
Enrico

Chinmay · March 21, 2022, 3:08pm

For some reason, I thought it won’t work. It solves all problem. Thanks

system · April 4, 2022, 3:09pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.