Hi.
As described in step 1 of mu reply I am trying to read TTrees into numpy arrays.
Now following is the code snippet I am using for the purpose
def train(self,gamma_file,input_features,target_feature,model_name,precuts,test_split_ratio,bkg_file,**kwargs_algo) :
dict_features = None
# I am allowing input_features to be either list of variables or a dictionary where key works as alias and values are expressions involving input branches
if isinstance(input_features,list) :
dict_features = dict(zip(input_features,input_features))
else :
dict_features = input_features
#create RDataFrame for gamma ray data and background data
# all the branches are taken
# What can be a way to select branches required from input_features
dfg_full = ROOT.RDataFrame("TriggeredEvents",gamma_file)
dfg_precuts = dfg_full.Filter(precuts)
number_of_gamma = dfg_precuts.Count().GetValue()
print("Number of gamma samples after precuts = {}".format(number_of_gamma))
# generate filtered numpy arrays with only columns that we need
# generate empty numpy array for gamma
np_g = np.empty((number_of_gamma,len(dict_features.keys())))
for i,key,value in zip(range(0,len(dict_features.keys())),dict_features.keys(),dict_features.values()) :
dfg_feature = dfg_precuts.Define(key,value)
np_g[:,i] = dfg_feature.AsNumpy([key])[key]
Is there better way of doing this ? I want to avoid
- loading full data in memory as in
dfg_full = ROOT.RDataFrame("TriggeredEvents",gamma_file)
- And creating dataframe objects in loop
for i,key,value in zip(range(0,len(dict_features.keys())),dict_features.keys(),dict_features.values()) :
dfg_feature = dfg_precuts.Define(key,value)
np_g[:,i] = dfg_feature.AsNumpy([key])[key]