RDataFrame, pandas and save to hdf5

eneb · November 12, 2024, 6:05pm

Dear experts,

I am not sure if this is the correct place to ask, but since it has to do at least partially with the RDataFrame functionalities, maybe someone can help me out.
Usually I just use RDataFrame to create histograms. Now I started producing Ntuples with the Snapshot option. This works. For further processing I need to use the AsNumpy functionality, which also works:

ROOT.EnableImplicitMT(nCPUs) # not sure if it helps here?
rdf = ROOT.RDataFrame(tree, file_list) # file_list contains all the produced ntuples
np_data = rdf.AsNumpy(columns=variables) # variables is a list containing strings with the names of the columns
df = pd.DataFrame(np_data)
print(df.dtypes)

Looking at the output of the print statement, one column has the dftype “object”. This happens, because the variable is an ROOT.VecOps.RVec<float> (at least this is shown when I print the np_data object.). And this is when things do not work anymore, because I am unable to save this dataframe to an hdf5 file.
df.to_hdf(output_file, key="df", mode="a", append=True) gives the error TypeError: Cannot serialize the column [test] because its data contents are not [string] but [mixed] object dtype
My naive question now would be, when there is just one entry in the vector, how can I change the datatype, maybe even in the np_data object to be e.g. a float? I know the RVec Type has it advantages when using ROOT, but it seems like pandas cannot handle it as functions like df.astype do not work and give the error TypeError: float() argument must be a string or a real number, not 'RVec<float>'

Does somebody know a solution? Thank you in advance!

Cheers

dastudillo · November 13, 2024, 5:17am

If the vectors only contain one element (or you always want to keep the first element), one option is to define a new column with element 0 and then save the dataframe with that column while excluding the vector column, e.g. (not tested on vectors, but I suppose should work as-is or with some adaptation; in any case this shows the idea):

tree = 'mytree'
file_list = ['myfile.root']
variables = ['x','y','myvec']

rdf = ROOT.RDataFrame(tree, file_list)
rdf = rdf.Define("myvalue","myvec[0]")  # add new column to the same dataframe

# see the original
np_data0 = rdf.AsNumpy(columns=variables)
df0 = pd.DataFrame(np_data0)
print(df0.dtypes)
print(df0)

# new
variables2 = ['x','y','myvalue']
np_data = rdf.AsNumpy(columns=variables2)
df = pd.DataFrame(np_data)
print(df.dtypes)
print(df)

Danilo · November 13, 2024, 5:22am

Hi,

Nothing to add to the very nice reply above, but I have a curiosity about the use case, if I may: what is the reason why you need to leave RDataFrame to switch to Pandas? What functionality are you missing?

Cheers,
Danilo

eneb · November 13, 2024, 7:33am

Hi,

Thanks a lot for the quick answer. My hope was to avoid adding new columns somehow, but it makes sense. I will need to include some code which checks the datatype of each variable and then also edits the variable array, but that should not be a big problem I hope.

eneb · November 13, 2024, 7:41am

Hi Danilo,

I need to run neural network studies using these variables. And for the NN I need the inputs to be some sort of flat (numpy) arrays (not TTrees) and storing the input data in a dataframe (and then a hdf5 file) seems to make sense for further processing.
(I think also for TMVA the tutorials use the functionality of RDataFrame and AsNumpy for this purpose)

Cheers