I need to save a pandas dataframe as a TTree. Right now I’m saving the data as CSV, loading it with RDataframe and saving as TTree. However, this method changes many of the bool columns into int.
Is there an efficient way to import either pandas dataframe or a numpy array directly to RDataframe?
The issue here is manifold. First, how is a bool represented in your CSV? I try to understand whether the CSV datasource of RDataFrame just makes an implicit conversion to int or whether it’s the CSV format, which has no definite layout to store boolean values (it’s just a text file!).
However, you can find below a solution to convert a pandas dataframe to a ROOT TTree. Also here, you face the issue that no direct conversion of boolean numpy arrays to booleans in a TTree is possible. This time, the issue is that the memory layout of boolean numpy array (one bool per byte) is different from the C++ layout (one bool per bit), which doesn’t allow us to read the numpy array directly.
Does this fit your needs?
# Create a pandas dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['x'] = np.array([1.0, 2.0, 3.0]) # double
df['y'] = np.array([4, 5, 6]) # long
df['z'] = np.array([True, False, True]) # boolean
# Have a look!
print(df)
# Convert data to a dictionary with numpy arrays
data = {key: df[key].values for key in ['x', 'y', 'z']}
# Unfortunately booleans in a numpy array don't have the same
# memory layout as in C++, and therefore we cannot adopt boolean
# columns on the C++ side of RDataFrame.
# The workaround is reading them out as integers.
data['z'] = data['z'].astype(np.int)
# Write the dictionary with numpy arrays to a ROOT file
import ROOT
rdf = ROOT.RDF.MakeNumpyDataFrame(data)
rdf = rdf.Define('z_bool', '(bool)z') # Let's rewrite z as bool
rdf.Snapshot('tree', 'file.root')
# Again, have a look!
rdf.Display().Print()
# print(pandas.DataFrame)
x y z
0 1.0 4 True
1 2.0 5 False
2 3.0 6 True
# RDataFrame.Display().Print()
z_bool | x | y | z |
true | 1.0000000 | 4 | 1 |
false | 2.0000000 | 5 | 0 |
true | 3.0000000 | 6 | 1 |
I didn’t explain “bool” issue fully. Initially the bool branches were saved by pandas to CSV as strings “True/False”. RDataframe was interpreting these as strings. So I decided to use int (0/1) instead - seemed liked the easiest solution. But after loading the CSV in RDataframe, I couldn’t find a way to convert the branch from int to bool, probably because of the memory layout issue you mentioned.
Thank you very much for the solution. MakeNumpyDataFrame will be very useful.