Dear experts,
I have a huge load of data in a TTree, more than 1M channels with a number of floats and ints per entry. I want to “handle” the data in my code in the form of a pandas data frame.
Hence, I open the TFile and extract the TTree and read it into a RDataFrame, which I then export via numpy to a pandas data frame. After the heavy work is done, a new pandas data frame (of similar number of rows but different number of columns) should be written to a TTree, not necessarily a TFile (in fact in some use cases I want to store the TTree in a TMemFile to serialize it, c.f. Serialize and deserialize TTree into coral::Blob). I’m saying this because, apparently, RDataFrame::Snapshot
exports it to a TTree within a TFile – is there a way without the TFile?
I thought the most straight forward thing to do is the following:
out = ROOT.RDF.MakeNumpyDataFrame(df.to_dict("list"))
out.Snapshot("tree", "myTestFile.root")
where df
obviously is the pandas data frame that, via to_dict
gets first exported into a dictionary a la {"col1": [....lots of values for this column for each row....], "col2": .... etc.}
. This apparently is the required input for MakeNumpyDataFrame
.
However, ROOT complains about it. The error reads as follows:
Traceback (most recent call last):
File "test3.py", line 75, in <module>
out = ROOT.RDF.MakeNumpyDataFrame(foo)
RuntimeError: Object not convertible: Dictionary entry sector is not convertible with AsRVec.
sector
is one of the column names, and its column contains integers between 0 and 15 inclusive. The data frame has 1’048’586 rows. I’m using Python version 3.8.6 with ROOT 6.24/00 and GCC 8.3.0 on linux/lxplus7.
Am I doing something wrong or is there just too much data for MakeNumpyDataFrame
to handle?
Is there a better or more efficient way of building a TTree out of a pandas data frame (preferrably without automatically exporting it to a TFile since, as I mentioned already, I want to serialize the TTree).
Thank you for any helpful advice!
heico