RDataframe with strings

Hi!

I am trying to convert a pandas dataframe to a RDataframe to save it as .root file.
I have seen previous posts like

As I have seen before, I convert the pandas df into a dictionary and then build the RDataframe. My problem is that one of my columns “detector” has string entries, which leads to errors (RuntimeError: Object not convertible: Dictionary entry detector is not convertible with AsRVec.). So my solution was to first exclude this column and add it to the rdf afterwards. Here Make RDataFrames interoperable with other Python tools I have seen a solution for floats which I tried to adapt for strings, but I don’t know how to make it work.

arraydata = {key: df[key].values for key in df.keys()}

det_dict = arraydata.pop("detector")

rdf = ROOT.RDF.FromNumpy(arraydata)
rdf = rdf.Define('detector', 'auto to_eval = "det_dict[" + std::to_string(rdfentry_) + "]"; return TPython::Eval(to_eval.c_str());') 

But with this the rdf looks like this:

+-----+-----------------+-----+
| Row | detector        | ... | 
+-----+-----------------+-----+
| 0   | @0x7f93f0cb1820 | ... | 
+-----+-----------------+-----+
| 1   | @0x7f93f0cb1820 | ... | 
+-----+-----------------+-----+
| 2   | @0x7f93f0cb1820 | ... | 
+-----+-----------------+-----+
| 3   | @0x7f93f0cb1820 | ... | 
+-----+-----------------+-----+
| 4   | @0x7f93f0cb1820 | ... | 
+-----+-----------------+-----+

Thank you!
Viktoria


ROOT Version: 6.28/00
Built for linuxx8664gcc


Dear Viktoria,

Thanks for the interesting post!
I am doing some guessing about the context without the original input to reproduce your issue, but I suspect the issue here is linked to the conversion of the TPyResult returned by TPython::Eval to string.
The Define line should look like:

rdf = rdf.Define('detector', 'auto to_eval = "det_dict[" + std::to_string(rdfentry_) + "]"; return (std::string) TPython::Eval(to_eval.c_str());') 

If that does not work, we can iterate further.
I hope this helps!

Cheers,
D

Dear Danilo,

Thank you! The line you suggested runs without errors, but when I try to do rdf.Display().Print() I get loads of errors:

logic_error: ROOT::RDF::RDisplay& ROOT::RDF::RResultPtr<ROOT::RDF::RDisplay>::operator*() =>
    logic_error: basic_string::_M_construct null not valid

input_line_1046:74:12: error: expected member name or ';' after declaration specifiers
 TPyReturn isascii() {
 ~~~~~~~~~ ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:234:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:100:26: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                            ^
input_line_1046:74:12: error: expected ')'
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:234:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:100:28: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                              ^
input_line_1046:74:12: note: to match this '('
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:234:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:100:23: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                         ^
input_line_1046:74:12: error: expected ')'
 TPyReturn isascii() {
           ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:234:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:100:37: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                                       ^
input_line_1046:74:12: note: to match this '('
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:234:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/home/viktoria/anaconda3/x86_64-conda-linux-gnu/sysroot/usr/include/ctype.h:100:22: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                        ^
input_line_1046:6:28: error: no member named 'reserve' in 'std::vector<TPyArg, std::allocator<TPyArg> >'
  std::vector<TPyArg> v; v.reserve(0);
                         ~ ^

Same when I try to create a Snapshot.

Cheers,
Viktoria

Hi Viktoria,

Sorry about that. Could you put me in condition to reproduce your setup with the input file you are using and a minimal form of the code that reproduce the error? That would be very useful to converge on a solution.

Cheers,
D

Hi,
here is a minimal example with a small fraction of my data:
min_example_string_dataframe.zip (3.5 KB)

In case this would change anything, my root was installed with conda:

root                      6.28.0          py310h9b08913_1    conda-forge
root_base                 6.28.0          py310h720e498_1    conda-forge

Cheers,
Viktoria

Dear Viktoria,

Thanks for the reproducer and your patience. This example exposed a problem in ROOT, more specifically in its interaction with Python.
What is happening is that the dictionary you are consulting inside the TPython::Eval invocation cannot be found. We are working on a better interoperability of Python and C++ through PyROOT this year, progress is to be expected. At the moment I cannot offer a transparent fix but an easy workaround: just add ROOT.TPython.Class() before the Define call.
This new python script works:

import ROOT
import pandas as pd

# creating the pandas dataframe
df = pd.read_csv("data_example.csv")

# converting it to a dictionary
arraydata = {key: df[key].values for key in df.keys()}

#excluding the detector column
det_dict = arraydata.pop("detector")

# creating the root dataframe
rdf = ROOT.RDF.FromNumpy(arraydata)

# Workaround: initialise the Python interpreter before jitting the Define
ROOT.TPython.Class()

# add the detector column back in
rdf = rdf.Define('detector', 'auto to_eval = "det_dict[" + std::to_string(rdfentry_) + "]"; return std::string(TPython::Eval(to_eval.c_str()));') 

rdf.Display().Print()

I hope this helps.

Cheers,
Danilo

PS
In case you are really starting from a CSV file (apologies if this is not the case!) you can always use the RDataFrame CSV data source: here the example.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.