RDataFrame from pandas with string columns


Greetings,

I am trying to convert a pandas dataframe with some string type columns to a RDataframe and save it to file.

I tried to follow the workaround (adding ROOT.TPython.Class() before Define()) presented in the topic RDataframe with strings:

import pandas as pd
import ROOT

if __name__ == "__main__":

    # load the pickled pandas file
    pandas_df = pd.read_pickle("pandas_df_example.pckl")

    # format the columns to pass them to ROOT
    pandas_df.columns = ["_".join(map(str, col)) for col in pandas_df.columns]
    pandas_df.columns = pandas_df.columns.astype(str)
    

    # check column types
    # print(pandas_df.head())
    # print(pandas_df.dtypes)

    # extract the string-type columns to add them later
    string_cols = pandas_df.select_dtypes("string").columns
    numerical_cols_df = pandas_df.drop(string_cols, axis=1)

    # import the numerical columns to RDataFrame
    data_rdf = ROOT.RDF.FromPandas(numerical_cols_df)

    # add back the string-type columns with Define
    ROOT.TPython.Class()
    for str_df_col in string_cols:
        data_rdf = data_rdf.Define(
            f"{str_df_col}",
            f'auto to_eval = "pandas_df[\'{str_df_col}\'][" + std::to_string(rdfentry_) + "]"; return (std::string) TPython::Eval(to_eval.c_str());',
        )
    
    # save the dataframe to file
    data_rdf.Snapshot("test_tree","Test_file.root")

For some reason, though, I still get a similar error to the one that the workaround was supposed to address:

input_line_83:74:12: error: expected member name or ';' after declaration specifiers
 TPyReturn isascii() {
 ~~~~~~~~~ ^
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:26: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                            ^
input_line_83:74:12: error: expected ')'
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:28: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                              ^
input_line_83:74:12: note: to match this '('
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:23: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                         ^
input_line_83:74:12: error: expected ')'
 TPyReturn isascii() {
           ^
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:37: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                                       ^
input_line_83:74:12: note: to match this '('
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:22: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                        ^
RDataFrame::Run: event loop was interrupted
input_line_88:74:12: error: expected member name or ';' after declaration specifiers
 TPyReturn isascii() {
 ~~~~~~~~~ ^
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:26: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                            ^
input_line_88:74:12: error: expected ')'
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:28: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                              ^
input_line_88:74:12: note: to match this '('
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:23: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                         ^
input_line_88:74:12: error: expected ')'
 TPyReturn isascii() {
           ^
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:37: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                                       ^
input_line_88:74:12: note: to match this '('
/usr/include/ctype.h:225:22: note: expanded from macro 'isascii'
#  define isascii(c)    __isascii (c)
                        ^
/usr/include/ctype.h:99:22: note: expanded from macro '__isascii'
#define __isascii(c)    (((c) & ~0x7f) == 0)    /* If C is a 7 bit value.  */
                        ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:680:30: error: no member named '_M_start' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
        std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish,
                      ~~~~~~~~~~~~~ ^
input_line_88:6:23: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::~vector' requested here
  std::vector<TPyArg> v; v.reserve(0);
                      ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:86:57: error: no member named '_M_start' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
                _GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR(this->_M_impl._M_start),
                                                        ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_iterator.h:2470:41: note: expanded from macro '_GLIBCXX_MAKE_MOVE_IF_NOEXCEPT_ITERATOR'
  std::__make_move_if_noexcept_iterator(_Iter)
                                        ^~~~~
input_line_88:6:28: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::reserve' requested here
  std::vector<TPyArg> v; v.reserve(0);
                           ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:88:36: error: no member named '_M_start' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
              std::_Destroy(this->_M_impl._M_start, this->_M_impl._M_finish,
                            ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:92:32: error: no member named '_M_start' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
          _M_deallocate(this->_M_impl._M_start,
                        ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:95:18: error: no member named '_M_start' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
          this->_M_impl._M_start = __tmp;
          ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:96:18: error: no member named '_M_finish' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
          this->_M_impl._M_finish = __tmp + __old_size;
          ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:97:18: error: no member named '_M_end_of_storage' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
          this->_M_impl._M_end_of_storage = this->_M_impl._M_start + __n;
          ~~~~~~~~~~~~~ ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:281:22: error: no member named '_M_impl' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >'
      { return this->_M_impl; }
               ~~~~  ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:924:28: note: in instantiation of member function 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_M_get_Tp_allocator' requested here
      { return _S_max_size(_M_get_Tp_allocator()); }
                           ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:69:23: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::max_size' requested here
      if (__n > this->max_size())
                      ^
input_line_88:6:28: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::reserve' requested here
  std::vector<TPyArg> v; v.reserve(0);
                           ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:999:40: error: no member named '_M_end_of_storage' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
      { return size_type(this->_M_impl._M_end_of_storage
                         ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:71:17: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::capacity' requested here
      if (this->capacity() < __n)
                ^
input_line_88:6:28: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::reserve' requested here
  std::vector<TPyArg> v; v.reserve(0);
                           ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:919:40: error: no member named '_M_finish' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
      { return size_type(this->_M_impl._M_finish - this->_M_impl._M_start); }
                         ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:73:33: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::size' requested here
          const size_type __old_size = size();
                                       ^
input_line_88:6:28: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::reserve' requested here
  std::vector<TPyArg> v; v.reserve(0);
                           ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:112:20: error: no member named '_M_finish' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
        if (this->_M_impl._M_finish != this->_M_impl._M_end_of_storage)
            ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:1204:9: note: in instantiation of function template specialization 'std::vector<TPyArg, std::allocator<TPyArg> >::emplace_back<TPyArg>' requested here
      { emplace_back(std::move(__x)); }
        ^
input_line_88:11:5: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::push_back' requested here
  v.push_back(fPyObject);
    ^
In module 'std' imported from input_line_1:1:
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:830:39: error: no member named '_M_finish' in 'std::_Vector_base<TPyArg, std::allocator<TPyArg> >::_Vector_impl'
      { return iterator(this->_M_impl._M_finish); }
                        ~~~~~~~~~~~~~ ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:1146:11: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::end' requested here
        return *(end() - 1);
                 ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/vector.tcc:123:9: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::back' requested here
        return back();
               ^
/usr/lib/gcc/x86_64-redhat-linux/11/../../../../include/c++/11/bits/stl_vector.h:1204:9: note: in instantiation of function template specialization 'std::vector<TPyArg, std::allocator<TPyArg> >::emplace_back<TPyArg>' requested here
      { emplace_back(std::move(__x)); }
        ^
input_line_88:11:5: note: in instantiation of member function 'std::vector<TPyArg, std::allocator<TPyArg> >::push_back' requested here
  v.push_back(fPyObject);
    ^
RDataFrame::Run: event loop was interrupted
Traceback (most recent call last):
  File "/storage/gpfs_data/neutrino/users/alrugger/Software/DarkNews/DarkNews-generator/examples/root_mr_example.py", line 32, in <module>
    data_rdf.Snapshot("test_tree","Test_file.root")
cppyy.gbl.std.logic_error: Template method resolution failed:
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    logic_error: basic_string::_M_construct null not valid
  ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Snapshot(string_view treename, string_view filename, string_view columnNameRegexp = "", const ROOT::RDF::RSnapshotOptions& options = RSnapshotOptions()) =>
    logic_error: basic_string::_M_construct null not valid

A strange aspect is that the same code runs with no issues on my local machine (ROOT: v6.32.08, Python: 3.9.6, MacOSX15.1 with Intel chip and Apple clang-16.0.0 compiler).

What could be the problem with my code or setup?
Any feedback that you may have would be greatly appreciated.

I’m attaching the python script and a shortened pandas file below.
root_mr_example.py (1.1 KB)
pandas_df_example.pckl.zip (162.8 KB)

ROOT Version: v6.32.06
Python version: 3.9.18
Platform: AlmaLinux 9.4
Compiler: linuxx8664gcc


Hi,

Thanks for the interesting post.

I am sorry to read the solution on the forum did not work for you on both platforms out of the box:I think the first thing to figure out are the differences between the two setups. One is the ROOT version, patch 08 vs pach 06. How was ROOT configured in the two cases? Is the version of pandas the same?

Best,
D

Hi,

Thank you for the quick feedback. The Pandas version is 2.2.3 in both setups.
Both the remote and local installations are compiled. For my local installation I just kept the default cmake options. As far as I understand, the remote installation was compiled with the same options as this bash script snippet:

    # Run cmake with specified install and source directories
    cmake -DCMAKE_CXX_STANDARD=17 -Dxrootd=OFF -Dpythia6=ON -DPYTHIA6_LIBRARY=/opt/exp_software/neutrino/al9/PYTHIA6/v6_428/lib/libPythia6.so \
        -Dpythia8=ON -DPYTHIA8_DIR=/opt/exp_software/neutrino/al9/PYTHIA8/8312 -Dmathmore=ON \
        -DCMAKE_INSTALL_PREFIX=../../"$INSTALL_FOLDER_NAME" -DPython3_EXECUTABLE="$(which python3)" ../../"$SOURCE_FOLDER_NAME"

Is there any specific configuration parameter that I should look into?
I’m waiting for your feedback.

Hi,

Thanks. This is not clear to me yet. I cannot reproduce the issue due to the DarkNews module, which apparently is required by the example. Would you be able to remove it from the equation?

Cheers,
D

Hi,

I’m sorry for the issue: the pandas dataframe had some package-dependent attributes, apparently. I removed them and now I can run the example outside my DarkNews installation environments.
The error is still there on the remote machine.

I’m attaching the updated data file and script.

root_mr_example_format.py (1.1 KB)
pandas_df_format.pckl.zip (153.7 KB)