Create Rdataframe from csv with no header


ROOT Version: 6.30.08
Platform: Rocky Linux 8.10 (Green Obsidian)


I am trying to use ROOT.RDF.FromCSV in python. The csv does not have a header row. So, I want to pass in colTypes arguments in order to define the column names and column types.

The csv consists of 4 columns, 2 strings and 2 integers:

"foo","bar",1,2
"baz","boo",3,4

My understanding is that I have to pass an unorder map as a string. I have tried the following but with no luck

Python 3.12.8 (main, Dec 12 2024, 16:30:29) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> file = "sample.csv"
>>> df = ROOT.RDF.FromCSV(file, False, ",", -1, "{{\"Col0\", \"T\"}}, {{\"Col1\", \"T\"}}, {{\"Col2\", \"L\"}}, {{\"Col3\", \"L\"}}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(basic_string_view<char,char_traits<char> > fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5
>>> del df
>>> df = ROOT.RDF.FromCSV(file, False, ",", -1, "{{\"Col0\", \"T\"}}, {{\"Col1\", \"T\"}}, {{\"Col2\", \"L\"}}, {{\"Col3\", \"L\"}}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(basic_string_view<char,char_traits<char> > fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5
[fff@scripts]$ cat test.csv
"foo","bar",1,2
"baz","boo",3,4
[fff@scripts]$

My intention is to define the column names at least.

Hi @dimitris_lepipas,

Let me add @vpadulan in the loop who might be able to help you.

Cheers,
Dev

Hi @dimitris_lepipas,

since you’re using Python you can simply do:

import ROOT 
filename = "csvex.csv"
df = ROOT.RDF.FromCSV(filename, False)
df.Display("Col0").Print() # checking what's inside my rdf
+-----+---------+
| Row | Col0    | 
+-----+---------+
| 0   | ""foo"" | 
+-----+---------+
| 1   | ""baz"" | 
+-----+---------+

It should indeed be slightly better documented.

Cheers,
Marta

Hi @dimitris_lepipas.

One more thing, the above is using the latest ROOT release 6.36. I highly suggest that you update your ROOT version, as for example the pythonization of FromCSV was added in 6.32, and you are also missing out on many new features and bug fixes!

Cheers,
Marta

Thanks for the response. I found out how to read a csv without a header by passing False as a 2nd argument or as a named parameter

[lepipas@pocket-n2 scripts]$ cat test.csv
"foo","bar",1,2
"baz","boo",3,4
[lepipas@pocket-n2 scripts]$ python3
Python 3.6.8 (default, Apr 24 2024, 21:55:04)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT

>>>
>>> file = "test.csv"
>>> df = ROOT.RDF.FromCSV(fileName=file, readHeaders=False, delimiter=",", linesChunkSize=-1)
>>> df.Display().Print()
+-----+-------+-------+------+------+
| Row | Col0  | Col1  | Col2 | Col3 |
+-----+-------+-------+------+------+
| 0   | "foo" | "bar" | 1    | 2    |
+-----+-------+-------+------+------+
| 1   | "baz" | "boo" | 3    | 4    |
+-----+-------+-------+------+------+
>>> del df
>>> df = ROOT.RDF.FromCSV(file, False, ",", -1)
>>> df.Display().Print()
+-----+-------+-------+------+------+
| Row | Col0  | Col1  | Col2 | Col3 |
+-----+-------+-------+------+------+
| 0   | "foo" | "bar" | 1    | 2    |
+-----+-------+-------+------+------+
| 1   | "baz" | "boo" | 3    | 4    |
+-----+-------+-------+------+------+
>>>

But cannot find a way to use the fifth argument in the constructor, e.g.

1. Column Types (optional, default is an empty map). A map with column names as keys and their type (expressed as a single character, see below) as values.

I tried the following and getting a TypeError:

>>> df = ROOT.RDF.FromCSV(file, False, ",", -1, "{{\"Col0\", \"T\"}}, {{\"Col1\", \"T\"}}, {{\"Col2\", \"L\"}}, {{\"Col3\", \"L\"}}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(basic_string_view<char,char_traits<char> > fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5
>>>

Is it possible to specify the column names in the constructor? My intention is to not have a RootDataFrame with column names not Col0, … but one with custom names that I will provide in the constructor if possible.

I also update the info about my os platform version, I am also posting here detailed:

NAME="Rocky Linux"
VERSION="8.10 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.10"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

So I guess that the latest supported is 6.30.08 as I see in: Overview - rpms/root - src.fedoraproject.org

Regarding the last argument (column types), this works in C++ (at least on v6.32.12, didn’t test on older versions):

root [0] auto df = ROOT::RDF::FromCSV("sample.csv",false,',',-1LL,{{"Col0",'T'},{"Col1",'T'},{"Col2",'L'},{"Col3",'D'}})
(ROOT::RDataFrame &) A data frame associated to the data source "CSV data source"
root [1] df.Describe()
(ROOT::RDF::RDFDescription) Dataframe from datasource RCsv

Property                Value
--------                -----
Columns in total            4
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type            Origin
------  ----            ------
Col0    std::string     Dataset
Col1    std::string     Dataset
Col2    Long64_t        Dataset
Col3    double          Dataset

root [2]

I was not able to get it working with Python, as I guess you have also tried.

1 Like

Hi @dimitris_lepipas,

In 6.36 (also 6.34 and 6.32), you can do:

>>> df = ROOT.RDF.FromCSV("csvex.csv", readHeaders=False, delimiter=",", linesChunkSize=-1, columnNames = ['cat', 'dog', 'horse', 'monkey'])
>>> df.Display().Print()
+-----+---------+---------+-------+--------+
| Row | cat     | dog     | horse | monkey | 
+-----+---------+---------+-------+--------+
| 0   | ""foo"" | ""bar"" | 1     | 2      | 
+-----+---------+---------+-------+--------+
| 1   | ""baz"" | ""boo"" | 3     | 4      | 
+-----+---------+---------+-------+--------+

If you don’t want to/can’t update your OS to something newer which would work with newer versions of ROOT, you should follow the advice from @dastudillo above and run in C++.

Cheers,
Marta

1 Like

Regarding what you tried you have two errors. The first one is more obvious, you have to pass in something that actually looks like a map e.g. a python dict. as in:

>>> df = ROOT.RDF.FromCSV("sample.csv", False, ",", -1, {"Col0": "T", "Col1": "T","Col2": "L", "Col3": "L"})
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    df = ROOT.RDF.FromCSV("sample.csv", False, ",", -1, {"Col0": "T", "Col1": "T","Col2": "L", "Col3": "L"})
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(string_view fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5

However, that is not enough to achieve the same as in c++, as the second type of the map has to be convertible to char and we pass in a string here. A char is just an integer in c++ so if we really want to pass a type as in c++ it is possible by e.g. looking up the codes for T and L in an ascii table and using that:

>>> df = ROOT.RDF.FromCSV("sample.csv", False, ",", -1, {"Col0": 0x54, "Col1": 0x54,"Col2": 0x4c, "Col3": 0x4c})
>>> df.Describe()
Dataframe from datasource RCsv

Property                Value
--------                -----
Columns in total            4
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type            Origin
------  ----            ------
Col0    std::string     Dataset
Col1    std::string     Dataset
Col2    Long64_t        Dataset
Col3    Long64_t        Dataset

Not super ergonomic though…

3 Likes