Create Rdataframe from csv with no header


ROOT Version: 6.30.08
Platform: Rocky Linux 8.10 (Green Obsidian)


I am trying to use ROOT.RDF.FromCSV in python. The csv does not have a header row. So, I want to pass in colTypes arguments in order to define the column names and column types.

The csv consists of 4 columns, 2 strings and 2 integers:

"foo","bar",1,2
"baz","boo",3,4

My understanding is that I have to pass an unorder map as a string. I have tried the following but with no luck

Python 3.12.8 (main, Dec 12 2024, 16:30:29) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> file = "sample.csv"
>>> df = ROOT.RDF.FromCSV(file, False, ",", -1, "{{\"Col0\", \"T\"}}, {{\"Col1\", \"T\"}}, {{\"Col2\", \"L\"}}, {{\"Col3\", \"L\"}}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(basic_string_view<char,char_traits<char> > fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5
>>> del df
>>> df = ROOT.RDF.FromCSV(file, False, ",", -1, "{{\"Col0\", \"T\"}}, {{\"Col1\", \"T\"}}, {{\"Col2\", \"L\"}}, {{\"Col3\", \"L\"}}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(basic_string_view<char,char_traits<char> > fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5
[fff@scripts]$ cat test.csv
"foo","bar",1,2
"baz","boo",3,4
[fff@scripts]$

My intention is to define the column names at least.

Hi @dimitris_lepipas,

Let me add @vpadulan in the loop who might be able to help you.

Cheers,
Dev

Hi @dimitris_lepipas,

since you’re using Python you can simply do:

import ROOT 
filename = "csvex.csv"
df = ROOT.RDF.FromCSV(filename, False)
df.Display("Col0").Print() # checking what's inside my rdf
+-----+---------+
| Row | Col0    | 
+-----+---------+
| 0   | ""foo"" | 
+-----+---------+
| 1   | ""baz"" | 
+-----+---------+

It should indeed be slightly better documented.

Cheers,
Marta

Hi @dimitris_lepipas.

One more thing, the above is using the latest ROOT release 6.36. I highly suggest that you update your ROOT version, as for example the pythonization of FromCSV was added in 6.32, and you are also missing out on many new features and bug fixes!

Cheers,
Marta

Thanks for the response. I found out how to read a csv without a header by passing False as a 2nd argument or as a named parameter

[lepipas@pocket-n2 scripts]$ cat test.csv
"foo","bar",1,2
"baz","boo",3,4
[lepipas@pocket-n2 scripts]$ python3
Python 3.6.8 (default, Apr 24 2024, 21:55:04)
[GCC 8.5.0 20210514 (Red Hat 8.5.0-22)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT

>>>
>>> file = "test.csv"
>>> df = ROOT.RDF.FromCSV(fileName=file, readHeaders=False, delimiter=",", linesChunkSize=-1)
>>> df.Display().Print()
+-----+-------+-------+------+------+
| Row | Col0  | Col1  | Col2 | Col3 |
+-----+-------+-------+------+------+
| 0   | "foo" | "bar" | 1    | 2    |
+-----+-------+-------+------+------+
| 1   | "baz" | "boo" | 3    | 4    |
+-----+-------+-------+------+------+
>>> del df
>>> df = ROOT.RDF.FromCSV(file, False, ",", -1)
>>> df.Display().Print()
+-----+-------+-------+------+------+
| Row | Col0  | Col1  | Col2 | Col3 |
+-----+-------+-------+------+------+
| 0   | "foo" | "bar" | 1    | 2    |
+-----+-------+-------+------+------+
| 1   | "baz" | "boo" | 3    | 4    |
+-----+-------+-------+------+------+
>>>

But cannot find a way to use the fifth argument in the constructor, e.g.

1. Column Types (optional, default is an empty map). A map with column names as keys and their type (expressed as a single character, see below) as values.

I tried the following and getting a TypeError:

>>> df = ROOT.RDF.FromCSV(file, False, ",", -1, "{{\"Col0\", \"T\"}}, {{\"Col1\", \"T\"}}, {{\"Col2\", \"L\"}}, {{\"Col3\", \"L\"}}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(basic_string_view<char,char_traits<char> > fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5
>>>

Is it possible to specify the column names in the constructor? My intention is to not have a RootDataFrame with column names not Col0, … but one with custom names that I will provide in the constructor if possible.

I also update the info about my os platform version, I am also posting here detailed:

NAME="Rocky Linux"
VERSION="8.10 (Green Obsidian)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="8.10"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Rocky Linux 8.10 (Green Obsidian)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:8:GA"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2029-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-8"
ROCKY_SUPPORT_PRODUCT_VERSION="8.10"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.10"

So I guess that the latest supported is 6.30.08 as I see in: Overview - rpms/root - src.fedoraproject.org

Regarding the last argument (column types), this works in C++ (at least on v6.32.12, didn’t test on older versions):

root [0] auto df = ROOT::RDF::FromCSV("sample.csv",false,',',-1LL,{{"Col0",'T'},{"Col1",'T'},{"Col2",'L'},{"Col3",'D'}})
(ROOT::RDataFrame &) A data frame associated to the data source "CSV data source"
root [1] df.Describe()
(ROOT::RDF::RDFDescription) Dataframe from datasource RCsv

Property                Value
--------                -----
Columns in total            4
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type            Origin
------  ----            ------
Col0    std::string     Dataset
Col1    std::string     Dataset
Col2    Long64_t        Dataset
Col3    double          Dataset

root [2]

I was not able to get it working with Python, as I guess you have also tried.

1 Like

Hi @dimitris_lepipas,

In 6.36 (also 6.34 and 6.32), you can do:

>>> df = ROOT.RDF.FromCSV("csvex.csv", readHeaders=False, delimiter=",", linesChunkSize=-1, columnNames = ['cat', 'dog', 'horse', 'monkey'])
>>> df.Display().Print()
+-----+---------+---------+-------+--------+
| Row | cat     | dog     | horse | monkey | 
+-----+---------+---------+-------+--------+
| 0   | ""foo"" | ""bar"" | 1     | 2      | 
+-----+---------+---------+-------+--------+
| 1   | ""baz"" | ""boo"" | 3     | 4      | 
+-----+---------+---------+-------+--------+

If you don’t want to/can’t update your OS to something newer which would work with newer versions of ROOT, you should follow the advice from @dastudillo above and run in C++.

Cheers,
Marta

1 Like

Regarding what you tried you have two errors. The first one is more obvious, you have to pass in something that actually looks like a map e.g. a python dict. as in:

>>> df = ROOT.RDF.FromCSV("sample.csv", False, ",", -1, {"Col0": "T", "Col1": "T","Col2": "L", "Col3": "L"})
Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    df = ROOT.RDF.FromCSV("sample.csv", False, ",", -1, {"Col0": "T", "Col1": "T","Col2": "L", "Col3": "L"})
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(string_view fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: could not convert argument 5

However, that is not enough to achieve the same as in c++, as the second type of the map has to be convertible to char and we pass in a string here. A char is just an integer in c++ so if we really want to pass a type as in c++ it is possible by e.g. looking up the codes for T and L in an ascii table and using that:

>>> df = ROOT.RDF.FromCSV("sample.csv", False, ",", -1, {"Col0": 0x54, "Col1": 0x54,"Col2": 0x4c, "Col3": 0x4c})
>>> df.Describe()
Dataframe from datasource RCsv

Property                Value
--------                -----
Columns in total            4
Columns from defines        0
Event loops run             0
Processing slots            1

Column  Type            Origin
------  ----            ------
Col0    std::string     Dataset
Col1    std::string     Dataset
Col2    Long64_t        Dataset
Col3    Long64_t        Dataset

Not super ergonomic though…

3 Likes

I finally installed root with miniconda to try what you suggested and in version 6.34 (latest stable) it does not work. Not sure if it works in 6.36.

(nio_test_root_env) [xxx@yyy ~]$ root --version
ROOT Version: 6.34.04
Built for linuxx8664gcc on Feb 26 2025, 15:57:22
From tags/6-34-04@6-34-04
(nio_test_root_env) [xxx@yyy ~]$ python
Python 3.12.11 | packaged by conda-forge | (main, Jun  4 2025, 14:45:31) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> f = "scripts/test.csv"
>>>
>>> df = ROOT.RDF.FromCSV(f, readHeaders=False, delimiter=",", linesChunkSize=-1, columnNames = ['cat', 'dog', 'horse', 'monkey'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: ROOT::RDataFrame ROOT::RDF::FromCSV(string_view fileName, bool readHeaders = true, char delimiter = ',', Long64_t linesChunkSize = -1LL, unordered_map<string,char>&& colTypes = {}) =>
    TypeError: RDF::FromCSV got an unexpected keyword argument 'columnNames'
>>> df = ROOT.RDF.FromCSV(f, False, ",", -1, {"Col0": 0x54, "Col1": 0x54,"Col2": 0x4c, "Col3": 0x4c})
>>> df.Display().Print()
+-----+-------+-------+------+------+
| Row | Col0  | Col1  | Col2 | Col3 |
+-----+-------+-------+------+------+
| 0   | "foo" | "bar" | 1    | 2    |
+-----+-------+-------+------+------+
| 1   | "baz" | "boo" | 3    | 4    |
+-----+-------+-------+------+------+
>>>

So, it doesn’t seem that there is a way to define the column names at least until version 6.34.

I tried to update root to 6.36 with:

(nio_test_root_env) [xxx@yyy ~]$ conda update root
Channels:
 - conda-forge
 - defaults
Platform: linux-64
Collecting package metadata (repodata.json): done
Solving environment: done

# All requested packages already installed.

(nio_test_root_env) [xxxx@yyy ~]$ 

but it didn’t work. seems that 6.34 is the latest available version in conda

Hi @dimitris_lepipas,

thanks for trying and I am sorry for misleading information. Indeed, the feature will only work with 6.36. We are working on having the conda package of newest ROOT ready soon, but as of now, you are right, the latest version is 6.34.04.

I am guessing you don’t have a CERN computing account? If yes, you could use SWAN platform with Bleeding Edge configuration.

One more solution could be via a docker image. We have an ubuntu docker image of ROOT 6.36 available. You could try to use that, for example, with GitHub Codespaces as we do for our student course GitHub - root-project/student-course: ROOT course for students. If you click on the “GH Codespaces” icon you will be able to use the notebooks from the course but also modify/add your own notebooks or other files. You can also try and create your own repository using a similar Dockerfile that we provide: student-course/.devcontainer/Dockerfile at main · root-project/student-course · GitHub. Let me know if this is a sufficient solution for your work. If this doesn’t work for your workflow, you should try and use ROOT with C++ at this point of time.

Sorry for the inconvenience. Once ROOT 6.36 via Conda is available, it should become easier for you!

Cheers,
Marta

1 Like

Thanks for the provided solutions. I will stick on conda and wait for the new root version.