Column types for RDataFrame from CSV

danj1011 · January 10, 2023, 5:15pm

Hi.

NB: this question is a bit like RDataFrame column types reading in a csv file but in that case they could get around the problem by tricking ROOT into expecting a string with quotation marks. Here I don’t know how to tell ROOT that it’s going to lose precision if it forces a column to an integer, even if the first row makes it seem that it should be safe.

I’m reading in csv data with:

auto df = ROOT::RDF::MakeCsvDataFrame(filename_in);

and it looks like ROOT sets the type using the first entry in the CSV. Sometimes for a large number like momentum, though, it looks like a LongInt when the type is really Double_t if you look at other rows. Is there a way to force ROOT to use the ‘correct’ types for the columns?

ROOT Version: 6.26/10
Platform: Arch
Compiler: Not Provided

danj1011 · January 10, 2023, 8:13pm

I got around this problem since I am in control of the CSV contents, too. So I changed the precision of what I dump to CSV so that the column always has a “.” and that forces RDataFrame to read it in as a Double.

Incidentally, I’m only writing to CSV because I have a problem with race conditions when writing to separate TTree/TFiles in a ForEachSlot of an RDataFrame. When I write to CSV instead of TFile I get around it easily…

bellenot · January 10, 2023, 8:23pm

Maybe @vpadulan or @Axel can comment about the issue you have

danj1011 · January 14, 2023, 7:44am

Hope so, @bellenot !

dastudillo · January 14, 2023, 2:17pm

One way around it is to first read the csv file into a tree, which does support specifying the type of each column (see TTree::ReadFile), and then when you read the tree with a DataFrame, it will take the types from the tree.

$ cat z.txt
2 209231
3 345
7 25435
1 6732645

$ cat z.C
void z() {
  TFile f("z.root","RECREATE");
  TTree *T=new TTree("T","tree");
  T->ReadFile("z.txt","a/I:b/D");
  T->Write("T");
  f.Close();
}

$  root -l -b -q z.C
(...)

root [0] ROOT::RDataFrame d("T", "z.root")
(ROOT::RDataFrame &) A data frame built on top of the T dataset.
root [1] d.GetColumnType("a")
(std::string) "Int_t"
root [2] d.GetColumnType("b")
(std::string) "Double_t"

eguiraud · January 16, 2023, 4:26pm

Hi,

in ROOT master (and the upcoming v6.28) this problem is resolved by FromCsv which allows passing a mapping from column to type.

In v6.26 I think the best workaround is what @dastudillo suggests.

Cheers,
Enrico

danj1011 · January 16, 2023, 4:41pm

Thanks, I’ll do that.

How long until 6.28 comes out? (ballpark)

eguiraud · January 16, 2023, 5:33pm

Very rough estimate: a few weeks

system · January 30, 2023, 5:34pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.