NaN values in one column break histogram of another column

mwilkins · December 6, 2019, 7:41pm

If I have an RDataFrame where some of the values in some of the columns are NaN, I cannot make a histogram from a column without NaN values:

Exception: TH1D& ROOT::RDF::RResultPtr<TH1D>::operator*() =>
    stoll: no conversion (C++ exception of type invalid_argument)

Reproducer follows.

Suppose I have 2 csv files, one with all filled values and one with some empty values:

$ cat temp_good.csv 
a,b,c
1,2,3
1,2,3
1,2,3

$ cat temp_bad.csv 
a,b,c
1,2,3
1,2,
1,,

I can successfully make a histogram of ‘a’ from temp_good.csv, but not temp_bad.csv, even though ‘a’ has all its values defined in both files:

In [1]: import ROOT

In [2]: df_good = ROOT.RDF.MakeCsvDataFrame('temp_good.csv')

In [3]: h_good = df_good.Histo1D('a')

In [4]: h_good.SetFillStyle(3845)

In [5]: df_bad = ROOT.RDF.MakeCsvDataFrame('temp_bad.csv')

In [6]: h_bad = df_bad.Histo1D('a')

In [7]: h_bad.SetFillStyle(3845)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-7-4aa1c7efd4fb> in <module>()
----> 1 h_bad.SetFillStyle(3845)

Exception: TH1D& ROOT::RDF::RResultPtr<TH1D>::operator*() =>
    stoll: no conversion (C++ exception of type invalid_argument)

ROOT Version: master
Platform: macOS
Compiler: Not Provided

couet · December 9, 2019, 10:09am

Can’t you make the 2nd file:

a,b,c
1,2,3
1,2,0
1,0,0

?

mwilkins · December 9, 2019, 2:10pm

Well, no, if I have a dataset for which the correct value really is undefined or NaN. One could, of course, find a workaround that involves assigning a dummy value, say -1, instead of NaN, but I’m not sure why a user should have to; it seems the most desirable/intuitive workflow would be:

h_bad_a = df_bad.Histo1D('a')
h_bad_b = df_bad.Filter('!TMath::IsNan("b")').Histo1D('b')

eguiraud · December 9, 2019, 2:37pm

Hi,
I’m afraid the CSV datasource does not support NaN values (@etejedor can confirm).
You can request the feature on JIRA, although I am not sure how long it might take for it to float at the top of the to-do list.

Cheers,
Enrico

etejedor · December 9, 2019, 3:29pm

I confirm what Enrico said, unfortunately NaN are not supported by the CSV data source.

mwilkins · December 9, 2019, 3:53pm

Okay, I added an issue to the JIRA with minor priority. Hopefully a fix will be in the works eventually, as this needlessly limits the utility of RDataFrame.

system · December 23, 2019, 3:53pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

eguiraud · May 4, 2022, 8:20am

Hi,
it took a while, but thanks to @ikabadzhov this is now fixed in master