How can I assign NaN

aruntripathi · July 12, 2009, 5:10am

Hi,
How can I assign NaN to numeric variables (int, float, double…) that works universally, across different platforms ?

 On Solaris, in standalone code, if I say x = 0.0/0.0, then it assigns x = NaN. But CINT just complains about divide by zero error, and exits.

My motivation is the following. I want to be able to handle missing values in the data by assigning a value of NaN when the data for a numeric field is missing. But I can not find a way to assign NaN which works universally, across different platforms.

By the way, missing data handling is done in most statistical packages automatically, but not in ROOT. It will be very useful to have such a facility in ROOT and all its associated statistical packages etc., if ROOT is to be a useful tool outside the world of HEP.

In regular applications, missing data is almost the rule rather than an exception. So a way to handle such data is a prerequisite before people can use ROOT outside the highly structured HEP environment.

Many thanks.
-Arun

Axel · July 12, 2009, 10:53am

Hi,

there is std::numeric_limits<double>::quiet_NaN() - but that’s not implemented in CINT (though I should be able to add it if you need it).

Do you have an example what you mean with “missing data”? Are you referring to a measured zero compared to a zero as in “lack of measurements”? And the statistical tools that this is missing for you most imminently are histograms? If that’s correct, wouldn’t a TGraph be more appropriate in your use case?

Cheers, Axel.

aruntripathi · July 13, 2009, 9:18am

Thanks very much for your reply, Axel.

The idea of missing values is the following. Suppose someone conducts a survey on a population, asking the participants’ name, age, income.

Often times, people taking the survey will not provide answers to all questions, but only some of them, and it will be inconsistent from person to person. So, the results of the survey may look like the following (this is a highly simplified case, just to illustrate), in a CSV file:

Name, Age, Income
Name1, 18, 18000
Name2, , 50000
Name3, 50,

In this example, when reading the data, one would treat Name as a character field, and Age and Income as numeric. However, as you can see, the value of Age and Income are not always provided. So in this case, for Name2, Age is missing, and For Name3, Income is missing.

This is just an example of “missing values” which are very much prevelant in nearly all industrial/corporate databases.

Most statistical packages provide the following functionality:

(1) Ability to read “messy” data like above automatically. The data reader will automatically detect “missing” values and keep track of them.
(2) During mathematical manipulation of the data, missing values will be handled automatically, in the following sense:

  if Y = A + B + C; then Y = "missing" if either A or B or C are "missing"

similarly for other mathematical operations.

Also, while computing the statistical properties of the data, e.g., average age etc., missing values will be automatically excluded (without making the software crash), and the incidence of missing values will be reported. etc.

Assuming the ROOT team expects ROOT to be used outside HEP (and I know the HEP data analysis is the main Mandate for the ROOT team), this kind of functionality is absolutely crucial. And this applies all the more to packages like TMVA, RooFit, and RooStats.

It maybe useful for the ROOT team to interact with “pure statisticians” (whcih I am not. I am an ex-HEP person), and bring some of them on-board, for the sake of cross-pollination, and to understand data analysis issues outside HEP. Otherwise, I think ROOT will have a very uphill task gaining acceptance in a broader community, which will be very unfortunate in my opinion, considering the solid foundation (which a lot of other statistical packages lack) and all the other amazing functionality that ROOT provides.

Of course, ROOT team may decide that non-HEP related problems are not their business, which is a perfectly valid argument.

-Arun

brun · July 13, 2009, 9:49am

What you describe is perfectly correct. However, what you request is not totally clear.
Do you simply need a function to set NaN to a double, float, int? if yes, you can use the Axel’s recipee. We could provide TMath::SetNaN to be consistent with TMath::IsNaN.
When filling histograms, NaN are automatically detected and diveted to the overflow bin automatically.

Rene

aruntripathi · July 18, 2009, 4:35am

Rene,
Ultimately, what I am suggesting is that ROOT should have: a way to automatically read in and handle missing values, just like most statistical packages. Seems like NaN maybe a good value to assign to the missing values, but perhaps the ROOT team may come up with an alternative/better solution ?

In any case, I think having a SetNaN() function that works in CINT, and is consistent with TMATH::IsNaN() will be very helpful, and I will look forward to this functionality in the near future.

By the way, sometime ago, I posted the code to read a CSV file, including missing values in this forum:

root.cern.ch/phpBB2/viewtopic.ph … highlight=

The idea was to create some code that can read in arbitrarily formatted data easily and automatically. In the above code, I just assigned a default value of -99 to missing values, which is obviously unsatisfactory.

Once we have a SetNaN() method that works in CINT, I can update the code to assign NaN for missing values. That way, during any analysis, missing values can be identified unambiguously.

Thanks very much.
-Arun