Thanks very much for your reply, Axel.
The idea of missing values is the following. Suppose someone conducts a survey on a population, asking the participants’ name, age, income.
Often times, people taking the survey will not provide answers to all questions, but only some of them, and it will be inconsistent from person to person. So, the results of the survey may look like the following (this is a highly simplified case, just to illustrate), in a CSV file:
Name, Age, Income
Name1, 18, 18000
Name2, , 50000
Name3, 50,
In this example, when reading the data, one would treat Name as a character field, and Age and Income as numeric. However, as you can see, the value of Age and Income are not always provided. So in this case, for Name2, Age is missing, and For Name3, Income is missing.
This is just an example of “missing values” which are very much prevelant in nearly all industrial/corporate databases.
Most statistical packages provide the following functionality:
(1) Ability to read “messy” data like above automatically. The data reader will automatically detect “missing” values and keep track of them.
(2) During mathematical manipulation of the data, missing values will be handled automatically, in the following sense:
if Y = A + B + C; then Y = "missing" if either A or B or C are "missing"
similarly for other mathematical operations.
Also, while computing the statistical properties of the data, e.g., average age etc., missing values will be automatically excluded (without making the software crash), and the incidence of missing values will be reported. etc.
Assuming the ROOT team expects ROOT to be used outside HEP (and I know the HEP data analysis is the main Mandate for the ROOT team), this kind of functionality is absolutely crucial. And this applies all the more to packages like TMVA, RooFit, and RooStats.
It maybe useful for the ROOT team to interact with “pure statisticians” (whcih I am not. I am an ex-HEP person), and bring some of them on-board, for the sake of cross-pollination, and to understand data analysis issues outside HEP. Otherwise, I think ROOT will have a very uphill task gaining acceptance in a broader community, which will be very unfortunate in my opinion, considering the solid foundation (which a lot of other statistical packages lack) and all the other amazing functionality that ROOT provides.
Of course, ROOT team may decide that non-HEP related problems are not their business, which is a perfectly valid argument.
-Arun