Hi, what’s the proper way to deal with missing values in a RDataFrame?
For example, consider the training data if the Tianic dataset. Even though ROOT thinks the column Age is a number, trying to plot a histogram of Age will result in an exception due to missing values.
root [] rdf = ROOT::RDF::MakeCsvDataFrame("train.csv");
root [] for(auto name : rdf.GetColumnNames()) {
root [] cout << "\t"<< name << " " << rdf.GetColumnType(name) << "\n";
root [] }
PassengerId Long64_t
Survived Long64_t
Pclass Long64_t
Name std::string
Sex std::string
Age Long64_t
SibSp Long64_t
Parch Long64_t
Ticket std::string
Fare double
Cabin std::string
std::string
root [] h1 = rdf.Histo1D("Age")
(ROOT::RDF::RResultPtr<TH1D> &) @0x7efbff6a1050
root [] h1->Draw();
Error in <TRint::HandleTermInput()>: std::invalid_argument caught: stoll
I know this is just an example. Yet how should we deal with this in practice?
From the error message, I suspect your “train.csv” file contains data lines which are not properly parsed by ROOT (you could attach this file for inspection).
@etejedor There is a bug in ROOT’s “csv” parsing. It improperly parses “DOS” encoded files. You can see it with the attached file, in the very last column name. It should be “Embarked” but ROOT improperly uses the final “carriage return” character as the last character of this name (this character does not appear in “Unix” encoded files, of course).
@marty1885 The problem is that the “Age” and “Cabin” columns are sometimes empty. ROOT cannot accept it. When creating this file, you would need to make sure that every entry gets some “default” value.
Hi,
this is a known missing feature, relevant discussion + link to the jira ticket is here.
Unfortunately, at the moment we do not have free hands to implement support for missing values (me myself will not be at CERN until March) but PRs are of course welcome – or comments to the jira ticket explaining why the feature should float up the to do list.
The simplest workaround is to substitute missing values with some telltale value.
I’d ask @etejedor, the author of the RCSVDataSource, to point to the parts of the code that need upgrading to support missing values. There is also the question of what behavior we want RDF to have when there is a missing value in a CSV file.