Dealing with missing values in RDataFrame(CSV)

Hi, what’s the proper way to deal with missing values in a RDataFrame?

For example, consider the training data if the Tianic dataset. Even though ROOT thinks the column Age is a number, trying to plot a histogram of Age will result in an exception due to missing values.

root [] rdf = ROOT::RDF::MakeCsvDataFrame("train.csv");
root [] for(auto name : rdf.GetColumnNames()) {
root []    cout << "\t"<< name << " " << rdf.GetColumnType(name) << "\n";
root [] }
	PassengerId Long64_t
	Survived Long64_t
	Pclass Long64_t
	Name std::string
	Sex std::string
	Age Long64_t
	SibSp Long64_t
	Parch Long64_t
	Ticket std::string
	Fare double
	Cabin std::string
	 std::string
root [] h1 = rdf.Histo1D("Age")
(ROOT::RDF::RResultPtr<TH1D> &) @0x7efbff6a1050
root [] h1->Draw();
Error in <TRint::HandleTermInput()>: std::invalid_argument caught: stoll

I know this is just an example. Yet how should we deal with this in practice?

From the error message, I suspect your “train.csv” file contains data lines which are not properly parsed by ROOT (you could attach this file for inspection).

1 Like

Hi, thanks. The forums doesn’t support uploading CSV files. So I have a link above to the train.csv file. Here’s it in txt formtrain.txt (59.8 KB)

Yes, that’s the exact problem I described. ROOT failed to parse the file due to missing values.

@etejedor There is a bug in ROOT’s “csv” parsing. It improperly parses “DOS” encoded files. You can see it with the attached file, in the very last column name. It should be “Embarked” but ROOT improperly uses the final “carriage return” character as the last character of this name (this character does not appear in “Unix” encoded files, of course).

@marty1885 The problem is that the “Age” and “Cabin” columns are sometimes empty. ROOT cannot accept it. When creating this file, you would need to make sure that every entry gets some “default” value.

Hi,
this is a known missing feature, relevant discussion + link to the jira ticket is here.

Unfortunately, at the moment we do not have free hands to implement support for missing values (me myself will not be at CERN until March) but PRs are of course welcome – or comments to the jira ticket explaining why the feature should float up the to do list.

The simplest workaround is to substitute missing values with some telltale value.

Cheers,
Enrico

1 Like

Sounds like something I can do, hopefully. Would I be able to submit a PR if I don’t work at CERN?

Absolutely, ROOT is open source and welcomes external contributions. You can find the code at GitHub - root-project/root: The official repository for ROOT: analyzing, storing and visualizing big data, scientifically.

I’d ask @etejedor, the author of the RCSVDataSource, to point to the parts of the code that need upgrading to support missing values. There is also the question of what behavior we want RDF to have when there is a missing value in a CSV file.

Cheers,
Enrico

Hi,

Regarding the DOS format problem, please see my answer here:

As for the RCsvDataSource, this is the relevant file in ROOT:

Possibly it should be enough to add the support for the empty values in ParseValue

1 Like