RDataframe with branch contain NaN


Please read tips for efficient and successful posting and posting code

ROOT Version: 6.21/01
Platform: Ubuntu 18.04.4 LTS
Compiler: gcc


I am using RDataframe to do some filter but since some column are purely NaN. So the codes can not run.

oid filter(){
        int count_e = 0;
        int count_m = 0;
        ROOT::RDataFrame old("UnnormedTree","higgsino_10_eos.root");

        auto cut = [&count_e, &count_m](float a1, float pt) {
                if (a1 == 2 && pt < 10*1000.0 && count_e < 50000 ){
                        count_e +=1;
                        return true;
                }
                if (a1 == 6 && pt< 10*1000.0 && count_m < 50000){
                        count_m +=1;
                        return true;
                }
                return false;
        };
        old.Filter(cut,{"truth_type","ROC_slicing_lep_pT"}).Snapshot("UnnormedTree", "filtered_higgsino10.root");
}

It can run correctly with the a tree without NaN. But when running a file with NaN column the error shows

Error in <TRint::HandleTermInput()>: std::runtime_error caught: Unknown columns: lep_core57,lep_coreCone,lep_pu_corr20,lep_pu_corr30,lep_pu_corr40,lep_pt_corr20,lep_pt_corr30,lep_pt_corr40

We have also tried to blacklist those features.

void filter(){
	int count_e = 0;
	int count_m = 0;
	ROOT::RDataFrame old("UnnormedTree","higgsino_10_eos.root");
	// your blacklist
    static const std::vector<std::string> blacklist = {"lep_core57","lep_coreCone","lep_pu_corr20","lep_pu_corr30","lep_pu_corr40","lep_pt_corr20","lep_pt_corr30","lep_pt_corr40"};
	//define those problematic columns
	int my_number = 42;
	auto old_0 = old;
	auto old_1 = old;
	for(auto i =0; i < blacklist.size(); i++){
		old_0 = old_1;
		old_1 = old_0.Define(blacklist[i], to_string(my_number))
	}


	// get good_cols
	TFile f = TFile("higgsino_10_eos.root");
	TTree* t = (TTree*) f.Get("UnnormedTree");
	TObjArray* name_array = t->GetListOfBranches();
	std::vector<string> good_cols;
	for(int i = 0; i < name_array->GetEntries(); ++i) 
	{ 
		auto value = name_array->At(i)->GetName();
		if(std::find(blacklist.begin(), blacklist.end(), value) != blacklist.end()){
			good_cols.push_back(value);
		}
	}

	auto cut = [&count_e, &count_m](float a1, float pt) {
		if (a1 == 2 && pt < 10*1000.0 && count_e < 50000 ){
			count_e +=1;
			return true;
		}	
		if (a1 == 6 && pt< 10*1000.0 && count_m < 50000){
			count_m +=1;
			return true;
		}
		return false;
	};
	old_1.Filter(cut,{"truth_type","ROC_slicing_lep_pT"}).Snapshot("UnnormedTree", "unnormed_filtered_higgsino10.root",good_cols);
}

But still we get the same error.

Thanks so much!

Hi @kai_zheng,
the error message means that RDF does not recognize lep_core57 and the other columns as valid column names for that TTree. What does old.GetColumnNames() return? Also, could you try upgrading to ROOT v6.22/02 and check whether that fixes it?

Cheers,
Enrico

Thanks so much for your reply! I find another way to circumvent the problem.

But it seems like RDataframe can not handle dataset contains NaN column.

Yes, those branch name are in the list. I can actually draw the histogram.

I am using a lab computer so I have not checked with newer root version. But I can share you the link of the file for further investigation in case you need.

Best,

Kai Zheng

Hi Kai Zheng,
thanks a lot for sharing the file, I can reproduce the issue.

Can you share what the workaround you found is, and when you say “I can actually draw the histogram” do you mean without RDF?

There is some difference in how the two TTrees “UnnormedTree” and “NormalizedTree” are structured, and that’s what’s confusing RDF: note that the problem with certain column names not being recognized is triggered before any values are read, so it’s not the NaNs that are causing the problem but really a difference in the TTree structure.
For example, df.GetColumnNames() lists both lep_core57 and lep_core57.core57 for “NormalizedTree”, but only lep_core57.core57 for “UnnormedTree”.

TTree::Print also shows some differences, but I still don’t understand what exactly is confusing RDF:

UnnormedTree

Br   36 :lep_core57 : core57/F

NormalizedTree

Br   36 :lep_core57 : lep_core57/F

I will write here if I find out something else. Feel free to open an issue at github.com/root-project/root/issues , the fact that RDF gets confused like this is a bug.

Cheers,
Enrico

P.S.
I think this is an instance of https://sft.its.cern.ch/jira/browse/ROOT-10625

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Hi,
just an update: this issue is fixed in ROOT’s master branch (the fix will be available in tomorrow’s nightly builds) and it will be part of v6.24/02 and v6.26.

More info at https://sft.its.cern.ch/jira/browse/ROOT-9558 and https://sft.its.cern.ch/jira/browse/ROOT-10625 .

Cheers,
Enrico

This topic was automatically closed after 13 days. New replies are no longer allowed.