RDataFrame filter result incorrect

beojan · October 20, 2018, 1:13pm

I have three filters which should split my data into three categories:

auto boosted = df.Filter("event.truth_type==2", "boosted");
auto intermediate = df.Filter("event.truth_type==1", "intermediate");
auto resolved = df.Filter("event.truth_type==0", "resolved");

However, the cutflow report reads:

boosted   : pass=5454       all=89389      --    6.101 %
intermediate: pass=36263      all=89389      --   40.568 %
resolved  : pass=54524      all=89389      --   60.996 %

I.e. the three categories add up to more than 100%. Comparing to the result of TTree::Draw, it seems the problem is with the resolved filter.

eguiraud · October 20, 2018, 4:26pm

Wow that’s weird, thanks for reporting.
Could you provide a minimal reproducer that we can run and debug?

Cheers,
Enrico

beojan · October 21, 2018, 5:24pm

The macro is:

void ProcessTruth(std::string filename) {
    using namespace std;
    using namespace ROOT;
    ofstream out("TruthSummary.csv", ios::app);
    RDataFrame df("fullmassplane", filename);
    auto boosted = df.Filter("event.truth_type==2", "boosted");
    auto intermediate = df.Filter("event.truth_type==1", "intermediate");
    auto resolved = df.Filter("event.truth_type==0", "resolved");
    auto correct_selection = resolved.Filter("event.truth_h1_j1 && event.truth_h1_j2 && event.truth_h2_j1 && event.truth_h2_j2", "correct_sel");
    auto correct_pair = resolved.Filter("event.truth_h1_j1==1 && event.truth_h1_j2==1 && event.truth_h2_j1==2 && event.truth_h2_j2==2", "correct_pair");
    auto wrong_selection = resolved.Count().GetValue() - correct_selection.Count().GetValue();
    auto wrong_pair = correct_selection.Count().GetValue() - correct_pair.Count().GetValue();
    out << filename << "," << boosted.Count().GetValue() << "," << intermediate.Count().GetValue() << "," << resolved.Count().GetValue() << ","
        << correct_pair.Count().GetValue() << "," << wrong_pair << "," << wrong_selection << std::endl;
    df.Report()->Print();

}

The file I’m running on is at /eos/user/b/bstanisl/DebugFile/M1200/M1200.root and shared with @eguiraud and sft-root.

beojan · October 22, 2018, 9:51am

The correct_selection and correct_pair filters also return 0. Doing the same thing with root_pandas shows that this is also incorrect.

EDIT:
Working interactively (root -l) in LCG 94 Python 3:

root [0] using namespace ROOT;
root [1] RDataFrame df("fullmassplane", "M1200.root")
(ROOT::RDataFrame &) A data frame built on top of the fullmassplane dataset.
root [2] .ls
root [3] df.Filter("event.truth_type==0").Count()
(ROOT::RDF::RResultPtr<ULong64_t>) @0x58edf30
root [4] df.Filter("event.truth_type==0").Count().GetValue*(
root (cont'ed, cancel with .@) [5]
root (cont'ed, cancel with .@) [5].@
root [6] df.Filter("event.truth_type==0").Count().GetValue()
(const unsigned long long) 47672
root [7] df.Filter("event.truth_type==0").Count().GetValue()
(const unsigned long long) 54524

The incorrect value is only returned if another filter has been added already.

eguiraud · October 22, 2018, 9:35pm

Thanks for the update.
This is now ROOT-9743, let’s continue the discussion there.

I’m not sure I will have time to look into it before next week, but in any case it’s very close to the top of the priority queue

Cheers,
Enrico

system · November 5, 2018, 9:35pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.