RDataFrame Sum() results in NaN for a combination of files

apetukho · May 11, 2023, 2:36pm

Dear ROOT experts.

I’ve got a simple python function that computes the sum of weights of a branch in a tree using RDataFrame.

def MakeWeightedCorrelationMatrixRdf(inputFilePathList, treeName):
    df = ROOT.ROOT.RDataFrame(treeName, inputFilePathList)

    eventCountHandle = df.Count()
    sumOfWeightsHandle = df.Sum("weightModified")
    
    print(f"Event count: {eventCountHandle.GetValue()}")
    print(f"Weight sum: {sumOfWeightsHandle.GetValue()}")
    print()

I’m trying to run in on a different combination of 3 files A.root, B.root and C.root.

print("A + B")
testFilePathList = [
    'A.root',
    'B.root',
]
MakeWeightedCorrelationMatrixRdf(testFilePathList, "tree_PFLOW")

print("A")
testFilePathList = [
    'A.root',
]
MakeWeightedCorrelationMatrixRdf(testFilePathList, "tree_PFLOW")

print("B")
testFilePathList = [
    'B.root',
]
MakeWeightedCorrelationMatrixRdf(testFilePathList, "tree_PFLOW")

print("A + C")
testFilePathList = [
    'A.root',
    'C.root',
]
MakeWeightedCorrelationMatrixRdf(testFilePathList, "tree_PFLOW")

print("B + C")
testFilePathList = [
    'B.root',
    'C.root',
]
MakeWeightedCorrelationMatrixRdf(testFilePathList, "tree_PFLOW")

For some reason, the combination of A.root and B.root yields a NaN for the weight sum, even though both the individual files and their combination with C.root are fine.

A + B
Event count: 1548011
Weight sum: nan

A
Event count: 1416179
Weight sum: 5326.039287298693

B
Event count: 131832
Weight sum: 91.8623456310427

A + C
Event count: 1733179
Weight sum: 5326.039287298693

B + C
Event count: 448832
Weight sum: 465.77165986367976

I’ve tried the same with the simple tree loop in python and everything works as expected.
Function:

def GetWeightSum(inputFilePathList, treeName):
    inputChain = ROOT.TChain(treeName)
    for inputFilePath in inputFilePathList:
    	inputChain.Add(inputFilePath)

    weightSum = 0
    count = 0
    for event in inputChain:
        weightSum += event.weightModified
        count += 1

Results:

A + B
Event count: 1548011
Weight sum: 5417.901632929743

A
Event count: 1416179
Weight sum: 5326.039287298693

B
Event count: 131832
Weight sum: 91.8623456310427

A + C
Event count: 1733179
Weight sum: 5699.948601531312

B + C
Event count: 448832
Weight sum: 465.77165986367976

The files are unfortunately private, but I can share them and the reproducers in PM.

What could’ve caused the issues with the RDataFrame approach? I want to keep it as it is way faster and easier to work with.

Best regards,
Aleksandr

ROOT Version: 6.28.04
Platform: Ubuntu 20.04
Compiler: Precompiled

mczurylo · May 11, 2023, 3:18pm

Hi @apetukho,

welcome back to the forum! Maybe @vpadulan or @eguiraud have some ideas?

Cheers,
Marta

eguiraud · May 11, 2023, 3:51pm

Hi @apetukho ,

we would have to debug this on our side to figure out where things go wrong. Please go ahead and share the inputs with me and @vpadulan (e.g. via cernbox)

Cheers,
Enrico

vpadulan · May 17, 2023, 3:57pm

After some debugging, it seems we found the culprit.

In a first investigation, I found out that with respect to the initial reproducer from the user, the order of files “A.root” and “B.root” was actually inverted, so that they were creating a TChain with order “B.root”,“A.root”. If I reversed the order, then the Sum operation returned the correct value.

It seems that the cause of the issue is a mismatch in the types of the column “weightModified” in the two files. Consider the following snippet

import ROOT

df = ROOT.RDataFrame("tree_PFLOW", "A.root")
print("A")
print(f"Column 'weightModified' has type '{df.GetColumnType('weightModified')}'")

df = ROOT.RDataFrame("tree_PFLOW", "B.root")
print("B")
print(f"Column 'weightModified' has type '{df.GetColumnType('weightModified')}'")

The output is

$: python print_column_type.py 
A
Column 'weightModified' has type 'Double_t'
B
Column 'weightModified' has type 'Float_t'

Now, this fact by itself is not enough. For example, if I get the two columns out from RDataFrame as vectors and then sum their values together, it works as long as there are no values from the double column that are outside of the float range:

#include <ROOT/RDataFrame.hxx>
#include <iostream>

int main()
{
    ROOT::RDataFrame dfA{"tree_PFLOW", "A.root"};
    auto rptrA = dfA.Take<double>("weightModified");
    auto valsA = *rptrA;

    ROOT::RDataFrame dfB{"tree_PFLOW", "B.root"};
    auto rptrB = dfB.Take<float>("weightModified");
    auto valsB = *rptrB;

    float sum{}; // mimic initial value of df.Sum with B, A file order

    for (std::size_t i = 0; i < valsB.size(); i++)
    {
        sum += valsB[i];
    }

    for (std::size_t i = 0; i < valsA.size(); i++)
    {
        sum += valsA[i];
    }

    std::cout << "Total sum is " << sum << std::endl;
}

The output here is

Total sum is 5377.25

But, if I try to mimick a bit more what the RDataFrame Sum does, for example by using a Foreach to increase the value of the float sum variable in the program, then I get the error:

#include <iostream>

#include <ROOT/RDataFrame.hxx>

int main()
{
    ROOT::RDataFrame df{"tree_PFLOW", {"B.root", "A.root"}};
    float sum{}; // mimic initial value of df.Sum with B,A file order

    df.Foreach([&](float val)
               { sum += val; },
               {"weightModified"});

    std::cout << "Total sum is " << sum << std::endl;
}

Then the output is

Total sum is nan

There might be a way to impose stronger checks on the types of columns when switching between different trees of a TChain. But it needs some work and I can’t give an estimate right now.

Meanwhile, the best approach is making sure that the column with the same name across different trees has the same type.

Cheers,
Vincenzo

system · May 31, 2023, 3:58pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.