Inconsistent behaviour of RDataFrame ColumnTypes

Dear experts,

We are observing an inconsistent behaviour when reading an RDataFrame column from a TTree or when redefining it. For instance, if the type of a TBranch is say v[3]/F, when reading the TTree the column type is ROOT::VecOps::RVec<Float_t>, but when redefining the column it becomes ROOT::VecOps::RVec<float>. The full example to reproduce this is posted below.

This is annoying when performing Vary on multiple columns, in case some of them come directly from the TTree while some others are instead the result of redefinitions.

A dumb workaround is to redefine every input column to itself, which is of course not ideal and I expect it also brings some overhead.

Any help would be much appreciated.

Many thanks,
Federico

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TFile.h>
#include <TTree.h>

int create_root_file()
{
        TFile* f = TFile::Open("tree.root", "RECREATE");
        TTree *t = new TTree("tree", "tree");
        Float_t v[3] = {0., 1., 2.};
        t->Branch("v", &v, "v[3]/F");
        t->Branch("w", &v, "w[3]/F");
        t->Fill();
        t->Write();
        f->Close();
        return 0;
}


int main()
{
        create_root_file();

        ROOT::RDataFrame df_start("tree", "tree.root");
        auto df = df_start.Filter("true");
        auto t1 = df.GetColumnType("v");
        df = df.Redefine("v", "v");
        auto t2 = df.GetColumnType("v");
        std::cout 
                << t1 << " from TFile\n"
                << t2 << "   from Redefine(...)\n"
                << typeid(float).name() << "\n"
                << typeid(Float_t).name() << "\n"
                << typeid(ROOT::VecOps::RVec<float>).name() << "\n"
                << typeid(ROOT::VecOps::RVec<Float_t>).name() << "\n"
                << ROOT::RDF::RDFInternal::TypeID2TypeName(typeid(ROOT::VecOps::RVec<float>)) << "\n"
                << ROOT::RDF::RDFInternal::TypeID2TypeName(typeid(ROOT::VecOps::RVec<Float_t>)) << "\n"
        ;
        // uncomment the line below to get the error
        // df = df.Vary({"v", "w"}, [] (const ROOT::RVecF& v, const ROOT::RVecF& w) { return ROOT::RVec<ROOT::RVec<ROOT::RVecF>>{{v * 0.9, v * 1.1}, {w * 0.9, w * 1.1}}; }, {"v", "w"}, {"down", "up"}, "variation");
        return 0;
}

Detail of the used ROOT version below:

   ------------------------------------------------------------------
  | Welcome to ROOT 6.32.06                        https://root.cern |
  | (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Oct 01 2024, 10:44:42                 |
  | From tags/6-32-06@6-32-06                                        |
  | With g++ (Alpine 13.2.1_git20240309) 13.2.1 20240309             |
  | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q'  |
   ------------------------------------------------------------------

Welcome to the ROOT Forum!
I’ll let @vpadulan comment on this

Dear @ferri ,

Thank you for the reproducer, I will take a look soon.

Cheers,
Vincenzo

1 Like

Dear @ferri ,

I can reproduce the issue as you described it. It will need some thinking on our side to understand what to do in such cases, different possibilities may arise which might have consequences on other types of usage of the API. Meanwhile, is there some workaround we can find to strike a middle ground? e.g. casting to Float_t the values of the collections only for those columns that actually need redefinition?

Cheers,
Vincenzo

Ciao Vincenzo,

Thanks for your reply. We can think of a workaround, sure.

What do you think is most efficient, a Redefine to themselves of the columns entering the Vary, to bring them to float along with the already Redefined variables or a cast to Float_t during the Redefine? I am in for the solution that minimizes memory allocations (and for suggestions on how to best cast, to Float_t, in case: can we re-interpret without re-allocations?).

Out of curiosity, why is RVecF defined as

using RVecF = ROOT::VecOps::RVec<float>;

rather than

using RVecF = ROOT::VecOps::RVec<Float_t>;

?

Thanks,
Federico

Dear @ferri ,

I have opened a github issue to keep track of this, at Sanity check for Vary is too restrictive · Issue #17486 · root-project/root · GitHub. I have created a PR with a fix at [df] Fix Vary sanity check for typedefs by vepadulano · Pull Request #17478 · root-project/root · GitHub.

As for the workaround, I believe the easiest and most efficient approach for now is to Redefine the minimum amount of columns, I am supposing that this means only caling Redefine on the nominal call that will be varied and cast that to float. Regarding the minimization of allocations, we could play with the idea of using the memory-adopting constructor of RVec to our advantage ROOT: ROOT::VecOps::RVec< T > Class Template Reference .

Finally, the reason why RVecF is defined with float rather than Float_t is to avoid using I/O related types in general contexts. In principle RVec is used in-memory, although it could be stored on disk when/if desired.

Cheers,
Vincenzo

Dear @vpadulan ,

All this is great, many thanks for your help and quick actions!

Federico

Dear @ferri ,

The linked issue has been resolved. The changes will be available in the next LCG nightlies or with the next ROOT release.

Cheers,
Vincenzo

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.