RDataFrame snapshot automatic type deduction performance


ROOT Version: 6.16
Platform: Not Provided
Compiler: Not Provided


I am trying to do a few thousand snapshots of a dataframe with different filters. The snapshots do all write the same columns. From following code I get that ~1s per snapshot is spend because of the automatic column type deduction. I would like to use the automatic deduction, but the performance is a bottleneck for me.

Is there a way to do this deduction only once and somehow apply it for all snapshots?

ROOT::RDF::RNode defines(ROOT::RDF::RNode node, int ncols){
    if(ncols > 0){
        return defines(node.Define("x"+std::to_string(ncols), [ncols](){return ncols;}), ncols-1);
    }
    else{
        return node;
    }
}

int main(){
    ROOT::RDataFrame df_orig(10);
    auto df = defines(df_orig, 3);
    std::time_t start = std::time(0);

    ROOT::RDF::RSnapshotOptions opts;
    opts.fLazy = true;
    using SnapRet_t = ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>>;
    std::vector<SnapRet_t> rets;
    
    start = std::time(0);
    for (auto i = 0; i < 5; ++i){
        rets.emplace_back(df.Snapshot<int,int,int>("t", "f" + std::to_string(i) + ".root", {"x1","x2","x3"}, opts));
    }
    std::cout << "time with template: " << std::time(0) - start << "s" << std::endl;

    start = std::time(0);    
    for (auto i = 0; i < 5; ++i){
        rets.emplace_back(df.Snapshot("t", "f" + std::to_string(i) + ".root", {"x1","x2","x3"}, opts));
    }
    std::cout << "time without template: " << std::time(0) - start << "s" << std::endl;

    return 0;
}

Hi,

thanks for your report. We are aware of this performance degradation pattern and will implement a solution asap, not sure we’ll make it in time for reease 6.18.

Now, to address today your concrete problem. How many columns are you snapshotting? Is it an option to explicitly write the types, perhaps for part of the snapshots?

Cheers,
D

Hi,

thanks!

I am usually snapshotting 5-15 columns. All snapshots have the same columns, so I would only have to hardcode the column types once. The problem is that the amount/type of columns depends on a configuration file.

The config is read from a json in python. From this some std::vector<string> are filled which are given to a c++ class. This class interacts with the dataframe and creates all the snapshots.

So one way might be checking the column types in python once and then dynamically compiling the c++ with the right template snapshot.

Cheers,
Christian

Hi Christian,

thanks for claryfying the context.
What about jitting via gInterpreter->Declare templated functions which propagate the types to the snapshot call and take in input the list of columns?
Those would be jitted once and used by your setup a few thousands times therewith eliminating the problem.
I can help you through this if something is not clear.

Cheers,
Danilo

Hi Danilo,

Ok, I think that is exactly what I need.

So far I tried this:

#include <ROOT/RDataFrame.hxx>
#include "ROOT/RDF/RInterface.hxx"
#include <iostream>

ROOT::RDF::RNode defines(ROOT::RDF::RNode node, int ncols){
    if(ncols > 0){
        return defines(node.Define("x"+std::to_string(ncols), [ncols](){return ncols;}), ncols-1);
    }
    else{
        return node;
    }
}

int snapshotperf(){
    ROOT::RDataFrame df_orig(10);
    auto df = defines(df_orig, 3);
    std::time_t start = std::time(0);

    ROOT::RDF::RSnapshotOptions opts;
    opts.fLazy = true;
    using SnapRet_t = ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>>;
    std::vector<SnapRet_t> rets;

    std::vector<std::string> columnnames = {"x1", "x2", "x3"};
    std::vector<std::string> columntypes = {"int", "int", "int"};
    std::string template_expr("<");
    for(int i = 0; i < columntypes.size(); i++){
        template_expr+=columntypes[i];
        if(i!= columntypes.size()-1)
            template_expr+=",";
    }
    template_expr+=">";


    std::string declare_expr(
        "ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>> make_snap(ROOT::RDF::RNode df, std::string treename, std::string fname, std::vector<std::string> columnnames, ROOT::RDF::RSnapshotOptions opts){"
        "return df.Snapshot"+template_expr+"(treename, fname, columnnames, opts);"
        "}");

    gInterpreter->Declare(declare_expr.c_str());

    TInterpreterValue *tiv = gInterpreter->CreateTemporary();
    std::string eval_str(
            "[](ROOT::RDF::RNode df, std::string treename, std::string fname, std::vector<std::string> columnnames, ROOT::RDF::RSnapshotOptions opts)"
            " {return make_snap(df, treename, fname, columnnames, opts);};"
        );
    gInterpreter->Evaluate(eval_str.c_str(), *tiv);

    using functype = std::function<SnapRet_t(ROOT::RDF::RNode,std::string,std::string,std::vector<std::string>, ROOT::RDF::RSnapshotOptions)>;
    functype make_snap = *(functype*)tiv->GetAsPointer();
    for (auto i = 0; i < 5; ++i){
        SnapRet_t res = make_snap(df, "t", "f" + std::to_string(i) + ".root", columnnames, opts);
        rets.emplace_back(res);
    }

    return df.Count().GetValue();
}

So far this gives me a segmentation violation. Do you know what I did wrong?

Cheers,
Christian

 *** Break *** segmentation violation
[/usr/lib/system/libsystem_platform.dylib] _sigtramp (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[<unknown binary>] (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] cling::Interpreter::EvaluateInternal(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] cling::MetaSema::actOnxCommand(llvm::StringRef, llvm::StringRef, cling::Value*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] cling::MetaParser::isXCommand(cling::MetaSema::ActionResult&, cling::Value*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] cling::MetaParser::isCommand(cling::MetaSema::ActionResult&, cling::Value*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] cling::MetaProcessor::process(llvm::StringRef, cling::Interpreter::CompilationResult&, cling::Value*, bool) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] HandleInterpreterException(cling::MetaProcessor*, char const*, cling::Interpreter::CompilationResult&, cling::Value*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] TCling::ProcessLine(char const*, TInterpreter::EErrorCode*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCling.so] TCling::ProcessLineSynch(char const*, TInterpreter::EErrorCode*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libCore.6.16.so] TApplication::ExecuteFile(char const*, int*, bool) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libRint.6.16.so] TRint::ProcessLineNr(char const*, char const*, int*) (no debug info)
[/Users/Christian/work/root/root_v6_16/lib/libRint.6.16.so] TRint::Run(bool) (no debug info)
[/Users/Christian/work/root/root_v6_16/bin/root.exe] main (no debug info)
[/usr/lib/system/libdyld.dylib] start (no debug info)

Ok I found a working solution.
Thanks a lot for pointing me to this gInterpreter->Declare.

#include <ROOT/RDataFrame.hxx>
#include "ROOT/RDF/RInterface.hxx"
#include <iostream>

ROOT::RDF::RNode defines(ROOT::RDF::RNode node, int ncols){
    if(ncols > 0){
        return defines(node.Define("x"+std::to_string(ncols), [ncols](){return ncols;}), ncols-1);
    }
    else{
        return node;
    }
}

int snapshotperf(){
    ROOT::RDataFrame df_orig(10);
    auto df = defines(df_orig, 3);
    std::time_t start = std::time(0);

    ROOT::RDF::RSnapshotOptions opts;
    opts.fLazy = true;
    using SnapRet_t = ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>>;
    std::vector<SnapRet_t> rets;

    std::vector<std::string> columnnames = {"x1", "x2", "x3"};
    std::vector<std::string> columntypes = {"int", "int", "int"};
    std::string template_expr("<");
    for(int i = 0; i < columntypes.size(); i++){
        template_expr+=columntypes[i];
        if(i!= columntypes.size()-1)
            template_expr+=",";
    }
    template_expr+=">";

    start = std::time(0);
    std::string declare_expr(
        "ROOT::RDF::RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager>> make_snap"
        "(ROOT::RDF::RNode df, std::string treename, std::string fname, std::vector<std::string> columnnames, ROOT::RDF::RSnapshotOptions opts){"
        "return df.Snapshot"+template_expr+"(treename, fname, columnnames, opts);"
        "}");

    gInterpreter->Declare(declare_expr.c_str());
    std::cout << "time to declare " << std::time(0)-start << std::endl;

    start = std::time(0);
    auto make_snap = (SnapRet_t (*)(ROOT::RDF::RNode,std::string,std::string,std::vector<std::string>, ROOT::RDF::RSnapshotOptions)) gInterpreter->ProcessLine("make_snap");
    std::cout << "time to get function " << std::time(0)-start << std::endl;

    start = std::time(0);
    for (auto i = 0; i < 5; ++i){
        SnapRet_t res = make_snap(df, "t", "f" + std::to_string(i) + ".root", columnnames, opts);
        rets.emplace_back(res);
    }
    std::cout << "time to create snapshots " << std::time(0)-start << std::endl;

    return df.Count().GetValue();
}
1 Like

Great!
Thanks for sharing it.

Cheers,
D

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.