Low performance merging files with RDataFrame plus multiple cycles

rooter_03 · October 3, 2021, 3:03am

Hi,

I am having problems merging trees using RDataFrame. From the code below:

import ROOT
import shutil
import os
import timeit

#------------------------------------
def make_data(nentries):
    filename='file_0.root'
    if os.path.isfile(filename):
        return
        
    df = ROOT.RDataFrame(nentries)
    for i_branch in range(30):
        df = df.Define('a_{}'.format(i_branch), 'TRandom3 r(0); return r.Gaus(0, 1);')
    df.Snapshot('tree', filename)
        
    ifile=ROOT.TFile(filename)
    ifile.ls()
    ifile.Close()

    shutil.copyfile(filename, 'file_1.root')
#------------------------------------
def merger_fm():
    mrg=ROOT.TFileMerger(False)
    mrg.SetFastMethod(True)
    mrg.AddFile('file_0.root')
    mrg.AddFile('file_1.root')
        
    mrg.OutputFile('file_mrg_mr.root')
    mrg.Merge()
        
    ifile=ROOT.TFile('file_mrg_mr.root')
    ifile.ls()
    print(ifile.tree.GetEntries())
    ifile.Close()
#------------------------------------
def merger_df():
    l_file=['file_0.root', 'file_1.root']
    
    df = ROOT.RDataFrame('tree', l_file)
    df.Snapshot('tree', 'file_mrg_df.root')
        
    ifile=ROOT.TFile('file_mrg_df.root')
    ifile.ls()
    print(ifile.tree.GetEntries())
    ifile.Close()
#------------------------------------
make_data(1000000)

val_fm = timeit.timeit('merger_fm()', number=1, globals=locals())
val_df = timeit.timeit('merger_df()', number=1, globals=locals())

print('')
print('{0:<20}{1:<20.3}'.format('TFileMerger', val_fm))
print('{0:<20}{1:<20.3}'.format('RDataFrame' , val_df))

I see:

i.e., the approach with TFileMerger takes about 6 times less and produces only one cycle. Is there a way to:

Speed this up.
Remove all the cycles.

Although the cycles are supposed to not affect us, when I work with trees I usually do:

l_tree = []
l_key = ifile.GetListOfKeys()
for key in l_key:
    obj = key.ReadObj()
    if not obj.InheritsFrom('TTree'):
        continue
    l_tree.append(obj)

fun(l_tree)

which seems to put also the cycles as independent trees:

Cheers.

Please read tips for efficient and successful posting and posting code

_ROOT Version:6.22/06
_Platform: x86_64-centos7-gcc8-opt
_Compiler: gcc8-opt

Use LCG_99 with x86_64-centos7-gcc8-opt

Wile_E_Coyote · October 3, 2021, 8:52am

I guess, you will get “equal” speeds, if you try:

    mrg=ROOT.TFileMerger(True)
    mrg.SetFastMethod(False)

BTW. See:

rooter_03 · October 4, 2021, 1:56am

Hello,

Thanks for your reply. But:

I do not want a way to make my code slower, I want the merge with Snapshot to be faster. How can I achieve that?
OK, so from the links there is no safe way to get an object from a ROOT file using GetListOfKeys. It seems that all the cycles get stored as independent keys and in the case of trees this means that the last key, which I probably would use, will correspond to an incomplete tree. Thus, I would rather not have more than one cycle saved. How do I do prevent Snapshot from saving multiple cycles?

Cheers.

rooter_03 · October 4, 2021, 2:52am

Hi,

This:

def getTrees(directory):
    l_key=directory.GetListOfKeys()

    d_itree={}
    for key in l_key:
        itree=key.ReadObj()
        if not itree.InheritsFrom('TTree'):
            continue

        name = itree.GetName()
        nevt = itree.GetEntries()

        tp_tree = (nevt, itree)
        if name not in d_itree:
            d_itree[name] = [tp_tree]
        else:
            d_itree[name].append(tp_tree)

    l_itree=[]
    for treename, l_tp in d_itree.items():
        l_tp.sort()
        nevt, tree =l_tp[-1]
        l_itree.append(tree)

    return l_itree

is a temporary workaround to get the trees corresponding to the latest cycles through the corresponding keys.

Cheers.

Axel · October 4, 2021, 7:21am

Wouldn’t it be simpler to invoke hadd, or am I missing some operation you want to do on the trees?

eguiraud · October 4, 2021, 7:43am

Hi,
to expand on this, RDataFrame is probably not the best tool if you want to only do a merge: TFileMerger or hadd would be more appropriate. The difference is that TFileMerger and hadd know that you only want to merge/copy, and can therefore skip some processing steps or perform certain operations in bulk (e.g. if you just have to copy TTree data from one TFile to another there is no need to decompress it, you can just memcpy the compressed bytes).

RDataFrame on the other hand decompresses, reads in and processes the values of each event and – that’s the price of generality (e.g. you can Filter events and Define new ones and produce control plots etc. etc. at the same time as you write out the new data).

Cheers,
Enrico

rooter_03 · October 4, 2021, 3:04pm

Hi @Axel, @eguiraud ,

I guess by hadd you mean TFileMerger, because most of the merging I do is from c++ or python code and it’s easier to just use this class rather than invoking the utility.

The idea was to modify trees, adding columns, and then save those modified trees into one tree. This last step would be equivalent to merging, therefore I did not have the choice to just use TFileMerger because I was modifying the trees and then merging them. However I found out that there were two problems:

The merging is slow. Therefore I made a test and found out that TFileMerger is faster, even if you do not modify the trees by adding a column, you just merge.
RDataFrame makes many cycles. I thought that this meant that the class was somehow doing something inefficient (like saving more often than necessary) and that it could be turned off. However I did not find anything that can speed it up, therefore I asked.

These cycles, together with the function that retrieves trees to process them (which I showed you above) was causing the cycle before the last one to be used instead; thus, events were dropped. When I found out that events had been dropped, I started trying to figure out why and that led me to the test I posted originally and the questions I asked.

I will use TFileMerger when just merging. However when merging is just part of the job, I will have to:

Bear the lower performance.
Use the modified function to retrieve trees, which seems to be picking up the latest cycle.

Cheers.

Axel · October 4, 2021, 3:32pm

We need to understand why all those cycles are created. @eguiraud is this in RDF Snaphot?

You can just take the name of the key and strip the training ;NNN part (the cycle number); that way you’re guaranteed to read the most recent version. GetListOfKeys() actually guarantees that the newest version is always first, so it should be enough to skip any other TKey with the same name. If this isn’t the case in your example then we need to understand why not…

eguiraud · October 4, 2021, 4:39pm

Yes, I’ll take a better look but I think it’s just intermediate flushes, i.e. normal.

@rooter_03 I’ll also check whether there are obvious performance bottlenecks but again the comparison is between a very generic tool and an API that’s specifically meant for fast TTree merging.

eguiraud · October 6, 2021, 1:29pm

Hi,
about the namecycles, I am not sure I see your same problem. Running the reproducer in your first post with ROOT v6.24.00 (installed via conda some time ago):

$ python repro.py
TFile**		file_0.root	
 TFile*		file_0.root	
  KEY: TTree	tree;8	tree
TFile**		file_mrg_mr.root	
 TFile*		file_mrg_mr.root	
  KEY: TTree	tree;1	tree
2000000
TFile**		file_mrg_df.root	
 TFile*		file_mrg_df.root	
  KEY: TTree	tree;15	tree
2000000

TFileMerger         0.526
RDataFrame          25.3

$ python
Python 3.9.7 | packaged by conda-forge | (default, Sep  2 2021, 17:58:34)
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ROOT
>>> f1 =  ROOT.TFile("file_0.root")
>>> len(f1.GetListOfKeys())
1
>>> f2 = ROOT.TFile("file_mrg_df.root")
>>> len(f2.GetListOfKeys())
1

So there is only one key in the output files.
What am I doing differently?

Also note that a similar C++ TTree-filling program produces multiple namecycles as well, simply because of intermediate autoflushes (i.e. this is not a quirk of Snapshot):

#include <TFile.h>
#include <TRandom3.h>
#include <TTree.h>

void make_data(unsigned long long nentries) {
  const auto filename = "file_0.root";

  auto f = TFile(filename, "recreate");
  TTree t("t", "t");
  std::vector<double> x(30);
  for (int i_branch = 0; i_branch < 30; ++i_branch) {
    t.Branch(("a_" + std::to_string(i_branch)).c_str(), &x[i_branch]);
  }

  TRandom3 r(0);
  for (auto n = 0ull; n < nentries; ++n) {
    for (auto &e : x)
       e = r.Gaus();
    t.Fill();
  }

  t.Write();

  f.ls();
}

int main() { make_data(1000000); }

About the speed, as you see the difference is even more dramatic on my laptop. I would like to check how that looks in pure C++ but did not have time today, will reply here as soon as I have something.

Cheers,
Enrico

rooter_03 · October 7, 2021, 1:52pm

Hi @eguiraud

Thanks for looking into this. You are using a different version from what I showed in my example. I tested again and it seems that version 6.24 does not keep old cycles, also TFileMerger seems to be even faster now.

v22.06

v24.00

I will just move to the latest version.

Cheers.

eguiraud · October 13, 2021, 8:25am

Hi,
so I took Python, just-in-time compilation and compilation optimization levels out of the equation to compare the performance of Snapshot and TFileMerger on equal grounds:

// tfilemerger.cpp
#include <TStopwatch.h>
#include <TFileMerger.h>

int main() {
   TFileMerger mrg;
   mrg.SetFastMethod(true);
   mrg.AddFile("file_0.root");
   mrg.AddFile("file_1.root");

   mrg.OutputFile("file_mrg_mr.root");
   TStopwatch sw;
   sw.Start();
   mrg.Merge();
   sw.Stop();
   sw.Print();
}

// snapshot.cpp
#include <ROOT/RDataFrame.hxx>

void merger_df() {
  auto df = ROOT::RDataFrame("tree", {"file_0.root", "file_1.root"});
  df.Snapshot<double, double, double, double, double, double, double, double,
              double, double, double, double, double, double, double, double,
              double, double, double, double, double, double, double, double,
              double, double, double, double, double, double>(
      "tree", "file_mrg_df.root",
      {
          "a_0",  "a_1",  "a_2",  "a_3",  "a_4",  "a_5",  "a_6",  "a_7",
          "a_8",  "a_9",  "a_10", "a_11", "a_12", "a_13", "a_14", "a_15",
          "a_16", "a_17", "a_18", "a_19", "a_20", "a_21", "a_22", "a_23",
          "a_24", "a_25", "a_26", "a_27", "a_28", "a_29",
      });
}

int main() {
  TStopwatch st;
  st.Start();
  merger_df();
  st.Stop();
  st.Print();
}

I am aware that nobody will ever write a Snapshot invocation like that, but it’s useful for the purposes of making sure that both TFileMerger and Snapshot are compiled ahead of time and with a reasonable optimization level (-O2).

This results in a ~19s runtime for Snapshot and ~0.5s for TFileMerger.

Setting mrg.SetFastMethod(false); brings TFileMerger to a runtime of 17s. Flamegraphs easily show what the difference is (you can open them in their own browser tab to make them interactive – right-click, open in new tab):

tfilemerger

snapshot

As we suspected the difference is simply that Snapshot decompresses and re-compresses all data while TFileMerger does a direct copy of the compressed buffer (an optimization disabled by mrg.SetFastMethod(false)).

As Snapshot is more general it would be difficult to perform the same optimization as TFileMerger there (although not impossible, I guess).
I hope this clarifies what you see.

Cheers,
Enrico

system · October 27, 2021, 8:26am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.