TDataFrame Performance

tquante · September 30, 2017, 7:56pm

Hi,

I have build a small example to illustrate a performance issue, which I have with the TDF
dataFramePerformance.C (4.0 KB)

When defining a set of prefilters and then branching several other filters from the last prefilter, the performance drops quite significantly. This is even the case if no event survives the prefilter set.
In the above Example two histograms take approximately 2 seconds. With all the others the code runs for more than 20 seconds.

Now I am wondering if this is caused by wrong usage of the TDF from my side or if this an internal problem.

Cheers Thomas

eguiraud · October 1, 2017, 12:14am

Hi,
if adding many filters with everything else equal makes performance drop, it’s a bug, but I don’t think that’s the case.
You are probably seeing worse and worse performance with an increasing number of histograms, independently of the number of filters, correct?

If that’s the case and you are on ROOT v6.10/4, try switching to ROOT v6.10/6 or even to master: previous versions had an issue with some redundant instrumentation being inserted in jitted (just-in-time-compiled) code.

Lastly, if you really want to get the best performance out of TDF, switch Histo1D("x") with Histo1D<double>("x") or equivalent to compile that action instead of jitting it at runtime, and absolutely remember to always compile with at least -O2.

Please let me know how that goes
Cheers,
Enrico

tquante · October 4, 2017, 5:48pm

Hi,

that tip does the trick . The lazy evaluation of 64 histograms went down from approx 30 seconds to no time.

I also started to use implicit multithreading which significantly reduces the time.

Thanks a lot for your help

Cheers,
Thomas

eguiraud · October 4, 2017, 6:04pm

Glad I could help!
Which tip though? Are you compiling both ROOT and your program with -O2?

If you got that speedup by changing Histo1D("x") to Histo1D<double>("x") I’d like to stress that in the latest versions of ROOT (v6.10/6 or master) the “just-in-time-compiled” version (the one without template parameter) should just add a constant overhead of 1-2s to the program execution (the time it takes to ROOT to compile and execute the Histo1D calls at runtime).

Cheers,
Enrico

tquante · October 5, 2017, 5:07pm

The template parameter did the thing. My Framework was already compiled with -O2 before. I am using root version 6.10.06 and adding 64 hists take 25-30 seconds without the template parameter. I would assume that in this case the just in time compiler is called for each hist which is added which can take quite a while.

Another thing I have noticed is, that snapshotting with implicit multithreading seems to result in lost events. I am currently not sure if this is caused by my cut functions. is it save to pass the branch variable as const reference or can they change during processing?

Cheers,
Thomas

eguiraud · October 5, 2017, 5:50pm

Yes this was bug ROOT-9027, now fixed in the master and 6-10-00-patches branches. Definitely report this kind of things if you see them

Now that I try on v6-10-00-patches I see this issue as well. This was supposed to be fixed in v6.10/6 (by commit fb0541d374). The good news is that on master and this takes ~2 seconds (100 jitted histograms):

#include "ROOT/TDataFrame.hxx"
#include <vector>

int main() {
   ROOT::Experimental::TDataFrame d(10);
   auto dd = d.Define("x", "2");
   std::vector<ROOT::Experimental::TDF::TResultProxy<TH1D>> histos;
   histos.reserve(100);
   for (auto i = 0u; i < 100; ++i)
      histos.emplace_back(dd.Histo1D("x"));
   *histos.front();
   if (histos.back()->GetEntries() != 10)
      return 1;
   return 0;
}

I’ll see if I can track down what we forgot to backport to v6.10.

Cheers,
Enrico

eguiraud · October 6, 2017, 11:55am

Follow-up: I forgot that in v6.10 we still do one call to the interpreter per Histo1D call, while more recently we switched to one call for all booked actions. That is something that would be painful to backport.

I suggest you switch to 6.11 (should be released today or tomorrow) or 6.12 (second half of November) to get the best performance out of TDataFrame (and quite a handful of new features).

Cheers,
Enrico

Danilo · October 6, 2017, 1:11pm

Hi,

6.11/02 has just been released: Development release v6.11/02

Cheers,
D

tquante · October 6, 2017, 3:12pm

Hi,

thank you very much! I will try the new Version. Is there a special docu version for root 6.11?
Cheers,
Thomas

tquante · October 6, 2017, 3:23pm

Sry guys I was blind and haven’t seen the master branch on the docu page XD

tquante · October 6, 2017, 7:17pm

Hi,

I have tested the ROOT version 6.11.03 SHA 9bb8349ee631929321a609032fc7c6f52891a637. After adapting to the interface changes the code runs fine and fast as before. Unfortunately the implicit multithreading seems to be broken.
using the following statement

ROOT::EnableImplicitMT(4);

returns no error, but root still uses only one core. Is there a new prerequirement which has to be met?

Cheers,
Thomas

eguiraud · October 6, 2017, 7:46pm

Can you post the simplest script that reproduces the issue?
This simple demo runs on all cores with that same commit, for me:

#include "ROOT/TDataFrame.hxx"
#include <iostream>

int main() {
   ROOT::EnableImplicitMT();
   ROOT::Experimental::TDataFrame d(1e10);
   auto c = d.Define("x", []() { for(volatile int i = 0; i < 1000; ++i); return 1; }).Count();
   std::cout << *c << std::endl;
   return 0;
}

tquante · October 6, 2017, 8:31pm

Thanks for your fast reply. The test script runs on all cores… So I have to investigate my problem. A quick profiling shows 14 to 20 % of io wait state. In this case it could be that the hd ist to slow and the DataFrame performs much better than in the old version . I will try the it on ssd tomorrow and let you now about the outcome.

Cheers,
Thomas

tquante · October 17, 2017, 6:36pm

Hi,

I needed a bit more time, because I found some bugs on my side. Now implicit multithreading is working. The issue was a bug in my code. I have also verified my code with an old algorithm and both algorithms yield the same result when processing 20 million events. For my side the race condition, which lead to loosing events, is gone .

Are features planned to lazy evaluate an output rootfile? The snapshot seems to be evaluated instantly.

Thank you guys for your hard work!

Cheers,
Thomas

Danilo · October 17, 2017, 6:46pm

Hi Thomas,

thanks a lot for insisting and giving us feedback! Glad to hear that now everything works for you.
Indeed Snapshot is presently what we call an “instant action”. We do not plan to change this into a lazy operation for the forthcoming 6.12 release but we certainly take into account your comment.

Cheers,
Danilo

system · October 31, 2017, 6:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.