I have build a small example to illustrate a performance issue, which I have with the TDF dataFramePerformance.C (4.0 KB)
When defining a set of prefilters and then branching several other filters from the last prefilter, the performance drops quite significantly. This is even the case if no event survives the prefilter set.
In the above Example two histograms take approximately 2 seconds. With all the others the code runs for more than 20 seconds.
Now I am wondering if this is caused by wrong usage of the TDF from my side or if this an internal problem.
Hi,
if adding many filters with everything else equal makes performance drop, it’s a bug, but I don’t think that’s the case.
You are probably seeing worse and worse performance with an increasing number of histograms, independently of the number of filters, correct?
If that’s the case and you are on ROOT v6.10/4, try switching to ROOT v6.10/6 or even to master: previous versions had an issue with some redundant instrumentation being inserted in jitted (just-in-time-compiled) code.
Lastly, if you really want to get the best performance out of TDF, switch Histo1D("x") with Histo1D<double>("x") or equivalent to compile that action instead of jitting it at runtime, and absolutely remember to always compile with at least -O2.
Glad I could help!
Which tip though? Are you compiling both ROOT and your program with -O2?
If you got that speedup by changing Histo1D("x") to Histo1D<double>("x") I’d like to stress that in the latest versions of ROOT (v6.10/6 or master) the “just-in-time-compiled” version (the one without template parameter) should just add a constant overhead of 1-2s to the program execution (the time it takes to ROOT to compile and execute the Histo1D calls at runtime).
The template parameter did the thing. My Framework was already compiled with -O2 before. I am using root version 6.10.06 and adding 64 hists take 25-30 seconds without the template parameter. I would assume that in this case the just in time compiler is called for each hist which is added which can take quite a while.
Another thing I have noticed is, that snapshotting with implicit multithreading seems to result in lost events. I am currently not sure if this is caused by my cut functions. is it save to pass the branch variable as const reference or can they change during processing?
Yes this was bug ROOT-9027, now fixed in the master and 6-10-00-patches branches. Definitely report this kind of things if you see them
Now that I try on v6-10-00-patches I see this issue as well. This was supposed to be fixed in v6.10/6 (by commit fb0541d374). The good news is that on master and this takes ~2 seconds (100 jitted histograms):
#include "ROOT/TDataFrame.hxx"
#include <vector>
int main() {
ROOT::Experimental::TDataFrame d(10);
auto dd = d.Define("x", "2");
std::vector<ROOT::Experimental::TDF::TResultProxy<TH1D>> histos;
histos.reserve(100);
for (auto i = 0u; i < 100; ++i)
histos.emplace_back(dd.Histo1D("x"));
*histos.front();
if (histos.back()->GetEntries() != 10)
return 1;
return 0;
}
I’ll see if I can track down what we forgot to backport to v6.10.
Follow-up: I forgot that in v6.10 we still do one call to the interpreter per Histo1D call, while more recently we switched to one call for all booked actions. That is something that would be painful to backport.
I suggest you switch to 6.11 (should be released today or tomorrow) or 6.12 (second half of November) to get the best performance out of TDataFrame (and quite a handful of new features).
I have tested the ROOT version 6.11.03 SHA 9bb8349ee631929321a609032fc7c6f52891a637. After adapting to the interface changes the code runs fine and fast as before. Unfortunately the implicit multithreading seems to be broken.
using the following statement
ROOT::EnableImplicitMT(4);
returns no error, but root still uses only one core. Is there a new prerequirement which has to be met?
Thanks for your fast reply. The test script runs on all cores… So I have to investigate my problem. A quick profiling shows 14 to 20 % of io wait state. In this case it could be that the hd ist to slow and the DataFrame performs much better than in the old version . I will try the it on ssd tomorrow and let you now about the outcome.
I needed a bit more time, because I found some bugs on my side. Now implicit multithreading is working. The issue was a bug in my code. I have also verified my code with an old algorithm and both algorithms yield the same result when processing 20 million events. For my side the race condition, which lead to loosing events, is gone .
Are features planned to lazy evaluate an output rootfile? The snapshot seems to be evaluated instantly.
thanks a lot for insisting and giving us feedback! Glad to hear that now everything works for you.
Indeed Snapshot is presently what we call an “instant action”. We do not plan to change this into a lazy operation for the forthcoming 6.12 release but we certainly take into account your comment.