Performance deterioration starting with ROOT release v6-20 (vectorization in RooFit)

Hello,

Attached you can find a test macro test.C (3.8 KB) which prints time spent on fitting using RooFit. It is some reduction of the whole fitting problem that we have in our analysis. I cannot exclude that further reducing the example would still show my problem.

The problem is that testing on several different machines and gcc versions, I have 3-5 times degradation of the run time between ROOT before changes around commit v6-19-01-1204-ge419e5715b (vectorization in RooFit) and after that.

If the test is performed in this way:

  root -n -b -l -q test.C++ >& log.txt
  tail -n 1 log.txt # shows number of milliseconds spent on fitting the model 10 times

I get the following on my laptop with Debian 12, gcc12, i9 13900HX, for listed ROOT versions:

v6-19-01-1199-g24266cd51a
2099
v6-19-01-1204-ge419e5715b
9735
v6-21-02
10920
v6-28-04
16150

The funny thing is that testing on an lxplus-like CentOS7 virtual machine at CERN, with this script (couldn’t attach):

#!/bin/bash

for v in LCG_94a/x86_64-centos7-gcc8-opt LCG_96b/x86_64-centos7-gcc8-opt LCG_97/x86_64-centos7-gcc8-opt LCG_104/x86_64-centos7-gcc12-opt ; do
  . /cvmfs/sft.cern.ch/lcg/views/$v/setup.sh
  root-config --version
  root -n -b -l -q test.C++ >& log.txt
  tail -n 1 log.txt
done

I don’t get deterioration, but rather a small improvement:

6.14/08
9375
6.18/04
9381
6.20/02
8718
6.28/04
8415

Although the run times are neither similar to the fastest nor the slowest performance on my local machine. It should be noted, though, that the absolute run time is hard to compare here, because locally it runs on a rather powerful CPU, while at CERN it is on some (probably shared) VM in the computer centre.

I have a feeling that there might be some difference in how I am building ROOT compared to the central installations and that I didn’t change some important build option for the newer ROOT versions. My cmake call is:

cmake -Dxml=1 -Dtmva=1 -Dbuiltin_vc=1 -Ddavix=0 -Dbuiltin_afterimage=0 -Dbuiltin_glew=0 -Dbuiltin_ftgl=0 -Dxrootd=0 -Dbuiltin_veccore=1 -Droofit=1 -Dunuran=1 -Dminuit2=1 -Dmathmore=1 -Dfftw3=1 -Dvdt=1 -Dgenvector=1 -Dopengl=1 -Dsoversion=1 -Dexplicitlink=1 -DCMAKE_CXX_FLAGS='-D__ROOFIT_NOBANNER' -DCMAKE_INSTALL_PREFIX=../install $sources

Did anybody run into similar problems? Is it a regression in ROOT for my case or am doing something wrong in the build?

Regards,
Antoni

Dear @amarcine ,

If a regression is indeed present, then it’s something we want to address. RooFit has undergone a big effort of improvement in performance in the past years with demonstrated results, so the issue you report is definitely unexpected. I believe @jonas may be interested in understanding better with you what the culprit might be.

Cheers,
Vincenzo

Dear @vpadulan and @jonas,

I am aware of the expected improvement in performance. However looking both at the presentation and at the release notes, I got an impression that the improvement is expected for unbinned fits. Is any improvement in performance expected for binned fits?

It would be great to see what you get from my test macro for ROOT 6.18 and 6.28 built in a way you normally build it (before diving into possible deficiencies of my cmake opts). Note that to build 6.18 on a modern compiler (e.g. gcc 12) may require some tweaks.

Cheers,
Antoni

Hi, thanks for the report!

The performance improvements are not expected to improve binned fits significantly, but it also depends on the pdfs you are using in your model.

I can’t reproduce the problem on my machine when comparing the 6-18 branch with the 6-28 branch. The situation is similar to what you observe on CentoOS, although it’s a bit faster overall (I run it on a AMD Ryzen 9 3900):

6.18/04
6073

6.28/06
4699

Here is the root-settings-v6-18.cmake file with my built options (you have to rename the file, the forum didn’t allow me to upload with the .cmake suffix):
root-settings-v6-18.txt (4.9 KB)

You can pass it to cmake with the -C option.

I really have no idea why you see these completely different numbers on your laptop. In particular, the performace for v6-19-01-1199-g24266cd51a is surprisingly good for a laptop CPU. Generally, RooFit is not affected by the build options you are setting, except for fftw3 maybe. But since you are using a RooFormulaVar, the rabbit hole could be potentially very deep to the interpreter, cling and llvm.

Some ideas if you want to understand this problem further:

You can build ROOT with debug info (-DCMAKE_BUILD_TYPE=RelWithDebInfo) and then produce flamegraphs of the call stack like explained here:
https://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Also, to exclude that it has to do with the interpreter, try to replace the RooFormulaVar with a compiled custom class, like explained in this tutorial:
https://root.cern.ch/doc/master/rf104__classfactory_8C.html

E.g.:

  RooClassFactory::makeFunction("MyFormula", "x,xmax,thresh", "", "xmax + thresh - x");
  gROOT->ProcessLineSync(".x MyFormula.cxx+");
  RooWorkspace ws("ws");
  ws.import(x);
  ws.factory(TString::Format("MyFormula::mX(x, %f, %f)", xmax, GetKThreshold()));
  RooAbsReal& mX = *ws.function("mX");

I hope these ideas help you to understand what is going on, if you really want to go to the bottom of it! But maybe your performance problems will already be gone by not using RooFormulaVar.

Cheers,
Jonas

Hi,

Thanks for the hints!

Indeed using your code snippet to replace RooFormulaVar affects the performance:

v6-19-01-1199-g24266cd51a
1570
v6-19-01-1204-ge419e5715b
3539
v6-21-02
4448
v6-28-04
6672

The improvement is small for the “old” ROOT version(s), but massive for the ones after vectorization (and I guess some changes to how RooFormulaVar works?). Still the sequencing of performance remains.

I also tried with bindFunction, but I get a bit worse results than with your snippet. Do you know why? Maybe I should also replace bindPdf?

I tried to use your cmake settings, but have problems with missing dependencies. While data bases are clearly irrelevant, I stumbled on Vc. Do you have your own build or does it come from the system? Btw., what compiler do you use and what system is that? Also do you have the same settings for 6.18 and 6.28?

I will continue with flamegraphs and trying to reproduce your build.

Can you also try 6.28 with my build options?

As for the performance of my CPU, it is almost the most powerful what Intel currently gives. It eats 140W on all-core load. And googling a bit it looks like its single core performance can be from 50 to 100% better than AMD Ryzen 9 3900 depending on the benchmark. So it might be that my v6-19-01-1199-g24266cd51a result corresponds roughly to what you get.

Cheers,
Antoni

Turns out the reason for the deterioration was rather trivial… I was building ROOT without optimization flags.

ROOT used to build by default with RelWithDebInfo if nothing was set by the user. I was puzzled when you told me to build with that option and checked in CMakeCache.txt for different versions. Starting with v6-19-01-1204-ge419e5715b the value of CMAKE_BUILD_TYPE is empty.

After setting -DCMAKE_BUILD_TYPE=Release and using your snippet instead of RooFormulaVar I get:

v6-19-01-1199-g24266cd51a
1545
v6-28-04
1457
1 Like

Funny enough Building ROOT from source - ROOT claims that the default for CMAKE_BUILD_TYPE is Release

OK, now I understand what is going on. In v6-19-01-1186-gadacfa3eed the default CMAKE_BUILD_TYPE setting was changed: from just setting it to RelWithDebInfo to setting it to Release, but only if the user doesn’t pass compiler flags. Because I am passing CMAKE_CXX_FLAGS to switch off RooFit banner, I switched off default optimization. And probably I did some mistake in bisecting earlier, because I thought my problem started couple commits later.

@jonas with the main problem out of the way, I still would like to better understand some issues around the vectorization.

  1. Do I understand ROOT Version 6.20 Release Notes correctly, that I should set -DCMAKE_CXX_FLAGS=-march=native and -Dvdt=ON for max performance? Do these affect binned fits?

  2. Are there any other options I should pay attention to regarding RooFit performance?

  3. If I implement custom pdfs, is it described somewhere how to support vectorization in them?

Cheers,
Antoni

  1. No, this is not required anymore since ROOT 6.24 (see release notes). ROOT always compiles the RooFit code now for all different CPU vector instructions, and the right library with the instructions for your CPU will be loaded at runtime.

    Yes, these affect binned fits too. It depends more on the pdf type if it supports vectorized evaluation or not. The thing is, in binned fits people are more likely use RooFit classes that are not easily vectorizable.

  2. If it’s about performance for the vectorized Bachend with BatchMode(), just pay attention that all your pdfs and RooAbsReals support vectorized evaluation by overriding RooAbsReal::computeBatch(), like here for example the RooGaussian implementation. You can get informed about evaluation of non-vectorized pdfs if you enable the RooFit::FastEvaluations message stream at the RooFit::INFO level (see this tutorial on how to use the RooMsgService).

  3. Just implement this computeBatch(). Note that the signature of this function will change again in ROOT 6.30, since this vectorized backend is still constantly improved. But in the future, you can always get a skeleton for implementing you pdf also with the vectorized evaluation with the RooClassFactory (after I merge this PR at least :slight_smile: ).

Note that your model will not benefit greatly from vectorization. You are using the FFT pdf, which works as follows:

  1. Do the fast Fourrier transform with fine discrete samplings and fill template histogram with the convolution results (in a RooHistPdf)
  2. Evaluate the RooHistPdf for each event/bin in the dataset/hist

The second step is a lookup that doesn’t vectorize. You might still see speedups with BatchMode() because the new evaluation backend also has some other optimizations, but these are small.

In any case, you need to pass this BatchMode() option to fitTo() or createNLL(). Otherwise, you will use the old evaluation backend.

I’m happy to answer more questions, should you have any!

Cheers,
Jonas

1 Like

Dear @jonas,

Thank you very much! It is enlightening. Also thanks for the great work on RooFit!

In my test case BatchMode() seems to degrade the performance. Maybe again this is a problem of cmake options. But anyway I am happy without the BatchMode().

What is the differences between NumCPU and Parallelize?

Cheers,
Antoni

Yes, the convolution is a special case because of the cached templates. The BatchMode() doesn’t deal so well with that (yet). It’s strength so far are mathematical expression pdfs without any special logic. So I don’t think it’s another cmake problem.

The NumCPU() option is for the case of no BatchMode() and distributes the computation over several cores. But this has quite some overhead, so for many fits it’s not work it. It’s more for unbinned fits with many events, where you parallelize over events, or for RooSimultaneous with many channels were you parallelize over channels. How you parallelize is defined with the “strat” parameter of NumCPU().

The Parallelize() option is still very experimental. It’s part of an effort by some ATLAS collaborators to redesign the whole parallelization strategy of the test statistics. Of course it would be nice if you give it a try and see if it works for your usecase, but for any further questions on this I will refer you to @egpbos. But since this option is taylored towards the ATLAS Higgs combination workflow, you will for sure run into some problems when doing any other fit with it.

For me Parallelize never performs better than the default (no parallelization of any kind) and furthermore on subsequent runs yields very inconsistent speeds - for just 1x fitting I get 170-200 ms for the default and 300-1700ms with Parallelize(2) independently of the system load.

Is Parallelize meant to substitute NumCPU in the future?

So NumCPU() does not affect BatchMode()? But then does BatchMode() utilize multithreading and if so, how does one control how many cores does it take?

I tried also the IntegrateBins(0) (seems to be meant for my case to better describe the peak), but it takes more than 100x more time and this is already after changing from the default 1D integrator to RooAdaptiveGaussKronrodIntegrator1D with maxSeg=50 and method=31Points (can remember now why I chose this parameters back in 2015…, but it does work significantly faster than the default).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.