Fast (non-)ROOT Fitting

ROOT Version: 6.24.00
Platform: MacOS Catalina 10.15.7
Compiler: G++

Hi there,

I hope this still remains an appropriate question for the ROOT form and apologies in advance for the abstract nature of the question.

I am looking to increase the speed of a real-time fitting routine that I have written using ROOT, which fits a function to data points of a file that updates with incoming data. The function is non-linear and requires the use of TF1::Integral(). Unfortunately, I cannot attach video clips to this message, but Iā€™ve attached an image of the fit, for whatever use that might be.

LiveFit2.pdf (18.2 KB)

I believe Iā€™ve done all I can do to optimise my code, namely by reducing as much as possible the number of commands to be executed in the infinite loop. Whether or not I choose to not display the fit results with ROOT, I achieve a fit rate between 2-5 Hz.

My question is: is there a potential performance improvement to be gained by moving away from ROOT? Having discussed with some colleagues, it seems that GNU Scientific Library would be the first choice but I have seen that GNUPlot might be another choice. I donā€™t have any experience with either, but if I understand correctly, ROOT makes use of GSL in its own fit routines?

I expect that trying to implement everything I have done in ROOT with either of the options will be a large task, as ROOT makes a lot of these things easier. In particular, the evaluation of integrals. Me and my team wonder that if we strip away the overhead ROOT has, if we can improve the fit rate. We ideally want to reach a fit-rate of about 10 Hz.

I want to avoid going down this alternate route only to find that the results are similar, so any insight that can be provided before I undertake this would be greatly appreciated! Please let me know if thereā€™s any further information you need in order to answer my question.

Hi @SaadShaikh ,
maybe @moneta already has an idea, but the best way to proceed would be to profile where the program spends time exactly.

On Linux you could do so e.g. by running the program under perf for a few seconds and then inspect the output of perf report or visualizing the profiling as a flamegraph. Also make sure your program is compiled with optimizations -O2 compiler flag and that you donā€™t just execute it as a macro as root macro.C. You need at least root macro.C+ (with the trailing +) to have ROOT compile the code into an optimized shared library.

Cheers,
Enrico

Hi Enrico. Thank you for your response. It seems that DTrace and Flame Graph will be my only option for MacOS? Indeed, I compile my code with g++ with those optimisation flags and run the program from the terminal as an executable. I donā€™t use interactive ROOT.

I conducted some rudimentary tests using chrono::high_resolution_clock to try and determine where the program spends the most time.

If we exclude the graphical elements of the plot, such that in each loop the program reads the data for fitting (2 line CSV file, each line containing 32 values), creates the TH1, initialises the fit parameters, creates the TF1 object and then performs the fit. When I wrote the program, I did my best to avoid re-creating objects and try, wherever possible, to modify existing objects. However to get proper behaviour, the TH1 and TF1 had to be re-initialised each time.

With this setting, the rate of the entire program was 2.49 Hz whereas the rate of just the fit (i.e. h->Fit(ā€œFunctionā€,ā€œRQ0ā€);) was 2.63 Hz.

Similarly, introducing the graphical plot to show the results, the rate of the entire program was 2.14 Hz and the rate of fit was 2.70 Hz.

This at least, unsurprisingly, tells us that the fit itself is taking most of the time. @eguiraud Iā€™m not sure if you would want to know more deeply which aspect of the fit is taking the most time?

For interest, the refresh rate of the graphical plot taking out the fit is about 11.8 Hz and without the plot and fit, the refresh rate is about 70Hz. This tells us that the reading/writing of data and creation of objects used in the loop is not the bottleneck.

Hi,

The way to improve the time, is to improve the evaluation of the fitting function. If it uses numerical l integration, you can maybe look if you can use instead analytical integration. Also the function evaluation can be vectorized and the fit performed in multi-threads. All this is supported by ROOT.
Furthermore, your bottleneck will be probably your function evaluation, so using another tool rather than ROOT, I donā€™t think will help you run the fit faster.

Lorenzo

Hi Lorenzo. Thank you for your advice ā€“ I agree that it would definitely be beneficial to try and parallelise the fitting function evaluation before anything else. Just to say, my fitting is analytic rather than numeric. Could you shed some light on how I go about parallelisation? I havenā€™t attempted this before. Would it be useful for me to share my short fitting function code (perhaps privately) to discuss this further?

Many thanks again for your support.

If you consider something alternative to ROOT, take a look at Solving Non-linear Least Squares ā€” Ceres Solver which can be parallelized easily. It automatically calculates analytical derivatives for you :wink: Automatic Derivatives ā€” Ceres Solver

If you need even more speed and can use a GPU, this might interest you: GitHub - gpufit/Gpufit: GPU-accelerated Levenberg-Marquardt curve fitting in CUDA

Hi,
Yes please share your code. Which ROOT version are you using ?
If it is a recent version, multi-thread fitting is normally automatically enabled, once you have called
ROOT::EnableImplicitMT()

Lorenzo

Hi @ferhue. Thank you for the suggestions ā€“ I will investigate these. It may be that we need to move away from ROOT anyway further down the line, to something more widely used by a non-particle physics community. For now, I think I will exhaust all that ROOT can offer.

@moneta I am using ROOT 6.24.00. I think it would be best to share the code privately, just to be safe. Iā€™ll send you a private message with the code.

Thank you again everyone for your help :slight_smile:

While Lorenzo is looking at your code: how many data points do you have in your fit? 2.5Hz is super slow for 100 point, and maybe appropriate for 100 million :slight_smile: Can you increase the fit verbosity to see how often the fit function is invoked?

And yes, a profile would be fantastic to have, to know where we spend the time.

Hi Axel. Only 32 points at the moment! We hope to achieve 10Hz with up to 160 points. Lorenzo identified by profiling my code that most of the time was spent on the evaluation of the integrals in the fitting function. In the way the code was structured, TF1 objects were being created each time the function was evaluatied in order to calculate these. Lorenzo advised to avoid doing that and instead use ROOT::Math::IntegratorOneDim to evaluate the integrals and use the Gaussā€“Legendre alogrithm instead of the default from GSL.

Currently a work in progress as Lorenzo (very kindly!) is helping me implement these changes. From his tests, it seems we should be able to achieve 10-15Hz with these changes.

1 Like

If the integral takes a lot of time, one trick could be ā€˜precalculate itā€™ in many steps and store it into an LUT. If it is a definite integral, then the evaluation of the integral is just subtracting two values from an LUT.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.