Minuit1 vs Minuit2 Tolerance and EDM

Hi, I’ve been running a 400+ amplitude fit using RooMinimizer as the interface to MIGRAD. We configure MIGRAD to use strategy 0 and tolerance 3 through RooMinizer

minimizer.setStrategy(0);
minimizer.setStrategy(3.0);

and the only difference is whether we use Minuit1(default) vs Minuit2

minimizer.setMinimizerType("Minuit2");

and the Minuit2 documentation states that Migrad converges when the estimated distance to minimum (EDM) is smaller than

tolerance * 1e-3

which should be 3e-3 in this case. I get the expected printout with Minuit1

STARTED MIGRAD MINIMIZATION. STRATEGY 0. CONVERGENCE WHEN EDM .LT. 3.00E-03

but using Minuit2 prints something different

Minuit2Minimizer: Minimize with max-calls 500000 convergence for edm < 3 strategy 0
VariableMetric: start iterating until Edm is < 0.006

which means that the convergence criteria tolerance has doubled in Minuit2. I have 4 fits which converges in both Minuit1 and Minuit2, Minuit1 consistently has a smaller minima (attached). On another note, Minuit2 seems to converge more often than Minuit1 although I guess this is expected since Minuit2 has a larger convergence criteria?
both_passed.pdf (16.9 KB)

Which leads me to a few questions:

  1. I recall in this forum a mention that Minuit2 has a fix for small tolerances. What exactly does this fix do?
  2. Is this fix enlarging the tolerance I asked for in Minuit2?
  3. Is there any other differences, aside from the tolerance, between Minuit1 and Minuit2 which would cause the differences in minima value?
  4. We have about 400k data and 2.5M MC (MC is used to normalize our likelihood function). Is it harder for EDM to go below the convergence criteria when statistics increases?
  5. A bit unrelated. What does strategy 0 assumes for the initial second derivative matrix at the start? Is it the same as the BFGS method in scipy which assumes the identity matrix?

Since this is an ongoing analysis with unreleased private code, I am unable to provide a working example so I apologize for this inconvenience.

Hi,

The factor of 2 in the tolerance definition is caused by a different definition of the edm between Minuit1 and Minuit2. The edm in Minuit1 is defined as 0.5 the one in Minuit2 and in order to maintain some compatibility between the two there is a factor of two added in the actual tolerance ( edm < 0.006 instead of edm < 0.003 when a tolerance of 3 is used).
It is however strange that you get smaller function values in Minuit1 than Minuit2. The attached figures shows a very large difference, a small difference can happen, but of the order smaller than ~0.006.

Now concerning your questions, I am not sure which fix are you referring to, can you please post the links to the posts ? In general there could be some small differences for some fixes applied in Minuit2, but those should make in general make Minuit2 more cable of converging than Minuit1, but when both converge, both should reach the same minimum value. If this is not the case, I am interested to investigate it more, but I would need to reproduce the results. Otherwise I could have a look at the obtained results obtained with the maximum verbosity mode.

When the statistics increase is common to have a larger error in the likelihood computation due to numerical error. It is recommended, if possible, to compute a compensated summation and if possible keep the total likelihood value quite small (not too large) by using also an overall offset.

When using strategy 0 (and also strategy 1) the initial Hessian matrix is estimated as a diagonal matrix computed using the diagonal second derivatives only, not an identity matrix. When using strategy 2 the full Hessian matrix is computed and used as initial state.
The main difference between strategy 0 and 1 is that in the case of strategy 0 the Hessian matrix is never computed at the end of the minimization, but only the approximation used in Migrad.

If something is not clear or if you have any other questions, please let me know

Best,

Lorenzo

The factor of 2 in the tolerance definition is caused by a different definition of the edm between Minuit1 and Minuit2.

Based on what I understood of your reply, the factor of two is added to the calculation of EDM in Minuit2?

The attached figures shows a very large difference, a small difference can happen, but of the order smaller than ~0.006.

It could be because our parameters are highly correlated. Some of the parameters have global correlation above 1.0 possibly due to numerical issues. I guess MINUIT’s assumption that the negative log likelihood is a quadratic function around the minima probably breaks down in our fits. Physics-wise, it is the best model we can come up with but I guess statistics-wise there too many free parameters.

Now concerning your questions, I am not sure which fix are you referring to, can you please post the links to the posts ?

You mentioned “a couple of fixes” in 4. of this post and “due to some issues which have been fixed only in the new version” in this post as well. In fact, you mentioned “some fixes applied in Minuit2” in your reply above as well. I was wondering what these fixes were specifically and in technical terms since the details were never given in ROOT forums and Minuit2 documentation.

Otherwise I could have a look at the obtained results obtained with the maximum verbosity mode.

I can ask my collaborators if they are comfortable sharing the logs with you privately. We do plan the publicly release the fitting code after we publish the analysis (earliest in 6 months but can take up to a year). If you don’t mind waiting so long I will add this to my to-do when the analysis and code is public.

if possible, to compute a compensated summation and if possible keep the total likelihood value quite small (not too large) by using also an overall offset.

We calculate the per-event likelihood using CUDA and sum them using reduction operation with CUDA boost which I assume does not use any compensated summation. Regarding total likelihood, we usually end up with a value around -180,000. I’ve not heard of any amplitude fits using offsets in their minimization so I either have to ask around or think of how to implement this without compromising the mathematical result.

initial Hessian matrix is estimated as a diagonal matrix computed using the diagonal second derivatives only

Thanks for answering this! Maybe the scipy BFGS method later multiplies the identity matrix with first/second derivatives in some other part of the code that I did not read.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.