Precision problem on some platforms

Hi,

I found a bug in the Pt() member function of TLorentzVector. It seems the computation of pt is sometimes setting a floating point error flag that influences the result of a subsequent comparison. Take a look at the attached example. It gives the right results when interpreted in ROOT, but not when you compile it, neither from ROOT nor external:

a.Pt() > a.Pt() = true ??!

This bug became obvious when I tried to sort a std::vector of TLorentzVector pointers by pt using std::sort and a predicate that compares the result of the Pt() function. Since our analysis framework is relying heavily on STL sorting procedures, this bug renders it rather useless.

I’d appreciate if this could be fixed ASAP.

Thanks,

Johannes Asal

Albert-Ludwigs-Universität Freiburg
Fakultät für Mathematik und Physik
Physikalisches Institut
Hermann-Herder Str. 3
79104 Freiburg

Phone: +49 761 203-8409
Fax : +49 761 203-5931
bugtest.c (893 Bytes)

You should never test if two computed floating point numbers are equal.
You can only check if they are equal within some precision.

Rene

This is true, but that’s not the point. The point is, that a.Pt() > a.Pt() returns true where the resulting double values from both computations in fact contain exactly the same data. That’s not a question of precision. And what about consistency? Even if you not consider this a bug, what would the users say when their former interpreted code gives different results once they compile it?

Johannes

And if that’s not enough evidence yet, I just found out that both
a.Pt() > b.Pt()
AND
b.Pt() > a.Pt()
return true. That’s programming quantum mechanics…

Hi,

Please note that sorting on floating point is explicitly (due to CPU implementation) instable and you can/will get different result on different platform (or even different optimization level). In that context CINT and the compiler are 2 different platforms.

You will need to make sure your algorithm is taking this fundamental unstability of floating comparaison in consideration.

Cheers,
Philippe

Hi,

As another example :slight_smile:, on my machine your test returns:

[pcanal@hennequeville tmp]$ root.exe -b -l 
root [0] .x btest.C 
pt == pt2
a.Pt() == b.Pt()
a.Pt() == a.Pt()
root [1] .x btest.C+
Info in <ACLiC>: script has already been loaded in interpreted mode
Info in <ACLiC>: unloading /var/tmp/./btest.C and compiling it
pt == pt2
a.Pt() == b.Pt()
a.Pt() == a.Pt()

Cheers,
Philippe.

I agree that floating point behavior might be different on different platforms, but how does this explain why a.Pt() > b.Pt() and b.Pt() > a.Pt()? Let’s suppose one of the calculations returns 1.99999999 and the second one returns 2.0 (with exactly the same formula and the same input values, which should sound strange to you already). Then you might get a.Pt() > b.Pt(). But how is it possible that you can get b.Pt() > a.Pt() at the same time then? Remember that those objects have been copied and should contain exactly the same floating point values from which pt is calculated. This would mean, floating point calculations are time dependent or depend on the order of execution. And why does this never happen when I calculate stuff myself and compare it? I can even check for floating point equality on the same platform that TLorentzVector is failing on when I do the Pt calculation manually.

Very interesting is also the fact, that when i do

double x = a.Pt();
double y = b.Pt();

and test for x > y, y > x and x == y (which I should not do, I know, but anyway), then I suddenly get a true for x > y and x == y. This seems to me as if the first comparison is eating up the error flag set by Pt() calculation and after that the second and third work fine… But correct me please if I’m wrong.

Hi,

Please remind me, the behavior you are seeing are they appearing
a) only on CINT
b) only in Compiler
c) in both
and which version of ROOT are you testing with?

Cheers,
Philippe.

Hi,

this is happening only in Compiler, using ROOT v5.18 compiled for slc4 platform.

Btw, meanwhile I found the reason for this behavior. It seems, that the Pt calculation is setting most likely a overflow/underflow exception bit on some platforms, that cause the subsequent comparisons to fail. When I enable double precision rounding by using

asm (“fldcw %0” : : “m” (*&mode));

where mode is 0x27F, the problem vanishes.

I suggest, that ROOT should check for overflow/underflow and clear the corresponding bits before returning any value. Alternatively this rounding mode should be set to give equal results on all platforms, because this is not something that should be left to the user.

Cheers,
Johannes

WHAAAAAT !!! I imagine the number of user complaints !! Please FIX YOUR CODE.
As already said, comparing the computation of 2 floating point numbers cannot be done.
This is well known since day 1 of computing.

Rene

I am not talking about comparison of equality. I’m talking about < and > relations. And this is possible with floating point numbers unless they have infinite precision. If you are not even able to have a strict weak ordering on floats or doubles, what can you do with them other than printing out their values? And how many users have ever checked the overflow bit after retrieving Pt()?

I don’t think any user will complain, if they cannot check whether the computed value they retrieved caused an overflow or not. No one will check for that anyway. It is more important (to me at least) to have defined behavior on comparisons. But it is just a suggestion.

Cheers,
Johannes

I am a bit lost in what you are talking about. Could you explain clearly what is your problem? Of course you can compare /sort floating point numbers. The only thing that you cannot do is to test the equality.
Anything else is up to you in your code.

Rene

Ok, I will explain again:

Look at the LorentzVector I create in the example. It is called ‘a’ and is being copied to vector ‘b’. When I now do something like the following:

a.Pt() > b.Pt()

i get true. This may be explained by precision issues, but not the fact, that the call

b.Pt() > a.Pt()

also returns true. That this is being caused by a floating point exception flag (which is obviously being set on every call to Pt()) can be easily verified by doing the following:

double x = a.Pt();
double y = b.Pt();

If you now do a.Pt() > b.Pt() it will return true, because it is being influenced by the exception flag. Do it again, and you will retrieve false, the correct answer.

It will become very obvious by doing the following (which is also one naive solution to fix the problem in the analysis code):

double x = a.Pt();
double y = b.Pt();

bool dummy = x > y;
bool realValue = x > y;

dummy will be true and realValue will be false. Just check it for yourself. The problem does not appear on lxplus though, so my best guess is it has to do something with the 32bit systems. Also CINT does not show this behavior. It is reproducable only in compiled code.

Cheers,
Johannes

This is a known problem and is due to the use of extended precisions
in the registers (80 bit), when doing the calculation and 64 bit for storing them memory.
You can see for example a discussion of this in this article:

hal.archives-ouvertes.fr/docs/00 … rticle.pdf

According to gcc the option -ffloat-store should improve this and
probably fix it (it fixes in that case).
However we cannot add this option in genersl since it will penalizes
the performances.
I think is the user responsability to deal with this special cases. In your case, I think, you should avoid the sorting of pt of two identical vectors.

Best Regards

Lorenzo

Thanks for your reply,

this is very interesting. If there is a compiler flag to fix this, I agree that the user can take care of it. But I had never heard of this problem and I think all users should be aware of this fact. So it would be good to point it out somewhere.

Sometimes you cannot avoid sorting vectors with same-Pt-LorentzVectors inside, for example if you have a particle collection with particles from two different authors (that’s what crashed it in my case).

Another workaround is to store the computed values and use a dummy comparison. But this has a speed penalty of course.

Cheers,
Johannes

The problem is worse in your case since you are using a 32 bit on a Linux platform. In this case you are using the x87 floating point unit instead of SSE.
If you are using an Intel Mac, gcc is configured by default to use SSE and,
I have tested, your code does not give this problem.

If possible, I would recommend you to use an x86_64 architecture and software compiled natively with 64 bit. In addition to run faster, the results will be also more reliable.

Cheers

Lorenzo