79436 – [ARM Cortex-M4F] VFMA used in place of subtraction gives inexact results

Bug 79436 - [ARM Cortex-M4F] VFMA used in place of subtraction gives inexact results

Summary: [ARM Cortex-M4F] VFMA used in place of subtraction gives inexact results

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	6.3.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:

Depends on:
Blocks:

Reported:	2017-02-08 21:42 UTC by Freddie Chopin
Modified:	2017-02-09 13:52 UTC (History)
CC List:	0 users

See Also:
Host:
Target:	arm-none-eabi
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
test case (637 bytes, text/x-csrc) 2017-02-08 21:42 UTC, Freddie Chopin	Details
assembly dump of invalid version (1.22 KB, text/plain) 2017-02-08 21:43 UTC, Freddie Chopin	Details
assembly dump of valid version (1.24 KB, text/plain) 2017-02-08 21:43 UTC, Freddie Chopin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Freddie Chopin 2017-02-08 21:42:30 UTC

Created attachment 40699 [details]
test case

In some very specific code paths with specific optimization compiler generates VNEG (negate) + VFMA (multiply and accumulate) instead of VSUB + VMUL. Most likely this is not a problem in most cases, but in the test case I attach this leads to inexact results in a well-defined code. The variable "distance", which is basically a length of a difference of two _IDENTICAL_ vectors multiplied by some constant factor, is expected to be 0. "x - x" for components of each vector gives zero. "0 * 0" gives zero. "0 + 0" gives zero. "sqrtf(0)" gives zero. "0 * x" gives 0. These are basically all the operations in the example. However the value of "distance" is calculated to be 1.34925369e-06 and the assertion fails.

Several notes:

1. The code seems complicated, but if I simplify it, different sequence of instructions is generated, and the VNEG + VFMA used instead of VSUB + VMUL is essential to the problem. To generate slightly different sequence it's enough to uncomment the assertion that checks equality of vectors.

2. The problem appears on -O2, -O3 and -Os. It does not appear on -O1 and -Og (probably neither with -O0, but I did not check).

3. The problem can be observed for GCC 6.3.0 and GCC 5.3.1. With or without ARM patches.

Exact compilation command:

arm-none-eabi-g++ -Wall -Wextra -Wshadow -std=gnu++11 -g -ggdb3 -mcpu=cortex-m4 -mthumb -mfloat-abi=hard -mfpu=fpv4-sp-d16 -O2 -ffunction-sections -fdata-sections -fno-rtti -fno-exceptions -c vfma.cpp -o vfma.o

I also attach assembly output of the "invalid" version (-O2) and "valid" version (-O1).

Comment 1 Freddie Chopin 2017-02-08 21:43:11 UTC

Created attachment 40700 [details]
assembly dump of invalid version

Comment 2 Freddie Chopin 2017-02-08 21:43:35 UTC

Created attachment 40701 [details]
assembly dump of valid version

Comment 3 Andrew Pinski 2017-02-08 23:06:39 UTC

https://gcc.gnu.org/onlinedocs/gcc-6.3.0/gcc/Optimize-Options.html#index-ffp-contract-715

-ffp-contract=style
-ffp-contract=off disables floating-point expression contraction. -ffp-contract=fast enables floating-point expression contraction such as forming of fused multiply-add operations if the target has native support for them. -ffp-contract=on enables floating-point expression contraction if allowed by the language standard. This is currently not implemented and treated equal to -ffp-contract=off.
The default is -ffp-contract=fast. 

In this case what is happening is the following:

we get:
a * b - a * b;
Which gets transformed into:
c = a*b
c1 = -c
d = a*b + (c1)

Which is correct but not what you are expecting.  See above above about -ffp-contract=.

Comment 4 Freddie Chopin 2017-02-09 07:31:06 UTC

Hello Andrew and thanks for your answer.

Generally I don't care about the sequence of operations or the exact instructions that get generated, but in this case this default behaviour produces invalid results.

Generally the whole calculations are like this:
mx = ex - sx;
my = ey - sy;
distance = sqrtf(mx * mx + my * my) * constant;

The important thing here is that ex and sx are bitwise identical, just as ey and sy. Thus everything above can be transformed to:

mx = x - x;
my = y - y;
distance = sqrtf(mx * mx + my * my) * constant;

and then:

mx = 0;
my = 0;
distance = sqrtf(0 * 0 + 0 * 0) * constant;

However you rearrange that, the expected result is 0 and I see no place for "typical" floating point inaccuracies here. Let me reiterate - "startVector" and "endVector" in my test case are bitwise identical. Yet the code produces 1.34925369e-06 at the end...

The same code compiled at -O2 for x86-64 does not assert. I don't know x86-64 assembly, but I'm pretty sure that it supports this kind of instructions.

If the results of VFMA are considered "good enough" I think that the default value of -ffp-contract should be changed to "off" - after all -funsafe-math-optimizations or -ffast-math are not enabled by default either.

BTW - VFMA is used also with "-std=c++11".

Thus I think that the bug is not invalid and I ask you to reconsider. Thanks in advance!

Comment 5 Andrew Pinski 2017-02-09 07:38:59 UTC

"this default behaviour produces invalid results."

No.
Read http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.22.6768

> The same code compiled at -O2 for x86-64 does not assert. 
That because the default x86_64 target does not have the fused multiple add instruction.  On a newer Intel (or AMD) machine, add -march=naitve and you will see the same behavior.

You are not understand what fused multiple add instruction does and how floating point works.  

VFMA is not just multiply and accumulate but rather fused multiply and add.  The multiplication is done in infinite precision and then the add is done in that same infinite precision and then a round happens down to double.  Instead of what happens with -ffp-contract=off which is not to used the fused multiple add instruction which means round between the multiply and addition.

Comment 6 Freddie Chopin 2017-02-09 08:17:26 UTC

> On a newer Intel (or AMD) machine, add -march=naitve and you will
> see the same behavior.

You are right, adding that switch causes the assert to appear...

> VFMA is not just multiply and accumulate but rather fused multiply and add.
> The multiplication is done in infinite precision and then the add is done 
> in that same infinite precision and then a round happens down to double. 
> Instead of what happens with -ffp-contract=off which is not to used the 
> fused multiple add instruction which means round between the multiply and
> addition.

I've read that info already, but in my (limited) understanding subtracting two identical numbers still gives 0 at infinite precision, no matter how you round the result ); But I get your point - somehow after re-arranging the whole sequence, the result got inexact (from my point of view).

Any advice how to handle this problem in this particular code? I'd prefer not to just set "-ffp-contract=off" for my whole project, as the toolchain libraries are compiled with contracting enabled anyway (I see A LOT of VFMA in functions like sinf() or cosf()), so that "solution" would be partial at best. Is there any "generic" way to write code that would allow the result to be "correct", possibly by also allowing optimizations which don't change accuracy?

Comment 7 Richard Earnshaw 2017-02-09 13:52:36 UTC

double __attribute__((optimize("fp-contract=off"))) x (double a, double b, double c)
{
  return a*b + c;
}

You might also need to mark the function as no-inline to prevent it being inlined into functions where the default still applies.