Floating point performance issue

David Brown david@westcontrol.com
Tue Dec 20 11:49:00 GMT 2011


On 20/12/2011 11:20, Ico wrote:
> * On Tue Dec 20 11:05:17 +0100 2011, Marcin Mirosław wrote:
>
>> W dniu 20.12.2011 10:52, Ico pisze:
>
>>> I am able to reproduce this on multiple i686 boxes using various gcc versions
>>> (4.4, 4.6). Compiling on x86_64 does not show this behaviour.
>>>
>>> Is anybody able to reproduce this issue, and how can this be explained ?
>>
>> I can reproduce such situation too. I can only guess this happens
>> because on i686 default is mfpmath=387, on x86_64 default is
>> mfpmath=sse. If you compile your code using "-O3 -mfpmath=sse
>> -march=native<or something what else what have support for sse>" then
>> booth times will be almost equal.
>
> Thanks for testing this.
>
> Still, I'm not sure if sse is part of the problem and/or solution.
>
> I have been reducing the program to see what the smallest code is that still
> shows this behaviour. Latest version is below.
>
>
> $ gcc -msse -mfpmath=sse -O3 -march=native test.c
> $ time ./a.out 0.9
>
> real	0m2.653s
> user	0m2.648s
> sys	0m0.002s
> $ time ./a.out 0.001
>
> real	0m0.144s
> user	0m0.140s
> sys	0m0.002s
>
>
> /* gcc -msse -mfpmath=sse -O3 -march=native test.c  */
>
> #include<stdlib.h>
>
> #define S 20000000
>
> int main(int argc, char **argv)
> {
>          int j;
>          double a = 0;
>          double b = 1;
>          double f = atof(argv[1]);
>
>          for(j=0; j<S; j++) {
>                  a = b * f;
>                  b = a * f;
>          }
>
>          return a;
> }

I've just tried this code (on an i7-920, 64-bit Linux), compiled with 
gcc 4.5.1, command line:

	gcc fp.c -o fp -O2 -Wa,-ahdls=fp.lst

"time ./fp 0.50" takes 0.088 seconds, but "time ./fp 0.51" takes 2.584 
seconds.

The inner loop is using mmx, as can be seen from the listing file:

   18                    .L2:
   19 0020 660F28CA              movapd  %xmm2, %xmm1
   20 0024 83E801                subl    $1, %eax
   21 0027 F20F59C8              mulsd   %xmm0, %xmm1
   22 002b 660F28D1              movapd  %xmm1, %xmm2
   23 002f F20F59D0              mulsd   %xmm0, %xmm2
   24 0033 75EB                  jne     .L2

But what is more interesting, is if I compile with the "-ffast-math" 
flag, exactly the same listing file is generated, but the program runs 
at about 0.088 seconds for any input.

You can get a clue as to the reason behind this if you add a 
"printf("%g\t%g\n", a, b)" statement at the end of the program.  When I 
then run "fp.slow" (no "-ffast-math" flag) and "fp.fast" (with 
"-ffast-math") I get:

[david@davidquad c]$ time ./fp.slow 0.50
0       0.000000

real    0m0.088s
user    0m0.086s
sys     0m0.000s

[david@davidquad c]$ time ./fp.slow 0.51
4.94066e-324    0.000000

real    0m2.575s
user    0m2.567s
sys     0m0.000s

[david@davidquad c]$ time ./fp.fast 0.50
0       0.000000

real    0m0.087s
user    0m0.086s
sys     0m0.000s
[david@davidquad c]$ time ./fp.fast 0.51
0       0.000000

real    0m0.087s
user    0m0.087s
sys     0m0.000s


The key point here is that without the "-ffast-math" flag, rounding 
effects mean that with input values strictly more than 0.5, but less 
than 1, end up with "a" being stuck on a very small but non-zero value. 
  With input values of 0.5 or less, "a" quickly reduces to 0, and the 
processor short-cuts the multiply.  With the "-ffast-math" flag active, 
my guess is that the program header sets up different rounding or 
saturation modes on the processor, avoiding this issue.

mvh.,

David




More information about the Gcc-help mailing list