Bug 53967 - GCC produces slow code for convolution algorithm with -mfpmath=sse (the AMD_64 default)
Summary: GCC produces slow code for convolution algorithm with -mfpmath=sse (the AMD_6...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.6.2
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-07-14 20:52 UTC by bfriesen
Modified: 2015-02-26 08:57 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-07-16 00:00:00


Attachments
Convolution example C file, pre-processed version, build log, assembler output (2.73 KB, application/x-gzip)
2012-07-14 20:52 UTC, bfriesen
Details
Build log (967 bytes, text/plain)
2012-07-14 20:55 UTC, bfriesen
Details
Sample portable source file (834 bytes, text/plain)
2012-07-14 20:56 UTC, bfriesen
Details
Pre-processed source (411 bytes, text/plain)
2012-07-14 20:57 UTC, bfriesen
Details
Generated assembler code (675 bytes, text/plain)
2012-07-14 20:58 UTC, bfriesen
Details
Pre-processed GraphicsMagick source (effect.c). (69.50 KB, application/octet-stream)
2012-07-14 21:42 UTC, bfriesen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description bfriesen 2012-07-14 20:52:22 UTC
Created attachment 27792 [details]
Convolution example C file, pre-processed version, build log, assembler output

The classic convolution algorithm (as implemented in GraphicsMagick) is observed to run 2X slower with -mfpmath=sse than with -mfpmath=387.  Unfortunately -mfpmath=sse is the default for -m64 builds on AMD_64 so this has large impact for users.

Even with -mfpmath=387 other compilers (LLVM, Open64, and Oracle Studio) produce faster code by default so some of these compilers are producing up to 3X better overall run-time performance and all of them are at least 2X faster than the GCC default for x86-64.

This issue has been verified under Solaris 10, OpenIndiana, and Ubuntu Linux on Opteron and several modern Xeon CPUs.

Please note that AMD Opteron 6200 family CPUs were not observed to suffer from this issue.
Comment 1 bfriesen 2012-07-14 20:55:48 UTC
Created attachment 27793 [details]
Build log
Comment 2 bfriesen 2012-07-14 20:56:55 UTC
Created attachment 27794 [details]
Sample portable source file
Comment 3 bfriesen 2012-07-14 20:57:58 UTC
Created attachment 27795 [details]
Pre-processed source
Comment 4 bfriesen 2012-07-14 20:58:59 UTC
Created attachment 27796 [details]
Generated assembler code
Comment 5 bfriesen 2012-07-14 21:06:27 UTC
Please note that while I mentioned GCC 4.6.2, the same problem is also observed with GCC 4.7.1.
Comment 6 bfriesen 2012-07-14 21:42:38 UTC
Created attachment 27797 [details]
Pre-processed GraphicsMagick source (effect.c).

In case the small sample (which only illustrates the core algorithm) does not satisfy, I have attached a pre-processed version of the real GraphicsMagick code with the performance issue.  Look for ConvolveImage().
Comment 7 Richard Biener 2012-07-16 12:42:15 UTC
What options do you use besides -march=corei7-avx?  The build-log does not tell.
Did you try -march=corei7 instead of -march=corei7-avx?
Comment 8 bfriesen 2012-07-16 14:16:46 UTC
I used -march=native in this case.  It is interesting that this enabled AVX (this particular CPU does support it).

To be clear, the problem also occurs with

-m64 -mtune=generic -march=x86-64 -mfpmath=sse

vs

-m64 -mtune=generic -march=x86-64 -mfpmath=387

and is also observed on a 5-year old Opteron.

With GCC 4.7.1, and for a specific application benchmark case and with generic architecture and tuning, -mfpmath=387 produces 0.133 iter/s and -mfpmath=sse produces 0.047 iter/s.  A different (non-GCC) compiler on the same system produces 0.155 iter/s.

In the course of testing, I have indeed tried -march=corei7 and it did not provide an improvement.
Comment 9 Richard Biener 2012-07-16 14:56:59 UTC
(In reply to comment #8)
> I used -march=native in this case.  It is interesting that this enabled AVX
> (this particular CPU does support it).
> 
> To be clear, the problem also occurs with
> 
> -m64 -mtune=generic -march=x86-64 -mfpmath=sse
> 
> vs
> 
> -m64 -mtune=generic -march=x86-64 -mfpmath=387
> 
> and is also observed on a 5-year old Opteron.
> 
> With GCC 4.7.1, and for a specific application benchmark case and with generic
> architecture and tuning, -mfpmath=387 produces 0.133 iter/s and -mfpmath=sse
> produces 0.047 iter/s.  A different (non-GCC) compiler on the same system
> produces 0.155 iter/s.
> 
> In the course of testing, I have indeed tried -march=corei7 and it did not
> provide an improvement.

What kind of optimization options are you using?  -O3?  Or are you really
using -O0 (aka nothing)?
Comment 10 bfriesen 2012-07-16 15:35:03 UTC
This particular application test was done with these options (i.e. -O2):

-m64 -mtune=generic -march=x86-64 -mfpmath=387 -O2

I have also tried -O3, with no positive benefit.

The Autoconf default is -O2 so that is what I generally test/tune the software with. It is pretty rare to see additional benefit from -O3, although with some versions of GCC I have seen application crashes due to wrong code from the tree vectorizer.

Bob
Comment 11 bfriesen 2012-07-16 15:41:08 UTC
I just verified that -O3 produces similar timings to -O2 for both -mfpmath=387 and -mfpmath=sse
Comment 12 Stupachenko Evgeny 2012-07-18 09:45:15 UTC
I tried it at "-O2" and got low performance with -mfpmath=sse. It looks like it is caused by register dependency (%xmm0) between:

addss	%xmm0, %xmm1
cvtsi2ss	%eax, %xmm0

Renaming %xmm0 in cvtsi2ss to another free register in all such cases resolves the issue. 

Also you can try "-O2 -funroll-loops", which made "sse" code even faster and
and "-O2 -fschedule-insns" which significantly reduced performance loses in "sse" case.
Comment 13 Richard Biener 2012-07-18 10:49:53 UTC
You can also try -frename-registers
Comment 14 bfriesen 2012-07-18 14:28:04 UTC
With

-m64 -mtune=generic -march=x86-64 -mfpmath=sse -O2 -funroll-loops -fschedule-insns

I see a whole-program performance jump from 0.047 iter/s to 0.156 iter/s (331% boost).  That is huge!  Given the fundamental properties of this algorithm (the image processing algorithm most often recommended to be moved to a GPU) the world would be a better place if this performance was the normal case.

With

-m64 -mtune=generic -march=x86-64 -mfpmath=sse -O2 -fschedule-insns

I see 0.101 iter/s

These must not be included in -O3 since

-m64 -mtune=generic -march=x86-64 -mfpmath=sse -O3

produces only 0.048 iter/s
Comment 15 bfriesen 2012-07-18 20:42:22 UTC
Testing shows that using

-m64 -march=native -O2 -mfpmath=sse -frename-registers

is sufficient to restore good performance.
Comment 16 bfriesen 2012-07-19 14:29:10 UTC
Is there a way that I can selectively apply the -frename-registers fix to functions which benefit from it in order to work around the bug until the fix is widely available?  I tried

#pragma GCC optimize ("O3,rename-registers")

and

#pragma GCC optimize ("rename-registers")

as well as the function attribute equivalent and there was no effect.  GCC seems to ignore the request.

I did find another somewhat similar function which benefited significantly from -frename-registers.
Comment 17 bfriesen 2012-07-21 01:04:55 UTC
I discovered that GCC's __attribute__((__optimize__())) and optimization pragmas do not work for OpenMP code because OpenMP uses a different function name for the actual working code.  This makes it much more painful to work around this bug.
Comment 18 xunxun 2012-08-12 15:41:35 UTC
Is the bug related with PR19780?