This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Wed, 22 Apr 2015 14:03:38 +0000
- Subject: [Bug target/65847] SSE2 code for adding two structs is much worse at -O3 than at -O2
- Auto-submitted: auto-generated
- References: <bug-65847-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65847
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |missed-optimization
Target| |x86_64-*-*
Status|UNCONFIRMED |NEW
Last reconfirmed| |2015-04-22
CC| |rguenth at gcc dot gnu.org
Ever confirmed|0 |1
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed. The issue is that the vectorizer thinks x and y reside in memory
and thus it vectorizes the code as
<bb 2>:
vect__2.5_11 = MEM[(double *)&x];
vect__3.8_13 = MEM[(double *)&y];
vect__4.9_14 = vect__2.5_11 + vect__3.8_13;
MEM[(double *)&D.1840] = vect__4.9_14;
return D.1840;
which looks good. But now comes the ABI and passes x, y and the return
value in registers ...
But even the best vectorized sequence would have four stmts - two to
pack arguments into vector registers, one add and one upack for the
return value.
Thus it seems the vectorizer should be informed of this ABI detail
or simply as heuristic never consider function arguments "memory"
it can perform vector loads on (which probably means to disable
group analysis on them?).
On i?86 with SSE2 we get
movupd 8(%esp), %xmm1
movl 4(%esp), %eax
movupd 24(%esp), %xmm0
addpd %xmm1, %xmm0
movups %xmm0, (%eax)
vs.
movsd 16(%esp), %xmm0
movl 4(%esp), %eax
movsd 8(%esp), %xmm1
addsd 32(%esp), %xmm0
addsd 24(%esp), %xmm1
movsd %xmm0, 8(%eax)
movsd %xmm1, (%eax)
which eventually looks even profitable (with -mfpmath=sse).
So a simple heuristic might pessimize things too much.
Replicating calls.c code to compute how the arguments are passed sounds
odd though...
Eventually the target can pessimize the loads in the target cost model
though (at least it can perform a more reasonable "heuristic").