Bug 19680 - sub-optimial register allocation with sse
Summary: sub-optimial register allocation with sse
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: ra
Depends on:
Blocks: 19714
  Show dependency treegraph
 
Reported: 2005-01-28 23:34 UTC by tbp
Modified: 2005-11-13 16:45 UTC (History)
2 users (show)

See Also:
Host: cygwin
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2005-11-13 16:45:25


Attachments
Various address generation snippets (338 bytes, text/plain)
2005-01-28 23:35 UTC, tbp
Details
the whole deal (1.25 KB, text/plain)
2005-01-30 09:53 UTC, tbp
Details
icc asm dump (1.66 KB, text/plain)
2005-01-30 13:29 UTC, tbp
Details
Take #2 (1.43 KB, text/plain)
2005-01-30 14:03 UTC, tbp
Details
intersect_packet asm dump, icc (1.20 KB, text/plain)
2005-01-31 15:43 UTC, tbp
Details
intersect_packet asm dump, gcc (1.39 KB, text/plain)
2005-01-31 16:01 UTC, tbp
Details
mark builtins constant (2.14 KB, patch)
2005-01-31 18:55 UTC, Richard Henderson
Details | Diff
hack sse simode inputs (628 bytes, patch)
2005-01-31 18:58 UTC, Richard Henderson
Details | Diff
modify modes_tieable_p for sse (1.50 KB, patch)
2005-01-31 22:53 UTC, Richard Henderson
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description tbp 2005-01-28 23:34:17 UTC
When comparing the kind of code ICC outputs vs gcc, it's really obvious gcc
could make a better use of x86 baroque addressing modes.
More specifically i rarely ever see it using the *8 scale factor, even when
addressing nicely power-of-2 sized stuff, and that's definitely a performance
problem when dealing with those large SSE vectors.

In that testcase the *8 scale factor is only used once and even if it's
questionnable the use a of a fancier mode would help in this particular
testcase, there's no doubt it would in Real World.

Also, take note of the horrible code for massage48; in Real World it's even worse:
  4012a6:       mov    $0x30,%edx
  4012af:       imul   %edx,%eax
That's not from the testcase, that's in a loop and edx get reloaded each time.

Tested with today's cvs and something like -O3 -march=k8 -fomit-frame-pointer
-mfpmath=sse and -O3 -march=pentium4 -fomit-frame-pointer -mfpmath=sse.
Comment 1 tbp 2005-01-28 23:35:58 UTC
Created attachment 8095 [details]
Various address generation snippets
Comment 2 Steven Bosscher 2005-01-29 02:34:35 UTC

*** This bug has been marked as a duplicate of 18463 ***
Comment 3 Richard Henderson 2005-01-30 07:30:46 UTC
No, Steven, this is a different problem.  Note that there are not two
memory references, as in the other PR.
Comment 4 Richard Henderson 2005-01-30 07:45:15 UTC
That said, what are you expecting here for massage48?  On K8, the latency
of imul for a 32-bit register operand is 3 cycles.  Alternately, we can
break this down into

   leal (%eax,%eax,2), %eax
   sall $4, $eax
   # Use eax unscaled

Which is 2 cycles latency for the leal and one for the sall, so gcc rightly
chooses to use the single multiply instruction, since the alternative is no
cheaper.  On pentium4, timings are different, and we select the leal+sall
sequence.  I don't see anything in this test case at all that could be made
to use more complex addressing modes.

As for the "Real World" code that uses a variable imul instead of a constant,
that would be more interesting to examine.
Comment 5 tbp 2005-01-30 08:07:16 UTC
I'm sorry for providing such a poor testcase.
Here's the kind of *48 sequence i'm seeing, k8 codegen; that's happening at a
point where's there's quite some register pressure and it really doesn't help
that another register is needed (gcc has to push/pop whereas icc doesn't).

  402520:       mov    (%esi),%ecx
  402522:       mov    $0x30,%eax
  402527:       mov    0x3c(%esp),%ebx
  40252b:       add    $0x4,%esi
  40252e:       imul   %eax,%ecx
  402531:       add    0x94(%esp),%ecx
  402538:       mov    (%ecx),%edi

After that there's a long string of vector operations where gcc computes all
addresses upfront (with shifts) and then use a *1 scale factor; ICC only use *8
for that and is faster.
Comment 6 tbp 2005-01-30 09:53:34 UTC
Created attachment 8105 [details]
the whole deal

Ok, here's a self contained excerpt.
There's a variable imul, simple addressing and some really weird stuff
happening to the hit_t structure: it seems that gcc tries to cache it locally
but then updates both the cached and original version.

Take note that this intersect leaf function is part of a bigger function that
searches for leaves etc; i mean in the real app there's even more registers
being spilled etc. It's not pretty and called million times per seconds.
Comment 7 Richard Henderson 2005-01-30 10:59:43 UTC
Ah hah.  This is a bit of "cleverness" in the backend.  It turns out that
for K8, imul with an 8-bit immediate is vector decoded, and imul with a
register is direct decoded.  In theory, splitting out the constant allows
greater throughput through the instruction decoder.  Now, we *only* do this
split after register allocation, and only if there are in fact a free register.
So in no case should this be causing more register spills.

As an aside, if we had proper control over the exact instruction form being
used, as opposed to yielding that control to the assembler, we'd emit the imul
with a 32-bit immediate (0x69 db 30 00 00 00), since that's direct decoded too.
But emitting machine code directly from the compiler is pie-in-the-sky territory.

Other than that, I guess the rest is just generic register-allocation sucking.
Fancy attaching the assembly produced with icc for the second test case, so
that we know what we're aiming for?
Comment 8 tbp 2005-01-30 13:29:15 UTC
Created attachment 8107 [details]
icc asm dump

Here you are.
Comment 9 tbp 2005-01-30 13:37:48 UTC
But i had to rewrite the hit_t structure in a way more closer to what's found in
the original source to avoid the same useless cloning i noted earlier with gcc.
Something like:
union float4_t {
	float	f[4];
	__m128	v;

	operator const __m128()	const	{ return v; }
	operator __m128()		{ return v; }
};

union int4_t {
	int	i[4];
	__m128i	v;
	operator const __m128i()	const	{ return v; }
	operator __m128i()			{ return v; }
};

struct hit_t {
	float4_t t,u,v;
	int4_t id;
};
to avoid this:
  40125f:       andnps %xmm4,%xmm0
  401262:       orps   0x40(%esp),%xmm0
  401267:       movaps %xmm0,0x40(%esp)
  40126c:       movaps %xmm0,0x10(%eax)

I really don't get why it's happening; and i'm pretty sure it wasn't there with
some previous version of gcc. I'd seriously need a clue.

That aside, ICC's ouput is much cleaner and despite the P4 way to compute
addresses, faster on my K8.
Comment 10 tbp 2005-01-30 14:03:39 UTC
Created attachment 8108 [details]
Take #2

Hmm. I'm attaching another asm dump for ICC because it spits better code for
the function as found in the original app.
Got to love those temperamental compilers.
Comment 11 Richard Henderson 2005-01-30 18:04:00 UTC
Ok, I see what Intel is doing.  It's computing an index by 16 by doing

  addl %ecx,%ecx
  movl (%ebx, %ecx, 8), %eax

instead of

  sall $4, %ecx
  movl (%ebx, %ecx), %eax

which, considering the suckitude of the P4 shifter, is a win.  It should
not be a win for any other cpu.
Comment 12 tbp 2005-01-30 18:40:43 UTC
Yes that's not a win per se but even with those "unrolled" addr computations its
encodings end up generally tighter, ie:
gcc:
  40114d:       c1 e1 04                shl    $0x4,%ecx
  401150:       8d 41 30                lea    0x30(%ecx),%eax
...
  40115a:       0f 58 0c 07             addps  (%edi,%eax,1),%xmm1
...
  40116f:       0f 58 04 0f             addps  (%edi,%ecx,1),%xmm0

icc:
  4236e4:       03 ed                   add    %ebp,%ebp
  4236e6:       0f 28 64 ef 30          movaps 0x30(%edi,%ebp,8),%xmm4
  4236eb:       0f 28 0c ef             movaps (%edi,%ebp,8),%xmm1

Small win (and it's hard to follow as they schedule things very differently and
gcc touches the stack a lot more), but could be even better if shifting was
allowed. And in such a lenghty loop, decoding bandwith is scarce. If gcc wasn't
so greedingly trying to precompute indexes and offsets...

Could you tell me why gcc feels obliged to make a local copy of the hit_t
structure on the stack and then update both the copy and the original?
Ideally i'd like it to not try to outsmart me :) (or maybe i'm missing something
obvious).

Comment 13 tbp 2005-01-30 18:59:34 UTC
Ah! Seems that another temporary isn't eliminated, much like
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19274, this time with _mm_set1_epi32.

  40129b:       89 44 24 1c             mov    %eax,0x1c(%esp)
  40129f:       66 0f 6e 64 24 1c       movd   0x1c(%esp),%xmm4
  4012a5:       66 0f 70 cc 00          pshufd $0x0,%xmm4,%xmm1
Comment 14 Richard Henderson 2005-01-31 05:31:00 UTC
If you're still looking at K8, moving through memory is two cycles faster than
moving directly between the general register file and the sse register file.
Comment 15 tbp 2005-01-31 14:14:58 UTC
Yes, and i'm not asking for a GPR->SSE transfer. What i'm asking is why gcc
feels the urge to copy that memory reference to the stack before fooling around
with it.

The full sequence is:
  401298:       8b 42 28                mov    0x28(%edx),%eax
  40129b:       89 44 24 1c             mov    %eax,0x1c(%esp)
  40129f:       66 0f 6e 64 24 1c       movd   0x1c(%esp),%xmm4
  4012a5:       66 0f 70 cc 00          pshufd $0x0,%xmm4,%xmm1

whereas ICC does:
  40241d:       66 0f 6e 44 ca 28       movd   0x28(%edx,%ecx,8),%xmm0
  40242c:       66 0f 70 c0 00          pshufd $0x0,%xmm0,%xmm0

I've opened http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19714 for this.
Comment 16 tbp 2005-01-31 15:43:10 UTC
Created attachment 8113 [details]
intersect_packet asm dump, icc
Comment 17 tbp 2005-01-31 16:01:06 UTC
Created attachment 8114 [details]
intersect_packet asm dump, gcc

I've double checked what i was doing in the app and that testcase.
In fact in the app i force gcc to not inline the intersect_packet function
because then the generated code makes no sense at all: silly addr generation,
local copy of one structure... you name it.

The function takes 353 bytes for ICC and 387 bytes for gcc. If we factor out
the 
_mm_set1_epi32 problem that's 387-12 = 375 bytes, or +22.
Looking at SSE register movements (movaps), 10 for ICC and 18 for gcc, there's
+8 for gcc (at 3 bytes a piece) or 24 bytes.

So beside scheduling issues (i'd really like gcc to respect my load ordering,
like ICC does, with a 48byte size there might be 2 cache lines addressed) the
main difference seems to be related to register allocation.

I wonder why it's so much worse when inlined, as that require in practice to
workaround by adding a call on top of it.
Comment 18 Richard Henderson 2005-01-31 18:55:13 UTC
Created attachment 8115 [details]
mark builtins constant

This patch marks builtins const or pure as appropriate, and all of them
nothrow.
This could result in slightly better code out of the tree optimizers.  It could

also result in higher register pressure as well.
Comment 19 Richard Henderson 2005-01-31 18:58:37 UTC
Created attachment 8116 [details]
hack sse simode inputs

This patch lies a bit to the rtl optimizers saying that some of the sse insns
take memory inputs when they don't.  The goal here is to convince the rtl
optimizers to feed memory inputs to these instructions directly, which removes
a register temporary, and thus the ability of the register allocator to screw
up and assign that temporary to the wrong register class.
Comment 20 Richard Henderson 2005-01-31 19:04:02 UTC
I think you'll also want to try using -fno-gcse.  The gcse pass is hoisting 
values out of your loop (as it is supposed to), except that we don't have 
enough registers to hold it all, so the values get spilled back to the stack.
This is a known problem.  If we had anything approaching a decent register
allocator, we'd be able to move these values (known to be loop invariant) back
inside the loop instead of spilling them.

Of course, we've got multiple passes that hoist values out of loops, and using
-fno-gcse only prevents half of the spilled values from being moved.
Comment 21 tbp 2005-01-31 20:18:53 UTC
-fno-gcse is a godsend, instant speedup and most of the sillyness when inlining
is gone.

Now i've applied both your patches, and while there's promising they also
triggers their own nastyness; gcc is so fond of memory inputs that it dumps
stuff on the stack only to address them some instructions latter (well, that's
my interpretation :).

For example,
  4010c3:       0f 28 6c 13 30          movaps 0x30(%ebx,%edx,1),%xmm5
  4010c8:       0f 28 f9                movaps %xmm1,%xmm7
  4010cb:       0f 28 cb                movaps %xmm3,%xmm1
  4010ce:       0f 29 6c 24 10          movaps %xmm5,0x10(%esp)
  4010d3:       0f 59 ce                mulps  %xmm6,%xmm1
  4010d6:       0f 59 c4                mulps  %xmm4,%xmm0
  4010d9:       0f 28 6c 16 30          movaps 0x30(%esi,%edx,1),%xmm5
  4010de:       0f 59 5c 24 10          mulps  0x10(%esp),%xmm3

or
  40119d:       0f c2 c1 01             cmpltps %xmm1,%xmm0
  4011a1:       0f 29 04 24             movaps %xmm0,(%esp)
  4011a5:       0f 28 c5                movaps %xmm5,%xmm0
  4011a8:       0f c2 c1 01             cmpltps %xmm1,%xmm0
  4011ac:       0f 28 c8                movaps %xmm0,%xmm1
  4011af:       0f 56 0c 24             orps   (%esp),%xmm1

Other than those quirks, it looks better to me.

Just to be sure i've tried that patched version on my app, and it's slower than
the unpatched version (both with -fno-gcse).
Comment 22 tbp 2005-01-31 20:35:16 UTC
Hmm, there's something fishy with _mm_set1_epi32.

With your patches there's no stack copy anymore but, with
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=19714 testcase, i get:

00401080 <eliminated(int)>:
  401080:       66 0f 6e 4c 24 04       movd   0x4(%esp),%xmm1
  401086:       66 0f 70 c1 00          pshufd $0x0,%xmm1,%xmm0
  40108b:       c3                      ret    

00401090 <not_eliminated(int const*)>:
  401090:       8b 44 24 04             mov    0x4(%esp),%eax
  401094:       66 0f 6e 08             movd   (%eax),%xmm1
  401098:       66 0f 70 c1 00          pshufd $0x0,%xmm1,%xmm0
  40109d:       c3                      ret 

00401050 <not_eliminated_bis(int const&)>:
  401050:       8b 44 24 04             mov    0x4(%esp),%eax
  401054:       66 0f 6e 08             movd   (%eax),%xmm1
  401058:       66 0f 70 c1 00          pshufd $0x0,%xmm1,%xmm0
  40105d:       c3                      ret    

... and that's quite bogus :)
Comment 23 Richard Henderson 2005-01-31 21:02:37 UTC
(In reply to comment #22)

No, it isn't.  Look at your functions again.  The assembly that you
pasted is 100% perfect.  You cannot improve on that in any way.
Comment 24 Richard Henderson 2005-01-31 21:12:49 UTC
(In reply to comment #21)
>   4010ce:       0f 29 6c 24 10          movaps %xmm5,0x10(%esp)
>   4010de:       0f 59 5c 24 10          mulps  0x10(%esp),%xmm3
>   4011a1:       0f 29 04 24             movaps %xmm0,(%esp)
>   4011af:       0f 56 0c 24             orps   (%esp),%xmm1

These instruction patterns didn't change.  I only fibbed wrt "int" inputs.

It's not impossible that some of this is due to adding the const to the
builtin signatures, but...

> Just to be sure i've tried that patched version on my app, and it's slower
> than the unpatched version (both with -fno-gcse).

So, overall the combination of the two patches is a lose?  How about each
patch individually?
Comment 25 tbp 2005-01-31 22:21:57 UTC
Oops, my bad. Thought pshufd mixed both operands à la shufps; i'm obviously not
familiar with the integer side of SSE.

And yes the combination is a lose, albeit a small one around 3%. But i'm timing
the whole thing not just that intersection code.

To give you an idea here's the peak perf for a given tree.
intersection inlined:
ICC: 16.4 fps
gcc: 14.7 unpatched, 14.4 patched
gcc & -fno-gcse: 14.8 unpatched, 14.5 patched

intersection not inlined:
gcc: 14.9 unpatched, 13.7 patched
gcc & -fno-gcse: 14.8 unpatched, 14.0 patched

The problem is that the surrounding code (kdtree search) also change a lot and
gcc doesn't find an optimal way for that either.
What's reassuring is that ICC isn't much smarter than gcc on the tree search.
Each compiler comes up with nice trick here and there but none got the whole
picture right. Still ICC doesn't suck as much on average.
I'm going to replace all that tree search with asm to settle the issue someday.
BTW the difference between ICC and GCC is around ~5% on the scalar path.

I'm going to try each patch in turn as requested.
Comment 26 Richard Henderson 2005-01-31 22:53:17 UTC
Created attachment 8119 [details]
modify modes_tieable_p for sse

One more thing to try, to induce the register allocator to not be quite so
stupid.
At least within a single basic block anyway.  But these short-lived registers
seem to be the bulk of the complete moronitude to begin with.

In particular I'd like to know if this patch makes d-19680-2 unnecessary.
Comment 27 tbp 2005-01-31 22:58:16 UTC
In previous test i've used a crufted string of compilation options; i've removed
all that crap for -O3 -march=k8 -mfpmath=sse -fno-gcse -fno-exceptions.

The second patch, hack sse simode inputs, is a small win or at least a draw,
inlined or not, -fno-gcse or not.

The first one, mark builtins constant, is a lose in every conditions; it's the
one that triggers those stack references/movements noted earlier (and my guess
is what's causing the perf regression, other than that the code seems prettier).
Comment 28 tbp 2005-01-31 23:28:49 UTC
Wow! We got a winner. 15.8 fps with -fno-gcse, inlining and only d-19680-3.

  402680:       66 0f 6f d1             movdqa %xmm1,%xmm2
..
  402688:       66 0f db 50 30          pand   0x30(%eax),%xmm2
  40268d:       66 0f 6e 41 28          movd   0x28(%ecx),%xmm0
  402692:       66 0f 70 c0 00          pshufd $0x0,%xmm0,%xmm0
  402697:       66 0f df c8             pandn  %xmm0,%xmm1
  40269b:       66 0f eb ca             por    %xmm2,%xmm1
  40269f:       0f 29 48 30             movaps %xmm1,0x30(%eax)
That's the final integer update. Perfect.

Want me to try that champ in conjunction with d-19680-1?
Comment 29 tbp 2005-01-31 23:42:50 UTC
d-19680-1 + d-19680-3 isn't as good, 14.9fps, as some silly stack movements are
induced; ie:

  40265f:       0f 29 04 24             movaps %xmm0,(%esp)
  402663:       0f 57 c0                xorps  %xmm0,%xmm0
  402666:       0f c2 d0 01             cmpltps %xmm0,%xmm2
  40266a:       0f 56 14 24             orps   (%esp),%xmm2
Comment 30 CVS Commits 2005-02-02 00:31:02 UTC
Subject: Bug 19680

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	rth@gcc.gnu.org	2005-02-02 00:30:38

Modified files:
	gcc            : ChangeLog 
	gcc/config/i386: i386-protos.h i386.c i386.h 

Log message:
	PR target/19680
	* config/i386/i386.h (MODES_TIEABLE_P): Use ix86_modes_tieable_p.
	* config/i386/i386.c (ix86_hard_regno_mode_ok): Change return
	type to bool.
	(ix86_tieable_integer_mode_p, ix86_modes_tieable_p): New.
	* config/i386/i386-protos.h: Update.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.7369&r2=2.7370
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386-protos.h.diff?cvsroot=gcc&r1=1.130&r2=1.131
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.c.diff?cvsroot=gcc&r1=1.794&r2=1.795
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/config/i386/i386.h.diff?cvsroot=gcc&r1=1.420&r2=1.421

Comment 31 tbp 2005-05-09 08:46:25 UTC
I'm going to ping this bugreport because there's still some very bad interaction
with gcse in current gcc.
Just compile the 'packet_intersection.cpp' testcase with ie g++-4.1-4120050501
for ia32 to convince yourself (x86-64 is apparently immune, thanks to the lower
register pressure).

It takes a bit of luck, or the use of -fno-gcse, to dodge that issue which
really kills performance.