Bug 21195 - SSE intrinsics not inlined, sometimes.
Summary: SSE intrinsics not inlined, sometimes.
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.1.0
: P2 normal
Target Milestone: 4.1.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-cvs/2005-06...
Keywords: missed-optimization, patch, ssemmx
Depends on:
Blocks:
 
Reported: 2005-04-24 18:02 UTC by tbp
Modified: 2005-07-21 08:28 UTC (History)
1 user (show)

See Also:
Host: x86*
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2005-06-14 08:55:35


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description tbp 2005-04-24 18:02:54 UTC
Under some conditions (generally if you upset the inlining heuristic ie by force
inlining something), SSE intrinsics don't get inlined and some truely horrible
code ensues; the fix, tinkering with params, isn't much prettier.
Happened to me with various 4.x versions, on x86 or x86-64.

silly testcase:
#include <xmmintrin.h>



static __attribute__ ((always_inline)) bool bloatit(const __m128 a, const __m128
b) {

	const __m128

		v0 = _mm_max_ps(a,b),

		v1 = _mm_min_ps(a,b),

		v2 = _mm_mul_ps(a,b),

		v3 = _mm_div_ps(a,b),

		g0 = _mm_or_ps(_mm_or_ps(_mm_or_ps(v0,v1), v2), v3);

	

	return _mm_movemask_ps(g0);

}



bool finalblow(const __m128 a, const __m128 b, const __m128 c, const __m128 d,
const __m128 e, const __m128 f) {

	return bloatit(a,b) & bloatit(c,d) & bloatit(e,f) & bloatit(a,c) & bloatit(b,d)
& bloatit(c,e) & bloatit(d,f);

}


int main() { return 0; }


At -O3, on x86-64-linux, g++-4120050417 gets funky with:
0000000000400540 <_mm_mul_ps(float __vector, float __vector)>:
  400540:       mulps  %xmm1,%xmm0
  400543:       retq
...
0000000000400550 <_mm_div_ps(float __vector, float __vector)>:
  400550:       divps  %xmm1,%xmm0
  400553:       retq
...
0000000000400560 <_mm_min_ps(float __vector, float __vector)>:
  400560:       minps  %xmm1,%xmm0
  400563:       retq
...
0000000000400570 <_mm_max_ps(float __vector, float __vector)>:
  400570:       maxps  %xmm1,%xmm0
  400573:       retq
...
0000000000400580 <_mm_or_ps(float __vector, float __vector)>:
  400580:       orps   %xmm1,%xmm0
  400583:       retq
...
0000000000400590 <_mm_movemask_ps(float __vector)>:
  400590:       movmskps %xmm0,%eax
  400593:       retq

... only to conclude with this wonder
00000000004005b0 <finalblow(float __vector, float __vector, float __vector,
float __vector, float __vector, float __vector)>:
  4005b0:       push   %rbx
  4005b1:       xor    %ebx,%ebx
  4005b3:       sub    $0x1b0,%rsp
  4005ba:       movaps %xmm2,0x180(%rsp)
  4005c2:       movaps %xmm3,0x170(%rsp)
  4005ca:       movaps %xmm4,0x160(%rsp)
  4005d2:       movaps %xmm5,0x150(%rsp)
  4005da:       movaps %xmm1,0x190(%rsp)
  4005e2:       movaps %xmm0,0x1a0(%rsp)
  4005ea:       callq  400550 <_mm_div_ps(float __vector, float __vector)>
  4005ef:       movaps %xmm0,0x140(%rsp)
  4005f7:       movaps 0x190(%rsp),%xmm1
  4005ff:       movaps 0x1a0(%rsp),%xmm0
  400607:       callq  400540 <_mm_mul_ps(float __vector, float __vector)>
  40060c:       movaps 0x190(%rsp),%xmm1
  400614:       movaps %xmm0,0x130(%rsp)
  40061c:       movaps 0x1a0(%rsp),%xmm0
  400624:       callq  400560 <_mm_min_ps(float __vector, float __vector)>
  400629:       movaps 0x190(%rsp),%xmm1
  400631:       movaps %xmm0,0x120(%rsp)
  400639:       movaps 0x1a0(%rsp),%xmm0
  400641:       callq  400570 <_mm_max_ps(float __vector, float __vector)>
  400646:       movaps 0x120(%rsp),%xmm1
  40064e:       callq  400580 <_mm_or_ps(float __vector, float __vector)>
  400653:       movaps 0x130(%rsp),%xmm1
  40065b:       callq  400580 <_mm_or_ps(float __vector, float __vector)>
  400660:       movaps 0x140(%rsp),%xmm1
  400668:       callq  400580 <_mm_or_ps(float __vector, float __vector)>
  40066d:       callq  400590 <_mm_movemask_ps(float __vector)>
  400672:       movaps 0x170(%rsp),%xmm1
etc...


As said earlier, that's just one way to make that happen.
It would be a real plus if those intrinsics could be inconditionnaly inlined.
Comment 1 tbp 2005-04-26 12:45:40 UTC
Let's have some more fun.

Take the silly testcase up there, add this:
struct foo_t {
  bool dummy;
    __attribute__ ((always_inline)) foo_t() {}
};

change finalblow into that:
bool finalblow(const __m128 a, const __m128 b, const __m128 c, const __m128 d,
const __m128 e, const __m128 f) {
  foo_t bar[4];

  return bar[0].dummy &
            bloatit(a,b) & bloatit(c,d) & bloatit(e,f) & bloatit(a,c) &
bloatit(b,d) & bloatit(c,e) & bloatit(d,f);
}

and with the same compiler & flags you'll get this interesting snippet, from
finalblow:
...
  4005ea:       data16  <-- sure that loop deserves to be aligned
  4005eb:       data16
  4005ec:       nop
  4005ed:       data16
  4005ee:       data16
  4005ef:       nop
  4005f0:       inc    %eax
  4005f2:       cmp    $0x4,%eax
  4005f5:       jne    4005f0 <finalblow(float __vector, float __vector, float
__vector, float __vector, float __vector, float __vector)+0x40>
...

In case you're wondering, yes that's the constructor.

Again, that testcase is a bit artificial.
But i've just spent an hour tracking what was producing such an interesting
aligned empty loop in my app: same symptoms, but triggered differently; the
constructor was empty and not always_inline, but apparently some treshold was
met (lots of inlining around) and tada... instant contribution to the global
warming for peanuts :)

I'm certainly not qualified, but i'll dare to say that something's fishy wrt
inlining.
Comment 2 Andrew Pinski 2005-04-26 13:25:18 UTC
(In reply to comment #1)
> and with the same compiler & flags you'll get this interesting snippet, from
> finalblow:
> In case you're wondering, yes that's the constructor.

That is PR 19639.
Comment 3 tbp 2005-04-26 14:29:15 UTC
Subject: Re:  SSE intrinsics not inlined, sometimes.

On 26 Apr 2005 13:25:20 -0000, pinskia at gcc dot gnu dot org
<gcc-bugzilla@gcc.gnu.org> wrote:
> That is PR 19639.
Oh! A patch.

Sorry for the additionnal noise, but i'm getting a bit overeactive
about that inlining business.
Comment 4 tbp 2005-05-05 23:58:32 UTC
For future reference, i'm including my end-user offline answer to Uros regarding
always_inline usage.

Here we go:
> I was trying to take a quick look at your bugreport regarding
> always_inline attrubite. Just a quick remark - using only a plain static
> inline bool .... fixes the problem for me and at -O3 code looks like it
Doesn't surprise me.

> should. Is there a specific reason to have an attribute always_inline
> declared for the function you would like to inline? (Please note, that
> Jan Hubicka is currently working in this area.)
Yes, because inline alone in practice is next to useless. You say
below that reg<->mem movements are expensive, but my prime concern in
a hot path is branches.
And if you expect code to be inlined (or more precisely, you expect no
function call) then you have no alternative but to use always_inline.
Tho once you start using always_inline you upset the compiler and you
step in a world of pain where you have to babysit it for dependant
code with combo of always_inline/noinline.

In fact, always_inline/noinline combo are the only kludge for a number
of other problems:
. when gcc gets nuts, they are useful containement measures (so the
sillyness doesn't propagate)
. as said earlier inline being an (ignored) hint, if you have, say a
member function doing just one op (like those intrinsics in the
testcase), it makes absolutely no sense to not inline them. Ever. Yet
some times it happens.
. gcc doesn't like long sequences of branchless vectorized code, which
are quite common, and a static always_inline function is a way to tell
it to look somewhere else.
. those same static always_inline functions also are a way to tell it
to look closer at some code portion and to try to map its working set
into registers; it also has to do with the lack of an unroll pragma
and generally the lack of any directive to tell the compiler to pay
special attention to specific code.

So in the hotpath my code typically ends up being a bunch of
always_inline functions coalesced into a noinline.
For the non speed critical path, i let it up to the compiler. In that
regard, gcc4.x (and specifically gcc4.1) got a lot wiser, perhaps as
good as icc, but obviously not failproof :)

Comment 5 Stuart Hastings 2005-06-29 16:49:56 UTC
I marked all the x86 vector intrinsics with always_inline, and this seems to fix both the testcases here.

http://gcc.gnu.org/ml/gcc-cvs/2005-06/msg01059.html
Comment 6 Uroš Bizjak 2005-07-21 08:28:41 UTC
(In reply to comment #5)
> I marked all the x86 vector intrinsics with always_inline, and this seems to 
fix both the testcases here.

  Confirmed.