Bug 42779 - extremely poor code quality, aliasing issues with __m128i
extremely poor code quality, aliasing issues with __m128i
Status: RESOLVED INVALID
Product: gcc
Classification: Unclassified
Component: target
4.5.0
: P3 normal
: ---
Assigned To: Not yet assigned to anyone
: alias, missed-optimization
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2010-01-17 20:13 UTC by Piotr Wyderski
Modified: 2010-03-13 03:21 UTC (History)
5 users (show)

See Also:
Host:
Target: i?86-*-* x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2010-01-17 21:05:55


Attachments
Source code (37.93 KB, text/plain)
2010-01-17 20:15 UTC, Piotr Wyderski
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Piotr Wyderski 2010-01-17 20:13:40 UTC
The attached code compiled with

g++ -std=gnu++0x -O2 -m32 -march=native -msse -msse2 -msse3 -Wall
-Werror -Wno-unused -Wno-strict-aliasing -march=native
-fomit-frame-pointer -Wno-pmf-conversions -g main.cpp

emits code which is, to put it mildly,
far from optimal. For instance, the code of

bit_vector::op_not_or:

    combine([](__m128i x, __m128i y) { return _mm_xor_si128(_mm_or_si128(x, y), g_Mask[128]); }, true, v1, v2);

emits:

00401800 <__ZN10bit_vector9op_not_orERKS_S1_>:
  401800:	55                   	push   %ebp
  401801:	57                   	push   %edi
  401802:	56                   	push   %esi
  401803:	53                   	push   %ebx
  401804:	83 ec 1c             	sub    $0x1c,%esp
  401807:	8b 74 24 30          	mov    0x30(%esp),%esi
  40180b:	8b 7c 24 34          	mov    0x34(%esp),%edi
  40180f:	8b 6c 24 38          	mov    0x38(%esp),%ebp
  401813:	8b 5f 04             	mov    0x4(%edi),%ebx
  401816:	39 ee                	cmp    %ebp,%esi
  401818:	74 46                	je     401860 <__ZN10bit_vector9op_not_orERKS_S1_+0x60>
  40181a:	83 c3 7f             	add    $0x7f,%ebx
  40181d:	c1 eb 07             	shr    $0x7,%ebx
  401820:	85 db                	test   %ebx,%ebx
  401822:	74 30                	je     401854 <__ZN10bit_vector9op_not_orERKS_S1_+0x54>
  401824:	31 c0                	xor    %eax,%eax
  401826:	66 0f 76 c9          	pcmpeqd %xmm1,%xmm1
  40182a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
  401830:	8b 0f                	mov    (%edi),%ecx
  401832:	89 c2                	mov    %eax,%edx
  401834:	40                   	inc    %eax
  401835:	c1 e2 04             	shl    $0x4,%edx
  401838:	39 c3                	cmp    %eax,%ebx
  40183a:	66 0f 6f 04 11       	movdqa (%ecx,%edx,1),%xmm0
  40183f:	8b 4d 00             	mov    0x0(%ebp),%ecx
  401842:	66 0f eb 04 11       	por    (%ecx,%edx,1),%xmm0
  401847:	8b 0e                	mov    (%esi),%ecx
  401849:	66 0f ef c1          	pxor   %xmm1,%xmm0
  40184d:	66 0f 7f 04 11       	movdqa %xmm0,(%ecx,%edx,1)
  401852:	75 dc                	jne    401830 <__ZN10bit_vector9op_not_orERKS_S1_+0x30>
  401854:	83 c4 1c             	add    $0x1c,%esp
  401857:	5b                   	pop    %ebx
  401858:	5e                   	pop    %esi
  401859:	5f                   	pop    %edi
  40185a:	5d                   	pop    %ebp
  40185b:	c3                   	ret
Comment 1 Piotr Wyderski 2010-01-17 20:15:00 UTC
Created attachment 19639 [details]
Source code
Comment 2 Paolo Carlini 2010-01-17 20:21:59 UTC
I would suggest providing a snippet *much* smaller and explain more clearly the issue, for example, providing two small functions (say, below 10, 20 lines), one using more high-level C++0x features and the other using old style solutions.

That is, if you hope to have somebody actually working on a fix ;)
Comment 3 Piotr Wyderski 2010-01-17 20:46:09 UTC
This is a generic code, as it covers two bug reports.
In fact, it will probably be used as a base for additional
two missing optimization reports. So I thought it would be
good to provide the code of the entire sandbox.

To be more specific: the vectors passed to
combine() are constant. The compiler should
not re-evaluate the base addresses of the
m_Data arrays every iteration, as above:

    mov    (%edi),%ecx
    ...
    mov    0x0(%ebp),%ecx
    ...
    mov    (%esi),%ecx

A single base address fetch phase and
index-based addressing with scaled
induction variable (by a factor of 16)
will be more optimal, e.g.:
    
   // esi = src1
   // edi = src2
   // ebx = dst
   // edx = induction variable

L0:cmpl   %edx, max_index
   je     L1:
   movdqa (%esi,%edx,1),%xmm0
   por    (%edi,%edx,1),%xmm0
   pxor   %xmm1, %xmm0
   movdqa %xmm0, (%ebx, %edx, 1)
   add    $16, %edx
   jmp    L0

L1:

as I would have written it by hand in assembler.
An aggresively unrolled version (say, four-way)
with prefetching for longer blocks will also be welcome.
Comment 4 Paolo Carlini 2010-01-17 20:49:59 UTC
Still, please provide a short self-contained snippet, no includes, or in any case, only the bare minimum. Thanks. Again, no cross-references.
Comment 5 Richard Biener 2010-01-17 21:05:55 UTC
Well, indeed - it would be far more useful to have a less convoluted testcase
where unrelated functions and source make analysis hard.

Please provide a testcase with a _single_ computation kernel applying it in
a single way (I'm trying to follow op_and ...).

From an inlining perspective it doesn't look so bad - early inlining turns 
the innermost loop into

<bb 7>:
  D.15257_17 = thisD.13894_4(D)->m_DataD.13729;
  D.15256_19 = iD.15246_18 * 16;
  D.15255_20 = D.15257_17 + D.15256_19;
  D.15254_21 = v2D.13892_5(D)->m_DataD.13729;
  D.15256_22 = iD.15246_18 * 16;
  D.15253_23 = D.15254_21 + D.15256_22;
  D.15252_24 = *D.15253_23;
  D.15251_25 = *D.15236_2;
  D.15256_26 = iD.15246_18 * 16;
  D.15250_27 = D.15251_25 + D.15256_26;
  D.15249_28 = *D.15250_27;
  D.15247_29 = __builtin_ia32_pand128D.1150 (D.15249_28, D.15252_24);
  *D.15255_20 = D.15247_29;
  iD.15246_30 = iD.15246_18 + 1;

<bb 8>:
  # iD.15246_18 = PHI <0(6), iD.15246_30(7)>
  if (max_idxD.15245_16 != iD.15246_18)
    goto <bb 7>;

already (_ZN10bit_vector6op_andERKS_S1_).

Now the main issue why the redundant loads are not hoisted is that all
data pointers are ref-all:

  chunk_typeD.13721 * {ref-all} D.15255;

thus you tell the compiler that the store

  *D.15255_20 = D.15247_29;

might possibly clobber all loads in that loop.  Of course chunk_type is
just __m128i and I always complained that this is ref-all which makes
optimization of pointers to __m128i practically useless.

This is really a target header problem.
Comment 6 Richard Biener 2010-01-17 21:08:48 UTC
If you fix that issue (simply remove all __may_alias__ attributes from
preprocessed source) you get

.L62:
        addl    $1, %edx
        movdqa  (%ebx,%eax), %xmm0
        pand    (%esi,%eax), %xmm0
        movdqa  %xmm0, (%edi,%eax)
        addl    $16, %eax
        cmpl    %edx, %ecx
        jne     .L62

for the innermost loop.  Maybe not 100% perfect but reasonable (that's -O2).
Comment 7 Richard Biener 2010-01-17 21:19:04 UTC
Btw, if you know more about data types than nothing then do not use the
__m128* types for data but build your vector types yourself with an
appropriate base type like

typedef T m128T __attribute__((__vector_size__(16)));

It seems that the Intel intrinsics API at no point follows type-based
aliasing rules (they even have vectors of characters, so restricting
types based on possible data types is not going to work out).
Comment 8 Piotr Wyderski 2010-01-17 21:29:07 UTC
It would be great to use my own vector data
types, as in simple cases it allows GCC to
automaticly vectorize them in a portable way,
but in more complex cases it would mean a lot
of explicit type casts => even more unreadable
code, which even now is hard to follow.
Comment 9 rguenther@suse.de 2010-01-17 21:35:02 UTC
Subject: Re:  [C++0x] Variadic templates + lambdas =
 extremely poor code quality, __m128i and aliasing sucks

On Sun, 17 Jan 2010, piotr dot wyderski at gmail dot com wrote:

> ------- Comment #8 from piotr dot wyderski at gmail dot com  2010-01-17 21:29 -------
> It would be great to use my own vector data
> types, as in simple cases it allows GCC to
> automaticly vectorize them in a portable way,
> but in more complex cases it would mean a lot
> of explicit type casts => even more unreadable
> code, which even now is hard to follow.

I don't think you have a choice.
Comment 10 H.J. Lu 2010-01-17 22:11:33 UTC
(In reply to comment #7)
> Btw, if you know more about data types than nothing then do not use the
> __m128* types for data but build your vector types yourself with an
> appropriate base type like
> 
> typedef T m128T __attribute__((__vector_size__(16)));

It isn't a bad idea.

> It seems that the Intel intrinsics API at no point follows type-based
> aliasing rules (they even have vectors of characters, so restricting
> types based on possible data types is not going to work out).
> 

That is true.
Comment 11 Paolo Carlini 2010-01-18 10:46:31 UTC
If I understand correctly (thanks Richard for the quick and detailed analysis) there is nothing C++0x specific in this target/optimization issue. Thus, I'm removing the marker from the Summary and cleaning it. If I'm wrong please tweak it back, adjust it, thanks.
Comment 12 Andrew Pinski 2010-03-13 03:19:56 UTC
I think it is by design that __m128i is marked as may_alias.    In fact it is, see PR 31245.  There is no other way to get around this really.
Comment 13 Andrew Pinski 2010-03-13 03:21:56 UTC
Invalid as mentioned as __m128i has to alias int, short, and char as shown by PR 31245.