42779 – extremely poor code quality, aliasing issues with __m128i

Bug 42779 - extremely poor code quality, aliasing issues with __m128i

Summary: extremely poor code quality, aliasing issues with __m128i

Status:	RESOLVED INVALID

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.5.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	alias, missed-optimization

Depends on:
Blocks:

Reported:	2010-01-17 20:13 UTC by Piotr Wyderski
Modified:	2010-03-13 03:21 UTC (History)
CC List:	5 users (show)

See Also:
Host:
Target:	i?86-- x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2010-01-17 21:05:55

Attachments
Source code (5.18 KB, text/plain) 2010-01-17 20:15 UTC, Piotr Wyderski	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Piotr Wyderski 2010-01-17 20:13:40 UTC

The attached code compiled with

g++ -std=gnu++0x -O2 -m32 -march=native -msse -msse2 -msse3 -Wall
-Werror -Wno-unused -Wno-strict-aliasing -march=native
-fomit-frame-pointer -Wno-pmf-conversions -g main.cpp

emits code which is, to put it mildly,
far from optimal. For instance, the code of

bit_vector::op_not_or:

    combine([](__m128i x, __m128i y) { return _mm_xor_si128(_mm_or_si128(x, y), g_Mask[128]); }, true, v1, v2);

emits:

00401800 <__ZN10bit_vector9op_not_orERKS_S1_>:
  401800:	55                   	push   %ebp
  401801:	57                   	push   %edi
  401802:	56                   	push   %esi
  401803:	53                   	push   %ebx
  401804:	83 ec 1c             	sub    $0x1c,%esp
  401807:	8b 74 24 30          	mov    0x30(%esp),%esi
  40180b:	8b 7c 24 34          	mov    0x34(%esp),%edi
  40180f:	8b 6c 24 38          	mov    0x38(%esp),%ebp
  401813:	8b 5f 04             	mov    0x4(%edi),%ebx
  401816:	39 ee                	cmp    %ebp,%esi
  401818:	74 46                	je     401860 <__ZN10bit_vector9op_not_orERKS_S1_+0x60>
  40181a:	83 c3 7f             	add    $0x7f,%ebx
  40181d:	c1 eb 07             	shr    $0x7,%ebx
  401820:	85 db                	test   %ebx,%ebx
  401822:	74 30                	je     401854 <__ZN10bit_vector9op_not_orERKS_S1_+0x54>
  401824:	31 c0                	xor    %eax,%eax
  401826:	66 0f 76 c9          	pcmpeqd %xmm1,%xmm1
  40182a:	8d b6 00 00 00 00    	lea    0x0(%esi),%esi
  401830:	8b 0f                	mov    (%edi),%ecx
  401832:	89 c2                	mov    %eax,%edx
  401834:	40                   	inc    %eax
  401835:	c1 e2 04             	shl    $0x4,%edx
  401838:	39 c3                	cmp    %eax,%ebx
  40183a:	66 0f 6f 04 11       	movdqa (%ecx,%edx,1),%xmm0
  40183f:	8b 4d 00             	mov    0x0(%ebp),%ecx
  401842:	66 0f eb 04 11       	por    (%ecx,%edx,1),%xmm0
  401847:	8b 0e                	mov    (%esi),%ecx
  401849:	66 0f ef c1          	pxor   %xmm1,%xmm0
  40184d:	66 0f 7f 04 11       	movdqa %xmm0,(%ecx,%edx,1)
  401852:	75 dc                	jne    401830 <__ZN10bit_vector9op_not_orERKS_S1_+0x30>
  401854:	83 c4 1c             	add    $0x1c,%esp
  401857:	5b                   	pop    %ebx
  401858:	5e                   	pop    %esi
  401859:	5f                   	pop    %edi
  40185a:	5d                   	pop    %ebp
  40185b:	c3                   	ret

Comment 1 Piotr Wyderski 2010-01-17 20:15:00 UTC

Created attachment 19639 [details]
Source code

Comment 2 Paolo Carlini 2010-01-17 20:21:59 UTC

I would suggest providing a snippet *much* smaller and explain more clearly the issue, for example, providing two small functions (say, below 10, 20 lines), one using more high-level C++0x features and the other using old style solutions.

That is, if you hope to have somebody actually working on a fix ;)

Comment 3 Piotr Wyderski 2010-01-17 20:46:09 UTC

This is a generic code, as it covers two bug reports.
In fact, it will probably be used as a base for additional
two missing optimization reports. So I thought it would be
good to provide the code of the entire sandbox.

To be more specific: the vectors passed to
combine() are constant. The compiler should
not re-evaluate the base addresses of the
m_Data arrays every iteration, as above:

    mov    (%edi),%ecx
    ...
    mov    0x0(%ebp),%ecx
    ...
    mov    (%esi),%ecx

A single base address fetch phase and
index-based addressing with scaled
induction variable (by a factor of 16)
will be more optimal, e.g.:
    
   // esi = src1
   // edi = src2
   // ebx = dst
   // edx = induction variable

L0:cmpl   %edx, max_index
   je     L1:
   movdqa (%esi,%edx,1),%xmm0
   por    (%edi,%edx,1),%xmm0
   pxor   %xmm1, %xmm0
   movdqa %xmm0, (%ebx, %edx, 1)
   add    $16, %edx
   jmp    L0

L1:

as I would have written it by hand in assembler.
An aggresively unrolled version (say, four-way)
with prefetching for longer blocks will also be welcome.

Comment 4 Paolo Carlini 2010-01-17 20:49:59 UTC

Still, please provide a short self-contained snippet, no includes, or in any case, only the bare minimum. Thanks. Again, no cross-references.

Comment 5 Richard Biener 2010-01-17 21:05:55 UTC

Well, indeed - it would be far more useful to have a less convoluted testcase
where unrelated functions and source make analysis hard.

Please provide a testcase with a _single_ computation kernel applying it in
a single way (I'm trying to follow op_and ...).

From an inlining perspective it doesn't look so bad - early inlining turns 
the innermost loop into

<bb 7>:
  D.15257_17 = thisD.13894_4(D)->m_DataD.13729;
  D.15256_19 = iD.15246_18 * 16;
  D.15255_20 = D.15257_17 + D.15256_19;
  D.15254_21 = v2D.13892_5(D)->m_DataD.13729;
  D.15256_22 = iD.15246_18 * 16;
  D.15253_23 = D.15254_21 + D.15256_22;
  D.15252_24 = *D.15253_23;
  D.15251_25 = *D.15236_2;
  D.15256_26 = iD.15246_18 * 16;
  D.15250_27 = D.15251_25 + D.15256_26;
  D.15249_28 = *D.15250_27;
  D.15247_29 = __builtin_ia32_pand128D.1150 (D.15249_28, D.15252_24);
  *D.15255_20 = D.15247_29;
  iD.15246_30 = iD.15246_18 + 1;

<bb 8>:
  # iD.15246_18 = PHI <0(6), iD.15246_30(7)>
  if (max_idxD.15245_16 != iD.15246_18)
    goto <bb 7>;

already (_ZN10bit_vector6op_andERKS_S1_).

Now the main issue why the redundant loads are not hoisted is that all
data pointers are ref-all:

  chunk_typeD.13721 * {ref-all} D.15255;

thus you tell the compiler that the store

  *D.15255_20 = D.15247_29;

might possibly clobber all loads in that loop.  Of course chunk_type is
just __m128i and I always complained that this is ref-all which makes
optimization of pointers to __m128i practically useless.

This is really a target header problem.

Comment 6 Richard Biener 2010-01-17 21:08:48 UTC

If you fix that issue (simply remove all __may_alias__ attributes from
preprocessed source) you get

.L62:
        addl    $1, %edx
        movdqa  (%ebx,%eax), %xmm0
        pand    (%esi,%eax), %xmm0
        movdqa  %xmm0, (%edi,%eax)
        addl    $16, %eax
        cmpl    %edx, %ecx
        jne     .L62

for the innermost loop.  Maybe not 100% perfect but reasonable (that's -O2).

Comment 7 Richard Biener 2010-01-17 21:19:04 UTC

Btw, if you know more about data types than nothing then do not use the
__m128* types for data but build your vector types yourself with an
appropriate base type like

typedef T m128T __attribute__((__vector_size__(16)));

It seems that the Intel intrinsics API at no point follows type-based
aliasing rules (they even have vectors of characters, so restricting
types based on possible data types is not going to work out).

Comment 8 Piotr Wyderski 2010-01-17 21:29:07 UTC

It would be great to use my own vector data
types, as in simple cases it allows GCC to
automaticly vectorize them in a portable way,
but in more complex cases it would mean a lot
of explicit type casts => even more unreadable
code, which even now is hard to follow.

Comment 9 rguenther@suse.de 2010-01-17 21:35:02 UTC

Subject: Re:  [C++0x] Variadic templates + lambdas =
 extremely poor code quality, __m128i and aliasing sucks

On Sun, 17 Jan 2010, piotr dot wyderski at gmail dot com wrote:

> ------- Comment #8 from piotr dot wyderski at gmail dot com  2010-01-17 21:29 -------
> It would be great to use my own vector data
> types, as in simple cases it allows GCC to
> automaticly vectorize them in a portable way,
> but in more complex cases it would mean a lot
> of explicit type casts => even more unreadable
> code, which even now is hard to follow.

I don't think you have a choice.

Comment 10 H.J. Lu 2010-01-17 22:11:33 UTC

(In reply to comment #7)
> Btw, if you know more about data types than nothing then do not use the
> __m128* types for data but build your vector types yourself with an
> appropriate base type like
> 
> typedef T m128T __attribute__((__vector_size__(16)));

It isn't a bad idea.

> It seems that the Intel intrinsics API at no point follows type-based
> aliasing rules (they even have vectors of characters, so restricting
> types based on possible data types is not going to work out).
> 

That is true.

Comment 11 Paolo Carlini 2010-01-18 10:46:31 UTC

If I understand correctly (thanks Richard for the quick and detailed analysis) there is nothing C++0x specific in this target/optimization issue. Thus, I'm removing the marker from the Summary and cleaning it. If I'm wrong please tweak it back, adjust it, thanks.

Comment 12 Andrew Pinski 2010-03-13 03:19:56 UTC

I think it is by design that __m128i is marked as may_alias.    In fact it is, see PR 31245.  There is no other way to get around this really.

Comment 13 Andrew Pinski 2010-03-13 03:21:56 UTC

Invalid as mentioned as __m128i has to alias int, short, and char as shown by PR 31245.