31307 – Interaction between x86_64 builtin function and inline functions causes poor code

Bug 31307 - Interaction between x86_64 builtin function and inline functions causes poor code

Summary: Interaction between x86_64 builtin function and inline functions causes poor ...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.1.2

Importance:	P3 normal
Target Milestone:	4.3.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	24689
	Show dependency tree / graph

Reported:	2007-03-22 00:37 UTC by Michael Meissner
Modified:	2007-04-12 22:36 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64-redhat-linux
Build:
Known to work:
Known to fail:
Last reconfirmed:	2007-04-10 15:32:01

Attachments
C++ source that shows the bug (642 bytes, text/plain) 2007-03-22 00:38 UTC, Michael Meissner	Details
This is the assembly language with the extra store in it (532 bytes, text/plain) 2007-03-22 00:39 UTC, Michael Meissner	Details
This is the good source compiled with -DUSE_MACRO (524 bytes, text/plain) 2007-03-22 00:40 UTC, Michael Meissner	Details
patch (1.21 KB, patch) 2007-04-10 16:55 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Michael Meissner 2007-03-22 00:37:09 UTC

If you compile the attached code with optimization on a 4.1.x system it will generate a store into a stack temporary in the middle of the loop that is never used.  If you compile the code with -DUSE_MACRO where it uses macros instead of inline functions, it will generate the correct code without the extra store.  It is still a bug in the 4.3 mainline with a compiler built on March 30th.

Comment 1 Michael Meissner 2007-03-22 00:38:18 UTC

Created attachment 13248 [details]
C++ source that shows the bug

This is the source that shows the bug.

Comment 2 Michael Meissner 2007-03-22 00:39:24 UTC

Created attachment 13249 [details]
This is the assembly language with the extra store in it

Comment 3 Michael Meissner 2007-03-22 00:40:18 UTC

Created attachment 13250 [details]
This is the good source compiled with -DUSE_MACRO

Comment 4 Andrew Pinski 2007-03-22 01:38:36 UTC

Inline version:
  r.dst[0].i = MEM[base: d];
  D.6423 = r.dst[0].i;
  D.6449 = __builtin_ia32_paddusb128 (VIEW_CONVERT_EXPR<__v16qi>(D.6423), VIEW_CONVERT_EXPR<__v16qi>(D.6423));
  r.dst[0].i = VIEW_CONVERT_EXPR<__m128i>(D.6449);
  __builtin_ia32_movntdq ((__m128i *) d, r.dst[0].i);
  d = d + 16B;


macro:
  D.6414 = MEM[base: d];
  D.6435 = __builtin_ia32_paddusb128 (VIEW_CONVERT_EXPR<__v16qi>(D.6414), VIEW_CONVERT_EXPR<__v16qi>(D.6414));
  __builtin_ia32_movntdq ((__m128i *) d, VIEW_CONVERT_EXPR<__m128i>(D.6435));
  d = d + 16B;

So somehow r.dst[0].i is not being optimized correctly, I did not look into why really.

Comment 5 Richard Biener 2007-04-10 14:15:38 UTC

There's one missed FRE opportunity in that we do not value-number
VIEW_CONVERT_EXPR<__v16qi>(D.6423) the same.  This is because VIEW_CONVERT_EXPR
is tcc_reference I believe.

Created value VH.19 for VH.17.VH.18 vuses: (SFT.32_47)
Created value VH.20 for VIEW_CONVERT_EXPR<__v16qi>(VH.19)
Created value VH.21 for VIEW_CONVERT_EXPR<__v16qi>(VH.19)

Does the new VN fix that?

Comment 6 Richard Biener 2007-04-10 14:37:47 UTC

Reduced testcase:

typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__));
typedef long long __v2di __attribute__ ((__vector_size__ (16)));
typedef char __v16qi __attribute__ ((__vector_size__ (16)));

union __attribute__((aligned(16))) MY_M128
{
  __m128i i;
};

struct RegFile
{
  MY_M128 dst[4];
};

__inline__ __attribute__((always_inline)) static void
MEM_OPT_LOAD(MY_M128* reg, __m128i* mem)
{
  reg[0].i = *mem;
}

__inline__ __attribute__((always_inline)) static void
MEM_OPT_STORE(MY_M128* reg, __m128i* mem)
{
  __builtin_ia32_movntdq ((__v2di *)mem, (__v2di)reg[0].i);
}

static __inline __m128i __attribute__((__always_inline__))
_mm_adds_epu8 (__m128i __A, __m128i __B)
{
  return (__m128i)__builtin_ia32_paddusb128 ((__v16qi)__A, (__v16qi)__B);
}

int test(unsigned char *d)
{
  RegFile r;
  MEM_OPT_LOAD((r.dst) , ((__m128i*) d));
  r.dst[0].i = _mm_adds_epu8(r.dst[0].i, r.dst[0].i);
  MEM_OPT_STORE((r.dst), (__m128i*) d);
  return 0;
}

Comment 7 Richard Biener 2007-04-10 15:32:01 UTC

store_copyprop is not able to optimize this because the two array refs
r.dst[0].i use different types for the index zero (one int, one unsigned long)
and one has operand2 and operand3 set but the other not and operand_equal_p compares not by value.  Doh.  No idea why operand_equal_p should care about
those operands as

/* Array indexing.
   Operand 0 is the array; operand 1 is a (single) array index.
   Operand 2, if present, is a copy of TYPE_MIN_VALUE of the index.
   Operand 3, if present, is the element size, measured in units of
   the alignment of the element type.  */
DEFTREECODE (ARRAY_REF, "array_ref", tcc_reference, 4)

correctly hints that those do not carry extra information.  With those "fixed"
we finally get

<bb 2>:
  d.0 = (__m128i *) d;
  # VUSE <SMT.6_18(D)>
  D.2490 = *d.0;
  # SMT.6_21 = VDEF <SMT.6_18(D)>
  D.2496 = __builtin_ia32_paddusb128 (VIEW_CONVERT_EXPR<__v16qi>(D.2490), VIEW_CONVERT_EXPR<__v16qi>(D.2490));
  # SMT.6_23 = VDEF <SMT.6_21>
  __builtin_ia32_movntdq (d.0, VIEW_CONVERT_EXPR<__m128i>(D.2496));

as store_copyprop can replace the loaded value by the stored one.  "Fixed" as
in

Index: fold-const.c
===================================================================
*** fold-const.c        (revision 123691)
--- fold-const.c        (working copy)
*************** operand_equal_p (tree arg0, tree arg1, u
*** 2884,2894 ****
  
        case ARRAY_REF:
        case ARRAY_RANGE_REF:
!         /* Operands 2 and 3 may be null.  */
          return (OP_SAME (0)
!                 && OP_SAME (1)
!                 && OP_SAME_WITH_NULL (2)
!                 && OP_SAME_WITH_NULL (3));
  
        case COMPONENT_REF:
          /* Handle operand 2 the same as for ARRAY_REF.  Operand 0
--- 2884,2896 ----
  
        case ARRAY_REF:
        case ARRAY_RANGE_REF:
!         /* Operands 2 and 3 do not provide extra information.
!            Compare the array index by value if it is constant first as we
!            may have different types but same value here.  */
          return (OP_SAME (0)
!                 && (tree_int_cst_equal (TREE_OPERAND (arg0, 1),
!                                       TREE_OPERAND (arg1, 1))
!                     || OP_SAME (1)));
  
        case COMPONENT_REF:
          /* Handle operand 2 the same as for ARRAY_REF.  Operand 0

another thing would be to hunt down where the different typed indices are
produced and to either get rid of operands 2 and 3 from ARRAY_REF or
produce them consistently.

Comment 8 Richard Biener 2007-04-10 16:55:20 UTC

Created attachment 13348 [details]
patch

Actually that breaks with some hashing.  I have a slightly different approach in
testing now.  We are inconsistent in gimplifying ARRARY_REFs - for one part
we only set operands 2 and 3 if they are not gimple_min_invariant, for another
part we set them nevertheless.  Comparing array indices by value for operand_equal_p is still the way to go I think (unless we want to fix all builders of ARRAY_REFs...).

Comment 9 Andrew Pinski 2007-04-10 21:11:50 UTC

(In reply to comment #8)
> Created an attachment (id=13348) [edit]
> patch

> for one part we only set operands 2 and 3 if they are not gimple_min_invariant,
This issue is PR 24689.

Comment 10 Richard Biener 2007-04-10 21:38:45 UTC

Indeed.

Comment 11 Richard Biener 2007-04-12 10:16:12 UTC

Subject: Bug 31307

Author: rguenth
Date: Thu Apr 12 10:15:53 2007
New Revision: 123736

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=123736
Log:
2007-04-12  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/24689
	PR tree-optimization/31307
	* fold-const.c (operand_equal_p): Compare INTEGER_CST array
	indices by value.
	* gimplify.c (canonicalize_addr_expr): To be consistent with
	gimplify_compound_lval only set operands two and three of
	ARRAY_REFs if they are not gimple_min_invariant.  This makes
	it never at this place.
	* tree-ssa-ccp.c (maybe_fold_offset_to_array_ref): Likewise.

	* g++.dg/tree-ssa/pr31307.C: New testcase.
	* gcc.dg/tree-ssa/pr24689.c: Likewise.

Added:
    trunk/gcc/testsuite/g++.dg/tree-ssa/pr31307.C
    trunk/gcc/testsuite/gcc.dg/tree-ssa/pr24689.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/gimplify.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-ccp.c

Comment 12 Richard Biener 2007-04-12 10:27:42 UTC

Fixed.

Comment 13 Michael Meissner 2007-04-12 20:18:47 UTC

How hard would it be to back port the change to 4.1.3 and 4.2?

Comment 14 Andrew Pinski 2007-04-12 22:32:33 UTC

(In reply to comment #13)
> How hard would it be to back port the change to 4.1.3 and 4.2?
Why do you want to that, this is not a regression at all.  I am tried of people asking questions like this for missed optimization bugs.  It is not like we have a policy for patches already that should be clear.

Comment 15 Richard Biener 2007-04-12 22:36:46 UTC

4.2.0 wouldn't be too difficult (a svn merge of the change to 4.2.0 branch succeeds without problems), but 4.1.3 has ineffective store copyprop (see PR26135).  Of course this is not a regression, so a backport is not appropriate.