If you compile the attached code with optimization on a 4.1.x system it will generate a store into a stack temporary in the middle of the loop that is never used. If you compile the code with -DUSE_MACRO where it uses macros instead of inline functions, it will generate the correct code without the extra store. It is still a bug in the 4.3 mainline with a compiler built on March 30th.
Created attachment 13248 [details] C++ source that shows the bug This is the source that shows the bug.
Created attachment 13249 [details] This is the assembly language with the extra store in it
Created attachment 13250 [details] This is the good source compiled with -DUSE_MACRO
Inline version: r.dst[0].i = MEM[base: d]; D.6423 = r.dst[0].i; D.6449 = __builtin_ia32_paddusb128 (VIEW_CONVERT_EXPR<__v16qi>(D.6423), VIEW_CONVERT_EXPR<__v16qi>(D.6423)); r.dst[0].i = VIEW_CONVERT_EXPR<__m128i>(D.6449); __builtin_ia32_movntdq ((__m128i *) d, r.dst[0].i); d = d + 16B; macro: D.6414 = MEM[base: d]; D.6435 = __builtin_ia32_paddusb128 (VIEW_CONVERT_EXPR<__v16qi>(D.6414), VIEW_CONVERT_EXPR<__v16qi>(D.6414)); __builtin_ia32_movntdq ((__m128i *) d, VIEW_CONVERT_EXPR<__m128i>(D.6435)); d = d + 16B; So somehow r.dst[0].i is not being optimized correctly, I did not look into why really.
There's one missed FRE opportunity in that we do not value-number VIEW_CONVERT_EXPR<__v16qi>(D.6423) the same. This is because VIEW_CONVERT_EXPR is tcc_reference I believe. Created value VH.19 for VH.17.VH.18 vuses: (SFT.32_47) Created value VH.20 for VIEW_CONVERT_EXPR<__v16qi>(VH.19) Created value VH.21 for VIEW_CONVERT_EXPR<__v16qi>(VH.19) Does the new VN fix that?
Reduced testcase: typedef long long __m128i __attribute__ ((__vector_size__ (16), __may_alias__)); typedef long long __v2di __attribute__ ((__vector_size__ (16))); typedef char __v16qi __attribute__ ((__vector_size__ (16))); union __attribute__((aligned(16))) MY_M128 { __m128i i; }; struct RegFile { MY_M128 dst[4]; }; __inline__ __attribute__((always_inline)) static void MEM_OPT_LOAD(MY_M128* reg, __m128i* mem) { reg[0].i = *mem; } __inline__ __attribute__((always_inline)) static void MEM_OPT_STORE(MY_M128* reg, __m128i* mem) { __builtin_ia32_movntdq ((__v2di *)mem, (__v2di)reg[0].i); } static __inline __m128i __attribute__((__always_inline__)) _mm_adds_epu8 (__m128i __A, __m128i __B) { return (__m128i)__builtin_ia32_paddusb128 ((__v16qi)__A, (__v16qi)__B); } int test(unsigned char *d) { RegFile r; MEM_OPT_LOAD((r.dst) , ((__m128i*) d)); r.dst[0].i = _mm_adds_epu8(r.dst[0].i, r.dst[0].i); MEM_OPT_STORE((r.dst), (__m128i*) d); return 0; }
store_copyprop is not able to optimize this because the two array refs r.dst[0].i use different types for the index zero (one int, one unsigned long) and one has operand2 and operand3 set but the other not and operand_equal_p compares not by value. Doh. No idea why operand_equal_p should care about those operands as /* Array indexing. Operand 0 is the array; operand 1 is a (single) array index. Operand 2, if present, is a copy of TYPE_MIN_VALUE of the index. Operand 3, if present, is the element size, measured in units of the alignment of the element type. */ DEFTREECODE (ARRAY_REF, "array_ref", tcc_reference, 4) correctly hints that those do not carry extra information. With those "fixed" we finally get <bb 2>: d.0 = (__m128i *) d; # VUSE <SMT.6_18(D)> D.2490 = *d.0; # SMT.6_21 = VDEF <SMT.6_18(D)> D.2496 = __builtin_ia32_paddusb128 (VIEW_CONVERT_EXPR<__v16qi>(D.2490), VIEW_CONVERT_EXPR<__v16qi>(D.2490)); # SMT.6_23 = VDEF <SMT.6_21> __builtin_ia32_movntdq (d.0, VIEW_CONVERT_EXPR<__m128i>(D.2496)); as store_copyprop can replace the loaded value by the stored one. "Fixed" as in Index: fold-const.c =================================================================== *** fold-const.c (revision 123691) --- fold-const.c (working copy) *************** operand_equal_p (tree arg0, tree arg1, u *** 2884,2894 **** case ARRAY_REF: case ARRAY_RANGE_REF: ! /* Operands 2 and 3 may be null. */ return (OP_SAME (0) ! && OP_SAME (1) ! && OP_SAME_WITH_NULL (2) ! && OP_SAME_WITH_NULL (3)); case COMPONENT_REF: /* Handle operand 2 the same as for ARRAY_REF. Operand 0 --- 2884,2896 ---- case ARRAY_REF: case ARRAY_RANGE_REF: ! /* Operands 2 and 3 do not provide extra information. ! Compare the array index by value if it is constant first as we ! may have different types but same value here. */ return (OP_SAME (0) ! && (tree_int_cst_equal (TREE_OPERAND (arg0, 1), ! TREE_OPERAND (arg1, 1)) ! || OP_SAME (1))); case COMPONENT_REF: /* Handle operand 2 the same as for ARRAY_REF. Operand 0 another thing would be to hunt down where the different typed indices are produced and to either get rid of operands 2 and 3 from ARRAY_REF or produce them consistently.
Created attachment 13348 [details] patch Actually that breaks with some hashing. I have a slightly different approach in testing now. We are inconsistent in gimplifying ARRARY_REFs - for one part we only set operands 2 and 3 if they are not gimple_min_invariant, for another part we set them nevertheless. Comparing array indices by value for operand_equal_p is still the way to go I think (unless we want to fix all builders of ARRAY_REFs...).
(In reply to comment #8) > Created an attachment (id=13348) [edit] > patch > for one part we only set operands 2 and 3 if they are not gimple_min_invariant, This issue is PR 24689.
Indeed.
Subject: Bug 31307 Author: rguenth Date: Thu Apr 12 10:15:53 2007 New Revision: 123736 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=123736 Log: 2007-04-12 Richard Guenther <rguenther@suse.de> PR tree-optimization/24689 PR tree-optimization/31307 * fold-const.c (operand_equal_p): Compare INTEGER_CST array indices by value. * gimplify.c (canonicalize_addr_expr): To be consistent with gimplify_compound_lval only set operands two and three of ARRAY_REFs if they are not gimple_min_invariant. This makes it never at this place. * tree-ssa-ccp.c (maybe_fold_offset_to_array_ref): Likewise. * g++.dg/tree-ssa/pr31307.C: New testcase. * gcc.dg/tree-ssa/pr24689.c: Likewise. Added: trunk/gcc/testsuite/g++.dg/tree-ssa/pr31307.C trunk/gcc/testsuite/gcc.dg/tree-ssa/pr24689.c Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/gimplify.c trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-ccp.c
Fixed.
How hard would it be to back port the change to 4.1.3 and 4.2?
(In reply to comment #13) > How hard would it be to back port the change to 4.1.3 and 4.2? Why do you want to that, this is not a regression at all. I am tried of people asking questions like this for missed optimization bugs. It is not like we have a policy for patches already that should be clear.
4.2.0 wouldn't be too difficult (a svn merge of the change to 4.2.0 branch succeeds without problems), but 4.1.3 has ineffective store copyprop (see PR26135). Of course this is not a regression, so a backport is not appropriate.