Compiling the testcase below with -O2 works, and fails with -O3. gcc-4.3 works with both -O2 and -O3. $ x86_64-unknown-linux-gnu-gcc-4.4.0 -O2 testcase-min.i $ ./a.out OK $ x86_64-unknown-linux-gnu-gcc-4.4.0 -O3 testcase-min.i $ ./a.out Failed: /.G����7��G��?�?�; $ ./a.out Failed: �G����7 h�G��3�?� $ gcc-4.3 -O3 testcase-min.i $ ./a.out OK Actually it is -ftree-vectorize that causes the failure: $ x86_64-unknown-linux-gnu-gcc-4.4.0 -O1 -ftree-vectorize testcase-min.i $ ./a.out Failed:"�@���@ �?�� The output is random garbage with -O3, but should be '>AUTOIT UNICODE SCRIPT<'. $ x86_64-unknown-linux-gnu-gcc-4.4.0 --version x86_64-unknown-linux-gnu-gcc-4.4.0 (GCC) 4.4.0 20080918 (experimental) Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. gcc-4.4 is a build from svn: $ svn info Path: . URL: svn://gcc.gnu.org/svn/gcc/trunk Repository Root: svn://gcc.gnu.org/svn/gcc Repository UUID: 138bc75d-0d04-0410-961f-82ee72b054a4 Revision: 140456 Node Kind: directory Schedule: normal Last Changed Author: amacleod Last Changed Rev: 140456 Last Changed Date: 2008-09-18 17:07:35 +0300 (Thu, 18 Sep 2008) $ gcc-4.3 --version gcc-4.3 (Debian 4.3.2-1) 4.3.2 Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Testcase: extern int printf (__const char *__restrict __format, ...); extern int strcmp (__const char *__s1, __const char *__s2); extern int puts (__const char *__s); typedef unsigned char uint8_t; typedef unsigned int uint32_t; struct MT { uint32_t *next; uint32_t items; uint32_t mt[624]; }; static uint8_t MT_getnext (struct MT *MT) { uint32_t r; if (!--MT->items) { uint32_t *mt = MT->mt; unsigned int i; MT->next = mt; for (i = 0; i < 227; i++) mt[i] = ((((mt[i] ^ mt[i + 1]) & 0x7ffffffe) ^ mt[i]) >> 1) ^ ((0 - (mt[i + 1] & 1)) & 0x9908b0df) ^ mt[i + 397]; } r = *(MT->next++); r ^= (r >> 11); r ^= ((r & 0xff3a58ad) << 7); r ^= ((r & 0xffffdf8c) << 15); r ^= (r >> 18); return (uint8_t) (r >> 1); } static void MT_decrypt (uint8_t * buf, unsigned int size, uint32_t seed) { struct MT MT; unsigned int i; uint32_t *mt = MT.mt; *mt = seed; for (i = 1; i < 624; i++) mt[i] = i + 0x6c078965 * ((mt[i - 1] >> 30) ^ mt[i - 1]); MT.items = 1; while (size--) *buf++ ^= MT_getnext (&MT); }; static uint8_t buf[23] = { 0xc0, 0x49, 0x17, 0x32, 0x62, 0x1e, 0x2e, 0xd5, 0x4c, 0x19, 0x28, 0x49, 0x91, 0xe4, 0x72, 0x83, 0x91, 0x3d, 0x93, 0x83, 0xb3, 0x61, 0x38 }; int main (void) { uint32_t s; s = 23; MT_decrypt (buf, s, s + 0xa25e); if (!strcmp (">AUTOIT UNICODE SCRIPT<", buf)) { puts ("OK"); return 0; } else { printf ("Failed:%s\n", buf); return 1; } }
Created attachment 16357 [details] config.log for gcc 4.4 Some more system info: $ uname -a Linux debian 2.6.26-1-amd64 #1 SMP Wed Sep 10 15:31:12 UTC 2008 x86_64 GNU/Linux $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping : 7 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips : 5669.97 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping : 7 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 3 initial apicid : 3 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips : 5666.05 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping : 7 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 1 initial apicid : 1 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips : 5666.03 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz stepping : 7 cpu MHz : 2000.000 cache size : 6144 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 lahf_lm bogomips : 5666.03 clflush size : 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management:
Also fails on powerpc-linux-gnu.
Self-contained testcase, which will work even for non-ascii compatible exec-charset: /* PR tree-optimization/37573 */ struct S { unsigned int *a; unsigned int b; unsigned int c[624]; }; static unsigned char __attribute__((noinline)) foo (struct S *s) { unsigned int r; if (!--s->b) { unsigned int *c = s->c; unsigned int i; s->a = c; for (i = 0; i < 227; i++) c[i] = ((((c[i] ^ c[i + 1]) & 0x7ffffffe) ^ c[i]) >> 1) ^ ((0 - (c[i + 1] & 1)) & 0x9908b0df) ^ c[i + 397]; } r = *(s->a++); r ^= (r >> 11); r ^= ((r & 0xff3a58ad) << 7); r ^= ((r & 0xffffdf8c) << 15); r ^= (r >> 18); return (unsigned char) (r >> 1); } static void __attribute__((noinline)) bar (unsigned char *p, unsigned int q, unsigned int r) { struct S s; unsigned int i; unsigned int *c = s.c; *c = r; for (i = 1; i < 624; i++) c[i] = i + 0x6c078965 * ((c[i - 1] >> 30) ^ c[i - 1]); s.b = 1; while (q--) *p++ ^= foo (&s); }; static unsigned char p[23] = { 0xc0, 0x49, 0x17, 0x32, 0x62, 0x1e, 0x2e, 0xd5, 0x4c, 0x19, 0x28, 0x49, 0x91, 0xe4, 0x72, 0x83, 0x91, 0x3d, 0x93, 0x83, 0xb3, 0x61, 0x38 }; static unsigned char q[23] = { 0x3e, 0x41, 0x55, 0x54, 0x4f, 0x49, 0x54, 0x20, 0x55, 0x4e, 0x49, 0x43, 0x4f, 0x44, 0x45, 0x20, 0x53, 0x43, 0x52, 0x49, 0x50, 0x54, 0x3c }; int main (void) { unsigned int s; s = 23; bar (p, s, s + 0xa25e); if (__builtin_memcmp (p, q, s) != 0) __builtin_abort (); return 0; } The bug is in the bar function, as can be proven by compiling it with -O3 and adding optimize (0) attribute to bar.
The data dependence on the previous loop is clearly not considered, the loop is vectorized as if c on the rhs and c on the lhs were different non-overlapping arrays.
(In reply to comment #5) > The data dependence on the previous loop is clearly not considered, the loop is > vectorized as if c on the rhs and c on the lhs were different non-overlapping > arrays. > The data dependence analysis returns chrec_known for that ddr (probably getting confused by *&s.c[0] and *&s.c[1] base objects), so the vectorizer thinks they are independent and vectorizes the loop. *D.1659_13 base_address: &s.c[0] offset from base address: 0 constant offset from base address: 0 step: 4 aligned to: 128 base_object: *&s.c[0] symbol tag: SMT.62 *D.1655_9 base_address: &s.c[1] offset from base address: 0 constant offset from base address: 0 step: 4 aligned to: 128 base_object: *&s.c[1] symbol tag: SMT.62 (compute_affine_dependence (stmt_a = D.1660_14 = *D.1659_13; ) (stmt_b = *D.1655_9 = D.1664_23; ) )
split_constant_offset does not handle correctly the offset of &s.c[1] as this is an ADDR_EXPR whose op0 was set by extract_ops_from_tree to itself, an ADDR_EXPR. Now this code is not handled in split_constant_offset_1 as this calls if (!handled_component_p (op0)) return false; and returns false. This is an early return that stops the extraction of the base address of the array: s.c[0] and the offset 4. This is probably due to the fact that extract_ops_from_tree does return the expression itself for ADDR_EXPRs in this case: else if (grhs_class == GIMPLE_SINGLE_RHS) { *op1_p = expr; *op2_p = NULL_TREE; } I wonder if the fix could be to handle ADDR_EXPRs as GIMPLE_UNARY_RHS instead of GIMPLE_SINGLE_RHS: else if (grhs_class == GIMPLE_UNARY_RHS) { *op1_p = TREE_OPERAND (expr, 0); *op2_p = NULL_TREE; } in which case that would solve the base and offset computations.
No, ADDR_EXPRs are single because they can have an arbitrary number of operands (think of &a_1->b[i_2][j_3] which has three operands, a_1, i_2 and j_3). In your case it is a is_gimple_min_invariant, which may add to the confusion?
IMHO the fix for the tuplification bug(!) is to strip the ADDR_EXPR that is always present on op0 in split_constant_offset_1 so: case ADDR_EXPR: { tree base, poffset; HOST_WIDE_INT pbitsize, pbitpos; enum machine_mode pmode; int punsignedp, pvolatilep; op0 = TREE_OPERAND (op0, 0); if (!handled_component_p (op0)) return false;
Subject: Re: [4.4 Regression] gcc-4.4 regression: incorrect code generation with -O1 -ftree-vectorize On Wed, Oct 15, 2008 at 4:47 PM, rguenth at gcc dot gnu dot org > IMHO the fix for the tuplification bug(!) is to strip the ADDR_EXPR that is > always present on op0 in split_constant_offset_1 so: > > case ADDR_EXPR: > { > tree base, poffset; > HOST_WIDE_INT pbitsize, pbitpos; > enum machine_mode pmode; > int punsignedp, pvolatilep; > > op0 = TREE_OPERAND (op0, 0); > if (!handled_component_p (op0)) > return false; This is exactly what I tried within gdb and it did worked, although not as I expected: the base address ends to be &s and not &s.c[0] as I expected before. This then fixes the bug as we end on the same base address of the structure. Then the alias analysis answers that the two accesses can overlap.
Subject: Re: [4.4 Regression] gcc-4.4 regression: incorrect code generation with -O1 -ftree-vectorize On Thu, 16 Oct 2008, spop at gcc dot gnu dot org wrote: > ------- Comment #10 from spop at gcc dot gnu dot org 2008-10-16 00:02 ------- > Subject: Re: [4.4 Regression] gcc-4.4 regression: incorrect code generation > with -O1 -ftree-vectorize > > On Wed, Oct 15, 2008 at 4:47 PM, rguenth at gcc dot gnu dot org > > IMHO the fix for the tuplification bug(!) is to strip the ADDR_EXPR that is > > always present on op0 in split_constant_offset_1 so: > > > > case ADDR_EXPR: > > { > > tree base, poffset; > > HOST_WIDE_INT pbitsize, pbitpos; > > enum machine_mode pmode; > > int punsignedp, pvolatilep; > > > > op0 = TREE_OPERAND (op0, 0); > > if (!handled_component_p (op0)) > > return false; > > This is exactly what I tried within gdb and it did worked, although > not as I expected: the base address ends to be &s and not &s.c[0] as > I expected before. This then fixes the bug as we end on the same > base address of the structure. Then the alias analysis answers that > the two accesses can overlap. Yes, but that is another thing. It also isn't that simple - for any two accesses you want to disambiguate you have to find a suitable common base. Consider &s.c[1] and &s + i, obviously the accesses can overlap - would you still say so if the base address of the first one would be &s.c[0]? (really the base address of a non-variable access is the access itself, right? &s.c[1] in this case) Richard.
Subject: Re: [4.4 Regression] gcc-4.4 regression: incorrect code generation with -O1 -ftree-vectorize > common base. Consider &s.c[1] and &s + i, obviously the accesses can > overlap - would you still say so if the base address of the first one > would be &s.c[0]? Yes, in the case &s.c[1] versus &s.c[0], we still have to consider the arrays to potentially overlap. > (really the base address of a non-variable access is the access > itself, right? &s.c[1] in this case) No, it cannot be &s.c[1] here. The base object for arrays in structs should be the struct itself. The base address tells you what memory object is accessed with an offset. For structs, you are allowed to access any of their contents using arithmetic. For instance in: struct s { int a[2]; int c[20]; } you could access s.c[10] from the address of struct s with: &s.a + 12.
I just noticed that this testcase also fails with -O3 on gcc version 4.1.2 20070626 (Red Hat 4.1.2-14), but works on gcc version 4.1.3 20080623 (prerelease) (Debian 4.1.2-23)
Mine.
Fixed.
Subject: Bug 37573 Author: rguenth Date: Mon Nov 3 12:32:52 2008 New Revision: 141547 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=141547 Log: 2008-11-03 Richard Guenther <rguenther@suse.de> PR middle-end/37573 * tree-data-ref.c (split_constant_offset_1): Fix tuplification. * gcc.c-torture/execute/pr37573.c: New testcase. Added: trunk/gcc/testsuite/gcc.c-torture/execute/pr37573.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-data-ref.c
Thanks.