The following testcase is miscompiled on powerpc64-linux with -O2 -ftree-vectorize -m32 -mcpu=power7 -mtune=power7: typedef union U { unsigned short s; unsigned char c; } __attribute__((packed)) U; struct S { char e __attribute__((aligned (16))); U s[32]; }; struct S t = {0, {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}, {11}, {12}, {13}, {14}, {15}, {16}, {17}, {18}, {19}, {20}, {21}, {22}, {23}, {24}, {25}, {26}, {27}, {28}, {29}, {30}, {31}, {32}}}; unsigned short d[32]; int main () { int i; for (i = 0; i < 32; i++) d[i] = t.s[i].s; if (__builtin_memcmp (d, t.s, sizeof d)) __builtin_abort (); return 0; } The problem as I see is that the loop is vectorized with dr_explicit_realign_optimized, and calls vect_create_data_ref_ptr with offset of TYPE_VECT_SUBPARTS (type) - 1, which is 7 in this case. This function then calls vect_create_addr_base_for_vector_ref which multiplies the offset by step (2 in this case), and thus adds 14 bytes to the base address. The base address is &t + 1 and &t is 16 bytes aligned, so it loads the first part from (&t + 1) & -16 (correct) and the second part from (&t + 1 + 14) & -16, which is still wrong, because (&t + 15) & -16 is still &t, not &t + 16 we want to use. The problem is that as the offset given to vect_create_data_ref_ptr is not in bytes, but in number of subparts, after multiplying it by step we end up with offset that doesn't have the lowest bit(s) set, which is required for this kind of realignment. Richard, thoughts on this? Either we'd need the offset always already in bytes and let all the callers ensure it is in bytes, or some flag whether offset is in bytes or units, or some flag to add not just offset * step, but offset * step + (step - 1), or do that always.
I see non-NULL offset passed to vect_create_data_ref_ptr by vectorizable_store (for negative case), vectorizable_load (for this dr_explicit_realign_optimized), and to vect_create_addr_base_for_vector_ref in vect_gen_niters_for_prolog_loop (for negative case) and vect_create_cond_for_align_checks (also for negative case). So guess we'd need to check all these cases what we really want in those cases.
Hum, from the comment on vect_create_data_ref_ptr I would expect offset to be in units of the _vector_ size ... Output: 1. Declare a new ptr to vector_type, and have it point to the base of the data reference (initial addressed accessed by the data reference). For example, for vector of type V8HI, the following code is generated: v8hi *ap; ap = (v8hi *)initial_address; if OFFSET is not supplied: initial_address = &a[init]; if OFFSET is supplied: initial_address = &a[init + OFFSET]; Return the initial_address in INITIAL_ADDRESS. looking at the implementation it expects byte offsets.
Btw, should be reproducible with negative step cases on x86_64 as well, as we do there if (negative) offset = size_int (-TYPE_VECTOR_SUBPARTS (vectype) + 1); ? I suppose interpreting offset in terms of scalar element size is what was intended.
Note that vect_create_addr_base_for_vector_ref _does_ multiply offset by the scalar element size (even if it calls it 'step'): tree step = TYPE_SIZE_UNIT (TREE_TYPE (DR_REF (dr))); ... if (offset) { offset = fold_build2 (MULT_EXPR, sizetype, fold_convert (sizetype, offset), step); so that would be consistent with the interpretation and all uses I can see.
Another testcase: typedef union U { unsigned short s; unsigned char c; } __attribute__((packed)) U; struct S { char e __attribute__((aligned (16))); U s[32]; }; struct S t = {0, {{0x5010}, {0x5111}, {0x5212}, {0x5313}, {0x5414}, {0x5515}, {0x5616}, {0x5717}, {0x5818}, {0x5919}, {0x5a1a}, {0x5b1b}, {0x5c1c}, {0x5d1d}, {0x5e1e}, {0x5f1f}, {0x6020}, {0x6121}, {0x6222}, {0x6323}, {0x6424}, {0x6525}, {0x6626}, {0x6727}, {0x6828}, {0x6929}, {0x6a2a}, {0x6b2b}, {0x6c2c}, {0x6d2d}, {0x6e2e}, {0x6f2f}}}; unsigned short d[32]; int main () { int i; for (i = 0; i < 32; i++) d[i] = t.s[i].s + 4; for (i = 0; i < 32; i++) if (d[i] != t.s[i].s + 4) __builtin_abort (); else asm volatile ("" : : : "memory"); return 0; } which fails similarly. For both testcases, if I manually change the addi 9,10,15 instruction to addi 9,10,16, it passes. That instruction corresponds to the *.vect created: vectp_t.9_3 = &MEM[(void *)&t + 15B]; and I change it to vectp_t.9_3 = &MEM[(void *)&t + 16B]; Here is what the vectorizer emits for this second testcase: <bb 2>: vectp_t.5_19 = &MEM[(void *)&t + 1B]; vectp_t.5_5 = vectp_t.5_19 & 4294967280B; vect__7.3_4 = MEM[(short unsigned int *)vectp_t.5_5]; vect__7.6_2 = __builtin_altivec_mask_for_load (vectp_t.5_19); vectp_t.9_3 = &MEM[(void *)&t + 15B]; vect_cst_.13_33 = { 4, 4, 4, 4, 4, 4, 4, 4 }; vectp_d.15_35 = &d; <bb 3>: # i_24 = PHI <i_10(4), 0(2)> # ivtmp_21 = PHI <ivtmp_20(4), 32(2)> # vect__7.7_1 = PHI <vect__7.10_31(4), vect__7.3_4(2)> # vectp_t.8_28 = PHI <vectp_t.8_29(4), vectp_t.9_3(2)> # vectp_d.14_36 = PHI <vectp_d.14_37(4), vectp_d.15_35(2)> # ivtmp_9 = PHI <ivtmp_39(4), 0(2)> vectp_t.8_30 = vectp_t.8_28 & 4294967280B; vect__7.10_31 = MEM[(short unsigned int *)vectp_t.8_30]; vect__7.11_32 = REALIGN_LOAD <vect__7.7_1, vect__7.10_31, vect__7.6_2>; _7 = t.s[i_24].s; vect__8.12_34 = vect__7.11_32 + vect_cst_.13_33; _8 = _7 + 4; MEM[(short unsigned int *)vectp_d.14_36] = vect__8.12_34; i_10 = i_24 + 1; ivtmp_20 = ivtmp_21 - 1; vectp_t.8_29 = vectp_t.8_28 + 16; vectp_d.14_37 = vectp_d.14_36 + 16; ivtmp_39 = ivtmp_9 + 1; if (ivtmp_39 < 4) goto <bb 4>; else goto <bb 5>; <bb 4>: goto <bb 3>; SSA_NAMEs _1, _31 aren't really used for anything but arguments for REALIGN_LOAD, so as long as the targets that support this (seems only rs6000 and spu) handle the misaligned units fine (it is about whether __builtin_altivec_mask_for_load computes the right mask for the permutations), I'd say just fixing up the offset should be all that is needed. Note that at least two of the three negative == true cases I saw on one x86_64 testcase use negative offset and depend on it not to have the low bits set (so offset -7 must become -14 for V8HImode and not -15). So, at least as a hack, adding step - 1 to offset in vect_create_addr_base_for_vector_ref if offset is non-NULL and positive might DTRT (because in all the 3 negative cases offset will be negative). But perhaps cleaner will be a bool flag that the offset is already in bytes or something similar.
Created attachment 33539 [details] gcc5-pr63341.patch Untested fix. Fixes the testcases on ppc32 -mcpu=power7 -mtune=power7 though.
Author: jakub Date: Thu Sep 25 08:12:49 2014 New Revision: 215583 URL: https://gcc.gnu.org/viewcvs?rev=215583&root=gcc&view=rev Log: PR tree-optimization/63341 * tree-vectorizer.h (vect_create_data_ref_ptr, vect_create_addr_base_for_vector_ref): Add another tree argument defaulting to NULL_TREE. * tree-vect-data-refs.c (vect_create_data_ref_ptr): Add byte_offset argument, pass it down to vect_create_addr_base_for_vector_ref. (vect_create_addr_base_for_vector_ref): Add byte_offset argument, add that to base_offset too if non-NULL. * tree-vect-stmts.c (vectorizable_load): Add byte_offset variable, for dr_explicit_realign_optimized set it to vector byte size - 1 instead of setting offset, pass byte_offset down to vect_create_data_ref_ptr. * gcc.dg/vect/pr63341-1.c: New test. * gcc.dg/vect/pr63341-2.c: New test. Added: trunk/gcc/testsuite/gcc.dg/vect/pr63341-1.c trunk/gcc/testsuite/gcc.dg/vect/pr63341-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-vect-data-refs.c trunk/gcc/tree-vect-stmts.c trunk/gcc/tree-vectorizer.h
Author: jakub Date: Thu Sep 25 08:17:32 2014 New Revision: 215585 URL: https://gcc.gnu.org/viewcvs?rev=215585&root=gcc&view=rev Log: PR tree-optimization/63341 * tree-vectorizer.h (vect_create_data_ref_ptr, vect_create_addr_base_for_vector_ref): Add another tree argument defaulting to NULL_TREE. * tree-vect-data-refs.c (vect_create_data_ref_ptr): Add byte_offset argument, pass it down to vect_create_addr_base_for_vector_ref. (vect_create_addr_base_for_vector_ref): Add byte_offset argument, add that to base_offset too if non-NULL. * tree-vect-stmts.c (vectorizable_load): Add byte_offset variable, for dr_explicit_realign_optimized set it to vector byte size - 1 instead of setting offset, pass byte_offset down to vect_create_data_ref_ptr. * gcc.dg/vect/pr63341-1.c: New test. * gcc.dg/vect/pr63341-2.c: New test. Added: branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr63341-1.c branches/gcc-4_9-branch/gcc/testsuite/gcc.dg/vect/pr63341-2.c Modified: branches/gcc-4_9-branch/gcc/ChangeLog branches/gcc-4_9-branch/gcc/testsuite/ChangeLog branches/gcc-4_9-branch/gcc/tree-vect-data-refs.c branches/gcc-4_9-branch/gcc/tree-vect-stmts.c branches/gcc-4_9-branch/gcc/tree-vectorizer.h
Author: jakub Date: Thu Sep 25 08:19:14 2014 New Revision: 215587 URL: https://gcc.gnu.org/viewcvs?rev=215587&root=gcc&view=rev Log: PR tree-optimization/63341 * tree-vectorizer.h (vect_create_data_ref_ptr, vect_create_addr_base_for_vector_ref): Add another tree argument defaulting to NULL_TREE. * tree-vect-data-refs.c (vect_create_data_ref_ptr): Add byte_offset argument, pass it down to vect_create_addr_base_for_vector_ref. (vect_create_addr_base_for_vector_ref): Add byte_offset argument, add that to base_offset too if non-NULL. * tree-vect-stmts.c (vectorizable_load): Add byte_offset variable, for dr_explicit_realign_optimized set it to vector byte size - 1 instead of setting offset, pass byte_offset down to vect_create_data_ref_ptr. * gcc.dg/vect/pr63341-1.c: New test. * gcc.dg/vect/pr63341-2.c: New test. Added: branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/vect/pr63341-1.c branches/gcc-4_8-branch/gcc/testsuite/gcc.dg/vect/pr63341-2.c Modified: branches/gcc-4_8-branch/gcc/ChangeLog branches/gcc-4_8-branch/gcc/testsuite/ChangeLog branches/gcc-4_8-branch/gcc/tree-vect-data-refs.c branches/gcc-4_8-branch/gcc/tree-vect-stmts.c branches/gcc-4_8-branch/gcc/tree-vectorizer.h
Should be fixed now.