Here is a testcase: double *a; int b; void test (void) { for (; b; b++) if (b < 7) a[b] = 1.0; } Produced assembler for that loop when compiled for AVX-512: >gcc -O2 -ftree-vectorize -march=skylake-avx512 small.i -S .L4: vpcmpd $2, %zmm2, %zmm0, %k1 addl $1, %r8d vpaddd %zmm3, %zmm0, %zmm0 vmovupd %zmm1, (%rsi){%k1} kshiftrw $8, %k1, %k1 vmovapd %zmm1, 64(%rsi){%k1} subq $-128, %rsi cmpl %edx, %r8d jb .L4 We have two store using the same base. One of them is unaligned and another one is aligned. The difference comes from GIMPLE. Here is a vectorized loop: <bb 6>: # vect_vec_iv_.11_71 = PHI <vect_cst__69(5), vect_vec_iv_.11_72(6)> # ivtmp.26_87 = PHI <0(5), ivtmp.26_6(6)> # ivtmp.28_7 = PHI <ivtmp.28_24(5), ivtmp.28_8(6)> vectp.15_82 = (vector(8) double *) ivtmp.28_7; vect_vec_iv_.11_72 = vect_vec_iv_.11_71 + { 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16 }; mask__24.12_74 = vect_vec_iv_.11_71 <= { 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6 }; mask_patt_28.14_76 = [vec_unpack_lo_expr] mask__24.12_74; mask_patt_28.14_77 = [vec_unpack_hi_expr] mask__24.12_74; MASK_STORE (vectp.15_82, 0B, mask_patt_28.14_76, { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 }); _25 = ivtmp.28_7 + 64; _88 = (vector(8) double *) _25; MASK_STORE (_88, 0B, mask_patt_28.14_77, { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 }); ivtmp.26_6 = ivtmp.26_87 + 1; ivtmp.28_8 = ivtmp.28_7 + 128; if (ivtmp.26_6 < bnd.8_31) goto <bb 6>; Pointers used for masked stores have different SSA_NAME_PTR_INFO For vectp.15_82 we have {pt = {anything = 0, nonlocal = 1, escaped = 1, ipa_escaped = 0, null = 0, vars_contains_nonlocal = 0, vars_contains_escaped = 0, vars_contains_escaped_heap = 0, vars = 0x7ffff7c13160}, align = 8, misalign = 0} For _88 we have {pt = {anything = 0, nonlocal = 1, escaped = 1, ipa_escaped = 0, null = 0, vars_contains_nonlocal = 0, vars_contains_escaped = 0, vars_contains_escaped_heap = 0, vars = 0x7ffff7c13160}, align = 0, misalign = 0} Zero alignment here for _88 causes TYPE_ALIGN to be used for the second MASK_STORE. TYPE_ALIGN for vector types is its size and therefore we get incorrect aligned memory access.
Looks like an old bug. We may see the same problem on avx2 using older compiler. Here is a test for GCC5: double *a; long long c; int *d; void test (void) { int b; for (b = 0; b < 1024; b++) { d[b] = 1; if (c > a[b]) a[b] = 1.0; } } >gcc -O2 -ftree-vectorize -march=haswell small.i -S -fdump-rtl-final In small.i.final dump: (insn:TI 91 85 89 5 (set (mem:V4DF (plus:DI (reg:DI 0 ax [orig:151 ivtmp.36 ] [151]) (const_int 32 [0x20])) [3 MEM[(double *)_18]+0 S32 A256]) (unspec:V4DF [ (reg:V4DI 22 xmm1 [169]) (reg:V4DF 24 xmm3 [179]) (mem:V4DF (plus:DI (reg:DI 0 ax [orig:151 ivtmp.36 ] [151]) (const_int 32 [0x20])) [3 MEM[(double *)_18]+0 S32 A256]) ] UNSPEC_MASKMOV)) small.i:13 4457 {avx_maskstorepd256} (expr_list:REG_DEAD (reg:V4DI 22 xmm1 [169]) (nil))) (insn:TI 89 91 93 5 (set (mem:V4DF (reg:DI 0 ax [orig:151 ivtmp.36 ] [151]) [3 MEM[(double *)vectp_pretmp.18_75]+0 S32 A64]) (unspec:V4DF [ (reg:V4DI 23 xmm2 [173]) (reg:V4DF 24 xmm3 [179]) (mem:V4DF (reg:DI 0 ax [orig:151 ivtmp.36 ] [151]) [3 MEM[(double *)vectp_pretmp.18_75]+0 S32 A64]) ] UNSPEC_MASKMOV)) small.i:13 4457 {avx_maskstorepd256} (expr_list:REG_DEAD (reg:V4DI 23 xmm2 [173]) (nil))) Again we have different alignment. For avx2 it doesn't matter because we don't have aligned and unaligned variants of maskmov. In case of regular stores alignment is correct. The oldest compiler I tried was 20140625.
The zero comes from bump_vector_ptr I think, a missed optimization. So the issue is that we don't properly use an unaligned type, likely when we expand: static void expand_mask_store_optab_fn (internal_fn, gcall *stmt, convert_optab optab) { struct expand_operand ops[3]; tree type, lhs, rhs, maskt; rtx mem, reg, mask; maskt = gimple_call_arg (stmt, 2); rhs = gimple_call_arg (stmt, 3); type = TREE_TYPE (rhs); lhs = fold_build2 (MEM_REF, type, gimple_call_arg (stmt, 0), gimple_call_arg (stmt, 1)); it just uses 'type' from rhs but that is just the value being stored. type needs to be adjusted with build_aligned_type () for the alignment of the access.
For normal vectorized stores, the alignment is preserved through the MEM_REF/TARGET_MEM_REF type, e.g. 5991 TREE_TYPE (data_ref) 5992 = build_aligned_type (TREE_TYPE (data_ref), 5993 align * BITS_PER_UNIT); but for masked loads/stores we don't have that available, all we record is the pointer (what we put as TREE_OPERAND (mem_ref, 0)), which can have arbitrary type, and the aliasing pointer (with always zero constant). The alignment info we record through: set_ptr_info_alignment (get_ptr_info (dataref_ptr), align, misalign); Unfortunately, it seems that this way we rely on the optimizers not to lose the alignment in the points to info, otherwise get_object_alignment_2 assumes the pointer is fully aligned. In *.vect it is still preserved: # PT = nonlocal escaped # ALIGN = 8, MISALIGN = 0 vectp_pretmp.14_77 = vectp_pretmp.14_74 + 32; # USE = anything # CLB = anything MASK_STORE (vectp_pretmp.14_77, 0B, mask__ifc__37.13_72, vect_cst__73); but then comes ivopts and turns that into: _36 = ivtmp.32_7 + 32; # PT = nonlocal escaped _37 = (vector(4) double *) _36; # PT = nonlocal escaped # ALIGN = 8, MISALIGN = 0 vectp_pretmp.14_77 = _37; # USE = anything # CLB = anything MASK_STORE (vectp_pretmp.14_77, 0B, mask__ifc__37.13_72, vect_cst__73); and then dom3 turns this into: _36 = ivtmp.32_7 + 32; # PT = nonlocal escaped _37 = (vector(4) double *) _36; # PT = nonlocal escaped # ALIGN = 8, MISALIGN = 0 vectp_pretmp.14_77 = _37; # USE = anything # CLB = anything MASK_STORE (_37, 0B, mask__36.12_70, { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 }); and the points to info is there already only on a SSA_NAME that has zero uses. So, I think that we need to store the original alignment from the vectorization time somewhere else, so it is not an optimization to have that info preserved. As the second argument of MASK_{LOAD,STORE} ifn is always 0, and all it matters is the type on it, perhaps we could stick the alignment in that argument in there? I.e. call get_object_alignment for the original load/store during tree-if-conversion.c, in the vectorizer store: misalign ? misalign & -misalign : align as INTEGER_CST with the type we were using on the 0 pointer now, and then during expansion use build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)), 0) instead of gimple_call_arg (stmt, 1), and the actual value use to build_aligned_type from the rhs type. Richard, are you ok with that representation? And then of course is the optimization question, that it is bad that the points to / alignment info is lost due to what ivopts and following optimization passes perform. Should we have some late ccp pass that recomputes it, or teach ivopts to to add alignment/points to info to the stuff it creates, or teach some optimizations that if they replace one pointer with another defined in the same bb and the old one has more precise points to info (or similarly for integers and value ranges) that it could stick that better info onto the other SSA_NAME?
On Tue, 8 Dec 2015, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68786 > > --- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > For normal vectorized stores, the alignment is preserved through the > MEM_REF/TARGET_MEM_REF type, e.g. > 5991 TREE_TYPE (data_ref) > 5992 = build_aligned_type (TREE_TYPE (data_ref), > 5993 align * BITS_PER_UNIT); > but for masked loads/stores we don't have that available, all we record is > the pointer (what we put as TREE_OPERAND (mem_ref, 0)), which can have > arbitrary type, and the aliasing pointer (with always zero constant). > The alignment info we record through: > set_ptr_info_alignment (get_ptr_info (dataref_ptr), align, > misalign); > Unfortunately, it seems that this way we rely on the optimizers not to lose the > alignment in the points to info, otherwise get_object_alignment_2 assumes the > pointer is fully aligned. > In *.vect it is still preserved: > # PT = nonlocal escaped > # ALIGN = 8, MISALIGN = 0 > vectp_pretmp.14_77 = vectp_pretmp.14_74 + 32; > # USE = anything > # CLB = anything > MASK_STORE (vectp_pretmp.14_77, 0B, mask__ifc__37.13_72, vect_cst__73); > but then comes ivopts and turns that into: > _36 = ivtmp.32_7 + 32; > # PT = nonlocal escaped > _37 = (vector(4) double *) _36; > # PT = nonlocal escaped > # ALIGN = 8, MISALIGN = 0 > vectp_pretmp.14_77 = _37; > # USE = anything > # CLB = anything > MASK_STORE (vectp_pretmp.14_77, 0B, mask__ifc__37.13_72, vect_cst__73); > and then dom3 turns this into: > _36 = ivtmp.32_7 + 32; > # PT = nonlocal escaped > _37 = (vector(4) double *) _36; > # PT = nonlocal escaped > # ALIGN = 8, MISALIGN = 0 > vectp_pretmp.14_77 = _37; > # USE = anything > # CLB = anything > MASK_STORE (_37, 0B, mask__36.12_70, { 1.0e+0, 1.0e+0, 1.0e+0, 1.0e+0 }); > and the points to info is there already only on a SSA_NAME that has zero uses. > > So, I think that we need to store the original alignment from the vectorization > time somewhere else, so it is not an optimization to have that info preserved. > As the second argument of MASK_{LOAD,STORE} ifn is always 0, and all it matters > is the type on it, perhaps we could stick the alignment in that argument in > there? I.e. call get_object_alignment for the original load/store during > tree-if-conversion.c, in the vectorizer store: > misalign ? misalign & -misalign : align > as INTEGER_CST with the type we were using on the 0 pointer now, and then > during expansion use build_int_cst (TREE_TYPE (gimple_call_arg (stmt, 1)), 0) > instead of gimple_call_arg (stmt, 1), and the actual value use to > build_aligned_type from the rhs type. Richard, are you ok with that > representation? > > And then of course is the optimization question, that it is bad that the points > to / alignment info is lost due to what ivopts and following optimization > passes perform. Should we have some late ccp pass that recomputes it, or teach > ivopts to to add alignment/points to info to the stuff it creates, or teach > some optimizations that if they replace one pointer with another defined in the > same bb and the old one has more precise points to info (or similarly for > integers and value ranges) that it could stick that better info onto the other > SSA_NAME? I think you need the alignment on the actual reference if it is needed for correctness (or always assume unaligned instead of using an aligned type). Yes, the alias type pointer can be used as vehicle for misalign info. Sure it's "bad" when we lose the info but this just can happen... (eventually we can track down offenders and try to preserve it)
Created attachment 36963 [details] gcc6-pr68786.patch Untested fix. Not adding any testcase, because the one included here is bad - Micha's loop splitting optimization is likely going to turn that into two separate loops, plus to avoid undefined behavior one has to call test with b <= 0 and in that case it will be always < 7.
Author: jakub Date: Wed Dec 9 13:42:06 2015 New Revision: 231454 URL: https://gcc.gnu.org/viewcvs?rev=231454&root=gcc&view=rev Log: PR tree-optimization/68786 * tree-if-conv.c: Include builtins.h. (predicate_mem_writes): Put result of get_object_alignment (ref) into second argument's value. * tree-vect-stmts.c (vectorizable_mask_load_store): Put minimum pointer alignment into second argument's value. * tree-data-ref.c (get_references_in_stmt): Use value of second argument for build_aligned_type, and only the type to build a zero second argument for MEM_REF. * internal-fn.c (expand_mask_load_optab_fn, expand_mask_store_optab_fn): Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/internal-fn.c trunk/gcc/tree-data-ref.c trunk/gcc/tree-if-conv.c trunk/gcc/tree-vect-stmts.c
Fixed.