When compiling the testcase with fully masked AVX512 vectorization, -march=znver4 --param=vect-partial-vector-usage=2 -fdiagnostics-plain-output -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions RTL sched2 is presented with (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7fda7413dd80 b>) (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) (vec_merge:V16SI (reg:V16SI 21 xmm1 [118]) (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7fda7413dd80 b>) (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) (reg:HI 69 k1 [116]))) "/space/rguenther/src/gcc11queue/gcc/testsuite/gcc.dg/torture/pr58955-2.c":12:12 1942 {avx512f_storev16si_mask} (expr_list:REG_DEAD (reg:HI 69 k1 [116]) (expr_list:REG_DEAD (reg:DI 40 r12 [orig:90 _22 ] [90]) (expr_list:REG_DEAD (reg:V16SI 21 xmm1 [118]) (nil))))) ... (insn 41 39 42 3 (set (reg:CCZ 17 flags) (compare:CCZ (mem/c:SI (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7fda7413dd80 b>) (const_int 4 [0x4]))) [1 b[1]+0 S4 A32]) (const_int 1 [0x1]))) "/space/rguenther/src/gcc11queue/gcc/testsuite/gcc.dg/torture/pr58955-2.c":15:6 11 {*cmpsi_1} (nil)) and it moves the masked store across the load of one of the destinations elements: - 32: xmm0:V16QI=vec_duplicate(bx:QI) - REG_DEAD bx:QI - 33: NOTE_INSN_DELETED - 34: k1:HI=unspec[xmm0:V16QI,[`*.LC0'],0x6] 146 - REG_DEAD xmm0:V16QI 36: cx:SI=0x1 REG_EQUIV 0x1 + 41: flags:CCZ=cmp([const(`b'+0x4)],0x1) + 32: xmm0:V16QI=vec_duplicate(bx:QI) + REG_DEAD bx:QI 35: xmm1:V16SI=vec_duplicate(cx:SI) REG_DEAD cx:SI REG_EQUIV const_vector + 34: k1:HI=unspec[xmm0:V16QI,[`*.LC0'],0x6] 146 + REG_DEAD xmm0:V16QI + 39: [`a']=0x2 38: [r12:DI+const(`b'-0x4)]=vec_merge(xmm1:V16SI,[r12:DI+const(`b'-0x4)],k1:HI) REG_DEAD k1:HI REG_DEAD r12:DI REG_DEAD xmm1:V16SI - 39: [`a']=0x2 - 41: flags:CCZ=cmp([const(`b'+0x4)],0x1) the address of the masked store is computed oddly though: 14: r12:DI=dx:DI<<0x2+0x4 REG_DEAD dx:DI and in the end the assembly contains leaq 4(,%rdx,4), %r12 ... cmpl $1, b+4(%rip) ... vmovdqu32 %zmm1, b-4(%r12){%k1} (%rdx is zero)
Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc tackling a different issue?
(In reply to Alexander Monakov from comment #1) > Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc > tackling a different issue? Yes, the issue in this bug persists after the above fix.
This looks like the same issue as PR110309. We have (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7ffff6e28d80 b>) (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) (vec_merge:V16SI (reg:V16SI 20 xmm0 [118]) (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7ffff6e28d80 b>) (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) so instead of a masked load we see a vec_merge with a (mem:V16SI ...) based on the vectp_b.12_28 pointer that has full size but the load of b[1] we try disambiguate against refers to int b[10] which is too small for a load of 64 bytes so we disambiguate based on that. So quite likely a duplicate.
(In reply to Richard Biener from comment #3) > This looks like the same issue as PR110309. We have > > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] > <var_decl 0x7ffff6e28d80 b>) > (const_int -4 [0xfffffffffffffffc])))) [1 MEM > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) > (vec_merge:V16SI (reg:V16SI 20 xmm0 [118]) > (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] > <var_decl 0x7ffff6e28d80 b>) > (const_int -4 [0xfffffffffffffffc])))) [1 MEM > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) > > so instead of a masked load we see a vec_merge with a (mem:V16SI ...) > based on the vectp_b.12_28 pointer that has full size but the load of b[1] > we try disambiguate against refers to int b[10] which is too small for > a load of 64 bytes so we disambiguate based on that. /* If the pointer based access is bigger than the variable they cannot alias. This is similar to the check below where we use TBAA to increase the size of the pointer based access based on the dynamic type of a containing object we can infer from it. */ poly_int64 dsize2; if (known_size_p (size1) --- should be unknown?? && poly_int_tree_p (DECL_SIZE (base2), &dsize2) && known_lt (dsize2, size1)) return false; Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload? It seems to me maxsize should be 64bytes, but real size should be unknown.
On Tue, 20 Jun 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- > (In reply to Richard Biener from comment #3) > > This looks like the same issue as PR110309. We have > > > > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) > > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] > > <var_decl 0x7ffff6e28d80 b>) > > (const_int -4 [0xfffffffffffffffc])))) [1 MEM > > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) > > (vec_merge:V16SI (reg:V16SI 20 xmm0 [118]) > > (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) > > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] > > <var_decl 0x7ffff6e28d80 b>) > > (const_int -4 [0xfffffffffffffffc])))) [1 MEM > > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) > > > > so instead of a masked load we see a vec_merge with a (mem:V16SI ...) > > based on the vectp_b.12_28 pointer that has full size but the load of b[1] > > we try disambiguate against refers to int b[10] which is too small for > > a load of 64 bytes so we disambiguate based on that. > > > /* If the pointer based access is bigger than the variable they cannot > alias. This is similar to the check below where we use TBAA to > increase the size of the pointer based access based on the dynamic > type of a containing object we can infer from it. */ > poly_int64 dsize2; > if (known_size_p (size1) --- should be unknown?? > && poly_int_tree_p (DECL_SIZE (base2), &dsize2) > && known_lt (dsize2, size1)) > return false; > > Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload? > It seems to me maxsize should be 64bytes, but real size should be unknown. Yes, you shouldn't have MEM_ATTRs that indicate the size is known.
(In reply to rguenther@suse.de from comment #5) > On Tue, 20 Jun 2023, crazylht at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > > > --- Comment #4 from Hongtao.liu <crazylht at gmail dot com> --- > > (In reply to Richard Biener from comment #3) > > > This looks like the same issue as PR110309. We have > > > > > > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) > > > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] > > > <var_decl 0x7ffff6e28d80 b>) > > > (const_int -4 [0xfffffffffffffffc])))) [1 MEM > > > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) > > > (vec_merge:V16SI (reg:V16SI 20 xmm0 [118]) > > > (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) > > > (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] > > > <var_decl 0x7ffff6e28d80 b>) > > > (const_int -4 [0xfffffffffffffffc])))) [1 MEM > > > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) > > > > > > so instead of a masked load we see a vec_merge with a (mem:V16SI ...) > > > based on the vectp_b.12_28 pointer that has full size but the load of b[1] > > > we try disambiguate against refers to int b[10] which is too small for > > > a load of 64 bytes so we disambiguate based on that. > > > > > > /* If the pointer based access is bigger than the variable they cannot > > alias. This is similar to the check below where we use TBAA to > > increase the size of the pointer based access based on the dynamic > > type of a containing object we can infer from it. */ > > poly_int64 dsize2; > > if (known_size_p (size1) --- should be unknown?? > > && poly_int_tree_p (DECL_SIZE (base2), &dsize2) > > && known_lt (dsize2, size1)) > > return false; > > > > Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload? > > It seems to me maxsize should be 64bytes, but real size should be unknown. > > Yes, you shouldn't have MEM_ATTRs that indicate the size is known. So it looks like a generic problem and better to be handled in expand_partial_{load, store}_optab_fn?
> So it looks like a generic problem and better to be handled in > expand_partial_{load, store}_optab_fn? There're many other places with assumption MEM_SIZE is equal to MODE_SIZE even !MEM_SIZE_KNOWN_P, .i.e. ao_ref_base will rewrite size to MODE_SIZE.
(In reply to Hongtao.liu from comment #7) > > So it looks like a generic problem and better to be handled in > > expand_partial_{load, store}_optab_fn? > > There're many other places with assumption MEM_SIZE is equal to MODE_SIZE > even !MEM_SIZE_KNOWN_P, .i.e. ao_ref_base will rewrite size to MODE_SIZE. Yes, that's because MEM_EXPR isn't really correct ... I guess we could work around that in ao_ref_from_mem but expand_partial_store_optab_fn is wrong in setting that. There's no "partial memory" MEM, and AFAIK the memory attributes are only additional info and ignoring them is valid, so indeed a pass could at least interpret inputs and outputs in mode size even when UNSPECs are involved. But it must not(?) interpret them as must uses or kills? I also think that MEM_SIZE is really only relevant for BLKmode MEMs, MEM_OFFSET is only relevant for interpreting MEM_EXPR. Clearing MEM_EXPR and MEM_SIZE results in an ICE: #0 fancy_abort (file=0x31ce278 "/space/rguenther/src/gcc11queue/gcc/rtlanal.cc", line=469, function=0x31d0560 <rtx_addr_can_trap_p_1(rtx_def const*, poly_int<1u, long>, poly_int<1u, long>, machine_mode, bool)::__FUNCTION__> "rtx_addr_can_trap_p_1") at /space/rguenther/src/gcc11queue/gcc/diagnostic.cc:2232 #1 0x000000000158b62b in rtx_addr_can_trap_p_1 (x=0x7ffff6d51798, offset=..., size=..., mode=E_V16SImode, unaligned_mems=false) at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:467 #2 0x0000000001591a2d in may_trap_p_1 (x=0x7ffff6d51780, flags=0) at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:3160 #3 0x0000000001591f64 in may_trap_p (x=0x7ffff6d51780) at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:3283 #4 0x00000000015f32e6 in simplify_context::simplify_ternary_operation (this=0x7fffffffcd28, code=VEC_MERGE, mode=E_V16SImode, op0_mode=E_V16SImode, op0=0x7ffff6e21d90, op1=0x7ffff6d51780, op2=0x7ffff6d51108) at /space/rguenther/src/gcc11queue/gcc/simplify-rtx.cc:7040 #5 0x0000000000f590c7 in simplify_ternary_operation (code=VEC_MERGE, mode=E_V16SImode, op0_mode=E_V16SImode, op0=0x7ffff6e21d90, op1=0x7ffff6d51780, op2=0x7ffff6d51108) at /space/rguenther/src/gcc11queue/gcc/rtl.h:3498 (gdb) p debug_rtx (op1) (mem:V16SI (plus:DI (reg/f:DI 113) (reg:DI 90 [ _22 ])) [1 A32]) where I have only preserved MEM_ALIAS_SET and MEM_ALIGN. It insists on knowing the MEMs size if it's not BLKmode or VOIDmode. Indeed from the mode we can derived its size. So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the problematic part of alias analysis. Together with UNSPEC this might be enough to fix things. I'm going to test such a patch and seek for feedback on the mailing list.
> So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the > problematic part of alias analysis. Together with UNSPEC this might be > enough to fix things. > Note maskstore won't optimized to vpblendd since it doesn't support memory dest, so I guess no need to use UNSPEC for maskstore?
On Sun, 25 Jun 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> --- > > > So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the > > problematic part of alias analysis. Together with UNSPEC this might be > > enough to fix things. > > > Note maskstore won't optimized to vpblendd since it doesn't support memory > dest, so I guess no need to use UNSPEC for maskstore? A maskstore now looks like (insn 31 30 32 5 (set (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ]) (reg:DI 90 [ ivtmp.28 ])) [1 S64 A64]) (vec_merge:V8DF (reg:V8DF 115 [ vect__9.14 ]) (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ]) (reg:DI 90 [ ivtmp.28 ])) [1 S64 A64]) (reg:QI 100 [ loop_mask_57 ]))) "t.c":6:14 1957 {avx512f_storev8df_mask} that appears as a full read and a full store which means earlier stores to the masked part could be elided rightfully by RTL DSE at least. If there's any RTL analysis about whether a conditional load could trap then for example a full V8DF load from the same address that's currently conditional but after the above could be analyzed as safe to be scheduled speculatively and unconditional while it would not be safe as it could trap. I think the DSE issue is real and it should be easy to create a testcase like void foo (double *d, int mask) { d[5] = 1.; _intrinsic_for_mask_store (d, some-val, mask); } call that with lane 5 masked and check for d[5] = 1. preserved. I think RTL DSE will remove that store.
The trapping angle seems valid, but I have a really hard time understanding the DSE issue, and the preceding issue about disambiguation based on RTL aliasing. How would DSE optimize out 'd[5] = 1' in your example when the mask_store reads it? Isn't that a data dependency? How is the initial issue different from int f(__m128i *p, __m128i v, int *q) { *q = 0; *p = v; return *q; } that we cannot optimize to 'return 0' due to __attribute__((may_alias)) attached to __m128i?
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #11 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > The trapping angle seems valid, but I have a really hard time understanding the > DSE issue, and the preceding issue about disambiguation based on RTL aliasing. > > How would DSE optimize out 'd[5] = 1' in your example when the mask_store reads > it? Isn't that a data dependency? Ah, yes - I completely missed that. > How is the initial issue different from > > int f(__m128i *p, __m128i v, int *q) > { > *q = 0; > *p = v; > return *q; > } > > that we cannot optimize to 'return 0' due to __attribute__((may_alias)) > attached to __m128i? As explained in comment#3 the issue is related to the tree alias oracle part that gets invoked on the MEM_EXPR for the load where there is no information that the load could be partial so it gets disambiguated against a decl that's off less size than the full vector. I've proposed a patch for that issue at hand (clear MEM_EXPR for all partial load/store MEMs), but didn't yet get an approval. The question of what RTL analysis can derive from the RTL is also still open (but it might be we're "safe" because it just doesn't do any analysis/transform that would be mislead)
(In reply to rguenther@suse.de from comment #12) > As explained in comment#3 the issue is related to the tree alias oracle > part that gets invoked on the MEM_EXPR for the load where there is > no information that the load could be partial so it gets disambiguated > against a decl that's off less size than the full vector. With my example I'm trying to say that types in the IR are wrong if we disambiguate like that. People writing C need to attach may_alias to vector types for plain load/stores to validly overlap with scalar accesses, and when vectorizer introduces vector accesses it needs to do something like that, or else intermixed scalar accesses may be incorrectly disambiguated against new vector ones.
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #13 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #12) > > As explained in comment#3 the issue is related to the tree alias oracle > > part that gets invoked on the MEM_EXPR for the load where there is > > no information that the load could be partial so it gets disambiguated > > against a decl that's off less size than the full vector. > > With my example I'm trying to say that types in the IR are wrong if we > disambiguate like that. People writing C need to attach may_alias to vector > types for plain load/stores to validly overlap with scalar accesses, and when > vectorizer introduces vector accesses it needs to do something like that, or > else intermixed scalar accesses may be incorrectly disambiguated against new > vector ones. vectors of T and scalar T interoperate TBAA wise. What we disambiguate is int a[2]; int foo(int *p) { a[0] = 1; *(v4si *)p = {0,0,0,0}; return a[0]; } because the V4SI vector store is too large for the a[] object. That doesn't even use TBAA (it works with -fno-strict-aliasing just fine). If the v4si store is masked we cannot do this anymore, but the IL we seed the alias oracle with doesn't know the store is partial. The only way to "fix" it is to take away all of the information from it.
(In reply to rguenther@suse.de from comment #10) > On Sun, 25 Jun 2023, crazylht at gmail dot com wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > > > --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> --- > > > > > So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the > > > problematic part of alias analysis. Together with UNSPEC this might be > > > enough to fix things. > > > > > Note maskstore won't optimized to vpblendd since it doesn't support memory > > dest, so I guess no need to use UNSPEC for maskstore? > > A maskstore now looks like > > (insn 31 30 32 5 (set (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ]) > (reg:DI 90 [ ivtmp.28 ])) [1 S64 A64]) > (vec_merge:V8DF (reg:V8DF 115 [ vect__9.14 ]) > (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ]) > (reg:DI 90 [ ivtmp.28 ])) [1 S64 A64]) > (reg:QI 100 [ loop_mask_57 ]))) "t.c":6:14 1957 > {avx512f_storev8df_mask} > > that appears as a full read and a full store which means earlier > stores to the masked part could be elided rightfully by RTL DSE > at least. If there's any RTL analysis about whether a conditional > load could trap then for example a full V8DF load from > the same address that's currently conditional but after the above > could be analyzed as safe to be scheduled speculatively and > unconditional while it would not be safe as it could trap. > In my understanding, use whole size for trap analysis is suboptimal but safe, if whole size access is safe, mask_load/store must be safe. But it could be suboptimal when whole size access can trap but mask_load/store won't, but we should accept that suboptimal since mask is not always known. I didn't find such rule in rtx_addr_can_trap_p, but only known_subrange_p. /* If the pointer based access is bigger than the variable they cannot alias. This is similar to the check below where we use TBAA to increase the size of the pointer based access based on the dynamic type of a containing object we can infer from it. */ poly_int64 dsize2; if (known_size_p (size1) && poly_int_tree_p (DECL_SIZE (base2), &dsize2) && known_lt (dsize2, size1)) return false;
(In reply to rguenther@suse.de from comment #14) > vectors of T and scalar T interoperate TBAA wise. What we disambiguate is > > int a[2]; > > int foo(int *p) > { > a[0] = 1; > *(v4si *)p = {0,0,0,0}; > return a[0]; > } > > because the V4SI vector store is too large for the a[] object. That > doesn't even use TBAA (it works with -fno-strict-aliasing just fine). Thank you for the example. If we do the same for vector loads, that's a footgun for users who use vector loads to access small objects: // alignment to 16 is ensured externally extern int a[2]; int foo() { a[0] = 1; __v4si v = (__attribute__((may_alias)) __v4si *) &a; // mask out extra elements in v and continue ... } This is a benign data race on data that follows 'a' in the address space, but otherwise should be a valid and useful technique. > If the v4si store is masked we cannot do this anymore, but the IL > we seed the alias oracle with doesn't know the store is partial. > The only way to "fix" it is to take away all of the information from it. But that won't fix the trapping issue? I think we need a distinct RTX for memory accesses where hardware does fault suppression for masked-out elements.
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #16 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #14) > > vectors of T and scalar T interoperate TBAA wise. What we disambiguate is > > > > int a[2]; > > > > int foo(int *p) > > { > > a[0] = 1; > > *(v4si *)p = {0,0,0,0}; > > return a[0]; > > } > > > > because the V4SI vector store is too large for the a[] object. That > > doesn't even use TBAA (it works with -fno-strict-aliasing just fine). > > Thank you for the example. If we do the same for vector loads, that's a footgun > for users who use vector loads to access small objects: > > // alignment to 16 is ensured externally > extern int a[2]; > > int foo() > { > a[0] = 1; > > __v4si v = (__attribute__((may_alias)) __v4si *) &a; > // mask out extra elements in v and continue > ... > } > > This is a benign data race on data that follows 'a' in the address space, but > otherwise should be a valid and useful technique. Yes, we do the same to loads. I hope that's not a common technique though but I have to admit the vectorizer itself assesses whether it's safe to access "gaps" by looking at alignment so its code generation is prone to this same "mistake". Now, is "alignment to 16 is ensured externally" good enough here? If we consider static int a[2]; and code doing if (is_aligned (a)) { __v4si v = (__attribute__((may_alias)) __v4si *) &a; } then we cannot even use a DECL_ALIGN that's insufficient for decls that bind locally. Note we have similar arguments with aggregate type sizes (and TBAA) where when we infer a dynamic type from one access we check if the other access would fit. Wouldn't the above then extend to that as well given we could also do aggregate copies of "padding" and ignore the bits if we'd have ensured the larger access wouldn't trap? So supporting the above might be a bit of a stretch (though I think we have to fix the vectorizer here). > > If the v4si store is masked we cannot do this anymore, but the IL > > we seed the alias oracle with doesn't know the store is partial. > > The only way to "fix" it is to take away all of the information from it. > > But that won't fix the trapping issue? I think we need a distinct RTX for > memory accesses where hardware does fault suppression for masked-out elements. Yes, it doesn't fix that part. The idea of using BLKmode instead of a vector mode for the MEMs would, I guess, together with specifying MEM_SIZE as not known.
(In reply to rguenther@suse.de from comment #17) > Yes, we do the same to loads. I hope that's not a common technique > though but I have to admit the vectorizer itself assesses whether it's > safe to access "gaps" by looking at alignment so its code generation > is prone to this same "mistake". > > Now, is "alignment to 16 is ensured externally" good enough here? > If we consider > > static int a[2]; > > and code doing > > if (is_aligned (a)) > { > __v4si v = (__attribute__((may_alias)) __v4si *) &a; > } > > then we cannot even use a DECL_ALIGN that's insufficient for decls > that bind locally. I agree. I went with the 'extern' example because there it should be more obvious the construction ought to work. > Note we have similar arguments with aggregate type sizes (and TBAA) > where when we infer a dynamic type from one access we check if > the other access would fit. Wouldn't the above then extend to that > as well given we could also do aggregate copies of "padding" and > ignore the bits if we'd have ensured the larger access wouldn't trap? I think a read via a may_alias type just tells you that N bytes are accessible for reading, not necessarily for writing. So I don't see a problem, but maybe I didn't quite catch what you are saying. > So supporting the above might be a bit of a stretch (though I think > we have to fix the vectorizer here). What would the solution be? Using a may_alias type for such accesses? > > > If the v4si store is masked we cannot do this anymore, but the IL > > > we seed the alias oracle with doesn't know the store is partial. > > > The only way to "fix" it is to take away all of the information from it. > > > > But that won't fix the trapping issue? I think we need a distinct RTX for > > memory accesses where hardware does fault suppression for masked-out elements. > > Yes, it doesn't fix that part. The idea of using BLKmode instead of > a vector mode for the MEMs would, I guess, together with specifying > MEM_SIZE as not known. Unfortunate if that works for the trapping side, but not for the aliasing side.
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #18 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #17) > > Yes, we do the same to loads. I hope that's not a common technique > > though but I have to admit the vectorizer itself assesses whether it's > > safe to access "gaps" by looking at alignment so its code generation > > is prone to this same "mistake". > > > > Now, is "alignment to 16 is ensured externally" good enough here? > > If we consider > > > > static int a[2]; > > > > and code doing > > > > if (is_aligned (a)) > > { > > __v4si v = (__attribute__((may_alias)) __v4si *) &a; > > } > > > > then we cannot even use a DECL_ALIGN that's insufficient for decls > > that bind locally. > > I agree. I went with the 'extern' example because there it should be more > obvious the construction ought to work. > > > > Note we have similar arguments with aggregate type sizes (and TBAA) > > where when we infer a dynamic type from one access we check if > > the other access would fit. Wouldn't the above then extend to that > > as well given we could also do aggregate copies of "padding" and > > ignore the bits if we'd have ensured the larger access wouldn't trap? > > I think a read via a may_alias type just tells you that N bytes are accessible > for reading, not necessarily for writing. So I don't see a problem, but maybe I > didn't quite catch what you are saying. I wasn't sure how to phrase, what I was saying is we have this "the access is too large for the object in consideration, so it cannot alias it" in places where we just work with types within the TBAA framework. So I wondered if one can construct a similar case to support that we should not do this. (tree-ssa-alias.cc: aliasing_component_refs_p) > > > So supporting the above might be a bit of a stretch (though I think > > we have to fix the vectorizer here). > > What would the solution be? Using a may_alias type for such accesses? But the size argument doesn't have anything to do with TBAA (and may_alias is about TBAA). I don't think we have any way to circumvent C object access rules. That is, for example, with -fno-strict-aliasing the following isn't going to work. int a; int b; int main() { a = 1; b = 2; if (&a + 1 == &b) // equality compare of unrelated pointers OK { long x = *(long *)&a; // access outside of 'a' not OK if (x != 0x0000000100000002) abort (); } } there's no command-line flag or attribute to form a pointer to an object composing 'a' and 'b' besides changing how the storage is declared. I don't think we should make an exception for "padding" after an object and I don't see any sensible way how to constrain the size of the supported "padding" either? Pad to the largest possible alignment of the object? That would be MAX_OFILE_ALIGNMENT ... > > > > > If the v4si store is masked we cannot do this anymore, but the IL > > > > we seed the alias oracle with doesn't know the store is partial. > > > > The only way to "fix" it is to take away all of the information from it. > > > > > > But that won't fix the trapping issue? I think we need a distinct RTX for > > > memory accesses where hardware does fault suppression for masked-out elements. > > > > Yes, it doesn't fix that part. The idea of using BLKmode instead of > > a vector mode for the MEMs would, I guess, together with specifying > > MEM_SIZE as not known. > > Unfortunate if that works for the trapping side, but not for the > aliasing side. It should work for both I think, but MEM_EXPR would need changing as well - we do have a perfectly working representation there, it would just be the first CALL_EXPR in such context ...
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:dbf8ab449417aa24669f6ccf50be8c17f8c1278e commit r14-2116-gdbf8ab449417aa24669f6ccf50be8c17f8c1278e Author: liuhongt <hongtao.liu@intel.com> Date: Mon Jun 26 21:07:09 2023 +0800 Refine maskstore patterns with UNSPEC_MASKMOV. Similar like r14-2070-gc79476da46728e If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent it to be transformed to any other whole memory access instructions. gcc/ChangeLog: PR rtl-optimization/110237 * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with UNSPEC_MASKMOV. (maskstore<mode><avx512fmaskmodelower): Ditto. (*<avx512>_store<mode>_mask): New define_insn, it's renamed from original <avx512>_store<mode>_mask.
(In reply to rguenther@suse.de from comment #19) > But the size argument doesn't have anything to do with TBAA (and > may_alias is about TBAA). I don't think we have any way to circumvent > C object access rules. That is, for example, with -fno-strict-aliasing > the following isn't going to work. > > int a; > int b; > > int main() > { > a = 1; > b = 2; > if (&a + 1 == &b) // equality compare of unrelated pointers OK > { > long x = *(long *)&a; // access outside of 'a' not OK > if (x != 0x0000000100000002) > abort (); > } > } > > there's no command-line flag or attribute to form a pointer > to an object composing 'a' and 'b' besides changing how the > storage is declared. But store-merging and SLP can introduce a wide long-sized access where on source level you had two adjacent loads or even memcpy's, so we really seem to have a problem here and might need to be able to annotate types or individual accesses as "may-alias-with-oob-ok" in the IR: PR 110431.
On Tue, 27 Jun 2023, amonakov at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237 > > --- Comment #21 from Alexander Monakov <amonakov at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #19) > > But the size argument doesn't have anything to do with TBAA (and > > may_alias is about TBAA). I don't think we have any way to circumvent > > C object access rules. That is, for example, with -fno-strict-aliasing > > the following isn't going to work. > > > > int a; > > int b; > > > > int main() > > { > > a = 1; > > b = 2; > > if (&a + 1 == &b) // equality compare of unrelated pointers OK > > { > > long x = *(long *)&a; // access outside of 'a' not OK > > if (x != 0x0000000100000002) > > abort (); > > } > > } > > > > there's no command-line flag or attribute to form a pointer > > to an object composing 'a' and 'b' besides changing how the > > storage is declared. > > But store-merging and SLP can introduce a wide long-sized access where on > source level you had two adjacent loads or even memcpy's, so we really seem to > have a problem here and might need to be able to annotate types or individual > accesses as "may-alias-with-oob-ok" in the IR: PR 110431. But above 'a' and 'b' are not adjacent, they are only verified to be at runtime. The only thing we do IIRC is use wider loads to access properly aligned storage as we know the load wouldn't trap. That can lead us to the case you pointed out originally - we load stuff we will ignore but might cause alias disambiguation to disambiguate against a store of the original non-widened size.
The releases/gcc-11 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:ad1a3da97f7c4a85778f91b235a8d936bb1c829b commit r11-10884-gad1a3da97f7c4a85778f91b235a8d936bb1c829b Author: liuhongt <hongtao.liu@intel.com> Date: Mon Jun 26 21:07:09 2023 +0800 Refine maskstore patterns with UNSPEC_MASKMOV. Similar like r14-2070-gc79476da46728e If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent it to be transformed to any other whole memory access instructions. gcc/ChangeLog: PR rtl-optimization/110237 * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with UNSPEC_MASKMOV. (maskstore<mode><avx512fmaskmodelower): Ditto. (*<avx512>_store<mode>_mask): New define_insn, it's renamed from original <avx512>_store<mode>_mask.
The releases/gcc-12 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:a435939ba7e5e489a422071014f943c1a577bfe6 commit r12-9743-ga435939ba7e5e489a422071014f943c1a577bfe6 Author: liuhongt <hongtao.liu@intel.com> Date: Mon Jun 26 21:07:09 2023 +0800 Refine maskstore patterns with UNSPEC_MASKMOV. Similar like r14-2070-gc79476da46728e If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent it to be transformed to any other whole memory access instructions. gcc/ChangeLog: PR rtl-optimization/110237 * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with UNSPEC_MASKMOV. (maskstore<mode><avx512fmaskmodelower): Ditto. (*<avx512>_store<mode>_mask): New define_insn, it's renamed from original <avx512>_store<mode>_mask.
The releases/gcc-13 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:8b059560146a93e5174262fef25e8b1aa39bb879 commit r13-7500-g8b059560146a93e5174262fef25e8b1aa39bb879 Author: liuhongt <hongtao.liu@intel.com> Date: Mon Jun 26 21:07:09 2023 +0800 Refine maskstore patterns with UNSPEC_MASKMOV. Similar like r14-2070-gc79476da46728e If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent it to be transformed to any other whole memory access instructions. gcc/ChangeLog: PR rtl-optimization/110237 * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with UNSPEC_MASKMOV. (maskstore<mode><avx512fmaskmodelower): Ditto. (*<avx512>_store<mode>_mask): New define_insn, it's renamed from original <avx512>_store<mode>_mask.
We are seeing the same problem on riscv and Richi's RFC patch helps (our patterns are already unspec).
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:b09b879e4e9cc24a5d2b0344c1930020c218a104 commit r14-5964-gb09b879e4e9cc24a5d2b0344c1930020c218a104 Author: Richard Biener <rguenther@suse.de> Date: Wed Jun 21 09:31:30 2023 +0200 middle-end/110237 - wrong MEM_ATTRs for partial loads/stores The following addresses a miscompilation by RTL scheduling related to the representation of masked stores. For that we have (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7ffff6e28d80 b>) (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) (vec_merge:V16SI (reg:V16SI 20 xmm0 [118]) (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] <var_decl 0x7ffff6e28d80 b>) (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]) and specifically the memory attributes [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32] are problematic. They tell us the instruction stores and reads a full vector which it if course does not. There isn't any good MEM_EXPR we can use here (we lack a way to just specify a pointer and restrict info for example), and since the MEMs have a vector mode it's difficult in general as passes do not need to look at the memory attributes at all. The easiest way to avoid running into the alias analysis problem is to scrap the MEM_EXPR when we expand the internal functions for partial loads/stores. That avoids the disambiguation we run into which is realizing that we store to an object of less size as the size of the mode we appear to store. After the patch we see just [1 S64 A32] so we preserve the alias set, the alignment and the size (the size is redundant if the MEM insn't BLKmode). That's still not good in case the RTL alias oracle would implement the same disambiguation but it fends off the gimple one. This fixes gcc.dg/torture/pr58955-2.c when built with AVX512 and --param=vect-partial-vector-usage=1. PR middle-end/110237 * internal-fn.cc (expand_partial_load_optab_fn): Clear MEM_EXPR and MEM_OFFSET. (expand_partial_store_optab_fn): Likewise.
Fixed.