Bug 110237 - gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload
Summary: gcc.dg/torture/pr58955-2.c is miscompiled by RTL scheduling after reload
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 14.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2023-06-13 13:50 UTC by Richard Biener
Modified: 2024-01-27 14:29 UTC (History)
6 users (show)

See Also:
Host:
Target: x86_64-*-* riscv*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2023-07-31 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2023-06-13 13:50:27 UTC
When compiling the testcase with fully masked AVX512 vectorization,
-march=znver4 --param=vect-partial-vector-usage=2 -fdiagnostics-plain-output -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions

RTL sched2 is presented with

(insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
                (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7fda7413dd80 b>)
                        (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
        (vec_merge:V16SI (reg:V16SI 21 xmm1 [118])
            (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
                    (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7fda7413dd80 b>)
                            (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
            (reg:HI 69 k1 [116]))) "/space/rguenther/src/gcc11queue/gcc/testsuite/gcc.dg/torture/pr58955-2.c":12:12 1942 {avx512f_storev16si_mask}
     (expr_list:REG_DEAD (reg:HI 69 k1 [116])
        (expr_list:REG_DEAD (reg:DI 40 r12 [orig:90 _22 ] [90])
            (expr_list:REG_DEAD (reg:V16SI 21 xmm1 [118])
                (nil)))))
...
(insn 41 39 42 3 (set (reg:CCZ 17 flags)
        (compare:CCZ (mem/c:SI (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7fda7413dd80 b>)
                        (const_int 4 [0x4]))) [1 b[1]+0 S4 A32])
            (const_int 1 [0x1]))) "/space/rguenther/src/gcc11queue/gcc/testsuite/gcc.dg/torture/pr58955-2.c":15:6 11 {*cmpsi_1}
     (nil))

and it moves the masked store across the load of one of the destinations elements:

-   32: xmm0:V16QI=vec_duplicate(bx:QI)
-      REG_DEAD bx:QI
-   33: NOTE_INSN_DELETED
-   34: k1:HI=unspec[xmm0:V16QI,[`*.LC0'],0x6] 146
-      REG_DEAD xmm0:V16QI
    36: cx:SI=0x1
       REG_EQUIV 0x1
+   41: flags:CCZ=cmp([const(`b'+0x4)],0x1)
+   32: xmm0:V16QI=vec_duplicate(bx:QI)
+      REG_DEAD bx:QI
    35: xmm1:V16SI=vec_duplicate(cx:SI)
       REG_DEAD cx:SI
       REG_EQUIV const_vector
+   34: k1:HI=unspec[xmm0:V16QI,[`*.LC0'],0x6] 146
+      REG_DEAD xmm0:V16QI
+   39: [`a']=0x2
    38: [r12:DI+const(`b'-0x4)]=vec_merge(xmm1:V16SI,[r12:DI+const(`b'-0x4)],k1:HI)
       REG_DEAD k1:HI
       REG_DEAD r12:DI
       REG_DEAD xmm1:V16SI
-   39: [`a']=0x2
-   41: flags:CCZ=cmp([const(`b'+0x4)],0x1)

the address of the masked store is computed oddly though:

   14: r12:DI=dx:DI<<0x2+0x4
      REG_DEAD dx:DI

and in the end the assembly contains

        leaq    4(,%rdx,4), %r12
...
        cmpl    $1, b+4(%rip)
...
        vmovdqu32       %zmm1, b-4(%r12){%k1}

(%rdx is zero)
Comment 1 Alexander Monakov 2023-06-14 11:02:59 UTC
Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc tackling a different issue?
Comment 2 Richard Biener 2023-06-14 12:09:04 UTC
(In reply to Alexander Monakov from comment #1)
> Sorry, was your recent patch g:1c3661e224e3ddfc6f773b095740c0f5a7ddf5fc
> tackling a different issue?

Yes, the issue in this bug persists after the above fix.
Comment 3 Richard Biener 2023-06-20 09:52:32 UTC
This looks like the same issue as PR110309.  We have

(insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
                (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7ffff6e28d80 b>)
                        (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
        (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
            (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
                    (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7ffff6e28d80 b>)
                            (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])

so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
based on the vectp_b.12_28 pointer that has full size but the load of b[1]
we try disambiguate against refers to int b[10] which is too small for
a load of 64 bytes so we disambiguate based on that.

So quite likely a duplicate.
Comment 4 Hongtao.liu 2023-06-20 11:35:31 UTC
(In reply to Richard Biener from comment #3)
> This looks like the same issue as PR110309.  We have
> 
> (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
>                 (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> <var_decl 0x7ffff6e28d80 b>)
>                         (const_int -4 [0xfffffffffffffffc])))) [1 MEM
> <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
>         (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
>             (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
>                     (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> <var_decl 0x7ffff6e28d80 b>)
>                             (const_int -4 [0xfffffffffffffffc])))) [1 MEM
> <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
> 
> so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
> based on the vectp_b.12_28 pointer that has full size but the load of b[1]
> we try disambiguate against refers to int b[10] which is too small for
> a load of 64 bytes so we disambiguate based on that.


  /* If the pointer based access is bigger than the variable they cannot
     alias.  This is similar to the check below where we use TBAA to
     increase the size of the pointer based access based on the dynamic
     type of a containing object we can infer from it.  */
  poly_int64 dsize2;
  if (known_size_p (size1) --- should be unknown??
      && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
      && known_lt (dsize2, size1))
    return false;

Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload?
It seems to me maxsize should be 64bytes, but real size should be unknown.
Comment 5 rguenther@suse.de 2023-06-20 12:20:04 UTC
On Tue, 20 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
> (In reply to Richard Biener from comment #3)
> > This looks like the same issue as PR110309.  We have
> > 
> > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
> >                 (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > <var_decl 0x7ffff6e28d80 b>)
> >                         (const_int -4 [0xfffffffffffffffc])))) [1 MEM
> > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
> >         (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
> >             (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
> >                     (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > <var_decl 0x7ffff6e28d80 b>)
> >                             (const_int -4 [0xfffffffffffffffc])))) [1 MEM
> > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
> > 
> > so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
> > based on the vectp_b.12_28 pointer that has full size but the load of b[1]
> > we try disambiguate against refers to int b[10] which is too small for
> > a load of 64 bytes so we disambiguate based on that.
> 
> 
>   /* If the pointer based access is bigger than the variable they cannot
>      alias.  This is similar to the check below where we use TBAA to
>      increase the size of the pointer based access based on the dynamic
>      type of a containing object we can infer from it.  */
>   poly_int64 dsize2;
>   if (known_size_p (size1) --- should be unknown??
>       && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
>       && known_lt (dsize2, size1))
>     return false;
> 
> Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload?
> It seems to me maxsize should be 64bytes, but real size should be unknown.

Yes, you shouldn't have MEM_ATTRs that indicate the size is known.
Comment 6 Hongtao.liu 2023-06-20 16:05:25 UTC
(In reply to rguenther@suse.de from comment #5)
> On Tue, 20 Jun 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> > 
> > --- Comment #4 from Hongtao.liu <crazylht at gmail dot com> ---
> > (In reply to Richard Biener from comment #3)
> > > This looks like the same issue as PR110309.  We have
> > > 
> > > (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
> > >                 (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > > <var_decl 0x7ffff6e28d80 b>)
> > >                         (const_int -4 [0xfffffffffffffffc])))) [1 MEM
> > > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
> > >         (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
> > >             (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90]) 
> > >                     (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2] 
> > > <var_decl 0x7ffff6e28d80 b>)
> > >                             (const_int -4 [0xfffffffffffffffc])))) [1 MEM
> > > <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
> > > 
> > > so instead of a masked load we see a vec_merge with a (mem:V16SI ...)
> > > based on the vectp_b.12_28 pointer that has full size but the load of b[1]
> > > we try disambiguate against refers to int b[10] which is too small for
> > > a load of 64 bytes so we disambiguate based on that.
> > 
> > 
> >   /* If the pointer based access is bigger than the variable they cannot
> >      alias.  This is similar to the check below where we use TBAA to
> >      increase the size of the pointer based access based on the dynamic
> >      type of a containing object we can infer from it.  */
> >   poly_int64 dsize2;
> >   if (known_size_p (size1) --- should be unknown??
> >       && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
> >       && known_lt (dsize2, size1))
> >     return false;
> > 
> > Should we set MEM_SIZE_KNOWN_P to false for maskstore/maskload?
> > It seems to me maxsize should be 64bytes, but real size should be unknown.
> 
> Yes, you shouldn't have MEM_ATTRs that indicate the size is known.

So it looks like a generic problem and better to be handled in expand_partial_{load, store}_optab_fn?
Comment 7 Hongtao.liu 2023-06-21 05:48:49 UTC
> So it looks like a generic problem and better to be handled in
> expand_partial_{load, store}_optab_fn?

There're many other places with assumption MEM_SIZE is equal to MODE_SIZE even !MEM_SIZE_KNOWN_P, .i.e. ao_ref_base will rewrite size to MODE_SIZE.
Comment 8 Richard Biener 2023-06-21 07:33:58 UTC
(In reply to Hongtao.liu from comment #7)
> > So it looks like a generic problem and better to be handled in
> > expand_partial_{load, store}_optab_fn?
> 
> There're many other places with assumption MEM_SIZE is equal to MODE_SIZE
> even !MEM_SIZE_KNOWN_P, .i.e. ao_ref_base will rewrite size to MODE_SIZE.

Yes, that's because MEM_EXPR isn't really correct ... I guess we could
work around that in ao_ref_from_mem but expand_partial_store_optab_fn
is wrong in setting that.

There's no "partial memory" MEM, and AFAIK the memory attributes are
only additional info and ignoring them is valid, so indeed a pass
could at least interpret inputs and outputs in mode size even when
UNSPECs are involved.  But it must not(?) interpret them as must uses
or kills?

I also think that MEM_SIZE is really only relevant for BLKmode MEMs,
MEM_OFFSET is only relevant for interpreting MEM_EXPR.

Clearing MEM_EXPR and MEM_SIZE results in an ICE:

#0  fancy_abort (file=0x31ce278 "/space/rguenther/src/gcc11queue/gcc/rtlanal.cc", line=469, 
    function=0x31d0560 <rtx_addr_can_trap_p_1(rtx_def const*, poly_int<1u, long>, poly_int<1u, long>, machine_mode, bool)::__FUNCTION__> "rtx_addr_can_trap_p_1") at /space/rguenther/src/gcc11queue/gcc/diagnostic.cc:2232
#1  0x000000000158b62b in rtx_addr_can_trap_p_1 (x=0x7ffff6d51798, offset=..., size=..., mode=E_V16SImode, unaligned_mems=false)
    at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:467
#2  0x0000000001591a2d in may_trap_p_1 (x=0x7ffff6d51780, flags=0) at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:3160
#3  0x0000000001591f64 in may_trap_p (x=0x7ffff6d51780) at /space/rguenther/src/gcc11queue/gcc/rtlanal.cc:3283
#4  0x00000000015f32e6 in simplify_context::simplify_ternary_operation (this=0x7fffffffcd28, code=VEC_MERGE, mode=E_V16SImode, 
    op0_mode=E_V16SImode, op0=0x7ffff6e21d90, op1=0x7ffff6d51780, op2=0x7ffff6d51108)
    at /space/rguenther/src/gcc11queue/gcc/simplify-rtx.cc:7040
#5  0x0000000000f590c7 in simplify_ternary_operation (code=VEC_MERGE, mode=E_V16SImode, op0_mode=E_V16SImode, op0=0x7ffff6e21d90, 
    op1=0x7ffff6d51780, op2=0x7ffff6d51108) at /space/rguenther/src/gcc11queue/gcc/rtl.h:3498

(gdb) p debug_rtx (op1)
(mem:V16SI (plus:DI (reg/f:DI 113)
        (reg:DI 90 [ _22 ])) [1  A32])

where I have only preserved MEM_ALIAS_SET and MEM_ALIGN.  It insists
on knowing the MEMs size if it's not BLKmode or VOIDmode.  Indeed
from the mode we can derived its size.

So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
problematic part of alias analysis.  Together with UNSPEC this might be
enough to fix things.

I'm going to test such a patch and seek for feedback on the mailing list.
Comment 9 Hongtao.liu 2023-06-25 04:06:04 UTC
> So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
> problematic part of alias analysis.  Together with UNSPEC this might be
> enough to fix things.
> 
Note maskstore won't optimized to vpblendd since it doesn't support memory dest, so I guess no need to use UNSPEC for maskstore?
Comment 10 rguenther@suse.de 2023-06-26 07:41:12 UTC
On Sun, 25 Jun 2023, crazylht at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
> 
> > So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
> > problematic part of alias analysis.  Together with UNSPEC this might be
> > enough to fix things.
> > 
> Note maskstore won't optimized to vpblendd since it doesn't support memory
> dest, so I guess no need to use UNSPEC for maskstore?

A maskstore now looks like

(insn 31 30 32 5 (set (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
                (reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
        (vec_merge:V8DF (reg:V8DF 115 [ vect__9.14 ])
            (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
                    (reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
            (reg:QI 100 [ loop_mask_57 ]))) "t.c":6:14 1957 
{avx512f_storev8df_mask}

that appears as a full read and a full store which means earlier
stores to the masked part could be elided rightfully by RTL DSE
at least.  If there's any RTL analysis about whether a conditional
load could trap then for example a full V8DF load from
the same address that's currently conditional but after the above
could be analyzed as safe to be scheduled speculatively and
unconditional while it would not be safe as it could trap.

I think the DSE issue is real and it should be easy to create
a testcase like

void foo (double *d, int mask)
{
  d[5] = 1.;
  _intrinsic_for_mask_store (d, some-val, mask);
}

call that with lane 5 masked and check for d[5] = 1. preserved.
I think RTL DSE will remove that store.
Comment 11 Alexander Monakov 2023-06-26 07:58:16 UTC
The trapping angle seems valid, but I have a really hard time understanding the DSE issue, and the preceding issue about disambiguation based on RTL aliasing.

How would DSE optimize out 'd[5] = 1' in your example when the mask_store reads it? Isn't that a data dependency?

How is the initial issue different from

int f(__m128i *p, __m128i v, int *q)
{
  *q = 0;
  *p = v;
  return *q;
}

that we cannot optimize to 'return 0' due to __attribute__((may_alias)) attached to __m128i?
Comment 12 rguenther@suse.de 2023-06-26 08:07:29 UTC
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #11 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> The trapping angle seems valid, but I have a really hard time understanding the
> DSE issue, and the preceding issue about disambiguation based on RTL aliasing.
> 
> How would DSE optimize out 'd[5] = 1' in your example when the mask_store reads
> it? Isn't that a data dependency?

Ah, yes - I completely missed that.

> How is the initial issue different from
> 
> int f(__m128i *p, __m128i v, int *q)
> {
>   *q = 0;
>   *p = v;
>   return *q;
> }
> 
> that we cannot optimize to 'return 0' due to __attribute__((may_alias))
> attached to __m128i?

As explained in comment#3 the issue is related to the tree alias oracle
part that gets invoked on the MEM_EXPR for the load where there is
no information that the load could be partial so it gets disambiguated
against a decl that's off less size than the full vector.

I've proposed a patch for that issue at hand (clear MEM_EXPR for
all partial load/store MEMs), but didn't yet get an approval.  The
question of what RTL analysis can derive from the RTL is also still
open (but it might be we're "safe" because it just doesn't do any
analysis/transform that would be mislead)
Comment 13 Alexander Monakov 2023-06-26 08:22:11 UTC
(In reply to rguenther@suse.de from comment #12)
> As explained in comment#3 the issue is related to the tree alias oracle
> part that gets invoked on the MEM_EXPR for the load where there is
> no information that the load could be partial so it gets disambiguated
> against a decl that's off less size than the full vector.

With my example I'm trying to say that types in the IR are wrong if we disambiguate like that. People writing C need to attach may_alias to vector types for plain load/stores to validly overlap with scalar accesses, and when vectorizer introduces vector accesses it needs to do something like that, or else intermixed scalar accesses may be incorrectly disambiguated against new vector ones.
Comment 14 rguenther@suse.de 2023-06-26 08:35:34 UTC
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #13 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #12)
> > As explained in comment#3 the issue is related to the tree alias oracle
> > part that gets invoked on the MEM_EXPR for the load where there is
> > no information that the load could be partial so it gets disambiguated
> > against a decl that's off less size than the full vector.
> 
> With my example I'm trying to say that types in the IR are wrong if we
> disambiguate like that. People writing C need to attach may_alias to vector
> types for plain load/stores to validly overlap with scalar accesses, and when
> vectorizer introduces vector accesses it needs to do something like that, or
> else intermixed scalar accesses may be incorrectly disambiguated against new
> vector ones.

vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is

int a[2];

int foo(int *p)
{
  a[0] = 1;
  *(v4si *)p = {0,0,0,0};
  return a[0];
}

because the V4SI vector store is too large for the a[] object.  That
doesn't even use TBAA (it works with -fno-strict-aliasing just fine).

If the v4si store is masked we cannot do this anymore, but the IL
we seed the alias oracle with doesn't know the store is partial.
The only way to "fix" it is to take away all of the information from it.
Comment 15 Hongtao.liu 2023-06-26 08:44:01 UTC
(In reply to rguenther@suse.de from comment #10)
> On Sun, 25 Jun 2023, crazylht at gmail dot com wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> > 
> > --- Comment #9 from Hongtao.liu <crazylht at gmail dot com> ---
> > 
> > > So we can simply clear only MEM_EXPR (and MEM_OFFSET), that cuts off the
> > > problematic part of alias analysis.  Together with UNSPEC this might be
> > > enough to fix things.
> > > 
> > Note maskstore won't optimized to vpblendd since it doesn't support memory
> > dest, so I guess no need to use UNSPEC for maskstore?
> 
> A maskstore now looks like
> 
> (insn 31 30 32 5 (set (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
>                 (reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
>         (vec_merge:V8DF (reg:V8DF 115 [ vect__9.14 ])
>             (mem:V8DF (plus:DI (reg/v/f:DI 108 [ a ])
>                     (reg:DI 90 [ ivtmp.28 ])) [1  S64 A64])
>             (reg:QI 100 [ loop_mask_57 ]))) "t.c":6:14 1957 
> {avx512f_storev8df_mask}
> 
> that appears as a full read and a full store which means earlier
> stores to the masked part could be elided rightfully by RTL DSE
> at least.  If there's any RTL analysis about whether a conditional
> load could trap then for example a full V8DF load from
> the same address that's currently conditional but after the above
> could be analyzed as safe to be scheduled speculatively and
> unconditional while it would not be safe as it could trap.
> 

In my understanding, use whole size for trap analysis is suboptimal but safe, if whole size access is safe, mask_load/store must be safe. But it could be suboptimal when whole size access can trap but mask_load/store won't, but we should accept that suboptimal since mask is not always known.

I didn't find such rule in rtx_addr_can_trap_p, but only known_subrange_p.

  /* If the pointer based access is bigger than the variable they cannot
     alias.  This is similar to the check below where we use TBAA to
     increase the size of the pointer based access based on the dynamic
     type of a containing object we can infer from it.  */
  poly_int64 dsize2;
  if (known_size_p (size1)
      && poly_int_tree_p (DECL_SIZE (base2), &dsize2)
      && known_lt (dsize2, size1))
    return false;
Comment 16 Alexander Monakov 2023-06-26 08:54:15 UTC
(In reply to rguenther@suse.de from comment #14)
> vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is
> 
> int a[2];
> 
> int foo(int *p)
> {
>   a[0] = 1;
>   *(v4si *)p = {0,0,0,0};
>   return a[0];
> }
> 
> because the V4SI vector store is too large for the a[] object.  That
> doesn't even use TBAA (it works with -fno-strict-aliasing just fine).

Thank you for the example. If we do the same for vector loads, that's a footgun for users who use vector loads to access small objects:

// alignment to 16 is ensured externally
extern int a[2];

int foo()
{
  a[0] = 1;
  
  __v4si v = (__attribute__((may_alias)) __v4si *) &a;
  // mask out extra elements in v and continue
 ...
}

This is a benign data race on data that follows 'a' in the address space, but otherwise should be a valid and useful technique.

> If the v4si store is masked we cannot do this anymore, but the IL
> we seed the alias oracle with doesn't know the store is partial.
> The only way to "fix" it is to take away all of the information from it.

But that won't fix the trapping issue? I think we need a distinct RTX for memory accesses where hardware does fault suppression for masked-out elements.
Comment 17 rguenther@suse.de 2023-06-26 11:14:27 UTC
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #16 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #14)
> > vectors of T and scalar T interoperate TBAA wise.  What we disambiguate is
> > 
> > int a[2];
> > 
> > int foo(int *p)
> > {
> >   a[0] = 1;
> >   *(v4si *)p = {0,0,0,0};
> >   return a[0];
> > }
> > 
> > because the V4SI vector store is too large for the a[] object.  That
> > doesn't even use TBAA (it works with -fno-strict-aliasing just fine).
> 
> Thank you for the example. If we do the same for vector loads, that's a footgun
> for users who use vector loads to access small objects:
> 
> // alignment to 16 is ensured externally
> extern int a[2];
> 
> int foo()
> {
>   a[0] = 1;
> 
>   __v4si v = (__attribute__((may_alias)) __v4si *) &a;
>   // mask out extra elements in v and continue
>  ...
> }
> 
> This is a benign data race on data that follows 'a' in the address space, but
> otherwise should be a valid and useful technique.

Yes, we do the same to loads.  I hope that's not a common technique
though but I have to admit the vectorizer itself assesses whether it's
safe to access "gaps" by looking at alignment so its code generation
is prone to this same "mistake".

Now, is "alignment to 16 is ensured externally" good enough here?
If we consider

static int a[2];

and code doing

 if (is_aligned (a))
   {
     __v4si v = (__attribute__((may_alias)) __v4si *) &a;
   }

then we cannot even use a DECL_ALIGN that's insufficient for decls
that bind locally.

Note we have similar arguments with aggregate type sizes (and TBAA)
where when we infer a dynamic type from one access we check if
the other access would fit.  Wouldn't the above then extend to that
as well given we could also do aggregate copies of "padding" and
ignore the bits if we'd have ensured the larger access wouldn't trap?

So supporting the above might be a bit of a stretch (though I think
we have to fix the vectorizer here).

> > If the v4si store is masked we cannot do this anymore, but the IL
> > we seed the alias oracle with doesn't know the store is partial.
> > The only way to "fix" it is to take away all of the information from it.
> 
> But that won't fix the trapping issue? I think we need a distinct RTX for
> memory accesses where hardware does fault suppression for masked-out elements.

Yes, it doesn't fix that part.  The idea of using BLKmode instead of
a vector mode for the MEMs would, I guess, together with specifying
MEM_SIZE as not known.
Comment 18 Alexander Monakov 2023-06-26 18:19:39 UTC
(In reply to rguenther@suse.de from comment #17)
> Yes, we do the same to loads.  I hope that's not a common technique
> though but I have to admit the vectorizer itself assesses whether it's
> safe to access "gaps" by looking at alignment so its code generation
> is prone to this same "mistake".
> 
> Now, is "alignment to 16 is ensured externally" good enough here?
> If we consider
> 
> static int a[2];
> 
> and code doing
> 
>  if (is_aligned (a))
>    {
>      __v4si v = (__attribute__((may_alias)) __v4si *) &a;
>    }
> 
> then we cannot even use a DECL_ALIGN that's insufficient for decls
> that bind locally.

I agree. I went with the 'extern' example because there it should be more obvious the construction ought to work.


> Note we have similar arguments with aggregate type sizes (and TBAA)
> where when we infer a dynamic type from one access we check if
> the other access would fit.  Wouldn't the above then extend to that
> as well given we could also do aggregate copies of "padding" and
> ignore the bits if we'd have ensured the larger access wouldn't trap?

I think a read via a may_alias type just tells you that N bytes are accessible for reading, not necessarily for writing. So I don't see a problem, but maybe I didn't quite catch what you are saying.

 
> So supporting the above might be a bit of a stretch (though I think
> we have to fix the vectorizer here).

What would the solution be? Using a may_alias type for such accesses?


> > > If the v4si store is masked we cannot do this anymore, but the IL
> > > we seed the alias oracle with doesn't know the store is partial.
> > > The only way to "fix" it is to take away all of the information from it.
> > 
> > But that won't fix the trapping issue? I think we need a distinct RTX for
> > memory accesses where hardware does fault suppression for masked-out elements.
> 
> Yes, it doesn't fix that part.  The idea of using BLKmode instead of
> a vector mode for the MEMs would, I guess, together with specifying
> MEM_SIZE as not known.

Unfortunate if that works for the trapping side, but not for the aliasing side.
Comment 19 rguenther@suse.de 2023-06-27 06:53:46 UTC
On Mon, 26 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #18 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #17)
> > Yes, we do the same to loads.  I hope that's not a common technique
> > though but I have to admit the vectorizer itself assesses whether it's
> > safe to access "gaps" by looking at alignment so its code generation
> > is prone to this same "mistake".
> > 
> > Now, is "alignment to 16 is ensured externally" good enough here?
> > If we consider
> > 
> > static int a[2];
> > 
> > and code doing
> > 
> >  if (is_aligned (a))
> >    {
> >      __v4si v = (__attribute__((may_alias)) __v4si *) &a;
> >    }
> > 
> > then we cannot even use a DECL_ALIGN that's insufficient for decls
> > that bind locally.
> 
> I agree. I went with the 'extern' example because there it should be more
> obvious the construction ought to work.
> 
> 
> > Note we have similar arguments with aggregate type sizes (and TBAA)
> > where when we infer a dynamic type from one access we check if
> > the other access would fit.  Wouldn't the above then extend to that
> > as well given we could also do aggregate copies of "padding" and
> > ignore the bits if we'd have ensured the larger access wouldn't trap?
> 
> I think a read via a may_alias type just tells you that N bytes are accessible
> for reading, not necessarily for writing. So I don't see a problem, but maybe I
> didn't quite catch what you are saying.

I wasn't sure how to phrase, what I was saying is we have this
"the access is too large for the object in consideration, so it cannot
alias it" in places where we just work with types within the TBAA
framework.  So I wondered if one can construct a similar case to
support that we should not do this.  (tree-ssa-alias.cc:
aliasing_component_refs_p)

> 
> > So supporting the above might be a bit of a stretch (though I think
> > we have to fix the vectorizer here).
> 
> What would the solution be? Using a may_alias type for such accesses?

But the size argument doesn't have anything to do with TBAA (and
may_alias is about TBAA).  I don't think we have any way to circumvent
C object access rules.  That is, for example, with -fno-strict-aliasing
the following isn't going to work.

int a;
int b;

int main()
{
  a = 1;
  b = 2;
  if (&a + 1 == &b) // equality compare of unrelated pointers OK
    {
      long x = *(long *)&a; // access outside of 'a' not OK
      if (x != 0x0000000100000002)
        abort ();
    }
}

there's no command-line flag or attribute to form a pointer
to an object composing 'a' and 'b' besides changing how the
storage is declared.

I don't think we should make an exception for "padding" after
an object and I don't see any sensible way how to constrain
the size of the supported "padding" either?  Pad to the
largest possible alignment of the object?  That would be
MAX_OFILE_ALIGNMENT ...

> 
> > > > If the v4si store is masked we cannot do this anymore, but the IL
> > > > we seed the alias oracle with doesn't know the store is partial.
> > > > The only way to "fix" it is to take away all of the information from it.
> > > 
> > > But that won't fix the trapping issue? I think we need a distinct RTX for
> > > memory accesses where hardware does fault suppression for masked-out elements.
> > 
> > Yes, it doesn't fix that part.  The idea of using BLKmode instead of
> > a vector mode for the MEMs would, I guess, together with specifying
> > MEM_SIZE as not known.
> 
> Unfortunate if that works for the trapping side, but not for the 
> aliasing side.

It should work for both I think, but MEM_EXPR would need changing
as well - we do have a perfectly working representation there, it
would just be the first CALL_EXPR in such context ...
Comment 20 GCC Commits 2023-06-27 07:30:38 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:dbf8ab449417aa24669f6ccf50be8c17f8c1278e

commit r14-2116-gdbf8ab449417aa24669f6ccf50be8c17f8c1278e
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jun 26 21:07:09 2023 +0800

    Refine maskstore patterns with UNSPEC_MASKMOV.
    
    Similar like r14-2070-gc79476da46728e
    
    If mem_addr points to a memory region with less than whole vector size
    bytes of accessible memory and k is a mask that would prevent reading
    the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
    it to be transformed to any other whole memory access instructions.
    
    gcc/ChangeLog:
    
            PR rtl-optimization/110237
            * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with
            UNSPEC_MASKMOV.
            (maskstore<mode><avx512fmaskmodelower): Ditto.
            (*<avx512>_store<mode>_mask): New define_insn, it's renamed
            from original <avx512>_store<mode>_mask.
Comment 21 Alexander Monakov 2023-06-27 09:50:11 UTC
(In reply to rguenther@suse.de from comment #19)
> But the size argument doesn't have anything to do with TBAA (and
> may_alias is about TBAA).  I don't think we have any way to circumvent
> C object access rules.  That is, for example, with -fno-strict-aliasing
> the following isn't going to work.
> 
> int a;
> int b;
> 
> int main()
> {
>   a = 1;
>   b = 2;
>   if (&a + 1 == &b) // equality compare of unrelated pointers OK
>     {
>       long x = *(long *)&a; // access outside of 'a' not OK
>       if (x != 0x0000000100000002)
>         abort ();
>     }
> }
> 
> there's no command-line flag or attribute to form a pointer
> to an object composing 'a' and 'b' besides changing how the
> storage is declared.

But store-merging and SLP can introduce a wide long-sized access where on source level you had two adjacent loads or even memcpy's, so we really seem to have a problem here and might need to be able to annotate types or individual accesses as "may-alias-with-oob-ok" in the IR: PR 110431.
Comment 22 rguenther@suse.de 2023-06-27 10:03:11 UTC
On Tue, 27 Jun 2023, amonakov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110237
> 
> --- Comment #21 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #19)
> > But the size argument doesn't have anything to do with TBAA (and
> > may_alias is about TBAA).  I don't think we have any way to circumvent
> > C object access rules.  That is, for example, with -fno-strict-aliasing
> > the following isn't going to work.
> > 
> > int a;
> > int b;
> > 
> > int main()
> > {
> >   a = 1;
> >   b = 2;
> >   if (&a + 1 == &b) // equality compare of unrelated pointers OK
> >     {
> >       long x = *(long *)&a; // access outside of 'a' not OK
> >       if (x != 0x0000000100000002)
> >         abort ();
> >     }
> > }
> > 
> > there's no command-line flag or attribute to form a pointer
> > to an object composing 'a' and 'b' besides changing how the
> > storage is declared.
> 
> But store-merging and SLP can introduce a wide long-sized access where on
> source level you had two adjacent loads or even memcpy's, so we really seem to
> have a problem here and might need to be able to annotate types or individual
> accesses as "may-alias-with-oob-ok" in the IR: PR 110431.

But above 'a' and 'b' are not adjacent, they are only verified to be
at runtime.  The only thing we do IIRC is use wider loads to access
properly aligned storage as we know the load wouldn't trap.  That
can lead us to the case you pointed out originally - we load stuff
we will ignore but might cause alias disambiguation to disambiguate
against a store of the original non-widened size.
Comment 23 GCC Commits 2023-06-29 01:27:20 UTC
The releases/gcc-11 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:ad1a3da97f7c4a85778f91b235a8d936bb1c829b

commit r11-10884-gad1a3da97f7c4a85778f91b235a8d936bb1c829b
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jun 26 21:07:09 2023 +0800

    Refine maskstore patterns with UNSPEC_MASKMOV.
    
    Similar like r14-2070-gc79476da46728e
    
    If mem_addr points to a memory region with less than whole vector size
    bytes of accessible memory and k is a mask that would prevent reading
    the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
    it to be transformed to any other whole memory access instructions.
    
    gcc/ChangeLog:
    
            PR rtl-optimization/110237
            * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with
            UNSPEC_MASKMOV.
            (maskstore<mode><avx512fmaskmodelower): Ditto.
            (*<avx512>_store<mode>_mask): New define_insn, it's renamed
            from original <avx512>_store<mode>_mask.
Comment 24 GCC Commits 2023-06-29 01:28:24 UTC
The releases/gcc-12 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:a435939ba7e5e489a422071014f943c1a577bfe6

commit r12-9743-ga435939ba7e5e489a422071014f943c1a577bfe6
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jun 26 21:07:09 2023 +0800

    Refine maskstore patterns with UNSPEC_MASKMOV.
    
    Similar like r14-2070-gc79476da46728e
    
    If mem_addr points to a memory region with less than whole vector size
    bytes of accessible memory and k is a mask that would prevent reading
    the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
    it to be transformed to any other whole memory access instructions.
    
    gcc/ChangeLog:
    
            PR rtl-optimization/110237
            * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with
            UNSPEC_MASKMOV.
            (maskstore<mode><avx512fmaskmodelower): Ditto.
            (*<avx512>_store<mode>_mask): New define_insn, it's renamed
            from original <avx512>_store<mode>_mask.
Comment 25 GCC Commits 2023-06-29 01:29:20 UTC
The releases/gcc-13 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:8b059560146a93e5174262fef25e8b1aa39bb879

commit r13-7500-g8b059560146a93e5174262fef25e8b1aa39bb879
Author: liuhongt <hongtao.liu@intel.com>
Date:   Mon Jun 26 21:07:09 2023 +0800

    Refine maskstore patterns with UNSPEC_MASKMOV.
    
    Similar like r14-2070-gc79476da46728e
    
    If mem_addr points to a memory region with less than whole vector size
    bytes of accessible memory and k is a mask that would prevent reading
    the inaccessible bytes from mem_addr, add UNSPEC_MASKMOV to prevent
    it to be transformed to any other whole memory access instructions.
    
    gcc/ChangeLog:
    
            PR rtl-optimization/110237
            * config/i386/sse.md (<avx512>_store<mode>_mask): Refine with
            UNSPEC_MASKMOV.
            (maskstore<mode><avx512fmaskmodelower): Ditto.
            (*<avx512>_store<mode>_mask): New define_insn, it's renamed
            from original <avx512>_store<mode>_mask.
Comment 26 Robin Dapp 2023-11-27 12:39:24 UTC
We are seeing the same problem on riscv and Richi's RFC patch helps (our patterns are already unspec).
Comment 27 GCC Commits 2023-11-29 12:26:58 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:b09b879e4e9cc24a5d2b0344c1930020c218a104

commit r14-5964-gb09b879e4e9cc24a5d2b0344c1930020c218a104
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jun 21 09:31:30 2023 +0200

    middle-end/110237 - wrong MEM_ATTRs for partial loads/stores
    
    The following addresses a miscompilation by RTL scheduling related
    to the representation of masked stores.  For that we have
    
    (insn 38 35 39 3 (set (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
                    (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7ffff6e28d80 b>)
                            (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
            (vec_merge:V16SI (reg:V16SI 20 xmm0 [118])
                (mem:V16SI (plus:DI (reg:DI 40 r12 [orig:90 _22 ] [90])
                        (const:DI (plus:DI (symbol_ref:DI ("b") [flags 0x2]  <var_decl 0x7ffff6e28d80 b>)
                                (const_int -4 [0xfffffffffffffffc])))) [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32])
    
    and specifically the memory attributes
    
      [1 MEM <vector(16) int> [(int *)vectp_b.12_28]+0 S64 A32]
    
    are problematic.  They tell us the instruction stores and reads a full
    vector which it if course does not.  There isn't any good MEM_EXPR
    we can use here (we lack a way to just specify a pointer and restrict
    info for example), and since the MEMs have a vector mode it's
    difficult in general as passes do not need to look at the memory
    attributes at all.
    
    The easiest way to avoid running into the alias analysis problem is
    to scrap the MEM_EXPR when we expand the internal functions for
    partial loads/stores.  That avoids the disambiguation we run into
    which is realizing that we store to an object of less size as
    the size of the mode we appear to store.
    
    After the patch we see just
    
      [1  S64 A32]
    
    so we preserve the alias set, the alignment and the size (the size
    is redundant if the MEM insn't BLKmode).  That's still not good
    in case the RTL alias oracle would implement the same
    disambiguation but it fends off the gimple one.
    
    This fixes gcc.dg/torture/pr58955-2.c when built with AVX512
    and --param=vect-partial-vector-usage=1.
    
            PR middle-end/110237
            * internal-fn.cc (expand_partial_load_optab_fn): Clear
            MEM_EXPR and MEM_OFFSET.
            (expand_partial_store_optab_fn): Likewise.
Comment 28 Richard Biener 2023-11-29 12:27:20 UTC
Fixed.