Bug 113576 - [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce02d95c
Summary: [14 regression] 502.gcc_r hangs r14-8223-g1c1853a70f9422169190e65e568dcccbce...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.0
: P1 normal
Target Milestone: 14.0
Assignee: Richard Biener
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2024-01-24 10:05 UTC by Hongtao Liu
Modified: 2024-02-18 02:10 UTC (History)
7 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-01-24 00:00:00


Attachments
reduced testcase attached. (1.72 KB, text/x-csrc)
2024-01-24 10:05 UTC, Hongtao Liu
Details
Proposed testsuite patch (625 bytes, patch)
2024-02-14 14:31 UTC, Uroš Bizjak
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Hongtao Liu 2024-01-24 10:05:45 UTC
Created attachment 57206 [details]
reduced testcase attached.

reduced testcase attached. compiled with gcc -O3 -march=skylake-avx512.

Probably related to PR113539
Comment 1 Hongtao Liu 2024-01-24 10:07:18 UTC
int
__attribute__((noinline))
sbitmap_first_set_bit (const_sbitmap bmap)
{
  unsigned int n = 0;
  sbitmap_iterator sbi;

  EXECUTE_IF_SET_IN_SBITMAP (bmap, 0, n, sbi)
    return n;
  return -1;
}

hangs on this function, it's an vect early break case.

 78.L10:
 79        shrq    %rdx
 80        movl    %ecx, %eax
 81        incl    %ecx
 82        testb   $1, %dl
 83        je      .L10
 84        ret

hangs on this scalar part.
Comment 2 Richard Biener 2024-01-24 12:27:31 UTC
Confirmed.  Let me have a look.
Comment 3 Richard Biener 2024-01-24 13:37:01 UTC
So the change enables early exit vectorization since may_be_zero is _10 == 0
here, resulting in an overall

number_of_iterationsm1 == _10 != 0 ? _10 + 4294967295 : 0

and

number_of_iterations = MAX_EXPR <_10, 1>

We're vectorizing the induction to re-start the iteration after exit which
I think is all OK (but it must be broken...).  But we're stuck in a
not vectorized loop,

94        /* Skip bits that are zero.  */
95        for (; (i->word & 1) == 0; i->word >>= 1)
96          i->bit_num++;

   0x0000000000400890 <+528>:   shr    %rdx
   0x0000000000400893 <+531>:   mov    %ecx,%eax
   0x0000000000400895 <+533>:   inc    %ecx
=> 0x0000000000400897 <+535>:   test   $0x1,%dl
   0x000000000040089a <+538>:   je     0x400890 <sbitmap_first_set_bit+528>

and %rdx is zero.

I think the vector IL is sound.

Disabling cunroll allows the testcase to pass, so it might be an error on
the upper bound of its iterations (but I think that's OK, too).

What looks a bit odd is the condition for skipping the epilogue which
we should never do for LOOP_VINFO_EARLY_BREAKS_VECT_PEELED, we're using

      /* If we have a peeled vector iteration we will never skip the epilog loop
         and we can simplify the cfg a lot by not doing the edge split.  */
      if (skip_epilog || LOOP_VINFO_EARLY_BREAKS (loop_vinfo))
        {
          guard_cond = fold_build2 (EQ_EXPR, boolean_type_node,
                                    niters, niters_vector_mult_vf);

here, but I think niters_vector_mult_vf is wrong (but it doesn't matter
here, still it looks bogus).
Comment 4 Richard Biener 2024-01-24 13:47:56 UTC
So with the niter analysis part what's different is how we deal with the
case of 'may_be_zero', for a loop with a non-empty latch a zero means
we do not execute the latch but the traditional 'niter' means the number
of latch executions.

That means the vectorizer simply implementing may_be_zero as
niter = may_be_zero ? 0 : niter is flawed (if it were that easy we
wouldn't have this extra field).
Comment 5 Richard Biener 2024-01-24 13:50:51 UTC
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index fe631252dc2..28ad03e0b8a 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -991,8 +991,12 @@ vec_init_loop_exit_info (class loop *loop)
        {
          tree may_be_zero = niter_desc.may_be_zero;
          if ((integer_zerop (may_be_zero)
-              || integer_nonzerop (may_be_zero)
-              || COMPARISON_CLASS_P (may_be_zero))
+              /* As we are handling may_be_zero that's not false by
+                 rewriting niter to may_be_zero ? 0 : niter we require
+                 an empty latch.  */
+              || (exit->src == single_pred (loop->latch)
+                  && (integer_nonzerop (may_be_zero)
+                      || COMPARISON_CLASS_P (may_be_zero))))
              && (!candidate
                  || dominated_by_p (CDI_DOMINATORS, exit->src,
                                     candidate->src)))

fixes it, I'm testing this.
Comment 6 Hongtao Liu 2024-01-25 01:03:58 UTC
Another potential buggy place is 

240  vexit_reduc_67 = mask_patt_43.28_62 & mask_patt_43.28_63;
241  if (vexit_reduc_67 == { -1, -1, -1, -1 })
242    goto <bb 6>; [94.50%]
243  else


is expanded to 

 319(insn 69 68 70 8 (set (reg:CCZ 17 flags)
 320        (compare:CCZ (reg:QI 189 [ vexit_reduc_67 ])
 321            (const_int -1 [0xffffffffffffffff]))) "test.c":83:18 discrim 1 9 {*cmpqi_1}

But it should only test the lower 4 bits, the higher part is zeroed by avx512 comparison instructions.

 293(insn 65 64 66 8 (set (reg:QI 186 [ mask_patt_43.28_62 ])
 294        (unspec:QI [
 295                (reg:V4DI 124 [ vect__29.26 ])
 296                (reg:V4DI 185)
 297                (const_int 0 [0])
 298            ] UNSPEC_PCMP)) "test.c":83:18 discrim 1 2811 {avx512vl_cmpv4di3}
 299     (nil))
 300(insn 66 65 67 8 (set (reg:V4DI 187)
 301        (const_vector:V4DI [
 302                (const_int 0 [0]) repeated x4
 303            ])) "test.c":83:18 discrim 1 2021 {movv4di_internal}
 304     (nil))
 305(insn 67 66 68 8 (set (reg:QI 188 [ mask_patt_43.28_63 ])
 306        (unspec:QI [
 307                (reg:V4DI 125 [ vect__29.27 ])
 308                (reg:V4DI 187)
 309                (const_int 0 [0])
 310            ] UNSPEC_PCMP)) "test.c":83:18 discrim 1 2811 {avx512vl_cmpv4di3}
 311     (nil))
 312(insn 68 67 69 8 (parallel [
 313            (set (reg:QI 189 [ vexit_reduc_67 ])
 314                (and:QI (reg:QI 186 [ mask_patt_43.28_62 ])
 315                    (reg:QI 188 [ mask_patt_43.28_63 ])))
 316            (clobber (reg:CC 17 flags))
 317        ]) "test.c":83:18 discrim 1 618 {*andqi_1}
 318     (nil))
Comment 7 Hongtao Liu 2024-01-25 06:28:35 UTC
diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 1fd957288d4..33a8d539b4d 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned char *ptr, int len,

       unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
       unsigned int first_elt = off * elts_per_byte;
-      unsigned int extract_elts = extract_bytes * elts_per_byte;
+      unsigned int extract_elts = count;
       for (unsigned int i = 0; i < extract_elts; ++i)
        {
          tree elt = VECTOR_CST_ELT (expr, first_elt + i);

Shouldn't we use count here?(it also fixed the hanged issue).

Also even vector_boolean_type has only 4 elements, VECTOR_CST_ELT (expr, 5) still return -1, not sure if it's reasonable.
Comment 8 Hongtao Liu 2024-01-25 07:02:42 UTC
maybe 

diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
index 1fd957288d4..6d321f9baef 100644
--- a/gcc/fold-const.cc
+++ b/gcc/fold-const.cc
@@ -8035,6 +8035,9 @@ native_encode_vector_part (const_tree expr, unsigned char *ptr, int len,
       unsigned int extract_elts = extract_bytes * elts_per_byte;
       for (unsigned int i = 0; i < extract_elts; ++i)
        {
+         /* Don't encode any bit beyond the range of the vector.  */
+         if (first_elt + i > count)
+           break;
Comment 9 GCC Commits 2024-01-25 07:39:11 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:578c7b91f418ebbef1bf169117815409e06f5197

commit r14-8413-g578c7b91f418ebbef1bf169117815409e06f5197
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jan 24 14:55:49 2024 +0100

    tree-optimization/113576 - non-empty latch and may_be_zero vectorization
    
    We can't support niters with may_be_zero when we end up with a
    non-empty latch due to early exit peeling.  At least not in
    the simplistic way the vectorizer handles this now.  Disallow
    it again for exits that are not the last one.
    
            PR tree-optimization/113576
            * tree-vect-loop.cc (vec_init_loop_exit_info): Only allow
            exits with may_be_zero niters when its the last one.
    
            * gcc.dg/vect/pr113576.c: New testcase.
Comment 10 Hongtao Liu 2024-01-25 08:34:43 UTC
Fixed.
Comment 11 Richard Biener 2024-01-25 08:52:42 UTC
(In reply to Hongtao Liu from comment #8)
> maybe 
> 
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1fd957288d4..6d321f9baef 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8035,6 +8035,9 @@ native_encode_vector_part (const_tree expr, unsigned
> char *ptr, int len,
>        unsigned int extract_elts = extract_bytes * elts_per_byte;
>        for (unsigned int i = 0; i < extract_elts; ++i)
>         {
> +         /* Don't encode any bit beyond the range of the vector.  */
> +         if (first_elt + i > count)
> +           break;

Hmm.  I think that VECTOR_CST_ELT should have ICEd for out-of-bound
element queries but it seems to make up elements for us here.  Richard?

But yes, we do

      unsigned int extract_elts = extract_bytes * elts_per_byte;

and since native_encode_* and native_interpret_* operate on bytes we have
difficulties dealing with bit-precision entities with padding.

There's either the possibility to fail encoding when that happens or
do something else.  Note that RTL expansion will do

    case VECTOR_CST:
      {
        tree tmp = NULL_TREE; 
        if (VECTOR_MODE_P (mode))
          return const_vector_from_tree (exp);
        scalar_int_mode int_mode;
        if (is_int_mode (mode, &int_mode))
          {
            tree type_for_mode = lang_hooks.types.type_for_mode (int_mode, 1);
            if (type_for_mode)
              tmp = fold_unary_loc (loc, VIEW_CONVERT_EXPR,
                                    type_for_mode, exp);

which I think should always succeed (otherwise it falls back to expanding
a CTOR).  That means failing to encode/interpret might get into
store_constructor which I think will zero a register destination and thus
fill padding with zeros.

So yeah, something like this looks OK, but I think instead of only
testing against 'count' we should also test against TYPE_VECTOR_SUBPARTS
(that might be variable, so with known_gt).

Would be interesting to see whether this fixes the issue without the
now installed patch.
Comment 12 rguenther@suse.de 2024-01-25 09:15:17 UTC
On Thu, 25 Jan 2024, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> index 1fd957288d4..33a8d539b4d 100644
> --- a/gcc/fold-const.cc
> +++ b/gcc/fold-const.cc
> @@ -8032,7 +8032,7 @@ native_encode_vector_part (const_tree expr, unsigned char
> *ptr, int len,
> 
>        unsigned int elts_per_byte = BITS_PER_UNIT / elt_bits;
>        unsigned int first_elt = off * elts_per_byte;
> -      unsigned int extract_elts = extract_bytes * elts_per_byte;
> +      unsigned int extract_elts = count;
>        for (unsigned int i = 0; i < extract_elts; ++i)
>         {
>           tree elt = VECTOR_CST_ELT (expr, first_elt + i);
> 
> Shouldn't we use count here?(it also fixed the hanged issue).

extract_bytes is capped by the buffer 'len':

      int total_bytes = CEIL (elt_bits * count, BITS_PER_UNIT);
..
      int extract_bytes = MIN (len, total_bytes - off);

we'd still need to effectively do that.  But yeah, using CEIL
makes extract_elts off.  Maybe we should simply calculate
extract_bits instead (but then use uint64 for that)
Comment 13 Richard Sandiford 2024-01-25 09:26:57 UTC
I don't think there's any principle that upper bits must be zero.
How do we end up with a pattern that depends on that being the case?
Comment 14 Hongtao Liu 2024-01-25 10:09:58 UTC
The testcase attached is ok now, but 502.gcc_r is miscompiled after the installed patch.
Comment 15 Richard Biener 2024-01-25 10:13:30 UTC
(In reply to Richard Sandiford from comment #13)
> I don't think there's any principle that upper bits must be zero.
> How do we end up with a pattern that depends on that being the case?

I think the problem is the cbranch pattern which looks at all of the
QImode mask - but of course it doesn't know it's really V4BImode it's
working on ...

If there's no principle that the upper bits should be zero I think we
need a way for the target to say so.
Comment 16 Richard Sandiford 2024-01-25 12:24:55 UTC
(In reply to Richard Biener from comment #15)
> I think the problem is the cbranch pattern which looks at all of the
> QImode mask - but of course it doesn't know it's really V4BImode it's
> working on ...
Yeah.  Currently building an x86_64 toolchain to have a look, but I think whatever code uses a cbranch with a higher precision than the inputs should mask off the significant bits beforehand.  cbranch can also be used for comparing two variable masks too.

I suppose we could add a hook to say that padding bits of an integer mask must always be zero (and so it's expand's job to ensure that that holds for any mask operation).  But it feels dangerously close to TRULY_NOOP_TRUNCATION for integers.
Comment 17 Tamar Christina 2024-01-25 13:03:43 UTC
Well the mid-end has generated the right precision. The type it generates is   vector(4) <signed-boolean:1> vexit_reduc_67;
so it does say it's a single bit boolean.

Isn't this just an expand problem?
Comment 18 Richard Sandiford 2024-01-25 13:28:05 UTC
(In reply to Tamar Christina from comment #17)
> Well the mid-end has generated the right precision. The type it generates is
> vector(4) <signed-boolean:1> vexit_reduc_67;
> so it does say it's a single bit boolean.
> 
> Isn't this just an expand problem?
That's what I meant.  expand is using a QImode comparison to compare things with 4-bit precision, so I think the masking should happen at that point.

How about doing the masking in do_compare_and_jump?
Comment 19 rguenther@suse.de 2024-01-25 14:12:17 UTC
On Thu, 25 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #18 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
> (In reply to Tamar Christina from comment #17)
> > Well the mid-end has generated the right precision. The type it generates is
> > vector(4) <signed-boolean:1> vexit_reduc_67;
> > so it does say it's a single bit boolean.
> > 
> > Isn't this just an expand problem?
> That's what I meant.  expand is using a QImode comparison to compare things
> with 4-bit precision, so I think the masking should happen at that point.
> 
> How about doing the masking in do_compare_and_jump?

That sounds sensible.

Note that I wonder how to eliminate redundant maskings?  I suppose
eventually combine tracking nonzero bits where obvious would do
that?  For example for cmp:V4SI we know the bits will be zero but
I wonder if the RTL IL is obvious enough to derive this (or whether
there's a target hook for extra nonzero bit discovery, say for
unspecs).
Comment 20 Hongtao Liu 2024-01-26 02:24:22 UTC
> Note that I wonder how to eliminate redundant maskings?  I suppose
> eventually combine tracking nonzero bits where obvious would do
> that?  For example for cmp:V4SI we know the bits will be zero but
> I wonder if the RTL IL is obvious enough to derive this (or whether
> there's a target hook for extra nonzero bit discovery, say for
> unspecs).

I guess we need extra patterns to make combine know, we already have those for zero_extend.

3970;; Since vpcmpd implicitly clear the upper bits of dest, transform
 3971;; vpcmpd + zero_extend to vpcmpd since the instruction
 3972(define_insn_and_split "*<avx512>_cmp<V48H_AVX512VL:mode>3_zero_extend<SWI248x:mode>"
 3973  [(set (match_operand:SWI248x 0 "register_operand")
 3974        (zero_extend:SWI248x
 3975          (unspec:<V48H_AVX512VL:avx512fmaskmode>
 3976            [(match_operand:V48H_AVX512VL 1 "nonimmediate_operand")
 3977             (match_operand:V48H_AVX512VL 2 "nonimmediate_operand")
 3978             (match_operand:SI 3 "const_0_to_7_operand")]
 3979            UNSPEC_PCMP)))]
Comment 21 Hongtao Liu 2024-01-26 03:12:14 UTC
typedef unsigned long mp_limb_t;
typedef long mp_size_t;
typedef unsigned long mp_bitcnt_t;

typedef mp_limb_t *mp_ptr;
typedef const mp_limb_t *mp_srcptr;

#define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8)

#define GMP_LIMB_MAX (~ (mp_limb_t) 0)

mp_bitcnt_t
mpn_common_scan (mp_limb_t limb, mp_size_t i, mp_srcptr up, mp_size_t un,
                 mp_limb_t ux)
{
  unsigned cnt;

  while (limb == 0)
    {
      i++;
      if (i == un)
        return (ux == 0 ? ~(mp_bitcnt_t) 0 : un * GMP_LIMB_BITS);
      limb = ux ^ up[i];
    }
  return limb;
}

This one is miscompiled in 502.gcc_r

123  <bb 8> [local count: 862990464]:
124  _34 = ivtmp.20_20 * 32;
125  vect__5.15_59 = MEM <const vector(4) long unsigned int> [(const mp_limb_t *)vectp.14_53 + _34 * 1];
126  mask_patt_9.16_61 = vect__5.15_59 == vect_cst__60;
127  ivtmp.20_32 = ivtmp.20_20 + 1;
128  if (mask_patt_9.16_61 == { -1, -1, -1, -1 })
129    goto <bb 5>; [94.50%]
130  else
131    goto <bb 9>; [5.50%]


is expanded to

 30.L18:
 31        movq    %rdi, %rdx
 32        incq    %rdi
 33        salq    $5, %rdx
 34        vpcmpeqq        (%rax,%rdx), %ymm3, %k0
 35        kmovb   %k0, %edx
 36        cmpb    $-1, %dl
 37        jne     .L21
Comment 22 Hongtao Liu 2024-01-26 05:43:44 UTC
typedef unsigned long mp_limb_t;
typedef long mp_size_t;
typedef unsigned long mp_bitcnt_t;

typedef mp_limb_t *mp_ptr;
typedef const mp_limb_t *mp_srcptr;

#define GMP_LIMB_BITS (sizeof(mp_limb_t) * 8)

#define GMP_LIMB_MAX (~ (mp_limb_t) 0)

mp_bitcnt_t
__attribute__((noipa))
mpn_common_scan (mp_limb_t limb, mp_size_t i, mp_srcptr up, mp_size_t un,
                 mp_limb_t ux)
{
  unsigned cnt;

  while (limb == 0)
    {
      i++;
      if (i == un)
        return (ux == 0 ? ~(mp_bitcnt_t) 0 : un * GMP_LIMB_BITS);
      limb = ux ^ up[i];
    }
  return limb;
}

int main ()
{
  mp_limb_t up[10000];
  for (int i = 0; i != 10000; i++)
    up[i] = 1 << 8;
  up[2000] = 1;
  mp_bitcnt_t res = mpn_common_scan (0, 0, up, 10000, 1 << 8);
  if (res != 257)
    __builtin_abort ();
  return 1;
}


aborted with -O3 -march=skylake-avx512.
Comment 23 Tamar Christina 2024-01-29 17:28:01 UTC
*** Bug 113661 has been marked as a duplicate of this bug. ***
Comment 24 Tamar Christina 2024-01-29 18:31:09 UTC
Just to avoid confusion, are you still working on this one Richi?
Comment 25 Hongtao Liu 2024-01-30 03:29:15 UTC
(In reply to Tamar Christina from comment #24)
> Just to avoid confusion, are you still working on this one Richi?

I'm working on a patch to add a target hook as #c18 mentioned.
Comment 26 Tamar Christina 2024-01-30 07:46:34 UTC
Ah great, just checking it wasn't left unattended :)
Comment 27 Richard Biener 2024-01-30 08:26:45 UTC
(In reply to Hongtao Liu from comment #25)
> (In reply to Tamar Christina from comment #24)
> > Just to avoid confusion, are you still working on this one Richi?
> 
> I'm working on a patch to add a target hook as #c18 mentioned.

Not sure a target hook was suggested - I think it was suggested that
do_compare_and_jump always masks excess bits for integer mode vector masks?
Comment 28 Hongtao Liu 2024-01-30 08:34:03 UTC
I saw we already maskoff integral modes for vector mask in store_constructor

	/* Use sign-extension for uniform boolean vectors with
	   integer modes and single-bit mask entries.
	   Effectively "vec_duplicate" for bitmasks.  */
	if (elt_size == 1
	    && !TREE_SIDE_EFFECTS (exp)
	    && VECTOR_BOOLEAN_TYPE_P (type)
	    && SCALAR_INT_MODE_P (TYPE_MODE (type))
	    && (elt = uniform_vector_p (exp))
	    && !VECTOR_TYPE_P (TREE_TYPE (elt)))
	  {
	    rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)),
				 expand_normal (elt));
	    rtx tmp = gen_reg_rtx (mode);
	    convert_move (tmp, op0, 0);

	    /* Ensure no excess bits are set.
	       GCN needs this for nunits < 64.
	       x86 needs this for nunits < 8.  */
	    auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
	    if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
	      tmp = expand_binop (mode, and_optab, tmp,
				  GEN_INT ((1 << nunits) - 1), target,
				  true, OPTAB_WIDEN);
	    if (tmp != target)
	      emit_move_insn (target, tmp);
	    break;
	  }
Comment 29 Richard Biener 2024-01-30 08:56:24 UTC
(In reply to Hongtao Liu from comment #28)
> I saw we already maskoff integral modes for vector mask in store_constructor
> 
> 	/* Use sign-extension for uniform boolean vectors with
> 	   integer modes and single-bit mask entries.
> 	   Effectively "vec_duplicate" for bitmasks.  */
> 	if (elt_size == 1
> 	    && !TREE_SIDE_EFFECTS (exp)
> 	    && VECTOR_BOOLEAN_TYPE_P (type)
> 	    && SCALAR_INT_MODE_P (TYPE_MODE (type))
> 	    && (elt = uniform_vector_p (exp))
> 	    && !VECTOR_TYPE_P (TREE_TYPE (elt)))
> 	  {
> 	    rtx op0 = force_reg (TYPE_MODE (TREE_TYPE (elt)),
> 				 expand_normal (elt));
> 	    rtx tmp = gen_reg_rtx (mode);
> 	    convert_move (tmp, op0, 0);
> 
> 	    /* Ensure no excess bits are set.
> 	       GCN needs this for nunits < 64.
> 	       x86 needs this for nunits < 8.  */
> 	    auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
> 	    if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
> 	      tmp = expand_binop (mode, and_optab, tmp,
> 				  GEN_INT ((1 << nunits) - 1), target,
> 				  true, OPTAB_WIDEN);
> 	    if (tmp != target)
> 	      emit_move_insn (target, tmp);
> 	    break;
> 	  }

But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
masking for VECTOR_CST expansion as well, instead of trying to fixup
in do_compare_and_jump?
Comment 30 Richard Sandiford 2024-01-30 10:12:24 UTC
(In reply to Richard Biener from comment #29)
> But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> masking for VECTOR_CST expansion as well, instead of trying to fixup
> in do_compare_and_jump?
But then how would ~ be implemented for things like 4-bit masks?
If we use notqi2 then I assume the upper bits could be 1 rather than 0.
Comment 31 rguenther@suse.de 2024-01-30 10:39:00 UTC
On Tue, 30 Jan 2024, rsandifo at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576
> 
> --- Comment #30 from Richard Sandiford <rsandifo at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #29)
> > But that's just for CONSTRUCTORs, we got the VIEW_CONVERT_EXPR path for
> > VECTOR_CSTs.  But yeah, that _might_ argue we should perform the same
> > masking for VECTOR_CST expansion as well, instead of trying to fixup
> > in do_compare_and_jump?
> But then how would ~ be implemented for things like 4-bit masks?
> If we use notqi2 then I assume the upper bits could be 1 rather than 0.

Yeah, I guess it's similar to expand_expr_real_1 'reduce_bit_field'
handling - we'd need to insert fixup code in strathegic places
(or for ~ use xor with the proper mask).

The difficulty is that we can't make the backend do this unless
there are insn operands that allows it to infer the real precision
of the mode.  And for most insns the excess bits are irrelevant
anyway.

Still the CTOR case showed wrong-code issues with GCN, which possibly
means it has the same issue with VECTOR_CSTs as well.  IIRC that
was that all vectors are 1024bits, and its "fake" V4SImode insns
rely on accurate masked out upper bits.  That might hint that
compares are not enough here (but for non-compares the backend
might have a chance to fixup by infering the max. number of
active elements).

If we think that compares (but that would also be compares without
jump, aka a == b | c == d) are the only problematical case we can
also fixup at the uses rather than at the defs as 'reduce_bit_field'
tries to do.
Comment 32 Richard Biener 2024-02-01 08:59:23 UTC
Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
the lowest 4 or 2 bits.

It all feels like desaster waiting to happen ;)

For example BIT_NOT_EXPR is RTL expanded like

    case BIT_NOT_EXPR:
      op0 = expand_expr (treeop0, subtarget,
                         VOIDmode, EXPAND_NORMAL);
      if (modifier == EXPAND_STACK_PARM)
        target = 0;
      /* In case we have to reduce the result to bitfield precision
         for unsigned bitfield expand this as XOR with a proper constant
         instead.  */
      if (reduce_bit_field && TYPE_UNSIGNED (type))
        {
          int_mode = SCALAR_INT_TYPE_MODE (type);
          wide_int mask = wi::mask (TYPE_PRECISION (type),
                                    false, GET_MODE_PRECISION (int_mode));
  
          temp = expand_binop (int_mode, xor_optab, op0,
                               immed_wide_int_const (mask, int_mode),
                               target, 1, OPTAB_LIB_WIDEN);

so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
effective bit-precision set reduce_bit_field and fixup the fallout
(not sure why the above is only for TYPE_UNSIGNED).

At least it feels similar and doing things the opposite for vectors
(fixing up at uses) would be odd?
Comment 33 Jakub Jelinek 2024-02-01 09:06:46 UTC
I guess inverting just 2 or 4 bits can be done with kxorb or knotb + kandb, but we need to have the mask forced in some mask register.
Comment 34 Richard Sandiford 2024-02-01 09:18:55 UTC
(In reply to Richard Biener from comment #32)
> Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
> the lowest 4 or 2 bits.
> 
> It all feels like desaster waiting to happen ;)
Yes :)

> For example BIT_NOT_EXPR is RTL expanded like
> 
>     case BIT_NOT_EXPR:
>       op0 = expand_expr (treeop0, subtarget,
>                          VOIDmode, EXPAND_NORMAL);
>       if (modifier == EXPAND_STACK_PARM)
>         target = 0;
>       /* In case we have to reduce the result to bitfield precision
>          for unsigned bitfield expand this as XOR with a proper constant
>          instead.  */
>       if (reduce_bit_field && TYPE_UNSIGNED (type))
>         {
>           int_mode = SCALAR_INT_TYPE_MODE (type);
>           wide_int mask = wi::mask (TYPE_PRECISION (type),
>                                     false, GET_MODE_PRECISION (int_mode));
>   
>           temp = expand_binop (int_mode, xor_optab, op0,
>                                immed_wide_int_const (mask, int_mode),
>                                target, 1, OPTAB_LIB_WIDEN);
> 
> so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
> effective bit-precision set reduce_bit_field and fixup the fallout
> (not sure why the above is only for TYPE_UNSIGNED).
>
> At least it feels similar and doing things the opposite for vectors
> (fixing up at uses) would be odd?
Do you know why we take this approach for integers?  Is it for
correctness?  Or is it supposed to be more optimal?

I can imagine that, for arithmetic types, there are going to many
more instances where upper bits matter (division, right shifts,
MIN/MAX, etc.).  So perhaps reducing every result is a good
trade-off there.

But there's an argument that it should be rare for the padding
bits in a vector to matter, since very few things would look at the
padding bits anyway.  So perhaps the cost should be borne by the
operations that need canonical integers.

Not a strong opinion though, more just devil's advocate.

There again, if e.g. the x86 API guarantees memcmp equality between
two masks whose significant bits are equal, then we probably have
no choice.
Comment 35 Richard Biener 2024-02-01 10:04:45 UTC
(In reply to Richard Sandiford from comment #34)
> (In reply to Richard Biener from comment #32)
> > Btw, AVX512 knotb will invert all 8 bits and there's no knot just affecting
> > the lowest 4 or 2 bits.
> > 
> > It all feels like desaster waiting to happen ;)
> Yes :)
> 
> > For example BIT_NOT_EXPR is RTL expanded like
> > 
> >     case BIT_NOT_EXPR:
> >       op0 = expand_expr (treeop0, subtarget,
> >                          VOIDmode, EXPAND_NORMAL);
> >       if (modifier == EXPAND_STACK_PARM)
> >         target = 0;
> >       /* In case we have to reduce the result to bitfield precision
> >          for unsigned bitfield expand this as XOR with a proper constant
> >          instead.  */
> >       if (reduce_bit_field && TYPE_UNSIGNED (type))
> >         {
> >           int_mode = SCALAR_INT_TYPE_MODE (type);
> >           wide_int mask = wi::mask (TYPE_PRECISION (type),
> >                                     false, GET_MODE_PRECISION (int_mode));
> >   
> >           temp = expand_binop (int_mode, xor_optab, op0,
> >                                immed_wide_int_const (mask, int_mode),
> >                                target, 1, OPTAB_LIB_WIDEN);
> > 
> > so we could, for VECTOR_BOOLEAN_TYPE_P with integer mode and
> > effective bit-precision set reduce_bit_field and fixup the fallout
> > (not sure why the above is only for TYPE_UNSIGNED).
> >
> > At least it feels similar and doing things the opposite for vectors
> > (fixing up at uses) would be odd?
> Do you know why we take this approach for integers?  Is it for
> correctness?  Or is it supposed to be more optimal?

It's done for correctness.  The main thing was bitfields > int which
end up as bit-precision types in GIMPLE.

> I can imagine that, for arithmetic types, there are going to many
> more instances where upper bits matter (division, right shifts,
> MIN/MAX, etc.).  So perhaps reducing every result is a good
> trade-off there.

I think it was the easiest place to fix up and to make sure later
RTL opts and backends do not interfere.

> But there's an argument that it should be rare for the padding
> bits in a vector to matter, since very few things would look at the
> padding bits anyway.  So perhaps the cost should be borne by the
> operations that need canonical integers.

But what happens when an operation not needing canonical intergers
is transformed, say by combine or simplify-rtx to one that needs?
I think the reduce_bitfield code was trying to be safe here.

Actual define_insns not needing canonical padding could do like
we now require scalar shifts - they could match a variant with
the mask canonicalization op and hope for combine eliminating that.
Of course that will explode in case it's the majority of cases ...

> Not a strong opinion though, more just devil's advocate.

Yeah.  The important thing seems to be that the info this is a
bit-precision QImode isn't lost ... which points to that using
plain integer modes was a very bad choice ...

> There again, if e.g. the x86 API guarantees memcmp equality between
> two masks whose significant bits are equal, then we probably have
> no choice.

That's a good question, possibly most relevant for the OpenMP SIMD ABI.

The AVX512 APIs use integer types everywhere, there's no intrinsic
for ktest itself, but _mm512_kortestz and kortestc and also _mm512_knot
(oddly only for HImode).  So it at least seems to - from a quick look -
be "broken" in the intrinsic API as well.  At least we have
_mm_cmp_sd_mask producing a QImode mask, _mm512_knot inverting too
many bits.  So the user must be aware of the "padding".

One could argue it's the vectorizers job to fixup then, but of course
it doesn't get to see the correct vector types here.

Given we have the fixup in CTOR expansion the issue at hand could be
fixed by mirroring that in VECTOR_CST expansion for now.

Andrew - I suppose GCN also has cbranch, can you try to check what happens
for < 8 lane vector modes there?
Comment 36 Richard Biener 2024-02-07 15:08:42 UTC
For example with AVX512VL and the following, using -O -fgimple -mavx512vl
we get simply

        notl    %esi
        orl     %esi, %edi
        cmpb    $15, %dil
        je      .L6

typedef long v4si __attribute__((vector_size(4*sizeof(long))));
typedef v4si v4sib __attribute__((vector_mask));
typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));

void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
{
  v4sib tem;

__BB(2):
  tem_5 = ~v2_2(D);
  tem_3 = v1_1(D) | tem_5;
  tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1 };
  if (tem_3 == tem_4)
    goto __BB3;
  else
    goto __BB4;

__BB(3):
  __builtin_abort ();

__BB(4):
  return;
}


the question is whether that matches the semantics of GIMPLE (the padding
is inverted, too), whether it invokes undefined behavior (don't do it - it
seems for people using intrinsics that's what it is?) or whether we
should avoid affecting padding.

Note after the patch I proposed on the mailing list the constant mask is
now expanded with zero padding.
Comment 37 Hongtao Liu 2024-02-08 01:18:04 UTC
(In reply to Richard Biener from comment #36)
> For example with AVX512VL and the following, using -O -fgimple -mavx512vl
> we get simply
> 
>         notl    %esi
>         orl     %esi, %edi
>         cmpb    $15, %dil
>         je      .L6
> 
> typedef long v4si __attribute__((vector_size(4*sizeof(long))));
> typedef v4si v4sib __attribute__((vector_mask));
> typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));
> 
> void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
> {
>   v4sib tem;
> 
> __BB(2):
>   tem_5 = ~v2_2(D);
>   tem_3 = v1_1(D) | tem_5;
>   tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1,
> _Literal (sbool1) -1, _Literal (sbool1) -1 };
>   if (tem_3 == tem_4)
>     goto __BB3;
>   else
>     goto __BB4;
> 
> __BB(3):
>   __builtin_abort ();
> 
> __BB(4):
>   return;
> }
> 
> 
> the question is whether that matches the semantics of GIMPLE (the padding
> is inverted, too), whether it invokes undefined behavior (don't do it - it
> seems for people using intrinsics that's what it is?) or whether we
> should avoid affecting padding.
> 
> Note after the patch I proposed on the mailing list the constant mask is
> now expanded with zero padding.

I think we should also mask off the upper bits of variable mask?

        notl    %esi
        orl     %esi, %edi
        notl    %edi
        andl    $15, %edi
        je      .L3
Comment 38 Hongtao Liu 2024-02-08 01:58:47 UTC
> I think we should also mask off the upper bits of variable mask?
> 
>         notl    %esi
>         orl     %esi, %edi
>         notl    %edi
>         andl    $15, %edi
>         je      .L3

with -mbmi, it's 

        andn    %esi, %edi, %edi
        andl    $15, %edi
        je      .L3
Comment 39 Hongtao Liu 2024-02-08 04:22:53 UTC
> > the question is whether that matches the semantics of GIMPLE (the padding
> > is inverted, too), whether it invokes undefined behavior (don't do it - it
> > seems for people using intrinsics that's what it is?)
For the intrinisc, the instructions only care about lower bits, so it's not big issue? And it sounds like similar issue as _BitInt(4)/_BitInt(2), I assume there're garbage in the upper bits.
Comment 40 Jakub Jelinek 2024-02-08 07:36:41 UTC
For unsigned _BitInt(4) or unsigned _BitInt(2) we mask it whenever loading from memory or function argument or whatever other ABI specific spot (and also when storing because that is how RTL expects it; because of that we don't mask it when using it from say automatic variables where we know we've initialized it ourselves).
Comment 41 Richard Biener 2024-02-08 10:35:27 UTC
(In reply to Hongtao Liu from comment #38)
> > I think we should also mask off the upper bits of variable mask?
> > 
> >         notl    %esi
> >         orl     %esi, %edi
> >         notl    %edi
> >         andl    $15, %edi
> >         je      .L3
> 
> with -mbmi, it's 
> 
>         andn    %esi, %edi, %edi
>         andl    $15, %edi
>         je      .L3

Well, yes, the discussion in this bug was whether to do this at consumers
(that's sth new) or with all mask operations (that's how we handle
bit-precision integer operations, so it might be relatively easy to
do that - specifically spot the places eventually needing adjustment).

There's do_store_flag to fixup for uses not in branches and
do_compare_and_jump for conditional jumps.

Note the AND is removed by combine if I add it:

Successfully matched this instruction:
(set (reg:CCZ 17 flags)
    (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
            (const_int 15 [0xf]))
        (const_int 0 [0])))

(*testhi_not)

-    9: {r103:QI=r102:QI&0xf;clobber flags:CC;}
+      REG_DEAD r99:QI
+    9: NOTE_INSN_DELETED
+   12: flags:CCZ=cmp(~r102:QI#0&0xf,0)
       REG_DEAD r102:QI
-      REG_UNUSED flags:CC
-   12: flags:CCZ=cmp(r103:QI,0xf)
-      REG_DEAD r103:QI

and we get

foo:
.LFB0:
        .cfi_startproc
        notl    %esi
        orl     %esi, %edi
        notl    %edi
        testb   $15, %dil
        je      .L6
        ret

which I'm not sure is OK?

diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index e2d2b3cb111..784707c1e55 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -1266,6 +1266,7 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum rtx_code signed_code,
   machine_mode mode;
   int unsignedp;
   enum rtx_code code;
+  unsigned HOST_WIDE_INT nunits;
 
   /* Don't crash if the comparison was erroneous.  */
   op0 = expand_normal (treeop0);
@@ -1308,6 +1309,18 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum rtx_code signed_code,
       emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, op1));
       op1 = new_op1;
     }
+  else if (VECTOR_BOOLEAN_TYPE_P (type)
+          && mode == QImode
+          && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)
+          && nunits < BITS_PER_UNIT)
+    {
+      op0 = expand_binop (mode, and_optab, op0,
+                         GEN_INT ((1 << nunits) - 1), NULL_RTX,
+                         true, OPTAB_WIDEN);
+      op1 = expand_binop (mode, and_optab, op1,
+                         GEN_INT ((1 << nunits) - 1), NULL_RTX,
+                         true, OPTAB_WIDEN);
+    }
 
   do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode,
                           ((mode == BLKmode)
Comment 42 Richard Biener 2024-02-08 10:48:37 UTC
And the do_store_flag part:

diff --git a/gcc/expr.cc b/gcc/expr.cc
index fc5e998e329..44d64274071 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -13693,6 +13693,19 @@ do_store_flag (sepops ops, rtx target, machine_mode mode)
     subtarget = 0;
 
   expand_operands (arg0, arg1, subtarget, &op0, &op1, EXPAND_NORMAL);
+  unsigned HOST_WIDE_INT nunits;
+  if (VECTOR_BOOLEAN_TYPE_P (type)
+      && operand_mode == QImode
+      && TYPE_VECTOR_SUBPARTS (type).is_constant (&nunits)
+      && nunits < BITS_PER_UNIT)
+    {
+      op0 = expand_binop (mode, and_optab, op0,
+                         GEN_INT ((1 << nunits) - 1), NULL_RTX,
+                         true, OPTAB_WIDEN);
+      op1 = expand_binop (mode, and_optab, op1,
+                         GEN_INT ((1 << nunits) - 1), NULL_RTX,
+                         true, OPTAB_WIDEN);
+    }
 
   if (target == 0)
     target = gen_reg_rtx (mode);


for the testcase

typedef long v4si __attribute__((vector_size(4*sizeof(long))));
typedef v4si v4sib __attribute__((vector_mask));
typedef _Bool sbool1 __attribute__((signed_bool_precision(1)));
_Bool x;
void __GIMPLE (ssa) foo (v4sib v1, v4sib v2)
{
  v4sib tem;
  _Bool _7;

__BB(2):
  tem_5 = ~v2_2(D);
  tem_3 = v1_1(D) | tem_5;
  tem_4 = _Literal (v4sib) { _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1, _Literal (sbool1) -1 };
  _7 = tem_3 == tem_4;
  x = _7;
  return;
}
Comment 43 Hongtao Liu 2024-02-08 14:58:28 UTC
> Well, yes, the discussion in this bug was whether to do this at consumers
> (that's sth new) or with all mask operations (that's how we handle
> bit-precision integer operations, so it might be relatively easy to
> do that - specifically spot the places eventually needing adjustment).
> 
> There's do_store_flag to fixup for uses not in branches and
> do_compare_and_jump for conditional jumps.

reasonable enough for me.
Comment 44 Hongtao Liu 2024-02-08 15:07:14 UTC
> 
> Note the AND is removed by combine if I add it:
> 
> Successfully matched this instruction:
> (set (reg:CCZ 17 flags)
>     (compare:CCZ (and:HI (not:HI (subreg:HI (reg:QI 102 [ tem_3 ]) 0))
>             (const_int 15 [0xf]))
>         (const_int 0 [0])))
> 
> (*testhi_not)
> 
> -    9: {r103:QI=r102:QI&0xf;clobber flags:CC;}
> +      REG_DEAD r99:QI
> +    9: NOTE_INSN_DELETED
> +   12: flags:CCZ=cmp(~r102:QI#0&0xf,0)
>        REG_DEAD r102:QI
> -      REG_UNUSED flags:CC
> -   12: flags:CCZ=cmp(r103:QI,0xf)
> -      REG_DEAD r103:QI
> 
> and we get
> 
> foo:
> .LFB0:
>         .cfi_startproc
>         notl    %esi
>         orl     %esi, %edi
>         notl    %edi
>         testb   $15, %dil
>         je      .L6
>         ret
> 
> which I'm not sure is OK?
> 

Yes, I think it's on purpose

11508;; Split and;cmp (as optimized by combine) into not;test
11509;; Except when TARGET_BMI provides andn (*andn_<mode>_ccno).
11510(define_insn_and_split "*test<mode>_not"
11511  [(set (reg:CCZ FLAGS_REG)
11512        (compare:CCZ
11513          (and:SWI
11514            (not:SWI (match_operand:SWI 0 "register_operand"))
11515            (match_operand:SWI 1 "<nonmemory_szext_operand>"))
11516          (const_int 0)))]
11517  "ix86_pre_reload_split ()
11518   && (!TARGET_BMI || !REG_P (operands[1]))"
11519  "#"
11520  "&& 1"
11521  [(set (match_dup 2) (not:SWI (match_dup 0)))
11522   (set (reg:CCZ FLAGS_REG)
11523        (compare:CCZ (and:SWI (match_dup 2) (match_dup 1))
11524                     (const_int 0)))]
11525  "operands[2] = gen_reg_rtx (<MODE>mode);")
11526
11527;; Split and;cmp (as optimized by combine) into andn;cmp $0
11528(define_insn_and_split "*test<mode>_not_doubleword"
Comment 45 Hongtao Liu 2024-02-08 15:19:56 UTC
> > There's do_store_flag to fixup for uses not in branches and
> > do_compare_and_jump for conditional jumps.
> 
> reasonable enough for me.
I mean we only handle it at consumers where upper bits matters.
Comment 46 GCC Commits 2024-02-14 12:07:06 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:5352ede92483b949e811cbdcdfaec5378f3e06d6

commit r14-8975-g5352ede92483b949e811cbdcdfaec5378f3e06d6
Author: Richard Biener <rguenther@suse.de>
Date:   Fri Feb 9 08:15:44 2024 +0100

    middle-end/113576 - zero padding of vector bools when expanding compares
    
    The following zeros paddings of vector bools when expanding compares
    and the mode used for the compare is an integer mode.  In that case
    targets cannot distinguish between a 4 element and 8 element vector
    compare (both get to the QImode compare optab) so we have to do the
    job in the middle-end.
    
            PR middle-end/113576
            * expr.cc (do_store_flag): For vector bool compares of vectors
            with padding zero that.
            * dojump.cc (do_compare_and_jump): Likewise.
Comment 47 Richard Biener 2024-02-14 12:38:38 UTC
This should now be fixed, but I think the issue might be latent on branches for GCN or for AVX512 via intrinsics (fully masked loops for AVX512 is also only
available in GCC 14).

Let's close this P1, we have to consider backporting when we manage to create a miscompiled testcase for older releases.
Comment 48 Uroš Bizjak 2024-02-14 13:48:12 UTC
The runtime testcase fails on non-AVX512F x86 targets due to:

/* { dg-do run } */
/* { dg-options "-O3" } */
/* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-* i?86-*-* } } } */

but check_vect() only checks runtime support up to AVX2.
Comment 49 Richard Biener 2024-02-14 13:54:09 UTC
(In reply to Uroš Bizjak from comment #48)
> The runtime testcase fails on non-AVX512F x86 targets due to:
> 
> /* { dg-do run } */
> /* { dg-options "-O3" } */
> /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-*
> i?86-*-* } } } */
> 
> but check_vect() only checks runtime support up to AVX2.

Hmm, can we fix that?  We could change the above to { target avx512f_runtime }
but that really only checks for AVX512F, not say AVX512VL ...

I do remember using -mavx512vl wasn't enough to trigger the miscompile
nor did it trigger with -march=znver4 ... so I stuck to skylake-avx512 :/
Comment 50 Jakub Jelinek 2024-02-14 13:59:02 UTC
(In reply to Richard Biener from comment #49)
> (In reply to Uroš Bizjak from comment #48)
> > The runtime testcase fails on non-AVX512F x86 targets due to:
> > 
> > /* { dg-do run } */
> > /* { dg-options "-O3" } */
> > /* { dg-additional-options "-march=skylake-avx512" { target { x86_64-*-*
> > i?86-*-* } } } */
> > 
> > but check_vect() only checks runtime support up to AVX2.
> 
> Hmm, can we fix that?  We could change the above to { target avx512f_runtime
> }
> but that really only checks for AVX512F, not say AVX512VL ...
> 
> I do remember using -mavx512vl wasn't enough to trigger the miscompile
> nor did it trigger with -march=znver4 ... so I stuck to skylake-avx512 :/

It is certainly preferable to add -mavx512{bw,dq,vl} or whatever the testcase actually needs and then one can
#define AVX512BW
#define AVX512DQ
#define AVX512VL
#include "avx512-check.h"
and get checks for all those.
Comment 51 Jakub Jelinek 2024-02-14 14:01:44 UTC
From the -mavx* options I think -march=skylake-avx512 implies
-mavx512{f,cd,vl,bw,dq} but -mavx512f is implied by any of the latter 4.
Comment 52 Uroš Bizjak 2024-02-14 14:31:11 UTC
Created attachment 57424 [details]
Proposed testsuite patch

This patch fixes the failure for me (+ some other dg.exp/vect inconsistencies).
Comment 53 Jakub Jelinek 2024-02-14 14:36:02 UTC
Comment on attachment 57424 [details]
Proposed testsuite patch

As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f effective target and testing it at runtime IMHO isn't enough.
For dg-do run testcases I really think we should avoid those -march= options, because it means a lot of other stuff, BMI, LZCNT, ...
Comment 54 Richard Biener 2024-02-14 15:07:20 UTC
Please also verify the bug reproduced with the altered set of options.

What's the reason to have avx512-check.h in addition to tree-vect.h?
At least for the vectorizer testsuite the latter is the canonical one,
can we please merge AVX512 support therein?
Comment 55 Uroš Bizjak 2024-02-14 15:26:55 UTC
(In reply to Jakub Jelinek from comment #53)
> Comment on attachment 57424 [details]
> Proposed testsuite patch
> 
> As skylake-avx512 is -mavx512{f,cd,bw,dq,vl}, requiring just avx512f
> effective target and testing it at runtime IMHO isn't enough.
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

I think that addition of

+# if defined(__AVX512VL__)
+    want_level = 7, want_b = bit_AVX512VL;
+# elif defined(__AVX512F__)
+    want_level = 7, want_b = bit_AVX512F;
+# elif defined(__AVX2__)

to check_vect solves all current uses in gcc.dg/vect
Comment 56 Uroš Bizjak 2024-02-14 20:18:18 UTC
The testcase is fixed with g:430c772be3382134886db33133ed466c02efc71c
Comment 57 Hongtao Liu 2024-02-18 02:10:51 UTC
> For dg-do run testcases I really think we should avoid those -march=
> options, because it means a lot of other stuff, BMI, LZCNT, ...

Make sense.