Bug 113431 - [14 Regression] Wrong code at -O3
Summary: [14 Regression] Wrong code at -O3
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 14.0
: P1 normal
Target Milestone: 14.0
Assignee: Richard Biener
URL:
Keywords: needs-bisection, wrong-code
Depends on:
Blocks:
 
Reported: 2024-01-16 21:45 UTC by Patrick O'Neill
Modified: 2024-03-19 09:44 UTC (History)
9 users (show)

See Also:
Host:
Target: riscv aarch64-linux-gnu x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-01-17 00:00:00


Attachments
32-bit sparc-sun-solaris2.11 pr113431.c.185t.slp1 (2.47 KB, text/plain)
2024-01-19 08:56 UTC, Rainer Orth
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Patrick O'Neill 2024-01-16 21:45:13 UTC
Testcase:
int a[2][9];
int b;
int c;
int d;
int main() {
  for (b = 0;b < 2; b++)
    for (long e = 8; e > 0; e--)
      a[b][e] = a[0][1] == 0;

  if (!a[1][1])
    return 0;
  else
    return 1;
}

Commands:
> /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -O3 -march=rv64gcv red_copy.c -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/qemu-riscv64 user-config.out
> echo $?
1

> /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -O2 -march=rv64gcv red_copy.c -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/qemu-riscv64 user-config.out
> echo $?
0

> /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -O3 -march=rv64gc red_copy.c -o user-config.out
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/qemu-riscv64 user-config.out
> echo $?
0

When a[1][1] is set a[0][1] is true(1) so a[1][1] == 0 should be false(0). At -O3 rv64gcv the condition is evaluated incorrectly.

Godbolt: https://godbolt.org/z/Tfz3d646e
Comment 1 Patrick O'Neill 2024-01-16 21:46:29 UTC
Forgot to mention - this was found with the fuzzer.
Comment 2 JuzheZhong 2024-01-16 22:37:06 UTC
Will take a look today.
Comment 3 Patrick O'Neill 2024-01-16 22:57:39 UTC
This has the same behavior with --param=vsetvl-strategy=simple so this is probably not a vsetvl issue.

> /scratch/tc-testing/tc-jan-16-vsetvl-toggle/build-rv64gcv/bin/riscv64-unknown-linux-gnu-gcc -O3 -march=rv64gcv red_copy.c -o user-config.out --param=vsetvl-strategy=simple
> QEMU_CPU=rv64,vlen=128,v=true,vext_spec=v1.0,Zve32f=true,Zve64f=true timeout --verbose -k 0.1 1 /scratch/tc-testing/tc-jan-16-trunk/build-rv64gcv/bin/qemu-riscv64 user-config.out
> echo $?
1
Comment 4 JuzheZhong 2024-01-17 01:38:44 UTC
a[0][1] seems to be undefined value.

And the test seems to trigger undefined behavior.

I just checked ARM SVE and RVV.

The vectorized IR is totally the same and I don't see anything obviously wrong in
RVV assembler.

https://godbolt.org/z/jz866Kfsv

Hi, Andrew. Could you confirm it whether ARM SVE has the same behavior as RVV?
That is, with vectorized, return 1, whereas no vectorized, return 0;
Comment 5 Patrick O'Neill 2024-01-17 01:42:34 UTC
(In reply to JuzheZhong from comment #4)
> a[0][1] seems to be undefined value.

a is a global variable so the elements are initialized to 0. a[0][1] is within the bounds a[2][9].
Comment 6 JuzheZhong 2024-01-17 01:44:50 UTC
(In reply to Patrick O'Neill from comment #5)
> (In reply to JuzheZhong from comment #4)
> > a[0][1] seems to be undefined value.
> 
> a is a global variable so the elements are initialized to 0. a[0][1] is
> within the bounds a[2][9].

Yes, I believe a[0][1] is 0. Then for all a[b][e] == a[0][1] == 0 should always be
1. 

Then if (!a[1][1]) should be false and then return 1;

I failed to see anything wrong here.
Comment 7 Patrick O'Neill 2024-01-17 01:53:50 UTC
(In reply to JuzheZhong from comment #6)
> (In reply to Patrick O'Neill from comment #5)
> > (In reply to JuzheZhong from comment #4)
> > > a[0][1] seems to be undefined value.
> > 
> > a is a global variable so the elements are initialized to 0. a[0][1] is
> > within the bounds a[2][9].
> 
> Yes, I believe a[0][1] is 0. Then for all a[b][e] == a[0][1] == 0 should
> always be
> 1. 
> 
> Then if (!a[1][1]) should be false and then return 1;
> 
> I failed to see anything wrong here.

The value inside a[0][1] changes during the for loop. It starts out as 0 but is changed to 1 when b=0,e=1. The elements in a after b=0,e=1 should be 1.

Since a[1][1] is set after b=0,e=1 it should evaluate a[0][1]==0 to false and set a[1][1] to 0.
Comment 8 JuzheZhong 2024-01-17 01:56:12 UTC
(In reply to Patrick O'Neill from comment #7)
> (In reply to JuzheZhong from comment #6)
> > (In reply to Patrick O'Neill from comment #5)
> > > (In reply to JuzheZhong from comment #4)
> > > > a[0][1] seems to be undefined value.
> > > 
> > > a is a global variable so the elements are initialized to 0. a[0][1] is
> > > within the bounds a[2][9].
> > 
> > Yes, I believe a[0][1] is 0. Then for all a[b][e] == a[0][1] == 0 should
> > always be
> > 1. 
> > 
> > Then if (!a[1][1]) should be false and then return 1;
> > 
> > I failed to see anything wrong here.
> 
> The value inside a[0][1] changes during the for loop. It starts out as 0 but
> is changed to 1 when b=0,e=1. The elements in a after b=0,e=1 should be 1.
> 
> Since a[1][1] is set after b=0,e=1 it should evaluate a[0][1]==0 to false
> and set a[1][1] to 0.

So I doubt it is not RISC-V backend issue. It's middle-end loop vectorizer issue.
It's vectorized with evaluating whether it is 0 or not.

Let's wait for Andrew confirm it.
Comment 9 Andrew Pinski 2024-01-17 03:33:29 UTC
So some good/bad news, it even fails on x86_64-linux-gnu.
Comment 10 Richard Biener 2024-01-17 08:08:43 UTC
Looks like a dependence issue - we vectorize it as

  _24 = a[0][1];
  vect_cst__13 = {_24, _24, _24, _24};
  mask__19.9_3 = { 0, 0, 0, 0 } == vect_cst__13;
  vect_patt_31.10_1 = VEC_COND_EXPR <mask__19.9_3, { 1, 1, 1, 1 }, { 0, 0, 0, 0 }>;
  
  <bb 3> [local count: 29488088]:
  # b.2_66 = PHI <_4(5), 0(2)>
  # ivtmp_45 = PHI <ivtmp_38(5), 2(2)>
  # ivtmp_8 = PHI <ivtmp_14(5), &MEM <int[2][9]> [(void *)&a + 4B](2)>
  # ivtmp_23 = PHI <ivtmp_25(5), 0(2)>
  _18 = a[0][1];
  _19 = _18 == 0;
  _20 = (int) _19;
  MEM <vector(4) int> [(int *)ivtmp_8] = vect_patt_31.10_1;
  MEM <vector(4) int> [(int *)ivtmp_8 + 16B] = vect_patt_31.10_1;
  ivtmp_22 = ivtmp_8 + 36;
  _4 = b.2_66 + 1;
  ivtmp_38 = ivtmp_45 - 1;
  ivtmp_14 = ivtmp_8 + 36;
  ivtmp_25 = ivtmp_23 + 1;
  if (ivtmp_25 < 2)
    goto <bb 5>; [50.00%]
  else
    goto <bb 4>; [50.00%]

  <bb 5> [local count: 14744044]:
  goto <bb 3>; [100.00%]

(compute_affine_dependence
  ref_a: a[0][1], stmt_a: _18 = a[0][1];
  ref_b: a[b.2_66][1], stmt_b: a[b.2_66][1] = _20;
(analyze_overlapping_iterations
  (chrec_a = 1)
  (chrec_b = 1)
(analyze_ziv_subscript
)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
(analyze_overlapping_iterations
  (chrec_a = 0)
  (chrec_b = {0, +, 1}<nw>_1)
(analyze_siv_subscript
)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
(Dependence relation cannot be represented by distance vector.)
)

t.c:8:21: missed:   versioning for alias required: bad dist vector for a[0][1] and a[b.2_66][1]
consider run-time aliasing test between a[0][1] and a[b.2_66][1]

that looks OK, but we're not emitting such alias test.  We're using SLP
since we can handle non-grouped loads (guess that what it will bisect to).
Then:

t.c:6:16: note:   === vect_prune_runtime_alias_test_list ===
t.c:6:16: note:   no need for alias check between a[0][1] and a[b.2_66][1] when VF is 1
t.c:6:16: note:   improved number of alias checks from 1 to 0

which is then obviously wrong.
Comment 11 JuzheZhong 2024-01-17 08:31:10 UTC
I think this following:

https://godbolt.org/z/5sWEWWGox

ARM SVE GCC-11 correctly vectorize this codes.

But both GCC-12 and GCC-13 failed to vectorize it.

GCC-14 vectorize it in a wrong way.

Would it be possible to recover it back to GCC-11 ?
Comment 12 Richard Biener 2024-01-17 08:48:55 UTC
(In reply to JuzheZhong from comment #11)
> I think this following:
> 
> https://godbolt.org/z/5sWEWWGox
> 
> ARM SVE GCC-11 correctly vectorize this codes.
> 
> But both GCC-12 and GCC-13 failed to vectorize it.
> 
> GCC-14 vectorize it in a wrong way.
> 
> Would it be possible to recover it back to GCC-11 ?

With GCC 11 it probably SLP vectorizes the completely unrolled nest.  That
can be recovered with -fno-tree-loop-vectorize on trunk which on x86_64
produces

main:
.LFB0:
        .cfi_startproc
        movl    a+4(%rip), %edx
        xorl    %eax, %eax
        movl    $2, b(%rip)
        testl   %edx, %edx
        sete    %al
        movd    %eax, %xmm0
        setne   %al
        movzbl  %al, %eax
        pshufd  $0, %xmm0, %xmm0
        movups  %xmm0, a+4(%rip)
        movd    %eax, %xmm1
        movups  %xmm0, a+20(%rip)
        pshufd  $0, %xmm1, %xmm0
        movups  %xmm0, a+40(%rip)
        movups  %xmm0, a+56(%rip)
        ret
Comment 13 Richard Biener 2024-01-17 12:34:42 UTC
Mine.
Comment 14 Richard Biener 2024-01-17 12:59:20 UTC
So the issue is that we ignore the dependence because vect_preserves_scalar_order_p - which is correct iff we'd execute the
stmt within the vectorized loop body.  But as we discover it invariant
we hoist it, making the outcome of vect_preserves_scalar_order_p invalid.

I have a fix doing

t.c:5:17: note:   === vect_prune_runtime_alias_test_list ===
t.c:5:17: note:   can tell at compile time that a[0][1] and a[b.2_8][1] alias
t.c:7:15: missed:   not vectorized: compilation time alias: _18 = a[0][1];
a[b.2_8][1] = _20;

and then falling back to SLP vectorizing the unrolled loop as desired.
Comment 15 GCC Commits 2024-01-18 07:37:22 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:b981d5c60b8ef78e2adecd6b5d7e36f9e5e61c54

commit r14-8207-gb981d5c60b8ef78e2adecd6b5d7e36f9e5e61c54
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Jan 17 14:05:42 2024 +0100

    tree-optimization/113431 - wrong dependence with invariant load
    
    The vectorizer dependence analysis is confused with invariant loads
    when figuring whether the circumstances are so that we preserve
    scalar stmt execution order.  The following rectifies this.
    
            PR tree-optimization/113431
            * tree-vect-data-refs.cc (vect_preserves_scalar_order_p):
            When there is an invariant load we might not preserve
            scalar order.
    
            * gcc.dg/vect/pr113431.c: New testcase.
Comment 16 Richard Biener 2024-01-18 07:39:01 UTC
Fixed.
Comment 17 Rainer Orth 2024-01-19 08:55:00 UTC
The new test FAILs on 32 and 64-bit Solaris/SPARC:

+FAIL: gcc.dg/vect/pr113431.c -flto -ffat-lto-objects  scan-tree-dump-times slp1 "optimized: basic block part vectorized" 2
+FAIL: gcc.dg/vect/pr113431.c scan-tree-dump-times slp1 "optimized: basic block part vectorized" 2

I'm attaching the dump.
Comment 18 Rainer Orth 2024-01-19 08:56:03 UTC
Created attachment 57155 [details]
32-bit sparc-sun-solaris2.11 pr113431.c.185t.slp1
Comment 19 Rainer Orth 2024-02-28 14:55:02 UTC
The SPARC dump suggests

/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/vect/pr113431.c:12:15: missed:   unsupported unaligned access
/vol/gcc/src/hg/master/local/gcc/testsuite/gcc.dg/vect/pr113431.c:12:15: missed:   not vectorized: relevant stmt not supported: a[0][1] = _60;

that the tests needs vect_hw_misalign?
Comment 20 Jakub Jelinek 2024-03-15 14:06:13 UTC
Though, trying that in a cross to arm, with -march=armv9-a -munaligned-access it matches (in that case I believe vect_hw_misalign should be true), but it matches even with -march=armv9-a -mno-unaligned-access (and in that case I think it should be !vect_hw_misaligned target).
That said, sure, if it is
/* { dg-final { scan-tree-dump-times "optimized: basic block part vectorized" 2 "slp1" { target { vect_int && vect_hw_misaligned } } } } */
then it will simply not test anything on the non-vect_hw_misaligned targets, rather than say XPASS, so maybe that is ok.

Richi, thoughts on that?
Comment 21 Richard Biener 2024-03-15 15:42:46 UTC
(In reply to Jakub Jelinek from comment #20)
> Though, trying that in a cross to arm, with -march=armv9-a
> -munaligned-access it matches (in that case I believe vect_hw_misalign
> should be true), but it matches even with -march=armv9-a
> -mno-unaligned-access (and in that case I think it should be
> !vect_hw_misaligned target).
> That said, sure, if it is
> /* { dg-final { scan-tree-dump-times "optimized: basic block part
> vectorized" 2 "slp1" { target { vect_int && vect_hw_misaligned } } } } */
> then it will simply not test anything on the non-vect_hw_misaligned targets,
> rather than say XPASS, so maybe that is ok.
> 
> Richi, thoughts on that?

Yeah, I think that's OK.
Comment 22 GCC Commits 2024-03-15 15:52:14 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:ffd47fb63ddc024db847daa07f8ae27fffdfcb28

commit r14-9497-gffd47fb63ddc024db847daa07f8ae27fffdfcb28
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Mar 15 16:50:25 2024 +0100

    testsuite: Fix pr113431.c FAIL on sparc* [PR113431]
    
    As mentioned in the PR, the new testcase FAILs on sparc*-* due to
    lack of support of misaligned store.
    
    This patch restricts that to vect_hw_misalign targets.
    
    2024-03-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR tree-optimization/113431
            * gcc.dg/vect/pr113431.c: Restrict scan-tree-dump-times to
            vect_hw_misalign targets.
Comment 23 Jakub Jelinek 2024-03-15 15:53:50 UTC
Assuming fixed even on sparc*.
Comment 24 ro@CeBiTec.Uni-Bielefeld.DE 2024-03-19 09:44:52 UTC
> --- Comment #23 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Assuming fixed even on sparc*.

It is.  I've missed this one when collecting instances of missing
vect_hw_misalign like PR tree-optimization/98238.  Thanks.