Bug 114339 - [13 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14-6822
Summary: [13 regression] Tor miscompiled with -O2 -mavx -fno-vect-cost-model since r14...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 14.0
: P1 normal
Target Milestone: 13.3
Assignee: Jakub Jelinek
URL:
Keywords: wrong-code
Depends on:
Blocks:
 
Reported: 2024-03-14 18:34 UTC by Sam James
Modified: 2024-03-18 14:42 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-linux-gnu
Build:
Known to work: 10.1.0, 12.1.0, 12.3.0, 7.1.0
Known to fail: 13.1.0
Last reconfirmed: 2024-03-14 00:00:00


Attachments
larger.i (8.11 KB, text/plain)
2024-03-14 18:44 UTC, Sam James
Details
gcc14-pr114339.patch (717 bytes, patch)
2024-03-14 22:04 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sam James 2024-03-14 18:34:17 UTC
Tor fails its test suite with -O3 -march=znver2 -fno-vect-cost-model like:
```
$ ./configure CFLAGS="-O3 -march=znver2 -ggdb3 -fno-vect-cost-model" --enable-all-bugs-are-fatal --disable-html-manual --disable-manpage --disable-asciidoc --disable-memory-sentinels --disable-linker-hardening --disable-seccomp --disable-libscrypt --disable-module-relay --disable-module-dirauth --disable-module-pow && make -j$(nproc)
[...]
$ src/test/test --verbose entrynodes/node_filter --no-fork
assert(num_reachable_filtered_guards(gs, NULL) OP_EQ NUM): 7 vs 7Mar 14 09:50:56.182 [err] tor_assertion_failed_(): Bug: src/feature/client/entrynodes.c:2072: get_retry_schedule: Assertion 0 failed; aborting. (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug: Tor 0.4.8.10: Assertion 0 failed in get_retry_schedule at src/feature/client/entrynodes.c:2072: . Stack trace: (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(log_backtrace_impl+0x58) [0x55d7124029b8] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(tor_assertion_failed_+0x14f) [0x55d71241323f] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(+0x4d80e7) [0x55d7122b80e7] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(entry_guards_update_filtered_sets+0x2c8) [0x55d7122bb048] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(+0x2810be) [0x55d7120610be] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(testcase_run_one+0x2f4) [0x55d712239d74] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(tinytest_main+0x218) [0x55d71223a5f8] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(main+0x492) [0x55d711e77482] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     /usr/lib64/libc.so.6(+0x25e6a) [0x7f1b5816ce6a] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     /usr/lib64/libc.so.6(__libc_start_main+0x85) [0x7f1b5816cf25] (on Tor 0.4.8.10 )
Mar 14 09:50:56.183 [err] Bug:     ./test(_start+0x21) [0x55d711e775a1] (on Tor 0.4.8.10 )
````

```
Program received signal SIGABRT, Aborted.
0x00007ffff7489cdc in ?? () from /usr/lib64/libc.so.6
(gdb) bt
#0  0x00007ffff7489cdc in ?? () from /usr/lib64/libc.so.6
#1  0x00007ffff7434032 in raise () from /usr/lib64/libc.so.6
#2  0x00007ffff741c4f2 in abort () from /usr/lib64/libc.so.6
#3  0x0000555555b39010 in tor_raw_abort_ () at src/lib/err/torerr.c:225
#4  0x0000555555b46e30 in tor_abort_ () at src/lib/log/util_bug.c:174
#5  0x0000555555a0fcde in get_retry_schedule (failing_since=<optimized out>, now=1710410426, is_primary=<optimized out>) at src/feature/client/entrynodes.c:2072
#6  entry_guard_consider_retry (guard=guard@entry=0x555555e6be60) at src/feature/client/entrynodes.c:2089
#7  0x0000555555a0fff0 in entry_guard_consider_retry (guard=0x555555e6be60) at src/feature/client/entrynodes.c:2084
#8  entry_guard_set_filtered_flags (options=options@entry=0x555555e1a4b0, gs=gs@entry=0x555555e6b500, guard=0x555555e6be60) at src/feature/client/entrynodes.c:1737
#9  0x0000555555a118aa in entry_guards_update_filtered_sets (gs=gs@entry=0x555555e6b500) at src/feature/client/entrynodes.c:1758
#10 0x00005555557cded7 in test_entry_guard_node_filter (arg=<optimized out>) at src/test/test_entrynodes.c:1037
#11 0x000055555599ece5 in testcase_run_bare_ (testcase=testcase@entry=0x555555da13d8 <entrynodes_tests+760>) at src/ext/tinytest.c:107
#12 0x000055555599edb3 in testcase_run_one (group=group@entry=0x555555d99380 <testgroups+864>, testcase=0x555555da13d8 <entrynodes_tests+760>) at src/ext/tinytest.c:272
#13 0x000055555599f60c in tinytest_main (c=c@entry=4, v=v@entry=0x7fffffffd928, groups=groups@entry=0x555555d99020 <testgroups>) at src/ext/tinytest.c:454
#14 0x00005555555eb47b in main (c=4, v=<optimized out>) at src/test/testing_common.c:424
(gdb)
```

```
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/14/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-14.0.9999/work/gcc-14.0.9999/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/14 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/14/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/14 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/14/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/14/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/14/include/g++-v14 --disable-silent-rules --disable-dependency-tracking --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/14/python --enable-languages=c,c++,fortran,rust --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=yes,extra,rtl --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo Hardened 14.0.9999 p, commit 66ed76361b07f18610a134dca21c6945f03c6a6b' --with-gcc-major-version-only --enable-libstdcxx-time --enable-lto --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libssp --disable-libada --disable-cet --disable-systemtap --enable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --with-isl --disable-isl-version-check --enable-default-pie --enable-host-pie --enable-host-bind-now --enable-default-ssp --disable-fixincludes --with-build-config='bootstrap-O3 bootstrap-lto'
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 14.0.1 20240311 (experimental) 31ce2e993d09dcad1ce139a2848a28de5931056d (Gentoo Hardened 14.0.9999 p, commit 66ed76361b07f18610a134dca21c6945f03c6a6b)
```
Comment 1 Sam James 2024-03-14 18:34:39 UTC
The assert is at https://gitlab.torproject.org/tpo/core/tor/-/blob/tor-0.4.8.10/src/feature/client/entrynodes.c#L2072

```
(gdb) p delays
$3 = {{
    maximum = 21600,
    primary_delay = 600,
    nonprimary_delay = 3600
  }, {
    maximum = 345600,
    primary_delay = 5400,
    nonprimary_delay = 14400
  }, {
    maximum = 604800,
    primary_delay = 14400,
    nonprimary_delay = 64800
  }, {
    maximum = 9223372036854775807,
    primary_delay = 32400,
    nonprimary_delay = 129600
  }}
(gdb)
```

The bad loop (confirmed w/ novector pragma) is:
  for (i = 0; i < ARRAY_LENGTH(delays); ++i) {
    if (tdiff <= delays[i].maximum) {
      return is_primary ? delays[i].primary_delay : delays[i].nonprimary_delay;
    }
  }
Comment 2 Sam James 2024-03-14 18:35:15 UTC
I've minimised it to:
```
struct entry_guard_t {
  int is_reachable;
  long failing_since;
  int is_primary;
} *create_guard() {
  struct entry_guard_t *guard = __builtin_malloc(sizeof *guard);
  guard->is_reachable = guard->failing_since = guard->is_primary = 0;
  return guard;
}
int main() {
  struct entry_guard_t guard = *create_guard();
  long tdiff = 10412095 - guard.failing_since;
  struct {
    long maximum;
    int nonprimary_delay;
  } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}};
  unsigned i = 0;
  for (; i < sizeof(delays) / sizeof(delays[0]); ++i)
    if (tdiff <= delays[i].maximum)
      return delays[i].nonprimary_delay;
  __builtin_abort();
}
```

This fails for me with `-O2 -mavx -fno-vect-cost-model`.
Comment 3 Sam James 2024-03-14 18:44:13 UTC Comment hidden (obsolete)
Comment 4 Sam James 2024-03-14 18:54:19 UTC Comment hidden (obsolete)
Comment 5 Jakub Jelinek 2024-03-14 19:10:51 UTC
Started with r14-6822-g01f4251b8775c832a92d55e2df57c9ac72eaceef
Comment 6 Tamar Christina 2024-03-14 20:52:58 UTC
vectorizer generates:

  mask_patt_21.19_58 = vect_perm_even_49 >= vect_cst__57;
  mask_patt_21.19_59 = vect_perm_even_55 >= vect_cst__57;
  vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59;
  if (vexit_reduc_63 != { 0, 0 })
    goto <bb 14>; [20.00%]
  else
    goto <bb 5>; [80.00%]

This is changed at loopdone into:

  delays[3].nonprimary_delay = 129600;
  vect_cst__57 = {tdiff_6, tdiff_6};
  mask_patt_21.19_58 = vect_cst__57 <= { 0, 0 };
  mask_patt_21.19_59 = vect_cst__57 <= { 0, 0x7FFFFFFFFFFFFFFF };
  vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59;
  if (vexit_reduc_63 != { 0, 0 })
    goto <bb 3>; [20.00%]
  else
    goto <bb 7>; [80.00%]

or in other words, if there's any value where the compare succeeds, find it and return.
This looks correct to me.

It could be that my AVX is rusty but, this generates:

   vmovdqa 0xf9c(%rip),%xmm1        # 0x402010 
   mov    $0x1,%eax
   vmovq  %rcx,%xmm3
   vmovdqa %xmm0,(%rsp)
   vpunpcklqdq %xmm3,%xmm3,%xmm2
   vmovdqa %xmm0,0x10(%rsp)
   vpcmpgtq %xmm2,%xmm1,%xmm1#
   vmovdqa %xmm0,0x20(%rsp)
   vmovq  %rax,%xmm0
   vpunpcklqdq %xmm0,%xmm0,%xmm0
   movl   $0x1fa40,0x38(%rsp)
   vpcmpgtq %xmm2,%xmm0,%xmm0
   vpor   %xmm1,%xmm0,%xmm0
   vptest %xmm0,%xmm0

which looks off, particularly for the second compare it look like it doesn't do a load but instead just duplicates the constant 1.
gdb seems to confirm this. At the first compare:

(gdb) p $xmm2.v2_int64
$4 = {10412095, 10412095}
(gdb) p $xmm0.v2_int64
$5 = {0, 0}

which is what's expected, but at the second compare:

(gdb) p $xmm2.v2_int64
$7 = {10412095, 10412095}
(gdb) p $xmm0.v2_int64
$6 = {1, 1}

at the second it's comparing {1, 1} instead of {0, 0x7FFFFFFFFFFFFFFF}.

on AArch64 where it doesn't fail the comparison is:

   movi    v29.4s, 0
   add     x1, sp, 16
   ldr     x5, [x0, 8]
   mov     w0, 64064
   movk    w0, 0x1, lsl 16
   add     x3, sp, 48
   str     q29, [sp, 64]
   mov     x2, 57407
   mov     x4, 9223372036854775807
   str     x4, [sp, 64]
   movk    x2, 0x9e, lsl 16
   str     w0, [sp, 72]
   sub     x2, x2, x5
   stp     q29, q29, [x1]
   dup     v27.2d, x2
   ld2     {v30.2d - v31.2d}, [x1]
   str     q29, [sp, 48]
   ld2     {v28.2d - v29.2d}, [x3]
   cmge    v30.2d, v30.2d, v27.2d
   cmge    v28.2d, v28.2d, v27.2d
   orr     v30.16b, v30.16b, v28.16b
   umaxp   v30.4s, v30.4s, v30.4s
   fmov    x0, d30
   cbnz    x0, .L12

which has v30.2d being {0, 0} and v28.2d being {0, 0x7FFFFFFFFFFFFFFF} as expected...

On AArch64 we don't inline the constants because whatever is propagating the constants can't understand the LOAD_LANES:

  mask_patt_19.21_50 = vect__2.16_44 >= vect_cst__49;
  mask_patt_19.21_51 = vect__2.19_47 >= vect_cst__49;
  vexit_reduc_55 = mask_patt_19.21_50 | mask_patt_19.21_51;
  if (vexit_reduc_55 != { 0, 0 })
    goto <bb 3>; [20.00%]
  else
    goto <bb 7>; [80.00%]

so could this be another expansion bug?

Note that a simpler reproducer is this:

---
long tdiff = 10412095;

int main() {
  struct {
    long maximum;
    int nonprimary_delay;
  } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}};

  for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i)
    if (tdiff <= delays[i].maximum)
      return delays[i].nonprimary_delay;

  __builtin_abort();
}
---

the key point is that we're not allowed to constprop tdiff at GIMPLE. If we do, e.g:

int main() {
  struct {
    long maximum;
    int nonprimary_delay;
  } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}};
  long tdiff = 10412095;

  for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i)
    if (tdiff <= delays[i].maximum)
      return delays[i].nonprimary_delay;

  __builtin_abort();
}

then after vectorization the const prop the entire expression is evaluated at GIMPLE and it gets the right result.

This makes me believe it's a target expansion bug.
Comment 7 Andrew Pinski 2024-03-14 21:26:33 UTC
This looks wrong:
```
;; mask_patt_17.15_55 = vect_cst__53 <= { 0, 9223372036854775807 };

(insn 21 20 22 (set (reg:V2DI 111)
        (mem/u/c:V2DI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0  S16 A128])) -1
     (expr_list:REG_EQUAL (const_vector:V2DI [
                (const_int 1 [0x1])
                (const_int -9223372036854775808 [0x8000000000000000])
            ])
        (nil)))

(insn 22 21 23 (set (reg:V2DI 112)
        (gt:V2DI (reg:V2DI 111)
            (reg:V2DI 100 [ vect_cst__53 ]))) -1
     (nil))

(insn 23 22 0 (set (reg:V2DI 102 [ mask_patt_17.15D.3121 ])
        (reg:V2DI 112)) -1
     (nil))
```

we go from `a <= INT_MAX` to `INT_MIN > a` which is basically going from `true` to `a != INT_MIN`.
Comment 8 Jakub Jelinek 2024-03-14 21:28:11 UTC
Slightly simplified/cleaned up testcase:
/* { dg-do run } */
/* { dg-options "-O2 -fno-vect-cost-model" } */
/* { dg-additional-options "-mavx" { target avx_runtime } } */

struct S { int a; long b; int c; } s;

__attribute__((noipa)) struct S *
foo (void)
{
  return &s;
}

int
main ()
{
  struct S r = *foo ();
  long t = 10412095L - r.b;
  struct { long d; int e; } f[4] = { {}, {}, {}, { __LONG_MAX__, 0 } };
  for (unsigned i = 0; i < 4; ++i)
    if (t <= f[i].d)
      return f[i].e;
  __builtin_abort ();
}
Comment 9 Andrew Pinski 2024-03-14 21:31:35 UTC
Reduced testcase:
```
#define vect128 __attribute__((vector_size(16)))

[[gnu::noinline]]
vect128 long f(vect128 long a)
{
        return a <= (vect128 long){0, 9223372036854775807};
}

int main()
{
  vect128 long t = (vect128 long){0, 0};
  t = f(t);
  if (!t[1])
  __builtin_abort();

}
```
Comment 10 Andrew Pinski 2024-03-14 21:33:01 UTC
(In reply to Andrew Pinski from comment #9)
> Reduced testcase:

This works without -mavx and fails with -mavx even.
Comment 11 Andrew Pinski 2024-03-14 21:34:53 UTC
My reduced testcase started to fail in GCC 13.
Comment 12 Andrew Pinski 2024-03-14 21:39:29 UTC
I suspecting r13-3803-gfa271afb584230 which missed the border case of INT_MAX/INT_MIN .
Comment 13 Jakub Jelinek 2024-03-14 21:39:42 UTC
Nice, further cleaned up:
/* PR target/114339 */
/* { dg-do run } */
/* { dg-options "-O2 -Wno-psabi" } */
/* { dg-additional-options "-mavx" { target avx_runtime } } */

typedef long V __attribute__((vector_size (16)));

__attribute__((noipa)) V
foo (V a)
{
  return a <= (V) {0, __LONG_MAX__ };
}

int
main ()
{
  V t = foo ((V) { 0, 0 });
  if (t[0] != -1L || t[1] != -1L)
    __builtin_abort ();
}
Comment 14 Jakub Jelinek 2024-03-14 21:42:15 UTC
Indeed, r13-3803-gfa271afb584230, so mine.
Comment 15 Jakub Jelinek 2024-03-14 22:04:01 UTC
Created attachment 57707 [details]
gcc14-pr114339.patch

Untested fix.
Comment 16 GCC Commits 2024-03-15 10:04:37 UTC
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:ab2da8fb67b1aa0557a16b62689a888730dba610

commit r14-9494-gab2da8fb67b1aa0557a16b62689a888730dba610
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Mar 15 10:46:47 2024 +0100

    i386: Fix a pasto in ix86_expand_int_sse_cmp [PR114339]
    
    In r13-3803-gfa271afb58 I've added an optimization for LE/LEU/GE/GEU
    comparison against CONST_VECTOR.  As the comments say:
             /* x <= cst can be handled as x < cst + 1 unless there is
                wrap around in cst + 1.  */
    ...
                         /* For LE punt if some element is signed maximum.  */
    ...
                     /* For LEU punt if some element is unsigned maximum.  */
    and
             /* x >= cst can be handled as x > cst - 1 unless there is
                wrap around in cst - 1.  */
    ...
                         /* For GE punt if some element is signed minimum.  */
    ...
                     /* For GEU punt if some element is zero.  */
    Apparently I wrote the GE/GEU (second case) first and then
    copied/adjusted it for LE/LEU, most of the adjustments look correct, but
    I've left if (code == GE) comparison when testing if it should punt for
    signed maximum.  That condition is never true, because this is in
    switch (code) { ... case LE: case LEU: block and we really meant to
    be what the comment says, for LE punt if some element is signed maximum,
    as then cst + 1 wraps around.
    
    The following patch fixes the pasto.
    
    2024-03-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/114339
            * config/i386/i386-expand.cc (ix86_expand_int_sse_cmp) <case LE>: Fix
            a pasto, compare code against LE rather than GE.
    
            * gcc.target/i386/pr114339.c: New test.
Comment 17 Jakub Jelinek 2024-03-15 10:30:18 UTC
Fixed on the trunk so far.
Comment 18 GCC Commits 2024-03-15 23:29:43 UTC
The releases/gcc-13 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>:

https://gcc.gnu.org/g:ec5cb2a0f2436618219ce0ada3086f6088e37332

commit r13-8452-gec5cb2a0f2436618219ce0ada3086f6088e37332
Author: Jakub Jelinek <jakub@redhat.com>
Date:   Fri Mar 15 10:46:47 2024 +0100

    i386: Fix a pasto in ix86_expand_int_sse_cmp [PR114339]
    
    In r13-3803-gfa271afb58 I've added an optimization for LE/LEU/GE/GEU
    comparison against CONST_VECTOR.  As the comments say:
             /* x <= cst can be handled as x < cst + 1 unless there is
                wrap around in cst + 1.  */
    ...
                         /* For LE punt if some element is signed maximum.  */
    ...
                     /* For LEU punt if some element is unsigned maximum.  */
    and
             /* x >= cst can be handled as x > cst - 1 unless there is
                wrap around in cst - 1.  */
    ...
                         /* For GE punt if some element is signed minimum.  */
    ...
                     /* For GEU punt if some element is zero.  */
    Apparently I wrote the GE/GEU (second case) first and then
    copied/adjusted it for LE/LEU, most of the adjustments look correct, but
    I've left if (code == GE) comparison when testing if it should punt for
    signed maximum.  That condition is never true, because this is in
    switch (code) { ... case LE: case LEU: block and we really meant to
    be what the comment says, for LE punt if some element is signed maximum,
    as then cst + 1 wraps around.
    
    The following patch fixes the pasto.
    
    2024-03-15  Jakub Jelinek  <jakub@redhat.com>
    
            PR target/114339
            * config/i386/i386-expand.cc (ix86_expand_int_sse_cmp) <case LE>: Fix
            a pasto, compare code against LE rather than GE.
    
            * gcc.target/i386/pr114339.c: New test.
    
    (cherry picked from commit ab2da8fb67b1aa0557a16b62689a888730dba610)
Comment 19 Jakub Jelinek 2024-03-18 14:42:19 UTC
Fixed for 13.3+ too.