Tor fails its test suite with -O3 -march=znver2 -fno-vect-cost-model like: ``` $ ./configure CFLAGS="-O3 -march=znver2 -ggdb3 -fno-vect-cost-model" --enable-all-bugs-are-fatal --disable-html-manual --disable-manpage --disable-asciidoc --disable-memory-sentinels --disable-linker-hardening --disable-seccomp --disable-libscrypt --disable-module-relay --disable-module-dirauth --disable-module-pow && make -j$(nproc) [...] $ src/test/test --verbose entrynodes/node_filter --no-fork assert(num_reachable_filtered_guards(gs, NULL) OP_EQ NUM): 7 vs 7Mar 14 09:50:56.182 [err] tor_assertion_failed_(): Bug: src/feature/client/entrynodes.c:2072: get_retry_schedule: Assertion 0 failed; aborting. (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: Tor 0.4.8.10: Assertion 0 failed in get_retry_schedule at src/feature/client/entrynodes.c:2072: . Stack trace: (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(log_backtrace_impl+0x58) [0x55d7124029b8] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(tor_assertion_failed_+0x14f) [0x55d71241323f] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(+0x4d80e7) [0x55d7122b80e7] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(entry_guards_update_filtered_sets+0x2c8) [0x55d7122bb048] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(+0x2810be) [0x55d7120610be] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(testcase_run_one+0x2f4) [0x55d712239d74] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(tinytest_main+0x218) [0x55d71223a5f8] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(main+0x492) [0x55d711e77482] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: /usr/lib64/libc.so.6(+0x25e6a) [0x7f1b5816ce6a] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: /usr/lib64/libc.so.6(__libc_start_main+0x85) [0x7f1b5816cf25] (on Tor 0.4.8.10 ) Mar 14 09:50:56.183 [err] Bug: ./test(_start+0x21) [0x55d711e775a1] (on Tor 0.4.8.10 ) ```` ``` Program received signal SIGABRT, Aborted. 0x00007ffff7489cdc in ?? () from /usr/lib64/libc.so.6 (gdb) bt #0 0x00007ffff7489cdc in ?? () from /usr/lib64/libc.so.6 #1 0x00007ffff7434032 in raise () from /usr/lib64/libc.so.6 #2 0x00007ffff741c4f2 in abort () from /usr/lib64/libc.so.6 #3 0x0000555555b39010 in tor_raw_abort_ () at src/lib/err/torerr.c:225 #4 0x0000555555b46e30 in tor_abort_ () at src/lib/log/util_bug.c:174 #5 0x0000555555a0fcde in get_retry_schedule (failing_since=<optimized out>, now=1710410426, is_primary=<optimized out>) at src/feature/client/entrynodes.c:2072 #6 entry_guard_consider_retry (guard=guard@entry=0x555555e6be60) at src/feature/client/entrynodes.c:2089 #7 0x0000555555a0fff0 in entry_guard_consider_retry (guard=0x555555e6be60) at src/feature/client/entrynodes.c:2084 #8 entry_guard_set_filtered_flags (options=options@entry=0x555555e1a4b0, gs=gs@entry=0x555555e6b500, guard=0x555555e6be60) at src/feature/client/entrynodes.c:1737 #9 0x0000555555a118aa in entry_guards_update_filtered_sets (gs=gs@entry=0x555555e6b500) at src/feature/client/entrynodes.c:1758 #10 0x00005555557cded7 in test_entry_guard_node_filter (arg=<optimized out>) at src/test/test_entrynodes.c:1037 #11 0x000055555599ece5 in testcase_run_bare_ (testcase=testcase@entry=0x555555da13d8 <entrynodes_tests+760>) at src/ext/tinytest.c:107 #12 0x000055555599edb3 in testcase_run_one (group=group@entry=0x555555d99380 <testgroups+864>, testcase=0x555555da13d8 <entrynodes_tests+760>) at src/ext/tinytest.c:272 #13 0x000055555599f60c in tinytest_main (c=c@entry=4, v=v@entry=0x7fffffffd928, groups=groups@entry=0x555555d99020 <testgroups>) at src/ext/tinytest.c:454 #14 0x00005555555eb47b in main (c=4, v=<optimized out>) at src/test/testing_common.c:424 (gdb) ``` ``` Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/14/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /var/tmp/portage/sys-devel/gcc-14.0.9999/work/gcc-14.0.9999/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/14 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/14/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/14 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/14/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/14/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/14/include/g++-v14 --disable-silent-rules --disable-dependency-tracking --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/14/python --enable-languages=c,c++,fortran,rust --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=yes,extra,rtl --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo Hardened 14.0.9999 p, commit 66ed76361b07f18610a134dca21c6945f03c6a6b' --with-gcc-major-version-only --enable-libstdcxx-time --enable-lto --disable-libstdcxx-pch --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libssp --disable-libada --disable-cet --disable-systemtap --enable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --with-isl --disable-isl-version-check --enable-default-pie --enable-host-pie --enable-host-bind-now --enable-default-ssp --disable-fixincludes --with-build-config='bootstrap-O3 bootstrap-lto' Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 14.0.1 20240311 (experimental) 31ce2e993d09dcad1ce139a2848a28de5931056d (Gentoo Hardened 14.0.9999 p, commit 66ed76361b07f18610a134dca21c6945f03c6a6b) ```
The assert is at https://gitlab.torproject.org/tpo/core/tor/-/blob/tor-0.4.8.10/src/feature/client/entrynodes.c#L2072 ``` (gdb) p delays $3 = {{ maximum = 21600, primary_delay = 600, nonprimary_delay = 3600 }, { maximum = 345600, primary_delay = 5400, nonprimary_delay = 14400 }, { maximum = 604800, primary_delay = 14400, nonprimary_delay = 64800 }, { maximum = 9223372036854775807, primary_delay = 32400, nonprimary_delay = 129600 }} (gdb) ``` The bad loop (confirmed w/ novector pragma) is: for (i = 0; i < ARRAY_LENGTH(delays); ++i) { if (tdiff <= delays[i].maximum) { return is_primary ? delays[i].primary_delay : delays[i].nonprimary_delay; } }
I've minimised it to: ``` struct entry_guard_t { int is_reachable; long failing_since; int is_primary; } *create_guard() { struct entry_guard_t *guard = __builtin_malloc(sizeof *guard); guard->is_reachable = guard->failing_since = guard->is_primary = 0; return guard; } int main() { struct entry_guard_t guard = *create_guard(); long tdiff = 10412095 - guard.failing_since; struct { long maximum; int nonprimary_delay; } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}}; unsigned i = 0; for (; i < sizeof(delays) / sizeof(delays[0]); ++i) if (tdiff <= delays[i].maximum) return delays[i].nonprimary_delay; __builtin_abort(); } ``` This fails for me with `-O2 -mavx -fno-vect-cost-model`.
Created attachment 57705 [details] larger.i Ah, wait, that might be a bad reduction. Let me attach a larger one, then I can give the original if needed too.
(In reply to Sam James from comment #3) > Created attachment 57705 [details] > larger.i > > Ah, wait, that might be a bad reduction. Let me attach a larger one, then I > can give the original if needed too. OK, no, I think https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114339#c2 is fine.
Started with r14-6822-g01f4251b8775c832a92d55e2df57c9ac72eaceef
vectorizer generates: mask_patt_21.19_58 = vect_perm_even_49 >= vect_cst__57; mask_patt_21.19_59 = vect_perm_even_55 >= vect_cst__57; vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59; if (vexit_reduc_63 != { 0, 0 }) goto <bb 14>; [20.00%] else goto <bb 5>; [80.00%] This is changed at loopdone into: delays[3].nonprimary_delay = 129600; vect_cst__57 = {tdiff_6, tdiff_6}; mask_patt_21.19_58 = vect_cst__57 <= { 0, 0 }; mask_patt_21.19_59 = vect_cst__57 <= { 0, 0x7FFFFFFFFFFFFFFF }; vexit_reduc_63 = mask_patt_21.19_58 | mask_patt_21.19_59; if (vexit_reduc_63 != { 0, 0 }) goto <bb 3>; [20.00%] else goto <bb 7>; [80.00%] or in other words, if there's any value where the compare succeeds, find it and return. This looks correct to me. It could be that my AVX is rusty but, this generates: vmovdqa 0xf9c(%rip),%xmm1 # 0x402010 mov $0x1,%eax vmovq %rcx,%xmm3 vmovdqa %xmm0,(%rsp) vpunpcklqdq %xmm3,%xmm3,%xmm2 vmovdqa %xmm0,0x10(%rsp) vpcmpgtq %xmm2,%xmm1,%xmm1# vmovdqa %xmm0,0x20(%rsp) vmovq %rax,%xmm0 vpunpcklqdq %xmm0,%xmm0,%xmm0 movl $0x1fa40,0x38(%rsp) vpcmpgtq %xmm2,%xmm0,%xmm0 vpor %xmm1,%xmm0,%xmm0 vptest %xmm0,%xmm0 which looks off, particularly for the second compare it look like it doesn't do a load but instead just duplicates the constant 1. gdb seems to confirm this. At the first compare: (gdb) p $xmm2.v2_int64 $4 = {10412095, 10412095} (gdb) p $xmm0.v2_int64 $5 = {0, 0} which is what's expected, but at the second compare: (gdb) p $xmm2.v2_int64 $7 = {10412095, 10412095} (gdb) p $xmm0.v2_int64 $6 = {1, 1} at the second it's comparing {1, 1} instead of {0, 0x7FFFFFFFFFFFFFFF}. on AArch64 where it doesn't fail the comparison is: movi v29.4s, 0 add x1, sp, 16 ldr x5, [x0, 8] mov w0, 64064 movk w0, 0x1, lsl 16 add x3, sp, 48 str q29, [sp, 64] mov x2, 57407 mov x4, 9223372036854775807 str x4, [sp, 64] movk x2, 0x9e, lsl 16 str w0, [sp, 72] sub x2, x2, x5 stp q29, q29, [x1] dup v27.2d, x2 ld2 {v30.2d - v31.2d}, [x1] str q29, [sp, 48] ld2 {v28.2d - v29.2d}, [x3] cmge v30.2d, v30.2d, v27.2d cmge v28.2d, v28.2d, v27.2d orr v30.16b, v30.16b, v28.16b umaxp v30.4s, v30.4s, v30.4s fmov x0, d30 cbnz x0, .L12 which has v30.2d being {0, 0} and v28.2d being {0, 0x7FFFFFFFFFFFFFFF} as expected... On AArch64 we don't inline the constants because whatever is propagating the constants can't understand the LOAD_LANES: mask_patt_19.21_50 = vect__2.16_44 >= vect_cst__49; mask_patt_19.21_51 = vect__2.19_47 >= vect_cst__49; vexit_reduc_55 = mask_patt_19.21_50 | mask_patt_19.21_51; if (vexit_reduc_55 != { 0, 0 }) goto <bb 3>; [20.00%] else goto <bb 7>; [80.00%] so could this be another expansion bug? Note that a simpler reproducer is this: --- long tdiff = 10412095; int main() { struct { long maximum; int nonprimary_delay; } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}}; for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i) if (tdiff <= delays[i].maximum) return delays[i].nonprimary_delay; __builtin_abort(); } --- the key point is that we're not allowed to constprop tdiff at GIMPLE. If we do, e.g: int main() { struct { long maximum; int nonprimary_delay; } delays[] = {{}, {}, {}, {9223372036854775807, 36 * 60 * 60}}; long tdiff = 10412095; for (unsigned i = 0; i < sizeof(delays) / sizeof(delays[0]); ++i) if (tdiff <= delays[i].maximum) return delays[i].nonprimary_delay; __builtin_abort(); } then after vectorization the const prop the entire expression is evaluated at GIMPLE and it gets the right result. This makes me believe it's a target expansion bug.
This looks wrong: ``` ;; mask_patt_17.15_55 = vect_cst__53 <= { 0, 9223372036854775807 }; (insn 21 20 22 (set (reg:V2DI 111) (mem/u/c:V2DI (symbol_ref/u:DI ("*.LC1") [flags 0x2]) [0 S16 A128])) -1 (expr_list:REG_EQUAL (const_vector:V2DI [ (const_int 1 [0x1]) (const_int -9223372036854775808 [0x8000000000000000]) ]) (nil))) (insn 22 21 23 (set (reg:V2DI 112) (gt:V2DI (reg:V2DI 111) (reg:V2DI 100 [ vect_cst__53 ]))) -1 (nil)) (insn 23 22 0 (set (reg:V2DI 102 [ mask_patt_17.15D.3121 ]) (reg:V2DI 112)) -1 (nil)) ``` we go from `a <= INT_MAX` to `INT_MIN > a` which is basically going from `true` to `a != INT_MIN`.
Slightly simplified/cleaned up testcase: /* { dg-do run } */ /* { dg-options "-O2 -fno-vect-cost-model" } */ /* { dg-additional-options "-mavx" { target avx_runtime } } */ struct S { int a; long b; int c; } s; __attribute__((noipa)) struct S * foo (void) { return &s; } int main () { struct S r = *foo (); long t = 10412095L - r.b; struct { long d; int e; } f[4] = { {}, {}, {}, { __LONG_MAX__, 0 } }; for (unsigned i = 0; i < 4; ++i) if (t <= f[i].d) return f[i].e; __builtin_abort (); }
Reduced testcase: ``` #define vect128 __attribute__((vector_size(16))) [[gnu::noinline]] vect128 long f(vect128 long a) { return a <= (vect128 long){0, 9223372036854775807}; } int main() { vect128 long t = (vect128 long){0, 0}; t = f(t); if (!t[1]) __builtin_abort(); } ```
(In reply to Andrew Pinski from comment #9) > Reduced testcase: This works without -mavx and fails with -mavx even.
My reduced testcase started to fail in GCC 13.
I suspecting r13-3803-gfa271afb584230 which missed the border case of INT_MAX/INT_MIN .
Nice, further cleaned up: /* PR target/114339 */ /* { dg-do run } */ /* { dg-options "-O2 -Wno-psabi" } */ /* { dg-additional-options "-mavx" { target avx_runtime } } */ typedef long V __attribute__((vector_size (16))); __attribute__((noipa)) V foo (V a) { return a <= (V) {0, __LONG_MAX__ }; } int main () { V t = foo ((V) { 0, 0 }); if (t[0] != -1L || t[1] != -1L) __builtin_abort (); }
Indeed, r13-3803-gfa271afb584230, so mine.
Created attachment 57707 [details] gcc14-pr114339.patch Untested fix.
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:ab2da8fb67b1aa0557a16b62689a888730dba610 commit r14-9494-gab2da8fb67b1aa0557a16b62689a888730dba610 Author: Jakub Jelinek <jakub@redhat.com> Date: Fri Mar 15 10:46:47 2024 +0100 i386: Fix a pasto in ix86_expand_int_sse_cmp [PR114339] In r13-3803-gfa271afb58 I've added an optimization for LE/LEU/GE/GEU comparison against CONST_VECTOR. As the comments say: /* x <= cst can be handled as x < cst + 1 unless there is wrap around in cst + 1. */ ... /* For LE punt if some element is signed maximum. */ ... /* For LEU punt if some element is unsigned maximum. */ and /* x >= cst can be handled as x > cst - 1 unless there is wrap around in cst - 1. */ ... /* For GE punt if some element is signed minimum. */ ... /* For GEU punt if some element is zero. */ Apparently I wrote the GE/GEU (second case) first and then copied/adjusted it for LE/LEU, most of the adjustments look correct, but I've left if (code == GE) comparison when testing if it should punt for signed maximum. That condition is never true, because this is in switch (code) { ... case LE: case LEU: block and we really meant to be what the comment says, for LE punt if some element is signed maximum, as then cst + 1 wraps around. The following patch fixes the pasto. 2024-03-15 Jakub Jelinek <jakub@redhat.com> PR target/114339 * config/i386/i386-expand.cc (ix86_expand_int_sse_cmp) <case LE>: Fix a pasto, compare code against LE rather than GE. * gcc.target/i386/pr114339.c: New test.
Fixed on the trunk so far.
The releases/gcc-13 branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:ec5cb2a0f2436618219ce0ada3086f6088e37332 commit r13-8452-gec5cb2a0f2436618219ce0ada3086f6088e37332 Author: Jakub Jelinek <jakub@redhat.com> Date: Fri Mar 15 10:46:47 2024 +0100 i386: Fix a pasto in ix86_expand_int_sse_cmp [PR114339] In r13-3803-gfa271afb58 I've added an optimization for LE/LEU/GE/GEU comparison against CONST_VECTOR. As the comments say: /* x <= cst can be handled as x < cst + 1 unless there is wrap around in cst + 1. */ ... /* For LE punt if some element is signed maximum. */ ... /* For LEU punt if some element is unsigned maximum. */ and /* x >= cst can be handled as x > cst - 1 unless there is wrap around in cst - 1. */ ... /* For GE punt if some element is signed minimum. */ ... /* For GEU punt if some element is zero. */ Apparently I wrote the GE/GEU (second case) first and then copied/adjusted it for LE/LEU, most of the adjustments look correct, but I've left if (code == GE) comparison when testing if it should punt for signed maximum. That condition is never true, because this is in switch (code) { ... case LE: case LEU: block and we really meant to be what the comment says, for LE punt if some element is signed maximum, as then cst + 1 wraps around. The following patch fixes the pasto. 2024-03-15 Jakub Jelinek <jakub@redhat.com> PR target/114339 * config/i386/i386-expand.cc (ix86_expand_int_sse_cmp) <case LE>: Fix a pasto, compare code against LE rather than GE. * gcc.target/i386/pr114339.c: New test. (cherry picked from commit ab2da8fb67b1aa0557a16b62689a888730dba610)
Fixed for 13.3+ too.