Bug 102206 - amd zen hosts running zen-optimized gcc: gimplification ICE after r10-7284
Summary: amd zen hosts running zen-optimized gcc: gimplification ICE after r10-7284
Status: WAITING
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 10.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2021-09-05 07:49 UTC by Greg Turner
Modified: 2021-09-26 03:14 UTC (History)
7 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-09-06 00:00:00


Attachments
xml_grammar_gcc_-E.cpp.xz (252.91 KB, application/octet-stream)
2021-09-06 08:41 UTC, Greg Turner
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Greg Turner 2021-09-05 07:49:29 UTC
The bad news: why this bug report will be long and confusing
============================================================

There some things in this bug report that will probably make folks think "This is a hardware or software stability problem and not gcc's fault."

Strictly speaking, I can't entirely disprove this hypothesis, but I will present evidence below which has led me to believe it's probably a legit gcc bug.

AFAICT, due to the nature of binary distributions, this bug manifests exclusively on Gentoo.  I imagine there are like 50 Gentoo users on Zen and 25 of them have experienced the bug, half of whom filed bug reports, and the rest of whom shrugged it off and decided it must have been a cosmic ray or something :)

So, Gentoo-only.  "Rice" is an arguably racially-insensitive term Gentoo people use to describe excessive customization of Gentoo systems resulting in various breakage and non-reproducible problems.  Initially I thought this was probably a "rice"-related problem, but I have taken considerable pains to rule this out.  It's not rice-related.

But wait, there's more!  It's also non-deterministically non-deterministic!

That is, almost everyone experiences this bug non-deterministically.  But, for reasons not yet understood, some users (I was once one of them) have found <software, hardware> configurations in which this bug manifests fully repeatably and deterministically.  Sadly, AFAICT none of these users have managed to preserve these fully deterministic software configurations.

OK, enough prefacing. I just want to prepare the reader: the nature of this bug/issue will raise some doubts, which will likely need to be overcome before this bug looks "legit".  I also wish to encourage the reader not to jump to easy "not-a-bug" conclusions without careful consideration of the circumstances presented below.

Scope/Domain of the bug
=======================

On Gentoo, there are several AMD Zen hardware users who report that they must either

A) avoid building gcc with

  -m{arch,tune}=znver?

and their

  -m{arch,tune}=native

equivalents, or

B) must downgrade to gcc-9 or earlier.

If they fail to do so, the bug will occur, eventually.  The compile which seems to most reliably reproduce the bug is boost (any recent version will do the trick).  But it appears in other builds.

Zen (1xxx) and Zen+ (2xxx) hosts seem most susceptible.  But Zen-2 and Zen 3 hosts (3xxx/4xxx(?) and 5xxx, respectively) also appear to be at least occasionally affected.

Note that once an optimized gcc is built, optimizing the target build with similar -m{arch,tune} options is not a requirement.  But, such target optimizations do seem to reproduce the problem with a considerably greater probability.

Bug Manifestation
=================

The bug itself appears as an ICE, stack smash, or zero-pointer-deference fault during c++ compiles.  The problem seems to always manifest during gimplification and to produce distinctive stack-dumps, ie:

Thread 2.1 "cc1plus" received signal SIGABRT, Aborted.
[Switching to process 15911]
0x00007ffff7bc3f71 in raise () from /lib64/libc.so.6
#0  0x00007ffff7bc3f71 in raise () from /lib64/libc.so.6
#1  0x00007ffff7bad537 in abort () from /lib64/libc.so.6
#2  0x00007ffff7c08207 in ?? () from /lib64/libc.so.6
#3  0x00007ffff7c99892 in __fortify_fail () from /lib64/libc.so.6
#4  0x00007ffff7c99870 in __stack_chk_fail () from /lib64/libc.so.6
#5  0x000000000065a1f2 in cp_gimplify_expr(tree_node**, gimple**, gimple**) ()
#6  0x00000000009f4ffc in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#7  0x00000000009faeb1 in ?? ()
#8  0x00000000009f6304 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#9  0x00000000009f6196 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#10 0x00000000009f5d42 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#11 0x00000000009f6196 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#12 0x00000000009fd5b9 in ?? ()
#13 0x00000000009f638d in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#14 0x00000000009f6196 in gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ()
#15 0x00000000009f9ed9 in gimplify_body(tree_node*, bool) ()
#16 0x00000000009fa316 in gimplify_function_tree(tree_node*) ()
#17 0x0000000000885f58 in cgraph_node::analyze() ()
#18 0x0000000000888878 in ?? ()
#19 0x00000000008894c3 in symbol_table::finalize_compilation_unit() ()
#20 0x0000000000c769b1 in ?? ()
#21 0x000000000060065a in toplev::main(int, char**) ()
#22 0x000000000060413c in main ()

This is a pretty manageable example; others report stack traces with very deep gimplify_expr recursion*.

Git Bisect: 94e2418780f1d13235f3e2e6e5c09dbe821c1ce3
====================================================

A few months ago I git bisected this thing.  Since the bug was manifesting nondeterministically, it took some doing; I wrote scripts to repeatedly build boost, treating a point in history as "good" after no less than 50 consecutive successful builds with the resulting optimized compiler*.  Thankfully, this did result in a culprit which was revertible without crippling gcc:

  94e24187 | c++: Avoid unnecessary empty class copy [94175]

I must admit I don't really understand what this commit does.  But reverting it and rebuilding gcc-1{0.{1,2,3},1.{1,2}} results in a compiler which seems to work fine and does not suffer from the bug/issue.

Since then, every reporter so far in Gentoo bug 724314 (where most discussion of this bug has occurred) has reported that applying this patch also solved the problem for them*.

The specific Gentoo-friendly patch folks have been using is available at:

  https://724314.bugs.gentoo.org/attachment.cgi?id=718944 

Significance of this finding
============================

?

--
* See https://bugs.gentoo.org/724314 for examples/specifics
Comment 1 Greg Turner 2021-09-06 04:33:14 UTC
(In reply to Greg Turner from comment #0)
> this bug report will be long and confusing

tldr: as of 94e24187 host-optimized gcc exhibits ICEs while building complex C++ projects on AMD zen hosts.

The problem manifests during gimplification, usually nondeterministically.

Reverting 94e24187 resolves the problem, however the ultimate cause is as of yet undiagnosed.

Full details in my original long and confusing post :)
Comment 2 Martin Liška 2021-09-06 06:53:34 UTC
Can you please show how do you configure and build GCC (gcc -v).
And can you please attach a pre-processed boost source (and command-line used) that can reproduce the issue?
Comment 3 Greg Turner 2021-09-06 08:00:48 UTC
(In reply to Martin Liška from comment #2)
> Can you please show how do you configure and build GCC (gcc -v).
> And can you please attach a pre-processed boost source (and command-line
> used) that can reproduce the issue?

Gentoo does all this heavy lifting.  Some of these things I am blissfully ignorant of although I'm happy to drill into the build process and drag some artifacts over here if that will help.

I don't have an affected gcc, since I apply the patch to my one system capable of repro-ing the bug.  I might also have a laptop capable of reproing, I haven't tried.

But here's a gcc -v:

  Using built-in specs.
  COLLECT_GCC=gcc
  COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/11.2.0/lto-wrapper
  Target: x86_64-pc-linux-gnu
  Configured with: /var/tmp/portage/sys-devel/gcc-11.2.0/work/gcc-11.2.0/configure --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/11.2.0 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/11.2.0 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/11.2.0/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/11.2.0/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/11.2.0/include/g++-v11 --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/11.2.0/python --enable-languages=c,c++,go,jit,fortran --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --disable-libunwind-exceptions --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 11.2.0 p1' --disable-esp --enable-libstdcxx-time --enable-host-shared --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-multilib --with-multilib-list=m32,m64 --disable-fixed-point --enable-targets=all --enable-libgomp --disable-libssp --disable-libada --enable-systemtap --enable-valgrind-annotations --disable-vtable-verify --disable-libvtv --with-zstd --enable-lto --with-isl --disable-isl-version-check --enable-default-pie --enable-default-ssp
  Thread model: posix
  Supported LTO compression algorithms: zlib zstd
  gcc version 11.2.0 (Gentoo 11.2.0 p1) 

Without the patch the same configuration would of course enable me to repro; I assume the above would be 100% the same.

Note that this doesn't even capture the most important part of the equation*:

  $ portageq envvar CFLAGS
  -march=znver1 -mtune=znver1 -O2 -pipe -g

tbh I guess I am not sure if I ever confirmed this applied to 11.2, perhaps it fixed it and I never noticed.  But I very highly doubt it, this bug has been quite persistent.

As for boost, I don't think any special configuration or version is required to make it happen ... [time passes...] got it, the specific build step that tends** to cause the failure is:

/usr/bin/x86_64-pc-linux-gnu-g++ -fvisibility-inlines-hidden -O3 -march=native -pipe -std=c++14 -fPIC -m64 -pthread -finline-functions -Wno-inline -Wall -fvisibility=hidden -ftemplate-depth-255 -fvisibility=hidden -fvisibility-inlines-hidden -DBOOST_ALL_NO_LIB=1 -DBOOST_SERIALIZATION_DYN_LINK=1 -DNDEBUG -I. -c -o bin.v2/libs/serialization/build/gcc-10.1/gentoorelease/pch-off/threading-multi/visibility-hidden/xml_grammar.o libs/serialization/src/xml_grammar.cpp

Anyhow since we know for sure my system repro's the bug , here are the use-flags affecting my boost build:
dev-libs/boost-1.76.0-r1:0/1.76.0::gentoo  USE="bzip2 doc icu lzma nls python threads tools zlib zstd -context -debug -mpi -numpy -static-libs" ABI_X86="(64) -32 (-x32)" PYTHON_TARGETS="python3_8 python3_9 -python3_10"

This passes "${OPTIONS[@]}" to boost's jam invocation which on my system ends up looking like:

  declare -a OPTIONS=(
    [0]="gentoorelease"
    [1]="-j52"
    [2]="-q"
    [3]="-d+2"
    [4]="pch=off"
    [5]="-sICU_PATH=/usr"
    [6]="--without-mpi"
    [7]="--without-context"
    [8]="--without-coroutine"
    [9]="--without-fiber"
    [10]="--without-stacktrace"
    [11]="--boost-build=/usr/share/boost-build/src"
    [12]="--layout=system"
    [13]="threading=multi"
    [14]="link=shared"
    [15]="-sNO_BZIP2=0"
    [16]="-sNO_LZMA=0"
    [17]="-sNO_ZLIB=0"
    [18]="-sNO_ZSTD=0"
  )

See comment 12-14 of the Gentoo bug for some talk/examples of preproc headers.  I do not have an affected compiler at my disposal and am not even sure how all this preproc header stuff works... let me know if that's really a serious need & I'll build a bugged compiler & hit the books or w/e is required to figure out the preproc header things.

--
* if using Gentoo or Gentoo-prefix to repro this (possibly the path of least confusion ime) the use flag "custom-cflags" must be set on sys-devel/gcc to repro (otherwise the c{,xx}flags do not actually apply to gcc).

** rarely, I think I observed similar gimplification failures elsewhere in the build during my git bisect.  Maybe twice during the ~24 builds that would have occurred using affected gcc's during my bisect, based on a guesstimated 4 mean builds before failure for bugged compilers and assuming 50% of my 12-step git bisect was bugged...?
Comment 4 Martin Liška 2021-09-06 08:22:05 UTC
> 
> As for boost, I don't think any special configuration or version is required
> to make it happen ... [time passes...] got it, the specific build step that
> tends** to cause the failure is:
> 
> /usr/bin/x86_64-pc-linux-gnu-g++ -fvisibility-inlines-hidden -O3
> -march=native -pipe -std=c++14 -fPIC -m64 -pthread -finline-functions
> -Wno-inline -Wall -fvisibility=hidden -ftemplate-depth-255
> -fvisibility=hidden -fvisibility-inlines-hidden -DBOOST_ALL_NO_LIB=1
> -DBOOST_SERIALIZATION_DYN_LINK=1 -DNDEBUG -I. -c -o
> bin.v2/libs/serialization/build/gcc-10.1/gentoorelease/pch-off/threading-
> multi/visibility-hidden/xml_grammar.o libs/serialization/src/xml_grammar.cpp

Please attach a pre-processed source file (use -E option) to this bug. Note I don't use
Gentoo Linux, so I can't easily build the boost package.
Comment 5 Greg Turner 2021-09-06 08:33:09 UTC
(In reply to Martin Liška from comment #4)
> (use -E option) to this bug. Note

Oh, /that/ kind of preprocessed!  That's easy... I thought it was some kind of re-usable pre-compiled header file thing, sorry.

I would think you'd want the one generated on the bugged compiler, not mine.  But iiuc I guess they'd be identical, assuming all is well until gimplification?

For a start I'll get you what comes out of my patched gcc, I'll just need a moment.
Comment 6 Martin Liška 2021-09-06 08:41:34 UTC
> I would think you'd want the one generated on the bugged compiler, not mine.
> But iiuc I guess they'd be identical, assuming all is well until
> gimplification?

Yes, that's identical, it's a source file.
Comment 7 Greg Turner 2021-09-06 08:41:38 UTC
Created attachment 51412 [details]
xml_grammar_gcc_-E.cpp.xz

preproc boost cpp file that tends to trigger failure
Comment 8 Greg Turner 2021-09-06 08:48:27 UTC
Actually please ignore that one pending replacement, I probably generated it wrong...
Comment 9 Greg Turner 2021-09-06 08:50:43 UTC
Never mind, corrected version is quite equivalent:

--- xml_grammar_gcc_-E.cpp      2021-09-06 01:38:48.125773266 -0700
+++ xml_grammar_gcc_-E-try2.cpp 2021-09-06 01:49:24.384875598 -0700
@@ -1,4 +1,5 @@
 # 0 "libs/serialization/src/xml_grammar.cpp"
+# 1 "/var/tmp/portage/dev-libs/boost-1.76.0-r1/work/boost_1_76_0-abi_x86_64.amd64//"
 # 0 "<built-in>"
 # 0 "<command-line>"
 # 1 "/usr/include/stdc-predef.h" 1 3 4

:)
Comment 10 Greg Turner 2021-09-06 09:02:31 UTC
If you find yourself not readily reproducing, let me know....

I suspect a pregenerated gentoo prefix might be a nice "drag-and-drop" way to get someone up and running with a fully working reproduction.  Of course it'll take a bunch of time to create one.

But once done, interested parties could just shove it into their userspace sandboxes and stop scratching their heads.

Otherwise, the nondeterminism sort of leads to awful quagmires where you just keep scratching your head and saying "...or maybe I just got lucky..." followed by an unsatisfying period of reading about probability-density-functions and confidence-intervals on Wikipedia :)
Comment 11 Jakub Jelinek 2021-09-06 09:14:58 UTC
If it is really a buffer overflow in cp_gimplify_expr (which is weird, there aren't many arrays in there nor anything else that could result in that, there is
          tree *data[4] = { NULL, NULL, NULL, NULL };
but you aren't compiling with -fopenmp and it also uses just those 4 entries), then perhaps asan built gcc might be more reliable at detecting it.
Of course, if it is miscompiled cp_gimplify_expr, it might be something else...
Comment 12 Martin Liška 2021-09-06 09:39:48 UTC
I tried bootstrapping the current tip of gcc-11 branch with -O2 -march=native on my 
model name	: AMD Ryzen 7 2700X Eight-Core Processor

but I can't reproduce the ICE on the provided boost test-case :/
Comment 13 Alexander Monakov 2021-09-06 16:33:53 UTC
Sergei Trofimovich made substantial progress on diagnosing this on Gentoo side, and according to his findings the crash is due to reading stack canary from a wrong location. This indicates that the bug is not in GCC, but in the CPU or maybe the kernel.

Please see comments 73 and 74 in the Gentoo bugreport: https://bugs.gentoo.org/724314#c73