Following the discussion on PR83356, I did some more performance analysis of the AES code with various compiler versions, by running the in-kernel crypto selftest (kvm -kernel linux/arch/x86/boot/bzImage -append "tcrypt.mode=200 tcrypt.sec=1 console=ttyS0" -nographic -serial mon:stdio), which showed a very clear slowdown at gcc-7.2 (dated 20171130) compared to 7.1, all numbers are in cycles/byte for AES256+CBC on a 3.1GHz AMD Threadripper, lower numbers are better: default ubsan patched patched+ubsan gcc-4.3.6 -O2 14.9 ---- 14.9 ---- gcc-4.6.4 -O2 15.0 ---- 15.8 ---- gcc-4.9.4 -O2 15.5 20.7 15.9 20.9 gcc-5.5.0 -O2 15.6 47.3 86.4 48.8 gcc-6.3.1 -O2 14.6 49.4 94.3 50.9 gcc-7.1.1 -O2 13.5 54.6 15.2 52.0 gcc-7.2.1 -O2 16.8 124.7 92.0 52.2 gcc-8.0.0 -O2 14.6 56.6 15.3 53.5 gcc-7.1.1 -O1 14.6 53.8 gcc-7.2.1 -O1 15.5 55.9 gcc-8.0.0 -O1 15.0 50.7 clang-5 -O1 21.7 58.3 clang-5 -O2 15.5 49.1 handwritten asm 16.4 The 'patched' columns are with '-ftree-pre and -ftree-sra' disabled in the sources, which happened to help on gcc-7.2.1 for performance and to work around PR83356 but made things worse for most other cases. For better reproducibility, I tried doing the same with the libressl implementation of the same cipher, which also has interesting but unfortunately very different results: gcc-5.5.0 -O2 49.0 gcc-6.3.1 -O2 48.8 gcc-7.1.1 -O2 59.7 gcc-7.2.1 -O2 60.3 gcc-8.0.0 -O2 59.6 gcc-5.5.0 -O1 59.5 gcc-6.3.1 -O1 48.5 gcc-7.1.1 -O1 51.6 gcc-7.2.1 -O1 51.6 gcc-8.0.0 -O1 51.6 The source code is apparently derived from a common source, but has evolved in different ways, and the version from the kernel appears to be much faster overall. In both cases, we see a ~20% degradation between gcc-6.3.1 and gcc-7.2.1, but gcc-7.1.1 happens to produce the best results for the kernel version and very bad results for the libressl sources. The stack consumption problem from PR83356 does not appear with the libressl sources. I have not managed to run a ubsan-enabled libressl binary for testing. To put this in context, both libressl and Linux come with architecture-specific versions using SIMD registers for most architectures, and those tend to be much faster, but the C version is used on old x86 CPUs and minor architectures that lack SIMD registers or an AES implementation for them. If there is enough interest in addressing the slowdown, it should be possible to create a version of the kernel AES implementation that can be run in user space, as the current method of reproducing the results is fairly tedious.
Before posting a new workaround for PR83356 (the workaround is to use -Os instead of O2 for this file), I retested the performance numbers as well, and got slightly different numbers this time. I don't know what caused that difference, but now this is what I see is slightly different: -O2 -Os gcc-6.3.1 14.9 15.1 gcc-7.0.1 14.7 15.3 gcc-7.1.1 15.3 14.7 gcc-7.2.1 16.8 15.9 gcc-8.0.0 15.5 15.6 In particular, the gcc-7.1.1 results are a bit worse than they were, leading to a less significant regression from 7.1.1 to 7.2.2, and the numbers are now closer to what I saw with libressl. In both cases, we still have a 5% to 9% regression between gcc-7.1.1 (20170717) and gcc-7.2.1 (20180102), and a 14% to 23% regression between 6.3.1 and 7.2.1. I also found my mistake in the libressl numbers I showed in comment #1, they are listed exactly factor 3 higher than they should have been, and the actual results are close to the kernel implementation. I've measure these again now as well and come to the following results, using identical compilers as above: -O2 -Os gcc-6.3.1 16.7 16.7 gcc-7.0.1 17.5 16.0 gcc-7.1.1 17.5 16.0 gcc-7.2.1 17.6 16.0 gcc-8.0.0 16.8 15.5 To reproduce with libressl, one could use the following steps: $ git clone https://github.com/libressl-portable/portable.git $ cd portable $ ./autogen.sh $ sed -i 's/undef FULL_UNROLL/define FULL_UNROLL/' crypto/aes/aes_locl.h $ CC=x86_64-linux-gcc-7.2.1 ./configure --disable-asm $ make -sj8 $ ./apps/openssl/openssl speed aes-256-cbc The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256 cbc 168004.61k 174024.74k 174855.76k 176270.13k 176608.14k $ CC=x86_64-linux-gcc-6.3.1 ./configure --disable-asm $ touch crypto/aes/aes_core.c $ make -sj8 $ ./apps/openssl/openssl speed aes-256-cbc The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes aes-256 cbc 175366.81k 182261.29k 183131.80k 184369.21k 184611.37k
My kernel patch to use -Os got merged, but caused a regression, so I kept experimenting with the libressl implementation. Apparently, turning off -fcode-hoisting is a better way address PR83356, and the performance is the same as with -Os: New numbers with libressl, same method as before: -O2 -Os -O2 -fno-code-hoisting gcc-6.3.1 16.7 16.7 - gcc-7.0.1 17.5 16.0 16.0 gcc-7.1.1 17.5 16.0 16.0 gcc-7.2.1 17.5 16.0 16.0 gcc-8.0.0 16.8 15.5 15.5
(In reply to Arnd Bergmann from comment #0) > If there is enough interest in addressing the slowdown, it should be > possible to create a version of the kernel AES implementation that can be > run in user space, as the current method of reproducing the results is > fairly tedious. I would say that a 20% slowdown is significant enough that we should definitely look into this. A user space version would help immensely here. > The source code is apparently derived from a common source, but has evolved > in different ways, and the version from the kernel appears to be much faster > overall. It looks like you have various benchmarks based on different code bases. This is not good for reproduceability and diagnosing the problem. Could we settle on one, and ideally a (simple) user space version? This will drastically increase the likelihood of finding a solution :). Also, is this a GCC 8 regression? It looks like in most of the benchmarks you post, GCC 8 performs pretty close to 4.x. Again, settling on one benchmark, preferably in user space, would really help. Thanks.
(In reply to Aldy Hernandez from comment #3) > (In reply to Arnd Bergmann from comment #0) > > > If there is enough interest in addressing the slowdown, it should be > > possible to create a version of the kernel AES implementation that can be > > run in user space, as the current method of reproducing the results is > > fairly tedious. > > I would say that a 20% slowdown is significant enough that we should > definitely look into this. A user space version would help immensely here. The 20% number I got was from 7.1.1 to 7.2.1, but I can't reproduce the 7.1.1 performance any more, so it's possible that this was supposed to be 15.3 cycles instead of 13.5 cycles, but we'd still have a 13% regression using the kernel implementation, and a 9% regression with libressl, which is probably still significant. > > The source code is apparently derived from a common source, but has evolved > > in different ways, and the version from the kernel appears to be much faster > > overall. > > It looks like you have various benchmarks based on different code bases. > This is not good for reproduceability and diagnosing the problem. Could we > settle on one, and ideally a (simple) user space version? This will > drastically increase the likelihood of finding a solution :). I'd suggest sticking with the libressl test case from comment 1, and ignoring the kernel version until the libressl one is fully understood. It seems very likely that fixing one will also address the other. Are you able to start with the test procedure from comment 1, or do you need something that can be scripted better, e.g. in a single C file? > Also, is this a GCC 8 regression? It looks like in most of the benchmarks > you post, GCC 8 performs pretty close to 4.x. Again, settling on one > benchmark, preferably in user space, would really help. I had originally classified it as "7.2 regression", Richard changed it to "7/8 regression", which I think is correct: The problem is almost certainly the "-fcode-hoisting" optimization step, and both gcc-7 and gcc-8 show about a 10% difference between the normal "-O2" and "-O2 -fno-code-hoisting", it's just that gcc-8 is faster overall.
(In reply to Arnd Bergmann from comment #4) > I'd suggest sticking with the libressl test case from comment 1, and > ignoring the kernel version until the libressl one is fully understood. It > seems very likely that fixing one will also address the other. Alright, let's start with libressl which is user-space. > Are you able to start with the test procedure from comment 1, or do you need > something that can be scripted better, e.g. in a single C file? abulafia:/tmp/portable [master]$ ./autogen.sh pulling upstream openbsd source Cloning into 'openbsd'... ... copying manpages configure.ac:32: error: possibly undefined macro: AC_PROG_LIBTOOL If this token and others are legitimate, please use m4_pattern_allow. See the Autoconf documentation. autoreconf: /usr/bin/autoconf failed with exit status: 1 Perhaps this is sensitive to the autoconf version. I am running 2.69 on Fedora 27. But I think a single C file would be easier. Everyone's running a different (probably Linux) OS variant around here. Thanks.
Created attachment 43177 [details] Single-file version of aes benchmark I've managed to condense the 'openssl speed aes-256-cbc' test into a single file now: $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_test.c -o aes_test $ time ./aes_test real 0m4.499s user 0m4.498s sys 0m0.000s $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_test.c -o aes_test -fno-code-hoisting $ time ./aes_test real 0m4.135s user 0m4.134s sys 0m0.000s The test is hardcoded to do 100000 runs of of 8192 bytes, so on my 3.1GHz CPU that translates cyles: $ echo $[4999 * 310000000 / 819200000] 1891 # 18.91 cycles $ echo $[4135 * 310000000 / 819200000] 1564 # 15.6 cycles Similar results with gcc-8.0.0: $ x86_64-linux-gcc-8.0.0 -Wall -O2 aes_test.c -o aes_test $ time ./aes_test real 0m4.471s user 0m4.470s sys 0m0.000s $ x86_64-linux-gcc-8.0.0 -Wall -O2 aes_test.c -o aes_test -fno-code-hoisting $ time ./aes_test real 0m4.052s user 0m4.052s sys 0m0.000s Hope that helps
Created attachment 43178 [details] Single-file version of aes benchmark (shorter) I decided to strip the test case a bit more down by removing the decryption side that was unused, and checked that nothing else has changed.
I ran the testcase in comment #7 and can confirm that there is a 5.24% performance regression from 6.3.1 to 7.2.1, and a 2.88% regression from 6.3.1 to 8.0. I ran the test on my unloaded workstation, which is a: model name : Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz Average user time of 15 runs: a.out.6.3.1 is 3.820000 Average user time of 15 runs: a.out.7.2.1 is 4.020000 Average user time of 15 runs: a.out.8.0 is 3.931333 I can also confirm that everything runs much faster with -fno-code-hoisting, but that may or may not be related: Average user time of 15 runs: a.out.8.0 -fno-code-hoisting is 3.734667 These regressions seem pretty small, though. Perhaps we could find what regressed from 6.3.1 to 7.2.1, and maybe that's still causing problems in 8 (even though other factors may be causing it to be faster in 8). Confirmed.
Original regression in 7.x started with the -fcode-hoisting pass in r238242. Things started improving with r254948, though that is probably unrelated. Perhaps Richard can comment.
On Thu, 18 Jan 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651 > > Aldy Hernandez <aldyh at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |jakub at gcc dot gnu.org, > | |rguenth at gcc dot gnu.org > > --- Comment #9 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > Original regression in 7.x started with the -fcode-hoisting pass in r238242. > Things started improving with r254948, though that is probably unrelated. > > Perhaps Richard can comment. code-hoisting does its Job - it reduces the number of stmts in the program. Together with PRE code-hoisting enables more PRE and thus causes extra PHIs (not sure if those are the problem). But if you look at code-hoisting in isolation (-fcode-hoisting -fno-tree-pre) then it should be always profitable - it's probably the extra PRE that does the harm here. Nubers on my machine: > ./xgcc -B. t.c -O2 > /usr//bin/time ./a.out 4.13user 0.00system 0:04.13elapsed 99%CPU (0avgtext+0avgdata 1040maxresident)k 0inputs+0outputs (0major+61minor)pagefaults 0swaps > /usr//bin/time ./a.out 4.06user 0.00system 0:04.06elapsed 100%CPU (0avgtext+0avgdata 1032maxresident)k 0inputs+0outputs (0major+60minor)pagefaults 0swaps > ./xgcc -B. t.c -O2 -fno-tree-pre -fcode-hoisting > /usr//bin/time ./a.out 3.87user 0.00system 0:03.87elapsed 99%CPU (0avgtext+0avgdata 1052maxresident)k 0inputs+0outputs (0major+61minor)pagefaults 0swaps > /usr//bin/time ./a.out 3.90user 0.00system 0:03.90elapsed 99%CPU (0avgtext+0avgdata 1060maxresident)k 0inputs+0outputs (0major+62minor)pagefaults 0swaps > ./xgcc -B. t.c -O2 -ftree-pre -fno-code-hoisting > /usr//bin/time ./a.out 3.85user 0.00system 0:03.85elapsed 100%CPU (0avgtext+0avgdata 1032maxresident)k 0inputs+0outputs (0major+60minor)pagefaults 0swaps > /usr//bin/time ./a.out 3.85user 0.01system 0:03.87elapsed 99%CPU (0avgtext+0avgdata 1060maxresident)k 0inputs+0outputs (0major+62minor)pagefaults 0swaps note that both PRE and code-hoisting are sources of increased register pressure. > ./xgcc -B. t.c -O2 -ftree-pre -fcode-hoisting -S > grep rsp t.s | wc -l 47 > ./xgcc -B. t.c -O2 -ftree-pre -fno-code-hoisting -S > grep rsp t.s | wc -l 11 > ./xgcc -B. t.c -O2 -fno-tree-pre -fcode-hoisting -S > grep rsp t.s | wc -l 11 taming PRE down by decoupling code hoisting and PRE results in > ./xgcc -B. t.c -O2 -ftree-pre -fcode-hoisting -S > grep rsp t.s | wc -l 11 > ./xgcc -B. t.c -O2 -ftree-pre -fcode-hoisting > /usr//bin/time ./a.out 3.90user 0.00system 0:03.90elapsed 100%CPU (0avgtext+0avgdata 1148maxresident)k 0inputs+0outputs (0major+63minor)pagefaults 0swaps > /usr//bin/time ./a.out 3.89user 0.00system 0:03.89elapsed 100%CPU (0avgtext+0avgdata 1128maxresident)k 0inputs+0outputs (0major+60minor)pagefaults 0swaps Index: gcc/tree-ssa-pre.c =================================================================== --- gcc/tree-ssa-pre.c (revision 256837) +++ gcc/tree-ssa-pre.c (working copy) @@ -3687,15 +3687,23 @@ insert (void) if (dump_file && dump_flags & TDF_DETAILS) fprintf (dump_file, "Starting insert iteration %d\n", num_iterations); new_stuff = insert_aux (ENTRY_BLOCK_PTR_FOR_FN (cfun), flag_tree_pre, - flag_code_hoisting); + false); /* Clear the NEW sets before the next iteration. We have already fully propagated its contents. */ - if (new_stuff) + if (new_stuff || flag_code_hoisting) FOR_ALL_BB_FN (bb, cfun) bitmap_set_free (NEW_SETS (bb)); } statistics_histogram_event (cfun, "insert iterations", num_iterations); + + if (flag_code_hoisting) + { + if (dump_file && dump_flags & TDF_DETAILS) + fprintf (dump_file, "Starting insert for code hoisting\n"); + new_stuff = insert_aux (ENTRY_BLOCK_PTR_FOR_FN (cfun), false, + flag_code_hoisting); + } } but AFAIU this patch shouldn't have any effect... I guess I have to think about this 2nd order effect again (might be a missed PRE in the first place which of course wouldn't help us ;)). The above FAILs for example FAIL: gcc.dg/tree-ssa/ssa-hoist-3.c scan-tree-dump pre "Insertions: 1" FAIL: gcc.dg/tree-ssa/ssa-pre-30.c scan-tree-dump-times pre "Replaced MEM" 2
Trying out the patch from comment 10 on the original preprocessed source as attached to pr83356 also shows very noticeable improvements with stack spilling there: x86_64-linux-gcc-6.3.1 -Wall -O2 -S ./aes_generic.i -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l /git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt': /git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 48 bytes is larger than 10 bytes [-Wframe-larger-than=] 4075 x86_64-linux-gcc-7.1.1 -Wall -O2 -S aes_generic.i -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l /git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt': /git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 304 bytes is larger than 10 bytes [-Wframe-larger-than=] } 4141 x86_64-linux-gcc-7.2.1 -Wall -O2 -S aes_generic.i -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l /git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt': /git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 3840 bytes is larger than 10 bytes [-Wframe-larger-than=] 10351 # same as x86_64-linux-gcc-7.2.1 but with patch from comment 10: ./xgcc -Wall -O2 -S ./aes_generic.i -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l /git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt': /git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 272 bytes is larger than 10 bytes [-Wframe-larger-than=] 4739 My interpretation is that there are two distinct issues: both AES implementations (libressl and linux-kernel) suffer from a 5% to 10% regression that is triggered by the combination of -ftree-pre and -fcode-hoisting, but only the kernel implementation suffers from a second issue that Martin Liška traced back to r251376. This results in another few percents of slowdown in gcc-7.2.1 and an factor 2.3x slowdown (and corresponding increase in stack accesses) when -fsanitize=bounds -fsanitize=object-size gets enabled.
So somehow without code hoisting we don't find a single PRE opportunity - that's odd. Ah, so it goes int x; int foo(int cond1, int cond2, int op1, int op2, int op3) { int op; if (cond1) { x = op1 << 8; if (cond2) op = op2; else op = op3; } else op = op1; return op << 8; } when looking at simple PRE then op << 8 is not detected as partially redundant because while GVN PRE (PHI translation VN) ends up value-numbering op << 8 on the !cond1 path the same as op1 << 8: [changed] ANTIC_IN[6] := { op1_5(D) (0006), {lshift_expr,op1_5(D),8} (0002) } [changed] ANTIC_IN[3] := { op1_5(D) (0006), cond2_8(D) (0009), {lshift_expr,op1_5(D),8} (0002) } it doesn't consider op << 8 available on any path -- there isn't really any redundant computation on any path in the above code. Now comes code hoisting, seeing op1 << 8 computed twice and hoists it before if (cond1). int x; int foo(int cond1, int cond2, int op1, int op2, int op3) { int op; int tem = op1 << 8; if (cond1) { x = tem; if (cond2) op = op2; else op = op3; } else op = op1; return op << 8; } Note how it didn't end up removing the redundancy on the !cond1 path on its own! It relies on PRE to clean up after itself here. After this the PRE algorithm now finds its "available on one path" -- namely on the cond1 path where op << 8 is now computed twice -- and inserts op2 << 8 and op3 << 8 in the other predecessor (note we have a PHI with three args here, if we'd split that this might also change code generation for the better). And we end up with int x; int foo(int cond1, int cond2, int op1, int op2, int op3) { int op; tem = op1 << 8; if (cond1) { x = tem; if (cond2) tem = op2 << 8; else tem = op3 << 8; } return tem; } Note with the proposed patch the handling of the two FAILing testcases becomes less efficient -- doing hoisting first exposes full redundancies and thus avoids useless PRE insertions. For both FAILing testcases code generation in the end doesn't change. Now we have to wrap our brains around the above testcase and transform and decide if the order of events are good and expected or not.
Created attachment 43185 [details] Linux kernel version of AES algorithm, ported to standalone executable I've had another look at extracting a test case from the Linux kernel copy of this code. This now also shows the gcc-7.2.1 specific problem: $ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic; time ./aes_generic real 0m9.406s $ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic real 0m8.318s $ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic; time ./aes_generic real 0m22.151s $ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic real 0m8.439s $ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic ; time ./aes_generic real 0m3.031s $ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic -fno-code-hoisting ; time ./aes_generic real 0m2.894s $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic ; time ./aes_generic real 0m3.307s $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic -fno-code-hoisting ; time ./aes_generic real 0m2.875s
On Fri, 19 Jan 2018, arnd at linaro dot org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651 > > --- Comment #13 from Arnd Bergmann <arnd at linaro dot org> --- > Created attachment 43185 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43185&action=edit > Linux kernel version of AES algorithm, ported to standalone executable > > I've had another look at extracting a test case from the Linux kernel copy of > this code. This now also shows the gcc-7.2.1 specific problem: > > $ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size > aes_generic.c -o aes_generic; time ./aes_generic > real 0m9.406s > > $ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size > aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic > real 0m8.318s > > $ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size > aes_generic.c -o aes_generic; time ./aes_generic > real 0m22.151s > > $ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size > aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic > real 0m8.439s > > $ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic ; time > ./aes_generic > real 0m3.031s > > $ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic > -fno-code-hoisting ; time ./aes_generic > real 0m2.894s > > $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic ; time > ./aes_generic > real 0m3.307s > > $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic > -fno-code-hoisting ; time ./aes_generic > real 0m2.875s Would be nice if somebody can bisect it. It doesn't look like a PRE specific issue because there's no relevant PRE changes in the rev. range. I can't reproduce the slowdown when comparing 7.1.0 against 7.2.0 btw, so the regression must occur somewhere between 7.2.0 and now (or 7.1.1 got faster for a few revs).
(In reply to rguenther@suse.de from comment #14) > Would be nice if somebody can bisect it. It doesn't look like a PRE > specific issue because there's no relevant PRE changes in the rev. range. > I can't reproduce the slowdown when comparing 7.1.0 against 7.2.0 > btw, so the regression must occur somewhere between 7.2.0 and now > (or 7.1.1 got faster for a few revs). I've checked r251376 (the one I mentioned in comment #11), and confirmed that this caused the difference between my old 7.1.1 and the current 7.2.1.
On Fri, 19 Jan 2018, arnd at linaro dot org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651 > > --- Comment #15 from Arnd Bergmann <arnd at linaro dot org> --- > (In reply to rguenther@suse.de from comment #14) > > > Would be nice if somebody can bisect it. It doesn't look like a PRE > > specific issue because there's no relevant PRE changes in the rev. range. > > I can't reproduce the slowdown when comparing 7.1.0 against 7.2.0 > > btw, so the regression must occur somewhere between 7.2.0 and now > > (or 7.1.1 got faster for a few revs). > > I've checked r251376 (the one I mentioned in comment #11), and confirmed that > this caused the difference between my old 7.1.1 and the current 7.2.1. Ok, this is a bugfix and simply makes PRE do its job "properly" ...
GCC 7.3 is being released, adjusting target milestone.
This is what I see on my model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz: gcc-bisect.py bisect 'gcc -O2 aes_generic.c && /usr/bin/time -f '%E' ./a.out' -o Releases 4.8.0 (e9c762ec4671d77e)(22 Mar 2013 10:05): [took: 6.368s] result: OK 0:03.28 4.8.1 (caa62b4636bfed71)(31 May 2013 09:02): [took: 5.318s] result: OK 0:03.40 4.8.2 (9bcca88e24e64d4e)(16 Oct 2013 07:20): [took: 5.344s] result: OK 0:03.43 4.8.3 (6bbf0dec66c0e719)(22 May 2014 09:10): [took: 5.352s] result: OK 0:03.40 4.8.4 (1a97fa0bb3fa5669)(19 Dec 2014 11:43): [took: 5.283s] result: OK 0:03.39 4.8.5 (cf82a597b0d18985)(23 Jun 2015 07:54): [took: 5.431s] result: OK 0:03.45 4.9.0 (a7aa383874520cd5)(22 Apr 2014 09:43): [took: 5.469s] result: OK 0:03.39 4.9.1 (c6fa1b4126635939)(16 Jul 2014 10:04): [took: 5.500s] result: OK 0:03.39 4.9.2 (c1283af40b65f1ad)(30 Oct 2014 08:27): [took: 5.536s] result: OK 0:03.45 4.9.3 (876d41ed80ce13e0)(26 Jun 2015 17:57): [took: 5.499s] result: OK 0:03.39 4.9.4 (d3191480f376c780)(03 Aug 2016 05:07): [took: 5.444s] result: OK 0:03.38 5.1.0 (d5ad84b309d0d97d)(22 Apr 2015 08:43): [took: 5.770s] result: OK 0:03.37 5.2.0 (7b26e3896e268cd4)(16 Jul 2015 09:13): [took: 5.798s] result: OK 0:03.38 5.3.0 (2bc376d60753a58b)(04 Dec 2015 10:45): [took: 5.789s] result: OK 0:03.37 5.4.0 (9d0507742960aa9f)(03 Jun 2016 08:41): [took: 5.795s] result: OK 0:03.38 5.5.0 (ba9cddfdab8b539b)(10 Oct 2017 08:11): [took: 5.824s] result: OK 0:03.39 6.1.0 (c441d9e8e0438dcf)(27 Apr 2016 08:20): [took: 5.673s] result: OK 0:03.18 6.2.0 (6ac74a62ba725829)(22 Aug 2016 08:01): [took: 5.660s] result: OK 0:03.18 6.3.0 (4b5e15daff8b5444)(21 Dec 2016 07:51): [took: 5.665s] result: OK 0:03.18 6.4.0 (45dd06cef49fe00a)(04 Jul 2017 07:22): [took: 5.832s] result: OK 0:03.18 6.5.0 (e4c9bd2bb2324c32)(26 Oct 2018 09:54): [took: 5.724s] result: OK 0:03.18 7.1.0 (f9105a38249fb57f)(02 May 2017 12:42): [took: 6.186s] result: OK 0:03.34 7.2.0 (1bd23ca8c30f4827)(14 Aug 2017 07:59): [took: 6.149s] result: OK 0:03.33 7.3.0 (87fb575328cc5d95)(25 Jan 2018 08:17): [took: 6.350s] result: OK 0:03.53 8.1.0 (af8bbdf198a7cd61)(02 May 2018 08:13): [took: 6.293s] result: OK 0:03.26 8.2.0 (9fb89fa845c1b2e0)(26 Jul 2018 09:47): [took: 6.165s] result: OK 0:03.26 known-to-work: 4.8.5, 4.9.4, 5.5.0, 6.5.0, 7.3.0, 8.2.0 known-to-fail: Active branches 6 (f2b648ddec0a9552)(07 Nov 2018 20:52): [took: 5.662s] result: OK 0:03.19 7 (2d21868dbe6a8be0)(02 Jan 2019 00:16): [took: 6.220s] result: OK 0:03.50 8 (687f6e70d3fd0eac)(02 Jan 2019 00:16): [took: 6.226s] result: OK 0:03.26 Active branch bases 4.8-base (daf81a9011692e3e)(16 Mar 2013 02:48): [took: 5.279s] result: OK 0:03.42 4.9-base (ec86f0be138e2f97)(11 Apr 2014 12:47): [took: 5.383s] result: OK 0:03.38 5-base (905be4e64f0f9136)(12 Apr 2015 19:30): [took: 5.669s] result: OK 0:03.37 6-base (a050099a416f013b)(15 Apr 2016 14:51): [took: 5.671s] result: OK 0:03.19 7-base (7369309777f6d6e6)(20 Apr 2017 09:44): [took: 6.165s] result: OK 0:03.33 8-base (941fafa56b52ee23)(25 Apr 2018 07:10): [took: 6.299s] result: OK 0:03.26 Bisecting latest revisions 553d41a8a57496a6(02 Jan 2019 16:30): [took: 6.285s] result: OK 0:03.24 acafca510c97652f(09 Oct 2014 07:40): [took: 5.520s] result: OK 0:03.38 As mentioned in the previous comment, Richi: can we adjust title and known-to-work as you're installed patch a year ago?
Not sure what you are talking about - your numbers confirm the regression is still present? Or do you mean that GCC 7.1.0 is also bad and only 6.x was OK (which didn't have code hoisting?)
(In reply to Richard Biener from comment #19) > Not sure what you are talking about - your numbers confirm the regression is > still present? Or do you mean that GCC 7.1.0 is also bad and only 6.x was OK > (which didn't have code hoisting?) My numbers show that I can't see any regression/improvement on my machine. And I'm talking about r251376.
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
GCC 11 branch is being closed.