Bug 83651 - [7/8/9 regression] 20% slowdown of linux kernel AES cipher
Summary: [7/8/9 regression] 20% slowdown of linux kernel AES cipher
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.2.1
: P2 normal
Target Milestone: 7.4
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 83356
Blocks:
  Show dependency treegraph
 
Reported: 2018-01-02 16:59 UTC by Arnd Bergmann
Modified: 2018-01-25 08:27 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work: 7.1.0
Known to fail:
Last reconfirmed: 2018-01-17 00:00:00


Attachments
Single-file version of aes benchmark (13.97 KB, text/x-csrc)
2018-01-18 16:30 UTC, Arnd Bergmann
Details
Single-file version of aes benchmark (shorter) (8.18 KB, text/x-csrc)
2018-01-18 16:57 UTC, Arnd Bergmann
Details
Linux kernel version of AES algorithm, ported to standalone executable (19.66 KB, text/x-csrc)
2018-01-19 12:09 UTC, Arnd Bergmann
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arnd Bergmann 2018-01-02 16:59:57 UTC
Following the discussion on PR83356, I did some more performance analysis of the AES code with various compiler versions, by running the in-kernel crypto selftest (kvm -kernel linux/arch/x86/boot/bzImage -append "tcrypt.mode=200 tcrypt.sec=1 console=ttyS0"  -nographic -serial mon:stdio), which showed a very clear slowdown at gcc-7.2 (dated 20171130) compared to 7.1, all numbers are in cycles/byte for AES256+CBC on a 3.1GHz AMD Threadripper, lower numbers are better:

                default      ubsan         patched        patched+ubsan
gcc-4.3.6 -O2    14.9        ----           14.9         ----
gcc-4.6.4 -O2    15.0        ----           15.8         ----
gcc-4.9.4 -O2    15.5        20.7           15.9         20.9
gcc-5.5.0 -O2    15.6        47.3           86.4         48.8
gcc-6.3.1 -O2    14.6        49.4           94.3         50.9
gcc-7.1.1 -O2    13.5        54.6           15.2         52.0
gcc-7.2.1 -O2    16.8       124.7           92.0         52.2
gcc-8.0.0 -O2    14.6        56.6           15.3         53.5
gcc-7.1.1 -O1    14.6        53.8
gcc-7.2.1 -O1    15.5        55.9
gcc-8.0.0 -O1    15.0        50.7
clang-5 -O1      21.7        58.3
clang-5 -O2      15.5        49.1
handwritten asm  16.4

The 'patched' columns are with '-ftree-pre and -ftree-sra' disabled in the sources, which happened to help on gcc-7.2.1 for performance and to work around PR83356 but made things worse for most other cases.

For better reproducibility, I tried doing the same with the libressl implementation of the same cipher, which also has interesting but unfortunately very different results:

gcc-5.5.0 -O2    49.0
gcc-6.3.1 -O2    48.8
gcc-7.1.1 -O2    59.7
gcc-7.2.1 -O2    60.3
gcc-8.0.0 -O2    59.6

gcc-5.5.0 -O1    59.5
gcc-6.3.1 -O1    48.5
gcc-7.1.1 -O1    51.6
gcc-7.2.1 -O1    51.6
gcc-8.0.0 -O1    51.6

The source code is apparently derived from a common source, but has evolved in different ways, and the version from the kernel appears to be much faster overall. In both cases, we see a ~20% degradation between gcc-6.3.1 and gcc-7.2.1, but gcc-7.1.1 happens to produce the best results for the kernel version and very bad results for the libressl sources. The stack consumption problem from PR83356 does not appear with the libressl sources. I have not managed to run a ubsan-enabled libressl binary for testing.

To put this in context, both libressl and Linux come with architecture-specific versions using SIMD registers for most architectures, and those tend to be much faster, but the C version is used on old x86 CPUs and minor architectures that lack SIMD registers or an AES implementation for them.

If there is enough interest in addressing the slowdown, it should be possible to create a version of the kernel AES implementation that can be run in user space, as the current method of reproducing the results is fairly tedious.
Comment 1 Arnd Bergmann 2018-01-05 14:52:04 UTC
Before posting a new workaround for PR83356 (the workaround is to use -Os instead of O2 for this file), I retested the performance numbers as well, and got slightly different numbers this time. I don't know what caused that difference, but now this is what I see is slightly different:


                      -O2     -Os
      gcc-6.3.1       14.9    15.1
      gcc-7.0.1       14.7    15.3
      gcc-7.1.1       15.3    14.7
      gcc-7.2.1       16.8    15.9
      gcc-8.0.0       15.5    15.6

In particular, the gcc-7.1.1 results are a bit worse than they were, leading to a less significant regression from 7.1.1 to 7.2.2, and the numbers are now closer to what I saw with libressl. In both cases, we still have a 5% to 9% regression between gcc-7.1.1 (20170717) and gcc-7.2.1 (20180102), and a 14% to 23% regression between 6.3.1 and 7.2.1.

I also found my mistake in the libressl numbers I showed in comment #1, they are listed exactly factor 3 higher than they should have been, and the actual results are close to the kernel implementation. I've measure these again now as well and come to the following results, using identical compilers as above:

                      -O2     -Os
      gcc-6.3.1       16.7    16.7
      gcc-7.0.1       17.5    16.0
      gcc-7.1.1       17.5    16.0
      gcc-7.2.1       17.6    16.0
      gcc-8.0.0       16.8    15.5

To reproduce with libressl, one could use the following steps:

$ git clone https://github.com/libressl-portable/portable.git
$ cd portable
$ ./autogen.sh
$ sed -i 's/undef FULL_UNROLL/define FULL_UNROLL/' crypto/aes/aes_locl.h
$ CC=x86_64-linux-gcc-7.2.1 ./configure --disable-asm
$ make -sj8
$ ./apps/openssl/openssl speed aes-256-cbc
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc     168004.61k   174024.74k   174855.76k   176270.13k   176608.14k
$ CC=x86_64-linux-gcc-6.3.1 ./configure --disable-asm
$ touch crypto/aes/aes_core.c 
$ make -sj8
$ ./apps/openssl/openssl speed aes-256-cbc
The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
aes-256 cbc     175366.81k   182261.29k   183131.80k   184369.21k   184611.37k
Comment 2 Arnd Bergmann 2018-01-15 10:50:47 UTC
My kernel patch to use -Os got merged, but caused a regression, so I kept experimenting with the libressl implementation. Apparently, turning off -fcode-hoisting is a better way address PR83356, and the performance is the same as with -Os: New numbers with libressl, same method as before:

                      -O2     -Os    -O2 -fno-code-hoisting
      gcc-6.3.1       16.7    16.7    -
      gcc-7.0.1       17.5    16.0    16.0
      gcc-7.1.1       17.5    16.0    16.0
      gcc-7.2.1       17.5    16.0    16.0
      gcc-8.0.0       16.8    15.5    15.5
Comment 3 Aldy Hernandez 2018-01-17 17:27:09 UTC
(In reply to Arnd Bergmann from comment #0)

> If there is enough interest in addressing the slowdown, it should be
> possible to create a version of the kernel AES implementation that can be
> run in user space, as the current method of reproducing the results is
> fairly tedious.

I would say that a 20% slowdown is significant enough that we should definitely look into this.  A user space version would help immensely here.

> The source code is apparently derived from a common source, but has evolved
> in different ways, and the version from the kernel appears to be much faster
> overall. 

It looks like you have various benchmarks based on different code bases.  This is not good for reproduceability and diagnosing the problem.  Could we settle on one, and ideally a (simple) user space version?  This will drastically increase the likelihood of finding a solution :).

Also, is this a GCC 8 regression?  It looks like in most of the benchmarks you post, GCC 8 performs pretty close to 4.x.  Again, settling on one benchmark, preferably in user space, would really help.

Thanks.
Comment 4 Arnd Bergmann 2018-01-17 19:36:43 UTC
(In reply to Aldy Hernandez from comment #3)
> (In reply to Arnd Bergmann from comment #0)
> 
> > If there is enough interest in addressing the slowdown, it should be
> > possible to create a version of the kernel AES implementation that can be
> > run in user space, as the current method of reproducing the results is
> > fairly tedious.
> 
> I would say that a 20% slowdown is significant enough that we should
> definitely look into this.  A user space version would help immensely here.

The 20% number I got was from 7.1.1 to 7.2.1, but I can't reproduce the
7.1.1 performance any more, so it's possible that this was supposed to be
15.3 cycles instead of 13.5 cycles, but we'd still have a 13% regression
using the kernel implementation, and a 9% regression with libressl, which is
probably still significant.

> > The source code is apparently derived from a common source, but has evolved
> > in different ways, and the version from the kernel appears to be much faster
> > overall. 
> 
> It looks like you have various benchmarks based on different code bases. 
> This is not good for reproduceability and diagnosing the problem.  Could we
> settle on one, and ideally a (simple) user space version?  This will
> drastically increase the likelihood of finding a solution :).

I'd suggest sticking with the libressl test case from comment 1, and ignoring the kernel version until the libressl one is fully understood. It seems very likely that fixing one will also address the other.

Are you able to start with the test procedure from comment 1, or do you need something that can be scripted better, e.g. in a single C file?

> Also, is this a GCC 8 regression?  It looks like in most of the benchmarks
> you post, GCC 8 performs pretty close to 4.x.  Again, settling on one
> benchmark, preferably in user space, would really help.

I had originally classified it as "7.2 regression", Richard changed it to "7/8 regression", which I think is correct: The problem is almost certainly the "-fcode-hoisting" optimization step, and both gcc-7 and gcc-8 show about a 10% difference between the normal "-O2" and "-O2 -fno-code-hoisting", it's just that gcc-8 is faster overall.
Comment 5 Aldy Hernandez 2018-01-18 01:21:55 UTC
(In reply to Arnd Bergmann from comment #4)

> I'd suggest sticking with the libressl test case from comment 1, and
> ignoring the kernel version until the libressl one is fully understood. It
> seems very likely that fixing one will also address the other.

Alright, let's start with libressl which is user-space.

> Are you able to start with the test procedure from comment 1, or do you need
> something that can be scripted better, e.g. in a single C file?

abulafia:/tmp/portable [master]$ ./autogen.sh
pulling upstream openbsd source
Cloning into 'openbsd'...
...
copying manpages
configure.ac:32: error: possibly undefined macro: AC_PROG_LIBTOOL
      If this token and others are legitimate, please use m4_pattern_allow.
      See the Autoconf documentation.
autoreconf: /usr/bin/autoconf failed with exit status: 1

Perhaps this is sensitive to the autoconf version.  I am running 2.69 on Fedora 27.

But I think a single C file would be easier.  Everyone's running a different (probably Linux) OS variant around here.

Thanks.
Comment 6 Arnd Bergmann 2018-01-18 16:30:32 UTC
Created attachment 43177 [details]
Single-file version of aes benchmark

I've managed to condense the 'openssl speed aes-256-cbc' test into a single file now:

$ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_test.c -o aes_test
$ time ./aes_test
real	0m4.499s
user	0m4.498s
sys	0m0.000s

$ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_test.c -o aes_test -fno-code-hoisting
$ time ./aes_test
real	0m4.135s
user	0m4.134s
sys	0m0.000s

The test is hardcoded to do 100000 runs of of 8192 bytes, so on my 3.1GHz CPU that translates cyles:
$ echo $[4999 * 310000000  / 819200000]
1891 # 18.91 cycles
$ echo $[4135 * 310000000  / 819200000]
1564 # 15.6 cycles

Similar results with gcc-8.0.0:

$ x86_64-linux-gcc-8.0.0 -Wall -O2 aes_test.c -o aes_test
$ time ./aes_test
real	0m4.471s
user	0m4.470s
sys	0m0.000s

$ x86_64-linux-gcc-8.0.0 -Wall -O2 aes_test.c -o aes_test -fno-code-hoisting
$ time ./aes_test
real	0m4.052s
user	0m4.052s
sys	0m0.000s

Hope that helps
Comment 7 Arnd Bergmann 2018-01-18 16:57:01 UTC
Created attachment 43178 [details]
Single-file version of aes benchmark (shorter)

I decided to strip the test case a bit more down by removing the decryption side that was unused, and checked that nothing else has changed.
Comment 8 Aldy Hernandez 2018-01-18 19:22:12 UTC
I ran the testcase in comment #7 and can confirm that there is a 5.24% performance regression from 6.3.1 to 7.2.1, and a 2.88% regression from 6.3.1 to 8.0.

I ran the test on my unloaded workstation, which is a:

model name      : Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz

Average user time of 15 runs: a.out.6.3.1 is 3.820000
Average user time of 15 runs: a.out.7.2.1 is 4.020000
Average user time of 15 runs: a.out.8.0 is 3.931333

I can also confirm that everything runs much faster with -fno-code-hoisting, but that may or may not be related:

Average user time of 15 runs: a.out.8.0 -fno-code-hoisting is 3.734667

These regressions seem pretty small, though.

Perhaps we could find what regressed from 6.3.1 to 7.2.1, and maybe that's still causing problems in 8 (even though other factors may be causing it to be faster in 8).

Confirmed.
Comment 9 Aldy Hernandez 2018-01-18 20:13:42 UTC
Original regression in 7.x started with the -fcode-hoisting pass in r238242.  Things started improving with r254948, though that is probably unrelated.

Perhaps Richard can comment.
Comment 10 rguenther@suse.de 2018-01-19 09:17:09 UTC
On Thu, 18 Jan 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>                  CC|                            |jakub at gcc dot gnu.org,
>                    |                            |rguenth at gcc dot gnu.org
> 
> --- Comment #9 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> Original regression in 7.x started with the -fcode-hoisting pass in r238242. 
> Things started improving with r254948, though that is probably unrelated.
> 
> Perhaps Richard can comment.

code-hoisting does its Job - it reduces the number of stmts in the
program.  Together with PRE code-hoisting enables more PRE and
thus causes extra PHIs (not sure if those are the problem).  But
if you look at code-hoisting in isolation (-fcode-hoisting -fno-tree-pre)
then it should be always profitable - it's probably the extra PRE
that does the harm here.  Nubers on my machine:

> ./xgcc -B. t.c -O2
> /usr//bin/time ./a.out 
4.13user 0.00system 0:04.13elapsed 99%CPU (0avgtext+0avgdata 
1040maxresident)k
0inputs+0outputs (0major+61minor)pagefaults 0swaps
> /usr//bin/time ./a.out 
4.06user 0.00system 0:04.06elapsed 100%CPU (0avgtext+0avgdata 
1032maxresident)k
0inputs+0outputs (0major+60minor)pagefaults 0swaps
> ./xgcc -B. t.c -O2 -fno-tree-pre -fcode-hoisting
> /usr//bin/time ./a.out 
3.87user 0.00system 0:03.87elapsed 99%CPU (0avgtext+0avgdata 
1052maxresident)k
0inputs+0outputs (0major+61minor)pagefaults 0swaps
> /usr//bin/time ./a.out 
3.90user 0.00system 0:03.90elapsed 99%CPU (0avgtext+0avgdata 
1060maxresident)k
0inputs+0outputs (0major+62minor)pagefaults 0swaps
> ./xgcc -B. t.c -O2 -ftree-pre -fno-code-hoisting
> /usr//bin/time ./a.out 
3.85user 0.00system 0:03.85elapsed 100%CPU (0avgtext+0avgdata 
1032maxresident)k
0inputs+0outputs (0major+60minor)pagefaults 0swaps
> /usr//bin/time ./a.out 
3.85user 0.01system 0:03.87elapsed 99%CPU (0avgtext+0avgdata 
1060maxresident)k
0inputs+0outputs (0major+62minor)pagefaults 0swaps

note that both PRE and code-hoisting are sources of increased
register pressure.

> ./xgcc -B. t.c -O2 -ftree-pre -fcode-hoisting -S
> grep rsp t.s | wc -l
47
> ./xgcc -B. t.c -O2 -ftree-pre -fno-code-hoisting -S
> grep rsp t.s | wc -l
11
> ./xgcc -B. t.c -O2 -fno-tree-pre -fcode-hoisting -S
> grep rsp t.s | wc -l
11

taming PRE down by decoupling code hoisting and PRE results in

> ./xgcc -B. t.c -O2 -ftree-pre -fcode-hoisting -S
> grep rsp t.s | wc -l
11
> ./xgcc -B. t.c -O2 -ftree-pre -fcode-hoisting 
> /usr//bin/time ./a.out 
3.90user 0.00system 0:03.90elapsed 100%CPU (0avgtext+0avgdata 
1148maxresident)k
0inputs+0outputs (0major+63minor)pagefaults 0swaps
> /usr//bin/time ./a.out 
3.89user 0.00system 0:03.89elapsed 100%CPU (0avgtext+0avgdata 
1128maxresident)k
0inputs+0outputs (0major+60minor)pagefaults 0swaps

Index: gcc/tree-ssa-pre.c
===================================================================
--- gcc/tree-ssa-pre.c  (revision 256837)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -3687,15 +3687,23 @@ insert (void)
       if (dump_file && dump_flags & TDF_DETAILS)
        fprintf (dump_file, "Starting insert iteration %d\n", 
num_iterations);
       new_stuff = insert_aux (ENTRY_BLOCK_PTR_FOR_FN (cfun), 
flag_tree_pre,
-                             flag_code_hoisting);
+                             false);
 
       /* Clear the NEW sets before the next iteration.  We have already
          fully propagated its contents.  */
-      if (new_stuff)
+      if (new_stuff || flag_code_hoisting)
        FOR_ALL_BB_FN (bb, cfun)
          bitmap_set_free (NEW_SETS (bb));
     }
   statistics_histogram_event (cfun, "insert iterations", num_iterations);
+
+  if (flag_code_hoisting)
+    {
+      if (dump_file && dump_flags & TDF_DETAILS)
+       fprintf (dump_file, "Starting insert for code hoisting\n");
+      new_stuff = insert_aux (ENTRY_BLOCK_PTR_FOR_FN (cfun), false,
+                             flag_code_hoisting);
+    }
 }

but AFAIU this patch shouldn't have any effect...  I guess I have
to think about this 2nd order effect again (might be a missed
PRE in the first place which of course wouldn't help us ;)).
The above FAILs for example

FAIL: gcc.dg/tree-ssa/ssa-hoist-3.c scan-tree-dump pre "Insertions: 1"
FAIL: gcc.dg/tree-ssa/ssa-pre-30.c scan-tree-dump-times pre "Replaced MEM" 2
Comment 11 Arnd Bergmann 2018-01-19 10:00:51 UTC
Trying out the patch from comment 10 on the original preprocessed source as attached to pr83356 also shows very noticeable improvements with stack spilling there:

x86_64-linux-gcc-6.3.1 -Wall -O2 -S ./aes_generic.i  -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l
/git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt':
/git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 48 bytes is larger than 10 bytes [-Wframe-larger-than=]
4075

x86_64-linux-gcc-7.1.1 -Wall -O2 -S aes_generic.i  -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l
/git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt':
/git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 304 bytes is larger than 10 bytes [-Wframe-larger-than=]
 }
4141

x86_64-linux-gcc-7.2.1 -Wall -O2 -S aes_generic.i  -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l
/git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt':
/git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 3840 bytes is larger than 10 bytes [-Wframe-larger-than=]
10351

# same as x86_64-linux-gcc-7.2.1 but with patch from comment 10:
./xgcc -Wall -O2 -S ./aes_generic.i  -Wframe-larger-than=10 -fsanitize=bounds -fsanitize=object-size -fno-strict-aliasing ; grep rsp aes_generic.s | wc -l 
/git/arm-soc/crypto/aes_generic.c: In function 'aes_encrypt':
/git/arm-soc/crypto/aes_generic.c:1371:1: warning: the frame size of 272 bytes is larger than 10 bytes [-Wframe-larger-than=]
4739

My interpretation is that there are two distinct issues: both AES implementations (libressl and linux-kernel) suffer from a 5% to 10% regression that is triggered by the combination of -ftree-pre and -fcode-hoisting, but only the kernel implementation suffers from a second issue that Martin Liška traced back to r251376. This results in another few percents of slowdown in gcc-7.2.1  and an factor 2.3x slowdown (and corresponding increase in stack accesses) when -fsanitize=bounds -fsanitize=object-size gets enabled.
Comment 12 Richard Biener 2018-01-19 10:51:41 UTC
So somehow without code hoisting we don't find a single PRE opportunity - that's odd.  Ah, so it goes

int x;
int foo(int cond1, int cond2, int op1, int op2, int op3)
{
  int op;
  if (cond1)
    {
      x = op1 << 8;
      if (cond2)
        op = op2;
      else
        op = op3;
    }
  else
    op = op1;
  return op << 8;
}

when looking at simple PRE then op << 8 is not detected as partially redundant because while GVN PRE (PHI translation VN) ends up value-numbering op << 8 on the !cond1 path the same as op1 << 8:

[changed] ANTIC_IN[6] := { op1_5(D) (0006), {lshift_expr,op1_5(D),8} (0002) }
[changed] ANTIC_IN[3] := { op1_5(D) (0006), cond2_8(D) (0009), {lshift_expr,op1_5(D),8} (0002) }

it doesn't consider op << 8 available on any path -- there isn't really
any redundant computation on any path in the above code.

Now comes code hoisting, seeing op1 << 8 computed twice and hoists it
before if (cond1).

int x;
int foo(int cond1, int cond2, int op1, int op2, int op3)
{
  int op;
  int tem = op1 << 8;
  if (cond1)
    {
      x = tem;
      if (cond2)
        op = op2;
      else
        op = op3;
    }
  else
    op = op1;
  return op << 8;
}

Note how it didn't end up removing the redundancy on the !cond1 path
on its own!  It relies on PRE to clean up after itself here.

After this the PRE algorithm now finds its
"available on one path" -- namely on the cond1 path where op << 8 is
now computed twice -- and inserts op2 << 8 and op3 << 8 in the other
predecessor (note we have a PHI with three args here, if we'd split
that this might also change code generation for the better).  And we end up
with

int x;
int foo(int cond1, int cond2, int op1, int op2, int op3)
{
  int op;
  tem = op1 << 8;
  if (cond1)
    {
      x = tem;
      if (cond2)
        tem = op2 << 8;
      else
        tem = op3 << 8;
    }
  return tem;
}

Note with the proposed patch the handling of the two FAILing testcases
becomes less efficient -- doing hoisting first exposes full redundancies
and thus avoids useless PRE insertions.  For both FAILing testcases
code generation in the end doesn't change.

Now we have to wrap our brains around the above testcase and transform
and decide if the order of events are good and expected or not.
Comment 13 Arnd Bergmann 2018-01-19 12:09:55 UTC
Created attachment 43185 [details]
Linux kernel version of AES algorithm, ported to standalone executable

I've had another look at extracting a test case from the Linux kernel copy of this code. This now also shows the gcc-7.2.1 specific problem:

$ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic; time ./aes_generic
real	0m9.406s

$ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic
real	0m8.318s

$ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic; time ./aes_generic
real	0m22.151s

$ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic
real	0m8.439s

$ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic ; time ./aes_generic
real	0m3.031s

$ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic -fno-code-hoisting ; time ./aes_generic
real	0m2.894s

$ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic  ; time ./aes_generic
real	0m3.307s

$ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic -fno-code-hoisting ; time ./aes_generic
real	0m2.875s
Comment 14 rguenther@suse.de 2018-01-19 12:24:26 UTC
On Fri, 19 Jan 2018, arnd at linaro dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
> 
> --- Comment #13 from Arnd Bergmann <arnd at linaro dot org> ---
> Created attachment 43185 [details]
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43185&action=edit
> Linux kernel version of AES algorithm, ported to standalone executable
> 
> I've had another look at extracting a test case from the Linux kernel copy of
> this code. This now also shows the gcc-7.2.1 specific problem:
> 
> $ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size
> aes_generic.c -o aes_generic; time ./aes_generic
> real    0m9.406s
> 
> $ x86_64-linux-gcc-7.1.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size
> aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic
> real    0m8.318s
> 
> $ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size
> aes_generic.c -o aes_generic; time ./aes_generic
> real    0m22.151s
> 
> $ x86_64-linux-gcc-7.2.1 -Wall -O2 -fsanitize=bounds -fsanitize=object-size
> aes_generic.c -o aes_generic -fno-code-hoisting; time ./aes_generic
> real    0m8.439s
> 
> $ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic ; time
> ./aes_generic
> real    0m3.031s
> 
> $ x86_64-linux-gcc-7.1.1 -Wall -O2 aes_generic.c -o aes_generic
> -fno-code-hoisting ; time ./aes_generic
> real    0m2.894s
> 
> $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic  ; time
> ./aes_generic
> real    0m3.307s
> 
> $ x86_64-linux-gcc-7.2.1 -Wall -O2 aes_generic.c -o aes_generic
> -fno-code-hoisting ; time ./aes_generic
> real    0m2.875s

Would be nice if somebody can bisect it.  It doesn't look like a PRE
specific issue because there's no relevant PRE changes in the rev. range.
I can't reproduce the slowdown when comparing 7.1.0 against 7.2.0
btw, so the regression must occur somewhere between 7.2.0 and now
(or 7.1.1 got faster for a few revs).
Comment 15 Arnd Bergmann 2018-01-19 12:45:35 UTC
(In reply to rguenther@suse.de from comment #14)

> Would be nice if somebody can bisect it.  It doesn't look like a PRE
> specific issue because there's no relevant PRE changes in the rev. range.
> I can't reproduce the slowdown when comparing 7.1.0 against 7.2.0
> btw, so the regression must occur somewhere between 7.2.0 and now
> (or 7.1.1 got faster for a few revs).

I've checked r251376 (the one I mentioned in comment #11), and confirmed that this caused the difference between my old 7.1.1 and the current 7.2.1.
Comment 16 rguenther@suse.de 2018-01-19 12:53:07 UTC
On Fri, 19 Jan 2018, arnd at linaro dot org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83651
> 
> --- Comment #15 from Arnd Bergmann <arnd at linaro dot org> ---
> (In reply to rguenther@suse.de from comment #14)
> 
> > Would be nice if somebody can bisect it.  It doesn't look like a PRE
> > specific issue because there's no relevant PRE changes in the rev. range.
> > I can't reproduce the slowdown when comparing 7.1.0 against 7.2.0
> > btw, so the regression must occur somewhere between 7.2.0 and now
> > (or 7.1.1 got faster for a few revs).
> 
> I've checked r251376 (the one I mentioned in comment #11), and confirmed that
> this caused the difference between my old 7.1.1 and the current 7.2.1.

Ok, this is a bugfix and simply makes PRE do its job "properly" ...
Comment 17 Richard Biener 2018-01-25 08:27:21 UTC
GCC 7.3 is being released, adjusting target milestone.