Bug 57315 - LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
Summary: LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: 5.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ra
Depends on:
Blocks:
 
Reported: 2013-05-17 15:20 UTC by Zack Weinberg
Modified: 2016-01-27 06:07 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
self-contained test case (2.31 KB, text/x-csrc)
2013-05-28 20:29 UTC, Zack Weinberg
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Zack Weinberg 2013-05-17 15:20:04 UTC
I'm seeing a significant performance regression from 4.7 to 4.8 (targeting x86-64) on the "salsa20" core function (this is a stream cipher).  Repro instructions:

$ git clone git://github.com/zackw/rngstats.git
# ...

$ make -s cipher-test CC=gcc-4.7 && ./cipher-test >&/dev/null && ./cipher-test
KAT:       aes128... ok
KAT:       aes256... ok
KAT:         arc4... ok
KAT:      isaac64... ok
KAT:  salsa20_128... ok
KAT:  salsa20_256... ok
TIME:      aes128... 2000 keys,   3.47834s ->  574.987 keys/s
TIME:      aes256... 2000 keys,   3.62452s ->  551.797 keys/s
TIME:        arc4... 2000 keys,   2.21746s ->  901.933 keys/s
TIME:     isaac64... 2000 keys,   2.03467s ->  982.962 keys/s
TIME: salsa20_128... 2000 keys,   2.31960s ->  862.217 keys/s
TIME: salsa20_256... 2000 keys,   2.31932s ->  862.320 keys/s

$ make -s clean cipher-test CC=gcc-4.8 && ./cipher-test >&/dev/null && ./cipher-test
KAT:       aes128... ok
KAT:       aes256... ok
KAT:         arc4... ok
KAT:      isaac64... ok
KAT:  salsa20_128... ok
KAT:  salsa20_256... ok
TIME:      aes128... 2000 keys,   2.49224s ->  802.491 keys/s
TIME:      aes256... 2000 keys,   3.62372s ->  551.919 keys/s
TIME:        arc4... 2000 keys,   2.22794s ->  897.689 keys/s
TIME:     isaac64... 2000 keys,   2.05087s ->  975.194 keys/s
TIME: salsa20_128... 2000 keys,   3.53085s ->  566.436 keys/s
TIME: salsa20_256... 2000 keys,   2.53003s ->  790.505 keys/s

The regression shows in the last two TIME: lines for each build.  The relevant code is probably in ciphers/salsa20.c, or else in worker.c.

Note that there are other programs in this repository, and they require unusual libraries to build.  I recommend you do not attempt a "make all", and if you get errors, try commenting out the CFLAGS.mpi and LIBS.mpi lines in the Makefile.
Comment 1 Richard Biener 2013-05-21 09:37:32 UTC
Please at least reproduce the "core function" as a separate compilable testcase here togehter with flags used for the build.  Also please try to factor out
LTO ...
Comment 2 Zack Weinberg 2013-05-28 20:29:49 UTC
Created attachment 30210 [details]
self-contained test case

Here's a self-contained test case.

$ gcc-4.7 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out
 875.178 keys/s
$ gcc-4.8 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out
 808.869 keys/s

$ gcc-4.7 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out
 867.879 keys/s
$ gcc-4.8 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out
 800.794 keys/s

$ gcc-4.7 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 
 606.605 keys/s
$ gcc-4.8 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 
 571.935 keys/s

These numbers are stable to within about 1 key/s.  So there's a 6-8% regression from 4.7 to 4.8 regardless of optimization level, but also -O3 and -O3 -fwhole-program are inferior to -O2 for this program, with both compilers.  (-O2 -fwhole-program is within noise of just -O2 for both.)

With 4.8, -march=native on my computer expands to

-march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt --param l1-cache-size=0 --param l1-cache-line-size=0 --param l2-cache-size=256 -mtune=corei7-avx
Comment 3 Richard Biener 2013-05-29 09:01:05 UTC
The tree opt code is quite the same for 4.8 and 4.7 at -O3 -fwhole-program,
so I believe this boils down to spilling/register allocation (LRA vs. reload).

We inline everything into main () (even at -O2) and we don't
vectorize anything at -O3.
Comment 4 Vladimir Makarov 2013-12-04 18:29:43 UTC
  Zack, thanks for reporting this.  Crypto algorithms are very interesting cases for RA.  A lot of performance improvements were done for RA during gcc-4.9 development.  Now on Intel Haswell I have

bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out
 779.132 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out
 778.976 keys/s
bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out
1392.555 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out
1375.610 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out
1224.177 keys/s
bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out
1436.539 keys/s

Here, trunk5 is today GCC trunk.

Unfortunately, the changes in RA are too big and can not be ported to gcc-4.8.
Comment 5 Andrew Pinski 2016-01-27 06:07:48 UTC
Fixed so closing.