57315 – LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

Bug 57315 - LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

Summary: LTO and/or vectorizer performance regression on salsa20 core, 4.7->4.8

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Depends on:
Blocks:

Reported:	2013-05-17 15:20 UTC by Zack Weinberg
Modified:	2016-01-27 06:07 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
self-contained test case (2.31 KB, text/x-csrc) 2013-05-28 20:29 UTC, Zack Weinberg	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Zack Weinberg 2013-05-17 15:20:04 UTC

I'm seeing a significant performance regression from 4.7 to 4.8 (targeting x86-64) on the "salsa20" core function (this is a stream cipher).  Repro instructions:

$ git clone git://github.com/zackw/rngstats.git
# ...

$ make -s cipher-test CC=gcc-4.7 && ./cipher-test >&/dev/null && ./cipher-test
KAT:       aes128... ok
KAT:       aes256... ok
KAT:         arc4... ok
KAT:      isaac64... ok
KAT:  salsa20_128... ok
KAT:  salsa20_256... ok
TIME:      aes128... 2000 keys,   3.47834s ->  574.987 keys/s
TIME:      aes256... 2000 keys,   3.62452s ->  551.797 keys/s
TIME:        arc4... 2000 keys,   2.21746s ->  901.933 keys/s
TIME:     isaac64... 2000 keys,   2.03467s ->  982.962 keys/s
TIME: salsa20_128... 2000 keys,   2.31960s ->  862.217 keys/s
TIME: salsa20_256... 2000 keys,   2.31932s ->  862.320 keys/s

$ make -s clean cipher-test CC=gcc-4.8 && ./cipher-test >&/dev/null && ./cipher-test
KAT:       aes128... ok
KAT:       aes256... ok
KAT:         arc4... ok
KAT:      isaac64... ok
KAT:  salsa20_128... ok
KAT:  salsa20_256... ok
TIME:      aes128... 2000 keys,   2.49224s ->  802.491 keys/s
TIME:      aes256... 2000 keys,   3.62372s ->  551.919 keys/s
TIME:        arc4... 2000 keys,   2.22794s ->  897.689 keys/s
TIME:     isaac64... 2000 keys,   2.05087s ->  975.194 keys/s
TIME: salsa20_128... 2000 keys,   3.53085s ->  566.436 keys/s
TIME: salsa20_256... 2000 keys,   2.53003s ->  790.505 keys/s

The regression shows in the last two TIME: lines for each build.  The relevant code is probably in ciphers/salsa20.c, or else in worker.c.

Note that there are other programs in this repository, and they require unusual libraries to build.  I recommend you do not attempt a "make all", and if you get errors, try commenting out the CFLAGS.mpi and LIBS.mpi lines in the Makefile.

Comment 1 Richard Biener 2013-05-21 09:37:32 UTC

Please at least reproduce the "core function" as a separate compilable testcase here togehter with flags used for the build.  Also please try to factor out
LTO ...

Comment 2 Zack Weinberg 2013-05-28 20:29:49 UTC

Created attachment 30210 [details]
self-contained test case

Here's a self-contained test case.

$ gcc-4.7 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out
 875.178 keys/s
$ gcc-4.8 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out
 808.869 keys/s

$ gcc-4.7 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out
 867.879 keys/s
$ gcc-4.8 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out
 800.794 keys/s

$ gcc-4.7 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 
 606.605 keys/s
$ gcc-4.8 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 
 571.935 keys/s

These numbers are stable to within about 1 key/s.  So there's a 6-8% regression from 4.7 to 4.8 regardless of optimization level, but also -O3 and -O3 -fwhole-program are inferior to -O2 for this program, with both compilers.  (-O2 -fwhole-program is within noise of just -O2 for both.)

With 4.8, -march=native on my computer expands to

-march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt --param l1-cache-size=0 --param l1-cache-line-size=0 --param l2-cache-size=256 -mtune=corei7-avx

Comment 3 Richard Biener 2013-05-29 09:01:05 UTC

The tree opt code is quite the same for 4.8 and 4.7 at -O3 -fwhole-program,
so I believe this boils down to spilling/register allocation (LRA vs. reload).

We inline everything into main () (even at -O2) and we don't
vectorize anything at -O3.

Comment 4 Vladimir Makarov 2013-12-04 18:29:43 UTC

  Zack, thanks for reporting this.  Crypto algorithms are very interesting cases for RA.  A lot of performance improvements were done for RA during gcc-4.9 development.  Now on Intel Haswell I have

bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out
 779.132 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out
 778.976 keys/s
bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out
1392.555 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out
1375.610 keys/s
bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out
1224.177 keys/s
bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out
1436.539 keys/s

Here, trunk5 is today GCC trunk.

Unfortunately, the changes in RA are too big and can not be ported to gcc-4.8.

Comment 5 Andrew Pinski 2016-01-27 06:07:48 UTC

Fixed so closing.