I'm seeing a significant performance regression from 4.7 to 4.8 (targeting x86-64) on the "salsa20" core function (this is a stream cipher). Repro instructions: $ git clone git://github.com/zackw/rngstats.git # ... $ make -s cipher-test CC=gcc-4.7 && ./cipher-test >&/dev/null && ./cipher-test KAT: aes128... ok KAT: aes256... ok KAT: arc4... ok KAT: isaac64... ok KAT: salsa20_128... ok KAT: salsa20_256... ok TIME: aes128... 2000 keys, 3.47834s -> 574.987 keys/s TIME: aes256... 2000 keys, 3.62452s -> 551.797 keys/s TIME: arc4... 2000 keys, 2.21746s -> 901.933 keys/s TIME: isaac64... 2000 keys, 2.03467s -> 982.962 keys/s TIME: salsa20_128... 2000 keys, 2.31960s -> 862.217 keys/s TIME: salsa20_256... 2000 keys, 2.31932s -> 862.320 keys/s $ make -s clean cipher-test CC=gcc-4.8 && ./cipher-test >&/dev/null && ./cipher-test KAT: aes128... ok KAT: aes256... ok KAT: arc4... ok KAT: isaac64... ok KAT: salsa20_128... ok KAT: salsa20_256... ok TIME: aes128... 2000 keys, 2.49224s -> 802.491 keys/s TIME: aes256... 2000 keys, 3.62372s -> 551.919 keys/s TIME: arc4... 2000 keys, 2.22794s -> 897.689 keys/s TIME: isaac64... 2000 keys, 2.05087s -> 975.194 keys/s TIME: salsa20_128... 2000 keys, 3.53085s -> 566.436 keys/s TIME: salsa20_256... 2000 keys, 2.53003s -> 790.505 keys/s The regression shows in the last two TIME: lines for each build. The relevant code is probably in ciphers/salsa20.c, or else in worker.c. Note that there are other programs in this repository, and they require unusual libraries to build. I recommend you do not attempt a "make all", and if you get errors, try commenting out the CFLAGS.mpi and LIBS.mpi lines in the Makefile.
Please at least reproduce the "core function" as a separate compilable testcase here togehter with flags used for the build. Also please try to factor out LTO ...
Created attachment 30210 [details] self-contained test case Here's a self-contained test case. $ gcc-4.7 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out 875.178 keys/s $ gcc-4.8 -std=c99 -O2 -march=native salsa20-regr.c && ./a.out 808.869 keys/s $ gcc-4.7 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out 867.879 keys/s $ gcc-4.8 -std=c99 -O3 -march=native salsa20-regr.c && ./a.out 800.794 keys/s $ gcc-4.7 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 606.605 keys/s $ gcc-4.8 -std=c99 -O3 -fwhole-program -march=native salsa20-regr.c && ./a.out 571.935 keys/s These numbers are stable to within about 1 key/s. So there's a 6-8% regression from 4.7 to 4.8 regardless of optimization level, but also -O3 and -O3 -fwhole-program are inferior to -O2 for this program, with both compilers. (-O2 -fwhole-program is within noise of just -O2 for both.) With 4.8, -march=native on my computer expands to -march=corei7-avx -mcx16 -msahf -mno-movbe -maes -mpclmul -mpopcnt -mno-abm -mno-lwp -mno-fma -mno-fma4 -mno-xop -mno-bmi -mno-bmi2 -mno-tbm -mavx -mno-avx2 -msse4.2 -msse4.1 -mno-lzcnt -mno-rtm -mno-hle -mno-rdrnd -mno-f16c -mno-fsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt --param l1-cache-size=0 --param l1-cache-line-size=0 --param l2-cache-size=256 -mtune=corei7-avx
The tree opt code is quite the same for 4.8 and 4.7 at -O3 -fwhole-program, so I believe this boils down to spilling/register allocation (LRA vs. reload). We inline everything into main () (even at -O2) and we don't vectorize anything at -O3.
Zack, thanks for reporting this. Crypto algorithms are very interesting cases for RA. A lot of performance improvements were done for RA during gcc-4.9 development. Now on Intel Haswell I have bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out 779.132 keys/s bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out 778.976 keys/s bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O2 -march=native salsa-test.c && ./a.out 1392.555 keys/s bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.7-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out 1375.610 keys/s bash-4.2$ /home/cygnus/vmakarov/build/comparison/4.8-64/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out 1224.177 keys/s bash-4.2$ /home/cygnus/vmakarov/build1/trunk5/64r/bin/gcc -std=c99 -O3 -fwhole-program -march=native salsa-test.c && ./a.out 1436.539 keys/s Here, trunk5 is today GCC trunk. Unfortunately, the changes in RA are too big and can not be ported to gcc-4.8.
Fixed so closing.