This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/51017] [4.9/5/6 Regression] GCC performance regression (vs. 4.4/4.5), PRE increases register pressure too much
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 04 Mar 2016 13:00:55 +0000
- Subject: [Bug tree-optimization/51017] [4.9/5/6 Regression] GCC performance regression (vs. 4.4/4.5), PRE increases register pressure too much
- Auto-submitted: auto-generated
- References: <bug-51017-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51017
--- Comment #24 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 4.5 vs GCC 5 still shows GCC 4.5 is faster almost everywhere
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Benchmarking:
Traditional DES [128/128 BS SSE2-16]... DONE
Many salts: 3636K c/s real, 3636K c/s virtual | Many salts:
3488K c/s real, 3488K c/s virtual
Only one salt: 3047K c/s real, 3047K c/s virtual | Only one salt:
2896K c/s real, 2896K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Benchmarking:
BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts: 127360 c/s real, 127360 c/s virtual | Many salts:
108800 c/s real, 108800 c/s virtual
Only one salt: 124288 c/s real, 123057 c/s virtual | Only one salt:
106112 c/s real, 106112 c/s virtual
Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Benchmarking:
FreeBSD MD5 [32/64 X2]... DONE
Raw: 15392 c/s real, 15392 c/s virtual | Raw: 15936
c/s real, 15936 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Benchmarking:
OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw: 900 c/s real, 900 c/s virtual | Raw: 892 c/s
real, 892 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... DONE Benchmarking:
Kerberos AFS DES [48/64 4K]... DONE
Short: 478208 c/s real, 473473 c/s virtual | Short: 476672
c/s real, 476672 c/s virtual
Long: 1470K c/s real, 1470K c/s virtual | Long: 1473K
c/s real, 1473K c/s virtual
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Benchmarking:
LM DES [128/128 BS SSE2-16]... DONE
Raw: 16977K c/s real, 16977K c/s virtual | Raw: 14971K
c/s real, 14971K c/s virtual
Benchmarking: generic crypt(3) [?/64]... DONE Benchmarking:
generic crypt(3) [?/64]... DONE
Many salts: 362784 c/s real, 362784 c/s virtual | Many salts:
296352 c/s real, 296352 c/s virtual
Only one salt: 361728 c/s real, 361728 c/s virtual | Only one salt:
292182 c/s real, 295104 c/s virtual
Benchmarking: dummy [N/A]... DONE Benchmarking:
dummy [N/A]... DONE
Raw: 60157K c/s real, 60157K c/s virtual | Raw: 53849K
c/s real, 53316K c/s virtual
GCC 5 vs. GCC 6 shows some progress (and some small regressions), but not for
BSDI DES.
Benchmarking: Traditional DES [128/128 BS SSE2-16]... DONE Benchmarking:
Traditional DES [128/128 BS SSE2-16]... DONE
Many salts: 3488K c/s real, 3488K c/s virtual | Many salts:
3446K c/s real, 3446K c/s virtual
Only one salt: 2896K c/s real, 2896K c/s virtual | Only one salt:
2895K c/s real, 2895K c/s virtual
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE Benchmarking:
BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts: 108800 c/s real, 108800 c/s virtual | Many salts:
104934 c/s real, 105984 c/s virtual
Only one salt: 106112 c/s real, 106112 c/s virtual | Only one salt:
103040 c/s real, 103040 c/s virtual
Benchmarking: FreeBSD MD5 [32/64 X2]... DONE Benchmarking:
FreeBSD MD5 [32/64 X2]... DONE
Raw: 15936 c/s real, 15936 c/s virtual | Raw: 15864
c/s real, 15864 c/s virtual
Benchmarking: OpenBSD Blowfish (x32) [32/64 X2]... DONE Benchmarking:
OpenBSD Blowfish (x32) [32/64 X2]... DONE
Raw: 892 c/s real, 892 c/s virtual | Raw: 916 c/s
real, 916 c/s virtual
Benchmarking: Kerberos AFS DES [48/64 4K]... DONE Benchmarking:
Kerberos AFS DES [48/64 4K]... DONE
Short: 476672 c/s real, 476672 c/s virtual | Short: 471808
c/s real, 471808 c/s virtual
Long: 1473K c/s real, 1473K c/s virtual | Long: 1449K
c/s real, 1449K c/s virtual
Benchmarking: LM DES [128/128 BS SSE2-16]... DONE Benchmarking:
LM DES [128/128 BS SSE2-16]... DONE
Raw: 14971K c/s real, 14971K c/s virtual | Raw: 15917K
c/s real, 15917K c/s virtual
Benchmarking: generic crypt(3) [?/64]... DONE Benchmarking:
generic crypt(3) [?/64]... DONE
Many salts: 296352 c/s real, 296352 c/s virtual | Many salts:
348096 c/s real, 348096 c/s virtual
Only one salt: 292182 c/s real, 295104 c/s virtual | Only one salt:
347616 c/s real, 347616 c/s virtual
Benchmarking: dummy [N/A]... DONE Benchmarking:
dummy [N/A]... DONE
Raw: 53849K c/s real, 53316K c/s virtual | Raw: 60114K
c/s real, 60114K c/s virtual
Note that -fno-tree-pre no longer helps. With GCC 5/6 most intrinsics are
using a generic implementation and thus are transparent to the GIMPLE
middle-end
apart from __builtin_ia32_pandn128 which is used by _mm_andnot_si128.
What helps is -fno-tree-loop-im in addition to -fno-tree-pre so the
underlying issue is still that of register pressure it seems and it is
not really the loop-carried stuff we introduce but the excessive invariant
motion.
Benchmarking: BSDI DES (x725) [128/128 BS SSE2-16]... DONE
Many salts: 118528 c/s real, 117354 c/s virtual
Only one salt: 114944 c/s real, 114944 c/s virtual
movq DES_bs_all+18632(%rip), %rdi
movq DES_bs_all+18624(%rip), %rcx
movq DES_bs_all+18712(%rip), %rbp
movq DES_bs_all+18696(%rip), %r9
movq DES_bs_all+18688(%rip), %r10
movq DES_bs_all+18680(%rip), %r11
movq %rdi, 624(%rsp)
movq %rcx, 616(%rsp)
movq %rbp, 320(%rsp)
etc. - of course quite stupid to load sth and then spill it immediately...
We're also back to all unaligned loads/stores.