Created attachment 44672 [details] bug.c Performance hog initially noticed on http://www.xxhash.com project (xxhash-0.6.5). When being compiled on hppa it takes hours for gcc. When being compiled on x86_64 it takes seconds. Minimal example: // $ time /usr/bin/hppa2.0-unknown-linux-gnu-gcc -O2 -fomit-frame-pointer -c -o bug.o bug.C // real 0m9,051s // user 0m9,040s // sys 0m0,009s void a (long); void b (long long c, int s) { switch (s) { case 4: c = 0; case 6: c = c * 13; case 8: c = c + 19; case 28: c = c * 0x85ebca77c2b2ae63LLU; c = c * 9650029242287828579LL; c = c * 9650029242287828579LL; c = c * 9650029242287828579LL; /* this line adds 2 seconds: 7->9 */ c = c * 9650029242287828579LL; /* this line adds 2 seconds: 9->11 */ a (c); } }
Reproducible on today's vanilla gcc-master as well: ./xgcc -B. -v Reading specs from ./specs COLLECT_GCC=./xgcc COLLECT_LTO_WRAPPER=./lto-wrapper Target: hppa2.0-unknown-linux-gnu Configured with: ../gcc/configure --target=hppa2.0-unknown-linux-gnu --prefix=/home/slyfox/dev/git/gcc-hppa2.0/../gcc-hppa2.0-installed --with-sysroot=/usr/hppa2.0-unknown-linux-gnu --disable-bootstrap --enable-languages=c --disable-nls CFLAGS='-O2 -g' CXXFLAGS='-O2 -g' Thread model: posix gcc version 9.0.0 20180908 (experimental) (GCC) $ ./xgcc -B. -O2 -fomit-frame-pointer -c -o bug.o bug.c -ftime-report Time variable usr sys wall GGC phase opt and generate : 10.90 (100%) 0.00 ( 0%) 10.91 (100%) 464 kB ( 30%) df live regs : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%) tree PRE : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 23 kB ( 2%) expand : 10.89 (100%) 0.00 ( 0%) 10.89 (100%) 50 kB ( 3%) integrated RA : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 98 kB ( 6%) TOTAL : 10.90 0.00 10.91 1543 kB Extra diagnostic checks enabled; compiler may run slowly. Configure with --enable-checking=release to disable checks. $ perf record -- ./xgcc -B. -O2 -fomit-frame-pointer -c -o bug.o bug.c -ftime-report $ perf report | cat 96.49% cc1 cc1 [.] synth_mult 2.36% cc1 cc1 [.] optimize_function_for_size_p 0.34% cc1 cc1 [.] optimize_insn_for_size_p 0.32% cc1 cc1 [.] optimize_insn_for_speed_p 0.02% cc1 [kernel.vmlinux] [k] native_irq_return_iret
synth_mult() is called 57M times on this tiny sample. I noticed there is a NUM_ALG_HASH_ENTRIES bucket size for already computed algorithms for multiplication. The below tweak seems to workaround pathological case: --- a/gcc/expmed.h +++ b/gcc/expmed.h @@ -128,7 +128,7 @@ struct alg_hash_entry { /* The number of cache/hash entries. */ #if HOST_BITS_PER_WIDE_INT == 64 -#define NUM_ALG_HASH_ENTRIES 1031 +#define NUM_ALG_HASH_ENTRIES 10311 #else #define NUM_ALG_HASH_ENTRIES 307 #endif $ ./xgcc -B. -O2 -fomit-frame-pointer -c -o bug.o bug.c -ftime-report Time variable usr sys wall GGC phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 2%) 960 kB ( 62%) phase parsing : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 4%) 109 kB ( 7%) phase opt and generate : 0.39 (100%) 0.00 ( 0%) 0.46 ( 94%) 464 kB ( 30%) preprocessing : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 4%) 9 kB ( 1%) dominator optimization : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 4%) 1 kB ( 0%) expand : 0.37 ( 95%) 0.00 ( 0%) 0.42 ( 86%) 50 kB ( 3%) combiner : 0.01 ( 3%) 0.00 ( 0%) 0.01 ( 2%) 97 kB ( 6%) initialize rtl : 0.01 ( 3%) 0.00 ( 0%) 0.00 ( 0%) 13 kB ( 1%) verify RTL sharing : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 2%) 0 kB ( 0%) TOTAL : 0.39 0.00 0.49 1543 kB Extra diagnostic checks enabled; compiler may run slowly. Configure with --enable-checking=release to disable checks.
On 2018-09-08 3:11 PM, slyfox at inbox dot ru wrote: > -#define NUM_ALG_HASH_ENTRIES 1031 > +#define NUM_ALG_HASH_ENTRIES 10311 Does this help the compile time for xxhash?
Can be reproduced on x86_64-linux on the same testcase, just put a breakpoint on synth_mult and (gdb) set var t = -8796714831421723037 (gdb) set cost_limit->cost = 128 (gdb) set cost_limit->latency = 128 there is quite a recursion. Or even on unsigned long long foo (unsigned long long x) { return x * -8796714831421723037LL; } with -O2 and the same manually bumped cost/latency on the outermost synth_mult. If I instrument synth_mult, in this case there are 4897996 calls to synth_mult, different counts at various levels of synth_mult recursion (first number is number of synth_mult callers of synth_mult, the second number how many such synth_mult calls have been made): 0 2 1 5 2 16 3 36 4 98 5 231 6 528 7 1112 8 2127 9 4007 10 7492 11 13147 12 22400 13 36880 14 59282 15 93213 16 141522 17 201388 18 271570 19 359030 20 440279 21 505613 22 571884 23 587240 24 518362 25 409409 26 299289 27 195388 28 102828 29 40657 30 10932 31 1897 32 132 In any case, with very small cost of addition and shifts and extremely high cost and latency of multiplication, we seem to spend way too much trying to find an optimal synthetic multiplication sequence.
(In reply to dave.anglin from comment #3) > On 2018-09-08 3:11 PM, slyfox at inbox dot ru wrote: > > -#define NUM_ALG_HASH_ENTRIES 1031 > > +#define NUM_ALG_HASH_ENTRIES 10311 > Does this help the compile time for xxhash? A bit: it shrinks compile time from 80s to 13s (other targets do it in sub-second times). Ran the build against gcc master as: ${HOME}/dev/git/gcc-hppa2.0/gcc/xgcc -B${HOME}/dev/git/gcc-hppa2.0/gcc/ -O2 -pipe -fdiagnostics-show-option -frecord-gcc-switches -Wall -Wextra -Wcast-qual -Wcast-align -Wshadow -Wstrict-aliasing=1 -Wswitch-enum -Wdeclaration-after-statement -Wstrict-prototypes -Wundef -c -o xxhash.o xxhash.c -ftime-report Before the patch: Time variable usr sys wall GGC expand : 82.20 ( 93%) 0.01 ( 7%) 82.94 ( 93%) 4372 kB ( 11%) TOTAL : 88.11 0.14 89.06 40621 kB After the patch: Time variable usr sys wall GGC expand : 8.95 ( 68%) 0.00 ( 0%) 8.93 ( 67%) 4372 kB ( 11%) TOTAL : 13.15 0.08 13.24 40623 kB perf record/report still shows most of time is dominated by synth_mult(): 64.59% cc1 cc1 [.] synth_mult 3.00% cc1 cc1 [.] sd_lists_size
For the above case, there are during the recursion 77608 synth_mult calls, which might be ok, so indeed better hashing would help, but 10311 is too large and is not a prime number. Plus this hash table is resolving all collisions by evicting older cash entries. That is perhaps fine for the toplevel synth_mult calls, but for the recursive ones I guess we want to make sure we cache all the values needed to compute the same toplevel call and perhaps only record the toplevel call's entry if it would need to evict older entries. So maybe use the current hash table + another temporary hash_map live just during the outermost synth_mult call (and activate this behavior only if say seeing a recursion count of 16)?
(In reply to Jakub Jelinek from comment #6) > For the above case, there are during the recursion 77608 synth_mult calls, > which might be ok, so indeed better hashing would help, but 10311 is too > large and is not a prime number. Plus this hash table is resolving all > collisions by evicting older cash entries. > That is perhaps fine for the toplevel synth_mult calls, but for the > recursive ones I guess we want to make sure we cache all the values needed > to compute the same toplevel call and perhaps only record the toplevel > call's entry if it would need to evict older entries. > So maybe use the current hash table + another temporary hash_map live just > during the outermost synth_mult call (and activate this behavior only if say > seeing a recursion count of 16)? Sounds reasonable - I'd try activating the behavior on the first collision?
This xxhash.c file is embedded in many different projects, and is really causing problems on gentoo/hppa: zstandard: Fri Mar 15 14:16:42 2019: 7 hours, 29 minutes, 49 seconds Are we any closer to a fix than we were six months ago?
Also appears to occur building glusterfs.
This patch improves things by about a factor 4: Index: expmed.h =================================================================== --- expmed.h (revision 271111) +++ expmed.h (working copy) @@ -128,7 +128,7 @@ /* The number of cache/hash entries. */ #if HOST_BITS_PER_WIDE_INT == 64 -#define NUM_ALG_HASH_ENTRIES 1031 +#define NUM_ALG_HASH_ENTRIES 4093 #else #define NUM_ALG_HASH_ENTRIES 307 #endif
I tend to think there may be an issue with costs. There is no pattern for muldi3 on 32-bit hppa, so we use the generic code from libgcc. How do we know the cost for this routine? If synth_mult is disabled, the testcase takes a fraction of a second to compile. The only difference in the generated code is for "case 6" where the multiplication by 13 is done using shift add.
I started looking at implementing full local cache in complement to global evicting 'struct alg_hash_entry x_alg_hash[NUM_ALG_HASH_ENTRIES];' cache. Noob question: which gcc's data structure should I use to implement a simple append-only hash set of POD-like 'struct alg_hash_entry' keyed by 'alg_hash_entry::t' as a hash? 'gcc/hash-set.h', 'gcc/hash-map.h'? Or they are both wrong? Looks like for hash-* ones I need to define a bunch of traits that define things like 'is_deleted'/'is_empty'. I'm not very clear what it actually means. Do I need to embed/wrap 'is_deleted' flag into alg_hash_entry to accomodate the requirement?
*** Bug 96649 has been marked as a duplicate of this bug. ***
Created attachment 49069 [details] hack-25_all_hppa-faster-synth_mult.patch For those who need to be unblocked right now Gentoo uses the following hack: hack-25_all_hppa-faster-synth_mult.patch I'll ask around in #gcc on how to convert it to proper gcc-friendly hash table based solution discussed above.
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:050fc8b27a852007f8bb667999e1c8cfd31f90e1 commit r11-2851-g050fc8b27a852007f8bb667999e1c8cfd31f90e1 Author: Roger Sayle <roger@nextmovesoftware.com> Date: Tue Aug 25 19:02:45 2020 +0100 hppa: PR middle-end/87256: Improved hppa_rtx_costs avoids synth_mult madness. This is my proposed fix to PR middle-end/87256 where synth_mult takes an unreasonable amount of CPU time determining an optimal sequence of instructions to perform multiplication by (large) integer constants on hppa. One workaround proposed in bugzilla, is to increase the hash table used to cache/reuse intermediate results. This helps but is a workaround for the (hidden) underlying problem. The real issue is that the hppa_rtx_costs function is providing wildly inaccurate values (estimates) to the middle-end. For example, (p*q)+(r*s) would appear to be cheaper than a single multiplication. Another example is that "(ashiftrt:di regA regB)" is claimed to be only be COST_N_INSNS(1) when in fact the hppa backend actually generates slightly more than a single instruction. It turns out that simply tightening up the logic in hppa_rtx_costs to return more reasonable values, dramatically reduces the number of recursive invocations in synth_mult for the test case in PR87256, and presumably also produces faster code (that should be observable in benchmarks). 2020-08-25 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog PR middle-end/87256 * config/pa/pa.c (hppa_rtx_costs_shadd_p): New helper function to check for coefficients supported by shNadd and shladd,l. (hppa_rtx_costs): Rewrite to avoid using estimates based upon FACTOR and enable recursing deeper into RTL expressions.
The releases/gcc-10 branch has been updated by Roger Sayle <sayle@gcc.gnu.org>: https://gcc.gnu.org/g:ed1e7a6a8ad6903ee952e82ac9ddac01efb286dd commit r10-8674-ged1e7a6a8ad6903ee952e82ac9ddac01efb286dd Author: Roger Sayle <roger@nextmovesoftware.com> Date: Wed Aug 26 07:21:08 2020 +0100 hppa: PR middle-end/87256: Improved hppa_rtx_costs avoids synth_mult madness. Backport from master: 2020-08-26 Roger Sayle <roger@nextmovesoftware.com> gcc/ChangeLog PR middle-end/87256 * config/pa/pa.c (hppa_rtx_costs_shadd_p): New helper function to check for coefficients supported by shNadd and shladd,l. (hppa_rtx_costs): Rewrite to avoid using estimates based upon FACTOR and enable recursing deeper into RTL expressions. * config/pa/pa.md (shd_internal): Fix define_expand to provide gen_shd_internal.
Fixed on mainline and the gcc-10 branch.
Thank you!
hppa64 compiler is still slow: $ hppa64-linux-gnu-gcc -S -O2 hog.c hog.c: In function 'b': hog.c:18:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; ^~~~~~~~~~~~~~~~~~~~~ hog.c:19:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; ^~~~~~~~~~~~~~~~~~~~~ hog.c:20:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; /* this line adds 2 seconds: 7->9 */ ^~~~~~~~~~~~~~~~~~~~~ hog.c:21:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; /* this line adds 2 seconds: 9->11 */ ^~~~~~~~~~~~~~~~~~~~~ dave@mx3210:~/gnu/gcc$ time hppa64-linux-gnu-gcc -S -O2 hog.c hog.c: In function 'b': hog.c:18:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; ^~~~~~~~~~~~~~~~~~~~~ hog.c:19:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; ^~~~~~~~~~~~~~~~~~~~~ hog.c:20:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; /* this line adds 2 seconds: 7->9 */ ^~~~~~~~~~~~~~~~~~~~~ hog.c:21:15: warning: integer constant is so large that it is unsigned c = c * 9650029242287828579LL; /* this line adds 2 seconds: 9->11 */ ^~~~~~~~~~~~~~~~~~~~~ real 1m48.311s user 1m47.665s sys 0m0.471s