Bug 87256 - hppa spends huge amount of time in synth_mult()
Summary: hppa spends huge amount of time in synth_mult()
Status: REOPENED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: compile-time-hog
: 96649 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-09-08 18:30 UTC by Sergei Trofimovich
Modified: 2022-01-09 00:36 UTC (History)
11 users (show)

See Also:
Host:
Target: hppa64-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-09-08 00:00:00


Attachments
bug.c (287 bytes, text/x-csrc)
2018-09-08 18:30 UTC, Sergei Trofimovich
Details
hack-25_all_hppa-faster-synth_mult.patch (753 bytes, application/mbox)
2020-08-17 19:06 UTC, Sergei Trofimovich
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sergei Trofimovich 2018-09-08 18:30:25 UTC
Created attachment 44672 [details]
bug.c

Performance hog initially noticed on http://www.xxhash.com project (xxhash-0.6.5). When being compiled on hppa it takes hours for gcc. When being compiled on x86_64 it takes seconds.

Minimal example:

// $ time /usr/bin/hppa2.0-unknown-linux-gnu-gcc -O2 -fomit-frame-pointer -c -o bug.o bug.C
// real    0m9,051s
// user    0m9,040s
// sys     0m0,009s
void a (long);
void b (long long c, int s)
{
  switch (s)
    {
    case 4:
      c = 0;
    case 6:
      c = c * 13;
    case 8:
      c = c + 19;
    case 28:
      c = c * 0x85ebca77c2b2ae63LLU;
      c = c * 9650029242287828579LL;
      c = c * 9650029242287828579LL;
      c = c * 9650029242287828579LL; /* this line adds 2 seconds: 7->9 */
      c = c * 9650029242287828579LL; /* this line adds 2 seconds: 9->11 */
      a (c);
    }
}
Comment 1 Sergei Trofimovich 2018-09-08 18:34:48 UTC
Reproducible on today's vanilla gcc-master as well:

./xgcc -B. -v
Reading specs from ./specs
COLLECT_GCC=./xgcc
COLLECT_LTO_WRAPPER=./lto-wrapper
Target: hppa2.0-unknown-linux-gnu
Configured with: ../gcc/configure --target=hppa2.0-unknown-linux-gnu --prefix=/home/slyfox/dev/git/gcc-hppa2.0/../gcc-hppa2.0-installed --with-sysroot=/usr/hppa2.0-unknown-linux-gnu --disable-bootstrap --enable-languages=c --disable-nls CFLAGS='-O2 -g' CXXFLAGS='-O2 -g'
Thread model: posix
gcc version 9.0.0 20180908 (experimental) (GCC)

$ ./xgcc -B. -O2 -fomit-frame-pointer -c -o bug.o bug.c -ftime-report

Time variable                                   usr           sys          wall               GGC
 phase opt and generate             :  10.90 (100%)   0.00 (  0%)  10.91 (100%)     464 kB ( 30%)
 df live regs                       :   0.01 (  0%)   0.00 (  0%)   0.00 (  0%)       0 kB (  0%)
 tree PRE                           :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      23 kB (  2%)
 expand                             :  10.89 (100%)   0.00 (  0%)  10.89 (100%)      50 kB (  3%)
 integrated RA                      :   0.00 (  0%)   0.00 (  0%)   0.01 (  0%)      98 kB (  6%)
 TOTAL                              :  10.90          0.00         10.91           1543 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.

$ perf record -- ./xgcc -B. -O2 -fomit-frame-pointer -c -o bug.o bug.c -ftime-report
$ perf report | cat

    96.49%  cc1              cc1               [.] synth_mult
     2.36%  cc1              cc1               [.] optimize_function_for_size_p
     0.34%  cc1              cc1               [.] optimize_insn_for_size_p
     0.32%  cc1              cc1               [.] optimize_insn_for_speed_p
     0.02%  cc1              [kernel.vmlinux]  [k] native_irq_return_iret
Comment 2 Sergei Trofimovich 2018-09-08 19:11:01 UTC
synth_mult() is called 57M times on this tiny sample.

I noticed there is a NUM_ALG_HASH_ENTRIES bucket size for already computed algorithms for multiplication. The below tweak seems to workaround pathological case:

--- a/gcc/expmed.h
+++ b/gcc/expmed.h
@@ -128,7 +128,7 @@ struct alg_hash_entry {
 
 /* The number of cache/hash entries.  */
 #if HOST_BITS_PER_WIDE_INT == 64
-#define NUM_ALG_HASH_ENTRIES 1031
+#define NUM_ALG_HASH_ENTRIES 10311
 #else
 #define NUM_ALG_HASH_ENTRIES 307
 #endif

$ ./xgcc -B. -O2 -fomit-frame-pointer -c -o bug.o bug.c -ftime-report

Time variable                                   usr           sys          wall               GGC
 phase setup                        :   0.00 (  0%)   0.00 (  0%)   0.01 (  2%)     960 kB ( 62%)
 phase parsing                      :   0.00 (  0%)   0.00 (  0%)   0.02 (  4%)     109 kB (  7%)
 phase opt and generate             :   0.39 (100%)   0.00 (  0%)   0.46 ( 94%)     464 kB ( 30%)
 preprocessing                      :   0.00 (  0%)   0.00 (  0%)   0.02 (  4%)       9 kB (  1%)
 dominator optimization             :   0.00 (  0%)   0.00 (  0%)   0.02 (  4%)       1 kB (  0%)
 expand                             :   0.37 ( 95%)   0.00 (  0%)   0.42 ( 86%)      50 kB (  3%)
 combiner                           :   0.01 (  3%)   0.00 (  0%)   0.01 (  2%)      97 kB (  6%)
 initialize rtl                     :   0.01 (  3%)   0.00 (  0%)   0.00 (  0%)      13 kB (  1%)
 verify RTL sharing                 :   0.00 (  0%)   0.00 (  0%)   0.01 (  2%)       0 kB (  0%)
 TOTAL                              :   0.39          0.00          0.49           1543 kB
Extra diagnostic checks enabled; compiler may run slowly.
Configure with --enable-checking=release to disable checks.
Comment 3 dave.anglin 2018-09-08 22:23:20 UTC
On 2018-09-08 3:11 PM, slyfox at inbox dot ru wrote:
> -#define NUM_ALG_HASH_ENTRIES 1031
> +#define NUM_ALG_HASH_ENTRIES 10311
Does this help the compile time for xxhash?
Comment 4 Jakub Jelinek 2018-09-08 23:07:46 UTC
Can be reproduced on x86_64-linux on the same testcase, just put a breakpoint on
synth_mult and
(gdb) set var t = -8796714831421723037
(gdb) set cost_limit->cost = 128
(gdb) set cost_limit->latency = 128
there is quite a recursion.  Or even on
unsigned long long foo (unsigned long long x) { return x * -8796714831421723037LL; }
with -O2 and the same manually bumped cost/latency on the outermost synth_mult.
If I instrument synth_mult, in this case there are 4897996 calls to synth_mult, different counts at various levels of synth_mult recursion (first number is number of synth_mult callers of synth_mult, the second number how many such synth_mult calls have been made):
0 2
1 5
2 16
3 36
4 98
5 231
6 528
7 1112
8 2127
9 4007
10 7492
11 13147
12 22400
13 36880
14 59282
15 93213
16 141522
17 201388
18 271570
19 359030
20 440279
21 505613
22 571884
23 587240
24 518362
25 409409
26 299289
27 195388
28 102828
29 40657
30 10932
31 1897
32 132

In any case, with very small cost of addition and shifts and extremely high cost and latency of multiplication, we seem to spend way too much trying to find an optimal synthetic multiplication sequence.
Comment 5 Sergei Trofimovich 2018-09-08 23:27:46 UTC
(In reply to dave.anglin from comment #3)
> On 2018-09-08 3:11 PM, slyfox at inbox dot ru wrote:
> > -#define NUM_ALG_HASH_ENTRIES 1031
> > +#define NUM_ALG_HASH_ENTRIES 10311
> Does this help the compile time for xxhash?

A bit: it shrinks compile time from 80s to 13s (other targets do it in sub-second times).

Ran the build against gcc master as:

${HOME}/dev/git/gcc-hppa2.0/gcc/xgcc -B${HOME}/dev/git/gcc-hppa2.0/gcc/ -O2 -pipe -fdiagnostics-show-option -frecord-gcc-switches -Wall -Wextra -Wcast-qual -Wcast-align -Wshadow -Wstrict-aliasing=1 -Wswitch-enum -Wdeclaration-after-statement -Wstrict-prototypes -Wundef   -c -o xxhash.o xxhash.c -ftime-report

  Before the patch:
    Time variable                                   usr           sys          wall               GGC
     expand                             :  82.20 ( 93%)   0.01 (  7%)  82.94 ( 93%)    4372 kB ( 11%)
     TOTAL                              :  88.11          0.14         89.06          40621 kB

  After the patch:
    Time variable                                   usr           sys          wall               GGC
     expand                             :   8.95 ( 68%)   0.00 (  0%)   8.93 ( 67%)    4372 kB ( 11%)
     TOTAL                              :  13.15          0.08         13.24          40623 kB

perf record/report still shows most of time is dominated by synth_mult():
    64.59%  cc1              cc1               [.] synth_mult
     3.00%  cc1              cc1               [.] sd_lists_size
Comment 6 Jakub Jelinek 2018-09-08 23:35:19 UTC
For the above case, there are during the recursion 77608 synth_mult calls, which might be ok, so indeed better hashing would help, but 10311 is too large and is not a prime number.  Plus this hash table is resolving all collisions by evicting older cash entries.
That is perhaps fine for the toplevel synth_mult calls, but for the recursive ones I guess we want to make sure we cache all the values needed to compute the same toplevel call and perhaps only record the toplevel call's entry if it would need to evict older entries.
So maybe use the current hash table + another temporary hash_map live just during the outermost synth_mult call (and activate this behavior only if say seeing a recursion count of 16)?
Comment 7 Richard Biener 2018-09-12 08:28:46 UTC
(In reply to Jakub Jelinek from comment #6)
> For the above case, there are during the recursion 77608 synth_mult calls,
> which might be ok, so indeed better hashing would help, but 10311 is too
> large and is not a prime number.  Plus this hash table is resolving all
> collisions by evicting older cash entries.
> That is perhaps fine for the toplevel synth_mult calls, but for the
> recursive ones I guess we want to make sure we cache all the values needed
> to compute the same toplevel call and perhaps only record the toplevel
> call's entry if it would need to evict older entries.
> So maybe use the current hash table + another temporary hash_map live just
> during the outermost synth_mult call (and activate this behavior only if say
> seeing a recursion count of 16)?

Sounds reasonable - I'd try activating the behavior on the first collision?
Comment 8 Matt Turner 2019-03-17 07:15:47 UTC
This xxhash.c file is embedded in many different projects, and is really causing problems on gentoo/hppa:

zstandard: Fri Mar 15 14:16:42 2019: 7 hours, 29 minutes, 49 seconds

Are we any closer to a fix than we were six months ago?
Comment 9 John David Anglin 2019-07-09 22:40:12 UTC
Also appears to occur building glusterfs.
Comment 10 John David Anglin 2019-07-20 16:45:15 UTC
This patch improves things by about a factor 4:

Index: expmed.h
===================================================================
--- expmed.h    (revision 271111)
+++ expmed.h    (working copy)
@@ -128,7 +128,7 @@

 /* The number of cache/hash entries.  */
 #if HOST_BITS_PER_WIDE_INT == 64
-#define NUM_ALG_HASH_ENTRIES 1031
+#define NUM_ALG_HASH_ENTRIES 4093
 #else
 #define NUM_ALG_HASH_ENTRIES 307
 #endif
Comment 11 John David Anglin 2019-07-21 22:30:25 UTC
I tend to think there may be an issue with costs.  There is no pattern
for muldi3 on 32-bit hppa, so we use the generic code from libgcc.  How
do we know the cost for this routine?

If synth_mult is disabled, the testcase takes a fraction of a second to
compile.  The only difference in the generated code is for "case 6" where
the multiplication by 13 is done using shift add.
Comment 12 Sergei Trofimovich 2020-01-07 23:05:01 UTC
I started looking at implementing full local cache in complement to global evicting 'struct alg_hash_entry x_alg_hash[NUM_ALG_HASH_ENTRIES];' cache.

Noob question: which gcc's data structure should I use to implement a simple append-only hash set of POD-like 'struct alg_hash_entry' keyed by 'alg_hash_entry::t' as a hash?

'gcc/hash-set.h', 'gcc/hash-map.h'? Or they are both wrong? Looks like for hash-*  ones I need to define a bunch of traits that define things like 'is_deleted'/'is_empty'. I'm not very clear what it actually means. Do I need to embed/wrap 'is_deleted' flag into alg_hash_entry to accomodate the requirement?
Comment 13 Andrew Pinski 2020-08-17 16:33:02 UTC
*** Bug 96649 has been marked as a duplicate of this bug. ***
Comment 14 Sergei Trofimovich 2020-08-17 19:06:41 UTC
Created attachment 49069 [details]
hack-25_all_hppa-faster-synth_mult.patch

For those who need to be unblocked right now Gentoo uses the following hack: hack-25_all_hppa-faster-synth_mult.patch

I'll ask around in #gcc on how to convert it to proper gcc-friendly hash table based solution discussed above.
Comment 15 GCC Commits 2020-08-25 18:03:48 UTC
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:050fc8b27a852007f8bb667999e1c8cfd31f90e1

commit r11-2851-g050fc8b27a852007f8bb667999e1c8cfd31f90e1
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Tue Aug 25 19:02:45 2020 +0100

    hppa: PR middle-end/87256: Improved hppa_rtx_costs avoids synth_mult madness.
    
    This is my proposed fix to PR middle-end/87256 where synth_mult takes an
    unreasonable amount of CPU time determining an optimal sequence of
    instructions to perform multiplication by (large) integer constants on hppa.
    One workaround proposed in bugzilla, is to increase the hash table used
    to cache/reuse intermediate results. This helps but is a workaround for
    the (hidden) underlying problem.
    
    The real issue is that the hppa_rtx_costs function is providing wildly
    inaccurate values (estimates) to the middle-end.  For example, (p*q)+(r*s)
    would appear to be cheaper than a single multiplication.  Another
    example is that "(ashiftrt:di regA regB)" is claimed to be only be
    COST_N_INSNS(1) when in fact the hppa backend actually generates
    slightly more than a single instruction.
    
    It turns out that simply tightening up the logic in hppa_rtx_costs to
    return more reasonable values, dramatically reduces the number of recursive
    invocations in synth_mult for the test case in PR87256, and presumably
    also produces faster code (that should be observable in benchmarks).
    
    2020-08-25  Roger Sayle  <roger@nextmovesoftware.com>
    
    gcc/ChangeLog
            PR middle-end/87256
            * config/pa/pa.c (hppa_rtx_costs_shadd_p): New helper function
            to check for coefficients supported by shNadd and shladd,l.
            (hppa_rtx_costs):  Rewrite to avoid using estimates based upon
            FACTOR and enable recursing deeper into RTL expressions.
Comment 16 GCC Commits 2020-08-26 06:24:16 UTC
The releases/gcc-10 branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:ed1e7a6a8ad6903ee952e82ac9ddac01efb286dd

commit r10-8674-ged1e7a6a8ad6903ee952e82ac9ddac01efb286dd
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Wed Aug 26 07:21:08 2020 +0100

    hppa: PR middle-end/87256: Improved hppa_rtx_costs avoids synth_mult madness.
    
    Backport from master:
    
    2020-08-26  Roger Sayle  <roger@nextmovesoftware.com>
    
    gcc/ChangeLog
            PR middle-end/87256
            * config/pa/pa.c (hppa_rtx_costs_shadd_p): New helper function
            to check for coefficients supported by shNadd and shladd,l.
            (hppa_rtx_costs):  Rewrite to avoid using estimates based upon
            FACTOR and enable recursing deeper into RTL expressions.
            * config/pa/pa.md (shd_internal): Fix define_expand to provide
            gen_shd_internal.
Comment 17 Roger Sayle 2020-08-26 06:38:47 UTC
Fixed on mainline and the gcc-10 branch.
Comment 18 Sergei Trofimovich 2020-08-29 20:58:13 UTC
Thank you!
Comment 19 John David Anglin 2020-09-06 14:15:56 UTC
hppa64 compiler is still slow:

$ hppa64-linux-gnu-gcc -S -O2 hog.c
hog.c: In function 'b':
hog.c:18:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL;
               ^~~~~~~~~~~~~~~~~~~~~
hog.c:19:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL;
               ^~~~~~~~~~~~~~~~~~~~~
hog.c:20:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL; /* this line adds 2 seconds: 7->9 */
               ^~~~~~~~~~~~~~~~~~~~~
hog.c:21:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL; /* this line adds 2 seconds: 9->11 */
               ^~~~~~~~~~~~~~~~~~~~~
dave@mx3210:~/gnu/gcc$ time hppa64-linux-gnu-gcc -S -O2 hog.c
hog.c: In function 'b':
hog.c:18:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL;
               ^~~~~~~~~~~~~~~~~~~~~
hog.c:19:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL;
               ^~~~~~~~~~~~~~~~~~~~~
hog.c:20:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL; /* this line adds 2 seconds: 7->9 */
               ^~~~~~~~~~~~~~~~~~~~~
hog.c:21:15: warning: integer constant is so large that it is unsigned
       c = c * 9650029242287828579LL; /* this line adds 2 seconds: 9->11 */
               ^~~~~~~~~~~~~~~~~~~~~

real    1m48.311s
user    1m47.665s
sys     0m0.471s