This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug rtl-optimization/59811] [4.9/5/6 Regression] Huge increase in memory usage and compile time in combine
- From: "rguenth at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Thu, 11 Feb 2016 12:59:07 +0000
- Subject: [Bug rtl-optimization/59811] [4.9/5/6 Regression] Huge increase in memory usage and compile time in combine
- Auto-submitted: auto-generated
- References: <bug-59811-4 at http dot gcc dot gnu dot org/bugzilla/>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59811
--- Comment #16 from Richard Biener <rguenth at gcc dot gnu.org> ---
Call-grinding a release-checking stage3 shows get_ref_base_and_extend and
find_hard_regno_for_1 at the top.
And it shows wi::lshift_large called from get_ref_base_and_extent - exactly
what I feared... we do hit both wi::lshift_large and wi::mul_internal.
perf confirmes the hot spots get_ref_base_and_extent (9%) and
find_hard_regno_for_1 (19%) but wi::lshift_large is somewhat down (1.8%),
wi::mul_internal is at 1%. Note the shifts are all by 3 (BITS_PER_UNIT
multiplication).
The
/* The first unfilled output block is a left shift of the first
block in XVAL. The other output blocks contain bits from two
consecutive input blocks. */
unsigned HOST_WIDE_INT carry = 0;
for (unsigned int i = skip; i < len; ++i)
{
unsigned HOST_WIDE_INT x = safe_uhwi (xval, xlen, i - skip);
val[i] = (x << small_shift) | carry;
carry = x >> (-small_shift % HOST_BITS_PER_WIDE_INT);
}
loop in lshift_large doesn't seem to be very latency friendly:
4.02 â e0: mov (%r11,%r9,8),%rax
5.54 â e4: mov %rax,%rdi
â
â mov %r8d,%ecx
â
â shl %cl,%rdi
â
4.23 â mov %rdi,%rcx
â
3.91 â or %r15,%rcx
â
2.06 â mov %rcx,(%r14,%r9,8)
7.38 â mov %r13d,%ecx
â
1.41 â add $0x1,%r9
â
â shr %cl,%rax
3.04 â cmp %r12,%r9
3.37 â mov %rax,%r15
1.95 â â je 9e
I wonder if GCC can be more efficient here by special-casing skip == 0,
len == 2 and using a __int128 on hosts where that is available.
In this case we're shifting xlen == 1 values but the precision might need 2
(byte to bit precision). Special casing that case might also make sense.
It helps a bit but of course all the testing has an overhead as well.
Maybe a wi::bytes_to_bits helper is a better solution here.
Anyway, somehow caching the get_ref_base_and_extent result (which we re-compute
only for the stores btw, for stmt_may_clobber_ref_p) might help more.
Note that with release-checking the testcase compiles quite fast for me.
alias stmt walking : 4.53 (37%) usr 0.04 (18%) sys 4.44 (36%) wall
2 kB ( 0%) ggc
dead store elim2 : 0.67 ( 5%) usr 0.04 (18%) sys 0.71 ( 6%) wall
87250 kB (57%) ggc
combiner : 0.20 ( 2%) usr 0.00 ( 0%) sys 0.20 ( 2%) wall
2709 kB ( 2%) ggc
integrated RA : 1.18 (10%) usr 0.01 ( 5%) sys 1.19 (10%) wall
6629 kB ( 4%) ggc
LRA hard reg assignment : 2.69 (22%) usr 0.01 ( 5%) sys 2.69 (22%) wall
0 kB ( 0%) ggc
reload CSE regs : 0.56 ( 5%) usr 0.00 ( 0%) sys 0.56 ( 5%) wall
1064 kB ( 1%) ggc
TOTAL : 12.21 0.22 12.42
152002 kB
(that's w/o mucking with timevars).