This is the mail archive of the
mailing list for the GCC project.
Re: Predictive commoning leads to register to register moves through memory.
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Jeff Law <law at redhat dot com>
- Cc: Simon Dardis <Simon dot Dardis at imgtec dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Mon, 31 Aug 2015 12:39:33 +0200
- Subject: Re: Predictive commoning leads to register to register moves through memory.
- Authentication-results: sourceware.org; auth=none
- References: <B83211783F7A334B926F0C0CA42E32CAF21F34 at hhmail02 dot hh dot imgtec dot org> <55E082E9 dot 4000803 at redhat dot com>
On Fri, Aug 28, 2015 at 5:48 PM, Jeff Law <firstname.lastname@example.org> wrote:
> On 08/28/2015 09:43 AM, Simon Dardis wrote:
>> Following Jeff's advice to extract more information from GCC, I've
>> narrowed the cause down to the predictive commoning pass inserting
>> the load in a loop header style basic block. However, the next pass
>> in GCC, tree-cunroll promptly removes the loop and joins the loop
>> header to the body of the (non)loop. More oddly, disabling
>> conditional store elimination pass or the dominator optimizations
>> pass or disabling of jump-threading with --param
>> max-jump-thread-duplication-stmts=0 nets the above assembly code. Any
>> ideas on an approach for this issue?
> I'd probably start by looking at the .optimized tree dump in both cases to
> understand the difference, then (most liklely) tracing that through the RTL
> optimizers into the register allocator.
It's the known issue of LIM (here the one after pcom and complete unrolling of
the inner loop) being too aggressive with store-motion. Here the comptete
array is replaced with registers for the outer loop. Were 'poly' a
we'd have optimized it away completely.
_8 = 1.0e+0 / pretmp_42;
_12 = _8 * _8;
poly = _12;
# prephitmp_30 = PHI <_12(6), _36(9)>
# T_lsm.8_22 = PHI <_8(6), pretmp_42(9)>
poly_I_lsm0.10_38 = MEM[(double *)&poly + 8B];
_2 = prephitmp_30 * poly_I_lsm0.10_38;
_54 = _2 * poly_I_lsm0.10_38;
_67 = poly_I_lsm0.10_38 * _54;
_80 = poly_I_lsm0.10_38 * _67;
_93 = poly_I_lsm0.10_38 * _80;
_106 = poly_I_lsm0.10_38 * _93;
_19 = poly_I_lsm0.10_38 * _106;
count_23 = count_28 + 1;
if (count_23 != iterations_6(D))
goto <bb 5>;
goto <bb 8>;
poly = _2;
poly = _54;
poly = _67;
poly = _80;
poly = _93;
poly = _106;
poly = _19;
i1 = 9;
T = T_lsm.8_22;
note that DOM misses to CSE poly (a known defect), but heh, doing that
would only increase register pressure even more.
Note the above is on x86_64.