This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
Hi Kyrill,
On Thu, Nov 17, 2016 at 02:22:08PM +0000, Kyrill Tkachov wrote:
> >>>>>>I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there
> >>>>>>were
> >>>>>>some interesting swings.
> >>>>>>458.sjeng +1.45%
> >>>>>>471.omnetpp +2.19%
> >>>>>>445.gobmk -2.01%
> >>>>>>
> >>>>>>On SPECFP:
> >>>>>>453.povray +7.00%
> After looking at the gobmk performance with performance counters it looks
> like more icache pressure.
> I see an increase in misses.
> This looks to me like an effect of code size increase, though it is not
> that large an increase (0.4% with SWS).
Right. I don't see how to improve on this (but see below); ideas welcome :-)
> Branch mispredicts also go up a bit but not as much as icache misses.
I don't see that happening -- for some testcases we get unlucky and have
more branch predictor aliasing, and for some we have less, it's pretty
random. Some testcases are really sensitive to this.
> I don't think there's anything we can do here, or at least that this patch
> can do about it.
> Overall, there's a slight improvement in SPECINT, even with the gobmk
> regression and a slightly larger improvement
> on SPECFP due to povray.
And that is for only the "normal" GPRs, not LR or FP yet, right?
> Segher, one curious artifact I spotted while looking at codegen differences
> in gobmk was a case where we fail
> to emit load-pairs as effectively in the epilogue and its preceeding basic
> block.
> So before we had this epilogue:
> .L43:
> ldp x21, x22, [sp, 16]
> ldp x23, x24, [sp, 32]
> ldp x25, x26, [sp, 48]
> ldp x27, x28, [sp, 64]
> ldr x30, [sp, 80]
> ldp x19, x20, [sp], 112
> ret
>
> and I see this becoming (among numerous other changes in the function):
>
> .L69:
> ldp x21, x22, [sp, 16]
> ldr x24, [sp, 40]
> .L43:
> ldp x25, x26, [sp, 48]
> ldp x27, x28, [sp, 64]
> ldr x23, [sp, 32]
> ldr x30, [sp, 80]
> ldp x19, x20, [sp], 112
> ret
>
> So this is better in the cases where we jump straight into .L43 because we
> load fewer registers
> but worse when we jump to or fallthrough to .L69 because x23 and x24 are
> now restored using two loads
> rather than a single load-pair. This hunk isn't critical to performance in
> gobmk though.
Is loading/storing a pair as cheap as loading/storing a single register?
In that case you could shrink-wrap per pair of registers instead.
Segher