This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation

From: Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>
To: Segher Boessenkool <segher at kernel dot crashing dot org>
Cc: Andrew Pinski <pinskia at gmail dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Marcus Shawcroft <marcus dot shawcroft at arm dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, James Greenhalgh <james dot greenhalgh at arm dot com>
Date: Thu, 17 Nov 2016 16:50:46 +0000
Subject: Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
Authentication-results: sourceware.org; auth=none
References: <5824836B.5030302@foss.arm.com> <CA+=Sn1mZgmeU0VVfTi7HqWOWrZtoN=_43BJqgR9iTfs0Eq34zw@mail.gmail.com> <20161110233943.GC17570@gate.crashing.org> <58259AD6.4040203@foss.arm.com> <5825E454.6070302@foss.arm.com> <5829C958.2010601@foss.arm.com> <582DBD10.8010200@foss.arm.com> <20161117144424.GA3732@gate.crashing.org> <582DC4D8.9060701@foss.arm.com>


On 17/11/16 14:55, Kyrill Tkachov wrote:


On 17/11/16 14:44, Segher Boessenkool wrote:

Hi Kyrill,

On Thu, Nov 17, 2016 at 02:22:08PM +0000, Kyrill Tkachov wrote:

I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there
were
some interesting swings.
458.sjeng     +1.45%
471.omnetpp   +2.19%
445.gobmk     -2.01%

On SPECFP:
453.povray    +7.00%

After looking at the gobmk performance with performance counters it looks
like more icache pressure.
I see an increase in misses.
This looks to me like an effect of code size increase, though it is not
that large an increase (0.4% with SWS).

Right.  I don't see how to improve on this (but see below); ideas welcome :-)

Branch mispredicts also go up a bit but not as much as icache misses.

I don't see that happening -- for some testcases we get unlucky and have
more branch predictor aliasing, and for some we have less, it's pretty
random.  Some testcases are really sensitive to this.


Right, I don't think it's the branch prediction at fault in this case,
rather the above icache stuff.

I don't think there's anything we can do here, or at least that this patch
can do about it.
Overall, there's a slight improvement in SPECINT, even with the gobmk
regression and a slightly larger improvement
on SPECFP due to povray.

And that is for only the "normal" GPRs, not LR or FP yet, right?


This patch does implement FP registers wrapping as well but not LR.
Though I remember seeing the improvement even when only GPRs were wrapped
in an earlier version of the patch.

Segher, one curious artifact I spotted while looking at codegen differences
in gobmk was a case where we fail
to emit load-pairs as effectively in the epilogue and its preceeding basic
block.
So before we had this epilogue:
.L43:
     ldp    x21, x22, [sp, 16]
     ldp    x23, x24, [sp, 32]
     ldp    x25, x26, [sp, 48]
     ldp    x27, x28, [sp, 64]
     ldr    x30, [sp, 80]
     ldp    x19, x20, [sp], 112
     ret

and I see this becoming (among numerous other changes in the function):

.L69:
     ldp    x21, x22, [sp, 16]
     ldr    x24, [sp, 40]
.L43:
     ldp    x25, x26, [sp, 48]
     ldp    x27, x28, [sp, 64]
     ldr    x23, [sp, 32]
     ldr    x30, [sp, 80]
     ldp    x19, x20, [sp], 112
     ret

So this is better in the cases where we jump straight into .L43 because we
load fewer registers
but worse when we jump to or fallthrough to .L69 because x23 and x24 are
now restored using two loads
rather than a single load-pair. This hunk isn't critical to performance in
gobmk though.

Is loading/storing a pair as cheap as loading/storing a single register?
In that case you could shrink-wrap per pair of registers instead.


I suppose it can vary by microarchitecture. For the purposes of codegen I'd say
it's more expensive than load/storing a single register (as there's more memory bandwidth required after all)
but cheaper than two separate loads stores (alignment quirks notwithstanding).
Interesting idea. That could help with code size too. I'll try it out.


I'm encountering some difficulties implementing this idea.
I want to still keep the per-register structures across the hooks but basically restrict the number
of components in a basic block to an even number of FPRs and GPRs. I tried doing this in COMPONENTS_FOR_BB
but apparently this ended up not saving/restoring some of the registers at all because the components that were
"filtered out" that way still made their way to the bitmap passed into SET_HANDLED_COMPONENTS and so the normal
prologue/epilogue didn't end up saving and restoring them.

I don't want to do it in GET_SEPARATE_COMPONENTS as that doesn't see each basic block.
Is this something DISQUALIFY_COMPONENTS could be used for?

Thanks,
Kyrill

Thanks,
Kyrill


Segher

Follow-Ups:
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Segher Boessenkool

References:
- [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Kyrill Tkachov
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Andrew Pinski
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Segher Boessenkool
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Kyrill Tkachov
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Kyrill Tkachov
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Kyrill Tkachov
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Kyrill Tkachov
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Segher Boessenkool
- Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation
  - From: Kyrill Tkachov

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]