This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING][PATCH, AArch64] Disable reg offset in quad-word store for Falkor


Siddhesh Poyarekar wrote:
  
> The current cost model will disable reg offset for loads as well as
> stores, which doesn't work well since loads with reg offset are faster
> for falkor.

Why is that a bad thing? With the patch as is, the testcase generates:

.L4:
	ldr	q0, [x2, x3]
	add	x5, x1, x3
	add	x3, x3, 16
	cmp	x3, x4
	str	q0, [x5]
	bne	.L4

With a change in address cost (for loads and stores) we would get:

.L4:
	ldr	q0, [x3], 16
	str	q0, [x4], 16
	cmp	x3, x5
	bne	.L4

This looks better to me, especially if there are more loads and stores and
some have offsets as well (the writeback is once per stream while the extra
add happens for every store). It may be worth trying both possibilities
on a large body of code and see which comes out smallest/fastest.

Note using the cost model as intended means the compiler tries to use the
lowest cost possibility rather than never emitting the instruction, not even
when optimizing for size. I think it's wrong to always block a valid instruction.

> Also, this is a very specific tweak for a specific processor, i.e. I
> don't know if there is value in splitting out the costs into loads and
> stores and further into 128-bit and lower just to set the 128 store cost
> higher.  That will increase the size of the change by quite a bit and
> may not make it suitable for inclusion into gcc8 at this stage, while
> the current one still qualifies given its contained impact.

It's not clear whether it is easy to split out the costs today (it could be done
in aarch64_rtx_costs but not aarch64_address_cost, and the latter is what
IVOpt uses).

> Further, it seems like worthwhile work only if there are other parts
> that actually have the same quirk and can use this split.  Do you know
> of any such cores?

Currently there are several supported CPUs which use a much higher cost
for TImode and for register offsets. So it's a common thing to want, however
I don't know whether splitting load/store address costs helps for those.

I think a special case for Falkor in aarch64_address_cost would be acceptable
in GCC8 - that would be much smaller and cleaner than the current patch. 
If required we could improve upon this in GCC9 and add a way to differentiate
between loads and stores.

Wilco

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]