This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PING][PATCH, AArch64] Disable reg offset in quad-word store for Falkor


On Wednesday 17 January 2018 08:31 PM, Wilco Dijkstra wrote:
> Why is that a bad thing? With the patch as is, the testcase generates:
> 
> .L4:
> 	ldr	q0, [x2, x3]
> 	add	x5, x1, x3
> 	add	x3, x3, 16
> 	cmp	x3, x4
> 	str	q0, [x5]
> 	bne	.L4
> 
> With a change in address cost (for loads and stores) we would get:
> 
> .L4:
> 	ldr	q0, [x3], 16
> 	str	q0, [x4], 16
> 	cmp	x3, x5
> 	bne	.L4
> 
> This looks better to me, especially if there are more loads and stores and
> some have offsets as well (the writeback is once per stream while the extra
> add happens for every store). It may be worth trying both possibilities
> on a large body of code and see which comes out smallest/fastest.

This is great for the load because of the way the falkor prefetcher
works, but it is terrible for the store because of the way the pipeline
works.  The only performant store for falkor is an indirect load with a
constant or zero offset.  Everything else has hidden costs.

> Note using the cost model as intended means the compiler tries to use the
> lowest cost possibility rather than never emitting the instruction, not even
> when optimizing for size. I think it's wrong to always block a valid instruction.
<snip>
> It's not clear whether it is easy to split out the costs today (it could be done
> in aarch64_rtx_costs but not aarch64_address_cost, and the latter is what
> IVOpt uses).

I briefly looked at the possibility of splitting the register_offset
cost into load and store, but I realized that I'd have to modify the
target hook for it to be useful, which is way too much work for this
single quirk.

>> Further, it seems like worthwhile work only if there are other parts
>> that actually have the same quirk and can use this split.  Do you know
>> of any such cores?
> 
> Currently there are several supported CPUs which use a much higher cost
> for TImode and for register offsets. So it's a common thing to want, however
> I don't know whether splitting load/store address costs helps for those.

It wouldn't.  This ought to be expressed already using the addr_scale_costs.

> I think a special case for Falkor in aarch64_address_cost would be acceptable
> in GCC8 - that would be much smaller and cleaner than the current patch. 
> If required we could improve upon this in GCC9 and add a way to differentiate
> between loads and stores.

I can't do this in address_cost since I can't determine whether the
address is a load or a store location.  The most minimal way seems to be
using the patterns in the md file.

Siddhesh


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]