[PATCH i386][google]With -mtune=core2, avoid generating the slow unaligned vector load/store (issue5488054)

Wed Dec 14 00:04:00 GMT 2011

I updated the patch to add the checks in vectorizable_load and
vectorizable_store itself.

Thanks,
-Sri.

On Tue, Dec 13, 2011 at 12:16 PM, Xinliang David Li <davidxl@google.com> wrote:
> See instruction tables here:
> http://www.agner.org/optimize/instruction_tables.pdf
>
> My brief reading of the table for core2 and corei7 suggest the following:
>
> 1. On core2
>
> movdqu -- both load and store forms take up to 8 cycles to complete,
> and store form produces 8 uops while load produces 4 uops
>
> movsd load:  1 uop, 2 cycle latency
> movsd store: 1 uop, 3 cycle latency
>
> movhpd, movlpd load: 2 uops, 3 cycle latency
> movhpd store: 2 uops, 5 cycle latency
> movlpd store: 1uop, 3 cycle latency
>
>
> 2. Core i7
>
> movdqu load: 1 uop, 2 cycle latency
> movdqu store: 1 uop, 3 cycle latency
>
> movsd load: 1 uop, 2 cycle latency
> movsd store: 1uop, 3 cycle latency
>
> movhpd, movlpd load: 2 uop, 3 cycle latency
> movhpd, movlpd sotre: 2 uop, 5 cycle latency
>
>
> From the above, looks like a Sri's original simple heuristic should work fine
>
> 1) for corei7, if the load and stores can not be proved to be 128 bit
> aligned, always use movdqu
>
> 2) for core2,  experiment can be done to determine whether to look at
> unaligned stores or both unaligned loads to disable vectorization.
>
> Yes, for longer term, a more precise cost model is probably needed --
> but require lots of work which may not work a lot better in practice.
>
> What is more important is to beef up gcc infrastructure to allow more
> aggressive alignment (info) propagation.
>
> In 4.4, gcc does alignment (output array) based versioning -- Sri's
> patch has the effect of doing the samething but only for selected
> targets.
>
> thanks,
>
> David
>
> On Tue, Dec 13, 2011 at 10:56 AM, Richard Henderson <rth@redhat.com> wrote:
>> On 12/13/2011 10:26 AM, Sriraman Tallam wrote:
>>> Cool, this works for stores!  It generates the movlps + movhps. I have
>>> to also make a similar change to another call to gen_sse2_movdqu for
>>> loads. Would it be ok to not do this when tune=core2?
>>
>> We can work something out.
>>
>> I'd like you to do the benchmarking to know if unaligned loads are really as expensive as unaligned stores, and whether there are reformatting penalties that make the movlps+movhps option for either load or store less attractive.
>>
>>
>> r~