[PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

Uros Bizjak ubizjak@gmail.com
Fri Jan 3 16:04:00 GMT 2014


On Fri, Jan 3, 2014 at 3:02 PM, Uros Bizjak <ubizjak@gmail.com> wrote:

>>> Like in the patch below. Please note, that the block_tune setting for
>>> the nocona is wrong, -march=native on my trusted old P4 returns:
>>>
>>> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
>>> "l2-cache-size=2048" "-mtune=nocona"
>>>
>>> which is consistent with the above quote from manual.
>>>
>>> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
>>>
>>>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>>>     from prefetch_block tune setting.
>>>     (nocona_cost): Correct size of prefetch block to 64.
>>>
>>> The patch was bootstrapped on x86_64-pc-linux-gnu and is currently in
>>> regression testing. If there are no comments, I will commit it to
>>> mainline and release branches after a couple of days.
>>
>> That still has the effect of not aligning (for most tunings) 32 to 63 bytes
>> long aggregates to 32 bytes, while previously they were aligned.  Forcing
>> aligning 32 byte long aggregates to 64 bytes would be overkill, 32 byte
>> alignment is just fine for those (and ensures it never crosses 64 byte
>> boundary), for 33 to 63 bytes perhaps using 64 bytes alignment wouldn't
>> be that bad, just wouldn't match what we have done before.
>
> Please note that previous value was based on earlier (pre P4)
> recommendation and it was appropriate for older chips with 32byte
> cache line. The value should be updated long ago, when 64bit cache
> lines were introduced, but was probably missed due to usage of magic
> value without comment.
>
> Ah, I see. My patch deals only with structures, larger than cache
> line. As recommended in "As long as 16-byte boundaries (and cache
> lines) are never crossed, natural alignment is not strictly necessary
> (though it is an easy way to enforce this)." part of the manual, we
> should align smaller structures to 16 or 32 bytes.
>
> Yes, I agree. Can you please merge your patch together with the proposed patch?

On a second thought, the crossing of 16-byte boundaries is mentioned
for the data *access* (the instruction itself) if it is not naturally
aligned (please see example 3-40 and fig 3-2), which is *NOT* in our
case.

So, we don't have to align 32 byte structures in any way for newer
processors, since this optimization applies to 64+ byte (larger or
equal to cache line size) structures only. Older processors are
handled correctly, modulo nocona, where its cache line size value has
to be corrected.

Following that, my original patch implements this optimization in the
correct way.

Uros.



More information about the Gcc-patches mailing list