[PATCH i386 5/8] [AVX-512] Extend vectorizer hooks.

Jan Hubicka hubicka@ucw.cz
Mon May 19 04:48:00 GMT 2014


> > Thanks for the pointer, there is indeed the recommendation in
> > optimization manual [1], section 3.6.4, where it is said:
> >
> > --quote--
> > Misaligned data access can incur significant performance penalties.
> > This is particularly true for cache line
> > splits. The size of a cache line is 64 bytes in the Pentium 4 and
> > other recent Intel processors, including
> > processors based on Intel Core microarchitecture.
> > An access to data unaligned on 64-byte boundary leads to two memory
> > accesses and requires several
> > ??ops to be executed (instead of one). Accesses that span 64-byte
> > boundaries are likely to incur a large
> > performance penalty, the cost of each stall generally are greater on
> > machines with longer pipelines.
> >
> > ...
> >
> > A 64-byte or greater data structure or array should be aligned so that
> > its base address is a multiple of 64.
> > Sorting data in decreasing size order is one heuristic for assisting
> > with natural alignment. As long as 16-
> > byte boundaries (and cache lines) are never crossed, natural alignment
> > is not strictly necessary (though
> > it is an easy way to enforce this).
> > --/quote--
> >
> > So, this part has nothing to do with AVX512, but with cache line
> > width. And we do have a --param "l1-cache-line-size=64", detected with
> > -march=native that could come handy here.
> >
> > This part should be rewritten (and commented) with the information
> > above in mind.
> 
> Like in the patch below. Please note, that the block_tune setting for
> the nocona is wrong, -march=native on my trusted old P4 returns:
> 
> --param "l1-cache-size=16" --param "l1-cache-line-size=64" --param
> "l2-cache-size=2048" "-mtune=nocona"
> 
> which is consistent with the above quote from manual.
> 
> 2014-01-02  Uros Bizjak  <ubizjak@gmail.com>
> 
>     * config/i386/i386.c (ix86_data_alignment): Calculate max_align
>     from prefetch_block tune setting.
>     (nocona_cost): Correct size of prefetch block to 64.
> 
Uros,
I am looking into libreoffice size and the data alignment seems to make huge
difference. Data section has grown from 5.8MB to 6.3MB in between GCC 4.8 and 4.9,
while clang produces 5.2MB.

The two patches I posted to not align vtables and RTTI reduces it to 5.7MB, but
But perhaps we want to revisit the alignment rules.  The optimization manuals
usually care only about performance critical loops.  Perhaps we can make the
rules to align only bigger datastructures, or so at least for -O2.

Honza



More information about the Gcc-patches mailing list