Vectorization: Loop peeling with misaligned support.

Sun Nov 17 15:41:00 GMT 2013

"OndÅ™ej BÃlka" <neleai@seznam.cz> wrote:
>On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
>> "OndÅ™ej BÃlka" <neleai@seznam.cz> wrote:
>> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
>> 
>> IIRC what can still be seen is store-buffer related slowdowns when
>you have a big unaligned store load in your loop.  Thus aligning stores
>still pays back last time I measured this.
>
>Then send you benchmark. What I did is a loop that stores 512 bytes.
>Unaligned stores there are faster than aligned ones, so tell me when
>aligning stores pays itself. Note that in filling store buffer you must
>take into account extra stores to make loop aligned.

The issue is that the effective write bandwidth can be limited by the store buffer if you have multiple write streams.  IIRC at least some amd CPUs have to use two entries for stores crossing a cache line boundary.

Anyway, a look into the optimization manuals will tell you what to do and the cost model should follow these recommendations.

>Also what do you do with loops that contain no store? If I modify test
>to
>
>int set(int *p, int *q){
>  int i;
>  int sum = 0;
>  for (i=0; i < 128; i++)
>     sum += 42 * p[i];
>  return sum;
>}
>
>then it still does aligning.

Because the cost model simply does not exist for the decision whether to peel or not. Patches welcome.

>There may be a threshold after which aligning buffer makes sense then
>you
>need to show that loop spend most of time on sizes after that treshold.
>
>Also do you have data how common store-buffer slowdowns are? Without
>knowing that you risk that you make few loops faster at expense of
>majority which could likely slow whole application down. It would not
>supprise me as these loops can be ran mostly on L1 cache data (which is
>around same level as assuming that increased code size fits into
>instruction cache.)
>
>
>Actually these questions could be answered by a test, first compile
>SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
>use unaligned loads. Then results will tell if peeling is also good in
>practice or not.

It should not be a on or off decision but rather a decision based on a cost model.

Richard.