This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Vectorization: Loop peeling with misaligned support.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: Hendrik Greving <hendrik dot greving dot intel at gmail dot com>, Bingfeng Mei <bmei at broadcom dot com>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Sat, 16 Nov 2013 12:45:34 +0100
- Subject: Re: Vectorization: Loop peeling with misaligned support.
- Authentication-results: sourceware.org; auth=none
- References: <B71DF1153024A14EABB94E39368E44A6041FE5EA at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CAFiYyc2cj8GLL5PwVchTRreSMcCvxyVcz6k61_G=QqqXXVvzYQ at mail dot gmail dot com> <B71DF1153024A14EABB94E39368E44A6041FE756 at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CANc4vho3mLGo42ARGs7_toBax_xj6iG-d1qhA+oMfe3uFGbDKA at mail dot gmail dot com> <20131115222606 dot GA32059 at domone dot podge> <abdba6cc-3d06-4b3a-83a2-891dcc54f949 at email dot android dot com>
On Sat, Nov 16, 2013 at 11:37:36AM +0100, Richard Biener wrote:
> "OndÅej BÃlka" <neleai@seznam.cz> wrote:
> >On Fri, Nov 15, 2013 at 09:17:14AM -0800, Hendrik Greving wrote:
>
> IIRC what can still be seen is store-buffer related slowdowns when you have a big unaligned store load in your loop. Thus aligning stores still pays back last time I measured this.
Then send you benchmark. What I did is a loop that stores 512 bytes. Unaligned stores there are faster than aligned ones, so tell me when aligning stores pays itself. Note that in filling store buffer you must take into account extra stores to make loop aligned.
Also what do you do with loops that contain no store? If I modify test to
int set(int *p, int *q){
int i;
int sum = 0;
for (i=0; i < 128; i++)
sum += 42 * p[i];
return sum;
}
then it still does aligning.
There may be a threshold after which aligning buffer makes sense then you
need to show that loop spend most of time on sizes after that treshold.
Also do you have data how common store-buffer slowdowns are? Without
knowing that you risk that you make few loops faster at expense of
majority which could likely slow whole application down. It would not
supprise me as these loops can be ran mostly on L1 cache data (which is
around same level as assuming that increased code size fits into instruction cache.)
Actually these questions could be answered by a test, first compile
SPEC2006 with vanilla gcc -O3 and then with gcc that contains patch to
use unaligned loads. Then results will tell if peeling is also good in
practice or not.