This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: Propose moving vectorization from -O3 to -O2.
- From: OndÅej BÃlka <neleai at seznam dot cz>
- To: Xinliang David Li <davidxl at google dot com>
- Cc: Richard Biener <richard dot guenther at gmail dot com>, Cong Hou <congh at google dot com>, Zdenek Dvorak <ook at ucw dot cz>, "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>
- Date: Thu, 22 Aug 2013 10:24:27 +0200
- Subject: Re: Propose moving vectorization from -O3 to -O2.
- References: <CAK=A3=1fd07wXzYFn-t+arozGeSFrGoKDMgzgbdCyzoJkz99og at mail dot gmail dot com> <CAAkRFZLnRdZvzGCLdj7-gGN7oysV900=2cexqxLFX3JsmZVGAQ at mail dot gmail dot com> <ad058eea-ca1e-4ff2-bb19-ea6572be6897 at email dot android dot com> <CAAkRFZKnKXF05BJc24BJHCmBJpK-ng_oyRiub4M0USiExJx3fA at mail dot gmail dot com> <dc0945e4-eec8-4c72-885f-da5396580734 at email dot android dot com> <CAAkRFZ+coco5hKANMG3=2+p6oZ3gKmOU0wV5xx3mt-6h_AGjVQ at mail dot gmail dot com> <CAFiYyc3xz+363NpWp2MEibeSYaiGYVotTz4eigW0HCzvFyxh6w at mail dot gmail dot com> <CAAkRFZKXwHD9OPywaZ_5oOZR_pVJZxmYj3EU0tojcos0fVxdvQ at mail dot gmail dot com>
On Wed, Aug 21, 2013 at 11:50:34PM -0700, Xinliang David Li wrote:
> > The effect on runtime is not correlated to
> > either (which means the vectorizer cost model is rather bad), but integer
> > code usually does not benefit at all.
>
> The cost model does need some tuning. For instance, GCC vectorizer
> does peeling aggressively, but peeling in many cases can be avoided
> while still gaining good performance -- even when target does not have
> efficient unaligned load/store to implement unaligned access. GCC
> reports too high cost for unaligned access while too low for peeling
> overhead.
>
Another issue is that gcc generates very ineffective headers. If I
change example with following line
foo(a+rand()%10000, b+rand()%10000, c+rand()%10000, rand()%64);
then I get vectorizer regression of
gcc-4.7 -O3 x.c -o xa
versus
gcc-4.7 -O2 -funroll-all-loops x.c -o xb
> Example:
>
> ifndef TYPE
> #define TYPE float
> #endif
> #include <stdlib.h>
>
> __attribute__((noinline)) void
> foo (TYPE *a, TYPE* b, TYPE *c, int n)
> {
> int i;
> for ( i = 0; i < n; i++)
> a[i] = b[i] * c[i];
> }
>
> int g;
> int
> main()
> {
> int i;
> float *a = (float*) malloc (100000*4);
> float *b = (float*) malloc (100000*4);
> float *c = (float*) malloc (100000*4);
>
> for (i = 0; i < 100000; i++)
> foo(a, b, c, 100000);
>
>
> g = a[10];
>
> }
>
>
> 1) by default, GCC's vectorizer will peel the loop in foo, so that
> access to 'a' is aligned and using movaps instruction. The other
> accesses are using movups when -march=corei7 is used
> 2) Same as above, but -march=x86_64. Access to b is split into 'movlps
> and movhps', same for 'c'
>
> 3) Disabling peeling (via a hack) with -march=corei7 --- all three
> accesses are using movups
> 4) Disabling peeling, with -march=x86-64 -- all three accesses are
> using movlps/movhps
>
> Performance:
>
> 1) and 3) -- both 1.58s, but 3) is much smaller than 1). 3)'s text is
> 1462 bytes, and 1) is 1622 bytes
> 3) and 4) and no vectorize -- all very slow -- 4.8s
>
This could be explained by lack of unrolling. When unrolling is enabled
a slowdown is only 20% over sse variant.
> > That said, I expect 99% of used software
> > (probably rather 99,99999%) is not compiled on the system it runs on but
> > compiled to run on generic hardware and thus restricts itself to bare x86_64
> > SSE2 features. So what matters for enabling the vectorizer at -O2 is the
> > default architecture features of the given architecture(!) - remember
> > to not only
> > consider x86 here!
> >
This is non-issue as sse2 already contains most of operations needed.
Performance improvement from additional ss* is minimal.
A performance improvements over sse2 could be with avx/avx2 but it would
vectorizer of avx is still severely lacking.
> > The same argument was done on the fact that GCC does not optimize by default
> > but uses -O0. It's a straw-mans argument. All "benchmarking" I see uses
> > -O3 or -Ofast already.
>
> People can just do -O2 performance comparison.
>
When machines spend 95% of time in code compiled by gcc -O2 then
benchmarking should be done on -O2.
With any other flags you will just get bunch of numbers which are not
very related to performance.