This is the mail archive of the
gcc-help@gcc.gnu.org
mailing list for the GCC project.
Re: Advice about using SIMD extensions
- From: Brian Budge <brian dot budge at gmail dot com>
- To: Richard Beare <Richard dot Beare at csiro dot au>
- Cc: gcc-help at gcc dot gnu dot org
- Date: Thu, 24 Feb 2005 12:20:43 +0100
- Subject: Re: Advice about using SIMD extensions
- References: <421D1D78.2080904@csiro.au>
- Reply-to: Brian Budge <brian dot budge at gmail dot com>
Hi Richard -
With this kind of example you should definitely get about a 4 times
speed up. One of your issues may be that gcc doesn't seem (I haven't
confirmed this with anyone) to like to perform instruction scheduling
on vector types. I have also seen similar slowdowns when using
xmmintrin.h code if I code things naively.
My advice: Try to write the code out long hand using the xmm
intrinsics, interleaving loads and arithmetic, and see if you get a
speed up.
Can anyone confirm if gcc does sub-optimal instruction scheduling for
vector types?
Brian
On Thu, 24 Feb 2005 11:19:04 +1100, Richard Beare
<Richard.Beare@csiro.au> wrote:
> Hi Everyone,
> This is probably a common query, but I haven't managed to find any hints
> about what I'm doing wrong.
>
> I'm trying to use the SIMD extensions to accelerate array arithmetic. My
> test code is below. I'm running gcc-3.3.3 on a pentium 4 3GHz, running
> Fedora Core 2.
>
> My problem is that the SIMD code seems to be running slower than the
> optimized standard code. In fact if I turn on the optimization and cpu
> flag then I get a huge slowdown.
> I can confirm with objdump that faddp instructions are being generated
> at least some of the time.
>
> I've experimented with a few different compilers (only stable versions)
> but not achieved any consistent speed up.
>
> I'd have thought that this was the simplest example to accelerate.
>
> Am I doing something obvious wrong at the C level? Is there a particular
> compiler version that is known to do this sort of thing well?
>
> I would appreciate any advice.
>
> Here is the log of some test runs:
> ============================================================
> Standardized arithmetic
>
> 19.41user 0.01system 0:19.44elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+60minor)pagefaults 0swaps
>
> Standard with optimization
>
> cc -DDONORMAL -O2 -c -o vectrials.o vectrials.c
> cc -static vectrials.o -o vectrials
>
> Standardized arithmetic
>
> 5.48user 0.00system 0:05.49elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+60minor)pagefaults 0swaps
> ----------------------------
> Vectorized without optimization
>
> cc -DDOVEC -mcpu=pentium4 -c -o vectrials.o vectrials.c
> cc -static vectrials.o -o vectrials
>
> Vectorized arithmetic
>
> 9.02user 0.00system 0:09.03elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+60minor)pagefaults 0swaps
>
> Vectorized with optimization
>
> cc -DDOVEC -O2 -mcpu=pentium4 -c -o vectrials.o vectrials.c
> cc -static vectrials.o -o vectrials
>
> Vectorized arithmetic
>
> 35.89user 0.03system 0:36.17elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
> 0inputs+0outputs (0major+58minor)pagefaults 0swaps
> ============================================================
> And here is the code
>
> #define _XOPEN_SOURCE 600
> #include <errno.h>
> #include <stdlib.h>
> #include <stdio.h>
>
> #define LEN 1000
>
> #define THISTYPE float
>
> /* typedef v8qi myvec; */
> typedef int myvec __attribute__ ((mode(V4SF)));
>
> #define myvecSize (sizeof(myvec)/sizeof(THISTYPE))
>
> /**********************************************/
>
> void * myalloc(size_t size)
> {
> /* alignement should be on 16 byte boundaries! */
> const size_t align=2*sizeof(double);
> void *res=NULL;
> int status;
>
> status = posix_memalign(&res, align, size);
> switch (status) {
> case EINVAL:
> fprintf(stderr, "Alignment parameter no good\n");
> return NULL;
> break;
> case ENOMEM:
> fprintf(stderr, "Insufficient memory\n");
> return NULL;
> default:
> return res;
> }
> }
>
> /**********************************************/
>
> void f1(myvec *in1,myvec *in2, myvec *out, int len)
> {
> int i;
> /* fprintf(stderr, "Vectorised length =%d\n", len); */
>
> for (i=0;i<len;i++) {
> out[i] = in1[i] + in2[i];
> }
>
> }
>
> /**********************************************/
>
> void f2(THISTYPE *in1, THISTYPE *in2, THISTYPE *out, int len)
> {
> int i;
> /* fprintf(stderr, "Standard length =%d\n", len); */
> for (i=0;i<len;i++) {
> out[i] = in1[i] + in2[i];
> }
> }
>
> /**********************************************/
> void init(THISTYPE *I1, THISTYPE *I2, int len)
> {
> int i;
>
> for (i=0;i<len;i++) {
> I1[i] = 34.0;
> I2[i] = 354.0;
> }
>
> }
>
> void check(THISTYPE *OO, int len)
> {
> fprintf(stderr, "First=%f, Last=%f\n", OO[0], OO[len-1]);
> }
>
> #define TESTS 1000000
>
> int main()
> {
> myvec *input1, *input2, *output;
> THISTYPE *I1, *I2, *OO;
> int tt;
>
> /* fprintf(stderr, "(%d, %d, %d)\n", sizeof(double), sizeof(void *),
> sizeof(myvec)); */
>
> input1 = (myvec *)myalloc(LEN * sizeof(myvec));
> input2 = (myvec *)myalloc(LEN * sizeof(myvec));
> output = (myvec *)myalloc(LEN * sizeof(myvec));
>
> I1 = (THISTYPE *)input1;
> I2 = (THISTYPE *)input2;
> OO = (THISTYPE *)output;
>
> init(I1, I2, LEN*sizeof(myvec)/sizeof(THISTYPE));
>
> #ifdef DOVEC
> /* the vectorized one */
> fprintf(stderr, "Vectorized arithmetic\n");
> for (tt=0;tt<TESTS;tt++) {
> f1(input1, input2, output, LEN);
> }
> #endif
>
> #ifdef DONORMAL
> fprintf(stderr, "Standardized arithmetic\n");
> for (tt=0;tt<TESTS;tt++) {
> f2(I1, I2, OO, LEN * sizeof(myvec)/sizeof(THISTYPE));
> }
> #endif
> check(OO, LEN * sizeof(myvec)/sizeof(THISTYPE));
> return 0;
> }
>
> --
> Richard Beare, CSIRO Mathematical & Information Sciences
> Locked Bag 17, North Ryde, NSW 1670, Australia
> Phone: +61-2-93253221 (GMT+~10hrs) Fax: +61-2-93253200
>
> Richard.Beare@csiro.au
>