This is the mail archive of the gcc-help@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Why are vector instructions slower than loops?

From: "Rohit Garg" <rpg dot 314 at gmail dot com>
To: fniles at gnupooh dot org
Cc: gcc-help at gcc dot gnu dot org
Date: Sat, 27 Sep 2008 17:17:33 +0200
Subject: Re: Why are vector instructions slower than loops?
References: <12934.128.29.43.1.1222285358.squirrel@www.gnupooh.org>

It appears that you have a conroe based core 2 duo instead of a penryn
based one. Pre-penryn class cpu's don't have integer multiplication
instruction. Penryn and later intel cpu's do. For older cpu's
multiplication has to be implemented by moving stuff into scalar
registers, multiplying and moving them back. Which is why it is slow.
If you do fp multiply. I am sure, it will come out faster. (You many
not need it though, don't know about your app.)

(use cpu-z to find out if which cpu-generation yours belongs to)

BTW, don't declare such giant vectors(2kb wide is too much). It is
possible gcc conked off at it and generated scalar code with some
overhead. use 4 wide vectors and then loop over them. GCC will then
(pretty sure about this) generate vector math instructions.

And yes at O3, gcc does auto vectorization as well. The examples you
posted fall into that class(which gcc does without bothering you).  My
hunch is, loop version is vectorized, but your gcc vector version
isn't.

Gcc vector extensions are nice. But they don't suit my taste (even
when paired with union). There are more options (read intrinsics)
(compared to what gcc offers), available when you use them. They often
help. I have been helped by them(extra intrinsics).

>  * Why are vectors so much slower than plain old loops?  Shouldn't
>  * they be faster?  Do I have to actually call the built-in MMX and
>  * SSE instructions myself?  Shouldn't the compiler be able to do this
>  * given this much information?

No, you don't. look up intel's sse reference guide. C functions
(called intrinsics) for using those instructions are given there. You
don't have to mess with asm if you don't want to. These functions
mostly map to 1 cpu instruction per call. I felt the same way before
as you do now. Went ahead and wrote a inlined wrapper library so that
I don't have to bother with arcane intrinsic names.

Intel's compilers are very good when it comes to automagic
vectorization. Try them if you don't like my solution ( I amusing it
in production code, btw). Your example will be sure as hell
autovectorized by it.

HTH

-- 
Rohit Garg

http://rpg-314.blogspot.com/

Senior Undergraduate
Department of Physics
Indian Institute of Technology
Bombay

References:
- Why are vector instructions slower than loops?
  - From: fniles

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]