This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Better _MM_TRANSPOSE4_PS
- From: Evan Cheng <evan dot cheng at apple dot com>
- To: gcc-patches at gcc dot gnu dot org
- Date: Thu, 6 Oct 2005 15:45:10 -0700
- Subject: Better _MM_TRANSPOSE4_PS
Hi,
We would like to contribute a faster _MM_TRANSPOSE4_PS macro in
config/i386/xmmintrin.h
This version uses high / low moves and unpacks. It's 16% faster than
the old version on current generation of Pentium 4 processors.
Thanks,
Evan Cheng
Apple Computers, Inc.
Index: config/i386/xmmintrin.h
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/i386/xmmintrin.h,v
retrieving revision 1.33.6.3
diff -r1.33.6.3 xmmintrin.h
1200,1201c1200,1201
< #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) \
< do { \
---
> #define _MM_TRANSPOSE4_PS(row0, row1, row2,
row3) \
> do
{ \
1203,1210c1203,1210
< __v4sf __t0 = __builtin_ia32_shufps (__r0, __r1, 0x44); \
< __v4sf __t2 = __builtin_ia32_shufps (__r0, __r1, 0xEE); \
< __v4sf __t1 = __builtin_ia32_shufps (__r2, __r3, 0x44); \
< __v4sf __t3 = __builtin_ia32_shufps (__r2, __r3, 0xEE); \
< (row0) = __builtin_ia32_shufps (__t0, __t1, 0x88); \
< (row1) = __builtin_ia32_shufps (__t0, __t1, 0xDD); \
< (row2) = __builtin_ia32_shufps (__t2, __t3, 0x88); \
< (row3) = __builtin_ia32_shufps (__t2, __t3, 0xDD); \
---
> __v4sf __t0 = __builtin_ia32_unpcklps (__r0,
__r1); \
> __v4sf __t1 = __builtin_ia32_unpcklps (__r2,
__r3); \
> __v4sf __t2 = __builtin_ia32_unpckhps (__r0,
__r1); \
> __v4sf __t3 = __builtin_ia32_unpckhps (__r2,
__r3); \
> (row0) = __builtin_ia32_movlhps (__t0,
__t1); \
> (row1) = __builtin_ia32_movhlps (__t1,
__t0); \
> (row2) = __builtin_ia32_movlhps (__t2,
__t3); \
> (row3) = __builtin_ia32_movhlps (__t3,
__t2); \
1212a1213
>