This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Better _MM_TRANSPOSE4_PS


Hi,

We would like to contribute a faster _MM_TRANSPOSE4_PS macro in config/i386/xmmintrin.h

This version uses high / low moves and unpacks. It's 16% faster than the old version on current generation of Pentium 4 processors.

Thanks,

Evan Cheng
Apple Computers, Inc.


Index: config/i386/xmmintrin.h
===================================================================
RCS file: /cvs/gcc/gcc/gcc/config/i386/xmmintrin.h,v
retrieving revision 1.33.6.3
diff -r1.33.6.3 xmmintrin.h
1200,1201c1200,1201
< #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) \
< do { \
---
> #define _MM_TRANSPOSE4_PS(row0, row1, row2, row3) \
> do { \
1203,1210c1203,1210
< __v4sf __t0 = __builtin_ia32_shufps (__r0, __r1, 0x44); \
< __v4sf __t2 = __builtin_ia32_shufps (__r0, __r1, 0xEE); \
< __v4sf __t1 = __builtin_ia32_shufps (__r2, __r3, 0x44); \
< __v4sf __t3 = __builtin_ia32_shufps (__r2, __r3, 0xEE); \
< (row0) = __builtin_ia32_shufps (__t0, __t1, 0x88); \
< (row1) = __builtin_ia32_shufps (__t0, __t1, 0xDD); \
< (row2) = __builtin_ia32_shufps (__t2, __t3, 0x88); \
< (row3) = __builtin_ia32_shufps (__t2, __t3, 0xDD); \
---
> __v4sf __t0 = __builtin_ia32_unpcklps (__r0, __r1); \
> __v4sf __t1 = __builtin_ia32_unpcklps (__r2, __r3); \
> __v4sf __t2 = __builtin_ia32_unpckhps (__r0, __r1); \
> __v4sf __t3 = __builtin_ia32_unpckhps (__r2, __r3); \
> (row0) = __builtin_ia32_movlhps (__t0, __t1); \
> (row1) = __builtin_ia32_movhlps (__t1, __t0); \
> (row2) = __builtin_ia32_movlhps (__t2, __t3); \
> (row3) = __builtin_ia32_movhlps (__t3, __t2); \
1212a1213
>




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]