The following testcase when compiled with 'g++ -O3 -msse3 -o testcase testcase.C' finishes in 0m0.277s if compiled with gcc-3.4.4 and in 0m44.843s if compiled with gcc-4.0.2 (similar on all 4.x.x). Yuri ----------------------------------------------------------------------- typedef float __v2df __attribute__ ((__vector_size__ (16))); typedef __v2df __m128; static __inline __m128 _mm_sub_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_subps ((__v2df)__A, (__v2df)__B); } static __inline __m128 _mm_add_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_addps ((__v2df)__A, (__v2df)__B); } static __inline __m128 _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v2df){ __Z, __Y, __X, __W }; } struct FF { __m128 d; __inline FF() { } __inline FF(__m128 new_d) : d(new_d) { } __inline FF(float f) : d(_mm_setr_ps(f, f, f, f)) { } __inline FF operator+(FF other) { return (FF(_mm_add_pd(d,other.d))); } __inline FF operator-(FF other) { return (FF(_mm_sub_pd(d,other.d))); } }; float f[1024*1024]; int main() { int i; for (i = 0; i < 1024*1024; i++) { f[i] = 1.f/(1024*1024 + 10 - i); } FF total(0.f); for (int rpt = 0; rpt < 1000; rpt++) { FF p1(0.f), p2(0.), c; __m128 *pf = (__m128*)f; for (i = 0; i < 1024*1024/4; i++) { FF c(*pf++); total = total + c - p2 + p1; p1 = p2; p2 = c; } } }
actually it's the defect in this case: result is not used. But runtimes are very different in any case. 44.9s on 4.x.x vs. 0m2.371s on 3.4.4 --- begin corrected testcase ----------------------------------- #include <iostream> using namespace std; typedef float __v2df __attribute__ ((__vector_size__ (16))); typedef __v2df __m128; static __inline __m128 _mm_sub_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_subps ((__v2df)__A, (__v2df)__B); } static __inline __m128 _mm_add_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_addps ((__v2df)__A, (__v2df)__B); } static __inline __m128 _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v2df){ __Z, __Y, __X, __W }; } struct FF { __m128 d; __inline FF() { } __inline FF(__m128 new_d) : d(new_d) { } __inline FF(float f) : d(_mm_setr_ps(f, f, f, f)) { } __inline FF operator+(FF other) { return (FF(_mm_add_pd(d,other.d))); } __inline FF operator-(FF other) { return (FF(_mm_sub_pd(d,other.d))); } }; float f[1024*1024]; union U { __m128 m; float f[4]; }; int main() { int i; FF gtotal(0.f); for (i = 0; i < 1024*1024; i++) { f[i] = 1.f/(1024*1024 + 10 - i); } FF total(0.f); for (int rpt = 0; rpt < 1000; rpt++) { FF p1(0.f), p2(0.), c; __m128 *pf = (__m128*)f; for (i = 0; i < 1024*1024/4; i++) { FF c(*pf++); total = total + c - p2 + p1; p1 = p2; p2 = c; } gtotal = gtotal + total; } U u; u.m = gtotal.d; cout << (u.f[0]) << endl; } --- end corrected testcase -------------------------------------
I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode. Everything I tried shows that 4.x is actually faster than 3.4.4.
Subject: Re: REGREGRESSION: SSE2 vectorized code is many times slower on 4.x.x than on 3.4.4 I run on Athlon64-3200 in i386 compatible mode. Strange. I had he problem with gcc-4.0.1, yesterday I compiled gcc-4.0.2 and same thing. Yuri pinskia at gcc dot gnu dot org wrote: >------- Comment #2 from pinskia at gcc dot gnu dot org 2005-12-20 05:55 ------- >I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode. > >Everything I tried shows that 4.x is actually faster than 3.4.4. > > > >
Subject: Re: REGREGRESSION: SSE2 vectorized code is many times slower on 4.x.x than on 3.4.4 Also I use FreeBSD-6.0 if this even can make a difference. pinskia at gcc dot gnu dot org wrote: >------- Comment #2 from pinskia at gcc dot gnu dot org 2005-12-20 05:55 ------- >I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode. > >Everything I tried shows that 4.x is actually faster than 3.4.4. > > > >
Subject: Re: REGREGRESSION: SSE2 vectorized code is many times slower on 4.x.x than on 3.4.4 Here's attachment with asms generated in both cases. testcase-old.s is 4.3.3 and testcase-new.s is 4.0.2 In testcase-new.s SSE2 code is kinda diluted with i386 assembly, notably 'rep movsl' which never occurs in 3.4.4 output. Yuri
Created attachment 10534 [details] disasm.tar.bz2
I don't get: rep movsl At all on GNU/Linux, doing a cross compiler to FreeBSD right now.
Can you show what the output of "gcc -v" for the 3.4 compiler and the 4.0 compiler? This looks like just a different using arch by default. a Compiler compiled for i686 by default gives the good code but code compiled for i386 give bad code.
Subject: Re: REGREGRESSION: SSE2 vectorized code is many times slower on 4.x.x than on 3.4.4 ----------------------------------- Using built-in specs. Configured with: FreeBSD/i386 system compiler Thread model: posix gcc version 3.4.4 [FreeBSD] 20050518 ----------------------------------- g++ -v (4.0.2) Using built-in specs. Target: i386-unknown-freebsd6.0 Configured with: ../gcc-4.0.2/configure --prefix=/usr/local/gcc-4.0.2 Thread model: posix gcc version 4.0.2 pinskia at gcc dot gnu dot org wrote: >------- Comment #8 from pinskia at gcc dot gnu dot org 2005-12-20 06:36 ------- >Can you show what the output of "gcc -v" for the 3.4 compiler and the 4.0 >compiler? > >This looks like just a different using arch by default. > >a Compiler compiled for i686 by default gives the good code but code compiled >for i386 give bad code. > > > >
Oh, I looked a little more and yes it depends on the arch you are building for but only for 4.x. Since you are using SSE, you should add also -march=i686 or -march=k8 so that the code is also tuned for the processor you are using. Anyways the problem with i386 with 4.0 is really just PR 14295.
Subject: Re: REGREGRESSION: SSE2 vectorized code is many times slower on 4.x.x than on 3.4.4 Now this huge runtime difference disappeared but now 4.0.2-generated code is always ~> 20% slower. Many memory accesses where they are not needed at all and did not exist for 3.4.4. I tried -march=i686 and -march=k8, both are slower than 3.4.4. Do I also have to recompile gcc with some special options? Yuri pinskia at gcc dot gnu dot org wrote: >------- Comment #10 from pinskia at gcc dot gnu dot org 2005-12-20 06:55 ------- >Oh, I looked a little more and yes it depends on the arch you are building for >but only for 4.x. > >Since you are using SSE, you should add also -march=i686 or -march=k8 so that >the code is also tuned for the processor you are using. > >Anyways the problem with i386 with 4.0 is really just PR 14295. > > > >
Confirmed, it really only effects i386/i486 code (maybe i586 also but I did not try that). The only thing I can think is to change MOVE_COST for those subtargets or just have PR 14295 fixed.
Benchmarking -mtune=i386 tuned code on Athlon64 is simply a bad idea. Either you need to tune for your CPU (or at least some contemporary one like -mtune=pentium4 if you want to run quickly on a wider range of CPUs), or you should be benchmarking on real i386 hardware.
We're generating correct code, so I've marked this as P2, rather than P1.
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
struct FF { __m128 d; ..... } Mine I have a patch for this I cannot believe I found this before. The patch has been tested a bit at least in the local tree I have been playing out with. SRA should use element based copy that struct because it is only one element.
One element, but with some additional complication because it is a vector.
This patchlet makes GCC use element-copy for struct FF: Index: expr.c =================================================================== --- expr.c (revision 115990) +++ expr.c (working copy) @@ -4763,7 +4763,7 @@ count_type_elements (tree type, bool all return 2; case VECTOR_TYPE: - return TYPE_VECTOR_SUBPARTS (type); + return TYPE_MODE (type) == BLKmode ? TYPE_VECTOR_SUBPARTS (type) : 1; case INTEGER_TYPE: case REAL_TYPE:
(In reply to comment #19) > This patchlet makes GCC use element-copy for struct FF: You have to be careful when editing count_type_elements so that the elements of a constructor that are not explict are zeroed.
I'll see if I can construct a case where my patch fails (actually a newer one)
patch withdrawn, I'll wait for pinskia's
I should be posting a patch for this next week.
Patch submitted: http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01005.html
Subject: Bug 25500 Author: pinskia Date: Mon Nov 20 20:29:10 2006 New Revision: 119026 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=119026 Log: 2006-11-20 Andrew Pinski <andrew_pinski@playstation.sony.com> PR tree-opt/25500 * tree-sra.c (single_scalar_field_in_record_p): New function. (decide_block_copy): Use it. 2006-11-20 Andrew Pinski <andrew_pinski@playstation.sony.com> PR tree-opt/25500 * gcc.dg/tree-ssa/sra-4.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-4.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-sra.c
Fixed for 4.3.0 and above.