Bug 25500 - [4.0/4.1/4.2/4.3 Regression]: SSE2 vectorized code is slower on 4.x.x than previous
Summary: [4.0/4.1/4.2/4.3 Regression]: SSE2 vectorized code is slower on 4.x.x than pr...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.0.2
: P2 normal
Target Milestone: 4.3.0
Assignee: Andrew Pinski
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization, patch, ssemmx
Depends on: 14295
Blocks:
  Show dependency treegraph
 
Reported: 2005-12-20 05:25 UTC by Yuri
Modified: 2006-11-20 20:29 UTC (History)
1 user (show)

See Also:
Host:
Target: i386-*-*
Build:
Known to work: 3.4.0
Known to fail: 4.0.0 4.1.0 4.2.0
Last reconfirmed: 2005-12-25 01:02:34


Attachments
disasm.tar.bz2 (2.16 KB, application/octet-stream)
2005-12-20 06:19 UTC, Yuri
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri 2005-12-20 05:25:46 UTC
The following testcase when compiled with 'g++ -O3 -msse3 -o testcase testcase.C'
finishes in 0m0.277s if compiled with gcc-3.4.4 and in 0m44.843s if compiled with gcc-4.0.2 (similar on all 4.x.x).

Yuri


-----------------------------------------------------------------------

typedef float __v2df __attribute__ ((__vector_size__ (16)));
typedef __v2df __m128;

static __inline __m128 _mm_sub_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_subps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_add_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_addps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v2df){ __Z, __Y, __X, __W }; }

struct FF {
  __m128 d;

  __inline FF() { }
  __inline FF(__m128 new_d) : d(new_d) { }
  __inline FF(float f) : d(_mm_setr_ps(f, f, f, f)) { }

  __inline FF operator+(FF other) { return (FF(_mm_add_pd(d,other.d))); }
  __inline FF operator-(FF other) { return (FF(_mm_sub_pd(d,other.d))); }
};

float f[1024*1024];

int main() {
  int i;

  for (i = 0; i < 1024*1024; i++) { f[i] = 1.f/(1024*1024 + 10 - i); }

  FF total(0.f);

  for (int rpt = 0; rpt < 1000; rpt++) {
  FF p1(0.f), p2(0.), c;

  __m128 *pf = (__m128*)f;
  for (i = 0; i < 1024*1024/4; i++) {
    FF c(*pf++);

    total = total + c - p2 + p1;

    p1 = p2;
    p2 = c;
  }
  }
}
Comment 1 Yuri 2005-12-20 05:34:47 UTC
actually it's the defect in this case: result is not used.
But runtimes are very different in any case.
44.9s on 4.x.x vs. 0m2.371s on 3.4.4

--- begin corrected testcase -----------------------------------
#include <iostream>
using namespace std;

typedef float __v2df __attribute__ ((__vector_size__ (16)));
typedef __v2df __m128;

static __inline __m128 _mm_sub_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_subps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_add_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_addps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v2df){ __Z, __Y, __X, __W }; }

struct FF {
  __m128 d;

  __inline FF() { }
  __inline FF(__m128 new_d) : d(new_d) { }
  __inline FF(float f) : d(_mm_setr_ps(f, f, f, f)) { }

  __inline FF operator+(FF other) { return (FF(_mm_add_pd(d,other.d))); }
  __inline FF operator-(FF other) { return (FF(_mm_sub_pd(d,other.d))); }
};

float f[1024*1024];

    union U {
      __m128 m;
      float f[4];
    };

int main() {
  int i;

  FF gtotal(0.f);
  for (i = 0; i < 1024*1024; i++) { f[i] = 1.f/(1024*1024 + 10 - i); }

  FF total(0.f);

  for (int rpt = 0; rpt < 1000; rpt++) {
    FF p1(0.f), p2(0.), c;

    __m128 *pf = (__m128*)f;
    for (i = 0; i < 1024*1024/4; i++) {
      FF c(*pf++);

      total = total + c - p2 + p1;

      p1 = p2;
      p2 = c;
    }
    gtotal = gtotal + total;
  }

  U u;
  u.m = gtotal.d;

  cout << (u.f[0]) << endl;
}
--- end corrected testcase -------------------------------------
Comment 2 Andrew Pinski 2005-12-20 05:55:43 UTC
I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode.

Everything I tried shows that 4.x is actually faster than 3.4.4.
Comment 3 Yuri 2005-12-20 06:01:54 UTC
Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

I run on Athlon64-3200 in i386 compatible mode.
Strange.

I had he problem with gcc-4.0.1, yesterday I compiled gcc-4.0.2 and same 
thing.

Yuri


pinskia at gcc dot gnu dot org wrote:

>------- Comment #2 from pinskia at gcc dot gnu dot org  2005-12-20 05:55 -------
>I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode.
>
>Everything I tried shows that 4.x is actually faster than 3.4.4.
>
>
>  
>

Comment 4 Yuri 2005-12-20 06:03:37 UTC
Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

Also I use FreeBSD-6.0 if this even can make a difference.



pinskia at gcc dot gnu dot org wrote:

>------- Comment #2 from pinskia at gcc dot gnu dot org  2005-12-20 05:55 -------
>I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode.
>
>Everything I tried shows that 4.x is actually faster than 3.4.4.
>
>
>  
>

Comment 5 Yuri 2005-12-20 06:19:32 UTC
Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

Here's attachment with asms generated in both cases.

testcase-old.s is 4.3.3 and testcase-new.s is 4.0.2

In testcase-new.s SSE2 code is kinda diluted with i386 assembly, notably 
'rep movsl'
which never occurs in 3.4.4 output.

Yuri

Comment 6 Yuri 2005-12-20 06:19:32 UTC
Created attachment 10534 [details]
disasm.tar.bz2
Comment 7 Andrew Pinski 2005-12-20 06:33:16 UTC
I don't get:
        rep
        movsl

At all on GNU/Linux, doing a cross compiler to FreeBSD right now.
Comment 8 Andrew Pinski 2005-12-20 06:36:49 UTC
Can you show what the output of "gcc -v" for the 3.4 compiler and the 4.0 compiler?

This looks like just a different using arch by default.

a Compiler compiled for i686 by default gives the good code but code compiled for i386 give bad code.
Comment 9 Yuri 2005-12-20 06:51:45 UTC
Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

-----------------------------------
Using built-in specs.
Configured with: FreeBSD/i386 system compiler
Thread model: posix
gcc version 3.4.4 [FreeBSD] 20050518
-----------------------------------
g++ -v (4.0.2)
Using built-in specs.
Target: i386-unknown-freebsd6.0
Configured with: ../gcc-4.0.2/configure --prefix=/usr/local/gcc-4.0.2
Thread model: posix
gcc version 4.0.2


pinskia at gcc dot gnu dot org wrote:

>------- Comment #8 from pinskia at gcc dot gnu dot org  2005-12-20 06:36 -------
>Can you show what the output of "gcc -v" for the 3.4 compiler and the 4.0
>compiler?
>
>This looks like just a different using arch by default.
>
>a Compiler compiled for i686 by default gives the good code but code compiled
>for i386 give bad code.
>
>
>  
>

Comment 10 Andrew Pinski 2005-12-20 06:55:21 UTC
Oh, I looked a little more and yes it depends on the arch you are building for but only for 4.x.

Since you are using SSE, you should add also -march=i686 or -march=k8 so that the code is also tuned for the processor you are using.  

Anyways the problem with i386 with 4.0 is really just PR 14295.
Comment 11 Yuri 2005-12-20 07:40:50 UTC
Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

Now this huge runtime difference disappeared
but now 4.0.2-generated code is always ~> 20% slower.
Many memory accesses where they are not needed at all and did not exist 
for 3.4.4.

I tried -march=i686 and -march=k8, both are slower than 3.4.4.

Do I also have to recompile gcc with some special options?

Yuri


pinskia at gcc dot gnu dot org wrote:

>------- Comment #10 from pinskia at gcc dot gnu dot org  2005-12-20 06:55 -------
>Oh, I looked a little more and yes it depends on the arch you are building for
>but only for 4.x.
>
>Since you are using SSE, you should add also -march=i686 or -march=k8 so that
>the code is also tuned for the processor you are using.  
>
>Anyways the problem with i386 with 4.0 is really just PR 14295.
>
>
>  
>

Comment 12 Andrew Pinski 2005-12-25 01:02:34 UTC
Confirmed, it really only effects i386/i486 code (maybe i586 also but I did not try that).

The only thing I can think is to change MOVE_COST for those subtargets or just have PR 14295 fixed.
Comment 13 Jakub Jelinek 2005-12-28 16:53:26 UTC
Benchmarking -mtune=i386 tuned code on Athlon64 is simply a bad idea.
Either you need to tune for your CPU (or at least some contemporary one
like -mtune=pentium4 if you want to run quickly on a wider range of CPUs),
or you should be benchmarking on real i386 hardware.
Comment 14 Mark Mitchell 2006-01-15 22:13:38 UTC
We're generating correct code, so I've marked this as P2, rather than P1.
Comment 15 Mark Mitchell 2006-02-24 00:26:39 UTC
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Comment 16 Mark Mitchell 2006-05-25 02:33:41 UTC
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Comment 17 Andrew Pinski 2006-07-05 09:50:35 UTC
struct FF {
  __m128 d;
.....
}

Mine I have a patch for this I cannot believe I found this before.  The patch has been tested a bit at least in the local tree I have been playing out with.
SRA should use element based copy that struct because it is only one element.
Comment 18 Paolo Bonzini 2006-08-07 07:54:53 UTC
One element, but with some additional complication because it is a vector.
Comment 19 Paolo Bonzini 2006-08-07 07:59:52 UTC
This patchlet makes GCC use element-copy for struct FF:

Index: expr.c
===================================================================
--- expr.c      (revision 115990)
+++ expr.c      (working copy)
@@ -4763,7 +4763,7 @@ count_type_elements (tree type, bool all
       return 2;
 
     case VECTOR_TYPE:
-      return TYPE_VECTOR_SUBPARTS (type);
+      return TYPE_MODE (type) == BLKmode ? TYPE_VECTOR_SUBPARTS (type) : 1;
 
     case INTEGER_TYPE:
     case REAL_TYPE:
Comment 20 Andrew Pinski 2006-08-07 15:35:48 UTC
(In reply to comment #19)
> This patchlet makes GCC use element-copy for struct FF:

You have to be careful when editing count_type_elements so that the elements of a constructor that are not explict are zeroed.
Comment 21 Paolo Bonzini 2006-08-17 08:16:15 UTC
I'll see if I can construct a case where my patch fails (actually a newer one)
Comment 22 Paolo Bonzini 2006-08-18 16:16:07 UTC
patch withdrawn, I'll wait for pinskia's
Comment 23 Andrew Pinski 2006-11-12 08:07:07 UTC
I should be posting a patch for this next week.
Comment 24 Andrew Pinski 2006-11-15 00:38:37 UTC
Patch submitted:
http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01005.html
Comment 25 Andrew Pinski 2006-11-20 20:29:26 UTC
Subject: Bug 25500

Author: pinskia
Date: Mon Nov 20 20:29:10 2006
New Revision: 119026

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=119026
Log:
2006-11-20  Andrew Pinski  <andrew_pinski@playstation.sony.com>

        PR tree-opt/25500
        * tree-sra.c (single_scalar_field_in_record_p): New function.
        (decide_block_copy): Use it.

2006-11-20  Andrew Pinski  <andrew_pinski@playstation.sony.com>

        PR tree-opt/25500
        * gcc.dg/tree-ssa/sra-4.c: New testcase.



Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-4.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-sra.c

Comment 26 Andrew Pinski 2006-11-20 20:29:50 UTC
Fixed for 4.3.0 and above.