25500 – [4.0/4.1/4.2/4.3 Regression]: SSE2 vectorized code is slower on 4.x.x than previous

Bug 25500 - [4.0/4.1/4.2/4.3 Regression]: SSE2 vectorized code is slower on 4.x.x than previous

Summary: [4.0/4.1/4.2/4.3 Regression]: SSE2 vectorized code is slower on 4.x.x than pr...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.0.2

Importance:	P2 normal
Target Milestone:	4.3.0
Assignee:	Andrew Pinski

URL:	http://gcc.gnu.org/ml/gcc-patches/200...
Keywords:	missed-optimization, patch, ssemmx

Depends on:	14295
Blocks:
	Show dependency tree / graph

Reported:	2005-12-20 05:25 UTC by Yuri
Modified:	2006-11-20 20:29 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	i386--
Build:
Known to work:	3.4.0
Known to fail:	4.0.0 4.1.0 4.2.0
Last reconfirmed:	2005-12-25 01:02:34

Attachments
disasm.tar.bz2 (2.16 KB, application/octet-stream) 2005-12-20 06:19 UTC, Yuri	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri 2005-12-20 05:25:46 UTC

The following testcase when compiled with 'g++ -O3 -msse3 -o testcase testcase.C'
finishes in 0m0.277s if compiled with gcc-3.4.4 and in 0m44.843s if compiled with gcc-4.0.2 (similar on all 4.x.x).

Yuri


-----------------------------------------------------------------------

typedef float __v2df __attribute__ ((__vector_size__ (16)));
typedef __v2df __m128;

static __inline __m128 _mm_sub_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_subps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_add_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_addps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v2df){ __Z, __Y, __X, __W }; }

struct FF {
  __m128 d;

  __inline FF() { }
  __inline FF(__m128 new_d) : d(new_d) { }
  __inline FF(float f) : d(_mm_setr_ps(f, f, f, f)) { }

  __inline FF operator+(FF other) { return (FF(_mm_add_pd(d,other.d))); }
  __inline FF operator-(FF other) { return (FF(_mm_sub_pd(d,other.d))); }
};

float f[1024*1024];

int main() {
  int i;

  for (i = 0; i < 1024*1024; i++) { f[i] = 1.f/(1024*1024 + 10 - i); }

  FF total(0.f);

  for (int rpt = 0; rpt < 1000; rpt++) {
  FF p1(0.f), p2(0.), c;

  __m128 *pf = (__m128*)f;
  for (i = 0; i < 1024*1024/4; i++) {
    FF c(*pf++);

    total = total + c - p2 + p1;

    p1 = p2;
    p2 = c;
  }
  }
}

Comment 1 Yuri 2005-12-20 05:34:47 UTC

actually it's the defect in this case: result is not used.
But runtimes are very different in any case.
44.9s on 4.x.x vs. 0m2.371s on 3.4.4

--- begin corrected testcase -----------------------------------
#include <iostream>
using namespace std;

typedef float __v2df __attribute__ ((__vector_size__ (16)));
typedef __v2df __m128;

static __inline __m128 _mm_sub_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_subps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_add_pd (__m128 __A, __m128 __B) { return (__m128)__builtin_ia32_addps ((__v2df)__A, (__v2df)__B); }
static __inline __m128 _mm_setr_ps (float __Z, float __Y, float __X, float __W) { return __extension__ (__m128)(__v2df){ __Z, __Y, __X, __W }; }

struct FF {
  __m128 d;

  __inline FF() { }
  __inline FF(__m128 new_d) : d(new_d) { }
  __inline FF(float f) : d(_mm_setr_ps(f, f, f, f)) { }

  __inline FF operator+(FF other) { return (FF(_mm_add_pd(d,other.d))); }
  __inline FF operator-(FF other) { return (FF(_mm_sub_pd(d,other.d))); }
};

float f[1024*1024];

    union U {
      __m128 m;
      float f[4];
    };

int main() {
  int i;

  FF gtotal(0.f);
  for (i = 0; i < 1024*1024; i++) { f[i] = 1.f/(1024*1024 + 10 - i); }

  FF total(0.f);

  for (int rpt = 0; rpt < 1000; rpt++) {
    FF p1(0.f), p2(0.), c;

    __m128 *pf = (__m128*)f;
    for (i = 0; i < 1024*1024/4; i++) {
      FF c(*pf++);

      total = total + c - p2 + p1;

      p1 = p2;
      p2 = c;
    }
    gtotal = gtotal + total;
  }

  U u;
  u.m = gtotal.d;

  cout << (u.f[0]) << endl;
}
--- end corrected testcase -------------------------------------

Comment 2 Andrew Pinski 2005-12-20 05:55:43 UTC

I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode.

Everything I tried shows that 4.x is actually faster than 3.4.4.

Comment 3 Yuri 2005-12-20 06:01:54 UTC

Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

I run on Athlon64-3200 in i386 compatible mode.
Strange.

I had he problem with gcc-4.0.1, yesterday I compiled gcc-4.0.2 and same 
thing.

Yuri


pinskia at gcc dot gnu dot org wrote:

>------- Comment #2 from pinskia at gcc dot gnu dot org  2005-12-20 05:55 -------
>I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode.
>
>Everything I tried shows that 4.x is actually faster than 3.4.4.
>
>
>  
>

Comment 4 Yuri 2005-12-20 06:03:37 UTC

Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

Also I use FreeBSD-6.0 if this even can make a difference.



pinskia at gcc dot gnu dot org wrote:

>------- Comment #2 from pinskia at gcc dot gnu dot org  2005-12-20 05:55 -------
>I cannot reproduce this on an Athlon 64 running in either 32 or 64 bit mode.
>
>Everything I tried shows that 4.x is actually faster than 3.4.4.
>
>
>  
>

Comment 5 Yuri 2005-12-20 06:19:32 UTC

Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

Here's attachment with asms generated in both cases.

testcase-old.s is 4.3.3 and testcase-new.s is 4.0.2

In testcase-new.s SSE2 code is kinda diluted with i386 assembly, notably 
'rep movsl'
which never occurs in 3.4.4 output.

Yuri

Comment 6 Yuri 2005-12-20 06:19:32 UTC

Created attachment 10534 [details]
disasm.tar.bz2

Comment 7 Andrew Pinski 2005-12-20 06:33:16 UTC

I don't get:
        rep
        movsl

At all on GNU/Linux, doing a cross compiler to FreeBSD right now.

Comment 8 Andrew Pinski 2005-12-20 06:36:49 UTC

Can you show what the output of "gcc -v" for the 3.4 compiler and the 4.0 compiler?

This looks like just a different using arch by default.

a Compiler compiled for i686 by default gives the good code but code compiled for i386 give bad code.

Comment 9 Yuri 2005-12-20 06:51:45 UTC

Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

-----------------------------------
Using built-in specs.
Configured with: FreeBSD/i386 system compiler
Thread model: posix
gcc version 3.4.4 [FreeBSD] 20050518
-----------------------------------
g++ -v (4.0.2)
Using built-in specs.
Target: i386-unknown-freebsd6.0
Configured with: ../gcc-4.0.2/configure --prefix=/usr/local/gcc-4.0.2
Thread model: posix
gcc version 4.0.2


pinskia at gcc dot gnu dot org wrote:

>------- Comment #8 from pinskia at gcc dot gnu dot org  2005-12-20 06:36 -------
>Can you show what the output of "gcc -v" for the 3.4 compiler and the 4.0
>compiler?
>
>This looks like just a different using arch by default.
>
>a Compiler compiled for i686 by default gives the good code but code compiled
>for i386 give bad code.
>
>
>  
>

Comment 10 Andrew Pinski 2005-12-20 06:55:21 UTC

Oh, I looked a little more and yes it depends on the arch you are building for but only for 4.x.

Since you are using SSE, you should add also -march=i686 or -march=k8 so that the code is also tuned for the processor you are using.  

Anyways the problem with i386 with 4.0 is really just PR 14295.

Comment 11 Yuri 2005-12-20 07:40:50 UTC

Subject: Re:  REGREGRESSION: SSE2 vectorized code is many
 times slower on 4.x.x than on 3.4.4

Now this huge runtime difference disappeared
but now 4.0.2-generated code is always ~> 20% slower.
Many memory accesses where they are not needed at all and did not exist 
for 3.4.4.

I tried -march=i686 and -march=k8, both are slower than 3.4.4.

Do I also have to recompile gcc with some special options?

Yuri

pinskia at gcc dot gnu dot org wrote:

>------- Comment #10 from pinskia at gcc dot gnu dot org  2005-12-20 06:55 -------
>Oh, I looked a little more and yes it depends on the arch you are building for
>but only for 4.x.
>
>Since you are using SSE, you should add also -march=i686 or -march=k8 so that
>the code is also tuned for the processor you are using.  
>
>Anyways the problem with i386 with 4.0 is really just PR 14295.
>
>
>  
>

Comment 12 Andrew Pinski 2005-12-25 01:02:34 UTC

Confirmed, it really only effects i386/i486 code (maybe i586 also but I did not try that).

The only thing I can think is to change MOVE_COST for those subtargets or just have PR 14295 fixed.

Comment 13 Jakub Jelinek 2005-12-28 16:53:26 UTC

Benchmarking -mtune=i386 tuned code on Athlon64 is simply a bad idea.
Either you need to tune for your CPU (or at least some contemporary one
like -mtune=pentium4 if you want to run quickly on a wider range of CPUs),
or you should be benchmarking on real i386 hardware.

Comment 14 Mark Mitchell 2006-01-15 22:13:38 UTC

We're generating correct code, so I've marked this as P2, rather than P1.

Comment 15 Mark Mitchell 2006-02-24 00:26:39 UTC

This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.

Comment 16 Mark Mitchell 2006-05-25 02:33:41 UTC

Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.

Comment 17 Andrew Pinski 2006-07-05 09:50:35 UTC

struct FF {
  __m128 d;
.....
}

Mine I have a patch for this I cannot believe I found this before.  The patch has been tested a bit at least in the local tree I have been playing out with.
SRA should use element based copy that struct because it is only one element.

Comment 18 Paolo Bonzini 2006-08-07 07:54:53 UTC

One element, but with some additional complication because it is a vector.

Comment 19 Paolo Bonzini 2006-08-07 07:59:52 UTC

This patchlet makes GCC use element-copy for struct FF:

Index: expr.c
===================================================================
--- expr.c      (revision 115990)
+++ expr.c      (working copy)
@@ -4763,7 +4763,7 @@ count_type_elements (tree type, bool all
       return 2;
 
     case VECTOR_TYPE:
-      return TYPE_VECTOR_SUBPARTS (type);
+      return TYPE_MODE (type) == BLKmode ? TYPE_VECTOR_SUBPARTS (type) : 1;
 
     case INTEGER_TYPE:
     case REAL_TYPE:

Comment 20 Andrew Pinski 2006-08-07 15:35:48 UTC

(In reply to comment #19)
> This patchlet makes GCC use element-copy for struct FF:

You have to be careful when editing count_type_elements so that the elements of a constructor that are not explict are zeroed.

Comment 21 Paolo Bonzini 2006-08-17 08:16:15 UTC

I'll see if I can construct a case where my patch fails (actually a newer one)

Comment 22 Paolo Bonzini 2006-08-18 16:16:07 UTC

patch withdrawn, I'll wait for pinskia's

Comment 23 Andrew Pinski 2006-11-12 08:07:07 UTC

I should be posting a patch for this next week.

Comment 24 Andrew Pinski 2006-11-15 00:38:37 UTC

Patch submitted:
http://gcc.gnu.org/ml/gcc-patches/2006-11/msg01005.html

Comment 25 Andrew Pinski 2006-11-20 20:29:26 UTC

Subject: Bug 25500

Author: pinskia
Date: Mon Nov 20 20:29:10 2006
New Revision: 119026

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=119026
Log:
2006-11-20  Andrew Pinski  <andrew_pinski@playstation.sony.com>

        PR tree-opt/25500
        * tree-sra.c (single_scalar_field_in_record_p): New function.
        (decide_block_copy): Use it.

2006-11-20  Andrew Pinski  <andrew_pinski@playstation.sony.com>

        PR tree-opt/25500
        * gcc.dg/tree-ssa/sra-4.c: New testcase.



Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/sra-4.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-sra.c

Comment 26 Andrew Pinski 2006-11-20 20:29:50 UTC

Fixed for 4.3.0 and above.