Bug 34682

Summary: 70% slowdown with SSE enabled
Product: gcc Reporter: Matteo Croce <rootkit85>
Component: targetAssignee: Uroš Bizjak <ubizjak>
Status: RESOLVED FIXED    
Severity: normal CC: gcc-bugs
Priority: P3 Keywords: ssemmx
Version: 4.2.3   
Target Milestone: 4.3.0   
URL: http://gcc.gnu.org/ml/gcc-patches/2008-01/msg00254.html
Host: Target:
Build: Known to work:
Known to fail: Last reconfirmed: 2008-01-07 14:02:46
Bug Depends on: 23322    
Bug Blocks:    
Attachments: the source
the source compiled with -mfpmath=387
the source compiled with -mfpmath=sse
minimal testcase
minimal testcase, compiled with -mfpmath=387
minimal testcase, compiled with -mfpmath=sse

Description Matteo Croce 2008-01-05 21:29:04 UTC
I have a piece of code that runs 70% slower with SSE enabled than with plain 387 on a Dual CPU Xeon system.
I'm not an optimization fanatic, but since -mfpmath=sse is enabled by default on amd64 this could cause huge performance losses while making amd64 binaries on this CPU

The runlog is:

[aguy@enc1 ~]$ uname -a
FreeBSD enc1 6.2-RELEASE FreeBSD 6.2-RELEASE #0: Fri Jan 12 11:05:30 UTC 2007     root@dessler.cse.buffalo.edu:/usr/obj/usr/src/sys/SMP  
[aguy@enc1 ~]$ gcc42 -v
Using built-in specs.
Target: i386-portbld-freebsd6.2
Configured with: ./..//gcc-4.2-20071024/configure --disable-nls --with-system-zlib --with-libiconv-prefix=/usr/local --with-gmp=/usr/local --program-suffix=42 --libdir=/usr/local/lib/gcc-4.2.3 --with-gxx-include-dir=/usr/local/lib/gcc-4.2.3/include/c++/ --disable-rpath --prefix=/usr/local --mandir=/usr/local/man --infodir=/usr/local/info/gcc42 i386-portbld-freebsd6.2
Thread model: posix
gcc version 4.2.3 20071024 (prerelease)
[aguy@enc1 ~]$ gcc42 ssucks.c -O2 -march=prescott -o ssucks-387
[aguy@enc1 ~]$ gcc42 ssucks.c -O2 -march=prescott -o ssucks-sse -mfpmath=sse
[aguy@enc1 ~]$ ssucks-387 ; ssucks-sse

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1      4.0146e-13      0.0147    953.0052
     2     -1.4166e-13      0.0061   1149.6845

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1      4.0146e-13      0.0146    960.7945
     2     -1.4166e-13      0.0281    249.3171
[aguy@enc1 ~]$

1149.6845 vs 249.3171: a ~78% slowdown by just enabling sse

I have source, assembled files and runlog online here: http://teknoraver.campuslife.it/software/gcc-sse/

Cheers,
Matteo Croce
Comment 1 Matteo Croce 2008-01-05 21:31:20 UTC
Created attachment 14882 [details]
the source
Comment 2 Matteo Croce 2008-01-05 21:31:56 UTC
Created attachment 14883 [details]
the source compiled with -mfpmath=387
Comment 3 Matteo Croce 2008-01-05 21:32:24 UTC
Created attachment 14884 [details]
the source compiled with -mfpmath=sse
Comment 4 Richard Biener 2008-01-06 12:18:38 UTC
Please narrow down the particular loop in your testcase that gets slower.  It
looks like the testsuite measures several things.
Comment 5 Uroš Bizjak 2008-01-07 12:19:54 UTC
Confirmed by following testcase:

--cut here--
#include <stdio.h>

void __attribute__((noinline))
dtime (void) 
{
  __asm__ __volatile__ ("" : : : "memory");
}

double sa, sb, sc, sd;
double one, two, four, five;
double piref, piprg, pierr;

int
main (int argc, char *argv[])
{
  double s, u, v, w, x;

  long i, m;

  piref = 3.14159265358979324;
  one = 1.0;
  two = 2.0;
  four = 4.0;
  five = 5.0;

  m = 512000000;

  dtime();

  s = -five;
  sa = -one;

  dtime();

  for (i = 1; i <= m; i++)
    {
      s = -s;
      sa = sa + s;
    }

  dtime();

  sc = (double) m;

  u = sa;
  v = 0.0;
  w = 0.0;
  x = 0.0;

  dtime();

  for (i = 1; i <= m; i++)
    {
      s = -s;
      sa = sa + s;
      u = u + two;
      x = x + (s - u);
      v = v - s * u;
      w = w + s / u;
    }

  dtime();

  m = (long) (sa * x / sc);
  sa = four * w / five;
  sb = sa + five / v;
  sc = 31.25;
  piprg = sb - sc / (v * v * v);
  pierr = piprg - piref;

  printf ("%13.4le\n", pierr);
  return 0;
}
--cut here--

.L5:
        xorb    $-128, -17(%ebp)        #, s
        addl    $1, %eax        #, i.65
        addsd   %xmm4, %xmm1    # two.16, u
        cmpl    $512000001, %eax        #, i.65
        movsd   -24(%ebp), %xmm0        # s, tmp90
        addsd   -24(%ebp), %xmm2        # s, sa_lsm.48
        mulsd   %xmm1, %xmm0    # u, tmp90
        subsd   %xmm0, %xmm3    # tmp90, v
        movsd   -24(%ebp), %xmm0        # s, tmp91
        divsd   %xmm1, %xmm0    # u, tmp91
        addsd   -16(%ebp), %xmm0        # w, tmp91
        movsd   %xmm0, -16(%ebp)        # tmp91, w
        jne     .L5     #,


It is somehow possible to tolerate that "s" and "w" are not pushed into registers due to non-existent live range splitting (PR 23322), the main problem here is that the sign of "s"is changed in the memory by using (unaligned) xorb insn. The same situation is in the first (shorter) loop:

.L4:
	xorb	$-128, -17(%ebp)	#, s
	addl	$1, %eax	#, i
	cmpl	$512000001, %eax	#, i
	addsd	-24(%ebp), %xmm0	# s, sa_lsm.97
	jne	.L4	#,


The performance regression is caused by partial memory stall [1].

[1] Agner Fog: How to optimize for the Pentium family of microprocessors, section 14.7
Comment 6 Uroš Bizjak 2008-01-07 14:02:46 UTC
Patch in testing.
Comment 7 Uroš Bizjak 2008-01-07 14:09:33 UTC
Patched gcc:

387:


   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1     -8.1208e-11      0.0128   1094.6170
     2     -1.5485e-13      0.0061   1145.7086

SSE:

   FLOPS C Program (Double Precision), V2.0 18 Dec 1992

   Module     Error        RunTime      MFLOPS
                            (usec)
     1      4.0146e-13      0.0114   1227.3206
     2     -1.4166e-13      0.0050   1399.9125

   [ 2     -1.4166e-13      0.0269    260.2975 ]

So, 5.36x faster.
Comment 8 Matteo Croce 2008-01-07 19:47:03 UTC
Created attachment 14895 [details]
minimal testcase
Comment 9 Matteo Croce 2008-01-07 19:47:36 UTC
Created attachment 14896 [details]
minimal testcase, compiled with -mfpmath=387
Comment 10 Matteo Croce 2008-01-07 19:47:53 UTC
Created attachment 14897 [details]
minimal testcase, compiled with -mfpmath=sse
Comment 11 Matteo Croce 2008-01-07 19:49:22 UTC
very very minimal testcase added
Comment 12 uros 2008-01-07 20:07:09 UTC
Subject: Bug 34682

Author: uros
Date: Mon Jan  7 20:06:34 2008
New Revision: 131381

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=131381
Log:
        PR target/34682
        * config/i386/i386.md (neg<mode>2): Rename from negsf2, negdf2 and
        negxf2.  Macroize expander using X87MODEF mode iterator.  Change
        predicates of op0 and op1 to register_operand.
        (abs<mode>2): Rename from abssf2, absdf2 and negxf2.  Macroize expander
        using X87MODEF mode iterator.  Change predicates of op0 and op1 to
        register_operand.
        ("*absneg<mode>2_mixed", "*absneg<mode>2_sse"): Rename from
        corresponding patterns and macroize using MODEF macro.  Change
        predicates of op0 and op1 to register_operand and remove
        "m" constraint. Disparage "r" alternative with "!".
        ("*absneg<mode>2_i387"): Rename from corresponding patterns and
        macroize using X87MODEF macro.  Change predicates of op0 and op1
        to register_operand and remove "m" constraint.  Disparage "r"
        alternative with "!".
        (absneg splitter with memory operands): Remove.
        ("*neg<mode>2_1", "*abs<mode>2_1"): Rename from corresponding
        patterns and macroize using X87MODEF mode iterator.
        * config/i386/sse.md (negv4sf2, absv4sf2, neg2vdf2, absv2df2):
        Change predicate of op1 to register_operand.
        * config/i386/i386.c (ix86_expand_fp_absneg_operator): Remove support
        for memory operands.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/sse.md

Comment 13 Uroš Bizjak 2008-01-07 20:10:16 UTC
Fixed in SVN.