Bug 21676 - [4.3 Regression] Optimizer regression: SciMark sparse matrix benchmark
Summary: [4.3 Regression] Optimizer regression: SciMark sparse matrix benchmark
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.0.0
: P2 normal
Target Milestone: 4.4.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 23286 38824 39200
Blocks: 79703 39201
  Show dependency treegraph
 
Reported: 2005-05-20 08:30 UTC by Jason Bucata
Modified: 2017-02-24 09:29 UTC (History)
5 users (show)

See Also:
Host:
Target: i686-pc-linux-gnu
Build:
Known to work: 4.4.0
Known to fail: 4.0.4 4.3.3
Last reconfirmed: 2009-02-05 08:03:22


Attachments
preprocessed testcase files (2.87 KB, application/octet-stream)
2005-05-20 08:34 UTC, Jason Bucata
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jason Bucata 2005-05-20 08:30:39 UTC
I see a 14% slowdown with the SciMark sparse matrix multiplication benchmark
when going from 3.4.3 to 4.0.0 on my Gentoo box.  Flags are -O3 -march=athlon-xp
-fomit-frame-pointer.  I compiled and linked in one run of gcc, and ran the
executable from the command line with "time".

4.0's performance gets better (closer to 3.4's, which remains roughly constant)
as functions from the other files are moved into main.c.  On those grounds, my
own inexpert opinion is that this regression stems from function inlining.

Earlier tests showed that the two versions are much closer when
-fomit-frame-pointer isn't used.

Will upload the preprocessed test cases that you're so fond of ;).
Comment 1 Jason Bucata 2005-05-20 08:34:25 UTC
Created attachment 8933 [details]
preprocessed testcase files
Comment 2 Andrew Pinski 2006-06-04 20:06:12 UTC
It would be nice if we could get 4.1.x numbers.
Comment 3 Peter Doerfler 2006-06-06 11:22:40 UTC
I get the following with -O3 -march=pentium4 -fomit-frame-pointer on a pentium4 gentoo machine:

gcc-3.4.6   gcc-4.0.2   gcc-4.1.1
    2.69s       4.14s       3.26s

These are all with gentoo's patches.
Also, current mainline is the same as gcc-4.1.1

I can confirm that the difference without -fomit-frame-pointer is much smaller. In fact, 3.4.6 and 4.1.1 are almost the same without it. 
Comment 4 Richard Biener 2006-07-10 12:45:09 UTC
I get on a Pentium 4, -O3 -march=pentium4 -fomit-frame-pointer -o bench Random.i SparseCompRow.i array.i kernel.i main.i

3.4.6: 3.48s 
4.0.3: 4.44s
4.1.1: 4.12s
4.2.0: 4.13s

Comment 5 Andrew Pinski 2006-08-16 06:50:17 UTC
Can someone try the mainline again after Paolo B.'s patch?
Comment 6 Uroš Bizjak 2006-08-16 12:15:56 UTC
IMO the problem here is in IVopts. Using gcc-3.x, the innermost loop compiles to:

.L15:
	movl	(%edi,%edx,4), %eax
	fldl	(%ebp,%edx,8)
	addl	$1, %edx
	fmull	(%esi,%eax,8)
	cmpl	%ecx, %edx
	faddp	%st, %st(1)
	jl	.L15

and with current SVN gcc-4.2 into:

.L12:
	movl	(%ecx), %eax
	fldl	(%ebp,%eax,8)
	fmull	(%edx)
	faddp	%st, %st(1)
	addl	$1, %ebx
	addl	$4, %ecx
	addl	$8, %edx
	cmpl	%esi, %ebx
	jne	.L12

Adding -fno-ivopts, this loop gets compiled into:

.L12:
	movl	(%edi,%edx,4), %eax
	fldl	(%esi,%eax,8)
	fmull	(%ebp,%edx,8)
	faddp	%st, %st(1)
	addl	$1, %edx
	cmpl	%edx, %ecx
	jg	.L12

Timings (-O3 -march=pentium4 -fomit-frame-pointer):

gcc-3.2: 0m2.301s
gcc-4.2: 0m2.713s
gcc-4.2 + -fno-ivopts: 0m2.473s

with:

gcc version 3.2 20020903 (Red Hat Linux 8.0 3.2-7)
gcc version 4.2.0 20060816 (experimental)

I think that remaining time difference is due to strange loop above innermost:
gcc-3.2:

	fld	%st(0)
.L16:
	movl	36(%esp), %eax
	fld	%st(0)
	movl	4(%eax,%ebx,4), %ecx
	movl	(%eax,%ebx,4), %edx
	cmpl	%ecx, %edx
	jge	.L23
.L15:
	movl	(%edi,%edx,4), %eax
	fldl	(%ebp,%edx,8)
	addl	$1, %edx
	fmull	(%esi,%eax,8)
	cmpl	%ecx, %edx
	faddp	%st, %st(1)
	jl	.L15
.L23:
	movl	28(%esp), %eax
	fstpl	(%eax,%ebx,8)
	addl	$1, %ebx
	cmpl	24(%esp), %ebx
	jl	.L16

========
gcc-4.2:

.L8:
	movl	36(%esp), %edx
	movl	(%edx,%edi,4), %eax
	movl	4(%edx,%edi,4), %esi
	fldz
	cmpl	%esi, %eax
	jge	.L11
	fstp	%st(0)
	movl	40(%esp), %ebx
	leal	(%ebx,%eax,4), %ecx
	movl	32(%esp), %ebx
	leal	(%ebx,%eax,8), %edx
	fldz
	xorl	%ebx, %ebx
	subl	%eax, %esi
.L12:
	movl	(%ecx), %eax
	fldl	(%ebp,%eax,8)
	fmull	(%edx)
	faddp	%st, %st(1)
	addl	$1, %ebx
	addl	$4, %ecx
	addl	$8, %edx
	cmpl	%esi, %ebx
	jne	.L12
.L11:
	movl	28(%esp), %eax
	fstpl	(%eax,%edi,8)
	addl	$1, %edi
	cmpl	24(%esp), %edi
	jne	.L8

========
and gcc-4.2 -fno-ivopts:

.L8:
	leal	(%ebx,%ebx), %eax
	movl	40(%esp), %edx
	movl	(%edx,%eax,2), %edx
	movl	%edx, (%esp)
	movl	40(%esp), %edx
	movl	4(%edx,%eax,2), %ecx
	fldz
	cmpl	%ecx, (%esp)
	jge	.L11
	fstp	%st(0)
	movl	(%esp), %edx
	fldz
.L12:
	movl	(%edi,%edx,4), %eax
	fldl	(%esi,%eax,8)
	fmull	(%ebp,%edx,8)
	faddp	%st, %st(1)
	addl	$1, %edx
	cmpl	%edx, %ecx
	jg	.L12
.L11:
	movl	32(%esp), %ecx
	fstpl	(%ecx,%ebx,8)
	addl	$1, %ebx
	cmpl	%ebx, 28(%esp)
	jg	.L8
Comment 7 Uroš Bizjak 2006-08-17 07:21:23 UTC
(In reply to comment #6)

> I think that remaining time difference is due to strange loop above innermost:

... due to strange _header_ above innermost loop ...

The problem is that we load zero in both arms of "if".

This is what I get in .099t.optimized (using gcc-4.2 -O2 -fno-ivopts):

<L1>:;
  r.0 = (unsigned int) r;
  D.1556 = r.0 * 4;
  rowR = *((int *) D.1556 + row);
  rowRp1 = *((int *) D.1556 + row + 4B);
  if (rowR < rowRp1) goto <L41>; else goto <L42>;

<L42>:;
  sum = 0.0;
  goto <bb 5> (<L4>);

<L41>:;
  i = rowR;
  sum = 0.0;

Assignment to sum should be moved before if...

SSE is able to somehow CSE zero load during RTL:

.L8:
        movl 20(%ebp), %edx
        movapd  %xmm2, %xmm1
        movl (%edx,%ebx,4), %eax
        movl 4(%edx,%ebx,4), %ecx
        cmpl %ecx, %eax
        jge .L11
        movl %eax, %edx
        .p2align 4,,7
.L12:
Comment 8 Uroš Bizjak 2006-08-17 07:45:57 UTC
Also interesting is, that -march=pentium4 produces following "de-optimized" code, adding a couple more instructions and wasting %eax register:

.L8:
	leal	(%ebx,%ebx), %eax
	movl	40(%esp), %edx
	movl	(%edx,%eax,2), %edx
	movl	%edx, (%esp)
	movl	40(%esp), %edx
	movl	4(%edx,%eax,2), %ecx
	movapd	%xmm2, %xmm1
	cmpl	%ecx, (%esp)
	jge	.L11
	movl	(%esp), %edx
.L12:

Some additiona timing can be shown (gcc-4.2 -O2 -fomit-frame-pointer): 

-march=pentium4: 0m2.756s
-march=pentium4 -fno-ivopts: 0m2.500s
-march=pentium4 -fno-ivopts -mfpmath=sse: 0m2.461s
-msse2 -fno-ivopts -mfmpath=sse: 0m2.311s

In the last case, the generated code is equal to gcc-3.2 generated one:

.L8:
	movl	36(%esp), %edx
	movapd	%xmm2, %xmm1
	movl	(%edx,%ebx,4), %eax
	movl	4(%edx,%ebx,4), %ecx
	cmpl	%ecx, %eax
	jge	.L11
	movl	%eax, %edx
	.p2align 4,,7
.L12:
	movl	(%edi,%edx,4), %eax
	movsd	(%esi,%eax,8), %xmm0
	mulsd	(%ebp,%edx,8), %xmm0
	addl	$1, %edx
	cmpl	%edx, %ecx
	addsd	%xmm0, %xmm1
	jg	.L12
Comment 9 Andrew Pinski 2006-08-29 05:24:44 UTC
Fixed on the mainline by:
http://gcc.gnu.org/ml/gcc-patches/2006-08/msg01036.html
Comment 10 Uroš Bizjak 2006-08-29 06:12:54 UTC
(In reply to comment #9)
> Fixed on the mainline by:
> http://gcc.gnu.org/ml/gcc-patches/2006-08/msg01036.html

Not really, the above patch fixed only one of three problems. The other two remains, that is:

- ivopts problem (see comment #6)
- -march=pentium4 (see comment #8)

I'll try to see which option causes problems, described in #8.
Comment 11 Steven Bosscher 2007-12-16 23:17:13 UTC
Open regression with no activity since February 14.  Ping?
Comment 12 Uroš Bizjak 2007-12-16 23:49:02 UTC
(In reply to comment #11)
> Open regression with no activity since February 14.  Ping?

Current 4.3 SVN produces (-O3 -march=pentium4):

.L8:
        movl    (%ebx), %eax    #* ivtmp.33, tmp84
        fldl    (%edi,%eax,8)   #
        fmull   (%ecx)  #* ivtmp.35
        faddp   %st, %st(1)     #,
        addl    $1, %edx        #, i
        addl    $4, %ebx        #, ivtmp.33
        addl    $8, %ecx        #, ivtmp.35
        cmpl    %edx, %esi      # i, rowRp1
        jg      .L8     #,

(+ -fno-ivopts):

.L8:
        movl    (%esi,%edx,4), %eax     #, tmp76
        fldl    (%edi,%eax,8)   #* x
        movl    16(%ebp), %eax  # val,                  <-----here
        fmull   (%eax,%edx,8)   #
        faddp   %st, %st(1)     #,
        addl    $1, %edx        #, i
        cmpl    %edx, %ecx      # i, rowRp1
        jg      .L8     #,

We regressed vs. 4.2 in this case; "val" is loaded inside the loop.

Comment 13 Steven Bosscher 2008-01-12 14:34:36 UTC
Re. comment #7, "Assignment to sum should be moved before if..."

This is called code hoisting, and it is not performed in GCC except with -Os.  See bug 24001, bug 33315, and other reports about the same issue in Bugzilla.
Comment 14 Uroš Bizjak 2008-02-06 11:05:57 UTC
We still generate:

.L8:
	movl	(%ebx), %eax
	addl	$1, %edx
	addl	$4, %ebx
	fldl	(%edi,%eax,8)
	fmull	(%ecx)
	addl	$8, %ecx
	cmpl	%edx, %esi
	faddp	%st, %st(1)
	jg	.L8

This could IMO be optimized on RTL level, to use %edx as a count variable:

.L8:
	movl	(%ebx,%edx,4), %eax
	fldl	(%edi,%eax,8)
	fmull	(%ecx,%edx,8)
	addl	$1, %edx
	cmpl	%edx, %esi
	faddp	%st, %st(1)
	jg	.L8

This would be optimal code for this loop.
Comment 15 Joseph S. Myers 2008-07-04 16:53:49 UTC
Closing 4.1 branch.
Comment 16 Steven Bosscher 2008-11-22 10:31:44 UTC
See comment #7 and comment #13.
Comment 17 Paolo Bonzini 2009-02-05 08:03:21 UTC
Confirmed:

3.3 -O2               3.11s
4.1 -O2               3.44s
4.4 -O2               3.36s
4.4 -O2 -fno-ivopts   3.00s
Comment 18 Paolo Bonzini 2009-02-16 09:04:06 UTC
For whatever reason, we're now faster than GCC 3.3.

-O2 -fno-ivopts is still faster than -O2, will open an enhancement request for that.
Comment 19 Joseph S. Myers 2009-03-31 18:47:30 UTC
Closing 4.2 branch.
Comment 20 Richard Biener 2009-04-22 15:10:14 UTC
WONTFIX on the 4.3 branch.