Bug 34163 - [4.4 Regression] 10% performance regression since Nov 1 on Polyhedron's "NF" on AMD64
Summary: [4.4 Regression] 10% performance regression since Nov 1 on Polyhedron's "NF" ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.3.0
: P2 normal
Target Milestone: 4.5.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 39201
  Show dependency treegraph
 
Reported: 2007-11-20 15:00 UTC by Tobias Burnus
Modified: 2012-03-13 13:04 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-unknown-linux-gnu
Build:
Known to work: 4.5.0
Known to fail:
Last reconfirmed: 2009-07-03 11:05:43


Attachments
x86_64 asm dump of trisolve procedure (genereated without the patch) (1.12 KB, text/plain)
2008-04-24 19:56 UTC, Uroš Bizjak
Details
patch (1005 bytes, patch)
2009-07-03 11:22 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2007-11-20 15:00:21 UTC
According to http://www.suse.de/~gcctest/c++bench/polyhedron/, the run time of "NF" increased from 37.70s to 41.57s (+10%) between 071031 and 071101. This is an AMD Opteron system.

The same result I obtained at http://physik.fu-berlin.de/~tburnus/gcc-trunk/benchmark/
also on an AMD64. Interestingly, while x86-64 (-m64) gets slower, using -m32 I don't see any change (apart of noise) - neither for x87 nor for SSE(3).

To boil it down to version numbers, I believe the change must have happened between the following versions:
Fast:
  4.3.0 20071031 (experimental) [trunk revision 129791]
Slow:
  4.3.0 20071031 (experimental) [trunk revision 129797]

As culpits there are essentially only the following checkins possible:

- PR tree-optimization/32377 (compute_overlap_steps_for_affine_univar)
  Make it work also for unknown number of iterations.  [Sebastian Pop]

- PR middle-end/33779 fold-const.c (extract_muldiv_1): Make sure to not
  introduce new undefined integer overflow.  [Richard Guenther]

- PR fortran/33897 decl.c (gfc_match_entry): Do not make ENTRY name
  global for contained procedures.
  parse.c (gfc_fixup_sibling_symbols): Fix code for determining whether
  a procedure is external.     [Paul Thomas]

The Polyhedron test case is available from:
http://www.polyhedron.co.uk/MFL6VW74649
Comment 1 Uroš Bizjak 2008-04-21 07:11:35 UTC
Confirmed.
Comment 2 Richard Biener 2008-04-21 09:09:53 UTC
Well, this bug needs proper analysis and a testcase, but yes, I also see this
slowdown.
Comment 3 Uroš Bizjak 2008-04-22 16:43:16 UTC
(In reply to comment #2)
> Well, this bug needs proper analysis and a testcase, but yes, I also see this
> slowdown.

Richi, the only difference in generated code is by backing out your patch [1]

[1] http://gcc.gnu.org/viewcvs?view=rev&revision=129796

Other suspected patches have no effect on generated code for "gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3"

I will check execution times, but since I have Core2, perhaps there will be no slowdown. Can you try to benchmark on your target with [1] backed out?
Comment 4 Uroš Bizjak 2008-04-22 16:51:22 UTC
Confirmed also on core2:

benchmarked with patch:

22 Apr 2008 18:47:04 gfortran - Compile nf
command=gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3 nf.s -o nf
 
22 Apr 2008 18:47:04 gfortran - Execute nf
nf Run #   1    21.46340    21.46340 - Error=100.0000%
nf Run #   2    21.45860    21.46100 - Error=  0.0112%

Geometric Mean Execution Time =      21.46 seconds

benchmarked without patch:

22 Apr 2008 18:44:51 gfortran - Compile nf
command=gfortran -march=opteron -ffast-math -funroll-loops -ftree-loop-linear -ftree-vectorize -msse3 -O3 nf.s -o nf
 
22 Apr 2008 18:44:51 gfortran - Execute nf
nf Run #   1    19.46120    19.46120 - Error=100.0000%
nf Run #   2    19.46200    19.46160 - Error=  0.0021%

Geometric Mean Execution Time =      19.46 seconds
Comment 5 Andrew Pinski 2008-04-22 18:14:12 UTC
(In reply to comment #3)
> [1] http://gcc.gnu.org/viewcvs?view=rev&revision=129796

It was a correctness fix, which usually will slow down generated code. :)

So you have to look at the difference to make sure that the code generated before was actually producing no overflows.

Thanks,
Andrew Pinski
Comment 6 Richard Biener 2008-04-22 22:20:00 UTC
Indeed.  It would be interesting to analyze what optimization the folding enabled
and see if that can be recovered somehow.
Comment 7 Uroš Bizjak 2008-04-24 19:56:40 UTC
Created attachment 15527 [details]
x86_64 asm dump of trisolve procedure (genereated without the patch)

All the difference is in trisolve procedure (attached). The performance will be 10% better, if trisolve in the dump is substituted with attached function.

I'm using -O2 -funroll-loops.

BTW: There are two loops in this asm (.L3 and .L5). In current asm, suspicious parts are:

        movsd   16(%r9), %xmm6
        mulsd   16(%r8), %xmm6

and
        mulsd   -16(%rdx), %xmm0
        mulsd   -16(%r11), %xmm0

That is - loads from different addresses that are not present in non-patched asm.
Comment 8 Uroš Bizjak 2008-04-25 09:55:27 UTC
The problem is indeed in trisolve:

subroutine trisolve(x,i1,i2)
integer :: i1 , i2
real(dpkind),dimension(i2)::x
integer :: i
x(i1) = gi(i1)* x(i1)
do i = i1+1 , i2
   x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1))
enddo
do i = i2-1 , i1 , -1
   x(i) = x(i) - gi(i)*au1(i)*x(i+1)
enddo
end subroutine trisolve

Please note two very tight loops that calculate x[n] from the value x[n-1], where x[n-1] is the result of a previous step.

.127t.optimized tree dump for the the first loop (the second loop is the same, only going from last to first element) in non-regressed case shows:

<bb 4>:
  MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] - MEM[base: ivtmp.300] * MEM[base: ivtmp.297, offset: 0x0fffffffffffffff8]));
  ivtmp.295 = ivtmp.295 + D.3347;
  ivtmp.297 = ivtmp.297 + 8;
  ivtmp.300 = ivtmp.300 + 8;
  ivtmp.304 = ivtmp.304 + 1;
  if ((integer(kind=4)) ivtmp.304 == D.1652)
    goto <bb 5>;
  else
    goto <bb 4>;

this code results in:

.L3:
	movsd	(%r9), %xmm10
	addl	$4, %edx
	movsd	(%rcx), %xmm9
X+>	mulsd	-8(%rcx), %xmm10
	movsd	8(%rcx), %xmm7
	movsd	16(%rcx), %xmm5
	movsd	24(%rcx), %xmm3
	subsd	%xmm10, %xmm9
	mulsd	(%rax), %xmm9
	addq	%r10, %rax
1->	movsd	%xmm9, (%rcx)
	movsd	8(%r9), %xmm8
1+>	mulsd	%xmm9, %xmm8
	subsd	%xmm8, %xmm7
	mulsd	(%rax), %xmm7
	addq	%r10, %rax
2->	movsd	%xmm7, 8(%rcx)
	movsd	16(%r9), %xmm6
2+>	mulsd	%xmm7, %xmm6
	subsd	%xmm6, %xmm5
	mulsd	(%rax), %xmm5
	addq	%r10, %rax
3->	movsd	%xmm5, 16(%rcx)
	movsd	24(%r9), %xmm4
	addq	$32, %r9
3+>	mulsd	%xmm5, %xmm4
	subsd	%xmm4, %xmm3
	mulsd	(%rax), %xmm3
	addq	%r10, %rax
X->	movsd	%xmm3, 24(%rcx)
	addq	$32, %rcx
	cmpl	%ebp, %edx
	jne	.L3

In the code above, it can be seen how unrolled iterations are linked together. The result from previous iteration (marked with N->) enters next iteration (marked with N+>).

BTW: Optimizer could also link X-> and X+> but this is probably too much...

Patched gcc regressed in this area:

<bb 4>:
  MEM[base: ivtmp.297] = MEM[base: ivtmp.295] * ((MEM[base: ivtmp.297] - MEM[base: ivtmp.300] * MEM[base: ivtmp.302]));
  ivtmp.295 = ivtmp.295 + D.3349;
  ivtmp.297 = ivtmp.297 + 8;
  ivtmp.300 = ivtmp.300 + 8;
  ivtmp.302 = ivtmp.302 + 8;
  ivtmp.304 = ivtmp.304 + 1;
  if ((integer(kind=4)) ivtmp.304 == D.1652)
    goto <bb 5>;
  else
    goto <bb 4>;

this code results in:

.L3:
	movsd	(%r9), %xmm10
	addl	$4, %edx
	movsd	(%rcx), %xmm9
X->	mulsd	(%r8), %xmm10
	movsd	8(%rcx), %xmm7
	movsd	16(%rcx), %xmm5
	movsd	24(%rcx), %xmm3
	subsd	%xmm10, %xmm9
	mulsd	(%rax), %xmm9
	addq	%rbx, %rax
1->	movsd	%xmm9, (%rcx)
	movsd	8(%r9), %xmm8
1+>	mulsd	8(%r8), %xmm8
	subsd	%xmm8, %xmm7
	mulsd	(%rax), %xmm7
	addq	%rbx, %rax
2->	movsd	%xmm7, 8(%rcx)
	movsd	16(%r9), %xmm6
2+>	mulsd	16(%r8), %xmm6
	subsd	%xmm6, %xmm5
	mulsd	(%rax), %xmm5
	addq	%rbx, %rax
3->	movsd	%xmm5, 16(%rcx)
	movsd	24(%r9), %xmm4
	addq	$32, %r9
3+>	mulsd	24(%r8), %xmm4
	addq	$32, %r8
	subsd	%xmm4, %xmm3
	mulsd	(%rax), %xmm3
	addq	%rbx, %rax
X->	movsd	%xmm3, 24(%rcx)
	addq	$32, %rcx
	cmpl	%r12d, %edx
	jne	.L3

In the code above, the links are broken. In ".+>" case, gcc reloads from memory the same value that is otherwise available in the register, marked with ".->".
Comment 9 Richard Biener 2008-04-25 10:23:31 UTC
Not hoisting the load from x(i) is a missed PRE opportunity.  Complete testcase
for the second loop:

subroutine trisolve2(x,i1,i2,nxyz)
integer :: nxyz
real,dimension(nxyz):: au1
real,allocatable,dimension(:) :: gi
integer :: i1 , i2
real,dimension(i2)::x
integer :: i
allocate(gi(nxyz))
do i = i1+1 , i2
   x(i) = gi(i)*(x(i)-au1(i-1)*x(i-1))
enddo
end subroutine trisolve2
Comment 10 Uroš Bizjak 2008-04-25 11:07:59 UTC
(In reply to comment #9)
> Not hoisting the load from x(i) is a missed PRE opportunity.  Complete testcase
> for the second loop:

This is actually the first loop.

Just for reference: "-O2 -funroll-loops" flags are needed.
Comment 11 Richard Biener 2009-01-24 10:20:00 UTC
GCC 4.3.3 is being released, adjusting target milestone.
Comment 12 Rob 2009-01-28 03:54:07 UTC
On the Trunk using "-O2" or "-O3" can produce slower code.

I built gcc version 4.4.0 20090126 [trunk revision 143680] for i386-redhat-linux
and was dismayed to find that libmudflaps had a few FAILs:

Results for 4.4.0 20090126 (experimental) [trunk revision 143680] (GCC) testsuite on i386-redhat-linux-gnu
http://gcc.gnu.org/ml/gcc-testresults/2009-01/msg02853.html


The file "libmudflap.cth/pass40-frag.c" fails with NO optimization due to:
6100 6200 WARNING: program timed out.
FAIL: libmudflap.cth/pass40-frag.c output pattern test
Since it is only completing 62% of it's task the timeout needs an appropriate increase.


The file "libmudflap.cth/pass40-frag.c" fails with "-O2" due to:
4100 4200 4300 4400 4500 4600 4700 4800 4900 WARNING: program timed out.
FAIL: libmudflap.cth/pass40-frag.c (-O2) output pattern test
With "-O2" it only completes 49% (slower than default).


The file "libmudflap.cth/pass40-frag.c" fails with "-O3" due to:
5100 5200 5300 5400 5500 5600 5700 5800 WARNING: program timed out.
FAIL: libmudflap.cth/pass40-frag.c (-O3) output pattern test
With "-O3" it only completes 58% (slower than default, but faster than "-O2").

Rob
Comment 13 Paolo Bonzini 2009-02-16 10:23:10 UTC
Predictive commoning does exactly what you want.
Comment 14 Uroš Bizjak 2009-06-25 08:25:14 UTC
(In reply to comment #13)
> Predictive commoning does exactly what you want.

It is not effective for the testcase in Comment #9. The dumps for innermost loop are the same for -O2 -funroll-loops [-fpredictive-commoning]:

.L6:
	movss	(%rsi), %xmm9
	addl	$4, %r8d
	mulss	(%rcx), %xmm9
	movss	(%rdx), %xmm8
	movss	4(%rdx), %xmm6
	movss	8(%rdx), %xmm4
	movss	12(%rdx), %xmm2
	subss	%xmm9, %xmm8
	mulss	0(%rbp), %xmm8
	movss	%xmm8, (%rdx)
	movss	4(%rsi), %xmm7
	mulss	4(%rcx), %xmm7
	subss	%xmm7, %xmm6
	mulss	4(%rbp), %xmm6
	movss	%xmm6, 4(%rdx)
	movss	8(%rsi), %xmm5
	mulss	8(%rcx), %xmm5
	subss	%xmm5, %xmm4
	mulss	8(%rbp), %xmm4
	movss	%xmm4, 8(%rdx)
	movss	12(%rsi), %xmm3
	addq	$16, %rsi
	mulss	12(%rcx), %xmm3
	addq	$16, %rcx
	subss	%xmm3, %xmm2
	mulss	12(%rbp), %xmm2
	addq	$16, %rbp
	movss	%xmm2, 12(%rdx)
	addq	$16, %rdx
	cmpl	%r9d, %r8d
	jne	.L6
Comment 15 Uroš Bizjak 2009-06-25 08:31:02 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > Predictive commoning does exactly what you want.

Predictive commoning failed: no suitable chains
Comment 16 Richard Biener 2009-06-25 09:01:20 UTC
Executing predictive commoning without unrolling.

with -m32.  One of the cases SCEV is confused about pointer-plus offsets
being sizetype:

(Data Ref:
  stmt: (*x_58(D))[D.1627_54] = D.1638_71;
  ref: (*x_58(D))[D.1627_54];
  base_object: (*x_58(D))[0];
  Access function 0: {(integer(kind=8)) i_43 + -1, +, 1}_1
  Access function 1: 0B

vs.

(Data Ref:
  stmt: D.1634_67 = (*x_58(D))[D.1632_62];
  ref: (*x_58(D))[D.1632_62];
  base_object: (*x_58(D))[0];
  Access function 0: {(integer(kind=8)) (i_43 + -1) + -1, +, 1}_1
  Access function 1: 0B
Comment 17 Uroš Bizjak 2009-07-03 08:46:52 UTC
(In reply to comment #16)

> One of the cases SCEV is confused about pointer-plus offsets being sizetype:

Do we have a solution for this problem...?
Comment 18 rguenther@suse.de 2009-07-03 09:08:31 UTC
Subject: Re:  [4.3/4.4/4.5 Regression] 10% performance
 regression since Nov 1 on Polyhedron's "NF" on AMD64

On Fri, 3 Jul 2009, ubizjak at gmail dot com wrote:

> ------- Comment #17 from ubizjak at gmail dot com  2009-07-03 08:46 -------
> (In reply to comment #16)
> 
> > One of the cases SCEV is confused about pointer-plus offsets being sizetype:
> 
> Do we have a solution for this problem...?

My hope is that no-undefined-overflow will somehow magically solve
these problems ... otherwise no, there is unfortunately no way out
here.

Richard.
Comment 19 Richard Biener 2009-07-03 11:05:43 UTC
In fact, in this case we have the C equivalent

  int i;
  long j = (long)(i - 1);

vs.

  long j = (long)i - 1;

which I believe are equivalent if overflow is undefined (or i - 1 does not wrap).

It is just that fold obviously considers (long)i - 1 to be more expensive
than (long)(i - 1) and thus does not transform the latter into the former
(and it can't transform (long)i - 1 to (long)(i - 1) as if (long)i - 1
does not overflow there is no guarantee that i - 1 does not).

We should be able to do the former transformation during SCEV analysis
though.

I have a patch which results in (-O3 -ffast-math -funroll-loops)

.L6:
        mulss   (%rcx), %xmm0
        movss   (%rdx), %xmm5
        movss   4(%rdx), %xmm4
        addl    $4, %ebp
        subss   %xmm0, %xmm5
        movss   8(%rdx), %xmm0
        mulss   (%rsi), %xmm5
        movss   %xmm5, (%rdx)
        mulss   4(%rcx), %xmm5
        subss   %xmm5, %xmm4
        mulss   4(%rsi), %xmm4
        movss   %xmm4, 4(%rdx)
        movss   8(%rcx), %xmm3
        mulss   %xmm4, %xmm3
        subss   %xmm3, %xmm0
        mulss   8(%rsi), %xmm0
        movss   %xmm0, 8(%rdx)
        movss   12(%rcx), %xmm2
        addq    $16, %rcx
        mulss   %xmm0, %xmm2
        movss   12(%rdx), %xmm0
        subss   %xmm2, %xmm0
        mulss   12(%rsi), %xmm0
        addq    $16, %rsi
        movss   %xmm0, 12(%rdx)
        addq    $16, %rdx
        cmpl    %r8d, %ebp
        jne     .L6
Comment 20 Richard Biener 2009-07-03 11:14:35 UTC
Before:

 Time for setup          0.139
 Time per iteration      0.271
 Total Time              6.649
 Time for setup          0.136
 Time per iteration      0.265
 Total Time             10.210
 Time for setup          0.134
 Time per iteration      0.265
 Total Time              7.276
 Time for setup          0.134
 Time per iteration      0.260
 Total Time             11.572

After:

 Time for setup          0.114
 Time per iteration      0.238
 Total Time              5.834
 Time for setup          0.111
 Time per iteration      0.233
 Total Time              8.948
 Time for setup          0.110
 Time per iteration      0.237
 Total Time              6.504
 Time for setup          0.112
 Time per iteration      0.235
 Total Time             10.454

which seems to exactly recover this regression.
Comment 21 Richard Biener 2009-07-03 11:22:32 UTC
Created attachment 18133 [details]
patch
Comment 22 Richard Biener 2009-07-03 14:11:32 UTC
Subject: Bug 34163

Author: rguenth
Date: Fri Jul  3 14:11:14 2009
New Revision: 149207

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=149207
Log:
2009-07-03  Richard Guenther  <rguenther@suse.de>

	PR middle-end/34163
	* tree-chrec.c (chrec_convert_1): Fold (T2)(t +- x) to
	(T2)t +- (T2)x if t +- x is known to not overflow and
	the conversion widens the operation.
	* Makefile.in (tree-chrec.o): Add $(FLAGS_H) dependency.

	* gfortran.dg/pr34163.f90: New testcase.

Added:
    trunk/gcc/testsuite/gfortran.dg/pr34163.f90
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-chrec.c

Comment 23 Richard Biener 2009-07-03 14:11:41 UTC
Fixed on the trunk.
Comment 24 Richard Biener 2009-08-04 12:28:34 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 25 Richard Biener 2010-05-22 18:11:48 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 26 Richard Biener 2011-06-27 12:13:24 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 27 Jakub Jelinek 2012-03-13 13:04:25 UTC
Fixed in 4.5+, 4.4 is no longer supported.