37367 – [4.4/4.5 Regression] gcc-4.4/4.5 speed regression

Bug 37367 - [4.4/4.5 Regression] gcc-4.4/4.5 speed regression

Summary: [4.4/4.5 Regression] gcc-4.4/4.5 speed regression

Status:	RESOLVED WONTFIX

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.4.0

Importance:	P4 normal
Target Milestone:	4.4.4
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2008-09-04 10:01 UTC by tim blechmann
Modified:	2010-03-22 10:49 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:	x86_64--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
gcc-minimal-tune=core2.patch (413 bytes, patch) 2010-03-22 10:49 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description tim blechmann 2008-09-04 10:01:46 UTC

using a small piece of code of a digital filter, i was trying to benchmark several looping constructs. on x86_64 the following code was running 5% faster with g++-4.3 than with g++-4.4:

float __attribute__ ((noinline)) bench_5(float * out_sample, int n)
{
    float b1 = std::cos(0.01);
    float y1 = 0;
    float y2 = 1;

    do
    {
        float y0 = b1 * y1 - y2;
        *out_sample++ = y0;
        --n;
    }
    while (__builtin_expect(n!=0, 1));
}

tim@thinkpad:~$ g++-4.3 -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.3.2-0ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.3.2 (Ubuntu 4.3.2-0ubuntu3) 

tim@thinkpad:~$ g++-4.4 -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../gcc-4.4-20080815/configure --enable-languages=c,c++,fortran,objc,obj-c++ --enable-shared --with-system-zlib --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/local/include/c++/4.4 --program-suffix=-4.4 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
Thread model: posix
gcc version 4.4.0 20080815 (experimental) (GCC) 

the command line to compile the code was:
g++ benchmarks/loop_benchmark.cpp -O3 -lrt -march=core2

the difference in the machine code is the order of two subl and leal instructions:
*** 340,347 ****
  	addq	$16, %rax
  	cmpl	%r8d, %edx
  	jb	.L61
- 	subl	%r9d, %esi
  	leal	0(,%r9,4), %eax
  	mov	%eax, %eax
  	addq	%rax, %rcx
  	cmpl	%r9d, %r10d
--- 340,347 ----
  	addq	$16, %rax
  	cmpl	%r8d, %edx
  	jb	.L61
  	leal	0(,%r9,4), %eax
+ 	subl	%r9d, %esi
  	mov	%eax, %eax
  	addq	%rax, %rcx
  	cmpl	%r9d, %r10d
***************

since i read that gcc-4.4 is supposed to be aimed at code optimization, i thought it may be interesting to report it ...
the complete code can be found at http://tinyurl.com/5socts

Comment 1 Andrew Pinski 2008-09-06 21:43:52 UTC

Sounds like a scheduler issue.

Comment 2 Paolo Bonzini 2009-01-31 14:19:25 UTC

Unfortunately, I do not see any reason why the two should have different speed (which means there's no way to teach GCC the former is better).

I think a WONTFIX is the only possibility.  CCing a release manager.

Comment 3 Richard Biener 2009-01-31 14:33:08 UTC

I would say this needs a much more detailed pipeline description.  (btw, what
is diffed against what?  i.e. which variant is faster? ;))

As we have %r9 and %r9d access I would say this may be some artifacts in
HW register renaming.  What exact CPU are you using btw?  Is this maybe
fixed if you specify that exact CPU with -mtune= ?

Comment 4 Richard Biener 2009-02-05 22:03:05 UTC

5% is way below our release criteria threshold.  P4.

Comment 5 Steven Bosscher 2010-03-21 12:20:12 UTC

Bug in WAITING for a long time, no feedback. Very small, hard-to-catch code difference. It's been noted before that the core2 scheduler description (contributed by Intel itself!) often results in worse code than the generic scheduler description. All in all, no reason to track this anymore.

Comment 6 Jack Howarth 2010-03-21 14:44:00 UTC

Shouldn't there be a PR about the suboptimal performance from the core2 tuning (in hopes that original contributors from Intel will revisit these issues)?

Comment 7 H.J. Lu 2010-03-21 16:20:14 UTC

(In reply to comment #6)
> Shouldn't there be a PR about the suboptimal performance from the core2 tuning
> (in hopes that original contributors from Intel will revisit these issues)?
> 

Intel didn't contribute -march=core2. I have been telling
people to use -mtune=generic.

Comment 8 rguenther@suse.de 2010-03-22 10:01:32 UTC

Subject: Re:  [4.4/4.5 Regression] gcc-4.4/4.5 speed
 regression

On Sun, 21 Mar 2010, hjl dot tools at gmail dot com wrote:

> ------- Comment #7 from hjl dot tools at gmail dot com  2010-03-21 16:20 -------
> (In reply to comment #6)
> > Shouldn't there be a PR about the suboptimal performance from the core2 tuning
> > (in hopes that original contributors from Intel will revisit these issues)?
> > 
> 
> Intel didn't contribute -march=core2. I have been telling
> people to use -mtune=generic.

So should we make -march=core2 turn on -mtune=generic then?

Comment 9 Jakub Jelinek 2010-03-22 10:47:31 UTC

From the numbers Vladimir posted for SPEC2k, x86_64 -mtune=generic vs. -mtune=core2 has the same rate for SPECint, with core2 slightly smaller code size, for SPECfp -mtune=core2 has 0.4% worse rate due to 10% drop on facerec (otherwise it would be 0.4% win) with slightly smaller code for -mtune=core2.
On x86 SPECint is slightly worse with -mtune=core2 and even code size is
slightly larger, SPECfp on the other side has both slightly better rate and code size with -mtune=core2.  So using generic tuning for core2 is possible.

Comment 10 Jakub Jelinek 2010-03-22 10:49:16 UTC

Created attachment 20157 [details]
gcc-minimal-tune=core2.patch

Here is a minimal (and untested too) patch for that.
A bigger patch would drop all PROCESSOR_CORE2, core2_cost, CPU_CORE2 etc.
references (AFAIK there aren't that many in the sources).