using a small piece of code of a digital filter, i was trying to benchmark several looping constructs. on x86_64 the following code was running 5% faster with g++-4.3 than with g++-4.4: float __attribute__ ((noinline)) bench_5(float * out_sample, int n) { float b1 = std::cos(0.01); float y1 = 0; float y2 = 1; do { float y0 = b1 * y1 - y2; *out_sample++ = y0; --n; } while (__builtin_expect(n!=0, 1)); } tim@thinkpad:~$ g++-4.3 -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.3.2-0ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.3.2 (Ubuntu 4.3.2-0ubuntu3) tim@thinkpad:~$ g++-4.4 -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../gcc-4.4-20080815/configure --enable-languages=c,c++,fortran,objc,obj-c++ --enable-shared --with-system-zlib --enable-mpfr --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --without-included-gettext --enable-threads=posix --enable-nls --with-gxx-include-dir=/usr/local/include/c++/4.4 --program-suffix=-4.4 --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc Thread model: posix gcc version 4.4.0 20080815 (experimental) (GCC) the command line to compile the code was: g++ benchmarks/loop_benchmark.cpp -O3 -lrt -march=core2 the difference in the machine code is the order of two subl and leal instructions: *** 340,347 **** addq $16, %rax cmpl %r8d, %edx jb .L61 - subl %r9d, %esi leal 0(,%r9,4), %eax mov %eax, %eax addq %rax, %rcx cmpl %r9d, %r10d --- 340,347 ---- addq $16, %rax cmpl %r8d, %edx jb .L61 leal 0(,%r9,4), %eax + subl %r9d, %esi mov %eax, %eax addq %rax, %rcx cmpl %r9d, %r10d *************** since i read that gcc-4.4 is supposed to be aimed at code optimization, i thought it may be interesting to report it ... the complete code can be found at http://tinyurl.com/5socts
Sounds like a scheduler issue.
Unfortunately, I do not see any reason why the two should have different speed (which means there's no way to teach GCC the former is better). I think a WONTFIX is the only possibility. CCing a release manager.
I would say this needs a much more detailed pipeline description. (btw, what is diffed against what? i.e. which variant is faster? ;)) As we have %r9 and %r9d access I would say this may be some artifacts in HW register renaming. What exact CPU are you using btw? Is this maybe fixed if you specify that exact CPU with -mtune= ?
5% is way below our release criteria threshold. P4.
Bug in WAITING for a long time, no feedback. Very small, hard-to-catch code difference. It's been noted before that the core2 scheduler description (contributed by Intel itself!) often results in worse code than the generic scheduler description. All in all, no reason to track this anymore.
Shouldn't there be a PR about the suboptimal performance from the core2 tuning (in hopes that original contributors from Intel will revisit these issues)?
(In reply to comment #6) > Shouldn't there be a PR about the suboptimal performance from the core2 tuning > (in hopes that original contributors from Intel will revisit these issues)? > Intel didn't contribute -march=core2. I have been telling people to use -mtune=generic.
Subject: Re: [4.4/4.5 Regression] gcc-4.4/4.5 speed regression On Sun, 21 Mar 2010, hjl dot tools at gmail dot com wrote: > ------- Comment #7 from hjl dot tools at gmail dot com 2010-03-21 16:20 ------- > (In reply to comment #6) > > Shouldn't there be a PR about the suboptimal performance from the core2 tuning > > (in hopes that original contributors from Intel will revisit these issues)? > > > > Intel didn't contribute -march=core2. I have been telling > people to use -mtune=generic. So should we make -march=core2 turn on -mtune=generic then?
From the numbers Vladimir posted for SPEC2k, x86_64 -mtune=generic vs. -mtune=core2 has the same rate for SPECint, with core2 slightly smaller code size, for SPECfp -mtune=core2 has 0.4% worse rate due to 10% drop on facerec (otherwise it would be 0.4% win) with slightly smaller code for -mtune=core2. On x86 SPECint is slightly worse with -mtune=core2 and even code size is slightly larger, SPECfp on the other side has both slightly better rate and code size with -mtune=core2. So using generic tuning for core2 is possible.
Created attachment 20157 [details] gcc-minimal-tune=core2.patch Here is a minimal (and untested too) patch for that. A bigger patch would drop all PROCESSOR_CORE2, core2_cost, CPU_CORE2 etc. references (AFAIK there aren't that many in the sources).