This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug c/42621] New: 4.4/4.5 Regression, Computed gotos on AMD 800% slower
- From: "fredrik dot svahn at gmail dot com" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: 5 Jan 2010 11:44:20 -0000
- Subject: [Bug c/42621] New: 4.4/4.5 Regression, Computed gotos on AMD 800% slower
- Reply-to: gcc-bugzilla at gcc dot gnu dot org
When compiling a program with computed goto:s with gcc 4.4.2 it runs
significantly slower (up to a factor 10) than when it is compiled with e.g. gcc
4.1/4.3 with the same optimization flags (-O2 or -O3). A small dummy test
program without header file dependencies is attached.
I am compiling with a commandline like "gcc -O3 test.c -o testp.4.4.2", and run
the generated executable without arguments, like "./testp.4.4.2". Generating
cpu specific instructions, e.g. "-march=athlon64" seems to make no difference.
I have also tried with "-fno-gcse" (as recommended in the docs) to no avail.
Same results with targets x86_64 and i686 on Novell SLES 10 and Arch Linux.
Interestingly enough I do not see this problem on any Intel processor I have
tried, but I have seen the slowdown on all AMD processors I have tried (e.g.
Dual-Core AMD Opteron Processor 2216 and AMD Turion 64 X2 Mobile Technology
TL-60). In fact, the exact same two binaries resulting from compilation with
gcc 4.4.2 and gcc 4.3 for i686 which show a significant performance difference
on an AMD will not show any significant difference on an Intel Core 2 Duo
T7500.
Some observations:
1. On AMD there is a huge difference in the number of mispredicted branches
between the program compiled with gcc-4.4.2 and the program compiled with
earlier compilers. See for instance the following output from oprofile:
---
Counted RETIRED_INDIRECT_BRANCHES_MISPREDICTED events (Retired Indirect
Branches Mispredicted) with a unit mask of 0x00 (No unit mask) count 500
Counted RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS events (Retired Mispredicted
Branch Instructions) with a unit mask of 0x00 (No unit mask) count 500
Counted RETIRED_TAKEN_BRANCH_INSTRUCTIONS events (Retired taken branch
instructions) with a unit mask of 0x00 (No unit mask) count 500
RETIRED_INDIRE...|RETIRED_MISPRE...|RETIRED_TAKEN_...|
samples| %| samples| %| samples| %|
------------------------------------------------------
185416 88.7799 186587 82.8723 381826 48.1913 testp.4.4.2
5605 2.6838 6275 2.7870 157401 19.8660 testp.4.3
2. Gcc 4.3 generates the following assembler around the "eq:" label in
the attached program:
4004c0: 48 81 fb 00 e1 f5 05 cmp $0x5f5e100,%rbx
4004c7: 74 21 je 4004ea <main+0x6a>
4004c9: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
4004d0: 48 63 c5 movslq %ebp,%rax
4004d3: 48 8b 44 c4 b0 mov -0x50(%rsp,%rax,8),%rax
4004d8: ff e0 jmpq *%rax
While gcc 4.4.2 will generate an additional jump instruction:
4004c0: ff e0 jmpq *%rax
...
4004d8: 48 81 fb 00 e1 f5 05 cmp $0x5f5e100,%rbx
4004df: 74 21 je 400502 <main+0x82>
4004e1: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
4004e8: 48 63 c5 movslq %ebp,%rax
4004eb: 48 8b 44 c4 88 mov -0x78(%rsp,%rax,8),%rax
4004f0: eb ce jmp 4004c0 <main+0x40>
3. I see the same behaviour with a month-old snapshot of gcc 4.5.
Examples of compilers used (have tried with a number of differrent builds on
different targets):
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr --enable-shared
--enable-languages=c,c++,fortran,objc,obj-c++,ada
--enable-threads=posix --mandir=/usr/share/man
--infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib
--libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu
--disable-libstdcxx-pch --with-tune=generic
Thread model: posix
gcc version 4.4.2 (GCC)
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr --enable-shared
--enable-languages=c,c++ --enable-threads=posix
--mandir=/usr/share/man --infodir=/usr/share/info
--enable-__cxa_atexit --disable-multilib --libdir=/usr/lib
--libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch
--with-tune=generic --disable-werror --enable-checking=release
--program-suffix=-4.3 --enable-version-specific-runtime-libs
Thread model: posix
gcc version 4.3.3 (GCC)
Test program:
=============
#define VALUE 100000000
int main(int argc, char *argv[]) {
void *ops[] = { &&inc, &&eq, &>, &<, &>e, &<e, &&zero,
&¬_implemented, &&exit };
long i = 0;
int next_op = argc; //unknown at compile time...
int fail_op = 0; //inc
goto *ops[0];
inc:
i++;
goto *ops[next_op];
eq:
if (!(i == VALUE)) goto handle_fail;
return 0;
gt:
if (!(i > VALUE)) goto handle_fail;
return 0;
lt:
if (!(i < VALUE)) goto handle_fail;
return 0;
gte:
if (!(i >= VALUE)) goto handle_fail;
return 0;
lte:
if (!(i <= VALUE)) goto handle_fail;
return 0;
zero:
if (!(i == 0)) goto handle_fail;
return 0;
not_implemented:
fail_op = 8; //exit
goto handle_fail;
exit:
return -1;
handle_fail:
goto *ops[fail_op];
}
--
Summary: 4.4/4.5 Regression, Computed gotos on AMD 800% slower
Product: gcc
Version: 4.4.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: fredrik dot svahn at gmail dot com
GCC build triplet: x86_64-unknown-linux-gnu
GCC host triplet: x86_64-unknown-linux-gnu
GCC target triplet: x86_64-unknown-linux-gnu
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42621