42621 – [4.4 Regression] Computed gotos on AMD 800% slower

Bug 42621 - [4.4 Regression] Computed gotos on AMD 800% slower

Summary: [4.4 Regression] Computed gotos on AMD 800% slower

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.4.2

Importance:	P2 normal
Target Milestone:	4.5.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2010-01-05 11:44 UTC by Fredrik Svahn
Modified:	2012-03-13 13:04 UTC (History)
CC List:	5 users (show)

See Also:
Host:	x86_64-unknown-linux-gnu
Target:	x86_64-unknown-linux-gnu
Build:	x86_64-unknown-linux-gnu
Known to work:	4.5.0
Known to fail:
Last reconfirmed:	2010-01-05 12:50:01

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Fredrik Svahn 2010-01-05 11:44:19 UTC

When compiling a program with computed goto:s with gcc 4.4.2 it runs significantly slower (up to a factor 10) than when it is compiled with e.g. gcc 4.1/4.3 with the same optimization flags (-O2 or -O3). A small dummy test program without header file dependencies is attached.

I am compiling with a commandline like "gcc -O3 test.c -o testp.4.4.2", and run the generated executable without arguments, like "./testp.4.4.2". Generating cpu specific instructions, e.g. "-march=athlon64" seems to make no difference. I have also tried with "-fno-gcse" (as recommended in the docs) to no avail. Same results with targets x86_64 and i686 on Novell SLES 10 and Arch Linux.

Interestingly enough I do not see this problem on any Intel processor I have tried, but I have seen the slowdown on all AMD processors I have tried (e.g. Dual-Core AMD Opteron Processor 2216 and AMD Turion 64 X2 Mobile Technology TL-60). In fact, the exact same two binaries resulting from compilation with gcc 4.4.2 and gcc 4.3 for i686 which show a significant performance difference on an AMD will not show any significant difference on an Intel Core 2 Duo T7500.

Some observations:

1. On AMD there is a huge difference in the number of mispredicted branches between the program compiled with gcc-4.4.2 and the program compiled with earlier compilers. See for instance the following output from oprofile:

---
Counted RETIRED_INDIRECT_BRANCHES_MISPREDICTED events (Retired Indirect Branches Mispredicted) with a unit mask of 0x00 (No unit mask) count 500
Counted RETIRED_MISPREDICTED_BRANCH_INSTRUCTIONS events (Retired Mispredicted Branch Instructions) with a unit mask of 0x00 (No unit mask) count 500
Counted RETIRED_TAKEN_BRANCH_INSTRUCTIONS events (Retired taken branch instructions) with a unit mask of 0x00 (No unit mask) count 500
RETIRED_INDIRE...|RETIRED_MISPRE...|RETIRED_TAKEN_...|
  samples|      %|  samples|      %|  samples|      %|
------------------------------------------------------
   185416 88.7799    186587 82.8723    381826 48.1913 testp.4.4.2
     5605  2.6838      6275  2.7870    157401 19.8660 testp.4.3


2. Gcc 4.3 generates the following assembler around the "eq:" label in
the attached program:

  4004c0:       48 81 fb 00 e1 f5 05    cmp    $0x5f5e100,%rbx
  4004c7:       74 21                   je     4004ea <main+0x6a>
  4004c9:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  4004d0:       48 63 c5                movslq %ebp,%rax
  4004d3:       48 8b 44 c4 b0          mov    -0x50(%rsp,%rax,8),%rax
  4004d8:       ff e0                   jmpq   *%rax

While gcc 4.4.2 will generate an additional jump instruction:

  4004c0:       ff e0                   jmpq   *%rax
    ...
  4004d8:       48 81 fb 00 e1 f5 05    cmp    $0x5f5e100,%rbx
  4004df:       74 21                   je     400502 <main+0x82>
  4004e1:       0f 1f 80 00 00 00 00    nopl   0x0(%rax)
  4004e8:       48 63 c5                movslq %ebp,%rax
  4004eb:       48 8b 44 c4 88          mov    -0x78(%rsp,%rax,8),%rax
  4004f0:       eb ce                   jmp    4004c0 <main+0x40>

3. I see the same behaviour with a month-old snapshot of gcc 4.5.

Examples of compilers used (have tried with a number of differrent builds on different targets):

Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr --enable-shared
--enable-languages=c,c++,fortran,objc,obj-c++,ada
--enable-threads=posix --mandir=/usr/share/man
--infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib
--libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu
--disable-libstdcxx-pch --with-tune=generic
Thread model: posix
gcc version 4.4.2 (GCC)

Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr --enable-shared
--enable-languages=c,c++ --enable-threads=posix
--mandir=/usr/share/man --infodir=/usr/share/info
--enable-__cxa_atexit --disable-multilib --libdir=/usr/lib
--libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch
--with-tune=generic --disable-werror --enable-checking=release
--program-suffix=-4.3 --enable-version-specific-runtime-libs
Thread model: posix
gcc version 4.3.3 (GCC)

Test program:
=============

#define VALUE 100000000

int main(int argc, char *argv[]) {
  void *ops[] = { &&inc, &&eq, &&gt, &&lt, &&gte, &&lte, &&zero, &&not_implemented, &&exit };

  long i = 0;
  int next_op = argc; //unknown at compile time...
  int fail_op = 0; //inc
  goto *ops[0];   

  inc: 
    i++;
    goto *ops[next_op]; 

  eq: 
    if (!(i == VALUE)) goto handle_fail;
    return 0;     

  gt: 
    if (!(i > VALUE)) goto handle_fail;
    return 0;     

  lt: 
    if (!(i < VALUE)) goto handle_fail;
    return 0;     

  gte: 
    if (!(i >= VALUE)) goto handle_fail;
    return 0;     

  lte: 
    if (!(i <= VALUE)) goto handle_fail;
    return 0;     

  zero:
    if (!(i == 0)) goto handle_fail;
    return 0;     

  not_implemented: 
    fail_op = 8; //exit
    goto handle_fail;
  
  exit:
    return -1;


  handle_fail: 
    goto *ops[fail_op];
}

Comment 1 Steven Bosscher 2010-01-05 12:50:01 UTC

There is a pass "duplicate_computed_gotos" that should take care of this. Why does it not work in this case?

Comment 2 Steven Bosscher 2010-01-05 21:51:37 UTC

Caused by revision 139760.
http://gcc.gnu.org/viewcvs?view=revision&revision=139760

Comment 3 Andrew Pinski 2010-01-05 21:56:16 UTC

So the profiling information (which is not always accurate without real profiling) says the code is not executed that often.  I guess someone needs to tune them better for computed gotos unless people really want to do profiling runs first to get better performance ...

Comment 4 Steven Bosscher 2010-01-05 22:11:01 UTC

I would just go back to the old status (of GCC 4.3 and earlier) than deciding for each basic block individually whether to unfactor or not.

Could you please see if the attached problem makes the slow-down disappear?

Index: bb-reorder.c
===================================================================
--- bb-reorder.c	(revision 155661)
+++ bb-reorder.c	(working copy)
@@ -1981,7 +1981,9 @@ gate_duplicate_computed_gotos (void)
 {
   if (targetm.cannot_modify_jumps_p ())
     return false;
-  return (optimize > 0 && flag_expensive_optimizations);
+  return (optimize > 0
+	  && flag_expensive_optimizations
+	  && ! optimize_function_for_size_p (cfun));
 }
 
 
@@ -2072,9 +2074,6 @@ duplicate_computed_gotos (void)
 	  || single_pred_p (single_succ (bb)))
 	continue;
 
-      if (!optimize_bb_for_size_p (bb))
-	continue;
-
       /* The successor block has to be a duplication candidate.  */
       if (!bitmap_bit_p (candidates, single_succ (bb)->index))
 	continue;

Comment 5 Fredrik Svahn 2010-01-06 11:36:55 UTC

Thanks for the quick patch! 

Unfortunately it only works for me with option "-march=athlon64"? Is this intentional ("-march" is not needed for gcc-4.3)? 

Am I doing something wrong?

$ gcc-4.3 -v && /opt/gcc/bin/gcc-4.4.2-new -v
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../configure --prefix=/usr --enable-shared --enable-languages=c,c++ --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic --disable-werror --enable-checking=release --program-suffix=-4.3 --enable-version-specific-runtime-libs
Thread model: posix
gcc version 4.3.3 (GCC) 
Using built-in specs.
Target: x86_64-unknown-linux-gnu
Configured with: ../gcc-4.4.2/configure --prefix=/opt/gcc --enable-shared --enable-languages=c,c++ --enable-threads=posix --mandir=/usr/share/man --infodir=/usr/share/info --enable-__cxa_atexit --disable-multilib --libdir=/usr/lib --libexecdir=/usr/lib --enable-clocale=gnu --disable-libstdcxx-pch --with-tune=generic --disable-werror --enable-checking=release --program-suffix=-4.4.2-new --enable-version-specific-runtime-libs
Thread model: posix
gcc version 4.4.2 (GCC)

$ gcc-4.3 -g -O3 test.c -o testp.4.3 &&  /opt/gcc/bin/gcc-4.4.2-new -g -Wall -O3  test.c -o testp.4.4.2 
$ time ./testp.4.3 && time ./testp.4.4.2 

real    0m0.889s
user    0m0.880s
sys     0m0.000s

real    0m4.043s
user    0m4.036s
sys     0m0.003s
$ gcc-4.3 -g -O3 test.c -o testp.4.3 &&  /opt/gcc/bin/gcc-4.4.2-new -g -Wall -march=athlon64 -O3  test.c -o testp.4.4.2 
$ time ./testp.4.3 && time ./testp.4.4.2 

real    0m0.888s
user    0m0.880s
sys     0m0.000s

real    0m0.638s
user    0m0.627s
sys     0m0.003s

Comment 6 Fredrik Svahn 2010-01-06 11:44:00 UTC

I will try to distclean and rebuild from scratch to confirm my statement above.

Comment 7 Fredrik Svahn 2010-01-06 23:00:17 UTC

Summary:
The patch works great when building gcc from trunk (revision 155680). Both supplied test program and real application are optimized.

With gcc-4.4.2 I get the optimization for the test program only with e.g. -march=athlon64 or -mtune=native (which is an improvement, previously I could not get it to work even with these options). Without -mtune/-march optimization seems to bail out on the following check in bb-reorder.c@@duplicate_computed_gotos(void):

      /* Obviously the block has to end in a computed jump.  */
      if (!computed_jump_p (BB_END (bb)))
	continue;

I assume the patch was written for 4.5 so maybe testing it on 4.4.2 is a bit premature. Hope it helps anyway.

Comment 8 Steven Bosscher 2010-01-10 23:31:43 UTC

Subject: Bug 42621

Author: steven
Date: Sun Jan 10 23:31:30 2010
New Revision: 155796

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=155796
Log:
	PR rtl-optimization/42621
	* bb-reorder.c (gate_duplicated_computed_gotos): Only run if not
	optimizing for size.
	(duplicate_computed_gotos): Remove now-redundant check.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/bb-reorder.c

Comment 9 Richard Biener 2010-01-13 22:26:27 UTC

Fixed for 4.5 sofar.

Comment 10 Carl 2010-01-18 13:14:34 UTC

Please note that computed gotos are factored out because "they are a hell to deal with" in tree-cfg.c:build_gimple_cfg(). This means that they MUST be unfactored out as promised in the comment without leaving this to another optimization step that may or may not be enabled.

Also, for our product there are 97 "extra jumps" and 95 of them are long jumps, i.e:

 12be0:  ff e1           jmp *%ecx
 ...
 12dda:  e9 01 fe ff ff  jmp 12be0 <main_loop+0x220>
 ...

so this is a serious both speed and size pessimisation :(

Comment 11 Jeffrey Yasskin 2010-07-14 20:49:34 UTC

Is this the same bug as PR 39284?

Comment 12 Jaak Ristioja 2011-06-10 08:50:42 UTC

(In reply to comment #9)
> Fixed for 4.5 sofar.

Doesn't appear to be fixed in GCC 4.5.2 (under Gentoo Linux).

PS: The additional "jmp" instruction (as in the bug description) even appears to be generated in case of -O0.

PPS: As noted by other, this bug is likely a duplicate to bug 39284 and bug 43868.

Comment 13 Jaak Ristioja 2011-06-10 08:52:47 UTC

(In reply to comment #12)
> PPS: As noted by other, this bug is likely a duplicate to bug 39284 and bug
> 43868.

As noted by others, this bug is likely a duplicate to bug 39284 and bug 43686. Sorry about the typos.

Comment 14 Jakub Jelinek 2012-03-13 13:04:41 UTC

Fixed in 4.5+, 4.4 is no longer supported.