19672 – [3.4 Regression] Performance regression in simple loop code

Bug 19672 - [3.4 Regression] Performance regression in simple loop code

Summary: [3.4 Regression] Performance regression in simple loop code

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	3.4.3

Importance:	P2 normal
Target Milestone:	4.0.3
Assignee:	Paolo Bonzini

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2005-01-28 16:20 UTC by Stephan Bergmann
Modified:	2005-11-21 02:26 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	i686--
Build:
Known to work:	2.95.3 3.2.3 4.0.3 4.1.0
Known to fail:	3.3.3 3.4.5 4.0.2
Last reconfirmed:	2005-06-27 07:39:38

Attachments
patch to fix the bug (525 bytes, patch) 2005-10-18 11:41 UTC, Paolo Bonzini	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Stephan Bergmann 2005-01-28 16:20:36 UTC

In the following C source file test.c,

  int compare(char const * p1, int n1, char const * p2, int n2) {
    char const * q1 = p1 + n1;
    char const * q2 = p2 + n2;
    while (p1 < q1 && p2 < q2) {
      int n = *--q1 - *--q2;
      if (n) {
        return n;
      }
    }
    return n1 - n2;
  }
  int main(void) {
    char str[1000];
    int i;
    for (i = 0; i < 1000000; ++i) {
      compare(str, 1000, str, 1000);
    }
  }

compiled with

  gcc -O2 test.c

the loop body within compare takes 14 instructions on GCC 3.4.3 Linux x86,
compared to only 11 instructions on GCC 3.2.2 Linux x86 (see disassemblies
below), and the GCC 3.4.3 output takes substantially longer to run than the GCC
3.2.2 output:

  3.4.3> time ./a.out
  4.690u 0.001s 0:04.71 99.5%

  3.2.2> time ./a.out
  3.533u 0.002s 0:03.55 99.4%

This seems to bite us on OpenOffice.org, which contains an oft-called function
similar to compare above, and new versions of OOo (built with GCC 3.4) run
slower than old versions (built with GCC 3.2).  The question that remains for us
is whether this performance loss is specific to the given function, or could be
a general problem.

The dissasemblies:
08048360 <compare>: ! gcc 3.4.3
8048360: 55                      push   %ebp
8048361: 89 e5                   mov    %esp,%ebp
8048363: 57                      push   %edi
8048364: 56                      push   %esi
8048365: 53                      push   %ebx
8048366: 8b 45 0c                mov    0xc(%ebp),%eax
8048369: 8b 7d 08                mov    0x8(%ebp),%edi
804836c: 8b 75 10                mov    0x10(%ebp),%esi
804836f: 8d 1c 07                lea    (%edi,%eax,1),%ebx
8048372: 8b 45 14                mov    0x14(%ebp),%eax
8048375: 8d 0c 06                lea    (%esi,%eax,1),%ecx
8048378: 90                      nop
8048379: 8d b4 26 00 00 00 00    lea    0x0(%esi),%esi
8048380: 39 df                   cmp    %ebx,%edi      ! <-+
8048382: 0f 92 c0                setb   %al            !   |
8048385: 31 d2                   xor    %edx,%edx      !   |
8048387: 39 ce                   cmp    %ecx,%esi      !   |
8048389: 0f 92 c2                setb   %dl            !   |
804838c: 85 d0                   test   %edx,%eax      !   |
804838e: 74 13                   je     80483a3        !   |
8048390: 4b                      dec    %ebx           !   |
8048391: 49                      dec    %ecx           !   |
8048392: 0f be 01                movsbl (%ecx),%eax    !   |
8048395: 0f be 13                movsbl (%ebx),%edx    !   |
8048398: 29 c2                   sub    %eax,%edx      !   |
804839a: 89 d0                   mov    %edx,%eax      !   |
804839c: 74 e2                   je     8048380        ! --+
804839e: 5b                      pop    %ebx
804839f: 5e                      pop    %esi
80483a0: 5f                      pop    %edi
80483a1: 5d                      pop    %ebp
80483a2: c3                      ret
80483a3: 5b                      pop    %ebx
80483a4: 8b 45 0c                mov    0xc(%ebp),%eax
80483a7: 8b 55 14                mov    0x14(%ebp),%edx
80483aa: 5e                      pop    %esi
80483ab: 29 d0                   sub    %edx,%eax
80483ad: 5f                      pop    %edi
80483ae: 5d                      pop    %ebp
80483af: c3                      ret

08048370 <compare>: ! gcc 3.2.2
8048370: 55                      push   %ebp
8048371: 89 e5                   mov    %esp,%ebp
8048373: 57                      push   %edi
8048374: 8b 45 0c                mov    0xc(%ebp),%eax
8048377: 56                      push   %esi
8048378: 8b 7d 08                mov    0x8(%ebp),%edi
804837b: 53                      push   %ebx
804837c: 8b 75 10                mov    0x10(%ebp),%esi
804837f: 8d 1c 38                lea    (%eax,%edi,1),%ebx
8048382: 8b 45 14                mov    0x14(%ebp),%eax
8048385: 39 df                   cmp    %ebx,%edi
8048387: 8d 0c 30                lea    (%eax,%esi,1),%ecx
804838a: 73 1a                   jae    80483a6
804838c: 39 ce                   cmp    %ecx,%esi
804838e: 73 16                   jae    80483a6
8048390: 4b                      dec    %ebx           ! <-+
8048391: 49                      dec    %ecx           !   |
8048392: 0f be 01                movsbl (%ecx),%eax    !   |
8048395: 0f be 13                movsbl (%ebx),%edx    !   |
8048398: 29 c2                   sub    %eax,%edx      !   |
804839a: 89 d0                   mov    %edx,%eax      !   |
804839c: 75 10                   jne    80483ae        !   |
804839e: 39 df                   cmp    %ebx,%edi      !   |
80483a0: 73 04                   jae    80483a6        !   |
80483a2: 39 ce                   cmp    %ecx,%esi      !   |
80483a4: 72 ea                   jb     8048390        ! --+
80483a6: 8b 45 0c                mov    0xc(%ebp),%eax
80483a9: 8b 55 14                mov    0x14(%ebp),%edx
80483ac: 29 d0                   sub    %edx,%eax
80483ae: 5b                      pop    %ebx
80483af: 5e                      pop    %esi
80483b0: 5f                      pop    %edi
80483b1: 5d                      pop    %ebp
80483b2: c3                      ret

Comment 1 Andrew Pinski 2005-01-28 19:03:08 UTC

I cannot test this right now but this might be fixed on the mainline.

Comment 2 Andrew Pinski 2005-01-30 06:39:21 UTC

Hmm, this looks like a branch cost problem or related to that.

I think you can get all the speed back by supplying -mbranch-cost=1 but I could be wrong.

Hmm, I wonder if 3.2.x was compiled for i486 (or i386) instead of i686 which means this might not be 
a regression.
Can you give the output of "gcc -v" for both GCC's?

Comment 3 Stephan Bergmann 2005-01-31 09:09:15 UTC

"I think you can get all the speed back by supplying -mbranch-cost=1 but I could
be wrong."

No, adding -mbranch-cost=1 leads to only a very minor performance improvement.

"Can you give the output of "gcc -v" for both GCC's?"

3.4.3> gcc -v
Reading specs from
/export/home/sb93797/gcc-3.4.3-inst/lib/gcc/i686-pc-linux-gnu/3.4.3/specs
Configured with: ../gcc-3.4.3/configure --prefix=/export/home/sb93797/gcc-3.4.3-inst
Thread model: posix
gcc version 3.4.3

3.2.2> gcc -v
Reading specs from
/so/env/gcc_3.2.2_linux_libc2.20/bin/../lib/gcc-lib/i686-pc-linux-gnu/3.2.2/specs
Configured with: /export/home/obo/gcc-3.2.2/configure
--prefix=/net/grande.germany/develop6/update/dev/gcc_3.2.2_linux_libc2.20
--with-as=/net/grande.germany/develop6/update/dev/gcc_3.2.2_linux_libc2.20/bin/as
--with-ld=/net/grande.germany/develop6/update/dev/gcc_3.2.2_linux_libc2.20/bin/ld
--enable-languages=c,c++
Thread model: posix
gcc version 3.2.2

Comment 4 Andrew Pinski 2005-06-13 04:06:52 UTC

Confirmed.

Comment 5 dank 2005-06-27 06:59:19 UTC

I just reproduced this on two flavors of pentium 4 using -O3 -mtune=pentium.
The regression is worse, sometimes much worse, with -fPIC.

Times for gcc-2.95.3, gcc-3.4.3, gcc-4.0.0, gcc-4.1-20050603:

test run 1:
/proc/cpuinfo: cpu family 15, model 2, Intel Pentium 4 CPU 2.80GHz.
no -fPIC:   2.2, 2.7. 0.6. 0.6 seconds
with fPIC:  2.2, 2.9, 2.9, 2.9 seconds

test run 2:
/proc/cpuinfo: cpu family 15, model 4, Intel Pentium 4 CPU 3.20GHz
no -fPIC:   1.9, 3.2, 0.5, n/a seconds
with fPIC:  1.9, 3.6, 3.6, n/a seconds

Comment 6 Steven Bosscher 2005-06-27 07:41:58 UTC

For gcc4, the no-PIC case looks pretty good to me ;-)

Comment 7 Steven Bosscher 2005-06-27 07:59:08 UTC

Dan, can you show the assembler output for 2.95.3 and 4.0 (why is 4.1 "n/a"??) 
for the -fPIC case?

Comment 8 Andrew Pinski 2005-07-22 21:13:03 UTC

Moving to 4.0.2 pre Mark.

Comment 9 Paolo Bonzini 2005-10-18 09:48:11 UTC

Can anybody try killloop-branch?  Just a wild guess.

Comment 10 Paolo Bonzini 2005-10-18 09:55:09 UTC

There is also a (minor) register allocation problem in both outputs:

8048392: 0f be 01                movsbl (%ecx),%eax    !   |
8048395: 0f be 13                movsbl (%ebx),%edx    !   |
8048398: 29 c2                   sub    %eax,%edx      !   |
804839a: 89 d0                   mov    %edx,%eax      !   |
804839c: 74 e2                   je     8048380        ! --+

This could obviously become

8048392: 0f be 11                movsbl (%ecx),%edx
8048395: 0f be 03                movsbl (%ebx),%eax
8048398: 2b c2                   sub    %edx,%eax
804839c: 74 e2                   je     8048380

Comment 11 Paolo Bonzini 2005-10-18 10:07:24 UTC

The code I get from -mbranch-cost=1 is extremely similar to what 3.2.2 produced:

.L7:
        cmpl    %ebx, %edi
        jae     .L2
        cmpl    %ecx, %esi
        jae     .L2
        decl    %ebx
        movsbl  -1(%ecx),%eax
        decl    %ecx
        movsbl  (%ebx),%edx
        subl    %eax, %edx
        movl    %edx, %eax
        je      .L7
.L2:

The front-end is correctly producing a TRUTH_AND_EXPR, i.e. the short-circuiting is not required.  RTL expansion will use jae (short-circuited evaluation) or seta, depending on the branch cost.

However, the tuning here is wrong because on a pentium4 I do get a 40% speedup from -mbranch-cost=1.

Comment 12 Paolo Bonzini 2005-10-18 11:41:30 UTC

Created attachment 10016 [details]
patch to fix the bug

Indeed, RTL expansion will use jae (short-circuited evaluation) or seta, depending on the branch cost, but in a very involuted way which screws up the tuning.

This patch makes it more direct.

Comment 13 Paolo Bonzini 2005-10-18 16:04:28 UTC

The patch is not good because I forgot to check fallthrough edges.  The code I added goes above, just before "case CALL_EXPR:".

Restarting a bootstrap with a correct patch.

Comment 14 GCC Commits 2005-10-19 10:37:37 UTC

Subject: Bug 19672

CVSROOT:	/cvs/gcc
Module name:	gcc
Changes by:	bonzini@gcc.gnu.org	2005-10-19 10:37:31

Modified files:
	gcc            : ChangeLog dojump.c 

Log message:
	2005-10-18  Paolo Bonzini  <bonzini@gnu.org>
	
	PR #19672
	* dojump.c (do_jump): Handle TRUTH_AND_EXPR and TRUTH_OR_EXPR here.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&r1=2.10182&r2=2.10183
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dojump.c.diff?cvsroot=gcc&r1=1.42&r2=1.43

Comment 15 GCC Commits 2005-10-19 10:41:16 UTC

Subject: Bug 19672

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	gcc-4_0-branch
Changes by:	bonzini@gcc.gnu.org	2005-10-19 10:41:09

Modified files:
	gcc            : dojump.c ChangeLog 

Log message:
	2005-10-18  Paolo Bonzini  <bonzini@gnu.org>
	
	PR #19672
	* dojump.c (do_jump): Handle TRUTH_AND_EXPR and TRUTH_OR_EXPR here.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dojump.c.diff?cvsroot=gcc&only_with_tag=gcc-4_0-branch&r1=1.37.8.1&r2=1.37.8.2
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=gcc-4_0-branch&r1=2.7592.2.470&r2=2.7592.2.471

Comment 16 GCC Commits 2005-10-19 10:50:00 UTC

Subject: Bug 19672

CVSROOT:	/cvs/gcc
Module name:	gcc
Branch: 	gcc-3_4-branch
Changes by:	bonzini@gcc.gnu.org	2005-10-19 10:49:57

Modified files:
	gcc            : ChangeLog dojump.c 

Log message:
	2005-10-19  Paolo Bonzini  <bonzini@gnu.org>
	
	PR #19672
	* dojump.c (do_jump): Handle TRUTH_AND_EXPR and TRUTH_OR_EXPR
	like TRUTH_ANDIF_EXPR and TRUTH_ORIF_EXPR, if the branch cost
	is low enough.

Patches:
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/ChangeLog.diff?cvsroot=gcc&only_with_tag=gcc-3_4-branch&r1=2.2326.2.923&r2=2.2326.2.924
http://gcc.gnu.org/cgi-bin/cvsweb.cgi/gcc/gcc/dojump.c.diff?cvsroot=gcc&only_with_tag=gcc-3_4-branch&r1=1.9.4.2&r2=1.9.4.3

Comment 17 Paolo Bonzini 2005-10-19 10:57:45 UTC

Patches committed to all active branches.

Comment 18 Paolo Bonzini 2005-11-05 12:15:27 UTC

Patch had to be reverted on 3.4

Comment 19 Gabriel Dos Reis 2005-11-21 02:26:32 UTC

Fixed for 4.0.x and higher.
Won't fix in 3.4.x