Bug 37734 - Missing optimization: gcc fails to reuse flags from already calculated expression for condition check with zero
Summary: Missing optimization: gcc fails to reuse flags from already calculated expres...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.3.2
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2008-10-04 02:39 UTC by Siarhei Siamashka
Modified: 2011-06-01 14:31 UTC (History)
3 users (show)

See Also:
Host: i686-pc-linux-gnu
Target: i686-pc-linux-gnu
Build: i686-pc-linux-gnu
Known to work:
Known to fail: 4.5.1
Last reconfirmed: 2010-08-15 18:59:54


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Siarhei Siamashka 2008-10-04 02:39:46 UTC
For the following source:
/****************************************/
extern void a();

int unrolled_loop_fn(int count)
{
    while ((count -= 2) >= 0) {
        a();
        a();
    }
    if (count & 1) {
        a();
    }
}
/****************************************/

'gcc -O2 -c test.c' produces the following quite suboptimal code:

00000000 <unrolled_loop_fn>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   56                      push   %esi
   4:   8b 75 08                mov    0x8(%ebp),%esi
   7:   53                      push   %ebx
   8:   83 ee 02                sub    $0x2,%esi
   b:   85 f6                   test   %esi,%esi
   d:   89 f0                   mov    %esi,%eax
   f:   78 1c                   js     2d <unrolled_loop_fn+0x2d>
  11:   89 f3                   mov    %esi,%ebx
  13:   90                      nop
  14:   8d 74 26 00             lea    0x0(%esi),%esi
  18:   e8 fc ff ff ff          call   19 <unrolled_loop_fn+0x19>
  1d:   e8 fc ff ff ff          call   1e <unrolled_loop_fn+0x1e>
  22:   83 eb 02                sub    $0x2,%ebx
  25:   79 f1                   jns    18 <unrolled_loop_fn+0x18>
  27:   83 e6 01                and    $0x1,%esi
  2a:   8d 46 fe                lea    -0x2(%esi),%eax
  2d:   a8 01                   test   $0x1,%al
  2f:   74 05                   je     36 <unrolled_loop_fn+0x36>
  31:   e8 fc ff ff ff          call   32 <unrolled_loop_fn+0x32>
  36:   5b                      pop    %ebx
  37:   5e                      pop    %esi
  38:   5d                      pop    %ebp
  39:   c3                      ret
Comment 1 Siarhei Siamashka 2008-10-04 02:48:32 UTC
For -Os optimization, the generated code is much better:

00000000 <unrolled_loop_fn>:
   0:   55                      push   %ebp
   1:   89 e5                   mov    %esp,%ebp
   3:   53                      push   %ebx
   4:   83 ec 04                sub    $0x4,%esp
   7:   8b 5d 08                mov    0x8(%ebp),%ebx
   a:   eb 0a                   jmp    16 <unrolled_loop_fn+0x16>
   c:   e8 fc ff ff ff          call   d <unrolled_loop_fn+0xd>
  11:   e8 fc ff ff ff          call   12 <unrolled_loop_fn+0x12>
  16:   83 eb 02                sub    $0x2,%ebx
  19:   79 f1                   jns    c <unrolled_loop_fn+0xc>
  1b:   80 e3 01                and    $0x1,%bl
  1e:   74 05                   je     25 <unrolled_loop_fn+0x25>
  20:   e8 fc ff ff ff          call   21 <unrolled_loop_fn+0x21>
  25:   5a                      pop    %edx
  26:   5b                      pop    %ebx
  27:   5d                      pop    %ebp
  28:   c3                      ret
Comment 2 Siarhei Siamashka 2010-08-15 01:01:17 UTC
Here is another test example, now with some performance numbers for gcc 4.5.1 on 64-bit Intel Atom:

$ cat fibbonachi.c
/*******************/
#include <stdlib.h>

int fib(int n)
{
    int sum, previous = -1, result = 1;

    n++;
    while (--n >= 0)
    {
        sum = result + previous;
        previous = result;
        result = sum;
    }

    return result;
}

int main(void)
{
    if (fib(1000000000) != 1532868155)
        abort();
    return 0;
}
/*******************/

$ gcc -O2 -march=atom -o fibbonachi-O2 fibbonachi.c
$ gcc -Os -march=atom -o fibbonachi-Os fibbonachi.c

$ time ./fibbonachi-O2

real    0m3.722s
user    0m3.652s
sys     0m0.000s

$ time ./fibbonachi-Os

real    0m3.078s
user    0m3.044s
sys     0m0.000s


Loop code for -O2 optimizations on x86-64:

  18:   89 d1                   mov    %edx,%ecx
  1a:   89 c2                   mov    %eax,%edx
  1c:   8d 7f ff                lea    -0x1(%rdi),%edi
  1f:   8d 04 0a                lea    (%rdx,%rcx,1),%eax
  22:   83 ff ff                cmp    $0xffffffffffffffff,%edi
  25:   75 f1                   jne    18 <fib+0x18>

Loop code for -Os optimizations on x86-64:

   c:   8d 0c 10                lea    (%rax,%rdx,1),%ecx
   f:   89 c2                   mov    %eax,%edx
  11:   89 c8                   mov    %ecx,%eax
  13:   ff cf                   dec    %edi
  15:   79 f5                   jns    c <fib+0xc>



Also on ARM, loop code is suboptimal in all cases (just "subs + bge" could be used without any need for "cmn/cmp"):

-O2 on ARM:
  10:   e2433001        sub     r3, r3, #1
  14:   e0820001        add     r0, r2, r1
  18:   e3730001        cmn     r3, #1
  1c:   e1a01002        mov     r1, r2
  20:   e1a02000        mov     r2, r0
  24:   1afffff9        bne     10 <fib+0x10>

-Os on ARM:
   c:   e0831002        add     r1, r3, r2
  10:   e2400001        sub     r0, r0, #1
  14:   e1a02003        mov     r2, r3
  18:   e1a03001        mov     r3, r1
  1c:   e3500000        cmp     r0, #0
  20:   aafffff9        bge     c <fib+0xc>

-Os -mthumb on ARM:
   8:   1899            adds    r1, r3, r2
   a:   3801            subs    r0, #1
   c:   461a            mov     r2, r3
   e:   460b            mov     r3, r1
  10:   2800            cmp     r0, #0
  12:   daf9            bge.n   8 <fib+0x8>


There are still similarities between x86 and ARM here. When using -O2 optimizations, the redundant comparison is performed with -1 constant in both cases.
Comment 3 Siarhei Siamashka 2010-10-04 23:19:26 UTC
So if I understand it correctly, there are 2 independent performance issues here:
1. one in the middle-end (redundant comparison with -1) when -O2 optimization is selected.
2. another in ARM target, because it fails to produce efficient code with -Os optimizations, while x86 can.

I just remembered that Mozilla has been using -Os optimizations up until now because it was providing the best performance for them:
http://gcc.gnu.org/ml/gcc/2010-06/msg00715.html
I wonder if this particular missed-optimization issue is contributing to the occasional performance advantage of -Os over -O2 (other than smaller code size and reduced pressure on the instructions cache). Anyway, when looking at any code generated by gcc, simple loops and branches always tend to contain redundant instructions.