Bug 30354 - -Os doesn't optimize a/CONST even if it saves size.
-Os doesn't optimize a/CONST even if it saves size.
Status: ASSIGNED
Product: gcc
Classification: Unclassified
Component: target
4.1.1
: P3 normal
: ---
Assigned To: Jan Hubicka
: missed-optimization
Depends on:
Blocks: 16996 37515 4.7pending
  Show dependency treegraph
 
Reported: 2007-01-02 22:09 UTC by Denis Vlasenko
Modified: 2013-01-18 10:29 UTC (History)
5 users (show)

See Also:
Host:
Target: i386-pc-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-06-29 00:00:00


Attachments
Fix: adjust div cost for -Os on i386 (1.44 KB, patch)
2007-07-25 15:05 UTC, Denis Vlasenko
Details | Diff
Auto-generated test program with 15000 constant divs/mods (65.56 KB, application/octet-stream)
2007-07-25 15:09 UTC, Denis Vlasenko
Details
Test program generator (1.82 KB, text/plain)
2007-07-25 15:17 UTC, Denis Vlasenko
Details
Comparison of generated code with 4.4.svn.20090528 on x86_64 (11.12 KB, text/plain)
2009-06-21 16:12 UTC, Denis Vlasenko
Details
Comparison of generated code with 4.4.svn.20090528 on i86 (664.08 KB, text/plain)
2009-06-21 16:48 UTC, Denis Vlasenko
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Denis Vlasenko 2007-01-02 22:09:42 UTC
gcc -O2 usually optimizes a/CONST. In many cases code is not only significantly faster, but also smaller:

unsigned f(unsigned a) { return a/10; }

gcc 4.1.1 -O2:
        movl    $-858993459, %eax
        mull    4(%esp)
        shrl    $3, %edx
        movl    %edx, %eax
        ret

gcc 4.1.1 -Os:
        movl    4(%esp), %eax
        movl    $10, %edx
        movl    %edx, %ecx
        xorl    %edx, %edx
        divl    %ecx
        ret

Unfortunately, gcc -S never uses this optimization.

Note that with code proposed in bug 28417 a/CONST can be optimized even further (we can use smaller mul constant and avoid shrl) when we know from VRP that value of a is always small enough.
Comment 1 Andrew Pinski 2007-01-02 22:45:31 UTC
I think this is really a cost issue with the x86 back-end, rather than with the middle-end.
Comment 2 Denis Vlasenko 2007-07-25 15:05:36 UTC
Created attachment 13973 [details]
Fix: adjust div cost for -Os on i386

Patch was tested with 4.2.1, I guess it will apply to other versions of gcc as it is quite trivial.

Test program with lots or randomly-generated constant divisions and %'s was compiled by patched gcc with different costs of division:

   text    data     bss     dec     hex filename
 257731       0       0  257731   3eec3 421.org/t-Os.o
 256788       0       0  256788   3eb14 421.7/t-Os.o
 256788       0       0  256788   3eb14 421.8/t-Os.o
 257377       0       0  257377   3ed61 421.9/t-Os.o
 257825       0       0  257825   3ef21 421.10/t-Os.o

Seems like (at least on 4.2.1) cost of 8 is giving smallest code.

Among 15000 divisions in test program, only signed divisions by power-of-two grew in size (but they become MUCH faster, as they don't even use multiply now, let alone div):

@@ -1703 +1703 @@
-0000000f T id_x_16
+00000012 T id_x_16
@@ -1836 +1836 @@
-0000000f T id_x_2
+0000000e T id_x_2
@@ -2030 +2030 @@
-0000000f T id_x_32
+00000012 T id_x_32

id_x_16 was:
        movl    4(%esp), %eax
        movl    $16, %edx
        movl    %edx, %ecx
        cltd
        idivl   %ecx
        ret
Now it is:
        movl    4(%esp), %edx
        movl    %edx, %eax
        sarl    $31, %eax
        andl    $15, %eax
        addl    %edx, %eax
        sarl    $4, %eax
        ret

and also unsigned_x / 28, unsigned_x / 13952, unsigned_x / 56 grew by 1 byte.

The rest either were not changed (and still use div insn) or shrank (typically by 1 byte, record holders are "unsigned_x / 641" and "unsigned_x / 6700417" - shrank by 4 bytes).
Comment 3 Denis Vlasenko 2007-07-25 15:09:37 UTC
Created attachment 13974 [details]
Auto-generated test program with 15000 constant divs/mods

Test program, bzipped.

Build with
gcc -fomit-frame-pointer -Os t.c
and see sizes:
nm --size-sort t-Os.o | sort -k3
Comment 4 Denis Vlasenko 2007-07-25 15:17:33 UTC
Created attachment 13975 [details]
Test program generator

Test program was generated using gen_test.c
Comment 5 Denis Vlasenko 2007-07-25 15:22:25 UTC
Forgot to mention:

* generator tests signed and unsigned divisions and modulus, both const / x and x / const, and also tests u = a / b; v = a % b; construct.

* you need to filter gen_test output to weed out dups:

./gen_test | sort | uniq >t.c
Comment 6 Bernhard Reutner-Fischer 2009-06-05 16:19:09 UTC
CC'ing honza as i386 maintainer
Comment 7 Jan Hubicka 2009-06-06 13:41:23 UTC
It seems to make sense to bump cost of idiv a bit, given the fact that there are register pressure implications.

I would like to however understand what code sequences we produce that are estimated to be long but ends up being shorter in practice.  Would be possible to try to give me some examples of constants where it is important to bump cost to 8?  It is possible we can simply fix cost estimation in divmod expansion instead.

Honza 
Comment 8 Denis Vlasenko 2009-06-21 16:11:14 UTC
(In reply to comment #7)
> It seems to make sense to bump cost of idiv a bit, given the fact that there
> are register pressure implications.
> 
> I would like to however understand what code sequences we produce that are
> estimated to be long but ends up being shorter in practice.  Would be possible
> to try to give me some examples of constants where it is important to bump cost
> to 8?  It is possible we can simply fix cost estimation in divmod expansion
> instead.

Attached t.c.bz2 is a good source file to experiment with.

With last month's svn snapshot of gcc, I did the following:

/usr/app/gcc-4.4.svn.20090528/bin/gcc -g0 -Os -fomit-frame-pointer -ffunction-sections -c t.c
objdump -dr t.o >t.asm

with and without the patch, and compared results. (-ffunction-sections are used merely because they make "objdump -dr" output much more suitable for diffing).

Here is the diff between unpatched and patched gcc's code generated for int_x / 16:

 Disassembly of section .text.id_x_16:
 0000000000000000 <id_x_16>:
-   0:  89 f8                   mov    %edi,%eax
-   2:  ba 10 00 00 00          mov    $0x10,%edx
-   7:  89 d1                   mov    %edx,%ecx
-   9:  99                      cltd
-   a:  f7 f9                   idiv   %ecx
-   c:  c3                      retq
+   0:  8d 47 0f                lea    0xf(%rdi),%eax
+   3:  85 ff                   test   %edi,%edi
+   5:  0f 49 c7                cmovns %edi,%eax
+   8:  c1 f8 04                sar    $0x4,%eax
+   b:  c3                      retq

int_x / 2:

 Disassembly of section .text.id_x_2:
 0000000000000000 <id_x_2>:
    0:  89 f8                   mov    %edi,%eax
-   2:  ba 02 00 00 00          mov    $0x2,%edx
-   7:  89 d1                   mov    %edx,%ecx
-   9:  99                      cltd
-   a:  f7 f9                   idiv   %ecx
-   c:  c3                      retq
+   2:  c1 e8 1f                shr    $0x1f,%eax
+   5:  01 f8                   add    %edi,%eax
+   7:  d1 f8                   sar    %eax
+   9:  c3                      retq

As you can see, code become smaller and *much* faster (not even mul insn is used now).

Here is an example of unsigned_x / 641. In this case, code size is the same, but the code is faster:

 Disassembly of section .text.ud_x_641:
 0000000000000000 <ud_x_641>:
-   0:  ba 81 02 00 00          mov    $0x281,%edx
-   5:  89 f8                   mov    %edi,%eax
-   7:  89 d1                   mov    %edx,%ecx
-   9:  31 d2                   xor    %edx,%edx
-   b:  f7 f1                   div    %ecx
+   0:  89 f8                   mov    %edi,%eax
+   2:  48 69 c0 81 3d 66 00    imul   $0x663d81,%rax,%rax
+   9:  48 c1 e8 20             shr    $0x20,%rax
    d:  c3                      retq

There is not a single instance of code growth. Either newer gcc is better or maybe code growth cases are in 32-bit code only.

I will attach t64.asm.diff, take a look if you want to see all changes in generated code.
Comment 9 Denis Vlasenko 2009-06-21 16:12:15 UTC
Created attachment 18040 [details]
Comparison of generated code with 4.4.svn.20090528 on x86_64
Comment 10 Richard Biener 2009-06-21 16:25:55 UTC
Do we have correct size estimates on idiv with a constant argument at all?
I don't see length attributes on it ...
Comment 11 Denis Vlasenko 2009-06-21 16:47:33 UTC
In 32-bit code, there are indeed a few cases of code growth.

Here is a full list (id_XXX are signed divides, ud_XXX are unsigned ones):
-00000000 0000000f T id_x_4
+00000000 00000012 T id_x_4
-00000000 0000000f T id_x_8
+00000000 00000012 T id_x_8
-00000000 0000000f T id_x_16
+00000000 00000012 T id_x_16
-00000000 0000000f T id_x_32
+00000000 00000012 T id_x_32

-00000000 00000010 T ud_x_28
+00000000 00000015 T ud_x_28
-00000000 00000010 T ud_x_56
+00000000 00000015 T ud_x_56
-00000000 00000010 T ud_x_13952
+00000000 00000015 T ud_x_13952

They fall into two groups. Signed divisions by power-of-2 grew by 3 bytes but they are *much faster* now, and considering how often people type "x / 4" and think "this will be optimized to shift", forgetting that their x is signed, and they therefore will have divide insn (!), I see it as a good trade. Code comparison:

 00000000 <id_x_16>:
-   0:  8b 44 24 04             mov    0x4(%esp),%eax
-   4:  ba 10 00 00 00          mov    $0x10,%edx
-   9:  89 d1                   mov    %edx,%ecx
-   b:  99                      cltd
-   c:  f7 f9                   idiv   %ecx
-   e:  c3                      ret
+   0:  8b 54 24 04             mov    0x4(%esp),%edx
+   4:  89 d0                   mov    %edx,%eax
+   6:  c1 f8 1f                sar    $0x1f,%eax
+   9:  83 e0 0f                and    $0xf,%eax
+   c:  01 d0                   add    %edx,%eax
+   e:  c1 f8 04                sar    $0x4,%eax
+  11:  c3                      ret

The second group is just a few rare cases where "multiple by reciprocal" optimization happens to require more processing and code is 5 bytes longer:

 00000000 <ud_x_56>:
-   0:  8b 44 24 04             mov    0x4(%esp),%eax
-   4:  ba 38 00 00 00          mov    $0x38,%edx
-   9:  89 d1                   mov    %edx,%ecx
-   b:  31 d2                   xor    %edx,%edx
-   d:  f7 f1                   div    %ecx
-   f:  c3                      ret
+   0:  53                      push   %ebx
+   1:  8b 4c 24 08             mov    0x8(%esp),%ecx
+   5:  bb 25 49 92 24          mov    $0x24924925,%ebx
+   a:  c1 e9 03                shr    $0x3,%ecx
+   d:  89 c8                   mov    %ecx,%eax
+   f:  f7 e3                   mul    %ebx
+  11:  5b                      pop    %ebx
+  12:  89 d0                   mov    %edx,%eax
+  14:  c3                      ret

This is rare - only three cases in entire t.c.bz2

They are far outweighted by 474 cases where code got smaller.

Most of them are saving only one byte. For example, unsigned_x / 100:

 00000000 <ud_x_100>:
-   0:  8b 44 24 04             mov    0x4(%esp),%eax
-   4:  ba 64 00 00 00          mov    $0x64,%edx
-   9:  89 d1                   mov    %edx,%ecx
-   b:  31 d2                   xor    %edx,%edx
-   d:  f7 f1                   div    %ecx
-   f:  c3                      ret
+   0:  b8 1f 85 eb 51          mov    $0x51eb851f,%eax
+   5:  f7 64 24 04             mull   0x4(%esp)
+   9:  89 d0                   mov    %edx,%eax
+   b:  c1 e8 05                shr    $0x5,%eax
+   e:  c3                      ret

Some cases got shorter by 2 or 4 bytes:
-00000000 00000010 T ud_x_3
+00000000 0000000e T ud_x_3
-00000000 00000010 T ud_x_9
+00000000 0000000e T ud_x_9
-00000000 00000010 T ud_x_67
+00000000 0000000e T ud_x_67
-00000000 00000010 T ud_x_641
+00000000 0000000c T ud_x_641
-00000000 00000010 T ud_x_6700417
+00000000 0000000c T ud_x_6700417

For example, unsigned_x / 9:

 00000000 <ud_x_9>:
-   0:  8b 44 24 04             mov    0x4(%esp),%eax
-   4:  ba 09 00 00 00          mov    $0x9,%edx
-   9:  89 d1                   mov    %edx,%ecx
-   b:  31 d2                   xor    %edx,%edx
-   d:  f7 f1                   div    %ecx
-   f:  c3                      ret
+   0:  b8 39 8e e3 38          mov    $0x38e38e39,%eax
+   5:  f7 64 24 04             mull   0x4(%esp)
+   9:  89 d0                   mov    %edx,%eax
+   b:  d1 e8                   shr    %eax
+   d:  c3                      ret

and unsigned_x / 641:

 00000000 <ud_x_641>:
-   0:  8b 44 24 04             mov    0x4(%esp),%eax
-   4:  ba 81 02 00 00          mov    $0x281,%edx
-   9:  89 d1                   mov    %edx,%ecx
-   b:  31 d2                   xor    %edx,%edx
-   d:  f7 f1                   div    %ecx
-   f:  c3                      ret
+   0:  b8 81 3d 66 00          mov    $0x663d81,%eax
+   5:  f7 64 24 04             mull   0x4(%esp)
+   9:  89 d0                   mov    %edx,%eax
+   b:  c3                      ret

I will attach t32.asm.diff now
Comment 12 Denis Vlasenko 2009-06-21 16:48:11 UTC
Created attachment 18041 [details]
Comparison of generated code with 4.4.svn.20090528 on i86
Comment 13 Jan Hubicka 2009-06-30 13:36:11 UTC
Hmm,
looking at the cases it seems that main reason for the win is the fact that idiv needs integer load instruction that has long immediate and we don't optimize these for -Os well.

I suppose for -Os following is wrong:
    case CONST_INT:
    case CONST: 
    case LABEL_REF:
    case SYMBOL_REF:
      if (TARGET_64BIT && !x86_64_immediate_operand (x, VOIDmode))
        *total = 3;
      else if (TARGET_64BIT && !x86_64_zext_immediate_operand (x, VOIDmode))
        *total = 2;
      else if (flag_pic && SYMBOLIC_CONST (x)
               && (!TARGET_64BIT
                   || (!GET_CODE (x) != LABEL_REF
                       && (GET_CODE (x) != SYMBOL_REF
                           || !SYMBOL_REF_LOCAL_P (x)))))
        *total = 1;
      else  
        *total = 0;
      return true;

It probably should return actual size of load instruction with full sized immediate and the individual cases matching RTL codes should know where instruction allows cheap immediate operand encoding and prevent recursion counting operand size itself.

I will look into this but won't complain if someone beats me :))

Honza

Comment 14 Steven Bosscher 2010-01-08 09:06:02 UTC
Honza, you said in comment #13 that you would look at this -- got news?
Comment 15 Bernhard Reutner-Fischer 2012-06-28 23:28:28 UTC
Honza, did you find time to have a look?

I think this regressed alot in 4.6

$ for i in 3.4 4.2 4.4 4.5 4.6 4.7 4.8;do gcc-$i -fomit-frame-pointer -m32 -Os -o t-$i.o -c t.c;done
$ size t-*.o
   text	   data	    bss	    dec	    hex	filename
 254731	      0	      0	 254731	  3e30b	t-3.4.o
 257731	      0	      0	 257731	  3eec3	t-4.2.o
 242787	      0	      0	 242787	  3b463	t-4.4.o
 242787	      0	      0	 242787	  3b463	t-4.5.o
 542811	      0	      0	 542811	  8485b	t-4.6.o
 542811	      0	      0	 542811	  8485b	t-4.7.o
 542811	      0	      0	 542811	  8485b	t-4.8.o

where:
gcc version 3.4.6 (Debian 3.4.6-10)
gcc version 4.2.4 (Debian 4.2.4-6)
gcc version 4.4.7 (Debian 4.4.7-1) 
gcc version 4.5.3 (Debian 4.5.3-12) 
gcc version 4.6.3 (Debian 4.6.3-1) 
gcc version 4.7.1 (Debian 4.7.1-2)
4.8 was pristine (just unrelated fixups)
gcc version 4.8.0 20120514 (experimental) [fixups revision 19d3eef:c8f5cfb:cbf2756acd7df8cfb441025e4512b97b6ef2fd10] (GCC)
Comment 16 Denis Vlasenko 2013-01-18 10:29:12 UTC
(In reply to comment #15)
> Honza, did you find time to have a look?
> 
> I think this regressed alot in 4.6

Not really - it's just .eh_frame section.
I re-ran the tests with two gcc's I have here and sizes look like this:

   text	   data	    bss	    dec	    hex	filename
 257731	      0	      0	 257731	  3eec3	divmod-4.2.1-Os.o
 242787	      0	      0	 242787	  3b463	divmod-4.6.3-Os.o

Stock (unpatched) gcc improved, juggles registers better. For example:
int ib_100_x(int x) { return (100 / x) ^ (100 % x); }
    0:  b8 64 00 00 00          mov    $0x64,%eax
    5:  99                      cltd
    6:  f7 7c 24 04             idivl  0x4(%esp)
-   a:  31 c2                   xor    %eax,%edx
-   c:  89 d0                   mov    %edx,%eax
-   e:  c3                      ret
+   a:  31 d0                   xor    %edx,%eax
+   c:  c3                      ret

I believe my patch would improve things still - it is orthogonal to register allocation.

BTW, just so that we are all on the same page wrt compiler options:
here's the script I use to compile, disassemble, and extract function sizes from test program in comment 3. Tweakable by setting $PREFIX and/or $CC:

gencode.sh
==========
#!/bin/sh

#PREFIX="i686-"

test "$PREFIX"  || PREFIX=""
test "$CC"      || CC="${PREFIX}gcc"
test "$OBJDUMP" || OBJDUMP="${PREFIX}objdump"
test "$NM"      || NM="${PREFIX}nm"

CC_VER=`$CC --version | sed -n 's/[^ ]* [^ ]* \([3-9]\.[1-9][^ ]*\).*/\1/p'`
test "$CC_VER" || exit 1

build()
{
        opt=$1
        bname=divmod-$CC_VER${opt}${nail}

        # -ffunction-sections makes disasm easier to understand
        # (insn offsets start from 0 within every function).
        # -fno-exceptions -fno-asynchronous-unwind-tables: die, .eh_frame, die!
        $CC \
                -m32 \
                -fomit-frame-pointer \
                -ffunction-sections \
                -fno-exceptions \
                -fno-asynchronous-unwind-tables \
                ${opt} t.c -c -o $bname.o \
        && $OBJDUMP -dr $bname.o >$bname.asm \
        && $NM --size-sort $bname.o  | sort -k3 >$bname.nm
}

build -Os
#build -O2  #not interesting
#build -O3  #not interesting
size *.o | tee SIZES