gcc -O2 usually optimizes a/CONST. In many cases code is not only significantly faster, but also smaller: unsigned f(unsigned a) { return a/10; } gcc 4.1.1 -O2: movl $-858993459, %eax mull 4(%esp) shrl $3, %edx movl %edx, %eax ret gcc 4.1.1 -Os: movl 4(%esp), %eax movl $10, %edx movl %edx, %ecx xorl %edx, %edx divl %ecx ret Unfortunately, gcc -S never uses this optimization. Note that with code proposed in bug 28417 a/CONST can be optimized even further (we can use smaller mul constant and avoid shrl) when we know from VRP that value of a is always small enough.
I think this is really a cost issue with the x86 back-end, rather than with the middle-end.
Created attachment 13973 [details] Fix: adjust div cost for -Os on i386 Patch was tested with 4.2.1, I guess it will apply to other versions of gcc as it is quite trivial. Test program with lots or randomly-generated constant divisions and %'s was compiled by patched gcc with different costs of division: text data bss dec hex filename 257731 0 0 257731 3eec3 421.org/t-Os.o 256788 0 0 256788 3eb14 421.7/t-Os.o 256788 0 0 256788 3eb14 421.8/t-Os.o 257377 0 0 257377 3ed61 421.9/t-Os.o 257825 0 0 257825 3ef21 421.10/t-Os.o Seems like (at least on 4.2.1) cost of 8 is giving smallest code. Among 15000 divisions in test program, only signed divisions by power-of-two grew in size (but they become MUCH faster, as they don't even use multiply now, let alone div): @@ -1703 +1703 @@ -0000000f T id_x_16 +00000012 T id_x_16 @@ -1836 +1836 @@ -0000000f T id_x_2 +0000000e T id_x_2 @@ -2030 +2030 @@ -0000000f T id_x_32 +00000012 T id_x_32 id_x_16 was: movl 4(%esp), %eax movl $16, %edx movl %edx, %ecx cltd idivl %ecx ret Now it is: movl 4(%esp), %edx movl %edx, %eax sarl $31, %eax andl $15, %eax addl %edx, %eax sarl $4, %eax ret and also unsigned_x / 28, unsigned_x / 13952, unsigned_x / 56 grew by 1 byte. The rest either were not changed (and still use div insn) or shrank (typically by 1 byte, record holders are "unsigned_x / 641" and "unsigned_x / 6700417" - shrank by 4 bytes).
Created attachment 13974 [details] Auto-generated test program with 15000 constant divs/mods Test program, bzipped. Build with gcc -fomit-frame-pointer -Os t.c and see sizes: nm --size-sort t-Os.o | sort -k3
Created attachment 13975 [details] Test program generator Test program was generated using gen_test.c
Forgot to mention: * generator tests signed and unsigned divisions and modulus, both const / x and x / const, and also tests u = a / b; v = a % b; construct. * you need to filter gen_test output to weed out dups: ./gen_test | sort | uniq >t.c
CC'ing honza as i386 maintainer
It seems to make sense to bump cost of idiv a bit, given the fact that there are register pressure implications. I would like to however understand what code sequences we produce that are estimated to be long but ends up being shorter in practice. Would be possible to try to give me some examples of constants where it is important to bump cost to 8? It is possible we can simply fix cost estimation in divmod expansion instead. Honza
(In reply to comment #7) > It seems to make sense to bump cost of idiv a bit, given the fact that there > are register pressure implications. > > I would like to however understand what code sequences we produce that are > estimated to be long but ends up being shorter in practice. Would be possible > to try to give me some examples of constants where it is important to bump cost > to 8? It is possible we can simply fix cost estimation in divmod expansion > instead. Attached t.c.bz2 is a good source file to experiment with. With last month's svn snapshot of gcc, I did the following: /usr/app/gcc-4.4.svn.20090528/bin/gcc -g0 -Os -fomit-frame-pointer -ffunction-sections -c t.c objdump -dr t.o >t.asm with and without the patch, and compared results. (-ffunction-sections are used merely because they make "objdump -dr" output much more suitable for diffing). Here is the diff between unpatched and patched gcc's code generated for int_x / 16: Disassembly of section .text.id_x_16: 0000000000000000 <id_x_16>: - 0: 89 f8 mov %edi,%eax - 2: ba 10 00 00 00 mov $0x10,%edx - 7: 89 d1 mov %edx,%ecx - 9: 99 cltd - a: f7 f9 idiv %ecx - c: c3 retq + 0: 8d 47 0f lea 0xf(%rdi),%eax + 3: 85 ff test %edi,%edi + 5: 0f 49 c7 cmovns %edi,%eax + 8: c1 f8 04 sar $0x4,%eax + b: c3 retq int_x / 2: Disassembly of section .text.id_x_2: 0000000000000000 <id_x_2>: 0: 89 f8 mov %edi,%eax - 2: ba 02 00 00 00 mov $0x2,%edx - 7: 89 d1 mov %edx,%ecx - 9: 99 cltd - a: f7 f9 idiv %ecx - c: c3 retq + 2: c1 e8 1f shr $0x1f,%eax + 5: 01 f8 add %edi,%eax + 7: d1 f8 sar %eax + 9: c3 retq As you can see, code become smaller and *much* faster (not even mul insn is used now). Here is an example of unsigned_x / 641. In this case, code size is the same, but the code is faster: Disassembly of section .text.ud_x_641: 0000000000000000 <ud_x_641>: - 0: ba 81 02 00 00 mov $0x281,%edx - 5: 89 f8 mov %edi,%eax - 7: 89 d1 mov %edx,%ecx - 9: 31 d2 xor %edx,%edx - b: f7 f1 div %ecx + 0: 89 f8 mov %edi,%eax + 2: 48 69 c0 81 3d 66 00 imul $0x663d81,%rax,%rax + 9: 48 c1 e8 20 shr $0x20,%rax d: c3 retq There is not a single instance of code growth. Either newer gcc is better or maybe code growth cases are in 32-bit code only. I will attach t64.asm.diff, take a look if you want to see all changes in generated code.
Created attachment 18040 [details] Comparison of generated code with 4.4.svn.20090528 on x86_64
Do we have correct size estimates on idiv with a constant argument at all? I don't see length attributes on it ...
In 32-bit code, there are indeed a few cases of code growth. Here is a full list (id_XXX are signed divides, ud_XXX are unsigned ones): -00000000 0000000f T id_x_4 +00000000 00000012 T id_x_4 -00000000 0000000f T id_x_8 +00000000 00000012 T id_x_8 -00000000 0000000f T id_x_16 +00000000 00000012 T id_x_16 -00000000 0000000f T id_x_32 +00000000 00000012 T id_x_32 -00000000 00000010 T ud_x_28 +00000000 00000015 T ud_x_28 -00000000 00000010 T ud_x_56 +00000000 00000015 T ud_x_56 -00000000 00000010 T ud_x_13952 +00000000 00000015 T ud_x_13952 They fall into two groups. Signed divisions by power-of-2 grew by 3 bytes but they are *much faster* now, and considering how often people type "x / 4" and think "this will be optimized to shift", forgetting that their x is signed, and they therefore will have divide insn (!), I see it as a good trade. Code comparison: 00000000 <id_x_16>: - 0: 8b 44 24 04 mov 0x4(%esp),%eax - 4: ba 10 00 00 00 mov $0x10,%edx - 9: 89 d1 mov %edx,%ecx - b: 99 cltd - c: f7 f9 idiv %ecx - e: c3 ret + 0: 8b 54 24 04 mov 0x4(%esp),%edx + 4: 89 d0 mov %edx,%eax + 6: c1 f8 1f sar $0x1f,%eax + 9: 83 e0 0f and $0xf,%eax + c: 01 d0 add %edx,%eax + e: c1 f8 04 sar $0x4,%eax + 11: c3 ret The second group is just a few rare cases where "multiple by reciprocal" optimization happens to require more processing and code is 5 bytes longer: 00000000 <ud_x_56>: - 0: 8b 44 24 04 mov 0x4(%esp),%eax - 4: ba 38 00 00 00 mov $0x38,%edx - 9: 89 d1 mov %edx,%ecx - b: 31 d2 xor %edx,%edx - d: f7 f1 div %ecx - f: c3 ret + 0: 53 push %ebx + 1: 8b 4c 24 08 mov 0x8(%esp),%ecx + 5: bb 25 49 92 24 mov $0x24924925,%ebx + a: c1 e9 03 shr $0x3,%ecx + d: 89 c8 mov %ecx,%eax + f: f7 e3 mul %ebx + 11: 5b pop %ebx + 12: 89 d0 mov %edx,%eax + 14: c3 ret This is rare - only three cases in entire t.c.bz2 They are far outweighted by 474 cases where code got smaller. Most of them are saving only one byte. For example, unsigned_x / 100: 00000000 <ud_x_100>: - 0: 8b 44 24 04 mov 0x4(%esp),%eax - 4: ba 64 00 00 00 mov $0x64,%edx - 9: 89 d1 mov %edx,%ecx - b: 31 d2 xor %edx,%edx - d: f7 f1 div %ecx - f: c3 ret + 0: b8 1f 85 eb 51 mov $0x51eb851f,%eax + 5: f7 64 24 04 mull 0x4(%esp) + 9: 89 d0 mov %edx,%eax + b: c1 e8 05 shr $0x5,%eax + e: c3 ret Some cases got shorter by 2 or 4 bytes: -00000000 00000010 T ud_x_3 +00000000 0000000e T ud_x_3 -00000000 00000010 T ud_x_9 +00000000 0000000e T ud_x_9 -00000000 00000010 T ud_x_67 +00000000 0000000e T ud_x_67 -00000000 00000010 T ud_x_641 +00000000 0000000c T ud_x_641 -00000000 00000010 T ud_x_6700417 +00000000 0000000c T ud_x_6700417 For example, unsigned_x / 9: 00000000 <ud_x_9>: - 0: 8b 44 24 04 mov 0x4(%esp),%eax - 4: ba 09 00 00 00 mov $0x9,%edx - 9: 89 d1 mov %edx,%ecx - b: 31 d2 xor %edx,%edx - d: f7 f1 div %ecx - f: c3 ret + 0: b8 39 8e e3 38 mov $0x38e38e39,%eax + 5: f7 64 24 04 mull 0x4(%esp) + 9: 89 d0 mov %edx,%eax + b: d1 e8 shr %eax + d: c3 ret and unsigned_x / 641: 00000000 <ud_x_641>: - 0: 8b 44 24 04 mov 0x4(%esp),%eax - 4: ba 81 02 00 00 mov $0x281,%edx - 9: 89 d1 mov %edx,%ecx - b: 31 d2 xor %edx,%edx - d: f7 f1 div %ecx - f: c3 ret + 0: b8 81 3d 66 00 mov $0x663d81,%eax + 5: f7 64 24 04 mull 0x4(%esp) + 9: 89 d0 mov %edx,%eax + b: c3 ret I will attach t32.asm.diff now
Created attachment 18041 [details] Comparison of generated code with 4.4.svn.20090528 on i86
Hmm, looking at the cases it seems that main reason for the win is the fact that idiv needs integer load instruction that has long immediate and we don't optimize these for -Os well. I suppose for -Os following is wrong: case CONST_INT: case CONST: case LABEL_REF: case SYMBOL_REF: if (TARGET_64BIT && !x86_64_immediate_operand (x, VOIDmode)) *total = 3; else if (TARGET_64BIT && !x86_64_zext_immediate_operand (x, VOIDmode)) *total = 2; else if (flag_pic && SYMBOLIC_CONST (x) && (!TARGET_64BIT || (!GET_CODE (x) != LABEL_REF && (GET_CODE (x) != SYMBOL_REF || !SYMBOL_REF_LOCAL_P (x))))) *total = 1; else *total = 0; return true; It probably should return actual size of load instruction with full sized immediate and the individual cases matching RTL codes should know where instruction allows cheap immediate operand encoding and prevent recursion counting operand size itself. I will look into this but won't complain if someone beats me :)) Honza
Honza, you said in comment #13 that you would look at this -- got news?
Honza, did you find time to have a look? I think this regressed alot in 4.6 $ for i in 3.4 4.2 4.4 4.5 4.6 4.7 4.8;do gcc-$i -fomit-frame-pointer -m32 -Os -o t-$i.o -c t.c;done $ size t-*.o text data bss dec hex filename 254731 0 0 254731 3e30b t-3.4.o 257731 0 0 257731 3eec3 t-4.2.o 242787 0 0 242787 3b463 t-4.4.o 242787 0 0 242787 3b463 t-4.5.o 542811 0 0 542811 8485b t-4.6.o 542811 0 0 542811 8485b t-4.7.o 542811 0 0 542811 8485b t-4.8.o where: gcc version 3.4.6 (Debian 3.4.6-10) gcc version 4.2.4 (Debian 4.2.4-6) gcc version 4.4.7 (Debian 4.4.7-1) gcc version 4.5.3 (Debian 4.5.3-12) gcc version 4.6.3 (Debian 4.6.3-1) gcc version 4.7.1 (Debian 4.7.1-2) 4.8 was pristine (just unrelated fixups) gcc version 4.8.0 20120514 (experimental) [fixups revision 19d3eef:c8f5cfb:cbf2756acd7df8cfb441025e4512b97b6ef2fd10] (GCC)
(In reply to comment #15) > Honza, did you find time to have a look? > > I think this regressed alot in 4.6 Not really - it's just .eh_frame section. I re-ran the tests with two gcc's I have here and sizes look like this: text data bss dec hex filename 257731 0 0 257731 3eec3 divmod-4.2.1-Os.o 242787 0 0 242787 3b463 divmod-4.6.3-Os.o Stock (unpatched) gcc improved, juggles registers better. For example: int ib_100_x(int x) { return (100 / x) ^ (100 % x); } 0: b8 64 00 00 00 mov $0x64,%eax 5: 99 cltd 6: f7 7c 24 04 idivl 0x4(%esp) - a: 31 c2 xor %eax,%edx - c: 89 d0 mov %edx,%eax - e: c3 ret + a: 31 d0 xor %edx,%eax + c: c3 ret I believe my patch would improve things still - it is orthogonal to register allocation. BTW, just so that we are all on the same page wrt compiler options: here's the script I use to compile, disassemble, and extract function sizes from test program in comment 3. Tweakable by setting $PREFIX and/or $CC: gencode.sh ========== #!/bin/sh #PREFIX="i686-" test "$PREFIX" || PREFIX="" test "$CC" || CC="${PREFIX}gcc" test "$OBJDUMP" || OBJDUMP="${PREFIX}objdump" test "$NM" || NM="${PREFIX}nm" CC_VER=`$CC --version | sed -n 's/[^ ]* [^ ]* \([3-9]\.[1-9][^ ]*\).*/\1/p'` test "$CC_VER" || exit 1 build() { opt=$1 bname=divmod-$CC_VER${opt}${nail} # -ffunction-sections makes disasm easier to understand # (insn offsets start from 0 within every function). # -fno-exceptions -fno-asynchronous-unwind-tables: die, .eh_frame, die! $CC \ -m32 \ -fomit-frame-pointer \ -ffunction-sections \ -fno-exceptions \ -fno-asynchronous-unwind-tables \ ${opt} t.c -c -o $bname.o \ && $OBJDUMP -dr $bname.o >$bname.asm \ && $NM --size-sort $bname.o | sort -k3 >$bname.nm } build -Os #build -O2 #not interesting #build -O3 #not interesting size *.o | tee SIZES
Any chance of this being finally done? I proposed a simple, working patch in 2007, it's 2016 now and all these years users of -Os suffer from slow divisions in important cases usch as "signed_int / 16" and "unsigned_int / 10". I understand your desire to do it "better", to make gcc count size of div/idiv more accurately, without having to lie to it in the insn size table. But with you guys constantly distracted by other, more important issues, what happened here is that _nothing_ was done... I retested the patch with current svn (the future 7.0.0), using test program with 15000 divisions from comment 3: Bumping division cost up to 8 is no longer enough, this only makes gcc to be better towards some (not all) 2^N divisors. Bumping div cost to 9..12 helps with most of remaining 2^N divisor cases, and for two exceptional cases of x / 641 and x / 6700417. Only bumping div cost to 13, namely, changing div costs as follows: const struct processor_costs ix86_size_cost = {/* costs for tuning for size */ ... {COSTS_N_BYTES (13), /* cost of a divide/mod for QI */ COSTS_N_BYTES (13), /* HI */ COSTS_N_BYTES (13), /* SI */ COSTS_N_BYTES (13), /* DI */ COSTS_N_BYTES (15)}, /* other */ makes it work as it used to in 4.4.x days: out of 15000 cases in t.c, 975 cases are optimized so that they don't use "div" anymore. This should have made it smaller too... but did not, because meanwhile gcc has regressed in another area. Now it inserts superfluous register moves. See bug 70703 which I just filed. Essentially, instead of movl $6700417, %eax mull 4(%esp) movl %edx, %eax ret gcc generates: movl $6700417, %ecx movl %ecx, %eax ???? mull 4(%esp) movl %edx, %ecx ???? movl %ecx, %eax ret Sizes of compiled testcases (objN denotes cost of "div", A...D correspond to costs of 10..13): text data bss dec hex filename 242787 0 0 242787 3b463 gcc.obj3/divmod-7.0.0-Os.o 242813 0 0 242813 3b47d gcc.obj8/divmod-7.0.0-Os.o 242838 0 0 242838 3b496 gcc.obj9/divmod-7.0.0-Os.o 242844 0 0 242844 3b49c gcc.objA/divmod-7.0.0-Os.o 242844 0 0 242844 3b49c gcc.objB/divmod-7.0.0-Os.o 242844 0 0 242844 3b49c gcc.objC/divmod-7.0.0-Os.o 247573 0 0 247573 3c715 gcc.objD/divmod-7.0.0-Os.o So. Any chance of this patch being accepted sometime before 2100? ;)
Created attachment 38297 [details] Comparison of generated code with 7.0.0.svn on i86 With div cost of 3: 00000000 <ud_x_100>: - 0: 8b 44 24 04 mov 0x4(%esp),%eax - 4: b9 64 00 00 00 mov $0x64,%ecx - 9: 31 d2 xor %edx,%edx - b: f7 f1 div %ecx - d: c3 ret With div cost of 13: + 0: b9 1f 85 eb 51 mov $0x51eb851f,%ecx + 5: 89 c8 mov %ecx,%eax + 7: f7 64 24 04 mull 0x4(%esp) + b: 89 d1 mov %edx,%ecx + d: 89 c8 mov %ecx,%eax + f: c1 e8 05 shr $0x5,%eax + 12: c3 ret
(In reply to Denis Vlasenko from comment #17) > Any chance of this being finally done? > > I proposed a simple, working patch in 2007, it's 2016 now and all these > years users of -Os suffer from slow divisions in important cases usch as > "signed_int / 16" and "unsigned_int / 10". > So. > Any chance of this patch being accepted sometime before 2100? ;) For GCC you need to follow https://gcc.gnu.org/contribute.html to submit patches. https://gcc.gnu.org/contribute.html#legal is the first obstacle unless you happen to work for a company (like e.g. redhat) that has company-wide copyright assignment in place. Alternatively you can put your changes in Public Domain or go through an individual copyright assignment process for GCC with the FSF. Then you have to create proper testcases in dejagnu form included in your patch and have to submit that to the gcc-patches@ ML after proper bootstrapping and regression testing. It is recommended to Cc: the maintainer(s) of the architectures/subsystems/files you are touching in your patch submission to gain their attention. See toplevel MAINTAINERS file. If you want to ensure that object-size increase/regressions in your testcases get noticed, you can add a dejagnu check for that (see gcc/testsuite/lib/scanasm.exp:proc object-size). For details on patch submission please refer to: https://gcc.gnu.org/contribute.html#patches HTH,
(In reply to Bernhard Reutner-Fischer from comment #19) > (In reply to Denis Vlasenko from comment #17) > > Any chance of this being finally done? > > > > I proposed a simple, working patch in 2007, it's 2016 now and all these > > years users of -Os suffer from slow divisions in important cases usch as > > "signed_int / 16" and "unsigned_int / 10". > > > So. > > Any chance of this patch being accepted sometime before 2100? ;) > > For GCC you need to follow https://gcc.gnu.org/contribute.html to submit > patches. There is also a 10-step guide for new contributors here: https://gcc.gnu.org/wiki/GettingStarted#Basics:_Contributing_to_GCC_in_10_easy_steps