[ Reported to the Debian BTS as report #75773. Please CC 75773-quiet@bugs.debian.org on replies. Log of report can be found at http://bugs.debian.org/75773 ] For the file unsigned long foo(unsigned long a, unsigned long b) { unsigned long c = a - b; if (a < b) { c += 100; } return c; } gcc -O2 -S generates (The cmpl after the subl is unnecessary): .file "bug-75773.c" .text .align 2 .p2align 2,,3 .globl foo .type foo,@function foo: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl 12(%ebp), %eax movl %edx, %ecx subl %eax, %ecx cmpl %eax, %edx jae .L2 addl $100, %ecx .L2: movl %ecx, %eax popl %ebp ret .Lfe1: .size foo,.Lfe1-foo .ident "GCC: (GNU) 3.1 20010701 (experimental)" Release: 3.0 (Debian GNU/Linux) and HEAD 20010701 Environment: System: Debian GNU/Linux (testing/unstable) Architecture: i686 host: i386-linux build: i386-linux target: i386-linux configured with: ../src/configure -v --enable-languages=c,c++,java,f77,proto,objc --prefix=/usr --infodir=/share/info --mandir=/share/man --enable-shared --with-gnu-as --with-gnu-ld --with-system-zlib --enable-long-long --enable-nls --without-included-gettext --disable-checking --enable-threads=posix --enable-java-gc=boehm --with-cpp-install-dir=bin --enable-objc-gc i386-linux
State-Changed-From-To: open->analyzed State-Changed-Why: (set (reg/v:SI 61) (minus:SI (reg/v:SI 59) (reg/v:SI 60))) and (set (reg:CC 17 flags) (compare:CC (reg/v:SI 59) (reg/v:SI 60))) are not LOG_LINK related, because they share no common destination, so combine doesn't merge them. But after reload, they don't have the same arguments, so we can't peephole them there either.
It also happens on powerpc-apple-darwin6.6 Currently GCC produces: _foo: cmplw cr0,r3,r4 subf r3,r4,r3 bgelr- cr0 addi r3,r3,100 blr But GCC should be able to produce: _foo: subf. r3,r4,r3 bgelr- cr0 addi r3,r3,100 blr Which is smaller (and faster on every thing except maybe 970 and Power4).
Replacing a < b with (long) c < 0 sounds like a CSE type of work but does not quite fit in the current framework of replacing expensive expressions with cheaper equivalents that are already available. That is, nobody has computed (long) c < 0 before, so it's not available.
Observed on m68k-palmos (2.95.3+Debian boatload of patches), arm-palmos (3.2.2), AVR (3.3, 20030512 prerelease, likely patched), and on i386 (3.3.3, mainline and tree-ssa as of 20040302). I suggest updating $SUMMARY and $TARGET accordingly. Further testing (see attachement) show that opportunities can be missed even for simple signed ints: bar() has the problem, while baz() and bat() use either cmp or sub, but not both. Furthermore, baz() seems to show that the combine pass is doing a good job of finding the right compare instruction when the primary result of the subtraction is thrown away. Could it be possible that a simple and systematic: (a < b) ===> (coerce_to_signed_type<>(a - b) < 0) replacement (done perhaps at the SSA level) would resolve this, without incurring excessive costs?
Created attachment 5856 [details] a variant test (with signed ints instead of unsigned longs -- same problem) bar() shows the same problem as the original poster baz() shows a variant of bar(), which although not semantically identical, shows on i386 that the RTL optimizer is able to use a cmp instruction anyway (also on AVR, but NOT on ARM). bat() is semantically identical to bar() (modulo possible data overflow), and shows the kind of code which would be expected from compiling bar().
Still there as of 2005-01-23
Confirmed for AVR. GCC 4.2.1 for avr generates this: foo: /* prologue: frame size=0 */ push r14 push r15 push r16 push r17 /* prologue end (size=4) */ movw r14,r22 movw r16,r24 sub r14,r18 sbc r15,r19 sbc r16,r20 sbc r17,r21 cp r22,r18 cpc r23,r19 cpc r24,r20 cpc r25,r21 brsh .L2 ldi r24,lo8(100) ldi r25,hi8(100) ldi r26,hlo8(100) ldi r27,hhi8(100) add r14,r24 adc r15,r25 adc r16,r26 adc r17,r27 .L2: movw r24,r16 movw r22,r14 /* epilogue: frame size=0 */ pop r17 pop r16 pop r15 pop r14 ret Ideally it should be something like: foo: /* prologue: frame size=0 */ sub r22,r18 sbc r23,r19 sbc r24,r20 sbc r25,r21 brcc .L2 ldi r18,lo8(100) ldi r19,hi8(100) ldi r20,hlo8(100) ldi r21,hhi8(100) add r22,r18 adc r23,r19 adc r24,r20 adc r25,r21 .L2: /* epilogue: frame size=0 */ ret Which is less than half the number of instructions. Changing summary and target fields
Created attachment 14647 [details] Patch to enhance cse.c The attached patch can optimize this testcase: void foo (int); int bar2 (int a, int b) { int c = a - b; if (a < b) { foo (c); } return c; } Before: bar2: pushl %ebx # 33 *pushsi2 [length = 1] subl $8, %esp # 34 pro_epilogue_adjust_stack_1/1 [length = 3] movl 16(%esp), %edx # 2 *movsi_1/1 [length = 4] movl 20(%esp), %eax # 3 *movsi_1/1 [length = 4] movl %edx, %ebx # 32 *movsi_1/1 [length = 2] subl %eax, %ebx # 7 *subsi_1/1 [length = 2] cmpl %eax, %edx # 8 *cmpsi_1_insn/1 [length = 2] jge .L2 # 9 *jcc_1 [length = 2] movl %ebx, (%esp) # 11 *movsi_1/2 [length = 3] call foo # 12 *call_0 [length = 5] .L2: movl %ebx, %eax # 19 *movsi_1/1 [length = 2] addl $8, %esp # 37 pro_epilogue_adjust_stack_1/1 [length = 3] popl %ebx # 38 popsi1 [length = 1] ret # 39 return_internal [length = 1] After: bar2: pushl %ebx # 33 *pushsi2 [length = 1] subl $8, %esp # 34 pro_epilogue_adjust_stack_1/1 [length = 3] movl 16(%esp), %eax # 2 *movsi_1/1 [length = 4] movl %eax, %ebx # 32 *movsi_1/1 [length = 2] subl 20(%esp), %ebx # 8 *subsi_2/2 [length = 4] jns .L2 # 9 *jcc_1 [length = 2] movl %ebx, (%esp) # 11 *movsi_1/2 [length = 3] call foo # 12 *call_0 [length = 5] .L2: movl %ebx, %eax # 19 *movsi_1/1 [length = 2] addl $8, %esp # 37 pro_epilogue_adjust_stack_1/1 [length = 3] popl %ebx # 38 popsi1 [length = 1] ret # 39 return_internal [length = 1] One of the difficulties is that by the time the code gets to the cse1 pass, the statement "c = a - b" might have been moved below the "a < b" test, making it harder to optimize. The "bar2" testcase was crafted so this doesn't happen. But given the need to potentially replace both the sub/cmp insn and the jump insn, it would be better to move this optimization to combine.
Created attachment 14657 [details] Patch v2 to enhance cse.c This patch also handles unsigned comparisons and thus optimizes the original testcase too. Before: foo: movl 4(%esp), %edx # 2 *movsi_1/1 [length = 4] movl 8(%esp), %eax # 3 *movsi_1/1 [length = 4] movl %edx, %ecx # 35 *movsi_1/1 [length = 2] subl %eax, %ecx # 7 *subsi_1/1 [length = 2] cmpl %eax, %edx # 8 *cmpsi_1_insn/1 [length = 2] jae .L2 # 9 *jcc_1 [length = 2] addl $100, %ecx # 11 *addsi_1/1 [length = 3] .L2: movl %ecx, %eax # 18 *movsi_1/1 [length = 2] ret # 38 return_internal [length = 1] After: foo: movl 4(%esp), %eax # 2 *movsi_1/1 [length = 4] subl 8(%esp), %eax # 8 *subsi3_cc_overflow/2 [length = 4] jae .L2 # 9 *jcc_1 [length = 2] addl $100, %eax # 11 *addsi_1/1 [length = 3] .L2: rep # 39 return_internal_long [length = 1] ret I was going to abandon this patch, but maybe it deserves a second chance. :-)
> + for (defs = DF_INSN_DEFS (insn); > + *defs && DF_REF_REGNO (*defs) != REGNO (x); > + defs++) > + ; Are you aware of df_find_def() ? > + if (minus_elt) > + cmp = cse_find_comparison_use (dest, insn); Formatting. IMNSHO, computing DEF-USE chains for this niche optimization loses in the cost/benefit trade-off. The whole patch looks like a hack to me to specifically deal with this particular bug report. You could, of course, show that this is actually a very important optimization... I wonder if you can't just integrate this optimization in cse.c as-is by recording an equivalence "a < b" == "signof(c)" when you process "a - b". In any case, adding this optimization to cse.c is Just Wrong (tm). I don't understand why you didn't even try to optimize this in one of the tree optimizers instead. I thought it was clear GCC is trying to reduce its dependency on RTL optimizations?
In reply to comment #10 from Steven Bosscher 2007-11-28 22:02: > > + for (defs = DF_INSN_DEFS (insn); > > + *defs && DF_REF_REGNO (*defs) != REGNO (x); > > + defs++) > > + ; > Are you aware of df_find_def() ? Not until now, and it won't work either, because it uses rtx_equal_p() and the mode of DF_REF_REG() doesn't match that of the REG rtx in the insn. See also <URL:http://gcc.gnu.org/ml/gcc/2007-11/msg00719.html>. > IMNSHO, computing DEF-USE chains for this niche optimization loses in the > cost/benefit trade-off. The split of the comparison setter and the comparison user across two insns with no link between them is an interesting case of poor infrastructure. Most back ends can't emit the setter before they know what the user looks like and therefore always emit the two back-to-back. At the same time, several passes need to find one from the other but can't rely on them to be back-to-back and because there's no link between them (except if by DEF-USE/USE-DEF), they have to roll their own means of doing so. (So yes, it's been done without DEF-USE before and it can be done without DEF-USE again.) > I wonder if you can't just integrate this optimization in cse.c as-is by > recording an equivalence "a < b" == "signof(c)" when you process "a - b". That doesn't catch the unsigned comparison. Actually, how would I even know if it is unsigned or not? > In any case, adding this optimization to cse.c is Just Wrong (tm). I > don't understand why you didn't even try to optimize this in one of the > tree optimizers instead. I thought it was clear GCC is trying to reduce > its dependency on RTL optimizations? There's no way to optimize the signed case with -fwrapv (e.g. Java) at the tree level, because we can't arrange for a - b to be computed in the same instruction as a < b. At the RTL level, it is merely difficult. That makes it less interesting to work on this optimization at the tree level.
I think this should use find_comparison_args.
Well here's another test case for the same problem: extern print_gt(void); extern print_lt(void); extern print_eq(void); void cmp_and_branch(long a, long b) { long diff = a - b; if (diff > 0) { print_gt(); } else if (diff < 0) { print_lt(); } else { print_eq(); } } In this case, the result of the subtraction is directly used in the branch code. However, gcc -O2 -S still generates this output: .file "gcc_cmp_and_branch_test2.c" .text .p2align 4,,15 .globl cmp_and_branch .type cmp_and_branch, @function cmp_and_branch: .LFB0: .cfi_startproc subq %rsi, %rdi cmpq $0, %rdi jg .L5 jne .L6 jmp print_eq .p2align 4,,10 .p2align 3 .L5: jmp print_gt .p2align 4,,10 .p2align 3 .L6: jmp print_lt .cfi_endproc .LFE0: .size cmp_and_branch, .-cmp_and_branch .ident "GCC: (Gentoo 4.7.1) 4.7.1" .section .note.GNU-stack,"",@progbits I'm using Gentoo x86_64: Using built-in specs. COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.1/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.1/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: /tmp/portage/sys-devel/gcc-4.7.1/work/gcc-4.7.1/configure --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.1 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.1/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.1 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.1/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.1/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.1/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --with-ppl --with-cloog --disable-ppl-version-check --with-cloog-include=/usr/include/cloog-ppl --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --enable-multilib --with-multilib-list=m32,m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.7.1/python --enable-checking=release --enable-java-awt=gtk --enable-objc-gc --enable-languages=c,c++,java,objc,obj-c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.7.1' Thread model: posix gcc version 4.7.1 (Gentoo 4.7.1) I do hope this can be addressed sometime soon. Thanks.
Thank you for the additional information you have supplied regarding this Bug report. This is an automatically generated reply to let you know your message has been received. Your message has not been forwarded to the package maintainers or other interested parties; you should ensure that the developers are aware of the problem you have entered into the system - preferably quoting the Bug reference number, #75773. If you wish to submit further information on this problem, please send it to 75773-quiet@bugs.debian.org. Please do not send mail to owner@bugs.debian.org unless you wish to report a problem with the Bug-tracking system.
x86_64-*-* should be added to the target list. With the original test case, GCC 10.1 produces this: foo: movq %rdi, %rax subq %rsi, %rax cmpq %rsi, %rdi leaq 100(%rax), %rdx cmovb %rdx, %rax ret It should instead produce this: foo: subq %rsi, %rdi leaq 100(%rdi), %rax cmovae %rdi, %rax ret By the way, I thought that using __builtin_usubl_overflow might be a useful workaround for this bug, but it turns out that it currently isn't really, since it has a different missed optimization of its own: bug #96289
Another similar case. On this function: unsigned wrap(unsigned index, unsigned limit) { if (index >= limit) index -= limit; return index; } GCC 11.1 -O2 generates: wrap(unsigned int, unsigned int): mov edx, edi mov eax, edi sub edx, esi cmp edi, esi cmovnb eax, edx ret I believe cmp here is redundant as the flags are already set after sub. After removing cmp we get: wrap(unsigned int, unsigned int): mov edx, edi mov eax, edi sub edx, esi cmovnb eax, edx ret Now the register edx becomes unneeded: wrap(unsigned int, unsigned int): mov eax, edi sub edi, esi cmovnb eax, edi ret
Rask Ingemann Lambertsen has not been around working on GCC for over 10 years so unassigning.