Bug 3507 - appalling optimisation with sub/cmp on multiple targets
Summary: appalling optimisation with sub/cmp on multiple targets
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 3.0
: P3 enhancement
Target Milestone: ---
Assignee: Rask Ingemann Lambertsen
URL:
Keywords: missed-optimization
: 3996 (view as bug list)
Depends on:
Blocks:
 
Reported: 2001-07-01 09:16 UTC by 75773-quiet
Modified: 2012-07-10 10:54 UTC (History)
6 users (show)

See Also:
Host: i386-linux
Target: i386-linux, powerpc-*-*, avr-*-*
Build: i386-linux
Known to work:
Known to fail:
Last reconfirmed: 2007-11-02 14:17:51


Attachments
a variant test (with signed ints instead of unsigned longs -- same problem) (115 bytes, text/plain)
2004-03-03 19:52 UTC, Cyrille Chépélov
Details
Patch to enhance cse.c (1.53 KB, patch)
2007-11-27 18:04 UTC, Rask Ingemann Lambertsen
Details | Diff
Patch v2 to enhance cse.c (1.75 KB, patch)
2007-11-28 18:01 UTC, Rask Ingemann Lambertsen
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description 75773-quiet 2001-07-01 09:16:00 UTC
[ Reported to the Debian BTS as report #75773.
  Please CC 75773-quiet@bugs.debian.org on replies.
  Log of report can be found at http://bugs.debian.org/75773 ]
 	

For the file

unsigned long foo(unsigned long a, unsigned long b) {
	unsigned long c = a - b;

	if (a < b) {
		c += 100;
	}

	return c;
}

gcc -O2 -S generates (The cmpl after the subl is unnecessary):

        .file   "bug-75773.c"
        .text
        .align 2
        .p2align 2,,3
.globl foo
        .type   foo,@function
foo:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %edx
        movl    12(%ebp), %eax
        movl    %edx, %ecx
        subl    %eax, %ecx
        cmpl    %eax, %edx
        jae     .L2
        addl    $100, %ecx
.L2:
        movl    %ecx, %eax
        popl    %ebp
        ret
.Lfe1:
        .size   foo,.Lfe1-foo
        .ident  "GCC: (GNU) 3.1 20010701 (experimental)"

Release:
3.0 (Debian GNU/Linux) and HEAD 20010701

Environment:
System: Debian GNU/Linux (testing/unstable)
Architecture: i686
	
host: i386-linux
build: i386-linux
target: i386-linux
configured with: ../src/configure -v --enable-languages=c,c++,java,f77,proto,objc --prefix=/usr --infodir=/share/info --mandir=/share/man --enable-shared --with-gnu-as --with-gnu-ld --with-system-zlib --enable-long-long --enable-nls --without-included-gettext --disable-checking --enable-threads=posix --enable-java-gc=boehm --with-cpp-install-dir=bin --enable-objc-gc i386-linux
Comment 1 Richard Henderson 2002-04-02 02:36:06 UTC
State-Changed-From-To: open->analyzed
State-Changed-Why: (set (reg/v:SI 61)
            (minus:SI (reg/v:SI 59) (reg/v:SI 60)))
    
    and
    
       (set (reg:CC 17 flags)
            (compare:CC (reg/v:SI 59) (reg/v:SI 60))) 
    
    are not LOG_LINK related, because they share no common
    destination, so combine doesn't merge them.  But after
    reload, they don't have the same arguments, so we can't
    peephole them there either.
Comment 2 Andrew Pinski 2003-08-03 17:59:18 UTC
It also happens on powerpc-apple-darwin6.6
Currently GCC produces:
_foo:
        cmplw cr0,r3,r4
        subf r3,r4,r3
        bgelr- cr0
        addi r3,r3,100
        blr
But GCC should be able to produce:
_foo:
        subf. r3,r4,r3
        bgelr- cr0
        addi r3,r3,100
        blr
Which is smaller (and faster on every thing except maybe 970 and Power4).
Comment 3 Kazu Hirata 2004-01-04 08:26:38 UTC
Replacing a < b with (long) c < 0 sounds like a CSE type of work
but does not quite fit in the current framework of replacing
expensive expressions with cheaper equivalents that are already available.
That is, nobody has computed (long) c < 0 before, so it's not available.

Comment 4 Cyrille Chépélov 2004-03-03 19:46:37 UTC
Observed on m68k-palmos (2.95.3+Debian boatload of patches), arm-palmos (3.2.2),
AVR (3.3, 20030512 prerelease, likely patched), and on i386 (3.3.3, mainline and
tree-ssa as of 20040302).

I suggest updating $SUMMARY and $TARGET accordingly.

Further testing (see attachement) show that opportunities can be missed even for
simple signed ints: bar() has the problem, while baz() and bat() use either cmp
or sub, but not both. Furthermore, baz() seems to show that the combine pass is
doing a good job of finding the right compare instruction when the primary
result of the subtraction is thrown away. Could it be possible that a simple and
systematic:

   (a < b)     ===>  (coerce_to_signed_type<>(a - b) < 0) 

replacement (done perhaps at the SSA level) would resolve this, without
incurring excessive costs?
Comment 5 Cyrille Chépélov 2004-03-03 19:52:33 UTC
Created attachment 5856 [details]
a variant test (with signed ints instead of unsigned longs -- same problem)

bar() shows the same problem as the original poster

baz() shows a variant of bar(), which although not semantically identical,
shows on i386 that the RTL optimizer is able to use a cmp instruction anyway
(also on AVR, but NOT on ARM).

bat() is semantically identical to bar() (modulo possible data overflow), and
shows the kind of code which would be expected from compiling bar().
Comment 6 Steven Bosscher 2005-01-23 14:57:43 UTC
Still there as of 2005-01-23 
Comment 7 Eric Weddington 2007-08-23 20:10:25 UTC
Confirmed for AVR. GCC 4.2.1 for avr generates this:

foo:
/* prologue: frame size=0 */
	push r14
	push r15
	push r16
	push r17
/* prologue end (size=4) */
	movw r14,r22
	movw r16,r24
	sub r14,r18
	sbc r15,r19
	sbc r16,r20
	sbc r17,r21
	cp r22,r18
	cpc r23,r19
	cpc r24,r20
	cpc r25,r21
	brsh .L2
	ldi r24,lo8(100)
	ldi r25,hi8(100)
	ldi r26,hlo8(100)
	ldi r27,hhi8(100)
	add r14,r24
	adc r15,r25
	adc r16,r26
	adc r17,r27
.L2:
	movw r24,r16
	movw r22,r14
/* epilogue: frame size=0 */
	pop r17
	pop r16
	pop r15
	pop r14
	ret

Ideally it should be something like:

foo:
/* prologue: frame size=0 */
	sub r22,r18
	sbc r23,r19
	sbc r24,r20
	sbc r25,r21
	brcc .L2
	ldi r18,lo8(100)
	ldi r19,hi8(100)
	ldi r20,hlo8(100)
	ldi r21,hhi8(100)
	add r22,r18
	adc r23,r19
	adc r24,r20
	adc r25,r21
.L2:
/* epilogue: frame size=0 */
	ret

Which is less than half the number of instructions.

Changing summary and target fields
Comment 8 Rask Ingemann Lambertsen 2007-11-27 18:04:31 UTC
Created attachment 14647 [details]
Patch to enhance cse.c

The attached patch can optimize this testcase:

void foo (int);

int bar2 (int a, int b)
{
  int c = a - b;
  if (a < b)
    {
      foo (c);
    }
  return c;
}

Before:
bar2:
	pushl	%ebx		# 33	*pushsi2	[length = 1]
	subl	$8, %esp	# 34	pro_epilogue_adjust_stack_1/1	[length = 3]
	movl	16(%esp), %edx	# 2	*movsi_1/1	[length = 4]
	movl	20(%esp), %eax	# 3	*movsi_1/1	[length = 4]
	movl	%edx, %ebx	# 32	*movsi_1/1	[length = 2]
	subl	%eax, %ebx	# 7	*subsi_1/1	[length = 2]
	cmpl	%eax, %edx	# 8	*cmpsi_1_insn/1	[length = 2]
	jge	.L2		# 9	*jcc_1	[length = 2]
	movl	%ebx, (%esp)	# 11	*movsi_1/2	[length = 3]
	call	foo		# 12	*call_0	[length = 5]
.L2:
	movl	%ebx, %eax	# 19	*movsi_1/1	[length = 2]
	addl	$8, %esp	# 37	pro_epilogue_adjust_stack_1/1	[length = 3]
	popl	%ebx		# 38	popsi1	[length = 1]
	ret			# 39	return_internal	[length = 1]

After:
bar2:
	pushl	%ebx		# 33	*pushsi2	[length = 1]
	subl	$8, %esp	# 34	pro_epilogue_adjust_stack_1/1	[length = 3]
	movl	16(%esp), %eax	# 2	*movsi_1/1	[length = 4]
	movl	%eax, %ebx	# 32	*movsi_1/1	[length = 2]
	subl	20(%esp), %ebx	# 8	*subsi_2/2	[length = 4]
	jns	.L2		# 9	*jcc_1	[length = 2]
	movl	%ebx, (%esp)	# 11	*movsi_1/2	[length = 3]
	call	foo		# 12	*call_0	[length = 5]
.L2:
	movl	%ebx, %eax	# 19	*movsi_1/1	[length = 2]
	addl	$8, %esp	# 37	pro_epilogue_adjust_stack_1/1	[length = 3]
	popl	%ebx		# 38	popsi1	[length = 1]
	ret		# 39	return_internal	[length = 1]

One of the difficulties is that by the time the code gets to the cse1 pass, the statement "c = a - b" might have been moved below the "a < b" test, making it harder to optimize. The "bar2" testcase was crafted so this doesn't happen. But given the need to potentially replace both the sub/cmp insn and the jump insn, it would be better to move this optimization to combine.
Comment 9 Rask Ingemann Lambertsen 2007-11-28 18:01:16 UTC
Created attachment 14657 [details]
Patch v2 to enhance cse.c

This patch also handles unsigned comparisons and thus optimizes the original testcase too. Before:

foo:
	movl	4(%esp), %edx	# 2	*movsi_1/1	[length = 4]
	movl	8(%esp), %eax	# 3	*movsi_1/1	[length = 4]
	movl	%edx, %ecx	# 35	*movsi_1/1	[length = 2]
	subl	%eax, %ecx	# 7	*subsi_1/1	[length = 2]
	cmpl	%eax, %edx	# 8	*cmpsi_1_insn/1	[length = 2]
	jae	.L2		# 9	*jcc_1		[length = 2]
	addl	$100, %ecx	# 11	*addsi_1/1	[length = 3]
.L2:
	movl	%ecx, %eax	# 18	*movsi_1/1	[length = 2]
	ret			# 38	return_internal	[length = 1]

After:
foo:
	movl	4(%esp), %eax	# 2	*movsi_1/1	[length = 4]
	subl	8(%esp), %eax	# 8	*subsi3_cc_overflow/2	[length = 4]
	jae	.L2		# 9	*jcc_1		[length = 2]
	addl	$100, %eax	# 11	*addsi_1/1	[length = 3]
.L2:
	rep			# 39	return_internal_long	[length = 1]
	ret

I was going to abandon this patch, but maybe it deserves a second chance. :-)
Comment 10 Steven Bosscher 2007-11-28 22:02:46 UTC
> +  for (defs = DF_INSN_DEFS (insn);
> +       *defs && DF_REF_REGNO (*defs) != REGNO (x);
> +       defs++)
> +    ;

Are you aware of df_find_def() ?


> +	  if (minus_elt)
> +	      cmp = cse_find_comparison_use (dest, insn);

Formatting.


IMNSHO, computing DEF-USE chains for this niche optimization loses in the cost/benefit trade-off.  The whole patch looks like a hack to me to specifically deal with this particular bug report.  You could, of course, show that this is actually a very important optimization...

I wonder if you can't just integrate this optimization in cse.c as-is by recording an equivalence "a < b" == "signof(c)" when you process "a - b".

In any case, adding this optimization to cse.c is Just Wrong (tm).  I don't understand why you didn't even try to optimize this in one of the tree optimizers instead.  I thought it was clear GCC is trying to reduce its dependency on RTL optimizations?
Comment 11 Rask Ingemann Lambertsen 2007-11-29 00:09:51 UTC
In reply to comment #10 from Steven Bosscher 2007-11-28 22:02:

> > +  for (defs = DF_INSN_DEFS (insn);
> > +       *defs && DF_REF_REGNO (*defs) != REGNO (x);
> > +       defs++)
> > +    ;

> Are you aware of df_find_def() ?

   Not until now, and it won't work either, because it uses rtx_equal_p() and the mode of DF_REF_REG() doesn't match that of the REG rtx in the insn. See also <URL:http://gcc.gnu.org/ml/gcc/2007-11/msg00719.html>.

> IMNSHO, computing DEF-USE chains for this niche optimization loses in the
> cost/benefit trade-off.

   The split of the comparison setter and the comparison user across two
insns with no link between them is an interesting case of poor
infrastructure. Most back ends can't emit the setter before they know what
the user looks like and therefore always emit the two back-to-back. At the
same time, several passes need to find one from the other but can't rely on
them to be back-to-back and because there's no link between them (except if
by DEF-USE/USE-DEF), they have to roll their own means of doing so.

   (So yes, it's been done without DEF-USE before and it can be done without DEF-USE again.)

> I wonder if you can't just integrate this optimization in cse.c as-is by
> recording an equivalence "a < b" == "signof(c)" when you process "a - b".

   That doesn't catch the unsigned comparison. Actually, how would I even
know if it is unsigned or not?

> In any case, adding this optimization to cse.c is Just Wrong (tm).  I
> don't understand why you didn't even try to optimize this in one of the
> tree optimizers instead.  I thought it was clear GCC is trying to reduce
> its dependency on RTL optimizations?

   There's no way to optimize the signed case with -fwrapv (e.g. Java) at
the tree level, because we can't arrange for a - b to be computed in the
same instruction as a < b. At the RTL level, it is merely difficult. That
makes it less interesting to work on this optimization at the tree level.
Comment 12 Paolo Bonzini 2007-11-29 11:43:39 UTC
I think this should use find_comparison_args.
Comment 13 Daniel Santos 2012-07-10 10:23:08 UTC
Well here's another test case for the same problem:

extern print_gt(void);
extern print_lt(void);
extern print_eq(void);

void cmp_and_branch(long a, long b)
{
	long diff = a - b;

	if (diff > 0) {
		print_gt();
	} else if (diff < 0) {
		print_lt();
	} else {
		print_eq();
	}
}

In this case, the result of the subtraction is directly used in the branch code.  However, gcc -O2 -S still generates this output:

	.file	"gcc_cmp_and_branch_test2.c"
	.text
	.p2align 4,,15
	.globl	cmp_and_branch
	.type	cmp_and_branch, @function
cmp_and_branch:
.LFB0:
	.cfi_startproc
	subq	%rsi, %rdi
	cmpq	$0, %rdi
	jg	.L5
	jne	.L6
	jmp	print_eq
	.p2align 4,,10
	.p2align 3
.L5:
	jmp	print_gt
	.p2align 4,,10
	.p2align 3
.L6:
	jmp	print_lt
	.cfi_endproc
.LFE0:
	.size	cmp_and_branch, .-cmp_and_branch
	.ident	"GCC: (Gentoo 4.7.1) 4.7.1"
	.section	.note.GNU-stack,"",@progbits

I'm using Gentoo x86_64:

Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.1/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.7.1/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /tmp/portage/sys-devel/gcc-4.7.1/work/gcc-4.7.1/configure --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.7.1 --includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.1/include --datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.1 --mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.1/man --infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.7.1/info --with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.7.1/include/g++-v4 --host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec --disable-fixed-point --with-ppl --with-cloog --disable-ppl-version-check --with-cloog-include=/usr/include/cloog-ppl --enable-lto --enable-nls --without-included-gettext --with-system-zlib --enable-obsolete --disable-werror --enable-secureplt --enable-multilib --with-multilib-list=m32,m64 --enable-libmudflap --disable-libssp --enable-libgomp --with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.7.1/python --enable-checking=release --enable-java-awt=gtk --enable-objc-gc --enable-languages=c,c++,java,objc,obj-c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.7.1'
Thread model: posix
gcc version 4.7.1 (Gentoo 4.7.1) 

I do hope this can be addressed sometime soon.  Thanks.
Comment 14 owner 2012-07-10 10:54:26 UTC
Thank you for the additional information you have supplied regarding
this Bug report.

This is an automatically generated reply to let you know your message
has been received.

Your message has not been forwarded to the package maintainers or
other interested parties; you should ensure that the developers are
aware of the problem you have entered into the system - preferably
quoting the Bug reference number, #75773.

If you wish to submit further information on this problem, please
send it to 75773-quiet@bugs.debian.org.

Please do not send mail to owner@bugs.debian.org unless you wish
to report a problem with the Bug-tracking system.