Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug
Bug#: 3507
Product:  
Component:  
Status: ASSIGNED
Resolution:
Assigned To: Rask Ingemann Lambertsen <rask@gcc.gnu.org>
Host:
Reported against  
Priority:  
Severity:  
Target Milestone:  
 
 
Target:
Reporter: 75773-quiet@bugs.debian.org
Add CC:
CC:
Remove selected CCs
Build:
URL:
Summary:
Keywords:
Known to work:
Known to fail:

Attachment Description Type Created Size Actions
test.c a variant test (with signed ints instead of unsigned longs -- same problem) text/plain 2004-03-03 19:52 115 bytes Edit
pr3507.patch Patch to enhance cse.c patch 2007-11-27 18:04 1.53 KB Edit | Diff
pr3507-v2.patch Patch v2 to enhance cse.c patch 2007-11-28 18:01 1.75 KB Edit | Diff
Create a New Attachment (proposed patch, testcase, etc.) View All

Bug 3507 depends on: Show dependency tree
Show dependency graph
Bug 3507 blocks:

Additional Comments:




Mark bug as waiting for feedback
Mark bug as suspended




View Bug Activity   |   Format For Printing   |   Clone This Bug


Description:   Last confirmed: 2007-11-02 14:17 Opened: 2001-07-01 09:16
[ Reported to the Debian BTS as report #75773.
  Please CC 75773-quiet@bugs.debian.org on replies.
  Log of report can be found at http://bugs.debian.org/75773 ]
 	

For the file

unsigned long foo(unsigned long a, unsigned long b) {
	unsigned long c = a - b;

	if (a < b) {
		c += 100;
	}

	return c;
}

gcc -O2 -S generates (The cmpl after the subl is unnecessary):

        .file   "bug-75773.c"
        .text
        .align 2
        .p2align 2,,3
.globl foo
        .type   foo,@function
foo:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %edx
        movl    12(%ebp), %eax
        movl    %edx, %ecx
        subl    %eax, %ecx
        cmpl    %eax, %edx
        jae     .L2
        addl    $100, %ecx
.L2:
        movl    %ecx, %eax
        popl    %ebp
        ret
.Lfe1:
        .size   foo,.Lfe1-foo
        .ident  "GCC: (GNU) 3.1 20010701 (experimental)"

Release:
3.0 (Debian GNU/Linux) and HEAD 20010701

Environment:
System: Debian GNU/Linux (testing/unstable)
Architecture: i686
	
host: i386-linux
build: i386-linux
target: i386-linux
configured with: ../src/configure -v --enable-languages=c,c++,java,f77,proto,objc --prefix=/usr --infodir=/share/info --mandir=/share/man --enable-shared --with-gnu-as --with-gnu-ld --with-system-zlib --enable-long-long --enable-nls --without-included-gettext --disable-checking --enable-threads=posix --enable-java-gc=boehm --with-cpp-install-dir=bin --enable-objc-gc i386-linux

------- Comment #1 From Richard Henderson 2002-04-02 02:36 -------
State-Changed-From-To: open->analyzed
State-Changed-Why: (set (reg/v:SI 61)
            (minus:SI (reg/v:SI 59) (reg/v:SI 60)))
    
    and
    
       (set (reg:CC 17 flags)
            (compare:CC (reg/v:SI 59) (reg/v:SI 60))) 
    
    are not LOG_LINK related, because they share no common
    destination, so combine doesn't merge them.  But after
    reload, they don't have the same arguments, so we can't
    peephole them there either.

------- Comment #2 From Andrew Pinski 2003-08-03 17:59 -------
It also happens on powerpc-apple-darwin6.6
Currently GCC produces:
_foo:
        cmplw cr0,r3,r4
        subf r3,r4,r3
        bgelr- cr0
        addi r3,r3,100
        blr
But GCC should be able to produce:
_foo:
        subf. r3,r4,r3
        bgelr- cr0
        addi r3,r3,100
        blr
Which is smaller (and faster on every thing except maybe 970 and Power4).

------- Comment #3 From Kazu Hirata 2004-01-04 08:26 -------
Replacing a < b with (long) c < 0 sounds like a CSE type of work
but does not quite fit in the current framework of replacing
expensive expressions with cheaper equivalents that are already available.
That is, nobody has computed (long) c < 0 before, so it's not available.


------- Comment #4 From Cyrille Chépélov 2004-03-03 19:46 -------
Observed on m68k-palmos (2.95.3+Debian boatload of patches), arm-palmos
(3.2.2),
AVR (3.3, 20030512 prerelease, likely patched), and on i386 (3.3.3, mainline
and
tree-ssa as of 20040302).

I suggest updating $SUMMARY and $TARGET accordingly.

Further testing (see attachement) show that opportunities can be missed even
for
simple signed ints: bar() has the problem, while baz() and bat() use either cmp
or sub, but not both. Furthermore, baz() seems to show that the combine pass is
doing a good job of finding the right compare instruction when the primary
result of the subtraction is thrown away. Could it be possible that a simple
and
systematic:

   (a < b)     ===>  (coerce_to_signed_type<>(a - b) < 0) 

replacement (done perhaps at the SSA level) would resolve this, without
incurring excessive costs?

------- Comment #5 From Cyrille Chépélov 2004-03-03 19:52 -------
Created an attachment (id=5856) [edit]
a variant test (with signed ints instead of unsigned longs -- same problem)

bar() shows the same problem as the original poster

baz() shows a variant of bar(), which although not semantically identical,
shows on i386 that the RTL optimizer is able to use a cmp instruction anyway
(also on AVR, but NOT on ARM).

bat() is semantically identical to bar() (modulo possible data overflow), and
shows the kind of code which would be expected from compiling bar(). 

------- Comment #6 From Steven Bosscher 2005-01-23 14:57 -------
Still there as of 2005-01-23 

------- Comment #7 From Eric Weddington 2007-08-23 20:10 -------
Confirmed for AVR. GCC 4.2.1 for avr generates this:

foo:
/* prologue: frame size=0 */
        push r14
        push r15
        push r16
        push r17
/* prologue end (size=4) */
        movw r14,r22
        movw r16,r24
        sub r14,r18
        sbc r15,r19
        sbc r16,r20
        sbc r17,r21
        cp r22,r18
        cpc r23,r19
        cpc r24,r20
        cpc r25,r21
        brsh .L2
        ldi r24,lo8(100)
        ldi r25,hi8(100)
        ldi r26,hlo8(100)
        ldi r27,hhi8(100)
        add r14,r24
        adc r15,r25
        adc r16,r26
        adc r17,r27
.L2:
        movw r24,r16
        movw r22,r14
/* epilogue: frame size=0 */
        pop r17
        pop r16
        pop r15
        pop r14
        ret

Ideally it should be something like:

foo:
/* prologue: frame size=0 */
        sub r22,r18
        sbc r23,r19
        sbc r24,r20
        sbc r25,r21
        brcc .L2
        ldi r18,lo8(100)
        ldi r19,hi8(100)
        ldi r20,hlo8(100)
        ldi r21,hhi8(100)
        add r22,r18
        adc r23,r19
        adc r24,r20
        adc r25,r21
.L2:
/* epilogue: frame size=0 */
        ret

Which is less than half the number of instructions.

Changing summary and target fields

------- Comment #8 From Rask Ingemann Lambertsen 2007-11-27 18:04 -------
Created an attachment (id=14647) [edit]
Patch to enhance cse.c

The attached patch can optimize this testcase:

void foo (int);

int bar2 (int a, int b)
{
  int c = a - b;
  if (a < b)
    {
      foo (c);
    }
  return c;
}

Before:
bar2:
        pushl   %ebx            # 33    *pushsi2        [length = 1]
        subl    $8, %esp        # 34    pro_epilogue_adjust_stack_1/1   [length
= 3]
        movl    16(%esp), %edx  # 2     *movsi_1/1      [length = 4]
        movl    20(%esp), %eax  # 3     *movsi_1/1      [length = 4]
        movl    %edx, %ebx      # 32    *movsi_1/1      [length = 2]
        subl    %eax, %ebx      # 7     *subsi_1/1      [length = 2]
        cmpl    %eax, %edx      # 8     *cmpsi_1_insn/1 [length = 2]
        jge     .L2             # 9     *jcc_1  [length = 2]
        movl    %ebx, (%esp)    # 11    *movsi_1/2      [length = 3]
        call    foo             # 12    *call_0 [length = 5]
.L2:
        movl    %ebx, %eax      # 19    *movsi_1/1      [length = 2]
        addl    $8, %esp        # 37    pro_epilogue_adjust_stack_1/1   [length
= 3]
        popl    %ebx            # 38    popsi1  [length = 1]
        ret                     # 39    return_internal [length = 1]

After:
bar2:
        pushl   %ebx            # 33    *pushsi2        [length = 1]
        subl    $8, %esp        # 34    pro_epilogue_adjust_stack_1/1   [length
= 3]
        movl    16(%esp), %eax  # 2     *movsi_1/1      [length = 4]
        movl    %eax, %ebx      # 32    *movsi_1/1      [length = 2]
        subl    20(%esp), %ebx  # 8     *subsi_2/2      [length = 4]
        jns     .L2             # 9     *jcc_1  [length = 2]
        movl    %ebx, (%esp)    # 11    *movsi_1/2      [length = 3]
        call    foo             # 12    *call_0 [length = 5]
.L2:
        movl    %ebx, %eax      # 19    *movsi_1/1      [length = 2]
        addl    $8, %esp        # 37    pro_epilogue_adjust_stack_1/1   [length
= 3]
        popl    %ebx            # 38    popsi1  [length = 1]
        ret             # 39    return_internal [length = 1]

One of the difficulties is that by the time the code gets to the cse1 pass, the
statement "c = a - b" might have been moved below the "a < b" test, making it
harder to optimize. The "bar2" testcase was crafted so this doesn't happen. But
given the need to potentially replace both the sub/cmp insn and the jump insn,
it would be better to move this optimization to combine.

------- Comment #9 From Rask Ingemann Lambertsen 2007-11-28 18:01 -------
Created an attachment (id=14657) [edit]
Patch v2 to enhance cse.c

This patch also handles unsigned comparisons and thus optimizes the original
testcase too. Before:

foo:
        movl    4(%esp), %edx   # 2     *movsi_1/1      [length = 4]
        movl    8(%esp), %eax   # 3     *movsi_1/1      [length = 4]
        movl    %edx, %ecx      # 35    *movsi_1/1      [length = 2]
        subl    %eax, %ecx      # 7     *subsi_1/1      [length = 2]
        cmpl    %eax, %edx      # 8     *cmpsi_1_insn/1 [length = 2]
        jae     .L2             # 9     *jcc_1          [length = 2]
        addl    $100, %ecx      # 11    *addsi_1/1      [length = 3]
.L2:
        movl    %ecx, %eax      # 18    *movsi_1/1      [length = 2]
        ret                     # 38    return_internal [length = 1]

After:
foo:
        movl    4(%esp), %eax   # 2     *movsi_1/1      [length = 4]
        subl    8(%esp), %eax   # 8     *subsi3_cc_overflow/2   [length = 4]
        jae     .L2             # 9     *jcc_1          [length = 2]
        addl    $100, %eax      # 11    *addsi_1/1      [length = 3]
.L2:
        rep                     # 39    return_internal_long    [length = 1]
        ret

I was going to abandon this patch, but maybe it deserves a second chance. :-)

------- Comment #10 From Steven Bosscher 2007-11-28 22:02 -------
> +  for (defs = DF_INSN_DEFS (insn);
> +       *defs && DF_REF_REGNO (*defs) != REGNO (x);
> +       defs++)
> +    ;

Are you aware of df_find_def() ?


> +	  if (minus_elt)
> +	      cmp = cse_find_comparison_use (dest, insn);

Formatting.


IMNSHO, computing DEF-USE chains for this niche optimization loses in the
cost/benefit trade-off.  The whole patch looks like a hack to me to
specifically deal with this particular bug report.  You could, of course, show
that this is actually a very important optimization...

I wonder if you can't just integrate this optimization in cse.c as-is by
recording an equivalence "a < b" == "signof(c)" when you process "a - b".

In any case, adding this optimization to cse.c is Just Wrong (tm).  I don't
understand why you didn't even try to optimize this in one of the tree
optimizers instead.  I thought it was clear GCC is trying to reduce its
dependency on RTL optimizations?

------- Comment #11 From Rask Ingemann Lambertsen 2007-11-29 00:09 -------
In reply to comment #10 from Steven Bosscher 2007-11-28 22:02:

> > +  for (defs = DF_INSN_DEFS (insn);
> > +       *defs && DF_REF_REGNO (*defs) != REGNO (x);
> > +       defs++)
> > +    ;

> Are you aware of df_find_def() ?

   Not until now, and it won't work either, because it uses rtx_equal_p() and
the mode of DF_REF_REG() doesn't match that of the REG rtx in the insn. See
also <URL:http://gcc.gnu.org/ml/gcc/2007-11/msg00719.html>.

> IMNSHO, computing DEF-USE chains for this niche optimization loses in the
> cost/benefit trade-off.

   The split of the comparison setter and the comparison user across two
insns with no link between them is an interesting case of poor
infrastructure. Most back ends can't emit the setter before they know what
the user looks like and therefore always emit the two back-to-back. At the
same time, several passes need to find one from the other but can't rely on
them to be back-to-back and because there's no link between them (except if
by DEF-USE/USE-DEF), they have to roll their own means of doing so.

   (So yes, it's been done without DEF-USE before and it can be done without
DEF-USE again.)

> I wonder if you can't just integrate this optimization in cse.c as-is by
> recording an equivalence "a < b" == "signof(c)" when you process "a - b".

   That doesn't catch the unsigned comparison. Actually, how would I even
know if it is unsigned or not?

> In any case, adding this optimization to cse.c is Just Wrong (tm).  I
> don't understand why you didn't even try to optimize this in one of the
> tree optimizers instead.  I thought it was clear GCC is trying to reduce
> its dependency on RTL optimizations?

   There's no way to optimize the signed case with -fwrapv (e.g. Java) at
the tree level, because we can't arrange for a - b to be computed in the
same instruction as a < b. At the RTL level, it is merely difficult. That
makes it less interesting to work on this optimization at the tree level.

------- Comment #12 From Paolo Bonzini 2007-11-29 11:43 -------
I think this should use find_comparison_args.

Bug List: (This bug is not in your last search results)   Show last search results      Search page      Enter new bug