30315 – optimize unsigned-add overflow test on x86 to use cpu flags from addl

Bug 30315 - optimize unsigned-add overflow test on x86 to use cpu flags from addl

Summary: optimize unsigned-add overflow test on x86 to use cpu flags from addl

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	unknown

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Rask Ingemann Lambertsen

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2006-12-28 07:09 UTC by Ken Raeburn
Modified:	2019-06-06 12:28 UTC (History)
CC List:	5 users (show)

See Also:
Host:	powerpc-apple-darwin
Target:	i386-unknown-linux
Build:	powerpc-apple-darwin
Known to work:
Known to fail:
Last reconfirmed:	2007-08-19 13:00:47

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ken Raeburn 2006-12-28 07:09:00 UTC

Using gcc-4.3-20061223 targeting i386-linux and options "-O9 -fomit-frame-pointer -fexpensive-optimizations -S", my test program to detect unsigned integer overflow:

extern void abort(void);
unsigned foo(unsigned a, unsigned b) {
    unsigned sum = a + b;
    if (sum < a) abort(); /* check for overflow (wrapping) */
    if (sum < b) abort(); /* redundant */
    return sum;
}

compiles to:

foo:
        subl    $12, %esp
        movl    16(%esp), %eax
        movl    20(%esp), %edx
        addl    %eax, %edx
        cmpl    %edx, %eax
        ja      .L7
        cmpl    %edx, 20(%esp)
        ja      .L7
        movl    %edx, %eax
        addl    $12, %esp
        ret
.L7:
        call    abort

After the addition, which sets the ALU flags, the compiler issues two compare instructions and conditional branches.  This sequence could be replaced by a conditional branch following the addl, testing one of the flags (overflow? carry? I forget which) set by it.

I realize for other processors (e.g., with -march=pentiumpro), the addl may be replaced by something like leal, which may not set the flags the same way.  But if the flags are set when building for a certain architecture, use them....  And is leal+cmp with data dependence still going to be better than addl?

(Also, I think a different set of register selections could eliminate the last "movl" instruction, by putting the result of the addition into the return-value register.)

Comment 1 Ryan Johnson 2007-06-06 03:39:46 UTC

Happens on x86_64-unknown-linux-gnu as well, for both 4.2.0 and 4.3 (20070605)

The problem is even worse for 128-bit arithmetic because it has to check two registers (with associated branches) before making a decision. This in spite of the fact that sbb sets the flags properly AFAIK:

bool sub128(__uint128_t &dest, __uint128_t a, __uint128_t b) {
  dest = a - b;
  if(dest > a) abort();
}

_Z6sub128Rooo:
.LFB557:
	movq	%rsi, %rax
	movq	%rdx, %r10
	pushq	%rbx
.LCFI0:
	subq	%rcx, %rax
	sbbq	%r8, %rdx
	movq	%rax, (%rdi)
	cmpq	%rdx, %r10
	movq	%rdx, 8(%rdi)
	ja	.L23
	jae	.L24
.L21:
	call	abort
	.p2align 4,,7
.L24:
	cmpq	%rax, %rsi
	.p2align 4,,6
	jb	.L21
	.p2align 4,,7
.L23:
	popq	%rbx
	.p2align 4,,5
	ret

There's not really a way to work around it with inline asm, either, because of the branch on overflow that will most likely come right afterward...

Comment 2 Uroš Bizjak 2007-06-06 08:15:56 UTC

(In reply to comment #0)

> After the addition, which sets the ALU flags, the compiler issues two compare
> instructions and conditional branches.  This sequence could be replaced by a
> conditional branch following the addl, testing one of the flags (overflow?
> carry? I forget which) set by it.

in config/i386/i386-modes.def, documentation says:

   Add CCGOC to indicate comparisons against zero that allows
   unspecified garbage in the Carry and Overflow flag. This
   mode is used to simulate comparisons of (a-b) and (a+b)
   against zero using sub/cmp/add operations.

addl operates in CCGOCmode, but you are requesting GTU conditional jump that requires Carry flag=0 and Zero flag=0. GTU is incompatible with CCGOCmode due to Carry flag handling.

Change one of conditions to "if (sum == 0) abort();" and you will see insn elimination in action.

And BTW - wrapping is undefined operation.

Comment 3 pinskia@gmail.com 2007-06-06 09:10:41 UTC

Subject: Re:  optimize unsigned-add overflow test on x86 to use cpu flags from addl

> And BTW - wrapping is undefined operation.

One for signed integers for unsigned it is actually defined as
wrapping and the reporter used unsigned integers here.

Thanks,
Andrew Pinski

Comment 4 Rask Ingemann Lambertsen 2007-06-06 10:33:25 UTC

   I see no reason to mark this enhancement request as invalid.

   As to generating reasonable x86 code for overflow checks written in C, it isn't completely hopeless. I did an experiment with my 16-bit x86 port. First, I wrote the overflow check differently (and of course changed the data size to 16-bits):

#include <limits.h>

void abort (void);

unsigned short int foo (unsigned short int a, unsigned short int b)
{
  unsigned short int sum;

  sum = a + b;
  if ((unsigned long int) a + (unsigned long int) b > USHRT_MAX)
    abort ();
  return sum;
}

   Second, I added new instructions, of which combine was asking for the first one:

(define_insn_and_split "*addsi3_cconly_highpart_ccz_nc"
  [(set (reg:CCZ_NC CC_REG)
	(compare:CCZ_NC
	    (subreg:HI (plus:SI (match_operand:SI 1 "general_operand" "0")
				(match_operand:SI 2 "general_operand" "g")) 2)
	    (const_int 0)))
   (clobber (match_scratch:HI 0 "=r"))]
  "ia16_arith_operands_p (PLUS, operands)"
  "#"
  ""
  [(parallel
  [(set (match_dup 0) (plus:HI (match_dup 3) (match_dup 4)))
   (set (reg:CCZ_NC CC_REG)
	(compare:CCZ_NC (subreg:HI (plus:SI (zero_extend:SI (match_dup 3))
					    (zero_extend:SI (match_dup 4))) 2)
			(const_int 0)))]
  )]
{
  operands[3] = simplify_gen_subreg (HImode, operands[1], SImode, 0);
  operands[4] = simplify_gen_subreg (HImode, operands[2], SImode, 0);
})

(define_insn "*addhi3_cc_carry_ccz_nc"
  [(set (match_operand:HI 0 "nonimmediate_operand" "=qm,r,m")
	(plus:HI (match_operand:HI 1 "nonimmediate_operand" "%0,0,0")
		 (match_operand:HI 2 "general_operand" "Uo,g,ri")))
   (set (reg:CCZ_NC CC_REG)
	(compare:CCZ_NC
	    (subreg:HI (plus:SI (zero_extend:SI (match_dup 1))
				(zero_extend:SI (match_dup 2))) 2)
	    (const_int 0)))]
  "ia16_arith_operands_p (PLUS, operands)"
  "@
   addb\t%H2,\t%H0
   addw\t%2,\t%0
   addw\t%2,\t%0"
)

(define_insn "*addhi3_cconly_carry_ccz_nc"
  [(set (match_scratch:HI 0 "=qm,r,m")
	(plus:HI (match_operand:HI 1 "nonimmediate_operand" "%0,0,0")
		 (match_operand:HI 2 "general_operand" "Uo,g,ri")))
   (set (reg:CCZ_NC CC_REG)
	(compare:CCZ_NC
	    (subreg:HI (plus:SI (zero_extend:SI (match_dup 1))
				(zero_extend:SI (match_dup 2))) 2)
	    (const_int 0)))]
  "ia16_arith_operands_p (PLUS, operands)"
  "@
   addb\t%H2,\t%H0
   addw\t%2,\t%0
   addw\t%2,\t%0"
)

(define_insn "*beq_ccz_nc"
	[(set (pc)
	      (if_then_else (eq (reg:CCZ_NC CC_REG)
	                        (const_int 0))
	                    (label_ref (match_operand 0))
	                    (pc)))]
	""
	"jnc\t%l0"
)

   Third, I created CCZ_NCmode and modified my SELECT_CC_MODE callback to use it for the comparison in "*addsi3_cconly_highpart_ccz_nc". Result:

foo:
	pushw	%cx		;# 69	*pushhi1_nonimm
	pushw	%si		;# 70	*pushhi1_nonimm
	pushw	%di		;# 71	*pushhi1_nonimm
	pushw	%bp		;# 72	*pushhi1_nonimm
	movw	%sp,	%bp	;# 67	*movhi/1
	movw	10(%bp),%di	;# 2	*movhi/1
	movw	%sp,	%si	;# 68	*movhi/1
	movw	12(%si),%bp	;# 3	*movhi/1
	movw	%bp,	%ax	;# 7	*movhi/1
	addw	%di,	%ax	;# 63	*addhi3_cc_carry_ccz_nc/2
	jnc	.L7		;# 15	*beq_ccz_nc
	call	abort		;# 25	*call
.L7:
	popw	%bp		;# 75	*pophi1
	popw	%di		;# 76	*pophi1
	popw	%si		;# 77	*pophi1
	popw	%cx		;# 78	*pophi1
	ret			;# 79	*return

   I see no reason the i386 back end can't be improved in a similiar fashion.

   The original example is a lot more difficult because it uses two comparisons. The middle end would have to recognize that the two comparisons test the same thing.

Comment 5 Uroš Bizjak 2007-06-06 13:00:13 UTC

(In reply to comment #4)

>    I see no reason the i386 back end can't be improved in a similiar fashion.

Interesting...

I agree that this bugreport can be an enhancement request for the code you provided. We can perhaps operate in just introduced CCCmode, but it looks a lot of added functionality to cover this corner case. I'll try to code something.

Comment 6 Ken Raeburn 2007-06-06 14:51:06 UTC

Subject: Re:  optimize unsigned-add overflow test on x86 to use cpu flags from addl

On Jun 6, 2007, at 04:15, ubizjak at gmail dot com wrote:
> in config/i386/i386-modes.def, documentation says:
>
>    Add CCGOC to indicate comparisons against zero that allows
>    unspecified garbage in the Carry and Overflow flag. This
>    mode is used to simulate comparisons of (a-b) and (a+b)
>    against zero using sub/cmp/add operations.
>
> addl operates in CCGOCmode, but you are requesting GTU conditional  
> jump that
> requires Carry flag=0 and Zero flag=0. GTU is incompatible with  
> CCGOCmode due
> to Carry flag handling.

Actually, I think the test I want is just for the carry flag, which  
can be done with one branch instruction using the flags set by addl.   
If gcc's description of addl doesn't support that, well, then that's  
what I'm requesting be enhanced. :-)

Rask's approach looks interesting, though I wouldn't want to have to  
rewrite simple, portable tests for unsigned overflow to use wider  
types, especially if I'm adding unsigned long long values, where I  
don't have portable wider types.

Ken

Comment 7 Rask Ingemann Lambertsen 2007-06-13 13:36:48 UTC

Looking at this again, I don't think the transformation I'm making with the splitter is valid, because I'm making up a zero extension which wasn't there to begin with. The upper part could have been non-zero before the instruction.
As to the original example, it should be possible to optimize

         addl    %eax, %edx
         cmpl    %edx, %eax
         ja      .L7
into
         addl    %eax, %edx
         jc      .L7

with a CCmode expressing that the carry flag is set if %edx is smaller than %eax.

Comment 8 patchapp@dberlin.org 2007-08-03 19:40:20 UTC

Subject: Bug number PR target/30315

A patch for this bug has been added to the patch tracker.
The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2007-08/msg00204.html

Comment 9 patchapp@dberlin.org 2007-08-04 13:10:29 UTC

Subject: Bug number PR target/30315

A patch for this bug has been added to the patch tracker.
The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2007-08/msg00242.html

Comment 10 Rask Ingemann Lambertsen 2007-08-14 14:39:38 UTC

Subject: Bug 30315

Author: rask
Date: Tue Aug 14 14:39:24 2007
New Revision: 127481

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=127481
Log:
	PR target/30315
	* config/i386/i386.h (CANONICALIZE_COMPARISON): New.
	* config/i386/i386.md (plusminus)(addsub)(SWI): New.
	(*<addsub><mode>3_cc_overflow): New.
	(*add<mode>3_cconly_overflow): New.
	(*sub<mode>3_cconly_overflow): New.
	(*<addsub>si3_zext_cc_overflow): New.
	* config/i386/predicates.md (fcmov_comparison_operator): Accept
	CCCmode for LTU, GTU, LEU and GEU.
	(ix86_comparison_operator): Likewise.
	(ix86_carry_flag_operator): Carry flag is set if LTU or GTU in CCCmode.
	* gcc/config/i386/i386.c (put_condition_code): Support CCCmode.
	(ix86_cc_mode): Use CCCmode when testing for overflow of PLUS
	or MINUS expressions.

testsuite/
	PR target/30315
	* gcc.target/i386/pr30315.c: New.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr30315.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/i386.md
    trunk/gcc/config/i386/predicates.md
    trunk/gcc/testsuite/ChangeLog

Comment 11 Ken Raeburn 2007-08-18 17:55:52 UTC

Snapshot gcc-4.3-20070817 seems to do just fine with the sample code I supplied.

Though perhaps I should've given a bigger test case... :-)

One of the places where I'd want to check for overflow is in computing memory allocation sizes (hence my filing #30314 at the same time, and wanting a carry-flag check there too).  This patch lets gcc check the carry flag after addition, which is great.  However, it seems to prevent gcc from being able to look at the zero flag afterwards.  As in, "if the allocation size is zero, ask for one byte so malloc won't return a null pointer":

void *foo(unsigned a, unsigned b) {
    unsigned sum = a + b;
    if (sum < a) return 0;
    if (sum == 0) sum = 1;
    return malloc(sum);
}

Without the overflow check, gcc just uses the zero flag resulting from the addition.  With the overflow check in there, gcc uses the carry flag for the overflow check, but then re-tests the value to see if it's zero:

        addl    %eax, %edx
        jb      L3 ; L3 just returns null
        testl   %edx, %edx
        movl    $1, %eax
        cmove   %eax, %edx

Still, this is better than doing an explicit comparison for the overflow check and then the test for zero.

Comment 12 Rask Ingemann Lambertsen 2007-08-19 13:00:47 UTC

 void *foo(unsigned a, unsigned b) {
     unsigned sum = a + b;
     if (sum < a) return 0;
     if (sum == 0) sum = 1;

To be able to optimize both comparisons, we need to have the same comparison operands. The only way I can see this happening is to canonicalize the overflow check as "a + b < 0" with the obvious fear that since a and b are both unsigned, GCC will optimize the comparison away as always false - unless taught not to do so. I'll see what I can do.

Additionally, the resulting asm seems to be a bit stupid:

	testl	%edx,	%edx
	movl	$1,	%eax
	cmove	%eax,	%edx

should be just

	testl	%edx,	%edx
	sete	%dl

Comment 13 Uroš Bizjak 2007-08-19 18:03:59 UTC

(In reply to comment #12)

>      if (sum == 0) sum = 1;
> 

> Additionally, the resulting asm seems to be a bit stupid:
> 
>         testl   %edx,   %edx
>         movl    $1,     %eax
>         cmove   %eax,   %edx
> 
> should be just
> 
>         testl   %edx,   %edx
>         sete    %dl

Not, because sete sets dl to 0 (zero) when condition is not met. The c code above doesn't change sum when sum != 0.

Regarding the problem, described in comment #11:

Perhaps we can add CCCZmode as a common mode to otherwise orthogonal CCCmode and CCZmode. Playing with i386_cc_modes_compatible() and ix86_match_cc_mode() we can perhaps teach combine to CSE and merge two comparisons. To achieve this, I think we should use (enhanced) ix86_match_cc_mode() in just added RTL patterns.

Comment 14 patchapp@dberlin.org 2007-08-30 19:45:26 UTC

Subject: Bug number PR 30315

A patch for this bug has been added to the patch tracker.
The mailing list url for the patch is http://gcc.gnu.org/ml/gcc-patches/2007-08/msg02217.html

Comment 15 Rask Ingemann Lambertsen 2007-09-09 19:22:12 UTC

Subject: Bug 30315

Author: rask
Date: Sun Sep  9 19:21:59 2007
New Revision: 128305

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=128305
Log:
	PR target/30315
	* config/i386/i386.h (CANONICALIZE_COMPARISON): Delete.
	* simplify-rtx.c (simplify_relational_operation_1): Add the
	canonicalization from i386.h.
	* doc/md.texi (Canonicalization of Instructions): Document it.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.h
    trunk/gcc/doc/md.texi
    trunk/gcc/simplify-rtx.c

Comment 16 Rask Ingemann Lambertsen 2007-11-10 01:32:19 UTC

Two testcases which aren't optimized:

unsigned int bad1 (unsigned int a)
{
  unsigned int c = a - 1;
  if (c > a)
    abort ();
  else
    return c;
}

unsigned int bad2 (unsigned int a)
{
  unsigned int c = a - 2;
  if (c > a)
    abort ();
  else
    return c;
}

See also <URL:http://gcc.gnu.org/ml/gcc-patches/2007-10/msg01359.html>.

Comment 17 Andrew Pinski 2009-01-02 19:29:00 UTC

>* gcc.target/i386/pr30315.c: New.

This testcase fails with -march=pentium-m turned on.

Comment 18 Joseph Parmelee 2009-11-12 17:28:53 UTC

This bug also still appears in 4.4.2 with --with-arch=pentium3.

Comment 19 Uroš Bizjak 2009-11-16 19:05:11 UTC

(In reply to comment #18)
> This bug also still appears in 4.4.2 with --with-arch=pentium3.

pentium3 fails because it is not TARGET_HIMODE_MATH and TARGET_QIMODE_MATH.

So, it fails following testcase:

unsigned short
plusccsa (unsigned short a, unsigned short b)
{
  unsigned short sum = a + b;
  if (sum < a)
    abort ();
  return sum;
}

It used to generate:

plusccsa:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	movzwl	8(%ebp), %edx
	movzwl	12(%ebp), %eax
	addl	%edx, %eax
	movzwl	%ax, %eax
>>	cmpw	%ax, %dx
	ja	.L97
	leave
	ret
.L97:
	call	abort

And after the fix for wrong mode of compare insn for these targets [1]:

plusccsa:
	pushl	%ebp
	movl	%esp, %ebp
	subl	$8, %esp
	movzwl	8(%ebp), %edx
	movzwl	12(%ebp), %eax
	addl	%edx, %eax
	movzwl	%ax, %eax
>>	cmpl	%eax, %edx
	ja	.L52
	leave
	ret
.L52:
	call	abort

You don't want instructions that operate on bytes or words on pentium3...

At the end of the day, no bug here.

[1] http://gcc.gnu.org/ml/gcc-patches/2009-11/msg00787.html

Comment 20 Vincent Lefèvre 2019-06-06 12:05:52 UTC

Similar to comment 16:

void d (unsigned int a, unsigned int b)
{
  unsigned int diff = a - b;
  if (diff == 0)
    feq ();
  else if (diff > a)
    flt ();
  else
    fgt ();
}

is not optimized.

Under Debian, GCC 10.0.0 20190527 (experimental) [trunk revision 271671] gives for x86_64 (i386 is similar):

[...]
        xorl    %eax, %eax
        cmpl    %esi, %edi
        setb    %al
        je      .L7
        testl   %eax, %eax
        je      .L5
[...]

I would expect just a cmpl, a je and a jl.