71289 – Fails to optimize unsigned mul overflow check 'A > -1 / B'

Bug 71289 - Fails to optimize unsigned mul overflow check 'A > -1 / B'

Summary: Fails to optimize unsigned mul overflow check 'A > -1 / B'

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-05-26 11:31 UTC by Alexander Monakov
Modified:	2017-01-26 14:59 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Alexander Monakov 2016-05-26 11:31:44 UTC

If A and B are both unsigned, then 'A > -1 / B' is a nice predicate for checking whether A*B would overflow. The compiler should be able to optimize the otherwise-unused division to a multiplication followed by branch on overflow.

I've tried to add the corresponding transform to match.pd, but it seems something else needs to be wired up as well, because it doesn't trigger. What am I missing? TIA.

diff --git a/gcc/match.pd b/gcc/match.pd
index e511e9a..83c02a8 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2582,6 +2582,16 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
    (out (imagpart @2) { build_zero_cst (TREE_TYPE (@0)); }))))
 
+/* Simplify unsigned multiplication overflow check A > -1 / B to a builtin.  */
+(for cmp (gt le)
+     out (ne eq)
+ (simplify
+  (cmp @0 (trunc_div:s integer_all_onesp @1))
+  (if (TYPE_UNSIGNED (TREE_TYPE (@0))
+       && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
+   (out
+    (imagpart (IFN_MUL_OVERFLOW @0 @1))
+    { build_zero_cst (TREE_TYPE (@0)); }))))
 
 /* Simplification of math builtins.  These rules must all be optimizations
    as well as IL simplifications.  If there is a possibility that the new

Comment 1 Marc Glisse 2016-05-26 12:07:50 UTC

(In reply to Alexander Monakov from comment #0)
> I've tried to add the corresponding transform to match.pd, but it seems
> something else needs to be wired up as well, because it doesn't trigger.
> What am I missing? TIA.

What do the dumps look like? Gcc is likely to change things to -1 / B < A, which you don't handle...

> +/* Simplify unsigned multiplication overflow check A > -1 / B to a builtin.
> */
> +(for cmp (gt le)
> +     out (ne eq)
> + (simplify
> +  (cmp @0 (trunc_div:s integer_all_onesp @1))
> +  (if (TYPE_UNSIGNED (TREE_TYPE (@0))
> +       && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))

types_match seems unnecessary here, it is mostly useful when we have some CONVERT_EXPR, but trunc's arguments and result types have to match, same for cmp arguments, so @0 and @1 should always have matching types.

> +   (out
> +    (imagpart (IFN_MUL_OVERFLOW @0 @1))
> +    { build_zero_cst (TREE_TYPE (@0)); }))))

We probably need to disable such a transformation for vectors.

Comment 2 Alexander Monakov 2016-05-26 14:30:56 UTC

> What do the dumps look like? Gcc is likely to change things to -1 / B < A, which you don't handle...

The dumps didn't help much, but you're right that normally the order is opposite, thanks (I didn't realize that when looking in the debugger the first time).

I tried to take into account your comments. I also had to make the output more complex because emitting bare internal functions doesn't work. It still ICEs for

typedef unsigned v2si __attribute__((vector_size(8)));
v2si h(v2si a, v2si b)
{
  v2si m = {~0, ~0};
  return (a > m / b);
}

but before debugging that I'd like to know if this is the right approach in the first place.

diff --git a/gcc/match.pd b/gcc/match.pd
index e511e9a..a7fbcf1 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -2582,6 +2582,32 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
        && types_match (TREE_TYPE (@0), TREE_TYPE (@1)))
    (out (imagpart @2) { build_zero_cst (TREE_TYPE (@0)); }))))
 
+/* Simplify unsigned multiplication overflow check -1 / A < B to a builtin.  */
+(for cmp (lt ge)
+     out (ne eq)
+ (simplify
+  (cmp (trunc_div:s integer_all_onesp @0) @1)
+  (if (TYPE_UNSIGNED (TREE_TYPE (@0))
+       && !VECTOR_INTEGER_TYPE_P (TREE_TYPE (@0)))
+   (out
+    (imagpart { build_call_expr_internal_loc (UNKNOWN_LOCATION,
+                                             IFN_MUL_OVERFLOW,
+                                             build_complex_type (TREE_TYPE (@0)),
+                                             2, @0, @1); } )
+    { build_zero_cst (TREE_TYPE (@0)); }))))
+
+(for cmp (gt le)
+     out (ne eq)
+ (simplify
+  (cmp @0 (trunc_div:s integer_all_onesp @1))
+  (if (TYPE_UNSIGNED (TREE_TYPE (@0))
+       && !VECTOR_INTEGER_TYPE_P (TREE_TYPE (@0)))
+   (out
+    (imagpart { build_call_expr_internal_loc (UNKNOWN_LOCATION,
+                                             IFN_MUL_OVERFLOW,
+                                             build_complex_type (TREE_TYPE (@0)),
+                                             2, @0, @1); } )
+    { build_zero_cst (TREE_TYPE (@0)); }))))
 
 /* Simplification of math builtins.  These rules must all be optimizations
    as well as IL simplifications.  If there is a possibility that the new

Comment 3 Marc Glisse 2016-05-26 19:59:55 UTC

I think genmatch can handle calls to internal functions, the issue is with guessing the return type. Maybe we could have a specific heuristic for these functions in the case where the type is not explicitly specified.

(for cmp (lt ge)
     out (ne eq)
 (simplify
  (cmp (trunc_div:s integer_all_onesp @1) @0)
  (if (TYPE_UNSIGNED (TREE_TYPE (@0)) 
       && !VECTOR_TYPE_P (TREE_TYPE (@0)))
   (with { tree cpx = build_complex_type (TREE_TYPE (@0)); }
    (out 
     (imagpart (IFN_MUL_OVERFLOW:cpx @0 @1))
     { build_zero_cst (TREE_TYPE (@0)); })))))

(and the symmetric)

The approach seems fine to me.

Comment 4 Alexander Monakov 2016-05-30 14:37:33 UTC

Author: amonakov
Date: Mon May 30 14:37:02 2016
New Revision: 236882

URL: https://gcc.gnu.org/viewcvs?rev=236882&root=gcc&view=rev
Log:
match.pd: optimize unsigned mul overflow check

gcc/
2016-05-28  Alexander Monakov  <amonakov@ispras.ru>
            Marc Glisse  <marc.glisse@inria.fr>

	PR tree-optimization/71289
	* match.pd (-1 / B < A, A > -1 / B): New transformations.

gcc/testsuite/
2016-05-28  Alexander Monakov  <amonakov@ispras.ru>

	PR tree-optimization/71289
	* gcc.dg/pr71289.c: New test.


Added:
    trunk/gcc/testsuite/gcc.dg/pr71289.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/match.pd
    trunk/gcc/testsuite/ChangeLog

Comment 5 Alexander Monakov 2016-05-30 15:03:18 UTC

Fixed on the trunk. Thanks again Marc.

Comment 6 Kang-Che Sung 2017-01-26 14:59:20 UTC

Hello. Thank you guys for implementing the multiplication overflow optimization.
Otherwise there have been "complaints" on the web [1] about lack of such optimization.


However, from what I tested in the current svn source (r244933), the construct like this (b != 0 && -1 / b < a) still not yet yields optimal code. In particular, there are still unnecessary (b != 0) generated...

...for a code like this:

----------------------------------------------
#define UINT_MAX (~0U)
void abort(void); /* from stdlib.h */
unsigned int umul_overflow1(unsigned int a, unsigned int b) {
    if (b != 0 && UINT_MAX / b < a)
        abort();
    return a * b;
}
unsigned int umul_overflow2(unsigned int a, unsigned int b) {
    unsigned int res;
    if (__builtin_umul_overflow(a, b, &res))
        abort();
    return res;
}
----------------------------------------------

Generated assembly (-O2, gcc configured with --target=x86_64-pc-linux-gnu --with-arch-64=generic)

umul_overflow1:
.LFB0:
	.cfi_startproc
	testl	%esi, %esi
	je	.L2
	movl	%edi, %eax
	mull	%esi
	jo	.L13
.L2:
	movl	%esi, %eax
	imull	%edi, %eax
	ret
.L13:
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	call	abort
	.cfi_endproc
umul_overflow2:
.LFB1:
	.cfi_startproc
	movl	%edi, %eax
	mull	%esi
	jo	.L22
	rep ret
.L22:
	subq	$8, %rsp
	.cfi_def_cfa_offset 16
	call	abort
	.cfi_endproc

----------------------------------------------

I would expect that ideally, both functions will generate same machine code.


[1]http://kqueue.org/blog/2012/03/16/fast-integer-overflow-detection/