[PATCH] middle-end: Improve RTL expansion in expand_mul_overflow,

Thu Jul 9 08:17:46 GMT 2020

"Roger Sayle" <roger@nextmovesoftware.com> writes:
> This patch improves the RTL that the middle-end generates for testing
> signed overflow following a widening multiplication.  During this
> expansion the middle-end generates a truncation which can get used
> multiple times.  Placing this intermediate value in a pseudo register
> reduces the amount of code generated on platforms where this truncation
> requires an explicit instruction.
>
> This simple call to force_reg eliminates 368 lines of the -S output
> from testsuite/c-c++-common/torture/builtin-arith-overflow-1.c on
> nvptx-none.  An example difference is in t120_1smul where the following
> 7 instruction sequence in which the 1st and 6th instructions perform
> the same truncation:
>
> < cvt.u32.u64     %r31, %r28;           <- truncate %r28
> < shr.s32 %r30, %r31, 31;
> < cvt.u32.u64     %r32, %r29;
> < setp.eq.u32     %r33, %r30, %r32;
> < selp.u32        %r24, 0, 1, %r33;
> < cvt.u32.u64     %r25, %r28;           <- truncate %r28
> < setp.eq.u32     %r34, %r24, 0;
>
> is now generated as a 4 instruction sequence without duplication:
>
>> cvt.u32.u64     %r30, %r28;
>> shr.s32 %r31, %r30, 31;
>> cvt.u32.u64     %r32, %r29;
>> setp.eq.u32     %r33, %r31, %r32;
>
> On x86_64-pc-linux-gnu, where SUBREGs are free, this patch generates
> exactly the same builtin-arith-overflow-1.s as before.
>
> This patch has been tested on both x86_64-pc-linux-gnu with
> "make bootstrap" and nvptx-none with "make", with no new
> testsuite regressions on either platform.
> Ok for mainline?
>
>
> 2020-07-06  Roger Sayle  <roger@nextmovesoftware.com>
>
> gcc/ChangeLog:
> 	* internal-fn.c (expand_mul_overflow): When checking for signed
> 	overflow from a widening multiplication, we access the truncated
> 	lowpart RES twice, so keep this value in a pseudo register.
>
>
> Thanks in advance,
> Roger
> --
> Roger Sayle
> NextMove Software
> Cambridge, UK
>
> diff --git a/gcc/internal-fn.c b/gcc/internal-fn.c
> index 0be2eb4..d1bd6cc 100644
> --- a/gcc/internal-fn.c
> +++ b/gcc/internal-fn.c
> @@ -1627,6 +1627,9 @@ expand_mul_overflow (location_t loc, tree lhs, tree arg0, tree arg1,
>  				     profile_probability::very_likely ());
>  	  else
>  	    {
> +	      /* RES is used more than once, place it in a pseudo.  */
> +	      res = force_reg (mode, res);
> +
>  	      rtx signbit = expand_shift (RSHIFT_EXPR, mode, res, prec - 1,
>  					  NULL_RTX, 0);
>  	      /* RES is low half of the double width result, HIPART

In general, this can be dangerous performance-wise on targets where
subregs are free.  If the move survives to the register allocators,
it increases the risk that the move will become a machine insn.
(The RA will prefer to tie the registers, but that isn't guaranteed.)

But more fundamentally, this can hurt if the SUBREG_REG is live at
the same time as the new pseudo, since the RA then has to treat them
as separate quantities.  From your results, that obviously doesn't
occur in the test case, but I'm not 100% confident that it won't
occur elsewhere.

If target-independent code is going to optimise for “no subreg operand”
targets like nvptx, I think it needs to know that the target wants that.

Thanks,
Richard