This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] match.pd: optimize unsigned mul overflow check
- From: Alexander Monakov <amonakov at ispras dot ru>
- To: Marc Glisse <marc dot glisse at inria dot fr>
- Cc: gcc-patches at gcc dot gnu dot org
- Date: Mon, 30 May 2016 10:14:30 +0300 (MSK)
- Subject: Re: [PATCH] match.pd: optimize unsigned mul overflow check
- Authentication-results: sourceware.org; auth=none
- References: <alpine dot LNX dot 2 dot 20 dot 1605282232330 dot 2043 at monopod dot intra dot ispras dot ru> <alpine dot DEB dot 2 dot 20 dot 1605292303050 dot 1983 at laptop-mg dot saclay dot inria dot fr>
On Sun, 29 May 2016, Marc Glisse wrote:
> On Sat, 28 May 2016, Alexander Monakov wrote:
>
> > For unsigned A, B, 'A > -1 / B' is a nice predicate for checking whether
> > 'A*B'
> > overflows (or 'B && A > -1 / B' if B may be zero). Let's optimize it to an
> > invocation of __builtin_mul_overflow to avoid the divide operation.
>
> I forgot to ask earlier: what does this give for modes / platforms where
> umulv4 does not have a specific implementation? Is the generic implementation
> worse than A>-1/B, in which case we may want to check optab_handler before
> doing the transformation? Or is it always at least as good?
If umulv<mode>4 is unavailable (which today is everywhere except x86), gcc
falls back as follows. First, it tries to see if doing a multiplication in a
2x wider type is possible (which it usually is, as gcc supports __int128_t on
64-bit platforms and 64-bit long long on 32-bit platforms), then it looks at
high bits of the 2x wide product. This should boil down to doing a 'high
multiply' instruction if original operands' type matches register size, and a
normal multiply + masking high bits if the type is smaller than register.
Second, if the above fails (e.g. with 64-bit operands on a 32-bit platform),
then gcc emits a sequence that performs the multiplication by parts in a 2x
narrower type.
I think the first, more commonly taken, fallback path results in an
always-good code. In the second case, the eliminated 64-bit divide is unlikely
to have a direct hw support; e.g., on i386 it's a library call to __udivdi3.
This makes the transformation a likely loss for code size, a likely win for
performance. It could be better if GCC could CSE REALPART (IFN_MUL_OVERFLOW)
with A*B on gimple.
Thanks.
Alexander