Guard use of modulo in cshift (speedup protein)

Tue Apr 10 15:16:00 GMT 2012

On Tue, Apr 10, 2012 at 4:53 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> this patch speeds up polyhedrons protein on Bulldozer quite a bit.  The
> things is that in this testcase cshift is called with a very short length
> (<=3) and that the shift amount always is less than the length.
> Surprisingly the division instruction takes up considerable amount of
> time, so much that it makes sense to guard it, when the shift is in bound.
>
> Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):
>
>    23  0.0032 :   caf00:       idiv   %r13
>  13863  1.9055 :   caf03:       lea    (%rdx,%r13,1),%r12
>
> I.e. despite the memory shuffling one third of the cshift cycles are that
> division.  With the patch the time for protein drops from 0m21.367s to
> 0m20.547s on this Bulldozer machine.  I've checked that it has no adverse
> effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed,
> still an improvement).
>
> Regstrapped on x86_64-linux.  Okay for trunk?
>
>
> Ciao,
> Michael.
>
>        * m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.
>
>        * generated/cshift0_c10.c: Regenerated.
>        * generated/cshift0_c16.c: Regenerated.
>        * generated/cshift0_c4.c: Regenerated.
>        * generated/cshift0_c8.c: Regenerated.
>        * generated/cshift0_i16.c: Regenerated.
>        * generated/cshift0_i1.c: Regenerated.
>        * generated/cshift0_i2.c: Regenerated.
>        * generated/cshift0_i4.c: Regenerated.
>        * generated/cshift0_i8.c: Regenerated.
>        * generated/cshift0_r10.c: Regenerated.
>        * generated/cshift0_r16.c: Regenerated.
>        * generated/cshift0_r4.c: Regenerated.
>        * generated/cshift0_r8.c: Regenerated.
>
> Index: m4/cshift0.m4
> ===================================================================
> --- m4/cshift0.m4       (revision 186272)
> +++ m4/cshift0.m4       (working copy)
> @@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons
>   rptr = ret->base_addr;
>   sptr = array->base_addr;
>
> -  shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> -  if (shift < 0)
> -    shift += len;
> +  /* Avoid the costly modulo for trivially in-bound shifts.  */
> +  if (shift < 0 || shift >= len)
> +    {
> +      shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> +      if (shift < 0)
> +       shift += len;
> +    }
>
>   while (rptr)
>     {

This is OK.

Do you think it would be worthwhile to do this transformation in the
middle end too, based on profile information for values? IIRC
value-prof handles constant divmod but not ranges for modulo
operations.

Steven