Guard use of modulo in cshift (speedup protein)
Steven Bosscher
stevenb.gcc@gmail.com
Tue Apr 10 15:16:00 GMT 2012
On Tue, Apr 10, 2012 at 4:53 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> this patch speeds up polyhedrons protein on Bulldozer quite a bit. The
> things is that in this testcase cshift is called with a very short length
> (<=3) and that the shift amount always is less than the length.
> Surprisingly the division instruction takes up considerable amount of
> time, so much that it makes sense to guard it, when the shift is in bound.
>
> Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):
>
> 23 0.0032 : caf00: idiv %r13
> 13863 1.9055 : caf03: lea (%rdx,%r13,1),%r12
>
> I.e. despite the memory shuffling one third of the cshift cycles are that
> division. With the patch the time for protein drops from 0m21.367s to
> 0m20.547s on this Bulldozer machine. I've checked that it has no adverse
> effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed,
> still an improvement).
>
> Regstrapped on x86_64-linux. Okay for trunk?
>
>
> Ciao,
> Michael.
>
> * m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.
>
> * generated/cshift0_c10.c: Regenerated.
> * generated/cshift0_c16.c: Regenerated.
> * generated/cshift0_c4.c: Regenerated.
> * generated/cshift0_c8.c: Regenerated.
> * generated/cshift0_i16.c: Regenerated.
> * generated/cshift0_i1.c: Regenerated.
> * generated/cshift0_i2.c: Regenerated.
> * generated/cshift0_i4.c: Regenerated.
> * generated/cshift0_i8.c: Regenerated.
> * generated/cshift0_r10.c: Regenerated.
> * generated/cshift0_r16.c: Regenerated.
> * generated/cshift0_r4.c: Regenerated.
> * generated/cshift0_r8.c: Regenerated.
>
> Index: m4/cshift0.m4
> ===================================================================
> --- m4/cshift0.m4 (revision 186272)
> +++ m4/cshift0.m4 (working copy)
> @@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons
> rptr = ret->base_addr;
> sptr = array->base_addr;
>
> - shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> - if (shift < 0)
> - shift += len;
> + /* Avoid the costly modulo for trivially in-bound shifts. */
> + if (shift < 0 || shift >= len)
> + {
> + shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> + if (shift < 0)
> + shift += len;
> + }
>
> while (rptr)
> {
This is OK.
Do you think it would be worthwhile to do this transformation in the
middle end too, based on profile information for values? IIRC
value-prof handles constant divmod but not ranges for modulo
operations.
Steven
More information about the Gcc-patches
mailing list