This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Guard use of modulo in cshift (speedup protein)


On Tue, Apr 10, 2012 at 4:53 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> this patch speeds up polyhedrons protein on Bulldozer quite a bit. ?The
> things is that in this testcase cshift is called with a very short length
> (<=3) and that the shift amount always is less than the length.
> Surprisingly the division instruction takes up considerable amount of
> time, so much that it makes sense to guard it, when the shift is in bound.
>
> Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):
>
> ? ?23 ?0.0032 : ? caf00: ? ? ? idiv ? %r13
> ?13863 ?1.9055 : ? caf03: ? ? ? lea ? ?(%rdx,%r13,1),%r12
>
> I.e. despite the memory shuffling one third of the cshift cycles are that
> division. ?With the patch the time for protein drops from 0m21.367s to
> 0m20.547s on this Bulldozer machine. ?I've checked that it has no adverse
> effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed,
> still an improvement).
>
> Regstrapped on x86_64-linux. ?Okay for trunk?
>
>
> Ciao,
> Michael.
>
> ? ? ? ?* m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.
>
> ? ? ? ?* generated/cshift0_c10.c: Regenerated.
> ? ? ? ?* generated/cshift0_c16.c: Regenerated.
> ? ? ? ?* generated/cshift0_c4.c: Regenerated.
> ? ? ? ?* generated/cshift0_c8.c: Regenerated.
> ? ? ? ?* generated/cshift0_i16.c: Regenerated.
> ? ? ? ?* generated/cshift0_i1.c: Regenerated.
> ? ? ? ?* generated/cshift0_i2.c: Regenerated.
> ? ? ? ?* generated/cshift0_i4.c: Regenerated.
> ? ? ? ?* generated/cshift0_i8.c: Regenerated.
> ? ? ? ?* generated/cshift0_r10.c: Regenerated.
> ? ? ? ?* generated/cshift0_r16.c: Regenerated.
> ? ? ? ?* generated/cshift0_r4.c: Regenerated.
> ? ? ? ?* generated/cshift0_r8.c: Regenerated.
>
> Index: m4/cshift0.m4
> ===================================================================
> --- m4/cshift0.m4 ? ? ? (revision 186272)
> +++ m4/cshift0.m4 ? ? ? (working copy)
> @@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons
> ? rptr = ret->base_addr;
> ? sptr = array->base_addr;
>
> - ?shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> - ?if (shift < 0)
> - ? ?shift += len;
> + ?/* Avoid the costly modulo for trivially in-bound shifts. ?*/
> + ?if (shift < 0 || shift >= len)
> + ? ?{
> + ? ? ?shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> + ? ? ?if (shift < 0)
> + ? ? ? shift += len;
> + ? ?}
>
> ? while (rptr)
> ? ? {

This is OK.

Do you think it would be worthwhile to do this transformation in the
middle end too, based on profile information for values? IIRC
value-prof handles constant divmod but not ranges for modulo
operations.

Steven


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]