This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Guard use of modulo in cshift (speedup protein)

From: Steven Bosscher <stevenb dot gcc at gmail dot com>
To: Michael Matz <matz at suse dot de>
Cc: gcc-patches at gcc dot gnu dot org, fortran at gcc dot gnu dot org
Date: Tue, 10 Apr 2012 17:16:05 +0200
Subject: Re: Guard use of modulo in cshift (speedup protein)
References: <Pine.LNX.4.64.1204101646390.25409@wotan.suse.de>

On Tue, Apr 10, 2012 at 4:53 PM, Michael Matz <matz@suse.de> wrote:
> Hi,
>
> this patch speeds up polyhedrons protein on Bulldozer quite a bit. ?The
> things is that in this testcase cshift is called with a very short length
> (<=3) and that the shift amount always is less than the length.
> Surprisingly the division instruction takes up considerable amount of
> time, so much that it makes sense to guard it, when the shift is in bound.
>
> Here's some oprofile of _gfortrani_cshift0_i4 (total 31020 cycles):
>
> ? ?23 ?0.0032 : ? caf00: ? ? ? idiv ? %r13
> ?13863 ?1.9055 : ? caf03: ? ? ? lea ? ?(%rdx,%r13,1),%r12
>
> I.e. despite the memory shuffling one third of the cshift cycles are that
> division. ?With the patch the time for protein drops from 0m21.367s to
> 0m20.547s on this Bulldozer machine. ?I've checked that it has no adverse
> effect on older AMD or Intel cores (0:44.30elapsed vs 0:44.00elapsed,
> still an improvement).
>
> Regstrapped on x86_64-linux. ?Okay for trunk?
>
>
> Ciao,
> Michael.
>
> ? ? ? ?* m4/cshift0.m4 (cshift0_'rtype_code`): Guard use of modulo.
>
> ? ? ? ?* generated/cshift0_c10.c: Regenerated.
> ? ? ? ?* generated/cshift0_c16.c: Regenerated.
> ? ? ? ?* generated/cshift0_c4.c: Regenerated.
> ? ? ? ?* generated/cshift0_c8.c: Regenerated.
> ? ? ? ?* generated/cshift0_i16.c: Regenerated.
> ? ? ? ?* generated/cshift0_i1.c: Regenerated.
> ? ? ? ?* generated/cshift0_i2.c: Regenerated.
> ? ? ? ?* generated/cshift0_i4.c: Regenerated.
> ? ? ? ?* generated/cshift0_i8.c: Regenerated.
> ? ? ? ?* generated/cshift0_r10.c: Regenerated.
> ? ? ? ?* generated/cshift0_r16.c: Regenerated.
> ? ? ? ?* generated/cshift0_r4.c: Regenerated.
> ? ? ? ?* generated/cshift0_r8.c: Regenerated.
>
> Index: m4/cshift0.m4
> ===================================================================
> --- m4/cshift0.m4 ? ? ? (revision 186272)
> +++ m4/cshift0.m4 ? ? ? (working copy)
> @@ -98,9 +98,13 @@ cshift0_'rtype_code` ('rtype` *ret, cons
> ? rptr = ret->base_addr;
> ? sptr = array->base_addr;
>
> - ?shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> - ?if (shift < 0)
> - ? ?shift += len;
> + ?/* Avoid the costly modulo for trivially in-bound shifts. ?*/
> + ?if (shift < 0 || shift >= len)
> + ? ?{
> + ? ? ?shift = len == 0 ? 0 : shift % (ptrdiff_t)len;
> + ? ? ?if (shift < 0)
> + ? ? ? shift += len;
> + ? ?}
>
> ? while (rptr)
> ? ? {

This is OK.

Do you think it would be worthwhile to do this transformation in the
middle end too, based on profile information for values? IIRC
value-prof handles constant divmod but not ranges for modulo
operations.

Steven

Follow-Ups:
- Re: Guard use of modulo in cshift (speedup protein)
  - From: Michael Matz

References:
- Guard use of modulo in cshift (speedup protein)
  - From: Michael Matz

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]