This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc

From: Segher Boessenkool <segher at kernel dot crashing dot org>
To: Aaron Sawdey <acsawdey at linux dot ibm dot com>
Cc: Florian Weimer <fw at deneb dot enyo dot de>, gcc at gcc dot gnu dot org, Bill Schmidt <wschmidt at linux dot ibm dot com>, mjw at fedoraproject dot org
Date: Thu, 18 Oct 2018 11:59:47 -0500
Subject: Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
References: <c1513179-5aba-713b-32e7-3ce9c78cdb49@linux.ibm.com> <8736t4cmkh.fsf@mid.deneb.enyo.de> <8d7280f4-392a-e56b-380f-c7612f4bf670@linux.ibm.com>

On Thu, Oct 18, 2018 at 09:48:22AM -0500, Aaron Sawdey wrote:
> On 10/17/18 4:03 PM, Florian Weimer wrote:
> I'm aware of this. One thing that will help is that I believe the vsx
> expansion for strcmp/strncmp does not have this problem, so with
> current gcc9 trunk the problem should only be seen if one of the strings is
> known at compile time to be less than 16 bytes, or if -mcpu=power7, or
> if vector/vsx is disabled. My position is that it is valgrind's problem
> if it doesn't understand correct code, but I also want valgrind to be a
> useful tool so I'm going to take a look and see if I can find a gpr
> sequence that is equally fast that it can understand.

If we can do that without losing performance, that is nice of course :-)

> > We currently see around 0.5 KiB of instructions for each call to
> > strcmp.  I find it hard to believe that this improves general system
> > performance except in micro-benchmarks.
> 
> The expansion of strcmp where both arguments are strings of unknown
> length at compile time will compare 64 bytes then call strcmp on the
> remainder if no difference is found. If the gpr sequence is used (p7
> or vec/vsx disabled) then the overhead is 91 instructions. If the
> p8 vsx sequence is used, the overhead is 59 instructions. If the p9
> vsx sequence is used, then the overhead is 41 instructions.

That is 0.355kB, 0.230kB, resp. 0.160kB.

> Yes, this will increase the instruction footprint. However the processors
> that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
> of the comparison inline makes the common cases of strings being totally
> different, or identical and <= 64 bytes in length very much faster, and
> also avoiding the plt call means less pressure on the count cache and
> better branch prediction elsewhere.
> 
> If you are aware of any real world code that is faster when built
> with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
> so I can look at avoiding those situations.

+1

Thanks Aaron!  Both for all the original work, and for looking at it
once again.


Segher

References:
- [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
  - From: Aaron Sawdey
- Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
  - From: Florian Weimer
- Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc
  - From: Aaron Sawdey

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]