This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][GCC][rs6000] Remaining work for inline expansion of strncmp/strcmp/memcmp for powerpc


On Thu, Oct 18, 2018 at 09:48:22AM -0500, Aaron Sawdey wrote:
> On 10/17/18 4:03 PM, Florian Weimer wrote:
> I'm aware of this. One thing that will help is that I believe the vsx
> expansion for strcmp/strncmp does not have this problem, so with
> current gcc9 trunk the problem should only be seen if one of the strings is
> known at compile time to be less than 16 bytes, or if -mcpu=power7, or
> if vector/vsx is disabled. My position is that it is valgrind's problem
> if it doesn't understand correct code, but I also want valgrind to be a
> useful tool so I'm going to take a look and see if I can find a gpr
> sequence that is equally fast that it can understand.

If we can do that without losing performance, that is nice of course :-)

> > We currently see around 0.5 KiB of instructions for each call to
> > strcmp.  I find it hard to believe that this improves general system
> > performance except in micro-benchmarks.
> 
> The expansion of strcmp where both arguments are strings of unknown
> length at compile time will compare 64 bytes then call strcmp on the
> remainder if no difference is found. If the gpr sequence is used (p7
> or vec/vsx disabled) then the overhead is 91 instructions. If the
> p8 vsx sequence is used, the overhead is 59 instructions. If the p9
> vsx sequence is used, then the overhead is 41 instructions.

That is 0.355kB, 0.230kB, resp. 0.160kB.

> Yes, this will increase the instruction footprint. However the processors
> that this targets (p7, p8, p9) all have aggressive iprefetch. Doing some
> of the comparison inline makes the common cases of strings being totally
> different, or identical and <= 64 bytes in length very much faster, and
> also avoiding the plt call means less pressure on the count cache and
> better branch prediction elsewhere.
> 
> If you are aware of any real world code that is faster when built
> with -fno-builtin-strcmp and/or -fno-builtin-strncmp, please let me know
> so I can look at avoiding those situations.

+1

Thanks Aaron!  Both for all the original work, and for looking at it
once again.


Segher


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]