This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[RFC] memcpy/strcpy on unitialized memory optimization and custom entry point

From: OndÅej BÃlka <neleai at seznam dot cz>
To: gcc at gcc dot gnu dot org
Date: Tue, 16 Jun 2015 10:27:51 +0200
Subject: [RFC] memcpy/strcpy on unitialized memory optimization and custom entry point
Authentication-results: sourceware.org; auth=none

Hi,

This is another comment on builtin/header comments. While for
alignment/size bounds there is no reason to use builtin as gcc should
also optimize header version with 
if (__builtin_constant_p(n<42) && n < 42)

there is information that you could get only with gcc help. Its that
while speculative loads are easy library couldn't do speculative writes. 

One of performance issues is that copying single byte with library
routine is expensive. Mainly as its at tail of unpredicted branches as
bigger range is more probable, a extreme would be my following benchmark
where copying one byte with avx2 memcpy is slower than copying 700 bytes
as benchmark caused loop to be likely and well predicted while one byte
is quite unlikely
http://kam.mff.cuni.cz/~ondra/benchmark_string/haswell/memcpy_profile/results_rand/result.html

Now as proposal in reality most of time increasing memcpy size to
multiple of 8 would improve performance and didn't change semantics of
program as these bytes are not accessed by application. Same with strcpy
that would just copy 8-byte blocks instead spending time finding correct
size. However that would be hard to do.

Also this applies to vectorizer in general which with freshly allocated
memory could afford simpler path that does few extra writes.

So instead there could be gcc optimization that detects this by checking
that its just allocated memory so writing beyond allocated boundary
wouldn't change that its uninitialized.

For example here you could do only one 8-byte load/store instead of three.

#include <string.h>
char foo[8];
char *fo()
{
  char bar[8];
  memcpy(bar, foo, 7);
  return strdup(bar);
}

When exact size isn't know it would call a function say
memcpy_j/strcpy_j that could write extra bytes until 8-byte boundary, I
plan these to speedup strdup.

So comments, could it be generalized more? Or it is too much work
without much benefit?

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]