[RFC] memcpy/strcpy on unitialized memory optimization and custom entry point
Tue Jun 16 08:28:00 GMT 2015
This is another comment on builtin/header comments. While for
alignment/size bounds there is no reason to use builtin as gcc should
also optimize header version with
if (__builtin_constant_p(n<42) && n < 42)
there is information that you could get only with gcc help. Its that
while speculative loads are easy library couldn't do speculative writes.
One of performance issues is that copying single byte with library
routine is expensive. Mainly as its at tail of unpredicted branches as
bigger range is more probable, a extreme would be my following benchmark
where copying one byte with avx2 memcpy is slower than copying 700 bytes
as benchmark caused loop to be likely and well predicted while one byte
is quite unlikely
Now as proposal in reality most of time increasing memcpy size to
multiple of 8 would improve performance and didn't change semantics of
program as these bytes are not accessed by application. Same with strcpy
that would just copy 8-byte blocks instead spending time finding correct
size. However that would be hard to do.
Also this applies to vectorizer in general which with freshly allocated
memory could afford simpler path that does few extra writes.
So instead there could be gcc optimization that detects this by checking
that its just allocated memory so writing beyond allocated boundary
wouldn't change that its uninitialized.
For example here you could do only one 8-byte load/store instead of three.
memcpy(bar, foo, 7);
When exact size isn't know it would call a function say
memcpy_j/strcpy_j that could write extra bytes until 8-byte boundary, I
plan these to speedup strdup.
So comments, could it be generalized more? Or it is too much work
without much benefit?
More information about the Gcc