This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Builtin/headers: Constant arguments and adding extra entry points.

I start with simplest suggestion which is precomputing constant arguments 
like saving multiplication cost in strchr with:

char *strchr_c(char *x, unsigned long u);
#define strchr(x,c) \
(__builtin_constant_p(c) ? strchr_c (x, c * (~0ULL / 255)) : strchr (x,c))

Then I am working on using constant n for memset and memcpy. These
cannot be done in gcc alone as you need to choose implementation based
on cpu and for different sizes different are best for different cpu.

Some users try to always do inlining like in rte_memcpy. That works
better than gcc one as its optimized for newer processors.

For sizes beyond 64 bytes trying to fully expand memcpy and memset
doesn't make lot of sense as libcall is faster.

To get benefits of inlining I now work on following approach. For sizes
< 64 use builtin. For n 64-1024 make indirect jump according to
cpu-specific table that you get from libc.

That would allow do unrolling upto size 1024 into sequence of movsqa's
without increasing cache footprint much. Same with memset(x,0,n) except you
need to pass 0.0 argument to have zero xmm register.

Entry point for aligned input doesn't make lot of sense. As input is
short you want to go into copy of header and you save difference between
aligned/unaligned load and crosspage check. As you need to duplicate
header icache cost could be bigger.

What makes sense is inline headers instead expanding whole function.
I am looking at following expansion of strcmp/memcmp:

inline_strcmp (const char *x, const char *y)
  int r = *((unsigned char *) x) - *((unsigned char *) y);
  return r ? r : strcmp(x, y);

inline_memcmp (const void *x, const void *y, size_t n)
  if (n == 0)
    return 0;
  int r = *((unsigned char *) x) - *((unsigned char *) y);
  return r ? r : memcmp(x + 1, y + 1, n - 1);

Note that end is not tested as its unlikely. Same transformation
could be done for strncmp, strcasecmp and strncasecmp but we at libc
would need to improve tls access of tolower which now requires call
which defeats purpose of inline.

That gives considerable savings as in my profile 32.4% calls 
of strcmp and calls of 49.5% differ in first byte. From profiling 
data these branches are almost completely predictable as I see long
sequences of calls that differ at 0 followed by sequence that differ
in other. From programs measured it could harm only make. See attached

On x64 adding match for first 16 bytes using sse would also make sense.
except make all other programs have 90% of calls differ in first 16 bytes.

Same could be done for strchr/memchr headers where first 16 bytes also
form majority.

In case of make we should check if in strchr(x,'/') we have x[0] == '/'
which happens 85.1% times.

In generic case same header would be bigger so question if its
profitable versus code size becomes more significant.

For similar questions I have on todo list add counters for userspace
profiling. Decision if some optimization is profitable depends on
details like average size of input that cannot be directly determined
from profile. For example in strstr we would need digraph that occurs
least often.

I don't know if that could be integrated into -fprofile-generate
-fprofile-use or done before that as it would change control flow or do
it just by macros. If we could convince people to do compilation with
profiling it would also allow to directly precompute tables like below
without large header hacks, and make things like calculating perfect 
hashing possible without external tools.

For precomputed tables I so far know two use-cases

One case would be memchr("abc",x,3) or strchr("abc",x) pattern. 
I found that in libc to test membership which is obviously ineffective.
Second use case is strpbrk family.

These have in common that they could benefit from precomputed table with
1 for present bytes and 0 otherwise. While I could create such table I
couldn't do that without 256 warnings. Following constructs table just
fine but complains

warning: initializer element is not a constant expression

  static char x[256]  =  {strchr("aaa", 'a') == NULL, strchr("aaa", 'b') == NULL};
  printf("%i %i %s", x[0],x[1], x);

Same trick could be used for making bitwise array.

Also its weird what you could and cannot do in static initializers. 
I was surprised that I could use strchr but couldn't evalutate "abc"[2]
as 'c'.

When bug above gets fixed that allows these functions to be lot faster,
as most of time you get match in first 8 bytes.

Attachment: comparison_data
Description: Text document

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]