This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
I start with simplest suggestion which is precomputing constant arguments like saving multiplication cost in strchr with: char *strchr_c(char *x, unsigned long u); #define strchr(x,c) \ (__builtin_constant_p(c) ? strchr_c (x, c * (~0ULL / 255)) : strchr (x,c)) Then I am working on using constant n for memset and memcpy. These cannot be done in gcc alone as you need to choose implementation based on cpu and for different sizes different are best for different cpu. Some users try to always do inlining like in rte_memcpy. That works better than gcc one as its optimized for newer processors. For sizes beyond 64 bytes trying to fully expand memcpy and memset doesn't make lot of sense as libcall is faster. To get benefits of inlining I now work on following approach. For sizes < 64 use builtin. For n 64-1024 make indirect jump according to cpu-specific table that you get from libc. That would allow do unrolling upto size 1024 into sequence of movsqa's without increasing cache footprint much. Same with memset(x,0,n) except you need to pass 0.0 argument to have zero xmm register. Entry point for aligned input doesn't make lot of sense. As input is short you want to go into copy of header and you save difference between aligned/unaligned load and crosspage check. As you need to duplicate header icache cost could be bigger. What makes sense is inline headers instead expanding whole function. I am looking at following expansion of strcmp/memcmp: int inline_strcmp (const char *x, const char *y) { int r = *((unsigned char *) x) - *((unsigned char *) y); return r ? r : strcmp(x, y); } int inline_memcmp (const void *x, const void *y, size_t n) { if (n == 0) return 0; int r = *((unsigned char *) x) - *((unsigned char *) y); return r ? r : memcmp(x + 1, y + 1, n - 1); } Note that end is not tested as its unlikely. Same transformation could be done for strncmp, strcasecmp and strncasecmp but we at libc would need to improve tls access of tolower which now requires call which defeats purpose of inline. That gives considerable savings as in my profile 32.4% calls of strcmp and calls of 49.5% differ in first byte. From profiling data these branches are almost completely predictable as I see long sequences of calls that differ at 0 followed by sequence that differ in other. From programs measured it could harm only make. See attached data. On x64 adding match for first 16 bytes using sse would also make sense. except make all other programs have 90% of calls differ in first 16 bytes. Same could be done for strchr/memchr headers where first 16 bytes also form majority. In case of make we should check if in strchr(x,'/') we have x[0] == '/' which happens 85.1% times. In generic case same header would be bigger so question if its profitable versus code size becomes more significant. For similar questions I have on todo list add counters for userspace profiling. Decision if some optimization is profitable depends on details like average size of input that cannot be directly determined from profile. For example in strstr we would need digraph that occurs least often. I don't know if that could be integrated into -fprofile-generate -fprofile-use or done before that as it would change control flow or do it just by macros. If we could convince people to do compilation with profiling it would also allow to directly precompute tables like below without large header hacks, and make things like calculating perfect hashing possible without external tools. For precomputed tables I so far know two use-cases One case would be memchr("abc",x,3) or strchr("abc",x) pattern. I found that in libc to test membership which is obviously ineffective. Second use case is strpbrk family. These have in common that they could benefit from precomputed table with 1 for present bytes and 0 otherwise. While I could create such table I couldn't do that without 256 warnings. Following constructs table just fine but complains warning: initializer element is not a constant expression int main() { static char x[256] = {strchr("aaa", 'a') == NULL, strchr("aaa", 'b') == NULL}; printf("%i %i %s", x[0],x[1], x); } Same trick could be used for making bitwise array. Also its weird what you could and cannot do in static initializers. I was surprised that I could use strchr but couldn't evalutate "abc"[2] as 'c'. When bug above gets fixed that allows these functions to be lot faster, as most of time you get match in first 8 bytes.
Attachment:
comparison_data
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |