memcmp-intensive code becomes up to 6 times slower if compiled with the -O3 option than with the -g or -O0 option. The reason for this is that the inline memcmp function is *much* slower than the glibc memcmp.
Here's a simple test case:
void* list[1024 * 1024];
int count = sizeof(list) / sizeof(char*);
for (i=0; i < count; i++)
list[i] = calloc(1024, 1);
int dupes = 0;
int start = times(NULL);
for (i=0; i<count-1; i++)
if (!memcmp(list[i], list[i+1], 1024))
int ticks = times(NULL) - start;
printf("Time: %d ticks (%d memcmp/tick)\n", ticks, dupes/ticks);
# gcc -O3 -o test test.c
Time: 188 ticks (5577 memcmp/tick)
# gcc -O0 -o test test.c
Time: 30 ticks (34952 memcmp/tick)
System is Debian testing with libc package 2.10.1-7.
Confirmed. GCC simply emits repz cmpsb. There was even an e-mail with benchmark results and a patch (never applied):
Unless I read those benchmarks wrong, they say that for unaligned memory the inline byte-by-byte version is almost always faster. So surely a multi-byte inline version should at least do an alignment check before proceeding. I don't know if it would still be worth inlining then.
Mine. I think we should go for the rep movq. Ideally we should track the alignment from calloc call to use, perhaps by fusing the loops, but I am just day dreaming with that ;))
I will take a look.
Hmm, I should read testcases curefully. It is memcmp. GCC is not really smart on inlining this; I guess we should just disable the inline unless we optimize for size since we don't really have enough info to inline it well by default.
String/memory functions in the current glibc are VERY FAST,
especially for larger data size under all cases. Gcc should
inline only small/known data size. We are investigating this.
(In reply to comment #4)
> Hmm, I should read testcases curefully. It is memcmp. GCC is not really smart
> on inlining this; I guess we should just disable the inline unless we optimize
> for size since we don't really have enough info to inline it well by default.
We can still inline for very small known sizes where the call overhead
would outweight any optimization in glibc.
Speaking of memcmp/strcmp, perhaps we on the other side should inline for !-Os
commonly used comparisons like
if (strcmp (p, ".") == 0 || strcmp (p, "..") == 0)
glibc has a macro for this, but I think it would be preferable to do it in gcc where we know about -Os, cold basic blocks etc.
Currently we only fold if (strcmp (p, "")) ...
I just did some tests, and on my machine, glibc's memcmp is faster even when the size of the thing we're comparing is 4 bytes. I can't point to a case that this optimization speeds up on my machine.
The basic string/memory functions in glibc 2.13 or above are
super faster in all cases. GCC can't beat glibc if function
call overhead is low.
Can I force gcc not to use its inlined version?
if glibc implementation beats gcc even for size of 4, I guess we could just drop the pattern or enable at at -Os only.
Or are there easy cases we want to inline, like we do for memcpy?
Unlike memcpy, memcmp/strcmp is more difficult to handle because the amount of memory it will process is harder to estimate. I guess still with known alignment and/or small upper bound of object size, inline code would be a win.
Created attachment 24670 [details]
memcpy/memset testing script
can you please run the attached script with new glibc as
sh test_stringop 64 640000000 gcc -march=native | tee out
In my quick testing on glibc2.11 and core i5 & AMD machine, inline memcpy/memset is still win on I5 for all blocks sizes (our optimization table is however wrong since it is inherited from generic one). For blocks of 512b and above however the inline code is about as fast as glibc code and obviously longer.
On AMD machine libcall is win for blocks of 1k to 8k. For large blocks inline seems to be win again, for whatever reason. Probably prefetch logic is wrong on the older glibc.
If glibc stringops has been finally made sane, we ought to revisit the tables we generate inline versions from.
(In reply to comment #12)
> Created attachment 24670 [details]
> memcpy/memset testing script
> can you please run the attached script with new glibc as
> sh test_stringop 64 640000000 gcc -march=native | tee out
Do you think you could spin a script which also tests memcmp?
Created attachment 24676 [details]
Test results from my core i7
> Do you think you could spin a script which also tests memcmp?
memcmp is different story. Few years back I rewrote memcpy/memset codegen to allow choosing
from several basic implementations based on size. The script I attached is basically
testing the individual algorithm so you can set for each CPU target a table choosing best
performing one for given size.
memcmp is still produced in very stupid way. Sane inline implementations of
memcmp are harder than memcpy/memset. I will try to look into it and how
current recommended codegens look for AMD/Intel chips.
2011-08-12 Nick Clifton <firstname.lastname@example.org>
* builtins.c (expand_builtin_memcmp): Do not use cmpstrnsi pattern.
* doc/md.texi (cmpstrn): Note that the comparison stops if both
fetched bytes are zero.
(cmpmem): Note that the comparison does not stop if both of the
fetched bytes are zero.
we no longer expand memcmp inline as x86 does not have a cmpmemsi pattern.
int foo(char *s, char *q)
return memcmp (s, q, 4);
the 1-size case is folded to *s == *q.
This is now a regression, we have to do _something_ about it for 4.7,
at least for constant small sizes.
I guess ideally the middle-end should know how to inline the simple loop (for both strlen and memcmp) and do so when object size is known to be small (probably by target specific value). Or does anyone think it is a bad idea?
We could then make i386 backend to again inline the rep instruction and/or the new string extensions
(In reply to comment #17)
> I guess ideally the middle-end should know how to inline the simple loop (for
> both strlen and memcmp) and do so when object size is known to be small
> (probably by target specific value). Or does anyone think it is a bad idea?
I think that's a bad idea, yes. Apart from maybe simple cases we already
handle for memcpy -> assignment simplifications (thus, aligned compares up to
word_mode size), but supposedly only if the result is checked for ==/!= 0
Just FYI: mesa is now defaulting to -fno-builtin-memcmp to workaround this problem:
GCC 4.7.0 is being released, adjusting target milestone.
GCC 4.7.1 is being released, adjusting target milestone.
GCC 4.7.2 has been released.
GCC 4.7.3 is being released, adjusting target milestone.
The 4.7 branch is being closed, moving target milestone to 4.8.4.
GCC 4.8.4 has been released.
I won't be able to do anything iwth this for 5.0, so I am unassigning myself. Will try to keep this on my radar for next stage1 though.
*** Bug 59048 has been marked as a duplicate of this bug. ***
This is not assigned to anyone
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
Is this PR still a problem ?
I tried building the testcase from comment #1 today and found the memcmp is being called even at -O3. The run times are approximately the same for -O3 and -O0 as well.
I see memcmp being called at -O3 in compilers going all the way back to 4.5. That's expected given Nick's patch to disable cmpstrnM. I guess I'll look into making a cmpmem pattern.
Also, the test program needs some additional repeat to give meaningful times.
Yes, confirmed. -O3 and -O0 now both run the same speed so this bug is fixed.
Sorry for being potentially several years late confirming this.
Closing since PR52171 tracks the other problem.