This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
| Other format: | [Raw text] | |
When memcmp is used for aligned strings I strongly recommend a loop be
used instead of rep cmps on Intel Pentium up to P3. For the Athlon
(and possible P4) using cmps is best. I have verified that using a
loop on the Pentium MMX, Pentium II, and Pentium III is approximately 3
times faster than using a rep cmps. For the Athlon I have verified
that rep cmps is best. For the Athlon the timing for rep cmps is 16 +
(10/3)*c which will be pretty hard to beat on a loop using integer
operations.
For very short strings where the size is known I strongly recommend
using using bswap and compare. On any processor that I know about the
cmps instruction is not very efficient without the rep prefix and with
the rep prefix there is a fixed overhead which is significant when the
string is short. Furthermore it may very well be the case that the
value to compare is already loaded in a register as in the case of:
int x, y; ... memcmp(&x,&y,4) ...
I use such a compare in one of my project so this is not a
hypothetical case. I can post the code snippet that uses it if anyone
is interested.
For the general aligned case case it may faster to use the MMX or SSE
registers and instructions to perform the compare. I have not looked
into this at all. For all I know this may not even be possible.
The following code will perform memcmp in the aligned case via a loop
and a bswap at the end. It is assumed that it is okay to read off the
end of the string up to the nearest multiple of 4.
int cmps(const void * x0, const void * y0, size_t size)
{
const unsigned int * x = (const unsigned int *)x0;
const unsigned int * y = (const unsigned int *)y0;
int i = 0;
size_t s = size / 4;
while (i < s && x[i] == y[i]) ++i;
size -= i * 4;
if (size == 0) return 0;
unsigned int xx = x[i], yy = y[i];
asm("bswap %0" : "+r"(xx));
asm("bswap %0" : "+r"(yy));
if (size >= 4) {
return xx < yy ? -1 : 1;
} else {
unsigned int dis = 8*(4-size);
xx >>= dis;
yy >>= dis;
return xx - yy;
}
}
Here are the results when compared to __builtin_memcmp. The code is
attached. It was compiled with gcc 3.2.2 on Red Hat Linux 9 with -O3
-march=pentium3:
Memory compare 15 bytes:
4190000
1470000
Speed up: 2.850340
Memory compare 16 bytes:
4360000
1520000
Speed up: 2.868421
Memory compare 64 bytes:
11440000
4040000
Speed up: 2.831683
Memory compare 256 bytes:
38000000
12310000
Speed up: 3.086921
I do not know enough about gcc internals to submit a formal patch. An
memcmp implementation is the best I can offer. I hope someone will
take the initiative to create a formal patch out of my implementation
for Gcc 3.4.
Thanks.
--
http://kevin.atkinson.dhs.org
Attachment:
cmps.c
Description: Text document
Attachment:
cmps.s
Description: Text document
| Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
|---|---|---|
| Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |