[PATCH i386]: Enable push/pop in pro/epilogue for modern CPUs
Xinliang David Li
davidxl@google.com
Thu Dec 13 07:05:00 GMT 2012
Try the following one. 1) -minline-all-stringops
-mstringop-strategy=rep_8byte -O2 vs 1) -mstringop_strategy=libcall
-O2.
David
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#ifndef LEN
#define LEN 16
#endif
void copy(char* s1, char* s2,int len) __attribute__((noinline));
void copy(char* s1, char* s2,int len)
{
memcpy(s2,s1,len);
}
int main() {
char* s1 = (char*) malloc(LEN +10);
char* s2 = (char*) malloc(LEN +10);
int i = 0;
for (i = 0; i < 1000000000; i++)
{
copy(s1+1,s2+3,LEN);
}
}
On Wed, Dec 12, 2012 at 10:21 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Dec 12, 2012 at 10:09:14PM -0800, Xinliang David Li wrote:
>> On Wed, Dec 12, 2012 at 5:19 PM, Jan Hubicka <hubicka@ucw.cz> wrote:
>> >> > libcall is not faster up to 8KB to rep sequence that is better for regalloc/code
>> >> > cache than fully blowin function call.
>> >>
>> >> Be careful with this. My recollection is that REP sequence is good for
>> >> any size -- for smaller size, the REP initial set up cost is too high
>> >> (10s of cycles), while for large size copy, it is less efficient
>> >> compared with library version.
>> >
>> > Well this is based on the data from the memtest script.
>> > Core has good REP implementation - it is a win from rather small blocks (16
>> > bytes if I recall) and it does not need alignment.
>> > Library version starts to be interesting with caching hints, but I think till 80KB
>> > it is still not a win for my setup (glibc-2.15)
>>
>> A simple test shows that -mstringop-strategy=libcall always beats
>> -mstringop-strategy=rep_8byte (on core2 and corei7) except for size
>> smaller than 8 where the rep_8byte strategy simply bypasses REP movs.
>> Can you share your memtest ?
>
> I can't believe that say 16 byte or 32 byte memcpy can be ever faster using a
> libcall. The PLT call overhead is simply too high.
>
> Jakub
More information about the Gcc-patches
mailing list