This report was prompted by a mail on the lkml which was suggesting to hand-craft memset: http://lkml.org/lkml/2007/8/17/309 . So I wondered if the code generated for __builtin_memset was any good, and could be used instead of hand-crafted code. I tested with (Debian) GCC 3.4.6, 4.1.3, 4.2.1, and also with a snapshot of GCC 4.3. All the results are similar, so I will only show them for GCC 4.2 on x86-64. Compilation was done with -O3. First, the __builtin_memset code: void fill1(char *s, int a) { __builtin_memset(s, a, 15); } GCC generates: 0: 40 0f b6 c6 movzbl %sil,%eax 4: 48 ba 01 01 01 01 01 mov $0x101010101010101,%rdx b: 01 01 01 e: 40 0f b6 ce movzbl %sil,%ecx 12: 48 0f af c2 imul %rdx,%rax 16: 40 88 77 0e mov %sil,0xe(%rdi) 1a: 48 89 07 mov %rax,(%rdi) 1d: 40 0f b6 c6 movzbl %sil,%eax 21: 69 c0 01 01 01 01 imul $0x1010101,%eax,%eax 27: 89 47 08 mov %eax,0x8(%rdi) 2a: 89 c8 mov %ecx,%eax 2c: c1 e0 08 shl $0x8,%eax 2f: 01 c8 add %ecx,%eax 31: 66 89 47 0c mov %ax,0xc(%rdi) 35: c3 retq Notice that GCC first computes %sil * (01)^8 and puts it into %rax, then it computes %sil * (01)^4 and puts it into %eax (where it already was, due to the previous multiplication), then it computes %sil * (01)^2 and puts it into %ax (where it already was, again). Second, some code where multiplication results are reused: void fill2(char *s, int a) { unsigned long long int v = (unsigned char)a * 0x0101010101010101ull; *(unsigned long long int *)s = v; *(unsigned *)(s + 8) = v; *(unsigned short *)(s + 12) = v; *(s + 15) = v; } GCC generates: 0: 40 0f b6 f6 movzbl %sil,%esi 4: 48 b8 01 01 01 01 01 mov $0x101010101010101,%rax b: 01 01 01 e: 48 0f af f0 imul %rax,%rsi 12: 48 89 37 mov %rsi,(%rdi) 15: 89 77 08 mov %esi,0x8(%rdi) 18: 66 89 77 0c mov %si,0xc(%rdi) 1c: 40 88 77 0f mov %sil,0xf(%rdi) 20: c3 retq The function is 21 bytes smaller (-40%), it does not require two additional registers (c and d), and it will not be slower. The same issue arises on x86_32. The hand-written code (with 32bit integers this time) is 14 bytes smaller for memset(,,15).
This is a middle-end issue. What is happening is that it is splitting it up early on with the size. Now what the trunk is doing is better than what was done for 4.0.x.
Re-confirmed.
Fixed on the trunk by r12-285. This is what we get now: fill1(char*, int): movabsq $72340172838076673, %rax movzbl %sil, %esi imulq %rax, %rsi movq %rsi, (%rdi) movq %rsi, 7(%rdi) ret
Dup. *** This bug has been marked as a duplicate of bug 90773 ***