[Bug c/39549] New: Nonoptimal byte load. mov (%rdi),%al better then movzbl (%rdi),%eax
vvv at ru dot ru
gcc-bugzilla@gcc.gnu.org
Tue Mar 24 18:59:00 GMT 2009
> gcc --version
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]
> cat test.c
// file test.c One byte transfer
void f(char *a,char *b){
*b=*a;
}
void F(char *a,char *b){
asm volatile("mov (%rdi),%al\nmov %al,(%rsi)");
}
...
> gcc -g -otest test.c -O2 -mtune=core2
> objdump -d test
....
00000000004004f0 <f>:
4004f0: 0f b6 07 movzbl (%rdi),%eax
4004f3: 88 06 mov %al,(%rsi)
4004f5: c3 retq
4004f6: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
4004fd: 00 00 00
0000000000400500 <F>:
400500: 8a 07 mov (%rdi),%al
400502: 88 06 mov %al,(%rsi)
400504: c3 retq
GCC use movzbl (%rdi),%eax, but better to use mov (%rdi),%al, because last
instruction 1 byte shorter. Execution time the same (at least on Core 2 Duo and
Core 2 Solo).
Probably it is result of Intel recomendations to use movz to avoid a partial
register stall. But smaller instruction reduce fetch bandwidth... and
Qwote from: Intel® 64 and IA-32 Architectures Optimization Reference Manual
248966. 3.5.2.3 Partial Register Stalls
"The delay of a partial register stall is small in processors based on Intel
Core and
NetBurst microarchitectures, and in Pentium M processor (with CPUID signature
family 6, model 13), Intel Core Solo, and Intel Core Duo processors. Pentium M
processors (CPUID signature with family 6, model 9) and the P6 family incur a
large
penalty."
--
Summary: Nonoptimal byte load. mov (%rdi),%al better then movzbl
(%rdi),%eax
Product: gcc
Version: 4.3.2
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
AssignedTo: unassigned at gcc dot gnu dot org
ReportedBy: vvv at ru dot ru
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39549
More information about the Gcc-bugs
mailing list