gcc -O2 -S on this input: typedef unsigned long long u64; u64 test() { u64 low, high; asm volatile ("rdtsc" : "=a" (low), "=d" (high)); return low | (high << 32); } generates this: test: .LFB0: .cfi_startproc #APP # 6 "rax_rdx.c" 1 rdtsc # 0 "" 2 #NO_APP movq %rax, %rcx movq %rdx, %rax salq $32, %rax orq %rcx, %rax ret .cfi_endproc which is silly -- both movq instructions are unnecessary. clang -O3 -fomit-frame-pointer does much better: test: .Leh_func_begin0: #APP rdtsc #NO_APP shlq $32, %rdx orq %rdx, %rax ret Getting rid of the << 32 makes gcc generate the obvious code. FWIW, this code: unsigned long long rdtsc (void) { unsigned int tickl, tickh; __asm__ __volatile__("rdtsc":"=a"(tickl),"=d"(tickh)); return ((unsigned long long)tickh << 32)|tickl; } is copied verbatim from the manual in the "Machine Constraints" (http://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html#Machine-Constraints) and generates the same silly code.
If you use return __builtin_ia32_rdtsc (); instead, both 4.6 and 4.7 generate: rdtsc salq $32, %rdx orq %rdx, %rax ret Current GCC trunk generates: #APP # 7 "pr48877.c" 1 rdtsc # 0 "" 2 #NO_APP salq $32, %rdx orq %rax, %rdx movq %rdx, %rax ret for the asm testcase, which isn't as bad as 4.6, but isn't perfect. What matters for IRA is which pseudo is LHS/RHS1 and which is RHS2 on the orq insn, for the builtin version LHS/RHS1 is the pseudo set by the unspecv with "=a" constraint, for the asm version it is the LHS from the shift insn.
Modern GCC doesn't generate excessive moves for this example. It looks like the problem was fixed in 4.9.0: https://godbolt.org/z/MqE7sP . I think the bug can be closed now.
(In reply to Ivan Sorokin from comment #2) > Modern GCC doesn't generate excessive moves for this example. It looks like > the problem was fixed in 4.9.0: https://godbolt.org/z/MqE7sP . > > I think the bug can be closed now. Indeed.