Using 'gcc -Os -fomit-frame-pointer -march=core2 -mtune=core2' for unsigned short mul_high_c(unsigned short a, unsigned short b) { return (unsigned)(a * b) >> 16; } unsigned short mul_high_asm(unsigned short a, unsigned short b) { unsigned short res; asm("mulw %w2" : "=d"(res),"+a"(a) : "rm"(b)); return res; } I get _mul_high_c: subl $12, %esp movzwl 20(%esp), %eax movzwl 16(%esp), %edx addl $12, %esp imull %edx, %eax shrl $16, %eax ret _mul_high_asm: subl $12, %esp movl 16(%esp), %eax mulw 20(%esp) addl $12, %esp movl %edx, %eax ret mulw puts its outputs in dx:ax, and dx contains (dx:ax)>>16, so the shift is avoided. Ignoring the weird Darwin stack adjustment code, the version with mulw is somewhat shorter and avoids a movzwl. I'm not sure what the performance difference is; mulw is listed in Agner's tables as fairly low latency, but requires a length changing prefix for memory. This type of operation is useful in fixed-point math, such as embedded audio codecs or arithmetic coders.
Confirmed. It's probably difficult to expose this to combine, so a peephole may be the only choice to catch it.