Bug 39329 - x86 -Os could use mulw for (uint16 * uint16)>>16
Summary: x86 -Os could use mulw for (uint16 * uint16)>>16
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.4.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2009-03-01 02:42 UTC by astrange+gcc@gmail.com
Modified: 2021-07-21 05:01 UTC (History)
1 user (show)

See Also:
Host: i?86-*-*
Target: i?86-*-*
Build: i?86-*-*
Known to work:
Known to fail:
Last reconfirmed: 2009-03-01 11:17:38


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description astrange+gcc@gmail.com 2009-03-01 02:42:30 UTC
Using 'gcc -Os -fomit-frame-pointer -march=core2 -mtune=core2' for

unsigned short mul_high_c(unsigned short a, unsigned short b)
{
    return (unsigned)(a * b) >> 16;
}

unsigned short mul_high_asm(unsigned short a, unsigned short b)
{
    unsigned short res;
    asm("mulw %w2" : "=d"(res),"+a"(a) : "rm"(b));
    return res;
}

I get

_mul_high_c:
	subl	$12, %esp
	movzwl	20(%esp), %eax
	movzwl	16(%esp), %edx
	addl	$12, %esp
	imull	%edx, %eax
	shrl	$16, %eax
	ret
_mul_high_asm:
	subl	$12, %esp
	movl	16(%esp), %eax
	mulw 20(%esp)
	addl	$12, %esp
	movl	%edx, %eax
	ret

mulw puts its outputs in dx:ax, and dx contains (dx:ax)>>16, so the shift is avoided.

Ignoring the weird Darwin stack adjustment code, the version with mulw is somewhat shorter and avoids a movzwl. I'm not sure what the performance difference is; mulw is listed in Agner's tables as fairly low latency, but requires a length changing prefix for memory.

This type of operation is useful in fixed-point math, such as embedded audio codecs or arithmetic coders.
Comment 1 Richard Biener 2009-03-01 11:17:38 UTC
Confirmed.  It's probably difficult to expose this to combine, so a
peephole may be the only choice to catch it.