Bug 40893 - ARM and PPC truncate intermediate operations unnecessarily
Summary: ARM and PPC truncate intermediate operations unnecessarily
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.4.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2009-07-28 16:28 UTC by David Conrad
Modified: 2010-10-08 17:17 UTC (History)
6 users (show)

See Also:
Host: i386-apple-darwin
Target: arm-none-linux-gnueabi
Build:
Known to work:
Known to fail:
Last reconfirmed: 2009-09-09 16:33:48


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Conrad 2009-07-28 16:28:10 UTC
Consider the following C code:

#include <inttypes.h>

void dct2x2dc_dconly( int16_t d[2][2] )
{
    int d0 = d[0][0] + d[0][1];
    int d1 = d[1][0] + d[1][1];
    d[0][0] = d0 + d1;
    d[0][1] = d0 - d1;
}

The following is generated with arm-none-linux-gnueabi-gcc-4.4.0 -O3 -mcpu=cortex-a8 -S
dct2x2dc_dconly:
	ldrsh	ip, [r0, #2]
	ldrsh	r3, [r0, #0]
	ldrsh	r1, [r0, #6]
	ldrsh	r2, [r0, #4]
	add	r3, ip, r3
	add	r2, r1, r2
	uxth	r3, r3
	uxth	r2, r2
	rsb	r1, r2, r3
	add	r3, r2, r3
	strh	r1, [r0, #2]	@ movhi
	strh	r3, [r0, #0]	@ movhi
	bx	lr
(with pre-armv6 targets the two uxth are replaced by asl #16, lsr #16 pairs.)

The following is generated with powerpc-unknown-linux-gnu-gcc-4.4.0 -O3 -mcpu=G4 -S
dct2x2dc_dconly:
	lha 10,2(3)
	lha 0,0(3)
	lha 11,6(3)
	lha 9,4(3)
	add 0,10,0
	rlwinm 0,0,0,0xffff
	add 9,11,9
	rlwinm 9,9,0,0xffff
	subf 11,9,0
	add 0,9,0
	sth 11,2(3)
	sth 0,0(3)
	blr

The two uxth in the ARM version, and the two rlwinm in the PPC version are completely unnecessary, as letting strh/sth truncate will give equivalent results. x86 does not exhibit this behaviour, and removing either d0 + d1 or d0 - d1 will not cause d0 and d1 be truncated to to 16 bits on both ARM and PPC.

powerpc-unknown-linux-gnu-gcc-4.4.0 -v
Using built-in specs.
Target: powerpc-unknown-linux-gnu
Configured with: /var/tmp/portage/sys-devel/gcc-4.4.0/work/gcc-4.4.0/configure --prefix=/usr --bindir=/usr/powerpc-unknown-linux-gnu/gcc-bin/4.4.0 --includedir=/usr/lib/gcc/powerpc-unknown-linux-gnu/4.4.0/include --datadir=/usr/share/gcc-data/powerpc-unknown-linux-gnu/4.4.0 --mandir=/usr/share/gcc-data/powerpc-unknown-linux-gnu/4.4.0/man --infodir=/usr/share/gcc-data/powerpc-unknown-linux-gnu/4.4.0/info --with-gxx-include-dir=/usr/lib/gcc/powerpc-unknown-linux-gnu/4.4.0/include/g++-v4 --host=powerpc-unknown-linux-gnu --build=powerpc-unknown-linux-gnu --enable-altivec --disable-fixed-point --without-ppl --without-cloog --disable-nls --with-system-zlib --disable-checking --disable-werror --enable-secureplt --disable-multilib --disable-libmudflap --disable-libssp --enable-libgomp --enable-cld --disable-libgcj --enable-languages=c,c++,fortran --enable-shared --enable-threads=posix --enable-__cxa_atexit --enable-clocale=gnu --with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo 4.4.0 p1.1'
Thread model: posix
gcc version 4.4.0 (Gentoo 4.4.0 p1.1) 

arm-none-linux-gnueabi-gcc-4.4.0 -v
Using built-in specs.
Target: arm-none-linux-gnueabi
Configured with: ../gcc-4.4.0/configure --target=arm-none-linux-gnueabi --prefix=/usr/local/arm --enable-threads --with-sysroot=/usr/local/arm/arm-none-linux-gnueabi/libc
Thread model: posix
gcc version 4.4.0 (GCC)
Comment 1 David Conrad 2009-07-28 22:27:39 UTC
More specifically, on x86_64 the following is generated with gcc-4.4 -O3 -march=core2 -S
_dct2x2dc_dconly:
	movswl	2(%rdi),%edx
	pushq	%rbp
	addw	(%rdi), %dx
	movswl	6(%rdi),%eax
	movq	%rsp, %rbp
	addw	4(%rdi), %ax
	leal	(%rax,%rdx), %ecx
	subw	%ax, %dx
	movw	%cx, (%rdi)
	movw	%dx, 2(%rdi)
	leave
	ret

So it seems that the optimizer realizes that you don't need registers larger than 16-bits, which allows memory operands on x86, which is optimal for this case. However, other architectures follow this too literally, wasting instructions to truncate intermediate results to 16 bits.
Comment 2 paul walmsley 2010-10-05 18:14:35 UTC
Here's a minimal test case:

void foo(unsigned int v)
{
  *(volatile unsigned short *)0xabcdefab = (v);
}

arm-linux-gcc  -O2 -march=armv7-a -c test.c; arm-linux-objdump -DS test.o 
| less


00000000 <foo>:
   0:   e30e3fff        movw    r3, #61439      ; 0xefff
   4:   e34a3bcd        movt    r3, #43981      ; 0xabcd
   8:   e6ff0070        uxth    r0, r0
   c:   e14305b4        strh    r0, [r3, #-84]
  10:   e12fff1e        bx      lr


As David notes, the expected behavior is that the uxth should not be generated for >= armv6 targets, and the two shifts should not be generated on < armv6 targets, as they should be superfluous.

http://marc.info/?l=linux-omap&m=128630215909798&w=2
Comment 3 Richard Earnshaw 2010-10-08 13:04:49 UTC
(In reply to comment #2)
> Here's a minimal test case:
> 
> void foo(unsigned int v)
> {
>   *(volatile unsigned short *)0xabcdefab = (v);
> }
> 
>

The compiler has to be extremely conservative with this code as it has a volatile memory reference to deal with.  It must take extreme care not to modify that operation and one consequence of this is that it is then difficult to remove the narrowing operation.
Comment 4 paul walmsley 2010-10-08 17:17:24 UTC
The bug also appears without volatile:

----

/* generates an unnecessary uxth */
void foo(unsigned short a, unsigned short b, unsigned short c,
	 unsigned short *e, unsigned short *f)
{
	*e = (a + b) + c;
	*f = (a + b) - c;
}

/* works as expected */
void bar(unsigned short a, unsigned short b, unsigned short c,
	 unsigned short *e)
{
	*e = (a + b) + c;
}

/* works as expected */
void baz(unsigned short a, unsigned short b, unsigned short c,
	 unsigned short *e)
{
	*e = (a + b) - c;
}

-----

compiled and dumped with:

arm-linux-gnueabi-gcc -O2 -c test.c ; objdump -DS test.o

produces:

-----

Disassembly of section .text:

00000000 <foo>:
   0:   e0811000        add     r1, r1, r0
   4:   e1a01801        lsl     r1, r1, #16
   8:   e1a01821        lsr     r1, r1, #16
   c:   e0620001        rsb     r0, r2, r1
  10:   e0821001        add     r1, r2, r1
  14:   e1c310b0        strh    r1, [r3]
  18:   e59d3000        ldr     r3, [sp]
  1c:   e1c300b0        strh    r0, [r3]
  20:   e12fff1e        bx      lr

00000024 <bar>:
  24:   e0811000        add     r1, r1, r0
  28:   e0822001        add     r2, r2, r1
  2c:   e1c320b0        strh    r2, [r3]
  30:   e12fff1e        bx      lr

00000034 <baz>:
  34:   e0811000        add     r1, r1, r0
  38:   e0622001        rsb     r2, r2, r1
  3c:   e1c320b0        strh    r2, [r3]
  40:   e12fff1e        bx      lr

-----

gcc -v:

Using built-in specs.
Target: arm-linux-gnueabi
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-2' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/arm-linux-gnueabi/include/c++/4.4.5 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --disable-sjlj-exceptions --enable-checking=release --program-prefix=arm-linux-gnueabi- --includedir=/usr/arm-linux-gnueabi/include --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=arm-linux-gnueabi --with-headers=/usr/arm-linux-gnueabi/include --with-libs=/usr/arm-linux-gnueabi/lib
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-2)