48789 – missed ARM optimization: use LDMIA

Bug 48789 - missed ARM optimization: use LDMIA

Summary: missed ARM optimization: use LDMIA

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	4.6.0

Importance:	P4 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2011-04-27 11:09 UTC by Török Edwin
Modified:	2017-06-16 14:06 UTC (History)
CC List:	1 user (show)

See Also:
Host:	x86_64-linux-gnu
Target:	arm-elf
Build:	x86_64-linux-gnu
Known to work:
Known to fail:
Last reconfirmed:	2017-06-16 00:00:00

Attachments
reverse.c (318 bytes, text/x-csrc) 2011-04-27 11:10 UTC, Török Edwin	Details
test.S (563 bytes, application/octet-stream) 2011-04-27 11:10 UTC, Török Edwin	Details
bench.c (557 bytes, text/x-csrc) 2011-04-27 11:11 UTC, Török Edwin	Details
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Török Edwin 2011-04-27 11:09:49 UTC

The attached testcase compiles to larger and slower code than the hand-optimized version, although the C code follows closely the structure of hand-optimized assembly.

To reproduce the missed optimization:
arm-elf-gcc reverse.c -O3 -mcpu=arm946e-s -msoft-float

The reverse_bytes_order_c2 has too many ldr/str instructions, it should use ldmia/stmia as seen in the hand-optimized version (test.S reverse_bytes_order2).

Note: without -msoft-float it generates faster code by using VFP instructions, but my CPU doesn't support them, so I have to turn off floating point generation.

Attachments:
reverse.c: the testcase
test.S: the hand-optimized version of the reverse_bytes_order_c2, called reverse_bytes_order2 here (code from CHDK's lib/armutil/)
bench.c: a simple benchmark runner to compare gcc's version with the hand optimized one

This happens both with 4.6 and 4.5:
$ arm-elf-gcc -v
Using built-in specs.
COLLECT_GCC=../build-dir/arm/toolchain/bin/arm-elf-gcc
COLLECT_LTO_WRAPPER=/home/edwin/chdk/build-dir/arm/toolchain/libexec/gcc/arm-elf/4.6.0/lto-wrapper
Target: arm-elf
Configured with: ../gcc-4.6.0/configure --target=arm-elf --prefix=/home/edwin/chdk/build-dir/arm/toolchain --enable-interwork --enable-multilib --enable-languages=c --with-newlib --with-gmp-include=/home/edwin/chdk/build-dir/build/gmp --with-gmp-lib=/home/edwin/chdk/build-dir/build/gmp/.libs --without-headers --disable-libssp --disable-nls --disable-zlib --disable-libc --disable-libm --disable-intl --disable-hardfloat --disable-threads --with-gnu-as --with-gnu-ld
Thread model: single
gcc version 4.6.0 (GCC) 

$ /opt/cfarm/release/4.5.0/bin/gcc -v
Using built-in specs.
COLLECT_GCC=/opt/cfarm/release/4.5.0/bin/gcc
COLLECT_LTO_WRAPPER=/home/guerby/opt/release/4.5.0/bin/../libexec/gcc/armv7l-unknown-linux-gnueabi/4.5.0/lto-wrapper
Target: armv7l-unknown-linux-gnueabi
Configured with: ../gcc-4.5.0/configure --prefix=/opt/cfarm/release/4.5.0 --enable-languages=c,ada --enable-__cxa_atexit --disable-nls --enable-threads=posix --disable-multilib --with-gmp=/opt/cfarm/gmp-4.2.4 --with-mpfr=/opt/cfarm/mpfr-2.4.2 --with-mpc=/opt/cfarm/mpc-0.8 --with-cpu=cortex-a8 --with-fpu=neon --with-float=softfp --disable-werror
Thread model: posix
gcc version 4.5.0 (GCC) 

Some benchmarks (run on gcc33, which would support armv7, but my CPU won't, so I can only use armv5te):
base: 0.340810 (hand-optimized assembly)
3: 0.840712 (alternate version)
c: 0.379164 (C code, compiled with -O3)
c2: 0.395410 (C code, unrolled 8 times as the hand assembly, compiled with -O3)

(note: run benchmark as ./a.out; ./a.out; ./a.out. I think there is some frequency scaling causing the first run to be slower)

To run benchmark:
/opt/cfarm/release/4.5.0/bin/gcc bench.c reverse.c test.S -O3  -mcpu=arm946e-s -msoft-float

Comment 1 Török Edwin 2011-04-27 11:10:35 UTC

Created attachment 24114 [details]
reverse.c

Comment 2 Török Edwin 2011-04-27 11:10:49 UTC

Created attachment 24115 [details]
test.S

Comment 3 Török Edwin 2011-04-27 11:11:02 UTC

Created attachment 24116 [details]
bench.c

Comment 4 Ramana Radhakrishnan 2011-07-27 16:40:09 UTC

There are a number of problems and not all of them are related to the backend. 


- ldm / stm aren't really first class citizens as far as GCC is concerned . There is no way today of getting the register allocator to forcefully use increasing addresses as a metric of choosing where to do what. 

- I suspect the performance issues you are seeing are with the number of spills and fills that are being generated in this case. If you tried -fsched-pressure life becomes much better and in fact the amount of stack space used is 0 . I haven't run any benchmarks to see if in this particular case you get better performance .

Comment 5 Ramana Radhakrishnan 2017-06-16 14:06:28 UTC

Confirmed but lower priority given all the other statements in c#2