Bug 51980 - ARM - Neon code polluted by useless stores to the stack with vuzpq / vzipq / vtrnq
Summary: ARM - Neon code polluted by useless stores to the stack with vuzpq / vzipq / ...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: 4.9.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 47562
  Show dependency treegraph
 
Reported: 2012-01-24 14:32 UTC by Eric Batut
Modified: 2016-02-22 12:54 UTC (History)
5 users (show)

See Also:
Host:
Target: arm-linux-androideabi, arm*-*-*eabi
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-01-24 00:00:00


Attachments
Minimal repro case (C file) (625 bytes, application/octet-stream)
2012-01-24 14:32 UTC, Eric Batut
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eric Batut 2012-01-24 14:32:25 UTC
Created attachment 26442 [details]
Minimal repro case (C file)

When using UZP/ZIP/TRN Neon intrinsics, gcc-trunk generates a whole lot of stack operations (and associated stack alignment operations) even if everything can purely be done using Neon registers. 

Compiler used is GCC trunk, rev 183468, compiled with Android's build-gcc.sh (arm-linux-androideabi).

Command line is:
arm-linux-androideabi-g++ -c -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=vfp -flax-vector-conversions -mfpu=neon -O2 -o test.s test.c -S

Generated assembly code for attached C file is:
_Z13sqrlen4D_16u817__simd128_uint8_tS_:
	vabd.u8	q1, q0, q1
	stmfd	sp!, {r4, fp}       <= Unnecessary
	add	fp, sp, #4          <= Unnecessary
	sub	sp, sp, #48         <= Unnecessary
	add	r3, sp, #15         <= Unnecessary
	vmull.u8	q0, d2, d2
	bic	r3, r3, #15         <= Unnecessary
	vmull.u8	q8, d3, d3
	vuzp.32	q0, q8
	vstmia	r3, {d0-d1}         <= Unnecessary, caused by vuzp.32
	vstr	d16, [r3, #16]      <= Unnecessary, caused by vuzp.32
	vstr	d17, [r3, #24]      <= Unnecessary, caused by vuzp.32
	vpaddl.u16	q0, q0
	vpadal.u16	q0, q8
	sub	sp, fp, #4          <= Unnecessary
	ldmfd	sp!, {r4, fp}       <= Unnecessary
	bx	lr

As no stack operation is needed in this function, ideally the following should be generated instead:
_Z13sqrlen4D_16u817__simd128_uint8_tS_:
	vabd.u8	q1, q0, q1
	vmull.u8	q0, d2, d2
	vmull.u8	q8, d3, d3
	vuzp.32	q0, q8
	vpaddl.u16	q0, q0
	vpadal.u16	q0, q8
	bx	lr

This makes even tight Neon functions written with intrinsics much larger and slower than necessary, and makes it very hard to write performance-oriented code with intrinsics in arm-gcc.

gcc -v yields:
Using built-in specs.
COLLECT_GCC=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/bin/arm-linux-androideabi-g++
COLLECT_LTO_WRAPPER=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/lto-wrapper
Target: arm-linux-androideabi
Configured with: /home/eb/android-ndk-r6/src/build/../gcc/gcc-4.7.0/configure --prefix=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86 --target=arm-linux-androideabi --host=i386-linux-gnu --build=i386-linux-gnu --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --with-gmp=/tmp/ndk-eb/build/toolchain/temp-install --with-mpfr=/tmp/ndk-eb/build/toolchain/temp-install --with-mpc=/tmp/ndk-eb/build/toolchain/temp-install --disable-libssp --enable-threads --disable-nls --disable-libmudflap --disable-libgomp --disable-libstdc__-v3 --disable-sjlj-exceptions --disable-shared --disable-tls --with-float=soft --with-fpu=vfp --with-arch=armv5te --enable-target-optspace --enable-initfini-array --disable-nls --prefix=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86 --with-sysroot=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot --with-binutils-version=2.21.53 --with-mpfr-version=3.0.1 --with-gmp-version=5.0.2 --with-gcc-version=4.7.0 --with-gdb-version=6.6 --with-mpc-version=0.9 --with-arch=armv5te --enable-libstdc__-v3 --program-transform-name='s,^,arm-linux-androideabi-,'
Thread model: posix
gcc version 4.7.0 20120124 (experimental) (GCC) 
COLLECT_GCC_OPTIONS='-c' '-march=armv7-a' '-mcpu=cortex-a9' '-mfloat-abi=hard' '-mfpu=vfp' '-flax-vector-conversions' '-mfpu=neon' '-O2' '-o' 'test.s' '-S' '-v' '-mtls-dialect=gnu'
 /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/cc1plus -quiet -v -imultilib armv7-a -D_GNU_SOURCE test.c -mbionic -fPIC -quiet -dumpbase test.c -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=vfp -mfpu=neon -mtls-dialect=gnu -auxbase-strip test.s -O2 -version -flax-vector-conversions -o test.s -fno-exceptions -fno-rtti
GNU C++ (GCC) version 4.7.0 20120124 (experimental) (arm-linux-androideabi)
	compiled by GNU C version 4.6.0 20110603 (Red Hat 4.6.0-10), GMP version 5.0.2, MPFR version 3.0.1, MPC version 0.9
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include/c++/4.7.0"
ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include/c++/4.7.0/arm-linux-androideabi/armv7-a"
ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include/c++/4.7.0/backward"
ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot/usr/local/include"
#include "..." search starts here:
#include <...> search starts here:
 /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/include
 /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/include-fixed
 /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include
 /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot/usr/include
End of search list.
GNU C++ (GCC) version 4.7.0 20120124 (experimental) (arm-linux-androideabi)
	compiled by GNU C version 4.6.0 20110603 (Red Hat 4.6.0-10), GMP version 5.0.2, MPFR version 3.0.1, MPC version 0.9
GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096
Compiler executable checksum: d84173bb26a7319ac9d4c1278a6a7e04
COMPILER_PATH=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/bin/
LIBRARY_PATH=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/armv7-a/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/lib/armv7-a/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/lib/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot/usr/lib/
COLLECT_GCC_OPTIONS='-c' '-march=armv7-a' '-mcpu=cortex-a9' '-mfloat-abi=hard' '-mfpu=vfp' '-flax-vector-conversions' '-mfpu=neon' '-O2' '-o' 'test.s' '-S' '-v' '-mtls-dialect=gnu'
Comment 1 Richard Biener 2012-01-24 14:57:43 UTC
It looks like the neon builtins are not properly marked as pure/const, that
certainly is a road-block for optimizations.  The heavy use of UNSPECs is
another.

Confirmed.
Comment 2 Eric Batut 2012-01-27 14:13:08 UTC
Adding the usual suspects for ARM-related bugs.
Comment 3 Ramana Radhakrishnan 2012-01-27 15:25:29 UTC
(In reply to comment #1)
> It looks like the neon builtins are not properly marked as pure/const, that
> certainly is a road-block for optimizations.  



> The heavy use of UNSPECs is
> another.

yes, one other problem is that a lot of the neon intrinsics don't expand into an equivalent RTL - you still need the unspecs for the polynomial types but in general a large number of the intrinsics that are in the form of unspecs could use the underlying vec_ expanders that are also present. 




> 
> Confirmed.
Comment 4 Ramana Radhakrishnan 2012-03-30 07:58:49 UTC
Your testcase is broken - it doesn't honour reinterpret_casts properly . This is  a better testcase. 

#include <arm_neon.h>

uint32x4_t sqrlen4D_16u8( const uint8x16_t A, const uint8x16_t B )
{
 const uint8x16_t absAB = vabdq_u8( A, B );
 const uint16x8_t square_l = vmull_u8( vget_low_u8( absAB ), vget_low_u8( absAB ) );
 const uint16x8_t square_h = vmull_u8( vget_high_u8( absAB ), vget_high_u8( absAB ) );
 const uint32x4x2_t rgrgrgrg_babababa = vuzpq_u32( vreinterpretq_u32_u16 (square_l), vreinterpretq_u32_u16 (square_h) );
 const uint16x8_t rgrgrgrg = vreinterpretq_u16_u32 (rgrgrgrg_babababa.val[0]);
 const uint16x8_t babababa = vreinterpretq_u16_u32 (rgrgrgrg_babababa.val[1]);
 const uint32x4_t rpg_rpg_rpg_rpg = vpaddlq_u16( rgrgrgrg );
 const uint32x4_t dp = vpadalq_u16( rpg_rpg_rpg_rpg, babababa );
 return ( dp );
}
Comment 5 Ramana Radhakrishnan 2012-03-30 08:17:21 UTC
Experimenting with : 

Applying the patch of PR48941 and the patch for lower-subreg here

http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01886.html

I now see : We still have too many moves for my liking but the gratuituous
spilling is now gone. 

      .cpu cortex-a9
        .eabi_attribute 27, 3
        .fpu neon
        .eabi_attribute 20, 1
        .eabi_attribute 21, 1
        .eabi_attribute 23, 3
        .eabi_attribute 24, 1
        .eabi_attribute 25, 1
        .eabi_attribute 26, 2
        .eabi_attribute 30, 2
        .eabi_attribute 34, 1
        .eabi_attribute 18, 4
        .file   "t2.c"
        .text
        .align  2
        .global sqrlen4D_16u8
        .type   sqrlen4D_16u8, %function
sqrlen4D_16u8:
        @ args = 16, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        vmov    d16, r0, r1  @ v16qi
        vmov    d17, r2, r3
        vldmia  sp, {d18-d19}
        vabd.u8 q10, q8, q9
        vmull.u8        q11, d20, d20
        vmull.u8        q10, d21, d21
        vmov    q8, q11  @ v4si  -- unnecessary ? 
        vmov    q9, q10  @ v4si  -- unnecessary ? 
        vuzp.32 q8, q9
        vpaddl.u16      q10, q8
        vmov    q11, q10  @ v4si  -- unnecessary
        vpadal.u16      q11, q9
        vmov    r0, r1, d22  @ v4si
        vmov    r2, r3, d23
        bx      lr
        .size   sqrlen4D_16u8, .-sqrlen4D_16u8
        .ident  "GCC: (GNU) 4.8.0 20120330 (experimental)"
        .section        .note.GNU-stack,"",%progbits

This probably makes it a dup of PR48941 but it's starting to look more
promising now. 

Eric, could you try the 2 patches and see what you get - This isn't something to be gratuitously backported as we still have to see the effects elsewhere but it would be worth seeing if this helps on your intrinsics testcases. 

Ramana
Comment 6 Ramana Radhakrishnan 2012-07-05 16:45:32 UTC
Author: ramana
Date: Thu Jul  5 16:45:18 2012
New Revision: 189294

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189294
Log:
2012-07-05  Ramana Radhakrishnan  <ramana.radhakrishnan@linaro.org>

        PR target/49891
        PR target/51980
        * gcc/testsuite/gcc.target/arm/neon/vtrnf32.c: Update.
        * gcc/testsuite/gcc.target/arm/neon/vtrns32.c: Update.
        * gcc/testsuite/gcc.target/arm/neon/vtrnu32.c: Update.
        * gcc/testsuite/gcc.target/arm/neon/vzipf32.c: Update.
        * gcc/testsuite/gcc.target/arm/neon/vzips32.c: Update.
        * gcc/testsuite/gcc.target/arm/neon/vzipu32.c: Update.


2012-07-05  Ramana Radhakrishnan  <ramana.radhakrishnan@linaro.org>
	    Julian Brown  <julian@codesourcery.com>

        PR target/49891
        PR target/51980
        * config/arm/neon-gen.ml (return_by_ptr): Delete.
        (print_function): Handle empty strings.
        (return): Delete use of return_by_ptr.
        (mask_shape_for_shuffle): New function.
        (mask_elems): Likewise.
        (shuffle_fn): Likewise.
        (params): Simplify and remove use of return_by_ptr.
        (get_shuffle): New function.
        (print_variant): Update.
        * config/arm/neon.ml (rev_elems): New function.
        (permute_range): Likewise.
        (zip_range): Likewise.
        (uzip_range): Likewise.
        (trn_range): Likewise.
        (zip_elems): Likewise.
        (uzip_elems): Likewise.
        (trn_elems): Likewise.
        (features): New enumeration Use_shuffle. Delete ReturnPtr.
        (pf_su_8_16): New.
        (suf_32): New.
        (ops): Update entries for Vrev64, Vrev32, Vrev16, Vtr, Vzip, Vuzp.
        * config/arm/arm_neon.h: Regenerate.




Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/arm/arm_neon.h
    trunk/gcc/config/arm/neon-gen.ml
    trunk/gcc/config/arm/neon.ml
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/arm/neon/vtrnf32.c
    trunk/gcc/testsuite/gcc.target/arm/neon/vtrns32.c
    trunk/gcc/testsuite/gcc.target/arm/neon/vtrnu32.c
    trunk/gcc/testsuite/gcc.target/arm/neon/vzipf32.c
    trunk/gcc/testsuite/gcc.target/arm/neon/vzips32.c
    trunk/gcc/testsuite/gcc.target/arm/neon/vzipu32.c
Comment 7 mgretton 2013-05-28 19:30:39 UTC
Testing the testcase in #4 with a recent trunk (gcc version 4.9.0 20130528 (experimental) (GCC)) gives the following results:

arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=softfp -O2 -mthumb:
sqrlen4D_16u8:
        vmov    d18, r0, r1  @ v16qi
        vmov    d19, r2, r3
        vld1.64 {d16-d17}, [sp:64]
        vabd.u8 q8, q9, q8
        vmull.u8        q9, d16, d16
        vmull.u8        q8, d17, d17
        vuzp.32 q9, q8
        vpaddl.u16      q9, q9
        vmov    q10, q9  @ v4si
        vpadal.u16      q10, q8
        vmov    r0, r1, d20  @ v4si
        vmov    r2, r3, d21
        bx      lr


arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=hard -O2 -mthumb:
sqrlen4D_16u8:
        vabd.u8 q1, q0, q1
        vmull.u8        q0, d2, d2
        vmull.u8        q8, d3, d3
        vuzp.32 q0, q8
        vpaddl.u16      q0, q0
        vpadal.u16      q0, q8
        bx      lr

So code generation seems to be OK for hard-float ABI but the soft-float version has some issues with an extra vmov between the vpaddl and vpadal.
Comment 8 ktkachov 2014-01-22 12:18:55 UTC
> arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=softfp -O2 -mthumb:
> sqrlen4D_16u8:
>         vmov    d18, r0, r1  @ v16qi
>         vmov    d19, r2, r3
>         vld1.64 {d16-d17}, [sp:64]
>         vabd.u8 q8, q9, q8
>         vmull.u8        q9, d16, d16
>         vmull.u8        q8, d17, d17
>         vuzp.32 q9, q8
>         vpaddl.u16      q9, q9
>         vmov    q10, q9  @ v4si
>         vpadal.u16      q10, q8
>         vmov    r0, r1, d20  @ v4si
>         vmov    r2, r3, d21
>         bx      lr

With current trunk I'm getting for the softfp case:

        push    {lr}    @ 40    *push_multi     [length = 2]
        vmov    d16, r0, r1  @ v16qi    @ 37    *neon_movv16qi/6        [length = 8]
        vmov    d17, r2, r3
        add     lr, sp, #4      @ 36    *arm_addsi3/5   [length = 4]
        vldr    d18, [sp, #4]   @ 3     *neon_movv16qi/4        [length = 8]
        vldr    d19, [sp, #12]
        vabd.u8 q9, q8, q9      @ 7     neon_vabdv16qi  [length = 4]
        vmull.u8        q8, d18, d18    @ 14    neon_vmullv8qi  [length = 4]
        vmull.u8        q9, d19, d19    @ 16    neon_vmullv8qi  [length = 4]
        vuzp.32 q8, q9  @ 18    *neon_vuzpv4si_insn     [length = 4]
        vpaddl.u16      q8, q8  @ 22    neon_vpaddlv8hi [length = 4]
        vpadal.u16      q8, q9  @ 28    neon_vpadalv8hi [length = 4]
        vmov    r0, r1, d16  @ v4si     @ 39    *neon_movv4si/5 [length = 8]
        vmov    r2, r3, d17
        ldr     pc, [sp], #4    @ 45    *ldr_with_return        [length = 4]


The move between the vpad*s is gone, but there's a couple of redundant loads and some register spillage.
Comment 9 StaffLeavers 2014-01-22 12:19:45 UTC
greta.yorsh no longer works for ARM.

Your email will be forwarded to their line manager.


Please do not reply to this email.
If you need more information, please email real-postmaster@arm.com

Thank you.
Comment 10 StaffLeavers 2014-01-22 12:20:27 UTC
greta.yorsh no longer works for ARM.

Your email will be forwarded to their line manager.


Please do not reply to this email.
If you need more information, please email real-postmaster@arm.com

Thank you.
Comment 11 StaffLeavers 2014-01-22 12:21:25 UTC
greta.yorsh no longer works for ARM.

Your email will be forwarded to their line manager.


Please do not reply to this email.
If you need more information, please email real-postmaster@arm.com

Thank you.
Comment 12 StaffLeavers 2014-01-22 12:22:06 UTC
greta.yorsh no longer works for ARM.

Your email will be forwarded to their line manager.


Please do not reply to this email.
If you need more information, please email real-postmaster@arm.com

Thank you.
Comment 13 StaffLeavers 2014-01-22 12:22:29 UTC
greta.yorsh no longer works for ARM.

Your email will be forwarded to their line manager.


Please do not reply to this email.
If you need more information, please email real-postmaster@arm.com

Thank you.
Comment 14 christophe.lyon 2014-06-13 15:37:53 UTC
As of current trunk the softfp case looks like this:
sqrlen4D_16u8:
        vmov    d16, r0, r1  @ v16qi
        vmov    d17, r2, r3
        vld1.64 {d18-d19}, [sp:64]
        vabd.u8 q9, q8, q9
        vmull.u8        q8, d18, d18
        vmull.u8        q9, d19, d19
        vuzp.32 q8, q9
        vpaddl.u16      q8, q8
        vpadal.u16      q8, q9
        vmov    r0, r1, d16  @ v4si
        vmov    r2, r3, d17
        bx      lr

which looks quite good.
Comment 15 Ramana Radhakrishnan 2016-02-22 12:54:22 UTC
Presumed fixed in 4.9.0 by that commit.