Created attachment 26442 [details] Minimal repro case (C file) When using UZP/ZIP/TRN Neon intrinsics, gcc-trunk generates a whole lot of stack operations (and associated stack alignment operations) even if everything can purely be done using Neon registers. Compiler used is GCC trunk, rev 183468, compiled with Android's build-gcc.sh (arm-linux-androideabi). Command line is: arm-linux-androideabi-g++ -c -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=vfp -flax-vector-conversions -mfpu=neon -O2 -o test.s test.c -S Generated assembly code for attached C file is: _Z13sqrlen4D_16u817__simd128_uint8_tS_: vabd.u8 q1, q0, q1 stmfd sp!, {r4, fp} <= Unnecessary add fp, sp, #4 <= Unnecessary sub sp, sp, #48 <= Unnecessary add r3, sp, #15 <= Unnecessary vmull.u8 q0, d2, d2 bic r3, r3, #15 <= Unnecessary vmull.u8 q8, d3, d3 vuzp.32 q0, q8 vstmia r3, {d0-d1} <= Unnecessary, caused by vuzp.32 vstr d16, [r3, #16] <= Unnecessary, caused by vuzp.32 vstr d17, [r3, #24] <= Unnecessary, caused by vuzp.32 vpaddl.u16 q0, q0 vpadal.u16 q0, q8 sub sp, fp, #4 <= Unnecessary ldmfd sp!, {r4, fp} <= Unnecessary bx lr As no stack operation is needed in this function, ideally the following should be generated instead: _Z13sqrlen4D_16u817__simd128_uint8_tS_: vabd.u8 q1, q0, q1 vmull.u8 q0, d2, d2 vmull.u8 q8, d3, d3 vuzp.32 q0, q8 vpaddl.u16 q0, q0 vpadal.u16 q0, q8 bx lr This makes even tight Neon functions written with intrinsics much larger and slower than necessary, and makes it very hard to write performance-oriented code with intrinsics in arm-gcc. gcc -v yields: Using built-in specs. COLLECT_GCC=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/bin/arm-linux-androideabi-g++ COLLECT_LTO_WRAPPER=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/lto-wrapper Target: arm-linux-androideabi Configured with: /home/eb/android-ndk-r6/src/build/../gcc/gcc-4.7.0/configure --prefix=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86 --target=arm-linux-androideabi --host=i386-linux-gnu --build=i386-linux-gnu --with-gnu-as --with-gnu-ld --enable-languages=c,c++ --with-gmp=/tmp/ndk-eb/build/toolchain/temp-install --with-mpfr=/tmp/ndk-eb/build/toolchain/temp-install --with-mpc=/tmp/ndk-eb/build/toolchain/temp-install --disable-libssp --enable-threads --disable-nls --disable-libmudflap --disable-libgomp --disable-libstdc__-v3 --disable-sjlj-exceptions --disable-shared --disable-tls --with-float=soft --with-fpu=vfp --with-arch=armv5te --enable-target-optspace --enable-initfini-array --disable-nls --prefix=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86 --with-sysroot=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot --with-binutils-version=2.21.53 --with-mpfr-version=3.0.1 --with-gmp-version=5.0.2 --with-gcc-version=4.7.0 --with-gdb-version=6.6 --with-mpc-version=0.9 --with-arch=armv5te --enable-libstdc__-v3 --program-transform-name='s,^,arm-linux-androideabi-,' Thread model: posix gcc version 4.7.0 20120124 (experimental) (GCC) COLLECT_GCC_OPTIONS='-c' '-march=armv7-a' '-mcpu=cortex-a9' '-mfloat-abi=hard' '-mfpu=vfp' '-flax-vector-conversions' '-mfpu=neon' '-O2' '-o' 'test.s' '-S' '-v' '-mtls-dialect=gnu' /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/cc1plus -quiet -v -imultilib armv7-a -D_GNU_SOURCE test.c -mbionic -fPIC -quiet -dumpbase test.c -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=vfp -mfpu=neon -mtls-dialect=gnu -auxbase-strip test.s -O2 -version -flax-vector-conversions -o test.s -fno-exceptions -fno-rtti GNU C++ (GCC) version 4.7.0 20120124 (experimental) (arm-linux-androideabi) compiled by GNU C version 4.6.0 20110603 (Red Hat 4.6.0-10), GMP version 5.0.2, MPFR version 3.0.1, MPC version 0.9 GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include/c++/4.7.0" ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include/c++/4.7.0/arm-linux-androideabi/armv7-a" ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include/c++/4.7.0/backward" ignoring nonexistent directory "/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot/usr/local/include" #include "..." search starts here: #include <...> search starts here: /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/include /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/include-fixed /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/include /home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot/usr/include End of search list. GNU C++ (GCC) version 4.7.0 20120124 (experimental) (arm-linux-androideabi) compiled by GNU C version 4.6.0 20110603 (Red Hat 4.6.0-10), GMP version 5.0.2, MPFR version 3.0.1, MPC version 0.9 GGC heuristics: --param ggc-min-expand=30 --param ggc-min-heapsize=4096 Compiler executable checksum: d84173bb26a7319ac9d4c1278a6a7e04 COMPILER_PATH=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/libexec/gcc/arm-linux-androideabi/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/bin/ LIBRARY_PATH=/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/armv7-a/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/lib/armv7-a/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/lib/gcc/arm-linux-androideabi/4.7.0/../../../../arm-linux-androideabi/lib/:/home/eb/android-ndk-r6/toolchains/arm-linux-androideabi-4.7.0/prebuilt/linux-x86/sysroot/usr/lib/ COLLECT_GCC_OPTIONS='-c' '-march=armv7-a' '-mcpu=cortex-a9' '-mfloat-abi=hard' '-mfpu=vfp' '-flax-vector-conversions' '-mfpu=neon' '-O2' '-o' 'test.s' '-S' '-v' '-mtls-dialect=gnu'
It looks like the neon builtins are not properly marked as pure/const, that certainly is a road-block for optimizations. The heavy use of UNSPECs is another. Confirmed.
Adding the usual suspects for ARM-related bugs.
(In reply to comment #1) > It looks like the neon builtins are not properly marked as pure/const, that > certainly is a road-block for optimizations. > The heavy use of UNSPECs is > another. yes, one other problem is that a lot of the neon intrinsics don't expand into an equivalent RTL - you still need the unspecs for the polynomial types but in general a large number of the intrinsics that are in the form of unspecs could use the underlying vec_ expanders that are also present. > > Confirmed.
Your testcase is broken - it doesn't honour reinterpret_casts properly . This is a better testcase. #include <arm_neon.h> uint32x4_t sqrlen4D_16u8( const uint8x16_t A, const uint8x16_t B ) { const uint8x16_t absAB = vabdq_u8( A, B ); const uint16x8_t square_l = vmull_u8( vget_low_u8( absAB ), vget_low_u8( absAB ) ); const uint16x8_t square_h = vmull_u8( vget_high_u8( absAB ), vget_high_u8( absAB ) ); const uint32x4x2_t rgrgrgrg_babababa = vuzpq_u32( vreinterpretq_u32_u16 (square_l), vreinterpretq_u32_u16 (square_h) ); const uint16x8_t rgrgrgrg = vreinterpretq_u16_u32 (rgrgrgrg_babababa.val[0]); const uint16x8_t babababa = vreinterpretq_u16_u32 (rgrgrgrg_babababa.val[1]); const uint32x4_t rpg_rpg_rpg_rpg = vpaddlq_u16( rgrgrgrg ); const uint32x4_t dp = vpadalq_u16( rpg_rpg_rpg_rpg, babababa ); return ( dp ); }
Experimenting with : Applying the patch of PR48941 and the patch for lower-subreg here http://gcc.gnu.org/ml/gcc-patches/2012-03/msg01886.html I now see : We still have too many moves for my liking but the gratuituous spilling is now gone. .cpu cortex-a9 .eabi_attribute 27, 3 .fpu neon .eabi_attribute 20, 1 .eabi_attribute 21, 1 .eabi_attribute 23, 3 .eabi_attribute 24, 1 .eabi_attribute 25, 1 .eabi_attribute 26, 2 .eabi_attribute 30, 2 .eabi_attribute 34, 1 .eabi_attribute 18, 4 .file "t2.c" .text .align 2 .global sqrlen4D_16u8 .type sqrlen4D_16u8, %function sqrlen4D_16u8: @ args = 16, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. vmov d16, r0, r1 @ v16qi vmov d17, r2, r3 vldmia sp, {d18-d19} vabd.u8 q10, q8, q9 vmull.u8 q11, d20, d20 vmull.u8 q10, d21, d21 vmov q8, q11 @ v4si -- unnecessary ? vmov q9, q10 @ v4si -- unnecessary ? vuzp.32 q8, q9 vpaddl.u16 q10, q8 vmov q11, q10 @ v4si -- unnecessary vpadal.u16 q11, q9 vmov r0, r1, d22 @ v4si vmov r2, r3, d23 bx lr .size sqrlen4D_16u8, .-sqrlen4D_16u8 .ident "GCC: (GNU) 4.8.0 20120330 (experimental)" .section .note.GNU-stack,"",%progbits This probably makes it a dup of PR48941 but it's starting to look more promising now. Eric, could you try the 2 patches and see what you get - This isn't something to be gratuitously backported as we still have to see the effects elsewhere but it would be worth seeing if this helps on your intrinsics testcases. Ramana
Author: ramana Date: Thu Jul 5 16:45:18 2012 New Revision: 189294 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=189294 Log: 2012-07-05 Ramana Radhakrishnan <ramana.radhakrishnan@linaro.org> PR target/49891 PR target/51980 * gcc/testsuite/gcc.target/arm/neon/vtrnf32.c: Update. * gcc/testsuite/gcc.target/arm/neon/vtrns32.c: Update. * gcc/testsuite/gcc.target/arm/neon/vtrnu32.c: Update. * gcc/testsuite/gcc.target/arm/neon/vzipf32.c: Update. * gcc/testsuite/gcc.target/arm/neon/vzips32.c: Update. * gcc/testsuite/gcc.target/arm/neon/vzipu32.c: Update. 2012-07-05 Ramana Radhakrishnan <ramana.radhakrishnan@linaro.org> Julian Brown <julian@codesourcery.com> PR target/49891 PR target/51980 * config/arm/neon-gen.ml (return_by_ptr): Delete. (print_function): Handle empty strings. (return): Delete use of return_by_ptr. (mask_shape_for_shuffle): New function. (mask_elems): Likewise. (shuffle_fn): Likewise. (params): Simplify and remove use of return_by_ptr. (get_shuffle): New function. (print_variant): Update. * config/arm/neon.ml (rev_elems): New function. (permute_range): Likewise. (zip_range): Likewise. (uzip_range): Likewise. (trn_range): Likewise. (zip_elems): Likewise. (uzip_elems): Likewise. (trn_elems): Likewise. (features): New enumeration Use_shuffle. Delete ReturnPtr. (pf_su_8_16): New. (suf_32): New. (ops): Update entries for Vrev64, Vrev32, Vrev16, Vtr, Vzip, Vuzp. * config/arm/arm_neon.h: Regenerate. Modified: trunk/gcc/ChangeLog trunk/gcc/config/arm/arm_neon.h trunk/gcc/config/arm/neon-gen.ml trunk/gcc/config/arm/neon.ml trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/arm/neon/vtrnf32.c trunk/gcc/testsuite/gcc.target/arm/neon/vtrns32.c trunk/gcc/testsuite/gcc.target/arm/neon/vtrnu32.c trunk/gcc/testsuite/gcc.target/arm/neon/vzipf32.c trunk/gcc/testsuite/gcc.target/arm/neon/vzips32.c trunk/gcc/testsuite/gcc.target/arm/neon/vzipu32.c
Testing the testcase in #4 with a recent trunk (gcc version 4.9.0 20130528 (experimental) (GCC)) gives the following results: arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=softfp -O2 -mthumb: sqrlen4D_16u8: vmov d18, r0, r1 @ v16qi vmov d19, r2, r3 vld1.64 {d16-d17}, [sp:64] vabd.u8 q8, q9, q8 vmull.u8 q9, d16, d16 vmull.u8 q8, d17, d17 vuzp.32 q9, q8 vpaddl.u16 q9, q9 vmov q10, q9 @ v4si vpadal.u16 q10, q8 vmov r0, r1, d20 @ v4si vmov r2, r3, d21 bx lr arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=hard -O2 -mthumb: sqrlen4D_16u8: vabd.u8 q1, q0, q1 vmull.u8 q0, d2, d2 vmull.u8 q8, d3, d3 vuzp.32 q0, q8 vpaddl.u16 q0, q0 vpadal.u16 q0, q8 bx lr So code generation seems to be OK for hard-float ABI but the soft-float version has some issues with an extra vmov between the vpaddl and vpadal.
> arm-none-eabi-gcc -march=armv7-a -mfpu=neon -mfloat-abi=softfp -O2 -mthumb: > sqrlen4D_16u8: > vmov d18, r0, r1 @ v16qi > vmov d19, r2, r3 > vld1.64 {d16-d17}, [sp:64] > vabd.u8 q8, q9, q8 > vmull.u8 q9, d16, d16 > vmull.u8 q8, d17, d17 > vuzp.32 q9, q8 > vpaddl.u16 q9, q9 > vmov q10, q9 @ v4si > vpadal.u16 q10, q8 > vmov r0, r1, d20 @ v4si > vmov r2, r3, d21 > bx lr With current trunk I'm getting for the softfp case: push {lr} @ 40 *push_multi [length = 2] vmov d16, r0, r1 @ v16qi @ 37 *neon_movv16qi/6 [length = 8] vmov d17, r2, r3 add lr, sp, #4 @ 36 *arm_addsi3/5 [length = 4] vldr d18, [sp, #4] @ 3 *neon_movv16qi/4 [length = 8] vldr d19, [sp, #12] vabd.u8 q9, q8, q9 @ 7 neon_vabdv16qi [length = 4] vmull.u8 q8, d18, d18 @ 14 neon_vmullv8qi [length = 4] vmull.u8 q9, d19, d19 @ 16 neon_vmullv8qi [length = 4] vuzp.32 q8, q9 @ 18 *neon_vuzpv4si_insn [length = 4] vpaddl.u16 q8, q8 @ 22 neon_vpaddlv8hi [length = 4] vpadal.u16 q8, q9 @ 28 neon_vpadalv8hi [length = 4] vmov r0, r1, d16 @ v4si @ 39 *neon_movv4si/5 [length = 8] vmov r2, r3, d17 ldr pc, [sp], #4 @ 45 *ldr_with_return [length = 4] The move between the vpad*s is gone, but there's a couple of redundant loads and some register spillage.
greta.yorsh no longer works for ARM. Your email will be forwarded to their line manager. Please do not reply to this email. If you need more information, please email real-postmaster@arm.com Thank you.
As of current trunk the softfp case looks like this: sqrlen4D_16u8: vmov d16, r0, r1 @ v16qi vmov d17, r2, r3 vld1.64 {d18-d19}, [sp:64] vabd.u8 q9, q8, q9 vmull.u8 q8, d18, d18 vmull.u8 q9, d19, d19 vuzp.32 q8, q9 vpaddl.u16 q8, q8 vpadal.u16 q8, q9 vmov r0, r1, d16 @ v4si vmov r2, r3, d17 bx lr which looks quite good.
Presumed fixed in 4.9.0 by that commit.