Missing optimization on ARM NEON
Yufeng Zhang
Yufeng.Zhang@arm.com
Mon Nov 11 16:22:00 GMT 2013
Hi Povilas,
I can confirm that the mainline arm gcc generates the similar code to
what you've observed. Can you please raise a bugzilla for this issue at
http://gcc.gnu.org/bugzilla/
Thanks,
Yufeng
On 11/11/13 00:57, Povilas Kanapickas wrote:
> Hello,
>
> [ I don't have a way to test the described testcases against a newer
> compiler: could someone verify whether this bug applies to the SVN
> version of GCC? ]
>
> GCC-4.8.1 misses several optimizations when using NEON intrinsics.
> Consider the following snippet:
>
> #include<arm_neon.h>
>
> uint64_t* foo(uint64_t* x, uint32_t y)
> {
> uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> vst1q_u64(x, d);
> x+=2;
> return x;
> }
>
> 'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
> assembly:
>
> _Z3fooPyj:
> push {r4, r5, r6, r7}
> vdup.32 q8, r1
> add r7, r0, #32
> add r6, r0, #48
> add r5, r0, #64
> add r4, r0, #80
> add r1, r0, #96
> add r2, r0, #112
> mov r3, r0
> adds r0, r0, #128
> vst1.64 {d16-d17}, [r3:64]!
> vst1.64 {d16-d17}, [r3:64]
> vst1.64 {d16-d17}, [r7:64]
> vst1.64 {d16-d17}, [r6:64]
> vst1.64 {d16-d17}, [r5:64]
> vst1.64 {d16-d17}, [r4:64]
> vst1.64 {d16-d17}, [r1:64]
> vst1.64 {d16-d17}, [r2:64]
> pop {r4, r5, r6, r7}
> bx lr
>
> It's obvious that the GCC aproach is not optimal. The main problem is
> that pointer autoincrement feature of the vst1.64 instruction is not
> fully utilized. GCC apparently figures it out for the first store, but
> it becomes confused later. I would expect GCC to produce the following
> output:
>
> _Z3fooPyj:
> vdup.32 q8, r1
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> vst1.64 {d16-d17}, [r0:64]!
> bx lr
>
> On unrolled loops GCC spills almost all registers to memory, which
> causes two to three times worse performance compared to the optimal
> version. Unfortunately I couldn't force GCC to generate it by any means
> and had to use assembly.
>
> Could someone verify whether the above bug ispresent in the SVN version?
>
> Thanks,
> Povilas
>
More information about the Gcc-help
mailing list