Missing optimization on ARM NEON

Mon Nov 11 16:22:00 GMT 2013

Hi Povilas,

I can confirm that the mainline arm gcc generates the similar code to 
what you've observed.  Can you please raise a bugzilla for this issue at 
http://gcc.gnu.org/bugzilla/

Thanks,
Yufeng

On 11/11/13 00:57, Povilas Kanapickas wrote:
> Hello,
>
> [ I don't have a way to test the described testcases against a newer
> compiler: could someone verify whether this bug applies to the SVN
> version of GCC? ]
>
> GCC-4.8.1 misses several optimizations when using NEON intrinsics.
> Consider the following snippet:
>
> #include<arm_neon.h>
>
> uint64_t* foo(uint64_t* x, uint32_t y)
> {
>      uint64x2_t d = vreinterpretq_u64_u32(vdupq_n_u32(y));
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      vst1q_u64(x, d);
>      x+=2;
>      return x;
> }
>
> 'g++ test.cc -O3 -mfpu=neon --save-temps -c' produces the following
> assembly:
>
> _Z3fooPyj:
> 	push	{r4, r5, r6, r7}
> 	vdup.32	q8, r1
> 	add	r7, r0, #32
> 	add	r6, r0, #48
> 	add	r5, r0, #64
> 	add	r4, r0, #80
> 	add	r1, r0, #96
> 	add	r2, r0, #112
> 	mov	r3, r0
> 	adds	r0, r0, #128
> 	vst1.64	{d16-d17}, [r3:64]!
> 	vst1.64	{d16-d17}, [r3:64]
> 	vst1.64	{d16-d17}, [r7:64]
> 	vst1.64	{d16-d17}, [r6:64]
> 	vst1.64	{d16-d17}, [r5:64]
> 	vst1.64	{d16-d17}, [r4:64]
> 	vst1.64	{d16-d17}, [r1:64]
> 	vst1.64	{d16-d17}, [r2:64]
> 	pop	{r4, r5, r6, r7}
> 	bx	lr
>
> It's obvious that the GCC aproach is not optimal. The main problem is
> that pointer autoincrement feature of the vst1.64 instruction is not
> fully utilized. GCC apparently figures it out for the first store, but
> it becomes confused later. I would expect GCC to produce the following
> output:
>
> _Z3fooPyj:
> 	vdup.32	q8, r1
>          vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	vst1.64	{d16-d17}, [r0:64]!
> 	bx	lr
>
> On unrolled loops GCC spills almost all registers to memory, which
> causes two to three times worse performance compared to the optimal
> version. Unfortunately I couldn't force GCC to generate it by any means
> and had to use assembly.
>
> Could someone verify whether the above bug ispresent in the SVN version?
>
> Thanks,
> Povilas
>