Bug 51509 - Inefficient neon intrinsic code sequence
Summary: Inefficient neon intrinsic code sequence
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 47562
  Show dependency treegraph
 
Reported: 2011-12-12 07:25 UTC by Carrot
Modified: 2021-03-28 07:13 UTC (History)
6 users (show)

See Also:
Host:
Target: arm-linux-androideabi, arm-linux-gnueabi
Build:
Known to work:
Known to fail:
Last reconfirmed: 2011-12-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Carrot 2011-12-12 07:25:34 UTC
Compile the following code with options -march=armv7-a -mfloat-abi=softfp -mfpu=neon -mthumb -O2 -Wall -fpic

#include <arm_neon.h>
void simple_vld_intrin(uint8_t *src, uint8_t *dst)
{
  uint8x8x4_t x;
  uint8x8x2_t y;

  x = vld4_lane_u8(src, x, 0);

  y.val[0][0] = x.val[1][0];
 y.val[1][0] = x.val[2][0];

 vst2_lane_u8(dst, y, 0);
}

gcc 4.7 generates:


.LC0:
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.word	0
	.text
	.align	2
	.global	simple_vld_intrin
	.thumb
	.thumb_func
	.type	simple_vld_intrin, %function
simple_vld_intrin:
	@ args = 0, pretend = 0, frame = 32
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	ldr	r2, .L2
	sub	sp, sp, #32
.LPIC0:
	add	r2, pc
	vldmia	r2, {d18-d21}
	vmov.i32	d19, #0  @ v8qi
	vmov	d20, d19  @ v8qi
	vmov	q11, q9  @ ti
	vmov	q12, q10  @ ti
	vmov	d16, d19  @ v8qi
	vmov	d17, d19  @ v8qi
	vld4.8	{d22[0], d23[0], d24[0], d25[0]}, [r0]
	vstmia	sp, {d22-d25}
	ldrb	r2, [sp, #8]	@ zero_extendqisi2
	vmov.8	d16[0], r2
	vmov.u8	r3, d24[0]
	vmov.8	d17[0], r3
	vst2.8	{d16[0], d17[0]}, [r1]
	add	sp, sp, #32
	bx	lr
.L3:
	.align	2
.L2:
	.word	.LC0-(.LPIC0+4)


An ideal result should be:

	vld4.8	{d16[0], d17[0], d18[0], d19[0]}, [r0]
	vmov	d20, d17  @ v8qi
	vmov	d21, d18  @ v8qi
	vst2.8	{d20[0], d21[0]}, [r1]
	bx	lr
Comment 1 rsandifo@gcc.gnu.org 2011-12-13 09:07:38 UTC
At least part of the problem here is the uninitialised
variable in the vld4 call.  GCC tries to create a zero
initialisation of "x" before the vld4, so that the other
lanes have defined values.  Obviously we could be doing
that much better than we are, and perhaps we should have
some kind of special case so that uninitialised NEON vectors
are never zero-initialised (e.g. use a plain clobber instead).
But uninitialised variables aren't really ideal either way.
Something like:

  x = vld4_dup_u8(src);

  y.val[0][0] = x.val[1][0];
  y.val[1][0] = x.val[2][0];

  vst2_lane_u8(dst, y, 0);

would be better in principle.  Unfortunately, we don't
generate good code for that either.  Part of the problem
is introduced by lower-subreg, but it's not good even
with -fno-split-wide-types.
Comment 2 rsandifo@gcc.gnu.org 2011-12-13 09:20:54 UTC
FWIW,

  uint8x8x4_t x;
  uint8x8x2_t y;

  x = vld4_dup_u8(src);

  y.val[0] = x.val[1];
  y.val[1] = x.val[2];

  vst2_lane_u8(dst, y, 0);

does give the expected output.  I.e. the remaining inefficiency
from comment #1 is in the uninitialised parts of y.

Richard
Comment 3 Ramana Radhakrishnan 2012-06-15 00:51:26 UTC
With -fno-split-wide-types I can end up getting identical output to what is expected in this case with FSF trunk. I suspect this might be another of those costs with lower-subreg issues. 


Ramana
Comment 4 Maxim Kuvyrkov 2015-04-13 16:30:43 UTC
Kugan,

Would you please check if your patch for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65375 also affects this one?
Comment 5 Maxim Kuvyrkov 2015-04-13 16:33:10 UTC
Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for armv7.  Charles, would you please look at this?
Comment 6 Allan Jensen 2015-11-26 15:24:34 UTC
I have run into a similar problem with vld3 and vst4.

uint8x16x3_t tmp = vld3q_u8(src);
vst4q_u8((uint8_t *)dst, {tmp.val[2], tmp.val[1], tmp.val[0], fullVector});

produces:
  70:   4cdf4061        ld3     {v1.16b-v3.16b}, [x3], #48
  74:   4e083c04        mov     x4, v0.d[0]
  78:   4e183c05        mov     x5, v0.d[1]
  7c:   6f000400        mvni    v0.4s, #0x0
  80:   4e083c4a        mov     x10, v2.d[0]
  84:   4e183c4b        mov     x11, v2.d[1]
  88:   aa0403e2        mov     x2, x4
  8c:   aa0503e1        mov     x1, x5
  90:   4e083c24        mov     x4, v1.d[0]
  94:   4e183c25        mov     x5, v1.d[1]
  98:   a90007e2        stp     x2, x1, [sp]
  9c:   3d800fe0        str     q0, [sp,#48]
  a0:   a9012fea        stp     x10, x11, [sp,#16]
  a4:   aa0403e6        mov     x6, x4
  a8:   a90217e6        stp     x6, x5, [sp,#32]
  ac:   4c4023e0        ld1     {v0.16b-v3.16b}, [sp]
  b0:   4c9f0000        st4     {v0.16b-v3.16b}, [x0], #64


But if I add -fno-split-wide-types it compiles to:
  68:   4cdf4064        ld3     {v4.16b-v6.16b}, [x3], #48
  6c:   4f000400        movi    v0.4s, #0x0
  70:   6f000403        mvni    v3.4s, #0x0
  74:   4ea51ca1        mov     v1.16b, v5.16b
  78:   4ea41c82        mov     v2.16b, v4.16b
  7c:   4c9f0000        st4     {v0.16b-v3.16b}, [x0], #64

This happens with both 4.9 and 5.1 that I have tried.
Comment 7 Eric Gallager 2018-12-12 04:34:40 UTC
(In reply to Maxim Kuvyrkov from comment #5)
> Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
> armv7.  Charles, would you please look at this?

Should Charles still remain the assignee for this?
Comment 8 Christophe Lyon 2018-12-12 09:47:58 UTC
(In reply to Eric Gallager from comment #7)
> (In reply to Maxim Kuvyrkov from comment #5)
> > Oh, sorry, I missed the fact that PR65375 is for aarch64 and this one is for
> > armv7.  Charles, would you please look at this?
> 
> Should Charles still remain the assignee for this?

I'm afraid not: Charles no longer works with us.