65375 – aarch64: poor codegen for vld2q_f32 and vst2q_f32

Bug 65375 - aarch64: poor codegen for vld2q_f32 and vst2q_f32

Summary: aarch64: poor codegen for vld2q_f32 and vst2q_f32

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	5.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	kugan

URL:
Keywords:

Depends on:
Blocks:

Reported:	2015-03-10 08:15 UTC by kugan
Modified:	2019-09-12 04:23 UTC (History)
CC List:	3 users (show)

See Also:	51509 43725 63277
Host:
Target:	aarch64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2015-04-14 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description kugan 2015-03-10 08:15:30 UTC

#include <arm_neon.h>
void hello_vst2(float* fout, float *fin)
{
float32x4x2_t a;
a = vld2q_f32 (fin);
vst2q_f32 (fout, a);
}


with aarch64-none-linux-gnu-gcc  -O2 -ffast-math -unsafe-math-optimisations produces:

	.cpu generic+fp+simd
	.file	"neon.c"
	.text
	.align	2
	.p2align 3,,7
	.global	hello_vst2
	.type	hello_vst2, %function
hello_vst2:
	ld2	{v0.4s - v1.4s}, [x1]
	sub	sp, sp, #32
	umov	x1, v0.d[0]
	umov	x2, v0.d[1]
	str	q1, [sp, 16]
	mov	x5, x1
	stp	x5, x2, [sp]
	ld1	{v0.16b - v1.16b}, [sp]
	st2	{v0.4s - v1.4s}, [x0]
	add	sp, sp, 32
	ret
	.size	hello_vst2, .-hello_vst2
	.ident	"GCC: (GNU) 5.0.0 20150305 (experimental)"
	.section	.note.GNU-stack,"",%progbits

Comment 1 kugan 2015-03-10 08:16:44 UTC

arm-none-linux-gnueabi-gcc  -O2 -ffast-math -unsafe-math-optimisations   -mfpu=neon produces just:

hello_vst2:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	vld2.32	{d16-d19}, [r1]
	vst2.32	{d16-d19}, [r0]
	bx

Comment 2 kugan 2015-03-10 08:17:56 UTC

aarch64-none-linux-gnu-gcc -O2 -ffast-math -unsafe-math-optimisations -fno-split-wide-types produces :

	ld2	{v2.4s - v3.4s}, [x1]
	orr	v0.16b, v2.16b, v2.16b
	orr	v1.16b, v3.16b, v3.16b
	st2	{v0.4s - v1.4s}, [x0]
	ret

Comment 3 kugan 2015-03-10 08:19:15 UTC

aarch64-none-linux-gnu-gcc  -v
Using built-in specs.
COLLECT_GCC=/home/kugan/work/builds/gcc-fsf-gcc/tools/bin/aarch64-none-linux-gnu-gcc
COLLECT_LTO_WRAPPER=/home/kugan/work/builds/gcc-fsf-gcc/tools/libexec/gcc/aarch64-none-linux-gnu/5.0.0/lto-wrapper
Target: aarch64-none-linux-gnu
Configured with: /home/kugan/work/sources/gcc-fsf/gcc/configure --target=aarch64-none-linux-gnu --prefix=/home/kugan/work/builds/gcc-fsf-gcc/tools --with-sysroot=/home/kugan/work/builds/gcc-fsf-gcc/sysroot-aarch64-none-linux-gnu --disable-libssp --disable-libgomp --disable-libmudflap --enable-languages=c,c++,fortran
Thread model: posix
gcc version 5.0.0 20150305 (experimental) (GCC)

Comment 4 Andrew Pinski 2015-03-10 08:32:43 UTC

;; _6 = __builtin_aarch64_get_qregoiv4sf (__o_5, 0);

(insn 8 7 0 (set (reg:V4SF 74 [ D.16774 ])
        (subreg:V4SF (reg/v:OI 73 [ __o ]) 0)) /data1/src/gcc-cavium/toolchain-thunder/thunderx-tools/lib/gcc/aarch64-thunderx-linux-gnu/5.0.0/include/arm_neon.h:15586 -1
     (nil))

;; _7 = __builtin_aarch64_get_qregoiv4sf (__o_5, 1);

(insn 9 8 0 (set (reg:V4SF 75 [ D.16774 ])
        (subreg:V4SF (reg/v:OI 73 [ __o ]) 16)) /data1/src/gcc-cavium/toolchain-thunder/thunderx-tools/lib/gcc/aarch64-thunderx-linux-gnu/5.0.0/include/arm_neon.h:15587 -1
     (nil))


Actually maybe we should use POI here, the partial integer mode will cause splitting subreg not do anything.

Comment 5 Maxim Kuvyrkov 2015-04-13 16:36:04 UTC

Kugan and Jim Wilson have posted a patch for this on March 26th.

Comment 6 James Greenhalgh 2015-04-14 08:05:14 UTC

So, fixed then?

Comment 7 Maxim Kuvyrkov 2015-04-14 08:06:39 UTC

The patch is not approved yet.

Comment 8 kugan 2015-04-14 09:11:12 UTC

Patch is at https://gcc.gnu.org/ml/gcc-patches/2015-03/msg00857.html and not approved yet.

Comment 9 Ramana Radhakrishnan 2015-06-23 13:19:02 UTC

Fixed by this then ? 

https://gcc.gnu.org/ml/gcc-patches/2015-05/msg01776.html

Comment 10 Jim Wilson 2015-06-23 15:40:15 UTC

Improved, but not completely resolved.  We still get unnecessary orr instructions, same as in comment 2.  This is partly an issue with the register allocator not handling partially overlapping register reads/writes very well.  We already have a few other bugs for that.  This is also partly an issue with how the aarch64 builtins work, via __builtin_aarch64_[gs]et_qregoiv4sf which create the partially overlapping register reads/writes.  The ARM builtins don't work this way, they use a union for type punning, and hence don't have the same problem.

Comment 11 Ramana Radhakrishnan 2015-06-24 09:06:45 UTC

(In reply to Jim Wilson from comment #10)
> Improved, but not completely resolved.  We still get unnecessary orr
> instructions, same as in comment 2.  This is partly an issue with the
> register allocator not handling partially overlapping register reads/writes
> very well.  We already have a few other bugs for that.  This is also partly
> an issue with how the aarch64 builtins work, via
> __builtin_aarch64_[gs]et_qregoiv4sf which create the partially overlapping
> register reads/writes.  The ARM builtins don't work this way, they use a
> union for type punning, and hence don't have the same problem.

Both the ARM and the AArch64 ports have the issues with partially overlapping register reads / writes especially with the vzip / vuzip style intrinsics in AArch32 world or even the larger vld3/4 intrinsics in both ARM and AArch64 states. It would be nice to fix that finally.

If that is the only issue left in the ticket - maybe we should just park this example in that ticket - IIRC PR43725 and close this one out ?

regards
Ramana

Comment 12 kugan 2015-06-24 09:13:38 UTC

Fixed in trunk except for the additional orr instruction (overlapping register reads / write). As Ramana mentioned, that is a known problem and tracked in PR43725.

Comment 13 Ramana Radhakrishnan 2015-06-25 20:49:44 UTC

Or indeed PR 63277...