92665 – [AArch64] low lanes select not optimized out for vmlal intrinsics

Bug 92665 - [AArch64] low lanes select not optimized out for vmlal intrinsics

Summary: [AArch64] low lanes select not optimized out for vmlal intrinsics

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	10.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Andrew Pinski

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	95265
	Show dependency tree / graph

Reported:	2019-11-25 18:59 UTC by Sebastian Pop
Modified:	2021-01-29 08:43 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	aarch64-linux-gnu
Build:
Known to work:	11.0
Known to fail:
Last reconfirmed:	2019-11-25 00:00:00

Attachments
Patch which I wrote for GCC 7.3 (1019 bytes, patch) 2019-11-25 19:35 UTC, Andrew Pinski	Details \| Diff
latest patch (1.97 KB, patch) 2020-01-25 13:09 UTC, Andrew Pinski	Details \| Diff
Show Obsolete (1) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Sebastian Pop 2019-11-25 18:59:10 UTC

With gcc as of today I see dup instructions that could be optimized out:

$ cat red.c
#include "arm_neon.h"

int32x4_t fun(int32x4_t a, int16x8_t b, int16x8_t c) {
  a = vmlal_s16(a, vget_low_s16(b), vget_low_s16(c));
  a = vmlal_high_s16(a, b, c);
  return a;
}

$ gcc -O3 -S -o- red.c
fun:
	dup	d3, v1.d[0]
	dup	d4, v2.d[0]
	smlal v0.4s,v3.4h,v4.4h
	smlal2 v0.4s,v1.8h,v2.8h
	ret

$ clang -O3 -S -o- red.c
fun:
	smlal	v0.4s, v1.4h, v2.4h
	smlal2	v0.4s, v1.8h, v2.8h
	ret

Comment 1 Andrew Pinski 2019-11-25 19:07:15 UTC

Confirmed.
I have a patch (or two) that will optimize this.
I was going to be submitting them in the next few days.

Comment 2 Andrew Pinski 2019-11-25 19:35:42 UTC

Created attachment 47356 [details]
Patch which I wrote for GCC 7.3

I have to double check if it applies directly as I had other patches in this area but this is the patch which I had wrote for GCC 7.3.

Comment 3 Wilco 2019-11-25 19:47:47 UTC

(In reply to Andrew Pinski from comment #2)
> Created attachment 47356 [details]
> Patch which I wrote for GCC 7.3
> 
> I have to double check if it applies directly as I had other patches in this
> area but this is the patch which I had wrote for GCC 7.3.

I think it's because many intrinsics in arm_neon.h still use asm which inhibits most optimizations.

Comment 4 Andrew Pinski 2019-11-25 20:00:59 UTC

(In reply to Wilco from comment #3)
> I think it's because many intrinsics in arm_neon.h still use asm which
> inhibits most optimizations.

NO in this case it is not.

Take:
#include "arm_neon.h"

float64x1_t fun(float64x2_t a, float64x2_t b) {
  return vget_low_f64(b);
}
double fun1(float64x2_t a, float64x2_t b) {
  return b[0];
}

---- CUT ----
Both of these should be optimized to just
fmov d0, d1
ret

Even worse take:
#include "arm_neon.h"

float64x1_t fun(float64x2_t a, float64x2_t b) {
  return vget_low_f64(b) + vget_high_f64(b);
}
double fun1(float64x2_t a, float64x2_t b) {
  return b[0] + b[1];
}

---- CUT ---

Comment 5 Andrew Pinski 2020-01-25 05:07:55 UTC

Note some of the moves were removed with g:8c8952918b75f4fa6adbbe44cd641d5fd0bb55e3

But it is not a general solution, it just "splits" the case where dst and source have the same register.  my patch (which I have a few improvements to it and fixing it for GCC 10) handles the case where the dst and the source registers are different which is what is happening in this case.

Comment 6 Andrew Pinski 2020-01-25 13:09:42 UTC

Created attachment 47706 [details]
latest patch

I will submit this tomorrow.  I just need to "xfail" two of the testcases for big-endian.  big-endian does something weird sometimes.

Comment 7 Sebastian Pop 2020-03-31 18:15:33 UTC

Hi Andrew, have you committed the fix for this?

Comment 8 ktkachov 2021-01-29 08:43:44 UTC

The issue in this bug report is that the "get low lane" operation should just be a move rather than a vec_select so that it can be optimised away.
After g:e140f5fd3e235c5a37dc99b79f37a5ad4dc59064 GCC 11 does the right thing for all testcases in this PR

So marking this as fixed.