Bug 92665 - [AArch64] low lanes select not optimized out for vmlal intrinsics
Summary: [AArch64] low lanes select not optimized out for vmlal intrinsics
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 enhancement
Target Milestone: ---
Assignee: Andrew Pinski
URL:
Keywords: missed-optimization
Depends on:
Blocks: 95265
  Show dependency treegraph
 
Reported: 2019-11-25 18:59 UTC by Sebastian Pop
Modified: 2021-01-29 08:43 UTC (History)
1 user (show)

See Also:
Host:
Target: aarch64-linux-gnu
Build:
Known to work: 11.0
Known to fail:
Last reconfirmed: 2019-11-25 00:00:00


Attachments
Patch which I wrote for GCC 7.3 (1019 bytes, patch)
2019-11-25 19:35 UTC, Andrew Pinski
Details | Diff
latest patch (1.97 KB, patch)
2020-01-25 13:09 UTC, Andrew Pinski
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Sebastian Pop 2019-11-25 18:59:10 UTC
With gcc as of today I see dup instructions that could be optimized out:

$ cat red.c
#include "arm_neon.h"

int32x4_t fun(int32x4_t a, int16x8_t b, int16x8_t c) {
  a = vmlal_s16(a, vget_low_s16(b), vget_low_s16(c));
  a = vmlal_high_s16(a, b, c);
  return a;
}

$ gcc -O3 -S -o- red.c
fun:
	dup	d3, v1.d[0]
	dup	d4, v2.d[0]
	smlal v0.4s,v3.4h,v4.4h
	smlal2 v0.4s,v1.8h,v2.8h
	ret

$ clang -O3 -S -o- red.c
fun:
	smlal	v0.4s, v1.4h, v2.4h
	smlal2	v0.4s, v1.8h, v2.8h
	ret
Comment 1 Andrew Pinski 2019-11-25 19:07:15 UTC
Confirmed.
I have a patch (or two) that will optimize this.
I was going to be submitting them in the next few days.
Comment 2 Andrew Pinski 2019-11-25 19:35:42 UTC
Created attachment 47356 [details]
Patch which I wrote for GCC 7.3

I have to double check if it applies directly as I had other patches in this area but this is the patch which I had wrote for GCC 7.3.
Comment 3 Wilco 2019-11-25 19:47:47 UTC
(In reply to Andrew Pinski from comment #2)
> Created attachment 47356 [details]
> Patch which I wrote for GCC 7.3
> 
> I have to double check if it applies directly as I had other patches in this
> area but this is the patch which I had wrote for GCC 7.3.

I think it's because many intrinsics in arm_neon.h still use asm which inhibits most optimizations.
Comment 4 Andrew Pinski 2019-11-25 20:00:59 UTC
(In reply to Wilco from comment #3)
> I think it's because many intrinsics in arm_neon.h still use asm which
> inhibits most optimizations.

NO in this case it is not.

Take:
#include "arm_neon.h"

float64x1_t fun(float64x2_t a, float64x2_t b) {
  return vget_low_f64(b);
}
double fun1(float64x2_t a, float64x2_t b) {
  return b[0];
}

---- CUT ----
Both of these should be optimized to just
fmov d0, d1
ret

Even worse take:
#include "arm_neon.h"

float64x1_t fun(float64x2_t a, float64x2_t b) {
  return vget_low_f64(b) + vget_high_f64(b);
}
double fun1(float64x2_t a, float64x2_t b) {
  return b[0] + b[1];
}

---- CUT ---
Comment 5 Andrew Pinski 2020-01-25 05:07:55 UTC
Note some of the moves were removed with g:8c8952918b75f4fa6adbbe44cd641d5fd0bb55e3

But it is not a general solution, it just "splits" the case where dst and source have the same register.  my patch (which I have a few improvements to it and fixing it for GCC 10) handles the case where the dst and the source registers are different which is what is happening in this case.
Comment 6 Andrew Pinski 2020-01-25 13:09:42 UTC
Created attachment 47706 [details]
latest patch

I will submit this tomorrow.  I just need to "xfail" two of the testcases for big-endian.  big-endian does something weird sometimes.
Comment 7 Sebastian Pop 2020-03-31 18:15:33 UTC
Hi Andrew, have you committed the fix for this?
Comment 8 ktkachov 2021-01-29 08:43:44 UTC
The issue in this bug report is that the "get low lane" operation should just be a move rather than a vec_select so that it can be optimised away.
After g:e140f5fd3e235c5a37dc99b79f37a5ad4dc59064 GCC 11 does the right thing for all testcases in this PR

So marking this as fixed.