With gcc as of today I see dup instructions that could be optimized out: $ cat red.c #include "arm_neon.h" int32x4_t fun(int32x4_t a, int16x8_t b, int16x8_t c) { a = vmlal_s16(a, vget_low_s16(b), vget_low_s16(c)); a = vmlal_high_s16(a, b, c); return a; } $ gcc -O3 -S -o- red.c fun: dup d3, v1.d[0] dup d4, v2.d[0] smlal v0.4s,v3.4h,v4.4h smlal2 v0.4s,v1.8h,v2.8h ret $ clang -O3 -S -o- red.c fun: smlal v0.4s, v1.4h, v2.4h smlal2 v0.4s, v1.8h, v2.8h ret
Confirmed. I have a patch (or two) that will optimize this. I was going to be submitting them in the next few days.
Created attachment 47356 [details] Patch which I wrote for GCC 7.3 I have to double check if it applies directly as I had other patches in this area but this is the patch which I had wrote for GCC 7.3.
(In reply to Andrew Pinski from comment #2) > Created attachment 47356 [details] > Patch which I wrote for GCC 7.3 > > I have to double check if it applies directly as I had other patches in this > area but this is the patch which I had wrote for GCC 7.3. I think it's because many intrinsics in arm_neon.h still use asm which inhibits most optimizations.
(In reply to Wilco from comment #3) > I think it's because many intrinsics in arm_neon.h still use asm which > inhibits most optimizations. NO in this case it is not. Take: #include "arm_neon.h" float64x1_t fun(float64x2_t a, float64x2_t b) { return vget_low_f64(b); } double fun1(float64x2_t a, float64x2_t b) { return b[0]; } ---- CUT ---- Both of these should be optimized to just fmov d0, d1 ret Even worse take: #include "arm_neon.h" float64x1_t fun(float64x2_t a, float64x2_t b) { return vget_low_f64(b) + vget_high_f64(b); } double fun1(float64x2_t a, float64x2_t b) { return b[0] + b[1]; } ---- CUT ---
Note some of the moves were removed with g:8c8952918b75f4fa6adbbe44cd641d5fd0bb55e3 But it is not a general solution, it just "splits" the case where dst and source have the same register. my patch (which I have a few improvements to it and fixing it for GCC 10) handles the case where the dst and the source registers are different which is what is happening in this case.
Created attachment 47706 [details] latest patch I will submit this tomorrow. I just need to "xfail" two of the testcases for big-endian. big-endian does something weird sometimes.
Hi Andrew, have you committed the fix for this?
The issue in this bug report is that the "get low lane" operation should just be a move rather than a vec_select so that it can be optimised away. After g:e140f5fd3e235c5a37dc99b79f37a5ad4dc59064 GCC 11 does the right thing for all testcases in this PR So marking this as fixed.