Take: #define vector __attribute__((vector_size(2*sizeof(double) ))) vector double test_vpasted2 (vector double low, vector double high) { return (vector double){low[0], high[1]}; } --- CUT --- This produces on the trunk: adrp x0, .LC0 ldr q2, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b - v1.16b}, v2.16b ret ---- CUT ---- Where LC0 is {0..7,24..31} When really this should produce just ins. GCC 8.2 produces: dup v0.2d, v0.d[0] ins v0.d[1], v1.d[1] But GCC 9 produces the correct thing of just an ins (though I don't have a compiler to prove that). The problem is GCC 10 converts the above code to: _4 = VEC_PERM_EXPR <low_1(D), high_2(D), { 0, 3 }>; Which the back-end does not optimize to do an ins.
*** Bug 92892 has been marked as a duplicate of this bug. ***
I have a patch, which recognize such pattern and adds ins instructions. Example in this issue description is compiled fine and produce this assembly: 0000000000000000 <test_vpasted2>: 0: 6e184420 mov v0.d[1], v1.d[1] 4: d65f03c0 ret However, there are a little bit more complicated examples, where allocated registers are preventing from optimal assembly to be generated. (example: part of blender benchmark from speccpu) #include <math.h> static inline float dot_v3v3(const float a[3], const float b[3]) { return a[0] * b[0] + a[1] * b[1] + a[2] * b[2]; } static inline float len_v3(const float a[3]) { return sqrtf(dot_v3v3(a, a)); } void window_translate_m4(float winmat[4][4], float perspmat[4][4], const float x, const float y) { if (winmat[2][3] == -1.0f) { /* in the case of a win-matrix, this means perspective always */ float v1[3]; float v2[3]; float len1, len2; v1[0] = perspmat[0][0]; v1[1] = perspmat[1][0]; v1[2] = perspmat[2][0]; v2[0] = perspmat[0][1]; v2[1] = perspmat[1][1]; v2[2] = perspmat[2][1]; len1 = (1.0f / len_v3(v1)); len2 = (1.0f / len_v3(v2)); winmat[2][0] += len1 * winmat[0][0] * x; winmat[2][1] += len2 * winmat[1][1] * y; } else { winmat[3][0] += x; winmat[3][1] += y; } } This will produce: ... 24: fd400010 ldr d16, [x0] 28: fd400807 ldr d7, [x0,#16] ... 34: 6e040611 mov v17.s[0], v16.s[0] 38: 6e0c24f1 mov v17.s[1], v7.s[1] ... # v16/d17 and d7/v7 are not used in any other places while it can be: ... 24: fd400010 ldr d16, [x0] 28: fd400807 ldr d7, [x0,#16] ... 38: 6e0c24f1 mov v16.s[1], v7.s[1] ... # and v16 is used instead of v17. It looks like peephole2 can be used to optimize it. I'm currently looking into in.
(In reply to Dmitrij Pochepko from comment #2) > It looks like peephole2 can be used to optimize it. I'm currently looking > into in. Don't. Can you attach your current patch? My bet is you expand this vec_perm like: (set (reg:V2DF t1) (reg:V2DF low)) (set (reg:V2DF t) (vec_merge:V2DF (vec_duplicate:V2DF (vec_select:DF (reg:V2DF high) (parallel [(const_int 2)]))) (reg:V2DF t1) (const_int 2))) Since it is only two elements.
Note the __builtin_shuffle expansion can be generialized to handle the case where there is more than elements but only one element insert: #define vector __attribute__((vector_size(16) )) vector float f(vector float a, vector float b) { return __builtin_shuffle (a, b, (vector int){0, 4, 2, 3}); } This function should generate: ins v0.s[1], v1.s[0] ret #define vector __attribute__((vector_size(16) )) vector float f1(vector float a, vector float b) { return __builtin_shuffle (b, a, (vector int){4, 0, 6, 7}); } This function should also generate: ins v0.s[1], v1.s[0] ret Even this: #define vector __attribute__((vector_size(16) )) vector float f(vector float a, vector float b) { return __builtin_shuffle (a, a, (vector int){0, 0, 2, 3}); } Should generate: ins v0.s[1], v0.s[0] ret
(In reply to Andrew Pinski from comment #4) > Note the __builtin_shuffle expansion can be generialized to handle the case > where there is more than elements but only one element insert: Note this should be done after zip expansion has happened.
Created attachment 47851 [details] current patch version
Created attachment 47853 [details] A better version This attached patch is generalized version of the vec_perm to insert. We don't need need patterns in the .md file as there are already enough to do the correct thing. Yes we still have the extra mov, but that seems to be a register allocator issue. I have not looked into it further.
(In reply to Andrew Pinski from comment #7) > Yes we still have the extra mov, but that seems to be a register allocator > issue. I have not looked into it further. But the original SLP issue is what produces the bad TBL in the first place, I filed that as PR 93771.
Mine.
Created attachment 47855 [details] New patch with testcases added Note some of the testcases depend on PR 82199. I will be submitting both by the end of next week.
Is it still under development/improvement?
GCC 10.1 has been released.
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>: https://gcc.gnu.org/g:c9c87e6f9c795bb36e4570a07501fc182eaad035 commit r11-2192-gc9c87e6f9c795bb36e4570a07501fc182eaad035 Author: Dmitrij Pochepko <dmitrij.pochepko@bell-sw.com> Date: Fri Jul 17 10:24:46 2020 +0100 vector creation from two parts of two vectors produces TBL rather than ins (PR 93720) The following patch enables vector permutations optimization by trying to use ins instruction instead of slow and generic tbl. example: vector float f0(vector float a, vector float b) { return __builtin_shuffle (a, a, (vector int){3, 1, 2, 3}); } was compiled into: ... adrp x0, .LC0 ldr q1, [x0, #:lo12:.LC0] tbl v0.16b, {v0.16b}, v1.16b ... and after patch: ... ins v0.s[0], v0.s[3] ... bootstrapped and tested on aarch64-linux-gnu with no regressions gcc/ChangeLog: 2020-07-17 Andrew Pinski <apinksi@marvell.com> PR target/93720 * config/aarch64/aarch64.c (aarch64_evpc_ins): New function. (aarch64_expand_vec_perm_const_1): Call it. * config/aarch64/aarch64-simd.md (aarch64_simd_vec_copy_lane): Make public, and add a "@" prefix. gcc/testsuite/ChangeLog: 2020-07-17 Andrew Pinski <apinksi@marvell.com> PR target/93720 * gcc.target/aarch64/vins-1.c: New test. * gcc.target/aarch64/vins-2.c: New test. * gcc.target/aarch64/vins-3.c: New test. Co-Authored-By: Dmitrij Pochepko <dmitrij.pochepko@bell-sw.com>
Fixed on trunk.