Bug 93720 - [10/11 Regression] vector creation from two parts of two vectors produces TBL rather than ins
Summary: [10/11 Regression] vector creation from two parts of two vectors produces TBL...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P2 normal
Target Milestone: 11.0
Assignee: Andrew Pinski
URL:
Keywords: missed-optimization
: 92892 (view as bug list)
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2020-02-13 01:03 UTC by Andrew Pinski
Modified: 2023-02-17 21:05 UTC (History)
3 users (show)

See Also:
Host:
Target: aarch64*-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-02-16 00:00:00


Attachments
current patch version (1.43 KB, patch)
2020-02-16 17:03 UTC, Dmitrij Pochepko
Details | Diff
A better version (1.05 KB, patch)
2020-02-16 20:08 UTC, Andrew Pinski
Details | Diff
New patch with testcases added (1.71 KB, patch)
2020-02-17 01:55 UTC, Andrew Pinski
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Andrew Pinski 2020-02-13 01:03:27 UTC
Take:
#define vector __attribute__((vector_size(2*sizeof(double) )))

vector double
test_vpasted2 (vector double low, vector double high)
{
          return (vector double){low[0], high[1]};
}

--- CUT ---
This produces on the trunk:
        adrp    x0, .LC0
        ldr     q2, [x0, #:lo12:.LC0]
        tbl     v0.16b, {v0.16b - v1.16b}, v2.16b
        ret
---- CUT ----
Where LC0 is {0..7,24..31}
When really this should produce just ins.

GCC 8.2 produces:
        dup     v0.2d, v0.d[0]
        ins     v0.d[1], v1.d[1]
But GCC 9 produces the correct thing of just an ins (though I don't have a compiler to prove that).

The problem is GCC 10 converts the above code to:
  _4 = VEC_PERM_EXPR <low_1(D), high_2(D), { 0, 3 }>;

Which the back-end does not optimize to do an ins.
Comment 1 Andrew Pinski 2020-02-13 01:05:20 UTC
*** Bug 92892 has been marked as a duplicate of this bug. ***
Comment 2 Dmitrij Pochepko 2020-02-14 14:07:42 UTC
I have a patch, which recognize such pattern and adds ins instructions. Example in this issue description is compiled fine and produce this assembly:

0000000000000000 <test_vpasted2>:
   0:	6e184420 	mov	v0.d[1], v1.d[1]
   4:	d65f03c0 	ret

However, there are a little bit more complicated examples, where allocated registers are preventing from optimal assembly to be generated.
(example: part of blender benchmark from speccpu)
#include <math.h>

static inline float dot_v3v3(const float a[3], const float b[3])
{
        return a[0] * b[0] + a[1] * b[1] + a[2] * b[2];
}

static inline float len_v3(const float a[3])
{
        return sqrtf(dot_v3v3(a, a));
}


void window_translate_m4(float winmat[4][4], float perspmat[4][4], const float x, const float y)
{
        if (winmat[2][3] == -1.0f) {
                /* in the case of a win-matrix, this means perspective always */
                float v1[3];
                float v2[3];
                float len1, len2;

                v1[0] = perspmat[0][0];
                v1[1] = perspmat[1][0];
                v1[2] = perspmat[2][0];

                v2[0] = perspmat[0][1];
                v2[1] = perspmat[1][1];
                v2[2] = perspmat[2][1];

                len1 = (1.0f / len_v3(v1));
                len2 = (1.0f / len_v3(v2));

                winmat[2][0] += len1 * winmat[0][0] * x;
                winmat[2][1] += len2 * winmat[1][1] * y;
        }
        else {
                winmat[3][0] += x;
                winmat[3][1] += y;
        }
}


This will produce:
...
  24:	fd400010 	ldr	d16, [x0]
  28:	fd400807 	ldr	d7, [x0,#16]
...
  34:	6e040611 	mov	v17.s[0], v16.s[0]
  38:	6e0c24f1 	mov	v17.s[1], v7.s[1]
...
# v16/d17 and d7/v7 are not used in any other places

while it can be:

...
  24:	fd400010 	ldr	d16, [x0]
  28:	fd400807 	ldr	d7, [x0,#16]
...
  38:	6e0c24f1 	mov	v16.s[1], v7.s[1]
...
# and v16 is used instead of v17.


It looks like peephole2 can be used to optimize it. I'm currently looking into in.
Comment 3 Andrew Pinski 2020-02-16 06:37:36 UTC
(In reply to Dmitrij Pochepko from comment #2) 
> It looks like peephole2 can be used to optimize it. I'm currently looking
> into in.

Don't.  Can you attach your current patch?  
My bet is you expand this vec_perm like:
(set (reg:V2DF t1) (reg:V2DF low))
(set (reg:V2DF t) (vec_merge:V2DF
                   (vec_duplicate:V2DF
                    (vec_select:DF (reg:V2DF high)
                     (parallel [(const_int 2)])))
                  (reg:V2DF t1)
                  (const_int 2)))

Since it is only two elements.
Comment 4 Andrew Pinski 2020-02-16 11:53:56 UTC
Note the __builtin_shuffle expansion can be generialized to handle the case where there is more than elements but only one element insert:

#define vector __attribute__((vector_size(16) ))

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, b, (vector int){0, 4, 2, 3});
}

This function should generate:
ins     v0.s[1], v1.s[0]
ret

#define vector __attribute__((vector_size(16) ))

vector float f1(vector float a, vector float b)
{
  return __builtin_shuffle  (b, a, (vector int){4, 0, 6, 7});
}
This function should also generate:
ins     v0.s[1], v1.s[0]
ret

Even this:
#define vector __attribute__((vector_size(16) ))

vector float f(vector float a, vector float b)
{
  return __builtin_shuffle  (a, a, (vector int){0, 0, 2, 3});
}
Should generate:
ins     v0.s[1], v0.s[0]
ret
Comment 5 Andrew Pinski 2020-02-16 11:55:01 UTC
(In reply to Andrew Pinski from comment #4)
> Note the __builtin_shuffle expansion can be generialized to handle the case
> where there is more than elements but only one element insert:

Note this should be done after zip expansion has happened.
Comment 6 Dmitrij Pochepko 2020-02-16 17:03:57 UTC
Created attachment 47851 [details]
current patch version
Comment 7 Andrew Pinski 2020-02-16 20:08:17 UTC
Created attachment 47853 [details]
A better version

This attached patch is generalized version of the vec_perm to insert.  We don't need need patterns in the .md file as there are already enough to do the correct thing.

Yes we still have the extra mov, but that seems to be a register allocator issue.  I have not looked into it further.
Comment 8 Andrew Pinski 2020-02-16 21:32:43 UTC
(In reply to Andrew Pinski from comment #7)
> Yes we still have the extra mov, but that seems to be a register allocator
> issue.  I have not looked into it further.

But the original SLP issue is what produces the bad TBL in the first place, I filed that as PR 93771.
Comment 9 Andrew Pinski 2020-02-17 01:50:20 UTC
Mine.
Comment 10 Andrew Pinski 2020-02-17 01:55:04 UTC
Created attachment 47855 [details]
New patch with testcases added

Note some of the testcases depend on PR 82199.
I will be submitting both by the end of next week.
Comment 11 Dmitrij Pochepko 2020-03-31 13:25:10 UTC
Is it still under development/improvement?
Comment 12 Jakub Jelinek 2020-05-07 11:56:25 UTC
GCC 10.1 has been released.
Comment 13 GCC Commits 2020-07-17 09:25:10 UTC
The master branch has been updated by Richard Sandiford <rsandifo@gcc.gnu.org>:

https://gcc.gnu.org/g:c9c87e6f9c795bb36e4570a07501fc182eaad035

commit r11-2192-gc9c87e6f9c795bb36e4570a07501fc182eaad035
Author: Dmitrij Pochepko <dmitrij.pochepko@bell-sw.com>
Date:   Fri Jul 17 10:24:46 2020 +0100

    vector creation from two parts of two vectors produces TBL rather than ins (PR 93720)
    
    The following patch enables vector permutations optimization by trying
    to use ins instruction instead of slow and generic tbl.
    
    example:
    
    vector float f0(vector float a, vector float b)
    {
      return __builtin_shuffle (a, a, (vector int){3, 1, 2, 3});
    }
    
    was compiled into:
    ...
            adrp    x0, .LC0
            ldr     q1, [x0, #:lo12:.LC0]
            tbl     v0.16b, {v0.16b}, v1.16b
    ...
    
    and after patch:
    ...
            ins     v0.s[0], v0.s[3]
    ...
    
    bootstrapped and tested on aarch64-linux-gnu with no regressions
    
    gcc/ChangeLog:
    
    2020-07-17  Andrew Pinski  <apinksi@marvell.com>
    
            PR target/93720
            * config/aarch64/aarch64.c (aarch64_evpc_ins): New function.
            (aarch64_expand_vec_perm_const_1): Call it.
            * config/aarch64/aarch64-simd.md (aarch64_simd_vec_copy_lane): Make
            public, and add a "@" prefix.
    
    gcc/testsuite/ChangeLog:
    
    2020-07-17  Andrew Pinski  <apinksi@marvell.com>
    
            PR target/93720
            * gcc.target/aarch64/vins-1.c: New test.
            * gcc.target/aarch64/vins-2.c: New test.
            * gcc.target/aarch64/vins-3.c: New test.
    
    Co-Authored-By: Dmitrij Pochepko <dmitrij.pochepko@bell-sw.com>
Comment 14 Richard Sandiford 2020-07-21 10:03:47 UTC
Fixed on trunk.