74585 – powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

Bug 74585 - powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

Summary: powerpc64: Very poor code generation for homogeneous vector aggregates passed...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-08-11 20:29 UTC by Bill Schmidt
Modified:	2020-07-10 20:59 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	powerpc64--*
Build:
Known to work:
Known to fail:	5.4.1, 6.1.1, 7.0
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Bill Schmidt 2016-08-11 20:29:07 UTC

For the PowerPC64 ELF V2 ABI, homogeneous aggregates of vectors (any combination of structs and arrays containing only vectors) are passed in the first eight vector registers.  Tree-sra does not understand this and performs scalarization on such aggregates, forcing them to memory in the process.  This causes unnecessary stores/reloads to the stack on function entry and exit, resulting in terrible performance.

As an example, consider:

--- SNIP ---
#define VEC_DW_H (1)
#define VEC_DW_L (0)

typedef struct
{
  __vector double vx0;
  __vector double vx1;
  __vector double vx2;
  __vector double vx3;
} vdoublex8_t;

vdoublex8_t
test_vecd8_rotate_left (vdoublex8_t a)
{
  __vector double temp;
  vdoublex8_t result;

  temp = a.vx0;

  /* Copy low dword of vx0 and high dword of vx1 to vx0 high / low.  */
  result.vx0[VEC_DW_H] = a.vx0[VEC_DW_L];
  result.vx0[VEC_DW_L] = a.vx1[VEC_DW_H];
  /* Copy low dword of vx1 and high dword of vx2 to vx1 high / low.  */
  result.vx1[VEC_DW_H] = a.vx1[VEC_DW_L];
  result.vx1[VEC_DW_L] = a.vx2[VEC_DW_H];
  /* Copy low dword of vx2 and high dword of vx2 to vx2 high / low.  */
  result.vx2[VEC_DW_H] = a.vx2[VEC_DW_L];
  result.vx2[VEC_DW_L] = a.vx3[VEC_DW_H];
  /* Copy low dword of vx3 and high dword of vx0 to vx3 high / low.  */
  result.vx3[VEC_DW_H] = a.vx3[VEC_DW_L];
  result.vx3[VEC_DW_L] = temp[VEC_DW_H];

  return (result);
}
--- SNIP ---

After 031t.forwprop, we have:


;; Function test_vecd8_rotate_left (test_vecd8_rotate_left, funcdef_no=0, decl_uid=2364, cgraph_uid=0, symbol_order=0)

test_vecd8_rotate_left (struct vdoublex8_t a)
{
  struct vdoublex8_t result;
  __vector double temp;
  struct vdoublex8_t D.2369;
  __vector double _1;
  double _2;
  double _3;
  double _4;
  double _5;
  double _6;
  double _7;
  double _8;
  double _9;

  <bb 2>:
  _1 = a.vx0;
  temp_23 = _1;
  _2 = BIT_FIELD_REF <a.vx0, 64, 0>;
  BIT_FIELD_REF <result.vx0, 64, 64> = _2;
  _3 = BIT_FIELD_REF <a.vx1, 64, 64>;
  BIT_FIELD_REF <result.vx0, 64, 0> = _3;
  _4 = BIT_FIELD_REF <a.vx1, 64, 0>;
  BIT_FIELD_REF <result.vx1, 64, 64> = _4;
  _5 = BIT_FIELD_REF <a.vx2, 64, 64>;
  BIT_FIELD_REF <result.vx1, 64, 0> = _5;
  _6 = BIT_FIELD_REF <a.vx2, 64, 0>;
  BIT_FIELD_REF <result.vx2, 64, 64> = _6;
  _7 = BIT_FIELD_REF <a.vx3, 64, 64>;
  BIT_FIELD_REF <result.vx2, 64, 0> = _7;
  _8 = BIT_FIELD_REF <a.vx3, 64, 0>;
  BIT_FIELD_REF <result.vx3, 64, 64> = _8;
  _9 = BIT_FIELD_REF <_1, 64, 64>;
  BIT_FIELD_REF <result.vx3, 64, 0> = _9;
  D.2369 = result;
  result ={v} {CLOBBER};
  return D.2369;

}

but after 032t.esra, we have:

test_vecd8_rotate_left (struct vdoublex8_t a)
{
  __vector double result$vx3;
  __vector double result$vx2;
  __vector double result$vx1;
  __vector double result$vx0;
  __vector double a$vx3;
  __vector double a$vx2;
  __vector double a$vx1;
  __vector double a$vx0;
  struct vdoublex8_t result;
  __vector double temp;
  struct vdoublex8_t D.2369;
  __vector double _1;
  double _2;
  double _3;
  double _4;
  double _5;
  double _6;
  double _7;
  double _8;
  double _9;
  __vector double _11;
  __vector double _21;
  __vector double _25;
  __vector double _26;

  <bb 2>:
  a$vx0_27 = MEM[(struct  *)&a];
  a$vx1_28 = MEM[(struct  *)&a + 16B];
  a$vx2_29 = MEM[(struct  *)&a + 32B];
  a$vx3_30 = MEM[(struct  *)&a + 48B];
  _1 = a$vx0_27;
  temp_23 = _1;
  _2 = BIT_FIELD_REF <a$vx0_27, 64, 0>;
  BIT_FIELD_REF <result$vx0, 64, 64> = _2;
  _3 = BIT_FIELD_REF <a$vx1_28, 64, 64>;
  BIT_FIELD_REF <result$vx0, 64, 0> = _3;
  _4 = BIT_FIELD_REF <a$vx1_28, 64, 0>;
  BIT_FIELD_REF <result$vx1, 64, 64> = _4;
  _5 = BIT_FIELD_REF <a$vx2_29, 64, 64>;
  BIT_FIELD_REF <result$vx1, 64, 0> = _5;
  _6 = BIT_FIELD_REF <a$vx2_29, 64, 0>;
  BIT_FIELD_REF <result$vx2, 64, 64> = _6;
  _7 = BIT_FIELD_REF <a$vx3_30, 64, 64>;
  BIT_FIELD_REF <result$vx2, 64, 0> = _7;
  _8 = BIT_FIELD_REF <a$vx3_30, 64, 0>;
  BIT_FIELD_REF <result$vx3, 64, 64> = _8;
  _9 = BIT_FIELD_REF <_1, 64, 64>;
  BIT_FIELD_REF <result$vx3, 64, 0> = _9;
  _21 = result$vx0;
  MEM[(struct  *)&D.2369] = _21;
  _11 = result$vx1;
  MEM[(struct  *)&D.2369 + 16B] = _11;
  _25 = result$vx2;
  MEM[(struct  *)&D.2369 + 32B] = _25;
  _26 = result$vx3;
  MEM[(struct  *)&D.2369 + 48B] = _26;
  result$vx0 ={v} {CLOBBER};
  result$vx1 ={v} {CLOBBER};
  result$vx2 ={v} {CLOBBER};
  result$vx3 ={v} {CLOBBER};
  return D.2369;

}

I will spare you the terrible assembly code that we get as a result, but suffice it to say it is filled with unnecessary memory accesses when all this logic can be done efficiently in registers.

I'm not familiar with the tree-sra code, but a quick scan indicates there isn't any sort of target hook to indicate when pushing parameters to memory is a bad idea.  I'm guessing we need one.  Or is there another way that this behavior can be avoided?

There are a number of criteria in find_param_candidates to preclude scalarization, but nothing seems to obviously fit the condition of "the ABI in effect requests that you leave this guy alone."  The issue is complicated by the fact that scalarization would indeed be a reasonable thing to do if we have run out of protocol registers for a parameter; but I am willing to give that up if we can avoid lousy code in the common case.

Our ABI has the same issues if we replace vectors by float, double, or long double.

Any thoughts on how this should be addressed?

Comment 1 Richard Biener 2016-08-12 09:15:29 UTC

The issue is not SRA pushing things to memory - it doesn't.  The issue is that in the GIMPLE IL the parameter appears as "memory" as it is an aggregate type.

The issue is really RTL expansion of the argument (not sure where that's done)
which doesn't take into account that we could happily expand

  a$vx0_27 = MEM[(struct  *)&a];
  a$vx1_28 = MEM[(struct  *)&a + 16B];
  a$vx2_29 = MEM[(struct  *)&a + 32B];
  a$vx3_30 = MEM[(struct  *)&a + 48B];

if a is expanded to registers.

Note comparing assembler with/without -fno-tree-sra doesn't really show me
the obvious badness:

 test_vecd8_rotate_left:
-       addi 6,1,-288
-       li 9,160
+       addi 6,1,-224
+       li 7,96
        xxpermdi 34,34,34,2
        xxpermdi 35,35,35,2
-       li 10,176
+       li 8,112
+       li 10,128
        xxpermdi 36,36,36,2
        xxpermdi 37,37,37,2
+       stxvd2x 34,6,7
+       li 9,144
+       vspltisw 0,0
+       stxvd2x 35,6,8
+       stxvd2x 36,6,10
+       stxvd2x 37,6,9
+       lxvd2x 0,6,7
        li 7,32
-       stxvd2x 34,6,9
+       lxvd2x 7,6,8
        li 8,48
-       stxvd2x 35,6,10
-       li 10,192
-       stxvd2x 36,6,10
-       li 10,208
-       stxvd2x 37,6,10
-       lfd 12,-128(1)
-       lxvd2x 0,6,9
-       li 9,96
+       lxvd2x 9,6,10
        li 10,64
-       stfd 12,-184(1)
-       lfd 12,-104(1)
-       xxpermdi 0,0,0,3
-       stfd 0,-144(1)
-       stfd 12,-192(1)
-       lfd 12,-112(1)
-       stfd 12,-168(1)
-       lfd 12,-88(1)
-       lxvd2x 10,6,9
-       li 9,112
-       stxvd2x 10,6,7
-       stfd 12,-176(1)
-       lfd 12,-96(1)
-       stfd 12,-152(1)
-       lfd 12,-72(1)
        lxvd2x 11,6,9
-       li 9,128
-       lxvd2x 34,6,7
-       stxvd2x 11,6,8
-       stfd 12,-160(1)
-       lfd 12,-80(1)
-       xxpermdi 34,34,34,2
-       stfd 12,-136(1)
-       lxvd2x 12,6,9
-       li 9,144
-       lxvd2x 35,6,8
-       stxvd2x 12,6,10
-       lxvd2x 0,6,9
        li 9,80
-       xxpermdi 35,35,35,2
+       fmr 6,0
+       xxpermdi 0,0,0,3
+       fmr 8,7
+       xxpermdi 7,7,7,3
+       fmr 10,9
+       xxpermdi 9,9,9,3
+       fmr 12,11
+       xxpermdi 11,11,11,3
+       xxpermdi 6,6,32,1
+       xxpermdi 8,8,32,1
+       xxpermdi 10,10,32,1
+       xxpermdi 7,6,7,0
+       xxpermdi 12,12,32,1
+       xxpermdi 9,8,9,0
+       xxpermdi 11,10,11,0
+       xxpermdi 8,7,7,2
+       xxpermdi 0,12,0,0
+       xxpermdi 10,9,9,2
+       stxvd2x 8,6,7
+       xxpermdi 12,11,11,2
+       stxvd2x 10,6,8
+       xxpermdi 0,0,0,2
+       stxvd2x 12,6,10
        stxvd2x 0,6,9
+       lxvd2x 34,6,7
+       lxvd2x 35,6,8
        lxvd2x 36,6,10
        lxvd2x 37,6,9
+       xxpermdi 35,35,35,2
+       xxpermdi 34,34,34,2
        xxpermdi 36,36,36,2
        xxpermdi 37,37,37,2
        blr

 t.s |   80 ++++++++++++++++++++++++++++++++++----------------------------------
 1 file changed, 40 insertions(+), 40 deletions(-)


Generally the GIMPLE phase not knowing about calling conventions has issues
but I don't see an easy way out here other than lowering things much much
earlier ... (with its own downside, esp. considering the "cross-target" games
we play for offloading via LTO).

That said, somebody needs to sit down and see where the copy to memory
is generated and why.

Comment 2 Richard Biener 2016-08-12 09:16:44 UTC

The issue is not SRA pushing things to memory - it doesn't.  The issue is that in the GIMPLE IL the parameter appears as "memory" as it is an aggregate type.

The issue is really RTL expansion of the argument (not sure where that's done)
which doesn't take into account that we could happily expand

  a$vx0_27 = MEM[(struct  *)&a];
  a$vx1_28 = MEM[(struct  *)&a + 16B];
  a$vx2_29 = MEM[(struct  *)&a + 32B];
  a$vx3_30 = MEM[(struct  *)&a + 48B];

if a is expanded to registers.

Note comparing assembler with/without -fno-tree-sra doesn't really show me
the obvious badness:

 test_vecd8_rotate_left:
-       addi 6,1,-288
-       li 9,160
+       addi 6,1,-224
+       li 7,96
        xxpermdi 34,34,34,2
        xxpermdi 35,35,35,2
-       li 10,176
+       li 8,112
+       li 10,128
        xxpermdi 36,36,36,2
        xxpermdi 37,37,37,2
+       stxvd2x 34,6,7
+       li 9,144
+       vspltisw 0,0
+       stxvd2x 35,6,8
+       stxvd2x 36,6,10
+       stxvd2x 37,6,9
+       lxvd2x 0,6,7
        li 7,32
-       stxvd2x 34,6,9
+       lxvd2x 7,6,8
        li 8,48
-       stxvd2x 35,6,10
-       li 10,192
-       stxvd2x 36,6,10
-       li 10,208
-       stxvd2x 37,6,10
-       lfd 12,-128(1)
-       lxvd2x 0,6,9
-       li 9,96
+       lxvd2x 9,6,10
        li 10,64
-       stfd 12,-184(1)
-       lfd 12,-104(1)
-       xxpermdi 0,0,0,3
-       stfd 0,-144(1)
-       stfd 12,-192(1)
-       lfd 12,-112(1)
-       stfd 12,-168(1)
-       lfd 12,-88(1)
-       lxvd2x 10,6,9
-       li 9,112
-       stxvd2x 10,6,7
-       stfd 12,-176(1)
-       lfd 12,-96(1)
-       stfd 12,-152(1)
-       lfd 12,-72(1)
        lxvd2x 11,6,9
-       li 9,128
-       lxvd2x 34,6,7
-       stxvd2x 11,6,8
-       stfd 12,-160(1)
-       lfd 12,-80(1)
-       xxpermdi 34,34,34,2
-       stfd 12,-136(1)
-       lxvd2x 12,6,9
-       li 9,144
-       lxvd2x 35,6,8
-       stxvd2x 12,6,10
-       lxvd2x 0,6,9
        li 9,80
-       xxpermdi 35,35,35,2
+       fmr 6,0
+       xxpermdi 0,0,0,3
+       fmr 8,7
+       xxpermdi 7,7,7,3
+       fmr 10,9
+       xxpermdi 9,9,9,3
+       fmr 12,11
+       xxpermdi 11,11,11,3
+       xxpermdi 6,6,32,1
+       xxpermdi 8,8,32,1
+       xxpermdi 10,10,32,1
+       xxpermdi 7,6,7,0
+       xxpermdi 12,12,32,1
+       xxpermdi 9,8,9,0
+       xxpermdi 11,10,11,0
+       xxpermdi 8,7,7,2
+       xxpermdi 0,12,0,0
+       xxpermdi 10,9,9,2
+       stxvd2x 8,6,7
+       xxpermdi 12,11,11,2
+       stxvd2x 10,6,8
+       xxpermdi 0,0,0,2
+       stxvd2x 12,6,10
        stxvd2x 0,6,9
+       lxvd2x 34,6,7
+       lxvd2x 35,6,8
        lxvd2x 36,6,10
        lxvd2x 37,6,9
+       xxpermdi 35,35,35,2
+       xxpermdi 34,34,34,2
        xxpermdi 36,36,36,2
        xxpermdi 37,37,37,2
        blr

 t.s |   80 ++++++++++++++++++++++++++++++++++----------------------------------
 1 file changed, 40 insertions(+), 40 deletions(-)


Generally the GIMPLE phase not knowing about calling conventions has issues
but I don't see an easy way out here other than lowering things much much
earlier ... (with its own downside, esp. considering the "cross-target" games
we play for offloading via LTO).

That said, somebody needs to sit down and see where the copy to memory
is generated and why.

Comment 3 Richard Biener 2016-08-12 09:21:12 UTC

That is, compare DECL_RTL for a in the good/bad case and see how it arrives so.

Comment 4 Richard Biener 2016-08-12 09:30:33 UTC

Bad case:

(mem/c:BLK (plus:DI (reg/f:DI 150 virtual-stack-vars)
        (const_int 64 [0x40])) [1 a+0 S64 A128])

#0  set_decl_rtl (t=<parm_decl 0x2aaaac199000 a>, x=0x2aaaac1ae960)
    at /space/rguenther/src/svn/trunk/gcc/emit-rtl.c:1282
#1  0x00000000008ad5ca in set_rtl (t=<parm_decl 0x2aaaac199000 a>, 
    x=0x2aaaac1ae960) at /space/rguenther/src/svn/trunk/gcc/cfgexpand.c:302
#2  0x00000000008b0236 in set_parm_rtl (parm=<parm_decl 0x2aaaac199000 a>, 
    x=0x2aaaac1ae960) at /space/rguenther/src/svn/trunk/gcc/cfgexpand.c:1275
#3  0x0000000000a7477b in assign_parm_setup_block (all=0x7fffffffd5c0, 
    parm=<parm_decl 0x2aaaac199000 a>, data=0x7fffffffd540)
    at /space/rguenther/src/svn/trunk/gcc/function.c:3109
#4  0x0000000000a76f92 in assign_parms (
    fndecl=<function_decl 0x2aaaac171a00 test_vecd8_rotate_left>)
    at /space/rguenther/src/svn/trunk/gcc/function.c:3775
#5  0x0000000000a7afc6 in expand_function_start (
    subr=<function_decl 0x2aaaac171a00 test_vecd8_rotate_left>)
    at /space/rguenther/src/svn/trunk/gcc/function.c:5211

Good case:

(mem/c:BLK (plus:DI (reg/f:DI 150 virtual-stack-vars)
        (const_int 128 [0x80])) [1 a+0 S64 A128])

so no difference.  We have in the good case

;; _1 = a.vx0;

(insn 17 16 18 (set (reg:V2DF 176 [ _1 ])
        (vec_select:V2DF (mem/c:V2DF (plus:DI (reg/f:DI 150 virtual-stack-vars)
                    (const_int 128 [0x80])) [2 a.vx0+0 S16 A128])
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) t.c:18 -1
     (nil))

and in the bad case

;; a$vx0_22 = MEM[(struct  *)&a];

(insn 17 16 18 (set (reg:V2DF 191 [ a$vx0 ])
        (vec_select:V2DF (mem/c:V2DF (plus:DI (reg/f:DI 150 virtual-stack-vars)
                    (const_int 64 [0x40])) [1 MEM[(struct  *)&a]+0 S16 A128])
            (parallel [
                    (const_int 1 [0x1])
                    (const_int 0 [0])
                ]))) -1
     (nil))

again almost the same.  Parameter setup in the bad case:

(insn 2 15 3 2 (set (reg:V4SI 183)
        (reg:V4SI 79 2 [ a ])) t.c:14 -1
     (nil))
(insn 3 2 4 2 (set (reg:V4SI 184)
        (reg:V4SI 80 3 [ a+16 ])) t.c:14 -1
     (nil))
(insn 4 3 5 2 (set (reg:V4SI 185)
        (reg:V4SI 81 4 [ a+32 ])) t.c:14 -1
     (nil))
(insn 5 4 6 2 (set (reg:V4SI 186)
        (reg:V4SI 82 5 [ a+48 ])) t.c:14 -1
     (nil))
(insn 6 5 7 2 (set (reg:V4SI 187)
        (vec_select:V4SI (reg:V4SI 183)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 7 6 8 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 64 [0x40])) [1 a+0 S16 A128])
        (vec_select:V4SI (reg:V4SI 187)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 8 7 9 2 (set (reg:V4SI 188)
        (vec_select:V4SI (reg:V4SI 184)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 9 8 10 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 80 [0x50])) [1 a+16 S16 A128])
        (vec_select:V4SI (reg:V4SI 188)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 10 9 11 2 (set (reg:V4SI 189)
        (vec_select:V4SI (reg:V4SI 185)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 11 10 12 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 96 [0x60])) [1 a+32 S16 A128])
        (vec_select:V4SI (reg:V4SI 189)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 12 11 13 2 (set (reg:V4SI 190)
        (vec_select:V4SI (reg:V4SI 186)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 112 [0x70])) [1 a+48 S16 A128])
        (vec_select:V4SI (reg:V4SI 190)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(note 14 13 17 2 NOTE_INSN_FUNCTION_BEG)


and in the good case:

(insn 2 15 3 2 (set (reg:V4SI 168)
        (reg:V4SI 79 2 [ a ])) t.c:14 -1
     (nil))
(insn 3 2 4 2 (set (reg:V4SI 169)
        (reg:V4SI 80 3 [ a+16 ])) t.c:14 -1
     (nil))
(insn 4 3 5 2 (set (reg:V4SI 170)
        (reg:V4SI 81 4 [ a+32 ])) t.c:14 -1
     (nil))
(insn 5 4 6 2 (set (reg:V4SI 171)
        (reg:V4SI 82 5 [ a+48 ])) t.c:14 -1
     (nil))
(insn 6 5 7 2 (set (reg:V4SI 172)
        (vec_select:V4SI (reg:V4SI 168)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 7 6 8 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 128 [0x80])) [1 a+0 S16 A128])
        (vec_select:V4SI (reg:V4SI 172)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 8 7 9 2 (set (reg:V4SI 173)
        (vec_select:V4SI (reg:V4SI 169)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 9 8 10 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 144 [0x90])) [1 a+16 S16 A128])
        (vec_select:V4SI (reg:V4SI 173)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 10 9 11 2 (set (reg:V4SI 174)
        (vec_select:V4SI (reg:V4SI 170)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 11 10 12 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 160 [0xa0])) [1 a+32 S16 A128])
        (vec_select:V4SI (reg:V4SI 174)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 12 11 13 2 (set (reg:V4SI 175)
        (vec_select:V4SI (reg:V4SI 171)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 176 [0xb0])) [1 a+48 S16 A128])
        (vec_select:V4SI (reg:V4SI 175)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))
(note 14 13 17 2 NOTE_INSN_FUNCTION_BEG)

again exactly the same.  There must be downstream effects that cause the whole
issue during RTL optimization.

Comment 5 Bill Schmidt 2016-08-12 13:46:46 UTC

(In reply to Richard Biener from comment #1)
> The issue is not SRA pushing things to memory - it doesn't.  The issue is
> that in the GIMPLE IL the parameter appears as "memory" as it is an
> aggregate type.
> 
> The issue is really RTL expansion of the argument (not sure where that's
> done)
> which doesn't take into account that we could happily expand
> 
>   a$vx0_27 = MEM[(struct  *)&a];
>   a$vx1_28 = MEM[(struct  *)&a + 16B];
>   a$vx2_29 = MEM[(struct  *)&a + 32B];
>   a$vx3_30 = MEM[(struct  *)&a + 48B];
> 
> if a is expanded to registers.
> 

OK, I'll investigate further.  But the assumption that an aggregate parameter is "memory" seems wrong to me.  SRA is explicitly taking the address of a parameter which has not previously had its address taken, so it seems to me that if it's passed in registers this in itself will force it to memory on procedure entry, despite whatever else is currently going on with -fno-tree-sra.  But this is all talk, I need to do some digging to see what the issues are.  To me the SRA transformation seemed "obviously" :) wrong, so I stopped looking...

Comment 6 rguenther@suse.de 2016-08-12 14:58:00 UTC

On August 12, 2016 3:46:46 PM GMT+02:00, "wschmidt at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585
>
>--- Comment #5 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
>(In reply to Richard Biener from comment #1)
>> The issue is not SRA pushing things to memory - it doesn't.  The
>issue is
>> that in the GIMPLE IL the parameter appears as "memory" as it is an
>> aggregate type.
>> 
>> The issue is really RTL expansion of the argument (not sure where
>that's
>> done)
>> which doesn't take into account that we could happily expand
>> 
>>   a$vx0_27 = MEM[(struct  *)&a];
>>   a$vx1_28 = MEM[(struct  *)&a + 16B];
>>   a$vx2_29 = MEM[(struct  *)&a + 32B];
>>   a$vx3_30 = MEM[(struct  *)&a + 48B];
>> 
>> if a is expanded to registers.
>> 
>
>OK, I'll investigate further.  But the assumption that an aggregate
>parameter
>is "memory" seems wrong to me.  SRA is explicitly taking the address of
>a
>parameter which has not previously had its address taken, so it seems
>to me
>that if it's passed in registers this in itself will force it to memory
>on
>procedure entry, despite whatever else is currently going on with
>-fno-tree-sra.  But this is all talk, I need to do some digging to see
>what the
>issues are.  To me the SRA transformation seemed "obviously" :) wrong,
>so I
>stopped looking...

The MEM_REF quoted does not constitute an address-taking of a.  It's the same semantically as the original component-refs.

Richard.

Comment 7 Bill Schmidt 2016-08-12 17:30:10 UTC

One issue is in the expansion logic, specifically in assign_parm_setup_block in function.c:

  /* If a BLKmode arrives in registers, copy it to a stack slot.  ...  */

The preceding logic handles single-register parameters being placed in a pseudo without a stack slot, but all aggregates are forced to the stack no matter what, so that's where

(mem/c:BLK (plus:DI (reg/f:DI 150 virtual-stack-vars)
        (const_int 128 [0x80])) [0  A128])

comes from initially.  I suspect this generally has to be this way, and subsequent optimization is supposed to try to use the registers where possible and eliminate the stores when they go dead.  The question is whether the optimizer is smart enough to recognize that the BIT_FIELD_EXPRs correspond to vector extract operations that can be performed on the incoming pseudo.  I'll look further.

Revising title/component accordingly...

Comment 8 Bill Schmidt 2016-08-12 18:09:53 UTC

FYI, adding -mcpu=power9 to the options makes it much easier to read the RTL, as it gets rid of the extra vector swaps needed for POWER8.

Comment 9 Bill Schmidt 2016-08-12 20:51:39 UTC

We do optimize things well for the following:

typedef struct
{
  __vector double vx0;
  __vector double vx1;
  __vector double vx2;
  __vector double vx3;
} vdoublex8_t;

vdoublex8_t
test_vecd8_rotate_left (vdoublex8_t a)
{
  vdoublex8_t b;
  b.vx0 = a.vx0;
  b.vx1 = a.vx1;
  b.vx2 = a.vx2;
  b.vx3 = a.vx3;
  return b;
}

At expansion time we have similar code as above, with all the stack stores and reloads for the argument and the return value, but the optimizer is able to remove all of that.  So the problem only occurs when referencing part of a vector.

Comment 10 Bill Schmidt 2016-08-12 21:18:25 UTC

The dse pass is responsible for removing all the unnecessary stack activity.  I think that we are probably confusing it because the stores are full vector stores, but the loads are vector element loads of smaller size.

Some evidence for this:  I can get the desired code generation by rewriting the code to copy all the vectors in the structure into "scalar vectors" prior to use, and doing the reverse to construct the result vector.  We then get the code we're looking for.

To wit:

typedef struct
	  {
		__vector double vx0;
		__vector double vx1;
		__vector double vx2;
		__vector double vx3;
	  } vdoublex8_t;

vdoublex8_t
test_vecd8_rotate_left (vdoublex8_t a)
{
	__vector double avx0, avx1, avx2, avx3, rvx0, rvx1, rvx2, rvx3;
	__vector double temp;
	vdoublex8_t result;

	avx0 = a.vx0;
	avx1 = a.vx1;
	avx2 = a.vx2;
	avx3 = a.vx3;

	temp = a.vx0;

	/* Copy low dword of vx0 and high dword of vx1 to vx0 high / low.  */
	rvx0[VEC_DW_H] = avx0[VEC_DW_L];
	rvx0[VEC_DW_L] = avx1[VEC_DW_H];
	/* Copy low dword of vx1 and high dword of vx2 to vx1 high / low.  */
	rvx1[VEC_DW_H] = avx1[VEC_DW_L];
	rvx1[VEC_DW_L] = avx2[VEC_DW_H];
	/* Copy low dword of vx2 and high dword of vx2 to vx2 high / low.  */
	rvx2[VEC_DW_H] = avx2[VEC_DW_L];
	rvx2[VEC_DW_L] = avx3[VEC_DW_H];
	/* Copy low dword of vx3 and high dword of vx0 to vx3 high / low.  */
	rvx3[VEC_DW_H] = avx3[VEC_DW_L];
	rvx3[VEC_DW_L] = temp[VEC_DW_H];

	result.vx0 = rvx0;
	result.vx1 = rvx1;
	result.vx2 = rvx2;
	result.vx3 = rvx3;

	return (result);
}

With this we generate pretty tight code with no loads or stores.  (Just lost my network connection to the server i was testing on, so I can't post the code, but it looks good.)

Comment 11 Bill Schmidt 2016-08-15 17:35:21 UTC

With the original test case, -mcpu=power8 is problematic because of the use of the "swapping stores," whose RHS is a vec_select rather than a register or subreg.  This prevents us from saving the RHS of the store for use in replacing subsequent loads, running afoul of this logic in dse.c:record_store ():

  if (GET_CODE (body) == SET
      /* No place to keep the value after ra.  */
      && !reload_completed
      && (REG_P (SET_SRC (body))                   <= this part
          || GET_CODE (SET_SRC (body)) == SUBREG
          || CONSTANT_P (SET_SRC (body)))
      && !MEM_VOLATILE_P (mem)
      /* Sometimes the store and reload is used for truncation and              
         rounding.  */
      && !(FLOAT_MODE_P (GET_MODE (mem)) && (flag_float_store)))

We can circumvent this if we can use stvx to force the parameters to the stack, which is legal since the stack slots are properly aligned.

However, even using -mcpu=power9, we don't handle removing the stores and replacing the partial loads with register logic.

Comment 12 Bill Schmidt 2016-08-15 22:30:33 UTC

The rest of the ugly code (once you ignore the loads/stores) is horrible choices of register allocation.  Need to understand why we're not making use of the high floating-point registers; too much copying back and forth between low and high vector sets.

Comment 13 Richard Biener 2016-08-16 07:43:10 UTC

(In reply to Bill Schmidt from comment #11)
> With the original test case, -mcpu=power8 is problematic because of the use
> of the "swapping stores," whose RHS is a vec_select rather than a register
> or subreg.  This prevents us from saving the RHS of the store for use in
> replacing subsequent loads, running afoul of this logic in
> dse.c:record_store ():
> 
>   if (GET_CODE (body) == SET
>       /* No place to keep the value after ra.  */
>       && !reload_completed
>       && (REG_P (SET_SRC (body))                   <= this part
>           || GET_CODE (SET_SRC (body)) == SUBREG
>           || CONSTANT_P (SET_SRC (body)))
>       && !MEM_VOLATILE_P (mem)
>       /* Sometimes the store and reload is used for truncation and          
> 
>          rounding.  */
>       && !(FLOAT_MODE_P (GET_MODE (mem)) && (flag_float_store)))
> 
> We can circumvent this if we can use stvx to force the parameters to the
> stack, which is legal since the stack slots are properly aligned.
> 
> However, even using -mcpu=power9, we don't handle removing the stores and
> replacing the partial loads with register logic.

You mean stores like the following?

(insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
                (const_int 112 [0x70])) [1 a+48 S16 A128])
        (vec_select:V4SI (reg:V4SI 190)
            (parallel [
                    (const_int 2 [0x2])
                    (const_int 3 [0x3])
                    (const_int 0 [0])
                    (const_int 1 [0x1])
                ]))) t.c:14 -1
     (nil))

I wonder why dse can't simply force the rhs to a register?  Of course if
power really has stores that do this vec_select but no non-store with
the operation then this might not be valid ...

Now, in the end this example just shows that lowering register passing
only at RTL expansion leads to a load of missed optimizations regarding
to parameter setup ... some scheme to apply the lowering on GIMPLE already
would be interesting to explore (but albeit quite a bit of work).  We'd
have a second set of "parameter decls" somewhere, like in struct function,
and use that when the IL is on lowered form.  Same for DECL_RESULT of course.
And then the interesting part is whether to expose the stack in some way or
restrict the lowering to decomposition/combining to registers.

Comment 14 Bill Schmidt 2016-08-16 11:33:52 UTC

(In reply to Richard Biener from comment #13)
> 
> You mean stores like the following?
> 
> (insn 13 12 14 2 (set (mem/c:V4SI (plus:DI (reg/f:DI 150 virtual-stack-vars)
>                 (const_int 112 [0x70])) [1 a+48 S16 A128])
>         (vec_select:V4SI (reg:V4SI 190)
>             (parallel [
>                     (const_int 2 [0x2])
>                     (const_int 3 [0x3])
>                     (const_int 0 [0])
>                     (const_int 1 [0x1])
>                 ]))) t.c:14 -1
>      (nil))
> 
> I wonder why dse can't simply force the rhs to a register?  Of course if
> power really has stores that do this vec_select but no non-store with
> the operation then this might not be valid ...

Right, the problem is our limited selection of vector stores on POWER8.  We can either use stvx, which requires that the address be 16-byte aligned, or we can use stxvd2x (what you see here), which has the odd property of being a big-endian store even on a little-endian system.  We can't force the rhs to a register because that would have the unwanted side effect of converting an unaligned load to a forcibly aligned load (masking off the low-order 4 bits).

Now, for parameters, this is legal because the stack slots are 16-byte aligned, but DSE doesn't know that.  I don't think we want to pollute DSE with extra logic for this architectural anomaly, so it's probably best if we do some more work to figure out when we can safely use the aligning store.  (Something that's been on the back burner for a while, but this discovery makes it more important.)

For POWER9, we have unaligned stores with proper endian behavior, so it won't be a problem except for POWER8.

> 
> Now, in the end this example just shows that lowering register passing
> only at RTL expansion leads to a load of missed optimizations regarding
> to parameter setup ... some scheme to apply the lowering on GIMPLE already
> would be interesting to explore (but albeit quite a bit of work).  We'd
> have a second set of "parameter decls" somewhere, like in struct function,
> and use that when the IL is on lowered form.  Same for DECL_RESULT of course.
> And then the interesting part is whether to expose the stack in some way or
> restrict the lowering to decomposition/combining to registers.

Right...lots of work here, and of course some added complexity to ABI implementation for new and existing targets.  But small structures/arrays passed entirely in registers is a not-uncommon ABI feature, and even just exposing those early could be helpful.  (Complications set in when you run out of registers halfway through a structure and so forth, so even just lowering that sort of thing early would not be trivial.)