This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers

From: "wschmidt at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Fri, 12 Aug 2016 21:18:25 +0000
Subject: [Bug rtl-optimization/74585] powerpc64: Very poor code generation for homogeneous vector aggregates passed in registers
Authentication-results: sourceware.org; auth=none
Auto-submitted: auto-generated
References: <bug-74585-4@http.gcc.gnu.org/bugzilla/>

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=74585

--- Comment #10 from Bill Schmidt <wschmidt at gcc dot gnu.org> ---
The dse pass is responsible for removing all the unnecessary stack activity.  I
think that we are probably confusing it because the stores are full vector
stores, but the loads are vector element loads of smaller size.

Some evidence for this:  I can get the desired code generation by rewriting the
code to copy all the vectors in the structure into "scalar vectors" prior to
use, and doing the reverse to construct the result vector.  We then get the
code we're looking for.

To wit:

typedef struct
          {
                __vector double vx0;
                __vector double vx1;
                __vector double vx2;
                __vector double vx3;
          } vdoublex8_t;

vdoublex8_t
test_vecd8_rotate_left (vdoublex8_t a)
{
        __vector double avx0, avx1, avx2, avx3, rvx0, rvx1, rvx2, rvx3;
        __vector double temp;
        vdoublex8_t result;

        avx0 = a.vx0;
        avx1 = a.vx1;
        avx2 = a.vx2;
        avx3 = a.vx3;

        temp = a.vx0;

        /* Copy low dword of vx0 and high dword of vx1 to vx0 high / low.  */
        rvx0[VEC_DW_H] = avx0[VEC_DW_L];
        rvx0[VEC_DW_L] = avx1[VEC_DW_H];
        /* Copy low dword of vx1 and high dword of vx2 to vx1 high / low.  */
        rvx1[VEC_DW_H] = avx1[VEC_DW_L];
        rvx1[VEC_DW_L] = avx2[VEC_DW_H];
        /* Copy low dword of vx2 and high dword of vx2 to vx2 high / low.  */
        rvx2[VEC_DW_H] = avx2[VEC_DW_L];
        rvx2[VEC_DW_L] = avx3[VEC_DW_H];
        /* Copy low dword of vx3 and high dword of vx0 to vx3 high / low.  */
        rvx3[VEC_DW_H] = avx3[VEC_DW_L];
        rvx3[VEC_DW_L] = temp[VEC_DW_H];

        result.vx0 = rvx0;
        result.vx1 = rvx1;
        result.vx2 = rvx2;
        result.vx3 = rvx3;

        return (result);
}

With this we generate pretty tight code with no loads or stores.  (Just lost my
network connection to the server i was testing on, so I can't post the code,
but it looks good.)

References:
- [Bug tree-optimization/74585] New: [5/6/7] Tree-sra forces parameters to memory causing awful code generation
  - From: wschmidt at gcc dot gnu.org

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]