Bug 31334 - Bad codegen for vector initializer with constants prop'd into a vector initializer
Summary: Bad codegen for vector initializer with constants prop'd into a vector initia...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.3.0
: P3 enhancement
Target Milestone: 4.4.0
Assignee: Andrew Pinski
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization, patch
Depends on:
Blocks: 4.4pending
  Show dependency treegraph
 
Reported: 2007-03-24 09:14 UTC by Dorit Naishlos
Modified: 2008-03-28 07:30 UTC (History)
3 users (show)

See Also:
Host: powerpc-linux
Target: powerpc-linux
Build: powerpc-linux
Known to work:
Known to fail:
Last reconfirmed: 2007-03-24 13:30:37


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dorit Naishlos 2007-03-24 09:14:49 UTC
Turns out the code we are generating for vectorized induction ppc is quite terrible - the vector induction variable is advanced by a constant step in the loop (e.g., {4,4,4,4} as in the testcase below). This is the sequence gcc currently creates for altivec in order to generate the {4,4,4,4} vector:

        li 0,4
        stw 0,-48(1)
        lvewx 0,0,9
        vspltw 0,0,0

So, one thing to figure out is why we don't use the immediate form of the splat (vspltiw); The other is - why this sequence ends up getting generated not only before the loop (see insns marked with "<<<1" below), but also inside the loop... (see insns marked with "<<<2" below). 

This is the testcase (it is basically the testcase gcc.dg/vect/no-tree-scev-cprop-vect-iv-1.c with larger loop count to avoid complete unrolling):

int main1 (int X)
{
  int s = X;
  int i;
  for (i = 0; i < 96; i++)
    s += i;
  return s;
}

compiled as follows:
gcc -O2 -ftree-vectorize -maltivec -fno-tree-scev-cprop -S t.c


        li 0,4		<<<1
        stw 0,-48(1)	<<<1
        ld 9,.LC1@toc(2)
        li 0,23
        mr 11,3
        mtctr 0
        lvx 1,0,9
        addi 9,1,-48
        vor 13,1,1
        lvewx 0,0,9	<<<1
        vspltw 0,0,0	<<<1	
        vadduwm 1,1,0
        .p2align 4,,15
.L2:
        li 0,4		<<<2
        addi 9,1,-48
        vadduwm 13,13,1
        stw 0,-48(1)	<<<2
        lvewx 0,0,9	<<<2
        vspltw 0,0,0	<<<2
        vadduwm 1,1,0
        bdnz .L2
Comment 1 David Edelsohn 2007-03-24 13:30:37 UTC
Loading the constant through memory appears to be related to the vector plus expander.

118t.final_cleanup shows:

  vect_vec_iv_.32 = vect_cst_.30 + {4, 4, 4, 4};

and 120r.expand shows:

;; vect_vec_iv_.32 = vect_cst_.30 + {4, 4, 4, 4}
(insn 12 10 13 (set (reg:SI 134)
        (const_int 4 [0x4])) -1 (nil)
    (nil))

(insn 13 12 14 (set (mem/c/i:SI (plus:SI (reg/f:SI 115 virtual-stack-vars)
                (const_int 8 [0x8])) [2 S4 A128])
        (reg:SI 134)) -1 (nil)
    (nil))

(insn 14 13 15 (parallel [
            (set (reg:V4SI 133)
                (mem/c/i:V4SI (plus:SI (reg/f:SI 115 virtual-stack-vars)
                        (const_int 8 [0x8])) [2 S16 A128]))
            (unspec [
                    (const_int 0 [0x0])
                ] 196)
        ]) -1 (nil)
    (nil))

(insn 15 14 16 (set (reg:V4SI 133)
        (vec_duplicate:V4SI (vec_select:SI (reg:V4SI 133)
                (parallel [
                        (const_int 0 [0x0])
                    ])))) -1 (nil)
    (nil))

(insn 16 15 0 (set (reg:V4SI 126 [ vect_vec_iv_.32 ])
        (plus:V4SI (reg:V4SI 127 [ vect_cst_.30 ])
            (reg:V4SI 133))) -1 (nil)
    (nil))

The constants appear to be constructor expressions:

 <plus_expr 0x4019d700
    type <vector_type 0x400e7ea0 __vector signed int
        type <integer_type 0x40074340 int sizes-gimplified public SI
            size <integer_cst 0x40068500 constant invariant 32>
            unit size <integer_cst 0x400682a0 constant invariant 4>
            align 32 symtab 0 alias set 2 canonical type 0x40074340 precision 32 min <integer_cst 0x400684a0 -2147483648> max <integer_cst 0x400684c0 2147483647>
            pointer_to_this <pointer_type 0x40074e38>>
        V4SI
        size <integer_cst 0x40068740 constant invariant 128>
        unit size <integer_cst 0x40068760 constant invariant 16>
        align 128 symtab 0 alias set -1 canonical type 0x400e7ea0 nunits 4>

    arg 0 <var_decl 0x401a3150 vect_vec_iv_.32 type <vector_type 0x400e7ea0 __vector signed int>
        used ignored V4SI file d.c line 2 size <integer_cst 0x40068740 128> unit size <integer_cst 0x40068760 16>
        align 128 context <function_decl 0x4015ae00 main1>
        (reg:V4SI 126 [ vect_vec_iv_.32 ])>
    arg 1 <constructor 0x401a6120 type <vector_type 0x400e7ea0 __vector signed int>
        constant>>


 <constructor 0x401a6120
    type <vector_type 0x400e7ea0 __vector signed int
        type <integer_type 0x40074340 int sizes-gimplified public SI
            size <integer_cst 0x40068500 constant invariant 32>
            unit size <integer_cst 0x400682a0 constant invariant 4>
            align 32 symtab 0 alias set 2 canonical type 0x40074340 precision 32 min <integer_cst 0x400684a0 -2147483648> max <integer_cst 0x400684c0 2147483647>
            pointer_to_this <pointer_type 0x40074e38>>
        V4SI
        size <integer_cst 0x40068740 constant invariant 128>
        unit size <integer_cst 0x40068760 constant invariant 16>
        align 128 symtab 0 alias set -1 canonical type 0x400e7ea0 nunits 4>
    constant>


And the constructors are thrown into memory.  The original expression with a constant never gets converted to RTL and never is evaluated as an easy vector constant.
Comment 2 Andrew Pinski 2007-03-25 08:11:55 UTC
Here is a reduced testcase which shows the same issue:
#define vector __attribute__((__vector_size__(16) ))
vector int f()
{
  int t = 4;
  return (vector int){t,t,t,t};
}
vector int f1()
{
  return (vector int){4,4,4,4};
}


----------------- cut -----------------
These above functions should have the same code gen but they don't.
Comment 3 Andrew Pinski 2007-03-25 08:26:10 UTC
Here is a patch which fixes the problem:
Index: ../../gcc/config/rs6000/rs6000.c
===================================================================
--- ../../gcc/config/rs6000/rs6000.c    (revision 123180)
+++ ../../gcc/config/rs6000/rs6000.c    (working copy)
@@ -2588,6 +2588,7 @@
 
   if (n_var == 0)
     {
+      rtx const_vec = gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0));
       if (mode != V4SFmode && all_const_zero)
        {
          /* Zero register.  */
@@ -2595,10 +2596,10 @@
                                  gen_rtx_XOR (mode, target, target)));
          return;
        }
-      else if (mode != V4SFmode && easy_vector_constant (vals, mode))
+      else if (mode != V4SFmode && easy_vector_constant (const_vec, mode))
        {
          /* Splat immediate.  */
-         emit_insn (gen_rtx_SET (VOIDmode, target, vals));
+         emit_insn (gen_rtx_SET (VOIDmode, target, const_vec));
          return;
        }
       else if (all_same)
@@ -2606,7 +2607,7 @@
       else
        {
          /* Load from constant pool.  */
-         emit_move_insn (target, gen_rtx_CONST_VECTOR (mode, XVEC (vals, 0)));
+         emit_move_insn (target, const_vec);
          return;
        }
     }



-------- CUT---------------
The problem is that we get a parallel:V4SI with vals which is obviously not an easy_vector_constant so we instead create a CONST_VECTOR to check if we have an easy vector constant in this case.  Yes we should most likely get a VECTOR_CST but the reason why we don't is most likely because we prop'd the constant into vector initializer and we never convert that vector initializer into a VECTOR_CST (I had a patch once for doing that in fold but I never got around to fully testing it).

Note the above patch was not bootstrapped or tested.  It was tested only on my testcase in comment #2 and the orignal testcase.

For the orginal testcase we get:
L2:
        vadduwm v13,v13,v0
        vadduwm v0,v0,v12
        bdnz L2
for the inner most loop which now looks good.


I am thinking we should not split this splat up until after reload anyways so it can be pulled out of a loop.
Comment 4 David Edelsohn 2007-03-25 20:56:26 UTC
I do not believe the patch will help with the original missed optimization because  the backend never sees a direct assignment from the CONSTRUCTOR -- it already is placed in memory.  The example in comment #2 is different.
Comment 5 Andrew Pinski 2007-03-25 21:55:14 UTC
(In reply to comment #4)
> I do not believe the patch will help with the original missed optimization
> because  the backend never sees a direct assignment from the CONSTRUCTOR -- it
> already is placed in memory.  The example in comment #2 is different.

Did I miss something because for the original example the back-end does see the CONSTRUCTOR in rs6000_expand_vector_init as a parallel with the mode of V4SI.  
The expansion of vect_cst_.30 + {4, 4, 4, 4} calls rs6000_expand_vector_init with vals being:
(parallel:V4SI  ([ (const_int 4),(const_int 4),(const_int 4),(const_int 4) ]) )

So when we call easy_vector_constant with vals, we always get false as it is not a const_vector.  My patch changes it so we call easy_vector_constant with a const_vector as we already proved in the loop above it is made up of constant elements as n_var is zero.


For the orginal testcase, the assembly now looks like (on powerpc-darwin):
_main1:
        mfspr r0,256
        stw r0,-4(r1)
        oris r0,r0,0xc00c
        mtspr 256,r0
        lis r2,ha16(LC0)
        la r2,lo16(LC0)(r2)
        vspltisw v0,4
        lvx v1,0,r2
        li r2,23
        mtctr r2
        vor v12,v0,v0
        mr r0,r3
        vor v13,v1,v1
        vadduwm v0,v1,v0
L2:
        vadduwm v13,v13,v0
        vadduwm v0,v0,v12
        bdnz L2
        vsldoi v0,v13,v13,8
        addi r2,r1,-20
        lwz r12,-4(r1)
        vadduwm v0,v0,v13
        vsldoi v1,v0,v0,12
        vadduwm v1,v1,v0
        stvewx v1,0,r2
        lwz r3,-20(r1)
        add r3,r3,r0
        mtspr 256,r12
        blr
        .const
        .align 4
LC0:
        .long   0
        .long   1
        .long   2
        .long   3


Notice how we have vspltisw outside the loop.
Comment 6 Andrew Pinski 2007-03-25 22:05:57 UTC
Now the patch does not solve the following testcase:
int main1 (int X)
{
  int s = X;
  int i;
  for (i = 0; i < 96*16; i+=16)
    s += i;
  return s;
}


For this one to be fixed, we need to define an insn which does vec_duplicate as an insn_and_split instead of expanding it up in the rs6000_expand_vector_init.
Comment 7 Andrew Pinski 2007-03-25 22:36:22 UTC
Something like:
(define_insn_and_split "altivec_dup<mode>"
  [(set (match_operand:V 0 "register_operand" "v")
        (vec_duplicate: (match_operand: 0 "r")))
   (clobber (match_operand:V 3 "memory_operand" "=Z"))]
  "TARGET_ALTIVEC"
  "#"
  "&& reload_completed"
  ....


Which then will be generated from rs6000_expand_vector_init.  I can write this if you want, it is just I cannot test this until Monday.
Comment 8 Dorit Naishlos 2007-04-03 20:46:34 UTC
(In reply to comment #7)
> Something like:
> (define_insn_and_split "altivec_dup<mode>"
>   [(set (match_operand:V 0 "register_operand" "v")
>         (vec_duplicate: (match_operand: 0 "r")))
>    (clobber (match_operand:V 3 "memory_operand" "=Z"))]
>   "TARGET_ALTIVEC"
>   "#"
>   "&& reload_completed"
>   ....
> Which then will be generated from rs6000_expand_vector_init.  I can write this
> if you want, it is just I cannot test this until Monday.

yes, please... I'll be very happy to see this fixed. (thanks!!)

Comment 9 Andrew Pinski 2007-12-03 04:33:24 UTC
Mine, but it won't be until after xmas.
Comment 10 Andrew Pinski 2007-12-16 04:53:01 UTC
Aftter 
2007-11-06  Jakub Jelinek  <jakub@redhat.com>

        PR tree-optimization/33993
        * tree-vect-transform.c (vect_get_constant_vectors): Use build_vector
        rather than build_constructor_from_list if all list values are
        constants.
        (get_initial_def_for_induction): Use build_vector instead of
        build_constructor_from_list.

We get for the original testcase:
.L3:
        vadduwm 1,1,0
        vadduwm 0,0,13
        bdnz .L3

As the vectorizer now VECTOR_CST which are handled correctly.
Comment 11 Andrew Pinski 2008-03-27 17:17:11 UTC
(In reply to comment #6)
> Now the patch does not solve the following testcase:
> int main1 (int X)

This is really PR 32107 and PR 32110 which I have patches for, and I will be submitting them this weekend.

The first issue I posted as http://gcc.gnu.org/ml/gcc-patches/2008-03/msg01700.html .
Comment 12 Andrew Pinski 2008-03-28 07:28:00 UTC
Subject: Bug 31334

Author: pinskia
Date: Fri Mar 28 07:27:11 2008
New Revision: 133674

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=133674
Log:
2008-03-28  Andrew Pinski  <andrew_pinski@playstation.sony.com>

        PR target/31334
        * config/rs6000/rs6000.c (rs6000_expand_vector_init): Create a
        const_vector when all the vectors are constant.

2008-03-28  Andrew Pinski  <andrew_pinski@playstation.sony.com>

        PR target/31334
        * gcc.target/powerpc/altivec-25.c: Nnew testcase.


Added:
    trunk/gcc/testsuite/gcc.target/powerpc/altivec-25.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/rs6000/rs6000.c
    trunk/gcc/testsuite/ChangeLog

Comment 13 Andrew Pinski 2008-03-28 07:30:14 UTC
The original issue has now been fixed, the rest is registered as PR 32107 and PR 32110.