[PATCH]middle-end[RFC] slp: new implementation of complex numbers

Richard Biener rguenther@suse.de
Tue Jun 22 12:08:06 GMT 2021


On Mon, 21 Jun 2021, Tamar Christina wrote:

> Hi Richi,
> 
> This patch is still very much incomplete and I do know that it is 
> missing things but it's complete enough such that examples are working 
> and allows me to show what I'm working towards.
> 
> note, that this approach will remove a lot of code in 
> tree-vect-slp-patterns but to keep the diff readable I've left them in 
> and just commented out the calls or removed them where needed.
> 
> The patch rewrites the complex numbers detection by splitting the 
> detection of structure from dataflow analysis.  In principle the biggest 
> difference between this and the previous implementation is that instead 
> of trying to detect valid complex operations it *makes* an operation a 
> valid complex operation.
> 
> To do this each operation gets a dual optab which matches the same structure but
> has no dataflow requirement.
> 
> i.e. in this patch I added 4, ADDSUB, SUBADD, MUL_ADDSUB, MULL_SUBADD.
> 
> There is a then a mapping between these and their variant with the dataflow:
> 
> * ADDSUB -> COMPLEX_ADD_ROT270
> * SUBADD -> COMPLEX_ADD_ROT90
> * MUL_ADDSUB -> COMPLEX_MUL_CONJ
> * MUL_SUBADD -> COMPLEX_MUL
> 
> with the intention that when we detect the structure of an operation we query
> the backend for both optabs.
> 
> This should result in one of three states:
> 
>  * not supported: Move on.
>  * Supports ADDSUB only: Rewrite using ADDSUB, set type to 'cannot_transform'
>  * Supports COMPLEX_ADD_ROT270 only: Rewrite using ADDSUB, set type to 'must_transform'
>  * Supports both: Rewrite using ADDSUB, set type fo 'can_transform'
> 
> with the idea behind `can_transform` is to check the costs of the inverse
> permute needed to use the complex operation and if this is very expensive then
> stick to addsub.  This requires the target to be able to cost the operations
> reasonably correct.
> 
> So for ADD this looks like
> 
>  === vect_match_slp_patterns ===
>  Analyzing SLP tree 0x494e970 for patterns
>  Found ADDSUB pattern in SLP tree
>  Target does not support ADDSUB for vector type vector(4) float
>  Found COMPLEX_ADD_ROT270 pattern in SLP tree
>  Target supports COMPLEX_ADD_ROT270 vectorization with mode vector(4) float
> Pattern matched SLP tree
> node 0x494e970 (max_nunits=4, refcnt=1)
> op template: REALPART_EXPR <*_10> = _23;
>   stmt 0 REALPART_EXPR <*_10> = _23;
>   stmt 1 IMAGPART_EXPR <*_10> = _22;
>   children 0x494ea00
> node 0x494ea00 (max_nunits=4, refcnt=1)
> op template: slp_patt_39 = .ADDSUB (_23, _23);
>   stmt 0 _23 = _6 + _13;
>   stmt 1 _22 = _12 - _8;
>   children 0x494eb20 0x494ebb0
> node 0x494eb20 (max_nunits=4, refcnt=1)
> op template: _13 = REALPART_EXPR <*_3>;
>   stmt 0 _13 = REALPART_EXPR <*_3>;
>   stmt 1 _12 = IMAGPART_EXPR <*_3>;
> node 0x494ebb0 (max_nunits=4, refcnt=1)
> op: VEC_PERM_EXPR
>   { }
>   lane permutation { 0[1] 0[0] }
>   children 0x494ec40
> node 0x494ec40 (max_nunits=1, refcnt=2)
> op template: _8 = REALPART_EXPR <*_5>;
>   stmt 0 _8 = REALPART_EXPR <*_5>;
>   stmt 1 _6 = IMAGPART_EXPR <*_5>;
>   load permutation { 0 1 }
> 
> and later during optimize_slp we get
> 
> Tranforming SLP expression from ADDSUB to COMPLEX_ADD_ROT270
> processing node 0x494ebb0
> simplifying permute node 0x494ebb0
> Optimized SLP instances:
> node 0x494e970 (max_nunits=4, refcnt=1)
> op template: REALPART_EXPR <*_10> = _23;
>    stmt 0 REALPART_EXPR <*_10> = _23;
>    stmt 1 IMAGPART_EXPR <*_10> = _22;
>    children 0x494ea00
> node 0x494ea00 (max_nunits=4, refcnt=1)
> op template: slp_patt_39 = .COMPLEX_ADD_ROT270 (_23, _23);
>    stmt 0 _23 = _6 + _13;
>    stmt 1 _22 = _12 - _8;
>    children 0x494eb20 0x494ebb0
> node 0x494eb20 (max_nunits=4, refcnt=1)
> op template: _13 = REALPART_EXPR <*_3>;
>    stmt 0 _13 = REALPART_EXPR <*_3>;
>    stmt 1 _12 = IMAGPART_EXPR <*_3>;
> node 0x494ebb0 (max_nunits=4, refcnt=1)
> op: VEC_PERM_EXPR
>    { }
>    lane permutation { 0[0] 0[1] }
>    children 0x494ec40
> node 0x494ec40 (max_nunits=1, refcnt=2)
> op template: _8 = REALPART_EXPR <*_5>;
>    stmt 0 _8 = REALPART_EXPR <*_5>;
>    stmt 1 _6 = IMAGPART_EXPR <*_5>;
> 
> Now I still have to elide the VEC_PERM_EXPR here but that's easy.

So having skimmed half of the patch - this means SLP pattern recog
will initially recognize { +, -, +, - } as ADDSUB for example
but not factor in lane permutes from loads yet.  Now, suppose we
have { +, -, -, + } seen in pattern recog - how's that handled?
It might feed operations where we'd like to have inputs permuted
and thus would end up with ADDSUB in the end?

That said, you do

+         /* Check to see if this node can be transformed during permute
+            materialization.  */
+         bool patt_trans_cand_p = is_pattern_convert_candidate_p (node);
+         if (patt_trans_cand_p)
+           bitmap_set_bit (n_transform, idx);

in the propagation stage (but this looks like a static marking).

And then just for each of those candidates you somehow "undo" the
permutes on the SLP childs (by modifying it in-place - uh, see below).
But if you figure one child that is not permuted you've done that
anyway but stop.

So what I've expected is that this "transform" is done during
propagation (where you now set the n_transform bit), by means of
detecting the candidate as materialization point (but I'd have to
recap the kind of permutes we expect to be incoming).  That is,
for a specific set of permutes on the successor edges claim
the node can handle the permute and thus propagate unpermuted
upwards.

At the materialization stage you'd then transform the node.

Btw, how do we handle targets that can do complex_addsub but not
addsub?

> This works for ADD, but it doesn't work well when there's a complicated sequence
> of loads.  So for example
> 
> #define N 200
> void g (double complex a[restrict N], double complex b[restrict N],
> 	double complex c[restrict N])
> {
>   for (int i=0; i < N; i++)
>     {
>       c[i] =  a[i] * b[i];
>     }
> }
> 
> will results in an SLP tree where each operand of the multiply does not get to
> see all the original vectors:
> 
> Final SLP tree for instance 0x5678a30:
> node 0x55965a0 (max_nunits=4, refcnt=2)
> op template: REALPART_EXPR <*_7> = _25;
>   stmt 0 REALPART_EXPR <*_7> = _25;
>   stmt 1 IMAGPART_EXPR <*_7> = _26;
>   children 0x5596630
> node 0x5596630 (max_nunits=4, refcnt=2)
> op: VEC_PERM_EXPR
>   stmt 0 _25 = _17 - _22;
>   stmt 1 _26 = _23 + _24;
>   lane permutation { 0[0] 1[1] }
>   children 0x5596a20 0x5596ab0
> node 0x5596a20 (max_nunits=1, refcnt=1)
> op template: _25 = _17 - _22;
>   { }
>   children 0x55966c0 0x5596870
> node 0x55966c0 (max_nunits=4, refcnt=3)
> op template: _17 = _10 * _19;
>   stmt 0 _17 = _10 * _19;
>   stmt 1 _23 = _10 * _18;
>   children 0x5596750 0x55967e0
> node 0x5596750 (max_nunits=4, refcnt=2)
> op template: _10 = REALPART_EXPR <*_3>;
>   stmt 0 _10 = REALPART_EXPR <*_3>;
>   stmt 1 _10 = REALPART_EXPR <*_3>;
>   load permutation { 0 0 }
> node 0x55967e0 (max_nunits=4, refcnt=2)
> op template: _19 = REALPART_EXPR <*_5>;
>   stmt 0 _19 = REALPART_EXPR <*_5>;
>   stmt 1 _18 = IMAGPART_EXPR <*_5>;
>   load permutation { 0 1 }
> node 0x5596870 (max_nunits=4, refcnt=3)
> op template: _22 = _9 * _18;
>   stmt 0 _22 = _9 * _18;
>   stmt 1 _24 = _9 * _19;
>   children 0x5596900 0x5596990
> node 0x5596900 (max_nunits=4, refcnt=2)
> op template: _9 = IMAGPART_EXPR <*_3>;
>   stmt 0 _9 = IMAGPART_EXPR <*_3>;
>   stmt 1 _9 = IMAGPART_EXPR <*_3>;
>   load permutation { 1 1 }
> node 0x5596990 (max_nunits=4, refcnt=2)
> op template: _18 = IMAGPART_EXPR <*_5>;
>   stmt 0 _18 = IMAGPART_EXPR <*_5>;
>   stmt 1 _19 = REALPART_EXPR <*_5>;
>   load permutation { 1 0 }
> node 0x5596ab0 (max_nunits=1, refcnt=1)
> op template: _26 = _23 + _24;
>   { }
>   children 0x55966c0 0x5596870
> 
> So depending on the node each multiply either sees REAL 3 + REAL 5 + IMAG 5 or
> IMAG 3 + IMAG 5 + REAL 5.
> 
> since we are removing the TWO_OPERANDS node we need to drop one of the multiply
> and so we need to give it the original 2 vectors as a parameter.  The current
> implementation emits a permute operation to combine the loads again and later
> pushes the permute down.  This is problematic as it again requires us to do df
> analysis early.
> 
> To counter this, in the patch I have changed loads to no longer come out of
> build_slp with LOAD_PERMUTES but instead to have a VEC_PERM above each load.

Yep.  I've been there before (not sure if I ever sent you my 
work-in-progress here).  There's some wrongs in your patch but I guess
doing this exactly for the permutes optimize_slp handles would be fine.

We should see to do this independently of the stuff above, I can handle
this and will prepare a patch for this later.

> This is only temporary and the idea is that optimize SLP will push them back
> down into the loads once it's deemed there are still permutes needed.
> 
> So COMPLEX MUL becomes:
> 
> Final SLP tree for instance 0x46d9c80:
> node 0x45f76c0 (max_nunits=4, refcnt=2)
> op template: REALPART_EXPR <*_7> = _25;
>   stmt 0 REALPART_EXPR <*_7> = _25;
>   stmt 1 IMAGPART_EXPR <*_7> = _26;
>   children 0x45f7750
> node 0x45f7750 (max_nunits=4, refcnt=2)
> op: VEC_PERM_EXPR
>   stmt 0 _25 = _17 - _22;
>   stmt 1 _26 = _23 + _24;
>   lane permutation { 0[0] 1[1] }
>   children 0x45f7bd0 0x45f7c60
> node 0x45f7bd0 (max_nunits=1, refcnt=1)
> op template: _25 = _17 - _22;
>   { }
>   children 0x45f77e0 0x45f7a20
> node 0x45f77e0 (max_nunits=4, refcnt=3)
> op template: _17 = _10 * _19;
>   stmt 0 _17 = _10 * _19;
>   stmt 1 _23 = _10 * _18;
>   children 0x45f7870 0x45f7990
> node 0x45f7870 (max_nunits=4, refcnt=2)
> op: VEC_PERM_EXPR
>   { }
>   lane permutation { 0[0] 0[0] }
>   children 0x45f7900
> node 0x45f7900 (max_nunits=1, refcnt=4)
> op template: _10 = REALPART_EXPR <*_3>;
>   stmt 0 _10 = REALPART_EXPR <*_3>;
>   stmt 1 _9 = IMAGPART_EXPR <*_3>;
>   load permutation { 0 1 }
> node 0x45f7990 (max_nunits=4, refcnt=3)
> op template: _19 = REALPART_EXPR <*_5>;
>   stmt 0 _19 = REALPART_EXPR <*_5>;
>   stmt 1 _18 = IMAGPART_EXPR <*_5>;
>   load permutation { 0 1 }
> node 0x45f7a20 (max_nunits=4, refcnt=3)
> op template: _22 = _9 * _18;
>   stmt 0 _22 = _9 * _18;
>   stmt 1 _24 = _9 * _19;
>   children 0x45f7ab0 0x45f7b40
> node 0x45f7ab0 (max_nunits=4, refcnt=2)
> op: VEC_PERM_EXPR
>   { }
>   lane permutation { 0[1] 0[1] }
>   children 0x45f7900
> node 0x45f7b40 (max_nunits=4, refcnt=2)
> op: VEC_PERM_EXPR
>   { }
>   lane permutation { 0[1] 0[0] }
>   children 0x45f7990
> node 0x45f7c60 (max_nunits=1, refcnt=1)
> op template: _26 = _23 + _24;
>   { }
>   children 0x45f77e0 0x45f7a20
> 
> This representation has a number of advantages:
> 
>  * complex mul becomes trivial, as trivial as add.
>  * add now detects itself as part of a sequence of complex mul.  i.e. if you
>    have only add, it will use it instead of failing to follow the df like now.
>  * It opens the door a bit more to vectorize the first loop in x264.  The code
>    there has a cross iteration mix of loading the full vector but in an order
>    that the vectorizer thinks it can't use a single linear load for.  e.g.
> 
>    .. = (a[0] + b[0]) - (a[4] + b[4])
>    .. = (a[1] + b[1]) - (a[5] + b[5])
>    .. = (a[2] + b[2]) - (a[6] + b[6])
>    .. = (a[3] + b[3]) - (a[7] + b[6])
> 
>  Ultimately it thinks it's cheaper to use scalar loads here, while (at least for
>  aarch64) if we see the unrolled loop together we can pattern match this sequence
>  and rewrite it to something that uses the linear load. This is because the
>  addition and subtract are widening.  So the widening arithmetic operations we
>  have already only read half the vector since they're widening.
> 
>  For this to work, we need to see all 16 bytes loads, so the full unrolled loop
>  of 4 iterations. Which currently doesn't happen for NEON but does for SVE. (though
>  I wonder if now that you've added SLP decomposition if the initial SLP build
>  shouldn't allow building of SLP trees that are wider than the vector length and
>  later just decompose it or bump up the VF.).  However this is something for
>  another discussion :).
> 
>  Changing the loads like this makes things a lot simpler and we require no
>  additional work anywhere else.  The complexity ends up in optimize SLP which
>  now has the following changes:
> 
>  * loads no longer introduce permutes, so they don't see the permutations.
>    Instead each load can lead to a permute which can see the permutes.
>  * As we're moving permutes up we stop at call boundaries as well.
>  * As we're calculating materialization points for the permute we inspect the
>    node and check if we can transform it, if so we mark it for transformation.
>  * After calculating the materialization and transform points we iterate over
>    the transform points checking whether we should transform.  If we should we
>    modify the permute at the materialization point (which should be at the
>    transform point) into the inverse permute.
>  * During materialization as the permute is transformed this would undo the
>    permutation if we transformed the instruction.
>  * After computing this we elide the permute node, or push it down towards the
>    load and then back into the load itself, splitting the node if required.
> 
> After this the tree should be back to it's normal form.  I still need to 
> work out when and where one needs to push down the permute all the way 
> to the loads.

It's indeed the case that the current permute "propagation" is away from
loads only and a phase to propagate in the reverse direction is missing
if we now have permute sources other than loads.  The cycle handling
of the propagation is also a bit awkward and incomplete (but it "works").

Richard.

> But this covers my thoughts.  Any feedback is appreciated to get it right early
> on :)
> 
> Regards,
> Tamar
> 
> --- inline copy of patch -- 
> diff --git a/gcc/internal-fn.def b/gcc/internal-fn.def
> index b2f414d2131b867eda337cd30f5ed40ed7c9fa10..73d1d742414d16e06e8312967990d565eddb218d 100644
> --- a/gcc/internal-fn.def
> +++ b/gcc/internal-fn.def
> @@ -277,6 +277,10 @@ DEF_INTERNAL_FLT_FN (SCALB, ECF_CONST, scalb, binary)
>  DEF_INTERNAL_FLT_FLOATN_FN (FMIN, ECF_CONST, fmin, binary)
>  DEF_INTERNAL_FLT_FLOATN_FN (FMAX, ECF_CONST, fmax, binary)
>  DEF_INTERNAL_OPTAB_FN (XORSIGN, ECF_CONST, xorsign, binary)
> +DEF_INTERNAL_OPTAB_FN (ADDSUB, ECF_CONST, addsub, binary)
> +DEF_INTERNAL_OPTAB_FN (SUBADD, ECF_CONST, subadd, binary)
> +DEF_INTERNAL_OPTAB_FN (MUL_ADDSUB, ECF_CONST, mul_addsub, binary)
> +DEF_INTERNAL_OPTAB_FN (MUL_SUBADD, ECF_CONST, mul_subadd, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT90, ECF_CONST, cadd90, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_ADD_ROT270, ECF_CONST, cadd270, binary)
>  DEF_INTERNAL_OPTAB_FN (COMPLEX_MUL, ECF_CONST, cmul, binary)
> diff --git a/gcc/optabs.def b/gcc/optabs.def
> index b192a9d070b8aa72e5676b2eaa020b5bdd7ffcc8..73d23c49bdfa7423f67a6240650c558cfaf3f2f0 100644
> --- a/gcc/optabs.def
> +++ b/gcc/optabs.def
> @@ -290,6 +290,10 @@ OPTAB_D (atan_optab, "atan$a2")
>  OPTAB_D (atanh_optab, "atanh$a2")
>  OPTAB_D (copysign_optab, "copysign$F$a3")
>  OPTAB_D (xorsign_optab, "xorsign$F$a3")
> +OPTAB_D (addsub_optab, "addsub$a3")
> +OPTAB_D (subadd_optab, "subadd$a3")
> +OPTAB_D (mul_addsub_optab, "mul_addsub$a3")
> +OPTAB_D (mul_subadd_optab, "mul_subadd$a3")
>  OPTAB_D (cadd90_optab, "cadd90$a3")
>  OPTAB_D (cadd270_optab, "cadd270$a3")
>  OPTAB_D (cmul_optab, "cmul$a3")
> diff --git a/gcc/tree-vect-slp-patterns.c b/gcc/tree-vect-slp-patterns.c
> index 2ed49cd9edcabd7948b365dd60d7405b79079a7b..ce8d2daa1f06f17eda2cb08255d76bac9cf263f0 100644
> --- a/gcc/tree-vect-slp-patterns.c
> +++ b/gcc/tree-vect-slp-patterns.c
> @@ -68,6 +68,39 @@ along with GCC; see the file COPYING3.  If not see
>   * vect_pattern class
>   ******************************************************************************/
>  
> +static bool
> +vect_pattern_validate_optab (internal_fn ifn, slp_tree node);
> +
> +
> +/* Check if the base or complex equivalents are supported.  */
> +static bool
> +vect_pattern_validate_optab_or_eq (internal_fn ifn, slp_tree node)
> +{
> +  if (vect_pattern_validate_optab (ifn, node))
> +    return true;
> +
> +  switch (ifn)
> +  {
> +    case IFN_ADDSUB:
> +      ifn = IFN_COMPLEX_ADD_ROT270;
> +      break;
> +    case IFN_SUBADD:
> +      ifn = IFN_COMPLEX_ADD_ROT90;
> +      break;
> +    case IFN_MUL_ADDSUB:
> +      ifn = IFN_COMPLEX_MUL_CONJ;
> +      break;
> +    case IFN_MUL_SUBADD:
> +      ifn = IFN_COMPLEX_MUL;
> +      break;
> +    default:
> +      gcc_unreachable ();
> +  }
> +
> +  return vect_pattern_validate_optab (ifn, node);
> +}
> +
> +
>  /* Default implementation of recognize that performs matching, validation and
>     replacement of nodes but that can be overriden if required.  */
>  
> @@ -630,8 +663,7 @@ complex_add_pattern::build (vec_info *vinfo)
>  
>    /* First re-arrange the children.  */
>    SLP_TREE_CHILDREN (*this->m_node)[0] = children[0];
> -  SLP_TREE_CHILDREN (*this->m_node)[1] =
> -    vect_build_swap_evenodd_node (children[1]);
> +  SLP_TREE_CHILDREN (*this->m_node)[1] = children[1];
>  
>    SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (*this->m_node)[0])++;
>    SLP_TREE_REF_COUNT (SLP_TREE_CHILDREN (*this->m_node)[1])++;
> @@ -670,9 +702,9 @@ complex_add_pattern::matches (complex_operation_t op,
>        Rotation 0 and 180 can be handled by normal SIMD code, so we don't need
>        to care about them here.  */
>    if (op == MINUS_PLUS)
> -    ifn = IFN_COMPLEX_ADD_ROT90;
> +    ifn = IFN_SUBADD;
>    else if (op == PLUS_MINUS)
> -    ifn = IFN_COMPLEX_ADD_ROT270;
> +    ifn = IFN_ADDSUB;
>    else
>      return ifn;
>  
> @@ -680,17 +712,7 @@ complex_add_pattern::matches (complex_operation_t op,
>       we support.  */
>    gcc_assert (ops->length () == 2);
>  
> -  vec<slp_tree> children = SLP_TREE_CHILDREN ((*ops)[0]);
> -
> -  /* First node must be unpermuted.  */
> -  if (linear_loads_p (perm_cache, children[0]) != PERM_EVENODD)
> -    return IFN_LAST;
> -
> -  /* Second node must be permuted.  */
> -  if (linear_loads_p (perm_cache, children[1]) != PERM_ODDEVEN)
> -    return IFN_LAST;
> -
> -  if (!vect_pattern_validate_optab (ifn, *node))
> +  if (!vect_pattern_validate_optab_or_eq (ifn, *node))
>      return IFN_LAST;
>  
>    return ifn;
> @@ -1501,7 +1523,8 @@ vect_pattern_decl_t slp_patterns[]
>       order patterns from the largest to the smallest.  Especially if they
>       overlap in what they can detect.  */
>  
> -  SLP_PATTERN (complex_operations_pattern),
> +  SLP_PATTERN (complex_add_pattern),
> +  //SLP_PATTERN (complex_operations_pattern),
>  };
>  #undef SLP_PATTERN
>  
> diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
> index 0c1f85beeb2e9f3fb7c66c15d4d30594b2570f9e..f24ad85d15780807c49153895b7340c9b4d4d1f3 100644
> --- a/gcc/tree-vect-slp.c
> +++ b/gcc/tree-vect-slp.c
> @@ -1764,25 +1764,69 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>  	{
>  	  *max_nunits = this_max_nunits;
>  	  (*tree_size)++;
> -	  node = vect_create_new_slp_node (node, stmts, 0);
> -	  SLP_TREE_VECTYPE (node) = vectype;
> +	  node = vect_create_new_slp_node (node, stmts, 1);
>  	  /* And compute the load permutation.  Whether it is actually
>  	     a permutation depends on the unrolling factor which is
> -	     decided later.  */
> -	  vec<unsigned> load_permutation;
> +	     decided later.  We specifically materialize permutes at this
> +	     early stage and leave it up to optimize SLP to push them back
> +	     into the load if needed.  */
> +	  lane_permutation_t lane_permutation;
> +
> +	  /* See the loads with a linear permute, this so that optimize_slp
> +	     can later push the permute upwards and downwards without needing
> +	     to inspect the parent node.  */
> +	  load_permutation_t load_permutation;
>  	  int j;
> -	  stmt_vec_info load_info;
> +	  stmt_vec_info load_info, curr_load;
> +	  lane_permutation.create (group_size);
>  	  load_permutation.create (group_size);
> +	  vec<stmt_vec_info> group_stmts;
> +	  bool linear = true;
>  	  stmt_vec_info first_stmt_info
>  	    = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> +	  group_stmts.create (DR_GROUP_SIZE (first_stmt_info));
> +	  curr_load = first_stmt_info;
>  	  FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
>  	    {
>  	      int load_place = vect_get_place_in_interleaving_chain
>  		  (load_info, first_stmt_info);
>  	      gcc_assert (load_place != -1);
> -	      load_permutation.safe_push (load_place);
> +
> +	      lane_permutation.safe_push (std::make_pair (0, load_place));
> +	      load_permutation.quick_push (j);
> +
> +	      group_stmts.quick_push (curr_load);
> +	      linear = linear && curr_load == load_info;
> +	      curr_load = DR_GROUP_NEXT_ELEMENT (curr_load);
> +	    }
> +
> +	  if (!linear)
> +	    {
> +	      slp_tree child;
> +	      if (slp_tree *load = bst_map->get (group_stmts))
> +		child = *load;
> +	      else
> +		{
> +		  child =  vect_create_new_slp_node (group_stmts.copy (), 0);
> +		  SLP_TREE_VECTYPE (child) = vectype;
> +		  /* One for me, one for the BST map.  */
> +		  SLP_TREE_REF_COUNT (child)++;
> +		  bst_map->put (group_stmts, child);
> +		}
> +
> +	      /* Set some metadata on the load node.  */
> +	      SLP_TREE_REF_COUNT (child)++;
> +	      SLP_TREE_LANES (child) = SLP_TREE_LANES (node);
> +	      SLP_TREE_LOAD_PERMUTATION (child) = load_permutation;
> +
> +	      /* But return a permute node.  */
> +	      SLP_TREE_LANE_PERMUTATION (node) = lane_permutation;
> +	      SLP_TREE_CODE (node) = VEC_PERM_EXPR;
> +	      SLP_TREE_CHILDREN (node).quick_push (child);
> +	      SLP_TREE_SCALAR_STMTS (node) = vNULL;
>  	    }
> -	  SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
> +
> +	  SLP_TREE_VECTYPE (node) = vectype;
>  	  return node;
>  	}
>      }
> @@ -3430,8 +3474,8 @@ vect_analyze_slp (vec_info *vinfo, unsigned max_tree_size)
>        FOR_EACH_VEC_ELT (LOOP_VINFO_SLP_INSTANCES (vinfo), i, instance)
>  	{
>  	  slp_tree root = SLP_INSTANCE_TREE (instance);
> -	  optimize_load_redistribution (bst_map, vinfo, SLP_TREE_LANES (root),
> -					&load_map, root);
> +	  //optimize_load_redistribution (bst_map, vinfo, SLP_TREE_LANES (root),
> +	//				&load_map, root);
>  	}
>      }
>  
> @@ -3548,6 +3592,83 @@ vect_slp_perms_eq (const vec<vec<unsigned> > &perms,
>  			 sizeof (unsigned) * perms[perm_a].length ()) == 0));
>  }
>  
> +static bool
> +is_pattern_convert_candidate_p (slp_tree node)
> +{
> +  stmt_vec_info stmt_info;
> +  if (!(stmt_info = SLP_TREE_REPRESENTATIVE (node))
> +      || !STMT_VINFO_SLP_VECT_ONLY_PATTERN (stmt_info))
> +    return false;
> +
> +  gimple* stmt = STMT_VINFO_STMT (stmt_info);
> +  if (!is_gimple_call (stmt))
> +    return false;
> +
> +  if (!gimple_call_internal_p (stmt))
> +    return false;
> +
> +  internal_fn ifn = gimple_call_internal_fn (stmt);
> +
> +  switch (ifn)
> +    {
> +    case IFN_ADDSUB:
> +    case IFN_SUBADD:
> +    case IFN_MUL_ADDSUB:
> +    case IFN_MUL_SUBADD:
> +      return true;
> +    default:
> +      break;
> +    }
> +
> +  return false;
> +}
> +
> +static internal_fn
> +vect_get_transformed_pattern (internal_fn ifn, slp_tree /* node */)
> +{
> +  switch (ifn)
> +    {
> +    case IFN_ADDSUB:
> +      ifn = IFN_COMPLEX_ADD_ROT270;
> +      break;
> +    case IFN_SUBADD:
> +      ifn = IFN_COMPLEX_ADD_ROT90;
> +      break;
> +    case IFN_MUL_ADDSUB:
> +      ifn = IFN_COMPLEX_MUL_CONJ;
> +      break;
> +    case IFN_MUL_SUBADD:
> +      ifn = IFN_COMPLEX_MUL;
> +      break;
> +    default:
> +      gcc_unreachable ();
> +    }
> +
> +  return ifn;
> +}
> +
> +
> +static load_permutation_t
> +vect_reverse_perm (load_permutation_t perm)
> +{
> +  auto_vec<std::pair<unsigned, unsigned> > pairs;
> +  pairs.create (perm.length ());
> +  unsigned i;
> +  for (i = 0; i < perm.length (); i++)
> +    pairs.quick_push (std::make_pair (i, perm[i]));
> +
> +  typedef const std::pair<unsigned, unsigned>* cmp_t;
> +  pairs.qsort ([](const void* a, const void* b) -> int
> +    { return (int)((cmp_t)a)->second - (int)((cmp_t)b)->second; });
> +
> +  load_permutation_t rev_perm;
> +  rev_perm.create (perm.length ());
> +  for (i = 0; i < perm.length (); i++)
> +    rev_perm.quick_push (pairs[i].first);
> +
> +  return rev_perm;
> +}
> +
>  /* Optimize the SLP graph of VINFO.  */
>  
>  void
> @@ -3578,11 +3699,13 @@ vect_optimize_slp (vec_info *vinfo)
>  
>    auto_sbitmap n_visited (vertices.length ());
>    auto_sbitmap n_materialize (vertices.length ());
> +  auto_sbitmap n_transform (vertices.length ());
>    auto_vec<int> n_perm (vertices.length ());
>    auto_vec<vec<unsigned> > perms;
>  
>    bitmap_clear (n_visited);
>    bitmap_clear (n_materialize);
> +  bitmap_clear (n_transform);
>    n_perm.quick_grow_cleared (vertices.length ());
>    perms.safe_push (vNULL); /* zero is no permute */
>  
> @@ -3602,55 +3725,73 @@ vect_optimize_slp (vec_info *vinfo)
>  	 as entries to the reverse graph.  */
>        if (!slpg->vertices[idx].succ)
>  	bitmap_set_bit (n_visited, idx);
> -      /* Loads are the only thing generating permutes.  */
> -      if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
> -	continue;
>  
> -      /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> -	 node unpermuted, record this permute.  */
> -      stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> -      if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> -	continue;
> -      dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> -      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> -      bool any_permute = false;
> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> +      int perm_seed = 0;
> +
> +      for (graph_edge *pred = slpg->vertices[idx].pred;
> +	   pred; pred = pred->pred_next)
>  	{
> -	  unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> -	  imin = MIN (imin, idx);
> -	  imax = MAX (imax, idx);
> -	  if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> -	    any_permute = true;
> -	}
> -      /* If there's no permute no need to split one out.  */
> -      if (!any_permute)
> -	continue;
> -      /* If the span doesn't match we'd disrupt VF computation, avoid
> -	 that for now.  */
> -      if (imax - imin + 1 != SLP_TREE_LANES (node))
> -	continue;
> +	  int src_idx = pred->src;
> +	  slp_tree src_node = vertices[src_idx];
>  
> -      /* For now only handle true permutes, like
> -	 vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> -	 when permuting constants and invariants keeping the permute
> -	 bijective.  */
> -      auto_sbitmap load_index (SLP_TREE_LANES (node));
> -      bitmap_clear (load_index);
> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -	bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> -      unsigned j;
> -      for (j = 0; j < SLP_TREE_LANES (node); ++j)
> -	if (!bitmap_bit_p (load_index, j))
> -	  break;
> -      if (j != SLP_TREE_LANES (node))
> -	continue;
> +	  /* Loads are the only thing generating permutes, however the permutes
> +	     are not yet lowered into the loads.  So instead chase up the chain
> +	     to find a permute node.  */
> +	  if (!SLP_TREE_LANE_PERMUTATION (src_node).exists ())
> +	    continue;
> +
> +	  /* If splitting out a SLP_TREE_LANE_PERMUTATION can make the
> +	     node unpermuted, record this permute.  */
> +	  stmt_vec_info dr_stmt = SLP_TREE_REPRESENTATIVE (node);
> +	  if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
> +	    continue;
> +	  dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> +	  unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> +	  bool any_permute = false;
> +	  for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j)
> +	    {
> +	      unsigned src_op = SLP_TREE_LANE_PERMUTATION (src_node)[j].first;
> +	      unsigned idx = SLP_TREE_LANE_PERMUTATION (src_node)[j].second;
> +	      /* This isn't a load permute moved out.  */
> +	      if (src_op != 0)
> +		continue;
> +	      imin = MIN (imin, idx);
> +	      imax = MAX (imax, idx);
> +	      if (idx - SLP_TREE_LANE_PERMUTATION (src_node)[0].second != j)
> +		any_permute = true;
> +	    }
> +	  /* If there's no permute no need to split one out.  */
> +	  if (!any_permute)
> +	    continue;
> +	  /* If the span doesn't match we'd disrupt VF computation, avoid
> +	     that for now.  */
> +	  if (imax - imin + 1 != SLP_TREE_LANES (node))
> +	    continue;
> +
> +	  /* For now only handle true permutes, like
> +	     vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
> +	     when permuting constants and invariants keeping the permute
> +	     bijective.  */
> +	  auto_sbitmap load_index (SLP_TREE_LANES (src_node));
> +	  bitmap_clear (load_index);
> +	  for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j)
> +	    bitmap_set_bit (load_index, SLP_TREE_LANE_PERMUTATION (src_node)[j].second - imin);
> +	  unsigned j;
> +	  for (j = 0; j < SLP_TREE_LANES (src_node); ++j)
> +	  if (!bitmap_bit_p (load_index, j))
> +	    break;
> +	  if (j != SLP_TREE_LANES (src_node))
> +	    continue;
>  
> -      vec<unsigned> perm = vNULL;
> -      perm.safe_grow (SLP_TREE_LANES (node), true);
> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -	perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> -      perms.safe_push (perm);
> -      n_perm[idx] = perms.length () - 1;
> +	  vec<unsigned> perm = vNULL;
> +	  perm.safe_grow (SLP_TREE_LANES (src_node), true);
> +	  for (unsigned j = 0; j < SLP_TREE_LANES (src_node); ++j)
> +	    perm[j] = SLP_TREE_LANE_PERMUTATION (src_node)[j].second - imin;
> +	  perms.safe_push (perm);
> +	  n_perm[src_idx] = perms.length () - 1;
> +	  perm_seed = -1;
> +       }
> +      n_perm[idx] = perm_seed;
>      }
>  
>    /* Propagate permutes along the graph and compute materialization points.  */
> @@ -3667,12 +3808,12 @@ vect_optimize_slp (vec_info *vinfo)
>  	  slp_tree node = vertices[idx];
>  	  /* For leafs there's nothing to do - we've seeded permutes
>  	     on those above.  */
> -	  if (SLP_TREE_DEF_TYPE (node) != vect_internal_def)
> +	  if (SLP_TREE_DEF_TYPE (node) != vect_internal_def || !slpg->vertices[idx].succ)
>  	    continue;
>  
>  	  bitmap_set_bit (n_visited, idx);
>  
> -	  /* We cannot move a permute across a store.  */
> +	  /* We cannot move a permute across a store.  TODO: or a call.  */
>  	  if (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node))
>  	      && DR_IS_WRITE
>  		   (STMT_VINFO_DATA_REF (SLP_TREE_REPRESENTATIVE (node))))
> @@ -3723,6 +3864,12 @@ vect_optimize_slp (vec_info *vinfo)
>  	      changed = true;
>  	    }
>  
> +	  /* Check to see if this node can be transformed during permute
> +	     materialization.  */
> +	  bool patt_trans_cand_p = is_pattern_convert_candidate_p (node);
> +	  if (patt_trans_cand_p)
> +	    bitmap_set_bit (n_transform, idx);
> +
>  	  if (perm == 0)
>  	    continue;
>  
> @@ -3764,6 +3911,46 @@ vect_optimize_slp (vec_info *vinfo)
>      }
>    while (changed || iteration == 1);
>  
> +  bitmap_clear (n_visited);
> +
> +  /* Transform and record any permutes for the usage of the instruction.
> +     TODO: Check if it's cheaper to keep ADDSUB if available or use to
> +	   COMPLEX_ADD based on the reverse permute that's needed.  */
> +  sbitmap_iterator bi;
> +  EXECUTE_IF_SET_IN_BITMAP (n_transform, 0, i, bi)
> +    {
> +      slp_tree node = vertices[i];
> +
> +      for (graph_edge *succ = slpg->vertices[i].succ;
> +		       succ; succ = succ->succ_next)
> +	{
> +	  int perm_id = n_perm[succ->dest];
> +	  if (perm_id <= 0)
> +	    continue;
> +
> +	  load_permutation_t linperm = vect_reverse_perm (perms[perm_id]);
> +	  perms[perm_id].release ();
> +	  perms[perm_id] = linperm;
> +
> +          bitmap_set_bit (n_materialize, succ->dest);
> +	}
> +
> +      /* Transform the node if we have determined that it's beneficial to do so.
> +	 when we do so the perm that has been calculated for the children of the
> +	 node will be applied.  */
> +      stmt_vec_info stmt_info = SLP_TREE_REPRESENTATIVE (node);
> +      gimple* stmt = STMT_VINFO_STMT (stmt_info);
> +      internal_fn old_ifn = gimple_call_internal_fn (stmt);
> +      internal_fn ifn = vect_get_transformed_pattern (old_ifn ,node);
> +
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "Tranforming SLP expression from %s to %s\n",
> +			 internal_fn_name (old_ifn), internal_fn_name (ifn));
> +      gimple_call_set_internal_fn (as_a <gcall *> (stmt), ifn);
> +    }
> +
> +
>    /* Materialize.  */
>    for (i = 0; i < vertices.length (); ++i)
>      {
> @@ -3798,6 +3985,11 @@ vect_optimize_slp (vec_info *vinfo)
>  			    SLP_TREE_SCALAR_OPS (child), true);
>  	}
>  
> +      if (dump_enabled_p ())
> +	dump_printf_loc (MSG_NOTE, vect_location,
> +			 "processing node %p\n",
> +			 node);
> +
>        if (bitmap_bit_p (n_materialize, i))
>  	{
>  	  if (SLP_TREE_LOAD_PERMUTATION (node).exists ())
> @@ -3986,6 +4178,16 @@ vect_optimize_slp (vec_info *vinfo)
>  	    }
>  	}
>      }
> +
> +  if (dump_enabled_p ())
> +    {
> +      dump_printf_loc (MSG_NOTE, vect_location, "Optimized SLP instances: \n");
> +      hash_set<slp_tree> visited;
> +      slp_instance instance;
> +      FOR_EACH_VEC_ELT (vinfo->slp_instances, i, instance)
> +	vect_print_slp_graph (MSG_NOTE, vect_location,
> +			      SLP_INSTANCE_TREE (instance), visited);
> +    }
>  }
>  
>  /* Gather loads reachable from the individual SLP graph entries.  */
> 
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE Software Solutions Germany GmbH, Maxfeldstrasse 5, 90409 Nuernberg,
Germany; GF: Felix Imendörffer; HRB 36809 (AG Nuernberg)


More information about the Gcc-patches mailing list