This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC][PATCH]Merge VEC_COND_EXPR into MASK_STORE after loop vectorization


On 11/20/18 7:57 AM, Renlin Li wrote:
> Hi Richard,
> 
> On 11/14/2018 02:59 PM, Richard Biener wrote:
>> On Fri, Nov 9, 2018 at 4:49 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>
>>> Hi Richard,
>>>
>>> On 11/09/2018 11:48 AM, Richard Biener wrote:
>>>> On Thu, Nov 8, 2018 at 5:55 PM Renlin Li <renlin.li@foss.arm.com> wrote:
>>>>
>>>>>
>>>>> Hi Richard,
>>>>>
>>>>>
>> I don't see the masked load here on x86_64 btw. (I don't see
>> if-conversion generating a load).
>> I guess that's again when store-data-races are allowed that it uses a
>> RMW cycle and vectorization
>> generating the masked variants for the loop-mask.  Which means for SVE
>> if-conversion should
>> prefer the masked-store variant even when store data races are allowed?
> 
> Yes, it looks like, for SVE, masked-store variant is preferred even when store data races are allowed.
> 
> This decision is made in if-cvt.
> 
> mask_store need a pointer, and if it is created from an array access, we need to make sure the data reference analysis
> 
> could properly analysis relationship between array reference and pointer reference.
> 
> So that no versioned loop is generated during loop vectorization.
> (This is a general improvement, and could be done in a different patch?)
> 
>>
>>>
>>>>
>>>> I was wondering whether we can implement
>>>>
>>>>     l = [masked]load;
>>>>     tem = cond ? x : l;
>>>>     masked-store = tem;
>>>>
>>>> pattern matching in a regular pass - forwprop for example.  Note the
>>>> load doesn't need to be masked,
>>>> correct?  In fact if it is masked you need to make sure the
>>>> conditional never accesses parts that
>>>> are masked in the load, no?  Or require the mask to be the same as
>>>> that used by the store.  But then
>>>> you still cannot simply replace the store mask with a new mask
>>>> generated from the conditional?
>>>
>>> Yes, this would require the mask for load and store is the same.
>>> This matches the pattern before loop vectorization.
>>> The mask here is loop mask, to ensure we are bounded by the number of iterations.
>>>
>>>
>>> The new mask is the (original mask & condition mask) (example shown above).
>>>
>>> In this case, less lanes will be stored.
>>>
>>> It is possible we do that in forwprop.
>>> I could try to integrate the change into it if it is the correct place to go.
>>>
>>>
>>> As the pattern is initially generated by loop vectorizer, I did the change right after it before it got
>>>
>>> converted into other forms. For example, forwprop will transform the original code into:
>>>
>>>
>>>     vect__2.4_29 = vect_cst__27 + { 1, ... };
>>>     _16 = (void *) ivtmp.13_25;
>>>     _2 = &MEM[base: _16, offset: 0B];
>>>     vect__ifc__24.7_33 = .MASK_LOAD (_2, 4B, loop_mask_32);
>>>     _28 = vect_cst__34 != { 0, ... };
>>>     _35 = .COND_ADD (_28, vect_cst__27, { 1, ... }, vect__ifc__24.7_33);
>>>     vect__ifc__26.8_36 = _35;
>>>     .MASK_STORE (_2, 4B, loop_mask_32, vect__ifc__26.8_36);
>>>     ivtmp_41 = ivtmp_40 + POLY_INT_CST [4, 4];
>>>     next_mask_43 = .WHILE_ULT (ivtmp_41, 16, { 0, ... });
>>>     ivtmp.13_15 = ivtmp.13_25 + POLY_INT_CST [16, 16];
>>>
>>> This make the pattern matching not straight forward.
>>
>> Ah, that's because of the .COND_ADDs (I wonder about the copy that's
>> left - forwprop should eliminate copies).  Note the pattern-matching
>> could go in the
>>
>>        /* Apply forward propagation to all stmts in the basic-block.
>>           Note we update GSI within the loop as necessary.  */
>>
>> loop which comes before the match.pd pattern matching so you'd
>> still see the form without the .COND_ADD I think.
>>
>> There _is_ precedence for some masked-store post-processing
>> in the vectorizer (optimize_mask_stores), not that I like that
>> very much either.  Eventually those can be at least combined...
> Thanks for your suggestion, indeed .COND_ADD is generated later in fold_stmt function.
> 
> 
> I update the patch with the style of "forward-propagation". It starts from
> VEC_COND, and forward propagate it into MASK_STORE when specific pattern is found.
> 
> 
>      X = MASK_LOAD (PTR, -, MASK)
>      VAL = ...
>      Y = VEC_COND (cond, VAL, X)
>      MASK_STORE (PTR, -, MASK, Y)
>>
>> That said, I prefer the forwprop place for any pattern matching
>> and the current patch needs more comments to understand
>> what it is doing (the DCE it does is IMHO premature).  You
>> should also modify the masked store in-place rather than
>> building a new one.  I don't like how you need to use
>> find_data_references_in_stmt, can't you simply compare
>> the address and size arguments?  
> 
> find_data_references_in_stmt is used because the two data reference are created
> 
> as two new SSA_NAMEs from same scalar pointer by loop vectorizer.
> I can not directly compare the address as the are complicated with loop information.
> 
> 
> By moving the functionality into forwprop, the complications are removed by the optimizers in between.
> 
> This makes a simple comparison possible.
So presumably this has been bootstrap tested?  If so, it's fine once you
add an appropriate ChangeLog.



> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index 7449eaf..c954b2d 100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -629,6 +629,163 @@ forward_propagate_into_cond (gimple_stmt_iterator *gsi_p)
>    return 0;
>  }
>  
> +/* Given the following pattern:
> +     X = MASK_LOAD (PTR, -, MASK)
> +     VAL = ...
> +     Y = VEC_COND (cond, VAL, X)
> +     MASK_STORE (PTR, -, MASK, Y)
> +
> +   They could be combined into a single MASK_STORE with new mask.
> +   The new mask is the combination of original mask and the value selection mask
> +   in VEC_COND_EXPR.
> +
> +     AND_MASK = MASK & cond
> +     MASK_STORE (PTR, -, AND_MASK, VAL)
> +
> +   After the transformation, the MASK_LOAD and VEC_COND_EXPR might be dead.  */
> +
> +static bool
> +forward_propagate_vec_cond_expr (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  if (!is_gimple_assign (stmt) || gimple_assign_rhs_code (stmt) != VEC_COND_EXPR)
> +    return false;
> +
> +  imm_use_iterator iter;
> +  gimple *use_stmt;
> +  use_operand_p use_p;
> +  tree lhs = gimple_assign_lhs (stmt);
> +  bool rewrite = false;
> +
> +  FOR_EACH_IMM_USE_FAST (use_p, iter, lhs)
> +    {
> +      use_stmt = USE_STMT (use_p);
> +      if (gimple_call_internal_p (use_stmt, IFN_MASK_STORE)
> +	  && !gimple_has_volatile_ops (use_stmt))
> +	{
> +	  rewrite = true;
> +	  break;
> +	}
> +    }
> +
> +  /* Early return if it doesn't have a user we are interested in.  */
> +  if (!rewrite)
> +    return false;
> +
> +  tree sel_cond = gimple_assign_rhs1 (stmt);
> +  tree true_val = gimple_assign_rhs2 (stmt);
> +  tree false_val = gimple_assign_rhs3 (stmt);
> +
> +  FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
> +    {
> +      if (!gimple_call_internal_p (use_stmt, IFN_MASK_STORE))
> +	continue;
> +
> +      gimple *mask_store = use_stmt;
> +      tree store_ptr = gimple_call_arg (mask_store, 0);
> +      tree vec_op = gimple_call_arg (mask_store, 3);
> +      tree store_mask = gimple_call_arg (mask_store, 2);
> +      if (vec_op != lhs)
> +	continue;
> +
> +      /* Looking for this pattern:
> +	 X = MASK_LOAD (PTR, -, MASK)
> +	 VAL = ...
> +	 Y = VEC_COND (cond, VAL, X)
> +	 MASK_STORE (PTR, -, MASK, Y)  */
> +      gimple *mask_load = NULL;
> +      bool reverse_p = false;
> +
> +      /* A[i] = cond ? val : A[i] */
> +      if (TREE_CODE (false_val) == SSA_NAME)
> +	{
> +	  gimple *def = SSA_NAME_DEF_STMT (false_val);
> +	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
> +	    mask_load = def;
> +	}
> +      /* A[i] = cond ? A[i] : val
> +	 Transform into:
> +	 A[i] = !cond ? val : A[i]  */
> +      if (mask_load == NULL && TREE_CODE (true_val) == SSA_NAME)
> +	{
> +	  gimple *def = SSA_NAME_DEF_STMT (true_val);
> +	  if (gimple_call_internal_p (def, IFN_MASK_LOAD))
> +	    {
> +	      enum tree_code code = TREE_CODE (sel_cond);
> +	      tree op_type = TREE_TYPE (TREE_OPERAND (sel_cond, 0));
> +	      code = invert_tree_comparison (code, HONOR_NANS (op_type));
> +	      if (code == ERROR_MARK)
> +		return false;
> +
> +	      sel_cond = build2_loc (EXPR_LOCATION (sel_cond), code,
> +				     TREE_TYPE (sel_cond),
> +				     TREE_OPERAND (sel_cond, 0),
> +				     TREE_OPERAND (sel_cond, 1));
> +	      mask_load = def;
> +	      reverse_p = true;
> +	    }
> +	}
> +
> +      /* The pair must be in the same basic block, use the same mask,
> +	 and access the same memory.  */
> +      if (mask_load == NULL
> +	  || gimple_has_volatile_ops (mask_load)
> +	  || gimple_bb (mask_store) != gimple_bb (mask_load)
> +	  || gimple_vuse (mask_store) != gimple_vuse (mask_load)
> +	  || !operand_equal_p (store_mask, gimple_call_arg (mask_load, 2), 0)
> +	  || !operand_equal_p (store_ptr, gimple_call_arg (mask_load, 0), 0)
> +	  || !operand_equal_p (gimple_call_arg (mask_load, 1),
> +			       gimple_call_arg (mask_store, 1),
> +			       0))
> +	continue;
> +
> +      if (dump_file)
> +	{
> +	  fprintf (dump_file, "  Merge '");
> +	  print_gimple_expr (dump_file, mask_load, 0);
> +	  fprintf (dump_file, "  into '");
> +	  print_gimple_expr (dump_file, mask_store, 0);
> +	  fprintf (dump_file, "'\n");
> +	}
> +
> +      tree sel_mask
> +	= force_gimple_operand_gsi (gsi,
> +				    unshare_expr (sel_cond),
> +				    true, NULL_TREE,
> +				    true, GSI_SAME_STMT);
> +

> +      /* Replace the condition in COND_EXPR with the new variable.  */
> +      if (reverse_p)
> +	{
> +	  gimple_set_op (stmt, 1, sel_mask);
> +	  gimple_set_op (stmt, 2, false_val);
> +	  gimple_set_op (stmt, 3, true_val);
> +
> +	  update_stmt (stmt);
> +	  true_val = false_val;
> +	}
> +      else
> +	{
> +	  gimple_set_op (stmt, 1, sel_mask);
> +	  update_stmt (stmt);
> +	}
Is this really correct?   I'm thinking about the case where the
COND_EXPR has multiple uses.  I don't think you can just slam in a new
condition in that case, unless you know all the uses are going to be
transformed in the same way.



Or am I missing the intent of what that fragment is doing?

Overall it looks reasonable though.  We should try and wrap up iteration
on the implementation ASAP.  We're well into stage3 at this point.

jeff


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]