[PATCH 5/7 v6] vect: Support vector load/store with length in vectorizer

Tue Jul 7 10:15:56 GMT 2020

"Kewen.Lin" <linkw@linux.ibm.com> writes:
> Hi Richard,
>
> on 2020/7/1 下午11:17, Richard Sandiford wrote:
>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>> on 2020/7/1 上午3:53, Richard Sandiford wrote:
>>>> "Kewen.Lin" <linkw@linux.ibm.com> writes:
>>>>>    poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>>>>> +  tree length_limit = NULL_TREE;
>>>>> +  /* For length, we need length_limit to check length in range.  */
>>>>> +  if (!vect_for_masking)
>>>>> +    {
>>>>> +      poly_uint64 len_limit = nscalars_per_ctrl * rgc->factor;
>>>>> +      length_limit = build_int_cst (compare_type, len_limit);
>>>>> +    }
>>>>>  
>>>>>    /* Calculate the maximum number of scalar values that the rgroup
>>>>>       handles in total, the number that it handles for each iteration
>>>>> @@ -434,12 +445,12 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>>    tree nscalars_total = niters;
>>>>>    tree nscalars_step = build_int_cst (iv_type, vf);
>>>>>    tree nscalars_skip = niters_skip;
>>>>> -  if (nscalars_per_iter != 1)
>>>>> +  if (nscalars_per_iter_ft != 1)
>>>>>      {
>>>>>        /* We checked before setting LOOP_VINFO_USING_PARTIAL_VECTORS_P that
>>>>>  	 these multiplications don't overflow.  */
>>>>> -      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter);
>>>>> -      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter);
>>>>> +      tree compare_factor = build_int_cst (compare_type, nscalars_per_iter_ft);
>>>>> +      tree iv_factor = build_int_cst (iv_type, nscalars_per_iter_ft);
>>>>>        nscalars_total = gimple_build (preheader_seq, MULT_EXPR, compare_type,
>>>>>  				     nscalars_total, compare_factor);
>>>>>        nscalars_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>>>>> @@ -509,7 +520,7 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>>>>>  	     NSCALARS_SKIP to that cannot overflow.  */
>>>>>  	  tree const_limit = build_int_cst (compare_type,
>>>>>  					    LOOP_VINFO_VECT_FACTOR (loop_vinfo)
>>>>> -					    * nscalars_per_iter);
>>>>> +					    * nscalars_per_iter_ft);
>>>>>  	  first_limit = gimple_build (preheader_seq, MIN_EXPR, compare_type,
>>>>>  				      nscalars_total, const_limit);
>>>>>  	  first_limit = gimple_build (preheader_seq, PLUS_EXPR, compare_type,
>>>>
>>>> It looks odd that we don't need to adjust the other nscalars_* values too.
>>>> E.g. the above seems to be comparing an unscaled nscalars_total with
>>>> a scaled nscalars_per_iter.  I think the units ought to “agree”,
>>>> both here and in the rest of the function.
>>>>
>>>
>>> Sorry, I didn't quite follow this comment.  Both nscalars_totoal and
>>> nscalars_step are scaled here.  The remaining related nscalars_*
>>> seems only nscalars_skip, but length can't support skip.
>> 
>> Hmm, OK.  But in that case can you update the names of the variables
>> to match?  It's confusing to have some nscalars_* variables actually
>> count scalars (and thus have “nitems” equivalents) and other nscalars_*
>> variables count something else (and thus effectively be nitems_* variables
>> themselves).
>> 
>
> OK.  I'll update the names like nscalars_total/nscalars_step and equivalents
> to nitems_total/... (or nunits_total better?)

I agree “items” isn't great.  I was trying to avoid “units” because GCC
often uses that to mean bytes (BITS_PER_UNIT, UNITS_PER_WORD, etc.).
In this context that could be confusing, because sometimes the
“units” actually would be bytes, but not always.

>>>>> @@ -9850,11 +9986,30 @@ vectorizable_condition (vec_info *vinfo,
>>>>>  	  return false;
>>>>>  	}
>>>>>  
>>>>> -      if (loop_vinfo
>>>>> -	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
>>>>> -	  && reduction_type == EXTRACT_LAST_REDUCTION)
>>>>> -	vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>>> -			       ncopies * vec_num, vectype, NULL);
>>>>> +      if (loop_vinfo && for_reduction
>>>>> +	  && LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo))
>>>>> +	{
>>>>> +	  if (reduction_type == EXTRACT_LAST_REDUCTION)
>>>>> +	    vect_record_loop_mask (loop_vinfo, &LOOP_VINFO_MASKS (loop_vinfo),
>>>>> +				   ncopies * vec_num, vectype, NULL);
>>>>> +	  /* Using partial vectors can introduce inactive lanes in the last
>>>>> +	     iteration, since full vector of condition results are operated,
>>>>> +	     it's unsafe here.  But if we can AND the condition mask with
>>>>> +	     loop mask, it would be safe then.  */
>>>>> +	  else if (!loop_vinfo->scalar_cond_masked_set.is_empty ())
>>>>> +	    {
>>>>> +	      scalar_cond_masked_key cond (cond_expr, ncopies * vec_num);
>>>>> +	      if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>>> +		{
>>>>> +		  bool honor_nans = HONOR_NANS (TREE_TYPE (cond.op0));
>>>>> +		  cond.code = invert_tree_comparison (cond.code, honor_nans);
>>>>> +		  if (!loop_vinfo->scalar_cond_masked_set.contains (cond))
>>>>> +		    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +		}
>>>>> +	    }
>>>>> +	  else
>>>>> +	    LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
>>>>> +	}
>>>>>  
>>>>>        STMT_VINFO_TYPE (stmt_info) = condition_vec_info_type;
>>>>>        vect_model_simple_cost (vinfo, stmt_info, ncopies, dts, ndts, slp_node,
>>>>
>>>> I don't understand this part.
>>>
>>> This is for the regression case on aarch64:
>>>
>>> PASS->FAIL: gcc.target/aarch64/sve/reduc_8.c -march=armv8.2-a+sve  scan-assembler-not \\tcmpeq\\tp[0-9]+\\.s,
>> 
>> OK, if this is an SVE thing, it should really be a separate patch.
>> (And thanks for testing SVE.)
>> 
>>> As you mentioned before, we would expect to record masks for partial vectors reduction, 
>>> otherwise the inactive lanes would be possibly unsafe.  For this failed case, the
>>> reduction_type is TREE_CODE_REDUCTION, we won't record loop mask.  But it's still safe
>>> since the mask is further AND with some loop mask.  The difference looks like:
>>>
>>> Without mask AND loop mask optimization:
>>>
>>>   loop_mask =...
>>>   v1 = .MASK_LOAD (a, loop_mask)
>>>   mask1 = v1 == {cst, ...}                // unsafe since it's generate from full width.
>>>   mask2 = loop_mask & mask1               // safe, since it's AND with loop mask?
>>>   v2 = .MASK_LOAD (b, mask2)
>>>   vres = VEC_COND_EXPR < mask1, vres, v2> // unsafe coz of mask1
>>>
>>> With mask AND loop mask optimization:
>>>
>>>   loop_mask =...
>>>   v1 = .MASK_LOAD (a, loop_mask)
>>>   mask1 = v1 == {cst, ...}
>>>   mask2 = loop_mask & mask1       
>>>   v2 = .MASK_LOAD (b, mask2)
>>>   vres = VEC_COND_EXPR < mask2, vres, v2> // safe coz of mask2?
>>>
>>>
>>> The loop mask ANDing can make unsafe inactive lanes safe.  So the fix here is to further check
>>> it's possible to be optimized further, if it can, we can know it's safe.  Does it make sense?
>> 
>> But in this particular test, we're doing outer loop vectorisation,
>> and the only elements of vres that matter are the ones selected
>> by loop_mask (since those are the only ones that get stored out).
>> So applying the loop mask to the VEC_COND_EXPR is “just” an
>> (important) optimisation, rather than a correctness issue.
>>  
>
> Thanks for the clarification.  It looks the vres is always safe since its
> further usage is guard with loop mask.  Then sorry that I didn't catch why
> it is one optimization for this case, is there some difference in backend
> supports on this different mask for cond_expr?

No, the idea of the optimisation is to avoid cases in which we have:

    cmp_res = …compare…
    cmp_res' = cmp_res & loop_mask
    IFN_MASK_LOAD (…, cmp_res')
    z = cmp_res ? x : y

The problem here is that cmp_res and cmp_res' are live at the same time,
which prevents cmp_res and cmp_res' from being combined into a single
instruction.  It's better for the final instruction to be:

    z = cmp_res' ? x : y

so that everything uses the same comparison result.

We can't leave that to later passes because nothing in the gimple IL
indicates that only the loop_mask elements of z matter.

>> What's causing the test to start failing with the patch?  I realise
>> you've probably already said, sorry, but it's been a large patch series
>> so it's hard to keep all the details committed to memory.
>> 
>
> No problem, appreciate your time much!  Since length-based partial vectors
> doesn't support any reduction so far, the function has the responsibility
> to disable use_partial_vectors_p for it.  Without the above else-if part,
> since the reduction_type is TREE_CODE_REDUCTION for this case, the else part
> will stop this case to use mask-based partial vectors, but the case expects
> the outer loop still able to use mask-based partial vectors.
>
> As your clarification above, else-if looks wrong.  Probably we can change it
> to check whether the current vectorization is for outer loop and the condition
> stmt being handled is in the inner loop, we can allow it for partial vectors?

I think it's more whether, for outer loop vectorisation, the reduction
is a double reduction or a simple nested-cycle reduction.  Both have
a COND_EXPR in the inner loop, but the extra elements only matter for
double reductions.

There again, I don't think we actually support double reductions for
COND_EXPR reductions.

>>>>> @@ -11910,3 +12065,36 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>>>>>    *nunits_vectype_out = nunits_vectype;
>>>>>    return opt_result::success ();
>>>>>  }
>>>>> +
>>>>> +/* Generate and return statement sequence that sets vector length LEN that is:
>>>>> +
>>>>> +   min_of_start_and_end = min (START_INDEX, END_INDEX);
>>>>> +   left_len = END_INDEX - min_of_start_and_end;
>>>>> +   rhs = min (left_len, LEN_LIMIT);
>>>>> +   LEN = rhs;
>>>>> +
>>>>> +   TODO: for now, rs6000 supported vector with length only cares 8-bits, which
>>>>> +   means if we have left_len in bytes larger than 255, it can't be saturated to
>>>>> +   vector limit (vector size).  One target hook can be provided if other ports
>>>>> +   don't suffer this.
>>>>> +*/
>>>>
>>>> Should be no line break before the */
>>>>
>>>> Personally I think it'd be better to drop the TODO.  This isn't the only
>>>> place that would need to change if we allowed out-of-range lengths,
>>>> whereas the comment might give the impression that it is.
>>>>
>>>
>>> Sorry I might miss something, but all undetermined lengths are generated here,
>>> the other places you meant is doc or elsewhere?
>> 
>> For example, we'd need to start querying the length operand of the optabs
>> to see what length precision the target uses, since it would be invalid
>> to do this optimisation for IVs that are wider than that precision.
>> The routine above doesn't seem the right place to do that.
>> 
>
> OK, but it seems it's acceptable if the IV wider than the precision since
> we allows it out of range?

For example, suppose that a target handled out-of-range values but
still had a QImode length.  If the IV was wider than QI, we'd truncate
0x100 to 0 when generating the pattern, so a full-vector access would
get truncated to an empty-vector access.

Thanks,
Richard