[PATCH] [RFC] Single iteration peeling for gaps is sufficient with loop masking

Thu Nov 14 18:31:13 GMT 2024

Richard Biener <rguenther@suse.de> writes:
>> Am 14.11.2024 um 17:38 schrieb Richard Sandiford <richard.sandiford@arm.com>:
>> 
>> Richard Biener <rguenther@suse.de> writes:
>>> When we do loop masking via mask or length a single scalar iteration
>>> should be sufficient to avoid excess accesses.  This fixes the last
>>> known FAILs with --param vect-force-slp=1.
>>> 
>>> Bootstrap and regtest running on x86_64-unknown-linux-gnu.
>>> 
>>> Do we know of a case where the peeling isn't sufficient with VL vectors?
>>> 
>>> The CI will probably fail because of dependent patches I just pushed :/
>>> 
>>> Thanks,
>>> Richard.
>>> 
>>>    PR tree-optimization/117558
>>>    * tree-vect-stmts.cc (get_group_load_store_type): Exempt
>>>    VL vector types from not sufficient gap peeling testing.
>>> ---
>>> gcc/tree-vect-stmts.cc | 41 +++++++++++++++++++----------------------
>>> 1 file changed, 19 insertions(+), 22 deletions(-)
>>> 
>>> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
>>> index d3552266eee..b5f90803eed 100644
>>> --- a/gcc/tree-vect-stmts.cc
>>> +++ b/gcc/tree-vect-stmts.cc
>>> @@ -2181,33 +2181,30 @@ get_group_load_store_type (vec_info *vinfo, stmt_vec_info stmt_info,
>>> 
>>>      /* Peeling for gaps assumes that a single scalar iteration
>>>         is enough to make sure the last vector iteration doesn't
>>> -         access excess elements.  */
>>> +         access excess elements.  For variable-length vectors the
>>> +         required loop masking ensures a single iteration is always
>>> +         sufficient.  */
>> 
>> VL vectors don't directly imply loop masking.  We support unmasked
>> VLA loops too, in cases where masking fails for some reason.  Admittedly
>> there should be fewer of those cases than there were at one time (we started
>> with no masking, and gradually added more cases), but it's still true in
>> general that we can't assume VLA => loop masking.
>
> I see.  So in that case we’d keep the existing logic, but for VLA vectors, if we cannot statistically determine (I’m still pondering if the actual condition correctly captures what we need to check…), we either have to fail or set a (new) LOOP_VINFO_NEEDS_MASKING_P so we can fail (VLA) vectorization when we determined we cannot use masking?

Yeah, I think so.  But like you say, I'm not sure either way what the
condition should be (too much in stage 1 mode to think much about it :)).

But peeling for gaps is still needed with loop masking for SLP groups
that load beyond the original code, since the mask is computed at group
granularity rather than element granularity.  (Or at least, it was.)
That includes load-lanes where the final lane isn't needed.

Thanks,
Richard

>
> Richard.
>
>> Thanks,
>> Richard
>> 
>>> +      unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size;
>>>      if (overrun_p
>>> -          && (!can_div_trunc_p (group_size
>>> -                    * LOOP_VINFO_VECT_FACTOR (loop_vinfo) - gap,
>>> -                    nunits, &tem, &remain)
>>> -          || maybe_lt (remain + group_size, nunits)))
>>> -        {
>>> +          && nunits.is_constant (&cnunits)
>>> +          && LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf)
>>> +          && ((cremain = (group_size * cvf - gap) % cnunits), true)
>>> +          && cremain + group_size < cnunits
>>>          /* But peeling a single scalar iteration is enough if
>>>         we can use the next power-of-two sized partial
>>>         access and that is sufficiently small to be covered
>>>         by the single scalar iteration.  */
>>> -          unsigned HOST_WIDE_INT cnunits, cvf, cremain, cpart_size;
>>> -          if (!nunits.is_constant (&cnunits)
>>> -          || !LOOP_VINFO_VECT_FACTOR (loop_vinfo).is_constant (&cvf)
>>> -          || (((cremain = (group_size * cvf - gap) % cnunits), true)
>>> -              && ((cpart_size = (1 << ceil_log2 (cremain))), true)
>>> -              && (cremain + group_size < cpart_size
>>> -              || vector_vector_composition_type
>>> -                   (vectype, cnunits / cpart_size,
>>> -                &half_vtype) == NULL_TREE)))
>>> -        {
>>> -          if (dump_enabled_p ())
>>> -            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> -                     "peeling for gaps insufficient for "
>>> -                     "access\n");
>>> -          return false;
>>> -        }
>>> +          && ((cpart_size = (1 << ceil_log2 (cremain))), true)
>>> +          && (cremain + group_size < cpart_size
>>> +          || vector_vector_composition_type
>>> +               (vectype, cnunits / cpart_size,
>>> +            &half_vtype) == NULL_TREE))
>>> +        {
>>> +          if (dump_enabled_p ())
>>> +        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
>>> +                 "peeling for gaps insufficient for "
>>> +                 "access\n");
>>> +          return false;
>>>        }
>>>    }
>>>     }