[PATCH] vect: Add a “very cheap” cost model

Richard Sandiford richard.sandiford@arm.com
Mon Nov 16 09:58:45 GMT 2020


Richard Biener <richard.guenther@gmail.com> writes:
> On Fri, Nov 13, 2020 at 7:35 PM Richard Sandiford via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> Currently we have three vector cost models: cheap, dynamic and
>> unlimited.  -O2 -ftree-vectorize uses “cheap” by default, but that's
>> still relatively aggressive about peeling and aliasing checks,
>> and can lead to significant code size growth.
>>
>> This patch adds an even more conservative choice, which for lack of
>> imagination I've called “very cheap”.  It only allows vectorisation
>> if the vector code entirely replaces the scalar code.  It also
>> requires one iteration of the vector loop to pay for itself,
>> regardless of how often the loop iterates.  (If the vector loop
>> needs multiple iterations to be beneficial then things are
>> probably too close to call, and the conservative thing would
>> be to stick with the scalar code.)
>>
>> The idea is that this should be suitable for -O2, although the patch
>> doesn't change any defaults itself.
>>
>> I tested this by building and running a bunch of workloads for SVE,
>> with three options:
>>
>>   (1) -O2
>>   (2) -O2 -ftree-vectorize -fvect-cost-model=very-cheap
>>   (3) -O2 -ftree-vectorize [-fvect-cost-model=cheap]
>>
>> All three builds used the default -msve-vector-bits=scalable and
>> ran with the minimum vector length of 128 bits, which should give
>> a worst-case bound for the performance impact.
>>
>> The workloads included a mixture of microbenchmarks and full
>> applications.  Because it's quite an eclectic mix, there's not
>> much point giving exact figures.  The aim was more to get a general
>> impression.
>>
>> Code size growth with (2) was much lower than with (3).  Only a
>> handful of tests increased by more than 5%, and all of them were
>> microbenchmarks.
>>
>> In terms of performance, (2) was significantly faster than (1)
>> on microbenchmarks (as expected) but also on some full apps.
>> Again, performance only regressed on a handful of tests.
>>
>> As expected, the performance of (3) vs. (1) and (3) vs. (2) is more
>> of a mixed bag.  There are several significant improvements with (3)
>> over (2), but also some (smaller) regressions.  That seems to be in
>> line with -O2 -ftree-vectorize being a kind of -O2.5.
>
> So previous attempts at enabling vectorization at -O2 also factored
> in compile-time requirements.  We've looked mainly at SPEC and
> there even the current "cheap" model doesn't fare very well IIRC
> and costs quite some compile-time and code-size.

Yeah, that seems to match what I was seeing with the cheap model:
the size could increase quite significantly.

> Turning down vectorization even more will have even less impact on
> performance but the compile-time cost will likely not shrink very
> much.

Agreed.  We've already done most of the work by the time we decide not
to go ahead.

I didn't really measure compile time TBH.  This was mostly written
from an SVE point of view: when SVE is enabled, vectorisation is
important enough that it's IMO worth paying the compile-time cost.

> I think we need ways to detect candidates that will end up
> cheap or very cheap without actually doing all of the analysis
> first.

Yeah, that sounds good if it's doable.  But with SVE, the aim
is to reduce the number of cases in which a loop would fail to
be vectorised on cost grounds.  I hope we'll be able to do more
of that for GCC 12.

E.g. one of the uses of the SVE2 WHILERW and WHILEWR instructions
is to clamp the amount of work that the vector loop does based on
runtime aliases.  We don't yet use it for that (it's still on
the TODO list), but once we do, runtime aliases would often not
be a problem even for the very cheap model.  And SVE already removes
two of the other main reasons for aborting early: the need to peel
for alignment and the need to peel for niters.

There are cases like peeling for gaps that should produce scalar code
even with SVE, but they probably aren't common enough to have a
significant impact on compile time.

So in a sense, the aim with SVE is to make that kind of early-out test
redundant as much as possible.

>> The patch reorders vect_cost_model so that values are in order
>> of increasing aggressiveness, which makes it possible to use
>> range checks.  The value 0 still represents “unlimited”,
>> so “if (flag_vect_cost_model)” is still a meaningful check.
>>
>> Tested on aarch64-linux-gnu, arm-linux-gnueabihf and
>> x86_64-linux-gnu.  OK to install?
>
> Does the patch also vectorize with SVE loops that have
> unknown loop bound?  The documentation isn't entirely
> conclusive there.

Yeah, for SVE it vectorises.  How about changing:

  For example, if each iteration of a vectorized loop would handle
  exactly four iterations, …

to:

  For example, if each iteration of a vectorized loop could only
  handle exactly four iterations of the original scalar loop, …

?

> Iff the iteration count is a multiple of two and the target can
> vectorize the loop with both VF 2 and VF 4 but VF 4 would be better if
> we'd use the 'cheap' cost model, does 'very-cheap' not vectorize the
> loop or does it choose VF 2?

It would choose VF 2, if that's still a win over scalar code.

> In itself the patch is reasonable, thus OK.

Thanks.

Richard


More information about the Gcc-patches mailing list