This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Add missing cases to vect_get_smallest_scalar_type (PR 85286)
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Jakub Jelinek <jakub at redhat dot com>, GCC Patches <gcc-patches at gcc dot gnu dot org>, Richard Sandiford <richard dot sandiford at linaro dot org>
- Date: Tue, 10 Apr 2018 14:58:16 +0200
- Subject: Re: Add missing cases to vect_get_smallest_scalar_type (PR 85286)
- References: <877epg6zlq.fsf@linaro.org> <20180409180141.GU8577@tucnak> <871sfn739h.fsf@linaro.org>
On Tue, Apr 10, 2018 at 12:40 PM, Richard Sandiford
<richard.sandiford@linaro.org> wrote:
> Jakub Jelinek <jakub@redhat.com> writes:
>> On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote:
>>> In this PR we used WIDEN_SUM_EXPR to vectorise:
>>>
>>> short i, y;
>>> int sum;
>>> [...]
>>> for (i = x; i > 0; i--)
>>> sum += y;
>>>
>>> with 4 ints and 8 shorts per vector. The problem was that we set
>>> the VF based only on the ints, then calculated the number of vector
>>> copies based on the shorts, giving 4/8. Previously that led to
>>> ncopies==0, but after r249897 we pick it up as an ICE.
>>>
>>> In this particular case we could vectorise the reduction by setting
>>> ncopies based on the output type rather than the input type, but it
>>> doesn't seem worth adding a special "optimisation" for such a
>>> pathological case. I think it's really an instance of the more general
>>> problem that we can't vectorise using combinations of (say) 64-bit and
>>> 128-bit vectors on targets that support both.
>>
>> We badly need that, there are plenty of PRs where we generate really large
>> vectorized loop because of it e.g. on x86 where we can easily use 128-bit,
>> 256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material.
>
> Yeah. We also need it on AArch64 for a proper implementation of simd
> clones for Advanced SIMD.
>
> I think it's related to one of the most important missed optimisations
> for SVE: when using mixed data sizes, it's usually better to store the
> smaller data unpacked in wider lanes, and there's direct support for
> loading and storing it that way. In both the SVE and non-SVE cases,
> we want the VF sometimes to be based on wider sizes rather than the
> narrowest one.
It's unfortunately not very easy to remove the limitation in full and in
general it widens the space we need to search for the best vectorization
even further...
> FWIW, I have some patches queued for GCC 9 that should make it
> easier to implement this (but no promises). They're also supposed
> to make it possible to compare the costs of different implementations
> side-by-side, rather than always picking the first one that has
> a lower cost than the scalar code.
I have also a similar patch in the works.
Richard.
> Thanks,
> Richard