This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
- From: Ramana Radhakrishnan <ramana dot gcc at googlemail dot com>
- To: Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>
- Cc: Alan Hayward <alan dot hayward at arm dot com>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Fri, 11 Sep 2015 14:46:32 +0100
- Subject: Re: [PATCH] vectorizing conditional expressions (PR tree-optimization/65947)
- Authentication-results: sourceware.org; auth=none
- References: <D217578B dot 7FE4%alan dot hayward at arm dot com> <1441923254 dot 4772 dot 37 dot camel at oc8801110288 dot ibm dot com> <D2184E16 dot 8003%alan dot hayward at arm dot com> <1441977591 dot 2795 dot 11 dot camel at gnopaine>
On Fri, Sep 11, 2015 at 2:19 PM, Bill Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> Hi Alan,
>
> I probably wasn't clear enough. The implementation in the vectorizer is
> fine and I'm not asking that to change per target. What I'm objecting
> to is the equivalence between a REDUC_MAX_EXPR and a cost associated
> with vec_to_scalar. This assumes that the back end will implement a
> REDUC_MAX_EXPR in a specific way that at least some back ends cannot.
> But those back ends should be free to model the cost of the
> REDUC_MAX_EXPR appropriately. Therefore I am asking for a new
> vect_cost_for_stmt type to represent the cost of a REDUC_MAX_EXPR. For
> ARM, this cost will be the same as a vec_to_scalar. For others, it may
> not be; for powerpc, it certainly will not be.
>
> We can produce a perfectly fine sequence for a REDUC_MAX_EXPR during RTL
> expansion, and therefore it is not correct for us to explode this in
> tree-vect-generic. This would expand the code size without providing
> any significant optimization opportunity, and could reduce the ability
> to, for instance, common REDUC_MAX_EXPRs. It would also slow down the
> gimple vectorizers.
>
> I apologize if my loose use of language confused the issue. It isn't
> the whole COND_REDUCTION I'm concerned with, but the REDUC_MAX_EXPRs
> that are used by it.
>
> (The costs in powerpc won't be enormous, but they are definitely
> mode-dependent in a way that vec_to_scalar is not. We'll need 2*log(n)
> instructions, where n is the number of elements in the mode being
> vectorized.)
IIUC, on AArch64 a reduc_max_expr matches with a single reduction
operation but on AArch32 Neon a reduc_smax gets implemented as a
sequence of vpmax instructions which sounds similar to the PowerPC
example as well. Thus mapping a reduc_smax expression to the cost of a
vec_to_scalar is probably not right in this particular situation.
regards
Ramana
>
> A secondary concern for powerpc is that REDUC_MAX_EXPR produces a scalar
> that has to be broadcast back to a vector, and the best way to implement
> it for us already has the max value in all positions of a vector. But
> that is something we should be able to fix with simplify-rtx in the back
> end.
>
> Thanks,
> Bill
>
>
> On Fri, 2015-09-11 at 10:15 +0100, Alan Hayward wrote:
>> Hi Bill,
>>
>> Iâd be a bit worried about asking the backend for the cost of a
>> COND_REDUCTION, as that will rely on the backend understanding the
>> implementation the vectorizer is using - every time the vectorizer
>> changed, the backends would need to be updated too. Iâm hoping soon to get
>> together a patch to reduce the stmts produced on the simpler cases, which
>> would require a different set of costings. I can also imagine further
>> improvements being added for other special cases over time. Having the
>> backends understand every variation would be a little cumbersome.
>>
>> As it stands today, we correctly exit the optimisation if max reduction
>> isnât supported in hardware, which is what the cost model is expecting.
>>
>>
>> If power wanted to use this implementation, then I think itâd probably
>> need some code in tree-vect-generic.c to implement on emulated max
>> reduction, which would then require updates to the costs modelling of
>> anything that uses max reduction (not just cond reduction). All of that is
>> outside the scope of this patch.
>>
>>
>> Thanks,
>> Alan.
>>
>> On 10/09/2015 23:14, "Bill Schmidt" <wschmidt@linux.vnet.ibm.com> wrote:
>>
>> >Hi Alan,
>> >
>> >The cost modeling of the epilogue code seems pretty target-specific ("An
>> >EQ stmt and an AND stmt, reduction of the max index and a reduction of
>> >the found values, a broadcast of the max value," resulting in two
>> >vector_stmts, one vec_to_scalar, and two scalar_to_vecs). On powerpc,
>> >this will not represent the cost accurately, and the cost will indeed be
>> >quite different depending on the mode (logarithmic in the number of
>> >elements). I think that you need to create a new entry in
>> >vect_cost_for_stmt to represent the cost of a COND_REDUCTION, and allow
>> >each target to calculate the cost appropriately.
>> >
>> >(Powerpc doesn't have a max-reduction hardware instruction, but because
>> >the reduction will be only in the epilogue code, it may still be
>> >profitable for us to generate the somewhat expensive reduction sequence
>> >in order to vectorize the loop. But we definitely want to model it as
>> >costly in and of itself. Also, the sequence will produce the maximum
>> >value in all positions without a separate broadcast.)
>> >
>> >Thanks,
>> >Bill
>> >
>> >On Thu, 2015-09-10 at 15:51 +0100, Alan Hayward wrote:
>> >> Hi,
>> >> This patch (attached) adds support for vectorizing conditional
>> >>expressions
>> >> (PR 65947), for example:
>> >>
>> >> int condition_reduction (int *a, int min_v)
>> >> {
>> >> int last = 0;
>> >> for (int i = 0; i < N; i++)
>> >> if (a[i] < min_v)
>> >> last = a[i];
>> >> return last;
>> >> }
>> >>
>> >> To do this the loop is vectorised to create a vector of data results (ie
>> >> of matching a[i] values). Using an induction variable, an additional
>> >> vector is added containing the indexes where the matches occured. In the
>> >> function epilogue this is reduced to a single max value and then used to
>> >> index into the vector of data results.
>> >> When no values are matched in the loop, the indexes vector will contain
>> >> all zeroes, eventually matching the first entry in the data results
>> >>vector.
>> >>
>> >> To vectorize sucessfully, support is required for REDUC_MAX_EXPR. This
>> >>is
>> >> supported by aarch64 and arm. On X86 and powerpc, gcc will complain that
>> >> REDUC_MAX_EXPR is not supported for the required modes, failing the
>> >> vectorization. On mips it complains that the required vcond expression
>> >>is
>> >> not supported. It is suggested the relevant backend experts add the
>> >> required backend support.
>> >>
>> >> Using a simple testcase based around a large number of N and run on an
>> >> aarch64 juno board, with the patch in use, the runtime reduced to 0.8 of
>> >> it's original time.
>> >>
>> >> This patch caused binary differences in three spec2006 binaries on
>> >>aarch64
>> >> - 4.16.gamess, 435.gromacs and 456.hmmer. Running them on a juno board
>> >> showed no improvement or degregation in runtime.
>> >>
>> >>
>> >> In the near future I hope to submit a further patch (as PR 66558) which
>> >> optimises the case where the result is simply the index of the loop, for
>> >> example:
>> >> int condition_reduction (int *a, int min_v)
>> >> {
>> >> int last = 0;
>> >> for (int i = 0; i < N; i++)
>> >> if (a[i] < min_v)
>> >> last = i;
>> >> return last;
>> >> }
>> >> In this case a lot of the new code can be optimized away.
>> >>
>> >> I have run check for aarch64, arm and x86 and have seen no regressions.
>> >>
>> >>
>> >> Changelog:
>> >>
>> >> 2015-08-28 Alan Hayward <alan.hayward@arm.com>
>> >>
>> >> PR tree-optimization/65947
>> >> * tree-vect-loop.c
>> >> (vect_is_simple_reduction_1): Find condition reductions.
>> >> (vect_model_reduction_cost): Add condition reduction costs.
>> >> (get_initial_def_for_reduction): Add condition reduction initial
>> >> var.
>> >> (vect_create_epilog_for_reduction): Add condition reduction
>> >>epilog.
>> >> (vectorizable_reduction): Condition reduction support.
>> >> * tree-vect-stmts.c
>> >> (vectorizable_condition): Add vect reduction arg
>> >> * doc/sourcebuild.texi (Vector-specific attributes): Document
>> >> vect_max_reduc
>> >>
>> >> testsuite/Changelog:
>> >>
>> >> PR tree-optimization/65947
>> >> * lib/target-supports.exp
>> >> (check_effective_target_vect_max_reduc): Add.
>> >> * gcc.dg/vect/pr65947-1.c: New test.
>> >> * gcc.dg/vect/pr65947-2.c: New test.
>> >> * gcc.dg/vect/pr65947-3.c: New test.
>> >> * gcc.dg/vect/pr65947-4.c: New test.
>> >> * gcc.dg/vect/pr65947-5.c: New test.
>> >> * gcc.dg/vect/pr65947-6.c: New test.
>> >> * gcc.dg/vect/pr65947-7.c: New test.
>> >> * gcc.dg/vect/pr65947-8.c: New test.
>> >> * gcc.dg/vect/pr65947-9.c: New test.
>> >> * gcc.dg/vect/pr65947-10.c: New test.
>> >> * gcc.dg/vect/pr65947-11.c: New test.
>> >>
>> >>
>> >>
>> >> Thanks,
>> >> Alan
>> >>
>> >>
>> >
>> >
>>
>>
>
>