This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
- From: Richard Biener <richard dot guenther at gmail dot com>
- To: Bingfeng Mei <bmei at broadcom dot com>
- Cc: "gcc at gcc dot gnu dot org" <gcc at gcc dot gnu dot org>, "gcc-patches at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Wed, 29 Jan 2014 10:32:25 +0100
- Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
- Authentication-results: sourceware.org; auth=none
- References: <B71DF1153024A14EABB94E39368E44A60426734F at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CAFiYyc1soyzwGiPvzfPuK5f87FnpZLcWb5JRhW_hzfiyT7BbnA at mail dot gmail dot com> <B71DF1153024A14EABB94E39368E44A604268B54 at SJEXCHMB13 dot corp dot ad dot broadcom dot com> <CAFiYyc3oobCNnwmVjfPidR96xXUmHsEZOsp9snSu-rHc=oPJQw at mail dot gmail dot com> <B71DF1153024A14EABB94E39368E44A604268D3E at SJEXCHMB13 dot corp dot ad dot broadcom dot com>
On Tue, Jan 28, 2014 at 4:17 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
> I checked vectorization code, it seems that only relevant place vec_widen_mult_even/odd & vec_widen_mult_lo/hi are generated is in supportable_widening_operation. One of these pairs is selected, with priority given to vec_widen_mult_even/odd if it is a reduction loop. However, lo/hi pair seems to have wider usage than even/odd pair (non-loop? Non-reduction?). Maybe that's why AltiVec and x86 still implement both pairs. Is following patch OK?
Ok.
Thanks,
Richard.
> Index: gcc/ChangeLog
> ===================================================================
> --- gcc/ChangeLog (revision 207183)
> +++ gcc/ChangeLog (working copy)
> @@ -1,3 +1,9 @@
> +2014-01-28 Bingfeng Mei <bmei@broadcom.com>
> +
> + * doc/md.texi: Mention that a target shouldn't implement
> + vec_widen_(s|u)mul_even/odd pair if it is less efficient
> + than hi/lo pair.
> +
> 2014-01-28 Richard Biener <rguenther@suse.de>
>
> Revert
> Index: gcc/doc/md.texi
> ===================================================================
> --- gcc/doc/md.texi (revision 207183)
> +++ gcc/doc/md.texi (working copy)
> @@ -4918,7 +4918,8 @@ the output vector (operand 0).
> Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
> are vectors with N signed/unsigned elements of size S@. Multiply the high/low
> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
> -in the output vector (operand 0).
> +in the output vector (operand 0). A target shouldn't implement even/odd pattern
> +pair if it is less efficient than lo/hi one.
>
> @cindex @code{vec_widen_ushiftl_hi_@var{m}} instruction pattern
> @cindex @code{vec_widen_ushiftl_lo_@var{m}} instruction pattern
>
>
> -----Original Message-----
> From: Richard Biener [mailto:richard.guenther@gmail.com]
> Sent: 28 January 2014 12:56
> To: Bingfeng Mei
> Cc: gcc@gcc.gnu.org
> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>
> On Tue, Jan 28, 2014 at 12:08 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>> Thanks, Richard. It is not very clear from documents.
>>
>> "Signed/Unsigned widening multiplication. The two inputs (operands 1 and 2)
>> are vectors with N signed/unsigned elements of size S. Multiply the high/low
>> or even/odd elements of the two vectors, and put the N/2 products of size 2*S
>> in the output vector (operand 0)."
>>
>> So I thought that implementing both can help vectorizer to optimize more loops.
>> Maybe we should improve documents.
>
> Maybe. But my answer was from the top of my head - so better double-check
> in the vectorizer sources.
>
> Richard.
>
>> Bingfeng
>>
>>
>>
>> -----Original Message-----
>> From: Richard Biener [mailto:richard.guenther@gmail.com]
>> Sent: 28 January 2014 11:02
>> To: Bingfeng Mei
>> Cc: gcc@gcc.gnu.org
>> Subject: Re: VEC_WIDEN_MULT_(LO|HI)_EXPR vs. VEC_WIDEN_MULT_(EVEN|ODD)_EXPR in vectorization.
>>
>> On Wed, Jan 22, 2014 at 1:20 PM, Bingfeng Mei <bmei@broadcom.com> wrote:
>>> Hi,
>>> I noticed there is a regression of 4.8 against ancient 4.5 in vectorization on our port. After a bit investigation, I found following code that prefer even|odd version instead of lo|hi one. This is obviously the case for AltiVec and maybe some other targets. But even|odd (expanding to a series of instructions) versions are less efficient on our target than lo|hi ones. Shouldn't there be a target-specific hook to do the choice instead of hard-coded one here, or utilizing some cost-estimating technique to compare two alternatives?
>>
>> Hmm, what's the reason for a target to support both? I think the idea
>> was that a target only supports either (the more efficient case).
>>
>> Richard.
>>
>>> /* The result of a vectorized widening operation usually requires
>>> two vectors (because the widened results do not fit into one vector).
>>> The generated vector results would normally be expected to be
>>> generated in the same order as in the original scalar computation,
>>> i.e. if 8 results are generated in each vector iteration, they are
>>> to be organized as follows:
>>> vect1: [res1,res2,res3,res4],
>>> vect2: [res5,res6,res7,res8].
>>>
>>> However, in the special case that the result of the widening
>>> operation is used in a reduction computation only, the order doesn't
>>> matter (because when vectorizing a reduction we change the order of
>>> the computation). Some targets can take advantage of this and
>>> generate more efficient code. For example, targets like Altivec,
>>> that support widen_mult using a sequence of {mult_even,mult_odd}
>>> generate the following vectors:
>>> vect1: [res1,res3,res5,res7],
>>> vect2: [res2,res4,res6,res8].
>>>
>>> When vectorizing outer-loops, we execute the inner-loop sequentially
>>> (each vectorized inner-loop iteration contributes to VF outer-loop
>>> iterations in parallel). We therefore don't allow to change the
>>> order of the computation in the inner-loop during outer-loop
>>> vectorization. */
>>> /* TODO: Another case in which order doesn't *really* matter is when we
>>> widen and then contract again, e.g. (short)((int)x * y >> 8).
>>> Normally, pack_trunc performs an even/odd permute, whereas the
>>> repack from an even/odd expansion would be an interleave, which
>>> would be significantly simpler for e.g. AVX2. */
>>> /* In any case, in order to avoid duplicating the code below, recurse
>>> on VEC_WIDEN_MULT_EVEN_EXPR. If it succeeds, all the return values
>>> are properly set up for the caller. If we fail, we'll continue with
>>> a VEC_WIDEN_MULT_LO/HI_EXPR check. */
>>> if (vect_loop
>>> && STMT_VINFO_RELEVANT (stmt_info) == vect_used_by_reduction
>>> && !nested_in_vect_loop_p (vect_loop, stmt)
>>> && supportable_widening_operation (VEC_WIDEN_MULT_EVEN_EXPR,
>>> stmt, vectype_out, vectype_in,
>>> code1, code2, multi_step_cvt,
>>> interm_types))
>>> return true;
>>>
>>>
>>> Thanks,
>>> Bingfeng Mei