This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
- From: Cong Hou <congh at google dot com>
- To: Richard Biener <richard dot guenther at gmail dot com>
- Cc: Richard Biener <rguenther at suse dot de>, GCC Patches <gcc-patches at gcc dot gnu dot org>
- Date: Wed, 25 Jun 2014 10:34:42 -0700
- Subject: Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
- Authentication-results: sourceware.org; auth=none
- References: <CAK=A3=2ZcCUrMi60=ViCsr8qe0M8eMXiAsTjog3q1HBDuiBPLQ at mail dot gmail dot com> <alpine dot LSU dot 2 dot 11 dot 1404281252080 dot 18709 at zhemvz dot fhfr dot qr> <CAK=A3=38pXmXv3jWbQdnXnBAQC0j-exKPV56bhoYOePw4Jbmtw at mail dot gmail dot com> <CAFiYyc0Abwy93CA3GV7d1gTLFpqzBjepzgwhyzbie7jx0wLaQQ at mail dot gmail dot com>
On Tue, Jun 24, 2014 at 4:05 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> On Sat, May 3, 2014 at 2:39 AM, Cong Hou <congh@google.com> wrote:
>> On Mon, Apr 28, 2014 at 4:04 AM, Richard Biener <rguenther@suse.de> wrote:
>>> On Thu, 24 Apr 2014, Cong Hou wrote:
>>>
>>>> Given the following loop:
>>>>
>>>> int a[N];
>>>> short b[N*2];
>>>>
>>>> for (int i = 0; i < N; ++i)
>>>> a[i] = b[i*2];
>>>>
>>>>
>>>> After being vectorized, the access to b[i*2] will be compiled into
>>>> several packing statements, while the type promotion from short to int
>>>> will be compiled into several unpacking statements. With this patch,
>>>> each pair of pack/unpack statements will be replaced by less expensive
>>>> statements (with shift or bit-and operations).
>>>>
>>>> On x86_64, the loop above will be compiled into the following assembly
>>>> (with -O2 -ftree-vectorize):
>>>>
>>>> movdqu 0x10(%rcx),%xmm3
>>>> movdqu -0x20(%rcx),%xmm0
>>>> movdqa %xmm0,%xmm2
>>>> punpcklwd %xmm3,%xmm0
>>>> punpckhwd %xmm3,%xmm2
>>>> movdqa %xmm0,%xmm3
>>>> punpcklwd %xmm2,%xmm0
>>>> punpckhwd %xmm2,%xmm3
>>>> movdqa %xmm1,%xmm2
>>>> punpcklwd %xmm3,%xmm0
>>>> pcmpgtw %xmm0,%xmm2
>>>> movdqa %xmm0,%xmm3
>>>> punpckhwd %xmm2,%xmm0
>>>> punpcklwd %xmm2,%xmm3
>>>> movups %xmm0,-0x10(%rdx)
>>>> movups %xmm3,-0x20(%rdx)
>>>>
>>>>
>>>> With this patch, the generated assembly is shown below:
>>>>
>>>> movdqu 0x10(%rcx),%xmm0
>>>> movdqu -0x20(%rcx),%xmm1
>>>> pslld $0x10,%xmm0
>>>> psrad $0x10,%xmm0
>>>> pslld $0x10,%xmm1
>>>> movups %xmm0,-0x10(%rdx)
>>>> psrad $0x10,%xmm1
>>>> movups %xmm1,-0x20(%rdx)
>>>>
>>>>
>>>> Bootstrapped and tested on x86-64. OK for trunk?
>>>
>>> This is an odd place to implement such transform. Also if it
>>> is faster or not depends on the exact ISA you target - for
>>> example ppc has constraints on the maximum number of shifts
>>> carried out in parallel and the above has 4 in very short
>>> succession. Esp. for the sign-extend path.
>>
>> Thank you for the information about ppc. If this is an issue, I think
>> we can do it in a target dependent way.
>>
>>
>>>
>>> So this looks more like an opportunity for a post-vectorizer
>>> transform on RTL or for the vectorizer special-casing
>>> widening loads with a vectorizer pattern.
>>
>> I am not sure if the RTL transform is more difficult to implement. I
>> prefer the widening loads method, which can be detected in a pattern
>> recognizer. The target related issue will be resolved by only
>> expanding the widening load on those targets where this pattern is
>> beneficial. But this requires new tree operations to be defined. What
>> is your suggestion?
>>
>> I apologize for the delayed reply.
>
> Likewise ;)
>
> I suggest to implement this optimization in vector lowering in
> tree-vect-generic.c. This sees for your example
>
> vect__5.7_32 = MEM[symbol: b, index: ivtmp.15_13, offset: 0B];
> vect__5.8_34 = MEM[symbol: b, index: ivtmp.15_13, offset: 16B];
> vect_perm_even_35 = VEC_PERM_EXPR <vect__5.7_32, vect__5.8_34, { 0,
> 2, 4, 6, 8, 10, 12, 14 }>;
> vect__6.9_37 = [vec_unpack_lo_expr] vect_perm_even_35;
> vect__6.9_38 = [vec_unpack_hi_expr] vect_perm_even_35;
>
> where you can apply the pattern matching and transform (after checking
> with the target, of course).
This sounds good to me! I'll try to make a patch following your suggestion.
Thank you!
Cong
>
> Richard.
>
>>
>> thanks,
>> Cong
>>
>>>
>>> Richard.