This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.

From: Richard Biener <rguenther at suse dot de>
To: Cong Hou <congh at google dot com>
Cc: GCC Patches <gcc-patches at gcc dot gnu dot org>
Date: Mon, 28 Apr 2014 13:04:27 +0200 (CEST)
Subject: Re: [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
Authentication-results: sourceware.org; auth=none
References: <CAK=A3=2ZcCUrMi60=ViCsr8qe0M8eMXiAsTjog3q1HBDuiBPLQ at mail dot gmail dot com>

On Thu, 24 Apr 2014, Cong Hou wrote:

> Given the following loop:
> 
> int a[N];
> short b[N*2];
> 
> for (int i = 0; i < N; ++i)
>   a[i] = b[i*2];
> 
> 
> After being vectorized, the access to b[i*2] will be compiled into
> several packing statements, while the type promotion from short to int
> will be compiled into several unpacking statements. With this patch,
> each pair of pack/unpack statements will be replaced by less expensive
> statements (with shift or bit-and operations).
> 
> On x86_64, the loop above will be compiled into the following assembly
> (with -O2 -ftree-vectorize):
> 
> movdqu 0x10(%rcx),%xmm3
> movdqu -0x20(%rcx),%xmm0
> movdqa %xmm0,%xmm2
> punpcklwd %xmm3,%xmm0
> punpckhwd %xmm3,%xmm2
> movdqa %xmm0,%xmm3
> punpcklwd %xmm2,%xmm0
> punpckhwd %xmm2,%xmm3
> movdqa %xmm1,%xmm2
> punpcklwd %xmm3,%xmm0
> pcmpgtw %xmm0,%xmm2
> movdqa %xmm0,%xmm3
> punpckhwd %xmm2,%xmm0
> punpcklwd %xmm2,%xmm3
> movups %xmm0,-0x10(%rdx)
> movups %xmm3,-0x20(%rdx)
> 
> 
> With this patch, the generated assembly is shown below:
> 
> movdqu 0x10(%rcx),%xmm0
> movdqu -0x20(%rcx),%xmm1
> pslld  $0x10,%xmm0
> psrad  $0x10,%xmm0
> pslld  $0x10,%xmm1
> movups %xmm0,-0x10(%rdx)
> psrad  $0x10,%xmm1
> movups %xmm1,-0x20(%rdx)
> 
> 
> Bootstrapped and tested on x86-64. OK for trunk?

This is an odd place to implement such transform.  Also if it
is faster or not depends on the exact ISA you target - for
example ppc has constraints on the maximum number of shifts
carried out in parallel and the above has 4 in very short
succession.  Esp. for the sign-extend path.

So this looks more like an opportunity for a post-vectorizer
transform on RTL or for the vectorizer special-casing
widening loads with a vectorizer pattern.

Richard.

References:
- [PATCH] Detect a pack-unpack pattern in GCC vectorizer and optimize it.
  - From: Cong Hou

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]