This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: combine permutations in gimple

From: Marc Glisse <marc dot glisse at inria dot fr>
To: Ramana Radhakrishnan <ramana dot radhakrishnan at linaro dot org>
Cc: Richard Guenther <richard dot guenther at gmail dot com>, gcc-patches at gcc dot gnu dot org
Date: Mon, 13 Aug 2012 16:29:08 +0200 (CEST)
Subject: Re: combine permutations in gimple
References: <alpine.DEB.2.02.1208102312390.13110@stedding.saclay.inria.fr> <alpine.DEB.2.02.1208111417100.26487@stedding.saclay.inria.fr> <CAFiYyc1nH_kLCshNRdbpLstfvtOFTb8fEjvHwo65quBeDYuUqw@mail.gmail.com> <alpine.DEB.2.02.1208131436460.3258@laptop-mg.saclay.inria.fr> <CACUk7=U-zL6cP1cM=sg2=m6FXodHDaOPC5Vz5enOoU8fHK9BhQ@mail.gmail.com> <CAFiYyc0wMYiLbyFQj-+bLNoeHLKekKpRA=Xbp++5hKhdU2f7fg@mail.gmail.com> <alpine.DEB.2.02.1208131515210.3258@laptop-mg.saclay.inria.fr> <CACUk7=XxDwvz8gRnxoEk+=2wvdGzv3zL48LDS=W7f4z-7PQaRw@mail.gmail.com>

On Mon, 13 Aug 2012, Ramana Radhakrishnan wrote:

On 13 August 2012 14:21, Marc Glisse <marc.glisse@inria.fr> wrote:

On Mon, 13 Aug 2012, Richard Guenther wrote:

On Mon, Aug 13, 2012 at 3:12 PM, Ramana Radhakrishnan
<ramana.radhakrishnan@linaro.org> wrote:

I guess people will complain soon enough if this causes horrible
performance
regressions in vectorized code.

Not having looked at your patch in great detail,. surely what we don't
want is a situation where 2 constant permutations are converted into
one generic permute. Based on a quick read of your patch I couldn't
work that out.  It might be that 2 constant  permutes are cheaper than
a generic permute. Have you looked at any examples in that space . I
surely wouldn't like to see a sequence of interleave / transpose
change into a generic permute operation on Neon as that would be far
more expensive than this.  It surely needs more testting than just
this bit before going in. The reason being that this would likely take
more registers and indeed produce loads of a constant pool for the new
mask.

What do you call constant / non-constant? The combined permutation is still
constant, although the expansion (in the back-end) might fail to expand it
efficiently and fall back to the generic permutation expansion...

That is exactly what I was worried about. By constant I would expect
something that is expanded as  a constant permute by the backend - an
interleave operation or a transpose operation or indeed some of the
funky operations as below in ARM / Neon which is the SIMD architecture
I'm most familiar with .

If you had something that did the following :


uint8x8_t
tst_vrev642_u8 (uint8x8_t __a)
{
 uint8x8_t __rv;
 uint8x8_t __mask1 = { 7, 6, 5, 4, 3, 2, 1, 0};
 uint8x8_t __mask2 =  { 0, 8, 2, 10, 4, 12, 6, 14 };
 return __builtin_shuffle ( __builtin_shuffle ( __a, __mask1), __mask2) ;
}

I would expect these instructions

	vrev64.8	d0, d0
	vmov	d16, d0  @ v8qi
	vtrn.8	d0, d16
	bx	lr

With your patch I see

tst_vrev642_u8:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	fldd	d16, .L2
	vtbl.8	d0, {d0}, d16
	bx	lr
.L3:
	.align	3
.L2:
	.byte	7
	.byte	7
	.byte	5
	.byte	5
	.byte	3
	.byte	3
	.byte	1
	.byte	1

Seems to be one instruction shorter at least ;-) Yes, there can be much worse regressions than that because of the patch (like 40 instructions instead of 4, in the x86 backend).

It might be that the backend predicates need tightening in this case
but  why not try to get cost info about such combinations rather than
just doing this gratuitously ?

I don't think the word gratuitous is appropriate. middle-end replaces a+-b with a-b without first asking the backend whether it might be more efficient. One permutation is better than 2. It just happens that the range of possible permutations is too large (and the basic instructions are too strange) for backends to do a good job on them, and thus keeping toplevel input as a hint is important.

While in this case the ARM port might
be wrong , but in general when the vector permute rewrites were done
we chose to go ahead and keep the generic constant permutes in the
backend as the last resort to fall back to. However if fwprop starts
making such transformations really one ought to get relative costs for
each of these operations rather than allowing gratuitous replacements.

Indeed, having costs would help, but that's a lot of work.

As you can see from the original email, I am willing to limit the optimization to the case where the combined permutation is the identity (yes, that's quite drastic...). I should probably implement that special case anyway, because it doesn't require its argument to have a single use for the optimization to make sense.

--
Marc Glisse

Follow-Ups:
- Re: combine permutations in gimple
  - From: Ramana Radhakrishnan

References:
- combine permutations in gimple
  - From: Marc Glisse
- Re: combine permutations in gimple
  - From: Marc Glisse
- Re: combine permutations in gimple
  - From: Richard Guenther
- Re: combine permutations in gimple
  - From: Marc Glisse
- Re: combine permutations in gimple
  - From: Ramana Radhakrishnan
- Re: combine permutations in gimple
  - From: Richard Guenther
- Re: combine permutations in gimple
  - From: Marc Glisse
- Re: combine permutations in gimple
  - From: Ramana Radhakrishnan

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]