This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: combine permutations in gimple


On 13 August 2012 14:21, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Mon, 13 Aug 2012, Richard Guenther wrote:
>
>> On Mon, Aug 13, 2012 at 3:12 PM, Ramana Radhakrishnan
>> <ramana.radhakrishnan@linaro.org> wrote:
>>>>
>>>>
>>>> I guess people will complain soon enough if this causes horrible
>>>> performance
>>>> regressions in vectorized code.
>>>
>>>
>>> Not having looked at your patch in great detail,. surely what we don't
>>> want is a situation where 2 constant permutations are converted into
>>> one generic permute. Based on a quick read of your patch I couldn't
>>> work that out.  It might be that 2 constant  permutes are cheaper than
>>> a generic permute. Have you looked at any examples in that space . I
>>> surely wouldn't like to see a sequence of interleave / transpose
>>> change into a generic permute operation on Neon as that would be far
>>> more expensive than this.  It surely needs more testting than just
>>> this bit before going in. The reason being that this would likely take
>>> more registers and indeed produce loads of a constant pool for the new
>>> mask.
>
>
> What do you call constant / non-constant? The combined permutation is still
> constant, although the expansion (in the back-end) might fail to expand it
> efficiently and fall back to the generic permutation expansion...


That is exactly what I was worried about. By constant I would expect
something that is expanded as  a constant permute by the backend - an
interleave operation or a transpose operation or indeed some of the
funky operations as below in ARM / Neon which is the SIMD architecture
I'm most familiar with .


If you had something that did the following :


uint8x8_t
tst_vrev642_u8 (uint8x8_t __a)
{
  uint8x8_t __rv;
  uint8x8_t __mask1 = { 7, 6, 5, 4, 3, 2, 1, 0};
  uint8x8_t __mask2 =  { 0, 8, 2, 10, 4, 12, 6, 14 };
  return __builtin_shuffle ( __builtin_shuffle ( __a, __mask1), __mask2) ;
}



I would expect these instructions

	vrev64.8	d0, d0
	vmov	d16, d0  @ v8qi
	vtrn.8	d0, d16
	bx	lr

With your patch I see

tst_vrev642_u8:
	@ args = 0, pretend = 0, frame = 0
	@ frame_needed = 0, uses_anonymous_args = 0
	@ link register save eliminated.
	fldd	d16, .L2
	vtbl.8	d0, {d0}, d16
	bx	lr
.L3:
	.align	3
.L2:
	.byte	7
	.byte	7
	.byte	5
	.byte	5
	.byte	3
	.byte	3
	.byte	1
	.byte	1

It might be that the backend predicates need tightening in this case
but  why not try to get cost info about such combinations rather than
just doing this gratuitously ?  While in this case the ARM port might
be wrong , but in general when the vector permute rewrites were done
we chose to go ahead and keep the generic constant permutes in the
backend as the last resort to fall back to. However if fwprop starts
making such transformations really one ought to get relative costs for
each of these operations rather than allowing gratuitous replacements.



Thanks,
Ramana


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]