This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: [Patch, RTL] Eliminate redundant vec_select moves.
- From: Richard Sandiford <rdsandiford at googlemail dot com>
- To: Tejas Belagod <tbelagod at arm dot com>
- Cc: Bill Schmidt <wschmidt at linux dot vnet dot ibm dot com>, "gcc-patches\ at gcc dot gnu dot org" <gcc-patches at gcc dot gnu dot org>
- Date: Wed, 04 Dec 2013 17:36:19 +0000
- Subject: Re: [Patch, RTL] Eliminate redundant vec_select moves.
- Authentication-results: sourceware.org; auth=none
- References: <527A4309 dot 70209 at arm dot com> <8738n9sj8o dot fsf at talisman dot default> <527A5EF4 dot 5090505 at arm dot com> <87y551r01p dot fsf at talisman dot default> <527A7612 dot 2080406 at arm dot com> <877gcll9ht dot fsf at talisman dot default> <527BA073 dot 30900 at arm dot com> <87zjpg1d5p dot fsf at sandifor-thinkpad dot stglab dot manchester dot uk dot ibm dot com> <527BD411 dot 6060300 at arm dot com> <878uwwdnx0 dot fsf at talisman dot default> <52962733 dot 7030005 at arm dot com> <87r4a0c1ul dot fsf at talisman dot default> <529F5318 dot 1030505 at arm dot com>
Tejas Belagod <tbelagod@arm.com> writes:
> Richard Sandiford wrote:
>> Tejas Belagod <tbelagod@arm.com> writes:
>>> Richard Sandiford wrote:
>>>> Tejas Belagod <tbelagod@arm.com> writes:
>>>>>> The problem is that one reg rtx can span several hard registers.
>>>>>> E.g. (reg:V4SI 32) might represent one 64-bit register (no. 32),
>>>>>> but it might instead represent two 32-bit registers (nos. 32 and 33).
>>>>>> Obviously the latter's not very likely for vectors this small,
>>>>>> but more likely for larger ones (including on NEON IIRC).
>>>>>>
>>>>>> So if we had 2 32-bit registers being treated as a V4HI, it would be:
>>>>>>
>>>>>> <--32--><--33-->
>>>>>> msb lsb
>>>>>> 0000111122223333
>>>>>> VVVVVVVV
>>>>>> 00001111
>>>>>> msb lsb
>>>>>> <--32-->
>>>>>>
>>>>>> for big endian and:
>>>>>>
>>>>>> <--33--><--32-->
>>>>>> msb lsb
>>>>>> 3333222211110000
>>>>>> VVVVVVVV
>>>>>> 11110000
>>>>>> msb lsb
>>>>>> <--32-->
>>>>>>
>>>>>> for little endian.
>>>>> Ah, ok, that makes things clearer. Thanks for that.
>>>>>
>>>>> I can't find any helper function that figures out if we're writing
>>>>> partial or
>>>>> full result regs. Would something like
>>>>>
>>>>> REGNO (src) == REGNO (dst) &&
>>>>> HARD_REGNO_NREGS (src) == HARD_REGNO_NREGS (dst) == 1
>>>>>
>>>>> be a sane check for partial result regs?
>>>> Yeah, that should work. I think a more general alternative would be:
>>>>
>>>> simplify_subreg_regno (REGNO (src), GET_MODE (src),
>>>> offset, GET_MODE (dst)) == (int) REGNO (dst)
>>>>
>>>> where:
>>>>
>>>> offset = GET_MODE_UNIT_SIZE (GET_MODE (src)) * INTVAL (XVECEXP (sel, 0))
>>>>
>>>> That offset is the byte offset of the first selected element from the
>>>> start of a vector in memory, which is also the way that SUBREG_BYTEs
>>>> are counted. For little-endian it gives the offset of the lsb of the
>>>> slice, while for big-endian it gives the offset of the msb (which is
>>>> also how SUBREG_BYTEs work).
>>>>
>>>> The simplify_subreg_regno should cope with both single-register vectors
>>>> and multi-register vectors.
>>> Sorry for the delayed response to this.
>>>
>>> Thanks for the tip. Here's an improved patch that implements the
>>> simplify_sureg_regno () method of eliminating redundant moves. Regarding the
>>> test case, I failed to get the ppc back-end to generate RTL pattern
>>> that this
>>> patch checks for. I can easily write a test case for aarch64(big and little
>>> endian) on these lines
>>>
>>> typedef float float32x4_t __attribute__ ((__vector_size__ (16)));
>>>
>>> float foo_be (float32x4_t x)
>>> {
>>> return x[3];
>>> }
>>>
>>> float foo_le (float32x4_t x)
>>> {
>>> return x[0];
>>> }
>>>
>>> where I know that the vector indexing will generate a vec_select on
>>> the same src and dst regs that could be optimized away and hence test
>>> it. But I'm struggling to get a test case that the ppc altivec
>>> back-end will generate such a vec_select for. I see that altivec does
>>> not define vec_extract, so a simple indexing like this seems to happen
>>> via memory. Also, I don't know enough about the ppc PCS or
>>> architecture to write a test that will check for this optimization
>>> opportunity on same src and dst hard-registers. Any hints?
>>
>> Me neither, sorry.
>>
>> FWIW, the MIPS tests:
>>
>> typedef float float32x2_t __attribute__ ((__vector_size__ (8)));
>> void bar (float);
>> void foo_be (float32x2_t x) { bar (x[1]); }
>> void foo_le (float32x2_t x) { bar (x[0]); }
>>
>> also exercise it, but I don't think they add anything over the aarch64
>> versions. I can add them to the testsuite anyway if it helps though.
>>
>>> diff --git a/gcc/rtlanal.c b/gcc/rtlanal.c
>>> index 0cd0c7e..ca25ce5 100644
>>> --- a/gcc/rtlanal.c
>>> +++ b/gcc/rtlanal.c
>>> @@ -1180,6 +1180,22 @@ set_noop_p (const_rtx set)
>>> dst = SUBREG_REG (dst);
>>> }
>>>
>>> + /* It is a NOOP if destination overlaps with selected src vector
>>> + elements. */
>>> + if (GET_CODE (src) == VEC_SELECT
>>> + && REG_P (XEXP (src, 0)) && REG_P (dst)
>>> + && HARD_REGISTER_P (XEXP (src, 0))
>>> + && HARD_REGISTER_P (dst))
>>> + {
>>> + rtx par = XEXP (src, 1);
>>> + rtx src0 = XEXP (src, 0);
>>> + HOST_WIDE_INT offset =
>>> + GET_MODE_UNIT_SIZE (GET_MODE (src0)) * INTVAL (XVECEXP (par, 0, 0));
>>> +
>>> + return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
>>> + offset, GET_MODE (dst)) == (int)REGNO (dst);
>>> + }
>>> +
>>
>> Since this also (correctly) triggers for vector results, we need to keep
>> the check for consecutive indices that you had originally. (It's always
>> the first index that should be used for the simplify_subreg_regno though.)
>>
>> Looks good to me otherwise, thanks.
>
> Thanks Richard. Here is a revised patch. Sorry about the delay - I was
> investigating to make sure an LRA ICE I was seeing on aarch64 was
> unrelated to this patch. I've added a test case that I expect to pass
> for aarch64. I've also added the tests that you suggested for MIPS,
> but haven't checked for the target because I'm not sure what
> optimizations happen on MIPS.
Thanks, looks good to me, but I can't approve it. Just one minor
formatting nit:
> + return simplify_subreg_regno (REGNO (src0), GET_MODE (src0),
> + offset, GET_MODE (dst)) == (int)REGNO (dst);
space after "(int)".
Richard