This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Re: current NEON status
- From: Andrew Pinski <pinskia at gmail dot com>
- To: Marat Zakirov <m dot zakirov at samsung dot com>
- Cc: GCC Mailing List <gcc at gcc dot gnu dot org>, Ramana Radhakrishnan <ramana dot radhakrishnan at arm dot com>, Christophe Lyon <christophe dot lyon at linaro dot org>, "Joseph S. Myers" <joseph at codesourcery dot com>, Richard Earnshaw <Richard dot Earnshaw at arm dot com>, Vladimir Makarov <vmakarov at redhat dot com>, Yury Gribov <y dot gribov at samsung dot com>, Slava Garbuzov <v dot garbuzov at samsung dot com>
- Date: Mon, 11 Aug 2014 23:49:37 -0700
- Subject: Re: current NEON status
- Authentication-results: sourceware.org; auth=none
- References: <53D8F550 dot 2060007 at samsung dot com> <53E9B7B2 dot 6010504 at samsung dot com>
On Mon, Aug 11, 2014 at 11:44 PM, Marat Zakirov <m.zakirov@samsung.com> wrote:
> Hi Vladimir!
>
> I think you are as the main IRA contributor would be appropriate person to
> answer question bellow. Please confirm or refute my statement about
> unsplittable register ranges in GCC IRA.
>
>
> On 07/30/2014 05:38 PM, Marat Zakirov wrote:
>>
>> Hi there!
>>
>> My question came from bug
>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725. I found that GCC
>> considers NEON register ranges as unsplittable. So any subregister may be
>> used only after whole chunk is dead. This issue leads to redundant spill
>> fills which is performance trouble.
>>
>> Example 1: RAL trouble
>>
>> #include <arm_neon.h>
>> #include <inttypes.h>
>>
>> extern uint16x8x4_t m0;
>> extern uint16x8x4_t m1;
>> extern uint16x8x4_t m2;
>> extern uint16x8x4_t m3;
>> extern uint16x8_t m4;
>>
>> void foo1(uint16_t * in_ptr)
>> {
>> uint16x8x4_t t0, t1, t2, t3;
>> t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
>> t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
>> t2 = vld4q_u16((uint16_t *)&in_ptr[128]);
>> t3 = vld4q_u16((uint16_t *)&in_ptr[192]);
>> m4 = t0.val[3];
>> m4 = m4 * 3; <<< *
>> t0.val[3] = t1.val[3];
>> m0 = t3;
>> m1 = t2;
>> m2 = t1;
>> m3 = t0;
>> }
>>
>> Here test uses all NEON registers. No spill is needed. Because
>> multiplication requires one Q register which may be obtained from dead
>> t0.val[3] subregister. But GCC makes spill if multiplication (*) exists
>> because of issue described above.
>>
>> Example 2: CSE makes trouble for IRA
>>
>> #include <arm_neon.h>
>> #include <inttypes.h>
>>
>> extern uint16x8x4_t m0;
>> extern uint16x8x4_t m1;
>>
>> void foo2(uint16_t * in_ptr)
>> {
>> uint16x8x4_t t0, t1;
>> t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
>> t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
>> t0.val[0] *= 333;
>> t0.val[1] *= 333;
>> t0.val[2] *= 333;
>> t0.val[3] *= 333;
>> t1.val[0] *= 333;
>> t1.val[1] *= 333;
>> t1.val[2] *= 333;
>> t1.val[3] *= 333;
>> m0 = t0;
>> m1 = t1;
>> }
>>
>> Here test uses only half NEON + one Q for '333' factor. But GCC makes
>> spills here too! Briefly speak problem is in partial CSE. GCC generates rtl
>> with the listed bellow form:
>>
>> Before CSE:
>>
>> a = b
>> a0 = a0 * 3
>> a1 = a1 * 3
>> a2 = a2 * 3
>> a3 = a3 * 3
>>
>> After:
>>
>> a = b
>> a0 = b0 * 3
>> a1 = a1 * 3 <<< *
>> a2 = a2 * 3
>> a3 = a3 * 3
>>
>> CSE do not substitute b1 to a1 because at the moment (*) a0 was already
>> defined so actually a != b. Yes but a1 = b1, unfortunately CSE also do not
>> handle register-ranges parts as RA does. Strange thing here is that even if
>> we fix CSE, so CSE could propagate register-ranges subregs, this will make
>> trouble to RAL also because of the same reason: IRA do not handle precisely
>> register ranges parts. I attached a demo patch which forbids partial CSE
>> propagation and removes spills from Ex2. Is this patch OK? Or maybe CSE
>> should be fixed in a different way? Or maybe partial substitution is OK?
>>
>> Main question: Are there any plans to fix/upgrade IRA?
>>
>> --Marat
>
>
>
> gcc/ChangeLog:
>
> 2014-07-30 Marat Zakirov <m.zakirov@samsung.com>
>
> * cse.c (canon_reg): Forbid partial CSE.
> * fwprop.c (forward_propagate_and_simplify): Likewise.
>
> diff --git a/gcc/cse.c b/gcc/cse.c
> index 34f9364..a9e0442 100644
> --- a/gcc/cse.c
> +++ b/gcc/cse.c
> @@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn)
> || ! REGNO_QTY_VALID_P (REGNO (x)))
> return x;
>
> + if (GET_MODE (x) == XImode)
> + return x;
This patch is wrong and even more wrong. XImode is not defined in all targets.
Maybe the better fix is to have lower subreg come along and split up
the moves for a = b and then a pass after reload comes along and
stitches it back together.
Thanks,
Andrew
> +
> q = REG_QTY (REGNO (x));
> ent = &qty_table[q];
> first = ent->first_reg;
> diff --git a/gcc/fwprop.c b/gcc/fwprop.c
> index 547fcd6..eadc729 100644
> --- a/gcc/fwprop.c
> +++ b/gcc/fwprop.c
> @@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx
> def_insn, rtx def_set)
> if (!new_rtx)
> return false;
>
> + if (GET_MODE (reg) == XImode)
> + return false;
> +
> return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
> }
>
>