This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: current NEON status


On 12/08/14 07:49, Andrew Pinski wrote:
> On Mon, Aug 11, 2014 at 11:44 PM, Marat Zakirov <m.zakirov@samsung.com> wrote:
>> Hi Vladimir!
>>
>> I think you are as the main IRA contributor would be appropriate person to
>> answer question bellow. Please confirm or refute my statement about
>> unsplittable register ranges in GCC IRA.
>>
>>
>> On 07/30/2014 05:38 PM, Marat Zakirov wrote:
>>>
>>> Hi there!
>>>
>>> My question came from bug
>>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43725. I found that GCC
>>> considers NEON register ranges as unsplittable. So any subregister may be
>>> used only after whole chunk is dead. This issue leads to redundant spill
>>> fills which is performance trouble.
>>>
>>> Example 1: RAL trouble
>>>
>>> #include <arm_neon.h>
>>> #include <inttypes.h>
>>>
>>> extern  uint16x8x4_t m0;
>>> extern  uint16x8x4_t m1;
>>> extern  uint16x8x4_t m2;
>>> extern  uint16x8x4_t m3;
>>> extern  uint16x8_t   m4;
>>>
>>> void foo1(uint16_t * in_ptr)
>>> {
>>>      uint16x8x4_t t0, t1, t2, t3;
>>>      t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
>>>      t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
>>>      t2 = vld4q_u16((uint16_t *)&in_ptr[128]);
>>>      t3 = vld4q_u16((uint16_t *)&in_ptr[192]);
>>>      m4 = t0.val[3];
>>>      m4 = m4 * 3;                       <<< *
>>>      t0.val[3] = t1.val[3];
>>>      m0 = t3;
>>>      m1 = t2;
>>>      m2 = t1;
>>>      m3 = t0;
>>> }
>>>
>>> Here test uses all NEON registers. No spill is needed. Because
>>> multiplication requires one Q register which may be obtained from dead
>>> t0.val[3] subregister. But GCC makes spill if multiplication (*) exists
>>> because of issue described above.
>>>
>>> Example 2: CSE makes trouble for IRA
>>>
>>> #include <arm_neon.h>
>>> #include <inttypes.h>
>>>
>>> extern  uint16x8x4_t m0;
>>> extern  uint16x8x4_t m1;
>>>
>>> void foo2(uint16_t * in_ptr)
>>> {
>>>      uint16x8x4_t t0, t1;
>>>      t0 = vld4q_u16((uint16_t *)&in_ptr[0 ]);
>>>      t1 = vld4q_u16((uint16_t *)&in_ptr[64]);
>>>      t0.val[0] *= 333;
>>>      t0.val[1] *= 333;
>>>      t0.val[2] *= 333;
>>>      t0.val[3] *= 333;
>>>      t1.val[0] *= 333;
>>>      t1.val[1] *= 333;
>>>      t1.val[2] *= 333;
>>>      t1.val[3] *= 333;
>>>      m0 = t0;
>>>      m1 = t1;
>>> }
>>>
>>> Here test uses only half NEON + one Q for '333' factor. But GCC makes
>>> spills here too! Briefly speak problem is in partial CSE. GCC generates rtl
>>> with the listed bellow form:
>>>
>>> Before CSE:
>>>
>>> a = b
>>> a0 = a0 * 3
>>> a1 = a1 * 3
>>> a2 = a2 * 3
>>> a3 = a3 * 3
>>>
>>> After:
>>>
>>> a = b
>>> a0 = b0 * 3
>>> a1 = a1 * 3 <<< *
>>> a2 = a2 * 3
>>> a3 = a3 * 3
>>>
>>> CSE do not substitute b1 to a1 because at the moment (*) a0 was already
>>> defined so actually a != b. Yes but a1 = b1, unfortunately CSE also do not
>>> handle register-ranges parts as RA does. Strange thing here is that even if
>>> we fix CSE, so CSE could propagate register-ranges subregs, this will make
>>> trouble to RAL also because of the same reason: IRA do not handle precisely
>>> register ranges parts. I attached a demo patch which forbids partial CSE
>>> propagation and removes spills from Ex2. Is this patch OK? Or maybe CSE
>>> should be fixed in a different way? Or maybe partial substitution is OK?
>>>
>>> Main question: Are there any plans to fix/upgrade IRA?
>>>
>>> --Marat
>>
>>
>>
>> gcc/ChangeLog:
>>
>> 2014-07-30  Marat Zakirov  <m.zakirov@samsung.com>
>>
>>         * cse.c (canon_reg): Forbid partial CSE.
>>         * fwprop.c (forward_propagate_and_simplify): Likewise.
>>
>> diff --git a/gcc/cse.c b/gcc/cse.c
>> index 34f9364..a9e0442 100644
>> --- a/gcc/cse.c
>> +++ b/gcc/cse.c
>> @@ -2862,6 +2862,9 @@ canon_reg (rtx x, rtx insn)
>>             || ! REGNO_QTY_VALID_P (REGNO (x)))
>>           return x;
>>
>> +        if (GET_MODE (x) == XImode)
>> +          return x;
> 
> This patch is wrong and even more wrong.  XImode is not defined in all targets.
>

Even if it were, this still wouldn't be the right fix.  What if a
machine had a native XImode?  Then you'd be arbitrarily disabling parts
of the compiler.

R.


> Maybe the better fix is to have lower subreg come along and split up
> the moves for a = b and then a pass after reload comes along and
> stitches it back together.
> 
> Thanks,
> Andrew
> 
> 
>> +
>>         q = REG_QTY (REGNO (x));
>>         ent = &qty_table[q];
>>         first = ent->first_reg;
>> diff --git a/gcc/fwprop.c b/gcc/fwprop.c
>> index 547fcd6..eadc729 100644
>> --- a/gcc/fwprop.c
>> +++ b/gcc/fwprop.c
>> @@ -1317,6 +1317,9 @@ forward_propagate_and_simplify (df_ref use, rtx
>> def_insn, rtx def_set)
>>    if (!new_rtx)
>>      return false;
>>
>> +  if (GET_MODE (reg) == XImode)
>> +    return false;
>> +
>>    return try_fwprop_subst (use, loc, new_rtx, def_insn, set_reg_equal);
>>  }
>>
>>
> 



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]