[PATCH][AArch64] Separate shrink wrapping hooks implementation

Tue Nov 29 11:32:00 GMT 2016

On 29/11/16 11:18, Kyrill Tkachov wrote:
> Hi James,
>
> On 29/11/16 10:57, James Greenhalgh wrote:
>> On Mon, Nov 14, 2016 at 02:25:28PM +0000, Kyrill Tkachov wrote:
>>> On 11/11/16 15:31, Kyrill Tkachov wrote:
>>>> On 11/11/16 10:17, Kyrill Tkachov wrote:
>>>>> On 10/11/16 23:39, Segher Boessenkool wrote:
>>>>>> On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
>>>>>>> On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
>>>>>>>> I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
>>>>>>>> some interesting swings.
>>>>>>>> 458.sjeng     +1.45%
>>>>>>>> 471.omnetpp   +2.19%
>>>>>>>> 445.gobmk     -2.01%
>>>>>>>>
>>>>>>>> On SPECFP:
>>>>>>>> 453.povray    +7.00%
>>>>>>> Wow, this looks really good.  Thank you for implementing this.  If I
>>>>>>> get some time I am going to try it out on other processors than A72
>>>>>>> but I doubt I have time any time soon.
>>>>>> I'd love to hear what causes the slowdown for gobmk as well, btw.
>>>>> I haven't yet gotten a direct answer for that (through performance analysis
>>>>> tools) but I have noticed that load/store pairs are not generated as
>>>>> aggressively as I hoped.  They are being merged by the sched fusion pass
>>>>> and peepholes (which runs after this) but it still misses cases. I've
>>>>> hacked the SWS hooks to generate pairs explicitly and that increases the
>>>>> number of pairs and helps code size to boot. It complicates the logic of
>>>>> the hooks a bit but not too much.
>>>>>
>>>>> I'll make those changes and re-benchmark, hopefully that
>>>>> will help performance.
>>>>>
>>>> And here's a version that explicitly emits pairs. I've looked at assembly
>>>> codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the
>>>> original version.  I kicked off benchmarks over the weekend to see the
>>>> effect.  Andrew, if you want to try it out (more benchmarking and testing
>>>> always welcome) this is the one to try.
>>>>
>>> And I discovered over the weekend that gamess and wrf have validation errors.
>>> This version runs correctly.  SPECINT results were fine though and there is
>>> even a small overall gain due to sjeng and omnetpp. However, gobmk still has
>>> the regression.  I'll rerun SPECFP with this patch (it's really just a small
>>> bugfix over the previous version) and get on with analysing gobmk.
>> I have some comments in line, most of which are about hardcoding the
>> maximum register number, but at a high level this looks good to me.
>
> Thanks for having a look.
> I'll respin with the comments addressed and I have a couple of comments inline.
>
> Kyrill
>
>> Thanks,
>> James
>>
>>
>
> <snip>
>
>> +
>> +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
>> +
>> +static sbitmap
>> +aarch64_components_for_bb (basic_block bb)
>> +{
>> +  bitmap in = DF_LIVE_IN (bb);
>> +  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
>> +  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
>> +
>> +  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
>> +  bitmap_clear (components);
>> +
>> +  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>> The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
>> where the end of the register file is (does this, for example, fall apart
>> with the SVE work that was recently posted). Something like a
>> LAST_HARDREG_NUM might work?
>>
>
> I think you mean FIRST_PSEUDO_REGISTER. AFAICS the compiler uses
> a loop from 0 to FIRST_PSEUDO_REGISTER to go through the hard registers
> in various places in the midend.
> I'll change it to use that, though if the way to save/restore such new registers becomes
> different from the current approach (i.e. perform a DI/DFmode memory op) the code in these
> hooks will have to be updated anyway to take it into account.
>

And actually trying to implement this blows up. The "hard" registers include CC_REGNUM, which we
definitely want to avoid 'saving/restoring'. We really just want to save the normal register
data registers, so is it okay if I leave it as it is?
The prologue/epilogue code already uses V31_REGNUM, so it would need to change anyway if new
registers are added in the future.

Kyrill

>>> +    if ((!call_used_regs[regno])
>>> +       && (bitmap_bit_p (in, regno)
>>> +       || bitmap_bit_p (gen, regno)
>>> +       || bitmap_bit_p (kill, regno)))
>>> +      bitmap_set_bit (components, regno);
>>> +
>>> +  return components;
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
>>> +   Nothing to do for aarch64.  */
>>> +
>>> +static void
>>> +aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
>>> +{
>>> +}
>> Is there no default "do nothing" hook for this?
>>
>
> I don't see one defined anywhere and the documentation for TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> says that if it is defined, all other hooks in the group must be defined.
>
>>> +
>>> +/* Return the next set bit in BMP from START onwards.  Return the total number
>>> +   of bits in BMP if no set bit is found at or after START. */
>>> +
>>> +static unsigned int
>>> +aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
>>> +{
>>> +  unsigned int nbits = SBITMAP_SIZE (bmp);
>>> +  if (start == nbits)
>>> +    return start;
>>> +
>>> +  gcc_assert (start < nbits);
>>> +  for (unsigned int i = start; i < nbits; i++)
>>> +    if (bitmap_bit_p (bmp, i))
>>> +      return i;
>>> +
>>> +  return nbits;
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
>>> +
>>> +static void
>>> +aarch64_emit_prologue_components (sbitmap components)
>>> +{
>>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>>> +                 ? HARD_FRAME_POINTER_REGNUM
>>> +                 : STACK_POINTER_REGNUM);
>>> +
>>> +  unsigned total_bits = SBITMAP_SIZE (components);
>> Would this be clearer called last_regno ?
>>
>>> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
>>> +  rtx_insn *insn = NULL;
>>> +
>>> +  while (regno != total_bits)
>>> +    {
>>> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
>> Comment about why this can be DFmode rather than some 128-bit mode might
>> be useful here.
>>
>>> +      rtx reg = gen_rtx_REG (mode, regno);
>>> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>>> +      if (!frame_pointer_needed)
>>> +    offset += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
>>> +      rtx mem = gen_frame_mem (mode, addr);
>>> +
>>> +      rtx set = gen_rtx_SET (mem, reg);
>>> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
>>> +      /* No more registers to save after REGNO.
>>> +     Emit a single save and exit.  */
>>> +      if (regno2 == total_bits)
>>> +    {
>>> +      insn = emit_insn (set);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
>>> +      break;
>>> +    }
>>> +
>>> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
>>> +      /* The next register is not of the same class or its offset is not
>>> +     mergeable with the current one into a pair.  */
>>> +      if (!satisfies_constraint_Ump (mem)
>>> +      || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
>>> +      || (offset2 - cfun->machine->frame.reg_offset[regno])
>>> +        != GET_MODE_SIZE (DImode))
>>> +    {
>>> +      insn = emit_insn (set);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
>>> +
>>> +      regno = regno2;
>>> +      continue;
>>> +    }
>>> +
>>> +      /* REGNO2 can be stored in a pair with REGNO.  */
>>> +      rtx reg2 = gen_rtx_REG (mode, regno2);
>>> +      if (!frame_pointer_needed)
>>> +    offset2 += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
>>> +      rtx mem2 = gen_frame_mem (mode, addr2);
>>> +      rtx set2 = gen_rtx_SET (mem2, reg2);
>>> +
>>> +      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_OFFSET, set);
>>> +      add_reg_note (insn, REG_CFA_OFFSET, set2);
>>> +
>>> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
>>> +    }
>>> +
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
>>> +
>>> +static void
>>> +aarch64_emit_epilogue_components (sbitmap components)
>> Given the similarity of the logic, is there any way you can refactor this
>> with the prologue code above?
>>
>>> +{
>>> +
>>> +  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
>>> +                 ? HARD_FRAME_POINTER_REGNUM
>>> +                 : STACK_POINTER_REGNUM);
>>> +  unsigned total_bits = SBITMAP_SIZE (components);
>>> +  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
>>> +  rtx_insn *insn = NULL;
>>> +
>>> +  while (regno != total_bits)
>>> +    {
>>> +      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
>>> +      rtx reg = gen_rtx_REG (mode, regno);
>>> +      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
>>> +      if (!frame_pointer_needed)
>>> +    offset += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr = plus_constant (Pmode, ptr_reg, offset);
>>> +      rtx mem = gen_frame_mem (mode, addr);
>>> +
>>> +      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
>>> +      /* No more registers after REGNO to restore.
>>> +     Emit a single restore and exit.  */
>>> +      if (regno2 == total_bits)
>>> +    {
>>> +      insn = emit_move_insn (reg, mem);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>>> +      break;
>>> +    }
>>> +
>>> +      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
>>> +      /* The next register is not of the same class or its offset is not
>>> +     mergeable with the current one into a pair or the offset doesn't fit
>>> +     for a load pair.  Emit a single restore and continue from REGNO2.  */
>>> +      if (!satisfies_constraint_Ump (mem)
>>> +      || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
>>> +      || (offset2 - cfun->machine->frame.reg_offset[regno])
>>> +        != GET_MODE_SIZE (DImode))
>>> +    {
>>> +      insn = emit_move_insn (reg, mem);
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>>> +
>>> +      regno = regno2;
>>> +      continue;
>>> +    }
>>> +
>>> +      /* REGNO2 can be loaded in a pair with REGNO.  */
>>> +      rtx reg2 = gen_rtx_REG (mode, regno2);
>>> +      if (!frame_pointer_needed)
>>> +    offset2 += cfun->machine->frame.frame_size
>>> +          - cfun->machine->frame.hard_fp_offset;
>>> +      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
>>> +      rtx mem2 = gen_frame_mem (mode, addr2);
>>> +
>>> +      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
>>> +      RTX_FRAME_RELATED_P (insn) = 1;
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg);
>>> +      add_reg_note (insn, REG_CFA_RESTORE, reg2);
>>> +
>>> +      regno = aarch64_get_next_set_bit (components, regno2 + 1);
>>> +    }
>>> +}
>>> +
>>> +/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
>>> +
>>> +static void
>>> +aarch64_set_handled_components (sbitmap components)
>>> +{
>>> +  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
>>> +    if (bitmap_bit_p (components, regno))
>>> +      cfun->machine->reg_is_wrapped_separately[regno] = true;
>>> +}
>>> +
>>>   /* AArch64 stack frames generated by this compiler look like:
>>>         +-------------------------------+
>>> @@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
>>>     return false;
>>>   }
>>>   -bool
>>> -aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
>>> -{
>>> -  return (offset >= -64 * GET_MODE_SIZE (mode)
>>> -      && offset < 64 * GET_MODE_SIZE (mode)
>>> -      && offset % GET_MODE_SIZE (mode) == 0);
>>> -}
>>> -
>>> -static inline bool
>>> -offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
>>> -                   HOST_WIDE_INT offset)
>>> -{
>>> -  return offset >= -256 && offset < 256;
>>> -}
>>> -
>>> -static inline bool
>>> -offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
>>> -{
>>> -  return (offset >= 0
>>> -      && offset < 4096 * GET_MODE_SIZE (mode)
>>> -      && offset % GET_MODE_SIZE (mode) == 0);
>>> -}
>>> -
>>>   /* Return true if MODE is one of the modes for which we
>>>      support LDP/STP operations.  */
>>>   @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
>>>   #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
>>>     aarch64_first_cycle_multipass_dfa_lookahead_guard
>>>   +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
>>> +  aarch64_get_separate_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
>>> +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
>>> +  aarch64_components_for_bb
>>> +
>>> +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
>>> +  aarch64_disqualify_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
>>> +  aarch64_emit_prologue_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
>>> +  aarch64_emit_epilogue_components
>>> +
>>> +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
>>> +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
>>> +  aarch64_set_handled_components
>>> +
>>>   #undef TARGET_TRAMPOLINE_INIT
>>>   #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
>>>   diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>>> index 584ff5c..fb89e5a 100644
>>> --- a/gcc/config/aarch64/aarch64.h
>>> +++ b/gcc/config/aarch64/aarch64.h
>>> @@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
>>>   typedef struct GTY (()) machine_function
>>>   {
>>>     struct aarch64_frame frame;
>>> +  /* One entry for each GPR and FP register.  */
>>> +  bool reg_is_wrapped_separately[V31_REGNUM + 1];
>> Another hardcoded use of V31_REGNUM.
>>
>> Thanks,
>> James
>>
>