This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][AArch64] Separate shrink wrapping hooks implementation



On 29/11/16 11:18, Kyrill Tkachov wrote:
Hi James,

On 29/11/16 10:57, James Greenhalgh wrote:
On Mon, Nov 14, 2016 at 02:25:28PM +0000, Kyrill Tkachov wrote:
On 11/11/16 15:31, Kyrill Tkachov wrote:
On 11/11/16 10:17, Kyrill Tkachov wrote:
On 10/11/16 23:39, Segher Boessenkool wrote:
On Thu, Nov 10, 2016 at 02:42:24PM -0800, Andrew Pinski wrote:
On Thu, Nov 10, 2016 at 6:25 AM, Kyrill Tkachov
I ran SPEC2006 on a Cortex-A72. Overall scores were neutral but there were
some interesting swings.
458.sjeng     +1.45%
471.omnetpp   +2.19%
445.gobmk     -2.01%

On SPECFP:
453.povray    +7.00%
Wow, this looks really good.  Thank you for implementing this.  If I
get some time I am going to try it out on other processors than A72
but I doubt I have time any time soon.
I'd love to hear what causes the slowdown for gobmk as well, btw.
I haven't yet gotten a direct answer for that (through performance analysis
tools) but I have noticed that load/store pairs are not generated as
aggressively as I hoped.  They are being merged by the sched fusion pass
and peepholes (which runs after this) but it still misses cases. I've
hacked the SWS hooks to generate pairs explicitly and that increases the
number of pairs and helps code size to boot. It complicates the logic of
the hooks a bit but not too much.

I'll make those changes and re-benchmark, hopefully that
will help performance.

And here's a version that explicitly emits pairs. I've looked at assembly
codegen on SPEC2006 and it generates quite a few more LDP/STP pairs than the
original version.  I kicked off benchmarks over the weekend to see the
effect.  Andrew, if you want to try it out (more benchmarking and testing
always welcome) this is the one to try.

And I discovered over the weekend that gamess and wrf have validation errors.
This version runs correctly.  SPECINT results were fine though and there is
even a small overall gain due to sjeng and omnetpp. However, gobmk still has
the regression.  I'll rerun SPECFP with this patch (it's really just a small
bugfix over the previous version) and get on with analysing gobmk.
I have some comments in line, most of which are about hardcoding the
maximum register number, but at a high level this looks good to me.

Thanks for having a look.
I'll respin with the comments addressed and I have a couple of comments inline.

Kyrill

Thanks,
James



<snip>

+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+
+static sbitmap
+aarch64_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (V31_REGNUM + 1);
+  bitmap_clear (components);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
The use of R0_REGNUM and V31_REGNUM scare me a little bit, as we're hardcoding
where the end of the register file is (does this, for example, fall apart
with the SVE work that was recently posted). Something like a
LAST_HARDREG_NUM might work?


I think you mean FIRST_PSEUDO_REGISTER. AFAICS the compiler uses
a loop from 0 to FIRST_PSEUDO_REGISTER to go through the hard registers
in various places in the midend.
I'll change it to use that, though if the way to save/restore such new registers becomes
different from the current approach (i.e. perform a DI/DFmode memory op) the code in these
hooks will have to be updated anyway to take it into account.


And actually trying to implement this blows up. The "hard" registers include CC_REGNUM, which we
definitely want to avoid 'saving/restoring'. We really just want to save the normal register
data registers, so is it okay if I leave it as it is?
The prologue/epilogue code already uses V31_REGNUM, so it would need to change anyway if new
registers are added in the future.

Kyrill

+    if ((!call_used_regs[regno])
+       && (bitmap_bit_p (in, regno)
+       || bitmap_bit_p (gen, regno)
+       || bitmap_bit_p (kill, regno)))
+      bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.
+   Nothing to do for aarch64.  */
+
+static void
+aarch64_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+}
Is there no default "do nothing" hook for this?


I don't see one defined anywhere and the documentation for TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
says that if it is defined, all other hooks in the group must be defined.

+
+/* Return the next set bit in BMP from START onwards.  Return the total number
+   of bits in BMP if no set bit is found at or after START. */
+
+static unsigned int
+aarch64_get_next_set_bit (sbitmap bmp, unsigned int start)
+{
+  unsigned int nbits = SBITMAP_SIZE (bmp);
+  if (start == nbits)
+    return start;
+
+  gcc_assert (start < nbits);
+  for (unsigned int i = start; i < nbits; i++)
+    if (bitmap_bit_p (bmp, i))
+      return i;
+
+  return nbits;
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_prologue_components (sbitmap components)
+{
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+                 ? HARD_FRAME_POINTER_REGNUM
+                 : STACK_POINTER_REGNUM);
+
+  unsigned total_bits = SBITMAP_SIZE (components);
Would this be clearer called last_regno ?

+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != total_bits)
+    {
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
Comment about why this can be DFmode rather than some 128-bit mode might
be useful here.

+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+    offset += cfun->machine->frame.frame_size
+          - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      rtx set = gen_rtx_SET (mem, reg);
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers to save after REGNO.
+     Emit a single save and exit.  */
+      if (regno2 == total_bits)
+    {
+      insn = emit_insn (set);
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+      break;
+    }
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+     mergeable with the current one into a pair.  */
+      if (!satisfies_constraint_Ump (mem)
+      || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+      || (offset2 - cfun->machine->frame.reg_offset[regno])
+        != GET_MODE_SIZE (DImode))
+    {
+      insn = emit_insn (set);
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_OFFSET, copy_rtx (set));
+
+      regno = regno2;
+      continue;
+    }
+
+      /* REGNO2 can be stored in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+    offset2 += cfun->machine->frame.frame_size
+          - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+      rtx set2 = gen_rtx_SET (mem2, reg2);
+
+      insn = emit_insn (aarch64_gen_store_pair (mode, mem, reg, mem2, reg2));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_OFFSET, set);
+      add_reg_note (insn, REG_CFA_OFFSET, set2);
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+
+static void
+aarch64_emit_epilogue_components (sbitmap components)
Given the similarity of the logic, is there any way you can refactor this
with the prologue code above?

+{
+
+  rtx ptr_reg = gen_rtx_REG (Pmode, frame_pointer_needed
+                 ? HARD_FRAME_POINTER_REGNUM
+                 : STACK_POINTER_REGNUM);
+  unsigned total_bits = SBITMAP_SIZE (components);
+  unsigned regno = aarch64_get_next_set_bit (components, R0_REGNUM);
+  rtx_insn *insn = NULL;
+
+  while (regno != total_bits)
+    {
+      machine_mode mode = GP_REGNUM_P (regno) ? DImode : DFmode;
+      rtx reg = gen_rtx_REG (mode, regno);
+      HOST_WIDE_INT offset = cfun->machine->frame.reg_offset[regno];
+      if (!frame_pointer_needed)
+    offset += cfun->machine->frame.frame_size
+          - cfun->machine->frame.hard_fp_offset;
+      rtx addr = plus_constant (Pmode, ptr_reg, offset);
+      rtx mem = gen_frame_mem (mode, addr);
+
+      unsigned regno2 = aarch64_get_next_set_bit (components, regno + 1);
+      /* No more registers after REGNO to restore.
+     Emit a single restore and exit.  */
+      if (regno2 == total_bits)
+    {
+      insn = emit_move_insn (reg, mem);
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_RESTORE, reg);
+      break;
+    }
+
+      HOST_WIDE_INT offset2 = cfun->machine->frame.reg_offset[regno2];
+      /* The next register is not of the same class or its offset is not
+     mergeable with the current one into a pair or the offset doesn't fit
+     for a load pair.  Emit a single restore and continue from REGNO2.  */
+      if (!satisfies_constraint_Ump (mem)
+      || GP_REGNUM_P (regno) != GP_REGNUM_P (regno2)
+      || (offset2 - cfun->machine->frame.reg_offset[regno])
+        != GET_MODE_SIZE (DImode))
+    {
+      insn = emit_move_insn (reg, mem);
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_RESTORE, reg);
+
+      regno = regno2;
+      continue;
+    }
+
+      /* REGNO2 can be loaded in a pair with REGNO.  */
+      rtx reg2 = gen_rtx_REG (mode, regno2);
+      if (!frame_pointer_needed)
+    offset2 += cfun->machine->frame.frame_size
+          - cfun->machine->frame.hard_fp_offset;
+      rtx addr2 = plus_constant (Pmode, ptr_reg, offset2);
+      rtx mem2 = gen_frame_mem (mode, addr2);
+
+      insn = emit_insn (aarch64_gen_load_pair (mode, reg, mem, reg2, mem2));
+      RTX_FRAME_RELATED_P (insn) = 1;
+      add_reg_note (insn, REG_CFA_RESTORE, reg);
+      add_reg_note (insn, REG_CFA_RESTORE, reg2);
+
+      regno = aarch64_get_next_set_bit (components, regno2 + 1);
+    }
+}
+
+/* Implement TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS.  */
+
+static void
+aarch64_set_handled_components (sbitmap components)
+{
+  for (unsigned regno = R0_REGNUM; regno <= V31_REGNUM; regno++)
+    if (bitmap_bit_p (components, regno))
+      cfun->machine->reg_is_wrapped_separately[regno] = true;
+}
+
  /* AArch64 stack frames generated by this compiler look like:
        +-------------------------------+
@@ -3944,29 +4218,6 @@ aarch64_classify_index (struct aarch64_address_info *info, rtx x,
    return false;
  }
  -bool
-aarch64_offset_7bit_signed_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= -64 * GET_MODE_SIZE (mode)
-      && offset < 64 * GET_MODE_SIZE (mode)
-      && offset % GET_MODE_SIZE (mode) == 0);
-}
-
-static inline bool
-offset_9bit_signed_unscaled_p (machine_mode mode ATTRIBUTE_UNUSED,
-                   HOST_WIDE_INT offset)
-{
-  return offset >= -256 && offset < 256;
-}
-
-static inline bool
-offset_12bit_unsigned_scaled_p (machine_mode mode, HOST_WIDE_INT offset)
-{
-  return (offset >= 0
-      && offset < 4096 * GET_MODE_SIZE (mode)
-      && offset % GET_MODE_SIZE (mode) == 0);
-}
-
  /* Return true if MODE is one of the modes for which we
     support LDP/STP operations.  */
  @@ -14452,6 +14703,30 @@ aarch64_optab_supported_p (int op, machine_mode mode1, machine_mode,
  #define TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD_GUARD \
    aarch64_first_cycle_multipass_dfa_lookahead_guard
  +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS \
+  aarch64_get_separate_components
+
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB \
+  aarch64_components_for_bb
+
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS \
+  aarch64_disqualify_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  aarch64_emit_prologue_components
+
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  aarch64_emit_epilogue_components
+
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS \
+  aarch64_set_handled_components
+
  #undef TARGET_TRAMPOLINE_INIT
  #define TARGET_TRAMPOLINE_INIT aarch64_trampoline_init
  diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
index 584ff5c..fb89e5a 100644
--- a/gcc/config/aarch64/aarch64.h
+++ b/gcc/config/aarch64/aarch64.h
@@ -591,6 +591,8 @@ struct GTY (()) aarch64_frame
  typedef struct GTY (()) machine_function
  {
    struct aarch64_frame frame;
+  /* One entry for each GPR and FP register.  */
+  bool reg_is_wrapped_separately[V31_REGNUM + 1];
Another hardcoded use of V31_REGNUM.

Thanks,
James




Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]