This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] [v2][aarch64] Avoid tag collisions for loads falkor


Hi Siddhesh,

On 13/07/18 12:26, Siddhesh Poyarekar wrote:
Hi,

This is a rewrite of the tag collision avoidance patch that Kugan had
written as a machine reorg pass back in February.

The falkor hardware prefetching system uses a combination of the
source, destination and offset to decide which prefetcher unit to
train with the load.  This is great when loads in a loop are
sequential but sub-optimal if there are unrelated loads in a loop that
tag to the same prefetcher unit.

This pass attempts to rename the desination register of such colliding
loads using routines available in regrename.c so that their tags do
not collide.  This shows some performance gains with mcf and xalancbmk
(~5% each) and will be tweaked further.  The pass is placed near the
fag end of the pass list so that subsequent passes don't inadvertantly
end up undoing the renames.

A full gcc bootstrap and testsuite ran successfully on aarch64, i.e. it
did not introduce any new regressions.  I also did a make-check with
-mcpu=falkor to ensure that there were no regressions.  The couple of
regressions I found were target-specific and were related to scheduling
and cost differences and are not correctness issues.

Changes from v1:

- Fixed up issues pointed out by Kyrill
- Avoid renaming R0/V0 since they could be return values
- Fixed minor formatting issues.

2018-07-02  Siddhesh Poyarekar <siddhesh@sourceware.org>
            Kugan Vivekanandarajah <kugan.vivekanandarajah@linaro.org>

        * config/aarch64/falkor-tag-collision-avoidance.c: New file.
        * config.gcc (extra_objs): Build it.
        * config/aarch64/t-aarch64 (falkor-tag-collision-avoidance.o):
        Likewise.
        * config/aarch64/aarch64-passes.def
        (pass_tag_collision_avoidance): New pass.
        * config/aarch64/aarch64.c (qdf24xx_tunings): Add
        AARCH64_EXTRA_TUNE_RENAME_LOAD_REGS to tuning_flags.
        (aarch64_classify_address): Remove static qualifier.
        (aarch64_address_info, aarch64_address_type): Move to...
        * config/aarch64/aarch64-protos.h: ... here.
        (make_pass_tag_collision_avoidance): New function.
        * config/aarch64/aarch64-tuning-flags.def (rename_load_regs):
        New tuning flag.


This looks good to me modulo a couple of minor comments inline.
You'll still need an approval from a maintainer.

Thanks,
Kyrill

CC: james.greenhalgh@arm.com
CC: kyrylo.tkachov@foss.arm.com
---
 gcc/config.gcc                                |   2 +-
 gcc/config/aarch64/aarch64-passes.def         |   1 +
 gcc/config/aarch64/aarch64-protos.h           |  49 +
 gcc/config/aarch64/aarch64-tuning-flags.def   |   2 +
 gcc/config/aarch64/aarch64.c                  |  48 +-
 .../aarch64/falkor-tag-collision-avoidance.c  | 856 ++++++++++++++++++
 gcc/config/aarch64/t-aarch64                  |   9 +
 7 files changed, 921 insertions(+), 46 deletions(-)

<snip>

+/* Find the use def chain in which INSN exists and then see if there is a
+   definition inside the loop and outside it.  We use this as a simple
+   approximation to determine whether the base register is an IV.  The basic
+   idea is to find INSN in the use-def chains for its base register and find
+   all definitions that reach it.  Of all these definitions, there should be at
+   least one definition that is a simple addition of a constant value, either
+   as a binary operation or a pre or post update.
+
+   The function returns true if the base register is estimated to be an IV.  */
+static bool
+iv_p (rtx_insn *insn, rtx reg, struct loop *loop)
+{
+  df_ref ause;
+  unsigned regno = REGNO (reg);
+
+  /* Ignore loads from the stack.  */
+  if (regno == SP_REGNUM)
+    return false;
+
+  for (ause= DF_REG_USE_CHAIN (regno); ause; ause = DF_REF_NEXT_REG (ause))
+    {

Space after ause

+      if (!DF_REF_INSN_INFO (ause)
+         || !NONDEBUG_INSN_P (DF_REF_INSN (ause)))
+       continue;
+
+      if (insn != DF_REF_INSN (ause))
+       continue;
+
+      struct df_insn_info *insn_info = DF_INSN_INFO_GET (insn);
+      df_ref def_rec;
+
+      FOR_EACH_INSN_INFO_DEF (def_rec, insn_info)
+       {
+         rtx_insn *insn = DF_REF_INSN (def_rec);
+         basic_block bb = BLOCK_FOR_INSN (insn);
+
+         if (dominated_by_p (CDI_DOMINATORS, bb, loop->header)
+             && bb->loop_father == loop)
+           {
+             if (recog_memoized (insn) < 0)
+               continue;
+
+             rtx pat = PATTERN (insn);
+
+             /* Prefetch or clobber; unlikely to be a constant stride.  The
+                falkor software prefetcher tuning is pretty conservative, so
+                its presence indicates that the access pattern is probably
+                strided but most likely with an unknown stride size or a
+                stride size that is quite large.  */
+             if (GET_CODE (pat) != SET)
+               continue;
+
+             rtx x = SET_SRC (pat);
+             if (GET_CODE (x) == ZERO_EXTRACT
+                 || GET_CODE (x) == ZERO_EXTEND
+                 || GET_CODE (x) == SIGN_EXTEND)
+               x = XEXP (x, 0);
+
+             /* Loading the value from memory; unlikely to be a constant
+                stride.  */
+             if (MEM_P (x))
+               continue;
+
+             /* An increment or decrement by a constant MODE_SIZE amount or
+                the result of a binary expression is likely to be an IV.  */
+             if (GET_CODE (x) == POST_INC
+                 || GET_CODE (x) == POST_DEC
+                 || GET_CODE (x) == PRE_INC
+                 || GET_CODE (x) == PRE_DEC)
+               return true;
+             else if (BINARY_P (x)
+                      && (CONST_INT_P (XEXP (x, 0))
+                          || CONST_INT_P (XEXP (x, 1))))
+               {
+                 rtx stride = (CONST_INT_P (XEXP (x, 0))
+                               ? XEXP (x, 0) : XEXP (x, 1));
+
+                 /* Don't bother with very long strides because the prefetcher
+                    is unable to train on them anyway.  */
+                 if (INTVAL (stride) < MAX_PREFETCH_STRIDE)
+                   return true;
+               }
+           }
+       }
+      return false;
+    }
+  return false;
+}
+
+

<snip>

+
+/* Return true if INSN is a strided load in LOOP.  If it is a strided load, set
+   the DEST, BASE and OFFSET.  Also, if this is a pre/post increment load, set
+   PRE_POST to true.
+
+   The routine does checks on the destination of the insn and depends on
+   STRIDED_LOAD_P to check the source and fill in the BASE and OFFSET.  */
+static bool
+get_load_info (rtx_insn *insn, struct loop *loop, rtx *dest, rtx *base,
+              rtx *offset, bool *pre_post, bool *ldp)
+{
+  if (!INSN_P (insn) || recog_memoized (insn) < 0)
+    return false;
+
+  rtx pat = PATTERN (insn);
+  unsigned code = GET_CODE (pat);
+  bool load_pair = (code == PARALLEL);
+
+  /* For a load pair we need only the first base and destination
+     registers.  We however need to ensure that our pre/post increment
+     offset is doubled; we do that in STRIDED_LOAD_P.  */
+  if (load_pair)
+    {
+      pat = XVECEXP (pat, 0, 0);
+      code = GET_CODE (pat);
+    }
+
+  if (code != SET)
+    return false;
+
+  rtx dest_rtx = SET_DEST (pat);
+
+  if (!REG_P (dest_rtx))
+    return false;
+
+  /* Don't try to rename R0 or V0 because we could be messing with the return
+     value.  */
+  if (REGNO (dest_rtx) == R0_REGNUM || REGNO (dest_rtx) == V0_REGNUM)
+    return false;
+
+  unsigned regno = REGNO (dest_rtx);
+  machine_mode mode = GET_MODE (dest_rtx);
+  machine_mode inner_mode = GET_MODE_INNER (mode);
+
+  /* Falkor does not support SVE vectors.  */
+  gcc_assert (GET_MODE_SIZE (mode).is_constant ());
+

I think this will blow up if someone tries compiling for SVE (-march=armv8.2-a+sve for example)
with -mtune=falkor. We don't want to crash then. I believe you just want to bail out of the optimisation by returning false.
You should update the comment in tag () to reflect this as well.

+  /* Ignore vector struct or lane loads.  */
+  if (GET_MODE_SIZE (mode).to_constant ()
+      != GET_MODE_SIZE (inner_mode).to_constant ())
+    return false;
+
+  /* The largest width we want to bother with is a load of a pair of
+     quad-words.  */
+  if ((GET_MODE_SIZE (mode).to_constant () << load_pair)
+      > GET_MODE_SIZE (OImode))
+    return false;
+
+  /* Ignore loads into the stack pointer because it is unlikely to be a
+     stream.  */
+  if (regno == SP_REGNUM)
+    return false;
+
+  if (valid_src_p (SET_SRC (pat), insn, loop, pre_post, base, offset,
+                  load_pair))
+    {
+      *dest = dest_rtx;
+      *ldp = load_pair;
+
+      return true;
+    }
+
+  return false;
+}
+


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]