This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass


On Fri, Jul 15, 2016 at 5:13 PM, Kyrill Tkachov
<kyrylo.tkachov@foss.arm.com> wrote:
> Hi all,
>
> This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow
> stores of constants
> into fewer wider stores.  A 2009 patch from Jakub [1] contains many
> testcases but a simple motivating
> case can be:
>
> struct bar {
>   int a;
>   char b;
>   char c;
>   char d;
>   char e;
> }; // packed 64-bit structure
>
> void bar (struct bar *);
>
> void
> foo (struct bar *p)
> {
>   p->b = 0;
>   p->a = 0;
>   p->c = 0;
>   p->d = 1;
>   p->e = 0;
> }
>
> Currently on aarch64 this will generate:
> foo:
>         mov     w1, 1
>         str     wzr, [x0]
>         strb    wzr, [x0, 4]
>         strb    wzr, [x0, 5]
>         strb    w1, [x0, 6]
>         strb    wzr, [x0, 7]
>         ret
>
> With this patch this can be improved into a single unaligned store:
> foo:
>         mov     x1, 0x1000000000000
>         str     x1, [x0]
>         ret
>
> or, if compiled with -mstrict-align:
> foo:
>         mov     w1, 0x10000
>         stp     wzr, w1, [x0]
>         ret
>
> The pass is a tree-ssa pass that runs fairly late in the pipeline, after
> pass_optimize_widening_mul.
> I explain the approach taken in the comments in the new
> tree-ssa-store-widening.c file but essentially
> it has 3 phases applied to each basic block:
>
> 1) Scan through the statements recording assignments of constants to
> destinations like ARRAY_REF,
> COMPONENT_REF, MEM_REF which are determined to write to an ultimate common
> destination. get_inner_reference
> is used to decompose these destinations. Continue recording these until we
> encounter a statement that may
> interfere with the stores we've been recording (load or store that may
> alias, volatile access etc).
> These assignments of interest are recorded as store_immediate_info objects
> in the m_store_info vector.
>
> 2) Analyse the stores recorded in phase one (they all write to a destination
> offset from a common base)
> and merge them into wider assignments up to BITS_PER_WORD bits wide. These
> widened assignments are represented
> as merged_store_group objects and they are recorded in the
> m_merged_store_groups vector. This is the
> coalesce_immediate_stores function. It sorts the stores by the bitposition
> they write to and iterates through
> them, merging consecutive stores (it fails the transformation on overlapping
> stores, I don't think that case
> appears often enough to warrant extra logic) up to BITS_PER_WORD-wide
> accesses.
>
> 3) Go through the merged stores recorded in m_merged_store_groups and output
> each widened store. Widened stores
> that are not of a bitsize that is a power of two (for example 48 bits wide)
> are output as multiple stores of decreasing
> power-of-two width. So, for a widened store 48-bits wide this phase would a
> emit a 32-bit store followed by a
> 16-bit store. The new sequence is only emitted if it contains fewer
> statements than the original sequence that it
> will replace.  This phase also avoids outputting unaligned stores for
> STRICT_ALIGNMENT targets or targets where
> SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want
> to avoid generation of unaligned
> stores even when it is legal I've added the new param
> PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
> to disallow unaligned store generation.  Its default setting is to allow
> them (assuming that STRICT_ALIGNMENT
> and SLOW_UNALIGNED_ACCESS allows it).
>
> This is my first GIMPLE-level pass so please do point out places where I'm
> not using the interfaces correctly.
> This patch handles bitfields as well, but only if they are a multiple of
> BITS_PER_UNIT. It should be easily
> extensible to handle other bitfields as well, but I'm not entirely sure of
> the rules for laying out such bitfields
> and in particular the byteswap logic that needs to be applied for big-endian
> targets. If someone can shed some light
> on how they should be handed I'll be happy to try it out, but I believe this
> patch is an improvement as it is.
>
> This has been bootstrapped and tested on aarch64-none-linux-gnu,
> arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
> I've also tested it on the big-endian targets: armeb-none-eabi,
> aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.
>
> I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no
> regressions, the overall score improved a bit
> (about 0.1%). The interesting improvements were:
> 458.sjeng     (+0.8%)
> 483.xalancbmk (+1.1%)
> 416.gamess    (+1.0%)
> 454.calculix  (+1.1%)
>
> An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it
> transformed a long sequence of constant
> byte stores into a much shorter sequence of word-size stores (from ~550
> instructions to ~190).
>
> On x86_64 SPECINT there was no change in the overall score. The code size at
> -Ofast is consistently smaller
> with this patch but the preformance differences on sub-benchmarks are in the
> noise.
>
> I've included the testcases from Jakub's patch [1] and added a few of my
> own.
>
> Is this direction acceptable for the problem this is trying to solve?

+      /* Record the original statements so that we can keep track of
+        statements emitted in this pass and not re-process new
+        statements.  */
+      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+       {
+         gimple *stmt = gsi_stmt (gsi);
+         if (!is_gimple_debug (stmt))
+           orig_stmts.add (stmt);
+         num_statements++;
+       }

please use gimple_set_visited () instead, that should be cheaper.


+      do
+       {
+         changes_made = false;
+         for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+           {
...
+       }
+      while (changes_made);

looks pretty quadratic to me.  Instead of tracking things with m_curr_base_expr
why not use a hash-map to track stores related to a base?

+                          /* Don't handle bitfields that are not a multiple
+                             of BITS_PER_UNIT for now.  Can be extended
+                             later.  */
+                          && (bitsize % BITS_PER_UNIT == 0)

:(

+                          && !stmt_interferes_with_mem_accesses_p (lhs))

given this function loops over all accesses this is quadratic as well.

I think alias queries could be simplified if you reduce them to alias
checks on the base object (and allow overlapping constant stores
which should be easy to handle during merging).

The VUSE/VDEF handling is somewhat odd.  All stores have both
a VDEF and a VUSE, if you merge a set of them you can simply
re-use the VDEF/VUSE of one of them, effectively replacing the
stmt.  For all other stores you remove you miss a
  unlink_stmt_vdef (stmt);
  release_defs (stmt);
to update virtual SSA form and properly release SSA names.

As you use TBAA in your alias checks you may only _sink_
stores.  Not sure if your merged store placement ensures this.

I think this kind of transforms is useful in early optimizations, not only
very late as you perform it.  Of course it should be really cheap there.

Note that options like -ftree-store-widening are discouraged
("tree" does mean nothing to our users and store widening isn't
what is done - it's store merging).  Simply name it -fstore-merging.
Also the file name should be gimple-ssa-store-merging.c

Looking forward to an algorithmically enhanced version.

Richard.


> Thanks,
> Kyrill
>
> N.B. I'm going on vacation until August so I won't be able to respond to any
> feedback until I get back.
>
> [1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html
>
> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>
>     PR middle-end/22141
>     * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
>     * common.opt (ftree-store-widening): New Optimization option.
>     * opts.c (default_options_table): Add entry for
>     OPT_ftree_store_widening.
>     * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
>     * passes.def: Insert pass_tree_store_widening.
>     * tree-pass.h (make_pass_tree_store_widening): Declare extern
>     prototype.
>     * tree-ssa-store-widening.c: New file.
>     * doc/invoke.texi (Optimization Options): Document
>     -ftree-store-widening.
>
> 2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
>             Jakub Jelinek  <jakub@redhat.com>
>
>     PR middle-end/22141
>     * gcc.c-torture/execute/pr22141-1.c: New test.
>     * gcc.c-torture/execute/pr22141-2.c: Likewise.
>     * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
>     * gcc.dg/store_widening_1.c: New test.
>     * gcc.dg/store_widening_2.c: Likewise.
>     * gcc.dg/store_widening_3.c: Likewise.
>     * gcc.dg/store_widening_4.c: Likewise.
>     * gcc.target/i386/pr22141.c: Likewise.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]