[PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

Fri Jul 15 15:14:00 GMT 2016

Hi all,

This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow stores of constants
into fewer wider stores.  A 2009 patch from Jakub [1] contains many testcases but a simple motivating
case can be:

struct bar {
   int a;
   char b;
   char c;
   char d;
   char e;
}; // packed 64-bit structure

void bar (struct bar *);

void
foo (struct bar *p)
{
   p->b = 0;
   p->a = 0;
   p->c = 0;
   p->d = 1;
   p->e = 0;
}

Currently on aarch64 this will generate:
foo:
         mov     w1, 1
         str     wzr, [x0]
         strb    wzr, [x0, 4]
         strb    wzr, [x0, 5]
         strb    w1, [x0, 6]
         strb    wzr, [x0, 7]
         ret

With this patch this can be improved into a single unaligned store:
foo:
         mov     x1, 0x1000000000000
         str     x1, [x0]
         ret

or, if compiled with -mstrict-align:
foo:
         mov     w1, 0x10000
         stp     wzr, w1, [x0]
         ret

The pass is a tree-ssa pass that runs fairly late in the pipeline, after pass_optimize_widening_mul.
I explain the approach taken in the comments in the new tree-ssa-store-widening.c file but essentially
it has 3 phases applied to each basic block:

1) Scan through the statements recording assignments of constants to destinations like ARRAY_REF,
COMPONENT_REF, MEM_REF which are determined to write to an ultimate common destination. get_inner_reference
is used to decompose these destinations. Continue recording these until we encounter a statement that may
interfere with the stores we've been recording (load or store that may alias, volatile access etc).
These assignments of interest are recorded as store_immediate_info objects in the m_store_info vector.

2) Analyse the stores recorded in phase one (they all write to a destination offset from a common base)
and merge them into wider assignments up to BITS_PER_WORD bits wide. These widened assignments are represented
as merged_store_group objects and they are recorded in the m_merged_store_groups vector. This is the
coalesce_immediate_stores function. It sorts the stores by the bitposition they write to and iterates through
them, merging consecutive stores (it fails the transformation on overlapping stores, I don't think that case
appears often enough to warrant extra logic) up to BITS_PER_WORD-wide accesses.

3) Go through the merged stores recorded in m_merged_store_groups and output each widened store. Widened stores
that are not of a bitsize that is a power of two (for example 48 bits wide) are output as multiple stores of decreasing
power-of-two width. So, for a widened store 48-bits wide this phase would a emit a 32-bit store followed by a
16-bit store. The new sequence is only emitted if it contains fewer statements than the original sequence that it
will replace.  This phase also avoids outputting unaligned stores for STRICT_ALIGNMENT targets or targets where
SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want to avoid generation of unaligned
stores even when it is legal I've added the new param PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
to disallow unaligned store generation.  Its default setting is to allow them (assuming that STRICT_ALIGNMENT
and SLOW_UNALIGNED_ACCESS allows it).

This is my first GIMPLE-level pass so please do point out places where I'm not using the interfaces correctly.
This patch handles bitfields as well, but only if they are a multiple of BITS_PER_UNIT. It should be easily
extensible to handle other bitfields as well, but I'm not entirely sure of the rules for laying out such bitfields
and in particular the byteswap logic that needs to be applied for big-endian targets. If someone can shed some light
on how they should be handed I'll be happy to try it out, but I believe this patch is an improvement as it is.

This has been bootstrapped and tested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
I've also tested it on the big-endian targets: armeb-none-eabi, aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.

I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no regressions, the overall score improved a bit
(about 0.1%). The interesting improvements were:
458.sjeng     (+0.8%)
483.xalancbmk (+1.1%)
416.gamess    (+1.0%)
454.calculix  (+1.1%)

An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it transformed a long sequence of constant
byte stores into a much shorter sequence of word-size stores (from ~550 instructions to ~190).

On x86_64 SPECINT there was no change in the overall score. The code size at -Ofast is consistently smaller
with this patch but the preformance differences on sub-benchmarks are in the noise.

I've included the testcases from Jakub's patch [1] and added a few of my own.

Is this direction acceptable for the problem this is trying to solve?

Thanks,
Kyrill

N.B. I'm going on vacation until August so I won't be able to respond to any feedback until I get back.

[1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html

2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

     PR middle-end/22141
     * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
     * common.opt (ftree-store-widening): New Optimization option.
     * opts.c (default_options_table): Add entry for
     OPT_ftree_store_widening.
     * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
     * passes.def: Insert pass_tree_store_widening.
     * tree-pass.h (make_pass_tree_store_widening): Declare extern
     prototype.
     * tree-ssa-store-widening.c: New file.
     * doc/invoke.texi (Optimization Options): Document
     -ftree-store-widening.

2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
             Jakub Jelinek  <jakub@redhat.com>

     PR middle-end/22141
     * gcc.c-torture/execute/pr22141-1.c: New test.
     * gcc.c-torture/execute/pr22141-2.c: Likewise.
     * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
     * gcc.dg/store_widening_1.c: New test.
     * gcc.dg/store_widening_2.c: Likewise.
     * gcc.dg/store_widening_3.c: Likewise.
     * gcc.dg/store_widening_4.c: Likewise.
     * gcc.target/i386/pr22141.c: Likewise.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: widening.patch
Type: text/x-patch
Size: 44961 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20160715/3fba5292/attachment.bin>