This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass

From: Kyrill Tkachov <kyrylo dot tkachov at foss dot arm dot com>
To: GCC Patches <gcc-patches at gcc dot gnu dot org>
Date: Fri, 15 Jul 2016 16:13:48 +0100
Subject: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
Authentication-results: sourceware.org; auth=none

Hi all,

This is a GIMPLE pass to implement PR middle-end/22141. that is merge narrow stores of constants
into fewer wider stores.  A 2009 patch from Jakub [1] contains many testcases but a simple motivating
case can be:

struct bar {
  int a;
  char b;
  char c;
  char d;
  char e;
}; // packed 64-bit structure

void bar (struct bar *);

void
foo (struct bar *p)
{
  p->b = 0;
  p->a = 0;
  p->c = 0;
  p->d = 1;
  p->e = 0;
}

Currently on aarch64 this will generate:
foo:
        mov     w1, 1
        str     wzr, [x0]
        strb    wzr, [x0, 4]
        strb    wzr, [x0, 5]
        strb    w1, [x0, 6]
        strb    wzr, [x0, 7]
        ret

With this patch this can be improved into a single unaligned store:
foo:
        mov     x1, 0x1000000000000
        str     x1, [x0]
        ret

or, if compiled with -mstrict-align:
foo:
        mov     w1, 0x10000
        stp     wzr, w1, [x0]
        ret

The pass is a tree-ssa pass that runs fairly late in the pipeline, after pass_optimize_widening_mul.
I explain the approach taken in the comments in the new tree-ssa-store-widening.c file but essentially
it has 3 phases applied to each basic block:

1) Scan through the statements recording assignments of constants to destinations like ARRAY_REF,
COMPONENT_REF, MEM_REF which are determined to write to an ultimate common destination. get_inner_reference
is used to decompose these destinations. Continue recording these until we encounter a statement that may
interfere with the stores we've been recording (load or store that may alias, volatile access etc).
These assignments of interest are recorded as store_immediate_info objects in the m_store_info vector.

2) Analyse the stores recorded in phase one (they all write to a destination offset from a common base)
and merge them into wider assignments up to BITS_PER_WORD bits wide. These widened assignments are represented
as merged_store_group objects and they are recorded in the m_merged_store_groups vector. This is the
coalesce_immediate_stores function. It sorts the stores by the bitposition they write to and iterates through
them, merging consecutive stores (it fails the transformation on overlapping stores, I don't think that case
appears often enough to warrant extra logic) up to BITS_PER_WORD-wide accesses.

3) Go through the merged stores recorded in m_merged_store_groups and output each widened store. Widened stores
that are not of a bitsize that is a power of two (for example 48 bits wide) are output as multiple stores of decreasing
power-of-two width. So, for a widened store 48-bits wide this phase would a emit a 32-bit store followed by a
16-bit store. The new sequence is only emitted if it contains fewer statements than the original sequence that it
will replace.  This phase also avoids outputting unaligned stores for STRICT_ALIGNMENT targets or targets where
SLOW_UNALIGNED_ACCESS forbids it. Since some configurations/targets may want to avoid generation of unaligned
stores even when it is legal I've added the new param PARAM_STORE_WIDENING_ALLOW_UNALIGNED that can be used
to disallow unaligned store generation.  Its default setting is to allow them (assuming that STRICT_ALIGNMENT
and SLOW_UNALIGNED_ACCESS allows it).

This is my first GIMPLE-level pass so please do point out places where I'm not using the interfaces correctly.
This patch handles bitfields as well, but only if they are a multiple of BITS_PER_UNIT. It should be easily
extensible to handle other bitfields as well, but I'm not entirely sure of the rules for laying out such bitfields
and in particular the byteswap logic that needs to be applied for big-endian targets. If someone can shed some light
on how they should be handed I'll be happy to try it out, but I believe this patch is an improvement as it is.

This has been bootstrapped and tested on aarch64-none-linux-gnu, arm-none-linux-gnueabihf and x86_64-unknown-linux-gnu.
I've also tested it on the big-endian targets: armeb-none-eabi, aarch64_be-none-elf. Also tested aarch64-none-elf/-mabi=ilp32.

I've benchmarked it on SPEC2006 on AArch64 on a Cortex-A72 and there were no regressions, the overall score improved a bit
(about 0.1%). The interesting improvements were:
458.sjeng     (+0.8%)
483.xalancbmk (+1.1%)
416.gamess    (+1.0%)
454.calculix  (+1.1%)

An interesting effect was in BZ2_decompress from bzip2 where at -Ofast it transformed a long sequence of constant
byte stores into a much shorter sequence of word-size stores (from ~550 instructions to ~190).

On x86_64 SPECINT there was no change in the overall score. The code size at -Ofast is consistently smaller
with this patch but the preformance differences on sub-benchmarks are in the noise.

I've included the testcases from Jakub's patch [1] and added a few of my own.

Is this direction acceptable for the problem this is trying to solve?

Thanks,
Kyrill

N.B. I'm going on vacation until August so I won't be able to respond to any feedback until I get back.

[1] https://gcc.gnu.org/ml/gcc-patches/2009-09/msg01745.html

2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

    PR middle-end/22141
    * Makefile.in (OBJS): Add tree-ssa-store-widening.o.
    * common.opt (ftree-store-widening): New Optimization option.
    * opts.c (default_options_table): Add entry for
    OPT_ftree_store_widening.
    * params.def (PARAM_STORE_WIDENING_ALLOW_UNALIGNED): Define.
    * passes.def: Insert pass_tree_store_widening.
    * tree-pass.h (make_pass_tree_store_widening): Declare extern
    prototype.
    * tree-ssa-store-widening.c: New file.
    * doc/invoke.texi (Optimization Options): Document
    -ftree-store-widening.

2016-07-15  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
            Jakub Jelinek  <jakub@redhat.com>

    PR middle-end/22141
    * gcc.c-torture/execute/pr22141-1.c: New test.
    * gcc.c-torture/execute/pr22141-2.c: Likewise.
    * gcc.target/aarch64/ldp_stp_1.c: Adjust for -ftree-store-widening.
    * gcc.dg/store_widening_1.c: New test.
    * gcc.dg/store_widening_2.c: Likewise.
    * gcc.dg/store_widening_3.c: Likewise.
    * gcc.dg/store_widening_4.c: Likewise.
    * gcc.target/i386/pr22141.c: Likewise.

diff --git a/gcc/Makefile.in b/gcc/Makefile.in
index 2dc88bee3e98d1642d7603c620faccbc8d4d9c54..9992a212f976cdfaa00cc21a6c70525fd345976e 100644
--- a/gcc/Makefile.in
+++ b/gcc/Makefile.in
@@ -1505,6 +1505,7 @@ OBJS = \
 	tree-ssa-sccvn.o \
 	tree-ssa-scopedtables.o \
 	tree-ssa-sink.o \
+	tree-ssa-store-widening.o \
 	tree-ssa-strlen.o \
 	tree-ssa-structalias.o \
 	tree-ssa-tail-merge.o \
diff --git a/gcc/common.opt b/gcc/common.opt
index 72e657ca000ec3333e40b93e264767a6d840d1c9..13d47e17ff016e1af5971d36a046849f53d5bcc2 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2471,6 +2471,10 @@ ftree-sra
 Common Report Var(flag_tree_sra) Optimization
 Perform scalar replacement of aggregates.
 
+ftree-store-widening
+Common Var(flag_tree_store_widening) Optimization
+Use the tree store widening pass.
+
 ftree-ter
 Common Report Var(flag_tree_ter) Optimization
 Replace temporary expressions in the SSA->normal pass.
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 997faa15b6e4e8dc938bad5b8b249f3870598990..b57c24e2207f3f504e186bef51b05dbbdbeacfec 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -411,8 +411,8 @@ Objective-C and Objective-C++ Dialects}.
 -ftree-loop-vectorize @gol
 -ftree-parallelize-loops=@var{n} -ftree-pre -ftree-partial-pre -ftree-pta @gol
 -ftree-reassoc -ftree-sink -ftree-slsr -ftree-sra @gol
--ftree-switch-conversion -ftree-tail-merge -ftree-ter @gol
--ftree-vectorize -ftree-vrp -funconstrained-commons @gol
+-ftree-store-widening -ftree-switch-conversion -ftree-tail-merge @gol
+-ftree-ter -ftree-vectorize -ftree-vrp -funconstrained-commons @gol
 -funit-at-a-time -funroll-all-loops -funroll-loops @gol
 -funsafe-loop-optimizations -funsafe-math-optimizations -funswitch-loops @gol
 -fipa-ra -fvariable-expansion-in-unroller -fvect-cost-model -fvpt @gol
@@ -6333,6 +6333,7 @@ compilation time.
 -ftree-sink @gol
 -ftree-slsr @gol
 -ftree-sra @gol
+-ftree-store-widening @gol
 -ftree-pta @gol
 -ftree-ter @gol
 -funit-at-a-time}
@@ -7651,6 +7652,13 @@ Perform scalar replacement of aggregates.  This pass replaces structure
 references with scalars to prevent committing structures to memory too
 early.  This flag is enabled by default at @option{-O} and higher.
 
+@item -ftree-store-widening
+@opindex ftree-store-widening
+Perform merging of narrow stores to consecutive memory addresses.  This pass
+merges contigous stores of immediate values narrower than a word into fewer
+wider stores to reduce the number of instructions.  This is enabled by default
+at @option{-O} and higher.
+
 @item -ftree-ter
 @opindex ftree-ter
 Perform temporary expression replacement during the SSA->normal phase.  Single
diff --git a/gcc/opts.c b/gcc/opts.c
index 4053fb1db0a9a814a5623e0604619fbe068b37ab..6f7c4a2504b0bf13ab6f4cb6faff35757381fff2 100644
--- a/gcc/opts.c
+++ b/gcc/opts.c
@@ -463,6 +463,7 @@ static const struct default_options default_options_table[] =
     { OPT_LEVELS_1_PLUS, OPT_ftree_dse, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_ter, NULL, 1 },
     { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_ftree_sra, NULL, 1 },
+    { OPT_LEVELS_1_PLUS_NOT_DEBUG, OPT_ftree_store_widening, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_fre, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_copy_prop, NULL, 1 },
     { OPT_LEVELS_1_PLUS, OPT_ftree_sink, NULL, 1 },
diff --git a/gcc/params.def b/gcc/params.def
index 62ec600ba3c88dde78150fae63591e0855e93752..521ca636c08b4e0d845db947e12fd58a033f94e6 100644
--- a/gcc/params.def
+++ b/gcc/params.def
@@ -1095,6 +1095,12 @@ DEFPARAM (PARAM_MAX_TAIL_MERGE_COMPARISONS,
           "Maximum amount of similar bbs to compare a bb with.",
           10, 0, 0)
 
+DEFPARAM (PARAM_STORE_WIDENING_ALLOW_UNALIGNED,
+	  "store-widening-allow-unaligned",
+	  "Allow the store widening pass to introduce unaligned stores "
+	  "if it is legal to do so",
+          1, 0, 1)
+
 DEFPARAM (PARAM_MAX_TAIL_MERGE_ITERATIONS,
           "max-tail-merge-iterations",
           "Maximum amount of iterations of the pass over a function.",
diff --git a/gcc/passes.def b/gcc/passes.def
index c78dbf2f61beb3940122109c04473e6499846b22..e54c64893c274dcf69fcee9d61be2193f3d8e148 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -324,6 +324,7 @@ along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_phiopt);
       NEXT_PASS (pass_fold_builtins);
       NEXT_PASS (pass_optimize_widening_mul);
+      NEXT_PASS (pass_tree_store_widening);
       NEXT_PASS (pass_tail_calls);
       /* If DCE is not run before checking for uninitialized uses,
 	 we may get false warnings (e.g., testsuite/gcc.dg/uninit-5.c).
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c b/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
new file mode 100644
index 0000000000000000000000000000000000000000..7c888b469cf39f00ced8ddb8cc6dc245fa30b97b
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
@@ -0,0 +1,122 @@
+/* PR middle-end/22141 */
+
+extern void abort (void);
+
+struct S
+{
+  struct T
+    {
+      char a;
+      char b;
+      char c;
+      char d;
+    } t;
+} u;
+
+struct U
+{
+  struct S s[4];
+};
+
+void __attribute__((noinline))
+c1 (struct T *p)
+{
+  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
+    abort ();
+  __builtin_memset (p, 0xaa, sizeof (*p));
+}
+
+void __attribute__((noinline))
+c2 (struct S *p)
+{
+  c1 (&p->t);
+}
+
+void __attribute__((noinline))
+c3 (struct U *p)
+{
+  c2 (&p->s[2]);
+}
+
+void __attribute__((noinline))
+f1 (void)
+{
+  u = (struct S) { { 1, 2, 3, 4 } };
+}
+
+void __attribute__((noinline))
+f2 (void)
+{
+  u.t.a = 1;
+  u.t.b = 2;
+  u.t.c = 3;
+  u.t.d = 4;
+}
+
+void __attribute__((noinline))
+f3 (void)
+{
+  u.t.d = 4;
+  u.t.b = 2;
+  u.t.a = 1;
+  u.t.c = 3;
+}
+
+void __attribute__((noinline))
+f4 (void)
+{
+  struct S v;
+  v.t.a = 1;
+  v.t.b = 2;
+  v.t.c = 3;
+  v.t.d = 4;
+  c2 (&v);
+}
+
+void __attribute__((noinline))
+f5 (struct S *p)
+{
+  p->t.a = 1;
+  p->t.c = 3;
+  p->t.d = 4;
+  p->t.b = 2;
+}
+
+void __attribute__((noinline))
+f6 (void)
+{
+  struct U v;
+  v.s[2].t.a = 1;
+  v.s[2].t.b = 2;
+  v.s[2].t.c = 3;
+  v.s[2].t.d = 4;
+  c3 (&v);
+}
+
+void __attribute__((noinline))
+f7 (struct U *p)
+{
+  p->s[2].t.a = 1;
+  p->s[2].t.c = 3;
+  p->s[2].t.d = 4;
+  p->s[2].t.b = 2;
+}
+
+int
+main (void)
+{
+  struct U w;
+  f1 ();
+  c2 (&u);
+  f2 ();
+  c1 (&u.t);
+  f3 ();
+  c2 (&u);
+  f4 ();
+  f5 (&u);
+  c2 (&u);
+  f6 ();
+  f7 (&w);
+  c3 (&w);
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c b/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
new file mode 100644
index 0000000000000000000000000000000000000000..cb9cc79026310260ffc3a83bfdf9bfc92f998a86
--- /dev/null
+++ b/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
@@ -0,0 +1,122 @@
+/* PR middle-end/22141 */
+
+extern void abort (void);
+
+struct S
+{
+  struct T
+    {
+      char a;
+      char b;
+      char c;
+      char d;
+    } t;
+} u __attribute__((aligned));
+
+struct U
+{
+  struct S s[4];
+};
+
+void __attribute__((noinline))
+c1 (struct T *p)
+{
+  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
+    abort ();
+  __builtin_memset (p, 0xaa, sizeof (*p));
+}
+
+void __attribute__((noinline))
+c2 (struct S *p)
+{
+  c1 (&p->t);
+}
+
+void __attribute__((noinline))
+c3 (struct U *p)
+{
+  c2 (&p->s[2]);
+}
+
+void __attribute__((noinline))
+f1 (void)
+{
+  u = (struct S) { { 1, 2, 3, 4 } };
+}
+
+void __attribute__((noinline))
+f2 (void)
+{
+  u.t.a = 1;
+  u.t.b = 2;
+  u.t.c = 3;
+  u.t.d = 4;
+}
+
+void __attribute__((noinline))
+f3 (void)
+{
+  u.t.d = 4;
+  u.t.b = 2;
+  u.t.a = 1;
+  u.t.c = 3;
+}
+
+void __attribute__((noinline))
+f4 (void)
+{
+  struct S v __attribute__((aligned));
+  v.t.a = 1;
+  v.t.b = 2;
+  v.t.c = 3;
+  v.t.d = 4;
+  c2 (&v);
+}
+
+void __attribute__((noinline))
+f5 (struct S *p)
+{
+  p->t.a = 1;
+  p->t.c = 3;
+  p->t.d = 4;
+  p->t.b = 2;
+}
+
+void __attribute__((noinline))
+f6 (void)
+{
+  struct U v __attribute__((aligned));
+  v.s[2].t.a = 1;
+  v.s[2].t.b = 2;
+  v.s[2].t.c = 3;
+  v.s[2].t.d = 4;
+  c3 (&v);
+}
+
+void __attribute__((noinline))
+f7 (struct U *p)
+{
+  p->s[2].t.a = 1;
+  p->s[2].t.c = 3;
+  p->s[2].t.d = 4;
+  p->s[2].t.b = 2;
+}
+
+int
+main (void)
+{
+  struct U w __attribute__((aligned));
+  f1 ();
+  c2 (&u);
+  f2 ();
+  c1 (&u.t);
+  f3 ();
+  c2 (&u);
+  f4 ();
+  f5 (&u);
+  c2 (&u);
+  f6 ();
+  f7 (&w);
+  c3 (&w);
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.dg/store_widening_1.c b/gcc/testsuite/gcc.dg/store_widening_1.c
new file mode 100644
index 0000000000000000000000000000000000000000..6b7d9b3f83a065a5a420281217fa8299d8e1c275
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_1.c
@@ -0,0 +1,35 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+struct bar {
+  int a;
+  char b;
+  char c;
+  char d;
+  char e;
+  char f;
+  char g;
+};
+
+void
+foo1 (struct bar *p)
+{
+  p->b = 0;
+  p->a = 0;
+  p->c = 0;
+  p->d = 0;
+  p->e = 0;
+}
+
+void
+foo2 (struct bar *p)
+{
+  p->b = 0;
+  p->a = 0;
+  p->c = 1;
+  p->d = 0;
+  p->e = 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Widening successful" 2 "store-widening" } } */
diff --git a/gcc/testsuite/gcc.dg/store_widening_2.c b/gcc/testsuite/gcc.dg/store_widening_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..e5ac1f1fee24e01cf4d3327afbbb68560919b5e6
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_2.c
@@ -0,0 +1,80 @@
+/* { dg-do run } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+struct bar
+{
+  int a;
+  unsigned char b;
+  unsigned char c;
+  short d;
+  unsigned char e;
+  unsigned char f;
+  unsigned char g;
+};
+
+__attribute__ ((noinline)) void
+foozero (struct bar *p)
+{
+  p->b = 0;
+  p->a = 0;
+  p->c = 0;
+  p->d = 0;
+  p->e = 0;
+  p->f = 0;
+  p->g = 0;
+}
+
+__attribute__ ((noinline)) void
+foo1 (struct bar *p)
+{
+  p->b = 1;
+  p->a = 2;
+  p->c = 3;
+  p->d = 4;
+  p->e = 5;
+  p->f = 0;
+  p->g = 0xff;
+}
+
+__attribute__ ((noinline)) void
+foo2 (struct bar *p, struct bar *p2)
+{
+  p->b = 0xff;
+  p2->b = 0xa;
+  p->a = 0xfffff;
+  p2->c = 0xc;
+  p->c = 0xff;
+  p2->d = 0xbf;
+  p->d = 0xfff;
+}
+
+int
+main (void)
+{
+  struct bar b1, b2;
+  foozero (&b1);
+  foozero (&b2);
+
+  foo1 (&b1);
+  if (b1.b != 1 || b1.a != 2 || b1.c != 3 || b1.d != 4 || b1.e != 5
+      || b1.f != 0 || b1.g != 0xff)
+    __builtin_abort ();
+
+  foozero (&b1);
+  /* Make sure writes to aliasing struct pointers preserve the
+     correct order.  */
+  foo2 (&b1, &b1);
+  if (b1.b != 0xa || b1.a != 0xfffff || b1.c != 0xff || b1.d != 0xfff)
+    __builtin_abort ();
+
+  foozero (&b1);
+  foo2 (&b1, &b2);
+  if (b1.a != 0xfffff || b1.b != 0xff || b1.c != 0xff || b1.d != 0xfff
+      || b2.b != 0xa || b2.c != 0xc || b2.d != 0xbf)
+    __builtin_abort ();
+
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump-times "Widening successful" 2 "store-widening" } } */
diff --git a/gcc/testsuite/gcc.dg/store_widening_3.c b/gcc/testsuite/gcc.dg/store_widening_3.c
new file mode 100644
index 0000000000000000000000000000000000000000..63a76e7ad6ec912749725e92093179b07d02afe3
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_3.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+/* Make sure stores to volatile addresses don't get combined with
+   other accesses.  */
+
+struct bar
+{
+  int a;
+  char b;
+  char c;
+  volatile short d;
+  char e;
+  char f;
+  char g;
+};
+
+void
+foozero (struct bar *p)
+{
+  p->b = 0xa;
+  p->a = 0xb;
+  p->c = 0xc;
+  p->d = 0;
+  p->e = 0xd;
+  p->f = 0xe;
+  p->g = 0xf;
+}
+
+/* { dg-final { scan-tree-dump "Volatile access terminates chain" "store-widening" } } */
+/* { dg-final { scan-tree-dump-times "=\{v\} 0;" 1 "store-widening" } } */
\ No newline at end of file
diff --git a/gcc/testsuite/gcc.dg/store_widening_4.c b/gcc/testsuite/gcc.dg/store_widening_4.c
new file mode 100644
index 0000000000000000000000000000000000000000..4f90b9cdc4dfe1a80d46b1dbc6357fc2e6750f79
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/store_widening_4.c
@@ -0,0 +1,32 @@
+/* { dg-do compile } */
+/* { dg-require-effective-target non_strict_align } */
+/* { dg-options "-O -fdump-tree-store-widening" } */
+
+/* Check that we can merge interleaving stores that are guaranteed
+   to be non-aliasing.  */
+
+struct bar
+{
+  int a;
+  char b;
+  char c;
+  short d;
+  char e;
+  char f;
+  char g;
+};
+
+void
+foozero (struct bar *restrict p, struct bar *restrict p2)
+{
+  p->b = 0xff;
+  p2->b = 0xa;
+  p->a = 0xfffff;
+  p2->a = 0xab;
+  p2->c = 0xc;
+  p->c = 0xff;
+  p2->d = 0xbf;
+  p->d = 0xfff;
+}
+
+/* { dg-final { scan-tree-dump-times "Widening successful" 2 "store-widening" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c b/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
index f02e55f1cc2f01063aea35b3b88f793bb2f7c532..9de4e771ab1e73bce960d4038f8ec5b49b5c612c 100644
--- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
+++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
@@ -3,22 +3,22 @@
 int arr[4][4];
 
 void
-foo ()
+foo (int x, int y)
 {
-  arr[0][1] = 1;
-  arr[1][0] = -1;
-  arr[2][0] = 1;
-  arr[1][1] = -1;
-  arr[0][2] = 1;
-  arr[0][3] = -1;
-  arr[1][2] = 1;
-  arr[2][1] = -1;
-  arr[3][0] = 1;
-  arr[3][1] = -1;
-  arr[2][2] = 1;
-  arr[1][3] = -1;
-  arr[2][3] = 1;
-  arr[3][2] = -1;
+  arr[0][1] = x;
+  arr[1][0] = y;
+  arr[2][0] = x;
+  arr[1][1] = y;
+  arr[0][2] = x;
+  arr[0][3] = y;
+  arr[1][2] = x;
+  arr[2][1] = y;
+  arr[3][0] = x;
+  arr[3][1] = y;
+  arr[2][2] = x;
+  arr[1][3] = y;
+  arr[2][3] = x;
+  arr[3][2] = y;
 }
 
 /* { dg-final { scan-assembler-times "stp\tw\[0-9\]+, w\[0-9\]" 7 } } */
diff --git a/gcc/testsuite/gcc.target/i386/pr22141.c b/gcc/testsuite/gcc.target/i386/pr22141.c
new file mode 100644
index 0000000000000000000000000000000000000000..efded365778417e7b4bf468288f42f5136fdd585
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/pr22141.c
@@ -0,0 +1,126 @@
+/* PR middle-end/22141 */
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+extern void abort (void);
+
+struct S
+{
+  struct T
+    {
+      char a;
+      char b;
+      char c;
+      char d;
+    } t;
+} u;
+
+struct U
+{
+  struct S s[4];
+};
+
+void __attribute__((noinline))
+c1 (struct T *p)
+{
+  if (p->a != 1 || p->b != 2 || p->c != 3 || p->d != 4)
+    abort ();
+  __builtin_memset (p, 0xaa, sizeof (*p));
+}
+
+void __attribute__((noinline))
+c2 (struct S *p)
+{
+  c1 (&p->t);
+}
+
+void __attribute__((noinline))
+c3 (struct U *p)
+{
+  c2 (&p->s[2]);
+}
+
+void __attribute__((noinline))
+f1 (void)
+{
+  u = (struct S) { { 1, 2, 3, 4 } };
+}
+
+void __attribute__((noinline))
+f2 (void)
+{
+  u.t.a = 1;
+  u.t.b = 2;
+  u.t.c = 3;
+  u.t.d = 4;
+}
+
+void __attribute__((noinline))
+f3 (void)
+{
+  u.t.d = 4;
+  u.t.b = 2;
+  u.t.a = 1;
+  u.t.c = 3;
+}
+
+void __attribute__((noinline))
+f4 (void)
+{
+  struct S v;
+  v.t.a = 1;
+  v.t.b = 2;
+  v.t.c = 3;
+  v.t.d = 4;
+  c2 (&v);
+}
+
+void __attribute__((noinline))
+f5 (struct S *p)
+{
+  p->t.a = 1;
+  p->t.c = 3;
+  p->t.d = 4;
+  p->t.b = 2;
+}
+
+void __attribute__((noinline))
+f6 (void)
+{
+  struct U v;
+  v.s[2].t.a = 1;
+  v.s[2].t.b = 2;
+  v.s[2].t.c = 3;
+  v.s[2].t.d = 4;
+  c3 (&v);
+}
+
+void __attribute__((noinline))
+f7 (struct U *p)
+{
+  p->s[2].t.a = 1;
+  p->s[2].t.c = 3;
+  p->s[2].t.d = 4;
+  p->s[2].t.b = 2;
+}
+
+int
+main (void)
+{
+  struct U w;
+  f1 ();
+  c2 (&u);
+  f2 ();
+  c1 (&u.t);
+  f3 ();
+  c2 (&u);
+  f4 ();
+  f5 (&u);
+  c2 (&u);
+  f6 ();
+  f7 (&w);
+  c3 (&w);
+  return 0;
+}
+
+/* { dg-final { scan-assembler-times "67305985\|4030201" 7 } } */
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index 6d4d70cc1850464d1b014b91da363ee6164f4432..5a9ffa79dfb2ad1254b709d032455958afe542f1 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -424,6 +424,7 @@ extern gimple_opt_pass *make_pass_late_warn_uninitialized (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cse_reciprocals (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_cse_sincos (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_optimize_bswap (gcc::context *ctxt);
+extern gimple_opt_pass *make_pass_tree_store_widening (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_optimize_widening_mul (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_warn_function_return (gcc::context *ctxt);
 extern gimple_opt_pass *make_pass_warn_function_noreturn (gcc::context *ctxt);
diff --git a/gcc/tree-ssa-store-widening.c b/gcc/tree-ssa-store-widening.c
new file mode 100644
index 0000000000000000000000000000000000000000..febd5dff3474d458c580deacf23bbe2d1e3430b0
--- /dev/null
+++ b/gcc/tree-ssa-store-widening.c
@@ -0,0 +1,875 @@
+/* GIMPLE store widening pass.
+   Copyright (C) 2016 Free Software Foundation, Inc.
+   Contributed by ARM Ltd.
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+/* The purpose of this pass is to combine multiple memory stores of
+   constant values to consecutive memory locations into fewer wider stores.
+   For example, if we have a sequence peforming four byte stores to
+   consecutive memory locations:
+   [p     ] := imm1;
+   [p + 1B] := imm2;
+   [p + 2B] := imm3;
+   [p + 3B] := imm4;
+   we can transform this into a single 4-byte store if the target supports it:
+  [p] := imm1:imm2:imm3:imm4 //concatenated immediates according to endianness.
+
+   The algorithm is applied to each basic block in three phases:
+
+   1) Scan through the basic block recording constant assignments to
+   destinations that can be expressed as a store to memory of a certain size
+   at a certain bit offset.  Record all such assignments as long as they store
+   to a common base object (the address pointed to by 'p' in the above
+   example).
+   These stores can be a result of structure element initializers, array stores
+   etc.  A store_immediate_info object is recorded for every such store.
+   Record as many such assignments to a single base as possible until a
+   statement that interferes with the store sequence is encountered.
+
+   2) Analyze the chain of stores recorded in phase 1) (i.e. the vector of
+   store_immediate_info objects) and coalesce contiguous stores into
+   merged_store_group objects.
+
+   For example, given the stores:
+   [p     ] := 0;
+   [p + 1B] := 1;
+   [p + 3B] := 0;
+   [p + 4B] := 1;
+   [p + 5B] := 0;
+   [p + 6B] := 0;
+   This phase would produce two merged_store_group objects, one recording the
+   two bytes stored in the memory region [p : p + 1] and another
+   recording the four bytes stored in the memory region [p + 3 : p + 6].
+
+   3) The merged_store_group objects produced in phase 2) are processed
+   to generate the sequence of wider stores that set the contiguous memory
+   regions to the sequence of bytes that correspond to it.  This may emit
+   multiple stores per store group to handle contiguous stores that are not
+   of a size that is a power of 2.  For example it can try to emit a 40-bit
+   store as a 32-bit store followed by an 8-bit store.
+   We try to emit as wide stores as we can while respecting STRICT_ALIGNMENT or
+   SLOW_UNALIGNED_ACCESS rules.
+
+   Note on endianness and example:
+   The bitposition that is written in each access is determined by
+   get_inner_reference which calculates the position in the object rather than
+   the memory itself so some care must be taken when coalescing stores and
+   synthesizing the result in steps 2) and 3) respectively to handle
+   byte-endianness correctly.
+   Consider 2 contiguous 16-bit stores followed by 2 contiguous 8-bit stores:
+   [p     ] := 0x1234;
+   [p + 2B] := 0x5678;
+   [p + 4B] := 0xab;
+   [p + 5B] := 0xcd;
+
+   The memory layout for little-endian (LE) and big-endian (BE) must be:
+  p |LE|BE|
+  ---------
+  0 |34|12|
+  1 |12|34|
+  2 |78|56|
+  3 |56|78|
+  4 |ab|ab|
+  5 |cd|cd|
+
+  To merge these into a single 48-bit merged value 'val' in phase 2)
+  on little-endian we insert stores to higher (consecutive) bitpositions
+  into the most significant bits of the merged value.
+  For the example above that would be:
+  val = 0x1234;
+  val = val | (0x5678 << 16);
+  val = val | (0xab << 32);
+  val = val | (0xcd << 40);
+  To produce a wide_int containing: 0xcdab56781234
+
+  For big-endian we insert stores to higher bitpositions into the least
+  significant bits of the merged value.  So for the above example the
+  operations would be:
+  val = 0x1234;
+  val = 0x5678 | (val << 16);
+  val = 0xab   | (val << 8);
+  val = 0xcd   | (val << 8);
+  to produce a wide_int containing: 0x12345678abcd
+
+  Then, in phase 3), we want to emit this 48-bit value as a 32-bit store
+  followed by a 16-bit store.  Again, we must consider endianness when
+  breaking down the 48-bit value 'val' computed above.
+  For little endian we emit:
+  [p]      (32-bit) := 0x56781234; // val & 0x0000ffffffff;
+  [p + 4B] (16-bit) := 0xcdab;    // (val & 0xffff00000000) >> 32;
+
+  Whereas for big-endian we emit:
+  [p]      (32-bit) := 0x12345678; // (val & 0xffffffff0000) >> 16;
+  [p + 4B] (16-bit) := 0xabcd;     //  val & 0x00000000ffff;
+
+  The same effect could be achieved without changing the insertion order
+  during phase 2) and changing the extraction logic in phase 3) by
+  byte-swapping the values when they are being merged in phase 2) and again
+  after extracting them in phase 3).  But that approach would be harder to
+  extend to deal with potential future improvements including bitfields that
+  are not a multiple of BITS_PER_UNIT or non-constant contiguous stores.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "backend.h"
+#include "tree.h"
+#include "gimple.h"
+#include "builtins.h"
+#include "tree-pass.h"
+#include "ssa.h"
+#include "gimple-pretty-print.h"
+#include "alias.h"
+#include "fold-const.h"
+#include "params.h"
+#include "gimple-iterator.h"
+#include "gimplify.h"
+
+/* The maximum size (in bits) of the stores this pass should generate.  */
+#define MAX_STORE_BITSIZE (BITS_PER_WORD)
+
+namespace {
+
+/* Struct recording the information about a single store of an immediate
+   to memory.  These are created in the first phase and coalesced into
+   merged_store_group objects in the second phase.  */
+
+struct store_immediate_info
+{
+  HOST_WIDE_INT bitsize;
+  HOST_WIDE_INT bitpos;
+  tree val;
+  tree dest;
+  gimple *stmt;
+  store_immediate_info (HOST_WIDE_INT, HOST_WIDE_INT, tree, tree, gimple *);
+};
+
+store_immediate_info::store_immediate_info (HOST_WIDE_INT bs, HOST_WIDE_INT bp,
+					    tree v, tree d, gimple *st)
+  : bitsize (bs), bitpos (bp), val (v), dest (d), stmt (st)
+{
+}
+
+/* Struct representing a group of stores to contiguous memory locations.
+   These are produced by the second phase (coalescing) and consumed in the
+   third phase that outputs the widened stores.  */
+
+struct merged_store_group
+{
+  HOST_WIDE_INT start;
+  HOST_WIDE_INT width;
+  wide_int val;
+  unsigned int align;
+  auto_vec<gimple *> stmts;
+  merged_store_group (store_immediate_info *);
+  void merge_into (store_immediate_info *);
+};
+
+/* Initialize a merged_store_group object from a store_immediate_info
+   object.  Initialize the wide_int recording the value and the vector
+   of statements that will be replaced by this store.  */
+
+merged_store_group::merged_store_group (store_immediate_info *info)
+{
+  start = info->bitpos;
+  width = info->bitsize;
+
+  val = wide_int::from (info->val, MAX_STORE_BITSIZE, UNSIGNED);
+  align = get_object_alignment (info->dest);
+  stmts.create (1);
+  stmts.safe_push (info->stmt);
+}
+
+/* Merge a store recorded by INFO into this widened store.
+   Insert the value stored into the val field in the appropriate
+   position (byte-swapping as appropriate for endianness).
+   Record the original statement that produced the store.
+   See note on endianness at top of file for an explanation of the
+   BYTES_BIG_ENDIAN logic.  */
+
+void
+merged_store_group::merge_into (store_immediate_info *info)
+{
+  HOST_WIDE_INT wid = info->bitsize;
+  wide_int tmp = wide_int::from (info->val, MAX_STORE_BITSIZE, UNSIGNED);
+
+  /* Make sure we're inserting in the position we think we're inserting.  */
+  gcc_assert (info->bitpos == start + width);
+  if (BYTES_BIG_ENDIAN)
+    val = wi::insert (tmp, val, wid, width);
+  else
+    val = wi::insert (val, tmp, width, wid);
+  width += wid;
+  gimple *stmt = info->stmt;
+  stmts.safe_push (stmt);
+}
+
+const pass_data pass_data_tree_store_widening = {
+  GIMPLE_PASS,      /* type */
+  "store-widening", /* name */
+  OPTGROUP_NONE,    /* optinfo_flags */
+  TV_NONE,	  /* tv_id */
+  PROP_ssa,	 /* properties_required */
+  0,		    /* properties_provided */
+  0,		    /* properties_destroyed */
+  0,		    /* todo_flags_start */
+  TODO_update_ssa,  /* todo_flags_finish */
+};
+
+class pass_tree_store_widening : public gimple_opt_pass
+{
+public:
+  pass_tree_store_widening (gcc::context *ctxt)
+    : gimple_opt_pass (pass_data_tree_store_widening, ctxt)
+  {
+  }
+
+  virtual bool
+  gate (function *)
+  {
+    return flag_tree_store_widening && optimize;
+  }
+
+  virtual unsigned int execute (function *);
+
+private:
+  tree m_curr_base_expr;
+  auto_vec<gimple *> m_store_chain;
+  auto_vec<struct store_immediate_info *> m_store_info;
+  auto_vec<gimple *> m_interspersed_mem_accesses;
+  auto_vec<merged_store_group *> m_merged_store_groups;
+
+  bool stmt_terminates_chain_p (gimple *);
+  bool stmt_interferes_with_mem_accesses_p (tree);
+  bool terminate_and_process_chain ();
+  bool coalesce_immediate_stores ();
+  bool output_merged_store (merged_store_group *);
+  bool output_merged_stores ();
+
+}; // class pass_tree_store_widening
+
+/* Sorting function for store_immediate_info objects.
+   Sorts them by bitposition.  */
+
+static int
+sort_by_bitpos (const void *x, const void *y)
+{
+  store_immediate_info *const *tmp = (store_immediate_info * const *) x;
+  store_immediate_info *const *tmp2 = (store_immediate_info * const *) y;
+
+  if ((*tmp)->bitpos <= (*tmp2)->bitpos)
+    return -1;
+  else if ((*tmp)->bitpos > (*tmp2)->bitpos)
+    return 1;
+
+  gcc_unreachable ();
+  return 0;
+}
+
+/* Go through the candidate stores recorded in m_store_info and merge them
+   into merged_store_group objects recorded into m_merged_store_groups
+   representing the widened stores.  Return true if coalescing was successful
+   and the number of widened stores is fewer than the original number
+   of stores.  */
+
+bool
+pass_tree_store_widening::coalesce_immediate_stores ()
+{
+  /* Anything less can't be processed.  */
+  if (m_store_info.length () < 2)
+    return false;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    fprintf (dump_file, "Attempting to coalesce %u stores in chain.\n",
+	     m_store_info.length ());
+
+  store_immediate_info *info;
+  unsigned int i;
+
+  /* Order the stores by the bitposition they write to.  */
+  m_store_info.qsort (sort_by_bitpos);
+
+  info = m_store_info[0];
+
+  merged_store_group *merged_store = new merged_store_group (info);
+  m_merged_store_groups.safe_push (merged_store);
+
+  FOR_EACH_VEC_ELT (m_store_info, i, info)
+    {
+      if (dump_file)
+	{
+	  fprintf (dump_file, "Store %u:\nbitsize:" HOST_WIDE_INT_PRINT_DEC
+			       " bitpos:" HOST_WIDE_INT_PRINT_DEC
+			       " val:\n", i, info->bitsize, info->bitpos);
+	  print_generic_expr (dump_file, info->val, 0);
+	  fprintf (dump_file, "\n------------\n");
+	}
+
+      /* First element was already handled outside the loop.  */
+      if (i == 0)
+	continue;
+
+      /* |---store 1---|
+	       |---store 2---|
+       Partially overlapping stores are not handled at the moment.  */
+      HOST_WIDE_INT start = info->bitpos;
+      if (IN_RANGE (start, merged_store->start,
+		    merged_store->start + merged_store->width - 1))
+	return false;
+
+      /* |---store 1---| <gap> |---store 2---|.
+	 Gap between stores.  Start a new group.  */
+      if (start != merged_store->start + merged_store->width)
+	{
+	  merged_store = new merged_store_group (info);
+	  m_merged_store_groups.safe_push (merged_store);
+	  continue;
+	}
+
+      /* |---store 1---||---store 2---|
+	 This store is consecutive to the previous one.  */
+
+      unsigned int prev_size = merged_store->width;
+      /* If we will be exceeding the maximum store size start a new group but
+	 record the alignment of the new store appropriately.  */
+      if (prev_size + info->bitsize > MAX_STORE_BITSIZE)
+	{
+	  merged_store_group *new_merged_store = new merged_store_group (info);
+	  new_merged_store->align
+	    = MIN (merged_store->align, merged_store->width);
+	  merged_store = new_merged_store;
+	  m_merged_store_groups.safe_push (merged_store);
+	  continue;
+	}
+      /*  Merge it into the current store group.  */
+      else
+	{
+	  merged_store->merge_into (info);
+	  continue;
+	}
+
+      gcc_unreachable ();
+      return false;
+    }
+
+  gcc_assert (m_merged_store_groups.length () <= m_store_info.length ());
+  bool success = m_merged_store_groups.length () < m_store_info.length ();
+  if (dump_file)
+    {
+      if (success)
+	fprintf (dump_file, "Coalescing successful!\n"
+			     "Merged into %u stores\n",
+		 m_merged_store_groups.length ());
+    }
+
+  return success;
+}
+
+/* Return the type to use for the merged stores described by GROUP.
+   This is needed to get the alias sets right.  */
+
+static tree
+get_type_for_merged_store (merged_store_group *group)
+{
+  gimple *stmt;
+  unsigned int i;
+  tree lhs = gimple_assign_lhs (group->stmts[0]);
+  tree type = reference_alias_ptr_type (lhs);
+  FOR_EACH_VEC_ELT (group->stmts, i, stmt)
+    {
+      if (i == 0)
+	continue;
+
+      lhs = gimple_assign_lhs (stmt);
+      tree type1 = reference_alias_ptr_type (lhs);
+      if (!alias_ptr_types_compatible_p (type, type1))
+	return ptr_type_node;
+    }
+  return type;
+}
+
+/* Given a store group GROUP record in VDEF and VUSE the last
+   vdef in the original statement sequence and also the first vuse.
+   These are to be used in the replacement sequence.  */
+
+static void
+get_vuse_vdef_for_merged_store (merged_store_group *group, tree *vdef,
+				tree *vuse)
+{
+  gimple *stmt;
+  unsigned int i;
+  *vdef = NULL_TREE;
+  *vuse = NULL_TREE;
+  FOR_EACH_VEC_ELT (group->stmts, i, stmt)
+    {
+      if (gimple_vuse (stmt))
+	{
+	  *vuse = gimple_vuse (stmt);
+	  break;
+	}
+    }
+  *vdef = gimple_vdef (group->stmts[group->stmts.length () - 1]);
+  /* What do you mean there's no vdef?  */
+  gcc_assert (*vdef);
+}
+
+/* Return the location_t information we can find among the statements
+   in GROUP.  */
+
+static location_t
+get_merged_store_location (merged_store_group *group)
+{
+  gimple *stmt;
+  unsigned int i;
+
+  FOR_EACH_VEC_ELT (group->stmts, i, stmt)
+    {
+      if (gimple_has_location (stmt))
+	return gimple_location (stmt);
+    }
+  return UNKNOWN_LOCATION;
+}
+
+/* Return true if storing an integer of bitsize SIZE using an unaligned
+   access if prohibitively slow.  */
+
+static bool
+slow_unaligned_size_access (unsigned int size)
+{
+  machine_mode mode = mode_for_size (size, MODE_INT, 0);
+  gcc_assert (mode != BLKmode);
+  return SLOW_UNALIGNED_ACCESS (mode, GET_MODE_ALIGNMENT (mode));
+}
+
+/* Given a merged store group GROUP output the widened version of it.
+   Try store sizes of at most MAX_STORE_BITSIZE bits wide and don't output
+   unaligned stores for STRICT_ALIGNMENT targets or if it's too expensive.
+   Make sure that the number of statements output is less than the number of
+   original statements.  If a better sequence is possible emit it and
+   return true.  See note on endianness at top of file for an explanation of
+   the BYTES_BIG_ENDIAN logic.  */
+
+bool
+pass_tree_store_widening::output_merged_store (merged_store_group *group)
+{
+  HOST_WIDE_INT pos = group->start;
+  HOST_WIDE_INT size = group->width;
+  HOST_WIDE_INT bytepos = pos / BITS_PER_UNIT;
+  unsigned int align = group->align;
+  gcc_assert (IN_RANGE (size, 0, MAX_STORE_BITSIZE));
+
+  unsigned int orig_num_stmts = group->stmts.length ();
+  if (orig_num_stmts < 2)
+    return false;
+
+  /* We don't handle partial bitfields for now.  */
+  if ((size % BITS_PER_UNIT != 0) || (pos % BITS_PER_UNIT != 0))
+    return false;
+
+  bool allow_unaligned
+    = !STRICT_ALIGNMENT && PARAM_VALUE (PARAM_STORE_WIDENING_ALLOW_UNALIGNED);
+  unsigned int try_size = MAX_STORE_BITSIZE;
+  while (try_size > size
+	 || ((!allow_unaligned || slow_unaligned_size_access (try_size))
+	     && try_size > align))
+    {
+      try_size /= 2;
+      if (try_size < BITS_PER_UNIT)
+	{
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file, "Failed to output merged store.\n"
+				   "Failed to find starting size meeting"
+				   " alignment requirements.\n");
+	    }
+	  return false;
+	}
+    }
+
+  HOST_WIDE_INT try_pos = bytepos;
+  int wi_offset = BYTES_BIG_ENDIAN ? size - try_size : 0;
+  if (dump_file)
+    {
+      fprintf (dump_file,
+	       "Trying to output merged store at pos " HOST_WIDE_INT_PRINT_DEC
+	       ", of size " HOST_WIDE_INT_PRINT_DEC ", "
+	       "starting size %u and alignment %u\n", pos, size, try_size,
+	       align);
+    }
+
+  gimple_seq seq = NULL;
+  unsigned int num_stmts = 0;
+  tree offset_type = get_type_for_merged_store (group);
+  tree last_vdef, new_vuse;
+  get_vuse_vdef_for_merged_store (group, &last_vdef, &new_vuse);
+  location_t loc = get_merged_store_location (group);
+  gimple *stmt = NULL;
+  while (size > 0)
+    {
+      tree int_type = build_nonstandard_integer_type (try_size, UNSIGNED);
+      SET_TYPE_ALIGN (int_type, align);
+      tree addr = build_fold_addr_expr (m_curr_base_expr);
+      tree dest = fold_build2 (MEM_REF, int_type, addr,
+			       build_int_cst (offset_type, try_pos));
+
+      wide_int wi_val = group->val;
+
+      wi_val
+	= wi::bit_and (wi_val, wi::shifted_mask (wi_offset, try_size, false,
+						 MAX_STORE_BITSIZE));
+
+      wi_val
+	= wi::rshift (wi_val, wide_int::from (wi_offset, MAX_STORE_BITSIZE, UNSIGNED),
+		      UNSIGNED);
+
+      tree src = wide_int_to_tree (int_type, wi_val);
+      stmt = gimple_build_assign (dest, src);
+      gimple_set_vuse (stmt, new_vuse);
+      gimple_seq_add_stmt (&seq, stmt);
+      num_stmts++;
+
+      try_pos += try_size / BITS_PER_UNIT;
+
+      if (!BYTES_BIG_ENDIAN)
+	wi_offset += try_size;
+      size -= try_size;
+      align = try_size;
+      while (size < try_size)
+	try_size /= 2;
+
+      if (BYTES_BIG_ENDIAN)
+	wi_offset -= try_size;
+
+      if (num_stmts >= orig_num_stmts)
+	{
+	  if (dump_file)
+	    {
+	      fprintf (dump_file, "Exceeded original number of stmts (%u)."
+				   "  Not profitable to emit new sequence.\n",
+			orig_num_stmts);
+	    }
+	  return false;
+	}
+
+      tree new_vdef
+	= size ? make_ssa_name (gimple_vop (cfun), stmt) : last_vdef;
+      gimple_set_vdef (stmt, new_vdef);
+      SSA_NAME_DEF_STMT (new_vdef) = stmt;
+      new_vuse = new_vdef;
+    }
+  gcc_assert (size == 0);
+  gcc_assert (seq);
+  annotate_all_with_location (seq, loc);
+  gimple_stmt_iterator gsi = gsi_for_stmt (group->stmts[0]);
+  if (dump_file)
+    {
+      fprintf (dump_file,
+	       "New sequence of %u stmts to replace old one of %u stmts:\n",
+	       num_stmts, orig_num_stmts);
+      print_gimple_seq (dump_file, seq, 0, TDF_VOPS | TDF_MEMSYMS);
+    }
+  gsi_insert_seq_before (&gsi, seq, GSI_SAME_STMT);
+
+  return true;
+}
+
+/* Process the merged_store_group objects created in the coalescing phase.
+   Try to output the widened stores and delete the original statements if
+   successful.  Return true iff any changes were made.  */
+
+bool
+pass_tree_store_widening::output_merged_stores ()
+{
+  unsigned int i;
+  merged_store_group *merged_store;
+  bool ret = false;
+  FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_store)
+    {
+      bool successful_p = output_merged_store (merged_store);
+      if (successful_p)
+	{
+	  gimple *stmt;
+	  unsigned int j;
+	  FOR_EACH_VEC_ELT (merged_store->stmts, j, stmt)
+	    {
+	      gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+	      gsi_remove (&gsi, true);
+	    }
+	}
+      ret |= successful_p;
+    }
+  if (ret && dump_file)
+    fprintf (dump_file, "Widening successful!\n");
+
+  return ret;
+}
+
+/* Coalesce the store_immediate_info objects recorded in the first phase
+   and output them.  Clear and reinitialize the structures to record the
+   next store chain.  Return true if any changes were made.  */
+
+bool
+pass_tree_store_widening::terminate_and_process_chain (void)
+{
+  /* Process store chain.  */
+  bool ret = false;
+  if (m_store_info.length () > 1)
+    {
+      ret = coalesce_immediate_stores ();
+      if (ret)
+	ret = output_merged_stores ();
+    }
+
+  /* Reinitialize structures.  */
+  store_immediate_info *info;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_store_info, i, info)
+    delete info;
+
+  merged_store_group *merged_info;
+  FOR_EACH_VEC_ELT (m_merged_store_groups, i, merged_info)
+    delete merged_info;
+
+  m_merged_store_groups.truncate (0);
+  m_store_info.truncate (0);
+  m_store_chain.truncate (0);
+  m_interspersed_mem_accesses.truncate (0);
+  m_curr_base_expr = NULL_TREE;
+  return ret;
+}
+
+/* Return true iff upon encountering the statement STMT we should terminate.
+   Use aliasing info to determine that.  */
+
+bool
+pass_tree_store_widening::stmt_terminates_chain_p (gimple *stmt)
+{
+  if (!m_curr_base_expr)
+    return false;
+
+  if (!gimple_vuse (stmt) && !gimple_vdef (stmt))
+    return false;
+
+  if (gimple_has_volatile_ops (stmt)
+      || stmt_may_clobber_ref_p (stmt, m_curr_base_expr)
+      || ref_maybe_used_by_stmt_p (stmt, m_curr_base_expr))
+    return true;
+
+  return false;
+}
+
+/* Return true iff storing to LHS may interfere with any store recorded
+   in the m_interspersed_mem_accesses vector.  */
+
+bool
+pass_tree_store_widening::stmt_interferes_with_mem_accesses_p (tree lhs)
+{
+  gimple *tmp_stmt;
+  unsigned int i;
+  FOR_EACH_VEC_ELT (m_interspersed_mem_accesses, i, tmp_stmt)
+    {
+      if (stmt_may_clobber_ref_p (tmp_stmt, lhs)
+	  || ref_maybe_used_by_stmt_p (tmp_stmt, lhs))
+	return true;
+    }
+  return false;
+}
+
+/* Return true iff LHS is a destination potentially interesting for
+   store widening.  */
+
+static bool
+lhs_code_valid_for_store_widening_p (tree lhs)
+{
+  tree_code code = TREE_CODE (lhs);
+  if (code == ARRAY_REF || code == ARRAY_RANGE_REF || code == MEM_REF
+      || code == COMPONENT_REF || code == BIT_FIELD_REF)
+    return true;
+
+  return false;
+}
+
+/* Entry point for the pass.  Go over each basic block recording chains of
+  immediate stores.  Upon encountering a terminating statement (as defined
+  by stmt_terminates_chain_p) process the recorded stores and emit the widened
+  variants.  */
+
+unsigned int
+pass_tree_store_widening::execute (function *fun)
+{
+  basic_block bb;
+  m_curr_base_expr = NULL_TREE;
+  hash_set<gimple *> orig_stmts;
+
+  bool global_changes_made = false;
+  FOR_EACH_BB_FN (bb, fun)
+    {
+      gimple_stmt_iterator gsi;
+      HOST_WIDE_INT num_statements = 0;
+      /* Record the original statements so that we can keep track of
+	 statements emitted in this pass and not re-process new
+	 statements.  */
+      for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	{
+	  gimple *stmt = gsi_stmt (gsi);
+	  if (!is_gimple_debug (stmt))
+	    orig_stmts.add (stmt);
+	  num_statements++;
+	}
+      if (num_statements < 2)
+	continue;
+      bool changes_made;
+      do
+	{
+	  changes_made = false;
+	  for (gsi = gsi_after_labels (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple *stmt = gsi_stmt (gsi);
+
+	      if (!orig_stmts.contains (stmt) || is_gimple_debug (stmt))
+		continue;
+	      else if (is_gimple_assign (stmt)
+		       && lhs_code_valid_for_store_widening_p (
+			    gimple_assign_lhs (stmt)))
+		{
+		  tree lhs = gimple_assign_lhs (stmt);
+
+		  HOST_WIDE_INT bitsize, bitpos;
+		  machine_mode mode;
+		  int unsignedp = 0, reversep = 0, volatilep = 0;
+		  tree offset, base_addr;
+		  base_addr
+		    = get_inner_reference (lhs, &bitsize, &bitpos, &offset,
+					   &mode, &unsignedp, &reversep,
+					   &volatilep);
+		  bool same_base_p
+		    = m_curr_base_expr
+		      && operand_equal_p (m_curr_base_expr, base_addr, 0);
+
+		  if (volatilep)
+		    {
+		      if (dump_file)
+			fprintf (dump_file, "Volatile access terminates "
+					    "chain.\n");
+		      changes_made |= terminate_and_process_chain ();
+		    }
+		  else if (gimple_assign_rhs_code (stmt) == INTEGER_CST
+			   && (!m_curr_base_expr || same_base_p) && !offset
+			   && !reversep && bitsize <= MAX_STORE_BITSIZE
+			   /* Don't handle bitfields that are not a multiple
+			      of BITS_PER_UNIT for now.  Can be extended
+			      later.  */
+			   && (bitsize % BITS_PER_UNIT == 0)
+			   && !stmt_interferes_with_mem_accesses_p (lhs))
+		    {
+		      /* Valid store to continue widening or, in the
+			 !m_curr_base_expr case, to start a chain.  */
+		      gcc_assert (m_curr_base_expr
+				  || m_store_chain.is_empty ());
+		      m_curr_base_expr = base_addr;
+		      m_store_chain.safe_push (stmt);
+		      tree int_val = gimple_assign_rhs1 (stmt);
+		      store_immediate_info *info
+			= new store_immediate_info (bitsize, bitpos, int_val,
+						    lhs, stmt);
+
+		      if (dump_file)
+			{
+			  fprintf (dump_file,
+				   "Recording immediate store from stmt:\n");
+			  print_gimple_stmt (dump_file, stmt, 0, 0);
+			}
+		      m_store_info.safe_push (info);
+		    }
+		  else if (same_base_p)
+		    {
+		      if (dump_file)
+			fprintf (dump_file,
+				 "Non-constant store based of current base.  "
+				 "Recording for alias checking purposes.\n");
+
+		      if (!m_store_info.is_empty ())
+			m_interspersed_mem_accesses.safe_push (stmt);
+		    }
+		  /* Store to a different base.  If it conflicts with
+		     m_curr_base_expr stop the chain, otherwise ignore.  */
+		  else if (stmt_terminates_chain_p (stmt))
+		    {
+		      if (dump_file)
+			fprintf (dump_file,
+			   "Store to a different base terminates chain.\n");
+
+		      changes_made |= terminate_and_process_chain ();
+		    }
+		  else
+		    {
+		      /* stmt doesn't affect anything.  */
+		      continue;
+		    }
+		}
+	      else if (stmt_terminates_chain_p (stmt))
+		{
+		  /* Non-assignment that terminates.  */
+		  if (dump_file)
+		    fprintf (dump_file, "Statement terminates chain.\n");
+
+		  changes_made |= terminate_and_process_chain ();
+		}
+	      else if (gimple_vdef (stmt) || gimple_vuse (stmt))
+		{
+		  if (dump_file)
+		    {
+		      fprintf (dump_file,
+			       "Statement reads/writes memory.  "
+			       "Recording for alias checking purposes.\n");
+		    }
+		  if (!m_store_info.is_empty ())
+		    m_interspersed_mem_accesses.safe_push (stmt);
+		}
+	      /* Non-assignment statement that doesn't impact current chain.  */
+	    }
+	  /* End of basic block.  Terminate chain.  */
+	  if (dump_file && !m_store_info.is_empty ())
+	    fprintf (dump_file,
+		     "End of basic block.  Terminating recorded chain.\n");
+	  changes_made |= terminate_and_process_chain ();
+	  global_changes_made |= changes_made;
+	  num_statements--;
+	  /* Make sure the number of times we went around the block is
+	     not more than the number of statements in the block.  Otherwise
+	     something has gone seriously wrong and we're probably in an
+	     infinite loop.  */
+	  gcc_assert (num_statements >= 0);
+	  if (changes_made && dump_file)
+	    fprintf (dump_file, "Changes made this iteration.  Going through "
+				"basic block again.\n");
+	}
+      while (changes_made);
+    }
+  return global_changes_made ? TODO_cleanup_cfg : 0;
+}
+
+} // anon namespace
+
+/* Construct and return a store widening pass object.  */
+
+gimple_opt_pass *
+make_pass_tree_store_widening (gcc::context *ctxt)
+{
+  return new pass_tree_store_widening (ctxt);
+}

Follow-Ups:
- Re: [PATCH][RFC] PR middle-end/22141 GIMPLE store widening pass
  - From: Richard Biener

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]