Bug 22141 - [5/6/7 Regression] Missing optimization when storing structures
Summary: [5/6/7 Regression] Missing optimization when storing structures
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.0.1
: P2 minor
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: missed-optimization
: 37135 42594 53417 69616 70784 (view as bug list)
Depends on:
Blocks: 16996 69616
  Show dependency treegraph
 
Reported: 2005-06-22 10:51 UTC by Frank Mehnert
Modified: 2018-07-04 04:22 UTC (History)
23 users (show)

See Also:
Host:
Target:
Build:
Known to work: 7.0
Known to fail: 4.0.4
Last reconfirmed: 2016-01-05 00:00:00


Attachments
gcc8-pr22141-improvements.patch (9.69 KB, patch)
2017-10-27 11:29 UTC, Jakub Jelinek
Details | Diff
gcc8-pr22141-improvements.patch (9.98 KB, patch)
2017-10-27 13:09 UTC, Jakub Jelinek
Details | Diff
gcc8-pr22141-improvements.patch (10.14 KB, patch)
2017-10-27 16:31 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Frank Mehnert 2005-06-22 10:51:51 UTC
In the attached example, the store of type_t is not optimized by gcc. Instead 
of storing one 32-bit value, four 8-bit values are stored. 
 
gcc-4.0.1 produces the following code: 
 
 /home/fm3/foo.c:16 
 80483b7:       c6 05 13 96 04 08 04    movb   $0x4,0x8049613 
 80483be:       c6 05 12 96 04 08 03    movb   $0x3,0x8049612 
 80483c5:       c6 05 11 96 04 08 02    movb   $0x2,0x8049611 
 80483cc:       c6 05 10 96 04 08 01    movb   $0x1,0x8049610 
 
gcc-3.4.5 produces the following code: 
 
 /home/fm3/foo.c:16 
 8048391:       ba 01 02 03 04          mov    $0x4030201,%edx 
 /home/fm3/foo.c:15 
 8048396:       89 e5                   mov    %esp,%ebp 
 8048398:       83 ec 18                sub    $0x18,%esp 
 804839b:       83 e4 f0                and    $0xfffffff0,%esp 
 804839e:       83 ec 10                sub    $0x10,%esp 
 /home/fm3/foo.c:16 
 80483a1:       89 15 1c 96 04 08       mov    %edx,0x804961c 
 
  (better) 
 
gcc-2.95.4 produces the following code: 
 
 /home/fm3/foo.c:16 
 80483f7:       c7 05 84 96 04 08 01    movl   $0x4030201,0x8049684 
 80483fe:       02 03 04 
 
 (best because no register allocation) 
 
The example: 
 
#include <stdio.h> 
 
typedef struct 
{ 
  unsigned char a; 
  unsigned char b; 
  unsigned char c; 
  unsigned char d; 
} type_t; 
 
static type_t u; 
 
int 
main(void) 
{ 
  u = (type_t){ 1, 2, 3, 4}; 
 
  printf("%d %d %d %d\n", u.a, u.b, u.c, u.d); 
 
  return 0; 
} 
 
gcc-4.0 -v -save-temps -g -o foo -W -Wall -Wformat=2 -O3 -Wall -W 
-fomit-frame-pointer -march=pentium3 -g foo.c: 
 
Using built-in specs. 
Target: i486-linux 
Configured with: ../src/configure -v 
--enable-languages=c,c++,java,f95,objc,ada,treelang --prefix=/usr 
--enable-shared --with-system-zlib --libexecdir=/usr/lib --enable-nls 
--without-included-gettext --enable-threads=posix --program-suffix=-4.0 
--enable-__cxa_atexit --enable-libstdcxx-allocator=mt --enable-clocale=gnu 
--enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk 
--enable-mpfr --disable-werror --enable-checking=release i486-linux 
Thread model: posix 
gcc version 4.0.1 20050522 (prerelease) (Debian 4.0.0-9) 
 /usr/lib/gcc/i486-linux/4.0.1/cc1 -E -quiet -v foo.c -march=pentium3 -W -Wall 
-Wformat=2 -Wall -W -fomit-frame-pointer -fworking-directory -O3 
-fpch-preprocess -o foo.i 
ignoring nonexistent directory 
"/usr/lib/gcc/i486-linux/4.0.1/../../../../i486-linux/include" 
#include "..." search starts here: 
#include <...> search starts here: 
 /usr/local/include 
 /usr/lib/gcc/i486-linux/4.0.1/include 
 /usr/include 
End of search list. 
 /usr/lib/gcc/i486-linux/4.0.1/cc1 -fpreprocessed foo.i -quiet -dumpbase foo.c 
-march=pentium3 -auxbase foo -g -g -O3 -W -Wall -Wformat=2 -Wall -W -version 
-fomit-frame-pointer -o foo.s 
GNU C version 4.0.1 20050522 (prerelease) (Debian 4.0.0-9) (i486-linux) 
        compiled by GNU C version 4.0.1 20050522 (prerelease) (Debian 
4.0.0-9). 
GGC heuristics: --param ggc-min-expand=64 --param ggc-min-heapsize=64260 
 as -V -Qy --32 -o foo.o foo.s 
GNU assembler version 2.15 (i486-linux-gnu) using BFD version 2.15 
 /usr/lib/gcc/i486-linux/4.0.1/collect2 --eh-frame-hdr -m elf_i386 
-dynamic-linker /lib/ld-linux.so.2 -o 
foo /usr/lib/gcc/i486-linux/4.0.1/../../../../lib/crt1.o /usr/lib/gcc/i486-linux/4.0.1/../../../../lib/crti.o /usr/lib/gcc/i486-linux/4.0.1/crtbegin.o 
-L/usr/lib/gcc/i486-linux/4.0.1 -L/usr/lib/gcc/i486-linux/4.0.1 
-L/usr/lib/gcc/i486-linux/4.0.1/../../../../lib 
-L/usr/lib/gcc/i486-linux/4.0.1/../../.. -L/lib/../lib -L/usr/lib/../lib foo.o 
-lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s 
--no-as-needed /usr/lib/gcc/i486-linux/4.0.1/crtend.o /usr/lib/gcc/i486-linux/4.0.1/../../../../lib/crtn.o
Comment 1 Andrew Pinski 2005-06-22 13:07:04 UTC
Reduced testcase for the regression:
typedef struct  type
{
  char a;
  char b;
  char c;
  char d;
} type_t;
type_t u;
void
f(void)
{
  u = (type_t){ 1, 2, 3, 4};
}

3.3 produced:
        lis r0,0x102
        lis r2,ha16(_u)
        ori r0,r0,772
        stw r0,lo16(_u)(r2)
        blr

4.0.0 produces:
        lis r11,ha16(_u)
        li r0,2
        la r9,lo16(_u)(r11)
        li r2,1
        stb r0,1(r9)
        li r0,4
        stb r2,lo16(_u)(r11)
        li r2,3
        stb r0,3(r9)
        stb r2,2(r9)
        blr

Which is much slower and larger.

Also this is a code size regression even at -Os.
Comment 2 Mark Mitchell 2005-10-31 03:54:03 UTC
Leaving as P2.

Do we know what's different?  The structure type is byte-aligned.  How did 2.95 justify using a 4-byte store?
Comment 3 Andrew Pinski 2005-10-31 03:59:03 UTC
(In reply to comment #2)
> Leaving as P2.
> Do we know what's different?
Yes in 4.0 and above there is no CONSTRUCTOR so we don't see the full CONSTRUCTOR in expand so it could expand to just one integer store.

>The structure type is byte-aligned.  How did 2.95 justify using a 4-byte store?
There is no strict alignment requirements for either PPC or x86 which is why GCC did it.
Comment 4 Mark Mitchell 2006-02-24 00:25:57 UTC
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Comment 5 Mark Mitchell 2006-05-25 02:33:02 UTC
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Comment 6 Steven Bosscher 2007-12-16 23:17:21 UTC
Open regression with no activity since February 14.  Ping?
Comment 7 Richard Biener 2008-01-17 15:47:38 UTC
Still broken (IMHO rightfully P2).

With 3.3 we get (x86_64):

f:
.LFB3:
        movl    $67305985, u(%rip)
        ret

while on trunk we have:

f:
.LFB2:
        movb    $1, u(%rip)
        movb    $2, u+1(%rip)
        movb    $3, u+2(%rip)
        movb    $4, u+3(%rip)
        ret

regardless of optimization flag (same for -Os and -O).  Gimplification now
produces

f ()
{
  u.a = 1;
  u.b = 2;
  u.c = 3;
  u.d = 4;
}

which makes the task of combining the initializer even harder.
Comment 8 Dinar Temirbulatov 2008-01-17 18:34:23 UTC
This regression happens after the SSA was merged in to the mainline
Comment 9 pinskia@gmail.com 2008-01-18 11:56:38 UTC
Subject: Re:  [4.0/4.1/4.2/4.3 Regression] Missing optimization when storing structures

> ------- Comment #8 from dtemirbulatov at gmail dot com  2008-01-17 18:34 -------
> This regression happens after the SSA was merged in to the mainline

Yes because of the gimplifier ...

-- Pinski
Comment 10 Steven Bosscher 2008-01-18 12:04:13 UTC
Re. comment 7:

What does the initial RTL look like with GCC 3.3 and with a post tree-ssa compiler?
Comment 11 Richard Biener 2008-01-18 12:08:23 UTC
For 3.3-hammer I get (x86_64):

;; Function f

(note 2 0 3 NOTE_INSN_DELETED)

(note 3 2 4 NOTE_INSN_FUNCTION_BEG) 

(note 4 3 5 NOTE_INSN_DELETED)

(note 5 4 6 NOTE_INSN_DELETED)

(note 6 5 8 NOTE_INSN_DELETED)
    
(insn 8 6 9 (nil) (set (reg/v:SI 58 [ <anonymous> ])
        (const_int 0 [0x0])) -1 (nil)
    (nil))
    
(insn 9 8 10 (nil) (set (strict_low_part (subreg:QI (reg/v:SI 58 [ <anonymous> ]) 0))
        (const_int 1 [0x1])) -1 (nil)
    (nil))

(insn 10 9 11 (nil) (set (reg:DI 59)
        (const_int 2 [0x2])) -1 (nil)
    (expr_list:REG_EQUAL (const_int 2 [0x2])
        (nil)))

(insn 11 10 13 (nil) (set (zero_extract:DI (subreg:DI (reg/v:SI 58 [ <anonymous> ]) 0)
            (const_int 8 [0x8])
            (const_int 8 [0x8]))
        (reg:DI 59)) -1 (nil)
    (nil))

(insn 13 11 14 (nil) (parallel [
            (set (reg/v:SI 58 [ <anonymous> ])
                (and:SI (reg/v:SI 58 [ <anonymous> ])
                    (const_int -16711681 [0xffffffffff00ffff])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 14 13 16 (nil) (parallel [
            (set (reg/v:SI 58 [ <anonymous> ])
                (ior:SI (reg/v:SI 58 [ <anonymous> ])
                    (const_int 196608 [0x30000])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))
 
(insn 16 14 17 (nil) (parallel [
            (set (reg/v:SI 58 [ <anonymous> ])
                (and:SI (reg/v:SI 58 [ <anonymous> ])
                    (const_int 16777215 [0xffffff])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 17 16 18 (nil) (parallel [
            (set (reg/v:SI 58 [ <anonymous> ])
                (ior:SI (reg/v:SI 58 [ <anonymous> ])
                    (const_int 67108864 [0x4000000])))
            (clobber (reg:CC 17 flags))
        ]) -1 (nil)
    (nil))

(insn 18 17 19 (nil) (set (mem/s:SI (symbol_ref:DI ("u")) [2 u+0 S4 A8])
        (reg/v:SI 58 [ <anonymous> ])) -1 (nil)
    (nil))

(note 19 18 21 NOTE_INSN_FUNCTION_END)

(code_label 21 19 0 1 "" [0 uses])


while on the trunk I see:


;; Function f (f)


;;
;; Full RTL generated for this function:
;;
(note 1 0 3 NOTE_INSN_DELETED)

;; Start of basic block ( 0) -> 2
;; Pred edge  ENTRY [100.0%]  (fallthru)
(note 3 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)

(note 2 3 4 2 NOTE_INSN_FUNCTION_BEG)
;; End of basic block 2 -> ( 3) 

;; Succ edge  3 [100.0%]  (fallthru)

;; Start of basic block ( 2) -> 3
;; Pred edge  2 [100.0%]  (fallthru)
(note 4 2 5 3 [bb 3] NOTE_INSN_BASIC_BLOCK)

(insn 5 4 6 3 t.c:12 (set (reg/f:DI 58)
        (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>)) -1 (nil))

(insn 6 5 7 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 58) [0 u.a+0 S1 A8])
        (const_int 1 [0x1])) -1 (nil))

(insn 7 6 8 3 t.c:12 (set (reg/f:DI 59)
        (const:DI (plus:DI (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>)
                (const_int 1 [0x1])))) -1 (nil))

(insn 8 7 9 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 59) [0 u.b+0 S1 A8])
        (const_int 2 [0x2])) -1 (nil))

(insn 9 8 10 3 t.c:12 (set (reg/f:DI 60)
        (const:DI (plus:DI (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>)
                (const_int 2 [0x2])))) -1 (nil))

(insn 10 9 11 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 60) [0 u.c+0 S1 A8])
        (const_int 3 [0x3])) -1 (nil))

(insn 11 10 12 3 t.c:12 (set (reg/f:DI 61)
        (const:DI (plus:DI (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>)
                (const_int 3 [0x3])))) -1 (nil))

(insn 12 11 17 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 61) [0 u.d+0 S1 A8])
        (const_int 4 [0x4])) -1 (nil))
;; End of basic block 3 -> ( 4)

;; Succ edge  4 [100.0%]  (fallthru)

;; Start of basic block ( 3) -> 4
;; Pred edge  3 [100.0%]  (fallthru)
(note 17 12 14 4 [bb 4] NOTE_INSN_BASIC_BLOCK)

(jump_insn 14 17 15 4 t.c:13 (set (pc)
        (label_ref 16)) -1 (nil))
;; End of basic block 4 -> ( 6)

;; Succ edge  6 [100.0%] 

(barrier 15 14 13)

;; Start of basic block () -> 5
(code_label 13 15 18 5 1 "" [0 uses])

(note 18 13 16 5 [bb 5] NOTE_INSN_BASIC_BLOCK)
;; End of basic block 5 -> ( 6) 

;; Succ edge  6 [100.0%]  (fallthru)

;; Start of basic block ( 4 5) -> 6
;; Pred edge  4 [100.0%]
;; Pred edge  5 [100.0%]  (fallthru)
(code_label 16 18 19 6 2 "" [1 uses])

(note 19 16 0 6 [bb 6] NOTE_INSN_BASIC_BLOCK)
;; End of basic block 6 -> ( 1)

;; Succ edge  EXIT [100.0%]  (fallthru)
Comment 12 Steven Bosscher 2008-01-18 12:35:50 UTC
So 3.3 expanded the initializer into sets of the individual components, but with ANDs and ORs and a single MEM store, instead of MEM stores to the individual components.

It seems to me that this is not something you'd want to fix with a tree optimizer.  Maybe some local (i.e. per basic block) RTL pass:
1. Collect all stores
2. Find related stores
3. Combine related stores where possible and cheaper

I don't think this is not fixable for GCC 4.3.
Comment 13 Richard Biener 2008-01-18 12:45:32 UTC
Right.  You also need to watch for TBAA problems in the RTL you create.
Comment 14 Joseph S. Myers 2008-07-04 16:54:33 UTC
Closing 4.1 branch.
Comment 15 Jakub Jelinek 2008-11-19 21:12:27 UTC
The advantage of such a RTL pass (or just adding such optimization to another RTL pass) would be that it would handle also say:
struct S
{
  char a;
  char b;
  char c;
  char d;
} u __attribute__((aligned));

void
f1 (void)
{
  u = (struct S) { 1, 2, 3, 4 };
}

void
f2 (void)
{
  u.a = 1;
  u.b = 2;
  u.c = 3;
  u.d = 4;
}

void
f3 (void)
{
  u.d = 4;
  u.b = 2;
  u.a = 1;
  u.c = 3;
}

where only f1 used to be optimized into a single store by 3.2/3.3.  The question is, do we want to do this only if constants are stored into neighbouring < wordsize MEMs?  Do we need to use cselib to canonicalize the addresses (similarly how DSE does it), or could we just rely on MEM_EXPR/MEM_OFFSET/MEM_SIZE?  If bitfields are involved, combiner often helps with merging adjacent bitfield stores into a single store, so this pass would be desirable after the combiner (and probably before register allocation).
Comment 16 Paolo Bonzini 2009-02-03 14:30:09 UTC
*** Bug 37135 has been marked as a duplicate of this bug. ***
Comment 17 etienne_lorrain 2009-02-03 16:38:26 UTC
(In reply to comment #15)
> The advantage of such a RTL pass (or just adding such optimization to another
> RTL pass) would be that it would handle also say: ...

Why only limit that pass to constants, and only to writes:

etienne@cygne:~$ gcc --version
gcc (Debian 4.3.3-3) 4.3.3
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

etienne@cygne:~$ cat tmp.c
struct S
{
  char a;
  char b;
  char c;
  char d;
} u, v;

void fct (void) {
  u.a = v.a;
  u.b = v.b;
  u.c = v.c;
  u.d = v.d;
}

etienne@cygne:~$ gcc -Os -S tmp.c -o tmp.s
etienne@cygne:~$ cat tmp.s
        .file   "tmp.c"
        .text
.globl fct
        .type   fct, @function
fct:
        movb    v, %al
        pushl   %ebp
        movl    %esp, %ebp
        popl    %ebp
        movb    %al, u
        movb    v+1, %al
        movb    %al, u+1
        movb    v+2, %al
        movb    %al, u+2
        movb    v+3, %al
        movb    %al, u+3
        ret
        .size   fct, .-fct
        .comm   u,4,4
        .comm   v,4,4
        .ident  "GCC: (Debian 4.3.3-3) 4.3.3"
        .section        .note.GNU-stack,"",@progbits
etienne@cygne:~$

 A single 32 bits read, and a single 32 bits write should be sufficient,
when none of the fields of the structure is declared volatile.
Comment 18 Joseph S. Myers 2009-03-31 18:49:24 UTC
Closing 4.2 branch.
Comment 19 Richard Biener 2009-08-04 12:26:25 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 20 Richard Biener 2010-01-03 13:30:26 UTC
*** Bug 42594 has been marked as a duplicate of this bug. ***
Comment 21 Richard Biener 2010-05-22 18:10:30 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 22 Richard Biener 2011-06-27 12:14:38 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 23 Richard Biener 2012-01-12 11:20:35 UTC
As we have MEM_REF available we can in theory do the combining on the tree-level
as well (or during gimplification).  In fact, as we have the byteswap
detection pass which does not yet handle byte-swapping memory (thus, a
memory destination) it looks like a perfect place to handle this while
also fixing that deficiency (gather sources for a contiguous piecewise
stored memory region of suitable register size/alignment).
Comment 24 Jakub Jelinek 2012-03-13 12:48:09 UTC
4.4 branch is being closed, moving to 4.5.4 target.
Comment 25 Andrew Pinski 2012-05-19 23:31:40 UTC
*** Bug 53417 has been marked as a duplicate of this bug. ***
Comment 26 Richard Biener 2012-07-02 11:59:02 UTC
The 4.5 branch is being closed, adjusting target milestone.
Comment 27 Jakub Jelinek 2013-04-12 15:17:04 UTC
GCC 4.6.4 has been released and the branch has been closed.
Comment 28 Richard Biener 2014-06-12 13:47:00 UTC
The 4.7 branch is being closed, moving target milestone to 4.8.4.
Comment 29 Jakub Jelinek 2014-12-19 13:30:04 UTC
GCC 4.8.4 has been released.
Comment 30 hariharan.gcc 2015-02-02 19:30:41 UTC
I saw a related problem, this time with bitfields

$more bitfieldtest.c

typedef union {
  struct {
    unsigned int b1:1;
    unsigned int b2:1;
    unsigned int b3:1;
    unsigned int b4:1;
    unsigned int b5:1;
  }fields;
  unsigned int word;
} _t_bitfields;


void _const_populate_bits(_t_bitfields * data)
{
  data->fields.b1 = 1;
  data->fields.b2 = 0;
  data->fields.b3 = 1;
  data->fields.b4 = 1;
  data->fields.b5 = 0;
}

At the end of tree stages, it looks like this

$more bitfieldtest.c.165t.optimized

;; Function _const_populate_bits (_const_populate_bits, funcdef_no=0, decl_uid=1339, cgraph_uid=0)

_const_populate_bits (union _t_bitfields * data)
{
  <bb 2>:
  data_2(D)->fields.b1 = 1;
  data_2(D)->fields.b2 = 0;
  data_2(D)->fields.b3 = 1;
  data_2(D)->fields.b4 = 1;
  data_2(D)->fields.b5 = 0;
  return;

}

Expand expands each one of the assignments in turn and some get combined later on into ok-ish code. It would be nice to be able to combine all 5 assignments into one.

Its kind of related to this PR, but is it sufficiently different to warrant a separate PR for it?
Comment 31 Richard Biener 2015-06-23 08:19:12 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 32 Jakub Jelinek 2015-06-26 19:55:15 UTC
GCC 4.9.3 has been released.
Comment 33 Jakub Jelinek 2016-02-16 13:34:06 UTC
This should be handled in some late GIMPLE pass, shortly before expansion, but with some cleanup passes in between that and expansion.  Both bitfields and adjacent fields handled the same or similar way.
Comment 34 Richard Biener 2016-04-25 10:44:48 UTC
*** Bug 70784 has been marked as a duplicate of this bug. ***
Comment 35 Richard Biener 2016-08-03 10:38:07 UTC
GCC 4.9 branch is being closed
Comment 36 Andrew Pinski 2016-09-03 18:13:08 UTC
*** Bug 77456 has been marked as a duplicate of this bug. ***
Comment 37 ktkachov 2016-10-28 14:19:22 UTC
Author: ktkachov
Date: Fri Oct 28 14:18:50 2016
New Revision: 241649

URL: https://gcc.gnu.org/viewcvs?rev=241649&root=gcc&view=rev
Log:
GIMPLE store merging pass

2016-10-28  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>

	PR middle-end/22141
	* Makefile.in (OBJS): Add gimple-ssa-store-merging.o.
	* common.opt (fstore-merging): New Optimization option.
	* opts.c (default_options_table): Add entry for
	OPT_ftree_store_merging.
	* fold-const.h (can_native_encode_type_p): Declare prototype.
	* fold-const.c (can_native_encode_type_p): Define.
	* params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define.
	(PARAM_MAX_STORES_TO_MERGE): Likewise.
	* timevar.def (TV_GIMPLE_STORE_MERGING): New timevar.
	* passes.def: Insert pass_tree_store_merging.
	* tree-pass.h (make_pass_store_merging): Declare extern
	prototype.
	* gimple-ssa-store-merging.c: New file.
	* doc/invoke.texi (Optimization Options): Document
	-fstore-merging.
	(--param documentation): Document store-merging-allow-unaligned
	and max-stores-to-merge.

2016-10-28  Kyrylo Tkachov  <kyrylo.tkachov@arm.com>
            Jakub Jelinek  <jakub@redhat.com>
            Andrew Pinski  <pinskia@gmail.com>

	PR middle-end/22141
	PR rtl-optimization/23684
	* gcc.c-torture/execute/pr22141-1.c: New test.
	* gcc.c-torture/execute/pr22141-2.c: Likewise.
	* gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging.
	* gcc.target/aarch64/ldp_stp_4.c: Likewise.
	* gcc.dg/store_merging_1.c: New test.
	* gcc.dg/store_merging_2.c: Likewise.
	* gcc.dg/store_merging_3.c: Likewise.
	* gcc.dg/store_merging_4.c: Likewise.
	* gcc.dg/store_merging_5.c: Likewise.
	* gcc.dg/store_merging_6.c: Likewise.
	* gcc.dg/store_merging_7.c: Likewise.
	* gcc.target/i386/pr22141.c: Likewise.
	* gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options.
	* g++.dg/init/new17.C: Likewise.



Added:
    trunk/gcc/gimple-ssa-store-merging.c
    trunk/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c
    trunk/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c
    trunk/gcc/testsuite/gcc.dg/store_merging_1.c
    trunk/gcc/testsuite/gcc.dg/store_merging_2.c
    trunk/gcc/testsuite/gcc.dg/store_merging_3.c
    trunk/gcc/testsuite/gcc.dg/store_merging_4.c
    trunk/gcc/testsuite/gcc.dg/store_merging_5.c
    trunk/gcc/testsuite/gcc.dg/store_merging_6.c
    trunk/gcc/testsuite/gcc.dg/store_merging_7.c
    trunk/gcc/testsuite/gcc.target/i386/pr22141.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/Makefile.in
    trunk/gcc/common.opt
    trunk/gcc/doc/invoke.texi
    trunk/gcc/fold-const.c
    trunk/gcc/fold-const.h
    trunk/gcc/opts.c
    trunk/gcc/params.def
    trunk/gcc/passes.def
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/g++.dg/init/new17.C
    trunk/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c
    trunk/gcc/testsuite/gcc.target/aarch64/ldp_stp_4.c
    trunk/gcc/testsuite/gcc.target/i386/pr34012.c
    trunk/gcc/timevar.def
    trunk/gcc/tree-pass.h
Comment 38 ktkachov 2016-11-02 16:14:59 UTC
Fixed for GCC 7
Comment 39 ktkachov 2016-11-03 12:20:55 UTC
Adjusting target milestone for bookkeeping.
Comment 40 etienne_lorrain 2016-12-15 09:49:07 UTC
Following my comment No 17, the optimisation could also be done for reads - we still have (https://gcc.godbolt.org/ -O2 -m32) that:
struct S
{
  char a;
  char b;
  char c;
  char d;
} u, v;

void fct (void) {
  u.a = v.a;
  u.b = v.b;
  u.c = v.c;
  u.d = v.d;
}

compiled into:
fct():
        movzbl  v, %eax
        movb    %al, u
        movzbl  v+1, %eax
        movb    %al, u+1
        movzbl  v+2, %eax
        movb    %al, u+2
        movzbl  v+3, %eax
        movb    %al, u+3
        ret

Same if fct() contains "u = (struct S) {v.a, v.b, v.c, v.d};"
Where a single read32 / write32 could be used...

Having fct() only do "u = v;" is compiled with a single read32 / write32.
I do not know if it is important enough, if it needs another bugzilla number...
Thanks.
Comment 41 ktkachov 2016-12-15 09:51:49 UTC
(In reply to etienne_lorrain from comment #40)
> Following my comment No 17, the optimisation could also be done for reads -
> we still have (https://gcc.godbolt.org/ -O2 -m32) that:
> struct S
> {
>   char a;
>   char b;
>   char c;
>   char d;
> } u, v;
> 
> void fct (void) {
>   u.a = v.a;
>   u.b = v.b;
>   u.c = v.c;
>   u.d = v.d;
> }
> 
> compiled into:
> fct():
>         movzbl  v, %eax
>         movb    %al, u
>         movzbl  v+1, %eax
>         movb    %al, u+1
>         movzbl  v+2, %eax
>         movb    %al, u+2
>         movzbl  v+3, %eax
>         movb    %al, u+3
>         ret
> 
> Same if fct() contains "u = (struct S) {v.a, v.b, v.c, v.d};"
> Where a single read32 / write32 could be used...
> 
> Having fct() only do "u = v;" is compiled with a single read32 / write32.
> I do not know if it is important enough, if it needs another bugzilla
> number...
> Thanks.

Yeah, I separate bugzilla entry would be good here
Comment 42 etienne_lorrain 2016-12-15 11:26:15 UTC
Separate Bug 78821 has been successfully created following comment 41
Comment 43 Jakub Jelinek 2017-10-27 11:29:04 UTC
Created attachment 42485 [details]
gcc8-pr22141-improvements.patch

Untested store merging improvements, mainly written to address #c30, but also containing various other improvements and fixes.
Comment 44 Jakub Jelinek 2017-10-27 12:13:33 UTC
With the #c43 patch, the included store_merging_10.c improves on x86_64 from:
 	movzbl	(%rdi), %eax
-	andl	$-19, %eax
+	andl	$-32, %eax
 	orl	$13, %eax
 	movb	%al, (%rdi)
in foo and
-	orb	$1, (%rdi)
 	movl	(%rdi), %eax
-	andl	$-131071, %eax
+	andl	$2147352576, %eax
+	orl	$1, %eax
 	movl	%eax, (%rdi)
-	shrl	$24, %eax
-	andl	$127, %eax
-	movb	%al, 3(%rdi)
in bar.  foo is something combine.c managed to optimize too, but bar it couldn't.
In store_merging_11.c on x86_64, bar is the same and foo changed:
-	movabsq	$578437695752115969, %rax
-	movl	$0, 9(%rdi)
-	movb	$0, 15(%rdi)
-	movq	%rax, 1(%rdi)
-	xorl	%eax, %eax
-	movw	%ax, 13(%rdi)
+	movl	$23, %eax
+	movb	$1, 1(%rdi)
+	movl	$117835012, 4(%rdi)
+	movw	%ax, 2(%rdi)
+	movq	$8, 8(%rdi)
which is not only shorter, but all the stores are aligned.
On ppc64le in store_merging_10.c the difference is:
-	lwz 9,0(3)
+	lbz 9,0(3)
 	rlwinm 9,9,0,0,26
 	ori 9,9,0xd
-	stw 9,0(3)
+	stb 9,0(3)
in foo and
 	lwz 9,0(3)
+	rlwinm 9,9,0,1,14
 	ori 9,9,0x1
-	rlwinm 9,9,0,31,14
-	rlwinm 9,9,0,1,31
 	stw 9,0(3)
in bar, and store_merging_11.c the difference is:
-	lis 8,0x807
-	li 9,0
-	ori 8,8,0x605
-	li 10,0
-	sldi 8,8,32
-	stw 9,9(3)
-	sth 9,13(3)
-	oris 8,8,0x400
-	stb 10,15(3)
-	ori 8,8,0x1701
-	mtvsrd 0,8
-	stfd 0,1(3)
+	lis 9,0x706
+	li 7,1
+	li 8,23
+	ori 9,9,0x504
+	li 10,8
+	stb 7,1(3)
+	sth 8,2(3)
+	stw 9,4(3)
+	std 10,8(3)
in foo and no changes in bar.

What the patch doesn't implement yet, but could be also possible for allow_unaligned case is in store_merging_11.c when we are storing 15 bytes store
8 bytes at offset 1 and 8 bytes at offset 8 (i.e. create two overlapping stores, in this case one aligned and one unaligned).
Comment 45 Jakub Jelinek 2017-10-27 13:09:12 UTC
Created attachment 42487 [details]
gcc8-pr22141-improvements.patch

That patch failed to bootstrap due to undesirable -W*uninitialized warnings, the following (still untested) should fix that.
Comment 46 Jakub Jelinek 2017-10-27 16:31:27 UTC
Created attachment 42489 [details]
gcc8-pr22141-improvements.patch

The previous version passed bootstrap on x86_64, i686 and powerpc64le, but
regtest on x86_64 and i686 showed another bug in the gimple-ssa-store-merging.c
changes and a testcase issue in one case.  Will bootstrap/regtest this updated
version now on all 3 again.
Comment 47 Jakub Jelinek 2017-10-30 11:05:21 UTC
Author: jakub
Date: Mon Oct 30 11:04:49 2017
New Revision: 254213

URL: https://gcc.gnu.org/viewcvs?rev=254213&root=gcc&view=rev
Log:
	PR middle-end/22141
	* gimple-ssa-store-merging.c: Include rtl.h and expr.h.
	(struct store_immediate_info): Add bitregion_start and bitregion_end
	fields.
	(store_immediate_info::store_immediate_info): Add brs and bre
	arguments and initialize bitregion_{start,end} from those.
	(struct merged_store_group): Add bitregion_start, bitregion_end,
	align_base and mask fields.  Drop unnecessary struct keyword from
	struct store_immediate_info.  Add do_merge method.
	(clear_bit_region_be): Use memset instead of loop storing zeros.
	(merged_store_group::do_merge): New method.
	(merged_store_group::merge_into): Use do_merge.  Allow gaps in between
	stores as long as the surrounding bitregions have no gaps.
	(merged_store_group::merge_overlapping): Use do_merge.
	(merged_store_group::apply_stores): Test that bitregion_{start,end}
	is byte aligned, rather than requiring that start and width are
	byte aligned.  Drop unnecessary struct keyword from
	struct store_immediate_info.  Allocate and populate also mask array.
	Make start of the arrays relative to bitregion_start rather than
	start and size them according to bitregion_{end,start} difference.
	(struct imm_store_chain_info): Drop unnecessary struct keyword from
	struct store_immediate_info.
	(pass_store_merging::gate): Punt if BITS_PER_UNIT or CHAR_BIT is not 8.
	(pass_store_merging::terminate_all_aliasing_chains): Drop unnecessary
	struct keyword from struct store_immediate_info.
	(imm_store_chain_info::coalesce_immediate_stores): Allow gaps in
	between stores as long as the surrounding bitregions have no gaps.
	Formatting fixes.
	(struct split_store): Add orig non-static data member.
	(split_store::split_store): Initialize orig to false.
	(find_constituent_stmts): Return store_immediate_info *, non-NULL
	if there is exactly a single original stmt.  Change stmts argument
	to pointer from reference, if NULL, don't push anything to it.  Add
	first argument, use it to optimize skipping over orig stmts that
	are known to be before bitpos already.  Simplify.
	(split_group): Return unsigned int count how many stores are or
	would be needed rather than a bool.  Add allow_unaligned argument.
	Change split_stores argument from reference to pointer, if NULL,
	only do a dry run computing how many stores would be produced.
	Rewritten algorithm to use both alignment and misalign if
	!allow_unaligned and handle bitfield stores with gaps.
	(imm_store_chain_info::output_merged_store): Set start_byte_pos
	from bitregion_start instead of start.  Compute allow_unaligned
	here, if true, do 2 split_group dry runs to compute which one
	produces fewer stores and prefer aligned if equal.  Punt if
	new count is bigger or equal than original before emitting any
	statements, rather than during that.  Remove no longer needed
	new_ssa_names tracking.  Replace num_stmts with
	split_stores.length ().  Use 32-bit stack allocated entries
	in split_stores auto_vec.  Try to reuse original store lhs/rhs1
	if possible.  Handle bitfields with gaps.
	(pass_store_merging::execute): Ignore bitsize == 0 stores.
	Compute bitregion_{start,end} for the stores and construct
	store_immediate_info with that.  Formatting fixes.

	* gcc.dg/store_merging_10.c: New test.
	* gcc.dg/store_merging_11.c: New test.
	* gcc.dg/store_merging_12.c: New test.
	* g++.dg/pr71694.C: Add -fno-store-merging to dg-options.

Added:
    trunk/gcc/testsuite/gcc.dg/store_merging_10.c
    trunk/gcc/testsuite/gcc.dg/store_merging_11.c
    trunk/gcc/testsuite/gcc.dg/store_merging_12.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-ssa-store-merging.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/g++.dg/pr71694.C
Comment 48 Jakub Jelinek 2017-10-30 16:20:55 UTC
Author: jakub
Date: Mon Oct 30 16:20:24 2017
New Revision: 254228

URL: https://gcc.gnu.org/viewcvs?rev=254228&root=gcc&view=rev
Log:
	PR middle-end/22141
	* gimple-ssa-store-merging.c (merged_store_group::apply_stores): Fix
	arguments to clear_bit_region_be.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/gimple-ssa-store-merging.c
Comment 49 ktkachov 2018-05-27 19:12:52 UTC
*** Bug 69616 has been marked as a duplicate of this bug. ***