In the attached example, the store of type_t is not optimized by gcc. Instead of storing one 32-bit value, four 8-bit values are stored. gcc-4.0.1 produces the following code: /home/fm3/foo.c:16 80483b7: c6 05 13 96 04 08 04 movb $0x4,0x8049613 80483be: c6 05 12 96 04 08 03 movb $0x3,0x8049612 80483c5: c6 05 11 96 04 08 02 movb $0x2,0x8049611 80483cc: c6 05 10 96 04 08 01 movb $0x1,0x8049610 gcc-3.4.5 produces the following code: /home/fm3/foo.c:16 8048391: ba 01 02 03 04 mov $0x4030201,%edx /home/fm3/foo.c:15 8048396: 89 e5 mov %esp,%ebp 8048398: 83 ec 18 sub $0x18,%esp 804839b: 83 e4 f0 and $0xfffffff0,%esp 804839e: 83 ec 10 sub $0x10,%esp /home/fm3/foo.c:16 80483a1: 89 15 1c 96 04 08 mov %edx,0x804961c (better) gcc-2.95.4 produces the following code: /home/fm3/foo.c:16 80483f7: c7 05 84 96 04 08 01 movl $0x4030201,0x8049684 80483fe: 02 03 04 (best because no register allocation) The example: #include <stdio.h> typedef struct { unsigned char a; unsigned char b; unsigned char c; unsigned char d; } type_t; static type_t u; int main(void) { u = (type_t){ 1, 2, 3, 4}; printf("%d %d %d %d\n", u.a, u.b, u.c, u.d); return 0; } gcc-4.0 -v -save-temps -g -o foo -W -Wall -Wformat=2 -O3 -Wall -W -fomit-frame-pointer -march=pentium3 -g foo.c: Using built-in specs. Target: i486-linux Configured with: ../src/configure -v --enable-languages=c,c++,java,f95,objc,ada,treelang --prefix=/usr --enable-shared --with-system-zlib --libexecdir=/usr/lib --enable-nls --without-included-gettext --enable-threads=posix --program-suffix=-4.0 --enable-__cxa_atexit --enable-libstdcxx-allocator=mt --enable-clocale=gnu --enable-libstdcxx-debug --enable-java-gc=boehm --enable-java-awt=gtk --enable-mpfr --disable-werror --enable-checking=release i486-linux Thread model: posix gcc version 4.0.1 20050522 (prerelease) (Debian 4.0.0-9) /usr/lib/gcc/i486-linux/4.0.1/cc1 -E -quiet -v foo.c -march=pentium3 -W -Wall -Wformat=2 -Wall -W -fomit-frame-pointer -fworking-directory -O3 -fpch-preprocess -o foo.i ignoring nonexistent directory "/usr/lib/gcc/i486-linux/4.0.1/../../../../i486-linux/include" #include "..." search starts here: #include <...> search starts here: /usr/local/include /usr/lib/gcc/i486-linux/4.0.1/include /usr/include End of search list. /usr/lib/gcc/i486-linux/4.0.1/cc1 -fpreprocessed foo.i -quiet -dumpbase foo.c -march=pentium3 -auxbase foo -g -g -O3 -W -Wall -Wformat=2 -Wall -W -version -fomit-frame-pointer -o foo.s GNU C version 4.0.1 20050522 (prerelease) (Debian 4.0.0-9) (i486-linux) compiled by GNU C version 4.0.1 20050522 (prerelease) (Debian 4.0.0-9). GGC heuristics: --param ggc-min-expand=64 --param ggc-min-heapsize=64260 as -V -Qy --32 -o foo.o foo.s GNU assembler version 2.15 (i486-linux-gnu) using BFD version 2.15 /usr/lib/gcc/i486-linux/4.0.1/collect2 --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o foo /usr/lib/gcc/i486-linux/4.0.1/../../../../lib/crt1.o /usr/lib/gcc/i486-linux/4.0.1/../../../../lib/crti.o /usr/lib/gcc/i486-linux/4.0.1/crtbegin.o -L/usr/lib/gcc/i486-linux/4.0.1 -L/usr/lib/gcc/i486-linux/4.0.1 -L/usr/lib/gcc/i486-linux/4.0.1/../../../../lib -L/usr/lib/gcc/i486-linux/4.0.1/../../.. -L/lib/../lib -L/usr/lib/../lib foo.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/i486-linux/4.0.1/crtend.o /usr/lib/gcc/i486-linux/4.0.1/../../../../lib/crtn.o
Reduced testcase for the regression: typedef struct type { char a; char b; char c; char d; } type_t; type_t u; void f(void) { u = (type_t){ 1, 2, 3, 4}; } 3.3 produced: lis r0,0x102 lis r2,ha16(_u) ori r0,r0,772 stw r0,lo16(_u)(r2) blr 4.0.0 produces: lis r11,ha16(_u) li r0,2 la r9,lo16(_u)(r11) li r2,1 stb r0,1(r9) li r0,4 stb r2,lo16(_u)(r11) li r2,3 stb r0,3(r9) stb r2,2(r9) blr Which is much slower and larger. Also this is a code size regression even at -Os.
Leaving as P2. Do we know what's different? The structure type is byte-aligned. How did 2.95 justify using a 4-byte store?
(In reply to comment #2) > Leaving as P2. > Do we know what's different? Yes in 4.0 and above there is no CONSTRUCTOR so we don't see the full CONSTRUCTOR in expand so it could expand to just one integer store. >The structure type is byte-aligned. How did 2.95 justify using a 4-byte store? There is no strict alignment requirements for either PPC or x86 which is why GCC did it.
This issue will not be resolved in GCC 4.1.0; retargeted at GCC 4.1.1.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Open regression with no activity since February 14. Ping?
Still broken (IMHO rightfully P2). With 3.3 we get (x86_64): f: .LFB3: movl $67305985, u(%rip) ret while on trunk we have: f: .LFB2: movb $1, u(%rip) movb $2, u+1(%rip) movb $3, u+2(%rip) movb $4, u+3(%rip) ret regardless of optimization flag (same for -Os and -O). Gimplification now produces f () { u.a = 1; u.b = 2; u.c = 3; u.d = 4; } which makes the task of combining the initializer even harder.
This regression happens after the SSA was merged in to the mainline
Subject: Re: [4.0/4.1/4.2/4.3 Regression] Missing optimization when storing structures > ------- Comment #8 from dtemirbulatov at gmail dot com 2008-01-17 18:34 ------- > This regression happens after the SSA was merged in to the mainline Yes because of the gimplifier ... -- Pinski
Re. comment 7: What does the initial RTL look like with GCC 3.3 and with a post tree-ssa compiler?
For 3.3-hammer I get (x86_64): ;; Function f (note 2 0 3 NOTE_INSN_DELETED) (note 3 2 4 NOTE_INSN_FUNCTION_BEG) (note 4 3 5 NOTE_INSN_DELETED) (note 5 4 6 NOTE_INSN_DELETED) (note 6 5 8 NOTE_INSN_DELETED) (insn 8 6 9 (nil) (set (reg/v:SI 58 [ <anonymous> ]) (const_int 0 [0x0])) -1 (nil) (nil)) (insn 9 8 10 (nil) (set (strict_low_part (subreg:QI (reg/v:SI 58 [ <anonymous> ]) 0)) (const_int 1 [0x1])) -1 (nil) (nil)) (insn 10 9 11 (nil) (set (reg:DI 59) (const_int 2 [0x2])) -1 (nil) (expr_list:REG_EQUAL (const_int 2 [0x2]) (nil))) (insn 11 10 13 (nil) (set (zero_extract:DI (subreg:DI (reg/v:SI 58 [ <anonymous> ]) 0) (const_int 8 [0x8]) (const_int 8 [0x8])) (reg:DI 59)) -1 (nil) (nil)) (insn 13 11 14 (nil) (parallel [ (set (reg/v:SI 58 [ <anonymous> ]) (and:SI (reg/v:SI 58 [ <anonymous> ]) (const_int -16711681 [0xffffffffff00ffff]))) (clobber (reg:CC 17 flags)) ]) -1 (nil) (nil)) (insn 14 13 16 (nil) (parallel [ (set (reg/v:SI 58 [ <anonymous> ]) (ior:SI (reg/v:SI 58 [ <anonymous> ]) (const_int 196608 [0x30000]))) (clobber (reg:CC 17 flags)) ]) -1 (nil) (nil)) (insn 16 14 17 (nil) (parallel [ (set (reg/v:SI 58 [ <anonymous> ]) (and:SI (reg/v:SI 58 [ <anonymous> ]) (const_int 16777215 [0xffffff]))) (clobber (reg:CC 17 flags)) ]) -1 (nil) (nil)) (insn 17 16 18 (nil) (parallel [ (set (reg/v:SI 58 [ <anonymous> ]) (ior:SI (reg/v:SI 58 [ <anonymous> ]) (const_int 67108864 [0x4000000]))) (clobber (reg:CC 17 flags)) ]) -1 (nil) (nil)) (insn 18 17 19 (nil) (set (mem/s:SI (symbol_ref:DI ("u")) [2 u+0 S4 A8]) (reg/v:SI 58 [ <anonymous> ])) -1 (nil) (nil)) (note 19 18 21 NOTE_INSN_FUNCTION_END) (code_label 21 19 0 1 "" [0 uses]) while on the trunk I see: ;; Function f (f) ;; ;; Full RTL generated for this function: ;; (note 1 0 3 NOTE_INSN_DELETED) ;; Start of basic block ( 0) -> 2 ;; Pred edge ENTRY [100.0%] (fallthru) (note 3 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK) (note 2 3 4 2 NOTE_INSN_FUNCTION_BEG) ;; End of basic block 2 -> ( 3) ;; Succ edge 3 [100.0%] (fallthru) ;; Start of basic block ( 2) -> 3 ;; Pred edge 2 [100.0%] (fallthru) (note 4 2 5 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 5 4 6 3 t.c:12 (set (reg/f:DI 58) (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>)) -1 (nil)) (insn 6 5 7 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 58) [0 u.a+0 S1 A8]) (const_int 1 [0x1])) -1 (nil)) (insn 7 6 8 3 t.c:12 (set (reg/f:DI 59) (const:DI (plus:DI (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>) (const_int 1 [0x1])))) -1 (nil)) (insn 8 7 9 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 59) [0 u.b+0 S1 A8]) (const_int 2 [0x2])) -1 (nil)) (insn 9 8 10 3 t.c:12 (set (reg/f:DI 60) (const:DI (plus:DI (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>) (const_int 2 [0x2])))) -1 (nil)) (insn 10 9 11 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 60) [0 u.c+0 S1 A8]) (const_int 3 [0x3])) -1 (nil)) (insn 11 10 12 3 t.c:12 (set (reg/f:DI 61) (const:DI (plus:DI (symbol_ref:DI ("u") <var_decl 0x2b813a2b8460 u>) (const_int 3 [0x3])))) -1 (nil)) (insn 12 11 17 3 t.c:12 (set (mem/s/c:QI (reg/f:DI 61) [0 u.d+0 S1 A8]) (const_int 4 [0x4])) -1 (nil)) ;; End of basic block 3 -> ( 4) ;; Succ edge 4 [100.0%] (fallthru) ;; Start of basic block ( 3) -> 4 ;; Pred edge 3 [100.0%] (fallthru) (note 17 12 14 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (jump_insn 14 17 15 4 t.c:13 (set (pc) (label_ref 16)) -1 (nil)) ;; End of basic block 4 -> ( 6) ;; Succ edge 6 [100.0%] (barrier 15 14 13) ;; Start of basic block () -> 5 (code_label 13 15 18 5 1 "" [0 uses]) (note 18 13 16 5 [bb 5] NOTE_INSN_BASIC_BLOCK) ;; End of basic block 5 -> ( 6) ;; Succ edge 6 [100.0%] (fallthru) ;; Start of basic block ( 4 5) -> 6 ;; Pred edge 4 [100.0%] ;; Pred edge 5 [100.0%] (fallthru) (code_label 16 18 19 6 2 "" [1 uses]) (note 19 16 0 6 [bb 6] NOTE_INSN_BASIC_BLOCK) ;; End of basic block 6 -> ( 1) ;; Succ edge EXIT [100.0%] (fallthru)
So 3.3 expanded the initializer into sets of the individual components, but with ANDs and ORs and a single MEM store, instead of MEM stores to the individual components. It seems to me that this is not something you'd want to fix with a tree optimizer. Maybe some local (i.e. per basic block) RTL pass: 1. Collect all stores 2. Find related stores 3. Combine related stores where possible and cheaper I don't think this is not fixable for GCC 4.3.
Right. You also need to watch for TBAA problems in the RTL you create.
Closing 4.1 branch.
The advantage of such a RTL pass (or just adding such optimization to another RTL pass) would be that it would handle also say: struct S { char a; char b; char c; char d; } u __attribute__((aligned)); void f1 (void) { u = (struct S) { 1, 2, 3, 4 }; } void f2 (void) { u.a = 1; u.b = 2; u.c = 3; u.d = 4; } void f3 (void) { u.d = 4; u.b = 2; u.a = 1; u.c = 3; } where only f1 used to be optimized into a single store by 3.2/3.3. The question is, do we want to do this only if constants are stored into neighbouring < wordsize MEMs? Do we need to use cselib to canonicalize the addresses (similarly how DSE does it), or could we just rely on MEM_EXPR/MEM_OFFSET/MEM_SIZE? If bitfields are involved, combiner often helps with merging adjacent bitfield stores into a single store, so this pass would be desirable after the combiner (and probably before register allocation).
*** Bug 37135 has been marked as a duplicate of this bug. ***
(In reply to comment #15) > The advantage of such a RTL pass (or just adding such optimization to another > RTL pass) would be that it would handle also say: ... Why only limit that pass to constants, and only to writes: etienne@cygne:~$ gcc --version gcc (Debian 4.3.3-3) 4.3.3 Copyright (C) 2008 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. etienne@cygne:~$ cat tmp.c struct S { char a; char b; char c; char d; } u, v; void fct (void) { u.a = v.a; u.b = v.b; u.c = v.c; u.d = v.d; } etienne@cygne:~$ gcc -Os -S tmp.c -o tmp.s etienne@cygne:~$ cat tmp.s .file "tmp.c" .text .globl fct .type fct, @function fct: movb v, %al pushl %ebp movl %esp, %ebp popl %ebp movb %al, u movb v+1, %al movb %al, u+1 movb v+2, %al movb %al, u+2 movb v+3, %al movb %al, u+3 ret .size fct, .-fct .comm u,4,4 .comm v,4,4 .ident "GCC: (Debian 4.3.3-3) 4.3.3" .section .note.GNU-stack,"",@progbits etienne@cygne:~$ A single 32 bits read, and a single 32 bits write should be sufficient, when none of the fields of the structure is declared volatile.
Closing 4.2 branch.
GCC 4.3.4 is being released, adjusting target milestone.
*** Bug 42594 has been marked as a duplicate of this bug. ***
GCC 4.3.5 is being released, adjusting target milestone.
4.3 branch is being closed, moving to 4.4.7 target.
As we have MEM_REF available we can in theory do the combining on the tree-level as well (or during gimplification). In fact, as we have the byteswap detection pass which does not yet handle byte-swapping memory (thus, a memory destination) it looks like a perfect place to handle this while also fixing that deficiency (gather sources for a contiguous piecewise stored memory region of suitable register size/alignment).
4.4 branch is being closed, moving to 4.5.4 target.
*** Bug 53417 has been marked as a duplicate of this bug. ***
The 4.5 branch is being closed, adjusting target milestone.
GCC 4.6.4 has been released and the branch has been closed.
The 4.7 branch is being closed, moving target milestone to 4.8.4.
GCC 4.8.4 has been released.
I saw a related problem, this time with bitfields $more bitfieldtest.c typedef union { struct { unsigned int b1:1; unsigned int b2:1; unsigned int b3:1; unsigned int b4:1; unsigned int b5:1; }fields; unsigned int word; } _t_bitfields; void _const_populate_bits(_t_bitfields * data) { data->fields.b1 = 1; data->fields.b2 = 0; data->fields.b3 = 1; data->fields.b4 = 1; data->fields.b5 = 0; } At the end of tree stages, it looks like this $more bitfieldtest.c.165t.optimized ;; Function _const_populate_bits (_const_populate_bits, funcdef_no=0, decl_uid=1339, cgraph_uid=0) _const_populate_bits (union _t_bitfields * data) { <bb 2>: data_2(D)->fields.b1 = 1; data_2(D)->fields.b2 = 0; data_2(D)->fields.b3 = 1; data_2(D)->fields.b4 = 1; data_2(D)->fields.b5 = 0; return; } Expand expands each one of the assignments in turn and some get combined later on into ok-ish code. It would be nice to be able to combine all 5 assignments into one. Its kind of related to this PR, but is it sufficiently different to warrant a separate PR for it?
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
This should be handled in some late GIMPLE pass, shortly before expansion, but with some cleanup passes in between that and expansion. Both bitfields and adjacent fields handled the same or similar way.
*** Bug 70784 has been marked as a duplicate of this bug. ***
GCC 4.9 branch is being closed
*** Bug 77456 has been marked as a duplicate of this bug. ***
Author: ktkachov Date: Fri Oct 28 14:18:50 2016 New Revision: 241649 URL: https://gcc.gnu.org/viewcvs?rev=241649&root=gcc&view=rev Log: GIMPLE store merging pass 2016-10-28 Kyrylo Tkachov <kyrylo.tkachov@arm.com> PR middle-end/22141 * Makefile.in (OBJS): Add gimple-ssa-store-merging.o. * common.opt (fstore-merging): New Optimization option. * opts.c (default_options_table): Add entry for OPT_ftree_store_merging. * fold-const.h (can_native_encode_type_p): Declare prototype. * fold-const.c (can_native_encode_type_p): Define. * params.def (PARAM_STORE_MERGING_ALLOW_UNALIGNED): Define. (PARAM_MAX_STORES_TO_MERGE): Likewise. * timevar.def (TV_GIMPLE_STORE_MERGING): New timevar. * passes.def: Insert pass_tree_store_merging. * tree-pass.h (make_pass_store_merging): Declare extern prototype. * gimple-ssa-store-merging.c: New file. * doc/invoke.texi (Optimization Options): Document -fstore-merging. (--param documentation): Document store-merging-allow-unaligned and max-stores-to-merge. 2016-10-28 Kyrylo Tkachov <kyrylo.tkachov@arm.com> Jakub Jelinek <jakub@redhat.com> Andrew Pinski <pinskia@gmail.com> PR middle-end/22141 PR rtl-optimization/23684 * gcc.c-torture/execute/pr22141-1.c: New test. * gcc.c-torture/execute/pr22141-2.c: Likewise. * gcc.target/aarch64/ldp_stp_1.c: Adjust for -fstore-merging. * gcc.target/aarch64/ldp_stp_4.c: Likewise. * gcc.dg/store_merging_1.c: New test. * gcc.dg/store_merging_2.c: Likewise. * gcc.dg/store_merging_3.c: Likewise. * gcc.dg/store_merging_4.c: Likewise. * gcc.dg/store_merging_5.c: Likewise. * gcc.dg/store_merging_6.c: Likewise. * gcc.dg/store_merging_7.c: Likewise. * gcc.target/i386/pr22141.c: Likewise. * gcc.target/i386/pr34012.c: Add -fno-store-merging to dg-options. * g++.dg/init/new17.C: Likewise. Added: trunk/gcc/gimple-ssa-store-merging.c trunk/gcc/testsuite/gcc.c-torture/execute/pr22141-1.c trunk/gcc/testsuite/gcc.c-torture/execute/pr22141-2.c trunk/gcc/testsuite/gcc.dg/store_merging_1.c trunk/gcc/testsuite/gcc.dg/store_merging_2.c trunk/gcc/testsuite/gcc.dg/store_merging_3.c trunk/gcc/testsuite/gcc.dg/store_merging_4.c trunk/gcc/testsuite/gcc.dg/store_merging_5.c trunk/gcc/testsuite/gcc.dg/store_merging_6.c trunk/gcc/testsuite/gcc.dg/store_merging_7.c trunk/gcc/testsuite/gcc.target/i386/pr22141.c Modified: trunk/gcc/ChangeLog trunk/gcc/Makefile.in trunk/gcc/common.opt trunk/gcc/doc/invoke.texi trunk/gcc/fold-const.c trunk/gcc/fold-const.h trunk/gcc/opts.c trunk/gcc/params.def trunk/gcc/passes.def trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/init/new17.C trunk/gcc/testsuite/gcc.target/aarch64/ldp_stp_1.c trunk/gcc/testsuite/gcc.target/aarch64/ldp_stp_4.c trunk/gcc/testsuite/gcc.target/i386/pr34012.c trunk/gcc/timevar.def trunk/gcc/tree-pass.h
Fixed for GCC 7
Adjusting target milestone for bookkeeping.
Following my comment No 17, the optimisation could also be done for reads - we still have (https://gcc.godbolt.org/ -O2 -m32) that: struct S { char a; char b; char c; char d; } u, v; void fct (void) { u.a = v.a; u.b = v.b; u.c = v.c; u.d = v.d; } compiled into: fct(): movzbl v, %eax movb %al, u movzbl v+1, %eax movb %al, u+1 movzbl v+2, %eax movb %al, u+2 movzbl v+3, %eax movb %al, u+3 ret Same if fct() contains "u = (struct S) {v.a, v.b, v.c, v.d};" Where a single read32 / write32 could be used... Having fct() only do "u = v;" is compiled with a single read32 / write32. I do not know if it is important enough, if it needs another bugzilla number... Thanks.
(In reply to etienne_lorrain from comment #40) > Following my comment No 17, the optimisation could also be done for reads - > we still have (https://gcc.godbolt.org/ -O2 -m32) that: > struct S > { > char a; > char b; > char c; > char d; > } u, v; > > void fct (void) { > u.a = v.a; > u.b = v.b; > u.c = v.c; > u.d = v.d; > } > > compiled into: > fct(): > movzbl v, %eax > movb %al, u > movzbl v+1, %eax > movb %al, u+1 > movzbl v+2, %eax > movb %al, u+2 > movzbl v+3, %eax > movb %al, u+3 > ret > > Same if fct() contains "u = (struct S) {v.a, v.b, v.c, v.d};" > Where a single read32 / write32 could be used... > > Having fct() only do "u = v;" is compiled with a single read32 / write32. > I do not know if it is important enough, if it needs another bugzilla > number... > Thanks. Yeah, I separate bugzilla entry would be good here
Separate Bug 78821 has been successfully created following comment 41
Created attachment 42485 [details] gcc8-pr22141-improvements.patch Untested store merging improvements, mainly written to address #c30, but also containing various other improvements and fixes.
With the #c43 patch, the included store_merging_10.c improves on x86_64 from: movzbl (%rdi), %eax - andl $-19, %eax + andl $-32, %eax orl $13, %eax movb %al, (%rdi) in foo and - orb $1, (%rdi) movl (%rdi), %eax - andl $-131071, %eax + andl $2147352576, %eax + orl $1, %eax movl %eax, (%rdi) - shrl $24, %eax - andl $127, %eax - movb %al, 3(%rdi) in bar. foo is something combine.c managed to optimize too, but bar it couldn't. In store_merging_11.c on x86_64, bar is the same and foo changed: - movabsq $578437695752115969, %rax - movl $0, 9(%rdi) - movb $0, 15(%rdi) - movq %rax, 1(%rdi) - xorl %eax, %eax - movw %ax, 13(%rdi) + movl $23, %eax + movb $1, 1(%rdi) + movl $117835012, 4(%rdi) + movw %ax, 2(%rdi) + movq $8, 8(%rdi) which is not only shorter, but all the stores are aligned. On ppc64le in store_merging_10.c the difference is: - lwz 9,0(3) + lbz 9,0(3) rlwinm 9,9,0,0,26 ori 9,9,0xd - stw 9,0(3) + stb 9,0(3) in foo and lwz 9,0(3) + rlwinm 9,9,0,1,14 ori 9,9,0x1 - rlwinm 9,9,0,31,14 - rlwinm 9,9,0,1,31 stw 9,0(3) in bar, and store_merging_11.c the difference is: - lis 8,0x807 - li 9,0 - ori 8,8,0x605 - li 10,0 - sldi 8,8,32 - stw 9,9(3) - sth 9,13(3) - oris 8,8,0x400 - stb 10,15(3) - ori 8,8,0x1701 - mtvsrd 0,8 - stfd 0,1(3) + lis 9,0x706 + li 7,1 + li 8,23 + ori 9,9,0x504 + li 10,8 + stb 7,1(3) + sth 8,2(3) + stw 9,4(3) + std 10,8(3) in foo and no changes in bar. What the patch doesn't implement yet, but could be also possible for allow_unaligned case is in store_merging_11.c when we are storing 15 bytes store 8 bytes at offset 1 and 8 bytes at offset 8 (i.e. create two overlapping stores, in this case one aligned and one unaligned).
Created attachment 42487 [details] gcc8-pr22141-improvements.patch That patch failed to bootstrap due to undesirable -W*uninitialized warnings, the following (still untested) should fix that.
Created attachment 42489 [details] gcc8-pr22141-improvements.patch The previous version passed bootstrap on x86_64, i686 and powerpc64le, but regtest on x86_64 and i686 showed another bug in the gimple-ssa-store-merging.c changes and a testcase issue in one case. Will bootstrap/regtest this updated version now on all 3 again.
Author: jakub Date: Mon Oct 30 11:04:49 2017 New Revision: 254213 URL: https://gcc.gnu.org/viewcvs?rev=254213&root=gcc&view=rev Log: PR middle-end/22141 * gimple-ssa-store-merging.c: Include rtl.h and expr.h. (struct store_immediate_info): Add bitregion_start and bitregion_end fields. (store_immediate_info::store_immediate_info): Add brs and bre arguments and initialize bitregion_{start,end} from those. (struct merged_store_group): Add bitregion_start, bitregion_end, align_base and mask fields. Drop unnecessary struct keyword from struct store_immediate_info. Add do_merge method. (clear_bit_region_be): Use memset instead of loop storing zeros. (merged_store_group::do_merge): New method. (merged_store_group::merge_into): Use do_merge. Allow gaps in between stores as long as the surrounding bitregions have no gaps. (merged_store_group::merge_overlapping): Use do_merge. (merged_store_group::apply_stores): Test that bitregion_{start,end} is byte aligned, rather than requiring that start and width are byte aligned. Drop unnecessary struct keyword from struct store_immediate_info. Allocate and populate also mask array. Make start of the arrays relative to bitregion_start rather than start and size them according to bitregion_{end,start} difference. (struct imm_store_chain_info): Drop unnecessary struct keyword from struct store_immediate_info. (pass_store_merging::gate): Punt if BITS_PER_UNIT or CHAR_BIT is not 8. (pass_store_merging::terminate_all_aliasing_chains): Drop unnecessary struct keyword from struct store_immediate_info. (imm_store_chain_info::coalesce_immediate_stores): Allow gaps in between stores as long as the surrounding bitregions have no gaps. Formatting fixes. (struct split_store): Add orig non-static data member. (split_store::split_store): Initialize orig to false. (find_constituent_stmts): Return store_immediate_info *, non-NULL if there is exactly a single original stmt. Change stmts argument to pointer from reference, if NULL, don't push anything to it. Add first argument, use it to optimize skipping over orig stmts that are known to be before bitpos already. Simplify. (split_group): Return unsigned int count how many stores are or would be needed rather than a bool. Add allow_unaligned argument. Change split_stores argument from reference to pointer, if NULL, only do a dry run computing how many stores would be produced. Rewritten algorithm to use both alignment and misalign if !allow_unaligned and handle bitfield stores with gaps. (imm_store_chain_info::output_merged_store): Set start_byte_pos from bitregion_start instead of start. Compute allow_unaligned here, if true, do 2 split_group dry runs to compute which one produces fewer stores and prefer aligned if equal. Punt if new count is bigger or equal than original before emitting any statements, rather than during that. Remove no longer needed new_ssa_names tracking. Replace num_stmts with split_stores.length (). Use 32-bit stack allocated entries in split_stores auto_vec. Try to reuse original store lhs/rhs1 if possible. Handle bitfields with gaps. (pass_store_merging::execute): Ignore bitsize == 0 stores. Compute bitregion_{start,end} for the stores and construct store_immediate_info with that. Formatting fixes. * gcc.dg/store_merging_10.c: New test. * gcc.dg/store_merging_11.c: New test. * gcc.dg/store_merging_12.c: New test. * g++.dg/pr71694.C: Add -fno-store-merging to dg-options. Added: trunk/gcc/testsuite/gcc.dg/store_merging_10.c trunk/gcc/testsuite/gcc.dg/store_merging_11.c trunk/gcc/testsuite/gcc.dg/store_merging_12.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-store-merging.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/pr71694.C
Author: jakub Date: Mon Oct 30 16:20:24 2017 New Revision: 254228 URL: https://gcc.gnu.org/viewcvs?rev=254228&root=gcc&view=rev Log: PR middle-end/22141 * gimple-ssa-store-merging.c (merged_store_group::apply_stores): Fix arguments to clear_bit_region_be. Modified: trunk/gcc/ChangeLog trunk/gcc/gimple-ssa-store-merging.c
*** Bug 69616 has been marked as a duplicate of this bug. ***