The stack usage of the code gcc-4.x generated looks inefficient on x86 and x86_64. A simple test case is below; ======== #define copy_from_asm(x, addr, err) \ asm volatile( \ "1:\tmovl %2, %1\n" \ "2:\n" \ ".section .fixup,\"ax\"\n" \ "\txor %1,%1\n" \ "\tmov $1,%0\n" \ "\tjmp 2b\n" \ ".previous\n" \ : "=r" (err), "=r" (x) \ : "m" (*(int*)(addr)), "0" (err)) #define copy_from(x, addr, err) do { \ (err) = 0; \ copy_from_asm((x), (addr), (err)); \ } while (0) #define copy(x, addr) ({ \ int __err; \ copy_from((x), (addr), __err); \ __err; \ }) int src[32]; int dst[32]; #define my_copy(x) do { err |= copy(dst[x], &src[x]); } while (0) int test(void) { int err = 0; my_copy( 0); my_copy( 1); my_copy( 2); my_copy( 3); my_copy( 4); my_copy( 5); my_copy( 6); my_copy( 7); my_copy( 8); my_copy( 9); my_copy(10); my_copy(11); my_copy(12); my_copy(13); my_copy(14); my_copy(15); my_copy(16); my_copy(17); my_copy(18); my_copy(19); my_copy(20); my_copy(21); my_copy(22); my_copy(23); my_copy(24); my_copy(25); my_copy(26); my_copy(27); my_copy(28); my_copy(29); my_copy(30); my_copy(31); return err; } ====== I compiled this test case with gcc-3.4.6, 4.2.4, 4.3.2 and 4.4-20081205, and the compile option is "-g -Os -mno-red-zone". The code size of objects are below; $ size test.o.* text data bss dec hex filename 945 0 0 945 3b1 test.o.34 1157 0 0 1157 485 test.o.42 1133 0 0 1133 46d test.o.43 1201 0 0 1201 4b1 test.o.44 gcc-3.4.6 generates; 0000000000000000 <test>: 0: 31 c9 xor %ecx,%ecx 2: 8b 05 00 00 00 00 mov 0x0(%rip),%eax # 8 <test+0x8> 8: 89 05 00 00 00 00 mov %eax,0x0(%rip) # e <test+0xe> e: 31 c0 xor %eax,%eax 10: 8b 15 00 00 00 00 mov 0x0(%rip),%edx # 16 <test+0x16> 16: 09 c1 or %eax,%ecx 18: 89 15 00 00 00 00 mov %edx,0x0(%rip) # 1e <test+0x1e> 1e: 31 c0 xor %eax,%eax 20: 8b 15 00 00 00 00 mov 0x0(%rip),%edx # 26 <test+0x26> 26: 09 c1 or %eax,%ecx gcc-4.4 generates; 0000000000000000 <test>: 0: 41 57 push %r15 2: 31 c0 xor %eax,%eax 4: 41 56 push %r14 6: 41 55 push %r13 8: 41 89 c5 mov %eax,%r13d b: 41 54 push %r12 d: 55 push %rbp e: 53 push %rbx f: 48 83 ec 58 sub $0x58,%rsp 13: 8b 15 00 00 00 00 mov 0x0(%rip),%edx # 19 <test+0x19> 19: 89 15 00 00 00 00 mov %edx,0x0(%rip) # 1f <test+0x1f> 1f: 41 89 c6 mov %eax,%r14d 22: 8b 15 00 00 00 00 mov 0x0(%rip),%edx # 28 <test+0x28> 28: 89 15 00 00 00 00 mov %edx,0x0(%rip) # 2e <test+0x2e> 2e: 41 89 c4 mov %eax,%r12d ... bf: 31 d2 xor %edx,%edx c1: 44 8b 3d 00 00 00 00 mov 0x0(%rip),%r15d # c8 <test+0xc8> c8: 89 54 24 04 mov %edx,0x4(%rsp) cc: 44 89 3d 00 00 00 00 mov %r15d,0x0(%rip) # d3 <test+0xd3> d3: 45 31 ff xor %r15d,%r15d d6: 8b 15 00 00 00 00 mov 0x0(%rip),%edx # dc <test+0xdc> dc: 44 89 7c 24 54 mov %r15d,0x54(%rsp) ... 26d: 0b 54 24 54 or 0x54(%rsp),%edx 271: 0b 54 24 50 or 0x50(%rsp),%edx 275: 0b 54 24 4c or 0x4c(%rsp),%edx 279: 0b 54 24 48 or 0x48(%rsp),%edx ... On gcc-4.x, error values temporally stored to stack and at the last "or" all stored data. This stack usage seems inefficient.
The culprit is tree-ssa-reassoc.c, with -fno-tree-ssa-reassoc the generated code is comparable to 3.4. For some reason it decides: Transforming err_10 = __err_3 | __err_8; into err_10 = __err_8 | __err_3; (no idea what advantages that has, but anyway) but instead of reinserting the transformed stmt into the location where err_10 used to be set before, it moves all the 30 err_A = err_B | __err_C stmts (where each stmt depends on the preceeding one) after the last asm volatile. This means that register pressure jumps through the roof. In this case (if the transformation actually makes sense) certainly the first transformed stmt can be inserted into the old definition spot and all the following stmts can be left untouched.
Sounds similar to PR 32183.
In particular, the problem is in linearize_expr_tree. The |s are already perfectly linearized (in all but the innermost recursive linearize_expr_tree binlhsisreassoc is 1 and binrhsisreassoc is 0, in the innermost both are 0), yet: 1645 gsinow = gsi_for_stmt (stmt); 1646 gsilhs = gsi_for_stmt (SSA_NAME_DEF_STMT (binlhs)); 1647 gsi_move_before (&gsilhs, &gsinow); moves all of them to one spot. Couldn't that moving be deferred till we actually know it is worthwhile to reassociate something, not blindly just in case something will be needed?
Subject: Re: [4.2/4.3/4.4 regression] tree-ssa-reassoc.c increases register pressure several times Yes, that looks like a bug. There are also numerous ways in which the placement can be improved. A few people had talked about rewriting it to be a balanced tree placement, etc, so i've never bothered to do even the small cleanups. If you want to change placement, go for it. :) On Tue, Dec 16, 2008 at 11:57 AM, jakub at gcc dot gnu dot org <gcc-bugzilla@gcc.gnu.org> wrote: > > > ------- Comment #3 from jakub at gcc dot gnu dot org 2008-12-16 16:57 ------- > In particular, the problem is in linearize_expr_tree. The |s are already > perfectly linearized (in all but the innermost recursive linearize_expr_tree > binlhsisreassoc is 1 and binrhsisreassoc is 0, in the innermost both are 0), > yet: > 1645 gsinow = gsi_for_stmt (stmt); > 1646 gsilhs = gsi_for_stmt (SSA_NAME_DEF_STMT (binlhs)); > 1647 gsi_move_before (&gsilhs, &gsinow); > moves all of them to one spot. Couldn't that moving be deferred till we > actually know it is worthwhile to reassociate something, not blindly just in > case something will be needed? > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=38533 > > ------- You are receiving this mail because: ------- > You are on the CC list for the bug, or are watching someone who is. >
Created attachment 16915 [details] gcc44-pr38533.patch Patch that cures this. Bootstrapped/regtested on x86_64-linux, except for gfortran.dg/reassoc_4.f failure where the undistribution doesn't happen any more for some reason. Looking into it.
The problem with the patch is that useless insns (visited with has_zero_uses on lhs) are now scattered in the bbs and so aren't removed immediately.
Subject: Bug 38533 Author: jakub Date: Thu Dec 18 07:54:43 2008 New Revision: 142807 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=142807 Log: PR middle-end/38533 * tree-ssa-reassoc.c (remove_visited_stmt_chain): New function. (rewrite_expr_tree): Add moved argument, move stmts together if needed. Call remove_visited_stmt_chain. (linearize_expr_tree): Don't move stmts here. (reassociate_bb): Call remove_visited_stmt_chain if num ops is 1. Adjust rewrite_expr_tree caller. * gcc.dg/tree-ssa/pr38533.c: New test. * gcc.c-torture/execute/pr38533.c: New test. Added: trunk/gcc/testsuite/gcc.c-torture/execute/pr38533.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr38533.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-reassoc.c
Fixed on the trunk.
Did the patch cure PR28481 too?
Closing 4.2 branch.
WONTFIX on the 4.3 branch.