Created attachment 38058 [details] inttostr.c Code size increase on master for ARM target compared to gcc-5.3.0 Target: arm-none-eabi Flags: -Os -mcpu=arm966e-s -marm gcc 5.3.0: 00000000 <inttostr>: 0: e3a03000 mov r3, #0 4: e92d4070 push {r4, r5, r6, lr} 8: e1a06000 mov r6, r0 c: e2422001 sub r2, r2, #1 10: e0205fc0 eor r5, r0, r0, asr #31 14: e0455fc0 sub r5, r5, r0, asr #31 18: e0814002 add r4, r1, r2 1c: e7c13002 strb r3, [r1, r2] 20: e1a00005 mov r0, r5 24: e3a0100a mov r1, #10 28: ebfffffe bl 0 <__aeabi_uidivmod> 2c: e2811030 add r1, r1, #48 ; 0x30 30: e5641001 strb r1, [r4, #-1]! 34: e1a00005 mov r0, r5 38: e3a0100a mov r1, #10 3c: ebfffffe bl 0 <__aeabi_uidiv> 40: e2505000 subs r5, r0, #0 44: 1afffff5 bne 20 <inttostr+0x20> 48: e3560000 cmp r6, #0 4c: b3a0302d movlt r3, #45 ; 0x2d 50: b5443001 strblt r3, [r4, #-1] 54: b2444001 sublt r4, r4, #1 58: e1a00004 mov r0, r4 5c: e8bd8070 pop {r4, r5, r6, pc} gcc-6-20160313 snapshot from master: 00000000 <inttostr>: 0: e3a03000 mov r3, #0 4: e92d41f0 push {r4, r5, r6, r7, r8, lr} 8: e1a07000 mov r7, r0 c: e3a0800a mov r8, #10 10: e2422001 sub r2, r2, #1 14: e0206fc0 eor r6, r0, r0, asr #31 18: e0466fc0 sub r6, r6, r0, asr #31 1c: e0815002 add r5, r1, r2 20: e7c13002 strb r3, [r1, r2] 24: e1a00006 mov r0, r6 28: e1a01008 mov r1, r8 2c: ebfffffe bl 0 <__aeabi_uidivmod> 30: e2811030 add r1, r1, #48 ; 0x30 34: e5451001 strb r1, [r5, #-1] 38: e1a00006 mov r0, r6 3c: e1a01008 mov r1, r8 40: ebfffffe bl 0 <__aeabi_uidiv> 44: e2506000 subs r6, r0, #0 48: e2454001 sub r4, r5, #1 4c: 1a000005 bne 68 <inttostr+0x68> 50: e3570000 cmp r7, #0 54: b3a0302d movlt r3, #45 ; 0x2d 58: b5443001 strblt r3, [r4, #-1] 5c: b2454002 sublt r4, r5, #2 60: e1a00004 mov r0, r4 64: e8bd81f0 pop {r4, r5, r6, r7, r8, pc} 68: e1a05004 mov r5, r4 6c: eaffffec b 24 <inttostr+0x24>
My #1 bet would be FSM threading.
(In reply to Richard Biener from comment #1) > My #1 bet would be FSM threading. I doubt it as if I read the asm differences correctly, GCC 6 just no longer does store with post increment and that causes register allocation differences. AKA there is no extra jump threading; just extra moves. Like: 68: e1a05004 mov r5, r4 6c: eaffffec b 24 <inttostr+0x24> so it might be caused by out of SSA differences too :)
Sorry Richi, no jump threads in this code, so it's not the root cause of any codesize regressions.
Confirmed. There is a size increase in going from GCC 5 to GCC 6 - the quantum changes slightly with cpu tuning options (8 bytes with -march=armv7-a -marm and 12 bytes with -mcpu=arm966e-s)
I see multiple points of the increases: r224048 added 4 bytes. r224995 added 8 bytes. r228318 added 4 bytes. Perhaps the middle one could change from (if (single_use (@2) || (TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST)) to (if (single_use (@2) || (!optimize_size && TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST)) (or some optimize*for_speed*)? CCing authors of the other commits. That said, complaining about size regressions generally should be only if it (significantly) increases sizes of some larger codebases (CSiBE, size of some large app, ...), when looking at a single function only, there will be always improvements and degradations, the compiler is changing.
Thanks for your analysis on this. One comment on this 'complaint', it's not only about size - in my example the compiler uses 2 more regs push and pop, and several more instructions, so I think causing performance regressions aswell? I can file the 'complaints' as performance degradations next time if this is better. Actually this was derived from a larger code base analysis (CSiBE) Bug 61578 - [4.9 regression] Code size increase for ARM thumb compared to 4.8.x when compiling with -Os Where CSiBE problems was analysed, but this issue unfortunately got too fuzzy where its hard to define an issue on almost 1000 files, the conclusion was then to create separate smaller examples to work on, because the CSiBE overall benchmark was hard to overview. I understand its much focus on performance speed on GCC development, but I do think that size really is important aswell since GCC is used very widely also for typically ARM based embedded systems. I will continue to try track size on CSiBE on http://gcc.hederstierna.com/csibe, but please comment if you think size regressions are non-wanted, I can try to focus more on issues that have both performance speed and size components combined (I guess they often go hand in hand.). Thanks, BR, Fredrik
I would suggest that you definitely keep reporting these things and extracting examples from csibe or other benchmarks to show the codesize increases. While some folks will prioritize performance, it doesn't mean that codesize is ignored. Similarly for significant changes in compile-time, stack usage, etc.
All I wanted to say is that it is to be expected that on some code a newer GCC version ends up needing one or two more instructions, even at -Os, and what matters is not the size of a single function, but larger codebases. If the testcase here has been extracted from CSiBE and/or represents something that happens often, then it is of course desirable to fix that. So e.g. it would be interesting to find out what impact these above mentioned commits have on larger codebases. If they are in the end neutral or shrink code size, then one testcase going in the other direction is fine, if they have overal negative -Os code size effects, then it is something to work on.
(In reply to Jakub Jelinek from comment #5) > CCing authors of the other commits. That said, complaining about size > regressions generally should be only if it (significantly) increases sizes > of some larger codebases (CSiBE, size of some large app, ...), when looking > at a single function only, there will be always improvements and > degradations, the compiler is changing. A few comments about "my" patch (r228318), the reorder-blocks thing. First of all, almost all it does is heuristic, so exactly as Jakub says you have to look at bigger tests. This patch makes -freorder-blocks use a new algorithm ("simple") for -Os; after a few follow-up patches this consistently gives smaller code. The big difference with the old ("stc") algorithm is that that one can duplicate blocks, which sometimes results in smaller code, while "simple" never does. It may be interesting to look at before and after of just this patch, to see if the heuristics can be improved.
I am looking into it. -mcpu=arm966e-s does not uses the TARGET_NEW_GENERIC_COSTS. i.e, the rtx costs setup by the back-end might not be optimal.
Optimized gimple diff between 5.3 and trunk is : -;; Function inttostr (inttostr, funcdef_no=0, decl_uid=5268, cgraph_uid=0, symbol_order=0) +;; Function inttostr (inttostr, funcdef_no=0, decl_uid=4222, cgraph_uid=0, symbol_order=0) Removing basic block 7 Removing basic block 8 @@ -43,7 +43,7 @@ goto <bb 6>; <bb 5>: - p_22 = p_2 + 4294967294; + p_22 = p_16 + 4294967295; MEM[(char *)p_16 + 4294967295B] = 45; <bb 6>:
However, diff of cfgexand is significantly different: ;; Full RTL generated for this function: ;; 32: NOTE_INSN_DELETED - 38: NOTE_INSN_BASIC_BLOCK 2 + 39: NOTE_INSN_BASIC_BLOCK 2 33: r151:SI=r0:SI 34: r152:SI=r1:SI 35: r153:SI=r2:SI 36: NOTE_INSN_FUNCTION_BEG - 40: {r141:SI=abs(r151:SI);clobber cc:CC;} - 41: r154:SI=r153:SI-0x1 - 42: r142:SI=r152:SI+r154:SI - 43: r155:SI=0 - 44: r156:QI=r155:SI#0 - 45: [r142:SI]=r156:QI - 61: L61: - 46: NOTE_INSN_BASIC_BLOCK 4 - 47: r142:SI=r142:SI-0x1 - 48: r1:SI=0xa - 49: r0:SI=r141:SI - 50: r0:DI=call [`__aeabi_uidivmod'] argc:0 + 41: {r141:SI=abs(r151:SI);clobber cc:CC;} + 42: r154:SI=r153:SI-0x1 + 43: r142:SI=r152:SI+r154:SI + 44: r155:SI=0 + 45: r156:QI=r155:SI#0 + 46: [r142:SI]=r156:QI + 81: pc=L62 + 82: barrier + 84: L84: + 83: NOTE_INSN_BASIC_BLOCK 4 + 37: r142:SI=r150:SI + 62: L62: + 47: NOTE_INSN_BASIC_BLOCK 5 + 48: r150:SI=r142:SI-0x1 + 49: r1:SI=0xa + 50: r0:SI=r141:SI + 51: r0:DI=call [`__aeabi_uidivmod'] argc:0 REG_CALL_DECL `__aeabi_uidivmod' REG_EH_REGION 0xffffffff80000000 - 51: r162:SI=r1:SI + 52: r162:SI=r1:SI REG_EQUAL umod(r141:SI,0xa) - 52: r163:QI=r162:SI#0 - 53: r164:SI=r163:QI#0+0x30 - 54: r165:QI=r164:SI#0 - 55: [r142:SI]=r165:QI - 56: r1:SI=0xa - 57: r0:SI=r141:SI - 58: r0:SI=call [`__aeabi_uidiv'] argc:0 + 53: r163:QI=r162:SI#0 + 54: r164:SI=r163:QI#0+0x30 + 55: r165:QI=r164:SI#0 + 56: [r150:SI]=r165:QI + 57: r1:SI=0xa + 58: r0:SI=r141:SI + 59: r0:SI=call [`__aeabi_uidiv'] argc:0 REG_CALL_DECL `__aeabi_uidiv' REG_EH_REGION 0xffffffff80000000 - 59: r169:SI=r0:SI + 60: r169:SI=r0:SI REG_EQUAL udiv(r141:SI,0xa) - 60: r141:SI=r169:SI - 62: cc:CC=cmp(r141:SI,0) - 63: pc={(cc:CC!=0)?L61:pc} + 61: r141:SI=r169:SI + 63: cc:CC=cmp(r141:SI,0) + 64: pc={(cc:CC!=0)?L84:pc} REG_BR_PROB 9100 - 64: NOTE_INSN_BASIC_BLOCK 5 - 65: cc:CC=cmp(r151:SI,0) - 66: pc={(cc:CC>=0)?L72:pc} + 65: NOTE_INSN_BASIC_BLOCK 6 + 66: cc:CC=cmp(r151:SI,0) + 67: pc={(cc:CC>=0)?L77:pc} REG_BR_PROB 6335 - 67: NOTE_INSN_BASIC_BLOCK 6 - 68: r149:SI=r142:SI-0x1 - 69: r170:SI=0x2d - 70: r171:QI=r170:SI#0 - 71: [r142:SI-0x1]=r171:QI - 37: r142:SI=r149:SI - 72: L72: - 73: NOTE_INSN_BASIC_BLOCK 7 - 74: r150:SI=r142:SI + 68: NOTE_INSN_BASIC_BLOCK 7 + 69: r149:SI=r142:SI-0x2 + 70: r170:SI=0x2d + 71: r171:QI=r170:SI#0 + 72: [r150:SI-0x1]=r171:QI + 38: r150:SI=r149:SI + 77: L77: + 80: NOTE_INSN_BASIC_BLOCK 9 78: r0:SI=r150:SI 79: use r0:SI
The change to the assignment of p_22 is made by forwprop1. It does create a situation where p_2 is live outside the loop and hides the CSE opportunity, which may be the cause of the more significant differences at expansion time.
(In reply to Jeffrey A. Law from comment #13) > The change to the assignment of p_22 is made by forwprop1. > > It does create a situation where p_2 is live outside the loop and hides the > CSE opportunity, which may be the cause of the more significant differences > at expansion time. Indeed. This is what I see: gcc 5 branch with -O2 t2.c -S -Os -marm -mcpu=arm966e-s 96 Bytes trunk with -O2 t2.c -c -Os -marm -mcpu=arm966e-s 112 Bytes trunk with -O2 t2.c -c -Os -marm 108 Bytes trunk with -O2 t2.c -c -Os -marm -mcpu=arm966e-s -fno-tree-forwprop 96 Bytes trunk with -O2 t2.c -c -Os -marm -mcpu=arm966e-s and Jakub's changes in c# 5 - 100 Bytes
Created attachment 38185 [details] tok.c I took another example for CSiBE and stripped it down. I'm not 100% sure it is the exact same issue, but looks similar so I attached it. This gives bigger code for -Os -mcpu=arm966e-s -marm, and also for -mcpu=cortex-m3, though arm7tdmi and cortex-m0 resulted in small code. gcc 5.3.0: 00000000 <test>: 0: e92d4010 push {r4, lr} 4: e1a04000 mov r4, r0 8: ebfffffe bl 0 <next> c: e3500000 cmp r0, #0 10: 0a000003 beq 24 <test+0x24> 14: e59f002c ldr r0, [pc, #44] ; 48 <test+0x48> 18: ebfffffe bl 0 <dump> 1c: e1a00004 mov r0, r4 20: eafffff8 b 8 <test+0x8> 24: e1a00004 mov r0, r4 28: ebfffffe bl 0 <next> 2c: e3500000 cmp r0, #0 30: 0a000002 beq 40 <test+0x40> 34: e59f000c ldr r0, [pc, #12] ; 48 <test+0x48> 38: ebfffffe bl 0 <dump> 3c: eafffff8 b 24 <test+0x24> 40: e1a00004 mov r0, r4 44: e8bd8010 pop {r4, pc} 48: 00000000 andeq r0, r0, r0 master: 00000000 <test>: 0: e92d4070 push {r4, r5, r6, lr} 4: e1a04000 mov r4, r0 8: ebfffffe bl 0 <next> c: e59f5048 ldr r5, [pc, #72] ; 5c <test+0x5c> 10: e3500000 cmp r0, #0 14: 1a000006 bne 34 <test+0x34> 18: e1a00004 mov r0, r4 1c: ebfffffe bl 0 <next> 20: e59f5034 ldr r5, [pc, #52] ; 5c <test+0x5c> 24: e3500000 cmp r0, #0 28: 1a000006 bne 48 <test+0x48> 2c: e1a00004 mov r0, r4 30: e8bd8070 pop {r4, r5, r6, pc} 34: e1a00005 mov r0, r5 38: ebfffffe bl 0 <dump> 3c: e1a00004 mov r0, r4 40: ebfffffe bl 0 <next> 44: eafffff1 b 10 <test+0x10> 48: e1a00005 mov r0, r5 4c: ebfffffe bl 0 <dump> 50: e1a00004 mov r0, r4 54: ebfffffe bl 0 <next> 58: eafffff1 b 24 <test+0x24> 5c: 00000000 andeq r0, r0, r0
GCC 6.1 has been released.
GCC 6.2 is being released, adjusting target milestone.
Tested GCC-6.2, still same behavior.
GCC 6.3 is being released, adjusting target milestone.
GCC 6.4 is being released, adjusting target milestone.
For the record, current mainline is even worse than when the testccase was originally reported. We are now are at 116 bytes versus 96 for gcc-5-branch. (In reply to Jakub Jelinek from comment #5) > I see multiple points of the increases: > r224048 added 4 bytes. > r224995 added 8 bytes. > r228318 added 4 bytes. > > Perhaps the middle one could change from > (if (single_use (@2) > || (TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST)) > to > (if (single_use (@2) > || (!optimize_size && TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) > == INTEGER_CST)) > (or some optimize*for_speed*)? As Jakub mentions in comment #5, there are multiple patches that slowly bloated the generated code, but perhaps it is a fool's errand to tackle them individually. For instance, associating "buf + len - 1" differently in r224995 reduces the generated code by 8 bytes, but predicating the association by optimize_for_speed seems fragile IMO. Interestingly, disabling -ftree-forwprop, reduces the byte count to 92 which is even smaller than gcc-5-branch, so perhaps worth pursuing. Even on x86, disabling forwprop reduces the byte count by 13. And on both x86 and arm, we get one less branch without forwprop. The first forwprop change is to replace an equality by a greater than, which hardly seems worthwhile (and even a longer byte sequence on x86), but perhaps not a showstopper: < if (ui_21 != 0) --- > if (ui_7 > 9) OTOH, the following changes things quite a bit on arm: < p_22 = p_19 + 4294967295; < *p_22 = 45; --- > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; For context, we are using p_8 which was the previous iteration's value for p_8 and then subtracting 2 to return p correctly. What the heck? <bb 2> [local count: 118111601]: _1 = ABS_EXPR <i_12(D)>; ui_13 = (unsigned int) _1; len.0_2 = (sizetype) len_14(D); _3 = len.0_2 + 4294967295; p_16 = buf_15(D) + _3; *p_16 = 0; <bb 3> [local count: 1073741825]: # ui_7 = PHI <ui_13(2), ui_21(3)> # p_8 = PHI <p_16(2), p_19(3)> _4 = ui_7 % 10; _5 = (char) _4; p_19 = p_8 + 4294967295; _6 = _5 + 48; MEM[base: p_19, offset: 0B] = _6; ui_21 = ui_7 / 10; if (ui_7 > 9) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111601]: if (i_12(D) < 0) goto <bb 5>; [41.00%] else goto <bb 6>; [59.00%] <bb 5> [local count: 48425756]: p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; <bb 6> [local count: 118111601]: # p_9 = PHI <p_22(5), p_19(4)> return p_9; This finally yields assembly without auto dec, and with an extra (forward!) branch: .L2: mov r1, #10 mov r0, r6 bl __aeabi_uidivmod umull r2, r3, r6, r8 add r1, r1, #48 cmp r6, #9 sub r4, r5, #1 strb r1, [r5, #-1] lsr r3, r3, #3 bhi .L4 ;; extra forward branch cmp r7, #0 movlt r3, #45 strblt r3, [r4, #-1] sublt r4, r5, #2 mov r0, r4 pop {r4, r5, r6, r7, r8, pc} .L4: mov r5, r4 mov r6, r3 b .L2 whereas without -ftree-forwprop we get auto dec: .L2: mov r0, r5 mov r1, #10 bl __aeabi_uidivmod umull r2, r3, r5, r7 add r1, r1, #48 lsrs r5, r3, #3 strb r1, [r4, #-1]! ;; auto dec, yay bne .L2 cmp r6, #0 movlt r3, #45 strblt r3, [r4, #-1]! ;; auto dec, yay mov r0, r4 pop {r4, r5, r6, r7, r8, pc}
For the curious, on x86 with -ftree-forwprop we get an additional jump: inttostr: .LFB0: .cfi_startproc movl %edi, %eax movslq %edx, %rdx movl $-858993459, %r9d sarl $31, %eax leaq -1(%rsi,%rdx), %rsi movl %eax, %ecx movb $0, (%rsi) xorl %edi, %ecx subl %eax, %ecx jmp .L2 ;; boo! .p2align 4,,10 .p2align 3 .L4: movq %r8, %rsi movl %edx, %ecx .L2: ... ... ... movb $45, -1(%r8) leaq -2(%rsi), %r8 Not to mention that the last two instructions are slower (ok, larger) than without forwprop, probably because rsi encodes better than r8. movb $45, -1(%rsi) subq $1, %rsi Also, the inequality comparison on x86 is shorter than comparing with > 9. But that's probably irrelevant. All in all: x86 -ftree-forwprop: 139 bytes x86 -fno-tree-forwprop: 126 bytes
> OTOH, the following changes things quite a bit on arm: > > < p_22 = p_19 + 4294967295; > < *p_22 = 45; > --- > > p_22 = p_8 + 4294967294; > > MEM[(char *)p_19 + 4294967295B] = 45; > > For context, we are using p_8 which was the previous iteration's value for p_8 > and then subtracting 2 to return p correctly. What the heck? > > <bb 2> [local count: 118111601]: > _1 = ABS_EXPR <i_12(D)>; > ui_13 = (unsigned int) _1; > len.0_2 = (sizetype) len_14(D); > _3 = len.0_2 + 4294967295; > p_16 = buf_15(D) + _3; > *p_16 = 0; > > <bb 3> [local count: 1073741825]: > # ui_7 = PHI <ui_13(2), ui_21(3)> > # p_8 = PHI <p_16(2), p_19(3)> > _4 = ui_7 % 10; > _5 = (char) _4; > p_19 = p_8 + 4294967295; > _6 = _5 + 48; > MEM[base: p_19, offset: 0B] = _6; > ui_21 = ui_7 / 10; > if (ui_7 > 9) > goto <bb 3>; [89.00%] > else > goto <bb 4>; [11.00%] > > <bb 4> [local count: 118111601]: > if (i_12(D) < 0) > goto <bb 5>; [41.00%] > else > goto <bb 6>; [59.00%] > > <bb 5> [local count: 48425756]: > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > <bb 6> [local count: 118111601]: > # p_9 = PHI <p_22(5), p_19(4)> > return p_9; forwprop aggressively canonicalizes memory references by forwarding all constant address adjustments into its constant offset part. What usually makes things complicated in the end is when for an IV we get overlapping life-ranges for the before and after value because that inhibits coalescing. Is this what happens here? We do have some code trying to rectify IV-related stuff for coalescing, maybe that can be enhanced to handle these kind of cases... I fear that in the end somehow exposing autoinc/dec on GIMPLE might be the only way to truly fix and maintain good code...
So we indeed have p_20 and p_9 live as p_9 is used after the loop. Originally this wasn't the case but fold_stmt in the first forwprop pass does this by means of following use-def chains. As I usually say the user could have written the loop that way so a real fix is to try undoing this somewhere. Sprinkling single_use checks in match.pd (or simple_iv_increment_p-like checks) doesn't look like a sustainable approach to me.
So I suggest to look at insert_backedge_copies () to see whether replacing out-of-loop pre-inc uses with the post-inc value is possible.
Yea, we went through similar issues with Alex's work recently. After that work landed I went through all the BZs that I could find with autoinc {pre,post}{inc,dec} that BZ could find in the hopes that his work would fix some of them. Sadly there was only 1 that was fixed. -- So one concern if you try to expose autoincrement addressing in gimple is that the RTL optimizers aren't prepared to handle until combine or later. So there's probably a fair amount of downstream work to do. rth and I sketched out a pass that replace optimize_related_values and autoinc passes using the concepts of PRE and SLSR many years ago. It felt much cleaner that what we've been doing and likely more powerful. But we never went forward with an implementation. I got the sense from that recent BZ review that we're generally missing a fair number of opportunities, but didn't dig into all of them to gain an understanding of why we're doing so bad.
(In reply to Richard Biener from comment #25) > What usually makes things complicated in the end is when for an IV > we get overlapping life-ranges for the before and after value because > that inhibits coalescing. Is this what happens here? ... ... > So we indeed have p_20 and p_9 live as p_9 is used after the loop. > Originally > this wasn't the case but fold_stmt in the first forwprop pass does this > by means of following use-def chains. I'm a bit confused. Where are you looking, after the forprop1 pass? Because after forwprop1 I see: <bb 5> : p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; <bb 6> : # p_9 = PHI <p_19(4), p_22(5)> return p_9; Which as I understand has: p_8 being the IV at the beginning of the last iteration. p_19 being the IV at the end of the last iteration. p_22 being the IV at the end of the last iteration MINUS 1. I don't see a p_20 anywhere. Did you mean that p_8 and p_19 where both live at the end of the loop? Thanks for all the feedback BTW.
On Fri, 2 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #28 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #25) > > > > What usually makes things complicated in the end is when for an IV > > we get overlapping life-ranges for the before and after value because > > that inhibits coalescing. Is this what happens here? > ... > ... > > So we indeed have p_20 and p_9 live as p_9 is used after the loop. > > Originally > > this wasn't the case but fold_stmt in the first forwprop pass does this > > by means of following use-def chains. > > I'm a bit confused. > > Where are you looking, after the forprop1 pass? Because after forwprop1 I see: > > <bb 5> : > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > <bb 6> : > # p_9 = PHI <p_19(4), p_22(5)> > return p_9; > > Which as I understand has: > > p_8 being the IV at the beginning of the last iteration. > p_19 being the IV at the end of the last iteration. > p_22 being the IV at the end of the last iteration MINUS 1. > > I don't see a p_20 anywhere. Did you mean that p_8 and p_19 where both live at > the end of the loop? Yes. Not sure if that captures all of the issues in this bug but that's clearly bad.
Created attachment 43597 [details] untested patch implementing suggestion in comment 26 The attached untested patch attempts to implement the suggestion in comment 26 of replacing the out-of-loop pre-inc with post-inc values. Richi, is this more or less what you had in mind? Assuming this: LOOP: # p_8 = PHI <p_16(2), p_19(3)> ... p_19 = p_8 + 4294967295; goto LOOP: The patch replaces: p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; into: p_22 = p_19 + 4294967295; *p_22 = 45; This allows the backend to use auto-dec in two places: strb r1, [r4, #-1]! ... strblt r3, [r4, #-1]! ...reducing the byte count from 116 to 104, but just shy of the 96 needed to eliminate the regression. I will discuss the missing bytes in a follow-up comment, as they are unrelated to this IV adjustment patch. It is worth noting that x86 also benefits from a reduction of 3 bytes with this patch, as we remove 2 lea instructions: one within the loop, and one before returning. Thus, I believe this is a regression across the board, or at least in multiple architectures. A few comments... While I see the benefit of hijacking insert_backedge_copies() for this, I am not a big fan of changing the IL after the last tree dump (*t.optimized), as the modified IL would only be visible in *r.expand. Could we perhaps move this to another spot? Say after the last forwprop pass, or perhaps right before expand? Or perhaps have a *t.final dump right before expand? As mentioned, this is only a proof of concept. I made the test rather restrictive. I suppose we could relax the conditions and generalize it a bit. There are comments throughout showing what I had in mind.
On Thu, 8 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > Aldy Hernandez <aldyh at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Assignee|unassigned at gcc dot gnu.org |aldyh at gcc dot gnu.org > > --- Comment #30 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > Created attachment 43597 [details] > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43597&action=edit > untested patch implementing suggestion in comment 26 > > The attached untested patch attempts to implement the suggestion in comment 26 > of replacing the out-of-loop pre-inc with post-inc values. > > Richi, is this more or less what you had in mind? Yes. > Assuming this: > > LOOP: > # p_8 = PHI <p_16(2), p_19(3)> > ... > p_19 = p_8 + 4294967295; > goto LOOP: > > The patch replaces: > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > into: > p_22 = p_19 + 4294967295; > *p_22 = 45; > > This allows the backend to use auto-dec in two places: > > strb r1, [r4, #-1]! > ... > strblt r3, [r4, #-1]! > > ...reducing the byte count from 116 to 104, but just shy of the 96 needed to > eliminate the regression. I will discuss the missing bytes in a follow-up > comment, as they are unrelated to this IV adjustment patch. > > It is worth noting that x86 also benefits from a reduction of 3 bytes with this > patch, as we remove 2 lea instructions: one within the loop, and one before > returning. Thus, I believe this is a regression across the board, or at least > in multiple architectures. > > A few comments... > > While I see the benefit of hijacking insert_backedge_copies() for this, I am > not a big fan of changing the IL after the last tree dump (*t.optimized), as > the modified IL would only be visible in *r.expand. Could we perhaps move this > to another spot? Say after the last forwprop pass, or perhaps right before > expand? Or perhaps have a *t.final dump right before expand? I don't see a big problem here - but yes, for example doing it during uncprop would be possible as well (moving all of insert_backedge_copies then). I'd not do this at this point though. > As mentioned, this is only a proof of concept. I made the test rather > restrictive. I suppose we could relax the conditions and generalize it a bit. > There are comments throughout showing what I had in mind. I'd have not restricted the out-of-loop IV use to IV +- CST but instead did the transform + LOOP: + # p_8 = PHI <p_16(2), p_INC(3)> + ... + p_INC = p_8 - 1; + goto LOOP; + ... p_8 uses ... to + LOOP: + # p_8 = PHI <p_16(2), p_INC(3)> + ... + p_INC = p_8 - 1; + goto LOOP; newtem_12 = p_INC + 1; // undo IV increment ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... so it would always work if we can undo the IV increment. The disadvantage might be that we then rely on RTL optimizations to combine the original out-of-loop constant add with the newtem computation but I guess that's not too much to ask ;) k
As mentioned in the previous comment, the proposed patch brings down the count from 116 to 108 on ARM, but is shy of the desired 96. The missing bytes can be attributed to forwprop folding this (IL expanded for illustration): if (ui_7 / 10 != 0) into: if (ui_7 > 9) More specifically, changing this: # ui_7 = PHI <ui_13(2), ui_21(3)> ... ui_21 = ui_7 / 10; if (ui_21 != 0) into: # ui_7 = PHI <ui_13(2), ui_21(3)> ... ui_21 = ui_7 / 10; if (ui_7 > 9) Inhibiting this optimization brings down the byte count to 92 which is even lower than our 96 boogie man, so perhaps worth pursuing. (Assumes my proposed patch is also applied.) I'm no expert, but isn't a EQ/NE with 0 preferable than a <> with a non-zero? If so, should we restrict the folding somewhat, or clean this up after the fact? For reference, the folding (in forwprop) is due to this match.pd pattern: /* X / C1 op C2 into a simple range test. */ ...though eliminating it causes another pattern to pick up the slack and do the same: /* Transform: * (X / Y) == 0 -> X < Y if X, Y are unsigned. * (X / Y) != 0 -> X >= Y, if X, Y are unsigned. */ Eliminating both patterns "fixes" the problem. Suggestions welcome :).
The "> 9" transform reduces path length. But foe -Os that is often not a good idea.
On Thu, 8 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #32 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > As mentioned in the previous comment, the proposed patch brings down the count > from 116 to 108 on ARM, but is shy of the desired 96. > > The missing bytes can be attributed to forwprop folding this (IL expanded for > illustration): > > if (ui_7 / 10 != 0) > > into: > > if (ui_7 > 9) > > More specifically, changing this: > > # ui_7 = PHI <ui_13(2), ui_21(3)> > ... > ui_21 = ui_7 / 10; > if (ui_21 != 0) > > into: > > # ui_7 = PHI <ui_13(2), ui_21(3)> > ... > ui_21 = ui_7 / 10; > if (ui_7 > 9) > > Inhibiting this optimization brings down the byte count to 92 which is even > lower than our 96 boogie man, so perhaps worth pursuing. (Assumes my proposed > patch is also applied.) I'm no expert, but isn't a EQ/NE with 0 preferable > than a <> with a non-zero? > > If so, should we restrict the folding somewhat, or clean this up after the > fact? > > For reference, the folding (in forwprop) is due to this match.pd pattern: > > /* X / C1 op C2 into a simple range test. */ > > ...though eliminating it causes another pattern to pick up the slack and do the > same: > > /* Transform: > * (X / Y) == 0 -> X < Y if X, Y are unsigned. > * (X / Y) != 0 -> X >= Y, if X, Y are unsigned. > */ > > Eliminating both patterns "fixes" the problem. > > Suggestions welcome :). In other places where this happened we add single_use () checks to the patterns. The :s isn't really effective for patterns that simplify into "simple" expressions. A patch doing that to the two patterns in question would be OK I think. Sth like Index: gcc/match.pd =================================================================== --- gcc/match.pd (revision 258359) +++ gcc/match.pd (working copy) @@ -1290,11 +1290,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* X / C1 op C2 into a simple range test. */ (for cmp (simple_comparison) (simplify - (cmp (trunc_div:s @0 INTEGER_CST@1) INTEGER_CST@2) + (cmp (trunc_div:s@3 @0 INTEGER_CST@1) INTEGER_CST@2) (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && integer_nonzerop (@1) && !TREE_OVERFLOW (@1) - && !TREE_OVERFLOW (@2)) + && !TREE_OVERFLOW (@2) + && single_use (@3)) (with { tree lo, hi; bool neg_overflow; enum tree_code code = fold_div_compare (cmp, @1, @2, &lo, &hi, &neg_overflow); } tough there is at least one constant simplification result where we shouldn't apply this restriction so moving the single_use check to multiple places below or factoring things a bit would be appropriate. Similar for the other pattern then.
On Thu, 8 Mar 2018, segher at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #33 from Segher Boessenkool <segher at gcc dot gnu.org> --- > The "> 9" transform reduces path length. But foe -Os that is often not a > good idea. Hm, indeed - the > 9 might be notably faster. So it's not a clear win to disable the folding here.
WRT the division removal. That seems so profitable that a slight increase in codesize is warranted. So if we fix the other issue and the source of the remaining codesize regressions is the removal of the division, I would consider this BZ resolved.
Hi Richi. (In reply to rguenther@suse.de from comment #31) > I'd have not restricted the out-of-loop IV use to IV +- CST but > instead did the transform > > + LOOP: > + # p_8 = PHI <p_16(2), p_INC(3)> > + ... > + p_INC = p_8 - 1; > + goto LOOP; > + ... p_8 uses ... > > to > > + LOOP: > + # p_8 = PHI <p_16(2), p_INC(3)> > + ... > + p_INC = p_8 - 1; > + goto LOOP; > newtem_12 = p_INC + 1; // undo IV increment > ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... > > so it would always work if we can undo the IV increment. > > The disadvantage might be that we then rely on RTL optimizations > to combine the original out-of-loop constant add with the > newtem computation but I guess that's not too much to ask ;) > k It looks like RTL optimizations have a harder time optimizing things when I take the above approach. Doing what you suggest, we end up optimizing this (simplified for brevity): <bb 3> # p_8 = PHI <p_16(2), p_19(3)> p_19 = p_8 + 4294967295; if (ui_7 > 9) goto <bb 3>; [89.00%] ... <bb 5> p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; into this: <bb 3>: # p_8 = PHI <p_16(2), p_19(3)> p_19 = p_8 + 4294967295; if (ui_7 > 9) ... <bb 4>: _25 = p_19 + 1; ;; undo the increment ... <bb 5>: p_22 = _25 + 4294967294; MEM[(char *)_25 + 4294967294B] = 45; I haven't dug into the RTL optimizations, but the end result is that we only get one auto-dec inside the loop, and some -2 indexing outside of it: strb r1, [r4, #-1]! lsr r3, r3, #3 bhi .L4 cmp r6, #0 movlt r2, #45 add r3, r4, #1 strblt r2, [r3, #-2] sublt r4, r4, #1 as opposed to mine: <bb 3>: # p_8 = PHI <p_16(2), p_19(3)> p_19 = p_8 + 4294967295; if (ui_7 > 9) ... <bb 5>: p_22 = p_19 + 4294967295; *p_22 = 45; which gives us two auto-dec, and much tighter code: strb r1, [r4, #-1]! lsr r3, r3, #3 bhi .L4 cmp r6, #0 movlt r3, #45 strblt r3, [r4, #-1]! Would it be OK to go with my approach, or is worth looking into the rtl optimizers and seeing what can be done (boo! :)). Thanks.
On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > Hi Richi. > > (In reply to rguenther@suse.de from comment #31) > > > I'd have not restricted the out-of-loop IV use to IV +- CST but > > instead did the transform > > > > + LOOP: > > + # p_8 = PHI <p_16(2), p_INC(3)> > > + ... > > + p_INC = p_8 - 1; > > + goto LOOP; > > + ... p_8 uses ... > > > > to > > > > + LOOP: > > + # p_8 = PHI <p_16(2), p_INC(3)> > > + ... > > + p_INC = p_8 - 1; > > + goto LOOP; > > newtem_12 = p_INC + 1; // undo IV increment > > ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... > > > > so it would always work if we can undo the IV increment. > > > > The disadvantage might be that we then rely on RTL optimizations > > to combine the original out-of-loop constant add with the > > newtem computation but I guess that's not too much to ask ;) > > k > > It looks like RTL optimizations have a harder time optimizing things when I > take the above approach. > > Doing what you suggest, we end up optimizing this (simplified for brevity): > > <bb 3> > # p_8 = PHI <p_16(2), p_19(3)> > p_19 = p_8 + 4294967295; > if (ui_7 > 9) > goto <bb 3>; [89.00%] > ... > > <bb 5> > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > into this: > > <bb 3>: > # p_8 = PHI <p_16(2), p_19(3)> > p_19 = p_8 + 4294967295; > if (ui_7 > 9) > ... > > <bb 4>: > _25 = p_19 + 1; ;; undo the increment > ... > > <bb 5>: > p_22 = _25 + 4294967294; > MEM[(char *)_25 + 4294967294B] = 45; shouldn't the p_19 in the MEM remain? Why a new BB for the adjustment? (just asking...) > I haven't dug into the RTL optimizations, but the end result is that we only > get one auto-dec inside the loop, and some -2 indexing outside of it: > > strb r1, [r4, #-1]! > lsr r3, r3, #3 > bhi .L4 > cmp r6, #0 > movlt r2, #45 > add r3, r4, #1 > strblt r2, [r3, #-2] > sublt r4, r4, #1 > > as opposed to mine: > > <bb 3>: > # p_8 = PHI <p_16(2), p_19(3)> > p_19 = p_8 + 4294967295; > if (ui_7 > 9) > ... > > <bb 5>: > p_22 = p_19 + 4294967295; > *p_22 = 45; > > which gives us two auto-dec, and much tighter code: > > strb r1, [r4, #-1]! > lsr r3, r3, #3 > bhi .L4 > cmp r6, #0 > movlt r3, #45 > strblt r3, [r4, #-1]! > > Would it be OK to go with my approach, or is worth looking into the rtl > optimizers and seeing what can be done (boo! :)). Would be interesting to see what is causing things to break. If it's only combine doing the trick then it will obviously be multi-uses of the adjustment pseudo. But eventually I thought fwprop would come to the rescue... I think a good approach would be to instead of just replacing all out-of-loop uses with the new SSA name doing the adjustment check if the use is in a "simple" (thus SSA name +- cst) stmt and there forward the adjustment. So have the best of both worlds. I would have hoped this extra complication isn't needed but as you figured...
(In reply to rguenther@suse.de from comment #38) > On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > > Hi Richi. > > > > (In reply to rguenther@suse.de from comment #31) > > > > > I'd have not restricted the out-of-loop IV use to IV +- CST but > > > instead did the transform > > > > > > + LOOP: > > > + # p_8 = PHI <p_16(2), p_INC(3)> > > > + ... > > > + p_INC = p_8 - 1; > > > + goto LOOP; > > > + ... p_8 uses ... > > > > > > to > > > > > > + LOOP: > > > + # p_8 = PHI <p_16(2), p_INC(3)> > > > + ... > > > + p_INC = p_8 - 1; > > > + goto LOOP; > > > newtem_12 = p_INC + 1; // undo IV increment > > > ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... > > > > > > so it would always work if we can undo the IV increment. > > > > > > The disadvantage might be that we then rely on RTL optimizations > > > to combine the original out-of-loop constant add with the > > > newtem computation but I guess that's not too much to ask ;) > > > k > > > > It looks like RTL optimizations have a harder time optimizing things when I > > take the above approach. > > > > Doing what you suggest, we end up optimizing this (simplified for brevity): > > > > <bb 3> > > # p_8 = PHI <p_16(2), p_19(3)> > > p_19 = p_8 + 4294967295; > > if (ui_7 > 9) > > goto <bb 3>; [89.00%] > > ... > > > > <bb 5> > > p_22 = p_8 + 4294967294; > > MEM[(char *)p_19 + 4294967295B] = 45; > > > > into this: > > > > <bb 3>: > > # p_8 = PHI <p_16(2), p_19(3)> > > p_19 = p_8 + 4294967295; > > if (ui_7 > 9) > > ... > > > > <bb 4>: > > _25 = p_19 + 1; ;; undo the increment > > ... > > > > <bb 5>: > > p_22 = _25 + 4294967294; > > MEM[(char *)_25 + 4294967294B] = 45; > > shouldn't the p_19 in the MEM remain? Why a new BB for the adjustment? > (just asking...) Oh, I was adjusting/propagating that one as a special case to see if the [_25 + 4294967294] could be CSE'd somehow by the RTL optimizers. But even with the p_19 remaining, the result is the same in the assembly. For the record, with the p_19 in place, we would get (see below for a more complete dump): p_22 = _25 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; There is no new BB. That was the result of adding the TMP on the non-back edge. My simplification of the gimple IL for illustration purposes made that bit unclear. The unadulterated original was: <bb 3> [local count: 1073741825]: # ui_7 = PHI <ui_13(2), ui_21(3)> # p_8 = PHI <p_16(2), p_19(3)> _4 = ui_7 % 10; _5 = (char) _4; p_19 = p_8 + 4294967295; _6 = _5 + 48; MEM[base: p_19, offset: 0B] = _6; ui_21 = ui_7 / 10; if (ui_7 > 9) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111601]: if (i_12(D) < 0) goto <bb 5>; [41.00%] else goto <bb 6>; [59.00%] <bb 5> [local count: 48425756]: p_22 = p_8 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; with a corresponding transformation (keeping the p_19 as you inquired) of: <bb 3> [local count: 1073741825]: # ui_7 = PHI <ui_13(2), ui_24(3)> # p_8 = PHI <p_16(2), p_19(3)> _4 = ui_7 % 10; _5 = (char) _4; p_19 = p_8 + 4294967295; _6 = _5 + 48; MEM[base: p_19, offset: 0B] = _6; ui_21 = ui_7 / 10; ui_24 = ui_21; if (ui_7 > 9) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 118111601]: _25 = p_19 + 1; ;; undo the increment on the non-back edge if (i_12(D) < 0) goto <bb 5>; [41.00%] else goto <bb 6>; [59.00%] <bb 5> [local count: 48425756]: p_22 = _25 + 4294967294; MEM[(char *)p_19 + 4294967295B] = 45; > > > I haven't dug into the RTL optimizations, but the end result is that we only > > get one auto-dec inside the loop, and some -2 indexing outside of it: > > > > strb r1, [r4, #-1]! > > lsr r3, r3, #3 > > bhi .L4 > > cmp r6, #0 > > movlt r2, #45 > > add r3, r4, #1 > > strblt r2, [r3, #-2] > > sublt r4, r4, #1 > > > > as opposed to mine: > > > > <bb 3>: > > # p_8 = PHI <p_16(2), p_19(3)> > > p_19 = p_8 + 4294967295; > > if (ui_7 > 9) > > ... > > > > <bb 5>: > > p_22 = p_19 + 4294967295; > > *p_22 = 45; > > > > which gives us two auto-dec, and much tighter code: > > > > strb r1, [r4, #-1]! > > lsr r3, r3, #3 > > bhi .L4 > > cmp r6, #0 > > movlt r3, #45 > > strblt r3, [r4, #-1]! > > > > Would it be OK to go with my approach, or is worth looking into the rtl > > optimizers and seeing what can be done (boo! :)). > > Would be interesting to see what is causing things to break. If it's > only combine doing the trick then it will obviously be multi-uses > of the adjustment pseudo. But eventually I thought fwprop would > come to the rescue... I could post the WIP for this approach if anyone is interested. > > I think a good approach would be to instead of just replacing > all out-of-loop uses with the new SSA name doing the adjustment > check if the use is in a "simple" (thus SSA name +- cst) stmt > and there forward the adjustment. > > So have the best of both worlds. I would have hoped this > extra complication isn't needed but as you figured... Hmmm, can we go with my first patch (cleaned up) instead, as it's already working? Thanks.
On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #39 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #38) > > On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > > > > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > > > Hi Richi. > > > > > > (In reply to rguenther@suse.de from comment #31) > > > > > > > I'd have not restricted the out-of-loop IV use to IV +- CST but > > > > instead did the transform > > > > > > > > + LOOP: > > > > + # p_8 = PHI <p_16(2), p_INC(3)> > > > > + ... > > > > + p_INC = p_8 - 1; > > > > + goto LOOP; > > > > + ... p_8 uses ... > > > > > > > > to > > > > > > > > + LOOP: > > > > + # p_8 = PHI <p_16(2), p_INC(3)> > > > > + ... > > > > + p_INC = p_8 - 1; > > > > + goto LOOP; > > > > newtem_12 = p_INC + 1; // undo IV increment > > > > ... p_8 out-of-loop p_8 uses replaced with newtem_12 ... > > > > > > > > so it would always work if we can undo the IV increment. > > > > > > > > The disadvantage might be that we then rely on RTL optimizations > > > > to combine the original out-of-loop constant add with the > > > > newtem computation but I guess that's not too much to ask ;) > > > > k > > > > > > It looks like RTL optimizations have a harder time optimizing things when I > > > take the above approach. > > > > > > Doing what you suggest, we end up optimizing this (simplified for brevity): > > > > > > <bb 3> > > > # p_8 = PHI <p_16(2), p_19(3)> > > > p_19 = p_8 + 4294967295; > > > if (ui_7 > 9) > > > goto <bb 3>; [89.00%] > > > ... > > > > > > <bb 5> > > > p_22 = p_8 + 4294967294; > > > MEM[(char *)p_19 + 4294967295B] = 45; > > > > > > into this: > > > > > > <bb 3>: > > > # p_8 = PHI <p_16(2), p_19(3)> > > > p_19 = p_8 + 4294967295; > > > if (ui_7 > 9) > > > ... > > > > > > <bb 4>: > > > _25 = p_19 + 1; ;; undo the increment > > > ... > > > > > > <bb 5>: > > > p_22 = _25 + 4294967294; > > > MEM[(char *)_25 + 4294967294B] = 45; > > > > shouldn't the p_19 in the MEM remain? Why a new BB for the adjustment? > > (just asking...) > > Oh, I was adjusting/propagating that one as a special case to see if the [_25 + > 4294967294] could be CSE'd somehow by the RTL optimizers. But even with the > p_19 remaining, the result is the same in the assembly. For the record, with > the p_19 in place, we would get (see below for a more complete dump): > > p_22 = _25 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > There is no new BB. That was the result of adding the TMP on the non-back > edge. My simplification of the gimple IL for illustration purposes made that > bit unclear. > > The unadulterated original was: > > <bb 3> [local count: 1073741825]: > # ui_7 = PHI <ui_13(2), ui_21(3)> > # p_8 = PHI <p_16(2), p_19(3)> > _4 = ui_7 % 10; > _5 = (char) _4; > p_19 = p_8 + 4294967295; > _6 = _5 + 48; > MEM[base: p_19, offset: 0B] = _6; > ui_21 = ui_7 / 10; > if (ui_7 > 9) > goto <bb 3>; [89.00%] > else > goto <bb 4>; [11.00%] > > <bb 4> [local count: 118111601]: > if (i_12(D) < 0) > goto <bb 5>; [41.00%] > else > goto <bb 6>; [59.00%] > > <bb 5> [local count: 48425756]: > p_22 = p_8 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > with a corresponding transformation (keeping the p_19 as you inquired) of: > > <bb 3> [local count: 1073741825]: > # ui_7 = PHI <ui_13(2), ui_24(3)> > # p_8 = PHI <p_16(2), p_19(3)> > _4 = ui_7 % 10; > _5 = (char) _4; > p_19 = p_8 + 4294967295; > _6 = _5 + 48; > MEM[base: p_19, offset: 0B] = _6; > ui_21 = ui_7 / 10; > ui_24 = ui_21; > if (ui_7 > 9) > goto <bb 3>; [89.00%] > else > goto <bb 4>; [11.00%] > > <bb 4> [local count: 118111601]: > _25 = p_19 + 1; ;; undo the increment on the non-back edge > if (i_12(D) < 0) > goto <bb 5>; [41.00%] > else > goto <bb 6>; [59.00%] > > <bb 5> [local count: 48425756]: > p_22 = _25 + 4294967294; > MEM[(char *)p_19 + 4294967295B] = 45; > > > > > > I haven't dug into the RTL optimizations, but the end result is that we only > > > get one auto-dec inside the loop, and some -2 indexing outside of it: > > > > > > strb r1, [r4, #-1]! > > > lsr r3, r3, #3 > > > bhi .L4 > > > cmp r6, #0 > > > movlt r2, #45 > > > add r3, r4, #1 > > > strblt r2, [r3, #-2] > > > sublt r4, r4, #1 > > > > > > as opposed to mine: > > > > > > <bb 3>: > > > # p_8 = PHI <p_16(2), p_19(3)> > > > p_19 = p_8 + 4294967295; > > > if (ui_7 > 9) > > > ... > > > > > > <bb 5>: > > > p_22 = p_19 + 4294967295; > > > *p_22 = 45; > > > > > > which gives us two auto-dec, and much tighter code: > > > > > > strb r1, [r4, #-1]! > > > lsr r3, r3, #3 > > > bhi .L4 > > > cmp r6, #0 > > > movlt r3, #45 > > > strblt r3, [r4, #-1]! > > > > > > Would it be OK to go with my approach, or is worth looking into the rtl > > > optimizers and seeing what can be done (boo! :)). > > > > Would be interesting to see what is causing things to break. If it's > > only combine doing the trick then it will obviously be multi-uses > > of the adjustment pseudo. But eventually I thought fwprop would > > come to the rescue... > > I could post the WIP for this approach if anyone is interested. > > > > > I think a good approach would be to instead of just replacing > > all out-of-loop uses with the new SSA name doing the adjustment > > check if the use is in a "simple" (thus SSA name +- cst) stmt > > and there forward the adjustment. > > > > So have the best of both worlds. I would have hoped this > > extra complication isn't needed but as you figured... > > Hmmm, can we go with my first patch (cleaned up) instead, as it's already > working? Well, your patch only replaces increments it can modify possibly leaving uses unaltered. That's IMHO not good. Which is why I suggested to have it like your original patch for those uses you can modify but for other uses simply substitute the LHS of the compensation stmt. If the compensation stmt ends up being not used it won't be expanded I think (you could of course insert/build it just on-demand).
(In reply to rguenther@suse.de from comment #40) > Well, your patch only replaces increments it can modify possibly > leaving uses unaltered. That's IMHO not good. > > Which is why I suggested to have it like your original patch for > those uses you can modify but for other uses simply substitute > the LHS of the compensation stmt. > > If the compensation stmt ends up being not used it won't be > expanded I think (you could of course insert/build it just on-demand). Ah. I had misunderstood. I will do as you suggest and then take the patch upstream so we can finish the rest of the discussion there. Thanks so much for the feedback.
(In reply to Jeffrey A. Law from comment #36) > WRT the division removal. That seems so profitable that a slight increase > in codesize is warranted. So if we fix the other issue and the source of > the remaining codesize regressions is the removal of the division, I would > consider this BZ resolved. Richi. Jeff. Limiting the single_use with optimize_size as well may give us the best of both worlds. Would you like me to post the [untested] patch below upstream? With this patch code size is even smaller than GCC 5.3. I really don't care. Actually, I'd prefer to do nothing and close the PR ;-). Up to y'all. diff --git a/gcc/match.pd b/gcc/match.pd index f61c4d7440a..5d29bf62dc9 100644 --- a/gcc/match.pd +++ b/gcc/match.pd @@ -1290,11 +1290,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) /* X / C1 op C2 into a simple range test. */ (for cmp (simple_comparison) (simplify - (cmp (trunc_div:s @0 INTEGER_CST@1) INTEGER_CST@2) + (cmp (trunc_div:s@3 @0 INTEGER_CST@1) INTEGER_CST@2) (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) && integer_nonzerop (@1) && !TREE_OVERFLOW (@1) - && !TREE_OVERFLOW (@2)) + && !TREE_OVERFLOW (@2) + && (!optimize_size || single_use (@3))) (with { tree lo, hi; bool neg_overflow; enum tree_code code = fold_div_compare (cmp, @1, @2, &lo, &hi, &neg_overflow); } @@ -1456,9 +1457,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) (for cmp (eq ne) ocmp (lt ge) (simplify - (cmp (trunc_div @0 @1) integer_zerop) + (cmp (trunc_div@2 @0 @1) integer_zerop) (if (TYPE_UNSIGNED (TREE_TYPE (@0)) - && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0)))) + && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0))) + && (!optimize_size || single_use (@2))) (ocmp @0 @1)))) /* X == C - X can never be true if C is odd. */
On Mon, 19 Mar 2018, aldyh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359 > > --- Comment #42 from Aldy Hernandez <aldyh at gcc dot gnu.org> --- > (In reply to Jeffrey A. Law from comment #36) > > WRT the division removal. That seems so profitable that a slight increase > > in codesize is warranted. So if we fix the other issue and the source of > > the remaining codesize regressions is the removal of the division, I would > > consider this BZ resolved. > > Richi. Jeff. > > Limiting the single_use with optimize_size as well may give us the best of both > worlds. Would you like me to post the [untested] patch below upstream? With > this patch code size is even smaller than GCC 5.3. > > I really don't care. Actually, I'd prefer to do nothing and close the PR ;-). > Up to y'all. I don't like to see plain optimize_size uses in match.pd, so yeah, just close the PR. Disclaimer: if we ever need to do sth like that we should wrap it in sth like optimize_pattern_for_size () to be able to eventually pick up local profiling info. > diff --git a/gcc/match.pd b/gcc/match.pd > index f61c4d7440a..5d29bf62dc9 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -1290,11 +1290,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > /* X / C1 op C2 into a simple range test. */ > (for cmp (simple_comparison) > (simplify > - (cmp (trunc_div:s @0 INTEGER_CST@1) INTEGER_CST@2) > + (cmp (trunc_div:s@3 @0 INTEGER_CST@1) INTEGER_CST@2) > (if (INTEGRAL_TYPE_P (TREE_TYPE (@0)) > && integer_nonzerop (@1) > && !TREE_OVERFLOW (@1) > - && !TREE_OVERFLOW (@2)) > + && !TREE_OVERFLOW (@2) > + && (!optimize_size || single_use (@3))) > (with { tree lo, hi; bool neg_overflow; > enum tree_code code = fold_div_compare (cmp, @1, @2, &lo, &hi, > &neg_overflow); } > @@ -1456,9 +1457,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (for cmp (eq ne) > ocmp (lt ge) > (simplify > - (cmp (trunc_div @0 @1) integer_zerop) > + (cmp (trunc_div@2 @0 @1) integer_zerop) > (if (TYPE_UNSIGNED (TREE_TYPE (@0)) > - && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0)))) > + && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0))) > + && (!optimize_size || single_use (@2))) > (ocmp @0 @1)))) > > /* X == C - X can never be true if C is odd. */ > >
(In reply to rguenther@suse.de from comment #43) > I don't like to see plain optimize_size uses in match.pd, so yeah, > just close the PR. Excellent. > > Disclaimer: if we ever need to do sth like that we should wrap it > in sth like optimize_pattern_for_size () to be able to eventually > pick up local profiling info. Noted. Thanks for all your help on this.
Proposed patch: https://gcc.gnu.org/ml/gcc-patches/2018-03/msg00885.html Richi has made some suggestions and will take it from here. Thanks.
GCC 6 branch is being closed
Author: rguenth Date: Wed Oct 31 11:57:33 2018 New Revision: 265677 URL: https://gcc.gnu.org/viewcvs?rev=265677&root=gcc&view=rev Log: 2018-10-31 Richard Biener <rguenther@suse.de> PR middle-end/70359 PR middle-end/86270 * tree-outof-ssa.c (insert_backedge_copies): Restrict copy generation to useful cases. Place the copy before the definition of the backedge value when possible. * gcc.target/i386/pr70359.c: New testcase. * gcc.target/i386/pr86270.c: Likewise. Added: trunk/gcc/testsuite/gcc.target/i386/pr70359.c trunk/gcc/testsuite/gcc.target/i386/pr86270.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-outof-ssa.c
Per c#43 and c#48.