Bug 70359 - [7/8/9 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0
Summary: [7/8/9 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 6.0
: P2 normal
Target Milestone: 7.4
Assignee: Aldy Hernandez
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2016-03-22 13:49 UTC by Fredrik Hederstierna
Modified: 2018-10-31 11:58 UTC (History)
12 users (show)

See Also:
Host:
Target: arm-none-eabi, arm-none-linux-gnueabi, x86*
Build:
Known to work: 5.3.0
Known to fail:
Last reconfirmed: 2016-03-23 00:00:00


Attachments
inttostr.c (161 bytes, text/plain)
2016-03-22 13:49 UTC, Fredrik Hederstierna
Details
tok.c (102 bytes, text/plain)
2016-04-04 22:00 UTC, Fredrik Hederstierna
Details
untested patch implementing suggestion in comment 26 (1.66 KB, patch)
2018-03-08 10:15 UTC, Aldy Hernandez
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Fredrik Hederstierna 2016-03-22 13:49:11 UTC
Created attachment 38058 [details]
inttostr.c

Code size increase on master for ARM target compared to gcc-5.3.0
Target: arm-none-eabi
Flags: -Os -mcpu=arm966e-s -marm

gcc 5.3.0:

00000000 <inttostr>:
   0:   e3a03000        mov     r3, #0
   4:   e92d4070        push    {r4, r5, r6, lr}
   8:   e1a06000        mov     r6, r0
   c:   e2422001        sub     r2, r2, #1
  10:   e0205fc0        eor     r5, r0, r0, asr #31
  14:   e0455fc0        sub     r5, r5, r0, asr #31
  18:   e0814002        add     r4, r1, r2
  1c:   e7c13002        strb    r3, [r1, r2]
  20:   e1a00005        mov     r0, r5
  24:   e3a0100a        mov     r1, #10
  28:   ebfffffe        bl      0 <__aeabi_uidivmod>
  2c:   e2811030        add     r1, r1, #48     ; 0x30
  30:   e5641001        strb    r1, [r4, #-1]!
  34:   e1a00005        mov     r0, r5
  38:   e3a0100a        mov     r1, #10
  3c:   ebfffffe        bl      0 <__aeabi_uidiv>
  40:   e2505000        subs    r5, r0, #0
  44:   1afffff5        bne     20 <inttostr+0x20>
  48:   e3560000        cmp     r6, #0
  4c:   b3a0302d        movlt   r3, #45 ; 0x2d
  50:   b5443001        strblt  r3, [r4, #-1]
  54:   b2444001        sublt   r4, r4, #1
  58:   e1a00004        mov     r0, r4
  5c:   e8bd8070        pop     {r4, r5, r6, pc}


gcc-6-20160313 snapshot from master:

00000000 <inttostr>:
   0:   e3a03000        mov     r3, #0
   4:   e92d41f0        push    {r4, r5, r6, r7, r8, lr}
   8:   e1a07000        mov     r7, r0
   c:   e3a0800a        mov     r8, #10
  10:   e2422001        sub     r2, r2, #1
  14:   e0206fc0        eor     r6, r0, r0, asr #31
  18:   e0466fc0        sub     r6, r6, r0, asr #31
  1c:   e0815002        add     r5, r1, r2
  20:   e7c13002        strb    r3, [r1, r2]
  24:   e1a00006        mov     r0, r6
  28:   e1a01008        mov     r1, r8
  2c:   ebfffffe        bl      0 <__aeabi_uidivmod>
  30:   e2811030        add     r1, r1, #48     ; 0x30
  34:   e5451001        strb    r1, [r5, #-1]
  38:   e1a00006        mov     r0, r6
  3c:   e1a01008        mov     r1, r8
  40:   ebfffffe        bl      0 <__aeabi_uidiv>
  44:   e2506000        subs    r6, r0, #0
  48:   e2454001        sub     r4, r5, #1
  4c:   1a000005        bne     68 <inttostr+0x68>
  50:   e3570000        cmp     r7, #0
  54:   b3a0302d        movlt   r3, #45 ; 0x2d
  58:   b5443001        strblt  r3, [r4, #-1]
  5c:   b2454002        sublt   r4, r5, #2
  60:   e1a00004        mov     r0, r4
  64:   e8bd81f0        pop     {r4, r5, r6, r7, r8, pc}
  68:   e1a05004        mov     r5, r4
  6c:   eaffffec        b       24 <inttostr+0x24>
Comment 1 Richard Biener 2016-03-22 14:19:46 UTC
My #1 bet would be FSM threading.
Comment 2 Andrew Pinski 2016-03-22 15:01:46 UTC
(In reply to Richard Biener from comment #1)
> My #1 bet would be FSM threading.

I doubt it as if I read the asm differences correctly, GCC 6 just no longer does store with post increment and that causes register allocation differences.  AKA there is no extra jump threading; just extra moves.  Like:

  68:   e1a05004        mov     r5, r4
  6c:   eaffffec        b       24 <inttostr+0x24>

so it might be caused by out of SSA differences too :)
Comment 3 Jeffrey A. Law 2016-03-22 15:18:00 UTC
Sorry Richi, no jump threads in this code, so it's not the root cause of any codesize regressions.
Comment 4 Ramana Radhakrishnan 2016-03-23 12:38:47 UTC
Confirmed. There is a size increase in going from GCC 5 to GCC 6 - the quantum changes slightly with cpu tuning options (8 bytes with -march=armv7-a -marm and 12 bytes with -mcpu=arm966e-s)
Comment 5 Jakub Jelinek 2016-03-24 15:27:59 UTC
I see multiple points of the increases:
r224048 added 4 bytes.
r224995 added 8 bytes.
r228318 added 4 bytes.

Perhaps the middle one could change from
(if (single_use (@2)
     || (TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST))
to
(if (single_use (@2)
     || (!optimize_size && TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST))
(or some optimize*for_speed*)?
CCing authors of the other commits.  That said, complaining about size regressions generally should be only if it (significantly) increases sizes of some larger codebases (CSiBE, size of some large app, ...), when looking at a single function only, there will be always improvements and degradations, the compiler is changing.
Comment 6 Fredrik Hederstierna 2016-03-24 18:56:33 UTC
Thanks for your analysis on this. One comment on this 'complaint', it's not only about size - in my example the compiler uses 2 more regs push and pop, and several more instructions, so I think causing performance regressions aswell? I can file the 'complaints' as performance degradations next time if this is better.

Actually this was derived from a larger code base analysis (CSiBE)

Bug 61578 - [4.9 regression] Code size increase for ARM thumb compared to 4.8.x when compiling with -Os

Where CSiBE problems was analysed, but this issue unfortunately got too fuzzy where its hard to define an issue on almost 1000 files, the conclusion was then to create separate smaller examples to work on, because the CSiBE overall benchmark was hard to overview.

I understand its much focus on performance speed on GCC development, but I do think that size really is important aswell since GCC is used very widely also for typically ARM based embedded systems.

I will continue to try track size on CSiBE on http://gcc.hederstierna.com/csibe, but please comment if you think size regressions are non-wanted, I can try to focus more on issues that have both performance speed and size components combined (I guess they often go hand in hand.).

Thanks, BR, Fredrik
Comment 7 Jeffrey A. Law 2016-03-24 19:53:55 UTC
I would suggest that you definitely keep reporting these things and extracting examples from csibe or other benchmarks to show the codesize increases.

While some folks will prioritize performance, it doesn't mean that codesize is ignored.  Similarly for significant changes in compile-time, stack usage, etc.
Comment 8 Jakub Jelinek 2016-03-24 20:46:22 UTC
All I wanted to say is that it is to be expected that on some code a newer GCC version ends up needing one or two more instructions, even at -Os, and what matters is not the size of a single function, but larger codebases.  If the testcase here has been extracted from CSiBE and/or represents something that happens often, then it is of course desirable to fix that.
So e.g. it would be interesting to find out what impact these above mentioned commits have on larger codebases.  If they are in the end neutral or shrink code size, then one testcase going in the other direction is fine, if they have overal negative -Os code size effects, then it is something to work on.
Comment 9 Segher Boessenkool 2016-03-25 00:33:51 UTC
(In reply to Jakub Jelinek from comment #5)
> CCing authors of the other commits.  That said, complaining about size
> regressions generally should be only if it (significantly) increases sizes
> of some larger codebases (CSiBE, size of some large app, ...), when looking
> at a single function only, there will be always improvements and
> degradations, the compiler is changing.

A few comments about "my" patch (r228318), the reorder-blocks thing.
First of all, almost all it does is heuristic, so exactly as Jakub
says you have to look at bigger tests.  This patch makes -freorder-blocks
use a new algorithm ("simple") for -Os; after a few follow-up patches
this consistently gives smaller code.  The big difference with the old
("stc") algorithm is that that one can duplicate blocks, which sometimes
results in smaller code, while "simple" never does.  It may be interesting
to look at before and after of just this patch, to see if the heuristics
can be improved.
Comment 10 kugan 2016-03-26 08:08:59 UTC
I am looking into it.  -mcpu=arm966e-s does not uses the TARGET_NEW_GENERIC_COSTS. i.e, the rtx costs setup by the back-end might not be optimal.
Comment 11 kugan 2016-03-27 10:17:34 UTC
Optimized gimple diff between 5.3 and trunk is :

-;; Function inttostr (inttostr, funcdef_no=0, decl_uid=5268, cgraph_uid=0, symbol_order=0)
+;; Function inttostr (inttostr, funcdef_no=0, decl_uid=4222, cgraph_uid=0, symbol_order=0)
 
 Removing basic block 7
 Removing basic block 8
@@ -43,7 +43,7 @@
     goto <bb 6>;
 
   <bb 5>:
-  p_22 = p_2 + 4294967294;
+  p_22 = p_16 + 4294967295;
   MEM[(char *)p_16 + 4294967295B] = 45;
 
   <bb 6>:
Comment 12 kugan 2016-03-27 10:22:22 UTC
However, diff of cfgexand is significantly different:
 ;; Full RTL generated for this function:
 ;;
    32: NOTE_INSN_DELETED
-   38: NOTE_INSN_BASIC_BLOCK 2
+   39: NOTE_INSN_BASIC_BLOCK 2
    33: r151:SI=r0:SI
    34: r152:SI=r1:SI
    35: r153:SI=r2:SI
    36: NOTE_INSN_FUNCTION_BEG
-   40: {r141:SI=abs(r151:SI);clobber cc:CC;}
-   41: r154:SI=r153:SI-0x1
-   42: r142:SI=r152:SI+r154:SI
-   43: r155:SI=0
-   44: r156:QI=r155:SI#0
-   45: [r142:SI]=r156:QI
-   61: L61:
-   46: NOTE_INSN_BASIC_BLOCK 4
-   47: r142:SI=r142:SI-0x1
-   48: r1:SI=0xa
-   49: r0:SI=r141:SI
-   50: r0:DI=call [`__aeabi_uidivmod'] argc:0
+   41: {r141:SI=abs(r151:SI);clobber cc:CC;}
+   42: r154:SI=r153:SI-0x1
+   43: r142:SI=r152:SI+r154:SI
+   44: r155:SI=0
+   45: r156:QI=r155:SI#0
+   46: [r142:SI]=r156:QI
+   81: pc=L62
+   82: barrier
+   84: L84:
+   83: NOTE_INSN_BASIC_BLOCK 4
+   37: r142:SI=r150:SI
+   62: L62:
+   47: NOTE_INSN_BASIC_BLOCK 5
+   48: r150:SI=r142:SI-0x1
+   49: r1:SI=0xa
+   50: r0:SI=r141:SI
+   51: r0:DI=call [`__aeabi_uidivmod'] argc:0
       REG_CALL_DECL `__aeabi_uidivmod'
       REG_EH_REGION 0xffffffff80000000
-   51: r162:SI=r1:SI
+   52: r162:SI=r1:SI
       REG_EQUAL umod(r141:SI,0xa)
-   52: r163:QI=r162:SI#0
-   53: r164:SI=r163:QI#0+0x30
-   54: r165:QI=r164:SI#0
-   55: [r142:SI]=r165:QI
-   56: r1:SI=0xa
-   57: r0:SI=r141:SI
-   58: r0:SI=call [`__aeabi_uidiv'] argc:0
+   53: r163:QI=r162:SI#0
+   54: r164:SI=r163:QI#0+0x30
+   55: r165:QI=r164:SI#0
+   56: [r150:SI]=r165:QI
+   57: r1:SI=0xa
+   58: r0:SI=r141:SI
+   59: r0:SI=call [`__aeabi_uidiv'] argc:0
       REG_CALL_DECL `__aeabi_uidiv'
       REG_EH_REGION 0xffffffff80000000
-   59: r169:SI=r0:SI
+   60: r169:SI=r0:SI
       REG_EQUAL udiv(r141:SI,0xa)
-   60: r141:SI=r169:SI
-   62: cc:CC=cmp(r141:SI,0)
-   63: pc={(cc:CC!=0)?L61:pc}
+   61: r141:SI=r169:SI
+   63: cc:CC=cmp(r141:SI,0)
+   64: pc={(cc:CC!=0)?L84:pc}
       REG_BR_PROB 9100
-   64: NOTE_INSN_BASIC_BLOCK 5
-   65: cc:CC=cmp(r151:SI,0)
-   66: pc={(cc:CC>=0)?L72:pc}
+   65: NOTE_INSN_BASIC_BLOCK 6
+   66: cc:CC=cmp(r151:SI,0)
+   67: pc={(cc:CC>=0)?L77:pc}
       REG_BR_PROB 6335
-   67: NOTE_INSN_BASIC_BLOCK 6
-   68: r149:SI=r142:SI-0x1
-   69: r170:SI=0x2d
-   70: r171:QI=r170:SI#0
-   71: [r142:SI-0x1]=r171:QI
-   37: r142:SI=r149:SI
-   72: L72:
-   73: NOTE_INSN_BASIC_BLOCK 7
-   74: r150:SI=r142:SI
+   68: NOTE_INSN_BASIC_BLOCK 7
+   69: r149:SI=r142:SI-0x2
+   70: r170:SI=0x2d
+   71: r171:QI=r170:SI#0
+   72: [r150:SI-0x1]=r171:QI
+   38: r150:SI=r149:SI
+   77: L77:
+   80: NOTE_INSN_BASIC_BLOCK 9
    78: r0:SI=r150:SI
    79: use r0:SI
Comment 13 Jeffrey A. Law 2016-03-29 15:56:15 UTC
The change to the assignment of p_22 is made by forwprop1.

It does create a situation where p_2 is live outside the loop and hides the CSE opportunity, which may be the cause of the more significant differences at expansion time.
Comment 14 kugan 2016-03-30 00:36:53 UTC
(In reply to Jeffrey A. Law from comment #13)
> The change to the assignment of p_22 is made by forwprop1.
> 
> It does create a situation where p_2 is live outside the loop and hides the
> CSE opportunity, which may be the cause of the more significant differences
> at expansion time.

Indeed. This is what I see: 

gcc 5 branch with  -O2 t2.c  -S -Os -marm  -mcpu=arm966e-s 96 Bytes
trunk with -O2 t2.c  -c -Os -marm  -mcpu=arm966e-s 	112 Bytes
trunk with -O2 t2.c  -c -Os -marm  			108 Bytes
trunk with -O2 t2.c  -c -Os -marm  -mcpu=arm966e-s  -fno-tree-forwprop 96 Bytes
trunk with -O2 t2.c  -c -Os -marm  -mcpu=arm966e-s and Jakub's changes in c# 5 - 100 Bytes
Comment 15 Fredrik Hederstierna 2016-04-04 22:00:17 UTC
Created attachment 38185 [details]
tok.c

I took another example for CSiBE and stripped it down. I'm not 100% sure it is the exact same issue, but looks similar so I attached it.

This gives bigger code for -Os -mcpu=arm966e-s -marm, and also for -mcpu=cortex-m3, though arm7tdmi and cortex-m0 resulted in small code.

gcc 5.3.0:

00000000 <test>:
   0:   e92d4010        push    {r4, lr}
   4:   e1a04000        mov     r4, r0
   8:   ebfffffe        bl      0 <next>
   c:   e3500000        cmp     r0, #0
  10:   0a000003        beq     24 <test+0x24>
  14:   e59f002c        ldr     r0, [pc, #44]   ; 48 <test+0x48>
  18:   ebfffffe        bl      0 <dump>
  1c:   e1a00004        mov     r0, r4
  20:   eafffff8        b       8 <test+0x8>
  24:   e1a00004        mov     r0, r4
  28:   ebfffffe        bl      0 <next>
  2c:   e3500000        cmp     r0, #0
  30:   0a000002        beq     40 <test+0x40>
  34:   e59f000c        ldr     r0, [pc, #12]   ; 48 <test+0x48>
  38:   ebfffffe        bl      0 <dump>
  3c:   eafffff8        b       24 <test+0x24>
  40:   e1a00004        mov     r0, r4
  44:   e8bd8010        pop     {r4, pc}
  48:   00000000        andeq   r0, r0, r0

master:

00000000 <test>:
   0:   e92d4070        push    {r4, r5, r6, lr}
   4:   e1a04000        mov     r4, r0
   8:   ebfffffe        bl      0 <next>
   c:   e59f5048        ldr     r5, [pc, #72]   ; 5c <test+0x5c>
  10:   e3500000        cmp     r0, #0
  14:   1a000006        bne     34 <test+0x34>
  18:   e1a00004        mov     r0, r4
  1c:   ebfffffe        bl      0 <next>
  20:   e59f5034        ldr     r5, [pc, #52]   ; 5c <test+0x5c>
  24:   e3500000        cmp     r0, #0
  28:   1a000006        bne     48 <test+0x48>
  2c:   e1a00004        mov     r0, r4
  30:   e8bd8070        pop     {r4, r5, r6, pc}
  34:   e1a00005        mov     r0, r5
  38:   ebfffffe        bl      0 <dump>
  3c:   e1a00004        mov     r0, r4
  40:   ebfffffe        bl      0 <next>
  44:   eafffff1        b       10 <test+0x10>
  48:   e1a00005        mov     r0, r5
  4c:   ebfffffe        bl      0 <dump>
  50:   e1a00004        mov     r0, r4
  54:   ebfffffe        bl      0 <next>
  58:   eafffff1        b       24 <test+0x24>
  5c:   00000000        andeq   r0, r0, r0
Comment 16 Jakub Jelinek 2016-04-27 10:55:53 UTC
GCC 6.1 has been released.
Comment 17 Richard Biener 2016-08-22 08:19:12 UTC
GCC 6.2 is being released, adjusting target milestone.
Comment 18 Richard Biener 2016-08-22 08:20:16 UTC
GCC 6.2 is being released, adjusting target milestone.
Comment 19 Fredrik Hederstierna 2016-08-28 20:44:45 UTC
Tested GCC-6.2, still same behavior.
Comment 20 Jakub Jelinek 2016-12-21 10:54:57 UTC
GCC 6.3 is being released, adjusting target milestone.
Comment 21 Richard Biener 2017-07-04 08:43:21 UTC
GCC 6.4 is being released, adjusting target milestone.
Comment 22 Aldy Hernandez 2018-03-02 10:54:10 UTC
For the record, current mainline is even worse than when the testccase was originally reported.  We are now are at 116 bytes versus 96 for gcc-5-branch.

(In reply to Jakub Jelinek from comment #5)
> I see multiple points of the increases:
> r224048 added 4 bytes.
> r224995 added 8 bytes.
> r228318 added 4 bytes.
> 
> Perhaps the middle one could change from
> (if (single_use (@2)
>      || (TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3) == INTEGER_CST))
> to
> (if (single_use (@2)
>      || (!optimize_size && TREE_CODE (@1) == INTEGER_CST && TREE_CODE (@3)
> == INTEGER_CST))
> (or some optimize*for_speed*)?

As Jakub mentions in comment #5, there are multiple patches that slowly bloated the generated code, but perhaps it is a fool's errand to tackle them individually.  For instance, associating "buf + len - 1" differently in r224995 reduces the generated code by 8 bytes, but predicating the association by optimize_for_speed seems fragile IMO.

Interestingly, disabling -ftree-forwprop, reduces the byte count to 92 which is even smaller than gcc-5-branch, so perhaps worth pursuing.  Even on x86, disabling forwprop reduces the byte count by 13.  And on both x86 and arm, we get one less branch without forwprop.

The first forwprop change is to replace an equality by a greater than, which hardly seems worthwhile (and even a longer byte sequence on x86), but perhaps not a showstopper:

<   if (ui_21 != 0)
---
>   if (ui_7 > 9)

OTOH, the following changes things quite a bit on arm:

<   p_22 = p_19 + 4294967295;
<   *p_22 = 45;
---
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;

For context, we are using p_8 which was the previous iteration's value for p_8 and then subtracting 2 to return p correctly.  What the heck?

  <bb 2> [local count: 118111601]:
  _1 = ABS_EXPR <i_12(D)>;
  ui_13 = (unsigned int) _1;
  len.0_2 = (sizetype) len_14(D);
  _3 = len.0_2 + 4294967295;
  p_16 = buf_15(D) + _3;
  *p_16 = 0;

  <bb 3> [local count: 1073741825]:
  # ui_7 = PHI <ui_13(2), ui_21(3)>
  # p_8 = PHI <p_16(2), p_19(3)>
  _4 = ui_7 % 10;
  _5 = (char) _4;
  p_19 = p_8 + 4294967295;
  _6 = _5 + 48;
  MEM[base: p_19, offset: 0B] = _6;
  ui_21 = ui_7 / 10;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111601]:
  if (i_12(D) < 0)
    goto <bb 5>; [41.00%]
  else
    goto <bb 6>; [59.00%]

  <bb 5> [local count: 48425756]:
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

  <bb 6> [local count: 118111601]:
  # p_9 = PHI <p_22(5), p_19(4)>
  return p_9;

This finally yields assembly without auto dec, and with an extra (forward!) branch:

.L2:
        mov     r1, #10
        mov     r0, r6
        bl      __aeabi_uidivmod
        umull   r2, r3, r6, r8
        add     r1, r1, #48
        cmp     r6, #9
        sub     r4, r5, #1
        strb    r1, [r5, #-1]
        lsr     r3, r3, #3
        bhi     .L4                  ;; extra forward branch
        cmp     r7, #0
        movlt   r3, #45
        strblt  r3, [r4, #-1]
        sublt   r4, r5, #2
        mov     r0, r4
        pop     {r4, r5, r6, r7, r8, pc}
.L4:
        mov     r5, r4
        mov     r6, r3
        b       .L2

whereas without -ftree-forwprop we get auto dec:

.L2:
        mov     r0, r5
        mov     r1, #10
        bl      __aeabi_uidivmod
        umull   r2, r3, r5, r7
        add     r1, r1, #48
        lsrs    r5, r3, #3
        strb    r1, [r4, #-1]!         ;; auto dec, yay
        bne     .L2
        cmp     r6, #0
        movlt   r3, #45
        strblt  r3, [r4, #-1]!         ;; auto dec, yay
        mov     r0, r4
        pop     {r4, r5, r6, r7, r8, pc}
Comment 23 Aldy Hernandez 2018-03-02 11:05:30 UTC
For the curious, on x86 with -ftree-forwprop we get an additional jump:

inttostr:
.LFB0:
        .cfi_startproc
        movl    %edi, %eax
        movslq  %edx, %rdx
        movl    $-858993459, %r9d
        sarl    $31, %eax
        leaq    -1(%rsi,%rdx), %rsi
        movl    %eax, %ecx
        movb    $0, (%rsi)
        xorl    %edi, %ecx
        subl    %eax, %ecx
        jmp     .L2                    ;; boo!
        .p2align 4,,10
        .p2align 3
.L4:
        movq    %r8, %rsi
        movl    %edx, %ecx
.L2:
...
...
...
        movb    $45, -1(%r8)
        leaq    -2(%rsi), %r8

Not to mention that the last two instructions are slower (ok, larger) than without forwprop, probably because rsi encodes better than r8.

        movb    $45, -1(%rsi)
        subq    $1, %rsi

Also, the inequality comparison on x86 is shorter than comparing with > 9.  But that's probably irrelevant.

All in all:

x86 -ftree-forwprop:    139 bytes
x86 -fno-tree-forwprop: 126 bytes
Comment 24 rguenther@suse.de 2018-03-02 11:09:29 UTC
> OTOH, the following changes things quite a bit on arm:
> 
> <   p_22 = p_19 + 4294967295;
> <   *p_22 = 45;
> ---
> >   p_22 = p_8 + 4294967294;
> >   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> For context, we are using p_8 which was the previous iteration's value for p_8
> and then subtracting 2 to return p correctly.  What the heck?
> 
>   <bb 2> [local count: 118111601]:
>   _1 = ABS_EXPR <i_12(D)>;
>   ui_13 = (unsigned int) _1;
>   len.0_2 = (sizetype) len_14(D);
>   _3 = len.0_2 + 4294967295;
>   p_16 = buf_15(D) + _3;
>   *p_16 = 0;
> 
>   <bb 3> [local count: 1073741825]:
>   # ui_7 = PHI <ui_13(2), ui_21(3)>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   _4 = ui_7 % 10;
>   _5 = (char) _4;
>   p_19 = p_8 + 4294967295;
>   _6 = _5 + 48;
>   MEM[base: p_19, offset: 0B] = _6;
>   ui_21 = ui_7 / 10;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 4> [local count: 118111601]:
>   if (i_12(D) < 0)
>     goto <bb 5>; [41.00%]
>   else
>     goto <bb 6>; [59.00%]
> 
>   <bb 5> [local count: 48425756]:
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
>   <bb 6> [local count: 118111601]:
>   # p_9 = PHI <p_22(5), p_19(4)>
>   return p_9;

forwprop aggressively canonicalizes memory references by forwarding
all constant address adjustments into its constant offset part.

What usually makes things complicated in the end is when for an IV
we get overlapping life-ranges for the before and after value because
that inhibits coalescing.  Is this what happens here?

We do have some code trying to rectify IV-related stuff for coalescing,
maybe that can be enhanced to handle these kind of cases...

I fear that in the end somehow exposing autoinc/dec on GIMPLE might
be the only way to truly fix and maintain good code...
Comment 25 Richard Biener 2018-03-02 11:36:07 UTC
So we indeed have p_20 and p_9 live as p_9 is used after the loop.  Originally
this wasn't the case but fold_stmt in the first forwprop pass does this
by means of following use-def chains.

As I usually say the user could have written the loop that way so a real
fix is to try undoing this somewhere.

Sprinkling single_use checks in match.pd (or simple_iv_increment_p-like
checks) doesn't look like a sustainable approach to me.
Comment 26 Richard Biener 2018-03-02 11:42:58 UTC
So I suggest to look at insert_backedge_copies () to see whether replacing
out-of-loop pre-inc uses with the post-inc value is possible.
Comment 27 Jeffrey A. Law 2018-03-02 16:39:14 UTC
Yea, we went through similar issues with Alex's work recently.  After that work landed I went through all the BZs that I could find with autoinc {pre,post}{inc,dec} that BZ could find in the hopes that his work would fix some of them.

Sadly there was only 1 that was fixed.


--

So one concern if you try to expose autoincrement addressing in gimple is that the RTL optimizers aren't prepared to handle until combine or later.  So there's probably a fair amount of downstream work to do.

rth and I sketched out a pass that replace optimize_related_values and autoinc passes using the concepts of PRE and SLSR many years ago.  It felt much cleaner that what we've been doing and likely more powerful.  But we never went forward with an implementation.

I got the sense from that recent BZ review that we're generally missing a fair number of opportunities, but didn't dig into all of them to gain an understanding of why we're doing so bad.
Comment 28 Aldy Hernandez 2018-03-02 17:37:50 UTC
(In reply to Richard Biener from comment #25)


> What usually makes things complicated in the end is when for an IV
> we get overlapping life-ranges for the before and after value because
> that inhibits coalescing.  Is this what happens here?
...
...
> So we indeed have p_20 and p_9 live as p_9 is used after the loop. 
> Originally
> this wasn't the case but fold_stmt in the first forwprop pass does this
> by means of following use-def chains.

I'm a bit confused.

Where are you looking, after the forprop1 pass?  Because after forwprop1 I see:

  <bb 5> :
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

  <bb 6> :
  # p_9 = PHI <p_19(4), p_22(5)>
  return p_9;

Which as I understand has:

p_8  being the IV at the beginning of the last iteration.  
p_19 being the IV at the end of the last iteration.
p_22 being the IV at the end of the last iteration MINUS 1.

I don't see a p_20 anywhere.  Did you mean that p_8 and p_19 where both live at the end of the loop?

Thanks for all the feedback BTW.
Comment 29 rguenther@suse.de 2018-03-05 10:02:12 UTC
On Fri, 2 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #28 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #25)
> 
> 
> > What usually makes things complicated in the end is when for an IV
> > we get overlapping life-ranges for the before and after value because
> > that inhibits coalescing.  Is this what happens here?
> ...
> ...
> > So we indeed have p_20 and p_9 live as p_9 is used after the loop. 
> > Originally
> > this wasn't the case but fold_stmt in the first forwprop pass does this
> > by means of following use-def chains.
> 
> I'm a bit confused.
> 
> Where are you looking, after the forprop1 pass?  Because after forwprop1 I see:
> 
>   <bb 5> :
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
>   <bb 6> :
>   # p_9 = PHI <p_19(4), p_22(5)>
>   return p_9;
> 
> Which as I understand has:
> 
> p_8  being the IV at the beginning of the last iteration.  
> p_19 being the IV at the end of the last iteration.
> p_22 being the IV at the end of the last iteration MINUS 1.
> 
> I don't see a p_20 anywhere.  Did you mean that p_8 and p_19 where both live at
> the end of the loop?

Yes.  Not sure if that captures all of the issues in this bug but that's
clearly bad.
Comment 30 Aldy Hernandez 2018-03-08 10:15:47 UTC
Created attachment 43597 [details]
untested patch implementing suggestion in comment 26

The attached untested patch attempts to implement the suggestion in comment 26 of replacing the out-of-loop pre-inc with post-inc values.

Richi, is this more or less what you had in mind?

Assuming this:

LOOP:
  # p_8 = PHI <p_16(2), p_19(3)>
  ...
  p_19 = p_8 + 4294967295;
  goto LOOP:

The patch replaces:
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;
into:
  p_22 = p_19 + 4294967295;
  *p_22 = 45;

This allows the backend to use auto-dec in two places:

strb	r1, [r4, #-1]!
...
strblt	r3, [r4, #-1]!

...reducing the byte count from 116 to 104, but just shy of the 96 needed to eliminate the regression.  I will discuss the missing bytes in a follow-up comment, as they are unrelated to this IV adjustment patch.

It is worth noting that x86 also benefits from a reduction of 3 bytes with this patch, as we remove 2 lea instructions: one within the loop, and one before returning.  Thus, I believe this is a regression across the board, or at least in multiple architectures.

A few comments...

While I see the benefit of hijacking insert_backedge_copies() for this, I am not a big fan of changing the IL after the last tree dump (*t.optimized), as the modified IL would only be visible in *r.expand.  Could we perhaps move this to another spot?  Say after the last forwprop pass, or perhaps right before expand?  Or perhaps have a *t.final dump right before expand?

As mentioned, this is only a proof of concept.  I made the test rather restrictive.  I suppose we could relax the conditions and generalize it a bit.  There are comments throughout showing what I had in mind.
Comment 31 rguenther@suse.de 2018-03-08 10:32:21 UTC
On Thu, 8 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> Aldy Hernandez <aldyh at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>            Assignee|unassigned at gcc dot gnu.org      |aldyh at gcc dot gnu.org
> 
> --- Comment #30 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> Created attachment 43597 [details]
>   --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43597&action=edit
> untested patch implementing suggestion in comment 26
> 
> The attached untested patch attempts to implement the suggestion in comment 26
> of replacing the out-of-loop pre-inc with post-inc values.
> 
> Richi, is this more or less what you had in mind?

Yes.

> Assuming this:
> 
> LOOP:
>   # p_8 = PHI <p_16(2), p_19(3)>
>   ...
>   p_19 = p_8 + 4294967295;
>   goto LOOP:
> 
> The patch replaces:
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> into:
>   p_22 = p_19 + 4294967295;
>   *p_22 = 45;
> 
> This allows the backend to use auto-dec in two places:
> 
> strb    r1, [r4, #-1]!
> ...
> strblt  r3, [r4, #-1]!
> 
> ...reducing the byte count from 116 to 104, but just shy of the 96 needed to
> eliminate the regression.  I will discuss the missing bytes in a follow-up
> comment, as they are unrelated to this IV adjustment patch.
> 
> It is worth noting that x86 also benefits from a reduction of 3 bytes with this
> patch, as we remove 2 lea instructions: one within the loop, and one before
> returning.  Thus, I believe this is a regression across the board, or at least
> in multiple architectures.
> 
> A few comments...
> 
> While I see the benefit of hijacking insert_backedge_copies() for this, I am
> not a big fan of changing the IL after the last tree dump (*t.optimized), as
> the modified IL would only be visible in *r.expand.  Could we perhaps move this
> to another spot?  Say after the last forwprop pass, or perhaps right before
> expand?  Or perhaps have a *t.final dump right before expand?

I don't see a big problem here - but yes, for example doing it
during uncprop would be possible as well (moving all of
insert_backedge_copies then).  I'd not do this at this point though.

> As mentioned, this is only a proof of concept.  I made the test rather
> restrictive.  I suppose we could relax the conditions and generalize it a bit. 
> There are comments throughout showing what I had in mind.

I'd have not restricted the out-of-loop IV use to IV +- CST but
instead did the transform

+   LOOP:
+     # p_8 = PHI <p_16(2), p_INC(3)>
+     ...
+     p_INC = p_8 - 1;
+     goto LOOP;
+     ... p_8 uses ...

to

+   LOOP:
+     # p_8 = PHI <p_16(2), p_INC(3)>
+     ...
+     p_INC = p_8 - 1;
+     goto LOOP;
      newtem_12 = p_INC + 1; // undo IV increment
      ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...

so it would always work if we can undo the IV increment.

The disadvantage might be that we then rely on RTL optimizations
to combine the original out-of-loop constant add with the
newtem computation but I guess that's not too much to ask ;)
k
Comment 32 Aldy Hernandez 2018-03-08 10:34:07 UTC
As mentioned in the previous comment, the proposed patch brings down the count from 116 to 108 on ARM, but is shy of the desired 96.

The missing bytes can be attributed to forwprop folding this (IL expanded for illustration):

  if (ui_7 / 10 != 0)

into:

  if (ui_7 > 9)

More specifically, changing this:

  # ui_7 = PHI <ui_13(2), ui_21(3)>
  ...
  ui_21 = ui_7 / 10;
  if (ui_21 != 0)

into:

  # ui_7 = PHI <ui_13(2), ui_21(3)>
  ...
  ui_21 = ui_7 / 10;
  if (ui_7 > 9)

Inhibiting this optimization brings down the byte count to 92 which is even lower than our 96 boogie man, so perhaps worth pursuing.  (Assumes my proposed patch is also applied.)  I'm no expert, but isn't a EQ/NE with 0 preferable than a <> with a non-zero?

If so, should we restrict the folding somewhat, or clean this up after the fact?

For reference, the folding (in forwprop) is due to this match.pd pattern:

/* X / C1 op C2 into a simple range test.  */

...though eliminating it causes another pattern to pick up the slack and do the same:

/* Transform:
 * (X / Y) == 0 -> X < Y if X, Y are unsigned.
 * (X / Y) != 0 -> X >= Y, if X, Y are unsigned.
 */

Eliminating both patterns "fixes" the problem.

Suggestions welcome :).
Comment 33 Segher Boessenkool 2018-03-08 10:37:43 UTC
The "> 9" transform reduces path length.  But foe -Os that is often not a
good idea.
Comment 34 rguenther@suse.de 2018-03-08 10:47:18 UTC
On Thu, 8 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #32 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> As mentioned in the previous comment, the proposed patch brings down the count
> from 116 to 108 on ARM, but is shy of the desired 96.
> 
> The missing bytes can be attributed to forwprop folding this (IL expanded for
> illustration):
> 
>   if (ui_7 / 10 != 0)
> 
> into:
> 
>   if (ui_7 > 9)
> 
> More specifically, changing this:
> 
>   # ui_7 = PHI <ui_13(2), ui_21(3)>
>   ...
>   ui_21 = ui_7 / 10;
>   if (ui_21 != 0)
> 
> into:
> 
>   # ui_7 = PHI <ui_13(2), ui_21(3)>
>   ...
>   ui_21 = ui_7 / 10;
>   if (ui_7 > 9)
> 
> Inhibiting this optimization brings down the byte count to 92 which is even
> lower than our 96 boogie man, so perhaps worth pursuing.  (Assumes my proposed
> patch is also applied.)  I'm no expert, but isn't a EQ/NE with 0 preferable
> than a <> with a non-zero?
> 
> If so, should we restrict the folding somewhat, or clean this up after the
> fact?
> 
> For reference, the folding (in forwprop) is due to this match.pd pattern:
> 
> /* X / C1 op C2 into a simple range test.  */
> 
> ...though eliminating it causes another pattern to pick up the slack and do the
> same:
> 
> /* Transform:
>  * (X / Y) == 0 -> X < Y if X, Y are unsigned.
>  * (X / Y) != 0 -> X >= Y, if X, Y are unsigned.
>  */
> 
> Eliminating both patterns "fixes" the problem.
> 
> Suggestions welcome :).

In other places where this happened we add single_use () checks to
the patterns.  The :s isn't really effective for patterns that
simplify into "simple" expressions.  A patch doing that to the
two patterns in question would be OK I think.

Sth like

Index: gcc/match.pd
===================================================================
--- gcc/match.pd        (revision 258359)
+++ gcc/match.pd        (working copy)
@@ -1290,11 +1290,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* X / C1 op C2 into a simple range test.  */
 (for cmp (simple_comparison)
  (simplify
-  (cmp (trunc_div:s @0 INTEGER_CST@1) INTEGER_CST@2)
+  (cmp (trunc_div:s@3 @0 INTEGER_CST@1) INTEGER_CST@2)
   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
        && integer_nonzerop (@1)
        && !TREE_OVERFLOW (@1)
-       && !TREE_OVERFLOW (@2))
+       && !TREE_OVERFLOW (@2)
+       && single_use (@3))
    (with { tree lo, hi; bool neg_overflow;
           enum tree_code code = fold_div_compare (cmp, @1, @2, &lo, &hi,
                                                   &neg_overflow); }

tough there is at least one constant simplification result where
we shouldn't apply this restriction so moving the single_use check
to multiple places below or factoring things a bit would be
appropriate.

Similar for the other pattern then.
Comment 35 rguenther@suse.de 2018-03-08 10:49:33 UTC
On Thu, 8 Mar 2018, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #33 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> The "> 9" transform reduces path length.  But foe -Os that is often not a
> good idea.

Hm, indeed - the > 9 might be notably faster.  So it's not a clear
win to disable the folding here.
Comment 36 Jeffrey A. Law 2018-03-09 19:48:58 UTC
WRT the division removal.  That seems so profitable that a slight increase in codesize is warranted.  So if we fix the other issue  and the source of the remaining codesize regressions is the removal of the division, I would consider this BZ resolved.
Comment 37 Aldy Hernandez 2018-03-15 11:25:51 UTC
Hi Richi.

(In reply to rguenther@suse.de from comment #31)

> I'd have not restricted the out-of-loop IV use to IV +- CST but
> instead did the transform
> 
> +   LOOP:
> +     # p_8 = PHI <p_16(2), p_INC(3)>
> +     ...
> +     p_INC = p_8 - 1;
> +     goto LOOP;
> +     ... p_8 uses ...
> 
> to
> 
> +   LOOP:
> +     # p_8 = PHI <p_16(2), p_INC(3)>
> +     ...
> +     p_INC = p_8 - 1;
> +     goto LOOP;
>       newtem_12 = p_INC + 1; // undo IV increment
>       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> 
> so it would always work if we can undo the IV increment.
> 
> The disadvantage might be that we then rely on RTL optimizations
> to combine the original out-of-loop constant add with the
> newtem computation but I guess that's not too much to ask ;)
> k

It looks like RTL optimizations have a harder time optimizing things when I take the above approach.

Doing what you suggest, we end up optimizing this (simplified for brevity):

  <bb 3>
  # p_8 = PHI <p_16(2), p_19(3)>
  p_19 = p_8 + 4294967295;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  ...

  <bb 5>
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

into this:

  <bb 3>:
  # p_8 = PHI <p_16(2), p_19(3)>
  p_19 = p_8 + 4294967295;
  if (ui_7 > 9)
  ...

  <bb 4>:
  _25 = p_19 + 1;          ;; undo the increment
  ...

  <bb 5>:
  p_22 = _25 + 4294967294;
  MEM[(char *)_25 + 4294967294B] = 45;

I haven't dug into the RTL optimizations, but the end result is that we only get one auto-dec inside the loop, and some -2 indexing outside of it:

        strb    r1, [r4, #-1]!
        lsr     r3, r3, #3
        bhi     .L4
        cmp     r6, #0
        movlt   r2, #45
        add     r3, r4, #1
        strblt  r2, [r3, #-2]
        sublt   r4, r4, #1

as opposed to mine:

  <bb 3>:
  # p_8 = PHI <p_16(2), p_19(3)>
  p_19 = p_8 + 4294967295;
  if (ui_7 > 9)
  ...

  <bb 5>:
  p_22 = p_19 + 4294967295;
  *p_22 = 45;

which gives us two auto-dec, and much tighter code:

        strb    r1, [r4, #-1]!
        lsr     r3, r3, #3
        bhi     .L4
        cmp     r6, #0
        movlt   r3, #45
        strblt  r3, [r4, #-1]!

Would it be OK to go with my approach, or is worth looking into the rtl optimizers and seeing what can be done (boo! :)).

Thanks.
Comment 38 rguenther@suse.de 2018-03-15 11:51:33 UTC
On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> Hi Richi.
> 
> (In reply to rguenther@suse.de from comment #31)
> 
> > I'd have not restricted the out-of-loop IV use to IV +- CST but
> > instead did the transform
> > 
> > +   LOOP:
> > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > +     ...
> > +     p_INC = p_8 - 1;
> > +     goto LOOP;
> > +     ... p_8 uses ...
> > 
> > to
> > 
> > +   LOOP:
> > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > +     ...
> > +     p_INC = p_8 - 1;
> > +     goto LOOP;
> >       newtem_12 = p_INC + 1; // undo IV increment
> >       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> > 
> > so it would always work if we can undo the IV increment.
> > 
> > The disadvantage might be that we then rely on RTL optimizations
> > to combine the original out-of-loop constant add with the
> > newtem computation but I guess that's not too much to ask ;)
> > k
> 
> It looks like RTL optimizations have a harder time optimizing things when I
> take the above approach.
> 
> Doing what you suggest, we end up optimizing this (simplified for brevity):
> 
>   <bb 3>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   p_19 = p_8 + 4294967295;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   ...
> 
>   <bb 5>
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> into this:
> 
>   <bb 3>:
>   # p_8 = PHI <p_16(2), p_19(3)>
>   p_19 = p_8 + 4294967295;
>   if (ui_7 > 9)
>   ...
> 
>   <bb 4>:
>   _25 = p_19 + 1;          ;; undo the increment
>   ...
> 
>   <bb 5>:
>   p_22 = _25 + 4294967294;
>   MEM[(char *)_25 + 4294967294B] = 45;

shouldn't the p_19 in the MEM remain?  Why a new BB for the adjustment?
(just asking...)

> I haven't dug into the RTL optimizations, but the end result is that we only
> get one auto-dec inside the loop, and some -2 indexing outside of it:
> 
>         strb    r1, [r4, #-1]!
>         lsr     r3, r3, #3
>         bhi     .L4
>         cmp     r6, #0
>         movlt   r2, #45
>         add     r3, r4, #1
>         strblt  r2, [r3, #-2]
>         sublt   r4, r4, #1
> 
> as opposed to mine:
> 
>   <bb 3>:
>   # p_8 = PHI <p_16(2), p_19(3)>
>   p_19 = p_8 + 4294967295;
>   if (ui_7 > 9)
>   ...
> 
>   <bb 5>:
>   p_22 = p_19 + 4294967295;
>   *p_22 = 45;
> 
> which gives us two auto-dec, and much tighter code:
> 
>         strb    r1, [r4, #-1]!
>         lsr     r3, r3, #3
>         bhi     .L4
>         cmp     r6, #0
>         movlt   r3, #45
>         strblt  r3, [r4, #-1]!
> 
> Would it be OK to go with my approach, or is worth looking into the rtl
> optimizers and seeing what can be done (boo! :)).

Would be interesting to see what is causing things to break.  If it's
only combine doing the trick then it will obviously be multi-uses
of the adjustment pseudo.  But eventually I thought fwprop would
come to the rescue...

I think a good approach would be to instead of just replacing
all out-of-loop uses with the new SSA name doing the adjustment
check if the use is in a "simple" (thus SSA name +- cst) stmt
and there forward the adjustment.

So have the best of both worlds.  I would have hoped this
extra complication isn't needed but as you figured...
Comment 39 Aldy Hernandez 2018-03-15 12:23:29 UTC
(In reply to rguenther@suse.de from comment #38)
> On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> > 
> > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> > Hi Richi.
> > 
> > (In reply to rguenther@suse.de from comment #31)
> > 
> > > I'd have not restricted the out-of-loop IV use to IV +- CST but
> > > instead did the transform
> > > 
> > > +   LOOP:
> > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > +     ...
> > > +     p_INC = p_8 - 1;
> > > +     goto LOOP;
> > > +     ... p_8 uses ...
> > > 
> > > to
> > > 
> > > +   LOOP:
> > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > +     ...
> > > +     p_INC = p_8 - 1;
> > > +     goto LOOP;
> > >       newtem_12 = p_INC + 1; // undo IV increment
> > >       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> > > 
> > > so it would always work if we can undo the IV increment.
> > > 
> > > The disadvantage might be that we then rely on RTL optimizations
> > > to combine the original out-of-loop constant add with the
> > > newtem computation but I guess that's not too much to ask ;)
> > > k
> > 
> > It looks like RTL optimizations have a harder time optimizing things when I
> > take the above approach.
> > 
> > Doing what you suggest, we end up optimizing this (simplified for brevity):
> > 
> >   <bb 3>
> >   # p_8 = PHI <p_16(2), p_19(3)>
> >   p_19 = p_8 + 4294967295;
> >   if (ui_7 > 9)
> >     goto <bb 3>; [89.00%]
> >   ...
> > 
> >   <bb 5>
> >   p_22 = p_8 + 4294967294;
> >   MEM[(char *)p_19 + 4294967295B] = 45;
> > 
> > into this:
> > 
> >   <bb 3>:
> >   # p_8 = PHI <p_16(2), p_19(3)>
> >   p_19 = p_8 + 4294967295;
> >   if (ui_7 > 9)
> >   ...
> > 
> >   <bb 4>:
> >   _25 = p_19 + 1;          ;; undo the increment
> >   ...
> > 
> >   <bb 5>:
> >   p_22 = _25 + 4294967294;
> >   MEM[(char *)_25 + 4294967294B] = 45;
> 
> shouldn't the p_19 in the MEM remain?  Why a new BB for the adjustment?
> (just asking...)

Oh, I was adjusting/propagating that one as a special case to see if the [_25 + 4294967294] could be CSE'd somehow by the RTL optimizers.  But even with the p_19 remaining, the result is the same in the assembly.  For the record, with the p_19 in place, we would get (see below for a more complete dump):

  p_22 = _25 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

There is no new BB.  That was the result of adding the TMP on the non-back edge.  My simplification of the gimple IL for illustration purposes made that bit unclear.

The unadulterated original was:

   <bb 3> [local count: 1073741825]:
  # ui_7 = PHI <ui_13(2), ui_21(3)>
  # p_8 = PHI <p_16(2), p_19(3)>
  _4 = ui_7 % 10;
  _5 = (char) _4;
  p_19 = p_8 + 4294967295;
  _6 = _5 + 48;
  MEM[base: p_19, offset: 0B] = _6;
  ui_21 = ui_7 / 10;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111601]:
  if (i_12(D) < 0)
    goto <bb 5>; [41.00%]
  else
    goto <bb 6>; [59.00%]

  <bb 5> [local count: 48425756]:
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

with a corresponding transformation (keeping the p_19 as you inquired) of:

  <bb 3> [local count: 1073741825]:
  # ui_7 = PHI <ui_13(2), ui_24(3)>
  # p_8 = PHI <p_16(2), p_19(3)>
  _4 = ui_7 % 10;
  _5 = (char) _4;
  p_19 = p_8 + 4294967295;
  _6 = _5 + 48;
  MEM[base: p_19, offset: 0B] = _6;
  ui_21 = ui_7 / 10;
  ui_24 = ui_21;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111601]:
  _25 = p_19 + 1;                  ;; undo the increment on the non-back edge
  if (i_12(D) < 0)
    goto <bb 5>; [41.00%]
  else
    goto <bb 6>; [59.00%]

  <bb 5> [local count: 48425756]:
  p_22 = _25 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

> 
> > I haven't dug into the RTL optimizations, but the end result is that we only
> > get one auto-dec inside the loop, and some -2 indexing outside of it:
> > 
> >         strb    r1, [r4, #-1]!
> >         lsr     r3, r3, #3
> >         bhi     .L4
> >         cmp     r6, #0
> >         movlt   r2, #45
> >         add     r3, r4, #1
> >         strblt  r2, [r3, #-2]
> >         sublt   r4, r4, #1
> > 
> > as opposed to mine:
> > 
> >   <bb 3>:
> >   # p_8 = PHI <p_16(2), p_19(3)>
> >   p_19 = p_8 + 4294967295;
> >   if (ui_7 > 9)
> >   ...
> > 
> >   <bb 5>:
> >   p_22 = p_19 + 4294967295;
> >   *p_22 = 45;
> > 
> > which gives us two auto-dec, and much tighter code:
> > 
> >         strb    r1, [r4, #-1]!
> >         lsr     r3, r3, #3
> >         bhi     .L4
> >         cmp     r6, #0
> >         movlt   r3, #45
> >         strblt  r3, [r4, #-1]!
> > 
> > Would it be OK to go with my approach, or is worth looking into the rtl
> > optimizers and seeing what can be done (boo! :)).
> 
> Would be interesting to see what is causing things to break.  If it's
> only combine doing the trick then it will obviously be multi-uses
> of the adjustment pseudo.  But eventually I thought fwprop would
> come to the rescue...

I could post the WIP for this approach if anyone is interested.

> 
> I think a good approach would be to instead of just replacing
> all out-of-loop uses with the new SSA name doing the adjustment
> check if the use is in a "simple" (thus SSA name +- cst) stmt
> and there forward the adjustment.
> 
> So have the best of both worlds.  I would have hoped this
> extra complication isn't needed but as you figured...

Hmmm, can we go with my first patch (cleaned up) instead, as it's already working?

Thanks.
Comment 40 rguenther@suse.de 2018-03-15 12:52:45 UTC
On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #39 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #38)
> > On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> > > 
> > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> > > Hi Richi.
> > > 
> > > (In reply to rguenther@suse.de from comment #31)
> > > 
> > > > I'd have not restricted the out-of-loop IV use to IV +- CST but
> > > > instead did the transform
> > > > 
> > > > +   LOOP:
> > > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > > +     ...
> > > > +     p_INC = p_8 - 1;
> > > > +     goto LOOP;
> > > > +     ... p_8 uses ...
> > > > 
> > > > to
> > > > 
> > > > +   LOOP:
> > > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > > +     ...
> > > > +     p_INC = p_8 - 1;
> > > > +     goto LOOP;
> > > >       newtem_12 = p_INC + 1; // undo IV increment
> > > >       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> > > > 
> > > > so it would always work if we can undo the IV increment.
> > > > 
> > > > The disadvantage might be that we then rely on RTL optimizations
> > > > to combine the original out-of-loop constant add with the
> > > > newtem computation but I guess that's not too much to ask ;)
> > > > k
> > > 
> > > It looks like RTL optimizations have a harder time optimizing things when I
> > > take the above approach.
> > > 
> > > Doing what you suggest, we end up optimizing this (simplified for brevity):
> > > 
> > >   <bb 3>
> > >   # p_8 = PHI <p_16(2), p_19(3)>
> > >   p_19 = p_8 + 4294967295;
> > >   if (ui_7 > 9)
> > >     goto <bb 3>; [89.00%]
> > >   ...
> > > 
> > >   <bb 5>
> > >   p_22 = p_8 + 4294967294;
> > >   MEM[(char *)p_19 + 4294967295B] = 45;
> > > 
> > > into this:
> > > 
> > >   <bb 3>:
> > >   # p_8 = PHI <p_16(2), p_19(3)>
> > >   p_19 = p_8 + 4294967295;
> > >   if (ui_7 > 9)
> > >   ...
> > > 
> > >   <bb 4>:
> > >   _25 = p_19 + 1;          ;; undo the increment
> > >   ...
> > > 
> > >   <bb 5>:
> > >   p_22 = _25 + 4294967294;
> > >   MEM[(char *)_25 + 4294967294B] = 45;
> > 
> > shouldn't the p_19 in the MEM remain?  Why a new BB for the adjustment?
> > (just asking...)
> 
> Oh, I was adjusting/propagating that one as a special case to see if the [_25 +
> 4294967294] could be CSE'd somehow by the RTL optimizers.  But even with the
> p_19 remaining, the result is the same in the assembly.  For the record, with
> the p_19 in place, we would get (see below for a more complete dump):
> 
>   p_22 = _25 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> There is no new BB.  That was the result of adding the TMP on the non-back
> edge.  My simplification of the gimple IL for illustration purposes made that
> bit unclear.
> 
> The unadulterated original was:
> 
>    <bb 3> [local count: 1073741825]:
>   # ui_7 = PHI <ui_13(2), ui_21(3)>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   _4 = ui_7 % 10;
>   _5 = (char) _4;
>   p_19 = p_8 + 4294967295;
>   _6 = _5 + 48;
>   MEM[base: p_19, offset: 0B] = _6;
>   ui_21 = ui_7 / 10;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 4> [local count: 118111601]:
>   if (i_12(D) < 0)
>     goto <bb 5>; [41.00%]
>   else
>     goto <bb 6>; [59.00%]
> 
>   <bb 5> [local count: 48425756]:
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> with a corresponding transformation (keeping the p_19 as you inquired) of:
> 
>   <bb 3> [local count: 1073741825]:
>   # ui_7 = PHI <ui_13(2), ui_24(3)>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   _4 = ui_7 % 10;
>   _5 = (char) _4;
>   p_19 = p_8 + 4294967295;
>   _6 = _5 + 48;
>   MEM[base: p_19, offset: 0B] = _6;
>   ui_21 = ui_7 / 10;
>   ui_24 = ui_21;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 4> [local count: 118111601]:
>   _25 = p_19 + 1;                  ;; undo the increment on the non-back edge
>   if (i_12(D) < 0)
>     goto <bb 5>; [41.00%]
>   else
>     goto <bb 6>; [59.00%]
> 
>   <bb 5> [local count: 48425756]:
>   p_22 = _25 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> > 
> > > I haven't dug into the RTL optimizations, but the end result is that we only
> > > get one auto-dec inside the loop, and some -2 indexing outside of it:
> > > 
> > >         strb    r1, [r4, #-1]!
> > >         lsr     r3, r3, #3
> > >         bhi     .L4
> > >         cmp     r6, #0
> > >         movlt   r2, #45
> > >         add     r3, r4, #1
> > >         strblt  r2, [r3, #-2]
> > >         sublt   r4, r4, #1
> > > 
> > > as opposed to mine:
> > > 
> > >   <bb 3>:
> > >   # p_8 = PHI <p_16(2), p_19(3)>
> > >   p_19 = p_8 + 4294967295;
> > >   if (ui_7 > 9)
> > >   ...
> > > 
> > >   <bb 5>:
> > >   p_22 = p_19 + 4294967295;
> > >   *p_22 = 45;
> > > 
> > > which gives us two auto-dec, and much tighter code:
> > > 
> > >         strb    r1, [r4, #-1]!
> > >         lsr     r3, r3, #3
> > >         bhi     .L4
> > >         cmp     r6, #0
> > >         movlt   r3, #45
> > >         strblt  r3, [r4, #-1]!
> > > 
> > > Would it be OK to go with my approach, or is worth looking into the rtl
> > > optimizers and seeing what can be done (boo! :)).
> > 
> > Would be interesting to see what is causing things to break.  If it's
> > only combine doing the trick then it will obviously be multi-uses
> > of the adjustment pseudo.  But eventually I thought fwprop would
> > come to the rescue...
> 
> I could post the WIP for this approach if anyone is interested.
> 
> > 
> > I think a good approach would be to instead of just replacing
> > all out-of-loop uses with the new SSA name doing the adjustment
> > check if the use is in a "simple" (thus SSA name +- cst) stmt
> > and there forward the adjustment.
> > 
> > So have the best of both worlds.  I would have hoped this
> > extra complication isn't needed but as you figured...
> 
> Hmmm, can we go with my first patch (cleaned up) instead, as it's already
> working?

Well, your patch only replaces increments it can modify possibly
leaving uses unaltered.  That's IMHO not good.

Which is why I suggested to have it like your original patch for
those uses you can modify but for other uses simply substitute
the LHS of the compensation stmt.

If the compensation stmt ends up being not used it won't be
expanded I think (you could of course insert/build it just on-demand).
Comment 41 Aldy Hernandez 2018-03-15 13:09:44 UTC
(In reply to rguenther@suse.de from comment #40)

> Well, your patch only replaces increments it can modify possibly
> leaving uses unaltered.  That's IMHO not good.
> 
> Which is why I suggested to have it like your original patch for
> those uses you can modify but for other uses simply substitute
> the LHS of the compensation stmt.
> 
> If the compensation stmt ends up being not used it won't be
> expanded I think (you could of course insert/build it just on-demand).

Ah.  I had misunderstood.

I will do as you suggest and then take the patch upstream so we can finish the rest of the discussion there.

Thanks so much for the feedback.
Comment 42 Aldy Hernandez 2018-03-19 17:58:02 UTC
(In reply to Jeffrey A. Law from comment #36)
> WRT the division removal.  That seems so profitable that a slight increase
> in codesize is warranted.  So if we fix the other issue  and the source of
> the remaining codesize regressions is the removal of the division, I would
> consider this BZ resolved.

Richi.  Jeff.

Limiting the single_use with optimize_size as well may give us the best of both worlds.  Would you like me to post the [untested] patch below upstream?  With this patch code size is even smaller than GCC 5.3.

I really don't care.  Actually, I'd prefer to do nothing and close the PR ;-).  Up to y'all.

diff --git a/gcc/match.pd b/gcc/match.pd
index f61c4d7440a..5d29bf62dc9 100644
--- a/gcc/match.pd
+++ b/gcc/match.pd
@@ -1290,11 +1290,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 /* X / C1 op C2 into a simple range test.  */
 (for cmp (simple_comparison)
  (simplify
-  (cmp (trunc_div:s @0 INTEGER_CST@1) INTEGER_CST@2)
+  (cmp (trunc_div:s@3 @0 INTEGER_CST@1) INTEGER_CST@2)
   (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
        && integer_nonzerop (@1)
        && !TREE_OVERFLOW (@1)
-       && !TREE_OVERFLOW (@2))
+       && !TREE_OVERFLOW (@2)
+       && (!optimize_size || single_use (@3)))
    (with { tree lo, hi; bool neg_overflow;
 	   enum tree_code code = fold_div_compare (cmp, @1, @2, &lo, &hi,
 						   &neg_overflow); }
@@ -1456,9 +1457,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
 (for cmp (eq ne)
      ocmp (lt ge)
  (simplify
-  (cmp (trunc_div @0 @1) integer_zerop)
+  (cmp (trunc_div@2 @0 @1) integer_zerop)
   (if (TYPE_UNSIGNED (TREE_TYPE (@0))
-       && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0))))
+       && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0)))
+       && (!optimize_size || single_use (@2)))
    (ocmp @0 @1))))
 
 /* X == C - X can never be true if C is odd.  */
Comment 43 rguenther@suse.de 2018-03-20 07:54:35 UTC
On Mon, 19 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #42 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> (In reply to Jeffrey A. Law from comment #36)
> > WRT the division removal.  That seems so profitable that a slight increase
> > in codesize is warranted.  So if we fix the other issue  and the source of
> > the remaining codesize regressions is the removal of the division, I would
> > consider this BZ resolved.
> 
> Richi.  Jeff.
> 
> Limiting the single_use with optimize_size as well may give us the best of both
> worlds.  Would you like me to post the [untested] patch below upstream?  With
> this patch code size is even smaller than GCC 5.3.
> 
> I really don't care.  Actually, I'd prefer to do nothing and close the PR ;-). 
> Up to y'all.

I don't like to see plain optimize_size uses in match.pd, so yeah,
just close the PR.

Disclaimer: if we ever need to do sth like that we should wrap it
in sth like optimize_pattern_for_size () to be able to eventually
pick up local profiling info.

> diff --git a/gcc/match.pd b/gcc/match.pd
> index f61c4d7440a..5d29bf62dc9 100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -1290,11 +1290,12 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  /* X / C1 op C2 into a simple range test.  */
>  (for cmp (simple_comparison)
>   (simplify
> -  (cmp (trunc_div:s @0 INTEGER_CST@1) INTEGER_CST@2)
> +  (cmp (trunc_div:s@3 @0 INTEGER_CST@1) INTEGER_CST@2)
>    (if (INTEGRAL_TYPE_P (TREE_TYPE (@0))
>         && integer_nonzerop (@1)
>         && !TREE_OVERFLOW (@1)
> -       && !TREE_OVERFLOW (@2))
> +       && !TREE_OVERFLOW (@2)
> +       && (!optimize_size || single_use (@3)))
>     (with { tree lo, hi; bool neg_overflow;
>            enum tree_code code = fold_div_compare (cmp, @1, @2, &lo, &hi,
>                                                    &neg_overflow); }
> @@ -1456,9 +1457,10 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (for cmp (eq ne)
>       ocmp (lt ge)
>   (simplify
> -  (cmp (trunc_div @0 @1) integer_zerop)
> +  (cmp (trunc_div@2 @0 @1) integer_zerop)
>    (if (TYPE_UNSIGNED (TREE_TYPE (@0))
> -       && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0))))
> +       && (VECTOR_TYPE_P (type) || !VECTOR_TYPE_P (TREE_TYPE (@0)))
> +       && (!optimize_size || single_use (@2)))
>     (ocmp @0 @1))))
> 
>  /* X == C - X can never be true if C is odd.  */
> 
>
Comment 44 Aldy Hernandez 2018-03-20 08:21:34 UTC
(In reply to rguenther@suse.de from comment #43)

> I don't like to see plain optimize_size uses in match.pd, so yeah,
> just close the PR.

Excellent.

> 
> Disclaimer: if we ever need to do sth like that we should wrap it
> in sth like optimize_pattern_for_size () to be able to eventually
> pick up local profiling info.

Noted.

Thanks for all your help on this.
Comment 45 Aldy Hernandez 2018-03-21 07:11:23 UTC
Proposed patch:

https://gcc.gnu.org/ml/gcc-patches/2018-03/msg00885.html

Richi has made some suggestions and will take it from here.

Thanks.
Comment 46 Jakub Jelinek 2018-10-26 10:06:39 UTC
GCC 6 branch is being closed
Comment 47 Richard Biener 2018-10-31 11:58:06 UTC
Author: rguenth
Date: Wed Oct 31 11:57:33 2018
New Revision: 265677

URL: https://gcc.gnu.org/viewcvs?rev=265677&root=gcc&view=rev
Log:
2018-10-31  Richard Biener  <rguenther@suse.de>

	PR middle-end/70359
	PR middle-end/86270
	* tree-outof-ssa.c (insert_backedge_copies): Restrict
	copy generation to useful cases.  Place the copy before
	the definition of the backedge value when possible.

	* gcc.target/i386/pr70359.c: New testcase.
	* gcc.target/i386/pr86270.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr70359.c
    trunk/gcc/testsuite/gcc.target/i386/pr86270.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-outof-ssa.c