[Bug middle-end/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0

Thu Mar 15 12:52:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359

--- Comment #40 from rguenther at suse dot de <rguenther at suse dot de> ---
On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> 
> --- Comment #39 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #38)
> > On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> > > 
> > > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> > > Hi Richi.
> > > 
> > > (In reply to rguenther@suse.de from comment #31)
> > > 
> > > > I'd have not restricted the out-of-loop IV use to IV +- CST but
> > > > instead did the transform
> > > > 
> > > > +   LOOP:
> > > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > > +     ...
> > > > +     p_INC = p_8 - 1;
> > > > +     goto LOOP;
> > > > +     ... p_8 uses ...
> > > > 
> > > > to
> > > > 
> > > > +   LOOP:
> > > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > > +     ...
> > > > +     p_INC = p_8 - 1;
> > > > +     goto LOOP;
> > > >       newtem_12 = p_INC + 1; // undo IV increment
> > > >       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> > > > 
> > > > so it would always work if we can undo the IV increment.
> > > > 
> > > > The disadvantage might be that we then rely on RTL optimizations
> > > > to combine the original out-of-loop constant add with the
> > > > newtem computation but I guess that's not too much to ask ;)
> > > > k
> > > 
> > > It looks like RTL optimizations have a harder time optimizing things when I
> > > take the above approach.
> > > 
> > > Doing what you suggest, we end up optimizing this (simplified for brevity):
> > > 
> > >   <bb 3>
> > >   # p_8 = PHI <p_16(2), p_19(3)>
> > >   p_19 = p_8 + 4294967295;
> > >   if (ui_7 > 9)
> > >     goto <bb 3>; [89.00%]
> > >   ...
> > > 
> > >   <bb 5>
> > >   p_22 = p_8 + 4294967294;
> > >   MEM[(char *)p_19 + 4294967295B] = 45;
> > > 
> > > into this:
> > > 
> > >   <bb 3>:
> > >   # p_8 = PHI <p_16(2), p_19(3)>
> > >   p_19 = p_8 + 4294967295;
> > >   if (ui_7 > 9)
> > >   ...
> > > 
> > >   <bb 4>:
> > >   _25 = p_19 + 1;          ;; undo the increment
> > >   ...
> > > 
> > >   <bb 5>:
> > >   p_22 = _25 + 4294967294;
> > >   MEM[(char *)_25 + 4294967294B] = 45;
> > 
> > shouldn't the p_19 in the MEM remain?  Why a new BB for the adjustment?
> > (just asking...)
> 
> Oh, I was adjusting/propagating that one as a special case to see if the [_25 +
> 4294967294] could be CSE'd somehow by the RTL optimizers.  But even with the
> p_19 remaining, the result is the same in the assembly.  For the record, with
> the p_19 in place, we would get (see below for a more complete dump):
> 
>   p_22 = _25 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> There is no new BB.  That was the result of adding the TMP on the non-back
> edge.  My simplification of the gimple IL for illustration purposes made that
> bit unclear.
> 
> The unadulterated original was:
> 
>    <bb 3> [local count: 1073741825]:
>   # ui_7 = PHI <ui_13(2), ui_21(3)>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   _4 = ui_7 % 10;
>   _5 = (char) _4;
>   p_19 = p_8 + 4294967295;
>   _6 = _5 + 48;
>   MEM[base: p_19, offset: 0B] = _6;
>   ui_21 = ui_7 / 10;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 4> [local count: 118111601]:
>   if (i_12(D) < 0)
>     goto <bb 5>; [41.00%]
>   else
>     goto <bb 6>; [59.00%]
> 
>   <bb 5> [local count: 48425756]:
>   p_22 = p_8 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> with a corresponding transformation (keeping the p_19 as you inquired) of:
> 
>   <bb 3> [local count: 1073741825]:
>   # ui_7 = PHI <ui_13(2), ui_24(3)>
>   # p_8 = PHI <p_16(2), p_19(3)>
>   _4 = ui_7 % 10;
>   _5 = (char) _4;
>   p_19 = p_8 + 4294967295;
>   _6 = _5 + 48;
>   MEM[base: p_19, offset: 0B] = _6;
>   ui_21 = ui_7 / 10;
>   ui_24 = ui_21;
>   if (ui_7 > 9)
>     goto <bb 3>; [89.00%]
>   else
>     goto <bb 4>; [11.00%]
> 
>   <bb 4> [local count: 118111601]:
>   _25 = p_19 + 1;                  ;; undo the increment on the non-back edge
>   if (i_12(D) < 0)
>     goto <bb 5>; [41.00%]
>   else
>     goto <bb 6>; [59.00%]
> 
>   <bb 5> [local count: 48425756]:
>   p_22 = _25 + 4294967294;
>   MEM[(char *)p_19 + 4294967295B] = 45;
> 
> > 
> > > I haven't dug into the RTL optimizations, but the end result is that we only
> > > get one auto-dec inside the loop, and some -2 indexing outside of it:
> > > 
> > >         strb    r1, [r4, #-1]!
> > >         lsr     r3, r3, #3
> > >         bhi     .L4
> > >         cmp     r6, #0
> > >         movlt   r2, #45
> > >         add     r3, r4, #1
> > >         strblt  r2, [r3, #-2]
> > >         sublt   r4, r4, #1
> > > 
> > > as opposed to mine:
> > > 
> > >   <bb 3>:
> > >   # p_8 = PHI <p_16(2), p_19(3)>
> > >   p_19 = p_8 + 4294967295;
> > >   if (ui_7 > 9)
> > >   ...
> > > 
> > >   <bb 5>:
> > >   p_22 = p_19 + 4294967295;
> > >   *p_22 = 45;
> > > 
> > > which gives us two auto-dec, and much tighter code:
> > > 
> > >         strb    r1, [r4, #-1]!
> > >         lsr     r3, r3, #3
> > >         bhi     .L4
> > >         cmp     r6, #0
> > >         movlt   r3, #45
> > >         strblt  r3, [r4, #-1]!
> > > 
> > > Would it be OK to go with my approach, or is worth looking into the rtl
> > > optimizers and seeing what can be done (boo! :)).
> > 
> > Would be interesting to see what is causing things to break.  If it's
> > only combine doing the trick then it will obviously be multi-uses
> > of the adjustment pseudo.  But eventually I thought fwprop would
> > come to the rescue...
> 
> I could post the WIP for this approach if anyone is interested.
> 
> > 
> > I think a good approach would be to instead of just replacing
> > all out-of-loop uses with the new SSA name doing the adjustment
> > check if the use is in a "simple" (thus SSA name +- cst) stmt
> > and there forward the adjustment.
> > 
> > So have the best of both worlds.  I would have hoped this
> > extra complication isn't needed but as you figured...
> 
> Hmmm, can we go with my first patch (cleaned up) instead, as it's already
> working?

Well, your patch only replaces increments it can modify possibly
leaving uses unaltered.  That's IMHO not good.

Which is why I suggested to have it like your original patch for
those uses you can modify but for other uses simply substitute
the LHS of the compensation stmt.

If the compensation stmt ends up being not used it won't be
expanded I think (you could of course insert/build it just on-demand).