[Bug middle-end/70359] [6/7/8 Regression] Code size increase for x86/ARM/others compared to gcc-5.3.0

Thu Mar 15 12:23:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359

--- Comment #39 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
(In reply to rguenther@suse.de from comment #38)
> On Thu, 15 Mar 2018, aldyh at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70359
> > 
> > --- Comment #37 from Aldy Hernandez <aldyh at gcc dot gnu.org> ---
> > Hi Richi.
> > 
> > (In reply to rguenther@suse.de from comment #31)
> > 
> > > I'd have not restricted the out-of-loop IV use to IV +- CST but
> > > instead did the transform
> > > 
> > > +   LOOP:
> > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > +     ...
> > > +     p_INC = p_8 - 1;
> > > +     goto LOOP;
> > > +     ... p_8 uses ...
> > > 
> > > to
> > > 
> > > +   LOOP:
> > > +     # p_8 = PHI <p_16(2), p_INC(3)>
> > > +     ...
> > > +     p_INC = p_8 - 1;
> > > +     goto LOOP;
> > >       newtem_12 = p_INC + 1; // undo IV increment
> > >       ... p_8 out-of-loop p_8 uses replaced with newtem_12 ...
> > > 
> > > so it would always work if we can undo the IV increment.
> > > 
> > > The disadvantage might be that we then rely on RTL optimizations
> > > to combine the original out-of-loop constant add with the
> > > newtem computation but I guess that's not too much to ask ;)
> > > k
> > 
> > It looks like RTL optimizations have a harder time optimizing things when I
> > take the above approach.
> > 
> > Doing what you suggest, we end up optimizing this (simplified for brevity):
> > 
> >   <bb 3>
> >   # p_8 = PHI <p_16(2), p_19(3)>
> >   p_19 = p_8 + 4294967295;
> >   if (ui_7 > 9)
> >     goto <bb 3>; [89.00%]
> >   ...
> > 
> >   <bb 5>
> >   p_22 = p_8 + 4294967294;
> >   MEM[(char *)p_19 + 4294967295B] = 45;
> > 
> > into this:
> > 
> >   <bb 3>:
> >   # p_8 = PHI <p_16(2), p_19(3)>
> >   p_19 = p_8 + 4294967295;
> >   if (ui_7 > 9)
> >   ...
> > 
> >   <bb 4>:
> >   _25 = p_19 + 1;          ;; undo the increment
> >   ...
> > 
> >   <bb 5>:
> >   p_22 = _25 + 4294967294;
> >   MEM[(char *)_25 + 4294967294B] = 45;
> 
> shouldn't the p_19 in the MEM remain?  Why a new BB for the adjustment?
> (just asking...)

Oh, I was adjusting/propagating that one as a special case to see if the [_25 +
4294967294] could be CSE'd somehow by the RTL optimizers.  But even with the
p_19 remaining, the result is the same in the assembly.  For the record, with
the p_19 in place, we would get (see below for a more complete dump):

  p_22 = _25 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

There is no new BB.  That was the result of adding the TMP on the non-back
edge.  My simplification of the gimple IL for illustration purposes made that
bit unclear.

The unadulterated original was:

   <bb 3> [local count: 1073741825]:
  # ui_7 = PHI <ui_13(2), ui_21(3)>
  # p_8 = PHI <p_16(2), p_19(3)>
  _4 = ui_7 % 10;
  _5 = (char) _4;
  p_19 = p_8 + 4294967295;
  _6 = _5 + 48;
  MEM[base: p_19, offset: 0B] = _6;
  ui_21 = ui_7 / 10;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111601]:
  if (i_12(D) < 0)
    goto <bb 5>; [41.00%]
  else
    goto <bb 6>; [59.00%]

  <bb 5> [local count: 48425756]:
  p_22 = p_8 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

with a corresponding transformation (keeping the p_19 as you inquired) of:

  <bb 3> [local count: 1073741825]:
  # ui_7 = PHI <ui_13(2), ui_24(3)>
  # p_8 = PHI <p_16(2), p_19(3)>
  _4 = ui_7 % 10;
  _5 = (char) _4;
  p_19 = p_8 + 4294967295;
  _6 = _5 + 48;
  MEM[base: p_19, offset: 0B] = _6;
  ui_21 = ui_7 / 10;
  ui_24 = ui_21;
  if (ui_7 > 9)
    goto <bb 3>; [89.00%]
  else
    goto <bb 4>; [11.00%]

  <bb 4> [local count: 118111601]:
  _25 = p_19 + 1;                  ;; undo the increment on the non-back edge
  if (i_12(D) < 0)
    goto <bb 5>; [41.00%]
  else
    goto <bb 6>; [59.00%]

  <bb 5> [local count: 48425756]:
  p_22 = _25 + 4294967294;
  MEM[(char *)p_19 + 4294967295B] = 45;

> 
> > I haven't dug into the RTL optimizations, but the end result is that we only
> > get one auto-dec inside the loop, and some -2 indexing outside of it:
> > 
> >         strb    r1, [r4, #-1]!
> >         lsr     r3, r3, #3
> >         bhi     .L4
> >         cmp     r6, #0
> >         movlt   r2, #45
> >         add     r3, r4, #1
> >         strblt  r2, [r3, #-2]
> >         sublt   r4, r4, #1
> > 
> > as opposed to mine:
> > 
> >   <bb 3>:
> >   # p_8 = PHI <p_16(2), p_19(3)>
> >   p_19 = p_8 + 4294967295;
> >   if (ui_7 > 9)
> >   ...
> > 
> >   <bb 5>:
> >   p_22 = p_19 + 4294967295;
> >   *p_22 = 45;
> > 
> > which gives us two auto-dec, and much tighter code:
> > 
> >         strb    r1, [r4, #-1]!
> >         lsr     r3, r3, #3
> >         bhi     .L4
> >         cmp     r6, #0
> >         movlt   r3, #45
> >         strblt  r3, [r4, #-1]!
> > 
> > Would it be OK to go with my approach, or is worth looking into the rtl
> > optimizers and seeing what can be done (boo! :)).
> 
> Would be interesting to see what is causing things to break.  If it's
> only combine doing the trick then it will obviously be multi-uses
> of the adjustment pseudo.  But eventually I thought fwprop would
> come to the rescue...

I could post the WIP for this approach if anyone is interested.

> 
> I think a good approach would be to instead of just replacing
> all out-of-loop uses with the new SSA name doing the adjustment
> check if the use is in a "simple" (thus SSA name +- cst) stmt
> and there forward the adjustment.
> 
> So have the best of both worlds.  I would have hoped this
> extra complication isn't needed but as you figured...

Hmmm, can we go with my first patch (cleaned up) instead, as it's already
working?

Thanks.