[Bug middle-end/84067] [8 regression] gcc.dg/wmul-1.c regression on aarch64 after r257077

Mon Jan 29 12:54:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067

--- Comment #7 from rguenther at suse dot de <rguenther at suse dot de> ---
On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> 
> --- Comment #6 from ktkachov at gcc dot gnu.org ---
> (In reply to rguenther@suse.de from comment #5)
> > On Mon, 29 Jan 2018, ktkachov at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84067
> > > 
> > > --- Comment #3 from ktkachov at gcc dot gnu.org ---
> > > (In reply to Richard Biener from comment #2)
> > > > So any hint on whether the code after r257077 is better or worse than before?
> > > 
> > > Looks worse unfortunately:
> > > For aarch64 at -O2 it generates:
> > > foo:
> > >         mov     w3, 44
> > >         mov     w2, 40
> > >         mov     w5, 1
> > >         mov     w4, 2
> > >         smull   x3, w1, w3
> > >         smull   x2, w1, w2
> > >         str     w5, [x0, x3]
> > >         add     x2, x2, 400
> > >         add     x1, x2, x1, sxtw 2
> > >         str     w4, [x0, x1]
> > >         ret
> > > 
> > > whereas with r257077 it generates the shorter:
> > > foo:
> > >         mov     w3, 40
> > >         sxtw    x2, w1
> > >         mov     w4, 1
> > >         smaddl  x0, w1, w3, x0
> > >         mov     w3, 2
> > >         add     x1, x0, x2, lsl 2
> > >         str     w4, [x0, x2, lsl 2]
> > >         str     w3, [x1, 400]
> > >         ret
> > 
> > So shorter is worse?  Might be because I don't understand the
> > difference between the 'lsl 2' and the 'sxtw 2' or the cost
> > of the [x1, 400] addressing.
> 
> Sorry, I messed up the writeup. Let me try again.
> The shorter sequence (with the smaddl) is the good one and is produced
> *without* r257077. After r257077 we generate the longer and worse sequence with
> two smull.

I see the shorter sequence with TOT, r257077 included.  The testcase
explicitely checks for no widen-mult-plus but we now have two:

  <bb 2> [local count: 1073741825]:
  _17 = Idx_6(D) w* 44;
  _13 = Arr_7(D) + _17;
  MEM[(int[10] *)_13] = 1;
  _4 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 40, 400>;
  _18 = WIDEN_MULT_PLUS_EXPR <Idx_6(D), 4, _4>;
  _16 = Arr_7(D) + _18;
  MEM[(int[10] *)_16] = 2;
  return;

note the "shorter" sequence I see is

foo:
        mov     x4, 400
        mov     w3, 40
        mov     w2, 44
        mov     w5, 1
        smaddl  x3, w1, w3, x4
        mov     w4, 2
        smull   x2, w1, w2
        add     x1, x3, x1, sxtw 2
        str     w5, [x0, x2]
        str     w4, [x0, x1]
        ret

which doesn't 1:1 match either of yours.