This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs


On Fri, 9 Aug 2019, Uros Bizjak wrote:

> On Mon, Aug 5, 2019 at 3:09 PM Uros Bizjak <ubizjak@gmail.com> wrote:
> 
> > > > > > > (define_mode_iterator MAXMIN_IMODE [SI "TARGET_SSE4_1"] [DI "TARGET_AVX512F"])
> > > > > > >
> > > > > > > and then we need to split DImode for 32bits, too.
> > > > > >
> > > > > > For now, please add "TARGET_64BIT && TARGET_AVX512F" for DImode
> > > > > > condition, I'll provide _doubleword splitter later.
> > > > >
> > > > > Shouldn't that be TARGET_AVX512VL instead?  Or does the insn use %g0 etc.
> > > > > to force use of %zmmN?
> > > >
> > > > It generates V4SI mode, so - yes, AVX512VL.
> > >
> > >     case SMAX:
> > >     case SMIN:
> > >     case UMAX:
> > >     case UMIN:
> > >       if ((mode == DImode && (!TARGET_64BIT || !TARGET_AVX512VL))
> > >           || (mode == SImode && !TARGET_SSE4_1))
> > >         return false;
> > >
> > > so there's no way to use AVX512VL for 32bit?
> >
> > There is a way, but on 32bit targets, we need to split DImode
> > operation to a sequence of SImode operations for unconverted pattern.
> > This is of course doable, but somehow more complex than simply
> > emitting a DImode compare + DImode cmove, which is what current
> > splitter does. So, a follow-up task.
> 
> Please find attached the complete .md part that enables SImode for
> TARGET_SSE4_1 and DImode for TARGET_AVX512VL for both, 32bit and 64bit
> targets. The patterns also allows for memory operand 2, so STV has
> chance to create the vector pattern with implicit load. In case STV
> fails, the memory operand 2 is loaded to the register first;  operand
> 2 is used in compare and cmove instruction, so pre-loading of the
> operand should be beneficial.

Thanks.

> Also note, that splitting should happen rarely. Due to the cost
> function, STV should effectively always convert minmax to a vector
> insn.

I've analyzed the 464.h264ref slowdown on Haswell and it is due to
this kind of "simple" conversion:

  5.50 │1d0:   test   %esi,%es
  0.07 │       mov    $0x0,%ex
       │       cmovs  %eax,%es
  5.84 │       imul   %r8d,%es

to

  0.65 │1e0:   vpxor  %xmm0,%xmm0,%xmm0
  0.32 │       vpmaxs -0x10(%rsp),%xmm0,%xmm0
 40.45 │       vmovd  %xmm0,%eax
  2.45 │       imul   %r8d,%eax

which looks like a RA artifact in the end.  We spill %esi only
with -mstv here as STV introduces a (subreg:V4SI ...) use
of a pseudo ultimatively set from di.  STV creates an additional
pseudo for this (copy-in) but it places that copy next to the
original def rather than next to the start of the chain it
converts which is probably the issue why we spill.  And this
is because it inserts those at each definition of the pseudo
rather than just at the reaching definition(s) or at the
uses of the pseudo in the chain (that because there may be
defs of that pseudo in the chain itself).  Note that STV emits
such "conversion" copies as simple reg-reg moves:

(insn 1094 3 4 2 (set (reg:SI 777)
        (reg/v:SI 438 [ y ])) "refbuf.i":4:1 -1
     (nil))

but those do not prevail very long (this one gets removed by CSE2).
So IRA just sees the (subreg:V4SI (reg/v:SI 438 [ y ]) 0) use
and computes

    r438: preferred NO_REGS, alternative NO_REGS, allocno NO_REGS
    a297(r438,l0) costs: SSE_REGS:5628,5628 MEM:3618,3618

so I wonder if STV shouldn't instead emit gpr->xmm moves
here (but I guess nothing again prevents RTL optimizers from
combining that with the single-use in the max instruction...).

So this boils down to STV splitting live-ranges but other
passes undoing that and then RA not considering splitting
live-ranges here, arriving at unoptimal allocation.

A testcase showing this issue is (simplified from 464.h264ref
UMVLine16Y_11):

unsigned short
UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
{
  if (y != width)
    {
      y = y < 0 ? 0 : y;
      return Pic[y * width];
    }
  return Pic[y];
}

where the condition and the Pic[y] load mimics the other use of y.
Different, even worse spilling is generated by

unsigned short
UMVLine16Y_11 (short unsigned int * Pic, int y, int width)
{
  y = y < 0 ? 0 : y;
  return Pic[y * width] + y;
}

I guess this all shows that STVs "trick" of simply wrapping
integer mode pseudos in (subreg:vector-mode ...) is bad?

I've added a (failing) testcase to reflect the above.

Richard.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]