[PATCH v2] Add vec_const_duplicate optab and TARGET_GEN_MEMSET_SCRATCH_RTX

H.J. Lu hjl.tools@gmail.com
Tue Jun 1 13:29:41 GMT 2021


On Tue, Jun 1, 2021 at 6:25 AM Richard Biener
<richard.guenther@gmail.com> wrote:
>
> On Tue, Jun 1, 2021 at 3:05 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > On Mon, May 31, 2021 at 11:54:53PM -0600, Jeff Law wrote:
> > >
> > >
> > > On 5/31/2021 11:50 PM, Richard Sandiford wrote:
> > > > "H.J. Lu via Gcc-patches" <gcc-patches@gcc.gnu.org> writes:
> > > > > On Mon, May 31, 2021 at 06:32:04AM -0700, H.J. Lu wrote:
> > > > > > On Mon, May 31, 2021 at 6:26 AM Richard Biener
> > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > On Mon, May 31, 2021 at 3:12 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > On Mon, May 31, 2021 at 5:46 AM Richard Biener
> > > > > > > > <richard.guenther@gmail.com> wrote:
> > > > > > > > > On Mon, May 31, 2021 at 2:09 PM H.J. Lu <hjl.tools@gmail.com> wrote:
> > > > > > > > > > On Wed, May 26, 2021 at 10:28:16AM +0200, Richard Biener wrote:
> > > > > > > > > > > > > >   -- Target Hook: rtx TARGET_GEN_MEMSET_VALUE (rtx DATA, scalar_int_mode
> > > > > > > > > > > > > >            MODE)
> > > > > > > > > > > > > >       This function returns the RTL of a register containing
> > > > > > > > > > > > > >       'GET_MODE_SIZE (MODE)' consecutive copies of the unsigned char
> > > > > > > > > > > > > >       value given in the RTL register DATA.  For example, if MODE is 4
> > > > > > > > > > > > > >       bytes wide, return the RTL for 0x01010101*DATA.
> > > > > > > > > > > > > For this one I wonder if it should be an optab instead.  Couldn't you
> > > > > > > > > > > > > use the existing vec_duplicate for this by using (paradoxical) subregs
> > > > > > > > > > > > > like (subreg:TI (vec_duplicate:VnQI (subreg:VnQI (reg:QI ...)))?
> > > > > > > > > > > > I tried.   It doesn't even work on x86.  See:
> > > > > > > > > > > >
> > > > > > > > > > > > https://gcc.gnu.org/pipermail/gcc-patches/2021-May/570661.html
> > > > > > > > > > > Not sure what I should read from there...
> > > > > > > > > > >
> > > > > > > > > > > > There are special cases to subreg HI, SI and DI modes of TI mode in
> > > > > > > > > > > > ix86_gen_memset_value_from_prev.   simplify_gen_subreg doesn't
> > > > > > > > > > > > work here.   Each backend may need its own special handling.
> > > > > > > > > > > OK, I guess I'm not (RTL) qualified enough to further review these parts,
> > > > > > > > > > > sorry.  Since we're doing code generation the canonical way to communicate
> > > > > > > > > > > with backends should be optabs, not some set of disconnected target hooks.
> > > > > > > > > > > But as said, I probably don't know enough of RTL to see why it's the only way.
> > > > > > > > > > >
> > > > > > > > > > > Richard.
> > > > > > > > > > Here is the patch to add optabs instead.  Does it look OK?
> > > > > > > > > >
> > > > > > > > > > Thanks.
> > > > > > > > > >
> > > > > > > > > > H.J.
> > > > > > > > > > ---
> > > > > > > > > > Add 2 optabs:
> > > > > > > > > >
> > > > > > > > > > 1. integer_extract: Extract lower bit value from the integer value in
> > > > > > > > > > TImode, OImode or XImode.
> > > > > > > > > That sounds very specific, esp. the restriction to {TI,OI,XI}mode.
> > > > > > > > > It also sounds like it matches (subreg:{TI,OI,XI} (...) 0).  There are
> > > > > > > > > existing target hooks verifying subreg validity - why's that not a good
> > > > > > > > > fit here?  ISTR you say gen_lowpart () doesn't work (or was it
> > > > > > > > > simplify_gen_subreg?), why's that so?
> > > > > > > > {TI,OI,XI}mode are storage only integer types.   subreg doesn't work
> > > > > > > > well on them.  I got
> > > > > > > >
> > > > > > > > [hjl@gnu-cfl-2 pieces]$ cat s2.i
> > > > > > > > extern void *ops;
> > > > > > > >
> > > > > > > > void
> > > > > > > > foo (int c)
> > > > > > > > {
> > > > > > > >    __builtin_memset (ops, c, 34);
> > > > > > > > }
> > > > > > > > [hjl@gnu-cfl-2 pieces]$ make s2.s
> > > > > > > > /export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/xgcc
> > > > > > > > -B/export/build/gnu/tools-build/gcc-gitlab-debug/build-x86_64-linux/gcc/
> > > > > > > > -O2 -march=haswell -S s2.i
> > > > > > > > during RTL pass: reload
> > > > > > > > s2.i: In function ‘foo’:
> > > > > > > > s2.i:7:1: internal compiler error: maximum number of generated reload
> > > > > > > > insns per insn achieved (90)
> > > > > > > >      7 | }
> > > > > > > >        | ^
> > > > > > > > 0x1050734 lra_constraints(bool)
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra-constraints.c:5091
> > > > > > > > 0x1039536 lra(_IO_FILE*)
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/lra.c:2336
> > > > > > > > 0xfe1140 do_reload
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:5822
> > > > > > > > 0xfe162e execute
> > > > > > > > /export/gnu/import/git/gitlab/x86-gcc/gcc/ira.c:6008
> > > > > > > > Please submit a full bug report,
> > > > > > > > with preprocessed source if appropriate.
> > > > > > > > Please include the complete backtrace with any bug report.
> > > > > > > > See <https://gcc.gnu.org/bugs/> for instructions.
> > > > > > > > make: *** [Makefile:32: s2.s] Error 1
> > > > > > > > [hjl@gnu-cfl-2 pieces]$
> > > > > > > >
> > > > > > > > due to
> > > > > > > >
> > > > > > > > (insn 12 11 0 (set (mem:HI (plus:DI (reg/f:DI 84)
> > > > > > > >                  (const_int 32 [0x20])) [0 MEM <char[1:34]> [(void
> > > > > > > > *)ops.0_1]+32 S2 A8])
> > > > > > > >          (subreg:HI (reg:OI 51 xmm15) 0)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > >
> > > > > > > > The new optab gives us
> > > > > > > >
> > > > > > > > (insn 12 11 13 2 (set (reg:TI 88)
> > > > > > > >          (reg:TI 51 xmm15)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > > (insn 13 12 14 2 (set (reg:SI 89)
> > > > > > > >          (subreg:SI (reg:TI 88) 0)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > > (insn 14 13 15 2 (set (reg:HI 87)
> > > > > > > >          (subreg:HI (reg:SI 89) 0)) "s2.i":6:3 -1
> > > > > > > >       (nil))
> > > > > > > that looks odd to me - what's the final result after LRA?  I think
> > > > > > I got:
> > > > > >
> > > > > > vmovd %edi, %xmm15
> > > > > > movq ops(%rip), %rdx
> > > > > > vpbroadcastb %xmm15, %ymm15
> > > > > > vmovq %xmm15, %rax    <<<< move to GPR
> > > > > > vmovdqu %ymm15, (%rdx)
> > > > > > movw %ax, 32(%rdx)   <<<< subreg of GPR
> > > > > > vzeroupper
> > > > > > ret
> > > > > >
> > > > > > > we should see to make lowpart_subreg work on {XI,OI,TI}mode.
> > > > > > > Only two steps should be necessary at most:
> > > > > > > xmm -> gpr, grp -> subreg, or gpr -> subreg.  So the expander
> > > > > > > code in memset should try to generate the subreg directly
> > > > > > subreg didn't fail on x86 when I tried.
> > > > > >
> > > > > > > and if that fails, try a word_mode subreg followed by the subreg.
> > > > > > I will try word_mode subreg.
> > > > > >
> > > > > Here is the v2 patch to use word_mode subreg.  For
> > > > >
> > > > > ---
> > > > > extern void *ops;
> > > > >
> > > > > void
> > > > > foo (int c)
> > > > > {
> > > > >    __builtin_memset (ops, 4, 32);
> > > > > }
> > > > > ---
> > > > >
> > > > > without vec_const_duplicate, I got
> > > > >
> > > > >   movl    $4, %eax
> > > > >   movq    ops(%rip), %rdx
> > > > >   movd    %eax, %xmm0
> > > > >   punpcklbw       %xmm0, %xmm0
> > > > >   punpcklwd       %xmm0, %xmm0
> > > > >   pshufd  $0, %xmm0, %xmm0
> > > > >   movups  %xmm0, (%rdx)
> > > > >   movups  %xmm0, 16(%rdx)
> > > > >   ret
> > > > >
> > > > > With vec_const_duplicate, I got
> > > > >
> > > > >   movq    ops(%rip), %rax
> > > > >   movdqa  .LC0(%rip), %xmm0
> > > > >   movups  %xmm0, (%rax)
> > > > >   movups  %xmm0, 16(%rax)
> > > > >   ret
> > > > >
> > > > > If vec_duplicate is allowed to fail, I don't need vec_const_duplicate.
> > > > I don't understand why we need an optab for this though.  If the operand
> > > > is constant then we should just be doing an ordinary move in which the
> > > > source is a CONST_VECTOR.  It's then up to the move patterns to handle
> > > > duplicated constants as efficiently as possible.  (Sorry if this was
> > > > discussed upthread and I missed it.)
> > > That's exactly the point I'm trying to get across as well.
> > >
> >
> > This is what we do today.  But I'd like to generate
> >
> >         movl    $4, %eax
> >         vpbroadcastb    %eax, %ymm15
> >         movq    ops(%rip), %rax
> >         vmovdqu %ymm15, (%rax)
> >         vzeroupper
> >         ret
> >
> > instead of
> >
> >         vmovdqa .LC0(%rip), %ymm15
> >         movq    ops(%rip), %rax
> >         vmovdqu %ymm15, (%rax)
> >         vzeroupper
> >         ret
> >
> > Do I need a vec_dup pattern for it?
>
> I think we have special code sequences to materialize some
> constant vectors already, we should be able to add to that, no?

We can do that for all 0s and all 1s at the final codegen.   For
other values, since we need a GPR, we can't do that.

-- 
H.J.


More information about the Gcc-patches mailing list