This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Use of vector instructions in memmov/memset expanding

From: Michael Zolotukhin <michael dot v dot zolotukhin at gmail dot com>
To: Jan Hubicka <hubicka at ucw dot cz>
Cc: Jakub Jelinek <jakub at redhat dot com>, Jack Howarth <howarth at bromo dot med dot uc dot edu>, gcc-patches at gcc dot gnu dot org, Richard Guenther <richard dot guenther at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, izamyatin at gmail dot com, areg dot melikadamyan at gmail dot com
Date: Fri, 28 Oct 2011 16:41:03 +0400
Subject: Re: Use of vector instructions in memmov/memset expanding
References: <CANtU079YjpFwdy1kXoebsLyKe08KTwjwXDGJ4kY+gEJcZUTdMg@mail.gmail.com> <20110928133105.GA26045@bromo.med.uc.edu> <CANtU07-DbAbwc6gH_KYaYiXqgMRhGLv_G5Gifvv1Nu0CLh7fgw@mail.gmail.com> <20110928215104.GA29339@bromo.med.uc.edu> <CANtU07-4ZvCmH=OFa49Zc5LQWWzj4jOCLbr6PeDRqiPouZrCdQ@mail.gmail.com> <CANtU07-iXjvZ0h80EM-fUgKNdDUTS0GCASNFSe608Su4AFM3Sw@mail.gmail.com> <20110929112159.GT2687@tyan-ft48-01.lab.bos.redhat.com> <CANtU07-TBmKmmoO9tOnE+8VCExDs1pU2fOUqU+HACFMW6bWhzg@mail.gmail.com> <CANtU07_UpRQx92W6UPMKcthR4Kphf2vmw4L+w3btx3Lc4VCZYg@mail.gmail.com> <CANtU078LPkNdAA6JNeCQUjVbJHJ-bH3XyWP5m6g+3Xj+koCoSw@mail.gmail.com> <20111027150909.GA29087@kam.mff.cuni.cz>

Hi Jan!
Thanks for the review, you could find my answers to some of your
remarks below. I'll send a corrected patch soon with answers to the
rest of your remarks.

> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> -  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> +  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
>    {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> +   {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> +   {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense.  I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?

Yes, alignment should match the size of loads/stores as well as offset
from alignment boundary should be known. In other case, strategies for
unknown alignment would be chosen.


> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are.  That might be enough for you?

Yes, theoretically it could harm 64/128-bit arithmetic, but actually
what could we do if we have DImode, mem-to-mem move and our mode is
32-bit? Ideally, RA should be able to make desicions on how to perform
such moves, but currently it doesn't generate SSE-moves - when it'll
be able to do so, I think we could remove this part and rely on RA.
And, one more point. This is quite a special case - here we want to
perform move via half of vector register. This is the main reason why
these particular cases are handled in special, not common, way.


> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0.  Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.

Yep, you're right - we just don't start the loop from 0. I'll send a
version with the comments soon.


> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> +/* Emit strset instuction.  If RHS is constant, and vector mode will be used,
> +   then move this consatnt to a vector register before emitting strset.  */
> +static void
> +emit_strset (rtx destmem, rtx value,
> +            rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?

I don't think it matters here, but to make emit_strset look similar to
emit_strmov, most of emit_strset body realy could be moved to
gen_strset.


>   if (max_size > 16)
>     {
>       rtx label = ix86_expand_aligntest (count, 16, true);
>       if (TARGET_64BIT)
>        {
> -         dest = change_address (destmem, DImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, DImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> +                                                               value)));
>
> No use for 128bit moves here?
>        }
>       else
>        {
> -         dest = change_address (destmem, SImode, destptr);
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> -         emit_insn (gen_strset (destptr, dest, value));
> +         destmem = change_address (destmem, SImode, destptr);
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
> +         emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> +                                                               value)));
>
> And here?

For memset prologues/epilogues I avoid using vector moves as it could
require expensive initialization (we need to create a vector filled
with the given value). For other cases, I'll re-check if use of vector
moves is possible.


> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
>       off = 4;
>       emit_insn (gen_strmov (destreg, dst, srcreg, src));
>     }
> +  if (align_bytes & 8)
> +    {
> +      if (TARGET_64BIT || TARGET_SSE)
> +       {
> +         dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +       }
> +      else
> +       {
> +         dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> +         src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
> +         emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?

Actually, here vector-moves are used if they are available - if 32bit
mode is used (so we can't do the move via GPR), but SSE is available,
then SSE-move would be generated.


> +/* Target hook.  Returns rtx of mode MODE with promoted value VAL, that is
> +   supposed to represent one byte.  MODE could be a vector mode.
> +   Example:
> +   1) VAL = const_int (0xAB), mode = SImode,
> +   the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> +   2) if VAL isn't const, then the result will be the result of MUL-instruction
> +   of VAL and const_int (0x01010101) (for SImode).  */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.

I don't think it could be done in machine-independent way - e.g. if
AVX is available, we could use broadcast-instructions, if not - we
need to use multiply-instructions, on other architectures there
probably some other more efficient ways to duplicate byte value across
the entire vector register. So IMO it's a good place to have a hook.


> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
>  The default is zero which means to not iterate over other vector sizes.
>  @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>

These changes present in middle-end part too (of course, in the full
patch it's not duplicated). I left it in this patch too to avoid
possible problems with build.


-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


On 27 October 2011 19:09, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> sorry for delay with the review. This is my first pass through the backend part, hopefully
> someone else will do the middle end bits.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..d7c4330 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FSQRT instruction. ?*/
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense. ?I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
>
> @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
> ? ? }
> ? else
> ? ? {
> + ? ? ?if (mode == DImode
> + ? ? ? ? && !TARGET_64BIT
> + ? ? ? ? && TARGET_SSE2
> + ? ? ? ? && MEM_P (op0)
> + ? ? ? ? && MEM_P (op1)
> + ? ? ? ? && !push_operand (op0, mode)
> + ? ? ? ? && can_create_pseudo_p ())
> + ? ? ? {
> + ? ? ? ? rtx temp = gen_reg_rtx (V2DImode);
> + ? ? ? ? emit_insn (gen_sse2_loadq (temp, op1));
> + ? ? ? ? emit_insn (gen_sse_storeq (op0, temp));
> + ? ? ? ? return;
> + ? ? ? }
> + ? ? ?if (mode == DImode
> + ? ? ? ? && !TARGET_64BIT
> + ? ? ? ? && TARGET_SSE
> + ? ? ? ? && !MEM_P (op1)
> + ? ? ? ? && GET_MODE (op1) == V2DImode)
> + ? ? ? {
> + ? ? ? ? emit_insn (gen_sse_storeq (op0, op1));
> + ? ? ? ? return;
> + ? ? ? }
> + ? ? ?if (mode == TImode
> + ? ? ? ? && TARGET_AVX2
> + ? ? ? ? && MEM_P (op0)
> + ? ? ? ? && !MEM_P (op1)
> + ? ? ? ? && GET_MODE (op1) == V4DImode)
> + ? ? ? {
> + ? ? ? ? op0 = convert_to_mode (V2DImode, op0, 1);
> + ? ? ? ? emit_insn (gen_vec_extract_lo_v4di (op0, op1));
> + ? ? ? ? return;
> + ? ? ? }
>
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are. ?That might be enough for you?
>
> If you go this way, please make separate patch so it can be benchmarked. While the moves are faster
> there are always problem with size mismatches in load/store buffers.
>
> I think for string operations we should output the SSE/AVX instruction variants
> by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
> feasible? ?This has been traditionally problem with old RA because it did not
> see that because pseudo is eventually used for arithmetics, it can not go into
> SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
> variants for 64bit/128bit manipulations w/o hurting performance of code that
> does arithmetic on long long and __int128. ?Perhaps IRA can solve this now.
> Vladimir, do you have any ida?
>
> +/* Helper function for expand_set_or_movmem_via_loop.
> + ? This function can reuse iter rtx from another loop and don't generate
> + ? code for updating the addresses. ?*/
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx destptr, rtx srcptr, rtx value,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx count, rtx iter,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum machine_mode mode, int unroll,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0. ?Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.
> @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
> ? emit_insn (gen_strmov (destptr, dest, srcptr, src));
> ?}
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST. ?*/
> +/* Emit strset instuction. ?If RHS is constant, and vector mode will be used,
> + ? then move this consatnt to a vector register before emitting strset. ?*/
> +static void
> +emit_strset (rtx destmem, rtx value,
> + ? ? ? ? ? ?rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
> ? ? ? ?{
> - ? ? ? ? if (TARGET_64BIT)
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? else
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? offset += 8;
> + ? ? ? ? if (GET_MODE (destmem) != move_mode)
> + ? ? ? ? ? destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
> it copies memory aliasing attributes and it is needed to zap them here since stringops
> behaves funily WRT aliaseing.
> ? if (max_size > 16)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (count, 16, true);
> ? ? ? if (TARGET_64BIT)
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, DImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, DImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
>
> No use for 128bit moves here?
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, SImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, SImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
>
> And here?
> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
> ? if (align <= 1 && desired_alignment > 1)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 1, false);
> - ? ? ?srcmem = change_address (srcmem, QImode, srcptr);
> - ? ? ?destmem = change_address (destmem, QImode, destptr);
> + ? ? ?srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> + ? ? ?destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
> ? ? ? off = 4;
> ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> ? ? }
> + ?if (align_bytes & 8)
> + ? ?{
> + ? ? ?if (TARGET_64BIT || TARGET_SSE)
> + ? ? ? {
> + ? ? ? ? dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> + ? ? ? ? src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? }
> + ? ? ?else
> + ? ? ? {
> + ? ? ? ? dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + ? ? ? ? src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
> +/* Target hook. ?Returns rtx of mode MODE with promoted value VAL, that is
> + ? supposed to represent one byte. ?MODE could be a vector mode.
> + ? Example:
> + ? 1) VAL = const_int (0xAB), mode = SImode,
> + ? the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> + ? 2) if VAL isn't const, then the result will be the result of MUL-instruction
> + ? of VAL and const_int (0x01010101) (for SImode). ?*/
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
> ?The default is zero which means to not iterate over other vector sizes.
> ?@end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
> Honza
>

Follow-Ups:
- Re: Use of vector instructions in memmov/memset expanding
  - From: Richard Henderson

References:
- Re: Use of vector instructions in memmov/memset expanding
  - From: Michael Zolotukhin
- Re: Use of vector instructions in memmov/memset expanding
  - From: Michael Zolotukhin
- Re: Use of vector instructions in memmov/memset expanding
  - From: Michael Zolotukhin
- Re: Use of vector instructions in memmov/memset expanding
  - From: Jan Hubicka

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]