This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
Re: Use of vector instructions in memmov/memset expanding
- From: Michael Zolotukhin <michael dot v dot zolotukhin at gmail dot com>
- To: Jan Hubicka <hubicka at ucw dot cz>
- Cc: Jakub Jelinek <jakub at redhat dot com>, Jack Howarth <howarth at bromo dot med dot uc dot edu>, gcc-patches at gcc dot gnu dot org, Richard Guenther <richard dot guenther at gmail dot com>, "H.J. Lu" <hjl dot tools at gmail dot com>, izamyatin at gmail dot com, areg dot melikadamyan at gmail dot com
- Date: Fri, 28 Oct 2011 16:41:03 +0400
- Subject: Re: Use of vector instructions in memmov/memset expanding
- References: <CANtU079YjpFwdy1kXoebsLyKe08KTwjwXDGJ4kY+gEJcZUTdMg@mail.gmail.com> <20110928133105.GA26045@bromo.med.uc.edu> <CANtU07-DbAbwc6gH_KYaYiXqgMRhGLv_G5Gifvv1Nu0CLh7fgw@mail.gmail.com> <20110928215104.GA29339@bromo.med.uc.edu> <CANtU07-4ZvCmH=OFa49Zc5LQWWzj4jOCLbr6PeDRqiPouZrCdQ@mail.gmail.com> <CANtU07-iXjvZ0h80EM-fUgKNdDUTS0GCASNFSe608Su4AFM3Sw@mail.gmail.com> <20110929112159.GT2687@tyan-ft48-01.lab.bos.redhat.com> <CANtU07-TBmKmmoO9tOnE+8VCExDs1pU2fOUqU+HACFMW6bWhzg@mail.gmail.com> <CANtU07_UpRQx92W6UPMKcthR4Kphf2vmw4L+w3btx3Lc4VCZYg@mail.gmail.com> <CANtU078LPkNdAA6JNeCQUjVbJHJ-bH3XyWP5m6g+3Xj+koCoSw@mail.gmail.com> <20111027150909.GA29087@kam.mff.cuni.cz>
Hi Jan!
Thanks for the review, you could find my answers to some of your
remarks below. I'll send a corrected patch soon with answers to the
rest of your remarks.
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> - {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> + {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> + {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense. I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
Yes, alignment should match the size of loads/stores as well as offset
from alignment boundary should be known. In other case, strategies for
unknown alignment would be chosen.
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are. That might be enough for you?
Yes, theoretically it could harm 64/128-bit arithmetic, but actually
what could we do if we have DImode, mem-to-mem move and our mode is
32-bit? Ideally, RA should be able to make desicions on how to perform
such moves, but currently it doesn't generate SSE-moves - when it'll
be able to do so, I think we could remove this part and rely on RA.
And, one more point. This is quite a special case - here we want to
perform move via half of vector register. This is the main reason why
these particular cases are handled in special, not common, way.
> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0. Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.
Yep, you're right - we just don't start the loop from 0. I'll send a
version with the comments soon.
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST. */
> +/* Emit strset instuction. If RHS is constant, and vector mode will be used,
> + then move this consatnt to a vector register before emitting strset. */
> +static void
> +emit_strset (rtx destmem, rtx value,
> + rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
I don't think it matters here, but to make emit_strset look similar to
emit_strmov, most of emit_strset body realy could be moved to
gen_strset.
> if (max_size > 16)
> {
> rtx label = ix86_expand_aligntest (count, 16, true);
> if (TARGET_64BIT)
> {
> - dest = change_address (destmem, DImode, destptr);
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> + destmem = change_address (destmem, DImode, destptr);
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + value)));
>
> No use for 128bit moves here?
> }
> else
> {
> - dest = change_address (destmem, SImode, destptr);
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> - emit_insn (gen_strset (destptr, dest, value));
> + destmem = change_address (destmem, SImode, destptr);
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
> + emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + value)));
>
> And here?
For memset prologues/epilogues I avoid using vector moves as it could
require expensive initialization (we need to create a vector filled
with the given value). For other cases, I'll re-check if use of vector
moves is possible.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
> off = 4;
> emit_insn (gen_strmov (destreg, dst, srcreg, src));
> }
> + if (align_bytes & 8)
> + {
> + if (TARGET_64BIT || TARGET_SSE)
> + {
> + dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> + src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + }
> + else
> + {
> + dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
Actually, here vector-moves are used if they are available - if 32bit
mode is used (so we can't do the move via GPR), but SSE is available,
then SSE-move would be generated.
> +/* Target hook. Returns rtx of mode MODE with promoted value VAL, that is
> + supposed to represent one byte. MODE could be a vector mode.
> + Example:
> + 1) VAL = const_int (0xAB), mode = SImode,
> + the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> + 2) if VAL isn't const, then the result will be the result of MUL-instruction
> + of VAL and const_int (0x01010101) (for SImode). */
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
I don't think it could be done in machine-independent way - e.g. if
AVX is available, we could use broadcast-instructions, if not - we
need to use multiply-instructions, on other architectures there
probably some other more efficient ways to duplicate byte value across
the entire vector register. So IMO it's a good place to have a hook.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
> The default is zero which means to not iterate over other vector sizes.
> @end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
These changes present in middle-end part too (of course, in the full
patch it's not duplicated). I left it in this patch too to avoid
possible problems with build.
--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.
On 27 October 2011 19:09, Jan Hubicka <hubicka@ucw.cz> wrote:
> Hi,
> sorry for delay with the review. This is my first pass through the backend part, hopefully
> someone else will do the middle end bits.
>
> diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
> index 2c53423..d7c4330 100644
> --- a/gcc/config/i386/i386.c
> +++ b/gcc/config/i386/i386.c
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FSQRT instruction. ?*/
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
>
> I am bit concerned about explossion of variants, but adding aligned variants probably makes
> sense. ?I guess we should specify what alignment needs to be known. I.e. is alignment of 2 enough
> or shall the alignment be matching the size of loads/stores produced?
>
> @@ -15266,6 +15362,38 @@ ix86_expand_move (enum machine_mode mode, rtx operands[])
> ? ? }
> ? else
> ? ? {
> + ? ? ?if (mode == DImode
> + ? ? ? ? && !TARGET_64BIT
> + ? ? ? ? && TARGET_SSE2
> + ? ? ? ? && MEM_P (op0)
> + ? ? ? ? && MEM_P (op1)
> + ? ? ? ? && !push_operand (op0, mode)
> + ? ? ? ? && can_create_pseudo_p ())
> + ? ? ? {
> + ? ? ? ? rtx temp = gen_reg_rtx (V2DImode);
> + ? ? ? ? emit_insn (gen_sse2_loadq (temp, op1));
> + ? ? ? ? emit_insn (gen_sse_storeq (op0, temp));
> + ? ? ? ? return;
> + ? ? ? }
> + ? ? ?if (mode == DImode
> + ? ? ? ? && !TARGET_64BIT
> + ? ? ? ? && TARGET_SSE
> + ? ? ? ? && !MEM_P (op1)
> + ? ? ? ? && GET_MODE (op1) == V2DImode)
> + ? ? ? {
> + ? ? ? ? emit_insn (gen_sse_storeq (op0, op1));
> + ? ? ? ? return;
> + ? ? ? }
> + ? ? ?if (mode == TImode
> + ? ? ? ? && TARGET_AVX2
> + ? ? ? ? && MEM_P (op0)
> + ? ? ? ? && !MEM_P (op1)
> + ? ? ? ? && GET_MODE (op1) == V4DImode)
> + ? ? ? {
> + ? ? ? ? op0 = convert_to_mode (V2DImode, op0, 1);
> + ? ? ? ? emit_insn (gen_vec_extract_lo_v4di (op0, op1));
> + ? ? ? ? return;
> + ? ? ? }
>
> This hunk seems dangerous in a way that by emitting the explicit loadq/storeq pairs (and similar) will
> prevent use of integer registers for 64bit/128bit arithmetic.
>
> I guess we could play such tricks for memory-memory moves & constant stores. With gimple optimizations
> we already know pretty well that the moves will stay as they are. ?That might be enough for you?
>
> If you go this way, please make separate patch so it can be benchmarked. While the moves are faster
> there are always problem with size mismatches in load/store buffers.
>
> I think for string operations we should output the SSE/AVX instruction variants
> by hand + we could try to instruct IRA to actually preffer use of SSE/AVX when
> feasible? ?This has been traditionally problem with old RA because it did not
> see that because pseudo is eventually used for arithmetics, it can not go into
> SSE register. So it was not possible to make MMX/SSE/AVX to be preferred
> variants for 64bit/128bit manipulations w/o hurting performance of code that
> does arithmetic on long long and __int128. ?Perhaps IRA can solve this now.
> Vladimir, do you have any ida?
>
> +/* Helper function for expand_set_or_movmem_via_loop.
> + ? This function can reuse iter rtx from another loop and don't generate
> + ? code for updating the addresses. ?*/
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx destptr, rtx srcptr, rtx value,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx count, rtx iter,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum machine_mode mode, int unroll,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int expected_size, bool change_ptrs)
>
> I wrote the original function, but it is not really clear for me what the function
> does now. I.e. what is code for updating addresses and what means reusing iter.
> I guess reusing iter means that we won't start the loop from 0. ?Could you
> expand comments a bit more?
>
> I know I did not documented them originally, but all the parameters ought to be
> explicitely documented in a function comment.
> @@ -20913,7 +21065,27 @@ emit_strmov (rtx destmem, rtx srcmem,
> ? emit_insn (gen_strmov (destptr, dest, srcptr, src));
> ?}
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST. ?*/
> +/* Emit strset instuction. ?If RHS is constant, and vector mode will be used,
> + ? then move this consatnt to a vector register before emitting strset. ?*/
> +static void
> +emit_strset (rtx destmem, rtx value,
> + ? ? ? ? ? ?rtx destptr, enum machine_mode mode, int offset)
>
> This seems to more naturally belong into gen_strset expander?
> ? ? ? ?{
> - ? ? ? ? if (TARGET_64BIT)
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? else
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? offset += 8;
> + ? ? ? ? if (GET_MODE (destmem) != move_mode)
> + ? ? ? ? ? destmem = change_address (destmem, move_mode, destptr);
> AFAIK change_address is not equilvalent into adjust_automodify_address_nv in a way
> it copies memory aliasing attributes and it is needed to zap them here since stringops
> behaves funily WRT aliaseing.
> ? if (max_size > 16)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (count, 16, true);
> ? ? ? if (TARGET_64BIT)
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, DImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, DImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (DImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
>
> No use for 128bit moves here?
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, SImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, SImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? value)));
>
> And here?
> @@ -21204,8 +21426,8 @@ expand_movmem_prologue (rtx destmem, rtx srcmem,
> ? if (align <= 1 && desired_alignment > 1)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 1, false);
> - ? ? ?srcmem = change_address (srcmem, QImode, srcptr);
> - ? ? ?destmem = change_address (destmem, QImode, destptr);
> + ? ? ?srcmem = adjust_automodify_address_1 (srcmem, QImode, srcptr, 0, 1);
> + ? ? ?destmem = adjust_automodify_address_1 (destmem, QImode, destptr, 0, 1);
>
> You want to always use adjust_automodify_address or adjust_automodify_address_nv,
> adjust_automodify_address_1 is not intended for general use.
> @@ -21286,6 +21528,37 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
> ? ? ? off = 4;
> ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> ? ? }
> + ?if (align_bytes & 8)
> + ? ?{
> + ? ? ?if (TARGET_64BIT || TARGET_SSE)
> + ? ? ? {
> + ? ? ? ? dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> + ? ? ? ? src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? }
> + ? ? ?else
> + ? ? ? {
> + ? ? ? ? dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + ? ? ? ? src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
>
> again, no use for vector moves?
> +/* Target hook. ?Returns rtx of mode MODE with promoted value VAL, that is
> + ? supposed to represent one byte. ?MODE could be a vector mode.
> + ? Example:
> + ? 1) VAL = const_int (0xAB), mode = SImode,
> + ? the result is const_int (0xABABABAB).
>
> This can be handled in machine independent way, right?
>
> + ? 2) if VAL isn't const, then the result will be the result of MUL-instruction
> + ? of VAL and const_int (0x01010101) (for SImode). ?*/
>
> This would probably go better as named expansion pattern, like we do for other
> machine description interfaces.
> diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
> index 90cef1c..4b7d67b 100644
> --- a/gcc/doc/tm.texi
> +++ b/gcc/doc/tm.texi
> @@ -5780,6 +5780,32 @@ mode returned by @code{TARGET_VECTORIZE_PREFERRED_SIMD_MODE}.
> ?The default is zero which means to not iterate over other vector sizes.
> ?@end deftypefn
>
> +@deftypefn {Target Hook} bool TARGET_SLOW_UNALIGNED_ACCESS (enum machine_mode @var{mode}, unsigned int @var{align})
> +This hook should return true if memory accesses in mode @var{mode} to data
> +aligned by @var{align} bits have a cost many times greater than aligned
> +accesses, for example if they are emulated in a trap handler.
>
> New hooks should go to the machine indpendent part of the patch.
>
> Honza
>