This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Use of vector instructions in memmov/memset expanding


Hi, Jan!
I was just preparing my version of the patch, but it seems a bit late
now. Please see my comments to this and your previous letter below.

By the way, would it be possible to commit other part of the patch
(middle-end part) - probably also by small parts -  and some other
tuning after stage 1 closes?


> The patches disabling CSE and forwprop on constants are apparently papering around
> problem that subregs of vector registers used in epilogue make IRA to think
> that it can't put value into SSE register (resulting in NO_REGS class) and making
> reloat to output load of 0 into the internal loop.

The problem here isn't about subregs - there is just no way to emit
store when destination is 128-bit memory operand and source is 128-bit
immediate. We should somehow find that previously we initialized a
vector register for use in this particular store - either IRA should
traverse through the code trying to find such initialization (i.e. IRA
should revert the FWP's work), or we just shouldn't let such
situations happen by disabling FWP for 128-bit immediates. I think the
second option is much easier both in implementation and for
understanding.


> I also plugged some code paths - the pain here is that the stringops have many
> variants - different algorithms, different alignmnet, constant/variable
> counts. These increase the testing matrix and some of code paths was worng with
> the new SSE code.

Yep, I also saw such fails, thanks for the fixes. Though, I see
another problem here: the main reason of these fails is that when size
is small we could skip main loop and thus reach epilogue with
uninitialized loop-iterator and/or promoted value. To make the
algorithm absolutely correct, we should either perform needed
initializations in the very beginning (before zero-testing) or use
byte-loop in the epilogue. The second way could greatly hurt
performance, so I think we should just initialize everything before
the main loop in assumption that size is big enough and it'll be used
in the main loop.
Moreover, this algorithm wasn't initially intended for small sizes -
memcpy/memset for small sizes should be expanded earlier, in
move_by_pieces or set_by_pieces (it was in middle-end part of the
patch). So the assumption about the size should be correct.


> We still may want to produce SSE moves for 64bit
> operations in 32bit codegen but this is independent problem + the patch as it is seems
> to produce slight regression on crafty.

Actually, such 8byte moves aren't critical for these part of the patch
- here such moves only could be used in prologues/epilogues and
doesn't affect performance much (assuming that size isn't very small
and small performance loss in prologue/epilogue doesn't affect overall
performance).
But for memcpy/memset for small sizes, which are expanded in
middle-end part, that could be quite crucial. For example, for copying
of 24 bytes with unknown alignment on Atom three 8-byte SSE moves
could be much faster than six 4-byte moves via GPR. So, it's
definitely good to have an opportunity to generate such moves.


> I think it would be better to use V8QI mode
> for the promoted value (since it is what it really is) avoiding the need for changes in
> expand_move and the loadq pattern.

Actually, we rarely operate in byte mode - usually we move/store at
least with Pmode (when we use GPR). So V4SI or V2DI also looks
reasonable to me here. Also, when promoting value from GPR to
SSE-register, we surely need value in DI/SI-mode value, not in QImode.
We could make everything in QI/V16QI/V8QI-modes but that could lead to
generation of converts in many places (like in promotion to vector).

> I noticed that core model still use generic costs that is quite bogus.
Yes, I agree. It's better to have separate cost-models for them.


> I also reverted changes to generic cost models since those are results of discussion
> in between AMD and Intel and any changes here needs to be discussed on both
> sides.

Sure, I totally agree.


On 7 November 2011 19:41, Jan Hubicka <hubicka@ucw.cz> wrote:
>
> Hi,
> this is variant of patch I hope to commit today after some further testing.
> I removed most of the code changing interfaces outside i386 backend and also
> reverted changes to generic cost models since those are results of discussion
> in between AMD and Intel and any changes here needs to be discussed on both
> sides.
>
> I noticed that core model still use generic costs that is quite bogus. It also
> seems bogus to have two costs for 32bit and 64bit mode of core - the only
> reason why there are 32bit and 64bit generic models is because 32bit generic
> still take into acount 32bit chips (centrino and Athlon). ?I think we may drop
> those and remove optimizations targetted to help those chips.
>
> Bootstrapped/regtested x86_64-linux, intend to commit it today after some
> further testing.
>
> Honza
>
> 2011-11-03 ?Zolotukhin Michael ?<michael.v.zolotukhin@gmail.com>
> ? ? ? ? ? ?Jan Hubicka ?<jh@suse.cz>
>
> ? ? ? ?* config/i386/i386.h (processor_costs): Add second dimension to
> ? ? ? ?stringop_algs array.
> ? ? ? ?* config/i386/i386.c (cost models): Initialize second dimension of
> ? ? ? ?stringop_algs arrays.
> ? ? ? ?(core_cost): New costs based on generic64 costs with updated stringop
> ? ? ? ?values.
> ? ? ? ?(promote_duplicated_reg): Add support for vector modes, add
> ? ? ? ?declaration.
> ? ? ? ?(promote_duplicated_reg_to_size): Likewise.
> ? ? ? ?(processor_target): Set core costs for core variants.
> ? ? ? ?(expand_set_or_movmem_via_loop_with_iter): New function.
> ? ? ? ?(expand_set_or_movmem_via_loop): Enable reuse of the same iters in
> ? ? ? ?different loops, produced by this function.
> ? ? ? ?(emit_strset): New function.
> ? ? ? ?(expand_movmem_epilogue): Add epilogue generation for bigger sizes,
> ? ? ? ?use SSE-moves where possible.
> ? ? ? ?(expand_setmem_epilogue): Likewise.
> ? ? ? ?(expand_movmem_prologue): Likewise for prologue.
> ? ? ? ?(expand_setmem_prologue): Likewise.
> ? ? ? ?(expand_constant_movmem_prologue): Likewise.
> ? ? ? ?(expand_constant_setmem_prologue): Likewise.
> ? ? ? ?(decide_alg): Add new argument align_unknown. ?Fix algorithm of
> ? ? ? ?strategy selection if TARGET_INLINE_ALL_STRINGOPS is set; Skip sse_loop
> ? ? ? ?(decide_alignment): Update desired alignment according to chosen move
> ? ? ? ?mode.
> ? ? ? ?(ix86_expand_movmem): Change unrolled_loop strategy to use SSE-moves.
> ? ? ? ?(ix86_expand_setmem): Likewise.
> ? ? ? ?(ix86_slow_unaligned_access): Implementation of new hook
> ? ? ? ?slow_unaligned_access.
> ? ? ? ?* config/i386/i386.md (strset): Enable half-SSE moves.
> ? ? ? ?* config/i386/sse.md (vec_dupv4si): Add expand for vec_dupv4si.
> ? ? ? ?(vec_dupv2di): Add expand for vec_dupv2di.
>
> Index: i386.h
> ===================================================================
> --- i386.h ? ? ?(revision 181033)
> +++ i386.h ? ? ?(working copy)
> @@ -159,8 +159,12 @@ struct processor_costs {
> ? const int fchs; ? ? ? ? ? ? ?/* cost of FCHS instruction. ?*/
> ? const int fsqrt; ? ? ? ? ? ? /* cost of FSQRT instruction. ?*/
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* Specify what algorithm
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?to use for stringops on unknown size. ?*/
> - ?struct stringop_algs memcpy[2], memset[2];
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?to use for stringops on unknown size.
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?First index is used to specify whether
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?alignment is known or not.
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Second - to specify whether 32 or 64 bits
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?are used. ?*/
> + ?struct stringop_algs memcpy[2][2], memset[2][2];
> ? const int scalar_stmt_cost; ? /* Cost of any scalar operation, excluding
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? load and store. ?*/
> ? const int scalar_load_cost; ? /* Cost of scalar load. ?*/
> Index: i386.md
> ===================================================================
> --- i386.md ? ? (revision 181033)
> +++ i386.md ? ? (working copy)
> @@ -15937,6 +15937,17 @@
> ? ? ? ? ? ? ?(clobber (reg:CC FLAGS_REG))])]
> ? ""
> ?{
> + ?rtx vec_reg;
> + ?enum machine_mode mode = GET_MODE (operands[2]);
> + ?if (vector_extensions_used_for_mode (mode)
> + ? ? ?&& CONSTANT_P (operands[2]))
> + ? ?{
> + ? ? ?if (mode == DImode)
> + ? ? ? mode = TARGET_64BIT ? V2DImode : V4SImode;
> + ? ? ?vec_reg = gen_reg_rtx (mode);
> + ? ? ?emit_move_insn (vec_reg, operands[2]);
> + ? ? ?operands[2] = vec_reg;
> + ? ?}
> ? if (GET_MODE (operands[1]) != GET_MODE (operands[2]))
> ? ? operands[1] = adjust_address_nv (operands[1], GET_MODE (operands[2]), 0);
>
> Index: i386-opts.h
> ===================================================================
> --- i386-opts.h (revision 181033)
> +++ i386-opts.h (working copy)
> @@ -37,7 +37,8 @@ enum stringop_alg
> ? ?rep_prefix_8_byte,
> ? ?loop_1_byte,
> ? ?loop,
> - ? unrolled_loop
> + ? unrolled_loop,
> + ? sse_loop
> ?};
>
> ?/* Available call abi. ?*/
> Index: sse.md
> ===================================================================
> --- sse.md ? ? ?(revision 181033)
> +++ sse.md ? ? ?(working copy)
> @@ -7509,6 +7509,16 @@
> ? ?(set_attr "prefix" "maybe_vex,orig,vex,maybe_vex,orig,orig")
> ? ?(set_attr "mode" "V2SF,TI,TI,TI,V4SF,V2SF")])
>
> +(define_expand "vec_dupv4si"
> + ?[(set (match_operand:V4SI 0 "register_operand" "")
> + ? ? ? (vec_duplicate:V4SI
> + ? ? ? ? (match_operand:SI 1 "nonimmediate_operand" "")))]
> + ?"TARGET_SSE"
> +{
> + ?if (!TARGET_AVX)
> + ? ?operands[1] = force_reg (V4SImode, operands[1]);
> +})
> +
> ?(define_insn "*vec_dupv4si"
> ? [(set (match_operand:V4SI 0 "register_operand" ? ? "=x,x,x")
> ? ? ? ?(vec_duplicate:V4SI
> @@ -7525,6 +7535,16 @@
> ? ?(set_attr "prefix" "maybe_vex,vex,orig")
> ? ?(set_attr "mode" "TI,V4SF,V4SF")])
>
> +(define_expand "vec_dupv2di"
> + ?[(set (match_operand:V2DI 0 "register_operand" "")
> + ? ? ? (vec_duplicate:V2DI
> + ? ? ? ? (match_operand:DI 1 "nonimmediate_operand" "")))]
> + ?"TARGET_SSE"
> +{
> + ?if (!TARGET_AVX)
> + ? ?operands[1] = force_reg (V2DImode, operands[1]);
> +})
> +
> ?(define_insn "*vec_dupv2di"
> ? [(set (match_operand:V2DI 0 "register_operand" ? ? "=x,x,x,x")
> ? ? ? ?(vec_duplicate:V2DI
> Index: i386.opt
> ===================================================================
> --- i386.opt ? ?(revision 181033)
> +++ i386.opt ? ?(working copy)
> @@ -324,6 +324,9 @@ Enum(stringop_alg) String(loop) Value(lo
> ?EnumValue
> ?Enum(stringop_alg) String(unrolled_loop) Value(unrolled_loop)
>
> +EnumValue
> +Enum(stringop_alg) String(sse_loop) Value(sse_loop)
> +
> ?mtls-dialect=
> ?Target RejectNegative Joined Var(ix86_tls_dialect) Enum(tls_dialect) Init(TLS_DIALECT_GNU)
> ?Use given thread-local storage dialect
> Index: i386.c
> ===================================================================
> --- i386.c ? ? ?(revision 181033)
> +++ i386.c ? ? ?(working copy)
> @@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost =
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_BYTES (2), ? ? ? ? ? ? ? ? ? /* cost of FSQRT instruction. ?*/
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}}}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -632,10 +636,14 @@ struct processor_costs i386_cost = { ? ? ?/*
> ? COSTS_N_INSNS (22), ? ? ? ? ? ? ? ? ?/* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (24), ? ? ? ? ? ? ? ? ?/* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (122), ? ? ? ? ? ? ? ? /* cost of FSQRT instruction. ?*/
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -704,10 +712,14 @@ struct processor_costs i486_cost = { ? ? ?/*
> ? COSTS_N_INSNS (3), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (3), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (83), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> + ?{{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> + ? {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{rep_prefix_4_byte, {{-1, rep_prefix_4_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -774,10 +786,14 @@ struct processor_costs pentium_cost = {
> ? COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (70), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ?{{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{libcall, {{-1, rep_prefix_4_byte}}},
> + ? {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{libcall, {{-1, rep_prefix_4_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{libcall, {{-1, rep_prefix_4_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -849,12 +865,18 @@ struct processor_costs pentiumpro_cost =
> ? ? ?noticeable win, for bigger blocks either rep movsl or rep movsb is
> ? ? ?way to go. ?Rep movsb has apparently more expensive startup time in CPU,
> ? ? ?but after 4K the difference is down in the noise. ?*/
> - ?{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
> + ?{{{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
> ? ? ? ? ? ? ? ? ? ? ? ?{8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{rep_prefix_4_byte, {{1024, unrolled_loop},
> - ? ? ? ? ? ? ? ? ? ? ? {8192, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {{rep_prefix_4_byte, {{128, loop}, {1024, unrolled_loop},
> + ? ? ? ? ? ? ? ? ? ? ? {8192, rep_prefix_4_byte}, {-1, rep_prefix_1_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{rep_prefix_4_byte, {{1024, unrolled_loop},
> + ? ? ? ? ? ? ? ? ? ? ? {8192, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{rep_prefix_4_byte, {{1024, unrolled_loop},
> + ? ? ? ? ? ? ? ? ? ? ? {8192, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -922,10 +944,14 @@ struct processor_costs geode_cost = {
> ? COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (54), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ?{{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -995,10 +1021,14 @@ struct processor_costs k6_cost = {
> ? COSTS_N_INSNS (2), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (2), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (56), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ?{{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{libcall, {{256, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1068,10 +1098,14 @@ struct processor_costs athlon_cost = {
> ? /* For some reason, Athlon deals better with REP prefix (relative to loops)
> ? ? ?compared to K8. Alignment becomes important after 8 bytes for memcpy and
> ? ? ?128 bytes for memset. ?*/
> - ?{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ?{{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?{{{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{libcall, {{2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1146,11 +1180,16 @@ struct processor_costs k8_cost = {
> ? /* K8 has optimized REP instruction for medium sized blocks, but for very
> ? ? ?small blocks it is better to use loop. For large blocks, libcall can
> ? ? ?do nontemporary accesses and beat inline considerably. ?*/
> - ?{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ?{{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> ? ?{libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{{libcall, {{8, loop}, {24, unrolled_loop},
> + ? {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> + ?{{{libcall, {{8, loop}, {24, unrolled_loop},
> ? ? ? ? ? ? ?{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?{libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{8, loop}, {24, unrolled_loop},
> + ? ? ? ? ? ? {2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1233,11 +1272,16 @@ struct processor_costs amdfam10_cost = {
> ? /* AMDFAM10 has optimized REP instruction for medium sized blocks, but for
> ? ? ?very small blocks it is better to use loop. For large blocks, libcall can
> ? ? ?do nontemporary accesses and beat inline considerably. ?*/
> - ?{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> - ? {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{{libcall, {{8, loop}, {24, unrolled_loop},
> + ?{{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{16, loop}, {512, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{16, loop}, {512, rep_prefix_8_byte}, {-1, libcall}}}}},
> + ?{{{libcall, {{8, loop}, {24, unrolled_loop},
> ? ? ? ? ? ? ?{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?{libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{8, loop}, {24, unrolled_loop},
> + ? ? ? ? ? ? {2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1320,11 +1364,16 @@ struct processor_costs bdver1_cost = {
> ? /* ?BDVER1 has optimized REP instruction for medium sized blocks, but for
> ? ? ? very small blocks it is better to use loop. For large blocks, libcall
> ? ? ? can do nontemporary accesses and beat inline considerably. ?*/
> - ?{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ?{{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> ? ?{libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{{libcall, {{8, loop}, {24, unrolled_loop},
> + ? {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> + ?{{{libcall, {{8, loop}, {24, unrolled_loop},
> ? ? ? ? ? ? ?{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?{libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{8, loop}, {24, unrolled_loop},
> + ? ? ? ? ? ? {2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 6, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1407,11 +1456,16 @@ struct processor_costs bdver2_cost = {
> ? /* ?BDVER2 has optimized REP instruction for medium sized blocks, but for
> ? ? ? very small blocks it is better to use loop. For large blocks, libcall
> ? ? ? can do nontemporary accesses and beat inline considerably. ?*/
> - ?{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ?{{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> ? ?{libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{{libcall, {{8, loop}, {24, unrolled_loop},
> + ?{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> + ?{{{libcall, {{8, loop}, {24, unrolled_loop},
> ? ? ? ? ? ? ?{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?{libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ?{{libcall, {{8, loop}, {24, unrolled_loop},
> + ? ? ? ? ? ? {2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 6, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1489,11 +1543,16 @@ struct processor_costs btver1_cost = {
> ? /* BTVER1 has optimized REP instruction for medium sized blocks, but for
> ? ? ?very small blocks it is better to use loop. For large blocks, libcall can
> ? ? ?do nontemporary accesses and beat inline considerably. ?*/
> - ?{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ?{{{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> ? ?{libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{{libcall, {{8, loop}, {24, unrolled_loop},
> + ? {{libcall, {{6, loop}, {14, unrolled_loop}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{16, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> + ?{{{libcall, {{8, loop}, {24, unrolled_loop},
> ? ? ? ? ? ? ?{2048, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?{libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{8, loop}, {24, unrolled_loop},
> + ? ? ? ? ? ? {2048, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {libcall, {{48, unrolled_loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1560,11 +1619,18 @@ struct processor_costs pentium4_cost = {
> ? COSTS_N_INSNS (2), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (2), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (43), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> +
> + ?{{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> + ? {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> + ? DUMMY_STRINGOP_ALGS}},
> +
> + ?{{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> ? ?{-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> + ? {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1631,13 +1697,22 @@ struct processor_costs nocona_cost = {
> ? COSTS_N_INSNS (3), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (3), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (44), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> +
> + ?{{{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> ? ?{libcall, {{32, loop}, {20000, rep_prefix_8_byte},
> ? ? ? ? ? ? ?{100000, unrolled_loop}, {-1, libcall}}}},
> - ?{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> + ? {{libcall, {{12, loop_1_byte}, {-1, rep_prefix_4_byte}}},
> + ? {libcall, {{32, loop}, {20000, rep_prefix_8_byte},
> + ? ? ? ? ? ? {100000, unrolled_loop}, {-1, libcall}}}}},
> +
> + ?{{{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> ? ?{-1, libcall}}},
> ? ?{libcall, {{24, loop}, {64, unrolled_loop},
> ? ? ? ? ? ? ?{8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{6, loop_1_byte}, {48, loop}, {20480, rep_prefix_4_byte},
> + ? {-1, libcall}}},
> + ? {libcall, {{24, loop}, {64, unrolled_loop},
> + ? ? ? ? ? ? {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1704,13 +1779,108 @@ struct processor_costs atom_cost = {
> ? COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (40), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
> - ? {libcall, {{32, loop}, {64, rep_prefix_4_byte},
> - ? ? ? ? {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{{libcall, {{8, loop}, {15, unrolled_loop},
> - ? ? ? ? {2048, rep_prefix_4_byte}, {-1, libcall}}},
> - ? {libcall, {{24, loop}, {32, unrolled_loop},
> - ? ? ? ? {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +
> + ?/* stringop_algs for memcpy.
> + ? ? SSE loops works best on Atom, but fall back into non-SSE unrolled loop variant
> + ? ? if that fails. ?*/
> + ?{{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. ?*/
> + ? ?{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> + ? {{libcall, {{-1, libcall}}}, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* Unknown alignment. ?*/
> + ? ?{libcall, {{2048, sse_loop}, {2048, unrolled_loop},
> + ? ? ? ? ? ? ?{-1, libcall}}}}},
> +
> + ?/* stringop_algs for memset. ?*/
> + ?{{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment. ?*/
> + ? ?{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
> + ? {{libcall, {{1024, sse_loop}, {1024, unrolled_loop}, ? ? ? ? /* Unknown alignment. ?*/
> + ? ? ? ? ? ? ?{-1, libcall}}},
> + ? ?{libcall, {{2048, sse_loop}, {2048, unrolled_loop},
> + ? ? ? ? ? ? ?{-1, libcall}}}}},
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* vec_stmt_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* vec_to_scalar_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_to_vec_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* vec_align_load_cost. ?*/
> + ?2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* vec_unalign_load_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* vec_store_cost. ?*/
> + ?3, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cond_taken_branch_cost. ?*/
> + ?1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cond_not_taken_branch_cost. ?*/
> +};
> +
> +/* Core should produce code tuned for core variants. ?*/
> +static const
> +struct processor_costs core_cost = {
> + ?COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of an add instruction */
> + ?/* On all chips taken into consideration lea is 2 cycles and more. ?With
> + ? ? this cost however our current implementation of synth_mult results in
> + ? ? use of unnecessary temporary registers causing regression on several
> + ? ? SPECfp benchmarks. ?*/
> + ?COSTS_N_INSNS (1) + 1, ? ? ? ? ? ? ? /* cost of a lea instruction */
> + ?COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* variable shift costs */
> + ?COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* constant shift costs */
> + ?{COSTS_N_INSNS (3), ? ? ? ? ? ? ? ? ?/* cost of starting multiply for QI */
> + ? COSTS_N_INSNS (4), ? ? ? ? ? ? ? ? ?/* ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? HI */
> + ? COSTS_N_INSNS (3), ? ? ? ? ? ? ? ? ?/* ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SI */
> + ? COSTS_N_INSNS (4), ? ? ? ? ? ? ? ? ?/* ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? DI */
> + ? COSTS_N_INSNS (2)}, ? ? ? ? ? ? ? ? /* ? ? ? ? ? ? ? ? ? ? ? ? ? ?other */
> + ?0, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of multiply per each bit set */
> + ?{COSTS_N_INSNS (18), ? ? ? ? ? ? ? ? /* cost of a divide/mod for QI */
> + ? COSTS_N_INSNS (26), ? ? ? ? ? ? ? ? /* ? ? ? ? ? ? ? ? ? ? ? ? ?HI */
> + ? COSTS_N_INSNS (42), ? ? ? ? ? ? ? ? /* ? ? ? ? ? ? ? ? ? ? ? ? ?SI */
> + ? COSTS_N_INSNS (74), ? ? ? ? ? ? ? ? /* ? ? ? ? ? ? ? ? ? ? ? ? ?DI */
> + ? COSTS_N_INSNS (74)}, ? ? ? ? ? ? ? ? ? ? ? ?/* ? ? ? ? ? ? ? ? ? ? ? ? ?other */
> + ?COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of movsx */
> + ?COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of movzx */
> + ?8, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* "large" insn */
> + ?17, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* MOVE_RATIO */
> + ?4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* cost for loading QImode using movzbl */
> + ?{4, 4, 4}, ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of loading integer registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in QImode, HImode and SImode.
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Relative to reg-reg move (2). ?*/
> + ?{4, 4, 4}, ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of storing integer registers */
> + ?4, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of reg,reg fld/fst */
> + ?{12, 12, 12}, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* cost of loading fp registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in SFmode, DFmode and XFmode */
> + ?{6, 6, 8}, ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of storing fp registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in SFmode, DFmode and XFmode */
> + ?2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of moving MMX register */
> + ?{8, 8}, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* cost of loading MMX registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in SImode and DImode */
> + ?{8, 8}, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* cost of storing MMX registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in SImode and DImode */
> + ?2, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of moving SSE register */
> + ?{8, 8, 8}, ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of loading SSE registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in SImode, DImode and TImode */
> + ?{8, 8, 8}, ? ? ? ? ? ? ? ? ? ? ? ? ? /* cost of storing SSE registers
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?in SImode, DImode and TImode */
> + ?5, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* MMX or SSE register to integer */
> + ?32, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* size of l1 cache. ?*/
> + ?512, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* size of l2 cache. ?*/
> + ?64, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?/* size of prefetch block */
> + ?6, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* number of parallel prefetches */
> + ?/* Benchmarks shows large regressions on K8 sixtrack benchmark when this
> + ? ? value is increased to perhaps more appropriate value of 5. ?*/
> + ?3, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* Branch cost */
> + ?COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FADD and FSUB insns. ?*/
> + ?COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FMUL instruction. ?*/
> + ?COSTS_N_INSNS (20), ? ? ? ? ? ? ? ? ?/* cost of FDIV instruction. ?*/
> + ?COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> + ?COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> + ?COSTS_N_INSNS (40), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> +
> + ?/* stringop_algs for memcpy. ?*/
> + ?{{{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_4_byte}, {-1, libcall}}}, /* Known alignment. ?*/
> + ? ?{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_4_byte}, {-1, libcall}}}, /* Unknown alignment. ?*/
> + ? ?{libcall, {{16, loop}, {24, unrolled_loop}, {1024, rep_prefix_8_byte}, {-1, libcall}}}}},
> +
> + ?/* stringop_algs for memset. ?*/
> + ?{{{libcall, {{256, rep_prefix_4_byte}}}, /* Known alignment. ?*/
> + ? ?{libcall, {{256, rep_prefix_8_byte}}}},
> + ? {{libcall, {{256, rep_prefix_4_byte}}}, /* Unknown alignment. ?*/
> + ? ?{libcall, {{256, rep_prefix_8_byte}}}}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1724,7 +1894,7 @@ struct processor_costs atom_cost = {
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cond_not_taken_branch_cost. ?*/
> ?};
>
> -/* Generic64 should produce code tuned for Nocona and K8. ?*/
> +/* Generic64 should produce code tuned for Nocona, Core, ?K8, Amdfam10 and buldozer. ?*/
> ?static const
> ?struct processor_costs generic64_cost = {
> ? COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of an add instruction */
> @@ -1784,10 +1954,16 @@ struct processor_costs generic64_cost =
> ? COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (40), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{DUMMY_STRINGOP_ALGS,
> - ? {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> - ?{DUMMY_STRINGOP_ALGS,
> - ? {libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +
> + ?{{DUMMY_STRINGOP_ALGS,
> + ? ?{libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {DUMMY_STRINGOP_ALGS,
> + ? ?{libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> +
> + ?{{DUMMY_STRINGOP_ALGS,
> + ? ?{libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> + ? {DUMMY_STRINGOP_ALGS,
> + ? ?{libcall, {{32, loop}, {8192, rep_prefix_8_byte}, {-1, libcall}}}}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -1801,8 +1977,8 @@ struct processor_costs generic64_cost =
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* cond_not_taken_branch_cost. ?*/
> ?};
>
> -/* Generic32 should produce code tuned for PPro, Pentium4, Nocona,
> - ? Athlon and K8. ?*/
> +/* Generic32 should produce code tuned for PPro, Pentium4, Nocona, Core
> + ? Athlon, K8, amdfam10, buldozer. ?*/
> ?static const
> ?struct processor_costs generic32_cost = {
> ? COSTS_N_INSNS (1), ? ? ? ? ? ? ? ? ? /* cost of an add instruction */
> @@ -1856,10 +2032,16 @@ struct processor_costs generic32_cost =
> ? COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FABS instruction. ?*/
> ? COSTS_N_INSNS (8), ? ? ? ? ? ? ? ? ? /* cost of FCHS instruction. ?*/
> ? COSTS_N_INSNS (40), ? ? ? ? ? ? ? ? ?/* cost of FSQRT instruction. ?*/
> - ?{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> + ?/* stringop_algs for memcpy. ?*/
> + ?{{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> - ?{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> + ? {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> + ?/* stringop_algs for memset. ?*/
> + ?{{{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> ? ?DUMMY_STRINGOP_ALGS},
> + ? {{libcall, {{32, loop}, {8192, rep_prefix_4_byte}, {-1, libcall}}},
> + ? DUMMY_STRINGOP_ALGS}},
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_stmt_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar load_cost. ?*/
> ? 1, ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* scalar_store_cost. ?*/
> @@ -2536,6 +2718,8 @@ static void ix86_set_current_function (t
> ?static unsigned int ix86_minimum_incoming_stack_boundary (bool);
>
> ?static enum calling_abi ix86_function_abi (const_tree);
> +static rtx promote_duplicated_reg (enum machine_mode, rtx);
> +static rtx promote_duplicated_reg_to_size (rtx, int, int, int);
>
>
> ?#ifndef SUBTARGET32_DEFAULT_CPU
> @@ -2582,13 +2766,13 @@ static const struct ptt processor_target
> ? {&k8_cost, 16, 7, 16, 7, 16},
> ? {&nocona_cost, 0, 0, 0, 0, 0},
> ? /* Core 2 32-bit. ?*/
> - ?{&generic32_cost, 16, 10, 16, 10, 16},
> + ?{&core_cost, 16, 10, 16, 10, 16},
> ? /* Core 2 64-bit. ?*/
> - ?{&generic64_cost, 16, 10, 16, 10, 16},
> + ?{&core_cost, 16, 10, 16, 10, 16},
> ? /* Core i7 32-bit. ?*/
> - ?{&generic32_cost, 16, 10, 16, 10, 16},
> + ?{&core_cost, 16, 10, 16, 10, 16},
> ? /* Core i7 64-bit. ?*/
> - ?{&generic64_cost, 16, 10, 16, 10, 16},
> + ?{&core_cost, 16, 10, 16, 10, 16},
> ? {&generic32_cost, 16, 7, 16, 7, 16},
> ? {&generic64_cost, 16, 10, 16, 10, 16},
> ? {&amdfam10_cost, 32, 24, 32, 7, 32},
> @@ -20800,22 +20984,37 @@ counter_mode (rtx count_exp)
> ? return SImode;
> ?}
>
> -/* When SRCPTR is non-NULL, output simple loop to move memory
> +/* Helper function for expand_set_or_movmem_via_loop.
> +
> + ? When SRCPTR is non-NULL, output simple loop to move memory
> ? ?pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
> ? ?overall size is COUNT specified in bytes. ?When SRCPTR is NULL, output the
> ? ?equivalent loop to set memory by VALUE (supposed to be in MODE).
>
> ? ?The size is rounded down to whole number of chunk size moved at once.
> - ? SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info. ?*/
> + ? SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info.
>
> + ? If ITER isn't NULL, than it'll be used in the generated loop without
> + ? initialization (that allows to generate several consequent loops using the
> + ? same iterator).
> + ? If CHANGE_PTRS is specified, DESTPTR and SRCPTR would be increased by
> + ? iterator value at the end of the function (as if they iterate in the loop).
> + ? Otherwise, their vaules'll stay unchanged.
> +
> + ? If EXPECTED_SIZE isn't -1, than it's used to compute branch-probabilities on
> + ? the loop backedge. ?When expected size is unknown (it's -1), the probability
> + ? is set to 80%.
>
> -static void
> -expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx destptr, rtx srcptr, rtx value,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx count, enum machine_mode mode, int unroll,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int expected_size)
> + ? Return value is rtx of iterator, used in the loop - it could be reused in
> + ? consequent calls of this function. ?*/
> +static rtx
> +expand_set_or_movmem_via_loop_with_iter (rtx destmem, rtx srcmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx destptr, rtx srcptr, rtx value,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx count, rtx iter,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?enum machine_mode mode, int unroll,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int expected_size, bool change_ptrs)
> ?{
> - ?rtx out_label, top_label, iter, tmp;
> + ?rtx out_label, top_label, tmp;
> ? enum machine_mode iter_mode = counter_mode (count);
> ? rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
> ? rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
> @@ -20823,10 +21022,12 @@ expand_set_or_movmem_via_loop (rtx destm
> ? rtx x_addr;
> ? rtx y_addr;
> ? int i;
> + ?bool reuse_iter = (iter != NULL_RTX);
>
> ? top_label = gen_label_rtx ();
> ? out_label = gen_label_rtx ();
> - ?iter = gen_reg_rtx (iter_mode);
> + ?if (!reuse_iter)
> + ? ?iter = gen_reg_rtx (iter_mode);
>
> ? size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL, 1, OPTAB_DIRECT);
> @@ -20837,18 +21038,21 @@ expand_set_or_movmem_via_loop (rtx destm
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? true, out_label);
> ? ? ? predict_jump (REG_BR_PROB_BASE * 10 / 100);
> ? ? }
> - ?emit_move_insn (iter, const0_rtx);
> + ?if (!reuse_iter)
> + ? ?emit_move_insn (iter, const0_rtx);
>
> ? emit_label (top_label);
>
> ? tmp = convert_modes (Pmode, iter_mode, iter, true);
> ? x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
> - ?destmem = change_address (destmem, mode, x_addr);
> + ?destmem =
> + ? ?adjust_automodify_address_nv (copy_rtx (destmem), mode, x_addr, 0);
>
> ? if (srcmem)
> ? ? {
> ? ? ? y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
> - ? ? ?srcmem = change_address (srcmem, mode, y_addr);
> + ? ? ?srcmem =
> + ? ? ? adjust_automodify_address_nv (copy_rtx (srcmem), mode, y_addr, 0);
>
> ? ? ? /* When unrolling for chips that reorder memory reads and writes,
> ? ? ? ? we can save registers by using single temporary.
> @@ -20920,19 +21124,43 @@ expand_set_or_movmem_via_loop (rtx destm
> ? ? }
> ? else
> ? ? predict_jump (REG_BR_PROB_BASE * 80 / 100);
> - ?iter = ix86_zero_extend_to_Pmode (iter);
> - ?tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ?true, OPTAB_LIB_WIDEN);
> - ?if (tmp != destptr)
> - ? ?emit_move_insn (destptr, tmp);
> - ?if (srcptr)
> + ?if (change_ptrs)
> ? ? {
> - ? ? ?tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
> + ? ? ?iter = ix86_zero_extend_to_Pmode (iter);
> + ? ? ?tmp = expand_simple_binop (Pmode, PLUS, destptr, iter, destptr,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? true, OPTAB_LIB_WIDEN);
> - ? ? ?if (tmp != srcptr)
> - ? ? ? emit_move_insn (srcptr, tmp);
> + ? ? ?if (tmp != destptr)
> + ? ? ? emit_move_insn (destptr, tmp);
> + ? ? ?if (srcptr)
> + ? ? ? {
> + ? ? ? ? tmp = expand_simple_binop (Pmode, PLUS, srcptr, iter, srcptr,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?true, OPTAB_LIB_WIDEN);
> + ? ? ? ? if (tmp != srcptr)
> + ? ? ? ? ? emit_move_insn (srcptr, tmp);
> + ? ? ? }
> ? ? }
> ? emit_label (out_label);
> + ?return iter;
> +}
> +
> +/* When SRCPTR is non-NULL, output simple loop to move memory
> + ? pointer to SRCPTR to DESTPTR via chunks of MODE unrolled UNROLL times,
> + ? overall size is COUNT specified in bytes. ?When SRCPTR is NULL, output the
> + ? equivalent loop to set memory by VALUE (supposed to be in MODE).
> +
> + ? The size is rounded down to whole number of chunk size moved at once.
> + ? SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info. ?*/
> +
> +static void
> +expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx destptr, rtx srcptr, rtx value,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?rtx count, enum machine_mode mode, int unroll,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?int expected_size)
> +{
> + ?expand_set_or_movmem_via_loop_with_iter (destmem, srcmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?destptr, srcptr, value,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?count, NULL_RTX, mode, unroll,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?expected_size, true);
> ?}
>
> ?/* Output "rep; mov" instruction.
> @@ -21036,7 +21264,18 @@ emit_strmov (rtx destmem, rtx srcmem,
> ? emit_insn (gen_strmov (destptr, dest, srcptr, src));
> ?}
>
> -/* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST. ?*/
> +/* Emit strset instuction. ?If RHS is constant, and vector mode will be used,
> + ? then move this constant to a vector register before emitting strset. ?*/
> +static void
> +emit_strset (rtx destmem, rtx value,
> + ? ? ? ? ? ?rtx destptr, enum machine_mode mode, int offset)
> +{
> + ?rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
> + ?emit_insn (gen_strset (destptr, dest, value));
> +}
> +
> +/* Output code to copy (COUNT % MAX_SIZE) bytes from SRCPTR to DESTPTR.
> + ? SRCMEM and DESTMEM provide MEMrtx to feed proper aliasing info. ?*/
> ?static void
> ?expand_movmem_epilogue (rtx destmem, rtx srcmem,
> ? ? ? ? ? ? ? ? ? ? ? ?rtx destptr, rtx srcptr, rtx count, int max_size)
> @@ -21047,43 +21286,55 @@ expand_movmem_epilogue (rtx destmem, rtx
> ? ? ? HOST_WIDE_INT countval = INTVAL (count);
> ? ? ? int offset = 0;
>
> - ? ? ?if ((countval & 0x10) && max_size > 16)
> + ? ? ?int remainder_size = countval % max_size;
> + ? ? ?enum machine_mode move_mode = Pmode;
> +
> + ? ? ?/* Firstly, try to move data with the widest possible mode.
> + ? ? ? ?Remaining part we'll move using Pmode and narrower modes. ?*/
> + ? ? ?if (TARGET_SSE)
> ? ? ? ?{
> - ? ? ? ? if (TARGET_64BIT)
> - ? ? ? ? ? {
> - ? ? ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
> - ? ? ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
> - ? ? ? ? ? }
> - ? ? ? ? else
> - ? ? ? ? ? gcc_unreachable ();
> - ? ? ? ? offset += 16;
> + ? ? ? ? if (max_size >= GET_MODE_SIZE (V4SImode))
> + ? ? ? ? ? move_mode = V4SImode;
> + ? ? ? ? else if (max_size >= GET_MODE_SIZE (DImode))
> + ? ? ? ? ? move_mode = DImode;
> ? ? ? ?}
> - ? ? ?if ((countval & 0x08) && max_size > 8)
> +
> + ? ? ?while (remainder_size >= GET_MODE_SIZE (move_mode))
> ? ? ? ?{
> - ? ? ? ? if (TARGET_64BIT)
> - ? ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
> - ? ? ? ? else
> - ? ? ? ? ? {
> - ? ? ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> - ? ? ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
> - ? ? ? ? ? }
> - ? ? ? ? offset += 8;
> + ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
> + ? ? ? ? offset += GET_MODE_SIZE (move_mode);
> + ? ? ? ? remainder_size -= GET_MODE_SIZE (move_mode);
> + ? ? ? }
> +
> + ? ? ?/* Move the remaining part of epilogue - its size might be
> + ? ? ? ?a size of the widest mode. ?*/
> + ? ? ?move_mode = Pmode;
> + ? ? ?while (remainder_size >= GET_MODE_SIZE (move_mode))
> + ? ? ? {
> + ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, move_mode, offset);
> + ? ? ? ? offset += GET_MODE_SIZE (move_mode);
> + ? ? ? ? remainder_size -= GET_MODE_SIZE (move_mode);
> ? ? ? ?}
> - ? ? ?if ((countval & 0x04) && max_size > 4)
> +
> + ? ? ?if (remainder_size >= 4)
> ? ? ? ?{
> - ? ? ? ? ?emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> + ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> ? ? ? ? ?offset += 4;
> + ? ? ? ? remainder_size -= 4;
> ? ? ? ?}
> - ? ? ?if ((countval & 0x02) && max_size > 2)
> + ? ? ?if (remainder_size >= 2)
> ? ? ? ?{
> - ? ? ? ? ?emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
> + ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
> ? ? ? ? ?offset += 2;
> + ? ? ? ? remainder_size -= 2;
> ? ? ? ?}
> - ? ? ?if ((countval & 0x01) && max_size > 1)
> + ? ? ?if (remainder_size >= 1)
> ? ? ? ?{
> - ? ? ? ? ?emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
> + ? ? ? ? emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
> ? ? ? ? ?offset += 1;
> + ? ? ? ? remainder_size -= 1;
> ? ? ? ?}
> + ? ? ?gcc_assert (remainder_size == 0);
> ? ? ? return;
> ? ? }
> ? if (max_size > 8)
> @@ -21189,87 +21440,121 @@ expand_setmem_epilogue_via_loop (rtx des
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1, max_size / 2);
> ?}
>
> -/* Output code to set at most count & (max_size - 1) bytes starting by DEST. ?*/
> +/* Output code to set with VALUE at most (COUNT % MAX_SIZE) bytes starting from
> + ? DESTPTR.
> + ? DESTMEM provides MEMrtx to feed proper aliasing info.
> + ? PROMOTED_TO_GPR_VALUE is rtx representing a GPR containing broadcasted VALUE.
> + ? PROMOTED_TO_VECTOR_VALUE is rtx representing a vector register containing
> + ? broadcasted VALUE.
> + ? PROMOTED_TO_GPR_VALUE and PROMOTED_TO_VECTOR_VALUE could be NULL if the
> + ? promotion hasn't been generated before. ?*/
> ?static void
> -expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
> +expand_setmem_epilogue (rtx destmem, rtx destptr, rtx promoted_to_vector_value,
> + ? ? ? ? ? ? ? ? ? ? ? rtx promoted_to_gpr_value, rtx value, rtx count,
> + ? ? ? ? ? ? ? ? ? ? ? int max_size)
> ?{
> - ?rtx dest;
> -
> ? if (CONST_INT_P (count))
> ? ? {
> ? ? ? HOST_WIDE_INT countval = INTVAL (count);
> ? ? ? int offset = 0;
>
> - ? ? ?if ((countval & 0x10) && max_size > 16)
> - ? ? ? {
> - ? ? ? ? if (TARGET_64BIT)
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset + 8);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? else
> - ? ? ? ? ? gcc_unreachable ();
> - ? ? ? ? offset += 16;
> - ? ? ? }
> - ? ? ?if ((countval & 0x08) && max_size > 8)
> - ? ? ? {
> - ? ? ? ? if (TARGET_64BIT)
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, DImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? else
> - ? ? ? ? ? {
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset + 4);
> - ? ? ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? ? }
> - ? ? ? ? offset += 8;
> - ? ? ? }
> - ? ? ?if ((countval & 0x04) && max_size > 4)
> + ? ? ?int remainder_size = countval % max_size;
> + ? ? ?enum machine_mode move_mode = Pmode;
> +
> + ? ? ?/* Firstly, try to move data with the widest possible mode.
> + ? ? ? ?Remaining part we'll move using Pmode and narrower modes. ?*/
> +
> + ? ? ?if (promoted_to_vector_value)
> + ? ? ? while (remainder_size >= 16)
> + ? ? ? ? {
> + ? ? ? ? ? if (GET_MODE (destmem) != move_mode)
> + ? ? ? ? ? ? destmem = adjust_automodify_address_nv (destmem, move_mode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? destptr, offset);
> + ? ? ? ? ? emit_strset (destmem, promoted_to_vector_value, destptr,
> + ? ? ? ? ? ? ? ? ? ? ? ?move_mode, offset);
> +
> + ? ? ? ? ? offset += 16;
> + ? ? ? ? ? remainder_size -= 16;
> + ? ? ? ? }
> +
> + ? ? ?/* Move the remaining part of epilogue - its size might be
> + ? ? ? ?a size of the widest mode. ?*/
> + ? ? ?while (remainder_size >= GET_MODE_SIZE (Pmode))
> + ? ? ? {
> + ? ? ? ? if (!promoted_to_gpr_value)
> + ? ? ? ? ? promoted_to_gpr_value = promote_duplicated_reg (Pmode, value);
> + ? ? ? ? emit_strset (destmem, promoted_to_gpr_value, destptr, Pmode, offset);
> + ? ? ? ? offset += GET_MODE_SIZE (Pmode);
> + ? ? ? ? remainder_size -= GET_MODE_SIZE (Pmode);
> + ? ? ? }
> +
> + ? ? ?if (!promoted_to_gpr_value && remainder_size > 1)
> + ? ? ? promoted_to_gpr_value = promote_duplicated_reg (remainder_size >= 4
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? SImode : HImode, value);
> + ? ? ?if (remainder_size >= 4)
> ? ? ? ?{
> - ? ? ? ? dest = adjust_automodify_address_nv (destmem, SImode, destptr, offset);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
> + ? ? ? ? emit_strset (destmem, gen_lowpart (SImode, promoted_to_gpr_value), destptr,
> + ? ? ? ? ? ? ? ? ? ? ?SImode, offset);
> ? ? ? ? ?offset += 4;
> + ? ? ? ? remainder_size -= 4;
> ? ? ? ?}
> - ? ? ?if ((countval & 0x02) && max_size > 2)
> + ? ? ?if (remainder_size >= 2)
> ? ? ? ?{
> - ? ? ? ? dest = adjust_automodify_address_nv (destmem, HImode, destptr, offset);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
> - ? ? ? ? offset += 2;
> + ? ? ? ? emit_strset (destmem, gen_lowpart (HImode, promoted_to_gpr_value), destptr,
> + ? ? ? ? ? ? ? ? ? ? ?HImode, offset);
> + ? ? ? ? offset +=2;
> + ? ? ? ? remainder_size -= 2;
> ? ? ? ?}
> - ? ? ?if ((countval & 0x01) && max_size > 1)
> + ? ? ?if (remainder_size >= 1)
> ? ? ? ?{
> - ? ? ? ? dest = adjust_automodify_address_nv (destmem, QImode, destptr, offset);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
> + ? ? ? ? emit_strset (destmem,
> + ? ? ? ? ? ? ? ? ? ? ?promoted_to_gpr_value ? gen_lowpart (QImode, promoted_to_gpr_value) : value,
> + ? ? ? ? ? ? ? ? ? ? ? destptr,
> + ? ? ? ? ? ? ? ? ? ? ?QImode, offset);
> ? ? ? ? ?offset += 1;
> + ? ? ? ? remainder_size -= 1;
> ? ? ? ?}
> + ? ? ?gcc_assert (remainder_size == 0);
> ? ? ? return;
> ? ? }
> +
> + ?/* count isn't const. ?*/
> ? if (max_size > 32)
> ? ? {
> - ? ? ?expand_setmem_epilogue_via_loop (destmem, destptr, value, count, max_size);
> + ? ? ?expand_setmem_epilogue_via_loop (destmem, destptr, value, count,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?max_size);
> ? ? ? return;
> ? ? }
> +
> + ?if (!promoted_to_gpr_value)
> + ? ?promoted_to_gpr_value = promote_duplicated_reg_to_size (value,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode));
> +
> ? if (max_size > 16)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (count, 16, true);
> - ? ? ?if (TARGET_64BIT)
> + ? ? ?if (TARGET_SSE && promoted_to_vector_value)
> + ? ? ? {
> + ? ? ? ? destmem = change_address (destmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GET_MODE (promoted_to_vector_value),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_vector_value));
> + ? ? ? }
> + ? ? ?else if (TARGET_64BIT)
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, DImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, DImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, SImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, SImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> ? ? ? ?}
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> @@ -21279,14 +21564,22 @@ expand_setmem_epilogue (rtx destmem, rtx
> ? ? ? rtx label = ix86_expand_aligntest (count, 8, true);
> ? ? ? if (TARGET_64BIT)
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, DImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, DImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> + ? ? ? }
> + ? ? ?/* FIXME: When this hunk it output, IRA classifies promoted_to_vector_value
> + ? ? ? ? as NO_REGS. ?*/
> + ? ? ?else if (TARGET_SSE && promoted_to_vector_value && 0)
> + ? ? ? {
> + ? ? ? ? destmem = change_address (destmem, V2SImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gen_lowpart (V2SImode, promoted_to_vector_value)));
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> - ? ? ? ? dest = change_address (destmem, SImode, destptr);
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> - ? ? ? ? emit_insn (gen_strset (destptr, dest, value));
> + ? ? ? ? destmem = change_address (destmem, SImode, destptr);
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> + ? ? ? ? emit_insn (gen_strset (destptr, destmem, promoted_to_gpr_value));
> ? ? ? ?}
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> @@ -21294,24 +21587,27 @@ expand_setmem_epilogue (rtx destmem, rtx
> ? if (max_size > 4)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (count, 4, true);
> - ? ? ?dest = change_address (destmem, SImode, destptr);
> - ? ? ?emit_insn (gen_strset (destptr, dest, gen_lowpart (SImode, value)));
> + ? ? ?destmem = change_address (destmem, SImode, destptr);
> + ? ? ?emit_insn (gen_strset (destptr, destmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?gen_lowpart (SImode, promoted_to_gpr_value)));
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? }
> ? if (max_size > 2)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (count, 2, true);
> - ? ? ?dest = change_address (destmem, HImode, destptr);
> - ? ? ?emit_insn (gen_strset (destptr, dest, gen_lowpart (HImode, value)));
> + ? ? ?destmem = change_address (destmem, HImode, destptr);
> + ? ? ?emit_insn (gen_strset (destptr, destmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?gen_lowpart (HImode, promoted_to_gpr_value)));
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? }
> ? if (max_size > 1)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (count, 1, true);
> - ? ? ?dest = change_address (destmem, QImode, destptr);
> - ? ? ?emit_insn (gen_strset (destptr, dest, gen_lowpart (QImode, value)));
> + ? ? ?destmem = change_address (destmem, QImode, destptr);
> + ? ? ?emit_insn (gen_strset (destptr, destmem,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?gen_lowpart (QImode, promoted_to_gpr_value)));
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? }
> @@ -21327,8 +21623,8 @@ expand_movmem_prologue (rtx destmem, rtx
> ? if (align <= 1 && desired_alignment > 1)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 1, false);
> - ? ? ?srcmem = change_address (srcmem, QImode, srcptr);
> - ? ? ?destmem = change_address (destmem, QImode, destptr);
> + ? ? ?srcmem = adjust_automodify_address_nv (srcmem, QImode, srcptr, 0);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
> ? ? ? emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> ? ? ? ix86_adjust_counter (count, 1);
> ? ? ? emit_label (label);
> @@ -21337,8 +21633,8 @@ expand_movmem_prologue (rtx destmem, rtx
> ? if (align <= 2 && desired_alignment > 2)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 2, false);
> - ? ? ?srcmem = change_address (srcmem, HImode, srcptr);
> - ? ? ?destmem = change_address (destmem, HImode, destptr);
> + ? ? ?srcmem = adjust_automodify_address_nv (srcmem, HImode, srcptr, 0);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
> ? ? ? emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> ? ? ? ix86_adjust_counter (count, 2);
> ? ? ? emit_label (label);
> @@ -21347,14 +21643,34 @@ expand_movmem_prologue (rtx destmem, rtx
> ? if (align <= 4 && desired_alignment > 4)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 4, false);
> - ? ? ?srcmem = change_address (srcmem, SImode, srcptr);
> - ? ? ?destmem = change_address (destmem, SImode, destptr);
> + ? ? ?srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> ? ? ? emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> ? ? ? ix86_adjust_counter (count, 4);
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? }
> - ?gcc_assert (desired_alignment <= 8);
> + ?if (align <= 8 && desired_alignment > 8)
> + ? ?{
> + ? ? ?rtx label = ix86_expand_aligntest (destptr, 8, false);
> + ? ? ?if (TARGET_64BIT || TARGET_SSE)
> + ? ? ? {
> + ? ? ? ? srcmem = adjust_automodify_address_nv (srcmem, DImode, srcptr, 0);
> + ? ? ? ? destmem = adjust_automodify_address_nv (destmem, DImode, destptr, 0);
> + ? ? ? ? emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> + ? ? ? }
> + ? ? ?else
> + ? ? ? {
> + ? ? ? ? srcmem = adjust_automodify_address_nv (srcmem, SImode, srcptr, 0);
> + ? ? ? ? destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> + ? ? ? ? emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> + ? ? ? ? emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> + ? ? ? }
> + ? ? ?ix86_adjust_counter (count, 8);
> + ? ? ?emit_label (label);
> + ? ? ?LABEL_NUSES (label) = 1;
> + ? ?}
> + ?gcc_assert (desired_alignment <= 16);
> ?}
>
> ?/* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
> @@ -21409,6 +21725,37 @@ expand_constant_movmem_prologue (rtx dst
> ? ? ? off = 4;
> ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> ? ? }
> + ?if (align_bytes & 8)
> + ? ?{
> + ? ? ?if (TARGET_64BIT || TARGET_SSE)
> + ? ? ? {
> + ? ? ? ? dst = adjust_automodify_address_nv (dst, DImode, destreg, off);
> + ? ? ? ? src = adjust_automodify_address_nv (src, DImode, srcreg, off);
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? }
> + ? ? ?else
> + ? ? ? {
> + ? ? ? ? dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + ? ? ? ? src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? ? emit_insn (gen_strmov (destreg, dst, srcreg, src));
> + ? ? ? }
> + ? ? ?if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
> + ? ? ? set_mem_align (dst, 8 * BITS_PER_UNIT);
> + ? ? ?if (src_align_bytes >= 0)
> + ? ? ? {
> + ? ? ? ? unsigned int src_align = 0;
> + ? ? ? ? if ((src_align_bytes & 7) == (align_bytes & 7))
> + ? ? ? ? ? src_align = 8;
> + ? ? ? ? else if ((src_align_bytes & 3) == (align_bytes & 3))
> + ? ? ? ? ? src_align = 4;
> + ? ? ? ? else if ((src_align_bytes & 1) == (align_bytes & 1))
> + ? ? ? ? ? src_align = 2;
> + ? ? ? ? if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
> + ? ? ? ? ? set_mem_align (src, src_align * BITS_PER_UNIT);
> + ? ? ? }
> + ? ? ?off = 8;
> + ? ?}
> ? dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
> ? src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
> ? if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
> @@ -21416,7 +21763,9 @@ expand_constant_movmem_prologue (rtx dst
> ? if (src_align_bytes >= 0)
> ? ? {
> ? ? ? unsigned int src_align = 0;
> - ? ? ?if ((src_align_bytes & 7) == (align_bytes & 7))
> + ? ? ?if ((src_align_bytes & 15) == (align_bytes & 15))
> + ? ? ? src_align = 16;
> + ? ? ?else if ((src_align_bytes & 7) == (align_bytes & 7))
> ? ? ? ?src_align = 8;
> ? ? ? else if ((src_align_bytes & 3) == (align_bytes & 3))
> ? ? ? ?src_align = 4;
> @@ -21444,7 +21793,7 @@ expand_setmem_prologue (rtx destmem, rtx
> ? if (align <= 1 && desired_alignment > 1)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 1, false);
> - ? ? ?destmem = change_address (destmem, QImode, destptr);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, QImode, destptr, 0);
> ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (QImode, value)));
> ? ? ? ix86_adjust_counter (count, 1);
> ? ? ? emit_label (label);
> @@ -21453,7 +21802,7 @@ expand_setmem_prologue (rtx destmem, rtx
> ? if (align <= 2 && desired_alignment > 2)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 2, false);
> - ? ? ?destmem = change_address (destmem, HImode, destptr);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, HImode, destptr, 0);
> ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (HImode, value)));
> ? ? ? ix86_adjust_counter (count, 2);
> ? ? ? emit_label (label);
> @@ -21462,13 +21811,23 @@ expand_setmem_prologue (rtx destmem, rtx
> ? if (align <= 4 && desired_alignment > 4)
> ? ? {
> ? ? ? rtx label = ix86_expand_aligntest (destptr, 4, false);
> - ? ? ?destmem = change_address (destmem, SImode, destptr);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> ? ? ? emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> ? ? ? ix86_adjust_counter (count, 4);
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? }
> - ?gcc_assert (desired_alignment <= 8);
> + ?if (align <= 8 && desired_alignment > 8)
> + ? ?{
> + ? ? ?rtx label = ix86_expand_aligntest (destptr, 8, false);
> + ? ? ?destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
> + ? ? ?emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> + ? ? ?emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
> + ? ? ?ix86_adjust_counter (count, 8);
> + ? ? ?emit_label (label);
> + ? ? ?LABEL_NUSES (label) = 1;
> + ? ?}
> + ?gcc_assert (desired_alignment <= 16);
> ?}
>
> ?/* Set enough from DST to align DST known to by aligned by ALIGN to
> @@ -21504,6 +21863,19 @@ expand_constant_setmem_prologue (rtx dst
> ? ? ? emit_insn (gen_strset (destreg, dst,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? gen_lowpart (SImode, value)));
> ? ? }
> + ?if (align_bytes & 8)
> + ? ?{
> + ? ? ?dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + ? ? ?emit_insn (gen_strset (destreg, dst,
> + ? ? ? ? ? gen_lowpart (SImode, value)));
> + ? ? ?off = 4;
> + ? ? ?dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> + ? ? ?emit_insn (gen_strset (destreg, dst,
> + ? ? ? ? ? gen_lowpart (SImode, value)));
> + ? ? ?if (MEM_ALIGN (dst) < 8 * BITS_PER_UNIT)
> + ? ? ? set_mem_align (dst, 8 * BITS_PER_UNIT);
> + ? ? ?off = 4;
> + ? ?}
> ? dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
> ? if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
> ? ? set_mem_align (dst, desired_align * BITS_PER_UNIT);
> @@ -21515,7 +21887,7 @@ expand_constant_setmem_prologue (rtx dst
> ?/* Given COUNT and EXPECTED_SIZE, decide on codegen of string operation. ?*/
> ?static enum stringop_alg
> ?decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
> - ? ? ? ? ? int *dynamic_check)
> + ? ? ? ? ? int *dynamic_check, bool align_unknown)
> ?{
> ? const struct stringop_algs * algs;
> ? bool optimize_for_speed;
> @@ -21524,7 +21896,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
> ? ? ?consider such algorithms if the user has appropriated those
> ? ? ?registers for their own purposes. */
> ? bool rep_prefix_usable = !(fixed_regs[CX_REG] || fixed_regs[DI_REG]
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? || (memset
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ?|| (memset
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
>
> ?#define ALG_USABLE_P(alg) (rep_prefix_usable ? ? ? ? ? ? ? ? ? \
> @@ -21537,7 +21909,7 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
> ? ? ?of time processing large blocks. ?*/
> ? if (optimize_function_for_size_p (cfun)
> ? ? ? || (optimize_insn_for_size_p ()
> - ? ? ? ? ?&& expected_size != -1 && expected_size < 256))
> + ? ? ? ? && expected_size != -1 && expected_size < 256))
> ? ? optimize_for_speed = false;
> ? else
> ? ? optimize_for_speed = true;
> @@ -21546,9 +21918,9 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
>
> ? *dynamic_check = -1;
> ? if (memset)
> - ? ?algs = &cost->memset[TARGET_64BIT != 0];
> + ? ?algs = &cost->memset[align_unknown][TARGET_64BIT != 0];
> ? else
> - ? ?algs = &cost->memcpy[TARGET_64BIT != 0];
> + ? ?algs = &cost->memcpy[align_unknown][TARGET_64BIT != 0];
> ? if (ix86_stringop_alg != no_stringop && ALG_USABLE_P (ix86_stringop_alg))
> ? ? return ix86_stringop_alg;
> ? /* rep; movq or rep; movl is the smallest variant. ?*/
> @@ -21612,29 +21984,33 @@ decide_alg (HOST_WIDE_INT count, HOST_WI
> ? ? ? enum stringop_alg alg;
> ? ? ? int i;
> ? ? ? bool any_alg_usable_p = true;
> + ? ? ?bool only_libcall_fits = true;
>
> ? ? ? for (i = 0; i < MAX_STRINGOP_ALGS; i++)
> - ? ? ? ?{
> - ? ? ? ? ?enum stringop_alg candidate = algs->size[i].alg;
> - ? ? ? ? ?any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
> + ? ? ? {
> + ? ? ? ? enum stringop_alg candidate = algs->size[i].alg;
> + ? ? ? ? any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
>
> - ? ? ? ? ?if (candidate != libcall && candidate
> - ? ? ? ? ? ? ?&& ALG_USABLE_P (candidate))
> - ? ? ? ? ? ? ?max = algs->size[i].max;
> - ? ? ? ?}
> + ? ? ? ? if (candidate != libcall && candidate
> + ? ? ? ? ? ? && ALG_USABLE_P (candidate))
> + ? ? ? ? ? {
> + ? ? ? ? ? ? max = algs->size[i].max;
> + ? ? ? ? ? ? only_libcall_fits = false;
> + ? ? ? ? ? }
> + ? ? ? }
> ? ? ? /* If there aren't any usable algorithms, then recursing on
> - ? ? ? ? smaller sizes isn't going to find anything. ?Just return the
> - ? ? ? ? simple byte-at-a-time copy loop. ?*/
> - ? ? ?if (!any_alg_usable_p)
> - ? ? ? ?{
> - ? ? ? ? ?/* Pick something reasonable. ?*/
> - ? ? ? ? ?if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> - ? ? ? ? ? ?*dynamic_check = 128;
> - ? ? ? ? ?return loop_1_byte;
> - ? ? ? ?}
> + ? ? ? ?smaller sizes isn't going to find anything. ?Just return the
> + ? ? ? ?simple byte-at-a-time copy loop. ?*/
> + ? ? ?if (!any_alg_usable_p || only_libcall_fits)
> + ? ? ? {
> + ? ? ? ? /* Pick something reasonable. ?*/
> + ? ? ? ? if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> + ? ? ? ? ? *dynamic_check = 128;
> + ? ? ? ? return loop_1_byte;
> + ? ? ? }
> ? ? ? if (max == -1)
> ? ? ? ?max = 4096;
> - ? ? ?alg = decide_alg (count, max / 2, memset, dynamic_check);
> + ? ? ?alg = decide_alg (count, max / 2, memset, dynamic_check, align_unknown);
> ? ? ? gcc_assert (*dynamic_check == -1);
> ? ? ? gcc_assert (alg != libcall);
> ? ? ? if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
> @@ -21658,9 +22034,14 @@ decide_alignment (int align,
> ? ? ? case no_stringop:
> ? ? ? ?gcc_unreachable ();
> ? ? ? case loop:
> + ? ? ? desired_align = GET_MODE_SIZE (Pmode);
> + ? ? ? break;
> ? ? ? case unrolled_loop:
> ? ? ? ?desired_align = GET_MODE_SIZE (Pmode);
> ? ? ? ?break;
> + ? ? ?case sse_loop:
> + ? ? ? desired_align = 16;
> + ? ? ? break;
> ? ? ? case rep_prefix_8_byte:
> ? ? ? ?desired_align = 8;
> ? ? ? ?break;
> @@ -21748,6 +22129,11 @@ ix86_expand_movmem (rtx dst, rtx src, rt
> ? enum stringop_alg alg;
> ? int dynamic_check;
> ? bool need_zero_guard = false;
> + ?bool align_unknown;
> + ?int unroll_factor;
> + ?enum machine_mode move_mode;
> + ?rtx loop_iter = NULL_RTX;
> + ?int dst_offset, src_offset;
>
> ? if (CONST_INT_P (align_exp))
> ? ? align = INTVAL (align_exp);
> @@ -21771,9 +22157,17 @@ ix86_expand_movmem (rtx dst, rtx src, rt
>
> ? /* Step 0: Decide on preferred algorithm, desired alignment and
> ? ? ?size of chunks to be copied by main loop. ?*/
> -
> - ?alg = decide_alg (count, expected_size, false, &dynamic_check);
> + ?dst_offset = get_mem_align_offset (dst, MOVE_MAX*BITS_PER_UNIT);
> + ?src_offset = get_mem_align_offset (src, MOVE_MAX*BITS_PER_UNIT);
> + ?align_unknown = (dst_offset < 0
> + ? ? ? ? ? ? ? ? ?|| src_offset < 0
> + ? ? ? ? ? ? ? ? ?|| src_offset != dst_offset);
> + ?alg = decide_alg (count, expected_size, false, &dynamic_check, align_unknown);
> ? desired_align = decide_alignment (align, alg, expected_size);
> + ?if (align_unknown)
> + ? ?desired_align = align;
> + ?unroll_factor = 1;
> + ?move_mode = Pmode;
>
> ? if (!TARGET_ALIGN_STRINGOPS)
> ? ? align = desired_align;
> @@ -21792,11 +22186,22 @@ ix86_expand_movmem (rtx dst, rtx src, rt
> ? ? ? gcc_unreachable ();
> ? ? case loop:
> ? ? ? need_zero_guard = true;
> - ? ? ?size_needed = GET_MODE_SIZE (Pmode);
> + ? ? ?move_mode = Pmode;
> + ? ? ?unroll_factor = 1;
> + ? ? ?size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> ? ? ? break;
> ? ? case unrolled_loop:
> ? ? ? need_zero_guard = true;
> - ? ? ?size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
> + ? ? ?move_mode = Pmode;
> + ? ? ?unroll_factor = TARGET_64BIT ? 4 : 2;
> + ? ? ?size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> + ? ? ?break;
> + ? ?case sse_loop:
> + ? ? ?need_zero_guard = true;
> + ? ? ?/* Use SSE instructions, if possible. ?*/
> + ? ? ?move_mode = align_unknown ? DImode : V4SImode;
> + ? ? ?unroll_factor = TARGET_64BIT ? 4 : 2;
> + ? ? ?size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> ? ? ? break;
> ? ? case rep_prefix_8_byte:
> ? ? ? size_needed = 8;
> @@ -21857,6 +22262,12 @@ ix86_expand_movmem (rtx dst, rtx src, rt
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> + ? ? ? ? /* SSE and unrolled algs re-use iteration counter in the epilogue. ?*/
> + ? ? ? ? if (alg == sse_loop || alg == unrolled_loop)
> + ? ? ? ? ? {
> + ? ? ? ? ? ? loop_iter = gen_reg_rtx (counter_mode (count_exp));
> + ? ? ? ? ? ? ?emit_move_insn (loop_iter, const0_rtx);
> + ? ? ? ? ? }
> ? ? ? ? ?label = gen_label_rtx ();
> ? ? ? ? ?emit_cmp_and_jump_insns (count_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GEN_INT (epilogue_size_needed),
> @@ -21908,6 +22319,8 @@ ix86_expand_movmem (rtx dst, rtx src, rt
> ? ? ? ? ?dst = change_address (dst, BLKmode, destreg);
> ? ? ? ? ?expand_movmem_prologue (dst, src, destreg, srcreg, count_exp, align,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?desired_align);
> + ? ? ? ? set_mem_align (src, desired_align*BITS_PER_UNIT);
> + ? ? ? ? set_mem_align (dst, desired_align*BITS_PER_UNIT);
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> @@ -21964,12 +22377,16 @@ ix86_expand_movmem (rtx dst, rtx src, rt
> ? ? ? expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_exp, Pmode, 1, expected_size);
> ? ? ? break;
> + ? ?case sse_loop:
> ? ? case unrolled_loop:
> - ? ? ?/* Unroll only by factor of 2 in 32bit mode, since we don't have enough
> - ? ? ? ?registers for 4 temporaries anyway. ?*/
> - ? ? ?expand_set_or_movmem_via_loop (dst, src, destreg, srcreg, NULL,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?count_exp, Pmode, TARGET_64BIT ? 4 : 2,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?expected_size);
> + ? ? ?/* In some cases we want to use the same iterator in several adjacent
> + ? ? ? ?loops, so here we save loop iterator rtx and don't update addresses. ?*/
> + ? ? ?loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?srcreg, NULL,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?count_exp, loop_iter,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?move_mode,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unroll_factor,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?expected_size, false);
> ? ? ? break;
> ? ? case rep_prefix_8_byte:
> ? ? ? expand_movmem_via_rep_mov (dst, src, destreg, srcreg, count_exp,
> @@ -22020,9 +22437,41 @@ ix86_expand_movmem (rtx dst, rtx src, rt
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? }
>
> + ?/* We haven't updated addresses, so we'll do it now.
> + ? ? Also, if the epilogue seems to be big, we'll generate a loop (not
> + ? ? unrolled) in it. ?We'll do it only if alignment is unknown, because in
> + ? ? this case in epilogue we have to perform memmove by bytes, which is very
> + ? ? slow. ?*/
> + ?if (alg == sse_loop || alg == unrolled_loop)
> + ? ?{
> + ? ? ?rtx tmp;
> + ? ? ?if (align_unknown && unroll_factor > 1)
> + ? ? ? {
> + ? ? ? ? /* Reduce epilogue's size by creating not-unrolled loop. ?If we won't
> + ? ? ? ? ? ?do this, we can have very big epilogue - when alignment is statically
> + ? ? ? ? ? ?unknown we'll have the epilogue byte by byte which may be very slow. ?*/
> + ? ? ? ? loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, src, destreg,
> + ? ? ? ? ? ? srcreg, NULL, count_exp,
> + ? ? ? ? ? ? loop_iter, move_mode, 1,
> + ? ? ? ? ? ? expected_size, false);
> + ? ? ? ? src = change_address (src, BLKmode, srcreg);
> + ? ? ? ? dst = change_address (dst, BLKmode, destreg);
> + ? ? ? ? epilogue_size_needed = GET_MODE_SIZE (move_mode);
> + ? ? ? }
> + ? ? ?tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?true, OPTAB_LIB_WIDEN);
> + ? ? ?if (tmp != destreg)
> + ? ? ? emit_move_insn (destreg, tmp);
> +
> + ? ? ?tmp = expand_simple_binop (Pmode, PLUS, srcreg, loop_iter, srcreg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?true, OPTAB_LIB_WIDEN);
> + ? ? ?if (tmp != srcreg)
> + ? ? ? emit_move_insn (srcreg, tmp);
> + ? ?}
> ? if (count_exp != const0_rtx && epilogue_size_needed > 1)
> ? ? expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?epilogue_size_needed);
> +
> ? if (jump_around_label)
> ? ? emit_label (jump_around_label);
> ? return true;
> @@ -22040,7 +22489,37 @@ promote_duplicated_reg (enum machine_mod
> ? rtx tmp;
> ? int nops = mode == DImode ? 3 : 2;
>
> + ?if (VECTOR_MODE_P (mode))
> + ? ?{
> + ? ? ?enum machine_mode inner = GET_MODE_INNER (mode);
> + ? ? ?rtx promoted_val, vec_reg;
> + ? ? ?if (CONST_INT_P (val))
> + ? ? ? return ix86_build_const_vector (mode, true, val);
> +
> + ? ? ?promoted_val = promote_duplicated_reg (inner, val);
> + ? ? ?vec_reg = gen_reg_rtx (mode);
> + ? ? ?switch (mode)
> + ? ? ? {
> + ? ? ? case V2DImode:
> + ? ? ? ? emit_insn (gen_vec_dupv2di (vec_reg, promoted_val));
> + ? ? ? ? break;
> + ? ? ? case V4SImode:
> + ? ? ? ? emit_insn (gen_vec_dupv4si (vec_reg, promoted_val));
> + ? ? ? ? break;
> + ? ? ? default:
> + ? ? ? ? gcc_unreachable ();
> + ? ? ? ? break;
> + ? ? ? }
> +
> + ? ? ?return vec_reg;
> + ? ?}
> ? gcc_assert (mode == SImode || mode == DImode);
> + ?if (mode == DImode && !TARGET_64BIT)
> + ? ?{
> + ? ? ?rtx vec_reg = promote_duplicated_reg (V4SImode, val);
> + ? ? ?vec_reg = convert_to_mode (V2DImode, vec_reg, 1);
> + ? ? ?return vec_reg;
> + ? ?}
> ? if (val == const0_rtx)
> ? ? return copy_to_mode_reg (mode, const0_rtx);
> ? if (CONST_INT_P (val))
> @@ -22106,11 +22585,27 @@ promote_duplicated_reg (enum machine_mod
> ?static rtx
> ?promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
> ?{
> - ?rtx promoted_val;
> + ?rtx promoted_val = NULL_RTX;
>
> - ?if (TARGET_64BIT
> - ? ? ?&& (size_needed > 4 || (desired_align > align && desired_align > 4)))
> - ? ?promoted_val = promote_duplicated_reg (DImode, val);
> + ?if (size_needed > 8 || (desired_align > align && desired_align > 8))
> + ? ?{
> + ? ? ?/* We want to promote to vector register, so we expect that at least SSE
> + ? ? ? ?is available. ?*/
> + ? ? ?gcc_assert (TARGET_SSE);
> +
> + ? ? ?/* In case of promotion to vector register, we expect that val is a
> + ? ? ? ?constant or already promoted to GPR value. ?*/
> + ? ? ?gcc_assert (GET_MODE (val) == Pmode || CONSTANT_P (val));
> + ? ? ?if (TARGET_64BIT)
> + ? ? ? promoted_val = promote_duplicated_reg (V2DImode, val);
> + ? ? ?else
> + ? ? ? promoted_val = promote_duplicated_reg (V4SImode, val);
> + ? ?}
> + ?else if (size_needed > 4 || (desired_align > align && desired_align > 4))
> + ? ?{
> + ? ? ?gcc_assert (TARGET_64BIT);
> + ? ? ?promoted_val = promote_duplicated_reg (DImode, val);
> + ? ?}
> ? else if (size_needed > 2 || (desired_align > align && desired_align > 2))
> ? ? promoted_val = promote_duplicated_reg (SImode, val);
> ? else if (size_needed > 1 || (desired_align > align && desired_align > 1))
> @@ -22138,10 +22633,14 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? int size_needed = 0, epilogue_size_needed;
> ? int desired_align = 0, align_bytes = 0;
> ? enum stringop_alg alg;
> - ?rtx promoted_val = NULL;
> - ?bool force_loopy_epilogue = false;
> + ?rtx gpr_promoted_val = NULL;
> + ?rtx vec_promoted_val = NULL;
> ? int dynamic_check;
> ? bool need_zero_guard = false;
> + ?bool align_unknown;
> + ?unsigned int unroll_factor;
> + ?enum machine_mode move_mode;
> + ?rtx loop_iter = NULL_RTX;
>
> ? if (CONST_INT_P (align_exp))
> ? ? align = INTVAL (align_exp);
> @@ -22161,8 +22660,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? /* Step 0: Decide on preferred algorithm, desired alignment and
> ? ? ?size of chunks to be copied by main loop. ?*/
>
> - ?alg = decide_alg (count, expected_size, true, &dynamic_check);
> + ?align_unknown = get_mem_align_offset (dst, BITS_PER_UNIT) < 0;
> + ?alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
> ? desired_align = decide_alignment (align, alg, expected_size);
> + ?unroll_factor = 1;
> + ?move_mode = Pmode;
>
> ? if (!TARGET_ALIGN_STRINGOPS)
> ? ? align = desired_align;
> @@ -22180,11 +22682,28 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ? gcc_unreachable ();
> ? ? case loop:
> ? ? ? need_zero_guard = true;
> - ? ? ?size_needed = GET_MODE_SIZE (Pmode);
> + ? ? ?move_mode = Pmode;
> + ? ? ?size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> ? ? ? break;
> ? ? case unrolled_loop:
> ? ? ? need_zero_guard = true;
> - ? ? ?size_needed = GET_MODE_SIZE (Pmode) * 4;
> + ? ? ?move_mode = Pmode;
> + ? ? ?unroll_factor = 1;
> + ? ? ?/* Select maximal available 1,2 or 4 unroll factor. ?*/
> + ? ? ?while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> + ? ? ? ? ? ?&& unroll_factor < 4)
> + ? ? ? unroll_factor *= 2;
> + ? ? ?size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> + ? ? ?break;
> + ? ?case sse_loop:
> + ? ? ?need_zero_guard = true;
> + ? ? ?move_mode = TARGET_64BIT ? V2DImode : V4SImode;
> + ? ? ?unroll_factor = 1;
> + ? ? ?/* Select maximal available 1,2 or 4 unroll factor. ?*/
> + ? ? ?while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
> + ? ? ? ? ? ?&& unroll_factor < 4)
> + ? ? ? unroll_factor *= 2;
> + ? ? ?size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
> ? ? ? break;
> ? ? case rep_prefix_8_byte:
> ? ? ? size_needed = 8;
> @@ -22229,8 +22748,10 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ?main loop and epilogue (ie one load of the big constant in the
> ? ? ?front of all code. ?*/
> ? if (CONST_INT_P (val_exp))
> - ? ?promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?desired_align, align);
> + ? ?gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?align);
> ? /* Ensure that alignment prologue won't copy past end of block. ?*/
> ? if (size_needed > 1 || (desired_align > 1 && desired_align > align))
> ? ? {
> @@ -22239,12 +22760,6 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ? ? Make sure it is power of 2. ?*/
> ? ? ? epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
>
> - ? ? ?/* To improve performance of small blocks, we jump around the VAL
> - ? ? ? ?promoting mode. ?This mean that if the promoted VAL is not constant,
> - ? ? ? ?we might not use it in the epilogue and have to use byte
> - ? ? ? ?loop variant. ?*/
> - ? ? ?if (epilogue_size_needed > 2 && !promoted_val)
> - ? ? ? ?force_loopy_epilogue = true;
> ? ? ? if (count)
> ? ? ? ?{
> ? ? ? ? ?if (count < (unsigned HOST_WIDE_INT)epilogue_size_needed)
> @@ -22259,6 +22774,12 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> + ? ? ? ? /* SSE and unrolled_lopo algs re-use iteration counter in the epilogue. ?*/
> + ? ? ? ? if (alg == sse_loop || alg == unrolled_loop)
> + ? ? ? ? ? {
> + ? ? ? ? ? ? loop_iter = gen_reg_rtx (counter_mode (count_exp));
> + ? ? ? ? ? ? ?emit_move_insn (loop_iter, const0_rtx);
> + ? ? ? ? ? }
> ? ? ? ? ?label = gen_label_rtx ();
> ? ? ? ? ?emit_cmp_and_jump_insns (count_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GEN_INT (epilogue_size_needed),
> @@ -22284,9 +22805,11 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? /* Step 2: Alignment prologue. ?*/
>
> ? /* Do the expensive promotion once we branched off the small blocks. ?*/
> - ?if (!promoted_val)
> - ? ?promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?desired_align, align);
> + ?if (!gpr_promoted_val)
> + ? ?gpr_promoted_val = promote_duplicated_reg_to_size (val_exp,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?GET_MODE_SIZE (Pmode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?align);
> ? gcc_assert (desired_align >= 1 && align >= 1);
>
> ? if (desired_align > align)
> @@ -22298,17 +22821,20 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ? ? ? ? the pain to maintain it for the first move, so throw away
> ? ? ? ? ? ? the info early. ?*/
> ? ? ? ? ?dst = change_address (dst, BLKmode, destreg);
> - ? ? ? ? expand_setmem_prologue (dst, destreg, promoted_val, count_exp, align,
> + ? ? ? ? expand_setmem_prologue (dst, destreg, gpr_promoted_val, count_exp, align,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?desired_align);
> + ? ? ? ? set_mem_align (dst, desired_align*BITS_PER_UNIT);
> ? ? ? ?}
> ? ? ? else
> ? ? ? ?{
> ? ? ? ? ?/* If we know how many bytes need to be stored before dst is
> ? ? ? ? ? ? sufficiently aligned, maintain aliasing info accurately. ?*/
> - ? ? ? ? dst = expand_constant_setmem_prologue (dst, destreg, promoted_val,
> + ? ? ? ? dst = expand_constant_setmem_prologue (dst, destreg, gpr_promoted_val,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? desired_align, align_bytes);
> ? ? ? ? ?count_exp = plus_constant (count_exp, -align_bytes);
> ? ? ? ? ?count -= align_bytes;
> + ? ? ? ? if (count < (unsigned HOST_WIDE_INT) size_needed)
> + ? ? ? ? ? goto epilogue;
> ? ? ? ?}
> ? ? ? if (need_zero_guard
> ? ? ? ? ?&& (count < (unsigned HOST_WIDE_INT) size_needed
> @@ -22336,7 +22862,7 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> ? ? ? label = NULL;
> - ? ? ?promoted_val = val_exp;
> + ? ? ?gpr_promoted_val = val_exp;
> ? ? ? epilogue_size_needed = 1;
> ? ? }
> ? else if (label == NULL_RTX)
> @@ -22350,27 +22876,40 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? case no_stringop:
> ? ? ? gcc_unreachable ();
> ? ? case loop_1_byte:
> - ? ? ?expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
> + ? ? ?expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, val_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_exp, QImode, 1, expected_size);
> ? ? ? break;
> ? ? case loop:
> - ? ? ?expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
> + ? ? ?expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, gpr_promoted_val,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_exp, Pmode, 1, expected_size);
> ? ? ? break;
> ? ? case unrolled_loop:
> - ? ? ?expand_set_or_movmem_via_loop (dst, NULL, destreg, NULL, promoted_val,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?count_exp, Pmode, 4, expected_size);
> + ? ? ?loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL, gpr_promoted_val, count_exp,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?loop_iter, move_mode, unroll_factor,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?expected_size, false);
> + ? ? ?break;
> + ? ?case sse_loop:
> + ? ? ?vec_promoted_val =
> + ? ? ? promote_duplicated_reg_to_size (gpr_promoted_val,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? GET_MODE_SIZE (move_mode),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? desired_align, align);
> + ? ? ?loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?NULL, vec_promoted_val, count_exp,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?loop_iter, move_mode, unroll_factor,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?expected_size, false);
> ? ? ? break;
> ? ? case rep_prefix_8_byte:
> - ? ? ?expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
> + ? ? ?gcc_assert (TARGET_64BIT);
> + ? ? ?expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?DImode, val_exp);
> ? ? ? break;
> ? ? case rep_prefix_4_byte:
> - ? ? ?expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
> + ? ? ?expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?SImode, val_exp);
> ? ? ? break;
> ? ? case rep_prefix_1_byte:
> - ? ? ?expand_setmem_via_rep_stos (dst, destreg, promoted_val, count_exp,
> + ? ? ?expand_setmem_via_rep_stos (dst, destreg, gpr_promoted_val, count_exp,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?QImode, val_exp);
> ? ? ? break;
> ? ? }
> @@ -22401,17 +22940,33 @@ ix86_expand_setmem (rtx dst, rtx count_e
> ? ? ? ?}
> ? ? ? emit_label (label);
> ? ? ? LABEL_NUSES (label) = 1;
> + ? ? ?/* We can not rely on fact that promoved value is known. ?*/
> + ? ? ?vec_promoted_val = 0;
> ? ? }
> ?epilogue:
> - ?if (count_exp != const0_rtx && epilogue_size_needed > 1)
> + ?if (alg == sse_loop || alg == unrolled_loop)
> ? ? {
> - ? ? ?if (force_loopy_epilogue)
> - ? ? ? expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?epilogue_size_needed);
> - ? ? ?else
> - ? ? ? expand_setmem_epilogue (dst, destreg, promoted_val, count_exp,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? epilogue_size_needed);
> + ? ? ?rtx tmp;
> + ? ? ?if (align_unknown && unroll_factor > 1)
> + ? ? ? {
> + ? ? ? ? /* Reduce epilogue's size by creating not-unrolled loop. ?If we won't
> + ? ? ? ? ? ?do this, we can have very big epilogue - when alignment is statically
> + ? ? ? ? ? ?unknown we'll have the epilogue byte by byte which may be very slow. ?*/
> + ? ? ? ? loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
> + ? ? ? ? ? ? NULL, vec_promoted_val, count_exp,
> + ? ? ? ? ? ? loop_iter, move_mode, 1,
> + ? ? ? ? ? ? expected_size, false);
> + ? ? ? ? dst = change_address (dst, BLKmode, destreg);
> + ? ? ? ? epilogue_size_needed = GET_MODE_SIZE (move_mode);
> + ? ? ? }
> + ? ? ?tmp = expand_simple_binop (Pmode, PLUS, destreg, loop_iter, destreg,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?true, OPTAB_LIB_WIDEN);
> + ? ? ?if (tmp != destreg)
> + ? ? ? emit_move_insn (destreg, tmp);
> ? ? }
> + ?if (count_exp != const0_rtx && epilogue_size_needed > 1)
> + ? ?expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? val_exp, count_exp, epilogue_size_needed);
> ? if (jump_around_label)
> ? ? emit_label (jump_around_label);
> ? return true;
>



-- 
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]