This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: Memset/memcpy patch

From: Jan Hubicka <hubicka at ucw dot cz>
To: Jan Hubicka <hubicka at ucw dot cz>
Cc: "H.J. Lu" <hjl dot tools at gmail dot com>, Michael Zolotukhin <michael dot v dot zolotukhin at gmail dot com>, gcc-patches at gcc dot gnu dot org, izamyatin at gmail dot com
Date: Fri, 18 Nov 2011 03:23:27 +0100
Subject: Re: Memset/memcpy patch
References: <CANtU07_0M+T=+6JhB-vyyMpt8h-_h5G7w05xYTyqqQ32tGNowg@mail.gmail.com> <20111114170335.GA29877@kam.mff.cuni.cz> <CAMe9rOrfMq6CJG71kZ3+-tmWhQQ_7pLK2HFz_ZOdAnzk5b5CHA@mail.gmail.com> <CAMe9rOpTtYtW6oy0yyfeSzXDW4-N5S0VQ6=uLftOaq1egZKG4A@mail.gmail.com> <20111117211057.GE30593@kam.mff.cuni.cz> <20111117212335.GF30593@kam.mff.cuni.cz>
> > > 
> > > The current x86 memset/memcpy expansion is broken. It miscompiles
> > > many programs, including GCC itself.  Should it be reverted for now?
> > 
> > There was problem in the new code doing loopy epilogues.
> > I am currently testing the following patch that shold fix the problem.
> > We could either revert now and I will apply combined patch or I hope to fix that
> > tonight.
> 
> To expand little bit. I was looking into the code for most of the day today and
> the patch combines several fixes
>    1) the new loopy epilogue code was quite broken. It did not work for memset at all because
>       the promoted value was not always initialized that I fixed in the version of patch
>       that is in mainline now. It however also miss bound check in some cases.  This is fixed
>       by the expand_set_or_movmem_via_loop_with_iter change.
>    2) I misupdated atom description so 32bit memset was not expanded inline, this is fixed
>       by memset changes
>    3) decide_alg was broken in two ways - first it gives complex algorithms for -O0
>       and it chose wrong variant when sse_loop is used.
>    4) the epilogue loop was output even in the case it is not needed - i.e. when unrolled
>       loops handled 16 bytes at once, and block size is 39. This is the ix86_movmem
>       and ix86_setmem change
>    5) The implementation of ix86_movmem/ix86_setmem diverged for no reason so I got it back
>       to sync. For some reason SSE code in movmem is not output for 64bit unaligned memcpy
>       that is fixed too.
>    6) it seems that both bdver and core is good enough on handling misaligned blocks that 
>       the alignmnet prologues can be ommited. This greatly improves and reduces size of the
>       inline sequence. I will however break this out into independent patch.
> 
> Life would be easier if the changes was made in multiple incremental steps, stringops expansion
> is relatively tricky busyness and realively easy to get wrong in some cases since there are so 
> many of them depending on knowledge of size/alignmnet and target architecture.

Hi,
this is the patch I comitted after bootstrapping/regstesting x86_64-linux and
--with-arch=core2 --with-cpu=atom gfortran.fortran-torture/execute/arrayarg.f90
failure stays. As I've explained in the PR log, I believe it is previously
latent problem elsewhere that is now triggered by inline memset expansion that
is later unrolled.  I would welcome help from someone who understand the
testcase on whether it is aliasing safe or not.

Honza

	PR bootstrap/51134
	* i386.c (atom_cost): Fix 32bit memset description.
	(expand_set_or_movmem_via_loop_with_iter): Output proper bounds check for epilogue loops.
	(expand_movmem_epilogue): Handle epilogues up to size 15 w/o producing byte loop.
	(decide_alg): sse_loop is not useable wthen SSE2 is disabled; when not optimizing always
	use rep movsb or lincall; do not produce word sized loops when optimizing memset for
	size (to avoid need for large constants).
	(ix86_expand_movmem): Get into sync with ix86_expand_setmem; choose unroll factors
	better; always do 128bit moves when producing SSE loops; do not produce loopy epilogue
	when size is too small.
	(promote_duplicated_reg_to_size): Do not look into desired alignments when
	doing vector expansion.
	(ix86_expand_setmem): Track better when promoted value is available; choose unroll factors
	more sanely.; output loopy epilogue only when needed.
Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c	(revision 181407)
--- config/i386/i386.c	(working copy)
*************** struct processor_costs atom_cost = {
*** 1785,1791 ****
       if that fails.  */
    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
!    {{libcall, {{-1, libcall}}},			       /* Unknown alignment.  */
      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
  	       {-1, libcall}}}}},
  
--- 1785,1791 ----
       if that fails.  */
    {{{libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}, /* Known alignment.  */
      {libcall, {{4096, sse_loop}, {4096, unrolled_loop}, {-1, libcall}}}},
!    {{libcall, {{2048, sse_loop}, {2048, unrolled_loop}, {-1, libcall}}}, /* Unknown alignment.  */
      {libcall, {{2048, sse_loop}, {2048, unrolled_loop},
  	       {-1, libcall}}}}},
  
*************** expand_set_or_movmem_via_loop_with_iter 
*** 21149,21168 ****
  
    top_label = gen_label_rtx ();
    out_label = gen_label_rtx ();
-   if (!reuse_iter)
-     iter = gen_reg_rtx (iter_mode);
- 
    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
! 			      NULL, 1, OPTAB_DIRECT);
!   /* Those two should combine.  */
!   if (piece_size == const1_rtx)
      {
!       emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
  			       true, out_label);
-       predict_jump (REG_BR_PROB_BASE * 10 / 100);
      }
-   if (!reuse_iter)
-     emit_move_insn (iter, const0_rtx);
  
    emit_label (top_label);
  
--- 21149,21173 ----
  
    top_label = gen_label_rtx ();
    out_label = gen_label_rtx ();
    size = expand_simple_binop (iter_mode, AND, count, piece_size_mask,
! 		               NULL, 1, OPTAB_DIRECT);
!   if (!reuse_iter)
      {
!       iter = gen_reg_rtx (iter_mode);
!       /* Those two should combine.  */
!       if (piece_size == const1_rtx)
! 	{
! 	  emit_cmp_and_jump_insns (size, const0_rtx, EQ, NULL_RTX, iter_mode,
! 				   true, out_label);
! 	  predict_jump (REG_BR_PROB_BASE * 10 / 100);
! 	}
!       emit_move_insn (iter, const0_rtx);
!     }
!   else
!     {
!       emit_cmp_and_jump_insns (iter, size, GE, NULL_RTX, iter_mode,
  			       true, out_label);
      }
  
    emit_label (top_label);
  
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 21460,21466 ****
        gcc_assert (remainder_size == 0);
        return;
      }
!   if (max_size > 8)
      {
        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
  				    count, 1, OPTAB_DIRECT);
--- 21465,21471 ----
        gcc_assert (remainder_size == 0);
        return;
      }
!   if (max_size > 16)
      {
        count = expand_simple_binop (GET_MODE (count), AND, count, GEN_INT (max_size - 1),
  				    count, 1, OPTAB_DIRECT);
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 21475,21480 ****
--- 21480,21504 ----
     */
    if (TARGET_SINGLE_STRINGOP)
      {
+       if (max_size > 8)
+ 	{
+ 	  rtx label = ix86_expand_aligntest (count, 8, true);
+ 	  if (TARGET_64BIT)
+ 	    {
+ 	      src = change_address (srcmem, DImode, srcptr);
+ 	      dest = change_address (destmem, DImode, destptr);
+ 	      emit_insn (gen_strmov (destptr, dest, srcptr, src));
+ 	    }
+ 	  else
+ 	    {
+ 	      src = change_address (srcmem, SImode, srcptr);
+ 	      dest = change_address (destmem, SImode, destptr);
+ 	      emit_insn (gen_strmov (destptr, dest, srcptr, src));
+ 	      emit_insn (gen_strmov (destptr, dest, srcptr, src));
+ 	    }
+ 	  emit_label (label);
+ 	  LABEL_NUSES (label) = 1;
+ 	}
        if (max_size > 4)
  	{
  	  rtx label = ix86_expand_aligntest (count, 4, true);
*************** expand_movmem_epilogue (rtx destmem, rtx
*** 21508,21513 ****
--- 21532,21566 ----
        rtx offset = force_reg (Pmode, const0_rtx);
        rtx tmp;
  
+       if (max_size > 8)
+ 	{
+ 	  rtx label = ix86_expand_aligntest (count, 8, true);
+ 	  if (TARGET_64BIT)
+ 	    {
+ 	      src = change_address (srcmem, DImode, srcptr);
+ 	      dest = change_address (destmem, DImode, destptr);
+ 	      emit_move_insn (dest, src);
+ 	      tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (8), NULL,
+ 				         true, OPTAB_LIB_WIDEN);
+ 	    }
+ 	  else
+ 	    {
+ 	      src = change_address (srcmem, SImode, srcptr);
+ 	      dest = change_address (destmem, SImode, destptr);
+ 	      emit_move_insn (dest, src);
+ 	      tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), NULL,
+ 				         true, OPTAB_LIB_WIDEN);
+ 	      if (tmp != offset)
+ 	         emit_move_insn (offset, tmp);
+ 	      tmp = expand_simple_binop (Pmode, PLUS, offset, GEN_INT (4), NULL,
+ 				         true, OPTAB_LIB_WIDEN);
+ 	      emit_move_insn (dest, src);
+ 	    }
+ 	  if (tmp != offset)
+ 	    emit_move_insn (offset, tmp);
+ 	  emit_label (label);
+ 	  LABEL_NUSES (label) = 1;
+ 	}
        if (max_size > 4)
  	{
  	  rtx label = ix86_expand_aligntest (count, 4, true);
*************** expand_setmem_epilogue (rtx destmem, rtx
*** 21588,21604 ****
  	 Remaining part we'll move using Pmode and narrower modes.  */
  
        if (promoted_to_vector_value)
! 	while (remainder_size >= 16)
! 	  {
! 	    if (GET_MODE (destmem) != move_mode)
! 	      destmem = adjust_automodify_address_nv (destmem, move_mode,
! 						      destptr, offset);
! 	    emit_strset (destmem, promoted_to_vector_value, destptr,
! 			 move_mode, offset);
  
! 	    offset += 16;
! 	    remainder_size -= 16;
! 	  }
  
        /* Move the remaining part of epilogue - its size might be
  	 a size of the widest mode.  */
--- 21641,21668 ----
  	 Remaining part we'll move using Pmode and narrower modes.  */
  
        if (promoted_to_vector_value)
! 	{
! 	  if (promoted_to_vector_value)
! 	    {
! 	      if (max_size >= GET_MODE_SIZE (V4SImode))
! 		move_mode = V4SImode;
! 	      else if (max_size >= GET_MODE_SIZE (DImode))
! 		move_mode = DImode;
! 	    }
! 	  while (remainder_size >= GET_MODE_SIZE (move_mode))
! 	    {
! 	      if (GET_MODE (destmem) != move_mode)
! 		destmem = adjust_automodify_address_nv (destmem, move_mode,
! 							destptr, offset);
! 	      emit_strset (destmem,
! 			   promoted_to_vector_value,
! 			   destptr,
! 			   move_mode, offset);
  
! 	      offset += GET_MODE_SIZE (move_mode);
! 	      remainder_size -= GET_MODE_SIZE (move_mode);
! 	    }
! 	}
  
        /* Move the remaining part of epilogue - its size might be
  	 a size of the widest mode.  */
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22022,22031 ****
  			     || (memset
  				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
  
! #define ALG_USABLE_P(alg) (rep_prefix_usable			\
! 			   || (alg != rep_prefix_1_byte		\
! 			       && alg != rep_prefix_4_byte      \
! 			       && alg != rep_prefix_8_byte))
    const struct processor_costs *cost;
  
    /* Even if the string operation call is cold, we still might spend a lot
--- 22086,22096 ----
  			     || (memset
  				 ? fixed_regs[AX_REG] : fixed_regs[SI_REG]));
  
! #define ALG_USABLE_P(alg) ((rep_prefix_usable			\
! 			    || (alg != rep_prefix_1_byte	\
! 			        && alg != rep_prefix_4_byte      \
! 			        && alg != rep_prefix_8_byte))    \
! 			   && (TARGET_SSE2 || alg != sse_loop))
    const struct processor_costs *cost;
  
    /* Even if the string operation call is cold, we still might spend a lot
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22037,22042 ****
--- 22102,22110 ----
    else
      optimize_for_speed = true;
  
+   if (!optimize)
+     return (rep_prefix_usable ? rep_prefix_1_byte : libcall);
+ 
    cost = optimize_for_speed ? ix86_cost : &ix86_size_cost;
  
    *dynamic_check = -1;
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22049,22058 ****
    /* rep; movq or rep; movl is the smallest variant.  */
    else if (!optimize_for_speed)
      {
!       if (!count || (count & 3))
! 	return rep_prefix_usable ? rep_prefix_1_byte : loop_1_byte;
        else
! 	return rep_prefix_usable ? rep_prefix_4_byte : loop;
      }
    /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
     */
--- 22117,22126 ----
    /* rep; movq or rep; movl is the smallest variant.  */
    else if (!optimize_for_speed)
      {
!       if (!count || (count & 3) || memset)
! 	return rep_prefix_usable ? rep_prefix_1_byte : libcall;
        else
! 	return rep_prefix_usable ? rep_prefix_4_byte : libcall;
      }
    /* Very tiny blocks are best handled via the loop, REP is expensive to setup.
     */
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22106,22118 ****
        int max = -1;
        enum stringop_alg alg;
        int i;
-       bool any_alg_usable_p = true;
        bool only_libcall_fits = true;
  
        for (i = 0; i < MAX_STRINGOP_ALGS; i++)
  	{
  	  enum stringop_alg candidate = algs->size[i].alg;
- 	  any_alg_usable_p = any_alg_usable_p && ALG_USABLE_P (candidate);
  
  	  if (candidate != libcall && candidate
  	      && ALG_USABLE_P (candidate))
--- 22174,22184 ----
*************** decide_alg (HOST_WIDE_INT count, HOST_WI
*** 22124,22130 ****
        /* If there aren't any usable algorithms, then recursing on
  	 smaller sizes isn't going to find anything.  Just return the
  	 simple byte-at-a-time copy loop.  */
!       if (!any_alg_usable_p || only_libcall_fits)
  	{
  	  /* Pick something reasonable.  */
  	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
--- 22190,22196 ----
        /* If there aren't any usable algorithms, then recursing on
  	 smaller sizes isn't going to find anything.  Just return the
  	 simple byte-at-a-time copy loop.  */
!       if (only_libcall_fits)
  	{
  	  /* Pick something reasonable.  */
  	  if (TARGET_INLINE_STRINGOPS_DYNAMICALLY)
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 22253,22259 ****
    int dynamic_check;
    bool need_zero_guard = false;
    bool align_unknown;
!   int unroll_factor;
    enum machine_mode move_mode;
    rtx loop_iter = NULL_RTX;
    int dst_offset, src_offset;
--- 22319,22325 ----
    int dynamic_check;
    bool need_zero_guard = false;
    bool align_unknown;
!   unsigned int unroll_factor;
    enum machine_mode move_mode;
    rtx loop_iter = NULL_RTX;
    int dst_offset, src_offset;
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 22316,22329 ****
      case unrolled_loop:
        need_zero_guard = true;
        move_mode = Pmode;
!       unroll_factor = TARGET_64BIT ? 4 : 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
        need_zero_guard = true;
        /* Use SSE instructions, if possible.  */
!       move_mode = align_unknown ? DImode : V4SImode;
!       unroll_factor = TARGET_64BIT ? 4 : 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
--- 22382,22408 ----
      case unrolled_loop:
        need_zero_guard = true;
        move_mode = Pmode;
!       unroll_factor = 1;
!       /* Select maximal available 1,2 or 4 unroll factor.
! 	 In 32bit we can not afford to use 4 registers inside the loop.  */
!       if (!count)
! 	unroll_factor = TARGET_64BIT ? 4 : 2;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < (TARGET_64BIT ? 4 :2))
! 	  unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
        need_zero_guard = true;
        /* Use SSE instructions, if possible.  */
!       move_mode = V4SImode;
!       /* Select maximal available 1,2 or 4 unroll factor.  */
!       if (!count)
! 	unroll_factor = 4;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < 4)
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 22568,22574 ****
    if (alg == sse_loop || alg == unrolled_loop)
      {
        rtx tmp;
!       if (align_unknown && unroll_factor > 1)
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
--- 22647,22659 ----
    if (alg == sse_loop || alg == unrolled_loop)
      {
        rtx tmp;
!       int remainder_size = epilogue_size_needed;
! 
!       /* We may not need the epilgoue loop at all when the count is known
! 	 and alignment is not adjusted.  */
!       if (count && desired_align <= align)
! 	remainder_size = count % epilogue_size_needed;
!       if (remainder_size > 31)
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
*************** promote_duplicated_reg_to_size (rtx val,
*** 22710,22716 ****
  {
    rtx promoted_val = NULL_RTX;
  
!   if (size_needed > 8 || (desired_align > align && desired_align > 8))
      {
        /* We want to promote to vector register, so we expect that at least SSE
  	 is available.  */
--- 22795,22801 ----
  {
    rtx promoted_val = NULL_RTX;
  
!   if (size_needed > 8)
      {
        /* We want to promote to vector register, so we expect that at least SSE
  	 is available.  */
*************** promote_duplicated_reg_to_size (rtx val,
*** 22724,22730 ****
        else
  	promoted_val = promote_duplicated_reg (V4SImode, val);
      }
!   else if (size_needed > 4 || (desired_align > align && desired_align > 4))
      {
        gcc_assert (TARGET_64BIT);
        promoted_val = promote_duplicated_reg (DImode, val);
--- 22809,22815 ----
        else
  	promoted_val = promote_duplicated_reg (V4SImode, val);
      }
!   else if (size_needed > 4)
      {
        gcc_assert (TARGET_64BIT);
        promoted_val = promote_duplicated_reg (DImode, val);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22764,22769 ****
--- 22849,22855 ----
    unsigned int unroll_factor;
    enum machine_mode move_mode;
    rtx loop_iter = NULL_RTX;
+   bool early_jump = false;
  
    if (CONST_INT_P (align_exp))
      align = INTVAL (align_exp);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22783,22789 ****
    /* Step 0: Decide on preferred algorithm, desired alignment and
       size of chunks to be copied by main loop.  */
  
!   align_unknown = CONST_INT_P (align_exp) && INTVAL (align_exp) > 0;
    alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
    desired_align = decide_alignment (align, alg, expected_size);
    unroll_factor = 1;
--- 22869,22875 ----
    /* Step 0: Decide on preferred algorithm, desired alignment and
       size of chunks to be copied by main loop.  */
  
!   align_unknown = !(CONST_INT_P (align_exp) && INTVAL (align_exp) > 0);
    alg = decide_alg (count, expected_size, true, &dynamic_check, align_unknown);
    desired_align = decide_alignment (align, alg, expected_size);
    unroll_factor = 1;
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22813,22821 ****
        move_mode = Pmode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	     && unroll_factor < 4)
! 	unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
--- 22899,22910 ----
        move_mode = Pmode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       if (!count)
! 	unroll_factor = 4;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < 4)
! 	  unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case sse_loop:
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22823,22831 ****
        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	     && unroll_factor < 4)
! 	unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
--- 22912,22923 ----
        move_mode = TARGET_64BIT ? V2DImode : V4SImode;
        unroll_factor = 1;
        /* Select maximal available 1,2 or 4 unroll factor.  */
!       if (!count)
! 	unroll_factor = 4;
!       else
! 	while (GET_MODE_SIZE (move_mode) * unroll_factor * 2 < count
! 	       && unroll_factor < 4)
! 	  unroll_factor *= 2;
        size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
        break;
      case rep_prefix_8_byte:
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 22904,22909 ****
--- 22996,23002 ----
                emit_move_insn (loop_iter, const0_rtx);
  	    }
  	  label = gen_label_rtx ();
+ 	  early_jump = true;
  	  emit_cmp_and_jump_insns (count_exp,
  				   GEN_INT (epilogue_size_needed),
  				   LTU, 0, counter_mode (count_exp), 1, label);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 23016,23022 ****
        vec_promoted_val =
  	promote_duplicated_reg_to_size (gpr_promoted_val,
  					GET_MODE_SIZE (move_mode),
! 					desired_align, align);
        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
  				     NULL, vec_promoted_val, count_exp,
  				     loop_iter, move_mode, unroll_factor,
--- 23109,23115 ----
        vec_promoted_val =
  	promote_duplicated_reg_to_size (gpr_promoted_val,
  					GET_MODE_SIZE (move_mode),
! 					GET_MODE_SIZE (move_mode), align);
        loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
  				     NULL, vec_promoted_val, count_exp,
  				     loop_iter, move_mode, unroll_factor,
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 23065,23085 ****
        LABEL_NUSES (label) = 1;
        /* We can not rely on fact that promoved value is known.  */
        vec_promoted_val = 0;
!       gpr_promoted_val = 0;
      }
   epilogue:
    if (alg == unrolled_loop || alg == sse_loop)
      {
        rtx tmp;
!       if (align_unknown && unroll_factor > 1
! 	  && epilogue_size_needed >= GET_MODE_SIZE (move_mode)
! 	  && vec_promoted_val)
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
  	     unknown we'll have the epilogue byte by byte which may be very slow.  */
  	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
! 	      NULL, vec_promoted_val, count_exp,
  	      loop_iter, move_mode, 1,
  	      expected_size, false);
  	  dst = change_address (dst, BLKmode, destreg);
--- 23158,23183 ----
        LABEL_NUSES (label) = 1;
        /* We can not rely on fact that promoved value is known.  */
        vec_promoted_val = 0;
!       if (early_jump)
!         gpr_promoted_val = 0;
      }
   epilogue:
    if (alg == unrolled_loop || alg == sse_loop)
      {
        rtx tmp;
!       int remainder_size = epilogue_size_needed;
!       if (count && desired_align <= align)
! 	remainder_size = count % epilogue_size_needed;
!       /* We may not need the epilgoue loop at all when the count is known
! 	 and alignment is not adjusted.  */
!       if (remainder_size > 31 
! 	  && (alg == sse_loop ? vec_promoted_val : gpr_promoted_val))
  	{
  	  /* Reduce epilogue's size by creating not-unrolled loop.  If we won't
  	     do this, we can have very big epilogue - when alignment is statically
  	     unknown we'll have the epilogue byte by byte which may be very slow.  */
  	  loop_iter = expand_set_or_movmem_via_loop_with_iter (dst, NULL, destreg,
! 	      NULL, (alg == sse_loop ? vec_promoted_val : gpr_promoted_val), count_exp,
  	      loop_iter, move_mode, 1,
  	      expected_size, false);
  	  dst = change_address (dst, BLKmode, destreg);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 23090,23106 ****
        if (tmp != destreg)
  	emit_move_insn (destreg, tmp);
      }
!   if (count_exp == const0_rtx)
      ;
!   else if (!gpr_promoted_val && epilogue_size_needed > 1)
      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
  				     epilogue_size_needed);
    else
!     {
!       if (epilogue_size_needed > 1)
! 	expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
! 				val_exp, count_exp, epilogue_size_needed);
!     }
    if (jump_around_label)
      emit_label (jump_around_label);
    return true;
--- 23188,23201 ----
        if (tmp != destreg)
  	emit_move_insn (destreg, tmp);
      }
!   if (count_exp == const0_rtx || epilogue_size_needed <= 1)
      ;
!   else if (!gpr_promoted_val)
      expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
  				     epilogue_size_needed);
    else
!     expand_setmem_epilogue (dst, destreg, vec_promoted_val, gpr_promoted_val,
! 			    val_exp, count_exp, epilogue_size_needed);
    if (jump_around_label)
      emit_label (jump_around_label);
    return true;
Follow-Ups:
- Re: Memset/memcpy patch
  - From: Michael Zolotukhin
References:
- Re: Memset/memcpy patch
  - From: Jan Hubicka
- Re: Memset/memcpy patch
  - From: H.J. Lu
- Re: Memset/memcpy patch
  - From: H.J. Lu
- Re: Memset/memcpy patch
  - From: Jan Hubicka
- Re: Memset/memcpy patch
  - From: Jan Hubicka
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]