This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH, x86] Use vector moves in memmove expanding


Hi,
Jan, thanks for the review, I hope to prepare an updated version of the patch
shortly.  Please see my answers to your comments below.

Uros, there is a question of a better approach for generation of wide moves.
Could you please comment it (see details in bullets 3 and 5)?

1.
> +static int smallest_pow2_greater_than (int);
>
> Perhaps it is easier to use existing 1<<ceil_log2?
Well, yep.  Actually, this routine has already been used there, so I continued
using it.  I guess we could change its implementation to call
ceil_log2/floor_log2 or remove it entirely.

2.
> -      y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
> -      srcmem = change_address (srcmem, mode, y_addr);
> +      srcmem = offset_address (srcmem, copy_rtx (tmp), piece_size_n);
> +      srcmem = adjust_address (srcmem, mode, 0);
> ...
> This change looks OK and can go into manline independnetly. Just please ensure that changing
> the way address is computed is not making us to preserve alias set. Memmove can not rely on the alias
> set of the src/destination objects.
Could you explain it in more details?  Do you mean that at the beginning DST
and SRC could point to one memory location and have corresponding alias sets,
and I just change addresses they point to without invalidating alias sets?
I haven't thought about this, and that seems like a possible bug, but I guess
it could be simply fixed by calling change_address at the end.

3.
> +  /* Find the widest mode in which we could perform moves.
> +     Start with the biggest power of 2 less than SIZE_TO_MOVE and half
> +     it until move of such size is supported.  */
> +  piece_size = smallest_pow2_greater_than (size_to_move) >> 1;
> +  move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0);
>
> I suppose this is a problem with SSE moves ending up in integer register, since
> you get TImode rather than vectorized mode, like V8QImode in here.  Why not stick
> with the original mode parmaeter?
Yes, here we choose TImode instead of a vector mode, but that actually was done
intentionally.  I tried several approaches here and decided that using the
widest integer mode is the best one for now.  We could try to find out a
particular (vector)mode in which we want to perform copying, but isn't it better
to rely on a machine-description here?  My idea here was to just request a copy
of, for instance, 128-bit piece (i.e. one TI-move) and leave it to the compiler
to choose the most optimal way of performing it.  Currently, the compiler thinks
that move of 128bits should be splitted into two moves of 64-bits (this
transformation is done in split2 pass) - if it's actually not so optimal, we
should fix it there, IMHO.

I think Uros could give me an advice on whether it's a reasonable approach or it
should be changed.

Also, I tried to avoid such fixes in this patch - that doesn't mean I'm not
going to work on the fixes, quite the contrary.  But it'd be easier to work on
them if we have a code in the trunk that could reveal the problem.

4.
> Doesn't this effectively kill support for TARGET_SINGLE_STRINGOP? It is useful as
> size optimization.
Do you mean removing emit_strmov?  I don't think it'll kill anything, as new
emit_memmov is capable of doing what emit_strmov did and is just an extended
version of it.  BTW, under TARGET_SINGLE_STRINGOP switch gen_strmov is used, not
emit_strmov - behaviour there hasn't been changed by this patch.

5.
> For SSE codegen, won't we need to track down in destination was aligned to generate aligned/unaligned moves?
We try to achieve a required alignment by prologue, so in the main loop
destination is aligned properly.  Source, meanwhile, could be misaligned, so for
it unaligned moves could be generated.  Here I actually also rely on the fact
that we have an optimal description of aligned/unaligned moves in MD-file, i.e.
if it's better to emit two DI-moves instead of one unaligned TI-mode, then
splits/expands will manage to do that.

6.
> +  else if (TREE_CODE (expr) == MEM_REF)
> +    {
> +      tree base = TREE_OPERAND (expr, 0);
> +      tree byte_offset = TREE_OPERAND (expr, 1);
> +      if (TREE_CODE (base) != ADDR_EXPR
> +         || TREE_CODE (byte_offset) != INTEGER_CST)
> +       return -1;
> +      if (!DECL_P (TREE_OPERAND (base, 0))
> +         || DECL_ALIGN (TREE_OPERAND (base, 0)) < align)
>
> You can use TYPE_ALIGN here? In general can't we replace all the GIMPLE
> handling by get_object_alignment?
>
> +       return -1;
> +      offset += tree_low_cst (byte_offset, 1);
> +    }
>    else
>      return -1;
>
> This change out to go independently. I can not review it.
> I will make first look over the patch shortly, but please send updated patch fixing
> the problem with integer regs.
Actually, I don't know what is a right way to find out alignment, but the
existing one didn't work.  Routine get_mem_align_offset didn't handle MEM_REFs
at all, so I added some handling there - I'm not sure it's complete and
absoulutely correct, but that currently works for me.  I'd be glad to hear any
suggestions of how that should be done - whom should I ask about it?

---
Thanks, Michael


On 17 April 2013 19:12, Jan Hubicka <hubicka@ucw.cz> wrote:
> @@ -2392,6 +2392,7 @@ static void ix86_set_current_function (tree);
>  static unsigned int ix86_minimum_incoming_stack_boundary (bool);
>
>  static enum calling_abi ix86_function_abi (const_tree);
> +static int smallest_pow2_greater_than (int);
>
> Perhaps it is easier to use existing 1<<ceil_log2?
>
>
>  #ifndef SUBTARGET32_DEFAULT_CPU
> @@ -21829,11 +21830,10 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
>  {
>    rtx out_label, top_label, iter, tmp;
>    enum machine_mode iter_mode = counter_mode (count);
> -  rtx piece_size = GEN_INT (GET_MODE_SIZE (mode) * unroll);
> +  int piece_size_n = GET_MODE_SIZE (mode) * unroll;
> +  rtx piece_size = GEN_INT (piece_size_n);
>    rtx piece_size_mask = GEN_INT (~((GET_MODE_SIZE (mode) * unroll) - 1));
>    rtx size;
> -  rtx x_addr;
> -  rtx y_addr;
>    int i;
>
>    top_label = gen_label_rtx ();
> @@ -21854,13 +21854,18 @@ expand_set_or_movmem_via_loop (rtx destmem, rtx srcmem,
>    emit_label (top_label);
>
>    tmp = convert_modes (Pmode, iter_mode, iter, true);
> -  x_addr = gen_rtx_PLUS (Pmode, destptr, tmp);
> -  destmem = change_address (destmem, mode, x_addr);
> +
> +  /* This assert could be relaxed - in this case we'll need to compute
> +     smallest power of two, containing in PIECE_SIZE_N and pass it to
> +     offset_address.  */
> +  gcc_assert ((piece_size_n & (piece_size_n - 1)) == 0);
> +  destmem = offset_address (destmem, tmp, piece_size_n);
> +  destmem = adjust_address (destmem, mode, 0);
>
>    if (srcmem)
>      {
> -      y_addr = gen_rtx_PLUS (Pmode, srcptr, copy_rtx (tmp));
> -      srcmem = change_address (srcmem, mode, y_addr);
> +      srcmem = offset_address (srcmem, copy_rtx (tmp), piece_size_n);
> +      srcmem = adjust_address (srcmem, mode, 0);
>
>        /* When unrolling for chips that reorder memory reads and writes,
>          we can save registers by using single temporary.
> @@ -22039,13 +22044,61 @@ expand_setmem_via_rep_stos (rtx destmem, rtx destptr, rtx value,
>    emit_insn (gen_rep_stos (destptr, countreg, destmem, value, destexp));
>  }
>
> This change looks OK and can go into manline independnetly. Just please ensure that changing
> the way address is computed is not making us to preserve alias set. Memmove can not rely on the alias
> set of the src/destination objects.
>
> -static void
> -emit_strmov (rtx destmem, rtx srcmem,
> -            rtx destptr, rtx srcptr, enum machine_mode mode, int offset)
> -{
> -  rtx src = adjust_automodify_address_nv (srcmem, mode, srcptr, offset);
> -  rtx dest = adjust_automodify_address_nv (destmem, mode, destptr, offset);
> -  emit_insn (gen_strmov (destptr, dest, srcptr, src));
> +/* This function emits moves to copy SIZE_TO_MOVE bytes from SRCMEM to
> +   DESTMEM.
> +   SRC is passed by pointer to be updated on return.
> +   Return value is updated DST.  */
> +static rtx
> +emit_memmov (rtx destmem, rtx *srcmem, rtx destptr, rtx srcptr,
> +            HOST_WIDE_INT size_to_move)
> +{
> +  rtx dst = destmem, src = *srcmem, adjust, tempreg;
> +  enum insn_code code;
> +  enum machine_mode move_mode;
> +  int piece_size, i;
> +
> +  /* Find the widest mode in which we could perform moves.
> +     Start with the biggest power of 2 less than SIZE_TO_MOVE and half
> +     it until move of such size is supported.  */
> +  piece_size = smallest_pow2_greater_than (size_to_move) >> 1;
> +  move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0);
>
> I suppose this is a problem with SSE moves ending up in integer register, since
> you get TImode rather than vectorized mode, like V8QImode in here.  Why not stick
> with the original mode parmaeter?
> +  code = optab_handler (mov_optab, move_mode);
> +  while (code == CODE_FOR_nothing && piece_size > 1)
> +    {
> +      piece_size >>= 1;
> +      move_mode = mode_for_size (piece_size * BITS_PER_UNIT, MODE_INT, 0);
> +      code = optab_handler (mov_optab, move_mode);
> +    }
> +  gcc_assert (code != CODE_FOR_nothing);
> +
> +  dst = adjust_automodify_address_nv (dst, move_mode, destptr, 0);
> +  src = adjust_automodify_address_nv (src, move_mode, srcptr, 0);
> +
> +  /* Emit moves.  We'll need SIZE_TO_MOVE/PIECE_SIZES moves.  */
> +  gcc_assert (size_to_move % piece_size == 0);
> +  adjust = GEN_INT (piece_size);
> +  for (i = 0; i < size_to_move; i += piece_size)
> +    {
> +      /* We move from memory to memory, so we'll need to do it via
> +        a temporary register.  */
> +      tempreg = gen_reg_rtx (move_mode);
> +      emit_insn (GEN_FCN (code) (tempreg, src));
> +      emit_insn (GEN_FCN (code) (dst, tempreg));
> +
> +      emit_move_insn (destptr,
> +                     gen_rtx_PLUS (Pmode, copy_rtx (destptr), adjust));
> +      emit_move_insn (srcptr,
> +                     gen_rtx_PLUS (Pmode, copy_rtx (srcptr), adjust));
> +
> +      dst = adjust_automodify_address_nv (dst, move_mode, destptr,
> +                                         piece_size);
> +      src = adjust_automodify_address_nv (src, move_mode, srcptr,
> +                                         piece_size);
> +    }
> +
> +  /* Update DST and SRC rtx.  */
> +  *srcmem = src;
> +  return dst;
>
> Doesn't this effectively kill support for TARGET_SINGLE_STRINGOP? It is useful as
> size optimization.
>  }
>
>  /* Output code to copy at most count & (max_size - 1) bytes from SRC to DEST.  */
> @@ -22057,44 +22110,17 @@ expand_movmem_epilogue (rtx destmem, rtx srcmem,
>    if (CONST_INT_P (count))
>      {
>        HOST_WIDE_INT countval = INTVAL (count);
> -      int offset = 0;
> +      HOST_WIDE_INT epilogue_size = countval % max_size;
> +      int i;
>
> -      if ((countval & 0x10) && max_size > 16)
> -       {
> -         if (TARGET_64BIT)
> -           {
> -             emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
> -             emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset + 8);
> -           }
> -         else
> -           gcc_unreachable ();
> -         offset += 16;
> -       }
> -      if ((countval & 0x08) && max_size > 8)
> -       {
> -         if (TARGET_64BIT)
> -           emit_strmov (destmem, srcmem, destptr, srcptr, DImode, offset);
> -         else
> -           {
> -             emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> -             emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset + 4);
> -           }
> -         offset += 8;
> -       }
> -      if ((countval & 0x04) && max_size > 4)
> +      /* For now MAX_SIZE should be a power of 2.  This assert could be
> +        relaxed, but it'll require a bit more complicated epilogue
> +        expanding.  */
> +      gcc_assert ((max_size & (max_size - 1)) == 0);
> +      for (i = max_size; i >= 1; i >>= 1)
>         {
> -          emit_strmov (destmem, srcmem, destptr, srcptr, SImode, offset);
> -         offset += 4;
> -       }
> -      if ((countval & 0x02) && max_size > 2)
> -       {
> -          emit_strmov (destmem, srcmem, destptr, srcptr, HImode, offset);
> -         offset += 2;
> -       }
> -      if ((countval & 0x01) && max_size > 1)
> -       {
> -          emit_strmov (destmem, srcmem, destptr, srcptr, QImode, offset);
> -         offset += 1;
> +         if (epilogue_size & i)
> +           destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
>         }
>        return;
>      }
> @@ -22330,47 +22356,33 @@ expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_
>  }
>
>  /* Copy enough from DEST to SRC to align DEST known to by aligned by ALIGN to
> -   DESIRED_ALIGNMENT.  */
> -static void
> +   DESIRED_ALIGNMENT.
> +   Return value is updated DESTMEM.  */
> +static rtx
>  expand_movmem_prologue (rtx destmem, rtx srcmem,
>                         rtx destptr, rtx srcptr, rtx count,
>                         int align, int desired_alignment)
>  {
> -  if (align <= 1 && desired_alignment > 1)
> -    {
> -      rtx label = ix86_expand_aligntest (destptr, 1, false);
> -      srcmem = change_address (srcmem, QImode, srcptr);
> -      destmem = change_address (destmem, QImode, destptr);
> -      emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> -      ix86_adjust_counter (count, 1);
> -      emit_label (label);
> -      LABEL_NUSES (label) = 1;
> -    }
> -  if (align <= 2 && desired_alignment > 2)
> -    {
> -      rtx label = ix86_expand_aligntest (destptr, 2, false);
> -      srcmem = change_address (srcmem, HImode, srcptr);
> -      destmem = change_address (destmem, HImode, destptr);
> -      emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> -      ix86_adjust_counter (count, 2);
> -      emit_label (label);
> -      LABEL_NUSES (label) = 1;
> -    }
> -  if (align <= 4 && desired_alignment > 4)
> +  int i;
> +  for (i = 1; i < desired_alignment; i <<= 1)
>      {
> -      rtx label = ix86_expand_aligntest (destptr, 4, false);
> -      srcmem = change_address (srcmem, SImode, srcptr);
> -      destmem = change_address (destmem, SImode, destptr);
> -      emit_insn (gen_strmov (destptr, destmem, srcptr, srcmem));
> -      ix86_adjust_counter (count, 4);
> -      emit_label (label);
> -      LABEL_NUSES (label) = 1;
> +      if (align <= i)
> +       {
> +         rtx label = ix86_expand_aligntest (destptr, i, false);
> +         destmem = emit_memmov (destmem, &srcmem, destptr, srcptr, i);
> +         ix86_adjust_counter (count, i);
> +         emit_label (label);
> +         LABEL_NUSES (label) = 1;
> +         set_mem_align (destmem, i * 2 * BITS_PER_UNIT);
> +       }
>      }
> -  gcc_assert (desired_alignment <= 8);
> +  return destmem;
>  }
>
>  /* Copy enough from DST to SRC to align DST known to DESIRED_ALIGN.
> -   ALIGN_BYTES is how many bytes need to be copied.  */
> +   ALIGN_BYTES is how many bytes need to be copied.
> +   The function updates DST and SRC, namely, it sets proper alignment.
> +   DST is returned via return value, SRC is updated via pointer SRCP.  */
>  static rtx
>  expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
>                                  int desired_align, int align_bytes)
> @@ -22378,62 +22390,34 @@ expand_constant_movmem_prologue (rtx dst, rtx *srcp, rtx destreg, rtx srcreg,
>    rtx src = *srcp;
>    rtx orig_dst = dst;
>    rtx orig_src = src;
> -  int off = 0;
> +  int piece_size = 1;
> +  int copied_bytes = 0;
>    int src_align_bytes = get_mem_align_offset (src, desired_align * BITS_PER_UNIT);
>    if (src_align_bytes >= 0)
>      src_align_bytes = desired_align - src_align_bytes;
> -  if (align_bytes & 1)
> -    {
> -      dst = adjust_automodify_address_nv (dst, QImode, destreg, 0);
> -      src = adjust_automodify_address_nv (src, QImode, srcreg, 0);
> -      off = 1;
> -      emit_insn (gen_strmov (destreg, dst, srcreg, src));
> -    }
> -  if (align_bytes & 2)
> -    {
> -      dst = adjust_automodify_address_nv (dst, HImode, destreg, off);
> -      src = adjust_automodify_address_nv (src, HImode, srcreg, off);
> -      if (MEM_ALIGN (dst) < 2 * BITS_PER_UNIT)
> -       set_mem_align (dst, 2 * BITS_PER_UNIT);
> -      if (src_align_bytes >= 0
> -         && (src_align_bytes & 1) == (align_bytes & 1)
> -         && MEM_ALIGN (src) < 2 * BITS_PER_UNIT)
> -       set_mem_align (src, 2 * BITS_PER_UNIT);
> -      off = 2;
> -      emit_insn (gen_strmov (destreg, dst, srcreg, src));
> -    }
> -  if (align_bytes & 4)
> +
> +  for (piece_size = 1;
> +       piece_size <= desired_align && copied_bytes < align_bytes;
> +       piece_size <<= 1)
>      {
> -      dst = adjust_automodify_address_nv (dst, SImode, destreg, off);
> -      src = adjust_automodify_address_nv (src, SImode, srcreg, off);
> -      if (MEM_ALIGN (dst) < 4 * BITS_PER_UNIT)
> -       set_mem_align (dst, 4 * BITS_PER_UNIT);
> -      if (src_align_bytes >= 0)
> +      if (align_bytes & piece_size)
>         {
> -         unsigned int src_align = 0;
> -         if ((src_align_bytes & 3) == (align_bytes & 3))
> -           src_align = 4;
> -         else if ((src_align_bytes & 1) == (align_bytes & 1))
> -           src_align = 2;
> -         if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
> -           set_mem_align (src, src_align * BITS_PER_UNIT);
> +         dst = emit_memmov (dst, &src, destreg, srcreg, piece_size);
> +         copied_bytes += piece_size;
>         }
> -      off = 4;
> -      emit_insn (gen_strmov (destreg, dst, srcreg, src));
>      }
> -  dst = adjust_automodify_address_nv (dst, BLKmode, destreg, off);
> -  src = adjust_automodify_address_nv (src, BLKmode, srcreg, off);
> +
>    if (MEM_ALIGN (dst) < (unsigned int) desired_align * BITS_PER_UNIT)
>      set_mem_align (dst, desired_align * BITS_PER_UNIT);
>    if (src_align_bytes >= 0)
>      {
> -      unsigned int src_align = 0;
> -      if ((src_align_bytes & 7) == (align_bytes & 7))
> -       src_align = 8;
> -      else if ((src_align_bytes & 3) == (align_bytes & 3))
> -       src_align = 4;
> -      else if ((src_align_bytes & 1) == (align_bytes & 1))
> -       src_align = 2;
> +      unsigned int src_align;
> +      for (src_align = desired_align; src_align >= 2; src_align >>= 1)
> +       {
> +         if ((src_align_bytes & (src_align - 1))
> +              == (align_bytes & (src_align - 1)))
> +           break;
> +       }
>        if (src_align > (unsigned int) desired_align)
>         src_align = desired_align;
>        if (MEM_ALIGN (src) < src_align * BITS_PER_UNIT)
> @@ -22666,42 +22650,24 @@ decide_alg (HOST_WIDE_INT count, HOST_WIDE_INT expected_size, bool memset,
>  static int
>  decide_alignment (int align,
>                   enum stringop_alg alg,
> -                 int expected_size)
> +                 int expected_size,
> +                 enum machine_mode move_mode)
>  {
>    int desired_align = 0;
> -  switch (alg)
> -    {
> -      case no_stringop:
> -       gcc_unreachable ();
> -      case loop:
> -      case unrolled_loop:
> -       desired_align = GET_MODE_SIZE (Pmode);
> -       break;
> -      case rep_prefix_8_byte:
> -       desired_align = 8;
> -       break;
> -      case rep_prefix_4_byte:
> -       /* PentiumPro has special logic triggering for 8 byte aligned blocks.
> -          copying whole cacheline at once.  */
> -       if (TARGET_PENTIUMPRO)
> -         desired_align = 8;
> -       else
> -         desired_align = 4;
> -       break;
> -      case rep_prefix_1_byte:
> -       /* PentiumPro has special logic triggering for 8 byte aligned blocks.
> -          copying whole cacheline at once.  */
> -       if (TARGET_PENTIUMPRO)
> -         desired_align = 8;
> -       else
> -         desired_align = 1;
> -       break;
> -      case loop_1_byte:
> -       desired_align = 1;
> -       break;
> -      case libcall:
> -       return 0;
> -    }
> +
> +  gcc_assert (alg != no_stringop);
> +
> +  if (alg == libcall)
> +    return 0;
> +  if (move_mode == VOIDmode)
> +    return 0;
> +
> +  desired_align = GET_MODE_SIZE (move_mode);
> +  /* PentiumPro has special logic triggering for 8 byte aligned blocks.
> +     copying whole cacheline at once.  */
> +  if (TARGET_PENTIUMPRO
> +      && (alg == rep_prefix_4_byte || alg == rep_prefix_1_byte))
> +    desired_align = 8;
>
>    if (optimize_size)
>      desired_align = 1;
> @@ -22709,6 +22675,7 @@ decide_alignment (int align,
>      desired_align = align;
>    if (expected_size != -1 && expected_size < 4)
>      desired_align = align;
> +
>    return desired_align;
>  }
>
> @@ -22765,6 +22732,8 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
>    int dynamic_check;
>    bool need_zero_guard = false;
>    bool noalign;
> +  enum machine_mode move_mode = VOIDmode;
> +  int unroll_factor = 1;
>
>    if (CONST_INT_P (align_exp))
>      align = INTVAL (align_exp);
> @@ -22788,50 +22757,60 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
>
>    /* Step 0: Decide on preferred algorithm, desired alignment and
>       size of chunks to be copied by main loop.  */
> -
>    alg = decide_alg (count, expected_size, false, &dynamic_check, &noalign);
> -  desired_align = decide_alignment (align, alg, expected_size);
> -
> -  if (!TARGET_ALIGN_STRINGOPS || noalign)
> -    align = desired_align;
> -
>    if (alg == libcall)
>      return false;
>    gcc_assert (alg != no_stringop);
> +
>    if (!count)
>      count_exp = copy_to_mode_reg (GET_MODE (count_exp), count_exp);
>    destreg = copy_addr_to_reg (XEXP (dst, 0));
>    srcreg = copy_addr_to_reg (XEXP (src, 0));
> +
> +  unroll_factor = 1;
> +  move_mode = word_mode;
>    switch (alg)
>      {
>      case libcall:
>      case no_stringop:
>        gcc_unreachable ();
> +    case loop_1_byte:
> +      need_zero_guard = true;
> +      move_mode = QImode;
> +      break;
>      case loop:
>        need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (word_mode);
>        break;
>      case unrolled_loop:
>        need_zero_guard = true;
> -      size_needed = GET_MODE_SIZE (word_mode) * (TARGET_64BIT ? 4 : 2);
> +      unroll_factor = (TARGET_64BIT ? 4 : 2);
> +      break;
> +    case vector_loop:
> +      need_zero_guard = true;
> +      unroll_factor = 4;
> +      /* Find the widest supported mode.  */
> +      move_mode = Pmode;
> +      while (optab_handler (mov_optab, GET_MODE_WIDER_MODE (move_mode))
> +            != CODE_FOR_nothing)
> +         move_mode = GET_MODE_WIDER_MODE (move_mode);
>        break;
>      case rep_prefix_8_byte:
> -      size_needed = 8;
> +      move_mode = DImode;
>        break;
>      case rep_prefix_4_byte:
> -      size_needed = 4;
> +      move_mode = SImode;
>        break;
>      case rep_prefix_1_byte:
> -      size_needed = 1;
> -      break;
> -    case loop_1_byte:
> -      need_zero_guard = true;
> -      size_needed = 1;
> +      move_mode = QImode;
>        break;
>      }
> -
> +  size_needed = GET_MODE_SIZE (move_mode) * unroll_factor;
>    epilogue_size_needed = size_needed;
>
> +  desired_align = decide_alignment (align, alg, expected_size, move_mode);
>
> +  desired_align = decide_alignment (align, alg, expected_size, move_mode);
> +  if (!TARGET_ALIGN_STRINGOPS || noalign)
> +    align = desired_align;
> +
>
> For SSE codegen, won't we need to track down in destination was aligned to generate aligned/unaligned moves?
>
> Otherwise the patch seems resonable.  Thanks for submitting it by pieces so the review is easier.
>
> Honza

--
---
Best regards,
Michael V. Zolotukhin,
Software Engineer
Intel Corporation.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]