This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

PR target/30213 (x86 memset if wrong for small block sizes)


Hi,
The testcase manifest problem of memset for blocks smaller than 4 bytes.
This is caused by fact that overall structure memset is:

  if (block is not small)
    {
      prologue: Align destination
      main loop: Copy 4-byte chunks
    }
  epilogue: Copy the rest

In order to be able to write 4 byte chunks, the value has to be promoted from 1
byte.  This is cheap for constant, but means multiplication by 0x10101010 for
variables, so my intention was to promote constant before if statement, so
epilogue can use wride writes, while promote variables in the THEN edge of if
statement.  This was neccesary to meet performance of library implementation for
small blocks especially for P4, where multiplication is expensive and
the equivalent sequence of shifts too.

However the code to hanle promotion is currently written as follows:

  /* Ensure that alignment prologue won't copy past end of block.  */
  if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
      && !count)
    {
      int size = MAX (size_needed - 1, desired_align - align);
      /* To improve performance of small blocks, we jump around the promoting
         code, so we need to use QImode accesses in epilogue.  */
      if (GET_CODE (val_exp) != CONST_INT && size_needed > 1)
	force_loopy_epilogue = true;
      label = gen_label_rtx ();
      emit_cmp_and_jump_insns (count_exp,
			       GEN_INT (size),
			       LEU, 0, GET_MODE (count_exp), 1, label);
      if (expected_size == -1 || expected_size <= size)
	predict_jump (REG_BR_PROB_BASE * 60 / 100);
      else
	predict_jump (REG_BR_PROB_BASE * 20 / 100);
    }
  if (TARGET_64BIT
      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
    promoted_val = promote_duplicated_reg (DImode, val_exp);
  else if (size_needed > 2 || (desired_align > align && desired_align > 2))
    promoted_val = promote_duplicated_reg (SImode, val_exp);
  else if (size_needed > 1 || (desired_align > align && desired_align > 1))
    promoted_val = promote_duplicated_reg (HImode, val_exp);
  else
    promoted_val = val_exp;
  gcc_assert (desired_align >= 1 && align >= 1);
  if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
      && !count && !label)
    {
      ... again the same body as in first if condtional.
    }

As you can notice the condition for code producing guard before promotion and
after promotion is equivalent.  This is result of some exprimenting, when I
removed GET_CODE (val_exp) == CONST_INT check in the front.  The code in
epilogue however knows that when val_exp is CONST_INT it is free to use the
promoted value and thus we end up with partially uninitialized use of register.

It seems to me, that the code as written, expecting same behaviour in prologue
and epilogue conditionals is fragile, so in addition to fixing this by putting
promotion code into function and avoiding code duplication I also changed rest
of logic to just use values in epilogue computed in prologue so things are
more dificult to confuse.

To make code more readable, I also added new commentary and hopefully made
less tricky code chosing epilogue size (epilogue always copy up power of two
sized blocks, it has be big enough to fit in the main loop chunks size
and the maximal size copied by prologue). Only now I noticed that I spred the
knowledge about how whole code is organized into comments before the
many functions that don't make expanders itself readable.

I realize that the ended up reltively long, if if the cleanups would make
reviewing dificult, I can make simple fix first (adding the mising test
mentioned above) and rewrite the code in question next.

Bootstrapped/regtested i686-linux, x86_64-linux in progress, OK?

	PR target/30213
	* i386.c (expand_setmem_epilogue): Fix formating.
	(dsmalest_pow2_greater_than): New function.
	(ix86_expand_movmem): Improve comments; avoid re-computing of
	epilogue size.
	(promote_duplicated_reg_to_size): Break out from ...
	(expand_setmem): ... this one; reorganize promotion code;
	improve comments; avoid recomputation of epilogue size.

Index: config/i386/i386.c
===================================================================
*** config/i386/i386.c	(revision 120019)
--- config/i386/i386.c	(working copy)
*************** static void
*** 13159,13164 ****
--- 13159,13165 ----
  expand_setmem_epilogue (rtx destmem, rtx destptr, rtx value, rtx count, int max_size)
  {
    rtx dest;
+ 
    if (GET_CODE (count) == CONST_INT)
      {
        HOST_WIDE_INT countval = INTVAL (count);
*************** decide_alignment (int align,
*** 13491,13498 ****
    return desired_align;
  }
  
  /* Expand string move (memcpy) operation.  Use i386 string operations when
!    profitable.  expand_clrmem contains similar code.  */
  int
  ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
  		    rtx expected_align_exp, rtx expected_size_exp)
--- 13492,13531 ----
    return desired_align;
  }
  
+ /* Return thre smallest power of 2 greater than VAL.  */
+ static int
+ smallest_pow2_greater_than (int val)
+ {
+   int ret = 1;
+   while (ret <= val)
+     ret <<= 1;
+   return ret;
+ }
+ 
  /* Expand string move (memcpy) operation.  Use i386 string operations when
!    profitable.  expand_clrmem contains similar code.  The code is depends
!    on architecture and epilogue, but is aways has the same overall
!    structure:
! 
!    1) Prologue guard: Conditional that jumps up to epilogues for small
!       blocks that can be handled by epilogue alone.  This is faster but
!       also needed for correctness, since prologue assume the blocks is
!       greater than desired alignment.
! 
!       Optional dynamic check for size and libcall for large
!       blocks is emitted here too, with -minline-stringops-dynamically.
! 
!    2) Prologue: copy first few bytes in order to get destination aligned
!       to DESIRED_ALIGN.  It is emit only when ALIGN is smaller and up to
!       DESIRED_ALIGN - ALIGN bytes can be copied.
!       We emit either jump tree on power of two sized blocks, or a byte loop.
! 
!    3) Main body: the copying loop itself, copying in SIZE_NEEDED chunks
!       with specified algorithm.
! 
!    4) Epilogue: code copying tail of the block that is too small to be
!       handled by main body (or up to size guarded by prologue guard).  */
!    
  int
  ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
  		    rtx expected_align_exp, rtx expected_size_exp)
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13505,13511 ****
    HOST_WIDE_INT align = 1;
    unsigned HOST_WIDE_INT count = 0;
    HOST_WIDE_INT expected_size = -1;
!   int size_needed = 0;
    int desired_align = 0;
    enum stringop_alg alg;
    int dynamic_check;
--- 13538,13544 ----
    HOST_WIDE_INT align = 1;
    unsigned HOST_WIDE_INT count = 0;
    HOST_WIDE_INT expected_size = -1;
!   int size_needed = 0, epilogue_size_needed;
    int desired_align = 0;
    enum stringop_alg alg;
    int dynamic_check;
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13519,13527 ****
    if (GET_CODE (count_exp) == CONST_INT)
      count = expected_size = INTVAL (count_exp);
    if (GET_CODE (expected_size_exp) == CONST_INT && count == 0)
!     {
!       expected_size = INTVAL (expected_size_exp);
!     }
  
    alg = decide_alg (count, expected_size, false, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
--- 13552,13561 ----
    if (GET_CODE (count_exp) == CONST_INT)
      count = expected_size = INTVAL (count_exp);
    if (GET_CODE (expected_size_exp) == CONST_INT && count == 0)
!     expected_size = INTVAL (expected_size_exp);
! 
!   /* Step 0: decide on preffered algorithm, desired alignment and
!      size of chunks to be copied by main loop.  */
  
    alg = decide_alg (count, expected_size, false, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13559,13564 ****
--- 13593,13602 ----
        break;
      }
  
+   epilogue_size_needed = size_needed;
+ 
+   /* Step 1: Prologue guard.  */
+ 
    /* Alignment code needs count to be in register.  */
    if (GET_CODE (count_exp) == CONST_INT && desired_align > align)
      {
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13568,13584 ****
        count_exp = force_reg (mode, count_exp);
      }
    gcc_assert (desired_align >= 1 && align >= 1);
    /* Ensure that alignment prologue won't copy past end of block.  */
    if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
        && !count)
      {
!       int size = MAX (size_needed - 1, desired_align - align);
  
        label = gen_label_rtx ();
        emit_cmp_and_jump_insns (count_exp,
! 			       GEN_INT (size),
! 			       LEU, 0, GET_MODE (count_exp), 1, label);
!       if (expected_size == -1 || expected_size < size)
  	predict_jump (REG_BR_PROB_BASE * 60 / 100);
        else
  	predict_jump (REG_BR_PROB_BASE * 20 / 100);
--- 13606,13627 ----
        count_exp = force_reg (mode, count_exp);
      }
    gcc_assert (desired_align >= 1 && align >= 1);
+ 
    /* Ensure that alignment prologue won't copy past end of block.  */
    if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
        && !count)
      {
!       epilogue_size_needed = MAX (size_needed - 1, desired_align - align);
! 
!       /* Epilogue always copy COUNT_EXP & EPILOGUE_SIZE_NEEDED bytes.
! 	 Make sure it is power of 2.  */
!       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
  
        label = gen_label_rtx ();
        emit_cmp_and_jump_insns (count_exp,
! 			       GEN_INT (epilogue_size_needed),
! 			       LTU, 0, GET_MODE (count_exp), 1, label);
!       if (expected_size == -1 || expected_size < epilogue_size_needed)
  	predict_jump (REG_BR_PROB_BASE * 60 / 100);
        else
  	predict_jump (REG_BR_PROB_BASE * 20 / 100);
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13597,13604 ****
        emit_label (hot_label);
      }
  
  
-   /* Alignment prologue.  */
    if (desired_align > align)
      {
        /* Except for the first move in epilogue, we no longer know
--- 13640,13647 ----
        emit_label (hot_label);
      }
  
+   /* Step 2: Alignment prologue.  */
  
    if (desired_align > align)
      {
        /* Except for the first move in epilogue, we no longer know
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13617,13623 ****
        label = NULL;
      }
  
!   /* Main body.  */
    switch (alg)
      {
      case libcall:
--- 13660,13667 ----
        label = NULL;
      }
  
!   /* Step 3: Main loop.  */
! 
    switch (alg)
      {
      case libcall:
*************** ix86_expand_movmem (rtx dst, rtx src, rt
*** 13665,13689 ****
        dst = change_address (dst, BLKmode, destreg);
      }
  
!   /* Epilogue to copy the remaining bytes.  */
    if (label)
      {
!       if (size_needed < desired_align - align)
  	{
  	  tmp =
  	    expand_simple_binop (GET_MODE (count_exp), AND, count_exp,
  				 GEN_INT (size_needed - 1), count_exp, 1,
  				 OPTAB_DIRECT);
- 	  size_needed = desired_align - align + 1;
  	  if (tmp != count_exp)
  	    emit_move_insn (count_exp, tmp);
  	}
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   if (count_exp != const0_rtx && size_needed > 1)
      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
! 			    size_needed);
    if (jump_around_label)
      emit_label (jump_around_label);
    return 1;
--- 13709,13739 ----
        dst = change_address (dst, BLKmode, destreg);
      }
  
!   /* Step 4: Epilogue to copy the remaining bytes.  */
! 
    if (label)
      {
!       /* When the main loop was done, COUNT_EXP might hold original count,
!  	 while we want to copy only COUNT_EXP & SIZE_NEEDED bytes.
! 	 Epilogue code will actually copy COUNT_EXP & EPILOGUE_SIZE_NEEDED
! 	 bytes. Compensate if needed.  */
! 	 
!       if (size_needed < epilogue_size_needed)
  	{
  	  tmp =
  	    expand_simple_binop (GET_MODE (count_exp), AND, count_exp,
  				 GEN_INT (size_needed - 1), count_exp, 1,
  				 OPTAB_DIRECT);
  	  if (tmp != count_exp)
  	    emit_move_insn (count_exp, tmp);
  	}
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
! 
!   if (count_exp != const0_rtx && epilogue_size_needed > 1)
      expand_movmem_epilogue (dst, src, destreg, srcreg, count_exp,
! 			    epilogue_size_needed);
    if (jump_around_label)
      emit_label (jump_around_label);
    return 1;
*************** promote_duplicated_reg (enum machine_mod
*** 13761,13768 ****
      }
  }
  
  /* Expand string clear operation (bzero).  Use i386 string operations when
!    profitable.  expand_movmem contains similar code.  */
  int
  ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
  		    rtx expected_align_exp, rtx expected_size_exp)
--- 13811,13840 ----
      }
  }
  
+ /* Duplicate value VAL using promote_duplicated_reg into maximal size that will
+    be needed by main loop copygin SIZE_NEEDED chunks and prologue getting
+    allignment from ALIGN to DESIRED_ALIGN.  */
+ static rtx
+ promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
+ {
+   rtx promoted_val;
+ 
+   if (TARGET_64BIT
+       && (size_needed > 4 || (desired_align > align && desired_align > 4)))
+     promoted_val = promote_duplicated_reg (DImode, val);
+   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
+     promoted_val = promote_duplicated_reg (SImode, val);
+   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
+     promoted_val = promote_duplicated_reg (HImode, val);
+   else
+     promoted_val = val;
+ 
+   return promoted_val;
+ }
+ 
  /* Expand string clear operation (bzero).  Use i386 string operations when
!    profitable.  See expand_movmem comment for explanation of individual
!    steps performed..  */
  int
  ix86_expand_setmem (rtx dst, rtx count_exp, rtx val_exp, rtx align_exp,
  		    rtx expected_align_exp, rtx expected_size_exp)
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13774,13783 ****
    HOST_WIDE_INT align = 1;
    unsigned HOST_WIDE_INT count = 0;
    HOST_WIDE_INT expected_size = -1;
!   int size_needed = 0;
    int desired_align = 0;
    enum stringop_alg alg;
!   rtx promoted_val = val_exp;
    bool force_loopy_epilogue = false;
    int dynamic_check;
  
--- 13846,13855 ----
    HOST_WIDE_INT align = 1;
    unsigned HOST_WIDE_INT count = 0;
    HOST_WIDE_INT expected_size = -1;
!   int size_needed = 0, epilogue_size_needed;
    int desired_align = 0;
    enum stringop_alg alg;
!   rtx promoted_val = NULL;
    bool force_loopy_epilogue = false;
    int dynamic_check;
  
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13792,13797 ****
--- 13864,13872 ----
    if (GET_CODE (expected_size_exp) == CONST_INT && count == 0)
      expected_size = INTVAL (expected_size_exp);
  
+   /* Step 0: decide on preffered algorithm, desired alignment and
+      size of chunks to be copied by main loop.  */
+ 
    alg = decide_alg (count, expected_size, true, &dynamic_check);
    desired_align = decide_alignment (align, alg, expected_size);
  
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13826,13831 ****
--- 13901,13910 ----
        size_needed = 1;
        break;
      }
+   epilogue_size_needed = size_needed;
+ 
+   /* Step 1: Prologue guard.  */
+ 
    /* Alignment code needs count to be in register.  */
    if (GET_CODE (count_exp) == CONST_INT && desired_align > align)
      {
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13834,13853 ****
  	mode = DImode;
        count_exp = force_reg (mode, count_exp);
      }
    /* Ensure that alignment prologue won't copy past end of block.  */
    if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
        && !count)
      {
!       int size = MAX (size_needed - 1, desired_align - align);
!       /* To improve performance of small blocks, we jump around the promoting
!          code, so we need to use QImode accesses in epilogue.  */
!       if (GET_CODE (val_exp) != CONST_INT && size_needed > 1)
! 	force_loopy_epilogue = true;
        label = gen_label_rtx ();
        emit_cmp_and_jump_insns (count_exp,
! 			       GEN_INT (size),
! 			       LEU, 0, GET_MODE (count_exp), 1, label);
!       if (expected_size == -1 || expected_size <= size)
  	predict_jump (REG_BR_PROB_BASE * 60 / 100);
        else
  	predict_jump (REG_BR_PROB_BASE * 20 / 100);
--- 13913,13945 ----
  	mode = DImode;
        count_exp = force_reg (mode, count_exp);
      }
+   /* Do the cheap promotion to allow better CSE across the 
+      main loop and epilogue (ie one load of the big constant in the
+      front of all code.  */
+   if (GET_CODE (val_exp) == CONST_INT)
+     promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
+ 						   desired_align, align);
    /* Ensure that alignment prologue won't copy past end of block.  */
    if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
        && !count)
      {
!       epilogue_size_needed = MAX (size_needed - 1, desired_align - align);
! 
!       /* Epilogue always copy COUNT_EXP & EPILOGUE_SIZE_NEEDED bytes.
! 	 Make sure it is power of 2.  */
!       epilogue_size_needed = smallest_pow2_greater_than (epilogue_size_needed);
! 
!       /* To improve performance of small blocks, we jump around the VAL
! 	 promoting mode.  This mean that if the promoted VAL is not constant,
! 	 we might not use it in the epilogue and have to use byte
! 	 loop variant.  */
!       if (epilogue_size_needed > 2 && !promoted_val)
!         force_loopy_epilogue = true;
        label = gen_label_rtx ();
        emit_cmp_and_jump_insns (count_exp,
! 			       GEN_INT (epilogue_size_needed),
! 			       LTU, 0, GET_MODE (count_exp), 1, label);
!       if (expected_size == -1 || expected_size <= epilogue_size_needed)
  	predict_jump (REG_BR_PROB_BASE * 60 / 100);
        else
  	predict_jump (REG_BR_PROB_BASE * 20 / 100);
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13863,13892 ****
        emit_jump (jump_around_label);
        emit_label (hot_label);
      }
!   if (TARGET_64BIT
!       && (size_needed > 4 || (desired_align > align && desired_align > 4)))
!     promoted_val = promote_duplicated_reg (DImode, val_exp);
!   else if (size_needed > 2 || (desired_align > align && desired_align > 2))
!     promoted_val = promote_duplicated_reg (SImode, val_exp);
!   else if (size_needed > 1 || (desired_align > align && desired_align > 1))
!     promoted_val = promote_duplicated_reg (HImode, val_exp);
!   else
!     promoted_val = val_exp;
    gcc_assert (desired_align >= 1 && align >= 1);
-   if ((size_needed > 1 || (desired_align > 1 && desired_align > align))
-       && !count && !label)
-     {
-       int size = MAX (size_needed - 1, desired_align - align);
  
-       label = gen_label_rtx ();
-       emit_cmp_and_jump_insns (count_exp,
- 			       GEN_INT (size),
- 			       LEU, 0, GET_MODE (count_exp), 1, label);
-       if (expected_size == -1 || expected_size <= size)
- 	predict_jump (REG_BR_PROB_BASE * 60 / 100);
-       else
- 	predict_jump (REG_BR_PROB_BASE * 20 / 100);
-     }
    if (desired_align > align)
      {
        /* Except for the first move in epilogue, we no longer know
--- 13955,13969 ----
        emit_jump (jump_around_label);
        emit_label (hot_label);
      }
! 
!   /* Step 2: Alignment prologue.  */
! 
!   /* Do the expensive promotion once we branched off the small blocks.  */
!   if (!promoted_val)
!     promoted_val = promote_duplicated_reg_to_size (val_exp, size_needed,
! 						   desired_align, align);
    gcc_assert (desired_align >= 1 && align >= 1);
  
    if (desired_align > align)
      {
        /* Except for the first move in epilogue, we no longer know
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13903,13908 ****
--- 13980,13988 ----
        LABEL_NUSES (label) = 1;
        label = NULL;
      }
+ 
+   /* Step 3: Main loop.  */
+ 
    switch (alg)
      {
      case libcall:
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13940,13947 ****
--- 14020,14034 ----
    else
      dst = change_address (dst, BLKmode, destreg);
  
+   /* Step 4: Epilogue to copy the remaining bytes.  */
+ 
    if (label)
      {
+       /* When the main loop was done, COUNT_EXP might hold original count,
+  	 while we want to copy only COUNT_EXP & SIZE_NEEDED bytes.
+ 	 Epilogue code will actually copy COUNT_EXP & EPILOGUE_SIZE_NEEDED
+ 	 bytes. Compensate if needed.  */
+ 
        if (size_needed < desired_align - align)
  	{
  	  tmp =
*************** ix86_expand_setmem (rtx dst, rtx count_e
*** 13955,13961 ****
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   if (count_exp != const0_rtx && size_needed > 1)
      {
        if (force_loopy_epilogue)
  	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,
--- 14042,14048 ----
        emit_label (label);
        LABEL_NUSES (label) = 1;
      }
!   if (count_exp != const0_rtx && epilogue_size_needed > 1)
      {
        if (force_loopy_epilogue)
  	expand_setmem_epilogue_via_loop (dst, destreg, val_exp, count_exp,


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]