This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Use of vector instructions in memmov/memset expanding


diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 2c53423..6ce240a 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -561,10 +561,14 @@ struct processor_costs ix86_size_cost = {/* costs for tuning for size */
   COSTS_N_BYTES (2),			/* cost of FABS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FCHS instruction.  */
   COSTS_N_BYTES (2),			/* cost of FSQRT instruction.  */
-  {{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},
+  {{{rep_prefix_1_byte, {{-1, rep_prefix_1_byte}}},

I do think we will want sse_unrolled_loop (and perhaps sse_loop, I don't know)
algorithms added to enable the SSE codegen: there are chips that do support SSE
but internally split the wide moves into word sized chunks and thus benefits
are minimal and setup costs won't pay back.  So we do not want to default to
SSE codegen whenever possible, just when it is supposed to pay back.

I wonder what we will want to do then with -mno-sse (and i.e. for linux kernel where
one can not implicitely use SSE).  Probably making sse_unrolled_loop->unrolled_loop
transition is easiest for this quite rare case even if some of other algorithm may
turn out to be superrior.

+  if (align <= 8 && desired_alignment > 8)
+    {
+      rtx label = ix86_expand_aligntest (destptr, 8, false);
+      destmem = adjust_automodify_address_nv (destmem, SImode, destptr, 0);
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      emit_insn (gen_strset (destptr, destmem, gen_lowpart (SImode, value)));
+      ix86_adjust_counter (count, 8);
+      emit_label (label);
+      LABEL_NUSES (label) = 1;
+    }
+  gcc_assert (desired_alignment <= 16);

No vector move because promoted vector value is not computed, yet?  (it would
make sense to bypass it to keep hot path for small blocks SSE free).
     case unrolled_loop:
       need_zero_guard = true;
-      size_needed = GET_MODE_SIZE (Pmode) * (TARGET_64BIT ? 4 : 2);
+      /* Use SSE instructions, if possible.  */
+      move_mode = TARGET_SSE ? (align_unknown ? DImode : V4SImode) : Pmode;
+      unroll_factor = 4;

Where did go the logic dropping unroll factor to 2 for 32bit integer loops?  This is
important otherwise se starve the RA.
@@ -21897,9 +22254,41 @@ ix86_expand_movmem (rtx dst, rtx src, rtx count_exp, rtx align_exp,
       LABEL_NUSES (label) = 1;
     }
 
+  /* We haven't updated addresses, so we'll do it now.
+     Also, if the epilogue seems to be big, we'll generate a loop (not
+     unrolled) in it.  We'll do it only if alignment is unknown, because in
+     this case in epilogue we have to perform memmove by bytes, which is very
+     slow.  */

The unrolled epilogue does at most one byte wide move, while the rolled does at most
4*16. Given that the odds are that the blocks are small, are you sure this is not
causing performance problems?

@@ -21983,11 +22402,21 @@ promote_duplicated_reg (enum machine_mode mode, rtx val)
 static rtx
 promote_duplicated_reg_to_size (rtx val, int size_needed, int desired_align, int align)
 {
-  rtx promoted_val;
+  rtx promoted_val = NULL_RTX;
 
-  if (TARGET_64BIT
-      && (size_needed > 4 || (desired_align > align && desired_align > 4)))
-    promoted_val = promote_duplicated_reg (DImode, val);
+  if (size_needed > 8 || (desired_align > align && desired_align > 8))
+    {
+      gcc_assert (TARGET_SSE);
+      if (TARGET_64BIT)
+	promoted_val = promote_duplicated_reg (V2DImode, val);
+      else
+	promoted_val = promote_duplicated_reg (V4SImode, val);

Hmm, it would seem more natural to turn this into V8QImode since it is really
vector of 4 duplicated bytes.  This will avoid some TARGET_64BIT tess bellow.
Also AMD chips are very slow on integer->SSE moves. How the final promotion
sequence looks there?
diff --git a/gcc/cse.c b/gcc/cse.c
index ae67685..3b6471d 100644
--- a/gcc/cse.c
+++ b/gcc/cse.c
@@ -4616,7 +4616,10 @@ cse_insn (rtx insn)
 		 to fold switch statements when an ADDR_DIFF_VEC is used.  */
 	      || (GET_CODE (src_folded) == MINUS
 		  && GET_CODE (XEXP (src_folded, 0)) == LABEL_REF
-		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF)))
+		  && GET_CODE (XEXP (src_folded, 1)) == LABEL_REF))
+	      /* Don't propagate vector-constants, as for now no architecture
+		 supports vector immediates.  */
+	  && !vector_extensions_used_for_mode (mode))

This seems quite dubious.  The instruction pattern representing the move should
refuse the constants via its condition or predicates.

The i386 specific bits seems quite good to me now and ready for mainline with
the above comments addressed.  It may make your life easier if you tried to
make version that does not need any of the middle end changes - I think it is
possible, the middle end changes are mostly about expanding memcpy/memset w/o
loops at all.  This can be handled incrementally after your current patch gets
to mainline.

Honza


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]