This is the mail archive of the mailing list for the GCC project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[committed] Fix gcc.dg/memcpy-1.c for MIPS

memcpy-1.c was failing for 32-bit MIPS targets because SRA thought that
it was better to use movmemsi to implement a 3-int copy, rather than
decompose the 3-int copy into 3 individual copies.  This in turn is
caused by the definitions of MOVE_RATIO and MOVE_BY_PIECES_P:

  - The default MOVE_RATIO is 2 for movmem targets.  mips.h preserves
    this default, only overriding it when movmemsi is not available.

  - The default MOVE_BY_PIECES_P returns false for moves that need
    MOVE_RATIO moves or more.

This is all good at the RTL level, in that sense that movmemsi should
produce code that is as good as or better than move_by_pieces.  However,
there are some sizes and alignments for which movmemsi is effectively
"by pieces" itself.  Although movmemsi's rendering of such moves is
sometimes better than move_by_pieces's rendering, we should still
allow the moves to be decomposed at the tree level, as it often leads
to other optimisations.

The patch below therefore increases MOVE_RATIO to one less than the
smallest number of moves for which movmemsi might use a loop.  It also
makes MOVE_BY_PIECES_P return true at the tree level for moves that are
always "by pieces".

Unfortunately, I'm not able to do performance testing at the moment
(I'm hoping that will change soon).  I therefore used CSiBE at -O2 to
get a sense for that benchmark's code is affected.  The sizes themselves
aren't interesting for -O2; I just wanted to see which files changed.
The results for mipsisa64-elfoabi-gcc -mips32r2 were:

cg_compiler_opensrc:constfold                     5736     5704 :   99.44%
flex:main                                        22024    22016 :   99.96%
jikespg:src/resolve                              15676    15820 :  100.92%
jpeg:jchuff                                       6520     6504 :   99.75%
libpng:pngrtran                                  26044    26028 :   99.94%
linux:net/ipv4/icmp                               5388     5404 :  100.30%
linux:net/ipv4/arp                                6292     6308 :  100.25%
linux:arch/testplatform/mm/init                   3128     3096 :   98.98%
linux:arch/testplatform/kernel/signal             3512     3480 :   99.09%
teem:src/bane/gkmsHvol                            3480     3416 :   98.16%
teem:src/ten/epireg                              17704    17688 :   99.91%
teem:src/ten/chan                                 9916     9868 :   99.52%
Total                                          3893813  3893725 :  100.00%

and the results for mipsisa64-elfoabi-gcc -mips64 were:

cg_compiler_opensrc:symbols                       9328     9320 :   99.91%
libpng:pngrtran                                  27440    27424 :   99.94%
replaypc:ndx-dump                                 4704     4608 :   97.96%
Total                                          3941769  3941649 :  100.00%

All the size reductions were real code improvements.  They were caused
by a combination of gimplification differences and more use of SRA.

The jikespg:src/resolve change was caused by an extra function being
inlined, and from my reading, the new code was an improvement.
We previously assumed that very small structure copies had a fixed
cost of 4, whereas now we give them a smaller and more accurate cost.

The two linux/net changes were caused by cases in which the following
function was not being inlined:

/* Deprecated: use ip_route_output_key directly */
static inline int ip_route_output(struct rtable **rp,
				      u32 daddr, u32 saddr, u32 tos, int oif)
	struct rt_key key = { dst:daddr, src:saddr, oif:oif, tos:tos };

	return ip_route_output_key(rp, &key);

rt_key occupies 5 words, so the change in MOVE_RATIO increased the cost
of key's initialisation from 4 to 5.  The inliner then thought that calls
to ip_route_output were less costly than inlining them.  This isn't true,
so these two files do get worse.  I think the problem is that we
underestimate the cost of the ip_route_output call: we don't take into
account that OIF is passed on the stack, which makes the fifth argument
significantly more costly than the first four, both on the caller's and
the callee's side.  As I understand it, we also don't take into account
the fact that a function that needs a frame for addressable variables
has more prologue and epilogue code than one that might not need a frame
(which is hard to predict, of course).

The inlining heuristics are very delicate, and I'm not convinced that
it's easy to fix this case without regressing others.  Since the patch
is a win for the other cases, I went ahead with it anyway.  Only luck
led to the better linux code: it never really made sense to use a memcpy
cost for constructors like key's, where the total number of fields is
small and the explicitly-initialised fields are all variables.

Regression-tested on mipsisa32-elf, mipsisa64-elfoabi and mips-linux-gnu.
Applied to mainline.


	* config/mips/mips.h (MOVE_MAX): Use UNITS_PER_WORD and describe
	MIPS-specific implementation details.
	for targets with movmemsi.
	(MOVE_BY_PIECES_P): Define.
	* config/mips/mips.c (MAX_MOVE_REGS, MAX_MOVE_BYTES): Delete.
	(mips_block_move_loop): Add a bytes_per_iter argument.
	(mips_expand_block_move): Use MIPS_MAX_MOVE_BYTES_STRAIGHT.
	Update call to mips_block_move_loop.

Index: gcc/config/mips/mips.h
--- gcc/config/mips/mips.h	2007-10-24 14:28:39.000000000 +0100
+++ gcc/config/mips/mips.h	2007-10-24 18:16:38.000000000 +0100
@@ -2338,9 +2338,10 @@ #define CASE_VECTOR_PC_RELATIVE TARGET_M
-/* Max number of bytes we can move from memory to memory
-   in one reasonably fast instruction.  */
-#define MOVE_MAX (TARGET_64BIT ? 8 : 4)
+/* Although LDC1 and SDC1 provide 64-bit moves on 32-bit targets,
+   we generally don't want to use them for copying arbitrary data.
+   A single N-word move is usually the same cost as N single-word moves.  */
 #define MAX_MOVE_MAX 8
 /* Define this macro as a C expression which is nonzero if
@@ -2769,6 +2770,18 @@ #define SIZE_TYPE (POINTER_SIZE == 64 ? 
 #define PTRDIFF_TYPE (POINTER_SIZE == 64 ? "long int" : "int")
+/* The maximum number of bytes that can be copied by one iteration of
+   a movmemsi loop; see mips_block_move_loop.  */
+/* The maximum number of bytes that can be copied by a straight-line
+   implementation of movmemsi; see mips_block_move_straight.  We want
+   to make sure that any loop-based implementation will iterate at
+   least twice.  */
 /* The base cost of a memcpy call, for MOVE_RATIO and friends.  These
    values were determined experimentally by benchmarking with CSiBE.
    In theory, the call overhead is higher for TARGET_ABICALLS (especially
@@ -2778,23 +2791,39 @@ #define PTRDIFF_TYPE (POINTER_SIZE == 64
 #define MIPS_CALL_RATIO 8
-/* Define MOVE_RATIO to encourage use of movmemsi when enabled,
-   since it should always generate code at least as good as
-   move_by_pieces().  But when inline movmemsi pattern is disabled
-   (i.e., with -mips16 or -mmemcpy), instead use a value approximating
-   the length of a memcpy call sequence, so that move_by_pieces will
-   generate inline code if it is shorter than a function call.
-   Since move_by_pieces_ninsns() counts memory-to-memory moves, but
-   we'll have to generate a load/store pair for each, halve the value of 
-   MIPS_CALL_RATIO to take that into account.
-   The default value for MOVE_RATIO when HAVE_movmemsi is true is 2.
-   There is no point to setting it to less than this to try to disable
-   move_by_pieces entirely, because that also disables some desirable 
-   tree-level optimizations, specifically related to optimizing a
-   one-byte string copy into a simple move byte operation.  */
-#define MOVE_RATIO \
+/* Any loop-based implementation of movmemsi will have at least
+   moves, so allow individual copies of fewer elements.
+   When movmemsi is not available, use a value approximating
+   the length of a memcpy call sequence, so that move_by_pieces
+   will generate inline code if it is shorter than a function call.
+   Since move_by_pieces_ninsns counts memory-to-memory moves, but
+   we'll have to generate a load/store pair for each, halve the
+   value of MIPS_CALL_RATIO to take that into account.  */
+#define MOVE_RATIO					\
+  (HAVE_movmemsi					\
+   : MIPS_CALL_RATIO / 2)
+/* movmemsi is meant to generate code that is at least as good as
+   move_by_pieces.  However, movmemsi effectively uses a by-pieces
+   implementation both for moves smaller than a word and for word-aligned
+   moves of no more than MIPS_MAX_MOVE_BYTES_STRAIGHT bytes.  We should
+   allow the tree-level optimisers to do such moves by pieces, as it
+   often exposes other optimization opportunities.  We might as well
+   continue to use movmemsi at the rtl level though, as it produces
+   better code when scheduling is disabled (such as at -O).  */
+  (HAVE_movmemsi						\
+   ? (!currently_expanding_to_rtl				\
+      && ((ALIGN) < BITS_PER_WORD				\
+	  ? (SIZE) < UNITS_PER_WORD				\
+   : (move_by_pieces_ninsns (SIZE, ALIGN, MOVE_MAX_PIECES + 1)	\
+      < (unsigned int) MOVE_RATIO))
 /* For CLEAR_RATIO, when optimizing for size, give a better estimate
    of the length of a memset call, but use the default otherwise.  */
Index: gcc/config/mips/mips.c
--- gcc/config/mips/mips.c	2007-10-24 14:28:39.000000000 +0100
+++ gcc/config/mips/mips.c	2007-10-24 14:28:42.000000000 +0100
@@ -5622,9 +5622,6 @@ mips_expand_fcc_reload (rtx dest, rtx sr
   emit_insn (gen_slt_sf (dest, fp2, fp1));
-#define MAX_MOVE_REGS 4
 /* Emit straight-line code to move LENGTH bytes from SRC to DEST.
    Assume that the areas do not overlap.  */
@@ -5710,22 +5707,23 @@ mips_adjust_block_mem (rtx mem, HOST_WID
   set_mem_align (*loop_mem, MIN (MEM_ALIGN (mem), length * BITS_PER_UNIT));
-/* Move LENGTH bytes from SRC to DEST using a loop that moves MAX_MOVE_BYTES
-   per iteration.  LENGTH must be at least MAX_MOVE_BYTES.  Assume that the
-   memory regions do not overlap.  */
+/* Move LENGTH bytes from SRC to DEST using a loop that moves BYTES_PER_ITER
+   bytes at a time.  LENGTH must be at least BYTES_PER_ITER.  Assume that
+   the memory regions do not overlap.  */
 static void
-mips_block_move_loop (rtx dest, rtx src, HOST_WIDE_INT length)
+mips_block_move_loop (rtx dest, rtx src, HOST_WIDE_INT length,
+		      HOST_WIDE_INT bytes_per_iter)
   rtx label, src_reg, dest_reg, final_src;
   HOST_WIDE_INT leftover;
-  leftover = length % MAX_MOVE_BYTES;
+  leftover = length % bytes_per_iter;
   length -= leftover;
   /* Create registers and memory references for use within the loop.  */
-  mips_adjust_block_mem (src, MAX_MOVE_BYTES, &src_reg, &src);
-  mips_adjust_block_mem (dest, MAX_MOVE_BYTES, &dest_reg, &dest);
+  mips_adjust_block_mem (src, bytes_per_iter, &src_reg, &src);
+  mips_adjust_block_mem (dest, bytes_per_iter, &dest_reg, &dest);
   /* Calculate the value that SRC_REG should have after the last iteration
      of the loop.  */
@@ -5737,11 +5735,11 @@ mips_block_move_loop (rtx dest, rtx src,
   emit_label (label);
   /* Emit the loop body.  */
-  mips_block_move_straight (dest, src, MAX_MOVE_BYTES);
+  mips_block_move_straight (dest, src, bytes_per_iter);
   /* Move on to the next block.  */
-  mips_emit_move (src_reg, plus_constant (src_reg, MAX_MOVE_BYTES));
-  mips_emit_move (dest_reg, plus_constant (dest_reg, MAX_MOVE_BYTES));
+  mips_emit_move (src_reg, plus_constant (src_reg, bytes_per_iter));
+  mips_emit_move (dest_reg, plus_constant (dest_reg, bytes_per_iter));
   /* Emit the loop condition.  */
   if (Pmode == DImode)
@@ -5763,14 +5761,15 @@ mips_expand_block_move (rtx dest, rtx sr
   if (GET_CODE (length) == CONST_INT)
-      if (INTVAL (length) <= 2 * MAX_MOVE_BYTES)
 	  mips_block_move_straight (dest, src, INTVAL (length));
 	  return true;
       else if (optimize)
-	  mips_block_move_loop (dest, src, INTVAL (length));
+	  mips_block_move_loop (dest, src, INTVAL (length),
 	  return true;

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]