RFA: DSE replace_read revisited
Richard Sandiford
rsandifo@nildram.co.uk
Sun Oct 28 11:54:00 GMT 2007
A few weeks ago, I posted a patch to fix some problems in the
new RTL DSE replace_read code:
http://gcc.gnu.org/ml/gcc-patches/2007-09/msg01316.html
However, I found that the testcase added there still fails for some
MIPS targets, and investigating that led me to some wrong-code and
pessimisation issues too. There seem to be several problems:
(1) As I complained about in the message above, find_shift_sequence
does the shift in the mode class of the stored value, rather than
MODE_INT. So if we have a V4HI store and a V2HI read (say), we'd
use a vector shift rather than an integer shift. We'd therefore
shift the individual HI elements right, rather than shift the
whole bit pattern.
(2) When reading the low end of a stored value, the original code used
gen_lowpart to do the mode punning:
emit_move_insn (read_reg, gen_lowpart (read_mode, store_info->rhs));
This ICEd when a DFmode store was followed by an SFmode move, as in:
union u { double d; float f[2]; };
float
foo (union u *u, double d)
{
u->d = d;
return u->f[1];
}
for big-endian targets, because SFmode subregs of DFmode values
are not allowed. My attempt to fix the integer cases for MIPS
meant that we'd now convert the DFmode value into SFmode instead,
which is of course also wrong.
(Although this particular example isn't very realistic,
the equivalent TFmode/DFmode one matters for targets with
IEEE long doubles.)
(3) The code triggers for cases where it is a pessimisation.
E.g. on hard-float MIPS targets, if a DFmode value is stored in
an FPR (as we hope most DFmode values are), we cannot directly
treat the low part as an SFmode value. The sequence generated
for a lowpart would involve a move into and then out of integer
registers, and it would have been better to keep the original load.
However, on soft-float MIPS targets, replacing an SFmode load
of a DFmode store is the right thing to do.
(4) The code requires the store and read to have the same mode class:
if (GET_MODE_CLASS (read_mode) != GET_MODE_CLASS (store_mode))
return false;
This seems unnecessarily restrictive. I guess it was meant to
be a simplifying assumption, but as per (1) and (2) above,
it isn't really.
(5) The shift case can't deal with floats, even though it would be
profitable to do so on some soft-float targets.
(6) The code only deals with REGs:
if (store_info->is_set
/* No place to keep the value after ra. */
&& !reload_completed
/* The careful reviewer may wish to comment my checking that the
rhs of a store is always a reg. */
&& REG_P (SET_SRC (body))
/* Sometimes the store and reload is used for truncation and
rounding. */
&& !(FLOAT_MODE_P (GET_MODE (mem)) && (flag_float_store)))
This does indeed seem unnecessarily restrictive, and is the cause of
the dse-1.c failures that led me here originally. The code could
easily handle subregs and constants as well.
Addressing (1)-(5) together, I think part of the problem is that the
code has too many special exceptions. From a high-level perspective,
I think we need two things:
(a) a way of telling whether a mode change is profitable
(b) a way of extracting the low bits of a value of one arbitrary mode
into a value of another arbitrary mode
(a) we already have in the form of MODES_TIEABLE_P. So rather than
check whether the two modes are the same class, I think we should check
whether each individual mode change we make is profitable accordingly to
MODES_TIEABLE_P.
(b) we don't have (AFAIK), so I added a new routine called extract_low_bits.
Quoting the function's comment:
Try to read the low bits of SRC as an rvalue of mode MODE, preserving
the bit pattern. SRC_MODE is the mode of SRC; if this is smaller than
MODE, fill the upper bits with zeros. Fail if the layout of either
mode is unknown (as for CC modes) or if the extraction would involve
unprofitable mode punning. Return the value on success, otherwise
return null.
This is different from gen_lowpart* in these respects:
- the returned value must always be considered an rvalue
- when MODE is wider than SRC_MODE, the extraction involves
a zero extension
- when MODE is smaller than SRC_MODE, the extraction involves
a truncation (and is thus subject to TRULY_NOOP_TRUNCATION).
In other words, this routine performs a computation, whereas the
gen_lowpart* routines are conceptually lvalue or rvalue subreg
operations.
We don't need the extension case for DSE -- MODE is never bigger than
SRC_MODE -- but it falls out of the wash. We might one day need to do
something like this even if the extraction _would_ involve unprofitable
mode punning; not in DSE, but elsewhere. If that happens, we can always
add a parameter to control the check.
The patch below therefore:
- allows the two modes to have different classes, fixing (4)
- allows the stored value to be a SUBREG or a constant, fixing (6)
- always does the shift in MODE_INT, fixing (1)
- makes find_shift_sequence iterate on the shift mode rather than
access_size, which is more efficient and IMO a little clearer
- uses MODES_TIEABLE_P to decide whether it is profitable to convert
the stored value into the shift mode
- uses gen_simplify_subreg to convert the stored value into the shift
mode; this is a paradoxical subreg operation, and is allowed to
fail if the mode change is not valid
- uses extract_low_bits to convert the shifted value into the read value
- uses extract_low_bits to convert the store value into the read value
when no shift is needed, fixing (2) and (3)
- removes:
if (access_size > UNITS_PER_WORD || FLOAT_MODE_P (store_mode))
return false;
the first check is redundant with find_shift_sequence and removing
the second fixes (4)
- computes the nonshift sequence as well as the shift sequence
before trying the replacement, since the nonshift sequence
can now fail as well.
FWIW, I tested the patch on CSiBE for mipsisa64-elf at -Os.
The results were:
jikespg:src/spacetab 21540 21532 : 99.96%
jikespg:src/produce 12664 12640 : 99.81%
jikespg:src/mkfirst 17024 17012 : 99.93%
jikespg:src/remsp 9036 9024 : 99.87%
libmspack:test/md5 4812 4808 : 99.92%
linux:kernel/sched 5992 5988 : 99.93%
lwip:src/core/tcp_output 3284 3272 : 99.63%
OpenTCP:tcp 7230 7242 : 100.17%
teem:src/limn/cam 3780 3772 : 99.79%
teem:src/dye/methodsDye 2844 2820 : 99.16%
teem:src/gage/sclanswer 4692 4684 : 99.83%
unrarlib:unrarlib/unrarlib 15672 15676 : 100.03%
zlib:deflate 8460 8464 : 100.05%
----------------------------------------------------------------
Total 3605989 3605893 : 100.00%
The three increases are due to cases in which we convert a sign-extending
SImode load into a register extension; the former is a single instruction
whereas the latter takes two. This is something that DSE could do for
REG arguments even before the patch, and it might suggest that we need
to check whether the new instruction is more expensive than the old.
However, the MIPS port gives the two instructions the same cost anyway,
even for -Os, so it wouldn't make any difference as things stand.
I'm afraid that giving a load or store a cost of 1 instruction will lead
to some silly optimisation decisions. It might be worthwhile if we've
being very aggressive about optimising for size (as is usually the case
for MIPS16) but it seems dangerous otherwise. I'm therefore not too
worried about the three extra instructions, given that the net saving
is much higher.
The TCP increase is 3 instructions rather than 1 because of a missed
SIGN_EXTEND optimisation in simplify-rtx.c. I have a separate patch
for that, but I'm not convinced it's stage 3 material.
Bootstrapped & regression-tested on x86_64-linux-gnu. Also
regression-tested on mipsisa64-elf and mipsisa32-elf, fixing
dse-1.c for the former. I've added a case that shows why
handling constants is worthwhile. OK to install?
Richard
gcc/
* Makefile.in (dse.o): Depend on $(TM_P_H).
* optabs.h (extract_low_bits): Declare.
* optabs.c (extract_low_bits): New function.
* rtlhooks.c (gen_lowpart_general): Generalize SUBREG handling.
* dse.c: Include tm_p.h.
(find_shift_sequence): Remove the read_reg argument and return the
read value. Emit the instructions instead of returning them.
Iterate on new_mode rather than calculating it each time.
Check MODES_TIEABLE_P. Use simplify_gen_subreg to convert the
source to NEW_MODE and extract_low_bits to convert the shifted
value to READ_MODE.
(replace_read): Allow the load and store to have different mode
classes. Use extract_low_bits when SHIFT == 0. Create the shift
or extraction instructions before trying the replacement. Update
dump-file code accordingly, avoiding use of REGNO (store_info->rhs).
gcc/testsuite/
* gcc.target/mips/dse-1.c: Add checks for zeros.
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in 2007-10-27 09:25:22.000000000 +0100
+++ gcc/Makefile.in 2007-10-27 09:34:07.000000000 +0100
@@ -2537,8 +2537,8 @@ dce.o : dce.c $(CONFIG_H) $(SYSTEM_H) co
$(TREE_H) $(REGS_H) hard-reg-set.h $(FLAGS_H) $(DF_H) cselib.h \
$(DBGCNT_H) dce.h timevar.h tree-pass.h $(DBGCNT_H)
dse.o : dse.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
- $(TREE_H) $(REGS_H) hard-reg-set.h $(FLAGS_H) insn-config.h $(RECOG_H) \
- $(EXPR_H) $(DF_H) cselib.h $(DBGCNT_H) timevar.h tree-pass.h \
+ $(TREE_H) $(TM_P_H) $(REGS_H) hard-reg-set.h $(FLAGS_H) insn-config.h \
+ $(RECOG_H) $(EXPR_H) $(DF_H) cselib.h $(DBGCNT_H) timevar.h tree-pass.h \
alloc-pool.h $(ALIAS_H) dse.h
fwprop.o : fwprop.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
toplev.h insn-config.h $(RECOG_H) $(FLAGS_H) $(OBSTACK_H) $(BASIC_BLOCK_H) \
Index: gcc/optabs.h
===================================================================
--- gcc/optabs.h 2007-10-27 09:25:22.000000000 +0100
+++ gcc/optabs.h 2007-10-27 09:34:07.000000000 +0100
@@ -770,6 +770,8 @@ extern rtx expand_vec_cond_expr (tree, r
/* Generate code for VEC_LSHIFT_EXPR and VEC_RSHIFT_EXPR. */
extern rtx expand_vec_shift_expr (tree, rtx);
+extern rtx extract_low_bits (enum machine_mode, enum machine_mode, rtx);
+
#define optab_handler(optab,mode) (&(optab)->handlers[(int) (mode)])
#define convert_optab_handler(optab,mode,mode2) \
(&(optab)->handlers[(int) (mode)][(int) (mode2)])
Index: gcc/optabs.c
===================================================================
--- gcc/optabs.c 2007-10-27 09:25:22.000000000 +0100
+++ gcc/optabs.c 2007-10-28 09:26:43.000000000 +0000
@@ -4984,6 +4984,58 @@ gen_move_insn (rtx x, rtx y)
end_sequence ();
return seq;
}
+
+/* Try to read the low bits of SRC as an rvalue of mode MODE, preserving
+ the bit pattern. SRC_MODE is the mode of SRC; if this is smaller than
+ MODE, fill the upper bits with zeros. Fail if the layout of either
+ mode is unknown (as for CC modes) or if the extraction would involve
+ unprofitable mode punning. Return the value on success, otherwise
+ return null.
+
+ This is different from gen_lowpart* in these respects:
+
+ - the returned value must always be considered an rvalue
+
+ - when MODE is wider than SRC_MODE, the extraction involves
+ a zero extension
+
+ - when MODE is smaller than SRC_MODE, the extraction involves
+ a truncation (and is thus subject to TRULY_NOOP_TRUNCATION).
+
+ In other words, this routine performs a computation, whereas the
+ gen_lowpart* routines are conceptually lvalue or rvalue subreg
+ operations. */
+
+rtx
+extract_low_bits (enum machine_mode mode, enum machine_mode src_mode, rtx src)
+{
+ enum machine_mode int_mode, src_int_mode;
+
+ if (mode == src_mode)
+ return src;
+
+ if (CONSTANT_P (src))
+ return simplify_gen_subreg (mode, src, src_mode,
+ subreg_lowpart_offset (mode, src_mode));
+
+ if (GET_MODE_CLASS (mode) == MODE_CC || GET_MODE_CLASS (src_mode) == MODE_CC)
+ return NULL_RTX;
+
+ src_int_mode = int_mode_for_mode (src_mode);
+ int_mode = int_mode_for_mode (mode);
+ if (src_int_mode == BLKmode || int_mode == BLKmode)
+ return NULL_RTX;
+
+ if (!MODES_TIEABLE_P (src_int_mode, src_mode))
+ return NULL_RTX;
+ if (!MODES_TIEABLE_P (int_mode, mode))
+ return NULL_RTX;
+
+ src = gen_lowpart (src_int_mode, src);
+ src = convert_modes (int_mode, src_int_mode, src, true);
+ src = gen_lowpart (mode, src);
+ return src;
+}
/* Return the insn code used to extend FROM_MODE to TO_MODE.
UNSIGNEDP specifies zero-extension instead of sign-extension. If
Index: gcc/rtlhooks.c
===================================================================
--- gcc/rtlhooks.c 2007-10-27 09:25:22.000000000 +0100
+++ gcc/rtlhooks.c 2007-10-27 09:34:07.000000000 +0100
@@ -43,11 +43,9 @@ gen_lowpart_general (enum machine_mode m
if (result)
return result;
- /* If it's a REG, it must be a hard reg that's not valid in MODE. */
- else if (REG_P (x)
- /* Or we could have a subreg of a floating point value. */
- || (GET_CODE (x) == SUBREG
- && FLOAT_MODE_P (GET_MODE (SUBREG_REG (x)))))
+ /* Handle SUBREGs and hard REGs that were rejected by
+ simplify_gen_subreg. */
+ else if (REG_P (x) || GET_CODE (x) == SUBREG)
{
result = gen_lowpart_common (mode, copy_to_reg (x));
gcc_assert (result != 0);
Index: gcc/dse.c
===================================================================
--- gcc/dse.c 2007-10-27 09:25:22.000000000 +0100
+++ gcc/dse.c 2007-10-27 09:34:07.000000000 +0100
@@ -29,6 +29,7 @@ Software Foundation; either version 3, o
#include "tm.h"
#include "rtl.h"
#include "tree.h"
+#include "tm_p.h"
#include "regs.h"
#include "hard-reg-set.h"
#include "flags.h"
@@ -1381,9 +1382,9 @@ record_store (rtx body, bb_info_t bb_inf
if (store_info->is_set
/* No place to keep the value after ra. */
&& !reload_completed
- /* The careful reviewer may wish to comment my checking that the
- rhs of a store is always a reg. */
- && REG_P (SET_SRC (body))
+ && (REG_P (SET_SRC (body))
+ || GET_CODE (SET_SRC (body)) == SUBREG
+ || CONSTANT_P (SET_SRC (body)))
/* Sometimes the store and reload is used for truncation and
rounding. */
&& !(FLOAT_MODE_P (GET_MODE (mem)) && (flag_float_store)))
@@ -1416,15 +1417,15 @@ dump_insn_info (const char * start, insn
shift. */
static rtx
-find_shift_sequence (rtx read_reg,
- int access_size,
+find_shift_sequence (int access_size,
store_info_t store_info,
read_info_t read_info,
int shift)
{
enum machine_mode store_mode = GET_MODE (store_info->mem);
enum machine_mode read_mode = GET_MODE (read_info->mem);
- rtx chosen_seq = NULL;
+ enum machine_mode new_mode;
+ rtx read_reg = NULL;
/* Some machines like the x86 have shift insns for each size of
operand. Other machines like the ppc or the ia-64 may only have
@@ -1433,21 +1434,31 @@ find_shift_sequence (rtx read_reg,
justify the value we want to read but is available in one insn on
the machine. */
- for (; access_size <= UNITS_PER_WORD; access_size *= 2)
+ for (new_mode = smallest_mode_for_size (access_size * BITS_PER_UNIT,
+ MODE_INT);
+ GET_MODE_BITSIZE (new_mode) <= BITS_PER_WORD;
+ new_mode = GET_MODE_WIDER_MODE (new_mode))
{
- rtx target, new_reg, shift_seq, insn;
- enum machine_mode new_mode;
+ rtx target, new_reg, shift_seq, insn, new_lhs;
int cost;
- /* Try a wider mode if truncating the store mode to ACCESS_SIZE
- bytes requires a real instruction. */
- if (access_size < GET_MODE_SIZE (store_mode)
- && !TRULY_NOOP_TRUNCATION (access_size * BITS_PER_UNIT,
+ /* Try a wider mode if truncating the store mode to NEW_MODE
+ requires a real instruction. */
+ if (GET_MODE_BITSIZE (new_mode) < GET_MODE_BITSIZE (store_mode)
+ && !TRULY_NOOP_TRUNCATION (GET_MODE_BITSIZE (new_mode),
GET_MODE_BITSIZE (store_mode)))
continue;
- new_mode = smallest_mode_for_size (access_size * BITS_PER_UNIT,
- GET_MODE_CLASS (read_mode));
+ /* Also try a wider mode if the necessary punning is either not
+ desirable or not possible. */
+ if (!CONSTANT_P (store_info->rhs)
+ && !MODES_TIEABLE_P (new_mode, store_mode))
+ continue;
+ new_lhs = simplify_gen_subreg (new_mode, copy_rtx (store_info->rhs),
+ store_mode, 0);
+ if (new_lhs == NULL_RTX)
+ continue;
+
new_reg = gen_reg_rtx (new_mode);
start_sequence ();
@@ -1484,31 +1495,13 @@ find_shift_sequence (rtx read_reg,
take the value from the store and put it into the
shift pseudo, then shift it, then generate another
move to put in into the target of the read. */
- start_sequence ();
- emit_move_insn (new_reg, gen_lowpart (new_mode, store_info->rhs));
+ emit_move_insn (new_reg, new_lhs);
emit_insn (shift_seq);
- convert_move (read_reg, new_reg, 1);
-
- if (dump_file)
- {
- fprintf (dump_file, " -- adding extract insn r%d:%s = r%d:%s\n",
- REGNO (new_reg), GET_MODE_NAME (new_mode),
- REGNO (store_info->rhs), GET_MODE_NAME (store_mode));
-
- fprintf (dump_file, " -- with shift of r%d by %d\n",
- REGNO(new_reg), shift);
- fprintf (dump_file, " -- and second extract insn r%d:%s = r%d:%s\n",
- REGNO (read_reg), GET_MODE_NAME (read_mode),
- REGNO (new_reg), GET_MODE_NAME (new_mode));
- }
-
- /* Get the three insn sequence and return it. */
- chosen_seq = get_insns ();
- end_sequence ();
+ read_reg = extract_low_bits (read_mode, new_mode, new_reg);
break;
}
- return chosen_seq;
+ return read_reg;
}
@@ -1551,15 +1544,11 @@ replace_read (store_info_t store_info, i
enum machine_mode read_mode = GET_MODE (read_info->mem);
int shift;
int access_size; /* In bytes. */
- rtx read_reg = gen_reg_rtx (read_mode);
- rtx shift_seq = NULL;
+ rtx insns, read_reg;
if (!dbg_cnt (dse))
return false;
- if (GET_MODE_CLASS (read_mode) != GET_MODE_CLASS (store_mode))
- return false;
-
/* To get here the read is within the boundaries of the write so
shift will never be negative. Start out with the shift being in
bytes. */
@@ -1573,62 +1562,43 @@ replace_read (store_info_t store_info, i
/* From now on it is bits. */
shift *= BITS_PER_UNIT;
- /* We need to keep this in perspective. We are replacing a read
+ /* Create a sequence of instructions to set up the read register.
+ This sequence goes immediately before the store and its result
+ is read by the load.
+
+ We need to keep this in perspective. We are replacing a read
with a sequence of insns, but the read will almost certainly be
in cache, so it is not going to be an expensive one. Thus, we
are not willing to do a multi insn shift or worse a subroutine
call to get rid of the read. */
+ if (dump_file)
+ fprintf (dump_file, "trying to replace %smode load in insn %d"
+ " from %smode store in insn %d\n",
+ GET_MODE_NAME (read_mode), INSN_UID (read_insn->insn),
+ GET_MODE_NAME (store_mode), INSN_UID (store_insn->insn));
+ start_sequence ();
if (shift)
+ read_reg = find_shift_sequence (access_size, store_info, read_info, shift);
+ else
+ read_reg = extract_low_bits (read_mode, store_mode,
+ copy_rtx (store_info->rhs));
+ if (read_reg == NULL_RTX)
{
- if (access_size > UNITS_PER_WORD || FLOAT_MODE_P (store_mode))
- return false;
-
- shift_seq = find_shift_sequence (read_reg, access_size, store_info,
- read_info, shift);
- if (!shift_seq)
- return false;
+ end_sequence ();
+ if (dump_file)
+ fprintf (dump_file, " -- could not extract bits of stored value\n");
+ return false;
}
-
- if (dump_file)
- fprintf (dump_file, "replacing load at %d from store at %d\n",
- INSN_UID (read_insn->insn), INSN_UID (store_insn->insn));
+ /* Force the value into a new register so that it won't be clobbered
+ between the store and the load. */
+ read_reg = copy_to_mode_reg (read_mode, read_reg);
+ insns = get_insns ();
+ end_sequence ();
if (validate_change (read_insn->insn, loc, read_reg, 0))
{
- rtx insns;
deferred_change_t deferred_change = pool_alloc (deferred_change_pool);
- if (read_mode == store_mode)
- {
- start_sequence ();
-
- /* The modes are the same and everything lines up. Just
- generate a simple move. */
- emit_move_insn (read_reg, store_info->rhs);
- if (dump_file)
- fprintf (dump_file, " -- adding move insn r%d = r%d\n",
- REGNO (read_reg), REGNO (store_info->rhs));
- insns = get_insns ();
- end_sequence ();
- }
- else if (shift)
- insns = shift_seq;
- else
- {
- /* The modes are different but the lsb are in the same
- place, we need to extract the value in the right from the
- rhs of the store. */
- start_sequence ();
- convert_move (read_reg, store_info->rhs, 1);
-
- if (dump_file)
- fprintf (dump_file, " -- adding extract insn r%d:%s = r%d:%s\n",
- REGNO (read_reg), GET_MODE_NAME (read_mode),
- REGNO (store_info->rhs), GET_MODE_NAME (store_mode));
- insns = get_insns ();
- end_sequence ();
- }
-
/* Insert this right before the store insn where it will be safe
from later insns that might change it before the read. */
emit_insn_before (insns, store_insn->insn);
@@ -1666,12 +1636,22 @@ replace_read (store_info_t store_info, i
rest of dse, play like this read never happened. */
read_insn->read_rec = read_info->next;
pool_free (read_info_pool, read_info);
+ if (dump_file)
+ {
+ fprintf (dump_file, " -- replaced the loaded MEM with ");
+ print_simple_rtl (dump_file, read_reg);
+ fprintf (dump_file, "\n");
+ }
return true;
}
else
{
if (dump_file)
- fprintf (dump_file, " -- validation failure\n");
+ {
+ fprintf (dump_file, " -- replacing the loaded MEM with ");
+ print_simple_rtl (dump_file, read_reg);
+ fprintf (dump_file, " led to an invalid instruction\n");
+ }
return false;
}
}
Index: gcc/testsuite/gcc.target/mips/dse-1.c
===================================================================
--- gcc/testsuite/gcc.target/mips/dse-1.c 2007-10-27 09:25:22.000000000 +0100
+++ gcc/testsuite/gcc.target/mips/dse-1.c 2007-10-27 09:34:07.000000000 +0100
@@ -14,6 +14,13 @@ #define TEST(ID, TYPE1, TYPE2) \
u->m2 = x; \
return (u->m1[0] \
+ u->m1[sizeof (TYPE2) / sizeof (TYPE1) - 1]); \
+ } \
+ \
+ TYPE1 __attribute__((nomips16)) \
+ g##ID (union u##ID *u) \
+ { \
+ u->m2 = 0; \
+ return (u->m1[0] | u->m1[1]); \
}
TEST (1, unsigned int, unsigned long long);
More information about the Gcc-patches
mailing list