This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Yet another GCSE speedup

From: Jeffrey A Law <law at redhat dot com>
To: gcc-patches at gcc dot gnu dot org
Date: Tue, 15 Feb 2005 23:14:21 -0700
Subject: Yet another GCSE speedup
Organization: Red Hat, Inc
Reply-to: law at redhat dot com

We're still spending an unusually large amount of time in the GCSE code.
Consider for example the profile when we compile fold-const.i with
the mainline sources:

 CPROP 1               :   0.12 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%)
 PRE                   :   0.81 ( 8%) usr   0.02 ( 2%) sys   0.86 ( 7%)
 CPROP 2               :   0.10 ( 1%) usr   0.01 ( 1%) sys   0.11 ( 1%)
 TOTAL                 :  10.45

Most of that .81 seconds is burned in compute_transp doing some list
walking and lots of unnecessary loads/test sequences.

This change significantly improves the situation with fold-const.i:

 CPROP 1               :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.13 ( 1%)
 PRE                   :   0.21 ( 2%) usr   0.02 ( 2%) sys   0.27 ( 2%)
 CPROP 2               :   0.10 ( 1%) usr   0.00 ( 0%) sys   0.08 ( 1%)
 TOTAL                 :   9.90

Not bad at all.  On a larger sample of codes our total speedup is just
shy of 1%.

How do we get this nice little speedup.... :-)

gcse.c has an array indexed by block numbers which contains a list of
all the insns which might modify memory in the given block.


When we compute the locally transitive property we walk those lists to
determine which MEM expressions are transparent through each block with
code that looks like this:

      FOR_EACH_BB (bb)
        {
          rtx list_entry = canon_modify_mem_list[bb->index];

          while (list_entry)
            {
              rtx dest, dest_addr;

              if (CALL_P (XEXP (list_entry, 0)))
                {
                  if (set_p)
                    SET_BIT (bmap[bb->index], indx);
                  else
                    RESET_BIT (bmap[bb->index], indx);
                  break;
                }
              /* LIST_ENTRY must be an INSN of some kind that sets
memory.
                 Examine each hunk of memory that is modified.  */

              dest = XEXP (list_entry, 0);
              list_entry = XEXP (list_entry, 1);
              dest_addr = XEXP (list_entry, 0);

              if (canon_true_dependence (dest, GET_MODE (dest),
dest_addr,
                                         x, rtx_addr_varies_p))
                {
                  if (set_p)
                    SET_BIT (bmap[bb->index], indx);
                  else
                    RESET_BIT (bmap[bb->index], indx);
                  break;
                }
              list_entry = XEXP (list_entry, 1);
            }
        }

      x = XEXP (x, 0);
      goto repeat;


Note carefully that we stop the list walk as soon as we hit a call or
a conflicting memory write on our list.

These walks actually get pretty expensive when you consider that we do
them for every MEM expression we're tracking.  We can significantly
improve this with a couple of simple changes.

First, we keep a bitmap of blocks which have calls.  We can test that
bitmap before we walk down CANON_MODIFY_MEM_LIST.  This gives a small,
but measurable speedup.

Then we can use bitmap iterators to iterate over the blocks with
calls and then blocks which write memory, but which do not have calls
(instead of using FOR_EACH_BB).  That gets us to just under a 1%
compile-time speedup.

Bootstrapped and regression tested on i686-pc-linux-gnu.

Attachment: PPP
Description: Text document

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]