This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: target register load optimizations


Richard Henderson wrote:
> I don't like either of the phrases "late-loop" or "target register".
> This has nothing to do with loops, per-se, since afaict this is a
> straight dominator tree operation; "target" is way too overloaded in
> gcc terminology already.  I think the phrase "branch register optimization"
> is a bit better, since this does more to describe what we're actually
> trying to do for SH5.

I've used late-loop.c for the filename because most of the optimizations
could have been done by loop invariant code motion, were it not for the
problem in the gcc infrastructure to deal with branches, calls and return
instructions that are dependent on a register load for their target.

Actually, we've described this as 'pt migration' before, as pt is the
mnemonic for the prepare target register instruction, and the objective
is to migrate the pt instructions further up the instruction stream,
the most important instance being to hoist them out of loops.
Considiering that other CPUs are likely to use different menmonics,
though, it seemed appropriate use use a name that describes the process
without referring to a mnemonic. 
The target registers do not only hold branch targets, but are also used
for function call and return.  So calling this a branch register
optimization would not be factual.
I suppose if we want to me more specific about what kind of target we
are talking about here, we have to call it a control flow target.

So that would give us:

Options:
--cft-load-optimize "Optimize loading of control flow target registers before prologue/epilogue threading"
--cft-load-optimize "Optimize loading of control flow target registers after prologue/epilogue threading"

filename:
cft-load.c
/* Optimize loading of control flow target registers.  */

flag variables:
flag_cft_load_optimize
flag_cft_load_optimize2

Likewise for other macro / variable names

> Is there anything you're doing here that wouldn't be solved with a
> register allocator that placed spill/fill code at optimal locations?
> I know we don't have that at the moment (even with new-ra), but in
> theory would such an allocator completely obviate this pass?

The problem is that the passes that handle control flow can't grok a
prepare target / branch using target register sequence, and they also
assume that branches can be willy-nilly inverted, and the target changed
to a new label.  To change this would require a rewrite of large parts
of gcc.
So before reload, we pretend that jumps and branches that take labels
for the target exist, but force the labels into target registers with
reload constraints.  That works reasonably well if you have loop-free
code with medium to long basic blocks, but produces atrocious code for
loops.  Your 'optimal' spill code placement algorithm would have to be
able to hoist input reloads out of loops.
	
> I don't see anywhere an explanation of why it would be useful to run
> this code both before and after prologue generation.  Why wouldn't we
> ONLY run the pass before prologue generation and be done with it?

That is what is done for the SH5 by default.  A disadvantage of this
strategy is that the prologue and epilogue have to handle the return
address, which has to be put into a target register before we can return.
So prologue and epilogue needs some extra logic to make this work well.
On the other hand, at the time you write prologues / epilogues, you know
what target register actually need saving.
Originally, the pass was done after the prologue/epilogue generation
(well, to be exact it was in MACHINE_DEPENDENT_REORG initially, but that
didn't work out).  I thought I should leave some choice so people could
experiment what works best on the SH5 and on other processors as
the F_CPU.
	
> You appear to be replicating a good part of df.c for this.  Is it
> possible to re-use the existing insn scanning code instead?  How about
> if df.c was enhanced to track only registers of a given class (with
> ALL_REGS resulting in what it does now)?

AFAICT there is no support in df.c to keep track of groups of def-use
chains where the definition uses the same constant value.  So this
support would have to be added, too.
(Kind of a note to myself:) I would also need to get rid of spurious
SUBREGS in target register SET_DESTs .
> 
> Specifics:
> 
> +   if (flag_optimize_target_registers_2)
> +     {
> +       open_dump_file (DFI_targetregs, decl);
> +
> +       target_registers_optimize (insns, true);
> +
> +       close_dump_file (DFI_targetregs, print_rtl_with_bb, insns);
> +
> +       ggc_collect ();
> +     }
> 
> If you're going to run this twice, you need two dump files.
> Anything else is just confusing when grepping dumps for first
> ocurrence of a pattern.

It is possible to run it twice, although the envisaged typical use is
really that you enable it only once - what invocation time is best will
depend on the target processor, how much code size increase you are
willing to accept, and the structure of the source code.  I think the
code needs a somewhat wider user base to find out if running both
passes make sense.
I'm not sure if I should make the change for two dump file names now,
or wait first if this pass is really used twice in the same compilation.

> + static int
> + basic_block_freq (bb)
> +      basic_block bb;
> + {
> +   int loop_depth = MIN (10, (bb->loop_depth));
> +   return 1 << (loop_depth * 3);
> + }
> 
> bb->frequency should always be valid.  If not via profile info,
> then by heuristic estimation.

That makes sense.  I want to benchmark if the heuristics are working
well enough first, though.
	
> +       /* If there are abnormal edges, we do not attempt
> +          to optimize target register placement.  The only
> +          reason for this is that subsequent runs of
> +          find_basic_blocks() get confused when an assignment
> +          of a label to a target register is not adjacent to a
> +          branch that uses that target register, and
> +          introduce a whole lot _more_ abnormal edges.  In
> +          itself, this is pessimistic but harmless, but it can
> +          introduce apparent control-paths that previously didn't
> +          exist, causing basic block live-at-start sets to change,
> +          and that causes an assertion in flow.c
> +          (verify_local_live_at_start) to fail.
> 
> Is this relevant anymore?  IIRC we no longer re-build the CFG
> from scratch at any point.  Easing this restriction might make
> this pass relevant to IA-64, which also has special branch registers
> but only uses them for indirect jumps.

I can certainly try to take out this code; I reckon a simple regression
test then should answer this question, as we have some tests with abnormal
edges in our testsuite.
There will be some merge work to be done first, though, so I would like to
get consensus on the design issues first.
	
-- 
--------------------------
SuperH (UK) Ltd.
2410 Aztec West / Almondsbury / BRISTOL / BS32 4QX
T:+44 1454 465658


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]