This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
Re: Generic hwloop support library

From: Bernd Schmidt <bernds at codesourcery dot com>
To: GCC Patches <gcc-patches at gcc dot gnu dot org>, rdsandiford at googlemail dot com
Cc: Jie Zhang <jzhang918 at gmail dot com>
Date: Tue, 05 Jul 2011 00:31:03 +0200
Subject: Re: Generic hwloop support library
References: <4E009E8B.6010104@codesourcery.com> <87fwmoltrf.fsf@firetop.home>
On 07/02/11 15:03, Richard Sandiford wrote:
> I suppose my main question about this is: did you consider extending
> the cfgloop machinery instead of writing new loop discovery code?

cfgloop only discovers natural loops. Here, we can and should deal with
things that have multiple entry points, since the loop structure can be
mangled by earlier optimizations.

Another thing is that we want to discover every loop-end instruction
(even if it's not really in a loop), as the hwloop_fail function is a
good place to split them.

> Also, loop discovery seems to be based on the liveness of the iterator,
> but what happens if the iterator is a general register that is spilled
> for part of the loop?  Obviously it's best if that doesn't happen, but
> if it does, it looks like you might have a "loop" in which the recorded
> head isn't actually considered to be a member of the loop.

Using liveness of the iterator is important since we're going to shift
that into a different register, a hardware loop counter. On Blackfin,
that's LC0/LC1, on C6X it's called ILC. This means that on the currently
supported targets we can't actually deal with uses of the iterator
inside (or outside) the loop, and a spill would show up as such a use
that would cause us to fail the loop.

However, I've added a test to ensure that the tail is part of the loop
(we start scanning from the head so that's always part of the loop). The
register allocator probably isn't smart enough to cause trouble of this
kind, but it's better to handle it.

> I wonder whether we should use a more specialised name, such as
> "hw_doloop", and similarly in the function names.  "loop_info"
> and "bb_in_loop" sound more like things that belong in cfgloop.

Renamed to hwloop_info. bb_in_loop is now static.

>> +struct hw_doloop_hooks

> For the record, I wondered at first whether these should be target hooks,
> but I agree they shouldn't.  md_reorg is special enough that this new
> infrastructure really does need to be a subroutine of md_reorg rather
> than a new pass.  Having separate hooks is more flexible in that case,
> because it allows the hooks to rely on data that is set up by md_reorg
> beforehand.

Yes. I see it this way - this is private communication between two
pieces of code, used at a specific time, not something that needs to be
exposed to all of the compiler.

> The main comment is above reorg_loops():
> 
>> +/* Run from machine_dependent_reorg, this pass looks for doloop_end
>> +   insns, analyzes the loop structure, and tries to rewrite the RTL of
>> +   these loops so that the target can create whatever hardware looping
>> +   instructions are available.
>> +   
>> +   DO_REORDER should be set to true if we should try to use the
>> +   reorder_loops function to ensure the loop end occurs after the loop
>> +   start.  HOOKS is used to pass in target specific hooks; see
>> +   hw-doloop.h.  */
> 
> I think the combination of this comment and the comments in the hooks
> ought to be self-contained enough that a back-end writer could tell from
> them alone how to call the subroutine, and how the hooks need to be
> defined.  Would you mind expanding it a bit?  E.g. I found the
> after_reorder comment a bit hard to follow because there's no single
> comment that describes what reordering is done, and how the backend
> might want to react to it.  I think it would also help to describe
> (broadly) what kind of hardware instruction you're trying to support.

The easiest way to deal with after_reorder is to remove it. I used to
have some scheduling bits in there for C6X, but it looks like this won't
be needed in the final version.

I've tried to expand the comments a bit.

> The current code seems to assume that the iterator is a single hard
> register.  It might be worth stating (and asserting?)  that somewhere.

Added to end_pattern_reg comment.

> Silly nitlet 1, but the file seems a bit inconsistent about the
> number of lines between the comment and the start of function.

That means it fits in well with overall gcc code :-)
Changed.

> it might be better to allocate using XCNEW instead.

Done.

>> +	  if (GET_MODE (insn) == TImode)
>> +	    loop->timode_insns++;
> 
> I assume this is used by the C6X port?  The information is only going
> to be meaningful to ports like C6X that treat scheduling as a subroutine
> of md_reorg, so I wonder whether it belongs here.

That kind of belongs with the old scheduling in after_reorder thing, and
looks like won't actually be necessary, so I've removed it.

In general, "belongs here" is debatable since different ports will want
different information about the loop. But since we're scanning the loop
anyway we might as well collect as much data as possible here rather
than design a complicated interface with target_hwloop_data structures
or somesuch.

> I think this should be iterating over DF_INSN_DEFS instead.  If that
> isn't accurate for some reason, it needs to be fixed...

Changed.

>> +  if (EDGE_COUNT (tail_bb->succs) != 2)
>> +    {
>> +      loop->bad = true;
>> +      loop->head = loop->successor = NULL;
> 
> What's the significance of clearing "head" and "successor" as well as
> setting "bad"?

Probably a historical thing (head == NULL signifying badness). Changed,
and changed scan_loop to test "bad".

>> +  works = VEC_alloc (basic_block,heap,20);
> 
> Missing spaces.

Fixed.

>> +	      if (!VEC_space (basic_block, works, 1))
>> +		{
>> +		  if (dwork)
>> +		    {
>> +		      VEC_block_remove (basic_block, works, 0, dwork);
>> +		      dwork = 0;
>> +		    }
>> +		  else
>> +		    VEC_reserve (basic_block, heap, works, 1);
>> +		}
>> +	      VEC_quick_push (basic_block, works, succ);
> 
> I'm being dense, sorry, but could you walk me through what this
> is doing (i.e. why it isn't a simple VEC_safe_push)?

No idea. svn archeology suggests Jie wrote that bit, but I'm guessing he
won't remember either. I've simplified this area now; it still seems to
do the same thing.

> I think this would be easier to follow if the discovery part of the
> loop_incoming loop was split out:

Done.

> Could you add commentary explaining what this case is trying to handle?
> It seems like adding the forwarder adds more incoming_dests,
> so presumably the idea is to unify the incoming_srcs.  Something like
> (in insn order):
> 
>     +---- B1
>     |     |
>     |     V
>     | +-- B2	  
>     | |
>     | |   L1 <-+
>     | |   |    |
>     | |   V    |
>     | +-> L2   |
>     |     |    |
>     |     V    |
>     +---> L3   |
>           |    |
>           V    |
>           L4 --+
>           |
>           V
>          ...
> 
> So we'd then consider B2 to be part of the loop.  Is that right?
> If so, how does it help?  B2 still seems a bit of an odd-man-out,
> even if we pretend it's part of the loop.

The trick is that we can then insert the LSETUP instruction in B1, since
that becomes the only block that enters the loop. The idea is to reduce
the number of cases that the optimizer function needs to handle by
giving it either an incoming_src or an incoming_dest. If we did try to
handle something like the above directly, we'd likely insert too many
LSETUP instructions.

This kind of problem seems to be introduced mainly by bb-reorder; with
that turned off I can't make the code trigger.

>> +      /* There's a degenerate case we can handle - an empty loop consisting
>> +	 of only a back branch.  Handle that by deleting the branch.  */
> 
> Do we really need to handle that here?

It does trigger, for example with a testcase from a Linux kernel. I've
added a check for register liveness. Presumably it could also be handled
earlier, but it seems easy enough to also detect the problem here in
case it takes a long time for the last dead insns to disappear.

>> +      bb->aux = loop;
> 
> Where is this used?  Is it provided for the backend hooks?
> If so, that doesn't seem to be documented.

Removed, and changed some places where we clear aux.

> ->outer doesn't seem to be used anywhere, and the choice of loop
> looks somewhat arbitrary.

Removed outer.  I don't think it's arbitrary, but it may be initially
confusing (especially with the "outer"): it doesn't actually construct a
tree; every inner loop is recorded regardless of how many levels down it
is. This works just fine for the DFS we do in optimize_loops.

> We can avoid the temporary here with:

Changed.

>> +#define BB_AUX_INDEX(BB) ((unsigned)(BB)->aux)

Changed to intptr_t. Should have used the most up-to-date version of
the Blackfin backend as source.

>> +      /* Recreate an index for basic blocks that represents their order.  */
>> +      for (bb = ENTRY_BLOCK_PTR->next_bb, index = 0;
>> +	   bb != EXIT_BLOCK_PTR;
>> +	   bb = bb->next_bb, index++)
>> +	bb->aux = (PTR) index;
> 
> I think this would be clearer with FOR_EACH_BB.  It seems a shame
> to recompute this for every loop though, especially if no changes
> are needed.

Changed.

>> +      if (BB_AUX_INDEX (loop->head) < BB_AUX_INDEX (loop->tail))
>> +	continue;
> 
> <=?  A self loop isn't a problem.

Fixed.

>> +	      if (dump_file)
>> +		fprintf (dump_file, ";; Moving block %d before block %d\n",
>> +			 loop->head->index, start_bb->index);

> How confident are we at this stage that the loop can use the hardware
> instruction (i.e. won't be marked bad)?

In theory that's really target-dependent. It could depend on factors
such as nesting levels, registers used, or number of instructions in the
loop.

In practice this function will only be used on Blackfin for now. I have
no numbers about how likely we are to optimize the loop fter this. I
suppose a possible enhancement would be an additional early bad-marking
hook. In practice I don't think it matters since I think this tends to
improve the generated code in any case.

> It might be nice to have a
> comment about the impact of this change in terms of branching,
> both when we can and can't do the actual loop optimisation.

My impression is that it tends to remove branches inside loops, which
ought to be a good thing. I don't think it can add branches to the loop,
at worst it'll switch fallthru/jump edges.

For example, here it moved the L11 block upwards, which should save one
unconditional jump even if the LSETUP conversion hadn't worked:

-       jump.s .L11;
+       LSETUP (.L11, .L19) LC1 = P4;
+.L11:
+       do random stuff
 .L12:
	inner loop, sets cc
        if !cc jump .L12 (bp);
	stuff
+.L19:
-       P4 += -1;
-       cc =P4==0;
-       if !cc jump .L11;
-       jump.s .L18;
-.L11:
-       do random stuff
-       jump.s .L12;
-.L18:



> Is there no possibility of running the optimisation in cfglayout mode
> instead?  It seems from this and the forwarder block stuff above as
> though it might make things easier, but maybe not.

I'm not sure what you mean here. This reordering isn't for the sake of
the optimization code; it's necessary due to the encoding of the
Blackfin LSETUP instruction which assumes that the loop start label is
before the loop end label.

The optimizing functions definitely need to see RTL that corresponds
directly to generated assembly, since they need to count instruction
sizes. So I think cfglayout is out for that.

>> +  /* Every loop contains in its list of inner loops every loop nested inside
>> +     it, even if there are intermediate loops.  This works because we're doing
>> +     a depth-first search here and never visit a loop more than once.  */
>> +  for (ix = 0; VEC_iterate (loop_info, loop->loops, ix, inner); ix++)
>> +    {
>> +      optimize_loop (inner, hooks);
> 
> Perhaps use a worklist here, to avoid O(loops)-deep recursion?

Well, it's O(counting loop nesting level) deep, which I think isn't much
of a problem in practice. One could even argue that it's a O(1) since
the register allocator will eventually run out of loop iteration registers.

> Is it a required part of the interface that we call things like
> after_reorder even if there are no loops?

This is moot now given that after_reorder is gone, I think.

Thanks for the review - new patch below. I think this addresses
everything. Tested by compiling a lot of input files and verifying it
produces identical code as the previous version; will also do a normal
regression test.


Bernd
Attachment: doloop-0704i.diff
Description: Text document
Follow-Ups:
- Re: Generic hwloop support library
  - From: Richard Sandiford
References:
- Re: Generic hwloop support library
  - From: Richard Sandiford
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]