This is the mail archive of the
gcc-patches@gcc.gnu.org
mailing list for the GCC project.
[PATCH] PR26251: -Os vs. tree-ssa jump threading
- From: Roger Sayle <roger at eyesopen dot com>
- To: Jeff Law <law at redhat dot com>, <gcc-patches at gcc dot gnu dot org>
- Date: Sun, 2 Jul 2006 19:27:30 -0600 (MDT)
- Subject: [PATCH] PR26251: -Os vs. tree-ssa jump threading
Hi Jeff,
The following patch is my attempt to resolve PR tree-optimization/26251
which is a P2 code size regression on mainline, caused by jump threading
in tree-ssa's VRP pass.
The issue is that tree-level jump threading is more general than its
RTL predecessor, in that it allows jump threading to duplicate and
specialize entire basic blocks preceding the bypassed/threaded conditional
jump. This code duplication/specialization has always reduces the number
of instructions executed, but may unfortunately increase code size.
For Dan Kegel's example in comment #4, we even duplicate a function
call!
The solution proposed below is, when optimizing for size, to only
thread jumps over basic blocks that require no code duplication.
Trivially, this is either (i) when we're threading the only edge
into a basic block, so that the block won't need to be duplicated,
or (ii) when the threaded basic block is empty except for the control
flow statement being threaded. This second condition matches the
RTL "jump bypassing" pass' original constraints, that the conditional
jump had to occur at the start of an RTL basic blocks.
I'll admit that my first version of this patch only implemented
condition (i), but as shown below we get a better code size
reduction using the more complex test of both (i) and (ii).
Using the code from the bugzilla PR, On x86, we currently generate
with -Os -fomit-frame-pointer:
foo: movl 4(%esp), %edx
movl $2, %eax
movl $0, (%edx)
.L2: movl $0, -4(%edx,%eax,4)
incl %eax
cmpl $11, %eax
jne .L2
ret
where VRP has effectively unpeeled the loop once, resulting in
two stores to memory.
Using condition (i) above, which effectively disables threading
completely for this testcase, results in:
foo: movl 4(%esp), %edx
xorl %eax, %eax
jmp .L2
.L3: movl $0, (%edx,%eax,4)
incl %eax
.L2: cmpl $10, %eax
jne .L3
ret
which hasn't unpeeled the loop, but also has omitted to thread
the inital jump into the loop, hence the loop hasn't been rotated.
[I'm surprised this fumble wasn't picked up by the RTL-level jump
bypassing pass; I didn't think we'd disabled it yet :-(].
However, using both conditions (i) && (ii), effectively adding the
new redirection_block_p predicate test in the patch below, we now
generate:
foo: movl 4(%esp), %edx
movl $1, %eax
.L2: movl $0, -4(%edx,%eax,4)
incl %eax
cmpl $11, %eax
jne .L2
ret
Apart from the curious change in induction variable performed by
tree-ssa's ivopts pass, this is about as good as you can expect
from tree-level jump threading with -Os. Particularly, it resolves
the code size regression, returning us to the code we generated
in gcc 4.0.
To demonstrate the one-off results above are representative, I've
also benchmarked both this patch and its predecessor on the CSiBE
v1.0.1 code size benchmark on x86 using the default -Os flags.
Mainline currently generates 931,515 bytes, the earlier (i) patch
produced a respectable improvement to 928,262 bytes (a saving
of 3253 bytes), and the (i)&(ii) patch below achieves 928,073
bytes (a saving of 3442 bytes). I'd expect similar saving to be
observed on other platforms, as tree-ssa passes are nominally
platform independent and duplicating potentially large basic
blocks is intuitively bad when optimizing for size.
The following patch has been tested on i686-pc-linux-gnu, with
a full "make bootstrap", all default languages including Ada,
and regression tested with a top-level "make -k check" with no
new failures.
Jeff, are you happy with this approach/refinement?
Many thanks in advance,
2006-07-02 Roger Sayle <roger@eyesopen.com>
PR tree-optimization/26251
* tree-ssa-threadupdate.c (redirection_block_p): New function.
(thread_block): When optimizing for size refuse to thread jumps
that would require duplication of blocks other than redirection
blocks.
Index: tree-ssa-threadupdate.c
===================================================================
*** tree-ssa-threadupdate.c (revision 115062)
--- tree-ssa-threadupdate.c (working copy)
***************
*** 1,5 ****
/* Thread edges through blocks and update the control flow and SSA graphs.
! Copyright (C) 2004, 2005 Free Software Foundation, Inc.
This file is part of GCC.
--- 1,5 ----
/* Thread edges through blocks and update the control flow and SSA graphs.
! Copyright (C) 2004, 2005, 2006 Free Software Foundation, Inc.
This file is part of GCC.
*************** redirect_edges (void **slot, void *data)
*** 652,657 ****
--- 652,684 ----
return 1;
}
+ /* Return true if this block has no executable statements other than
+ a simple ctrl flow instruction. When the number of outgoing edges
+ is one, this is equivalent to a "forwarder" block. */
+
+ static bool
+ redirection_block_p (basic_block bb)
+ {
+ block_stmt_iterator bsi;
+
+ /* Advance to the first executable statement. */
+ bsi = bsi_start (bb);
+ while (!bsi_end_p (bsi)
+ && (TREE_CODE (bsi_stmt (bsi)) == LABEL_EXPR
+ || IS_EMPTY_STMT (bsi_stmt (bsi))))
+ bsi_next (&bsi);
+
+ /* Check if this is an empty block. */
+ if (bsi_end_p (bsi))
+ return true;
+
+ /* Test that we've reached the terminating control statement. */
+ return bsi_stmt (bsi)
+ && (TREE_CODE (bsi_stmt (bsi)) == COND_EXPR
+ || TREE_CODE (bsi_stmt (bsi)) == GOTO_EXPR
+ || TREE_CODE (bsi_stmt (bsi)) == SWITCH_EXPR);
+ }
+
/* BB is a block which ends with a COND_EXPR or SWITCH_EXPR and when BB
is reached via one or more specific incoming edges, we know which
outgoing edge from BB will be traversed.
*************** thread_block (basic_block bb)
*** 699,704 ****
--- 726,742 ----
be threaded to a duplicate of BB. */
bool all = true;
+ /* If optimizing for size, only thread this block if we don't have
+ to duplicate it or it's an otherwise empty redirection block. */
+ if (optimize_size
+ && EDGE_COUNT (bb->preds) > 1
+ && !redirection_block_p (bb))
+ {
+ FOR_EACH_EDGE (e, ei, bb->preds)
+ e->aux = NULL;
+ return false;
+ }
+
/* To avoid scanning a linear array for the element we need we instead
use a hash table. For normal code there should be no noticeable
difference. However, if we have a block with a large number of
Roger
--