[PATCH, PR43864] Gimple level duplicate block cleanup.
Tom de Vries
vries@codesourcery.com
Mon Jul 18 00:41:00 GMT 2011
On 07/12/2011 04:07 PM, Richard Guenther wrote:
> On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries <vries@codesourcery.com> wrote:
>> Hi Richard,
>>
>> here's a new version of the pass. I attempted to address as much as possible
>> your comments. The pass was bootstrapped and reg-tested on x86_64.
>>
>> On 06/14/2011 05:01 PM, Richard Guenther wrote:
>>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vries@codesourcery.com> wrote:
>>>> Hi Richard,
>>>>
>>>> thanks for the review.
>>>>
>>>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vries@codesourcery.com> wrote:
>>>>>> Hi Richard,
>>>>>>
>>>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the following
>>>>>> table (%, lower is better).
>>>>>>
>>>>>> none pic
>>>>>> thumb1 thumb2 thumb1 thumb2
>>>>>> spec2000 99.9 99.9 99.8 99.8
>>>>>>
>>>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure that the
>>>>>> optimizations proposed in PR20070 would fix this PR.
>>>>>>
>>>>>> The problem in this PR is that when compiling with -O2, the example below should
>>>>>> only have one call to free. The original problem is formulated in terms of -Os,
>>>>>> but currently we generate one call to free with -Os, although still not the
>>>>>> smallest code possible. I'll show here the -O2 case, since that's similar to the
>>>>>> original PR.
>>>>>>
>>>>
>>>> Example A. (naming it for reference below)
>>>>
>>>>>> #include <stdio.h>
>>>>>> void foo (char*, FILE*);
>>>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>>>> {
>>>>>> char fileName[1000];
>>>>>> FILE *fp;
>>>>>> sprintf(fileName, outputFileName);
>>>>>> if (access(fileName, 1) == 0) {
>>>>>> free(ctx);
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> fp = fopen(fileName, 0);
>>>>>> if (fp == 0) {
>>>>>> free(ctx);
>>>>>> return 0;
>>>>>> }
>>>>>>
>>>>>> foo(outputFileName, fp);
>>>>>>
>>>>>> return ctx;
>>>>>> }
>>>>>>
>>>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in PR20070.
>>>>>> - Merging 2 blocks which are identical expect for input registers, by using a
>>>>>> conditional move to choose between the different input registers.
>>>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>>> differences
>>>>>>
>>>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>>>> conditional move would probably be benificial in terms of size, but it's not
>>>>>> clear what condition the conditional move should be using. Calculating such a
>>>>>> condition would add in size and increase the execution path.
>>>>>>
>>>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>>>> ...
>>>>>> push {r4, r5, lr}
>>>>>> mov r4, r0
>>>>>> sub sp, sp, #1004
>>>>>> mov r5, r1
>>>>>> mov r0, sp
>>>>>> mov r1, r4
>>>>>> bl sprintf
>>>>>> mov r0, sp
>>>>>> movs r1, #1
>>>>>> bl access
>>>>>> mov r3, r0
>>>>>> cbz r0, .L6
>>>>>> movs r1, #0
>>>>>> mov r0, sp
>>>>>> bl fopen
>>>>>> mov r1, r0
>>>>>> cbz r0, .L7
>>>>>> mov r0, r4
>>>>>> bl foo
>>>>>> .L3:
>>>>>> mov r0, r5
>>>>>> add sp, sp, #1004
>>>>>> pop {r4, r5, pc}
>>>>>> .L6:
>>>>>> mov r0, r5
>>>>>> mov r5, r3
>>>>>> bl free
>>>>>> b .L3
>>>>>> .L7:
>>>>>> mov r0, r5
>>>>>> mov r5, r1
>>>>>> bl free
>>>>>> b .L3
>>>>>> ...
>>>>>>
>>>>>> The proposed patch solved the problem by dealing with the 2 blocks at a level
>>>>>> when they are still identical: at gimple level. It detect that the 2 blocks are
>>>>>> identical, and removes one of them.
>>>>>>
>>>>>> The following table shows the impact of the patch on the example in terms of
>>>>>> size for -march=armv7-a:
>>>>>>
>>>>>> without with delta
>>>>>> Os : 108 104 -4
>>>>>> O2 : 120 104 -16
>>>>>> Os thumb: 68 64 -4
>>>>>> O2 thumb: 76 64 -12
>>>>>>
>>>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>>>> replacement of 2 moves by a constant set, which also decreases the execution
>>>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>>>
>>>>>>
>>>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore those
>>>>>> differences in comparison. Without this feature, we would only match blocks with
>>>>>> resultless operations, due the the ssa-nature of gimples.
>>>>>> For example, with this feature, we reduce the following function to its minimum
>>>>>> at gimple level, rather than at rtl level.
>>>>>>
>>>>
>>>> Example B. (naming it for reference below)
>>>>
>>>>>> int f(int c, int b, int d)
>>>>>> {
>>>>>> int r, e;
>>>>>>
>>>>>> if (c)
>>>>>> r = b + d;
>>>>>> else
>>>>>> {
>>>>>> e = b + d;
>>>>>> r = e;
>>>>>> }
>>>>>>
>>>>>> return r;
>>>>>> }
>>>>>>
>>>>>> ;; Function f (f)
>>>>>>
>>>>>> f (int c, int b, int d)
>>>>>> {
>>>>>> int e;
>>>>>>
>>>>>> <bb 2>:
>>>>>> e_6 = b_3(D) + d_4(D);
>>>>>> return e_6;
>>>>>>
>>>>>> }
>>>>>>
>>>>>> I'll send the patch with the testcases in a separate email.
>>>>>>
>>>>>> OK for trunk?
>>>>>
>>>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>>>> _way_ too often.
>>>>>
>>>>
>>>> Here is a reworked patch that addresses several concerns, particularly the
>>>> compile time overhead.
>>>>
>>>> Changes:
>>>> - The optimization is now in a separate file.
>>>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>>> remove the test for pass-local flags.
>>>> New is the pass driver tail_merge_optimize, based on
>>>> tree-cfgcleanup.c:cleanup_tree_cfg_1.
>>>> - The pass is run once, on SSA. Before, the patch would
>>>> fix example A only before SSA and example B only on SSA.
>>>> In order to fix example A on SSA, I added these changes:
>>>> - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>>> - insert vop phi in bb2, and use that one (update_vuses)
>>>> - complete pt_solutions_equal_p.
>>>>
>>>> Passed x86_64 bootstrapping and regression testing, currently regtesting on ARM.
>>>>
>>>> I placed the pass at the earliest point where it fixes example B: After copy
>>>> propagation and dead code elimination, specifically, after the first invocation
>>>> of pass_cd_dce. Do you know (any other points) where the pass should be scheduled?
>>>
>>> It's probably reasonable to run it after IPA inlining has taken place which
>>> means insert it somewhen after the second pass_fre (I'd suggest after
>>> pass_merge_phi).
>>>
>>
>> I placed it there, but I ran into some interaction with
>> pass_late_warn_uninitialized. Addition of the pass makes test
>> gcc.dg/uninit-pred-2_c.c fail.
>>
>> FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
>> (test for bogus messages, line 43)
>> FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
>> (test for warnings, line 45)
>>
>> int foo_2 (int n, int m, int r)
>> {
>> int flag = 0;
>> int v;
>>
>> if (n)
>> {
>> v = r;
>> flag = 1;
>> }
>>
>> if (m) g++;
>> else bar ();
>>
>> if (flag)
>> blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
>> else
>> blah (v); { dg-warning "uninitialized" "real uninitialized var warning" }
>>
>> return 0;
>> }
>>
>> The pass replaces the second call to blah with the first one, and eliminates
>> the if. After that, the uninitialized warning is issued for the line number
>> of the first call to blah, while at source level the warning only makes sense
>> for the second call to blah.
>>
>> Shall I try putting the pass after pass_late_warn_uninitialized?
>
> No, simply pass -fno-tree-tail-merge in the testcase.
>
>>> But my general comment still applies - I don't like the structural
>>> comparison code at all and you should really use the value-numbering
>>> machineries we have
>>
>> I now use sccvn.
>
> Good.
>
>>> or even better, merge this pass with FRE itself
>>> (or DOM if that suits you more). For FRE you'd want to hook into
>>> tree-ssa-pre.c:eliminate().
>>>
>>
>> If we need to do the transformation after pass_late_warn_uninitialized, it needs
>> to stay on its own, I suppose.
>
> I suppose part of the high cost of the pass is running SCCVN, so it
> makes sense to share that with the existing FRE run.
Done.
> Any reason
> you use VN_NOWALK?
>
No, that was just a first-try value.
>>>>> This also duplicates the literal matching done on the RTL level - instead
>>>>> I think this optimization would be more related to value-numbering
>>>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>>>> jump-threading).
>>>>
>>>> The pass currently does just duplicate block elimination, not cross-jumping.
>>>> If we would like to extend this to cross-jumping, I think we need to do the
>>>> reverse of value numbering: walk backwards over the bb, and keep track of the
>>>> way values are used rather than defined. This will allows us to make a cut
>>>> halfway a basic block.
>>>
>>> I don't understand - I propose to do literal matching but using value-numbering
>>> for tracking equivalences to avoid literal matching for stuff we know is
>>> equivalent. In fact I think it will be mostly calls and stores where we
>>> need to do literal matching, but never intermediate computations on
>>> registers.
>>>
>>
>> I tried to implement that scheme now.
>>
>>> But maybe I miss something here.
>>>
>>>> In general, we cannot do cut halfway a basic block in the current implementation
>>>> (of value numbering and forward matching), since we assume equivalence of the
>>>> incoming vops at bb entry. This assumption is in general only valid if we indeed
>>>> replace the entire block by another entire block.
>>>
>>> Why are VOPs of concern at all?
>>>
>>
>> In the previous version, I inserted the phis for the vops manually.
>> In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
>> with vops, so it's not relevant anymore.
>>
>>>> I imagine that a cross-jumping heuristic would be based on the length of the
>>>> match and the amount of non-vop phis it would introduce. Then value numbering
>>>> would be something orthogonal to this optimization, which would reduce amount of
>>>> phis needed for a cross-jump.
>>>> I think it would make sense to use SCCVN value numbering at the point that we
>>>> have this backward matching.
>>>>
>>>> I'm not sure whether it's a good idea to try to replace the current forward
>>>> local value numbering with SCCVN value numbering, since we currently declare
>>>> vops equal, which are, in the global sense, not equal. And once we go to
>>>> backward matching, we'll need something to keep track of the uses, and we can
>>>> reuse the current infrastructure for that, but not the SCCVN value numbering.
>>>>
>>>> Does that make any sense?
>>>
>>> Ok, let me think about this a bit.
>>>
>>
>> I tried to to be more clear on this in the header comment of the pass.
>>
>>> For now about the patch in general. The functions need renaming to
>>> something more sensible now that this isn't cfg-cleanup anymore.
>>>
>>> I miss a general overview of the pass - it's hard to reverse engineer
>>> its working for me.
>>
>> I added a header comment.
>>
>>> Like (working backwards), you are detecting
>>> duplicate predecessors
>>> - that obviously doesn't work for duplicates
>>> without any successors, like those ending in noreturn calls.
>>>
>>
>> Merging of blocks without successors works now.
>>
>>> + n = EDGE_COUNT (bb->preds);
>>> +
>>> + for (i = 0; i < n; ++i)
>>> + {
>>> + e1 = EDGE_PRED (bb, i);
>>> + if (e1->flags & EDGE_COMPLEX)
>>> + continue;
>>> + for (j = i + 1; j < n; ++j)
>>> + {
>>>
>>> that's quadratic in the number of predecessors.
>>>
>>
>> The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
>> Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
>> per worklist iteration.
>>
>>> + /* Block e1->src might be deleted. If bb and e1->src are the same
>>> + block, delete e2->src instead, by swapping e1 and e2. */
>>> + e1_swapped = (bb == e1->src) ? e2: e1;
>>> + e2_swapped = (bb == e1->src) ? e1: e2;
>>>
>>> is that because you incrementally merge preds two at a time? As you
>>> are deleting blocks don't you need to adjust the quadratic walking?
>>> Thus, with say four equivalent preds won't your code crash anyway?
>>>
>>
>> I think it was to make calculation of dominator info easier, but I use now
>> functions from dominance.c for that, so this piece of code is gone.
>>
>>> I think the code needs to delay the CFG manipulation to the end
>>> of this function.
>>>
>>
>> I now delay the cfg manipulation till after each analysis phase.
>>
>>> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
>>> + E2 are either:
>>> + - equal, or
>>> + - defined locally in E1->src and E2->src.
>>> + In the latter case, register the alternatives in *PHI_EQUIV. */
>>> +
>>> +static bool
>>> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
>>> +{
>>> + int n1 = e1->dest_idx;
>>> + int n2 = e2->dest_idx;
>>> + gimple_stmt_iterator gsi;
>>> + basic_block dest = e1->dest;
>>> + gcc_assert (dest == e2->dest);
>>>
>>> too many asserts in general - I'd say for this case pass in the destination
>>> block as argument.
>>>
>>> + gcc_assert (val1 != NULL_TREE);
>>> + gcc_assert (val2 != NULL_TREE);
>>>
>>> superfluous.
>>>
>>> +static bool
>>> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
>>> ...
>>> + VEC (edge,heap) *redirected_edges;
>>> + gcc_assert (bb == e2->dest);
>>>
>>> same.
>>>
>>> + if (e1->flags != e2->flags)
>>> + return false;
>>>
>>> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
>>> by swapping edges in the preds.
>>>
>>
>> That's handled now.
>>
>>> + /* TODO: We could allow multiple successor edges here, as long as bb1 and bb2
>>> + have the same successors. */
>>> + if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
>>> + return false;
>>>
>>> hm, ok - that would need fixing, too. Same or mergeable successors
>>> of course, which makes me wonder if doing this whole transformation
>>> incrementally and locally is a good idea ;) Also
>>>
>>
>> Also handled now.
>>
>>> + /* Calculate the changes to be made to the dominator info.
>>> + Calculate bb2_dom. */
>>> ...
>>>
>>> wouldn't be necessary I suppose (just throw away dom info after the
>>> pass).
>>>
>>> That is, I'd globally record BB equivalences (thus, "value-number"
>>> BBs) and apply the CFG manipulations at a single point.
>>>
>>
>> I delay the cfg manipulation till after each analysis phase. Delaying the cfg
>> manipulation till the end of the pass instead might make the analysis code more
>> convoluted.
>>
>>> Btw, I miss where you insert PHI nodes for all uses that flow in
>>> from the preds preds - you do that for VOPs but not for real
>>> operands?
>>>
>>
>> Indeed, inserting phis for non-vops is a todo.
>>
>>> + /* Replace uses of vuse2 with uses of the phi. */
>>> + for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
>>> + {
>>>
>>> why not walk immediate uses of the old PHI and SET_USE to
>>> the new one instead (for those uses in the duplicate BB of course)?
>>>
>>
>> And I no longer insert VOP phis, but let a TODO handle that, so this code is gone.
>
> Ok. Somewhat costly in comparison though.
>
I tried to add that back, guarded by update_vops. Handled in update_vuses,
vop_phi, insn_vops, vop_at_entry, replace_block_by.
>>> + case GSS_CALL:
>>> + if (!pt_solution_equal_p (gimple_call_use_set (s1),
>>> + gimple_call_use_set (s2))
>>>
>>> I don't understand why you are concerned about equality of
>>> points-to information. Why not simply ior it (pt_solution_ior_into - note
>>> they are shared so you need to unshare them first).
>>>
>>
>> I let a todo handle the alias info now.
>
> Hmm, that's not going to work if it's needed for correctness.
>
Should be handed my merge_calls now.
>>> +/* Return true if p1 and p2 can be considered equal. */
>>> +
>>> +static bool
>>> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
>>>
>>> would go into tree-ssa-structalias.c instead.
>>>
>>> +static bool
>>> +gimple_base_equal_p (gimple s1, gimple s2)
>>> +{
>>> ...
>>> + if (gimple_modified_p (s1) || gimple_modified_p (s2))
>>> + return false;
>>>
>>> that shouldn't be of concern.
>>>
>>> + if (s1->gsbase.subcode != s2->gsbase.subcode)
>>> + return false;
>>>
>>> for assigns that are of class GIMPLE_SINGLE_RHS we do not
>>> update subcode during transformations so it can differ for now
>>> equal statements.
>>>
>>
>> handled properly now.
>>
>>> I'm not sure if a splay tree for the SSA name version equivalency
>>> map is the best representation - I would have used a simple
>>> array of num_ssa_names size and assign value-numbers
>>> (the lesser version for example).
>>>
>>> Thus equiv_insert would do
>>>
>>> value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>>> values[SSA_NAME_VERSION (val1)] = value;
>>> values[SSA_NAME_VERSION (val2)] = value;
>>>
>>> if the names are not defined in bb1 resp. bb2 we would have to insert
>>> a PHI node in the merged block - that would be a cost thingy for
>>> doing this value-numbering in a more global way.
>>>
>>
>> local value numbering code has been removed.
>>
>>> You don't seem to be concerned about the SSA names points-to
>>> information, but it surely has the same issues as that of the calls
>>> (so either they should be equal or they should be conservatively
>>> merged). But as-is you don't allow any values to flow into the
>>> merged blocks that are not equal for both edges, no?
>>>
>>
>> Correct, that's still a todo.
>>
>>> + TV_TREE_CFG, /* tv_id */
>>>
>>> add a new timevar. We wan to be able to turn the pass off,
>>> so also add a new option (I can see it might make debugging harder
>>> in some cases).
>>>
>>
>> I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.
>>
>>> Can you assess the effect of the patch on GCC itself (for example
>>> when building cc1?)? What's the size benefit and the compile-time
>>> overhead?
>>>
>>
>> effect on building cc1:
>>
>> real user sys
>> without: 19m50.158s 19m 2.090s 0m20.860s
>> with: 19m59.456s 19m17.170s 0m20.350s
>> ----------
>> +15.080s
>> +1.31%
>
> That's quite a lot of time.
>
Measurement for this version:
real user sys
without 19m59.995s 19m 9.970s 0m21.050s
with 19m56.160s 19m14.830s 0m21.530s
----------
+4.86s
+0.42%
text data bss dec hex filename
17547657 41736 1364384 18953777 1213631 without/cc1
17211049 41736 1364384 18617169 11c1351 with/cc1
--------
-336608
-1.92%
>> $ size without/cc1 with/cc1
>> text data bss dec hex filename
>> 17515986 41320 1364352 18921658 120b8ba without/cc1
>> 17399226 41320 1364352 18804898 11ef0a2 with/cc1
>> --------
>> -116760
>> -0.67%
>>
>> OK for trunk, provided build & reg-testing on ARM is ok?
>
> I miss additions to the testsuite.
>
I will send an updated patch on thread
http://gcc.gnu.org/ml/gcc-patches/2011-06/msg00625.html.
>
> +static bool
> +bb_dominated_by_p (basic_block bb1, basic_block bb2)
>
> please use
>
> + if (TREE_CODE (val1) == SSA_NAME)
> + {
> + if (!same_preds
> && !SSA_NAME_IS_DEFAULT_DEF (val1)
> && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1)))
> return false;
>
> instead. All stmts should have a BB apart from def stmts of default defs
> (which are gimple_nops).
>
Done.
> +/* Return the canonical scc_vn tree for X, if we can use scc_vn_info.
> + Otherwise, return X. */
> +
> +static tree
> +gvn_val (tree x)
> +{
> + return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME)
> + ? VN_INFO ((x))->valnum : x);
> +}
>
> I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead
> which seems to perform the same.
Done.
> Do you ever call the above
> when scc_vn_ok is false or x is NULL?
Not in this version. Earlier, I also ran the pass if sccvn bailed out, but pre
and fre only run if sccvn succeeded.
>
> +static bool
> +gvn_uses_equal (tree val1, tree val2, basic_block bb1,
> + basic_block bb2, bool same_preds)
> +{
> + gimple def1, def2;
> + basic_block def1_bb, def2_bb;
> +
> + if (val1 == NULL_TREE || val2 == NULL_TREE)
> + return false;
>
> does this ever happen?
Not in the current version. Removed.
>
> + if (gvn_val (val1) != gvn_val (val2))
> + return false;
>
> I suppose a shortcut
>
> if (val1 == val2)
> return true;
>
> is possible?
>
Indeed. Added.
> +static int *bb_size;
> +
> +/* Init bb_size administration. */
> +
> +static void
> +init_bb_size (void)
> +{
>
> if you need more per-BB info you can hook it into bb->aux. What's the
> size used for (I guess I'll see below ...)?
>
> + for (gsi = gsi_start_nondebug_bb (bb);
> + !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> + size++;
>
> is pretty rough. I guess for a quick false result for comparing BBs
> (which means you could initialize the info lazily?)
Done.
>
> +struct same_succ
> +{
> + /* The bbs that have the same successor bbs. */
> + bitmap bbs;
> + /* The successor bbs. */
> + bitmap succs;
> + /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
> + bb. */
> + bitmap inverse;
> + /* The edge flags for each of the successor bbs. */
> + VEC (int, heap) *succ_flags;
> + /* Indicates whether the struct is in the worklist. */
> + bool in_worklist;
> +};
>
> looks somewhat odd at first sight - maybe a overall comment what this
> is used for is missing. Well, let's see.
>
Tried to add an overall comment.
> +static hashval_t
> +same_succ_hash (const void *ve)
> +{
> + const_same_succ_t e = (const_same_succ_t)ve;
> + hashval_t hashval = bitmap_hash (e->succs);
> + int flags;
> + unsigned int i;
> + unsigned int first = bitmap_first_set_bit (e->bbs);
> + int size = bb_size [first];
> + gimple_stmt_iterator gsi;
> + gimple stmt;
> + basic_block bb = BASIC_BLOCK (first);
> +
> + hashval = iterative_hash_hashval_t (size, hashval);
> + for (gsi = gsi_start_nondebug_bb (bb);
> + !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
> + {
> + stmt = gsi_stmt (gsi);
> + hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
> + if (!is_gimple_call (stmt))
> + continue;
> + if (gimple_call_internal_p (stmt))
> + hashval = iterative_hash_hashval_t
> + ((hashval_t) gimple_call_internal_fn (stmt), hashval);
> + else
> + hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);
>
> you could also keep a cache of the BB hash as you keep a cache
> of the size (if this function is called multiple times per BB).
Right, I forgot it's a closed hash table. Added the cache of the bb hash.
> The hash looks relatively weak - for all asignments it will hash
> in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code.
Done.
> The call handling OTOH looks overly complicated to me ;)
>
That's an attempt to handle compiling insn-recog.c efficiently. All bbs without
successors are grouped together, and even after selecting on same function name,
there are still thousands of bbs in that group. I added now the args as well.
> The hash will be dependent on stmt ordering even if that doesn't matter,
> like
>
> i = i + 1;
> j = j - 1;
>
> vs. the swapped variant.
Right, that's a todo, added that in the header comment.
> Similar the successor edges are not sorted,
> so true/false edges may be in different order.
>
I keep a cache of the successor edge flags, in order of bbs.
> Not sure yet if your comparison function would make those BBs
> unequal anyway.
>
> +static bool
> +inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
> +{
> + int f1a, f1b, f2a, f2b;
> + int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
> +
> + if (VEC_length (int, e1->succ_flags) != 2)
> + return false;
> ...
>
> I wonder why you keep a VEC of successor edges in same_succ_t
> instead of using the embedded successor edge vector in the basic_block
> structure?
>
To keep the information in order of bbs, for quick comparison.
> + bb_to_same_succ[bb->index] = *slot;
>
> looks like a candidate for per-BB info in bb->aux, too.
>
Done.
> +static void
> +find_same_succ (void)
> +{
> + int i;
> + same_succ_t same = same_succ_alloc ();
> +
> + for (i = 0; i < last_basic_block; ++i)
> + {
> + find_same_succ_bb (BASIC_BLOCK (i), &same);
> + if (same == NULL)
> + same = same_succ_alloc ();
> + }
>
> I suppose you want FOR_EACH_BB (excluding entry/exit block) or
> FOR_ALL_BB (including them). The above also can
> have BASIC_BLOCK(i) == NULL. Similar in other places.
>
Done.
> + for (i = 0; i < n1; ++i)
> + {
> + ei = EDGE_PRED (bb1, i);
> + for (j = 0; j < n2; ++j)
> + {
> + ej = EDGE_PRED (bb2, j);
> + if (ei->src != ej->src)
> + continue;
> + nr_matches++;
> + break;
> + }
> + }
>
> FOR_EACH_EDGE (ei, iterator, bb1->preds)
> if (!find_edge (ei->src, bb2))
> return false;
>
> is easier to parse.
>
Done.
> +static bool
> +gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond)
> +{
> + tree var, var_type;
> + bool honor_nans;
> +
> + if (is_gimple_assign (s1)
> + && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS)
> + return true;
>
> the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code
> (TREE_CODE of gimple_assign_rhs1 actually).
>
> +static bool
> +gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond)
>
> I wonder if you still need this given ..
>
> +static bool
> +gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond)
> +{
> + unsigned int i;
> + enum gimple_statement_structure_enum gss;
> + tree lhs1, lhs2;
> + basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
> +
> + /* Handle omp gimples conservatively. */
> + if (is_gimple_omp (s1) || is_gimple_omp (s2))
> + return false;
> +
> + /* Handle lhs. */
> + lhs1 = gimple_get_lhs (s1);
> + lhs2 = gimple_get_lhs (s2);
> + if (lhs1 != NULL_TREE && lhs2 != NULL_TREE)
> + return (same_preds && TREE_CODE (lhs1) == SSA_NAME
> + && TREE_CODE (lhs2) == SSA_NAME
> + && gvn_val (lhs1) == gvn_val (lhs2));
> + else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE))
> + return false;
>
> all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN
> and GIMPLE_CALL stmts with a lhs).
>
> That leaves the case of calls without a lhs. I'd rather structure this
> function like
>
> if (gimple_code (s1) != gimple_code (s2))
> return false;
> swithc (gimple_code (s1))
> {
> case GIMPLE_CALL:
> ... compare arguments ...
> if equal ok, if not and we have a lhs use GVN.
>
> case GIMPLE_ASSIGN:
> ... compare GVN of the LHS ...
>
> case GIMPLE_COND:
> ... compare operands ...
>
> default:
> return false;
> }
>
>
Done.
> +static bool
> +bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds,
> + bool inv_cond)
> +
> +{
>
> you don't do an early out by comparing the pre-computed sizes.
This function is only called for bb with the same size. The hash table equal
fuction does have the size comparison.
> Mind
> you can have hashtable collisions where they still differ (did you
> check hashtable stats on it? how is the collision rate?)
>
I managed to lower the collision rate by specifying n_basic_blocks as hash table
size.
While compiling insn-recog.c, the highest collision rate for a function is 2.69.
> +static bool
> +bb_has_non_vop_phi (basic_block bb)
> +{
> + gimple_seq phis = phi_nodes (bb);
> + gimple phi;
> +
> + if (phis == NULL)
> + return false;
> +
> + if (!gimple_seq_singleton_p (phis))
> + return true;
> +
> + phi = gimple_seq_first_stmt (phis);
> + return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi)));
>
> return is_gimple_reg (gimple_phi_result (phi));
>
Done.
> +static void
> +update_debug_stmts (void)
> +{
> + int i;
> + basic_block bb;
> +
> + for (i = 0; i < last_basic_block; ++i)
> + {
> + gimple stmt;
> + gimple_stmt_iterator gsi;
> +
> + bb = BASIC_BLOCK (i);
>
> FOR_EACH_BB
>
> it must be possible to avoid scanning basic-blocks that are not affected
> by the transform, no? In fact the only affected basic-blocks should be
> those that were merged with another block?
Done. I also check for MAY_HAVE_DEBUG_STMTS now.
>
> + /* Mark vops for updating. Without this, TODO_update_ssa_only_virtuals
> + won't do anything. */
> + mark_sym_for_renaming (gimple_vop (cfun));
>
> it won't insert any PHIs, that's correct. Still somewhat ugly, a manual
> update of PHI nodes should be possible.
>
Added. I'm trying to be lazy about it though:
+ bool update_vops = ((todo & TODO_update_ssa_only_virtuals) == 0
+ || !symbol_marked_for_renaming (gimple_vop (cfun)));
If we're going to insert those phis anyway given the current todo, we don't bother.
> + if (dump_file)
> + {
> + fprintf (dump_file, "Before TODOs.\n");
>
> with TDF_DETAILS only please.
>
Done.
> + free_dominance_info (CDI_DOMINATORS);
>
> if you keep dominance info up-to-date there is no need to free it.
>
Indeed. And by not freeing it, the info was checked, and I hit validation errors
in that info, so the updating code had problems. I reverted back to calculating
when needed and freeing when changed.
> + TODO_verify_ssa | TODO_verify_stmts
> + | TODO_verify_flow | TODO_update_ssa_only_virtuals
> + | TODO_rebuild_alias
>
> please no TODO_rebuild_alias, simply remove it - alias info in merged
> paths should be compatible enough if there is value-equivalence between
> SSA names. At least you can't rely on TODO_rebuild_alias for
> correctness - it is skipped if IPA PTA was run for example.
>
Done.
> + | TODO_cleanup_cfg
>
> is that needed? If so return it from your execute function if you changed
> anything only. But I doubt your transformation introduces cleanup
> opportunities?
>
If all the predeccessor blocks of a block are merged, the block and it's
remaining predecessor block might be merged, so that is a cleanup opportunity.
Removed for the moment.
> New options and params need documentation in doc/invoke.texi.
>
Added.
> Thanks,
> Richard.
>
Bootstrapped and reg-tested on x86_64. Ok for trunk (after ARM testing)?
Thanks,
- Tom
2011-07-17 Tom de Vries <tom@codesourcery.com>
PR middle-end/43864
* tree-ssa-tail-merge.c: New file.
(struct same_succ): Define.
(same_succ_t, const_same_succ_t): New typedef.
(struct bb_cluster): Define.
(bb_cluster_t, const_bb_cluster_t): New typedef.
(struct aux_bb_info): Define.
(BB_SIZE, BB_SAME_SUCC, BB_CLUSTER, BB_VOP_AT_EXIT): Define.
(gvn_uses_equal): New function.
(same_succ_print, same_succ_print_traverse, same_succ_hash)
(inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
(same_succ_reset): New function.
(same_succ_htab, same_succ_edge_flags)
(deleted_bbs, deleted_bb_preds): New var.
(debug_same_succ): New function.
(worklist): New var.
(print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
(init_worklist, delete_worklist, delete_basic_block_same_succ)
(same_succ_flush_bbs, update_worklist): New function.
(print_cluster, debug_cluster, same_predecessors)
(add_bb_to_cluster, new_cluster, delete_cluster): New function.
(all_clusters): New var.
(alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
(merge_clusters, set_cluster): New function.
(gimple_equal_p, find_duplicate, same_phi_alternatives_1)
(same_phi_alternatives, bb_has_non_vop_phi, find_clusters_1)
(find_clusters): New function.
(merge_calls, update_vuses, vop_phi, insn_vops, vop_at_entry)
(replace_block_by): New function.
(update_bbs): New var.
(apply_clusters): New function.
(update_debug_stmt, update_debug_stmts): New function.
(tail_merge_optimize): New function.
tree-flow.h (tail_merge_optimize): Declare.
* tree-ssa-pre.c (execute_pre): Use tail_merge_optimize.
* Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
(tree-ssa-tail-merge.o): New rule.
* opts.c (default_options_table): Set OPT_ftree_tail_merge by default at
OPT_LEVELS_2_PLUS.
* tree-ssa-sccvn.c (vn_valueize): Move to ...
* tree-ssa-sccvn.h (vn_valueize): Here.
* tree-ssa-alias.h (pt_solution_ior_into_shared): Declare.
* tree-ssa-structalias.c (find_what_var_points_to): Factor out and
use ...
(pt_solution_share): New function.
(pt_solution_unshare, pt_solution_ior_into_shared): New function.
(delete_points_to_sets): Nullify shared_bitmap_table after deletion.
* timevar.def (TV_TREE_TAIL_MERGE): New timevar.
* common.opt (ftree-tail-merge): New switch.
* params.def (PARAM_MAX_TAIL_MERGE_COMPARISONS): New parameter.
* doc/invoke.texi (Optimization Options, -O2): Add -ftree-tail-merge.
(-ftree-tail-merge, max-tail-merge-comparisons): New item.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pr43864.31.patch
Type: text/x-patch
Size: 57892 bytes
Desc: not available
URL: <http://gcc.gnu.org/pipermail/gcc-patches/attachments/20110718/dfc13dde/attachment.bin>
More information about the Gcc-patches
mailing list