[Design notes, RFC] Address-lowering prototype design (PR46556)

Richard Guenther richard.guenther@gmail.com
Tue Jun 7 10:07:00 GMT 2011

On Mon, Jun 6, 2011 at 8:07 PM, William J. Schmidt
<wschmidt@linux.vnet.ibm.com> wrote:
> It's been some time since I last posted about the address lowering issue from
> PR46556 (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46556).  I've had a basic
> prototype in place for some time, but I ran into some issues that initially
> caused performance regressions; I've had to jump through several hoops to
> resolve these.  At this time, I'm now seeing decent performance improvements
> from address lowering for PowerPC (about 2% overall on SPEC cpu2000).  A
> couple of minor degradations still occur that I am looking into.
> I thought this would be a good time to put out another RFC.  However, the code
> has grown over time, and I doubt whether all you busy people are particularly
> interested in reviewing a 2700-line prototype patch in detail.  Instead, I've
> written up some notes on the design, some of the challenges I ran into, and my
> current approach to handling those.
> I'd like to first get comments from interested parties on the prototype
> design, and get consensus on whether this is the appropriate overall
> approach.  If so, the next step would be to decide how to go about reviewing
> the code.  Since smaller patches are much easier to review, one possibility
> would be for me to gate the new pass on a flag that's off by default, and
> submit patches incrementally for review.  I.e., start with the bare-bones
> algorithm, and gradually add in the complexities as folks have time to review
> the code.  I'm quite new to the community, so any advice on how to handle
> this would be very welcome.
> In any case, here are my notes.  Thanks in advance for your views!
> Bill
> Basic algorithm
> ===============
> A new pass is added late in the gimple phases, currently between
> pass_phi_only_cprop and pass_cd_dce.  I chose this placement because the new
> pass can generate a fair amount of dead code.
> Address lowering walks the CFG in predominator order.  Common subexpression
> elimination is performed on operations that could participate in addressing
> expressions.  These operations may have been made available in the current
> block, or in any dominating block.
> Any statement that may contain a memory access (excluding calls) is considered
> for address lowering.  The memory access is expanded into an affine
> combination of trees by looking back at feeding statements, using logic
> originally developed for IVopts.  The affine combination is subjected to a
> number of tests to determine whether it's legal and profitable to lower the
> addressing expression into a TARGET_MEM_REF (or MEM_REF, when addressing is
> sufficiently simple).  If so, the conversion to a TMR is performed, again
> using logic originally developed for IVopts.
> The resulting memory reference is again subjected to a number of conditions
> testing for legality and profitability.  If these conditions are all met, the
> reference data is copied from the original expression to the TMR, and the TMR
> replaces the original expression in the gimple statement.
> The conversion of a memory access into a TMR generally creates a number of new
> gimple statements preceding the statement under analysis.  CSE is performed on
> the fly for these new statements as well, using the available expressions data
> discussed earlier.  This produces naturally shared addressability for members
> of the same structure, for example.
> Inhibitors
> ==========
> As mentioned, a number of conditions are checked before replacing a memory
> access with a TMR.  Conditions that inhibit replacement include the following.
>  * There is not yet any handling for bitfields.
>  * An expression containing a direct field reference from a small nonvolatile
>   structure is left alone.  It is assumed that these values will be
>   register-assigned, so exposing their addresses is counterproductive.
>  * I have seen odd cases in C++ code such as
>     lhs = MEM[(struct foo *) 0B].bar;
>   This results in an affine combination of trees with no elements, so an
>   explicit check for such empty combinations is required.
>  * If the affine combination of trees introduces an ADDR_EXPR for any decl
>   that is not already marked as addressable, we refuse to make the
>   replacement.  Taking the address of something for the first time defeats
>   downstream optimizations that would otherwise be possible.
>  * create_mem_ref can occasionally produce a TMR or MEM_REF with a base
>   expression of constant zero.  One way this can happen is with questionable
>   coding practices, such as treating an integer paramter as a pointer.  It's
>   an open question whether create_mem_ref should ever produce such a tree.
>   In any case, the base expression of zero again confuses the may-aliasing
>   machinery, so I refuse the replacement in this case.
>  * If the original expression will be recognized as a "hidden global store" in
>   tree-ssa-sink.c:is_hidden_global_store, but the replacement expression will
>   not, it is possible for the dead code eliminator to remove the modified
>   statement.  It seems to me this shouldn't normally happen in practice.  For
>   now I detect this case and refuse to do the replacement, but I suspect a
>   possible front-end or upstream-optimization problem here.  The test case
>   that fails here is libstdc++-v3/testsuite/23_containers/vector/
>   ext_pointer_modifiers/insert.cc.  More investigation required.

That indeed sounds odd.

> Probably no longer inhibitors
> =============================
> I currently am checking for some cases that I think may no longer be
> necessary, following this change:
> ------------------------------------------------------------------------
> r174282 | rguenth | 2011-05-26 08:01:48 -0500 (Thu, 26 May 2011) | 10 lines
> 2011-05-26  Richard Guenther  <rguenther@suse.de>
>        PR tree-optimization/48702
>        * tree-ssa-address.c (create_mem_ref_raw): Create MEM_REFs
>        only when we know the base address is within bounds.
>        * tree-ssa-alias.c (indirect_ref_may_alias_decl_p): Do not
>        assume the base address of TARGET_MEM_REFs is in bounds.
>        * gcc.dg/torture/pr48702.c: New testcase.
> ------------------------------------------------------------------------
> I am in process of checking whether removing these checks has any untoward
> consequences:
>  * There are several ways in which the affine combination of trees can contain
>   subexpressions with negative coefficients.  This is evil and must be
>   avoided, since it can result in a memory reference whose base address
>   pointer lies outside the object it references.  The may-aliasing machinery
>   cannot currently handle this.  So we refuse to make the replacement when a
>   negative coefficient is detected.
>  * Earlier induction variable optimization may also cause us to end up with a
>   base address below the beginning of an object.  (Ivars are often initialized
>   to the base of element -1 outside of a loop, and incremented at the top of
>   the loop.  Expanding into the affine combination of trees collapses this to
>   produce a base expression pointing outside the object.)  I avoid this by
>   checking whether the base expression is an SSA name for an ivar that is
>   directly or transitively defined by a PHI expression.
> Zero-offset memory references
> =============================
> Following is a fairly common pattern that results from address lowering with
> CSE:
>      <bb 2>:
>        D.2154_8 = n_2(D) * 4;
>        D.2107_3 = MEM[base: p_1(D), index: D.2154_8, offset: 0B];
>        D.2156_10 = p_1(D) + D.2154_8;
>        D.2108_4 = MEM[(struct x *)D.2156_10 + 128B];
>        D.2109_5 = MEM[(struct x *)D.2156_10 + 64B];
>        foo (D.2107_3, D.2108_4, D.2109_5);
>        return;
> When the addressing expression consists of two tree addends and a non-zero
> offset, the addends are pre-added and used as the base expression, with the
> offset becoming the immediate operand in the storage op.  When the offset is
> zero, the base and index are added implicitly in an indexed-form storage op.
> The bias to the indexed form with a zero offset is a good choice, in absence
> of knowledge of the surrounding code; it saves an instruction and a dependency
> with a possible small increase in register pressure.  As can be seen here,
> though, we end up with a missed commoning opportunity, since the second
> instruction could become:
>       D.2107_3 = MEM[(struct x *)D.2156_10 + 0B];
> Since the add into D.2156_10 is going to occur anyway, the advantage of the
> bias to indexed form vanishes.  Instead, it leads to an increase in register
> pressure.  This is more noticeable as expressions are CSEd across blocks.
> As the example shows, when the zero-offset memory reference is first
> encountered, the sum of its base and index components may not be a known
> available expression.  To find all opportunities where local CSE is possible
> for these, the zero-offset expressions are recorded during the basic
> algorithm.  At the end of each block, the zero-offset expressions are
> post-processed.  If the opportunity exists, the base is replaced with the SSA
> representative of the sum, and the index is changed to constant zero.  If
> necessary, the definition of the SSA representative of the sum is moved ahead
> of the zero-offset expression.  This is safe since both addends are known to
> be available there.
> Challenge with zero-offset memory references
> ============================================
> Unfortunately, when I first implemented this post-processing, I didn't see any
> change in code generation.  Upon investigation, it turned out that fwprop.c
> was undoing this optimization in forward_propagate_and_simplify.  To
> circumvent this, I ended up adding some extra code to identify zero-offset
> memory references that fit the pattern of the example (only in RTL).  I
> constrained fwprop.c from propagating into these as locally poor
> replacements.
> I recognize this might be a controversial choice, and I'm open to other
> suggestions.  But I found CSEing these to be a useful optimization that makes
> a difference in common situations, so I didn't want to lose it.
> Loss of aliasing information
> ============================
> The most serious problem I've run into is degraded performance due to poorer
> instruction scheduling choices.  I tracked this down to
> alias.c:nonoverlapping_component_refs_p.
> This code proves that two memory accesses don't overlap by attempting to prove
> that they access different fields of the same structure.  This is done using
> the MEM_EXPRs of the two rtx's, which record the expression trees that were
> translated into the rtx's during expand.  When address lowering is not
> present, a simple COMPONENT_REF will appear in the MEM_EXPR:  x.a, for
> example.  However, address lowering changes the simple COMPONENT_REF into a
> [TARGET_]MEM_REF that is no longer necessarily identifiable as a field
> reference.  Thus the aliasing machinery can no longer prove that two such
> field references are disjoint.
> This has severe consequences for performance, and has to be dealt with if
> address lowering is to be successful.
> I've worked around this with an admittedly fragile solution; I'll discuss the
> drawbacks below.  The idea is to construct a mapping from replacement mem_refs
> to the original expressions that they replaced.  When a MEM_EXPR is being set
> during expand, we first look up the mem_ref in the mapping.  If present, the
> MEM_EXPR is set to the original expression, rather than to the mem_ref.  This
> essentially duplicates the behavior in the absence of address lowering.

Ick.  We had this in the past via TMR_ORIGINAL which caused all sorts
of problems.  Removing it didn't cause much degradation because we now
preserve points-to information.

Originally I played with lowering all memory accesses to MEM_REFs
(see the old mem-ref branch), and the loss of type-based alias
disambiguation was indeed an issue.

But - I definitely do not like the idea of preserving something similar
to TMR_ORIGINAL.  Instead we can try preserving some information
we derive from it.  We keep the original access type that we can use
for TBAA but do not retain knowledge on whether the type of the
MEM_REF is valid for TBAA or if it is view-converted.

> This appears to work well in practice today.  However, it relies on there
> being no optimizations between the address lowering pass and expand that
> modify [TARGET_]MEM_REFs.  Dead store elimination may remove some, which is
> fine, of course.  But if a new pass was to be added that modified operands of
> mem_refs after address lowering, code generation would be quietly degraded.
> We would need some specific scan tests in the test suite to check for this.
> Another (less serious) problem is having to introduce a data structure that
> needs to be maintained beyond the lifetime of its associated pass.  It's a bit
> of a wart to have this side table that must be kept around until expand is
> finished.  It's workable, but not very clean.
> As an alternative, I went looking for ways to carry the original expression
> along with the mem_ref in the remaining gimple phases; but I wasn't able to
> find a good way to do that.  Perhaps others can suggest something.
> As noted, I recognize this is a fragile solution, and I'm very interested in
> other ideas for keeping the necessary information around until it's needed.
> Open issues
> ===========
> CSE of addressing expressions across blocks (or within very large blocks) can
> raise register pressure and potentially introduce spill code.  We may want
> some heuristics to constrain the distance across which expressions are
> considered available.
> I am investigating small degradations in 177.mesa, 187.facerec, 300.twolf, and
> 164.gzip.
> Performance testing on other platforms is needed.

More information about the Gcc-patches mailing list