[Bug tree-optimization/81165] [8 Regression] Regression in GCC-8.0.0's optimizer

aoliva at gcc dot gnu.org gcc-bugzilla@gcc.gnu.org
Wed Dec 6 12:13:00 GMT 2017


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81165

Alexandre Oliva <aoliva at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|unassigned at gcc dot gnu.org      |aoliva at gcc dot gnu.org

--- Comment #13 from Alexandre Oliva <aoliva at gcc dot gnu.org> ---
Created attachment 42800
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42800&action=edit
patch, first try

This is my first cut at it.  I couldn't quite figure out how to determine
whether stmts were dead efficiently with a forward-pass (I'd have to keep track
of all potentially dead stmts, probably without knowing for sure what's dead
and what isn't till the very end), so I introduced it as a backward scan of all
copied blocks in the path, that we only perform when we reach the stmt count
limit, and that can tell right away whether each stmt is dead, so it also knows
when to stop because we're past the limit.

This indeed solves the problem reported here, but in a kind of weird way.  We
perform very different threading in earlier passes, and we end up with all the
computation in the loop entry test all the way to dse3, past even thread4. 
It's cddce3 that optimizes it out, and later we further simplify the
conditional set of t1, and its use, into a single (x1&1)!=0 (testl;je), even
better than the code we generated at 7.x.  However, the code at e.g. dom2 looks
much uglier than what we get without this patch.

One thing I haven't done, but I kind of feel might be useful, is to cache the
killed_stmts computation per block within a single pass.  I tried to cache it
in the path itself, but (unsurprisingly) then we didn't ever get any hits :-) 
Do you believe this would be worthwhile?  I gather it will have to touch a
number of passes that call into the jump threading infrastructure to reset the
cache.

Another thought that occurred to me was that, instead of scanning insns
backwards, I could start at the final control-flow stmt, and just follow the
links from uses to defs.  It seems like this could avoid the linear scan, but
it would require a bit more complexity and additional memory than the hash_map
I'm using; I can see it working efficiently with one hash_map mapping SSA_NAMEs
to remaining uses within the block, and one vector or bitmap to hold dead defs
pending back-propagation to their uses.  It feels like this could be more
efficient, but I'm hesitant because of the far more random access patterns it
would entail.  I guess I'll have to implement it and benchmark it to know for
sure...


More information about the Gcc-bugs mailing list