[Bug middle-end/54394] New: fatigue2 -flto run time regression

Tue Aug 28 22:32:00 GMT 2012

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=54394

             Bug #: 54394
           Summary: fatigue2 -flto run time regression
    Classification: Unclassified
           Product: gcc
           Version: 4.8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
        AssignedTo: unassigned@gcc.gnu.org
        ReportedBy: jamborm@gcc.gnu.org
                CC: rguenth@gcc.gnu.org
              Host: x86_64-linux-gnu
            Target: x86_64-linux-gnu

Revision 190346 caused a large run time regression of fatigue2
polyhedron benchmark when run with -Ofast -flto.  On a x86_64-linux
box, the run time went from 150 seconds to 215 seconds and there is a
similar percentage increase on my i686-linux desktop.

The commit leading to that revision is:

    2012-08-13  Richard Guenther  <rguenther@suse.de>

        * basic-block.h (struct basic_block): Remove loop_depth
        member, move flags and index members next to each other.
        * cfgloop.h (bb_loop_depth): New inline function.
        * cfghooks.c (split_block): Do not set loop_depth.
        (duplicate_block): Likewise.
        * cfgloop.c (flow_loop_nodes_find): Likewise.
        (flow_loops_find): Likewise.
        (add_bb_to_loop): Likewise.
        (remove_bb_from_loops): Likewise.
        * cfgrtl.c (force_nonfallthru_and_redirect): Likewise.
        * gimple-streamer-in.c (input_bb): Do not stream loop_depth.
        * gimple-streamer-out.c (output_bb): Likewise.
        * bt-load.c: Include cfgloop.h.
        (migrate_btr_defs): Use bb_loop_depth.
        * cfg.c (dump_bb_info): Likewise.
        * final.c (compute_alignments): Likewise.
        * ira.c (update_equiv_regs): Likewise.
        * tree-ssa-copy.c (init_copy_prop): Likewise.
        * tree-ssa-dom.c (loop_depth_of_name): Likewise.
        * tree-ssa-forwprop.c: Include cfgloop.h.
        (forward_propagate_addr_expr): Use bb_loop_depth.
        * tree-ssa-pre.c (insert_into_preds_of_block): Likewise.
        * tree-ssa-sink.c (select_best_block): Likewise.
        * ipa-inline-analysis.c: Include cfgloop.h.
        (estimate_function_body_sizes): Use bb_loop_depth.
        * Makefile.in (tree-ssa-forwprop.o): Depend on $(CFGLOOP_H).
        (ipa-inline-analysis.o): Likewise.
        (bt-load.o): Likewise.

        * gcc.dg/tree-prof/update-loopch.c: Adjust.

I believe the patch was not supposed to alter compiler output in any
(significant) way.  However, inlining decisions are different (file 1
is the dump before the patch, file 2 with it):

  In file 1: extra inlining into function MAIN__.2477/17
    Function __computer_time_m_MOD_computer_time/13 inlined 1 times (as opposed
to 0 times)
    Function __perdida_m_MOD_perdida/16 inlined 1 times (as opposed to 0 times)

  In file 2: extra inlining into function MAIN__.2477/17
    Function __free_input_MOD_convert_lower_case/9 inlined 1 times (as opposed
to 0 times)
    Function __free_input_MOD_convert_lower_case.part.2.2390/62 inlined 1 times
(as opposed to 0 times)
    Function __read_input_m_MOD_read_input/12 inlined 1 times (as opposed to 0
times)

  In file 2: extra un-inlined function __perdida_m_MOD_perdida/16
    Callers: 1, Callees: 27, Inlinees: 0

  In file 1: extra un-inlined function
__read_input_m_MOD_read_input.constprop.0/122
    Originally a clone of __read_input_m_MOD_read_input/12
    Callers: 1, Callees: 530, Inlinees: 22

At the same time this does not seem to be an LTO issue because the
inline dump of the compilation (as opposed to linking) before the
patch contains lines:

    __perdida_m_MOD_perdida/9 function not considered for inlining
      loop depth: 2 freq:53666 size:21 time: 30 callee size: 0 stack: 0

which the patch changes to:

    __perdida_m_MOD_perdida/9 function not considered for inlining
      loop depth: 0 freq:53666 size:21 time: 30 callee size: 0 stack: 0

LTO only makes the heuristics inline perdida as a function called just
once.  Loop depth 0 makes the candidate look not beneficial/cold even
when we know there are no other callees.

Loop depth is zero because at the time of inlining analysis, the
bb->loop_father is NULL.  So it seems we need to compute loops at the
beginning of inline summary generation?