Bug 89578 - [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
Summary: [9 Regression] 5% runtime regression for 481.wrf at -Ofast -flto
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: 9.0
Assignee: Richard Biener
URL:
Keywords:
Depends on:
Blocks: spec 87609 90328
  Show dependency treegraph
 
Reported: 2019-03-04 14:05 UTC by Richard Biener
Modified: 2019-09-05 12:12 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-03-04 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2019-03-04 14:05:25 UTC
I see with -march=native on haswell in addition to -Ofast -flto ~5% runtime regression for 481.wrf with the last known good rev. being r269093 and
the first known bad one r269146.
Comment 2 Richard Biener 2019-03-04 14:08:44 UTC
Suspicious changes include the fix for PR87609 and the new pass_remove_partial_avx_dependency.  I see no other relevant changes.
Comment 3 Martin Liška 2019-03-04 15:08:50 UTC
(In reply to Richard Biener from comment #2)
> Suspicious changes include the fix for PR87609 and the new
> pass_remove_partial_avx_dependency.  I see no other relevant changes.

Yes, started with r269098.
Comment 4 Richard Biener 2019-03-06 11:29:58 UTC
Investigating, but I guess there's nothing to fix :/
Comment 5 Richard Biener 2019-03-06 13:50:49 UTC
So it looks like the change from r269097 to r269098 is only 2.5% (I've included
the followup fix ontop of r269098):

481.wrf         11170        245       45.7 *   11170        251       44.5 S
481.wrf         11170        246       45.4 S   11170        251       44.5 *

Checking r269096 reveals (BASE = r269096, PEAK = r269098 + followup):

481.wrf         11170        244       45.7 S   11170        254       44.0 S
481.wrf         11170        244       45.8 *   11170        252       44.3 *

Note that fortran guarantees on parameter aliasing are stronger than
those of restrict qualified pointers so using restrict isn't perfect
and it might suffer from the correctness fix unnecessarily.  In a
similar fashion performing inlining might make PTA do more conservative
choices than when looking at the fnspec guarantees.

For r269096 there's the possibility of simply never recomputing restrict
(gate it with !cfun->after_inlining).

Overall I see

   9.90%        201223  wrf_peak.amd64-  wrf_peak.amd64-m64-gcc42-nn  [.] solve_interface_
   8.50%        172668  wrf_base.amd64-  wrf_base.amd64-m64-gcc42-nn  [.] solve_interface_

which explains why we only see this with -flto (this function does nothing
but call other functions...).  It doesn't really explain why r269096 makes
such a big difference though (I guess it must be the PRE PTA recompute
triggering this).

But the function is a huge mess and the profile quite flat and similar...
Comment 6 Richard Biener 2019-03-07 10:55:28 UTC
One difference that is clearly visible is missed vectorization of

 module_small_step_em.fppized.f90:1399:14: note:  LOOP VECTORIZED
 module_small_step_em.fppized.f90:1376:14: note:  LOOP VECTORIZED
-module_small_step_em.fppized.f90:1376:14: note:  LOOP VECTORIZED
 module_small_step_em.fppized.f90:1354:14: note:  LOOP VECTORIZED

This is the following inner loop which we vectorize before the change but not after.  Before we have

create runtime check for data references MEM[(float[0:D.14326] *)_891][_13070] and MEM[(float[0:D.14326] *)_891][_4424]
module_small_step_em.fppized.f90:1376:14: note:  created 1 versioning for alias checks.
module_small_step_em.fppized.f90:1376:14: optimized:  loop versioned for vectorization because of possible aliasing

where afterwards

module_small_step_em.fppized.f90:1376:14: missed:   number of versioning for alias run-time tests exceeds 10 (--param vect-max-version-for-alias-checks)

SUBROUTINE advance_w( w, rw_tend, ww, u, v,       &
...
! Jammed 3 doubly nested loops over k/i into 1 for slight improvement
! in efficiency.  No change in results (bit-for-bit). JM 20040514
! (left a blank line where the other two k/i-loops were)
!
    DO k=2,k_end
      DO i=i_start, i_end
        w(i,k,j)=w(i,k,j)+dts*rw_tend(i,k,j)                       &
                 + msft_inv(i)*cqw(i,k,j)*(                        &
            +.5*dts*g*mut_inv(i)*rdn(k)*                           &
             (c2a(i,k  ,j)*rdnw(k  )                               &
        *((1.+epssm)*(rhs(i,k+1  )-rhs(i,k    ))                   &
         +(1.-epssm)*(ph(i,k+1,j)-ph(i,k  ,j)))                    &
             -c2a(i,k-1,j)*rdnw(k-1)                               &
        *((1.+epssm)*(rhs(i,k    )-rhs(i,k-1  ))                   &
         +(1.-epssm)*(ph(i,k  ,j)-ph(i,k-1,j)))))                  &

                +dts*g*msft_inv(i)*(rdn(k)*                        &
             (c2a(i,k  ,j)*alt(i,k  ,j)*t_2ave(i,k  ,j)            &
             -c2a(i,k-1,j)*alt(i,k-1,j)*t_2ave(i,k-1,j))           &
               +(rdn(k)*(c2a(i,k  ,j)*alb(i,k  ,j)                 &
                        -c2a(i,k-1,j)*alb(i,k-1,j))*mut_inv(i)-1.) &
                     *muave(i,j))
      ENDDO
    ENDDO


There is almost no difference in the IL after if-conversion but changes
like

-  _14697 = MEM[(float[0:D.14440] *)_959 clique 40 base 47][_15604];
+  _14697 = MEM[(float[0:D.14440] *)_959 clique 71 base 47][_15604];
   _9355 = _14697 * 4.905000209808349609375e+0;
-  _9371 = MEM[(float[0:D.14436] *)_957 clique 40 base 28][_15604];
+  _9371 = MEM[(float[0:D.14436] *)_957 clique 71 base 28][_15604];

which is introduced by unroll-and-jam by means of unrolling the outer loop
once.  The restrict clique remapping is strictly speaking required here.

There are of course cases where remapping can be elided but they are
difficult to reconstruct.

Basically each clique > 1 belongs to a specific inline block and whether
we need to remap it when copying a stmt depends on whether we duplicate
"within" that block or we duplicate the block itself.

A good "hint" would be (for example for unroll-and-jam) whether cliques
mentioned in the to be copied loop are also mentioned in stmts belonging
exclusively to the outer loop that is unrolled (but then they could have
been moved there by invariant motion for example).

I'm not sure if we want to look at the BLOCK tree.  Or whether we want to
build a similar thing into the loop structure, noting inline contexts
(the single clique that needs not remapping when copying the loop).

Note I haven't been able to verify whether the above loop is executed
at all but -fno-loop-unroll-and-jam makes performance equal (and faster
than ever...).
Comment 7 Richard Biener 2019-03-08 10:20:36 UTC
Fixed.
Comment 8 Richard Biener 2019-03-08 10:20:47 UTC
Author: rguenth
Date: Fri Mar  8 10:20:12 2019
New Revision: 269484

URL: https://gcc.gnu.org/viewcvs?rev=269484&root=gcc&view=rev
Log:
2019-03-08  Richard Biener  <rguenther@suse.de>

	PR middle-end/89578
	* cfgloop.h (struct loop): Add owned_clique field.
	* cfgloopmanip.c (copy_loop_info): Copy it.
	* tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique
	cliques.
	* tree-inline.c (copy_loops): Remap owned_clique.
	* lto-streamer-in.c (input_cfg): Stream owned_clique.
	* lto-streamer-out.c (output_cfg): Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/cfgloop.h
    trunk/gcc/cfgloopmanip.c
    trunk/gcc/lto-streamer-in.c
    trunk/gcc/lto-streamer-out.c
    trunk/gcc/tree-cfg.c
    trunk/gcc/tree-inline.c
Comment 9 Richard Biener 2019-08-30 11:11:33 UTC
Author: rguenth
Date: Fri Aug 30 11:11:01 2019
New Revision: 275069

URL: https://gcc.gnu.org/viewcvs?rev=275069&root=gcc&view=rev
Log:
2019-08-30  Richard Biener  <rguenther@suse.de>

	* lto-streamer.h (LTO_minor_version): Bump.

	Backport from mainline
	2019-05-06  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/90328
	* tree-data-ref.h (dr_may_alias_p): Pass in the actual loop nest.
	* tree-data-ref.c (dr_may_alias_p): Check whether the clique
	is valid in the loop nest before using it.
	(initialize_data_dependence_relation): Adjust.
	* graphite-scop-detection.c (build_alias_set): Pass the SCOP enclosing
	loop as loop-nest to dr_may_alias_p.

	* gcc.dg/torture/pr90328.c: New testcase.

	2019-03-08  Richard Biener  <rguenther@suse.de>

	PR middle-end/89578
	* cfgloop.h (struct loop): Add owned_clique field.
	* cfgloopmanip.c (copy_loop_info): Copy it.
	* tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique
	cliques.
	* tree-inline.c (copy_loops): Remap owned_clique.
	* lto-streamer-in.c (input_cfg): Stream owned_clique.
	* lto-streamer-out.c (output_cfg): Likewise.

	2019-02-22  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/87609
	* tree-cfg.c (gimple_duplicate_bb): Only remap inlined cliques.

	2019-02-22  Richard Biener  <rguenther@suse.de>

	PR middle-end/87609
	* cfghooks.h (dependence_hash): New typedef.
	(struct copy_bb_data): New type.
	(cfg_hooks::duplicate_block): Adjust to take a copy_bb_data argument.
	(duplicate_block): Likewise.
	* cfghooks.c (duplicate_block): Pass down copy_bb_data.
	(copy_bbs): Create and pass down copy_bb_data.
	* cfgrtl.c (cfg_layout_duplicate_bb): Adjust.
	(rtl_duplicate_bb): Likewise.
	* tree-cfg.c (gimple_duplicate_bb): If the copy_bb_data arg is not NULL
	remap dependence info.

	* gcc.dg/torture/restrict-7.c: New testcase.

	2019-02-22  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/87609
	* tree-core.h (tree_base): Document special clique values.
	* tree-inline.c (remap_dependence_clique): Do not use the
	special clique value of one.
	(maybe_set_dependence_info): Use clique one.
	(clear_dependence_clique): New callback.
	(compute_dependence_clique): Clear clique one from all refs
	before assigning it (again).

Added:
    branches/gcc-8-branch/gcc/testsuite/gcc.dg/torture/pr90328.c
    branches/gcc-8-branch/gcc/testsuite/gcc.dg/torture/restrict-7.c
Modified:
    branches/gcc-8-branch/gcc/ChangeLog
    branches/gcc-8-branch/gcc/cfghooks.c
    branches/gcc-8-branch/gcc/cfghooks.h
    branches/gcc-8-branch/gcc/cfgloop.h
    branches/gcc-8-branch/gcc/cfgloopmanip.c
    branches/gcc-8-branch/gcc/cfgrtl.c
    branches/gcc-8-branch/gcc/graphite-scop-detection.c
    branches/gcc-8-branch/gcc/lto-streamer-in.c
    branches/gcc-8-branch/gcc/lto-streamer-out.c
    branches/gcc-8-branch/gcc/lto-streamer.h
    branches/gcc-8-branch/gcc/testsuite/ChangeLog
    branches/gcc-8-branch/gcc/tree-cfg.c
    branches/gcc-8-branch/gcc/tree-core.h
    branches/gcc-8-branch/gcc/tree-data-ref.c
    branches/gcc-8-branch/gcc/tree-data-ref.h
    branches/gcc-8-branch/gcc/tree-inline.c
    branches/gcc-8-branch/gcc/tree-ssa-structalias.c
Comment 10 Richard Biener 2019-09-05 12:12:29 UTC
Author: rguenth
Date: Thu Sep  5 12:11:52 2019
New Revision: 275405

URL: https://gcc.gnu.org/viewcvs?rev=275405&root=gcc&view=rev
Log:
2019-09-05  Richard Biener  <rguenther@suse.de>

	* lto-streamer.h (LTO_minor_version): Bump.

	Backport from mainline
	2019-05-06  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/90328
	* tree-data-ref.h (dr_may_alias_p): Pass in the actual loop nest.
	* tree-data-ref.c (dr_may_alias_p): Check whether the clique
	is valid in the loop nest before using it.
	(initialize_data_dependence_relation): Adjust.
	* graphite-scop-detection.c (build_alias_set): Pass the SCOP enclosing
	loop as loop-nest to dr_may_alias_p.

	* gcc.dg/torture/pr90328.c: New testcase.

	2019-03-08  Richard Biener  <rguenther@suse.de>

	PR middle-end/89578
	* cfgloop.h (struct loop): Add owned_clique field.
	* cfgloopmanip.c (copy_loop_info): Copy it.
	* tree-cfg.c (gimple_duplicate_bb): Do not remap owned_clique
	cliques.
	* tree-inline.c (copy_loops): Remap owned_clique.
	* lto-streamer-in.c (input_cfg): Stream owned_clique.
	* lto-streamer-out.c (output_cfg): Likewise.

	2019-02-22  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/87609
	* tree-cfg.c (gimple_duplicate_bb): Only remap inlined cliques.

	2019-02-22  Richard Biener  <rguenther@suse.de>

	PR middle-end/87609
	* cfghooks.h (dependence_hash): New typedef.
	(struct copy_bb_data): New type.
	(cfg_hooks::duplicate_block): Adjust to take a copy_bb_data argument.
	(duplicate_block): Likewise.
	* cfghooks.c (duplicate_block): Pass down copy_bb_data.
	(copy_bbs): Create and pass down copy_bb_data.
	* cfgrtl.c (cfg_layout_duplicate_bb): Adjust.
	(rtl_duplicate_bb): Likewise.
	* tree-cfg.c (gimple_duplicate_bb): If the copy_bb_data arg is not NULL
	remap dependence info.

	* gcc.dg/torture/restrict-7.c: New testcase.

	2019-02-22  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/87609
	* tree-core.h (tree_base): Document special clique values.
	* tree-inline.c (remap_dependence_clique): Do not use the
	special clique value of one.
	(maybe_set_dependence_info): Use clique one.
	(clear_dependence_clique): New callback.
	(compute_dependence_clique): Clear clique one from all refs
	before assigning it (again).

Added:
    branches/gcc-7-branch/gcc/testsuite/gcc.dg/torture/pr90328.c
    branches/gcc-7-branch/gcc/testsuite/gcc.dg/torture/restrict-7.c
Modified:
    branches/gcc-7-branch/gcc/ChangeLog
    branches/gcc-7-branch/gcc/cfghooks.c
    branches/gcc-7-branch/gcc/cfghooks.h
    branches/gcc-7-branch/gcc/cfgloop.h
    branches/gcc-7-branch/gcc/cfgloopmanip.c
    branches/gcc-7-branch/gcc/cfgrtl.c
    branches/gcc-7-branch/gcc/graphite-scop-detection.c
    branches/gcc-7-branch/gcc/lto-streamer-in.c
    branches/gcc-7-branch/gcc/lto-streamer-out.c
    branches/gcc-7-branch/gcc/lto-streamer.h
    branches/gcc-7-branch/gcc/testsuite/ChangeLog
    branches/gcc-7-branch/gcc/tree-cfg.c
    branches/gcc-7-branch/gcc/tree-core.h
    branches/gcc-7-branch/gcc/tree-data-ref.c
    branches/gcc-7-branch/gcc/tree-data-ref.h
    branches/gcc-7-branch/gcc/tree-inline.c
    branches/gcc-7-branch/gcc/tree-ssa-structalias.c