Bug 60042 - vectorizer still does too many dependence tests for himeno:jacobi
Summary: vectorizer still does too many dependence tests for himeno:jacobi
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.9.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 23855
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2014-02-03 11:19 UTC by Richard Biener
Modified: 2016-08-15 09:28 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2014-02-04 00:00:00


Attachments
himeno bench (2.88 KB, text/plain)
2014-02-03 11:19 UTC, Richard Biener
Details
patch to prune deps to scalar globals (787 bytes, patch)
2014-02-03 11:23 UTC, Richard Biener
Details | Diff
patch for the PRE issue (1.81 KB, patch)
2014-02-04 14:53 UTC, Richard Biener
Details | Diff
updated patch (1.82 KB, patch)
2014-02-04 15:09 UTC, Richard Biener
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2014-02-03 11:19:19 UTC
Created attachment 32026 [details]
himeno bench

I still get

himenobmtxpa.c:296:9: note: disable versioning for alias - max number of generated checks exceeded
himenobmtxpa.c:296:9: note: too long list of versioning for alias run-time tests.

I have a patch to remove some false dependences but still

himenobmtxpa.c:296:9: note: improved number of alias checks from 31 to 21

and 21 is too much - 7 should suffice.
Comment 1 Richard Biener 2014-02-03 11:23:09 UTC
Created attachment 32027 [details]
patch to prune deps to scalar globals

My patch to prune dependences to scalar global vars.
Comment 2 Richard Biener 2014-02-03 13:51:00 UTC
With some more dumping I seee

himenobmtxpa.c:296:9: note: === vect_prune_runtime_alias_test_list ===
himenobmtxpa.c:296:9: note: merging ranges for *_205, *_324 and *_49, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_205, *_324 and *_192, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_168, *_324 and *_69, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_168, *_324 and *_154, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_265, *_324 and *_296, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_265, *_324 and *_89, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_174, *_324 and *_248, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_174, *_324 and *_161, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_211, *_324 and *_231, *_324
himenobmtxpa.c:296:9: note: merging ranges for *_211, *_324 and *_199, *_324
himenobmtxpa.c:296:9: note: improved number of alias checks from 31 to 21

and

Creating dr for *_205
analyze_innermost: success.
        base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1009 * 4)
        offset from base address: 0
        constant offset from base address: 0
        step: 4
        aligned to: 128
        base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1009 * 4)
        Access function 0: {0B, +, 4}_7
Creating dr for *_168
analyze_innermost: success.
        base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1023 * 4)
        offset from base address: 0
        constant offset from base address: 0
        step: 4
        aligned to: 128
        base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1023 * 4)
        Access function 0: {0B, +, 4}_7
Creating dr for *_265
analyze_innermost: success.
        base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1034 * 4)
        offset from base address: 0
        constant offset from base address: 0
        step: 4
        aligned to: 128
        base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1034 * 4)
        Access function 0: {0B, +, 4}_7
Creating dr for *_174
analyze_innermost: success.
        base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1063 * 4)
        offset from base address: 0
        constant offset from base address: 0
        step: 4
        aligned to: 128
        base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1063 * 4)
        Access function 0: {0B, +, 4}_7
...

so the remaining DDRs against *_324 all look related.

  pretmp_1062 = pretmp_1020 + pretmp_1047;
  pretmp_1063 = _25 * pretmp_1062;

  pretmp_1033 = j_380 + pretmp_1020;
  pretmp_1034 = _25 * pretmp_1033;

  pretmp_1022 = pretmp_1020 + pretmp_1021;
  pretmp_1023 = _25 * pretmp_1022;

but SCEV doesn't expand stmts before the loop and thus doesn't see this.
It's obviously far from trivial to merge segments with symbolic start
addresses ... these are multi-dimensional accesses:

        for(k=1 ; k<kmax ; k++){
          s0= MR(a,0,i,j,k)*MR(p,0,i+1,j,  k)
            + MR(a,1,i,j,k)*MR(p,0,i,  j+1,k)
            + MR(a,2,i,j,k)*MR(p,0,i,  j,  k+1)
            + MR(b,0,i,j,k)
             *( MR(p,0,i+1,j+1,k) - MR(p,0,i+1,j-1,k)
              - MR(p,0,i-1,j+1,k) + MR(p,0,i-1,j-1,k) )
            + MR(b,1,i,j,k)
             *( MR(p,0,i,j+1,k+1) - MR(p,0,i,j-1,k+1)
              - MR(p,0,i,j+1,k-1) + MR(p,0,i,j-1,k-1) )
            + MR(b,2,i,j,k)
             *( MR(p,0,i+1,j,k+1) - MR(p,0,i-1,j,k+1)
              - MR(p,0,i+1,j,k-1) + MR(p,0,i-1,j,k-1) )
            + MR(c,0,i,j,k) * MR(p,0,i-1,j,  k)
            + MR(c,1,i,j,k) * MR(p,0,i,  j-1,k)
            + MR(c,2,i,j,k) * MR(p,0,i,  j,  k-1)
            + MR(wrk1,0,i,j,k);

          ss= (s0*MR(a,3,i,j,k) - MR(p,0,i,j,k))*MR(bnd,0,i,j,k);

          gosa+= ss*ss;
          MR(wrk2,0,i,j,k)= MR(p,0,i,j,k) + omega*ss;
        }

and we manage to merge the fastest varying dimension +-1 ones AFAIK,
but not for example the ones for MR(p,0,i+1,j+1,k) and MR(p,0,i+1,j-1,k).
Ideally we would be able to derive a single check for each array
(which would require analyzing the DRs in the outer loops as well to
gather info about the other dimensions).
Comment 3 Richard Biener 2014-02-04 13:24:10 UTC
Bah, and with -fipa-pta -Ofast -fwhole-program we now _do_ see that there
isn't any aliasing but PRE messes up the loop and creates loop carried
dependencies ... :/
Comment 4 Richard Biener 2014-02-04 13:33:56 UTC
(In reply to Richard Biener from comment #3)
> Bah, and with -fipa-pta -Ofast -fwhole-program we now _do_ see that there
> isn't any aliasing but PRE messes up the loop and creates loop carried
> dependencies ... :/

Ok, that's because PRE would be the one to make the array loads base
addresses simple induction variables (it performs invariant motion of the
address load from the global matrix struct).  Thus inhibit_phi_insertion
returns false.  Looks like ordering of PRE / LIM isn't too great when
considering such cases.
Comment 5 Richard Biener 2014-02-04 14:53:25 UTC
Created attachment 32038 [details]
patch for the PRE issue

The attached patch delays inhibit_phi_insertion to a point where it can give
a more definitive answer (eliminate () time).
Comment 6 Richard Biener 2014-02-04 15:09:15 UTC
Created attachment 32039 [details]
updated patch
Comment 7 Jakub Jelinek 2014-02-04 16:36:49 UTC
As discussed on IRC the #c1 patch would need to be moved after the
if (STMT_VINFO_GATHER_P (stmtinfo_a)...) {}
Getting the version checks below the default limit would be really nice, the benchmark numbers look much nicer when it is vectorized, even with
--param vect-max-version-for-alias-checks=32.
Comment 8 Richard Biener 2014-02-05 11:42:08 UTC
For analysis of the DRs in outer loops we need to fix PR23855.
Comment 9 Richard Biener 2014-02-05 14:35:16 UTC
PR23855 fixed we'd get for the analysis of the remaining DRs in the outermost loop
for example for matrix A:

Creating dr for *_290
analyze_innermost: success.
        base_address: pretmp_1792 + (sizetype) ((long unsigned int) pretmp_1974 * 4)
        offset from base address: 0
        constant offset from base address: 4
        step: 4
        aligned to: 128
        base_object: *pretmp_1792 + (sizetype) ((long unsigned int) (((pretmp_1831 + 1) * pretmp_1794 + 1) * pretmp_1796) * 4)
        Access function 0: {{{4B, +, (sizetype) ((long unsigned int) (pretmp_1794 * pretmp_1796) * 4)}_2, +, (sizetype) ((long unsigned int) pretmp_1796 * 4)}_6, +, 4}_7

Creating dr for *_81
analyze_innermost: success.
        base_address: pretmp_1792 + (sizetype) ((long unsigned int) pretmp_1915 * 4)
        offset from base address: 0
        constant offset from base address: 4
        step: 4
        aligned to: 128
        base_object: *pretmp_1792 + (sizetype) ((long unsigned int) (((pretmp_1804 + 1) * pretmp_1794 + 1) * pretmp_1796) * 4)
        Access function 0: {{{4B, +, (sizetype) ((long unsigned int) (pretmp_1794 * pretmp_1796) * 4)}_2, +, (sizetype) ((long unsigned int) pretmp_1796 * 4)}_6, +, 4}_7

Creating dr for *_60
analyze_innermost: success.
        base_address: pretmp_1792 + (sizetype) ((long unsigned int) pretmp_1902 * 4)
        offset from base address: 0
        constant offset from base address: 4
        step: 4
        aligned to: 128
        base_object: *pretmp_1792 + (sizetype) ((long unsigned int) (((pretmp_1801 + 1) * pretmp_1794 + 1) * pretmp_1796) * 4)
        Access function 0: {{{4B, +, (sizetype) ((long unsigned int) (pretmp_1794 * pretmp_1796) * 4)}_2, +, (sizetype) ((long unsigned int) pretmp_1796 * 4)}_6, +, 4}_7

which still requires hard work to actually combine as the base objects still
differ (but the access function is equal).
Comment 10 Richard Biener 2014-04-14 13:57:33 UTC
Author: rguenth
Date: Mon Apr 14 13:57:00 2014
New Revision: 209374

URL: http://gcc.gnu.org/viewcvs?rev=209374&root=gcc&view=rev
Log:
2014-04-14  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/60042
	* tree-ssa-pre.c (inhibit_phi_insertion): Remove.
	(insert_into_preds_of_block): Do not prevent PHI insertion
	for REFERENCE exprs here ...
	(eliminate_dom_walker::before_dom_children): ... but prevent
	their use here under similar conditions when applied to the
	IL after PRE optimizations.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-pre.c
Comment 11 Richard Biener 2016-08-15 09:28:26 UTC
Trunk still has 22 alias checks (so it even got worse).