Created attachment 32026 [details] himeno bench I still get himenobmtxpa.c:296:9: note: disable versioning for alias - max number of generated checks exceeded himenobmtxpa.c:296:9: note: too long list of versioning for alias run-time tests. I have a patch to remove some false dependences but still himenobmtxpa.c:296:9: note: improved number of alias checks from 31 to 21 and 21 is too much - 7 should suffice.
Created attachment 32027 [details] patch to prune deps to scalar globals My patch to prune dependences to scalar global vars.
With some more dumping I seee himenobmtxpa.c:296:9: note: === vect_prune_runtime_alias_test_list === himenobmtxpa.c:296:9: note: merging ranges for *_205, *_324 and *_49, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_205, *_324 and *_192, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_168, *_324 and *_69, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_168, *_324 and *_154, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_265, *_324 and *_296, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_265, *_324 and *_89, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_174, *_324 and *_248, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_174, *_324 and *_161, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_211, *_324 and *_231, *_324 himenobmtxpa.c:296:9: note: merging ranges for *_211, *_324 and *_199, *_324 himenobmtxpa.c:296:9: note: improved number of alias checks from 31 to 21 and Creating dr for *_205 analyze_innermost: success. base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1009 * 4) offset from base address: 0 constant offset from base address: 0 step: 4 aligned to: 128 base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1009 * 4) Access function 0: {0B, +, 4}_7 Creating dr for *_168 analyze_innermost: success. base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1023 * 4) offset from base address: 0 constant offset from base address: 0 step: 4 aligned to: 128 base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1023 * 4) Access function 0: {0B, +, 4}_7 Creating dr for *_265 analyze_innermost: success. base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1034 * 4) offset from base address: 0 constant offset from base address: 0 step: 4 aligned to: 128 base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1034 * 4) Access function 0: {0B, +, 4}_7 Creating dr for *_174 analyze_innermost: success. base_address: pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1063 * 4) offset from base address: 0 constant offset from base address: 0 step: 4 aligned to: 128 base_object: *pretmp_1004 + (sizetype) ((long unsigned int) pretmp_1063 * 4) Access function 0: {0B, +, 4}_7 ... so the remaining DDRs against *_324 all look related. pretmp_1062 = pretmp_1020 + pretmp_1047; pretmp_1063 = _25 * pretmp_1062; pretmp_1033 = j_380 + pretmp_1020; pretmp_1034 = _25 * pretmp_1033; pretmp_1022 = pretmp_1020 + pretmp_1021; pretmp_1023 = _25 * pretmp_1022; but SCEV doesn't expand stmts before the loop and thus doesn't see this. It's obviously far from trivial to merge segments with symbolic start addresses ... these are multi-dimensional accesses: for(k=1 ; k<kmax ; k++){ s0= MR(a,0,i,j,k)*MR(p,0,i+1,j, k) + MR(a,1,i,j,k)*MR(p,0,i, j+1,k) + MR(a,2,i,j,k)*MR(p,0,i, j, k+1) + MR(b,0,i,j,k) *( MR(p,0,i+1,j+1,k) - MR(p,0,i+1,j-1,k) - MR(p,0,i-1,j+1,k) + MR(p,0,i-1,j-1,k) ) + MR(b,1,i,j,k) *( MR(p,0,i,j+1,k+1) - MR(p,0,i,j-1,k+1) - MR(p,0,i,j+1,k-1) + MR(p,0,i,j-1,k-1) ) + MR(b,2,i,j,k) *( MR(p,0,i+1,j,k+1) - MR(p,0,i-1,j,k+1) - MR(p,0,i+1,j,k-1) + MR(p,0,i-1,j,k-1) ) + MR(c,0,i,j,k) * MR(p,0,i-1,j, k) + MR(c,1,i,j,k) * MR(p,0,i, j-1,k) + MR(c,2,i,j,k) * MR(p,0,i, j, k-1) + MR(wrk1,0,i,j,k); ss= (s0*MR(a,3,i,j,k) - MR(p,0,i,j,k))*MR(bnd,0,i,j,k); gosa+= ss*ss; MR(wrk2,0,i,j,k)= MR(p,0,i,j,k) + omega*ss; } and we manage to merge the fastest varying dimension +-1 ones AFAIK, but not for example the ones for MR(p,0,i+1,j+1,k) and MR(p,0,i+1,j-1,k). Ideally we would be able to derive a single check for each array (which would require analyzing the DRs in the outer loops as well to gather info about the other dimensions).
Bah, and with -fipa-pta -Ofast -fwhole-program we now _do_ see that there isn't any aliasing but PRE messes up the loop and creates loop carried dependencies ... :/
(In reply to Richard Biener from comment #3) > Bah, and with -fipa-pta -Ofast -fwhole-program we now _do_ see that there > isn't any aliasing but PRE messes up the loop and creates loop carried > dependencies ... :/ Ok, that's because PRE would be the one to make the array loads base addresses simple induction variables (it performs invariant motion of the address load from the global matrix struct). Thus inhibit_phi_insertion returns false. Looks like ordering of PRE / LIM isn't too great when considering such cases.
Created attachment 32038 [details] patch for the PRE issue The attached patch delays inhibit_phi_insertion to a point where it can give a more definitive answer (eliminate () time).
Created attachment 32039 [details] updated patch
As discussed on IRC the #c1 patch would need to be moved after the if (STMT_VINFO_GATHER_P (stmtinfo_a)...) {} Getting the version checks below the default limit would be really nice, the benchmark numbers look much nicer when it is vectorized, even with --param vect-max-version-for-alias-checks=32.
For analysis of the DRs in outer loops we need to fix PR23855.
PR23855 fixed we'd get for the analysis of the remaining DRs in the outermost loop for example for matrix A: Creating dr for *_290 analyze_innermost: success. base_address: pretmp_1792 + (sizetype) ((long unsigned int) pretmp_1974 * 4) offset from base address: 0 constant offset from base address: 4 step: 4 aligned to: 128 base_object: *pretmp_1792 + (sizetype) ((long unsigned int) (((pretmp_1831 + 1) * pretmp_1794 + 1) * pretmp_1796) * 4) Access function 0: {{{4B, +, (sizetype) ((long unsigned int) (pretmp_1794 * pretmp_1796) * 4)}_2, +, (sizetype) ((long unsigned int) pretmp_1796 * 4)}_6, +, 4}_7 Creating dr for *_81 analyze_innermost: success. base_address: pretmp_1792 + (sizetype) ((long unsigned int) pretmp_1915 * 4) offset from base address: 0 constant offset from base address: 4 step: 4 aligned to: 128 base_object: *pretmp_1792 + (sizetype) ((long unsigned int) (((pretmp_1804 + 1) * pretmp_1794 + 1) * pretmp_1796) * 4) Access function 0: {{{4B, +, (sizetype) ((long unsigned int) (pretmp_1794 * pretmp_1796) * 4)}_2, +, (sizetype) ((long unsigned int) pretmp_1796 * 4)}_6, +, 4}_7 Creating dr for *_60 analyze_innermost: success. base_address: pretmp_1792 + (sizetype) ((long unsigned int) pretmp_1902 * 4) offset from base address: 0 constant offset from base address: 4 step: 4 aligned to: 128 base_object: *pretmp_1792 + (sizetype) ((long unsigned int) (((pretmp_1801 + 1) * pretmp_1794 + 1) * pretmp_1796) * 4) Access function 0: {{{4B, +, (sizetype) ((long unsigned int) (pretmp_1794 * pretmp_1796) * 4)}_2, +, (sizetype) ((long unsigned int) pretmp_1796 * 4)}_6, +, 4}_7 which still requires hard work to actually combine as the base objects still differ (but the access function is equal).
Author: rguenth Date: Mon Apr 14 13:57:00 2014 New Revision: 209374 URL: http://gcc.gnu.org/viewcvs?rev=209374&root=gcc&view=rev Log: 2014-04-14 Richard Biener <rguenther@suse.de> PR tree-optimization/60042 * tree-ssa-pre.c (inhibit_phi_insertion): Remove. (insert_into_preds_of_block): Do not prevent PHI insertion for REFERENCE exprs here ... (eliminate_dom_walker::before_dom_children): ... but prevent their use here under similar conditions when applied to the IL after PRE optimizations. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-pre.c
Trunk still has 22 alias checks (so it even got worse).