The attached test-case compiled with "-Ofast -fopenmp -march=core-avx2" options contains loop marked with pragma omp simd which is not vectorized: t1.cpp:66:20: note: not vectorized: data ref analysis failed MEM[(struct vec_ *)_9] = 1.0e+0; where index is defined as _12 = GOMP_SIMD_LANE (simduid.0_11(D)); _9 = &D.4231[_12].org; Note that this test was extracted from important benchmark and loop in question is vectorized if we revert fix for 59984.
Created attachment 35541 [details] test-case to reproduce Must be compiled with -Ofast -fopenmp -march=core-avx2 options.
On this testcase, I believe it is most important to change FRE and SRA. Without -fopenmp, we have before fre2: MEM[(struct ray_ *)&ray] = 1.0e+0; MEM[(struct ray_ *)&ray + 4B] = _8; ray.dir.x = x_10; ray.dir.y = y_11; ... _23 = ray.org.x; _26 = ray.org.y; _29 = ray.dir.x; _31 = ray.dir.y; and fre2 optimizes this into the values stored in there. With -fopenmp, we have the SIMD_LANE based array references: _12 = GOMP_SIMD_LANE (simduid.0_11(D)); _9 = &D.2881[_12].org; MEM[(struct vec_ *)_9] = 1.0e+0; MEM[(struct vec_ *)_9 + 4B] = _8; D.2881[_12].dir.x = x_13; D.2881[_12].dir.y = y_14; ... _26 = MEM[(const struct Ray *)&D.2881][_12].org.x; _29 = MEM[(const struct Ray *)&D.2881][_12].org.y; _32 = MEM[(const struct Ray *)&D.2881][_12].dir.x; _34 = MEM[(const struct Ray *)&D.2881][_12].dir.y; that fre2 isn't able to optimize. Doing something here on the vectorizer side is too late, we really need to scalarize those or remove completely (through fre2).
Ah, but fre2 is able to optimize those: D.2881[_12].dir.x = x_13; ... _32 = MEM[(const struct Ray *)&D.2881][_12].dir.x; cases to _32 = x_13; The problem in FRE2 is only with the MEM_REFs. If in the testcase - ray.org = p; + ray.org.x = p.x; + ray.org.y = p.y; then FRE2 handles it completely and the problem is the same between -fopenmp and -fno-openmp - ifcvt not ifconverting the loop.
Ah, and the reason for the ifcvt failure is that we have an unoptimized virtual PHI in the loop that ifcvt gives up on: # .MEM_55 = PHI <.MEM_55(7), .MEM_5(D)(2)> Only phicprop2 pass is able to optimize that away (and replace all uses of .MEM_55 with .MEM_5(D)).
Created attachment 35596 [details] gcc6-pr66142.patch Fix for the ifcvt issue. With this the modified testcase is vectorized.
As for the FRE (actually SCCVN) issue, seems this affects more than GOMP_SIMD_LANE, seems there is insufficient canonicalization of the references performed by SCCVN, so the same references, but in slightly different form, aren't considered the same. Please see e.g. struct A { float x, y; }; struct B { struct A u, v; }; void bar (struct A *); float f1 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; struct A *q = &x[y].u; *q = p; x[y].v.x = 3.0f; x[y].v.y = 4.0f; float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y; bar (&p); return f; } float f2 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; x[y].u = p; x[y].v.x = 3.0f; x[y].v.y = 4.0f; float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y; bar (&p); return f; } float f3 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; struct A *q = &x[y].u; __builtin_memcpy (&q, &p.x, sizeof (float)); __builtin_memcpy (&q->y, &p.y, sizeof (float)); *q = p; x[y].v.x = 3.0f; x[y].v.y = 4.0f; float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y; bar (&p); return f; } float f4 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; __builtin_memcpy (&x[y].u.x, &p.x, sizeof (float)); __builtin_memcpy (&x[y].u.y, &p.y, sizeof (float)); x[y].v.x = 3.0f; x[y].v.y = 4.0f; float f = x[y].u.x + x[y].u.y + x[y].v.x + x[y].v.y; bar (&p); return f; } at -O2, FRE is able to optimize f2 and f4, but not f1 and f3, despite all 4 functions doing the same thing.
Slightly improved variant of #c6 testcase: struct A { float x, y; }; struct B { struct A u; }; void bar (struct A *); float f1 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; struct A *q = &x[y].u; *q = p; float f = x[y].u.x + x[y].u.y; bar (&p); return f; } float f2 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; x[y].u = p; float f = x[y].u.x + x[y].u.y; bar (&p); return f; } float f3 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; struct A *q = &x[y].u; __builtin_memcpy (&q->x, &p.x, sizeof (float)); __builtin_memcpy (&q->y, &p.y, sizeof (float)); *q = p; float f = x[y].u.x + x[y].u.y; bar (&p); return f; } float f4 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; __builtin_memcpy (&x[y].u.x, &p.x, sizeof (float)); __builtin_memcpy (&x[y].u.y, &p.y, sizeof (float)); float f = x[y].u.x + x[y].u.y; bar (&p); return f; }
I guess it depends on what exactly SCCVN uses the operand vectors created/optimized by copy_reference_ops_from_ref/valueize_refs_1 for. If it is used only for tracking what is the value of the reference (the value stored into that piece of memory), or also for canonicalization of references to certain more canonical form, or both. If only the value of the reference, then supposedly it would be nice to get far more canonicalization in the references (e.g. turn (perhaps multiple nested) COMPONENT_REFs into MEM_REFs with offset (except perhaps for bitfields, perhaps even do something about ARRAY_REFs, so that say even: struct A { float x, y; }; struct B { struct A u; }; void bar (struct A *); float f2 (struct B *x, int y) { struct A p; p.x = 1.0f; p.y = 2.0f; char *z = (char *) x; z += y * sizeof (struct B); z += __builtin_offsetof (struct B, u.y); x[y].u = p; float f = *(float *) z; bar (&p); return f; } is handled). But, supposedly it would be undesirable to canonicalize all the COMPONENT_REFs to the MEM_REFs (at least early on, e.g. for __builtin_object_size (, 1) purposes, or aliasing etc.). Richi, any thoughts on this?
On Mon, 25 May 2015, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > CC| |rguenth at gcc dot gnu.org > > --- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> --- I guess > it depends on what exactly SCCVN uses the operand vectors > created/optimized by copy_reference_ops_from_ref/valueize_refs_1 for. If > it is used only for tracking what is the value of the reference (the > value stored into that piece of memory), or also for canonicalization of > references to certain more canonical form, or both. If only the value It's only for tracking the value (and to get a canonical form for its internal hash-tables and compare function). I think this bug may be related to PR63916. > of the reference, then supposedly it would be nice to get far more > canonicalization in the references (e.g. turn (perhaps multiple nested) > COMPONENT_REFs into MEM_REFs with offset (except perhaps for bitfields, > perhaps even do something about ARRAY_REFs, so that say even: struct A { > float x, y; }; struct B { struct A u; }; void bar (struct A *); > > float > f2 (struct B *x, int y) > { > struct A p; > p.x = 1.0f; > p.y = 2.0f; > char *z = (char *) x; > z += y * sizeof (struct B); > z += __builtin_offsetof (struct B, u.y); > x[y].u = p; > float f = *(float *) z; > bar (&p); > return f; > } > is handled). But, supposedly it would be undesirable to canonicalize all the > COMPONENT_REFs to the MEM_REFs (at least early on, e.g. for > __builtin_object_size (, 1) purposes, or aliasing etc.). > Richi, any thoughts on this? On the f1 testcase it probably fails to look through the aggregate copy in vn_reference_lookup_3. I'll have a short look later.
Testing Index: gcc/tree-ssa-sccvn.c =================================================================== --- gcc/tree-ssa-sccvn.c (revision 223574) +++ gcc/tree-ssa-sccvn.c (working copy) @@ -1894,7 +1894,12 @@ vn_reference_lookup_3 (ao_ref *ref, tree size2 = lhs_ref.size; maxsize2 = lhs_ref.max_size; if (maxsize2 == -1 - || (base != base2 && !operand_equal_p (base, base2, 0)) + || (base != base2 + && (TREE_CODE (base) != MEM_REF + || TREE_CODE (base2) != MEM_REF + || TREE_OPERAND (base, 0) != TREE_OPERAND (base2, 0) + || !tree_int_cst_equal (TREE_OPERAND (base, 1), + TREE_OPERAND (base2, 1)))) || offset2 > offset || offset2 + size2 < offset + maxsize) return (void *)-1; which fixes the comment#7 testcase.
Author: rguenth Date: Tue May 26 13:55:40 2015 New Revision: 223697 URL: https://gcc.gnu.org/viewcvs?rev=223697&root=gcc&view=rev Log: 2015-05-26 Richard Biener <rguenther@suse.de> PR tree-optimization/66142 * tree-ssa-sccvn.c (vn_reference_lookup_3): Manually compare MEM_REFs for the same base address. * gcc.dg/tree-ssa/ssa-fre-44.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-sccvn.c
Fixed(?).
Original test-case is not vectorized yet with Richard patch for sccvn.
As said in comment #9, this is PR63916 I think. We have _9 = &D.3665[_11].org; MEM[(struct vec_ *)_9] = 1.0e+0; MEM[(struct vec_ *)_9 + 4B] = _8; ... _24 = MEM[(const struct Ray *)&D.3665][_11].org.x; _27 = MEM[(const struct Ray *)&D.3665][_11].org.y; which ultimately is from inlining of D.3665[D.3663].org = p; D.3665[D.3663].dir.x = x; D.3665[D.3663].dir.y = y; D.3664[D.3663].t = 1.0e+10; D.3664[D.3663].h = 0; D.3667 = &D.3665[D.3663]; D.3668 = &D.3664[D.3663]; bar (D.3668, D.3667, &spheres[0]); to p.x = 1.0e+0; p.y = _9; ... D.3665[_15].org = p; D.3665[_15].dir.x = x_16; D.3665[_15].dir.y = y_17; D.3664[_15].t = 1.0e+10; D.3664[_15].h = 0; _23 = &D.3665[_15]; _24 = &D.3664[_15]; _32 = MEM[(const struct Ray *)_23].org.x; forwprop cannot forward &D.3665[_15] into the dereference (it would create a bogus access paths even if types were matching and TBAA would be fine). And SCCVN isn't able to "canonicalize" the accesses (because it doesn't look at stmt defs when building the canonical address of an access). Note that PRE uses the canonical form for expression insertion so we have to be careful what we put in the canonical form (or separate the address compute from the actual dereference). Yes, lowering the access more in the SCCVN machinery would be possible (say, use the affine combination machinery with the powers of affine-comb-expand (and making that use the VN lattice...)), but the issue with this is always PRE insertion and us _not_ wanting to have all PRE inserted loads be fully lowered to address-compute plus indirection. I hope to get to PR63916 this stage1.
Author: rguenth Date: Thu May 28 13:24:53 2015 New Revision: 223816 URL: https://gcc.gnu.org/viewcvs?rev=223816&root=gcc&view=rev Log: 2015-05-28 Richard Biener <rguenther@suse.de> PR tree-optimization/66142 * tree-ssa-sccvn.c (vn_reference_lookup_3): Handle non-GIMPLE values better in memcpy destination handling. Handle non-aliasing we discover here. * gcc.dg/tree-ssa/ssa-fre-44.c: Fixup. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/ssa-fre-44.c trunk/gcc/tree-ssa-sccvn.c
Author: jakub Date: Fri May 29 13:06:23 2015 New Revision: 223863 URL: https://gcc.gnu.org/viewcvs?rev=223863&root=gcc&view=rev Log: PR tree-optimization/66142 * tree-if-conv.c (if_convertible_phi_p): Don't give up on virtual phis that feed themselves. * gcc.dg/vect/pr66142.c: New test. Added: trunk/gcc/testsuite/gcc.dg/vect/pr66142.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-if-conv.c
Ok, so even with PR63916 rudimentary fixed we hit the issue that in _9 = &D.3665[_11].org; MEM[(struct vec_ *)_9] = 1.0e+0; MEM[(struct vec_ *)_9 + 4B] = _8; ... _24 = MEM[(const struct Ray *)&D.3665][_11].org.x; the store to MEM[(struct vec_ *)_9 + 4B] aliases MEM[(const struct Ray *)&D.3665][_11].org.x. And the rudimentary fix for PR63916 doesn't allow for the forwarding to succeed into MEM[(struct vec_ *)_9 + 4B] either. Without SRA we get D.3665[_11].org = p; D.3665[_11].dir.x = x_12; D.3665[_11].dir.y = y_13; D.3664[_11].t = 1.0e+10; D.3664[_11].h = 0; _25 = MEM[(const struct Ray *)&D.3665][_11].org.x; and the look-through-aggregate-copy code bails out at /* See if the assignment kills REF. */ base2 = ao_ref_base (&lhs_ref); offset2 = lhs_ref.offset; size2 = lhs_ref.size; maxsize2 = lhs_ref.max_size; if (maxsize2 == -1 || (base != base2 && (TREE_CODE (base) != MEM_REF || TREE_CODE (base2) != MEM_REF || TREE_OPERAND (base, 0) != TREE_OPERAND (base2, 0) || !tree_int_cst_equal (TREE_OPERAND (base, 1), TREE_OPERAND (base2, 1)))) || offset2 > offset || offset2 + size2 < offset + maxsize) return (void *)-1; stmt_kills_ref_p fails to compare MEM[(const struct Ray *)&D.3665][_15].org and D.3665[_15].org equal via operand_equal_p.
Let's re-open this for the original testcase.
Richard, would you be able to look at this in some time?
Yes, I plan to come back to this during stage3.
Ok, so if we fix SCCVN further we hit its (designed) limitation with stmt walking not following backedges (but the use is in a loop and the def outside). We avoid doing the work to determine if sth is loop invariant here. This was trying to improve things with SRA disabled - I'm going to still push those changes if testing works out fine. Now with SRA enabled we have a better situation in this regard but face _9 = &D.4309[_12].org; MEM[(struct vec_ *)_9] = 1.0e+0; MEM[(struct vec_ *)_9 + 4B] = _8; D.4309[_12].dir.x = x_13; D.4309[_12].dir.y = y_14; D.4308[_12].t = 1.0e+10; D.4308[_12].h = 0; _20 = &D.4308[_12]; _26 = MEM[(const struct Ray *)&D.4309][_12].org.x; which as said isn't friendly enough for the simplistic address forward propagation we have. The issue here is that we can't disambiguate against MEM[(struct vec_ *)_9 + 4B] = _8; when looking up MEM[(const struct Ray *)&D.4309][_12].org.x; because vn_reference_maybe_forwprop_address doesn't forward into MEM[(struct vec_ *)_9 + 4B] as its offset is not zero: /* If that didn't work because the address isn't invariant propagate the reference tree from the address operation in case the current dereference isn't offsetted. */ if (!addr_base && *i_p == ops->length () - 1 && off == 0 /* This makes us disable this transform for PRE where the reference ops might be also used for code insertion which is invalid. */ && default_vn_walk_kind == VN_WALKREWRITE) { auto_vec<vn_reference_op_s, 32> tem; copy_reference_ops_from_ref (TREE_OPERAND (addr, 0), &tem); ops->pop (); ops->pop (); ops->safe_splice (tem); --*i_p; return true; it also likely wouldn't help vn_reference_lookup_3 to do the disambiguation as that performs just lhs_ops = valueize_refs_1 (lhs_ops, &valueized_anything); if (valueized_anything) { lhs_ref_ok = ao_ref_init_from_vn_reference (&lhs_ref, get_alias_set (lhs), TREE_TYPE (lhs), lhs_ops); if (lhs_ref_ok && !refs_may_alias_p_1 (ref, &lhs_ref, true)) return NULL; but there is still the variable offset to cope with. We'd need to perform some disambiguation on lhs_ref vs. vr. But it's not clear how to represent &D.4309[_12].org forwarded into MEM[(struct vec_ *)_9 + 4B]. It's kind-of a COMPONENT_REF with just a offset added. Let's collect the improvements I have sofar in my local tree and see if they otherwise work.
Author: rguenth Date: Fri Sep 18 07:57:00 2015 New Revision: 227896 URL: https://gcc.gnu.org/viewcvs?rev=227896&root=gcc&view=rev Log: 2015-09-18 Richard Biener <rguenther@suse.de> PR tree-optimization/66142 * fold-const.c (operand_equal_p): When OEP_ADDRESS_OF treat MEM[&x] and x the same. * tree-ssa-sccvn.h (vn_reference_fold_indirect): Remove. * tree-ssa-sccvn.c (vn_reference_fold_indirect): Return true when we simplified sth. (vn_reference_maybe_forwprop_address): Likewise. (valueize_refs_1): When we simplified through vn_reference_fold_indirect or vn_reference_maybe_forwprop_address set valueized_anything to true. (vn_reference_lookup_3): Use stmt_kills_ref_p to see whether one ref kills the other instead of just a offset-based test. * tree-ssa-alias.c (stmt_kills_ref_p): Use OEP_ADDRESS_OF for the operand_equal_p test to compare bases and also compare sizes. Modified: trunk/gcc/ChangeLog trunk/gcc/fold-const.c trunk/gcc/tree-ssa-alias.c trunk/gcc/tree-ssa-sccvn.c trunk/gcc/tree-ssa-sccvn.h
Richard, Do we have any chance to vectorize attached test-case using GCC6 compiler?
On Tue, 8 Dec 2015, ysrumyan at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66142 > > --- Comment #23 from Yuri Rumyantsev <ysrumyan at gmail dot com> --- > Richard, > > Do we have any chance to vectorize attached test-case using GCC6 compiler? No, I don't see any.
If we convert copy structures to copy structure fields test will be vectorized and all mentions of GOMP_SIMD_LANE will be deleted. But if we slightly modify test by introducing new function vdot and insert its call: b = r.x * ray->dir.x + r.y * ray->dir.y; | v b = vdot (r, ray->dir); test won't be vectorized: test2.cpp:70:9: note: not vectorized: not suitable for scatter store D.6062[_9].org.x = 1.0e+0; test2.cpp is attached.
Created attachment 37940 [details] test-case to reproduce Need to be compiled with -Ofast -mavx2 -fopenmp options.