I noticed that vectorized version of loop is deleted although compiler reports that it was successfully vectorized: t1.c:7:3: note: LOOP VECTORIZED but after we can see in vect-dump: Removing basic block 4 basic block 4, loop depth 1 pred: 45 # i_16 = PHI <i_73(45)> # ivtmp_15 = PHI <ivtmp_76(45)> # vectp_a1.11_116 = PHI <vectp_a1.12_114(45)> # vectp_a2.19_125 = PHI <vectp_a2.20_123(45)> # vectp_a3.22_130 = PHI <vectp_a3.23_128(45)> # vectp_a1.25_136 = PHI <vectp_a1.26_134(45)> # vectp_a3.34_147 = PHI <vectp_a3.35_145(45)> # ivtmp_38 = PHI <0(45)> vect__5.13_118 = MEM[(float *)vectp_a1.11_116]; _5 = a1[i_16]; _31 = &a2[i_16]; vect__ifc__32.14_122 = VEC_COND_EXPR <vect__5.13_118 >= vect_cst_.15_119, vect_cst_.16_120, vect_cst_.17_121>; _ifc__32 = _5 >= x_6(D) ? 4294967295 : 0; vect__7.18_127 = MASK_LOAD (vectp_a2.19_125, 0B, vect__ifc__32.14_122); _7 = 0.0; _33 = &a3[i_16]; vect__8.21_132 = MASK_LOAD (vectp_a3.22_130, 0B, vect__ifc__32.14_122); _8 = 0.0; vect__9.24_133 = vect__7.18_127 + vect__8.21_132; _9 = _7 + _8; _34 = &a1[i_16]; MASK_STORE (vectp_a1.25_136, 0B, vect__ifc__32.14_122, vect__9.24_133); vect__11.27_140 = vect_cst_.28_37 + vect_cst_.29_139; _11 = x_6(D) + 1.0e+0; _35 = &a3[i_16]; vect__ifc__36.30_144 = VEC_COND_EXPR <vect__5.13_118 >= vect_cst_.31_141, vect_cst_.32_142, vect_cst_.33_143>; _ifc__36 = _5 >= x_6(D) ? 0 : 4294967295; ... Test and compile options will be attached.
Created attachment 34867 [details] test-case to reproduce Test needs to be compiled with -Ofast -m64 -mcore-avx2 options.
We apply versioning for aliasing but compute it as always aliasing in some way, thus the runtime check gets immediately folded and thus the vectorized loop removed: t.c:7:3: note: create runtime check for data references a1[i_16] and *_34 t.c:7:3: note: created 1 versioning for alias checks. t.c:7:3: note: loop versioned for vectorization because of possible aliasing ... but I see the alias runtime check nowhere. The DRs are (gdb) p debug_data_reference (dr_a.dr) #(Data Ref: # bb: 4 # stmt: _5 = a1[i_16]; # ref: a1[i_16]; # base_object: a1; # Access function 0: {0, +, 1}_1 #) $17 = void (gdb) p debug_data_reference (dr_b.dr) #(Data Ref: # bb: 4 # stmt: MASK_STORE (_34, 0B, _ifc__32, _9); # ref: *_34; # base_object: MEM[(float *)&a1]; # Access function 0: {0B, +, 4}_1 #) so maybe the code doing masked loads/stores updates the DRs in a way that will later confuse runtime alias checking. Or for some reason it doesn't update it enough to make data-dependence analysis handle it. Clearly this is a must-dependence (but with known distance), so sth that data dependence analysis should handle and sth that the runtime alias checking isn't handling.
Confirmed. CCing Jakub who implemented this stuff.
I'm talking about (compute_affine_dependence stmt_a: _5 = a1[i_16]; stmt_b: MASK_STORE (_34, 0B, _ifc__32, _9); ) -> dependence analysis failed somehow it works for (compute_affine_dependence stmt_a: _8 = MASK_LOAD (_33, 0B, _ifc__32); stmt_b: MASK_STORE (_35, 0B, _ifc__36, _11); (analyze_overlapping_iterations (chrec_a = {0B, +, 4}_1) (chrec_b = {0B, +, 4}_1) (overlap_iterations_a = [0]) (overlap_iterations_b = [0])) )
(In reply to Richard Biener from comment #4) > I'm talking about > > (compute_affine_dependence > stmt_a: _5 = a1[i_16]; > stmt_b: MASK_STORE (_34, 0B, _ifc__32, _9); > ) -> dependence analysis failed Ah - this is pointer vs. array access, thus different "base". A generic issue that can for example be "solved" by trying to force a pointer base for a non-pointer base DR when analysis fails and one is based on a pointer. Hmm. Mine. > somehow it works for > > (compute_affine_dependence > stmt_a: _8 = MASK_LOAD (_33, 0B, _ifc__32); > stmt_b: MASK_STORE (_35, 0B, _ifc__36, _11); > (analyze_overlapping_iterations > (chrec_a = {0B, +, 4}_1) > (chrec_b = {0B, +, 4}_1) > (overlap_iterations_a = [0]) > (overlap_iterations_b = [0])) > )
Created attachment 34882 [details] hack just for the masked load/store case Incomplete special-casing for the masked load/store case. We need to mark the masked load/store IFN calls somehow to mark the forwarding as valid. A "real" fix would "duplicate" dr->indices to always have an alternate analysis for *&dr->ref in case dr->ref is not a pointer evolution. We could then pick the one with compatible dr->indices.base_object. Of course that may still not handle all cases, if the ptr evolution is not enough (like for having a[2][j] vs. *(&a[3][j])). But at least it could help in some general cases.
For the masked load/store case we could also simply put the "real" memory access in place of the pointer argument. To make that valid GIMPLE we could wrap it inside a fake VIEW_CONVERT_EXPR for example - like one converting to a char[sizeof (ref)], this would make it appear as aggregate. Of course when we inspect the masked load/store we'd have to strip that VIEW_CONVERT_EXPR again. But I don't think it would harm anyone seeing that VIEW_CONVERT_EXPR.
Adding some flag on the MASK_STORE or MASK_LOAD is not hard, it can be in another argument, or some GF_*, whatever. But I don't understand here what is the difference between originally pointer based and array based access. Doesn't for non-masked stores or loads forwprop propagate array accesses into pointer stores/loads anyway?
On Mon, 2 Jan 2017, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65206 > > --- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Adding some flag on the MASK_STORE or MASK_LOAD is not hard, it can be in > another argument, or some GF_*, whatever. But I don't understand here what is > the difference between originally pointer based and array based access. > Doesn't for non-masked stores or loads forwprop propagate array accesses into > pointer stores/loads anyway? The issue is that dependence analysis can't handle mixed pointer / array accesses (it would need to treat the array access as pointer-based, thus comment #5 which talks about always providing both sets of DR->indices variants). Yes, forwprop propagates the loads/stores but it correctly conservatively handles propagating p = &a[i] to *p (it doesn't propagate) or p = &a[1] to *p (it creates MEM[a, 4], not an array access). The "hack" forwards it as array access to make dependence analysis happy. The real fix is comment #5 computing (maybe lazily) a DR->indices variant for array accesses using get_inner_reference on the full reference and analyzing it as MEM[base + offset, bitoff], thus if it were *&ref. That makes more refs "compatible" for initialize_data_dependence_relation. Implementation-wise it's not that easy to implement this though :/
*** Bug 72739 has been marked as a duplicate of this bug. ***
*** Bug 101548 has been marked as a duplicate of this bug. ***
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>: https://gcc.gnu.org/g:f92901a508305f291fcf2acae0825379477724de commit r12-3677-gf92901a508305f291fcf2acae0825379477724de Author: Richard Biener <rguenther@suse.de> Date: Wed Sep 8 14:42:31 2021 +0200 tree-optimization/65206 - dependence analysis on mixed pointer/array This adds the capability to analyze the dependence of mixed pointer/array accesses. The example is from where using a masked load/store creates the pointer-based access when an otherwise unconditional access is array based. Other examples would include accesses to an array mixed with accesses from inlined helpers that work on pointers. The idea is quite simple and old - analyze the data-ref indices as if the reference was pointer-based. The following change does this by changing dr_analyze_indices to work on the indices sub-structure and storing an alternate indices substructure in each data reference. That alternate set of indices is analyzed lazily by initialize_data_dependence_relation when it fails to match-up the main set of indices of two data references. initialize_data_dependence_relation is refactored into a head and a tail worker and changed to work on one of the indices structures and thus away from using DR_* access macros which continue to reference the main indices substructure. There are quite some vectorization and loop distribution opportunities unleashed in SPEC CPU 2017, notably 520.omnetpp_r, 548.exchange2_r, 510.parest_r, 511.povray_r, 521.wrf_r, 526.blender_r, 527.cam4_r and 544.nab_r see amendments in what they report with -fopt-info-loop while the rest of the specrate set sees no changes there. Measuring runtime for the set where changes were reported reveals nothing off-noise besides 511.povray_r which seems to regress slightly for me (on a Zen2 machine with -Ofast -march=native). 2021-09-08 Richard Biener <rguenther@suse.de> PR tree-optimization/65206 * tree-data-ref.h (struct data_reference): Add alt_indices, order it last. * tree-data-ref.c (free_data_ref): Release alt_indices. (dr_analyze_indices): Work on struct indices and get DR_REF as tree. (create_data_ref): Adjust. (initialize_data_dependence_relation): Split into head and tail. When the base objects fail to match up try again with pointer-based analysis of indices. * tree-vectorizer.c (vec_info_shared::check_datarefs): Do not compare the lazily computed alternate set of indices. * gcc.dg/torture/20210916.c: New testcase. * gcc.dg/vect/pr65206.c: Likewise.
Fixed for GCC 12.
*** Bug 69732 has been marked as a duplicate of this bug. ***