65206 – vectorized version of loop is removed, dependence analysis fails for *&a[i] vs a[j]

Bug 65206 - vectorized version of loop is removed, dependence analysis fails for *&a[i] vs a[j]

Summary: vectorized version of loop is removed, dependence analysis fails for *&a[i] v...

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	5.0

Importance:	P3 normal
Target Milestone:	12.0
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Duplicates (3):	69732 72739 101548 (view as bug list)
Depends on:
Blocks:	vectorizer 85057
	Show dependency tree / graph

Reported:	2015-02-25 14:01 UTC by Yuri Rumyantsev
Modified:	2022-03-29 12:57 UTC (History)
CC List:	3 users (show)

See Also:
Host:
Target:
Build:
Known to work:	12.0
Known to fail:	11.2.1
Last reconfirmed:	2015-02-25 00:00:00

Attachments
test-case to reproduce (146 bytes, text/plain) 2015-02-25 14:04 UTC, Yuri Rumyantsev	Details
hack just for the masked load/store case (834 bytes, patch) 2015-02-26 11:30 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Yuri Rumyantsev 2015-02-25 14:01:31 UTC

I noticed that vectorized version of loop is deleted although compiler reports that it was successfully vectorized:
t1.c:7:3: note: LOOP VECTORIZED

but after we can see in vect-dump:

Removing basic block 4
basic block 4, loop depth 1
 pred:       45
# i_16 = PHI <i_73(45)>
# ivtmp_15 = PHI <ivtmp_76(45)>
# vectp_a1.11_116 = PHI <vectp_a1.12_114(45)>
# vectp_a2.19_125 = PHI <vectp_a2.20_123(45)>
# vectp_a3.22_130 = PHI <vectp_a3.23_128(45)>
# vectp_a1.25_136 = PHI <vectp_a1.26_134(45)>
# vectp_a3.34_147 = PHI <vectp_a3.35_145(45)>
# ivtmp_38 = PHI <0(45)>
vect__5.13_118 = MEM[(float *)vectp_a1.11_116];
_5 = a1[i_16];
_31 = &a2[i_16];
vect__ifc__32.14_122 = VEC_COND_EXPR <vect__5.13_118 >= vect_cst_.15_119, vect_cst_.16_120, vect_cst_.17_121>;
_ifc__32 = _5 >= x_6(D) ? 4294967295 : 0;
vect__7.18_127 = MASK_LOAD (vectp_a2.19_125, 0B, vect__ifc__32.14_122);
_7 = 0.0;
_33 = &a3[i_16];
vect__8.21_132 = MASK_LOAD (vectp_a3.22_130, 0B, vect__ifc__32.14_122);
_8 = 0.0;
vect__9.24_133 = vect__7.18_127 + vect__8.21_132;
_9 = _7 + _8;
_34 = &a1[i_16];
MASK_STORE (vectp_a1.25_136, 0B, vect__ifc__32.14_122, vect__9.24_133);
vect__11.27_140 = vect_cst_.28_37 + vect_cst_.29_139;
_11 = x_6(D) + 1.0e+0;
_35 = &a3[i_16];
vect__ifc__36.30_144 = VEC_COND_EXPR <vect__5.13_118 >= vect_cst_.31_141, vect_cst_.32_142, vect_cst_.33_143>;
_ifc__36 = _5 >= x_6(D) ? 0 : 4294967295;
...

Test and compile options will be attached.

Comment 1 Yuri Rumyantsev 2015-02-25 14:04:17 UTC

Created attachment 34867 [details]
test-case to reproduce

Test needs to be compiled with -Ofast -m64 -mcore-avx2 options.

Comment 2 Richard Biener 2015-02-25 14:55:37 UTC

We apply versioning for aliasing but compute it as always aliasing in some way,
thus the runtime check gets immediately folded and thus the vectorized loop removed:

t.c:7:3: note: create runtime check for data references a1[i_16] and *_34
t.c:7:3: note: created 1 versioning for alias checks.
t.c:7:3: note: loop versioned for vectorization because of possible aliasing
...

but I see the alias runtime check nowhere.

The DRs are

(gdb) p debug_data_reference (dr_a.dr)
#(Data Ref: 
#  bb: 4 
#  stmt: _5 = a1[i_16];
#  ref: a1[i_16];
#  base_object: a1;
#  Access function 0: {0, +, 1}_1
#)
$17 = void
(gdb) p debug_data_reference (dr_b.dr)
#(Data Ref: 
#  bb: 4 
#  stmt: MASK_STORE (_34, 0B, _ifc__32, _9);
#  ref: *_34;
#  base_object: MEM[(float *)&a1];
#  Access function 0: {0B, +, 4}_1
#)

so maybe the code doing masked loads/stores updates the DRs in a way that
will later confuse runtime alias checking.  Or for some reason it doesn't
update it enough to make data-dependence analysis handle it.

Clearly this is a must-dependence (but with known distance), so sth
that data dependence analysis should handle and sth that the runtime
alias checking isn't handling.

Comment 3 Richard Biener 2015-02-25 14:56:29 UTC

Confirmed.  CCing Jakub who implemented this stuff.

Comment 4 Richard Biener 2015-02-25 14:59:51 UTC

I'm talking about

(compute_affine_dependence
  stmt_a: _5 = a1[i_16];
  stmt_b: MASK_STORE (_34, 0B, _ifc__32, _9);
) -> dependence analysis failed

somehow it works for

(compute_affine_dependence
  stmt_a: _8 = MASK_LOAD (_33, 0B, _ifc__32);
  stmt_b: MASK_STORE (_35, 0B, _ifc__36, _11);
(analyze_overlapping_iterations
  (chrec_a = {0B, +, 4}_1)
  (chrec_b = {0B, +, 4}_1)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
)

Comment 5 Richard Biener 2015-02-25 15:02:35 UTC

(In reply to Richard Biener from comment #4)
> I'm talking about
> 
> (compute_affine_dependence
>   stmt_a: _5 = a1[i_16];
>   stmt_b: MASK_STORE (_34, 0B, _ifc__32, _9);
> ) -> dependence analysis failed

Ah - this is pointer vs. array access, thus different "base".  A generic issue
that can for example be "solved" by trying to force a pointer base for
a non-pointer base DR when analysis fails and one is based on a pointer.

Hmm.

Mine.

> somehow it works for
> 
> (compute_affine_dependence
>   stmt_a: _8 = MASK_LOAD (_33, 0B, _ifc__32);
>   stmt_b: MASK_STORE (_35, 0B, _ifc__36, _11);
> (analyze_overlapping_iterations
>   (chrec_a = {0B, +, 4}_1)
>   (chrec_b = {0B, +, 4}_1)
>   (overlap_iterations_a = [0])
>   (overlap_iterations_b = [0]))
> )

Comment 6 Richard Biener 2015-02-26 11:30:46 UTC

Created attachment 34882 [details]
hack just for the masked load/store case

Incomplete special-casing for the masked load/store case.  We need to mark
the masked load/store IFN calls somehow to mark the forwarding as valid.

A "real" fix would "duplicate" dr->indices to always have an alternate
analysis for *&dr->ref in case dr->ref is not a pointer evolution.  We
could then pick the one with compatible dr->indices.base_object.  Of course
that may still not handle all cases, if the ptr evolution is not enough
(like for having a[2][j] vs. *(&a[3][j])).  But at least it could help in
some general cases.

Comment 7 Richard Biener 2015-02-26 11:40:24 UTC

For the masked load/store case we could also simply put the "real" memory access
in place of the pointer argument.  To make that valid GIMPLE we could wrap it
inside a fake VIEW_CONVERT_EXPR for example - like one converting to a
char[sizeof (ref)], this would make it appear as aggregate.  Of course when
we inspect the masked load/store we'd have to strip that VIEW_CONVERT_EXPR
again.  But I don't think it would harm anyone seeing that VIEW_CONVERT_EXPR.

Comment 8 Jakub Jelinek 2017-01-02 14:45:24 UTC

Adding some flag on the MASK_STORE or MASK_LOAD is not hard, it can be in another argument, or some GF_*, whatever.  But I don't understand here what is the difference between originally pointer based and array based access.  Doesn't for non-masked stores or loads forwprop propagate array accesses into pointer stores/loads anyway?

Comment 9 rguenther@suse.de 2017-01-04 09:41:06 UTC

On Mon, 2 Jan 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65206
> 
> --- Comment #8 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Adding some flag on the MASK_STORE or MASK_LOAD is not hard, it can be in
> another argument, or some GF_*, whatever.  But I don't understand here what is
> the difference between originally pointer based and array based access. 
> Doesn't for non-masked stores or loads forwprop propagate array accesses into
> pointer stores/loads anyway?

The issue is that dependence analysis can't handle mixed pointer / array
accesses (it would need to treat the array access as pointer-based,
thus comment #5 which talks about always providing both sets of
DR->indices variants).

Yes, forwprop propagates the loads/stores but it correctly conservatively
handles propagating p = &a[i] to *p (it doesn't propagate) or
p = &a[1] to *p (it creates MEM[a, 4], not an array access).  The "hack"
forwards it as array access to make dependence analysis happy.

The real fix is comment #5 computing (maybe lazily) a DR->indices
variant for array accesses using get_inner_reference on the
full reference and analyzing it as MEM[base + offset, bitoff],
thus if it were *&ref.

That makes more refs "compatible" for initialize_data_dependence_relation.

Implementation-wise it's not that easy to implement this though :/

Comment 10 Richard Biener 2017-01-12 15:17:09 UTC

*** Bug 72739 has been marked as a duplicate of this bug. ***

Comment 11 Richard Biener 2021-09-08 11:08:55 UTC

*** Bug 101548 has been marked as a duplicate of this bug. ***

Comment 12 GCC Commits 2021-09-20 06:51:19 UTC

The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:f92901a508305f291fcf2acae0825379477724de

commit r12-3677-gf92901a508305f291fcf2acae0825379477724de
Author: Richard Biener <rguenther@suse.de>
Date:   Wed Sep 8 14:42:31 2021 +0200

    tree-optimization/65206 - dependence analysis on mixed pointer/array
    
    This adds the capability to analyze the dependence of mixed
    pointer/array accesses.  The example is from where using a masked
    load/store creates the pointer-based access when an otherwise
    unconditional access is array based.  Other examples would include
    accesses to an array mixed with accesses from inlined helpers
    that work on pointers.
    
    The idea is quite simple and old - analyze the data-ref indices
    as if the reference was pointer-based.  The following change does
    this by changing dr_analyze_indices to work on the indices
    sub-structure and storing an alternate indices substructure in
    each data reference.  That alternate set of indices is analyzed
    lazily by initialize_data_dependence_relation when it fails to
    match-up the main set of indices of two data references.
    initialize_data_dependence_relation is refactored into a head
    and a tail worker and changed to work on one of the indices
    structures and thus away from using DR_* access macros which
    continue to reference the main indices substructure.
    
    There are quite some vectorization and loop distribution opportunities
    unleashed in SPEC CPU 2017, notably 520.omnetpp_r, 548.exchange2_r,
    510.parest_r, 511.povray_r, 521.wrf_r, 526.blender_r, 527.cam4_r and
    544.nab_r see amendments in what they report with -fopt-info-loop while
    the rest of the specrate set sees no changes there.  Measuring runtime
    for the set where changes were reported reveals nothing off-noise
    besides 511.povray_r which seems to regress slightly for me
    (on a Zen2 machine with -Ofast -march=native).
    
    2021-09-08  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/65206
            * tree-data-ref.h (struct data_reference): Add alt_indices,
            order it last.
            * tree-data-ref.c (free_data_ref): Release alt_indices.
            (dr_analyze_indices): Work on struct indices and get DR_REF as tree.
            (create_data_ref): Adjust.
            (initialize_data_dependence_relation): Split into head
            and tail.  When the base objects fail to match up try
            again with pointer-based analysis of indices.
            * tree-vectorizer.c (vec_info_shared::check_datarefs): Do
            not compare the lazily computed alternate set of indices.
    
            * gcc.dg/torture/20210916.c: New testcase.
            * gcc.dg/vect/pr65206.c: Likewise.

Comment 13 Richard Biener 2021-09-20 06:51:53 UTC

Fixed for GCC 12.

Comment 14 Richard Biener 2022-03-29 12:57:59 UTC

*** Bug 69732 has been marked as a duplicate of this bug. ***