104368 – [12/13/14 Regression] Failure to vectorise conditional grouped accesses after PR102659

Bug 104368 - [12/13/14 Regression] Failure to vectorise conditional grouped accesses after PR102659

Summary: [12/13/14 Regression] Failure to vectorise conditional grouped accesses after...

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	12.0

Importance:	P2 enhancement
Target Milestone:	12.4
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2022-02-03 14:51 UTC by Richard Sandiford
Modified:	2023-05-08 12:23 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2022-02-04 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Sandiford 2022-02-03 14:51:41 UTC

The following test regressed with PR102659, compiled with
-O3 -march=armv8.2-a+sve:

void f(int *restrict x, int *restrict y, int n)
{
  for (int i = 0; i < n; ++i)
    if (x[i] > 0)
      x[i] = y[i * 2] + y[i * 2 + 1];
}

Previously we treated the y[] accesses as a linear group
and so could use LD2W.  Now we treat them as individual
gather loads instead:

.L3:
        ld1w    z1.s, p0/z, [x0, x3, lsl 2]
        lsl     z0.s, z2.s, #1
        cmpgt   p0.s, p0/z, z1.s, #0
        ld1w    z1.s, p0/z, [x1, z0.s, sxtw 2]   // Gather
        ld1w    z0.s, p0/z, [x5, z0.s, sxtw 2]   // Gather
        add     z0.s, z1.s, z0.s
        st1w    z0.s, p0, [x0, x3, lsl 2]
        incw    z2.s
        add     x3, x3, x4
        whilelo p0.s, w3, w2
        b.any   .L3

Comment 1 Richard Biener 2022-02-04 07:58:14 UTC

Confirmed.  On x86 with AVX2 we don't get this vectorized anymore for the same reason.

t.c:5:15: missed:  failed: evolution of base is not affine.
        base_address:
        offset from base address:
        constant offset from base address:
        step:
        base alignment: 0
        base misalignment: 0
        offset alignment: 0
        step alignment: 0
        base_object: *_8
Creating dr for *_12

if-conversion now produces

...
  _47 = (unsigned long) y_21(D);
..
# i_26 = PHI <i_23(8), 0(15)>
_1 = (long unsigned int) i_26;
_2 = _1 * 4;
_3 = x_20(D) + _2;
_4 = *_3;
_45 = (unsigned int) i_26;
_46 = _45 * 2;
_5 = (int) _46;
_6 = (long unsigned int) _5;
_7 = _6 * 4;
_48 = _47 + _7;
_8 = (int *) _48;
_49 = _4 > 0;
_9 = .MASK_LOAD (_8, 32B, _49);
_10 = _6 + 1;
_11 = _10 * 4;
_51 = _11 + _47;
_12 = (int *) _51;
_13 = .MASK_LOAD (_12, 32B, _49);
_52 = (unsigned int) _9;
_53 = (unsigned int) _13;
_54 = _52 + _53;
_14 = (int) _54;
.MASK_STORE (_3, 32B, _49, _14);
i_23 = i_26 + 1;
if (n_19(D) > i_23)
  goto <bb 8>; [89.00%]
else
  goto <bb 6>; [11.00%]


note that if-conversion is correct in rewriting i*2 and i*2 + 1 to unsigned
arithmetic since that will now execute unconditionally and can overflow.

In the end the issue is that the multiplication by the element size is
done in sizetype and so y[i*2] and y[i*2+1] might not be adjacent.  What
we miss is that iff the stmts were executed then because of undefined overflow
they will always be adjacent.

IMHO the only good way to recover is to scrap the separate if-conversion step
and do vectorization on the original IL.  Or integrate the two passes
as much as to allow dataref analysis on the not if-converted IL.

Another possibility (and long-standing TODO) is to teach SCEV analysis
to derive assumptions we can version the loop on - in this case that
i*2 + 1 does not overflow.

Note in this particular case we probably miss to see that

i is in [0,INT_MAX-1] and thus (unsigned)i * 2 + 1 never wraps

(unless I miss something).  We have

  <bb 3> [local count: 955630226]:
  # RANGE [0, 2147483647] NONZERO 2147483647
  # i_26 = PHI <i_23(8), 0(15)>
  # RANGE [0, 2147483646] NONZERO 2147483647
  _1 = (long unsigned int) i_26;
  # RANGE [0, 8589934584] NONZERO 8589934588
  _2 = _1 * 4;
  # PT = null { D.2435 } (nonlocal, restrict)
  _3 = x_20(D) + _2;
  _4 = MEM[(int *)_3 clique 1 base 1];
  _45 = (unsigned int) i_26;
  _46 = _45 * 2;
  _5 = (int) _46;
  _6 = (long unsigned int) _5;
  _7 = _6 * 4;
  _48 = _47 + _7;

so unfortunately while _1 has that correct range, i_26 does not and the
ifcvt generated stmts don't either.  It might be possible to throw
ranger on the if-converted body.

Andrew - if we'd like to do that, in tree-if-conv.cc in tree_if_conversion ()
after we've produced the final IL (after the call to ifcvt_hoist_invariants),
is there a way to invoke ranger on the stmts of the (single-BB) loop
and have it adjust the global ranges?  In particular - see above, it
would need to somehow improve the global range of the i_26 IV.

The pass creates blocks and destroys edges, so I'm not sure if we can
reasonably use a caching instance over its lifetime so cost per loop would
be a limiting factor.

Comment 2 Jakub Jelinek 2022-05-06 08:32:36 UTC

GCC 12.1 is being released, retargeting bugs to GCC 12.2.

Comment 3 Richard Biener 2022-08-19 08:25:37 UTC

GCC 12.2 is being released, retargeting bugs to GCC 12.3.

Comment 4 Richard Biener 2023-05-08 12:23:48 UTC

GCC 12.3 is being released, retargeting bugs to GCC 12.4.