The following test regressed with PR102659, compiled with -O3 -march=armv8.2-a+sve: void f(int *restrict x, int *restrict y, int n) { for (int i = 0; i < n; ++i) if (x[i] > 0) x[i] = y[i * 2] + y[i * 2 + 1]; } Previously we treated the y[] accesses as a linear group and so could use LD2W. Now we treat them as individual gather loads instead: .L3: ld1w z1.s, p0/z, [x0, x3, lsl 2] lsl z0.s, z2.s, #1 cmpgt p0.s, p0/z, z1.s, #0 ld1w z1.s, p0/z, [x1, z0.s, sxtw 2] // Gather ld1w z0.s, p0/z, [x5, z0.s, sxtw 2] // Gather add z0.s, z1.s, z0.s st1w z0.s, p0, [x0, x3, lsl 2] incw z2.s add x3, x3, x4 whilelo p0.s, w3, w2 b.any .L3
Confirmed. On x86 with AVX2 we don't get this vectorized anymore for the same reason. t.c:5:15: missed: failed: evolution of base is not affine. base_address: offset from base address: constant offset from base address: step: base alignment: 0 base misalignment: 0 offset alignment: 0 step alignment: 0 base_object: *_8 Creating dr for *_12 if-conversion now produces ... _47 = (unsigned long) y_21(D); .. # i_26 = PHI <i_23(8), 0(15)> _1 = (long unsigned int) i_26; _2 = _1 * 4; _3 = x_20(D) + _2; _4 = *_3; _45 = (unsigned int) i_26; _46 = _45 * 2; _5 = (int) _46; _6 = (long unsigned int) _5; _7 = _6 * 4; _48 = _47 + _7; _8 = (int *) _48; _49 = _4 > 0; _9 = .MASK_LOAD (_8, 32B, _49); _10 = _6 + 1; _11 = _10 * 4; _51 = _11 + _47; _12 = (int *) _51; _13 = .MASK_LOAD (_12, 32B, _49); _52 = (unsigned int) _9; _53 = (unsigned int) _13; _54 = _52 + _53; _14 = (int) _54; .MASK_STORE (_3, 32B, _49, _14); i_23 = i_26 + 1; if (n_19(D) > i_23) goto <bb 8>; [89.00%] else goto <bb 6>; [11.00%] note that if-conversion is correct in rewriting i*2 and i*2 + 1 to unsigned arithmetic since that will now execute unconditionally and can overflow. In the end the issue is that the multiplication by the element size is done in sizetype and so y[i*2] and y[i*2+1] might not be adjacent. What we miss is that iff the stmts were executed then because of undefined overflow they will always be adjacent. IMHO the only good way to recover is to scrap the separate if-conversion step and do vectorization on the original IL. Or integrate the two passes as much as to allow dataref analysis on the not if-converted IL. Another possibility (and long-standing TODO) is to teach SCEV analysis to derive assumptions we can version the loop on - in this case that i*2 + 1 does not overflow. Note in this particular case we probably miss to see that i is in [0,INT_MAX-1] and thus (unsigned)i * 2 + 1 never wraps (unless I miss something). We have <bb 3> [local count: 955630226]: # RANGE [0, 2147483647] NONZERO 2147483647 # i_26 = PHI <i_23(8), 0(15)> # RANGE [0, 2147483646] NONZERO 2147483647 _1 = (long unsigned int) i_26; # RANGE [0, 8589934584] NONZERO 8589934588 _2 = _1 * 4; # PT = null { D.2435 } (nonlocal, restrict) _3 = x_20(D) + _2; _4 = MEM[(int *)_3 clique 1 base 1]; _45 = (unsigned int) i_26; _46 = _45 * 2; _5 = (int) _46; _6 = (long unsigned int) _5; _7 = _6 * 4; _48 = _47 + _7; so unfortunately while _1 has that correct range, i_26 does not and the ifcvt generated stmts don't either. It might be possible to throw ranger on the if-converted body. Andrew - if we'd like to do that, in tree-if-conv.cc in tree_if_conversion () after we've produced the final IL (after the call to ifcvt_hoist_invariants), is there a way to invoke ranger on the stmts of the (single-BB) loop and have it adjust the global ranges? In particular - see above, it would need to somehow improve the global range of the i_26 IV. The pass creates blocks and destroys edges, so I'm not sure if we can reasonably use a caching instance over its lifetime so cost per loop would be a limiting factor.
GCC 12.1 is being released, retargeting bugs to GCC 12.2.
GCC 12.2 is being released, retargeting bugs to GCC 12.3.
GCC 12.3 is being released, retargeting bugs to GCC 12.4.