The testcase pr65947-10.c fails on amdgcn because there are more vector lanes than there is data, and the algorithm created doesn't allow for this. (Actually there's also a backend pattern missing, but I have a patch for that I'll commit shortly.) Here's the affected loop: float last = 0; for (int i = 0; i < 32; i++) if (a[i] < min_v) last = a[i]; Which produces the following code (long lines shortened). vect_cst__33 = {min_v_11(D), .... min_v_11(D)}; vect__4.16_32 = .MASK_LOAD (a_10(D), 4B, { -1, [...] -1, 0, [...] 0 }); vect_last_6.17_34 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, vect__4.16_32, { 0.0, [...] 0.0 }>; _38 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, { 1, 2, [...] 64 }, { 0, [...] 0 }>; _40 = .REDUC_MAX (_38); _41 = {_40, _40, [...] _40}; _43 = VEC_COND_EXPR <_38 == _41, vect_last_6.17_34, { 0.0, [...] 0.0 }>; _44 = VIEW_CONVERT_EXPR<vector(64) unsigned int>(_43); _45 = .REDUC_MAX (_44); _46 = VIEW_CONVERT_EXPR<float>(_45); return _46; In English: 1. Do a masked load of 32 elements (into 64-lane register). Loads "0.0" into the spare lanes. 2. Compare the all 64-lanes against "min_v". Label all the "true" lanes with the lane number. 3. Use a reduction to find the greatest numbered "true" lane. 4. Zero all the loaded values apart from the one in the greatest lane. 5. Use a reduction to find the value of the lane that isn't zeroed. That's slightly tortuous when we could just do a vec_extract on "_40", but that's an aside. The problem is in step 2: the spare lanes contain 0.0, which means that comparing them against "min_v" returns "true". This means that the algorithm always finds "last = a[63]" which isn't a real value and therefore always ends up being "0.0".
Looks like cond-reduction cannot handle fully masked loops unless we'd somehow mask the condition operation itself?
(In reply to Richard Biener from comment #1) > Looks like cond-reduction cannot handle fully masked loops unless we'd > somehow mask the condition operation itself? Yeah, looks like it. We'd never noticed this for SVE because we always use EXTRACT_LAST reductions instead.
The GCN architecture can handle the masking, but I don't know how we'd represent or apply that in the middle end? I can probably implement extract_last, and that might be more efficient, but I don't see how that will help the logic? The convoluted extraction is working fine; it's the vector comparison that is misleading.
IIRC AVX512 also implements fully masked loops so the testcase should fail there, too, if we adjust N accordingly (to 15 or 31). Hmm, can't seem to trigger the fully masked support here, maybe I misremember. Btw, isn't the issue that the reduction looks at all lanes? That is, I think the code simply assumes that for fully masked loops at least one iteration is performed with all lanes active. So if you bump N to 64 + 32 the test passes on amdgcn?
And the issue is to be fixed in vect_create_epilog_for_reduction where we create the index IV: /* For cond reductions we want to create a new vector (INDEX_COND_EXPR) which is updated with the current index of the loop for every match of the original loop's cond_expr (VEC_STMT). This results in a vector containing the last time the condition passed for that vector lane. The first match will be a 1 to allow 0 to be used for non-matching indexes. If there are no matches at all then the vector will be all zeroes. */ if (STMT_VINFO_REDUC_TYPE (reduc_info) == COND_REDUCTION) { and the masking support needs to be added similarly to how vectorizable_condition handles it.
(In reply to Richard Biener from comment #4) > Btw, isn't the issue that the reduction looks at all lanes? That is, > I think the code simply assumes that for fully masked loops at least > one iteration is performed with all lanes active. So if you bump > N to 64 + 32 the test passes on amdgcn? Yes, only the loads are masked. For most things this works fine, but not for reductions or permutations, etc. If I set N=64, and double the input array, then the test passes indeed. Masking the load of the {1, 2, 3 .. 63} vector would solve the issue, as would masking the comparison or the reduction (not that there's an optab for that).
Thank you for the help, Richard and Richard. I'm electing not to fix this bug, but instead have worked around it for amdgcn by implementing the fold_extract_last pattern. There's two main reasons for this: 1. I'm not confident I can see how to fix it properly, so it's going to take me a long time to get it right. 2. The code that it generates is horrible (and the fix will only make it worse), so I would want to implement fold_extract_last anyway. This would leave this code unused by amdgcn, and the time spent fixing it wasted effort. I've therefore downgraded the priority and severity on this bug, and committed a pointer into the code to assist anybody tripping over the same issue in future. https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01191.html The bug still exists, but is now no longer reproducible on amdgcn.