92772 – wrong code vectorizing masked max

Bug 92772 - wrong code vectorizing masked max

Summary: wrong code vectorizing masked max

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	10.0

Importance:	P5 minor
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	wrong-code

Depends on:
Blocks:	vectorizer
	Show dependency tree / graph

Reported:	2019-12-03 15:13 UTC by Andrew Stubbs
Modified:	2019-12-17 13:19 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	amdgcn
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Andrew Stubbs 2019-12-03 15:13:05 UTC

The testcase pr65947-10.c fails on amdgcn because there are more vector lanes than there is data, and the algorithm created doesn't allow for this. (Actually there's also a backend pattern missing, but I have a patch for that I'll commit shortly.)

Here's the affected loop:

 float last = 0;

 for (int i = 0; i < 32; i++)
   if (a[i] < min_v)
     last = a[i];

Which produces the following code (long lines shortened).

   vect_cst__33 = {min_v_11(D), .... min_v_11(D)};
   vect__4.16_32 = .MASK_LOAD (a_10(D), 4B, { -1, [...] -1, 0, [...] 0 });
   vect_last_6.17_34 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, vect__4.16_32, { 0.0, [...] 0.0 }>;
   _38 = VEC_COND_EXPR <vect__4.16_32 < vect_cst__33, { 1, 2, [...] 64 }, { 0, [...] 0 }>;
   _40 = .REDUC_MAX (_38);
   _41 = {_40, _40, [...] _40};
   _43 = VEC_COND_EXPR <_38 == _41, vect_last_6.17_34, { 0.0, [...] 0.0 }>;
   _44 = VIEW_CONVERT_EXPR<vector(64) unsigned int>(_43);
   _45 = .REDUC_MAX (_44);
   _46 = VIEW_CONVERT_EXPR<float>(_45);
   return _46;

In English:

1. Do a masked load of 32 elements (into 64-lane register). Loads "0.0" into the spare lanes.
2. Compare the all 64-lanes against "min_v". Label all the "true" lanes with the lane number.
3. Use a reduction to find the greatest numbered "true" lane.
4. Zero all the loaded values apart from the one in the greatest lane.
5. Use a reduction to find the value of the lane that isn't zeroed.

That's slightly tortuous when we could just do a vec_extract on "_40", but that's an aside.

The problem is in step 2: the spare lanes contain 0.0, which means that comparing them against "min_v" returns "true". This means that the algorithm always finds "last = a[63]" which isn't a real value and therefore always ends up being "0.0".

Comment 1 Richard Biener 2019-12-04 07:08:20 UTC

Looks like cond-reduction cannot handle fully masked loops unless we'd somehow mask the condition operation itself?

Comment 2 Richard Sandiford 2019-12-04 08:14:47 UTC

(In reply to Richard Biener from comment #1)
> Looks like cond-reduction cannot handle fully masked loops unless we'd
> somehow mask the condition operation itself?

Yeah, looks like it.  We'd never noticed this for SVE because
we always use EXTRACT_LAST reductions instead.

Comment 3 Andrew Stubbs 2019-12-04 10:07:37 UTC

The GCN architecture can handle the masking, but I don't know how we'd represent or apply that in the middle end?

I can probably implement extract_last, and that might be more efficient, but I don't see how that will help the logic? The convoluted extraction is working fine; it's the vector comparison that is misleading.

Comment 4 Richard Biener 2019-12-04 10:21:00 UTC

IIRC AVX512 also implements fully masked loops so the testcase should fail there, too, if we adjust N accordingly (to 15 or 31).  Hmm, can't seem to trigger
the fully masked support here, maybe I misremember.

Btw, isn't the issue that the reduction looks at all lanes?  That is,
I think the code simply assumes that for fully masked loops at least
one iteration is performed with all lanes active.  So if you bump
N to 64 + 32 the test passes on amdgcn?

Comment 5 Richard Biener 2019-12-04 10:25:47 UTC

And the issue is to be fixed in vect_create_epilog_for_reduction where
we create the index IV:

  /* For cond reductions we want to create a new vector (INDEX_COND_EXPR)
     which is updated with the current index of the loop for every match of
     the original loop's cond_expr (VEC_STMT).  This results in a vector
     containing the last time the condition passed for that vector lane.
     The first match will be a 1 to allow 0 to be used for non-matching
     indexes.  If there are no matches at all then the vector will be all
     zeroes.  */
  if (STMT_VINFO_REDUC_TYPE (reduc_info) == COND_REDUCTION)
    {

and the masking support needs to be added similarly to how vectorizable_condition
handles it.

Comment 6 Andrew Stubbs 2019-12-04 12:12:46 UTC

(In reply to Richard Biener from comment #4)
> Btw, isn't the issue that the reduction looks at all lanes?  That is,
> I think the code simply assumes that for fully masked loops at least
> one iteration is performed with all lanes active.  So if you bump
> N to 64 + 32 the test passes on amdgcn?

Yes, only the loads are masked. For most things this works fine, but not for reductions or permutations, etc.

If I set N=64, and double the input array, then the test passes indeed.

Masking the load of the {1, 2, 3 .. 63} vector would solve the issue, as would masking the comparison or the reduction (not that there's an optab for that).

Comment 7 Andrew Stubbs 2019-12-17 13:19:28 UTC

Thank you for the help, Richard and Richard.

I'm electing not to fix this bug, but instead have worked around it for amdgcn by implementing the fold_extract_last pattern.

There's two main reasons for this:

1. I'm not confident I can see how to fix it properly, so it's going to take me a long time to get it right.

2. The code that it generates is horrible (and the fix will only make it worse), so I would want to implement fold_extract_last anyway. This would leave this code unused by amdgcn, and the time spent fixing it wasted effort.

I've therefore downgraded the priority and severity on this bug, and committed a pointer into the code to assist anybody tripping over the same issue in future.

https://gcc.gnu.org/ml/gcc-patches/2019-12/msg01191.html

The bug still exists, but is now no longer reproducible on amdgcn.