Bug 104952 - [nvptx][OpenMP] wrong code with OR / AND reduction ('reduction(||:' and '&&') with SIMT
Summary: [nvptx][OpenMP] wrong code with OR / AND reduction ('reduction(||:' and '&&')...
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: 12.0
Assignee: Not yet assigned to anyone
URL:
Keywords: openmp, wrong-code
Depends on:
Blocks:
 
Reported: 2022-03-16 11:42 UTC by Tobias Burnus
Modified: 2022-03-18 14:48 UTC (History)
2 users (show)

See Also:
Host:
Target: nvptx-none
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
Tentative patch with test-cases, rationale and changelog (1.19 KB, text/plain)
2022-03-17 14:12 UTC, Tom de Vries
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2022-03-16 11:42:35 UTC
* Works when compiled with -O0
* Fails when compiled with -O1 → "result" is 0 instead of 1.

Observations:
* Issue occurs with '||' and '&&' – and for char/short/int/long.
* Note: -O1 implies that omp_max_vf() returns != 1
  Thus, only with -O1 there is SIMT, which seems to cause the problem.
* When replacing 'reduction(||:'  by
                 'reduction(|:'   the code passes.

Regarding the latter: Given that the order is not determined (i.e. 'a || b'  and 'b || a' can occur), I think '||' can always be replaced by '|' in the reduction.

@jakub: ^  does this make sense?

* * *

Long testcase is tests/5.0/loop/test_loop_reduction_or_device.c from
https://github.com/SOLLVE/sollve_vv/
[the ..._and_... ('&&') testcase fails in the same way.] — Short testcase is below.

Short testcase (runs into abort with -O1, works with -O0 or with "reduction:(|"):

 * * *

int main () {
  char arr[100];
  int result = 0;
  for (int i = 0; i < 100; ++i)
    arr[i] = 0;
  arr[5] = 1;
#pragma omp target parallel map(tofrom:arr,result)
  #pragma omp loop reduction(||: result)
    for (int i = 0; i < 100; ++i)
      result = result || arr[i];

  if (result != 1)
    __builtin_abort ();
  return 0;
}
Comment 1 Tom de Vries 2022-03-17 10:30:13 UTC
I can reproduce the problem.

I've made the simd explicit (I hope that's still valid openmp code):
...
$ cat libgomp/testsuite/libgomp.c/test.c
#define N 32

static char arr[N];

int
main (void)
{
  unsigned int result = 0;

  for (unsigned int i = 0; i < N; ++i)
    arr[i] = 0;
  arr[5] = 1;

#pragma omp target map(tofrom:result) map(to:arr)
#pragma omp simd reduction(||: result)
  for (unsigned int i = 0; i < N; ++i)
    result = result || arr[i];

  if (result != 1)
    return 1;

  return 0;
}
...

Easy workaround:
...
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index d932d74cb03..bf6845d654e 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -4641,6 +4641,15 @@ lower_rec_simd_input_clauses (tree new_var, omp_context *ctx,
                  sctx->max_vf = 1;
                  break;
              }
+
+             if (OMP_CLAUSE_REDUCTION_CODE (c) == TRUTH_ANDIF_EXPR
+                 || OMP_CLAUSE_REDUCTION_CODE (c) == TRUTH_ORIF_EXPR)
+               {
+                 sctx->max_vf = 1;
+                 break;
+               }
            }
        }
       if (maybe_gt (sctx->max_vf, 1U))
...
Comment 2 Tom de Vries 2022-03-17 12:07:50 UTC
I think the problem can be seen already at omp-lower, in the body of the butterfly loop.

Let's first look at what we have if we use reduction op '|':
...
                    D.2173 = .GOMP_SIMT_VF ();
                    D.2164 = 1;
                    D.2161 = 0;
                    goto <D.2175>;
                    <D.2174>:
                    D.2165 = D.2163;
                    D.2165 = D.2163;
                    D.2166 = .GOMP_SIMT_XCHG_BFLY (D.2165, D.2164);
                    D.2167 = D.2165 | D.2166;
                    D.2163 = D.2167;
                    D.2164 = D.2164 << 1;
                    <D.2175>:
                    if (D.2164 < D.2173) goto <D.2174>; else goto <D.2176>;
                    <D.2176>:
...
Fairly straightforward, we have a loop, runs a couple of times, first a shuffle (GOMP_SIMT_XCHG_BFLY), then an update (D.2167 = D.2165 | D.2166).

Now compare that with reduction op '||':
...
                    D.2183 = .GOMP_SIMT_VF ();
                    D.2164 = 1;
                    D.2161 = 0;
                    goto <D.2185>;
                    <D.2184>:
                    D.2169 = D.2163;
                    D.2170 = (_Bool) D.2169;
                    if (D.2170 != 0) goto <D.2166>; else goto <D.2171>;
                    <D.2171>:
                    D.2169 = D.2163;
                    D.2172 = .GOMP_SIMT_XCHG_BFLY (D.2169, D.2164);
                    D.2173 = (_Bool) D.2172;
                    if (D.2173 != 0) goto <D.2166>; else goto <D.2167>;
                    <D.2166>:
                    iftmp.5 = 1;
                    goto <D.2168>;
                    <D.2167>:
                    iftmp.5 = 0;
                    <D.2168>:
                    D.2163 = iftmp.5;
                    D.2164 = D.2164 << 1;
                    <D.2185>:
                    if (D.2164 < D.2183) goto <D.2184>; else goto <D.2186>;
                    <D.2186>:
...

The shuffle is now conditional.  I think the shuffle is inserted too late, in the middle of the update rather than before.
Comment 3 Tom de Vries 2022-03-17 12:46:27 UTC
Hmm, that seems to be actually due to:
...
                      if (sctx.is_simt)
                        {
                          if (!simt_lane)
                            simt_lane = create_tmp_var (unsigned_type_node);
                          x = build_call_expr_internal_loc
                            (UNKNOWN_LOCATION, IFN_GOMP_SIMT_XCHG_BFLY,
                             TREE_TYPE (ivar), 2, ivar, simt_lane);
                          x = build2 (code, TREE_TYPE (ivar), ivar, x);
                          gimplify_assign (ivar, x, &llist[2]);
                        }
...
which gimplifies assigning:
...
(gdb) call debug_generic_expr (x)
D.2163 || .GOMP_SIMT_XCHG_BFLY (D.2163, D.2164)
...
to:
...
(gdb) call debug_generic_expr (ivar)
D.2163
...
Comment 4 Tom de Vries 2022-03-17 12:56:25 UTC
This fixes it:
...
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index d932d74cb03..f2ac8f98e32 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -6734,7 +6734,21 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimpl
e_seq *dlist,
                          x = build_call_expr_internal_loc
                            (UNKNOWN_LOCATION, IFN_GOMP_SIMT_XCHG_BFLY,
                             TREE_TYPE (ivar), 2, ivar, simt_lane);
-                         x = build2 (code, TREE_TYPE (ivar), ivar, x);
+                         /* Make sure x is evaluated unconditionally.  */
+                         enum tree_code update_code;
+                         switch (OMP_CLAUSE_REDUCTION_CODE (c))
+                           {
+                           case TRUTH_ANDIF_EXPR:
+                             update_code = TRUTH_AND_EXPR;
+                             break;
+                           case TRUTH_ORIF_EXPR:
+                             update_code = TRUTH_OR_EXPR;
+                             break;
+                           default:
+                             update_code = code;
+                             break;
+                           }
+                         x = build2 (update_code, TREE_TYPE (ivar), ivar, x);
                          gimplify_assign (ivar, x, &llist[2]);
                        }
                      tree ivar2 = ivar;
...
Comment 5 Jakub Jelinek 2022-03-17 13:04:20 UTC
Note, I guess this is related to PR94366 and r12-438-g1580fc764423bf89e9b which was fixing it for non-SIMT and quite possibly left out the SIMT stuff out.
Using the TRUTH_{AND,OR}_EXPR instead of TRUTH_{AND,OR}IF_EXPR ought to be fine,
it is only merging the private vars into the original var, so neither has really side-effects, all we want is that we are actually merging orig = (orig != 0) & (private != 0) rather than orig & private etc.
Comment 6 Jakub Jelinek 2022-03-17 13:14:05 UTC
And yes, #c1 is valid.  But would be nice to have similar test with && and
initial result = 2; and arr[] say { 1, 2, 3, 4, 5, 6, 7, ..., 32 } and test
result is 1 at the end to make sure we don't actually do just
orig = orig & (private != 0)
style merging or even just
orig = orig & private;
Comment 7 Tom de Vries 2022-03-17 13:29:05 UTC
Alternative fix that doesn't require fiddling with the 'code' var:
...
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index d932d74cb03..d0ddd4a6142 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -6734,7 +6734,10 @@ lower_rec_input_clauses (tree clauses, gimple_seq *ilist, gimpl
e_seq *dlist,
                          x = build_call_expr_internal_loc
                            (UNKNOWN_LOCATION, IFN_GOMP_SIMT_XCHG_BFLY,
                             TREE_TYPE (ivar), 2, ivar, simt_lane);
-                         x = build2 (code, TREE_TYPE (ivar), ivar, x);
+                         /* Make sure x is evaluated unconditionally.  */
+                         tree bfly_var = create_tmp_var (TREE_TYPE (ivar));
+                         gimplify_assign (bfly_var, x, &llist[2]);
+                         x = build2 (code, TREE_TYPE (ivar), ivar, bfly_var);
                          gimplify_assign (ivar, x, &llist[2]);
                        }
                      tree ivar2 = ivar;
...
Comment 8 Tom de Vries 2022-03-17 13:31:24 UTC
(In reply to Jakub Jelinek from comment #6)
> And yes, #c1 is valid.

Thanks for confirming.

> But would be nice to have similar test with && and
> initial result = 2; and arr[] say { 1, 2, 3, 4, 5, 6, 7, ..., 32 } and test
> result is 1 at the end to make sure we don't actually do just
> orig = orig & (private != 0)
> style merging or even just
> orig = orig & private;

Ack, will add that.
Comment 9 Tom de Vries 2022-03-17 14:12:24 UTC
Created attachment 52647 [details]
Tentative patch with test-cases, rationale and changelog

I'll put this through testing, and submit if no problems found.
Comment 10 Jakub Jelinek 2022-03-17 14:17:10 UTC
Comment on attachment 52647 [details]
Tentative patch with test-cases, rationale and changelog

Please change arr[5] = 1; to arr[5] = 42; or so also to test it is doing != 0 comparisons.
Comment 11 GCC Commits 2022-03-18 14:46:56 UTC
The master branch has been updated by Tom de Vries <vries@gcc.gnu.org>:

https://gcc.gnu.org/g:093cdadbce30ce2d36846a05d979b8afc2eff618

commit r12-7702-g093cdadbce30ce2d36846a05d979b8afc2eff618
Author: Tom de Vries <tdevries@suse.de>
Date:   Thu Mar 17 14:37:28 2022 +0100

    [openmp] Fix SIMT reduction using TRUTH_{AND,OR}IF_EXPR
    
    Consider test-case pr104952-1.c, included in this commit, containing:
    ...
      #pragma omp target map(tofrom:result) map(to:arr)
      #pragma omp simd reduction(||: result)
    ...
    
    When run on x86_64 with nvptx accelerator, the test-case either aborts or
    hangs.
    
    The reduction clause is translated by the SIMT code (active for nvptx) as a
    butterfly reduction loop with this butterfly shuffle / update pair:
    ...
      D.2163 = D.2163 || .GOMP_SIMT_XCHG_BFLY (D.2163, D.2164)
    ...
    in the loop body.
    
    The problem is that the butterfly shuffle is possibly not executed, while it
    needs to be executed unconditionally.
    
    Fix this by translating instead as:
    ...
      D.tmp_bfly = .GOMP_SIMT_XCHG_BFLY (D.2163, D.2164)
      D.2163 = D.2163 || D.tmp_bfly
    ...
    
    Tested on x86_64-linux with nvptx accelerator.
    
    gcc/ChangeLog:
    
    2022-03-17  Tom de Vries  <tdevries@suse.de>
    
            PR target/104952
            * omp-low.cc (lower_rec_input_clauses): Make sure GOMP_SIMT_XCHG_BFLY
            is executed unconditionally.
    
    libgomp/ChangeLog:
    
    2022-03-17  Tom de Vries  <tdevries@suse.de>
    
            PR target/104952
            * testsuite/libgomp.c/pr104952-1.c: New test.
            * testsuite/libgomp.c/pr104952-2.c: New test.
Comment 12 Tom de Vries 2022-03-18 14:48:24 UTC
Fixed by "[openmp] Fix SIMT reduction using TRUTH_{AND,OR}IF_EXPR".