[Bug libgomp/87835] nvptx offloading: libgomp.oacc-c-c++-common/asyncwait-1.c execution test intermittently fails at -O2

Fri Jan 18 11:04:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87835

--- Comment #3 from Thomas Schwinge <tschwinge at gcc dot gnu.org> ---
Created attachment 45457
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45457&action=edit
[WIP] libgomp.oacc-c-c++-common/asyncwait-1.c debug

(In reply to Tom de Vries from comment #2)
> (In reply to Tom de Vries from comment #1)
> > (In reply to Thomas Schwinge from comment #0)
> > > After r264397 "[nvptx] Remove use of CUDA unified memory in libgomp", I'm
> > > seeing (intermittently only, and only on some systems):
> > 
> > I see the failure reproduced consistently with a Quadro M1200.

Oh, good -- in a way ;-) -- that it's consistently reproducable for you.  For
me, the failure is rather rare.

> > > I have not yet analyzed what's causing this, but I have some ideas about
> > > pending patches that might cure it.

Unfortunately, the patches I've been thinking of either are on trunk already,
or can't possibly be related to this problem.

The 'async'/'wait' clauses/directives in the test case look correct.

> do you intend to address this before stage4 closes?

I'd like to, yes.

Here is my current status.

With "-O2":

    [...]
      nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1,
vectors=32
      nvptx_exec: kernel main$_omp_fn$37: finished
      GOACC_data_end: restore mappings
      GOACC_data_end: mappings restored
    [abort]

In addition to "main$_omp_fn$37", sometimes also seen with "main$_omp_fn$25",
"main$_omp_fn$29", "main$_omp_fn$33".

So far only seen with OpenACC 'kernels' constructs, but not with the very
similar 'parallel' ones earlier in the file.

For example, without "DEBUG_K":

    [...]
      nvptx_exec: kernel main$_omp_fn$37: launch gangs=32, workers=1,
vectors=32
      nvptx_exec: kernel main$_omp_fn$37: finished
    GOACC_wait -2 1
    goacc_wait -2 1
    goacc_wait   1
      GOACC_data_end: restore mappings
      GOACC_data_end: mappings restored
    1007 c[64] 0
    1019 e[64] 13
    1007 c[65] 0
    1019 e[65] 13
    1007 c[66] 0
    1019 e[66] 13
    [...]
    1007 c[125] 0
    1019 e[125] 13
    1007 c[126] 0
    1019 e[126] 13
    1007 c[127] 0
    1019 e[127] 13

With "DEBUG_K":

    [...]
      nvptx_exec: kernel main$_omp_fn$37: launch gangs=1, workers=1, vectors=32
      nvptx_exec: kernel main$_omp_fn$37: finished
    GOACC_wait -2 1
    goacc_wait -2 1
    goacc_wait   1
    966 c[64] 0
    966 c[65] 0
    966 c[66] 0
    [...]
    966 c[125] 0
    966 c[126] 0
    966 c[127] 0

So, the compute kernel ("main$_omp_fn$37") doesn't find the "c" array properly
initialized, even though they're enqueued on the same 'async', so have to
execute in proper order by definition.

I've only ever seen this with the "c" array.

Sometimes that's starting already with index 0 (often seen with
"main$_omp_fn$29"), or as late as index 100 (rarely).

When running under "valgrind", repeatedly until there's an "abort", that
doesn't print anything suspicious.

Might this perhaps be a latent issue in OpenACC 'kernels' plus 'async', now
uncovered by the r264397 "[nvptx] Remove use of CUDA unified memory in libgomp"
commit?