Bug 97203 - [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexpf call
Summary: [nvptx] 'illegal memory access was encountered' with 'omp simd'/SIMT and cexp...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: openmp, wrong-code
Depends on:
Blocks:
 
Reported: 2020-09-25 08:52 UTC by Tobias Burnus
Modified: 2024-06-04 15:56 UTC (History)
2 users (show)

See Also:
Host:
Target: nvptx
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments
C testcase - compile with -fopenmp and "-O0", "-O1", and "-O1 -funsafe-math-optimizations" (383 bytes, text/plain)
2020-09-25 08:52 UTC, Tobias Burnus
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2020-09-25 08:52:05 UTC
Created attachment 49269 [details]
C testcase - compile with -fopenmp and "-O0", "-O1", and "-O1 -funsafe-math-optimizations"

My impression is that this is again (→ PR95654) related to SIMT going somehow wrong, but I do not quite understand why.


The code uses 'omp simd ... reduction(…)' — using 'omp parallel do ...' instead works.


The big program works at -O0, fails with -O1/-O2 but starts working again if additionally -ffast-math is used. The fail is:
  libgomp: cuCtxSynchronize error: invalid program counter
or
  libgomp: cuCtxSynchronize error: unspecified launch failure (perhaps abort was called) 


The attached program is a vastly reduced version, which has a similar fail and similar pattern, which may or may not have the same cause. – In any case:

It uses 'omp simd' and, hence, nvptx's SIMT and inside 'omp simd':
            float cosArg = __builtin_cosf(expArg);
            float sinArg = __builtin_sinf(expArg);

With with -O0 but also with -O1/-O2 -funsafe-math-optimizations it works and the code contains with -funsafe-math-optimizations:
                cos.approx.f32  %r73, %r75;
                sin.approx.f32  %r72, %r75;
and with -O0 (and unsafe math disabled):
                call (%value_in), cosf, (%out_arg1);
                call (%value_in), sinf, (%out_arg1);

But with -O1/-O2 it fails with:
   libgomp: cuCtxSynchronize error: an illegal memory access was encountered
here, the sin/cos was turned into BUILT_IN_SINCOSF and we end up with the code:
   call cexpf, (%out_arg1, %out_arg2, %out_arg3);


I have no idea why 'call cosf/sinf' inside 'omp simd' works but 'call cexpf' fails – nor whether that is indeed related to SIMT.


I think there are two issues. Mainly:

FIRST ISSUE: Why does it fail with 'cexpf'?

 * * *

SECOND ISSUE: Missed optimization for BUILT_IN_SINCOSF:

  if (optab_handler (sincos_optab, mode) != CODE_FOR_nothing)
...
  else if (targetm.libc_has_function (function_sincos))
...
  else
...
        fn = builtin_decl_explicit (BUILT_IN_CEXPF);


Seems as if we do the latter. In newlib's ./newlib/libm/complex/cexpf.c:

cexpf(float complex z)
...
        x = crealf(z);
        y = cimagf(z);
        r = expf(x);
        w = r * cosf(y) + r * sinf(y) * I;
        return w;

which is not really a performance boost compared to just calling sinf/cosf ...

Note that newlib does have newlib/libm/math/wf_sincos.c which does:
        void sincosf(float x, float *sinx, float *cosx)
{
  *sinx = sinf (x);
  *cosx = cosf (x);

Which avoids a bunch of '*' and '+' and inparticular an 'expf' call. (Should be still slower than directly calling sinf/cosf due to the call overhead, but much better than cexpf, unless implemented in hardware.)
Comment 1 Tobias Burnus 2020-09-25 08:57:54 UTC
Besides PR95654, see PR81778 and PR80053.
Comment 2 Tom de Vries 2020-10-08 09:56:03 UTC
Minimal version (without inlining sinf code from newlib):
...
/* { dg-additional-options "-lm -foffload=-lm" } */

#define N 1

int
main (void) {
  float k[N];
  float res;

  for (int i = 0; i < N; i++)
    k[i] = 300;
  
#pragma omp target map(to:k) map(from:res)
  {
    float sum = 0.0;
#pragma omp simd reduction(+:sum)
    for (int i = 0; i < N; i++)
      sum += __builtin_sinf (k[i]);
    
    res = sum;
  }

  return 0;
}
...
Comment 3 Tom de Vries 2020-10-08 12:57:45 UTC
[ Note, this is with GOMP_NVPTX_JIT=-O0. ]

In sinf, we have:
...
 45:                        return -__kernel_cosf(y[0],y[1]);
...
which translates to:
...
.loc 1 45 12
ld.f32 %r67,[%frame+4];
ld.f32 %r65,[%frame];
{
.param .f32 %value_in;
.param .f32 %out_arg1;
st.param.f32 [%out_arg1],%r65;
.param .f32 %out_arg2;
st.param.f32 [%out_arg2],%r67;
call (%value_in),__kernel_cosf,(%out_arg1,%out_arg2);
ld.param.f32 %r68,[%value_in];
}
.loc 1 45 11
neg.f32 %r37,%r68;
...

If I place (using GOMP_NVPTX_PTXRW) a trap before the first load:
...
 .loc 1 45 12
+trap
 ld.f32 %r67,[%frame+4];
...
I get:
...
libgomp: cuCtxSynchronize error: an illegal instruction was encountered
...

If I place it after the first load, I get:
...
libgomp: cuCtxSynchronize error: an illegal memory access was encountered
...
Comment 4 Tom de Vries 2020-10-08 15:27:34 UTC
So, I think calling functions from simd code is atm not supported for nvptx.

Stack variables in simd code are mapped on a per-thread stack rather than on the
usual per-warp stack.

The functions are compiled with the usual per-warp stack, so calling those functions from simd might mean the different lanes are gonna disagree about what the value in a stack variable should be.

Having said that, for the example in comment 2, there only should be one thread executing the call, so this doesn't explain the illegal memory access.
Comment 5 Tom de Vries 2020-10-08 15:37:00 UTC
FWIW, another aspect here is convergence (as usual).

Looking at the SASS code for main$_omp_fn$0$impl, I don't find evidence for the usual divergence/convergence ops (SSY/SYNC), which might mean that the following shfl is executed in divergent mode, so, even if we would not get the memory access error, we would not get correct results.
Comment 6 Alexander Monakov 2020-10-09 08:36:30 UTC
(In reply to Tom de Vries from comment #4)
> So, I think calling functions from simd code is atm not supported for nvptx.
> 
> Stack variables in simd code are mapped on a per-thread stack rather than on
> the
> usual per-warp stack.
> 
> The functions are compiled with the usual per-warp stack, so calling those
> functions from simd might mean the different lanes are gonna disagree about
> what the value in a stack variable should be.

This is inaccurate. In -msoft-stack mode there's no baked-in assumption that stacks are always per-warp. The "soft stack" pointer can point either to global memory (outside of SIMD regions), or to local memory (inside SIMD regions). The pointer is switched between per-warp global memory and per-lane local memory by nvptx.c:nvptx_output_softstack_switch.

The main requirement is that functions callable from OpenMP offloaded code are compiled for -mgomp multilib variant. The design allows calling functions even from inside SIMD regions, and it should be supported.

It is very disappointing that the first reaction was "I think ... is not supported" without reaching out and asking questions. Lack of efficient communication was a huge issue when OpenMP offloading support was contributed, and it's disappointing to see it again years later.
Comment 7 Tom de Vries 2020-10-12 08:44:41 UTC
(In reply to Alexander Monakov from comment #6)
> (In reply to Tom de Vries from comment #4)
> > So, I think calling functions from simd code is atm not supported for nvptx.
> > 
> > Stack variables in simd code are mapped on a per-thread stack rather than on
> > the
> > usual per-warp stack.
> > 
> > The functions are compiled with the usual per-warp stack, so calling those
> > functions from simd might mean the different lanes are gonna disagree about
> > what the value in a stack variable should be.
> 
> This is inaccurate. In -msoft-stack mode there's no baked-in assumption that
> stacks are always per-warp. The "soft stack" pointer can point either to
> global memory (outside of SIMD regions), or to local memory (inside SIMD
> regions). The pointer is switched between per-warp global memory and
> per-lane local memory by nvptx.c:nvptx_output_softstack_switch.
> 
> The main requirement is that functions callable from OpenMP offloaded code
> are compiled for -mgomp multilib variant. The design allows calling
> functions even from inside SIMD regions, and it should be supported.

I see, that's helpful, thanks.

I guess I was thrown off by seeing a %simtstack_ar of 136 bytes:
...
.local .align 8 .b8 %simtstack_ar[136];
...
which seems more of an amount claimed by a single function.

Is it possible you meant the default of -msoft-stack-reserve-local=128 to mean 128kb (similar to what is claimed in nvptx_stacks_size in the plugin)? Because currently it means 128 bytes.
Comment 8 Alexander Monakov 2020-10-12 09:36:03 UTC
No, -msoft-stack-reserve-local is really meant to be in bytes: it may not exceed the amount of .local memory reserved by CUDA driver (which is just 1-2 KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but plugin-nvptx.c does not).

Keep in mind that .local memory reservation is multiplied by number of active contexts, which could be in range 20000-30000 when the code was written: 128KB local memory per active thread would imply a 2.5GB allocation on the GPU.
Comment 9 Tom de Vries 2020-10-12 13:00:04 UTC
(In reply to Tom de Vries from comment #2)
> Minimal version (without inlining sinf code from newlib):
> ...
> /* { dg-additional-options "-lm -foffload=-lm" } */
> 
> #define N 1
> 
> int
> main (void) {
>   float k[N];
>   float res;
> 
>   for (int i = 0; i < N; i++)
>     k[i] = 300;
>   
> #pragma omp target map(to:k) map(from:res)
>   {
>     float sum = 0.0;
> #pragma omp simd reduction(+:sum)
>     for (int i = 0; i < N; i++)
>       sum += __builtin_sinf (k[i]);
>     
>     res = sum;
>   }
> 
>   return 0;
> }
> ...

Starts passing at -foffload=-msoft-stack-reserve-local=346.
Comment 10 Tom de Vries 2020-10-12 15:10:27 UTC
(In reply to Alexander Monakov from comment #8)
> No, -msoft-stack-reserve-local is really meant to be in bytes: it may not
> exceed the amount of .local memory reserved by CUDA driver (which is just
> 1-2 KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but
> plugin-nvptx.c does not).
> 
> Keep in mind that .local memory reservation is multiplied by number of
> active contexts, which could be in range 20000-30000 when the code was
> written: 128KB local memory per active thread would imply a 2.5GB allocation
> on the GPU.

With the number of active contexts, do you mean the sm_count * thread_max as used in nvptx-run.c (which, FWIW, is 10.240 on my card)?
Comment 11 Alexander Monakov 2020-10-12 15:15:20 UTC
Yes, that.