Created attachment 49269 [details] C testcase - compile with -fopenmp and "-O0", "-O1", and "-O1 -funsafe-math-optimizations" My impression is that this is again (→ PR95654) related to SIMT going somehow wrong, but I do not quite understand why. The code uses 'omp simd ... reduction(…)' — using 'omp parallel do ...' instead works. The big program works at -O0, fails with -O1/-O2 but starts working again if additionally -ffast-math is used. The fail is: libgomp: cuCtxSynchronize error: invalid program counter or libgomp: cuCtxSynchronize error: unspecified launch failure (perhaps abort was called) The attached program is a vastly reduced version, which has a similar fail and similar pattern, which may or may not have the same cause. – In any case: It uses 'omp simd' and, hence, nvptx's SIMT and inside 'omp simd': float cosArg = __builtin_cosf(expArg); float sinArg = __builtin_sinf(expArg); With with -O0 but also with -O1/-O2 -funsafe-math-optimizations it works and the code contains with -funsafe-math-optimizations: cos.approx.f32 %r73, %r75; sin.approx.f32 %r72, %r75; and with -O0 (and unsafe math disabled): call (%value_in), cosf, (%out_arg1); call (%value_in), sinf, (%out_arg1); But with -O1/-O2 it fails with: libgomp: cuCtxSynchronize error: an illegal memory access was encountered here, the sin/cos was turned into BUILT_IN_SINCOSF and we end up with the code: call cexpf, (%out_arg1, %out_arg2, %out_arg3); I have no idea why 'call cosf/sinf' inside 'omp simd' works but 'call cexpf' fails – nor whether that is indeed related to SIMT. I think there are two issues. Mainly: FIRST ISSUE: Why does it fail with 'cexpf'? * * * SECOND ISSUE: Missed optimization for BUILT_IN_SINCOSF: if (optab_handler (sincos_optab, mode) != CODE_FOR_nothing) ... else if (targetm.libc_has_function (function_sincos)) ... else ... fn = builtin_decl_explicit (BUILT_IN_CEXPF); Seems as if we do the latter. In newlib's ./newlib/libm/complex/cexpf.c: cexpf(float complex z) ... x = crealf(z); y = cimagf(z); r = expf(x); w = r * cosf(y) + r * sinf(y) * I; return w; which is not really a performance boost compared to just calling sinf/cosf ... Note that newlib does have newlib/libm/math/wf_sincos.c which does: void sincosf(float x, float *sinx, float *cosx) { *sinx = sinf (x); *cosx = cosf (x); Which avoids a bunch of '*' and '+' and inparticular an 'expf' call. (Should be still slower than directly calling sinf/cosf due to the call overhead, but much better than cexpf, unless implemented in hardware.)
Besides PR95654, see PR81778 and PR80053.
Minimal version (without inlining sinf code from newlib): ... /* { dg-additional-options "-lm -foffload=-lm" } */ #define N 1 int main (void) { float k[N]; float res; for (int i = 0; i < N; i++) k[i] = 300; #pragma omp target map(to:k) map(from:res) { float sum = 0.0; #pragma omp simd reduction(+:sum) for (int i = 0; i < N; i++) sum += __builtin_sinf (k[i]); res = sum; } return 0; } ...
[ Note, this is with GOMP_NVPTX_JIT=-O0. ] In sinf, we have: ... 45: return -__kernel_cosf(y[0],y[1]); ... which translates to: ... .loc 1 45 12 ld.f32 %r67,[%frame+4]; ld.f32 %r65,[%frame]; { .param .f32 %value_in; .param .f32 %out_arg1; st.param.f32 [%out_arg1],%r65; .param .f32 %out_arg2; st.param.f32 [%out_arg2],%r67; call (%value_in),__kernel_cosf,(%out_arg1,%out_arg2); ld.param.f32 %r68,[%value_in]; } .loc 1 45 11 neg.f32 %r37,%r68; ... If I place (using GOMP_NVPTX_PTXRW) a trap before the first load: ... .loc 1 45 12 +trap ld.f32 %r67,[%frame+4]; ... I get: ... libgomp: cuCtxSynchronize error: an illegal instruction was encountered ... If I place it after the first load, I get: ... libgomp: cuCtxSynchronize error: an illegal memory access was encountered ...
So, I think calling functions from simd code is atm not supported for nvptx. Stack variables in simd code are mapped on a per-thread stack rather than on the usual per-warp stack. The functions are compiled with the usual per-warp stack, so calling those functions from simd might mean the different lanes are gonna disagree about what the value in a stack variable should be. Having said that, for the example in comment 2, there only should be one thread executing the call, so this doesn't explain the illegal memory access.
FWIW, another aspect here is convergence (as usual). Looking at the SASS code for main$_omp_fn$0$impl, I don't find evidence for the usual divergence/convergence ops (SSY/SYNC), which might mean that the following shfl is executed in divergent mode, so, even if we would not get the memory access error, we would not get correct results.
(In reply to Tom de Vries from comment #4) > So, I think calling functions from simd code is atm not supported for nvptx. > > Stack variables in simd code are mapped on a per-thread stack rather than on > the > usual per-warp stack. > > The functions are compiled with the usual per-warp stack, so calling those > functions from simd might mean the different lanes are gonna disagree about > what the value in a stack variable should be. This is inaccurate. In -msoft-stack mode there's no baked-in assumption that stacks are always per-warp. The "soft stack" pointer can point either to global memory (outside of SIMD regions), or to local memory (inside SIMD regions). The pointer is switched between per-warp global memory and per-lane local memory by nvptx.c:nvptx_output_softstack_switch. The main requirement is that functions callable from OpenMP offloaded code are compiled for -mgomp multilib variant. The design allows calling functions even from inside SIMD regions, and it should be supported. It is very disappointing that the first reaction was "I think ... is not supported" without reaching out and asking questions. Lack of efficient communication was a huge issue when OpenMP offloading support was contributed, and it's disappointing to see it again years later.
(In reply to Alexander Monakov from comment #6) > (In reply to Tom de Vries from comment #4) > > So, I think calling functions from simd code is atm not supported for nvptx. > > > > Stack variables in simd code are mapped on a per-thread stack rather than on > > the > > usual per-warp stack. > > > > The functions are compiled with the usual per-warp stack, so calling those > > functions from simd might mean the different lanes are gonna disagree about > > what the value in a stack variable should be. > > This is inaccurate. In -msoft-stack mode there's no baked-in assumption that > stacks are always per-warp. The "soft stack" pointer can point either to > global memory (outside of SIMD regions), or to local memory (inside SIMD > regions). The pointer is switched between per-warp global memory and > per-lane local memory by nvptx.c:nvptx_output_softstack_switch. > > The main requirement is that functions callable from OpenMP offloaded code > are compiled for -mgomp multilib variant. The design allows calling > functions even from inside SIMD regions, and it should be supported. I see, that's helpful, thanks. I guess I was thrown off by seeing a %simtstack_ar of 136 bytes: ... .local .align 8 .b8 %simtstack_ar[136]; ... which seems more of an amount claimed by a single function. Is it possible you meant the default of -msoft-stack-reserve-local=128 to mean 128kb (similar to what is claimed in nvptx_stacks_size in the plugin)? Because currently it means 128 bytes.
No, -msoft-stack-reserve-local is really meant to be in bytes: it may not exceed the amount of .local memory reserved by CUDA driver (which is just 1-2 KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but plugin-nvptx.c does not). Keep in mind that .local memory reservation is multiplied by number of active contexts, which could be in range 20000-30000 when the code was written: 128KB local memory per active thread would imply a 2.5GB allocation on the GPU.
(In reply to Tom de Vries from comment #2) > Minimal version (without inlining sinf code from newlib): > ... > /* { dg-additional-options "-lm -foffload=-lm" } */ > > #define N 1 > > int > main (void) { > float k[N]; > float res; > > for (int i = 0; i < N; i++) > k[i] = 300; > > #pragma omp target map(to:k) map(from:res) > { > float sum = 0.0; > #pragma omp simd reduction(+:sum) > for (int i = 0; i < N; i++) > sum += __builtin_sinf (k[i]); > > res = sum; > } > > return 0; > } > ... Starts passing at -foffload=-msoft-stack-reserve-local=346.
(In reply to Alexander Monakov from comment #8) > No, -msoft-stack-reserve-local is really meant to be in bytes: it may not > exceed the amount of .local memory reserved by CUDA driver (which is just > 1-2 KB, unless overridden via cuCtxSetLimit, which nvptx-run.c does, but > plugin-nvptx.c does not). > > Keep in mind that .local memory reservation is multiplied by number of > active contexts, which could be in range 20000-30000 when the code was > written: 128KB local memory per active thread would imply a 2.5GB allocation > on the GPU. With the number of active contexts, do you mean the sm_count * thread_max as used in nvptx-run.c (which, FWIW, is 10.240 on my card)?
Yes, that.