:(. So what other option one has to implement something like TLS, even
using inline asm or similar? There is %tid, so perhaps indexing some array
with %tid? The trouble with that is that some thread can do
#pragma omp parallel again, and I bet the %tid afterwards would be
again 0-(n-1), and if it is an index into a global array, it wouldn't work
well then. Maybe without anything like TLS we can't really support nested
parallelism, only one level of #pragma omp parallel inside of nvptx regions.
But, if we add support for #pragma omp team, we'd either need the array
in gang-local memory, or some other special register to give us gang id.