gomp-nvptx branch status

Fri Nov 11 08:40:00 GMT 2016

On Thu, Nov 10, 2016 at 08:09:51PM +0300, Alexander Monakov wrote:
> I'd like to provide an overview of the gomp-nvptx branch status. In response to
> this message I'll send two more emails, with libgomp and middle-end changes on
> the branch.  Some of the changes to libgomp such as build machinery adaptations
> have already received substantial comments in 2015, but the middle-end stuff is
> mostly unreviewed I believe.
> 
> Middle-end changes mostly amount to adding SIMD-to-SIMT transforms in omp-low.c,
> as shown on the Cauldron.  SIMT outlining via gimplifier abuse is not there, and
> neither is cloning of SIMD/SIMT loops.  Outlining is required for correctness,
> and cloning is useful as it allows to avoid intermixing SIMD+SIMT and thus be
> sure that SIMT lowering does not 'dirty' SIMD loops and regress host/MIC
> vectorization.  I could argue that it's possible to improve my SIMT lowering to
> avoid some dirtying (like moving loop-invariant calls to GOMP_SIMT_VF()), but
> the need for outlining makes that moot anyway, I think.

Approved with small nits, only very few requiring immediate action, the rest
can be handled incrementally once the changes are in.
Please work with Bernd on the config/nvptx bits.

> To get great performance this will need further changes everywhere, including
> in target-independent code, due to accidents like this bug (which I'd like to
> ping given the topic): https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68706 

Do you or anyone else have suggestions on how to find out the threshold
between when it is worth to use just a global lock wrt. many separate
atomics?  In any case, we'd need to analyze all the operations for whether
we can use atomics for them, if we need to do use the lock for any of them,
then using it for all of them is probably better than many atomics + one
GOMP_atomic_* pair.  Then there is the case of user defined reductions, we
should try harder to use atomics for them.

> With OpenMP/PTX offloading there are 5 additional failures in check-target-libgomp:
> 
> Two due to tests using 'usleep' in a target region:
> FAIL: libgomp.c/target-32.c (test for excess errors)
> FAIL: libgomp.c/thread-limit-2.c (test for excess errors)

Could these be "solved" say by something like:

--- libgomp/testsuite/libgomp.c/target-32.c.jj	2015-11-14 19:38:31.000000000 +0100
+++ libgomp/testsuite/libgomp.c/target-32.c	2016-11-11 09:29:50.411072865 +0100
@@ -1,7 +1,20 @@
 #include <stdlib.h>
 #include <unistd.h>
+#include <omp.h>
 
-int main ()
+static inline
+do_sleep (int cnt)
+{
+  int i;
+  if (omp_is_initial_device ())
+    usleep (cnt);
+  else
+    for (i = 0; i < 10 * cnt; i++)
+      asm volatile ("" : : : "memory");
+}
+
+int
+main ()
 {
   int a = 0, b = 0, c = 0, d[7];

plus folding omp_is_initial_device as a builtin in the offloading compiler
(which we want to do anyway and similar builtin is folded for OpenACC
already)? 

> 
> Two with 'target nowait' (not implemented)
> FAIL: libgomp.c/target-33.c execution test
> FAIL: libgomp.c/target-34.c execution test
> 
> One with 'target link' (not implemented)
> FAIL: libgomp.c/target-link-1.c (test for excess errors)

Can you work on implementing these during stage3?

	Jakub