[Bug middle-end/88703] New: oacc_validate_dims allows invalid dimensions

Sat Jan 5 00:52:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88703

            Bug ID: 88703
           Summary: oacc_validate_dims allows invalid dimensions
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: vries at gcc dot gnu.org
  Target Milestone: ---

Consider oacc_validate_dims on trunk:
...
oacc_validate_dims (tree fn, tree attrs, int *dims, int level, unsigned used)
{
  tree purpose[GOMP_DIM_MAX];
  unsigned ix;
  tree pos = TREE_VALUE (attrs);

  /* Make sure the attribute creator attached the dimension                     
     information.  */
  gcc_assert (pos);

  for (ix = 0; ix != GOMP_DIM_MAX; ix++)
    {
      purpose[ix] = TREE_PURPOSE (pos);
      tree val = TREE_VALUE (pos);
      dims[ix] = val ? TREE_INT_CST_LOW (val) : -1;
      pos = TREE_CHAIN (pos);
    }

  bool changed = targetm.goacc.validate_dims (fn, dims, level);

  /* Default anything left to 1 or a partitioned default.  */
  for (ix = 0; ix != GOMP_DIM_MAX; ix++)
    if (dims[ix] < 0)
      {
        /* The OpenACC spec says 'If the [num_gangs] clause is not              
           specified, an implementation-defined default will be used;           
           the default may depend on the code within the construct.'            
           (2.5.6).  Thus an implementation is free to choose                   
           non-unity default for a parallel region that doesn't have            
           any gang-partitioned loops.  However, it appears that there          
           is a sufficient body of user code that expects non-gang              
           partitioned regions to not execute in gang-redundant mode.           
           So we (a) don't warn about the non-portability and (b) pick          
           the minimum permissible dimension size when there is no              
           partitioned execution.  Otherwise we pick the global                 
           default for the dimension, which the user can control.  The          
           same wording and logic applies to num_workers and                    
           vector_length, however the worker- or vector- single                 
           execution doesn't have the same impact as gang-redundant             
           execution.  (If the minimum gang-level partioning is not 1,          
           the target is probably too confusing.)  */
        dims[ix] = (used & GOMP_DIM_MASK (ix)
                    ? oacc_default_dims[ix] : oacc_min_dims[ix]);
        changed = true;
      }

  if (changed)
    {
      /* Replace the attribute with new values.  */
      pos = NULL_TREE;
      for (ix = GOMP_DIM_MAX; ix--;)
        pos = tree_cons (purpose[ix],
                         build_int_cst (integer_type_node, dims[ix]), pos);
      oacc_replace_fn_attrib (fn, pos);
    }
}
...

It does the following:
- read the dimensions set in the attributes
- call targetm.goacc.validate_dims on those dimensions
- apply oacc_default_dims[ix] or oacc_min_dims[ix] to set remaining unset
  dimensions
- update the dimensions in the attributes

However, it's possible that the resulting dimensions are in fact invalid.

Consider this test-case on og8 branch:
...
$ cat libgomp/testsuite/libgomp.oacc-c-c++-common/test.c
/* { dg-do run { target openacc_nvidia_accel_selected } } */
/* { dg-additional-options "-fopenacc-dim=:32
-foffload=-mlong-vector-in-workers" } */

#include <stdlib.h>

#define N 2048

unsigned int a[N];
unsigned int b[N];
unsigned int c[N];
unsigned int n = N;

int
main (void)
{
#pragma acc parallel vector_length (128) copyin (a,b) copyout (c)
  {
    #pragma acc loop worker
    for (unsigned int i = 0; i < n; i++)
      #pragma acc loop vector
      for (unsigned int j = 0; j < n; j++)
        ;
  }

  return 0;
}
...

This generates these dimensions in the .s file:
...
//:FUNC_MAP "main$_omp_fn$0", 0x1, 0x20, 0x80
...
so, num_workers * vector_length == 0x20 * 0x80 == 32 * 128 == 4096 (while the
maximum allowed is 1024 == maximum CTA size).

This causes a runtime error:
...
libgomp: The Nvidia accelerator has insufficient resources to launch
'main$_omp_fn$0' with num_workers = 32 and vector_length = 128; recompile the
program with 'num_workers = x and vector_length = y' on that offloaded region
or '-fopenacc-dim=-:x:y' where x * y <= 1024.
...

An easy way to detect this problem at compile time is by adding an assert here:
...

diff --git a/gcc/omp-offload.c b/gcc/omp-offload.c
index aac0aa8b27a..7e3efa1032e 100644
--- a/gcc/omp-offload.c
+++ b/gcc/omp-offload.c
@@ -700,6 +700,7 @@ oacc_validate_dims
                    ? oacc_default_dims[ix] : oacc_min_dims[ix]);
        changed = true;
       }
+  gcc_assert (!targetm.goacc.validate_dims (fn, dims, level));

   if (changed)
     {
...

For the test-case, the compiler will enter the second call to
targetm.goacc.validate_dims with dims {1, 32, 128}, which will be updated to
{1, 32, 32}, which will cause targetm.goacc.validate_dims to return true, which
will trigger the assert.

AFAIU, this is a generic problem with the targetm.goacc.validate_dims hook on
both trunk and og8, and not specific to nvptx.