[Bug tree-optimization/91020] New: Enhance SRA to deal with "omp simd array" variables

Thu Jun 27 10:50:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91020

            Bug ID: 91020
           Summary: Enhance SRA to deal with "omp simd array" variables
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jakub at gcc dot gnu.org
  Target Milestone: ---

See PR91018 for related info, the PSTL has (apparently not enabled yet?) code
to use user defined reductions and wrappers.
Here it is written by hand with the _Combiner template copied from PSTL (so
PSTL licensed):

template <typename _Tp, typename _BinaryOp>
struct _Combiner
{
    _Tp __value;
    _BinaryOp* __bin_op; // Here is a pointer to function because of default
ctor

    _Combiner() : __value{}, __bin_op(nullptr) {}
    _Combiner(const _Tp& value, const _BinaryOp* bin_op) : __value(value),
__bin_op(const_cast<_BinaryOp*>(bin_op)) {}
    _Combiner(const _Combiner& __obj) : __value{}, __bin_op(__obj.__bin_op) {}

    void
    operator()(const _Combiner& __obj)
    {
        __value = (*__bin_op)(__value, __obj.__value);
    }
};

int r, a[1024], b[1024];

template <class _Tp, class _BinaryOperation>
static inline void
foo (_Tp *a, _Tp *b, _Tp &r, _BinaryOperation __binary_op)
{
  typedef _Combiner<_Tp, _BinaryOperation> _CombinerType;
  _CombinerType __init_{r, &__binary_op};
  #pragma omp declare reduction(__bin_op : _CombinerType : omp_out(omp_in))
initializer(omp_priv = omp_orig)

  #pragma omp simd reduction (inscan, __bin_op:__init_)
  for (int i = 0; i < 1024; i++)
    {
      __init_.__value = __binary_op(__init_.__value, a[i]);
      #pragma omp scan inclusive(__init_)
      b[i] = __init_.__value;
    }
  r = __init_.__value;
}

__attribute__((noipa)) void
foo (int *a, int *b, int &r)
{
  foo (a, b, r, [](int x, int y){ return x + y; });
}

We can't vectorize this ATM, as the vectorizer supports only whole element
accesses for the simd lane, something without gaps.  While it is in theory
possible to support those, it will never result in a very good code.

What would help in this case is SRA optimize those "omp simd array", e.g. by
finding out that after inlining the different fields of the elements of these
arrays are accessed only separately.  Like normal SRA would in that case undo
the C++ abstraction penalty by splitting the __init_ variable to
__init_$__value and __init_$__bin_op scalar variables, could SRA in this case
split the "omp simd array" arrays into one array int array (the __value
elements) and another one holding __binary_op (and ideally find out that the
latter is only ever written into and thus can be thrown away)?

As the "omp declare simd" arrays are created by the compiler, there are certain
rules one can rely on, e.g. that all code just accesses a single element of
those arrays at a time, .

Perhaps it could help also cases where say there is a user defined reduction
containing multiple fields that could be similar split off and would be if it
wasn't the "omp simd array" arrays.  E.g.
struct S { int p; int m; };
#pragma omp declare reduction (plusmult: struct S : (omp_out.p += omp_in.p),
(omp_out.m *= omp_in.m)) \
  initializer (omp_priv = { 0, 1 })

struct S
foo (int *a, int *b)
{
  struct S s = { 0, 1 };
  int i;
  #pragma omp simd reduction (plusmult: s)
  for (i = 0; i < 1024; ++i)
    {
      s.p += a[i];
      s.m *= b[i];
    }
  return s;
}