Bug 80844

Summary:	OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)
Product:	gcc	Reporter:	Peter Cordes <pcordes>
Component:	tree-optimization	Assignee:	Richard Biener <rguenth>
Status:	ASSIGNED ---
Severity:	normal	CC:	jakub
Priority:	P3	Keywords:	missed-optimization, openmp
Version:	8.0
Target Milestone:	---
See Also:	https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114635
Host:		Target:	x86_64--, i?86--
Build:		Known to work:
Known to fail:		Last reconfirmed:	2017-05-23 00:00:00

Description Peter Cordes 2017-05-20 21:16:29 UTC

float sumfloat_omp(const float arr[]) {
  float sum=0;
   #pragma omp simd reduction(+:sum) aligned(arr : 64)
  for (int i=0 ; i<1024 ; i++)
    sum = sum + arr[i];
  return sum;
}
// https://godbolt.org/g/6KnMXM

x86-64 gcc7.1 and gcc8-snapshot-20170520 -mavx2 -ffast-math -funroll-loops -fopenmp -O3 emit:

        # omitted integer code to align the stack by 32
        vpxor   %xmm0, %xmm0, %xmm0      # tmp119
        vmovaps %xmm0, -48(%rbp)         # tmp119, MEM[(void *)&D.2373]
        vmovaps %xmm0, -32(%rbp)         # tmp119, MEM[(void *)&D.2373]
        vmovaps -48(%rbp), %ymm8         # MEM[(float *)&D.2373], vect__23.20
        # then the loop

The store-forwarding stall part of this is very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833

With gcc4/5/6, we get four integer stores like movq $0, -48(%rbp) before the vector load.  Either way, this is ridiculous because vpxor already zeroed the whole ymm vector, and causes a store-forwarding stall.

It's also silly because -ffast-math allows 0.0+x to optimize to x.  It could start off by simply loading the first vector instead of adding it to 0, in this special case where the loop count is a compile-time constant.

---

It's not just AVX 256b vectors, although it's far worse there.

With just SSE2:

        pxor    %xmm0, %xmm0
        movaps  %xmm0, -24(%rsp)    # dead store
        pxor    %xmm0, %xmm0


Or from older gcc versions, the same int store->vector reload store-forwarding-stall inducing code.

-----


Even though -funroll-loops unrolls, it doesn't use multiple accumulators to run separate dep chains in parallel.  So it still bottlenecks on the 4 cycle latency of VADDPS, instead of the 0.5c throughput (Example numbers for Intel Skylake, and yes the results are the same with -march=skylake).  clang uses 4 vector accumulators, so it runs 4x faster when data is hot in L2.  (up to 8x is possible with data hot in L1).  Intel pre-skylake has VADDPS latency=3c throughput=1c, so there's still a factor of three to be had.  But Haswell has FMA lat=5c, tput=0.5, so you need 10 accumulators to saturate the 2 load&FMA per clock max throughput.

Is there already an open bug for this?  I know it's totally separate from this issue.

Comment 1 Richard Biener 2017-05-23 07:08:50 UTC

Uh.  .optimized:

float sumfloat_omp(const float*) (const float * arr)
{
  unsigned long ivtmp.22;
  vector(8) float D__lsm0.19;
  const vector(8) float vect__23.18;
  const vector(8) float vect__4.16;
  float stmp_sum_19.12;
  vector(8) float vect__18.10;
  float D.2841[8];
  vector(8) float _10;
  void * _77;
  unsigned long _97;

  <bb 2> [1.00%]:
  arr_13 = arr_12(D);
  __builtin_memset (&D.2841, 0, 32);
  _10 = MEM[(float *)&D.2841];
  ivtmp.22_78 = (unsigned long) arr_13;
  _97 = ivtmp.22_78 + 4096;

...

  <bb 4> [1.00%]:
  MEM[(float *)&D.2841] = vect__23.18_58;
  vect__18.10_79 = MEM[(float *)&D.2841];
  stmp_sum_19.12_50 = [reduc_plus_expr] vect__18.10_79;
  return stmp_sum_19.12_50;


well, that explains it ;)  Coming from

  <bb 7> [99.00%]:
  # i_33 = PHI <i_25(8), 0(6)>
  # ivtmp_35 = PHI <ivtmp_28(8), 1024(6)>
  _21 = GOMP_SIMD_LANE (simduid.0_14(D));
  _1 = (long unsigned int) i_33;
  _2 = _1 * 4;
  _3 = arr_13 + _2;
  _4 = *_3;
  _22 = D.2841[_21];
  _23 = _4 + _22;
  D.2841[_21] = _23;
  i_25 = i_33 + 1;
  ivtmp_28 = ivtmp_35 - 1;
  if (ivtmp_28 != 0)
    goto <bb 8>; [98.99%]

so we perform the reduction in memory, then LIM performs store-motion on it
but the memset isn't inlined early enough to rewrite the decl into SSA
(CCP from GOMP_SIMD_VF is missing).  In DOM we have

  __builtin_memset (&D.2841, 0, 32);
  _10 = MEM[(float *)&D.2841];

so we do not fold that.

If OMP SIMD always zeros the vector then it could also emit the maybe easier
to optimize

  WITH_SIZE_EXPR<_3, D.2841> = {};

of course gimple_fold_builtin_memset should simply be improved to optimize
now constant-size memset to = {}.

I'll have a look.

Comment 2 Jakub Jelinek 2017-05-23 07:48:23 UTC

(In reply to Richard Biener from comment #1)
> If OMP SIMD always zeros the vector then it could also emit the maybe easier
> to optimize
> 
>   WITH_SIZE_EXPR<_3, D.2841> = {};

It doesn't always zero, it can be pretty arbitrary.  For the default reductions on integral/floating point types it does zero for +/-/|/^/|| reductions, but e.g. 1 for */&&, or ~0 for &, or maximum or minimum for min or max.  For user defined reductions it can be whatever the user requests, constructor for some class type, function call, set to arbitrary value etc.
For other privatization clauses it is again something different (uninitialized for private/lastprivate, some other var + some bias for linear, ...).
And then after the simd loop there is again a reduction or something similar, but again can be quite complex in the general case.  If it helps, we could mark the pre-simd and post-simd loops somehow in the loop structure or something, but the actual work needs to be done later, especially after inlining, including the vectorizer and other passes.
E.g. for the typical reduction where the vectorizer computes the "simd array" in a vector temporary (or collection of them), it would be nice if we were able to pattern recognize simple cases and turn those into vector reduction patterns.

Comment 3 Peter Cordes 2017-05-23 08:08:29 UTC

(In reply to Jakub Jelinek from comment #2)
> It doesn't always zero, it can be pretty arbitrary.

Is if feasible have it just load the first vector of elements, instead of broadcasting the identity value?  i.e. do the vector equivalent of 

 sum = a[0]
 for (i=1; ...)

i.e. peel the first iteration and optimize away the computation, leaving just the load.  Another way to handle the actual loop body running zero times for counts between 1 and 2 full vectors is to put the loop entry point after the first load & accumulate.

(BTW, for operations like min/max/AND/OR where duplicate values don't affect the result, an unaligned final vector would be much more efficient than a scalar cleanup for the last less-than-full-vector of elements, but you still need a scalar fallback if the total count can be smaller than 1 full vector...)

Comment 4 Jakub Jelinek 2017-05-24 07:59:22 UTC

What we should do is first vectorize the main simd loop and then, once we've determined the vectorization factor thereof etc., see if there is any related preparation and finalization loop around it and try to vectorize those with the same vectorization factor.

Comment 5 Richard Biener 2017-05-24 13:00:34 UTC

(In reply to Jakub Jelinek from comment #2)
> (In reply to Richard Biener from comment #1)
> > If OMP SIMD always zeros the vector then it could also emit the maybe easier
> > to optimize
> > 
> >   WITH_SIZE_EXPR<_3, D.2841> = {};
> 
> It doesn't always zero, it can be pretty arbitrary.

Ah, the memset gets exposed by loop distribution.  Before we have

  <bb 3> [5.67%]:
  # _28 = PHI <_27(13), 0(10)>
  D.2357[_28] = 0.0;
  _27 = _28 + 1;
  if (_15 > _27)
    goto <bb 13>; [85.00%]
  else
    goto <bb 4>; [15.00%]

  <bb 13> [4.82%]:
  goto <bb 3>; [100.00%]

so indeed the other cases will be more "interesting".

For your latest idea to work we have to make sure the prologue / epilogue
loop doesn't get unrolled / pattern matched.

I'll still look at enhancing memset folding (it's pretty conservative
in the cases it handles).

>  For the default
> reductions on integral/floating point types it does zero for +/-/|/^/||
> reductions, but e.g. 1 for */&&, or ~0 for &, or maximum or minimum for min
> or max.  For user defined reductions it can be whatever the user requests,
> constructor for some class type, function call, set to arbitrary value etc.
> For other privatization clauses it is again something different
> (uninitialized for private/lastprivate, some other var + some bias for
> linear, ...).
> And then after the simd loop there is again a reduction or something
> similar, but again can be quite complex in the general case.  If it helps,
> we could mark the pre-simd and post-simd loops somehow in the loop structure
> or something, but the actual work needs to be done later, especially after
> inlining, including the vectorizer and other passes.
> E.g. for the typical reduction where the vectorizer computes the "simd
> array" in a vector temporary (or collection of them), it would be nice if we
> were able to pattern recognize simple cases and turn those into vector
> reduction patterns.

Comment 6 Richard Biener 2017-05-26 07:15:40 UTC

Author: rguenth
Date: Fri May 26 07:14:52 2017
New Revision: 248481

URL: https://gcc.gnu.org/viewcvs?rev=248481&root=gcc&view=rev
Log:
2017-05-26  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/80844
	* tree-vectorizer.c (adjust_simduid_builtins): Propagate
	results.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-vectorizer.c

Comment 7 Richard Biener 2017-05-26 07:29:05 UTC

Maybe we can simply set loop->force_vectorize on the prologue / epilogue loops.
Hmm, seems to be generated before we have a CFG ...