float sumfloat_omp(const float arr[]) { float sum=0; #pragma omp simd reduction(+:sum) aligned(arr : 64) for (int i=0 ; i<1024 ; i++) sum = sum + arr[i]; return sum; } // https://godbolt.org/g/6KnMXM x86-64 gcc7.1 and gcc8-snapshot-20170520 -mavx2 -ffast-math -funroll-loops -fopenmp -O3 emit: # omitted integer code to align the stack by 32 vpxor %xmm0, %xmm0, %xmm0 # tmp119 vmovaps %xmm0, -48(%rbp) # tmp119, MEM[(void *)&D.2373] vmovaps %xmm0, -32(%rbp) # tmp119, MEM[(void *)&D.2373] vmovaps -48(%rbp), %ymm8 # MEM[(float *)&D.2373], vect__23.20 # then the loop The store-forwarding stall part of this is very similar to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 With gcc4/5/6, we get four integer stores like movq $0, -48(%rbp) before the vector load. Either way, this is ridiculous because vpxor already zeroed the whole ymm vector, and causes a store-forwarding stall. It's also silly because -ffast-math allows 0.0+x to optimize to x. It could start off by simply loading the first vector instead of adding it to 0, in this special case where the loop count is a compile-time constant. --- It's not just AVX 256b vectors, although it's far worse there. With just SSE2: pxor %xmm0, %xmm0 movaps %xmm0, -24(%rsp) # dead store pxor %xmm0, %xmm0 Or from older gcc versions, the same int store->vector reload store-forwarding-stall inducing code. ----- Even though -funroll-loops unrolls, it doesn't use multiple accumulators to run separate dep chains in parallel. So it still bottlenecks on the 4 cycle latency of VADDPS, instead of the 0.5c throughput (Example numbers for Intel Skylake, and yes the results are the same with -march=skylake). clang uses 4 vector accumulators, so it runs 4x faster when data is hot in L2. (up to 8x is possible with data hot in L1). Intel pre-skylake has VADDPS latency=3c throughput=1c, so there's still a factor of three to be had. But Haswell has FMA lat=5c, tput=0.5, so you need 10 accumulators to saturate the 2 load&FMA per clock max throughput. Is there already an open bug for this? I know it's totally separate from this issue.
Uh. .optimized: float sumfloat_omp(const float*) (const float * arr) { unsigned long ivtmp.22; vector(8) float D__lsm0.19; const vector(8) float vect__23.18; const vector(8) float vect__4.16; float stmp_sum_19.12; vector(8) float vect__18.10; float D.2841[8]; vector(8) float _10; void * _77; unsigned long _97; <bb 2> [1.00%]: arr_13 = arr_12(D); __builtin_memset (&D.2841, 0, 32); _10 = MEM[(float *)&D.2841]; ivtmp.22_78 = (unsigned long) arr_13; _97 = ivtmp.22_78 + 4096; ... <bb 4> [1.00%]: MEM[(float *)&D.2841] = vect__23.18_58; vect__18.10_79 = MEM[(float *)&D.2841]; stmp_sum_19.12_50 = [reduc_plus_expr] vect__18.10_79; return stmp_sum_19.12_50; well, that explains it ;) Coming from <bb 7> [99.00%]: # i_33 = PHI <i_25(8), 0(6)> # ivtmp_35 = PHI <ivtmp_28(8), 1024(6)> _21 = GOMP_SIMD_LANE (simduid.0_14(D)); _1 = (long unsigned int) i_33; _2 = _1 * 4; _3 = arr_13 + _2; _4 = *_3; _22 = D.2841[_21]; _23 = _4 + _22; D.2841[_21] = _23; i_25 = i_33 + 1; ivtmp_28 = ivtmp_35 - 1; if (ivtmp_28 != 0) goto <bb 8>; [98.99%] so we perform the reduction in memory, then LIM performs store-motion on it but the memset isn't inlined early enough to rewrite the decl into SSA (CCP from GOMP_SIMD_VF is missing). In DOM we have __builtin_memset (&D.2841, 0, 32); _10 = MEM[(float *)&D.2841]; so we do not fold that. If OMP SIMD always zeros the vector then it could also emit the maybe easier to optimize WITH_SIZE_EXPR<_3, D.2841> = {}; of course gimple_fold_builtin_memset should simply be improved to optimize now constant-size memset to = {}. I'll have a look.
(In reply to Richard Biener from comment #1) > If OMP SIMD always zeros the vector then it could also emit the maybe easier > to optimize > > WITH_SIZE_EXPR<_3, D.2841> = {}; It doesn't always zero, it can be pretty arbitrary. For the default reductions on integral/floating point types it does zero for +/-/|/^/|| reductions, but e.g. 1 for */&&, or ~0 for &, or maximum or minimum for min or max. For user defined reductions it can be whatever the user requests, constructor for some class type, function call, set to arbitrary value etc. For other privatization clauses it is again something different (uninitialized for private/lastprivate, some other var + some bias for linear, ...). And then after the simd loop there is again a reduction or something similar, but again can be quite complex in the general case. If it helps, we could mark the pre-simd and post-simd loops somehow in the loop structure or something, but the actual work needs to be done later, especially after inlining, including the vectorizer and other passes. E.g. for the typical reduction where the vectorizer computes the "simd array" in a vector temporary (or collection of them), it would be nice if we were able to pattern recognize simple cases and turn those into vector reduction patterns.
(In reply to Jakub Jelinek from comment #2) > It doesn't always zero, it can be pretty arbitrary. Is if feasible have it just load the first vector of elements, instead of broadcasting the identity value? i.e. do the vector equivalent of sum = a[0] for (i=1; ...) i.e. peel the first iteration and optimize away the computation, leaving just the load. Another way to handle the actual loop body running zero times for counts between 1 and 2 full vectors is to put the loop entry point after the first load & accumulate. (BTW, for operations like min/max/AND/OR where duplicate values don't affect the result, an unaligned final vector would be much more efficient than a scalar cleanup for the last less-than-full-vector of elements, but you still need a scalar fallback if the total count can be smaller than 1 full vector...)
What we should do is first vectorize the main simd loop and then, once we've determined the vectorization factor thereof etc., see if there is any related preparation and finalization loop around it and try to vectorize those with the same vectorization factor.
(In reply to Jakub Jelinek from comment #2) > (In reply to Richard Biener from comment #1) > > If OMP SIMD always zeros the vector then it could also emit the maybe easier > > to optimize > > > > WITH_SIZE_EXPR<_3, D.2841> = {}; > > It doesn't always zero, it can be pretty arbitrary. Ah, the memset gets exposed by loop distribution. Before we have <bb 3> [5.67%]: # _28 = PHI <_27(13), 0(10)> D.2357[_28] = 0.0; _27 = _28 + 1; if (_15 > _27) goto <bb 13>; [85.00%] else goto <bb 4>; [15.00%] <bb 13> [4.82%]: goto <bb 3>; [100.00%] so indeed the other cases will be more "interesting". For your latest idea to work we have to make sure the prologue / epilogue loop doesn't get unrolled / pattern matched. I'll still look at enhancing memset folding (it's pretty conservative in the cases it handles). > For the default > reductions on integral/floating point types it does zero for +/-/|/^/|| > reductions, but e.g. 1 for */&&, or ~0 for &, or maximum or minimum for min > or max. For user defined reductions it can be whatever the user requests, > constructor for some class type, function call, set to arbitrary value etc. > For other privatization clauses it is again something different > (uninitialized for private/lastprivate, some other var + some bias for > linear, ...). > And then after the simd loop there is again a reduction or something > similar, but again can be quite complex in the general case. If it helps, > we could mark the pre-simd and post-simd loops somehow in the loop structure > or something, but the actual work needs to be done later, especially after > inlining, including the vectorizer and other passes. > E.g. for the typical reduction where the vectorizer computes the "simd > array" in a vector temporary (or collection of them), it would be nice if we > were able to pattern recognize simple cases and turn those into vector > reduction patterns.
Author: rguenth Date: Fri May 26 07:14:52 2017 New Revision: 248481 URL: https://gcc.gnu.org/viewcvs?rev=248481&root=gcc&view=rev Log: 2017-05-26 Richard Biener <rguenther@suse.de> PR tree-optimization/80844 * tree-vectorizer.c (adjust_simduid_builtins): Propagate results. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-vectorizer.c
Maybe we can simply set loop->force_vectorize on the prologue / epilogue loops. Hmm, seems to be generated before we have a CFG ...