Summary: | 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions | ||
---|---|---|---|
Product: | gcc | Reporter: | Yichao Yu <yyc1992> |
Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |
Status: | NEW --- | ||
Severity: | normal | CC: | crazylht, drraph, freddie, jamborm, Joost.VandeVondele, liuhongt, rguenth, sjames, tulipawn |
Priority: | P3 | Keywords: | missed-optimization |
Version: | 6.1.1 | ||
Target Milestone: | --- | ||
Host: | Target: | x86_64-*-* | |
Build: | Known to work: | ||
Known to fail: | Last reconfirmed: | 2016-06-07 00:00:00 | |
Bug Depends on: | |||
Bug Blocks: | 53947 |
Description
Yichao Yu
2016-06-04 17:55:12 UTC
The core loop is .L8: addq $1, %rdx vaddps (%r8), %ymm1, %ymm1 addq $32, %r8 cmpq %rdx, %rcx ja .L8 which compared to LLVM is not unrolled. You can use -funroll-loops to force that which probably fixes the performance compared to LLVM. For the short loop above I also guess this is not the optimal IV choice. (In reply to Richard Biener from comment #1) > The core loop is > > .L8: > addq $1, %rdx > vaddps (%r8), %ymm1, %ymm1 > addq $32, %r8 > cmpq %rdx, %rcx > ja .L8 > > which compared to LLVM is not unrolled. You can use -funroll-loops to > force that which probably fixes the performance compared to LLVM. For > the short loop above I also guess this is not the optimal IV choice. -funroll-loops only gains 10% or so, nowhere near the factor of 2 with clang. Except for the slightly better induction choice in llvm, the 2 unrolled loops look quite similar, I have a hard time seeing how one can be so much faster than the other. Maybe the alignment somehow ends up better in one case? Or the loop being one instruction shorter lets it fit better in cache? (In reply to Marc Glisse from comment #2) > (In reply to Richard Biener from comment #1) > > The core loop is > > > > .L8: > > addq $1, %rdx > > vaddps (%r8), %ymm1, %ymm1 > > addq $32, %r8 > > cmpq %rdx, %rcx > > ja .L8 > > > > which compared to LLVM is not unrolled. You can use -funroll-loops to > > force that which probably fixes the performance compared to LLVM. For > > the short loop above I also guess this is not the optimal IV choice. > > -funroll-loops only gains 10% or so, nowhere near the factor of 2 with > clang. Except for the slightly better induction choice in llvm, the 2 > unrolled loops look quite similar, I have a hard time seeing how one can be > so much faster than the other. Maybe the alignment somehow ends up better in > one case? Or the loop being one instruction shorter lets it fit better in > cache? Can you post a full example? The LLVM bug and this copy lacks information on what actual 'a' and 'n' is used. Note that unless a fits in L2 I hardly doubt one can exceeed memory bandwidth (and thus code-gen should not matter unless it affects the HW prefetcher). The C code is in the gist linked `a` is a cacheline aligned pointer and `n` is 1024 so `a` should even fits in L1d, which is 32kB on both processors I benchmarked. More precise timing (ns per loop) 6700K ``` % ./benchmark-gcc 80.553456 % ./benchmark-clang37 28.222281 % ./benchmark-clang38 41.782532 ``` 4702HQ ``` % ./benchmark-gcc 140.744893 % ./benchmark-clang37 50.835441 % ./benchmark-clang38 70.220946 ``` Pasting the whole program over for completeness. The alignment line gives some weird timing on clang without `-mcore-avx2` but doesn't change anything too much with `-Ofast -mcore-avx2` ``` // #include <stdlib.h> #include <stdint.h> #include <time.h> #include <stdio.h> #include <string.h> uint64_t gettime_ns() { struct timespec t; clock_gettime(CLOCK_MONOTONIC, &t); return t.tv_sec * (uint64_t) 1e9 + t.tv_nsec; } __attribute__((noinline)) float sum32(float *a, size_t n) { /* a = (float*)__builtin_assume_aligned(a, 64); */ float s = 0; for (size_t i = 0;i < n;i++) s += a[i]; __asm__ volatile ("" ::: "memory"); return s; } int main() { float *p = aligned_alloc(64, sizeof(float) * 1024); memset(p, 0, sizeof(float) * 1024); uint64_t start = gettime_ns(); for (int i = 0;i < 1024 * 1024;i++) sum32(p, 1024); free(p); uint64_t end = gettime_ns(); printf("%f\n", (end - start) / (1024.0 * 1024.0)); return 0; } ``` An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but for some unknown reason figure Alignment of 'a' as unusable: Lattices: Node: main/35: Node: sum32/34: param [0]: VARIABLE ctxs: VARIABLE Alignment unusable (BOTTOM) AGGS VARIABLE param [1]: VARIABLE 1024 [from: 35(99000)] [loc_time: 65, loc_size: 10, prop_time: 0, prop_size: 0] ctxs: VARIABLE Alignment unusable (BOTTOM) AGGS VARIABLE Evaluating opportunities for sum32/34. - considering value 1024 for param #1 n (caller_count: 1) good_cloning_opportunity_p (time: 65, size: 10, freq_sum: 99000) -> evaluation: 643500, threshold: 500 Creating a specialized node of sum32/34. replacing param #1 n with const 1024 Accounting size:7.00, time:72.78 on predicate:(true) Accounting size:3.00, time:2.00 on new predicate:(not inlined) the new node is sum32.constprop/43. iff LLVM disables IPA CP cloning with 'noinline' the testcase should add 'noclone' as well to be a fair comparison. The vectorizer decides to peel the loop for alignment (as usual...) and thus creates both prologue and epilogue loop. That shouldn't matter in practice but it likely obfuscates code enough to make the Job for IVOPTs harder. If the desire was to have nothing known about alignment and 'n' in sum32 the above cannot be avoided anyway. We also peel both prologue and epilogue loop. clang 3.6 (the one I have locally) doesn't peel for alignment and thus uses unaligned loads and unrolls the loop by 2 only. It doesn't do any IPA CP with -Ofast. Note that the difference WRT clangs unrolling and GCCs unrolling is that clang uses two accumulators while GCC just processes multiple loads on the same accumulator with its unrolling. Thus clang can exploit parallelism in the pipeline of the CPU while GCC restricts the CPU due to the dependences. This means that it is the vectorizer that needs to consider using a larger vectorization factor rather than post-vectorization unrolling (that's likely to pay off only for reductions). I wonder what LLVMs heuristic is here. The IPA-CP alignment question remains though. Martins? Isn't this a case where -fvariable-expansion-in-unroller is helpful ? > gcc -Ofast t.c -lrt ; ./a.out 285.670206 > gcc -Ofast -funroll-loops -fvariable-expansion-in-unroller t.c -lrt ; ./a.out 151.246083 > gcc -Ofast -funroll-loops t.c -lrt ; ./a.out 277.047507 There is some relation with PR25621 I think. If I add `-fvariable-expansion-in-unroller` (omg this options is like half the command line ;-p ...), the performance matches the clang one after the clang 3.8 regression. ``` % gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2 benchmark.c -o benchmark2 % ./benchmark2 45.588861 % ./benchmark-gcc 80.518152 % ./benchmark-clang38 41.920054 % ./benchmark-clang37 25.093145 ``` On Tue, 7 Jun 2016, yyc1992 at gmail dot com wrote:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
>
> --- Comment #7 from Yichao Yu <yyc1992 at gmail dot com> ---
> If I add `-fvariable-expansion-in-unroller` (omg this options is like half the
> command line ;-p ...), the performance matches the clang one after the clang
> 3.8 regression.
>
> ```
> % gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2
> benchmark.c -o benchmark2
> % ./benchmark2
> 45.588861
> % ./benchmark-gcc
> 80.518152
> % ./benchmark-clang38
> 41.920054
> % ./benchmark-clang37
> 25.093145
> ```
Yeah, but -fvariable-expansion-in-unroller is quite late.
(In reply to Richard Biener from comment #5) > An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but > for some unknown reason figure Alignment of 'a' as unusable: The function sum32 is not static, so it can be called from outside of the current compilation unit with any alignment. IPA-CP does not clone for alignment, which was my deliberate decision because I found it very difficult to reason about its profitability (I am opened to suggestions in this area, of course). Make it static or compile with -flto and you will get: IPA lattices after all propagation: Lattices: Node: main/35: Node: sum32/34: param [0]: VARIABLE ctxs: VARIABLE Alignment 64, misalignment 0 AGGS VARIABLE param [1]: 1024 [from: 35(99000)] [loc_time: 0, loc_size: 0, prop_time: 0, prop_size: 0] ctxs: VARIABLE Alignment unusable (BOTTOM) AGGS VARIABLE On June 10, 2016 5:22:39 PM GMT+02:00, "jamborm at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 > >--- Comment #9 from Martin Jambor <jamborm at gcc dot gnu.org> --- >(In reply to Richard Biener from comment #5) >> An interesting observation is that we clone sum32 for IPA-CP of n == >1024 but >> for some unknown reason figure Alignment of 'a' as unusable: > >The function sum32 is not static, so it can be called from outside of >the current compilation unit with any alignment. IPA-CP does not >clone for alignment, which was my deliberate decision because I found >it very difficult to reason about its profitability (I am opened to >suggestions in this area, of course). It ends up cloning for the constant second arg though so I'd have expected the alignment to be set on the first... >Make it static or compile with -flto and you will get: > >IPA lattices after all propagation: > >Lattices: > Node: main/35: > Node: sum32/34: > param [0]: VARIABLE > ctxs: VARIABLE > Alignment 64, misalignment 0 > AGGS VARIABLE >param [1]: 1024 [from: 35(99000)] [loc_time: 0, loc_size: 0, prop_time: >0, >prop_size: >0] > ctxs: VARIABLE > Alignment unusable (BOTTOM) > AGGS VARIABLE I've been looking into this and the big difference appears to be that when Clang unrolls the loop it does so using multiple accumulators (and indeed does this without need to be told to unroll. Given: double acc(double *x, int n) { double a = 0; #pragma omp simd for (int i = 0; i < n; i++) a += x[i]; return a; } and compiling with clang -march=native -Ofast -fopenmp -S the core loop reads as: vaddpd (%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 32(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 64(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 96(%rdi,%rsi,8), %ymm3, %ymm3 vaddpd 128(%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 160(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 192(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 224(%rdi,%rsi,8), %ymm3, %ymm3 vaddpd 256(%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 288(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 320(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 352(%rdi,%rsi,8), %ymm3, %ymm3 vaddpd 384(%rdi,%rsi,8), %ymm0, %ymm0 vaddpd 416(%rdi,%rsi,8), %ymm1, %ymm1 vaddpd 448(%rdi,%rsi,8), %ymm2, %ymm2 vaddpd 480(%rdi,%rsi,8), %ymm3, %ymm3 which is heavily unrolled and uses four separate accumulators to hide the latency of the vector adds. Interestingly, one could argue that Clang is not using enough registers given that Skylake can dual-issue adds and they have a latency of 4 cycles (implying you want 8 separate accumulators). GCC 10 with gcc -march=skylake -Ofast -fopenmp -S test.c -funroll-loops vaddpd -224(%r8), %ymm1, %ymm2 vaddpd -192(%r8), %ymm2, %ymm3 vaddpd -160(%r8), %ymm3, %ymm4 vaddpd -128(%r8), %ymm4, %ymm5 vaddpd -96(%r8), %ymm5, %ymm6 vaddpd -64(%r8), %ymm6, %ymm7 vaddpd -32(%r8), %ymm7, %ymm0 which although it is unrolled, is not a useful unrolling due to the dependency chain. Indeed, I would not be surprised if the performance is similar to the unrolled code as the loop related cruft can be hidden. This problem has been recently discussed at: https://stackoverflow.com/questions/76407241/why-is-cython-so-much-slower-than-numba-for-this-simple-loop The target now has the ability to tell the vectorizer to choose a larger VF based on the cost info it got for the default VF, so the x86 backend could make use of that. For example with the following patch we'll unroll the vectorized loops 4 times (of course the actual check for small reduction loops and a register pressure estimate is missing). That generates .L4: vaddps (%rax), %zmm1, %zmm1 vaddps 64(%rax), %zmm2, %zmm2 addq $256, %rax vaddps -128(%rax), %zmm0, %zmm0 vaddps -64(%rax), %zmm3, %zmm3 cmpq %rcx, %rax jne .L4 movq %rdx, %rax andq $-64, %rax vaddps %zmm3, %zmm0, %zmm0 vaddps %zmm2, %zmm1, %zmm1 vaddps %zmm1, %zmm0, %zmm1 ... more epilog ... with -march=znver4 on current trunk. diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index d4ff56ee8dd..53c09bb9d9c 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs stmt_vec_info stmt_info, slp_tree node, tree vectype, int misalign, vect_cost_model_location where) override; + void finish_cost (const vector_costs *uncast_scalar_costs); }; +void +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) +{ + auto *scalar_costs + = static_cast<const ix86_vector_costs *> (uncast_scalar_costs); + m_suggested_unroll_factor = 4; + vector_costs::finish_cost (scalar_costs); +} + /* Implement targetm.vectorize.create_costs. */ static vector_costs * (In reply to Richard Biener from comment #13) > The target now has the ability to tell the vectorizer to choose a larger VF > based on the cost info it got for the default VF, so the x86 backend could > make use of that. For example with the following patch we'll unroll the > vectorized loops 4 times (of course the actual check for small reduction > loops and a register pressure estimate is missing). That generates > > .L4: > vaddps (%rax), %zmm1, %zmm1 > vaddps 64(%rax), %zmm2, %zmm2 > addq $256, %rax > vaddps -128(%rax), %zmm0, %zmm0 > vaddps -64(%rax), %zmm3, %zmm3 > cmpq %rcx, %rax > jne .L4 > movq %rdx, %rax > andq $-64, %rax > vaddps %zmm3, %zmm0, %zmm0 > vaddps %zmm2, %zmm1, %zmm1 > vaddps %zmm1, %zmm0, %zmm1 > ... more epilog ... > > with -march=znver4 on current trunk. > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index d4ff56ee8dd..53c09bb9d9c 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs > stmt_vec_info stmt_info, slp_tree node, > tree vectype, int misalign, > vect_cost_model_location where) override; > + void finish_cost (const vector_costs *uncast_scalar_costs); > }; > > +void > +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > +{ > + auto *scalar_costs > + = static_cast<const ix86_vector_costs *> (uncast_scalar_costs); > + m_suggested_unroll_factor = 4; > + vector_costs::finish_cost (scalar_costs); I remember we have posted an patch for that https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html One regression observed is the VF of epilog loop will increase(from xmm to ymm) after unroll the vectorized loops, and it regressed performance for lower-tripcount loop(similar as -mprefer-vector-width=512). Also for the case in the PR, I'm trying to enable -fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will break reduction chain. On Wed, 7 Jun 2023, crazylht at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414 > > --- Comment #14 from Hongtao.liu <crazylht at gmail dot com> --- > (In reply to Richard Biener from comment #13) > > The target now has the ability to tell the vectorizer to choose a larger VF > > based on the cost info it got for the default VF, so the x86 backend could > > make use of that. For example with the following patch we'll unroll the > > vectorized loops 4 times (of course the actual check for small reduction > > loops and a register pressure estimate is missing). That generates > > > > .L4: > > vaddps (%rax), %zmm1, %zmm1 > > vaddps 64(%rax), %zmm2, %zmm2 > > addq $256, %rax > > vaddps -128(%rax), %zmm0, %zmm0 > > vaddps -64(%rax), %zmm3, %zmm3 > > cmpq %rcx, %rax > > jne .L4 > > movq %rdx, %rax > > andq $-64, %rax > > vaddps %zmm3, %zmm0, %zmm0 > > vaddps %zmm2, %zmm1, %zmm1 > > vaddps %zmm1, %zmm0, %zmm1 > > ... more epilog ... > > > > with -march=znver4 on current trunk. > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index d4ff56ee8dd..53c09bb9d9c 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -23615,8 +23615,18 @@ class ix86_vector_costs : public vector_costs > > stmt_vec_info stmt_info, slp_tree node, > > tree vectype, int misalign, > > vect_cost_model_location where) override; > > + void finish_cost (const vector_costs *uncast_scalar_costs); > > }; > > > > +void > > +ix86_vector_costs::finish_cost (const vector_costs *uncast_scalar_costs) > > +{ > > + auto *scalar_costs > > + = static_cast<const ix86_vector_costs *> (uncast_scalar_costs); > > + m_suggested_unroll_factor = 4; > > + vector_costs::finish_cost (scalar_costs); > > I remember we have posted an patch for that > https://gcc.gnu.org/pipermail/gcc-patches/2022-October/604186.html > > One regression observed is the VF of epilog loop will increase(from xmm to ymm) > after unroll the vectorized loops, and it regressed performance for > lower-tripcount loop(similar as -mprefer-vector-width=512). Ah, yeah. We could resort to check estimated_number_of_iterations to guide us with profile feedback. I'm also (again) working on fully masked epilogues which should reduce the impact on low-trip count loops. > Also for the case in the PR, I'm trying to enable > -fvariable-expansion-in-unroller when -funroll-loops, and the partial sum will > break reduction chain. Probably also a good idea - maybe -fvariable-expansion-in-unroller can be made smarter and guided by register pressure? |