Bug 71414 - 2x slower than clang summing small float array, GCC should consider larger vectorization factor for "unrolling" reductions
Summary: 2x slower than clang summing small float array, GCC should consider larger v...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 6.1.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2016-06-04 17:55 UTC by Yichao Yu
Modified: 2020-10-11 15:39 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-06-07 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yichao Yu 2016-06-04 17:55:12 UTC
Ref https://llvm.org/bugs/show_bug.cgi?id=28002

C source code.

```c
__attribute__((noinline)) float sum32(float *a, size_t n)
{
    /* a = (float*)__builtin_assume_aligned(a, 64); */
    float s = 0;
    for (size_t i = 0;i < n;i++)
        s += a[i];
    return s;
}```


See [this gist](https://gist.github.com/yuyichao/5b07f71c1f19248ec5511d758532a4b0) for assembly output by different compilers. GCC appears to be ~2x slower than clang on the two machines (4702HQ and 6700K) I benchmarked this.
Comment 1 Richard Biener 2016-06-06 08:21:39 UTC
The core loop is

.L8:
        addq    $1, %rdx
        vaddps  (%r8), %ymm1, %ymm1
        addq    $32, %r8
        cmpq    %rdx, %rcx
        ja      .L8

which compared to LLVM is not unrolled.  You can use -funroll-loops to
force that which probably fixes the performance compared to LLVM.  For
the short loop above I also guess this is not the optimal IV choice.
Comment 2 Marc Glisse 2016-06-06 08:45:57 UTC
(In reply to Richard Biener from comment #1)
> The core loop is
> 
> .L8:
>         addq    $1, %rdx
>         vaddps  (%r8), %ymm1, %ymm1
>         addq    $32, %r8
>         cmpq    %rdx, %rcx
>         ja      .L8
> 
> which compared to LLVM is not unrolled.  You can use -funroll-loops to
> force that which probably fixes the performance compared to LLVM.  For
> the short loop above I also guess this is not the optimal IV choice.

-funroll-loops only gains 10% or so, nowhere near the factor of 2 with clang. Except for the slightly better induction choice in llvm, the 2 unrolled loops look quite similar, I have a hard time seeing how one can be so much faster than the other. Maybe the alignment somehow ends up better in one case? Or the loop being one instruction shorter lets it fit better in cache?
Comment 3 Richard Biener 2016-06-06 09:37:57 UTC
(In reply to Marc Glisse from comment #2)
> (In reply to Richard Biener from comment #1)
> > The core loop is
> > 
> > .L8:
> >         addq    $1, %rdx
> >         vaddps  (%r8), %ymm1, %ymm1
> >         addq    $32, %r8
> >         cmpq    %rdx, %rcx
> >         ja      .L8
> > 
> > which compared to LLVM is not unrolled.  You can use -funroll-loops to
> > force that which probably fixes the performance compared to LLVM.  For
> > the short loop above I also guess this is not the optimal IV choice.
> 
> -funroll-loops only gains 10% or so, nowhere near the factor of 2 with
> clang. Except for the slightly better induction choice in llvm, the 2
> unrolled loops look quite similar, I have a hard time seeing how one can be
> so much faster than the other. Maybe the alignment somehow ends up better in
> one case? Or the loop being one instruction shorter lets it fit better in
> cache?

Can you post a full example?  The LLVM bug and this copy lacks information
on what actual 'a' and 'n' is used.  Note that unless a fits in L2 I hardly
doubt one can exceeed memory bandwidth (and thus code-gen should not matter
unless it affects the HW prefetcher).
Comment 4 Yichao Yu 2016-06-06 11:48:57 UTC
The C code is in the gist linked `a` is a cacheline aligned pointer and `n` is 1024 so `a` should even fits in L1d, which is 32kB on both processors I benchmarked.

More precise timing (ns per loop)

6700K

```
% ./benchmark-gcc           
80.553456
% ./benchmark-clang37 
28.222281
% ./benchmark-clang38 
41.782532
```

4702HQ

```
% ./benchmark-gcc 
140.744893
% ./benchmark-clang37 
50.835441
% ./benchmark-clang38
70.220946
```

Pasting the whole program over for completeness.
The alignment line gives some weird timing on clang without `-mcore-avx2` but doesn't change anything too much with `-Ofast -mcore-avx2`

```
//

#include <stdlib.h>
#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <string.h>

uint64_t gettime_ns()
{
    struct timespec t;
    clock_gettime(CLOCK_MONOTONIC, &t);
    return t.tv_sec * (uint64_t) 1e9 + t.tv_nsec;
}


__attribute__((noinline)) float sum32(float *a, size_t n)
{
    /* a = (float*)__builtin_assume_aligned(a, 64); */
    float s = 0;
    for (size_t i = 0;i < n;i++)
        s += a[i];
    __asm__ volatile ("" ::: "memory");
    return s;
}

int main()
{
    float *p = aligned_alloc(64, sizeof(float) * 1024);
    memset(p, 0, sizeof(float) * 1024);
    uint64_t start = gettime_ns();
    for (int i = 0;i < 1024 * 1024;i++)
        sum32(p, 1024);
    free(p);
    uint64_t end = gettime_ns();
    printf("%f\n", (end - start) / (1024.0 * 1024.0));
    return 0;
}
```
Comment 5 Richard Biener 2016-06-07 09:14:34 UTC
An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but
for some unknown reason figure Alignment of 'a' as unusable:

Lattices:
  Node: main/35:
  Node: sum32/34:
    param [0]: VARIABLE
         ctxs: VARIABLE
         Alignment unusable (BOTTOM)
        AGGS VARIABLE
    param [1]: VARIABLE
               1024 [from: 35(99000)] [loc_time: 65, loc_size: 10, prop_time: 0, prop_size: 0]
         ctxs: VARIABLE
         Alignment unusable (BOTTOM)
        AGGS VARIABLE

Evaluating opportunities for sum32/34.
 - considering value 1024 for param #1 n (caller_count: 1)
     good_cloning_opportunity_p (time: 65, size: 10, freq_sum: 99000) -> evaluation: 643500, threshold: 500
  Creating a specialized node of sum32/34.
    replacing param #1 n with const 1024
                Accounting size:7.00, time:72.78 on predicate:(true)
                Accounting size:3.00, time:2.00 on new predicate:(not inlined)
     the new node is sum32.constprop/43.


iff LLVM disables IPA CP cloning with 'noinline' the testcase should add
'noclone' as well to be a fair comparison.

The vectorizer decides to peel the loop for alignment (as usual...) and thus
creates both prologue and epilogue loop.  That shouldn't matter in practice
but it likely obfuscates code enough to make the Job for IVOPTs harder.
If the desire was to have nothing known about alignment and 'n' in sum32
the above cannot be avoided anyway.  We also peel both prologue and epilogue
loop.

clang 3.6 (the one I have locally) doesn't peel for alignment and thus uses
unaligned loads and unrolls the loop by 2 only.  It doesn't do any IPA CP
with -Ofast.

Note that the difference WRT clangs unrolling and GCCs unrolling is that
clang uses two accumulators while GCC just processes multiple loads on
the same accumulator with its unrolling.  Thus clang can exploit parallelism
in the pipeline of the CPU while GCC restricts the CPU due to the dependences.

This means that it is the vectorizer that needs to consider using a larger
vectorization factor rather than post-vectorization unrolling (that's likely
to pay off only for reductions).  I wonder what LLVMs heuristic is here.

The IPA-CP alignment question remains though.  Martins?
Comment 6 Joost VandeVondele 2016-06-07 13:52:08 UTC
Isn't this a case where -fvariable-expansion-in-unroller is helpful ?

> gcc -Ofast t.c -lrt ; ./a.out
285.670206

> gcc -Ofast -funroll-loops -fvariable-expansion-in-unroller  t.c -lrt ; ./a.out
151.246083

> gcc -Ofast -funroll-loops  t.c -lrt ; ./a.out
277.047507

There is some relation with PR25621 I think.
Comment 7 Yichao Yu 2016-06-07 14:01:03 UTC
If I add `-fvariable-expansion-in-unroller` (omg this options is like half the command line ;-p ...), the performance matches the clang one after the clang 3.8 regression.

```
% gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2 benchmark.c -o benchmark2 
% ./benchmark2 
45.588861
% ./benchmark-gcc
80.518152
% ./benchmark-clang38
41.920054
% ./benchmark-clang37
25.093145
```
Comment 8 rguenther@suse.de 2016-06-08 14:25:52 UTC
On Tue, 7 Jun 2016, yyc1992 at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
> 
> --- Comment #7 from Yichao Yu <yyc1992 at gmail dot com> ---
> If I add `-fvariable-expansion-in-unroller` (omg this options is like half the
> command line ;-p ...), the performance matches the clang one after the clang
> 3.8 regression.
> 
> ```
> % gcc -funroll-loops -fvariable-expansion-in-unroller -Ofast -march=core-avx2
> benchmark.c -o benchmark2 
> % ./benchmark2 
> 45.588861
> % ./benchmark-gcc
> 80.518152
> % ./benchmark-clang38
> 41.920054
> % ./benchmark-clang37
> 25.093145
> ```

Yeah, but -fvariable-expansion-in-unroller is quite late.
Comment 9 Martin Jambor 2016-06-10 15:22:39 UTC
(In reply to Richard Biener from comment #5)
> An interesting observation is that we clone sum32 for IPA-CP of n == 1024 but
> for some unknown reason figure Alignment of 'a' as unusable:

The function sum32 is not static, so it can be called from outside of
the current compilation unit with any alignment.  IPA-CP does not
clone for alignment, which was my deliberate decision because I found
it very difficult to reason about its profitability (I am opened to
suggestions in this area, of course).

Make it static or compile with -flto and you will get:

IPA lattices after all propagation:

Lattices:
  Node: main/35:
  Node: sum32/34:
    param [0]: VARIABLE
         ctxs: VARIABLE
         Alignment 64, misalignment 0
        AGGS VARIABLE
    param [1]: 1024 [from: 35(99000)] [loc_time: 0, loc_size: 0, prop_time: 0, prop_size: 
0]
         ctxs: VARIABLE
         Alignment unusable (BOTTOM)
        AGGS VARIABLE
Comment 10 rguenther@suse.de 2016-06-10 16:32:03 UTC
On June 10, 2016 5:22:39 PM GMT+02:00, "jamborm at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71414
>
>--- Comment #9 from Martin Jambor <jamborm at gcc dot gnu.org> ---
>(In reply to Richard Biener from comment #5)
>> An interesting observation is that we clone sum32 for IPA-CP of n ==
>1024 but
>> for some unknown reason figure Alignment of 'a' as unusable:
>
>The function sum32 is not static, so it can be called from outside of
>the current compilation unit with any alignment.  IPA-CP does not
>clone for alignment, which was my deliberate decision because I found
>it very difficult to reason about its profitability (I am opened to
>suggestions in this area, of course).

It ends up cloning for the constant second arg though so I'd have expected the alignment to be set on the first...

>Make it static or compile with -flto and you will get:
>
>IPA lattices after all propagation:
>
>Lattices:
>  Node: main/35:
>  Node: sum32/34:
>    param [0]: VARIABLE
>         ctxs: VARIABLE
>         Alignment 64, misalignment 0
>        AGGS VARIABLE
>param [1]: 1024 [from: 35(99000)] [loc_time: 0, loc_size: 0, prop_time:
>0,
>prop_size: 
>0]
>         ctxs: VARIABLE
>         Alignment unusable (BOTTOM)
>        AGGS VARIABLE
Comment 11 Freddie Witherden 2020-10-11 15:39:35 UTC
I've been looking into this and the big difference appears to be that when Clang unrolls the loop it does so using multiple accumulators (and indeed does this without need to be told to unroll.  Given:

double acc(double *x, int n)
{
    double a = 0;
    #pragma omp simd
    for (int i = 0; i < n; i++)
	a += x[i];
    return a;
}


and compiling with clang -march=native -Ofast -fopenmp -S the core loop reads as:

	vaddpd	(%rdi,%rsi,8), %ymm0, %ymm0
	vaddpd	32(%rdi,%rsi,8), %ymm1, %ymm1
	vaddpd	64(%rdi,%rsi,8), %ymm2, %ymm2
	vaddpd	96(%rdi,%rsi,8), %ymm3, %ymm3
	vaddpd	128(%rdi,%rsi,8), %ymm0, %ymm0
	vaddpd	160(%rdi,%rsi,8), %ymm1, %ymm1
	vaddpd	192(%rdi,%rsi,8), %ymm2, %ymm2
	vaddpd	224(%rdi,%rsi,8), %ymm3, %ymm3
	vaddpd	256(%rdi,%rsi,8), %ymm0, %ymm0
	vaddpd	288(%rdi,%rsi,8), %ymm1, %ymm1
	vaddpd	320(%rdi,%rsi,8), %ymm2, %ymm2
	vaddpd	352(%rdi,%rsi,8), %ymm3, %ymm3
	vaddpd	384(%rdi,%rsi,8), %ymm0, %ymm0
	vaddpd	416(%rdi,%rsi,8), %ymm1, %ymm1
	vaddpd	448(%rdi,%rsi,8), %ymm2, %ymm2
	vaddpd	480(%rdi,%rsi,8), %ymm3, %ymm3

which is heavily unrolled and uses four separate accumulators to hide the latency of the vector adds.  Interestingly, one could argue that Clang is not using enough registers given that Skylake can dual-issue adds and they have a latency of 4 cycles (implying you want 8 separate accumulators).

GCC 10 with gcc -march=skylake -Ofast -fopenmp -S test.c -funroll-loops

	vaddpd	-224(%r8), %ymm1, %ymm2
	vaddpd	-192(%r8), %ymm2, %ymm3
 	vaddpd	-160(%r8), %ymm3, %ymm4
	vaddpd	-128(%r8), %ymm4, %ymm5
	vaddpd	-96(%r8), %ymm5, %ymm6
	vaddpd	-64(%r8), %ymm6, %ymm7
	vaddpd	-32(%r8), %ymm7, %ymm0

which although it is unrolled, is not a useful unrolling due to the dependency chain.  Indeed, I would not be surprised if the performance is similar to the unrolled code as the loop related cruft can be hidden.