Bug 99412 - s352 benchmark of TSVC is vectorized by clang and not by gcc
Summary: s352 benchmark of TSVC is vectorized by clang and not by gcc
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 11.0
: P3 normal
Target Milestone: 13.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on: 97832
Blocks: vectorizer TSVC
  Show dependency treegraph
 
Reported: 2021-03-05 14:54 UTC by Jan Hubicka
Modified: 2024-09-18 21:59 UTC (History)
2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2021-03-08 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jan Hubicka 2021-03-05 14:54:32 UTC
typedef float real_t;

#define iterations 100000
#define LEN_1D 32000
#define LEN_2D 256

real_t a[LEN_1D],b[LEN_1D];
int main ()
{

//    loop rerolling
//    unrolled dot product

    real_t dot;
    for (int nl = 0; nl < 8*iterations; nl++) {
        dot = (real_t)0.;
        for (int i = 0; i < LEN_1D; i += 5) {
            dot = dot + a[i] * b[i] + a[i + 1] * b[i + 1] + a[i + 2]
                * b[i + 2] + a[i + 3] * b[i + 3] + a[i + 4] * b[i + 4];
        }
    }

    return dot;
}


clang does:
main:                                   # @main
        .cfi_startproc
# %bb.0:
        xorl    %eax, %eax
        .p2align        4, 0x90
.LBB0_1:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB0_2 Depth 2
        vxorps  %xmm0, %xmm0, %xmm0
        movq    $-5, %rcx
        .p2align        4, 0x90
.LBB0_2:                                #   Parent Loop BB0_1 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
        vmovups b+20(,%rcx,4), %xmm1
        vmovss  b+36(,%rcx,4), %xmm2            # xmm2 = mem[0],zero,zero,zero
        vmulps  a+20(,%rcx,4), %xmm1, %xmm1
        vpermilpd       $1, %xmm1, %xmm3        # xmm3 = xmm1[1,0]
        vaddps  %xmm3, %xmm1, %xmm1
        vmovshdup       %xmm1, %xmm3            # xmm3 = xmm1[1,1,3,3]
        vaddss  %xmm3, %xmm1, %xmm1
        vfmadd231ss     a+36(,%rcx,4), %xmm2, %xmm1 # xmm1 = (xmm2 * mem) + xmm1
        addq    $5, %rcx
        vaddss  %xmm0, %xmm1, %xmm0
        cmpq    $31995, %rcx                    # imm = 0x7CFB
        jb      .LBB0_2
# %bb.3:                                #   in Loop: Header=BB0_1 Depth=1
        incl    %eax
        cmpl    $800000, %eax                   # imm = 0xC3500
        jne     .LBB0_1
# %bb.4:
        vcvttss2si      %xmm0, %eax
        retq
Comment 1 Richard Biener 2021-03-08 08:32:46 UTC
With -fno-tree-reassoc we detect the reduction chain and produce

.L3:
        vmovaps b(%rax), %ymm5
        vmovaps b+32(%rax), %ymm6
        addq    $160, %rax
        vfmadd231ps     a-160(%rax), %ymm5, %ymm1
        vmovaps b-96(%rax), %ymm7
        vfmadd231ps     a-128(%rax), %ymm6, %ymm0
        vmovaps b-64(%rax), %ymm5
        vmovaps b-32(%rax), %ymm6
        vfmadd231ps     a-96(%rax), %ymm7, %ymm2
        vfmadd231ps     a-64(%rax), %ymm5, %ymm3
        vfmadd231ps     a-32(%rax), %ymm6, %ymm4
        cmpq    $128000, %rax
        jne     .L3
        vaddps  %ymm1, %ymm0, %ymm0
        vaddps  %ymm2, %ymm0, %ymm0
        vaddps  %ymm3, %ymm0, %ymm0
        vaddps  %ymm4, %ymm0, %ymm0
        vextractf128    $0x1, %ymm0, %xmm1
        vaddps  %xmm0, %xmm1, %xmm1
        vmovhlps        %xmm1, %xmm1, %xmm0
        vaddps  %xmm1, %xmm0, %xmm0
        vshufps $85, %xmm0, %xmm0, %xmm1
        vaddps  %xmm0, %xmm1, %xmm0
        decl    %edx
        jne     .L2

we're not re-rolling and thus are forced to use a VF of 4 here.

Note that LLVM doesn't seem to veectorize the loop but instead vectorizes
the basic-block which isn't what TSVC looks for (but that would work for
non-fast-math).
Comment 2 Jan Hubicka 2023-01-11 19:14:05 UTC
This is also seen with zen4 comparing gcc and aocc. (about 2.3 times differnece)
Comment 3 Richard Biener 2023-01-12 09:12:06 UTC
(In reply to Jan Hubicka from comment #2)
> This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
> differnece)

Disabling

@@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb)
                          binary op are chosen wisely.  */
                       int len = ops.length ();
                       if (len >= 3)
                         swap_ops_for_binary_stmt (ops, len - 3, stmt);

will naturally create the reduction chain (or leave it in place) given the
current rank computation.  We do have (somewhat) robust fallback from
reduction chain to reduction (via reduction path support), so I think this
change would be OK.
Comment 4 Richard Biener 2023-01-12 09:30:07 UTC
(In reply to Richard Biener from comment #3)
> (In reply to Jan Hubicka from comment #2)
> > This is also seen with zen4 comparing gcc and aocc. (about 2.3 times
> > differnece)
> 
> Disabling
> 
> @@ -6877,7 +6887,7 @@ reassociate_bb (basic_block bb)
>                           binary op are chosen wisely.  */
>                        int len = ops.length ();
>                        if (len >= 3)
>                          swap_ops_for_binary_stmt (ops, len - 3, stmt);
> 
> will naturally create the reduction chain (or leave it in place) given the
> current rank computation.  We do have (somewhat) robust fallback from
> reduction chain to reduction (via reduction path support), so I think this
> change would be OK.

The code originated from r0-111616-gdf7b0cc4aae062, the reassoc rewrite by
Jeff back in 2005 for GCC 4.3 (or 4.2, don't have that tree around anymore).
Comment 5 GCC Commits 2023-01-12 13:30:47 UTC
The master branch has been updated by Richard Biener <rguenth@gcc.gnu.org>:

https://gcc.gnu.org/g:b073f2b098ba7819450d6c14a0fb96cb1c09f242

commit r13-5122-gb073f2b098ba7819450d6c14a0fb96cb1c09f242
Author: Richard Biener <rguenther@suse.de>
Date:   Thu Jan 12 11:18:22 2023 +0100

    tree-optimization/99412 - reassoc and reduction chains
    
    With -ffast-math we end up associating reduction chains and break
    them - this is because of old code that tries to rectify reductions
    into a shape likened by the vectorizer.  Nowadays the rank compute
    produces correct association for reduction chains and the vectorizer
    has robust support to fall back to a regular reductions (via
    reduction path) when it turns out to be not a proper reduction chain.
    
    So this patch removes the special code in reassoc which makes
    the TSVC s352 vectorized with -Ofast (it is already without
    -ffast-math).
    
            PR tree-optimization/99412
            * tree-ssa-reassoc.cc (is_phi_for_stmt): Remove.
            (swap_ops_for_binary_stmt): Remove reduction handling.
            (rewrite_expr_tree_parallel): Adjust.
            (reassociate_bb): Likewise.
            * tree-parloops.cc (build_new_reduction): Handle MINUS_EXPR.
    
            * gcc.dg/vect/pr99412.c: New testcase.
            * gcc.dg/tree-ssa/reassoc-47.c: Adjust comment.
            * gcc.dg/tree-ssa/reassoc-48.c: Remove.
Comment 6 Richard Biener 2023-01-12 13:42:30 UTC
GCC now does

.L2:
        vmovaps b(%rax), %xmm6
        vmulps  a(%rax), %xmm6, %xmm0
        addq    $80, %rax
        vmovaps b-64(%rax), %xmm7
        vmovaps b-48(%rax), %xmm6
        vaddps  %xmm0, %xmm5, %xmm5
        vmulps  a-64(%rax), %xmm7, %xmm0
        vmovaps b-32(%rax), %xmm7
        vaddps  %xmm0, %xmm1, %xmm1
        vmulps  a-48(%rax), %xmm6, %xmm0
        vmovaps b-16(%rax), %xmm6
        vaddps  %xmm0, %xmm4, %xmm4
        vmulps  a-32(%rax), %xmm7, %xmm0
        vaddps  %xmm0, %xmm2, %xmm2
        vmulps  a-16(%rax), %xmm6, %xmm0
        vaddps  %xmm0, %xmm3, %xmm3
        cmpq    $128000, %rax
        jne     .L2

thus uses a VF of 4 with -Ofast.  Your LLVM snippet uses 5 lanes
instead of our 20 with four lanes in a V4SF and one lane in a scalar.
That's interesting but not something we support.

Re-rolling would mean using a single v4sf 4 lane vector here.  For
a pure SLP loop something like this should be possible without too
much hassle I think.  We'd just need to try ... (and think of if it's
worth in real life)

For the purpose of the Summary this is fixed now.