Bug 86174 - Poor vectorization/register allocation with omp simd, FMA
Summary: Poor vectorization/register allocation with omp simd, FMA
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.1.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: alias, missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2018-06-16 16:03 UTC by Jed Brown
Modified: 2018-06-18 07:56 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-06-18 00:00:00


Attachments
Source demonstrating poor optimization (285 bytes, text/plain)
2018-06-16 16:03 UTC, Jed Brown
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jed Brown 2018-06-16 16:03:50 UTC
Created attachment 44287 [details]
Source demonstrating poor optimization

The attached code produces lots of needless spills with gcc-8.1 on Linux.

$ gcc -O3 -Wall -march=skylake -ffast-math -fopenmp -c mm-gcc.c

0000000000000080 <mult+0x80> vbroadcastsd ymm2,QWORD PTR [rdx]
0000000000000085 <mult+0x85> vmovupd ymm1,YMMWORD PTR [rcx]
0000000000000089 <mult+0x89> vmovapd ymm0,ymm1
000000000000008d <mult+0x8d> vfmadd213pd ymm0,ymm2,YMMWORD PTR [rsp]
0000000000000093 <mult+0x93> vmovapd YMMWORD PTR [rsp],ymm0
0000000000000098 <mult+0x98> vmovupd ymm0,YMMWORD PTR [rcx+0x20]
000000000000009d <mult+0x9d> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0x20]
00000000000000a4 <mult+0xa4> vmovapd YMMWORD PTR [rsp+0x20],ymm2
00000000000000aa <mult+0xaa> vbroadcastsd ymm2,QWORD PTR [rdx+0x400]
00000000000000b3 <mult+0xb3> vmovapd ymm3,ymm1
00000000000000b7 <mult+0xb7> vfmadd213pd ymm3,ymm2,YMMWORD PTR [rsp+0x40]
00000000000000be <mult+0xbe> vmovapd YMMWORD PTR [rsp+0x40],ymm3
00000000000000c4 <mult+0xc4> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0x60]
00000000000000cb <mult+0xcb> vmovapd YMMWORD PTR [rsp+0x60],ymm2
00000000000000d1 <mult+0xd1> vbroadcastsd ymm2,QWORD PTR [rdx+0x800]
00000000000000da <mult+0xda> vmovapd ymm3,ymm1
00000000000000de <mult+0xde> vfmadd213pd ymm3,ymm2,YMMWORD PTR [rsp+0x80]
00000000000000e8 <mult+0xe8> vmovapd YMMWORD PTR [rsp+0x80],ymm3
00000000000000f1 <mult+0xf1> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0xa0]
00000000000000fb <mult+0xfb> vmovapd YMMWORD PTR [rsp+0xa0],ymm2
0000000000000104 <mult+0x104> vbroadcastsd ymm2,QWORD PTR [rdx+0xc00]
000000000000010d <mult+0x10d> vfmadd213pd ymm1,ymm2,YMMWORD PTR [rsp+0xc0]
0000000000000117 <mult+0x117> vmovapd YMMWORD PTR [rsp+0xc0],ymm1
0000000000000120 <mult+0x120> vfmadd213pd ymm0,ymm2,YMMWORD PTR [rsp+0xe0]
000000000000012a <mult+0x12a> vmovapd YMMWORD PTR [rsp+0xe0],ymm0
0000000000000133 <mult+0x133> add    rdx,0x8
0000000000000137 <mult+0x137> add    rcx,0x400
000000000000013e <mult+0x13e> cmp    rsi,rcx
0000000000000141 <mult+0x141> jne    0000000000000080 <mult+0x80>

GCC does not issue vector instructions if omp simd is removed.  In contrast, clang-6 vectorizes well with or without omp simd:

$ clang -O3 -Wall -march=haswell -ffast-math -c mm-gcc.c

00000000000000e0 <mult+0xe0> vmovapd ymm9,ymm6
00000000000000e4 <mult+0xe4> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x800]
00000000000000ee <mult+0xee> vmovupd ymm6,YMMWORD PTR [rax-0x20]
00000000000000f3 <mult+0xf3> vmovupd ymm11,YMMWORD PTR [rax]
00000000000000f7 <mult+0xf7> vfmadd231pd ymm1,ymm6,ymm10
00000000000000fc <mult+0xfc> vfmadd231pd ymm7,ymm11,ymm10
0000000000000101 <mult+0x101> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x400]
000000000000010b <mult+0x10b> vfmadd231pd ymm8,ymm6,ymm10
0000000000000110 <mult+0x110> vfmadd231pd ymm5,ymm11,ymm10
0000000000000115 <mult+0x115> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8]
000000000000011b <mult+0x11b> vfmadd231pd ymm2,ymm6,ymm10
0000000000000120 <mult+0x120> vfmadd231pd ymm3,ymm11,ymm10
0000000000000125 <mult+0x125> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8+0x400]
000000000000012f <mult+0x12f> vfmadd213pd ymm6,ymm10,ymm9
0000000000000134 <mult+0x134> vfmadd231pd ymm4,ymm11,ymm10
0000000000000139 <mult+0x139> add    rax,0x400
000000000000013f <mult+0x13f> add    rbx,0x1
0000000000000143 <mult+0x143> jne    00000000000000e0 <mult+0xe0>

(I used -march=haswell instead of -march=skylake due to https://bugs.llvm.org/show_bug.cgi?id=37819.)
Comment 1 Alexander Monakov 2018-06-16 17:42:30 UTC
It might be useful to note that what the testcase "wants" to happen is for the compiler to notice that the temporary array 'double C[Si][Sk]' does not need to live in memory - ideally it would correspond to 8 256-bit (or 4 512-bit) registers.
Comment 2 Richard Biener 2018-06-18 07:56:37 UTC
Confirmed.  There's two things in the way - first we transform the

        #pragma omp simd
        for (int kk=0; kk<Sk; kk++) {
          c[(i+ii)*p+k+kk] = C[ii][kk];
        }

loop to memcpy (we could simply avoid that for force_vectorize loops as a hack).
And if we avoid that, for example with -fno-tree-loop-distribute-patterns then
we fail to elide the stores to C[].  That happens because unrolling doesn't
preserve restrict info and when vectorization makes C addressable it doesn't
make restrict info reflect that it doesn't alias with anything.

We also do not have a late enough scalarization pass that would elide
the array - we'd rely on LIM/DSE here.