Bug 54939 - Very poor vectorization of loops with complex arithmetic
Summary: Very poor vectorization of loops with complex arithmetic
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.0
: P3 normal
Target Milestone: ---
Assignee: Richard Biener
URL:
Keywords:
Depends on: 84361
Blocks: vectorizer 37021
  Show dependency treegraph
 
Reported: 2012-10-16 14:22 UTC by Yuri Rumyantsev
Modified: 2023-07-21 12:28 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-10-16 00:00:00


Attachments
test reproducer (449 bytes, application/octet-stream)
2012-10-16 14:54 UTC, Yuri Rumyantsev
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Yuri Rumyantsev 2012-10-16 14:22:07 UTC
Analyzing some performance anomaly for spec2000 I found out that 168.wupwise with vectorization is slower than without it on x86. The main problem is that gcc does not recognize some special idioms of complex addition and multiplication in process of loop vectorization. For example, for a simple zaxpy loop icc genearates 1.6X faster code than gcc. Here is assembly for zaxpy loop produced by icc:

..B1.4:                         # Preds ..B1.2 ..B1.4
        movups    (%rsi,%rdx), %xmm2                            #7.28
        movups    16(%rsi,%rdx), %xmm5                          #7.28
        movups    (%rsi,%rcx), %xmm4                            #7.17
        movups    16(%rsi,%rcx), %xmm7                          #7.17
        movddup   (%rsi,%rdx), %xmm3                            #7.27
        incq      %r8                                           #6.10
        movddup   16(%rsi,%rdx), %xmm6                          #7.27
        unpckhpd  %xmm2, %xmm2                                  #7.27
        unpckhpd  %xmm5, %xmm5                                  #7.27
        mulpd     %xmm1, %xmm3                                  #7.27
        mulpd     %xmm0, %xmm2                                  #7.27
        mulpd     %xmm1, %xmm6                                  #7.27
        mulpd     %xmm0, %xmm5                                  #7.27
        addsubpd  %xmm2, %xmm3                                  #7.27
        addsubpd  %xmm5, %xmm6                                  #7.27
        addpd     %xmm3, %xmm4                                  #7.9
        addpd     %xmm6, %xmm7                                  #7.9
        movups    %xmm4, (%rsi,%rcx)                            #7.9
        movups    %xmm7, 16(%rsi,%rcx)                          #7.9
        addq      $32, %rsi                                     #6.10
        cmpq      %rdi, %r8                                     #6.10
        jb        ..B1.4        # Prob 64%                      #6.10
( I got it with -xSSE4.2 -O3 options). Gor gcc compiler the following options were used: -m64 -mfpmath=sse  -march=corei7 -O3 -ffast-math.
Comment 1 Richard Biener 2012-10-16 14:36:52 UTC
Can you reproduce zaxpy source here please?  Also please see the list of bugs
referenced from PR53947, there is likely a duplicate for this issue.
Comment 2 Yuri Rumyantsev 2012-10-16 14:54:50 UTC
Created attachment 28455 [details]
test reproducer
Comment 3 Yuri Rumyantsev 2012-10-16 15:06:19 UTC
I looked through the list of all issues related to vectorization but could not find duplicate.
Comment 4 Richard Biener 2012-10-16 15:31:52 UTC
Thanks.
Comment 5 Richard Biener 2013-03-27 11:19:49 UTC
Confirmed.  GCC vectorizes this using hybrid SLP - it unrolls the loop once
to be able to vectorize two minus and two adds resulting from the complex
multiplication.

The PR is kind-of a duplicate of PR37021 where also a reduction and
a variable stride is involved.  So fixing this bug is required to more
efficiently vectorize PR37021.

Note that even this bug has multiple issues that need to be tackled.
I happen to work on them.
Comment 6 Richard Biener 2016-06-08 13:45:42 UTC
So we now would vectorize this but compute the SLP vectorization as never profitable.  Generated code with -fno-vect-cost-model (-Ofast -march=corei7):

.L4:
        movupd  (%rdx,%rax), %xmm0
        movapd  %xmm3, %xmm1
        mulpd   %xmm0, %xmm1
        palignr $8, %xmm0, %xmm0
        mulpd   %xmm2, %xmm0
        addsubpd        %xmm0, %xmm1
        movupd  (%rcx,%rax), %xmm0
        addpd   %xmm0, %xmm1
        movups  %xmm1, (%rcx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L4

t.f:6:0: note: Cost model analysis:
  Vector inside of loop cost: 15
  Vector prologue cost: 10
  Vector epilogue cost: 0
  Scalar iteration cost: 14
  Scalar outside cost: 6
  Vector outside cost: 10
  prologue iterations: 0
  epilogue iterations: 0
t.f:6:0: note: cost model: the vector iteration cost = 15 divided by the scalar iteration cost = 14 is greater or equal to the vectorization factor = 1.
t.f:6:0: note: not vectorized: vectorization not profitable.
t.f:6:0: note: not vectorized: vector version will never be profitable.

as it is basically equal to basic-block vectorizing the loop body.  Note that
we pessimistically handle addsubpd as if it were not present and the code
would really end up as

  vect__31.16_42 = vect__27.10_49 - vect__28.15_43;
  vect__31.17_41 = vect__27.10_49 + vect__28.15_43;
  _40 = VEC_PERM_EXPR <vect__31.16_42, vect__31.17_41, { 0, 3 }>;

which is what the vectorizer handles this with (two vector_stmt plus one
vec_perm cost).  The x86 vectorizer cost model would need to be adjusted
for this.

With cost model enabled we fall back to vectorization using interleaving:

.L5:
        movupd  (%rax), %xmm5
        addl    $1, %r9d
        addq    $32, %rax
        addq    $32, %r8
        movupd  -32(%r8), %xmm1
        movupd  -16(%r8), %xmm0
        movapd  %xmm5, %xmm2
        movapd  %xmm1, %xmm6
        movupd  -16(%rax), %xmm9
        unpckhpd        %xmm0, %xmm1
        unpcklpd        %xmm0, %xmm6
        movapd  %xmm6, %xmm0
        mulpd   %xmm8, %xmm0
        unpcklpd        %xmm9, %xmm2
        unpckhpd        %xmm9, %xmm5
        mulpd   %xmm7, %xmm6
        addpd   %xmm0, %xmm2
        movapd  %xmm1, %xmm0
        mulpd   %xmm7, %xmm0
        mulpd   %xmm8, %xmm1
        subpd   %xmm0, %xmm2
        movapd  %xmm1, %xmm0
        addpd   %xmm6, %xmm0
        movapd  %xmm2, %xmm1
        addpd   %xmm5, %xmm0
        unpcklpd        %xmm0, %xmm1
        unpckhpd        %xmm0, %xmm2
        movups  %xmm1, -32(%rax)
        movups  %xmm2, -16(%rax)
        cmpl    %esi, %r9d
        jb      .L5

t.f:6:0: note: Cost model analysis:
  Vector inside of loop cost: 26
  Vector prologue cost: 10
  Vector epilogue cost: 14
  Scalar iteration cost: 14
  Scalar outside cost: 6
  Vector outside cost: 24
  prologue iterations: 0
  epilogue iterations: 1
  Calculated minimum iters for profitability: 6
t.f:6:0: note:   Runtime profitability threshold = 5
t.f:6:0: note:   Static estimate profitability threshold = 16



Thus this is now a cost model issue.
Comment 7 Richard Biener 2016-06-08 13:47:49 UTC
With AVX2 we indeed generate

.L4:
        vmovupd (%rdx,%rax), %ymm3
        addl    $1, %r9d
        vpermpd $177, %ymm3, %ymm4
        vmovapd %ymm3, %ymm2
        vmulpd  %ymm6, %ymm4, %ymm4
        vfmsub132pd     %ymm5, %ymm4, %ymm2
        vfmadd132pd     %ymm5, %ymm4, %ymm3
        vshufpd $10, %ymm3, %ymm2, %ymm2
        vaddpd  (%rcx,%rax), %ymm2, %ymm2
        vmovupd %ymm2, (%rcx,%rax)
        addq    $32, %rax
        cmpl    %esi, %r9d
        jb      .L4

thus either there is no addsub for %ymm or there is insufficient pattern support
for it.  Note that with AVX2 the above is what is generated even with the cost
model as it's now considered a profitable vectorization.
Comment 8 Richard Biener 2019-01-03 12:07:02 UTC
On trunk we get with -Ofast -msse4.2

.L3:
        movupd  (%rdx,%rax), %xmm0
        movupd  (%rdx,%rax), %xmm1
        movupd  (%rcx,%rax), %xmm4
        mulpd   %xmm3, %xmm1
        palignr $8, %xmm0, %xmm0
        mulpd   %xmm2, %xmm0
        addsubpd        %xmm0, %xmm1
        addpd   %xmm4, %xmm1
        movups  %xmm1, (%rcx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L3

ICC unrolls the body once more.  With -mavx2 we get

.L4:
        vmovupd (%rdx,%rsi), %xmm6
        vinsertf128     $0x1, 16(%rdx,%rsi), %ymm6, %ymm1
        vmovupd (%rcx,%rsi), %xmm7
        vmulpd  %ymm5, %ymm1, %ymm2
        vpermpd $177, %ymm1, %ymm1
        vmulpd  %ymm4, %ymm1, %ymm1
        vaddsubpd       %ymm1, %ymm2, %ymm1
        vinsertf128     $0x1, 16(%rcx,%rsi), %ymm7, %ymm2
        vaddpd  %ymm2, %ymm1, %ymm1
        vmovups %xmm1, (%rcx,%rsi)
        vextractf128    $0x1, %ymm1, 16(%rcx,%rsi)
        addq    $32, %rsi
        cmpq    %rsi, %rdi
        jne     .L4

if I add -mfma for example via -march=core-avx2 I get again

.L4:
        vpermpd $177, (%rdx,%rsi), %ymm2
        vmovapd %ymm4, %ymm1
        vmulpd  %ymm5, %ymm2, %ymm2
        vfmsub132pd     (%rdx,%rsi), %ymm2, %ymm1
        vfmadd231pd     (%rdx,%rsi), %ymm4, %ymm2
        vshufpd $10, %ymm2, %ymm1, %ymm1
        vaddpd  (%rcx,%rsi), %ymm1, %ymm1
        vmovupd %ymm1, (%rcx,%rsi)
        addq    $32, %rsi
        cmpq    %rsi, %rdi
        jne     .L4

showing that we lack vfmaddsub patterns or those present do not work
properly.  combine sees

   37: r109:V4DF={r127:V4DF*r105:V4DF+-r108:V4DF}
   38: r110:V4DF={r127:V4DF*r105:V4DF+r108:V4DF}
      REG_DEAD r127:V4DF
      REG_DEAD r108:V4DF
   39: r129:V4DF=vec_select(vec_concat(r109:V4DF,r110:V4DF),parallel)
      REG_DEAD r110:V4DF
      REG_DEAD r109:V4DF

unfortunately the fmaddsub patterns all use UNSPECs with the comment

;; It would be possible to represent these without the UNSPEC as
;;
;; (vec_merge
;;   (fma op1 op2 op3)
;;   (fma op1 op2 (neg op3))
;;   (merge-const))
;;
;; But this doesn't seem useful in practice.

the AVX512 ones do not seem to suffer from this but using AVX512 via
-march=knl also only results in

.L4:
        vmovupd (%rdx,%rax), %zmm1
        vpermpd $177, %zmm1, %zmm2
        vmovapd %zmm1, %zmm0
        vmulpd  %zmm4, %zmm2, %zmm2
        vfmsub132pd     %zmm3, %zmm2, %zmm0
        vfmadd132pd     %zmm3, %zmm2, %zmm1
        vshufpd $170, %zmm1, %zmm0, %zmm0
        vaddpd  (%rcx,%rax), %zmm0, %zmm0
        vmovupd %zmm0, (%rcx,%rax)
        leaq    64(%rax), %rax
        cmpq    %rax, %rsi
        jne     .L4

but maybe KNL doesn't have fmaddsub.  Ah, with AVX512 we end up with
vec_merge (like with SSE2) but above AVX256 shows concat/select
(avx_shufpd256_1).

Maybe combine itself can try both variants in case there's a duality
between (vec_merge ...) and (vec_select (vec_concat ...))?
Comment 9 Segher Boessenkool 2019-01-03 12:40:19 UTC
I think we should declare one of the forms canonical, and make simplify
use that one, if possible.  If we really want one form in some cases and
another in other cases something like change_zero_ext will work.
Comment 10 rguenther@suse.de 2019-01-03 12:51:17 UTC
On Thu, 3 Jan 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939
> 
> --- Comment #9 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> I think we should declare one of the forms canonical, and make simplify
> use that one, if possible.  If we really want one form in some cases and
> another in other cases something like change_zero_ext will work.

Yeah, the main issue is that targets have existing patterns prefering
either form.  vec_merge avoids the need of wider intermediate modes
but with vec_select it's easier to allow constrained operands...

So IMHO RTL shouldn't have both but it would be convenient to
not need that wider mode...  it might ask for a
vec_concat_and_select (aka vec_shuffle2).  Another issue with
vec_merge is the limit of the bitmask operand
(CONST_INT) which limits us to 64 vector elements.

That said, kill vec_merge and allow

 (vec_select:V4DF (vec_concat:V8DF (reg:V4DF...) (reg:V4DF...))
                  (parallel [ (const_int 0) ....])))

to be written as

 (vec_shuffle:V4DF (reg:V4DF...) (reg:V4DF...)
                   (parallel [ (const_int 0) ....])))

note vec_concat operands are not constrained in any way but
vec_shuffle would be?

That is, in practice targets can probably get rid of vec_merge
by using vec_select/vec_concat "easily" even if that means
introducing of "fake" larger vector modes.
Comment 11 Segher Boessenkool 2019-01-03 13:03:14 UTC
(In reply to rguenther@suse.de from comment #10)
> even if that means introducing of "fake" larger vector modes.

That would be a good reason to not do this, except many targets already
do need double-length modes to describe their permute instructions?
Comment 12 Uroš Bizjak 2019-01-03 13:05:49 UTC
(In reply to Richard Biener from comment #8)

> Maybe combine itself can try both variants in case there's a duality
> between (vec_merge ...) and (vec_select (vec_concat ...))?


Please note that combine splitters are used to help combine create addsub pattern. Please also see addsub_vm_operator predicate that handles vec_merge form and addsub_vs_operator that handles vec_select/vec_concat combo. Please see PR66560.

Probably these predicates and similar splitters can be used to handle fmaddsub instructions.
Comment 13 rguenther@suse.de 2019-01-03 13:22:38 UTC
On Thu, 3 Jan 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54939
> 
> --- Comment #11 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #10)
> > even if that means introducing of "fake" larger vector modes.
> 
> That would be a good reason to not do this, except many targets already
> do need double-length modes to describe their permute instructions?

Yes.  I think there's no good need for vec_merge for those targets.
vec_shuffle would of course remove the need of the larger modes.
Comment 14 Richard Biener 2023-07-21 12:28:10 UTC
With SSE4.2 we now get

.L3:
        movupd  (%rdx,%rax), %xmm0
        movupd  (%rcx,%rax), %xmm4
        movapd  %xmm0, %xmm1
        palignr $8, %xmm0, %xmm0
        mulpd   %xmm3, %xmm1
        mulpd   %xmm2, %xmm0
        addpd   %xmm4, %xmm1
        addsubpd        %xmm0, %xmm1
        movups  %xmm1, (%rcx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L3

with AVX and FMA

.L4:
        vmovupd (%rdx,%rax), %ymm0
        vmovapd %ymm4, %ymm1
        vfmadd213pd     (%rcx,%rax), %ymm0, %ymm1
        vpermilpd       $5, %ymm0, %ymm0
        vmulpd  %ymm3, %ymm0, %ymm0
        vaddsubpd       %ymm0, %ymm1, %ymm1
        vmovupd %ymm1, (%rcx,%rax)
        addq    $32, %rax
        cmpq    %rsi, %rax
        jne     .L4

so I'd say fixed.  But.  With AVX512 we now get

.L4:
        vmovupd (%rdi,%rax), %zmm0
        vmovapd %zmm7, %zmm2
        vmovapd %zmm4, %zmm6
        vfmadd213pd     (%rcx,%rax), %zmm0, %zmm2
        vpermilpd       $85, %zmm0, %zmm0
        vfmadd132pd     %zmm0, %zmm2, %zmm6
        vfnmadd132pd    %zmm4, %zmm2, %zmm0
        vmovapd %zmm6, %zmm0{%k1}
        vmovupd %zmm0, (%rcx,%rax)
        addq    $64, %rax
        cmpq    %rax, %rsi
        jne     .L4

it's odd that this only happens with -mprefer-vector-width=512 though.  Do
we possibly miss vec_{fm,}{addsub,subadd} for those?  Looks like so.

Tracking in PR110767.  The vectorizer side is fixed.