[Bug tree-optimization/97343] AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product

Fri Oct 9 11:58:52 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97343

--- Comment #2 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Richard Biener from comment #1)
> All below for Part 2.
> 
> Without -ffast-math you are seeing GCC using in-order reductions now while
> with -ffast-math the vectorizer gets a bit confused about reassociations done
> before, for me producing
> 

Just to understand, when you are saying "Without -ffast-math" does it includes
my set1 == "-O3 -ffast-math -fno-associative-math" ?

BTW, I like your phrasing: "a bit confused about reassociations"

> .L3:
>         vmovupd 32(%rsi,%rax), %ymm3
>         vmovupd (%rdx,%rax), %ymm7
>         vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
>         vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
>         vmovupd 32(%rsi,%rax), %ymm5
>         vpermpd $136, %ymm0, %ymm4
>         vpermpd $40, %ymm2, %ymm7
>         vpermpd $221, %ymm0, %ymm1
>         vpermpd $125, %ymm2, %ymm3
>         vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
>         vmovupd (%rdx,%rax), %ymm2
>         vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
>         addq    $64, %rax
>         vpermpd $136, %ymm0, %ymm5
>         vpermpd $221, %ymm0, %ymm0
>         vpermpd $40, %ymm2, %ymm8
>         vpermpd $125, %ymm2, %ymm2
>         vmulpd  %ymm8, %ymm5, %ymm5
>         vmulpd  %ymm2, %ymm0, %ymm0
>         vfmadd132pd     %ymm3, %ymm5, %ymm1
>         vfmadd231pd     %ymm7, %ymm4, %ymm0
>         vaddpd  %ymm0, %ymm1, %ymm0
>         vaddpd  %ymm0, %ymm6, %ymm6
>         cmpq    %rcx, %rax
>         jne     .L3
> 
> -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> chains and this SLP reduction ends up looking like
> 
> t5.c:7:21: note:   Vectorizing SLP tree:
> t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
> t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
> t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
> t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
> t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
> t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> t5.c:7:21: note:        children 0x4100c20
> t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
> t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
> t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
> t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
> t5.c:7:21: note:        children 0x4100920 0x4100b40
> t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
> t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
> t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
> t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
> t5.c:7:21: note:        children 0x41004e0 0x41008b0
> t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
> t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
> t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
> t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
> t5.c:7:21: note:        children 0x3f2a550 0x3f28700
> t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
> t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
> t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
> t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
> t5.c:7:21: note:        children 0x40b6990 0x3f29e00
> t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _88 = *_87;
> t5.c:7:21: note:        stmt 1 _96 = *_95;
> t5.c:7:21: note:        stmt 2 _96 = *_95;
> t5.c:7:21: note:        stmt 3 _88 = *_87;
> t5.c:7:21: note:        load permutation { 1 5 5 1 }
> t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _94 = *_93;
> t5.c:7:21: note:        stmt 1 _86 = *_85;
> t5.c:7:21: note:        stmt 2 _94 = *_93;
> t5.c:7:21: note:        stmt 3 _86 = *_85;
> t5.c:7:21: note:        load permutation { 5 1 5 1 }
> t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
> t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
> t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
> t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
> t5.c:7:21: note:        children 0x3f287e0 0x3f28770
> t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _116 = *_115;
> t5.c:7:21: note:        stmt 1 _124 = *_123;
> t5.c:7:21: note:        stmt 2 _124 = *_123;
> t5.c:7:21: note:        stmt 3 _116 = *_115;
> t5.c:7:21: note:        load permutation { 2 6 6 2 }
> t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _122 = *_121;
> t5.c:7:21: note:        stmt 1 _114 = *_113;
> t5.c:7:21: note:        stmt 2 _122 = *_121;
> t5.c:7:21: note:        stmt 3 _114 = *_113;
> t5.c:7:21: note:        load permutation { 6 2 6 2 }
> t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
> t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
> t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
> t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
> t5.c:7:21: note:        children 0x4100290 0x4100810
> t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _60 = *_59;
> t5.c:7:21: note:        stmt 1 _68 = *_67;
> t5.c:7:21: note:        stmt 2 _68 = *_67;
> t5.c:7:21: note:        stmt 3 _60 = *_59;
> t5.c:7:21: note:        load permutation { 0 4 4 0 }
> t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _66 = *_65;
> t5.c:7:21: note:        stmt 1 _58 = *_57;
> t5.c:7:21: note:        stmt 2 _66 = *_65;
> t5.c:7:21: note:        stmt 3 _58 = *_57;
> t5.c:7:21: note:        load permutation { 4 0 4 0 }
> t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
> t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
> t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
> t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
> t5.c:7:21: note:        children 0x4100990 0x4100a50
> t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _144 = *_143;
> t5.c:7:21: note:        stmt 1 _152 = *_151;
> t5.c:7:21: note:        stmt 2 _152 = *_151;
> t5.c:7:21: note:        stmt 3 _144 = *_143;
> t5.c:7:21: note:        load permutation { 3 7 7 3 }
> t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _150 = *_149;
> t5.c:7:21: note:        stmt 1 _142 = *_141;
> t5.c:7:21: note:        stmt 2 _150 = *_149;
> t5.c:7:21: note:        stmt 3 _142 = *_141;
> t5.c:7:21: note:        load permutation { 7 3 7 3 }
> 
> which eventually shows some non-obvious permute optimization opportunities.
> I'm currently working on a permute optimization phase btw. but for start
> only handling cases that do not help here.
> 
> Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> get
> the reduction chain variant which optimizes to
> 
> .L3:
>         vmovupd 32(%rsi,%rax), %ymm6
>         vmovupd 32(%rdx,%rax), %ymm7
>         vmovupd (%rsi,%rax), %ymm5
>         vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
>         vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
>         vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
>         addq    $64, %rax
>         vfmadd231pd     %ymm6, %ymm7, %ymm2
>         cmpq    %rcx, %rax
>         jne     .L3
> 
> even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). 

It's very fragile. I made a tiny (and natural for my real app) change in the
source (see below) and the nice MSVC-like inner loop disappeared.

void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim -= a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_reim+acc_imre;
}

> Which means
> the following source change helps:
> 
> void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> double* a, const double* b, int N)
> {
>   double acc_rere = 0;
>   double acc_imim = 0;
>   double acc_reim = 0;
>   double acc_imre = 0;
>   for (int c = 0; c < N; ++c) {
>     for (int k = 0; k < 4; ++k) {
>       acc_rere += a[c*8+k+0]*b[c*8+k+0];
>       acc_imim += a[c*8+k+4]*b[c*8+k+4];
>       acc_reim += a[c*8+k+0]*b[c*8+k+4];
>       acc_imre += a[c*8+k+4]*b[c*8+k+0];
>     }
>   }
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_imre-acc_reim;
> }
> 

IMHO, options like "no-tree-reassoc", including and especially within
__attribute__((optimize("")))  are for people like you.
People like me, i.e. not compiler guys, don't consider them not only for
production, but even for hobby. Unless a hobby is compiler research.

Also, several years ago I was told by (not you, as Richard Biener, but "you" as
"gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as Richard
Biener, also took part in discussion, but appeared to hold different opinion,
see 70255) told me that __attribute__((optimize(""))) can't be relied on in
production. Back than we came to conclusion that this statement has to be in
official docs. And indeed since GCC7 section 6.33.1 contains the folowing
sentences:
"The optimize attribute should be used for debugging purposes only. It is not
suitable in production code."

> the reduction epilogue ends up like
> 
>         vextractf128    $0x1, %ymm3, %xmm4
>         vaddpd  %xmm3, %xmm4, %xmm3
>         vunpckhpd       %xmm3, %xmm3, %xmm4
>         vaddpd  %xmm3, %xmm4, %xmm3
>         vextractf128    $0x1, %ymm2, %xmm4
>         vaddpd  %xmm2, %xmm4, %xmm4
>         vunpckhpd       %xmm4, %xmm4, %xmm2
>         vaddpd  %xmm4, %xmm2, %xmm2
>         vextractf128    $0x1, %ymm1, %xmm4
>         vaddpd  %xmm1, %xmm4, %xmm4
>         vaddsd  %xmm2, %xmm3, %xmm2
>         vunpckhpd       %xmm4, %xmm4, %xmm1
>         vaddpd  %xmm4, %xmm1, %xmm1
>         vextractf128    $0x1, %ymm0, %xmm4
>         vaddpd  %xmm0, %xmm4, %xmm4
>         vunpckhpd       %xmm4, %xmm4, %xmm0
>         vaddpd  %xmm4, %xmm0, %xmm0
>         vsubsd  %xmm1, %xmm0, %xmm0
>         vzeroupper
>         vmovsd  %xmm2, (%rdi)
>         vmovsd  %xmm0, 32(%rdi)
> 
> which is not optimal since we miss the opportunity to vectorize the
> adds of the accumulators
> 
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_imre-acc_reim;

Epilogue is a tricky matter.
There are many ways to do it with about the same speed and which variant is
faster could depend on fine microarchitectural details, that could be not the
same even on such quite similar CPUs as Skylake and Zen2 or Skylake and Ice
Lake, or may be even Skylake and Skylake-X (well, may be the last one is an 
exaggeration). The optimal sequence also depends on surrounding, e.g. if I
change the source to:
  res[0] = acc_rere+acc_imim;
  res[1] = acc_reim-acc_imre; // results adjacent in memory
It could already be different.
In later case it likely would be
  vaddpd  ymm_rere,ymm_imim,ymm_re
  vsubpd  ymm_reim,ymm_imre,ymm_im
  vhadd   ymm_im,ymm_re,ymm_reim
  vperm2f128 $1,    ymm_reim, ymm_reim, ymm_reimH
  vaddpd  xmm2_reimH, xmm_reim, xmm_reim

Even icc can't do it perfectly right now. 
It would be nice (for you personally, at least) if in this case gcc will
generate better code than icc, but it is far [far [far...]] less important than
robust handling of inner loop.