Bug 102040 - vectorizer costing is off because of if-conversion
Summary: vectorizer costing is off because of if-conversion
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 12.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2021-08-24 13:03 UTC by Richard Biener
Modified: 2021-08-24 13:04 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Richard Biener 2021-08-24 13:03:59 UTC
void foo (double * __restrict x, double *y, double *z, int n)
{
  for (int i = 0; i < n; ++i)
    {
      double tem;
      if (i & 8)
        tem = y[i] * 3.;
      else
        tem = z[i] + 5.;
      x[i] = tem;
    }
}

is vectorized using masked loads with -Ofast -mavx2 on x86_64 and the
cost model says

t3.c:3:21: note:  Cost model analysis:
  Vector inside of loop cost: 252
  Vector prologue cost: 56
  Vector epilogue cost: 352
  Scalar iteration cost: 84
  Scalar outside cost: 8
  Vector outside cost: 408
  prologue iterations: 0
  epilogue iterations: 4
  Calculated minimum iters for profitability: 6

in particular the scalar iteration cost is based on unconditionally
executing

0x3907200 i_26 & 8 1 times scalar_stmt costs 4 in prologue
0x3907200 _1 != 0 1 times scalar_stmt costs 4 in prologue
0x3907200 .MASK_LOAD (_4, 64B, _39) 1 times scalar_load costs 12 in prologue
0x3907200 _5 * 3.0e+0 1 times scalar_stmt costs 20 in prologue
0x3907200 _1 == 0 1 times scalar_stmt costs 4 in prologue
0x3907200 .MASK_LOAD (_8, 64B, _41) 1 times scalar_load costs 12 in prologue
0x3907200 _9 + 5.0e+0 1 times scalar_stmt costs 12 in prologue
0x3907200 _1 == 0 ? tem_19 : tem_21 1 times scalar_stmt costs 4 in prologue
0x3907200 tem_14 1 times scalar_store costs 12 in prologue

each scalar iteration will only execute one arm of the condition and
given the condition contiguous memory blocks of size 8 * sizeof (double)
are loaded either from x or y.  It's not clear how the vector code
behaves when using masked loads but when you use non-possibly-trapping
accesses we'd not use masked loads anyway.

With -mprefer-vector-width=128 and thus V2DFmode vectors the conditional
code is hardly going to be profitable given for each useful lane we're
computing one not useful one (consider the condition being & 1 here
for a similarly extreme example).

Besides these statically visible partitions which eventually could see
an optimal solution (unroll to their period for example) in the dynamic
case the scalar costing would need to consider branch prediction and
branch mispredict costs.  An available profile could provide a hint
at whether we're dealing with a known well-predicted branch or whether
we don't know.

double x[1024], y[1024], z[1024], a[1024];
void foo ()
{
  for (int i = 0; i < 1024; ++i)
    {
      double tem;
      if (a[i] > 0.)
        tem = y[i];
      else
        tem = z[i];
      x[i] = tem;
    }
}

is vectorized with SSE to

.L2:
        movapd  %xmm2, %xmm0
        movapd  y(%rax), %xmm1
        addq    $16, %rax
        cmpltpd a-16(%rax), %xmm0
        andpd   %xmm0, %xmm1
        andnpd  z-16(%rax), %xmm0
        orpd    %xmm1, %xmm0
        movaps  %xmm0, x-16(%rax)
        cmpq    $8192, %rax
        jne     .L2

we manage to obfuscate the non-vectorized variant to

        pxor    %xmm1, %xmm1
        jmp     .L5
        .p2align 4,,10
        .p2align 3
.L11:
        movsd   y(%rax), %xmm0
.L4:
        movsd   %xmm0, x(%rax)
        addq    $8, %rax
        cmpq    $8192, %rax
        je      .L10
.L5:
        movsd   a(%rax), %xmm0
        comisd  %xmm1, %xmm0
        ja      .L11
        movsd   z(%rax), %xmm0
        jmp     .L4
.L10:
        ret

so I'm not actually sure it will run faster even when we make the
a[] compares all the same (the vector variant runs almost 3 times faster).

clang ends up using integer moves and cmov here (and it unrolls) but
it does not vectorize.  GCCs vector variant is nearly two times faster.

So intuition when comparing branchy scalar and straight-line vector code
can prove wrong.