69873 – Vectorizer fails to emit runtime profitability check if no peeling/versioning is done

Bug 69873 - Vectorizer fails to emit runtime profitability check if no peeling/versioning is done

Summary: Vectorizer fails to emit runtime profitability check if no peeling/versioning...

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	6.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2016-02-19 12:19 UTC by Richard Biener
Modified:	2022-03-29 13:10 UTC (History)
CC List:	0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2016-02-19 00:00:00

Attachments
patch (1.15 KB, patch) 2016-02-19 12:28 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Richard Biener 2016-02-19 12:19:44 UTC

double a[1024];
void foo (unsigned n)
{
  for (unsigned i = 0; i < n * 2; ++i)
    a[i] = 1.;
}

currently says the threshold is > 3 iterations but no such guard is added
(whether that's a reasonable threshold is another question of course).

Comment 1 Richard Biener 2016-02-19 12:28:39 UTC

Created attachment 37740 [details]
patch

Patch fixing this.  Also fixing some incosistencies in cost estimation which unfortunately "improves" the threshold to > 6 iterations.

Issues with the cost model include the odd separation between taken vs. not-taken
branch cost (as opposed to fallthru, not fallthru or well predicted vs. not predicted).  This artificially raises the cost of the vectorized path (on x86_64
generic model, vectorized path has a "taken" branch of cost 3 while unvectorized
has an "not taken" branch of cost 1).

Another issue is the prologue cost for emitting the vector double constant
{ 1., 1. }.  The scalar version also has a constant pool entry but we assume
none exist for the scalar path - reasonable only if insns with immediate forms
are available which IMHO is not reasonable to assume for FP modes.  This skews
the cost by one.

If you disable cunroll then generated code is

        cmpl    $6, %edi
        jbe     .L3
        subl    $2, %edi
        movapd  .LC0(%rip), %xmm0
        shrl    %edi
        xorl    %eax, %eax
        leal    1(%rdi), %ecx
.L4:
        movq    %rax, %rdx
        addq    $1, %rax
        salq    $4, %rdx
        cmpl    %eax, %ecx
        movaps  %xmm0, a(%rdx)
        ja      .L4

        rep ret
.L3:
        movsd   .LC1(%rip), %xmm0
        xorl    %eax, %eax
.L6:
        movsd   %xmm0, a(,%rax,8)
        addq    $1, %rax
        cmpl    %eax, %edi
        ja      .L6
.L1:
        rep ret

so the stupid IV choice makes the runtime check reasonable.

Comment 2 Richard Biener 2016-02-19 12:36:32 UTC

ICC decides to have a cut-off at 8 iterations, having an unrolled vectorized
iteration

..B1.4:                         # Preds ..B1.4 ..B1.3
        addl      $8, %eax                                      #4.3
        movaps    %xmm0, a(,%rdx,8)                             #5.5
        cmpl      %ecx, %eax                                    #4.3
        movaps    %xmm0, 16+a(,%rdx,8)                          #5.5
        movaps    %xmm0, 32+a(,%rdx,8)                          #5.5
        movaps    %xmm0, 48+a(,%rdx,8)                          #5.5
        movl      %eax, %edx                                    #4.3
        jb        ..B1.4 

with a shared scalar epilogue (shared with the non-vectorized path)

..B1.8:                         # Preds ..B1.8 ..B1.7
        incl      %edx                                          #4.3
        movq      %rax, a(,%rcx,8)                              #5.5
        incl      %ecx                                          #4.3
        cmpl      %edi, %edx                                    #4.3
        jb        ..B1.8  

where at least that scalar loop with two IVs looks sub-optimal.

This also shows that unrolling from within the vectorizer might be more
profitable than doing it afterwards (generating an additional vectorized
epilogue for the remaining vectorized iterations).  Code-size wise, of course.

Comment 3 Richard Biener 2022-03-29 13:10:51 UTC

It's difficult to find a testcase that we'd vectorize with a runtime profitability check but whithout any versioning or peeling because we should always be able to statically decide profitability there - more vector iterations
shouldn't be able to offset any outside cost on the vector path because there is none (well, an "excessive" number of emitted splats maybe - but that starts to be quite a bit tricky).