This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

[Bug target/80571] New: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

From: "peter at cordes dot ca" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Sun, 30 Apr 2017 08:29:54 +0000
Subject: [Bug target/80571] New: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops
Auto-submitted: auto-generated

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571

            Bug ID: 80571
           Summary: AVX allows multiple vcvtsi2ss/sd (integer ->
                    float/double) to reuse a single dep-breaking vxorps,
                    even hoisting it out of loops
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

See also more discussion on a clang bug about what's optimal for scalar int->fp
conversion (https://bugs.llvm.org/show_bug.cgi?id=22024#c11).

The important point is that using vcvtsi2sd/ss with a src vector reg different
from the destination reg allows reusing the same "safe" source to avoid false
dependencies, using of one vxorps for multiple conversions.  (Even hoisted out
of loops).

// see https://godbolt.org/g/c9k4SH for gcc8, clang4, icc17, and MSVC CL
void cvti64f64_loop(double *dp, long long *ip) {
    for (int i=0; i<1000 ; i++) {
        dp[i] = ip[i];
        // int64 can't vectorize without AVX512
    }
}


compiles with gcc6.3.1 and gcc8 20170429 -O3 -march=haswell:

cvti64f64_loop:
        xorl    %eax, %eax
.L6:
        vxorpd  %xmm0, %xmm0, %xmm0
        vcvtsi2sdq      (%rsi,%rax), %xmm0, %xmm0
        vmovsd  %xmm0, (%rdi,%rax)
        addq    $8, %rax
        cmpq    $8000, %rax
        jne     .L6
        ret

But it could compile to

        xorl    %eax, %eax
        vxorpd  %xmm1, %xmm1, %xmm1    #### Hoisted out of the loop
.L6:
        vcvtsi2sdq      (%rsi,%rax), %xmm1, %xmm0
        vmovsd  %xmm0, (%rdi,%rax)
        addq    $8, %rax
        cmpq    $8000, %rax
        jne     .L6
        ret

----

This trick requires AVX, of course.

If the loop needs a vector constant, you can use that instead of a
specially-zeroed register.  (The upper elements don't have to be 0.0 for ...ss
instructions, and the x86-64 SysV ABI allows passing scalar float/double args
with garbage (not zeros) in the high bytes.  This already happens in some code,
so library implementers already have to avoid assuming that high elements are
always clear, even with hand-written asm).

clang already hoists vxorps out of loops it, but only by looking for "dead"
registers, not by reusing a reg holding a constant.  (e.g. enabling those
multiplies in the godbolt link to use up all the xmm regs gets clang to put a
vxorps in the loop instead of merging into a reg holding one of them.)


Using a constant has the downside that loading the constant might cache-miss,
delaying a bunch of loop-setup work from happening (and loop-setup is where
int->float is probably most common).  So maybe it would be best to vxorps-zero
a register and use it for int->FP conversion if several other instructions
depend on that before any depend on the constant.  The zeroed register can
later have a constant loaded into it or whatever, and use that as a safe source
register inside the loop.

Or if there's a constant that's about to be used with the result of the int->fp
conversion, it's maybe not bad to load that constant and then use that xmm reg
as the merge-target for a scalar conversion.  If OOO execution can do the load
early (and it doesn't miss in cache), then there's no extra latency.  Or if it
does miss, then there's only an extra cycle or two of latency for that
dependency chain vs. spending an instruction on vxorpd-zeroing a target to
convert into, separate from loading the constant.



----

This trick works for non-loops, of course:

void cvti32f32(float *A, float *B, int x, int y) {
    *B = y;
    *A = x;
}

currently compiles to (-O3 -march=haswell)

cvti32f32:
        vxorps  %xmm0, %xmm0, %xmm0
        vcvtsi2ss       %ecx, %xmm0, %xmm0
        vmovss  %xmm0, (%rsi)
        vxorps  %xmm0, %xmm0, %xmm0
        vcvtsi2ss       %edx, %xmm0, %xmm0
        vmovss  %xmm0, (%rdi)

But could compile (with no loss of safety against false deps) to

cvti32f32:
        vxorps  %xmm1, %xmm1, %xmm1     # reused for both
        vcvtsi2ss      %ecx, %xmm1, %xmm0
        vmovss  %xmm0, (%rsi)
        vcvtsi2ss      %edx, %xmm1, %xmm0   # this convert can go into xmm1 if
we want to have both in regs at once
        vmovss  %xmm0, (%rdi)
        ret


Or, with the same amount of safety and fewer fused-domain uops on Intel
SnB-family CPUs, and not requiring AVX (but of course works fine with AVX):

cvti32f32:
        movd  %ecx, %xmm0         # breaks deps on xmm0
        cvtdq2ps  %xmm0, %xmm0    # 1 uop for port1
        movss     %xmm0, (%rsi)
        movd  %edx, %xmm0
        cvtdq2ps  %xmm0, %xmm0
        movss     %xmm0, (%rdi)
        ret

But this is only good for 32-bit integer -> float, not double.  (Because
cvtdq2pd also takes a port5 uop, unlike conversion to ps).  The code-size is
also slightly larger, taking 2 instructions per conversion instead of 1 for AVX 
 when a safe register to merge into is already available.

See full analysis of this (uop counts and stuff for Intel SnB-family) at the
bottom of this comment: https://bugs.llvm.org/show_bug.cgi?id=22024#c11

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]