Bug 68928 - AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling
Summary: AVX loops on unaligned arrays could generate more efficient startup/cleanup c...
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 5.3.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization, ssemmx
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2015-12-16 00:19 UTC by Peter Cordes
Modified: 2015-12-16 21:47 UTC (History)
0 users

See Also:
Host:
Target: x86-64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2015-12-16 00:19:16 UTC
I have some suggestions for better code that gcc could use for the prologue/epilogue when vectorizing loops over unaligned buffers.  I haven't looked at gcc's code, just the output, so IDK how one might get gcc to implement these.

---------

Consider the following code:

#include <immintrin.h>

typedef float float_align32 __attribute__ ((aligned (32)));
void floatmul_aligned(float_align32 *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}


void floatmul(float *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}

g++ 5.3.0 -O3 -march=sandybridge emits what you'd expect for the aligned version: 

floatmul_aligned(float*):
        leaq    4096(%rdi), %rax
.L2:
        vmovaps (%rdi), %ymm0
        addq    $32, %rdi
        vaddps  %ymm0, %ymm0, %ymm0
        vmovaps %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L2
        vzeroupper
        ret

*** off-topic ***

It unfortunately uses 5 uops in the loop, meaning it can only issue one iteration per 2 clocks.  Other than unrolling, it would prob. be more efficient to get 2.0f broadcast into %ymm1 and use vmulps (%rdi), %ymm1, %ymm0, avoiding the separate load.

Doing the loop in reverse order, with an indexed addressing mode counting an index down to zero, would also keep the loop overhead down to one decrement-and-branch uop.  I know compilers are allowed to re-order memory accesses, so I assume this would be allowed.  However, this wouldn't actually help on Sandybridge since it seems that two-register addressing modes might not micro-fuse on SnB-family CPUs: (http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes.  Agner Fog says he tested and found 2-reg addressing modes did micro-fuse.  Agner Fog is probably right, but IDK what's wrong with my experiment using perf counters.)  That would make the store 2 uops.


*** back on topic ***

Anyway, that wasn't even what I meant to report.  The unaligned case peels off the potentially-unaligned start/end iterations, and unrolls them into a giant amount of code.  This is unlikely to be optimal outside of microbenchmarks, since CPUs with a uop-cache suffer from excessive unrolling.

floatmul(float*):
        movq    %rdi, %rax
        andl    $31, %eax
        shrq    $2, %rax
        negq    %rax
        andl    $7, %eax
        je      .L12
        vmovss  (%rdi), %xmm0
        vaddss  %xmm0, %xmm0, %xmm0
        vmovss  %xmm0, (%rdi)
        cmpl    $1, %eax
        je      .L13
        vmovss  4(%rdi), %xmm0
        vaddss  %xmm0, %xmm0, %xmm0
        vmovss  %xmm0, 4(%rdi)
        cmpl    $2, %eax
        je      .L14
        vmovss  8(%rdi), %xmm0
        ...

        repeated up to  cmpl    $6, %eax
        ...
        some loop setup
.L9:
        vmovaps (%rcx,%rax), %ymm0
        addl    $1, %edx
        vaddps  %ymm0, %ymm0, %ymm0
        vmovaps %ymm0, (%rcx,%rax)
        addq    $32, %rax
        cmpl    %esi, %edx
        jb      .L9

        ...
        another fully-unrolled up-to-7 iteration cleanup loop

Notice that the vectorized part of the loop now has 6 uops.  (Or 7, if the store can't micro-fuse.)  So gcc is even farther from getting this loop to run at one cycle per iteration.  (Which should be possible on Haswell.  On SnB/IvB (and AMD Bulldozer-family), a 256b store takes two cycles anyway.)


Is there any experimental evidence that fully unrolling to make this much code is beneficial?

The most obvious way to improve on this would be to use a 128b xmm vector for the first 4 iterations of the prologue/epilogue loops.

Even simply not unrolling the 7-iteration alignment loops might be a win.  Every unrolled iteration still has a compare-and-branch.  By counting down to zero, the loop could have the same overhead.  All that changes is branch prediction (one taken branch and many not-taken, vs. a single loop branch taken n times.)

AVX introduces a completely different way to handle this, though: VMASKMOVPS is usable now, since it doesn't have the non-temporal hint that makes the SSE version of it nearly useless.  According to Agner Fog's insn tables, vpmaskmov %ymm, %ymm, m256 is only 4 uops, and has a throughput of one per 2 cycles (SnB/IvB/Haswell).  It's quite slow (as a store) on AMD bulldozer-family CPUs, though, so this might only be appropriate with -tune=something other than AMD.

The trouble is turning a misalignment count into a mask.  Most of the useful instructions (like PSRLDQ to use on a vector of all-ones) are only available with immediate counts.  Keeping an array of 7 256b masks seems like a big waste, and having a switch() to run the byte-shift instruction with one of 7 different immediate operands also sucks.  (And doesn't work because it work in-lane, not across both 256b lanes).  PSLLQ can take a shift count in the low qword of an xmm register, but I'm not sure it helps.

My best idea for generating a mask for VMASKMOVPS requires AVX2: broadcast the misalignment count to all bytes of a ymm register (VPBROADCASTB).  Use VPCMPGTB with another 256b constant (LSB first): { 7,7,7,7, 6,6,6,6, 5,5,5,5, 4,4,4,4, 3,3,3,3, 2,2,2,2, 1,1,1,1, 0,0,0,0 }.  The least significant 4B will always be 0, since the count is in the 0-7 range, and 7>7 is false. (Or put the count into the 1-8 range so we do 32B of useful work in the aligned case, instead of zero, without branching.)

The constant could shrink to 64b {7,6,5,4,3,2,1,0}, with an extra instruction:

    vmovd        %eax,  %xmm0      # %eax = misalignment count
    vpbroadcastb %xmm0, %xmm0      # broadcast the low byte to 128b
    vpcmpgtb    .constant, %xmm0, %xmm1   # make sure the 64b .constant is at least 128b from the end of a cache line
      # Our mask is in the low 8 bytes of %xmm1.  The upper 64b is garbage
    vpmovsxbd    %xmm1, %ymm1      # all-ones sign-extends to all-ones

Or if reading "garbage" is too hacky, expand to 256b sooner:

    vmovd        %eax,  %xmm0      # %eax = misalignment count
    vpbroadcastd %xmm0, %ymm0      # broadcast the low 32b
    vpmovzxbd    .constant, %ymm1  # 2 uops, unlike the other forms.
    vpcmpgtd     %ymm1, %ymm0, %ymm2


Haswell: first version is movd: 1uop(p5). 1/1.  vpbroadcastb x,x: 1uop(p5). 1/1
vpcmpgtb x,mem: 1uop(p15). 1/0.5   vpmovzxbd y,x: 1uop(p5), 3/1.
total: 4 uops, 6c latency, 1 per 2c throughput (bottleneck on p5).

second version: movd: 1uop. 1/1.  vpbroadcastb y,x: 1uop. 3/1.
vpcmpgtd y,y: 1uop. 1/0.5   vpmovzxbd ymm, m64: 2uops(p5+p23), ?/1.
total: 5 uops, 5c latency, 1 per 2c throughput (saturating p5).  The movzx is off the critical path, but can't micro-fuse.


Without AVX2: 

    #### untested / possibly buggy
    imul         $0x1010101, %edx, %eax  # broadcast the misalignment count to the 4 bytes of %eax
    vmovd        %eax,  %xmm0
    vpshufd      $0, %xmm0, %xmm0    # broadcast to all bytes of xmm0  
    
    vpcmpgtb    .constant2, %xmm0, %xmm1   # constant2 = .byte 0,0, 1,1, ...
    vpmovsxwd    %xmm1, %xmm2              # [32b:count>0, 32b:count>1, ...]
    vpunpckhwd   %xmm1, %xmm1, %xmm3       # [32b:count>4, 32b:count>5, ...]

    vinsertf128  $1, %xmm2, %ymm3,  %ymm3  # combine the masks

    # do a masked load, too, to avoid possible NaN slowdowns if there's garbage before the array
    vmaskmov     (%rdi), %ymm2, %ymm3       # rdi = p & ~0x1F: rounded down to previous 32B boundary
    vaddps       %ymm3, %ymm3, %ymm3
    vmaskmov     %ymm3, %ymm2, (%rdi)

    mainloop:
      ...
    # epilogue: invert the mask


Or skip the vinsertf and do two separate vaddps xmm / vmaskmov, but that's probably worse.  The addresses used with vmaskmov will be aligned: its the masking that takes care of unaligned accesses.

Also, we could broadcast the count just to words, and use vpcmpgtw.  We're using words rather than bytes because there's no single instruction to pmovsxbd from the 2nd 32b chunk of a source register (to unpack the high half of the mask).  punpckh same,same can read from the high half of a reg and double the size of the elements, though.  An extra pmovsxbw or punpcklbw would let us use a 64b constant, though.

Anyway, IDK if this idea is generally useful for gcc to handle arrays that aren't guaranteed to be aligned.

Probably things should be arranged so that in the aligned case, either the mask generation and vmaskmovps are skipped altogether, or that the vmaskmovps does a full 256b load/store, rather than a fully-masked 0-byte load/store.  Besides the obvious reason of avoiding wasted work, AMD Jaguar's VMASKMOVPS takes ~300 clocks for a load with mask=0, vs. 2 clocks (15 cycle latency) in the normal case.  VMASKMOVPS 256b store on Jaguar has one per 22c throughput, though, and takes 36m-ops.  So it's not worth using if targeting jaguar, but avoiding Jaguar's catastrophic case is a good idea even when tuning for something else, since it's probably a good idea anyway.
Comment 1 Richard Biener 2015-12-16 08:54:26 UTC
I'd say the easiest way is to avoid peeling for alignment on x86_64 and just use unaligned ops... (we're still left with a epilogue for possible remaining scalar
iters)
Comment 2 Peter Cordes 2015-12-16 21:46:16 UTC
Richard wrote: 
> [...] avoid peeling for alignment on x86_64 and just use unaligned ops

Yeah, that's what clang does, and may be optimal.  Certainly it's easy, and gives optimal performance when buffers *are* in fact aligned, even when the programmer has neglected to inform the compiler of any guarantee.

However, with vector sizes getting closer to the cache-line size, unaligned accesses will cross cache lines more of the time.  (e.g. an AVX loop over an unaligned buffer will have a cacheline split on every other iteration).  Iff we can *cheaply* avoid this, it may be worth it.

IIRC, all modern x86 / x86-64 CPUs have no penalty for unaligned loads, as long as they don't actually cross a cache-line boundary.  (True for Intel since Nehalem).  Store-forwarding doesn't work well if the stores don't line up with the loads, though.
Comment 3 Peter Cordes 2015-12-16 21:47:40 UTC
I posted this as a question on stackoverflow, and got some useful comments (and had some ideas while writing up a mask-gen answer).

http://stackoverflow.com/questions/34306933/vectorizing-with-unaligned-buffers-using-vmaskmovps-generating-a-mask-from-a-m

Stephen Canon points out that VMASKMOVPS isn't actually useful: you can instead use unaligned loads/stores for the peeled first/last iteration, and do overlapping work.  You just have to make sure you load any data you need before clobbering it.  I posted an answer using that idea, but I'm not sure if it's the sort of thing a compiler could decide to use.


For reduction loops where we need to accumulate each element exactly once, a mask is still useful, but we can use it for ANDPS / ANDNPS instead of VMASKMOV.

I improved the mask-generation to a single AVX2 VPMOVSXBD load (with 5 or 7 single-uop integer instructions to generate the index from the start/end address).  VPCMPGT isn't needed: instead just use an index to take the right window of bytes from memory.  This emulates a variable-count VPSLLDQ on a buffer of all-ones.

This is something gcc could maybe use, but probably some experimental testing to compare with just using unaligned is warranted before spending any time implementing automatic generation of something complicated like this.