Bug 82142 - struct zeroing should use wide stores instead of avoiding overwriting padding
Summary: struct zeroing should use wide stores instead of avoiding overwriting padding
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 102672
  Show dependency treegraph
 
Reported: 2017-09-08 07:20 UTC by Peter Cordes
Modified: 2024-01-27 00:03 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-09-12 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Cordes 2017-09-08 07:20:22 UTC
(not sure if tree-optimization is the right "component", please check.)

Zeroing a struct (by assignment from an all-zero struct) is inefficient for structs too small to inline rep stos or memset, if they have padding.  Store coalescing happens (yay, huge improvement over gcc6!), but only if it won't write any padding.  I assume assignment from non-zero struct constants is like this, too.

Note also that gcc zeros two registers for two 16-bit stores, instead of reusing a zeroed reg.

Also note that Haswell doesn't have LCP stalls for a `mov` with 16-bit immediate, only for ALU, so there's no need to avoid it.  But if you *are* going to zero a register, you should use it as the source for all the integer mov instructions to save code-size.  (And avoid filling up space in the uop cache with immediates).  Sandybridge-family no longer has register-read-port stalls (P6 / Core2 / Nehalem), and even there a recently-written register is fine for several cycles.

https://godbolt.org/g/AT7yfs

typedef struct {
    int a,b,c;
    char j,k, k1;
        // 1B of padding
    int l,m,n[8];
    char c1,c2;
        // 2B of padding
}foo;
int sf = sizeof(foo);

sf:
        .long   60   # bytes long

void assignzero(foo *p) {
    foo tmp = {};
    *p = tmp;
}

(GCC-Explorer-Build) 8.0.0 20170907  -xc -std=c11 -O3 -march=haswell


assignzero:
        pushq   %rbp
        xorl    %eax, %eax            # zero a reg to avoid mov $imm16, (mem)
        vpxor   %xmm0, %xmm0, %xmm0
        xorl    %edx, %edx            # and another one, instead of reusing eax??
        movq    $0, (%rdi)
        movl    $0, 8(%rdi)
        movq    %rsp, %rbp      # make a stack frame because of the ymm?  At least it doesn't do something with %r10 like gcc7.2
        movw    %ax, 12(%rdi)
        movb    $0, 14(%rdi)
                           # avoiding a store to 1B of padding
        vmovdqu %ymm0, 16(%rdi)       # bunch of int members
        movq    $0, 48(%rdi)
        movw    %dx, 56(%rdi)
        vzeroupper
        popq    %rbp
        ret

So 1B of padding cost us 4 narrow stores instead of one 16B store at the front of the struct.  This also uses AVX for a single 256b store; very likely not worth it.


// what we would like to see, more or less
void charzero(char *p) {
    __builtin_memset(p, 0, sizeof(foo));
}
charzero:
        vpxor   %xmm0, %xmm0, %xmm0
        movq    $0, 48(%rdi)
        vmovups %xmm0, (%rdi)
        vmovups %xmm0, 16(%rdi)
        vmovups %xmm0, 32(%rdi)
        movl    $0, 56(%rdi)
        ret

This chooses not to use 256b AVX, so no vzeroupper, and no slow-mode while warming up the upper-half of execution units / data bus. (Agner's description sounds like it applies to stores: http://agner.org/optimize/blog/read.php?i=415).  Also no triggering a lower turbo threshold.  All of this is probably good for a tiny function.

We could save some code-size by replacing the integer mov $0 stores with vmovq %xmm0, 48(%rdi).  That's probably not good for Bulldozer-family (shared vector unit between two integer cores), so maybe only enable that with a -march=znver1 or sandybridge-family CPU.  (xor-zeroing takes zero cycles on Intel SnB-family, so xmm0 is ready as a source for the store in the same cycle that vpxor issues.  I guess it's possible that an integer store can execute 1 cycle sooner than an AVX 8-byte store, so if there weren't already any stores waiting to execute, and store-port throughput was about to become a bottleneck in later code, it makes sense.  Maybe only do it for the last store, using a 4-byte vmovd %xmm0, 56(%rdi))

Or better (on Haswell and especially Skylake): use overlapping vector stores like clang has been doing since 3.7, and like glibc memcpy does for small copies:

   # clang (trunk) -O3 -march=haswell
assignzero:                             # same code for clang's memset
        vxorps  %xmm0, %xmm0, %xmm0
        vmovups %ymm0, 28(%rdi)         # last 32 bytes
        vmovups %ymm0, (%rdi)           # first 32 bytes, overlapping by 4
        vzeroupper
        retq

Having to use vzeroupper here is questionable.

4 xmm stores (with the last one overlapping) might be a better choice here, but I haven't tried to measure the effect of scattered uses of AVX.  When tuning for Sandybridge, clang only uses 128b stores here.
Comment 1 Richard Biener 2017-09-12 09:20:04 UTC
I believe this is SRAs fault to expose the padding in the first place.  Later
we do try to merge the stores "back" but obviously we can't do that.

Eventually we could make SRA emit CLOBBERs for the padding but not sure if
that alone would help.

I think we may have duplicates for this SRA "issue".  W/o vectorization
we end up with

assignzero (struct foo * p)
{
  <bb 2> [100.00%] [count: INV]:
  MEM[(struct foo *)p_2(D)] = 0;
  MEM[(struct foo *)p_2(D) + 8B] = 0;
  MEM[(struct foo *)p_2(D) + 12B] = 0;
  MEM[(struct foo *)p_2(D) + 14B] = 0;
  MEM[(struct foo *)p_2(D) + 16B] = 0;
  MEM[(struct foo *)p_2(D) + 24B] = 0;
  MEM[(struct foo *)p_2(D) + 32B] = 0;
  MEM[(struct foo *)p_2(D) + 40B] = 0;
  MEM[(struct foo *)p_2(D) + 48B] = 0;
  MEM[(struct foo *)p_2(D) + 56B] = 0;
  return;

with vectorization:

assignzero (struct foo * p)
{
  <bb 2> [100.00%] [count: INV]:
  MEM[(struct foo *)p_2(D)] = 0;
  MEM[(struct foo *)p_2(D) + 8B] = 0;
  MEM[(struct foo *)p_2(D) + 12B] = 0;
  MEM[(struct foo *)p_2(D) + 14B] = 0;
  MEM[(struct foo *)p_2(D) + 16B] = { 0, 0, 0, 0, 0, 0, 0, 0 };
  MEM[(struct foo *)p_2(D) + 48B] = 0;
  MEM[(struct foo *)p_2(D) + 56B] = 0;

I think store-merging could make use of

  MEM[(...)p_2(D) + CST] = CLOBBER;

for the holes.

Note that strictly speaking as we optimize memset(.., 0, ..) to = {} SRA
cannot simply replace = {} with setting only non-padding members to zero.
Likewise we optimize memcpy (A, B,...) to *A = *B, same arguments for
padding apply.  Ok, so we don't actually fold that aggressive.