(not sure if tree-optimization is the right "component", please check.) Zeroing a struct (by assignment from an all-zero struct) is inefficient for structs too small to inline rep stos or memset, if they have padding. Store coalescing happens (yay, huge improvement over gcc6!), but only if it won't write any padding. I assume assignment from non-zero struct constants is like this, too. Note also that gcc zeros two registers for two 16-bit stores, instead of reusing a zeroed reg. Also note that Haswell doesn't have LCP stalls for a `mov` with 16-bit immediate, only for ALU, so there's no need to avoid it. But if you *are* going to zero a register, you should use it as the source for all the integer mov instructions to save code-size. (And avoid filling up space in the uop cache with immediates). Sandybridge-family no longer has register-read-port stalls (P6 / Core2 / Nehalem), and even there a recently-written register is fine for several cycles. https://godbolt.org/g/AT7yfs typedef struct { int a,b,c; char j,k, k1; // 1B of padding int l,m,n[8]; char c1,c2; // 2B of padding }foo; int sf = sizeof(foo); sf: .long 60 # bytes long void assignzero(foo *p) { foo tmp = {}; *p = tmp; } (GCC-Explorer-Build) 8.0.0 20170907 -xc -std=c11 -O3 -march=haswell assignzero: pushq %rbp xorl %eax, %eax # zero a reg to avoid mov $imm16, (mem) vpxor %xmm0, %xmm0, %xmm0 xorl %edx, %edx # and another one, instead of reusing eax?? movq $0, (%rdi) movl $0, 8(%rdi) movq %rsp, %rbp # make a stack frame because of the ymm? At least it doesn't do something with %r10 like gcc7.2 movw %ax, 12(%rdi) movb $0, 14(%rdi) # avoiding a store to 1B of padding vmovdqu %ymm0, 16(%rdi) # bunch of int members movq $0, 48(%rdi) movw %dx, 56(%rdi) vzeroupper popq %rbp ret So 1B of padding cost us 4 narrow stores instead of one 16B store at the front of the struct. This also uses AVX for a single 256b store; very likely not worth it. // what we would like to see, more or less void charzero(char *p) { __builtin_memset(p, 0, sizeof(foo)); } charzero: vpxor %xmm0, %xmm0, %xmm0 movq $0, 48(%rdi) vmovups %xmm0, (%rdi) vmovups %xmm0, 16(%rdi) vmovups %xmm0, 32(%rdi) movl $0, 56(%rdi) ret This chooses not to use 256b AVX, so no vzeroupper, and no slow-mode while warming up the upper-half of execution units / data bus. (Agner's description sounds like it applies to stores: http://agner.org/optimize/blog/read.php?i=415). Also no triggering a lower turbo threshold. All of this is probably good for a tiny function. We could save some code-size by replacing the integer mov $0 stores with vmovq %xmm0, 48(%rdi). That's probably not good for Bulldozer-family (shared vector unit between two integer cores), so maybe only enable that with a -march=znver1 or sandybridge-family CPU. (xor-zeroing takes zero cycles on Intel SnB-family, so xmm0 is ready as a source for the store in the same cycle that vpxor issues. I guess it's possible that an integer store can execute 1 cycle sooner than an AVX 8-byte store, so if there weren't already any stores waiting to execute, and store-port throughput was about to become a bottleneck in later code, it makes sense. Maybe only do it for the last store, using a 4-byte vmovd %xmm0, 56(%rdi)) Or better (on Haswell and especially Skylake): use overlapping vector stores like clang has been doing since 3.7, and like glibc memcpy does for small copies: # clang (trunk) -O3 -march=haswell assignzero: # same code for clang's memset vxorps %xmm0, %xmm0, %xmm0 vmovups %ymm0, 28(%rdi) # last 32 bytes vmovups %ymm0, (%rdi) # first 32 bytes, overlapping by 4 vzeroupper retq Having to use vzeroupper here is questionable. 4 xmm stores (with the last one overlapping) might be a better choice here, but I haven't tried to measure the effect of scattered uses of AVX. When tuning for Sandybridge, clang only uses 128b stores here.
I believe this is SRAs fault to expose the padding in the first place. Later we do try to merge the stores "back" but obviously we can't do that. Eventually we could make SRA emit CLOBBERs for the padding but not sure if that alone would help. I think we may have duplicates for this SRA "issue". W/o vectorization we end up with assignzero (struct foo * p) { <bb 2> [100.00%] [count: INV]: MEM[(struct foo *)p_2(D)] = 0; MEM[(struct foo *)p_2(D) + 8B] = 0; MEM[(struct foo *)p_2(D) + 12B] = 0; MEM[(struct foo *)p_2(D) + 14B] = 0; MEM[(struct foo *)p_2(D) + 16B] = 0; MEM[(struct foo *)p_2(D) + 24B] = 0; MEM[(struct foo *)p_2(D) + 32B] = 0; MEM[(struct foo *)p_2(D) + 40B] = 0; MEM[(struct foo *)p_2(D) + 48B] = 0; MEM[(struct foo *)p_2(D) + 56B] = 0; return; with vectorization: assignzero (struct foo * p) { <bb 2> [100.00%] [count: INV]: MEM[(struct foo *)p_2(D)] = 0; MEM[(struct foo *)p_2(D) + 8B] = 0; MEM[(struct foo *)p_2(D) + 12B] = 0; MEM[(struct foo *)p_2(D) + 14B] = 0; MEM[(struct foo *)p_2(D) + 16B] = { 0, 0, 0, 0, 0, 0, 0, 0 }; MEM[(struct foo *)p_2(D) + 48B] = 0; MEM[(struct foo *)p_2(D) + 56B] = 0; I think store-merging could make use of MEM[(...)p_2(D) + CST] = CLOBBER; for the holes. Note that strictly speaking as we optimize memset(.., 0, ..) to = {} SRA cannot simply replace = {} with setting only non-padding members to zero. Likewise we optimize memcpy (A, B,...) to *A = *B, same arguments for padding apply. Ok, so we don't actually fold that aggressive.