void *sink; void bar() { int a[100]{1,2,3,4}; sink = a; // a escapes the function asm("":::"memory"); // and compiler memory barrier // forces the compiler to materialize a[] in memory instead of optimizing away } gcc 8.1 and gcc 9.2 both make asm like this (even with -O3): bar(): push edi # save call-preserved EDI which rep stos uses xor eax, eax # eax=0 mov ecx, 100 # repeat-count = 100 sub esp, 400 # reserve 400 bytes on the stack mov edi, esp # dst for rep stos mov DWORD PTR sink, esp # sink = a rep stosd # memset(a, 0, 400) mov DWORD PTR [esp], 1 # then store the non-zero initializers mov DWORD PTR [esp+4], 2 # over the zeroed part of the array mov DWORD PTR [esp+8], 3 mov DWORD PTR [esp+12], 4 add esp, 400 # cleanup the stack pop edi # and restore caller's EDI ret
It's actually an optimization - it's cheaper to clear the whole object if most of it is zero. What we miss is to notice the special-case of only the tail being zeros. Jeff added memset pruning to DSE but this case has <bb 2> [local count: 1073741824]: a = {}; MEM <unsigned long> [(int *)&a] = 8589934593; MEM <unsigned long> [(int *)&a + 8B] = 17179869187; sink = &a; the other obvious place to fix it is in the gimplifier of course which creates the above code in the first place. The same issue happens with void *sink; void bar() { int a[100] = { [96]=1,2,3,4}; sink = a; // a escapes the function asm("":::"memory"); // and compiler memory barrier // forces the compiler to materialize a[] in memory instead of optimizing away } or void *sink; void bar() { int a[100] = { 1,2,3,4,[96]=1,2,3,4}; sink = a; // a escapes the function asm("":::"memory"); // and compiler memory barrier // forces the compiler to materialize a[] in memory instead of optimizing away } though the trailing zeros are probably the most common case.
Or in theory store-merging could do this, although it can handle only the = {} and not memset right now. That said, just unconditionally adjusting the clearing store not to cover the boundaries which are overwritten is not a good idea, it is much faster to start with an 8 or 16 byte aligned address over skipping 3 or 7 bytes there.