This is quite long-latent, affecting GCC 7.* and later. [856] % gcctk -v Using built-in specs. COLLECT_GCC=gcctk COLLECT_LTO_WRAPPER=/local/suz-local/software/local/gcc-trunk/libexec/gcc/x86_64-pc-linux-gnu/12.0.0/lto-wrapper Target: x86_64-pc-linux-gnu Configured with: ../gcc-trunk/configure --disable-bootstrap --prefix=/local/suz-local/software/local/gcc-trunk --enable-languages=c,c++ --disable-werror --enable-multilib --with-system-zlib Thread model: posix Supported LTO compression algorithms: zlib gcc version 12.0.0 20211030 (experimental) [master r12-4804-g75c9fa318e3] (GCC) [857] % [857] % gcctk -O1 small.c; ./a.out 0 [858] % gcctk -O2 small.c [859] % ./a.out 0 Aborted [860] % [860] % cat small.c int printf(const char *, ...); int a, *b, c, e, f; void g() { int *d[7]; d[6] = b = (int *)d; printf("0\n"); } int i() { for (c = 0; c < 2; c++) { long h[6][2]; for (e = 0; e < 6; e++) for (f = 0; f < 2; f++) h[e][f] = 1; if (c) { g(); return h[3][0]; } } return 0; } int main() { if (i() != 1) __builtin_abort (); return 0; }
Confirmed, here is one which is broken at -O1 and -O2 but ok at -O3: __attribute__((noipa)) void ff(void){} int a, *b, c, e, f; __attribute__((always_inline)) static inline void g() { int *d[7]; d[6] = b = (int *)d; ff(); } __attribute__((noinline)) int i() { for (c = 0; c < 2; c++) { long h[6][2]; for (e = 0; e < 6; e++) for (f = 0; f < 2; f++) h[e][f] = 1; if (c) { g(); return h[3][0]; } } return 0; } int main() { if (i() != 1) __builtin_abort (); return 0; }
Since noipa does not work with GCC 7, the following fails all the back to GCC 7: __attribute__((noipa,noinline,noclone)) void ff(void){asm("":::"memory");} int a, *b, c, e, f; __attribute__((always_inline)) static inline void g() { int *d[7]; d[6] = b = (int *)d; ff(); } __attribute__((noinline)) int i() { for (c = 0; c < 2; c++) { long h[6][2]; for (e = 0; e < 6; e++) for (f = 0; f < 2; f++) h[e][f] = 1; if (c) { g(); return h[3][0]; } } return 0; } int main() { if (i() != 1) __builtin_abort (); return 0; }
This looks like a Register allocator issue. IN GCC 6 we have: .L8: movq %rsp, b(%rip) movq %rsp, 48(%rsp) call _Z2ffv movq 112(%rsp), %rax addq $160, %rsp .cfi_def_cfa_offset 8 ret While in GCC 7+ we get: movl $2, f(%rip) movl $6, e(%rip) movq %rsp, b(%rip) movq %rsp, 48(%rsp) call _Z2ffv movl 48(%rsp), %eax addq $96, %rsp .cfi_def_cfa_offset 8 ret
Doesn't look to me like RA issue, but rather incorrect coalescing of temporary vars. Optimized dump has: <bb 2> [local count: 1652516]: ivtmp.40_47 = (unsigned long) &h; _30 = ivtmp.40_47 + 96; <bb 3> [local count: 13370357]: # ivtmp.40_38 = PHI <ivtmp.40_47(2), ivtmp.40_44(3)> _21 = (void *) ivtmp.40_38; MEM[(long int *)_21] = 1; MEM[(long int *)_21 + 8B] = 1; ivtmp.40_44 = ivtmp.40_38 + 16; if (_30 != ivtmp.40_44) goto <bb 3>; [89.00%] else goto <bb 4>; [11.00%] <bb 4> [local count: 1652516]: h ={v} {CLOBBER}; <bb 5> [local count: 13370357]: # ivtmp.30_29 = PHI <ivtmp.30_48(5), ivtmp.40_47(4)> _31 = (void *) ivtmp.30_29; MEM[(long int *)_31] = 1; MEM[(long int *)_31 + 8B] = 1; ivtmp.30_48 = ivtmp.30_29 + 16; if (_30 != ivtmp.30_48) goto <bb 5>; [89.00%] else goto <bb 6>; [11.00%] <bb 6> [local count: 1652516]: f = 2; e = 6; c = 1; b = &d; d[6] = &d; ff (); d ={v} {CLOBBER}; _5 = h[3][0]; _18 = (int) _5; h ={v} {CLOBBER}; return _18; So, the code initializes the whole h array, then has h ={v} {CLOBBER};, then initializes it again, but unfortunately without mentioning the var in the IL - it reuses ivtmp.40_47 for that, then sets various vars, including d[6] initialization with &d escaping, clobbers d, reads from h and finally clobbers h again. I guess the var partitioning code from the above thinks h isn't really live in between the first h clobber and h[3][0] load and so decides: Partition 0: size 96 align 16 h d and places both h and d into the same stack slot.
What we see is an effect of 3 different optimizations, one is loop unrolling (cunroll in this case), another one is ivopts that moves the h array references out of the loops but still has separate: ivtmp.40_47 = (unsigned long) &h; _17 = (unsigned long) &h; _30 = _17 + 96; ... h ={v} {CLOBBER}; ivtmp.30_33 = (unsigned long) &h; _41 = (unsigned long) &h; _36 = _41 + 96; and finally dom3's VN which replaces ivtmp.30_33 initializer with ivtmp.40_47 and _36 with _30. If what the cfg expand var partition code is as designed (I think other passes do it too, e.g. compute_live_vars/live_vars_at_stmt relies on it too), then we need to somehow avoid VN of &var across var ={v} {CLOBBER} stmt, but it isn't really clear to me how. Unless we change loop unrolling so that the different loop iterations if there is a var clobber in the loop actually have different variables (the first iteration the original var and other iterations that var's copies; perhaps only for addressable vars?). Then naturally VN couldn't merge those and the RTL partitioning code could decide to put them into the same or different partition and later RTL opts could CSE the addresses.
Isn't this another case of the still unsolved PR90348?
Looks like that. What do you think about unrolling making variable copies? We'd need to be sure that the scope of the variable is the loop we are unrolling though (or something nested in it).
On Tue, 2 Nov 2021, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103006 > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Looks like that. > What do you think about unrolling making variable copies? We'd need to be sure > that the scope of the variable is the loop we are unrolling though (or > something nested in it). Being able to determine that would solve the very issue we're trying to fix with making the copy. The problem is that we put in CLOBBERs based on the original BINDs but later optimizers do not respect the birth boundary. If we can figure that out we could ignore the respective CLOBBERs for the CFG expansion live compute as well. I think we may be able to compute the CFG SCC a CLOBBER resides in and in case the CLOBBERed variable is live-in into that SCC we cannot prune it with that CLOBBER. Or so.
We don't have that many unrolling passes though, and perhaps we could use compute_live_vars and decide based on that. Though I guess the addresses of the vars could be even then hoisted before such loops, though it is unclear why it would be done, it can't be VN because the vars don't exist outside of the loop. Say LIM can do that though...
On Tue, 2 Nov 2021, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103006 > > Jakub Jelinek <jakub at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Keywords|ra | > > --- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > We don't have that many unrolling passes though, and perhaps we could use > compute_live_vars and decide based on that. Though I guess the addresses of > the vars could be even then hoisted before such loops, though it is unclear why > it would be done, it can't be VN because the vars don't exist outside of the > loop. > Say LIM can do that though... Yes, LIM can do it as can loop header copying or jump threading that happens to peel an iteration. Also there's no dataflow barrier that prevents addresses from being moved so that "no pass does this" isn't a good excuse. That said, I think the fix is to the code computing stack slot reuse.
(In reply to rguenther@suse.de from comment #8) > On Tue, 2 Nov 2021, jakub at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103006 > > > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > > Looks like that. > > What do you think about unrolling making variable copies? We'd need to be sure > > that the scope of the variable is the loop we are unrolling though (or > > something nested in it). > > Being able to determine that would solve the very issue we're trying > to fix with making the copy. The problem is that we put in CLOBBERs > based on the original BINDs but later optimizers do not respect the > birth boundary. If we can figure that out we could ignore the > respective CLOBBERs for the CFG expansion live compute as well. > > I think we may be able to compute the CFG SCC a CLOBBER resides in > and in case the CLOBBERed variable is live-in into that SCC we cannot > prune it with that CLOBBER. Or so. Tried that but while it solves PR97821 it doesn't fix the case in this bug because there the CLOBBER that breaks things is not inside a SCC but the issue in this case is that not all accesses to 'h' also mention h and thus we miss to make 'h' live again after the CLOBBER. I also fear the more variables we expose the easier it will be to run into this issue. It might be possible to (conservatively) track pointers from ADDR_EXPR mentions, but that's going to give up in most of the interesting cases (it has the same issue as ideas how to prevent "leakage" via pointers). The other idea that we discussed past in time is to perform stack slot sharing early when the CLOBBERs are still OK to use. Possibly even w/o CLOBBERs but with the GIMPLE binds we have even after gimplifying. We'd make the "stack slot" sharing explicit by replacing decls assigned to the same stack slot with either the larges of the decls or anonymous stack memory (a new decl). Doing this before inlining is complete will obviously not catch all important cases. Doing it after inlining requires being careful during early optimizations (jump threading is the one transform we do early that can cause code duplication and followup CSE).
Oh, and I think address-takens are really not an issue but the accesses based on them which confuse the simplistic live analysis to not recognize those as births. So we _can_ introduce explicit birth operations. The simplest thing we probably can do is to add clobbers there and set a special 'birth' flag on them just for liveness analysis, the rest of the compiler can treat them like clobbers - besides cases where we remove clobbers. We can't remove a birth without also removing all clobbers of a variable (even in copies of birth-death regions). It might be tempting to somehow link birth and its clobbers (IIRC with cleanups and so we can have multiple clobbers for one birth), like via SSA def and uses, but when we copy a birth that breaks down. So the alternative is probably to mark a variable as not to be subject to stack slot sharing when removing a birth clobber. The initial birth clobber would be at a more conservative position than the current way of treating the first mention as birth but we can sink birth clobbers (even across address takens) and hoist clobbers to shrink live ranges at some point. Both birth and clobber act as optimization barrier for loads and stores of the affected variable, that's good for the purpose but possibly bad for optimization. I checked and for example loop store motion does consider clobbers inside a loop as reason to not optimize. And with the current scheme we don't even optimize cases like struct Foo { int i; int j; int a[24]; }; void bar(struct Foo f); void baz() { struct Foo f, g; f.i = 1; bar (f); g.j = 2; bar (g); } as nothing hoists the clobbers we only put at the end of the function and thus f and g appear to conflict (we only use clobbers to compute live, for not address taken vars we could rely on mentions only). I don't think we can reasonably fix all of the issue on branches and I have my doubts for GCC 12.
So I have a patch that adds explicit birth markers (using clobbers specially marked). That works well sofar but it conflicts with clobbers (not marked as birth) that are added for clobbering at the start of variable lifetime like C++ does at the beginning of CTORs. I for example see inst ={v} {CLOBBER(birth)}; inst ={v} {CLOBBER}; (**) inst.v = 42; ... inst ={v} {CLOBBER}; where (**) is inserted by the C++ frontend (with -flifetime-dse which is the default). Indeed my life analysis for the purpose of stack slot sharing now only relies on the birth/death markers so it gets confused by the extra clobber. We now also have some use-after-free diagnostic that would likely trip over this as it assumes that a CLOBBER ends lifetime of storage. I guess disentangling both use-cases by also marking the end-of-storage-lifetime clobbers specially would solve both issues.
There's an interesting case, a = BIRTH loop: b = DEATH a = DEATH b = BIRTH goto loop; where we end up having both a and b in the live-in set at the loop label but a is removed before we see the BIRTH of b which is where we add conflicts based on the current set of active vars. In the case I'm running into this I have tail-recursion do a = BIRTH b = BIRTH ... a = DEATH b = DEATH into loop: a = BIRTH b = BIRTH goto loop; a = DEATH b = DEATH leading to a similar issue. The issue above can for example arise from loop rotation. In all cases live from backedges confuse the "optimization" done to only record conflicts when we add a var to the live set (and it is not already set). The previous code had /* If this is the first real instruction in this BB we need to add conflicts for everything live at this point now. Unlike classical liveness for named objects we can't rely on seeing a def/use of the names we're interested in. There might merely be indirect loads/stores. We'd not add any conflicts for such partitions. */ and the easiest is to do the same here (we don't see the backedge "use"), but we could possibly improve by noting which vars are only live from a backedge here.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.