This change: commit 6decda1a35be5764101987c210b5693a0d914e58 Author: Richard Biener <rguenther@suse.de> Date: Thu Oct 12 11:34:57 2023 +0200 tree-optimization/111779 - Handle some BIT_FIELD_REFs in SRA The following handles byte-aligned, power-of-two and byte-multiple sized BIT_FIELD_REF reads in SRA. In particular this should cover BIT_FIELD_REFs created by optimize_bit_field_compare. For gcc.dg/tree-ssa/ssa-dse-26.c we now SRA the BIT_FIELD_REF appearing there leading to more DSE, fully eliding the aggregates. This results in the same false positive -Wuninitialized as the older attempt to remove the folding from optimize_bit_field_compare, fixed by initializing part of the aggregate unconditionally. PR tree-optimization/111779 gcc/ * tree-sra.cc (sra_handled_bf_read_p): New function. (build_access_from_expr_1): Handle some BIT_FIELD_REFs. (sra_modify_expr): Likewise. (make_fancy_name_1): Skip over BIT_FIELD_REF. gcc/fortran/ * trans-expr.cc (gfc_trans_assignment_1): Initialize lhs_caf_attr and rhs_caf_attr codimension flag to avoid false positive -Wuninitialized. gcc/testsuite/ * gcc.dg/tree-ssa/ssa-dse-26.c: Adjust for more DSE. * gcc.dg/vect/vect-pr111779.c: New testcase. Causes execute/20040709-2.c to fail on mcore-elf at -O2. It also results in what appears to be significantly poorer code generation. Note I haven't managed to get mcore-elf-gdb to work, so debugging is, umm, painful. And I wouldn't put a lot of faith in the simulator correctness. I have simplified the test to this: extern void abort (void); extern void exit (int); unsigned int myrnd (void) { static unsigned int s = 1388815473; s *= 1103515245; s += 12345; return (s / 65536) % 2048; } struct __attribute__((packed)) K { unsigned int k:6, l:1, j:10, i:15; }; struct K sK; unsigned int fn1K (unsigned int x) { struct K y = sK; y.k += x; return y.k; } void testK (void) { int i; unsigned int mask, v, a, r; struct K x; char *p = (char *) &sK; for (i = 0; i < sizeof (sK); ++i) *p++ = myrnd (); v = myrnd (); a = myrnd (); sK.k = v; x = sK; r = fn1K (a); if (x.j != sK.j || x.l != sK.l) abort (); } int main (void) { testK (); exit (0); } Which should at least make the poor code gen obvious. I don't expect to have time to debug this further anytime in the near future.
If I configured correctly the only change (early) SRA does is unsigned int fn1K (unsigned int x) { + <unnamed-unsigned:6> y$k; ... @@ -57,13 +53,14 @@ <bb 2> : y = sK; - _1 = y.k; + y$k_13 = sK.k; + _1 = y$k_13; _2 = (unsigned char) _1; _3 = (unsigned char) x_9(D); _4 = _2 + _3; _5 = (<unnamed-unsigned:6>) _4; - y.k = _5; - _6 = y.k; + y$k_14 = _5; + _6 = y$k_14; _11 = (unsigned int) _6; y ={v} {CLOBBER(eol)}; return _11; and later DSE eliminates 'y' completely. Note I can't see any big assembly difference -O2 vs -O2 -fno-tree-sra so I wonder if I really reproduced the significantly worse code generation issue. In fact generated code looks better, the stack slot is elided: > diff -u t.s.good t.s.bad --- t.s.good 2023-10-16 10:18:27.457651977 +0200 +++ t.s.bad 2023-10-16 10:07:49.997656268 +0200 @@ -19,13 +19,11 @@ .export fn1K .type fn1K, @function fn1K: - subi sp,8 lrw r7, sK ld.b r7,(r7) addu r2,r7 movi r7,63 and r2,r7 - addi sp,8 jmp r15 .size fn1K, .-fn1K .align 1
I wouldn't look at with/without tree-sra. Instead I would look at the assembly code before/after your change. It's clearly worse. testK goes from ~37 instructions to ~70 and the abort call is no longer removed.
OK, so the abort () call is only elided during RTL optimization before the change. The change itself first manifests during ESRA as void testK () { - <unnamed-unsigned:10> x$j; unsigned int D.1835; static unsigned int s = 1388815473; unsigned int D.1834; @@ -167,14 +157,13 @@ _5 = (<unnamed-unsigned:6>) _51; sK.k = _5; x = sK; - x$j_55 = sK.j; y$k_36 = sK.k; _37 = (unsigned char) y$k_36; _38 = (unsigned char) _46; _39 = _37 + _38; _40 = (<unnamed-unsigned:6>) _39; _41 = (unsigned int) _40; - _6 = x$j_55; + _6 = x.j; _7 = sK.j; if (_6 != _7) goto <bb 7>; [INV] there are other uses of 'x' below, unchanged: else goto <bb 6>; [INV] <bb 6> : _8 = BIT_FIELD_REF <x, 8, 0>; _9 = BIT_FIELD_REF <sK, 8, 0>; _10 = _8 ^ _9; _11 = _10 & 64; if (_11 != 0) goto <bb 7>; [INV] else goto <bb 8>; [INV] <bb 7> : abort (); and with the change we now analyze those and run into ! Disqualifying sK - Address taken in a non-call-argument context. ! Disqualifying x - No or inhibitingly overlapping accesses. That's because fold somehow transformed one but not the other bitfield compares. In particular we have the following two accesses _8 = BIT_FIELD_REF <x, 8, 0>; _6 = x.j; where the first is offset = 0, size = 8 and the second offset = 7, size = 10 and those partially overlap. Previous to the change SRA was doing a not profitable change in the end causing it to leave 'x' around. The interesting difference is in FRE though where without the change we are able to elide the _6 != _7 compare but with it we fail. That's because SRA makes the load of sK.j available - while VN can see through the x = sK assignment it won't derive a value from the result unless there's a leader for it in the IL. Swapping the compare to if (sK.j != x.j || x.l != sK.l) abort (); makes it optimized and the abort () call vanishes as well - in fact it now vanishes already in DOM2, much before RTL optimization. I don't think the SRA change is to be blamed for doing anything wrong, we're running into natural limitations of CSE and premature folding inhibiting SRA (which the SRA change itself was supposed to handle a bit better). When there's actual wrong-code now on mcore this is a latent problem uncovered by the change. The missed-optimization is yet another "dead code elimination regression". I don't see an easy way to solve the missed-optimization regression. On the VN side, when we just see x = sK; _3 = x.j; _4 = sK.j; then visiting the x.j load, when running into x = SK, could eventually also put sK.j into the hash table (with the same VARYING value we assign to x.j). This would require some extra bookkeeping as we actually compute this expression already but do not keep it as alternate form to record when we figure there's no actual leader for it. Testcase: struct X { int a; int b; } p, o; void foo () { p = o; if (p.a != o.a) __builtin_abort (); } Let me take this for this particular VN issue.
GCC 14.1 is being released, retargeting bugs to GCC 14.2.
GCC 14.2 is being released, retargeting bugs to GCC 14.3.
So even after the recent ifcombine work we still produce during early folding: if (x.j != sK.j || ((BIT_FIELD_REF <x, 8, 0> ^ BIT_FIELD_REF <sK, 8, 0>) & 64) != 0) { abort (); } and ifcombine does not optimize this further. Disabling optimize_bit_field_compare optimizes this fully even before ifcombine has a chance to perform the same optimization.