Created attachment 31967 [details] preprocessed source of ray/src/rt/ambient.c gcc 4.8.x generates 10-15% slower code compared to 4.7.x for Mark Stock's radiance benchmark (http://markjstock.org/pages/rad_bench.html). I observed this regression on Linux x86_64, and with different CPUs (Ivy Bridge, Haswell, AMD Phenom, Kaveri). I had suspected the new register allocator, but the actual cause is a difference in loop unrolling. The hotspot is the nested loops with the recursive call at the end of the sumambient() function. When using -Ofast, gcc 4.7.x will unroll the outer loop, which results in some optimization possibilities in the inner loop. gcc 4.8.x does not unroll the outer loop. -funroll-loops does not change the behavior.
Caused by r193246: 193246 hubicka /* If there is call on a hot path through the loop, then 193246 hubicka there is most probably not much to optimize. */ 193246 hubicka else if (size.num_non_pure_calls_on_hot_path) 138522 rguenth { 138522 rguenth if (dump_file && (dump_flags & TDF_DETAILS)) 193246 hubicka fprintf (dump_file, "Not unrolling loop %d: " 193246 hubicka "contains call and code would grow.\n", 138522 rguenth loop->num); 138522 rguenth return false; 138522 rguenth } so we conclude size: 59-8, last_iteration: 52-5 Loop size: 59 Estimated size after unrolling: 269 Not unrolling loop 2: contains call and code would grow. so it was disabled on purpose. Not sure if it's worth special-casing self-recursive calls the same as pure/const ones.
> 193246 hubicka /* If there is call on a hot path through the loop, > then > 193246 hubicka there is most probably not much to optimize. */ > 193246 hubicka else if (size.num_non_pure_calls_on_hot_path) > 138522 rguenth { > 138522 rguenth if (dump_file && (dump_flags & TDF_DETAILS)) > 193246 hubicka fprintf (dump_file, "Not unrolling loop %d: " > 193246 hubicka "contains call and code would grow.\n", > 138522 rguenth loop->num); > 138522 rguenth return false; > 138522 rguenth } > > so we conclude > > size: 59-8, last_iteration: 52-5 > Loop size: 59 > Estimated size after unrolling: 269 > Not unrolling loop 2: contains call and code would grow. > > so it was disabled on purpose. Not sure if it's worth special-casing > self-recursive calls the same as pure/const ones. Not by the logic above, because self recursion is not really better understood by optimizers than normal function calls. I am surprised the unroling makes 10-15% difference in this case then. Is there something that makes the unrolled loop to optimize better? Honza
It's this conditional in the inner loop. The expression becomes constant only if both loops are unrolled (i and j are the loop counters): if (1<<j & i) ck0[j] += s; This condition is probably not perfectly predictable even by modern branch predictors, so I assume this is the main reason for the regression. Chris
GCC 4.8.3 is being released, adjusting target milestone.
GCC 4.8.4 has been released.
In the size estimation we have BB: 46, after_exit: 0 size: 1 _319 = MEM[(double *)c0_188(D) + 16B]; size: 1 _321 = i_171 >> 2; Constant expression will be folded away. size: 2 if (_321 != 0) so somehow we fail to see that the conditional is constant after peeling. Reworking things so we can use a lattice (also avoid calling simple_iv too many times) will improve that (but also the code doesn't then only consider MAX (taken-paths-lengths) in case there is an else block). It looks like reworking this to do a domwalk may be profitable. The simplistic approach gets us from an estimated size after unrolling of to 238 (still that won't unroll the loop). I think we should also consider size->eliminated_by_peeling in the call heuristic - the call itself might be a minor distraction and especially eliminated branches might result in a good speed benefit. That said, the core algorithm for likely-eliminated should be improved, but it won't help by itself.
Let's assign this to me for next stage1.
*** Bug 65207 has been marked as a duplicate of this bug. ***
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
GCC 4.9.3 has been released.
*** Bug 66263 has been marked as a duplicate of this bug. ***
GCC 4.9 branch is being closed
As duplicated bug 65207 calls out this is a SPEC CPU related bug.
GCC 5 branch is being closed
GCC 6 branch is being closed
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
GCC 8.4.0 has been released, adjusting target milestone.
GCC 8 branch is being closed.
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
GCC 9 branch is being closed
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
GCC 10 branch is being closed.
One thing unrelated to the unrolling I Noticed is: ``` if (_221 != 0) goto <bb 50>; [50.00%] else goto <bb 49>; [50.00%] <bb 49> [local count: 163152564]: _212 = _218 - s_202; goto <bb 51>; [100.00%] <bb 50> [local count: 163152564]: _222 = s_202 + _218; <bb 51> [local count: 326305128]: # ck0$0_144 = PHI <_218(49), _222(50)> # prephitmp_211 = PHI <_212(49), _218(50)> ``` Does not use conditional moves on aarch64 (or x86_64) even though it could/should. That pattern shows up 3 times due to the unrolling of the inner loop there. ``` ck0[j] = c0[j]; if (1<<j & i) ck0[j] += s; if (r->rop[j] < ck0[j] - 1.0*s) ```
GCC 11 branch is being closed.