Bug 59967 - [12/13/14/15 Regression] Performance regression from 4.7.x to 4.8.x (loop not unrolled)
Summary: [12/13/14/15 Regression] Performance regression from 4.7.x to 4.8.x (loop not...
Status: ASSIGNED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 4.8.2
: P2 normal
Target Milestone: 12.5
Assignee: Richard Biener
URL:
Keywords: missed-optimization
: 65207 66263 (view as bug list)
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2014-01-28 13:12 UTC by Christoph Breitkopf
Modified: 2024-07-19 12:56 UTC (History)
3 users (show)

See Also:
Host:
Target:
Build:
Known to work: 4.7.3
Known to fail: 4.8.2
Last reconfirmed: 2014-03-31 00:00:00


Attachments
preprocessed source of ray/src/rt/ambient.c (27.14 KB, text/plain)
2014-01-28 13:12 UTC, Christoph Breitkopf
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christoph Breitkopf 2014-01-28 13:12:45 UTC
Created attachment 31967 [details]
preprocessed source of ray/src/rt/ambient.c

gcc 4.8.x generates 10-15% slower code compared to 4.7.x for Mark Stock's radiance benchmark (http://markjstock.org/pages/rad_bench.html).

I observed this regression on Linux x86_64, and with different CPUs (Ivy Bridge, Haswell, AMD Phenom, Kaveri). I had suspected the new register allocator, but the actual cause is a difference in loop unrolling.

The hotspot is the nested loops with the recursive call at the end of the sumambient() function. When using -Ofast, gcc 4.7.x will unroll the outer loop, which results in some optimization possibilities in the inner loop. gcc 4.8.x does not unroll the outer loop. -funroll-loops does not change the behavior.
Comment 1 Richard Biener 2014-03-31 09:23:30 UTC
Caused by r193246:

193246    hubicka       /* If there is call on a hot path through the loop, then
193246    hubicka        there is most probably not much to optimize.  */
193246    hubicka       else if (size.num_non_pure_calls_on_hot_path)
138522    rguenth       {
138522    rguenth         if (dump_file && (dump_flags & TDF_DETAILS))
193246    hubicka           fprintf (dump_file, "Not unrolling loop %d: "
193246    hubicka                    "contains call and code would grow.\n",
138522    rguenth                    loop->num);
138522    rguenth         return false;
138522    rguenth       }

so we conclude

size: 59-8, last_iteration: 52-5
  Loop size: 59
  Estimated size after unrolling: 269
Not unrolling loop 2: contains call and code would grow.

so it was disabled on purpose.  Not sure if it's worth special-casing
self-recursive calls the same as pure/const ones.
Comment 2 Jan Hubicka 2014-04-02 23:17:11 UTC
> 193246    hubicka       /* If there is call on a hot path through the loop,
> then
> 193246    hubicka        there is most probably not much to optimize.  */
> 193246    hubicka       else if (size.num_non_pure_calls_on_hot_path)
> 138522    rguenth       {
> 138522    rguenth         if (dump_file && (dump_flags & TDF_DETAILS))
> 193246    hubicka           fprintf (dump_file, "Not unrolling loop %d: "
> 193246    hubicka                    "contains call and code would grow.\n",
> 138522    rguenth                    loop->num);
> 138522    rguenth         return false;
> 138522    rguenth       }
> 
> so we conclude
> 
> size: 59-8, last_iteration: 52-5
>   Loop size: 59
>   Estimated size after unrolling: 269
> Not unrolling loop 2: contains call and code would grow.
> 
> so it was disabled on purpose.  Not sure if it's worth special-casing
> self-recursive calls the same as pure/const ones.

Not by the logic above, because self recursion is not really better understood
by optimizers than normal function calls.  I am surprised the unroling makes
10-15% difference in this case then. Is there something that makes the unrolled
loop to optimize better?

Honza
Comment 3 Christoph Breitkopf 2014-04-03 06:18:30 UTC
It's this conditional in the inner loop. The expression becomes constant only if both loops are unrolled (i and j are the loop counters):

   if (1<<j & i)
    ck0[j] += s;

This condition is probably not perfectly predictable even by modern branch predictors, so I assume this is the main reason for the regression.

Chris
Comment 4 Richard Biener 2014-05-22 09:01:17 UTC
GCC 4.8.3 is being released, adjusting target milestone.
Comment 5 Jakub Jelinek 2014-12-19 13:34:46 UTC
GCC 4.8.4 has been released.
Comment 6 Richard Biener 2015-01-15 14:44:07 UTC
In the size estimation we have

 BB: 46, after_exit: 0
  size:   1 _319 = MEM[(double *)c0_188(D) + 16B];
  size:   1 _321 = i_171 >> 2;
   Constant expression will be folded away.
  size:   2 if (_321 != 0)

so somehow we fail to see that the conditional is constant after peeling.

Reworking things so we can use a lattice (also avoid calling simple_iv
too many times) will improve that (but also the code doesn't then
only consider MAX (taken-paths-lengths) in case there is an else block).
It looks like reworking this to do a domwalk may be profitable.

The simplistic approach gets us from an estimated size after unrolling
of to 238 (still that won't unroll the loop).  I think we should also
consider size->eliminated_by_peeling in the call heuristic - the
call itself might be a minor distraction and especially eliminated
branches might result in a good speed benefit.

That said, the core algorithm for likely-eliminated should be improved,
but it won't help by itself.
Comment 7 Richard Biener 2015-01-15 14:44:41 UTC
Let's assign this to me for next stage1.
Comment 8 Richard Biener 2015-03-18 13:17:28 UTC
*** Bug 65207 has been marked as a duplicate of this bug. ***
Comment 9 Richard Biener 2015-06-23 08:17:30 UTC
The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.
Comment 10 Jakub Jelinek 2015-06-26 19:58:38 UTC
GCC 4.9.3 has been released.
Comment 11 Richard Biener 2015-06-30 11:02:49 UTC
*** Bug 66263 has been marked as a duplicate of this bug. ***
Comment 12 Richard Biener 2016-08-03 11:34:45 UTC
GCC 4.9 branch is being closed
Comment 13 Andrew Pinski 2016-08-29 04:34:41 UTC
As duplicated bug 65207 calls out this is a SPEC CPU related bug.
Comment 14 Jakub Jelinek 2017-10-10 13:26:55 UTC
GCC 5 branch is being closed
Comment 15 Jakub Jelinek 2018-10-26 10:11:25 UTC
GCC 6 branch is being closed
Comment 16 Richard Biener 2019-11-14 07:57:14 UTC
The GCC 7 branch is being closed, re-targeting to GCC 8.4.
Comment 17 Jakub Jelinek 2020-03-04 09:38:25 UTC
GCC 8.4.0 has been released, adjusting target milestone.
Comment 18 Jakub Jelinek 2021-05-14 09:47:07 UTC
GCC 8 branch is being closed.
Comment 19 Richard Biener 2021-06-01 08:06:12 UTC
GCC 9.4 is being released, retargeting bugs to GCC 9.5.
Comment 20 Richard Biener 2022-05-27 09:35:06 UTC
GCC 9 branch is being closed
Comment 21 Jakub Jelinek 2022-06-28 10:30:50 UTC
GCC 10.4 is being released, retargeting bugs to GCC 10.5.
Comment 22 Richard Biener 2023-07-07 10:30:12 UTC
GCC 10 branch is being closed.
Comment 23 Andrew Pinski 2024-03-11 06:18:07 UTC
One thing unrelated to the unrolling I Noticed is:
```
  if (_221 != 0)
    goto <bb 50>; [50.00%]
  else
    goto <bb 49>; [50.00%]

  <bb 49> [local count: 163152564]:
  _212 = _218 - s_202;
  goto <bb 51>; [100.00%]

  <bb 50> [local count: 163152564]:
  _222 = s_202 + _218;

  <bb 51> [local count: 326305128]:
  # ck0$0_144 = PHI <_218(49), _222(50)>
  # prephitmp_211 = PHI <_212(49), _218(50)>
```

Does not use conditional moves on aarch64 (or x86_64) even though it could/should.
That pattern shows up 3 times due to the unrolling of the inner loop there.
```
   ck0[j] = c0[j];
   if (1<<j & i)
    ck0[j] += s;
   if (r->rop[j] < ck0[j] - 1.0*s)
```
Comment 24 Richard Biener 2024-07-19 12:56:47 UTC
GCC 11 branch is being closed.