Bug 88440 - size optimization of memcpy-like code
Summary: size optimization of memcpy-like code
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 8.2.0
: P3 normal
Target Milestone: 10.0
Assignee: Richard Biener
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2018-12-11 00:01 UTC by Trass3r
Modified: 2021-02-16 15:41 UTC (History)
4 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2018-12-11 00:00:00


Attachments
patch (1.21 KB, patch)
2019-01-02 09:46 UTC, Richard Biener
Details | Diff
SPEC2006 and SPEC2017 report (58.30 KB, application/pdf)
2019-05-22 08:33 UTC, Martin Liška
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Trass3r 2018-12-11 00:01:26 UTC
https://godbolt.org/z/RTji7B

void foo(char* restrict dst, const char* buf) {
    for (int i=0; i<8; ++i)
        *dst++ = *buf++;
}

$ gcc -Os
$ gcc -O2
.L2:
        mov     dl, BYTE PTR [rsi+rax]
        mov     BYTE PTR [rdi+rax], dl
        inc     rax
        cmp     rax, 8
        jne     .L2

$ gcc -O3
        mov     rax, QWORD PTR [rsi]
        mov     QWORD PTR [rdi], rax

$ arm-none-eabi-gcc -O3 -mthumb -mcpu=cortex-m4
        ldr     r3, [r1]  @ unaligned
        ldr     r2, [r1, #4]      @ unaligned
        str     r2, [r0, #4]      @ unaligned
        str     r3, [r0]  @ unaligned

The -O3 code is both faster and smaller for both ARM and x64:
"note: Loop 1 distributed: split to 0 loops and 1 library calls."

Should be considered for -O2 and -Os as well.
Comment 1 Andrew Pinski 2018-12-11 00:07:18 UTC
I thought I had a dup of this bug somewhere which was asking for this optimization to moved to -O2 (and -Os) and above rather than keep it at -O3 and above.
Comment 2 Richard Biener 2018-12-11 09:28:18 UTC
I think distributing patterns is reasonable for -O[2s].  I think we kept it at -O3 because of compile-time concerns (dependence analysis is quadratic).

But then we still don't do basic vectorization at -O[2s] either... (same
compile-time for not too much gain issue).

So if somebody is willing to do some compile-time / effect numbers
(bootstrap / SPEC?) then we can consider enabling loop-distribute-patterns
for -O[2s] for GCC 10.
Comment 3 Trass3r 2018-12-12 01:09:26 UTC
Adding -ftree-loop-distribute-patterns to -Os does not seem to make a difference though.
Comment 4 rguenther@suse.de 2018-12-12 08:31:35 UTC
On Wed, 12 Dec 2018, hoganmeier at gmail dot com wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #3 from krux <hoganmeier at gmail dot com> ---
> Adding -ftree-loop-distribute-patterns to -Os does not seem to make a
> difference though.

Possibly because of

      /* Don't distribute multiple exit edges loop, or cold loop.  */
      if (!single_exit (loop)
          || !optimize_loop_for_speed_p (loop))
        continue;
Comment 5 Martin Liška 2018-12-27 08:01:02 UTC
Sure, I can help with measurements during next stage1. Can you please Richi attach a patch that will enable the optimization for -O[2s]?
Comment 6 Richard Biener 2019-01-02 09:46:44 UTC
Created attachment 45313 [details]
patch

This enables distribution of patterns at -O[2s]+ and optimizes the testcase
at -Os by adjusting the guards in loop distribution.

Note that the interesting bits are compile-time, binary-size and performance
at mainly -O2, eventually size at -Os.

I suspect that at -O2 w/o profiling most loops would be optimize_loop_for_speed
anyways so changing the heuristics isn't so bad but of course enabling
distribution at -O2 might encour a penalty.
Comment 7 Martin Liška 2019-05-16 15:20:14 UTC
(In reply to Richard Biener from comment #6)
> Created attachment 45313 [details]
> patch
> 
> This enables distribution of patterns at -O[2s]+ and optimizes the testcase
> at -Os by adjusting the guards in loop distribution.
> 
> Note that the interesting bits are compile-time, binary-size and performance
> at mainly -O2, eventually size at -Os.
> 
> I suspect that at -O2 w/o profiling most loops would be
> optimize_loop_for_speed
> anyways so changing the heuristics isn't so bad but of course enabling
> distribution at -O2 might encour a penalty.

I have so far build numbers on a Zen machine with -j16:

SPEC2006:

  Elapsed compile for '400.perlbench': 00:00:05 (5)
  Elapsed compile for '401.bzip2': 00:00:02 (2)
  Elapsed compile for '403.gcc': 00:00:11 (11)
  Elapsed compile for '429.mcf': 00:00:01 (1)
  Elapsed compile for '445.gobmk': 00:00:04 (4)
  Elapsed compile for '456.hmmer': 00:00:01 (1)
  Elapsed compile for '458.sjeng': 00:00:01 (1)
  Elapsed compile for '462.libquantum': 00:00:01 (1)
  Elapsed compile for '464.h264ref': 00:00:04 (4)
  Elapsed compile for '471.omnetpp': 00:00:05 (5)
  Elapsed compile for '473.astar': 00:00:01 (1)
  Elapsed compile for '483.xalancbmk': 00:00:21 (21)
  Elapsed compile for '410.bwaves': 00:00:01 (1)
  Elapsed compile for '416.gamess': 00:00:20 (20)
  Elapsed compile for '433.milc': 00:00:02 (2)
  Elapsed compile for '434.zeusmp': 00:00:02 (2)
  Elapsed compile for '435.gromacs': 00:00:06 (6)
  Elapsed compile for '436.cactusADM': 00:00:04 (4)
  Elapsed compile for '437.leslie3d': 00:00:04 (4)
  Elapsed compile for '444.namd': 00:00:09 (9)
  Elapsed compile for '447.dealII': 00:00:15 (15)
  Elapsed compile for '450.soplex': 00:00:03 (3)
  Elapsed compile for '453.povray': 00:00:04 (4)
  Elapsed compile for '454.calculix': 00:00:06 (6)
  Elapsed compile for '459.GemsFDTD': 00:00:09 (9)
  Elapsed compile for '465.tonto': 00:00:53 (53)
  Elapsed compile for '470.lbm': 00:00:02 (2)
  Elapsed compile for '481.wrf': 00:00:38 (38)
  Elapsed compile for '482.sphinx3': 00:00:01 (1)

All differences before and after are withing 1s, which is granularity.

SPEC 2017:

  Elapsed compile for '503.bwaves_r': 00:00:01 (1)
  Elapsed compile for '507.cactuBSSN_r': 00:00:25 (25)
  Elapsed compile for '508.namd_r': 00:00:09 (9)
  Elapsed compile for '510.parest_r': 00:00:46 (46)
  Elapsed compile for '511.povray_r': 00:00:04 (4)
  Elapsed compile for '519.lbm_r': 00:00:01 (1)
  Elapsed compile for '521.wrf_r': 00:05:46 (346)
  Elapsed compile for '526.blender_r': 00:00:25 (25)
  Elapsed compile for '527.cam4_r': 00:00:37 (37)
  Elapsed compile for '538.imagick_r': 00:00:11 (11)
  Elapsed compile for '544.nab_r': 00:00:01 (1)
  Elapsed compile for '549.fotonik3d_r': 00:00:07 (7)
  Elapsed compile for '554.roms_r': 00:00:06 (6)
  Elapsed compile for '500.perlbench_r': 00:00:09 (9)
  Elapsed compile for '502.gcc_r': 00:00:44 (44)
  Elapsed compile for '505.mcf_r': 00:00:01 (1)
  Elapsed compile for '520.omnetpp_r': 00:00:12 (12)
  Elapsed compile for '523.xalancbmk_r': 00:00:25 (25)
  Elapsed compile for '525.x264_r': 00:00:09 (9)
  Elapsed compile for '531.deepsjeng_r': 00:00:02 (2)
  Elapsed compile for '541.leela_r': 00:00:03 (3)
  Elapsed compile for '548.exchange2_r': 00:00:04 (4)
  Elapsed compile for '557.xz_r': 00:00:01 (1)

There's only one difference:

521.wrf_r: 310 -> 346s
Comment 8 rguenther@suse.de 2019-05-16 18:55:33 UTC
On Thu, 16 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #7 from Martin Liška <marxin at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #6)
> > Created attachment 45313 [details]
> > patch
> > 
> > This enables distribution of patterns at -O[2s]+ and optimizes the testcase
> > at -Os by adjusting the guards in loop distribution.
> > 
> > Note that the interesting bits are compile-time, binary-size and performance
> > at mainly -O2, eventually size at -Os.
> > 
> > I suspect that at -O2 w/o profiling most loops would be
> > optimize_loop_for_speed
> > anyways so changing the heuristics isn't so bad but of course enabling
> > distribution at -O2 might encour a penalty.
> 
> I have so far build numbers on a Zen machine with -j16:
... 
> There's only one difference:
> 
> 521.wrf_r: 310 -> 346s

Ick.  I currently see no limiting on the size of loops in
loop distribution, one easy would be to limit the worklist
size in find_seed_stmts_for_distribution with a --param
we can lower at -O[2s], another thing would be to limit
loop nest depth similarly.

A profile might be interesting here as well...
Comment 9 Martin Liška 2019-05-17 08:35:11 UTC
So comparison all files in wrf, I've got:

╔═════════════════════════════════════╤════════╤════════╤═════════╗
║ Filename                            │ Before │ After  │ Change  ║
╠═════════════════════════════════════╪════════╪════════╪═════════╣
║ module_configure.fppized.f90        │ 127.81 │ 163.39 │ 127.84% ║
║ d1fgkb.fppized.f90                  │ 0.21   │ 0.23   │ 109.52% ║
║ solve_interface.fppized.f90         │ 0.35   │ 0.38   │ 108.57% ║
║ module_ltng_crmpr92.fppized.f90     │ 0.28   │ 0.3    │ 107.14% ║
║ module_cu_kf.fppized.f90            │ 1.42   │ 1.51   │ 106.34% ║
║ mradbg.fppized.f90                  │ 0.32   │ 0.34   │ 106.25% ║
║ module_sf_pxlsm.fppized.f90         │ 0.55   │ 0.58   │ 105.45% ║
║ module_domain_type.fppized.f90      │ 0.19   │ 0.2    │ 105.26% ║
║ module_shallowcu_driver.fppized.f90 │ 0.19   │ 0.2    │ 105.26% ║
║ module_bl_gfs.fppized.f90           │ 0.78   │ 0.82   │ 105.13% ║
║ module_bl_myjurb.fppized.f90        │ 0.78   │ 0.82   │ 105.13% ║
║ Meat.fppized.f90                    │ 0.39   │ 0.41   │ 105.13% ║
...

So the only significant offender is module_configure.fppized.f90 file. Let me profile it.
Comment 10 Martin Liška 2019-05-17 08:54:56 UTC
> So the only significant offender is module_configure.fppized.f90 file. Let
> me profile it.

Time profile before/after:

╔══════════════════════════╤════════╤════════╤═════════╗
║ PASS                     │ Before │ After  │ Change  ║
╠══════════════════════════╪════════╪════════╪═════════╣
║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
║ integrated RA            │ 6.76   │ 6.41   │ 94.82%  ║
║ tree SSA incremental     │ 9.01   │ 11.16  │ 123.86% ║
║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
║ PRE                      │ 23.24  │ 32.32  │ 139.07% ║
║ alias stmt walking       │ 27.69  │ 28.75  │ 103.83% ║
║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
║ TOTAL                    │ 125.39 │ 165.17 │ 131.73% ║
╚══════════════════════════╧════════╧════════╧═════════╝

Richi, do you want a perf report or do you come up with a patch that will introduce the aforementioned params?
Comment 11 rguenther@suse.de 2019-05-17 09:04:06 UTC
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #10 from Martin Liška <marxin at gcc dot gnu.org> ---
> > So the only significant offender is module_configure.fppized.f90 file. Let
> > me profile it.
> 
> Time profile before/after:
> 
> ╔══════════════════════════╤════════╤════════╤═════════╗
> ║ PASS                     │ Before │ After  │ Change  ║
> ╠══════════════════════════╪════════╪════════╪═════════╣
> ║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
> ║ integrated RA            │ 6.76   │ 6.41   │ 94.82%  ║
> ║ tree SSA incremental     │ 9.01   │ 11.16  │ 123.86% ║
> ║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
> ║ PRE                      │ 23.24  │ 32.32  │ 139.07% ║
> ║ alias stmt walking       │ 27.69  │ 28.75  │ 103.83% ║
> ║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
> ║ TOTAL                    │ 125.39 │ 165.17 │ 131.73% ║
> ╚══════════════════════════╧════════╧════════╧═════════╝
> 
> Richi, do you want a perf report or do you come up with a patch that will
> introduce the aforementioned params?

Can you share -fopt-report-loop differences?  From the above I would
guess we split a lot of loops, meaning the memcpy/memmove/memset
calls are in the "middle" and we have to split loops (how many
calls are detected here?).  If that's true another way would be
to only allow calls at head or tail position, thus a single
non-builtin partition.
Comment 12 Martin Liška 2019-05-17 09:15:26 UTC
> 
> Can you share -fopt-report-loop differences?  From the above I would
> guess we split a lot of loops, meaning the memcpy/memmove/memset
> calls are in the "middle" and we have to split loops (how many
> calls are detected here?).  If that's true another way would be
> to only allow calls at head or tail position, thus a single
> non-builtin partition.

I newly see ~1400 lines:

module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to 0 loops and 1 library calls.
module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to 0 loops and 1 library calls.
...
Comment 13 rguenther@suse.de 2019-05-17 10:56:58 UTC
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #12 from Martin Liška <marxin at gcc dot gnu.org> ---
> > 
> > Can you share -fopt-report-loop differences?  From the above I would
> > guess we split a lot of loops, meaning the memcpy/memmove/memset
> > calls are in the "middle" and we have to split loops (how many
> > calls are detected here?).  If that's true another way would be
> > to only allow calls at head or tail position, thus a single
> > non-builtin partition.
> 
> I newly see ~1400 lines:
> 
> module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0
> loops and 1 library calls.
> module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0
> loops and 1 library calls.
> module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0
> loops and 1 library calls.
> module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0
> loops and 1 library calls.
> module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0
> loops and 1 library calls.
> module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to
> 0 loops and 1 library calls.
> module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to
> 0 loops and 1 library calls.
> ...

All with "0 loops"?  That disputes my theory :/
Comment 14 Martin Liška 2019-05-17 11:44:57 UTC
(In reply to rguenther@suse.de from comment #13)
> On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > 
> > --- Comment #12 from Martin Liška <marxin at gcc dot gnu.org> ---
> > > 
> > > Can you share -fopt-report-loop differences?  From the above I would
> > > guess we split a lot of loops, meaning the memcpy/memmove/memset
> > > calls are in the "middle" and we have to split loops (how many
> > > calls are detected here?).  If that's true another way would be
> > > to only allow calls at head or tail position, thus a single
> > > non-builtin partition.
> > 
> > I newly see ~1400 lines:
> > 
> > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0
> > loops and 1 library calls.
> > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to
> > 0 loops and 1 library calls.
> > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to
> > 0 loops and 1 library calls.
> > ...
> 
> All with "0 loops"?  That disputes my theory :/

Yep. All these are in a form of:

  <bb 1809> [local count: 118163158]:
  # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)>
  # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)>
  _3211 = S.1565_41079 + -1;
  _3212 = fire_ignition_start_y1[_3211];
  MEM[(real(kind=4)[11] *)&model_config_rec + 101040B][_3211] = _3212;
  S.1565_32687 = S.1565_41079 + 1;
  ivtmp_38848 = ivtmp_38850 - 1;
  if (ivtmp_38848 == 0)
    goto <bb 2027>; [9.09%]
  else
    goto <bb 3351>; [90.91%]

  <bb 3351> [local count: 107425740]:
  goto <bb 1809>; [100.00%]

  <bb 2027> [local count: 10737418]:

  <bb 1810> [local count: 118163158]:
  # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)>
  # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)>
  _3213 = S.1566_41080 + -1;
  _3214 = fire_ignition_end_x1[_3213];
  MEM[(real(kind=4)[11] *)&model_config_rec + 101084B][_3213] = _3214;
  S.1566_32689 = S.1566_41080 + 1;
  ivtmp_38854 = ivtmp_38856 - 1;
  if (ivtmp_38854 == 0)
    goto <bb 2026>; [9.09%]
  else
    goto <bb 3350>; [90.91%]

  <bb 3350> [local count: 107425740]:
  goto <bb 1810>; [100.00%]

  <bb 2026> [local count: 10737418]:

  <bb 1811> [local count: 118163158]:
  # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)>
  # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)>
  _3215 = S.1567_41081 + -1;
  _3216 = fire_ignition_end_y1[_3215];
  MEM[(real(kind=4)[11] *)&model_config_rec + 101128B][_3215] = _3216;
  S.1567_32691 = S.1567_41081 + 1;
  ivtmp_38858 = ivtmp_38860 - 1;
  if (ivtmp_38858 == 0)
    goto <bb 2025>; [9.09%]
  else
    goto <bb 3349>; [90.91%]

  <bb 3349> [local count: 107425740]:
  goto <bb 1811>; [100.00%]

  <bb 2025> [local count: 10737418]:
...


It's a configure module, so that it probably contains so many loops for various configs.
Comment 15 rguenther@suse.de 2019-05-17 12:01:18 UTC
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> --- Comment #14 from Martin Liška <marxin at gcc dot gnu.org> ---
> (In reply to rguenther@suse.de from comment #13)
> > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > > 
> > > --- Comment #12 from Martin Liška <marxin at gcc dot gnu.org> ---
> > > > 
> > > > Can you share -fopt-report-loop differences?  From the above I would
> > > > guess we split a lot of loops, meaning the memcpy/memmove/memset
> > > > calls are in the "middle" and we have to split loops (how many
> > > > calls are detected here?).  If that's true another way would be
> > > > to only allow calls at head or tail position, thus a single
> > > > non-builtin partition.
> > > 
> > > I newly see ~1400 lines:
> > > 
> > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0
> > > loops and 1 library calls.
> > > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to
> > > 0 loops and 1 library calls.
> > > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to
> > > 0 loops and 1 library calls.
> > > ...
> > 
> > All with "0 loops"?  That disputes my theory :/
> 
> Yep. All these are in a form of:
> 
>   <bb 1809> [local count: 118163158]:
>   # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)>
>   # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)>
>   _3211 = S.1565_41079 + -1;
>   _3212 = fire_ignition_start_y1[_3211];
>   MEM[(real(kind=4)[11] *)&model_config_rec + 101040B][_3211] = _3212;
>   S.1565_32687 = S.1565_41079 + 1;
>   ivtmp_38848 = ivtmp_38850 - 1;
>   if (ivtmp_38848 == 0)
>     goto <bb 2027>; [9.09%]
>   else
>     goto <bb 3351>; [90.91%]
> 
>   <bb 3351> [local count: 107425740]:
>   goto <bb 1809>; [100.00%]
> 
>   <bb 2027> [local count: 10737418]:
> 
>   <bb 1810> [local count: 118163158]:
>   # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)>
>   # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)>
>   _3213 = S.1566_41080 + -1;
>   _3214 = fire_ignition_end_x1[_3213];
>   MEM[(real(kind=4)[11] *)&model_config_rec + 101084B][_3213] = _3214;
>   S.1566_32689 = S.1566_41080 + 1;
>   ivtmp_38854 = ivtmp_38856 - 1;
>   if (ivtmp_38854 == 0)
>     goto <bb 2026>; [9.09%]
>   else
>     goto <bb 3350>; [90.91%]
> 
>   <bb 3350> [local count: 107425740]:
>   goto <bb 1810>; [100.00%]
> 
>   <bb 2026> [local count: 10737418]:
> 
>   <bb 1811> [local count: 118163158]:
>   # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)>
>   # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)>
>   _3215 = S.1567_41081 + -1;
>   _3216 = fire_ignition_end_y1[_3215];
>   MEM[(real(kind=4)[11] *)&model_config_rec + 101128B][_3215] = _3216;
>   S.1567_32691 = S.1567_41081 + 1;
>   ivtmp_38858 = ivtmp_38860 - 1;
>   if (ivtmp_38858 == 0)
>     goto <bb 2025>; [9.09%]
>   else
>     goto <bb 3349>; [90.91%]
> 
>   <bb 3349> [local count: 107425740]:
>   goto <bb 1811>; [100.00%]
> 
>   <bb 2025> [local count: 10737418]:
> ...
> 
> 
> It's a configure module, so that it probably contains so many loops for various
> configs.

Hmm, so then it might be we run into some CFG complexity cut-off
before for PRE and RA but not after since the CFG should simplify
a lot if we make memcpy from all of the above loops...
Comment 16 Martin Liška 2019-05-22 08:33:48 UTC
Created attachment 46393 [details]
SPEC2006 and SPEC2017 report

The report presents difference between master (first gray column) and the Richi's patch (last 2 columns in order to run tests twice).

There are some signification improvements and regressions both. Note that 436.cactusADM is jumping (21%)!
Comment 17 Martin Liška 2019-05-22 08:36:03 UTC
> 
> Hmm, so then it might be we run into some CFG complexity cut-off
> before for PRE and RA but not after since the CFG should simplify
> a lot if we make memcpy from all of the above loops...

I guess so. Note that even without the patch the files takes 2 minutes to compile. It's somehow weird.

I'm done with patch measurements, do you see it Richi beneficial to enable it with -O2?
Comment 18 Richard Biener 2019-05-22 09:24:33 UTC
OK, let's do it.
Comment 19 rguenther@suse.de 2019-05-22 09:33:13 UTC
On Wed, 22 May 2019, marxin at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> 
> Martin Liška <marxin at gcc dot gnu.org> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|ASSIGNED                    |NEW
>            Assignee|marxin at gcc dot gnu.org          |unassigned at gcc dot gnu.org
> 
> --- Comment #17 from Martin Liška <marxin at gcc dot gnu.org> ---
> > 
> > Hmm, so then it might be we run into some CFG complexity cut-off
> > before for PRE and RA but not after since the CFG should simplify
> > a lot if we make memcpy from all of the above loops...
> 
> I guess so. Note that even without the patch the files takes 2 minutes to
> compile. It's somehow weird.
> 
> I'm done with patch measurements, do you see it Richi beneficial to enable it
> with -O2?

OK, let's do it.
Comment 20 Richard Biener 2019-05-22 11:23:03 UTC
(In reply to rguenther@suse.de from comment #11)
> On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440
> > 
> > --- Comment #10 from Martin Liška <marxin at gcc dot gnu.org> ---
> > > So the only significant offender is module_configure.fppized.f90 file. Let
> > > me profile it.
> > 
> > Time profile before/after:
> > 
> > ╔══════════════════════════╤════════╤════════╤═════════╗
> > ║ PASS                     │ Before │ After  │ Change  ║
> > ╠══════════════════════════╪════════╪════════╪═════════╣
> > ║ backwards jump threading │ 6.29   │ 6.16   │ 97.93%  ║
> > ║ integrated RA            │ 6.76   │ 6.41   │ 94.82%  ║
> > ║ tree SSA incremental     │ 9.01   │ 11.16  │ 123.86% ║
> > ║ LRA create live ranges   │ 15.68  │ 40.02  │ 255.23% ║
> > ║ PRE                      │ 23.24  │ 32.32  │ 139.07% ║
> > ║ alias stmt walking       │ 27.69  │ 28.75  │ 103.83% ║
> > ║ phase opt and generate   │ 124.13 │ 163.95 │ 132.08% ║
> > ║ TOTAL                    │ 125.39 │ 165.17 │ 131.73% ║
> > ╚══════════════════════════╧════════╧════════╧═════════╝
> > 
> > Richi, do you want a perf report or do you come up with a patch that will
> > introduce the aforementioned params?
> 
> Can you share -fopt-report-loop differences?  From the above I would
> guess we split a lot of loops, meaning the memcpy/memmove/memset
> calls are in the "middle" and we have to split loops (how many
> calls are detected here?).  If that's true another way would be
> to only allow calls at head or tail position, thus a single
> non-builtin partition.

Some analysis shows, focusing on LRA lives, that unpatched we have

lra live on 53 BBs for wrf_alt_nml_obsolete
lra live on 5 BBs for set_config_as_buffer
lra live on 5 BBs for get_config_as_buffer
lra live on 3231 BBs for initial_config
lra live on 3231 BBs for initial_config

while patched

lra live on 53 BBs for wrf_alt_nml_obsolete
lra live on 5 BBs for set_config_as_buffer
lra live on 5 BBs for get_config_as_buffer
lra live on 465 BBs for initial_config
lra live on 465 BBs for initial_config

so it's the initial_config function.  We need 8 DF worklist iterations
in both cases but eventually the amount of local work is larger
or the local work isn't linear in the size of the BBs.  The "work"
it does to not update hardregs by anding ~all_hard_regs_bitmap seems
somewhat pointless unless the functions do not handle those correctly.
But that's micro-optimizing, likewise adding a bitmap_ior_and_compl_and_compl
function to avoid the temporary bitmap in live_trans_fun.

perf tells us most time is spent in process_bb_lives, not in the dataflow
problem though, and there in ix86_hard_regno_call_part_clobbered
(the function has a _lot_ of calls...).

Also w/o pattern detection the lra_simple_p heuristic kicks in since
we have a lot more BBs.

  /* If there are too many pseudos and/or basic blocks (e.g. 10K
     pseudos and 10K blocks or 100K pseudos and 1K blocks), we will
     use simplified and faster algorithms in LRA.  */
  lra_simple_p
    = (ira_use_lra_p
       && max_reg_num () >= (1 << 26) / last_basic_block_for_fn (cfun));

The code is auto-generated and large (I have a single source file using
no modules now but still too large and similar to SPEC to attach here),
so I wouldn't worry too much here.  The above magic constant should be
a --param though.
Comment 21 Richard Biener 2019-05-22 11:49:22 UTC
Ick.

static inline void
check_pseudos_live_through_calls (int regno,
                                  HARD_REG_SET last_call_used_reg_set,
                                  rtx_insn *call_insn)
{
...
  for (hr = 0; HARD_REGISTER_NUM_P (hr); hr++)
    if (targetm.hard_regno_call_part_clobbered (call_insn, hr,
                                                PSEUDO_REGNO_MODE (regno)))
      add_to_hard_reg_set (&lra_reg_info[regno].conflict_hard_regs,
                           PSEUDO_REGNO_MODE (regno), hr);

this loop is repeatedly computing an implicit hard-reg set for
which hard-regs are partly clobbered by the call for the _same_
actual instruction since check_pseudos_live_through_calls is called
via

      /* Mark each defined value as live.  We need to do this for
         unused values because they still conflict with quantities
         that are live at the time of the definition.  */
      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
        {
          if (reg->type != OP_IN)
            {
              update_pseudo_point (reg->regno, curr_point, USE_POINT);
              mark_regno_live (reg->regno, reg->biggest_mode);
              check_pseudos_live_through_calls (reg->regno,
                                                last_call_used_reg_set,
                                                call_insn);
...
        }

and

              EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, j)
                {
                  IOR_HARD_REG_SET (lra_reg_info[j].actual_call_used_reg_set,
                                    this_call_used_reg_set);

                  if (flush)
                    check_pseudos_live_through_calls (j,
                                                      last_call_used_reg_set,
                                                      last_call_insn);
                }

and

      /* Mark each used value as live.  */
      for (reg = curr_id->regs; reg != NULL; reg = reg->next)
        if (reg->type != OP_OUT)
          {
            if (reg->type == OP_IN)
              update_pseudo_point (reg->regno, curr_point, USE_POINT);
            mark_regno_live (reg->regno, reg->biggest_mode);
            check_pseudos_live_through_calls (reg->regno,
                                              last_call_used_reg_set,
                                              call_insn);
          }

and

  EXECUTE_IF_SET_IN_BITMAP (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, j, bi)
    {
      if (sparseset_cardinality (pseudos_live_through_calls) == 0)
        break;
      if (sparseset_bit_p (pseudos_live_through_calls, j))
        check_pseudos_live_through_calls (j, last_call_used_reg_set, call_insn);
    }

the pseudos mode may change but I guess usually it doesn't.  I also wonder
why the target hook doesn't return a hard-reg-set ...

That said, the above code doesn't scale well with functions with a lot of
calls at least, also the passed call_insn isn't the current insn and
might even be NULL.  All but aarch64 do not even look at the actual instruction
(even more an argument for re-designing the hook with it's use in mind).

I guess an artificial testcase with a lot of calls and a lot of live
pseudos (even single-BB) should show this issue easily.

Samples: 579  of event 'cycles:ppp', Event count (approx.): 257134187434191     
Overhead  Command  Shared Object     Symbol                                     
  22.26%  f951     f951              [.] process_bb_lives
  15.06%  f951     f951              [.] ix86_hard_regno_call_part_clobbered
   8.55%  f951     f951              [.] concat
   6.88%  f951     f951              [.] find_base_term
   3.60%  f951     f951              [.] get_ref_base_and_extent
   3.27%  f951     f951              [.] find_base_term
   2.95%  f951     f951              [.] make_hard_regno_dead
Comment 22 Richard Biener 2019-05-22 12:25:21 UTC
The code in question was originally added with r202721 by Vlad and likely
became more costly after making the target macro a hook (no inlining
anymore).
Comment 23 Richard Biener 2019-05-23 11:35:47 UTC
Author: rguenth
Date: Thu May 23 11:35:16 2019
New Revision: 271553

URL: https://gcc.gnu.org/viewcvs?rev=271553&root=gcc&view=rev
Log:
2019-05-23  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/88440
	* opts.c (default_options_table): Enable -ftree-loop-distribute-patterns
	at -O[2s]+.
	* tree-loop-distribution.c (generate_memset_builtin): Fold the
	generated call.
	(generate_memcpy_builtin): Likewise.
	(distribute_loop): Pass in whether to only distribute patterns.
	(prepare_perfect_loop_nest): Also allow size optimization.
	(pass_loop_distribution::execute): When optimizing a loop
	nest for size allow pattern replacement.

	* gcc.dg/tree-ssa/ldist-37.c: New testcase.
	* gcc.dg/tree-ssa/ldist-38.c: Likewise.
	* gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns.
	* gcc.dg/tree-ssa/ldist-37.c: Adjust.
	* gcc.dg/tree-ssa/ldist-38.c: Likewise.
	* g++.dg/tree-ssa/pr78847.C: Likewise.
	* gcc.dg/autopar/pr39500-1.c: Likewise.
	* gcc.dg/autopar/reduc-1char.c: Likewise.
	* gcc.dg/autopar/reduc-7.c: Likewise.
	* gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise.
	* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-1.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-2.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-3.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-4.c: Likewise.
	* gcc.dg/tree-ssa/prefetch-7.c: Likewise.
	* gcc.dg/tree-ssa/prefetch-8.c: Likewise.
	* gcc.dg/tree-ssa/prefetch-9.c: Likewise.
	* gcc.dg/tree-ssa/scev-11.c: Likewise.
	* gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise.
	* gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise.
	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise.
	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise.
	* gcc.target/i386/pr30970.c: Likewise.
	* gcc.target/i386/vect-double-1.c: Likewise.
	* gcc.target/i386/vect-double-2.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-2.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-26.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-28.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-32.c: Likewise.
	* gfortran.dg/vect/vect-5.f90: Likewise.
	* gfortran.dg/vect/vect-8.f90: Likewise.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/opts.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/g++.dg/tree-ssa/pr78847.C
    trunk/gcc/testsuite/gcc.dg/autopar/pr39500-1.c
    trunk/gcc/testsuite/gcc.dg/autopar/reduc-1char.c
    trunk/gcc/testsuite/gcc.dg/autopar/reduc-7.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-26.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-28.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-32.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-1.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-2.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-3.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-4.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-7.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-8.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-9.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/scev-11.c
    trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-31.c
    trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-33.c
    trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c
    trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c
    trunk/gcc/testsuite/gcc.dg/vect/vect.exp
    trunk/gcc/testsuite/gcc.target/i386/pr30970.c
    trunk/gcc/testsuite/gcc.target/i386/vect-double-1.c
    trunk/gcc/testsuite/gcc.target/i386/vect-double-2.c
    trunk/gcc/testsuite/gfortran.dg/vect/vect-5.f90
    trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90
    trunk/gcc/tree-loop-distribution.c
Comment 24 Richard Biener 2019-05-23 11:39:41 UTC
Fixed on trunk.
Comment 25 Richard Biener 2019-05-24 08:49:15 UTC
Author: rguenth
Date: Fri May 24 08:48:14 2019
New Revision: 271595

URL: https://gcc.gnu.org/viewcvs?rev=271595&root=gcc&view=rev
Log:
2019-05-23  Richard Biener  <rguenther@suse.de>

	PR tree-optimization/88440
	* opts.c (default_options_table): Enable -ftree-loop-distribute-patterns
	at -O[2s]+.
	* tree-loop-distribution.c (generate_memset_builtin): Fold the
	generated call.
	(generate_memcpy_builtin): Likewise.
	(distribute_loop): Pass in whether to only distribute patterns.
	(prepare_perfect_loop_nest): Also allow size optimization.
	(pass_loop_distribution::execute): When optimizing a loop
	nest for size allow pattern replacement.

	* gcc.dg/tree-ssa/ldist-37.c: New testcase.
	* gcc.dg/tree-ssa/ldist-38.c: Likewise.
	* gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns.
	* gcc.dg/tree-ssa/ldist-37.c: Adjust.
	* gcc.dg/tree-ssa/ldist-38.c: Likewise.
	* g++.dg/tree-ssa/pr78847.C: Likewise.
	* gcc.dg/autopar/pr39500-1.c: Likewise.
	* gcc.dg/autopar/reduc-1char.c: Likewise.
	* gcc.dg/autopar/reduc-7.c: Likewise.
	* gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise.
	* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-1.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-2.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-3.c: Likewise.
	* gcc.dg/tree-ssa/predcom-dse-4.c: Likewise.
	* gcc.dg/tree-ssa/prefetch-7.c: Likewise.
	* gcc.dg/tree-ssa/prefetch-8.c: Likewise.
	* gcc.dg/tree-ssa/prefetch-9.c: Likewise.
	* gcc.dg/tree-ssa/scev-11.c: Likewise.
	* gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise.
	* gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise.
	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise.
	* gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise.
	* gcc.target/i386/pr30970.c: Likewise.
	* gcc.target/i386/vect-double-1.c: Likewise.
	* gcc.target/i386/vect-double-2.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-2.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-26.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-28.c: Likewise.
	* gcc.dg/tree-ssa/gen-vect-32.c: Likewise.
	* gfortran.dg/vect/vect-5.f90: Likewise.
	* gfortran.dg/vect/vect-8.f90: Likewise.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-37.c
    trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-38.c
Comment 26 Christophe Lyon 2019-05-27 13:38:29 UTC
Author: clyon
Date: Mon May 27 13:37:57 2019
New Revision: 271662

URL: https://gcc.gnu.org/viewcvs?rev=271662&root=gcc&view=rev
Log:
[testsuite,aarch64,arm] PR88440: Fix testcases

2019-05-27  Christophe Lyon  <christophe.lyon@linaro.org>

	PR tree-optimization/88440
	gcc/testsuite/
	* gcc.target/aarch64/sve/index_offset_1.c: Add -fno-tree-loop-distribute-patterns.
	* gcc.target/aarch64/sve/single_1.c: Likewise.
	* gcc.target/aarch64/sve/single_2.c: Likewise.
	* gcc.target/aarch64/sve/single_3.c: Likewise.
	* gcc.target/aarch64/sve/single_4.c: Likewise.
	* gcc.target/aarch64/sve/vec_init_1.c: Likewise.
	* gcc.target/aarch64/vect-fmovd-zero.c: Likewise.
	* gcc.target/aarch64/vect-fmovf-zero.c: Likewise.
	* gcc.target/arm/ivopts.c: Likewise.


Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/aarch64/sve/index_offset_1.c
    trunk/gcc/testsuite/gcc.target/aarch64/sve/single_1.c
    trunk/gcc/testsuite/gcc.target/aarch64/sve/single_2.c
    trunk/gcc/testsuite/gcc.target/aarch64/sve/single_3.c
    trunk/gcc/testsuite/gcc.target/aarch64/sve/single_4.c
    trunk/gcc/testsuite/gcc.target/aarch64/sve/vec_init_1.c
    trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovd-zero.c
    trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovf-zero.c
    trunk/gcc/testsuite/gcc.target/arm/ivopts.c
Comment 27 Mel Chen 2020-03-09 06:09:13 UTC
Related new bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092