https://godbolt.org/z/RTji7B void foo(char* restrict dst, const char* buf) { for (int i=0; i<8; ++i) *dst++ = *buf++; } $ gcc -Os $ gcc -O2 .L2: mov dl, BYTE PTR [rsi+rax] mov BYTE PTR [rdi+rax], dl inc rax cmp rax, 8 jne .L2 $ gcc -O3 mov rax, QWORD PTR [rsi] mov QWORD PTR [rdi], rax $ arm-none-eabi-gcc -O3 -mthumb -mcpu=cortex-m4 ldr r3, [r1] @ unaligned ldr r2, [r1, #4] @ unaligned str r2, [r0, #4] @ unaligned str r3, [r0] @ unaligned The -O3 code is both faster and smaller for both ARM and x64: "note: Loop 1 distributed: split to 0 loops and 1 library calls." Should be considered for -O2 and -Os as well.
I thought I had a dup of this bug somewhere which was asking for this optimization to moved to -O2 (and -Os) and above rather than keep it at -O3 and above.
I think distributing patterns is reasonable for -O[2s]. I think we kept it at -O3 because of compile-time concerns (dependence analysis is quadratic). But then we still don't do basic vectorization at -O[2s] either... (same compile-time for not too much gain issue). So if somebody is willing to do some compile-time / effect numbers (bootstrap / SPEC?) then we can consider enabling loop-distribute-patterns for -O[2s] for GCC 10.
Adding -ftree-loop-distribute-patterns to -Os does not seem to make a difference though.
On Wed, 12 Dec 2018, hoganmeier at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #3 from krux <hoganmeier at gmail dot com> --- > Adding -ftree-loop-distribute-patterns to -Os does not seem to make a > difference though. Possibly because of /* Don't distribute multiple exit edges loop, or cold loop. */ if (!single_exit (loop) || !optimize_loop_for_speed_p (loop)) continue;
Sure, I can help with measurements during next stage1. Can you please Richi attach a patch that will enable the optimization for -O[2s]?
Created attachment 45313 [details] patch This enables distribution of patterns at -O[2s]+ and optimizes the testcase at -Os by adjusting the guards in loop distribution. Note that the interesting bits are compile-time, binary-size and performance at mainly -O2, eventually size at -Os. I suspect that at -O2 w/o profiling most loops would be optimize_loop_for_speed anyways so changing the heuristics isn't so bad but of course enabling distribution at -O2 might encour a penalty.
(In reply to Richard Biener from comment #6) > Created attachment 45313 [details] > patch > > This enables distribution of patterns at -O[2s]+ and optimizes the testcase > at -Os by adjusting the guards in loop distribution. > > Note that the interesting bits are compile-time, binary-size and performance > at mainly -O2, eventually size at -Os. > > I suspect that at -O2 w/o profiling most loops would be > optimize_loop_for_speed > anyways so changing the heuristics isn't so bad but of course enabling > distribution at -O2 might encour a penalty. I have so far build numbers on a Zen machine with -j16: SPEC2006: Elapsed compile for '400.perlbench': 00:00:05 (5) Elapsed compile for '401.bzip2': 00:00:02 (2) Elapsed compile for '403.gcc': 00:00:11 (11) Elapsed compile for '429.mcf': 00:00:01 (1) Elapsed compile for '445.gobmk': 00:00:04 (4) Elapsed compile for '456.hmmer': 00:00:01 (1) Elapsed compile for '458.sjeng': 00:00:01 (1) Elapsed compile for '462.libquantum': 00:00:01 (1) Elapsed compile for '464.h264ref': 00:00:04 (4) Elapsed compile for '471.omnetpp': 00:00:05 (5) Elapsed compile for '473.astar': 00:00:01 (1) Elapsed compile for '483.xalancbmk': 00:00:21 (21) Elapsed compile for '410.bwaves': 00:00:01 (1) Elapsed compile for '416.gamess': 00:00:20 (20) Elapsed compile for '433.milc': 00:00:02 (2) Elapsed compile for '434.zeusmp': 00:00:02 (2) Elapsed compile for '435.gromacs': 00:00:06 (6) Elapsed compile for '436.cactusADM': 00:00:04 (4) Elapsed compile for '437.leslie3d': 00:00:04 (4) Elapsed compile for '444.namd': 00:00:09 (9) Elapsed compile for '447.dealII': 00:00:15 (15) Elapsed compile for '450.soplex': 00:00:03 (3) Elapsed compile for '453.povray': 00:00:04 (4) Elapsed compile for '454.calculix': 00:00:06 (6) Elapsed compile for '459.GemsFDTD': 00:00:09 (9) Elapsed compile for '465.tonto': 00:00:53 (53) Elapsed compile for '470.lbm': 00:00:02 (2) Elapsed compile for '481.wrf': 00:00:38 (38) Elapsed compile for '482.sphinx3': 00:00:01 (1) All differences before and after are withing 1s, which is granularity. SPEC 2017: Elapsed compile for '503.bwaves_r': 00:00:01 (1) Elapsed compile for '507.cactuBSSN_r': 00:00:25 (25) Elapsed compile for '508.namd_r': 00:00:09 (9) Elapsed compile for '510.parest_r': 00:00:46 (46) Elapsed compile for '511.povray_r': 00:00:04 (4) Elapsed compile for '519.lbm_r': 00:00:01 (1) Elapsed compile for '521.wrf_r': 00:05:46 (346) Elapsed compile for '526.blender_r': 00:00:25 (25) Elapsed compile for '527.cam4_r': 00:00:37 (37) Elapsed compile for '538.imagick_r': 00:00:11 (11) Elapsed compile for '544.nab_r': 00:00:01 (1) Elapsed compile for '549.fotonik3d_r': 00:00:07 (7) Elapsed compile for '554.roms_r': 00:00:06 (6) Elapsed compile for '500.perlbench_r': 00:00:09 (9) Elapsed compile for '502.gcc_r': 00:00:44 (44) Elapsed compile for '505.mcf_r': 00:00:01 (1) Elapsed compile for '520.omnetpp_r': 00:00:12 (12) Elapsed compile for '523.xalancbmk_r': 00:00:25 (25) Elapsed compile for '525.x264_r': 00:00:09 (9) Elapsed compile for '531.deepsjeng_r': 00:00:02 (2) Elapsed compile for '541.leela_r': 00:00:03 (3) Elapsed compile for '548.exchange2_r': 00:00:04 (4) Elapsed compile for '557.xz_r': 00:00:01 (1) There's only one difference: 521.wrf_r: 310 -> 346s
On Thu, 16 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #7 from Martin Liška <marxin at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #6) > > Created attachment 45313 [details] > > patch > > > > This enables distribution of patterns at -O[2s]+ and optimizes the testcase > > at -Os by adjusting the guards in loop distribution. > > > > Note that the interesting bits are compile-time, binary-size and performance > > at mainly -O2, eventually size at -Os. > > > > I suspect that at -O2 w/o profiling most loops would be > > optimize_loop_for_speed > > anyways so changing the heuristics isn't so bad but of course enabling > > distribution at -O2 might encour a penalty. > > I have so far build numbers on a Zen machine with -j16: ... > There's only one difference: > > 521.wrf_r: 310 -> 346s Ick. I currently see no limiting on the size of loops in loop distribution, one easy would be to limit the worklist size in find_seed_stmts_for_distribution with a --param we can lower at -O[2s], another thing would be to limit loop nest depth similarly. A profile might be interesting here as well...
So comparison all files in wrf, I've got: ╔═════════════════════════════════════╤════════╤════════╤═════════╗ ║ Filename │ Before │ After │ Change ║ ╠═════════════════════════════════════╪════════╪════════╪═════════╣ ║ module_configure.fppized.f90 │ 127.81 │ 163.39 │ 127.84% ║ ║ d1fgkb.fppized.f90 │ 0.21 │ 0.23 │ 109.52% ║ ║ solve_interface.fppized.f90 │ 0.35 │ 0.38 │ 108.57% ║ ║ module_ltng_crmpr92.fppized.f90 │ 0.28 │ 0.3 │ 107.14% ║ ║ module_cu_kf.fppized.f90 │ 1.42 │ 1.51 │ 106.34% ║ ║ mradbg.fppized.f90 │ 0.32 │ 0.34 │ 106.25% ║ ║ module_sf_pxlsm.fppized.f90 │ 0.55 │ 0.58 │ 105.45% ║ ║ module_domain_type.fppized.f90 │ 0.19 │ 0.2 │ 105.26% ║ ║ module_shallowcu_driver.fppized.f90 │ 0.19 │ 0.2 │ 105.26% ║ ║ module_bl_gfs.fppized.f90 │ 0.78 │ 0.82 │ 105.13% ║ ║ module_bl_myjurb.fppized.f90 │ 0.78 │ 0.82 │ 105.13% ║ ║ Meat.fppized.f90 │ 0.39 │ 0.41 │ 105.13% ║ ... So the only significant offender is module_configure.fppized.f90 file. Let me profile it.
> So the only significant offender is module_configure.fppized.f90 file. Let > me profile it. Time profile before/after: ╔══════════════════════════╤════════╤════════╤═════════╗ ║ PASS │ Before │ After │ Change ║ ╠══════════════════════════╪════════╪════════╪═════════╣ ║ backwards jump threading │ 6.29 │ 6.16 │ 97.93% ║ ║ integrated RA │ 6.76 │ 6.41 │ 94.82% ║ ║ tree SSA incremental │ 9.01 │ 11.16 │ 123.86% ║ ║ LRA create live ranges │ 15.68 │ 40.02 │ 255.23% ║ ║ PRE │ 23.24 │ 32.32 │ 139.07% ║ ║ alias stmt walking │ 27.69 │ 28.75 │ 103.83% ║ ║ phase opt and generate │ 124.13 │ 163.95 │ 132.08% ║ ║ TOTAL │ 125.39 │ 165.17 │ 131.73% ║ ╚══════════════════════════╧════════╧════════╧═════════╝ Richi, do you want a perf report or do you come up with a patch that will introduce the aforementioned params?
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #10 from Martin Liška <marxin at gcc dot gnu.org> --- > > So the only significant offender is module_configure.fppized.f90 file. Let > > me profile it. > > Time profile before/after: > > ╔══════════════════════════╤════════╤════════╤═════════╗ > ║ PASS │ Before │ After │ Change ║ > ╠══════════════════════════╪════════╪════════╪═════════╣ > ║ backwards jump threading │ 6.29 │ 6.16 │ 97.93% ║ > ║ integrated RA │ 6.76 │ 6.41 │ 94.82% ║ > ║ tree SSA incremental │ 9.01 │ 11.16 │ 123.86% ║ > ║ LRA create live ranges │ 15.68 │ 40.02 │ 255.23% ║ > ║ PRE │ 23.24 │ 32.32 │ 139.07% ║ > ║ alias stmt walking │ 27.69 │ 28.75 │ 103.83% ║ > ║ phase opt and generate │ 124.13 │ 163.95 │ 132.08% ║ > ║ TOTAL │ 125.39 │ 165.17 │ 131.73% ║ > ╚══════════════════════════╧════════╧════════╧═════════╝ > > Richi, do you want a perf report or do you come up with a patch that will > introduce the aforementioned params? Can you share -fopt-report-loop differences? From the above I would guess we split a lot of loops, meaning the memcpy/memmove/memset calls are in the "middle" and we have to split loops (how many calls are detected here?). If that's true another way would be to only allow calls at head or tail position, thus a single non-builtin partition.
> > Can you share -fopt-report-loop differences? From the above I would > guess we split a lot of loops, meaning the memcpy/memmove/memset > calls are in the "middle" and we have to split loops (how many > calls are detected here?). If that's true another way would be > to only allow calls at head or tail position, thus a single > non-builtin partition. I newly see ~1400 lines: module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to 0 loops and 1 library calls. module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to 0 loops and 1 library calls. ...
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #12 from Martin Liška <marxin at gcc dot gnu.org> --- > > > > Can you share -fopt-report-loop differences? From the above I would > > guess we split a lot of loops, meaning the memcpy/memmove/memset > > calls are in the "middle" and we have to split loops (how many > > calls are detected here?). If that's true another way would be > > to only allow calls at head or tail position, thus a single > > non-builtin partition. > > I newly see ~1400 lines: > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0 > loops and 1 library calls. > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0 > loops and 1 library calls. > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0 > loops and 1 library calls. > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0 > loops and 1 library calls. > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0 > loops and 1 library calls. > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to > 0 loops and 1 library calls. > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to > 0 loops and 1 library calls. > ... All with "0 loops"? That disputes my theory :/
(In reply to rguenther@suse.de from comment #13) > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > > > --- Comment #12 from Martin Liška <marxin at gcc dot gnu.org> --- > > > > > > Can you share -fopt-report-loop differences? From the above I would > > > guess we split a lot of loops, meaning the memcpy/memmove/memset > > > calls are in the "middle" and we have to split loops (how many > > > calls are detected here?). If that's true another way would be > > > to only allow calls at head or tail position, thus a single > > > non-builtin partition. > > > > I newly see ~1400 lines: > > > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0 > > loops and 1 library calls. > > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to > > 0 loops and 1 library calls. > > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to > > 0 loops and 1 library calls. > > ... > > All with "0 loops"? That disputes my theory :/ Yep. All these are in a form of: <bb 1809> [local count: 118163158]: # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)> # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)> _3211 = S.1565_41079 + -1; _3212 = fire_ignition_start_y1[_3211]; MEM[(real(kind=4)[11] *)&model_config_rec + 101040B][_3211] = _3212; S.1565_32687 = S.1565_41079 + 1; ivtmp_38848 = ivtmp_38850 - 1; if (ivtmp_38848 == 0) goto <bb 2027>; [9.09%] else goto <bb 3351>; [90.91%] <bb 3351> [local count: 107425740]: goto <bb 1809>; [100.00%] <bb 2027> [local count: 10737418]: <bb 1810> [local count: 118163158]: # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)> # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)> _3213 = S.1566_41080 + -1; _3214 = fire_ignition_end_x1[_3213]; MEM[(real(kind=4)[11] *)&model_config_rec + 101084B][_3213] = _3214; S.1566_32689 = S.1566_41080 + 1; ivtmp_38854 = ivtmp_38856 - 1; if (ivtmp_38854 == 0) goto <bb 2026>; [9.09%] else goto <bb 3350>; [90.91%] <bb 3350> [local count: 107425740]: goto <bb 1810>; [100.00%] <bb 2026> [local count: 10737418]: <bb 1811> [local count: 118163158]: # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)> # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)> _3215 = S.1567_41081 + -1; _3216 = fire_ignition_end_y1[_3215]; MEM[(real(kind=4)[11] *)&model_config_rec + 101128B][_3215] = _3216; S.1567_32691 = S.1567_41081 + 1; ivtmp_38858 = ivtmp_38860 - 1; if (ivtmp_38858 == 0) goto <bb 2025>; [9.09%] else goto <bb 3349>; [90.91%] <bb 3349> [local count: 107425740]: goto <bb 1811>; [100.00%] <bb 2025> [local count: 10737418]: ... It's a configure module, so that it probably contains so many loops for various configs.
On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > --- Comment #14 from Martin Liška <marxin at gcc dot gnu.org> --- > (In reply to rguenther@suse.de from comment #13) > > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > > > > > --- Comment #12 from Martin Liška <marxin at gcc dot gnu.org> --- > > > > > > > > Can you share -fopt-report-loop differences? From the above I would > > > > guess we split a lot of loops, meaning the memcpy/memmove/memset > > > > calls are in the "middle" and we have to split loops (how many > > > > calls are detected here?). If that's true another way would be > > > > to only allow calls at head or tail position, thus a single > > > > non-builtin partition. > > > > > > I newly see ~1400 lines: > > > > > > module_configure.fppized.f90:7993:0: optimized: Loop 10 distributed: split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:7995:0: optimized: Loop 11 distributed: split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8000:0: optimized: Loop 15 distributed: split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8381:0: optimized: Loop 77 distributed: split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8383:0: optimized: Loop 78 distributed: split to 0 > > > loops and 1 library calls. > > > module_configure.fppized.f90:8498:0: optimized: Loop 105 distributed: split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9742:0: optimized: Loop 169 distributed: split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9978:0: optimized: Loop 207 distributed: split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9979:0: optimized: Loop 208 distributed: split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9980:0: optimized: Loop 209 distributed: split to > > > 0 loops and 1 library calls. > > > module_configure.fppized.f90:9981:0: optimized: Loop 210 distributed: split to > > > 0 loops and 1 library calls. > > > ... > > > > All with "0 loops"? That disputes my theory :/ > > Yep. All these are in a form of: > > <bb 1809> [local count: 118163158]: > # S.1565_41079 = PHI <1(2028), S.1565_32687(3351)> > # ivtmp_38850 = PHI <11(2028), ivtmp_38848(3351)> > _3211 = S.1565_41079 + -1; > _3212 = fire_ignition_start_y1[_3211]; > MEM[(real(kind=4)[11] *)&model_config_rec + 101040B][_3211] = _3212; > S.1565_32687 = S.1565_41079 + 1; > ivtmp_38848 = ivtmp_38850 - 1; > if (ivtmp_38848 == 0) > goto <bb 2027>; [9.09%] > else > goto <bb 3351>; [90.91%] > > <bb 3351> [local count: 107425740]: > goto <bb 1809>; [100.00%] > > <bb 2027> [local count: 10737418]: > > <bb 1810> [local count: 118163158]: > # S.1566_41080 = PHI <1(2027), S.1566_32689(3350)> > # ivtmp_38856 = PHI <11(2027), ivtmp_38854(3350)> > _3213 = S.1566_41080 + -1; > _3214 = fire_ignition_end_x1[_3213]; > MEM[(real(kind=4)[11] *)&model_config_rec + 101084B][_3213] = _3214; > S.1566_32689 = S.1566_41080 + 1; > ivtmp_38854 = ivtmp_38856 - 1; > if (ivtmp_38854 == 0) > goto <bb 2026>; [9.09%] > else > goto <bb 3350>; [90.91%] > > <bb 3350> [local count: 107425740]: > goto <bb 1810>; [100.00%] > > <bb 2026> [local count: 10737418]: > > <bb 1811> [local count: 118163158]: > # S.1567_41081 = PHI <1(2026), S.1567_32691(3349)> > # ivtmp_38860 = PHI <11(2026), ivtmp_38858(3349)> > _3215 = S.1567_41081 + -1; > _3216 = fire_ignition_end_y1[_3215]; > MEM[(real(kind=4)[11] *)&model_config_rec + 101128B][_3215] = _3216; > S.1567_32691 = S.1567_41081 + 1; > ivtmp_38858 = ivtmp_38860 - 1; > if (ivtmp_38858 == 0) > goto <bb 2025>; [9.09%] > else > goto <bb 3349>; [90.91%] > > <bb 3349> [local count: 107425740]: > goto <bb 1811>; [100.00%] > > <bb 2025> [local count: 10737418]: > ... > > > It's a configure module, so that it probably contains so many loops for various > configs. Hmm, so then it might be we run into some CFG complexity cut-off before for PRE and RA but not after since the CFG should simplify a lot if we make memcpy from all of the above loops...
Created attachment 46393 [details] SPEC2006 and SPEC2017 report The report presents difference between master (first gray column) and the Richi's patch (last 2 columns in order to run tests twice). There are some signification improvements and regressions both. Note that 436.cactusADM is jumping (21%)!
> > Hmm, so then it might be we run into some CFG complexity cut-off > before for PRE and RA but not after since the CFG should simplify > a lot if we make memcpy from all of the above loops... I guess so. Note that even without the patch the files takes 2 minutes to compile. It's somehow weird. I'm done with patch measurements, do you see it Richi beneficial to enable it with -O2?
OK, let's do it.
On Wed, 22 May 2019, marxin at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > Martin Liška <marxin at gcc dot gnu.org> changed: > > What |Removed |Added > ---------------------------------------------------------------------------- > Status|ASSIGNED |NEW > Assignee|marxin at gcc dot gnu.org |unassigned at gcc dot gnu.org > > --- Comment #17 from Martin Liška <marxin at gcc dot gnu.org> --- > > > > Hmm, so then it might be we run into some CFG complexity cut-off > > before for PRE and RA but not after since the CFG should simplify > > a lot if we make memcpy from all of the above loops... > > I guess so. Note that even without the patch the files takes 2 minutes to > compile. It's somehow weird. > > I'm done with patch measurements, do you see it Richi beneficial to enable it > with -O2? OK, let's do it.
(In reply to rguenther@suse.de from comment #11) > On Fri, 17 May 2019, marxin at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88440 > > > > --- Comment #10 from Martin Liška <marxin at gcc dot gnu.org> --- > > > So the only significant offender is module_configure.fppized.f90 file. Let > > > me profile it. > > > > Time profile before/after: > > > > ╔══════════════════════════╤════════╤════════╤═════════╗ > > ║ PASS │ Before │ After │ Change ║ > > ╠══════════════════════════╪════════╪════════╪═════════╣ > > ║ backwards jump threading │ 6.29 │ 6.16 │ 97.93% ║ > > ║ integrated RA │ 6.76 │ 6.41 │ 94.82% ║ > > ║ tree SSA incremental │ 9.01 │ 11.16 │ 123.86% ║ > > ║ LRA create live ranges │ 15.68 │ 40.02 │ 255.23% ║ > > ║ PRE │ 23.24 │ 32.32 │ 139.07% ║ > > ║ alias stmt walking │ 27.69 │ 28.75 │ 103.83% ║ > > ║ phase opt and generate │ 124.13 │ 163.95 │ 132.08% ║ > > ║ TOTAL │ 125.39 │ 165.17 │ 131.73% ║ > > ╚══════════════════════════╧════════╧════════╧═════════╝ > > > > Richi, do you want a perf report or do you come up with a patch that will > > introduce the aforementioned params? > > Can you share -fopt-report-loop differences? From the above I would > guess we split a lot of loops, meaning the memcpy/memmove/memset > calls are in the "middle" and we have to split loops (how many > calls are detected here?). If that's true another way would be > to only allow calls at head or tail position, thus a single > non-builtin partition. Some analysis shows, focusing on LRA lives, that unpatched we have lra live on 53 BBs for wrf_alt_nml_obsolete lra live on 5 BBs for set_config_as_buffer lra live on 5 BBs for get_config_as_buffer lra live on 3231 BBs for initial_config lra live on 3231 BBs for initial_config while patched lra live on 53 BBs for wrf_alt_nml_obsolete lra live on 5 BBs for set_config_as_buffer lra live on 5 BBs for get_config_as_buffer lra live on 465 BBs for initial_config lra live on 465 BBs for initial_config so it's the initial_config function. We need 8 DF worklist iterations in both cases but eventually the amount of local work is larger or the local work isn't linear in the size of the BBs. The "work" it does to not update hardregs by anding ~all_hard_regs_bitmap seems somewhat pointless unless the functions do not handle those correctly. But that's micro-optimizing, likewise adding a bitmap_ior_and_compl_and_compl function to avoid the temporary bitmap in live_trans_fun. perf tells us most time is spent in process_bb_lives, not in the dataflow problem though, and there in ix86_hard_regno_call_part_clobbered (the function has a _lot_ of calls...). Also w/o pattern detection the lra_simple_p heuristic kicks in since we have a lot more BBs. /* If there are too many pseudos and/or basic blocks (e.g. 10K pseudos and 10K blocks or 100K pseudos and 1K blocks), we will use simplified and faster algorithms in LRA. */ lra_simple_p = (ira_use_lra_p && max_reg_num () >= (1 << 26) / last_basic_block_for_fn (cfun)); The code is auto-generated and large (I have a single source file using no modules now but still too large and similar to SPEC to attach here), so I wouldn't worry too much here. The above magic constant should be a --param though.
Ick. static inline void check_pseudos_live_through_calls (int regno, HARD_REG_SET last_call_used_reg_set, rtx_insn *call_insn) { ... for (hr = 0; HARD_REGISTER_NUM_P (hr); hr++) if (targetm.hard_regno_call_part_clobbered (call_insn, hr, PSEUDO_REGNO_MODE (regno))) add_to_hard_reg_set (&lra_reg_info[regno].conflict_hard_regs, PSEUDO_REGNO_MODE (regno), hr); this loop is repeatedly computing an implicit hard-reg set for which hard-regs are partly clobbered by the call for the _same_ actual instruction since check_pseudos_live_through_calls is called via /* Mark each defined value as live. We need to do this for unused values because they still conflict with quantities that are live at the time of the definition. */ for (reg = curr_id->regs; reg != NULL; reg = reg->next) { if (reg->type != OP_IN) { update_pseudo_point (reg->regno, curr_point, USE_POINT); mark_regno_live (reg->regno, reg->biggest_mode); check_pseudos_live_through_calls (reg->regno, last_call_used_reg_set, call_insn); ... } and EXECUTE_IF_SET_IN_SPARSESET (pseudos_live, j) { IOR_HARD_REG_SET (lra_reg_info[j].actual_call_used_reg_set, this_call_used_reg_set); if (flush) check_pseudos_live_through_calls (j, last_call_used_reg_set, last_call_insn); } and /* Mark each used value as live. */ for (reg = curr_id->regs; reg != NULL; reg = reg->next) if (reg->type != OP_OUT) { if (reg->type == OP_IN) update_pseudo_point (reg->regno, curr_point, USE_POINT); mark_regno_live (reg->regno, reg->biggest_mode); check_pseudos_live_through_calls (reg->regno, last_call_used_reg_set, call_insn); } and EXECUTE_IF_SET_IN_BITMAP (df_get_live_in (bb), FIRST_PSEUDO_REGISTER, j, bi) { if (sparseset_cardinality (pseudos_live_through_calls) == 0) break; if (sparseset_bit_p (pseudos_live_through_calls, j)) check_pseudos_live_through_calls (j, last_call_used_reg_set, call_insn); } the pseudos mode may change but I guess usually it doesn't. I also wonder why the target hook doesn't return a hard-reg-set ... That said, the above code doesn't scale well with functions with a lot of calls at least, also the passed call_insn isn't the current insn and might even be NULL. All but aarch64 do not even look at the actual instruction (even more an argument for re-designing the hook with it's use in mind). I guess an artificial testcase with a lot of calls and a lot of live pseudos (even single-BB) should show this issue easily. Samples: 579 of event 'cycles:ppp', Event count (approx.): 257134187434191 Overhead Command Shared Object Symbol 22.26% f951 f951 [.] process_bb_lives 15.06% f951 f951 [.] ix86_hard_regno_call_part_clobbered 8.55% f951 f951 [.] concat 6.88% f951 f951 [.] find_base_term 3.60% f951 f951 [.] get_ref_base_and_extent 3.27% f951 f951 [.] find_base_term 2.95% f951 f951 [.] make_hard_regno_dead
The code in question was originally added with r202721 by Vlad and likely became more costly after making the target macro a hook (no inlining anymore).
Author: rguenth Date: Thu May 23 11:35:16 2019 New Revision: 271553 URL: https://gcc.gnu.org/viewcvs?rev=271553&root=gcc&view=rev Log: 2019-05-23 Richard Biener <rguenther@suse.de> PR tree-optimization/88440 * opts.c (default_options_table): Enable -ftree-loop-distribute-patterns at -O[2s]+. * tree-loop-distribution.c (generate_memset_builtin): Fold the generated call. (generate_memcpy_builtin): Likewise. (distribute_loop): Pass in whether to only distribute patterns. (prepare_perfect_loop_nest): Also allow size optimization. (pass_loop_distribution::execute): When optimizing a loop nest for size allow pattern replacement. * gcc.dg/tree-ssa/ldist-37.c: New testcase. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns. * gcc.dg/tree-ssa/ldist-37.c: Adjust. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * g++.dg/tree-ssa/pr78847.C: Likewise. * gcc.dg/autopar/pr39500-1.c: Likewise. * gcc.dg/autopar/reduc-1char.c: Likewise. * gcc.dg/autopar/reduc-7.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-1.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-2.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-3.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-4.c: Likewise. * gcc.dg/tree-ssa/prefetch-7.c: Likewise. * gcc.dg/tree-ssa/prefetch-8.c: Likewise. * gcc.dg/tree-ssa/prefetch-9.c: Likewise. * gcc.dg/tree-ssa/scev-11.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise. * gcc.target/i386/pr30970.c: Likewise. * gcc.target/i386/vect-double-1.c: Likewise. * gcc.target/i386/vect-double-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-26.c: Likewise. * gcc.dg/tree-ssa/gen-vect-28.c: Likewise. * gcc.dg/tree-ssa/gen-vect-32.c: Likewise. * gfortran.dg/vect/vect-5.f90: Likewise. * gfortran.dg/vect/vect-8.f90: Likewise. Modified: trunk/gcc/ChangeLog trunk/gcc/opts.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/g++.dg/tree-ssa/pr78847.C trunk/gcc/testsuite/gcc.dg/autopar/pr39500-1.c trunk/gcc/testsuite/gcc.dg/autopar/reduc-1char.c trunk/gcc/testsuite/gcc.dg/autopar/reduc-7.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-26.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-28.c trunk/gcc/testsuite/gcc.dg/tree-ssa/gen-vect-32.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-1.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-3.c trunk/gcc/testsuite/gcc.dg/tree-ssa/predcom-dse-4.c trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-7.c trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-8.c trunk/gcc/testsuite/gcc.dg/tree-ssa/prefetch-9.c trunk/gcc/testsuite/gcc.dg/tree-ssa/scev-11.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-31.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/i386/costmodel-vect-33.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c trunk/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c trunk/gcc/testsuite/gcc.dg/vect/vect.exp trunk/gcc/testsuite/gcc.target/i386/pr30970.c trunk/gcc/testsuite/gcc.target/i386/vect-double-1.c trunk/gcc/testsuite/gcc.target/i386/vect-double-2.c trunk/gcc/testsuite/gfortran.dg/vect/vect-5.f90 trunk/gcc/testsuite/gfortran.dg/vect/vect-8.f90 trunk/gcc/tree-loop-distribution.c
Fixed on trunk.
Author: rguenth Date: Fri May 24 08:48:14 2019 New Revision: 271595 URL: https://gcc.gnu.org/viewcvs?rev=271595&root=gcc&view=rev Log: 2019-05-23 Richard Biener <rguenther@suse.de> PR tree-optimization/88440 * opts.c (default_options_table): Enable -ftree-loop-distribute-patterns at -O[2s]+. * tree-loop-distribution.c (generate_memset_builtin): Fold the generated call. (generate_memcpy_builtin): Likewise. (distribute_loop): Pass in whether to only distribute patterns. (prepare_perfect_loop_nest): Also allow size optimization. (pass_loop_distribution::execute): When optimizing a loop nest for size allow pattern replacement. * gcc.dg/tree-ssa/ldist-37.c: New testcase. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * gcc.dg/vect/vect.exp: Add -fno-tree-loop-distribute-patterns. * gcc.dg/tree-ssa/ldist-37.c: Adjust. * gcc.dg/tree-ssa/ldist-38.c: Likewise. * g++.dg/tree-ssa/pr78847.C: Likewise. * gcc.dg/autopar/pr39500-1.c: Likewise. * gcc.dg/autopar/reduc-1char.c: Likewise. * gcc.dg/autopar/reduc-7.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt-2.c: Likewise. * gcc.dg/tree-ssa/ivopts-lt.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-1.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-2.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-3.c: Likewise. * gcc.dg/tree-ssa/predcom-dse-4.c: Likewise. * gcc.dg/tree-ssa/prefetch-7.c: Likewise. * gcc.dg/tree-ssa/prefetch-8.c: Likewise. * gcc.dg/tree-ssa/prefetch-9.c: Likewise. * gcc.dg/tree-ssa/scev-11.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/i386/costmodel-vect-33.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-31.c: Likewise. * gcc.dg/vect/costmodel/x86_64/costmodel-vect-33.c: Likewise. * gcc.target/i386/pr30970.c: Likewise. * gcc.target/i386/vect-double-1.c: Likewise. * gcc.target/i386/vect-double-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-26.c: Likewise. * gcc.dg/tree-ssa/gen-vect-28.c: Likewise. * gcc.dg/tree-ssa/gen-vect-32.c: Likewise. * gfortran.dg/vect/vect-5.f90: Likewise. * gfortran.dg/vect/vect-8.f90: Likewise. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-37.c trunk/gcc/testsuite/gcc.dg/tree-ssa/ldist-38.c
Author: clyon Date: Mon May 27 13:37:57 2019 New Revision: 271662 URL: https://gcc.gnu.org/viewcvs?rev=271662&root=gcc&view=rev Log: [testsuite,aarch64,arm] PR88440: Fix testcases 2019-05-27 Christophe Lyon <christophe.lyon@linaro.org> PR tree-optimization/88440 gcc/testsuite/ * gcc.target/aarch64/sve/index_offset_1.c: Add -fno-tree-loop-distribute-patterns. * gcc.target/aarch64/sve/single_1.c: Likewise. * gcc.target/aarch64/sve/single_2.c: Likewise. * gcc.target/aarch64/sve/single_3.c: Likewise. * gcc.target/aarch64/sve/single_4.c: Likewise. * gcc.target/aarch64/sve/vec_init_1.c: Likewise. * gcc.target/aarch64/vect-fmovd-zero.c: Likewise. * gcc.target/aarch64/vect-fmovf-zero.c: Likewise. * gcc.target/arm/ivopts.c: Likewise. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/aarch64/sve/index_offset_1.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_1.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_2.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_3.c trunk/gcc/testsuite/gcc.target/aarch64/sve/single_4.c trunk/gcc/testsuite/gcc.target/aarch64/sve/vec_init_1.c trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovd-zero.c trunk/gcc/testsuite/gcc.target/aarch64/vect-fmovf-zero.c trunk/gcc/testsuite/gcc.target/arm/ivopts.c
Related new bugzilla: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94092