Bug 87832 - AMD pipeline models are very costly size-wise
Summary: AMD pipeline models are very costly size-wise
Status: UNCONFIRMED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks: 84402
  Show dependency treegraph
 
Reported: 2018-10-31 14:26 UTC by Alexander Monakov
Modified: 2023-01-02 16:39 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Alexander Monakov 2018-10-31 14:26:26 UTC
Looking at i386 insn-automata.o, out of its 2.2M rodata size almost all is due to very large tables for AMD CPU models. Note how znver additions are more than half of overall size.

What is causing that and can it be improved?

2176 core2_core_transitions
2496 slm_base
2527 bdver3_load_min_issue_delay
2746 glm_base
3892 bdver1_fp_base
4261 insn_latency(rtx_insn*, rtx_insn*)
4444 bdver1_ieu_min_issue_delay
4492 geode_base
4608 bdver3_ieu_transitions
6402 bdver1_load_transitions
7862 athlon_fp_check
7862 athlon_fp_transitions
9433 internal_min_issue_delay(int, DFA_chip*)
10108 bdver3_load_transitions
10360 print_reservation(_IO_FILE*, rtx_insn*)::reservation_names
10498 geode_check
10498 geode_transitions
12575 athlon_fp_min_issue_delay
12599 internal_state_transition(int, DFA_chip*)
12742 btver2_fp_check
12742 btver2_fp_transitions
13896 slm_transitions
13896 slm_check
17776 bdver1_ieu_transitions
20068 bdver1_fp_check
20068 bdver1_fp_transitions
26208 slm_min_issue_delay
27244 bdver1_fp_min_issue_delay
28518 glm_transitions
28518 glm_check
33690 geode_min_issue_delay
46980 bdver3_fp_min_issue_delay
49428 glm_min_issue_delay
53730 btver2_fp_min_issue_delay
68160 znver1_ieu_min_issue_delay
93960 bdver3_fp_transitions
136320 znver1_ieu_transitions
428108 znver1_fp_min_issue_delay
856216 znver1_fp_transitions
Comment 1 Alexander Monakov 2022-10-24 18:48:48 UTC
Suggested partial fix for the integer-pipe side of the blowup: https://inbox.sourceware.org/gcc-patches/4549f27b-238a-7d77-f72b-cc77df8ae36e@ispras.ru/
Comment 2 GCC Commits 2022-11-01 12:21:13 UTC
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>:

https://gcc.gnu.org/g:5cee5f94000ee5eabce9b223c44c7923c1c69f61

commit r13-3589-g5cee5f94000ee5eabce9b223c44c7923c1c69f61
Author: Alexander Monakov <amonakov@ispras.ru>
Date:   Mon Oct 31 17:35:57 2022 +0300

    i386: correct integer division modeling in znver.md
    
    In znver.md, division instructions have descriptions like
    
    (define_insn_reservation "znver1_idiv_DI" 41
                            (and (eq_attr "cpu" "znver1,znver2")
                                 (and (eq_attr "type" "idiv")
                                      (and (eq_attr "mode" "DI")
                                           (eq_attr "memory" "none"))))
                            "znver1-double,znver1-ieu2*41")
    
    which says that DImode idiv has latency 41 (which is correct) and that
    it occupies 2nd integer execution unit for 41 consecutive cycles, but
    that is not correct:
    
    1) the division instruction is partially pipelined, and has throughput
       1/14, not 1/41;
    
    2) for the most part it occupies a separate division unit, not the
       general arithmetic unit.
    
    Evidently, interaction of such 41-cycle paths with the rest of
    reservations causes a combinatorial explosion in the automaton.
    
    Fix this by modeling the integer division unit properly, and correcting
    reservations to use the measured reciprocal throughput of those
    instructions (available from uops.info). A similar correction for
    floating-point divisions is left for a followup patch.
    
    Top 5 znver table sizes, before:
    
    68692 r znver1_ieu_check
    68692 r znver1_ieu_transitions
    99792 r znver1_ieu_min_issue_delay
    428108 r znver1_fp_min_issue_delay
    856216 r znver1_fp_transitions
    
    After:
    
    1454 r znver1_ieu_translate
    1454 r znver1_translate
    2304 r znver1_ieu_transitions
    428108 r znver1_fp_min_issue_delay
    856216 r znver1_fp_transitions
    
    gcc/ChangeLog:
    
            PR target/87832
            * config/i386/znver.md (znver1_idiv): New automaton.
            (znver1-idiv): New unit.
            (znver1_idiv_DI): Correct unit and cycles in the reservation.
            (znver1_idiv_SI): Ditto.
            (znver1_idiv_HI): Ditto.
            (znver1_idiv_QI): Ditto.
            (znver1_idiv_mem_DI): Ditto.
            (znver1_idiv_mem_SI): Ditto.
            (znver1_idiv_mem_HI): Ditto.
            (znver1_idiv_mem_QI): Ditto.
            (znver3_idiv_DI): Ditto.
            (znver3_idiv_SI): Ditto.
            (znver3_idiv_HI): Ditto.
            (znver3_idiv_QI): Ditto.
            (znver3_idiv_mem_DI): Ditto.
            (znver3_idiv_mem_SI): Ditto.
            (znver3_idiv_mem_HI): Ditto.
            (znver3_idiv_mem_QI): Ditto.
Comment 3 Alexander Monakov 2022-11-07 11:23:39 UTC
Followup patches have been posted at https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amonakov@ispras.ru/
Comment 4 GCC Commits 2022-11-16 13:41:54 UTC
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>:

https://gcc.gnu.org/g:dd744f06c9952f92738b0860630085f0f0b99574

commit r13-4092-gdd744f06c9952f92738b0860630085f0f0b99574
Author: Alexander Monakov <amonakov@ispras.ru>
Date:   Tue Nov 1 17:04:25 2022 +0300

    i386: correct x87&SSE division modeling in znver.md
    
    Correct modeling of division instructions in the SIMD/FP domain for
    AMD Zen architectures and avoid combinatorial explosion of automaton
    tables by modeling the separate floating-point division unit and
    correcting reservations to reflect reciprocal throughput of the
    corresponding instructions, similar to earlier commit
    5cee5f94000 ("i386: correct integer division modeling in znver.md").
    
    Division is partially pipelined and some instructions have fractional
    throughput (e.g. Zen 3 can issue divss and divsd each 3.5 and 4.5
    cycles on average, respectively). Considering these CPUs implement
    out-of-order execution, the model doesn't need to be exact to the last
    cycle, so simplify it by using 4/5 cycles for SF/DF modes, and not
    modeling the fact that FP3 pipe is occupied for one cycle.
    
    Top znver table sizes in insn-automata.o:
    
    Before:
    
    428108 r znver1_fp_min_issue_delay
    856216 r znver1_fp_transitions
    
    After:
    
    30056 r znver1_fp_min_issue_delay
    120224 r znver1_fp_transitions
    
    gcc/ChangeLog:
    
            PR target/87832
            * config/i386/znver.md (znver1_fdiv): New automaton.
            (znver1-fdiv): New unit.
            (znver1_fp_op_div): Correct unit and cycles in the reservation.
            (znver1_fp_op_div_load): Ditto.
            (znver1_fp_op_idiv_load): Ditto.
            (znver2_fp_op_idiv_load): Ditto.
            (znver1_ssediv_ss_ps): Ditto.
            (znver1_ssediv_ss_ps_load): Ditto.
            (znver1_ssediv_sd_pd): Ditto.
            (znver1_ssediv_sd_pd_load): Ditto.
            (znver1_ssediv_avx256_ps): Ditto.
            (znver1_ssediv_avx256_ps_load): Ditto.
            (znver1_ssediv_avx256_pd): Ditto.
            (znver1_ssediv_avx256_pd_load): Ditto.
Comment 5 GCC Commits 2022-11-16 13:41:59 UTC
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>:

https://gcc.gnu.org/g:d4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f

commit r13-4093-gd4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f
Author: Alexander Monakov <amonakov@ispras.ru>
Date:   Tue Nov 1 17:53:13 2022 +0300

    i386: correct x87&SSE multiplication modeling in znver.md
    
    All multiplication instructions are fully pipelined, except AVX256
    instructions on Zen 1, which issue over two cycles on a 128-bit unit.
    Correct the model accordingly to reduce combinatorial explosion in
    automaton tables.
    
    Top znver table sizes in insn-automata.o:
    
    Before:
    
    30056 r znver1_fp_min_issue_delay
    120224 r znver1_fp_transitions
    
    After:
    
    6720 r znver1_fp_min_issue_delay
    53760 r znver1_fp_transitions
    
    gcc/ChangeLog:
    
            PR target/87832
            * config/i386/znver.md: (znver1_fp_op_mul): Correct cycles in
            the reservation.
            (znver1_fp_op_mul_load): Ditto.
            (znver1_mmx_mul): Ditto.
            (znver1_mmx_load): Ditto.
            (znver1_ssemul_ss_ps): Ditto.
            (znver1_ssemul_ss_ps_load): Ditto.
            (znver1_ssemul_avx256_ps): Ditto.
            (znver1_ssemul_avx256_ps_load): Ditto.
            (znver1_ssemul_sd_pd): Ditto.
            (znver1_ssemul_sd_pd_load): Ditto.
            (znver2_ssemul_sd_pd): Ditto.
            (znver2_ssemul_sd_pd_load): Ditto.
            (znver1_ssemul_avx256_pd): Ditto.
            (znver1_ssemul_avx256_pd_load): Ditto.
            (znver1_sseimul): Ditto.
            (znver1_sseimul_avx256): Ditto.
            (znver1_sseimul_load): Ditto.
            (znver1_sseimul_avx256_load): Ditto.
            (znver1_sseimul_di): Ditto.
            (znver1_sseimul_load_di): Ditto.
Comment 6 Alexander Monakov 2022-11-16 13:48:42 UTC
With these patches on trunk, current situation is:

nm -CS -t d --defined-only gcc/insn-automata.o | sed 's/^[0-9]* 0*//' | sort -n | tail -40
2496 r slm_base
2527 r bdver3_load_min_issue_delay
2746 r glm_base
3892 r bdver1_fp_base
4444 r bdver1_ieu_min_issue_delay
4492 r geode_base
4608 r bdver3_ieu_transitions
6402 r bdver1_load_transitions
6720 r znver1_fp_min_issue_delay
7862 r athlon_fp_check
7862 r athlon_fp_transitions
9122 r lujiazui_core_base
9997 t internal_insn_latency(int, int, rtx_insn*, rtx_insn*)
10108 r bdver3_load_transitions
10498 r geode_check
10498 r geode_transitions
11632 r print_reservation(_IO_FILE*, rtx_insn*)::reservation_names
12575 r athlon_fp_min_issue_delay
12742 r btver2_fp_check
12742 r btver2_fp_transitions
13896 r slm_check
13896 r slm_transitions
17149 t internal_min_issue_delay(int, DFA_chip*)
17349 t internal_state_transition(int, DFA_chip*)
17776 r bdver1_ieu_transitions
20068 r bdver1_fp_check
20068 r bdver1_fp_transitions
26208 r slm_min_issue_delay
27244 r bdver1_fp_min_issue_delay
28518 r glm_check
28518 r glm_transitions
33690 r geode_min_issue_delay
46980 r bdver3_fp_min_issue_delay
49428 r glm_min_issue_delay
53730 r btver2_fp_min_issue_delay
53760 r znver1_fp_transitions
93960 r bdver3_fp_transitions
106102 r lujiazui_core_check
106102 r lujiazui_core_transitions
196123 r lujiazui_core_min_issue_delay

What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
Comment 7 Jan Hubicka 2022-11-16 14:16:32 UTC
> 53730 r btver2_fp_min_issue_delay
> 53760 r znver1_fp_transitions
> 93960 r bdver3_fp_transitions
> 106102 r lujiazui_core_check
> 106102 r lujiazui_core_transitions
> 196123 r lujiazui_core_min_issue_delay
> 
> What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
Yes, I think that makes sense...

Honza
Comment 8 Alexander Monakov 2022-11-16 14:30:19 UTC
(In reply to Jan Hubicka from comment #7)
> > 53730 r btver2_fp_min_issue_delay
> > 53760 r znver1_fp_transitions
> > 93960 r bdver3_fp_transitions
> > 106102 r lujiazui_core_check
> > 106102 r lujiazui_core_transitions
> > 196123 r lujiazui_core_min_issue_delay
> > 
> > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
> Yes, I think that makes sense...

Do you mean we should fix modeling of divisions there as well? I don't have latency/throughput measurements for those CPUs, nor access so I can run experiments myself, unfortunately.

I guess you mean just making a patch to model division units separately, leaving latency/throughput as in current incorrect models, and leave it to manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we might be able to crowd-source it eventually.
Comment 9 Jan Hubicka 2022-11-16 15:34:06 UTC
> 
> Do you mean we should fix modeling of divisions there as well? I don't have
> latency/throughput measurements for those CPUs, nor access so I can run
> experiments myself, unfortunately.
> 
> I guess you mean just making a patch to model division units separately,
> leaving latency/throughput as in current incorrect models, and leave it to
> manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we
> might be able to crowd-source it eventually.
Actually for older cores I think the manufacturers do not care much.  I
still have a working Bulldozer machine and I can do some testing.
I think in Buldozer case I was basing the latency throughput on data in
Agner Fog's manuals.  How do you test it?
Honza
Comment 10 Alexander Monakov 2022-11-16 17:15:38 UTC
(In reply to Jan Hubicka from comment #9)
> Actually for older cores I think the manufacturers do not care much.  I
> still have a working Bulldozer machine and I can do some testing.
> I think in Buldozer case I was basing the latency throughput on data in
> Agner Fog's manuals.

Ahhh, how could I forget that his manuals have data for those cores too. Thanks for the reminder! This solves the conundrum nicely:

AMD Jaguar ('btver2' in GCC): int/fp division is not pipelined, separate int/fp dividers;

AMD Bulldozer, Steamroller ('bdver1', 'bdver3'): int division is not pipelined (one divider), fp division is slightly pipelined (two independent dividers);

Zhaoxin Lujiazui appears to use the same divider as VIA Nano 3000, which is not pipelined.

So it's already enough to produce a decent patch.

> How do you test it?

For AMD Zen patches I was using measurements by Andreas Abel ( https://uops.info/table_overview.html ) and running a few experiments myself by coding loops in NASM and timing them with 'perf stat' on a Zen 2 CPU.
Comment 11 Alexander Monakov 2022-12-07 15:23:57 UTC
Factoring out Lujiazui divider shrinks its tables by almost 20x:

3 r lujiazui_decoder_min_issue_delay
20 r lujiazui_decoder_transitions
32 r lujiazui_agu_min_issue_delay
126 r lujiazui_agu_transitions
304 r lujiazui_div_base
352 r lujiazui_div_check
352 r lujiazui_div_transitions
1152 r lujiazui_core_min_issue_delay
1592 r lujiazui_agu_translate
1592 r lujiazui_core_translate
1592 r lujiazui_decoder_translate
1592 r lujiazui_div_translate
3952 r lujiazui_div_min_issue_delay
9216 r lujiazui_core_transitions
Comment 12 Martin Liška 2022-12-08 09:48:10 UTC
Nice work Alexander!
Comment 13 GCC Commits 2023-01-02 16:39:08 UTC
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>:

https://gcc.gnu.org/g:ec1db9017939bb8289c9bd63aace66c0f3957ecd

commit r13-4956-gec1db9017939bb8289c9bd63aace66c0f3957ecd
Author: Alexander Monakov <amonakov@ispras.ru>
Date:   Fri Dec 9 20:47:55 2022 +0300

    i386: correct division modeling in lujiazui.md
    
    Model the divider in Lujiazui processors as a separate automaton to
    significantly reduce the overall model size. This should also result
    in improved accuracy, as pipe 0 should be able to accept new
    instructions while the divider is occupied.
    
    It is unclear why integer divisions are modeled as if pipes 0-3 are all
    occupied. I've opted to keep a single-cycle reservation of all four
    pipes together, so GCC should continue trying to pack instructions
    around a division accordingly.
    
    Currently top three symbols in insn-automata.o are:
    
    106102 r lujiazui_core_check
    106102 r lujiazui_core_transitions
    196123 r lujiazui_core_min_issue_delay
    
    This patch shrinks all lujiazui tables to:
    
    3 r lujiazui_decoder_min_issue_delay
    20 r lujiazui_decoder_transitions
    32 r lujiazui_agu_min_issue_delay
    126 r lujiazui_agu_transitions
    304 r lujiazui_div_base
    352 r lujiazui_div_check
    352 r lujiazui_div_transitions
    1152 r lujiazui_core_min_issue_delay
    1592 r lujiazui_agu_translate
    1592 r lujiazui_core_translate
    1592 r lujiazui_decoder_translate
    1592 r lujiazui_div_translate
    3952 r lujiazui_div_min_issue_delay
    9216 r lujiazui_core_transitions
    
    This continues the work on reducing i386 insn-automata.o size started
    with similar fixes for division and multiplication instructions in
    znver.md.
    
    gcc/ChangeLog:
    
            PR target/87832
            * config/i386/lujiazui.md (lujiazui_div): New automaton.
            (lua_div): New unit.
            (lua_idiv_qi): Correct unit in the reservation.
            (lua_idiv_qi_load): Ditto.
            (lua_idiv_hi): Ditto.
            (lua_idiv_hi_load): Ditto.
            (lua_idiv_si): Ditto.
            (lua_idiv_si_load): Ditto.
            (lua_idiv_di): Ditto.
            (lua_idiv_di_load): Ditto.
            (lua_fdiv_SF): Ditto.
            (lua_fdiv_SF_load): Ditto.
            (lua_fdiv_DF): Ditto.
            (lua_fdiv_DF_load): Ditto.
            (lua_fdiv_XF): Ditto.
            (lua_fdiv_XF_load): Ditto.
            (lua_ssediv_SF): Ditto.
            (lua_ssediv_load_SF): Ditto.
            (lua_ssediv_V4SF): Ditto.
            (lua_ssediv_load_V4SF): Ditto.
            (lua_ssediv_V8SF): Ditto.
            (lua_ssediv_load_V8SF): Ditto.
            (lua_ssediv_SD): Ditto.
            (lua_ssediv_load_SD): Ditto.
            (lua_ssediv_V2DF): Ditto.
            (lua_ssediv_load_V2DF): Ditto.
            (lua_ssediv_V4DF): Ditto.
            (lua_ssediv_load_V4DF): Ditto.