Looking at i386 insn-automata.o, out of its 2.2M rodata size almost all is due to very large tables for AMD CPU models. Note how znver additions are more than half of overall size. What is causing that and can it be improved? 2176 core2_core_transitions 2496 slm_base 2527 bdver3_load_min_issue_delay 2746 glm_base 3892 bdver1_fp_base 4261 insn_latency(rtx_insn*, rtx_insn*) 4444 bdver1_ieu_min_issue_delay 4492 geode_base 4608 bdver3_ieu_transitions 6402 bdver1_load_transitions 7862 athlon_fp_check 7862 athlon_fp_transitions 9433 internal_min_issue_delay(int, DFA_chip*) 10108 bdver3_load_transitions 10360 print_reservation(_IO_FILE*, rtx_insn*)::reservation_names 10498 geode_check 10498 geode_transitions 12575 athlon_fp_min_issue_delay 12599 internal_state_transition(int, DFA_chip*) 12742 btver2_fp_check 12742 btver2_fp_transitions 13896 slm_transitions 13896 slm_check 17776 bdver1_ieu_transitions 20068 bdver1_fp_check 20068 bdver1_fp_transitions 26208 slm_min_issue_delay 27244 bdver1_fp_min_issue_delay 28518 glm_transitions 28518 glm_check 33690 geode_min_issue_delay 46980 bdver3_fp_min_issue_delay 49428 glm_min_issue_delay 53730 btver2_fp_min_issue_delay 68160 znver1_ieu_min_issue_delay 93960 bdver3_fp_transitions 136320 znver1_ieu_transitions 428108 znver1_fp_min_issue_delay 856216 znver1_fp_transitions
Suggested partial fix for the integer-pipe side of the blowup: https://inbox.sourceware.org/gcc-patches/4549f27b-238a-7d77-f72b-cc77df8ae36e@ispras.ru/
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>: https://gcc.gnu.org/g:5cee5f94000ee5eabce9b223c44c7923c1c69f61 commit r13-3589-g5cee5f94000ee5eabce9b223c44c7923c1c69f61 Author: Alexander Monakov <amonakov@ispras.ru> Date: Mon Oct 31 17:35:57 2022 +0300 i386: correct integer division modeling in znver.md In znver.md, division instructions have descriptions like (define_insn_reservation "znver1_idiv_DI" 41 (and (eq_attr "cpu" "znver1,znver2") (and (eq_attr "type" "idiv") (and (eq_attr "mode" "DI") (eq_attr "memory" "none")))) "znver1-double,znver1-ieu2*41") which says that DImode idiv has latency 41 (which is correct) and that it occupies 2nd integer execution unit for 41 consecutive cycles, but that is not correct: 1) the division instruction is partially pipelined, and has throughput 1/14, not 1/41; 2) for the most part it occupies a separate division unit, not the general arithmetic unit. Evidently, interaction of such 41-cycle paths with the rest of reservations causes a combinatorial explosion in the automaton. Fix this by modeling the integer division unit properly, and correcting reservations to use the measured reciprocal throughput of those instructions (available from uops.info). A similar correction for floating-point divisions is left for a followup patch. Top 5 znver table sizes, before: 68692 r znver1_ieu_check 68692 r znver1_ieu_transitions 99792 r znver1_ieu_min_issue_delay 428108 r znver1_fp_min_issue_delay 856216 r znver1_fp_transitions After: 1454 r znver1_ieu_translate 1454 r znver1_translate 2304 r znver1_ieu_transitions 428108 r znver1_fp_min_issue_delay 856216 r znver1_fp_transitions gcc/ChangeLog: PR target/87832 * config/i386/znver.md (znver1_idiv): New automaton. (znver1-idiv): New unit. (znver1_idiv_DI): Correct unit and cycles in the reservation. (znver1_idiv_SI): Ditto. (znver1_idiv_HI): Ditto. (znver1_idiv_QI): Ditto. (znver1_idiv_mem_DI): Ditto. (znver1_idiv_mem_SI): Ditto. (znver1_idiv_mem_HI): Ditto. (znver1_idiv_mem_QI): Ditto. (znver3_idiv_DI): Ditto. (znver3_idiv_SI): Ditto. (znver3_idiv_HI): Ditto. (znver3_idiv_QI): Ditto. (znver3_idiv_mem_DI): Ditto. (znver3_idiv_mem_SI): Ditto. (znver3_idiv_mem_HI): Ditto. (znver3_idiv_mem_QI): Ditto.
Followup patches have been posted at https://inbox.sourceware.org/gcc-patches/20221101162637.14238-1-amonakov@ispras.ru/
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>: https://gcc.gnu.org/g:dd744f06c9952f92738b0860630085f0f0b99574 commit r13-4092-gdd744f06c9952f92738b0860630085f0f0b99574 Author: Alexander Monakov <amonakov@ispras.ru> Date: Tue Nov 1 17:04:25 2022 +0300 i386: correct x87&SSE division modeling in znver.md Correct modeling of division instructions in the SIMD/FP domain for AMD Zen architectures and avoid combinatorial explosion of automaton tables by modeling the separate floating-point division unit and correcting reservations to reflect reciprocal throughput of the corresponding instructions, similar to earlier commit 5cee5f94000 ("i386: correct integer division modeling in znver.md"). Division is partially pipelined and some instructions have fractional throughput (e.g. Zen 3 can issue divss and divsd each 3.5 and 4.5 cycles on average, respectively). Considering these CPUs implement out-of-order execution, the model doesn't need to be exact to the last cycle, so simplify it by using 4/5 cycles for SF/DF modes, and not modeling the fact that FP3 pipe is occupied for one cycle. Top znver table sizes in insn-automata.o: Before: 428108 r znver1_fp_min_issue_delay 856216 r znver1_fp_transitions After: 30056 r znver1_fp_min_issue_delay 120224 r znver1_fp_transitions gcc/ChangeLog: PR target/87832 * config/i386/znver.md (znver1_fdiv): New automaton. (znver1-fdiv): New unit. (znver1_fp_op_div): Correct unit and cycles in the reservation. (znver1_fp_op_div_load): Ditto. (znver1_fp_op_idiv_load): Ditto. (znver2_fp_op_idiv_load): Ditto. (znver1_ssediv_ss_ps): Ditto. (znver1_ssediv_ss_ps_load): Ditto. (znver1_ssediv_sd_pd): Ditto. (znver1_ssediv_sd_pd_load): Ditto. (znver1_ssediv_avx256_ps): Ditto. (znver1_ssediv_avx256_ps_load): Ditto. (znver1_ssediv_avx256_pd): Ditto. (znver1_ssediv_avx256_pd_load): Ditto.
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>: https://gcc.gnu.org/g:d4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f commit r13-4093-gd4cc7a8c4a623b62dd0d486d7780d91b58eb6f1f Author: Alexander Monakov <amonakov@ispras.ru> Date: Tue Nov 1 17:53:13 2022 +0300 i386: correct x87&SSE multiplication modeling in znver.md All multiplication instructions are fully pipelined, except AVX256 instructions on Zen 1, which issue over two cycles on a 128-bit unit. Correct the model accordingly to reduce combinatorial explosion in automaton tables. Top znver table sizes in insn-automata.o: Before: 30056 r znver1_fp_min_issue_delay 120224 r znver1_fp_transitions After: 6720 r znver1_fp_min_issue_delay 53760 r znver1_fp_transitions gcc/ChangeLog: PR target/87832 * config/i386/znver.md: (znver1_fp_op_mul): Correct cycles in the reservation. (znver1_fp_op_mul_load): Ditto. (znver1_mmx_mul): Ditto. (znver1_mmx_load): Ditto. (znver1_ssemul_ss_ps): Ditto. (znver1_ssemul_ss_ps_load): Ditto. (znver1_ssemul_avx256_ps): Ditto. (znver1_ssemul_avx256_ps_load): Ditto. (znver1_ssemul_sd_pd): Ditto. (znver1_ssemul_sd_pd_load): Ditto. (znver2_ssemul_sd_pd): Ditto. (znver2_ssemul_sd_pd_load): Ditto. (znver1_ssemul_avx256_pd): Ditto. (znver1_ssemul_avx256_pd_load): Ditto. (znver1_sseimul): Ditto. (znver1_sseimul_avx256): Ditto. (znver1_sseimul_load): Ditto. (znver1_sseimul_avx256_load): Ditto. (znver1_sseimul_di): Ditto. (znver1_sseimul_load_di): Ditto.
With these patches on trunk, current situation is: nm -CS -t d --defined-only gcc/insn-automata.o | sed 's/^[0-9]* 0*//' | sort -n | tail -40 2496 r slm_base 2527 r bdver3_load_min_issue_delay 2746 r glm_base 3892 r bdver1_fp_base 4444 r bdver1_ieu_min_issue_delay 4492 r geode_base 4608 r bdver3_ieu_transitions 6402 r bdver1_load_transitions 6720 r znver1_fp_min_issue_delay 7862 r athlon_fp_check 7862 r athlon_fp_transitions 9122 r lujiazui_core_base 9997 t internal_insn_latency(int, int, rtx_insn*, rtx_insn*) 10108 r bdver3_load_transitions 10498 r geode_check 10498 r geode_transitions 11632 r print_reservation(_IO_FILE*, rtx_insn*)::reservation_names 12575 r athlon_fp_min_issue_delay 12742 r btver2_fp_check 12742 r btver2_fp_transitions 13896 r slm_check 13896 r slm_transitions 17149 t internal_min_issue_delay(int, DFA_chip*) 17349 t internal_state_transition(int, DFA_chip*) 17776 r bdver1_ieu_transitions 20068 r bdver1_fp_check 20068 r bdver1_fp_transitions 26208 r slm_min_issue_delay 27244 r bdver1_fp_min_issue_delay 28518 r glm_check 28518 r glm_transitions 33690 r geode_min_issue_delay 46980 r bdver3_fp_min_issue_delay 49428 r glm_min_issue_delay 53730 r btver2_fp_min_issue_delay 53760 r znver1_fp_transitions 93960 r bdver3_fp_transitions 106102 r lujiazui_core_check 106102 r lujiazui_core_transitions 196123 r lujiazui_core_min_issue_delay What shall we do with similar blowups in lujiazui and b[dt]ver[123] models?
> 53730 r btver2_fp_min_issue_delay > 53760 r znver1_fp_transitions > 93960 r bdver3_fp_transitions > 106102 r lujiazui_core_check > 106102 r lujiazui_core_transitions > 196123 r lujiazui_core_min_issue_delay > > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models? Yes, I think that makes sense... Honza
(In reply to Jan Hubicka from comment #7) > > 53730 r btver2_fp_min_issue_delay > > 53760 r znver1_fp_transitions > > 93960 r bdver3_fp_transitions > > 106102 r lujiazui_core_check > > 106102 r lujiazui_core_transitions > > 196123 r lujiazui_core_min_issue_delay > > > > What shall we do with similar blowups in lujiazui and b[dt]ver[123] models? > Yes, I think that makes sense... Do you mean we should fix modeling of divisions there as well? I don't have latency/throughput measurements for those CPUs, nor access so I can run experiments myself, unfortunately. I guess you mean just making a patch to model division units separately, leaving latency/throughput as in current incorrect models, and leave it to manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we might be able to crowd-source it eventually.
> > Do you mean we should fix modeling of divisions there as well? I don't have > latency/throughput measurements for those CPUs, nor access so I can run > experiments myself, unfortunately. > > I guess you mean just making a patch to model division units separately, > leaving latency/throughput as in current incorrect models, and leave it to > manufacturers to correct it? Alternatively, for AMD Bobcat and Bulldozer we > might be able to crowd-source it eventually. Actually for older cores I think the manufacturers do not care much. I still have a working Bulldozer machine and I can do some testing. I think in Buldozer case I was basing the latency throughput on data in Agner Fog's manuals. How do you test it? Honza
(In reply to Jan Hubicka from comment #9) > Actually for older cores I think the manufacturers do not care much. I > still have a working Bulldozer machine and I can do some testing. > I think in Buldozer case I was basing the latency throughput on data in > Agner Fog's manuals. Ahhh, how could I forget that his manuals have data for those cores too. Thanks for the reminder! This solves the conundrum nicely: AMD Jaguar ('btver2' in GCC): int/fp division is not pipelined, separate int/fp dividers; AMD Bulldozer, Steamroller ('bdver1', 'bdver3'): int division is not pipelined (one divider), fp division is slightly pipelined (two independent dividers); Zhaoxin Lujiazui appears to use the same divider as VIA Nano 3000, which is not pipelined. So it's already enough to produce a decent patch. > How do you test it? For AMD Zen patches I was using measurements by Andreas Abel ( https://uops.info/table_overview.html ) and running a few experiments myself by coding loops in NASM and timing them with 'perf stat' on a Zen 2 CPU.
Factoring out Lujiazui divider shrinks its tables by almost 20x: 3 r lujiazui_decoder_min_issue_delay 20 r lujiazui_decoder_transitions 32 r lujiazui_agu_min_issue_delay 126 r lujiazui_agu_transitions 304 r lujiazui_div_base 352 r lujiazui_div_check 352 r lujiazui_div_transitions 1152 r lujiazui_core_min_issue_delay 1592 r lujiazui_agu_translate 1592 r lujiazui_core_translate 1592 r lujiazui_decoder_translate 1592 r lujiazui_div_translate 3952 r lujiazui_div_min_issue_delay 9216 r lujiazui_core_transitions
Nice work Alexander!
The master branch has been updated by Alexander Monakov <amonakov@gcc.gnu.org>: https://gcc.gnu.org/g:ec1db9017939bb8289c9bd63aace66c0f3957ecd commit r13-4956-gec1db9017939bb8289c9bd63aace66c0f3957ecd Author: Alexander Monakov <amonakov@ispras.ru> Date: Fri Dec 9 20:47:55 2022 +0300 i386: correct division modeling in lujiazui.md Model the divider in Lujiazui processors as a separate automaton to significantly reduce the overall model size. This should also result in improved accuracy, as pipe 0 should be able to accept new instructions while the divider is occupied. It is unclear why integer divisions are modeled as if pipes 0-3 are all occupied. I've opted to keep a single-cycle reservation of all four pipes together, so GCC should continue trying to pack instructions around a division accordingly. Currently top three symbols in insn-automata.o are: 106102 r lujiazui_core_check 106102 r lujiazui_core_transitions 196123 r lujiazui_core_min_issue_delay This patch shrinks all lujiazui tables to: 3 r lujiazui_decoder_min_issue_delay 20 r lujiazui_decoder_transitions 32 r lujiazui_agu_min_issue_delay 126 r lujiazui_agu_transitions 304 r lujiazui_div_base 352 r lujiazui_div_check 352 r lujiazui_div_transitions 1152 r lujiazui_core_min_issue_delay 1592 r lujiazui_agu_translate 1592 r lujiazui_core_translate 1592 r lujiazui_decoder_translate 1592 r lujiazui_div_translate 3952 r lujiazui_div_min_issue_delay 9216 r lujiazui_core_transitions This continues the work on reducing i386 insn-automata.o size started with similar fixes for division and multiplication instructions in znver.md. gcc/ChangeLog: PR target/87832 * config/i386/lujiazui.md (lujiazui_div): New automaton. (lua_div): New unit. (lua_idiv_qi): Correct unit in the reservation. (lua_idiv_qi_load): Ditto. (lua_idiv_hi): Ditto. (lua_idiv_hi_load): Ditto. (lua_idiv_si): Ditto. (lua_idiv_si_load): Ditto. (lua_idiv_di): Ditto. (lua_idiv_di_load): Ditto. (lua_fdiv_SF): Ditto. (lua_fdiv_SF_load): Ditto. (lua_fdiv_DF): Ditto. (lua_fdiv_DF_load): Ditto. (lua_fdiv_XF): Ditto. (lua_fdiv_XF_load): Ditto. (lua_ssediv_SF): Ditto. (lua_ssediv_load_SF): Ditto. (lua_ssediv_V4SF): Ditto. (lua_ssediv_load_V4SF): Ditto. (lua_ssediv_V8SF): Ditto. (lua_ssediv_load_V8SF): Ditto. (lua_ssediv_SD): Ditto. (lua_ssediv_load_SD): Ditto. (lua_ssediv_V2DF): Ditto. (lua_ssediv_load_V2DF): Ditto. (lua_ssediv_V4DF): Ditto. (lua_ssediv_load_V4DF): Ditto.