SPEC INTrate 2017 538.imagick_r benchmark is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with trunk/master on Zen 5 CPUs. The regression has been introduced in r15-3441-g4292297a0f938f (Jan Hubicka: Zen5 tuning part 5: update instruction latencies in x86-tune-costs) It is the modification of "cost of ADDSS/SD SUBSS/SD insns" that is the culprit, bumping it back to COSTS_N_INSNS(3) (instead of COSTS_N_INSNS(2)) makes the regression go away. Nevertheless, Honza claims the cost should be correct. Perf stat of the slow run: 116866.57 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 8347 page-faults:u # 71.423 /sec 484499860679 cycles:u # 4.146 GHz 21879349058 stalled-cycles-frontend:u # 4.52% frontend cycles idle 2030074730877 instructions:u # 4.19 insn per cycle # 0.01 stalled cycles per insn 224436542157 branches:u # 1.920 G/sec 1716173329 branch-misses:u # 0.76% of all branches 116.881252465 seconds time elapsed 116.808499000 seconds user 0.057350000 seconds sys Perf report of the slow run (annotated assmebly attached): # Samples: 470K of event 'cycles:Pu' # Event count (approx.): 484158470552 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................... ............................................. # 44.71% 210348 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MeanShiftImage 28.76% 135308 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] GetVirtualPixelsFromNexus 25.50% 120106 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MorphologyApply Perf stat of the fast run (with just the one cost reverted): Performance counter stats for 'taskset -c 0 specinvoke': 108805.48 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 8312 page-faults:u # 76.393 /sec 450981792793 cycles:u # 4.145 GHz 22610930072 stalled-cycles-frontend:u # 5.01% frontend cycles idle 1933965750890 instructions:u # 4.29 insn per cycle # 0.01 stalled cycles per insn 224433996552 branches:u # 2.063 G/sec 1721069495 branch-misses:u # 0.77% of all branches 108.819368844 seconds time elapsed 108.763582000 seconds user 0.041314000 seconds sys Perf report of the fast run (annotated assmebly attached): # Samples: 427K of event 'cycles:Pu' # Event count (approx.): 439380128661 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................... .................................................. # 44.53% 190164 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MeanShiftImage 28.13% 120243 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MorphologyApply 26.20% 111906 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] GetVirtualPixelsFromNexus
Created attachment 60757 [details] Perf annotate of the slow case Perf annotate of the slow case
Created attachment 60758 [details] Perf annotate of the fast case Perf annotate of the fast case
Created attachment 60759 [details] Output of -fopt-info-vec in the slow case Output of -fopt-info-vec in the slow case
Created attachment 60760 [details] Output of -fopt-info-vec in the fast case Output of -fopt-info-vec in the fast case
This is likely a store-to-load-forwarding stall issue. I have tracked it down to two SLP1 transformations - dumps below. Even though MeanShiftImage receives 10% more samples, its assembly is the same and the transformations slowing things down happens in GetVirtualPixelsFromNexus (which is being called from the hottest loop of MeanShiftImage). I have verified that the same slow-down happens with -flto-partition=one and then used option -fdbg-cnt=vect_slp:1-1043:1046-100000 (using unpatched master revision 337b9ff4854) to avoid the two SLP vectorizations and confirmed the run-time performance has been restored. The two SLP1 transformations are reported to take place at magick/cache.c:2573 but that is where the function GetVirtualPixelsFromNexus begins. Looking at dumps with -lineno, the vectorized statements are followed immediately by one from magick/random.c:629 (load of the normalize field at the end of function GetPseudoRandomValue). The SLP1 transformations are: --- /home/mjambor/gcc/benchmarks/cpu2017/benchspec/CPU/538.imagick_r/build/build_peak_trunk-lto-nat-m64.0000/gvpfn/imagick_r.ltrans1.ltrans.188t.slp1 2025-04-07 23:28:47.568570350 +0200 +++ gvpfn/imagick_r.ltrans1.ltrans.188t.slp1 2025-04-07 23:28:50.673570142 +0200 @@ -3897,54 +3899,10 @@ magick/cache.c:2699:7: note: created &virtual_pixel magick/cache.c:2699:7: note: add new stmt: MEM <vector(4) short unsigned int> [(short unsigned int *)&virtual_pixel] = { 0, 0, 0, 65535 }; magick/cache.c:2699:7: note: vectorizing stmts using SLP. -magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte vectors -magick/cache.c:2573:33: note: Vectorizing SLP tree: -magick/cache.c:2573:33: note: node 0x3648860 (max_nunits=4, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: op template: MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; -magick/cache.c:2573:33: note: stmt 0 MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; -magick/cache.c:2573:33: note: stmt 1 MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931; -magick/cache.c:2573:33: note: stmt 2 MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932; -magick/cache.c:2573:33: note: stmt 3 MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930; -magick/cache.c:2573:33: note: children 0x3648a10 -magick/cache.c:2573:33: note: node (external) 0x3648a10 (max_nunits=1, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: { _929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930 } -magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from: MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.193_931 = PHI <D__lsm.193_32(80)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.194_932 = PHI <D__lsm.194_33(80)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.195_930 = PHI <D__lsm.195_474(80)>, type of def: internal -magick/cache.c:2573:33: note: transform store. ncopies = 1 -magick/cache.c:2573:33: note: create vector_type-pointer variable to type: vector(4) long unsigned int vectorizing a pointer ref: MEM[(long unsigned int *)prephitmp_900 + 32B] -magick/cache.c:2573:33: note: created vectp.214_881 -magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882; -magick/cache.c:2573:33: note: vectorizing stmts using SLP. -magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte vectors -magick/cache.c:2573:33: note: Vectorizing SLP tree: -magick/cache.c:2573:33: note: node 0x3648aa0 (max_nunits=4, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: op template: MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; -magick/cache.c:2573:33: note: stmt 0 MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; -magick/cache.c:2573:33: note: stmt 1 MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606; -magick/cache.c:2573:33: note: stmt 2 MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568; -magick/cache.c:2573:33: note: stmt 3 MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342; -magick/cache.c:2573:33: note: children 0x3648c50 -magick/cache.c:2573:33: note: node (external) 0x3648c50 (max_nunits=1, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: { _552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342 } -magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from: MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.189_606 = PHI <D__lsm.189_574(82)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.190_568 = PHI <D__lsm.190_555(82)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.191_342 = PHI <D__lsm.191_106(82)>, type of def: internal -magick/cache.c:2573:33: note: transform store. ncopies = 1 -magick/cache.c:2573:33: note: create vector_type-pointer variable to type: vector(4) long unsigned int vectorizing a pointer ref: MEM[(long unsigned int *)prephitmp_900 + 32B] -magick/cache.c:2573:33: note: created vectp.216_874 -magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] = _879; -magick/cache.c:2573:33: note: vectorizing stmts using SLP. magick/cache.c:2573:33: note: ***** The result for vector mode V64QI would be the same __attribute__((visibility ("default"), hot)) const struct PixelPacket * GetVirtualPixelsFromNexus (const struct Image * image, const VirtualPixelMethod virtual_pixel_method, const ssize_t x, const ssize_t y, const size_t columns, const size_t rows, struct NexusInfo * nexus_info, struct ExceptionInfo * exception) { - long unsigned int * vectp.216; - vector(4) long unsigned int * vectp_prephitmp.215; - long unsigned int * vectp.214; - vector(4) long unsigned int * vectp_prephitmp.213; Quantum * vectp.212; vector(4) short unsigned int * vectp_virtual_pixel.211; Quantum * vectp.210; @@ -4306,8 +4264,6 @@ long unsigned int pretmp_875; long int _877; long int prephitmp_878; - vector(4) long unsigned int _879; - vector(4) long unsigned int _882; long unsigned int pretmp_889; long int _891; long int prephitmp_892; @@ -5354,9 +5310,10 @@ # D__lsm.193_931 = PHI <D__lsm.193_32(80)> # D__lsm.195_930 = PHI <D__lsm.195_474(80)> # _929 = PHI <_543(80)> - _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930}; - vectp.214_881 = prephitmp_900 + 32; - MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882; + MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931; + MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932; + MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930; + MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; [magick/random.c:629:3] # DEBUG BEGIN_STMT [magick/random.c:629:21] _544 = [magick/random.c:629:21] MEM[(struct RandomInfo *)prephitmp_900].normalize; [magick/random.c:629:32] _545 = (double) _929; @@ -5413,9 +5370,10 @@ # D__lsm.191_342 = PHI <D__lsm.191_106(82)> # D__lsm.190_568 = PHI <D__lsm.190_555(82)> # D__lsm.189_606 = PHI <D__lsm.189_574(82)> - _879 = {_552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342}; - vectp.216_874 = prephitmp_900 + 32; - MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] = _879; + MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606; + MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568; + MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342; + MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; [magick/random.c:629:3] # DEBUG BEGIN_STMT [magick/random.c:629:32] _531 = (double) _552; _502 = _230 * _544;
So this is likely a Zen5 tuning thing that makes the vectorization profitable. Though since this just transforms stores this cannot be a STLF fail, instead it's likely the vector(4) long unsigned int builds from scalars that cause this. It's also not possible the add/sub insn cost change causes this. Could it be that this is in the end sth similar as we see with LBM?
Hmm, the sequence does not use + at all, but I think I know what is going on. While the field is called addss it is used as an kitchen sink for all other simple operations. /* pmuludq under sse2, pmuldq under sse4.1, for sign_extend, require extra 4 mul, 4 add, 4 cmp and 2 shift. */ if (!TARGET_SSE4_1 && !uns_p) extra_cost = (cost->mulss + cost->addss + cost->sse_op) * 4 + cost->sse_op * 2; .... case FLOAT_EXTEND: if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) *total = 0; else *total = ix86_vec_cost (mode, cost->addss); return false; .... case FLOAT_TRUNCATE: if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) *total = cost->fadd; else *total = ix86_vec_cost (mode, cost->addss); return false; ... case scalar_stmt: return fp ? ix86_cost->addss : COSTS_N_INSNS (1); .... case vector_stmt: return ix86_vec_cost (mode, fp ? ix86_cost->addss : ix86_cost->sse_op); Only addss was sped up (and apparently only in common contextes), other simple SSE operations are still 3 cycles. We have const int sse_op; /* cost of cheap SSE instruction. */ const int addss; /* cost of ADDSS/SD SUBSS/SD instructions. */ SSE_OP is used for integer SSE instructions, which are typically 1 cycle, so perhaps we want to have also sse_fp_op /* Chose of cheap SSE fp instruction. */ in addition to addss. But to be precise builtin_vectorizer cost would need to now if scalar/vector_stmt is additio or something else, which AFAK it doesn't
(In reply to Jan Hubicka from comment #7) > Hmm, the sequence does not use + at all, but I think I know what is going > on. While the field is called addss it is used as an kitchen sink for all > other simple operations. > /* pmuludq under sse2, pmuldq under sse4.1, for sign_extend, > require extra 4 mul, 4 add, 4 cmp and 2 shift. */ > if (!TARGET_SSE4_1 && !uns_p) > extra_cost = (cost->mulss + cost->addss + cost->sse_op) * 4 > + cost->sse_op * 2; this looks OK? > .... > case FLOAT_EXTEND: > if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) > *total = 0; > else > *total = ix86_vec_cost (mode, cost->addss); not sure why we use addss instead of sse_op here? > return false; > .... > case FLOAT_TRUNCATE: > if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) > *total = cost->fadd; > else > *total = ix86_vec_cost (mode, cost->addss); likewise? but this is rtx_cost > return false; > ... > case scalar_stmt: > return fp ? ix86_cost->addss : COSTS_N_INSNS (1); we shouldn't get here, the add_stmt_cost "hook" handles operations separately. > .... > case vector_stmt: > return ix86_vec_cost (mode, > fp ? ix86_cost->addss : ix86_cost->sse_op); I'm not sure why we use addss cost in case of FP mode. So this would be the only two places to try fixing. The question is for which ops we fall back here. The following might tell: diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 4f8380c4a58..fe1beefe732 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -25346,6 +25346,7 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind, stmt_cost = ix86_cost->add; break; default: + gcc_unreachable (); break; } } > Only addss was sped up (and apparently only in common contextes), other > simple SSE operations are still 3 cycles. > > We have > const int sse_op; /* cost of cheap SSE instruction. */ > const int addss; /* cost of ADDSS/SD SUBSS/SD instructions. > */ > > SSE_OP is used for integer SSE instructions, which are typically 1 cycle, so > perhaps we want to have also sse_fp_op /* Chose of cheap SSE fp instruction. > */ > in addition to addss. > > But to be precise builtin_vectorizer cost would need to now if > scalar/vector_stmt is additio or something else, which AFAK it doesn't It does see it
(In reply to Richard Biener from comment #8) > (In reply to Jan Hubicka from comment #7) [...] > > return false; > > ... > > case scalar_stmt: > > return fp ? ix86_cost->addss : COSTS_N_INSNS (1); > > we shouldn't get here, the add_stmt_cost "hook" handles operations > separately. > > > .... > > case vector_stmt: > > return ix86_vec_cost (mode, > > fp ? ix86_cost->addss : ix86_cost->sse_op); > > I'm not sure why we use addss cost in case of FP mode. So this would be > the only two places to try fixing. > > The question is for which ops we fall back here. The following might tell: > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index 4f8380c4a58..fe1beefe732 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -25346,6 +25346,7 @@ ix86_vector_costs::add_stmt_cost (int count, > vect_cost_for_stmt kind, > stmt_cost = ix86_cost->add; > break; > default: > + gcc_unreachable (); > break; > } > } OK, it's a quite incomplete set. I would suggest to fix the fallback to use sse_op and never addss. Or avoid reducing the addss cost alltogether as a simpler thing for now.
(In reply to Richard Biener from comment #9) > (In reply to Richard Biener from comment #8) > > (In reply to Jan Hubicka from comment #7) > [...] > > > return false; > > > ... > > > case scalar_stmt: > > > return fp ? ix86_cost->addss : COSTS_N_INSNS (1); > > > > we shouldn't get here, the add_stmt_cost "hook" handles operations > > separately. > > > > > .... > > > case vector_stmt: > > > return ix86_vec_cost (mode, > > > fp ? ix86_cost->addss : ix86_cost->sse_op); > > > > I'm not sure why we use addss cost in case of FP mode. So this would be > > the only two places to try fixing. > > > > The question is for which ops we fall back here. The following might tell: > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > index 4f8380c4a58..fe1beefe732 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -25346,6 +25346,7 @@ ix86_vector_costs::add_stmt_cost (int count, > > vect_cost_for_stmt kind, > > stmt_cost = ix86_cost->add; > > break; > > default: > > + gcc_unreachable (); > > break; > > } > > } > > OK, it's a quite incomplete set. I would suggest to fix the fallback to > use sse_op and never addss. Or avoid reducing the addss cost alltogether > as a simpler thing for now. Btw, it was your r8-4018-gf6fd8f2bd4e9a9 which added the FP vs. non-FP difference.
Note that whether mode is FP is not a good indicator for some operations, as said, nowadays we should catch things where it matters upthread.
> Btw, it was your r8-4018-gf6fd8f2bd4e9a9 which added the FP vs. non-FP difference. Yep, I know. With that patch I mostly wanted to limit redundancy of the tables. The int/Fp difference was mostly based on the observation that most of integer SSE operations (for example padd) take 1 cycle, while most of FP operations (like addss) take 3 cycles. My simplified understanding is that FP operations are usually pipelined to 3 cycles (since Pentium to today) because they include normalization, operation and rounding. The cost table is basically meant to have "typical cost" (sse_op and addss) along with all important exceptions (mul, div, fma, sqrt). Zen5 is, I think, first CPU where addss has different timing than other basic FP arithmetic which makes addss itself an exception. (Back then, I should have renamed addss cost and make the comment more descriptive.) So based on this adding sse_fp_op (set to 3 on Zen5 and same cost as addss everywhere else) for "typical FP operation" and keep addss cost for actual FP add/sub (I will need to benchmark if sub is also 2 cycles; I am not sure about that) IMO makes sense. But indeed we currently use addss for conversions and other stuff which is not necessarily good and we may want to add more entries for these. Do you know what are important ones and ought to be fixed? I am OK with using addss cost of 3 for trunk&release branches and make this more precise next stage1.
On Wed, 9 Apr 2025, hubicka at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298 > > --- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> --- > > Btw, it was your r8-4018-gf6fd8f2bd4e9a9 which added the FP vs. non-FP difference. > > Yep, I know. With that patch I mostly wanted to limit redundancy of the > tables. The int/Fp difference was mostly based on the observation that most of > integer SSE operations (for example padd) take 1 cycle, while most of FP > operations (like addss) take 3 cycles. My simplified understanding is that FP > operations are usually pipelined to 3 cycles (since Pentium to today) because > they include normalization, operation and rounding. The cost table is basically > meant to have "typical cost" (sse_op and addss) along with all important > exceptions (mul, div, fma, sqrt). > > Zen5 is, I think, first CPU where addss has different timing than other basic > FP arithmetic which makes addss itself an exception. (Back then, I should have > renamed addss cost and make the comment more descriptive.) > > So based on this adding sse_fp_op (set to 3 on Zen5 and same cost as addss > everywhere else) for "typical FP operation" and keep addss cost for actual FP > add/sub (I will need to benchmark if sub is also 2 cycles; I am not sure about > that) IMO makes sense. > > But indeed we currently use addss for conversions and other stuff which is not > necessarily good and we may want to add more entries for these. Do you know > what are important ones and ought to be fixed? I'm not sure which ones are important, but we should try to get rid of the fallback to call the "old" hook from add_stmt_cost so it's more obvious from which context we come. But I also expect this are to change quite a bit next stage1 where I hope to revamp vectorizer costing. > I am OK with using addss cost of 3 for trunk&release branches and make this > more precise next stage1. That's what we use now? But I still don't understand why exactly 538.imagick_r regresses.
> > I am OK with using addss cost of 3 for trunk&release branches and make this > > more precise next stage1. > > That's what we use now? But I still don't understand why exactly 538.imagick_r regresses. We currently use 2 for znver5. In r15-3441-g4292297a0f938f (causing this regressoin) I changed COSTS_N_INSNS (3), /* cost of ADDSS/SD SUBSS/SD insns. */ which was copied from znver4 tables to /* ADDSS has throughput 2 and latency 2 (in some cases when source is another addition). */ COSTS_N_INSNS (2), /* cost of ADDSS/SD SUBSS/SD insns. */ since ADDSS has latency of 2 at znver5 (at least in some cases, but common ones). Martin claims he tested that this particular change causes the regression and moreover that the difference is: - _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930}; - vectp.214_881 = prephitmp_900 + 32; - MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882; + MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931; + MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932; + MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930; + MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; Which boils down to: long test[4]; void foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d) { test[0]=a; test[1]=b; test[2]=c; test[3]=d; } being vectorized with cost of 2 (trunk) as _Z3foommmm: .LFB0: .cfi_startproc vmovq %rdx, %xmm2 vmovq %rdi, %xmm3 vpinsrq $1, %rcx, %xmm2, %xmm1 vpinsrq $1, %rsi, %xmm3, %xmm0 vinserti64x2 $0x1, %xmm1, %ymm0, %ymm0 vmovdqa %ymm0, test(%rip) vzeroupper ret while with cost of 3 we get: _Z3foommmm: .LFB0: .cfi_startproc vmovq %rdi, %xmm2 vmovq %rdx, %xmm3 vpinsrq $1, %rsi, %xmm2, %xmm1 vpinsrq $1, %rcx, %xmm3, %xmm0 vmovdqa %xmm1, test(%rip) vmovdqa %xmm0, test+16(%rip) ret I would not guess which one is better. I can see the trunk's version saves one move and potential store->load stall. In stand alone benchmark (which does not consume the stored values) both variants have the same runtime. Costing difference is: a.0_1 1 times scalar_store costs 16 in body b.1_2 1 times scalar_store costs 16 in body c.2_3 1 times scalar_store costs 16 in body d.3_4 1 times scalar_store costs 16 in body node 0x396b4c0 1 times vec_construct costs 40 in prologue a.0_1 1 times vector_store costs 24 in body r.C:5:16: note: Cost model analysis for part in loop 0: Vector cost: 64 Scalar cost: 64 to: a.0_1 1 times scalar_store costs 16 in body b.1_2 1 times scalar_store costs 16 in body c.2_3 1 times scalar_store costs 16 in body d.3_4 1 times scalar_store costs 16 in body node 0x396b4c0 1 times vec_construct costs 44 in prologue a.0_1 1 times vector_store costs 24 in body r.C:5:16: note: Cost model analysis for part in loop 0: Vector cost: 68 Scalar cost: 64 Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is integer->sse move that is slower and set to 4 cycles. Overall it is wrong that we use addss cost to estimate vec_construct: case vec_construct: { int n = TYPE_VECTOR_SUBPARTS (vectype); /* N - 1 element inserts into an SSE vector, the possible GPR -> XMM move is accounted for in add_stmt_cost. */ if (GET_MODE_BITSIZE (mode) <= 128) return (n - 1) * ix86_cost->sse_op; /* One vinserti128 for combining two SSE vectors for AVX256. */ else if (GET_MODE_BITSIZE (mode) == 256) return ((n - 2) * ix86_cost->sse_op + ix86_vec_cost (mode, ix86_cost->addss)); /* One vinserti64x4 and two vinserti128 for combining SSE and AVX256 vectors to AVX512. */ else if (GET_MODE_BITSIZE (mode) == 512) return ((n - 4) * ix86_cost->sse_op + 3 * ix86_vec_cost (mode, ix86_cost->addss)); gcc_unreachable (); } I think we may want to have ix86_cost->hard_register->integer_to_sse to cost the construction in integer modes instead of addss? Overall r15-3441-g4292297a0f938f is right for addss itself, but wrong for all other instructions we glob into addss cost (conversion, packing etc.). This was a silly mistake to forget that addss is used in many other cases. So I think it still makes sense to have at least one extra cost entry (for common simple fp operation that is more expensive that common simple integer operation we already have) and name it in a way that it is clear it represents various kinds typical FP operations (unless we want to have costs of all of them).
I made sily stand-alone test: long test[4]; __attribute__ ((noipa)) void foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d) { test[0]=a; test[1]=b; test[2]=c; test[3]=d; } int main() { long s = 0; for (int i = 0; i < 1000000000; i++) { foo (i,i+1,i+2,i+3); s+=test[0]; s+=test[1]; s+=test[2]; s+=test[3]; } return s; } And curiously enough it strongly prefers cost of 2 over 3. jh@shroud:~/trunk/build/gcc> perf stat ./test-noslp Performance counter stats for './test-noslp': 1,211.17 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 55 page-faults:u # 45.411 /sec 5,000,372,342 cycles:u # 4.129 GHz 2,000,253,750 stalled-cycles-frontend:u # 40.00% frontend cycles idle 17,000,136,662 instructions:u # 3.40 insn per cycle # 0.12 stalled cycles per insn 3,000,030,827 branches:u # 2.477 G/sec 2,592 branch-misses:u # 0.00% of all branches 1.211767440 seconds time elapsed 1.211832000 seconds user 0.000000000 seconds sys jh@shroud:~/trunk/build/gcc> perf stat ./test-cost3 Performance counter stats for './test-cost3': 7,266.90 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 55 page-faults:u # 7.569 /sec 30,001,467,995 cycles:u # 4.129 GHz 1,111,876 stalled-cycles-frontend:u # 0.00% frontend cycles idle 23,000,138,491 instructions:u # 0.77 insn per cycle # 0.00 stalled cycles per insn 3,000,032,652 branches:u # 412.835 M/sec 4,455 branch-misses:u # 0.00% of all branches 7.267898755 seconds time elapsed 7.267379000 seconds user 0.000000000 seconds sys jh@shroud:~/trunk/build/gcc> perf stat ./test-cost2 Performance counter stats for './test-cost2': 1,089.54 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 55 page-faults:u # 50.480 /sec 4,501,104,318 cycles:u # 4.131 GHz 5,495,394 stalled-cycles-frontend:u # 0.12% frontend cycles idle 24,000,136,630 instructions:u # 5.33 insn per cycle # 0.00 stalled cycles per insn 3,000,030,793 branches:u # 2.753 G/sec 2,492 branch-misses:u # 0.00% of all branches 1.090067946 seconds time elapsed 1.090267000 seconds user 0.000000000 seconds sys Cost2 variant does: 00000000004011c0 <_Z3foommmm>: 4011c0: c4 e1 f9 6e d2 vmovq %rdx,%xmm2 4011c5: c4 e1 f9 6e df vmovq %rdi,%xmm3 4011ca: c4 e3 e9 22 c9 01 vpinsrq $0x1,%rcx,%xmm2,%xmm1 4011d0: c4 e3 e1 22 c6 01 vpinsrq $0x1,%rsi,%xmm3,%xmm0 4011d6: 62 f3 fd 28 38 c1 01 vinserti64x2 $0x1,%xmm1,%ymm0,%ymm0 4011dd: c5 fd 7f 05 5b 2e 00 vmovdqa %ymm0,0x2e5b(%rip) # 404040 <test> 4011e4: 00 4011e5: c5 f8 77 vzeroupper 4011e8: c3 ret .... 401059: c5 fd 6f 0d df 2f 00 vmovdqa 0x2fdf(%rip),%ymm1 # 404040 <test> 401060: 00 401061: 62 f3 fd 28 39 c8 01 vextracti64x2 $0x1,%ymm1,%xmm0 401068: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 40106c: c5 f1 73 d8 08 vpsrldq $0x8,%xmm0,%xmm1 401071: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 401075: c4 e1 f9 7e c0 vmovq %xmm0,%rax 40107a: 49 01 c4 add %rax,%r12 while cost3 variant does: 00000000004011c0 <_Z3foommmm>: 4011c0: c4 e1 f9 6e d7 vmovq %rdi,%xmm2 4011c5: c4 e1 f9 6e da vmovq %rdx,%xmm3 4011ca: c4 e3 e9 22 ce 01 vpinsrq $0x1,%rsi,%xmm2,%xmm1 4011d0: c4 e3 e1 22 c1 01 vpinsrq $0x1,%rcx,%xmm3,%xmm0 4011d6: c5 f9 7f 0d 62 2e 00 vmovdqa %xmm1,0x2e62(%rip) # 404040 <test> 4011dd: 00 4011de: c5 f9 7f 05 6a 2e 00 vmovdqa %xmm0,0x2e6a(%rip) # 404050 <test+0x10> 4011e5: 00 4011e6: c3 ret .... 401059: c5 fd 6f 0d df 2f 00 vmovdqa 0x2fdf(%rip),%ymm1 # 404040 <test> 401060: 00 401061: 62 f3 fd 28 39 c8 01 vextracti64x2 $0x1,%ymm1,%xmm0 401068: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 40106c: c5 f1 73 d8 08 vpsrldq $0x8,%xmm0,%xmm1 401071: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 401075: c4 e1 f9 7e c0 vmovq %xmm0,%rax 40107a: 49 01 c4 add %rax,%r12 noslp 00000000004011a0 <_Z3foommmm>: 4011a0: 48 89 3d 99 2e 00 00 mov %rdi,0x2e99(%rip) # 404040 <test> 4011a7: 48 89 35 9a 2e 00 00 mov %rsi,0x2e9a(%rip) # 404048 <test+0x8> 4011ae: 48 89 15 9b 2e 00 00 mov %rdx,0x2e9b(%rip) # 404050 <test+0x10> 4011b5: 48 89 0d 9c 2e 00 00 mov %rcx,0x2e9c(%rip) # 404058 <test+0x18> 4011bc: c3 ret .... 401046: 48 03 1d f3 2f 00 00 add 0x2ff3(%rip),%rbx # 404040 <test> 40104d: 48 03 1d f4 2f 00 00 add 0x2ff4(%rip),%rbx # 404048 <test+0x8> 401054: 48 03 1d f5 2f 00 00 add 0x2ff5(%rip),%rbx # 404050 <test+0x10> 40105b: 48 03 1d f6 2f 00 00 add 0x2ff6(%rip),%rbx # 404058 <test+0x18>
(In reply to Jan Hubicka from comment #15) > I made sily stand-alone test: > > long test[4]; > __attribute__ ((noipa)) > void > foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d) > { > test[0]=a; > test[1]=b; > test[2]=c; > test[3]=d; > } > > int > main() > { > long s = 0; > for (int i = 0; i < 1000000000; i++) > { > foo (i,i+1,i+2,i+3); > s+=test[0]; > s+=test[1]; > s+=test[2]; > s+=test[3]; > } > return s; > } > > And curiously enough it strongly prefers cost of 2 over 3. > > jh@shroud:~/trunk/build/gcc> perf stat ./test-noslp > > Performance counter stats for './test-noslp': > > 1,211.17 msec task-clock:u # 1.000 CPUs > utilized > 0 context-switches:u # 0.000 /sec > > 0 cpu-migrations:u # 0.000 /sec > > 55 page-faults:u # 45.411 /sec > > 5,000,372,342 cycles:u # 4.129 GHz > > 2,000,253,750 stalled-cycles-frontend:u # 40.00% frontend > cycles idle > 17,000,136,662 instructions:u # 3.40 insn per > cycle > # 0.12 stalled cycles > per insn > 3,000,030,827 branches:u # 2.477 G/sec > > 2,592 branch-misses:u # 0.00% of all > branches > > 1.211767440 seconds time elapsed > > 1.211832000 seconds user > 0.000000000 seconds sys > > > jh@shroud:~/trunk/build/gcc> perf stat ./test-cost3 > > Performance counter stats for './test-cost3': > > 7,266.90 msec task-clock:u # 1.000 CPUs > utilized > 0 context-switches:u # 0.000 /sec > > 0 cpu-migrations:u # 0.000 /sec > > 55 page-faults:u # 7.569 /sec > > 30,001,467,995 cycles:u # 4.129 GHz > > 1,111,876 stalled-cycles-frontend:u # 0.00% frontend > cycles idle > 23,000,138,491 instructions:u # 0.77 insn per > cycle > # 0.00 stalled cycles > per insn > 3,000,032,652 branches:u # 412.835 M/sec > > 4,455 branch-misses:u # 0.00% of all > branches > > 7.267898755 seconds time elapsed > > 7.267379000 seconds user > 0.000000000 seconds sys > > > jh@shroud:~/trunk/build/gcc> perf stat ./test-cost2 > > Performance counter stats for './test-cost2': > > 1,089.54 msec task-clock:u # 1.000 CPUs > utilized > 0 context-switches:u # 0.000 /sec > > 0 cpu-migrations:u # 0.000 /sec > > 55 page-faults:u # 50.480 /sec > > 4,501,104,318 cycles:u # 4.131 GHz > > 5,495,394 stalled-cycles-frontend:u # 0.12% frontend > cycles idle > 24,000,136,630 instructions:u # 5.33 insn per > cycle > # 0.00 stalled cycles > per insn > 3,000,030,793 branches:u # 2.753 G/sec > > 2,492 branch-misses:u # 0.00% of all > branches > > 1.090067946 seconds time elapsed > > 1.090267000 seconds user > 0.000000000 seconds sys > > > Cost2 variant does: > > 00000000004011c0 <_Z3foommmm>: > 4011c0: c4 e1 f9 6e d2 vmovq %rdx,%xmm2 > 4011c5: c4 e1 f9 6e df vmovq %rdi,%xmm3 > 4011ca: c4 e3 e9 22 c9 01 vpinsrq $0x1,%rcx,%xmm2,%xmm1 > 4011d0: c4 e3 e1 22 c6 01 vpinsrq $0x1,%rsi,%xmm3,%xmm0 > 4011d6: 62 f3 fd 28 38 c1 01 vinserti64x2 $0x1,%xmm1,%ymm0,%ymm0 > 4011dd: c5 fd 7f 05 5b 2e 00 vmovdqa %ymm0,0x2e5b(%rip) # > 404040 <test> > 4011e4: 00 > 4011e5: c5 f8 77 vzeroupper > 4011e8: c3 ret > .... > 401059: c5 fd 6f 0d df 2f 00 vmovdqa 0x2fdf(%rip),%ymm1 # > 404040 <test> that will forward nicely > 401060: 00 > 401061: 62 f3 fd 28 39 c8 01 vextracti64x2 $0x1,%ymm1,%xmm0 > 401068: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 > 40106c: c5 f1 73 d8 08 vpsrldq $0x8,%xmm0,%xmm1 > 401071: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 > 401075: c4 e1 f9 7e c0 vmovq %xmm0,%rax > 40107a: 49 01 c4 add %rax,%r12 > > > while cost3 variant does: > 00000000004011c0 <_Z3foommmm>: > 4011c0: c4 e1 f9 6e d7 vmovq %rdi,%xmm2 > 4011c5: c4 e1 f9 6e da vmovq %rdx,%xmm3 > 4011ca: c4 e3 e9 22 ce 01 vpinsrq $0x1,%rsi,%xmm2,%xmm1 > 4011d0: c4 e3 e1 22 c1 01 vpinsrq $0x1,%rcx,%xmm3,%xmm0 > 4011d6: c5 f9 7f 0d 62 2e 00 vmovdqa %xmm1,0x2e62(%rip) # > 404040 <test> > 4011dd: 00 > 4011de: c5 f9 7f 05 6a 2e 00 vmovdqa %xmm0,0x2e6a(%rip) # > 404050 <test+0x10> > 4011e5: 00 > 4011e6: c3 ret > .... > 401059: c5 fd 6f 0d df 2f 00 vmovdqa 0x2fdf(%rip),%ymm1 # > 404040 <test> this will fail to forward, thus a huge penalty. > 401060: 00 > 401061: 62 f3 fd 28 39 c8 01 vextracti64x2 $0x1,%ymm1,%xmm0 > 401068: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 > 40106c: c5 f1 73 d8 08 vpsrldq $0x8,%xmm0,%xmm1 > 401071: c5 f9 d4 c1 vpaddq %xmm1,%xmm0,%xmm0 > 401075: c4 e1 f9 7e c0 vmovq %xmm0,%rax > 40107a: 49 01 c4 add %rax,%r12 > > > noslp > 00000000004011a0 <_Z3foommmm>: > 4011a0: 48 89 3d 99 2e 00 00 mov %rdi,0x2e99(%rip) # > 404040 <test> > 4011a7: 48 89 35 9a 2e 00 00 mov %rsi,0x2e9a(%rip) # > 404048 <test+0x8> > 4011ae: 48 89 15 9b 2e 00 00 mov %rdx,0x2e9b(%rip) # > 404050 <test+0x10> > 4011b5: 48 89 0d 9c 2e 00 00 mov %rcx,0x2e9c(%rip) # > 404058 <test+0x18> > 4011bc: c3 ret > .... > 401046: 48 03 1d f3 2f 00 00 add 0x2ff3(%rip),%rbx # > 404040 <test> > 40104d: 48 03 1d f4 2f 00 00 add 0x2ff4(%rip),%rbx # > 404048 <test+0x8> > 401054: 48 03 1d f5 2f 00 00 add 0x2ff5(%rip),%rbx # > 404050 <test+0x10> > 40105b: 48 03 1d f6 2f 00 00 add 0x2ff6(%rip),%rbx # > 404058 <test+0x18>
(In reply to Jan Hubicka from comment #14) > > Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is > integer->sse move that is slower and set to 4 cycles. > Overall it is wrong that we use addss cost to estimate vec_construct: > > case vec_construct: > { > int n = TYPE_VECTOR_SUBPARTS (vectype); > /* N - 1 element inserts into an SSE vector, the possible > GPR -> XMM move is accounted for in add_stmt_cost. */ > if (GET_MODE_BITSIZE (mode) <= 128) > return (n - 1) * ix86_cost->sse_op; > /* One vinserti128 for combining two SSE vectors for AVX256. */ > else if (GET_MODE_BITSIZE (mode) == 256) > return ((n - 2) * ix86_cost->sse_op > + ix86_vec_cost (mode, ix86_cost->addss)); > /* One vinserti64x4 and two vinserti128 for combining SSE > and AVX256 vectors to AVX512. */ > else if (GET_MODE_BITSIZE (mode) == 512) > return ((n - 4) * ix86_cost->sse_op > + 3 * ix86_vec_cost (mode, ix86_cost->addss)); > gcc_unreachable (); > } > > I think we may want to have ix86_cost->hard_register->integer_to_sse to cost > the construction in integer modes instead of addss? I have no recollection on why we are mixing sse_op and addss cost here ... It's not a integer to SSE conversion either (again the caller adjusts for this in this case). We seem to use sse_op for the element insert into SSE reg and addss for the insert of SSE regs into YMM or ZMM. I think it's reasonable to change this to consistently use sse_op.
(In reply to Richard Biener from comment #17) > (In reply to Jan Hubicka from comment #14) > > > > Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is > > integer->sse move that is slower and set to 4 cycles. > > Overall it is wrong that we use addss cost to estimate vec_construct: > > > > case vec_construct: > > { > > int n = TYPE_VECTOR_SUBPARTS (vectype); > > /* N - 1 element inserts into an SSE vector, the possible > > GPR -> XMM move is accounted for in add_stmt_cost. */ > > if (GET_MODE_BITSIZE (mode) <= 128) > > return (n - 1) * ix86_cost->sse_op; > > /* One vinserti128 for combining two SSE vectors for AVX256. */ > > else if (GET_MODE_BITSIZE (mode) == 256) > > return ((n - 2) * ix86_cost->sse_op > > + ix86_vec_cost (mode, ix86_cost->addss)); > > /* One vinserti64x4 and two vinserti128 for combining SSE > > and AVX256 vectors to AVX512. */ > > else if (GET_MODE_BITSIZE (mode) == 512) > > return ((n - 4) * ix86_cost->sse_op > > + 3 * ix86_vec_cost (mode, ix86_cost->addss)); > > gcc_unreachable (); > > } > > > > I think we may want to have ix86_cost->hard_register->integer_to_sse to cost > > the construction in integer modes instead of addss? > > I have no recollection on why we are mixing sse_op and addss cost here ... > It's not a integer to SSE conversion either (again the caller adjusts > for this in this case). We seem to use sse_op for the element insert > into SSE reg and addss for the insert of SSE regs into YMM or ZMM. > > I think it's reasonable to change this to consistently use sse_op. So this was from r8-6815-gbe77ba2a461eef. addss was chosen for the inserts into the larger vectors because "The following tries to account for the fact that when constructing AVX256 or AVX512 vectors from elements we can only use insertps to insert into the low 128bits of a vector but have to use vinserti128 or vinserti64x4 to build larger AVX256/512 vectors. Those operations also have higher latency (Agner documents 3 cycles for Broadwell for reg-reg vinserti128 while insertps has one cycle latency). Agner doesn't have tables for AVX512 yet but I guess the story is similar for vinserti64x4. Latency is similar for FP adds so I re-used ix86_cost->addss for this cost." On zen4 vinserti128 is 1 cycle latency, on zen2 it was still 3. But OTOH PINSRB/W/D/Q is 4 cycles on zen4 while indeed insertps is 1 cycle. It will depend very much on the subarch what vec_init will generate and the costing is likely not accurate with the simple formula used in ix86_builtin_vectorization_cost. Definitely re-using addss cost now causes issues. Maybe we need to have separate "cost of SSE construction from QI/HI/SI,SF/DI,DF" and "cost of AVX construction from SSE" and "cost of AVX512 construction from SSE/AVX" tuning cost entries. Or just a single combined cost { 60, 28, 3, 1, 1, 3, 1 } /* cost of vector construction from elements V16QI, V8HI, V4SI/V4SF, V2DI/V2DF, YMM from XMM, ZMM from XMM, ZMM from YMM */ though I'd probably split the SSE CTOR costs from the YMM/ZMM costs. Or we have cost for the element inserts into V16QI, V8HI, V4SI/V4SF, V2DI/V2DF and V2XMM, V4XMM, V2YMM. That said, given with extra reg we build V16QI in a tree way, avoding 15 * high pinsrb latency the combined cost might be (more?) interesting.
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>: https://gcc.gnu.org/g:e2011ab13de3e70774f869b356f5f9c750780b34 commit r15-9495-ge2011ab13de3e70774f869b356f5f9c750780b34 Author: Jan Hubicka <hubicka@ucw.cz> Date: Tue Apr 15 19:04:15 2025 +0200 Set ADDSS cost to 3 for znver5 Znver5 has latency of addss 2 in typical case while all earlier versions has latency 3. Unforunately addss cost is used to cost many other SSE instructions than just addss and setting the cost to 2 makes us to vectorize 4 64bit stores into one 256bit store which in turn regesses imagemagick. This patch sets the cost back to 3. Next stage1 we can untie addss from the other operatoins and set it correctly. bootstrapped/regtested x86_64-linux and also benchmarked on SPEC2k17 gcc/ChangeLog: PR target/119298 * config/i386/x86-tune-costs.h (znver5_cost): Set ADDSS cost to 3.