Bug 119298 - [15/16 Regression] 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5 since r15-3441-g4292297a0f938f
Summary: [15/16 Regression] 538.imagick_r is faster when compiled with GCC 14.2 and -O...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 15.0
: P2 normal
Target Milestone: 15.0
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: spec
  Show dependency treegraph
 
Reported: 2025-03-14 18:05 UTC by Martin Jambor
Modified: 2025-04-17 12:38 UTC (History)
4 users (show)

See Also:
Host: x86_64-linux-gnu
Target: x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed: 2025-04-08 00:00:00


Attachments
Perf annotate of the slow case (347.51 KB, application/x-xz)
2025-03-14 18:06 UTC, Martin Jambor
Details
Perf annotate of the fast case (368.39 KB, application/x-xz)
2025-03-14 18:07 UTC, Martin Jambor
Details
Output of -fopt-info-vec in the slow case (7.28 KB, application/x-xz)
2025-03-14 18:08 UTC, Martin Jambor
Details
Output of -fopt-info-vec in the fast case (7.24 KB, application/x-xz)
2025-03-14 18:08 UTC, Martin Jambor
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Jambor 2025-03-14 18:05:34 UTC
SPEC INTrate 2017 538.imagick_r benchmark is faster when compiled with
GCC 14.2 and -Ofast -flto -march=native than with trunk/master on Zen
5 CPUs.

The regression has been introduced in r15-3441-g4292297a0f938f (Jan
Hubicka: Zen5 tuning part 5: update instruction latencies in
x86-tune-costs)

It is the modification of "cost of ADDSS/SD SUBSS/SD insns" that is
the culprit, bumping it back to COSTS_N_INSNS(3) (instead of
COSTS_N_INSNS(2)) makes the regression go away.  Nevertheless, Honza
claims the cost should be correct.


Perf stat of the slow run:

	 116866.57 msec task-clock:u                     #    1.000 CPUs utilized
		 0      context-switches:u               #    0.000 /sec
		 0      cpu-migrations:u                 #    0.000 /sec
	      8347      page-faults:u                    #   71.423 /sec
      484499860679      cycles:u                         #    4.146 GHz
       21879349058      stalled-cycles-frontend:u        #    4.52% frontend cycles idle
     2030074730877      instructions:u                   #    4.19  insn per cycle
						  #    0.01  stalled cycles per insn
      224436542157      branches:u                       #    1.920 G/sec
	1716173329      branch-misses:u                  #    0.76% of all branches

     116.881252465 seconds time elapsed

     116.808499000 seconds user
       0.057350000 seconds sys

Perf report of the slow run (annotated assmebly attached):

  # Samples: 470K of event 'cycles:Pu'
  # Event count (approx.): 484158470552
  #
  # Overhead       Samples  Command          Shared Object                    Symbol
  # ........  ............  ...............  ............................... 
.............................................
  #
      44.71%        210348  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.] MeanShiftImage
      28.76%        135308  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.] GetVirtualPixelsFromNexus
      25.50%        120106  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.] MorphologyApply





Perf stat of the fast run (with just the one cost reverted):

 Performance counter stats for 'taskset -c 0 specinvoke':

	 108805.48 msec task-clock:u                     #    1.000 CPUs utilized
		 0      context-switches:u               #    0.000 /sec
		 0      cpu-migrations:u                 #    0.000 /sec
	      8312      page-faults:u                    #   76.393 /sec
      450981792793      cycles:u                         #    4.145 GHz
       22610930072      stalled-cycles-frontend:u        #    5.01% frontend cycles idle
     1933965750890      instructions:u                   #    4.29  insn per cycle
						  #    0.01  stalled cycles per insn
      224433996552      branches:u                       #    2.063 G/sec
	1721069495      branch-misses:u                  #    0.77% of all branches

     108.819368844 seconds time elapsed

     108.763582000 seconds user
       0.041314000 seconds sys

Perf report of the fast run (annotated assmebly attached):

# Samples: 427K of event 'cycles:Pu'
# Event count (approx.): 439380128661
#
# Overhead       Samples  Command          Shared Object                    Symbol
# ........  ............  ...............  ............................... 
..................................................
#
    44.53%        190164  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.] MeanShiftImage
    28.13%        120243  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.] MorphologyApply
    26.20%        111906  imagick_r_peak.  imagick_r_peak.mine-lto-nat-m64  [.] GetVirtualPixelsFromNexus
Comment 1 Martin Jambor 2025-03-14 18:06:57 UTC
Created attachment 60757 [details]
Perf annotate of the slow case

Perf annotate of the slow case
Comment 2 Martin Jambor 2025-03-14 18:07:23 UTC
Created attachment 60758 [details]
Perf annotate of the fast case

Perf annotate of the fast case
Comment 3 Martin Jambor 2025-03-14 18:08:14 UTC
Created attachment 60759 [details]
Output of -fopt-info-vec in the slow case

Output of -fopt-info-vec in the slow case
Comment 4 Martin Jambor 2025-03-14 18:08:40 UTC
Created attachment 60760 [details]
Output of -fopt-info-vec in the fast case

Output of -fopt-info-vec in the fast case
Comment 5 Martin Jambor 2025-04-08 10:37:10 UTC
This is likely a store-to-load-forwarding stall issue.  I have tracked
it down to two SLP1 transformations - dumps below.

Even though MeanShiftImage receives 10% more samples, its assembly is
the same and the transformations slowing things down happens in
GetVirtualPixelsFromNexus (which is being called from the hottest loop
of MeanShiftImage).

I have verified that the same slow-down happens with
-flto-partition=one and then used option
-fdbg-cnt=vect_slp:1-1043:1046-100000 (using unpatched master revision
337b9ff4854) to avoid the two SLP vectorizations and confirmed the
run-time performance has been restored.

The two SLP1 transformations are reported to take place at
magick/cache.c:2573 but that is where the function
GetVirtualPixelsFromNexus begins.  Looking at dumps with -lineno, the
vectorized statements are followed immediately by one from
magick/random.c:629 (load of the normalize field at the end of
function GetPseudoRandomValue).

The SLP1 transformations are:

--- /home/mjambor/gcc/benchmarks/cpu2017/benchspec/CPU/538.imagick_r/build/build_peak_trunk-lto-nat-m64.0000/gvpfn/imagick_r.ltrans1.ltrans.188t.slp1	2025-04-07 23:28:47.568570350 +0200
+++ gvpfn/imagick_r.ltrans1.ltrans.188t.slp1	2025-04-07 23:28:50.673570142 +0200
@@ -3897,54 +3899,10 @@
 magick/cache.c:2699:7: note: created &virtual_pixel
 magick/cache.c:2699:7: note: add new stmt: MEM <vector(4) short unsigned int> [(short unsigned int *)&virtual_pixel] = { 0, 0, 0, 65535 };
 magick/cache.c:2699:7: note: vectorizing stmts using SLP.
-magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte vectors
-magick/cache.c:2573:33: note: Vectorizing SLP tree:
-magick/cache.c:2573:33: note: node 0x3648860 (max_nunits=4, refcnt=1) vector(4) long unsigned int
-magick/cache.c:2573:33: note: op template: MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;
-magick/cache.c:2573:33: note: 	stmt 0 MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;
-magick/cache.c:2573:33: note: 	stmt 1 MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931;
-magick/cache.c:2573:33: note: 	stmt 2 MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932;
-magick/cache.c:2573:33: note: 	stmt 3 MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930;
-magick/cache.c:2573:33: note: 	children 0x3648a10
-magick/cache.c:2573:33: note: node (external) 0x3648a10 (max_nunits=1, refcnt=1) vector(4) long unsigned int
-magick/cache.c:2573:33: note: 	{ _929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930 }
-magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from: MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.193_931 = PHI <D__lsm.193_32(80)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.194_932 = PHI <D__lsm.194_33(80)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.195_930 = PHI <D__lsm.195_474(80)>, type of def: internal
-magick/cache.c:2573:33: note: transform store. ncopies = 1
-magick/cache.c:2573:33: note: create vector_type-pointer variable to type: vector(4) long unsigned int  vectorizing a pointer ref: MEM[(long unsigned int *)prephitmp_900 + 32B]
-magick/cache.c:2573:33: note: created vectp.214_881
-magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882;
-magick/cache.c:2573:33: note: vectorizing stmts using SLP.
-magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte vectors
-magick/cache.c:2573:33: note: Vectorizing SLP tree:
-magick/cache.c:2573:33: note: node 0x3648aa0 (max_nunits=4, refcnt=1) vector(4) long unsigned int
-magick/cache.c:2573:33: note: op template: MEM[(long unsigned int *)prephitmp_900 + 32B] = _552;
-magick/cache.c:2573:33: note: 	stmt 0 MEM[(long unsigned int *)prephitmp_900 + 32B] = _552;
-magick/cache.c:2573:33: note: 	stmt 1 MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606;
-magick/cache.c:2573:33: note: 	stmt 2 MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568;
-magick/cache.c:2573:33: note: 	stmt 3 MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342;
-magick/cache.c:2573:33: note: 	children 0x3648c50
-magick/cache.c:2573:33: note: node (external) 0x3648c50 (max_nunits=1, refcnt=1) vector(4) long unsigned int
-magick/cache.c:2573:33: note: 	{ _552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342 }
-magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from: MEM[(long unsigned int *)prephitmp_900 + 32B] = _552;
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.189_606 = PHI <D__lsm.189_574(82)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.190_568 = PHI <D__lsm.190_555(82)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.191_342 = PHI <D__lsm.191_106(82)>, type of def: internal
-magick/cache.c:2573:33: note: transform store. ncopies = 1
-magick/cache.c:2573:33: note: create vector_type-pointer variable to type: vector(4) long unsigned int  vectorizing a pointer ref: MEM[(long unsigned int *)prephitmp_900 + 32B]
-magick/cache.c:2573:33: note: created vectp.216_874
-magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] = _879;
-magick/cache.c:2573:33: note: vectorizing stmts using SLP.
 magick/cache.c:2573:33: note: ***** The result for vector mode V64QI would be the same
 __attribute__((visibility ("default"), hot))
 const struct PixelPacket * GetVirtualPixelsFromNexus (const struct Image * image, const VirtualPixelMethod virtual_pixel_method, const ssize_t x, const ssize_t y, const size_t columns, const size_t rows, struct NexusInfo * nexus_info, struct ExceptionInfo * exception)
 {
-  long unsigned int * vectp.216;
-  vector(4) long unsigned int * vectp_prephitmp.215;
-  long unsigned int * vectp.214;
-  vector(4) long unsigned int * vectp_prephitmp.213;
   Quantum * vectp.212;
   vector(4) short unsigned int * vectp_virtual_pixel.211;
   Quantum * vectp.210;
@@ -4306,8 +4264,6 @@
   long unsigned int pretmp_875;
   long int _877;
   long int prephitmp_878;
-  vector(4) long unsigned int _879;
-  vector(4) long unsigned int _882;
   long unsigned int pretmp_889;
   long int _891;
   long int prephitmp_892;
@@ -5354,9 +5310,10 @@
   # D__lsm.193_931 = PHI <D__lsm.193_32(80)>
   # D__lsm.195_930 = PHI <D__lsm.195_474(80)>
   # _929 = PHI <_543(80)>
-  _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930};
-  vectp.214_881 = prephitmp_900 + 32;
-  MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882;
+  MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931;
+  MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932;
+  MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930;
+  MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;
   [magick/random.c:629:3] # DEBUG BEGIN_STMT
   [magick/random.c:629:21] _544 = [magick/random.c:629:21] MEM[(struct RandomInfo *)prephitmp_900].normalize;
   [magick/random.c:629:32] _545 = (double) _929;
@@ -5413,9 +5370,10 @@
   # D__lsm.191_342 = PHI <D__lsm.191_106(82)>
   # D__lsm.190_568 = PHI <D__lsm.190_555(82)>
   # D__lsm.189_606 = PHI <D__lsm.189_574(82)>
-  _879 = {_552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342};
-  vectp.216_874 = prephitmp_900 + 32;
-  MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] = _879;
+  MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606;
+  MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568;
+  MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342;
+  MEM[(long unsigned int *)prephitmp_900 + 32B] = _552;
   [magick/random.c:629:3] # DEBUG BEGIN_STMT
   [magick/random.c:629:32] _531 = (double) _552;
   _502 = _230 * _544;
Comment 6 Richard Biener 2025-04-08 10:57:24 UTC
So this is likely a Zen5 tuning thing that makes the vectorization profitable.
Though since this just transforms stores this cannot be a STLF fail, instead
it's likely the vector(4) long unsigned int builds from scalars that cause this.

It's also not possible the add/sub insn cost change causes this.

Could it be that this is in the end sth similar as we see with LBM?
Comment 7 Jan Hubicka 2025-04-08 15:43:24 UTC
Hmm, the sequence does not use + at all, but I think I know what is going on. While the field is called addss it is used as an kitchen sink for all other simple operations.  
      /* pmuludq under sse2, pmuldq under sse4.1, for sign_extend,
         require extra 4 mul, 4 add, 4 cmp and 2 shift.  */
      if (!TARGET_SSE4_1 && !uns_p)
        extra_cost = (cost->mulss + cost->addss + cost->sse_op) * 4
                      + cost->sse_op * 2;
....
    case FLOAT_EXTEND:
      if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
        *total = 0;
      else
        *total = ix86_vec_cost (mode, cost->addss);
      return false;
....
    case FLOAT_TRUNCATE:
      if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) 
        *total = cost->fadd;
      else
        *total = ix86_vec_cost (mode, cost->addss);
      return false;
...
      case scalar_stmt:
        return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
....
      case vector_stmt:
        return ix86_vec_cost (mode,
                              fp ? ix86_cost->addss : ix86_cost->sse_op);

Only addss was sped up (and apparently only in common contextes), other simple SSE operations are still 3 cycles.

We have
  const int sse_op;             /* cost of cheap SSE instruction.  */
  const int addss;              /* cost of ADDSS/SD SUBSS/SD instructions.  */

SSE_OP is used for integer SSE instructions, which are typically 1 cycle, so perhaps we want to have also sse_fp_op /* Chose of cheap SSE fp instruction.  */
in addition to addss.

But to be precise builtin_vectorizer cost would need to now if scalar/vector_stmt is additio or something else, which AFAK it doesn't
Comment 8 Richard Biener 2025-04-09 11:33:55 UTC
(In reply to Jan Hubicka from comment #7)
> Hmm, the sequence does not use + at all, but I think I know what is going
> on. While the field is called addss it is used as an kitchen sink for all
> other simple operations.  
>       /* pmuludq under sse2, pmuldq under sse4.1, for sign_extend,
>          require extra 4 mul, 4 add, 4 cmp and 2 shift.  */
>       if (!TARGET_SSE4_1 && !uns_p)
>         extra_cost = (cost->mulss + cost->addss + cost->sse_op) * 4
>                       + cost->sse_op * 2;

this looks OK?

> ....
>     case FLOAT_EXTEND:
>       if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode))
>         *total = 0;
>       else
>         *total = ix86_vec_cost (mode, cost->addss);

not sure why we use addss instead of sse_op here?

>       return false;
> ....
>     case FLOAT_TRUNCATE:
>       if (!SSE_FLOAT_MODE_SSEMATH_OR_HFBF_P (mode)) 
>         *total = cost->fadd;
>       else
>         *total = ix86_vec_cost (mode, cost->addss);

likewise?  but this is rtx_cost

>       return false;
> ...
>       case scalar_stmt:
>         return fp ? ix86_cost->addss : COSTS_N_INSNS (1);

we shouldn't get here, the add_stmt_cost "hook" handles operations
separately.

> ....
>       case vector_stmt:
>         return ix86_vec_cost (mode,
>                               fp ? ix86_cost->addss : ix86_cost->sse_op);

I'm not sure why we use addss cost in case of FP mode.  So this would be
the only two places to try fixing.

The question is for which ops we fall back here.  The following might tell:

diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 4f8380c4a58..fe1beefe732 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -25346,6 +25346,7 @@ ix86_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
            stmt_cost = ix86_cost->add;
          break;
        default:
+         gcc_unreachable ();
          break;
        }
     }


> Only addss was sped up (and apparently only in common contextes), other
> simple SSE operations are still 3 cycles.
> 
> We have
>   const int sse_op;             /* cost of cheap SSE instruction.  */
>   const int addss;              /* cost of ADDSS/SD SUBSS/SD instructions. 
> */
> 
> SSE_OP is used for integer SSE instructions, which are typically 1 cycle, so
> perhaps we want to have also sse_fp_op /* Chose of cheap SSE fp instruction.
> */
> in addition to addss.
> 
> But to be precise builtin_vectorizer cost would need to now if
> scalar/vector_stmt is additio or something else, which AFAK it doesn't

It does see it
Comment 9 Richard Biener 2025-04-09 11:40:29 UTC
(In reply to Richard Biener from comment #8)
> (In reply to Jan Hubicka from comment #7)
[...]
> >       return false;
> > ...
> >       case scalar_stmt:
> >         return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
> 
> we shouldn't get here, the add_stmt_cost "hook" handles operations
> separately.
> 
> > ....
> >       case vector_stmt:
> >         return ix86_vec_cost (mode,
> >                               fp ? ix86_cost->addss : ix86_cost->sse_op);
> 
> I'm not sure why we use addss cost in case of FP mode.  So this would be
> the only two places to try fixing.
> 
> The question is for which ops we fall back here.  The following might tell:
> 
> diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> index 4f8380c4a58..fe1beefe732 100644
> --- a/gcc/config/i386/i386.cc
> +++ b/gcc/config/i386/i386.cc
> @@ -25346,6 +25346,7 @@ ix86_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
>             stmt_cost = ix86_cost->add;
>           break;
>         default:
> +         gcc_unreachable ();
>           break;
>         }
>      }

OK, it's a quite incomplete set.  I would suggest to fix the fallback to
use sse_op and never addss.  Or avoid reducing the addss cost alltogether
as a simpler thing for now.
Comment 10 Richard Biener 2025-04-09 11:43:19 UTC
(In reply to Richard Biener from comment #9)
> (In reply to Richard Biener from comment #8)
> > (In reply to Jan Hubicka from comment #7)
> [...]
> > >       return false;
> > > ...
> > >       case scalar_stmt:
> > >         return fp ? ix86_cost->addss : COSTS_N_INSNS (1);
> > 
> > we shouldn't get here, the add_stmt_cost "hook" handles operations
> > separately.
> > 
> > > ....
> > >       case vector_stmt:
> > >         return ix86_vec_cost (mode,
> > >                               fp ? ix86_cost->addss : ix86_cost->sse_op);
> > 
> > I'm not sure why we use addss cost in case of FP mode.  So this would be
> > the only two places to try fixing.
> > 
> > The question is for which ops we fall back here.  The following might tell:
> > 
> > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > index 4f8380c4a58..fe1beefe732 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -25346,6 +25346,7 @@ ix86_vector_costs::add_stmt_cost (int count,
> > vect_cost_for_stmt kind,
> >             stmt_cost = ix86_cost->add;
> >           break;
> >         default:
> > +         gcc_unreachable ();
> >           break;
> >         }
> >      }
> 
> OK, it's a quite incomplete set.  I would suggest to fix the fallback to
> use sse_op and never addss.  Or avoid reducing the addss cost alltogether
> as a simpler thing for now.

Btw, it was your r8-4018-gf6fd8f2bd4e9a9 which added the FP vs. non-FP
difference.
Comment 11 Richard Biener 2025-04-09 11:44:45 UTC
Note that whether mode is FP is not a good indicator for some operations, as said, nowadays we should catch things where it matters upthread.
Comment 12 Jan Hubicka 2025-04-09 13:09:44 UTC
> Btw, it was your r8-4018-gf6fd8f2bd4e9a9 which added the FP vs. non-FP difference.

Yep, I know.  With that patch I mostly wanted to limit redundancy of the tables. The int/Fp difference was mostly based on the observation that most of integer SSE operations (for example padd) take 1 cycle, while most of FP operations (like addss) take 3 cycles. My simplified understanding is that FP operations are usually pipelined to 3 cycles (since Pentium to today) because they include normalization, operation and rounding. The cost table is basically meant to have "typical cost" (sse_op and addss) along with all important exceptions (mul, div, fma, sqrt).

Zen5 is, I think, first CPU where addss has different timing than other basic FP arithmetic which makes addss itself an exception.  (Back then, I should have renamed addss cost and make the comment more descriptive.)

So based on this adding sse_fp_op (set to 3 on Zen5 and same cost as addss everywhere else) for "typical FP operation" and keep addss cost for actual FP add/sub (I will need to benchmark if sub is also 2 cycles; I am not sure about that) IMO makes sense.

But indeed we currently use addss for conversions and other stuff which is not necessarily good and we may want to add more entries for these.  Do you know what are important ones and ought to be fixed?

I am OK with using addss cost of 3 for trunk&release branches and make this more precise next stage1.
Comment 13 rguenther@suse.de 2025-04-09 13:13:44 UTC
On Wed, 9 Apr 2025, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
> 
> --- Comment #12 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> > Btw, it was your r8-4018-gf6fd8f2bd4e9a9 which added the FP vs. non-FP difference.
> 
> Yep, I know.  With that patch I mostly wanted to limit redundancy of the
> tables. The int/Fp difference was mostly based on the observation that most of
> integer SSE operations (for example padd) take 1 cycle, while most of FP
> operations (like addss) take 3 cycles. My simplified understanding is that FP
> operations are usually pipelined to 3 cycles (since Pentium to today) because
> they include normalization, operation and rounding. The cost table is basically
> meant to have "typical cost" (sse_op and addss) along with all important
> exceptions (mul, div, fma, sqrt).
> 
> Zen5 is, I think, first CPU where addss has different timing than other basic
> FP arithmetic which makes addss itself an exception.  (Back then, I should have
> renamed addss cost and make the comment more descriptive.)
> 
> So based on this adding sse_fp_op (set to 3 on Zen5 and same cost as addss
> everywhere else) for "typical FP operation" and keep addss cost for actual FP
> add/sub (I will need to benchmark if sub is also 2 cycles; I am not sure about
> that) IMO makes sense.
> 
> But indeed we currently use addss for conversions and other stuff which is not
> necessarily good and we may want to add more entries for these.  Do you know
> what are important ones and ought to be fixed?

I'm not sure which ones are important, but we should try to get rid
of the fallback to call the "old" hook from add_stmt_cost so it's more
obvious from which context we come.  But I also expect this are to
change quite a bit next stage1 where I hope to revamp vectorizer
costing.

> I am OK with using addss cost of 3 for trunk&release branches and make this
> more precise next stage1.

That's what we use now?  But I still don't understand why exactly
538.imagick_r regresses.
Comment 14 Jan Hubicka 2025-04-09 14:44:30 UTC
> > I am OK with using addss cost of 3 for trunk&release branches and make this
> > more precise next stage1.
> 
> That's what we use now?  But I still don't understand why exactly 538.imagick_r regresses.

We currently use 2 for znver5.
In r15-3441-g4292297a0f938f (causing this regressoin) I changed

  COSTS_N_INSNS (3),                    /* cost of ADDSS/SD SUBSS/SD insns.  */

which was copied from znver4 tables to

  /* ADDSS has throughput 2 and latency 2
     (in some cases when source is another addition).  */
  COSTS_N_INSNS (2),                    /* cost of ADDSS/SD SUBSS/SD insns.  */

since ADDSS has latency of 2 at znver5 (at least in some cases, but common ones).  Martin claims he tested that this particular change causes the regression and moreover that the difference is:

-  _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930};
-  vectp.214_881 = prephitmp_900 + 32;
-  MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882;
+  MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931;
+  MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932;
+  MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930;
+  MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;

Which boils down to:

long test[4];
void
foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d)
{
        test[0]=a;
        test[1]=b;
        test[2]=c;
        test[3]=d;
}

being vectorized with cost of 2 (trunk) as

_Z3foommmm:
.LFB0:
        .cfi_startproc
        vmovq   %rdx, %xmm2
        vmovq   %rdi, %xmm3
        vpinsrq $1, %rcx, %xmm2, %xmm1
        vpinsrq $1, %rsi, %xmm3, %xmm0
        vinserti64x2    $0x1, %xmm1, %ymm0, %ymm0
        vmovdqa %ymm0, test(%rip)
        vzeroupper
        ret

while with cost of 3 we get:

_Z3foommmm:
.LFB0:
        .cfi_startproc
        vmovq   %rdi, %xmm2
        vmovq   %rdx, %xmm3
        vpinsrq $1, %rsi, %xmm2, %xmm1
        vpinsrq $1, %rcx, %xmm3, %xmm0
        vmovdqa %xmm1, test(%rip)
        vmovdqa %xmm0, test+16(%rip)
        ret

I would not guess which one is better.  I can see the trunk's version saves one move and potential store->load stall. In stand alone benchmark (which does not consume the stored values) both variants have the same runtime.

Costing difference is:

a.0_1 1 times scalar_store costs 16 in body
b.1_2 1 times scalar_store costs 16 in body
c.2_3 1 times scalar_store costs 16 in body
d.3_4 1 times scalar_store costs 16 in body
node 0x396b4c0 1 times vec_construct costs 40 in prologue
a.0_1 1 times vector_store costs 24 in body
r.C:5:16: note: Cost model analysis for part in loop 0:
  Vector cost: 64
  Scalar cost: 64

to:

a.0_1 1 times scalar_store costs 16 in body
b.1_2 1 times scalar_store costs 16 in body
c.2_3 1 times scalar_store costs 16 in body
d.3_4 1 times scalar_store costs 16 in body
node 0x396b4c0 1 times vec_construct costs 44 in prologue
a.0_1 1 times vector_store costs 24 in body
r.C:5:16: note: Cost model analysis for part in loop 0:
  Vector cost: 68
  Scalar cost: 64

Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is integer->sse move that is slower and set to 4 cycles.
Overall it is wrong that we use addss cost to estimate vec_construct:

      case vec_construct:
        {
          int n = TYPE_VECTOR_SUBPARTS (vectype);
          /* N - 1 element inserts into an SSE vector, the possible
             GPR -> XMM move is accounted for in add_stmt_cost.  */
          if (GET_MODE_BITSIZE (mode) <= 128)
            return (n - 1) * ix86_cost->sse_op;
          /* One vinserti128 for combining two SSE vectors for AVX256.  */
          else if (GET_MODE_BITSIZE (mode) == 256)
            return ((n - 2) * ix86_cost->sse_op
                    + ix86_vec_cost (mode, ix86_cost->addss));
          /* One vinserti64x4 and two vinserti128 for combining SSE
             and AVX256 vectors to AVX512.  */
          else if (GET_MODE_BITSIZE (mode) == 512)
            return ((n - 4) * ix86_cost->sse_op
                    + 3 * ix86_vec_cost (mode, ix86_cost->addss));
          gcc_unreachable ();
        }

I think we may want to have ix86_cost->hard_register->integer_to_sse to cost the construction in integer modes instead of addss?


Overall r15-3441-g4292297a0f938f is right for addss itself, but wrong for all other instructions we glob into addss cost (conversion, packing etc.). This was a silly mistake to forget that addss is used in many other cases.
So I think it still makes sense to have at least one extra cost entry (for common simple fp operation that is more expensive that common simple integer operation we already have) and name it in a way that it is clear it represents various kinds typical FP operations (unless we want to have costs of all of them).
Comment 15 Jan Hubicka 2025-04-09 14:53:34 UTC
I made sily stand-alone test:

long test[4];
__attribute__ ((noipa))
void
foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d)
{
        test[0]=a;
        test[1]=b;
        test[2]=c;
        test[3]=d;
}

int
main()
{
        long s = 0;
        for (int i = 0; i < 1000000000; i++)
        {
                foo (i,i+1,i+2,i+3);
                s+=test[0];
                s+=test[1];
                s+=test[2];
                s+=test[3];
        }
        return s;
}

And curiously enough it strongly prefers cost of 2 over 3. 

jh@shroud:~/trunk/build/gcc> perf stat ./test-noslp 

 Performance counter stats for './test-noslp':

          1,211.17 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
                55      page-faults:u                    #   45.411 /sec                      
     5,000,372,342      cycles:u                         #    4.129 GHz                       
     2,000,253,750      stalled-cycles-frontend:u        #   40.00% frontend cycles idle      
    17,000,136,662      instructions:u                   #    3.40  insn per cycle            
                                                  #    0.12  stalled cycles per insn   
     3,000,030,827      branches:u                       #    2.477 G/sec                     
             2,592      branch-misses:u                  #    0.00% of all branches           

       1.211767440 seconds time elapsed

       1.211832000 seconds user
       0.000000000 seconds sys


jh@shroud:~/trunk/build/gcc> perf stat ./test-cost3

 Performance counter stats for './test-cost3':

          7,266.90 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
                55      page-faults:u                    #    7.569 /sec                      
    30,001,467,995      cycles:u                         #    4.129 GHz                       
         1,111,876      stalled-cycles-frontend:u        #    0.00% frontend cycles idle      
    23,000,138,491      instructions:u                   #    0.77  insn per cycle            
                                                  #    0.00  stalled cycles per insn   
     3,000,032,652      branches:u                       #  412.835 M/sec                     
             4,455      branch-misses:u                  #    0.00% of all branches           

       7.267898755 seconds time elapsed

       7.267379000 seconds user
       0.000000000 seconds sys


jh@shroud:~/trunk/build/gcc> perf stat ./test-cost2

 Performance counter stats for './test-cost2':

          1,089.54 msec task-clock:u                     #    1.000 CPUs utilized             
                 0      context-switches:u               #    0.000 /sec                      
                 0      cpu-migrations:u                 #    0.000 /sec                      
                55      page-faults:u                    #   50.480 /sec                      
     4,501,104,318      cycles:u                         #    4.131 GHz                       
         5,495,394      stalled-cycles-frontend:u        #    0.12% frontend cycles idle      
    24,000,136,630      instructions:u                   #    5.33  insn per cycle            
                                                  #    0.00  stalled cycles per insn   
     3,000,030,793      branches:u                       #    2.753 G/sec                     
             2,492      branch-misses:u                  #    0.00% of all branches           

       1.090067946 seconds time elapsed

       1.090267000 seconds user
       0.000000000 seconds sys


Cost2 variant does:

00000000004011c0 <_Z3foommmm>:
  4011c0:       c4 e1 f9 6e d2          vmovq  %rdx,%xmm2
  4011c5:       c4 e1 f9 6e df          vmovq  %rdi,%xmm3
  4011ca:       c4 e3 e9 22 c9 01       vpinsrq $0x1,%rcx,%xmm2,%xmm1
  4011d0:       c4 e3 e1 22 c6 01       vpinsrq $0x1,%rsi,%xmm3,%xmm0
  4011d6:       62 f3 fd 28 38 c1 01    vinserti64x2 $0x1,%xmm1,%ymm0,%ymm0
  4011dd:       c5 fd 7f 05 5b 2e 00    vmovdqa %ymm0,0x2e5b(%rip)        # 404040 <test>
  4011e4:       00 
  4011e5:       c5 f8 77                vzeroupper
  4011e8:       c3                      ret
....
  401059:       c5 fd 6f 0d df 2f 00    vmovdqa 0x2fdf(%rip),%ymm1        # 404040 <test>
  401060:       00 
  401061:       62 f3 fd 28 39 c8 01    vextracti64x2 $0x1,%ymm1,%xmm0
  401068:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
  40106c:       c5 f1 73 d8 08          vpsrldq $0x8,%xmm0,%xmm1
  401071:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
  401075:       c4 e1 f9 7e c0          vmovq  %xmm0,%rax
  40107a:       49 01 c4                add    %rax,%r12


while cost3 variant does:
00000000004011c0 <_Z3foommmm>:
  4011c0:       c4 e1 f9 6e d7          vmovq  %rdi,%xmm2
  4011c5:       c4 e1 f9 6e da          vmovq  %rdx,%xmm3
  4011ca:       c4 e3 e9 22 ce 01       vpinsrq $0x1,%rsi,%xmm2,%xmm1
  4011d0:       c4 e3 e1 22 c1 01       vpinsrq $0x1,%rcx,%xmm3,%xmm0
  4011d6:       c5 f9 7f 0d 62 2e 00    vmovdqa %xmm1,0x2e62(%rip)        # 404040 <test>
  4011dd:       00 
  4011de:       c5 f9 7f 05 6a 2e 00    vmovdqa %xmm0,0x2e6a(%rip)        # 404050 <test+0x10>
  4011e5:       00 
  4011e6:       c3                      ret
....
  401059:       c5 fd 6f 0d df 2f 00    vmovdqa 0x2fdf(%rip),%ymm1        # 404040 <test>
  401060:       00 
  401061:       62 f3 fd 28 39 c8 01    vextracti64x2 $0x1,%ymm1,%xmm0
  401068:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
  40106c:       c5 f1 73 d8 08          vpsrldq $0x8,%xmm0,%xmm1
  401071:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
  401075:       c4 e1 f9 7e c0          vmovq  %xmm0,%rax
  40107a:       49 01 c4                add    %rax,%r12


noslp
00000000004011a0 <_Z3foommmm>:
  4011a0:       48 89 3d 99 2e 00 00    mov    %rdi,0x2e99(%rip)        # 404040 <test>
  4011a7:       48 89 35 9a 2e 00 00    mov    %rsi,0x2e9a(%rip)        # 404048 <test+0x8>
  4011ae:       48 89 15 9b 2e 00 00    mov    %rdx,0x2e9b(%rip)        # 404050 <test+0x10>
  4011b5:       48 89 0d 9c 2e 00 00    mov    %rcx,0x2e9c(%rip)        # 404058 <test+0x18>
  4011bc:       c3                      ret
....
  401046:       48 03 1d f3 2f 00 00    add    0x2ff3(%rip),%rbx        # 404040 <test>
  40104d:       48 03 1d f4 2f 00 00    add    0x2ff4(%rip),%rbx        # 404048 <test+0x8>
  401054:       48 03 1d f5 2f 00 00    add    0x2ff5(%rip),%rbx        # 404050 <test+0x10>
  40105b:       48 03 1d f6 2f 00 00    add    0x2ff6(%rip),%rbx        # 404058 <test+0x18>
Comment 16 Richard Biener 2025-04-10 06:12:17 UTC
(In reply to Jan Hubicka from comment #15)
> I made sily stand-alone test:
> 
> long test[4];
> __attribute__ ((noipa))
> void
> foo (unsigned long a, unsigned long b, unsigned long c, unsigned long d)
> {
>         test[0]=a;
>         test[1]=b;
>         test[2]=c;
>         test[3]=d;
> }
> 
> int
> main()
> {
>         long s = 0;
>         for (int i = 0; i < 1000000000; i++)
>         {
>                 foo (i,i+1,i+2,i+3);
>                 s+=test[0];
>                 s+=test[1];
>                 s+=test[2];
>                 s+=test[3];
>         }
>         return s;
> }
> 
> And curiously enough it strongly prefers cost of 2 over 3. 
> 
> jh@shroud:~/trunk/build/gcc> perf stat ./test-noslp 
> 
>  Performance counter stats for './test-noslp':
> 
>           1,211.17 msec task-clock:u                     #    1.000 CPUs
> utilized             
>                  0      context-switches:u               #    0.000 /sec    
> 
>                  0      cpu-migrations:u                 #    0.000 /sec    
> 
>                 55      page-faults:u                    #   45.411 /sec    
> 
>      5,000,372,342      cycles:u                         #    4.129 GHz     
> 
>      2,000,253,750      stalled-cycles-frontend:u        #   40.00% frontend
> cycles idle      
>     17,000,136,662      instructions:u                   #    3.40  insn per
> cycle            
>                                                   #    0.12  stalled cycles
> per insn   
>      3,000,030,827      branches:u                       #    2.477 G/sec   
> 
>              2,592      branch-misses:u                  #    0.00% of all
> branches           
> 
>        1.211767440 seconds time elapsed
> 
>        1.211832000 seconds user
>        0.000000000 seconds sys
> 
> 
> jh@shroud:~/trunk/build/gcc> perf stat ./test-cost3
> 
>  Performance counter stats for './test-cost3':
> 
>           7,266.90 msec task-clock:u                     #    1.000 CPUs
> utilized             
>                  0      context-switches:u               #    0.000 /sec    
> 
>                  0      cpu-migrations:u                 #    0.000 /sec    
> 
>                 55      page-faults:u                    #    7.569 /sec    
> 
>     30,001,467,995      cycles:u                         #    4.129 GHz     
> 
>          1,111,876      stalled-cycles-frontend:u        #    0.00% frontend
> cycles idle      
>     23,000,138,491      instructions:u                   #    0.77  insn per
> cycle            
>                                                   #    0.00  stalled cycles
> per insn   
>      3,000,032,652      branches:u                       #  412.835 M/sec   
> 
>              4,455      branch-misses:u                  #    0.00% of all
> branches           
> 
>        7.267898755 seconds time elapsed
> 
>        7.267379000 seconds user
>        0.000000000 seconds sys
> 
> 
> jh@shroud:~/trunk/build/gcc> perf stat ./test-cost2
> 
>  Performance counter stats for './test-cost2':
> 
>           1,089.54 msec task-clock:u                     #    1.000 CPUs
> utilized             
>                  0      context-switches:u               #    0.000 /sec    
> 
>                  0      cpu-migrations:u                 #    0.000 /sec    
> 
>                 55      page-faults:u                    #   50.480 /sec    
> 
>      4,501,104,318      cycles:u                         #    4.131 GHz     
> 
>          5,495,394      stalled-cycles-frontend:u        #    0.12% frontend
> cycles idle      
>     24,000,136,630      instructions:u                   #    5.33  insn per
> cycle            
>                                                   #    0.00  stalled cycles
> per insn   
>      3,000,030,793      branches:u                       #    2.753 G/sec   
> 
>              2,492      branch-misses:u                  #    0.00% of all
> branches           
> 
>        1.090067946 seconds time elapsed
> 
>        1.090267000 seconds user
>        0.000000000 seconds sys
> 
> 
> Cost2 variant does:
> 
> 00000000004011c0 <_Z3foommmm>:
>   4011c0:       c4 e1 f9 6e d2          vmovq  %rdx,%xmm2
>   4011c5:       c4 e1 f9 6e df          vmovq  %rdi,%xmm3
>   4011ca:       c4 e3 e9 22 c9 01       vpinsrq $0x1,%rcx,%xmm2,%xmm1
>   4011d0:       c4 e3 e1 22 c6 01       vpinsrq $0x1,%rsi,%xmm3,%xmm0
>   4011d6:       62 f3 fd 28 38 c1 01    vinserti64x2 $0x1,%xmm1,%ymm0,%ymm0
>   4011dd:       c5 fd 7f 05 5b 2e 00    vmovdqa %ymm0,0x2e5b(%rip)        #
> 404040 <test>
>   4011e4:       00 
>   4011e5:       c5 f8 77                vzeroupper
>   4011e8:       c3                      ret
> ....
>   401059:       c5 fd 6f 0d df 2f 00    vmovdqa 0x2fdf(%rip),%ymm1        #
> 404040 <test>

that will forward nicely

>   401060:       00 
>   401061:       62 f3 fd 28 39 c8 01    vextracti64x2 $0x1,%ymm1,%xmm0
>   401068:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
>   40106c:       c5 f1 73 d8 08          vpsrldq $0x8,%xmm0,%xmm1
>   401071:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
>   401075:       c4 e1 f9 7e c0          vmovq  %xmm0,%rax
>   40107a:       49 01 c4                add    %rax,%r12
> 
> 
> while cost3 variant does:
> 00000000004011c0 <_Z3foommmm>:
>   4011c0:       c4 e1 f9 6e d7          vmovq  %rdi,%xmm2
>   4011c5:       c4 e1 f9 6e da          vmovq  %rdx,%xmm3
>   4011ca:       c4 e3 e9 22 ce 01       vpinsrq $0x1,%rsi,%xmm2,%xmm1
>   4011d0:       c4 e3 e1 22 c1 01       vpinsrq $0x1,%rcx,%xmm3,%xmm0
>   4011d6:       c5 f9 7f 0d 62 2e 00    vmovdqa %xmm1,0x2e62(%rip)        #
> 404040 <test>
>   4011dd:       00 
>   4011de:       c5 f9 7f 05 6a 2e 00    vmovdqa %xmm0,0x2e6a(%rip)        #
> 404050 <test+0x10>
>   4011e5:       00 
>   4011e6:       c3                      ret
> ....
>   401059:       c5 fd 6f 0d df 2f 00    vmovdqa 0x2fdf(%rip),%ymm1        #
> 404040 <test>

this will fail to forward, thus a huge penalty.

>   401060:       00 
>   401061:       62 f3 fd 28 39 c8 01    vextracti64x2 $0x1,%ymm1,%xmm0
>   401068:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
>   40106c:       c5 f1 73 d8 08          vpsrldq $0x8,%xmm0,%xmm1
>   401071:       c5 f9 d4 c1             vpaddq %xmm1,%xmm0,%xmm0
>   401075:       c4 e1 f9 7e c0          vmovq  %xmm0,%rax
>   40107a:       49 01 c4                add    %rax,%r12
> 
> 
> noslp
> 00000000004011a0 <_Z3foommmm>:
>   4011a0:       48 89 3d 99 2e 00 00    mov    %rdi,0x2e99(%rip)        #
> 404040 <test>
>   4011a7:       48 89 35 9a 2e 00 00    mov    %rsi,0x2e9a(%rip)        #
> 404048 <test+0x8>
>   4011ae:       48 89 15 9b 2e 00 00    mov    %rdx,0x2e9b(%rip)        #
> 404050 <test+0x10>
>   4011b5:       48 89 0d 9c 2e 00 00    mov    %rcx,0x2e9c(%rip)        #
> 404058 <test+0x18>
>   4011bc:       c3                      ret
> ....
>   401046:       48 03 1d f3 2f 00 00    add    0x2ff3(%rip),%rbx        #
> 404040 <test>
>   40104d:       48 03 1d f4 2f 00 00    add    0x2ff4(%rip),%rbx        #
> 404048 <test+0x8>
>   401054:       48 03 1d f5 2f 00 00    add    0x2ff5(%rip),%rbx        #
> 404050 <test+0x10>
>   40105b:       48 03 1d f6 2f 00 00    add    0x2ff6(%rip),%rbx        #
> 404058 <test+0x18>
Comment 17 Richard Biener 2025-04-10 06:16:44 UTC
(In reply to Jan Hubicka from comment #14)
>
> Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is
> integer->sse move that is slower and set to 4 cycles.
> Overall it is wrong that we use addss cost to estimate vec_construct:
> 
>       case vec_construct:
>         {
>           int n = TYPE_VECTOR_SUBPARTS (vectype);
>           /* N - 1 element inserts into an SSE vector, the possible
>              GPR -> XMM move is accounted for in add_stmt_cost.  */
>           if (GET_MODE_BITSIZE (mode) <= 128)
>             return (n - 1) * ix86_cost->sse_op;
>           /* One vinserti128 for combining two SSE vectors for AVX256.  */
>           else if (GET_MODE_BITSIZE (mode) == 256)
>             return ((n - 2) * ix86_cost->sse_op
>                     + ix86_vec_cost (mode, ix86_cost->addss));
>           /* One vinserti64x4 and two vinserti128 for combining SSE
>              and AVX256 vectors to AVX512.  */
>           else if (GET_MODE_BITSIZE (mode) == 512)
>             return ((n - 4) * ix86_cost->sse_op
>                     + 3 * ix86_vec_cost (mode, ix86_cost->addss));
>           gcc_unreachable ();
>         }
> 
> I think we may want to have ix86_cost->hard_register->integer_to_sse to cost
> the construction in integer modes instead of addss?

I have no recollection on why we are mixing sse_op and addss cost here ...
It's not a integer to SSE conversion either (again the caller adjusts
for this in this case).  We seem to use sse_op for the element insert
into SSE reg and addss for the insert of SSE regs into YMM or ZMM.

I think it's reasonable to change this to consistently use sse_op.
Comment 18 Richard Biener 2025-04-11 09:21:18 UTC
(In reply to Richard Biener from comment #17)
> (In reply to Jan Hubicka from comment #14)
> >
> > Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is
> > integer->sse move that is slower and set to 4 cycles.
> > Overall it is wrong that we use addss cost to estimate vec_construct:
> > 
> >       case vec_construct:
> >         {
> >           int n = TYPE_VECTOR_SUBPARTS (vectype);
> >           /* N - 1 element inserts into an SSE vector, the possible
> >              GPR -> XMM move is accounted for in add_stmt_cost.  */
> >           if (GET_MODE_BITSIZE (mode) <= 128)
> >             return (n - 1) * ix86_cost->sse_op;
> >           /* One vinserti128 for combining two SSE vectors for AVX256.  */
> >           else if (GET_MODE_BITSIZE (mode) == 256)
> >             return ((n - 2) * ix86_cost->sse_op
> >                     + ix86_vec_cost (mode, ix86_cost->addss));
> >           /* One vinserti64x4 and two vinserti128 for combining SSE
> >              and AVX256 vectors to AVX512.  */
> >           else if (GET_MODE_BITSIZE (mode) == 512)
> >             return ((n - 4) * ix86_cost->sse_op
> >                     + 3 * ix86_vec_cost (mode, ix86_cost->addss));
> >           gcc_unreachable ();
> >         }
> > 
> > I think we may want to have ix86_cost->hard_register->integer_to_sse to cost
> > the construction in integer modes instead of addss?
> 
> I have no recollection on why we are mixing sse_op and addss cost here ...
> It's not a integer to SSE conversion either (again the caller adjusts
> for this in this case).  We seem to use sse_op for the element insert
> into SSE reg and addss for the insert of SSE regs into YMM or ZMM.
> 
> I think it's reasonable to change this to consistently use sse_op.

So this was from r8-6815-gbe77ba2a461eef.  addss was chosen for the inserts
into the larger vectors because

"The following tries to account for the fact that when constructing
AVX256 or AVX512 vectors from elements we can only use insertps to
insert into the low 128bits of a vector but have to use
vinserti128 or vinserti64x4 to build larger AVX256/512 vectors.
Those operations also have higher latency (Agner documents
3 cycles for Broadwell for reg-reg vinserti128 while insertps has
one cycle latency).  Agner doesn't have tables for AVX512 yet but
I guess the story is similar for vinserti64x4.

Latency is similar for FP adds so I re-used ix86_cost->addss for
this cost."

On zen4 vinserti128 is 1 cycle latency, on zen2 it was still 3.  But
OTOH PINSRB/W/D/Q is 4 cycles on zen4 while indeed insertps is 1 cycle.
It will depend very much on the subarch what vec_init will generate and
the costing is likely not accurate with the simple formula used in
ix86_builtin_vectorization_cost.

Definitely re-using addss cost now causes issues.  Maybe we need to
have separate "cost of SSE construction from QI/HI/SI,SF/DI,DF"
and "cost of AVX construction from SSE" and "cost of AVX512 construction
from SSE/AVX" tuning cost entries.  Or just a single combined cost

 { 60, 28, 3, 1, 1, 3, 1 }      /* cost of vector construction from elements
                                   V16QI, V8HI, V4SI/V4SF, V2DI/V2DF, YMM from XMM, ZMM from XMM, ZMM from YMM */

though I'd probably split the SSE CTOR costs from the YMM/ZMM costs.  Or
we have cost for the element inserts into V16QI, V8HI, V4SI/V4SF, V2DI/V2DF
and V2XMM, V4XMM, V2YMM.

That said, given with extra reg we build V16QI in a tree way, avoding
15 * high pinsrb latency the combined cost might be (more?) interesting.
Comment 19 GCC Commits 2025-04-15 17:05:56 UTC
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:e2011ab13de3e70774f869b356f5f9c750780b34

commit r15-9495-ge2011ab13de3e70774f869b356f5f9c750780b34
Author: Jan Hubicka <hubicka@ucw.cz>
Date:   Tue Apr 15 19:04:15 2025 +0200

    Set ADDSS cost to 3 for znver5
    
    Znver5 has latency of addss 2 in typical case while all earlier versions has latency 3.
    Unforunately addss cost is used to cost many other SSE instructions than just addss and
    setting the cost to 2 makes us to vectorize 4 64bit stores into one 256bit store which
    in turn regesses imagemagick.
    
    This patch sets the cost back to 3.  Next stage1 we can untie addss from the other operatoins
    and set it correctly.
    
    bootstrapped/regtested x86_64-linux and also benchmarked on SPEC2k17
    
    gcc/ChangeLog:
    
            PR target/119298
            * config/i386/x86-tune-costs.h (znver5_cost): Set ADDSS cost to 3.