Bug 91446 - Wrong cost for scalar_load/scalar_store of vector type
Summary: Wrong cost for scalar_load/scalar_store of vector type
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: 10.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-14 17:50 UTC by H.J. Lu
Modified: 2019-09-18 20:11 UTC (History)
4 users (show)

See Also:
Host:
Target: i386
Build:
Known to work:
Known to fail:
Last reconfirmed:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2019-08-14 17:50:00 UTC
[hjl@gnu-skx-1 skx-3]$ cat x.i
typedef struct
{
  unsigned long width, height;
  long x, y;
} info;

extern void bar (info *);

void
foo (unsigned long width, unsigned long height, long x, long y)
{
  info t;
  t.width = width;
  t.height = height;
  t.x = x;
  t.y = y;
  bar (&t);
}
[hjl@gnu-skx-1 skx-3]$ make x.s
/export/build/gnu/tools-build/gcc-tuning-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-tuning-debug/build-x86_64-linux/gcc/ -Ofast -march=skylake-avx512 -S x.i
[hjl@gnu-skx-1 skx-3]$ cat x.s
	.file	"x.i"
	.text
	.p2align 4
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	subq	$40, %rsp
	.cfi_def_cfa_offset 48
	movq	%rdi, (%rsp)
	movq	%rsp, %rdi
	movq	%rsi, 8(%rsp)
	movq	%rdx, 16(%rsp)
	movq	%rcx, 24(%rsp)
	call	bar
	addq	$40, %rsp
	.cfi_def_cfa_offset 8
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 10.0.0 20190812 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-skx-1 skx-3]$ 

The problem is

(gdb) f 0
#0  ix86_builtin_vectorization_cost (type_of_cost=scalar_store, 
    vectype=0x7fffea63abd0)
    at /export/gnu/import/git/intel/gcc-tuning/gcc/config/i386/i386.c:21090
21090				      : ix86_cost->int_store [2]) / 2;
(gdb) p mode
$7 = E_V2DImode
(gdb) 

     case scalar_load:
        /* load/store costs are relative to register move which is 2. Recompute
           it to COSTS_N_INSNS so everything have same base.  */
        return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
                              : ix86_cost->int_load [2]) / 2;

      case scalar_store:
        return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
                              : ix86_cost->int_store [2]) / 2;

2 DImode stores/loads are needed to store/load V2DImode.
Comment 1 Richard Biener 2019-08-15 07:32:39 UTC
Not sure what you are after, we are costing for the scalar cost

0x4ad73f0 width_2(D) 1 times scalar_store costs 6 in body
0x4ad73f0 height_4(D) 1 times scalar_store costs 6 in body
0x4ad73f0 x_6(D) 1 times scalar_store costs 6 in body
0x4ad73f0 y_8(D) 1 times scalar_store costs 6 in body

and for the vectorized cost

0x4bd8330 width_2(D) 1 times vec_construct costs 8 in prologue
0x4bd8330 width_2(D) 1 times vector_store costs 16 in body
0x4bd8330 x_6(D) 1 times vec_construct costs 8 in prologue
0x4bd8330 x_6(D) 1 times vector_store costs 16 in body

t.i:17:3: note:  Cost model analysis:
  Vector inside of basic block cost: 32
  Vector prologue cost: 16
  Vector epilogue cost: 0
  Scalar cost of basic block: 24
t.i:17:3: missed:  not vectorized: vectorization is not profitable.

I assume skylake-avx512 uses skylake_cost.  The only issue I see is fixed
use of [0]/[2] (SImode/SFmode) also because the cost tables do not have
entries for DImode int_load/store.

Skylake costs are odd here:

  {4, 4, 4},                            /* cost of loading integer registers
                                           in QImode, HImode and SImode.
                                           Relative to reg-reg move (2).  */
  {6, 6, 3},                            /* cost of storing integer registers */

Why is SImode store cost 3?  (looks like "benchmark" random-number generator?)

 {6, 6, 6, 10, 20},                    /* cost of loading SSE registers
                                           in 32,64,128,256 and 512-bit */
  {6, 6, 6, 10, 20},                    /* cost of unaligned loads.  */
  {8, 8, 8, 12, 24},                    /* cost of storing SSE registers
                                           in 32,64,128,256 and 512-bit */
  {8, 8, 8, 8, 16},                     /* cost of unaligned stores.  */

again, unaligned SSE stores are cheaper than aligned ones for 256 and 512 bits?!

In the end the scalar code is not vectorized because of vector construction
cost and because the vector store cost is higher than the scalar store cost
which oddly is too cheap (3 vs. expected 6).

So - INVALID?
Comment 2 Richard Biener 2019-08-15 07:37:33 UTC
Index: gcc/config/i386/x86-tune-costs.h
===================================================================
--- gcc/config/i386/x86-tune-costs.h    (revision 274422)
+++ gcc/config/i386/x86-tune-costs.h    (working copy)
@@ -1442,7 +1442,7 @@ struct processor_costs skylake_cost = {
   {4, 4, 4},                           /* cost of loading integer registers
                                           in QImode, HImode and SImode.
                                           Relative to reg-reg move (2).  */
-  {6, 6, 3},                           /* cost of storing integer registers */
+  {6, 6, 6},                           /* cost of storing integer registers */
   2,                                   /* cost of reg,reg fld/fst */
   {6, 6, 8},                           /* cost of loading fp registers
                                           in SFmode, DFmode and XFmode */

produces

foo:
.LFB0:
        .cfi_startproc
        vmovq   %rdi, %xmm1
        subq    $40, %rsp
        .cfi_def_cfa_offset 48
        vpinsrq $1, %rsi, %xmm1, %xmm0
        vmovq   %rdx, %xmm2
        vmovaps %xmm0, (%rsp)
        movq    %rsp, %rdi
        vpinsrq $1, %rcx, %xmm2, %xmm0
        vmovaps %xmm0, 16(%rsp)
        call    bar
        addq    $40, %rsp
        .cfi_def_cfa_offset 8
        ret

it may appear odd that we don't use AVX256, this is because of

t.i:17:3: note:   === vect_analyze_data_ref_accesses ===
t.i:17:3: note:   Detected interleaving store t.width and t.height
t.i:17:3: note:   Detected interleaving store t.x and t.y
t.i:17:3: note:   Detected interleaving store of size 2
t.i:17:3: note:         t.width = width_2(D);
t.i:17:3: note:         t.height = height_4(D);
t.i:17:3: note:   Detected interleaving store of size 2
t.i:17:3: note:         t.x = x_6(D);
t.i:17:3: note:         t.y = y_8(D);

and thus we are "confused" about the different sign of the fields which
ultimatively yields to different vector types which would make a difference
if there is sign-dependent arithmetic performed, but not in this particular
case.  On GIMPLE we'd also need nop-conversions to make the IL checker happy.
On this ground the bug would be valid but not about costs (you may want
to open a separate bug for this issue).
Comment 3 hjl@gcc.gnu.org 2019-09-18 19:49:50 UTC
Author: hjl
Date: Wed Sep 18 19:49:19 2019
New Revision: 275905

URL: https://gcc.gnu.org/viewcvs?rev=275905&root=gcc&view=rev
Log:
i386: Increase Skylake SImode pseudo register store cost

On Skylake, SImode store cost isn't less than half cost of 128-bit vector
store.  This patch increases Skylake SImode pseudo register store cost to
make it the same as QImode and HImode.

gcc/

	PR target/91446
	* config/i386/x86-tune-costs.h (skylake_cost): Increase SImode
	pseudo register store cost from 3 to 6 to make it the same as
	QImode and HImode.

gcc/testsuite/

	PR target/91446
	* gcc.target/i386/pr91446.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr91446.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune-costs.h
    trunk/gcc/testsuite/ChangeLog
Comment 4 H.J. Lu 2019-09-18 20:11:18 UTC
(In reply to Richard Biener from comment #2)
> On this ground the bug would be valid but not about costs (you may want
> to open a separate bug for this issue).

Fixed for GCC 10.  I opened PR 91811.