90878 – [8/9/10 Regression] integer -> SSE register move isn't generated

Bug 90878 - [8/9/10 Regression] integer -> SSE register move isn't generated

Summary: [8/9/10 Regression] integer -> SSE register move isn't generated

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.3.1

Importance:	P3 normal
Target Milestone:	10.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:	90952
Blocks:
	Show dependency tree / graph

Reported:	2019-06-13 22:17 UTC by H.J. Lu
Modified:	2019-09-18 20:03 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	x86-64
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2019-06-13 22:17:43 UTC

[hjl@gnu-cfl-1 sse-move]$ cat x.i
union ieee754_float
  {
    float f;

    struct
      {
	unsigned int mantissa:23;
	unsigned int exponent:8;
	unsigned int negative:1;
      } ieee;
};

double
foo (float f)
{
  union ieee754_float u;
  u.f = f;
  u.ieee.negative = 0;
  return u.f;
}
[hjl@gnu-cfl-1 sse-move]$ /usr/gcc-9.1.1-x32-tuning/bin/gcc -S -O2 -march=skylake x.i
[hjl@gnu-cfl-1 sse-move]$ cat x.s
	.file	"x.i"
	.text
	.p2align 4
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	vmovd	%xmm0, %eax
	andl	$2147483647, %eax
	movl	%eax, -4(%rsp)
	vcvtss2sd	-4(%rsp), %xmm0, %xmm0
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 9.1.1 20190517"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 sse-move]$ 

Skylake cost has

 2, 2,                                 /* SSE->integer and integer->SSE moves */

which are the same cost as the normal register move.  But integer->SSE
isn't used since ix86_register_move_cost has

  /* Moves between SSE/MMX and integer unit are expensive.  */
  if (MMX_CLASS_P (class1) != MMX_CLASS_P (class2)
      || SSE_CLASS_P (class1) != SSE_CLASS_P (class2))

    /* ??? By keeping returned value relatively high, we limit the number
       of moves between integer and MMX/SSE registers for all targets.
       Additionally, high value prevents problem with x86_modes_tieable_p(),
       where integer modes in MMX/SSE registers are not tieable
       because of missing QImode and HImode moves to, from or between
       MMX/SSE registers.  */
    return MAX (8, MMX_CLASS_P (class1) || MMX_CLASS_P (class2)
                ? ix86_cost->mmxsse_to_integer : ix86_cost->ssemmx_to_integer);

integer->SSE is always 8 which much more expensive than integer
register store whose cost is 3.

Comment 1 H.J. Lu 2019-06-13 22:36:17 UTC

If we make integer register store more expensive, this testcase will
regress:

[hjl@gnu-cfl-1 unroll]$ cat x.i
void
foo (long p2, long *diag, long d, long i)
{
  long k;
  k = p2 < 3 ? p2 + p2 : p2 + 3;
  while (i < k)
    diag[i++] = d;
}
[hjl@gnu-cfl-1 unroll]$ make
/export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/ -O3 -march=skylake  -S x.i
[hjl@gnu-cfl-1 unroll]$ cat x.s
	.file	"x.i"
	.text
	.p2align 4
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	leaq	(%rdi,%rdi), %rax
	leaq	3(%rdi), %r8
	cmpq	$2, %rdi
	cmovle	%rax, %r8
	cmpq	%rcx, %r8
	jle	.L10
	movq	%rcx, %rax
	notq	%rax
	movq	%r8, %r9
	addq	%r8, %rax
	subq	%rcx, %r9
	cmpq	$3, %rax
	jbe	.L5
	movq	%r9, %rdi
	shrq	$2, %rdi
	vmovq	%rdx, %xmm1
	leaq	(%rsi,%rcx,8), %rax
	salq	$5, %rdi
	vpbroadcastq	%xmm1, %ymm0
	addq	%rax, %rdi
	.p2align 4,,10
	.p2align 3
.L6:
	vmovdqu	%ymm0, (%rax)
	addq	$32, %rax
	cmpq	%rdi, %rax
	jne	.L6
	movq	%r9, %rax
	andq	$-4, %rax
	addq	%rax, %rcx
	cmpq	%rax, %r9
	je	.L9
	vzeroupper
.L5:
	leaq	1(%rcx), %rax
	movq	%rdx, (%rsi,%rcx,8)
	cmpq	%r8, %rax
	jge	.L10
	leaq	2(%rcx), %rdi
	movq	%rdx, (%rsi,%rax,8)
	cmpq	%rdi, %r8
	jle	.L10
	addq	$3, %rcx
	movq	%rdx, (%rsi,%rdi,8)
	cmpq	%rcx, %r8
	jle	.L10
	movq	%rdx, (%rsi,%rcx,8)
	ret
	.p2align 4,,10
	.p2align 3
.L9:
	vzeroupper
.L10:
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 10.0.0 20190613 (experimental)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 unroll]$ 

since higher integer register store cost will reduce loop unroll count.

Comment 2 Uroš Bizjak 2019-06-13 22:40:27 UTC

Why is this PR marked as a regression?

Comment 3 H.J. Lu 2019-06-13 23:38:19 UTC

(In reply to Uroš Bizjak from comment #2)
> Why is this PR marked as a regression?

GCC 7 doesn't use memory:

[hjl@gnu-cfl-1 sse-move]$ cat x.s
	.file	"x.i"
	.text
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	vmovd	%xmm0, %eax
	vxorpd	%xmm0, %xmm0, %xmm0
	andl	$2147483647, %eax
	vmovd	%eax, %xmm1
	vcvtss2sd	%xmm1, %xmm0, %xmm0
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.ident	"GCC: (GNU) 7.3.1 20181128"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 sse-move]$

Comment 4 Richard Biener 2019-06-14 07:29:05 UTC

The constant '8' is at least fishy - it should be based on some target specific cost value, like 8 * cost of a add or so.

Comment 5 H.J. Lu 2019-06-14 15:27:40 UTC

(In reply to H.J. Lu from comment #1)
> If we make integer register store more expensive, this testcase will
> regress:
> 
> [hjl@gnu-cfl-1 unroll]$ cat x.i
> void
> foo (long p2, long *diag, long d, long i)
> {
>   long k;
>   k = p2 < 3 ? p2 + p2 : p2 + 3;
>   while (i < k)
>     diag[i++] = d;
> }
> [hjl@gnu-cfl-1 unroll]$ make
> /export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/xgcc
> -B/export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/ -O3
> -march=skylake  -S x.i
...
> since higher integer register store cost will reduce loop unroll count.

ix86_builtin_vectorization_cost has

     case scalar_load:
        /* load/store costs are relative to register move which is 2. Recompute
           it to COSTS_N_INSNS so everything have same base.  */
        return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0]
                              : ix86_cost->int_load [2]) / 2;

      case scalar_store:
        return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0]
                              : ix86_cost->int_store [2]) / 2;

sse_load[0], int_load [2], sse_store[0], int_store [2] impact

1. Loop runtime profitability threshold.
2. Selection of memory vs register operands.

Should we add a separate set of costs to processor_costs for vectorizer?

Comment 6 H.J. Lu 2019-06-14 16:17:51 UTC

We have

-- Target Hook: bool TARGET_RTX_COSTS (rtx X, machine_mode MODE, int
          OUTER_CODE, int OPNO, int *TOTAL, bool SPEED)
     This target hook describes the relative costs of RTL expressions.

     The cost may depend on the precise form of the expression, which is
     available for examination in X, and the fact that X appears as
     operand OPNO of an expression with rtx code OUTER_CODE.  That is,
     the ho-- Target Hook: bool TARGET_RTX_COSTS (rtx X, machine_mode MODE, int
          OUTER_CODE, int OPNO, int *TOTAL, bool SPEED)
     This target hook describes the relative costs of RTL expressions.

     The cost may depend on the precise form of the expression, which is
     available for examination in X, and the fact that X appears as
     operand OPNO of an expression with rtx code OUTER_CODE.  That is,
     the hook can assume that there is some rtx Y such that 'GET_CODE
     (Y) == OUTER_CODE' and such that either (a) 'XEXP (Y, OPNO) == X'
     or (b) 'XVEC (Y, OPNO)' contains X.

     MODE is X's machine mode, or for cases like 'const_int' that do not
     have a mode, the mode in which X is used.

     In implementing this hook, you can use the construct 'COSTS_N_INSNS
     (N)' to specify a cost equal to N fast instructions.

     On entry to the hook, '*TOTAL' contains a default estimate for the
     cost of the expression.  The hook should modify this value as
     necessary.  Traditionally, the default costs are 'COSTS_N_INSNS
     (5)' for multiplications, 'COSTS_N_INSNS (7)' for division and
     modulus operations, and 'COSTS_N_INSNS (1)' for all other
     operations.

     When optimizing for code size, i.e. when 'speed' is false, this
     target hook should be used to estimate the relative size cost of an
     expression, again relative to 'COSTS_N_INSNS'.

     The hook returns true when all subexpressions of X have been
     processed, and false when 'rtx_cost' should recurse.
ok can assume that there is some rtx Y such that 'GET_CODE
     (Y) == OUTER_CODE' and such that either (a) 'XEXP (Y, OPNO) == X'
     or (b) 'XVEC (Y, OPNO)' contains X.

     MODE -- Target Hook: bool TARGET_RTX_COSTS (rtx X, machine_mode MODE, int
          OUTER_CODE, int OPNO, int *TOTAL, bool SPEED)
     This target hook describes the relative costs of RTL expressions.

     The cost may depend on the precise form of the expression, which is
     available for examination in X, and the fact that X appears as
     operand OPNO of an expression with rtx code OUTER_CODE.  That is,
     the hook can assume that there is some rtx Y such that 'GET_CODE
     (Y) == OUTER_CODE' and such that either (a) 'XEXP (Y, OPNO) == X'
     or (b) 'XVEC (Y, OPNO)' contains X.

     MODE is X's machine mode, or for cases like 'const_int' that do not
     have a mode, the mode in which X is used.

     In implementing this hook, you can use the construct 'COSTS_N_INSNS
     (N)' to specify a cost equal to N fast instructions.

     On entry to the hook, '*TOTAL' contains a default estimate for the
     cost of the expression.  The hook should modify this value as
     necessary.  Traditionally, the default costs are 'COSTS_N_INSNS
     (5)' for multiplications, 'COSTS_N_INSNS (7)' for division and
     modulus operations, and 'COSTS_N_INSNS (1)' for all other
     operations.

     When optimizing for code size, i.e. when 'speed' is false, this
     target hook should be used to estimate the relative size cost of an
     expression, again relative to 'COSTS_N_INSNS'.

     The hook returns true when all subexpressions of X have been
     processed, and false when 'rtx_cost' should recurse.
is X's machine mode, or for cases like 'const_int' that do not
     have a mode, the mode in which X is used.

     In implementing this hook, you can use the construct 'COSTS_N_INSNS
     (N)' to specify a cost equal to N fast instructions.

     On entry to the hook, '*TOTAL' contains a default estimate for the
     cost of the expression.  The hook should modify this value as
     necessary.  Traditionally, the default costs are 'COSTS_N_INSNS
     (5)' for multiplications, 'COSTS_N_INSNS (7)' for division and
     modulus operations, and 'COSTS_N_INSNS (1)' for all other
     operations.

     When optimizing for code size, i.e. when 'speed' is false, this
     target hook should be used to estimate the relative size cost of an
     expression, again relative to 'COSTS_N_INSNS'.

     The hook returns true when all subexpressions of X have been
     processed, and false when 'rtx_cost' should recurse.

-- Target Hook: int TARGET_REGISTER_MOVE_COST (machine_mode MODE,
          reg_class_t FROM, reg_class_t TO)
     This target hook should return the cost of moving data of mode MODE
     from a register in class FROM to one in class TO.  The classes are
     expressed using the enumeration values such as 'GENERAL_REGS'.  A
     value of 2 is the default; other values are interpreted relative to
     that.

     It is not required that the cost always equal 2 when FROM is the
     same as TO; on some machines it is expensive to move between
     registers if they are not general registers.

     If reload sees an insn consisting of a single 'set' between two
     hard registers, and if 'TARGET_REGISTER_MOVE_COST' applied to their
     classes returns a value of 2, reload does not check to ensure that
     the constraints of the insn are met.  Setting a cost of other than
     2 will allow reload to verify that the constraints are met.  You
     should do this if the 'movM' pattern's constraints do not allow
     such copying.

     The default version of this function returns 2.

TARGET_REGISTER_MOVE_COST and TARGET_RTX_COSTS serve different purposes
with different formulas.  TARGET_REGISTER_MOVE_COST is used by register
allocator.  TARGET_RTX_COSTS describes the relative costs of RTL
expressions.  Using values intended for register allocator for
TARGET_RTX_COSTS isn't appropriate.

Comment 7 hjl@gcc.gnu.org 2019-08-15 18:16:05 UTC

Author: hjl
Date: Thu Aug 15 18:15:33 2019
New Revision: 274543

URL: https://gcc.gnu.org/viewcvs?rev=274543&root=gcc&view=rev
Log:
i386: Separate costs of pseudo registers from hard registers

processor_costs has costs of RTL expressions with pseudo registers and
and costs of hard register moves:

1. Costs of RTL expressions are used to generate the most efficient RTL
operations with pseudo registers.

2. Costs of hard register moves are used by register allocator to
decide how to allocate and move hard registers.

Since relative costs of pseudo register load and store versus pseudo
register moves in RTL expressions can be different from relative costs
of hard registers, we should separate costs of RTL expressions with
pseudo registers from costs of hard registers so that register allocator
and RTL expressions can be improved independently.

This patch moves costs of hard register moves to the new hard_register
field and duplicates costs of moves which are also used for costs of RTL
expressions.

	PR target/90878
	* config/i386/i386.c (inline_memory_move_cost): Use hard_register
	for costs of hard register moves.
	(ix86_register_move_cost): Likewise.
	* config/i386/i386.h (processor_costs): Move costs of hard
	register moves to hard_register.  Add int_load, int_store,
	xmm_move, ymm_move, zmm_move, sse_to_integer, integer_to_sse,
	sse_load, sse_store, sse_unaligned_load and sse_unaligned_store
	for costs of RTL expressions.
	* config/i386/x86-tune-costs.h: Move costs of hard register
	moves to hard_register.  Duplicate int_load, int_store,
	xmm_move, ymm_move, zmm_move, sse_to_integer, integer_to_sse,
	sse_load, sse_store for costs of RTL expressions.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/config/i386/i386.h
    trunk/gcc/config/i386/x86-tune-costs.h

Comment 8 hjl@gcc.gnu.org 2019-09-18 19:51:16 UTC

Author: hjl
Date: Wed Sep 18 19:50:45 2019
New Revision: 275906

URL: https://gcc.gnu.org/viewcvs?rev=275906&root=gcc&view=rev
Log:
i386: Restore Skylake SImode hard register store cost

On Skylake, we should move integer register to SSE register without
going through memory.  This patch restores Skylake SImode hard register
store cost to 6.

gcc/

	PR target/90878
	* config/i386/x86-tune-costs.h (skylake_cost): Restore SImode
	hard register store cost to 6.

gcc/testsuite/

	PR target/90878
	* gcc.target/i386/pr90878.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr90878.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/x86-tune-costs.h
    trunk/gcc/testsuite/ChangeLog

Comment 9 H.J. Lu 2019-09-18 20:03:29 UTC

Fixed for GCC 10.