[hjl@gnu-cfl-1 sse-move]$ cat x.i union ieee754_float { float f; struct { unsigned int mantissa:23; unsigned int exponent:8; unsigned int negative:1; } ieee; }; double foo (float f) { union ieee754_float u; u.f = f; u.ieee.negative = 0; return u.f; } [hjl@gnu-cfl-1 sse-move]$ /usr/gcc-9.1.1-x32-tuning/bin/gcc -S -O2 -march=skylake x.i [hjl@gnu-cfl-1 sse-move]$ cat x.s .file "x.i" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc vmovd %xmm0, %eax andl $2147483647, %eax movl %eax, -4(%rsp) vcvtss2sd -4(%rsp), %xmm0, %xmm0 ret .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 9.1.1 20190517" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 sse-move]$ Skylake cost has 2, 2, /* SSE->integer and integer->SSE moves */ which are the same cost as the normal register move. But integer->SSE isn't used since ix86_register_move_cost has /* Moves between SSE/MMX and integer unit are expensive. */ if (MMX_CLASS_P (class1) != MMX_CLASS_P (class2) || SSE_CLASS_P (class1) != SSE_CLASS_P (class2)) /* ??? By keeping returned value relatively high, we limit the number of moves between integer and MMX/SSE registers for all targets. Additionally, high value prevents problem with x86_modes_tieable_p(), where integer modes in MMX/SSE registers are not tieable because of missing QImode and HImode moves to, from or between MMX/SSE registers. */ return MAX (8, MMX_CLASS_P (class1) || MMX_CLASS_P (class2) ? ix86_cost->mmxsse_to_integer : ix86_cost->ssemmx_to_integer); integer->SSE is always 8 which much more expensive than integer register store whose cost is 3.
If we make integer register store more expensive, this testcase will regress: [hjl@gnu-cfl-1 unroll]$ cat x.i void foo (long p2, long *diag, long d, long i) { long k; k = p2 < 3 ? p2 + p2 : p2 + 3; while (i < k) diag[i++] = d; } [hjl@gnu-cfl-1 unroll]$ make /export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/xgcc -B/export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/ -O3 -march=skylake -S x.i [hjl@gnu-cfl-1 unroll]$ cat x.s .file "x.i" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc leaq (%rdi,%rdi), %rax leaq 3(%rdi), %r8 cmpq $2, %rdi cmovle %rax, %r8 cmpq %rcx, %r8 jle .L10 movq %rcx, %rax notq %rax movq %r8, %r9 addq %r8, %rax subq %rcx, %r9 cmpq $3, %rax jbe .L5 movq %r9, %rdi shrq $2, %rdi vmovq %rdx, %xmm1 leaq (%rsi,%rcx,8), %rax salq $5, %rdi vpbroadcastq %xmm1, %ymm0 addq %rax, %rdi .p2align 4,,10 .p2align 3 .L6: vmovdqu %ymm0, (%rax) addq $32, %rax cmpq %rdi, %rax jne .L6 movq %r9, %rax andq $-4, %rax addq %rax, %rcx cmpq %rax, %r9 je .L9 vzeroupper .L5: leaq 1(%rcx), %rax movq %rdx, (%rsi,%rcx,8) cmpq %r8, %rax jge .L10 leaq 2(%rcx), %rdi movq %rdx, (%rsi,%rax,8) cmpq %rdi, %r8 jle .L10 addq $3, %rcx movq %rdx, (%rsi,%rdi,8) cmpq %rcx, %r8 jle .L10 movq %rdx, (%rsi,%rcx,8) ret .p2align 4,,10 .p2align 3 .L9: vzeroupper .L10: ret .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 10.0.0 20190613 (experimental)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 unroll]$ since higher integer register store cost will reduce loop unroll count.
Why is this PR marked as a regression?
(In reply to Uroš Bizjak from comment #2) > Why is this PR marked as a regression? GCC 7 doesn't use memory: [hjl@gnu-cfl-1 sse-move]$ cat x.s .file "x.i" .text .p2align 4,,15 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc vmovd %xmm0, %eax vxorpd %xmm0, %xmm0, %xmm0 andl $2147483647, %eax vmovd %eax, %xmm1 vcvtss2sd %xmm1, %xmm0, %xmm0 ret .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (GNU) 7.3.1 20181128" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 sse-move]$
The constant '8' is at least fishy - it should be based on some target specific cost value, like 8 * cost of a add or so.
(In reply to H.J. Lu from comment #1) > If we make integer register store more expensive, this testcase will > regress: > > [hjl@gnu-cfl-1 unroll]$ cat x.i > void > foo (long p2, long *diag, long d, long i) > { > long k; > k = p2 < 3 ? p2 + p2 : p2 + 3; > while (i < k) > diag[i++] = d; > } > [hjl@gnu-cfl-1 unroll]$ make > /export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/xgcc > -B/export/build/gnu/tools-build/gcc-wip-debug/build-x86_64-linux/gcc/ -O3 > -march=skylake -S x.i ... > since higher integer register store cost will reduce loop unroll count. ix86_builtin_vectorization_cost has case scalar_load: /* load/store costs are relative to register move which is 2. Recompute it to COSTS_N_INSNS so everything have same base. */ return COSTS_N_INSNS (fp ? ix86_cost->sse_load[0] : ix86_cost->int_load [2]) / 2; case scalar_store: return COSTS_N_INSNS (fp ? ix86_cost->sse_store[0] : ix86_cost->int_store [2]) / 2; sse_load[0], int_load [2], sse_store[0], int_store [2] impact 1. Loop runtime profitability threshold. 2. Selection of memory vs register operands. Should we add a separate set of costs to processor_costs for vectorizer?
We have -- Target Hook: bool TARGET_RTX_COSTS (rtx X, machine_mode MODE, int OUTER_CODE, int OPNO, int *TOTAL, bool SPEED) This target hook describes the relative costs of RTL expressions. The cost may depend on the precise form of the expression, which is available for examination in X, and the fact that X appears as operand OPNO of an expression with rtx code OUTER_CODE. That is, the ho-- Target Hook: bool TARGET_RTX_COSTS (rtx X, machine_mode MODE, int OUTER_CODE, int OPNO, int *TOTAL, bool SPEED) This target hook describes the relative costs of RTL expressions. The cost may depend on the precise form of the expression, which is available for examination in X, and the fact that X appears as operand OPNO of an expression with rtx code OUTER_CODE. That is, the hook can assume that there is some rtx Y such that 'GET_CODE (Y) == OUTER_CODE' and such that either (a) 'XEXP (Y, OPNO) == X' or (b) 'XVEC (Y, OPNO)' contains X. MODE is X's machine mode, or for cases like 'const_int' that do not have a mode, the mode in which X is used. In implementing this hook, you can use the construct 'COSTS_N_INSNS (N)' to specify a cost equal to N fast instructions. On entry to the hook, '*TOTAL' contains a default estimate for the cost of the expression. The hook should modify this value as necessary. Traditionally, the default costs are 'COSTS_N_INSNS (5)' for multiplications, 'COSTS_N_INSNS (7)' for division and modulus operations, and 'COSTS_N_INSNS (1)' for all other operations. When optimizing for code size, i.e. when 'speed' is false, this target hook should be used to estimate the relative size cost of an expression, again relative to 'COSTS_N_INSNS'. The hook returns true when all subexpressions of X have been processed, and false when 'rtx_cost' should recurse. ok can assume that there is some rtx Y such that 'GET_CODE (Y) == OUTER_CODE' and such that either (a) 'XEXP (Y, OPNO) == X' or (b) 'XVEC (Y, OPNO)' contains X. MODE -- Target Hook: bool TARGET_RTX_COSTS (rtx X, machine_mode MODE, int OUTER_CODE, int OPNO, int *TOTAL, bool SPEED) This target hook describes the relative costs of RTL expressions. The cost may depend on the precise form of the expression, which is available for examination in X, and the fact that X appears as operand OPNO of an expression with rtx code OUTER_CODE. That is, the hook can assume that there is some rtx Y such that 'GET_CODE (Y) == OUTER_CODE' and such that either (a) 'XEXP (Y, OPNO) == X' or (b) 'XVEC (Y, OPNO)' contains X. MODE is X's machine mode, or for cases like 'const_int' that do not have a mode, the mode in which X is used. In implementing this hook, you can use the construct 'COSTS_N_INSNS (N)' to specify a cost equal to N fast instructions. On entry to the hook, '*TOTAL' contains a default estimate for the cost of the expression. The hook should modify this value as necessary. Traditionally, the default costs are 'COSTS_N_INSNS (5)' for multiplications, 'COSTS_N_INSNS (7)' for division and modulus operations, and 'COSTS_N_INSNS (1)' for all other operations. When optimizing for code size, i.e. when 'speed' is false, this target hook should be used to estimate the relative size cost of an expression, again relative to 'COSTS_N_INSNS'. The hook returns true when all subexpressions of X have been processed, and false when 'rtx_cost' should recurse. is X's machine mode, or for cases like 'const_int' that do not have a mode, the mode in which X is used. In implementing this hook, you can use the construct 'COSTS_N_INSNS (N)' to specify a cost equal to N fast instructions. On entry to the hook, '*TOTAL' contains a default estimate for the cost of the expression. The hook should modify this value as necessary. Traditionally, the default costs are 'COSTS_N_INSNS (5)' for multiplications, 'COSTS_N_INSNS (7)' for division and modulus operations, and 'COSTS_N_INSNS (1)' for all other operations. When optimizing for code size, i.e. when 'speed' is false, this target hook should be used to estimate the relative size cost of an expression, again relative to 'COSTS_N_INSNS'. The hook returns true when all subexpressions of X have been processed, and false when 'rtx_cost' should recurse. -- Target Hook: int TARGET_REGISTER_MOVE_COST (machine_mode MODE, reg_class_t FROM, reg_class_t TO) This target hook should return the cost of moving data of mode MODE from a register in class FROM to one in class TO. The classes are expressed using the enumeration values such as 'GENERAL_REGS'. A value of 2 is the default; other values are interpreted relative to that. It is not required that the cost always equal 2 when FROM is the same as TO; on some machines it is expensive to move between registers if they are not general registers. If reload sees an insn consisting of a single 'set' between two hard registers, and if 'TARGET_REGISTER_MOVE_COST' applied to their classes returns a value of 2, reload does not check to ensure that the constraints of the insn are met. Setting a cost of other than 2 will allow reload to verify that the constraints are met. You should do this if the 'movM' pattern's constraints do not allow such copying. The default version of this function returns 2. TARGET_REGISTER_MOVE_COST and TARGET_RTX_COSTS serve different purposes with different formulas. TARGET_REGISTER_MOVE_COST is used by register allocator. TARGET_RTX_COSTS describes the relative costs of RTL expressions. Using values intended for register allocator for TARGET_RTX_COSTS isn't appropriate.
Author: hjl Date: Thu Aug 15 18:15:33 2019 New Revision: 274543 URL: https://gcc.gnu.org/viewcvs?rev=274543&root=gcc&view=rev Log: i386: Separate costs of pseudo registers from hard registers processor_costs has costs of RTL expressions with pseudo registers and and costs of hard register moves: 1. Costs of RTL expressions are used to generate the most efficient RTL operations with pseudo registers. 2. Costs of hard register moves are used by register allocator to decide how to allocate and move hard registers. Since relative costs of pseudo register load and store versus pseudo register moves in RTL expressions can be different from relative costs of hard registers, we should separate costs of RTL expressions with pseudo registers from costs of hard registers so that register allocator and RTL expressions can be improved independently. This patch moves costs of hard register moves to the new hard_register field and duplicates costs of moves which are also used for costs of RTL expressions. PR target/90878 * config/i386/i386.c (inline_memory_move_cost): Use hard_register for costs of hard register moves. (ix86_register_move_cost): Likewise. * config/i386/i386.h (processor_costs): Move costs of hard register moves to hard_register. Add int_load, int_store, xmm_move, ymm_move, zmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store, sse_unaligned_load and sse_unaligned_store for costs of RTL expressions. * config/i386/x86-tune-costs.h: Move costs of hard register moves to hard_register. Duplicate int_load, int_store, xmm_move, ymm_move, zmm_move, sse_to_integer, integer_to_sse, sse_load, sse_store for costs of RTL expressions. Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.c trunk/gcc/config/i386/i386.h trunk/gcc/config/i386/x86-tune-costs.h
Author: hjl Date: Wed Sep 18 19:50:45 2019 New Revision: 275906 URL: https://gcc.gnu.org/viewcvs?rev=275906&root=gcc&view=rev Log: i386: Restore Skylake SImode hard register store cost On Skylake, we should move integer register to SSE register without going through memory. This patch restores Skylake SImode hard register store cost to 6. gcc/ PR target/90878 * config/i386/x86-tune-costs.h (skylake_cost): Restore SImode hard register store cost to 6. gcc/testsuite/ PR target/90878 * gcc.target/i386/pr90878.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr90878.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/x86-tune-costs.h trunk/gcc/testsuite/ChangeLog
Fixed for GCC 10.