On x86, r265398 caused: FAIL: gcc.target/i386/avx512dq-concatv2si-1.c scan-assembler-times vpinsrd[^\n\r]*\\$1[^\n\r]*%xmm16[^\n\r]*%xmm3 2 FAIL: gcc.target/i386/avx512dq-concatv2si-1.c scan-assembler vpunpckldq[^\n\r]*%xmm17[^\n\r]*%xmm16[^\n\r]*%xmm3 Before: .file "avx512dq-concatv2si-1.c" .text .p2align 4 .globl f1 .type f1, @function f1: .LFB0: .cfi_startproc movl %edi, -4(%rsp) vmovd -4(%rsp), %xmm16 movl %esi, -4(%rsp) vmovd -4(%rsp), %xmm17 vpunpckldq %xmm17, %xmm16, %xmm3 ret .cfi_endproc .LFE0: .size f1, .-f1 After: .file "avx512dq-concatv2si-1.c" .text .p2align 4 .globl f1 .type f1, @function f1: .LFB0: .cfi_startproc movl %edi, -4(%rsp) vmovd -4(%rsp), %xmm16 movl %esi, -4(%rsp) vmovd -4(%rsp), %xmm17 vmovd %xmm17, %eax vpinsrd $1, %eax, %xmm16, %xmm3 ret .cfi_endproc .LFE0: .size f1, .-f1
*** Bug 87717 has been marked as a duplicate of this bug. ***
Following testcase: --cut here-- typedef int V __attribute__((vector_size (8))); void foo (int x, int y) { register int a __asm ("xmm1"); register int b __asm ("xmm2"); register V c __asm ("xmm3"); a = x; b = y; asm volatile ("" : "+v" (a), "+v" (b)); c = (V) { a, b }; asm volatile ("" : "+v" (c)); } --cut here-- gets compiled with -O2 -mavx -mtune=intel: vmovd %edi, %xmm1 vmovd %esi, %xmm2 vmovd %xmm2, %eax vpinsrd $1, %eax, %xmm1, %xmm3 ret The relevant pattern is defined as: (define_insn "*vec_concatv2si_sse4_1" [(set (match_operand:V2SI 0 "register_operand" "=Yr,*x, x, v,Yr,*x, v, v, *y,*y") (vec_concat:V2SI (match_operand:SI 1 "nonimmediate_operand" " 0, 0, x,Yv, 0, 0,Yv,rm, 0,rm") (match_operand:SI 2 "nonimm_or_0_operand" " rm,rm,rm,rm,Yr,*x,Yv, C,*ym, C")))] "TARGET_SSE4_1 && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "@ pinsrd\t{$1, %2, %0|%0, %2, 1} pinsrd\t{$1, %2, %0|%0, %2, 1} vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1} vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1} punpckldq\t{%2, %0|%0, %2} punpckldq\t{%2, %0|%0, %2} vpunpckldq\t{%2, %1, %0|%0, %1, %2} %vmovd\t{%1, %0|%0, %1} punpckldq\t{%2, %0|%0, %2} movd\t{%1, %0|%0, %1}" but for some reason RA chooses alternative 2 (x<-x,rm) instead of alternative 6 (v<-Yv,Yv), although alternative 2 needs an extra reload from %xmm2 to %eax.
Vlad, could you please have a look?
(In reply to Uroš Bizjak from comment #2) > Following testcase: > > --cut here-- > typedef int V __attribute__((vector_size (8))); > > void foo (int x, int y) > { > register int a __asm ("xmm1"); > register int b __asm ("xmm2"); > register V c __asm ("xmm3"); > a = x; > b = y; > asm volatile ("" : "+v" (a), "+v" (b)); > c = (V) { a, b }; > asm volatile ("" : "+v" (c)); > } > --cut here-- > > gets compiled with -O2 -mavx -mtune=intel: > > vmovd %edi, %xmm1 > vmovd %esi, %xmm2 > vmovd %xmm2, %eax > vpinsrd $1, %eax, %xmm1, %xmm3 > ret > > The relevant pattern is defined as: > > (define_insn "*vec_concatv2si_sse4_1" > [(set (match_operand:V2SI 0 "register_operand" > "=Yr,*x, x, v,Yr,*x, v, v, *y,*y") > (vec_concat:V2SI > (match_operand:SI 1 "nonimmediate_operand" > " 0, 0, x,Yv, 0, 0,Yv,rm, 0,rm") > (match_operand:SI 2 "nonimm_or_0_operand" > " rm,rm,rm,rm,Yr,*x,Yv, C,*ym, C")))] > "TARGET_SSE4_1 && !(MEM_P (operands[1]) && MEM_P (operands[2]))" > "@ > pinsrd\t{$1, %2, %0|%0, %2, 1} > pinsrd\t{$1, %2, %0|%0, %2, 1} > vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1} > vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1} > punpckldq\t{%2, %0|%0, %2} > punpckldq\t{%2, %0|%0, %2} > vpunpckldq\t{%2, %1, %0|%0, %1, %2} > %vmovd\t{%1, %0|%0, %1} > punpckldq\t{%2, %0|%0, %2} > movd\t{%1, %0|%0, %1}" > > but for some reason RA chooses alternative 2 (x<-x,rm) instead of > alternative 6 (v<-Yv,Yv), although alternative 2 needs an extra reload from > %xmm2 to %eax. I dig this a bit and looks like we missed something in combine pass, hence fail to get a pattern that can match alternative 6. The combine pass dump of old gcc shows: ------------------- REG_UNUSED flags:CC insn_cost 4 for 10: r82:SI=xmm16:SI REG_DEAD xmm16:SI insn_cost 4 for 11: r83:SI=xmm17:SI REG_DEAD xmm17:SI insn_cost 4 for 12: r87:V2SI=vec_concat(r82:SI,r83:SI) REG_DEAD r83:SI REG_DEAD r82:SI ------------------- then we got: ------------------- Trying 10 -> 12: 10: r82:SI=xmm16:SI REG_DEAD xmm16:SI 12: r87:V2SI=vec_concat(r82:SI,r83:SI) REG_DEAD r83:SI REG_DEAD r82:SI Successfully matched this instruction: (set (reg:V2SI 87) (vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ]) (reg:SI 83 [ b.1_2 ]))) allowing combination of insns 10 and 12 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 10. modifying insn i3 12: r87:V2SI=vec_concat(xmm16:SI,r83:SI) REG_DEAD xmm16:SI REG_DEAD r83:SI deferring rescan insn with uid = 12. Trying 11 -> 12: 11: r83:SI=xmm17:SI REG_DEAD xmm17:SI 12: r87:V2SI=vec_concat(xmm16:SI,r83:SI) REG_DEAD xmm16:SI REG_DEAD r83:SI Successfully matched this instruction: (set (reg:V2SI 87) (vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ]) (reg/v:SI 53 xmm17 [ b ]))) allowing combination of insns 11 and 12 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 11. modifying insn i3 12: r87:V2SI=vec_concat(xmm16:SI,xmm17:SI) REG_DEAD xmm17:SI REG_DEAD xmm16:SI deferring rescan insn with uid = 12. ------------------- There are two successful combine attempts. We end up with pattern that can match alternative 6. However dump from current GCC trunk shows: ------------------- insn_cost 4 for 19: r90:SI=xmm16:SI REG_DEAD xmm16:SI insn_cost 4 for 10: r82:SI=r90:SI REG_DEAD r90:SI insn_cost 4 for 20: r91:SI=xmm17:SI REG_DEAD xmm17:SI insn_cost 4 for 11: r83:SI=r91:SI REG_DEAD r91:SI insn_cost 4 for 12: r87:V2SI=vec_concat(r82:SI,r83:SI) REG_DEAD r83:SI REG_DEAD r82:SI insn_cost 4 for 13: xmm3:V2SI=r87:V2SI REG_DEAD r87:V2SI ------------------- Trying 11 -> 12: 11: r83:SI=r91:SI REG_DEAD r91:SI 12: r87:V2SI=vec_concat(r90:SI,r83:SI) REG_DEAD r90:SI REG_DEAD r83:SI Successfully matched this instruction: (set (reg:V2SI 87) (vec_concat:V2SI (reg:SI 90) (reg:SI 91))) allowing combination of insns 11 and 12 original costs 4 + 4 = 8 replacement cost 4 deferring deletion of insn with uid = 11. modifying insn i3 12: r87:V2SI=vec_concat(r90:SI,r91:SI) REG_DEAD r91:SI REG_DEAD r90:SI deferring rescan insn with uid = 12. ------------------- We end up with "12: r87:V2SI=vec_concat(r90:SI,r91:SI)", later in LRA pass, the operand r90 is replaced with XMM register, the r91 is kept as general register. Then no chance match against preferred alternative 6.
In general moving from propagation of hard regs is good thing for RA. Although there are exception as this PR. The problem starts with IRA. It decides that r91 should be a general regs based on cost calculation. The cost calculation code in IRA is very sensitive. A change there usually results in new PRs with unexpected code generation. I'll investigate more the PR and how to fix the PR with minimal effect to other targets and tests. But right now I can guess that the cost of move of greg with sse-regs and move of sse-regs is the same for intel but if we choose sse-regs they are coalesced and the move is removed.
The culprit for the bad code generation is the following insn description (define_insn "*movsi_internal" [(set (match_operand:SI 0 "nonimmediate_operand" "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm") (match_operand:SI 1 "general_operand" "g ,re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k"))] Alternatives with sse regs are not considered at all (hint *) for cost calculation even if one operand is sse hard reg. And therefore sse class for another operand with pseudo is too costly. Removing the hints is not a solution. I believe we will have even more problems with GCC testsuite. So I am trying to solve it with specific treatment of moves for cost calculations. The patch I am working on solves the PR but currently creates a few GCC testsuite failures (unexpected but correct code generation). So I am continuing to work on the PR.
Author: vmakarov Date: Thu Nov 22 17:25:57 2018 New Revision: 266385 URL: https://gcc.gnu.org/viewcvs?rev=266385&root=gcc&view=rev Log: 2018-11-22 Vladimir Makarov <vmakarov@redhat.com> PR rtl-optimization/87718 * ira-costs.c: Remove trailing white-spaces. (record_operand_costs): Add a special treatment for moves involving a hard register. 2018-11-22 Vladimir Makarov <vmakarov@redhat.com> PR rtl-optimization/87718 * gcc.target/i386/pr82361-1.c: Check only the first operand of moves. Modified: trunk/gcc/ChangeLog trunk/gcc/ira-costs.c trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.target/i386/pr82361-1.c
Fixed, thanks.
This change has quite a negative effect on the cost model in AArch64 an ICE due to the new costs and register classes it picks. See PR88282 Thanks, Tamar