87718 – [9 Regression] FAIL: gcc.target/i386/avx512dq-concatv2si-1.c

Bug 87718 - [9 Regression] FAIL: gcc.target/i386/avx512dq-concatv2si-1.c

Summary: [9 Regression] FAIL: gcc.target/i386/avx512dq-concatv2si-1.c

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	9.0

Importance:	P3 normal
Target Milestone:	9.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization, ra

Duplicates (1):	87717 (view as bug list)
Depends on:
Blocks:

Reported:	2018-10-24 00:32 UTC by H.J. Lu
Modified:	2021-09-27 07:05 UTC (History)
CC List:	4 users (show)

See Also:	81012
Host:
Target:	x86-64
Build:
Known to work:
Known to fail:
Last reconfirmed:	2018-10-24 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description H.J. Lu 2018-10-24 00:32:05 UTC

On x86, r265398 caused:

FAIL: gcc.target/i386/avx512dq-concatv2si-1.c scan-assembler-times vpinsrd[^\n\r]*\\$1[^\n\r]*%xmm16[^\n\r]*%xmm3 2
FAIL: gcc.target/i386/avx512dq-concatv2si-1.c scan-assembler vpunpckldq[^\n\r]*%xmm17[^\n\r]*%xmm16[^\n\r]*%xmm3

Before:

	.file	"avx512dq-concatv2si-1.c"
	.text
	.p2align 4
	.globl	f1
	.type	f1, @function
f1:
.LFB0:
	.cfi_startproc
	movl	%edi, -4(%rsp)
	vmovd	-4(%rsp), %xmm16
	movl	%esi, -4(%rsp)
	vmovd	-4(%rsp), %xmm17
	vpunpckldq	%xmm17, %xmm16, %xmm3
	ret
	.cfi_endproc
.LFE0:
	.size	f1, .-f1

After:

	.file	"avx512dq-concatv2si-1.c"
	.text
	.p2align 4
	.globl	f1
	.type	f1, @function
f1:
.LFB0:
	.cfi_startproc
	movl	%edi, -4(%rsp)
	vmovd	-4(%rsp), %xmm16
	movl	%esi, -4(%rsp)
	vmovd	-4(%rsp), %xmm17
	vmovd	%xmm17, %eax
	vpinsrd	$1, %eax, %xmm16, %xmm3
	ret
	.cfi_endproc
.LFE0:
	.size	f1, .-f1

Comment 1 Uroš Bizjak 2018-11-12 08:38:45 UTC

*** Bug 87717 has been marked as a duplicate of this bug. ***

Comment 2 Uroš Bizjak 2018-11-12 08:51:18 UTC

Following testcase:

--cut here--
typedef int V __attribute__((vector_size (8)));

void foo (int x, int y)
{
  register int a __asm ("xmm1");
  register int b __asm ("xmm2");
  register V c __asm ("xmm3");
  a = x;
  b = y;
  asm volatile ("" : "+v" (a), "+v" (b));
  c = (V) { a, b };
  asm volatile ("" : "+v" (c));
}
--cut here--

gets compiled with -O2 -mavx -mtune=intel:

        vmovd   %edi, %xmm1
        vmovd   %esi, %xmm2
        vmovd   %xmm2, %eax
        vpinsrd $1, %eax, %xmm1, %xmm3
        ret

The relevant pattern is defined as:

(define_insn "*vec_concatv2si_sse4_1"
  [(set (match_operand:V2SI 0 "register_operand"
	  "=Yr,*x, x, v,Yr,*x, v, v, *y,*y")
	(vec_concat:V2SI
	  (match_operand:SI 1 "nonimmediate_operand"
	  "  0, 0, x,Yv, 0, 0,Yv,rm,  0,rm")
	  (match_operand:SI 2 "nonimm_or_0_operand"
	  " rm,rm,rm,rm,Yr,*x,Yv, C,*ym, C")))]
  "TARGET_SSE4_1 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
  "@
   pinsrd\t{$1, %2, %0|%0, %2, 1}
   pinsrd\t{$1, %2, %0|%0, %2, 1}
   vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
   vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
   punpckldq\t{%2, %0|%0, %2}
   punpckldq\t{%2, %0|%0, %2}
   vpunpckldq\t{%2, %1, %0|%0, %1, %2}
   %vmovd\t{%1, %0|%0, %1}
   punpckldq\t{%2, %0|%0, %2}
   movd\t{%1, %0|%0, %1}"

but for some reason RA chooses alternative 2 (x<-x,rm) instead of alternative 6 (v<-Yv,Yv), although alternative 2 needs an extra reload from %xmm2 to %eax.

Comment 3 Jakub Jelinek 2018-11-13 16:05:14 UTC

Vlad, could you please have a look?

Comment 4 Terry Guo 2018-11-14 02:50:56 UTC

(In reply to Uroš Bizjak from comment #2)
> Following testcase:
> 
> --cut here--
> typedef int V __attribute__((vector_size (8)));
> 
> void foo (int x, int y)
> {
>   register int a __asm ("xmm1");
>   register int b __asm ("xmm2");
>   register V c __asm ("xmm3");
>   a = x;
>   b = y;
>   asm volatile ("" : "+v" (a), "+v" (b));
>   c = (V) { a, b };
>   asm volatile ("" : "+v" (c));
> }
> --cut here--
> 
> gets compiled with -O2 -mavx -mtune=intel:
> 
>         vmovd   %edi, %xmm1
>         vmovd   %esi, %xmm2
>         vmovd   %xmm2, %eax
>         vpinsrd $1, %eax, %xmm1, %xmm3
>         ret
> 
> The relevant pattern is defined as:
> 
> (define_insn "*vec_concatv2si_sse4_1"
>   [(set (match_operand:V2SI 0 "register_operand"
> 	  "=Yr,*x, x, v,Yr,*x, v, v, *y,*y")
> 	(vec_concat:V2SI
> 	  (match_operand:SI 1 "nonimmediate_operand"
> 	  "  0, 0, x,Yv, 0, 0,Yv,rm,  0,rm")
> 	  (match_operand:SI 2 "nonimm_or_0_operand"
> 	  " rm,rm,rm,rm,Yr,*x,Yv, C,*ym, C")))]
>   "TARGET_SSE4_1 && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
>   "@
>    pinsrd\t{$1, %2, %0|%0, %2, 1}
>    pinsrd\t{$1, %2, %0|%0, %2, 1}
>    vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
>    vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
>    punpckldq\t{%2, %0|%0, %2}
>    punpckldq\t{%2, %0|%0, %2}
>    vpunpckldq\t{%2, %1, %0|%0, %1, %2}
>    %vmovd\t{%1, %0|%0, %1}
>    punpckldq\t{%2, %0|%0, %2}
>    movd\t{%1, %0|%0, %1}"
> 
> but for some reason RA chooses alternative 2 (x<-x,rm) instead of
> alternative 6 (v<-Yv,Yv), although alternative 2 needs an extra reload from
> %xmm2 to %eax.

I dig this a bit and looks like we missed something in combine pass, hence fail to get a pattern that can match alternative 6. The combine pass dump of old gcc shows:
-------------------
      REG_UNUSED flags:CC
insn_cost 4 for    10: r82:SI=xmm16:SI
      REG_DEAD xmm16:SI
insn_cost 4 for    11: r83:SI=xmm17:SI
      REG_DEAD xmm17:SI
insn_cost 4 for    12: r87:V2SI=vec_concat(r82:SI,r83:SI)
      REG_DEAD r83:SI
      REG_DEAD r82:SI
-------------------

then we got:
-------------------
Trying 10 -> 12:
   10: r82:SI=xmm16:SI
      REG_DEAD xmm16:SI
   12: r87:V2SI=vec_concat(r82:SI,r83:SI)
      REG_DEAD r83:SI
      REG_DEAD r82:SI
Successfully matched this instruction:
(set (reg:V2SI 87)
    (vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ])
        (reg:SI 83 [ b.1_2 ])))
allowing combination of insns 10 and 12
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 10.
modifying insn i3    12: r87:V2SI=vec_concat(xmm16:SI,r83:SI)
      REG_DEAD xmm16:SI
      REG_DEAD r83:SI
deferring rescan insn with uid = 12.

Trying 11 -> 12:
   11: r83:SI=xmm17:SI
      REG_DEAD xmm17:SI
   12: r87:V2SI=vec_concat(xmm16:SI,r83:SI)
      REG_DEAD xmm16:SI
      REG_DEAD r83:SI
Successfully matched this instruction:
(set (reg:V2SI 87)
    (vec_concat:V2SI (reg/v:SI 52 xmm16 [ a ])
        (reg/v:SI 53 xmm17 [ b ])))
allowing combination of insns 11 and 12
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 11.
modifying insn i3    12: r87:V2SI=vec_concat(xmm16:SI,xmm17:SI)
      REG_DEAD xmm17:SI
      REG_DEAD xmm16:SI
deferring rescan insn with uid = 12.
-------------------

There are two successful combine attempts. We end up with pattern that can match alternative 6.

However dump from current GCC trunk shows:
-------------------
insn_cost 4 for    19: r90:SI=xmm16:SI
      REG_DEAD xmm16:SI
insn_cost 4 for    10: r82:SI=r90:SI
      REG_DEAD r90:SI
insn_cost 4 for    20: r91:SI=xmm17:SI
      REG_DEAD xmm17:SI
insn_cost 4 for    11: r83:SI=r91:SI
      REG_DEAD r91:SI
insn_cost 4 for    12: r87:V2SI=vec_concat(r82:SI,r83:SI)
      REG_DEAD r83:SI
      REG_DEAD r82:SI
insn_cost 4 for    13: xmm3:V2SI=r87:V2SI
      REG_DEAD r87:V2SI
-------------------
Trying 11 -> 12:
   11: r83:SI=r91:SI
      REG_DEAD r91:SI
   12: r87:V2SI=vec_concat(r90:SI,r83:SI)
      REG_DEAD r90:SI
      REG_DEAD r83:SI
Successfully matched this instruction:
(set (reg:V2SI 87)
    (vec_concat:V2SI (reg:SI 90)
        (reg:SI 91)))
allowing combination of insns 11 and 12
original costs 4 + 4 = 8
replacement cost 4
deferring deletion of insn with uid = 11.
modifying insn i3    12: r87:V2SI=vec_concat(r90:SI,r91:SI)
      REG_DEAD r91:SI
      REG_DEAD r90:SI
deferring rescan insn with uid = 12.
-------------------

We end up with "12: r87:V2SI=vec_concat(r90:SI,r91:SI)", later in LRA pass, the operand r90 is replaced with XMM register, the r91 is kept as general register. Then no chance match against preferred alternative 6.

Comment 5 Vladimir Makarov 2018-11-14 21:42:39 UTC

  In general moving from propagation of hard regs is good thing for RA.  Although there are exception as this PR.

  The problem starts with IRA.  It decides that r91 should be a general regs based  on cost calculation.  The cost calculation code in IRA is very sensitive.  A change there usually results in new PRs with unexpected code generation.

  I'll investigate more the PR and how to fix the PR with minimal effect to other targets and tests.  But right now I can guess that the cost of move of greg with sse-regs and move of sse-regs is the same for intel but if we choose sse-regs they are coalesced and the move is removed.

Comment 6 Vladimir Makarov 2018-11-19 19:12:50 UTC

The culprit for the bad code generation is the following insn description

(define_insn "*movsi_internal"
  [(set (match_operand:SI 0 "nonimmediate_operand"
    "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm")
        (match_operand:SI 1 "general_operand"
    "g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k"))]

Alternatives with sse regs are not considered at all (hint *) for cost calculation even if one operand is sse hard reg.  And therefore sse class for another operand with pseudo is too costly.

Removing the hints is not a solution.  I believe we will have even more problems with GCC testsuite.  So I am trying to solve it with specific treatment of moves for cost calculations.  The patch I am working on solves the PR but currently creates a few GCC testsuite failures (unexpected but correct code generation).  So I am continuing to work on the PR.

Comment 7 Vladimir Makarov 2018-11-22 17:26:29 UTC

Author: vmakarov
Date: Thu Nov 22 17:25:57 2018
New Revision: 266385

URL: https://gcc.gnu.org/viewcvs?rev=266385&root=gcc&view=rev
Log:
2018-11-22  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/87718
	* ira-costs.c: Remove trailing white-spaces.
	(record_operand_costs): Add a special treatment for moves
	involving a hard register.

2018-11-22  Vladimir Makarov  <vmakarov@redhat.com>

	PR rtl-optimization/87718
	* gcc.target/i386/pr82361-1.c: Check only the first operand of
	moves.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/ira-costs.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/pr82361-1.c

Comment 8 Jakub Jelinek 2018-11-27 15:52:46 UTC

Fixed, thanks.

Comment 9 Tamar Christina 2018-11-30 16:45:19 UTC

This change has quite a negative effect on the cost model in AArch64 an ICE due to the new costs and register classes it picks.

See PR88282

Thanks,
Tamar