Bug 70155 - Use SSE for TImode load/store
Summary: Use SSE for TImode load/store
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.0
: P3 normal
Target Milestone: 7.0
Assignee: Not yet assigned to anyone
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-03-09 16:54 UTC by H.J. Lu
Modified: 2016-04-29 17:28 UTC (History)
1 user (show)

See Also:
Host:
Target: x86-64
Build:
Known to work:
Known to fail:
Last reconfirmed: 2016-03-15 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description H.J. Lu 2016-03-09 16:54:17 UTC
[hjl@gnu-mic-2 int128]$ cat x.i
extern __int128 a, b;

struct foo
{
  __int128 i;
}__attribute__ ((packed));

extern struct foo x, y;

void
foo (void)
{
  a = b;
  x = y;
}
[hjl@gnu-mic-2 int128]$ gcc -S -O2 x.i
[hjl@gnu-mic-2 int128]$ cat x.s
	.file	"x.i"
	.section	.text.unlikely,"ax",@progbits
.LCOLDB0:
	.text
.LHOTB0:
	.p2align 4,,15
	.globl	foo
	.type	foo, @function
foo:
.LFB0:
	.cfi_startproc
	movq	b(%rip), %rax
	movq	b+8(%rip), %rdx
	movq	%rax, a(%rip)
	movq	%rdx, a+8(%rip)
	movq	y(%rip), %rax
	movq	y+8(%rip), %rdx
	movq	%rax, x(%rip)
	movq	%rdx, x+8(%rip)
	ret
	.cfi_endproc
.LFE0:
	.size	foo, .-foo
	.section	.text.unlikely
.LCOLDE0:
	.text
.LHOTE0:
	.ident	"GCC: (GNU) 5.3.1 20160212 (Red Hat 5.3.1-4)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-mic-2 int128]$ 

We could generate

foo:
.LFB0:
	.cfi_startproc
	movdqa	b(%rip), %xmm0
	movaps	%xmm0, a(%rip)
	movdqu	y(%rip), %xmm0
	movups	%xmm0, x(%rip)
	ret
	.cfi_endproc
.LFE0:
Comment 1 Uroš Bizjak 2016-03-09 20:01:06 UTC
This can be tweaked in processor_cost table.

However, is SSE move really faster? Cost tables doesn't say so.
Comment 2 H.J. Lu 2016-03-09 21:00:35 UTC
(In reply to Uroš Bizjak from comment #1)
> This can be tweaked in processor_cost table.

RA will use integer registers for TImode.

> However, is SSE move really faster? Cost tables doesn't say so.

Yes, that is what vector instructions are used for.
Comment 3 H.J. Lu 2016-03-15 21:25:43 UTC
We can extend STV pass to 64-bit mode to convert load and store of
128-bit ntegers to 128-bit SSE load and store, which is implemented
on hjl/pr70155/master branch.

As a follow-on optimization, we can add TImode bitwise patterns to
use SSE for TImode bitwise operation.
Comment 4 Uroš Bizjak 2016-03-16 09:15:33 UTC
(In reply to H.J. Lu from comment #3)
> We can extend STV pass to 64-bit mode to convert load and store of
> 128-bit ntegers to 128-bit SSE load and store, which is implemented
> on hjl/pr70155/master branch.
> 
> As a follow-on optimization, we can add TImode bitwise patterns to
> use SSE for TImode bitwise operation.

IMO, this is a good idea, and the infrastructure is already in place.
Comment 5 hjl@gcc.gnu.org 2016-04-27 17:33:12 UTC
Author: hjl
Date: Wed Apr 27 17:32:40 2016
New Revision: 235518

URL: https://gcc.gnu.org/viewcvs?rev=235518&root=gcc&view=rev
Log:
Extend STV pass to 64-bit mode

128-bit SSE load and store instructions can be used for load and store
of 128-bit integers if they are the only operations on 128-bit integers.
To convert load and store of 128-bit integers to 128-bit SSE load and
store, the original STV pass, which is designed to convert 64-bit integer
operations to SSE2 operations in 32-bit mode, is extended to 64-bit mode
in the following ways:

1. Class scalar_chain is turned into base class.  The 32-bit specific
member functions are moved to the new derived class, dimode_scalar_chain.
The new derived class, timode_scalar_chain, is added to convert oad and
store of 128-bit integers to 128-bit SSE load and store.
2. Add the 64-bit version of scalar_to_vector_candidate_p and
remove_non_convertible_regs.  Only TImode load and store are allowed
for conversion.  If one instruction on the chain of dependent
instructions aren't TImode load or store, the chain of instructions
won't be converted.
3. In 64-bit, we only convert from TImode to V1TImode, which have the
same size.  The difference is only vector registers are allowed in
TImode so that 128-bit SSE load and store instructions will be used
for load and store of 128-bit integers.
4. Put the 64-bit STV pass before the CSE pass so that instructions
changed or generated by the STV pass can be CSEed.

convert_scalars_to_vector calls free_dominance_info in 64-bit mode to
work around ICE in fwprop pass:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70807

when building libgcc on Linux/x86-64.

gcc/

	PR target/70155
	* config/i386/i386.c (scalar_to_vector_candidate_p): Renamed
	to ...
	(dimode_scalar_to_vector_candidate_p): This.
	(timode_scalar_to_vector_candidate_p): New function.
	(scalar_to_vector_candidate_p): Likewise.
	(timode_check_non_convertible_regs): Likewise.
	(timode_remove_non_convertible_regs): Likewise.
	(remove_non_convertible_regs): Likewise.
	(remove_non_convertible_regs): Renamed to ...
	(dimode_remove_non_convertible_regs): This.
	(scalar_chain::~scalar_chain): Make it virtual.
	(scalar_chain::compute_convert_gain): Make it pure virtual.
	(scalar_chain::mark_dual_mode_def): Likewise.
	(scalar_chain::convert_insn): Likewise.
	(scalar_chain::convert_registers): Likewise.
	(scalar_chain::add_to_queue): Make it protected.
	(scalar_chain::emit_conversion_insns): Likewise.
	(scalar_chain::replace_with_subreg): Likewise.
	(scalar_chain::replace_with_subreg_in_insn): Likewise.
	(scalar_chain::convert_op): Likewise.
	(scalar_chain::convert_reg): Likewise.
	(scalar_chain::make_vector_copies): Likewise.
	(scalar_chain::convert_registers): New pure virtual function.
	(class dimode_scalar_chain): New class.
	(class timode_scalar_chain): Likewise.
	(scalar_chain::mark_dual_mode_def): Renamed to ...
	(dimode_scalar_chain::mark_dual_mode_def): This.
	(timode_scalar_chain::mark_dual_mode_def): New function.
	(timode_scalar_chain::convert_insn): Likewise.
	(dimode_scalar_chain::convert_registers): Likewise.
	(scalar_chain::compute_convert_gain): Renamed to ...
	(dimode_scalar_chain::compute_convert_gain): This.
	(scalar_chain::replace_with_subreg): Renamed to ...
	(dimode_scalar_chain::replace_with_subreg): This.
	(scalar_chain::replace_with_subreg_in_insn): Renamed to ...
	(dimode_scalar_chain::replace_with_subreg_in_insn): This.
	(scalar_chain::make_vector_copies): Renamed to ...
	(dimode_scalar_chain::make_vector_copies): This.
	(scalar_chain::convert_reg): Renamed to ...
	(dimode_scalar_chain::convert_reg ): This.
	(scalar_chain::convert_op): Renamed to ...
	(dimode_scalar_chain::convert_op): This.
	(scalar_chain::convert_insn): Renamed to ...
	(dimode_scalar_chain::convert_insn): This.
	(scalar_chain::convert): Call convert_registers.
	(convert_scalars_to_vector): Change to scalar_chain pointer to
	use timode_scalar_chain in 64-bit mode and dimode_scalar_chain
	in 32-bit mode.  Delete scalar_chain pointer.  Call
	free_dominance_info in 64-bit mode.
	(pass_stv::gate): Remove TARGET_64BIT check.
	(ix86_option_override): Put the 64-bit STV pass before the CSE
	pass.

gcc/testsuite/

	PR target/70155
	* gcc.target/i386/pr55247-2.c: Updated to check movti_internal
	and movv1ti_internal patterns
	* gcc.target/i386/pr70155-1.c: New test.
	* gcc.target/i386/pr70155-2.c: Likewise.
	* gcc.target/i386/pr70155-3.c: Likewise.
	* gcc.target/i386/pr70155-4.c: Likewise.
	* gcc.target/i386/pr70155-5.c: Likewise.
	* gcc.target/i386/pr70155-6.c: Likewise.
	* gcc.target/i386/pr70155-7.c: Likewise.
	* gcc.target/i386/pr70155-8.c: Likewise.
	* gcc.target/i386/pr70155-9.c: Likewise.
	* gcc.target/i386/pr70155-10.c: Likewise.
	* gcc.target/i386/pr70155-11.c: Likewise.
	* gcc.target/i386/pr70155-12.c: Likewise.
	* gcc.target/i386/pr70155-13.c: Likewise.
	* gcc.target/i386/pr70155-14.c: Likewise.
	* gcc.target/i386/pr70155-15.c: Likewise.
	* gcc.target/i386/pr70155-16.c: Likewise.
	* gcc.target/i386/pr70155-17.c: Likewise.
	* gcc.target/i386/pr70155-18.c: Likewise.
	* gcc.target/i386/pr70155-19.c: Likewise.
	* gcc.target/i386/pr70155-20.c: Likewise.
	* gcc.target/i386/pr70155-21.c: Likewise.
	* gcc.target/i386/pr70155-22.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr70155-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-10.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-11.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-12.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-13.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-14.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-15.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-16.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-17.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-18.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-19.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-20.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-21.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-22.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-3.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-4.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-5.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-6.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-7.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-8.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-9.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.c
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/pr55247-2.c
Comment 6 H.J. Lu 2016-04-27 17:33:53 UTC
Fixed for GCC 7.
Comment 7 hjl@gcc.gnu.org 2016-04-29 17:28:31 UTC
Author: hjl
Date: Fri Apr 29 17:27:59 2016
New Revision: 235647

URL: https://gcc.gnu.org/viewcvs?rev=235647&root=gcc&view=rev
Log:
Update scan-assembler-not in PR target/70155 tests

Since PIC leads to the *movdi_internal pattern, check for nonexistence
of the *movdi_internal pattern in PR target/70155 tests only if PIC is
off.

	* gcc.target/i386/pr70155-1.c: Check for nonexistence of the
	*movdi_internal pattern only if PIC off.
	* gcc.target/i386/pr70155-2.c: Likewise.
	* gcc.target/i386/pr70155-3.c: Likewise.
	* gcc.target/i386/pr70155-4.c: Likewise.
	* gcc.target/i386/pr70155-5.c: Likewise.
	* gcc.target/i386/pr70155-6.c: Likewise.
	* gcc.target/i386/pr70155-7.c: Likewise.
	* gcc.target/i386/pr70155-8.c: Likewise.
	* gcc.target/i386/pr70155-15.c: Likewise.
	* gcc.target/i386/pr70155-17.c: Likewise.
	* gcc.target/i386/pr70155-22.c: Likewise.

Modified:
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/testsuite/gcc.target/i386/pr70155-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-15.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-17.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-2.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-22.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-3.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-4.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-5.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-6.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-7.c
    trunk/gcc/testsuite/gcc.target/i386/pr70155-8.c