[Bug target/92265] New: [x86] Dubious target costs for vec_construct
rsandifo at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Tue Oct 29 12:16:00 GMT 2019
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92265
Bug ID: 92265
Summary: [x86] Dubious target costs for vec_construct
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
CC: amonakov at gcc dot gnu.org, uros at gcc dot gnu.org
Target Milestone: ---
Target: x86_64-linux-gnu
The x86 costs for vec_construct look a little low, especially
for -m32. E.g. gcc.target/i386/pr84101.c has:
---------------------------------------------------
typedef struct uint64_pair uint64_pair_t ;
struct uint64_pair
{
unsigned long w0 ;
unsigned long w1 ;
} ;
uint64_pair_t pair(int num)
{
uint64_pair_t p ;
p.w0 = num << 1 ;
p.w1 = num >> 1 ;
return p ;
}
---------------------------------------------------
where uint64_pair is actually a uint32_pair for -m32.
If we consider applying SLP vectorisation to the store,
we have the difference between:
- 2 scalar_stores
- 1 vec_construct + 1 vector_store
The vec_construct cost for 64-bit and 128-bit vectors is:
int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
i.e. one SSE op per element. With -mtune=intel this gives:
- 2 scalar_stores = 3 + 3 insns
- 1 vec_construct + 1 vector_store = 2 + 3 insns
But for integer elements, the vec_construct actually needs two
integer-to-vector moves followed by an SSE pack:
movd %eax, %xmm1
movd %ecx, %xmm0
punpckldq %xmm1, %xmm0
movq %xmm0, (%edx)
compared to:
movl %eax, 4(%edx)
movl %ecx, (%edx)
I don't know enough about the Intel uarchs to know if there's
a significant difference between these two in practice.
But as Alexander points out, things are much worse if the
elements are DImode rather than SImode, i.e. if we change
the above "unsigned long"s to "__UINT64_TYPE__"s. We then
end up spilling the four registers to the stack, loading
them into a vector register, and then storing that vector
register out separately:
movl %edx, 8(%esp)
...
movl %edx, 12(%esp)
movq 8(%esp), %xmm0
movl %eax, 8(%esp)
...
movl %edx, 12(%esp)
movhps 8(%esp), %xmm0
movups %xmm0, (%ecx)
vs. 4 scalar stores directly to (%ecx). Here we're operating
on DIs and V2DIs, but the costs are the same as for SI vs. V2SI:
- 2 scalar_stores = 3 + 3 insns
- 1 vec_construct + 1 vector_store = 2 + 3 insns
So as far as the vectoriser is concerned, the vector form
seems cheaper.
More information about the Gcc-bugs
mailing list