[Bug target/92265] New: [x86] Dubious target costs for vec_construct

Tue Oct 29 12:16:00 GMT 2019

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92265

            Bug ID: 92265
           Summary: [x86] Dubious target costs for vec_construct
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
                CC: amonakov at gcc dot gnu.org, uros at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-linux-gnu

The x86 costs for vec_construct look a little low, especially
for -m32.  E.g. gcc.target/i386/pr84101.c has:

---------------------------------------------------
typedef struct uint64_pair uint64_pair_t ;
struct uint64_pair
{
  unsigned long w0 ;
  unsigned long w1 ;
} ;

uint64_pair_t pair(int num)
{
  uint64_pair_t p ;

  p.w0 = num << 1 ;
  p.w1 = num >> 1 ;

  return p ;
}
---------------------------------------------------

where uint64_pair is actually a uint32_pair for -m32.
If we consider applying SLP vectorisation to the store,
we have the difference between:

- 2 scalar_stores
- 1 vec_construct + 1 vector_store

The vec_construct cost for 64-bit and 128-bit vectors is:

          int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;

i.e. one SSE op per element.  With -mtune=intel this gives:

- 2 scalar_stores = 3 + 3 insns
- 1 vec_construct + 1 vector_store = 2 + 3 insns

But for integer elements, the vec_construct actually needs two
integer-to-vector moves followed by an SSE pack:

        movd    %eax, %xmm1
        movd    %ecx, %xmm0
        punpckldq       %xmm1, %xmm0
        movq    %xmm0, (%edx)

compared to:

        movl    %eax, 4(%edx)
        movl    %ecx, (%edx)

I don't know enough about the Intel uarchs to know if there's
a significant difference between these two in practice.

But as Alexander points out, things are much worse if the
elements are DImode rather than SImode, i.e. if we change
the above "unsigned long"s to "__UINT64_TYPE__"s.  We then
end up spilling the four registers to the stack, loading
them into a vector register, and then storing that vector
register out separately:

        movl    %edx, 8(%esp)
        ...
        movl    %edx, 12(%esp)
        movq    8(%esp), %xmm0
        movl    %eax, 8(%esp)
        ...
        movl    %edx, 12(%esp)
        movhps  8(%esp), %xmm0
        movups  %xmm0, (%ecx)

vs. 4 scalar stores directly to (%ecx).  Here we're operating
on DIs and V2DIs, but the costs are the same as for SI vs. V2SI:

- 2 scalar_stores = 3 + 3 insns
- 1 vec_construct + 1 vector_store = 2 + 3 insns

So as far as the vectoriser is concerned, the vector form
seems cheaper.