Bug 96654

Summary: Failure to optimize vectorized conversion to `int` with AVX
Product: gcc Reporter: Gabriel Ravier <gabravier>
Component: tree-optimizationAssignee: Not yet assigned to anyone <unassigned>
Status: NEW ---    
Severity: normal CC: crazylht, hjl.tools, rguenth, rsandifo
Priority: P3 Keywords: missed-optimization
Version: 11.0   
Target Milestone: ---   
Host: Target: x86_64-* i?86-*-*
Build: Known to work:
Known to fail: Last reconfirmed: 2020-08-17 00:00:00
Bug Depends on: 36844    
Bug Blocks: 53947    

Description Gabriel Ravier 2020-08-17 11:57:33 UTC
void f(double *src, int *dst)
    for (int i = 0; i < 4; i ++)
        dst[i] = (int)src[i];

With -O3 -mavx, LLVM outputs this :

f(double*, int*):
  vcvttpd2dq xmm0, ymmword ptr [rdi]
  vmovupd xmmword ptr [rsi], xmm0

GCC outputs this :

f(double*, int*):
  push rbp
  vmovupd xmm1, XMMWORD PTR [rdi]
  vinsertf128 ymm0, ymm1, XMMWORD PTR [rdi+16], 0x1
  mov rbp, rsp
  vcvttpd2dq xmm0, ymm0
  vmovdqu XMMWORD PTR [rsi], xmm0
  pop rbp
Comment 1 UroŇ° Bizjak 2020-08-17 19:04:22 UTC
The relevant pattern is present in sse.md:

(define_insn "fix_truncv4dfv4si2<mask_name>"
  [(set (match_operand:V4SI 0 "register_operand" "=v")
	(fix:V4SI (match_operand:V4DF 1 "nonimmediate_operand" "vm")))]
  "vcvttpd2dq{y}\t{%1, %0<mask_operand2>|%0<mask_operand2>, %1}"

but for some reason not exercised by target-independent part of the compiler.

Confirmed as a tree optimization problem.
Comment 2 Marc Glisse 2020-08-22 16:32:47 UTC
gcc doesn't seem very fond of using 2 different vector bitsizes at the same time, so VEC_PACK_FIX_TRUNC_EXPR takes 2 vectors of 2 double and gives one vector of 4 int. At the RTL level, we have a vec_concat:V4DF of 2 V2DF adjacent in memory, but nothing knows to turn that into a single load. (the conversion itself of 4 double to int is fine)
Comment 3 Richard Biener 2020-08-25 10:40:28 UTC
The pattern is exercised directly by BB vectorization only, loop vectorization
still uses a fixed vector size.  Still the assembly shows basically the same
code when doing BB vectorization only:

        vmovupd (%rdi), %xmm1
        vinsertf128     $0x1, 16(%rdi), %ymm1, %ymm0
        vcvttpd2dqy     %ymm0, %xmm0
        vmovdqu %xmm0, (%rsi)

this is probably because of some tuning (split unaligned loads, not using
a memory operand for vcvttpd2dqy).

With -O3 -fno-tree-loop-vectorize -march=core-avx2 I get

        vcvttpd2dqy     (%rdi), %xmm0
        vmovdqu %xmm0, (%rsi)