void f(double *src, int *dst) { for (int i = 0; i < 4; i ++) dst[i] = (int)src[i]; } With -O3 -mavx, LLVM outputs this : f(double*, int*): vcvttpd2dq xmm0, ymmword ptr [rdi] vmovupd xmmword ptr [rsi], xmm0 ret GCC outputs this : f(double*, int*): push rbp vmovupd xmm1, XMMWORD PTR [rdi] vinsertf128 ymm0, ymm1, XMMWORD PTR [rdi+16], 0x1 mov rbp, rsp vcvttpd2dq xmm0, ymm0 vmovdqu XMMWORD PTR [rsi], xmm0 vzeroupper pop rbp ret
The relevant pattern is present in sse.md: (define_insn "fix_truncv4dfv4si2<mask_name>" [(set (match_operand:V4SI 0 "register_operand" "=v") (fix:V4SI (match_operand:V4DF 1 "nonimmediate_operand" "vm")))] "TARGET_AVX || (TARGET_AVX512VL && TARGET_AVX512F)" "vcvttpd2dq{y}\t{%1, %0<mask_operand2>|%0<mask_operand2>, %1}" but for some reason not exercised by target-independent part of the compiler. Confirmed as a tree optimization problem.
gcc doesn't seem very fond of using 2 different vector bitsizes at the same time, so VEC_PACK_FIX_TRUNC_EXPR takes 2 vectors of 2 double and gives one vector of 4 int. At the RTL level, we have a vec_concat:V4DF of 2 V2DF adjacent in memory, but nothing knows to turn that into a single load. (the conversion itself of 4 double to int is fine)
The pattern is exercised directly by BB vectorization only, loop vectorization still uses a fixed vector size. Still the assembly shows basically the same code when doing BB vectorization only: f: .LFB0: .cfi_startproc vmovupd (%rdi), %xmm1 vinsertf128 $0x1, 16(%rdi), %ymm1, %ymm0 vcvttpd2dqy %ymm0, %xmm0 vmovdqu %xmm0, (%rsi) vzeroupper ret this is probably because of some tuning (split unaligned loads, not using a memory operand for vcvttpd2dqy). With -O3 -fno-tree-loop-vectorize -march=core-avx2 I get f: .LFB0: .cfi_startproc vcvttpd2dqy (%rdi), %xmm0 vmovdqu %xmm0, (%rsi) ret