The following code: #include <cstdint> struct B { std::int64_t x; std::int32_t y; std::int32_t z; }; B f(std::int64_t x, std::int32_t y, std::int32_t z) { return {x, y, z}; } Compiled with `gcc -O3 -std=gnu++17 -march=skylake` generates the following assembly: f(long, int, int): mov QWORD PTR [rsp-16], 0 mov QWORD PTR [rsp-24], rdi vmovdqa xmm1, XMMWORD PTR [rsp-24] vpinsrd xmm0, xmm1, esi, 2 vpinsrd xmm2, xmm0, edx, 3 vmovdqa XMMWORD PTR [rsp-24], xmm2 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] ret Which looks a bit excessive. Whereas when compiled with `clang-9.0 -O3 -std=gnu++17 -march=skylake` it produces the expected: f(long, int, int): mov rax, rdi shl rdx, 32 mov ecx, esi or rdx, rcx ret https://gcc.godbolt.org/z/udsiyF
This is a vector cost model.
We're "correctly" costing an extra spill and the two loads: t.C:10:24: note: vect_model_store_cost: inside_cost = 16, prologue_cost = 40 . 0x59cc2a0 y_4(D) 1 times vec_construct costs 8 in prologue 0x59cc2a0 y_4(D) 1 times vector_store costs 16 in body 0x59cc2a0 y_4(D) 1 times vector_store costs 16 in epilogue 0x59cc2a0 y_4(D) 2 times scalar_load costs 16 in epilogue 0x59f1130 y_4(D) 1 times scalar_store costs 12 in body 0x59f1130 z_6(D) 1 times scalar_store costs 12 in body t.C:10:24: note: Cost model analysis: Vector inside of basic block cost: 16 Vector prologue cost: 8 Vector epilogue cost: 32 Scalar cost of basic block: 24 t.C:10:24: missed: not vectorized: vectorization is not profitable. and expand from <bb 2> [local count: 1073741824]: D.2953.x = x_2(D); D.2953.y = y_4(D); D.2953.z = z_6(D); return D.2953; but somehow RTL expansion ends up doing ;; Generating RTL for gimple basic block 2 ;; D.2953.x = x_2(D); (insn 8 7 0 (set (subreg:DI (reg:TI 82 [ D.2953 ]) 0) (reg/v:DI 84 [ x ])) "t.C":10:24 -1 (nil)) ;; D.2953.y = y_4(D); (insn 9 8 10 (set (reg:V4SI 87) (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 85 [ y ])) (subreg:V4SI (reg:TI 82 [ D.2953 ]) 0) (const_int 4 [0x4]))) "t.C":10:24 -1 (nil)) (insn 10 9 0 (set (reg:TI 82 [ D.2953 ]) (subreg:TI (reg:V4SI 87) 0)) "t.C":10:24 -1 (nil)) ;; D.2953.z = z_6(D); (insn 11 10 12 (set (reg:V4SI 88) (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 86 [ z ])) (subreg:V4SI (reg:TI 82 [ D.2953 ]) 0) (const_int 8 [0x8]))) "t.C":10:24 -1 (nil)) (insn 12 11 0 (set (reg:TI 82 [ D.2953 ]) (subreg:TI (reg:V4SI 88) 0)) "t.C":10:24 -1 (nil)) ;; return D.2953; !?
It seems to get triggered by uint32_t, see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96562 Any plans to fix this bug?
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:7d5de349d21479d7ec61dd0153e6f0958ad7384f commit r11-2733-g7d5de349d21479d7ec61dd0153e6f0958ad7384f Author: liuhongt <hongtao.liu@intel.com> Date: Wed Aug 12 10:48:17 2020 +0800 Don't use pinsr/pextr for struct initialization/extraction. gcc/ PR target/96562 PR target/93897 * config/i386/i386-expand.c (ix86_expand_pinsr): Don't use pinsr for TImode. (ix86_expand_pextr): Don't use pextr for TImode. gcc/testsuite/ * gcc.target/i386/pr96562-1.c: New test.
The releases/gcc-10 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>: https://gcc.gnu.org/g:a49452d964e3bbd1d9aa0d809355f41347b3ec05 commit r10-8636-ga49452d964e3bbd1d9aa0d809355f41347b3ec05 Author: liuhongt <hongtao.liu@intel.com> Date: Wed Aug 12 10:48:17 2020 +0800 Don't use pinsr/pextr for struct initialization/extraction. gcc/ PR target/96562 PR target/93897 * config/i386/i386-expand.c (ix86_expand_pinsr): Don't use pinsr for TImode. (ix86_expand_pextr): Don't use pextr for TImode. gcc/testsuite/ * gcc.target/i386/pr96562-1.c: New test.
Fixed in GCC11, backport to GCC10.
Another testcase: [hjl@gnu-cfl-1 tmp]$ cat x.c extern int foo(); extern int bar(); typedef int (*func_t)(int); struct test { func_t func1; func_t func2; }; void mainfunc (struct test *iface) { iface->func1 = foo; iface->func2 = bar; } [hjl@gnu-cfl-1 tmp]$ gcc -S -O2 x.c [hjl@gnu-cfl-1 tmp]$ cat x.s .file "x.c" .text .p2align 4 .globl mainfunc .type mainfunc, @function mainfunc: .LFB0: .cfi_startproc movq $foo, (%rdi) movq $bar, 8(%rdi) ret .cfi_endproc .LFE0: .size mainfunc, .-mainfunc .ident "GCC: (GNU) 11.1.1 20210531 (Red Hat 11.1.1-3)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 tmp]$ gcc -S -O3 x.c -march=skylake [hjl@gnu-cfl-1 tmp]$ cat x.s .file "x.c" .text .p2align 4 .globl mainfunc .type mainfunc, @function mainfunc: .LFB0: .cfi_startproc movl $foo, %edx movl $bar, %eax vmovq %rdx, %xmm0 vpinsrq $1, %rax, %xmm0, %xmm0 vmovdqu %xmm0, (%rdi) ret .cfi_endproc .LFE0: .size mainfunc, .-mainfunc .ident "GCC: (GNU) 11.1.1 20210531 (Red Hat 11.1.1-3)" .section .note.GNU-stack,"",@progbits [hjl@gnu-cfl-1 tmp]$
> mainfunc: > .LFB0: > .cfi_startproc > movl $foo, %edx > movl $bar, %eax > vmovq %rdx, %xmm0 > vpinsrq $1, %rax, %xmm0, %xmm0 > vmovdqu %xmm0, (%rdi) > ret > .cfi_endproc > .LFE0: > .size mainfunc, .-mainfunc > .ident "GCC: (GNU) 11.1.1 20210531 (Red Hat 11.1.1-3)" > .section .note.GNU-stack,"",@progbits > [hjl@gnu-cfl-1 tmp]$ This is related to cost model of vector construct.