Bug 93897 - Poor trivial structure initialization code with -O3
Summary: Poor trivial structure initialization code with -O3
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 10.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: argument, return
  Show dependency treegraph
 
Reported: 2020-02-23 20:55 UTC by Maxim Egorushkin
Modified: 2024-06-18 23:45 UTC (History)
5 users (show)

See Also:
Host:
Target: x86_64-*-*-*
Build:
Known to work:
Known to fail: 10.0, 7.5.0, 8.3.1, 9.2.1
Last reconfirmed: 2020-02-23 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Maxim Egorushkin 2020-02-23 20:55:12 UTC
The following code:

    #include <cstdint>

    struct B {
        std::int64_t x;
        std::int32_t y;
        std::int32_t z;
    };
    
    B f(std::int64_t x, std::int32_t y, std::int32_t z) { 
        return {x, y, z}; 
    }

Compiled with `gcc -O3 -std=gnu++17 -march=skylake` generates the following assembly:

    f(long, int, int):
            mov     QWORD PTR [rsp-16], 0
            mov     QWORD PTR [rsp-24], rdi
            vmovdqa xmm1, XMMWORD PTR [rsp-24]
            vpinsrd xmm0, xmm1, esi, 2
            vpinsrd xmm2, xmm0, edx, 3
            vmovdqa XMMWORD PTR [rsp-24], xmm2
            mov     rax, QWORD PTR [rsp-24]
            mov     rdx, QWORD PTR [rsp-16]
            ret

Which looks a bit excessive.

Whereas when compiled with `clang-9.0 -O3 -std=gnu++17 -march=skylake` it produces the expected:

    f(long, int, int):
            mov     rax, rdi
            shl     rdx, 32
            mov     ecx, esi
            or      rdx, rcx
            ret

https://gcc.godbolt.org/z/udsiyF
Comment 1 Andrew Pinski 2020-02-23 21:37:17 UTC
This is a vector cost model.
Comment 2 Richard Biener 2020-02-24 12:47:31 UTC
We're "correctly" costing an extra spill and the two loads:

t.C:10:24: note:   vect_model_store_cost: inside_cost = 16, prologue_cost = 40 .
0x59cc2a0 y_4(D) 1 times vec_construct costs 8 in prologue
0x59cc2a0 y_4(D) 1 times vector_store costs 16 in body
0x59cc2a0 y_4(D) 1 times vector_store costs 16 in epilogue
0x59cc2a0 y_4(D) 2 times scalar_load costs 16 in epilogue
0x59f1130 y_4(D) 1 times scalar_store costs 12 in body
0x59f1130 z_6(D) 1 times scalar_store costs 12 in body
t.C:10:24: note:  Cost model analysis:
  Vector inside of basic block cost: 16
  Vector prologue cost: 8
  Vector epilogue cost: 32
  Scalar cost of basic block: 24
t.C:10:24: missed:  not vectorized: vectorization is not profitable.

and expand from

  <bb 2> [local count: 1073741824]:
  D.2953.x = x_2(D);
  D.2953.y = y_4(D);
  D.2953.z = z_6(D);
  return D.2953;

but somehow RTL expansion ends up doing

;; Generating RTL for gimple basic block 2

;; D.2953.x = x_2(D);

(insn 8 7 0 (set (subreg:DI (reg:TI 82 [ D.2953 ]) 0)
        (reg/v:DI 84 [ x ])) "t.C":10:24 -1
     (nil))

;; D.2953.y = y_4(D);

(insn 9 8 10 (set (reg:V4SI 87)
        (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 85 [ y ]))
            (subreg:V4SI (reg:TI 82 [ D.2953 ]) 0)
            (const_int 4 [0x4]))) "t.C":10:24 -1
     (nil))

(insn 10 9 0 (set (reg:TI 82 [ D.2953 ])
        (subreg:TI (reg:V4SI 87) 0)) "t.C":10:24 -1
     (nil))

;; D.2953.z = z_6(D);

(insn 11 10 12 (set (reg:V4SI 88)
        (vec_merge:V4SI (vec_duplicate:V4SI (reg/v:SI 86 [ z ]))
            (subreg:V4SI (reg:TI 82 [ D.2953 ]) 0)
            (const_int 8 [0x8]))) "t.C":10:24 -1
     (nil))

(insn 12 11 0 (set (reg:TI 82 [ D.2953 ])
        (subreg:TI (reg:V4SI 88) 0)) "t.C":10:24 -1
     (nil))

;; return D.2953;

!?
Comment 3 Maxim Egorushkin 2020-08-11 11:32:27 UTC
It seems to get triggered by uint32_t, see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96562

Any plans to fix this bug?
Comment 4 GCC Commits 2020-08-18 06:20:50 UTC
The master branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:7d5de349d21479d7ec61dd0153e6f0958ad7384f

commit r11-2733-g7d5de349d21479d7ec61dd0153e6f0958ad7384f
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Aug 12 10:48:17 2020 +0800

    Don't use pinsr/pextr for struct initialization/extraction.
    
    gcc/
            PR target/96562
            PR target/93897
            * config/i386/i386-expand.c (ix86_expand_pinsr): Don't use
            pinsr for TImode.
            (ix86_expand_pextr): Don't use pextr for TImode.
    
    gcc/testsuite/
            * gcc.target/i386/pr96562-1.c: New test.
Comment 5 GCC Commits 2020-08-18 06:24:56 UTC
The releases/gcc-10 branch has been updated by hongtao Liu <liuhongt@gcc.gnu.org>:

https://gcc.gnu.org/g:a49452d964e3bbd1d9aa0d809355f41347b3ec05

commit r10-8636-ga49452d964e3bbd1d9aa0d809355f41347b3ec05
Author: liuhongt <hongtao.liu@intel.com>
Date:   Wed Aug 12 10:48:17 2020 +0800

    Don't use pinsr/pextr for struct initialization/extraction.
    
    gcc/
            PR target/96562
            PR target/93897
            * config/i386/i386-expand.c (ix86_expand_pinsr): Don't use
            pinsr for TImode.
            (ix86_expand_pextr): Don't use pextr for TImode.
    
    gcc/testsuite/
            * gcc.target/i386/pr96562-1.c: New test.
Comment 6 Hongtao.liu 2020-08-18 06:28:47 UTC
Fixed in GCC11, backport to GCC10.
Comment 7 H.J. Lu 2021-07-14 12:58:29 UTC
Another testcase:

[hjl@gnu-cfl-1 tmp]$ cat x.c
extern int foo();
extern int bar();

typedef int (*func_t)(int);

struct test
{
        func_t func1;
        func_t func2;
};

void mainfunc (struct test *iface)
{
  iface->func1 = foo;
  iface->func2 = bar;
}
[hjl@gnu-cfl-1 tmp]$ gcc -S -O2 x.c
[hjl@gnu-cfl-1 tmp]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4
	.globl	mainfunc
	.type	mainfunc, @function
mainfunc:
.LFB0:
	.cfi_startproc
	movq	$foo, (%rdi)
	movq	$bar, 8(%rdi)
	ret
	.cfi_endproc
.LFE0:
	.size	mainfunc, .-mainfunc
	.ident	"GCC: (GNU) 11.1.1 20210531 (Red Hat 11.1.1-3)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 tmp]$ gcc -S -O3 x.c -march=skylake
[hjl@gnu-cfl-1 tmp]$ cat x.s
	.file	"x.c"
	.text
	.p2align 4
	.globl	mainfunc
	.type	mainfunc, @function
mainfunc:
.LFB0:
	.cfi_startproc
	movl	$foo, %edx
	movl	$bar, %eax
	vmovq	%rdx, %xmm0
	vpinsrq	$1, %rax, %xmm0, %xmm0
	vmovdqu	%xmm0, (%rdi)
	ret
	.cfi_endproc
.LFE0:
	.size	mainfunc, .-mainfunc
	.ident	"GCC: (GNU) 11.1.1 20210531 (Red Hat 11.1.1-3)"
	.section	.note.GNU-stack,"",@progbits
[hjl@gnu-cfl-1 tmp]$
Comment 8 Hongtao.liu 2021-08-17 10:16:22 UTC
> mainfunc:
> .LFB0:
> 	.cfi_startproc
> 	movl	$foo, %edx
> 	movl	$bar, %eax
> 	vmovq	%rdx, %xmm0
> 	vpinsrq	$1, %rax, %xmm0, %xmm0
> 	vmovdqu	%xmm0, (%rdi)
> 	ret
> 	.cfi_endproc
> .LFE0:
> 	.size	mainfunc, .-mainfunc
> 	.ident	"GCC: (GNU) 11.1.1 20210531 (Red Hat 11.1.1-3)"
> 	.section	.note.GNU-stack,"",@progbits
> [hjl@gnu-cfl-1 tmp]$

This is related to cost model of vector construct.