Bug 79357 - Doubling a single complex float gives inefficient code
Summary: Doubling a single complex float gives inefficient code
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer argument, return
  Show dependency treegraph
 
Reported: 2017-02-03 13:54 UTC by Raphael C
Modified: 2021-09-01 03:05 UTC (History)
0 users

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-02-06 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Raphael C 2017-02-03 13:54:50 UTC
Consider the following code:

#include <complex.h>
complex float f(complex float x) {
  return 2*x;
}

gcc with -O3  -march=core-avx2  gives:

f:
        vmovq   QWORD PTR [rsp-8], xmm0
        vmovss  xmm1, DWORD PTR [rsp-8]
        vmovss  xmm0, DWORD PTR [rsp-4]
        vaddss  xmm1, xmm1, xmm1
        vaddss  xmm0, xmm0, xmm0
        vmovss  DWORD PTR [rsp-16], xmm1
        vmovss  DWORD PTR [rsp-12], xmm0
        vmovq   xmm0, QWORD PTR [rsp-16]
        ret

Better and still suitable without -ffast-math would be:

f:
        vmulps    xmm0, xmm0, XMMWORD PTR .L_2il0floatpacket.1[rip] #3.12
        ret
Comment 1 Raphael C 2017-02-03 22:54:34 UTC
I omitted 

.L_2il0floatpacket.1:
        .long   0x40000000,0x40000000,0x00000000,0x00000000
        .long   0x40000000,0x00000000

from the assembly. In other words it just multiplies by 2.0f.
Comment 2 Richard Biener 2017-02-06 11:52:52 UTC
Confirmed.  We expand from

f (complex float x)
{
  float _2;
  float _3;
  float _4;
  float _5;
  complex float _6;

  <bb 2>:
  _2 = REALPART_EXPR <x_1(D)>;
  _3 = _2 * 2.0e+0;
  _4 = IMAGPART_EXPR <x_1(D)>;
  _5 = _4 * 2.0e+0;
  _6 = COMPLEX_EXPR <_3, _5>;
  return _6;

and this is another case where the ABI is not exposed and thus we can't really
do better on trees.

With _Complex double we manage to optimize it to

f:
.LFB0:
        .cfi_startproc
        vaddsd  %xmm1, %xmm1, %xmm1
        vaddsd  %xmm0, %xmm0, %xmm0
        ret

because there the ABI says real/imagpart are passed in different registers.

Maybe we can teach RTL forwprop about this case(s)... (I teached it somewhat
similar cases on power):

(note 9 0 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 9 3 2 (set (reg:DI 97 [ x ])
        (reg:DI 21 xmm0 [ x ])) "t.c":2 81 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 21 xmm0 [ x ])
        (nil)))
(insn 3 2 4 2 (set (mem/c:DI (plus:DI (reg/f:DI 20 frame)
                (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
        (reg:DI 97 [ x ])) "t.c":2 81 {*movdi_internal}
     (expr_list:REG_DEAD (reg:DI 97 [ x ])
        (nil)))
(insn 4 3 5 2 (set (reg:SF 95)
        (mem/c:SF (plus:DI (reg/f:DI 20 frame)
                (const_int -8 [0xfffffffffffffff8])) [0  S4 A64])) "t.c":2 125 {*movsf_internal}
     (nil))
(insn 5 4 8 2 (set (reg:SF 96)
        (mem/c:SF (plus:DI (reg/f:DI 20 frame)
                (const_int -4 [0xfffffffffffffffc])) [0  S4 A32])) "t.c":2 125 {*movsf_internal}
     (nil))
(note 8 5 11 2 NOTE_INSN_FUNCTION_BEG)

OTOH this is already a blocker -- we expand the initial real/imagpart extracts
to loads from the stack after spilling the complex arg...  initial RTL:

(note 9 1 2 2 [bb 2] NOTE_INSN_BASIC_BLOCK)
(insn 2 9 3 2 (set (reg:DI 97)
        (reg:DI 21 xmm0 [ x ])) "t.c":2 -1
     (nil))
(insn 3 2 4 2 (set (mem/c:DI (plus:DI (reg/f:DI 82 virtual-stack-vars)
                (const_int -8 [0xfffffffffffffff8])) [0  S8 A64])
        (reg:DI 97)) "t.c":2 -1
     (nil))
(insn 4 3 5 2 (set (reg:SF 95)
        (mem/c:SF (plus:DI (reg/f:DI 82 virtual-stack-vars)
                (const_int -8 [0xfffffffffffffff8])) [0  S4 A64])) "t.c":2 -1
     (nil))
(insn 5 4 6 2 (set (reg:SF 96)
        (mem/c:SF (plus:DI (reg/f:DI 82 virtual-stack-vars)
                (const_int -4 [0xfffffffffffffffc])) [0  S4 A32])) "t.c":2 -1
     (nil))
(insn 6 5 7 2 (set (reg/v:SF 93 [ x ])
        (reg:SF 95)) "t.c":2 -1
     (nil))
(insn 7 6 8 2 (set (reg/v:SF 94 [ x+4 ])
        (reg:SF 96)) "t.c":2 -1
     (nil))
(note 8 7 11 2 NOTE_INSN_FUNCTION_BEG)


handling the initial RTL for the return value will be equally "interesting":

(insn 18 14 19 2 (set (reg:SF 100)
        (reg:SF 91 [ <retval> ])) "t.c":4 -1
     (nil))
(insn 19 18 20 2 (set (reg:SF 101)
        (reg:SF 92 [ <retval>+4 ])) "t.c":4 -1
     (nil))
(insn 20 19 21 2 (set (mem/c:SF (plus:DI (reg/f:DI 82 virtual-stack-vars)
                (const_int -16 [0xfffffffffffffff0])) [0  S4 A32])
        (reg:SF 100)) "t.c":4 -1
     (nil))
(insn 21 20 22 2 (set (mem/c:SF (plus:DI (reg/f:DI 82 virtual-stack-vars)
                (const_int -12 [0xfffffffffffffff4])) [0  S4 A32])
        (reg:SF 101)) "t.c":4 -1
     (nil))
(insn 22 21 23 2 (set (reg:DI 21 xmm0)
        (mem/c:DI (plus:DI (reg/f:DI 82 virtual-stack-vars)
                (const_int -16 [0xfffffffffffffff0])) [0  S8 A32])) "t.c":4 -1
     (nil))
(insn 23 22 0 2 (use (reg:DI 21 xmm0)) "t.c":4 -1
     (nil))
Comment 3 Marc Glisse 2017-02-12 13:53:14 UTC
Note that we do not generate better code for
typedef float vec __attribute__((vector_size(8)));
vec g(vec x){return 2*x;}
(we don't consider larger vector modes when lowering/expanding vector operations, there is already a PR about that)
Comment 4 Andrew Pinski 2021-09-01 03:05:54 UTC
(In reply to Marc Glisse from comment #3)
> Note that we do not generate better code for
> typedef float vec __attribute__((vector_size(8)));
> vec g(vec x){return 2*x;}
> (we don't consider larger vector modes when lowering/expanding vector
> operations, there is already a PR about that)

The testcase in comment #3 was fixed for GCC 11+.
The original testcase still has issues has SLP does not happen but then we are to another problem.
See:
#define complex _Complex
typedef float vec __attribute__((vector_size(8)));

complex float f(complex float x) {
  vec t = *(vec*)&x;
  t= (2*t);
  return *(complex float*)&t;
}

Note I think there is a dup for the above issue too.