55266 – vector expansion: 24 movs for 4 adds

Bug 55266 - vector expansion: 24 movs for 4 adds

Summary: vector expansion: 24 movs for 4 adds

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	4.8.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:

Depends on:	52436 65832
Blocks:	vectorizer genvector
	Show dependency tree / graph

Reported:	2012-11-10 15:09 UTC by Marc Glisse
Modified:	2023-07-21 12:12 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64-linux-gnu
Build:
Known to work:
Known to fail:
Last reconfirmed:	2012-12-09 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Marc Glisse 2012-11-10 15:09:43 UTC

I already mentioned this example, but I don't think it is in any PR:

typedef double vec __attribute__((vector_size(4*sizeof(double))));
void f(vec*x){
  *x+=*x+*x;
}

compiled with -S -O3 -msse4, produces 4 add insns (normal), and 36 mov insns, which is a bit much... For comparison, this should be equivalent to the following code, which generates only 6 mov insn:

typedef double vec __attribute__((vector_size(2*sizeof(double))));
void f(vec*x){
  x[0]+=x[0]+x[0];
  x[1]+=x[1]+x[1];
}

One minor enhancement would be to have fold_ternary handle BIT_FIELD_REF of CONSTRUCTOR of vectors (I think it is already tracked elsewhere, though I couldn't find it).

But the main issue is with copying these fake vectors. Their fake "registers" are in memory, and copying between those (4 times 2 movs going through rax in DImode, I assume it is faster than going through xmm registers?) isn't optimized away. In this example, the content of *x is first copied to a fake register. Then V2DF parts are extracted, added, and put in memory. That fake register is now copied to a new fake register. V2DF are taken from it, added to the V2DF that were still there, and stored to memory. And that is finally copied to the memory location x.

I don't know how that should be improved. Maybe the vector lowering pass should go even further, turn the first program into the second one, and not leave any extra long vectors for the back-end to handle? It doesn't seem easy to optimize in the back-end, too late. Or maybe something can be done at expansion time?

Comment 1 Marc Glisse 2012-11-13 10:23:03 UTC

The first copy is PR 52436.

The second copy has a patch posted here:
http://gcc.gnu.org/ml/gcc-patches/2012-11/msg00900.html

The last copy would require turning:

  gimple_assign <constructor, _5, {_15, _18}, NULL, NULL>
  gimple_assign <ssa_name, *x_2(D), _5, NULL, NULL>

into:

  gimple_assign <ssa_name, *x_2(D), _15, NULL, NULL>
  gimple_assign <ssa_name, MEM[(vec *)x_2(D) + 16B], _18, NULL, NULL>

(not sure if endianness matters here)

which could maybe more easily be done by splitting the memory write (when the vector type is not supported) into a suitable number of bit_field_ref extractions and memory writes and relying on forwprop4 to simplify the bit_field_refs of the constructor.

Comment 2 Marc Glisse 2012-11-28 10:11:31 UTC

Author: glisse
Date: Wed Nov 28 10:11:27 2012
New Revision: 193884

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=193884
Log:
2012-11-28  Marc Glisse  <marc.glisse@inria.fr>

	PR middle-end/55266
	* fold-const.c (fold_ternary_loc) [BIT_FIELD_REF]: Handle
	CONSTRUCTOR with vector elements.
	* tree-ssa-propagate.c (valid_gimple_rhs_p): Handle CONSTRUCTOR
	and BIT_FIELD_REF.


Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/fold-const.c
    trunk/gcc/tree-ssa-propagate.c

Comment 3 Andrew Pinski 2012-12-09 02:07:58 UTC

The other issue is there is no DCE that happens after forwprop4.

Comment 4 vincenzo Innocente 2013-03-03 11:58:24 UTC

I see still problems when calling inline functions.
It seems that the code to satisfy the "calling ABI" is generated anyhow.

take the example below and compare the code generated for "dotd1" wrt "dotd2"
dotd2 has a "storm" of move before the reduction

c++ -std=c++11 -Ofast -march=corei7 -S conversions.cc -fabi-version=0 
the avx version is better but for dotd4 (actually dotd1 is lelf see like)

typedef float __attribute__( ( vector_size( 16 ) ) ) float32x4_t;
typedef double  __attribute__( ( vector_size( 32 ) ) ) float64x4_t;


inline 
float64x4_t convert(float32x4_t f) {
  return float64x4_t{f[0],f[1],f[2],f[3]};
}

float dotf(float32x4_t x, float32x4_t y) {
  float ret=0;
  for (int i=0;i!=4;++i) ret+=x[i]*y[i];
  return ret;
}

inline
double dotd(float64x4_t x, float64x4_t y) {
  double ret=0;
  for (int i=0;i!=4;++i) ret+=x[i]*y[i];
  return ret;
}



float dotd1(float32x4_t x, float32x4_t y) {
  float64x4_t dx,dy;
  for (int i=0;i!=4;++i) {
    dx[i]=x[i]; dy[i]=y[i];
  }
  double ret=0;
  for (int i=0;i!=4;++i) ret+=dx[i]*dy[i];
  return ret;
}

float dotd2(float32x4_t x, float32x4_t y) {
  float64x4_t dx=convert(x);
  float64x4_t dy=convert(y);
  return dotd(dx,dy);
}


float dotd3(float32x4_t x, float32x4_t y) {
  float64x4_t dx{x[0],x[1],x[2],x[3]};
  float64x4_t dy{y[0],y[1],y[2],y[3]};
  double ret=0;
  for (int i=0;i!=4;++i) ret+=dx[i]*dy[i];
  return ret;
}

float dotd4(float32x4_t x, float32x4_t y) {
  float64x4_t dx,dy;
  for (int i=0;i!=4;++i) {
    dx[i]=x[i]; dy[i]=y[i];
  }
  return dotd(dx,dy);
}

Comment 5 Richard Biener 2016-06-29 13:30:21 UTC

The latter is because of 'convert' leaving us with

  _1 = BIT_FIELD_REF <x_32(D), 32, 0>;
  _2 = (double) _1;
  _3 = BIT_FIELD_REF <x_32(D), 32, 32>;
  _4 = (double) _3;
  _5 = BIT_FIELD_REF <x_32(D), 32, 64>;
  _6 = (double) _5;
  _7 = BIT_FIELD_REF <x_32(D), 32, 96>;
  _8 = (double) _7;
  _9 = {_2, _4, _6, _8};

rather than

  vect__1.83_46 = x;
  vect__2.84_47 = [vec_unpack_lo_expr] vect__1.83_46;
  vect__2.84_48 = [vec_unpack_hi_expr] vect__1.83_46;
  MEM[(vector(4) double *)&dx] = vect__2.84_47;
  MEM[(vector(4) double *)&dx + 16B] = vect__2.84_48;

(which is in itself not optimal because not being in SSA form).

This means generic vector support lacks widening/shortening and thus you
have to jump through hoops with things like 'convert'.  And SLP vectorization
doesn't "vectorize" with vector CONSTRUCTORs as root (a possible enhancement I think).

For the original testcase it's a duplicate of PR65832 as we get

  <bb 2>:
  _1 = *x_5(D);
  _7 = BIT_FIELD_REF <_1, 128, 0>;
  _9 = _7 + _7;
  _10 = BIT_FIELD_REF <_1, 128, 128>;
  _12 = _10 + _10;
  _14 = _7 + _9;
  _16 = _10 + _12;
  _3 = {_14, _16};
  *x_5(D) = _3;

w/o fixing PR65832 this can be improved by "combining" the loads with the extracts and the CONSTRUCTOR with the store.

I have done sth similar for COMPLEX_EXPR in tree-ssa-forwprop.c ... (not
that I am very proud of that - heh).

Comment 6 Richard Biener 2023-07-21 12:12:11 UTC

The original issue is fixed.

f:
.LFB0:
        .cfi_startproc
        movapd  (%rdi), %xmm2
        movapd  16(%rdi), %xmm1
        movapd  %xmm2, %xmm0
        addpd   %xmm2, %xmm0
        addpd   %xmm2, %xmm0
        movaps  %xmm0, (%rdi)
        movapd  %xmm1, %xmm0
        addpd   %xmm1, %xmm0
        addpd   %xmm1, %xmm0
        movaps  %xmm0, 16(%rdi)
        ret

the issue in comment#4 as well I think:

_Z5dotd1Dv4_fS_:
.LFB3:
        .cfi_startproc
        movaps  %xmm1, %xmm3
        pxor    %xmm2, %xmm2
        movhlps %xmm0, %xmm2
        cvtps2pd        %xmm0, %xmm0
        cvtps2pd        %xmm2, %xmm1
        pxor    %xmm2, %xmm2
        movhlps %xmm3, %xmm2
        cvtps2pd        %xmm3, %xmm3
        cvtps2pd        %xmm2, %xmm2
        mulpd   %xmm3, %xmm0
        mulpd   %xmm2, %xmm1
        addpd   %xmm0, %xmm1
        movapd  %xmm1, %xmm0
        unpckhpd        %xmm1, %xmm0
        addpd   %xmm1, %xmm0
        cvtsd2ss        %xmm0, %xmm0
        ret
        .cfi_endproc
.LFE3:
        .size   _Z5dotd1Dv4_fS_, .-_Z5dotd1Dv4_fS_
        .p2align 4
        .globl  _Z5dotd2Dv4_fS_
        .type   _Z5dotd2Dv4_fS_, @function
_Z5dotd2Dv4_fS_:
.LFB4:
        .cfi_startproc
        movaps  %xmm1, %xmm3
        cvtps2pd        %xmm0, %xmm4
        pxor    %xmm2, %xmm2
        movhlps %xmm0, %xmm2
        pxor    %xmm0, %xmm0
        movhlps %xmm3, %xmm0
        cvtps2pd        %xmm2, %xmm2
        cvtps2pd        %xmm1, %xmm1
        cvtps2pd        %xmm0, %xmm0
        mulpd   %xmm4, %xmm1
        mulpd   %xmm0, %xmm2
        addpd   %xmm2, %xmm1
        movapd  %xmm1, %xmm0
        unpckhpd        %xmm1, %xmm0
        addpd   %xmm1, %xmm0
        cvtsd2ss        %xmm0, %xmm0
        ret