This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug tree-optimization/79336] Poor vectorisation of additive reduction of complex array, final SLP reduction step inefficient


https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79336

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |missed-optimization
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-02-02
          Component|c                           |tree-optimization
             Blocks|                            |53947
            Summary|Poor vectorisation of       |Poor vectorisation of
                   |additive reduction of       |additive reduction of
                   |complex array               |complex array, final SLP
                   |                            |reduction step inefficient
     Ever confirmed|0                           |1

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Confirmed.  The reduction loop itself is fine, it is the final reduction step
involving the SLP reduction result (we reduce two scalars) that is handled
less than optimally:

  <bb 3> [96.97%]:
  # i_16 = PHI <i_11(4), 0(2)>
  # p$real_13 = PHI <_17(4), 1.0e+0(2)>
  # p$imag_14 = PHI <_18(4), 0.0(2)>
  # ivtmp_34 = PHI <ivtmp_33(4), 32(2)>
  _1 = (long unsigned int) i_16;
  _2 = _1 * 8;
  _3 = x_9(D) + _2;
  _7 = REALPART_EXPR <*_3>;
  _12 = IMAGPART_EXPR <*_3>;
  _17 = _7 + p$real_13;
  _18 = _12 + p$imag_14;
  i_11 = i_16 + 1;
  ivtmp_33 = ivtmp_34 - 1;
  if (ivtmp_33 != 0)
    goto <bb 4>; [96.88%]
  else
    goto <bb 5>; [3.12%]

  <bb 4> [93.94%]:
  goto <bb 3>; [100.00%]

  <bb 5> [3.03%]:
  # _36 = PHI <_17(3)>
  # _35 = PHI <_18(3)>
  p_10 = COMPLEX_EXPR <_36, _35>;

here we simply try to first produce _36 and _35 from the vectorized reduction
result and then build the complex function result:

  <bb 5> [3.03%]:
  # _36 = PHI <_17(3)>
  # _35 = PHI <_18(3)>
  # vect__17.8_22 = PHI <vect__17.8_24(3)>
  stmp__17.9_21 = BIT_FIELD_REF <vect__17.8_22, 32, 0>;
  stmp__17.9_20 = BIT_FIELD_REF <vect__17.8_22, 32, 32>;
  stmp__17.9_19 = BIT_FIELD_REF <vect__17.8_22, 32, 64>;
  stmp__17.9_15 = BIT_FIELD_REF <vect__17.8_22, 32, 96>;
  stmp__17.9_6 = BIT_FIELD_REF <vect__17.8_22, 32, 128>;
  stmp__17.9_5 = BIT_FIELD_REF <vect__17.8_22, 32, 160>;
  stmp__17.9_4 = BIT_FIELD_REF <vect__17.8_22, 32, 192>;
  stmp__17.9_29 = BIT_FIELD_REF <vect__17.8_22, 32, 224>;
  stmp__17.9_28 = stmp__17.9_21 + stmp__17.9_19;
  stmp__17.9_27 = stmp__17.9_20 + stmp__17.9_15;
  stmp__17.9_26 = stmp__17.9_28 + stmp__17.9_6;
  stmp__17.9_37 = stmp__17.9_27 + stmp__17.9_5;
  stmp__17.9_38 = stmp__17.9_26 + stmp__17.9_4;
  stmp__17.9_39 = stmp__17.9_37 + stmp__17.9_29;
  p_10 = COMPLEX_EXPR <stmp__17.9_38, stmp__17.9_39>;
  return p_10;

this doesn't take advantage from the fact that we can do this kind
final SLP reduction more efficiently (didn't try to decipher exactly
what ICC does here).  It may require ABI details or knowing that
we can type-pun a vector to a complex...  (but only for complex float,
for complex double the ABI doesn't work out this way!)


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]