Bug 79726 - Missing optimisation: Type conversion not vectorised in simple additive reduction
Summary: Missing optimisation: Type conversion not vectorised in simple additive reduc...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 7.0.1
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2017-02-27 14:22 UTC by Raphael C
Modified: 2021-02-23 11:00 UTC (History)
0 users

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2017-02-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Raphael C 2017-02-27 14:22:50 UTC
Consider:

double f(double x[]) {
  float p = 1.0;
  for (int i = 0; i < 16; i++)
    p += x[i];
  return p;
}

gcc with -O3 -march=core-avx2 -ffast-math gives:

f:
        vmovsd  xmm0, QWORD PTR .LC0[rip]
        vaddsd  xmm0, xmm0, QWORD PTR [rdi]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+8]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+16]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+24]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+32]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+40]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+48]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+56]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+64]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+72]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+80]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+88]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+96]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+104]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+112]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        vaddsd  xmm0, xmm0, QWORD PTR [rdi+120]
        vcvtsd2ss       xmm0, xmm0, xmm0
        vcvtss2sd       xmm0, xmm0, xmm0
        ret
.LC0:
        .long   0
        .long   1072693248


However more efficient would be:

f:
        vcvtpd2ps xmm0, YMMWORD PTR [rdi]                       #4.5
        vcvtpd2ps xmm1, YMMWORD PTR [32+rdi]                    #4.5
        vcvtpd2ps xmm2, YMMWORD PTR [64+rdi]                    #4.5
        vcvtpd2ps xmm3, YMMWORD PTR [96+rdi]                    #4.5
        vaddps    xmm4, xmm0, xmm1                              #2.11
        vaddps    xmm5, xmm2, xmm3                              #2.11
        vaddps    xmm6, xmm4, xmm5                              #2.11
        vmovhlps  xmm7, xmm6, xmm6                              #2.11
        vaddps    xmm8, xmm6, xmm7                              #2.11
        vshufps   xmm9, xmm8, xmm8, 245                         #2.11
        vaddss    xmm10, xmm8, xmm9                             #2.11
        vaddss    xmm0, xmm10, DWORD PTR .L_2il0floatpacket.0[rip] #2.11
        vcvtss2sd xmm0, xmm0, xmm0                              #5.10
        vzeroupper                                              #5.10
        ret                                                     #5.10
.L_2il0floatpacket.0:
        .long   0x3f800000
Comment 1 Richard Biener 2017-02-27 15:08:59 UTC
Confirmed.  We do not handle widen-sum reduction for floating-point (or in general "complex" expressions).
Comment 2 Richard Biener 2021-02-23 11:00:48 UTC
It's also a general -ffast-math missed optimization to demote the double
add to a float one.  If you write

double f(double x[]) {
  float p = 1.0;
  for (int i = 0; i < 16; i++)
    p += (float)x[i];
  return p;
}

the loop is vectorized in a way you expect.

Note such demotion can result in +-Inf where it didn't appear before for
example when x[0] is less than float -Inf "+ 1." and thus (float)x[0] is
not representable but (float)(x[0] + 1.) is.

Still "correct" vectorization should also be possible but is not yet
implemented.