This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/78164] New: SLP vectorizer: prologue cost biased by redundancies
- From: "glisse at gcc dot gnu.org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Sun, 30 Oct 2016 17:17:21 +0000
- Subject: [Bug tree-optimization/78164] New: SLP vectorizer: prologue cost biased by redundancies
- Auto-submitted: auto-generated
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78164
Bug ID: 78164
Summary: SLP vectorizer: prologue cost biased by redundancies
Product: gcc
Version: 7.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: glisse at gcc dot gnu.org
Target Milestone: ---
From http://stackoverflow.com/q/39947582/1918193
void testfunc_flat(double a, double b, double* dst)
{
dst[0] = 0.1 + ( a)*(1.0 + 0.5*( a));
dst[1] = 0.1 + ( b)*(1.0 + 0.5*( b));
dst[2] = 0.1 + (-a)*(1.0 + 0.5*(-a));
dst[3] = 0.1 + (-b)*(1.0 + 0.5*(-b));
}
We fail to vectorize with AVX, that's understandable because the operations are
different. More surprising is that we reject SSE vectorization
Vector inside of basic block cost: 14
Vector prologue cost: 10
Vector epilogue cost: 0
Scalar cost of basic block: 22
However, if I disable the cost model, I can see this prologue that is supposed
to have cost 10:
vect_cst__47 = { 1.000000000000000055511151231257827021181583404541015625e-1,
1.000000000000000055511151231257827021181583404541015625e-1 };
vect_cst__44 = { 1.0e+0, 1.0e+0 };
vect_cst__42 = { 5.0e-1, 5.0e-1 };
vect_cst__40 = {a_19(D), b_23(D)};
vect_cst__38 = {a_19(D), b_23(D)};
vect_cst__34 = { 1.000000000000000055511151231257827021181583404541015625e-1,
1.000000000000000055511151231257827021181583404541015625e-1 };
vect_cst__32 = {a_19(D), b_23(D)};
vect_cst__30 = { 1.0e+0, 1.0e+0 };
vect_cst__28 = { 5.0e-1, 5.0e-1 };
vect_cst__27 = {a_19(D), b_23(D)};
Some very basic CSE would bring it down to a cost of 4 and allow vectorizing
like llvm.