GCC Bugzilla – Full Text Bug Listing

- Home
- | New
- | Browse
- | Search
- | [?]
- | Reports
- | Help
- | New Account
- | Log In
- | Forgot Password

Summary: | suboptimal SLP for reduced case from namd_r | ||
---|---|---|---|

Product: | gcc | Reporter: | Kewen Lin <linkw> |

Component: | tree-optimization | Assignee: | Not yet assigned to anyone <unassigned> |

Status: | UNCONFIRMED --- | ||

Severity: | enhancement | CC: | bill.schmidt, crazylht, linkw, rguenth, rsandifo, segher |

Priority: | P3 | Keywords: | missed-optimization |

Version: | 12.0 | ||

Target Milestone: | --- | ||

Host: | Target: | ||

Build: | Known to work: | ||

Known to fail: | Last reconfirmed: | ||

Bug Depends on: | |||

Bug Blocks: | 53947 |

Description
Kewen Lin
2021-08-17 09:26:48 UTC
The original costing shows the vectorized version wins, by checking the costings, it missed to model the cost of lane extraction, the patch was posted in: https://gcc.gnu.org/pipermail/gcc-patches/2021-August/577422.html With the proposed adjustment above, the costings become to: reduc.c:24:34: note: Cost model analysis for part in loop 0: Vector cost: 17 Scalar cost: 17 Now we consider vectorization is still profitable when both cost are equal, so the SLP still performs. One thing can make it different is that: when we do costing, math optimization doesn't happen, there are no FMA-style operations, but finally some multiply and subtraction is optimized to FMS. If costing for scalar faces two multiply-and-sub (counted as 2) instead of two multiplies and subtractions (counted as 4), vs. vector costing 1 instead of 2. It ends up with scalar 15 vs. vector 16. But it seems not practical since we can't predict the later processing well, I tried to hack pass_optimize_widening_mul to run before slp, I saw it failed earlier. Back to the optimized IR, I thought the problem is that the vectorized version has longer critical path for the reduc_plus result (latency in total). For vectorized version, _51 = diffa_41(D) * 1.666666666666666574148081281236954964697360992431640625e-1; _59 = {_51, 2.5e-1}; vect__20.13_60 = vect_vdw_d_37.12_56 * _59; _61 = .REDUC_PLUS (vect__20.13_60); The critical path is: scalar mult -> vect CTOR -> vector mult -> reduc_plus While for the scalar version: _51 = diffa_41(D) * 1.666666666666666574148081281236954964697360992431640625e-1; _21 = vdw_c_38 * 2.5e-1; _22 = .FMA (vdw_d_37, _51, _21); Two scalar mult can run in parallel and it further ends up with one FMA. On Power9, we don't have one unique REDUC_PLUS insn for double, it takes three insns: vector shift + vector addition + vector extraction. I'm not sure if this is a problem on the platforms which support efficient REDUC_PLUS, but it seems a bad idea to SLP that case where the root is reduc op, its feeders are not isomorphic and whose types are V2* and can be math optimized. On x86 we even have Vector cost: 136 Scalar cost: 196 note that we seem to vectorize the reduction but that only happens with -ffast-math, not -O2 -ftree-slp-vectorize? One issue is the association of (diffa * vdw_d * (1 / 6.) + vdw_c * (1 / 4.)) * diffa + vdw_b * (1 / 2.)) * diffa + vdw_a which we fail to reduce as diffa*diffa*diffa*(1/6.)*vdw_d + diffa*diffa*(1/4.)*vdw_c + diffa*(1/2.)*vdw_b + 1.0*vdw_a note vectorizer costing does not look at dependencies at all, it just sums up individual instruction latencies (and assumes unlimited throughput as well). (In reply to Richard Biener from comment #3) > On x86 we even have > > Vector cost: 136 > Scalar cost: 196 > > note that we seem to vectorize the reduction but that only happens with > -ffast-math, not -O2 -ftree-slp-vectorize? > I don't quite follow this question, may misunderstand it. Yes, -ffast-math is required, now -O2 doesn't implicitly enable vectorization, it needs the explicit slp option. > One issue is the association of > > (diffa * vdw_d * (1 / 6.) + vdw_c * (1 / 4.)) * diffa + vdw_b * (1 / 2.)) * > diffa + vdw_a > > which we fail to reduce as > > diffa*diffa*diffa*(1/6.)*vdw_d + diffa*diffa*(1/4.)*vdw_c + > diffa*(1/2.)*vdw_b + 1.0*vdw_a Good point! |