Consider: float f(float x[]) { float p = 1.0; for (int i = 0; i < 202; i++) p += 1; return p; } and float f(float x[]) { float p = 1.0; for (int i = 0; i < 200; i++) p += 1; return p; } both compiled in gcc 7 (20170210 snapshot) with -Ofast . In the former case (the 202 case) you get: f: movss xmm0, DWORD PTR .LC0[rip] ret .LC0: .long 1128988672 In the latter case (the 200 case) you get: f: movaps xmm0, XMMWORD PTR .LC0[rip] xor eax, eax movaps xmm3, XMMWORD PTR .LC1[rip] movaps xmm2, XMMWORD PTR .LC2[rip] .L2: movaps xmm1, xmm0 add eax, 1 cmp eax, 50 addps xmm0, xmm3 addps xmm1, xmm2 jne .L2 shufps xmm1, xmm1, 255 movaps xmm0, xmm1 ret .LC0: .long 1065353216 .long 1073741824 .long 1077936128 .long 1082130432 .LC1: .long 1082130432 .long 1082130432 .long 1082130432 .long 1082130432 .LC2: .long 1065353216 .long 1065353216 .long 1065353216 .long 1065353216 There are a lot of other pairs of consecutive even numbered limits where one optimizes well and the other doesn't. For example 194 and 196.
After some experimentation (also carried out by Hagen von Eitzen), it seems that any limit of at least 72 which is also a multiple of 4 causes the same optimisation problem. That is the loop is *not* optimised out in these cases. Perhaps this is an example of one optimisation (SIMD vectorisation) conflicting with another?
(In reply to Raphael C from comment #1) > After some experimentation (also carried out by Hagen von Eitzen), it seems > that any limit of at least 72 which is also a multiple of 4 causes the same > optimisation problem. That is the loop is *not* optimised out in these > cases. > > Perhaps this is an example of one optimisation (SIMD vectorisation) > conflicting with another? IMHO, this is "scev propagation" on floating point values if ffast-math? is enabled. We shouldn't rely on vectorizer to propagate final value, it should be done somewhere before vectorizer. Thanks.
In this case it is complete unrolling that can estimate the non-vector code to constant fold but not the vectorized code. OTOH it's quite excessive work done by the unroller when doing this for large N... And yes, SCEV final value replacement doesn't know how to handle float reductions (we have a different PR for that). Let's track the unroll cost model issue here (I believe I have partial patches for that somewhere, so mine).
(In reply to Richard Biener from comment #3) > In this case it is complete unrolling that can estimate the non-vector code > to constant fold but not the vectorized code. OTOH it's quite excessive > work done by the unroller when doing this for large N... > > And yes, SCEV final value replacement doesn't know how to handle float > reductions > (we have a different PR for that). Doesn't handle float reductions nor vector (integer or vector) reductions. Even the vector ones would be useful, if e.g. to a vector every iteration adds a VECTOR_CST or similar, then it could be still nicely optimized. For the 202 case, it seems we are generating a scalar loop epilogue (not needed for 200) and somehow it seems something in the vector is actually able to figure out the floating point final value, because we get: # p_2 = PHI <2.01e+2(5), p_12(7)> # i_3 = PHI <200(5), i_13(7)> on the scalar loop epilogue. So if something in the vectorizer is able to figure it out, why can't it just use that even in the case where no epilogue loop is needed?
(In reply to Jakub Jelinek from comment #4) > (In reply to Richard Biener from comment #3) > > In this case it is complete unrolling that can estimate the non-vector code > > to constant fold but not the vectorized code. OTOH it's quite excessive > > work done by the unroller when doing this for large N... > > > > And yes, SCEV final value replacement doesn't know how to handle float > > reductions > > (we have a different PR for that). > > Doesn't handle float reductions nor vector (integer or vector) reductions. > Even the vector ones would be useful, if e.g. to a vector every iteration > adds a VECTOR_CST or similar, then it could be still nicely optimized. Integer version should have already been supported now. > > For the 202 case, it seems we are generating a scalar loop epilogue (not > needed for 200) and somehow it seems something in the vector is actually > able to figure out the floating point final value, because we get: > # p_2 = PHI <2.01e+2(5), p_12(7)> > # i_3 = PHI <200(5), i_13(7)> > on the scalar loop epilogue. So if something in the vectorizer is able to > figure it out, why can't it just use that even in the case where no epilogue > loop is needed? IIUC, scev-ccp should be made query based interface so that it can be called for each loop closed phi at different compilation stage. It also needs to be extended to cover basic floating point case like this. Effectively, it need to do the same transformation as vectorizer does now, but just thought it might be a better place to do that.
On Mon, 13 Feb 2017, amker at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460 > > --- Comment #5 from amker at gcc dot gnu.org --- > (In reply to Jakub Jelinek from comment #4) > > (In reply to Richard Biener from comment #3) > > > In this case it is complete unrolling that can estimate the non-vector code > > > to constant fold but not the vectorized code. OTOH it's quite excessive > > > work done by the unroller when doing this for large N... > > > > > > And yes, SCEV final value replacement doesn't know how to handle float > > > reductions > > > (we have a different PR for that). > > > > Doesn't handle float reductions nor vector (integer or vector) reductions. > > Even the vector ones would be useful, if e.g. to a vector every iteration > > adds a VECTOR_CST or similar, then it could be still nicely optimized. > Integer version should have already been supported now. > > > > > For the 202 case, it seems we are generating a scalar loop epilogue (not > > needed for 200) and somehow it seems something in the vector is actually > > able to figure out the floating point final value, because we get: > > # p_2 = PHI <2.01e+2(5), p_12(7)> > > # i_3 = PHI <200(5), i_13(7)> > > on the scalar loop epilogue. So if something in the vectorizer is able to > > figure it out, why can't it just use that even in the case where no epilogue > > loop is needed? > IIUC, scev-ccp should be made query based interface so that it can be called > for each loop closed phi at different compilation stage. It also needs to be > extended to cover basic floating point case like this. Effectively, it need to > do the same transformation as vectorizer does now, but just thought it might be > a better place to do that. Yeah, the vectorizer does this in vect_update_ivs_after_vectorizer by accident I think - it sees the float "IV" and replaces the prologue loop init by init + niter * step which is on the border of invalid (without -ffp-contract=on/fast). At least if the vectorizer can do this then final value replacement can do so as well with Index: gcc/tree-scalar-evolution.c =================================================================== --- gcc/tree-scalar-evolution.c (revision 245417) +++ gcc/tree-scalar-evolution.c (working copy) @@ -3718,13 +3718,6 @@ final_value_replacement_loop (struct loo continue; } - if (!POINTER_TYPE_P (TREE_TYPE (def)) - && !INTEGRAL_TYPE_P (TREE_TYPE (def))) - { - gsi_next (&psi); - continue; - } - bool folded_casts; def = analyze_scalar_evolution_in_loop (ex_loop, loop, def, &folded_casts); (rather than removing the condition replace it with a validity check - like FP contraction? etc...). But ideally SCEV itself would contain those (or compute exact results with rounding effects). Like maybe simply Index: gcc/tree-scalar-evolution.c =================================================================== --- gcc/tree-scalar-evolution.c (revision 245417) +++ gcc/tree-scalar-evolution.c (working copy) @@ -3718,8 +3718,10 @@ final_value_replacement_loop (struct loo continue; } - if (!POINTER_TYPE_P (TREE_TYPE (def)) - && !INTEGRAL_TYPE_P (TREE_TYPE (def))) + if (! (POINTER_TYPE_P (TREE_TYPE (def)) + || INTEGRAL_TYPE_P (TREE_TYPE (def)) + || (FLOAT_TYPE_P (TREE_TYPE (def)) + && flag_fp_contract_mode == FP_CONTRACT_FAST))) { gsi_next (&psi); continue; Richard.
Shouldn't it (both in the vectorizer and in scev) be dependent not just on flag_fp_contract_mode but also on some -ffast-math subflag? Doing several additions can e.g. raise different exceptions and have different roundings from doing it as just one multiply.
On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460 > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > Shouldn't it (both in the vectorizer and in scev) be dependent not just on > flag_fp_contract_mode but also on some -ffast-math subflag? Doing several > additions can e.g. raise different exceptions and have different roundings from > doing it as just one multiply. Maybe, but I don't see which (apart from flag_unsafe_math_optimizations of course but we'd want to get rid of that...). reassoc uses flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P only). The docs only mention FMA for fp-contract...
(In reply to rguenther@suse.de from comment #8) > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote: > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460 > > > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > > Shouldn't it (both in the vectorizer and in scev) be dependent not just on > > flag_fp_contract_mode but also on some -ffast-math subflag? Doing several > > additions can e.g. raise different exceptions and have different roundings from > > doing it as just one multiply. > > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations > of course but we'd want to get rid of that...). reassoc uses > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P > only). The docs only mention FMA for fp-contract... fp-contract is I believe only about whether you are allowed to transform x = a + b * c; into a FMA or not (or similar expressions), so I think isn't even related to this. Even if we want to get rid of flag_unsafe_math_optimizations eventually, it is better to use that if we don't know what exactly rather than emit wrong-code with -fno-fast-math, we'll have to eventually add some new flags or figure it out.
(In reply to Jakub Jelinek from comment #9) > (In reply to rguenther@suse.de from comment #8) > > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote: > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460 > > > > > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > > > Shouldn't it (both in the vectorizer and in scev) be dependent not just on > > > flag_fp_contract_mode but also on some -ffast-math subflag? Doing several > > > additions can e.g. raise different exceptions and have different roundings from > > > doing it as just one multiply. > > > > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations > > of course but we'd want to get rid of that...). reassoc uses > > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P > > only). The docs only mention FMA for fp-contract... > > fp-contract is I believe only about whether you are allowed to transform x = > a + b * c; into a FMA or not (or similar expressions), so I think isn't even > related to this. Even if we want to get rid of > flag_unsafe_math_optimizations eventually, it is better to use that if we > don't know what exactly rather than emit wrong-code with -fno-fast-math, > we'll have to eventually add some new flags or figure it out. I also think we need flag_unsafe_math_optimizations or similar on this, this is risky transformation and we may put iterative numerical algorithms in danger.
On Tue, 14 Feb 2017, amker at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460 > > --- Comment #10 from amker at gcc dot gnu.org --- > (In reply to Jakub Jelinek from comment #9) > > (In reply to rguenther@suse.de from comment #8) > > > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote: > > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460 > > > > > > > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- > > > > Shouldn't it (both in the vectorizer and in scev) be dependent not just on > > > > flag_fp_contract_mode but also on some -ffast-math subflag? Doing several > > > > additions can e.g. raise different exceptions and have different roundings from > > > > doing it as just one multiply. > > > > > > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations > > > of course but we'd want to get rid of that...). reassoc uses > > > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P > > > only). The docs only mention FMA for fp-contract... > > > > fp-contract is I believe only about whether you are allowed to transform x = > > a + b * c; into a FMA or not (or similar expressions), so I think isn't even > > related to this. Even if we want to get rid of > > flag_unsafe_math_optimizations eventually, it is better to use that if we > > don't know what exactly rather than emit wrong-code with -fno-fast-math, > > we'll have to eventually add some new flags or figure it out. > > I also think we need flag_unsafe_math_optimizations or similar on this, this is > risky transformation and we may put iterative numerical algorithms in danger. And here's that unroll heuristic patch (cleaned up a bit), it makes us consider all operations on constants (where SCEV can compute an evolution) as constant (and actually also folds them in the poor-mans cprop code). Only works up to 16 iterations of course (thus for the testcase for an upper bound of 16*4). Kind of unrelated to this PR of course. Index: gcc/tree-ssa-loop-ivcanon.c =================================================================== --- gcc/tree-ssa-loop-ivcanon.c (revision 245417) +++ gcc/tree-ssa-loop-ivcanon.c (working copy) @@ -157,8 +157,6 @@ struct loop_size static bool constant_after_peeling (tree op, gimple *stmt, struct loop *loop) { - affine_iv iv; - if (is_gimple_min_invariant (op)) return true; @@ -189,11 +187,9 @@ constant_after_peeling (tree op, gimple } /* Induction variables are constants. */ - if (!simple_iv (loop, loop_containing_stmt (stmt), op, &iv, false)) - return false; - if (!is_gimple_min_invariant (iv.base)) - return false; - if (!is_gimple_min_invariant (iv.step)) + tree ev = analyze_scalar_evolution (loop, op); + if (chrec_contains_undetermined (ev) + || chrec_contains_symbols (ev)) return false; return true; } @@ -1259,7 +1255,7 @@ propagate_constants_for_unrolling (basic if (! SSA_NAME_OCCURS_IN_ABNORMAL_PHI (result) && gimple_phi_num_args (phi) == 1 - && TREE_CODE (arg) == INTEGER_CST) + && CONSTANT_CLASS_P (arg)) { replace_uses_by (result, arg); gsi_remove (&gsi, true); @@ -1276,7 +1272,7 @@ propagate_constants_for_unrolling (basic tree lhs; if (is_gimple_assign (stmt) - && gimple_assign_rhs_code (stmt) == INTEGER_CST + && TREE_CODE_CLASS (gimple_assign_rhs_code (stmt)) == tcc_constant && (lhs = gimple_assign_lhs (stmt), TREE_CODE (lhs) == SSA_NAME) && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (lhs)) {
(In reply to Jakub Jelinek from comment #7) > Shouldn't it (both in the vectorizer and in scev) be dependent not just on > flag_fp_contract_mode but also on some -ffast-math subflag? Doing several > additions can e.g. raise different exceptions and have different roundings > from doing it as just one multiply. This needs -fassociative-math I think?