79460 – gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

Bug 79460 - gcc fails to optimise out a trivial additive loop for seemingly arbitrary numbers of iterations

Summary: gcc fails to optimise out a trivial additive loop for seemingly arbitrary num...

Status:	ASSIGNED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	7.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Richard Biener

URL:
Keywords:	missed-optimization

Depends on:
Blocks:	loop-final-value, sccp
	Show dependency tree / graph

Reported:	2017-02-10 22:22 UTC by Raphael C
Modified:	2024-06-06 00:54 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed:	2017-02-11 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Raphael C 2017-02-10 22:22:43 UTC

Consider:

float f(float x[]) {
  float p = 1.0;
  for (int i = 0; i < 202; i++)
    p += 1;
  return p;
}

and

float f(float x[]) {
  float p = 1.0;
  for (int i = 0; i < 200; i++)
    p += 1;
  return p;
}

both compiled in gcc 7 (20170210 snapshot) with  -Ofast .

In the former case (the 202 case) you get:

f:
        movss   xmm0, DWORD PTR .LC0[rip]
        ret
.LC0:
        .long   1128988672



In the latter case (the 200 case) you get:


f:
        movaps  xmm0, XMMWORD PTR .LC0[rip]
        xor     eax, eax
        movaps  xmm3, XMMWORD PTR .LC1[rip]
        movaps  xmm2, XMMWORD PTR .LC2[rip]
.L2:
        movaps  xmm1, xmm0
        add     eax, 1
        cmp     eax, 50
        addps   xmm0, xmm3
        addps   xmm1, xmm2
        jne     .L2
        shufps  xmm1, xmm1, 255
        movaps  xmm0, xmm1
        ret
.LC0:
        .long   1065353216
        .long   1073741824
        .long   1077936128
        .long   1082130432
.LC1:
        .long   1082130432
        .long   1082130432
        .long   1082130432
        .long   1082130432
.LC2:
        .long   1065353216
        .long   1065353216
        .long   1065353216
        .long   1065353216

There are a lot of other pairs of consecutive even numbered limits where one optimizes well and the other doesn't. For example 194 and 196.

Comment 1 Raphael C 2017-02-12 09:22:44 UTC

After some experimentation (also carried out by Hagen von Eitzen), it seems that any limit of at least 72 which is also a multiple of 4 causes the same optimisation problem. That is the loop is *not* optimised out in these cases.  

Perhaps this is an example of one optimisation (SIMD vectorisation) conflicting with another?

Comment 2 bin cheng 2017-02-12 14:07:04 UTC

(In reply to Raphael C from comment #1)
> After some experimentation (also carried out by Hagen von Eitzen), it seems
> that any limit of at least 72 which is also a multiple of 4 causes the same
> optimisation problem. That is the loop is *not* optimised out in these
> cases.  
> 
> Perhaps this is an example of one optimisation (SIMD vectorisation)
> conflicting with another?

IMHO, this is "scev propagation" on floating point values if ffast-math? is enabled.  We shouldn't rely on vectorizer to propagate final value, it should be done somewhere before vectorizer.  Thanks.

Comment 3 Richard Biener 2017-02-13 09:23:18 UTC

In this case it is complete unrolling that can estimate the non-vector code to constant fold but not the vectorized code.  OTOH it's quite excessive work done by the unroller when doing this for large N...

And yes, SCEV final value replacement doesn't know how to handle float reductions
(we have a different PR for that).

Let's track the unroll cost model issue here (I believe I have partial patches for that somewhere, so mine).

Comment 4 Jakub Jelinek 2017-02-13 11:15:46 UTC

(In reply to Richard Biener from comment #3)
> In this case it is complete unrolling that can estimate the non-vector code
> to constant fold but not the vectorized code.  OTOH it's quite excessive
> work done by the unroller when doing this for large N...
> 
> And yes, SCEV final value replacement doesn't know how to handle float
> reductions
> (we have a different PR for that).

Doesn't handle float reductions nor vector (integer or vector) reductions.
Even the vector ones would be useful, if e.g. to a vector every iteration adds a VECTOR_CST or similar, then it could be still nicely optimized.

For the 202 case, it seems we are generating a scalar loop epilogue (not needed for 200) and somehow it seems something in the vector is actually able to figure out the floating point final value, because we get:
  # p_2 = PHI <2.01e+2(5), p_12(7)>
  # i_3 = PHI <200(5), i_13(7)>
on the scalar loop epilogue.  So if something in the vectorizer is able to figure it out, why can't it just use that even in the case where no epilogue loop is needed?

Comment 5 bin cheng 2017-02-13 11:31:59 UTC

(In reply to Jakub Jelinek from comment #4)
> (In reply to Richard Biener from comment #3)
> > In this case it is complete unrolling that can estimate the non-vector code
> > to constant fold but not the vectorized code.  OTOH it's quite excessive
> > work done by the unroller when doing this for large N...
> > 
> > And yes, SCEV final value replacement doesn't know how to handle float
> > reductions
> > (we have a different PR for that).
> 
> Doesn't handle float reductions nor vector (integer or vector) reductions.
> Even the vector ones would be useful, if e.g. to a vector every iteration
> adds a VECTOR_CST or similar, then it could be still nicely optimized.
Integer version should have already been supported now.

> 
> For the 202 case, it seems we are generating a scalar loop epilogue (not
> needed for 200) and somehow it seems something in the vector is actually
> able to figure out the floating point final value, because we get:
>   # p_2 = PHI <2.01e+2(5), p_12(7)>
>   # i_3 = PHI <200(5), i_13(7)>
> on the scalar loop epilogue.  So if something in the vectorizer is able to
> figure it out, why can't it just use that even in the case where no epilogue
> loop is needed?
IIUC, scev-ccp should be made query based interface so that it can be called for each loop closed phi at different compilation stage.  It also needs to be extended to cover basic floating point case like this.  Effectively, it need to do the same transformation as vectorizer does now, but just thought it might be a better place to do that.

Comment 6 rguenther@suse.de 2017-02-14 09:30:19 UTC

On Mon, 13 Feb 2017, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> 
> --- Comment #5 from amker at gcc dot gnu.org ---
> (In reply to Jakub Jelinek from comment #4)
> > (In reply to Richard Biener from comment #3)
> > > In this case it is complete unrolling that can estimate the non-vector code
> > > to constant fold but not the vectorized code.  OTOH it's quite excessive
> > > work done by the unroller when doing this for large N...
> > > 
> > > And yes, SCEV final value replacement doesn't know how to handle float
> > > reductions
> > > (we have a different PR for that).
> > 
> > Doesn't handle float reductions nor vector (integer or vector) reductions.
> > Even the vector ones would be useful, if e.g. to a vector every iteration
> > adds a VECTOR_CST or similar, then it could be still nicely optimized.
> Integer version should have already been supported now.
> 
> > 
> > For the 202 case, it seems we are generating a scalar loop epilogue (not
> > needed for 200) and somehow it seems something in the vector is actually
> > able to figure out the floating point final value, because we get:
> >   # p_2 = PHI <2.01e+2(5), p_12(7)>
> >   # i_3 = PHI <200(5), i_13(7)>
> > on the scalar loop epilogue.  So if something in the vectorizer is able to
> > figure it out, why can't it just use that even in the case where no epilogue
> > loop is needed?
> IIUC, scev-ccp should be made query based interface so that it can be called
> for each loop closed phi at different compilation stage.  It also needs to be
> extended to cover basic floating point case like this.  Effectively, it need to
> do the same transformation as vectorizer does now, but just thought it might be
> a better place to do that.

Yeah, the vectorizer does this in vect_update_ivs_after_vectorizer
by accident I think - it sees the float "IV" and replaces the prologue
loop init by init + niter * step which is on the border of invalid
(without -ffp-contract=on/fast).  At least if the vectorizer can do this
then final value replacement can do so as well with

Index: gcc/tree-scalar-evolution.c
===================================================================
--- gcc/tree-scalar-evolution.c (revision 245417)
+++ gcc/tree-scalar-evolution.c (working copy)
@@ -3718,13 +3718,6 @@ final_value_replacement_loop (struct loo
          continue;
        }
 
-      if (!POINTER_TYPE_P (TREE_TYPE (def))
-         && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
-       {
-         gsi_next (&psi);
-         continue;
-       }
-
       bool folded_casts;
       def = analyze_scalar_evolution_in_loop (ex_loop, loop, def,
                                              &folded_casts);

(rather than removing the condition replace it with a validity check -
like FP contraction?  etc...).
But ideally SCEV itself would contain those (or compute exact results
with rounding effects).

Like maybe simply

Index: gcc/tree-scalar-evolution.c
===================================================================
--- gcc/tree-scalar-evolution.c (revision 245417)
+++ gcc/tree-scalar-evolution.c (working copy)
@@ -3718,8 +3718,10 @@ final_value_replacement_loop (struct loo
          continue;
        }
 
-      if (!POINTER_TYPE_P (TREE_TYPE (def))
-         && !INTEGRAL_TYPE_P (TREE_TYPE (def)))
+      if (! (POINTER_TYPE_P (TREE_TYPE (def))
+            || INTEGRAL_TYPE_P (TREE_TYPE (def))
+            || (FLOAT_TYPE_P (TREE_TYPE (def))
+                && flag_fp_contract_mode == FP_CONTRACT_FAST)))
        {
          gsi_next (&psi);
          continue;

Richard.

Comment 7 Jakub Jelinek 2017-02-14 09:33:49 UTC

Shouldn't it (both in the vectorizer and in scev) be dependent not just on flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several additions can e.g. raise different exceptions and have different roundings from doing it as just one multiply.

Comment 8 rguenther@suse.de 2017-02-14 09:44:52 UTC

On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> 
> --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> additions can e.g. raise different exceptions and have different roundings from
> doing it as just one multiply.

Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
of course but we'd want to get rid of that...).  reassoc uses
flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
only).  The docs only mention FMA for fp-contract...

Comment 9 Jakub Jelinek 2017-02-14 09:52:42 UTC

(In reply to rguenther@suse.de from comment #8)
> On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> > 
> > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> > Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> > flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> > additions can e.g. raise different exceptions and have different roundings from
> > doing it as just one multiply.
> 
> Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
> of course but we'd want to get rid of that...).  reassoc uses
> flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
> only).  The docs only mention FMA for fp-contract...

fp-contract is I believe only about whether you are allowed to transform x = a + b * c; into a FMA or not (or similar expressions), so I think isn't even related to this.  Even if we want to get rid of flag_unsafe_math_optimizations eventually, it is better to use that if we don't know what exactly rather than emit wrong-code with -fno-fast-math, we'll have to eventually add some new flags or figure it out.

Comment 10 bin cheng 2017-02-14 10:00:27 UTC

(In reply to Jakub Jelinek from comment #9)
> (In reply to rguenther@suse.de from comment #8)
> > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> > > 
> > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> > > Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> > > flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> > > additions can e.g. raise different exceptions and have different roundings from
> > > doing it as just one multiply.
> > 
> > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
> > of course but we'd want to get rid of that...).  reassoc uses
> > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
> > only).  The docs only mention FMA for fp-contract...
> 
> fp-contract is I believe only about whether you are allowed to transform x =
> a + b * c; into a FMA or not (or similar expressions), so I think isn't even
> related to this.  Even if we want to get rid of
> flag_unsafe_math_optimizations eventually, it is better to use that if we
> don't know what exactly rather than emit wrong-code with -fno-fast-math,
> we'll have to eventually add some new flags or figure it out.

I also think we need flag_unsafe_math_optimizations or similar on this, this is risky transformation and we may put iterative numerical algorithms in danger.

Comment 11 rguenther@suse.de 2017-02-14 10:05:22 UTC

On Tue, 14 Feb 2017, amker at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> 
> --- Comment #10 from amker at gcc dot gnu.org ---
> (In reply to Jakub Jelinek from comment #9)
> > (In reply to rguenther@suse.de from comment #8)
> > > On Tue, 14 Feb 2017, jakub at gcc dot gnu.org wrote:
> > > 
> > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79460
> > > > 
> > > > --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> > > > Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> > > > flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> > > > additions can e.g. raise different exceptions and have different roundings from
> > > > doing it as just one multiply.
> > > 
> > > Maybe, but I don't see which (apart from flag_unsafe_math_optimizations
> > > of course but we'd want to get rid of that...).  reassoc uses
> > > flag_unsafe_math_optimizations (and does this for SCALAR_FLOAT_TYPE_P
> > > only).  The docs only mention FMA for fp-contract...
> > 
> > fp-contract is I believe only about whether you are allowed to transform x =
> > a + b * c; into a FMA or not (or similar expressions), so I think isn't even
> > related to this.  Even if we want to get rid of
> > flag_unsafe_math_optimizations eventually, it is better to use that if we
> > don't know what exactly rather than emit wrong-code with -fno-fast-math,
> > we'll have to eventually add some new flags or figure it out.
> 
> I also think we need flag_unsafe_math_optimizations or similar on this, this is
> risky transformation and we may put iterative numerical algorithms in danger.

And here's that unroll heuristic patch (cleaned up a bit), it makes us
consider all operations on constants (where SCEV can compute an evolution)
as constant (and actually also folds them in the poor-mans cprop code).
Only works up to 16 iterations of course (thus for the testcase for
an upper bound of 16*4).  Kind of unrelated to this PR of course.

Index: gcc/tree-ssa-loop-ivcanon.c
===================================================================
--- gcc/tree-ssa-loop-ivcanon.c (revision 245417)
+++ gcc/tree-ssa-loop-ivcanon.c (working copy)
@@ -157,8 +157,6 @@ struct loop_size
 static bool
 constant_after_peeling (tree op, gimple *stmt, struct loop *loop)
 {
-  affine_iv iv;
-
   if (is_gimple_min_invariant (op))
     return true;
 
@@ -189,11 +187,9 @@ constant_after_peeling (tree op, gimple
     }
 
   /* Induction variables are constants.  */
-  if (!simple_iv (loop, loop_containing_stmt (stmt), op, &iv, false))
-    return false;
-  if (!is_gimple_min_invariant (iv.base))
-    return false;
-  if (!is_gimple_min_invariant (iv.step))
+  tree ev = analyze_scalar_evolution (loop, op);
+  if (chrec_contains_undetermined (ev)
+      || chrec_contains_symbols (ev))
     return false;
   return true;
 }
@@ -1259,7 +1255,7 @@ propagate_constants_for_unrolling (basic
 
       if (! SSA_NAME_OCCURS_IN_ABNORMAL_PHI (result)
          && gimple_phi_num_args (phi) == 1
-         && TREE_CODE (arg) == INTEGER_CST)
+         && CONSTANT_CLASS_P (arg))
        {
          replace_uses_by (result, arg);
          gsi_remove (&gsi, true);
@@ -1276,7 +1272,7 @@ propagate_constants_for_unrolling (basic
       tree lhs;
 
       if (is_gimple_assign (stmt)
-         && gimple_assign_rhs_code (stmt) == INTEGER_CST
+         && TREE_CODE_CLASS (gimple_assign_rhs_code (stmt)) == 
tcc_constant
          && (lhs = gimple_assign_lhs (stmt), TREE_CODE (lhs) == SSA_NAME)
          && !SSA_NAME_OCCURS_IN_ABNORMAL_PHI (lhs))
        {

Comment 12 Segher Boessenkool 2017-02-14 14:12:27 UTC

(In reply to Jakub Jelinek from comment #7)
> Shouldn't it (both in the vectorizer and in scev) be dependent not just on
> flag_fp_contract_mode but also on some -ffast-math subflag?  Doing several
> additions can e.g. raise different exceptions and have different roundings
> from doing it as just one multiply.

This needs -fassociative-math I think?