[PATCH] Fix PR66313

Tue Jun 13 12:03:00 GMT 2017

On Tue, 13 Jun 2017, Bin.Cheng wrote:

> On Tue, Jun 13, 2017 at 12:48 PM, Richard Biener <rguenther@suse.de> wrote:
> > On Tue, 13 Jun 2017, Richard Sandiford wrote:
> >
> >> Richard Biener <rguenther@suse.de> writes:
> >> > On Tue, 13 Jun 2017, Richard Sandiford wrote:
> >> >> Richard Biener <rguenther@suse.de> writes:
> >> >> > So I've come back to PR66313 and found a solution to the tailrecursion
> >> >> > missed optimization when fixing the factoring folding to use an unsigned
> >> >> > type when we're not sure of overflow.
> >> >> >
> >> >> > The folding part is identical to my last try from 2015, the tailrecursion
> >> >> > part makes us handle intermittent stmts that were introduced by foldings
> >> >> > that "clobber" our quest walking the single-use chain of stmts between
> >> >> > the call and the return (and failing at all stmts that are not part
> >> >> > of said chain).  A simple solution is to move the stmts that are not
> >> >> > part of the chain and that we can move before the call.  That handles
> >> >> > the leaf conversions that now appear for tree-ssa/tailrecursion-6.c
> >> >> >
> >> >> > Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.
> >> >> >
> >> >> > Richard.
> >> >> >
> >> >> > 2017-05-31  Richard Biener  <rguenther@suse.de>
> >> >> >
> >> >> >  PR middle-end/66313
> >> >> >  * fold-const.c (fold_plusminus_mult_expr): If the factored
> >> >> >  factor may be zero use a wrapping type for the inner operation.
> >> >> >  * tree-tailcall.c (independent_of_stmt_p): Pass in to_move bitmap
> >> >> >  and handle moved defs.
> >> >> >  (process_assignment): Properly guard the unary op case.  Return a
> >> >> >  tri-state indicating that moving the stmt before the call may allow
> >> >> >  to continue.  Pass through to_move.
> >> >> >  (find_tail_calls): Handle moving unrelated defs before
> >> >> >  the call.
> >> >> >
> >> >> >  * c-c++-common/ubsan/pr66313.c: New testcase.
> >> >> >  * gcc.dg/tree-ssa/loop-15.c: Adjust.
> >> >> >
> >> >> > Index: gcc/fold-const.c
> >> >> > ===================================================================
> >> >> > *** gcc/fold-const.c.orig        2015-10-29 12:32:33.302782318 +0100
> >> >> > --- gcc/fold-const.c     2015-10-29 14:08:39.936497739 +0100
> >> >> > *************** fold_plusminus_mult_expr (location_t loc
> >> >> > *** 6916,6925 ****
> >> >> >       }
> >> >> >     same = NULL_TREE;
> >> >> >
> >> >> > !   if (operand_equal_p (arg01, arg11, 0))
> >> >> > !     same = arg01, alt0 = arg00, alt1 = arg10;
> >> >> > !   else if (operand_equal_p (arg00, arg10, 0))
> >> >> >       same = arg00, alt0 = arg01, alt1 = arg11;
> >> >> >     else if (operand_equal_p (arg00, arg11, 0))
> >> >> >       same = arg00, alt0 = arg01, alt1 = arg10;
> >> >> >     else if (operand_equal_p (arg01, arg10, 0))
> >> >> > --- 6916,6926 ----
> >> >> >       }
> >> >> >     same = NULL_TREE;
> >> >> >
> >> >> > !   /* Prefer factoring a common non-constant.  */
> >> >> > !   if (operand_equal_p (arg00, arg10, 0))
> >> >> >       same = arg00, alt0 = arg01, alt1 = arg11;
> >> >> > +   else if (operand_equal_p (arg01, arg11, 0))
> >> >> > +     same = arg01, alt0 = arg00, alt1 = arg10;
> >> >> >     else if (operand_equal_p (arg00, arg11, 0))
> >> >> >       same = arg00, alt0 = arg01, alt1 = arg10;
> >> >> >     else if (operand_equal_p (arg01, arg10, 0))
> >> >> > *************** fold_plusminus_mult_expr (location_t loc
> >> >> > *** 6974,6987 ****
> >> >> >          }
> >> >> >       }
> >> >> >
> >> >> > !   if (same)
> >> >> >       return fold_build2_loc (loc, MULT_EXPR, type,
> >> >> >                          fold_build2_loc (loc, code, type,
> >> >> >                                       fold_convert_loc (loc, type, alt0),
> >> >> >                                       fold_convert_loc (loc, type, alt1)),
> >> >> >                          fold_convert_loc (loc, type, same));
> >> >> >
> >> >> > !   return NULL_TREE;
> >> >> >   }
> >> >> >
> >> >> >   /* Subroutine of native_encode_expr.  Encode the INTEGER_CST
> >> >> > --- 6975,7010 ----
> >> >> >          }
> >> >> >       }
> >> >> >
> >> >> > !   if (!same)
> >> >> > !     return NULL_TREE;
> >> >> > !
> >> >> > !   if (! INTEGRAL_TYPE_P (type)
> >> >> > !       || TYPE_OVERFLOW_WRAPS (type)
> >> >> > !       /* We are neither factoring zero nor minus one.  */
> >> >> > !       || TREE_CODE (same) == INTEGER_CST)
> >> >> >       return fold_build2_loc (loc, MULT_EXPR, type,
> >> >> >                          fold_build2_loc (loc, code, type,
> >> >> >                                       fold_convert_loc (loc, type, alt0),
> >> >> >                                       fold_convert_loc (loc, type, alt1)),
> >> >> >                          fold_convert_loc (loc, type, same));
> >> >> >
> >> >> > !   /* Same may be zero and thus the operation 'code' may overflow.  Likewise
> >> >> > !      same may be minus one and thus the multiplication may overflow.  Perform
> >> >> > !      the operations in an unsigned type.  */
> >> >> > !   tree utype = unsigned_type_for (type);
> >> >> > !   tree tem = fold_build2_loc (loc, code, utype,
> >> >> > !                              fold_convert_loc (loc, utype, alt0),
> >> >> > !                              fold_convert_loc (loc, utype, alt1));
> >> >> > !   /* If the sum evaluated to a constant that is not -INF the multiplication
> >> >> > !      cannot overflow.  */
> >> >> > !   if (TREE_CODE (tem) == INTEGER_CST
> >> >> > !       && ! wi::eq_p (tem, wi::min_value (TYPE_PRECISION (utype), SIGNED)))
> >> >> > !     return fold_build2_loc (loc, MULT_EXPR, type,
> >> >> > !                            fold_convert (type, tem), same);
> >> >> > !
> >> >> > !   return fold_convert_loc (loc, type,
> >> >> > !                           fold_build2_loc (loc, MULT_EXPR, utype, tem,
> >> >> > !                                            fold_convert_loc (loc, utype, same)));
> >> >> >   }
> >> >> >
> >> >> >   /* Subroutine of native_encode_expr.  Encode the INTEGER_CST
> >> >>
> >> >> Sorry if you already know, but this part means that we can no longer
> >> >> vectorise:
> >> >>
> >> >> int
> >> >> f (int *x, int b1, int b2, int b3)
> >> >> {
> >> >>   int foo = 0;
> >> >>   for (int i1 = 0; i1 < b1; ++i1)
> >> >>     for (int i2 = 0; i2 < b2; ++i2)
> >> >>       for (int i3 = 0; i3 < b3; ++i3)
> >> >>         foo += x[i1 * b2 * b3 + i2 * b3 + (i3 - 1)];
> >> >>   return foo;
> >> >> }
> >> >>
> >> >> We now convert all the arithmetic in the [...] to unsigned int
> >> >> and reassociate it so that the "- 1" is applied last.  We then assume
> >> >> that the overflow in this subtraction is well-defined and that the
> >> >> &x[...] might not be linear for the inner loop.
> >> >
> >> > Can you file a PR?
> >>
> >> OK, filed as PR81082.
> >>
> >> >   # i3_44 = PHI <i3_32(6), 0(4)>
> >> >   i3.2_7 = (unsigned int) i3_44;
> >> >   _9 = i3.2_7 + _34;
> >> >   _10 = (int) _9;
> >> >   _11 = (long unsigned int) _10;
> >> >   _12 = _11 * 4;
> >> >   _13 = x_30(D) + _12;
> >> >   _14 = *_13;
> >> > ...
> >> >    i3_32 = i3_44 + 1;
> >> >
> >> > so we have *(x_30(D) + 4 * ((sizetype)(int)((unsigned){ 0, + , 1} + X))
> >> >
> >> > It might mean that we need to avoid this re-association.  Not sure how
> >> > to detect worthwhile vs. not worthwhile cases during folding.  At least
> >> > I see no easy way to recover from it in SCEV analysis unless the
> >> > number of iterations is constrained more than in your example.
> >>
> >> Yeah, in the example that this was reduced from, not reassociating
> >> is far more preferable to changing the types.  But like you say,
> >> I've no idea how we'd know that at this stage.
> >
> > In the past I played with not obfuscating things during FE lowering
> > so early, namely not expose the * 4 but keep the original types of
> > array / pointer indexes.  On the original mem-ref branch I had
> > a IDX_EXPR that allowed to chain those (a widen-multiply-accumulate
> > op).
> >
> > That said, the context where this association is not interesting is
> > address context and the multiplication to the byte offset.
> >
> > Of course in your case there's also b3 factored out which looks
> > like a generally interesting transform (but eventually harmful in address
> > context again).
> >
> > From the FE we get
> >
> > *(x + (sizetype) ((long unsigned int) (((i1 * b2) * b3 + i2 * b3) + (i3 +
> > -1)) * 4))
> >
> > and SCEV uses the fact that the mults/adds have undefined behavior
> > on overflow.  The same missed optimization occurs if you make all those
> > vars unsigned (with or without the folding in question):
> But with unsigned type it's not missed optimization because
> computation could overflow and result in non-scev address.

Correct.

> In this
> case the loop needs to be versioned under overflow/non-overflow to be
> parallelized/vectorized, right?  Or split for cases we can easily
> infer boundary condition of overflow.

Yes.

Richard.

> Thanks,
> bin
> >
> > #define U unsigned
> > int
> > f (int *x, U int b1, U int b2, U int b3)
> > {
> >   int foo = 0;
> >   for (U int i1 = 0; i1 < b1; ++i1)
> >     for (U int i2 = 0; i2 < b2; ++i2)
> >       for (U int i3 = 0; i3 < b3; ++i3)
> >         foo += x[i1 * b2 * b3 + i2 * b3 + (i3 - 1)];
> >   return foo;
> > }
> >
> > It works again for unsigned long as there can be no valid object
> > where the computation could wrap around.
> >
> > Richard.
> >
> > --
> > Richard Biener <rguenther@suse.de>
> > SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)
> 
> 

-- 
Richard Biener <rguenther@suse.de>
SUSE LINUX GmbH, GF: Felix Imendoerffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nuernberg)