User account creation filtered due to spam.
Since ca. 20140818 (r214099), gcc.dg/graphite/vect-pr43423.c on SPARC. I'm attaching the tree dump. Rainer
Created attachment 33425 [details] vect dump
Also appears on arm and aarch64
Also x86_64. Should be at least investigated.
7: note: === vect_analyze_data_refs === Creating dr for a[_56] analyze_innermost: failed: evolution of offset is not affine. base_address: offset from base address: constant offset from base address: step: aligned to: base_object: a Access function 0: (int) {(unsigned int) graphite_IV.5_50, +, 1}_3; the graphive IVs are signed long <bb 8>: _46 = (signed long) mid_6(D); ... <bb 9>: graphite_IV.5_50 = MAX_EXPR <_46, 0>; _51 = (signed long) n_5(D); _52 = _51 + -1; <bb 10>: # graphite_IV.5_53 = PHI <graphite_IV.5_50(9), graphite_IV.5_54(11)> _56 = (int) graphite_IV.5_53; _55 = a[_56]; _57 = c[_56]; _58 = _55 + _57; _64 = (int) graphite_IV.5_53; a[_64] = _58; graphite_IV.5_54 = graphite_IV.5_53 + 1; if (graphite_IV.5_53 < _52) goto <bb 11>; else goto <bb 13>; <bb 11>: goto <bb 10>; (instantiate_scev (instantiate_below = 9) (evolution_loop = 3) (chrec = (ssizetype) (int) {(unsigned int) graphite_IV.5_50, +, 1}_3) (res = (ssizetype) (int) {(unsigned int) graphite_IV.5_50, +, 1}_3)) so it appears to SCEV that the evolution may wrap. With GCC 4.9 we choose signed int as GRAPHITE IV. Looks like the IV type is statically determined by graphite_expression_type_precision as /* We always try to use signed 128 bit types, but fall back to smaller types in case a platform does not provide types of these sizes. In the future we should use isl to derive the optimal type for each subexpression. */ static int max_mode_int_precision = GET_MODE_PRECISION (mode_for_size (MAX_FIXED_MODE_SIZE, MODE_INT, 0)); static int graphite_expression_type_precision = 128 <= max_mode_int_precision ? 128 : max_mode_int_precision; Not sure how this doesn't end up with __int128_t on x86_64, but ... It's obviously a bad choice to use __int128_t (or even signed long) everywhere. Let's XFAIL this testcase if nobody fixes the underlying issue. What does it take to "use ISL to derive the optimal type for each subexpression"? Is that even possible?
Any progress on this? Or are we just going to xfail the testcase?
I think we have to XFAIL the testcase - the graphite folks seem to be gone again.
Graphite generates MAX/MIN expressions. I've modified Graphite to use the original types of "n" and "mid" in MIN and MAX, and to not generate the casts of "n" and "mid" to a longer signed INT before MIN/MAX, and the vectorization succeeded. It seems that it is not a Graphite problem but a scalar evolution one. Scalar evolution is not able to handle MIN/MAX expressions in the presence of casts. Beside vectorization also further unrolling is prevented.
(In reply to Mircea Namolaru from comment #7) > Graphite generates MAX/MIN expressions. > > I've modified Graphite to use the original types of "n" and "mid" in MIN and > MAX, and to not generate the casts of "n" and "mid" to a longer signed INT > before MIN/MAX, and the vectorization succeeded. > > It seems that it is not a Graphite problem but a scalar evolution one. > Scalar evolution is not able to handle MIN/MAX expressions in the presence > of casts. Beside vectorization also further unrolling is prevented. Can you share a patch with that modification? I'd like to look at the differences it makes with respect to SCEV / niter analysis. Note that neither SCEV nor niter analysis handle MIN/MAX_EXPRs explicitely. It might be that <bb 2>: if (n_5(D) > 0) goto <bb 4>; ... <bb 4>: _28 = (signed long) n_5(D); _29 = (signed long) mid_6(D); _30 = MIN_EXPR <_28, _29>; _31 = _30 > 0; if (_31 != 0) can be simplified to mid_6(D) > 0 by expansion/folding in some way though if there are no casts in the way. Not sure. I suppose ISL doesn't get to know that n > 0 if the loop enters (and doesn't exploit that knowledge)?
Note that as far as vectorization is concerned the issue is that the IV used to perform the memory accesses is not affine: <bb 6>: # graphite_IV.4_34 = PHI <0(5), graphite_IV.4_35(7)> _37 = (int) graphite_IV.4_34; _36 = a[_37]; _38 = b[_37]; _39 = _36 + _38; _45 = (int) graphite_IV.4_34; a[_45] = _39; graphite_IV.4_35 = graphite_IV.4_34 + 1; if (graphite_IV.4_34 < _33) goto <bb 7>; else goto <bb 8>; <bb 7>: goto <bb 6>; the IV _37 is (int) { 0, +, 1 } with graphite_IV being signed long. Thus _37 may wrap if graphite_IV becomes larger than INT_MAX. I don't see how this is related to the MIN/MAX_EXPRs. Thus it would be nice if ISL could compute bounds for the IVs (and if we can input bounds on the original IVs, of course). VRP may be able to see that _33 is computed as if (n_5(D) > 0) goto <bb 4>; ... <bb 4>: _28 = (signed long) n_5(D); _29 = (signed long) mid_6(D); _30 = MIN_EXPR <_28, _29>; _31 = _30 > 0; if (_31 != 0) goto <bb 5>; ... <bb 5>: _32 = MIN_EXPR <_28, _29>; _33 = _32 + -1; and thus _33 is [0, INT_MAX - 1] but it isn't able to do that (and VRP runs too late anyway). VRP misses that for _30 > 0 it can infer that mid_6 > 0, but that's probably a too difficult thing to asses with it working as SSA propagator and ASSERT_EXPR insertion happening before actual propagation (it needs to know n_5 > 0 to be able to do that insertion). A DOM style propagation would have less issues with deriving proper value-ranges here.
On my Intel x86-64 platform changed in graphite-isl-ast-to-gimple.c: - static int graphite_expression_type_precision = 128 <= max_mode_int_precision ? - 128 : max_mode_int_precision; + static int graphite_expression_type_precision = 32; The computation are done on INT, and you get this code for basic block 4 (and vectorization performed): _28 = MIN_EXPR <n_5(D), mid_6(D)>; _29 = _28 > 0; if (_29 != 0) goto <bb 5>; else goto <bb 8>; In my opinion the casts introduced cause problems to scalar evolution (and you are right, MIN/MAX are not the problem). I will look into two directions, and choose the quickest one fixing the regression. 1) not to generate casts in Graphite, if correctness is not affected (as in this case). But determining when the use of a longer size signed type is required is not so simple. 2) modify handling of casts in scalar evolution. But I am not familiar with this code. ----- Original Message ----- > From: "rguenth at gcc dot gnu.org" <gcc-bugzilla@gcc.gnu.org> > To: "mircea namolaru" <mircea.namolaru@inria.fr> > Sent: Wednesday, February 18, 2015 12:22:55 PM > Subject: [Bug tree-optimization/62630] [5 regression] gcc.dg/graphite/vect-pr43423.c FAILs > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62630 > > --- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> --- > (In reply to Mircea Namolaru from comment #7) > > Graphite generates MAX/MIN expressions. > > > > I've modified Graphite to use the original types of "n" and "mid" in MIN > > and > > MAX, and to not generate the casts of "n" and "mid" to a longer signed INT > > before MIN/MAX, and the vectorization succeeded. > > > > It seems that it is not a Graphite problem but a scalar evolution one. > > Scalar evolution is not able to handle MIN/MAX expressions in the presence > > of casts. Beside vectorization also further unrolling is prevented. > > Can you share a patch with that modification? I'd like to look at the > differences it makes with respect to SCEV / niter analysis. Note that > neither SCEV nor niter analysis handle MIN/MAX_EXPRs explicitely. It might > be that > > <bb 2>: > if (n_5(D) > 0) > goto <bb 4>; > ... > <bb 4>: > _28 = (signed long) n_5(D); > _29 = (signed long) mid_6(D); > _30 = MIN_EXPR <_28, _29>; > _31 = _30 > 0; > if (_31 != 0) > > can be simplified to mid_6(D) > 0 by expansion/folding in some way though > if there are no casts in the way. Not sure. I suppose ISL doesn't get to > know that n > 0 if the loop enters (and doesn't exploit that knowledge)? > > -- > You are receiving this mail because: > You are on the CC list for the bug. >
Btw, graphite generates expressions like MIN_EXPR <(long)n, (long)mid> < 0 that fold does not simplify. Adding a match.pd pattern to shorten min/max expressions we end up with <bb 4>: _28 = MIN_EXPR <n_5(D), mid_6(D)>; _29 = _28 > 0; if (_29 != 0) goto <bb 5>; else goto <bb 8>; <bb 5>: _30 = MIN_EXPR <n_5(D), mid_6(D)>; _31 = (signed long) _30; _32 = _31 + -1; but still niter analysis isn't able to constrain the max iterations to sth that SCEV could later use to prove that (int) graphite_IV doesn't wrap. Analyzing # of iterations of loop 2 exit condition [0, + , 1](no_overflow) < (signed long) _30 + -1 bounds on difference of bases: -9223372036854775808 ... 9223372036854775806 result: zero if _30 <= 0 # of iterations (unsigned long) ((signed long) _30 + -1), bounded by 9223372036854775806 /space/rguenther/src/svn/trunk/gcc/testsuite/gcc.dg/graphite/vect-pr43423.c:12:17: note: Symbolic number of iterations is (unsigned long) MAX_EXPR <_30, 1> so niter analysis can be improved here as _30 is of type int and even if _30 were INT_MIN the upper bound would be 2147483647. We can improve that by stripping sign-/zero-extensions from vars in bound_difference. With those two improvements we can vectorize the first loop.
Created attachment 34801 [details] use VRP to interpret an expr and compute its range The attached patch is a prototype that tries to replace niter analysis bound_difference value-range estimation code by VRP. I've chosen to replace the case with non-equal variable part only (which is what needs improvement for this testcase), but in theory most of the code should be replaced. The bound-using-guard code needs refinement (the niter parts go to greater lengths, expanding simple operations and whatnot - in theory we can go up SSA use->def chains in the recursion as well, it just has a cost). Note that without the MIN/MAX shortening this patch doesn't help as niter analysis doesn't present VRP with expanded enough expressions and of course all the IL has no value-ranges associated as graphite re-wrote it completely.
Ah, expand_simple_operations will not expand the MIN/MAX_EXPRs. If we change that the patch makes data-ref analysis fail differently (we correctly can then compute bounds for the loops!), as we still may wrap with (int) {(unsigned int) graphite_IV.5_50, +, 1}_3; (non-zero initial value) even with an niter bound of 2147483646. That said, the loop conditions and guards are set up in a funny enough way to confuse GCC ... Index: gcc/tree-ssa-loop-niter.c =================================================================== --- gcc/tree-ssa-loop-niter.c (revision 220784) +++ gcc/tree-ssa-loop-niter.c (working copy) @@ -1674,6 +1687,12 @@ expand_simple_operations (tree expr, tre ee = expand_simple_operations (e, stop); return fold_build2 (code, TREE_TYPE (expr), ee, e1); + case MIN_EXPR: + case MAX_EXPR: + e1 = expand_simple_operations (gimple_assign_rhs1 (stmt)); + ee = expand_simple_operations (gimple_assign_rhs2 (stmt)); + return fold_build2 (code, TREE_TYPE (expr), e1, ee); + default: return expr; } (not a good idea I think) The match.pd pattern: Index: gcc/match.pd =================================================================== --- gcc/match.pd (revision 220784) +++ gcc/match.pd (working copy) @@ -567,6 +567,18 @@ (define_operator_list inverted_tcc_compa && operand_equal_p (@1, TYPE_MAX_VALUE (type), OEP_ONLY_CONST)) @1)) +/* We can remove extensions of min/max operands and shorten the operation. */ +(for minmax (min max) + (simplify + (minmax (convert @0) (convert? @1)) + (if (INTEGRAL_TYPE_P (type) + && TYPE_PRECISION (TREE_TYPE (@0)) < TYPE_PRECISION (type) + && (((GENERIC && TYPE_MAIN_VARIANT (TREE_TYPE (@0)) == TYPE_MAIN_VARIANT (TREE_TYPE (@1))) + || (GIMPLE && types_compatible_p (TREE_TYPE (@0), TREE_TYPE (@1)))) + || (TREE_CODE (@1) == INTEGER_CST + && int_fits_type_p (@1, TREE_TYPE (@0)))) + && TYPE_PRECISION (TREE_TYPE (@0)) == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (@0)))) + (convert (minmax @0 (convert @1)))))) should probably be restricted to apply when feeding another conversion or a comparison (all shorten_* operations are probably profitable when feeding comparisons as well).
It seems to me that scalar evolution succeeds to determine the number of iterations for the case of signed longs. Looking in vectorization dump, first a symbolic expression for the number of iterations of a loop is found, and then vect_analyze_refs is entered. The problem is that the code expect an offset of a load to be an induction variable, but in our case an offset is only a cast of an induction variable, like below: _56 = (intD.6) graphite_IV.5_53; _55 = aD.1830[_56]; The offset is found not to be an affine expression, and vectorization don't succeed. But as the offset is a cast of an induction variable, it has the same behaviour as an induction variable even if formally is not one. It seems to me that somehow extending the code to support casts of induction variables will solve our this problem.
(In reply to Mircea Namolaru from comment #14) > It seems to me that scalar evolution succeeds to determine > the number of iterations for the case of signed longs. Looking > in vectorization dump, first a symbolic expression for the number of > iterations of a loop is found, and then vect_analyze_refs is entered. > The problem is that the code expect an offset of a load to be an induction > variable, > but in our case an offset is only a cast of an induction variable, like > below: > > _56 = (intD.6) graphite_IV.5_53; > _55 = aD.1830[_56]; > > The offset is found not to be an affine expression, and vectorization don't > succeed. But as the offset is a cast of an induction variable, it has the > same > behaviour as an induction variable even if formally is not one. This is not true - the cast of the induction variable is not an affine IV as the cast introduces wrapping if the IV exceeds the range of the casted-to type. number-of-iteration analysis can come to the rescue here but it has a very hard time, especially on the 2nd loop. It would be good to improve it anyway but we can hardly blame it solely for the problems ;) > It seems to > me > that somehow extending the code to support casts of induction variables > will solve our this problem.
Yes, but it seems to me that the cast (not in the original code) should not be generated at all if it could not be guaranteed that the casted-to type is larger enough to accommodate it. Otherwise you introduce a cast from a longer signed type to a shorter signed one whose behaviour is undefined by the C standard and was not in the original code. So the cast in the following code is problematic (when graphite_IV, a signed long is not in the range of a signed int). _56 = (intD.6) graphite_IV.5_53; _55 = aD.1830[_56]; The solution to fix this is to made Graphite not to generate casts like this. An alternative is to infer the range of graphite_IV like you do and remove the cast (but this seems more complicated and risky as the analysis may not succeed and the problematic cast is not removed).
On Fri, 20 Feb 2015, mircea.namolaru at inria dot fr wrote: > Yes, but it seems to me that the cast (not in the original code) should > not be generated at all if it could not be guaranteed that the casted-to > type is larger enough to accommodate it. Otherwise you introduce a cast > from a longer signed type to a shorter signed one whose behaviour is > undefined by the C standard and was not in the original code. Such casts have defined semantics in GNU C (and presumably in GIMPLE). (In ISO C, "either the result is implementation-defined or an implementation-defined signal is raised" - not undefined behavior.)
I've succeeded to explain why these casts are generated, and they seem correct. Graphite introduces new induction variables with a larger size type (then the type of original induction variable), to make sure that accommodate the iteration count. Otherwise an overflow not in the original program occurs. On the other side, to not hide overflows for computations using the original induction variables, you still need to use them. This explain the casts to variables with smaller types. These cast will cause an overflow (i.e. a value not in the range of the smaller type), only when an overflow in the original problem occurs. There were introduced to mimics the behaviour of the original program in case of overflows. So, my understanding is that there is nothing wrong with Graphite or scalar evolution. The vectorization succeeded because no larger size type was used, but this was unsafe. To make it work again, you need some supplementary analysis to determine that casts are redundant. This is not a simple problem, but with Richard patch the vectorization works. Unfortunately, don't see other simpler solutions. It would be possible maybe to catch in Graphite that the transformation is graphite identity, and the original lower bounds of loops are zero. This will ensure that a larger size type is not needed for induction variable. But seems like a modification intended to make this test works, and nothing more. Btw, the only potential problem found may be with the code for gather in vectorization. After the attempt to use the vector load fails, the vectorizer detects an opportunity for a gather instruction, but as don't find a suitable one (this depends on architecture) vectorization fails. It seems to me that the analysis for gather don't take into account the possibility of overflows. For this test, I could modify the code and use as gather instruction a load vector (even this was found not to be safe). The vectorization would succeed. But not entirely sure about this ...
I think in any case a graphite issue is not severe enough to justify P1.
XFAILed.
Author: rguenth Date: Wed Mar 25 12:54:12 2015 New Revision: 221662 URL: https://gcc.gnu.org/viewcvs?rev=221662&root=gcc&view=rev Log: 2015-03-25 Richard Biener <rguenther@suse.de> PR tree-optimization/62630 * gcc.dg/graphite/vect-pr43423.c: XFAIL. Modified: trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/graphite/vect-pr43423.c
This is target independent.
GCC 6.1 has been released.
GCC 7.1 has been released.