In http://gcc.gnu.org/ml/gcc/2010-08/msg00326.html Revital complained about MAX_EXPR no longer being recognized in: int foo (const unsigned char *tmp, int i, int val) { return (unsigned char)(((tmp[i] + val)>0xFF)?0xFF:(((tmp[i] + val)<0)?0:(tmp[i] + val))); } It is still recognized when using: int bar (const unsigned char *tmp, int i, int val) { int x = (((tmp[i] + val)>0xFF)?0xFF:(((tmp[i] + val)<0)?0:(tmp[i] + val))); return (unsigned char)x; } The regression is caused by folding being delayed in the C FE, while before the inner COND_EXPR has been optimized by fold_ternary* into MAX_EXPR, now it isn't immediately, thus convert_to_integer on the narrowing conversion propagates the (unsigned char) narrowing casts down into the operands and when fold_ternary* is actually called, it is too late, as the operands aren't considered equal. Perhaps it would be nice for -O2 to consider for e.g. int a and b that (unsigned char) (a + b) is the same as (unsigned char) a + (unsigned char) b during VN.

On the C FE side, perhaps these optimizations should be moved over from convert_to_integer to e.g. c_fully_fold_internal, where the operand would be folded first and only afterwards the narrowing conversion optimization would be applied. That would solve this testcase too, but wouldn't help the case where the narrowing conversion is done by the user already. int foo (const unsigned char *a, int b, int c) { int x = (unsigned char) (a[b] + c); int y = a[b] + c; int z = (unsigned char) y; return x == z; } is another testcase which isn't currently optimized at the tree level (the RTL level can figure this out though).

Created attachment 21556 [details] minmax.patch I've also noticed that the folder only recognizes the inner MAX_EXPR, but doesn't recognize the outer MIN_EXPR. Apparently (when convert_to_integer doesn't harm it, i.e. the bar testcase instead of foo) phiopt1 later on figures this out. The attached (untested) patch handles that case early, but not sure it is worth it.

Confirmed.

GCC 4.5.2 is being released, adjusting target milestone.

GCC 4.5.3 is being released, adjusting target milestone.

Hmm, we could solve this via forward-propagation in gimple, too. I have a patch which does this, but not sure if it is material for current stage. The missing patterns in forward-propagation are: - X ==/!= X -> true/false, - (type) X ==/!= (type) X -> true/false - X code Y ==/!= X code Z -> Y ==/!= Z (with code +, -, or ^). - (type) X ==/!= Y & CST -> X ==/!= (type-of-X) (Y & CST) if type-of-X has smaller, or equal precision as type and is unsigned, and type-of-x and type are of integral kind, and CST fits into type-of-X. - (type) (X op CST) -> (type) X op CST' with CST' = (type) X; and type has smaller or equal precision to type-of-x and both types are of integral kind. (for op = +,-,&,|,^) - (type) ((type2) X op Y) -> X op (type) Y, if type has smaller or equal precision to type2 and type-of-x is compatible to type, all types are of integral kind, and op is a +,-,&,|,or ^ expression.

Not a stage4 material.

Another testcase (lightly based on PR50182), -O3 -mavx: signed char a[1024], b[1024]; void foo (void) { int i, s, t; for (i = 0; i < 1024; i++) { s = a[i]; t = b[i]; s += t; a[i] = s; } } void bar (void) { int i; for (i = 0; i < 1024; i++) a[i] += b[i]; } shows terrible code for the first loop and nice for the second one, eventhough both can be implemented as the second loop.

Note that once can't use signed type in the narrowing + of course, it needs to be unsigned char addition to avoid overflows.

Another testcase: signed char a[1024], b[1024]; void baz (void) { int i, s, t; for (i = 0; i < 1024; i++) { s = a[i]; t = b[i]; s += t + 0x12345600; a[i] = s; } } The addition of constant we should be able to optimize away completely, but we don't.

Related to PR47477.

GCC 4.8.0 is being released, adjusting target milestone.

GCC 4.8.1 has been released.

GCC 4.8.2 has been released.

GCC 4.8.3 is being released, adjusting target milestone.

GCC 4.8.4 has been released.

The gcc-4_8-branch is being closed, re-targeting regressions to 4.9.3.

GCC 4.9.3 has been released.

When this deficiency is finally resolved, we should be able to remove the code in convert_to_integer dealing with narrowing operands: 738 case PLUS_EXPR: 739 case MINUS_EXPR: 740 case BIT_AND_EXPR: 741 case BIT_IOR_EXPR: 742 case BIT_XOR_EXPR: 743 trunc1: 744 { ... 821 return convert (type, 822 fold_build2 (ex_form, typex, 823 convert (typex, arg0), 824 convert (typex, arg1)));

GCC 4.9 branch is being closed

Couldn't we solve this with a pattern like this in match.pd: (simplify (convert (plus:c @0 (convert @1))) (if (INTEGRAL_TYPE_P (TREE_TYPE (@1)) && INTEGRAL_TYPE_P (TREE_TYPE (@0)) && INTEGRAL_TYPE_P (type) && types_match (@1, type) && (TYPE_PRECISION (TREE_TYPE (@1)) > TYPE_PRECISION (TREE_TYPE (@0)))) (bit_and (plus (convert @0) @1) { wide_int_to_tree (type, wi::mask (TYPE_PRECISION (TREE_TYPE (@0)), false, TYPE_PRECISION (type))); }))) Essentially this widens the arithmetic, then masks off undesirable bits in the result. As written I think this doesn't reduce the number of expressions consistently even though it helps the testcase. The reason it helps the testcase is the (convert @0) in the output pattern turns out to be a common subexpression. If (convert @0) is CSE-able, then this does reduce the number of expressions. Though we probably want to avoid widening beyond a target's wordsize (which I loathe to test for). Thoughts anyone?

I think the pattern to identify is the *tmp + val saturating addition. One of the issues is that we have tmp[i] + val in two forms: _5 = (int)tmp[i] + val; and _iftmp.1_13 = (int)(tmp[i] + (unsigned char)val); where the latter is guarded with an overflow check (_5 <= 255 && _5 >= 0). IMHO the unfortunate thing is that we arrive with this inconsistent narrowing here in the first place before we had a chance to clean things up. Of course as Jakub notices in comment#1 the user could have written this in this way. And _only_ if the no-overflow condition holds are the two expressions the same (contrary to Jakubs example that compares both values as unsigned char). As suggested above we may want to do "saturated OP" pattern matching here and then re-expand in more optimal form (using min/max as suggested or using an IFN like we have for the overflow cases). Another option is to teach VRP to detect the equivalence (but that looks hard and even more ad-hoc). As for using CSE in some way it would need to see the range of _5 and conditionally add (int)(unsigned char)expr-of-_5 simplified and hope things get detected that way... (but first of all it would need a value-range lattice). So the only "reasonable" suggestion is pattern matching the whole thing which fits phiopt. I do not think that having a 'widening' general pattern is the way to go. That has the same issues as the current narrowing one. Moving down the narrowing optimization from the FEs folding code is a step in the right direction anyway (but it would need putting elsewhere to not "regress"). I think that all but maybe a saturating-OP replacement in phiopt is GCC 8 material. Adjusting target milestone to GCC 7 meanwhile.

The model of shortening as much as possible for gimple, then widening to word mode at the gimple/RTL boundary is probably too simplistic. We really need the ability to widen when doing so allows for simplification. That's what the prototype pattern I posted was meant to show. By widening we can expose the common subexpressions, and once the common subexpressions are exposed, min/max can do its job. The problem is we don't have range data, so we have to do masking of the result, nor do we know if the widened operand is a common subexpression or not. Thus profitability is unknown at transformation time. I was about to write off VRP again because it lacks much of the CSE capability we want, but VRP has the information to know that the result is value preserving if done in the wider mode. So at VRP time we could do the widening without needing to mask the result, at which point widening is known to be profitable. As written we have two casts + addition. VRP could turn that into a addition with a single cast (both of which are common subexpressions, but VRP doesn't need to detect that to ensure profitability). DOM/FRE have CSE infrastructure, but the range data presented to them is not path sensitive and thus useless in this case. I'm hesitant to try to do all this in phi-opt -- phi-opt would have to look through the casts, prove equivalences, etc. ick.

(In reply to Jeffrey A. Law from comment #23) > The model of shortening as much as possible for gimple, then widening to > word mode at the gimple/RTL boundary is probably too simplistic. We really > need the ability to widen when doing so allows for simplification. But the key point here is "when doing so allows for simplification" ... > That's what the prototype pattern I posted was meant to show. By widening > we can expose the common subexpressions, and once the common subexpressions > are exposed, min/max can do its job. The problem is we don't have range > data, so we have to do masking of the result, nor do we know if the widened > operand is a common subexpression or not. Thus profitability is unknown at > transformation time. > > I was about to write off VRP again because it lacks much of the CSE > capability we want, but VRP has the information to know that the result is > value preserving if done in the wider mode. So at VRP time we could do the > widening without needing to mask the result, at which point widening is > known to be profitable. As written we have two casts + addition. VRP could > turn that into a addition with a single cast (both of which are common > subexpressions, but VRP doesn't need to detect that to ensure profitability). > > DOM/FRE have CSE infrastructure, but the range data presented to them is not > path sensitive and thus useless in this case. Note that mid-term I'd like our value-numbering infrastructure (VRP is also a kind of value-numbering) to improve to a point where we simply do VRP alongside CSE, constant, copy and known-zero/one-bits propagation. You "simply" track multiple lattices (configurable) from a common iteration (configurable as either non-iterating, pessimistically handling backedges, or iterating, optimistically handling them). > I'm hesitant to try to do all this in phi-opt -- phi-opt would have to look > through the casts, prove equivalences, etc. ick. Yeah, ICK. But still this case is very much about pattern matching (and getting min/max in the end). It _is_ a saturating operation which might be common enough to pattern-match (heh, some targets even have instruction support for them and the middle-end has saturating types!) So short-term pattern matching in phiopt is the way to go (I'd consider that even as a regression fix).

"When doing so allows for simplification" is actually a pretty low bar here. We just have to prove the widening cast is a common subexpression. Once we do that, we know widening is a win. And that's relatively easy to discover. You go back to the SSA_NAME_DEF_STMT of the object you want to widen, then walk over its immediate uses to see if one is a proper widening op that dominates the expression we're trying to simplify. That's it, nothing else needed to prove profitability. We only get ~20 hits in a bootstrap, but I didn't expect this idiom to show up all that much. -- Trying to do all the pattern matching in phi-opt and ultimately generate the min/max is going to be *hideous* because we're trying to do too many things at once. We have components of CSE, VRP and DCE that we'd have to bring into phi-opt, which seems wrong to me. Contrast that to simplifying the IL like my match.pd pattern does. We make a simple, profitable, change to the IL. That one profitable change in the IL has ripple effects that allow subsequent passes to do their jobs with the final result that the minmax detection code is presented with exactly the IL it knows how to transform. -- Alternately (and I haven't prototyped this) we could try again to look at this as a redundancy elimination problem. Given X = A + B in DOM, where A comes from a narrowed operand (A'). Lookup a widened B in the expression table (B'). If found, then lookup A' + B'. If found (lhs), then replace X = A + B with X = (typeof X) ((lhs) & mask). That's essentially doing the same thing as the prototype match.pd pattern. Except that we know the operand widening as well as the widened arithmetic are both CSEs.

On Fri, 17 Feb 2017, law at redhat dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=45397 > > --- Comment #25 from Jeffrey A. Law <law at redhat dot com> --- > "When doing so allows for simplification" is actually a pretty low bar here. > We just have to prove the widening cast is a common subexpression. Once we do > that, we know widening is a win. And that's relatively easy to discover. You > go back to the SSA_NAME_DEF_STMT of the object you want to widen, then walk > over its immediate uses to see if one is a proper widening op that dominates > the expression we're trying to simplify. > > That's it, nothing else needed to prove profitability. We only get ~20 hits in > a bootstrap, but I didn't expect this idiom to show up all that much. I find this kind of "technique" of finding CSE opportunities quite ugly and I'm not sure where you'd actually implement this... > > -- > > Trying to do all the pattern matching in phi-opt and ultimately generate the > min/max is going to be *hideous* because we're trying to do too many things at > once. We have components of CSE, VRP and DCE that we'd have to bring into > phi-opt, which seems wrong to me. Well, we are pattern matching the overflow builtins as well. I think it would be quite natural to pattern match saturating operations with the premise to generate better code for them (either by generic expansion to min/max or by utilizing CPU instructions -- I think SSE has saturating ops for example). You don't really need CSE, VRP and DCE, you "simply" pattern match the comparisons. You'd then replace the PHI node by the saturated op and let DCE do the rest for you. > Contrast that to simplifying the IL like my match.pd pattern does. We make a > simple, profitable, change to the IL. That one profitable change in the IL has > ripple effects that allow subsequent passes to do their jobs with the final > result that the minmax detection code is presented with exactly the IL it knows > how to transform. > > -- > > Alternately (and I haven't prototyped this) we could try again to look at this > as a redundancy elimination problem. > > Given X = A + B in DOM, where A comes from a narrowed operand (A'). Lookup a > widened B in the expression table (B'). If found, then lookup A' + B'. If > found (lhs), then replace X = A + B with X = (typeof X) ((lhs) & mask). > > That's essentially doing the same thing as the prototype match.pd pattern. > Except that we know the operand widening as well as the widened arithmetic are > both CSEs. CSE already does similar stuff for loads, transparently adding VIEW_CONVERT_EXPRs. I've added generic helpers for this (vn_nary_build_or_lookup). The question is if it's worth it. You'd basically add to the list of special patterns in visit_nary_op where we already handle ops in different sign (you'd add "op in different width" and possibly add two stmts instead of one - the conversion and the masking). Oh, happens to be a patch in my local tree only (no longer remember what was the reason to invent it but surely the question is how many of these special-cases are worth the extra lookups) Index: gcc/tree-ssa-sccvn.c =================================================================== --- gcc/tree-ssa-sccvn.c (revision 245417) +++ gcc/tree-ssa-sccvn.c (working copy) @@ -3453,17 +3493,78 @@ visit_copy (tree lhs, tree rhs) static bool visit_nary_op (tree lhs, gimple *stmt) { - bool changed = false; tree result = vn_nary_op_lookup_stmt (stmt, NULL); - if (result) - changed = set_ssa_val_to (lhs, result); - else - { - changed = set_ssa_val_to (lhs, lhs); - vn_nary_op_insert_stmt (stmt, lhs); + return set_ssa_val_to (lhs, result); + else if (INTEGRAL_TYPE_P (TREE_TYPE (lhs)) + && (TYPE_PRECISION (TREE_TYPE (lhs)) + == GET_MODE_PRECISION (TYPE_MODE (TREE_TYPE (lhs)))) + && is_gimple_assign (stmt)) + { + /* For operations on integers that have the same result independent + of the signedness of their operands try to lookup a result with + opposite sign. */ + enum tree_code code = gimple_assign_rhs_code (stmt); + switch (code) + { + case NEGATE_EXPR: + { + tree type = TREE_TYPE (lhs); + type = signed_or_unsigned_type_for (! TYPE_UNSIGNED (type), type); + tree ops[3]; + ops[0] = gimple_assign_rhs1 (stmt); + ops[0] = vn_nary_op_lookup_pieces (1, NOP_EXPR, type, ops, NULL); + if (ops[0]) + { + ops[0] = vn_nary_op_lookup_pieces (1, code, type, + ops, NULL); + if (ops[0]) + { + result = vn_nary_build_or_lookup (NOP_EXPR, + TREE_TYPE (lhs), ops); + if (result) + return set_ssa_val_to (lhs, result); + } + } + break; + } + case PLUS_EXPR: + case MINUS_EXPR: + case MULT_EXPR: + case LSHIFT_EXPR: + { + tree type = TREE_TYPE (lhs); + type = signed_or_unsigned_type_for (! TYPE_UNSIGNED (type), type); + tree ops[3]; + ops[0] = gimple_assign_rhs1 (stmt); + ops[0] = vn_nary_op_lookup_pieces (1, NOP_EXPR, type, ops, NULL); + if (ops[0]) + { + ops[1] = gimple_assign_rhs2 (stmt); + if (code != LSHIFT_EXPR) + ops[1] = vn_nary_op_lookup_pieces (1, NOP_EXPR, type, + &ops[1], NULL); + if (ops[1]) + { + ops[0] = vn_nary_op_lookup_pieces (2, code, type, + ops, NULL); + if (ops[0]) + { + result = vn_nary_build_or_lookup (NOP_EXPR, + TREE_TYPE (lhs), ops); + if (result) + return set_ssa_val_to (lhs, result); + } + } + } + break; + } + default:; + } } + bool changed = set_ssa_val_to (lhs, lhs); + vn_nary_op_insert_stmt (stmt, lhs); return changed; }

patch misses some ops = {} and ops[1] = NULL_TREE; to avoid ICEing.

WRT c#26. Yes, I would agree that finding CSE's that way is rather gross, but significantly less so than what will be required to address this problem in phi-opt. Pattern matching this is going to be significantly more complex than the ADD_OVERFLOW/SUB_OVERFLOW. I've looked at that code via pr79095 and catching these saturating idioms is significantly more complex. I prototyped the idea of having DOM do extra lookups for a widened version of X OP Y. It's a bit ugly relative to the match.pd approach, but possibly manageable. However, doing that in DOM exposes some lameness we'd have to address in VRP. Prior to DOM2 we have: ; basic block 4, loop depth 0, count 0, freq 4665, maybe hot ;; prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED) ;; pred: 3 [67.6%] (TRUE_VALUE,EXECUTABLE) _6 = (unsigned char) val_12(D); _7 = _3 + _6; iftmp.1_13 = (int) _7; We transform that into: ;; basic block 4, loop depth 0, count 0, freq 4665, maybe hot ;; prev block 3, next block 5, flags: (NEW, REACHABLE, VISITED) ;; pred: 3 [67.6%] (TRUE_VALUE,EXECUTABLE) iftmp.1_13 = _15 & 255 That's value preserving and clearly an improvement. Unfortunately we have to wait until vrp2 to discover that the masking is redundant and simplify the statement into: iftmp.1_13 = _15 The problem is we do not propagate that copy into the PHI node at BB4's successor. By the time we do finally propagate away the copy, there aren't any additional phi-opt passes to turn things into a MIN/MAX. THe lack of copy propagation when VRP simplifies a statement like that is due to using op_with_constant_singleton_value_range as the callback for substitute_and_fold. op_with_constant_singleton_value_range only returns exactly what you would think -- constant singleton ranges. Thus we don't discover or exploit copy propagation opportunities created by VRP's statement simplification. Enhancing the callback to return an SSA_NAME for cases were VRP simplifies arithmetic/logicals to copies allows the propagation step to propagate the copy into the PHI node in BB4's successor. That in turn allows phi-opt to do its job and by the .optimized dump we have: ;; basic block 2, loop depth 0, count 0, freq 10000, maybe hot ;; prev block 0, next block 1, flags: (NEW, REACHABLE, VISITED) ;; pred: ENTRY [100.0%] (FALLTHRU,EXECUTABLE) _1 = (sizetype) i_9(D); _2 = tmp_10(D) + _1; _3 = *_2; _4 = (int) _3; _5 = _4 + val_12(D); _16 = MAX_EXPR <_5, 0>; iftmp.0_7 = MIN_EXPR <_16, 255>; return iftmp.0_7; ;; succ: EXIT [100.0%] Which is what we want at this stage. Transforming something like that into saturating arithmetic for processors which support such insns is much easier (but IMHO out of the scope of this BZ). Anyway, I'm offline the next few days and largely booked on non-technical stuff much of March. I don't know if I'll be able to push this further over the next few weeks or not.

Created attachment 40822 [details] patch doing this in SCCVN This instead of in DOM implements it in SCCVN which makes the saturated ops recognized in phiopt1. The other testcases in this bug involve other peculiarities like having & 255 instead of a conversion to match or having A + B + CST. This shows that this kind of matching is really a hack... (or well, it show for another time that handling CSE and larger expressions that are associatable is difficult).

Dropping the regression marker. THe other cases are requests for additional optimizations, not regressions.

Author: rguenth Date: Mon Feb 27 08:51:28 2017 New Revision: 245752 URL: https://gcc.gnu.org/viewcvs?rev=245752&root=gcc&view=rev Log: 2017-02-27 Richard Biener <rguenther@suse.de> PR tree-optimization/45397 * tree-ssa-pre.c (eliminate_insert): Handle BIT_AND_EXPR. * tree-ssa-sccvn.c (valueized_wider_op): New helper. (visit_nary_op): Add pattern matching for CSEing sign-changed or truncated operations with wider ones. * gcc.dg/tree-ssa/pr45397.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr45397.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-pre.c trunk/gcc/tree-ssa-sccvn.c

Regression fixed on trunk.

GCC 7.1 has been released.

GCC 7.3 is being released, adjusting target milestone.

GCC 6 branch is closed, thus closing.