Based on a discussion on stackoverflow: https://stackoverflow.com/questions/57673825/how-to-force-gcc-to-assume-that-a-floating-point-expression-is-non-negative. With gcc-trunk and '-std=c++17 -O3', the function float test (float x) { return std::sqrt(x*x); } produces the following assembly: test(float): mulss xmm0, xmm0 pxor xmm2, xmm2 ucomiss xmm2, xmm0 movaps xmm1, xmm0 sqrtss xmm1, xmm1 ja .L8 movaps xmm0, xmm1 ret .L8: sub rsp, 24 movss DWORD PTR [rsp+12], xmm1 call sqrtf movss xmm1, DWORD PTR [rsp+12] add rsp, 24 movaps xmm0, xmm1 ret As far as I can tell, it calls sqrtf, unless the argument to sqrt is >= 0, to check for negatives/NaN's and set the appropriate errno. The behavior is reasonable, as expected. Adding '-fno-math-errno', '-ffast-math', or '-ffinite-math-only' removes all the clutter and compiles the same code into the neat test(float): mulss xmm0, xmm0 sqrtss xmm0, xmm0 ret Now, the problem is that GCC doesn't seem to optimize away the call to sqrtf based on some surrounding code. As an example, it would be neat to have this (or something similar) to get compiled into the same mulss-sqrtss-ret: float test (float x) { float y = x*x; if (y >= 0.f) return std::sqrt(y); __builtin_unreachable(); } If I understand it correctly, the 'y >= 0.f' excludes 'y' being NaN and 'y' being negative (though this is excluded by 'y = x*x'), so there is no need to check if the argument to `std::sqrt` is any bad, enabling to just do 'sqrtss' and return. Furthemore, adding e.g. '#pragma GCC optimize ("no-math-errno")' before the 'test' function doesn't lead to optimizing it either, though I'm not sure whether this is expected to work and/or requires a separate bugtracker issue.
Confirmed. We have no value-range analysis for floating-point, plus call-DCE emits an unordered >= while your test is using an ordered compare. <bb 2> [local count: 1073741824]: y_4 = x_3(D) * x_3(D); if (y_4 >= 0.0) goto <bb 3>; [26.75%] else goto <bb 6>; [73.25%] <bb 3> [local count: 287225938]: if (y_4 u>= 0.0) goto <bb 4>; [99.95%] else goto <bb 5>; [0.05%] <bb 4> [local count: 287082327]: _7 = .SQRT (y_4); goto <bb 6>; [100.00%] <bb 5> [local count: 143611]: _10 = __builtin_sqrtf (y_4); and jump threading doesn't know that if y >= 0.0 is true then isgreaterequal (y, 0.0) is as well. OTOH I'm not sure that's actually true since IIRC isgreaterequal (y, 0.0) will return true if isunordered (y, 0.0) while for y >= 0.0 the result is unspecified? So maybe you simply have to adjust your precondition appropriately like y = x * x; if (isless (y, 0.0)) __builtin_unreachable (); return std::sqrt (y); that way it works for me?
If by 'isless(y, 0.0)' you mean 'y < 0.f', then no, it doesn't change anything, it produces the same 'ucomiss ... call sqrtf' boilerplate. May I have misunderstood you? By the way, what about '#pragma GCC optimize ("no-math-errno")'? Is it supposed to work? Should I issue another bug on that matter?
(In reply to Richard Biener from comment #1) > for y >= 0.0 the result is unspecified? NaN >= 0.0 is false, but even if it was unspecified, the implication would still be true. (In reply to Nikita Lisitsa from comment #2) > If by 'isless(y, 0.0)' you mean 'y < 0.f', He is talking about the standard macro isless, defined in math.h.
On Wed, 4 Sep 2019, lisyarus at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91645 > > --- Comment #2 from Nikita Lisitsa <lisyarus at gmail dot com> --- > If by 'isless(y, 0.0)' you mean 'y < 0.f', then no, it doesn't change anything, > it produces the same 'ucomiss ... call sqrtf' boilerplate. May I have > misunderstood you? I meant isless literally, it's from math.h > By the way, what about '#pragma GCC optimize ("no-math-errno")'? Is it supposed > to work? Should I issue another bug on that matter? It should work if the #pragma is before the function start (those pragmas only work on function granularity)
Oh, thank you a lot! Indeed, this version compiles to just mulss & sqrtss float test (float x) { float y = x*x; if (std::isless(y, 0.f)) __builtin_unreachable(); return std::sqrt(y); } Yet, I still don't quite understand what is happening here. Is it because the standard '<' operator is still subject to FE_* ? Concerning pragmas, the code #pragma GCC optimize ("no-math-errno") float test (float x) { return std::sqrt(x*x); } produces the following assembly std::sqrt(float): pxor xmm2, xmm2 movaps xmm1, xmm0 ucomiss xmm2, xmm0 sqrtss xmm1, xmm1 ja .L8 movaps xmm0, xmm1 ret .L8: sub rsp, 24 movss DWORD PTR [rsp+12], xmm1 call sqrtf movss xmm1, DWORD PTR [rsp+12] add rsp, 24 movaps xmm0, xmm1 ret test(float): mulss xmm0, xmm0 jmp std::sqrt(float) So, the only notable difference is that now 'std::sqrt(float)' is not inlined, but is tail-called instead. Thus, the pragma seems not to work?
On Thu, 5 Sep 2019, lisyarus at gmail dot com wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91645 > > --- Comment #5 from Nikita Lisitsa <lisyarus at gmail dot com> --- > Oh, thank you a lot! Indeed, this version compiles to just mulss & sqrtss > > float test (float x) > { > float y = x*x; > if (std::isless(y, 0.f)) > __builtin_unreachable(); > return std::sqrt(y); > } > > Yet, I still don't quite understand what is happening here. Is it because the > standard '<' operator is still subject to FE_* ? It's probably because our jump-threading is imperfect if I interpret Marcs comment correctly. When seeing LE_EXPR we only consider if (FLOAT_TYPE_P (TREE_TYPE (op0))) { build_and_record_new_cond (ORDERED_EXPR, op0, op1, p); } so we know it's ORDERED but we don't register it's known to be UNLE as well. > Concerning pragmas, the code > > #pragma GCC optimize ("no-math-errno") > float test (float x) > { > return std::sqrt(x*x); > } > > produces the following assembly > > std::sqrt(float): > pxor xmm2, xmm2 > movaps xmm1, xmm0 > ucomiss xmm2, xmm0 > sqrtss xmm1, xmm1 > ja .L8 > movaps xmm0, xmm1 > ret > .L8: > sub rsp, 24 > movss DWORD PTR [rsp+12], xmm1 > call sqrtf > movss xmm1, DWORD PTR [rsp+12] > add rsp, 24 > movaps xmm0, xmm1 > ret > test(float): > mulss xmm0, xmm0 > jmp std::sqrt(float) > > So, the only notable difference is that now 'std::sqrt(float)' is not inlined, > but is tail-called instead. Thus, the pragma seems not to work? True, there may already be a bug about this. The issue is that whether functions set errno or not is decided globally and this info isn't changed between functions according to pragmas.
> Now, the problem is that GCC doesn't seem to optimize away the call to sqrtf > based on some surrounding code. As an example, it would be neat to have this > (or something similar) to get compiled into the same mulss-sqrtss-ret: > > float test (float x) > { > float y = x*x; > if (y >= 0.f) > return std::sqrt(y); > __builtin_unreachable(); > } > > If I understand it correctly, the 'y >= 0.f' excludes 'y' being NaN and 'y' > being negative (though this is excluded by 'y = x*x'), so there is no need > to check if the argument to `std::sqrt` is any bad, enabling to just do > 'sqrtss' and return. I'm not an FP expert, but I think we have enough information to do this right now. The evrp dump now has: =========== BB 2 ============ Imports: y_2 Exports: y_2 <bb 2> : y_2 = x_1(D) * x_1(D); if (y_2 >= 0.0) goto <bb 3>; [INV] else goto <bb 4>; [INV] 2->3 (T) y_2 : [frange] float [0.0, Inf] !NAN 2->4 (F) y_2 : [frange] float [ -Inf, 0.0] =========== BB 3 ============ y_2 [frange] float [0.0, Inf] !NAN <bb 3> : _6 = __builtin_sqrtf (y_2); return _6; Which means that y_2 is known to be [0.0, Inf] excluding a NAN. What needs to happen for the call to __builtin_sqrtf to be optimized to sqrtss?
First, there's magic bits which turn a standard sqrt call into something like if (exceptional condition) call libm's sqrt else use hardware sqrt The primary goal is to get errno set properly for those exceptional conditions. That code (tree-ssa-math-opts.cc?) can be updated to query the range to determine if the exceptional conditions can happen. Essentially that eliminates the need for -fno-math-errno. That may be enough for targets with real sqrt instructions. More work is likely needed for targets that use an estimator + correction steps.
It looks like what we want for this test is actually !isgreaterequal() not isless(), since we want to exclude the possibility of a NAN. Like this: float test (float x) { if (!__builtin_isgreaterequal (x, 0.0)) __builtin_unreachable(); return sqrtf (x); } After VRP1 removes the unreachable, the range for x_1 is correctly exported as >= -0.0 without a NAN: Global Exported (via unreachable): x_1(D) = [frange] float [-0.0 (-0x0.0p+0), +Inf] We end up with this: float test (float x) { float _4; <bb 2> [local count: 1073741824]: _4 = sqrtf (x_1(D)); return _4; } which then CDCE expands with the unnecessary checking code: <bb 2> [local count: 1073741824]: DCE_COND_LB.2_5 = x_1(D); DCE_COND_LB_TEST.3_6 = DCE_COND_LB.2_5 u>= 0.0; if (DCE_COND_LB_TEST.3_6 != 0) goto <bb 5>; [99.95%] else goto <bb 4>; [0.05%] <bb 5> [local count: 1073204960]: _4 = .SQRT (x_1(D)); goto <bb 3>; [100.00%] <bb 4> [local count: 536864]: _7 = sqrtf (x_1(D)); So the CDCE pass needs to be enhanced to use the ranger, instead of heuristics, to determine that the argument to sqrtf is neither negative nor NAN. In this particular case, we could use global ranges for free, but there's no reason we can't use an actual on-demand ranger for more complex scenarios. Just a guess here, but use_internal_fn() in CDCE shrink wraps the call to sqrt into a check with appropriate dispatch. We could emit the .SQRT call directly if the range of x_1 is not NAN and not negative.
*** Bug 103559 has been marked as a duplicate of this bug. ***
(In reply to Aldy Hernandez from comment #9) > It looks like what we want for this test is actually !isgreaterequal() not > isless(), since we want to exclude the possibility of a NAN. Like this: > > float test (float x) > { > if (!__builtin_isgreaterequal (x, 0.0)) > __builtin_unreachable(); > return sqrtf (x); > } No, what we emit for sqrtf is effectively: if (!__builtin_isless (x, 0.0)) asm ("sqrtss %0, %0" : "+x" (x)); else x = sqrtf (x); // The library version return x; So, the if (__builtin_isless (x, 0.0) __builtin_unreachable (); is all we need. We should either ask in tree-call-cdce.cc before trying to create this whether x range is [-0.0, +Inf] NAN or some subset thereof, in that case just use the .SQRT internal call, or if it is [-Inf, nextafterf (-0.0, -Inf)] or subset thereof, then use the library call only. Or try to fold it in vrp2 or when, but in that case we don't know that easily whether the comparison can be optimized away completely (it still raises exceptions on sNaN).
WIP which still doesn't work: --- gcc/value-query.cc.jj 2023-03-23 15:25:47.069740988 +0100 +++ gcc/value-query.cc 2023-03-30 14:56:58.809298424 +0200 @@ -230,9 +230,11 @@ range_query::get_tree_range (vrange &r, default: break; } - if (BINARY_CLASS_P (expr)) + if (BINARY_CLASS_P (expr) || COMPARISON_CLASS_P (expr)) { - range_op_handler op (TREE_CODE (expr), type); + range_op_handler op (TREE_CODE (expr), + BINARY_CLASS_P (expr) ? type + : TREE_TYPE (TREE_OPERAND (expr, 0))); if (op) { Value_Range r0 (TREE_TYPE (TREE_OPERAND (expr, 0))); --- gcc/tree-call-cdce.cc.jj 2023-01-02 09:32:45.940944935 +0100 +++ gcc/tree-call-cdce.cc 2023-03-30 14:54:25.248544702 +0200 @@ -36,6 +36,7 @@ along with GCC; see the file COPYING3. #include "builtins.h" #include "internal-fn.h" #include "tree-dfa.h" +#include "gimple-range.h" /* This pass serves two closely-related purposes: @@ -425,12 +426,9 @@ comparison_code_if_no_nans (tree_code co null tree. */ static void -gen_one_condition (tree arg, int lbub, - enum tree_code tcode, - const char *temp_name1, - const char *temp_name2, - vec<gimple *> conds, - unsigned *nconds) +gen_one_condition (tree arg, int lbub, enum tree_code tcode, + const char *temp_name1, const char *temp_name2, + vec<gimple *> conds, unsigned *nconds, gimple *stmt) { if (!HONOR_NANS (arg)) tcode = comparison_code_if_no_nans (tcode); @@ -451,10 +449,24 @@ gen_one_condition (tree arg, int lbub, gimple_assign_set_lhs (stmt1, tempn); tempc = create_tmp_var (boolean_type_node, temp_name2); - stmt2 = gimple_build_assign (tempc, - fold_build2 (tcode, - boolean_type_node, - tempn, lbub_real_cst)); + tree tcond = build2 (tcode, boolean_type_node, arg, lbub_real_cst); + int_range_max r; + range_query *q = get_range_query (cfun); + if (q == get_global_range_query ()) + q = enable_ranger (cfun); + /* Ask the ranger whether it knows the condition will be always false or + always true. */ + if (!q->range_of_expr (r, tcond, stmt) || r.undefined_p ()) + tcond = NULL_TREE; + else if (r.upper_bound () == 0) + tcond = boolean_false_node; + else if (r.lower_bound () == 1) + tcond = boolean_true_node; + else + tcond = NULL_TREE; + if (!tcond) + tcond = fold_build2 (tcode, boolean_type_node, tempn, lbub_real_cst); + stmt2 = gimple_build_assign (tempc, tcond); tempcn = make_ssa_name (tempc, stmt2); gimple_assign_set_lhs (stmt2, tempcn); @@ -475,16 +487,15 @@ gen_one_condition (tree arg, int lbub, for lower bound check, one for upper bound check. */ static void -gen_conditions_for_domain (tree arg, inp_domain domain, - vec<gimple *> conds, - unsigned *nconds) +gen_conditions_for_domain (tree arg, inp_domain domain, vec<gimple *> conds, + unsigned *nconds, gimple *stmt) { if (domain.has_lb) gen_one_condition (arg, domain.lb, (domain.is_lb_inclusive ? UNGE_EXPR : UNGT_EXPR), "DCE_COND_LB", "DCE_COND_LB_TEST", - conds, nconds); + conds, nconds, stmt); if (domain.has_ub) { @@ -496,7 +507,7 @@ gen_conditions_for_domain (tree arg, inp (domain.is_ub_inclusive ? UNLE_EXPR : UNLT_EXPR), "DCE_COND_UB", "DCE_COND_UB_TEST", - conds, nconds); + conds, nconds, stmt); } } @@ -518,9 +529,8 @@ gen_conditions_for_domain (tree arg, inp and *NCONDS is the number of logical conditions. */ static void -gen_conditions_for_pow_cst_base (tree base, tree expn, - vec<gimple *> conds, - unsigned *nconds) +gen_conditions_for_pow_cst_base (tree base, tree expn, vec<gimple *> conds, + unsigned *nconds, gimple *stmt) { inp_domain exp_domain; /* Validate the range of the base constant to make @@ -532,11 +542,9 @@ gen_conditions_for_pow_cst_base (tree ba real_from_integer (&mv, TYPE_MODE (TREE_TYPE (base)), 256, UNSIGNED); gcc_assert (!real_less (&mv, &bcv)); - exp_domain = get_domain (0, false, false, - 127, true, false); + exp_domain = get_domain (0, false, false, 127, true, false); - gen_conditions_for_domain (expn, exp_domain, - conds, nconds); + gen_conditions_for_domain (expn, exp_domain, conds, nconds, stmt); } /* Generate error condition code for pow calls with @@ -554,9 +562,8 @@ gen_conditions_for_pow_cst_base (tree ba conditions. */ static void -gen_conditions_for_pow_int_base (tree base, tree expn, - vec<gimple *> conds, - unsigned *nconds) +gen_conditions_for_pow_int_base (tree base, tree expn, vec<gimple *> conds, + unsigned *nconds, gimple *stmt) { gimple *base_def; tree base_val0; @@ -600,11 +607,9 @@ gen_conditions_for_pow_int_base (tree ba /* Generate condition in reverse order -- first the condition for the exp argument. */ - exp_domain = get_domain (0, false, false, - max_exp, true, true); + exp_domain = get_domain (0, false, false, max_exp, true, true); - gen_conditions_for_domain (expn, exp_domain, - conds, nconds); + gen_conditions_for_domain (expn, exp_domain, conds, nconds, stmt); /* Now generate condition for the base argument. Note it does not use the helper function @@ -660,9 +665,9 @@ gen_conditions_for_pow (gcall *pow_call, bc = TREE_CODE (base); if (bc == REAL_CST) - gen_conditions_for_pow_cst_base (base, expn, conds, nconds); + gen_conditions_for_pow_cst_base (base, expn, conds, nconds, pow_call); else if (bc == SSA_NAME) - gen_conditions_for_pow_int_base (base, expn, conds, nconds); + gen_conditions_for_pow_int_base (base, expn, conds, nconds, pow_call); else gcc_unreachable (); } @@ -852,7 +857,7 @@ gen_shrink_wrap_conditions (gcall *bi_ca inp_domain domain = get_no_error_domain (fnc); *nconds = 0; arg = gimple_call_arg (bi_call, 0); - gen_conditions_for_domain (arg, domain, conds, nconds); + gen_conditions_for_domain (arg, domain, conds, nconds, bi_call); } return; @@ -1290,6 +1295,8 @@ pass_call_cdce::execute (function *fun) return 0; shrink_wrap_conditional_dead_built_in_calls (cond_dead_built_in_calls); + if (get_range_query (fun) != get_global_range_query ()) + disable_ranger (fun); free_dominance_info (CDI_POST_DOMINATORS); /* As we introduced new control-flow we need to insert PHI-nodes for the call-clobbers of the remaining call. */
Created attachment 54789 [details] gcc13-pr91645.patch This seems to work right on the testcase I've been trying: float sqrtf (float); float foo (float x) { return sqrtf (x); } float bar (float x) { if (__builtin_isless (x, 0)) __builtin_unreachable (); return sqrtf (x); } float baz (float x) { if (!__builtin_isless (x, 0)) __builtin_unreachable (); return sqrtf (x); } float qux (float x) { x = x * x; return sqrtf (x); } Do we want this for GCC 13, or at least parts of it like the range-op-float.cc fixes and/or the value-range.cc change, or defer everything for GCC 14?
The master branch has been updated by Jakub Jelinek <jakub@gcc.gnu.org>: https://gcc.gnu.org/g:e02c9d9116f243643c0daba8dbcc5d1795c827c3 commit r13-6956-ge02c9d9116f243643c0daba8dbcc5d1795c827c3 Author: Jakub Jelinek <jakub@redhat.com> Date: Fri Mar 31 13:41:34 2023 +0200 range-op-float, value-range: Fix up handling of UN{LT,LE,GT,GE,EQ}_EXPR and handle comparisons in get_tree_range [PR91645] When looking into PR91645, I've noticed we handle UN{LT,LE,GT,GE,EQ}_EXPR comparisons incorrectly. All those are unordered or ..., we correctly return [1, 1] if one or both operands are known NANs, and correctly ask the non-UN prefixed op to fold_range if neither operand may be NAN. But for the case where one or both operands may be NAN, we always return [0, 1]. The UN* fold_range tries to handle it by asking the non-UN prefixed fold_range and if it returns [1, 1] return that, if it returns [0, 0] or [0, 1] return [0, 1], which makes sense, because the maybe NAN means that it is the non-UN prefixed fold_range unioned with [1, 1] in case the maybe NAN is actually NAN at runtime. The problem is that the non-UN prefixed fold_range always returns [0, 1] because those fold_range implementations are like: if (op1.known_isnan () || op2.known_isnan ()) r = range_false (type); else if (!maybe_isnan (op1, op2)) { ... } else r = range_true_and_false (type); and so if maybe_isnan, they always return [0, 1]. Now, thinking about it, this is unnecessary pessimization, for the case where the ... block returns range_false (type) we actually could do it also if maybe_isnan (op1, op2), because if one or both operands are NAN, the comparison will be false, and if neither is NAN, the comparison will be also false. Will fix incrementally today. Anyway, the following patch fixes it by asking the non-UN prefixed fold_range on ranges with NAN cleared, which I think does the right thing in all cases. Another change in the patch is that range_query::get_tree_range always returned VARYING for comparisons, this patch allows to ask about those as well (they are very much like binary ops, except they take the important type from the types of the operands rather than result). Initially I've developed this patch together with changes to tree-call-cdce.cc, but those result in one regression and apparently aren't actually needed to fix this bug, the range-op-float.cc changes are enough. 2023-03-31 Jakub Jelinek <jakub@redhat.com> PR tree-optimization/91645 * range-op-float.cc (foperator_unordered_lt::fold_range, foperator_unordered_le::fold_range, foperator_unordered_gt::fold_range, foperator_unordered_ge::fold_range, foperator_unordered_equal::fold_range): Call the ordered fold_range on ranges with cleared NaNs. * value-query.cc (range_query::get_tree_range): Handle also COMPARISON_CLASS_P trees. * gcc.target/i386/pr103559-1.c: New test. * gcc.target/i386/pr103559-2.c: New test. * gcc.target/i386/pr103559-3.c: New test. * gcc.target/i386/pr103559-4.c: New test.