Bug 106952 - Missed optimization: x < y ? x : y not lowered to minss
Summary: Missed optimization: x < y ? x : y not lowered to minss
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2022-09-15 15:30 UTC by Tavian Barnes
Modified: 2023-07-21 08:18 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-09-17 00:00:00


Attachments
Assembly from gcc -O3 -S bug.c (531 bytes, text/plain)
2022-09-15 15:30 UTC, Tavian Barnes
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tavian Barnes 2022-09-15 15:30:30 UTC
Created attachment 53580 [details]
Assembly from gcc -O3 -S bug.c

The following is an implementation of a ray/axis-aligned box intersection test:

struct ray {
    float origin[3];
    float dir_inv[3];
};

struct box {
    float min[3];
    float max[3];
};

static inline float min(float x, float y) {
    return x < y ? x : y;
}

static inline float max(float x, float y) {
    return x < y ? x : y;
}

_Bool intersection(const struct ray *ray, const struct box *box) {
    float tmin = 0.0, tmax = 1.0 / 0.0;

    for (int i = 0; i < 3; ++i) {
        float t1 = (box->min[i] - ray->origin[i]) * ray->dir_inv[i];
        float t2 = (box->max[i] - ray->origin[i]) * ray->dir_inv[i];

        tmin = min(max(t1, tmin), max(t2, tmin));
        tmax = max(min(t1, tmax), min(t2, tmax));
    }

    return tmin < tmax;
}

However, gcc -O3 doesn't use minss/maxss for every min()/max().  Instead, some of them are lowered to conditional jumps which regresses performance significantly since the branches are unpredictable.

Simpler variants like

        tmin = max(tmin, min(t1, t2));
        tmax = min(tmax, max(t1, t2));

get the desired codegen, but that behaves differently if t1 or t2 is NaN.

"Bisecting" with godbolt.org, it seems this is an old regression: 4.8.5 was good, but 4.9.0 was bad.
Comment 1 Alexander Monakov 2022-09-15 15:42:21 UTC
Note, your 'max' function is the same as 'min' (the issue remains with that corrected).
Comment 2 Tavian Barnes 2022-09-15 15:54:08 UTC
(In reply to Alexander Monakov from comment #1)
> Note, your 'max' function is the same as 'min' (the issue remains with that
> corrected).

Whoops, thanks.

Also I just noticed that GCC 12.2 does better (but not perfect) with 

#define min(x, y) ((x) < (y) ? (x) : (y))
#define max(x, y) ((x) > (y) ? (x) : (y))

instead of the inline functions.  Doesn't seem to help GCC trunk though.
Comment 3 Richard Biener 2022-09-17 04:47:44 UTC
The main issue is that combine is confused with the more complicated setup and the x86 ieee min/max patterns and on the GIMPLE level we are not forming IEEE compatible .FMIN and .FMAX plus the x86 backend doesn't announce any.
Comment 4 Richard Biener 2023-07-18 10:39:05 UTC
With the proposed patches for PR88540 and PR105715 I get with -O3 -msse4.1

intersection:
.LFB2:
        .cfi_startproc
        movss   .LC0(%rip), %xmm5
        pxor    %xmm2, %xmm2
        movss   (%rdi), %xmm4
        movss   12(%rsi), %xmm1
        movss   12(%rdi), %xmm0
        divss   %xmm2, %xmm5
        movss   (%rsi), %xmm3
        subss   %xmm4, %xmm1
        subss   %xmm4, %xmm3
        pxor    %xmm4, %xmm4
        mulss   %xmm0, %xmm1
        mulss   %xmm0, %xmm3
        movaps  %xmm1, %xmm0
        cmpnltss        %xmm2, %xmm0
        blendvps        %xmm0, %xmm1, %xmm4
        movaps  %xmm3, %xmm0
        cmpnltss        %xmm2, %xmm0
        pxor    %xmm2, %xmm2
        blendvps        %xmm0, %xmm3, %xmm2
        movss   16(%rsi), %xmm0
        minss   %xmm5, %xmm3
        minss   %xmm5, %xmm1
        movss   4(%rdi), %xmm5
        minss   %xmm4, %xmm2
        movss   16(%rdi), %xmm4
        subss   %xmm5, %xmm0
        maxss   %xmm3, %xmm1
        movss   4(%rsi), %xmm3
        subss   %xmm5, %xmm3
        mulss   %xmm4, %xmm0
        movss   8(%rdi), %xmm5
        mulss   %xmm4, %xmm3
        movaps  %xmm2, %xmm4
        maxss   %xmm0, %xmm4
        minss   %xmm1, %xmm0
        maxss   %xmm3, %xmm2
        minss   %xmm1, %xmm3
        movss   8(%rsi), %xmm1
        subss   %xmm5, %xmm1
        maxss   %xmm3, %xmm0
        movss   20(%rsi), %xmm3
        minss   %xmm4, %xmm2
        movss   20(%rdi), %xmm4
        subss   %xmm5, %xmm3
        mulss   %xmm4, %xmm1
        movaps  %xmm2, %xmm5
        mulss   %xmm4, %xmm3
        movaps  %xmm2, %xmm4
        maxss   %xmm1, %xmm4
        minss   %xmm0, %xmm1
        movaps  %xmm3, %xmm2
        maxss   %xmm3, %xmm5
        minss   %xmm0, %xmm2
        minss   %xmm5, %xmm4
        maxss   %xmm1, %xmm2
        comiss  %xmm4, %xmm2
        seta    %al
        ret

there's the existing issue that RTL conditional move expansion doesn't
preserve the equality of constants for

  _33 = t2_34 < 0.0;
  _12 = _33 ? 0.0 : t2_34;

but it emits two loads from the constant pool for 0.0 here which in the x86
backend fail to be recognized as min/max.
Comment 5 Richard Biener 2023-07-21 08:18:03 UTC
After the latest fixes we still fail to recognize min/max early for

  float < 0.0 ? float : 0.0

because prepare_cmp_insn doesn't push the FP 0.0 constant to a reg
since RTX cost for this seems to be zero.  We then call insn_operand_matches
which ultimatively fails in ix86_fp_comparison_operator as
ix86_fp_comparison_strategy is IX86_FPCMP_COMI here and
ix86_trivial_fp_comparison_operator for

(lt (reg/v:SF 110 [ t2 ])
    (const_double:SF 0.0 [0x0.0p+0]))

returns false.

If I fix things so we try (gt (const_double:SF 0.0 [0x0.0p+0]) (reg:SF ..))
then maybe_legitimize_operands "breaks" things here since it forces
the cond operand to a register but not the comparison operand so
ix86_expand_fp_movcc again FAILs.

I'm not sure why the x86 backend allows any CONST_DOUBLE as part of
comparisons (during expansion only?).  This and maybe special-handling
of rtx_cost with this special constant and LT/GT code makes the first
compares not recognized as MIN/MAX.

The rest is fixed now.

Patch for trying (gt ..):

diff --git a/gcc/optabs.cc b/gcc/optabs.cc
index 32ff379ffc3..3ff8ba88bbb 100644
--- a/gcc/optabs.cc
+++ b/gcc/optabs.cc
@@ -4607,6 +4607,14 @@ prepare_cmp_insn (rtx x, rtx y, enum rtx_code comparison, rtx size,
        break;
     }
 
+  if (FLOAT_MODE_P (mode))
+    {
+      prepare_cmp_insn (y, x, swap_condition (comparison),
+                       size, unsignedp, methods, ptest, pmode);
+      if (*ptest)
+       return;
+    }
+
   if (methods != OPTAB_LIB_WIDEN)
     goto fail;