[Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ?

Mon Apr 6 09:35:30 GMT 2020

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497

            Bug ID: 94497
           Summary: Branchless clamp in the general case gets a branch in
                    a particular case ?
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: grasland at lal dot in2p3.fr
  Target Milestone: ---

(Triage note: I think this is probably a compiler middle-end or back-end issue,
but I am not knowledgeable enough about the structure of the GCC codebase to
pick the right component.)

---

I am trying to make a floating-point computation autovectorization-friendly,
without mandating the use of -ffast-math for optimal performance as that is a
numerical stability and compiler portability hazard. This turned out to be an
interesting exercise in IEEE-754 pedantry, of course, but I can live with that.

However, while trying to optimize a "clamp" computation, I ended up at a point
where the behavior of the GCC optimizer just does not make sense to me and I
could use the opinion of an expert.

Consider the following functions:

```
double fast_min(double x, double y) {
    return (x < y) ? x : y;
}

double fast_max(double x, double y) {
    return (x > y) ? x : y;
}
```

The definitions of fast_min and fast_max are carefully crafted to match the
semantics of x86's min and max instruction family, and indeed if I compile this
code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions
depending on which vector instruction sets are enabled.

This is exactly what I wanted, so far I'm happy. And if I now try to use these
min and max functions to write a clamp function...

```
double fast_clamp(double x, double min, double max) {
    return fast_max(fast_min(x, max), min);
}
```

...again, at -O1 optimization level and above, I get a minsd/maxsd pair, short
and sweet:

```
fast_clamp(double, double, double):
        minsd   xmm0, xmm2
        maxsd   xmm0, xmm1
        ret
```

Where this perfect picture becomes tainted, however, is as soon as I try to
_use_ this function with certain min/max arguments.

```
double use_fast_clamp(double x) {
    return fast_clamp(x, 0.0, 1.0);
}
```

All of a sudden, the assembly becomes branchy and terrible-looking, even in -O3
mode!

```
use_fast_clamp(double):
        movapd  xmm1, xmm0
        movsd   xmm0, QWORD PTR .LC0[rip]
        comisd  xmm0, xmm1
        jbe     .L13
        maxsd   xmm1, QWORD PTR .LC1[rip]
        movapd  xmm0, xmm1
.L13:
        ret
.LC0:
        .long   0
        .long   1072693248
.LC1:
        .long   0
        .long   0
```

I can make the generated code go back to a minsd/maxsd pair if I enable
-ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations),
but to the best of my knowledge, I shouldn't need fast-math flags here.

Further, even if I did forget about an IEEE-754 oddity that requires fast-math
flags, it would still mean that the above compilation of the general fast_clamp
function is incorrect: if this compilation output should work for any pair of
"min" and "max" double-precision arguments, then it trivially should work when
the min is 0.0 and max is 1.0. So one way or another, I think the GCC optimizer
is doing something strange here.

---

This is the most minimal example of this behavior that I managed to come up
with. Using only the fast_min or fast_math functions in isolation will behave
as expected and codegen into a single minsd or maxsd:

```
double use_fast_min(double x) {
    return fast_min(x, 1.0);
}

double use_fast_max(double x) {
    return fast_max(x, 0.0);
}
```

I observed similar behavior on any GCC build I could get my hands on, all the
way from the most recent GCC trunk build currently available on godbolt (10.0.1
20200405) to the most ancient build provided by godbolt (4.1.2).

Both my local system and godbolt run are Linux-based.

My local GCC build was configured with  ../configure --prefix=/usr
--infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64
--libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d
--enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver
--disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp
--disable-libssp --disable-libvtv --disable-cet --disable-libcc1
--enable-plugin --with-bugurl=https://bugs.opensuse.org/
--with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib
--enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos
--enable-version-specific-runtime-libs --with-gcc-major-version-only
--enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function
--program-suffix=-9 --without-system-libunwind --enable-multilib
--with-arch-32=x86-64 --with-tune=generic
--with-build-config=bootstrap-lto-lean --enable-link-mutex
--build=x86_64-suse-linux --host=x86_64-suse-linux

As for godbolt builds, it is easy to go to godbolt.org and add a -v to the
compiler options of the build you're interested in, so I will invite you to do
that instead of cluttering this already long bug report further.

---

FWIW, clang 10 behaves the way I would expect without fast-math flags (and also
generates the zero in place with a xorpd instead of loading it from memory,
which is kind of cool), but I'm well aware of the danger of comparing the
floating-point behavior of various compiler optimizers. So I wouldn't read too
much into that:

```
.LCPI5_0:
        .quad   4607182418800017408     # double 1
use_fast_clamp(double):                    # @use_fast_clamp(double)
        minsd   xmm0, qword ptr [rip + .LCPI5_0]
        xorpd   xmm1, xmm1
        maxsd   xmm0, xmm1
        ret
```

If you like to experiment on godbolt too, here's my setup:
https://godbolt.org/z/eD-guY .