[Bug c/94497] New: Branchless clamp in the general case gets a branch in a particular case ?
grasland at lal dot in2p3.fr
gcc-bugzilla@gcc.gnu.org
Mon Apr 6 09:35:30 GMT 2020
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94497
Bug ID: 94497
Summary: Branchless clamp in the general case gets a branch in
a particular case ?
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: grasland at lal dot in2p3.fr
Target Milestone: ---
(Triage note: I think this is probably a compiler middle-end or back-end issue,
but I am not knowledgeable enough about the structure of the GCC codebase to
pick the right component.)
---
I am trying to make a floating-point computation autovectorization-friendly,
without mandating the use of -ffast-math for optimal performance as that is a
numerical stability and compiler portability hazard. This turned out to be an
interesting exercise in IEEE-754 pedantry, of course, but I can live with that.
However, while trying to optimize a "clamp" computation, I ended up at a point
where the behavior of the GCC optimizer just does not make sense to me and I
could use the opinion of an expert.
Consider the following functions:
```
double fast_min(double x, double y) {
return (x < y) ? x : y;
}
double fast_max(double x, double y) {
return (x > y) ? x : y;
}
```
The definitions of fast_min and fast_max are carefully crafted to match the
semantics of x86's min and max instruction family, and indeed if I compile this
code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions
depending on which vector instruction sets are enabled.
This is exactly what I wanted, so far I'm happy. And if I now try to use these
min and max functions to write a clamp function...
```
double fast_clamp(double x, double min, double max) {
return fast_max(fast_min(x, max), min);
}
```
...again, at -O1 optimization level and above, I get a minsd/maxsd pair, short
and sweet:
```
fast_clamp(double, double, double):
minsd xmm0, xmm2
maxsd xmm0, xmm1
ret
```
Where this perfect picture becomes tainted, however, is as soon as I try to
_use_ this function with certain min/max arguments.
```
double use_fast_clamp(double x) {
return fast_clamp(x, 0.0, 1.0);
}
```
All of a sudden, the assembly becomes branchy and terrible-looking, even in -O3
mode!
```
use_fast_clamp(double):
movapd xmm1, xmm0
movsd xmm0, QWORD PTR .LC0[rip]
comisd xmm0, xmm1
jbe .L13
maxsd xmm1, QWORD PTR .LC1[rip]
movapd xmm0, xmm1
.L13:
ret
.LC0:
.long 0
.long 1072693248
.LC1:
.long 0
.long 0
```
I can make the generated code go back to a minsd/maxsd pair if I enable
-ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations),
but to the best of my knowledge, I shouldn't need fast-math flags here.
Further, even if I did forget about an IEEE-754 oddity that requires fast-math
flags, it would still mean that the above compilation of the general fast_clamp
function is incorrect: if this compilation output should work for any pair of
"min" and "max" double-precision arguments, then it trivially should work when
the min is 0.0 and max is 1.0. So one way or another, I think the GCC optimizer
is doing something strange here.
---
This is the most minimal example of this behavior that I managed to come up
with. Using only the fast_min or fast_math functions in isolation will behave
as expected and codegen into a single minsd or maxsd:
```
double use_fast_min(double x) {
return fast_min(x, 1.0);
}
double use_fast_max(double x) {
return fast_max(x, 0.0);
}
```
I observed similar behavior on any GCC build I could get my hands on, all the
way from the most recent GCC trunk build currently available on godbolt (10.0.1
20200405) to the most ancient build provided by godbolt (4.1.2).
Both my local system and godbolt run are Linux-based.
My local GCC build was configured with ../configure --prefix=/usr
--infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64
--libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d
--enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver
--disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp
--disable-libssp --disable-libvtv --disable-cet --disable-libcc1
--enable-plugin --with-bugurl=https://bugs.opensuse.org/
--with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib
--enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos
--enable-version-specific-runtime-libs --with-gcc-major-version-only
--enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function
--program-suffix=-9 --without-system-libunwind --enable-multilib
--with-arch-32=x86-64 --with-tune=generic
--with-build-config=bootstrap-lto-lean --enable-link-mutex
--build=x86_64-suse-linux --host=x86_64-suse-linux
As for godbolt builds, it is easy to go to godbolt.org and add a -v to the
compiler options of the build you're interested in, so I will invite you to do
that instead of cluttering this already long bug report further.
---
FWIW, clang 10 behaves the way I would expect without fast-math flags (and also
generates the zero in place with a xorpd instead of loading it from memory,
which is kind of cool), but I'm well aware of the danger of comparing the
floating-point behavior of various compiler optimizers. So I wouldn't read too
much into that:
```
.LCPI5_0:
.quad 4607182418800017408 # double 1
use_fast_clamp(double): # @use_fast_clamp(double)
minsd xmm0, qword ptr [rip + .LCPI5_0]
xorpd xmm1, xmm1
maxsd xmm0, xmm1
ret
```
If you like to experiment on godbolt too, here's my setup:
https://godbolt.org/z/eD-guY .
More information about the Gcc-bugs
mailing list