94497 – Branchless clamp in the general case gets a branch in a particular case ?

Bug 94497 - Branchless clamp in the general case gets a branch in a particular case ?

Summary: Branchless clamp in the general case gets a branch in a particular case ?

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	10.0

Importance:	P3 enhancement
Target Milestone:	14.0
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-04-06 09:35 UTC by Hadrien Grasland
Modified:	2023-07-20 08:38 UTC (History)
CC List:	2 users (show)

See Also:	110170
Host:
Target:	x86_64-- i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-04-06 00:00:00

Attachments
incomplete patch (1.69 KB, patch) 2020-04-06 13:34 UTC, Richard Biener	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Hadrien Grasland 2020-04-06 09:35:30 UTC

(Triage note: I think this is probably a compiler middle-end or back-end issue, but I am not knowledgeable enough about the structure of the GCC codebase to pick the right component.)

---

I am trying to make a floating-point computation autovectorization-friendly, without mandating the use of -ffast-math for optimal performance as that is a numerical stability and compiler portability hazard. This turned out to be an interesting exercise in IEEE-754 pedantry, of course, but I can live with that.

However, while trying to optimize a "clamp" computation, I ended up at a point where the behavior of the GCC optimizer just does not make sense to me and I could use the opinion of an expert.

Consider the following functions:

```
double fast_min(double x, double y) {
    return (x < y) ? x : y;
}

double fast_max(double x, double y) {
    return (x > y) ? x : y;
}
```

The definitions of fast_min and fast_max are carefully crafted to match the semantics of x86's min and max instruction family, and indeed if I compile this code with -O1 or above I get minsd/maxsd or vminsd/vmaxsd instructions depending on which vector instruction sets are enabled.

This is exactly what I wanted, so far I'm happy. And if I now try to use these min and max functions to write a clamp function...

```
double fast_clamp(double x, double min, double max) {
    return fast_max(fast_min(x, max), min);
}
```

...again, at -O1 optimization level and above, I get a minsd/maxsd pair, short and sweet:

```
fast_clamp(double, double, double):
        minsd   xmm0, xmm2
        maxsd   xmm0, xmm1
        ret
```

Where this perfect picture becomes tainted, however, is as soon as I try to _use_ this function with certain min/max arguments.

```
double use_fast_clamp(double x) {
    return fast_clamp(x, 0.0, 1.0);
}
```

All of a sudden, the assembly becomes branchy and terrible-looking, even in -O3 mode!

```
use_fast_clamp(double):
        movapd  xmm1, xmm0
        movsd   xmm0, QWORD PTR .LC0[rip]
        comisd  xmm0, xmm1
        jbe     .L13
        maxsd   xmm1, QWORD PTR .LC1[rip]
        movapd  xmm0, xmm1
.L13:
        ret
.LC0:
        .long   0
        .long   1072693248
.LC1:
        .long   0
        .long   0
```

I can make the generated code go back to a minsd/maxsd pair if I enable -ffast-math (more precisely -ffinite-math-only -funsafe-math-optimizations), but to the best of my knowledge, I shouldn't need fast-math flags here.

Further, even if I did forget about an IEEE-754 oddity that requires fast-math flags, it would still mean that the above compilation of the general fast_clamp function is incorrect: if this compilation output should work for any pair of "min" and "max" double-precision arguments, then it trivially should work when the min is 0.0 and max is 1.0. So one way or another, I think the GCC optimizer is doing something strange here.

---

This is the most minimal example of this behavior that I managed to come up with. Using only the fast_min or fast_math functions in isolation will behave as expected and codegen into a single minsd or maxsd:

```
double use_fast_min(double x) {
    return fast_min(x, 1.0);
}

double use_fast_max(double x) {
    return fast_max(x, 0.0);
}
```

I observed similar behavior on any GCC build I could get my hands on, all the way from the most recent GCC trunk build currently available on godbolt (10.0.1 20200405) to the most ancient build provided by godbolt (4.1.2).

Both my local system and godbolt run are Linux-based.

My local GCC build was configured with  ../configure --prefix=/usr --infodir=/usr/share/info --mandir=/usr/share/man --libdir=/usr/lib64 --libexecdir=/usr/lib64 --enable-languages=c,c++,objc,fortran,obj-c++,ada,go,d --enable-offload-targets=hsa,nvptx-none=/usr/nvptx-none, --without-cuda-driver --disable-werror --with-gxx-include-dir=/usr/include/c++/9 --enable-ssp --disable-libssp --disable-libvtv --disable-cet --disable-libcc1 --enable-plugin --with-bugurl=https://bugs.opensuse.org/ --with-pkgversion='SUSE Linux' --with-slibdir=/lib64 --with-system-zlib --enable-libstdcxx-allocator=new --disable-libstdcxx-pch --enable-libphobos --enable-version-specific-runtime-libs --with-gcc-major-version-only --enable-linker-build-id --enable-linux-futex --enable-gnu-indirect-function --program-suffix=-9 --without-system-libunwind --enable-multilib --with-arch-32=x86-64 --with-tune=generic --with-build-config=bootstrap-lto-lean --enable-link-mutex --build=x86_64-suse-linux --host=x86_64-suse-linux

As for godbolt builds, it is easy to go to godbolt.org and add a -v to the compiler options of the build you're interested in, so I will invite you to do that instead of cluttering this already long bug report further.

---

FWIW, clang 10 behaves the way I would expect without fast-math flags (and also generates the zero in place with a xorpd instead of loading it from memory, which is kind of cool), but I'm well aware of the danger of comparing the floating-point behavior of various compiler optimizers. So I wouldn't read too much into that:

```
.LCPI5_0:
        .quad   4607182418800017408     # double 1
use_fast_clamp(double):                    # @use_fast_clamp(double)
        minsd   xmm0, qword ptr [rip + .LCPI5_0]
        xorpd   xmm1, xmm1
        maxsd   xmm0, xmm1
        ret
```

If you like to experiment on godbolt too, here's my setup: https://godbolt.org/z/eD-guY .

Comment 1 Andrew Pinski 2020-04-06 09:39:16 UTC

So there might be some jump threading going on which is causing this issue.  There might already be a bug about that case too.

Comment 2 Richard Biener 2020-04-06 13:33:53 UTC

The issue is that our GIMPLE level if-conversion does not do the min/max
replacement because our MIN/MAX_EXPR IL elements are not IEEE conforming
with respect to NaN behavior.  There _is_ now FMIN/FMAX internal functions
which targets can support (not sure if x86 does support that named patterns)
which if-conversion could use (but it doesn't).  There's a bug about exactly
that I think.  PR88540 might be a close fit.

I do have a not working patch to address this, queued for GCC 11.

Comment 3 Richard Biener 2020-04-06 13:34:54 UTC

Created attachment 48213 [details]
incomplete patch

In case anybody is interested to complete it ...

Comment 4 Richard Biener 2020-04-06 13:37:24 UTC

As a workaround you can use -ffinite-math-only -fno-signed-zeros if that is applicable to the rest of your application.

Comment 5 Hadrien Grasland 2020-04-06 14:04:39 UTC

Thanks for the clarifications! We could probably live with -fno-signed-zeros, but I think -ffinite-math-only would be too much as an application-wide flag, as I've spotted several occurences of the "do the computation, then check if the result is normal" pattern around...

I tried to experiment with #pragma optimize as a finer-grained alternative to a global fast-math flag, which would probably be acceptable. But that seemed to have the side-effect of preventing inlining of the functions to which the attributes are applied into other functions that don't have it (I guess that attribute is implemented by splitting the code into multiple compilation units?), whereas in this part of the codebase we want all the inlining we can get...

So since I don't have the skills and time to work on GCC myself, I guess I will need to wait for now and see what you come up with.

Comment 6 Richard Biener 2020-04-06 16:31:42 UTC

(In reply to Hadrien Grasland from comment #5)
> Thanks for the clarifications! We could probably live with
> -fno-signed-zeros, but I think -ffinite-math-only would be too much as an
> application-wide flag, as I've spotted several occurences of the "do the
> computation, then check if the result is normal" pattern around...
> 
> I tried to experiment with #pragma optimize as a finer-grained alternative
> to a global fast-math flag, which would probably be acceptable. But that
> seemed to have the side-effect of preventing inlining of the functions to
> which the attributes are applied into other functions that don't have it (I
> guess that attribute is implemented by splitting the code into multiple
> compilation units?), whereas in this part of the codebase we want all the
> inlining we can get...

Yeah, since the #pragma optimize works at a function level we cannot permit
inlining of functions with conflicting settings...

> So since I don't have the skills and time to work on GCC myself, I guess I
> will need to wait for now and see what you come up with.

There's also the possibility to use XMM intrinsics or simply fmin()/fmax()
(but the latter is only optimized for targets implementing fmin/fmax expanders
which unfortunately x86 doesn't).

Comment 7 Andrew Pinski 2021-08-08 23:59:34 UTC

On a related note, I Notice we are doing a load for the constant forming of 0 here.  That seems less than optimial.

Comment 8 Richard Biener 2023-07-20 08:38:19 UTC

This is now fixed with the fix for PR61747 (this bug would have been a better fit though).

Fixed in GCC 14.