82170 – gcc optimizes int range-checking poorly on x86-64

Bug 82170 - gcc optimizes int range-checking poorly on x86-64

Summary: gcc optimizes int range-checking poorly on x86-64

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	middle-end (show other bugs)
Version:	8.3.1

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2017-09-10 22:16 UTC by Paul Eggert
Modified:	2023-12-25 11:14 UTC (History)
CC List:	5 users (show)

See Also:	56711
Host:
Target:	x86_64--, powerpc--*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2021-09-04 00:00:00

Attachments
source code that is poorly optimized on x86-64 (224 bytes, text/plain) 2017-09-10 22:16 UTC, Paul Eggert	Details
assembly-language output for poorly-optimized code (350 bytes, text/plain) 2017-09-10 22:16 UTC, Paul Eggert	Details
source code that is poorly optimized on x86-64, version 2 (204 bytes, text/plain) 2017-09-11 06:48 UTC, Paul Eggert	Details
assembly-language output for poorly-optimized code, version 2 (350 bytes, text/plain) 2017-09-11 06:49 UTC, Paul Eggert	Details
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Paul Eggert 2017-09-10 22:16:20 UTC

Created attachment 42148 [details]
source code that is poorly optimized on x86-64

GCC on Fedora 26 x86-64 (GCC 7.1.1 20170622 (Red Hat 7.1.1-3) generates poorly-optimized machine instructions for C code that tests in the obvious way whether a 'long long' fits in 'int'. If n is long long, the expression (INT_MIN <= n && n <= INT_MAX) should generate code that is no worse than (!__builtin_add_overflow_p (n, 0, 0)). However, with -O2 optimization before the conditional branch, the former expression generates four instructions containing two literal constants, whereas the latter generates just two instructions that reference only registers.

For the code that prompted this bug report we will likely use an "#if __GNUC__ < 7" that uses __builtin_add_overflow_p for GCC 7 and later, and the portable code otherwise. It'd be nicer, though, if we could just use the portable code.  See:

https://sourceware.org/ml/libc-alpha/2017-09/msg00411.html

Attached are a source program inrange.c and an assembly-language file inrange.s produced by 'gcc -O2 -S' that illustrates the problem. Akthough the two functions checked_arg_GCC7 and checked_arg_portable should have the same machine code, the former is more efficient than the latter.

Comment 1 Paul Eggert 2017-09-10 22:16:59 UTC

Created attachment 42149 [details]
assembly-language output for poorly-optimized code

Comment 2 Marc Glisse 2017-09-11 06:25:20 UTC

Note that n==(int)n (gcc documents that this must work) may work with more gcc versions and is more readable.

Comment 3 Paul Eggert 2017-09-11 06:44:26 UTC

(In reply to Marc Glisse from comment #2)
> Note that n==(int)n (gcc documents that this must work) may work with more
> gcc versions and is more readable.

Thanks, good point, I'll suggest switching to that in glibc, and will update the attachements accordingly. Still, the point remains that the portable code should compile to something just as efficient as 'n == (int) n', which is not portable.

Comment 4 Paul Eggert 2017-09-11 06:48:11 UTC

Created attachment 42152 [details]
source code that is poorly optimized on x86-64, version 2

Comment 5 Paul Eggert 2017-09-11 06:49:37 UTC

Created attachment 42153 [details]
assembly-language output for poorly-optimized code, version 2

Comment 6 Jakub Jelinek 2017-09-11 13:10:15 UTC

I'll have a look.

Comment 7 Jakub Jelinek 2017-09-14 09:25:31 UTC

More complete testcase:
extern inline int f1 (long long n) { return -__INT_MAX__ - 1 <= n && n <= __INT_MAX__; }
extern inline int f2 (long long n) { return n == (int) n; }
extern inline int f3 (unsigned long long n) { return n <= ~0U; }
extern inline int f4 (unsigned long long n) { return n == (unsigned int) n; }
extern inline int f5 (long long n) { return -__SHRT_MAX__ - 1 <= n && n <= __SHRT_MAX__; }
extern inline int f6 (long long n) { return n == (short) n; }
extern inline int f7 (unsigned long long n) { return n <= (unsigned short) ~0U; }
extern inline int f8 (unsigned long long n) { return n == (unsigned short) n; }
extern inline int f9 (long long n) { return -__SCHAR_MAX__ - 1 <= n && n <= __SCHAR_MAX__; }
extern inline int f10 (long long n) { return n == (signed char) n; }
extern inline int f11 (unsigned long long n) { return n <= (unsigned char) ~0U; }
extern inline int f12 (unsigned long long n) { return n == (unsigned char) n; }
extern inline int f13 (int n) { return -__SHRT_MAX__ - 1 <= n && n <= __SHRT_MAX__; }
extern inline int f14 (int n) { return n == (short) n; }
extern inline int f15 (unsigned int n) { return n <= (unsigned short) ~0U; }
extern inline int f16 (unsigned int n) { return n == (unsigned short) n; }
extern inline int f17 (int n) { return -__SCHAR_MAX__ - 1 <= n && n <= __SCHAR_MAX__; }
extern inline int f18 (int n) { return n == (signed char) n; }
extern inline int f19 (unsigned int n) { return n <= (unsigned char) ~0U; }
extern inline int f20 (unsigned int n) { return n == (unsigned char) n; }
extern inline int f21 (short int n) { return -__SCHAR_MAX__ - 1 <= n && n <= __SCHAR_MAX__; }
extern inline int f22 (short int n) { return n == (signed char) n; }
extern inline int f23 (unsigned short int n) { return n <= (unsigned char) ~0U; }
extern inline int f24 (unsigned short int n) { return n == (unsigned char) n; }
extern void foo (void);
void s1 (long long n) { if (f1 (n)) foo (); }
void s2 (long long n) { if (f2 (n)) foo (); }
void s3 (unsigned long long n) { if (f3 (n)) foo (); }
void s4 (unsigned long long n) { if (f4 (n)) foo (); }
void s5 (long long n) { if (f5 (n)) foo (); }
void s6 (long long n) { if (f6 (n)) foo (); }
void s7 (unsigned long long n) { if (f7 (n)) foo (); }
void s8 (unsigned long long n) { if (f8 (n)) foo (); }
void s9 (long long n) { if (f9 (n)) foo (); }
void s10 (long long n) { if (f10 (n)) foo (); }
void s11 (unsigned long long n) { if (f11 (n)) foo (); }
void s12 (unsigned long long n) { if (f12 (n)) foo (); }
void s13 (int n) { if (f13 (n)) foo (); }
void s14 (int n) { if (f14 (n)) foo (); }
void s15 (unsigned int n) { if (f15 (n)) foo (); }
void s16 (unsigned int n) { if (f16 (n)) foo (); }
void s17 (int n) { if (f17 (n)) foo (); }
void s18 (int n) { if (f18 (n)) foo (); }
void s19 (unsigned int n) { if (f19 (n)) foo (); }
void s20 (unsigned int n) { if (f20 (n)) foo (); }
void s21 (short int n) { if (f21 (n)) foo (); }
void s22 (short int n) { if (f22 (n)) foo (); }
void s23 (unsigned short int n) { if (f23 (n)) foo (); }
void s24 (unsigned short int n) { if (f24 (n)) foo (); }

So, this seems to be an instruction selection thing.  Comparing each pair of functions shows that at least for instruction counts the latter is often, but not always, shorter.

One question is if we want to canonicalize this during gimple fold (either the
n ==/!= (narrower type) n, or n + narrower_type_min_as_unsigned_wider <=/> narrower_type_max_as_unsigned_wider to the other form if single use); if we do and it is the former form, we'd also need to adjust the range discovery code that reassoc uses in the range optimization.

Then there is a question how do we want to generate optimal sequence.  Do we e.g. want to hook into the expansion (somewhere in do_store_flag and do_compare_and_jump), check for this pattern (using TER info) and perhaps try to expand both sequences, compute costs of both and see what is cheaper?

Hardcoding just one way I'm afraid is not going to be always a win.

Or shall the combiner be able to do something?

Comment 8 Jakub Jelinek 2017-09-14 10:04:45 UTC

To summarize IRC discussions about this, the first step should be to introduce SEXT_EXPR (split from Prathamesh's patch, improve), then add match.pd canonicalization of these range testing to SEXT_EXPR + EQ_EXPR/NE_EXPR or
BIT_AND_EXPR + EQ/NE, recognize those in range discovery for reassoc.
Combine probably isn't able to handle transformation of one seq to the other, and in any case, combiner would be only one way (it simplifies/canonicalizes the IL to something, but doesn't have easy way to try two completely different but equivalent sequences; also, often it is more than 4 instructions).
So we should try to do something in the expansion to try both sequences and pick the less costly.

Comment 9 Segher Boessenkool 2017-09-14 13:19:51 UTC

For range tests, one of the first things the tree optimisers do (in
003t.original) is change  (a <= x && x < b)  into  (x - a u< b - a)
(where u< is unsigned compare).  And this in as narrow a type as
possible.

This survives all the way through the gimple optimisers.  There is
no way combine can deal with this in general.

Maybe there should be a separate EXPR for range tests?

(The generated code looks relatively *good* on x86, btw ;-) )

Comment 10 Paul Eggert 2019-03-18 20:58:43 UTC

I just now ran into this problem in another situation and so was reminded of this bug report.

For what it's worth, clang does a good job of optimizing these comparisons. I just now compiled the sample C source code attached to this bug report, and clang version 7.0.1 (Fedora 7.0.1-4.fc29) x86-64 generates the same code for checked_arg_portable that it does for checked_arg_GCC7, whereas GCC 8.3.1 20190223 (Red Hat 8.3.1-2) x86-64 still generates poorly-optimized code for checked_arg_portable.