Bug 97445 - Some fonctions marked static inline in Linux kernel are not inlined
Summary: Some fonctions marked static inline in Linux kernel are not inlined
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: ipa (show other bugs)
Version: 10.1.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on: 97503 97519
Blocks:
  Show dependency treegraph
 
Reported: 2020-10-15 13:41 UTC by Christophe Leroy
Modified: 2020-11-01 12:00 UTC (History)
6 users (show)

See Also:
Host:
Target:
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-10-19 00:00:00


Attachments
pipe.i from build of fs/pipe.o (315.95 KB, application/x-gzip)
2020-10-19 12:57 UTC, Christophe Leroy
Details
pipe.s from build of fs/pipe.o (10.30 KB, text/plain)
2020-10-19 12:59 UTC, Christophe Leroy
Details
x86_64 pre-processed source file (244.38 KB, application/zstd)
2020-10-19 15:06 UTC, Martin Liška
Details
pipe.i from build of fs/pipe.o with GCC 9.2 (302.95 KB, application/x-gzip)
2020-10-20 05:21 UTC, Christophe Leroy
Details
pipe.s from build of fs/pipe.o with GCC 9.2 (165.05 KB, application/x-gzip)
2020-10-20 05:21 UTC, Christophe Leroy
Details
skbuff.s from build of net/core/skbuff.o with GCC 10 (473.90 KB, application/x-gzip)
2020-10-20 07:10 UTC, Christophe Leroy
Details
skbuff.i from build of net/core/skbuff.o with GCC 10 (484.78 KB, application/x-gzip)
2020-10-20 07:10 UTC, Christophe Leroy
Details
hintsp (3.02 KB, text/plain)
2020-10-20 13:12 UTC, Jan Hubicka
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Christophe Leroy 2020-10-15 13:41:31 UTC
Some functions declared 'static inline' in Linux kernel sources are not inlined and are seen multiple times in the vmlinux binary.

Two exemple: get_order() and csum_partial():

static inline __attribute_const__ int get_order(unsigned long size)
{
	if (__builtin_constant_p(size)) {
		if (!size)
			return BITS_PER_LONG - PAGE_SHIFT;

		if (size < (1UL << PAGE_SHIFT))
			return 0;

		return ilog2((size) - 1) - PAGE_SHIFT + 1;
	}

	size--;
	size >>= PAGE_SHIFT;
#if BITS_PER_LONG == 32
	return fls(size);
#else
	return fls64(size);
#endif
}

I get 75 times the following code:

c0016790 <get_order>:
c0016790:	38 63 ff ff 	addi    r3,r3,-1
c0016794:	54 63 a3 3e 	rlwinm  r3,r3,20,12,31
c0016798:	7c 63 00 34 	cntlzw  r3,r3
c001679c:	20 63 00 20 	subfic  r3,r3,32
c00167a0:	4e 80 00 20 	blr



static inline __wsum csum_partial(const void *buff, int len, __wsum sum)
{
	if (__builtin_constant_p(len) && len <= 16 && (len & 1) == 0) {
		if (len == 2)
			sum = csum_add(sum, (__force __wsum)*(const u16 *)buff);
		if (len >= 4)
			sum = csum_add(sum, (__force __wsum)*(const u32 *)buff);
		if (len == 6)
			sum = csum_add(sum, (__force __wsum)
					    *(const u16 *)(buff + 4));
		if (len >= 8)
			sum = csum_add(sum, (__force __wsum)
					    *(const u32 *)(buff + 4));
		if (len == 10)
			sum = csum_add(sum, (__force __wsum)
					    *(const u16 *)(buff + 8));
		if (len >= 12)
			sum = csum_add(sum, (__force __wsum)
					    *(const u32 *)(buff + 8));
		if (len == 14)
			sum = csum_add(sum, (__force __wsum)
					    *(const u16 *)(buff + 12));
		if (len >= 16)
			sum = csum_add(sum, (__force __wsum)
					    *(const u32 *)(buff + 12));
	} else if (__builtin_constant_p(len) && (len & 3) == 0) {
		sum = csum_add(sum, ip_fast_csum_nofold(buff, len >> 2));
	} else {
		sum = __csum_partial(buff, len, sum);
	}
	return sum;
}

	ppc-linux-objdump -d vmlinux | grep -e "<csum_partial>" -e "<__csum_partial>"

With gcc9 I get:

	c0017ef8 <__csum_partial>:
	c00182fc:	4b ff fb fd 	bl      c0017ef8 <__csum_partial>
	c0018478:	4b ff fa 80 	b       c0017ef8 <__csum_partial>
	c03e8458:	4b c2 fa a0 	b       c0017ef8 <__csum_partial>
	c03e8518:	4b c2 f9 e1 	bl      c0017ef8 <__csum_partial>
	c03ef410:	4b c2 8a e9 	bl      c0017ef8 <__csum_partial>
	c03f0b24:	4b c2 73 d5 	bl      c0017ef8 <__csum_partial>
	c04279a4:	4b bf 05 55 	bl      c0017ef8 <__csum_partial>
	c0429820:	4b be e6 d9 	bl      c0017ef8 <__csum_partial>
	c0429944:	4b be e5 b5 	bl      c0017ef8 <__csum_partial>
	c042b478:	4b be ca 81 	bl      c0017ef8 <__csum_partial>
	c042b554:	4b be c9 a5 	bl      c0017ef8 <__csum_partial>
	c045f15c:	4b bb 8d 9d 	bl      c0017ef8 <__csum_partial>
	c0492190:	4b b8 5d 69 	bl      c0017ef8 <__csum_partial>
	c0492310:	4b b8 5b e9 	bl      c0017ef8 <__csum_partial>
	c0495594:	4b b8 29 65 	bl      c0017ef8 <__csum_partial>
	c049c420:	4b b7 ba d9 	bl      c0017ef8 <__csum_partial>
	c049c870:	4b b7 b6 89 	bl      c0017ef8 <__csum_partial>
	c049c930:	4b b7 b5 c9 	bl      c0017ef8 <__csum_partial>
	c04a9ca0:	4b b6 e2 59 	bl      c0017ef8 <__csum_partial>
	c04bdde4:	4b b5 a1 15 	bl      c0017ef8 <__csum_partial>
	c04be480:	4b b5 9a 79 	bl      c0017ef8 <__csum_partial>
	c04be710:	4b b5 97 e9 	bl      c0017ef8 <__csum_partial>
	c04c969c:	4b b4 e8 5d 	bl      c0017ef8 <__csum_partial>
	c04ca2fc:	4b b4 db fd 	bl      c0017ef8 <__csum_partial>
	c04cf5bc:	4b b4 89 3d 	bl      c0017ef8 <__csum_partial>
	c04d0440:	4b b4 7a b9 	bl      c0017ef8 <__csum_partial>

With gcc10 I get:

	c0018d08 <__csum_partial>:
	c0019020 <csum_partial>:
	c0019020:	4b ff fc e8 	b       c0018d08 <__csum_partial>
	c001914c:	4b ff fe d4 	b       c0019020 <csum_partial>
	c0019250:	4b ff fd d1 	bl      c0019020 <csum_partial>
	c03e404c <csum_partial>:
	c03e404c:	4b c3 4c bc 	b       c0018d08 <__csum_partial>
	c03e4050:	4b ff ff fc 	b       c03e404c <csum_partial>
	c03e40fc:	4b ff ff 51 	bl      c03e404c <csum_partial>
	c03e6680:	4b ff d9 cd 	bl      c03e404c <csum_partial>
	c03e68c4:	4b ff d7 89 	bl      c03e404c <csum_partial>
	c03e7934:	4b ff c7 19 	bl      c03e404c <csum_partial>
	c03e7bf8:	4b ff c4 55 	bl      c03e404c <csum_partial>
	c03eb148:	4b ff 8f 05 	bl      c03e404c <csum_partial>
	c03ecf68:	4b c2 bd a1 	bl      c0018d08 <__csum_partial>
	c04275b8 <csum_partial>:
	c04275b8:	4b bf 17 50 	b       c0018d08 <__csum_partial>
	c0427884:	4b ff fd 35 	bl      c04275b8 <csum_partial>
	c0427b18:	4b ff fa a1 	bl      c04275b8 <csum_partial>
	c0427bd8:	4b ff f9 e1 	bl      c04275b8 <csum_partial>
	c0427cd4:	4b ff f8 e5 	bl      c04275b8 <csum_partial>
	c0427e34:	4b ff f7 85 	bl      c04275b8 <csum_partial>
	c045a1c0:	4b bb eb 49 	bl      c0018d08 <__csum_partial>
	c0489464 <csum_partial>:
	c0489464:	4b b8 f8 a4 	b       c0018d08 <__csum_partial>
	c04896b0:	4b ff fd b5 	bl      c0489464 <csum_partial>
	c048982c:	4b ff fc 39 	bl      c0489464 <csum_partial>
	c0490158:	4b b8 8b b1 	bl      c0018d08 <__csum_partial>
	c0492f0c <csum_partial>:
	c0492f0c:	4b b8 5d fc 	b       c0018d08 <__csum_partial>
	c049326c:	4b ff fc a1 	bl      c0492f0c <csum_partial>
	c049333c:	4b ff fb d1 	bl      c0492f0c <csum_partial>
	c0493b18:	4b ff f3 f5 	bl      c0492f0c <csum_partial>
	c0493f50:	4b ff ef bd 	bl      c0492f0c <csum_partial>
	c0493ffc:	4b ff ef 11 	bl      c0492f0c <csum_partial>
	c04a0f78:	4b b7 7d 91 	bl      c0018d08 <__csum_partial>
	c04b3e3c:	4b b6 4e cd 	bl      c0018d08 <__csum_partial>
	c04b40d0 <csum_partial>:
	c04b40d0:	4b b6 4c 38 	b       c0018d08 <__csum_partial>
	c04b4448:	4b ff fc 89 	bl      c04b40d0 <csum_partial>
	c04b46f4:	4b ff f9 dd 	bl      c04b40d0 <csum_partial>
	c04bf448:	4b b5 98 c0 	b       c0018d08 <__csum_partial>
	c04c5264:	4b b5 3a a5 	bl      c0018d08 <__csum_partial>
	c04c61e4:	4b b5 2b 25 	bl      c0018d08 <__csum_partial>

gcc10 defines multiple versions of csum_partial() which are just
an unconditionnal branch to __csum_partial().
Comment 1 Jakub Jelinek 2020-10-15 13:46:52 UTC
See https://gcc.gnu.org/bugs.html, you haven't provided either preprocessed source, nor gcc command line options.
inline keyword itself is not a guarantee that the function will be inlined, it is inlined if it is possible and seems beneficial to the inlining heuristics.
If you want to always inline, use __attribute__((always_inline)) in addition to inline keyword.
Comment 2 Richard Biener 2020-10-15 14:27:22 UTC
alternatively use inline w/o static to get C99 inline semantics (you have to
provide a single out of line copy yourself then via the appropriate declaration)
Comment 3 Segher Boessenkool 2020-10-15 14:49:41 UTC
AFAICS the point is that this always compiles to just a handful of insns,
and the inliner should be able to see that (even if the source is biggish).
Comment 4 Jakub Jelinek 2020-10-15 14:52:31 UTC
Even if it is just a few insns, if it is larger than the function call, the caller might already trigger threshold of how much it can be enlarged by inlining.
If this bugreport would come with the requested details, we could look at the inlining dump and see why gcc has decided not to inline it in particular cases.  Without that we can't do anything.
Comment 5 Christophe Leroy 2020-10-15 15:08:46 UTC
GCC version with the BUG:

Using built-in specs.
COLLECT_GCC=/opt/gcc-10.1.0-nolibc/powerpc64-linux/bin/powerpc64-linux-gcc
COLLECT_LTO_WRAPPER=/home/opt/gcc-10.1.0-nolibc/powerpc64-linux/bin/../libexec/gcc/powerpc64-linux/10.1.0/lto-wrapper
Target: powerpc64-linux
Configured with: /home/arnd/git/gcc/configure --target=powerpc64-linux --enable-targets=all --prefix=/home/arnd/cross/x86_64/gcc-10.1.0-nolibc/powerpc64-linux --enable-languages=c --without-headers --disable-bootstrap --disable-nls --disable-threads --disable-shared --disable-libmudflap --disable-libssp --disable-libgomp --disable-decimal-float --disable-libquadmath --disable-libatomic --disable-libcc1 --disable-libmpx --enable-checking=release
Thread model: single
Supported LTO compression algorithms: zlib
gcc version 10.1.0 (GCC)

GCC version without the BUG:

Using built-in specs.
COLLECT_GCC=/opt/gcc-9.2.0-nolibc/powerpc64-linux/bin/powerpc64-linux-gcc
COLLECT_LTO_WRAPPER=/home/opt/gcc-9.2.0-nolibc/powerpc64-linux/bin/../libexec/gcc/powerpc64-linux/9.2.0/lto-wrapper
Target: powerpc64-linux
Configured with: /home/arnd/git/gcc/configure --target=powerpc64-linux --enable-targets=all --prefix=/home/arnd/cross/x86_64/gcc-9.2.0-nolibc/powerpc64-linux --enable-languages=c --without-headers --disable-bootstrap --disable-nls --disable-threads --disable-shared --disable-libmudflap --disable-libssp --disable-libgomp --disable-decimal-float --disable-libquadmath --disable-libatomic --disable-libcc1 --disable-libmpx --enable-checking=release
Thread model: single
gcc version 9.2.0 (GCC)

Arguments used:

powerpc64-linux-gcc -Wp,-MMD,arch/powerpc/kernel/.setup-common.o.d  -nostdinc -isystem /home/opt/gcc-10.1.0-nolibc/powerpc64-linux/bin/../lib/gcc/powerpc64-linux/10.1.0/include -I./arch/powerpc/include -I./arch/powerpc/include/generated  -I./include -I./arch/powerpc/include/uapi -I./arch/powerpc/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -I ./arch/powerpc -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mcpu=powerpc -mcpu=powerpc -m32 -msoft-float -pipe -ffixed-r2 -mmultiple -mno-readonly-in-sdata -mcpu=powerpc64 -mno-altivec -mno-vsx -fno-asynchronous-unwind-tables -mno-string -mbig-endian -mstack-protector-guard=tls -mstack-protector-guard-reg=r2 -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -Wframe-larger-than=1024 -fstack-protector -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fomit-frame-pointer -fno-var-tracking-assignments -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -Wno-zero-length-bounds -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=./= -Wno-packed-not-aligned -mstack-protector-guard-offset=552 -Werror    -DKBUILD_MODFILE='"arch/powerpc/kernel/setup-common"' -DKBUILD_BASENAME='"setup_common"' -DKBUILD_MODNAME='"setup_common"' -c -o arch/powerpc/kernel/setup-common.o arch/powerpc/kernel/setup-common.c
Comment 6 Christophe Leroy 2020-10-15 15:13:16 UTC
Sorry, the above command is for another problem I'm about to report.

The command in question in this bug report is:

powerpc64-linux-gcc -Wp,-MMD,arch/powerpc/kernel/.setup-common.o.d  -nostdinc -isystem /home/opt/gcc-10.1.0-nolibc/powerpc64-linux/bin/../lib/gcc/powerpc64-linux/10.1.0/include -I./arch/powerpc/include -I./arch/powerpc/include/generated  -I./include -I./arch/powerpc/include/uapi -I./arch/powerpc/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -I ./arch/powerpc -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mcpu=powerpc -mcpu=powerpc -m32 -msoft-float -pipe -ffixed-r2 -mmultiple -mno-readonly-in-sdata -mcpu=powerpc64 -mno-altivec -mno-vsx -fno-asynchronous-unwind-tables -mno-string -mbig-endian -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fomit-frame-pointer -fno-var-tracking-assignments -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -Wno-zero-length-bounds -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=./= -Wno-packed-not-aligned -Werror    -DKBUILD_MODFILE='"arch/powerpc/kernel/setup-common"' -DKBUILD_BASENAME='"setup_common"' -DKBUILD_MODNAME='"setup_common"' -c -o arch/powerpc/kernel/setup-common.o arch/powerpc/kernel/setup-common.c
Comment 7 Christophe Leroy 2020-10-17 16:23:16 UTC
With get_order(), that's even worse: there are instances of it that are never called:

c000f94c <get_order>:
c005a7ac <get_order>:
c005a9c4:	4b ff fd e9 	bl      c005a7ac <get_order>
c005ab38:	4b ff fc 75 	bl      c005a7ac <get_order>
c005ac50:	4b ff fb 5d 	bl      c005a7ac <get_order>
c005b540 <get_order>:
c005b588:	4b ff ff b9 	bl      c005b540 <get_order>
c0074fe4 <get_order>:
c00815c8 <get_order>:
c00b8690 <get_order>:
c00ba898:	4b ff dd f9 	bl      c00b8690 <get_order>
c00ba9d4:	4b ff dc bd 	bl      c00b8690 <get_order>
c00bd650 <get_order>:
c00c3cbc:	4b ff 99 95 	bl      c00bd650 <get_order>
c00c3e90:	4b ff 97 c1 	bl      c00bd650 <get_order>
c00cfd08 <get_order>:
c00d02a8:	4b ff fa 61 	bl      c00cfd08 <get_order>
c00d0310:	4b ff f9 f9 	bl      c00cfd08 <get_order>
c00d0ac8:	4b ff f2 41 	bl      c00cfd08 <get_order>
c00d0c90:	4b ff f0 79 	bl      c00cfd08 <get_order>
c00d0d60:	4b ff ef a9 	bl      c00cfd08 <get_order>
c00d0dcc:	4b ff ef 3d 	bl      c00cfd08 <get_order>
c00d0e10:	4b ff ee f9 	bl      c00cfd08 <get_order>
c00d10ac:	4b ff ec 5d 	bl      c00cfd08 <get_order>
c00d19e4:	4b ff e3 25 	bl      c00cfd08 <get_order>
c00d4fa4:	4b ff ad 65 	bl      c00cfd08 <get_order>
c00d5cb4:	4b ff a0 55 	bl      c00cfd08 <get_order>
c00d5cc0:	4b ff a0 49 	bl      c00cfd08 <get_order>
c00d60fc:	4b ff 9c 0d 	bl      c00cfd08 <get_order>
c00e9c70 <get_order>:
c0114b50 <get_order>:
c013643c <get_order>:
c013a520 <get_order>:
c013b3f4:	4b ff f1 2d 	bl      c013a520 <get_order>
c013b40c:	4b ff f1 15 	bl      c013a520 <get_order>
c013b438:	4b ff f0 e9 	bl      c013a520 <get_order>
c013b454:	4b ff f0 cd 	bl      c013a520 <get_order>
c014236c:	4b ff 81 b5 	bl      c013a520 <get_order>
c01423d4:	4b ff 81 4d 	bl      c013a520 <get_order>
c0150ae8 <get_order>:
c015b968 <get_order>:
c0162040 <get_order>:
c016a710 <get_order>:
c0182aec <get_order>:
c01abe78 <get_order>:
c01cd598 <get_order>:
c01d2764 <get_order>:
c01ee3e0 <get_order>:
c01fea40 <get_order>:
c020bfd4 <get_order>:
c021cd80 <get_order>:
c021e510 <get_order>:
c02204b4 <get_order>:
c0225534 <get_order>:
c0228bec <get_order>:
c022c5a4 <get_order>:
c0231100 <get_order>:
c02314c0:	4b ff fc 41 	bl      c0231100 <get_order>
c02537a8 <get_order>:
c025a620 <get_order>:
c025d6bc:	4b ff cf 65 	bl      c025a620 <get_order>
c025d750:	4b ff ce d1 	bl      c025a620 <get_order>
c0270144 <get_order>:
c0287110 <get_order>:
c028717c:	4b ff ff 95 	bl      c0287110 <get_order>
c02871f0:	4b ff ff 21 	bl      c0287110 <get_order>
c028f3b8 <get_order>:
c029fa00 <get_order>:
c02a3ae0:	4b ff bf 21 	bl      c029fa00 <get_order>
c02a4c68:	4b ff ad 99 	bl      c029fa00 <get_order>
c02a5020:	4b ff a9 e1 	bl      c029fa00 <get_order>
c02a520c:	4b ff a7 f5 	bl      c029fa00 <get_order>
c02a5644:	4b ff a3 bd 	bl      c029fa00 <get_order>
c02ad00c <get_order>:
c02ad0c0:	4b ff ff 4d 	bl      c02ad00c <get_order>
c02b4ce0 <get_order>:
c02ba234 <get_order>:
c02bd758 <get_order>:
c02c292c <get_order>:
c02ccd00 <get_order>:
c0326dcc <get_order>:
c0327fc4:	4b ff ee 09 	bl      c0326dcc <get_order>
c032836c:	4b ff ea 61 	bl      c0326dcc <get_order>
c0328784:	4b ff e6 49 	bl      c0326dcc <get_order>
c0328810:	4b ff e5 bd 	bl      c0326dcc <get_order>
c0328860:	4b ff e5 6d 	bl      c0326dcc <get_order>
c032a754 <get_order>:
c03335d4 <get_order>:
c033c37c <get_order>:
c033e4ac:	4b ff de d1 	bl      c033c37c <get_order>
c033e4bc:	4b ff de c1 	bl      c033c37c <get_order>
c03447b4 <get_order>:
c0345cf8:	4b ff ea bd 	bl      c03447b4 <get_order>
c035a3fc <get_order>:
c0362694 <get_order>:
c0375710 <get_order>:
c041ab78:	4b ca 2a d9 	bl      c00bd650 <get_order>
c0429068:	4b c9 45 e9 	bl      c00bd650 <get_order>
Comment 8 Christophe Leroy 2020-10-17 16:31:32 UTC
(In reply to Jakub Jelinek from comment #1)
> See https://gcc.gnu.org/bugs.html, you haven't provided either preprocessed
> source, nor gcc command line options.
> inline keyword itself is not a guarantee that the function will be inlined,
> it is inlined if it is possible and seems beneficial to the inlining
> heuristics.
> If you want to always inline, use __attribute__((always_inline)) in addition
> to inline keyword.

When adding -save-temps as explained in https://gcc.gnu.org/bugs.html I get an error:

powerpc64-linux-gcc -Wp,-MMD,fs/.pipe.o.d  -nostdinc -isystem /home/opt/gcc-10.1.0-nolibc/powerpc64-linux/bin/../lib/gcc/powerpc64-linux/10.1.0/include -I./arch/powerpc/include -I./arch/powerpc/include/generated  -I./include -I./arch/powerpc/include/uapi -I./arch/powerpc/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -I ./arch/powerpc -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mcpu=powerpc -m32 -msoft-float -pipe -ffixed-r2 -mmultiple -mno-readonly-in-sdata -mcpu=860 -mno-altivec -mno-vsx -fno-asynchronous-unwind-tables -mno-string -mbig-endian -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fomit-frame-pointer -fno-var-tracking-assignments -g -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -Wno-zero-length-bounds -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=./= -Wno-packed-not-aligned    -DKBUILD_MODFILE='"fs/pipe"' -DKBUILD_BASENAME='"pipe"' -DKBUILD_MODNAME='"pipe"' -c -o fs/pipe.o fs/pipe.c -save-temps
powerpc64-linux-gcc.br_real: warning: -pipe ignored because -save-temps specified
powerpc64-linux-gcc.br_real: error: unrecognized command line option ‘-fno-allow-store-data-races’
Comment 9 Jakub Jelinek 2020-10-19 12:33:50 UTC
-fno-allow-store-data-races is fairly new option, previously it has been
--param=allow-store-data-races=0

I have no idea how you've tried to add -save-temps, so can't answer why you get the error.
What I was suggesting is build with make V=1 and copy/paste the command line used to compile a particular source file and append -save-temps to those options
Comment 10 Christophe Leroy 2020-10-19 12:39:03 UTC
(In reply to Jakub Jelinek from comment #9)
> What I was suggesting is build with make V=1 and copy/paste the command line
> used to compile a particular source file and append -save-temps to those
> options

That's exactly what I did, I built with V=1, then I copy/pasted the line and added  -save-temps at the end as you see in comment #8
Comment 11 Christophe Leroy 2020-10-19 12:57:23 UTC
Created attachment 49398 [details]
pipe.i from build of fs/pipe.o
Comment 12 Christophe Leroy 2020-10-19 12:59:56 UTC
Created attachment 49399 [details]
pipe.s from build of fs/pipe.o

fs/pipe.o includes an instance of get_order() allthouth get_order() is declared static and not called from any function.
Comment 13 Christophe Leroy 2020-10-19 13:01:24 UTC
(In reply to Christophe Leroy from comment #10)
> (In reply to Jakub Jelinek from comment #9)
> > What I was suggesting is build with make V=1 and copy/paste the command line
> > used to compile a particular source file and append -save-temps to those
> > options
> 
> That's exactly what I did, I built with V=1, then I copy/pasted the line and
> added  -save-temps at the end as you see in comment #8

This was a problem with PATH, when I was copying the command line, it was using another (older) version of GCC which was the one in the PATH. Sorry.
Comment 14 Martin Liška 2020-10-19 15:05:40 UTC
All right, I can reproduce it. I'm also attaching x86_64 pre-processes source file.
Comment 15 Martin Liška 2020-10-19 15:06:23 UTC
Created attachment 49401 [details]
x86_64 pre-processed source file
Comment 16 Martin Liška 2020-10-19 15:11:46 UTC
The following happens:

get_order is called by kmalloc_large which is called in kmalloc. And kmalloc calls the function for larger allocations. Problem is that we eliminate all calls to get_order late

pipe.i.108t.thread1:;; Function get_order (get_order, funcdef_no=295, decl_uid=4528, cgraph_uid=300, symbol_order=303)
pipe.i.108t.thread1:get_order (long unsigned int size)
pipe.i.108t.thread1:  _125 = get_order (_114);
pipe.i.108t.thread1:  _67 = get_order (_56);
pipe.i.109t.cdce:;; Function get_order (get_order, funcdef_no=295, decl_uid=4396, cgraph_uid=300, symbol_order=303)
pipe.i.109t.cdce:get_order (long unsigned int size)

so remove_unreachable_nodes is not called any more.
Comment 17 Jan Hubicka 2020-10-19 15:18:08 UTC
> The following happens:
> 
> get_order is called by kmalloc_large which is called in kmalloc. And kmalloc
> calls the function for larger allocations. Problem is that we eliminate all
> calls to get_order late
> 
> pipe.i.108t.thread1:;; Function get_order (get_order, funcdef_no=295,
> decl_uid=4528, cgraph_uid=300, symbol_order=303)
> pipe.i.108t.thread1:get_order (long unsigned int size)
> pipe.i.108t.thread1:  _125 = get_order (_114);
> pipe.i.108t.thread1:  _67 = get_order (_56);
> pipe.i.109t.cdce:;; Function get_order (get_order, funcdef_no=295,
> decl_uid=4396, cgraph_uid=300, symbol_order=303)
> pipe.i.109t.cdce:get_order (long unsigned int size)
> 
> so remove_unreachable_nodes is not called any more.
Yep, that is by design - we are already outputting functions to
assembler file, so there is not much we can do at this moent. Option
wold be to do threading early of course.  How often this happen in
practice?

Also note that -Winline outputs reasons why the static inline is not
inlined (which is also by design a decision of the inliner heuristics).
I suppose here the inliner sees the function called multiple times and
since it is quite long it decides to keep it offline.  Opitmizing all
references late if of course unfortunate.

Honza
Comment 18 Martin Liška 2020-10-19 15:33:43 UTC
(In reply to Jan Hubicka from comment #17)
> > The following happens:
> > 
> > get_order is called by kmalloc_large which is called in kmalloc. And kmalloc
> > calls the function for larger allocations. Problem is that we eliminate all
> > calls to get_order late
> > 
> > pipe.i.108t.thread1:;; Function get_order (get_order, funcdef_no=295,
> > decl_uid=4528, cgraph_uid=300, symbol_order=303)
> > pipe.i.108t.thread1:get_order (long unsigned int size)
> > pipe.i.108t.thread1:  _125 = get_order (_114);
> > pipe.i.108t.thread1:  _67 = get_order (_56);
> > pipe.i.109t.cdce:;; Function get_order (get_order, funcdef_no=295,
> > decl_uid=4396, cgraph_uid=300, symbol_order=303)
> > pipe.i.109t.cdce:get_order (long unsigned int size)
> > 
> > so remove_unreachable_nodes is not called any more.
> Yep, that is by design - we are already outputting functions to
> assembler file, so there is not much we can do at this moent. Option
> wold be to do threading early of course.  How often this happen in
> practice?
> 
> Also note that -Winline outputs reasons why the static inline is not

You are right, this is printed:

gcc -O2 pipe.i -c -fdump-tree-all -Winline
In file included from ./arch/x86/include/asm/page.h:77,
                 from ./arch/x86/include/asm/thread_info.h:12,
                 from ./include/linux/thread_info.h:38,
                 from ./arch/x86/include/asm/preempt.h:7,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:51,
                 from ./include/linux/mmzone.h:8,
                 from ./include/linux/gfp.h:6,
                 from ./include/linux/mm.h:10,
                 from fs/pipe.c:8:
./include/linux/slab.h: In function ‘alloc_pipe_info’:
./include/asm-generic/getorder.h:29:146: warning: inlining failed in call to ‘get_order’: --param max-inline-insns-single limit reached [-Winline]
   29 | static inline __attribute_const__ int get_order(unsigned long size)
      |                                                                                                                                                  ^        
In file included from fs/pipe.c:11:
./include/linux/slab.h:482:30: note: called from here
  482 |         unsigned int order = get_order(size);
      |                              ^~~~~~~~~~~~~~~
In file included from ./arch/x86/include/asm/page.h:77,
                 from ./arch/x86/include/asm/thread_info.h:12,
                 from ./include/linux/thread_info.h:38,
                 from ./arch/x86/include/asm/preempt.h:7,
                 from ./include/linux/preempt.h:78,
                 from ./include/linux/spinlock.h:51,
                 from ./include/linux/mmzone.h:8,
                 from ./include/linux/gfp.h:6,
                 from ./include/linux/mm.h:10,
                 from fs/pipe.c:8:
./include/linux/slab.h: In function ‘pipe_resize_ring’:
./include/asm-generic/getorder.h:29:146: warning: inlining failed in call to ‘get_order’: --param max-inline-insns-single limit reached [-Winline]
   29 | static inline __attribute_const__ int get_order(unsigned long size)
      |                                                                                                                                                  ^        
In file included from fs/pipe.c:11:
./include/linux/slab.h:482:30: note: called from here
  482 |         unsigned int order = get_order(size);
      |                              ^~~~~~~~~~~~~~~


> inlined (which is also by design a decision of the inliner heuristics).
> I suppose here the inliner sees the function called multiple times and
> since it is quite long it decides to keep it offline.  Opitmizing all
> references late if of course unfortunate.
> 
> Honza
Comment 19 Jan Hubicka 2020-10-19 16:13:12 UTC
get_order unwinds to:

  <bb 2> [local count: 1073741824]:
  _1 = __builtin_constant_p (size_68(D));
  if (_1 != 0)
    goto <bb 3>; [50.00%]
  else
    goto <bb 71>; [50.00%]

  <bb 3> [local count: 536870913]:
  if (size_68(D) == 0)
    goto <bb 72>; [21.72%]
  else
    goto <bb 4>; [78.28%]

  <bb 4> [local count: 420262548]:
  if (size_68(D) <= 4095)
    goto <bb 72>; [50.00%]
  else
    goto <bb 5>; [50.00%]

  <bb 5> [local count: 210131274]:
  _2 = size_68(D) + 18446744073709551615;
  _3 = __builtin_constant_p (_2);
  if (_3 != 0)
    goto <bb 6>; [50.00%]
  else
    goto <bb 69>; [50.00%]

  <bb 6> [local count: 105065637]:
  _4 = (signed long) _2;
  if (_4 >= 0)
    goto <bb 7>; [59.00%]
  else
    goto <bb 70>; [41.00%]

... [very long code]

  <bb 69> [local count: 105065637]:
  __asm__("bsrq %1,%q0" : "=r" bitpos_75 : "rm" _2, "0" -1);
  iftmp.1_73 = bitpos_75 + -11;

  <bb 70> [local count: 210131274]:
  # iftmp.1_67 = PHI <52(6), iftmp.1_73(69), 51(7), 50(8), 49(9), 48(10), 47(11), 46(12), 45(13), 44(14), 43(15), 42(16), 41(17), 40(18), 39(19), 38(20), 37(21), 36(22), 35(23), 34(24), 33(25), 32(26), 31(27), 30(28), 29(29), 28(30), 27(31), 26(32), 25(33), 24(34), 23(35), 22(36), 21(37), 20(38), 19(39), 18(40), 17(41), 16(42), 15(43), 14(44), 13(45), 12(46), 11(47), 10(48), 9(49), 8(50), 7(51), 6(52), 5(53), 4(54), 3(55), 2(56), 1(57), 0(58), -1(59), -2(60), -3(61), -4(62), -5(63), -6(64), -7(65), -8(66), -10(68), -9(67)>
  goto <bb 72>; [100.00%]

  <bb 71> [local count: 536870913]:
  size_69 = size_68(D) + 18446744073709551615;
  size_70 = size_69 >> 12;
  __asm__("bsrq %1,%q0" : "=r" bitpos_72 : "rm" size_70, "0" -1);
  _74 = bitpos_72 + 1;

  <bb 72> [local count: 1073741824]:
  # _66 = PHI <52(3), 0(4), iftmp.1_67(70), _74(71)>
  return _66;

We get summary:

IPA function summary for get_order/303 inlinable                                
  global time:     8.716289                                                     
  self size:       201                                                          
  global size:     201                                                          
  min size:       4                                                             
  self stack:      0                                                            
  global stack:    0                                                            
    size:4.000000, time:3.000000                                                
    size:3.000000, time:2.000000,  executed if:(not inlined)                    
    size:4.000000, time:2.000000,  executed if:(op0 not constant)               
    size:2.000000, time:0.782800,  executed if:(op0 != 0)                       
    size:3.000000, time:0.391400,  executed if:(op0 > 4095) && (op0 != 0)       
    size:2.000000, time:0.195700,  executed if:(op0 > 4095) && (op0 != 0) && (op0 not constant)
    size:3.000000, time:0.173194,  executed if:(op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0)
    size:3.000000, time:0.086597,  executed if:(op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0)
    size:3.000000, time:0.043299,  executed if:(op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0)
    size:3.000000, time:0.021649,  executed if:(op0,(# + 18446744073709551615),(# & 1152921504606846976) == 0) && (op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0)
    size:3.000000, time:0.010825,  executed if:(op0,(# + 18446744073709551615),(# & 576460752303423488) == 0) && (op0,(# + 18446744073709551615),(# & 1152921504606846976) == 0) && (op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0)
    size:168.000000, time:0.010825,  executed if:(op0,(# + 18446744073709551615),(# & 288230376151711744) == 0) && (op0,(# + 18446744073709551615),(# & 576460752303423488) == 0) && (op0,(# + 18446744073709551615),(# & 1152921504606846976) == 0) && (op0,(# + 18446744073709551615),(# & 2305843009213693952) == 0) && (op0,(# + 18446744073709551615),(# & 4611686018427387904) == 0) && (op0,(# + 18446744073709551615),((signed long) #) >= 0) && (op0 > 4095) && (op0 != 0)
  calls:                                                                        
    __builtin_constant_p/4546 function body not available                       
      freq:0.20 loop depth: 0 size: 0 time:  0 predicate: (op0 > 4095) && (op0 != 0)
       op0 points to local or readonly memory                                   
    __builtin_constant_p/4546 function body not available                       
      freq:1.00 loop depth: 0 size: 0 time:  0                                  

and then in calls to get_inline we do not know the constant parameter:

   Estimating body: get_order/303                                               
   Known to be false: not inlined                                               
   size:198 time:6.716289 nonspec time:8.716289 loops with known iterations:0.000000 known strides:0.000000

the problem here is size of 198 instructions while we inline up to 70 instructions.  Of course after concluding that parameter is not constant this would all collapse to just few instrutions.

It is difficult to handle builtin_constant_p correctly at this stage: ipa-prop is missing a lot of known constants and it is quite possible parameter will be folded to constant post inlining and thus we keep both variant.

We could teach ipa-predicates that the if is exclusive and thus only max of both variants should be accounted byt it does not fit the way predicates works very well.  One option would be to takea hint that function with builtin_constant_p on parameters really wants to be inlined and increase the bounds (I think this owuld be good idea to do along with functions having vector builtins in them), but that would cure only some cases, certainly not all.

It is always possible to always_inline functions that are intended to be always inlined.
Honza
Comment 20 Jan Hubicka 2020-10-19 16:20:45 UTC
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445
> 
> --- Comment #19 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
> get_order unwinds to:
> 
>   <bb 2> [local count: 1073741824]:
>   _1 = __builtin_constant_p (size_68(D));
>   if (_1 != 0)
>     goto <bb 3>; [50.00%]
>   else
>     goto <bb 71>; [50.00%]
> 
>   <bb 3> [local count: 536870913]:
>   if (size_68(D) == 0)
>     goto <bb 72>; [21.72%]
>   else
>     goto <bb 4>; [78.28%]
> 
>   <bb 4> [local count: 420262548]:
>   if (size_68(D) <= 4095)
>     goto <bb 72>; [50.00%]
>   else
>     goto <bb 5>; [50.00%]
> 
>   <bb 5> [local count: 210131274]:
>   _2 = size_68(D) + 18446744073709551615;
>   _3 = __builtin_constant_p (_2);
Forgot to note, things would be easier if we folded this to _1 :)
Among other we run on out of the limit on number of conditionals
recorded by the fnsummary pass.

Honza
Comment 21 Jakub Jelinek 2020-10-19 16:52:47 UTC
(In reply to Jan Hubicka from comment #20)
> >   _2 = size_68(D) + 18446744073709551615;
> >   _3 = __builtin_constant_p (_2);
> Forgot to note, things would be easier if we folded this to _1 :)
> Among other we run on out of the limit on number of conditionals
> recorded by the fnsummary pass.

Maybe better just have a match.pd rule that would fold
y = z binop cst;
x = __builtin_constant_p (y);
to
x = __builtin_constant_p (z);
and let SCCVN do the rest (or do it in fwprop or whatever else if we'd want to write it in C without having to enumerate all possible binops we want to do it for).

Not sure if we shouldn't stop on ops that could trap though, or on ops that could possibly be invalid...
Comment 22 Jan Hubicka 2020-10-19 17:13:17 UTC
> Maybe better just have a match.pd rule that would fold
> y = z binop cst;
> x = __builtin_constant_p (y);
> to
> x = __builtin_constant_p (z);
> and let SCCVN do the rest (or do it in fwprop or whatever else if we'd want to
> write it in C without having to enumerate all possible binops we want to do it
> for).
> 
> Not sure if we shouldn't stop on ops that could trap though, or on ops that
> could possibly be invalid...

We need to establish that z binop cst folds to a constant whenever z is
a constant or we may run into cases where we go into builtin_constant_p
== true case and then fail to fold the actual value.

Honza
> 
> -- 
> You are receiving this mail because:
> You are on the CC list for the bug.
Comment 23 Christophe Leroy 2020-10-20 05:19:49 UTC
(In reply to Jan Hubicka from comment #19)
> 
> It is always possible to always_inline functions that are intended to be
> always inlined.
> Honza

Yes and I sent a patch for that to the Linux kernel, but what I would like to understand is why does GCC 10 completely fails to inline that while GCC 9 was doing things properly ?

Find attached the same temp files generated with GCC 9. GCC9 sees that get_order() is not used and doesn't generate it.
Comment 24 Christophe Leroy 2020-10-20 05:21:14 UTC
Created attachment 49403 [details]
pipe.i from build of fs/pipe.o with GCC 9.2
Comment 25 Christophe Leroy 2020-10-20 05:21:36 UTC
Created attachment 49404 [details]
pipe.s from build of fs/pipe.o with GCC 9.2
Comment 26 Christophe Leroy 2020-10-20 06:17:29 UTC
In reality it is not perfect with GCC 9.2, but that's better, only two instances are unused. 

[root@po17688vm linux-powerpc]# ppc-linux-objdump -d vmlinux | grep get_order
c00c0470 <get_order>:
c013e238 <get_order>:
c0225f68 <get_order>:
c041ebfc:	4b ca 18 75 	bl      c00c0470 <get_order>
c042d34c:	4b c9 31 25 	bl      c00c0470 <get_order>

In the few places where get_order() remains with GCC 9.2, I get the following warning:

warning: inlining failed in call to 'get_order': call is unlikely and code size would grow [-Winline]
Comment 27 Jan Hubicka 2020-10-20 06:44:50 UTC
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445
> 
> --- Comment #23 from Christophe Leroy <christophe.leroy at csgroup dot eu> ---
> (In reply to Jan Hubicka from comment #19)
> > 
> > It is always possible to always_inline functions that are intended to be
> > always inlined.
> > Honza
> 
> Yes and I sent a patch for that to the Linux kernel, but what I would like to
> understand is why does GCC 10 completely fails to inline that while GCC 9 was
> doing things properly ?

It is because --param inline-insns-single was reduced for -O2 from 200
to 70.  GCC 10 has newly different set of parameters for -O2 and -O3 and
enables auto-inlining at -O2.

Problem with inlininig funtions declared inline is that C++ codebases
tends to abuse this keyword for things that are really too large (and
get_order would be such example if it did not have builtin_constant_p
check which inliner does not understand well). So having same limit at
-O2 and -O3 turned out to be problematic with respect to code size and
especially with respect to LTO, where a lot more inlining oppurtunities
appear.

I will implement the heuristics to push up inline limits of functions
having builtin_constant_p of parameter which should help a bit in this
case (but not very systematically: as dicussed in the PR log it is quite
hard problem to get builtin_constant_p right in the code size metrics
used by inliner before it knows exactly what is going to be constant and
what is not).

Honza
Comment 28 Christophe Leroy 2020-10-20 07:09:09 UTC
Find attached the temporary files for net/core/skbuff.c, as it is indeed what initially triggered my focus on this issue. I don't expect such a function at all:

00000cc8 <csum_partial>:
     cc8:       48 00 00 00     b       cc8 <csum_partial>
                        cc8: R_PPC_REL24        __csum_partial


Every function should call __csum_partial() directly.

I guess that's the same problem, powerpc csum_partial() does sums 'manually' for some particular values of constant 'len', then fallbacks on __csum_partial().
When 'len' is not a constant, csum_partial() is reduced to a call to __csum_partial(), and I'd expect GCC 10 to call __csum_partial() directly as all previous GCC have always done.

Though, getting the following warning when building:

warning: inlining failed in call to 'csum_partial': --param max-inline-insns-single limit reached [-Winline]
Comment 29 Christophe Leroy 2020-10-20 07:10:25 UTC
Created attachment 49406 [details]
skbuff.s from build of net/core/skbuff.o with GCC 10
Comment 30 Christophe Leroy 2020-10-20 07:10:59 UTC
Created attachment 49407 [details]
skbuff.i from build of net/core/skbuff.o with GCC 10
Comment 31 Segher Boessenkool 2020-10-20 09:55:12 UTC
(In reply to Jan Hubicka from comment #27)
> It is because --param inline-insns-single was reduced for -O2 from 200
> to 70.  GCC 10 has newly different set of parameters for -O2 and -O3 and
> enables auto-inlining at -O2.
> 
> Problem with inlininig funtions declared inline is that C++ codebases
> tends to abuse this keyword for things that are really too large (and
> get_order would be such example if it did not have builtin_constant_p
> check which inliner does not understand well). So having same limit at
> -O2 and -O3 turned out to be problematic with respect to code size and
> especially with respect to LTO, where a lot more inlining oppurtunities
> appear.

Do the heuristics account for that not inlining a "static inline" results
in multiple copies?

> I will implement the heuristics to push up inline limits of functions
> having builtin_constant_p of parameter which should help a bit in this
> case

Thank you!

> (but not very systematically: as dicussed in the PR log it is quite
> hard problem to get builtin_constant_p right in the code size metrics
> used by inliner before it knows exactly what is going to be constant and
> what is not).

That is true for many other inlining things as well...  builtin_constant_p
is worse than most I guess ;-)
Comment 32 Jan Hubicka 2020-10-20 10:16:19 UTC
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445
> 
> --- Comment #31 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> (In reply to Jan Hubicka from comment #27)
> > It is because --param inline-insns-single was reduced for -O2 from 200
> > to 70.  GCC 10 has newly different set of parameters for -O2 and -O3 and
> > enables auto-inlining at -O2.
> > 
> > Problem with inlininig funtions declared inline is that C++ codebases
> > tends to abuse this keyword for things that are really too large (and
> > get_order would be such example if it did not have builtin_constant_p
> > check which inliner does not understand well). So having same limit at
> > -O2 and -O3 turned out to be problematic with respect to code size and
> > especially with respect to LTO, where a lot more inlining oppurtunities
> > appear.
> 
> Do the heuristics account for that not inlining a "static inline" results
> in multiple copies?

It prevents inlining only when there are multiple calls in the unit
being compiled (there is no way to know that the same inline function is
duplicated in other units).
This is what happens here: there are multiple calls so inliner concludes
inlining would cost too much of code size and later they are optimized
away.

get_order is a wrapper around ffs64.  This can be implemented w/o asm
statement as follows:
int
my_fls64 (__u64 x)
{
  if (!x)
      return 0;
  return 64 - __builtin_clzl (x);
}

This results in longer assembly than the kernel asm implementation. If
that matters I would replace builtin_constnat_p part of get_order by this
implementation that is more transparent to the code size estimation and
things will get inlined.

Honza
Comment 33 Jakub Jelinek 2020-10-20 10:40:48 UTC
(In reply to Jan Hubicka from comment #32)
> get_order is a wrapper around ffs64.  This can be implemented w/o asm
> statement as follows:
> int
> my_fls64 (__u64 x)
> {
>   if (!x)
>       return 0;
>   return 64 - __builtin_clzl (x);
> }
> 
> This results in longer assembly than the kernel asm implementation. If
> that matters I would replace builtin_constnat_p part of get_order by this
> implementation that is more transparent to the code size estimation and
> things will get inlined.

Better __builtin_clzll so that it works also on 32-bit arches.
Anyway, if kernel's fls64 results in better code than the my_fls64, we should look at GCC's code generation for that case.

And, perhaps kernel's const_ilog2 should be reimplemented using __builtin_clz*?
Or, maybe even better, keep const_ilog2 as is because as it is declared it should be usable even in pedantic C constant expressions, and just change ilog2 to:
#define ilog2(n) \
( \
        __builtin_constant_p(n) ?       \
        ((n) < 2 ? 0 : 63 - __builtin_clzll (n)) : \
        (sizeof(n) <= 4) ?              \
        __ilog2_u32(n) :                \
        __ilog2_u64(n)                  \
 )
Comment 34 Jan Hubicka 2020-10-20 11:12:52 UTC
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445
> 
> --- Comment #33 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
> (In reply to Jan Hubicka from comment #32)
> > get_order is a wrapper around ffs64.  This can be implemented w/o asm
> > statement as follows:
> > int
> > my_fls64 (__u64 x)
> > {
> >   if (!x)
> >       return 0;
> >   return 64 - __builtin_clzl (x);
> > }
> > 
> > This results in longer assembly than the kernel asm implementation. If
> > that matters I would replace builtin_constnat_p part of get_order by this
> > implementation that is more transparent to the code size estimation and
> > things will get inlined.
> 
> Better __builtin_clzll so that it works also on 32-bit arches.
> Anyway, if kernel's fls64 results in better code than the my_fls64, we should
> look at GCC's code generation for that case.

Original asm is:

__attribute__ ((noinline))
int fls64(__u64 x)
{
 int bitpos = -1;
 asm("bsrq %1,%q0"
     : "+r" (bitpos)
     : "rm" (x));
 return bitpos + 1;
}

There seems to be bug in bsr{q} pattern.  I can make GCC produce same
code with:

__attribute__ ((noinline))
int
my_fls64 (__u64 x)
{
  asm volatile ("movl $-1, %eax");
  return (__builtin_clzll (x) ^ 63) + 1;
}

But obviously the volatile asm should not be needed.  I think bsrq is
incorrectly modelled as returning full register

(define_insn "bsr_rex64"
  [(set (match_operand:DI 0 "register_operand" "=r")
        (minus:DI (const_int 63)
                  (clz:DI (match_operand:DI 1 "nonimmediate_operand" "rm"))))
   (clobber (reg:CC FLAGS_REG))]
  "TARGET_64BIT"
  "bsr{q}\t{%1, %0|%0, %1}"
  [(set_attr "type" "alu1")
   (set_attr "prefix_0f" "1")
   (set_attr "znver1_decode" "vector")
   (set_attr "mode" "DI")])
Comment 35 Jan Hubicka 2020-10-20 11:17:19 UTC
> 
> Original asm is:
> 
> __attribute__ ((noinline))
> int fls64(__u64 x)
> {
>  int bitpos = -1;
>  asm("bsrq %1,%q0"
>      : "+r" (bitpos)
>      : "rm" (x));
>  return bitpos + 1;
> }
> 
> There seems to be bug in bsr{q} pattern.  I can make GCC produce same
> code with:
> 
> __attribute__ ((noinline))
> int
> my_fls64 (__u64 x)
> {
>   asm volatile ("movl $-1, %eax");
>   return (__builtin_clzll (x) ^ 63) + 1;
> }

Aha, bsr is not doing anything if parameter is 0, so pattern is correct
(just the instruction is undefined for 0 which makes sense).
But with that pattern GCC can't synthetize the code sequence above :)

Honza
Comment 36 Jan Hubicka 2020-10-20 11:45:02 UTC
> Find attached the temporary files for net/core/skbuff.c, as it is indeed what
> initially triggered my focus on this issue. I don't expect such a function at
> all:
> 
> 00000cc8 <csum_partial>:
>      cc8:       48 00 00 00     b       cc8 <csum_partial>
>                         cc8: R_PPC_REL24        __csum_partial
> 
> 
> Every function should call __csum_partial() directly.

Here csum_artial is:

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__)) __attribute__((__no_instrument_function__)) __wsum csum_add(__wsum csum, __wsum addend)
{
 if (__builtin_constant_p(csum) && csum == 0)
  return addend;
 if (__builtin_constant_p(addend) && addend == 0)
  return csum;
 asm("addc %0,%0,%1;"
     "addze %0,%0;"
     : "+r" (csum) : "r" (addend) : "xer");
 return csum;
}
static inline __attribute__((__gnu_inline__)) __attribute__((__unused__)) __attribute__((__no_instrument_function__)) __wsum csum_partial(const void *buff, int len, __wsum sum)
{
 if (__builtin_constant_p(len) && len <= 16 && (len & 1) == 0) {
  if (len == 2)
   sum = csum_add(sum, ( __wsum)*(const u16 *)buff);
  if (len >= 4)
   sum = csum_add(sum, ( __wsum)*(const u32 *)buff);
  if (len == 6)
   sum = csum_add(sum, ( __wsum)
         *(const u16 *)(buff + 4));
  if (len >= 8)
   sum = csum_add(sum, ( __wsum)
         *(const u32 *)(buff + 4));
  if (len == 10)
   sum = csum_add(sum, ( __wsum)
         *(const u16 *)(buff + 8));
  if (len >= 12)
   sum = csum_add(sum, ( __wsum)
         *(const u32 *)(buff + 8));
  if (len == 14)
   sum = csum_add(sum, ( __wsum)
         *(const u16 *)(buff + 12));
  if (len >= 16)
   sum = csum_add(sum, ( __wsum)
         *(const u32 *)(buff + 12));
 } else if (__builtin_constant_p(len) && (len & 3) == 0) {
  sum = csum_add(sum, ip_fast_csum_nofold(buff, len >> 2));
 } else {
  sum = __csum_partial(buff, len, sum);
 }
 return sum;
}
So again it expands to really large decision tree with many
builtion_constant_p checks that makes inliner to give up.
You should see all such cases easilly with -Winline
Comment 37 Jan Hubicka 2020-10-20 13:12:20 UTC
Created attachment 49408 [details]
hintsp

Hi,
this implements the heuristics increasing bounds for functions having
__builtin_constant_p on parameters.  Note that for get_order this is
still not enough, since we increase the bound twice if hint applies, so
it goes from 70 to 140 and not to 190 needed, however it will handle
ohter similar cases.

If hint weight is increased to 300%, so 210 I get:
hubicka@lomikamen-jh:/aux/hubicka/trunk-git/build10/gcc$ ./xgcc -B ./ -O2 -Winline pipe.i --param inline-heuristics-hint-percent=300
In file included from fs/pipe.c:11:
./include/linux/slab.h: In function ‘alloc_pipe_info’:
./include/linux/slab.h:586:121: warning: inlining failed in call to ‘kmalloc_array.constprop’: --param max-inline-insns-single limit reached
[-Winline]
./include/linux/slab.h:605:9: note: called from here
./include/linux/slab.h: In function ‘pipe_resize_ring’:
./include/linux/slab.h:586:121: warning: inlining failed in call to ‘kmalloc_array.constprop’: --param max-inline-insns-single limit reached [-Winline]
./include/linux/slab.h:605:9: note: called from here

So the problem only shifts to not inlininig kmalloc_array.
(that is why it would be nice to update kernel with the easier
get_order)

However it shows different problem: ipa-cp produces cone of
kmalloc_array since it is always used by constant size, but the clone
does not update the predicates, so we lose track about the parameter
being constant and that is why we optimize out only late.

Martin, I think this is caused by long lasting TODO in
ipa_fn_summary_t::duplicate and probably we should implement it: based
on the known partial assignment of params to constant we should fold the
conditions in predicates.

Indeed with ./xgcc -B ./ -O2 -Winline pipe.i  -fno-ipa-cp --param inline-heuristics-hint-percent=300
the warning goes away.  We still need the stronger hint though.
Comment 38 Christophe Leroy 2020-10-20 13:28:19 UTC
(In reply to Jan Hubicka from comment #32)
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97445
> > 
> > --- Comment #31 from Segher Boessenkool <segher at gcc dot gnu.org> ---
> > (In reply to Jan Hubicka from comment #27)
> > > It is because --param inline-insns-single was reduced for -O2 from 200
> > > to 70.  GCC 10 has newly different set of parameters for -O2 and -O3 and
> > > enables auto-inlining at -O2.
> > > 
> > > Problem with inlininig funtions declared inline is that C++ codebases
> > > tends to abuse this keyword for things that are really too large (and
> > > get_order would be such example if it did not have builtin_constant_p
> > > check which inliner does not understand well). So having same limit at
> > > -O2 and -O3 turned out to be problematic with respect to code size and
> > > especially with respect to LTO, where a lot more inlining oppurtunities
> > > appear.
> > 
> > Do the heuristics account for that not inlining a "static inline" results
> > in multiple copies?
> 
> It prevents inlining only when there are multiple calls in the unit
> being compiled (there is no way to know that the same inline function is
> duplicated in other units).
> This is what happens here: there are multiple calls so inliner concludes
> inlining would cost too much of code size and later they are optimized
> away.
> 
> get_order is a wrapper around ffs64.  This can be implemented w/o asm
> statement as follows:
> int
> my_fls64 (__u64 x)
> {
>   if (!x)
>       return 0;
>   return 64 - __builtin_clzl (x);
> }
> 
> This results in longer assembly than the kernel asm implementation. If
> that matters I would replace builtin_constnat_p part of get_order by this
> implementation that is more transparent to the code size estimation and
> things will get inlined.
> 

But on powerpc that's already the case and it doesn't solve the issue.

static inline int fls(unsigned int x)
{
	return 32 - __builtin_clz(x);
}

static inline int fls64(__u64 x)
{
	return 64 - __builtin_clzll(x);
}
Comment 39 Jakub Jelinek 2020-10-20 13:31:23 UTC
(In reply to Christophe Leroy from comment #38)
> But on powerpc that's already the case and it doesn't solve the issue.
> 
> static inline int fls(unsigned int x)
> {
> 	return 32 - __builtin_clz(x);
> }
> 
> static inline int fls64(__u64 x)
> {
> 	return 64 - __builtin_clzll(x);
> }

That is clearly a kernel bug (__builtin_clz* is documented undefined for 0, while fls* wants to be well defined there), and shouldn't change anything, because
in the if (__builtin_constant_p (size))
case get_order doesn't use fls*, but ilog2.  And it is ilog2 that should be changed.
Comment 40 Christophe Leroy 2020-10-20 13:51:18 UTC
(In reply to Jakub Jelinek from comment #39)
> (In reply to Christophe Leroy from comment #38)
> > But on powerpc that's already the case and it doesn't solve the issue.
> > 
> > static inline int fls(unsigned int x)
> > {
> > 	return 32 - __builtin_clz(x);
> > }
> > 
> > static inline int fls64(__u64 x)
> > {
> > 	return 64 - __builtin_clzll(x);
> > }
> 
> That is clearly a kernel bug (__builtin_clz* is documented undefined for 0,
> while fls* wants to be well defined there), and shouldn't change anything,
> because
> in the if (__builtin_constant_p (size))
> case get_order doesn't use fls*, but ilog2.  And it is ilog2 that should be
> changed.

What's the bug ?

int f(int x)
{
  return __builtin_clz(x);
}

Compiles into

<f>:
  cntlzw r3, r3
  blr


Powerpc 32 bits documentation says:

cntlzw : Count Leading Zeros Word

A count of the number of consecutive zero bits starting at bit 0 of rS is placed into rA. This number ranges from 0 to 32, inclusive.

I can't see a problem when x == 0
Comment 41 Jakub Jelinek 2020-10-20 14:18:38 UTC
It is documented to be undefined:
 -- Built-in Function: int __builtin_clz (unsigned int x)
     Returns the number of leading 0-bits in X, starting at the most
     significant bit position.  If X is 0, the result is undefined.
Especially GCC 11 (but e.g. clang too) will e.g. during value range propagation assume that e.g. the builtin return value will be only 0 to 31, not to 32, etc.
The portable way how to write this is x ? __builtin_clz (x) : whatever_value_you_want_for_clz_0;
and the compiler should recognize that and if the instruction is well defined for 0 and matches your choice, use optimal sequence.
Comment 42 Christophe Leroy 2020-10-20 14:22:54 UTC
(In reply to Jakub Jelinek from comment #41)
> It is documented to be undefined:
>  -- Built-in Function: int __builtin_clz (unsigned int x)
>      Returns the number of leading 0-bits in X, starting at the most
>      significant bit position.  If X is 0, the result is undefined.
> Especially GCC 11 (but e.g. clang too) will e.g. during value range
> propagation assume that e.g. the builtin return value will be only 0 to 31,
> not to 32, etc.
> The portable way how to write this is x ? __builtin_clz (x) :
> whatever_value_you_want_for_clz_0;
> and the compiler should recognize that and if the instruction is well
> defined for 0 and matches your choice, use optimal sequence.

int f(int x)
{
	return x ? __builtin_clz(x) : 32;
}

Is compiled into (with -O2):

00000000 <f>:
   0:	2c 03 00 00 	cmpwi   r3,0
   4:	40 82 00 0c 	bne     10 <f+0x10>
   8:	38 60 00 20 	li      r3,32
   c:	4e 80 00 20 	blr
  10:	7c 63 00 34 	cntlzw  r3,r3
  14:	4e 80 00 20 	blr



Allthough

int g(void)
{
Comment 43 Christophe Leroy 2020-10-20 14:24:55 UTC
(In reply to Christophe Leroy from comment #42)
> (In reply to Jakub Jelinek from comment #41)
> > It is documented to be undefined:
> >  -- Built-in Function: int __builtin_clz (unsigned int x)
> >      Returns the number of leading 0-bits in X, starting at the most
> >      significant bit position.  If X is 0, the result is undefined.
> > Especially GCC 11 (but e.g. clang too) will e.g. during value range
> > propagation assume that e.g. the builtin return value will be only 0 to 31,
> > not to 32, etc.
> > The portable way how to write this is x ? __builtin_clz (x) :
> > whatever_value_you_want_for_clz_0;
> > and the compiler should recognize that and if the instruction is well
> > defined for 0 and matches your choice, use optimal sequence.
> 
> int f(int x)
> {
> 	return x ? __builtin_clz(x) : 32;
> }
> 
> Is compiled into (with -O2):
> 
> 00000000 <f>:
>    0:	2c 03 00 00 	cmpwi   r3,0
>    4:	40 82 00 0c 	bne     10 <f+0x10>
>    8:	38 60 00 20 	li      r3,32
>    c:	4e 80 00 20 	blr
>   10:	7c 63 00 34 	cntlzw  r3,r3
>   14:	4e 80 00 20 	blr
> 
> 
> 
> Allthough
> 
> int g(void)
> {

int g(int x)
{
	return __builtin_clz(0);
}

Gives

00000018 <g>:
  18:	38 60 00 20 	li      r3,32
  1c:	4e 80 00 20 	blr
Comment 44 Jakub Jelinek 2020-10-20 14:29:08 UTC
Then perhaps some backends need to be improved.
Try e.g.:
void bar (void);

void
foo (int x)
{
  if (__builtin_clz (x) == 32)
    bar ();
}
with trunk GCC if you don't trust me.
Comment 45 Jakub Jelinek 2020-10-20 16:58:52 UTC
> > __attribute__ ((noinline))
> > int
> > my_fls64 (__u64 x)
> > {
> >   asm volatile ("movl $-1, %eax");
> >   return (__builtin_clzll (x) ^ 63) + 1;
> > }
> 
> Aha, bsr is not doing anything if parameter is 0, so pattern is correct
> (just the instruction is undefined for 0 which makes sense).
> But with that pattern GCC can't synthetize the code sequence above :)

The docs explicitly say that bsf/bsr DEST is undefined if ZF is set (operand is zero), so relying on it preserving the previous value of the register is quite dangerous.
Comment 46 Segher Boessenkool 2020-10-20 21:19:40 UTC
(In reply to Christophe Leroy from comment #43)
> int g(int x)
> {
> 	return __builtin_clz(0);
> }
> 
> Gives
> 
> 00000018 <g>:
>   18:	38 60 00 20 	li      r3,32
>   1c:	4e 80 00 20 	blr

That is because rs6000 has

/* The cntlzw and cntlzd instructions return 32 and 64 for input of zero.  */
#define CLZ_DEFINED_VALUE_AT_ZERO(MODE, VALUE) \
  ((VALUE) = GET_MODE_BITSIZE (MODE), 2)

This says that at RTL level and in the optabs, clz of 0 *is* defined,
for rs6000.  But the builtin is not valid with an arg of 0!
Comment 47 Christophe Leroy 2020-10-21 05:38:14 UTC
(In reply to Segher Boessenkool from comment #46)
> (In reply to Christophe Leroy from comment #43)
> > int g(int x)
> > {
> > 	return __builtin_clz(0);
> > }
> > 
> > Gives
> > 
> > 00000018 <g>:
> >   18:	38 60 00 20 	li      r3,32
> >   1c:	4e 80 00 20 	blr
> 
> That is because rs6000 has
> 
> /* The cntlzw and cntlzd instructions return 32 and 64 for input of zero.  */
> #define CLZ_DEFINED_VALUE_AT_ZERO(MODE, VALUE) \
>   ((VALUE) = GET_MODE_BITSIZE (MODE), 2)
> 
> This says that at RTL level and in the optabs, clz of 0 *is* defined,
> for rs6000.  But the builtin is not valid with an arg of 0!

I opened bug #97503 for that
Comment 48 Jan Hubicka 2020-10-21 15:04:25 UTC
Changing component to IPA.

Concerning comment #37 about summaries not being updated after ipa-cp, I was actually wrong there: they are updated and the behaviour is quite sane. We work out that kmalloc has constant argument and produce specialized clone for it. Because it is estimated quite large it is not inlined.  While when ipa-cp is disabled we work out that inlining it will simplify body a lot and bump up the limits.

Jakub, concerning
 asm volatile ("movl $-1, %eax") 
that was of course a hack.  I was confused about bsr instruction - for some time I tought it stores only 8bit value until I re-read the manual.

Honza
Comment 49 Jan Hubicka 2020-10-21 15:16:39 UTC
Patch posted for the inline heuristics change https://gcc.gnu.org/pipermail/gcc-patches/2020-October/556685.html

Also opened spearate PR on builtin_constant_p folding. I am not sure how to implement that correctly (what are the conditions that make this valid - perhaps for all "i op cst" after all?)

Martin, how does the if chain conversion behave on the example?
Comment 50 CVS Commits 2020-10-21 18:01:33 UTC
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:caaa218f912ccf932fdb79243ded68bb462bbe63

commit r11-4192-gcaaa218f912ccf932fdb79243ded68bb462bbe63
Author: Jan Hubicka <jh@suse.cz>
Date:   Wed Oct 21 20:00:22 2020 +0200

    Inline functions with builtin_constant_p more agressively.
    
    This patch implements heuristics that increases inline limits (by the hints
    mechanism) for inline functions that use builtin_constant_p on parameter. Those
    are very likely intended to be always inlined and simplify after inlining.
    
    The PR is about a function that we used to inline with
     --param inline-insns-single=200 but with new default of 70 for -O2 we no longer
    do so.  Hints are currently configured to bump the bound up twice, so we
    get limit of 140 that is still not enough to inline the particular testcase
    but it should help in general.  I can implement a stronger bump if that seems
    useful (maybe it is). The example is bit operation written as a decision chain
    with 64 conditions.
    This blows up the limit on number of conditions we track per funtion (which is
    30) and thus the size/time estimates are not working that well.
    
    gcc/ChangeLog:
    
            PR ipa/97445
            * ipa-fnsummary.c (ipa_dump_hints): Add INLINE_HINT_builtin_constant_p.
            (ipa_fn_summary::~ipa_fn_summary): Free builtin_constant_p_parms.
            (ipa_fn_summary_t::duplicate): Duplicate builtin_constant_p_parms.
            (ipa_dump_fn_summary): Dump builtin_constant_p_parms.
            (add_builtin_constant_p_parm): New function
            (set_cond_stmt_execution_predicate): Update builtin_constant_p_parms.
            (ipa_call_context::estimate_size_and_time): Set
            INLINE_HINT_builtin_constant_p..
            (ipa_merge_fn_summary_after_inlining): Merge builtin_constant_p_parms.
            (inline_read_section): Read builtin_constant_p_parms.
            (ipa_fn_summary_write): Write builtin_constant_p_parms.
            * ipa-fnsummary.h (enum ipa_hints_vals): Add
            INLINE_HINT_builtin_constant_p.
            * ipa-inline.c (want_inline_small_function_p): Use
            INLINE_HINT_builtin_constant_p.
            (edge_badness): Use INLINE_HINT_builtin_constant_p.
    
    gcc/testsuite/ChangeLog:
    
            PR ipa/97445
            * gcc.dg/ipa/inlinehint-5.c: New test.
Comment 51 Martin Liška 2020-10-21 18:21:25 UTC
> Martin, how does the if chain conversion behave on the example?

I don't see how can if-to-switch conversion pass help us here. It's designed to identify compact intervals. In this case we see:

  <bb 12> :
  _16 = _2 & 576460752303423488;
  if (_16 == 0)
    goto <bb 13>; [INV]
  else
    goto <bb 72>; [INV]

  <bb 13> :
  _18 = _2 & 288230376151711744;
  if (_18 == 0)
    goto <bb 14>; [INV]
  else
    goto <bb 72>; [INV]
...

So it's a series of masking. Do you see Honza how to convert it?
Comment 52 Jan Hubicka 2020-10-21 18:24:50 UTC
> I don't see how can if-to-switch conversion pass help us here. It's designed to
> identify compact intervals. In this case we see:
> 
>   <bb 12> :
>   _16 = _2 & 576460752303423488;
>   if (_16 == 0)
>     goto <bb 13>; [INV]
>   else
>     goto <bb 72>; [INV]
> 
>   <bb 13> :
>   _18 = _2 & 288230376151711744;
>   if (_18 == 0)
>     goto <bb 14>; [INV]
>   else
>     goto <bb 72>; [INV]
> ...
> 
> So it's a series of masking. Do you see Honza how to convert it?
It goes from 1 to 1<<63, so each of tests translates to a range.

Honza
Comment 53 Martin Liška 2020-10-21 18:25:53 UTC
> It goes from 1 to 1<<63, so each of tests translates to a range.

Yes, but these ranges are very large, nothing for a jump table or a bit-test.
Comment 54 Jan Hubicka 2020-10-21 18:34:40 UTC
> > It goes from 1 to 1<<63, so each of tests translates to a range.
> 
> Yes, but these ranges are very large, nothing for a jump table or a bit-test.
Yep, but theoretically you can recover the decision table and pattern
match it is a bit builtin.  Not sure how many open coded bit builtins
are there.
Comment 55 Martin Liška 2020-10-21 18:38:54 UTC
(In reply to Jan Hubicka from comment #54)
> > > It goes from 1 to 1<<63, so each of tests translates to a range.
> > 
> > Yes, but these ranges are very large, nothing for a jump table or a bit-test.
> Yep, but theoretically you can recover the decision table and pattern
> match it is a bit builtin.  Not sure how many open coded bit builtins
> are there.

Such a pattern matching is a bit similar to PR90838.
Anyway, not planning to implement it now.
Comment 56 CVS Commits 2020-10-21 23:42:31 UTC
The master branch has been updated by Jan Hubicka <hubicka@gcc.gnu.org>:

https://gcc.gnu.org/g:3fd5876793ddf882994acafc9c5b28e3be8897bd

commit r11-4196-g3fd5876793ddf882994acafc9c5b28e3be8897bd
Author: Jan Hubicka <jh@suse.cz>
Date:   Thu Oct 22 01:42:11 2020 +0200

    Strenghten bound for bulitin_constant_p hint.
    
    this patch makes builtin_constant_p hint to combine with other loop hints
    we already support.
    
    gcc/ChangeLog:
    
    2020-10-22  Jan Hubicka  <hubicka@ucw.cz>
    
            PR ipa/97445
            * ipa-inline.c (inline_insns_single): Add hint2 parameter.
            (inline_insns_auto): Add hint2 parameter.
            (can_inline_edge_by_limits_p): Update.
            (want_inline_small_function_p): Update.
            (wrapper_heuristics_may_apply): Update.