This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence
- From: "niels at penneman dot org" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Mon, 14 Oct 2013 20:01:43 +0000
- Subject: [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence
- Auto-submitted: auto-generated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727
Bug ID: 58727
Summary: Sub-optimal code for bit clear/set sequence
Product: gcc
Version: 4.9.0
Status: UNCONFIRMED
Severity: minor
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: niels at penneman dot org
GCC generates sub-optimal but functionally correct code when successively
clearing a bit and setting another bit in an integral value, as shown in the
"clear_set" and "set_clear" functions below. To be more specific, at all
O-levels except "-O1", it emits code to clear the bit being set, prior to
setting it. When splitting the clear and set operations with exactly the same
bits, and calling them one after the other, the problem does not occur. It
looks like this behaviour is caused by an optimization pass prior to inlining.
On ARM this may cause a superfluous instruction to be emitted. In most
instructions, the 32-bit ARM ISA can only encode 8-bit immediate values. Most
of those instructions provide 4 extra bits in their encoding to specify a
rotation. This implies that when the bit being set ("SET" in the test case
below) is close enough to the bit being cleared ("CLEAR"), the code generation
phase merges the superflous clear for "SET" with the one for "CLEAR". When the
bits are too far apart, however, an extra clear instruction is generated, even
when optimizing with "-Os".
It looks like the root cause is a problem in the optimizer and not the code
generator, since the same behaviour can be observed on x86_64. However, in the
latter case this "bug" does not alter the number of instructions required. My
knowledge of the x86 architecture is limited so I have no clue whether it
causes any inefficiencies at all (instruction width, perhaps?).
The test case below contains six functions:
- "clear_set" and "set_clear" suffer from the problem described above: the
compiler usually emits a superfluous clear for the bit being set;
- "clear" and "set" perform the individual operations;
- "clear_set_inline" and "set_clear_inline" are used to demonstrate that the
problem does not occur when splitting and inlining the operations.
The tests were performed on a Gentoo x86_64 Linux system; GCC versions for both
native and cross-compiler are given below.
## testcase.c #################################################################
enum masks { CLEAR = 0x400000, SET = 0x02 };
unsigned clear_set(unsigned a) { return (a & ~CLEAR) | SET; }
unsigned set_clear(unsigned a) { return (a | SET) & ~CLEAR; }
unsigned clear(unsigned a) { return a & ~CLEAR; }
unsigned set(unsigned a) { return a | SET; }
__attribute__((flatten)) unsigned clear_set_inline(unsigned a) { return
set(clear(a)); }
__attribute__((flatten)) unsigned set_clear_inline(unsigned a) { return
clear(set(a)); }
## GCC version (ARM) ##########################################################
$ arm-none-eabi-gcc -###
Using built-in specs.
COLLECT_GCC=/path/to/toolchain/bin/arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/path/to/toolchain/libexec/gcc/arm-none-eabi/4.8.1/lto-wrapper
Target: arm-none-eabi
Configured with: /path/to/builddir/src/gcc-4.8.1/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=arm-none-eabi --prefix=/path/to/toolchain
--with-local-prefix=/path/to/toolchain/arm-none-eabi/sysroot
--disable-libmudflap --with-sysroot=/path/to/toolchain/arm-none-eabi/sysroot
--with-newlib --enable-threads=no --disable-shared
--with-pkgversion='crosstool-NG 1.19.0' --with-float=hard
--disable-__cxa_atexit --with-gmp=/path/to/builddir/arm-none-eabi/buildtools
--with-mpfr=/path/to/builddir/arm-none-eabi/buildtools
--with-mpc=/path/to/builddir/arm-none-eabi/buildtools --with-ppl=no
--with-isl=no --with-cloog=no --with-libelf=no --disable-lto
--with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'
--enable-target-optspace --disable-libgomp --disable-libmudflap --disable-nls
--disable-multilib --enable-languages=c
Thread model: single
gcc version 4.8.1 (crosstool-NG 1.19.0)
## GCC version (x86_64) #######################################################
$ gcc -###
Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/var/tmp/portage/sys-devel/gcc-4.9.0_alpha20131013/work/gcc-4.9-20131013/configure
--prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include/g++-v4
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
--disable-fixed-point --with-cloog --disable-isl-version-check --enable-lto
--enable-nls --without-included-gettext --with-system-zlib --enable-obsolete
--disable-werror --enable-secureplt --enable-multilib
--with-multilib-list=m32,m64 --enable-libmudflap --disable-libssp
--enable-libgomp
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/python
--enable-checking=release --disable-libgcj --enable-libstdcxx-time
--enable-languages=c,c++,fortran --enable-shared --enable-threads=posix
--enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all
--with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo
4.9.0_alpha20131013'
Thread model: posix
gcc version 4.9.0-alpha20131013 20131013 (experimental) (Gentoo
4.9.0_alpha20131013)
## Compiler invocation / steps to reproduce ###################################
arm-none-eabi-gcc -O0 -S -fverbose-asm -o olevel0.s -c testcase.c
arm-none-eabi-gcc -O1 -fno-aggressive-loop-optimizations -fno-auto-inc-dec
-fno-branch-count-reg -fno-combine-stack-adjustments -fno-common
-fno-compare-elim -fno-cprop-registers -fno-defer-pop
-fno-delete-null-pointer-checks -fno-dwarf2-cfi-asm -fno-early-inlining
-fno-eliminate-unused-debug-types -fno-forward-propagate -fno-function-cse
-fno-gcse-lm -fno-guess-branch-probability -fno-ident -fno-if-conversion
-fno-if-conversion2 -fno-inline-atomics -fno-ipa-profile -fno-ipa-pure-const
-fno-ipa-reference -fno-ira-hoist-pressure -fno-ira-share-save-slots
-fno-ira-share-spill-slots -fno-ivopts -fno-keep-static-consts
-fno-leading-underscore -fno-math-errno -fno-merge-constants
-fno-merge-debug-strings -fno-move-loop-invariants -fno-peephole
-fno-prefetch-loop-arrays -fno-reg-struct-return
-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic
-fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic
-fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec
-fno-sched-spec-insn-heuristic -fno-sched-stalled-insns-dep
-fno-section-anchors -fno-show-column -fno-shrink-wrap -fno-signed-zeros
-fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-volatile-bitfields
-fno-sync-libcalls -fno-toplevel-reorder -fno-trapping-math -fno-tree-bit-ccp
-fno-tree-ccp -fno-tree-ch -fno-tree-coalesce-vars -fno-tree-copy-prop
-fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts
-fno-tree-dse -fno-tree-forwprop -fno-tree-fre -fno-tree-loop-if-convert
-fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize
-fno-tree-phiprop -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop
-fno-tree-sink -fno-tree-slp-vectorize -fno-tree-slsr -fno-tree-sra
-fno-tree-ter -fno-tree-vect-loop-version -fno-unit-at-a-time
-fno-zero-initialized-in-bss -S -fverbose-asm -o olevel1-all-disabled.s -c
testcase.c
arm-none-eabi-gcc -O1 -S -fverbose-asm -o olevel1.s -c testcase.c
arm-none-eabi-gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2.s
-c testcase.c
arm-none-eabi-gcc -Os -S -fverbose-asm -o olevels.s -c testcase.c
gcc -O0 -S -fverbose-asm -o olevel0-x86.s -c testcase.c
gcc -O1 -S -fverbose-asm -o olevel1-x86.s -c testcase.c
gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2-x86.s -c
testcase.c
## Observations & expected results ############################################
When adding "-Wall -Wextra" the compiler does not emit any warnings for any of
the commands above.
The above commands generate 5 assembly files with different optimization
levels:
1) olevel0.s at -O0
2) olevel1-all-disabled.s at -O1 with -fno-* for every -f* listed in olevel1.s
3) olevel1.s at -O1
4) olevel2.s at -O1 with -fexpensive-optimizations because it is the only -f*
option additionally enabled with -O2 that causes the superflous clear to be
emitted
5) olevels.s at -Os
The first interesting observation is that the superfluous clear is generated at
-O0 ("olevel0.s"). The generated code for "clear_set" and "set_clear" is
exactly the same, excluding comments/annotations (stripped for clarity):
1: str fp, [sp, #-4]!
2: add fp, sp, #0
3: sub sp, sp, #12
4: str r0, [fp, #-8]
5: ldr r3, [fp, #-8]
6: bic r3, r3, #4194304
7: bic r3, r3, #2
8: orr r3, r3, #2
9: mov r0, r3
10: sub sp, fp, #0
11: ldr fp, [sp], #4
12: bx lr
Just for the record: bic = clear bits in the specified mask, 0x400000 =
4194304. If we take a look at lines 6-8, the following happens:
6: A &= ~CLEAR
7: A &= ~SET
8: A |= SET
The clear instruction at line 7 is superfluous.
Looking at the "*_inline" versions at "-O0" doesn't make sense since the
compiler does not perform inlining at "-O0".
At "-O1", the code does not contain any superfluous instructions ("cat
olevel1.s | grep -vE '^\s*[@\.]' | sed -E 's/\s*@.*$//'"):
clear_set:
bic r0, r0, #4194304
orr r0, r0, #2
bx lr
set_clear:
bic r0, r0, #4194304
orr r0, r0, #2
bx lr
clear:
bic r0, r0, #4194304
bx lr
set:
orr r0, r0, #2
bx lr
clear_set_inline:
bic r0, r0, #4194304
orr r0, r0, #2
bx lr
set_clear_inline:
bic r0, r0, #4194304
orr r0, r0, #2
bx lr
Note that "clear_set", "set_clear", "clear_set_inline" and "set_clear_inline"
are all identical. I have been unable to find out which particular optimization
prevents the superfluous instruction from being emitted. I have tried to
reverse all "-f*" optimization parameters enabled by "-O1" as found in
"olevel1.s", but as should be visible from comparing "olevel1-all-disabled.s"
and "olevel1.s" there is no difference in "clear_set" and "set_clear" other
than in comments/annotations.
At "-O2" the superfluous clear is emitted once again. I have gone through the
optimizations enabled by -O2 which are not enabled by -O1 by comparing the
(verbose) assembly output. I have tracked down the cause to
"-fexpensive-optimizations". Enabling this optimization flag in combination
with -O1 results in the following output ("cat olevel2.s | grep -vE '^\s*[@\.]'
| sed -E 's/\s*@.*$//'"):
clear_set:
bic r0, r0, #4194304
bic r0, r0, #2
orr r0, r0, #2
bx lr
set_clear:
bic r0, r0, #4194304
bic r0, r0, #2
orr r0, r0, #2
bx lr
clear:
bic r0, r0, #4194304
bx lr
set:
orr r0, r0, #2
bx lr
clear_set_inline:
bic r0, r0, #4194304
orr r0, r0, #2
bx lr
set_clear_inline:
bic r0, r0, #4194304
orr r0, r0, #2
bx lr
Note that the "*_inline" functions do not contain the superfluous clear
instruction. The "-Os" case generates exactly the same instructions. Since
"-Os" also enables "-fexpensive-optimizations" as can be seen from "olevels.s",
the cause of the extra instruction at "-Os" is most likely the same as for
"-O2".
The above examples all use the ARM 32-bit ISA. By providing the arguments
"-march=armv7-m -mthumb" to the cross-compiler the exact same behaviour can be
observed for Thumb.
On x86_64, the following happens with "-O0" ("cat olevel0-x86.s | grep -vE
'^\s*[@\.]' | sed -E 's/\s*#.*$//'") in both "clear_set" and "set_clear":
1: pushq %rbp
2: movq %rsp, %rbp
3: movl %edi, -4(%rbp)
4: movl -4(%rbp), %eax
5: andl $-4194307, %eax
6: orl $2, %eax
7: popq %rbp
8: ret
At line 5, the constant -4194307 is 0xFFBFFFFD in hex, so ~(CLEAR | SET). Once
again, the bit that is set at line 6 with the "orl" instruction is first
cleared in the "andl" instruction.
Surprisingly, at "-O1" on x86_64, the superfluous clear remains in place. In
the inlined version the mask is correct, however ("cat olevel1-x86.s | grep -vE
'^\s*[@\.]' | sed -E 's/\s*#.*$//'"):
clear_set:
movl %edi, %eax
andl $-4194307, %eax
orl $2, %eax
ret
set_clear:
movl %edi, %eax
andl $-4194307, %eax
orl $2, %eax
ret
clear:
movl %edi, %eax
andl $-4194305, %eax
ret
set:
movl %edi, %eax
orl $2, %eax
ret
clear_set_inline:
movl %edi, %eax
andl $-4194305, %eax
orl $2, %eax
ret
set_clear_inline:
movl %edi, %eax
andl $-4194305, %eax
orl $2, %eax
ret
The same behaviour is observed with "-Os", "-O2" and "-O3".
I guess something internal in GCC is causing the bit to be cleared before it is
set even though there is absolutely no valid reason to do so from the
perspective of the program. While on x86_64 it probably does not affect code
performance, on ARM it adversely affects code size and perhaps performance
(untested).