This is the mail archive of the gcc-bugs@gcc.gnu.org mailing list for the GCC project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
[Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence

From: "niels at penneman dot org" <gcc-bugzilla at gcc dot gnu dot org>
To: gcc-bugs at gcc dot gnu dot org
Date: Mon, 14 Oct 2013 20:01:43 +0000
Subject: [Bug tree-optimization/58727] New: Sub-optimal code for bit clear/set sequence
Auto-submitted: auto-generated
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58727

            Bug ID: 58727
           Summary: Sub-optimal code for bit clear/set sequence
           Product: gcc
           Version: 4.9.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: niels at penneman dot org

GCC generates sub-optimal but functionally correct code when successively
clearing a bit and setting another bit in an integral value, as shown in the
"clear_set" and "set_clear" functions below. To be more specific, at all
O-levels except "-O1", it emits code to clear the bit being set, prior to
setting it. When splitting the clear and set operations with exactly the same
bits, and calling them one after the other, the problem does not occur. It
looks like this behaviour is caused by an optimization pass prior to inlining.

On ARM this may cause a superfluous instruction to be emitted. In most
instructions, the 32-bit ARM ISA can only encode 8-bit immediate values. Most
of those instructions provide 4 extra bits in their encoding to specify a
rotation. This implies that when the bit being set ("SET" in the test case
below) is close enough to the bit being cleared ("CLEAR"), the code generation
phase merges the superflous clear for "SET" with the one for "CLEAR". When the
bits are too far apart, however, an extra clear instruction is generated, even
when optimizing with "-Os".

It looks like the root cause is a problem in the optimizer and not the code
generator, since the same behaviour can be observed on x86_64. However, in the
latter case this "bug" does not alter the number of instructions required. My
knowledge of the x86 architecture is limited so I have no clue whether it
causes any inefficiencies at all (instruction width, perhaps?).

The test case below contains six functions:
- "clear_set" and "set_clear" suffer from the problem described above: the
compiler usually emits a superfluous clear for the bit being set;
- "clear" and "set" perform the individual operations;
- "clear_set_inline" and "set_clear_inline" are used to demonstrate that the
problem does not occur when splitting and inlining the operations.

The tests were performed on a Gentoo x86_64 Linux system; GCC versions for both
native and cross-compiler are given below.

## testcase.c #################################################################

enum masks { CLEAR = 0x400000, SET = 0x02 };
unsigned clear_set(unsigned a) { return (a & ~CLEAR) | SET; }
unsigned set_clear(unsigned a) { return (a | SET) & ~CLEAR; }
unsigned clear(unsigned a) { return a & ~CLEAR; }
unsigned set(unsigned a) { return a | SET; }
__attribute__((flatten)) unsigned clear_set_inline(unsigned a) { return
set(clear(a)); }
__attribute__((flatten)) unsigned set_clear_inline(unsigned a) { return
clear(set(a)); }

## GCC version (ARM) ##########################################################

$ arm-none-eabi-gcc -###
Using built-in specs.
COLLECT_GCC=/path/to/toolchain/bin/arm-none-eabi-gcc
COLLECT_LTO_WRAPPER=/path/to/toolchain/libexec/gcc/arm-none-eabi/4.8.1/lto-wrapper
Target: arm-none-eabi
Configured with: /path/to/builddir/src/gcc-4.8.1/configure
--build=x86_64-build_unknown-linux-gnu --host=x86_64-build_unknown-linux-gnu
--target=arm-none-eabi --prefix=/path/to/toolchain
--with-local-prefix=/path/to/toolchain/arm-none-eabi/sysroot
--disable-libmudflap --with-sysroot=/path/to/toolchain/arm-none-eabi/sysroot
--with-newlib --enable-threads=no --disable-shared
--with-pkgversion='crosstool-NG 1.19.0' --with-float=hard
--disable-__cxa_atexit --with-gmp=/path/to/builddir/arm-none-eabi/buildtools
--with-mpfr=/path/to/builddir/arm-none-eabi/buildtools
--with-mpc=/path/to/builddir/arm-none-eabi/buildtools --with-ppl=no
--with-isl=no --with-cloog=no --with-libelf=no --disable-lto
--with-host-libstdcxx='-static-libgcc -Wl,-Bstatic,-lstdc++,-Bdynamic -lm'
--enable-target-optspace --disable-libgomp --disable-libmudflap --disable-nls
--disable-multilib --enable-languages=c
Thread model: single
gcc version 4.8.1 (crosstool-NG 1.19.0)

## GCC version (x86_64) #######################################################

$ gcc -###
Using built-in specs.
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013/gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with:
/var/tmp/portage/sys-devel/gcc-4.9.0_alpha20131013/work/gcc-4.9-20131013/configure
--prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/gcc-bin/4.9.0-alpha20131013
--includedir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include
--datadir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013
--mandir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/man
--infodir=/usr/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/info
--with-gxx-include-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/4.9.0-alpha20131013/include/g++-v4
--host=x86_64-pc-linux-gnu --build=x86_64-pc-linux-gnu --disable-altivec
--disable-fixed-point --with-cloog --disable-isl-version-check --enable-lto
--enable-nls --without-included-gettext --with-system-zlib --enable-obsolete
--disable-werror --enable-secureplt --enable-multilib
--with-multilib-list=m32,m64 --enable-libmudflap --disable-libssp
--enable-libgomp
--with-python-dir=/share/gcc-data/x86_64-pc-linux-gnu/4.9.0-alpha20131013/python
--enable-checking=release --disable-libgcj --enable-libstdcxx-time
--enable-languages=c,c++,fortran --enable-shared --enable-threads=posix
--enable-__cxa_atexit --enable-clocale=gnu --enable-targets=all
--with-bugurl=http://bugs.gentoo.org/ --with-pkgversion='Gentoo
4.9.0_alpha20131013'
Thread model: posix
gcc version 4.9.0-alpha20131013 20131013 (experimental) (Gentoo
4.9.0_alpha20131013)

## Compiler invocation / steps to reproduce ###################################

arm-none-eabi-gcc -O0 -S -fverbose-asm -o olevel0.s -c testcase.c
arm-none-eabi-gcc -O1 -fno-aggressive-loop-optimizations -fno-auto-inc-dec
-fno-branch-count-reg -fno-combine-stack-adjustments -fno-common
-fno-compare-elim -fno-cprop-registers -fno-defer-pop
-fno-delete-null-pointer-checks -fno-dwarf2-cfi-asm -fno-early-inlining
-fno-eliminate-unused-debug-types -fno-forward-propagate -fno-function-cse
-fno-gcse-lm -fno-guess-branch-probability -fno-ident -fno-if-conversion
-fno-if-conversion2 -fno-inline-atomics -fno-ipa-profile -fno-ipa-pure-const
-fno-ipa-reference -fno-ira-hoist-pressure -fno-ira-share-save-slots
-fno-ira-share-spill-slots -fno-ivopts -fno-keep-static-consts
-fno-leading-underscore -fno-math-errno -fno-merge-constants
-fno-merge-debug-strings -fno-move-loop-invariants -fno-peephole
-fno-prefetch-loop-arrays -fno-reg-struct-return
-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic
-fno-sched-group-heuristic -fno-sched-interblock -fno-sched-last-insn-heuristic
-fno-sched-pressure -fno-sched-rank-heuristic -fno-sched-spec
-fno-sched-spec-insn-heuristic -fno-sched-stalled-insns-dep
-fno-section-anchors -fno-show-column -fno-shrink-wrap -fno-signed-zeros
-fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-volatile-bitfields
-fno-sync-libcalls -fno-toplevel-reorder -fno-trapping-math -fno-tree-bit-ccp
-fno-tree-ccp -fno-tree-ch -fno-tree-coalesce-vars -fno-tree-copy-prop
-fno-tree-copyrename -fno-tree-cselim -fno-tree-dce -fno-tree-dominator-opts
-fno-tree-dse -fno-tree-forwprop -fno-tree-fre -fno-tree-loop-if-convert
-fno-tree-loop-im -fno-tree-loop-ivcanon -fno-tree-loop-optimize
-fno-tree-phiprop -fno-tree-pta -fno-tree-reassoc -fno-tree-scev-cprop
-fno-tree-sink -fno-tree-slp-vectorize -fno-tree-slsr -fno-tree-sra
-fno-tree-ter -fno-tree-vect-loop-version -fno-unit-at-a-time
-fno-zero-initialized-in-bss -S -fverbose-asm -o olevel1-all-disabled.s -c
testcase.c
arm-none-eabi-gcc -O1 -S -fverbose-asm -o olevel1.s -c testcase.c
arm-none-eabi-gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2.s
-c testcase.c
arm-none-eabi-gcc -Os -S -fverbose-asm -o olevels.s -c testcase.c

gcc -O0 -S -fverbose-asm -o olevel0-x86.s -c testcase.c
gcc -O1 -S -fverbose-asm -o olevel1-x86.s -c testcase.c
gcc -O1 -fexpensive-optimizations -S -fverbose-asm -o olevel2-x86.s -c
testcase.c

## Observations & expected results ############################################

When adding "-Wall -Wextra" the compiler does not emit any warnings for any of
the commands above.

The above commands generate 5 assembly files with different optimization
levels:
1) olevel0.s at -O0
2) olevel1-all-disabled.s at -O1 with -fno-* for every -f* listed in olevel1.s
3) olevel1.s at -O1
4) olevel2.s at -O1 with -fexpensive-optimizations because it is the only -f*
option additionally enabled with -O2 that causes the superflous clear to be
emitted
5) olevels.s at -Os

The first interesting observation is that the superfluous clear is generated at
-O0 ("olevel0.s"). The generated code for "clear_set" and "set_clear" is
exactly the same, excluding comments/annotations (stripped for clarity):

 1:     str     fp, [sp, #-4]!
 2:     add     fp, sp, #0
 3:     sub     sp, sp, #12
 4:     str     r0, [fp, #-8]
 5:     ldr     r3, [fp, #-8]
 6:     bic     r3, r3, #4194304
 7:     bic     r3, r3, #2
 8:     orr     r3, r3, #2
 9:     mov     r0, r3
10:     sub     sp, fp, #0
11:     ldr     fp, [sp], #4
12:     bx      lr

Just for the record: bic = clear bits in the specified mask, 0x400000 =
4194304. If we take a look at lines 6-8, the following happens:
6: A &= ~CLEAR
7: A &= ~SET
8: A |= SET
The clear instruction at line 7 is superfluous.

Looking at the "*_inline" versions at "-O0" doesn't make sense since the
compiler does not perform inlining at "-O0".

At "-O1", the code does not contain any superfluous instructions ("cat
olevel1.s  | grep -vE '^\s*[@\.]' | sed -E 's/\s*@.*$//'"):

clear_set:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
set_clear:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
clear:
        bic     r0, r0, #4194304
        bx      lr
set:
        orr     r0, r0, #2
        bx      lr
clear_set_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
set_clear_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr

Note that "clear_set", "set_clear", "clear_set_inline" and "set_clear_inline"
are all identical. I have been unable to find out which particular optimization
prevents the superfluous instruction from being emitted. I have tried to
reverse all "-f*" optimization parameters enabled by "-O1" as found in
"olevel1.s", but as should be visible from comparing "olevel1-all-disabled.s"
and "olevel1.s" there is no difference in "clear_set" and "set_clear" other
than in comments/annotations.

At "-O2" the superfluous clear is emitted once again. I have gone through the
optimizations enabled by -O2 which are not enabled by -O1 by comparing the
(verbose) assembly output. I have tracked down the cause to
"-fexpensive-optimizations". Enabling this optimization flag in combination
with -O1 results in the following output ("cat olevel2.s | grep -vE '^\s*[@\.]'
| sed -E 's/\s*@.*$//'"):

clear_set:
        bic     r0, r0, #4194304
        bic     r0, r0, #2
        orr     r0, r0, #2
        bx      lr
set_clear:
        bic     r0, r0, #4194304
        bic     r0, r0, #2
        orr     r0, r0, #2
        bx      lr
clear:
        bic     r0, r0, #4194304
        bx      lr
set:
        orr     r0, r0, #2
        bx      lr
clear_set_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr
set_clear_inline:
        bic     r0, r0, #4194304
        orr     r0, r0, #2
        bx      lr

Note that the "*_inline" functions do not contain the superfluous clear
instruction. The "-Os" case generates exactly the same instructions. Since
"-Os" also enables "-fexpensive-optimizations" as can be seen from "olevels.s",
the cause of the extra instruction at "-Os" is most likely the same as for
"-O2".

The above examples all use the ARM 32-bit ISA. By providing the arguments
"-march=armv7-m -mthumb" to the cross-compiler the exact same behaviour can be
observed for Thumb.

On x86_64, the following happens with "-O0" ("cat olevel0-x86.s | grep -vE
'^\s*[@\.]' | sed -E 's/\s*#.*$//'") in both "clear_set" and "set_clear":

1:      pushq   %rbp
2:      movq    %rsp, %rbp
3:      movl    %edi, -4(%rbp)
4:      movl    -4(%rbp), %eax
5:      andl    $-4194307, %eax
6:      orl     $2, %eax
7:      popq    %rbp
8:      ret

At line 5, the constant -4194307 is 0xFFBFFFFD in hex, so ~(CLEAR | SET). Once
again, the bit that is set at line 6 with the "orl" instruction is first
cleared in the "andl" instruction.

Surprisingly, at "-O1" on x86_64, the superfluous clear remains in place. In
the inlined version the mask is correct, however ("cat olevel1-x86.s | grep -vE
'^\s*[@\.]' | sed -E 's/\s*#.*$//'"):

clear_set:
        movl    %edi, %eax
        andl    $-4194307, %eax
        orl     $2, %eax
        ret
set_clear:
        movl    %edi, %eax
        andl    $-4194307, %eax
        orl     $2, %eax
        ret
clear:
        movl    %edi, %eax
        andl    $-4194305, %eax
        ret
set:
        movl    %edi, %eax
        orl     $2, %eax
        ret
clear_set_inline:
        movl    %edi, %eax
        andl    $-4194305, %eax
        orl     $2, %eax
        ret
set_clear_inline:
        movl    %edi, %eax
        andl    $-4194305, %eax
        orl     $2, %eax
        ret

The same behaviour is observed with "-Os", "-O2" and "-O3".

I guess something internal in GCC is causing the bit to be cleared before it is
set even though there is absolutely no valid reason to do so from the
perspective of the program. While on x86_64 it probably does not affect code
performance, on ARM it adversely affects code size and perhaps performance
(untested).
Follow-Ups:
- [Bug rtl-optimization/58727] Sub-optimal code for bit clear/set sequence
  - From: pinskia at gcc dot gnu.org
- [Bug rtl-optimization/58727] Sub-optimal code for bit clear/set sequence
  - From: niels at penneman dot org
- [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  - From: pinskia at gcc dot gnu.org
- [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  - From: pinskia at gcc dot gnu.org
- [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  - From: jakub at gcc dot gnu.org
- [Bug tree-optimization/58727] Sub-optimal code for bit clear/set sequence
  - From: niels at penneman dot org
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]