Bug 18041

Summary: OR of two single-bit bitfields is inefficient
Product: gcc Reporter: Kazu Hirata <kazu>
Component: middle-endAssignee: Richard Biener <rguenth>
Status: RESOLVED FIXED    
Severity: enhancement CC: gcc-bugs, segher
Priority: P2 Keywords: missed-optimization
Version: unknown   
Target Milestone: ---   
Host: Target:
Build: Known to work: 9.0
Known to fail: Last reconfirmed: 2005-12-31 20:21:54
Bug Depends on:    
Bug Blocks: 19466, 81161    

Description Kazu Hirata 2004-10-17 16:36:06 UTC
Consider:

struct B {
  unsigned bit0 : 1;
  unsigned bit1 : 1;
};

void
foo (struct B *b)
{
  b->bit0 = b->bit0 | b->bit1;
}

./cc1 -O2 -fomit-frame-pointer -mregparm=3 generates

foo:
	movb	(%eax), %dl  <- one load
	movb	%dl, %cl
	shrb	%cl
	orl	%edx, %ecx
	andl	$1, %ecx
	movl	(%eax), %edx <- another load from the same place
	andl	$-2, %edx
	orl	%ecx, %edx   <- the second OR
	movl	%edx, (%eax)
	ret

We could do something like

	movb	(%eax), %cl
	movb	%cl, %dl
	shrb	%dl
	andl	$1, %edx
	orl	%ecx, %edx
	movb	%dl, (%eax)
	ret

or

	movb	(%eax), %dl
	testb	$2, %dl
	je	.L6
	orl	$1, %edx
	movb	%dl, (%eax)
.L6:
	ret

expr.c actually has code intended to emit the second suggestion
(look for "Check for |= or &= of a bitfield" in expr.c), but it is
practically disabled because we get tree like this

  b->bit0 = (<unnamed type>) (unsigned char)
              ((signed char) b->bit0 | (signed char) b->bit1)

whereas the code in expr.c expects

  b->bit0 = b->bit0 | b->bit1;

The code is not triggered even in gcc-3.3.  Probably it is practically
disabled for a long time.
Comment 1 Andrew Pinski 2004-10-17 17:01:28 UTC
Hmm, there is only one load on PPC (with either side):
same bit layout as below:
        lwz r0,0(r3)
        rlwinm r2,r0,0,31,31
        rlwinm r9,r0,31,31,31
        or r2,r2,r9
        rlwimi r0,r2,0,31,31
        stw r0,0(r3)
        blr
The oposite bit layout:
        lwz r0,0(r3)
        srwi r2,r0,31
        rlwinm r9,r0,2,31,31
        or r2,r2,r9
        rlwimi r0,r2,31,0,0
        stw r0,0(r3)
        blr
Comment 2 Andrew Pinski 2004-10-17 17:05:33 UTC
Confirmed about the extra and (I don't know why the extra load is in x86).
Comment 3 Andrew Pinski 2005-02-07 06:02:42 UTC
This is a much harder problem than doing a simplification at combine time because we have five 
instructions to worry about.
Comment 4 Richard Biener 2011-05-06 13:23:11 UTC
Mine.  With bitfield lowering I see

<bb 2>:
  BF.0_2 = MEM[(struct B *)b_1(D)];
  D.2686_3 = (<unnamed-unsigned:1>) BF.0_2;
  D.2687_4 = (unsigned char) D.2686_3;
  D.2694_6 = BF.0_2 >> 1;
  D.2688_7 = (<unnamed-unsigned:1>) D.2694_6;
  D.2689_8 = (unsigned char) D.2688_7;
  D.2690_9 = D.2689_8 | D.2687_4;
  D.2691_10 = (<unnamed-unsigned:1>) D.2690_9;
  D.2696_12 = BF.0_2 & 4294967294;
  D.2697_13 = (unsigned int) D.2691_10;
  BF.2_14 = D.2697_13 | D.2696_12;
  MEM[(struct B *)b_1(D)] = BF.2_14;
  return;

there is the possibility to associate truncations/widenings with |& so
D.2689_8 | D.2687_4 becomes (unsigned char) D.2688_7 | D.2686_3
and the truncation (<unnamed-unsigned:1>) D.2690_9 can be combined
with it.
Comment 5 Richard Biener 2011-05-10 11:38:50 UTC
With a patch I have we now optimize at the tree level to

<bb 2>:
  D.2686_2 = b_1(D)->bit0;
  D.2688_4 = b_1(D)->bit1;
  D.2693_10 = D.2688_4 ^ D.2686_2;
  b_1(D)->bit0 = D.2693_10;
  return;

and with bitfield lowering applied to

<bb 2>:
  BF.0_2 = MEM[(struct B *)b_1(D)];
  D.2694_6 = BF.0_2 >> 1;
  D.2701_18 = D.2694_6 ^ BF.0_2;
  D.2696_12 = BF.0_2 & 4294967294;
  D.2697_13 = D.2701_18 & 1;
  BF.2_14 = D.2697_13 | D.2696_12;
  MEM[(struct B *)b_1(D)] = BF.2_14;
  return;
Comment 6 Richard Biener 2011-05-11 10:53:00 UTC
Author: rguenth
Date: Wed May 11 10:52:57 2011
New Revision: 173650

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173650
Log:
2011-05-11  Richard Guenther  <rguenther@suse.de>

	PR tree-optimization/18041
	* tree-ssa-forwprop.c (simplify_bitwise_and): Rename to ...
	(simplify_bitwise_binary): ... this.  Handle operand conversions
	by applying them to the result instead.
	(tree_ssa_forward_propagate_single_use_vars): Adjust.  CSE tree code.

	* gcc.dg/tree-ssa/forwprop-13.c: New testcase.

Added:
    trunk/gcc/testsuite/gcc.dg/tree-ssa/forwprop-13.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/testsuite/ChangeLog
    trunk/gcc/tree-ssa-forwprop.c
Comment 7 Richard Biener 2016-06-29 13:09:25 UTC
GCC 6 at -O2 on x86_64 produces

foo:
.LFB0:
        .cfi_startproc
        movzbl  (%rdi), %eax
        movl    %eax, %edx
        shrb    %dl
        orl     %eax, %edx
        andl    $-2, %eax
        andl    $1, %edx
        orl     %edx, %eax
        movb    %al, (%rdi)
        ret

If we lower bitfield accesses to BIT_FIELD_REF / BIT_INSERT_EXPR we get

  <bb 2>:
  _7 = b_5(D)->D.1753;
  _1 = BIT_FIELD_REF <_7, 1, 0>;
  _2 = BIT_FIELD_REF <_7, 1, 1>;
  _3 = _1 | _2;
  _8 = BIT_INSERT_EXPR <_7, _3, 0 (1 bits)>;
  b_5(D)->D.1753 = _8;
  return;

if we lower to shifts/masks then

  <bb 2>:
  _7 = b_5(D)->D.1753;
  _11 = _7 >> 1;
  _1 = _7 | _11;
  _14 = _7 & 254;
  _10 = _1 & 1;
  _16 = _10 | _14;
  b_5(D)->D.1753 = _16;
  return;

it fails to notice that the & 254 is not necessary.  That's easier to
grasp from the first lowering.
Comment 8 Richard Biener 2018-11-05 14:04:09 UTC
We're still "stuck" on GIMPLE, on x86_64 we manage to elide the redundant load now and get

foo:
.LFB0:
        .cfi_startproc
        movzbl  (%rdi), %eax
        movl    %eax, %edx
        shrb    %dl
        orl     %eax, %edx
        andl    $-2, %eax
        andl    $1, %edx
        orl     %edx, %eax
        movb    %al, (%rdi)
        ret

where we fail to notice the RMW.  A simpler testcase is

struct B { unsigned bit0 : 1; };

void
bar  (struct B *b, _Bool x)
{
  b->bit0 |= x;
}

which generates

bar:
.LFB0:
        .cfi_startproc
        movzbl  (%rdi), %eax
        orl     %eax, %esi
        andl    $-2, %eax
        andl    $1, %esi
        orl     %esi, %eax
        movb    %al, (%rdi)
        ret

we'd need to recognize

            (set (reg:QI 96)
                (ior:QI (and:QI (reg:QI 90 [ *b_3(D) ])
                         (const_int -2 [0xfffffffffffffffe])))
                    (and:QI (ior:QI (reg:QI 90 [ *b_3(D) ])
                             (subreg:QI (reg:SI 98) 0))
                     (const_int 1 [0x1]))))

(r90 & -2) | ((r90 | rx) & 1)
-> (r90 & -2) | (r90 & 1) | (rx & 1)
-> r90 | (rx & 1)

I have a patch for simplify-rtx.c that recognizes this generating

foo:
.LFB0:
        .cfi_startproc
        movzbl  (%rdi), %edx
        movl    %edx, %eax
        shrb    %al
        andl    $1, %eax
        orl     %edx, %eax
        movb    %al, (%rdi)
        ret

and

bar:
.LFB1:
        .cfi_startproc
        andl    $1, %esi
        orb     %sil, (%rdi)
        ret
Comment 9 Richard Biener 2018-11-05 14:22:01 UTC
Looks like combine doesn't want

(insn 11 10 13 2 (parallel [
            (set (reg:QI 91)
                (ior:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp)
                            (const_int 4 [0x4])) [4 x+0 S1 A32])
                    (reg:QI 90 [ *b_3(D) ])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":12:11 429 {*iorqi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))

for a combination (on x86_64 with -m32).

(insn 13 11 15 2 (parallel [
            (set (reg:QI 93)
                (and:QI (reg:QI 91)
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":12:11 396 {*andqi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (expr_list:REG_DEAD (reg:QI 91)
            (nil))))
(insn 15 13 16 2 (parallel [
            (set (reg:QI 95)
                (and:QI (reg:QI 90 [ *b_3(D) ])
                    (const_int -2 [0xfffffffffffffffe])))
            (clobber (reg:CC 17 flags))
        ]) "t.c":12:11 396 {*andqi_1}
     (expr_list:REG_DEAD (reg:QI 90 [ *b_3(D) ])
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
(insn 16 15 17 2 (parallel [
            (set (reg:QI 96)
                (ior:QI (reg:QI 95)
                    (reg:QI 93)))
            (clobber (reg:CC 17 flags))
        ]) "t.c":12:11 429 {*iorqi_1}
     (expr_list:REG_DEAD (reg:QI 95)
        (expr_list:REG_DEAD (reg:QI 93)
            (expr_list:REG_UNUSED (reg:CC 17 flags)
                (nil)))))

Trying 11, 15, 13 -> 16:
   11: {r91:QI=[argp:SI+0x4]|r90:QI;clobber flags:CC;}
      REG_UNUSED flags:CC
   15: {r95:QI=r90:QI&0xfffffffffffffffe;clobber flags:CC;}
      REG_DEAD r90:QI
      REG_UNUSED flags:CC
   13: {r93:QI=r91:QI&0x1;clobber flags:CC;}
      REG_UNUSED flags:CC
      REG_DEAD r91:QI
   16: {r96:QI=r95:QI|r93:QI;clobber flags:CC;}
      REG_DEAD r95:QI
      REG_DEAD r93:QI
      REG_UNUSED flags:CC
Failed to match this instruction:
(parallel [
        (set (reg:QI 96)
            (ior:QI (and:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp)
                            (const_int 4 [0x4])) [4 x+0 S1 A32])
                    (const_int 1 [0x1]))
                (reg:QI 90 [ *b_3(D) ])))
        (clobber (reg:CC 17 flags))
    ])
Failed to match this instruction:
(set (reg:QI 96)
    (ior:QI (and:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp)
                    (const_int 4 [0x4])) [4 x+0 S1 A32])
            (const_int 1 [0x1]))
        (reg:QI 90 [ *b_3(D) ])))
Failed to match this instruction:
(set (reg:QI 95)
    (and:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp)
                (const_int 4 [0x4])) [4 x+0 S1 A32])
        (const_int 1 [0x1])))

looks like it doesn't try to "factor" out the load?
Comment 10 Segher Boessenkool 2018-11-05 20:02:14 UTC
If combine tries to split RTL into two instructions, it tries to do that
one way (and one way only).  It picked the AND here.  It did not work.

You can add some define_split to your target to help combine along.
Comment 11 Richard Biener 2018-11-06 08:09:36 UTC
Author: rguenth
Date: Tue Nov  6 08:09:03 2018
New Revision: 265829

URL: https://gcc.gnu.org/viewcvs?rev=265829&root=gcc&view=rev
Log:
2018-11-06  Richard Biener  <rguenther@suse.de>

	PR middle-end/18041
	* simplify-rtx.c (simplify_binary_operation_1): Add pattern
	matching bitfield insertion.

	* gcc.target/i386/pr18041-1.c: New testcase.
	* gcc.target/i386/pr18041-2.c: Likewise.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr18041-1.c
    trunk/gcc/testsuite/gcc.target/i386/pr18041-2.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/simplify-rtx.c
    trunk/gcc/testsuite/ChangeLog
Comment 12 Richard Biener 2018-11-06 08:42:32 UTC
Fixed.