Summary: | OR of two single-bit bitfields is inefficient | ||
---|---|---|---|
Product: | gcc | Reporter: | Kazu Hirata <kazu> |
Component: | middle-end | Assignee: | Richard Biener <rguenth> |
Status: | RESOLVED FIXED | ||
Severity: | enhancement | CC: | gcc-bugs, segher |
Priority: | P2 | Keywords: | missed-optimization |
Version: | unknown | ||
Target Milestone: | --- | ||
Host: | Target: | ||
Build: | Known to work: | 9.0 | |
Known to fail: | Last reconfirmed: | 2005-12-31 20:21:54 | |
Bug Depends on: | |||
Bug Blocks: | 19466, 81161 |
Description
Kazu Hirata
2004-10-17 16:36:06 UTC
Hmm, there is only one load on PPC (with either side): same bit layout as below: lwz r0,0(r3) rlwinm r2,r0,0,31,31 rlwinm r9,r0,31,31,31 or r2,r2,r9 rlwimi r0,r2,0,31,31 stw r0,0(r3) blr The oposite bit layout: lwz r0,0(r3) srwi r2,r0,31 rlwinm r9,r0,2,31,31 or r2,r2,r9 rlwimi r0,r2,31,0,0 stw r0,0(r3) blr Confirmed about the extra and (I don't know why the extra load is in x86). This is a much harder problem than doing a simplification at combine time because we have five instructions to worry about. Mine. With bitfield lowering I see <bb 2>: BF.0_2 = MEM[(struct B *)b_1(D)]; D.2686_3 = (<unnamed-unsigned:1>) BF.0_2; D.2687_4 = (unsigned char) D.2686_3; D.2694_6 = BF.0_2 >> 1; D.2688_7 = (<unnamed-unsigned:1>) D.2694_6; D.2689_8 = (unsigned char) D.2688_7; D.2690_9 = D.2689_8 | D.2687_4; D.2691_10 = (<unnamed-unsigned:1>) D.2690_9; D.2696_12 = BF.0_2 & 4294967294; D.2697_13 = (unsigned int) D.2691_10; BF.2_14 = D.2697_13 | D.2696_12; MEM[(struct B *)b_1(D)] = BF.2_14; return; there is the possibility to associate truncations/widenings with |& so D.2689_8 | D.2687_4 becomes (unsigned char) D.2688_7 | D.2686_3 and the truncation (<unnamed-unsigned:1>) D.2690_9 can be combined with it. With a patch I have we now optimize at the tree level to <bb 2>: D.2686_2 = b_1(D)->bit0; D.2688_4 = b_1(D)->bit1; D.2693_10 = D.2688_4 ^ D.2686_2; b_1(D)->bit0 = D.2693_10; return; and with bitfield lowering applied to <bb 2>: BF.0_2 = MEM[(struct B *)b_1(D)]; D.2694_6 = BF.0_2 >> 1; D.2701_18 = D.2694_6 ^ BF.0_2; D.2696_12 = BF.0_2 & 4294967294; D.2697_13 = D.2701_18 & 1; BF.2_14 = D.2697_13 | D.2696_12; MEM[(struct B *)b_1(D)] = BF.2_14; return; Author: rguenth Date: Wed May 11 10:52:57 2011 New Revision: 173650 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=173650 Log: 2011-05-11 Richard Guenther <rguenther@suse.de> PR tree-optimization/18041 * tree-ssa-forwprop.c (simplify_bitwise_and): Rename to ... (simplify_bitwise_binary): ... this. Handle operand conversions by applying them to the result instead. (tree_ssa_forward_propagate_single_use_vars): Adjust. CSE tree code. * gcc.dg/tree-ssa/forwprop-13.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/forwprop-13.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-forwprop.c GCC 6 at -O2 on x86_64 produces foo: .LFB0: .cfi_startproc movzbl (%rdi), %eax movl %eax, %edx shrb %dl orl %eax, %edx andl $-2, %eax andl $1, %edx orl %edx, %eax movb %al, (%rdi) ret If we lower bitfield accesses to BIT_FIELD_REF / BIT_INSERT_EXPR we get <bb 2>: _7 = b_5(D)->D.1753; _1 = BIT_FIELD_REF <_7, 1, 0>; _2 = BIT_FIELD_REF <_7, 1, 1>; _3 = _1 | _2; _8 = BIT_INSERT_EXPR <_7, _3, 0 (1 bits)>; b_5(D)->D.1753 = _8; return; if we lower to shifts/masks then <bb 2>: _7 = b_5(D)->D.1753; _11 = _7 >> 1; _1 = _7 | _11; _14 = _7 & 254; _10 = _1 & 1; _16 = _10 | _14; b_5(D)->D.1753 = _16; return; it fails to notice that the & 254 is not necessary. That's easier to grasp from the first lowering. We're still "stuck" on GIMPLE, on x86_64 we manage to elide the redundant load now and get foo: .LFB0: .cfi_startproc movzbl (%rdi), %eax movl %eax, %edx shrb %dl orl %eax, %edx andl $-2, %eax andl $1, %edx orl %edx, %eax movb %al, (%rdi) ret where we fail to notice the RMW. A simpler testcase is struct B { unsigned bit0 : 1; }; void bar (struct B *b, _Bool x) { b->bit0 |= x; } which generates bar: .LFB0: .cfi_startproc movzbl (%rdi), %eax orl %eax, %esi andl $-2, %eax andl $1, %esi orl %esi, %eax movb %al, (%rdi) ret we'd need to recognize (set (reg:QI 96) (ior:QI (and:QI (reg:QI 90 [ *b_3(D) ]) (const_int -2 [0xfffffffffffffffe]))) (and:QI (ior:QI (reg:QI 90 [ *b_3(D) ]) (subreg:QI (reg:SI 98) 0)) (const_int 1 [0x1])))) (r90 & -2) | ((r90 | rx) & 1) -> (r90 & -2) | (r90 & 1) | (rx & 1) -> r90 | (rx & 1) I have a patch for simplify-rtx.c that recognizes this generating foo: .LFB0: .cfi_startproc movzbl (%rdi), %edx movl %edx, %eax shrb %al andl $1, %eax orl %edx, %eax movb %al, (%rdi) ret and bar: .LFB1: .cfi_startproc andl $1, %esi orb %sil, (%rdi) ret Looks like combine doesn't want (insn 11 10 13 2 (parallel [ (set (reg:QI 91) (ior:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [4 x+0 S1 A32]) (reg:QI 90 [ *b_3(D) ]))) (clobber (reg:CC 17 flags)) ]) "t.c":12:11 429 {*iorqi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) for a combination (on x86_64 with -m32). (insn 13 11 15 2 (parallel [ (set (reg:QI 93) (and:QI (reg:QI 91) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) "t.c":12:11 396 {*andqi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_DEAD (reg:QI 91) (nil)))) (insn 15 13 16 2 (parallel [ (set (reg:QI 95) (and:QI (reg:QI 90 [ *b_3(D) ]) (const_int -2 [0xfffffffffffffffe]))) (clobber (reg:CC 17 flags)) ]) "t.c":12:11 396 {*andqi_1} (expr_list:REG_DEAD (reg:QI 90 [ *b_3(D) ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 16 15 17 2 (parallel [ (set (reg:QI 96) (ior:QI (reg:QI 95) (reg:QI 93))) (clobber (reg:CC 17 flags)) ]) "t.c":12:11 429 {*iorqi_1} (expr_list:REG_DEAD (reg:QI 95) (expr_list:REG_DEAD (reg:QI 93) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))))) Trying 11, 15, 13 -> 16: 11: {r91:QI=[argp:SI+0x4]|r90:QI;clobber flags:CC;} REG_UNUSED flags:CC 15: {r95:QI=r90:QI&0xfffffffffffffffe;clobber flags:CC;} REG_DEAD r90:QI REG_UNUSED flags:CC 13: {r93:QI=r91:QI&0x1;clobber flags:CC;} REG_UNUSED flags:CC REG_DEAD r91:QI 16: {r96:QI=r95:QI|r93:QI;clobber flags:CC;} REG_DEAD r95:QI REG_DEAD r93:QI REG_UNUSED flags:CC Failed to match this instruction: (parallel [ (set (reg:QI 96) (ior:QI (and:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [4 x+0 S1 A32]) (const_int 1 [0x1])) (reg:QI 90 [ *b_3(D) ]))) (clobber (reg:CC 17 flags)) ]) Failed to match this instruction: (set (reg:QI 96) (ior:QI (and:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [4 x+0 S1 A32]) (const_int 1 [0x1])) (reg:QI 90 [ *b_3(D) ]))) Failed to match this instruction: (set (reg:QI 95) (and:QI (mem/c:QI (plus:SI (reg/f:SI 16 argp) (const_int 4 [0x4])) [4 x+0 S1 A32]) (const_int 1 [0x1]))) looks like it doesn't try to "factor" out the load? If combine tries to split RTL into two instructions, it tries to do that one way (and one way only). It picked the AND here. It did not work. You can add some define_split to your target to help combine along. Author: rguenth Date: Tue Nov 6 08:09:03 2018 New Revision: 265829 URL: https://gcc.gnu.org/viewcvs?rev=265829&root=gcc&view=rev Log: 2018-11-06 Richard Biener <rguenther@suse.de> PR middle-end/18041 * simplify-rtx.c (simplify_binary_operation_1): Add pattern matching bitfield insertion. * gcc.target/i386/pr18041-1.c: New testcase. * gcc.target/i386/pr18041-2.c: Likewise. Added: trunk/gcc/testsuite/gcc.target/i386/pr18041-1.c trunk/gcc/testsuite/gcc.target/i386/pr18041-2.c Modified: trunk/gcc/ChangeLog trunk/gcc/simplify-rtx.c trunk/gcc/testsuite/ChangeLog Fixed. |