Bug 42972 - Very bad bit field code
Summary: Very bad bit field code
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.5.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: 16996
  Show dependency treegraph
 
Reported: 2010-02-05 11:12 UTC by Steven Bosscher
Modified: 2015-03-31 08:55 UTC (History)
6 users (show)

See Also:
Host:
Target: arm-eabi
Build:
Known to work:
Known to fail:
Last reconfirmed: 2012-03-18 12:45:53


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Bosscher 2010-02-05 11:12:50 UTC
Taken from http://hardwarebug.org/2010/01/30/bit-field-badness/

-----------------------------------
struct bf1_31 {
    unsigned a:1;
    unsigned b:31;
};

void func(struct bf1_31 *p, int n, int a)
{
    int i = 0;
    do {
        if (p[i].a)
            p[i].b += a;
    } while (++i < n);
}
-----------------------------------

GCC produces this dreadful code for arm-elf with options -march=armv5te -O3:

        .file   "t.c"
        .text
        .align  2
        .global func
        .type   func, %function
func:
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        mov     r3, #0
        str     r4, [sp, #-4]!
.L3:
        ldrb    ip, [r0, #0]    @ zero_extendqisi2
        add     r3, r3, #1
        tst     ip, #1
        ldrne   ip, [r0, #0]
        addne   r4, r2, ip, lsr #1
        andne   ip, ip, #1
        orrne   ip, ip, r4, asl #1
        strne   ip, [r0, #0]
        cmp     r3, r1
        add     r0, r0, #4
        blt     .L3
        ldmfd   sp!, {r4}
        bx      lr
        .size   func, .-func
        .ident  "GCC: (GNU) 4.5.0 20100204 (experimental) [trunk revision 156492]"


What is wrong with this code:
1) The same value is loaded from memory twice (from [r0, #0])
2) Mask/shift/or operations are used where a simple shifted add would suffice
3) Write-back addressing is not used
4) The loop control counts up and compares instead of counting down

(Re. 4: The loop is clearly invertible. Does GCC fail to see this? Or is this optimization just not implemented?)


Hand-optimized assembly would look like this according to Hardwarebug:
func:
        ldr     r3,  [r0], #4
        tst     r3,  #1
        add     r3,  r3,  r2,  lsl #1
        strne   r3,  [r0, #-4]
        subs    r1,  r1,  #1
        bgt     func
        bx      lr

Best compiler available produces this:
func:
        mov     r3, #0
        push    {r4, lr}
loop:
        ldr     ip, [r0, r3, lsl #2]
        tst     ip, #1
        addne   ip, ip, r2, lsl #1
        strne   ip, [r0, r3, lsl #2]
        add     r3, r3, #1
        cmp     r3, r1
        blt     loop
        pop     {r4, pc}


Part of the problem seems to be that the redundant load is not exposed in GIMPLE (altough the RTL optimizers should also be able to see it). The .139t.optimized dump looks like this:

     1  ;; Function func (func)
     2
     3  func (struct bf1_31 * p, int n, int a)
     4  {
     5    long unsigned int ivtmp.16;
     6    int i;
     7    <unnamed-unsigned:31> D.1973;
     8    unsigned int D.1972;
     9    int D.1971;
    10    int D.1970;
    11    <unnamed-unsigned:31> D.1969;
    12    unsigned char D.1966;
    13    unsigned char D.1965;
    14    struct bf1_31 * D.1964;
    15
    16  <bb 2>:
    17    ivtmp.16_30 = (long unsigned int) p_5(D);
    18
    19  <bb 3>:
    20    # i_1 = PHI <0(2), i_21(5)>
    21    # ivtmp.16_32 = PHI <ivtmp.16_30(2), ivtmp.16_31(5)>
    22    D.1964_38 = (struct bf1_31 *) ivtmp.16_32;
    23    D.1965_7 = BIT_FIELD_REF <*D.1964_38, 8, 0>;
    24    D.1966_8 = D.1965_7 & 1;
    25    if (D.1966_8 != 0)
    26      goto <bb 4>;
    27    else
    28      goto <bb 5>;
    29
    30  <bb 4>:
    31    D.1969_15 = D.1964_38->b;
    32    D.1970_16 = (int) D.1969_15;
    33    D.1971_18 = a_17(D) + D.1970_16;
    34    D.1972_19 = (unsigned int) D.1971_18;
    35    D.1973_20 = (<unnamed-unsigned:31>) D.1972_19;
    36    D.1964_38->b = D.1973_20;
    37
    38  <bb 5>:
    39    i_21 = i_1 + 1;
    40    ivtmp.16_31 = ivtmp.16_32 + 4;
    41    if (i_21 < n_22(D))
    42      goto <bb 3>;
    43    else
    44      goto <bb 6>;
    45
    46  <bb 6>:
    47    return;
    48
    49  }



There are two loads from *D.1964_38 at line 23 and line 31, but one is a BIT_FIELD_REF and the other is an INDIRECT_REF, so the redundant load is not noticed.


RTL PRE fails to kill the redundant load because the expressions for the loads are not the same:

Index 0 (hash value 10)
  (zero_extend:SI (mem/s:QI (reg:SI 158 [ ivtmp.16 ]) [0+0 S1 A32]))
Index 3 (hash value 7)
  (mem/s:SI (reg:SI 158 [ ivtmp.16 ]) [0+0 S4 A32])


CSE also does not see the redundant load although it could notice it:

* It sees the path:
;; Following path with 17 sets: 3 4

* It sees the first load:
(insn 23 22 24 3 t.c:10 (set (reg:SI 169)
        (zero_extend:SI (mem/s:QI (reg:SI 158 [ ivtmp.16 ]) [0+0 S1 A32]))) 148 {*arm_zero_extendqisi2} (nil))

* It sees the second load:
(insn 31 30 32 4 t.c:11 (set (reg:SI 174)
        (mem/s:SI (reg:SI 158 [ ivtmp.16 ]) [0+0 S4 A32])) 166 {*arm_movsi_insn} (nil))

The problem here probably is the same as for RTL PRE, that the zero_extend hides the redundancy.
Comment 1 Steven Bosscher 2010-02-05 12:42:51 UTC
In the .expand dump there is already something funny about the code generated for the bit field expressions.

For example the code generated for this:
  D.1966_8 = D.1965_7 & 1;
  if (D.1966_8 != 0)

TER will perform the forward substitution:
D.1966_8 replace with --> D.1966_8 = D.1965_7 & 1;

The expanders generate the following code for this sequence:

;; if (D.1966_8 != 0)

(insn 23 22 24 t.c:10 (set (reg:SI 169)
        (zero_extend:SI (mem/s:QI (reg/f:SI 164 [ D.1964 ]) [0+0 S1 A32]))) -1 (nil))

(insn 24 23 25 t.c:10 (set (reg:QI 168)
        (subreg:QI (reg:SI 169) 0)) -1 (nil))

(insn 25 24 26 t.c:10 (set (reg:SI 170)
        (and:SI (subreg:SI (reg:QI 168) 0)
            (const_int 1 [0x1]))) -1 (nil))

(insn 26 25 27 t.c:10 (set (reg:QI 171)
        (subreg:QI (reg:SI 170) 0)) -1 (nil))

(insn 27 26 28 t.c:10 (set (reg:SI 172)
        (and:SI (subreg:SI (reg:QI 171) 0)
            (const_int 255 [0xff]))) -1 (nil))

(insn 28 27 29 t.c:10 (set (reg:CC 24 cc)
        (compare:CC (reg:SI 172)
            (const_int 0 [0x0]))) -1 (nil))

(jump_insn 29 28 0 t.c:10 (set (pc)
        (if_then_else (eq (reg:CC 24 cc)
                (const_int 0 [0x0]))
            (label_ref 0)
            (pc))) -1 (expr_list:REG_BR_PROB (const_int 5000 [0x1388])
        (nil)))

What are insn 26 and insn 27 for?



Then the update of p[i].b:

;; D.1964_38->b = D.1973_20;

(insn 31 30 32 t.c:11 (set (reg:SI 174)
        (mem/s:SI (reg/f:SI 164 [ D.1964 ]) [0+0 S4 A32])) -1 (nil))

(insn 32 31 33 t.c:11 (set (reg:SI 173)
        (lshiftrt:SI (reg:SI 174)
            (const_int 1 [0x1]))) -1 (nil))

(insn 33 32 34 t.c:11 (set (reg:SI 175)
        (plus:SI (reg/v:SI 167 [ a ])
            (reg:SI 173))) -1 (nil))

(insn 34 33 35 t.c:11 (set (reg:SI 176)
        (mem/s/j:SI (reg/f:SI 164 [ D.1964 ]) [0+0 S4 A32])) -1 (nil))

(insn 35 34 36 t.c:11 (set (reg:SI 177)
        (and:SI (reg:SI 175)
            (const_int 2147483647 [0x7fffffff]))) -1 (nil))

(insn 36 35 37 t.c:11 (set (reg:SI 178)
        (and:SI (reg:SI 176)
            (const_int 1 [0x1]))) -1 (nil))

(insn 37 36 38 t.c:11 (set (reg:SI 177)
        (ashift:SI (reg:SI 177)
            (const_int 1 [0x1]))) -1 (nil))

(insn 38 37 39 t.c:11 (set (reg:SI 176)
        (ior:SI (reg:SI 177)
            (reg:SI 178))) -1 (nil))

(insn 39 38 0 t.c:11 (set (mem/s/j:SI (reg/f:SI 164 [ D.1964 ]) [0+0 S4 A32])
        (reg:SI 176)) -1 (nil))


Insns 32, 33, and 37 should use a shifted add instead. The mask of insn 35 looks redundant (the bit is shifted out in insn 37 iiuc) but it is not removed before combine. Combine generates the shifted add, except that it shifts "p[i]->b" one bit to the right, instead of "a" one bit left -- apparently not noticing that everything is shifted back left again in insn 38:

(insn 31 30 32 4 t.c:11 (set (reg:SI 174)
        (mem/s:SI (reg:SI 158 [ ivtmp.16 ]) [0+0 S4 A32])) 166 {*arm_movsi_insn} 

(insn 33 32 35 4 t.c:11 (set (reg:SI 175)
        (plus:SI (lshiftrt:SI (reg:SI 174)
                (const_int 1 [0x1]))
            (reg/v:SI 167 [ a ]))) 269 {*arith_shiftsi} (nil))

(insn 36 35 37 4 t.c:11 (set (reg:SI 178)
        (and:SI (reg:SI 174)
            (const_int 1 [0x1]))) 67 {*arm_andsi3_insn} (expr_list:REG_DEAD (reg:SI 174)
        (nil)))

(insn 38 37 39 4 t.c:11 (set (reg:SI 176)
        (ior:SI (ashift:SI (reg:SI 175)
                (const_int 1 [0x1]))
            (reg:SI 178))) 269 {*arith_shiftsi} (expr_list:REG_DEAD (reg:SI 175)
        (expr_list:REG_DEAD (reg:SI 178)
            (nil))))

(insn 39 38 40 4 t.c:11 (set (mem/s/j:SI (reg:SI 158 [ ivtmp.16 ]) [0+0 S4 A32])
        (reg:SI 176)) 166 {*arm_movsi_insn} (expr_list:REG_DEAD (reg:SI 176)
        (nil)))
Comment 2 Steven Bosscher 2010-02-05 12:45:53 UTC
This is both a tree-optimization and an rtl-optimization problem, so setting component to "middle-end" (there is no generic "optimization" component anymore).
Comment 3 Richard Biener 2010-02-05 13:17:23 UTC
On the (retired) mem-ref branch bitfield loads/stores were lowered very early
to read-extract-modify-write operations so the tree level would have optimized
this.

But of course people complained that architectures that can do bitfield
stores would be pessimized if we do not retain the funny BIT_FIELD_REFs
of memory.

Basically without some form of lowering the tree level is lost.

mem-ref branch lowered the fn to

  i = 0;
  <D.1562>:;
  D.1564 = (long unsigned int) i;
  D.1565 = D.1564 * 4;
  D.1566 = p + D.1565;
  MEML.0 = IMEM <unsigned int {2}, D.1566>;
  D.1567 = BIT_FIELD_REF <MEML.0, 1, 0>;
  if (D.1567 != 0) goto <D.1574>; else goto <D.1575>;
  <D.1574>:;
  D.1564 = (long unsigned int) i;
  D.1565 = D.1564 * 4;
  D.1566 = p + D.1565;
  D.1564 = (long unsigned int) i;
  D.1565 = D.1564 * 4;
  D.1566 = p + D.1565;
  MEML.1 = IMEM <unsigned int {2}, D.1566>;
  D.1568 = BIT_FIELD_REF <MEML.1, 31, 1>;
  D.1569 = (int) D.1568;
  D.1570 = D.1569 + a;
  D.1571 = (unsigned int) D.1570;
  D.1572 = (<unnamed-unsigned:31>) D.1571;
  MEML.2 = IMEM <unsigned int {2}, D.1566>;
  MEML.2 = BIT_FIELD_EXPR <MEML.2, D.1572, 31, 1>;
  IMEM <unsigned int {2}, D.1566> = MEML.2;
  <D.1575>:;
  i = i + 1;
  if (i < n) goto <D.1562>; else goto <D.1563>;
  <D.1563>:;
  return;

so FRE sees the redundant load and we expand from

<bb 2>:
  ivtmp.20_15 = (unsigned int *) p_5(D);

<bb 3>:
  # ivtmp.20_13 = PHI <ivtmp.20_15(2), ivtmp.20_22(5)>
  # i_1 = PHI <0(2), i_24(5)>
  MEML.0_7 = IMEM <unsigned int {2}, ivtmp.20_13>;
  D.1567_8 = BIT_FIELD_REF <MEML.0_7, 1, 0>;
  if (D.1567_8 != 0)
    goto <bb 4>;
  else
    goto <bb 5>;

<bb 4>:
  D.1568_16 = BIT_FIELD_REF <MEML.0_7, 31, 1>;
  D.1569_17 = (int) D.1568_16;
  D.1570_19 = D.1569_17 + a_18(D);
  D.1571_20 = (unsigned int) D.1570_19;
  D.1572_21 = (<unnamed-unsigned:31>) D.1571_20;
  MEML.2_23 = BIT_FIELD_EXPR <MEML.0_7, D.1572_21, 31, 1>;
  IMEM <unsigned int {2}, ivtmp.20_13> = MEML.2_23;

<bb 5>:
  i_24 = i_1 + 1;
  ivtmp.20_22 = ivtmp.20_13 + 4;
  if (i_24 < n_25(D))
    goto <bb 3>;
  else
    goto <bb 6>;

<bb 6>:
  return;


on x86_64 the generated code was

func:
.LFB2:
        movl    %edx, %r8d
        xorl    %ecx, %ecx
        .p2align 4,,10
        .p2align 3
.L3:
        movl    (%rdi), %eax
        testb   $1, %al
        je      .L2
        movl    %eax, %edx
        shrl    %eax
        addl    %r8d, %eax
        andl    $1, %edx
        addl    %eax, %eax
        orl     %eax, %edx
        movl    %edx, (%rdi)
.L2:
        addl    $1, %ecx
        addq    $4, %rdi
        cmpl    %esi, %ecx
        jl      .L3
        rep
        ret
Comment 4 Richard Biener 2010-02-05 13:25:45 UTC
(In reply to comment #1)
> In the .expand dump there is already something funny about the code generated
> for the bit field expressions.
> 
> For example the code generated for this:
>   D.1966_8 = D.1965_7 & 1;
>   if (D.1966_8 != 0)
> 
> TER will perform the forward substitution:
> D.1966_8 replace with --> D.1966_8 = D.1965_7 & 1;

Note that this dump is misleading.  The expression is only _marked_ for
possible forwarding.  It is the resposibility of the individual expanders
to do the expression lookup and in-place expansion.
And of course forwarded statements are still regularly expanded anyway
(well, to dead code) - that's probably what you see.
Comment 5 Michael Matz 2010-02-05 15:36:10 UTC
The code for the if() looks sane on x86-64:
-----------------------------------------
;; if (D.2729_8 != 0)

(insn 16 15 17 pr42972.c:10 (set (reg:QI 87)
        (mem/s:QI (reg/f:DI 82 [ D.2727 ]) [0 S1 A32])) -1 (nil))

(insn 17 16 18 pr42972.c:10 (parallel [
            (set (reg:QI 86)
                (and:QI (reg:QI 87)
                    (const_int 1 [0x1])))
            (clobber (reg:CC 17 flags))
        ]) -1 (expr_list:REG_EQUAL (and:QI (mem/s:QI (reg/f:DI 82 [ D.2727 ])
                                            [0 S1 A32])
            (const_int 1 [0x1]))
        (nil)))

(insn 18 17 19 pr42972.c:10 (set (reg:CCZ 17 flags)
        (compare:CCZ (reg:QI 86)
            (const_int 0 [0x0]))) -1 (nil))

(jump_insn 19 18 0 pr42972.c:10 (set (pc)
        (if_then_else (eq (reg:CCZ 17 flags)
                (const_int 0 [0x0]))
            (label_ref 0)
            (pc))) -1 (expr_list:REG_BR_PROB (const_int 5000 [0x1388])
        (nil)))
---------------------------------

Btw, instructions marked for forwarding (by TER) are not expanded to useless
code, but not expanded at all (or better said, only at the point of use).

The back-and-forth between QImode and SImode on arm seems to stem from
register promotion trying really hard to work on only SImode regs.
Comment 6 Steven Bosscher 2012-03-18 23:07:45 UTC
Trunk r185508 produces the same code as quoted in comment #0.