[Bug tree-optimization/81611] [8 Regression] gcc un-learned loop / post-increment optimization

Thu Jan 25 11:39:00 GMT 2018

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81611

--- Comment #19 from Georg-Johann Lay <gjl at gcc dot gnu.org> ---
Hi, thanks for all that work and efforts.

I tried that patch for the following small test:

extern void foo (void);

extern char volatile vv;

void func2 (const int *p)
{
    while (1)
    {
        int var = *p++;
        if (var == 10)
            return foo();
        if (var == 0)
            break;
    }
}

void func3 (const int *p, const __flash char *f)
{
    while (1)
    {
        int var = *p++;
        if (var == 10)
            return foo();
        vv = *f++;
        if (!vv)
            break;
    }
}

$ avr-gcc -Os -mmcu=avr5 inc.c -S -dp

Unfortunately, the code is still quote sub-optimal, in particular due to
reg-reg moves all over the place, apart from missing post-inc opportunities.

For example, func3 compiles as follows:

func3:
.L7:
        movw r20,r24     ;  37  [c=4 l=1]  *movhi/0
        subi r20,-2      ;  9   [c=4 l=2]  addhi3_clobber/1
        sbci r21,-1
        movw r30,r24     ;  38  [c=4 l=1]  *movhi/0
        ld r24,Z         ;  10  [c=8 l=2]  *movhi/2
        ldd r25,Z+1
        sbiw r24,10      ;  11  [c=12 l=1]  cmphi3/5
        brne .L6                 ;  12  [c=16 l=1]  branch
        jmp foo  ;  14  [c=0 l=2]  call_insn/3
.L8:
        movw r22,r26     ;  5   [c=4 l=1]  *movhi/0
        rjmp .L7                 ;  46  [c=4 l=1]  jump
.L6:
        movw r26,r22     ;  39  [c=4 l=1]  *movhi/0
        adiw r26,1       ;  18  [c=4 l=1]  addhi3_clobber/0
        movw r30,r22     ;  40  [c=4 l=1]  *movhi/0
        lpm r24,Z        ;  19  [c=4 l=1]  movqi_insn/3
        sts vv,r24       ;  20  [c=4 l=2]  movqi_insn/2
        lds r18,vv       ;  21  [c=4 l=2]  movqi_insn/3
        movw r24,r20     ;  22  [c=4 l=1]  *movhi/0
        cpse r18,__zero_reg__    ;  24  [c=0 l=1]  enable_interrupt-3
        rjmp .L8
        ret              ;  43  [c=0 l=1]  return

In particular, moving values back and forth and bad register selection is a
common and well known annoyance (insns 37, 38, 5, 39, 40, 22).

Just to give an impression of optimal code, which would read something like:

func3:
    ;; Use Z=r30/31 for F.  LPM can only use indirect and
    ;; post-inc with Z.
    movw r30, r22
    ;; Use X=r26/27 for P.  X register can only use indirect and
    ;; post-inc addressing, which is fine for that purpose.
    movw r26, r24
.L7:
    ;; var = *p++
        ld r24,X+
        ld r25,X+
    ;;  var == 10 ?
        sbiw r24,10
        brne .L6
        jmp foo
.L6:
    ;; vv = *f++
        lpm r24,Z+
        sts vv,r24
    ;; if (!vv) break
        lds r24,vv
        cpse r24,__zero_reg__
        rjmp .L7
        ret

If uses 12 instructions instead of 12, operates faster (usually focus is on
code size) and has a register footprint of 6 whereas gcc needs 12.