Help needed: Optimization of bytecode interpreter for ARM paltform

Fri Dec 8 18:12:00 GMT 2006

On Fri, 2006-12-08 at 17:21 +0000, Andrew Haley wrote:
> de Brebisson, Cyrille (Calculator Division) writes:
> 
>  > [snip] trying to re-code, using inline assembly goto *jump[*progc++]
>  > I used inline assembly to do:
>  > Ldrh instr, [progc], #2       // note that in most cases, there is an
>  >                               // extra instruction here that allows to
>  >                               // cancel the waitstate caused by the use
>  >                               // of register instr on the next
>  > instruction
>  > ldr pc, [jump, instr, asl #2]
>  > 
>  > because the compiler generates the highly unoptimized (and too large for
>  > the memory in my device)
>  > 	ldrh	r1, [r4], #2
>  > 	ldr	r8, .L2691+4
>  > 	ldr	fp, [r8, r1, asl #2]
>  > 	mov	pc, fp	@ indirect register jump
>  > [/snip]
>  > 
>  > >This is the crucial mistake: you can't jump out of an inline asm.
>  > 
>  > So, how can I optimize my code? Is there a way to force the compiler to
>  > 1: put a variable in a register? As the asm ("register"); constraint
>  > does not seem to do a lot of forcing
> 
> Definitely: if declaring a global register variable doesn't work,
> that's a bug.  What exactly did you try?
> 
>  > 2: get the compiler to condense the last 2 instructions in 1?
> 
> I'm not sure why gcc generates that sequence.  Forwarding to Richard
> Earnshaw for comment.

First of all, you don't mention which version of the compiler you are
using, so it's hard to know precisely why you get the code you do.
GCC-4.1 is used in my example below.

Trying to second guess the compiler is rarely profitable, but it's not
clear to me why the address of the jump table is not being hoisted out
of the loop.  There is a hack that will effectively force this in this
instance.  By loading a global variable (or you could pass it in as an
additional parameter such that it is always zero), we force the address
calculation into a local variable that the compiler can't (easily)
optimize away.  For the following test-case:

int offset = 0;

void runprog(unsigned short *prog, int count)
{
    __label__ code0, code1, code2, code3;
    static const void* const jump[4] = 
	{
	    &&code0, &&code1, &&code2, &&code3
	};
    const void* const* interp = jump+offset;

    while (count--)
	{
	    goto *interp[*prog++];
    code0:
	    foo();
	    continue;
    code1:
	    bar();
	    continue;
    code2:
	    wibble();
	    continue;
    code3:
	    wombat();
	    break;
	}
}

The critical part of the loop then compiles to:

        ldrh    r3, [r5], #2
        ldr     pc, [r6, r3, asl #2]    @ indirect memory jump

which looks fine to me.  Note, however, that if your 'switch' statement
is large, then you'll quite probably get spilling of variables.  The
value of interp is higly likely to be a candidate here because it's used
exactly once per iteration, so you'll then be back to where you started.

I'm somewhat confused as to why you haven't just used a switch table for
this, though.  The equivalent code:

void runprog(unsigned short *prog, int count)
{
    while (count--)
	{
	    switch(*prog++)
		{
		case 0:
		    foo();
		    continue;
		case 1:
		    bar();
		    continue;
		case 2:
		    wibble();
		    continue;
		case 3:
		    wombat();
		    goto done;
		}
	}
 done:
    ;
}

is much easier to understand and much more ammenable to the standard
optimizer framework.

R.