This is the mail archive of the gcc@gcc.gnu.org mailing list for the GCC project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Inefficient bitfield code generation

To: gcc at gcc dot gnu dot org
Subject: Inefficient bitfield code generation
From: Zoltan Hidvegi <hzoli at austin dot ibm dot com>
Date: Fri, 26 Oct 2001 18:21:52 -0500 (CDT)

The assembly I will use in this mail is 32-bit big-endian PowerPC, but
similar things happen on x86.  The main problem is that there is an
optimization to set byte-aligned 8-bit fields, that can be done with
one store, which overrides an other optimization that can combine
setting all fields in a bitfield to a single store word.  Consider the
following code:

struct foo {
    unsigned a : 24;
    unsigned b : 8;
};

void setfoo(struct foo *p, unsigned a, unsigned b)
{
    p->a = a;
    p->b = b;
}

The optimal assembly is

	rlwimi 5,4,8,0,23
	stw 5,0(3)

Instead of that, gcc generates

        lwz 0,0(3)
        rlwimi 0,4,8,0,23
        stw 0,0(3)
        stb 5,3(3)

I.e. it loads *p, modifies the upper 24 bits, stores it back, than use
a store byte to store the lower 8 bits.  The stb only works for 8-bit
data, so let's modify foo:

struct foo {
    unsigned a : 23;
    unsigned b : 9;
};

Now the assembly is:

        lwz 0,0(3)
        rlwimi 0,4,9,0,22
        rlwimi 0,5,0,23,31
        stw 0,0(3)

It's better, only one store, but the load is still there.  Now the stb
optimization does not happen, which results in better code.  Perhaps
we can get rid of the load by completely overwriting *p:

void setfoo(struct foo *p, unsigned a, unsigned b)
{
    foo t;
    t.a = a;
    t.b = b;
    *p = t;
}

Now the assembly for that is

        rlwimi 0,4,9,0,22
        rlwimi 0,5,0,23,31
        stw 0,0(3)

Almost perfect.  So maybe this would work in the 24/8 case, but
unfortunately we are not so lucky:

        lwz 0,-8(1)
        rlwimi 0,4,8,0,23
        stw 0,-8(1)
        stb 5,-5(1)
        lwz 0,-8(1)
        stw 0,0(3)

Now it loads the local instance of t from the stack.  In the 23/9 case
gcc was able to eliminate the stack reference, but now now, so we
have 2 loads and 3 stores instead of 1 store.

Maybe we can use int shift/mask operations in C:

void setfoo8(unsigned int *p, unsigned a, unsigned b)
{
    *p = (*p &  255) | (a << 8);
    *p = (*p & ~255) | (b & 255);
}

This results in

        rlwinm 5,5,0,24,31
        slwi 4,4,8
        or 4,4,5
        stw 4,0(3)

Not bad, no load, only one store, but still 3 fixed point instructions
instead of 1.  The same using *p = (b & 255) | (a << 8).  But with

void setfoo4(unsigned int *p, unsigned a, unsigned b)
{
    b = (b & 255) | (a << 8);
    *p = b;
}

the assembly is

        slwi 4,4,8
        rlwimi 5,4,0,0,23
        stw 5,0(3)

That's better, the same as the bitfield case without the stb
optimizatioin.

Zoli

Follow-Ups:
- Re: Inefficient bitfield code generation
  - From: degger

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]