This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
Inefficient bitfield code generation
- To: gcc at gcc dot gnu dot org
- Subject: Inefficient bitfield code generation
- From: Zoltan Hidvegi <hzoli at austin dot ibm dot com>
- Date: Fri, 26 Oct 2001 18:21:52 -0500 (CDT)
The assembly I will use in this mail is 32-bit big-endian PowerPC, but
similar things happen on x86. The main problem is that there is an
optimization to set byte-aligned 8-bit fields, that can be done with
one store, which overrides an other optimization that can combine
setting all fields in a bitfield to a single store word. Consider the
following code:
struct foo {
unsigned a : 24;
unsigned b : 8;
};
void setfoo(struct foo *p, unsigned a, unsigned b)
{
p->a = a;
p->b = b;
}
The optimal assembly is
rlwimi 5,4,8,0,23
stw 5,0(3)
Instead of that, gcc generates
lwz 0,0(3)
rlwimi 0,4,8,0,23
stw 0,0(3)
stb 5,3(3)
I.e. it loads *p, modifies the upper 24 bits, stores it back, than use
a store byte to store the lower 8 bits. The stb only works for 8-bit
data, so let's modify foo:
struct foo {
unsigned a : 23;
unsigned b : 9;
};
Now the assembly is:
lwz 0,0(3)
rlwimi 0,4,9,0,22
rlwimi 0,5,0,23,31
stw 0,0(3)
It's better, only one store, but the load is still there. Now the stb
optimization does not happen, which results in better code. Perhaps
we can get rid of the load by completely overwriting *p:
void setfoo(struct foo *p, unsigned a, unsigned b)
{
foo t;
t.a = a;
t.b = b;
*p = t;
}
Now the assembly for that is
rlwimi 0,4,9,0,22
rlwimi 0,5,0,23,31
stw 0,0(3)
Almost perfect. So maybe this would work in the 24/8 case, but
unfortunately we are not so lucky:
lwz 0,-8(1)
rlwimi 0,4,8,0,23
stw 0,-8(1)
stb 5,-5(1)
lwz 0,-8(1)
stw 0,0(3)
Now it loads the local instance of t from the stack. In the 23/9 case
gcc was able to eliminate the stack reference, but now now, so we
have 2 loads and 3 stores instead of 1 store.
Maybe we can use int shift/mask operations in C:
void setfoo8(unsigned int *p, unsigned a, unsigned b)
{
*p = (*p & 255) | (a << 8);
*p = (*p & ~255) | (b & 255);
}
This results in
rlwinm 5,5,0,24,31
slwi 4,4,8
or 4,4,5
stw 4,0(3)
Not bad, no load, only one store, but still 3 fixed point instructions
instead of 1. The same using *p = (b & 255) | (a << 8). But with
void setfoo4(unsigned int *p, unsigned a, unsigned b)
{
b = (b & 255) | (a << 8);
*p = b;
}
the assembly is
slwi 4,4,8
rlwimi 5,4,0,0,23
stw 5,0(3)
That's better, the same as the bitfield case without the stb
optimizatioin.
Zoli