[Bug middle-end/36041] Speed up builtin_popcountll
jakub at gcc dot gnu.org
gcc-bugzilla@gcc.gnu.org
Thu Jun 27 07:13:00 GMT 2013
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
--- Comment #21 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Created attachment 30382
--> http://gcc.gnu.org/bugzilla/attachment.cgi?id=30382&action=edit
gcc49-pr36041.patch
Untested libgcc2.c implementation (no hw support). HW support is IMHO best
dealt on the compiler side doing something, already a PLT call is fairly
expensive, but it depends if __builtin_popcount* is used in a hot loop or in
cold code (in the latter case it really doesn't matter).
I've looked at code generated for:
int
foo (unsigned long long i)
{
i = i - ((i >> 1) & 0x5555555555555555);
i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
return (i * 0x101010101010101) >> 56;
}
int
bar (unsigned long long i)
{
unsigned int i1 = i, i2 = i >> 32;
i1 = i1 - ((i1 >> 1) & 0x55555555);
i2 = i2 - ((i2 >> 1) & 0x55555555);
i1 = (i1 & 0x33333333) + ((i1 >> 2) & 0x33333333);
i2 = (i2 & 0x33333333) + ((i2 >> 2) & 0x33333333);
i1 = (i1 + (i1 >> 4)) & 0xF0F0F0F;
i2 = (i2 + (i2 >> 4)) & 0xF0F0F0F;
return ((i1 + i2) * 0x1010101) >> 24;
}
int
baz (unsigned long long i)
{
i = i - ((i >> 1) & 0x5555555555555555);
i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
return ((((unsigned int) i) + (unsigned int) (i >> 32)) * 0x1010101) >> 24;
}
on gcc -O2 -m32 and picked the second variant as shortest for the UDWtype
implementation, I guess that is likely the case on most targets. Note that the
patch still doesn't attempt to figure out if UWtype multiplication is expensive
or not, perhaps a useful test would be whether we emit __umul<mode>3 for that
mode inside of libgcc - we'd need to define some macro and if multiplication is
expensive use the shifts + additions alternative instead.
More information about the Gcc-bugs
mailing list