Bug 36041 - Speed up builtin_popcountll
Summary: Speed up builtin_popcountll
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: middle-end (show other bugs)
Version: 4.2.0
: P3 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2008-04-25 00:34 UTC by Joseph Zbiciak
Modified: 2021-11-28 00:18 UTC (History)
7 users (show)

See Also:
Host: x86_64-unknown-linux-gnu
Target: x86_64-unknown-linux-gnu
Build: x86_64-unknown-linux-gnu
Known to work:
Known to fail:
Last reconfirmed: 2021-08-16 00:00:00


Attachments
Popcount sample implementations (612 bytes, text/plain)
2008-04-25 00:36 UTC, Joseph Zbiciak
Details
IFUNC proof of concept patch (914 bytes, patch)
2013-06-26 23:28 UTC, Marc Glisse
Details | Diff
gcc49-pr36041.patch (770 bytes, patch)
2013-06-27 07:13 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Joseph Zbiciak 2008-04-25 00:34:18 UTC
The current __builtin_popcountll (and likely __builtin_popcount) are fairly slow as compared to a simple, short C version derived from what can be found in Knuth's recent publications.  The following short function is about 3x as fast as the __builtin version, which runs counter to the idea that __builtin_XXX provides access to implementations that are exemplars for a given platform.

unsigned int popcount64(unsigned long long x)
{
    x = (x & 0x5555555555555555ULL) + ((x >> 1) & 0x5555555555555555ULL);
    x = (x & 0x3333333333333333ULL) + ((x >> 2) & 0x3333333333333333ULL);
    x = (x & 0x0F0F0F0F0F0F0F0FULL) + ((x >> 4) & 0x0F0F0F0F0F0F0F0FULL);
    return (x * 0x0101010101010101ULL) >> 56;
}

This version has the additional benefit that it omits the lookup table that the current "builtin" version uses.

I measured the above function vs. __builtin_popcountll with a loop like the following:

    t1 = clock();
    for (j = 0; j < 1000000; j++)
        for (i = 0; i < 1024; i++)
            pt = popcount64(data[i]);
    t2 = clock();

    printf("popcount64 = %d clocks\n", t2 - t1);

...where data[] is a u64 that's preinitialized.

I'll attach the exact source I used, which also includes two other possible implementations of popcountll.
Comment 1 Joseph Zbiciak 2008-04-25 00:36:26 UTC
Created attachment 15529 [details]
Popcount sample implementations

These implementations are derived from insights in Henry S. Warren's Hacker's Delight and Knuth's recent fascicle regarding boolean tips and tricks.
Comment 2 Joseph Zbiciak 2008-04-25 00:39:22 UTC
When run on my Opteron 280 system, the four separate implementations give the following run times:

popcount64_1 = 13130000 clocks
popcount64_2 = 6480000 clocks
popcount64_3 = 3740000 clocks
popcount64_4 = 5490000 clocks

As one can see, the popcount64_3 implementation is over 3.5x the speed of the __builtin_popcountll implementation.
Comment 3 Richard Biener 2008-04-25 08:44:47 UTC
The implementation is written so it is also reasonable on targets like the AVR
which only has 8bit registers...
Comment 4 Joseph Zbiciak 2008-04-25 12:29:10 UTC
Is there a mechanism to provide different implementations based on the target (or in this case, class of target)?  The byte count approach certainly is more appropriate for 8-bit targets, sure, but what about the rest of us?  How are targets handled that might have this as an instruction?

FWIW, I'd be happy to write a 32-bit version to complement the 64-bit version I provided with my report if there's a way to build a different implementation based on the class of target being compiled for.  That way, embedded 8/16 bit, 32 bit and 64 bit targets can each have a version that's appropriate for that class of target.
Comment 5 Richard Biener 2008-04-25 14:52:08 UTC
It should be possible to have an alternate implementation in libgcc2.c by means
of just selecting on a proper architecture define or the size of the argument
mode.
Comment 6 Joseph Zbiciak 2008-04-29 03:42:01 UTC
(In reply to comment #5)
> It should be possible to have an alternate implementation in libgcc2.c by means
> of just selecting on a proper architecture define or the size of the argument
> mode.
> 

I see where it would go in libgcc2.c, but I don't know the appropriate architecture defines to key off of, since I really do know nothing about GCC's internals.

Since the method I used above is likely to be a strict improvement over the table driven method on 32-bit and 64-bit targets, is it enough to, say, key off of "#if W_TYPE_SIZE == 32" and "#if W_TYPE_SIZE == 64"?  Is there some documentation I can read to know how best to propose a patch? 

(I'm just a motivated user, not a compiler developer, in case you couldn't tell.)
Comment 7 Manuel López-Ibáñez 2010-02-21 01:34:11 UTC
Given Richard's comment, I am confirming this.

Joseph,

bugzilla is too busy to keep track of conversations. If you have questions about gcc development, go to gcc@gcc.gnu.org. See also http://gcc.gnu.org/contribute.html

If you send a patch to gcc-patches@gcc.gnu.org, you may get more specific feedback.
Comment 8 José Salavert Torres 2012-09-05 10:39:45 UTC
Hello, there has been any advance in in this issue, Knuth's publication approach would be great for 8 bit registers also.

Also, allowing different behaviour for each architecture would be better.

In the forums the implementation described here is now like this, seems to use less operations:

inline unsigned int bitcount32(uint32_t i) {

  //Parallel binary bit add                                                                                                                 
  i = i - ((i >> 1) & 0x55555555);
  i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
  return (((i + (i >> 4)) & 0xF0F0F0F) * 0x1010101) >> 24;

}
                                                                                                     
  //Parallel binary bit add                                                                                                                 
  i = i - ((i >> 1) & 0x5555555555555555);                                                                                                  
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);                                                                           
  return (((i + (i >> 4)) & 0xF0F0F0F0F0F0F0F) * 0x101010101010101) >> 56;                                                                  

}
Comment 9 Paolo Carlini 2012-09-05 15:21:12 UTC
Maybe Marc is interested.
Comment 10 Gunther Piez 2012-10-26 15:51:24 UTC
Just noted the exceptional slowness of the provided __builtin_popcountll() even on ARMv5.

I already used the above parallel bit count algorithm in the case that a native bit count instruction (like the SSE popcnt or NEON vcnt) is not present, but native 64 bit registers are available. 

But on a 32 bit architecture like ARM I figured it made sense to just use the __builtin_popcountll() because the many 64 bit instructions in the algorithm may be very slow without NEON or similar support on a pure 32 bit architecture.

But "optimizing" my code with some macro magic to make it use the library popcount made the whole program 25% slower, although only a minor part of it actually does use the popcount instruction.
Comment 11 Cristian Rodríguez 2013-06-26 18:52:23 UTC
Not to be annoying, but compiling the test case attached to this bug report with clang 3.3 produces code in where 

inline u32 popcount64_1(u64 x) { return __builtin_popcountll(x); }


is over 3 times faster than GCC 4.8.1 in x86_64.

I think GCC could "just" generate IFUNCS for generic targets , in x86_64 one function with attribute target popcnt and the other a call to libgcc that at least matches the clang performance.
Comment 12 Marc Glisse 2013-06-26 23:28:23 UTC
Created attachment 30381 [details]
IFUNC proof of concept patch

Sadly, libgcc is compiled with gcc and not g++ so we can't use the recent multiversioning support with the target attribute and we have to manually set up ifunc. Note that the si/di difference is not a typo, just a wart in the way libgcc is configured.

This is just a proof of concept, we'd want to replace also __popcountti2 at least. And most importantly we need to restrict the inclusion of t-ifunc to platforms where ifunc is supported (move it elsewhere in config.host, maybe even include the content of t-ifunc in an existing t-*).

There are probably better ways to organize this, putting the generic implementation in libgcc2.c protected by suitable macros (which ones?) so it benefits also darwin/cygwin (no ifunc) and non-x86 platforms.

I didn't check the generic code, I just pasted it from one of the comments.

If you want this to happen, please work out a patch and post it to gcc-patches (you can start from this one or not), don't wait for others to write one, I won't have more time to spend on this. Don't be too afraid to test the wrong macro, the reviewer will tell you if that is the case.
Comment 13 Andrew Pinski 2013-06-26 23:31:51 UTC
(In reply to Marc Glisse from comment #12)
> Created attachment 30381 [details]
> IFUNC proof of concept patch

I think it is a bad idea to use ifunc for such a function as most of the time it is link against statically in most cases.  Why can't you compile your code with -march=native for the places where you know you are going to compile and run directly on the same machine?
Comment 14 Cristian Rodríguez 2013-06-26 23:37:59 UTC
(In reply to Andrew Pinski from comment #13)
> (In reply to Marc Glisse from comment #12)
 Why can't you compile
> your code with -march=native for the places where you know you are going to
> compile and run directly on the same machine?

Because it will be useless to general purpose distributions of course.
Comment 15 Andrew Pinski 2013-06-26 23:41:31 UTC
(In reply to Cristian Rodríguez from comment #14)
> Because it will be useless to general purpose distributions of course.

Then ifunc for this short of a function is not useful either.  Then maybe we should move over to use the non table version for GCC in general.
Comment 16 Marc Glisse 2013-06-26 23:43:50 UTC
(In reply to Andrew Pinski from comment #13)
> I think it is a bad idea to use ifunc for such a function as most of the
> time it is link against statically in most cases.

g++ links to it dynamically by default.
Maybe we only want ifunc for libgcc_s and not libgcc, I haven't thought about it.

> Why can't you compile your code with -march=native

That's what I do, but Cristian is probably compiling generic packages for a distribution

> for the places where you know you are going to
> compile and run directly on the same machine?

so he is not in this situation.
Comment 17 Marc Glisse 2013-06-26 23:49:29 UTC
(In reply to Andrew Pinski from comment #15)
> (In reply to Cristian Rodríguez from comment #14)
> > Because it will be useless to general purpose distributions of course.
> 
> Then ifunc for this short of a function is not useful either.  Then maybe we
> should move over to use the non table version for GCC in general.

Moving from the table to the non-table version speeds things by a factor 2. The ifunc version gains another factor 2. I wouldn't call that useless. (obviously, it remains 4 or 5 times slower than an inlined version)
Comment 18 Jakub Jelinek 2013-06-27 05:33:58 UTC
I think it is a bad idea to introduce the IFUNC into libgcc_s, because then while you speed up the few users of this builtin, you slow down all users of libgcc_s (pretty much all C++ programs and lots of C programs), because they will need to resolve the ifunc.  For a very heavily used builtin perhaps, but for a rarely used one it just isn't a good idea.  User's can just use multi-versioning themselves and use __builtin_popcount* in the multi-versioned function.
Comment 19 Cristian Rodríguez 2013-06-27 06:13:59 UTC
(In reply to Jakub Jelinek from comment #18)
> I think it is a bad idea to introduce the IFUNC into libgcc_s, because then
> while you speed up the few users of this builtin, you slow down all users of
> libgcc_s (pretty much all C++ programs and lots of C programs), because they
> will need to resolve the ifunc.  For a very heavily used builtin perhaps,
> but for a rarely used one it just isn't a good idea.  User's can just use
> multi-versioning themselves and use __builtin_popcount* in the
> multi-versioned function.

Hold on..Apparently I used ambiguous language in my comment.. adding ifuncs to libgcc* was not my real suggestion, but to EMIT such IFUNC s in the resulting final user code when the target environment allows it. One generic, one hardware/arch specific.
Comment 20 Marc Glisse 2013-06-27 06:43:33 UTC
(In reply to Jakub Jelinek from comment #18)
> I think it is a bad idea to introduce the IFUNC into libgcc_s, because then
> while you speed up the few users of this builtin, you slow down all users of
> libgcc_s (pretty much all C++ programs and lots of C programs), because they
> will need to resolve the ifunc.

I assume it is only those that use the builtin at least once, no? At least LD_DEBUG seems to say so. I have no idea how heavy the ifunc resolution is, so ok. We are back to only considering the non-table version... (By the way, shouldn't these builtins act like C99 inline functions, so we can sometimes inline them at -O3 (it could also enable vectorization)? Or maybe they already do and it's just that I didn't test hard enough)


(In reply to Cristian Rodríguez from comment #19)
> Hold on..Apparently I used ambiguous language in my comment.. adding ifuncs
> to libgcc* was not my real suggestion, but to EMIT such IFUNC s in the
> resulting final user code when the target environment allows it. One
> generic, one hardware/arch specific.

Not sure if that's much better. Ideally we'd clone the hot loop that uses it and propagate the versioning to that, not just the instruction, but I don't think we have any code for that. Although if gcc saw the full code: if(__builtin_cpu_supports("popcnt"))_mm_popcnt_u64(x);else{call lib}, it might already manage to clone the loop.
Comment 21 Jakub Jelinek 2013-06-27 07:13:04 UTC
Created attachment 30382 [details]
gcc49-pr36041.patch

Untested libgcc2.c implementation (no hw support).  HW support is IMHO best dealt on the compiler side doing something, already a PLT call is fairly expensive, but it depends if __builtin_popcount* is used in a hot loop or in cold code (in the latter case it really doesn't matter).

I've looked at code generated for:
int
foo (unsigned long long i)
{
  i = i - ((i >> 1) & 0x5555555555555555);
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
  i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
  return (i * 0x101010101010101) >> 56;
}

int
bar (unsigned long long i)
{
  unsigned int i1 = i, i2 = i >> 32;
  i1 = i1 - ((i1 >> 1) & 0x55555555);
  i2 = i2 - ((i2 >> 1) & 0x55555555);
  i1 = (i1 & 0x33333333) + ((i1 >> 2) & 0x33333333);
  i2 = (i2 & 0x33333333) + ((i2 >> 2) & 0x33333333);
  i1 = (i1 + (i1 >> 4)) & 0xF0F0F0F;
  i2 = (i2 + (i2 >> 4)) & 0xF0F0F0F;
  return ((i1 + i2) * 0x1010101) >> 24;
}

int
baz (unsigned long long i)
{
  i = i - ((i >> 1) & 0x5555555555555555);
  i = (i & 0x3333333333333333) + ((i >> 2) & 0x3333333333333333);
  i = (i + (i >> 4)) & 0xF0F0F0F0F0F0F0F;
  return ((((unsigned int) i) + (unsigned int) (i >> 32)) * 0x1010101) >> 24;
}
on gcc -O2 -m32 and picked the second variant as shortest for the UDWtype implementation, I guess that is likely the case on most targets.  Note that the patch still doesn't attempt to figure out if UWtype multiplication is expensive or not, perhaps a useful test would be whether we emit __umul<mode>3 for that mode inside of libgcc - we'd need to define some macro and if multiplication is expensive use the shifts + additions alternative instead.
Comment 22 Jakub Jelinek 2013-06-28 09:29:23 UTC
Author: jakub
Date: Fri Jun 28 09:28:40 2013
New Revision: 200506

URL: http://gcc.gnu.org/viewcvs?rev=200506&root=gcc&view=rev
Log:
	PR middle-end/36041
	* libgcc2.c (POPCOUNTCST2, POPCOUNTCST4, POPCOUNTCST8, POPCOUNTCST):
	Define.
	(__popcountSI2): For __SIZEOF_INT__ > 2 targets use arithmetics
	instead of table lookups.
	(__popcountDI2): Likewise.

Modified:
    trunk/libgcc/ChangeLog
    trunk/libgcc/libgcc2.c
Comment 23 Marc Glisse 2013-06-28 12:50:53 UTC
(In reply to Jakub Jelinek from comment #18)
> I think it is a bad idea to introduce the IFUNC into libgcc_s, because then
> while you speed up the few users of this builtin, you slow down all users of
> libgcc_s (pretty much all C++ programs and lots of C programs), because they
> will need to resolve the ifunc.

Do you have a pointer to some text that explains this cost? I tried to read a bit, but it seems to me that unless you use LD_BIND_NOW, the ifunc won't be resolved if it isn't called. And if it is called, the main cost compared to a normal relocation should be the feature test, which is indeed a bit long but not that bad, especially since calling popcount once increases a lot the probability that it will be called again.

Not that it matters that much, interested distributions could always ship an sse4 libgcc_s where they replace the implementation of popcount, and best performance requires changing things before there is even a call to libgcc, but I'd like to understand when ifunc should be used or avoided (quite a few glibc functions seem to use ifunc, though they are much more used than popcount and most have a non-constant complexity).
Comment 24 Jakub Jelinek 2013-06-28 13:00:55 UTC
(In reply to Marc Glisse from comment #23)
> (In reply to Jakub Jelinek from comment #18)
> > I think it is a bad idea to introduce the IFUNC into libgcc_s, because then
> > while you speed up the few users of this builtin, you slow down all users of
> > libgcc_s (pretty much all C++ programs and lots of C programs), because they
> > will need to resolve the ifunc.
> 
> Do you have a pointer to some text that explains this cost? I tried to read
> a bit, but it seems to me that unless you use LD_BIND_NOW, the ifunc won't
> be resolved if it isn't called. And if it is called, the main cost compared
> to a normal relocation should be the feature test, which is indeed a bit
> long but not that bad, especially since calling popcount once increases a
> lot the probability that it will be called again.
> 
> Not that it matters that much, interested distributions could always ship an
> sse4 libgcc_s where they replace the implementation of popcount, and best
> performance requires changing things before there is even a call to libgcc,
> but I'd like to understand when ifunc should be used or avoided (quite a few
> glibc functions seem to use ifunc, though they are much more used than
> popcount and most have a non-constant complexity).

If you use prelink, then everything successfully prelinked is LD_BIND_NOW too (for most relocations without runtime cost, but for ifunc not), and
for every IFUNC that requires a .gnu.conflict ifunc relocation that needs to be resolved right away.
Comment 25 Andrew Pinski 2021-08-16 23:28:23 UTC
I still wonder if we should inline popcount if the target does not support it.  Same with the vectorized version of it.