This is the mail archive of the
gcc-bugs@gcc.gnu.org
mailing list for the GCC project.
[Bug middle-end/36041] Speed up builtin_popcountll
- From: "gpiez at web dot de" <gcc-bugzilla at gcc dot gnu dot org>
- To: gcc-bugs at gcc dot gnu dot org
- Date: Fri, 26 Oct 2012 15:51:24 +0000
- Subject: [Bug middle-end/36041] Speed up builtin_popcountll
- Auto-submitted: auto-generated
- References: <bug-36041-4@http.gcc.gnu.org/bugzilla/>
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041
Gunther Piez <gpiez at web dot de> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |gpiez at web dot de
--- Comment #10 from Gunther Piez <gpiez at web dot de> 2012-10-26 15:51:24 UTC ---
Just noted the exceptional slowness of the provided __builtin_popcountll() even
on ARMv5.
I already used the above parallel bit count algorithm in the case that a native
bit count instruction (like the SSE popcnt or NEON vcnt) is not present, but
native 64 bit registers are available.
But on a 32 bit architecture like ARM I figured it made sense to just use the
__builtin_popcountll() because the many 64 bit instructions in the algorithm
may be very slow without NEON or similar support on a pure 32 bit architecture.
But "optimizing" my code with some macro magic to make it use the library
popcount made the whole program 25% slower, although only a minor part of it
actually does use the popcount instruction.