This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[RS6000 PATCH] Improved popcount/parity intrinsics for Power6


Whilst developing some popcount/parity inspired optimizations for the
middle-end, I recently discovered that the rs0000 backend doesn't exactly
have a "popcount" instruction, but a clever "popcntb" instruction that
counts the bits in each byte.  The rs6000 backend then combines this
instruction with a clever multiply and shift to implement
__builtin_popcount on Power5 and Power6.

However, looking at rs6000.md's define_expand for this instruction, it
currently always generates a hardware "mullw" instruction, which now has a
significant latency on power6.  The initial part of the patch below is to
reorganize the way popcount?i2 is expanded to reuse the middle-end's
expand_mult, which will then use shifts and additions if appropriate on
the target.

Hence for the simple test-case:

int foo(int x)
{
  return __builtin_popcount(x);
}

Previously with -O2 -mcpu=power6 we'd generate:

.foo:   lis 0,0x101
        popcntb 3,3
        ori 0,0,257
        mullw 3,3,0
        srwi 3,3,24
        blr

where as with this patch we now generate:

.foo:   popcntb 0,3
        slwi 3,0,8
        add 3,3,0
        slwi 9,3,16
        add 3,3,9
        srwi 3,3,24
        blr

[If the timings in rs6000.c are to be believed, this is significantly
faster].

Following on from this, when the backend doesn't define a parity
instruction but does provide popcount, the middle-end RTL expansion
routines by default expand parity(x) as "popcount(x) & 1".  It turns
out that with powerpc's popcntb instruction we can be slightly cleverer
and determine popcount as the xor of the lowest bit from each byte.
Hence the patch below also provides a new define_expand for parity?i2 on
TARGET_POPCNTB targets.  With this we now generate:

.foo:   popcntb 3,3
        srwi 0,3,16
        xor 3,3,0
        srwi 9,3,8
        xor 3,3,9
        rlwinm 3,3,0,31,31
        blr

When optimizing for size with -Os we continue to generate the original
sequences, with the explicit multiplication.  Interestingly, expand_mult
considers the shift-and-add approach to also be a win (no worse) on power5
for these co-efficients, which is reasonable when the cost of constructing
the immediate constant is included in the latency of the multiplcation.

The following patch has been tested on powerpc-ibm-aix5.2.0.0 with a full
"make bootstrap", all default languages, and regression tested with a
top-level "make -k check" with no new failures.

Ok for mainline?  Unfortunately, I don't actually have access to a power6
to confirm the performance improvement, this is all based upon specs, but
if someone could give it a test, that would be appreciated.


2007-02-06  Roger Sayle  <roger@eyesopen.com>

        * config/rs6000/rs6000.md (popcount<mode>2): Rewrite.
        (parity<mode>2): New define_expand using rs6000_emit_parity.
        * config/rs6000/rs6000.c (rs6000_emit_popcount,
        rs6000_emit_parity): New functions.
        * config/rs6000/rs6000-protos.h (rs6000_emit_popcount,
        rs6000_emit_parity): Prototype here.

        * gcc.target/powerpc/popcount-1.c: New test case.
        * gcc.target/powerpc/parity-1.c: Likewise.

Roger
--

Attachment: patchp.txt
Description: Text document

Attachment: popcount-1.c
Description: Text document

Attachment: parity-1.c
Description: Text document


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]