This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Other format: | [Raw text] |
Whilst developing some popcount/parity inspired optimizations for the middle-end, I recently discovered that the rs0000 backend doesn't exactly have a "popcount" instruction, but a clever "popcntb" instruction that counts the bits in each byte. The rs6000 backend then combines this instruction with a clever multiply and shift to implement __builtin_popcount on Power5 and Power6. However, looking at rs6000.md's define_expand for this instruction, it currently always generates a hardware "mullw" instruction, which now has a significant latency on power6. The initial part of the patch below is to reorganize the way popcount?i2 is expanded to reuse the middle-end's expand_mult, which will then use shifts and additions if appropriate on the target. Hence for the simple test-case: int foo(int x) { return __builtin_popcount(x); } Previously with -O2 -mcpu=power6 we'd generate: .foo: lis 0,0x101 popcntb 3,3 ori 0,0,257 mullw 3,3,0 srwi 3,3,24 blr where as with this patch we now generate: .foo: popcntb 0,3 slwi 3,0,8 add 3,3,0 slwi 9,3,16 add 3,3,9 srwi 3,3,24 blr [If the timings in rs6000.c are to be believed, this is significantly faster]. Following on from this, when the backend doesn't define a parity instruction but does provide popcount, the middle-end RTL expansion routines by default expand parity(x) as "popcount(x) & 1". It turns out that with powerpc's popcntb instruction we can be slightly cleverer and determine popcount as the xor of the lowest bit from each byte. Hence the patch below also provides a new define_expand for parity?i2 on TARGET_POPCNTB targets. With this we now generate: .foo: popcntb 3,3 srwi 0,3,16 xor 3,3,0 srwi 9,3,8 xor 3,3,9 rlwinm 3,3,0,31,31 blr When optimizing for size with -Os we continue to generate the original sequences, with the explicit multiplication. Interestingly, expand_mult considers the shift-and-add approach to also be a win (no worse) on power5 for these co-efficients, which is reasonable when the cost of constructing the immediate constant is included in the latency of the multiplcation. The following patch has been tested on powerpc-ibm-aix5.2.0.0 with a full "make bootstrap", all default languages, and regression tested with a top-level "make -k check" with no new failures. Ok for mainline? Unfortunately, I don't actually have access to a power6 to confirm the performance improvement, this is all based upon specs, but if someone could give it a test, that would be appreciated. 2007-02-06 Roger Sayle <roger@eyesopen.com> * config/rs6000/rs6000.md (popcount<mode>2): Rewrite. (parity<mode>2): New define_expand using rs6000_emit_parity. * config/rs6000/rs6000.c (rs6000_emit_popcount, rs6000_emit_parity): New functions. * config/rs6000/rs6000-protos.h (rs6000_emit_popcount, rs6000_emit_parity): Prototype here. * gcc.target/powerpc/popcount-1.c: New test case. * gcc.target/powerpc/parity-1.c: Likewise. Roger --
Attachment:
patchp.txt
Description: Text document
Attachment:
popcount-1.c
Description: Text document
Attachment:
parity-1.c
Description: Text document
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |