Input: unsigned int rotr32(unsigned int v, unsigned int r) { return (v>>r)|(v<<(32-r)); } unsigned long long rotr64(unsigned long long v, unsigned long long r) { return (v>>r)|(v<<(64-r)); } Command line: gcc -O2 -save-temps rotr.C Output: _Z6rotr32jj: .LFB0: .cfi_startproc subfic 4,4,32 rotlw 3,3,4 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc _Z6rotr64yy: .LFB1: .cfi_startproc subfic 4,4,64 rotld 3,3,4 blr .long 0 .byte 0,9,0,0,0,0,0,0 .cfi_endproc subfic is a 2 cycle instruction, but can be replaced by 1 cycle instruction neg. rotr32(v,r) = rotl32(v,32-r) = rotl32(v,(32-r)%32) = rotl32(v,(-r)%32))= rotl32(v,-r) as long as you have a modulo rotate like rotlw/rlwnm. Same for 64-bit.
On what CPU do subfic and neg execute at different speed? (neg is better, of course, it doesn't write CA). GCC does not know rotates work for any masking of the amount (with 1's in the low 5 (resp. 6) bits); the rs6000 target code does not know about any masking (the SHIFT_COUNT_TRUNCATED macro cannot be used, but we could have more patterns, and then combine can do this in many cases).
POWER8 Processor User’s Manual for the Single-Chip Module: addi addis add add. subf subf. addic subfic adde addme subfme addze. subfze neg neg. nego 1 - 2 cycles (GPR) 2 cycles (XER) 5 cycles (CR) 6/cycle, 2/cycle (with XER or CR updates) CA is part of XER. 1-2 cycles versus 2 cycles.
Both subfic and neg are 1-2 if run on the integer units. neg can run on more units, but it is always 2 cycles then! (And the conditions where you *can* have 1 cycle are not very often satisfied, anyway).
Setting CA in XER increases issue to issue latency by 1 on Power8. See: Table 10-14. Issue-to-Issue Latencies In addition, setting the CA restricts instruction reordering.
Please try it out on hardware (or on a cycle-accurate simulator) if you don't believe me ;-)