94135 – PPC: subfic instead of neg used for rotate right

Bug 94135 - PPC: subfic instead of neg used for rotate right

Summary: PPC: subfic instead of neg used for rotate right

Status:	NEW

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	8.3.0

Importance:	P3 normal
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2020-03-11 07:06 UTC by Jens Seifert
Modified:	2022-03-08 16:20 UTC (History)
CC List:	1 user (show)

See Also:
Host:
Target:	powerpc---*
Build:
Known to work:
Known to fail:
Last reconfirmed:	2020-03-11 00:00:00

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Jens Seifert 2020-03-11 07:06:50 UTC

Input:

unsigned int rotr32(unsigned int v, unsigned int r)
{
   return (v>>r)|(v<<(32-r));
}

unsigned long long rotr64(unsigned long long v, unsigned long long r)
{
   return (v>>r)|(v<<(64-r));
}

Command line:
gcc -O2 -save-temps rotr.C

Output:
_Z6rotr32jj:
.LFB0:
        .cfi_startproc
        subfic 4,4,32
        rotlw 3,3,4
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0
        .cfi_endproc

_Z6rotr64yy:
.LFB1:
        .cfi_startproc
        subfic 4,4,64
        rotld 3,3,4
        blr
        .long 0
        .byte 0,9,0,0,0,0,0,0
        .cfi_endproc

subfic is a 2 cycle instruction, but can be replaced by 1 cycle instruction neg.
rotr32(v,r) = rotl32(v,32-r) = rotl32(v,(32-r)%32) = rotl32(v,(-r)%32))= rotl32(v,-r) as long as you have a modulo rotate like rotlw/rlwnm.

Same for 64-bit.

Comment 1 Segher Boessenkool 2020-03-11 20:44:26 UTC

On what CPU do subfic and neg execute at different speed?

(neg is better, of course, it doesn't write CA).

GCC does not know rotates work for any masking of the amount (with 1's in
the low 5 (resp. 6) bits); the rs6000 target code does not know about any
masking (the SHIFT_COUNT_TRUNCATED macro cannot be used, but we could have
more patterns, and then combine can do this in many cases).

Comment 2 Jens Seifert 2020-03-12 06:36:45 UTC

POWER8 Processor User’s Manual for the Single-Chip Module:

addi addis add add. subf subf. addic subfic adde addme subfme addze. subfze neg neg. nego

1 - 2 cycles (GPR)
2 cycles (XER)
5 cycles (CR)

6/cycle, 2/cycle (with XER or CR updates)

CA is part of XER.

1-2 cycles versus 2 cycles.

Comment 3 Segher Boessenkool 2020-03-12 17:01:08 UTC

Both subfic and neg are 1-2 if run on the integer units.  neg can run on
more units, but it is always 2 cycles then!  (And the conditions where you
*can* have 1 cycle are not very often satisfied, anyway).

Comment 4 Jens Seifert 2020-03-16 07:54:12 UTC

Setting CA in XER increases issue to issue latency by 1 on Power8.
See:
Table 10-14. Issue-to-Issue Latencies

In addition, setting the CA restricts instruction reordering.

Comment 5 Segher Boessenkool 2020-03-16 17:59:30 UTC

Please try it out on hardware (or on a cycle-accurate simulator) if you don't
believe me ;-)