Bug 18154 - Inefficient max/min code for PowerPC
Summary: Inefficient max/min code for PowerPC
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 4.0.0
: P2 enhancement
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2004-10-26 04:17 UTC by David Edelsohn
Modified: 2022-03-08 16:20 UTC (History)
4 users (show)

See Also:
Host:
Target: powerpc*-*-*
Build:
Known to work:
Known to fail: 4.9.3, 5.3.0, 6.0
Last reconfirmed: 2016-01-27 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description David Edelsohn 2004-10-26 04:17:54 UTC
int min(int a, int b) {
  if (a < b)
    return a;
  else
    return b;
}

should produce

        rlwinm  r2,r3,1,31,31
        subfc   r0,r4,r3
        rlwinm  r0,r4,1,31,31
        subfe   r0,r2,r0
        andc    r2,r4,r0
        and     r0,r3,r0
        or      r3,r0,r2
        blr

but instead produces a branch sequence

         mr r0,r3
         mr r3,r4
         cmpw cr7,r0,r4
         bgelr- cr7
         mr r3,r0
         blr

with unnecessary dependencies.
Comment 1 Andrew Pinski 2004-10-26 04:25:17 UTC
Confirmed.
Comment 2 David Edelsohn 2004-10-26 20:06:20 UTC
XLC chooses the straight-line code sequence versus compare and branch based on 
a cost model.  This should not be a uniform change in behavior for PowerPC.
Comment 3 David Edelsohn 2004-10-26 21:25:24 UTC
Also, do not enable when optimizing for size.
Comment 4 Geoff Keating 2004-10-27 23:45:41 UTC
I'm not sure that subfc/subfe is going to be cheaper than a compare and a branch, even if the branch is 
mispredicted half the time.  Do you have timing results?
Comment 5 Andrew Pinski 2004-10-27 23:52:41 UTC
I should note when I was doing SPEC work, using subfc/subfe did not help SPEC at all (I tried to change 
the source and also rs6000.md).
Comment 6 Andrew Pinski 2005-04-24 14:31:38 UTC
On the mainline, we now produce:
        cmpw cr7,r3,r4
        blelr- cr7
        mr r3,r4
        blr

Comment 7 Martin Sebor 2016-01-27 23:08:02 UTC
Current trunk as well as all supported GCC versions before it still emits the same code (see below).  XLC 12 on gcc111.fsffrance.org also emits a branch (see below).  Ditto for Clang.

David, in light of this and in light of comments #4 and #5, do you still believe that GCC should change as you suggested in the Description?

.min:                                   # 0x00000000 (H.10.NO_SYMBOL)
        cmp        0,r3,r4
        bc         BO_IF,CR0_LT,__L10
        oril       r3,r4,0x0000
        bcr        BO_ALWAYS,CR0_LT
__L10:                                  # 0x00000010 (H.10.NO_SYMBOL+0x10)
        bcr        BO_ALWAYS,CR0_LT


$ cat ~/tmp/t.c && /build/gcc-trunk/gcc/xgcc -B /build/gcc-trunk/gcc -O2 -S -Wall -Wextra -Wpedantic -o/dev/stdout ~/tmp/t.c
int min(int a, int b) {
  if (a < b)
    return a;
  else
    return b;
}
	.file	"t.c"
	.machine power8
	.abiversion 2
	.section	".toc","aw"
	.section	".text"
	.align 2
	.p2align 4,,15
	.globl min
	.type	min, @function
min:
	cmpw 7,3,4
	ble 7,.L2
	mr 3,4
.L2:
	extsw 3,3
	blr
	.long 0
	.byte 0,0,0,0,0,0,0,0
	.size	min,.-min
	.ident	"GCC: (GNU) 6.0.0 20160125 (experimental)"
	.section	.note.GNU-stack,"",@progbits
Comment 8 David Edelsohn 2016-01-28 06:28:23 UTC
Branchless code generally is better.
Comment 9 Martin Sebor 2016-01-30 18:57:38 UTC
I noticed while looking at an unrelated bug that when targeting power7 or power8 Clang makes use of the isel instruction and emits the following:

min:                                    # @min
	cmpw	 3, 4
	isel 3, 3, 4, 0
	blr

Gcc also has the capability of using isel but it's disabled by default even when targeting power8 and must be explicitly enabled via -misel.  With it, GCC emits the following branchless code:

min:
	cmpw 7,3,4
	isel 3,3,4,28
	extsw 3,3
	blr

Since the instruction exists for just this purpose (eliminating branches), would it make sense to enable it by default?  (I suppose one concern with it might be that it's not being very extensively tested.)
Comment 10 David Edelsohn 2016-01-31 01:04:13 UTC
isel is not generally performance win for Power using GCC.  It is enabled for LLVM because LLVM has a simplistic basic block scheduler and isel allows LLVM to form larger basic blocks to provide the scheduler with more freedom of movement.
Comment 11 Segher Boessenkool 2016-08-23 21:52:55 UTC
The signed version can be done in four insns:

1:      subfc   r5,r3,r4
        subfe   r6,r6,r6
        and     r7,r6,r5
        addc    r8,r7,r3

(superopt finds 16 versions, all similar).

The unsigned version can be done in six:

33:     subfc   r5,r3,r4
        srwi    r6,r4,31
        srwi    r7,r3,31
        subfe   r8,r6,r7
        and     r9,r8,r5
        addc    r10,r9,r3

(superopt finds 240 versions, many with one or two xoris ,,0x8000
which doesn't work for 64 bit, and many with srawi as well, which
can be more expensive than srwi; all remaining are similar).

For 32-bit min/max on a 64-bit cpu, we can use only "cheap", non-carry
instructions:

  extsw r3,r3
  extsw r4,r4
  subf r5,r4,r3
  srdi r6,r5,32
  and r7,r6,r5
  add r8,r7,r4

(and unsigned exts for unsigned).  Those extends often disappear into
surrounding insns, or because the ABI requires the regs to be extended
already, etc.
Comment 12 Segher Boessenkool 2016-08-23 21:56:13 UTC
(Never mind those last "addc" insn, they can just as well be plain
"add", I pasted the wrong ones).
Comment 13 Segher Boessenkool 2017-11-21 19:25:32 UTC
Trunk now generates isel for power9.