92549 – Use x86 xchg instruction more

Bug 92549 - Use x86 xchg instruction more

Summary: Use x86 xchg instruction more

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	target (show other bugs)
Version:	9.2.1

Importance:	P3 normal
Target Milestone:	10.0
Assignee:	Jakub Jelinek

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2019-11-17 16:59 UTC by Ulrich Drepper
Modified:	2021-12-21 11:36 UTC (History)
CC List:	2 users (show)

See Also:
Host:
Target:	x86_64--, i?86--
Build:
Known to work:
Known to fail:
Last reconfirmed:	2019-11-18 00:00:00

Attachments
gcc10-pr92549.patch (694 bytes, patch) 2019-11-18 17:07 UTC, Jakub Jelinek	Details \| Diff
View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Ulrich Drepper 2019-11-17 16:59:33 UTC

Take this code

__attribute__((noinline))
int f(int a, int b)
{
  return b - a + 5;
}
int foo(int a, int b)
{
  return 1 + f(b, a);
}
int main()
{
  return foo(39, 3);
}

gcc 9.2.1 generates for foo on x86-64 this code:

        movl    %edi, %r8d
        movl    %esi, %edi
        movl    %r8d, %esi
        call    f
        addl    $1, %eax
        ret

This could be better:

        xchgl   %edi, %esi
        call    f
        addl    $1, %eax
        ret

Switching parameter location is not a uncommon pattern.

If the regparm is used on x86-32 the same likely applies there.

Comment 1 Richard Biener 2019-11-18 08:05:56 UTC

Since there's no way encode this in RTL this must be done in some peephole2?
IIRC

(parallel
  (set (reg:SI 1) (reg:SI 2))
  (set (reg:SI 2) (reg:SI 1)))

doesn't work like PHI nodes (all "reads" happen first, then the "writes"),
even though it would be nice to eventually represent it that way?

Comment 2 Richard Biener 2019-11-18 08:07:38 UTC

But it looks like x86 has exactly patterns like this - but in this case
I guess combine won't ever try this because hardregs are invovled
(not sure if it ever tries to "simplify" the three-set into the parallel
two-set variant)

Comment 3 Jakub Jelinek 2019-11-18 17:07:41 UTC

Created attachment 47292 [details]
gcc10-pr92549.patch

Yeah, there are *swap<mode> patterns, but they are unlikely to trigger, because before RA usually there is no swap between pseudos, but simply different pseudos, and only during RA we get to a need of a swap.
This patch handles it in peephole2.  The big question is if it should be done always (as in the patch), or only at -Os or on selected modern CPUs + maybe generic tuning, where xchg with register operands just uses normal register renaming and is 0.5 or worst case 1 cycles.

Comment 4 Jakub Jelinek 2019-11-18 17:17:53 UTC

E.g. Agner Fog has in the tables for Atom mov r,r 1uops, latency 1, rec. throughput 1/2, while for xchg r,r 3uops, latency 6, rec. throughput 6.
It doesn't look beneficial speed wise then.
Though, even say on Skylake-X the tables say mov r32,r32 is 1uops, latency 0-1, rec. thr. 0.25, while xchg r,r is 3uops, latency 2, rec. thr. 1.

Comment 5 Jakub Jelinek 2019-11-18 17:26:56 UTC

Further info on the topic:
https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures

Comment 6 Segher Boessenkool 2019-11-18 17:52:45 UTC

(In reply to Richard Biener from comment #2)
> But it looks like x86 has exactly patterns like this - but in this case
> I guess combine won't ever try this because hardregs are invovled
> (not sure if it ever tries to "simplify" the three-set into the parallel
> two-set variant)

combine will use hard registers just fine, although sometimes
targetm.class_likely_spilled_p gets in the way, for x86; but you already
found out everything still is pseudos here (and xchg doesn't seem like
a good thing to do usually, even).

Comment 7 Jakub Jelinek 2019-11-19 09:32:32 UTC

Author: jakub
Date: Tue Nov 19 09:31:59 2019
New Revision: 278439

URL: https://gcc.gnu.org/viewcvs?rev=278439&root=gcc&view=rev
Log:
	PR target/92549
	* config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.

	* gcc.target/i386/pr92549.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr92549.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog

Comment 8 Andrew Pinski 2021-12-21 11:36:01 UTC

Fixed for -Os in GCC 10.