Bug 92549 - Use x86 xchg instruction more
Summary: Use x86 xchg instruction more
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 9.2.1
: P3 normal
Target Milestone: 10.0
Assignee: Jakub Jelinek
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2019-11-17 16:59 UTC by Ulrich Drepper
Modified: 2021-12-21 11:36 UTC (History)
2 users (show)

See Also:
Host:
Target: x86_64-*-*, i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2019-11-18 00:00:00


Attachments
gcc10-pr92549.patch (694 bytes, patch)
2019-11-18 17:07 UTC, Jakub Jelinek
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Ulrich Drepper 2019-11-17 16:59:33 UTC
Take this code

__attribute__((noinline))
int f(int a, int b)
{
  return b - a + 5;
}
int foo(int a, int b)
{
  return 1 + f(b, a);
}
int main()
{
  return foo(39, 3);
}

gcc 9.2.1 generates for foo on x86-64 this code:

        movl    %edi, %r8d
        movl    %esi, %edi
        movl    %r8d, %esi
        call    f
        addl    $1, %eax
        ret

This could be better:

        xchgl   %edi, %esi
        call    f
        addl    $1, %eax
        ret

Switching parameter location is not a uncommon pattern.

If the regparm is used on x86-32 the same likely applies there.
Comment 1 Richard Biener 2019-11-18 08:05:56 UTC
Since there's no way encode this in RTL this must be done in some peephole2?
IIRC

(parallel
  (set (reg:SI 1) (reg:SI 2))
  (set (reg:SI 2) (reg:SI 1)))

doesn't work like PHI nodes (all "reads" happen first, then the "writes"),
even though it would be nice to eventually represent it that way?
Comment 2 Richard Biener 2019-11-18 08:07:38 UTC
But it looks like x86 has exactly patterns like this - but in this case
I guess combine won't ever try this because hardregs are invovled
(not sure if it ever tries to "simplify" the three-set into the parallel
two-set variant)
Comment 3 Jakub Jelinek 2019-11-18 17:07:41 UTC
Created attachment 47292 [details]
gcc10-pr92549.patch

Yeah, there are *swap<mode> patterns, but they are unlikely to trigger, because before RA usually there is no swap between pseudos, but simply different pseudos, and only during RA we get to a need of a swap.
This patch handles it in peephole2.  The big question is if it should be done always (as in the patch), or only at -Os or on selected modern CPUs + maybe generic tuning, where xchg with register operands just uses normal register renaming and is 0.5 or worst case 1 cycles.
Comment 4 Jakub Jelinek 2019-11-18 17:17:53 UTC
E.g. Agner Fog has in the tables for Atom mov r,r 1uops, latency 1, rec. throughput 1/2, while for xchg r,r 3uops, latency 6, rec. throughput 6.
It doesn't look beneficial speed wise then.
Though, even say on Skylake-X the tables say mov r32,r32 is 1uops, latency 0-1, rec. thr. 0.25, while xchg r,r is 3uops, latency 2, rec. thr. 1.
Comment 6 Segher Boessenkool 2019-11-18 17:52:45 UTC
(In reply to Richard Biener from comment #2)
> But it looks like x86 has exactly patterns like this - but in this case
> I guess combine won't ever try this because hardregs are invovled
> (not sure if it ever tries to "simplify" the three-set into the parallel
> two-set variant)

combine will use hard registers just fine, although sometimes
targetm.class_likely_spilled_p gets in the way, for x86; but you already
found out everything still is pseudos here (and xchg doesn't seem like
a good thing to do usually, even).
Comment 7 Jakub Jelinek 2019-11-19 09:32:32 UTC
Author: jakub
Date: Tue Nov 19 09:31:59 2019
New Revision: 278439

URL: https://gcc.gnu.org/viewcvs?rev=278439&root=gcc&view=rev
Log:
	PR target/92549
	* config/i386/i386.md (peephole2 for *swap<mode>): New peephole2.

	* gcc.target/i386/pr92549.c: New test.

Added:
    trunk/gcc/testsuite/gcc.target/i386/pr92549.c
Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/i386/i386.md
    trunk/gcc/testsuite/ChangeLog
Comment 8 Andrew Pinski 2021-12-21 11:36:01 UTC
Fixed for -Os in GCC 10.