Take this code __attribute__((noinline)) int f(int a, int b) { return b - a + 5; } int foo(int a, int b) { return 1 + f(b, a); } int main() { return foo(39, 3); } gcc 9.2.1 generates for foo on x86-64 this code: movl %edi, %r8d movl %esi, %edi movl %r8d, %esi call f addl $1, %eax ret This could be better: xchgl %edi, %esi call f addl $1, %eax ret Switching parameter location is not a uncommon pattern. If the regparm is used on x86-32 the same likely applies there.
Since there's no way encode this in RTL this must be done in some peephole2? IIRC (parallel (set (reg:SI 1) (reg:SI 2)) (set (reg:SI 2) (reg:SI 1))) doesn't work like PHI nodes (all "reads" happen first, then the "writes"), even though it would be nice to eventually represent it that way?
But it looks like x86 has exactly patterns like this - but in this case I guess combine won't ever try this because hardregs are invovled (not sure if it ever tries to "simplify" the three-set into the parallel two-set variant)
Created attachment 47292 [details] gcc10-pr92549.patch Yeah, there are *swap<mode> patterns, but they are unlikely to trigger, because before RA usually there is no swap between pseudos, but simply different pseudos, and only during RA we get to a need of a swap. This patch handles it in peephole2. The big question is if it should be done always (as in the patch), or only at -Os or on selected modern CPUs + maybe generic tuning, where xchg with register operands just uses normal register renaming and is 0.5 or worst case 1 cycles.
E.g. Agner Fog has in the tables for Atom mov r,r 1uops, latency 1, rec. throughput 1/2, while for xchg r,r 3uops, latency 6, rec. throughput 6. It doesn't look beneficial speed wise then. Though, even say on Skylake-X the tables say mov r32,r32 is 1uops, latency 0-1, rec. thr. 0.25, while xchg r,r is 3uops, latency 2, rec. thr. 1.
Further info on the topic: https://stackoverflow.com/questions/45766444/why-is-xchg-reg-reg-a-3-micro-op-instruction-on-modern-intel-architectures
(In reply to Richard Biener from comment #2) > But it looks like x86 has exactly patterns like this - but in this case > I guess combine won't ever try this because hardregs are invovled > (not sure if it ever tries to "simplify" the three-set into the parallel > two-set variant) combine will use hard registers just fine, although sometimes targetm.class_likely_spilled_p gets in the way, for x86; but you already found out everything still is pseudos here (and xchg doesn't seem like a good thing to do usually, even).
Author: jakub Date: Tue Nov 19 09:31:59 2019 New Revision: 278439 URL: https://gcc.gnu.org/viewcvs?rev=278439&root=gcc&view=rev Log: PR target/92549 * config/i386/i386.md (peephole2 for *swap<mode>): New peephole2. * gcc.target/i386/pr92549.c: New test. Added: trunk/gcc/testsuite/gcc.target/i386/pr92549.c Modified: trunk/gcc/ChangeLog trunk/gcc/config/i386/i386.md trunk/gcc/testsuite/ChangeLog
Fixed for -Os in GCC 10.