Bug 105930 - [12/13 Regression] Excessive stack spill generation on 32-bit x86
Summary: [12/13 Regression] Excessive stack spill generation on 32-bit x86
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 12.1.1
: P3 normal
Target Milestone: 12.2
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2022-06-11 20:51 UTC by Linus Torvalds
Modified: 2022-08-19 18:29 UTC (History)
9 users (show)

See Also:
Host:
Target: i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2022-06-12 00:00:00


Attachments
Test-case extracted from the generic blake2b kernel code (1.19 KB, text/plain)
2022-06-11 20:51 UTC, Linus Torvalds
Details
Mindless revert that fixes things for me (1.14 KB, text/plain)
2022-06-12 00:20 UTC, Linus Torvalds
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Linus Torvalds 2022-06-11 20:51:44 UTC
Created attachment 53121 [details]
Test-case extracted from the generic blake2b kernel code

Gcc-12 seems to generate a huge number of stack spills on this blake2b test-case, to the point where it overflows the allowable kernel stack on 32-bit x86.

This crypto thing has two 128-byte buffers, so a stack frame a bit larger than 256 is expected when the dataset doesn't fit in the register set.

Just as an example, on this code, clang-.14.0.0 generates a stack frame that is 296 bytes. 

In contrast, gcc-12.1.1 generates a stack frame that is almost an order of magnitude(!) larger, at 2620 bytes.

The trivial Makefile I used for this test-case is

   # The kernel cannot just randomly use FP/MMX/AVX
    CFLAGS := -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx
    CFLAGS += -m32
    CFLAGS += -O2

    test:
        gcc $(CFLAGS) -Wall -S blake2b.c
        grep "sub.*%[er]sp" blake2b.s

to easily test different flags and the end result, but as can be seen from above, it really doesn't need any special flags except the ones that disable MMX/AVX code generation.

And the generated code looks perfectly regular, except for the fact that it uses almost 3kB of stack space.

Note that "-m32" is required to trigger this - the 64-bit case does much better, presumably because it has more registers and this needs fewer spills. It gets worse with some added debug flags we use in the kernel, but not that kind of "order of magnitude" worse.

Using -O1 or -Os makes no real difference.

This is presumably due to some newly triggered optimization in gcc-12, but I can't even begin to guess at what we'd need to disable (or enable) to avoid this horrendous stack growth. Some very aggressive instruction scheduling thing that spreads out all the calculations and always wants to spill-and-reload the subepxressions that it CSE'd? I dunno. 

Pls advice. The excessive stack literally causes build failures due to us using -Werror-frame-larger-than= to make sure stack use remains sanely bounded. The kernel stack is a rather limited resource.
Comment 1 Linus Torvalds 2022-06-11 23:20:57 UTC
Side note: it might be best to clarify that this is a regression specific to gcc-12.

Gcc 11.3 doesn't have the problem, and generates code for this same test-case with a stack frame of only 428 bytes. That's still bigger than clang, but not "crazy bigger".
Comment 2 Andrew Pinski 2022-06-11 23:59:28 UTC
thumb1 (which has 16 registers but really only 8 are GPRs) does not have this issue in GCC 12, so I suspect a target specific change caused this.
Comment 3 Linus Torvalds 2022-06-12 00:20:29 UTC
Created attachment 53123 [details]
Mindless revert that fixes things for me
Comment 4 Linus Torvalds 2022-06-12 00:20:55 UTC
So hey, since you guys use git now, I thought I might as well just bisect this.

Now, I have no idea what the best and most efficient way is to generate only "cc1", so my bisection run was this unholy mess of "just run configure, and then run 'make -j128' for long enough that 'host-x86_64-pc-linux-gnu/gcc/cc1' gets generated, and test that".

I'm  not proud of that hacky thing, but since gcc documentation is written in sanskrit, and mere mortals can't figure it out, it's the best I could do.

And the problem bisects down to

  8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48 is the first bad commit
  commit 8ea4a34bd0b0a46277b5e077c89cbd86dfb09c48
  Author: Roger Sayle <roger@nextmovesoftware.com>
  Date:   Sat Mar 5 08:50:45 2022 +0000

      PR 104732: Simplify/fix DI mode logic expansion/splitting on -m32.

so yes, this seems to be very much specific to the i386 target.

And yes, I also verified that reverting that commit on the current master branch solves it for me.

Again: this was just a completely mindless bisection, with a "revert to verify" on top of the current trunk, which for me happened to be commit cd02f15f1ae ("xtensa: Improve constant synthesis for both integer and floating-point").

I'm attaching the revert patch I used just so that you can see exactly what I did. I probably shouldn't have actually removed the testsuite entry, but again: ENTIRELY MINDLESS BISECTION RESULT.
Comment 5 Linus Torvalds 2022-06-12 00:27:48 UTC
(In reply to Linus Torvalds from comment #4)
> 
> I'm  not proud of that hacky thing, but since gcc documentation is written
> in sanskrit, and mere mortals can't figure it out, it's the best I could do.

And bu 'sanskrit' I mean 'texi'. 

I understand that it's how GNU projects are supposed to work, but markdown (or rst) is just *so* much more legible and you really can read it like text.

Anyway, that's my excuse for not knowing how to "just generate cc1" for a saner git bisect run. What I did worked, but was just incredibly ugly. There must be some better way gcc developers have when they want to bisect cc1 behavior.
Comment 6 Samuel Neves 2022-06-12 02:26:43 UTC
Based on that bisect commit, it is also possible to repro this issue in earlier GCCs (11, 10, seems fine on <= 9) purely by taking away the -mno-sseX, which triggers the same splitting as now on gcc-12: https://godbolt.org/z/KEcWGT9Yc
Comment 7 Roger Sayle 2022-06-12 06:28:59 UTC
Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and on godbolt the number of assembler lines reduces from 6952 to 6203).
Comment 8 Linus Torvalds 2022-06-12 16:18:03 UTC
(In reply to Roger Sayle from comment #7)
> Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and
> on godbolt the number of assembler lines reduces from 6952 to 6203).

Thanks. Using -mno-stv may well be a good workaround for the kernel for now, except I don't have a clue what it means. I will google it to figure out whether it's appropriate for our use.
Comment 9 Linus Torvalds 2022-06-12 16:35:17 UTC
Looks like STV is "scalar to vector" and it should have been disabled automatically by the -mno-avx flag anyway.

And the excessive stack usage was perhaps due to GCC preparing all those stack slots for integer -> vector movement that then never happens exactly because the kernel uses -mno-avx?

So if I understand this right, the kernel can indeed just add -mno-stv to work around this.

?
Comment 10 Linus Torvalds 2022-06-12 17:30:50 UTC
(In reply to Roger Sayle from comment #7)
> Investigating.  Adding -mno-stv the stack size reduces from 2612 to 428 (and
> on godbolt the number of assembler lines reduces from 6952 to 6203).

So now that I actually tried that, '-mno-stv' does nothing for me. I still see a frame size of 2620.

I wonder what the difference in our setups is. I tested with plain gcc-12.1.1 from Fedora 36, maybe you tested with current trunk or something?

Anyway, it seems -mno-stv isn't the solution at least for the released version ;(
Comment 11 Jakub Jelinek 2022-06-13 09:29:07 UTC
Confirmed it is r12-7502-g8ea4a34bd0b0a462 on our bisect seed.
Perhaps for !TARGET_STV || !TARGET_SSE2 we could keep the optabs, but split right away during expansion?
Anyway, I think we need to understand what makes it spill that much more, and unfortunately the testcase is too large to find that out easily, I think we should investigate commenting out some rounds.
Comment 12 Linus Torvalds 2022-06-13 17:39:49 UTC
(In reply to Jakub Jelinek from comment #11)
> Anyway, I think we need to understand what makes it spill that much more,
> and unfortunately the testcase is too large to find that out easily, I think
> we should investigate commenting out some rounds.

I think you can comment out all the rounds except for one, and still see the effects of this.

At least when I try it on godbolt with just ROUND(0) in place and rounds 1-12 deleted, for gcc-11.3 I get a stack frame of 388.

And with gcc-12.1 I get a stack frame of 516. Obviously that's not nearly as huge, but it's still a fairly significant expansion and is presumably due to the same basic issue.

I used godbolt just because I no longer have 11.3 installed to compare against, so it was the easiest way to verify that the stack expansion is still there.

Just to clarify: that's still with

   -O2 -m32 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx

I'm sure it depends on the flags.

The code is still not exactly trivial with just one round in place. It's a cryptographic hash, after all. But when compiling for 64-bit mode it all looks fairly straightforward, it really is that DImode stuff that makes it really ungainly when using -m32.
Comment 13 Samuel Neves 2022-06-13 18:23:40 UTC
Something simple like this -- https://godbolt.org/z/61orYdjK7 -- already exhibits the effect. 

Furthermore, and this also applies to the full BLAKE2b compression function, if you replace all the xors in the rounds by adds, the stack size goes back to normal. 

On the other hand, replacing the adds by xors, making the whole round function bitwise ops, does not have any effect on stack size.
Comment 14 Linus Torvalds 2022-06-13 18:31:22 UTC
(In reply to Samuel Neves from comment #13)
> Something simple like this -- https://godbolt.org/z/61orYdjK7 -- already
> exhibits the effect. 

Yup.

That's a much better test-case. I think you should attach it to the gcc bugzilla, I don't know how long godbolt remembers those links.

But godbolt here is good for another thing: select clang-14 as the compiler, and you realize how even gcc-11 is actually doing a really really bad job.

And I'm not saying that to shame anybody: I think it's more a sign that maybe the issue is actually much deeper, and that the problem already exists in gcc-11, but some small perturbation then just makes an already existing problem much worse in gcc-12.

So that commit I bisected to is probably almost entirely unrelated, and it is just the thing that turns up the bad behavior to 11.
Comment 15 Richard Biener 2022-06-14 07:57:48 UTC
So we feed DImode rotates into RA which constrains register allocation
enough to require spills (all 4 DImode vals are live across the kernel,
not even -fschedule-insn can do anything here).  I wonder if it ever makes
sense to not split wide ops before reload.
Comment 16 Jakub Jelinek 2022-06-14 08:50:17 UTC
Though, ix86_rot{l,r}di3_doubleword define_insn_and_split patterns were split only after reload both before and after Roger's change, so somehow whether we emit it as SImode from the beginning or only split it before reload affects the RA decisions.
unsigned long long foo (unsigned long long x, int y, unsigned long long z) { x ^= z; return (x << 24) | (x >> (-24 & 63)); }
is too simplified, the difference with that is just that we used to emit setting of the DImode pseudo to 0 before setting its halves with xor, while now we don't, so it must be something else.

I believe as post-reload splitters the doubleword rotates have been introduced already in PR17886.
Rewriting those into pre-reload splitters from post-reload splitters would be certainly possible, I will try that, the question is whether it would cure this and what effects it would have on other code.
Comment 17 Jakub Jelinek 2022-06-14 09:42:42 UTC
So, I've tried:
--- gcc/config/i386/i386.md.jj	2022-06-13 10:53:26.739290704 +0200
+++ gcc/config/i386/i386.md	2022-06-14 11:09:24.467024047 +0200
@@ -13734,14 +13734,13 @@
 ;; shift instructions and a scratch register.
 
 (define_insn_and_split "ix86_rotl<dwi>3_doubleword"
- [(set (match_operand:<DWI> 0 "register_operand" "=r")
-       (rotate:<DWI> (match_operand:<DWI> 1 "register_operand" "0")
-		     (match_operand:QI 2 "<shift_immediate_operand>" "<S>")))
-  (clobber (reg:CC FLAGS_REG))
-  (clobber (match_scratch:DWIH 3 "=&r"))]
- ""
+ [(set (match_operand:<DWI> 0 "register_operand")
+       (rotate:<DWI> (match_operand:<DWI> 1 "register_operand")
+		     (match_operand:QI 2 "<shift_immediate_operand>")))
+  (clobber (reg:CC FLAGS_REG))]
+ "ix86_pre_reload_split ()"
  "#"
- "reload_completed"
+ "&& 1"
  [(set (match_dup 3) (match_dup 4))
   (parallel
    [(set (match_dup 4)
@@ -13764,6 +13763,7 @@
 						       (match_dup 6)))) 0)))
     (clobber (reg:CC FLAGS_REG))])]
 {
+  operands[3] = gen_reg_rtx (<MODE>mode);
   operands[6] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode) - 1);
   operands[7] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));
 
@@ -13771,14 +13771,13 @@
 })
 
 (define_insn_and_split "ix86_rotr<dwi>3_doubleword"
- [(set (match_operand:<DWI> 0 "register_operand" "=r")
-       (rotatert:<DWI> (match_operand:<DWI> 1 "register_operand" "0")
-		       (match_operand:QI 2 "<shift_immediate_operand>" "<S>")))
-  (clobber (reg:CC FLAGS_REG))
-  (clobber (match_scratch:DWIH 3 "=&r"))]
- ""
+ [(set (match_operand:<DWI> 0 "register_operand")
+       (rotatert:<DWI> (match_operand:<DWI> 1 "register_operand")
+		       (match_operand:QI 2 "<shift_immediate_operand>")))
+  (clobber (reg:CC FLAGS_REG))]
+ "ix86_pre_reload_split ()"
  "#"
- "reload_completed"
+ "&& 1"
  [(set (match_dup 3) (match_dup 4))
   (parallel
    [(set (match_dup 4)
@@ -13801,6 +13800,7 @@
 						     (match_dup 6)))) 0)))
     (clobber (reg:CC FLAGS_REG))])]
 {
+  operands[3] = gen_reg_rtx (<MODE>mode);
   operands[6] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode) - 1);
   operands[7] = GEN_INT (GET_MODE_BITSIZE (<MODE>mode));
 

On the #c0 test with -O2 -m32 -mno-mmx -mno-sse it makes some difference, but not as much as one would hope for:
Numbers from gcc 11.3.1 20220614, 11.3.1 20220614 with the patch, 13.0.0 20220610, 13.0.0 20220614 with the patch:
sub on %esp    428      2556      2620      2556
fn size in B 21657     23186     28413     23534
.s lines      6199      3942      7260      4198
So, trunk patched with the above patch results in significantly fewer instructions, but larger (more of them use 32-bit immediates, mostly in form of whatever(%esp) memory source operand).
And the stack usage is high.

I think the patch is still a good idea, it gives the RA more options, but we should investigate why it consumes so much more stack and results in larger code.
Comment 18 Jakub Jelinek 2022-06-14 09:50:57 UTC
Of course, size comparisons of -O2 code aren't the most important, for -O2 it is more important how fast the code is.
When comparing -Os -m32 -mno-mmx -mno-sse, the numbers are
sub on %esp    412      2564      2620      2564
fn size in B 27535     20508     35036     20416
.s lines      5816      3590      7251      3544
So in the -Os case, the patched functions are both smaller and fewer instructions (significantly so), but compared to gcc 11 still significantly higher stack usage).
Comment 19 Arnd Bergmann 2022-06-16 10:06:43 UTC
I checked the other target architectures that are supported by the kernel to see if anything else is affected. Apparently only sparc32 has a similar issue with a frame size of 2280 bytes using gcc-10 or higher, compared to 600 to 800 bytes for gcc-4 through gcc-9.
Comment 20 GCC Commits 2022-06-24 06:17:54 UTC
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:3b8794302b52a819ca3ea78238e9b5025d1c56dd

commit r13-1239-g3b8794302b52a819ca3ea78238e9b5025d1c56dd
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Fri Jun 24 07:15:08 2022 +0100

    PR target/105930: Split *xordi3_doubleword after reload on x86.
    
    This patch addresses PR target/105930 which is an ia32 stack frame size
    regression in high-register pressure XOR-rich cryptography functions
    reported by Linus Torvalds.  The underlying problem is once the limited
    number of registers on the x86 are exhausted, the register allocator
    has to decide which to spill, where some eviction choices lead to much
    poorer code, but these consequences are difficult to predict in advance.
    
    The patch below, which splits xordi3_doubleword and iordi3_doubleword
    after reload (instead of before), significantly reduces the amount of
    spill code and stack frame size, in what might appear to be an arbitrary
    choice.
    
    My explanation of this behaviour is that the mixing of pre-reload split
    SImode instructions and post-reload split DImode instructions is
    confusing some of the heuristics used by reload.  One might think
    that splitting early gives the register allocator more freedom to
    use available registers, but in practice the constraint that double
    word values occupy consecutive registers (when ultimately used as a
    DImode value) is the greater constraint.  Instead, I believe in this
    case, the pseudo registers used in memory addressing, appear to be
    double counted for split SImode instructions compared to unsplit
    DImode instructions.  For the reduced test case in comment #13, this
    leads to %eax being used to hold the long-lived argument pointer "v",
    blocking the use of the ax:dx pair for processing double word values.
    The important lines are at the very top of the assembly output:
    
    GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
            sub     esp, 24
            mov     ecx, DWORD PTR [esp+40]
    
    GCC 12 [use %eax to address memory, require a 44-byte stack frame]
            sub     esp, 44
            mov     eax, DWORD PTR [esp+64]
    
    2022-06-24  Roger Sayle  <roger@nextmovesoftware.com>
                Uroš Bizjak  <ubizjak@gmail.com>
    
    gcc/ChangeLog
            PR target/105930
            * config/i386/i386.md (*<any_or>di3_doubleword): Split after
            reload.  Use rtx_equal_p to avoid creating memory-to-memory moves,
            and emit NOTE_INSN_DELETED if operand[2] is zero (i.e. with -O0).
Comment 21 Linus Torvalds 2022-06-24 16:45:24 UTC
(In reply to CVS Commits from comment #20)
>
>         One might think
>     that splitting early gives the register allocator more freedom to
>     use available registers, but in practice the constraint that double
>     word values occupy consecutive registers (when ultimately used as a
>     DImode value) is the greater constraint. 

Whee.

Why does gcc have that constraint, btw? I tried to look at the clang code generation once more, and I don't *think* clang has the same constraint, and maybe that is why it does so much better?

Yes, x86 itself inherently has a couple of forced register pairings (notably %edx:%eax for 64-bit multiplication and division), and obviously the whole calling convention requires well-defined pairings, but in the general case it seems to be a mistake to keep DImode values as DImode values and force them to be consecutive registers when used.

Maybe I misunderstand. But now that this comes up I have this dim memory of actually having had a discussion like this before on bugzilla, where gcc generated horrible DImode code.

>     GCC 11  [use %ecx to address memory, require a 24-byte stack frame]
>             sub     esp, 24
>             mov     ecx, DWORD PTR [esp+40]
>     
>     GCC 12 [use %eax to address memory, require a 44-byte stack frame]
>             sub     esp, 44
>             mov     eax, DWORD PTR [esp+64]

I just checked the current git -tip, and this does seem to fix the original case too, with the old horrid 2620 bytes of stack frame now being a *much* improved 404 bytes!

So your patch - or other changes - does fix it for me, unless I did something wrong in my testing (which is possible).

Thanks. I'm not sure what the gcc policy on closing the bug is (and I don't even know if I am allowed), so I'm not marking this closed, but it seems to be fixed as far as I am concerned, and I hope it gets released as a dot-release for the gcc-12 series.
Comment 22 Jakub Jelinek 2022-06-24 16:55:38 UTC
(In reply to Linus Torvalds from comment #21)
> Whee.
> 
> Why does gcc have that constraint, btw? I tried to look at the clang code
> generation once more, and I don't *think* clang has the same constraint, and
> maybe that is why it does so much better?

Registers in RTL have just a single register number and mode (ok, it has some extra info, but not a set of registers).  When it is a pseudo register, that doesn't constrain anything, it is just
(reg:DI 175).
But when it is a hard register, it still has just a single register number,
so there is no way to express through that non-consecutive set of registers,
so
(reg:DI 4)
needs to be di:si pair etc.
If the wider registers are narrowed before register allocation, it is just
a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.
If we wanted RA to allocate non-consecutive registers, we'd need to represent that differently (say as concatenation of SImode registers), but then it wouldn't be accepted by constraints and predicates of most of the define_insn_and_split patterns.
Comment 23 Linus Torvalds 2022-06-24 17:06:17 UTC
(In reply to Jakub Jelinek from comment #22)
> 
> If the wider registers are narrowed before register allocation, it is just
> a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.

That was more what I was thinking - why is the DImode information being kept so long?

I realize that you want to do a lot of the early CSE etc operations at that higher level, but by the time you are actually allocating registers and thinking about spilling them, why is it still a DImode thing?

And this now brings back my memory of the earlier similar discussion - it wasn't about DImode code generation, it was about bitfield code generation being horrendous, where gcc was keeping the whole "this is a bitfield" information around for a long time and as a result generating truly horrendous code. When it looked like it instead should just have turned it into a load and shift early, and then doing all the sane optimizations at that level (ie rewriting simple bitfield code to just do loads and shifts generated *much* better code than using bitfields).

But this is just my personal curiosity at this point - it looks like Roger Sayle's patch has fixed the immediate problem, so the big issue is solved. And maybe the fact that clang is doing so much better is due to something else entirely - it just _looks_ like it might be this artificial constraint by gcc that makes it do bad register and spill choices.
Comment 24 Linus Torvalds 2022-06-24 17:19:09 UTC
(In reply to Linus Torvalds from comment #23)
> 
> And this now brings back my memory of the earlier similar discussion - it
> wasn't about DImode code generation, it was about bitfield code generation
> being horrendous,

Searching around, it's this one from 2011:

  https://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

not really related to this issue, apart from the superficially similar issue with oddly bad code generation.
Comment 25 Jakub Jelinek 2022-06-24 17:33:54 UTC
(In reply to Linus Torvalds from comment #23)
> (In reply to Jakub Jelinek from comment #22)
> > 
> > If the wider registers are narrowed before register allocation, it is just
> > a pair like (reg:SI 123) (reg:SI 256) and it can be allowed anywhere.
> 
> That was more what I was thinking - why is the DImode information being kept
> so long?

This is what is being discussed here.
Some possibilities are lower these multi-word operations during expansion from GIMPLE to RTL (after all, the generic code usually does that without anything special needed on the backend side unless one declares the backend can do that better), one counter-argument to that is the x86 STV pass which uses vector operations for 2 word operations when possible and it won't really work when it is lowered during expansion.

Another is splitting those before register allocation, which is what some patterns did and what other patterns didn't.

Or it can be split after register allocation.

My understanding was that Roger tried to change some patterns from splitting after RA to before RA and it didn't improve this testcase, so in the end changed some other patterns from splitting before RA to after RA.
Comment 26 GCC Commits 2022-06-30 10:03:05 UTC
The master branch has been updated by Roger Sayle <sayle@gcc.gnu.org>:

https://gcc.gnu.org/g:00193676a5a3e7e50e1fa6646bb5abb5a7b2acbb

commit r13-1362-g00193676a5a3e7e50e1fa6646bb5abb5a7b2acbb
Author: Roger Sayle <roger@nextmovesoftware.com>
Date:   Thu Jun 30 11:00:03 2022 +0100

    Use xchg for DImode double word rotate by 32 bits with -m32 on x86.
    
    This patch was motivated by the investigation of Linus Torvalds' spill
    heavy cryptography kernels in PR 105930.  The <any_rotate>di3 expander
    handles all rotations by an immediate constant for 1..63 bits with the
    exception of 32 bits, which FAILs and is then split by the middle-end.
    This patch makes these 32-bit doubleword rotations consistent with the
    other DImode rotations during reload, which results in reduced register
    pressure, fewer instructions and the use of x86's xchg instruction
    when appropriate.  In theory, xchg can be handled by register renaming,
    but even on micro-architectures where it's implemented by 3 uops (no
    worse than a three instruction shuffle), avoiding nominating a
    "temporary" register, reduces user-visible register pressure (and
    has obvious code size benefits).
    
    The effects are best shown with the new testcase:
    
    unsigned long long bar();
    unsigned long long foo()
    {
      unsigned long long x = bar();
      return (x>>32) | (x<<32);
    }
    
    for which GCC with -m32 -O2 currently generates:
    
            subl    $12, %esp
            call    bar
            addl    $12, %esp
            movl    %eax, %ecx
            movl    %edx, %eax
            movl    %ecx, %edx
            ret
    
    but with this patch now generates:
    
            subl    $12, %esp
            call    bar
            addl    $12, %esp
            xchgl   %edx, %eax
            ret
    
    With this patch, the number of lines of assembly language generated
    for the blake2b kernel (from the attachment to PR105930) decreases
    from 5626 to 5404. Although there's an impressive reduction in
    instruction count, there's no change/reduction in stack frame size.
    
    2022-06-30  Roger Sayle  <roger@nextmovesoftware.com>
                Uroš Bizjak  <ubizjak@gmail.com>
    
    gcc/ChangeLog
            * config/i386/i386.md (swap_mode): Rename from *swap<mode> to
            provide gen_swapsi.
            (<any_rotate>di3): Handle !TARGET_64BIT rotations by 32 bits
            via new gen_<insn>32di2_doubleword below.
            (<anyrotate>32di2_doubleword): New define_insn_and_split
            that splits after reload as either a pair of move instructions
            or an xchgl (using gen_swapsi).
    
    gcc/testsuite/ChangeLog
            * gcc.target/i386/xchg-3.c: New test case.
Comment 27 Roger Sayle 2022-07-10 08:01:29 UTC
This should now be fixed on both mainline and the GCC 12 release branch.
Comment 28 Linus Torvalds 2022-07-10 16:41:38 UTC
(In reply to Roger Sayle from comment #27)
> This should now be fixed on both mainline and the GCC 12 release branch.

Thanks everybody.

Looks like the xchg optimization isn't in the gcc-12 release branch, but the stack use looks reasonable from my (very limited) testing.