Bug 113779 - Very inefficient m68k code generated for simple copy loop
Summary: Very inefficient m68k code generated for simple copy loop
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 13.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks:
 
Reported: 2024-02-05 21:07 UTC by Miro Kropacek
Modified: 2024-02-17 00:39 UTC (History)
1 user (show)

See Also:
Host:
Target: m68k-elf
Build:
Known to work:
Known to fail:
Last reconfirmed: 2024-02-06 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Miro Kropacek 2024-02-05 21:07:56 UTC
Even as simple loop as this:

void f(const long* src, long* dst, int count) {
        for (int i = 0; i < count; i++) {
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
                *dst++ = *src++;
        }
}

is compiled to:

#NO_APP
        .file   "test.c"
        .text
        .align  2
        .globl  f
        .type   f, @function
f:
        move.l 4(%sp),%a0
        move.l 8(%sp),%a1
        move.l 12(%sp),%d1
        jle .L1
        clr.l %d0
.L3:
        move.l (%a0),(%a1)
        move.l 4(%a0),4(%a1)
        move.l 8(%a0),8(%a1)
        move.l 12(%a0),12(%a1)
        move.l 16(%a0),16(%a1)
        move.l 20(%a0),20(%a1)
        move.l 24(%a0),24(%a1)
        move.l 28(%a0),28(%a1)
        move.l 32(%a0),32(%a1)
        move.l 36(%a0),36(%a1)
        move.l 40(%a0),40(%a1)
        move.l 44(%a0),44(%a1)
        move.l 48(%a0),48(%a1)
        move.l 52(%a0),52(%a1)
        move.l 56(%a0),56(%a1)
        add.w #64,%a0
        add.w #64,%a1
        move.l -4(%a0),-4(%a1)
        addq.l #1,%d0
        cmp.l %d1,%d0
        jne .L3
.L1:
        rts
        .size   f, .-f
        .ident  "GCC: (GNU) 13.2.0"

This has been like this for ages: gcc 4.6.4, gcc 7.2.0 and lately gcc 13.2.0 ... the last gcc where it was reported to transform into move.l (a0)+,(a1)+ was gcc 2.95 and gcc 3.x. 

So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much? Tested on m68k-elf-gcc -O2 -fomit-frame-pointer -m68020-60.
Comment 1 Andrew Pinski 2024-02-05 21:20:21 UTC
> So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much?

At one point of time (before I think GCC 9 or 8 or so), GCC's IV-OPTs optimization does not take into account post/pre increment, but now it does. BUT if the target cost model does not take those into account, then IV-OPTs could decide not to use them.
Now m68k is a target which not many GCC developers look at fixing, so it is up to someone to look into why the post increment is no longer being used.
Comment 2 Richard Biener 2024-02-06 07:58:00 UTC
I don't think IVOPTs would use postinc for the intermediate increments.  It's constant propagation/forwarding that accumulates the increments to a constant
offset which removes dependences on the instructions and thus would allow the
loads/stores to be executed in parallel (well, not that m68k uarchs likely can do any of that ...).

I wonder if the code we emit is measurably slower though?  It's possibly
a little bit larger due to the two IV increments.
Comment 3 Miro Kropacek 2024-02-06 08:16:56 UTC
> I wonder if the code we emit is measurably slower though?  It's possibly
a little bit larger due to the two IV increments.

It's definitely slower as both offsets next to the An registers generate a separate instruction word. So instead of 2-byte instruction "move.l (a0)+,(a1)+" we have a 6-byte instruction "move.l off(a0),off(a1)" and that hurts a lot even on the 68060, not to mention the poor 68000.
Comment 4 Mikael Pettersson 2024-02-06 12:47:25 UTC
I'm not sure this is an m68k bug. I tried several targets that have auto-increment addressing modes (m68k, pdp11, msp430, vax, aarch64) and none of them would use auto-increment for this test case.
Comment 5 Miro Kropacek 2024-02-06 12:58:30 UTC
I have been told that one of the reasons why post-incrementing modes are not supported / preferred these days is that they halt the CPU pipeline (of course, totally not applicable on m68k). So with the offsets you can parallelize the movements while when post-incrementing the values of a1, you always have to wait for the previous instruction to finish.

So I could understand that this has been changed but it definitely shouldn't be a change involving all possible CPUs.
Comment 6 Richard Biener 2024-02-06 13:14:29 UTC
It's already visible with a simple

void f(const long* src, long* dst)
{
  *dst++ = *src++;
  *dst = *src;
}

where we expand to RTL from

  _1 = *src_3(D);
  *dst_4(D) = _1;
  _2 = MEM[(const long int *)src_3(D) + 4B];
  MEM[(long int *)dst_4(D) + 4B] = _2;

there's nothing on GIMPLE that would split the add and RTLs auto-inc-dec
pass doesn't do anything either.  We'd need a form of "strength-reduction"
or maybe targets prefering auto-inc/dec should not legitimize constant
offsets before reload ...

Note with one more copy you then see

  _1 = *src_4(D);
  *dst_5(D) = _1;
  _2 = MEM[(const long int *)src_4(D) + 4B];
  MEM[(long int *)dst_5(D) + 4B] = _2;
  _3 = MEM[(const long int *)src_4(D) + 8B];
  MEM[(long int *)dst_5(D) + 8B] = _3;

and naiively splitting gives you

  src_6 = src_4(D) + 4;
  src_7 = src_4(D) + 8;

that said, it's really sth for RTL since it's going to be highly target
dependent which form is more efficient.  The auto-inc pass is well
structured, so it should be possible to extend it.
Comment 7 Hans-Peter Nilsson 2024-02-17 00:39:41 UTC
(In reply to Richard Biener from comment #6)
> The auto-inc pass is well
> structured, so it should be possible to extend it.
Or just replace it, as it doesn't look far enough to be able to handle all incdec-opportunities.