Even as simple loop as this: void f(const long* src, long* dst, int count) { for (int i = 0; i < count; i++) { *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; *dst++ = *src++; } } is compiled to: #NO_APP .file "test.c" .text .align 2 .globl f .type f, @function f: move.l 4(%sp),%a0 move.l 8(%sp),%a1 move.l 12(%sp),%d1 jle .L1 clr.l %d0 .L3: move.l (%a0),(%a1) move.l 4(%a0),4(%a1) move.l 8(%a0),8(%a1) move.l 12(%a0),12(%a1) move.l 16(%a0),16(%a1) move.l 20(%a0),20(%a1) move.l 24(%a0),24(%a1) move.l 28(%a0),28(%a1) move.l 32(%a0),32(%a1) move.l 36(%a0),36(%a1) move.l 40(%a0),40(%a1) move.l 44(%a0),44(%a1) move.l 48(%a0),48(%a1) move.l 52(%a0),52(%a1) move.l 56(%a0),56(%a1) add.w #64,%a0 add.w #64,%a1 move.l -4(%a0),-4(%a1) addq.l #1,%d0 cmp.l %d1,%d0 jne .L3 .L1: rts .size f, .-f .ident "GCC: (GNU) 13.2.0" This has been like this for ages: gcc 4.6.4, gcc 7.2.0 and lately gcc 13.2.0 ... the last gcc where it was reported to transform into move.l (a0)+,(a1)+ was gcc 2.95 and gcc 3.x. So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much? Tested on m68k-elf-gcc -O2 -fomit-frame-pointer -m68020-60.
> So what's the catch here? Why gcc hates move.l (ax)+,(ay)+ so much? At one point of time (before I think GCC 9 or 8 or so), GCC's IV-OPTs optimization does not take into account post/pre increment, but now it does. BUT if the target cost model does not take those into account, then IV-OPTs could decide not to use them. Now m68k is a target which not many GCC developers look at fixing, so it is up to someone to look into why the post increment is no longer being used.
I don't think IVOPTs would use postinc for the intermediate increments. It's constant propagation/forwarding that accumulates the increments to a constant offset which removes dependences on the instructions and thus would allow the loads/stores to be executed in parallel (well, not that m68k uarchs likely can do any of that ...). I wonder if the code we emit is measurably slower though? It's possibly a little bit larger due to the two IV increments.
> I wonder if the code we emit is measurably slower though? It's possibly a little bit larger due to the two IV increments. It's definitely slower as both offsets next to the An registers generate a separate instruction word. So instead of 2-byte instruction "move.l (a0)+,(a1)+" we have a 6-byte instruction "move.l off(a0),off(a1)" and that hurts a lot even on the 68060, not to mention the poor 68000.
I'm not sure this is an m68k bug. I tried several targets that have auto-increment addressing modes (m68k, pdp11, msp430, vax, aarch64) and none of them would use auto-increment for this test case.
I have been told that one of the reasons why post-incrementing modes are not supported / preferred these days is that they halt the CPU pipeline (of course, totally not applicable on m68k). So with the offsets you can parallelize the movements while when post-incrementing the values of a1, you always have to wait for the previous instruction to finish. So I could understand that this has been changed but it definitely shouldn't be a change involving all possible CPUs.
It's already visible with a simple void f(const long* src, long* dst) { *dst++ = *src++; *dst = *src; } where we expand to RTL from _1 = *src_3(D); *dst_4(D) = _1; _2 = MEM[(const long int *)src_3(D) + 4B]; MEM[(long int *)dst_4(D) + 4B] = _2; there's nothing on GIMPLE that would split the add and RTLs auto-inc-dec pass doesn't do anything either. We'd need a form of "strength-reduction" or maybe targets prefering auto-inc/dec should not legitimize constant offsets before reload ... Note with one more copy you then see _1 = *src_4(D); *dst_5(D) = _1; _2 = MEM[(const long int *)src_4(D) + 4B]; MEM[(long int *)dst_5(D) + 4B] = _2; _3 = MEM[(const long int *)src_4(D) + 8B]; MEM[(long int *)dst_5(D) + 8B] = _3; and naiively splitting gives you src_6 = src_4(D) + 4; src_7 = src_4(D) + 8; that said, it's really sth for RTL since it's going to be highly target dependent which form is more efficient. The auto-inc pass is well structured, so it should be possible to extend it.
(In reply to Richard Biener from comment #6) > The auto-inc pass is well > structured, so it should be possible to extend it. Or just replace it, as it doesn't look far enough to be able to handle all incdec-opportunities.