Post-increment addressing is generated only for the first memory access. Any subsequent memory access does not use post-increment addressing. The following two functions are reduced examples and result in the same code being generated. The problem exists with any number of memory accesses > 1 and at any optimization level. int test_0 (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } int test_1 (char* p, int c) { int r = 0; r += p[0]; r += p[1]; r += p[2]; return r; } compiled with -fno-ivopts -Os -m4-single -ml ... mov r4,r1 mov.b @r1+,r0 add #2,r4 mov.b @r1,r1 add r1,r0 mov.b @r4,r1 rts add r1,r0 This could be done better ... mov.b @r4+,r0 mov.b @r4+,r1 add r1,r0 mov.b @r4+,r1 rts add r1,r0 Another example with a loop: int test_func_1 (char* p, int c) { int r = 0; do { r += *p++; r += *p++; } while (--c); return r; } compiled with -fno-ivopts -Os -m4-single -ml ... mov #0,r0 .L5: mov r4,r1 mov.b @r1+,r2 dt r5 mov.b @r1,r1 add r2,r0 add #2,r4 bf/s .L5 add r1,r0 rts nop would be better as: mov #0, r0 .L5: mov.b @r4+, r1 dt r5 mov.b @r4+, r2 add r1, r0 bf/s .L5 add r2, r0 rts nop Using built-in specs. COLLECT_GCC=sh-elf-gcc COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper Target: sh-elf Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib Thread model: single gcc version 4.7.0 20111016 (experimental) (GCC)
GCC makes usual mem accesses into those with post_inc/pre_dec at auto_inc_dec pass. I guess that auto_inc_dec pass can't find post_inc insns well in that case because other tree/rtl optimizers tweak the code already. If this is the case, the problem would be not target specific.
*** Bug 50750 has been marked as a duplicate of this bug. ***
Kaz, do you happen to know why the following is defined in sh.h? #define USE_LOAD_POST_INCREMENT(mode) ((mode == SImode || mode == DImode) \ ? 0 : TARGET_SH1) #define USE_STORE_PRE_DECREMENT(mode) ((mode == SImode || mode == DImode) \ ? 0 : TARGET_SH1) Is there any (historical) reason to disable post-inc / pre-dec addressing for SImode / DImode? Thanks, Oleg
(In reply to comment #3) USE_LOAD_POST_INCREMENT and USE_STORE_PRE_DECREMENT are used only in move_by_pieces which is for some block operations when MOVE_BY_PIECES_P says OK. They don't disable post_inc/pre_dec addressing for SI/DImode in general, I think. It seems that they are 0 for SI/DImode because we have addressing with display for a limited size of memory chunk in these modes, though I'm wrong about it. I'm a bit curious to see what happens if they are changed to non-zero for SI/DImode.
(In reply to comment #4) > > I'm a bit curious to see what happens if they are changed > to non-zero for SI/DImode. ..not much actually. I've tried defining both as TARGET_SH1 and compared before/after CSiBE results. No code size changes at all. Also the micro test cases do not seem to be affected by that. Of course I might have missed some border case...
(In reply to comment #1) > GCC makes usual mem accesses into those with post_inc/pre_dec at > auto_inc_dec pass. I guess that auto_inc_dec pass can't find > post_inc insns well in that case because other tree/rtl optimizers > tweak the code already. If this is the case, the problem would be > not target specific. Looks like so. For the simple test case... int test (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } ...the load insns are initially expanded as: (insn 2 5 3 2 (set (reg/v/f:SI 169 [ p ]) (reg:SI 4 r4 [ p ])) (nil)) (insn 7 6 8 3 (set (reg:SI 171) (sign_extend:SI (mem:QI (reg/v/f:SI 169 [ p ]) [0 *p_2(D)+0 S1 A8]))) (nil)) (insn 8 7 9 3 (set (reg:SI 172) (reg/v/f:SI 169 [ p ])) (nil)) (insn 9 8 10 3 (set (reg/f:SI 173) (plus:SI (reg/v/f:SI 169 [ p ]) (const_int 1 [0x1]))) (nil)) (insn 10 9 11 3 (set (reg:SI 174) (sign_extend:SI (mem:QI (reg/f:SI 173) [0 MEM[(char *)p_2(D) + 1B]+0 S1 A8]))) (nil)) (insn 13 12 14 3 (set (reg/f:SI 177) (plus:SI (reg/v/f:SI 169 [ p ]) (const_int 2 [0x2]))) (nil)) (insn 14 13 15 3 (set (reg:SI 178) (sign_extend:SI (mem:QI (reg/f:SI 177) [0 MEM[(char *)p_2(D) + 2B]+0 S1 A8]))) (nil)) The auto_inc_dec pass then fails to realize that (reg:SI 177) = (reg:SI 169) + 2 = (reg:SI 173) + 1. I wonder whether there might be something in the target code that suggests the early optimizers to do that? I've tried playing with the TARGET_ADDRESS_COST hook but it didn't have any effect in this case.
(In reply to comment #6) > I wonder whether there might be something in the target code that suggests the > early optimizers to do that? I've tried playing with the TARGET_ADDRESS_COST > hook but it didn't have any effect in this case. -ftree-dump-all shows that forward propagation on ssa trees makes those memory accesses into simple array accesses. You can try -fno-tree-forwprop and see the effect of that option. It seems that there are no special knobs to control forwprop from the target side. The problem is that SH target can't do those simple array accesses well at QI/HImode because of the lack of displacement addressing for those modes.
(In reply to comment #7) > The problem is that SH target can't do those simple array accesses > well at QI/HImode because of the lack of displacement addressing > for those modes. In these particular cases, this is true. Even if there was a working QI/HImode displacement addressing, it would most likely result in worse code compared to post-inc / pre-dec addressing opportunities, because of register pressure on R0. Apart from that there is no displacement addressing for FPU loads/stores, which also misses some opportunities: float test_func_5 (float* p, int c) { float r = 0; do { r += *p++; r += *p++; r += *p++; } while (--c); return r; } Compiled with -Os -m4-single: fldi0 fr0 .L11: mov r4,r1 fmov.s @r1+,fr1 dt r5 fadd fr1,fr0 fmov.s @r1,fr1 mov r4,r1 add #8,r1 add #12,r4 fadd fr1,fr0 fmov.s @r1,fr1 bf/s .L11 fadd fr1,fr0 rts nop ..could be: fldi0 fr0 .L11: fmov.s @r4+,fr1 dt r5 fadd fr1,fr0 fmov.s @r4+,fr1 fadd fr1,fr0 fmov.s @r4+,fr1 bf/s .L11 fadd fr1,fr0 rts nop Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases.
(In reply to comment #8) > Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases. For that function, -fdump-tree-all shows that the tree loop ivopts optimization does it. Try -fno-tree-forwprop -fno-ivopts.
(In reply to comment #9) > (In reply to comment #8) > > Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases. > > For that function, -fdump-tree-all shows that the tree loop ivopts > optimization does it. Try -fno-tree-forwprop -fno-ivopts. This makes even worse code. Anyway, I think this issue should be addressed by the auto_inc_dec pass. If OK, I'd like to change it from target PR to middle-end PR.
(In reply to comment #10) > If OK, I'd like to change it from target PR to middle-end PR. Sure.
Author: olegendo Date: Tue Jun 12 07:09:52 2012 New Revision: 188426 URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188426 Log: PR target/50749 * gcc.target/sh/pr50749-sf-postinc-2.c: New. * gcc.target/sh/pr50749-sf-postinc-4.c: New. * gcc.target/sh/pr50749-qihisi-postinc-2.c: New. * gcc.target/sh/pr50749-qihisi-postinc-4.c: New. * gcc.target/sh/pr50749-sf-predec-2.c: New. * gcc.target/sh/pr50749-sf-predec-4.c: New. * gcc.target/sh/pr50749-qihisi-predec-1.c: New. * gcc.target/sh/pr50749-qihisi-predec-3.c: New. * gcc.target/sh/pr50749-sf-postinc-1.c: New. * gcc.target/sh/pr50749-sf-postinc-3.c: New. * gcc.target/sh/pr50749-qihisi-postinc-1.c: New. * gcc.target/sh/pr50749-qihisi-postinc-3.c: New. * gcc.target/sh/pr50749-sf-predec-1.c: New. * gcc.target/sh/pr50749-sf-predec-3.c: New. * gcc.target/sh/pr50749-qihisi-predec-2.c: New. * gcc.target/sh/pr50749-qihisi-predec-4.c: New. Added: trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-1.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-2.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-3.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-4.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-1.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-2.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-3.c trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-4.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-1.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-2.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-3.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-4.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-1.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-2.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-3.c trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-4.c Modified: trunk/gcc/testsuite/ChangeLog
PR 42612 describes a similar problem on ARM.
Just wanted to clarify the reason why the examples in the description have '-fno-ivopts', as it caused some confusion on the mailing list: int test_0 (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } compiled with -m4 -O2: mov.b @(1,r4),r0 mov.b @r4,r1 add r0,r1 mov.b @(2,r4),r0 rts add r1,r0 compiled with -m4 -Os: mov.b @(1,r4),r0 mov.b @r4,r2 add r0,r2 mov.b @(2,r4),r0 add r0,r2 rts mov r2,r0 As seen above, the memory accesses are turned into reg+disp addressing, although on SH using post-inc addressing would be cheaper (as it is also reported 'sh_address_cost'). Turning off ivopts leaves the post-inc addressing and points out the original problem of this PR. I believe that the proper fix for this would be having PR 56590 fixed.
There must be another scenario for the example, and in this case example: int test_0 (char* p, int c) { int r = 0; r += *p++; r += *p++; r += *p++; return r; } should be translated into sth like: //... ldrb [rx] ldrb [rx+1] ldrb [rx+2] add rx, rx, #3 //... This way all loads are independent and can be issued on super scalar machine. Actuall for targets like arm which supports post-increment constant (other than size of memory access), it can be further changed into: //... ldrb [rx], #3 ldrb [rx-2] ldrb [rx-1] //... For now auto-increment pass can't do this optimization. I once have a patch for this but benchmark shows the case is not common. This case is common especially after loop unrolling and rtl passes deliberately break down long dependence of RX, which I think is right.
(In reply to bin.cheng from comment #15) > There must be another scenario for the example, and in this case example: > > int test_0 (char* p, int c) > { > int r = 0; > r += *p++; > r += *p++; > r += *p++; > return r; > } > > should be translated into sth like: > //... > ldrb [rx] > ldrb [rx+1] > ldrb [rx+2] > add rx, rx, #3 > //... As mentioned above, on SH this is the case (displacement addressing mode is selected). However, the order of the memory accesses is not the same as in the original source code (which is OK for non-volatile mems). > This way all loads are independent and can be issued on super scalar > machine. Actuall for targets like arm which supports post-increment > constant (other than size of memory access), it can be further changed into: > //... > ldrb [rx], #3 > ldrb [rx-2] > ldrb [rx-1] > //... Whether this is transformation is beneficial or not depends on the target architecture of course. E.g. SH2A and SH4* is 2-way super scalar, but they have only one memory load/store unit. Thus the loads would not be done in parallel anyway and the latency of the post-incremented address register can be neglected. There is a similar case on SH regarding floating point loads. SH ISAs (other than SH2A) don't have a displacement addressing mode for floating point loads/stores. When loading adjacent memory locations it's best to use post-inc addressing modes and when storing adjacent memory locations it's best to use pre-dec stores. I.e. things like float* x = ...; *x++ = a; *x++ = b; *x++ = c; should become: add #8,r1 fmov.s fr0,@r1 // store c fmov.s fr1,@-r1 // store b fmov.s fr2,@-r1 // store a > For now auto-increment pass can't do this optimization. I once have a patch > for this but benchmark shows the case is not common. To be honest, I think such optimizations as mentioned above are out of scope of the auto-increment pass. Of course we can try to wallpaper its shortcomings and get some improvements here and there, but as soon as ivopts or another tree pass is changed and outputs different sequences it will break again. Thus I suggested to replace the auto-inc-dec pass with a more generic addressing mode selection RTL pass (PR 56590). Unfortunately, I don't have anything useful yet. But I could share some notes and thoughts regarding AMS optimization, if you're interested.
*** Bug 59078 has been marked as a duplicate of this bug. ***
*** Bug 19078 has been marked as a duplicate of this bug. ***
There is a GSoC 2015 project which will try to address the AMS problem. https://www.google-melange.com/gsoc/project/details/google/gsoc2015/erikvarga/5693417237512192 It will be initially for SH. If it works out, it can be generalized so that other targets can benefit from it, too.
I'll add a note here that this seems to also affect the AVR target when it is under register pressure and cant use Z or Y registers which can do Z+n /Y+n addressing, it ends up doing really stupid things with X register (aka r26:r27) (which could post-inc): adiw r26, 0x01 ld r24, X sbiw r26, 0x01 (followed by adiw r26, 2...) This was my test case for showing this behaviour on AVR: unsigned char test_0 (unsigned char* p1, unsigned char *p2, unsigned char *p3) { unsigned char r = 0; r += *p1++; r += *p2++; r += *p3++; r += *p1++; r += *p2++; r += *p3++; r += *p1++; r += *p2++; r += *p3++; return r; } This note added for anyone later trying to google for this after seeing that X reg stupidity on AVR, sorry for maybe-a-bit-noise ...
(In reply to Urja Rannikko from comment #20) > I'll add a note here that this seems to also affect the AVR target when it > is under register pressure and cant use Z or Y registers which can do Z+n > /Y+n addressing, it ends up doing really stupid things with X register > ... I don't know AVR, but if it has only two addr regs which can do post-inc loads, in your example a loop distribution is required. In other words, you need to use two loops, first with p1,p2 and second loop with p3. Because the calculation in the loop is simple, it would make sense. If the calculations in the loop are more complex, maybe not. It's similar to the problems in PR 53949.
All of the 3 pointer regs (X,Y,Z) can do post-inc operations, but only Y and Z can do displacement (which is what gcc tries to use, even when the increment operation would be better all around since we usually want the +4 pointer). The given code wasnt meant to be anywhere near the real code, just a code that shows the behaviour exists. The real code is just a function with lots going on involving loading 32-bit values from pointers. This code is WIP and i was just looking how it'll end up compiled, but for your curiosity, here's a fragment (not enough to compile), with a comment added on the part where i found the ld, adiw, ld, sbiw, adiw, ld, sbiw, adiw, .. sequence, but AFAIK it could be triggered at any usage of inlined buf2u32: static uint32_t buf2u32(uint8_t *buf) { return *(uint32_t*)buf; } static uint8_t do_exec_opbuf_extwrite(uint8_t *ewsq, uint32_t addr, uint8_t *buf, uint16_t dlen) { uint8_t* end = ewsq + ewsq[0] + 1; for (uint16_t i=0;i<dlen;i++,addr++) { uint8_t c = *buf++; uint8_t *cmd; for (cmd=ewsq+1;cmd<end;) { uint8_t op = *cmd++; uint8_t rop = op&0x3F; if (rop==S_CMD_O_WRITEB) { uint32_t ra; uint8_t rc; if (op & 0x80) { ra = addr; } else { ra = buf2u32(cmd); cmd += 4; } if (op & 0x40) { rc = c; } else { rc = *cmd++; } flash_write(ra,rc); continue; } if (rop==S_CMD_O_DELAY) { udelay(buf2u32(cmd)); cmd += 4; continue; } if ((rop==S_CMD_O_POLL)||(rop==S_CMD_O_POLL_DLY)) { uint8_t details = *cmd++; uint8_t mask = pgm_read_byte( &(bit_mask_lut[details&7]) ); uint32_t ra = addr; uint32_t usecs = 0; if ((op & 0x40)&&(!(details & 0x10))) { if (c & mask) { details |= 0x20; } else { details &= ~0x20; } } if (!(op & 0x80)) { ra = buf2u32(cmd); // Problem Here cmd += 4; } if (rop == S_CMD_O_POLL_DLY) { usecs = buf2u32(cmd); cmd += 4; } do_exec_poll(details,mask,ra,usecs); continue; } return 1; } if (cmd>end) return 1; } return 0; } If i end up liking this code enough it'll end up in a branch at https://github.com/urjaman/libfrser
Thanks for the interesting test/use case.