50749 – Auto-inc-dec does not find subsequent contiguous mem accesses

Bug 50749 - Auto-inc-dec does not find subsequent contiguous mem accesses

Summary: Auto-inc-dec does not find subsequent contiguous mem accesses

Status:	UNCONFIRMED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	rtl-optimization (show other bugs)
Version:	4.7.0

Importance:	P3 enhancement
Target Milestone:	---
Assignee:	Not yet assigned to anyone

URL:
Keywords:	missed-optimization

Duplicates (3):	19078 50750 59078 (view as bug list)
Depends on:	56590
Blocks:	24815
	Show dependency tree / graph

Reported:	2011-10-16 20:28 UTC by Oleg Endo
Modified:	2023-11-30 00:46 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	sh--* arm--* cris--
Build:
Known to work:
Known to fail:
Last reconfirmed:

Attachments
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Oleg Endo 2011-10-16 20:28:50 UTC

Post-increment addressing is generated only for the first
memory access. Any subsequent memory access does not use 
post-increment addressing.
The following two functions are reduced examples and result
in the same code being generated.
The problem exists with any number of memory accesses > 1
and at any optimization level.


int test_0 (char* p, int c)
{
  int r = 0;
  r += *p++;
  r += *p++;
  r += *p++;
  return r;
}

int test_1 (char* p, int c)
{
  int r = 0;
  r += p[0];
  r += p[1];
  r += p[2];
  return r;
}

compiled with -fno-ivopts -Os -m4-single -ml ... 

	mov	r4,r1
	mov.b	@r1+,r0
	add	#2,r4
	mov.b	@r1,r1
	add	r1,r0
	mov.b	@r4,r1
	rts	
	add	r1,r0

This could be done better ...

	mov.b	@r4+,r0
	mov.b	@r4+,r1
	add	r1,r0
	mov.b	@r4+,r1
	rts	
	add	r1,r0



Another example with a loop:

int test_func_1 (char* p, int c)
{
  int r = 0;
  do
  {
    r += *p++;
    r += *p++;
  } while (--c);
  return r;
}

compiled with -fno-ivopts -Os -m4-single -ml ... 

	mov	#0,r0
.L5:
	mov	r4,r1
	mov.b	@r1+,r2
	dt	r5
	mov.b	@r1,r1
	add	r2,r0
	add	#2,r4
	bf/s	.L5
	add	r1,r0
	rts	
	nop

would be better as:

	mov	#0, r0
.L5:
	mov.b	@r4+, r1
	dt	r5
	mov.b	@r4+, r2
	add	r1, r0
	bf/s	.L5
	add	r2, r0

	rts
	nop




Using built-in specs.
COLLECT_GCC=sh-elf-gcc
COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/sh-elf/4.7.0/lto-wrapper
Target: sh-elf
Configured with: ../gcc-trunk/configure --target=sh-elf --prefix=/usr/local --enable-languages=c,c++ --enable-multilib --disable-libssp --disable-nls --disable-werror --enable-lto --with-newlib --with-gnu-as --with-gnu-ld --with-system-zlib
Thread model: single
gcc version 4.7.0 20111016 (experimental) (GCC)

Comment 1 Kazumoto Kojima 2011-10-16 23:33:40 UTC

GCC makes usual mem accesses into those with post_inc/pre_dec at
auto_inc_dec pass.  I guess that auto_inc_dec pass can't find
post_inc insns well in that case because other tree/rtl optimizers
tweak the code already.  If this is the case, the problem would be
not target specific.

Comment 2 Kazumoto Kojima 2011-10-17 00:32:39 UTC

*** Bug 50750 has been marked as a duplicate of this bug. ***

Comment 3 Oleg Endo 2011-10-19 00:00:01 UTC

Kaz,

do you happen to know why the following is defined in sh.h?

#define USE_LOAD_POST_INCREMENT(mode)    ((mode == SImode || mode == DImode) \
                                           ? 0 : TARGET_SH1)

#define USE_STORE_PRE_DECREMENT(mode)    ((mode == SImode || mode == DImode) \
                                           ? 0 : TARGET_SH1)

Is there any (historical) reason to disable post-inc / pre-dec addressing
for SImode / DImode?

Thanks,
Oleg

Comment 4 Kazumoto Kojima 2011-10-19 21:36:56 UTC

(In reply to comment #3)

USE_LOAD_POST_INCREMENT and USE_STORE_PRE_DECREMENT are used only
in move_by_pieces which is for some block operations when
MOVE_BY_PIECES_P says OK.  They don't disable post_inc/pre_dec
addressing for SI/DImode in general, I think.  It seems that they
are 0 for SI/DImode because we have addressing with display for
a limited size of memory chunk in these modes, though I'm wrong
about it.  I'm a bit curious to see what happens if they are changed
to non-zero for SI/DImode.

Comment 5 Oleg Endo 2011-10-30 12:36:51 UTC

(In reply to comment #4)
> 
> I'm a bit curious to see what happens if they are changed
> to non-zero for SI/DImode.

..not much actually. I've tried defining both as TARGET_SH1 and compared 
before/after CSiBE results.  No code size changes at all.  Also the micro test cases do not seem to be affected by that.  Of course I might have missed some border case...

Comment 6 Oleg Endo 2011-10-30 13:53:51 UTC

(In reply to comment #1)
> GCC makes usual mem accesses into those with post_inc/pre_dec at
> auto_inc_dec pass.  I guess that auto_inc_dec pass can't find
> post_inc insns well in that case because other tree/rtl optimizers
> tweak the code already.  If this is the case, the problem would be
> not target specific.

Looks like so.  For the simple test case...

int test (char* p, int c)
{
  int r = 0;
  r += *p++;
  r += *p++;
  r += *p++;		
  return r;
}

...the load insns are initially expanded as:

(insn 2 5 3 2 (set (reg/v/f:SI 169 [ p ])
        (reg:SI 4 r4 [ p ]))
     (nil))


(insn 7 6 8 3 (set (reg:SI 171)
        (sign_extend:SI (mem:QI (reg/v/f:SI 169 [ p ]) [0 *p_2(D)+0 S1 A8]))) 
     (nil))


(insn 8 7 9 3 (set (reg:SI 172)
        (reg/v/f:SI 169 [ p ]))
     (nil))


(insn 9 8 10 3 (set (reg/f:SI 173)
        (plus:SI (reg/v/f:SI 169 [ p ])
            (const_int 1 [0x1])))
     (nil))


(insn 10 9 11 3 (set (reg:SI 174)
        (sign_extend:SI (mem:QI (reg/f:SI 173) [0 MEM[(char *)p_2(D) + 1B]+0 S1 A8])))
     (nil))

(insn 13 12 14 3 (set (reg/f:SI 177)
        (plus:SI (reg/v/f:SI 169 [ p ])
            (const_int 2 [0x2])))
     (nil))

(insn 14 13 15 3 (set (reg:SI 178)
        (sign_extend:SI (mem:QI (reg/f:SI 177) [0 MEM[(char *)p_2(D) + 2B]+0 S1 A8])))
     (nil))


The auto_inc_dec pass then fails to realize that (reg:SI 177) = (reg:SI 169) + 2 = (reg:SI 173) + 1.

I wonder whether there might be something in the target code that suggests the early optimizers to do that?  I've tried playing with the TARGET_ADDRESS_COST hook but it didn't have any effect in this case.

Comment 7 Kazumoto Kojima 2011-10-30 23:36:27 UTC

(In reply to comment #6)
> I wonder whether there might be something in the target code that suggests the
> early optimizers to do that?  I've tried playing with the TARGET_ADDRESS_COST
> hook but it didn't have any effect in this case.

-ftree-dump-all shows that forward propagation on ssa trees makes
those memory accesses into simple array accesses.  You can try
-fno-tree-forwprop and see the effect of that option.
It seems that there are no special knobs to control forwprop from
the target side.
The problem is that SH target can't do those simple array accesses
well at QI/HImode because of the lack of displacement addressing
for those modes.

Comment 8 Oleg Endo 2011-11-28 22:31:44 UTC

(In reply to comment #7)
> The problem is that SH target can't do those simple array accesses
> well at QI/HImode because of the lack of displacement addressing
> for those modes.

In these particular cases, this is true.  Even if there was a working QI/HImode displacement addressing, it would most likely result in worse code compared to post-inc / pre-dec addressing opportunities, because of register pressure on R0.  Apart from that there is no displacement addressing for FPU loads/stores, which also misses some opportunities:

float test_func_5 (float* p, int c)
{
  float r = 0;
  do
  {
    r += *p++;
    r += *p++;
    r += *p++;
  } while (--c);
  return r;
}

Compiled with -Os -m4-single:

	fldi0	fr0
.L11:
	mov	r4,r1
	fmov.s	@r1+,fr1
	dt	r5
	fadd	fr1,fr0
	fmov.s	@r1,fr1
	mov	r4,r1
	add	#8,r1
	add	#12,r4
	fadd	fr1,fr0
	fmov.s	@r1,fr1
	bf/s	.L11
	fadd	fr1,fr0
	rts	
	nop


..could be:
	fldi0	fr0
.L11:
	fmov.s	@r4+,fr1
	dt	r5
	fadd	fr1,fr0
	fmov.s	@r4+,fr1
	fadd	fr1,fr0
	fmov.s	@r4+,fr1
	bf/s	.L11
	fadd	fr1,fr0
	rts	
	nop


Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases.

Comment 9 Kazumoto Kojima 2011-11-28 23:29:57 UTC

(In reply to comment #8)
> Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases.

For that function, -fdump-tree-all shows that the tree loop ivopts
optimization does it.  Try -fno-tree-forwprop -fno-ivopts.

Comment 10 Oleg Endo 2011-12-30 02:14:00 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > Specifying -fno-tree-forwprop doesn't seem to have any effect on these cases.
> 
> For that function, -fdump-tree-all shows that the tree loop ivopts
> optimization does it.  Try -fno-tree-forwprop -fno-ivopts.

This makes even worse code.
Anyway, I think this issue should be addressed by the auto_inc_dec pass.

If OK, I'd like to change it from target PR to middle-end PR.

Comment 11 Kazumoto Kojima 2011-12-30 03:24:01 UTC

(In reply to comment #10)
> If OK, I'd like to change it from target PR to middle-end PR.

Sure.

Comment 12 Oleg Endo 2012-06-12 07:09:58 UTC

Author: olegendo
Date: Tue Jun 12 07:09:52 2012
New Revision: 188426

URL: http://gcc.gnu.org/viewcvs?root=gcc&view=rev&rev=188426
Log:
	PR target/50749
	* gcc.target/sh/pr50749-sf-postinc-2.c: New.
	* gcc.target/sh/pr50749-sf-postinc-4.c: New.
	* gcc.target/sh/pr50749-qihisi-postinc-2.c: New.
	* gcc.target/sh/pr50749-qihisi-postinc-4.c: New.
	* gcc.target/sh/pr50749-sf-predec-2.c: New.
	* gcc.target/sh/pr50749-sf-predec-4.c: New.
	* gcc.target/sh/pr50749-qihisi-predec-1.c: New.
	* gcc.target/sh/pr50749-qihisi-predec-3.c: New.
	* gcc.target/sh/pr50749-sf-postinc-1.c: New.
	* gcc.target/sh/pr50749-sf-postinc-3.c: New.
	* gcc.target/sh/pr50749-qihisi-postinc-1.c: New.
	* gcc.target/sh/pr50749-qihisi-postinc-3.c: New.
	* gcc.target/sh/pr50749-sf-predec-1.c: New.
	* gcc.target/sh/pr50749-sf-predec-3.c: New.
	* gcc.target/sh/pr50749-qihisi-predec-2.c: New.
	* gcc.target/sh/pr50749-qihisi-predec-4.c: New.


Added:
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-3.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-postinc-4.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-3.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-qihisi-predec-4.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-3.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-postinc-4.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-1.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-2.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-3.c
    trunk/gcc/testsuite/gcc.target/sh/pr50749-sf-predec-4.c
Modified:
    trunk/gcc/testsuite/ChangeLog

Comment 13 Oleg Endo 2012-08-23 14:17:29 UTC

PR 42612 describes a similar problem on ARM.

Comment 14 Oleg Endo 2013-06-22 12:22:28 UTC

Just wanted to clarify the reason why the examples in the description have '-fno-ivopts', as it caused some confusion on the mailing list:

int test_0 (char* p, int c)
{
  int r = 0;
  r += *p++;
  r += *p++;
  r += *p++;
  return r;
}

compiled with -m4 -O2:

        mov.b   @(1,r4),r0
        mov.b   @r4,r1
        add     r0,r1
        mov.b   @(2,r4),r0
        rts
        add     r1,r0

compiled with -m4 -Os:

        mov.b   @(1,r4),r0
        mov.b   @r4,r2
        add     r0,r2
        mov.b   @(2,r4),r0
        add     r0,r2
        rts
        mov     r2,r0

As seen above, the memory accesses are turned into reg+disp addressing, although on SH using post-inc addressing would be cheaper (as it is also reported 'sh_address_cost').
Turning off ivopts leaves the post-inc addressing and points out the original problem of this PR.  I believe that the proper fix for this would be having PR 56590 fixed.

Comment 15 bin.cheng 2013-09-30 09:06:52 UTC

There must be another scenario for the example, and in this case example:

int test_0 (char* p, int c)
{
  int r = 0;
  r += *p++;
  r += *p++;
  r += *p++;
  return r;
}

should be translated into sth like:
  //...
  ldrb [rx]
  ldrb [rx+1]
  ldrb [rx+2]
  add rx, rx, #3
  //...
This way all loads are independent and can be issued on super scalar machine.  Actuall for targets like arm which supports post-increment constant (other than size of memory access), it can be further changed into:
  //...
  ldrb [rx], #3
  ldrb [rx-2]
  ldrb [rx-1]
  //...
For now auto-increment pass can't do this optimization.  I once have a patch for this but benchmark shows the case is not common.

This case is common especially after loop unrolling and rtl passes deliberately break down long dependence of RX, which I think is right.

Comment 16 Oleg Endo 2013-10-03 10:47:17 UTC

(In reply to bin.cheng from comment #15)
> There must be another scenario for the example, and in this case example:
> 
> int test_0 (char* p, int c)
> {
>   int r = 0;
>   r += *p++;
>   r += *p++;
>   r += *p++;
>   return r;
> }
> 
> should be translated into sth like:
>   //...
>   ldrb [rx]
>   ldrb [rx+1]
>   ldrb [rx+2]
>   add rx, rx, #3
>   //...

As mentioned above, on SH this is the case (displacement addressing mode is selected).  However, the order of the memory accesses is not the same as in the original source code (which is OK for non-volatile mems).

> This way all loads are independent and can be issued on super scalar
> machine.  Actuall for targets like arm which supports post-increment
> constant (other than size of memory access), it can be further changed into:
>   //...
>   ldrb [rx], #3
>   ldrb [rx-2]
>   ldrb [rx-1]
>   //...

Whether this is transformation is beneficial or not depends on the target architecture of course.  E.g. SH2A and SH4* is 2-way super scalar, but they have only one memory load/store unit.  Thus the loads would not be done in parallel anyway and the latency of the post-incremented address register can be neglected.
There is a similar case on SH regarding floating point loads.  SH ISAs (other than SH2A) don't have a displacement addressing mode for floating point loads/stores.  When loading adjacent memory locations it's best to use post-inc addressing modes and when storing adjacent memory locations it's best to use pre-dec stores.  I.e. things like

float* x = ...;
*x++ = a;
*x++ = b;
*x++ = c;

should become:

add      #8,r1
fmov.s   fr0,@r1  // store c
fmov.s   fr1,@-r1 // store b
fmov.s   fr2,@-r1 // store a


> For now auto-increment pass can't do this optimization.  I once have a patch
> for this but benchmark shows the case is not common.

To be honest, I think such optimizations as mentioned above are out of scope of the auto-increment pass.  Of course we can try to wallpaper its shortcomings and get some improvements here and there, but as soon as ivopts or another tree pass is changed and outputs different sequences it will break again.
Thus I suggested to replace the auto-inc-dec pass with a more generic addressing mode selection RTL pass (PR 56590).  Unfortunately, I don't have anything useful yet.  But I could share some notes and thoughts regarding AMS optimization, if you're interested.

Comment 17 Oleg Endo 2013-12-06 20:28:13 UTC

*** Bug 59078 has been marked as a duplicate of this bug. ***

Comment 18 Steven Bosscher 2013-12-24 22:55:41 UTC

*** Bug 19078 has been marked as a duplicate of this bug. ***

Comment 19 Oleg Endo 2015-05-22 13:57:31 UTC

There is a GSoC 2015 project which will try to address the AMS problem.
https://www.google-melange.com/gsoc/project/details/google/gsoc2015/erikvarga/5693417237512192

It will be initially for SH.  If it works out, it can be generalized so that other targets can benefit from it, too.

Comment 20 Urja Rannikko 2015-09-17 08:59:53 UTC

I'll add a note here that this seems to also affect the AVR target when it is under register pressure and cant use Z or Y registers which can do Z+n /Y+n addressing, it ends up doing really stupid things with X register (aka r26:r27) (which could post-inc):
adiw	r26, 0x01
ld	r24, X
sbiw	r26, 0x01
(followed by adiw r26, 2...)

This was my test case for showing this behaviour on AVR:
unsigned char test_0 (unsigned char* p1, unsigned char *p2, unsigned char *p3)
{
  unsigned char r = 0;
  r += *p1++;
  r += *p2++;
  r += *p3++;

  r += *p1++;
  r += *p2++;
  r += *p3++;

  r += *p1++;
  r += *p2++;
  r += *p3++;
  
  return r;
}

This note added for anyone later trying to google for this after seeing that X reg stupidity on AVR, sorry for maybe-a-bit-noise ...

Comment 21 Oleg Endo 2015-09-17 14:26:33 UTC

(In reply to Urja Rannikko from comment #20)
> I'll add a note here that this seems to also affect the AVR target when it
> is under register pressure and cant use Z or Y registers which can do Z+n
> /Y+n addressing, it ends up doing really stupid things with X register 
> ...

I don't know AVR, but if it has only two addr regs which can do post-inc loads, in your example a loop distribution is required.  In other words, you need to use two loops, first with p1,p2 and second loop with p3.  Because the calculation in the loop is simple, it would make sense.  If the calculations in the loop are more complex, maybe not.  It's similar to the problems in PR 53949.

Comment 22 Urja Rannikko 2015-09-17 19:45:00 UTC

All of the 3 pointer regs (X,Y,Z) can do post-inc operations, but only Y and Z can do displacement (which is what gcc tries to use, even when the increment operation would be better all around since we usually want the +4 pointer).

The given code wasnt meant to be anywhere near the real code, just a code that shows the behaviour exists. The real code is just a function with lots going on involving loading 32-bit values from pointers.

This code is WIP and i was just looking how it'll end up compiled, but for your curiosity, here's a fragment (not enough to compile), with a comment added on the part where i found the ld, adiw, ld, sbiw, adiw, ld, sbiw, adiw, .. sequence, but AFAIK it could be triggered at any usage of inlined buf2u32:

static uint32_t buf2u32(uint8_t *buf) {
        return *(uint32_t*)buf;
}

static uint8_t do_exec_opbuf_extwrite(uint8_t *ewsq, uint32_t addr, uint8_t *buf, uint16_t dlen) {
        uint8_t* end = ewsq + ewsq[0] + 1;
        for (uint16_t i=0;i<dlen;i++,addr++) {
                uint8_t c = *buf++;
                uint8_t *cmd;
                for (cmd=ewsq+1;cmd<end;) {
                        uint8_t op = *cmd++;
                        uint8_t rop = op&0x3F;
                        if (rop==S_CMD_O_WRITEB) {
                                uint32_t ra;
                                uint8_t rc;
                                if (op & 0x80) {
                                        ra = addr;
                                } else {
                                        ra = buf2u32(cmd);
                                        cmd += 4;
                                }
                                if (op & 0x40) {
                                        rc = c;
                                } else {
                                        rc = *cmd++;
                                }
                                flash_write(ra,rc);
                                continue;
                        }
                        if (rop==S_CMD_O_DELAY) {
                                udelay(buf2u32(cmd));
                                cmd += 4;
                                continue;
                        }
                        if ((rop==S_CMD_O_POLL)||(rop==S_CMD_O_POLL_DLY)) {
                                uint8_t details = *cmd++;
                                uint8_t mask = pgm_read_byte( &(bit_mask_lut[details&7]) );
                                uint32_t ra = addr;
                                uint32_t usecs = 0;
                                if ((op & 0x40)&&(!(details & 0x10))) { 
                                        if (c & mask) {
                                                details |= 0x20;
                                        } else {
                                                details &= ~0x20;
                                        }
                                }
                                if (!(op & 0x80)) {
                                        ra = buf2u32(cmd); // Problem Here
                                        cmd += 4;
                                }
                                if (rop == S_CMD_O_POLL_DLY) {
                                        usecs = buf2u32(cmd);
                                        cmd += 4;
                                }
                                do_exec_poll(details,mask,ra,usecs);
                                continue;
                        }
                        return 1;
                }
                if (cmd>end) return 1;
        }
        return 0;
}
If i end up liking this code enough it'll end up in a branch at https://github.com/urjaman/libfrser

Comment 23 Oleg Endo 2015-09-18 14:02:26 UTC

Thanks for the interesting test/use case.