81611 – [8 Regression] gcc un-learned loop / post-increment optimization

Bug 81611 - [8 Regression] gcc un-learned loop / post-increment optimization

Summary: [8 Regression] gcc un-learned loop / post-increment optimization

Status:	RESOLVED FIXED

Alias:	None

Product:	gcc
Classification:	Unclassified
Component:	tree-optimization (show other bugs)
Version:	8.0

Importance:	P3 normal
Target Milestone:	8.0
Assignee:	Alexandre Oliva

URL:
Keywords:	missed-optimization

Depends on:
Blocks:

Reported:	2017-07-29 12:16 UTC by Georg-Johann Lay
Modified:	2020-06-04 07:32 UTC (History)
CC List:	6 users (show)

See Also:
Host:
Target:	avr
Build:
Known to work:	4.7.3
Known to fail:	8.0
Last reconfirmed:	2017-07-30 00:00:00

Attachments
partial candidate patch (626 bytes, patch) 2017-12-26 23:42 UTC, Alexandre Oliva	Details \| Diff
complementary candidate patch (1.50 KB, patch) 2017-12-27 05:06 UTC, Alexandre Oliva	Details \| Diff
alternative (?) complementary candidate patch (252 bytes, patch) 2017-12-27 06:43 UTC, Alexandre Oliva	Details \| Diff
updated complementary candidate patch (1.52 KB, patch) 2017-12-27 07:11 UTC, Alexandre Oliva	Details \| Diff
2xupdated complementary candidate patch (1.54 KB, patch) 2017-12-27 07:58 UTC, Alexandre Oliva	Details \| Diff
another complement to the initial partial patch, this one improving auto-inc-dec (972 bytes, patch) 2018-01-03 23:02 UTC, Alexandre Oliva	Details \| Diff
Show Obsolete (2) View All Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this bug.

Description Georg-Johann Lay 2017-07-29 12:16:01 UTC

C test case:

void func1 (unsigned char x, char *str)
{
    do {
        *str++ = '0' + (x & 1);
        x = x / 2;
    } while (x);
    *str = 0;
}

$ avr-gcc-8 foo.c -S -mmcu=atmega8 -O2

foo.s:

func1:
    movw r30,r22    ; ok, need the address in some
                    ;     address register (here Z=r30)
    rjmp .L2        ; ???
.L3:
    movw r30,r18     ; what the heck? Moving address back...
.L2:
    movw r18,r30;    ; ...and forth to some fresh register, just to increment
    subi r18,-1      ; the address from the *PREVIOUS* loop
    sbci r19,-1      ; slow, bloat, increases reg pressure
    mov r25,r24      ; ok
    andi r25,lo8(1)  ; ok
    subi r25,lo8(-(48)) ; ok
    st Z,r25         ; Why not just "st Z+, r25" ???
    lsr r24          ; ok
    brne .L3
    std Z+1,__zero_reg__
    ret


Just for reference the code from 4.7:

* Using 5 instructions less
* Occupying 2 registers less
* Loop consumes 4 cycles less (8 instead of 12).

func1:
    movw r30,r22
.L2:
    mov r25,r24
    andi r25,lo8(1)
    subi r25,lo8(-(48))
    st Z+,r25
    lsr r24
    brne .L2
    st Z,__zero_reg__
    ret

That's code I expect from a 3rd millenium compiler!! ;-)

Comment 1 Richard Biener 2017-07-31 07:20:56 UTC

Looks like a register allocation / SSA coalescing issue to me.  The IV is either not coalesced on the backedge or RA splits the live range.  Maybe IVOPTs related as well.

I suspect unless you do some more digging (bisection?) nothing will happen here ;)

Comment 2 Jeffrey A. Law 2017-12-20 17:34:33 UTC

So we get good code just prior to this:

Author: rguenth <rguenth@138bc75d-0d04-0410-961f-82ee72b054a4>
Date:   Tue Apr 21 12:52:43 2015 +0000

    2015-04-21  Richard Biener  <rguenther@suse.de>
    
            PR tree-optimization/65650
            * tree-ssa-ccp.c (valid_lattice_transition): Allow lattice
            transitions involving copies.
            (set_lattice_value): Adjust for copy lattice state.
[ ... ]

That change results in propagation of a copy (no surprise there).  This results in IVopts making some different choices.    Prior to the change the loop looks like this in the .optimized dump:

 # x_1 = PHI <x_4(D)(2), x_13(3)>
  # ivtmp.7_16 = PHI <ivtmp.7_7(2), ivtmp.7_15(3)>
  str_2 = (char *) ivtmp.7_16;
  _9 = x_1 & 1;
  _10 = _9 + 48;
  _11 = (char) _10;
  MEM[base: str_2, offset: 0B] = _11;
  ivtmp.7_15 = ivtmp.7_16 + 1;
  x_13 = x_1 >> 1;
  if (x_13 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

Intuitively we can see the relationship between STR and IVTMP and the likely post-inc opportunity at the memory reference and subsequent increment of IVTMP.

If we look at the loop after the referenced change we have:

  # x_1 = PHI <x_4(D)(2), x_13(3)>
  # str_2 = PHI <str_5(D)(2), str_8(3)>
  str_8 = str_2 + 1;
  _9 = x_1 & 1;
  _10 = _9 + 48;
  _11 = (char) _10;
  _16 = str_8 + 65535;
  MEM[base: _16, offset: 0B] = _11;
  x_13 = x_1 >> 1;
  if (x_13 != 0)
    goto <bb 3>;
  else
    goto <bb 4>;

So we no longer have the IV, just STR and it's a lot harder to recover the auto-inc opportunity at the memory reference.  Anyway, that's the point where it looks to me like things start to go off the rails.


If we walk forward to the trunk today and look at the .expand dump we have:

  # x_5 = PHI <x_8(D)(2), x_15(3)>
  # str_6 = PHI <str_9(D)(2), str_17(3)>
  _1 = x_5 & 1;
  _2 = _1 + 48;
  str_11 = str_6 + 1;
  _4 = (char) _2;
  _16 = str_6;
  MEM[base: _16, offset: 0B] = _4;
  x_13 = x_5 >> 1;
  x_15 = x_13;
  str_17 = str_11;
  if (x_5 > 1)
    goto <bb 3>; [85.00%]
  else
    goto <bb 4>; [15.00%]

;;   basic block 4, loop depth 0
;;    pred:       3
  MEM[(char *)str_6 + 1B] = 0;
  return;


I think part of the problem is that we need str_6 and str_11 -- they have different values and conflict.  The two MEMs could potentially be rewritten   in terms of str_11.  With the obvious copy-props we'd have something like this:

  # x_5 = PHI <x_8(D)(2), x_13(3)>
  # str_6 = PHI <str_9(D)(2), str_11(3)>
  _1 = x_5 & 1;
  _2 = _1 + 48;
  str_11 = str_6 + 1;
  _4 = (char) _2;
  MEM[base: str_11, offset: -1B] = _4;
  x_13 = x_5 >> 1;
  if (x_5 > 1)
    goto <bb 3>; [85.00%]
  else
    goto <bb 4>; [15.00%]

;;   basic block 4, loop depth 0
;;    pred:       3
  MEM[(char *)str_11, offset: 0B] = 0;
  return;


That ought to allow str_6 and str_11 to coalesce.  The question then becomes can we recover the auto-inc -- I'm not sure the auto-inc code is good enough to see it in that form.

Most importantly, while this BZ is filed against the AVR target, it seems to me to be clearly a generic issue.

Comment 3 Alexandre Oliva 2017-12-26 22:05:05 UTC

I see the same problem on x86_64.  Compiling with -fno-tree-forwprop stops str_6 from being propagated out of the loop, so it dies at the str_11 def and everything optimizes nicely.

Comment 4 Alexandre Oliva 2017-12-26 23:30:42 UTC

Mine; patch will be attached momentarily.

Comment 5 Alexandre Oliva 2017-12-26 23:42:51 UTC

Created attachment 42968 [details]
partial candidate patch

Alas, although it restores good code for x86_64 and arm, it doesn't go as far as enabling avr to use auto_inc (though auto_inc use is restored on arm).  Still looking...

Comment 6 Alexandre Oliva 2017-12-27 05:06:33 UTC

Created attachment 42969 [details]
complementary candidate patch

This patch complements the earlier one.  

On AVR, unlike other ports, we had the increment computed into a separate pseudo, then the memory store using the original pseudo, then the copy of the separate pseudo to the original one.  auto_inc_dec can't deal with such a sequence.  It could deal with it if the increment was after the memory access, which would have already eliminated the temporary and the copy.

So, instead of extending auto_inc_dec to deal with a more complex scenario involving 3 insns, I decided to try and move the increment down, past the memory access, so as to benefit even ports without auto_inc addressing modes by avoiding SSA conflicts and reducing the use of temporaries.

It moves IV increment stmts that are not used within their own blocks closer to the end of the block, so as to reduce the likelihood of a SSA conflict with the PHI node.  This also makes it more likely that the increment will appear after memory uses with which it can be combined into a post_inc addressing mode.

Comment 7 Alexandre Oliva 2017-12-27 05:32:42 UTC

Hmm, what the complementary patch won't do is improve the odds of auto_inc or even saving a temp in spaghetti code, rather than in loops.  Maybe that's important too?  I wonder if we should add the post-increment to the gimple stmt seq after the use of the old value, like IIRC we used to.

Comment 8 Alexandre Oliva 2017-12-27 06:43:01 UTC

Created attachment 42970 [details]
alternative (?) complementary candidate patch

This addresses the concern of post-increment in non-loops.  It solves the problem along with the initial partial patch, but I'm still undecided as to the second.  I already know it is broken (there's a thinko in the case that handles all uses in a single stmt), and once the postinc auto_inc addressing modes are handled by this third patch, I'm not sure yet whether the SSA conflict reduction the second one affords are worth the trouble any more.  I'll try to actually measure the effects, once I verify that it (or rather a fixed version thereof) works.

Comment 9 Alexandre Oliva 2017-12-27 07:11:33 UTC

Created attachment 42971 [details]
updated complementary candidate patch

This updated patch fixes a bug in the sinking patch, in the path that handled all uses in a single stmt.

Comment 10 Alexandre Oliva 2017-12-27 07:58:32 UTC

Created attachment 42972 [details]
2xupdated complementary candidate patch

It looked like the build was going so well, but...  No such luck.  Watch out for CPU-eating zombies with the previous patch; it may loop endlessly moving a stmt just before its successor over and over and over.  I suppose the old code really didn't mean to move stuff within a single BB, otherwise it would have checked this case.  Anyhow...  One more bug fixed, bootstrap well into stage3...

Comment 11 Alexandre Oliva 2017-12-27 22:54:41 UTC

Testing has revealed that the alternative complementary candidate patch introduces a number of regressions, in tests intended specifically to detect this kind of problem.  I don't see an easy way to delay the post-increment at gimplification time, so I'm dropping this approach altogether.  Back to exploring other possibilities, perhaps even extending auto_inc_dec, or...  something else, I don't know.

Comment 12 Jeffrey A. Law 2018-01-02 23:47:08 UTC

You know, I wonder if we're missing something bigger here.

ISTM we're potentially missing CSEs in memory addresses as well as forward propagation opportunities in MEM_REF expressions.

I strongly suspect DOM doesn't look inside a MEM_REF to see if the computed address is lying around in hash table.

Similarly I wouldn't be surprised if forward propagation misses the opportunity to eliminate instructions that compute addresses by propagating the constant offset part into a MEM_REF.

Comment 13 Alexandre Oliva 2018-01-03 02:44:25 UTC

We do have such constant propagation on such ports as x86* and arm, but not on avr.  Presumably (I haven't checked) it has to do with available addressing modes, and gimple's avoiding, even in MEM_REFs, address expressions that don't fit its more stringent requirements.

Comment 14 Jeffrey A. Law 2018-01-03 08:32:13 UTC

MEM_REF (as opposed to TARGET_MEM_REF) should be able to handle any simple SSA_NAME + CONSTANT_OFFSET which are all we're really concerned with here.  THe target's addressing modes don't really enter the picture (that's what TARGET_MEM_REF is for).

I'd look for an issue in the types preventing forwprop from doing its job here.

Comment 15 Alexandre Oliva 2018-01-03 20:29:19 UTC

As we create_mem_ref within ivopts, create_mem_ref_raw requires a valid_mem_ref_p, which in memory_address_addr_space_p calls targetm.addr_space.legitimate_address_p, and that's avr_addr_space_legitimate_address_p, that calls avr_legitimate_address_p, that rejects negative offsets to a REG.  That's where avr loses, compared with x86 and arm.

Because of this rejected address offset, we end up creating _16 = str_11 + 65535.  That eventually gets simplified to _16 = str_6 (in forwprop, no less), but the str_11 = str_6 + 1 stmt remains because of the use in the phi node, and it appears before the MEM_REF.

Now, the reason we validate the address is that create_mem_ref calls create_mem_ref_raw with verify enabled.  Even though the letter might end up creating a machine-independent MEM_REF, we end up constraining it according to machine-dependent addressing limitations anyway.

Comment 16 Alexandre Oliva 2018-01-03 20:58:12 UTC

Even if create_mem_ref_raw created a MEM_REF, we'd still allocate a new pseudo for the reg - 1 at cfgexpand, and that ends up preventing the post_inc addressing mode from being selected.

The more I think about it, the more I conclude we have to bite the bullet and support post_inc addressing modes in auto_inc_dec, even with increments saved on another pseudo before the memory uses.  Not just for iv uses, but also for linear code, mainly because the IR we generate for post-inc, all the way from the beginning, makes it more likely that the increment will be logically before the use of the old value as an address.  I was trying to avoid that, but at this point anything else feels like papering over the problem.

A smarter addressing-mode-aware middle end could tentatively generate off-by-1 base addresses so as to enable otherwise invalid addresses and subsequent post_inc transformations, but...  if the transformation doesn't take place, we may end up worse off.

Comment 17 Alexandre Oliva 2018-01-03 23:02:23 UTC

Created attachment 43025 [details]
another complement to the initial partial patch, this one improving auto-inc-dec

We already had code to turn the following into a post_inc:

(set (reg A) (plus (reg B) (const_int I)))

 ... (mem (plus (reg A) (const_int -I))) ...

so I've adjusted auto-inc-dec to also look for a (mem (reg B)) and deal with it in the same way.  It worked beautifully for this one testcase.  Let's see how a full bootstrap goes, on machines with auto-inc addressing modes...

Comment 18 Alexandre Oliva 2018-01-24 03:51:33 UTC

Vacations over, patches formatted and posted.
https://gcc.gnu.org/ml/gcc-patches/2018-01/msg01994.html

Comment 19 Georg-Johann Lay 2018-01-25 11:39:02 UTC

Hi, thanks for all that work and efforts.

I tried that patch for the following small test:

extern void foo (void);

extern char volatile vv;

void func2 (const int *p)
{
    while (1)
    {
        int var = *p++;
        if (var == 10)
            return foo();
        if (var == 0)
            break;
    }
}

void func3 (const int *p, const __flash char *f)
{
    while (1)
    {
        int var = *p++;
        if (var == 10)
            return foo();
        vv = *f++;
        if (!vv)
            break;
    }
}


$ avr-gcc -Os -mmcu=avr5 inc.c -S -dp

Unfortunately, the code is still quote sub-optimal, in particular due to reg-reg moves all over the place, apart from missing post-inc opportunities.

For example, func3 compiles as follows:

func3:
.L7:
	movw r20,r24	 ;  37	[c=4 l=1]  *movhi/0
	subi r20,-2	 ;  9	[c=4 l=2]  addhi3_clobber/1
	sbci r21,-1
	movw r30,r24	 ;  38	[c=4 l=1]  *movhi/0
	ld r24,Z	 ;  10	[c=8 l=2]  *movhi/2
	ldd r25,Z+1
	sbiw r24,10	 ;  11	[c=12 l=1]  cmphi3/5
	brne .L6		 ;  12	[c=16 l=1]  branch
	jmp foo	 ;  14	[c=0 l=2]  call_insn/3
.L8:
	movw r22,r26	 ;  5	[c=4 l=1]  *movhi/0
	rjmp .L7		 ;  46	[c=4 l=1]  jump
.L6:
	movw r26,r22	 ;  39	[c=4 l=1]  *movhi/0
	adiw r26,1	 ;  18	[c=4 l=1]  addhi3_clobber/0
	movw r30,r22	 ;  40	[c=4 l=1]  *movhi/0
	lpm r24,Z	 ;  19	[c=4 l=1]  movqi_insn/3
	sts vv,r24	 ;  20	[c=4 l=2]  movqi_insn/2
	lds r18,vv	 ;  21	[c=4 l=2]  movqi_insn/3
	movw r24,r20	 ;  22	[c=4 l=1]  *movhi/0
	cpse r18,__zero_reg__	 ;  24	[c=0 l=1]  enable_interrupt-3
	rjmp .L8
	ret		 ;  43	[c=0 l=1]  return

In particular, moving values back and forth and bad register selection is a common and well known annoyance (insns 37, 38, 5, 39, 40, 22).

Just to give an impression of optimal code, which would read something like:

func3:
    ;; Use Z=r30/31 for F.  LPM can only use indirect and
    ;; post-inc with Z.
    movw r30, r22
    ;; Use X=r26/27 for P.  X register can only use indirect and
    ;; post-inc addressing, which is fine for that purpose.
    movw r26, r24
.L7:
    ;; var = *p++
	ld r24,X+
	ld r25,X+
    ;;  var == 10 ?
	sbiw r24,10
	brne .L6
	jmp foo
.L6:
    ;; vv = *f++
	lpm r24,Z+
	sts vv,r24
    ;; if (!vv) break
	lds r24,vv
	cpse r24,__zero_reg__
	rjmp .L7
	ret


If uses 12 instructions instead of 12, operates faster (usually focus is on code size) and has a register footprint of 6 whereas gcc needs 12.

Comment 20 Georg-Johann Lay 2018-01-25 11:52:38 UTC

A bit of the bloat reported in PR81625 is also due to missed post-inc addressing, so it might be worth a look if you are after more test cases. (Current 8.0.1 perfomrs better than 8.0.0 I used back then: only bloat of +22% compared to 3.4.6 instead of +26).

Comment 21 Alexandre Oliva 2018-01-30 17:41:22 UTC

Author: aoliva
Date: Tue Jan 30 17:40:50 2018
New Revision: 257194

URL: https://gcc.gnu.org/viewcvs?rev=257194&root=gcc&view=rev
Log:
[PR81611] accept copies in simple_iv_increment_p

If there are copies between the GIMPLE_PHI at the loop body and the
increment that reaches it (presumably through a back edge), still
regard it as a simple_iv_increment, so that we won't consider the
value in the back edge eligible for forwprop.  Doing so would risk
making the phi node and the incremented conflicting value live
within the loop, and the phi node to be preserved for propagated
uses after the loop.

for  gcc/ChangeLog

	PR tree-optimization/81611
	* tree-ssa-dom.c (simple_iv_increment_p): Skip intervening
	copies.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/tree-ssa-dom.c

Comment 22 Eric Gallager 2018-02-15 21:29:41 UTC

(In reply to Alexandre Oliva from comment #21)
> Author: aoliva
> Date: Tue Jan 30 17:40:50 2018
> New Revision: 257194
> 
> URL: https://gcc.gnu.org/viewcvs?rev=257194&root=gcc&view=rev
> Log:
> [PR81611] accept copies in simple_iv_increment_p
> 
> If there are copies between the GIMPLE_PHI at the loop body and the
> increment that reaches it (presumably through a back edge), still
> regard it as a simple_iv_increment, so that we won't consider the
> value in the back edge eligible for forwprop.  Doing so would risk
> making the phi node and the incremented conflicting value live
> within the loop, and the phi node to be preserved for propagated
> uses after the loop.
> 
> for  gcc/ChangeLog
> 
> 	PR tree-optimization/81611
> 	* tree-ssa-dom.c (simple_iv_increment_p): Skip intervening
> 	copies.
> 
> Modified:
>     trunk/gcc/ChangeLog
>     trunk/gcc/tree-ssa-dom.c

Did this fix it?

Comment 23 Jeffrey A. Law 2018-02-15 21:56:35 UTC

No, the DOM change is only a partial fix.  I've largely approved the auto-inc change.  I recommend the additional tests in c#19 be pulled into a distinct BZ and the gcc8 regression marker removed once the auto-inc changes are committed.

Comment 24 Alexandre Oliva 2018-02-28 05:26:06 UTC

Author: aoliva
Date: Wed Feb 28 05:25:34 2018
New Revision: 258053

URL: https://gcc.gnu.org/viewcvs?rev=258053&root=gcc&view=rev
Log:
[PR81611] turn inc-and-use-of-dead-orig into auto-inc

When the addressing modes available on the machine don't allow offsets
in addresses, odds are that post-increments will be represented in
trees and RTL as:

  y <= x + 1
  ... *(x) ...
  x <= y

so deal with it by turning such RTL as:

  (set y (plus x n))
  ... (mem x) ...

without intervening uses of y into

  (set y x)
  ... (mem (post_add y n)) ...

so as to create auto-inc addresses that we'd otherwise miss.


for  gcc/ChangeLog

	PR rtl-optimization/81611
	* auto-inc-dec.c (attempt_change): Move dead note from
	mem_insn if it's the next use of regno
	(find_address): Take address use of reg holding
	non-incremented value.  Add parm to limit search to the named
	reg only.
	(merge_in_block): Attempt to use a mem insn that is the next
	use of the original regno.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/auto-inc-dec.c

Comment 25 Alexandre Oliva 2018-02-28 05:30:24 UTC

Fixed