Bug 69176 - [6 Regression] ICE in in final_scan_insn, at final.c:2981 on aarch64-linux-gnu
Summary: [6 Regression] ICE in in final_scan_insn, at final.c:2981 on aarch64-linux-gnu
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: target (show other bugs)
Version: 6.0
: P1 normal
Target Milestone: 6.0
Assignee: Richard Henderson
URL:
Keywords: ice-on-valid-code
Depends on:
Blocks:
 
Reported: 2016-01-06 19:40 UTC by Matthias Klose
Modified: 2016-01-20 17:39 UTC (History)
3 users (show)

See Also:
Host:
Target: aarch64-linux-gnu
Build:
Known to work: 5.3.1
Known to fail: 6.0
Last reconfirmed: 2016-01-07 00:00:00


Attachments
preprocessed source (68.84 KB, application/gzip)
2016-01-06 19:40 UTC, Matthias Klose
Details
proposed patch (1.55 KB, patch)
2016-01-08 09:13 UTC, Richard Henderson
Details | Diff
second patch (1.97 KB, patch)
2016-01-08 18:31 UTC, Richard Henderson
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Matthias Klose 2016-01-06 19:40:50 UTC
Created attachment 37243 [details]
preprocessed source

trunk r232077, aarch64-linux-gnu:

g++ src/faust-dsp-standalone.cpp -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fPIC -DVERSI
ON=1.3.0 -I/usr/lib/faust -Isrc/ -Iinclude/ -DPREFIX=/usr -Wall -c -o src/faust-dsp-standalone.o
In file included from src/faust-dsp-standalone.cpp:31:0:
src/../gen/yc20-dsp-standalone.cpp: In member function 'virtual void mydsp::compute(int, float**, float**)':
src/../gen/yc20-dsp-standalone.cpp:1659:15: note: variable tracking size limit exceeded with -fvar-tracking-ass
ignments, retrying without
  virtual void compute (int fullcount, FAUSTFLOAT** input, FAUSTFLOAT** output) {
               ^~~~~~~

In file included from src/faust-dsp-standalone.cpp:31:0:
src/../gen/yc20-dsp-standalone.cpp:19015:2: error: could not split insn
  }
  ^

(insn 152068 64217 64221 (set (reg:DI 1 x1 [88028])
        (plus:DI (reg/f:DI 29 x29)
            (const_int 129024 [0x1f800]))) src/../gen/yc20-dsp-standalone.cpp:10746 95 {*adddi3_pluslong}
     (nil))
src/../gen/yc20-dsp-standalone.cpp:19015:2: internal compiler error: in final_scan_insn, at final.c:2981
0xa0e9a7 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*)
        ../../src/gcc/rtl-error.c:108
0x83f67b final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*)
        ../../src/gcc/final.c:2981
0x83f927 final(rtx_insn*, _IO_FILE*, int)
        ../../src/gcc/final.c:2044
0x83fe8f rest_of_handle_final
        ../../src/gcc/final.c:4440
0x83fe8f execute
        ../../src/gcc/final.c:4515
Please submit a full bug report,
with preprocessed source if appropriate.

currently reducing, build takes 10min ...
Comment 1 Andrew Pinski 2016-01-06 20:00:27 UTC
Happens with r231970 also.
Comment 2 Andrew Pinski 2016-01-06 22:05:17 UTC
Note -g is not needed to reproduce the bug and speeds up the compiling a lot.
Comment 3 ktkachov 2016-01-07 09:30:32 UTC
Confirmed.
Comment 4 Andrew Pinski 2016-01-07 09:40:43 UTC
I think the problem is the constraints on *add<mode>3_pluslong allows all immediates.

Basically *add<mode>3_pluslong should not be a define_insn_and_split. it should really only a define_split and *addsi3_aarch64 should have another constraint which matches All others (the !aarch64_plus_operand (operands[2], VOIDmode)
   && !aarch64_move_imm (INTVAL (operands[2]), <MODE>mode) case).


You will never reduce a good testcase either since it is about LRA and constraints rather than the predicates.

Having two different define_insn is an interesting case for add and a special case too.
Comment 5 Andrew Pinski 2016-01-07 09:44:59 UTC
This was introduced by:
2015-11-24  Wilco Dijkstra  <wdijkstr@arm.com>

        * config/aarch64/aarch64.md (add<mode>3):
        Block early expansion into 2 add instructions.
        (add<mode>3_pluslong): New pattern to combine complex
        immediates into 2 additions.


2015-12-04  James Greenhalgh  <james.greenhalgh@arm.com>

        * config/aarch64/aarch64.md (add<mode>3_pluslong): Add register
        constraints.
Comment 6 Wilco 2016-01-07 12:47:12 UTC
The failing instruction is:

(insn 121584 47646 47647 (set (reg:DI 1 x1 [88028])
        (plus:DI (reg/f:DI 29 x29)
            (const_int 129024 [0x1f800]))) src/../gen/yc20-dsp-standalone.cpp:10746 95 {*adddi3_pluslong}

This won't split as 0x1f800 is a valid move immediate (a mask). This means the instruction does not match and should never have been created. I guess it gets created correctly but then updated without rechecking the full constraints.

> I think the problem is the constraints on *add<mode>3_pluslong allows all immediates.

I'm not sure what you mean here - there are 4 constraints that should all be true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i' and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm (INTVAL (operands[2]), <MODE>mode)".
Comment 7 Wilco 2016-01-07 18:17:03 UTC
> > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates.
> 
> I'm not sure what you mean here - there are 4 constraints that should all be
> true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i'
> and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm
> (INTVAL (operands[2]), <MODE>mode)".

It appears reload creates the instruction without doing the last check - this is incorrect as it might be an instruction that is disabled (eg. not supported by the selected architecture)...

However a trivial workaround is to always expand the pattern by changing the "&& true" into "true". I'll post a patch.
Comment 8 Andrew Pinski 2016-01-07 20:25:07 UTC
(In reply to Wilco from comment #7)
> > > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates.
> > 
> > I'm not sure what you mean here - there are 4 constraints that should all be
> > true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i'
> > and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm
> > (INTVAL (operands[2]), <MODE>mode)".
> 
> It appears reload creates the instruction without doing the last check -
> this is incorrect as it might be an instruction that is disabled (eg. not
> supported by the selected architecture)...
> 
> However a trivial workaround is to always expand the pattern by changing the
> "&& true" into "true". I'll post a patch.

plus patterns are special to reload (this is documented IIRC). So I think there should be only one plus pattern for DI mode.
Comment 9 Wilco 2016-01-07 21:54:55 UTC
(In reply to Andrew Pinski from comment #8)
> (In reply to Wilco from comment #7)
> > > > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates.
> > > 
> > > I'm not sure what you mean here - there are 4 constraints that should all be
> > > true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i'
> > > and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm
> > > (INTVAL (operands[2]), <MODE>mode)".
> > 
> > It appears reload creates the instruction without doing the last check -
> > this is incorrect as it might be an instruction that is disabled (eg. not
> > supported by the selected architecture)...
> > 
> > However a trivial workaround is to always expand the pattern by changing the
> > "&& true" into "true". I'll post a patch.
> 
> plus patterns are special to reload (this is documented IIRC). So I think
> there should be only one plus pattern for DI mode.

I see, that would explain why it doesn't check the condition. Merging into a single instruction should be possible. That still wouldn't check the condition of course, so we'd need to split anything that isn't a single add (given this odd reload case is so rare, it's not an issue).

There is also a secondary issue in that the large number of reloads in the example generate quite inefficient code. With a few tweaks I got a 5% reduction in instruction count.
Comment 10 Richard Henderson 2016-01-08 09:13:38 UTC
Created attachment 37267 [details]
proposed patch

Andrew is exactly right re plus being special.

The pluslong hoops that are being jumped through are really the
domain for post-reload splitters and peepholes.
Comment 11 Wilco 2016-01-08 15:45:42 UTC
(In reply to Richard Henderson from comment #10)
> Created attachment 37267 [details]
> proposed patch
> 
> Andrew is exactly right re plus being special.
> 
> The pluslong hoops that are being jumped through are really the
> domain for post-reload splitters and peepholes.

Thanks for the patch, it fixes the reported issue. However it regresses the examples that motivated the original patch:

int g3(int x, int y, int z)
{
  x += 0x15555;
  y += 0x15555;
  z += 0x15555;
  return x * y ^ z;
}

int g4(int x, int y, int z)
{
  x += 0x15555;
  return x;
}

These now produce:

g3:
        mov     w3, 21845
        mov     w4, 21845
        movk    w3, 0x1, lsl 16
        movk    w4, 0x1, lsl 16
        add     w0, w0, w3
        add     w1, w1, w4
        mov     w5, 21845
        movk    w5, 0x1, lsl 16
        mul     w0, w0, w1
        add     w2, w2, w5
        eor     w0, w0, w2
        ret

g4:
        mov     w1, 21845
        movk    w1, 0x1, lsl 16
        add     w0, w0, w1
        ret

The first case should have CSEd the complex immediate. The 2nd should have split into 2 adds.

The idea of the original patch was to expand single-instruction adds, so any complex immediates are explicit and get CSEd/lifted. Then allow combine to find single-use single-block complex immediates that can be merged with an add and split into a 2-instruction add expansion. This only works if you have a define_insn_and_split for the complex immediate, with just a split combine doesn't do it.

With your patch expand always emits add instructions with complex immediates which then can't be optimized.
Comment 12 Wilco 2016-01-08 17:12:29 UTC
(In reply to Wilco from comment #11)

> With your patch expand always emits add instructions with complex immediates
> which then can't be optimized.

OK, so I can change your patch do the right thing with 2 minor changes:

* expand add<mode>3 should use aarch64_plus_immediate
* the peepholes should have aarch64_move_imm (INTVAL (operands[2]), <MODE>mode) as the condition (we want to expand into add+add rather than mov+movk+add...).

With this the codesize of the example reduces by an extra 8% :-)

The only remaining question I had whether it would be possible to use peephole expansions rather than the late splits. If they are evaluated in order then if the first doesn't match or if there is no temporary, the 2nd one will split into 2 adds.
Comment 13 Richard Henderson 2016-01-08 18:31:49 UTC
Created attachment 37281 [details]
second patch

Ok, so the patch didn't survive overnight testing.

The primary issue being that stack pointer adjustments need to
be split immediately, since try_split will refuse to operate on
an instruction that modifies the stack pointer.

Thanks for the test cases in #c11, I'd forgotten about exposing
this to cse.  This adjusted patch handles them as expected.

The lack of condition on the peepholes was just a think-o.
Testing for single-insn move was always intended, just forgotten.

Full testing for this is still a couple hours off completion.
Comment 14 Richard Henderson 2016-01-08 18:36:42 UTC
(In reply to Wilco from comment #12)
> The only remaining question I had whether it would be possible to use
> peephole expansions rather than the late splits. If they are evaluated in
> order then if the first doesn't match or if there is no temporary, the 2nd
> one will split into 2 adds.

Why would you want to do this as peepholes?

I don't know what you're thinking of, but I'm pretty sure
it's already set up like you want: peep2 pass changes the
insns to use temporaries when possible, then splitting to
2 adds happens for any remaining long constants.
Comment 15 Wilco 2016-01-08 18:52:22 UTC
(In reply to Richard Henderson from comment #14)
> (In reply to Wilco from comment #12)
> > The only remaining question I had whether it would be possible to use
> > peephole expansions rather than the late splits. If they are evaluated in
> > order then if the first doesn't match or if there is no temporary, the 2nd
> > one will split into 2 adds.
> 
> Why would you want to do this as peepholes?
> 
> I don't know what you're thinking of, but I'm pretty sure
> it's already set up like you want: peep2 pass changes the
> insns to use temporaries when possible, then splitting to
> 2 adds happens for any remaining long constants.

The final split happens a few phases later, so I wondered whether it would be feasible to do all the splitting during peep2. There is likely no real CQ gain in doing so, it just might make understanding the md file easier.
Comment 16 Richard Henderson 2016-01-08 20:05:34 UTC
(In reply to Wilco from comment #15)
> The final split happens a few phases later, so I wondered whether it would
> be feasible to do all the splitting during peep2. There is likely no real CQ
> gain in doing so, it just might make understanding the md file easier.

No, it's not, since peep2 doesn't always run.  Whereas the splits
must be available during final for -O0.
Comment 17 Richard Henderson 2016-01-18 20:56:45 UTC
Author: rth
Date: Mon Jan 18 20:56:13 2016
New Revision: 232540

URL: https://gcc.gnu.org/viewcvs?rev=232540&root=gcc&view=rev
Log:
PR target/69176

  * config/aarch64/aarch64.md (add<GPI>3): Move long immediate
  operands to pseudo only if CSE is expected.  Split long immediate
  operands only after reload, and for the stack pointer.
  (*add<GPI>3_pluslong): Remove.
  (*addsi3_aarch64, *adddi3_aarch64): Merge into...
  (*add<GPI>3_aarch64): ... here.  Add r/rk/Upl alternative.
  (*addsi3_aarch64_uxtw): Add r/rk/Upl alternative.
  (*add<GPI>3 peepholes): New.
  (*add<GPI>3 splitters): New.
  * config/aarch64/constraints.md (Upl): New.
  * config/aarch64/predicates.md (aarch64_pluslong_strict_immedate): New.

Modified:
    trunk/gcc/ChangeLog
    trunk/gcc/config/aarch64/aarch64.md
    trunk/gcc/config/aarch64/constraints.md
    trunk/gcc/config/aarch64/predicates.md
Comment 18 Richard Henderson 2016-01-18 20:57:55 UTC
Fixed.
Comment 19 Matthias Klose 2016-01-20 17:39:16 UTC
here is a test case from an ICE I saw when backporting the patch to the gcc-5 Linaro branch. Fails with -O2, works with -O1

typedef int Nlm_Int4, ValNodePtr;
Nlm_Int4 b, e;
char c, d;
void fn1();
typedef struct { double patternProbability; } patternSearchItems;
char fn2();
void fn3();
patternSearchItems a;
void fn4() {
  Nlm_Int4 f[20000];
  char g[1];
  a = *(patternSearchItems *)fn3;
  while (fn2(c, d & e, b))
    if (a.patternProbability) {
      fn1(g);
      fn1(f);
    }
}