Created attachment 37243 [details] preprocessed source trunk r232077, aarch64-linux-gnu: g++ src/faust-dsp-standalone.cpp -g -O2 -fstack-protector-strong -Wformat -Werror=format-security -fPIC -DVERSI ON=1.3.0 -I/usr/lib/faust -Isrc/ -Iinclude/ -DPREFIX=/usr -Wall -c -o src/faust-dsp-standalone.o In file included from src/faust-dsp-standalone.cpp:31:0: src/../gen/yc20-dsp-standalone.cpp: In member function 'virtual void mydsp::compute(int, float**, float**)': src/../gen/yc20-dsp-standalone.cpp:1659:15: note: variable tracking size limit exceeded with -fvar-tracking-ass ignments, retrying without virtual void compute (int fullcount, FAUSTFLOAT** input, FAUSTFLOAT** output) { ^~~~~~~ In file included from src/faust-dsp-standalone.cpp:31:0: src/../gen/yc20-dsp-standalone.cpp:19015:2: error: could not split insn } ^ (insn 152068 64217 64221 (set (reg:DI 1 x1 [88028]) (plus:DI (reg/f:DI 29 x29) (const_int 129024 [0x1f800]))) src/../gen/yc20-dsp-standalone.cpp:10746 95 {*adddi3_pluslong} (nil)) src/../gen/yc20-dsp-standalone.cpp:19015:2: internal compiler error: in final_scan_insn, at final.c:2981 0xa0e9a7 _fatal_insn(char const*, rtx_def const*, char const*, int, char const*) ../../src/gcc/rtl-error.c:108 0x83f67b final_scan_insn(rtx_insn*, _IO_FILE*, int, int, int*) ../../src/gcc/final.c:2981 0x83f927 final(rtx_insn*, _IO_FILE*, int) ../../src/gcc/final.c:2044 0x83fe8f rest_of_handle_final ../../src/gcc/final.c:4440 0x83fe8f execute ../../src/gcc/final.c:4515 Please submit a full bug report, with preprocessed source if appropriate. currently reducing, build takes 10min ...
Happens with r231970 also.
Note -g is not needed to reproduce the bug and speeds up the compiling a lot.
Confirmed.
I think the problem is the constraints on *add<mode>3_pluslong allows all immediates. Basically *add<mode>3_pluslong should not be a define_insn_and_split. it should really only a define_split and *addsi3_aarch64 should have another constraint which matches All others (the !aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm (INTVAL (operands[2]), <MODE>mode) case). You will never reduce a good testcase either since it is about LRA and constraints rather than the predicates. Having two different define_insn is an interesting case for add and a special case too.
This was introduced by: 2015-11-24 Wilco Dijkstra <wdijkstr@arm.com> * config/aarch64/aarch64.md (add<mode>3): Block early expansion into 2 add instructions. (add<mode>3_pluslong): New pattern to combine complex immediates into 2 additions. 2015-12-04 James Greenhalgh <james.greenhalgh@arm.com> * config/aarch64/aarch64.md (add<mode>3_pluslong): Add register constraints.
The failing instruction is: (insn 121584 47646 47647 (set (reg:DI 1 x1 [88028]) (plus:DI (reg/f:DI 29 x29) (const_int 129024 [0x1f800]))) src/../gen/yc20-dsp-standalone.cpp:10746 95 {*adddi3_pluslong} This won't split as 0x1f800 is a valid move immediate (a mask). This means the instruction does not match and should never have been created. I guess it gets created correctly but then updated without rechecking the full constraints. > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates. I'm not sure what you mean here - there are 4 constraints that should all be true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i' and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm (INTVAL (operands[2]), <MODE>mode)".
> > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates. > > I'm not sure what you mean here - there are 4 constraints that should all be > true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i' > and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm > (INTVAL (operands[2]), <MODE>mode)". It appears reload creates the instruction without doing the last check - this is incorrect as it might be an instruction that is disabled (eg. not supported by the selected architecture)... However a trivial workaround is to always expand the pattern by changing the "&& true" into "true". I'll post a patch.
(In reply to Wilco from comment #7) > > > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates. > > > > I'm not sure what you mean here - there are 4 constraints that should all be > > true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i' > > and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm > > (INTVAL (operands[2]), <MODE>mode)". > > It appears reload creates the instruction without doing the last check - > this is incorrect as it might be an instruction that is disabled (eg. not > supported by the selected architecture)... > > However a trivial workaround is to always expand the pattern by changing the > "&& true" into "true". I'll post a patch. plus patterns are special to reload (this is documented IIRC). So I think there should be only one plus pattern for DI mode.
(In reply to Andrew Pinski from comment #8) > (In reply to Wilco from comment #7) > > > > I think the problem is the constraints on *add<mode>3_pluslong allows all immediates. > > > > > > I'm not sure what you mean here - there are 4 constraints that should all be > > > true before the instruction is matched: GPI, aarch64_pluslong_immediate, 'i' > > > and "!aarch64_plus_operand (operands[2], VOIDmode) && !aarch64_move_imm > > > (INTVAL (operands[2]), <MODE>mode)". > > > > It appears reload creates the instruction without doing the last check - > > this is incorrect as it might be an instruction that is disabled (eg. not > > supported by the selected architecture)... > > > > However a trivial workaround is to always expand the pattern by changing the > > "&& true" into "true". I'll post a patch. > > plus patterns are special to reload (this is documented IIRC). So I think > there should be only one plus pattern for DI mode. I see, that would explain why it doesn't check the condition. Merging into a single instruction should be possible. That still wouldn't check the condition of course, so we'd need to split anything that isn't a single add (given this odd reload case is so rare, it's not an issue). There is also a secondary issue in that the large number of reloads in the example generate quite inefficient code. With a few tweaks I got a 5% reduction in instruction count.
Created attachment 37267 [details] proposed patch Andrew is exactly right re plus being special. The pluslong hoops that are being jumped through are really the domain for post-reload splitters and peepholes.
(In reply to Richard Henderson from comment #10) > Created attachment 37267 [details] > proposed patch > > Andrew is exactly right re plus being special. > > The pluslong hoops that are being jumped through are really the > domain for post-reload splitters and peepholes. Thanks for the patch, it fixes the reported issue. However it regresses the examples that motivated the original patch: int g3(int x, int y, int z) { x += 0x15555; y += 0x15555; z += 0x15555; return x * y ^ z; } int g4(int x, int y, int z) { x += 0x15555; return x; } These now produce: g3: mov w3, 21845 mov w4, 21845 movk w3, 0x1, lsl 16 movk w4, 0x1, lsl 16 add w0, w0, w3 add w1, w1, w4 mov w5, 21845 movk w5, 0x1, lsl 16 mul w0, w0, w1 add w2, w2, w5 eor w0, w0, w2 ret g4: mov w1, 21845 movk w1, 0x1, lsl 16 add w0, w0, w1 ret The first case should have CSEd the complex immediate. The 2nd should have split into 2 adds. The idea of the original patch was to expand single-instruction adds, so any complex immediates are explicit and get CSEd/lifted. Then allow combine to find single-use single-block complex immediates that can be merged with an add and split into a 2-instruction add expansion. This only works if you have a define_insn_and_split for the complex immediate, with just a split combine doesn't do it. With your patch expand always emits add instructions with complex immediates which then can't be optimized.
(In reply to Wilco from comment #11) > With your patch expand always emits add instructions with complex immediates > which then can't be optimized. OK, so I can change your patch do the right thing with 2 minor changes: * expand add<mode>3 should use aarch64_plus_immediate * the peepholes should have aarch64_move_imm (INTVAL (operands[2]), <MODE>mode) as the condition (we want to expand into add+add rather than mov+movk+add...). With this the codesize of the example reduces by an extra 8% :-) The only remaining question I had whether it would be possible to use peephole expansions rather than the late splits. If they are evaluated in order then if the first doesn't match or if there is no temporary, the 2nd one will split into 2 adds.
Created attachment 37281 [details] second patch Ok, so the patch didn't survive overnight testing. The primary issue being that stack pointer adjustments need to be split immediately, since try_split will refuse to operate on an instruction that modifies the stack pointer. Thanks for the test cases in #c11, I'd forgotten about exposing this to cse. This adjusted patch handles them as expected. The lack of condition on the peepholes was just a think-o. Testing for single-insn move was always intended, just forgotten. Full testing for this is still a couple hours off completion.
(In reply to Wilco from comment #12) > The only remaining question I had whether it would be possible to use > peephole expansions rather than the late splits. If they are evaluated in > order then if the first doesn't match or if there is no temporary, the 2nd > one will split into 2 adds. Why would you want to do this as peepholes? I don't know what you're thinking of, but I'm pretty sure it's already set up like you want: peep2 pass changes the insns to use temporaries when possible, then splitting to 2 adds happens for any remaining long constants.
(In reply to Richard Henderson from comment #14) > (In reply to Wilco from comment #12) > > The only remaining question I had whether it would be possible to use > > peephole expansions rather than the late splits. If they are evaluated in > > order then if the first doesn't match or if there is no temporary, the 2nd > > one will split into 2 adds. > > Why would you want to do this as peepholes? > > I don't know what you're thinking of, but I'm pretty sure > it's already set up like you want: peep2 pass changes the > insns to use temporaries when possible, then splitting to > 2 adds happens for any remaining long constants. The final split happens a few phases later, so I wondered whether it would be feasible to do all the splitting during peep2. There is likely no real CQ gain in doing so, it just might make understanding the md file easier.
(In reply to Wilco from comment #15) > The final split happens a few phases later, so I wondered whether it would > be feasible to do all the splitting during peep2. There is likely no real CQ > gain in doing so, it just might make understanding the md file easier. No, it's not, since peep2 doesn't always run. Whereas the splits must be available during final for -O0.
Author: rth Date: Mon Jan 18 20:56:13 2016 New Revision: 232540 URL: https://gcc.gnu.org/viewcvs?rev=232540&root=gcc&view=rev Log: PR target/69176 * config/aarch64/aarch64.md (add<GPI>3): Move long immediate operands to pseudo only if CSE is expected. Split long immediate operands only after reload, and for the stack pointer. (*add<GPI>3_pluslong): Remove. (*addsi3_aarch64, *adddi3_aarch64): Merge into... (*add<GPI>3_aarch64): ... here. Add r/rk/Upl alternative. (*addsi3_aarch64_uxtw): Add r/rk/Upl alternative. (*add<GPI>3 peepholes): New. (*add<GPI>3 splitters): New. * config/aarch64/constraints.md (Upl): New. * config/aarch64/predicates.md (aarch64_pluslong_strict_immedate): New. Modified: trunk/gcc/ChangeLog trunk/gcc/config/aarch64/aarch64.md trunk/gcc/config/aarch64/constraints.md trunk/gcc/config/aarch64/predicates.md
Fixed.
here is a test case from an ICE I saw when backporting the patch to the gcc-5 Linaro branch. Fails with -O2, works with -O1 typedef int Nlm_Int4, ValNodePtr; Nlm_Int4 b, e; char c, d; void fn1(); typedef struct { double patternProbability; } patternSearchItems; char fn2(); void fn3(); patternSearchItems a; void fn4() { Nlm_Int4 f[20000]; char g[1]; a = *(patternSearchItems *)fn3; while (fn2(c, d & e, b)) if (a.patternProbability) { fn1(g); fn1(f); } }