erg@frogger:~/Factor/Factor$ gcc -O -fschedule-insns native/string.c native/string.c: In function ‘from_c_string’: native/string.c:103: error: unable to find a register to spill in class ‘DIREG’ native/string.c:103: error: this is the insn: (insn 13 39 14 0 (parallel [ (set (reg:DI 2 cx [63]) (unspec:DI [ (mem:BLK (reg/f:DI 1 dx [orig:65 c_string ] [65]) [0 A8]) (reg:QI 0 ax [67]) (const_int 1 [0x1]) (reg:DI 2 cx [66]) ] 20)) (use (reg:SI 19 dirflag)) (clobber (reg/f:DI 1 dx [orig:65 c_string ] [65])) (clobber (reg:CC 17 flags)) ]) 649 {*strlenqi_rex_1} (insn_list:REG_DEP_TRUE 11 (insn_list:REG_DEP_TRUE 9 (insn_list:REG_DEP_TRUE 10 (insn_list:REG_DEP_TRUE 12 (nil))))) (expr_list:REG_DEAD (reg:SI 19 dirflag) (expr_list:REG_DEAD (reg:DI 2 cx [66]) (expr_list:REG_DEAD (reg:QI 0 ax [67]) (expr_list:REG_DEAD (reg/f:DI 1 dx [orig:65 c_string ] [65]) (expr_list:REG_UNUSED (reg:CC 17 flags) (expr_list:REG_UNUSED (reg/f:DI 1 dx [orig:65 c_string ] [65]) (nil)))))))) native/string.c:103: confused by earlier errors, bailing out
Created attachment 9972 [details] string functions for Factor
We have another report of this before. Basicially the ra in gcc needs a lot of help to fix this.
// reduced testcase int bar (char *s) { return foo (strlen(s)); }
Confirmed, broken by this patch: http://gcc.gnu.org/ml/gcc-patches/2003-01/msg02147.html
(In reply to comment #4) > Confirmed, broken by this patch: > http://gcc.gnu.org/ml/gcc-patches/2003-01/msg02147.html This patch just exposes a latent bug in the RA, not be able to handle stuf like this.
The problem is caused by the combination of (1) x86_64 parameter passing convention, (2) x86 instructions that _require_ parameters in specific registers and (3) sched1 scheduling pass. ad 1) x86_64 passes function parameters in registers in the order, defined in x86_64_int_parameter_registers[] array. 5 /*RDI*/, 4 /*RSI*/, 1 /*RDX*/, 2 /*RCX*/, FIRST_REX_INT_REG /*R8 */, FIRST_REX_INT_REG + 1 /*R9 */ Additionally, RAX is used as a hidden argument register. In original example, call sequence to "memory_to_string" is constructed as: (insn 17 15 18 0 (set (reg:DI 4 si) (reg:DI 61)) 81 {*movdi_1_rex64} (insn_list:REG_DEP_TRUE 15 (nil)) (expr_list:REG_DEAD (reg:DI 61) (nil))) (insn 18 17 19 0 (set (reg:DI 5 di [ c_string ]) (reg/v/f:DI 60 [ c_string ])) 81 {*movdi_1_rex64} (nil) (expr_list:REG_DEAD (reg/v/f:DI 60 [ c_string ]) (nil))) (call_insn 19 18 20 0 (set (reg:DI 0 ax) (call (mem:QI (symbol_ref:DI ("memory_to_string") [flags 0x3] <function_decl 0x4044f080 memory_to_string>) [0 S1 A8]) (const_int 0 [0x0]))) 732 {*call_value_0_rex64} (insn_list:REG_DEP_TRUE 17 (insn_list:REG_DEP_TRUE 18 (nil))) (expr_list:REG_DEAD (reg:DI 4 si) (expr_list:REG_DEAD (reg:DI 5 di [ c_string ]) (expr_list:REG_EH_REGION (const_int 0 [0x0]) (nil)))) (expr_list:REG_DEP_TRUE (use (reg:DI 5 di [ c_string ])) (expr_list:REG_DEP_TRUE (use (reg:DI 4 si)) (nil)))) ad 2) Please note, that this sequence can be found just after *strlenqi_rex_1 mega-pattern. This pattern requires parameters to be put in excactly defined registers: (define_insn "*strlenqi_rex_1" [(set (match_operand:DI 0 "register_operand" "=&c") (unspec:DI [(mem:BLK (match_operand:DI 5 "register_operand" "1")) (match_operand:QI 2 "register_operand" "a") (match_operand:DI 3 "immediate_operand" "i") (match_operand:DI 4 "register_operand" "0")] UNSPEC_SCAS)) (use (reg:SI DIRFLAG_REG)) (clobber (match_operand:DI 1 "register_operand" "=D")) (clobber (reg:CC FLAGS_REG))] However, at the time of sched1 pass (before reload) hard registers are not known yet. We have following RTL pattern just above "memory_to_string" call sequence (reg_notes are not shown for clarity): (insn 13 12 14 0 (parallel [ (set (reg:DI 63) (unspec:DI [ (mem:BLK (reg/f:DI 65 [ c_string ]) [0 A8]) (reg:QI 67) (const_int 1 [0x1]) (reg:DI 66) ] 20)) (use (reg:SI 19 dirflag)) (clobber (reg/f:DI 65 [ c_string ])) (clobber (reg:CC 17 flags)) ]) 511 {*strlenqi_rex_1} ad 3) Sched1 pass is free to move (insn 17) and (insn 18) before (insn 13) as it doesn't recognize register allocating conflicts between these instructions. Following that move, reload has no registers to spill and ICEs. The testcase from comment #3 ICEs with: error: unable to find a register to spill in class âAREGâ Here, the same problem could be observed. As "foo" is missing a prototype, hidden RAX register gets allocated in addition to RDI: (insn 20 18 21 0 (set (reg:DI 5 di) (reg:DI 61)) 81 {*movdi_1_rex64} (insn_list:REG_DEP_TRUE 18 (nil)) (expr_list:REG_DEAD (reg:DI 61) (nil))) (insn 21 20 22 0 (set (reg:QI 0 ax) (const_int 0 [0x0])) 55 {*movqi_1} (nil) (nil)) (call_insn 22 21 23 0 (set (reg:SI 0 ax) (call (mem:QI (symbol_ref:DI ("foo") [flags 0x41] <function_decl 0x402cbd80 foo>) [0 S1 A8]) (const_int 0 [0x0]))) 732 {*call_value_0_rex64} (insn_list:REG_DEP_TRUE 20 (insn_list:REG_DEP_TRUE 21 (nil))) (expr_list:REG_DEAD (reg:DI 5 di) (nil)) (expr_list:REG_DEP_TRUE (use (reg:QI 0 ax)) (expr_list:REG_DEP_TRUE (use (reg:DI 5 di)) (nil)))) This AX register is then moved before "strlenqi_rex_1" pattern and this blocks the AX register. (BTW: If prototype of "foo" is added, this particular testcase compiles OK.) One possible fix to this problem would be not to schedule instructions that have assigned hard registers (move insns in above case). Considering the number of x86 instructions, that require fixed registers I would suggest bugmasters to raise the priority of this bug. The x86 backend should not have these problems, but using -mregparm=X I think it could also be tricked to this sort of ICEs. (BTW: I have added Jim Wilson to CC of this bug as he is current maintaine of insn scheduling pass code. Perhaps he has some ideas on how to solve this problem.)
Subject: Re: [3.4/4.0/4.1 regression] amd64 register spill error with -fschedule-insns On Wed, 2005-11-09 at 07:27, uros at kss-loka dot si wrote: > (BTW: I have added Jim Wilson to CC of this bug as he is current maintaine of > insn scheduling pass code. Perhaps he has some ideas on how to solve this > problem.) Vlad's name really should be first in the list, as he is the only one of the 4 listed people that is actively working on the scheduler. This problem only shows up when the -fschedule-insns option is used. The x86 port deliberately disables this by default for good reasons. One possible solution is to add hooks to the x86 backend that emit an error when a user specifies the -fschedule-insns option, since it isn't expected to work, and if it did work, it would more likely than not result in worse code. I think this is the best short term option. Another possible solution is to modify the strlenqi_rex_1 pattern to use explicit hard registers instead of constraints that specify one register. There is little point in trying to register allocate a pattern when there is only one possible register that can be used here. Using a hard register would expose the dependencies to the scheduler, preventing the scheduler from performing bad optimizations. The downside here is that use of hard registers may confuse some optimization passes, or expose limitations in some optimization passes, resulting in some performance loss. Also, you will have to find and fix all existing patterns which have this problem. Doing this will be error prone, and may cause additional problems, and/or may result in an incomplete fix if some patterns are missed. I doubt that this is worth the effort. As for a scheduler fix, we could perhaps test SMALL_REGISTER_CLASSES, and if defined, then make all insns that use a hard register into a scheduling barrier (e.g. a call to flush_pending_lists). Most all targets that define SMALL_REGISTER_CLASSES already disables the first scheduling pass, so this probably won't break anything. This should be verified. This is going to severely limit the optimizations that sched1 can perform, which may make it pointless, but at least then the -fschedule-insns option should work for x86. Since it was changes to SCHED_GROUP handling that exposed this problem, perhaps the SCHED_GROUP handling code can be conditional on SMALL_REGISTER_CLASSES. For instance, maybe we can put the insns loading parameters into registers into the SCHED_GROUP for SMALL_REGISTER_CLASSES targets. This would prevent moving such insns during sched1. This would mean adding code like the existing in_post_call_group_p stuff, except for insns before a call instead of insns after a call. I think this is the best long term solution. As for a register allocator fix, we could extend the RA to be able to spill and reload hard registers, but the resulting code in this case would be so bad that I see no point in even trying. Since end users will gain little benefit from being able to run the sched1 pass on x86 code, I don't think this is a serious problem.
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
(In reply to comment #3) > int bar (char *s) > { > return foo (strlen(s)); > } The testcase above fails with AREG spill failure. The testcase below fails with DIREG spill failure: int bar (int x, long l); int foo (char *s) { return bar (1, strlen(s)); }
Closing 4.1 branch.
Reconfirmed for test cases of comment #9 (same spill failures).
I started my work on register pressure sensitive insn scheduling recently. This bug will be fixed as byproduct of this work. I hope the code will be available for gcc4.5.
Closing 4.2 branch.
GCC 4.3.4 is being released, adjusting target milestone.
*** Bug 41164 has been marked as a duplicate of this bug. ***
Re: Comment 7: Since end users will gain little benefit from being able to run the sched1 pass on x86 code, I don't think this is a serious problem. PR33928 (comments 108 and 111) give an example where -fschedule-insns on x64-64 gives a 14% speedup on some direct and inverse FFT codes, certainly not a trivial difference.
Patch at http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html
Vlad: The patch works great in my tests so far, thanks. After installing your patch on today's trunk so that -fschedule-insns actually works, I find it is quite expensive on large files. For example, with today's trunk with your patches applied, for the file http://www.math.purdue.edu/~lucier/bugzilla/8/_num.i.gz and the options /pkgs/gcc-mainline-schedule/bin/gcc -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -ftime-report -c _num.i total CPU time on my x86-64 box is TOTAL : 29.60 0.92 30.54 176587 kB while with -fschedule-insns it is scheduling : 23.03 (42%) usr 0.02 ( 2%) sys 23.07 (41%) wall 2125 kB ( 1%) ggc TOTAL : 55.47 1.03 56.57 180793 kB I don't know whether you can make it go faster now, or whether that's unreasonable and I should just wait and file another PR. Brad
As I wrote, implementing register pressure-sensitive insn scheduling needs to look at all insns (ready or not) with resolved dependencies. In an extreme cases, such insns could b 10-100 more than the ready ones. Scheduling algorithm in gcc has at best Nlog(N) complexity where N is #insn with resolved dependencies but probably it is even worse. I could try to make some constraints on # of considered insns. I guess you should submit a new bug when the patch will be in the trunk.
Vlad: Thank you for your reply. The times I reported are for "-fschedule-insns" without "-fpressure-sched". The times with the addition of "-fpressure-sched" are not much greater than with "-fschedule-insns" by itself: With -fschedule-insns scheduling : 22.89 (41%) usr 0.02 ( 2%) sys 22.93 (40%) wall 2125 kB ( 1%) ggc integrated RA : 9.15 (16%) usr 0.06 ( 6%) sys 9.21 (16%) wall 5488 kB ( 3%) ggc scheduling 2 : 0.60 ( 1%) usr 0.00 ( 0%) sys 0.62 ( 1%) wall 422 kB ( 0%) ggc TOTAL : 55.67 0.93 56.66 180793 kB with -fschedule-insns -fsched-pressure scheduling : 23.31 (42%) usr 0.02 ( 2%) sys 23.36 (41%) wall 2125 kB ( 1%) ggc integrated RA : 9.18 (16%) usr 0.04 ( 4%) sys 9.22 (16%) wall 5517 kB ( 3%) ggc scheduling 2 : 0.58 ( 1%) usr 0.01 ( 1%) sys 0.58 ( 1%) wall 251 kB ( 0%) ggc TOTAL : 55.77 1.00 56.89 179606 kB and with neither -fschedule-insns nor -fsched-pressure: integrated RA : 6.40 (21%) usr 0.05 ( 5%) sys 6.41 (21%) wall 5087 kB ( 3%) ggc scheduling 2 : 0.58 ( 2%) usr 0.01 ( 1%) sys 0.60 ( 2%) wall 244 kB ( 0%) ggc TOTAL : 29.84 0.98 30.83 176587 kB So pre--register allocation instruction scheduling even without the new register pressure--aware algorithm takes quite a bit of time. I'll try to build a profiled gcc, and then if I find something I'll put it in a new PR. Brad
I see. I though you compared '-fschedule-insns' and '-fschedule-insns -fsched-pressure'. Your numbers shows the same as I reported for SPEC2000. The -fsched-pressure adds upto 3% compiler time (for power6) on x86 and x86_64 it is practically the same time (although -fsched-pressure does more job, it is compensated by smaller # of insns after RA). I did not analyzed your code but I think your code is quite specific and probably has a lot of parallelism which creates long ready and queue list. You could play with param max-pending-list-length to see will it help.
The output of gprof on this example is at http://www.math.purdue.edu/~lucier/bugzilla/11/gprof.out.gz Everything that takes more than a second is Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 10.73 4.45 4.45 15565 0.00 0.00 pop_scope 7.28 7.47 3.02 314259938 0.00 0.00 free_list 7.04 10.39 2.92 5575 0.00 0.00 dfs_enumerate_from 5.62 12.72 2.33 314988148 0.00 0.00 alloc_INSN_LIST 5.28 14.91 2.19 5292 0.00 0.00 get_loop_exit_edges 5.14 17.04 2.13 331244515 0.00 0.00 bitmap_set_bit 3.28 18.40 1.36 135329 0.00 0.00 sched_analyze_insn 3.09 19.68 1.28 29650 0.00 0.00 free_deps 2.75 20.82 1.14 21773210 0.00 0.00 bitmap_bit_p 2.35 21.80 0.98 14093247 0.00 0.00 dominated_by_p 1.99 22.62 0.83 5357385 0.00 0.00 bitmap_ior_into 1.88 23.40 0.78 199 0.00 0.00 inverted_post_order_compute 1.57 24.05 0.65 342 0.00 0.01 df_worklist_dataflow 1.37 24.62 0.57 51278357 0.00 0.00 decl_jump_unsafe 1.35 25.18 0.56 26181017 0.00 0.00 flow_bb_inside_loop_p 1.13 25.65 0.47 201 0.00 0.00 post_order_compute Nothing immediate jumps out at me. Brad
The gprof output on the _num.i example, with and without -fschedule-insns is at http://www.math.purdue.edu/~lucier/bugzilla/11/gprof.out-fschedule-insns.gz http://www.math.purdue.edu/~lucier/bugzilla/11/gprof.out-fnoschedule-insns.gz
*** Bug 41531 has been marked as a duplicate of this bug. ***
http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html was committed to trunk http://gcc.gnu.org/viewcvs?view=revision&revision=151348
Re. comment #25: So does that mean this is not a 4.5 regression anymore? If so, please adjust the summary also.
Right. On trunk both the reduced testcase and the original testcase pass without ICE. I forgot to take 4.5 out of the summary, thanks for reminding me.
GCC 4.3.5 is being released, adjusting target milestone.
4.3 branch is being closed, moving to 4.4.7 target.
Fixed in 4.5+, 4.4 is no longer supported.