Bug 24319 - [4.4 regression] amd64 register spill error with -fschedule-insns
Summary: [4.4 regression] amd64 register spill error with -fschedule-insns
Status: RESOLVED FIXED
Alias: None
Product: gcc
Classification: Unclassified
Component: rtl-optimization (show other bugs)
Version: 4.0.2
: P5 normal
Target Milestone: 4.5.0
Assignee: Not yet assigned to anyone
URL: http://gcc.gnu.org/ml/gcc-patches/200...
Keywords: ice-on-valid-code, ra
: 41164 (view as bug list)
Depends on:
Blocks: 32647
  Show dependency treegraph
 
Reported: 2005-10-11 20:48 UTC by Doug Coleman
Modified: 2012-03-13 12:59 UTC (History)
8 users (show)

See Also:
Host:
Target: x86_64-*-linux-gnu
Build:
Known to work: 3.3.3, 4.5.0
Known to fail: 3.4.0, 4.0.0, 4.0.4, 4.1.0
Last reconfirmed: 2009-03-20 19:36:44


Attachments
string functions for Factor (24.43 KB, text/plain)
2005-10-11 20:49 UTC, Doug Coleman
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Doug Coleman 2005-10-11 20:48:30 UTC
erg@frogger:~/Factor/Factor$ gcc -O -fschedule-insns native/string.c native/string.c: In function ‘from_c_string’:
native/string.c:103: error: unable to find a register to spill in class ‘DIREG’
native/string.c:103: error: this is the insn:
(insn 13 39 14 0 (parallel [
            (set (reg:DI 2 cx [63])
                (unspec:DI [
                        (mem:BLK (reg/f:DI 1 dx [orig:65 c_string ] [65]) [0 A8])
                        (reg:QI 0 ax [67])
                        (const_int 1 [0x1])
                        (reg:DI 2 cx [66])
                    ] 20))
            (use (reg:SI 19 dirflag))
            (clobber (reg/f:DI 1 dx [orig:65 c_string ] [65]))
            (clobber (reg:CC 17 flags))
        ]) 649 {*strlenqi_rex_1} (insn_list:REG_DEP_TRUE 11 (insn_list:REG_DEP_TRUE 9 (insn_list:REG_DEP_TRUE 10 (insn_list:REG_DEP_TRUE 12 (nil)))))
    (expr_list:REG_DEAD (reg:SI 19 dirflag)
        (expr_list:REG_DEAD (reg:DI 2 cx [66])
            (expr_list:REG_DEAD (reg:QI 0 ax [67])
                (expr_list:REG_DEAD (reg/f:DI 1 dx [orig:65 c_string ] [65])
                    (expr_list:REG_UNUSED (reg:CC 17 flags)
                        (expr_list:REG_UNUSED (reg/f:DI 1 dx [orig:65 c_string ] [65])                            (nil))))))))
native/string.c:103: confused by earlier errors, bailing out
Comment 1 Doug Coleman 2005-10-11 20:49:20 UTC
Created attachment 9972 [details]
string functions for Factor
Comment 2 Andrew Pinski 2005-10-11 20:51:40 UTC
We have another report of this before.  Basicially the ra in gcc needs a lot of help to fix this.
Comment 3 Serge Belyshev 2005-10-11 21:07:39 UTC
// reduced testcase

int bar (char *s)
{
  return foo (strlen(s));
}
Comment 4 Serge Belyshev 2005-10-11 23:18:36 UTC
Confirmed, broken by this patch: http://gcc.gnu.org/ml/gcc-patches/2003-01/msg02147.html
Comment 5 Andrew Pinski 2005-10-11 23:43:19 UTC
(In reply to comment #4)
> Confirmed, broken by this patch:
> http://gcc.gnu.org/ml/gcc-patches/2003-01/msg02147.html

This patch just exposes a latent bug in the RA, not be able to handle stuf like this.
Comment 6 Uroš Bizjak 2005-11-09 15:27:23 UTC
The problem is caused by the combination of (1) x86_64 parameter passing convention, (2) x86 instructions that _require_ parameters in specific registers and (3) sched1 scheduling pass.

ad 1)

x86_64 passes function parameters in registers in the order, defined in x86_64_int_parameter_registers[] array.

  5 /*RDI*/, 4 /*RSI*/, 1 /*RDX*/, 2 /*RCX*/,
  FIRST_REX_INT_REG /*R8 */, FIRST_REX_INT_REG + 1 /*R9 */

Additionally, RAX is used as a hidden argument register.

In original example, call sequence to "memory_to_string" is constructed as:

(insn 17 15 18 0 (set (reg:DI 4 si)
        (reg:DI 61)) 81 {*movdi_1_rex64} (insn_list:REG_DEP_TRUE 15 (nil))
    (expr_list:REG_DEAD (reg:DI 61)
        (nil)))

(insn 18 17 19 0 (set (reg:DI 5 di [ c_string ])
        (reg/v/f:DI 60 [ c_string ])) 81 {*movdi_1_rex64} (nil)
    (expr_list:REG_DEAD (reg/v/f:DI 60 [ c_string ])
        (nil)))

(call_insn 19 18 20 0 (set (reg:DI 0 ax)
        (call (mem:QI (symbol_ref:DI ("memory_to_string") [flags 0x3] <function_decl 0x4044f080 memory_to_string>) [0 S1 A8])
            (const_int 0 [0x0]))) 732 {*call_value_0_rex64} (insn_list:REG_DEP_TRUE 17 (insn_list:REG_DEP_TRUE 18 (nil)))
    (expr_list:REG_DEAD (reg:DI 4 si)
        (expr_list:REG_DEAD (reg:DI 5 di [ c_string ])
            (expr_list:REG_EH_REGION (const_int 0 [0x0])
                (nil))))
    (expr_list:REG_DEP_TRUE (use (reg:DI 5 di [ c_string ]))
        (expr_list:REG_DEP_TRUE (use (reg:DI 4 si))
            (nil))))


ad 2)

Please note, that this sequence can be found just after *strlenqi_rex_1 mega-pattern. This pattern requires parameters to be put in excactly defined registers:

(define_insn "*strlenqi_rex_1"
  [(set (match_operand:DI 0 "register_operand" "=&c")
	(unspec:DI [(mem:BLK (match_operand:DI 5 "register_operand" "1"))
		    (match_operand:QI 2 "register_operand" "a")
		    (match_operand:DI 3 "immediate_operand" "i")
		    (match_operand:DI 4 "register_operand" "0")] UNSPEC_SCAS))
   (use (reg:SI DIRFLAG_REG))
   (clobber (match_operand:DI 1 "register_operand" "=D"))
   (clobber (reg:CC FLAGS_REG))]

However, at the time of sched1 pass (before reload) hard registers are not known yet. We have following RTL pattern just above "memory_to_string" call sequence (reg_notes are not shown for clarity):

(insn 13 12 14 0 (parallel [
            (set (reg:DI 63)
                (unspec:DI [
                        (mem:BLK (reg/f:DI 65 [ c_string ]) [0 A8])
                        (reg:QI 67)
                        (const_int 1 [0x1])
                        (reg:DI 66)
                    ] 20))
            (use (reg:SI 19 dirflag))
            (clobber (reg/f:DI 65 [ c_string ]))
            (clobber (reg:CC 17 flags))
        ]) 511 {*strlenqi_rex_1}


ad 3)

Sched1 pass is free to move (insn 17) and (insn 18) before (insn 13) as it doesn't recognize register allocating conflicts between these instructions. Following that move, reload has no registers to spill and ICEs.

The testcase from comment #3 ICEs with:
error: unable to find a register to spill in class âAREGâ

Here, the same problem could be observed. As "foo" is missing a prototype, hidden RAX register gets allocated in addition to RDI:

(insn 20 18 21 0 (set (reg:DI 5 di)
        (reg:DI 61)) 81 {*movdi_1_rex64} (insn_list:REG_DEP_TRUE 18 (nil))
    (expr_list:REG_DEAD (reg:DI 61)
        (nil)))

(insn 21 20 22 0 (set (reg:QI 0 ax)
        (const_int 0 [0x0])) 55 {*movqi_1} (nil)
    (nil))

(call_insn 22 21 23 0 (set (reg:SI 0 ax)
        (call (mem:QI (symbol_ref:DI ("foo") [flags 0x41] <function_decl 0x402cbd80 foo>) [0 S1 A8])
            (const_int 0 [0x0]))) 732 {*call_value_0_rex64} (insn_list:REG_DEP_TRUE 20 (insn_list:REG_DEP_TRUE 21 (nil)))
    (expr_list:REG_DEAD (reg:DI 5 di)
        (nil))
    (expr_list:REG_DEP_TRUE (use (reg:QI 0 ax))
        (expr_list:REG_DEP_TRUE (use (reg:DI 5 di))
            (nil))))

This AX register is then moved before "strlenqi_rex_1" pattern and this blocks the AX register. (BTW: If prototype of "foo" is added, this particular testcase compiles OK.)

One possible fix to this problem would be not to schedule instructions that have assigned hard registers (move insns in above case). Considering the number of x86 instructions, that require fixed registers I would suggest bugmasters to raise the priority of this bug.

The x86 backend should not have these problems, but using -mregparm=X I think it could also be tricked to this sort of ICEs.

(BTW: I have added Jim Wilson to CC of this bug as he is current maintaine of insn scheduling pass code. Perhaps he has some ideas on how to solve this problem.)
Comment 7 Jim Wilson 2005-11-09 21:07:09 UTC
Subject: Re:  [3.4/4.0/4.1 regression] amd64
	register spill error with -fschedule-insns

On Wed, 2005-11-09 at 07:27, uros at kss-loka dot si wrote:
> (BTW: I have added Jim Wilson to CC of this bug as he is current maintaine of
> insn scheduling pass code. Perhaps he has some ideas on how to solve this
> problem.)

Vlad's name really should be first in the list, as he is the only one of
the 4 listed people that is actively working on the scheduler.

This problem only shows up when the -fschedule-insns option is used.
The x86 port deliberately disables this by default for good reasons.

One possible solution is to add hooks to the x86 backend that emit an
error when a user specifies the -fschedule-insns option, since it isn't
expected to work, and if it did work, it would more likely than not
result in worse code.  I think this is the best short term option.

Another possible solution is to modify the strlenqi_rex_1 pattern to use
explicit hard registers instead of constraints that specify one
register.  There is little point in trying to register allocate a
pattern when there is only one possible register that can be used here.
Using a hard register would expose the dependencies to the scheduler,
preventing the scheduler from performing bad optimizations.  The
downside here is that use of hard registers may confuse some
optimization passes, or expose limitations in some optimization passes,
resulting in some performance loss.  Also, you will have to find and fix
all existing patterns which have this problem.  Doing this will be error
prone, and may cause additional problems, and/or may result in an
incomplete fix if some patterns are missed.  I doubt that this is worth
the effort.

As for a scheduler fix, we could perhaps test SMALL_REGISTER_CLASSES,
and if defined, then make all insns that use a hard register into a
scheduling barrier (e.g. a call to flush_pending_lists).  Most all
targets that define SMALL_REGISTER_CLASSES already disables the first
scheduling pass, so this probably won't break anything.  This should be
verified.  This is going to severely limit the optimizations that sched1
can perform, which may make it pointless, but at least then the
-fschedule-insns option should work for x86.

Since it was changes to SCHED_GROUP handling that exposed this problem,
perhaps the SCHED_GROUP handling code can be conditional on
SMALL_REGISTER_CLASSES.  For instance, maybe we can put the insns
loading parameters into registers into the SCHED_GROUP for
SMALL_REGISTER_CLASSES targets.  This would prevent moving such insns
during sched1.  This would mean adding code like the existing
in_post_call_group_p stuff, except for insns before a call instead of
insns after a call.  I think this is the best long term solution.

As for a register allocator fix, we could extend the RA to be able to
spill and reload hard registers, but the resulting code in this case
would be so bad that I see no point in even trying.

Since end users will gain little benefit from being able to run the
sched1 pass on x86 code, I don't think this is a serious problem.


Comment 8 Mark Mitchell 2006-05-25 02:36:23 UTC
Will not be fixed in 4.1.1; adjust target milestone to 4.1.2.
Comment 9 Uroš Bizjak 2007-10-11 10:32:38 UTC
(In reply to comment #3)

> int bar (char *s)
> {
>   return foo (strlen(s));
> }

The testcase above fails with AREG spill failure. The testcase below fails with DIREG spill failure:

int bar (int x, long l);

int foo (char *s)
{
  return bar (1, strlen(s));
}
Comment 10 Joseph S. Myers 2008-07-04 20:05:19 UTC
Closing 4.1 branch.
Comment 11 Steven Bosscher 2009-03-20 19:36:44 UTC
Reconfirmed for test cases of comment #9 (same spill failures).
Comment 12 Vladimir Makarov 2009-03-23 17:04:30 UTC
I started my work on register pressure sensitive insn scheduling recently.  This bug will be fixed as byproduct of this work.  I hope the code will be available for gcc4.5. 
Comment 13 Joseph S. Myers 2009-03-31 18:58:23 UTC
Closing 4.2 branch.
Comment 14 Richard Biener 2009-08-04 12:27:04 UTC
GCC 4.3.4 is being released, adjusting target milestone.
Comment 15 Uroš Bizjak 2009-08-27 17:33:26 UTC
*** Bug 41164 has been marked as a duplicate of this bug. ***
Comment 16 lucier 2009-08-28 16:54:14 UTC
Re: Comment 7:

Since end users will gain little benefit from being able to run the sched1 pass on x86 code, I don't think this is a serious problem.

PR33928 (comments 108 and 111) give an example where -fschedule-insns on x64-64 gives a 14% speedup on some direct and inverse FFT codes, certainly not a trivial difference.
Comment 17 Uroš Bizjak 2009-09-01 12:08:16 UTC
Patch at http://gcc.gnu.org/ml/gcc-patches/2009-09/msg00003.html
Comment 18 lucier 2009-09-02 02:54:35 UTC
Vlad:

The patch works great in my tests so far, thanks.

After installing your patch on today's trunk so that -fschedule-insns actually works, I find it is quite expensive on large files.

For example, with today's trunk with your patches applied, for the file 

http://www.math.purdue.edu/~lucier/bugzilla/8/_num.i.gz

and the options

/pkgs/gcc-mainline-schedule/bin/gcc -Wno-unused -O1 -fno-math-errno -fschedule-insns2 -fno-trapping-math -fno-strict-aliasing -fwrapv -fomit-frame-pointer -fPIC -fno-common -mieee-fp -ftime-report -c _num.i

total CPU time on my x86-64 box is

 TOTAL                 :  29.60             0.92            30.54             176587 kB

while with -fschedule-insns it is

 scheduling            :  23.03 (42%) usr   0.02 ( 2%) sys  23.07 (41%) wall    2125 kB ( 1%) ggc
 TOTAL                 :  55.47             1.03            56.57             180793 kB

I don't know whether you can make it go faster now, or whether that's unreasonable and I should just wait and file another PR.

Brad
Comment 19 Vladimir Makarov 2009-09-02 16:14:47 UTC
  As I wrote, implementing register pressure-sensitive insn scheduling needs to look at all insns (ready or not) with resolved dependencies.  In an extreme cases, such insns could b 10-100 more than the ready ones.  Scheduling algorithm in gcc has at best Nlog(N) complexity where N is #insn with resolved dependencies but probably it is even worse.

  I could try to make some constraints on # of considered insns.

  I guess you should submit a new bug when the patch will be in the trunk.
Comment 20 lucier 2009-09-02 16:52:16 UTC
Vlad:

Thank you for your reply.

The times I reported are for "-fschedule-insns" without "-fpressure-sched".

The times with the addition of "-fpressure-sched" are not much greater than with "-fschedule-insns" by itself:

With -fschedule-insns

 scheduling            :  22.89 (41%) usr   0.02 ( 2%) sys  22.93 (40%) wall    2125 kB ( 1%) ggc
 integrated RA         :   9.15 (16%) usr   0.06 ( 6%) sys   9.21 (16%) wall    5488 kB ( 3%) ggc
 scheduling 2          :   0.60 ( 1%) usr   0.00 ( 0%) sys   0.62 ( 1%) wall     422 kB ( 0%) ggc
 TOTAL                 :  55.67             0.93            56.66             180793 kB

with -fschedule-insns -fsched-pressure

 scheduling            :  23.31 (42%) usr   0.02 ( 2%) sys  23.36 (41%) wall    2125 kB ( 1%) ggc
 integrated RA         :   9.18 (16%) usr   0.04 ( 4%) sys   9.22 (16%) wall    5517 kB ( 3%) ggc
 scheduling 2          :   0.58 ( 1%) usr   0.01 ( 1%) sys   0.58 ( 1%) wall     251 kB ( 0%) ggc
 TOTAL                 :  55.77             1.00            56.89             179606 kB

and with neither -fschedule-insns nor -fsched-pressure:

 integrated RA         :   6.40 (21%) usr   0.05 ( 5%) sys   6.41 (21%) wall    5087 kB ( 3%) ggc
 scheduling 2          :   0.58 ( 2%) usr   0.01 ( 1%) sys   0.60 ( 2%) wall     244 kB ( 0%) ggc
 TOTAL                 :  29.84             0.98            30.83             176587 kB

So pre--register allocation instruction scheduling even without the new register pressure--aware algorithm takes quite a bit of time.

I'll try to build a profiled gcc, and then if I find something I'll put it in a new PR.

Brad
Comment 21 Vladimir Makarov 2009-09-02 17:11:24 UTC
I see.  I though you compared '-fschedule-insns' and '-fschedule-insns -fsched-pressure'.

Your numbers shows the same as I reported for SPEC2000.  The -fsched-pressure adds upto 3% compiler time (for power6) on x86 and x86_64 it is practically the same time (although -fsched-pressure does more job, it is compensated by smaller # of insns after RA).

I did not analyzed your code but I think your code is quite specific and probably has a lot of parallelism which creates long ready and queue list.  You could play with param max-pending-list-length to see will it help.
Comment 22 lucier 2009-09-02 17:24:09 UTC
The output of gprof on this example is at

http://www.math.purdue.edu/~lucier/bugzilla/11/gprof.out.gz

Everything that takes more than a second is

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 10.73      4.45     4.45    15565     0.00     0.00  pop_scope
  7.28      7.47     3.02 314259938     0.00     0.00  free_list
  7.04     10.39     2.92     5575     0.00     0.00  dfs_enumerate_from
  5.62     12.72     2.33 314988148     0.00     0.00  alloc_INSN_LIST
  5.28     14.91     2.19     5292     0.00     0.00  get_loop_exit_edges
  5.14     17.04     2.13 331244515     0.00     0.00  bitmap_set_bit
  3.28     18.40     1.36   135329     0.00     0.00  sched_analyze_insn
  3.09     19.68     1.28    29650     0.00     0.00  free_deps
  2.75     20.82     1.14 21773210     0.00     0.00  bitmap_bit_p
  2.35     21.80     0.98 14093247     0.00     0.00  dominated_by_p
  1.99     22.62     0.83  5357385     0.00     0.00  bitmap_ior_into
  1.88     23.40     0.78      199     0.00     0.00  inverted_post_order_compute
  1.57     24.05     0.65      342     0.00     0.01  df_worklist_dataflow
  1.37     24.62     0.57 51278357     0.00     0.00  decl_jump_unsafe
  1.35     25.18     0.56 26181017     0.00     0.00  flow_bb_inside_loop_p
  1.13     25.65     0.47      201     0.00     0.00  post_order_compute

Nothing immediate jumps out at me.

Brad
Comment 23 lucier 2009-09-03 18:04:51 UTC
The gprof output on the _num.i example, with and without -fschedule-insns is at

http://www.math.purdue.edu/~lucier/bugzilla/11/gprof.out-fschedule-insns.gz
http://www.math.purdue.edu/~lucier/bugzilla/11/gprof.out-fnoschedule-insns.gz
Comment 24 Andrew Pinski 2009-10-01 06:55:45 UTC
*** Bug 41531 has been marked as a duplicate of this bug. ***
Comment 26 Steven Bosscher 2010-02-02 18:06:47 UTC
Re. comment #25:
So does that mean this is not a 4.5 regression anymore? If so, please adjust the summary also.
Comment 27 Sebastian Pop 2010-02-03 06:20:35 UTC
Right.  On trunk both the reduced testcase and the original testcase pass
without ICE.  I forgot to take 4.5 out of the summary, thanks for reminding me.
Comment 28 Richard Biener 2010-05-22 18:10:43 UTC
GCC 4.3.5 is being released, adjusting target milestone.
Comment 29 Richard Biener 2011-06-27 12:14:24 UTC
4.3 branch is being closed, moving to 4.4.7 target.
Comment 30 Jakub Jelinek 2012-03-13 12:59:48 UTC
Fixed in 4.5+, 4.4 is no longer supported.