This is the mail archive of the
gcc@gcc.gnu.org
mailing list for the GCC project.
RE: DFA Scheduler - unable to pipeline loads
- From: "Ye, Joey" <joey dot ye at intel dot com>
- To: "Matt Lee" <reachmatt dot lee at gmail dot com>, <gcc at gcc dot gnu dot org>
- Date: Mon, 3 Sep 2007 17:35:43 +0800
- Subject: RE: DFA Scheduler - unable to pipeline loads
- References: <e94302560708311457l59598381tae5a2588fff63c43@mail.gmail.com>
Matt,
I just started working on pipeline description and I'm confused one thing in your description.
For "integer", your cpu have a 1-cycle latency, but with 3 units stages "issue,iu,wb". What does that mean? My understanding is that the number of units seperated by "," should be equal to latency. Am I right?
Thanks - Joey
-----Original Message-----
From: gcc-owner@gcc.gnu.org [mailto:gcc-owner@gcc.gnu.org] On Behalf Of Matt Lee
Sent: 2007年9月1日 5:58
To: gcc@gcc.gnu.org
Subject: DFA Scheduler - unable to pipeline loads
Hi,
I am working with GCC-4.1.1 on a simple 5-pipe stage simple scalar
RISC processors with the following description for loads and stores,
(define_insn_reservation "integer" 1
(eq_attr "type" "branch,jump,call,arith,darith,icmp,nop")
"issue,iu,wb")
(define_insn_reservation "load" 3
(eq_attr "type" "load")
"issue,iu,wb")
(define_insn_reservation "store" 1
(eq_attr "type" "store")
"issue,iu,wb")
I am seeing poor scheduling in Dhrystone where a memcpy call is
expanded inline.
memcpy (&dst, &src, 16) ==>
load 1, rA + 4
store 1, rB + 4
load 2, rA + 8
store 2, rB + 8
...
Basically, instead of pipelining the loads, the current schedule
stalls the processor for two cycles on every dependent store. Here is
a dump from the .35.sched1 file.
;; ======================================================
;; -- basic block 0 from 6 to 36 -- before reload
;; ======================================================
;; 0--> 6 r84=r5 :issue,iu,wb
;; 1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb
;; 2--> 25 r92=0x5 :issue,iu,wb
;; 3--> 12 r85=[r84] :issue,iu,wb
;; 4--> 14 r87=[r86] :issue,iu,wb
;; 7--> 15 [r85]=r87 :issue,iu,wb
;; 8--> 16 r88=[r86+0x4] :issue,iu,wb
;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb
;; 12--> 18 r89=[r86+0x8] :issue,iu,wb
;; 15--> 19 [r85+0x8]=r89 :issue,iu,wb
;; 16--> 20 r90=[r86+0xc] :issue,iu,wb
;; 19--> 21 [r85+0xc]=r90 :issue,iu,wb
;; 20--> 22 r91=[r86+0x10] :issue,iu,wb
;; 23--> 23 [r85+0x10]=r91 :issue,iu,wb
;; 24--> 26 [r84+0xc]=r92 :issue,iu,wb
;; 25--> 31 clobber r3 :nothing
;; 25--> 36 use r3 :nothing
;; Ready list (final):
;; total time = 25
;; new head = 7
;; new tail = 36
There is an obvious better schedule to be obtained. Here is one such
(hand-modified) schedule which just pipelines two of the loads to
obtain a shorter critical path length to the whole function (function
has only bb 0)
;; 0--> 6 r84=r5 :issue,iu,wb
;; 1--> 13 r86=[`Ptr_Glob'] :issue,iu,wb
;; 2--> 25 r92=0x5 :issue,iu,wb
;; 3--> 12 r85=[r84] :issue,iu,wb
;; 4--> 14 r87=[r86] :issue,iu,wb
;; 7--> 15 [r85]=r87 :issue,iu,wb
;; 8--> 16 r88=[r86+0x4] :issue,iu,wb
;; 9--> 18 r89=[r86+0x8] :issue,iu,wb
;; 10--> 20 r90=[r86+0xc] :issue,iu,wb
;; 11--> 17 [r85+0x4]=r88 :issue,iu,wb
;; 12--> 19 [r85+0x8]=r89 :issue,iu,wb
;; 13--> 21 [r85+0xc]=r90 :issue,iu,wb
;; 14--> 22 r91=[r86+0x10] :issue,iu,wb
;; 17--> 23 [r85+0x10]=r91 :issue,iu,wb
;; 18--> 26 [r84+0xc]=r92 :issue,iu,mb_wb
;; 19--> 31 clobber r3 :nothing
;; 20--> 36 use r3 :nothing
;; Ready list (final):
;; total time = 20
;; new head = 7
;; new tail = 36
This schedule is 5 cycles faster.
I have read and re-read the material surrounding the DFA scheduler. I
understand that the heuristics optimize critical path length and not
stalls or other metrics. But in this case it is precisely the critical
path length that is shortened by the better schedule. I have been
examining various hooks available and for a while it seemed like
TARGET_SCHED_FIRST_CYCLE_MULTIPASS_DFA_LOOKAHEAD must be set to a
larger window to look for better candidates to schedule into the ready
queue. For instance, this discussion seems to say so.
http://gcc.gnu.org/ml/gcc/2002-05/msg01132.html
But a post that follows soon after seems to imply otherwise.
http://gcc.gnu.org/ml/gcc/2002-05/msg01388.html
Both posts are from Vladimir. In any case the final conclusion seems
to be that the lookahead is useful only for multi-issue processors.
So what is wrong in my description? How do I get GCC to produce the
optimal schedule?
I appreciate any help as I have run into a similiar scheduling
situation with other architectures in the past (GCC seems to choose
not to pipeline things as much as it can) and this time I would like
to understand it a bit more.
Compile flags used are basically, -O3.
--
thanks,
Matt