Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

Fri Oct 25 08:52:00 GMT 2013

> OK, so it is about 2%.  Did you try if you need lookahead even in the early pass (before reload)?  My guess would be so, but if not, it could cut the cost to half.  For -Ofast/-O3 it looks resonable to me, but we will  need to announce it on the ML.  For other settings I think we need to work on more improvements or cut the expenses.

Yes, it is required before reload.  

I have another idea which can be pondered upon. Currently, can we enable lookahead with the value 4 (pre reload) for default? This will exponentially cut the cost of build time. 
I have done some measurements on the build time of some benchmarks (mentioned below) with lookahead value 4. The 2% increase in build time with value 8 is now almost gone.

                   dfa4       no_lookahead

 perlbench       - 191s          193s
 bzip2           - 19s           19s
 gcc             - 429s          429s
 mcf             - 3s            3s
 gobmk           - 116s          115s
 hmmer           - 60s           60s
 sjeng           - 18s           17s
 libquantum      - 6s            6s
 h264ref         - 107s          107s
 omnetpp         - 128s          128s
 astar           - 7s            7s
 bwaves          - 5s            5s
 gamess          - 1964s         1957s
 milc            - 18s           18s
 GemsFDTD        - 273s          272s

Lookahead value 4 also helps because, the modified decoder model in bdver3.md is only two cycles deep (though in hardware it is actually 4 cycles deep). This means that we can look another two levels deep for better schedule.
GemsFDTD still retains the performance boost of around 6-7% with value 4.

Let me know your thoughts.

Regards
Ganesh

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Thursday, October 24, 2013 6:48 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; Uros Bizjak (ubizjak@gmail.com); H.J. Lu (hjl.tools@gmail.com)
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

> Hi,
> 
> > Is this with -fschedule-insns? Or only with default settings?  Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar).
> 
> The improvement is seen with the options "-fschedule-insns  -fschedule-insns2 -fsched-pressure"
> 
> Below are the build times of some of the SPEC benchmarks
> 
>                   dfa8       no_lookahead
> 
> perlbench       - 196s          193s
> bzip2           - 19s           19s
> gcc             - 439s          429s
> mcf             - 3s            3s
> gobmk           - 119s          115s
> hmmer           - 62s           60s
> sjeng           - 18s           17s
> libquantum      - 6s            6s
> h264ref         - 110s          107s
> omnetpp         - 132s          128s
> astar           - 7s            7s
> bwaves          - 4s            5s
> gamess          - 1996s         1957s
> milc            - 18s           18s
> GemsFDTD        - 276s          272s
> 
> I think we can enable it by default rather than for -Ofast.
> Please let me know your inputs.

OK, so it is about 2%.  Did you try if you need lookahead even in the early pass (before reload)?  My guess would be so, but if not, it could cut the cost to half.  For -Ofast/-O3 it looks resonable to me, but we will need to announce it on the ML.  For other settings I think we need to work on more improvmeents or cut the expenses.

Honza
> 
> Regards
> Ganesh
> 
> -----Original Message-----
> From: Jan Hubicka [mailto:hubicka@ucw.cz]
> Sent: Thursday, October 24, 2013 2:54 PM
> To: Gopalasubramanian, Ganesh
> Cc: gcc-patches@gcc.gnu.org; Uros Bizjak (ubizjak@gmail.com); 
> hubicka@ucw.cz; H.J. Lu (hjl.tools@gmail.com)
> Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for 
> modern x86 chips
> 
> > Attached is the patch which does the following scheduler related changes.
> > * re-models bdver3 decoder.
> > * It enables lookahead with value 8 for all BD architectures. The patch doesn't consider if reloading is completed or not (an area that needs to be worked on).
> > * The issue rate for BD architectures are set to 4.
> > 
> > I see the following performance improvements on bdver3 machine.
> > * GemsFDTD improves by 6-7% with lookahead value changed to 8.
> > * Hmmer improves by 9% when issue rate when set to 4 .
> 
> Is this with -fschedule-insns? Or only with default settings?  Did you test the compile time implications of increasing the lookahead? (value of 8 is very large, we may consider enbling it only for -Ofast, limiting for postreload only or something similar).
> 
> > 
> > I have considered the following hardware details for the model.
> > * There are four decoders inside a hardware decoder block.
> > * These four independent decoders can execute in parallel.  (They can take 8B from four different instructions and decode).
> > * These four decoders are pipelined 4 cycles deep and are non-stalling.
> > * Each decoder takes 8B of instruction data every cycle and tries decoding it. 
> > * Issue rate is 4.
> What is the overall limitation on number of bytes the instructions can occupy?
> I think they need to fit into 2 16 byte windows, right?
> In that case we may want to tweak the existing corei7 scheduling code to take care of this.  Making scheduler not overly optimistic about the parallelism is good since it will make less register pressure during the first pass..
> > 
> > Is it OK for upstream?
> 
> Otherwise the patch seems OK, but I would like to know the compile time effect first.
> 
> Honza
> > 
> > Changelog
> > ========
> > 2013-10-24  Ganesh Gopalasubramanian 
> > <Ganesh.Gopalasubramanian@amd.com>
> > 
> > 	* config/i386/bdver3.md : Added two additional decoder units 
> > 	to support issue rate of 4 and remodeled vector unit.
> > 
> > 	* config/i386/i386.c (ix86_issue_rate): Issue rate for BD
> > 	architectures is set to 4.
> > 
> > 	* config/i386/i386.c (ia32_multipass_dfa_lookahead): DFA
> > 	lookahead is set to 8 for BD architectures.
> > 
> > Regards
> > Ganesh
> > 
> 
> 
>