This is the mail archive of the gcc-patches@gcc.gnu.org mailing list for the GCC project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips


Hi Honza, 

> Yep, I think we need to merge only those autmatas tha are same for both:
> (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
> probably can become
> (define_automaton "bdver3,bdver3_fp")
> with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1
> automaton.  I think it should result in smaller binary - the fact that all conditionals are
> physically duplicated in bdver1/bdev3.md should be optimized away by genautomata.

Before merging the insn reservations, I need to compare the latency values for bdver1 and bdver3. I know that they are different for some of the instructions. 
In that case, the merging should prop up another subset of latency differences. I would like to keep these insn reservations in two .md files (one for bdver1 and one for bdver3) even after the merger.

> Your version has problem that it does not model the thing that the two decoders works sequentially.

The two stage modeling is required so that the decode unit reservations are screened from other unit reservations.
But this sort of goes away in bdver3 because of the decode cycle.
In bdver3, the decode units scan two of these windows every "two" cycles decoding a maximum of eight instructions.
The hardware scan is done every two cycles in bdver3 whereas it is done every single cycle in bdver1/bdver2. (But we have two separate hardware decoders which guarantees higher throughput)
This means that the two stage modeling is not required in the scheduler descriptions since the hardware sort of guarantees that with its scanning mechanism. 
Our job is to make sure that 8 direct instructions get scheduled in two cycles or 4 double instructions get scheduled in two cycles.
So, I have modeled the bdver3 decoders such that with in a cycle they guarantee to issue 4 direct instructions or 2 double instructions. 
This eliminates the sequencing problem in modeling decoders and also ensures that the issue rate can be numbered for a single cycle rather than two cycles.
This is one of the reasons why I remodeled only bdver3. Let me know your comments on this.

> We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more realistic estimates on what still can be issued - the value of 6 is unrealistically high.
This would get more complicated if we go by decoder capacity in bdver3. As we have two hardware decoders in steamroller (bdver3), they have a capacity to decode eight instructions per clock cycle, providing up to twice the decode and dispatch bandwidth compared to bdver1.
If we model this in GCC we need to change the issue rate to 8. If 6 is high, then 8 would add more joy and excitement.

TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in different way. 

> We also should enable ia32_multipass_dfa_lookahead - with that scheduler should be able to put double decoded and vector decoded insns on the proper places.
Yes. Whenever we have this scheduler analysis in place we discuss about this but unfortunately is left as it is.
I will look into this after I do the enablement for bdver4.

> I will work on replacing most of the CPU cases into tuning flags + costs.
I am planning to get bdver4 enablement in place once scheduler descriptions for bdver3 is done with.
I will have cycles to look into the cost models. Please delegate some tasks if you can and I am willing to take them up.

Regards
Ganesh

-----Original Message-----
From: Jan Hubicka [mailto:hubicka@ucw.cz] 
Sent: Tuesday, October 08, 2013 3:20 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.tools@gmail.com
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

> Hi Honza,
> 
> I am planning to update the scheduler descriptions for bdver3 first.
> Attached is the patch. Please let me know your comments if any.
> 
> Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these.

Yep, I think we need to merge only those autmatas tha are same for both:
(define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
probably can become
(define_automaton "bdver3,bdver3_fp")
with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1
automaton.  I think it should result in smaller binary - the fact that all conditionals are
physically duplicated in bdver1/bdev3.md should be optimized away by genautomata.

I also played a bit with the decoders and I am attaching my version - that seems SPEC neutral though.
Your version has problem that it does not model the thing that the two decoders works sequentially.

I removed the bdver1-decodev unit and instead i simply reserve all thre decoders + I added 
presence set requring second decoder to be taken only after first one changed presence set requring
decoder 2 to be taken only after decoder 1+2 to final presence set, so decoderv resetvation has
chance to pass.
Finally I added use-decodera that consumes all of the first decoder as soon as we start to allocate
second decoder - we can not really allocate them in interchanging order.

I also noticed that push/pop instructions are modeled as being vector, while manual says it 
generates one micro op unless memory operand is used.

I did not have much time to play further with this except for manual inspection of schedules that
seems better now and in rare cases handle 4-5 instructions per cycle.

We also should enable ia32_multipass_dfa_lookahead - with that scheduler should be able to put double decoded
and vector decoded insns on the proper places.

We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more realistic
estimates on what still can be issued - the value of 6 is unrealistically high.

Seems like with addition of Atom the scheduler macros became very twisty maze of passages.
I will work on replacing most of the CPU cases into tuning flags + costs.

What do you think?
Honza


Index: bdver1.md
===================================================================
--- bdver1.md	(revision 203204)
+++ bdver1.md	(working copy)
@@ -41,7 +41,9 @@
 (define_cpu_unit "bdver1-decode0" "bdver1")
 (define_cpu_unit "bdver1-decode1" "bdver1")
 (define_cpu_unit "bdver1-decode2" "bdver1")
-(define_cpu_unit "bdver1-decodev" "bdver1")
+(define_cpu_unit "bdver1-decode0b" "bdver1")
+(define_cpu_unit "bdver1-decode1b" "bdver1")
+(define_cpu_unit "bdver1-decode2b" "bdver1")
 
 ;; Model the fact that double decoded instruction may take 2 cycles
 ;; to decode when decoder2 and decoder0 in next cycle
@@ -57,18 +59,26 @@
 ;; too.  Vector decoded instructions then can't be issued when modeled
 ;; as consuming decoder0+decoder1+decoder2.
 ;; We solve that by specialized vector decoder unit and exclusion set.
-(presence_set "bdver1-decode2" "bdver1-decode0")
-(exclusion_set "bdver1-decodev" "bdver1-decode0,bdver1-decode1,bdver1-decode2")
-
-(define_reservation "bdver1-vector" "nothing,bdver1-decodev")
-(define_reservation "bdver1-direct1" "nothing,bdver1-decode1")
+(final_presence_set "bdver1-decode2" "bdver1-decode0,bdver1-decode1")
+(presence_set "bdver1-decode0b,bdver1-decode1b,bdver1-decode2b" "bdver1-decode0,bdver1-decode1")
+(final_presence_set "bdver1-decode2b" "bdver1-decode0b,bdver1-decode1b")
+
+(define_reservation "use-decodera" "((bdver1-decode0 | nothing)
+				     + (bdver1-decode1 | nothing)
+				     + (bdver1-decode2 | nothing))")
+(define_reservation "bdver1-vector" "nothing,((bdver1-decode0+bdver1-decode1+bdver1-decode2)
+					      |(use-decodera+bdver1-decode0b+bdver1-decode1b+bdver1-decode2b))")
+(define_reservation "bdver1-direct1" "nothing,(bdver1-decode1|(use-decodera+bdver1-decode1b))")
 (define_reservation "bdver1-direct" "nothing,
 				     (bdver1-decode0 | bdver1-decode1
-				     | bdver1-decode2)")
+				     | bdver1-decode2 | (use-decodera+bdver1-decode0b)
+				     | (use-decodera+bdver1-decode1b) | (use-decodera+bdver1-decode2b))")
 ;; Double instructions behaves like two direct instructions.
 (define_reservation "bdver1-double" "((bdver1-decode2,bdver1-decode0)
 				     | (nothing,(bdver1-decode0 + bdver1-decode1))
-				     | (nothing,(bdver1-decode1 + bdver1-decode2)))")
+				     | (nothing,(bdver1-decode1 + bdver1-decode2))
+				     | (nothing,(use-decodera + bdver1-decode0b + bdver1-decode1b))
+				     | (nothing,(use-decodera + bdver1-decode1b + bdver1-decode2b)))")
 
 
 (define_cpu_unit "bdver1-ieu0" "bdver1_ieu")
@@ -131,17 +141,28 @@
 			      (eq_attr "type" "call,callv"))
 			 "bdver1-double,bdver1-agu")
 ;; PUSH mem is double path.
+(define_insn_reservation "bdver1_pushmem" 1
+			 (and (eq_attr "cpu" "bdver1,bdver2")
+			      (and (eq_attr "type" "push")
+				   (eq_attr "memory" "both")))
+			 "bdver1-direct,bdver1-load,bdver1-store")
 (define_insn_reservation "bdver1_push" 1
 			 (and (eq_attr "cpu" "bdver1,bdver2")
 			      (eq_attr "type" "push"))
-			 "bdver1-direct,bdver1-agu,bdver1-store")
+			 "bdver1-direct,bdver1-store")
 ;; POP r16/mem are double path.
+;; 16bit pops are not really used by GCC.
+(define_insn_reservation "bdver1_popmem" 1
+			 (and (eq_attr "cpu" "bdver1,bdver2")
+			      (and (eq_attr "type" "pop")
+				   (eq_attr "memory" "both")))
+			 "bdver1-direct,bdver1-load,bdver1-store")
 (define_insn_reservation "bdver1_pop" 1
 			 (and (eq_attr "cpu" "bdver1,bdver2")
 			      (eq_attr "type" "pop"))
-			 "bdver1-direct,bdver1-ivector")
-;; LEAVE no latency info so far, assume same with amdfam10.
-(define_insn_reservation "bdver1_leave" 3
+			 "bdver1-direct,bdver1-load")
+;; By Agner Fog, latency is 4.
+(define_insn_reservation "bdver1_leave" 4
 			 (and (eq_attr "cpu" "bdver1,bdver2")
 			      (eq_attr "type" "leave"))
 			 "bdver1-vector,bdver1-ivector")
Index: i386.c
===================================================================
--- i386.c	(revision 203204)
+++ i386.c	(working copy)
@@ -24427,11 +25019,13 @@ ix86_issue_rate (void)
     case PROCESSOR_K8:
     case PROCESSOR_AMDFAM10:
     case PROCESSOR_GENERIC:
-    case PROCESSOR_BDVER1:
-    case PROCESSOR_BDVER2:
-    case PROCESSOR_BDVER3:
     case PROCESSOR_BTVER1:
       return 3;
+    case PROCESSOR_BDVER3:
+      return 4;
+    case PROCESSOR_BDVER1:
+    case PROCESSOR_BDVER2:
+      return 6;
 
     case PROCESSOR_CORE2:
     case PROCESSOR_COREI7:
@@ -24697,11 +25291,27 @@ ix86_adjust_cost (rtx insn, rtx link, rt
     case PROCESSOR_GENERIC:
       memory = get_attr_memory (insn);
 
-      /* Stack engine allows to execute push&pop instructions in parall.  */
+      /* Stack engine allows to execute push&pop instructions in parallel. 
+	 ??? There seems to be no detailed documentation of AMDFAM10, perhaps
+	 it is actually equivalent to the stronger notion of engine bellow. */
       if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
 	   && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
-	  && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8))
+	  && (ix86_tune == PROCESSOR_AMDFAM10))
+	return 0;
+
+      /* For buldozer up, the stack engine makes value of stack pointer available
+	 immediately, no mater about the use.   (i.e. when ESP is used as pointer
+	 or for arithmetic, the cost is bypassed, too.)  */
+      if (ix86_tune >= PROCESSOR_BDVER1
+	  && dep_insn_type == TYPE_PUSH)
 	return 0;
+      if (ix86_tune >= PROCESSOR_BDVER1
+	  && dep_insn_type == TYPE_POP)
+	{
+	  rtx dest = SET_DEST (PATTERN (dep_insn));
+	  if (REG_P (dest) && !reg_referenced_p (dest, PATTERN (insn)))
+	    return 0;
+	}
 
       /* Show ability of reorder buffer to hide latency of load by executing
 	 in parallel with previous instruction in case



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]