Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

Hi Honza!

> I will give it a try.
I understand how the proposed patch would be. That is OK to me.

> OK, my undertanding always was, that the decoders works sort of sequentially.
>
> One decoder of bdver1 can do
> vector,
> double, single,
> single, single, signle
>
>where decoder 1 is somehow special but hardware is able to swap first and second single.  Now if two decoders run, i would expect
>
> vector, vector
> single, single, signle
> double, single, double, single
> 
> to be decoded in once cycle

My understanding on the decode unit is mentioned below. Please correct me if I am wrong.

The sequential allotment of decoders is not there for bdver1.
Intel Sandybridge\core2 have four decoders. The first decoder is special for intel processors. 
For ex, In Sandybridge, the instructions that have only one µop can be decoded by any of the four decoders.
Instructions that have up to four µops will be decoded by the first decoder (of the four decoders available) only.

Bdver1 has four decoders. None of them is special unlike intel processors.
For bdver1, microcoded instructions are single issue. All four decoders are engaged for decoding microcoded instructions.

Decode unit of bdver3 has following specifications
* Four independent decoders which can execute in parallel
* Microcoded instructions are single issue. (All four decoders are engaged). This means that only one vectorpath instruction get issued in one cycle.
* The additional hardware instruction decoder increases the instruction decode capacity to eight instructions per clock cycle.
* The decoders are pipelined 4 cycles deep and are non-stalling.

So modeling vectorpath instructions is straightforward. No instructions are issued along with vector instructions.
We need to model only fastpath single and fastpath double instructions.
There are four decoders and they can execute in parallel. So they can take either two double or four single instructions.
We also don't need to model them in two stage as there is no sequence involved.
So, the modeling can be done such that in one cycle we schedule  2 singles + 1 double (or) 4 singles (or) 2 doubles.

I have tried to model this for bdver3 (code changes are mentioned below). Please let me know your opinion.

Regards
Ganesh

Patch
-----
Sent: Wednesday, October 09, 2013 7:18 PM
To: Gopalasubramanian, Ganesh
Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.tools@gmail.com
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

> Before merging the insn reservations, I need to compare the latency values for bdver1 and bdver3. I know that they are different for some of the instructions. 
> In that case, the merging should prop up another subset of latency differences. I would like to keep these insn reservations in two .md files (one for bdver1 and one for bdver3) even after the merger.

I am not really insisting on merging (define_insn_reservation "bdver3*") with (define_insn_reservation "bdver1*).  What I have in mind is merging actual atuomatons in cases it makes sense.  Latencies are not really encoded in those.

Bdver 12 has:
(define_automaton "bdver1,bdver1_ieu,bdver1_load,bdver1_fp,bdver1_agu")
while bdver 3:
(define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")

automatons bdver1 and bdver3 are very different, because one handles up to 3 instructions, while other handles only 2.  I am still bit confused with this every second cycle logic, so lets discuss it incrementally.

I would propose to have
(define_automaton "bdver3")
or perhaps
(define_automaton "bdver3,bdver3_fp")

now if you look at use of bdver3_ieu we have:
jan@linux-9ure:~/trunk/gcc/config/i386> grep bdver1-ieu *.md bdver1.md:(define_cpu_unit "bdver1-ieu0" "bdver1_ieu") bdver1.md:(define_cpu_unit "bdver1-ieu1" "bdver1_ieu") bdver1.md:(define_reservation "bdver1-ieu" "(bdver1-ieu0 | bdver1-ieu1)") bdver1.md:(define_reservation "bdver1-ivector" "bdver1-ieu0+bdver1-ieu1+
bdver1.md:                       "bdver1-direct1,bdver1-ieu1")
bdver1.md:                       "bdver1-direct1,bdver1-ieu1")
bdver1.md:                         "bdver1-direct1,bdver1-load,bdver1-ieu1")
bdver1.md:                       "bdver1-direct1,bdver1-load,bdver1-ieu1")
bdver1.md:                       "bdver1-vector,(bdver1-ieu0*6+(bdver1-fpsched,bdver1-fvector))")
bdver1.md:                       "bdver1-vector,((bdver1-load,bdver1-ieu0*6)+(bdver1-fpsched,bdver1-fvector))")
bdver1.md:                       "bdver1-vector,bdver1-load,bdver1-ieu0*6")
bdver1.md:                       "bdver1-direct,bdver1-ieu")
bdver1.md:                       "bdver1-vector,bdver1-ieu,bdver1-ieu")
bdver1.md:                       "bdver1-direct,bdver1-load,bdver1-ieu")
bdver1.md:                       "bdver1-vector,bdver1-load,bdver1-ieu,bdver1-ieu")
bdver1.md:                        bdver1-ieu,bdver1-store,
bdver1.md:                        bdver1-ieu,
bdver1.md:                        bdver1-ieu,
bdver1.md:                       "bdver1-direct,(bdver1-ieu+bdver1-agu),
bdver1.md:                       "bdver1-vector,(bdver1-ieu+bdver1-agu),bdver1-ieu,
jan@linux-9ure:~/trunk/gcc/config/i386> grep bdver3-ieu *.md bdver3.md:(define_cpu_unit "bdver3-ieu0" "bdver3_ieu") bdver3.md:(define_cpu_unit "bdver3-ieu1" "bdver3_ieu") bdver3.md:(define_reservation "bdver3-ieu" "(bdver3-ieu0|bdver3-ieu1)") bdver3.md:(define_reservation "bdver3-ivector" "bdver3-ieu0+bdver3-ieu1+
bdver3.md:                       "bdver3-double,(bdver3-agu | bdver3-ieu),nothing")
bdver3.md:                       "bdver3-direct,bdver3-ieu,bdver3-store")
bdver3.md:                       "bdver3-direct,bdver3-ieu")
bdver3.md:                       "bdver3-direct,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,bdver3-load,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,bdver3-load,bdver3-ieu1")
bdver3.md:                       "bdver3-direct,(bdver3-ieu|bdver3-agu)")
bdver3.md:                       "bdver3-direct,bdver3-load,bdver3-ieu")
bdver3.md:                       "bdver3-direct,bdver3-ieu,bdver3-store")
bdver3.md:                        bdver3-ieu,bdver3-store,
bdver3.md:                       "bdver3-direct,(bdver3-ieu+bdver3-agu),

While they are not used always the same way, it seems that bdver3 instructions have similar characteristic to bdver1 instructions, so unifying the automatons would save binary size.
This only means replacing
bdver3.md:(define_cpu_unit "bdver3-ieu0" "bdver1_ieu") bdver3.md:(define_cpu_unit "bdver3-ieu1" "bdver1_ieu") and removing bdver3_ieu from list of automatons.

The automaton minimization should take care of the rest and resulting schedules should not change.
I will give it a try.
> 
> > Your version has problem that it does not model the thing that the two decoders works sequentially.
> 
> The two stage modeling is required so that the decode unit reservations are screened from other unit reservations.
> But this sort of goes away in bdver3 because of the decode cycle.
> In bdver3, the decode units scan two of these windows every "two" cycles decoding a maximum of eight instructions.
> The hardware scan is done every two cycles in bdver3 whereas it is 
> done every single cycle in bdver1/bdver2. (But we have two separate hardware decoders which guarantees higher throughput) This means that the two stage modeling is not required in the scheduler descriptions since the hardware sort of guarantees that with its scanning mechanism.

OK, my undertanding always was, that the decoders works sort of sequentially.
One decoder of bdver1 can do

vector,
double, single,
single, single, signle

where decoder 1 is somehow special but hardware is able to swap first and second single.  Now if two decoders run, i would expect

vector, vector
single, single, signle
double, single, double, single

to be decoded in once cycle, but not

single, vector, single
or double,double,single,single

Similar situation seems to exist for bdver3 too.  As I understand it does

single, single
double

So for example

single,double,single,double,single,double,single,double pattern will take longer than single,single,double,double,single,single,double,double

If decoders are allocated sequencially and the instruction order matters, modeling this precisely and enabling DFA_LOOKAHEAD should make GCC to place things correctly into groups (plus minus a lot of luck because of dispatch windows and size limitations.  There is already some code for Atom to take care of this, perhaps it can be generalized)

Or is pre-scan smart enough to allocate single, signle into first decoder and vector into second?

> Our job is to make sure that 8 direct instructions get scheduled in two cycles or 4 double instructions get scheduled in two cycles.
> So, I have modeled the bdver3 decoders such that with in a cycle they guarantee to issue 4 direct instructions or 2 double instructions. 
> This eliminates the sequencing problem in modeling decoders and also ensures that the issue rate can be numbered for a single cycle rather than two cycles.
> This is one of the reasons why I remodeled only bdver3. Let me know your comments on this.
> 
> > We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more realistic estimates on what still can be issued - the value of 6 is unrealistically high.
> This would get more complicated if we go by decoder capacity in bdver3. As we have two hardware decoders in steamroller (bdver3), they have a capacity to decode eight instructions per clock cycle, providing up to twice the decode and dispatch bandwidth compared to bdver1.
> If we model this in GCC we need to change the issue rate to 8. If 6 is high, then 8 would add more joy and excitement.
> 
> TARGET_SCHED_VARIABLE_ISSUE is a nice suggestion to schedule instructions in different way. 
> 
> > We also should enable ia32_multipass_dfa_lookahead - with that scheduler should be able to put double decoded and vector decoded insns on the proper places.
> Yes. Whenever we have this scheduler analysis in place we discuss about this but unfortunately is left as it is.
> I will look into this after I do the enablement for bdver4.

I benchmarked this with my bdver1/2 changes (don't have bdver3 for
benchmarking) and it does not seem to do much of differnce at this stage.  It may be however simply because the model is broken.  Also SPEC is not the best benchmark for scheduling, since it has many other bounds.

> 
> > I will work on replacing most of the CPU cases into tuning flags + costs.
> I am planning to get bdver4 enablement in place once scheduler descriptions for bdver3 is done with.
> I will have cycles to look into the cost models. Please delegate some tasks if you can and I am willing to take them up.

Thanks.  What I am working now is to decide what scheduling of generic should be.  For that I need to be sure that both buldozer and core scheduling works well, so I can compare their benefits on both CPUs.  So getting through these issues is very useful.

Honza

> 
> Regards
> Ganesh
> 
> -----Original Message-----
> From: Jan Hubicka [mailto:hubicka@ucw.cz]
> Sent: Tuesday, October 08, 2013 3:20 PM
> To: Gopalasubramanian, Ganesh
> Cc: Jan Hubicka; gcc-patches@gcc.gnu.org; hjl.tools@gmail.com
> Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for 
> modern x86 chips
> 
> > Hi Honza,
> > 
> > I am planning to update the scheduler descriptions for bdver3 first.
> > Attached is the patch. Please let me know your comments if any.
> > 
> > Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these.
> 
> Yep, I think we need to merge only those autmatas tha are same for both:
> (define_automaton 
> "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
> probably can become
> (define_automaton "bdver3,bdver3_fp")
> with the corresponding reservations using 
> bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1 automaton.  I 
> think it should result in smaller binary - the fact that all conditionals are physically duplicated in bdver1/bdev3.md should be optimized away by genautomata.
> 
> I also played a bit with the decoders and I am attaching my version - that seems SPEC neutral though.
> Your version has problem that it does not model the thing that the two decoders works sequentially.
> 
> I removed the bdver1-decodev unit and instead i simply reserve all 
> thre decoders + I added presence set requring second decoder to be 
> taken only after first one changed presence set requring decoder 2 to 
> be taken only after decoder 1+2 to final presence set, so decoderv resetvation has chance to pass.
> Finally I added use-decodera that consumes all of the first decoder as 
> soon as we start to allocate second decoder - we can not really allocate them in interchanging order.
> 
> I also noticed that push/pop instructions are modeled as being vector, 
> while manual says it generates one micro op unless memory operand is used.
> 
> I did not have much time to play further with this except for manual 
> inspection of schedules that seems better now and in rare cases handle 4-5 instructions per cycle.
> 
> We also should enable ia32_multipass_dfa_lookahead - with that 
> scheduler should be able to put double decoded and vector decoded insns on the proper places.
> 
> We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to 
> get more realistic estimates on what still can be issued - the value of 6 is unrealistically high.
> 
> Seems like with addition of Atom the scheduler macros became very twisty maze of passages.
> I will work on replacing most of the CPU cases into tuning flags + costs.
> 
> What do you think?
> Honza
> 
> 
> Index: bdver1.md
> ===================================================================
> --- bdver1.md	(revision 203204)
> +++ bdver1.md	(working copy)
> @@ -41,7 +41,9 @@
>  (define_cpu_unit "bdver1-decode0" "bdver1")  (define_cpu_unit 
> "bdver1-decode1" "bdver1")  (define_cpu_unit "bdver1-decode2" 
> "bdver1") -(define_cpu_unit "bdver1-decodev" "bdver1")
> +(define_cpu_unit "bdver1-decode0b" "bdver1") (define_cpu_unit 
> +"bdver1-decode1b" "bdver1") (define_cpu_unit "bdver1-decode2b" 
> +"bdver1")
>  
>  ;; Model the fact that double decoded instruction may take 2 cycles  
> ;; to decode when decoder2 and decoder0 in next cycle @@ -57,18 +59,26 
> @@  ;; too.  Vector decoded instructions then can't be issued when 
> modeled  ;; as consuming decoder0+decoder1+decoder2.
>  ;; We solve that by specialized vector decoder unit and exclusion set.
> -(presence_set "bdver1-decode2" "bdver1-decode0") -(exclusion_set 
> "bdver1-decodev" "bdver1-decode0,bdver1-decode1,bdver1-decode2")
> -
> -(define_reservation "bdver1-vector" "nothing,bdver1-decodev") 
> -(define_reservation "bdver1-direct1" "nothing,bdver1-decode1")
> +(final_presence_set "bdver1-decode2" "bdver1-decode0,bdver1-decode1") 
> +(presence_set "bdver1-decode0b,bdver1-decode1b,bdver1-decode2b" 
> +"bdver1-decode0,bdver1-decode1") (final_presence_set 
> +"bdver1-decode2b" "bdver1-decode0b,bdver1-decode1b")
> +
> +(define_reservation "use-decodera" "((bdver1-decode0 | nothing)
> +				     + (bdver1-decode1 | nothing)
> +				     + (bdver1-decode2 | nothing))") (define_reservation 
> +"bdver1-vector" "nothing,((bdver1-decode0+bdver1-decode1+bdver1-decode2)
> +					      
> +|(use-decodera+bdver1-decode0b+bdver1-decode1b+bdver1-decode2b))")
> +(define_reservation "bdver1-direct1" 
> +"nothing,(bdver1-decode1|(use-decodera+bdver1-decode1b))")
>  (define_reservation "bdver1-direct" "nothing,
>  				     (bdver1-decode0 | bdver1-decode1
> -				     | bdver1-decode2)")
> +				     | bdver1-decode2 | (use-decodera+bdver1-decode0b)
> +				     | (use-decodera+bdver1-decode1b) | 
> +(use-decodera+bdver1-decode2b))")
>  ;; Double instructions behaves like two direct instructions.
>  (define_reservation "bdver1-double" "((bdver1-decode2,bdver1-decode0)
>  				     | (nothing,(bdver1-decode0 + bdver1-decode1))
> -				     | (nothing,(bdver1-decode1 + bdver1-decode2)))")
> +				     | (nothing,(bdver1-decode1 + bdver1-decode2))
> +				     | (nothing,(use-decodera + bdver1-decode0b + bdver1-decode1b))
> +				     | (nothing,(use-decodera + bdver1-decode1b + 
> +bdver1-decode2b)))")
>  
>  
>  (define_cpu_unit "bdver1-ieu0" "bdver1_ieu") @@ -131,17 +141,28 @@
>  			      (eq_attr "type" "call,callv"))
>  			 "bdver1-double,bdver1-agu")
>  ;; PUSH mem is double path.
> +(define_insn_reservation "bdver1_pushmem" 1
> +			 (and (eq_attr "cpu" "bdver1,bdver2")
> +			      (and (eq_attr "type" "push")
> +				   (eq_attr "memory" "both")))
> +			 "bdver1-direct,bdver1-load,bdver1-store")
>  (define_insn_reservation "bdver1_push" 1
>  			 (and (eq_attr "cpu" "bdver1,bdver2")
>  			      (eq_attr "type" "push"))
> -			 "bdver1-direct,bdver1-agu,bdver1-store")
> +			 "bdver1-direct,bdver1-store")
>  ;; POP r16/mem are double path.
> +;; 16bit pops are not really used by GCC.
> +(define_insn_reservation "bdver1_popmem" 1
> +			 (and (eq_attr "cpu" "bdver1,bdver2")
> +			      (and (eq_attr "type" "pop")
> +				   (eq_attr "memory" "both")))
> +			 "bdver1-direct,bdver1-load,bdver1-store")
>  (define_insn_reservation "bdver1_pop" 1
>  			 (and (eq_attr "cpu" "bdver1,bdver2")
>  			      (eq_attr "type" "pop"))
> -			 "bdver1-direct,bdver1-ivector")
> -;; LEAVE no latency info so far, assume same with amdfam10.
> -(define_insn_reservation "bdver1_leave" 3
> +			 "bdver1-direct,bdver1-load")
> +;; By Agner Fog, latency is 4.
> +(define_insn_reservation "bdver1_leave" 4
>  			 (and (eq_attr "cpu" "bdver1,bdver2")
>  			      (eq_attr "type" "leave"))
>  			 "bdver1-vector,bdver1-ivector")
> Index: i386.c
> ===================================================================
> --- i386.c	(revision 203204)
> +++ i386.c	(working copy)
> @@ -24427,11 +25019,13 @@ ix86_issue_rate (void)
>      case PROCESSOR_K8:
>      case PROCESSOR_AMDFAM10:
>      case PROCESSOR_GENERIC:
> -    case PROCESSOR_BDVER1:
> -    case PROCESSOR_BDVER2:
> -    case PROCESSOR_BDVER3:
>      case PROCESSOR_BTVER1:
>        return 3;
> +    case PROCESSOR_BDVER3:
> +      return 4;
> +    case PROCESSOR_BDVER1:
> +    case PROCESSOR_BDVER2:
> +      return 6;
>  
>      case PROCESSOR_CORE2:
>      case PROCESSOR_COREI7:
> @@ -24697,11 +25291,27 @@ ix86_adjust_cost (rtx insn, rtx link, rt
>      case PROCESSOR_GENERIC:
>        memory = get_attr_memory (insn);
>  
> -      /* Stack engine allows to execute push&pop instructions in parall.  */
> +      /* Stack engine allows to execute push&pop instructions in parallel. 
> +	 ??? There seems to be no detailed documentation of AMDFAM10, perhaps
> +	 it is actually equivalent to the stronger notion of engine bellow. 
> +*/
>        if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
>  	   && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
> -	  && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8))
> +	  && (ix86_tune == PROCESSOR_AMDFAM10))
> +	return 0;
> +
> +      /* For buldozer up, the stack engine makes value of stack pointer available
> +	 immediately, no mater about the use.   (i.e. when ESP is used as pointer
> +	 or for arithmetic, the cost is bypassed, too.)  */
> +      if (ix86_tune >= PROCESSOR_BDVER1
> +	  && dep_insn_type == TYPE_PUSH)
>  	return 0;
> +      if (ix86_tune >= PROCESSOR_BDVER1
> +	  && dep_insn_type == TYPE_POP)
> +	{
> +	  rtx dest = SET_DEST (PATTERN (dep_insn));
> +	  if (REG_P (dest) && !reg_referenced_p (dest, PATTERN (insn)))
> +	    return 0;
> +	}
>  
>        /* Show ability of reorder buffer to hide latency of load by executing
>  	 in parallel with previous instruction in case
>

Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips

Commit Message

Patch