From patchwork Tue Oct  8 09:49:48 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jan Hubicka <hubicka@ucw.cz>
X-Patchwork-Id: 281384
Return-Path: 
 <gcc-patches-return-350623-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 4898F2C00B6
	for <incoming@patchwork.ozlabs.org>;
	Tue,  8 Oct 2013 20:49:59 +1100 (EST)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:date
	:from:to:cc:subject:message-id:references:mime-version
	:content-type:in-reply-to; q=dns; s=default; b=Cjkj2L2F5zidmlTTN
	BXEJJ1sSIMEl/DjMTQz/KoMHfuqNgvjSp0/udhSi/djiKBk97s0aOMjEUuWy3gTr
	kOUS99K2bKYFcNbc0UdpEAGemCBWVkTxIeWJm6cIaxNik1J/b0oZTF7kNe4Xvgpt
	lQQ5iYS+xIZn+zQQtbGjEPHXZg=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id
	:list-unsubscribe:list-archive:list-post:list-help:sender:date
	:from:to:cc:subject:message-id:references:mime-version
	:content-type:in-reply-to; s=default; bh=fCquBEusPaDpKrkIIF4I+Xk
	puR8=; b=jfCij3wpeR9Zy5hD++Zzdp8rTe2PecSmHHpmXgyc9AUH/MZge7Dchvw
	dm97R5GLP3JpA+wrCBkBBmmwAUYhR73iiioFa4PP6caLwKTPbPiZEm9iHLYsRxPX
	hQXXR6M6FBJL44hQ/AYunsPo9W8FaZyp7EODyvxXFy1MpleZV/xE=
Received: (qmail 6317 invoked by alias); 8 Oct 2013 09:49:53 -0000
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: <mailto:gcc-patches-unsubscribe-##L=##H@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org
Received: (qmail 6308 invoked by uid 89); 8 Oct 2013 09:49:52 -0000
Authentication-Results: sourceware.org; auth=none
X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00,
	RCVD_IN_DNSWL_NONE, RP_MATCHES_RCVD,
	T_FRT_BELOW2 autolearn=ham version=3.3.2
X-HELO: atrey.karlin.mff.cuni.cz
Received: from atrey.karlin.mff.cuni.cz (HELO atrey.karlin.mff.cuni.cz)
	(195.113.26.193) by sourceware.org
	(qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP;
	Tue, 08 Oct 2013 09:49:51 +0000
Received: by atrey.karlin.mff.cuni.cz (Postfix, from userid 4018)	id
	BE94A815B7; Tue,  8 Oct 2013 11:49:48 +0200 (CEST)
Date: Tue, 8 Oct 2013 11:49:48 +0200
From: Jan Hubicka <hubicka@ucw.cz>
To: "Gopalasubramanian, Ganesh" <Ganesh.Gopalasubramanian@amd.com>
Cc: Jan Hubicka <hubicka@ucw.cz>,
	"gcc-patches@gcc.gnu.org" <gcc-patches@gcc.gnu.org>,
	"hjl.tools@gmail.com" <hjl.tools@gmail.com>
Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern
	x86 chips
Message-ID: <20131008094948.GD16394@atrey.karlin.mff.cuni.cz>
References: <20130930111712.GA25208@kam.mff.cuni.cz>
	<EB4625145972F94C9680D8CADD65161546700157@SATLEXDAG02.amd.com>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <EB4625145972F94C9680D8CADD65161546700157@SATLEXDAG02.amd.com>
User-Agent: Mutt/1.5.20 (2009-06-14)
X-IsSubscribed: yes

> Hi Honza,
> 
> I am planning to update the scheduler descriptions for bdver3 first.
> Attached is the patch. Please let me know your comments if any.
> 
> Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these.

Yep, I think we need to merge only those autmatas tha are same for both:
(define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu")
probably can become
(define_automaton "bdver3,bdver3_fp")
with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1
automaton.  I think it should result in smaller binary - the fact that all conditionals are
physically duplicated in bdver1/bdev3.md should be optimized away by genautomata.

I also played a bit with the decoders and I am attaching my version - that seems SPEC neutral though.
Your version has problem that it does not model the thing that the two decoders works sequentially.

I removed the bdver1-decodev unit and instead i simply reserve all thre decoders + I added 
presence set requring second decoder to be taken only after first one changed presence set requring
decoder 2 to be taken only after decoder 1+2 to final presence set, so decoderv resetvation has
chance to pass.
Finally I added use-decodera that consumes all of the first decoder as soon as we start to allocate
second decoder - we can not really allocate them in interchanging order.

I also noticed that push/pop instructions are modeled as being vector, while manual says it 
generates one micro op unless memory operand is used.

I did not have much time to play further with this except for manual inspection of schedules that
seems better now and in rare cases handle 4-5 instructions per cycle.

We also should enable ia32_multipass_dfa_lookahead - with that scheduler should be able to put double decoded
and vector decoded insns on the proper places.

We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more realistic
estimates on what still can be issued - the value of 6 is unrealistically high.

Seems like with addition of Atom the scheduler macros became very twisty maze of passages.
I will work on replacing most of the CPU cases into tuning flags + costs.

What do you think?
Honza

Index: bdver1.md
===================================================================
--- bdver1.md	(revision 203204)
+++ bdver1.md	(working copy)
@@ -41,7 +41,9 @@
 (define_cpu_unit "bdver1-decode0" "bdver1")
 (define_cpu_unit "bdver1-decode1" "bdver1")
 (define_cpu_unit "bdver1-decode2" "bdver1")
-(define_cpu_unit "bdver1-decodev" "bdver1")
+(define_cpu_unit "bdver1-decode0b" "bdver1")
+(define_cpu_unit "bdver1-decode1b" "bdver1")
+(define_cpu_unit "bdver1-decode2b" "bdver1")
 
 ;; Model the fact that double decoded instruction may take 2 cycles
 ;; to decode when decoder2 and decoder0 in next cycle
@@ -57,18 +59,26 @@
 ;; too.  Vector decoded instructions then can't be issued when modeled
 ;; as consuming decoder0+decoder1+decoder2.
 ;; We solve that by specialized vector decoder unit and exclusion set.
-(presence_set "bdver1-decode2" "bdver1-decode0")
-(exclusion_set "bdver1-decodev" "bdver1-decode0,bdver1-decode1,bdver1-decode2")
-
-(define_reservation "bdver1-vector" "nothing,bdver1-decodev")
-(define_reservation "bdver1-direct1" "nothing,bdver1-decode1")
+(final_presence_set "bdver1-decode2" "bdver1-decode0,bdver1-decode1")
+(presence_set "bdver1-decode0b,bdver1-decode1b,bdver1-decode2b" "bdver1-decode0,bdver1-decode1")
+(final_presence_set "bdver1-decode2b" "bdver1-decode0b,bdver1-decode1b")
+
+(define_reservation "use-decodera" "((bdver1-decode0 | nothing)
+				     + (bdver1-decode1 | nothing)
+				     + (bdver1-decode2 | nothing))")
+(define_reservation "bdver1-vector" "nothing,((bdver1-decode0+bdver1-decode1+bdver1-decode2)
+					      |(use-decodera+bdver1-decode0b+bdver1-decode1b+bdver1-decode2b))")
+(define_reservation "bdver1-direct1" "nothing,(bdver1-decode1|(use-decodera+bdver1-decode1b))")
 (define_reservation "bdver1-direct" "nothing,
 				     (bdver1-decode0 | bdver1-decode1
-				     | bdver1-decode2)")
+				     | bdver1-decode2 | (use-decodera+bdver1-decode0b)
+				     | (use-decodera+bdver1-decode1b) | (use-decodera+bdver1-decode2b))")
 ;; Double instructions behaves like two direct instructions.
 (define_reservation "bdver1-double" "((bdver1-decode2,bdver1-decode0)
 				     | (nothing,(bdver1-decode0 + bdver1-decode1))
-				     | (nothing,(bdver1-decode1 + bdver1-decode2)))")
+				     | (nothing,(bdver1-decode1 + bdver1-decode2))
+				     | (nothing,(use-decodera + bdver1-decode0b + bdver1-decode1b))
+				     | (nothing,(use-decodera + bdver1-decode1b + bdver1-decode2b)))")
 
 
 (define_cpu_unit "bdver1-ieu0" "bdver1_ieu")
@@ -131,17 +141,28 @@
 			      (eq_attr "type" "call,callv"))
 			 "bdver1-double,bdver1-agu")
 ;; PUSH mem is double path.
+(define_insn_reservation "bdver1_pushmem" 1
+			 (and (eq_attr "cpu" "bdver1,bdver2")
+			      (and (eq_attr "type" "push")
+				   (eq_attr "memory" "both")))
+			 "bdver1-direct,bdver1-load,bdver1-store")
 (define_insn_reservation "bdver1_push" 1
 			 (and (eq_attr "cpu" "bdver1,bdver2")
 			      (eq_attr "type" "push"))
-			 "bdver1-direct,bdver1-agu,bdver1-store")
+			 "bdver1-direct,bdver1-store")
 ;; POP r16/mem are double path.
+;; 16bit pops are not really used by GCC.
+(define_insn_reservation "bdver1_popmem" 1
+			 (and (eq_attr "cpu" "bdver1,bdver2")
+			      (and (eq_attr "type" "pop")
+				   (eq_attr "memory" "both")))
+			 "bdver1-direct,bdver1-load,bdver1-store")
 (define_insn_reservation "bdver1_pop" 1
 			 (and (eq_attr "cpu" "bdver1,bdver2")
 			      (eq_attr "type" "pop"))
-			 "bdver1-direct,bdver1-ivector")
-;; LEAVE no latency info so far, assume same with amdfam10.
-(define_insn_reservation "bdver1_leave" 3
+			 "bdver1-direct,bdver1-load")
+;; By Agner Fog, latency is 4.
+(define_insn_reservation "bdver1_leave" 4
 			 (and (eq_attr "cpu" "bdver1,bdver2")
 			      (eq_attr "type" "leave"))
 			 "bdver1-vector,bdver1-ivector")
Index: i386.c
===================================================================
--- i386.c	(revision 203204)
+++ i386.c	(working copy)
@@ -24427,11 +25019,13 @@ ix86_issue_rate (void)
     case PROCESSOR_K8:
     case PROCESSOR_AMDFAM10:
     case PROCESSOR_GENERIC:
-    case PROCESSOR_BDVER1:
-    case PROCESSOR_BDVER2:
-    case PROCESSOR_BDVER3:
     case PROCESSOR_BTVER1:
       return 3;
+    case PROCESSOR_BDVER3:
+      return 4;
+    case PROCESSOR_BDVER1:
+    case PROCESSOR_BDVER2:
+      return 6;
 
     case PROCESSOR_CORE2:
     case PROCESSOR_COREI7:
@@ -24697,11 +25291,27 @@ ix86_adjust_cost (rtx insn, rtx link, rt
     case PROCESSOR_GENERIC:
       memory = get_attr_memory (insn);
 
-      /* Stack engine allows to execute push&pop instructions in parall.  */
+      /* Stack engine allows to execute push&pop instructions in parallel. 
+	 ??? There seems to be no detailed documentation of AMDFAM10, perhaps
+	 it is actually equivalent to the stronger notion of engine bellow. */
       if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP)
 	   && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP))
-	  && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8))
+	  && (ix86_tune == PROCESSOR_AMDFAM10))
+	return 0;
+
+      /* For buldozer up, the stack engine makes value of stack pointer available
+	 immediately, no mater about the use.   (i.e. when ESP is used as pointer
+	 or for arithmetic, the cost is bypassed, too.)  */
+      if (ix86_tune >= PROCESSOR_BDVER1
+	  && dep_insn_type == TYPE_PUSH)
 	return 0;
+      if (ix86_tune >= PROCESSOR_BDVER1
+	  && dep_insn_type == TYPE_POP)
+	{
+	  rtx dest = SET_DEST (PATTERN (dep_insn));
+	  if (REG_P (dest) && !reg_referenced_p (dest, PATTERN (insn)))
+	    return 0;
+	}
 
       /* Show ability of reorder buffer to hide latency of load by executing
 	 in parallel with previous instruction in case