From patchwork Tue Oct 8 09:49:48 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jan Hubicka X-Patchwork-Id: 281384 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 4898F2C00B6 for ; Tue, 8 Oct 2013 20:49:59 +1100 (EST) DomainKey-Signature: a=rsa-sha1; c=nofws; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; q=dns; s=default; b=Cjkj2L2F5zidmlTTN BXEJJ1sSIMEl/DjMTQz/KoMHfuqNgvjSp0/udhSi/djiKBk97s0aOMjEUuWy3gTr kOUS99K2bKYFcNbc0UdpEAGemCBWVkTxIeWJm6cIaxNik1J/b0oZTF7kNe4Xvgpt lQQ5iYS+xIZn+zQQtbGjEPHXZg= DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=gcc.gnu.org; h=list-id :list-unsubscribe:list-archive:list-post:list-help:sender:date :from:to:cc:subject:message-id:references:mime-version :content-type:in-reply-to; s=default; bh=fCquBEusPaDpKrkIIF4I+Xk puR8=; b=jfCij3wpeR9Zy5hD++Zzdp8rTe2PecSmHHpmXgyc9AUH/MZge7Dchvw dm97R5GLP3JpA+wrCBkBBmmwAUYhR73iiioFa4PP6caLwKTPbPiZEm9iHLYsRxPX hQXXR6M6FBJL44hQ/AYunsPo9W8FaZyp7EODyvxXFy1MpleZV/xE= Received: (qmail 6317 invoked by alias); 8 Oct 2013 09:49:53 -0000 Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Received: (qmail 6308 invoked by uid 89); 8 Oct 2013 09:49:52 -0000 Authentication-Results: sourceware.org; auth=none X-Spam-SWARE-Status: No, score=-2.2 required=5.0 tests=AWL, BAYES_00, RCVD_IN_DNSWL_NONE, RP_MATCHES_RCVD, T_FRT_BELOW2 autolearn=ham version=3.3.2 X-HELO: atrey.karlin.mff.cuni.cz Received: from atrey.karlin.mff.cuni.cz (HELO atrey.karlin.mff.cuni.cz) (195.113.26.193) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 08 Oct 2013 09:49:51 +0000 Received: by atrey.karlin.mff.cuni.cz (Postfix, from userid 4018) id BE94A815B7; Tue, 8 Oct 2013 11:49:48 +0200 (CEST) Date: Tue, 8 Oct 2013 11:49:48 +0200 From: Jan Hubicka To: "Gopalasubramanian, Ganesh" Cc: Jan Hubicka , "gcc-patches@gcc.gnu.org" , "hjl.tools@gmail.com" Subject: Re: Fix scheduler ix86_issue_rate and ix86_adjust_cost for modern x86 chips Message-ID: <20131008094948.GD16394@atrey.karlin.mff.cuni.cz> References: <20130930111712.GA25208@kam.mff.cuni.cz> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-IsSubscribed: yes > Hi Honza, > > I am planning to update the scheduler descriptions for bdver3 first. > Attached is the patch. Please let me know your comments if any. > > Though I agree on merging bdver1/2 and bdver3 on most parts, the FP lines and decoding schemes are different. So, let me know how can I approach merging these. Yep, I think we need to merge only those autmatas tha are same for both: (define_automaton "bdver3,bdver3_ieu,bdver3_load,bdver3_fp,bdver3_agu") probably can become (define_automaton "bdver3,bdver3_fp") with the corresponding reservations using bdver3_ieu,bdver3_load,bdver3_agu changed to bdver1 automaton. I think it should result in smaller binary - the fact that all conditionals are physically duplicated in bdver1/bdev3.md should be optimized away by genautomata. I also played a bit with the decoders and I am attaching my version - that seems SPEC neutral though. Your version has problem that it does not model the thing that the two decoders works sequentially. I removed the bdver1-decodev unit and instead i simply reserve all thre decoders + I added presence set requring second decoder to be taken only after first one changed presence set requring decoder 2 to be taken only after decoder 1+2 to final presence set, so decoderv resetvation has chance to pass. Finally I added use-decodera that consumes all of the first decoder as soon as we start to allocate second decoder - we can not really allocate them in interchanging order. I also noticed that push/pop instructions are modeled as being vector, while manual says it generates one micro op unless memory operand is used. I did not have much time to play further with this except for manual inspection of schedules that seems better now and in rare cases handle 4-5 instructions per cycle. We also should enable ia32_multipass_dfa_lookahead - with that scheduler should be able to put double decoded and vector decoded insns on the proper places. We can also experiment with defining TARGET_SCHED_VARIABLE_ISSUE to get more realistic estimates on what still can be issued - the value of 6 is unrealistically high. Seems like with addition of Atom the scheduler macros became very twisty maze of passages. I will work on replacing most of the CPU cases into tuning flags + costs. What do you think? Honza Index: bdver1.md =================================================================== --- bdver1.md (revision 203204) +++ bdver1.md (working copy) @@ -41,7 +41,9 @@ (define_cpu_unit "bdver1-decode0" "bdver1") (define_cpu_unit "bdver1-decode1" "bdver1") (define_cpu_unit "bdver1-decode2" "bdver1") -(define_cpu_unit "bdver1-decodev" "bdver1") +(define_cpu_unit "bdver1-decode0b" "bdver1") +(define_cpu_unit "bdver1-decode1b" "bdver1") +(define_cpu_unit "bdver1-decode2b" "bdver1") ;; Model the fact that double decoded instruction may take 2 cycles ;; to decode when decoder2 and decoder0 in next cycle @@ -57,18 +59,26 @@ ;; too. Vector decoded instructions then can't be issued when modeled ;; as consuming decoder0+decoder1+decoder2. ;; We solve that by specialized vector decoder unit and exclusion set. -(presence_set "bdver1-decode2" "bdver1-decode0") -(exclusion_set "bdver1-decodev" "bdver1-decode0,bdver1-decode1,bdver1-decode2") - -(define_reservation "bdver1-vector" "nothing,bdver1-decodev") -(define_reservation "bdver1-direct1" "nothing,bdver1-decode1") +(final_presence_set "bdver1-decode2" "bdver1-decode0,bdver1-decode1") +(presence_set "bdver1-decode0b,bdver1-decode1b,bdver1-decode2b" "bdver1-decode0,bdver1-decode1") +(final_presence_set "bdver1-decode2b" "bdver1-decode0b,bdver1-decode1b") + +(define_reservation "use-decodera" "((bdver1-decode0 | nothing) + + (bdver1-decode1 | nothing) + + (bdver1-decode2 | nothing))") +(define_reservation "bdver1-vector" "nothing,((bdver1-decode0+bdver1-decode1+bdver1-decode2) + |(use-decodera+bdver1-decode0b+bdver1-decode1b+bdver1-decode2b))") +(define_reservation "bdver1-direct1" "nothing,(bdver1-decode1|(use-decodera+bdver1-decode1b))") (define_reservation "bdver1-direct" "nothing, (bdver1-decode0 | bdver1-decode1 - | bdver1-decode2)") + | bdver1-decode2 | (use-decodera+bdver1-decode0b) + | (use-decodera+bdver1-decode1b) | (use-decodera+bdver1-decode2b))") ;; Double instructions behaves like two direct instructions. (define_reservation "bdver1-double" "((bdver1-decode2,bdver1-decode0) | (nothing,(bdver1-decode0 + bdver1-decode1)) - | (nothing,(bdver1-decode1 + bdver1-decode2)))") + | (nothing,(bdver1-decode1 + bdver1-decode2)) + | (nothing,(use-decodera + bdver1-decode0b + bdver1-decode1b)) + | (nothing,(use-decodera + bdver1-decode1b + bdver1-decode2b)))") (define_cpu_unit "bdver1-ieu0" "bdver1_ieu") @@ -131,17 +141,28 @@ (eq_attr "type" "call,callv")) "bdver1-double,bdver1-agu") ;; PUSH mem is double path. +(define_insn_reservation "bdver1_pushmem" 1 + (and (eq_attr "cpu" "bdver1,bdver2") + (and (eq_attr "type" "push") + (eq_attr "memory" "both"))) + "bdver1-direct,bdver1-load,bdver1-store") (define_insn_reservation "bdver1_push" 1 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "push")) - "bdver1-direct,bdver1-agu,bdver1-store") + "bdver1-direct,bdver1-store") ;; POP r16/mem are double path. +;; 16bit pops are not really used by GCC. +(define_insn_reservation "bdver1_popmem" 1 + (and (eq_attr "cpu" "bdver1,bdver2") + (and (eq_attr "type" "pop") + (eq_attr "memory" "both"))) + "bdver1-direct,bdver1-load,bdver1-store") (define_insn_reservation "bdver1_pop" 1 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "pop")) - "bdver1-direct,bdver1-ivector") -;; LEAVE no latency info so far, assume same with amdfam10. -(define_insn_reservation "bdver1_leave" 3 + "bdver1-direct,bdver1-load") +;; By Agner Fog, latency is 4. +(define_insn_reservation "bdver1_leave" 4 (and (eq_attr "cpu" "bdver1,bdver2") (eq_attr "type" "leave")) "bdver1-vector,bdver1-ivector") Index: i386.c =================================================================== --- i386.c (revision 203204) +++ i386.c (working copy) @@ -24427,11 +25019,13 @@ ix86_issue_rate (void) case PROCESSOR_K8: case PROCESSOR_AMDFAM10: case PROCESSOR_GENERIC: - case PROCESSOR_BDVER1: - case PROCESSOR_BDVER2: - case PROCESSOR_BDVER3: case PROCESSOR_BTVER1: return 3; + case PROCESSOR_BDVER3: + return 4; + case PROCESSOR_BDVER1: + case PROCESSOR_BDVER2: + return 6; case PROCESSOR_CORE2: case PROCESSOR_COREI7: @@ -24697,11 +25291,27 @@ ix86_adjust_cost (rtx insn, rtx link, rt case PROCESSOR_GENERIC: memory = get_attr_memory (insn); - /* Stack engine allows to execute push&pop instructions in parall. */ + /* Stack engine allows to execute push&pop instructions in parallel. + ??? There seems to be no detailed documentation of AMDFAM10, perhaps + it is actually equivalent to the stronger notion of engine bellow. */ if (((insn_type == TYPE_PUSH || insn_type == TYPE_POP) && (dep_insn_type == TYPE_PUSH || dep_insn_type == TYPE_POP)) - && (ix86_tune != PROCESSOR_ATHLON && ix86_tune != PROCESSOR_K8)) + && (ix86_tune == PROCESSOR_AMDFAM10)) + return 0; + + /* For buldozer up, the stack engine makes value of stack pointer available + immediately, no mater about the use. (i.e. when ESP is used as pointer + or for arithmetic, the cost is bypassed, too.) */ + if (ix86_tune >= PROCESSOR_BDVER1 + && dep_insn_type == TYPE_PUSH) return 0; + if (ix86_tune >= PROCESSOR_BDVER1 + && dep_insn_type == TYPE_POP) + { + rtx dest = SET_DEST (PATTERN (dep_insn)); + if (REG_P (dest) && !reg_referenced_p (dest, PATTERN (insn))) + return 0; + } /* Show ability of reorder buffer to hide latency of load by executing in parallel with previous instruction in case