From patchwork Fri Aug 20 20:07:14 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bernd Schmidt X-Patchwork-Id: 62314 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 8DFBFB70DE for ; Sat, 21 Aug 2010 06:07:46 +1000 (EST) Received: (qmail 18388 invoked by alias); 20 Aug 2010 20:07:42 -0000 Received: (qmail 18352 invoked by uid 22791); 20 Aug 2010 20:07:35 -0000 X-SWARE-Spam-Status: No, hits=-0.3 required=5.0 tests=AWL, BAYES_50, TW_HG, TW_OV, TW_VZ, TW_ZB, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 20 Aug 2010 20:07:25 +0000 Received: (qmail 31956 invoked from network); 20 Aug 2010 20:07:20 -0000 Received: from unknown (HELO ?84.152.240.111?) (bernds@127.0.0.2) by mail.codesourcery.com with ESMTPA; 20 Aug 2010 20:07:20 -0000 Message-ID: <4C6EE072.4070802@codesourcery.com> Date: Fri, 20 Aug 2010 22:07:14 +0200 From: Bernd Schmidt User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.7) Gecko/20100724 Thunderbird/3.1.1 MIME-Version: 1.0 To: GCC Patches , "H.J. Lu" , Maxim Kuvyrkov , Paul Brook Subject: Core 2 and Core i7 tuning Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org Here's something I've been working on for a while. This adds a corei7 processor type, a Core 2/Core i7 scheduling description, and twiddles a few of the x86 tuning flags. I'm not terribly happy with it yet due to the relatively small performance improvement, but I'd promised some folks I'd post it this week, so... The scheduling description is heavily based on ppro.md. There seems to be no publicly available, detailed information from Intel about the Core 2 pipeline, so this work is based on Agner Fog's manuals. It should be correct in the essentials, at least as well as ppro.md (we aren't really able to do a good job with the execution ports since we have no concept of the out-of-order core). I have not tried to implement latencies or port reservations for every last MMX or SSE instruction, since who knows whether the information is totally accurate anyway. The i386 port has a lot of tuning flags, and I've mostly been running SPEC2000 benchmarks for the last few weeks, trying to find a set of them that works well on these processors. This is slightly tricky since there's some inherent noise in the results. Not using the LEAVE instruction seemed to make a difference on my Penryn laptop in 64 bit mode, but that's probably moot now that -fomit-frame-pointer is the default. I've changed a few others, but mostly these attempts resulted in lower or unchanged performance, for example: * using push/pop insns more often (there are about six of these tuning flags). I would have expected this to be a win. * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7 * upping the branch cost to 5; initial results looked good for Core i7 but in a full SPEC2000 run it seemed to be a slight loss, and a large loss on Core 2 * using different string algorithms (from tune_generic) * enabling SPLIT_LONG_MOVES * enabling the flags related to partial reg stalls * reducing code alignments (based on a comment in Agner's manual that they aren't important anymore) I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based on the recommendation in Agner's manual not to use operand size prefixes when they change the length of the instruction (i.e. if there's an immediate operand). That happens in the second of the following four instructions, and is said to cause a decoder stall: $ as orl $32768,%eax orw $32768,%ax orl $8,%eax orw $8,%ax 0: 0d 00 80 00 00 or $0x8000,%eax 5: 66 0d 00 80 or $0x8000,%ax 9: 83 c8 08 or $0x8,%eax c: 66 83 c8 08 or $0x8,%ax This didn't seem to have a large impact either however. On my last test run, I had SPECfp2000: -mtune=generic 3023 -mtune=core2 3036 SPECint2000: -mtune=generic 2774 -mtune=core2 2794 This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode. SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually with cpufreq-set (1 step below maximum, which seems to avoid turbo mode effectively). Compile flags were -O3 -mpc64 -frename-registers. The tree is a few weeks old so it doesn't have -fomit-frame-pointer by default. I also had -mtune=corei7 numbers, but they were a little lower since I was using that run for an experiment with higher branch costs. These numbers pretty much match the differences I was seeing on the Core 2 laptop during development. I'd welcome if other people would also run benchmarks. Comments? Is this OK? Bernd * doc/invoke.texi (i386 and x86-64 Options): Document corei7 cpu type. * config/i386/i386.h (TARGET_COREI7): New macro. (enum ix86_tune_indices): Add X86_TUNE_PROMOTE_HI_CONSTANTS. (enum target_cpu_default): Add TARGET_CPU_DEFAULT_corei7. (enum processor_type): Add PROCESSOR_COREI7. * config/i386/i386.md: Include "core2.md". (attr "cpu"): Add "corei7". (mul_operands): New attribute. (mul3_1, mulsi3_1_zext, mulhi3_1, mulqi3_1, mul3_1, mulqihi3_1, muldi3_highpart_1, mulsi3_highpart_1, mulsi3_highpart_zext): Set it. * config/i386/core2.md: New file. * config/i386/i386-c.c (ix86_target-macros_internal): Handle PROCESSOR_COREI7. * config/i386/i386.c (corei7_cost): New static variable. (m_COREI7, m_CORE2I7): New macros. (initial_ix86_tune_features): Use them. Disable X86_TUNE_USE_LEAVE, X86_TUNE_PAD_RETURNS and X86_TUNE_USE_INCDEC, and enable X86_TUNE_PROMOTE_HI_REGS and X86_TUNE_PROMOTE_HI_CONSTANTS for Core 2 and Core i7. (x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387): Use m_CORE2I7 instead of m_CORE2. (processor_target_table): Add entry for corei7_cost. (cpu_names): Add "corei7" entr. (override_options): Add entry for Core i7. (ix86_fixup_binary_operands, ix86_binary_operator_ok): Handle TARGET_PROMOTE_HI_CONSTANTS. (ix86_issue_rate): 4 for Core i7. (ix86_adjust_cost): Try to do something sensible about domains for PROCESSOR_COREI7. Index: doc/invoke.texi =================================================================== --- doc/invoke.texi (revision 162821) +++ doc/invoke.texi (working copy) @@ -11937,6 +11937,9 @@ SSE2 and SSE3 instruction set support. @item core2 Intel Core2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support. +@item corei7 +Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1 +and SSE4.2 instruction set support. @item atom Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3 instruction set support. Index: config/i386/i386.h =================================================================== --- config/i386/i386.h (revision 162821) +++ config/i386/i386.h (working copy) @@ -239,6 +239,7 @@ extern const struct processor_costs ix86 #define TARGET_ATHLON_K8 (TARGET_K8 || TARGET_ATHLON) #define TARGET_NOCONA (ix86_tune == PROCESSOR_NOCONA) #define TARGET_CORE2 (ix86_tune == PROCESSOR_CORE2) +#define TARGET_COREI7 (ix86_tune == PROCESSOR_COREI7) #define TARGET_GENERIC32 (ix86_tune == PROCESSOR_GENERIC32) #define TARGET_GENERIC64 (ix86_tune == PROCESSOR_GENERIC64) #define TARGET_GENERIC (TARGET_GENERIC32 || TARGET_GENERIC64) @@ -274,6 +275,7 @@ enum ix86_tune_indices { X86_TUNE_HIMODE_MATH, X86_TUNE_PROMOTE_QI_REGS, X86_TUNE_PROMOTE_HI_REGS, + X86_TUNE_PROMOTE_HI_CONSTANTS, X86_TUNE_ADD_ESP_4, X86_TUNE_ADD_ESP_8, X86_TUNE_SUB_ESP_4, @@ -348,6 +350,8 @@ extern unsigned char ix86_tune_features[ #define TARGET_HIMODE_MATH ix86_tune_features[X86_TUNE_HIMODE_MATH] #define TARGET_PROMOTE_QI_REGS ix86_tune_features[X86_TUNE_PROMOTE_QI_REGS] #define TARGET_PROMOTE_HI_REGS ix86_tune_features[X86_TUNE_PROMOTE_HI_REGS] +#define TARGET_PROMOTE_HI_CONSTANTS \ + ix86_tune_features[X86_TUNE_PROMOTE_HI_CONSTANTS] #define TARGET_ADD_ESP_4 ix86_tune_features[X86_TUNE_ADD_ESP_4] #define TARGET_ADD_ESP_8 ix86_tune_features[X86_TUNE_ADD_ESP_8] #define TARGET_SUB_ESP_4 ix86_tune_features[X86_TUNE_SUB_ESP_4] @@ -597,6 +601,7 @@ enum target_cpu_default TARGET_CPU_DEFAULT_prescott, TARGET_CPU_DEFAULT_nocona, TARGET_CPU_DEFAULT_core2, + TARGET_CPU_DEFAULT_corei7, TARGET_CPU_DEFAULT_atom, TARGET_CPU_DEFAULT_geode, @@ -2139,6 +2144,7 @@ enum processor_type PROCESSOR_K8, PROCESSOR_NOCONA, PROCESSOR_CORE2, + PROCESSOR_COREI7, PROCESSOR_GENERIC32, PROCESSOR_GENERIC64, PROCESSOR_AMDFAM10, Index: config/i386/i386.md =================================================================== --- config/i386/i386.md (revision 162821) +++ config/i386/i386.md (working copy) @@ -349,8 +349,8 @@ (define_constants ;; Processor type. -(define_attr "cpu" "none,pentium,pentiumpro,geode,k6,athlon,k8,core2,atom, - generic64,amdfam10,bdver1" +(define_attr "cpu" "none,pentium,pentiumpro,geode,k6,athlon,k8,core2,corei7, + atom,generic64,amdfam10,bdver1" (const (symbol_ref "ix86_schedule"))) ;; A basic instruction type. Refinements due to arguments to be @@ -388,6 +388,10 @@ (define_attr "unit" "integer,i387,sse,mm (const_string "unknown")] (const_string "integer"))) +;; For integer multiply insns, the number of operands. +(define_attr "mul_operands" "" + (const_int 2)) + ;; The (bounding maximum) length of an instruction immediate. (define_attr "length_immediate" "" (cond [(eq_attr "type" "incdec,setcc,icmov,str,lea,other,multi,idiv,leave, @@ -919,6 +923,7 @@ (define_mode_iterator P [(SI "Pmode == S (include "athlon.md") (include "geode.md") (include "atom.md") +(include "core2.md") ;; Operand and operator predicates and constraints @@ -7010,6 +7015,7 @@ (define_insn "*mul3_1" imul{}\t{%2, %1, %0|%0, %1, %2} imul{}\t{%2, %0|%0, %2}" [(set_attr "type" "imul") + (set_attr "mul_operands" "3,2,2") (set_attr "prefix_0f" "0,0,1") (set (attr "athlon_decode") (cond [(eq_attr "cpu" "athlon") @@ -7040,6 +7046,7 @@ (define_insn "*mulsi3_1_zext" imul{l}\t{%2, %1, %k0|%k0, %1, %2} imul{l}\t{%2, %k0|%k0, %2}" [(set_attr "type" "imul") + (set_attr "mul_operands" "3,3,2") (set_attr "prefix_0f" "0,0,1") (set (attr "athlon_decode") (cond [(eq_attr "cpu" "athlon") @@ -7077,6 +7084,7 @@ (define_insn "*mulhi3_1" imul{w}\t{%2, %1, %0|%0, %1, %2} imul{w}\t{%2, %0|%0, %2}" [(set_attr "type" "imul") + (set_attr "mul_operands" "3,3,2") (set_attr "prefix_0f" "0,0,1") (set (attr "athlon_decode") (cond [(eq_attr "cpu" "athlon") @@ -7103,6 +7111,7 @@ (define_insn "*mulqi3_1" && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "mul{b}\t%2" [(set_attr "type" "imul") + (set_attr "mul_operands" "1") (set_attr "length_immediate" "0") (set (attr "athlon_decode") (if_then_else (eq_attr "cpu" "athlon") @@ -7144,6 +7153,7 @@ (define_insn "*mul3_1" "!(MEM_P (operands[1]) && MEM_P (operands[2]))" "mul{}\t%2" [(set_attr "type" "imul") + (set_attr "mul_operands" "1") (set_attr "length_immediate" "0") (set (attr "athlon_decode") (if_then_else (eq_attr "cpu" "athlon") @@ -7164,6 +7174,7 @@ (define_insn "*mulqihi3_1" && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "mul{b}\t%2" [(set_attr "type" "imul") + (set_attr "mul_operands" "1") (set_attr "length_immediate" "0") (set (attr "athlon_decode") (if_then_else (eq_attr "cpu" "athlon") @@ -7203,6 +7214,7 @@ (define_insn "*muldi3_highpart_1" && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "mul{q}\t%2" [(set_attr "type" "imul") + (set_attr "mul_operands" "1") (set_attr "length_immediate" "0") (set (attr "athlon_decode") (if_then_else (eq_attr "cpu" "athlon") @@ -7226,6 +7238,7 @@ (define_insn "*mulsi3_highpart_1" "!(MEM_P (operands[1]) && MEM_P (operands[2]))" "mul{l}\t%2" [(set_attr "type" "imul") + (set_attr "mul_operands" "1") (set_attr "length_immediate" "0") (set (attr "athlon_decode") (if_then_else (eq_attr "cpu" "athlon") @@ -7249,6 +7262,7 @@ (define_insn "*mulsi3_highpart_zext" && !(MEM_P (operands[1]) && MEM_P (operands[2]))" "mul{l}\t%2" [(set_attr "type" "imul") + (set_attr "mul_operands" "1") (set_attr "length_immediate" "0") (set (attr "athlon_decode") (if_then_else (eq_attr "cpu" "athlon") Index: config/i386/core2.md =================================================================== --- config/i386/core2.md (revision 0) +++ config/i386/core2.md (revision 0) @@ -0,0 +1,744 @@ +;; Scheduling for Core 2 and derived processors. +;; Copyright (C) 2004, 2005, 2007, 2008, 2010 Free Software Foundation, Inc. +;; +;; This file is part of GCC. +;; +;; GCC is free software; you can redistribute it and/or modify +;; it under the terms of the GNU General Public License as published by +;; the Free Software Foundation; either version 3, or (at your option) +;; any later version. +;; +;; GCC is distributed in the hope that it will be useful, +;; but WITHOUT ANY WARRANTY; without even the implied warranty of +;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +;; GNU General Public License for more details. +;; +;; You should have received a copy of the GNU General Public License +;; along with GCC; see the file COPYING3. If not see +;; . */ + +;; The scheduling description in this file is based on the one in ppro.md, +;; with additional information obtained from +;; +;; "How to optimize for the Pentium family of microprocessors", +;; by Agner Fog, PhD. +;; +;; The major difference from the P6 pipeline is one extra decoder, and +;; one extra execute unit. Due to micro-op fusion, many insns no longer +;; need to be decoded in decoder 0, but can be handled by all of them. + +;; The core2_idiv, core2_fdiv and core2_ssediv automata are used to +;; model issue latencies of idiv, fdiv and ssediv type insns. +(define_automaton "core2_decoder,core2_core,core2_idiv,core2_fdiv,core2_ssediv,core2_load,core2_store") + +;; The CPU domain, used for Core i7 bypass latencies +(define_attr "i7_domain" "int,float,simd" + (cond [(eq_attr "type" "fmov,fop,fsgn,fmul,fdiv,fpspc,fcmov,fcmp,fxch,fistp,fisttp,frndint") + (const_string "float") + (eq_attr "type" "sselog,sselog1,sseiadd,sseiadd1,sseishft,sseishft1,sseimul, + sse,ssemov,sseadd,ssemul,ssecmp,ssecomi,ssecvt, + ssecvt1,sseicvt,ssediv,sseins,ssemuladd,sse4arg") + (cond [(eq_attr "mode" "V4DF,V8SF,V2DF,V4SF,SF,DF") + (const_string "float") + (eq_attr "mode" "SI") + (const_string "int")] + (const_string "simd")) + (eq_attr "type" "mmx,mmxmov,mmxadd,mmxmul,mmxcmp,mmxcvt,mmxshft") + (const_string "simd")] + (const_string "int"))) + +;; As for the Pentium Pro, +;; - an instruction with 1 uop can be decoded by any of the three +;; decoders in one cycle. +;; - an instruction with 1 to 4 uops can be decoded only by decoder 0 +;; but still in only one cycle. +;; - a complex (microcode) instruction can also only be decoded by +;; decoder 0, and this takes an unspecified number of cycles. +;; +;; The goal is to schedule such that we have a few-one-one uops sequence +;; in each cycle, to decode as many instructions per cycle as possible. +(define_cpu_unit "c2_decoder0" "core2_decoder") +(define_cpu_unit "c2_decoder1" "core2_decoder") +(define_cpu_unit "c2_decoder2" "core2_decoder") +(define_cpu_unit "c2_decoder3" "core2_decoder") + +;; We first wish to find an instruction for c2_decoder0, so exclude +;; c2_decoder1 and c2_decoder2 from being reserved until c2_decoder 0 is +;; reserved. +(presence_set "c2_decoder1" "c2_decoder0") +(presence_set "c2_decoder2" "c2_decoder0") +(presence_set "c2_decoder3" "c2_decoder0") + +;; Most instructions can be decoded on any of the three decoders. +(define_reservation "c2_decodern" "(c2_decoder0|c2_decoder1|c2_decoder2|c2_decoder3)") + +;; The out-of-order core has six pipelines. These are similar to the +;; Pentium Pro's five pipelines. Port 2 is responsible for memory loads, +;; port 3 for store address calculations, port 4 for memory stores, and +;; ports 0, 1 and 5 for everything else. + +(define_cpu_unit "c2_p0,c2_p1,c2_p5" "core2_core") +(define_cpu_unit "c2_p2" "core2_load") +(define_cpu_unit "c2_p3,c2_p4" "core2_store") +(define_cpu_unit "c2_idiv" "core2_idiv") +(define_cpu_unit "c2_fdiv" "core2_fdiv") +(define_cpu_unit "c2_ssediv" "core2_ssediv") + +;; Only the irregular instructions have to be modeled here. A load +;; increases the latency by 2 or 3, or by nothing if the manual gives +;; a latency already. Store latencies are not accounted for. +;; +;; The simple instructions follow a very regular pattern of 1 uop per +;; reg-reg operation, 1 uop per load on port 2. and 2 uops per store +;; on port 4 and port 3. These instructions are modelled at the bottom +;; of this file. +;; +;; For microcoded instructions we don't know how many uops are produced. +;; These instructions are the "complex" ones in the Intel manuals. All +;; we _do_ know is that they typically produce four or more uops, so +;; they can only be decoded on c2_decoder0. Modelling their latencies +;; doesn't make sense because we don't know how these instructions are +;; executed in the core. So we just model that they can only be decoded +;; on decoder 0, and say that it takes a little while before the result +;; is available. +(define_insn_reservation "c2_complex_insn" 6 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "other,multi,str")) + "c2_decoder0") + +(define_insn_reservation "c2_call" 1 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "call,callv")) + "c2_decoder0") + +;; imov with memory operands does not use the integer units. +;; imovx always decodes to one uop, and also doesn't use the integer +;; units if it has memory operands. +(define_insn_reservation "c2_imov" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "imov,imovx"))) + "c2_decodern,(c2_p0|c2_p1|c2_p5)") + +(define_insn_reservation "c2_imov_load" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "imov,imovx"))) + "c2_decodern,c2_p2") + +(define_insn_reservation "c2_imov_store" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (eq_attr "type" "imov"))) + "c2_decodern,c2_p4+c2_p3") + +(define_insn_reservation "c2_icmov" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "icmov"))) + "c2_decoder0,(c2_p0|c2_p1|c2_p5)*2") + +(define_insn_reservation "c2_icmov_load" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "icmov"))) + "c2_decoder0,c2_p2,(c2_p0|c2_p1|c2_p5)*2") + +(define_insn_reservation "c2_push_reg" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (eq_attr "type" "push"))) + "c2_decodern,c2_p4+c2_p3") + +(define_insn_reservation "c2_push_mem" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "both") + (eq_attr "type" "push"))) + "c2_decoder0,c2_p2,c2_p4+c2_p3") + +;; lea executes on port 0 with latency one and throughput 1. +(define_insn_reservation "c2_lea" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "lea"))) + "c2_decodern,c2_p0") + +;; Shift and rotate decode as two uops which can go to port 0 or 5. +;; The load and store units need to be reserved when memory operands +;; are involved. +(define_insn_reservation "c2_shift_rotate" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "ishift,ishift1,rotate,rotate1"))) + "c2_decodern,(c2_p0|c2_p5)") + +(define_insn_reservation "c2_shift_rotate_mem" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (eq_attr "type" "ishift,ishift1,rotate,rotate1"))) + "c2_decoder0,c2_p2,(c2_p0|c2_p5),c2_p4+c2_p3") + +;; See comments in ppro.md for the corresponding reservation. +(define_insn_reservation "c2_branch" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "ibr"))) + "c2_decodern,c2_p5") + +;; ??? Indirect branches probably have worse latency than this. +(define_insn_reservation "c2_indirect_branch" 6 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (eq_attr "type" "ibr"))) + "c2_decoder0,c2_p2+c2_p5") + +(define_insn_reservation "c2_leave" 4 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "leave")) + "c2_decoder0,c2_p2+(c2_p0|c2_p1),(c2_p0|c2_p1)") + +;; mul and imul with two/three operands only execute on port 1 for HImode +;; and SImode, port 0 for DImode. +(define_insn_reservation "c2_imul_hisi" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "HI,SI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "2,3"))))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_imul_hisi_mem" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (and (eq_attr "mode" "HI,SI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "2,3"))))) + "c2_decoder0,c2_p2+c2_p1") + +(define_insn_reservation "c2_imul_di" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "2,3"))))) + "c2_decodern,c2_p0") + +(define_insn_reservation "c2_imul_di_mem" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (and (eq_attr "mode" "DI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "2,3"))))) + "c2_decoder0,c2_p2+c2_p0") + +(define_insn_reservation "c2_imul_qi1" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "QI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "1"))))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_imul_qi1_mem" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "QI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "1"))))) + "c2_decoder0,c2_p2+c2_p1") + +(define_insn_reservation "c2_imul_hisi1" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "HI,SI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "1"))))) + "c2_decoder0,c2_p1") + +(define_insn_reservation "c2_imul_hisi1_mem" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "HI,SI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "1"))))) + "c2_decoder0,c2_p2+c2_p1") + +(define_insn_reservation "c2_imul_di1" 7 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "1"))))) + "c2_decoder0,c2_p0") + +(define_insn_reservation "c2_imul_di1_mem" 7 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DI") + (and (eq_attr "type" "imul") + (eq_attr "mul_operands" "1"))))) + "c2_decoder0,c2_p2+c2_p0") + +;; div and idiv are very similar, so we model them the same. +;; QI, HI, and SI have issue latency 12, 21, and 37, respectively. +;; These issue latencies are modelled via the c2_div automaton. +(define_insn_reservation "c2_idiv_QI" 19 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "QI") + (eq_attr "type" "idiv")))) + "c2_decoder0,(c2_p0+c2_idiv)*2,(c2_p0|c2_p1)+c2_idiv,c2_idiv*9") + +(define_insn_reservation "c2_idiv_QI_load" 19 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "QI") + (eq_attr "type" "idiv")))) + "c2_decoder0,c2_p2+c2_p0+c2_idiv,c2_p0+c2_idiv,(c2_p0|c2_p1)+c2_idiv,c2_idiv*9") + +(define_insn_reservation "c2_idiv_HI" 23 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "HI") + (eq_attr "type" "idiv")))) + "c2_decoder0,(c2_p0+c2_idiv)*3,(c2_p0|c2_p1)+c2_idiv,c2_idiv*17") + +(define_insn_reservation "c2_idiv_HI_load" 23 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "HI") + (eq_attr "type" "idiv")))) + "c2_decoder0,c2_p2+c2_p0+c2_idiv,c2_p0+c2_idiv,(c2_p0|c2_p1)+c2_idiv,c2_idiv*18") + +(define_insn_reservation "c2_idiv_SI" 39 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SI") + (eq_attr "type" "idiv")))) + "c2_decoder0,(c2_p0+c2_idiv)*3,(c2_p0|c2_p1)+c2_idiv,c2_idiv*33") + +(define_insn_reservation "c2_idiv_SI_load" 39 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "SI") + (eq_attr "type" "idiv")))) + "c2_decoder0,c2_p2+c2_p0+c2_idiv,c2_p0+c2_idiv,(c2_p0|c2_p1)+c2_idiv,c2_idiv*34") + +;; x87 floating point operations. + +(define_insn_reservation "c2_fxch" 0 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "fxch")) + "c2_decodern") + +(define_insn_reservation "c2_fop" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none,unknown") + (eq_attr "type" "fop"))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_fop_load" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "fop"))) + "c2_decoder0,c2_p2+c2_p1,c2_p1") + +(define_insn_reservation "c2_fop_store" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (eq_attr "type" "fop"))) + "c2_decoder0,c2_p0,c2_p0,c2_p0+c2_p4+c2_p3") + +(define_insn_reservation "c2_fop_both" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "both") + (eq_attr "type" "fop"))) + "c2_decoder0,c2_p2+c2_p0,c2_p0+c2_p4+c2_p3") + +(define_insn_reservation "c2_fsgn" 1 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "fsgn")) + "c2_decodern,c2_p0") + +(define_insn_reservation "c2_fistp" 5 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "fistp")) + "c2_decoder0,c2_p0*2,c2_p4+c2_p3") + +(define_insn_reservation "c2_fcmov" 2 + (and (eq_attr "cpu" "core2,corei7") + (eq_attr "type" "fcmov")) + "c2_decoder0,c2_p0*2") + +(define_insn_reservation "c2_fcmp" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "fcmp"))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_fcmp_load" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "fcmp"))) + "c2_decoder0,c2_p2+c2_p1") + +(define_insn_reservation "c2_fmov" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "fmov"))) + "c2_decodern,c2_p0") + +(define_insn_reservation "c2_fmov_load" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "!XF") + (eq_attr "type" "fmov")))) + "c2_decodern,c2_p2") + +(define_insn_reservation "c2_fmov_XF_load" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "XF") + (eq_attr "type" "fmov")))) + "c2_decoder0,(c2_p2+c2_p0)*2") + +(define_insn_reservation "c2_fmov_store" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (and (eq_attr "mode" "!XF") + (eq_attr "type" "fmov")))) + "c2_decodern,c2_p3+c2_p4") + +(define_insn_reservation "c2_fmov_XF_store" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (and (eq_attr "mode" "XF") + (eq_attr "type" "fmov")))) + "c2_decoder0,(c2_p3+c2_p4),(c2_p3+c2_p4)") + +;; fmul executes on port 0 with latency 5. It has issue latency 2, +;; but we don't model this. +(define_insn_reservation "c2_fmul" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "fmul"))) + "c2_decoder0,c2_p0*2") + +(define_insn_reservation "c2_fmul_load" 6 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "fmul"))) + "c2_decoder0,c2_p2+c2_p0,c2_p0") + +;; fdiv latencies depend on the mode of the operands. XFmode gives +;; a latency of 38 cycles, DFmode gives 32, and SFmode gives latency 18. +;; Division by a power of 2 takes only 9 cycles, but we cannot model +;; that. Throughput is equal to latency - 1, which we model using the +;; c2_div automaton. +(define_insn_reservation "c2_fdiv_SF" 18 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SF") + (eq_attr "type" "fdiv,fpspc")))) + "c2_decodern,c2_p0+c2_fdiv,c2_fdiv*16") + +(define_insn_reservation "c2_fdiv_SF_load" 19 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "SF") + (eq_attr "type" "fdiv,fpspc")))) + "c2_decoder0,c2_p2+c2_p0+c2_fdiv,c2_fdiv*16") + +(define_insn_reservation "c2_fdiv_DF" 32 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DF") + (eq_attr "type" "fdiv,fpspc")))) + "c2_decodern,c2_p0+c2_fdiv,c2_fdiv*30") + +(define_insn_reservation "c2_fdiv_DF_load" 33 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "DF") + (eq_attr "type" "fdiv,fpspc")))) + "c2_decoder0,c2_p2+c2_p0+c2_fdiv,c2_fdiv*30") + +(define_insn_reservation "c2_fdiv_XF" 38 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "XF") + (eq_attr "type" "fdiv,fpspc")))) + "c2_decodern,c2_p0+c2_fdiv,c2_fdiv*36") + +(define_insn_reservation "c2_fdiv_XF_load" 39 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "XF") + (eq_attr "type" "fdiv,fpspc")))) + "c2_decoder0,c2_p2+c2_p0+c2_fdiv,c2_fdiv*36") + +;; MMX instructions. + +(define_insn_reservation "c2_mmx_add" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "mmxadd,sseiadd"))) + "c2_decodern,c2_p0|c2_p5") + +(define_insn_reservation "c2_mmx_add_load" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "mmxadd,sseiadd"))) + "c2_decodern,c2_p2+c2_p0|c2_p5") + +(define_insn_reservation "c2_mmx_shft" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "mmxshft"))) + "c2_decodern,c2_p0|c2_p5") + +(define_insn_reservation "c2_mmx_shft_load" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "mmxshft"))) + "c2_decoder0,c2_p2+c2_p1") + +(define_insn_reservation "c2_mmx_sse_shft" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "type" "sseishft") + (eq_attr "length_immediate" "!0")))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_mmx_sse_shft_load" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "type" "sseishft") + (eq_attr "length_immediate" "!0")))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_mmx_sse_shft1" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "type" "sseishft") + (eq_attr "length_immediate" "0")))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_mmx_sse_shft1_load" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "type" "sseishft") + (eq_attr "length_immediate" "0")))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_mmx_mul" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "mmxmul,sseimul"))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_mmx_mul_load" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "mmxmul,sseimul"))) + "c2_decoder0,c2_p2+c2_p1") + +(define_insn_reservation "c2_sse_mmxcvt" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "mode" "DI") + (eq_attr "type" "mmxcvt"))) + "c2_decodern,c2_p1") + +;; FIXME: These are Pentium III only, but we cannot tell here if +;; we're generating code for PentiumPro/Pentium II or Pentium III +;; (define_insn_reservation "c2_sse_mmxshft" 2 +;; (and (eq_attr "cpu" "core2,corei7") +;; (and (eq_attr "mode" "TI") +;; (eq_attr "type" "mmxshft"))) +;; "c2_decodern,c2_p0") + +;; The sfence instruction. +(define_insn_reservation "c2_sse_sfence" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "unknown") + (eq_attr "type" "sse"))) + "c2_decoder0,c2_p4+c2_p3") + +;; FIXME: This reservation is all wrong when we're scheduling sqrtss. +(define_insn_reservation "c2_sse_SFDF" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "mode" "SF,DF") + (eq_attr "type" "sse"))) + "c2_decodern,c2_p0") + +(define_insn_reservation "c2_sse_V4SF" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "mode" "V4SF") + (eq_attr "type" "sse"))) + "c2_decoder0,c2_p1*2") + +(define_insn_reservation "c2_sse_addcmp" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "sseadd,ssecmp,ssecomi"))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_sse_addcmp_load" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "sseadd,ssecmp,ssecomi"))) + "c2_decodern,c2_p2+c2_p1") + +(define_insn_reservation "c2_sse_mul_SF" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SF,V4SF") + (eq_attr "type" "ssemul")))) + "c2_decodern,c2_p0") + +(define_insn_reservation "c2_sse_mul_SF_load" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "SF,V4SF") + (eq_attr "type" "ssemul")))) + "c2_decodern,c2_p2+c2_p0") + +(define_insn_reservation "c2_sse_mul_DF" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DF,V2DF") + (eq_attr "type" "ssemul")))) + "c2_decodern,c2_p0") + +(define_insn_reservation "c2_sse_mul_DF_load" 5 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (and (eq_attr "mode" "DF,V2DF") + (eq_attr "type" "ssemul")))) + "c2_decodern,c2_p2+c2_p0") + +(define_insn_reservation "c2_sse_div_SF" 18 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SF,V4SF") + (eq_attr "type" "ssediv")))) + "c2_decodern,c2_p0,c2_ssediv*17") + +(define_insn_reservation "c2_sse_div_SF_load" 18 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SF,V4SF") + (eq_attr "type" "ssediv")))) + "c2_decodern,(c2_p2+c2_p0),c2_ssediv*17") + +(define_insn_reservation "c2_sse_div_DF" 32 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DF,V2DF") + (eq_attr "type" "ssediv")))) + "c2_decodern,c2_p0,c2_ssediv*31") + +(define_insn_reservation "c2_sse_div_DF_load" 32 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DF,V2DF") + (eq_attr "type" "ssediv")))) + "c2_decodern,(c2_p2+c2_p0),c2_ssediv*31") + +;; FIXME: these have limited throughput +(define_insn_reservation "c2_sse_icvt_SF" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SF") + (eq_attr "type" "sseicvt")))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_sse_icvt_SF_load" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (and (eq_attr "mode" "SF") + (eq_attr "type" "sseicvt")))) + "c2_decodern,c2_p2+c2_p1") + +(define_insn_reservation "c2_sse_icvt_DF" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "DF") + (eq_attr "type" "sseicvt")))) + "c2_decoder0,c2_p0+c2_p1") + +(define_insn_reservation "c2_sse_icvt_DF_load" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (and (eq_attr "mode" "DF") + (eq_attr "type" "sseicvt")))) + "c2_decoder0,(c2_p2+c2_p1)") + +(define_insn_reservation "c2_sse_icvt_SI" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (and (eq_attr "mode" "SI") + (eq_attr "type" "sseicvt")))) + "c2_decodern,c2_p1") + +(define_insn_reservation "c2_sse_icvt_SI_load" 3 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "!none") + (and (eq_attr "mode" "SI") + (eq_attr "type" "sseicvt")))) + "c2_decodern,(c2_p2+c2_p1)") + +(define_insn_reservation "c2_sse_mov" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none") + (eq_attr "type" "ssemov"))) + "c2_decodern,(c2_p0|c2_p1|c2_p5)") + +(define_insn_reservation "c2_sse_mov_load" 2 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "ssemov"))) + "c2_decodern,c2_p2") + +(define_insn_reservation "c2_sse_mov_store" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (eq_attr "type" "ssemov"))) + "c2_decodern,c2_p4+c2_p3") + +;; All other instructions are modelled as simple instructions. +;; We have already modelled all i387 floating point instructions, so all +;; other instructions execute on either port 0, 1 or 5. This includes +;; the ALU units, and the MMX units. +;; +;; reg-reg instructions produce 1 uop so they can be decoded on any of +;; the three decoders. Loads benefit from micro-op fusion and can be +;; treated in the same way. +(define_insn_reservation "c2_insn" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "none,unknown") + (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,sseishft1,mmx,mmxcmp"))) + "c2_decodern,(c2_p0|c2_p1|c2_p5)") + +(define_insn_reservation "c2_insn_load" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "load") + (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,pop,sseishft1,mmx,mmxcmp"))) + "c2_decodern,c2_p2,(c2_p0|c2_p1|c2_p5)") + +;; register-memory instructions have three uops, so they have to be +;; decoded on c2_decoder0. +(define_insn_reservation "c2_insn_store" 1 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "store") + (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,sseishft1,mmx,mmxcmp"))) + "c2_decoder0,(c2_p0|c2_p1|c2_p5),c2_p4+c2_p3") + +;; read-modify-store instructions produce 4 uops so they have to be +;; decoded on c2_decoder0 as well. +(define_insn_reservation "c2_insn_both" 4 + (and (eq_attr "cpu" "core2,corei7") + (and (eq_attr "memory" "both") + (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,pop,sseishft1,mmx,mmxcmp"))) + "c2_decoder0,c2_p2,(c2_p0|c2_p1|c2_p5),c2_p4+c2_p3") + Index: config/i386/i386-c.c =================================================================== --- config/i386/i386-c.c (revision 162821) +++ config/i386/i386-c.c (working copy) @@ -122,6 +122,10 @@ ix86_target_macros_internal (int isa_fla def_or_undef (parse_in, "__core2"); def_or_undef (parse_in, "__core2__"); break; + case PROCESSOR_COREI7: + def_or_undef (parse_in, "__corei7"); + def_or_undef (parse_in, "__corei7__"); + break; case PROCESSOR_ATOM: def_or_undef (parse_in, "__atom"); def_or_undef (parse_in, "__atom__"); @@ -197,6 +201,9 @@ ix86_target_macros_internal (int isa_fla case PROCESSOR_CORE2: def_or_undef (parse_in, "__tune_core2__"); break; + case PROCESSOR_COREI7: + def_or_undef (parse_in, "__tune_corei7__"); + break; case PROCESSOR_ATOM: def_or_undef (parse_in, "__tune_atom__"); break; Index: config/i386/i386.c =================================================================== --- config/i386/i386.c (revision 162821) +++ config/i386/i386.c (working copy) @@ -1124,6 +1124,79 @@ struct processor_costs core2_cost = { }; static const +struct processor_costs corei7_cost = { + COSTS_N_INSNS (1), /* cost of an add instruction */ + COSTS_N_INSNS (1) + 1, /* cost of a lea instruction */ + COSTS_N_INSNS (1), /* variable shift costs */ + COSTS_N_INSNS (1), /* constant shift costs */ + {COSTS_N_INSNS (3), /* cost of starting multiply for QI */ + COSTS_N_INSNS (3), /* HI */ + COSTS_N_INSNS (3), /* SI */ + COSTS_N_INSNS (3), /* DI */ + COSTS_N_INSNS (3)}, /* other */ + 0, /* cost of multiply per each bit set */ + {COSTS_N_INSNS (22), /* cost of a divide/mod for QI */ + COSTS_N_INSNS (22), /* HI */ + COSTS_N_INSNS (22), /* SI */ + COSTS_N_INSNS (22), /* DI */ + COSTS_N_INSNS (22)}, /* other */ + COSTS_N_INSNS (1), /* cost of movsx */ + COSTS_N_INSNS (1), /* cost of movzx */ + 8, /* "large" insn */ + 16, /* MOVE_RATIO */ + 2, /* cost for loading QImode using movzbl */ + {6, 6, 6}, /* cost of loading integer registers + in QImode, HImode and SImode. + Relative to reg-reg move (2). */ + {4, 4, 4}, /* cost of storing integer registers */ + 2, /* cost of reg,reg fld/fst */ + {6, 6, 6}, /* cost of loading fp registers + in SFmode, DFmode and XFmode */ + {4, 4, 4}, /* cost of storing fp registers + in SFmode, DFmode and XFmode */ + 2, /* cost of moving MMX register */ + {6, 6}, /* cost of loading MMX registers + in SImode and DImode */ + {4, 4}, /* cost of storing MMX registers + in SImode and DImode */ + 2, /* cost of moving SSE register */ + {6, 6, 6}, /* cost of loading SSE registers + in SImode, DImode and TImode */ + {4, 4, 4}, /* cost of storing SSE registers + in SImode, DImode and TImode */ + 2, /* MMX or SSE register to integer */ + 32, /* size of l1 cache. */ + 256, /* size of l2 cache. */ + 128, /* size of prefetch block */ + 8, /* number of parallel prefetches */ + 3, /* Branch cost */ + COSTS_N_INSNS (3), /* cost of FADD and FSUB insns. */ + COSTS_N_INSNS (5), /* cost of FMUL instruction. */ + COSTS_N_INSNS (32), /* cost of FDIV instruction. */ + COSTS_N_INSNS (1), /* cost of FABS instruction. */ + COSTS_N_INSNS (1), /* cost of FCHS instruction. */ + COSTS_N_INSNS (58), /* cost of FSQRT instruction. */ + {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}}, + {libcall, {{32, loop}, {64, rep_prefix_4_byte}, + {8192, rep_prefix_8_byte}, {-1, libcall}}}}, + {{libcall, {{8, loop}, {15, unrolled_loop}, + {2048, rep_prefix_4_byte}, {-1, libcall}}}, + {libcall, {{24, loop}, {32, unrolled_loop}, + {8192, rep_prefix_8_byte}, {-1, libcall}}}}, + 1, /* scalar_stmt_cost. */ + 1, /* scalar load_cost. */ + 1, /* scalar_store_cost. */ + 1, /* vec_stmt_cost. */ + 1, /* vec_to_scalar_cost. */ + 1, /* scalar_to_vec_cost. */ + 1, /* vec_align_load_cost. */ + 2, /* vec_unalign_load_cost. */ + 1, /* vec_store_cost. */ + 3, /* cond_taken_branch_cost. */ + 1, /* cond_not_taken_branch_cost. */ +}; + +static const struct processor_costs atom_cost = { COSTS_N_INSNS (1), /* cost of an add instruction */ COSTS_N_INSNS (1) + 1, /* cost of a lea instruction */ @@ -1355,6 +1428,8 @@ const struct processor_costs *ix86_cost #define m_PENT4 (1< 127) + && (code != AND + || (INTVAL (src2) != 255 && INTVAL (src2) != -65281))) + src2 = gen_lowpart (HImode, force_reg (SImode, src2)); + operands[1] = src1; operands[2] = src2; return dst; @@ -14377,6 +14466,12 @@ ix86_binary_operator_ok (enum rtx_code c if (MEM_P (src1) && !rtx_equal_p (dst, src1)) return 0; + if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2) + && (INTVAL (src2) < -128 || INTVAL (src2) > 127) + && (code != AND + || (INTVAL (src2) != 255 && INTVAL (src2) != -65281))) + return 0; + return 1; } @@ -20495,6 +20590,7 @@ ix86_issue_rate (void) return 3; case PROCESSOR_CORE2: + case PROCESSOR_COREI7: return 4; default: @@ -20569,6 +20665,7 @@ ix86_adjust_cost (rtx insn, rtx link, rt { enum attr_type insn_type, dep_insn_type; enum attr_memory memory; + enum attr_i7_domain domain1, domain2; rtx set, set2; int dep_insn_code_number; @@ -20711,6 +20808,19 @@ ix86_adjust_cost (rtx insn, rtx link, rt else cost = 0; } + break; + + case PROCESSOR_COREI7: + memory = get_attr_memory (insn); + + domain1 = get_attr_i7_domain (insn); + domain2 = get_attr_i7_domain (dep_insn); + if (domain1 != domain2 + && !ix86_agi_dependent (dep_insn, insn)) + cost += ((domain1 == I7_DOMAIN_SIMD && domain2 == I7_DOMAIN_INT) + || (domain1 == I7_DOMAIN_INT && domain2 == I7_DOMAIN_SIMD) + ? 1 : 2); + break; default: break;