From patchwork Fri Oct 15 10:08:57 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Maxim Kuvyrkov X-Patchwork-Id: 67921 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from sourceware.org (server1.sourceware.org [209.132.180.131]) by ozlabs.org (Postfix) with SMTP id 0286CB70DA for ; Fri, 15 Oct 2010 21:09:45 +1100 (EST) Received: (qmail 1535 invoked by alias); 15 Oct 2010 10:09:41 -0000 Received: (qmail 1482 invoked by uid 22791); 15 Oct 2010 10:09:29 -0000 X-SWARE-Spam-Status: No, hits=-0.3 required=5.0 tests=AWL, BAYES_50, TW_CP, TW_HG, TW_OV, TW_VZ, TW_ZB, T_RP_MATCHES_RCVD X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Fri, 15 Oct 2010 10:09:05 +0000 Received: (qmail 18222 invoked from network); 15 Oct 2010 10:08:59 -0000 Received: from unknown (HELO ?172.16.1.24?) (maxim@127.0.0.2) by mail.codesourcery.com with ESMTPA; 15 Oct 2010 10:08:59 -0000 Message-ID: <4CB82839.5050609@codesourcery.com> Date: Fri, 15 Oct 2010 14:08:57 +0400 From: Maxim Kuvyrkov User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.9) Gecko/20100915 Thunderbird/3.1.4 MIME-Version: 1.0 To: gcc-patches CC: "H.J. Lu" , Bernd Schmidt Subject: Core 2/i7 tuning results and analysis X-IsSubscribed: yes Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm Precedence: bulk List-Id: List-Unsubscribe: List-Archive: List-Post: List-Help: Sender: gcc-patches-owner@gcc.gnu.org Delivered-To: mailing list gcc-patches@gcc.gnu.org [Resending without printed out version of the spreadsheet to fit into gcc-patches@ size requirements.] I've been investigating performance regressions for Core 2 and Core i7 processors. The impact of certain small tuning changes on x86 performance maybe interesting to a wider audience, so here is my results and analysis. Attached is a tar of the patch set I tested. Most of these patches are dissections from earlier Bernd's work for Core 2/i7. + 0001-Basic-support-for-Core-i7.patch + 0002-Enable-Core-i7-architectural-features.patch ? 0003-Extend-Core-2-tune-features-to-Core-i7.patch ? 0004-Tweak-tuning-for-Core-i7.patch + 0005-Add-PROMOTE_HI_CONSTANTS-tuning.patch + 0006-Define-Core-i7-costs.patch + 0007-Use-64-bit-alignment-for-Core-i7-32-bit-mode.patch + 0008-Configure-bits-for-Core-i7.patch + 0009-Core-i7-DFA-model.patch ? 0010-Define-issue_rate-for-Core-i7.patch + 0011-Model-Core-i7-pipeline-domains.patch - 0012-Update-Core-2-tuning.patch + 0013-Use-Core-2-DFA-model-for-Core-2.patch + 0014-Update-PentiumPro-tuning.patch - 0015-Handle-privileged-insns.patch + 0016-Model-Core2-i7-decoder-bottleneck.patch Some of these patches (marked with '+') tend to improve average performance, while others (marked with '-') tend to regress it. We will be posting the '+' patches for review once I get benchmark numbers without the regressing patches. Attached is an Excel spreadsheet with results for SPECCPU2000. The interesting part is the graphs visualizing performance impact of each of the patches. The "line" graph shows performance change in percent relative to *baseline*, i.e., current -mtune=core2 for Core2 and -mtune=generic[64] for Corei7. The "column" graph shows performance change in percent relative to *previous* patch. I find the "column" graph more interesting as it shows impact of individual changes on performance. SPECint and SPECfp results are highlighted with, respectively, purple and red on the column graph. Tuning flags: -O2 -ffast-math -msse2 -mfpmath=sse -mtune={core2, corei7} {-m32/-m64} Patches that are no-ops from performance point of view for a particular CPU are not included in the data. I did confirm that these patches indeed do not affect performance in one of the test runs. Now, analysis of the patches: + 0001-Basic-support-for-Core-i7.patch Baseline. The patch makes GCC recognize "corei7" for -mtune= and -march= options. The patch sets tuning for Core i7 to that of -mtune=generic or -mtune=generic64 depending on the {-m32/-m64} option. The generic CPU is special in the sense that has different tuning for 32-bit and 64-bit modes. The patch adds same capability to use different tuning for different ABI for Core i7. + 0002-Enable-Core-i7-architectural-features.patch Nearly noise from performance point of view. Enable supported ISA extensions for Core i7. ? 0003-Extend-Core-2-tune-features-to-Core-i7.patch Improves SPECfp a 32-bit mode, but degrades SPECint for 64-bit mode. Set tuning for Core i7 to be the same as for Core 2. ? 0004-Tweak-tuning-for-Core-i7.patch Regresses SPECint and SPECfp in 32-bit mode, but improves SPECint for 64-bit mode. Adjust tuning for Core i7. + 0005-Add-PROMOTE_HI_CONSTANTS-tuning.patch Improves SPECint. Add new tuning option to promote HI constants. + 0006-Define-Core-i7-costs.patch Slightly regresses SPECint, but improves SPECfp. Define rtx costs for Core i7. The biggest regression is 164.gzip. We don't know why. + 0007-Use-64-bit-alignment-for-Core-i7-32-bit-mode.patch Significantly improves Core i7 performance in 32-bit mode. Increase alignment for 32-bit mode for Core i7 to match 64-bit mode. + 0008-Configure-bits-for-Core-i7.patch Performance no-op. Add support for configure options --with-arch=, etc., for Core i7. + 0009-Core-i7-DFA-model.patch Improves SPECfp. DFA model for Core i7. ? 0010-Define-issue_rate-for-Core-i7.patch Improves SPECint, regresses SPECfp. Increase issue_rate to 4 for Core i7. This one-line change makes 200.sixtrack regress from +1.75% to -2.0% for Core i7 32-bit mode. I spent a lot of time investigating and trying to fix this regression, but didn't succeed. The slowdown can be tracked down to a hot loop that fits on a screen, but the slowdown seems to be evenly distributed all over the loop. The loop does floating-point computations with around 6 variables and streams data from memory. Instruction within the loop are all the same before and after the patch, the only difference is in their order. First I thought that the loop hits the decoder bottleneck, i.e., instructions that can be decoded only by D0 decoder get assigned to secondary decoders. I implemented modeling of Core2/i7 decoder to make scheduler aware of that (Model-Core2-i7-decoder-bottleneck.patch). That didn't fix the regression, so now I'm suspecting that the register ports may be responsible for the slowdown. I don't have a proof though. May be it is worth trying setting issue rate to 3 for Core2/i7? + 0011-Model-Core-i7-pipeline-domains.patch Improves SPECfp. Adjust scheduling costs for instructions that cross Core i7 pipeline domains, i.e., an instruction generates uops for both integer and floating-point domains that need to pass data between each other. - 0012-Update-Core-2-tuning.patch No definitive result for 32-bit mode; SPECfp regresses in 64-bit mode. Adjust tuning for Core 2. + 0013-Use-Core-2-DFA-model-for-Core-2.patch Improves SPECfp for 64-bit mode; improves and regresses SPECint and SPECfp in equal proportion for 32-bit mode. Switch DFA model for Core 2. 187.facerec regresses by 7% on 32-bit Core2 with this change. + 0014-Update-PentiumPro-tuning.patch No data, but should be an improvement. Enable PROMOTE_HI_CONSTANTS tuning for PentiumPro and, hence, -mtune=generic. - 0015-Handle-privileged-insns.patch Improves some tests, but regresses others, no conclusive result. Attempt to make scheduler smarter about which instructions to prioritize. The theory was that the scheduler should not distinguish between the *first* instruction in the ready list and subsequent instructions that are essentially the same as the first. [Rank_for_schedule() is used to sort the ready list and it has several tie-breaking checks to make the sort stable. From choose_ready/max_issue perspective these tie-breaking checks decrease optimization space for now good reason. Apparently, the theory does not agree with experiment in this case.] + 0016-Model-Core2-i7-decoder-bottleneck.patch Improves SPECint, though it was designed to fix regression in SPECfp's 200.sixtrack. The patch makes the scheduler aware of decoder restrictions on Core 2/i7. New hooks to multipass scheduling allow the backend to filter the search space from instructions that are no longer able to be issued on current cycle, e.g., because they would not fit into the rest of IFETCH block or could not be decoded by secondary decoders. Strictly speaking, this is theoretically possible to model in DFA, but it would require immensely more work and would not be nearly as comprehensible as using target hooks. Your comments [and patches fixing the regressions :)] are welcome. Thank you, diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c index 33510a7..08837e1 100644 --- a/gcc/config/i386/i386.c +++ b/gcc/config/i386/i386.c @@ -2852,7 +2852,8 @@ ix86_option_override_internal (bool main_args_p) PTA_LWP = 1 << 23, PTA_FSGSBASE = 1 << 24, PTA_RDRND = 1 << 25, - PTA_F16C = 1 << 26 + PTA_F16C = 1 << 26, + PTA_TUNE32 = 1 << 27 }; static struct pta @@ -2894,6 +2895,10 @@ ix86_option_override_internal (bool main_args_p) {"core2", PROCESSOR_CORE2, CPU_CORE2, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_CX16}, + {"corei7", PROCESSOR_GENERIC32, CPU_PENTIUMPRO, + PTA_TUNE32}, + {"", PROCESSOR_GENERIC64, CPU_GENERIC64, + PTA_64BIT}, {"atom", PROCESSOR_ATOM, CPU_ATOM, PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3 | PTA_SSSE3 | PTA_CX16 | PTA_MOVBE}, @@ -3127,6 +3132,16 @@ ix86_option_override_internal (bool main_args_p) for (i = 0; i < pta_size; i++) if (! strcmp (ix86_arch_string, processor_alias_table[i].name)) { + if (TARGET_64BIT && (processor_alias_table[i].flags & PTA_TUNE32)) + /* Switch to the next entry which has tuning parameters for 64-bit + mode. */ + { + ++i; + gcc_assert (i < pta_size + && processor_alias_table[i].name[0] == '\0' + && !(processor_alias_table[i].flags & PTA_TUNE32)); + } + ix86_schedule = processor_alias_table[i].schedule; ix86_arch = processor_alias_table[i].processor; /* Default cpu tuning to the architecture. */ @@ -3231,6 +3246,16 @@ ix86_option_override_internal (bool main_args_p) for (i = 0; i < pta_size; i++) if (! strcmp (ix86_tune_string, processor_alias_table[i].name)) { + if (TARGET_64BIT && (processor_alias_table[i].flags & PTA_TUNE32)) + /* Switch to the next entry which has tuning parameters for 64-bit + mode. */ + { + ++i; + gcc_assert (i < pta_size + && processor_alias_table[i].name[0] == '\0' + && !(processor_alias_table[i].flags & PTA_TUNE32)); + } + ix86_schedule = processor_alias_table[i].schedule; ix86_tune = processor_alias_table[i].processor; if (TARGET_64BIT && !(processor_alias_table[i].flags & PTA_64BIT))