diff mbox

Core 2/i7 tuning results and analysis

Message ID 4CB82839.5050609@codesourcery.com
State New
Headers show

Commit Message

Maxim Kuvyrkov Oct. 15, 2010, 10:08 a.m. UTC
[Resending without printed out version of the spreadsheet to fit into 
gcc-patches@ size requirements.]

I've been investigating performance regressions for Core 2 and Core i7 
processors.  The impact of certain small tuning changes on x86 
performance maybe interesting to a wider audience, so here is my results 
and analysis.

Attached is a tar of the patch set I tested.  Most of these patches are 
dissections from earlier Bernd's work for Core 2/i7.

+ 0001-Basic-support-for-Core-i7.patch
+ 0002-Enable-Core-i7-architectural-features.patch
? 0003-Extend-Core-2-tune-features-to-Core-i7.patch
? 0004-Tweak-tuning-for-Core-i7.patch
+ 0005-Add-PROMOTE_HI_CONSTANTS-tuning.patch
+ 0006-Define-Core-i7-costs.patch
+ 0007-Use-64-bit-alignment-for-Core-i7-32-bit-mode.patch
+ 0008-Configure-bits-for-Core-i7.patch
+ 0009-Core-i7-DFA-model.patch
? 0010-Define-issue_rate-for-Core-i7.patch
+ 0011-Model-Core-i7-pipeline-domains.patch
- 0012-Update-Core-2-tuning.patch
+ 0013-Use-Core-2-DFA-model-for-Core-2.patch
+ 0014-Update-PentiumPro-tuning.patch
- 0015-Handle-privileged-insns.patch
+ 0016-Model-Core2-i7-decoder-bottleneck.patch

Some of these patches (marked with '+') tend to improve average 
performance, while others (marked with '-') tend to regress it.  We will 
be posting the '+' patches for review once I get benchmark numbers 
without the regressing patches.

Attached is an Excel spreadsheet with results for SPECCPU2000.  The 
interesting part is the graphs visualizing performance impact of each of 
the patches.  The "line" graph shows performance change in percent 
relative to *baseline*, i.e., current -mtune=core2 for Core2 and 
-mtune=generic[64] for Corei7.  The "column" graph shows performance 
change in percent relative to *previous* patch.   I find the "column" 
graph more interesting as it shows impact of individual changes on 
performance.  SPECint and SPECfp results are highlighted with, 
respectively, purple and red on the column graph.

Tuning flags: -O2 -ffast-math -msse2 -mfpmath=sse -mtune={core2, corei7} 
{-m32/-m64}

Patches that are no-ops from performance point of view for a particular 
CPU are not included in the data.  I did confirm that these patches 
indeed do not affect performance in one of the test runs.

Now, analysis of the patches:

+ 0001-Basic-support-for-Core-i7.patch

Baseline.

The patch makes GCC recognize "corei7" for -mtune= and -march= options. 
   The patch sets tuning for Core i7 to that of -mtune=generic or 
-mtune=generic64 depending on the {-m32/-m64} option.  The generic CPU 
is special in the sense that has different tuning for 32-bit and 64-bit 
modes.  The patch adds same capability to use different tuning for 
different ABI for Core i7.

+ 0002-Enable-Core-i7-architectural-features.patch

Nearly noise from performance point of view.

Enable supported ISA extensions for Core i7.

? 0003-Extend-Core-2-tune-features-to-Core-i7.patch

Improves SPECfp a 32-bit mode, but degrades SPECint for 64-bit mode.

Set tuning for Core i7 to be the same as for Core 2.

? 0004-Tweak-tuning-for-Core-i7.patch

Regresses SPECint and SPECfp in 32-bit mode, but improves SPECint for 
64-bit mode.

Adjust tuning for Core i7.

+ 0005-Add-PROMOTE_HI_CONSTANTS-tuning.patch

Improves SPECint.  Add new tuning option to promote HI constants.

+ 0006-Define-Core-i7-costs.patch

Slightly regresses SPECint, but improves SPECfp.  Define rtx costs for 
Core i7.

The biggest regression is 164.gzip.  We don't know why.

+ 0007-Use-64-bit-alignment-for-Core-i7-32-bit-mode.patch

Significantly improves Core i7 performance in 32-bit mode.  Increase 
alignment for 32-bit mode for Core i7 to match 64-bit mode.

+ 0008-Configure-bits-for-Core-i7.patch

Performance no-op.  Add support for configure options --with-arch=, 
etc., for Core i7.

+ 0009-Core-i7-DFA-model.patch

Improves SPECfp.  DFA model for Core i7.

? 0010-Define-issue_rate-for-Core-i7.patch

Improves SPECint, regresses SPECfp.  Increase issue_rate to 4 for Core i7.

This one-line change makes 200.sixtrack regress from +1.75% to -2.0% for 
Core i7 32-bit mode.  I spent a lot of time investigating and trying to 
fix this regression, but didn't succeed.  The slowdown can be tracked 
down to a hot loop that fits on a screen, but the slowdown seems to be 
evenly distributed all over the loop.  The loop does floating-point 
computations with around 6 variables and streams data from memory. 
Instruction within the loop are all the same before and after the patch, 
the only difference is in their order.

First I thought that the loop hits the decoder bottleneck, i.e., 
instructions that can be decoded only by D0 decoder get assigned to 
secondary decoders.  I implemented modeling of Core2/i7 decoder to make 
scheduler aware of that (Model-Core2-i7-decoder-bottleneck.patch).  That 
didn't fix the regression, so now I'm suspecting that the register ports 
may be responsible for the slowdown.  I don't have a proof though.

May be it is worth trying setting issue rate to 3 for Core2/i7?

+ 0011-Model-Core-i7-pipeline-domains.patch

Improves SPECfp.  Adjust scheduling costs for instructions that cross 
Core i7 pipeline domains, i.e., an instruction generates uops for both 
integer and floating-point domains that need to pass data between each 
other.

- 0012-Update-Core-2-tuning.patch

No definitive result for 32-bit mode; SPECfp regresses in 64-bit mode. 
Adjust tuning for Core 2.

+ 0013-Use-Core-2-DFA-model-for-Core-2.patch

Improves SPECfp for 64-bit mode; improves and regresses SPECint and 
SPECfp in equal proportion for 32-bit mode.  Switch DFA model for Core 2.

187.facerec regresses by 7% on 32-bit Core2 with this change.

+ 0014-Update-PentiumPro-tuning.patch

No data, but should be an improvement.  Enable PROMOTE_HI_CONSTANTS 
tuning for PentiumPro and, hence, -mtune=generic.

- 0015-Handle-privileged-insns.patch

Improves some tests, but regresses others, no conclusive result.

Attempt to make scheduler smarter about which instructions to 
prioritize.  The theory was that the scheduler should not distinguish 
between the *first* instruction in the ready list and subsequent 
instructions that are essentially the same as the first.

[Rank_for_schedule() is used to sort the ready list and it has several 
tie-breaking checks to make the sort stable.  From 
choose_ready/max_issue perspective these tie-breaking checks decrease 
optimization space for now good reason.  Apparently, the theory does not 
agree with experiment in this case.]

+ 0016-Model-Core2-i7-decoder-bottleneck.patch

Improves SPECint, though it was designed to fix regression in SPECfp's 
200.sixtrack.  The patch makes the scheduler aware of decoder 
restrictions on Core 2/i7.  New hooks to multipass scheduling allow the 
backend to filter the search space from instructions that are no longer 
able to be issued on current cycle, e.g., because they would not fit 
into the rest of IFETCH block or could not be decoded by secondary decoders.

Strictly speaking, this is theoretically possible to model in DFA, but 
it would require immensely more work and would not be nearly as 
comprehensible as using target hooks.


Your comments [and patches fixing the regressions :)] are welcome.

Thank you,

Comments

H.J. Lu Oct. 15, 2010, 10:58 p.m. UTC | #1
On Fri, Oct 15, 2010 at 3:08 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:

> + 0001-Basic-support-for-Core-i7.patch
>
> Baseline.
>
> The patch makes GCC recognize "corei7" for -mtune= and -march= options.
> The patch sets tuning for Core i7 to that of -mtune=generic or
> -mtune=generic64 depending on the {-m32/-m64} option.  The generic CPU is
> special in the sense that has different tuning for 32-bit and 64-bit modes.
>  The patch adds same capability to use different tuning for different ABI
> for Core i7.
>

I don't think it is needed.  We added GENERIC32/GENRIC64 so that
we can tune for 32bit/64bit in i386.c and *.md. PTA_TUNE32 is only
used in ix86_option_override_internal where we know we are compiling
for 32bit or 64bit. We can use TARGET_64BIT instead of adding
PTA_TUNE32.
H.J. Lu Oct. 15, 2010, 11:04 p.m. UTC | #2
On Fri, Oct 15, 2010 at 3:08 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
> [Resending without printed out version of the spreadsheet to fit into
> gcc-patches@ size requirements.]
>
> I've been investigating performance regressions for Core 2 and Core i7
> processors.  The impact of certain small tuning changes on x86 performance
> maybe interesting to a wider audience, so here is my results and analysis.
>
> Attached is a tar of the patch set I tested.  Most of these patches are
> dissections from earlier Bernd's work for Core 2/i7.
>
> + 0001-Basic-support-for-Core-i7.patch
> + 0002-Enable-Core-i7-architectural-features.patch
> ? 0003-Extend-Core-2-tune-features-to-Core-i7.patch
> ? 0004-Tweak-tuning-for-Core-i7.patch
> + 0005-Add-PROMOTE_HI_CONSTANTS-tuning.patch
> + 0006-Define-Core-i7-costs.patch
> + 0007-Use-64-bit-alignment-for-Core-i7-32-bit-mode.patch
> + 0008-Configure-bits-for-Core-i7.patch
> + 0009-Core-i7-DFA-model.patch
> ? 0010-Define-issue_rate-for-Core-i7.patch
> + 0011-Model-Core-i7-pipeline-domains.patch
> - 0012-Update-Core-2-tuning.patch
> + 0013-Use-Core-2-DFA-model-for-Core-2.patch
> + 0014-Update-PentiumPro-tuning.patch
> - 0015-Handle-privileged-insns.patch
> + 0016-Model-Core2-i7-decoder-bottleneck.patch
>

Hi Maxim,

I will try your patches on Core 2 and Core i7 with SPEC CPU 2K/2006.
Unfortunately, gcc has been failing SPEC CPU 2K/2006 for several
weeks now:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45720
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=45865

My performance comparison may not be complete.
Maxim Kuvyrkov Oct. 16, 2010, 6:08 a.m. UTC | #3
On 10/16/10 3:04 AM, H.J. Lu wrote:
> On Fri, Oct 15, 2010 at 3:08 AM, Maxim Kuvyrkov<maxim@codesourcery.com>  wrote:
>> [Resending without printed out version of the spreadsheet to fit into
>> gcc-patches@ size requirements.]
>>
>> I've been investigating performance regressions for Core 2 and Core i7
>> processors.  The impact of certain small tuning changes on x86 performance
>> maybe interesting to a wider audience, so here is my results and analysis.
>>
>> Attached is a tar of the patch set I tested.  Most of these patches are
>> dissections from earlier Bernd's work for Core 2/i7.
>>
>> + 0001-Basic-support-for-Core-i7.patch
>> + 0002-Enable-Core-i7-architectural-features.patch
>> ? 0003-Extend-Core-2-tune-features-to-Core-i7.patch
>> ? 0004-Tweak-tuning-for-Core-i7.patch
>> + 0005-Add-PROMOTE_HI_CONSTANTS-tuning.patch
>> + 0006-Define-Core-i7-costs.patch
>> + 0007-Use-64-bit-alignment-for-Core-i7-32-bit-mode.patch
>> + 0008-Configure-bits-for-Core-i7.patch
>> + 0009-Core-i7-DFA-model.patch
>> ? 0010-Define-issue_rate-for-Core-i7.patch
>> + 0011-Model-Core-i7-pipeline-domains.patch
>> - 0012-Update-Core-2-tuning.patch
>> + 0013-Use-Core-2-DFA-model-for-Core-2.patch
>> + 0014-Update-PentiumPro-tuning.patch
>> - 0015-Handle-privileged-insns.patch
>> + 0016-Model-Core2-i7-decoder-bottleneck.patch
>>
>
> Hi Maxim,
>
> I will try your patches on Core 2 and Core i7 with SPEC CPU 2K/2006.
> Unfortunately, gcc has been failing SPEC CPU 2K/2006 for several
> weeks now:

FYI, I did the benchmarking against rev. 165150 dated Oct. 8.  I should 
have been lucky to capture a window of GCC mainline being OK.

Thanks,
H.J. Lu Oct. 17, 2010, 12:29 a.m. UTC | #4
On Fri, Oct 15, 2010 at 3:58 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Oct 15, 2010 at 3:08 AM, Maxim Kuvyrkov <maxim@codesourcery.com> wrote:
>
>> + 0001-Basic-support-for-Core-i7.patch
>>
>> Baseline.
>>
>> The patch makes GCC recognize "corei7" for -mtune= and -march= options.
>> The patch sets tuning for Core i7 to that of -mtune=generic or
>> -mtune=generic64 depending on the {-m32/-m64} option.  The generic CPU is
>> special in the sense that has different tuning for 32-bit and 64-bit modes.
>>  The patch adds same capability to use different tuning for different ABI
>> for Core i7.
>>
>
> I don't think it is needed.  We added GENERIC32/GENRIC64 so that
> we can tune for 32bit/64bit in i386.c and *.md. PTA_TUNE32 is only
> used in ix86_option_override_internal where we know we are compiling
> for 32bit or 64bit. We can use TARGET_64BIT instead of adding
> PTA_TUNE32.
>
>

This patch removes PTA_TUNE32:

http://git.kernel.org/?p=devel/gcc/hjl/x86.git;a=patch;h=39b88e72a06b52cebb2749433c874332da3a184d
diff mbox

Patch

diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 33510a7..08837e1 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -2852,7 +2852,8 @@  ix86_option_override_internal (bool main_args_p)
       PTA_LWP = 1 << 23,
       PTA_FSGSBASE = 1 << 24,
       PTA_RDRND = 1 << 25,
-      PTA_F16C = 1 << 26
+      PTA_F16C = 1 << 26,
+      PTA_TUNE32 = 1 << 27
     };
 
   static struct pta
@@ -2894,6 +2895,10 @@  ix86_option_override_internal (bool main_args_p)
       {"core2", PROCESSOR_CORE2, CPU_CORE2,
 	PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
 	| PTA_SSSE3 | PTA_CX16},
+      {"corei7", PROCESSOR_GENERIC32, CPU_PENTIUMPRO,
+	PTA_TUNE32},
+      {"", PROCESSOR_GENERIC64, CPU_GENERIC64,
+	PTA_64BIT},
       {"atom", PROCESSOR_ATOM, CPU_ATOM,
 	PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
 	| PTA_SSSE3 | PTA_CX16 | PTA_MOVBE},
@@ -3127,6 +3132,16 @@  ix86_option_override_internal (bool main_args_p)
   for (i = 0; i < pta_size; i++)
     if (! strcmp (ix86_arch_string, processor_alias_table[i].name))
       {
+	if (TARGET_64BIT && (processor_alias_table[i].flags & PTA_TUNE32))
+	  /* Switch to the next entry which has tuning parameters for 64-bit
+	     mode.  */
+	  {
+	    ++i;
+	    gcc_assert (i < pta_size
+			&& processor_alias_table[i].name[0] == '\0'
+			&& !(processor_alias_table[i].flags & PTA_TUNE32));
+	  }
+
 	ix86_schedule = processor_alias_table[i].schedule;
 	ix86_arch = processor_alias_table[i].processor;
 	/* Default cpu tuning to the architecture.  */
@@ -3231,6 +3246,16 @@  ix86_option_override_internal (bool main_args_p)
   for (i = 0; i < pta_size; i++)
     if (! strcmp (ix86_tune_string, processor_alias_table[i].name))
       {
+	if (TARGET_64BIT && (processor_alias_table[i].flags & PTA_TUNE32))
+	  /* Switch to the next entry which has tuning parameters for 64-bit
+	     mode.  */
+	  {
+	    ++i;
+	    gcc_assert (i < pta_size
+			&& processor_alias_table[i].name[0] == '\0'
+			&& !(processor_alias_table[i].flags & PTA_TUNE32));
+	  }
+
 	ix86_schedule = processor_alias_table[i].schedule;
 	ix86_tune = processor_alias_table[i].processor;
 	if (TARGET_64BIT && !(processor_alias_table[i].flags & PTA_64BIT))