Patchwork Core 2 and Core i7 tuning

login
register
mail settings
Submitter H.J. Lu
Date Aug. 20, 2010, 8:43 p.m.
Message ID <AANLkTinJX=DF7yvQgzRE1tNqcYvPXLUBLbaZJX_W4WTN@mail.gmail.com>
Download mbox | patch
Permalink /patch/62316/
State New
Headers show

Comments

H.J. Lu - Aug. 20, 2010, 8:43 p.m.
On Fri, Aug 20, 2010 at 1:07 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> Here's something I've been working on for a while.  This adds a corei7
> processor type, a Core 2/Core i7 scheduling description, and twiddles a
> few of the x86 tuning flags.  I'm not terribly happy with it yet due to
> the relatively small performance improvement, but I'd promised some
> folks I'd post it this week, so...
>
> The scheduling description is heavily based on ppro.md.  There seems to
> be no publicly available, detailed information from Intel about the Core
> 2 pipeline, so this work is based on Agner Fog's manuals.  It should be
> correct in the essentials, at least as well as ppro.md (we aren't really
> able to do a good job with the execution ports since we have no concept
> of the out-of-order core).  I have not tried to implement latencies or
> port reservations for every last MMX or SSE instruction, since who knows
> whether the information is totally accurate anyway.
>
> The i386 port has a lot of tuning flags, and I've mostly been running
> SPEC2000 benchmarks for the last few weeks, trying to find a set of them
> that works well on these processors.  This is slightly tricky since
> there's some inherent noise in the results.
>
> Not using the LEAVE instruction seemed to make a difference on my Penryn
> laptop in 64 bit mode, but that's probably moot now that
> -fomit-frame-pointer is the default.  I've changed a few others, but
> mostly these attempts resulted in lower or unchanged performance, for
> example:
>
>  * using push/pop insns more often (there are about six of these tuning
>   flags).  I would have expected this to be a win.
>  * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
>  * upping the branch cost to 5; initial results looked good for Core i7
>   but in a full SPEC2000 run it seemed to be a slight loss, and a large
>   loss on Core 2
>  * using different string algorithms (from tune_generic)
>  * enabling SPLIT_LONG_MOVES
>  * enabling the flags related to partial reg stalls
>  * reducing code alignments (based on a comment in Agner's manual that
>   they aren't important anymore)
>
> I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
> on the recommendation in Agner's manual not to use operand size prefixes
> when they change the length of the instruction (i.e. if there's an
> immediate operand).  That happens in the second of the following four
> instructions, and is said to cause a decoder stall:
>
> $ as
> orl $32768,%eax
> orw $32768,%ax
> orl $8,%eax
> orw $8,%ax
>
>   0:   0d 00 80 00 00          or     $0x8000,%eax
>   5:   66 0d 00 80             or     $0x8000,%ax
>   9:   83 c8 08                or     $0x8,%eax
>   c:   66 83 c8 08             or     $0x8,%ax
>
> This didn't seem to have a large impact either however.
>
> On my last test run, I had
> SPECfp2000:
>  -mtune=generic  3023
>  -mtune=core2    3036
> SPECint2000:
>  -mtune=generic  2774
>  -mtune=core2    2794
>
> This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
> SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
> with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
> effectively).
> Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
> weeks old so it doesn't have -fomit-frame-pointer by default.  I also
> had -mtune=corei7 numbers, but they were a little lower since I was
> using that run for an experiment with higher branch costs.
>
> These numbers pretty much match the differences I was seeing on the Core
> 2 laptop during development.  I'd welcome if other people would also run
> benchmarks.
>
> Comments?  Is this OK?
>

Please also include this patch.

Thanks.

Patch

diff --git a/gcc/config/i386/driver-i386.c b/gcc/config/i386/driver-i386.c
index 8a76857..998214b 100644
--- a/gcc/config/i386/driver-i386.c
+++ b/gcc/config/i386/driver-i386.c
@@ -554,21 +554,21 @@  const char *host_detect_local_cpu (int argc, const char **argv)
 	case 0x1e:
 	case 0x1f:
 	case 0x2e:
-	  /* FIXME: Optimize for Nehalem.  */
-	  cpu = "core2";
+	  /* Nehalem.  */
+	  cpu = "corei7";
 	  break;
 	case 0x25:
 	case 0x2f:
-	  /* FIXME: Optimize for Westmere.  */
-	  cpu = "core2";
+	  /* Westmere.  */
+	  cpu = "corei7";
 	  break;
 	case 0x17:
 	case 0x1d:
-	  /* Penryn.  FIXME: -mtune=core2 is slower than -mtune=generic  */
+	  /* Penryn.  */
 	  cpu = "core2";
 	  break;
 	case 0x0f:
-	  /* Merom.  FIXME: -mtune=core2 is slower than -mtune=generic  */
+	  /* Merom.  */
 	  cpu = "core2";
 	  break;
 	default: