Patchwork Core 2 and Core i7 tuning

login
register
mail settings
Submitter Bernd Schmidt
Date Aug. 20, 2010, 8:07 p.m.
Message ID <4C6EE072.4070802@codesourcery.com>
Download mbox | patch
Permalink /patch/62314/
State New
Headers show

Comments

Bernd Schmidt - Aug. 20, 2010, 8:07 p.m.
Here's something I've been working on for a while.  This adds a corei7
processor type, a Core 2/Core i7 scheduling description, and twiddles a
few of the x86 tuning flags.  I'm not terribly happy with it yet due to
the relatively small performance improvement, but I'd promised some
folks I'd post it this week, so...

The scheduling description is heavily based on ppro.md.  There seems to
be no publicly available, detailed information from Intel about the Core
2 pipeline, so this work is based on Agner Fog's manuals.  It should be
correct in the essentials, at least as well as ppro.md (we aren't really
able to do a good job with the execution ports since we have no concept
of the out-of-order core).  I have not tried to implement latencies or
port reservations for every last MMX or SSE instruction, since who knows
whether the information is totally accurate anyway.

The i386 port has a lot of tuning flags, and I've mostly been running
SPEC2000 benchmarks for the last few weeks, trying to find a set of them
that works well on these processors.  This is slightly tricky since
there's some inherent noise in the results.

Not using the LEAVE instruction seemed to make a difference on my Penryn
laptop in 64 bit mode, but that's probably moot now that
-fomit-frame-pointer is the default.  I've changed a few others, but
mostly these attempts resulted in lower or unchanged performance, for
example:

 * using push/pop insns more often (there are about six of these tuning
   flags).  I would have expected this to be a win.
 * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
 * upping the branch cost to 5; initial results looked good for Core i7
   but in a full SPEC2000 run it seemed to be a slight loss, and a large
   loss on Core 2
 * using different string algorithms (from tune_generic)
 * enabling SPLIT_LONG_MOVES
 * enabling the flags related to partial reg stalls
 * reducing code alignments (based on a comment in Agner's manual that
   they aren't important anymore)

I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
on the recommendation in Agner's manual not to use operand size prefixes
when they change the length of the instruction (i.e. if there's an
immediate operand).  That happens in the second of the following four
instructions, and is said to cause a decoder stall:

$ as
orl $32768,%eax
orw $32768,%ax
orl $8,%eax
orw $8,%ax

   0:	0d 00 80 00 00       	or     $0x8000,%eax
   5:	66 0d 00 80          	or     $0x8000,%ax
   9:	83 c8 08             	or     $0x8,%eax
   c:	66 83 c8 08          	or     $0x8,%ax

This didn't seem to have a large impact either however.

On my last test run, I had
SPECfp2000:
 -mtune=generic  3023
 -mtune=core2    3036
SPECint2000:
 -mtune=generic  2774
 -mtune=core2    2794

This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
effectively).
Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
weeks old so it doesn't have -fomit-frame-pointer by default.  I also
had -mtune=corei7 numbers, but they were a little lower since I was
using that run for an experiment with higher branch costs.

These numbers pretty much match the differences I was seeing on the Core
2 laptop during development.  I'd welcome if other people would also run
benchmarks.

Comments?  Is this OK?


Bernd
* doc/invoke.texi (i386 and x86-64 Options): Document corei7 cpu type.
	* config/i386/i386.h (TARGET_COREI7): New macro.
	(enum ix86_tune_indices): Add X86_TUNE_PROMOTE_HI_CONSTANTS.
	(enum target_cpu_default): Add TARGET_CPU_DEFAULT_corei7.
	(enum processor_type): Add PROCESSOR_COREI7.
	* config/i386/i386.md: Include "core2.md".
	(attr "cpu"): Add "corei7".
	(mul_operands): New attribute.
	(mul<mode>3_1, mulsi3_1_zext, mulhi3_1, mulqi3_1, <u>mul<mode><dwi>3_1,
	<u>mulqihi3_1, <s>muldi3_highpart_1, <s>mulsi3_highpart_1,
	<s>mulsi3_highpart_zext): Set it.
	* config/i386/core2.md: New file.
	* config/i386/i386-c.c (ix86_target-macros_internal): Handle
	PROCESSOR_COREI7.
	* config/i386/i386.c (corei7_cost): New static variable.
	(m_COREI7, m_CORE2I7): New macros.
	(initial_ix86_tune_features): Use them.  Disable X86_TUNE_USE_LEAVE,
	X86_TUNE_PAD_RETURNS and X86_TUNE_USE_INCDEC, and enable
	X86_TUNE_PROMOTE_HI_REGS and X86_TUNE_PROMOTE_HI_CONSTANTS for Core 2
	and Core i7.
	(x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387): Use
	m_CORE2I7 instead of m_CORE2.
	(processor_target_table): Add entry for corei7_cost.
	(cpu_names): Add "corei7" entr.
	(override_options): Add entry for Core i7.
	(ix86_fixup_binary_operands, ix86_binary_operator_ok): Handle
	TARGET_PROMOTE_HI_CONSTANTS.
	(ix86_issue_rate): 4 for Core i7.
	(ix86_adjust_cost): Try to do something sensible about domains for
	PROCESSOR_COREI7.
H.J. Lu - Aug. 20, 2010, 8:17 p.m.
On Fri, Aug 20, 2010 at 1:07 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> Here's something I've been working on for a while.  This adds a corei7
> processor type, a Core 2/Core i7 scheduling description, and twiddles a
> few of the x86 tuning flags.  I'm not terribly happy with it yet due to
> the relatively small performance improvement, but I'd promised some
> folks I'd post it this week, so...
>
> The scheduling description is heavily based on ppro.md.  There seems to
> be no publicly available, detailed information from Intel about the Core
> 2 pipeline, so this work is based on Agner Fog's manuals.  It should be
> correct in the essentials, at least as well as ppro.md (we aren't really
> able to do a good job with the execution ports since we have no concept
> of the out-of-order core).  I have not tried to implement latencies or
> port reservations for every last MMX or SSE instruction, since who knows
> whether the information is totally accurate anyway.
>
> The i386 port has a lot of tuning flags, and I've mostly been running
> SPEC2000 benchmarks for the last few weeks, trying to find a set of them
> that works well on these processors.  This is slightly tricky since
> there's some inherent noise in the results.
>
> Not using the LEAVE instruction seemed to make a difference on my Penryn
> laptop in 64 bit mode, but that's probably moot now that
> -fomit-frame-pointer is the default.  I've changed a few others, but
> mostly these attempts resulted in lower or unchanged performance, for
> example:
>
>  * using push/pop insns more often (there are about six of these tuning
>   flags).  I would have expected this to be a win.
>  * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
>  * upping the branch cost to 5; initial results looked good for Core i7
>   but in a full SPEC2000 run it seemed to be a slight loss, and a large
>   loss on Core 2
>  * using different string algorithms (from tune_generic)
>  * enabling SPLIT_LONG_MOVES
>  * enabling the flags related to partial reg stalls
>  * reducing code alignments (based on a comment in Agner's manual that
>   they aren't important anymore)
>
> I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
> on the recommendation in Agner's manual not to use operand size prefixes
> when they change the length of the instruction (i.e. if there's an
> immediate operand).  That happens in the second of the following four
> instructions, and is said to cause a decoder stall:
>
> $ as
> orl $32768,%eax
> orw $32768,%ax
> orl $8,%eax
> orw $8,%ax
>
>   0:   0d 00 80 00 00          or     $0x8000,%eax
>   5:   66 0d 00 80             or     $0x8000,%ax
>   9:   83 c8 08                or     $0x8,%eax
>   c:   66 83 c8 08             or     $0x8,%ax
>
> This didn't seem to have a large impact either however.
>
> On my last test run, I had
> SPECfp2000:
>  -mtune=generic  3023
>  -mtune=core2    3036
> SPECint2000:
>  -mtune=generic  2774
>  -mtune=core2    2794
>
> This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
> SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
> with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
> effectively).
> Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
> weeks old so it doesn't have -fomit-frame-pointer by default.  I also
> had -mtune=corei7 numbers, but they were a little lower since I was
> using that run for an experiment with higher branch costs.
>
> These numbers pretty much match the differences I was seeing on the Core
> 2 laptop during development.  I'd welcome if other people would also run
> benchmarks.
>
> Comments?  Is this OK?
>
>

I will run SPEC CPU 2K/2006. It will take a while.

Thanks.
Jan Hubicka - Aug. 21, 2010, 8:59 a.m.
> Here's something I've been working on for a while.  This adds a corei7
> processor type, a Core 2/Core i7 scheduling description, and twiddles a
> few of the x86 tuning flags.  I'm not terribly happy with it yet due to
> the relatively small performance improvement, but I'd promised some
> folks I'd post it this week, so...
> 
> The scheduling description is heavily based on ppro.md.  There seems to
> be no publicly available, detailed information from Intel about the Core
> 2 pipeline, so this work is based on Agner Fog's manuals.  It should be
> correct in the essentials, at least as well as ppro.md (we aren't really
> able to do a good job with the execution ports since we have no concept
> of the out-of-order core).  I have not tried to implement latencies or
> port reservations for every last MMX or SSE instruction, since who knows
> whether the information is totally accurate anyway.
> 
> The i386 port has a lot of tuning flags, and I've mostly been running
> SPEC2000 benchmarks for the last few weeks, trying to find a set of them
> that works well on these processors.  This is slightly tricky since
> there's some inherent noise in the results.

My experience with this micro tunning is that it is a lot easier to tune
on micro benchmarks first and then verify with spec2k or similar.  Spec2k
is too noisy to be useful to test micro changes in these.
> 
> Not using the LEAVE instruction seemed to make a difference on my Penryn
> laptop in 64 bit mode, but that's probably moot now that
> -fomit-frame-pointer is the default.  I've changed a few others, but
> mostly these attempts resulted in lower or unchanged performance, for
> example:
> 
>  * using push/pop insns more often (there are about six of these tuning
>    flags).  I would have expected this to be a win.

This is something I plan to try for generic model.  I think we use moves
in prologues/epilogues way too often now when practically all hardware
has ESP predictor.
>  * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
>  * upping the branch cost to 5; initial results looked good for Core i7
>    but in a full SPEC2000 run it seemed to be a slight loss, and a large
>    loss on Core 2

Do you still have some numbers on this?  I was playing with branch cost while
ago and my experience was that bumping branch cost up tends to cause a lot
of noise and that we would need better herusitic to decide on what branches
are badly predictables.

>  * using different string algorithms (from tune_generic)
>  * enabling SPLIT_LONG_MOVES

I don't seem to recall problem with moves with immediates being problem
in core2 in the manual?
>  * enabling the flags related to partial reg stalls
This seems wrong, I7 is not partial reg stall core.
>  * reducing code alignments (based on a comment in Agner's manual that
>    they aren't important anymore)
> 
> I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
> on the recommendation in Agner's manual not to use operand size prefixes
> when they change the length of the instruction (i.e. if there's an
> immediate operand).  That happens in the second of the following four
> instructions, and is said to cause a decoder stall:
> 
> $ as
> orl $32768,%eax
> orw $32768,%ax
> orl $8,%eax
> orw $8,%ax
> 
>    0:	0d 00 80 00 00       	or     $0x8000,%eax
>    5:	66 0d 00 80          	or     $0x8000,%ax
>    9:	83 c8 08             	or     $0x8,%eax
>    c:	66 83 c8 08          	or     $0x8,%ax
> 
> This didn't seem to have a large impact either however.
> 
> On my last test run, I had
> SPECfp2000:
>  -mtune=generic  3023
>  -mtune=core2    3036
> SPECint2000:
>  -mtune=generic  2774
>  -mtune=core2    2794
> 
> This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
> SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
> with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
> effectively).
> Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
> weeks old so it doesn't have -fomit-frame-pointer by default.  I also
> had -mtune=corei7 numbers, but they were a little lower since I was
> using that run for an experiment with higher branch costs.
> 
> These numbers pretty much match the differences I was seeing on the Core
> 2 laptop during development.  I'd welcome if other people would also run
> benchmarks.
> 
> Comments?  Is this OK?

I will comment in detail on the patch later today.

Honza
Jan Hubicka - Aug. 21, 2010, 9:16 a.m.
> 	* doc/invoke.texi (i386 and x86-64 Options): Document corei7 cpu type.
> 	* config/i386/i386.h (TARGET_COREI7): New macro.
> 	(enum ix86_tune_indices): Add X86_TUNE_PROMOTE_HI_CONSTANTS.
> 	(enum target_cpu_default): Add TARGET_CPU_DEFAULT_corei7.
> 	(enum processor_type): Add PROCESSOR_COREI7.
> 	* config/i386/i386.md: Include "core2.md".
> 	(attr "cpu"): Add "corei7".
> 	(mul_operands): New attribute.
> 	(mul<mode>3_1, mulsi3_1_zext, mulhi3_1, mulqi3_1, <u>mul<mode><dwi>3_1,
> 	<u>mulqihi3_1, <s>muldi3_highpart_1, <s>mulsi3_highpart_1,
> 	<s>mulsi3_highpart_zext): Set it.
> 	* config/i386/core2.md: New file.
> 	* config/i386/i386-c.c (ix86_target-macros_internal): Handle
> 	PROCESSOR_COREI7.
> 	* config/i386/i386.c (corei7_cost): New static variable.
> 	(m_COREI7, m_CORE2I7): New macros.
> 	(initial_ix86_tune_features): Use them.  Disable X86_TUNE_USE_LEAVE,
> 	X86_TUNE_PAD_RETURNS and X86_TUNE_USE_INCDEC, and enable
> 	X86_TUNE_PROMOTE_HI_REGS and X86_TUNE_PROMOTE_HI_CONSTANTS for Core 2
> 	and Core i7.
> 	(x86_accumulate_outgoing_args, x86_arch_always_fancy_math_387): Use
> 	m_CORE2I7 instead of m_CORE2.
> 	(processor_target_table): Add entry for corei7_cost.
> 	(cpu_names): Add "corei7" entr.
> 	(override_options): Add entry for Core i7.
> 	(ix86_fixup_binary_operands, ix86_binary_operator_ok): Handle
> 	TARGET_PROMOTE_HI_CONSTANTS.
> 	(ix86_issue_rate): 4 for Core i7.
> 	(ix86_adjust_cost): Try to do something sensible about domains for
> 	PROCESSOR_COREI7.
> 
> Index: config/i386/core2.md
> ===================================================================
> --- config/i386/core2.md	(revision 0)
> +++ config/i386/core2.md	(revision 0)

What is effect on cc1 binary size with your pipeline model?
I am asking because core has a lot of parallelizm that tends to blow up the automata
size a lot.
> @@ -2173,6 +2251,7 @@ static const struct ptt processor_target
>    {&k8_cost, 16, 7, 16, 7, 16},
>    {&nocona_cost, 0, 0, 0, 0, 0},
>    {&core2_cost, 16, 10, 16, 10, 16},
> +  {&corei7_cost, 16, 10, 16, 10, 16},

You was mentioning reducing alignments, but they seem same in the patch?
> @@ -14291,6 +14374,12 @@ ix86_fixup_binary_operands (enum rtx_cod
>    if (MEM_P (src1) && !rtx_equal_p (dst, src1))
>      src1 = force_reg (mode, src1);
>  
> +  if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2)
> +      && (INTVAL (src2) < -128 || INTVAL (src2) > 127)
> +      && (code != AND
> +	  || (INTVAL (src2) != 255 && INTVAL (src2) != -65281)))
> +    src2 = gen_lowpart (HImode, force_reg (SImode, src2));
> +

I am concerned about this especially on 32bit, since we force another register
to hold the constant.  
Option would be to do postreload peep2 to offload constants to registers, but then
we would miss PRE on those.  Perhaps we can break up the patch so we have
chance to see how it works.

The pipeline model seems resonable as does the tunning flags change, so perhaps it
should go in first.
>    operands[1] = src1;
>    operands[2] = src2;
>    return dst;
> @@ -14377,6 +14466,12 @@ ix86_binary_operator_ok (enum rtx_code c
>    if (MEM_P (src1) && !rtx_equal_p (dst, src1))
>      return 0;
>  
> +  if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2)
> +      && (INTVAL (src2) < -128 || INTVAL (src2) > 127)
> +      && (code != AND
> +	  || (INTVAL (src2) != 255 && INTVAL (src2) != -65281)))
> +    return 0;
> +
>    return 1;
>  }
>  
> @@ -20569,6 +20665,7 @@ ix86_adjust_cost (rtx insn, rtx link, rt
>  {
>    enum attr_type insn_type, dep_insn_type;
>    enum attr_memory memory;
> +  enum attr_i7_domain domain1, domain2;
>    rtx set, set2;
>    int dep_insn_code_number;
>  
> @@ -20711,6 +20808,19 @@ ix86_adjust_cost (rtx insn, rtx link, rt
>  	  else
>  	    cost = 0;
>  	}
> +      break;
> +
> +    case PROCESSOR_COREI7:
> +      memory = get_attr_memory (insn);
> +
> +      domain1 = get_attr_i7_domain (insn);
> +      domain2 = get_attr_i7_domain (dep_insn);
> +      if (domain1 != domain2
> +	  && !ix86_agi_dependent (dep_insn, insn))
> +	cost += ((domain1 == I7_DOMAIN_SIMD && domain2 == I7_DOMAIN_INT)
> +		 || (domain1 == I7_DOMAIN_INT && domain2 == I7_DOMAIN_SIMD)
> +		 ? 1 : 2);

This number is supposed to be load latency, is it still 1/2 at Core when reading from cache?

Honza
> +      break;
>  
>      default:
>        break;
Bernd Schmidt - Aug. 21, 2010, 1:38 p.m.
On 08/21/2010 10:59 AM, Jan Hubicka wrote:
> My experience with this micro tunning is that it is a lot easier to tune
> on micro benchmarks first and then verify with spec2k or similar.  Spec2k
> is too noisy to be useful to test micro changes in these.

Do you have any suitable ones?  I occasionally tried using just e.g.
164.gzip, but you can end up going in the wrong direction with something
that helps one benchmark at the expense of the others.

>>  * upping the branch cost to 5; initial results looked good for Core i7
>>    but in a full SPEC2000 run it seemed to be a slight loss, and a large
>>    loss on Core 2
> 
> Do you still have some numbers on this?  I was playing with branch cost while
> ago and my experience was that bumping branch cost up tends to cause a lot
> of noise and that we would need better herusitic to decide on what branches
> are badly predictables.

For the last run, I'd used branch cost 5 with -mtune=corei7, and ended
up with fp 3014 and int 2798 (vs. the 3036/2974 result with -mtune=core2
and a lower branch cost).

On the Core 2 laptop, branch cost 3 was 2285/2172, branch cost 4
2257/2175, branch cost 5 2250/2168.

I agree that we don't seem to have a good concept of prediction likelihood.

>>  * enabling SPLIT_LONG_MOVES
> 
> I don't seem to recall problem with moves with immediates being problem
> in core2 in the manual?

Just the general issue that there are size limits in the decoder.
Agner's optimization manual explicitly suggests that avoiding long
instructions is important.  I thought it was worth trying, at least.

>>  * enabling the flags related to partial reg stalls
> This seems wrong, I7 is not partial reg stall core.

According to the manual, it has a mechanism that reduces the penalty but
doesn't entirely eliminate it.  Again, worth trying.


Bernd
Jan Hubicka - Aug. 21, 2010, 2:55 p.m.
> On 08/21/2010 10:59 AM, Jan Hubicka wrote:
> > My experience with this micro tunning is that it is a lot easier to tune
> > on micro benchmarks first and then verify with spec2k or similar.  Spec2k
> > is too noisy to be useful to test micro changes in these.
> 
> Do you have any suitable ones?  I occasionally tried using just e.g.
> 164.gzip, but you can end up going in the wrong direction with something
> that helps one benchmark at the expense of the others.

When making Athlon model i put togehter few kernels that was rather useful.
They used to be in GCC CVS as benchmark stuite, I guess they should still
resist somewhere after conversion to SVN.  
> 
> >>  * upping the branch cost to 5; initial results looked good for Core i7
> >>    but in a full SPEC2000 run it seemed to be a slight loss, and a large
> >>    loss on Core 2
> > 
> > Do you still have some numbers on this?  I was playing with branch cost while
> > ago and my experience was that bumping branch cost up tends to cause a lot
> > of noise and that we would need better herusitic to decide on what branches
> > are badly predictables.
> 
> For the last run, I'd used branch cost 5 with -mtune=corei7, and ended
> up with fp 3014 and int 2798 (vs. the 3036/2974 result with -mtune=core2
> and a lower branch cost).
> 
> On the Core 2 laptop, branch cost 3 was 2285/2172, branch cost 4
> 2257/2175, branch cost 5 2250/2168.

Hmm, interesting.  I i7 improved in branch prediction while it did nothing
about costs of the conditional move sequences we produce, so I would expect
opposite term
> 
> I agree that we don't seem to have a good concept of prediction likelihood.
> 
> >>  * enabling SPLIT_LONG_MOVES
> > 
> > I don't seem to recall problem with moves with immediates being problem
> > in core2 in the manual?
> 
> Just the general issue that there are size limits in the decoder.

Yep, but I7 should not be that terribly decoder bound to compensate
other costs of the transformation (for PPro this turned decoding to be
microcoded, while on Core I think it only reduces bandwidth when the
instruction occupy too much of the prefetch window, right?)

> Agner's optimization manual explicitly suggests that avoiding long
> instructions is important.  I thought it was worth trying, at least.
> 
> >>  * enabling the flags related to partial reg stalls
> > This seems wrong, I7 is not partial reg stall core.
> 
> According to the manual, it has a mechanism that reduces the penalty but
> doesn't entirely eliminate it.  Again, worth trying.

The patch seem to enable partial_flag_reg_stall (i.e. replacement of 
inc/dec by add), not partial_reg_stall that prevents promotion of HImode
math to SImode.  This makes sense, unless Intel fixed the issue with inc/dec.

I wonder if you do have some code size information too and how your model
compare to generic?

Thanks,
Honza
> 
> 
> Bernd
H.J. Lu - Aug. 21, 2010, 3:20 p.m.
On Sat, Aug 21, 2010 at 7:55 AM, Jan Hubicka <hubicka@ucw.cz> wrote:
>
> The patch seem to enable partial_flag_reg_stall (i.e. replacement of
> inc/dec by add), not partial_reg_stall that prevents promotion of HImode
> math to SImode.  This makes sense, unless Intel fixed the issue with inc/dec.

Isn't inc/dec optimization controlled by X86_TUNE_USE_INCDEC?
Bernd Schmidt - Aug. 22, 2010, 4:24 p.m.
On 08/21/2010 04:55 PM, Jan Hubicka wrote:

> Hmm, interesting.  I i7 improved in branch prediction while it did nothing
> about costs of the conditional move sequences we produce, so I would expect
> opposite term

The manual says the branch mispredict penalty went up with Core 2 (vs.
PPro/Pentium M) and again with Core i7.

>>>>  * enabling the flags related to partial reg stalls
>>> This seems wrong, I7 is not partial reg stall core.
>>
>> According to the manual, it has a mechanism that reduces the penalty but
>> doesn't entirely eliminate it.  Again, worth trying.
> 
> The patch seem to enable partial_flag_reg_stall (i.e. replacement of 
> inc/dec by add), not partial_reg_stall that prevents promotion of HImode
> math to SImode.  This makes sense, unless Intel fixed the issue with inc/dec.

Yes.  As I said, I experimented with the other tuning flags, but left
most of them unchanged in the end.

> I wonder if you do have some code size information too and how your model
> compare to generic?

Nothing systematic.  Code size seems to be a little higher, which I
think can be explained by higher alignments in processor_target_table
for core2/corei7 vs. generic32.  That's probably worth experimenting
with further.


Bernd
Andi Kleen - Aug. 23, 2010, 1:17 p.m.
Bernd Schmidt <bernds@codesourcery.com> writes:

Hi Bernd,

FWIW I have an own private core i7 target, but it wasn't as fancy
as yours.

First I'm surprised that you wrote that the pipeline description
in the optimization manual wasn't good enough. Did you use
2.1 in http://www.intel.com/assets/pdf/manual/248966.pdf 
as a reference?

Also I think you forgot to update driver-i386.c

> Index: doc/invoke.texi
> ===================================================================
> --- doc/invoke.texi	(revision 162821)
> +++ doc/invoke.texi	(working copy)
> @@ -11937,6 +11937,9 @@ SSE2 and SSE3 instruction set support.
>  @item core2
>  Intel Core2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
>  instruction set support.
> +@item corei7
> +Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3,
> SSSE3, SSE4.1

As a general comment Core i7 is not a good name to use here because
it's a marketing name used for different micro architectures
(already the case). I made this mistake in another project
and still suffering from it :-)

The Intel manual uses "Enhanced Core in 45nm" 

Also there are CPUs like xeon5500 or 7500 that use a similar core, but
have a different name. I also don't think anyone still cares about "with
64bit-extensions". In my version I added aliases for xeon5500 etc.

Also there are already 32nm variants, which are mostly the same,
but have different cache sizes and a few extensions.


> Index: config/i386/i386-c.c
> ===================================================================
> --- config/i386/i386-c.c	(revision 162821)
> +++ config/i386/i386-c.c	(working copy)
> @@ -122,6 +122,10 @@ ix86_target_macros_internal (int isa_fla
>        def_or_undef (parse_in, "__core2");
>        def_or_undef (parse_in, "__core2__");
>        break;
> +    case PROCESSOR_COREI7:
> +      def_or_undef (parse_in, "__corei7");
> +      def_or_undef (parse_in, "__corei7__");

Again the name is not good.
>  
>  static const

Comparing costs with my own model: 

> +  0,					/* cost of multiply per each bit set */
> +  {COSTS_N_INSNS (22),			/* cost of a divide/mod for QI */
> +   COSTS_N_INSNS (22),			/*
> HI

AFAIK these costs are not accurate anymore for the new divider since
Penryn. The cost is variable based on bits, so fully expressing it would
need a few changes in the high level check.

 */

> +					   in SFmode, DFmode and XFmode */
> +  2,					/* cost of moving MMX register */
> +  {6, 6},				/* cost of loading MMX registers
> +					   in SImode and DImode */
> +  {4, 4},				/* cost of storing MMX registers
> +					   in SImode and DImode */
> +  2,					/* cost of moving SSE register
> */

Too high?

> +  {6, 6, 6},				/* cost of loading SSE registers
> +					   in SImode, DImode and TImode
> */

And I suspect that's also too high.

> +  {4, 4, 4},				/* cost of storing SSE registers
> +					   in SImode, DImode and TImode */
> +  2,					/* MMX or SSE register to
> integer */

1 now. Inter unit moves got a lot cheaper.

> +  32,					/* size of l1 cache.  */
> +  256,					/* size of l2 cache.  */

I used the L3 here. Makes more sense?

BTW I was always wondering if there should be a flag for multithreading,
then the values should be half.

> +  128,					/* size of prefetch
> block */

I don't think that's true.

> +  8,					/* number of parallel prefetches
> */

I believe this number is too low.

> +  3,					/* Branch cost */
> +  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
> +  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
> +  COSTS_N_INSNS (32),			/* cost of FDIV instruction.  */
> +  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
> +  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
> +  COSTS_N_INSNS (58),			/* cost of FSQRT
> instruction.  */

I suspect some of these costs are also outdated, but needs measurements.

> +  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
> +   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
> +	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
> +  {{libcall, {{8, loop}, {15, unrolled_loop},
> +	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
> +   {libcall, {{24, loop}, {32, unrolled_loop},
> +	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},

This is certainly not correct for Nehalem, see 2.2.6 in the optimization
manual

> +  1,					/* scalar_stmt_cost.  */
> +  1,					/* scalar load_cost.  */
> +  1,					/* scalar_store_cost.  */
> +  1,					/* vec_stmt_cost.  */
> +  1,					/* vec_to_scalar_cost.  */
> +  1,					/* scalar_to_vec_cost.  */
> +  1,					/* vec_align_load_cost.  */
> +  2,					/* vec_unalign_load_cost.  */

Should be actually the same as aligned. This gives a big improvement
because the vectorizer does not generate all the explicit alignment code.

The only problem I ran into is that it has to be redone for AVX again :/

>    /* X86_TUNE_PAD_RETURNS */
> -  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
> +  m_AMD_MULTIPLE | m_GENERIC,

Not sure why?

The return padding can still help to not exceed the max density
of the branch predictor. However it would be probably better to have 
a different pass for that.
  

-andi
Bernd Schmidt - Aug. 23, 2010, 1:33 p.m.
On 08/23/2010 03:17 PM, Andi Kleen wrote:
> 
> First I'm surprised that you wrote that the pipeline description
> in the optimization manual wasn't good enough. Did you use
> 2.1 in http://www.intel.com/assets/pdf/manual/248966.pdf 
> as a reference?

Not sure it's the same one, but I have an Intel optimization manual
which only seems to have general information about which instructions go
to which ports; the Agner Fog document has tables which at least try to
provide full information.  In the end, it may not be relevant since I
doubt there's much to be gained from trying to get this 100% accurate.

> As a general comment Core i7 is not a good name to use here because
> it's a marketing name used for different micro architectures
> (already the case). I made this mistake in another project
> and still suffering from it :-)

Most of these points also apply to Core 2, which has two different
variants and a couple of Xeons with the same basic core.

> Comparing costs with my own model: 

The i7 table is just copied from the Core 2 table for the moment.  I've
only adjusted the L2 cache size.

>> +  2,					/* cost of moving SSE register
>> */
> 
> Too high?

Likely.  I changed that in the pipeline description IIRC but this
probably needs changing as well.

> 1 now. Inter unit moves got a lot cheaper.

As far as I know there are still stalls?

>> +  32,					/* size of l1 cache.  */
>> +  256,					/* size of l2 cache.  */
> 
> I used the L3 here. Makes more sense?

No idea.

>> +  3,					/* Branch cost */
>> +  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
>> +  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
>> +  COSTS_N_INSNS (32),			/* cost of FDIV instruction.  */
>> +  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
>> +  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
>> +  COSTS_N_INSNS (58),			/* cost of FSQRT
>> instruction.  */
> 
> I suspect some of these costs are also outdated, but needs measurements.

FADD and FMUL are correct, I think, but Maxim pointed me at an earlier
patch from Vlad which got better results by changing them.

>>    /* X86_TUNE_PAD_RETURNS */
>> -  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
>> +  m_AMD_MULTIPLE | m_GENERIC,
> 
> Not sure why?

Everything I looked at seemed to say this is an AMD-only thing.


Bernd
Andi Kleen - Aug. 23, 2010, 1:55 p.m.
On Mon, Aug 23, 2010 at 03:33:27PM +0200, Bernd Schmidt wrote:
> Not sure it's the same one, but I have an Intel optimization manual
> which only seems to have general information about which instructions go
> to which ports; the Agner Fog document has tables which at least try to
> provide full information.  In the end, it may not be relevant since I
> doubt there's much to be gained from trying to get this 100% accurate.

Maybe.

> 
> > As a general comment Core i7 is not a good name to use here because
> > it's a marketing name used for different micro architectures
> > (already the case). I made this mistake in another project
> > and still suffering from it :-)
> 
> Most of these points also apply to Core 2, which has two different
> variants and a couple of Xeons with the same basic core.

Yes, but that doesn't mean that the mistake has to be repeated.


> 
> > Comparing costs with my own model: 
> 
> The i7 table is just copied from the Core 2 table for the moment.  I've
> only adjusted the L2 cache size.

Well as a minimum change you should at least fix the vector alignment,
that's a big win (just need to make sure AVX is still using it)

But some of the other parameters can also be tweaked.
I believe especially the string tuning ops help quite a lot.

> > 1 now. Inter unit moves got a lot cheaper.
> 
> As far as I know there are still stalls?

I thought it was pretty cheap. The manual even recommends to do 
XMM spilling, because it's far faster than L1.

> 
> >> +  32,					/* size of l1 cache.  */
> >> +  256,					/* size of l2 cache.  */
> > 
> > I used the L3 here. Makes more sense?
> 
> No idea.

I think it does, ignoring the L3 completely for cache blocking
of loops would be a poor decision.

That is there is still the problem of resource sharing with
multi threading, but afaik that's ignored everywhere in gcc currently.

> >> +  COSTS_N_INSNS (58),			/* cost of FSQRT
> >> instruction.  */
> > 
> > I suspect some of these costs are also outdated, but needs measurements.
> 
> FADD and FMUL are correct, I think, but Maxim pointed me at an earlier
> patch from Vlad which got better results by changing them.
> 
> >>    /* X86_TUNE_PAD_RETURNS */
> >> -  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
> >> +  m_AMD_MULTIPLE | m_GENERIC,
> > 
> > Not sure why?
> 
> Everything I looked at seemed to say this is an AMD-only thing.

The jump to ret is AMD only, but it still can help the Intel
branch predictor indirectly to avoid exceeding the maximum limit
per 16 byte window.

I thought that is why it was originally added for Core 2 too.

Better would be probably to use a special pass for this. iirc
there's already some code for it, but likely not fully correct.

-Andi
Jan Hubicka - Aug. 23, 2010, 2:10 p.m.
> >> +  2,					/* cost of moving SSE register
> >> */
> > 
> > Too high?
> 
> Likely.  I changed that in the pipeline description IIRC but this
> probably needs changing as well.

Those costs are not cycles, they are relative to reg-reg move that has cost
of 2.  So setting it to 1 makes SSE move cheaper than integer move. I see
that geode cost table is wrong here.

Honza
> 
> > 1 now. Inter unit moves got a lot cheaper.
> 
> As far as I know there are still stalls?
> 
> >> +  32,					/* size of l1 cache.  */
> >> +  256,					/* size of l2 cache.  */
> > 
> > I used the L3 here. Makes more sense?
> 
> No idea.
> 
> >> +  3,					/* Branch cost */
> >> +  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
> >> +  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
> >> +  COSTS_N_INSNS (32),			/* cost of FDIV instruction.  */
> >> +  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
> >> +  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
> >> +  COSTS_N_INSNS (58),			/* cost of FSQRT
> >> instruction.  */
> > 
> > I suspect some of these costs are also outdated, but needs measurements.
> 
> FADD and FMUL are correct, I think, but Maxim pointed me at an earlier
> patch from Vlad which got better results by changing them.
> 
> >>    /* X86_TUNE_PAD_RETURNS */
> >> -  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
> >> +  m_AMD_MULTIPLE | m_GENERIC,
> > 
> > Not sure why?
> 
> Everything I looked at seemed to say this is an AMD-only thing.
> 
> 
> Bernd
H.J. Lu - Aug. 23, 2010, 2:59 p.m.
On Fri, Aug 20, 2010 at 1:07 PM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> Here's something I've been working on for a while.  This adds a corei7
> processor type, a Core 2/Core i7 scheduling description, and twiddles a
> few of the x86 tuning flags.  I'm not terribly happy with it yet due to
> the relatively small performance improvement, but I'd promised some
> folks I'd post it this week, so...
>
> The scheduling description is heavily based on ppro.md.  There seems to
> be no publicly available, detailed information from Intel about the Core
> 2 pipeline, so this work is based on Agner Fog's manuals.  It should be
> correct in the essentials, at least as well as ppro.md (we aren't really
> able to do a good job with the execution ports since we have no concept
> of the out-of-order core).  I have not tried to implement latencies or
> port reservations for every last MMX or SSE instruction, since who knows
> whether the information is totally accurate anyway.
>
> The i386 port has a lot of tuning flags, and I've mostly been running
> SPEC2000 benchmarks for the last few weeks, trying to find a set of them
> that works well on these processors.  This is slightly tricky since
> there's some inherent noise in the results.
>
> Not using the LEAVE instruction seemed to make a difference on my Penryn
> laptop in 64 bit mode, but that's probably moot now that
> -fomit-frame-pointer is the default.  I've changed a few others, but
> mostly these attempts resulted in lower or unchanged performance, for
> example:
>
>  * using push/pop insns more often (there are about six of these tuning
>   flags).  I would have expected this to be a win.
>  * reusing the PentiumPro code in ix86_adjust_cost for Core 2 and i7
>  * upping the branch cost to 5; initial results looked good for Core i7
>   but in a full SPEC2000 run it seemed to be a slight loss, and a large
>   loss on Core 2
>  * using different string algorithms (from tune_generic)
>  * enabling SPLIT_LONG_MOVES
>  * enabling the flags related to partial reg stalls
>  * reducing code alignments (based on a comment in Agner's manual that
>   they aren't important anymore)
>
> I've implemented a new tuning flag, X86_TUNE_PROMOTE_HI_CONSTANTS, based
> on the recommendation in Agner's manual not to use operand size prefixes
> when they change the length of the instruction (i.e. if there's an
> immediate operand).  That happens in the second of the following four
> instructions, and is said to cause a decoder stall:
>
> $ as
> orl $32768,%eax
> orw $32768,%ax
> orl $8,%eax
> orw $8,%ax
>
>   0:   0d 00 80 00 00          or     $0x8000,%eax
>   5:   66 0d 00 80             or     $0x8000,%ax
>   9:   83 c8 08                or     $0x8,%eax
>   c:   66 83 c8 08             or     $0x8,%ax
>
> This didn't seem to have a large impact either however.
>
> On my last test run, I had
> SPECfp2000:
>  -mtune=generic  3023
>  -mtune=core2    3036
> SPECint2000:
>  -mtune=generic  2774
>  -mtune=core2    2794
>
> This is a Westmere Xeon, i.e. essentially a Core i7, in 32 bit mode.
> SPEC was locked to core 0 with schedtool, core 0 set to 3.2GHz manually
> with cpufreq-set (1 step below maximum, which seems to avoid turbo mode
> effectively).
> Compile flags were -O3 -mpc64 -frename-registers.  The tree is a few
> weeks old so it doesn't have -fomit-frame-pointer by default.  I also
> had -mtune=corei7 numbers, but they were a little lower since I was
> using that run for an experiment with higher branch costs.
>
> These numbers pretty much match the differences I was seeing on the Core
> 2 laptop during development.  I'd welcome if other people would also run
> benchmarks.
>

Here are my results on Core 2 and Core i7 running Fedora 13. There are
many regressions and a few improvements.
Bernd Schmidt - Aug. 23, 2010, 3:03 p.m.
On 08/23/2010 04:59 PM, H.J. Lu wrote:

> Here are my results on Core 2 and Core i7 running Fedora 13. There are
> many regressions and a few improvements.

What compilation flags were you using?


Bernd
H.J. Lu - Aug. 23, 2010, 3:39 p.m.
On Mon, Aug 23, 2010 at 8:03 AM, Bernd Schmidt <bernds@codesourcery.com> wrote:
> On 08/23/2010 04:59 PM, H.J. Lu wrote:
>
>> Here are my results on Core 2 and Core i7 running Fedora 13. There are
>> many regressions and a few improvements.
>
> What compilation flags were you using?
>
>

I use

Base: -ffast-math -mfpmath=sse -O2 -msse2
Peak: -ffast-math -mfpmath=sse -O3 -funroll-loops -msse2

and add -mtune=generic/-mtune=core2/corei7.

Patch

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 162821)
+++ doc/invoke.texi	(working copy)
@@ -11937,6 +11937,9 @@  SSE2 and SSE3 instruction set support.
 @item core2
 Intel Core2 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
 instruction set support.
+@item corei7
+Intel Core i7 CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3, SSSE3, SSE4.1
+and SSE4.2 instruction set support.
 @item atom
 Intel Atom CPU with 64-bit extensions, MMX, SSE, SSE2, SSE3 and SSSE3
 instruction set support.
Index: config/i386/i386.h
===================================================================
--- config/i386/i386.h	(revision 162821)
+++ config/i386/i386.h	(working copy)
@@ -239,6 +239,7 @@  extern const struct processor_costs ix86
 #define TARGET_ATHLON_K8 (TARGET_K8 || TARGET_ATHLON)
 #define TARGET_NOCONA (ix86_tune == PROCESSOR_NOCONA)
 #define TARGET_CORE2 (ix86_tune == PROCESSOR_CORE2)
+#define TARGET_COREI7 (ix86_tune == PROCESSOR_COREI7)
 #define TARGET_GENERIC32 (ix86_tune == PROCESSOR_GENERIC32)
 #define TARGET_GENERIC64 (ix86_tune == PROCESSOR_GENERIC64)
 #define TARGET_GENERIC (TARGET_GENERIC32 || TARGET_GENERIC64)
@@ -274,6 +275,7 @@  enum ix86_tune_indices {
   X86_TUNE_HIMODE_MATH,
   X86_TUNE_PROMOTE_QI_REGS,
   X86_TUNE_PROMOTE_HI_REGS,
+  X86_TUNE_PROMOTE_HI_CONSTANTS,
   X86_TUNE_ADD_ESP_4,
   X86_TUNE_ADD_ESP_8,
   X86_TUNE_SUB_ESP_4,
@@ -348,6 +350,8 @@  extern unsigned char ix86_tune_features[
 #define TARGET_HIMODE_MATH	ix86_tune_features[X86_TUNE_HIMODE_MATH]
 #define TARGET_PROMOTE_QI_REGS	ix86_tune_features[X86_TUNE_PROMOTE_QI_REGS]
 #define TARGET_PROMOTE_HI_REGS	ix86_tune_features[X86_TUNE_PROMOTE_HI_REGS]
+#define TARGET_PROMOTE_HI_CONSTANTS \
+	ix86_tune_features[X86_TUNE_PROMOTE_HI_CONSTANTS]
 #define TARGET_ADD_ESP_4	ix86_tune_features[X86_TUNE_ADD_ESP_4]
 #define TARGET_ADD_ESP_8	ix86_tune_features[X86_TUNE_ADD_ESP_8]
 #define TARGET_SUB_ESP_4	ix86_tune_features[X86_TUNE_SUB_ESP_4]
@@ -597,6 +601,7 @@  enum target_cpu_default
   TARGET_CPU_DEFAULT_prescott,
   TARGET_CPU_DEFAULT_nocona,
   TARGET_CPU_DEFAULT_core2,
+  TARGET_CPU_DEFAULT_corei7,
   TARGET_CPU_DEFAULT_atom,
 
   TARGET_CPU_DEFAULT_geode,
@@ -2139,6 +2144,7 @@  enum processor_type
   PROCESSOR_K8,
   PROCESSOR_NOCONA,
   PROCESSOR_CORE2,
+  PROCESSOR_COREI7,
   PROCESSOR_GENERIC32,
   PROCESSOR_GENERIC64,
   PROCESSOR_AMDFAM10,
Index: config/i386/i386.md
===================================================================
--- config/i386/i386.md	(revision 162821)
+++ config/i386/i386.md	(working copy)
@@ -349,8 +349,8 @@  (define_constants
 
 
 ;; Processor type.
-(define_attr "cpu" "none,pentium,pentiumpro,geode,k6,athlon,k8,core2,atom,
-		    generic64,amdfam10,bdver1"
+(define_attr "cpu" "none,pentium,pentiumpro,geode,k6,athlon,k8,core2,corei7,
+		    atom,generic64,amdfam10,bdver1"
   (const (symbol_ref "ix86_schedule")))
 
 ;; A basic instruction type.  Refinements due to arguments to be
@@ -388,6 +388,10 @@  (define_attr "unit" "integer,i387,sse,mm
 	   (const_string "unknown")]
 	 (const_string "integer")))
 
+;; For integer multiply insns, the number of operands.
+(define_attr "mul_operands" ""
+  (const_int 2))
+
 ;; The (bounding maximum) length of an instruction immediate.
 (define_attr "length_immediate" ""
   (cond [(eq_attr "type" "incdec,setcc,icmov,str,lea,other,multi,idiv,leave,
@@ -919,6 +923,7 @@  (define_mode_iterator P [(SI "Pmode == S
 (include "athlon.md")
 (include "geode.md")
 (include "atom.md")
+(include "core2.md")
 
 
 ;; Operand and operator predicates and constraints
@@ -7010,6 +7015,7 @@  (define_insn "*mul<mode>3_1"
    imul{<imodesuffix>}\t{%2, %1, %0|%0, %1, %2}
    imul{<imodesuffix>}\t{%2, %0|%0, %2}"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "3,2,2")
    (set_attr "prefix_0f" "0,0,1")
    (set (attr "athlon_decode")
 	(cond [(eq_attr "cpu" "athlon")
@@ -7040,6 +7046,7 @@  (define_insn "*mulsi3_1_zext"
    imul{l}\t{%2, %1, %k0|%k0, %1, %2}
    imul{l}\t{%2, %k0|%k0, %2}"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "3,3,2")
    (set_attr "prefix_0f" "0,0,1")
    (set (attr "athlon_decode")
 	(cond [(eq_attr "cpu" "athlon")
@@ -7077,6 +7084,7 @@  (define_insn "*mulhi3_1"
    imul{w}\t{%2, %1, %0|%0, %1, %2}
    imul{w}\t{%2, %0|%0, %2}"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "3,3,2")
    (set_attr "prefix_0f" "0,0,1")
    (set (attr "athlon_decode")
 	(cond [(eq_attr "cpu" "athlon")
@@ -7103,6 +7111,7 @@  (define_insn "*mulqi3_1"
    && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "mul{b}\t%2"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "1")
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
@@ -7144,6 +7153,7 @@  (define_insn "*<u>mul<mode><dwi>3_1"
   "!(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "<sgnprefix>mul{<imodesuffix>}\t%2"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "1")
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
@@ -7164,6 +7174,7 @@  (define_insn "*<u>mulqihi3_1"
    && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "<sgnprefix>mul{b}\t%2"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "1")
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
@@ -7203,6 +7214,7 @@  (define_insn "*<s>muldi3_highpart_1"
    && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "<sgnprefix>mul{q}\t%2"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "1")
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
@@ -7226,6 +7238,7 @@  (define_insn "*<s>mulsi3_highpart_1"
   "!(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "<sgnprefix>mul{l}\t%2"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "1")
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
@@ -7249,6 +7262,7 @@  (define_insn "*<s>mulsi3_highpart_zext"
    && !(MEM_P (operands[1]) && MEM_P (operands[2]))"
   "<sgnprefix>mul{l}\t%2"
   [(set_attr "type" "imul")
+   (set_attr "mul_operands" "1")
    (set_attr "length_immediate" "0")
    (set (attr "athlon_decode")
      (if_then_else (eq_attr "cpu" "athlon")
Index: config/i386/core2.md
===================================================================
--- config/i386/core2.md	(revision 0)
+++ config/i386/core2.md	(revision 0)
@@ -0,0 +1,744 @@ 
+;; Scheduling for Core 2 and derived processors.
+;; Copyright (C) 2004, 2005, 2007, 2008, 2010 Free Software Foundation, Inc.
+;;
+;; This file is part of GCC.
+;;
+;; GCC is free software; you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation; either version 3, or (at your option)
+;; any later version.
+;;
+;; GCC is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+;;
+;; You should have received a copy of the GNU General Public License
+;; along with GCC; see the file COPYING3.  If not see
+;; <http://www.gnu.org/licenses/>.  */
+
+;; The scheduling description in this file is based on the one in ppro.md,
+;; with additional information obtained from
+;;
+;;    "How to optimize for the Pentium family of microprocessors",
+;;    by Agner Fog, PhD.
+;;
+;; The major difference from the P6 pipeline is one extra decoder, and
+;; one extra execute unit.  Due to micro-op fusion, many insns no longer
+;; need to be decoded in decoder 0, but can be handled by all of them.
+
+;; The core2_idiv, core2_fdiv and core2_ssediv automata are used to
+;; model issue latencies of idiv, fdiv and ssediv type insns.
+(define_automaton "core2_decoder,core2_core,core2_idiv,core2_fdiv,core2_ssediv,core2_load,core2_store")
+
+;; The CPU domain, used for Core i7 bypass latencies
+(define_attr "i7_domain" "int,float,simd"
+  (cond [(eq_attr "type" "fmov,fop,fsgn,fmul,fdiv,fpspc,fcmov,fcmp,fxch,fistp,fisttp,frndint")
+	   (const_string "float")
+	 (eq_attr "type" "sselog,sselog1,sseiadd,sseiadd1,sseishft,sseishft1,sseimul,
+			  sse,ssemov,sseadd,ssemul,ssecmp,ssecomi,ssecvt,
+			  ssecvt1,sseicvt,ssediv,sseins,ssemuladd,sse4arg")
+	   (cond [(eq_attr "mode" "V4DF,V8SF,V2DF,V4SF,SF,DF")
+		    (const_string "float")
+		  (eq_attr "mode" "SI")
+		    (const_string "int")]
+		  (const_string "simd"))
+	 (eq_attr "type" "mmx,mmxmov,mmxadd,mmxmul,mmxcmp,mmxcvt,mmxshft")
+	   (const_string "simd")]
+	(const_string "int")))
+
+;; As for the Pentium Pro,
+;;  - an instruction with 1 uop can be decoded by any of the three
+;;    decoders in one cycle.
+;;  - an instruction with 1 to 4 uops can be decoded only by decoder 0
+;;    but still in only one cycle.
+;;  - a complex (microcode) instruction can also only be decoded by
+;;    decoder 0, and this takes an unspecified number of cycles.
+;;
+;; The goal is to schedule such that we have a few-one-one uops sequence
+;; in each cycle, to decode as many instructions per cycle as possible.
+(define_cpu_unit "c2_decoder0" "core2_decoder")
+(define_cpu_unit "c2_decoder1" "core2_decoder")
+(define_cpu_unit "c2_decoder2" "core2_decoder")
+(define_cpu_unit "c2_decoder3" "core2_decoder")
+
+;; We first wish to find an instruction for c2_decoder0, so exclude
+;; c2_decoder1 and c2_decoder2 from being reserved until c2_decoder 0 is
+;; reserved.
+(presence_set "c2_decoder1" "c2_decoder0")
+(presence_set "c2_decoder2" "c2_decoder0")
+(presence_set "c2_decoder3" "c2_decoder0")
+
+;; Most instructions can be decoded on any of the three decoders.
+(define_reservation "c2_decodern" "(c2_decoder0|c2_decoder1|c2_decoder2|c2_decoder3)")
+
+;; The out-of-order core has six pipelines.  These are similar to the
+;; Pentium Pro's five pipelines.  Port 2 is responsible for memory loads,
+;; port 3 for store address calculations, port 4 for memory stores, and
+;; ports 0, 1 and 5 for everything else.
+
+(define_cpu_unit "c2_p0,c2_p1,c2_p5" "core2_core")
+(define_cpu_unit "c2_p2" "core2_load")
+(define_cpu_unit "c2_p3,c2_p4" "core2_store")
+(define_cpu_unit "c2_idiv" "core2_idiv")
+(define_cpu_unit "c2_fdiv" "core2_fdiv")
+(define_cpu_unit "c2_ssediv" "core2_ssediv")
+
+;; Only the irregular instructions have to be modeled here.  A load
+;; increases the latency by 2 or 3, or by nothing if the manual gives
+;; a latency already.  Store latencies are not accounted for.
+;;
+;; The simple instructions follow a very regular pattern of 1 uop per
+;; reg-reg operation, 1 uop per load on port 2. and 2 uops per store
+;; on port 4 and port 3.  These instructions are modelled at the bottom
+;; of this file.
+;;
+;; For microcoded instructions we don't know how many uops are produced.
+;; These instructions are the "complex" ones in the Intel manuals.  All
+;; we _do_ know is that they typically produce four or more uops, so
+;; they can only be decoded on c2_decoder0.  Modelling their latencies
+;; doesn't make sense because we don't know how these instructions are
+;; executed in the core.  So we just model that they can only be decoded
+;; on decoder 0, and say that it takes a little while before the result
+;; is available.
+(define_insn_reservation "c2_complex_insn" 6
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "other,multi,str"))
+			 "c2_decoder0")
+
+(define_insn_reservation "c2_call" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "call,callv"))
+			 "c2_decoder0")
+
+;; imov with memory operands does not use the integer units.
+;; imovx always decodes to one uop, and also doesn't use the integer
+;; units if it has memory operands.
+(define_insn_reservation "c2_imov" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "imov,imovx")))
+			 "c2_decodern,(c2_p0|c2_p1|c2_p5)")
+
+(define_insn_reservation "c2_imov_load" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "imov,imovx")))
+			 "c2_decodern,c2_p2")
+
+(define_insn_reservation "c2_imov_store" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (eq_attr "type" "imov")))
+			 "c2_decodern,c2_p4+c2_p3")
+
+(define_insn_reservation "c2_icmov" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "icmov")))
+			 "c2_decoder0,(c2_p0|c2_p1|c2_p5)*2")
+
+(define_insn_reservation "c2_icmov_load" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "icmov")))
+			 "c2_decoder0,c2_p2,(c2_p0|c2_p1|c2_p5)*2")
+
+(define_insn_reservation "c2_push_reg" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (eq_attr "type" "push")))
+			 "c2_decodern,c2_p4+c2_p3")
+
+(define_insn_reservation "c2_push_mem" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "both")
+				   (eq_attr "type" "push")))
+			 "c2_decoder0,c2_p2,c2_p4+c2_p3")
+
+;; lea executes on port 0 with latency one and throughput 1.
+(define_insn_reservation "c2_lea" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "lea")))
+			 "c2_decodern,c2_p0")
+
+;; Shift and rotate decode as two uops which can go to port 0 or 5.
+;; The load and store units need to be reserved when memory operands
+;; are involved.
+(define_insn_reservation "c2_shift_rotate" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "ishift,ishift1,rotate,rotate1")))
+			 "c2_decodern,(c2_p0|c2_p5)")
+
+(define_insn_reservation "c2_shift_rotate_mem" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (eq_attr "type" "ishift,ishift1,rotate,rotate1")))
+			 "c2_decoder0,c2_p2,(c2_p0|c2_p5),c2_p4+c2_p3")
+
+;; See comments in ppro.md for the corresponding reservation.
+(define_insn_reservation "c2_branch" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "ibr")))
+			 "c2_decodern,c2_p5")
+
+;; ??? Indirect branches probably have worse latency than this.
+(define_insn_reservation "c2_indirect_branch" 6
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (eq_attr "type" "ibr")))
+			 "c2_decoder0,c2_p2+c2_p5")
+
+(define_insn_reservation "c2_leave" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "leave"))
+			 "c2_decoder0,c2_p2+(c2_p0|c2_p1),(c2_p0|c2_p1)")
+
+;; mul and imul with two/three operands only execute on port 1 for HImode
+;; and SImode, port 0 for DImode.
+(define_insn_reservation "c2_imul_hisi" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "HI,SI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "2,3")))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_imul_hisi_mem" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (and (eq_attr "mode" "HI,SI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "2,3")))))
+			 "c2_decoder0,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_imul_di" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "2,3")))))
+			 "c2_decodern,c2_p0")
+
+(define_insn_reservation "c2_imul_di_mem" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (and (eq_attr "mode" "DI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "2,3")))))
+			 "c2_decoder0,c2_p2+c2_p0")
+
+(define_insn_reservation "c2_imul_qi1" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "QI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "1")))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_imul_qi1_mem" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "QI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "1")))))
+			 "c2_decoder0,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_imul_hisi1" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "HI,SI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "1")))))
+			 "c2_decoder0,c2_p1")
+
+(define_insn_reservation "c2_imul_hisi1_mem" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "HI,SI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "1")))))
+			 "c2_decoder0,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_imul_di1" 7
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "1")))))
+			 "c2_decoder0,c2_p0")
+
+(define_insn_reservation "c2_imul_di1_mem" 7
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DI")
+					(and (eq_attr "type" "imul")
+					     (eq_attr "mul_operands" "1")))))
+			 "c2_decoder0,c2_p2+c2_p0")
+
+;; div and idiv are very similar, so we model them the same.
+;; QI, HI, and SI have issue latency 12, 21, and 37, respectively.
+;; These issue latencies are modelled via the c2_div automaton.
+(define_insn_reservation "c2_idiv_QI" 19
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "QI")
+					(eq_attr "type" "idiv"))))
+			 "c2_decoder0,(c2_p0+c2_idiv)*2,(c2_p0|c2_p1)+c2_idiv,c2_idiv*9")
+
+(define_insn_reservation "c2_idiv_QI_load" 19
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "QI")
+					(eq_attr "type" "idiv"))))
+			 "c2_decoder0,c2_p2+c2_p0+c2_idiv,c2_p0+c2_idiv,(c2_p0|c2_p1)+c2_idiv,c2_idiv*9")
+
+(define_insn_reservation "c2_idiv_HI" 23
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "HI")
+					(eq_attr "type" "idiv"))))
+			 "c2_decoder0,(c2_p0+c2_idiv)*3,(c2_p0|c2_p1)+c2_idiv,c2_idiv*17")
+
+(define_insn_reservation "c2_idiv_HI_load" 23
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "HI")
+					(eq_attr "type" "idiv"))))
+			 "c2_decoder0,c2_p2+c2_p0+c2_idiv,c2_p0+c2_idiv,(c2_p0|c2_p1)+c2_idiv,c2_idiv*18")
+
+(define_insn_reservation "c2_idiv_SI" 39
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SI")
+					(eq_attr "type" "idiv"))))
+			 "c2_decoder0,(c2_p0+c2_idiv)*3,(c2_p0|c2_p1)+c2_idiv,c2_idiv*33")
+
+(define_insn_reservation "c2_idiv_SI_load" 39
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "SI")
+					(eq_attr "type" "idiv"))))
+			 "c2_decoder0,c2_p2+c2_p0+c2_idiv,c2_p0+c2_idiv,(c2_p0|c2_p1)+c2_idiv,c2_idiv*34")
+
+;; x87 floating point operations.
+
+(define_insn_reservation "c2_fxch" 0
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "fxch"))
+			 "c2_decodern")
+
+(define_insn_reservation "c2_fop" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none,unknown")
+				   (eq_attr "type" "fop")))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_fop_load" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "fop")))
+			 "c2_decoder0,c2_p2+c2_p1,c2_p1")
+
+(define_insn_reservation "c2_fop_store" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (eq_attr "type" "fop")))
+			 "c2_decoder0,c2_p0,c2_p0,c2_p0+c2_p4+c2_p3")
+
+(define_insn_reservation "c2_fop_both" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "both")
+				   (eq_attr "type" "fop")))
+			 "c2_decoder0,c2_p2+c2_p0,c2_p0+c2_p4+c2_p3")
+
+(define_insn_reservation "c2_fsgn" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "fsgn"))
+			 "c2_decodern,c2_p0")
+
+(define_insn_reservation "c2_fistp" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "fistp"))
+			 "c2_decoder0,c2_p0*2,c2_p4+c2_p3")
+
+(define_insn_reservation "c2_fcmov" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (eq_attr "type" "fcmov"))
+			 "c2_decoder0,c2_p0*2")
+
+(define_insn_reservation "c2_fcmp" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "fcmp")))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_fcmp_load" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "fcmp")))
+			 "c2_decoder0,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_fmov" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "fmov")))
+			 "c2_decodern,c2_p0")
+
+(define_insn_reservation "c2_fmov_load" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "!XF")
+					(eq_attr "type" "fmov"))))
+			 "c2_decodern,c2_p2")
+
+(define_insn_reservation "c2_fmov_XF_load" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "XF")
+					(eq_attr "type" "fmov"))))
+			 "c2_decoder0,(c2_p2+c2_p0)*2")
+
+(define_insn_reservation "c2_fmov_store" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (and (eq_attr "mode" "!XF")
+					(eq_attr "type" "fmov"))))
+			 "c2_decodern,c2_p3+c2_p4")
+
+(define_insn_reservation "c2_fmov_XF_store" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (and (eq_attr "mode" "XF")
+					(eq_attr "type" "fmov"))))
+			 "c2_decoder0,(c2_p3+c2_p4),(c2_p3+c2_p4)")
+
+;; fmul executes on port 0 with latency 5.  It has issue latency 2,
+;; but we don't model this.
+(define_insn_reservation "c2_fmul" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "fmul")))
+			 "c2_decoder0,c2_p0*2")
+
+(define_insn_reservation "c2_fmul_load" 6
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "fmul")))
+			 "c2_decoder0,c2_p2+c2_p0,c2_p0")
+
+;; fdiv latencies depend on the mode of the operands.  XFmode gives
+;; a latency of 38 cycles, DFmode gives 32, and SFmode gives latency 18.
+;; Division by a power of 2 takes only 9 cycles, but we cannot model
+;; that.  Throughput is equal to latency - 1, which we model using the
+;; c2_div automaton.
+(define_insn_reservation "c2_fdiv_SF" 18
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SF")
+					(eq_attr "type" "fdiv,fpspc"))))
+			 "c2_decodern,c2_p0+c2_fdiv,c2_fdiv*16")
+
+(define_insn_reservation "c2_fdiv_SF_load" 19
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "SF")
+					(eq_attr "type" "fdiv,fpspc"))))
+			 "c2_decoder0,c2_p2+c2_p0+c2_fdiv,c2_fdiv*16")
+
+(define_insn_reservation "c2_fdiv_DF" 32
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DF")
+					(eq_attr "type" "fdiv,fpspc"))))
+			 "c2_decodern,c2_p0+c2_fdiv,c2_fdiv*30")
+
+(define_insn_reservation "c2_fdiv_DF_load" 33
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "DF")
+					(eq_attr "type" "fdiv,fpspc"))))
+			 "c2_decoder0,c2_p2+c2_p0+c2_fdiv,c2_fdiv*30")
+
+(define_insn_reservation "c2_fdiv_XF" 38
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "XF")
+					(eq_attr "type" "fdiv,fpspc"))))
+			 "c2_decodern,c2_p0+c2_fdiv,c2_fdiv*36")
+
+(define_insn_reservation "c2_fdiv_XF_load" 39
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "XF")
+					(eq_attr "type" "fdiv,fpspc"))))
+			 "c2_decoder0,c2_p2+c2_p0+c2_fdiv,c2_fdiv*36")
+
+;; MMX instructions.
+
+(define_insn_reservation "c2_mmx_add" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "mmxadd,sseiadd")))
+			 "c2_decodern,c2_p0|c2_p5")
+
+(define_insn_reservation "c2_mmx_add_load" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "mmxadd,sseiadd")))
+			 "c2_decodern,c2_p2+c2_p0|c2_p5")
+
+(define_insn_reservation "c2_mmx_shft" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "mmxshft")))
+			 "c2_decodern,c2_p0|c2_p5")
+
+(define_insn_reservation "c2_mmx_shft_load" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "mmxshft")))
+			 "c2_decoder0,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_mmx_sse_shft" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "type" "sseishft")
+					(eq_attr "length_immediate" "!0"))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_mmx_sse_shft_load" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "type" "sseishft")
+					(eq_attr "length_immediate" "!0"))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_mmx_sse_shft1" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "type" "sseishft")
+					(eq_attr "length_immediate" "0"))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_mmx_sse_shft1_load" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "type" "sseishft")
+					(eq_attr "length_immediate" "0"))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_mmx_mul" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "mmxmul,sseimul")))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_mmx_mul_load" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "mmxmul,sseimul")))
+			 "c2_decoder0,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_sse_mmxcvt" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "mode" "DI")
+				   (eq_attr "type" "mmxcvt")))
+			 "c2_decodern,c2_p1")
+
+;; FIXME: These are Pentium III only, but we cannot tell here if
+;; we're generating code for PentiumPro/Pentium II or Pentium III
+;; (define_insn_reservation "c2_sse_mmxshft" 2
+;;			 (and (eq_attr "cpu" "core2,corei7")
+;;			      (and (eq_attr "mode" "TI")
+;;				   (eq_attr "type" "mmxshft")))
+;;			 "c2_decodern,c2_p0")
+
+;; The sfence instruction.
+(define_insn_reservation "c2_sse_sfence" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "unknown")
+				   (eq_attr "type" "sse")))
+			 "c2_decoder0,c2_p4+c2_p3")
+
+;; FIXME: This reservation is all wrong when we're scheduling sqrtss.
+(define_insn_reservation "c2_sse_SFDF" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "mode" "SF,DF")
+				   (eq_attr "type" "sse")))
+			 "c2_decodern,c2_p0")
+
+(define_insn_reservation "c2_sse_V4SF" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "mode" "V4SF")
+				   (eq_attr "type" "sse")))
+			 "c2_decoder0,c2_p1*2")
+
+(define_insn_reservation "c2_sse_addcmp" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "sseadd,ssecmp,ssecomi")))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_sse_addcmp_load" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "sseadd,ssecmp,ssecomi")))
+			 "c2_decodern,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_sse_mul_SF" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SF,V4SF")
+					(eq_attr "type" "ssemul"))))
+			"c2_decodern,c2_p0")
+
+(define_insn_reservation "c2_sse_mul_SF_load" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "SF,V4SF")
+					(eq_attr "type" "ssemul"))))
+			"c2_decodern,c2_p2+c2_p0")
+
+(define_insn_reservation "c2_sse_mul_DF" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DF,V2DF")
+					(eq_attr "type" "ssemul"))))
+			"c2_decodern,c2_p0")
+
+(define_insn_reservation "c2_sse_mul_DF_load" 5
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (and (eq_attr "mode" "DF,V2DF")
+					(eq_attr "type" "ssemul"))))
+			"c2_decodern,c2_p2+c2_p0")
+
+(define_insn_reservation "c2_sse_div_SF" 18
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SF,V4SF")
+					(eq_attr "type" "ssediv"))))
+			 "c2_decodern,c2_p0,c2_ssediv*17")
+
+(define_insn_reservation "c2_sse_div_SF_load" 18
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SF,V4SF")
+					(eq_attr "type" "ssediv"))))
+			 "c2_decodern,(c2_p2+c2_p0),c2_ssediv*17")
+
+(define_insn_reservation "c2_sse_div_DF" 32
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DF,V2DF")
+					(eq_attr "type" "ssediv"))))
+			 "c2_decodern,c2_p0,c2_ssediv*31")
+
+(define_insn_reservation "c2_sse_div_DF_load" 32
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DF,V2DF")
+					(eq_attr "type" "ssediv"))))
+			 "c2_decodern,(c2_p2+c2_p0),c2_ssediv*31")
+
+;; FIXME: these have limited throughput
+(define_insn_reservation "c2_sse_icvt_SF" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SF")
+					(eq_attr "type" "sseicvt"))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_sse_icvt_SF_load" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (and (eq_attr "mode" "SF")
+					(eq_attr "type" "sseicvt"))))
+			 "c2_decodern,c2_p2+c2_p1")
+
+(define_insn_reservation "c2_sse_icvt_DF" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "DF")
+					(eq_attr "type" "sseicvt"))))
+			 "c2_decoder0,c2_p0+c2_p1")
+
+(define_insn_reservation "c2_sse_icvt_DF_load" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (and (eq_attr "mode" "DF")
+					(eq_attr "type" "sseicvt"))))
+			 "c2_decoder0,(c2_p2+c2_p1)")
+
+(define_insn_reservation "c2_sse_icvt_SI" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (and (eq_attr "mode" "SI")
+					(eq_attr "type" "sseicvt"))))
+			 "c2_decodern,c2_p1")
+
+(define_insn_reservation "c2_sse_icvt_SI_load" 3
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "!none")
+				   (and (eq_attr "mode" "SI")
+					(eq_attr "type" "sseicvt"))))
+			 "c2_decodern,(c2_p2+c2_p1)")
+
+(define_insn_reservation "c2_sse_mov" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none")
+				   (eq_attr "type" "ssemov")))
+			 "c2_decodern,(c2_p0|c2_p1|c2_p5)")
+
+(define_insn_reservation "c2_sse_mov_load" 2
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "ssemov")))
+			 "c2_decodern,c2_p2")
+
+(define_insn_reservation "c2_sse_mov_store" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (eq_attr "type" "ssemov")))
+			 "c2_decodern,c2_p4+c2_p3")
+
+;; All other instructions are modelled as simple instructions.
+;; We have already modelled all i387 floating point instructions, so all
+;; other instructions execute on either port 0, 1 or 5.  This includes
+;; the ALU units, and the MMX units.
+;;
+;; reg-reg instructions produce 1 uop so they can be decoded on any of
+;; the three decoders.  Loads benefit from micro-op fusion and can be
+;; treated in the same way.
+(define_insn_reservation "c2_insn" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "none,unknown")
+				   (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,sseishft1,mmx,mmxcmp")))
+			 "c2_decodern,(c2_p0|c2_p1|c2_p5)")
+
+(define_insn_reservation "c2_insn_load" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "load")
+				   (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,pop,sseishft1,mmx,mmxcmp")))
+			 "c2_decodern,c2_p2,(c2_p0|c2_p1|c2_p5)")
+
+;; register-memory instructions have three uops,  so they have to be
+;; decoded on c2_decoder0.
+(define_insn_reservation "c2_insn_store" 1
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "store")
+				   (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,sseishft1,mmx,mmxcmp")))
+			 "c2_decoder0,(c2_p0|c2_p1|c2_p5),c2_p4+c2_p3")
+
+;; read-modify-store instructions produce 4 uops so they have to be
+;; decoded on c2_decoder0 as well.
+(define_insn_reservation "c2_insn_both" 4
+			 (and (eq_attr "cpu" "core2,corei7")
+			      (and (eq_attr "memory" "both")
+				   (eq_attr "type" "alu,alu1,negnot,incdec,icmp,test,setcc,pop,sseishft1,mmx,mmxcmp")))
+			 "c2_decoder0,c2_p2,(c2_p0|c2_p1|c2_p5),c2_p4+c2_p3")
+
Index: config/i386/i386-c.c
===================================================================
--- config/i386/i386-c.c	(revision 162821)
+++ config/i386/i386-c.c	(working copy)
@@ -122,6 +122,10 @@  ix86_target_macros_internal (int isa_fla
       def_or_undef (parse_in, "__core2");
       def_or_undef (parse_in, "__core2__");
       break;
+    case PROCESSOR_COREI7:
+      def_or_undef (parse_in, "__corei7");
+      def_or_undef (parse_in, "__corei7__");
+      break;
     case PROCESSOR_ATOM:
       def_or_undef (parse_in, "__atom");
       def_or_undef (parse_in, "__atom__");
@@ -197,6 +201,9 @@  ix86_target_macros_internal (int isa_fla
     case PROCESSOR_CORE2:
       def_or_undef (parse_in, "__tune_core2__");
       break;
+    case PROCESSOR_COREI7:
+      def_or_undef (parse_in, "__tune_corei7__");
+      break;
     case PROCESSOR_ATOM:
       def_or_undef (parse_in, "__tune_atom__");
       break;
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 162821)
+++ config/i386/i386.c	(working copy)
@@ -1124,6 +1124,79 @@  struct processor_costs core2_cost = {
 };
 
 static const
+struct processor_costs corei7_cost = {
+  COSTS_N_INSNS (1),			/* cost of an add instruction */
+  COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
+  COSTS_N_INSNS (1),			/* variable shift costs */
+  COSTS_N_INSNS (1),			/* constant shift costs */
+  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
+   COSTS_N_INSNS (3),			/*				 HI */
+   COSTS_N_INSNS (3),			/*				 SI */
+   COSTS_N_INSNS (3),			/*				 DI */
+   COSTS_N_INSNS (3)},			/*			      other */
+  0,					/* cost of multiply per each bit set */
+  {COSTS_N_INSNS (22),			/* cost of a divide/mod for QI */
+   COSTS_N_INSNS (22),			/*			    HI */
+   COSTS_N_INSNS (22),			/*			    SI */
+   COSTS_N_INSNS (22),			/*			    DI */
+   COSTS_N_INSNS (22)},			/*			    other */
+  COSTS_N_INSNS (1),			/* cost of movsx */
+  COSTS_N_INSNS (1),			/* cost of movzx */
+  8,					/* "large" insn */
+  16,					/* MOVE_RATIO */
+  2,				     /* cost for loading QImode using movzbl */
+  {6, 6, 6},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {4, 4, 4},				/* cost of storing integer registers */
+  2,					/* cost of reg,reg fld/fst */
+  {6, 6, 6},				/* cost of loading fp registers
+					   in SFmode, DFmode and XFmode */
+  {4, 4, 4},				/* cost of storing fp registers
+					   in SFmode, DFmode and XFmode */
+  2,					/* cost of moving MMX register */
+  {6, 6},				/* cost of loading MMX registers
+					   in SImode and DImode */
+  {4, 4},				/* cost of storing MMX registers
+					   in SImode and DImode */
+  2,					/* cost of moving SSE register */
+  {6, 6, 6},				/* cost of loading SSE registers
+					   in SImode, DImode and TImode */
+  {4, 4, 4},				/* cost of storing SSE registers
+					   in SImode, DImode and TImode */
+  2,					/* MMX or SSE register to integer */
+  32,					/* size of l1 cache.  */
+  256,					/* size of l2 cache.  */
+  128,					/* size of prefetch block */
+  8,					/* number of parallel prefetches */
+  3,					/* Branch cost */
+  COSTS_N_INSNS (3),			/* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (5),			/* cost of FMUL instruction.  */
+  COSTS_N_INSNS (32),			/* cost of FDIV instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FABS instruction.  */
+  COSTS_N_INSNS (1),			/* cost of FCHS instruction.  */
+  COSTS_N_INSNS (58),			/* cost of FSQRT instruction.  */
+  {{libcall, {{11, loop}, {-1, rep_prefix_4_byte}}},
+   {libcall, {{32, loop}, {64, rep_prefix_4_byte},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  {{libcall, {{8, loop}, {15, unrolled_loop},
+	      {2048, rep_prefix_4_byte}, {-1, libcall}}},
+   {libcall, {{24, loop}, {32, unrolled_loop},
+	      {8192, rep_prefix_8_byte}, {-1, libcall}}}},
+  1,					/* scalar_stmt_cost.  */
+  1,					/* scalar load_cost.  */
+  1,					/* scalar_store_cost.  */
+  1,					/* vec_stmt_cost.  */
+  1,					/* vec_to_scalar_cost.  */
+  1,					/* scalar_to_vec_cost.  */
+  1,					/* vec_align_load_cost.  */
+  2,					/* vec_unalign_load_cost.  */
+  1,					/* vec_store_cost.  */
+  3,					/* cond_taken_branch_cost.  */
+  1,					/* cond_not_taken_branch_cost.  */
+};
+
+static const
 struct processor_costs atom_cost = {
   COSTS_N_INSNS (1),			/* cost of an add instruction */
   COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
@@ -1355,6 +1428,8 @@  const struct processor_costs *ix86_cost 
 #define m_PENT4  (1<<PROCESSOR_PENTIUM4)
 #define m_NOCONA  (1<<PROCESSOR_NOCONA)
 #define m_CORE2  (1<<PROCESSOR_CORE2)
+#define m_COREI7  (1<<PROCESSOR_COREI7)
+#define m_CORE2I7 (m_CORE2 | m_COREI7)
 #define m_ATOM  (1<<PROCESSOR_ATOM)
 
 #define m_GEODE  (1<<PROCESSOR_GEODE)
@@ -1384,18 +1459,18 @@  static unsigned int initial_ix86_tune_fe
      negatively, so enabling for Generic64 seems like good code size
      tradeoff.  We can't enable it for 32bit generic because it does not
      work well with PPro base chips.  */
-  m_386 | m_K6_GEODE | m_AMD_MULTIPLE | m_CORE2 | m_GENERIC64,
+  m_386 | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC64,
 
   /* X86_TUNE_PUSH_MEMORY */
   m_386 | m_K6_GEODE | m_AMD_MULTIPLE | m_PENT4
-  | m_NOCONA | m_CORE2 | m_GENERIC,
+  | m_NOCONA | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_ZERO_EXTEND_WITH_AND */
   m_486 | m_PENT,
 
   /* X86_TUNE_UNROLL_STRLEN */
   m_486 | m_PENT | m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_K6
-  | m_CORE2 | m_GENERIC,
+  | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_DEEP_BRANCH_PREDICTION */
   m_ATOM | m_PPRO | m_K6_GEODE | m_AMD_MULTIPLE | m_PENT4 | m_GENERIC,
@@ -1411,12 +1486,12 @@  static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_USE_SAHF */
   m_ATOM | m_PPRO | m_K6_GEODE | m_K8 | m_AMDFAM10 | m_BDVER1 | m_PENT4
-  | m_NOCONA | m_CORE2 | m_GENERIC,
+  | m_NOCONA | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_MOVX: Enable to zero extend integer registers to avoid
      partial dependencies.  */
   m_AMD_MULTIPLE | m_ATOM | m_PPRO | m_PENT4 | m_NOCONA
-  | m_CORE2 | m_GENERIC | m_GEODE /* m_386 | m_K6 */,
+  | m_CORE2I7 | m_GENERIC | m_GEODE /* m_386 | m_K6 */,
 
   /* X86_TUNE_PARTIAL_REG_STALL: We probably ought to watch for partial
      register stalls on Generic32 compilation setting as well.  However
@@ -1429,19 +1504,19 @@  static unsigned int initial_ix86_tune_fe
   m_PPRO,
 
   /* X86_TUNE_PARTIAL_FLAG_REG_STALL */
-  m_CORE2 | m_GENERIC,
+  m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_USE_HIMODE_FIOP */
   m_386 | m_486 | m_K6_GEODE,
 
   /* X86_TUNE_USE_SIMODE_FIOP */
-  ~(m_PPRO | m_AMD_MULTIPLE | m_PENT | m_ATOM | m_CORE2 | m_GENERIC),
+  ~(m_PPRO | m_AMD_MULTIPLE | m_PENT | m_ATOM | m_CORE2I7 | m_GENERIC),
 
   /* X86_TUNE_USE_MOV0 */
   m_K6,
 
   /* X86_TUNE_USE_CLTD */
-  ~(m_PENT | m_ATOM | m_K6 | m_CORE2 | m_GENERIC),
+  ~(m_PENT | m_ATOM | m_K6 | m_CORE2I7 | m_GENERIC),
 
   /* X86_TUNE_USE_XCHGB: Use xchgb %rh,%rl instead of rolw/rorw $8,rx.  */
   m_PENT4,
@@ -1457,7 +1532,7 @@  static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_PROMOTE_QIMODE */
   m_K6_GEODE | m_PENT | m_ATOM | m_386 | m_486 | m_AMD_MULTIPLE
-  | m_CORE2 | m_GENERIC /* | m_PENT4 ? */,
+  | m_CORE2I7 | m_GENERIC /* | m_PENT4 ? */,
 
   /* X86_TUNE_FAST_PREFIX */
   ~(m_PENT | m_486 | m_386),
@@ -1478,31 +1553,34 @@  static unsigned int initial_ix86_tune_fe
   0,
 
   /* X86_TUNE_PROMOTE_HI_REGS */
-  m_PPRO,
+  m_PPRO | m_CORE2I7,
+
+  /* X86_TUNE_PROMOTE_HI_CONSTANTS */
+  m_PPRO | m_CORE2I7,
 
   /* X86_TUNE_ADD_ESP_4: Enable if add/sub is preferred over 1/2 push/pop.  */
   m_ATOM | m_AMD_MULTIPLE | m_K6_GEODE | m_PENT4 | m_NOCONA
-  | m_CORE2 | m_GENERIC,
+  | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_ADD_ESP_8 */
   m_AMD_MULTIPLE | m_ATOM | m_PPRO | m_K6_GEODE | m_386
-  | m_486 | m_PENT4 | m_NOCONA | m_CORE2 | m_GENERIC,
+  | m_486 | m_PENT4 | m_NOCONA | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_SUB_ESP_4 */
-  m_AMD_MULTIPLE | m_ATOM | m_PPRO | m_PENT4 | m_NOCONA | m_CORE2
+  m_AMD_MULTIPLE | m_ATOM | m_PPRO | m_PENT4 | m_NOCONA | m_CORE2I7
   | m_GENERIC,
 
   /* X86_TUNE_SUB_ESP_8 */
   m_AMD_MULTIPLE | m_ATOM | m_PPRO | m_386 | m_486
-  | m_PENT4 | m_NOCONA | m_CORE2 | m_GENERIC,
+  | m_PENT4 | m_NOCONA | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_INTEGER_DFMODE_MOVES: Enable if integer moves are preferred
      for DFmode copies */
-  ~(m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2
+  ~(m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2I7
     | m_GENERIC | m_GEODE),
 
   /* X86_TUNE_PARTIAL_REG_DEPENDENCY */
-  m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_CORE2 | m_GENERIC,
+  m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY: In the Generic model we have a
      conflict here in between PPro/Pentium4 based chips that thread 128bit
@@ -1513,7 +1591,7 @@  static unsigned int initial_ix86_tune_fe
      shows that disabling this option on P4 brings over 20% SPECfp regression,
      while enabling it on K8 brings roughly 2.4% regression that can be partly
      masked by careful scheduling of moves.  */
-  m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2 | m_GENERIC
+  m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2I7 | m_GENERIC
   | m_AMDFAM10 | m_BDVER1,
 
   /* X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL */
@@ -1538,13 +1616,13 @@  static unsigned int initial_ix86_tune_fe
   m_PPRO | m_PENT4 | m_NOCONA,
 
   /* X86_TUNE_MEMORY_MISMATCH_STALL */
-  m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_CORE2 | m_GENERIC,
+  m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_PROLOGUE_USING_MOVE */
-  m_ATHLON_K8 | m_ATOM | m_PPRO | m_CORE2 | m_GENERIC,
+  m_ATHLON_K8 | m_ATOM | m_PPRO | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_EPILOGUE_USING_MOVE */
-  m_ATHLON_K8 | m_ATOM | m_PPRO | m_CORE2 | m_GENERIC,
+  m_ATHLON_K8 | m_ATOM | m_PPRO | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_SHIFT1 */
   ~m_486,
@@ -1560,25 +1638,25 @@  static unsigned int initial_ix86_tune_fe
 
   /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
      than 4 branch instructions in the 16 byte window.  */
-  m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_PENT4 | m_NOCONA | m_CORE2
+  m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_PENT4 | m_NOCONA | m_CORE2I7
   | m_GENERIC,
 
   /* X86_TUNE_SCHEDULE */
-  m_PPRO | m_AMD_MULTIPLE | m_K6_GEODE | m_PENT | m_ATOM | m_CORE2
+  m_PPRO | m_AMD_MULTIPLE | m_K6_GEODE | m_PENT | m_ATOM | m_CORE2I7
   | m_GENERIC,
 
   /* X86_TUNE_USE_BT */
-  m_AMD_MULTIPLE | m_ATOM | m_CORE2 | m_GENERIC,
+  m_AMD_MULTIPLE | m_ATOM | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_USE_INCDEC */
-  ~(m_PENT4 | m_NOCONA | m_GENERIC | m_ATOM),
+  ~(m_PENT4 | m_NOCONA | m_GENERIC | m_CORE2I7 | m_ATOM),
 
   /* X86_TUNE_PAD_RETURNS */
-  m_AMD_MULTIPLE | m_CORE2 | m_GENERIC,
+  m_AMD_MULTIPLE | m_GENERIC,
 
   /* X86_TUNE_EXT_80387_CONSTANTS */
   m_K6_GEODE | m_ATHLON_K8 | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO
-  | m_CORE2 | m_GENERIC,
+  | m_CORE2I7 | m_GENERIC,
 
   /* X86_TUNE_SHORTEN_X87_SSE */
   ~m_K8,
@@ -1622,7 +1700,7 @@  static unsigned int initial_ix86_tune_fe
   /* X86_TUNE_FUSE_CMP_AND_BRANCH: Fuse a compare or test instruction
      with a subsequent conditional jump instruction into a single
      compare-and-branch uop.  */
-  m_CORE2 | m_BDVER1,
+  m_CORE2I7 | m_BDVER1,
 
   /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
      will impact LEA instruction selection. */
@@ -1652,12 +1730,12 @@  static unsigned int initial_ix86_arch_fe
 };
 
 static const unsigned int x86_accumulate_outgoing_args
-  = m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2
+  = m_AMD_MULTIPLE | m_ATOM | m_PENT4 | m_NOCONA | m_PPRO | m_CORE2I7
     | m_GENERIC;
 
 static const unsigned int x86_arch_always_fancy_math_387
   = m_PENT | m_ATOM | m_PPRO | m_AMD_MULTIPLE | m_PENT4
-    | m_NOCONA | m_CORE2 | m_GENERIC;
+    | m_NOCONA | m_CORE2I7 | m_GENERIC;
 
 static enum stringop_alg stringop_alg = no_stringop;
 
@@ -2173,6 +2251,7 @@  static const struct ptt processor_target
   {&k8_cost, 16, 7, 16, 7, 16},
   {&nocona_cost, 0, 0, 0, 0, 0},
   {&core2_cost, 16, 10, 16, 10, 16},
+  {&corei7_cost, 16, 10, 16, 10, 16},
   {&generic32_cost, 16, 7, 16, 7, 16},
   {&generic64_cost, 16, 10, 16, 10, 16},
   {&amdfam10_cost, 32, 24, 32, 7, 32},
@@ -2195,6 +2274,7 @@  static const char *const cpu_names[TARGE
   "prescott",
   "nocona",
   "core2",
+  "corei7",
   "atom",
   "geode",
   "k6",
@@ -2889,6 +2969,9 @@  override_options (bool main_args_p)
       {"core2", PROCESSOR_CORE2, CPU_CORE2,
 	PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
 	| PTA_SSSE3 | PTA_CX16},
+      {"corei7", PROCESSOR_COREI7, CPU_COREI7,
+	PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
+	| PTA_SSSE3 | PTA_SSE4_1 | PTA_SSE4_2 | PTA_CX16},
       {"atom", PROCESSOR_ATOM, CPU_ATOM,
 	PTA_64BIT | PTA_MMX | PTA_SSE | PTA_SSE2 | PTA_SSE3
 	| PTA_SSSE3 | PTA_CX16 | PTA_MOVBE},
@@ -14291,6 +14374,12 @@  ix86_fixup_binary_operands (enum rtx_cod
   if (MEM_P (src1) && !rtx_equal_p (dst, src1))
     src1 = force_reg (mode, src1);
 
+  if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2)
+      && (INTVAL (src2) < -128 || INTVAL (src2) > 127)
+      && (code != AND
+	  || (INTVAL (src2) != 255 && INTVAL (src2) != -65281)))
+    src2 = gen_lowpart (HImode, force_reg (SImode, src2));
+
   operands[1] = src1;
   operands[2] = src2;
   return dst;
@@ -14377,6 +14466,12 @@  ix86_binary_operator_ok (enum rtx_code c
   if (MEM_P (src1) && !rtx_equal_p (dst, src1))
     return 0;
 
+  if (TARGET_PROMOTE_HI_CONSTANTS && mode == HImode && CONSTANT_P (src2)
+      && (INTVAL (src2) < -128 || INTVAL (src2) > 127)
+      && (code != AND
+	  || (INTVAL (src2) != 255 && INTVAL (src2) != -65281)))
+    return 0;
+
   return 1;
 }
 
@@ -20495,6 +20590,7 @@  ix86_issue_rate (void)
       return 3;
 
     case PROCESSOR_CORE2:
+    case PROCESSOR_COREI7:
       return 4;
 
     default:
@@ -20569,6 +20665,7 @@  ix86_adjust_cost (rtx insn, rtx link, rt
 {
   enum attr_type insn_type, dep_insn_type;
   enum attr_memory memory;
+  enum attr_i7_domain domain1, domain2;
   rtx set, set2;
   int dep_insn_code_number;
 
@@ -20711,6 +20808,19 @@  ix86_adjust_cost (rtx insn, rtx link, rt
 	  else
 	    cost = 0;
 	}
+      break;
+
+    case PROCESSOR_COREI7:
+      memory = get_attr_memory (insn);
+
+      domain1 = get_attr_i7_domain (insn);
+      domain2 = get_attr_i7_domain (dep_insn);
+      if (domain1 != domain2
+	  && !ix86_agi_dependent (dep_insn, insn))
+	cost += ((domain1 == I7_DOMAIN_SIMD && domain2 == I7_DOMAIN_INT)
+		 || (domain1 == I7_DOMAIN_INT && domain2 == I7_DOMAIN_SIMD)
+		 ? 1 : 2);
+      break;
 
     default:
       break;