Patchwork Add X86_TUNE_AVOID_LEA_FOR_ADDR

login
register
mail settings
Submitter H.J. Lu
Date Jan. 17, 2014, 2:19 p.m.
Message ID <20140117141937.GA1174@intel.com>
Download mbox | patch
Permalink /patch/312084/
State New
Headers show

Comments

H.J. Lu - Jan. 17, 2014, 2:19 p.m.
ix86_split_lea_for_addr transforms a single LEA instruction into a series
of MOV and ADD instructions.  For

lea 0x400(%eax, %ecx, 8), %edx

we get

mov %eax, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add $0x400, %edx

For -mtune=intel, we want to turn on X86_TUNE_OPT_AGU, but avoid
ix86_split_lea_for_addr.  This patch adds X86_TUNE_AVOID_LEA_FOR_ADDR
and PROCESSOR_INTEL.  We keep PROCESSOR_INTEL the same as
PROCESSOR_SILVERMONT, except that X86_TUNE_AVOID_LEA_FOR_ADDR isn't
turned on for PROCESSOR_INTEL.  OK for trunk?

Thanks.


H.J.
---
 gcc/config/i386/i386-c.c     |  2 +
 gcc/config/i386/i386.c       | 93 +++++++++++++++++++++++++++++++++++++++++---
 gcc/config/i386/i386.h       |  4 ++
 gcc/config/i386/x86-tune.def | 68 +++++++++++++++++++-------------
 5 files changed, 162 insertions(+), 32 deletions(-)
 create mode 100644 ChangeLog.intel

gcc/

2014-01-17  H.J. Lu  <hongjiu.lu@intel.com>

	* config/i386/i386-c.c (ix86_target_macros_internal): Handle
	PROCESSOR_INTEL.  Treat like PROCESSOR_GENERIC.
	* config/i386/i386.c (intel_memcpy): New.  Duplicate slm_memcpy.
	(intel_memset): New.  Duplicate slm_memset.
	(intel_cost): New.  Duplicate slm_cost.
	(m_INTEL): New macro.
	(processor_target_table): Add "intel".
	(ix86_option_override_internal): Replace PROCESSOR_SILVERMONT
	with PROCESSOR_INTEL for "intel".
	(ix86_lea_outperforms): Support PROCESSOR_INTEL.  Duplicate
	PROCESSOR_SILVERMONT.
	(ix86_avoid_lea_for_addr): Check TARGET_AVOID_LEA_FOR_ADDR
	instead of TARGET_OPT_AGU.
	(ix86_issue_rate): Likewise.
	(ix86_adjust_cost): Likewise.
	(ia32_multipass_dfa_lookahead): Likewise.
	(swap_top_of_ready_list): Likewise.
	(ix86_sched_reorder): Likewise.
	* config/i386/i386.h (TARGET_INTEL): New.
	(TARGET_AVOID_LEA_FOR_ADDR): Likewise.
	(processor_type): Add PROCESSOR_INTEL.
	* config/i386/x86-tune.def: Support m_INTEL. Duplicate
	m_SILVERMONT.  Add X86_TUNE_AVOID_LEA_FOR_ADDR.
Uros Bizjak - Jan. 17, 2014, 2:23 p.m.
On Fri, Jan 17, 2014 at 3:19 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
> ix86_split_lea_for_addr transforms a single LEA instruction into a series
> of MOV and ADD instructions.  For
>
> lea 0x400(%eax, %ecx, 8), %edx
>
> we get
>
> mov %eax, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add $0x400, %edx
>
> For -mtune=intel, we want to turn on X86_TUNE_OPT_AGU, but avoid
> ix86_split_lea_for_addr.  This patch adds X86_TUNE_AVOID_LEA_FOR_ADDR
> and PROCESSOR_INTEL.  We keep PROCESSOR_INTEL the same as
> PROCESSOR_SILVERMONT, except that X86_TUNE_AVOID_LEA_FOR_ADDR isn't
> turned on for PROCESSOR_INTEL.  OK for trunk?

As said earlier, m_INTEL is not a processor, but equals a REAL
processor, so the patch is not acceptable.

Uros.
Jakub Jelinek - Jan. 17, 2014, 2:30 p.m.
On Fri, Jan 17, 2014 at 06:19:37AM -0800, H.J. Lu wrote:
> ix86_split_lea_for_addr transforms a single LEA instruction into a series
> of MOV and ADD instructions.  For
> 
> lea 0x400(%eax, %ecx, 8), %edx
> 
> we get
> 
> mov %eax, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add $0x400, %edx

Ugh, is that really want you want for silvermont, as opposed to (at least
if the output operand isn't equal to the base):
mov %ecx, %edx	! if base is equal to index this would go away
add %ecx, %edx
add %edx, %edx
add %edx, %edx
add %eax, %edx
add $0x400, %edx
?

	Jakub
H.J. Lu - Jan. 17, 2014, 2:46 p.m.
On Fri, Jan 17, 2014 at 6:23 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 3:19 PM, H.J. Lu <hongjiu.lu@intel.com> wrote:
>> ix86_split_lea_for_addr transforms a single LEA instruction into a series
>> of MOV and ADD instructions.  For
>>
>> lea 0x400(%eax, %ecx, 8), %edx
>>
>> we get
>>
>> mov %eax, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add $0x400, %edx
>>
>> For -mtune=intel, we want to turn on X86_TUNE_OPT_AGU, but avoid
>> ix86_split_lea_for_addr.  This patch adds X86_TUNE_AVOID_LEA_FOR_ADDR
>> and PROCESSOR_INTEL.  We keep PROCESSOR_INTEL the same as
>> PROCESSOR_SILVERMONT, except that X86_TUNE_AVOID_LEA_FOR_ADDR isn't
>> turned on for PROCESSOR_INTEL.  OK for trunk?
>
> As said earlier, m_INTEL is not a processor, but equals a REAL
> processor, so the patch is not acceptable.
>

-mtune=intel, similar to -mtune=generic,  isn't equal to a single processor.
From invoke.texi:

---
@item intel
Produce code optimized for the most current Intel processors, which are
Haswell and Silvermont for this version of GCC.
---

We don't want -mtune=intel to define __tune_silvermont__ and we
want to generate balanced codes for Haswell and Silvermont.
-mtune=intel started as -mtune=silvermont.  I am working on incremental
changes like this to better tune for Haswell without significantly impacting
Silvermont.
H.J. Lu - Jan. 17, 2014, 2:50 p.m.
On Fri, Jan 17, 2014 at 6:30 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 17, 2014 at 06:19:37AM -0800, H.J. Lu wrote:
>> ix86_split_lea_for_addr transforms a single LEA instruction into a series
>> of MOV and ADD instructions.  For
>>
>> lea 0x400(%eax, %ecx, 8), %edx
>>
>> we get
>>
>> mov %eax, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add %ecx, %edx
>> add $0x400, %edx
>
> Ugh, is that really want you want for silvermont, as opposed to (at least
> if the output operand isn't equal to the base):
> mov %ecx, %edx  ! if base is equal to index this would go away
> add %ecx, %edx
> add %edx, %edx
> add %edx, %edx
> add %eax, %edx
> add $0x400, %edx
> ?

Wrong example.  It should be

lea 0x400(%edx, %ecx, 8), %edx

we get

add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add %ecx, %edx
add $0x400, %edx

For

lea 0x400(%eax, %ecx, 8), %edx

we get

mov %ecx, %edx
shl $3, %edx
add %eax, %edx
add $0x400, %edx
Uros Bizjak - Jan. 17, 2014, 3:11 p.m.
On Fri, Jan 17, 2014 at 3:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>>> ix86_split_lea_for_addr transforms a single LEA instruction into a series
>>> of MOV and ADD instructions.  For
>>>
>>> lea 0x400(%eax, %ecx, 8), %edx
>>>
>>> we get
>>>
>>> mov %eax, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add %ecx, %edx
>>> add $0x400, %edx
>>>
>>> For -mtune=intel, we want to turn on X86_TUNE_OPT_AGU, but avoid
>>> ix86_split_lea_for_addr.  This patch adds X86_TUNE_AVOID_LEA_FOR_ADDR
>>> and PROCESSOR_INTEL.  We keep PROCESSOR_INTEL the same as
>>> PROCESSOR_SILVERMONT, except that X86_TUNE_AVOID_LEA_FOR_ADDR isn't
>>> turned on for PROCESSOR_INTEL.  OK for trunk?
>>
>> As said earlier, m_INTEL is not a processor, but equals a REAL
>> processor, so the patch is not acceptable.
>>
>
> -mtune=intel, similar to -mtune=generic,  isn't equal to a single processor.
> From invoke.texi:
>
> ---
> @item intel
> Produce code optimized for the most current Intel processors, which are
> Haswell and Silvermont for this version of GCC.
> ---
>
> We don't want -mtune=intel to define __tune_silvermont__ and we
> want to generate balanced codes for Haswell and Silvermont.
> -mtune=intel started as -mtune=silvermont.  I am working on incremental
> changes like this to better tune for Haswell without significantly impacting
> Silvermont.

OK, this clarifies the situation.

So, -mtune=generic is too broad, and -mtune=intel is needed, as a
generic tuning for latest Intel processors (note the plural). We want
tuning options that cover Haswell and Silvermont for this version, but
not something that degrades runtime too much (or unnecessarily
increases code size too much).

If this is the case, I agree with the approach.

BTW: There are some ix86_tune == XXX conditions scattered throughout
LEA handling code. Can these be substituted with appropriate TARGET_*
defines?

Uros.
H.J. Lu - Jan. 17, 2014, 3:17 p.m.
On Fri, Jan 17, 2014 at 7:11 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 3:46 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>>>> ix86_split_lea_for_addr transforms a single LEA instruction into a series
>>>> of MOV and ADD instructions.  For
>>>>
>>>> lea 0x400(%eax, %ecx, 8), %edx
>>>>
>>>> we get
>>>>
>>>> mov %eax, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add %ecx, %edx
>>>> add $0x400, %edx
>>>>
>>>> For -mtune=intel, we want to turn on X86_TUNE_OPT_AGU, but avoid
>>>> ix86_split_lea_for_addr.  This patch adds X86_TUNE_AVOID_LEA_FOR_ADDR
>>>> and PROCESSOR_INTEL.  We keep PROCESSOR_INTEL the same as
>>>> PROCESSOR_SILVERMONT, except that X86_TUNE_AVOID_LEA_FOR_ADDR isn't
>>>> turned on for PROCESSOR_INTEL.  OK for trunk?
>>>
>>> As said earlier, m_INTEL is not a processor, but equals a REAL
>>> processor, so the patch is not acceptable.
>>>
>>
>> -mtune=intel, similar to -mtune=generic,  isn't equal to a single processor.
>> From invoke.texi:
>>
>> ---
>> @item intel
>> Produce code optimized for the most current Intel processors, which are
>> Haswell and Silvermont for this version of GCC.
>> ---
>>
>> We don't want -mtune=intel to define __tune_silvermont__ and we
>> want to generate balanced codes for Haswell and Silvermont.
>> -mtune=intel started as -mtune=silvermont.  I am working on incremental
>> changes like this to better tune for Haswell without significantly impacting
>> Silvermont.
>
> OK, this clarifies the situation.
>
> So, -mtune=generic is too broad, and -mtune=intel is needed, as a
> generic tuning for latest Intel processors (note the plural). We want
> tuning options that cover Haswell and Silvermont for this version, but
> not something that degrades runtime too much (or unnecessarily
> increases code size too much).

Yes, that is correct.

> If this is the case, I agree with the approach.

I will check it in.

> BTW: There are some ix86_tune == XXX conditions scattered throughout
> LEA handling code. Can these be substituted with appropriate TARGET_*
> defines?

I have been looking at them closely to check their impacts on
both Haswell and Silvermont.  I am planning to keep
the simple LEA -> ADD transformation, but avoid
the complex LEA -> ADD/MOV/SHL transformation.

Thanks.
Uros Bizjak - Jan. 17, 2014, 3:24 p.m.
On Fri, Jan 17, 2014 at 3:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
> Wrong example.  It should be
>
> lea 0x400(%edx, %ecx, 8), %edx
>
> we get
>
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add %ecx, %edx
> add $0x400, %edx

Even for this example, the code can be substantially improved:

shl $3, %ecx
add %ecx, %edx
add $0x400, %edx

Uros.
Jakub Jelinek - Jan. 17, 2014, 3:26 p.m.
On Fri, Jan 17, 2014 at 04:24:50PM +0100, Uros Bizjak wrote:
> On Fri, Jan 17, 2014 at 3:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
> >
> > Wrong example.  It should be
> >
> > lea 0x400(%edx, %ecx, 8), %edx
> >
> > we get
> >
> > add %ecx, %edx
> > add %ecx, %edx
> > add %ecx, %edx
> > add %ecx, %edx
> > add %ecx, %edx
> > add %ecx, %edx
> > add %ecx, %edx
> > add %ecx, %edx
> > add $0x400, %edx
> 
> Even for this example, the code can be substantially improved:
> 
> shl $3, %ecx
> add %ecx, %edx
> add $0x400, %edx

Only if ecx is dead after the statement.

	Jakub
Uros Bizjak - Jan. 17, 2014, 3:33 p.m.
On Fri, Jan 17, 2014 at 4:26 PM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Fri, Jan 17, 2014 at 04:24:50PM +0100, Uros Bizjak wrote:
>> On Fri, Jan 17, 2014 at 3:50 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>> >
>> > Wrong example.  It should be
>> >
>> > lea 0x400(%edx, %ecx, 8), %edx
>> >
>> > we get
>> >
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add %ecx, %edx
>> > add $0x400, %edx
>>
>> Even for this example, the code can be substantially improved:
>>
>> shl $3, %ecx
>> add %ecx, %edx
>> add $0x400, %edx
>
> Only if ecx is dead after the statement.

True. Do we have this information at the point transformation is performed?

Uros.
Uros Bizjak - Jan. 17, 2014, 3:36 p.m.
On Fri, Jan 17, 2014 at 4:17 PM, H.J. Lu <hjl.tools@gmail.com> wrote:

>> BTW: There are some ix86_tune == XXX conditions scattered throughout
>> LEA handling code. Can these be substituted with appropriate TARGET_*
>> defines?
>
> I have been looking at them closely to check their impacts on
> both Haswell and Silvermont.  I am planning to keep
> the simple LEA -> ADD transformation, but avoid
> the complex LEA -> ADD/MOV/SHL transformation.

No, I didn't talk about functional change, but about equivalent
TARGET_* define that can be used instead of "(ix86_tune ==
PROCESSOR_SILVERMONT) || (ix86_tune == PROCESSOR_INTEL)".

Uros.
H.J. Lu - Jan. 17, 2014, 3:55 p.m.
On Fri, Jan 17, 2014 at 7:36 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 4:17 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>
>>> BTW: There are some ix86_tune == XXX conditions scattered throughout
>>> LEA handling code. Can these be substituted with appropriate TARGET_*
>>> defines?
>>
>> I have been looking at them closely to check their impacts on
>> both Haswell and Silvermont.  I am planning to keep
>> the simple LEA -> ADD transformation, but avoid
>> the complex LEA -> ADD/MOV/SHL transformation.
>
> No, I didn't talk about functional change, but about equivalent
> TARGET_* define that can be used instead of "(ix86_tune ==
> PROCESSOR_SILVERMONT) || (ix86_tune == PROCESSOR_INTEL)".
>
> Uros.

Something like

#define TARGET_INTEL_SILVERMONT \
  (ix86_tune == PROCESSOR_SILVERMONT || ix86_tune == PROCESSOR_INTEL)
H.J. Lu - Jan. 17, 2014, 4:55 p.m.
On Fri, Jan 17, 2014 at 7:55 AM, H.J. Lu <hjl.tools@gmail.com> wrote:
> On Fri, Jan 17, 2014 at 7:36 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
>> On Fri, Jan 17, 2014 at 4:17 PM, H.J. Lu <hjl.tools@gmail.com> wrote:
>>
>>>> BTW: There are some ix86_tune == XXX conditions scattered throughout
>>>> LEA handling code. Can these be substituted with appropriate TARGET_*
>>>> defines?
>>>
>>> I have been looking at them closely to check their impacts on
>>> both Haswell and Silvermont.  I am planning to keep
>>> the simple LEA -> ADD transformation, but avoid
>>> the complex LEA -> ADD/MOV/SHL transformation.
>>
>> No, I didn't talk about functional change, but about equivalent
>> TARGET_* define that can be used instead of "(ix86_tune ==
>> PROCESSOR_SILVERMONT) || (ix86_tune == PROCESSOR_INTEL)".
>>
>> Uros.
>
> Something like
>
> #define TARGET_INTEL_SILVERMONT \
>   (ix86_tune == PROCESSOR_SILVERMONT || ix86_tune == PROCESSOR_INTEL)
>
>

I see what I meant.  I will submit a patch.

Patch

diff --git a/gcc/config/i386/i386-c.c b/gcc/config/i386/i386-c.c
index 9686382..ce9ba95 100644
--- a/gcc/config/i386/i386-c.c
+++ b/gcc/config/i386/i386-c.c
@@ -174,6 +174,7 @@  ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
     /* use PROCESSOR_max to not set/unset the arch macro.  */
     case PROCESSOR_max:
       break;
+    case PROCESSOR_INTEL:
     case PROCESSOR_GENERIC:
       gcc_unreachable ();
     }
@@ -276,6 +277,7 @@  ix86_target_macros_internal (HOST_WIDE_INT isa_flag,
       def_or_undef (parse_in, "__tune_slm__");
       def_or_undef (parse_in, "__tune_silvermont__");
       break;
+    case PROCESSOR_INTEL:
     case PROCESSOR_GENERIC:
       break;
     /* use PROCESSOR_max to not set/unset the tune macro.  */
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index df408ae..82753fd 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -1747,6 +1747,83 @@  struct processor_costs slm_cost = {
   1,					/* cond_not_taken_branch_cost.  */
 };
 
+static stringop_algs intel_memcpy[2] = {
+  {libcall, {{11, loop, false}, {-1, rep_prefix_4_byte, false}}},
+  {libcall, {{32, loop, false}, {64, rep_prefix_4_byte, false},
+             {8192, rep_prefix_8_byte, false}, {-1, libcall, false}}}};
+static stringop_algs intel_memset[2] = {
+  {libcall, {{8, loop, false}, {15, unrolled_loop, false},
+             {2048, rep_prefix_4_byte, false}, {-1, libcall, false}}},
+  {libcall, {{24, loop, false}, {32, unrolled_loop, false},
+             {8192, rep_prefix_8_byte, false}, {-1, libcall, false}}}};
+static const
+struct processor_costs intel_cost = {
+  COSTS_N_INSNS (1),			/* cost of an add instruction */
+  COSTS_N_INSNS (1) + 1,		/* cost of a lea instruction */
+  COSTS_N_INSNS (1),			/* variable shift costs */
+  COSTS_N_INSNS (1),			/* constant shift costs */
+  {COSTS_N_INSNS (3),			/* cost of starting multiply for QI */
+   COSTS_N_INSNS (3),			/*				 HI */
+   COSTS_N_INSNS (3),			/*				 SI */
+   COSTS_N_INSNS (4),			/*				 DI */
+   COSTS_N_INSNS (2)},			/*			      other */
+  0,					/* cost of multiply per each bit set */
+  {COSTS_N_INSNS (18),			/* cost of a divide/mod for QI */
+   COSTS_N_INSNS (26),			/*			    HI */
+   COSTS_N_INSNS (42),			/*			    SI */
+   COSTS_N_INSNS (74),			/*			    DI */
+   COSTS_N_INSNS (74)},			/*			    other */
+  COSTS_N_INSNS (1),			/* cost of movsx */
+  COSTS_N_INSNS (1),			/* cost of movzx */
+  8,					/* "large" insn */
+  17,					/* MOVE_RATIO */
+  4,					/* cost for loading QImode using movzbl */
+  {4, 4, 4},				/* cost of loading integer registers
+					   in QImode, HImode and SImode.
+					   Relative to reg-reg move (2).  */
+  {4, 4, 4},				/* cost of storing integer registers */
+  4,					/* cost of reg,reg fld/fst */
+  {12, 12, 12},				/* cost of loading fp registers
+					   in SFmode, DFmode and XFmode */
+  {6, 6, 8},				/* cost of storing fp registers
+					   in SFmode, DFmode and XFmode */
+  2,					/* cost of moving MMX register */
+  {8, 8},				/* cost of loading MMX registers
+					   in SImode and DImode */
+  {8, 8},				/* cost of storing MMX registers
+					   in SImode and DImode */
+  2,					/* cost of moving SSE register */
+  {8, 8, 8},				/* cost of loading SSE registers
+					   in SImode, DImode and TImode */
+  {8, 8, 8},				/* cost of storing SSE registers
+					   in SImode, DImode and TImode */
+  5,					/* MMX or SSE register to integer */
+  32,					/* size of l1 cache.  */
+  256,					/* size of l2 cache.  */
+  64,					/* size of prefetch block */
+  6,					/* number of parallel prefetches */
+  3,					/* Branch cost */
+  COSTS_N_INSNS (8),			/* cost of FADD and FSUB insns.  */
+  COSTS_N_INSNS (8),			/* cost of FMUL instruction.  */
+  COSTS_N_INSNS (20),			/* cost of FDIV instruction.  */
+  COSTS_N_INSNS (8),			/* cost of FABS instruction.  */
+  COSTS_N_INSNS (8),			/* cost of FCHS instruction.  */
+  COSTS_N_INSNS (40),			/* cost of FSQRT instruction.  */
+  intel_memcpy,
+  intel_memset,
+  1,					/* scalar_stmt_cost.  */
+  1,					/* scalar load_cost.  */
+  1,					/* scalar_store_cost.  */
+  1,					/* vec_stmt_cost.  */
+  1,					/* vec_to_scalar_cost.  */
+  1,					/* scalar_to_vec_cost.  */
+  1,					/* vec_align_load_cost.  */
+  2,					/* vec_unalign_load_cost.  */
+  1,					/* vec_store_cost.  */
+  3,					/* cond_taken_branch_cost.  */
+  1,					/* cond_not_taken_branch_cost.  */
+};
+
 /* Generic should produce code tuned for Core-i7 (and newer chips)
    and btver1 (and newer chips).  */
 
@@ -1942,6 +2019,7 @@  const struct processor_costs *ix86_cost = &pentium_cost;
 #define m_CORE_ALL (m_CORE2 | m_NEHALEM  | m_SANDYBRIDGE | m_HASWELL)
 #define m_BONNELL (1<<PROCESSOR_BONNELL)
 #define m_SILVERMONT (1<<PROCESSOR_SILVERMONT)
+#define m_INTEL (1<<PROCESSOR_INTEL)
 
 #define m_GEODE (1<<PROCESSOR_GEODE)
 #define m_K6 (1<<PROCESSOR_K6)
@@ -2401,6 +2479,7 @@  static const struct ptt processor_target_table[PROCESSOR_max] =
   {"haswell", &core_cost, 16, 10, 16, 10, 16},
   {"bonnell", &atom_cost, 16, 15, 16, 7, 16},
   {"silvermont", &slm_cost, 16, 15, 16, 7, 16},
+  {"intel", &intel_cost, 16, 15, 16, 7, 16},
   {"geode", &geode_cost, 0, 0, 0, 0, 0},
   {"k6", &k6_cost, 32, 7, 32, 7, 32},
   {"athlon", &athlon_cost, 16, 7, 16, 7, 16},
@@ -3112,7 +3191,7 @@  ix86_option_override_internal (bool main_args_p,
       {"atom", PROCESSOR_BONNELL, CPU_ATOM, PTA_BONNELL},
       {"silvermont", PROCESSOR_SILVERMONT, CPU_SLM, PTA_SILVERMONT},
       {"slm", PROCESSOR_SILVERMONT, CPU_SLM, PTA_SILVERMONT},
-      {"intel", PROCESSOR_SILVERMONT, CPU_SLM, PTA_NEHALEM},
+      {"intel", PROCESSOR_INTEL, CPU_SLM, PTA_NEHALEM},
       {"geode", PROCESSOR_GEODE, CPU_GEODE,
 	PTA_MMX | PTA_3DNOW | PTA_3DNOW_A | PTA_PREFETCH_SSE | PTA_PRFCHW},
       {"k6", PROCESSOR_K6, CPU_K6, PTA_MMX},
@@ -17941,7 +18020,7 @@  ix86_lea_outperforms (rtx insn, unsigned int regno0, unsigned int regno1,
   /* For Silvermont if using a 2-source or 3-source LEA for
      non-destructive destination purposes, or due to wanting
      ability to use SCALE, the use of LEA is justified.  */
-  if (ix86_tune == PROCESSOR_SILVERMONT)
+  if (ix86_tune == PROCESSOR_SILVERMONT || ix86_tune == PROCESSOR_INTEL)
     {
       if (has_scale)
 	return true;
@@ -18077,7 +18156,7 @@  ix86_avoid_lea_for_addr (rtx insn, rtx operands[])
   int ok;
 
   /* Check we need to optimize.  */
-  if (!TARGET_OPT_AGU || optimize_function_for_size_p (cfun))
+  if (!TARGET_AVOID_LEA_FOR_ADDR || optimize_function_for_size_p (cfun))
     return false;
 
   /* Check it is correct to split here.  */
@@ -25200,6 +25279,7 @@  ix86_issue_rate (void)
     case PROCESSOR_PENTIUM:
     case PROCESSOR_BONNELL:
     case PROCESSOR_SILVERMONT:
+    case PROCESSOR_INTEL:
     case PROCESSOR_K6:
     case PROCESSOR_BTVER2:
     case PROCESSOR_PENTIUM4:
@@ -25541,6 +25621,7 @@  ix86_adjust_cost (rtx insn, rtx link, rtx dep_insn, int cost)
       break;
 
     case PROCESSOR_SILVERMONT:
+    case PROCESSOR_INTEL:
       if (!reload_completed)
 	return cost;
 
@@ -25609,6 +25690,7 @@  ia32_multipass_dfa_lookahead (void)
     case PROCESSOR_HASWELL:
     case PROCESSOR_BONNELL:
     case PROCESSOR_SILVERMONT:
+    case PROCESSOR_INTEL:
       /* Generally, we want haifa-sched:max_issue() to look ahead as far
 	 as many instructions can be executed on a cycle, i.e.,
 	 issue_rate.  I wonder why tuning for many CPUs does not do this.  */
@@ -25830,7 +25912,7 @@  swap_top_of_ready_list (rtx *ready, int n_ready)
   int clock2 = -1;
   #define INSN_TICK(INSN) (HID (INSN)->tick)
 
-  if (ix86_tune != PROCESSOR_SILVERMONT)
+  if (ix86_tune != PROCESSOR_SILVERMONT && ix86_tune != PROCESSOR_INTEL)
     return false;
 
   if (!NONDEBUG_INSN_P (top))
@@ -25904,7 +25986,8 @@  ix86_sched_reorder (FILE *dump, int sched_verbose, rtx *ready, int *pn_ready,
 
   /* Do reodering for BONNELL/SILVERMONT only.  */
   if (ix86_tune != PROCESSOR_BONNELL
-      && ix86_tune != PROCESSOR_SILVERMONT)
+      && ix86_tune != PROCESSOR_SILVERMONT
+      && ix86_tune != PROCESSOR_INTEL)
     return issue_rate;
 
   /* Nothing to do if ready list contains only 1 instruction.  */
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 3199b41..580a319 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -308,6 +308,7 @@  extern const struct processor_costs ix86_size_cost;
 #define TARGET_HASWELL (ix86_tune == PROCESSOR_HASWELL)
 #define TARGET_BONNELL (ix86_tune == PROCESSOR_BONNELL)
 #define TARGET_SILVERMONT (ix86_tune == PROCESSOR_SILVERMONT)
+#define TARGET_INTEL (ix86_tune == PROCESSOR_INTEL)
 #define TARGET_GENERIC (ix86_tune == PROCESSOR_GENERIC)
 #define TARGET_AMDFAM10 (ix86_tune == PROCESSOR_AMDFAM10)
 #define TARGET_BDVER1 (ix86_tune == PROCESSOR_BDVER1)
@@ -429,6 +430,8 @@  extern unsigned char ix86_tune_features[X86_TUNE_LAST];
 #define TARGET_FUSE_ALU_AND_BRANCH \
 	ix86_tune_features[X86_TUNE_FUSE_ALU_AND_BRANCH]
 #define TARGET_OPT_AGU ix86_tune_features[X86_TUNE_OPT_AGU]
+#define TARGET_AVOID_LEA_FOR_ADDR \
+	ix86_tune_features[X86_TUNE_AVOID_LEA_FOR_ADDR]
 #define TARGET_VECTORIZE_DOUBLE \
 	ix86_tune_features[X86_TUNE_VECTORIZE_DOUBLE]
 #define TARGET_SOFTWARE_PREFETCHING_BENEFICIAL \
@@ -2184,6 +2187,7 @@  enum processor_type
   PROCESSOR_HASWELL,
   PROCESSOR_BONNELL,
   PROCESSOR_SILVERMONT,
+  PROCESSOR_INTEL,
   PROCESSOR_GEODE,
   PROCESSOR_K6,
   PROCESSOR_ATHLON,
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index ec96a4b..f5affe6 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -40,16 +40,16 @@  see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 
 /* X86_TUNE_SCHEDULE: Enable scheduling.  */
 DEF_TUNE (X86_TUNE_SCHEDULE, "schedule",
-          m_PENT | m_PPRO | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_K6_GEODE
-          | m_AMD_MULTIPLE | m_GENERIC)
+          m_PENT | m_PPRO | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL 
+	  | m_K6_GEODE | m_AMD_MULTIPLE | m_GENERIC)
 
 /* X86_TUNE_PARTIAL_REG_DEPENDENCY: Enable more register renaming
    on modern chips.  Preffer stores affecting whole integer register
    over partial stores.  For example preffer MOVZBL or MOVQ to load 8bit
    value over movb.  */
 DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
-          m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_AMD_MULTIPLE
-          | m_GENERIC)
+          m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL
+	  | m_AMD_MULTIPLE | m_GENERIC)
 
 /* X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY: This knob promotes all store
    destinations to be 128bit to allow register renaming on 128bit SSE units,
@@ -58,8 +58,8 @@  DEF_TUNE (X86_TUNE_PARTIAL_REG_DEPENDENCY, "partial_reg_dependency",
    SPECfp regression, while enabling it on K8 brings roughly 2.4% regression
    that can be partly masked by careful scheduling of moves.  */
 DEF_TUNE (X86_TUNE_SSE_PARTIAL_REG_DEPENDENCY, "sse_partial_reg_dependency",
-          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_AMDFAM10
-          | m_BDVER | m_GENERIC)
+          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT
+	  | m_INTEL | m_AMDFAM10 | m_BDVER | m_GENERIC)
 
 /* X86_TUNE_SSE_SPLIT_REGS: Set for machines where the type and dependencies
    are resolved on SSE register parts instead of whole registers, so we may
@@ -84,13 +84,14 @@  DEF_TUNE (X86_TUNE_PARTIAL_FLAG_REG_STALL, "partial_flag_reg_stall",
 /* X86_TUNE_MOVX: Enable to zero extend integer registers to avoid
    partial dependencies.  */
 DEF_TUNE (X86_TUNE_MOVX, "movx",
-          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_GEODE
-          | m_AMD_MULTIPLE  | m_GENERIC)
+          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT
+	  | m_INTEL | m_GEODE | m_AMD_MULTIPLE  | m_GENERIC)
 
 /* X86_TUNE_MEMORY_MISMATCH_STALL: Avoid partial stores that are followed by
    full sized loads.  */
 DEF_TUNE (X86_TUNE_MEMORY_MISMATCH_STALL, "memory_mismatch_stall",
-          m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_AMD_MULTIPLE | m_GENERIC)
+          m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL
+	  | m_AMD_MULTIPLE | m_GENERIC)
 
 /* X86_TUNE_FUSE_CMP_AND_BRANCH_32: Fuse compare with a subsequent
    conditional jump instruction for 32 bit TARGET.
@@ -124,7 +125,8 @@  DEF_TUNE (X86_TUNE_REASSOC_INT_TO_PARALLEL, "reassoc_int_to_parallel",
 /* X86_TUNE_REASSOC_FP_TO_PARALLEL: Try to produce parallel computations
    during reassociation of fp computation.  */
 DEF_TUNE (X86_TUNE_REASSOC_FP_TO_PARALLEL, "reassoc_fp_to_parallel",
-          m_BONNELL | m_SILVERMONT | m_HASWELL | m_BDVER1 | m_BDVER2 | m_GENERIC)
+          m_BONNELL | m_SILVERMONT | m_HASWELL | m_INTEL | m_BDVER1
+	  | m_BDVER2 | m_GENERIC)
 
 /*****************************************************************************/
 /* Function prologue, epilogue and function calling sequences.               */
@@ -143,7 +145,8 @@  DEF_TUNE (X86_TUNE_REASSOC_FP_TO_PARALLEL, "reassoc_fp_to_parallel",
    regression on mgrid due to IRA limitation leading to unecessary
    use of the frame pointer in 32bit mode.  */
 DEF_TUNE (X86_TUNE_ACCUMULATE_OUTGOING_ARGS, "accumulate_outgoing_args",
-	  m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_AMD_MULTIPLE | m_GENERIC)
+	  m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_INTEL
+	  | m_AMD_MULTIPLE | m_GENERIC)
 
 /* X86_TUNE_PROLOGUE_USING_MOVE: Do not use push/pop in prologues that are
    considered on critical path.  */
@@ -202,7 +205,8 @@  DEF_TUNE (X86_TUNE_PAD_RETURNS, "pad_returns",
 /* X86_TUNE_FOUR_JUMP_LIMIT: Some CPU cores are not able to predict more
    than 4 branch instructions in the 16 byte window.  */
 DEF_TUNE (X86_TUNE_FOUR_JUMP_LIMIT, "four_jump_limit",
-          m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_ATHLON_K8 | m_AMDFAM10)
+          m_PPRO | m_P4_NOCONA | m_BONNELL | m_SILVERMONT | m_INTEL |
+	  m_ATHLON_K8 | m_AMDFAM10)
 
 /*****************************************************************************/
 /* Integer instruction selection tuning                                      */
@@ -224,17 +228,22 @@  DEF_TUNE (X86_TUNE_READ_MODIFY, "read_modify", ~(m_PENT | m_PPRO))
 
 /* X86_TUNE_USE_INCDEC: Enable use of inc/dec instructions.   */
 DEF_TUNE (X86_TUNE_USE_INCDEC, "use_incdec",
-          ~(m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_GENERIC))
+          ~(m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL
+	    | m_GENERIC))
 
 /* X86_TUNE_INTEGER_DFMODE_MOVES: Enable if integer moves are preferred
    for DFmode copies */
 DEF_TUNE (X86_TUNE_INTEGER_DFMODE_MOVES, "integer_dfmode_moves",
           ~(m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT
-          | m_GEODE | m_AMD_MULTIPLE | m_GENERIC))
+	    | m_INTEL | m_GEODE | m_AMD_MULTIPLE | m_GENERIC))
 
 /* X86_TUNE_OPT_AGU: Optimize for Address Generation Unit. This flag
    will impact LEA instruction selection. */
-DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_BONNELL | m_SILVERMONT)
+DEF_TUNE (X86_TUNE_OPT_AGU, "opt_agu", m_BONNELL | m_SILVERMONT | m_INTEL)
+
+/* X86_TUNE_AVOID_LEA_FOR_ADDR: Avoid lea for address computation.  */
+DEF_TUNE (X86_TUNE_AVOID_LEA_FOR_ADDR, "avoid_lea_for_addr",
+	  m_BONNELL | m_SILVERMONT)
 
 /* X86_TUNE_SLOW_IMUL_IMM32_MEM: Imul of 32-bit constant and memory is
    vector path on AMD machines.
@@ -251,7 +260,7 @@  DEF_TUNE (X86_TUNE_SLOW_IMUL_IMM8, "slow_imul_imm8",
 /* X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE: Try to avoid memory operands for
    a conditional move.  */
 DEF_TUNE (X86_TUNE_AVOID_MEM_OPND_FOR_CMOVE, "avoid_mem_opnd_for_cmove",
-	  m_BONNELL | m_SILVERMONT)
+	  m_BONNELL | m_SILVERMONT | m_INTEL)
 
 /* X86_TUNE_SINGLE_STRINGOP: Enable use of single string operations, such
    as MOVS and STOS (without a REP prefix) to move/set sequences of bytes.  */
@@ -268,15 +277,18 @@  DEF_TUNE (X86_TUNE_MISALIGNED_MOVE_STRING_PRO_EPILOGUES,
 
 /* X86_TUNE_USE_SAHF: Controls use of SAHF.  */
 DEF_TUNE (X86_TUNE_USE_SAHF, "use_sahf",
-          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_K6_GEODE
-          | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER | m_GENERIC)
+          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT
+	  | m_INTEL | m_K6_GEODE | m_K8 | m_AMDFAM10 | m_BDVER | m_BTVER
+	  | m_GENERIC)
 
 /* X86_TUNE_USE_CLTD: Controls use of CLTD and CTQO instructions.  */
-DEF_TUNE (X86_TUNE_USE_CLTD, "use_cltd", ~(m_PENT | m_BONNELL | m_SILVERMONT | m_K6))
+DEF_TUNE (X86_TUNE_USE_CLTD, "use_cltd",
+	  ~(m_PENT | m_BONNELL | m_SILVERMONT | m_INTEL  | m_K6))
 
 /* X86_TUNE_USE_BT: Enable use of BT (bit test) instructions.  */
 DEF_TUNE (X86_TUNE_USE_BT, "use_bt",
-          m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_AMD_MULTIPLE | m_GENERIC)
+          m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_INTEL | m_AMD_MULTIPLE
+	  | m_GENERIC)
 
 /*****************************************************************************/
 /* 387 instruction selection tuning                                          */
@@ -291,16 +303,16 @@  DEF_TUNE (X86_TUNE_USE_HIMODE_FIOP, "use_himode_fiop",
 /* X86_TUNE_USE_SIMODE_FIOP: Enables use of x87 instructions with 32bit
    integer operand.  */
 DEF_TUNE (X86_TUNE_USE_SIMODE_FIOP, "use_simode_fiop",
-          ~(m_PENT | m_PPRO | m_CORE_ALL | m_BONNELL
-            | m_SILVERMONT | m_AMD_MULTIPLE | m_GENERIC))
+          ~(m_PENT | m_PPRO | m_CORE_ALL | m_BONNELL | m_SILVERMONT
+	    | m_INTEL | m_AMD_MULTIPLE | m_GENERIC))
 
 /* X86_TUNE_USE_FFREEP: Use freep instruction instead of fstp.  */
 DEF_TUNE (X86_TUNE_USE_FFREEP, "use_ffreep", m_AMD_MULTIPLE)
 
 /* X86_TUNE_EXT_80387_CONSTANTS: Use fancy 80387 constants, such as PI.  */
 DEF_TUNE (X86_TUNE_EXT_80387_CONSTANTS, "ext_80387_constants",
-          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT | m_K6_GEODE
-          | m_ATHLON_K8 | m_GENERIC)
+          m_PPRO | m_P4_NOCONA | m_CORE_ALL | m_BONNELL | m_SILVERMONT
+	  | m_INTEL | m_K6_GEODE | m_ATHLON_K8 | m_GENERIC)
 
 /*****************************************************************************/
 /* SSE instruction selection tuning                                          */
@@ -318,12 +330,14 @@  DEF_TUNE (X86_TUNE_GENERAL_REGS_SSE_SPILL, "general_regs_sse_spill",
 /* X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL: Use movups for misaligned loads instead
    of a sequence loading registers by parts.  */
 DEF_TUNE (X86_TUNE_SSE_UNALIGNED_LOAD_OPTIMAL, "sse_unaligned_load_optimal",
-          m_NEHALEM | m_SANDYBRIDGE | m_HASWELL | m_AMDFAM10 | m_BDVER | m_BTVER | m_SILVERMONT | m_GENERIC)
+          m_NEHALEM | m_SANDYBRIDGE | m_HASWELL | m_AMDFAM10 | m_BDVER
+	  | m_BTVER | m_SILVERMONT | m_INTEL | m_GENERIC)
 
 /* X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL: Use movups for misaligned stores instead
    of a sequence loading registers by parts.  */
 DEF_TUNE (X86_TUNE_SSE_UNALIGNED_STORE_OPTIMAL, "sse_unaligned_store_optimal",
-          m_NEHALEM | m_SANDYBRIDGE | m_HASWELL | m_BDVER | m_SILVERMONT | m_GENERIC)
+          m_NEHALEM | m_SANDYBRIDGE | m_HASWELL | m_BDVER | m_SILVERMONT
+	  | m_INTEL | m_GENERIC)
 
 /* Use packed single precision instructions where posisble.  I.e. movups instead
    of movupd.  */
@@ -360,7 +374,7 @@  DEF_TUNE (X86_TUNE_INTER_UNIT_CONVERSIONS, "inter_unit_conversions",
 /* X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS: Try to split memory operand for
    fp converts to destination register.  */
 DEF_TUNE (X86_TUNE_SPLIT_MEM_OPND_FOR_FP_CONVERTS, "split_mem_opnd_for_fp_converts",
-          m_SILVERMONT)
+          m_SILVERMONT | m_INTEL)
 
 /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
    from FP to FP.  This form of instructions avoids partial write to the