diff mbox

[i386] Add prefixes avoidance tuning for silvermont target

Message ID 20140702103536.GA1827@msticlxl57.ims.intel.com
State New
Headers show

Commit Message

Ilya Enkovich July 2, 2014, 10:43 a.m. UTC
Hi,

Silvermont processors have penalty for instructions having 4+ bytes of prefixes (including escape bytes in opcode).  This situation happens when REX prefix is used in SSE4 instructions.  This patch tries to avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in those instructions.  I achieved it by adding new tuning flag and new alternatives affected by tuning.

SSE4 instructions are not very widely used by GCC but I see some significant gains caused by this patch (tested on Avoton on -O3).

DENmark (EEMBC2.0) +2%
TCPmark (EEMBC2.0) +32%
SPEC2000 +0,5%
	  
Bootstrapped and tested on linux-x86_64.

Does it look OK for trunk?

Thanks,
Ilya
--
gcc/

2014-07-02  Ilya Enkovich  <ilya.enkovich@intel.com>

	* config/i386/constraints.md (Yr): New.
	* config/i386/i386.h (reg_class): Add NO_REX_SSE_REGS.
	(REG_CLASS_NAMES): Likewise.
	(REG_CLASS_CONTENTS): Likewise.
	* config/i386/sse.md (*vec_concatv2sf_sse4_1): Add alternatives
	which use only NO_REX_SSE_REGS.
	(vec_set<mode>_0): Likewise.
	(*vec_setv4sf_sse4_1): Likewise.
	(sse4_1_insertps): Likewise.
	(*sse4_1_extractps): Likewise.
	(*sse4_1_mulv2siv2di3): Likewise.
	(*<sse4_1_avx2>_mul<mode>3<mask_name>): Likewise.
	(*sse4_1_<code><mode>3): Likewise.
	(*sse4_1_<code><mode>3): Likewise.
	(*sse4_1_eqv2di3): Likewise.
	(sse4_2_gtv2di3): Likewise.
	(*vec_extractv4si): Likewise.
	(*vec_concatv2si_sse4_1): Likewise.
	(vec_concatv2di): Likewise.
	(<sse4_1>_blend<ssemodesuffix><avxsizesuffix>): Likewise.
	(<sse4_1>_blendv<ssemodesuffix><avxsizesuffix>): Likewise.
	(<sse4_1>_dp<ssemodesuffix><avxsizesuffix>): Likewise.
	(<sse4_1_avx2>_movntdqa): Likewise.
	(<sse4_1_avx2>_mpsadbw): Likewise.
	(sse4_1_packusdw): Likewise.
	(<sse4_1_avx2>_pblendvb): Likewise.
	(sse4_1_pblendw): Likewise.
	(sse4_1_phminposuw): Likewise.
	(sse4_1_<code>v8qiv8hi2): Likewise.
	(sse4_1_<code>v4qiv4si2): Likewise.
	(sse4_1_<code>v4hiv4si2): Likewise.
	(sse4_1_<code>v2qiv2di2): Likewise.
	(sse4_1_<code>v2hiv2di2): Likewise.
	(sse4_1_<code>v2siv2di2): Likewise.
	(sse4_1_ptest): Likewise.
	(<sse4_1>_round<ssemodesuffix><avxsizesuffix>): Likewise.
	(sse4_1_round<ssescalarmodesuffix>): Likewise.
	* config/i386/subst.md (mask_prefix4): New.
	* config/i386/x86-tune.def (X86_TUNE_AVOID_4BYTE_PREFIXES): New.

gcc/testsuites/

2014-07-02  Ilya Enkovich  <ilya.enkovich@intel.com>

	* gcc.target/i386/sse2-init-v2di-2.c: Adjust to changed
	vec_concatv2di template.

Comments

Andi Kleen July 2, 2014, 4:21 p.m. UTC | #1
Ilya Enkovich <enkovich.gnu@gmail.com> writes:

> Silvermont processors have penalty for instructions having 4+ bytes of
> prefixes (including escape bytes in opcode).  This situation happens
> when REX prefix is used in SSE4 instructions.  This patch tries to
> avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
> those instructions.  I achieved it by adding new tuning flag and new
> alternatives affected by tuning.

Why make it a tuning flag? Shouldn't this help unconditionally for code
size everywhere? Or is there some drawback? 

-Andi
Jakub Jelinek July 2, 2014, 4:27 p.m. UTC | #2
On Wed, Jul 02, 2014 at 09:21:25AM -0700, Andi Kleen wrote:
> Ilya Enkovich <enkovich.gnu@gmail.com> writes:
> 
> > Silvermont processors have penalty for instructions having 4+ bytes of
> > prefixes (including escape bytes in opcode).  This situation happens
> > when REX prefix is used in SSE4 instructions.  This patch tries to
> > avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
> > those instructions.  I achieved it by adding new tuning flag and new
> > alternatives affected by tuning.
> 
> Why make it a tuning flag? Shouldn't this help unconditionally for code
> size everywhere? Or is there some drawback? 

I don't think it will make code smaller, if you already have some value in
xmm8..xmm15 register, then by not allowing those registers directly on SSE4
insns just means it reloading and larger code.

BTW, is that change needed also when emitting AVX insns instead of SSE4?

	Jakub
Mike Stump July 2, 2014, 5:08 p.m. UTC | #3
On Jul 2, 2014, at 9:27 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Wed, Jul 02, 2014 at 09:21:25AM -0700, Andi Kleen wrote:
>> Ilya Enkovich <enkovich.gnu@gmail.com> writes:
>> 
>>> Silvermont processors have penalty for instructions having 4+ bytes of
>>> prefixes (including escape bytes in opcode).  This situation happens
>>> when REX prefix is used in SSE4 instructions.  This patch tries to
>>> avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
>>> those instructions.  I achieved it by adding new tuning flag and new
>>> alternatives affected by tuning.
>> 
>> Why make it a tuning flag? Shouldn't this help unconditionally for code
>> size everywhere? Or is there some drawback? 
> 
> I don't think it will make code smaller, if you already have some value in
> xmm8..xmm15 register, then by not allowing those registers directly on SSE4
> insns just means it reloading and larger code.

I can’t help but think a better way to do this is to explain the costs of the REX registers as being more expensive, then let the register allocator prefer the cheaper registers.  You then leave them all as valid, which I think is better than disappearing 1/2 of the registers.  Is it really cheaper to spill and reload those than use a REX register?
Andi Kleen July 2, 2014, 6:30 p.m. UTC | #4
Mike Stump <mikestump@comcast.net> writes:

>>>> Silvermont processors have penalty for instructions having 4+ bytes of
>>>> prefixes (including escape bytes in opcode).  This situation happens
>>>> when REX prefix is used in SSE4 instructions.  This patch tries to
>>>> avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
>>>> those instructions.  I achieved it by adding new tuning flag and new
>>>> alternatives affected by tuning.
>>> 
>>> Why make it a tuning flag? Shouldn't this help unconditionally for code
>>> size everywhere? Or is there some drawback? 
>> 
>> I don't think it will make code smaller, if you already have some value in
>> xmm8..xmm15 register, then by not allowing those registers directly on SSE4
>> insns just means it reloading and larger code.
>
> I can’t help but think a better way to do this is to explain the costs
> of the REX registers as being more expensive, then let the register
> allocator prefer the cheaper registers.  You then leave them all as
> valid, which I think is better than disappearing 1/2 of the registers.

Yes that would sound like a much better strategy.

BTW I thought gcc already did that for the integer registers to
avoid unnecessary prefixes, but maybe I misremember. Copying Honza.

-Andi
Ilya Enkovich July 3, 2014, 10:49 a.m. UTC | #5
2014-07-02 20:21 GMT+04:00 Andi Kleen <andi@firstfloor.org>:
> Ilya Enkovich <enkovich.gnu@gmail.com> writes:
>
>> Silvermont processors have penalty for instructions having 4+ bytes of
>> prefixes (including escape bytes in opcode).  This situation happens
>> when REX prefix is used in SSE4 instructions.  This patch tries to
>> avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
>> those instructions.  I achieved it by adding new tuning flag and new
>> alternatives affected by tuning.
>
> Why make it a tuning flag? Shouldn't this help unconditionally for code
> size everywhere? Or is there some drawback?

There is already a higher priority for registers not requiring REX.
My patch affects cases when compiler has to use xmm8-15 and it just
tries to say LRA to assign them for non SSE4 instructions.  I doubt it
would have some use for other targets than Silvermont.

Ilya

>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only
Ilya Enkovich July 3, 2014, 10:54 a.m. UTC | #6
2014-07-02 20:27 GMT+04:00 Jakub Jelinek <jakub@redhat.com>:
> On Wed, Jul 02, 2014 at 09:21:25AM -0700, Andi Kleen wrote:
>> Ilya Enkovich <enkovich.gnu@gmail.com> writes:
>>
>> > Silvermont processors have penalty for instructions having 4+ bytes of
>> > prefixes (including escape bytes in opcode).  This situation happens
>> > when REX prefix is used in SSE4 instructions.  This patch tries to
>> > avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
>> > those instructions.  I achieved it by adding new tuning flag and new
>> > alternatives affected by tuning.
>>
>> Why make it a tuning flag? Shouldn't this help unconditionally for code
>> size everywhere? Or is there some drawback?
>
> I don't think it will make code smaller, if you already have some value in
> xmm8..xmm15 register, then by not allowing those registers directly on SSE4
> insns just means it reloading and larger code.
>
> BTW, is that change needed also when emitting AVX insns instead of SSE4?

This is for Silvermont only. It does not have AVX.

Ilya

>
>         Jakub
Jakub Jelinek July 3, 2014, 10:56 a.m. UTC | #7
On Thu, Jul 03, 2014 at 02:49:10PM +0400, Ilya Enkovich wrote:
> 2014-07-02 20:21 GMT+04:00 Andi Kleen <andi@firstfloor.org>:
> > Ilya Enkovich <enkovich.gnu@gmail.com> writes:
> >
> >> Silvermont processors have penalty for instructions having 4+ bytes of
> >> prefixes (including escape bytes in opcode).  This situation happens
> >> when REX prefix is used in SSE4 instructions.  This patch tries to
> >> avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
> >> those instructions.  I achieved it by adding new tuning flag and new
> >> alternatives affected by tuning.
> >
> > Why make it a tuning flag? Shouldn't this help unconditionally for code
> > size everywhere? Or is there some drawback?
> 
> There is already a higher priority for registers not requiring REX.
> My patch affects cases when compiler has to use xmm8-15 and it just
> tries to say LRA to assign them for non SSE4 instructions.  I doubt it
> would have some use for other targets than Silvermont.

When it is just a hint, shouldn't there be something like Ya,???x
or Ya,!x or similar in the SSE4 constraints?  I mean, xmm{8-15} can be used,
just is costly.

	Jakub
Ilya Enkovich July 3, 2014, 11:07 a.m. UTC | #8
2014-07-03 14:56 GMT+04:00 Jakub Jelinek <jakub@redhat.com>:
> On Thu, Jul 03, 2014 at 02:49:10PM +0400, Ilya Enkovich wrote:
>> 2014-07-02 20:21 GMT+04:00 Andi Kleen <andi@firstfloor.org>:
>> > Ilya Enkovich <enkovich.gnu@gmail.com> writes:
>> >
>> >> Silvermont processors have penalty for instructions having 4+ bytes of
>> >> prefixes (including escape bytes in opcode).  This situation happens
>> >> when REX prefix is used in SSE4 instructions.  This patch tries to
>> >> avoid such situation by preferring xmm0-xmm7 usage over xmm8-xmm15 in
>> >> those instructions.  I achieved it by adding new tuning flag and new
>> >> alternatives affected by tuning.
>> >
>> > Why make it a tuning flag? Shouldn't this help unconditionally for code
>> > size everywhere? Or is there some drawback?
>>
>> There is already a higher priority for registers not requiring REX.
>> My patch affects cases when compiler has to use xmm8-15 and it just
>> tries to say LRA to assign them for non SSE4 instructions.  I doubt it
>> would have some use for other targets than Silvermont.
>
> When it is just a hint, shouldn't there be something like Ya,???x
> or Ya,!x or similar in the SSE4 constraints?  I mean, xmm{8-15} can be used,
> just is costly.
>

I made it Ya,?x.  I do not know how many '?' would be reasonable here.
One was enough to get gain on tests where I expected it.

Ilya

>         Jakub
diff mbox

Patch

diff --git a/gcc/config/i386/constraints.md b/gcc/config/i386/constraints.md
index 8e0a583..d6a0ea9 100644
--- a/gcc/config/i386/constraints.md
+++ b/gcc/config/i386/constraints.md
@@ -105,6 +105,8 @@ 
 ;;  d	Integer register when integer DFmode moves are enabled
 ;;  x	Integer register when integer XFmode moves are enabled
 ;;  f	x87 register when 80387 floating point arithmetic is enabled
+;;  r	SSE regs not requiring REX prefix when prefixes avoidance is enabled
+;;	and all SSE regs otherwise
 
 (define_register_constraint "Yz" "TARGET_SSE ? SSE_FIRST_REG : NO_REGS"
  "First SSE register (@code{%xmm0}).")
@@ -147,6 +149,10 @@ 
  "(ix86_fpmath & FPMATH_387) ? FLOAT_REGS : NO_REGS"
  "@internal Any x87 register when 80387 FP arithmetic is enabled.")
 
+(define_register_constraint "Yr"
+ "TARGET_SSE ? (X86_TUNE_AVOID_4BYTE_PREFIXES ? NO_REX_SSE_REGS : ALL_SSE_REGS) : NO_REGS"
+ "@internal Lower SSE register when avoiding REX prefix and all SSE registers otherwise.")
+
 ;; We use the B prefix to denote any number of internal operands:
 ;;  s  Sibcall memory operand, not valid for TARGET_X32
 ;;  w  Call memory operand, not valid for TARGET_X32
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 9e3ef94..54239e9 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -1282,6 +1282,7 @@  enum reg_class
   FP_TOP_REG, FP_SECOND_REG,	/* %st(0) %st(1) */
   FLOAT_REGS,
   SSE_FIRST_REG,
+  NO_REX_SSE_REGS,
   SSE_REGS,
   EVEX_SSE_REGS,
   ALL_SSE_REGS,
@@ -1339,6 +1340,7 @@  enum reg_class
    "FP_TOP_REG", "FP_SECOND_REG",	\
    "FLOAT_REGS",			\
    "SSE_FIRST_REG",			\
+   "NO_REX_SSE_REGS",			\
    "SSE_REGS",				\
    "EVEX_SSE_REGS",			\
    "ALL_SSE_REGS",			\
@@ -1378,6 +1380,7 @@  enum reg_class
     { 0x0200,       0x0,   0x0 },       /* FP_SECOND_REG */             \
     { 0xff00,       0x0,   0x0 },       /* FLOAT_REGS */                \
   { 0x200000,       0x0,   0x0 },       /* SSE_FIRST_REG */             \
+{ 0x1fe00000,  0x000000,   0x0 },       /* NO_REX_SSE_REGS */           \
 { 0x1fe00000,  0x1fe000,   0x0 },       /* SSE_REGS */                  \
        { 0x0,0xffe00000,  0x1f },       /* EVEX_SSE_REGS */             \
 { 0x1fe00000,0xffffe000,  0x1f },       /* ALL_SSE_REGS */              \
diff --git a/gcc/config/i386/sse.md b/gcc/config/i386/sse.md
index d907353..1e561c2 100644
--- a/gcc/config/i386/sse.md
+++ b/gcc/config/i386/sse.md
@@ -5585,26 +5585,28 @@ 
 ;; Although insertps takes register source, we prefer
 ;; unpcklps with register source since it is shorter.
 (define_insn "*vec_concatv2sf_sse4_1"
-  [(set (match_operand:V2SF 0 "register_operand"     "=x,x,x,x,x,*y ,*y")
+  [(set (match_operand:V2SF 0 "register_operand"     "=Yr,?x,x,Yr,?x,x,x,*y ,*y")
 	(vec_concat:V2SF
-	  (match_operand:SF 1 "nonimmediate_operand" " 0,x,0,x,m, 0 , m")
-	  (match_operand:SF 2 "vector_move_operand"  " x,x,m,m,C,*ym, C")))]
+	  (match_operand:SF 1 "nonimmediate_operand" "  0, 0,x, 0,0, x,m, 0 , m")
+	  (match_operand:SF 2 "vector_move_operand"  " Yr,?x,x, m,m, m,C,*ym, C")))]
   "TARGET_SSE4_1"
   "@
    unpcklps\t{%2, %0|%0, %2}
+   unpcklps\t{%2, %0|%0, %2}
    vunpcklps\t{%2, %1, %0|%0, %1, %2}
    insertps\t{$0x10, %2, %0|%0, %2, 0x10}
+   insertps\t{$0x10, %2, %0|%0, %2, 0x10}
    vinsertps\t{$0x10, %2, %1, %0|%0, %1, %2, 0x10}
    %vmovss\t{%1, %0|%0, %1}
    punpckldq\t{%2, %0|%0, %2}
    movd\t{%1, %0|%0, %1}"
-  [(set_attr "isa" "noavx,avx,noavx,avx,*,*,*")
-   (set_attr "type" "sselog,sselog,sselog,sselog,ssemov,mmxcvt,mmxmov")
-   (set_attr "prefix_data16" "*,*,1,*,*,*,*")
-   (set_attr "prefix_extra" "*,*,1,1,*,*,*")
-   (set_attr "length_immediate" "*,*,1,1,*,*,*")
-   (set_attr "prefix" "orig,vex,orig,vex,maybe_vex,orig,orig")
-   (set_attr "mode" "V4SF,V4SF,V4SF,V4SF,SF,DI,DI")])
+  [(set_attr "isa" "noavx,noavx,avx,noavx,noavx,avx,*,*,*")
+   (set_attr "type" "sselog,sselog,sselog,sselog,sselog,sselog,ssemov,mmxcvt,mmxmov")
+   (set_attr "prefix_data16" "*,*,*,1,1,*,*,*,*")
+   (set_attr "prefix_extra" "*,*,*,1,1,1,*,*,*")
+   (set_attr "length_immediate" "*,*,*,1,1,1,*,*,*")
+   (set_attr "prefix" "orig,orig,vex,orig,orig,vex,maybe_vex,orig,orig")
+   (set_attr "mode" "V4SF,V4SF,V4SF,V4SF,V4SF,V4SF,SF,DI,DI")])
 
 ;; ??? In theory we can match memory for the MMX alternative, but allowing
 ;; nonimmediate_operand for operand 2 and *not* allowing memory for the SSE
@@ -5652,49 +5654,51 @@ 
 ;; see comment above inline_secondary_memory_needed function in i386.c
 (define_insn "vec_set<mode>_0"
   [(set (match_operand:VI4F_128 0 "nonimmediate_operand"
-	  "=x,x,x ,x,x,x,x  ,x  ,m ,m   ,m")
+	  "=Yr,?x,x,x ,x,x,x, Yr, ?x ,x  ,m ,m   ,m")
 	(vec_merge:VI4F_128
 	  (vec_duplicate:VI4F_128
 	    (match_operand:<ssescalarmode> 2 "general_operand"
-	  " x,m,*r,m,x,x,*rm,*rm,!x,!*re,!*fF"))
+	  " Yr,?x,m,*r,m,x,x,*rm,*rm,*rm,!x,!*re,!*fF"))
 	  (match_operand:VI4F_128 1 "vector_move_operand"
-	  " C,C,C ,C,0,x,0  ,x  ,0 ,0   ,0")
+	  "  C,C, C,C ,C,0,x, 0, 0  ,x  ,0 ,0   ,0")
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    %vinsertps\t{$0xe, %d2, %0|%0, %d2, 0xe}
+   %vinsertps\t{$0xe, %d2, %0|%0, %d2, 0xe}
    %vmov<ssescalarmodesuffix>\t{%2, %0|%0, %2}
    %vmovd\t{%2, %0|%0, %2}
    movss\t{%2, %0|%0, %2}
    movss\t{%2, %0|%0, %2}
    vmovss\t{%2, %1, %0|%0, %1, %2}
    pinsrd\t{$0, %2, %0|%0, %2, 0}
+   pinsrd\t{$0, %2, %0|%0, %2, 0}
    vpinsrd\t{$0, %2, %1, %0|%0, %1, %2, 0}
    #
    #
    #"
-  [(set_attr "isa" "sse4,sse2,sse2,noavx,noavx,avx,sse4_noavx,avx,*,*,*")
+  [(set_attr "isa" "sse4,sse4,sse2,sse2,noavx,noavx,avx,sse4_noavx,sse4_noavx,avx,*,*,*")
    (set (attr "type")
-     (cond [(eq_attr "alternative" "0,6,7")
+     (cond [(eq_attr "alternative" "0,7,9")
 	      (const_string "sselog")
-	    (eq_attr "alternative" "9")
+	    (eq_attr "alternative" "11")
 	      (const_string "imov")
-	    (eq_attr "alternative" "10")
+	    (eq_attr "alternative" "12")
 	      (const_string "fmov")
 	   ]
 	   (const_string "ssemov")))
-   (set_attr "prefix_extra" "*,*,*,*,*,*,1,1,*,*,*")
-   (set_attr "length_immediate" "*,*,*,*,*,*,1,1,*,*,*")
-   (set_attr "prefix" "maybe_vex,maybe_vex,maybe_vex,orig,orig,vex,orig,vex,*,*,*")
-   (set_attr "mode" "SF,<ssescalarmode>,SI,SF,SF,SF,TI,TI,*,*,*")])
+   (set_attr "prefix_extra" "*,*,*,*,*,*,*,1,1,1,*,*,*")
+   (set_attr "length_immediate" "*,*,*,*,*,*,*,1,1,1,*,*,*")
+   (set_attr "prefix" "maybe_vex,maybe_vex,maybe_vex,maybe_vex,orig,orig,vex,orig,orig,vex,*,*,*")
+   (set_attr "mode" "SF,SF,<ssescalarmode>,SI,SF,SF,SF,TI,TI,TI,*,*,*")])
 
 ;; A subset is vec_setv4sf.
 (define_insn "*vec_setv4sf_sse4_1"
-  [(set (match_operand:V4SF 0 "register_operand" "=x,x")
+  [(set (match_operand:V4SF 0 "register_operand" "=Yr,?x,x")
 	(vec_merge:V4SF
 	  (vec_duplicate:V4SF
-	    (match_operand:SF 2 "nonimmediate_operand" "xm,xm"))
-	  (match_operand:V4SF 1 "register_operand" "0,x")
+	    (match_operand:SF 2 "nonimmediate_operand" "Yrm,?xm,xm"))
+	  (match_operand:V4SF 1 "register_operand" "0,0,x")
 	  (match_operand:SI 3 "const_int_operand")))]
   "TARGET_SSE4_1
    && ((unsigned) exact_log2 (INTVAL (operands[3]))
@@ -5704,26 +5708,27 @@ 
   switch (which_alternative)
     {
     case 0:
-      return "insertps\t{%3, %2, %0|%0, %2, %3}";
     case 1:
+      return "insertps\t{%3, %2, %0|%0, %2, %3}";
+    case 2:
       return "vinsertps\t{%3, %2, %1, %0|%0, %1, %2, %3}";
     default:
       gcc_unreachable ();
     }
 }
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sselog")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "V4SF")])
 
 (define_insn "sse4_1_insertps"
-  [(set (match_operand:V4SF 0 "register_operand" "=x,x")
-	(unspec:V4SF [(match_operand:V4SF 2 "nonimmediate_operand" "xm,xm")
-		      (match_operand:V4SF 1 "register_operand" "0,x")
-		      (match_operand:SI 3 "const_0_to_255_operand" "n,n")]
+  [(set (match_operand:V4SF 0 "register_operand" "=Yr,?x,x")
+	(unspec:V4SF [(match_operand:V4SF 2 "nonimmediate_operand" "Yrm,?xm,xm")
+		      (match_operand:V4SF 1 "register_operand" "0,0,x")
+		      (match_operand:SI 3 "const_0_to_255_operand" "n,n,n")]
 		     UNSPEC_INSERTPS))]
   "TARGET_SSE4_1"
 {
@@ -5737,19 +5742,20 @@ 
   switch (which_alternative)
     {
     case 0:
-      return "insertps\t{%3, %2, %0|%0, %2, %3}";
     case 1:
+      return "insertps\t{%3, %2, %0|%0, %2, %3}";
+    case 2:
       return "vinsertps\t{%3, %2, %1, %0|%0, %1, %2, %3}";
     default:
       gcc_unreachable ();
     }
 }
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sselog")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "V4SF")])
 
 (define_split
@@ -5791,13 +5797,14 @@ 
 })
 
 (define_insn_and_split "*sse4_1_extractps"
-  [(set (match_operand:SF 0 "nonimmediate_operand" "=rm,x,x")
+  [(set (match_operand:SF 0 "nonimmediate_operand" "=*rm,rm,x,x")
 	(vec_select:SF
-	  (match_operand:V4SF 1 "register_operand" "x,0,x")
-	  (parallel [(match_operand:SI 2 "const_0_to_3_operand" "n,n,n")])))]
+	  (match_operand:V4SF 1 "register_operand" "Yr,?x,0,x")
+	  (parallel [(match_operand:SI 2 "const_0_to_3_operand" "n,n,n,n")])))]
   "TARGET_SSE4_1"
   "@
    %vextractps\t{%2, %1, %0|%0, %1, %2}
+   %vextractps\t{%2, %1, %0|%0, %1, %2}
    #
    #"
   "&& reload_completed && SSE_REG_P (operands[0])"
@@ -5822,13 +5829,13 @@ 
     }
   DONE;
 }
-  [(set_attr "isa" "*,noavx,avx")
-   (set_attr "type" "sselog,*,*")
-   (set_attr "prefix_data16" "1,*,*")
-   (set_attr "prefix_extra" "1,*,*")
-   (set_attr "length_immediate" "1,*,*")
-   (set_attr "prefix" "maybe_vex,*,*")
-   (set_attr "mode" "V4SF,*,*")])
+  [(set_attr "isa" "*,*,noavx,avx")
+   (set_attr "type" "sselog,sselog,*,*")
+   (set_attr "prefix_data16" "1,1,*,*")
+   (set_attr "prefix_extra" "1,1,*,*")
+   (set_attr "length_immediate" "1,1,*,*")
+   (set_attr "prefix" "maybe_vex,maybe_vex,*,*")
+   (set_attr "mode" "V4SF,V4SF,*,*")])
 
 (define_insn_and_split "*vec_extractv4sf_mem"
   [(set (match_operand:SF 0 "register_operand" "=x,*r,f")
@@ -7980,25 +7987,26 @@ 
   "ix86_fixup_binary_operands_no_copy (MULT, V4SImode, operands);")
 
 (define_insn "*sse4_1_mulv2siv2di3"
-  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
+  [(set (match_operand:V2DI 0 "register_operand" "=Yr,?x,x")
 	(mult:V2DI
 	  (sign_extend:V2DI
 	    (vec_select:V2SI
-	      (match_operand:V4SI 1 "nonimmediate_operand" "%0,x")
+	      (match_operand:V4SI 1 "nonimmediate_operand" "%0,0,x")
 	      (parallel [(const_int 0) (const_int 2)])))
 	  (sign_extend:V2DI
 	    (vec_select:V2SI
-	      (match_operand:V4SI 2 "nonimmediate_operand" "xm,xm")
+	      (match_operand:V4SI 2 "nonimmediate_operand" "Yrm,?xm,xm")
 	      (parallel [(const_int 0) (const_int 2)])))))]
   "TARGET_SSE4_1 && ix86_binary_operator_ok (MULT, V4SImode, operands)"
   "@
    pmuldq\t{%2, %0|%0, %2}
+   pmuldq\t{%2, %0|%0, %2}
    vpmuldq\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sseimul")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_expand "avx2_pmaddwd"
@@ -8155,19 +8163,20 @@ 
 })
 
 (define_insn "*<sse4_1_avx2>_mul<mode>3<mask_name>"
-  [(set (match_operand:VI4_AVX512F 0 "register_operand" "=x,v")
+  [(set (match_operand:VI4_AVX512F 0 "register_operand" "=Yr,?x,v")
 	(mult:VI4_AVX512F
-	  (match_operand:VI4_AVX512F 1 "nonimmediate_operand" "%0,v")
-	  (match_operand:VI4_AVX512F 2 "nonimmediate_operand" "xm,vm")))]
+	  (match_operand:VI4_AVX512F 1 "nonimmediate_operand" "%0,0,v")
+	  (match_operand:VI4_AVX512F 2 "nonimmediate_operand" "Yrm,?xm,vm")))]
   "TARGET_SSE4_1 && ix86_binary_operator_ok (MULT, <MODE>mode, operands) && <mask_mode512bit_condition>"
   "@
    pmulld\t{%2, %0|%0, %2}
+   pmulld\t{%2, %0|%0, %2}
    vpmulld\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sseimul")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "<mask_prefix3>")
-   (set_attr "btver2_decode" "vector,vector")
+   (set_attr "prefix" "<mask_prefix4>")
+   (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_expand "mul<mode>3"
@@ -8524,18 +8533,19 @@ 
 })
 
 (define_insn "*sse4_1_<code><mode>3"
-  [(set (match_operand:VI14_128 0 "register_operand" "=x,x")
+  [(set (match_operand:VI14_128 0 "register_operand" "=Yr,?x,x")
 	(smaxmin:VI14_128
-	  (match_operand:VI14_128 1 "nonimmediate_operand" "%0,x")
-	  (match_operand:VI14_128 2 "nonimmediate_operand" "xm,xm")))]
+	  (match_operand:VI14_128 1 "nonimmediate_operand" "%0,0,x")
+	  (match_operand:VI14_128 2 "nonimmediate_operand" "Yrm,?xm,xm")))]
   "TARGET_SSE4_1 && ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
   "@
    p<maxmin_int><ssemodesuffix>\t{%2, %0|%0, %2}
+   p<maxmin_int><ssemodesuffix>\t{%2, %0|%0, %2}
    vp<maxmin_int><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sseiadd")
-   (set_attr "prefix_extra" "1,*")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix_extra" "1,1,*")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_insn "*<code>v8hi3"
@@ -8605,18 +8615,19 @@ 
 })
 
 (define_insn "*sse4_1_<code><mode>3"
-  [(set (match_operand:VI24_128 0 "register_operand" "=x,x")
+  [(set (match_operand:VI24_128 0 "register_operand" "=Yr,?x,x")
 	(umaxmin:VI24_128
-	  (match_operand:VI24_128 1 "nonimmediate_operand" "%0,x")
-	  (match_operand:VI24_128 2 "nonimmediate_operand" "xm,xm")))]
+	  (match_operand:VI24_128 1 "nonimmediate_operand" "%0,0,x")
+	  (match_operand:VI24_128 2 "nonimmediate_operand" "Yrm,?xm,xm")))]
   "TARGET_SSE4_1 && ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
   "@
    p<maxmin_int><ssemodesuffix>\t{%2, %0|%0, %2}
+   p<maxmin_int><ssemodesuffix>\t{%2, %0|%0, %2}
    vp<maxmin_int><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sseiadd")
-   (set_attr "prefix_extra" "1,*")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix_extra" "1,1,*")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_insn "*<code>v16qi3"
@@ -8684,18 +8695,19 @@ 
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "*sse4_1_eqv2di3"
-  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
+  [(set (match_operand:V2DI 0 "register_operand" "=Yr,?x,x")
 	(eq:V2DI
-	  (match_operand:V2DI 1 "nonimmediate_operand" "%0,x")
-	  (match_operand:V2DI 2 "nonimmediate_operand" "xm,xm")))]
+	  (match_operand:V2DI 1 "nonimmediate_operand" "%0,0,x")
+	  (match_operand:V2DI 2 "nonimmediate_operand" "Yrm,?xm,xm")))]
   "TARGET_SSE4_1 && ix86_binary_operator_ok (EQ, V2DImode, operands)"
   "@
    pcmpeqq\t{%2, %0|%0, %2}
+   pcmpeqq\t{%2, %0|%0, %2}
    vpcmpeqq\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssecmp")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_insn "*sse2_eq<mode>3"
@@ -8731,18 +8743,19 @@ 
   "ix86_fixup_binary_operands_no_copy (EQ, V2DImode, operands);")
 
 (define_insn "sse4_2_gtv2di3"
-  [(set (match_operand:V2DI 0 "register_operand" "=x,x")
+  [(set (match_operand:V2DI 0 "register_operand" "=Yr,?x,x")
 	(gt:V2DI
-	  (match_operand:V2DI 1 "register_operand" "0,x")
-	  (match_operand:V2DI 2 "nonimmediate_operand" "xm,xm")))]
+	  (match_operand:V2DI 1 "register_operand" "0,0,x")
+	  (match_operand:V2DI 2 "nonimmediate_operand" "Yrm,?xm,xm")))]
   "TARGET_SSE4_2"
   "@
    pcmpgtq\t{%2, %0|%0, %2}
+   pcmpgtq\t{%2, %0|%0, %2}
    vpcmpgtq\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssecmp")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_insn "avx2_gt<mode>3"
@@ -10432,9 +10445,9 @@ 
   "operands[1] = gen_rtx_REG (<MODE>mode, REGNO (operands[1]));")
 
 (define_insn "*vec_extractv4si"
-  [(set (match_operand:SI 0 "nonimmediate_operand" "=rm,x,x")
+  [(set (match_operand:SI 0 "nonimmediate_operand" "=rm,Yr,?x,x")
 	(vec_select:SI
-	  (match_operand:V4SI 1 "register_operand" "x,0,x")
+	  (match_operand:V4SI 1 "register_operand" "x,0,0,x")
 	  (parallel [(match_operand:SI 2 "const_0_to_3_operand")])))]
   "TARGET_SSE4_1"
 {
@@ -10444,10 +10457,11 @@ 
       return "%vpextrd\t{%2, %1, %0|%0, %1, %2}";
 
     case 1:
+    case 2:
       operands [2] = GEN_INT (INTVAL (operands[2]) * 4);
       return "psrldq\t{%2, %0|%0, %2}";
 
-    case 2:
+    case 3:
       operands [2] = GEN_INT (INTVAL (operands[2]) * 4);
       return "vpsrldq\t{%2, %1, %0|%0, %1, %2}";
 
@@ -10455,11 +10469,11 @@ 
       gcc_unreachable ();
     }
 }
-  [(set_attr "isa" "*,noavx,avx")
-   (set_attr "type" "sselog1,sseishft1,sseishft1")
-   (set_attr "prefix_extra" "1,*,*")
+  [(set_attr "isa" "*,noavx,noavx,avx")
+   (set_attr "type" "sselog1,sseishft1,sseishft1,sseishft1")
+   (set_attr "prefix_extra" "1,*,*,*")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix" "maybe_vex,orig,vex")
+   (set_attr "prefix" "maybe_vex,orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_insn "*vec_extractv4si_zext"
@@ -10566,25 +10580,27 @@ 
    (set_attr "mode" "TI,TI,DF,V4SF")])
 
 (define_insn "*vec_concatv2si_sse4_1"
-  [(set (match_operand:V2SI 0 "register_operand"     "=x, x,x,x, x, *y,*y")
+  [(set (match_operand:V2SI 0 "register_operand"     "=Yr,?x,x, Yr,?x,x, x, *y,*y")
 	(vec_concat:V2SI
-	  (match_operand:SI 1 "nonimmediate_operand" " 0, x,0,x,rm,  0,rm")
-	  (match_operand:SI 2 "vector_move_operand"  "rm,rm,x,x, C,*ym, C")))]
+	  (match_operand:SI 1 "nonimmediate_operand" "  0, 0,x,  0,0, x,rm,  0,rm")
+	  (match_operand:SI 2 "vector_move_operand"  " rm,rm,rm,Yr,?x,x, C,*ym, C")))]
   "TARGET_SSE4_1"
   "@
    pinsrd\t{$1, %2, %0|%0, %2, 1}
+   pinsrd\t{$1, %2, %0|%0, %2, 1}
    vpinsrd\t{$1, %2, %1, %0|%0, %1, %2, 1}
    punpckldq\t{%2, %0|%0, %2}
+   punpckldq\t{%2, %0|%0, %2}
    vpunpckldq\t{%2, %1, %0|%0, %1, %2}
    %vmovd\t{%1, %0|%0, %1}
    punpckldq\t{%2, %0|%0, %2}
    movd\t{%1, %0|%0, %1}"
-  [(set_attr "isa" "noavx,avx,noavx,avx,*,*,*")
-   (set_attr "type" "sselog,sselog,sselog,sselog,ssemov,mmxcvt,mmxmov")
-   (set_attr "prefix_extra" "1,1,*,*,*,*,*")
-   (set_attr "length_immediate" "1,1,*,*,*,*,*")
-   (set_attr "prefix" "orig,vex,orig,vex,maybe_vex,orig,orig")
-   (set_attr "mode" "TI,TI,TI,TI,TI,DI,DI")])
+  [(set_attr "isa" "noavx,noavx,avx,noavx,noavx,avx,*,*,*")
+   (set_attr "type" "sselog,sselog,sselog,sselog,sselog,sselog,ssemov,mmxcvt,mmxmov")
+   (set_attr "prefix_extra" "1,1,1,*,*,*,*,*,*")
+   (set_attr "length_immediate" "1,1,1,*,*,*,*,*,*")
+   (set_attr "prefix" "orig,orig,vex,orig,orig,vex,maybe_vex,orig,orig")
+   (set_attr "mode" "TI,TI,TI,TI,TI,TI,TI,DI,DI")])
 
 ;; ??? In theory we can match memory for the MMX alternative, but allowing
 ;; nonimmediate_operand for operand 2 and *not* allowing memory for the SSE
@@ -10627,15 +10643,16 @@ 
 ;; movd instead of movq is required to handle broken assemblers.
 (define_insn "vec_concatv2di"
   [(set (match_operand:V2DI 0 "register_operand"
-	  "=x,x ,Yi,x ,!x,x,x,x,x,x")
+	  "=Yr,?x,x ,Yi,x ,!x,x,x,x,x,x")
 	(vec_concat:V2DI
 	  (match_operand:DI 1 "nonimmediate_operand"
-	  " 0,x ,r ,xm,*y,0,x,0,0,x")
+	  "  0, 0,x ,r ,xm,*y,0,x,0,0,x")
 	  (match_operand:DI 2 "vector_move_operand"
-	  "rm,rm,C ,C ,C ,x,x,x,m,m")))]
+	  "*rm,rm,rm,C ,C ,C ,x,x,x,m,m")))]
   "TARGET_SSE"
   "@
    pinsrq\t{$1, %2, %0|%0, %2, 1}
+   pinsrq\t{$1, %2, %0|%0, %2, 1}
    vpinsrq\t{$1, %2, %1, %0|%0, %1, %2, 1}
    * return HAVE_AS_IX86_INTERUNIT_MOVQ ? \"%vmovq\t{%1, %0|%0, %1}\" : \"%vmovd\t{%1, %0|%0, %1}\";
    %vmovq\t{%1, %0|%0, %1}
@@ -10645,17 +10662,17 @@ 
    movlhps\t{%2, %0|%0, %2}
    movhps\t{%2, %0|%0, %2}
    vmovhps\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "x64_sse4_noavx,x64_avx,x64,sse2,sse2,sse2_noavx,avx,noavx,noavx,avx")
+  [(set_attr "isa" "x64_sse4_noavx,x64_sse4_noavx,x64_avx,x64,sse2,sse2,sse2_noavx,avx,noavx,noavx,avx")
    (set (attr "type")
      (if_then_else
-       (eq_attr "alternative" "0,1,5,6")
+       (eq_attr "alternative" "0,2,6,7")
        (const_string "sselog")
        (const_string "ssemov")))
-   (set_attr "prefix_rex" "1,1,1,*,*,*,*,*,*,*")
-   (set_attr "prefix_extra" "1,1,*,*,*,*,*,*,*,*")
-   (set_attr "length_immediate" "1,1,*,*,*,*,*,*,*,*")
-   (set_attr "prefix" "orig,vex,maybe_vex,maybe_vex,orig,orig,vex,orig,orig,vex")
-   (set_attr "mode" "TI,TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
+   (set_attr "prefix_rex" "1,1,1,1,*,*,*,*,*,*,*")
+   (set_attr "prefix_extra" "1,1,1,*,*,*,*,*,*,*,*")
+   (set_attr "length_immediate" "1,1,1,*,*,*,*,*,*,*,*")
+   (set_attr "prefix" "orig,orig,vex,maybe_vex,maybe_vex,orig,orig,vex,orig,orig,vex")
+   (set_attr "mode" "TI,TI,TI,TI,TI,TI,TI,TI,V4SF,V2SF,V2SF")])
 
 (define_expand "vec_unpacks_lo_<mode>"
   [(match_operand:<sseunpackmode> 0 "register_operand")
@@ -11552,91 +11569,95 @@ 
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 
 (define_insn "<sse4_1>_blend<ssemodesuffix><avxsizesuffix>"
-  [(set (match_operand:VF_128_256 0 "register_operand" "=x,x")
+  [(set (match_operand:VF_128_256 0 "register_operand" "=Yr,?x,x")
 	(vec_merge:VF_128_256
-	  (match_operand:VF_128_256 2 "nonimmediate_operand" "xm,xm")
-	  (match_operand:VF_128_256 1 "register_operand" "0,x")
+	  (match_operand:VF_128_256 2 "nonimmediate_operand" "Yrm,?xm,xm")
+	  (match_operand:VF_128_256 1 "register_operand" "0,0,x")
 	  (match_operand:SI 3 "const_0_to_<blendbits>_operand")))]
   "TARGET_SSE4_1"
   "@
    blend<ssemodesuffix>\t{%3, %2, %0|%0, %2, %3}
+   blend<ssemodesuffix>\t{%3, %2, %0|%0, %2, %3}
    vblend<ssemodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssemov")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse4_1>_blendv<ssemodesuffix><avxsizesuffix>"
-  [(set (match_operand:VF_128_256 0 "register_operand" "=x,x")
+  [(set (match_operand:VF_128_256 0 "register_operand" "=Yr,?x,x")
 	(unspec:VF_128_256
-	  [(match_operand:VF_128_256 1 "register_operand" "0,x")
-	   (match_operand:VF_128_256 2 "nonimmediate_operand" "xm,xm")
-	   (match_operand:VF_128_256 3 "register_operand" "Yz,x")]
+	  [(match_operand:VF_128_256 1 "register_operand" "0,0,x")
+	   (match_operand:VF_128_256 2 "nonimmediate_operand" "Yrm,?xm,xm")
+	   (match_operand:VF_128_256 3 "register_operand" "Yz,Yz,x")]
 	  UNSPEC_BLENDV))]
   "TARGET_SSE4_1"
   "@
    blendv<ssemodesuffix>\t{%3, %2, %0|%0, %2, %3}
+   blendv<ssemodesuffix>\t{%3, %2, %0|%0, %2, %3}
    vblendv<ssemodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssemov")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
-   (set_attr "btver2_decode" "vector,vector") 
+   (set_attr "prefix" "orig,orig,vex")
+   (set_attr "btver2_decode" "vector,vector,vector") 
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse4_1>_dp<ssemodesuffix><avxsizesuffix>"
-  [(set (match_operand:VF_128_256 0 "register_operand" "=x,x")
+  [(set (match_operand:VF_128_256 0 "register_operand" "=Yr,?x,x")
 	(unspec:VF_128_256
-	  [(match_operand:VF_128_256 1 "nonimmediate_operand" "%0,x")
-	   (match_operand:VF_128_256 2 "nonimmediate_operand" "xm,xm")
-	   (match_operand:SI 3 "const_0_to_255_operand" "n,n")]
+	  [(match_operand:VF_128_256 1 "nonimmediate_operand" "%0,0,x")
+	   (match_operand:VF_128_256 2 "nonimmediate_operand" "Yrm,?xm,xm")
+	   (match_operand:SI 3 "const_0_to_255_operand" "n,n,n")]
 	  UNSPEC_DP))]
   "TARGET_SSE4_1"
   "@
    dp<ssemodesuffix>\t{%3, %2, %0|%0, %2, %3}
+   dp<ssemodesuffix>\t{%3, %2, %0|%0, %2, %3}
    vdp<ssemodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssemul")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
-   (set_attr "btver2_decode" "vector,vector")
+   (set_attr "prefix" "orig,orig,vex")
+   (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse4_1_avx2>_movntdqa"
-  [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand" "=x, v")
-	(unspec:VI8_AVX2_AVX512F [(match_operand:VI8_AVX2_AVX512F 1 "memory_operand" "m, m")]
+  [(set (match_operand:VI8_AVX2_AVX512F 0 "register_operand" "=Yr,?x, v")
+	(unspec:VI8_AVX2_AVX512F [(match_operand:VI8_AVX2_AVX512F 1 "memory_operand" "m, m, m")]
 		     UNSPEC_MOVNTDQA))]
   "TARGET_SSE4_1"
   "%vmovntdqa\t{%1, %0|%0, %1}"
   [(set_attr "type" "ssemov")
-   (set_attr "prefix_extra" "1, *")
-   (set_attr "prefix" "maybe_vex, evex")
+   (set_attr "prefix_extra" "1,1,*")
+   (set_attr "prefix" "maybe_vex,maybe_vex,evex")
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "<sse4_1_avx2>_mpsadbw"
-  [(set (match_operand:VI1_AVX2 0 "register_operand" "=x,x")
+  [(set (match_operand:VI1_AVX2 0 "register_operand" "=Yr,?x,x")
 	(unspec:VI1_AVX2
-	  [(match_operand:VI1_AVX2 1 "register_operand" "0,x")
-	   (match_operand:VI1_AVX2 2 "nonimmediate_operand" "xm,xm")
-	   (match_operand:SI 3 "const_0_to_255_operand" "n,n")]
+	  [(match_operand:VI1_AVX2 1 "register_operand" "0,0,x")
+	   (match_operand:VI1_AVX2 2 "nonimmediate_operand" "Yrm,?xm,xm")
+	   (match_operand:SI 3 "const_0_to_255_operand" "n,n,n")]
 	  UNSPEC_MPSADBW))]
   "TARGET_SSE4_1"
   "@
    mpsadbw\t{%3, %2, %0|%0, %2, %3}
+   mpsadbw\t{%3, %2, %0|%0, %2, %3}
    vmpsadbw\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sselog1")
    (set_attr "length_immediate" "1")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
-   (set_attr "btver2_decode" "vector,vector")
+   (set_attr "prefix" "orig,orig,vex")
+   (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "avx2_packusdw"
@@ -11654,56 +11675,59 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_packusdw"
-  [(set (match_operand:V8HI 0 "register_operand" "=x,x")
+  [(set (match_operand:V8HI 0 "register_operand" "=Yr,?x,x")
 	(vec_concat:V8HI
 	  (us_truncate:V4HI
-	    (match_operand:V4SI 1 "register_operand" "0,x"))
+	    (match_operand:V4SI 1 "register_operand" "0,0,x"))
 	  (us_truncate:V4HI
-	    (match_operand:V4SI 2 "nonimmediate_operand" "xm,xm"))))]
+	    (match_operand:V4SI 2 "nonimmediate_operand" "Yrm,?xm,xm"))))]
   "TARGET_SSE4_1"
   "@
    packusdw\t{%2, %0|%0, %2}
+   packusdw\t{%2, %0|%0, %2}
    vpackusdw\t{%2, %1, %0|%0, %1, %2}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "sselog")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 (define_insn "<sse4_1_avx2>_pblendvb"
-  [(set (match_operand:VI1_AVX2 0 "register_operand" "=x,x")
+  [(set (match_operand:VI1_AVX2 0 "register_operand" "=Yr,?x,x")
 	(unspec:VI1_AVX2
-	  [(match_operand:VI1_AVX2 1 "register_operand"  "0,x")
-	   (match_operand:VI1_AVX2 2 "nonimmediate_operand" "xm,xm")
-	   (match_operand:VI1_AVX2 3 "register_operand" "Yz,x")]
+	  [(match_operand:VI1_AVX2 1 "register_operand"  "0,0,x")
+	   (match_operand:VI1_AVX2 2 "nonimmediate_operand" "Yrm,?xm,xm")
+	   (match_operand:VI1_AVX2 3 "register_operand" "Yz,Yz,x")]
 	  UNSPEC_BLENDV))]
   "TARGET_SSE4_1"
   "@
    pblendvb\t{%3, %2, %0|%0, %2, %3}
+   pblendvb\t{%3, %2, %0|%0, %2, %3}
    vpblendvb\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssemov")
    (set_attr "prefix_extra" "1")
-   (set_attr "length_immediate" "*,1")
-   (set_attr "prefix" "orig,vex")
-   (set_attr "btver2_decode" "vector,vector")
+   (set_attr "length_immediate" "*,*,1")
+   (set_attr "prefix" "orig,orig,vex")
+   (set_attr "btver2_decode" "vector,vector,vector")
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "sse4_1_pblendw"
-  [(set (match_operand:V8HI 0 "register_operand" "=x,x")
+  [(set (match_operand:V8HI 0 "register_operand" "=Yr,?x,x")
 	(vec_merge:V8HI
-	  (match_operand:V8HI 2 "nonimmediate_operand" "xm,xm")
-	  (match_operand:V8HI 1 "register_operand" "0,x")
-	  (match_operand:SI 3 "const_0_to_255_operand" "n,n")))]
+	  (match_operand:V8HI 2 "nonimmediate_operand" "Yrm,?xm,xm")
+	  (match_operand:V8HI 1 "register_operand" "0,0,x")
+	  (match_operand:SI 3 "const_0_to_255_operand" "n,n,n")))]
   "TARGET_SSE4_1"
   "@
    pblendw\t{%3, %2, %0|%0, %2, %3}
+   pblendw\t{%3, %2, %0|%0, %2, %3}
    vpblendw\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssemov")
    (set_attr "prefix_extra" "1")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "TI")])
 
 ;; The builtin uses an 8-bit immediate.  Expand that.
@@ -11751,8 +11775,8 @@ 
    (set_attr "mode" "<sseinsnmode>")])
 
 (define_insn "sse4_1_phminposuw"
-  [(set (match_operand:V8HI 0 "register_operand" "=x")
-	(unspec:V8HI [(match_operand:V8HI 1 "nonimmediate_operand" "xm")]
+  [(set (match_operand:V8HI 0 "register_operand" "=Yr,?x")
+	(unspec:V8HI [(match_operand:V8HI 1 "nonimmediate_operand" "Yrm,?xm")]
 		     UNSPEC_PHMINPOSUW))]
   "TARGET_SSE4_1"
   "%vphminposuw\t{%1, %0|%0, %1}"
@@ -11773,10 +11797,10 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_<code>v8qiv8hi2"
-  [(set (match_operand:V8HI 0 "register_operand" "=x")
+  [(set (match_operand:V8HI 0 "register_operand" "=Yr,?x")
 	(any_extend:V8HI
 	  (vec_select:V8QI
-	    (match_operand:V16QI 1 "nonimmediate_operand" "xm")
+	    (match_operand:V16QI 1 "nonimmediate_operand" "Yrm,?xm")
 	    (parallel [(const_int 0) (const_int 1)
 		       (const_int 2) (const_int 3)
 		       (const_int 4) (const_int 5)
@@ -11816,10 +11840,10 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_<code>v4qiv4si2"
-  [(set (match_operand:V4SI 0 "register_operand" "=x")
+  [(set (match_operand:V4SI 0 "register_operand" "=Yr,?x")
 	(any_extend:V4SI
 	  (vec_select:V4QI
-	    (match_operand:V16QI 1 "nonimmediate_operand" "xm")
+	    (match_operand:V16QI 1 "nonimmediate_operand" "Yrm,?xm")
 	    (parallel [(const_int 0) (const_int 1)
 		       (const_int 2) (const_int 3)]))))]
   "TARGET_SSE4_1"
@@ -11852,10 +11876,10 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_<code>v4hiv4si2"
-  [(set (match_operand:V4SI 0 "register_operand" "=x")
+  [(set (match_operand:V4SI 0 "register_operand" "=Yr,?x")
 	(any_extend:V4SI
 	  (vec_select:V4HI
-	    (match_operand:V8HI 1 "nonimmediate_operand" "xm")
+	    (match_operand:V8HI 1 "nonimmediate_operand" "Yrm,?xm")
 	    (parallel [(const_int 0) (const_int 1)
 		       (const_int 2) (const_int 3)]))))]
   "TARGET_SSE4_1"
@@ -11896,10 +11920,10 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_<code>v2qiv2di2"
-  [(set (match_operand:V2DI 0 "register_operand" "=x")
+  [(set (match_operand:V2DI 0 "register_operand" "=Yr,?x")
 	(any_extend:V2DI
 	  (vec_select:V2QI
-	    (match_operand:V16QI 1 "nonimmediate_operand" "xm")
+	    (match_operand:V16QI 1 "nonimmediate_operand" "Yrm,?xm")
 	    (parallel [(const_int 0) (const_int 1)]))))]
   "TARGET_SSE4_1"
   "%vpmov<extsuffix>bq\t{%1, %0|%0, %w1}"
@@ -11934,10 +11958,10 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_<code>v2hiv2di2"
-  [(set (match_operand:V2DI 0 "register_operand" "=x")
+  [(set (match_operand:V2DI 0 "register_operand" "=Yr,?x")
 	(any_extend:V2DI
 	  (vec_select:V2HI
-	    (match_operand:V8HI 1 "nonimmediate_operand" "xm")
+	    (match_operand:V8HI 1 "nonimmediate_operand" "Yrm,?xm")
 	    (parallel [(const_int 0) (const_int 1)]))))]
   "TARGET_SSE4_1"
   "%vpmov<extsuffix>wq\t{%1, %0|%0, %k1}"
@@ -11968,10 +11992,10 @@ 
    (set_attr "mode" "OI")])
 
 (define_insn "sse4_1_<code>v2siv2di2"
-  [(set (match_operand:V2DI 0 "register_operand" "=x")
+  [(set (match_operand:V2DI 0 "register_operand" "=Yr,?x")
 	(any_extend:V2DI
 	  (vec_select:V2SI
-	    (match_operand:V4SI 1 "nonimmediate_operand" "xm")
+	    (match_operand:V4SI 1 "nonimmediate_operand" "Yrm,?xm")
 	    (parallel [(const_int 0) (const_int 1)]))))]
   "TARGET_SSE4_1"
   "%vpmov<extsuffix>dq\t{%1, %0|%0, %q1}"
@@ -12012,8 +12036,8 @@ 
 
 (define_insn "sse4_1_ptest"
   [(set (reg:CC FLAGS_REG)
-	(unspec:CC [(match_operand:V2DI 0 "register_operand" "x")
-		    (match_operand:V2DI 1 "nonimmediate_operand" "xm")]
+	(unspec:CC [(match_operand:V2DI 0 "register_operand" "Yr,?x")
+		    (match_operand:V2DI 1 "nonimmediate_operand" "Yrm,?xm")]
 		   UNSPEC_PTEST))]
   "TARGET_SSE4_1"
   "%vptest\t{%1, %0|%0, %1}"
@@ -12023,10 +12047,10 @@ 
    (set_attr "mode" "TI")])
 
 (define_insn "<sse4_1>_round<ssemodesuffix><avxsizesuffix>"
-  [(set (match_operand:VF_128_256 0 "register_operand" "=x")
+  [(set (match_operand:VF_128_256 0 "register_operand" "=Yr,?x")
 	(unspec:VF_128_256
-	  [(match_operand:VF_128_256 1 "nonimmediate_operand" "xm")
-	   (match_operand:SI 2 "const_0_to_15_operand" "n")]
+	  [(match_operand:VF_128_256 1 "nonimmediate_operand" "Yrm,?xm")
+	   (match_operand:SI 2 "const_0_to_15_operand" "n,n")]
 	  UNSPEC_ROUND))]
   "TARGET_ROUND"
   "%vround<ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
@@ -12106,24 +12130,25 @@ 
 })
 
 (define_insn "sse4_1_round<ssescalarmodesuffix>"
-  [(set (match_operand:VF_128 0 "register_operand" "=x,x")
+  [(set (match_operand:VF_128 0 "register_operand" "=Yr,?x,x")
 	(vec_merge:VF_128
 	  (unspec:VF_128
-	    [(match_operand:VF_128 2 "register_operand" "x,x")
-	     (match_operand:SI 3 "const_0_to_15_operand" "n,n")]
+	    [(match_operand:VF_128 2 "register_operand" "Yr,?x,x")
+	     (match_operand:SI 3 "const_0_to_15_operand" "n,n,n")]
 	    UNSPEC_ROUND)
-	  (match_operand:VF_128 1 "register_operand" "0,x")
+	  (match_operand:VF_128 1 "register_operand" "0,0,x")
 	  (const_int 1)))]
   "TARGET_ROUND"
   "@
    round<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}
+   round<ssescalarmodesuffix>\t{%3, %2, %0|%0, %2, %3}
    vround<ssescalarmodesuffix>\t{%3, %2, %1, %0|%0, %1, %2, %3}"
-  [(set_attr "isa" "noavx,avx")
+  [(set_attr "isa" "noavx,noavx,avx")
    (set_attr "type" "ssecvt")
    (set_attr "length_immediate" "1")
-   (set_attr "prefix_data16" "1,*")
+   (set_attr "prefix_data16" "1,1,*")
    (set_attr "prefix_extra" "1")
-   (set_attr "prefix" "orig,vex")
+   (set_attr "prefix" "orig,orig,vex")
    (set_attr "mode" "<MODE>")])
 
 (define_expand "round<mode>2"
diff --git a/gcc/config/i386/subst.md b/gcc/config/i386/subst.md
index 1654cba..c6d7178 100644
--- a/gcc/config/i386/subst.md
+++ b/gcc/config/i386/subst.md
@@ -57,6 +57,7 @@ 
 (define_subst_attr "mask_prefix" "mask" "vex" "evex")
 (define_subst_attr "mask_prefix2" "mask" "maybe_vex" "evex")
 (define_subst_attr "mask_prefix3" "mask" "orig,vex" "evex")
+(define_subst_attr "mask_prefix4" "mask" "orig,orig,vex" "evex")
 
 (define_subst "mask"
   [(set (match_operand:SUBST_V 0)
diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
index cb44dc3..825ef14 100644
--- a/gcc/config/i386/x86-tune.def
+++ b/gcc/config/i386/x86-tune.def
@@ -395,6 +395,10 @@  DEF_TUNE (X86_TUNE_SLOW_PSHUFB, "slow_pshufb",
 DEF_TUNE (X86_TUNE_VECTOR_PARALLEL_EXECUTION, "vec_parallel",
           m_NEHALEM | m_SANDYBRIDGE | m_HASWELL)
 
+/* X86_TUNE_AVOID_4BYTE_PREFIXES: Avoid instructions requiring 4+ bytes of prefixes.  */
+DEF_TUNE (X86_TUNE_AVOID_4BYTE_PREFIXES, "avoid_4byte_prefixes",
+          m_SILVERMONT | m_INTEL)
+
 /*****************************************************************************/
 /* AVX instruction selection tuning (some of SSE flags affects AVX, too)     */
 /*****************************************************************************/
diff --git a/gcc/testsuite/gcc.target/i386/sse2-init-v2di-2.c b/gcc/testsuite/gcc.target/i386/sse2-init-v2di-2.c
index 6a50573..4739e4f 100644
--- a/gcc/testsuite/gcc.target/i386/sse2-init-v2di-2.c
+++ b/gcc/testsuite/gcc.target/i386/sse2-init-v2di-2.c
@@ -10,4 +10,4 @@  test (long long b)
   return _mm_cvtsi64_si128 (b); 
 }
 
-/* { dg-final { scan-assembler-times "vec_concatv2di/3" 1 } } */
+/* { dg-final { scan-assembler-times "vec_concatv2di/4" 1 } } */