Patchwork [i386] scalar ops that preserve the high part of a vector

login
register
mail settings
Submitter Marc Glisse
Date Dec. 1, 2012, 5:27 p.m.
Message ID <alpine.DEB.2.02.1212011800400.19206@stedding.saclay.inria.fr>
Download mbox | patch
Permalink /patch/203142/
State New
Headers show

Comments

Marc Glisse - Dec. 1, 2012, 5:27 p.m.
Hello,

here is a patch. If it is accepted, I'll extend it to other vm patterns 
(mul, div, min, max are likely candidates, but I need to check the doc). 
It passed bootstrap+testsuite on x86_64-linux.


2012-12-01  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855
gcc/
 	* config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
 	pattern.
 	* config/i386/i386-builtin-types.def: New function types.
 	* config/i386/i386.c (ix86_expand_args_builtin): Likewise.
 	(bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
 	__builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
 	* config/i386/xmmintrin.h: Adapt to new builtin prototype.
 	* config/i386/emmintrin.h: Likewise.
 	* doc/extend.texi (X86 Built-in Functions): Document changed prototype.

testsuite/
 	* gcc.target/i386/pr54855-1.c: New testcase.
 	* gcc.target/i386/pr54855-2.c: New testcase.
Uros Bizjak - Dec. 2, 2012, 10:51 a.m.
On Sat, Dec 1, 2012 at 6:27 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

> here is a patch. If it is accepted, I'll extend it to other vm patterns
> (mul, div, min, max are likely candidates, but I need to check the doc). It
> passed bootstrap+testsuite on x86_64-linux.
>
>
> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>
>         PR target/54855
> gcc/
>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>         pattern.
>         * config/i386/i386-builtin-types.def: New function types.
>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>         * config/i386/emmintrin.h: Likewise.
>         * doc/extend.texi (X86 Built-in Functions): Document changed
> prototype.
>
> testsuite/
>         * gcc.target/i386/pr54855-1.c: New testcase.
>         * gcc.target/i386/pr54855-2.c: New testcase.

Yes, the approach looks correct to me, but I wonder why we have
different representations for v4sf and v2df cases? I'd say that we
should canonicalize patterns somewhere in the middle end (probably to
vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
variant), otherwise we will have pattern explosion.

However, the patch is too late for 4.8, but definitely a wanted
generalization and fix of a (partially) wrong representation.

I have also CCd HJ for his opinion, since the patch touches published headers.

Thanks,
Uros.
Marc Glisse - Dec. 2, 2012, 12:30 p.m.
On Sun, 2 Dec 2012, Uros Bizjak wrote:

> On Sat, Dec 1, 2012 at 6:27 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>
>> here is a patch. If it is accepted, I'll extend it to other vm patterns
>> (mul, div, min, max are likely candidates, but I need to check the doc). It
>> passed bootstrap+testsuite on x86_64-linux.
>>
>>
>> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>>
>>         PR target/54855
>> gcc/
>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>         pattern.
>>         * config/i386/i386-builtin-types.def: New function types.
>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>         * config/i386/emmintrin.h: Likewise.
>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>> prototype.
>>
>> testsuite/
>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>         * gcc.target/i386/pr54855-2.c: New testcase.
>
> Yes, the approach looks correct to me, but I wonder why we have
> different representations for v4sf and v2df cases? I'd say that we
> should canonicalize patterns somewhere in the middle end (probably to
> vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
> variant), otherwise we will have pattern explosion.

(I assume s/vec_dup/vec_concat/ above)

Note that this comes from ix86_expand_vector_set, which purposedly uses 
VEC_CONCAT for V2DF and VEC_MERGE for V4SF. It is true that we could use 
the VEC_MERGE version more widely, but this code that selects the most 
appropriate pattern depending on the mode seems good to me. And I wouldn't 
call the few extra entries in sse.md an explosion quite yet...

(also, using VEC_DUPLICATE is quite artificial, in the special case where 
we set the first element of the vector, a subreg should work as well)


> However, the patch is too late for 4.8,

That's fine, I can hold it for 4.9. I'd like to finalize the patch now 
while it is fresh though (I would still redo a quick bootstrap+testsuite 
before commit when trunk re-opens).

Thanks,
Uros Bizjak - Dec. 3, 2012, 8:53 a.m.
On Sun, Dec 2, 2012 at 1:30 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

>>> here is a patch. If it is accepted, I'll extend it to other vm patterns
>>> (mul, div, min, max are likely candidates, but I need to check the doc).
>>> It
>>> passed bootstrap+testsuite on x86_64-linux.
>>>
>>>
>>> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>>>
>>>         PR target/54855
>>> gcc/
>>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>>         pattern.
>>>         * config/i386/i386-builtin-types.def: New function types.
>>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>>         * config/i386/emmintrin.h: Likewise.
>>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>>> prototype.
>>>
>>> testsuite/
>>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>>         * gcc.target/i386/pr54855-2.c: New testcase.
>>
>>
>> Yes, the approach looks correct to me, but I wonder why we have
>> different representations for v4sf and v2df cases? I'd say that we
>> should canonicalize patterns somewhere in the middle end (probably to
>> vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
>> variant), otherwise we will have pattern explosion.
>
>
> (I assume s/vec_dup/vec_concat/ above)

Ah, yes.

However, looking a bit more into the usage cases for these patterns,
they are only used through intrinsics with _m128 operands. While your
proposed patch makes these patterns more general (they can use 64bit
aligned memory), this is not their usual usage, and for their intended
usage, your proposed improvement complicates these patterns
unnecessarily. Following on these facts, I'd say that we leave these
special patters (since they serve their purpose well) and rather
introduce new patterns for "other" uses.

Uros.
Marc Glisse - Dec. 3, 2012, 3:34 p.m.
On Mon, 3 Dec 2012, Uros Bizjak wrote:

> On Sun, Dec 2, 2012 at 1:30 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>
>>>> here is a patch. If it is accepted, I'll extend it to other vm patterns
>>>> (mul, div, min, max are likely candidates, but I need to check the doc).
>>>> It
>>>> passed bootstrap+testsuite on x86_64-linux.
>>>>
>>>>
>>>> 2012-12-01  Marc Glisse  <marc.glisse@inria.fr>
>>>>
>>>>         PR target/54855
>>>> gcc/
>>>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>>>         pattern.
>>>>         * config/i386/i386-builtin-types.def: New function types.
>>>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>>>         * config/i386/emmintrin.h: Likewise.
>>>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>>>> prototype.
>>>>
>>>> testsuite/
>>>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>>>         * gcc.target/i386/pr54855-2.c: New testcase.
>>>
>>>
>>> Yes, the approach looks correct to me, but I wonder why we have
>>> different representations for v4sf and v2df cases? I'd say that we
>>> should canonicalize patterns somewhere in the middle end (probably to
>>> vec_merge variant, as IMO vec_dup looks like degenerated vec_merge
>>> variant), otherwise we will have pattern explosion.
>>
>>
>> (I assume s/vec_dup/vec_concat/ above)
>
> Ah, yes.
>
> However, looking a bit more into the usage cases for these patterns,
> they are only used through intrinsics with _m128 operands. While your
> proposed patch makes these patterns more general (they can use 64bit
> aligned memory), this is not their usual usage, and for their intended
> usage, your proposed improvement complicates these patterns
> unnecessarily. Following on these facts, I'd say that we leave these
> special patters (since they serve their purpose well) and rather
> introduce new patterns for "other" uses.

You mean like in the original patch?
http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01279.html

(it only had the V2DF version, not the V4SF one)

Funny how we switched sides, now I am the one who would rather have a 
single pattern instead of having one for the builtin and one for recog. It 
seems that once we add the new pattern, keeping the old one is a waste of 
maintenance time, and the few extra rtx from the slightly longer pattern 
for these seldomly used builtins should be negligible.

But I don't mind, if that's the version you prefer, I'll update the patch.

Thanks,
Uros Bizjak - Dec. 3, 2012, 5:55 p.m.
On Mon, Dec 3, 2012 at 4:34 PM, Marc Glisse <marc.glisse@inria.fr> wrote:

>> However, looking a bit more into the usage cases for these patterns,
>> they are only used through intrinsics with _m128 operands. While your
>> proposed patch makes these patterns more general (they can use 64bit
>> aligned memory), this is not their usual usage, and for their intended
>> usage, your proposed improvement complicates these patterns
>> unnecessarily. Following on these facts, I'd say that we leave these
>> special patters (since they serve their purpose well) and rather
>> introduce new patterns for "other" uses.
>
>
> You mean like in the original patch?
> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01279.html
>
> (it only had the V2DF version, not the V4SF one)
>
> Funny how we switched sides, now I am the one who would rather have a single
> pattern instead of having one for the builtin and one for recog. It seems
> that once we add the new pattern, keeping the old one is a waste of
> maintenance time, and the few extra rtx from the slightly longer pattern for
> these seldomly used builtins should be negligible.

Yes,  I didn't notice at the time that the intention of existing
patterns was to implement intrinsics that exclusively use _m128
operands.

> But I don't mind, if that's the version you prefer, I'll update the patch.

Actually, both approaches have their benefits and drawbacks.
Specialized vec_merge patterns can be efficiently macroized, and
support builtins with _m128 operands in a simple and efficient way.
You are proposing patterns that do not macroize well (this is what was
learned from your last patch) and require breakup of existing
macroized patterns.

So, we are actually adding new functionality - operations on an array
of values. IMO, this warrants new patterns, but please find a way for
V2DF and V4SF to macroize in the same way.

Uros.
Marc Glisse - Dec. 4, 2012, 2:05 p.m.
On Mon, 3 Dec 2012, Uros Bizjak wrote:

> On Mon, Dec 3, 2012 at 4:34 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>
>>> However, looking a bit more into the usage cases for these patterns,
>>> they are only used through intrinsics with _m128 operands. While your
>>> proposed patch makes these patterns more general (they can use 64bit
>>> aligned memory), this is not their usual usage, and for their intended
>>> usage, your proposed improvement complicates these patterns
>>> unnecessarily. Following on these facts, I'd say that we leave these
>>> special patters (since they serve their purpose well) and rather
>>> introduce new patterns for "other" uses.
>>
>>
>> You mean like in the original patch?
>> http://gcc.gnu.org/ml/gcc-patches/2012-10/msg01279.html
>>
>> (it only had the V2DF version, not the V4SF one)
>>
>> Funny how we switched sides, now I am the one who would rather have a single
>> pattern instead of having one for the builtin and one for recog. It seems
>> that once we add the new pattern, keeping the old one is a waste of
>> maintenance time, and the few extra rtx from the slightly longer pattern for
>> these seldomly used builtins should be negligible.
>
> Yes,  I didn't notice at the time that the intention of existing
> patterns was to implement intrinsics that exclusively use _m128
> operands.
>
>> But I don't mind, if that's the version you prefer, I'll update the patch.
>
> Actually, both approaches have their benefits and drawbacks.
> Specialized vec_merge patterns can be efficiently macroized, and
> support builtins with _m128 operands in a simple and efficient way.
> You are proposing patterns that do not macroize well (this is what was
> learned from your last patch) and require breakup of existing
> macroized patterns.
>
> So, we are actually adding new functionality - operations on an array
> of values. IMO, this warrants new patterns, but please find a way for
> V2DF and V4SF to macroize in the same way.

I am still confused as to what is wanted. If the quantity to minimize is
the number of entries in sse.md, we should replace the existing
vec_merge pattern with this one: it macroizes just as well, it directly
matches for V4SF, and the piece of code needed in simplify-rtx for V2DF
isn't too absurd. (then we need to adjust the builtins as in one of the
previous patches)

[(set (match_operand:VF_128 0 "register_operand" "=x,x")
       (vec_merge:VF_128
 	(vec_duplicate:VF_128
 	  (plusminus:<ssescalarmode>
 	    (vec_select:<ssescalarmode>
 	      (match_operand:VF_128 1 "register_operand" "0,x")
 	      (parallel [(const_int 0)]))
 	    (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm"))))
       (match_dup 1)
       (const_int 1)))]

Then there is the question (i) of possibly introducing a specialized
version for V2DF (different pattern) instead of adding code to
simplify-rtx.

And finally there is the question (ii) of keeping the old define_insn in
addition to the new one(s), just for the builtins.

My preference is:
(i) specialized pattern for V2DF
(ii) remove

It seems like you might be ok with:
(i) simplify-rtx
(ii) remove

Do you agree?

Patch

Index: gcc/testsuite/gcc.target/i386/pr54855-2.c

===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)

+++ gcc/testsuite/gcc.target/i386/pr54855-2.c	(revision 0)

@@ -0,0 +1,18 @@ 

+/* { dg-do compile } */

+/* { dg-options "-O -msse" } */

+

+typedef float vec __attribute__((vector_size(16)));

+

+vec f (vec x)

+{

+  x[0] += 2;

+  return x;

+}

+

+vec g (vec x)

+{

+  x[0] -= 1;

+  return x;

+}

+

+/* { dg-final { scan-assembler-not "mov" } } */


Property changes on: gcc/testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: gcc/testsuite/gcc.target/i386/pr54855-1.c

===================================================================
--- gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)

+++ gcc/testsuite/gcc.target/i386/pr54855-1.c	(revision 0)

@@ -0,0 +1,18 @@ 

+/* { dg-do compile } */

+/* { dg-options "-O -msse2" } */

+

+typedef double vec __attribute__((vector_size(16)));

+

+vec f (vec x)

+{

+  x[0] += 2;

+  return x;

+}

+

+vec g (vec x)

+{

+  x[0] -= 1;

+  return x;

+}

+

+/* { dg-final { scan-assembler-not "mov" } } */


Property changes on: gcc/testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:eol-style
   + native
Added: svn:keywords
   + Author Date Id Revision URL

Index: gcc/config/i386/i386.c

===================================================================
--- gcc/config/i386/i386.c	(revision 194017)

+++ gcc/config/i386/i386.c	(working copy)

@@ -27059,22 +27059,22 @@  static const struct builtin_description

   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },

-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },

+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },

+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },

   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27163,22 +27163,22 @@  static const struct builtin_description

   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },

-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },

+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },

+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },

   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30790,34 +30790,36 @@  ix86_expand_args_builtin (const struct b

     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:

     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:

     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
Index: gcc/config/i386/xmmintrin.h

===================================================================
--- gcc/config/i386/xmmintrin.h	(revision 194017)

+++ gcc/config/i386/xmmintrin.h	(working copy)

@@ -92,27 +92,27 @@  _mm_setzero_ps (void)

   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);

+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);

 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);

+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);

 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: gcc/config/i386/emmintrin.h

===================================================================
--- gcc/config/i386/emmintrin.h	(revision 194017)

+++ gcc/config/i386/emmintrin.h	(working copy)

@@ -226,33 +226,33 @@  _mm_cvtsi128_si64x (__m128i __A)

 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);

+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);

 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);

+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);

 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
Index: gcc/config/i386/sse.md

===================================================================
--- gcc/config/i386/sse.md	(revision 194017)

+++ gcc/config/i386/sse.md	(working copy)

@@ -855,36 +855,57 @@ 

 	  (match_operand:VF 2 "nonimmediate_operand" "xm,xm")))]
   "TARGET_SSE && ix86_binary_operator_ok (<CODE>, <MODE>mode, operands)"
   "@
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
-(define_insn "<sse>_vm<plusminus_insn><mode>3"

-  [(set (match_operand:VF_128 0 "register_operand" "=x,x")

-	(vec_merge:VF_128

-	  (plusminus:VF_128

-	    (match_operand:VF_128 1 "register_operand" "0,x")

-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))

+(define_insn "sse_vm<plusminus_insn>v4sf3"

+  [(set (match_operand:V4SF 0 "register_operand" "=x,x")

+	(vec_merge:V4SF

+	  (vec_duplicate:V4SF

+	    (plusminus:SF

+	      (vec_select:SF

+		(match_operand:V4SF 1 "register_operand" "0,x")

+		(parallel [(const_int 0)]))

+	      (match_operand:SF 2 "nonimmediate_operand" "xm,xm")))

 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
-   <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}

-   v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"

+   <plusminus_mnemonic>ss\t{%2, %0|%0, %2}

+   v<plusminus_mnemonic>ss\t{%2, %1, %0|%0, %1, %2}"

   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
-   (set_attr "mode" "<ssescalarmode>")])

+   (set_attr "mode" "SF")])

+

+(define_insn "sse2_vm<plusminus_insn>v2df3"

+  [(set (match_operand:V2DF 0 "register_operand" "=x,x")

+	(vec_concat:V2DF

+	  (plusminus:DF

+	    (vec_select:DF 

+	      (match_operand:V2DF 1 "register_operand" "0,x")

+	      (parallel [(const_int 0)]))

+	    (match_operand:DF 2 "nonimmediate_operand" "xm,xm"))

+	  (vec_select:DF (match_dup 1) (parallel [(const_int 1)]))))]

+  "TARGET_SSE2"

+  "@

+   <plusminus_mnemonic>sd\t{%2, %0|%0, %2}

+   v<plusminus_mnemonic>sd\t{%2, %1, %0|%0, %1, %2}"

+  [(set_attr "isa" "noavx,avx")

+   (set_attr "type" "sseadd")

+   (set_attr "prefix" "orig,vex")

+   (set_attr "mode" "DF")])

 
 (define_expand "mul<mode>3"
   [(set (match_operand:VF 0 "register_operand")
 	(mult:VF
 	  (match_operand:VF 1 "nonimmediate_operand")
 	  (match_operand:VF 2 "nonimmediate_operand")))]
   "TARGET_SSE"
   "ix86_fixup_binary_operands_no_copy (MULT, <MODE>mode, operands);")
 
 (define_insn "*mul<mode>3"
Index: gcc/config/i386/i386-builtin-types.def

===================================================================
--- gcc/config/i386/i386-builtin-types.def	(revision 194017)

+++ gcc/config/i386/i386-builtin-types.def	(working copy)

@@ -263,20 +263,21 @@  DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6

 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)

 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@  DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)

 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)

 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: gcc/doc/extend.texi

===================================================================
--- gcc/doc/extend.texi	(revision 194017)

+++ gcc/doc/extend.texi	(working copy)

@@ -9821,22 +9821,22 @@  int __builtin_ia32_comige (v4sf, v4sf)

 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)

-v4sf __builtin_ia32_subss (v4sf, v4sf)

+v4sf __builtin_ia32_addss (v4sf, float)

+v4sf __builtin_ia32_subss (v4sf, float)

 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9942,22 +9942,22 @@  v2df __builtin_ia32_cmpunordsd (v2df, v2

 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)

-v2df __builtin_ia32_subsd (v2df, v2df)

+v2df __builtin_ia32_addsd (v2df, double)

+v2df __builtin_ia32_subsd (v2df, double)

 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)