diff mbox

[i386] scalar ops that preserve the high part of a vector

Message ID alpine.DEB.2.02.1212041715370.22183@stedding.saclay.inria.fr
State New
Headers show

Commit Message

Marc Glisse Dec. 4, 2012, 4:28 p.m. UTC
On Tue, 4 Dec 2012, Marc Glisse wrote:

> Do you agree?

Like thisĀ ? (only tested on the new testcases, and then I'd need to ask Eric
his opinion)

2012-12-04  Marc Glisse  <marc.glisse@inria.fr>

 	PR target/54855
gcc/
 	* simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
 	with VEC_MERGE.
 	* config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
 	pattern.
 	* config/i386/i386-builtin-types.def: New function types.
 	* config/i386/i386.c (ix86_expand_args_builtin): Likewise.
 	(bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
 	__builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
 	* config/i386/xmmintrin.h: Adapt to new builtin prototype.
 	* config/i386/emmintrin.h: Likewise.
 	* doc/extend.texi (X86 Built-in Functions): Document changed prototype.


testsuite/
 	* gcc.target/i386/pr54855-1.c: New testcase.
 	* gcc.target/i386/pr54855-2.c: New testcase.

Comments

Uros Bizjak Dec. 4, 2012, 6:06 p.m. UTC | #1
On Tue, Dec 4, 2012 at 5:28 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Tue, 4 Dec 2012, Marc Glisse wrote:
>
>> Do you agree?
>
>
> Like this ? (only tested on the new testcases, and then I'd need to ask Eric
> his opinion)
>
> 2012-12-04  Marc Glisse  <marc.glisse@inria.fr>
>
>         PR target/54855
> gcc/
>         * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
>         with VEC_MERGE.
>
>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>         pattern.
>         * config/i386/i386-builtin-types.def: New function types.
>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>         * config/i386/emmintrin.h: Likewise.
>         * doc/extend.texi (X86 Built-in Functions): Document changed
> prototype.
>
>
> testsuite/
>         * gcc.target/i386/pr54855-1.c: New testcase.
>         * gcc.target/i386/pr54855-2.c: New testcase.

Yes, the approach taken in this patch looks really good to me. There
should be no code differences with your patch, but let's ask HJ for
his opinion on intrinsics header changes.

A little nit below:

> +           rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
> +           rtx newop1 = XEXP (trueop1, 0);
> +           return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0),
> +                                        newop0, newop1, GEN_INT (1));

You can use const1_rtx here.

Thanks,
Uros.
H.J. Lu Dec. 4, 2012, 6:12 p.m. UTC | #2
On Tue, Dec 4, 2012 at 10:06 AM, Uros Bizjak <ubizjak@gmail.com> wrote:
> On Tue, Dec 4, 2012 at 5:28 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
>> On Tue, 4 Dec 2012, Marc Glisse wrote:
>>
>>> Do you agree?
>>
>>
>> Like this ? (only tested on the new testcases, and then I'd need to ask Eric
>> his opinion)
>>
>> 2012-12-04  Marc Glisse  <marc.glisse@inria.fr>
>>
>>         PR target/54855
>> gcc/
>>         * simplify-rtx.c (simplify_binary_operation_1) <VEC_CONCAT>: Replace
>>         with VEC_MERGE.
>>
>>         * config/i386/sse.md (<sse>_vm<plusminus_insn><mode>3): Rewrite
>>         pattern.
>>         * config/i386/i386-builtin-types.def: New function types.
>>         * config/i386/i386.c (ix86_expand_args_builtin): Likewise.
>>         (bdesc_args) <__builtin_ia32_addss, __builtin_ia32_subss,
>>         __builtin_ia32_addsd, __builtin_ia32_subsd>: Change prototype.
>>         * config/i386/xmmintrin.h: Adapt to new builtin prototype.
>>         * config/i386/emmintrin.h: Likewise.
>>         * doc/extend.texi (X86 Built-in Functions): Document changed
>> prototype.
>>
>>
>> testsuite/
>>         * gcc.target/i386/pr54855-1.c: New testcase.
>>         * gcc.target/i386/pr54855-2.c: New testcase.
>
> Yes, the approach taken in this patch looks really good to me. There
> should be no code differences with your patch, but let's ask HJ for
> his opinion on intrinsics header changes.

Hi Kirill,

Can you take  a look?  Thanks.

> A little nit below:
>
>> +           rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
>> +           rtx newop1 = XEXP (trueop1, 0);
>> +           return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0),
>> +                                        newop0, newop1, GEN_INT (1));
>
> You can use const1_rtx here.
>
> Thanks,
> Uros.
Kirill Yukhin Dec. 6, 2012, 1:42 p.m. UTC | #3
>> Yes, the approach taken in this patch looks really good to me. There
>> should be no code differences with your patch, but let's ask HJ for
>> his opinion on intrinsics header changes.
>
> Hi Kirill,
>
> Can you take  a look?  Thanks.

Hi guys, I like changes in intrinsics header.

Thanks, K
Michael Zolotukhin Dec. 7, 2012, 6:49 a.m. UTC | #4
Hi guys,
Could I ask several questions just to clarify the things up?

1) Does the root problem lay in the fact that even for scalar
additions we perform the addition on the whole vector and only then
drop the higher parts of the vector? I.e. to fix the test from the PR
we need to replace plus on vector mode with plus on scalar mode?

2) Is one of the main requirements having the same pattern for V4SF
and V2DF version?

3) I don't see vec_concat in patterns from your patches, is it
explicitly generated by some x86-expander?

Anyway, I really like the idea of having some uniformity in describing
patterns for scalar instructions, so thank you for the work!

On 6 December 2012 17:42, Kirill Yukhin <kirill.yukhin@gmail.com> wrote:
>>> Yes, the approach taken in this patch looks really good to me. There
>>> should be no code differences with your patch, but let's ask HJ for
>>> his opinion on intrinsics header changes.
>>
>> Hi Kirill,
>>
>> Can you take  a look?  Thanks.
>
> Hi guys, I like changes in intrinsics header.
>
> Thanks, K
Uros Bizjak Dec. 7, 2012, 8:45 a.m. UTC | #5
On Fri, Dec 7, 2012 at 7:49 AM, Michael Zolotukhin
<michael.v.zolotukhin@gmail.com> wrote:
> Hi guys,
> Could I ask several questions just to clarify the things up?
>
> 1) Does the root problem lay in the fact that even for scalar
> additions we perform the addition on the whole vector and only then
> drop the higher parts of the vector? I.e. to fix the test from the PR
> we need to replace plus on vector mode with plus on scalar mode?

Yes, existing pattern is used to implement intrinsics, and it is
modelled with vector operand 2. But we in fact emit scalar operation,
so we would like to model the pattern with a scalar operand 2. This
way, the same pattern can be used to emit intrinsics _and_ can be used
to optimize the code from the testcase at the same time. Also, please
note that alignment requirements for vector operand and scalar
operands are different.

> 2) Is one of the main requirements having the same pattern for V4SF
> and V2DF version?

It is not required, but having macroized pattern avoids pattern
explosion, eases maintenance (it is easier to understand similar
functionality if it is described in some uniform way), and in some
cases, macroization opportunities force author to rethink the RTL
description, making the patterns more "universal".

Uros.
Marc Glisse Dec. 7, 2012, 8:49 a.m. UTC | #6
On Fri, 7 Dec 2012, Michael Zolotukhin wrote:

> 1) Does the root problem lay in the fact that even for scalar
> additions we perform the addition on the whole vector and only then
> drop the higher parts of the vector? I.e. to fix the test from the PR
> we need to replace plus on vector mode with plus on scalar mode?

The root problem is that we model the subs[sd] instructions as taking a 
128-bit second operand, when Intel's documentation says they take a 
32/64-bit operand, which is an important difference for memory operands 
(and constants). Writing a pattern that reconstructs the result from a 
scalar operation also seems more natural than pretending we are doing a 
parallel operation and dropping most of it (easier for recog and friends).

(note: I think the insn was written to support the intrinsic, which does 
take a 128-bit argument, so it did a good job for that)

> 2) Is one of the main requirements having the same pattern for V4SF
> and V2DF version?

Uros seems to think that would be best.

> 3) I don't see vec_concat in patterns from your patches, is it
> explicitly generated by some x86-expander?

It is generated by ix86_expand_vector_set.

> Anyway, I really like the idea of having some uniformity in describing
> patterns for scalar instructions, so thank you for the work!

For 2-element vectors, vec_concat does seem more natural than vec_merge. 
If we chose vec_merge as the canonical representation, we should chose it 
for setting an element in a vector (ix86_expand_vector_set) everywhere, 
not just those scalarish operations.

So it would be good to have rth's opinion on this (svn blame seems to 
indicate he is the one who chose to use vec_concat specifically for V2DF 
instead of vec_merge).
Michael Zolotukhin Dec. 7, 2012, 10:52 a.m. UTC | #7
Thanks for the explanation!

By the way, if we decide to have one pattern for V4SF instructions and
another for V2DF, we could try to use recently introduced define_subst
here. It won't reduce number of actual patterns (I mean number of
patterns after iterators and subst expanding), but it could help to
make sse.md more compact.

On 7 December 2012 12:49, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Fri, 7 Dec 2012, Michael Zolotukhin wrote:
>
>> 1) Does the root problem lay in the fact that even for scalar
>> additions we perform the addition on the whole vector and only then
>> drop the higher parts of the vector? I.e. to fix the test from the PR
>> we need to replace plus on vector mode with plus on scalar mode?
>
>
> The root problem is that we model the subs[sd] instructions as taking a
> 128-bit second operand, when Intel's documentation says they take a
> 32/64-bit operand, which is an important difference for memory operands (and
> constants). Writing a pattern that reconstructs the result from a scalar
> operation also seems more natural than pretending we are doing a parallel
> operation and dropping most of it (easier for recog and friends).
>
> (note: I think the insn was written to support the intrinsic, which does
> take a 128-bit argument, so it did a good job for that)
>
>
>> 2) Is one of the main requirements having the same pattern for V4SF
>> and V2DF version?
>
>
> Uros seems to think that would be best.
>
>
>> 3) I don't see vec_concat in patterns from your patches, is it
>> explicitly generated by some x86-expander?
>
>
> It is generated by ix86_expand_vector_set.
>
>
>> Anyway, I really like the idea of having some uniformity in describing
>> patterns for scalar instructions, so thank you for the work!
>
>
> For 2-element vectors, vec_concat does seem more natural than vec_merge. If
> we chose vec_merge as the canonical representation, we should chose it for
> setting an element in a vector (ix86_expand_vector_set) everywhere, not just
> those scalarish operations.
>
> So it would be good to have rth's opinion on this (svn blame seems to
> indicate he is the one who chose to use vec_concat specifically for V2DF
> instead of vec_merge).
>
> --
> Marc Glisse
Richard Henderson Dec. 7, 2012, 2:43 p.m. UTC | #8
On 2012-12-07 02:49, Marc Glisse wrote:
> The root problem is that we model the subs[sd] instructions as taking
> a 128-bit second operand, when Intel's documentation says they take a
> 32/64-bit operand, which is an important difference for memory
> operands (and constants). Writing a pattern that reconstructs the
> result from a scalar operation also seems more natural than
> pretending we are doing a parallel operation and dropping most of it
> (easier for recog and friends).

I agree this is a problem with the current representation.

> For 2-element vectors, vec_concat does seem more natural than
> vec_merge. If we chose vec_merge as the canonical representation, we
> should chose it for setting an element in a vector
> (ix86_expand_vector_set) everywhere, not just those scalarish
> operations.

I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
and to the detriment of e.g. mips.  There are plenty of embedded simd
implementations that are V2xx only.

If we simply pull the various x86 patterns into one common form, set
and extract included, does that buy us most of what we'd get for
playing games in combine?


As for your xmmintrin.h changes, I'd like to see a test case that verifies
that _mm_add_ss(a, b) does not add extra insns to extract __B[0].

> +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>"
>    [(set (match_operand:VF_128 0 "register_operand" "=x,x")
>  	(vec_merge:VF_128
> -	  (plusminus:VF_128
> -	    (match_operand:VF_128 1 "register_operand" "0,x")
> -	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
> +	  (vec_duplicate:VF_128
> +	    (plusminus:<ssescalarmode>
> +	      (vec_select:<ssescalarmode>
> +		(match_operand:VF_128 1 "register_operand" "0,x")
> +		(parallel [(const_int 0)]))
> +	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
>  	  (match_dup 1)
>  	  (const_int 1)))]
>    "TARGET_SSE"
>    "@
>     <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
>     v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
>    [(set_attr "isa" "noavx,avx")
>     (set_attr "type" "sseadd")
>     (set_attr "prefix" "orig,vex")
>     (set_attr "mode" "<ssescalarmode>")])

Did this really trigger as a substitution?  It's not supposed to have, since
you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")...


r~
Jakub Jelinek Dec. 7, 2012, 2:47 p.m. UTC | #9
On Fri, Dec 07, 2012 at 08:43:05AM -0600, Richard Henderson wrote:
> > +(define_insn "<sse>_vm<plusminus_insn><mode>3<vec_merge_or_concat>"
> >    [(set (match_operand:VF_128 0 "register_operand" "=x,x")
> >  	(vec_merge:VF_128
> > -	  (plusminus:VF_128
> > -	    (match_operand:VF_128 1 "register_operand" "0,x")
> > -	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
> > +	  (vec_duplicate:VF_128
> > +	    (plusminus:<ssescalarmode>
> > +	      (vec_select:<ssescalarmode>
> > +		(match_operand:VF_128 1 "register_operand" "0,x")
> > +		(parallel [(const_int 0)]))
> > +	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
> >  	  (match_dup 1)
> >  	  (const_int 1)))]
> >    "TARGET_SSE"
> >    "@
> >     <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
> >     v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
> >    [(set_attr "isa" "noavx,avx")
> >     (set_attr "type" "sseadd")
> >     (set_attr "prefix" "orig,vex")
> >     (set_attr "mode" "<ssescalarmode>")])
> 
> Did this really trigger as a substitution?  It's not supposed to have, since
> you didn't add (set_attr "replace_vec_merge_with_vec_concat" "yes")...

That was the older proposal, the current way to trigger it is using
the substitution attr somewhere, typically in pattern name
- <vec_merge_or_concat> in the above case.

	Jakub
Richard Henderson Dec. 7, 2012, 2:53 p.m. UTC | #10
On 2012-12-07 08:47, Jakub Jelinek wrote:
> That was the older proposal, the current way to trigger it is using
> the substitution attr somewhere, typically in pattern name
> - <vec_merge_or_concat> in the above case.

Ah right.

("My mind is going, Dave.  I can feel it.")


r~
Marc Glisse Dec. 7, 2012, 3 p.m. UTC | #11
On Fri, 7 Dec 2012, Richard Henderson wrote:

> On 2012-12-07 02:49, Marc Glisse wrote:
>> For 2-element vectors, vec_concat does seem more natural than
>> vec_merge. If we chose vec_merge as the canonical representation, we
>> should chose it for setting an element in a vector
>> (ix86_expand_vector_set) everywhere, not just those scalarish
>> operations.
>
> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
> and to the detriment of e.g. mips.  There are plenty of embedded simd
> implementations that are V2xx only.
>
> If we simply pull the various x86 patterns into one common form, set
> and extract included, does that buy us most of what we'd get for
> playing games in combine?

I'm sorry, could you be more precise? I don't see clearly what you are 
suggesting.

> As for your xmmintrin.h changes, I'd like to see a test case that verifies
> that _mm_add_ss(a, b) does not add extra insns to extract __B[0].

Yes, good idea, thanks.
Richard Henderson Dec. 7, 2012, 3:05 p.m. UTC | #12
On 2012-12-07 09:00, Marc Glisse wrote:
> On Fri, 7 Dec 2012, Richard Henderson wrote:
> 
>> On 2012-12-07 02:49, Marc Glisse wrote:
>>> For 2-element vectors, vec_concat does seem more natural than
>>> vec_merge. If we chose vec_merge as the canonical representation, we
>>> should chose it for setting an element in a vector
>>> (ix86_expand_vector_set) everywhere, not just those scalarish
>>> operations.
>>
>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
>> and to the detriment of e.g. mips.  There are plenty of embedded simd
>> implementations that are V2xx only.
>>
>> If we simply pull the various x86 patterns into one common form, set
>> and extract included, does that buy us most of what we'd get for
>> playing games in combine?
> 
> I'm sorry, could you be more precise? I don't see clearly what you are suggesting.

Don't change combine?

Have I lost the plot somewhere here?


r~
Marc Glisse Dec. 7, 2012, 3:12 p.m. UTC | #13
On Fri, 7 Dec 2012, Richard Henderson wrote:

> On 2012-12-07 09:00, Marc Glisse wrote:
>> On Fri, 7 Dec 2012, Richard Henderson wrote:
>>
>>> On 2012-12-07 02:49, Marc Glisse wrote:
>>>> For 2-element vectors, vec_concat does seem more natural than
>>>> vec_merge. If we chose vec_merge as the canonical representation, we
>>>> should chose it for setting an element in a vector
>>>> (ix86_expand_vector_set) everywhere, not just those scalarish
>>>> operations.
>>>
>>> I'd hate to enshrine vec_merge over vec_concat for the benefit of x86,
>>> and to the detriment of e.g. mips.  There are plenty of embedded simd
>>> implementations that are V2xx only.
>>>
>>> If we simply pull the various x86 patterns into one common form, set
>>> and extract included, does that buy us most of what we'd get for
>>> playing games in combine?
>>
>> I'm sorry, could you be more precise? I don't see clearly what you are suggesting.
>
> Don't change combine?

but change ix86_expand_vector_set and others to generate vec_merge and 
have only the vec_merge define_insn in sse.md? I guess it would buy a 
large part of it. That's a pretty invasive change, I'll have to try...
Richard Henderson Dec. 7, 2012, 4:23 p.m. UTC | #14
On 2012-12-07 09:12, Marc Glisse wrote:
> but change ix86_expand_vector_set and others to generate vec_merge
> and have only the vec_merge define_insn in sse.md? I guess it would
> buy a large part of it. That's a pretty invasive change, I'll have to
> try...

Is it really that invasive?  Anyway, it's something worth trying for 4.9...


r~
Marc Glisse Dec. 7, 2012, 5:23 p.m. UTC | #15
On Fri, 7 Dec 2012, Richard Henderson wrote:

> On 2012-12-07 09:12, Marc Glisse wrote:
>> but change ix86_expand_vector_set and others to generate vec_merge
>> and have only the vec_merge define_insn in sse.md? I guess it would
>> buy a large part of it. That's a pretty invasive change, I'll have to
>> try...
>
> Is it really that invasive?

No, changing only V2DF, I seem to have the basic pieces in place changing 
just 6 patterns in sse.md and a couple functions in i386.c. Now I need to 
test it and see how much it affects the generated code...

> Anyway, it's something worth trying for 4.9...

Should I take it that if it ends up looking manageable, you prefer 
changing all the V2DF operations to vec_merge, rather than just adapting 
the addsd pattern? Won't it complicate things for generic optimizations if 
different platforms have different canonical patterns for the same 
operation on V2DF? It seems to me that it would be good if we agreed on a 
pattern common to all platforms so we could canonicalize as in the last 
part of:
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00243.html
(or possibly the reverse transformation depending on the pattern we settle 
on).

(those 2 patches didn't touch the generic simplify-rtx code either, if I 
remember just the "don't touch combine" part of your suggestion ;-)
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00028.html
http://gcc.gnu.org/ml/gcc-patches/2012-12/msg00492.html )
diff mbox

Patch

Index: doc/extend.texi
===================================================================
--- doc/extend.texi	(revision 194150)
+++ doc/extend.texi	(working copy)
@@ -9843,22 +9843,22 @@  int __builtin_ia32_comige (v4sf, v4sf)
 int __builtin_ia32_ucomieq (v4sf, v4sf)
 int __builtin_ia32_ucomineq (v4sf, v4sf)
 int __builtin_ia32_ucomilt (v4sf, v4sf)
 int __builtin_ia32_ucomile (v4sf, v4sf)
 int __builtin_ia32_ucomigt (v4sf, v4sf)
 int __builtin_ia32_ucomige (v4sf, v4sf)
 v4sf __builtin_ia32_addps (v4sf, v4sf)
 v4sf __builtin_ia32_subps (v4sf, v4sf)
 v4sf __builtin_ia32_mulps (v4sf, v4sf)
 v4sf __builtin_ia32_divps (v4sf, v4sf)
-v4sf __builtin_ia32_addss (v4sf, v4sf)
-v4sf __builtin_ia32_subss (v4sf, v4sf)
+v4sf __builtin_ia32_addss (v4sf, float)
+v4sf __builtin_ia32_subss (v4sf, float)
 v4sf __builtin_ia32_mulss (v4sf, v4sf)
 v4sf __builtin_ia32_divss (v4sf, v4sf)
 v4si __builtin_ia32_cmpeqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpltps (v4sf, v4sf)
 v4si __builtin_ia32_cmpleps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgtps (v4sf, v4sf)
 v4si __builtin_ia32_cmpgeps (v4sf, v4sf)
 v4si __builtin_ia32_cmpunordps (v4sf, v4sf)
 v4si __builtin_ia32_cmpneqps (v4sf, v4sf)
 v4si __builtin_ia32_cmpnltps (v4sf, v4sf)
@@ -9964,22 +9964,22 @@  v2df __builtin_ia32_cmpunordsd (v2df, v2
 v2df __builtin_ia32_cmpneqsd (v2df, v2df)
 v2df __builtin_ia32_cmpnltsd (v2df, v2df)
 v2df __builtin_ia32_cmpnlesd (v2df, v2df)
 v2df __builtin_ia32_cmpordsd (v2df, v2df)
 v2di __builtin_ia32_paddq (v2di, v2di)
 v2di __builtin_ia32_psubq (v2di, v2di)
 v2df __builtin_ia32_addpd (v2df, v2df)
 v2df __builtin_ia32_subpd (v2df, v2df)
 v2df __builtin_ia32_mulpd (v2df, v2df)
 v2df __builtin_ia32_divpd (v2df, v2df)
-v2df __builtin_ia32_addsd (v2df, v2df)
-v2df __builtin_ia32_subsd (v2df, v2df)
+v2df __builtin_ia32_addsd (v2df, double)
+v2df __builtin_ia32_subsd (v2df, double)
 v2df __builtin_ia32_mulsd (v2df, v2df)
 v2df __builtin_ia32_divsd (v2df, v2df)
 v2df __builtin_ia32_minpd (v2df, v2df)
 v2df __builtin_ia32_maxpd (v2df, v2df)
 v2df __builtin_ia32_minsd (v2df, v2df)
 v2df __builtin_ia32_maxsd (v2df, v2df)
 v2df __builtin_ia32_andpd (v2df, v2df)
 v2df __builtin_ia32_andnpd (v2df, v2df)
 v2df __builtin_ia32_orpd (v2df, v2df)
 v2df __builtin_ia32_xorpd (v2df, v2df)
Index: testsuite/gcc.target/i386/pr54855-2.c
===================================================================
--- testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855-2.c	(revision 0)
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O -msse" } */
+
+typedef float vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855-2.c
___________________________________________________________________
Added: svn:keywords
   + Author Date Id Revision URL
Added: svn:eol-style
   + native

Index: testsuite/gcc.target/i386/pr54855-1.c
===================================================================
--- testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
+++ testsuite/gcc.target/i386/pr54855-1.c	(revision 0)
@@ -0,0 +1,18 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O -msse2" } */
+
+typedef double vec __attribute__((vector_size(16)));
+
+vec f (vec x)
+{
+  x[0] += 2;
+  return x;
+}
+
+vec g (vec x)
+{
+  x[0] -= 1;
+  return x;
+}
+
+/* { dg-final { scan-assembler-not "mov" } } */

Property changes on: testsuite/gcc.target/i386/pr54855-1.c
___________________________________________________________________
Added: svn:eol-style
   + native
Added: svn:keywords
   + Author Date Id Revision URL

Index: config/i386/xmmintrin.h
===================================================================
--- config/i386/xmmintrin.h	(revision 194150)
+++ config/i386/xmmintrin.h	(working copy)
@@ -92,27 +92,27 @@  _mm_setzero_ps (void)
   return __extension__ (__m128){ 0.0f, 0.0f, 0.0f, 0.0f };
 }
 
 /* Perform the respective operation on the lower SPFP (single-precision
    floating-point) values of A and B; the upper three SPFP values are
    passed through from A.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_addss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ss (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subss ((__v4sf)__A, (__v4sf)__B);
+  return (__m128) __builtin_ia32_subss ((__v4sf)__A, __B[0]);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_mulss ((__v4sf)__A, (__v4sf)__B);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ss (__m128 __A, __m128 __B)
Index: config/i386/emmintrin.h
===================================================================
--- config/i386/emmintrin.h	(revision 194150)
+++ config/i386/emmintrin.h	(working copy)
@@ -226,33 +226,33 @@  _mm_cvtsi128_si64x (__m128i __A)
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_addsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
+  return (__m128d)__builtin_ia32_subsd ((__v2df)__A, __B[0]);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
Index: config/i386/sse.md
===================================================================
--- config/i386/sse.md	(revision 194150)
+++ config/i386/sse.md	(working copy)
@@ -858,23 +858,26 @@ 
    <plusminus_mnemonic><ssemodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssemodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<MODE>")])
 
 (define_insn "<sse>_vm<plusminus_insn><mode>3"
   [(set (match_operand:VF_128 0 "register_operand" "=x,x")
 	(vec_merge:VF_128
-	  (plusminus:VF_128
-	    (match_operand:VF_128 1 "register_operand" "0,x")
-	    (match_operand:VF_128 2 "nonimmediate_operand" "xm,xm"))
+	  (vec_duplicate:VF_128
+	    (plusminus:<ssescalarmode>
+	      (vec_select:<ssescalarmode>
+		(match_operand:VF_128 1 "register_operand" "0,x")
+		(parallel [(const_int 0)]))
+	      (match_operand:<ssescalarmode> 2 "nonimmediate_operand" "xm,xm")))
 	  (match_dup 1)
 	  (const_int 1)))]
   "TARGET_SSE"
   "@
    <plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %0|%0, %2}
    v<plusminus_mnemonic><ssescalarmodesuffix>\t{%2, %1, %0|%0, %1, %2}"
   [(set_attr "isa" "noavx,avx")
    (set_attr "type" "sseadd")
    (set_attr "prefix" "orig,vex")
    (set_attr "mode" "<ssescalarmode>")])
Index: config/i386/i386-builtin-types.def
===================================================================
--- config/i386/i386-builtin-types.def	(revision 194150)
+++ config/i386/i386-builtin-types.def	(working copy)
@@ -263,20 +263,21 @@  DEF_FUNCTION_TYPE (UINT64, UINT64, UINT6
 DEF_FUNCTION_TYPE (UINT8, UINT8, INT)
 DEF_FUNCTION_TYPE (V16QI, V16QI, SI)
 DEF_FUNCTION_TYPE (V16QI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V16QI, V8HI, V8HI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, SI)
 DEF_FUNCTION_TYPE (V1DI, V1DI, V1DI)
 DEF_FUNCTION_TYPE (V1DI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V1DI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V2DF, PCV2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, DI)
+DEF_FUNCTION_TYPE (V2DF, V2DF, DOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, INT)
 DEF_FUNCTION_TYPE (V2DF, V2DF, PCDOUBLE)
 DEF_FUNCTION_TYPE (V2DF, V2DF, SI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V2DI)
 DEF_FUNCTION_TYPE (V2DF, V2DF, V4SF)
 DEF_FUNCTION_TYPE (V2DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V2DI, V16QI, V16QI)
 DEF_FUNCTION_TYPE (V2DI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V2DI, V2DI, INT)
@@ -296,20 +297,21 @@  DEF_FUNCTION_TYPE (V4DF, PCV4DF, V4DI)
 DEF_FUNCTION_TYPE (V4DF, V4DF, INT)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DF)
 DEF_FUNCTION_TYPE (V4DF, V4DF, V4DI)
 DEF_FUNCTION_TYPE (V4HI, V2SI, V2SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, INT)
 DEF_FUNCTION_TYPE (V4HI, V4HI, SI)
 DEF_FUNCTION_TYPE (V4HI, V4HI, V4HI)
 DEF_FUNCTION_TYPE (V4HI, V8QI, V8QI)
 DEF_FUNCTION_TYPE (V4SF, PCV4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, DI)
+DEF_FUNCTION_TYPE (V4SF, V4SF, FLOAT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, INT)
 DEF_FUNCTION_TYPE (V4SF, V4SF, PCV2SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2DF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V2SI)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SF)
 DEF_FUNCTION_TYPE (V4SF, V4SF, V4SI)
 DEF_FUNCTION_TYPE (V4SF, V8SF, INT)
 DEF_FUNCTION_TYPE (V4SI, V2DF, V2DF)
 DEF_FUNCTION_TYPE (V4SI, V4SF, V4SF)
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 194150)
+++ config/i386/i386.c	(working copy)
@@ -27059,22 +27059,22 @@  static const struct builtin_description
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttps2pi, "__builtin_ia32_cvttps2pi", IX86_BUILTIN_CVTTPS2PI, UNKNOWN, (int) V2SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_cvttss2si, "__builtin_ia32_cvttss2si", IX86_BUILTIN_CVTTSS2SI, UNKNOWN, (int) INT_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE | OPTION_MASK_ISA_64BIT, CODE_FOR_sse_cvttss2siq, "__builtin_ia32_cvttss2si64", IX86_BUILTIN_CVTTSS2SI64, UNKNOWN, (int) INT64_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_shufps, "__builtin_ia32_shufps", IX86_BUILTIN_SHUFPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF_INT },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_addv4sf3, "__builtin_ia32_addps", IX86_BUILTIN_ADDPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_subv4sf3, "__builtin_ia32_subps", IX86_BUILTIN_SUBPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_mulv4sf3, "__builtin_ia32_mulps", IX86_BUILTIN_MULPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_divv4sf3, "__builtin_ia32_divps", IX86_BUILTIN_DIVPS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
-  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmaddv4sf3,  "__builtin_ia32_addss", IX86_BUILTIN_ADDSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
+  { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmsubv4sf3,  "__builtin_ia32_subss", IX86_BUILTIN_SUBSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_FLOAT },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmmulv4sf3,  "__builtin_ia32_mulss", IX86_BUILTIN_MULSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_vmdivv4sf3,  "__builtin_ia32_divss", IX86_BUILTIN_DIVSS, UNKNOWN, (int) V4SF_FTYPE_V4SF_V4SF },
 
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpeqps", IX86_BUILTIN_CMPEQPS, EQ, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpltps", IX86_BUILTIN_CMPLTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpleps", IX86_BUILTIN_CMPLEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgtps", IX86_BUILTIN_CMPGTPS, LT, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpgeps", IX86_BUILTIN_CMPGEPS, LE, (int) V4SF_FTYPE_V4SF_V4SF_SWAP },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpunordps", IX86_BUILTIN_CMPUNORDPS, UNORDERED, (int) V4SF_FTYPE_V4SF_V4SF },
   { OPTION_MASK_ISA_SSE, CODE_FOR_sse_maskcmpv4sf3, "__builtin_ia32_cmpneqps", IX86_BUILTIN_CMPNEQPS, NE, (int) V4SF_FTYPE_V4SF_V4SF },
@@ -27163,22 +27163,22 @@  static const struct builtin_description
   { OPTION_MASK_ISA_SSE2 | OPTION_MASK_ISA_64BIT, CODE_FOR_sse2_cvttsd2siq, "__builtin_ia32_cvttsd2si64", IX86_BUILTIN_CVTTSD2SI64, UNKNOWN, (int) INT64_FTYPE_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2dq, "__builtin_ia32_cvtps2dq", IX86_BUILTIN_CVTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_cvtps2pd, "__builtin_ia32_cvtps2pd", IX86_BUILTIN_CVTPS2PD, UNKNOWN, (int) V2DF_FTYPE_V4SF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_fix_truncv4sfv4si2, "__builtin_ia32_cvttps2dq", IX86_BUILTIN_CVTTPS2DQ, UNKNOWN, (int) V4SI_FTYPE_V4SF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_addv2df3, "__builtin_ia32_addpd", IX86_BUILTIN_ADDPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_subv2df3, "__builtin_ia32_subpd", IX86_BUILTIN_SUBPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_mulv2df3, "__builtin_ia32_mulpd", IX86_BUILTIN_MULPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_divv2df3, "__builtin_ia32_divpd", IX86_BUILTIN_DIVPD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
-  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmaddv2df3,  "__builtin_ia32_addsd", IX86_BUILTIN_ADDSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
+  { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmsubv2df3,  "__builtin_ia32_subsd", IX86_BUILTIN_SUBSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_DOUBLE },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmmulv2df3,  "__builtin_ia32_mulsd", IX86_BUILTIN_MULSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_vmdivv2df3,  "__builtin_ia32_divsd", IX86_BUILTIN_DIVSD, UNKNOWN, (int) V2DF_FTYPE_V2DF_V2DF },
 
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpeqpd", IX86_BUILTIN_CMPEQPD, EQ, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpltpd", IX86_BUILTIN_CMPLTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmplepd", IX86_BUILTIN_CMPLEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgtpd", IX86_BUILTIN_CMPGTPD, LT, (int) V2DF_FTYPE_V2DF_V2DF_SWAP },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpgepd", IX86_BUILTIN_CMPGEPD, LE, (int) V2DF_FTYPE_V2DF_V2DF_SWAP},
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpunordpd", IX86_BUILTIN_CMPUNORDPD, UNORDERED, (int) V2DF_FTYPE_V2DF_V2DF },
   { OPTION_MASK_ISA_SSE2, CODE_FOR_sse2_maskcmpv2df3, "__builtin_ia32_cmpneqpd", IX86_BUILTIN_CMPNEQPD, NE, (int) V2DF_FTYPE_V2DF_V2DF },
@@ -30790,34 +30790,36 @@  ix86_expand_args_builtin (const struct b
     case V4HI_FTYPE_V8QI_V8QI:
     case V4HI_FTYPE_V2SI_V2SI:
     case V4DF_FTYPE_V4DF_V4DF:
     case V4DF_FTYPE_V4DF_V4DI:
     case V4SF_FTYPE_V4SF_V4SF:
     case V4SF_FTYPE_V4SF_V4SI:
     case V4SF_FTYPE_V4SF_V2SI:
     case V4SF_FTYPE_V4SF_V2DF:
     case V4SF_FTYPE_V4SF_DI:
     case V4SF_FTYPE_V4SF_SI:
+    case V4SF_FTYPE_V4SF_FLOAT:
     case V2DI_FTYPE_V2DI_V2DI:
     case V2DI_FTYPE_V16QI_V16QI:
     case V2DI_FTYPE_V4SI_V4SI:
     case V2UDI_FTYPE_V4USI_V4USI:
     case V2DI_FTYPE_V2DI_V16QI:
     case V2DI_FTYPE_V2DF_V2DF:
     case V2SI_FTYPE_V2SI_V2SI:
     case V2SI_FTYPE_V4HI_V4HI:
     case V2SI_FTYPE_V2SF_V2SF:
     case V2DF_FTYPE_V2DF_V2DF:
     case V2DF_FTYPE_V2DF_V4SF:
     case V2DF_FTYPE_V2DF_V2DI:
     case V2DF_FTYPE_V2DF_DI:
     case V2DF_FTYPE_V2DF_SI:
+    case V2DF_FTYPE_V2DF_DOUBLE:
     case V2SF_FTYPE_V2SF_V2SF:
     case V1DI_FTYPE_V1DI_V1DI:
     case V1DI_FTYPE_V8QI_V8QI:
     case V1DI_FTYPE_V2SI_V2SI:
     case V32QI_FTYPE_V16HI_V16HI:
     case V16HI_FTYPE_V8SI_V8SI:
     case V32QI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V32QI_V32QI:
     case V16HI_FTYPE_V16HI_V16HI:
     case V8SI_FTYPE_V4DF_V4DF:
Index: simplify-rtx.c
===================================================================
--- simplify-rtx.c	(revision 194150)
+++ simplify-rtx.c	(working copy)
@@ -3588,20 +3588,32 @@  simplify_binary_operation_1 (enum rtx_co
 	    int len0 = XVECLEN (par0, 0);
 	    int len1 = XVECLEN (par1, 0);
 	    rtvec vec = rtvec_alloc (len0 + len1);
 	    for (int i = 0; i < len0; i++)
 	      RTVEC_ELT (vec, i) = XVECEXP (par0, 0, i);
 	    for (int i = 0; i < len1; i++)
 	      RTVEC_ELT (vec, len0 + i) = XVECEXP (par1, 0, i);
 	    return simplify_gen_binary (VEC_SELECT, mode, XEXP (trueop0, 0),
 					gen_rtx_PARALLEL (VOIDmode, vec));
 	  }
+
+	/* Recognize a simple form of VEC_MERGE.  */
+	if (GET_CODE (trueop1) == VEC_SELECT
+	    && GET_MODE (XEXP (trueop1, 0)) == mode
+	    && XVECLEN (XEXP (trueop1, 1), 0) == 1
+	    && INTVAL (XVECEXP (XEXP (trueop1, 1), 0, 0)) == 1)
+	  {
+	    rtx newop0 = gen_rtx_fmt_e (VEC_DUPLICATE, mode, trueop0);
+	    rtx newop1 = XEXP (trueop1, 0);
+	    return simplify_gen_ternary (VEC_MERGE, mode, GET_MODE (newop0),
+					 newop0, newop1, GEN_INT (1));
+	  }
       }
       return 0;
 
     default:
       gcc_unreachable ();
     }
 
   return 0;
 }