diff mbox

[i386] Replace builtins with vector extensions

Message ID alpine.DEB.2.02.1404112137530.19663@stedding.saclay.inria.fr
State New
Headers show

Commit Message

Marc Glisse April 11, 2014, 8:09 p.m. UTC
Hello,

the previous discussion on the topic was before we added all those #pragma 
target in *mmintrin.h:

http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html

I believe that removes a large part of the arguments against it. Note that 
I only did a few of the more obvious intrinsics, I am waiting to see if 
this patch is accepted before doing more.

Bootstrap+testsuite on x86_64-linux-gnu.

2014-04-11  Marc Glisse  <marc.glisse@inria.fr>

 	* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
 	_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
 	instead of builtins.
 	* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
 	_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
 	_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
 	_mm_loadh_pd, _mm_loadl_pd): Likewise.
 	(_mm_sqrt_sd): Fix comment.

Comments

Marc Glisse April 28, 2014, 11:34 a.m. UTC | #1
Ping
http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html

(note that ARM seems to be doing the same thing for their neon 
intrinsics, see Ramana's patch series posted today)

On Fri, 11 Apr 2014, Marc Glisse wrote:

> Hello,
>
> the previous discussion on the topic was before we added all those #pragma 
> target in *mmintrin.h:
>
> http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
>
> I believe that removes a large part of the arguments against it. Note that I 
> only did a few of the more obvious intrinsics, I am waiting to see if this 
> patch is accepted before doing more.
>
> Bootstrap+testsuite on x86_64-linux-gnu.
>
> 2014-04-11  Marc Glisse  <marc.glisse@inria.fr>
>
> 	* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
> 	_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
> 	instead of builtins.
> 	* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, 
> _mm_storeh_pd,
> 	_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
> 	_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
> 	_mm_loadh_pd, _mm_loadl_pd): Likewise.
> 	(_mm_sqrt_sd): Fix comment.
Marc Glisse May 17, 2014, 1:34 p.m. UTC | #2
Ping

On Mon, 28 Apr 2014, Marc Glisse wrote:

> Ping
> http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html
>
> (note that ARM seems to be doing the same thing for their neon intrinsics, 
> see Ramana's patch series posted today)
>
> On Fri, 11 Apr 2014, Marc Glisse wrote:
>
>> Hello,
>> 
>> the previous discussion on the topic was before we added all those #pragma 
>> target in *mmintrin.h:
>> 
>> http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
>> 
>> I believe that removes a large part of the arguments against it. Note that 
>> I only did a few of the more obvious intrinsics, I am waiting to see if 
>> this patch is accepted before doing more.
>> 
>> Bootstrap+testsuite on x86_64-linux-gnu.
>> 
>> 2014-04-11  Marc Glisse  <marc.glisse@inria.fr>
>>
>> 	* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
>> 	_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
>> 	instead of builtins.
>> 	* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, 
>> _mm_storeh_pd,
>> 	_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
>> 	_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
>> 	_mm_loadh_pd, _mm_loadl_pd): Likewise.
>> 	(_mm_sqrt_sd): Fix comment.
Marc Glisse June 28, 2014, 10:42 a.m. UTC | #3
Ping,

nobody has an opinion on this? Or some explanation why I am mistaken to 
believe that #pragma target makes it safer now?

It would enable a number of optimizations, like constant propagation, FMA 
contraction, etc. It would also allow us to remove several builtins.

On Sat, 17 May 2014, Marc Glisse wrote:

> Ping
>
> On Mon, 28 Apr 2014, Marc Glisse wrote:
>
>> Ping
>> http://gcc.gnu.org/ml/gcc-patches/2014-04/msg00590.html
>> 
>> (note that ARM seems to be doing the same thing for their neon intrinsics, 
>> see Ramana's patch series posted today)
>> 
>> On Fri, 11 Apr 2014, Marc Glisse wrote:
>> 
>>> Hello,
>>> 
>>> the previous discussion on the topic was before we added all those #pragma 
>>> target in *mmintrin.h:
>>> 
>>> http://gcc.gnu.org/ml/gcc-patches/2013-04/msg00374.html
>>> 
>>> I believe that removes a large part of the arguments against it. Note that 
>>> I only did a few of the more obvious intrinsics, I am waiting to see if 
>>> this patch is accepted before doing more.
>>> 
>>> Bootstrap+testsuite on x86_64-linux-gnu.
>>> 
>>> 2014-04-11  Marc Glisse  <marc.glisse@inria.fr>
>>>
>>> 	* config/i386/xmmintrin.h (_mm_add_ps, _mm_sub_ps, _mm_mul_ps,
>>> 	_mm_div_ps, _mm_store_ss, _mm_cvtss_f32): Use vector extensions
>>> 	instead of builtins.
>>> 	* config/i386/emmintrin.h (_mm_store_sd, _mm_cvtsd_f64, _mm_storeh_pd,
>>> 	_mm_cvtsi128_si64, _mm_cvtsi128_si64x, _mm_add_pd, _mm_sub_pd,
>>> 	_mm_mul_pd, _mm_div_pd, _mm_storel_epi64, _mm_movepi64_pi64,
>>> 	_mm_loadh_pd, _mm_loadl_pd): Likewise.
>>> 	(_mm_sqrt_sd): Fix comment.
Ulrich Drepper June 28, 2014, 1:37 p.m. UTC | #4
On Sat, Jun 28, 2014 at 6:42 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
> Ping,
>
> nobody has an opinion on this? Or some explanation why I am mistaken to
> believe that #pragma target makes it safer now?
>
> It would enable a number of optimizations, like constant propagation, FMA
> contraction, etc. It would also allow us to remove several builtins.

I see no problem with using the array-type access to the registers.

As for replacing the builtins with arithmetic operators: I appreciate
the possibility for optimization.  But is there any chance the calls
could not end up being implemented with a vector instruction?  I think
that would be bad.  The intrinsics should be a way to guarantee that
the programmer can create vector instructions.  Otherwise we might
just not support them.
Marc Glisse June 28, 2014, 10:53 p.m. UTC | #5
On Sat, 28 Jun 2014, Ulrich Drepper wrote:

> On Sat, Jun 28, 2014 at 6:42 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
>> Ping,
>>
>> nobody has an opinion on this? Or some explanation why I am mistaken to
>> believe that #pragma target makes it safer now?
>>
>> It would enable a number of optimizations, like constant propagation, FMA
>> contraction, etc. It would also allow us to remove several builtins.
>
> I see no problem with using the array-type access to the registers.
>
> As for replacing the builtins with arithmetic operators: I appreciate
> the possibility for optimization.  But is there any chance the calls
> could not end up being implemented with a vector instruction?  I think
> that would be bad.  The intrinsics should be a way to guarantee that
> the programmer can create vector instructions.  Otherwise we might
> just not support them.

There is always a risk, but then even with builtins I think there was a 
small risk that an RTL optimization would mess things up. It is indeed 
higher if we expose the operation to the optimizers earlier, but it would 
be a bug if an "optimization" replaced a vector operation by something 
worse. Also, I am only proposing to handle the most trivial operations 
this way, not more complicated ones (like v[0]+=s) where we would be 
likely to fail generating the right instruction. And the pragma should 
ensure that the function will always be compiled in a mode where the 
vector instruction is available.

ARM did the same and I don't think I have seen a bug reporting a 
regression about it (I haven't really looked though).

Thanks,
Ulrich Drepper June 29, 2014, 9:40 a.m. UTC | #6
On Sat, Jun 28, 2014 at 6:53 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
> There is always a risk, but then even with builtins I think there was a
> small risk that an RTL optimization would mess things up. It is indeed
> higher if we expose the operation to the optimizers earlier, but it would be
> a bug if an "optimization" replaced a vector operation by something worse.
> Also, I am only proposing to handle the most trivial operations this way,
> not more complicated ones (like v[0]+=s) where we would be likely to fail
> generating the right instruction. And the pragma should ensure that the
> function will always be compiled in a mode where the vector instruction is
> available.
>
> ARM did the same and I don't think I have seen a bug reporting a regression
> about it (I haven't really looked though).

I think the Arm definitions come from a different angle.  It's new,
there is no assumed semantics.  For the x86 intrinsics Intel defines
that _mm_xxx() generates one of a given opcodes if there is a match.
If I want to generate a specific code sequence I use the intrinsics.
Otherwise I could already today use the vector type semantics myself.

Don't get me wrong, I like the idea to have the optimization of the
intrinsics happening.  But perhaps not unconditionally or at least not
without preventing them.

I know this will look ugly, but how about a macro
__GCC_X86_HONOR_INTRINSICS to enable the current code and have by
default your proposed use of the vector arithmetic in place?  This
wouldn't allow removing support for the built-ins but it would also
open the door to some more risky optimizations to be enabled by
default.
Marc Glisse June 29, 2014, 11:06 a.m. UTC | #7
On Sun, 29 Jun 2014, Ulrich Drepper wrote:

> I think the Arm definitions come from a different angle.  It's new,
> there is no assumed semantics.

Is it that new? I thought it was implemented based on a rather precise 
specification by ARM. Again, I don't really know arm.

> For the x86 intrinsics Intel defines
> that _mm_xxx() generates one of a given opcodes if there is a match.
> If I want to generate a specific code sequence I use the intrinsics.

We already sometimes generate a different instruction than the name of the 
instrinsic suggests, or combine consecutive intrinsics into something 
else. I use inline asm when I want a specific code sequence.

> Otherwise I could already today use the vector type semantics myself.

Well, the main reasons I use the intrinsics are:
1) the code compiles with visual studio
2) use the esoteric instructions (anything without a trivial mapping in C)

> Don't get me wrong, I like the idea to have the optimization of the
> intrinsics happening.  But perhaps not unconditionally or at least not
> without preventing them.
>
> I know this will look ugly, but how about a macro
> __GCC_X86_HONOR_INTRINSICS to enable the current code and have by
> default your proposed use of the vector arithmetic in place?  This
> wouldn't allow removing support for the built-ins but it would also
> open the door to some more risky optimizations to be enabled by
> default.

That's a pretty big drawback. Instead of simplifying the implementation, 
it makes it more complicated. We also have to document the macro, update 
the testsuite so it tests the intrinsics in both modes, etc.

I understand the concern, and I would probably implement 
__GCC_X86_HONOR_INTRINSICS (though the testsuite part scares me as I have 
so little understanding of how it works, so I may need help), but I'd like 
to make sure first that the simpler approach is not acceptable, possibly 
with strong constraints on which operations are ok (_mm_load[hl]_pd could 
be removed from the patch for instance).

As another comparison, clang's version of *intrin.h uses the vector 
extensions much more than I am proposing.
Kirill Yukhin July 3, 2014, 10:17 a.m. UTC | #8
Hello Marc,
On 28 Jun 12:42, Marc Glisse wrote:
> It would enable a number of optimizations, like constant
> propagation, FMA contraction, etc. It would also allow us to remove
> several builtins.
This should be main motivation for replacing built-ins.
But this approach IMHO should only be used for `obvious' cases only.
I mean: + - / * and friends.
Think that this shouldn't apply for shuffles, broadcasts.
But we have to define border between `obvious' and rest intrinsics.

On the over hand, updated in such a way intrinsic
may actually generate different instruction then intended (e.g. FMA case).

For ICC this is generally OK to generate different instructions, only
semantics should be obeyed.

--
Thanks, K
Marc Glisse July 4, 2014, 7:11 p.m. UTC | #9
On Thu, 3 Jul 2014, Kirill Yukhin wrote:

> Hello Marc,
> On 28 Jun 12:42, Marc Glisse wrote:
>> It would enable a number of optimizations, like constant
>> propagation, FMA contraction, etc. It would also allow us to remove
>> several builtins.
> This should be main motivation for replacing built-ins.
> But this approach IMHO should only be used for `obvious' cases only.
> I mean: + - / * and friends.
> Think that this shouldn't apply for shuffles, broadcasts.
> But we have to define border between `obvious' and rest intrinsics.

We don't have a syntax in the front-end for broadcasts anyway, but are you 
sure about shuffles? __builtin_shuffle directly translates to 
VEC_PERM_EXPR, on which we are careful to avoid optimizations like 
combining 2 shuffles unless the result is the identity. And expanding 
shuffles that can be done in a single instruction works well.

But I am happy not doing them yet. To be very specific, could you list 
which intrinsics you would like to remove from the posted patch?

> On the over hand, updated in such a way intrinsic
> may actually generate different instruction then intended (e.g. FMA case).

It is the same with scalars, we have -ffp-contract for that.

> For ICC this is generally OK to generate different instructions, only
> semantics should be obeyed.
Kirill Yukhin July 8, 2014, 11:14 a.m. UTC | #10
Hello Marc.
On 04 Jul 21:11, Marc Glisse wrote:
> On Thu, 3 Jul 2014, Kirill Yukhin wrote:
> like combining 2 shuffles unless the result is the identity. And
> expanding shuffles that can be done in a single instruction works
> well.
> 
> But I am happy not doing them yet. To be very specific, could you
> list which intrinsics you would like to remove from the posted
> patch?
I am not a x86 maintainer, however while such a replacements produce
correct semantics and probably enable optimizations, I support your patch.

Probably you could try such your approach on AVX2, AVX-512 whose intrinsics
are well covered by tests?

> >On the over hand, updated in such a way intrinsic
> >may actually generate different instruction then intended (e.g. FMA case).
> 
> It is the same with scalars, we have -ffp-contract for that.
Agreed.

--
Thanks, K
Jakub Jelinek July 8, 2014, 11:17 a.m. UTC | #11
On Tue, Jul 08, 2014 at 03:14:04PM +0400, Kirill Yukhin wrote:
> > >On the over hand, updated in such a way intrinsic
> > >may actually generate different instruction then intended (e.g. FMA case).
> > 
> > It is the same with scalars, we have -ffp-contract for that.
> Agreed.

I don't think we actually always guarantee using the particular instructions
for the intrinsics even when they are implemented using builtins, at least
if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine
something into some other insn, we'll happily do that.

	Jakub
Mike Stump July 8, 2014, 4:02 p.m. UTC | #12
On Jul 8, 2014, at 4:17 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Jul 08, 2014 at 03:14:04PM +0400, Kirill Yukhin wrote:
>>>> On the over hand, updated in such a way intrinsic
>>>> may actually generate different instruction then intended (e.g. FMA case).
>>> 
>>> It is the same with scalars, we have -ffp-contract for that.
>> Agreed.
> 
> I don't think we actually always guarantee using the particular instructions
> for the intrinsics even when they are implemented using builtins, at least
> if they don't use UNSPECs, e.g. if combiner or peephole2 manage to combine
> something into some other insn, we'll happily do that.

In a testcase, one is free to hide the inputs and the output from the optimizer using standard tricks and take one step closer to having a 1-1 mapping.  Of course, wether or not the port even offers a 1-1 mapping for any particular builtin is completely dependent upon the port.
diff mbox

Patch

Index: gcc/config/i386/emmintrin.h
===================================================================
--- gcc/config/i386/emmintrin.h	(revision 209323)
+++ gcc/config/i386/emmintrin.h	(working copy)
@@ -161,40 +161,40 @@  _mm_store_pd (double *__P, __m128d __A)
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_storeu_pd (double *__P, __m128d __A)
 {
   __builtin_ia32_storeupd (__P, __A);
 }
 
 /* Stores the lower DPFP value.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_store_sd (double *__P, __m128d __A)
 {
-  *__P = __builtin_ia32_vec_ext_v2df (__A, 0);
+  *__P = __A[0];
 }
 
 extern __inline double __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cvtsd_f64 (__m128d __A)
 {
-  return __builtin_ia32_vec_ext_v2df (__A, 0);
+  return __A[0];
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_storel_pd (double *__P, __m128d __A)
 {
   _mm_store_sd (__P, __A);
 }
 
 /* Stores the upper DPFP value.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_storeh_pd (double *__P, __m128d __A)
 {
-  *__P = __builtin_ia32_vec_ext_v2df (__A, 1);
+  *__P = __A[1];
 }
 
 /* Store the lower DPFP value across two words.
    The address must be 16-byte aligned.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_store1_pd (double *__P, __m128d __A)
 {
   _mm_store_pd (__P, __builtin_ia32_shufpd (__A, __A, _MM_SHUFFLE2 (0,0)));
 }
 
@@ -215,86 +215,86 @@  extern __inline int __attribute__((__gnu
 _mm_cvtsi128_si32 (__m128i __A)
 {
   return __builtin_ia32_vec_ext_v4si ((__v4si)__A, 0);
 }
 
 #ifdef __x86_64__
 /* Intel intrinsic.  */
 extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cvtsi128_si64 (__m128i __A)
 {
-  return __builtin_ia32_vec_ext_v2di ((__v2di)__A, 0);
+  return __A[0];
 }
 
 /* Microsoft intrinsic.  */
 extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cvtsi128_si64x (__m128i __A)
 {
-  return __builtin_ia32_vec_ext_v2di ((__v2di)__A, 0);
+  return __A[0];
 }
 #endif
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_addpd ((__v2df)__A, (__v2df)__B);
+  return __A + __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_addsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_subpd ((__v2df)__A, (__v2df)__B);
+  return __A - __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_subsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_mulpd ((__v2df)__A, (__v2df)__B);
+  return __A * __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_mulsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_pd (__m128d __A, __m128d __B)
 {
-  return (__m128d)__builtin_ia32_divpd ((__v2df)__A, (__v2df)__B);
+  return __A / __B;
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_sd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_divsd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sqrt_pd (__m128d __A)
 {
   return (__m128d)__builtin_ia32_sqrtpd ((__v2df)__A);
 }
 
-/* Return pair {sqrt (A[0), B[1]}.  */
+/* Return pair {sqrt (B[0]), A[1]}.  */
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sqrt_sd (__m128d __A, __m128d __B)
 {
   __v2df __tmp = __builtin_ia32_movsd ((__v2df)__A, (__v2df)__B);
   return (__m128d)__builtin_ia32_sqrtsd ((__v2df)__tmp);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_min_pd (__m128d __A, __m128d __B)
 {
@@ -708,27 +708,27 @@  _mm_store_si128 (__m128i *__P, __m128i _
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_storeu_si128 (__m128i *__P, __m128i __B)
 {
   __builtin_ia32_storedqu ((char *)__P, (__v16qi)__B);
 }
 
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_storel_epi64 (__m128i *__P, __m128i __B)
 {
-  *(long long *)__P = __builtin_ia32_vec_ext_v2di ((__v2di)__B, 0);
+  *(long long *)__P = __B[0];
 }
 
 extern __inline __m64 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_movepi64_pi64 (__m128i __B)
 {
-  return (__m64) __builtin_ia32_vec_ext_v2di ((__v2di)__B, 0);
+  return (__m64) __B[0];
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_movpi64_epi64 (__m64 __A)
 {
   return _mm_set_epi64 ((__m64)0LL, __A);
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_move_epi64 (__m128i __A)
@@ -915,27 +915,27 @@  _mm_unpackhi_pd (__m128d __A, __m128d __
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_unpacklo_pd (__m128d __A, __m128d __B)
 {
   return (__m128d)__builtin_ia32_unpcklpd ((__v2df)__A, (__v2df)__B);
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_loadh_pd (__m128d __A, double const *__B)
 {
-  return (__m128d)__builtin_ia32_loadhpd ((__v2df)__A, __B);
+  return __extension__ (__m128d){ __A[0], __B[0] };
 }
 
 extern __inline __m128d __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_loadl_pd (__m128d __A, double const *__B)
 {
-  return (__m128d)__builtin_ia32_loadlpd ((__v2df)__A, __B);
+  return __extension__ (__m128d){ __B[0], __A[1] };
 }
 
 extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_movemask_pd (__m128d __A)
 {
   return __builtin_ia32_movmskpd ((__v2df)__A);
 }
 
 extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_packs_epi16 (__m128i __A, __m128i __B)
Index: gcc/config/i386/xmmintrin.h
===================================================================
--- gcc/config/i386/xmmintrin.h	(revision 209323)
+++ gcc/config/i386/xmmintrin.h	(working copy)
@@ -173,39 +173,39 @@  extern __inline __m128 __attribute__((__
 _mm_max_ss (__m128 __A, __m128 __B)
 {
   return (__m128) __builtin_ia32_maxss ((__v4sf)__A, (__v4sf)__B);
 }
 
 /* Perform the respective operation on the four SPFP values in A and B.  */
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_add_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
+  return __A + __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sub_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_subps ((__v4sf)__A, (__v4sf)__B);
+  return __A - __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_mul_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_mulps ((__v4sf)__A, (__v4sf)__B);
+  return __A * __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_div_ps (__m128 __A, __m128 __B)
 {
-  return (__m128) __builtin_ia32_divps ((__v4sf)__A, (__v4sf)__B);
+  return __A / __B;
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_sqrt_ps (__m128 __A)
 {
   return (__m128) __builtin_ia32_sqrtps ((__v4sf)__A);
 }
 
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_rcp_ps (__m128 __A)
@@ -950,27 +950,27 @@  _mm_set_ps (const float __Z, const float
 extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_setr_ps (float __Z, float __Y, float __X, float __W)
 {
   return __extension__ (__m128)(__v4sf){ __Z, __Y, __X, __W };
 }
 
 /* Stores the lower SPFP value.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_store_ss (float *__P, __m128 __A)
 {
-  *__P = __builtin_ia32_vec_ext_v4sf ((__v4sf)__A, 0);
+  *__P = __A[0];
 }
 
 extern __inline float __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_cvtss_f32 (__m128 __A)
 {
-  return __builtin_ia32_vec_ext_v4sf ((__v4sf)__A, 0);
+  return __A[0];
 }
 
 /* Store four SPFP values.  The address must be 16-byte aligned.  */
 extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
 _mm_store_ps (float *__P, __m128 __A)
 {
   *(__v4sf *)__P = (__v4sf)__A;
 }
 
 /* Store four SPFP values.  The address need not be 16-byte aligned.  */