diff mbox series

[rs6000] 2/2 Add x86 SSE3 <pmmintrin,h> intrinsics to GCC PPC64LE target

Message ID c4d4781d-0842-0ba6-7b11-df7631f010b0@us.ibm.com
State New
Headers show
Series [rs6000] 2/2 Add x86 SSE3 <pmmintrin,h> intrinsics to GCC PPC64LE target | expand

Commit Message

Paul A. Clarke Oct. 2, 2018, 2:12 p.m. UTC
This is part 2/2 for contributing PPC64LE support for X86 SSE3
instrisics. This patch includes testsuite/gcc.target tests for the
intrinsics defined in pmmintrin.h. 

Tested on POWER8 ppc64le and ppc64 (-m64 and -m32, the latter only reporting
10 new unsupported tests.)

[gcc/testsuite]

2018-10-01  Paul A. Clarke  <pc@us.ibm.com>

	* sse3-check.h: New file.
	* sse3-addsubps.h: New file.
	* sse3-addsubpd.h: New file.
	* sse3-haddps.h: New file.
	* sse3-hsubps.h: New file.
	* sse3-haddpd.h: New file.
	* sse3-hsubpd.h: New file.
	* sse3-lddqu.h: New file.
	* sse3-movsldup.h: New file.
	* sse3-movshdup.h: New file.
	* sse3-movddup.h: New file.

Comments

Segher Boessenkool Oct. 5, 2018, 9:20 a.m. UTC | #1
Hi!

On Tue, Oct 02, 2018 at 09:12:07AM -0500, Paul Clarke wrote:
> This is part 2/2 for contributing PPC64LE support for X86 SSE3
> instrisics. This patch includes testsuite/gcc.target tests for the
> intrinsics defined in pmmintrin.h. 
> 
> Tested on POWER8 ppc64le and ppc64 (-m64 and -m32, the latter only reporting
> 10 new unsupported tests.)
> 
> [gcc/testsuite]
> 
> 2018-10-01  Paul A. Clarke  <pc@us.ibm.com>
> 
> 	* sse3-check.h: New file.
> 	* sse3-addsubps.h: New file.
> 	* sse3-addsubpd.h: New file.
> 	* sse3-haddps.h: New file.
> 	* sse3-hsubps.h: New file.
> 	* sse3-haddpd.h: New file.
> 	* sse3-hsubpd.h: New file.
> 	* sse3-lddqu.h: New file.
> 	* sse3-movsldup.h: New file.
> 	* sse3-movshdup.h: New file.
> 	* sse3-movddup.h: New file.

All these entries should have gcc.target/powerpc/ in the file name.

> --- gcc/testsuite/gcc.target/powerpc/pr37191.c	(nonexistent)
> +++ gcc/testsuite/gcc.target/powerpc/pr37191.c	(working copy)

You need to mention this file in the changelog, too.

> @@ -0,0 +1,49 @@
> +/* { dg-do compile } */
> +/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
> +/* { dg-options "-O3 -mdirect-move" } */

-mdirect-move is deprecated and doesn't do anything.  You want -mcpu=power8
if you want to enable power8 instructions.  (Or -mpower8-vector also works,
for the time being anyway, but it is not preferred).

Have you tested this with -mcpu= an older cpu?  Did that work?  (It won't
_do_ much of course, but are there extra unexpected errors, etc.)

> +/* { dg-require-effective-target lp64 } */

Do these tests actually need this?  For what, then?


Segher
Paul A. Clarke Oct. 5, 2018, 3:54 p.m. UTC | #2
On 10/05/2018 04:20 AM, Segher Boessenkool wrote:
> On Tue, Oct 02, 2018 at 09:12:07AM -0500, Paul Clarke wrote:
>> This is part 2/2 for contributing PPC64LE support for X86 SSE3
>> instrisics. This patch includes testsuite/gcc.target tests for the
>> intrinsics defined in pmmintrin.h. 
>>
>> Tested on POWER8 ppc64le and ppc64 (-m64 and -m32, the latter only reporting
>> 10 new unsupported tests.)
>>
>> [gcc/testsuite]
>>
>> 2018-10-01  Paul A. Clarke  <pc@us.ibm.com>
>>
>> 	* sse3-check.h: New file.
>> 	* sse3-addsubps.h: New file.
>> 	* sse3-addsubpd.h: New file.
>> 	* sse3-haddps.h: New file.
>> 	* sse3-hsubps.h: New file.
>> 	* sse3-haddpd.h: New file.
>> 	* sse3-hsubpd.h: New file.
>> 	* sse3-lddqu.h: New file.
>> 	* sse3-movsldup.h: New file.
>> 	* sse3-movshdup.h: New file.
>> 	* sse3-movddup.h: New file.
> 
> All these entries should have gcc.target/powerpc/ in the file name.

Ack.

>> --- gcc/testsuite/gcc.target/powerpc/pr37191.c	(nonexistent)
>> +++ gcc/testsuite/gcc.target/powerpc/pr37191.c	(working copy)
> 
> You need to mention this file in the changelog, too.

Ack.

>> @@ -0,0 +1,49 @@
>> +/* { dg-do compile } */
>> +/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
>> +/* { dg-options "-O3 -mdirect-move" } */
> 
> -mdirect-move is deprecated and doesn't do anything.  You want -mcpu=power8
> if you want to enable power8 instructions.  (Or -mpower8-vector also works,
> for the time being anyway, but it is not preferred).

All of the gcc/testsuite/gcc.target/powerpc/sse2*.c use "-mpower8-vector".  Shall I use that, or "-mcpu=power8"?

> Have you tested this with -mcpu= an older cpu?  Did that work?  (It won't
> _do_ much of course, but are there extra unexpected errors, etc.)

I just did, at your urging.  Seems OK.

>> +/* { dg-require-effective-target lp64 } */
> 
> Do these tests actually need this?  For what, then?

All of the gcc/testsuite/gcc.target/powerpc/sse2*.c use it.  I will profess my ignorance.  Should it be used?

PC
Segher Boessenkool Oct. 5, 2018, 4:54 p.m. UTC | #3
On Fri, Oct 05, 2018 at 10:54:18AM -0500, Paul Clarke wrote:
> On 10/05/2018 04:20 AM, Segher Boessenkool wrote:
> >> @@ -0,0 +1,49 @@
> >> +/* { dg-do compile } */
> >> +/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
> >> +/* { dg-options "-O3 -mdirect-move" } */
> > 
> > -mdirect-move is deprecated and doesn't do anything.  You want -mcpu=power8
> > if you want to enable power8 instructions.  (Or -mpower8-vector also works,
> > for the time being anyway, but it is not preferred).
> 
> All of the gcc/testsuite/gcc.target/powerpc/sse2*.c use "-mpower8-vector".  Shall I use that, or "-mcpu=power8"?

Ah right.  No, just keep it all the same, it is easiest.

> > Have you tested this with -mcpu= an older cpu?  Did that work?  (It won't
> > _do_ much of course, but are there extra unexpected errors, etc.)
> 
> I just did, at your urging.  Seems OK.

Nice, thanks.

> >> +/* { dg-require-effective-target lp64 } */
> > 
> > Do these tests actually need this?  For what, then?
> 
> All of the gcc/testsuite/gcc.target/powerpc/sse2*.c use it.  I will profess my ignorance.  Should it be used?

It means this test will only run on 64-bit compiles.  As long as we allow
the header to be used on 32-bit compiles (or on BE, etc.), preventing it
from being tested there is not so great.

But if all the existing things do this, it's fine to follow suit.


Segher
diff mbox series

Patch

Index: gcc/testsuite/gcc.target/powerpc/pr37191.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/pr37191.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/pr37191.c	(working copy)
@@ -0,0 +1,49 @@ 
+/* { dg-do compile } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-options "-O3 -mdirect-move" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#define NO_WARN_X86_INTRINSICS 1
+
+#include <mmintrin.h>
+#include <stddef.h>
+#include <stdint.h>
+
+//extern
+const uint64_t ff_bone;
+
+static inline void transpose4x4(uint8_t *dst, uint8_t *src, ptrdiff_t dst_stride, ptrdiff_t src_stride) {
+  __m64 row0 = _mm_cvtsi32_si64(*(unsigned*)(src + (0 * src_stride)));
+  __m64 row1 = _mm_cvtsi32_si64(*(unsigned*)(src + (1 * src_stride)));
+  __m64 row2 = _mm_cvtsi32_si64(*(unsigned*)(src + (2 * src_stride)));
+  __m64 row3 = _mm_cvtsi32_si64(*(unsigned*)(src + (3 * src_stride)));
+  __m64 tmp0 = _mm_unpacklo_pi8(row0, row1);
+  __m64 tmp1 = _mm_unpacklo_pi8(row2, row3);
+  __m64 row01 = _mm_unpacklo_pi16(tmp0, tmp1);
+  __m64 row23 = _mm_unpackhi_pi16(tmp0, tmp1);
+  *((unsigned*)(dst + (0 * dst_stride))) = _mm_cvtsi64_si32(row01);
+  *((unsigned*)(dst + (1 * dst_stride))) = _mm_cvtsi64_si32(_mm_unpackhi_pi32(row01, row01));
+  *((unsigned*)(dst + (2 * dst_stride))) = _mm_cvtsi64_si32(row23);
+  *((unsigned*)(dst + (3 * dst_stride))) = _mm_cvtsi64_si32(_mm_unpackhi_pi32(row23, row23));
+}
+#if 0
+static inline void h264_loop_filter_chroma_intra_mmx2(uint8_t *pix, int stride, int alpha1, int beta1)
+{
+    asm volatile(
+        ""
+        :: "r"(pix-2*stride), "r"(pix), "r"((long)stride),
+           "m"(alpha1), "m"(beta1), "m"(ff_bone)
+    );
+}
+
+#endif
+void h264_h_loop_filter_chroma_intra_mmx2(uint8_t *pix, int stride, int alpha, int beta)
+{
+  uint8_t trans[8*4] __attribute__ ((aligned (8)));
+  transpose4x4(trans, pix-2, 8, stride);
+  transpose4x4(trans+4, pix-2+4*stride, 8, stride);
+//    h264_loop_filter_chroma_intra_mmx2(trans+2*8, 8, alpha-1, beta-1);
+  transpose4x4(pix-2, trans, stride, 8);
+  transpose4x4(pix-2+4*stride, trans+4, stride, 8);
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-addsubpd.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-addsubpd.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-addsubpd.c	(working copy)
@@ -0,0 +1,102 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_addsubpd_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_addsubpd (double *i1, double *i2, double *r)
+{
+  __m128d t1 = _mm_loadu_pd (i1);
+  __m128d t2 = _mm_loadu_pd (i2);
+
+  t1 = _mm_addsub_pd (t1, t2);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static void
+sse3_test_addsubpd_subsume (double *i1, double *i2, double *r)
+{
+  __m128d t1 = _mm_load_pd (i1);
+  __m128d t2 = _mm_load_pd (i2);
+
+  t1 = _mm_addsub_pd (t1, t2);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static int
+chk_pd (double *v1, double *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 2; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static double p1[2] __attribute__ ((aligned(16)));
+static double p2[2] __attribute__ ((aligned(16)));
+static double p3[2];
+static double ck[2];
+
+double vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+static
+void
+TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 4)
+    {
+      p1[0] = vals[i+0];
+      p1[1] = vals[i+1];
+
+      p2[0] = vals[i+2];
+      p2[1] = vals[i+3];
+
+      ck[0] = p1[0] - p2[0];
+      ck[1] = p1[1] + p2[1];
+
+      sse3_test_addsubpd (p1, p2, p3);
+
+      fail += chk_pd (ck, p3);
+
+      sse3_test_addsubpd_subsume (p1, p2, p3);
+
+      fail += chk_pd (ck, p3);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-addsubps.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-addsubps.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-addsubps.c	(working copy)
@@ -0,0 +1,108 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#define NO_WARN_X86_INTRINSICS 1
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_addsubps_1
+#endif
+
+#include <pmmintrin.h>
+
+static void
+sse3_test_addsubps (float *i1, float *i2, float *r)
+{
+  __m128 t1 = _mm_loadu_ps (i1);
+  __m128 t2 = _mm_loadu_ps (i2);
+
+  t1 = _mm_addsub_ps (t1, t2);
+
+  _mm_storeu_ps (r, t1);
+}
+
+static void
+sse3_test_addsubps_subsume (float *i1, float *i2, float *r)
+{
+  __m128 t1 = _mm_load_ps (i1);
+  __m128 t2 = _mm_load_ps (i2);
+
+  t1 = _mm_addsub_ps (t1, t2);
+
+  _mm_storeu_ps (r, t1);
+}
+
+static int
+chk_ps (float *v1, float *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 4; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static float p1[4] __attribute__ ((aligned(16)));
+static float p2[4] __attribute__ ((aligned(16)));
+static float p3[4];
+static float ck[4];
+
+static float vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals); i += 8)
+    {
+      p1[0] = vals[i+0];
+      p1[1] = vals[i+1];
+      p1[2] = vals[i+2];
+      p1[3] = vals[i+3];
+
+      p2[0] = vals[i+4];
+      p2[1] = vals[i+5];
+      p2[2] = vals[i+6];
+      p2[3] = vals[i+7];
+
+      ck[0] = p1[0] - p2[0];
+      ck[1] = p1[1] + p2[1];
+      ck[2] = p1[2] - p2[2];
+      ck[3] = p1[3] + p2[3];
+
+      sse3_test_addsubps (p1, p2, p3);
+
+      fail += chk_ps (ck, p3);
+
+      sse3_test_addsubps_subsume (p1, p2, p3);
+
+      fail += chk_ps (ck, p3);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-check.h
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-check.h	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-check.h	(working copy)
@@ -0,0 +1,43 @@ 
+#include <stdio.h>
+#include <stdlib.h>
+
+#include "m128-check.h"
+
+/* define DEBUG replace abort with printf on error.  */
+//#define DEBUG 1
+
+#define TEST sse3_test
+
+static void sse3_test (void);
+
+static void
+__attribute__ ((noinline))
+do_test (void)
+{
+  sse3_test ();
+}
+
+int
+main ()
+{
+#ifdef __BUILTIN_CPU_SUPPORTS__
+  /* Most SSE intrinsic operations can be implemented via VMX
+     instructions, but some operations may be faster / simpler
+     using the POWER8 VSX instructions.  This is especially true
+     when we are transferring / converting to / from __m64 types.
+     The direct register transfer instructions from POWER8 are
+     especially important.  So we test for arch_2_07.  */
+  if (__builtin_cpu_supports ("arch_2_07"))
+    {
+      do_test ();
+#ifdef DEBUG
+      printf ("PASSED\n");
+#endif
+    }
+#ifdef DEBUG
+  else
+    printf ("SKIPPED\n");
+#endif
+#endif /* __BUILTIN_CPU_SUPPORTS__ */
+  return 0;
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-haddpd.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-haddpd.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-haddpd.c	(working copy)
@@ -0,0 +1,100 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#define NO_WARN_X86_INTRINSICS 1
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_haddpd_1
+#endif
+#include <pmmintrin.h>
+
+static void
+sse3_test_haddpd (double *i1, double *i2, double *r)
+{
+  __m128d t1 = _mm_loadu_pd (i1);
+  __m128d t2 = _mm_loadu_pd (i2);
+
+  t1 = _mm_hadd_pd (t1, t2);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static void
+sse3_test_haddpd_subsume (double *i1, double *i2, double *r)
+{
+  __m128d t1 = _mm_load_pd (i1);
+  __m128d t2 = _mm_load_pd (i2);
+
+  t1 = _mm_hadd_pd (t1, t2);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static int
+chk_pd (double *v1, double *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 2; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static double p1[2] __attribute__ ((aligned(16)));
+static double p2[2] __attribute__ ((aligned(16)));
+static double p3[2];
+static double ck[2];
+
+static double vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 4)
+    {
+      p1[0] = vals[i + 0];
+      p1[1] = vals[i + 1];
+
+      p2[0] = vals[i + 2];
+      p2[1] = vals[i + 3];
+
+      ck[0] = p1[0] + p1[1];
+      ck[1] = p2[0] + p2[1];
+
+      sse3_test_haddpd (p1, p2, p3);
+
+      fail += chk_pd (ck, p3);
+
+      sse3_test_haddpd_subsume (p1, p2, p3);
+
+      fail += chk_pd (ck, p3);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-haddps.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-haddps.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-haddps.c	(working copy)
@@ -0,0 +1,108 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_haddps_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_haddps (float *i1, float *i2, float *r)
+{
+  __m128 t1 = _mm_loadu_ps (i1);
+  __m128 t2 = _mm_loadu_ps (i2);
+
+  t1 = _mm_hadd_ps (t1, t2);
+
+  _mm_storeu_ps (r, t1);
+}
+
+static void
+sse3_test_haddps_subsume (float *i1, float *i2, float *r)
+{
+  __m128 t1 = _mm_load_ps (i1);
+  __m128 t2 = _mm_load_ps (i2);
+
+  t1 = _mm_hadd_ps (t1, t2);
+
+  _mm_storeu_ps (r, t1);
+}
+
+static int
+chk_ps(float *v1, float *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 4; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static float p1[4] __attribute__ ((aligned(16)));
+static float p2[4] __attribute__ ((aligned(16)));
+static float p3[4];
+static float ck[4];
+
+static float vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST ()
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 8)
+    {
+      p1[0] = vals[i+0];
+      p1[1] = vals[i+1];
+      p1[2] = vals[i+2];
+      p1[3] = vals[i+3];
+
+      p2[0] = vals[i+4];
+      p2[1] = vals[i+5];
+      p2[2] = vals[i+6];
+      p2[3] = vals[i+7];
+
+      ck[0] = p1[0] + p1[1];
+      ck[1] = p1[2] + p1[3];
+      ck[2] = p2[0] + p2[1];
+      ck[3] = p2[2] + p2[3];
+
+      sse3_test_haddps (p1, p2, p3);
+
+      fail += chk_ps (ck, p3);
+
+      sse3_test_haddps_subsume (p1, p2, p3);
+
+      fail += chk_ps (ck, p3);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-hsubpd.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-hsubpd.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-hsubpd.c	(working copy)
@@ -0,0 +1,101 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_hsubpd_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_hsubpd (double *i1, double *i2, double *r)
+{
+  __m128d t1 = _mm_loadu_pd (i1);
+  __m128d t2 = _mm_loadu_pd (i2);
+
+  t1 = _mm_hsub_pd (t1, t2);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static void
+sse3_test_hsubpd_subsume (double *i1, double *i2, double *r)
+{
+  __m128d t1 = _mm_load_pd (i1);
+  __m128d t2 = _mm_load_pd (i2);
+
+  t1 = _mm_hsub_pd (t1, t2);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static int
+chk_pd (double *v1, double *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 2; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static double p1[2] __attribute__ ((aligned(16)));
+static double p2[2] __attribute__ ((aligned(16)));
+static double p3[2];
+static double ck[2];
+
+static double vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 4)
+    {
+      p1[0] = vals[i + 0];
+      p1[1] = vals[i + 1];
+
+      p2[0] = vals[i + 2];
+      p2[1] = vals[i + 3];
+
+      ck[0] = p1[0] - p1[1];
+      ck[1] = p2[0] - p2[1];
+
+      sse3_test_hsubpd (p1, p2, p3);
+
+      fail += chk_pd (ck, p3);
+
+      sse3_test_hsubpd_subsume (p1, p2, p3);
+
+      fail += chk_pd (ck, p3);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-hsubps.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-hsubps.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-hsubps.c	(working copy)
@@ -0,0 +1,108 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_hsubps_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_hsubps (float *i1, float *i2, float *r)
+{
+  __m128 t1 = _mm_loadu_ps (i1);
+  __m128 t2 = _mm_loadu_ps (i2);
+
+  t1 = _mm_hsub_ps (t1, t2);
+
+  _mm_storeu_ps (r, t1);
+}
+
+static void
+sse3_test_hsubps_subsume (float *i1, float *i2, float *r)
+{
+  __m128 t1 = _mm_load_ps (i1);
+  __m128 t2 = _mm_load_ps (i2);
+
+  t1 = _mm_hsub_ps (t1, t2);
+
+  _mm_storeu_ps (r, t1);
+}
+
+static int
+chk_ps(float *v1, float *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 4; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static float p1[4] __attribute__ ((aligned(16)));
+static float p2[4] __attribute__ ((aligned(16)));
+static float p3[4];
+static float ck[4];
+
+static float vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST ()
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 8)
+    {
+      p1[0] = vals[i+0];
+      p1[1] = vals[i+1];
+      p1[2] = vals[i+2];
+      p1[3] = vals[i+3];
+
+      p2[0] = vals[i+4];
+      p2[1] = vals[i+5];
+      p2[2] = vals[i+6];
+      p2[3] = vals[i+7];
+
+      ck[0] = p1[0] - p1[1];
+      ck[1] = p1[2] - p1[3];
+      ck[2] = p2[0] - p2[1];
+      ck[3] = p2[2] - p2[3];
+
+      sse3_test_hsubps (p1, p2, p3);
+
+      fail += chk_ps (ck, p3);
+
+      sse3_test_hsubps_subsume (p1, p2, p3);
+
+      fail += chk_ps (ck, p3);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-lddqu.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-lddqu.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-lddqu.c	(working copy)
@@ -0,0 +1,80 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_lddqu_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_lddqu (double *i1, double *r)
+{
+  __m128i t1 = _mm_lddqu_si128 ((__m128i *) i1);
+
+  _mm_storeu_si128 ((__m128i *) r, t1);
+}
+
+static int
+chk_pd (double *v1, double *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 2; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static double p1[2];
+static double p2[2];
+static double ck[2];
+
+static double vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 2)
+    {
+      p1[0] = vals[i+0];
+      p1[1] = vals[i+1];
+
+      sse3_test_lddqu (p1, p2);
+
+      ck[0] = p1[0];
+      ck[1] = p1[1];
+
+      fail += chk_pd (ck, p2);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-movddup.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-movddup.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-movddup.c	(working copy)
@@ -0,0 +1,135 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_movddup_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_movddup_mem (double *i1, double *r)
+{
+  __m128d t1 = _mm_loaddup_pd (i1);
+
+  _mm_storeu_pd (r, t1);
+}
+
+static double cnst1 [2] = {1.0, 1.0};
+
+static void
+sse3_test_movddup_reg (double *i1, double *r)
+{
+  __m128d t1 = _mm_loadu_pd (i1);
+  __m128d t2 = _mm_loadu_pd (&cnst1[0]);
+
+  t1  = _mm_mul_pd (t1, t2);
+  t2  = _mm_movedup_pd (t1);
+
+  _mm_storeu_pd (r, t2);
+}
+
+static void
+sse3_test_movddup_reg_subsume_unaligned (double *i1, double *r)
+{
+  __m128d t1 = _mm_loadu_pd (i1);
+  __m128d t2 = _mm_movedup_pd (t1);
+
+  _mm_storeu_pd (r, t2);
+}
+
+static void
+sse3_test_movddup_reg_subsume_ldsd (double *i1, double *r)
+{
+  __m128d t1 = _mm_load_sd (i1);
+  __m128d t2 = _mm_movedup_pd (t1);
+
+  _mm_storeu_pd (r, t2);
+}
+
+static void
+sse3_test_movddup_reg_subsume (double *i1, double *r)
+{
+  __m128d t1 = _mm_load_pd (i1);
+  __m128d t2 = _mm_movedup_pd (t1);
+
+  _mm_storeu_pd (r, t2);
+}
+
+static int
+chk_pd (double *v1, double *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 2; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static double p1[2] __attribute__ ((aligned(16)));
+static double p2[2];
+static double ck[2];
+
+static double vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 1)
+    {
+      p1[0] = vals[i+0];
+
+      ck[0] = p1[0];
+      ck[1] = p1[0];
+
+      sse3_test_movddup_mem (p1, p2);
+
+      fail += chk_pd (ck, p2);
+
+      sse3_test_movddup_reg (p1, p2);
+
+      fail += chk_pd (ck, p2);
+
+      sse3_test_movddup_reg_subsume (p1, p2);
+
+      fail += chk_pd (ck, p2);
+
+      sse3_test_movddup_reg_subsume_unaligned (p1, p2);
+
+      fail += chk_pd (ck, p2);
+
+      sse3_test_movddup_reg_subsume_ldsd (p1, p2);
+
+      fail += chk_pd (ck, p2);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-movshdup.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-movshdup.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-movshdup.c	(working copy)
@@ -0,0 +1,98 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_movshdup_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_movshdup_reg (float *i1, float *r)
+{
+  __m128 t1 = _mm_loadu_ps (i1);
+  __m128 t2 = _mm_movehdup_ps (t1);
+
+  _mm_storeu_ps (r, t2);
+}
+
+static void
+sse3_test_movshdup_reg_subsume (float *i1, float *r)
+{
+  __m128 t1 = _mm_load_ps (i1);
+  __m128 t2 = _mm_movehdup_ps (t1);
+
+  _mm_storeu_ps (r, t2);
+}
+
+static int
+chk_ps (float *v1, float *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 4; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static float p1[4] __attribute__ ((aligned(16)));
+static float p2[4];
+static float ck[4];
+
+static float vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 2)
+    {
+      p1[0] = 0.0;
+      p1[1] = vals[i+0];
+      p1[2] = 1.0;
+      p1[3] = vals[i+1];
+
+      ck[0] = p1[1];
+      ck[1] = p1[1];
+      ck[2] = p1[3];
+      ck[3] = p1[3];
+
+      sse3_test_movshdup_reg (p1, p2);
+
+      fail += chk_ps (ck, p2);
+
+      sse3_test_movshdup_reg_subsume (p1, p2);
+
+      fail += chk_ps (ck, p2);
+    }
+
+  if (fail != 0)
+    abort ();
+}
Index: gcc/testsuite/gcc.target/powerpc/sse3-movsldup.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/sse3-movsldup.c	(nonexistent)
+++ gcc/testsuite/gcc.target/powerpc/sse3-movsldup.c	(working copy)
@@ -0,0 +1,98 @@ 
+/* { dg-do run } */
+/* { dg-options "-O3 -mpower8-vector -Wno-psabi" } */
+/* { dg-require-effective-target lp64 } */
+/* { dg-require-effective-target p8vector_hw } */
+
+#ifndef CHECK_H
+#define CHECK_H "sse3-check.h"
+#endif
+
+#include CHECK_H
+
+#ifndef TEST
+#define TEST sse3_test_movsldup_1
+#endif
+
+#define NO_WARN_X86_INTRINSICS 1
+#include <pmmintrin.h>
+
+static void
+sse3_test_movsldup_reg (float *i1, float *r)
+{
+  __m128 t1 = _mm_loadu_ps (i1);
+  __m128 t2 = _mm_moveldup_ps (t1);
+
+  _mm_storeu_ps (r, t2);
+}
+
+static void
+sse3_test_movsldup_reg_subsume (float *i1, float *r)
+{
+  __m128 t1 = _mm_load_ps (i1);
+  __m128 t2 = _mm_moveldup_ps (t1);
+
+  _mm_storeu_ps (r, t2);
+}
+
+static int
+chk_ps (float *v1, float *v2)
+{
+  int i;
+  int n_fails = 0;
+
+  for (i = 0; i < 4; i++)
+    if (v1[i] != v2[i])
+      n_fails += 1;
+
+  return n_fails;
+}
+
+static float p1[4] __attribute__ ((aligned(16)));
+static float p2[4];
+static float ck[4];
+
+static float vals[] =
+  {
+    100.0,  200.0, 300.0, 400.0, 5.0, -1.0, .345, -21.5,
+    1100.0, 0.235, 321.3, 53.40, 0.3, 10.0, 42.0, 32.52,
+    32.6,   123.3, 1.234, 2.156, 0.1, 3.25, 4.75, 32.44,
+    12.16,  52.34, 64.12, 71.13, -.1, 2.30, 5.12, 3.785,
+    541.3,  321.4, 231.4, 531.4, 71., 321., 231., -531.,
+    23.45,  23.45, 23.45, 23.45, 23.45, 23.45, 23.45, 23.45,
+    23.45,  -1.43, -6.74, 6.345, -20.1, -20.1, -40.1, -40.1,
+    1.234,  2.345, 3.456, 4.567, 5.678, 6.789, 7.891, 8.912,
+    -9.32,  -8.41, -7.50, -6.59, -5.68, -4.77, -3.86, -2.95,
+    9.32,  8.41, 7.50, 6.59, -5.68, -4.77, -3.86, -2.95
+  };
+
+//static
+void
+TEST (void)
+{
+  int i;
+  int fail = 0;
+
+  for (i = 0; i < sizeof (vals) / sizeof (vals[0]); i += 2)
+    {
+      p1[0] = vals[i+0];
+      p1[1] = 0.0;
+      p1[2] = vals[i+1];
+      p1[3] = 1.0;
+
+      ck[0] = p1[0];
+      ck[1] = p1[0];
+      ck[2] = p1[2];
+      ck[3] = p1[2];
+
+      sse3_test_movsldup_reg (p1, p2);
+
+      fail += chk_ps (ck, p2);
+
+      sse3_test_movsldup_reg_subsume (p1, p2);
+
+      fail += chk_ps (ck, p2);
+    }
+
+  if (fail != 0)
+    abort ();
+}