diff mbox

Add support for vbpermq builtin; Improve vec_extract

Message ID 20140326195045.GA24185@ibm-tiger.the-meissners.org
State New
Headers show

Commit Message

Michael Meissner March 26, 2014, 7:50 p.m. UTC
This patch adds support for adding a builtin to generate the vbpermq
instruction on ISA 2.07.  This instruction takes a vector in the Altivec
register set, and returns a 64-bit value in the upper part of the register, and
0 in the lower part of the register.

The output is explicitly a vector, since the documentation for the instruction
says that to do a permutation of all 8 bits, you need to do 2 vbpermq's, one
with the high bit in each byte within the vector set, and the other with the
high bit cleared.

	vbpermq v6,v1,v2        # select from high-order half of Q
	vxor    v0,v1,v4        # adjust index values
	vbpermq v5,v0,v3        # select from low-order half of Q
	vor     v6,v6,v5        # merge the two selections

In writing the tests, I noticed that the vec_extract code did not have
optimizations for getting 64-bit data out, of the vector element happens to be
0 on big endian systems, and 1 on little endian systems.  So I added
optimizations for register/register move, including using the mfvsrd
instruction to transfer the final result to a GPR.  While I was there, I added
vec_extract optimizations to do a 64-bit store and I combined the big endian
and little endian vec_extract load optimizaton.

I built a big endian Spec 2006 suite with this compiler, and compared it to the
trunk compiler without the changes.  Only 3 benchmarks (gamess, dealII, and
povray) generated vec_extracts that became moves instead of permutes.  I ran
the tests on a power7 system, and the differences in run time were in the noise
level.  None of the spec benchmarks generated vec_extract that was a load or a
store.

I did bootstraps on a big endian power7 system, a big endian power8 system, and
a little endian power8 system with no regressions in the test suite.  Are these
patches ok to install on both the trunk?  I would like to apply these patches
there as well, when all of the ISA 2.07 changes are present in the 4.8 branch,
Can I apply these patches?

[gcc]
2014-03-26  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* config/rs6000/constraints.md (wD constraint): New constraint to
	match the constant integer to get the top DImode/DFmode out of a
	vector in a VSX register.

	* config/rs6000/predicates.md (vsx_scalar_64bit): New predicate to
	match the constant integer to get the top DImode/DFmode out of a
	vector in a VSX register.

	* config/rs6000/rs6000-builtins.def (VBPERMQ): Add vbpermq builtin
	for ISA 2.07.

	* config/rs6000/rs6000-c.c (altivec_overloaded_builtins): Add
	vbpermq builtins.

	* config/rs6000/rs6000.c (rs6000_debug_reg_global): If
	-mdebug=reg, print value of VECTOR_ELEMENT_SCALAR_64BIT.

	* config/rs6000/vsx.md (vsx_extract_<mode>, V2DI/V2DF modes):
	Optimize vec_extract of 64-bit values, where the value being
	extracted is in the top word, where we can use scalar
	instructions.  Add direct move and store support.  Combine the big
	endian/little endian vector select load support into a single
	insn.
	(vsx_extract_<mode>_internal1): Likewise.
	(vsx_extract_<mode>_internal2): Likewise.
	(vsx_extract_<mode>_load): Likewise.
	(vsx_extract_<mode>_store): Likewise.
	(vsx_extract_<mode>_zero): Delete, big and little endian insns are
	combined into vsx_extract_<mode>_load.
	(vsx_extract_<mode>_one_le): Likewise.

	* config/rs6000/rs6000.h (VECTOR_ELEMENT_SCALAR_64BIT): Macro to
	define the top 64-bit vector element.

	* doc/md.texi (PowerPC and IBM RS6000 constraints): Document wD
	constraint.

[gcc/testsuite]
2014-03-26  Michael Meissner  <meissner@linux.vnet.ibm.com>

	* gcc.target/powerpc/p8vector-vbpermq.c: New test to test the
	vbpermq builtin.

	* gcc.target/powerpc/vsx-extract-1.c: New test to test VSX
	vec_select optimizations.
	* gcc.target/powerpc/vsx-extract-2.c: Likewise.
	* gcc.target/powerpc/vsx-extract-3.c: Likewise.

Comments

David Edelsohn March 27, 2014, 12:30 a.m. UTC | #1
On Wed, Mar 26, 2014 at 3:50 PM, Michael Meissner
<meissner@linux.vnet.ibm.com> wrote:
> This patch adds support for adding a builtin to generate the vbpermq
> instruction on ISA 2.07.  This instruction takes a vector in the Altivec
> register set, and returns a 64-bit value in the upper part of the register, and
> 0 in the lower part of the register.
>
> The output is explicitly a vector, since the documentation for the instruction
> says that to do a permutation of all 8 bits, you need to do 2 vbpermq's, one
> with the high bit in each byte within the vector set, and the other with the
> high bit cleared.
>
>         vbpermq v6,v1,v2        # select from high-order half of Q
>         vxor    v0,v1,v4        # adjust index values
>         vbpermq v5,v0,v3        # select from low-order half of Q
>         vor     v6,v6,v5        # merge the two selections
>
> In writing the tests, I noticed that the vec_extract code did not have
> optimizations for getting 64-bit data out, of the vector element happens to be
> 0 on big endian systems, and 1 on little endian systems.  So I added
> optimizations for register/register move, including using the mfvsrd
> instruction to transfer the final result to a GPR.  While I was there, I added
> vec_extract optimizations to do a 64-bit store and I combined the big endian
> and little endian vec_extract load optimizaton.
>
> I built a big endian Spec 2006 suite with this compiler, and compared it to the
> trunk compiler without the changes.  Only 3 benchmarks (gamess, dealII, and
> povray) generated vec_extracts that became moves instead of permutes.  I ran
> the tests on a power7 system, and the differences in run time were in the noise
> level.  None of the spec benchmarks generated vec_extract that was a load or a
> store.
>
> I did bootstraps on a big endian power7 system, a big endian power8 system, and
> a little endian power8 system with no regressions in the test suite.  Are these
> patches ok to install on both the trunk?  I would like to apply these patches
> there as well, when all of the ISA 2.07 changes are present in the 4.8 branch,
> Can I apply these patches?
>
> [gcc]
> 2014-03-26  Michael Meissner  <meissner@linux.vnet.ibm.com>
>
>         * config/rs6000/constraints.md (wD constraint): New constraint to
>         match the constant integer to get the top DImode/DFmode out of a
>         vector in a VSX register.
>
>         * config/rs6000/predicates.md (vsx_scalar_64bit): New predicate to
>         match the constant integer to get the top DImode/DFmode out of a
>         vector in a VSX register.
>
>         * config/rs6000/rs6000-builtins.def (VBPERMQ): Add vbpermq builtin
>         for ISA 2.07.
>
>         * config/rs6000/rs6000-c.c (altivec_overloaded_builtins): Add
>         vbpermq builtins.
>
>         * config/rs6000/rs6000.c (rs6000_debug_reg_global): If
>         -mdebug=reg, print value of VECTOR_ELEMENT_SCALAR_64BIT.
>
>         * config/rs6000/vsx.md (vsx_extract_<mode>, V2DI/V2DF modes):
>         Optimize vec_extract of 64-bit values, where the value being
>         extracted is in the top word, where we can use scalar
>         instructions.  Add direct move and store support.  Combine the big
>         endian/little endian vector select load support into a single
>         insn.
>         (vsx_extract_<mode>_internal1): Likewise.
>         (vsx_extract_<mode>_internal2): Likewise.
>         (vsx_extract_<mode>_load): Likewise.
>         (vsx_extract_<mode>_store): Likewise.
>         (vsx_extract_<mode>_zero): Delete, big and little endian insns are
>         combined into vsx_extract_<mode>_load.
>         (vsx_extract_<mode>_one_le): Likewise.
>
>         * config/rs6000/rs6000.h (VECTOR_ELEMENT_SCALAR_64BIT): Macro to
>         define the top 64-bit vector element.
>
>         * doc/md.texi (PowerPC and IBM RS6000 constraints): Document wD
>         constraint.
>
> [gcc/testsuite]
> 2014-03-26  Michael Meissner  <meissner@linux.vnet.ibm.com>
>
>         * gcc.target/powerpc/p8vector-vbpermq.c: New test to test the
>         vbpermq builtin.
>
>         * gcc.target/powerpc/vsx-extract-1.c: New test to test VSX
>         vec_select optimizations.
>         * gcc.target/powerpc/vsx-extract-2.c: Likewise.
>         * gcc.target/powerpc/vsx-extract-3.c: Likewise.

Okay.

Good to add the optimizations.

I notice that you emit nop with a comment after a "#" character. I
notice that you also added that to the POWER8 vector fusion peepholes.

Is it safe to assume that all assemblers for PowerPC will consider all
characters after a "#" to be a comment?

I would like to make sure there are no other problems with the patch
before backporting to 4.8. It wasn't included in the group of patches
for 4.8 that have been widely tested.

Thanks, David
Michael Meissner March 27, 2014, 3:53 p.m. UTC | #2
On Wed, Mar 26, 2014 at 08:30:39PM -0400, David Edelsohn wrote:
> Okay.
> 
> Good to add the optimizations.
> 
> I notice that you emit nop with a comment after a "#" character. I
> notice that you also added that to the POWER8 vector fusion peepholes.
> 
> Is it safe to assume that all assemblers for PowerPC will consider all
> characters after a "#" to be a comment?

Well in this case, we are considering only PowerPC assemblers that support VSX.

The fusion stuff uses ASM_COMMENT_START, so it should be safe.  I can delete
the comments on the nop, or delete the nop handling and just do fmr/xxlor to
the same register.  I put the comments on, so I could tell in processing the
asm files which flavor of vec_extract was used (and as I said, on spec 2006,
only the fmr/xxlor's were generated, and no nop's).

> I would like to make sure there are no other problems with the patch
> before backporting to 4.8. It wasn't included in the group of patches
> for 4.8 that have been widely tested.

I would at least like to add the part that adds vbpermq, even if we don't add
the vec_extract optimizations.
diff mbox

Patch

Index: gcc/config/rs6000/constraints.md
===================================================================
--- gcc/config/rs6000/constraints.md	(revision 208726)
+++ gcc/config/rs6000/constraints.md	(working copy)
@@ -106,6 +106,11 @@  (define_register_constraint "wy" "rs6000
 (define_register_constraint "wz" "rs6000_constraints[RS6000_CONSTRAINT_wz]"
   "Floating point register if the LFIWZX instruction is enabled or NO_REGS.")
 
+(define_constraint "wD"
+  "Int constant that is the element number of the 64-bit scalar in a vector."
+  (and (match_code "const_int")
+       (match_test "TARGET_VSX && (ival == VECTOR_ELEMENT_SCALAR_64BIT)")))
+
 ;; Lq/stq validates the address for load/store quad
 (define_memory_constraint "wQ"
   "Memory operand suitable for the load/store quad instructions"
Index: gcc/config/rs6000/predicates.md
===================================================================
--- gcc/config/rs6000/predicates.md	(revision 208726)
+++ gcc/config/rs6000/predicates.md	(working copy)
@@ -981,6 +981,14 @@  (define_predicate "zero_reg_mem_operand"
   (ior (match_operand 0 "zero_fp_constant")
        (match_operand 0 "reg_or_mem_operand")))
 
+;; Return 1 if the operand is a CONST_INT and it is the element for 64-bit
+;; data types inside of a vector that scalar instructions operate on
+(define_predicate "vsx_scalar_64bit"
+  (match_code "const_int")
+{
+  return (INTVAL (op) == VECTOR_ELEMENT_SCALAR_64BIT);
+})
+
 ;; Return 1 if the operand is a general register or memory operand without
 ;; pre_inc or pre_dec or pre_modify, which produces invalid form of PowerPC
 ;; lwa instruction.
Index: gcc/config/rs6000/rs6000-builtin.def
===================================================================
--- gcc/config/rs6000/rs6000-builtin.def	(revision 208726)
+++ gcc/config/rs6000/rs6000-builtin.def	(working copy)
@@ -1374,6 +1374,7 @@  BU_P8V_AV_2 (VMINUD,		"vminud",	CONST,	u
 BU_P8V_AV_2 (VMAXUD,		"vmaxud",	CONST,	umaxv2di3)
 BU_P8V_AV_2 (VMRGEW,		"vmrgew",	CONST,	p8_vmrgew)
 BU_P8V_AV_2 (VMRGOW,		"vmrgow",	CONST,	p8_vmrgow)
+BU_P8V_AV_2 (VBPERMQ,		"vbpermq",	CONST,	altivec_vbpermq)
 BU_P8V_AV_2 (VPKUDUM,		"vpkudum",	CONST,	altivec_vpkudum)
 BU_P8V_AV_2 (VPKSDSS,		"vpksdss",	CONST,	altivec_vpksdss)
 BU_P8V_AV_2 (VPKUDUS,		"vpkudus",	CONST,	altivec_vpkudus)
@@ -1448,6 +1449,7 @@  BU_P8V_OVERLOAD_2 (ORC,		"orc")
 BU_P8V_OVERLOAD_2 (VADDCUQ,	"vaddcuq")
 BU_P8V_OVERLOAD_2 (VADDUDM,	"vaddudm")
 BU_P8V_OVERLOAD_2 (VADDUQM,	"vadduqm")
+BU_P8V_OVERLOAD_2 (VBPERMQ,	"vbpermq")
 BU_P8V_OVERLOAD_2 (VMAXSD,	"vmaxsd")
 BU_P8V_OVERLOAD_2 (VMAXUD,	"vmaxud")
 BU_P8V_OVERLOAD_2 (VMINSD,	"vminsd")
Index: gcc/config/rs6000/rs6000-c.c
===================================================================
--- gcc/config/rs6000/rs6000-c.c	(revision 208726)
+++ gcc/config/rs6000/rs6000-c.c	(working copy)
@@ -3778,6 +3778,12 @@  const struct altivec_builtin_types altiv
     RS6000_BTI_unsigned_V1TI, RS6000_BTI_unsigned_V1TI,
     RS6000_BTI_unsigned_V1TI, 0 },
 
+  { P8V_BUILTIN_VEC_VBPERMQ, P8V_BUILTIN_VBPERMQ,
+    RS6000_BTI_V2DI, RS6000_BTI_V16QI, RS6000_BTI_V16QI, 0 },
+  { P8V_BUILTIN_VEC_VBPERMQ, P8V_BUILTIN_VBPERMQ,
+    RS6000_BTI_unsigned_V2DI, RS6000_BTI_unsigned_V16QI,
+    RS6000_BTI_unsigned_V16QI, 0 },
+
   { P8V_BUILTIN_VEC_VCLZ, P8V_BUILTIN_VCLZB,
     RS6000_BTI_V16QI, RS6000_BTI_V16QI, 0, 0 },
   { P8V_BUILTIN_VEC_VCLZ, P8V_BUILTIN_VCLZB,
Index: gcc/config/rs6000/rs6000.c
===================================================================
--- gcc/config/rs6000/rs6000.c	(revision 208726)
+++ gcc/config/rs6000/rs6000.c	(working copy)
@@ -2310,6 +2310,10 @@  rs6000_debug_reg_global (void)
 	   (int)END_BUILTINS);
   fprintf (stderr, DEBUG_FMT_D, "Number of rs6000 builtins",
 	   (int)RS6000_BUILTIN_COUNT);
+
+  if (TARGET_VSX)
+    fprintf (stderr, DEBUG_FMT_D, "VSX easy 64-bit scalar element",
+	     (int)VECTOR_ELEMENT_SCALAR_64BIT);
 }
 
 
Index: gcc/config/rs6000/vsx.md
===================================================================
--- gcc/config/rs6000/vsx.md	(revision 208726)
+++ gcc/config/rs6000/vsx.md	(working copy)
@@ -1531,52 +1531,129 @@  (define_insn "vsx_set_<mode>"
   [(set_attr "type" "vecperm")])
 
 ;; Extract a DF/DI element from V2DF/V2DI
-(define_insn "vsx_extract_<mode>"
-  [(set (match_operand:<VS_scalar> 0 "vsx_register_operand" "=ws,d,?wa")
-	(vec_select:<VS_scalar> (match_operand:VSX_D 1 "vsx_register_operand" "wd,wd,wa")
+(define_expand "vsx_extract_<mode>"
+  [(set (match_operand:<VS_scalar> 0 "register_operand" "")
+	(vec_select:<VS_scalar> (match_operand:VSX_D 1 "register_operand" "")
 		       (parallel
-			[(match_operand:QI 2 "u5bit_cint_operand" "i,i,i")])))]
+			[(match_operand:QI 2 "u5bit_cint_operand" "")])))]
   "VECTOR_MEM_VSX_P (<MODE>mode)"
+  "")
+
+;; Optimize cases were we can do a simple or direct move.
+;; Or see if we can avoid doing the move at all
+(define_insn "*vsx_extract_<mode>_internal1"
+  [(set (match_operand:<VS_scalar> 0 "register_operand" "=d,ws,?wa,r")
+	(vec_select:<VS_scalar>
+	 (match_operand:VSX_D 1 "register_operand" "d,wd,wa,wm")
+	 (parallel
+	  [(match_operand:QI 2 "vsx_scalar_64bit" "wD,wD,wD,wD")])))]
+  "VECTOR_MEM_VSX_P (<MODE>mode) && TARGET_POWERPC64 && TARGET_DIRECT_MOVE"
+{
+  int op0_regno = REGNO (operands[0]);
+  int op1_regno = REGNO (operands[1]);
+
+  if (op0_regno == op1_regno)
+    return "nop\t\t# vec_extract %x0,%x1,%2";
+
+  if (INT_REGNO_P (op0_regno))
+    return "mfvsrd %0,%x1";
+
+  if (FP_REGNO_P (op0_regno) && FP_REGNO_P (op1_regno))
+    return "fmr %0,%1";
+
+  return "xxlor %x0,%x1,%x1";
+}
+  [(set_attr "type" "fp,vecsimple,vecsimple,mftgpr")
+   (set_attr "length" "4")])
+
+(define_insn "*vsx_extract_<mode>_internal2"
+  [(set (match_operand:<VS_scalar> 0 "vsx_register_operand" "=d,ws,ws,?wa")
+	(vec_select:<VS_scalar>
+	 (match_operand:VSX_D 1 "vsx_register_operand" "d,wd,wd,wa")
+	 (parallel [(match_operand:QI 2 "u5bit_cint_operand" "wD,wD,i,i")])))]
+  "VECTOR_MEM_VSX_P (<MODE>mode)
+   && (!TARGET_POWERPC64 || !TARGET_DIRECT_MOVE
+       || INTVAL (operands[2]) != VECTOR_ELEMENT_SCALAR_64BIT)"
 {
   int fldDM;
   gcc_assert (UINTVAL (operands[2]) <= 1);
+
+  if (INTVAL (operands[2]) == VECTOR_ELEMENT_SCALAR_64BIT)
+    {
+      int op0_regno = REGNO (operands[0]);
+      int op1_regno = REGNO (operands[1]);
+
+      if (op0_regno == op1_regno)
+	return "nop\t\t# vec_extract %x0,%x1,%2";
+
+      if (FP_REGNO_P (op0_regno) && FP_REGNO_P (op1_regno))
+	return "fmr %0,%1";
+
+      return "xxlor %x0,%x1,%x1";
+    }
+
   fldDM = INTVAL (operands[2]) << 1;
   if (!BYTES_BIG_ENDIAN)
     fldDM = 3 - fldDM;
   operands[3] = GEN_INT (fldDM);
-  return \"xxpermdi %x0,%x1,%x1,%3\";
+  return "xxpermdi %x0,%x1,%x1,%3";
 }
-  [(set_attr "type" "vecperm")])
+  [(set_attr "type" "fp,vecsimple,vecperm,vecperm")
+   (set_attr "length" "4")])
 
-;; Optimize extracting element 0 from memory
-(define_insn "*vsx_extract_<mode>_zero"
-  [(set (match_operand:<VS_scalar> 0 "vsx_register_operand" "=ws,d,?wa")
+;; Optimize extracting a single scalar element from memory if the scalar is in
+;; the correct location to use a single load.
+(define_insn "*vsx_extract_<mode>_load"
+  [(set (match_operand:<VS_scalar> 0 "register_operand" "=d,wv,wr")
 	(vec_select:<VS_scalar>
-	 (match_operand:VSX_D 1 "indexed_or_indirect_operand" "Z,Z,Z")
-	 (parallel [(const_int 0)])))]
-  "VECTOR_MEM_VSX_P (<MODE>mode) && WORDS_BIG_ENDIAN"
-  "lxsd%U1x %x0,%y1"
-  [(set (attr "type")
-      (if_then_else
-	(match_test "update_indexed_address_mem (operands[1], VOIDmode)")
-	(const_string "fpload_ux")
-	(const_string "fpload")))
-   (set_attr "length" "4")])  
-
-;; Optimize extracting element 1 from memory for little endian
-(define_insn "*vsx_extract_<mode>_one_le"
-  [(set (match_operand:<VS_scalar> 0 "vsx_register_operand" "=ws,d,?wa")
+	 (match_operand:VSX_D 1 "memory_operand" "m,Z,m")
+	 (parallel [(match_operand:QI 2 "vsx_scalar_64bit" "wD,wD,wD")])))]
+  "VECTOR_MEM_VSX_P (<MODE>mode)"
+  "@
+   lfd%U1%X1 %0,%1
+   lxsd%U1x %x0,%y1
+   ld%U1%X1 %0,%1"
+  [(set_attr_alternative "type"
+      [(if_then_else
+	 (match_test "update_indexed_address_mem (operands[1], VOIDmode)")
+	 (const_string "fpload_ux")
+	 (if_then_else
+	   (match_test "update_address_mem (operands[1], VOIDmode)")
+	   (const_string "fpload_u")
+	   (const_string "fpload")))
+       (const_string "fpload")
+       (if_then_else
+	 (match_test "update_indexed_address_mem (operands[1], VOIDmode)")
+	 (const_string "load_ux")
+	 (if_then_else
+	   (match_test "update_address_mem (operands[1], VOIDmode)")
+	   (const_string "load_u")
+	   (const_string "load")))])
+   (set_attr "length" "4")])
+
+;; Optimize storing a single scalar element that is the right location to
+;; memory
+(define_insn "*vsx_extract_<mode>_store"
+  [(set (match_operand:<VS_scalar> 0 "memory_operand" "=m,Z,?Z")
 	(vec_select:<VS_scalar>
-	 (match_operand:VSX_D 1 "indexed_or_indirect_operand" "Z,Z,Z")
-	 (parallel [(const_int 1)])))]
-  "VECTOR_MEM_VSX_P (<MODE>mode) && !WORDS_BIG_ENDIAN"
-  "lxsd%U1x %x0,%y1"
-  [(set (attr "type")
-      (if_then_else
-	(match_test "update_indexed_address_mem (operands[1], VOIDmode)")
-	(const_string "fpload_ux")
-	(const_string "fpload")))
-   (set_attr "length" "4")])  
+	 (match_operand:VSX_D 1 "register_operand" "d,wd,wa")
+	 (parallel [(match_operand:QI 2 "vsx_scalar_64bit" "wD,wD,wD")])))]
+  "VECTOR_MEM_VSX_P (<MODE>mode)"
+  "@
+   stfd%U0%X0 %1,%0
+   stxsd%U0x %x1,%y0
+   stxsd%U0x %x1,%y0"
+  [(set_attr_alternative "type"
+      [(if_then_else
+	 (match_test "update_indexed_address_mem (operands[0], VOIDmode)")
+	 (const_string "fpstore_ux")
+	 (if_then_else
+	   (match_test "update_address_mem (operands[0], VOIDmode)")
+	   (const_string "fpstore_u")
+	   (const_string "fpstore")))
+       (const_string "fpstore")
+       (const_string "fpstore")])
+   (set_attr "length" "4")])
 
 ;; Extract a SF element from V4SF
 (define_insn_and_split "vsx_extract_v4sf"
Index: gcc/config/rs6000/rs6000.h
===================================================================
--- gcc/config/rs6000/rs6000.h	(revision 208726)
+++ gcc/config/rs6000/rs6000.h	(working copy)
@@ -477,6 +477,10 @@  extern int rs6000_vector_align[];
 #define VECTOR_ELT_ORDER_BIG                                  \
   (BYTES_BIG_ENDIAN || (rs6000_altivec_element_order == 2))
 
+/* Element number of the 64-bit value in a 128-bit vector that can be accessed
+   with scalar instructions.  */
+#define VECTOR_ELEMENT_SCALAR_64BIT	((BYTES_BIG_ENDIAN) ? 0 : 1)
+
 /* Alignment options for fields in structures for sub-targets following
    AIX-like ABI.
    ALIGN_POWER word-aligns FP doubles (default AIX ABI).
Index: gcc/config/rs6000/altivec.md
===================================================================
--- gcc/config/rs6000/altivec.md	(revision 208726)
+++ gcc/config/rs6000/altivec.md	(working copy)
@@ -142,6 +142,7 @@  (define_c_enum "unspec"
    UNSPEC_VSUBCUQ
    UNSPEC_VSUBEUQM
    UNSPEC_VSUBECUQ
+   UNSPEC_VBPERMQ
 ])
 
 (define_c_enum "unspecv"
@@ -3322,3 +3323,14 @@  (define_insn "altivec_vsubecuq"
   [(set_attr "length" "4")
    (set_attr "type" "vecsimple")])
 
+;; We use V2DI as the output type to simplify converting the permute
+;; bits into an integer
+(define_insn "altivec_vbpermq"
+  [(set (match_operand:V2DI 0 "register_operand" "=v")
+	(unspec:V2DI [(match_operand:V16QI 1 "register_operand" "v")
+		      (match_operand:V16QI 2 "register_operand" "v")]
+		     UNSPEC_VBPERMQ))]
+  "TARGET_P8_VECTOR"
+  "vbpermq %0,%1,%2"
+  [(set_attr "length" "4")
+   (set_attr "type" "vecsimple")])
Index: gcc/config/rs6000/altivec.h
===================================================================
--- gcc/config/rs6000/altivec.h	(revision 208726)
+++ gcc/config/rs6000/altivec.h	(working copy)
@@ -329,6 +329,7 @@ 
 #define vec_vaddcuq __builtin_vec_vaddcuq
 #define vec_vaddudm __builtin_vec_vaddudm
 #define vec_vadduqm __builtin_vec_vadduqm
+#define vec_vbpermq __builtin_vec_vbpermq
 #define vec_vclz __builtin_vec_vclz
 #define vec_vclzb __builtin_vec_vclzb
 #define vec_vclzd __builtin_vec_vclzd
Index: gcc/doc/md.texi
===================================================================
--- gcc/doc/md.texi	(revision 208726)
+++ gcc/doc/md.texi	(working copy)
@@ -2162,6 +2162,9 @@  VSX vector register to hold scalar float
 @item wz
 Floating point register if the LFIWZX instruction is enabled or NO_REGS.
 
+@item wD
+Int constant that is the element number of the 64-bit scalar in a vector.
+
 @item wQ
 A memory address that will work with the @code{lq} and @code{stq}
 instructions.
Index: gcc/testsuite/gcc.target/powerpc/p8vector-vbpermq.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/p8vector-vbpermq.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/p8vector-vbpermq.c	(revision 0)
@@ -0,0 +1,27 @@ 
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-O3 -mcpu=power8" } */
+/* { dg-final { scan-assembler     "vbpermq" } } */
+/* { dg-final { scan-assembler     "mfvsrd"  } } */
+/* { dg-final { scan-assembler-not "stfd"    } } */
+/* { dg-final { scan-assembler-not "stxvd2x" } } */
+
+#include <altivec.h>
+
+#if __LITTLE_ENDIAN__
+#define OFFSET 1
+#else
+#define OFFSET 0
+#endif
+
+long foos (vector signed char a, vector signed char b)
+{
+  return vec_extract (vec_vbpermq (a, b), OFFSET);
+}
+
+long foou (vector unsigned char a, vector unsigned char b)
+{
+  return vec_extract (vec_vbpermq (a, b), OFFSET);
+}
+
Index: gcc/testsuite/gcc.target/powerpc/vsx-extract-1.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vsx-extract-1.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vsx-extract-1.c	(revision 0)
@@ -0,0 +1,16 @@ 
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O3 -mcpu=power7" } */
+/* { dg-final { scan-assembler     "lfd"    } } */
+/* { dg-final { scan-assembler-not "lxvd2x" } } */
+
+#include <altivec.h>
+
+#if __LITTLE_ENDIAN__
+#define OFFSET 1
+#else
+#define OFFSET 0
+#endif
+
+double get_value (vector double *p) { return vec_extract (*p, OFFSET); }
Index: gcc/testsuite/gcc.target/powerpc/vsx-extract-2.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vsx-extract-2.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vsx-extract-2.c	(revision 0)
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_vsx_ok } */
+/* { dg-options "-O3 -mcpu=power7" } */
+/* { dg-final { scan-assembler     "xxlor"  } } */
+/* { dg-final { scan-assembler-not "lfd"    } } */
+/* { dg-final { scan-assembler-not "lxvd2x" } } */
+
+#include <altivec.h>
+
+#if __LITTLE_ENDIAN__
+#define OFFSET 1
+#else
+#define OFFSET 0
+#endif
+
+double get_value (vector double v) { return vec_extract (v, OFFSET); }
Index: gcc/testsuite/gcc.target/powerpc/vsx-extract-3.c
===================================================================
--- gcc/testsuite/gcc.target/powerpc/vsx-extract-3.c	(revision 0)
+++ gcc/testsuite/gcc.target/powerpc/vsx-extract-3.c	(revision 0)
@@ -0,0 +1,17 @@ 
+/* { dg-do compile { target { powerpc*-*-* && lp64 } } } */
+/* { dg-skip-if "" { powerpc*-*-darwin* } { "*" } { "" } } */
+/* { dg-require-effective-target powerpc_p8vector_ok } */
+/* { dg-options "-O3 -mcpu=power8" } */
+/* { dg-final { scan-assembler     "mfvsrd"  } } */
+/* { dg-final { scan-assembler-not "stfd"    } } */
+/* { dg-final { scan-assembler-not "stxvd2x" } } */
+
+#include <altivec.h>
+
+#if __LITTLE_ENDIAN__
+#define OFFSET 1
+#else
+#define OFFSET 0
+#endif
+
+long get_value (vector long v) { return vec_extract (v, OFFSET); }