diff mbox

Vector shuffling

Message ID CABYV9SU1KsFTvPfRjZpwH8joTPKJ6aFEWgwmy=_P9PrxFCKv=g@mail.gmail.com
State New
Headers show

Commit Message

Artem Shinkarov Sept. 9, 2011, 3:51 p.m. UTC
Hi, sorry for the delay, I had a lot of other stuff to do.

In the attachment there is a new patch that fixes all the issues
pointed by Joseph and almost all the issues pointed by Richard. The
issues that are not fixed are explained further.

Artem.

>+      if (TREE_CODE (vect) == VECTOR_CST)
>+        {
>+            unsigned i;
>+            tree vals = TREE_VECTOR_CST_ELTS (vect);
>+            for (i = 0; vals; vals = TREE_CHAIN (vals), ++i)
>+              if (i == index)
>
>operand_equal_p

Sorry, I didn't get this comment. It looks fine to me without any changes.

>+  if (need_asgn)
>+    {
>+      TREE_ADDRESSABLE (tmpvec) = 1;
>
>this should be at the point we call create_tmp_var.

Here we are talking about writing this line two times several lines
upper. I would like to leave it just to avoid code duplication.

>Index: gcc/passes.c
>===========================================================>========
>--- gcc/passes.c        (revision 178354)
>+++ gcc/passes.c        (working copy)
>@@ -1354,7 +1354,6 @@ init_optimization_passes (void)
>         NEXT_PASS (pass_vectorize);
>           {
>             struct opt_pass **p = &pass_vectorize.pass.sub;
>-             NEXT_PASS (pass_lower_vector_ssa);
>             NEXT_PASS (pass_dce_loop);
>           }
>          NEXT_PASS (pass_predcom);
>@@ -1366,6 +1365,7 @@ init_optimization_passes (void)
>         NEXT_PASS (pass_lim);
>         NEXT_PASS (pass_tree_loop_done);
>       }
>+      NEXT_PASS (pass_lower_vector_ssa);
>
>This change should not be neccesary.

Without this change the vector lowering with -Ox does not execute the
pass at all. So the overall idea is to make sure that if we have -Ox
we make lowering only once and as late as possible.

But maybe I am just missing something.



Thanks,
Artem.

P.S. X86 PEOPLE, WHERE ARE YOU? :)

Comments

Richard Biener Sept. 12, 2011, 6:06 a.m. UTC | #1
On Fri, Sep 9, 2011 at 5:51 PM, Artem Shinkarov
<artyom.shinkaroff@gmail.com> wrote:
> Hi, sorry for the delay, I had a lot of other stuff to do.
>
> In the attachment there is a new patch that fixes all the issues
> pointed by Joseph and almost all the issues pointed by Richard. The
> issues that are not fixed are explained further.
>
> Artem.
>
>>+      if (TREE_CODE (vect) == VECTOR_CST)
>>+        {
>>+            unsigned i;
>>+            tree vals = TREE_VECTOR_CST_ELTS (vect);
>>+            for (i = 0; vals; vals = TREE_CHAIN (vals), ++i)
>>+              if (i == index)
>>
>>operand_equal_p
>
> Sorry, I didn't get this comment. It looks fine to me without any changes.

Error on my side, the code is ok.

>>+  if (need_asgn)
>>+    {
>>+      TREE_ADDRESSABLE (tmpvec) = 1;
>>
>>this should be at the point we call create_tmp_var.
>
> Here we are talking about writing this line two times several lines
> upper. I would like to leave it just to avoid code duplication.

Well, it's just that they are supposed to be occuring in pairs only,
so it would make the code more understandable.  So please do the duplication.

>>Index: gcc/passes.c
>>===========================================================>========
>>--- gcc/passes.c        (revision 178354)
>>+++ gcc/passes.c        (working copy)
>>@@ -1354,7 +1354,6 @@ init_optimization_passes (void)
>>         NEXT_PASS (pass_vectorize);
>>           {
>>             struct opt_pass **p = &pass_vectorize.pass.sub;
>>-             NEXT_PASS (pass_lower_vector_ssa);
>>             NEXT_PASS (pass_dce_loop);
>>           }
>>          NEXT_PASS (pass_predcom);
>>@@ -1366,6 +1365,7 @@ init_optimization_passes (void)
>>         NEXT_PASS (pass_lim);
>>         NEXT_PASS (pass_tree_loop_done);
>>       }
>>+      NEXT_PASS (pass_lower_vector_ssa);
>>
>>This change should not be neccesary.
>
> Without this change the vector lowering with -Ox does not execute the
> pass at all. So the overall idea is to make sure that if we have -Ox
> we make lowering only once and as late as possible.
>
> But maybe I am just missing something.

Ah, I thought we had already converted the -O0 pass to work like the
complex lowering at -O0 (which uses a property).  So hm, I guess the
change is ok as well.

I'm on vacation for the next two weeks, conditional on the x86 approval
I'll take care of committing the patch after I return.

Richard.

>
>
> Thanks,
> Artem.
>
> P.S. X86 PEOPLE, WHERE ARE YOU? :)
>
Joseph Myers Sept. 13, 2011, 4:13 p.m. UTC | #2
On Fri, 9 Sep 2011, Artem Shinkarov wrote:

> Hi, sorry for the delay, I had a lot of other stuff to do.
> 
> In the attachment there is a new patch that fixes all the issues
> pointed by Joseph and almost all the issues pointed by Richard. The
> issues that are not fixed are explained further.

The C front-end parts of this version of the patch are OK with the 
spurious whitespace change

> @@ -6120,7 +6213,7 @@ digest_init (location_t init_loc, tree t
>  	  tree value;
>  	  bool constant_p = true;
>  
> -	  /* Iterate through elements and check if all constructor
> +          /* Iterate through elements and check if all constructor
>  	     elements are *_CSTs.  */

removed.  The original version of this line was correctly indented with a 
TAB.
Richard Henderson Sept. 15, 2011, 7:05 p.m. UTC | #3
> +The elements of the input vectors are numbered from left to right across
> +one or both of the vectors. Each element in the mask specifies a number
> +of element from the input vector(s). Consider the following example.

It would be more preferable to talk about the memory ordering of the
elements rather than "left" and "right" which are ambiguous at best.

> +  if (TREE_CODE (mask) == VECTOR_CST)
> +    {
> +      tree m_type, call;
> +      tree fn = targetm.vectorize.builtin_vec_perm (TREE_TYPE (v0), &m_type);
> +      /*rtx t;*/

Leftover crap.

> +
> +      if (!fn)
> +	goto vshuffle;
> +
> +      if (m_type != TREE_TYPE (TREE_TYPE (mask)))
> +	{
> +	  int units = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask));
> +	  tree cvt = build_vector_type (m_type, units);
> +	  mask = fold_convert (cvt, mask);
> +	}
> +
> +      call = fold_build1 (ADDR_EXPR, build_pointer_type (TREE_TYPE (fn)), fn);
> +      call = build_call_nary (type /* ? */, call, 3, v0, v1, mask);
> +
> +      return expand_expr_real_1 (call, target, VOIDmode, EXPAND_NORMAL, NULL);
> +    }
> +
> +vshuffle:
> +  gcc_assert (operand_equal_p (v0, v1, 0));

Why can't a non-constant shuffle have different V0 and V1?  That seems
like a direct violation of the documentation, and any sort of usefulness.

Also, while I'm ok with the use of builtin_vec_perm here in the short
term, I think that in the long term we should simply force the named
pattern to handle constants.  Then the vectorizer can simply use the
predicate and the tree code and we can drop the large redundancy of
builtins with different argument types.

Indeed, once this patch is applied, I think that ought to be the very
next task in this domain.

> +/* Vector shuffle expression.  A = VEC_SHUFFLE_EXPR<v0, v1, maks>

Typo in "mask".

> +   foreach i in length (mask):
> +     A = mask[i] < length (v0) ? v0[mask[i]] : v1[mask[i]]

Surely it's v1[mask[i] - length].

> +      if (TREE_CODE (vect) == VECTOR_CST)
> +        {
> +            unsigned i;

Indentation is off all through this function.

> +  mask = gen_rtx_AND (maskmode, mask, mm);
> +  
> +  /* Convert mask to vector of chars.  */
> +  mask = simplify_gen_subreg (V16QImode, mask, maskmode, 0);
> +  mask = force_reg (V16QImode, mask);

Why are you using force_reg to do all the dirty work?  Seems to
me this should be using expand_normal.  All throughout this 
function.  That would also avoid the need for all of the extra
force_reg stuff that ought not be there for -O0.

I also see that you're not even attempting to use xop_pperm.

Is ssse3_pshufb why you do the wrong thing in the expander for v0 != v1?
And give the vshuffle named pattern the wrong number of arguments?
It's certainly possible to handle it, though it takes a few more steps,
and might well be more efficient as a libgcc function rather than inline.


r~
Artem Shinkarov Sept. 28, 2011, 12:59 p.m. UTC | #4
On Thu, Sep 15, 2011 at 8:05 PM, Richard Henderson <rth@redhat.com> wrote:
>> +The elements of the input vectors are numbered from left to right across
>> +one or both of the vectors. Each element in the mask specifies a number
>> +of element from the input vector(s). Consider the following example.
>
> It would be more preferable to talk about the memory ordering of the
> elements rather than "left" and "right" which are ambiguous at best.
>
>> +  if (TREE_CODE (mask) == VECTOR_CST)
>> +    {
>> +      tree m_type, call;
>> +      tree fn = targetm.vectorize.builtin_vec_perm (TREE_TYPE (v0), &m_type);
>> +      /*rtx t;*/
>
> Leftover crap.

Fixed.

>> +
>> +      if (!fn)
>> +     goto vshuffle;
>> +
>> +      if (m_type != TREE_TYPE (TREE_TYPE (mask)))
>> +     {
>> +       int units = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask));
>> +       tree cvt = build_vector_type (m_type, units);
>> +       mask = fold_convert (cvt, mask);
>> +     }
>> +
>> +      call = fold_build1 (ADDR_EXPR, build_pointer_type (TREE_TYPE (fn)), fn);
>> +      call = build_call_nary (type /* ? */, call, 3, v0, v1, mask);
>> +
>> +      return expand_expr_real_1 (call, target, VOIDmode, EXPAND_NORMAL, NULL);
>> +    }
>> +
>> +vshuffle:
>> +  gcc_assert (operand_equal_p (v0, v1, 0));
>
> Why can't a non-constant shuffle have different V0 and V1?  That seems
> like a direct violation of the documentation, and any sort of usefulness.

Ok, I agree. The reason why this assert is here is that noone in the
middle-end generates the code that does not meet this assert. In
principle we definitely want to support it in the upcoming patches,
but it would be nice to start with a simple thing.

> Also, while I'm ok with the use of builtin_vec_perm here in the short
> term, I think that in the long term we should simply force the named
> pattern to handle constants.  Then the vectorizer can simply use the
> predicate and the tree code and we can drop the large redundancy of
> builtins with different argument types.
>
> Indeed, once this patch is applied, I think that ought to be the very
> next task in this domain.
>
>> +/* Vector shuffle expression.  A = VEC_SHUFFLE_EXPR<v0, v1, maks>
>
> Typo in "mask".
>
>> +   foreach i in length (mask):
>> +     A = mask[i] < length (v0) ? v0[mask[i]] : v1[mask[i]]
>
> Surely it's v1[mask[i] - length].
>
>> +      if (TREE_CODE (vect) == VECTOR_CST)
>> +        {
>> +            unsigned i;
>
> Indentation is off all through this function.

Fixed.

>> +  mask = gen_rtx_AND (maskmode, mask, mm);
>> +
>> +  /* Convert mask to vector of chars.  */
>> +  mask = simplify_gen_subreg (V16QImode, mask, maskmode, 0);
>> +  mask = force_reg (V16QImode, mask);
>
> Why are you using force_reg to do all the dirty work?  Seems to
> me this should be using expand_normal.  All throughout this
> function.  That would also avoid the need for all of the extra
> force_reg stuff that ought not be there for -O0.

I don't really understand this. As far as I know, expand_normal
"converts" tree to rtx. All my computations are happening at the level
of rtx and force_reg is needed just to bring an rtx expression to the
register of the correct mode. If I am missing something, could you
give an example how can I use expand_normal instead of force_reg in
this particular code.

> I also see that you're not even attempting to use xop_pperm.

As I said, I am happy to experiment with the cases v0 != v1 in the
upcoming patches. Let's just start with a simple thing and see what
kind of issues/problems it would bring.

> Is ssse3_pshufb why you do the wrong thing in the expander for v0 != v1?

My personal feeling is that it may be the case with v0 != v1, that it
would be more efficient to perform piecewise shuffling rather than
bitwise dances around the masks.

> And give the vshuffle named pattern the wrong number of arguments?

Ok, If I'll make vshuffle to accept only two arguments -- vector and
mask, would it be ok?

> It's certainly possible to handle it, though it takes a few more steps,
> and might well be more efficient as a libgcc function rather than inline.

I don't really understand why it could be more efficient. I thought
that inline gives more chances to the final RTL optimisation.
>
>
> r~
>
>
>

Thanks,
Artem.
Richard Henderson Sept. 28, 2011, 2:46 p.m. UTC | #5
On 09/28/2011 05:59 AM, Artem Shinkarov wrote:
> I don't really understand this. As far as I know, expand_normal
> "converts" tree to rtx. All my computations are happening at the level
> of rtx and force_reg is needed just to bring an rtx expression to the
> register of the correct mode. If I am missing something, could you
> give an example how can I use expand_normal instead of force_reg in
> this particular code.

Sorry, I meant expand_(simple_)binop.

>> Is ssse3_pshufb why you do the wrong thing in the expander for v0 != v1?
> 
> My personal feeling is that it may be the case with v0 != v1, that it
> would be more efficient to perform piecewise shuffling rather than
> bitwise dances around the masks.

Maybe for V2DI and V2DFmode, but probably not otherwise.

We can perform the double-word shuffle in 12 insns; 10 for SSE 4.1.
Example assembly attached.

>> It's certainly possible to handle it, though it takes a few more steps,
>> and might well be more efficient as a libgcc function rather than inline.
> 
> I don't really understand why it could be more efficient. I thought
> that inline gives more chances to the final RTL optimisation.

We'll not be able to optimize this at the rtl level.  There are too many
UNSPEC instructions in the way.  In any case, even if that weren't so we'd
only be able to do useful optimization for a constant permutation.  And
we should have been able to prove that at the gimple level.


r~
.data
	.align 16
vec3:	.long	3,3,3,3
vec4:	.long	4,4,4,4
dup4:	.byte	0,0,0,0, 4,4,4,4, 8,8,8,8, 12,12,12,12
ofs4:	.byte	0,1,2,3, 0,1,2,3, 0,1,2,3, 0,1,2,3

	.text
shuffle2:

	// Convert the low bits of the mask to a shuffle
	movdqa	%xmm2, %xmm3
	pand	vec3, %xmm3
	pmulld	vec4, %xmm3
	pshufb	dup4, %xmm3
	paddb	ofs4, %xmm3

	// Shuffle both inputs
	pshufb	%xmm3, %xmm0
	pshufb	%xmm3, %xmm1

	// Select and merge the inputs
	// Use ix86_expand_int_vcond for use of pblendvb for SSE4_1.
	pand	vec4, %xmm2
	pcmpeqd	vec4, %xmm2
	pand	%xmm2, %xmm1
	pandn	%xmm2, %xmm0
	por	%xmm1, %xmm0

	ret
diff mbox

Patch

Index: gcc/doc/extend.texi
===================================================================
--- gcc/doc/extend.texi	(revision 178354)
+++ gcc/doc/extend.texi	(working copy)
@@ -6561,6 +6561,32 @@  invoke undefined behavior at runtime.  W
 accesses for vector subscription can be enabled with
 @option{-Warray-bounds}.
 
+Vector shuffling is available using functions
+@code{__builtin_shuffle (vec, mask)} and
+@code{__builtin_shuffle (vec0, vec1, mask)}. Both functions construct
+a permutation of elements from one or two vectors and return a vector
+of the same type as input vector(s). The mask is a vector of
+integer-typed elements. The size of each element of the mask must be
+the same as the size of each input vector element. The number of
+elements in input vector(s) and mask must be the same.
+
+The elements of the input vectors are numbered from left to right across
+one or both of the vectors. Each element in the mask specifies a number
+of element from the input vector(s). Consider the following example.
+
+@smallexample
+typedef int v4si __attribute__ ((vector_size (16)));
+
+v4si a = @{1,2,3,4@};
+v4si b = @{5,6,7,8@};
+v4si mask1 = @{0,1,1,3@};
+v4si mask2 = @{0,4,2,5@};
+v4si res;
+
+res = __builtin_shuffle (a, mask1);       /* res is @{1,2,2,4@}  */
+res = __builtin_shuffle (a, b, mask2);    /* res is @{1,5,3,6@}  */
+@end smallexample
+
 You can declare variables and use them in function calls and returns, as
 well as in assignments and some casts.  You can specify a vector type as
 a return type for a function.  Vector types can also be used as function
Index: gcc/tree-pretty-print.c
===================================================================
--- gcc/tree-pretty-print.c	(revision 178354)
+++ gcc/tree-pretty-print.c	(working copy)
@@ -2067,6 +2067,16 @@  dump_generic_node (pretty_printer *buffe
       dump_generic_node (buffer, TREE_OPERAND (node, 2), spc, flags, false);
       pp_string (buffer, " > ");
       break;
+    
+    case VEC_SHUFFLE_EXPR:
+      pp_string (buffer, " VEC_SHUFFLE_EXPR < ");
+      dump_generic_node (buffer, TREE_OPERAND (node, 0), spc, flags, false);
+      pp_string (buffer, " , ");
+      dump_generic_node (buffer, TREE_OPERAND (node, 1), spc, flags, false);
+      pp_string (buffer, " , ");
+      dump_generic_node (buffer, TREE_OPERAND (node, 2), spc, flags, false);
+      pp_string (buffer, " > ");
+      break;
 
     case DOT_PROD_EXPR:
       pp_string (buffer, " DOT_PROD_EXPR < ");
Index: gcc/c-family/c-common.c
===================================================================
--- gcc/c-family/c-common.c	(revision 178354)
+++ gcc/c-family/c-common.c	(working copy)
@@ -425,6 +425,7 @@  const struct c_common_resword c_common_r
   { "__attribute__",	RID_ATTRIBUTE,	0 },
   { "__builtin_choose_expr", RID_CHOOSE_EXPR, D_CONLY },
   { "__builtin_complex", RID_BUILTIN_COMPLEX, D_CONLY },
+  { "__builtin_shuffle", RID_BUILTIN_SHUFFLE, D_CONLY },
   { "__builtin_offsetof", RID_OFFSETOF, 0 },
   { "__builtin_types_compatible_p", RID_TYPES_COMPATIBLE_P, D_CONLY },
   { "__builtin_va_arg",	RID_VA_ARG,	0 },
Index: gcc/c-family/c-common.h
===================================================================
--- gcc/c-family/c-common.h	(revision 178354)
+++ gcc/c-family/c-common.h	(working copy)
@@ -103,7 +103,7 @@  enum rid
   /* C extensions */
   RID_ASM,       RID_TYPEOF,   RID_ALIGNOF,  RID_ATTRIBUTE,  RID_VA_ARG,
   RID_EXTENSION, RID_IMAGPART, RID_REALPART, RID_LABEL,      RID_CHOOSE_EXPR,
-  RID_TYPES_COMPATIBLE_P,      RID_BUILTIN_COMPLEX,
+  RID_TYPES_COMPATIBLE_P,      RID_BUILTIN_COMPLEX,	     RID_BUILTIN_SHUFFLE,
   RID_DFLOAT32, RID_DFLOAT64, RID_DFLOAT128,
   RID_FRACT, RID_ACCUM,
 
Index: gcc/optabs.c
===================================================================
--- gcc/optabs.c	(revision 178354)
+++ gcc/optabs.c	(working copy)
@@ -6620,6 +6620,79 @@  vector_compare_rtx (tree cond, bool unsi
   return gen_rtx_fmt_ee (rcode, VOIDmode, ops[0].value, ops[1].value);
 }
 
+/* Return true if VEC_SHUFFLE_EXPR can be expanded using SIMD extensions
+   of the CPU.  */
+bool
+expand_vec_shuffle_expr_p (enum machine_mode mode, tree v0, tree v1, tree mask)
+{
+  int v0_mode_s = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (TREE_TYPE (v0))));
+  int mask_mode_s = GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (TREE_TYPE (mask))));
+
+  if (TREE_CODE (mask) == VECTOR_CST
+      && targetm.vectorize.builtin_vec_perm_ok (TREE_TYPE (v0), mask))
+    return true;
+  
+  if (!operand_equal_p (v0, v1, 0) || v0_mode_s != mask_mode_s
+      || TYPE_VECTOR_SUBPARTS (TREE_TYPE (v0))
+	 != TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)))
+    return false;
+    
+  return direct_optab_handler (vshuffle_optab, mode) != CODE_FOR_nothing;
+}
+
+/* Generate instructions for VEC_COND_EXPR given its type and three
+   operands.  */
+rtx
+expand_vec_shuffle_expr (tree type, tree v0, tree v1, tree mask, rtx target)
+{
+  struct expand_operand ops[4];
+  enum insn_code icode;
+  enum machine_mode mode = TYPE_MODE (type);
+  rtx rtx_v0, rtx_mask;
+
+  gcc_assert (expand_vec_shuffle_expr_p (mode, v0, v1, mask));
+
+  if (TREE_CODE (mask) == VECTOR_CST)
+    {
+      tree m_type, call;
+      tree fn = targetm.vectorize.builtin_vec_perm (TREE_TYPE (v0), &m_type);
+      /*rtx t;*/
+
+      if (!fn)
+	goto vshuffle;
+
+      if (m_type != TREE_TYPE (TREE_TYPE (mask)))
+	{
+	  int units = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask));
+	  tree cvt = build_vector_type (m_type, units);
+	  mask = fold_convert (cvt, mask);
+	}
+
+      call = fold_build1 (ADDR_EXPR, build_pointer_type (TREE_TYPE (fn)), fn);
+      call = build_call_nary (type /* ? */, call, 3, v0, v1, mask);
+
+      return expand_expr_real_1 (call, target, VOIDmode, EXPAND_NORMAL, NULL);
+    }
+
+vshuffle:
+  gcc_assert (operand_equal_p (v0, v1, 0));
+
+  icode = direct_optab_handler (vshuffle_optab, mode);
+
+  if (icode == CODE_FOR_nothing)
+    return 0;
+
+  rtx_v0 = expand_normal (v0);
+  rtx_mask = expand_normal (mask);
+
+  create_output_operand (&ops[0], target, mode);
+  create_input_operand (&ops[1], rtx_v0, mode);
+  create_input_operand (&ops[2], rtx_mask, mode);
+  expand_insn (icode, 3, ops);
+
+  return ops[0].value;
+}
+
 /* Return insn code for TYPE, the type of a VEC_COND_EXPR.  */
 
 static inline enum insn_code
Index: gcc/optabs.h
===================================================================
--- gcc/optabs.h	(revision 178354)
+++ gcc/optabs.h	(working copy)
@@ -636,6 +636,9 @@  enum direct_optab_index
   DOI_vcond,
   DOI_vcondu,
 
+  /* Vector shuffling.  */
+  DOI_vshuffle,
+
   /* Block move operation.  */
   DOI_movmem,
 
@@ -701,6 +704,7 @@  typedef struct direct_optab_d *direct_op
 #define reload_out_optab (&direct_optab_table[(int) DOI_reload_out])
 #define vcond_optab (&direct_optab_table[(int) DOI_vcond])
 #define vcondu_optab (&direct_optab_table[(int) DOI_vcondu])
+#define vshuffle_optab (&direct_optab_table[(int) DOI_vshuffle])
 #define movmem_optab (&direct_optab_table[(int) DOI_movmem])
 #define setmem_optab (&direct_optab_table[(int) DOI_setmem])
 #define cmpstr_optab (&direct_optab_table[(int) DOI_cmpstr])
@@ -879,8 +883,15 @@  extern rtx expand_widening_mult (enum ma
 /* Return tree if target supports vector operations for COND_EXPR.  */
 bool expand_vec_cond_expr_p (tree, enum machine_mode);
 
+/* Return tree if target supports vector operations for VEC_SHUFFLE_EXPR.  */
+bool expand_vec_shuffle_expr_p (enum machine_mode, tree, tree, tree);
+
 /* Generate code for VEC_COND_EXPR.  */
 extern rtx expand_vec_cond_expr (tree, tree, tree, tree, rtx);
+
+/* Generate code for VEC_SHUFFLE_EXPR.  */
+extern rtx expand_vec_shuffle_expr (tree, tree, tree, tree, rtx);
+
 /* Generate code for VEC_LSHIFT_EXPR and VEC_RSHIFT_EXPR.  */
 extern rtx expand_vec_shift_expr (sepops, rtx);
 
Index: gcc/genopinit.c
===================================================================
--- gcc/genopinit.c	(revision 178354)
+++ gcc/genopinit.c	(working copy)
@@ -255,6 +255,7 @@  static const char * const optabs[] =
   "set_optab_handler (vec_realign_load_optab, $A, CODE_FOR_$(vec_realign_load_$a$))",
   "set_direct_optab_handler (vcond_optab, $A, CODE_FOR_$(vcond$a$))",
   "set_direct_optab_handler (vcondu_optab, $A, CODE_FOR_$(vcondu$a$))",
+  "set_direct_optab_handler (vshuffle_optab, $A, CODE_FOR_$(vshuffle$a$))",
   "set_optab_handler (ssum_widen_optab, $A, CODE_FOR_$(widen_ssum$I$a3$))",
   "set_optab_handler (usum_widen_optab, $A, CODE_FOR_$(widen_usum$I$a3$))",
   "set_optab_handler (udot_prod_optab, $A, CODE_FOR_$(udot_prod$I$a$))",
Index: gcc/testsuite/gcc.c-torture/execute/vect-shuffle-2.c
===================================================================
--- gcc/testsuite/gcc.c-torture/execute/vect-shuffle-2.c	(revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/vect-shuffle-2.c	(revision 0)
@@ -0,0 +1,44 @@ 
+#define vector(elcount, type)  \
+__attribute__((vector_size((elcount)*sizeof(type)))) type
+
+#define vidx(type, vec, idx) (*(((type *) &(vec)) + idx))
+
+#define shuf2compare(type, count, vres, v0, v1, mask) \
+do { \
+    int __i; \
+    for (__i = 0; __i < count; __i++) { \
+        if (vidx(type, vres, __i) != ((vidx(type, mask, __i) < count) ? \
+                          vidx(type, v0, vidx(type, mask, __i)) :  \
+                          vidx(type, v1, (vidx(type, mask, __i) - count)))) \
+            __builtin_abort (); \
+        } \
+} while (0)
+
+
+int main (int argc, char *argv[]) {
+    vector (8, short) v0 = {5, 5,5,5,5,5,argc,7};
+    vector (8, short) v1 = {argc, 1,8,8,4,9,argc,4};
+    vector (8, short) v2;
+
+    //vector (8, short) mask = {1,2,5,4,3,6,7};
+    
+    vector (8, short) mask0 = {0,2,3,1,4,5,6,7};
+    vector (8, short) mask1 = {0,12,3,4,3,0,10,9};
+    
+    vector (8, short) mask2 = {0,8,1,9,2,10,3,11};
+
+    v2 = __builtin_shuffle (v0, v1,  mask0);
+    shuf2compare (short, 8, v2, v0, v1, mask0);
+ 
+    v2 = __builtin_shuffle (v0, v1,  mask1);
+    shuf2compare (short, 8, v2, v0, v1, mask1);
+    
+    v2 = __builtin_shuffle (v0, v1,  mask2);
+    shuf2compare (short, 8, v2, v0, v1, mask2);
+
+    v2 = __builtin_shuffle (mask0, mask0,  v0);
+    shuf2compare (short, 8, v2, mask0, mask0, v0);
+
+    return 0; 
+}
+
Index: gcc/testsuite/gcc.c-torture/execute/vect-shuffle-4.c
===================================================================
--- gcc/testsuite/gcc.c-torture/execute/vect-shuffle-4.c	(revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/vect-shuffle-4.c	(revision 0)
@@ -0,0 +1,50 @@ 
+#define vector(elcount, type)  \
+__attribute__((vector_size((elcount)*sizeof(type)))) type
+
+#define vidx(type, vec, idx) (*(((type *) &(vec)) + idx))
+
+#define shuf2compare(type, count, vres, v0, v1, mask) \
+do { \
+    int __i; \
+    for (__i = 0; __i < count; __i++) { \
+        if (vidx(type, vres, __i) != ((vidx(type, mask, __i) < count) ? \
+                          vidx(type, v0, vidx(type, mask, __i)) :  \
+                          vidx(type, v1, (vidx(type, mask, __i) - count)))) \
+            __builtin_abort (); \
+        } \
+} while (0)
+
+
+vector (8, short) __attribute__ ((noinline))
+f (vector (8, short) x, vector (8, short) y, vector (8, short) mask) {
+    return __builtin_shuffle (x, y, mask);
+}
+
+
+
+int main (int argc, char *argv[]) {
+    vector (8, short) v0 = {argc, 1,2,3,4,5,6,7};
+    vector (8, short) v1 = {argc, 1,argc,3,4,5,argc,7};
+    vector (8, short) v2;
+
+    //vector (8, short) mask = {1,2,5,4,3,6,7};
+    
+    vector (8, short) mask0 = {0,2,3,1,4,5,6,7};
+    vector (8, short) mask1 = {0,12,3,4,3,0,10,9};
+    vector (8, short) mask2 = {0,8,1,9,2,10,3,11};
+
+    v2 = f (v0, v1,  mask0);
+    shuf2compare (short, 8, v2, v0, v1, mask0);
+ 
+    v2 = f (v0, v1,  mask1);
+    shuf2compare (short, 8, v2, v0, v1, mask1);
+
+    v2 = f (v0, v1,  mask2);
+    shuf2compare (short, 8, v2, v0, v1, mask2);
+
+    v2 = f (mask0, mask0,  v0);
+    shuf2compare (short, 8, v2, mask0, mask0, v0);
+
+    return 0; 
+}
+
Index: gcc/testsuite/gcc.c-torture/execute/vect-shuffle-1.c
===================================================================
--- gcc/testsuite/gcc.c-torture/execute/vect-shuffle-1.c	(revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/vect-shuffle-1.c	(revision 0)
@@ -0,0 +1,46 @@ 
+#define vector(elcount, type)  \
+__attribute__((vector_size((elcount)*sizeof(type)))) type
+
+#define vidx(type, vec, idx) (*(((type *) &(vec)) + idx))
+
+#define shufcompare(type, count, vres, v0, mask) \
+do { \
+    int __i; \
+    for (__i = 0; __i < count; __i++) { \
+        if (vidx(type, vres, __i) != vidx(type, v0, vidx(type, mask, __i))) \
+            __builtin_abort (); \
+    } \
+} while (0)
+
+
+int main (int argc, char *argv[]) {
+    /*vector (8, short) v0 = {argc, 1,2,3,4,5,6,7};
+    vector (8, short) v1 = {argc, 1,argc,3,4,5,argc,7};
+    vector (8, short) v2;
+   
+    vector (8, short) smask = {0,0,1,2,3,4,5,6};
+    
+    v2 = __builtin_shuffle (v0,  smask);
+    shufcompare (short, 8, v2, v0, smask);
+    v2 = __builtin_shuffle (v0, v1);
+    shufcompare (short, 8, v2, v0, v1);
+    v2 = __builtin_shuffle (smask, v0);
+    shufcompare (short, 8, v2, smask, v0);*/
+
+    vector (4, int) i0 = {argc, 1,2,3};
+    vector (4, int) i1 = {argc, 1, argc, 3};
+    vector (4, int) i2;
+
+    vector (4, int) imask = {0,3,2,1};
+
+    /*i2 = __builtin_shuffle (i0, imask);
+    shufcompare (int, 4, i2, i0, imask);*/
+    i2 = __builtin_shuffle (i0, i1);
+    shufcompare (int, 4, i2, i0, i1);
+    
+    i2 = __builtin_shuffle (imask, i0);
+    shufcompare (int, 4, i2, imask, i0);
+    
+    return 0;
+}
+
Index: gcc/testsuite/gcc.c-torture/execute/vect-shuffle-3.c
===================================================================
--- gcc/testsuite/gcc.c-torture/execute/vect-shuffle-3.c	(revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/vect-shuffle-3.c	(revision 0)
@@ -0,0 +1,36 @@ 
+#define vector(elcount, type)  \
+__attribute__((vector_size((elcount)*sizeof(type)))) type
+
+#define vidx(type, vec, idx) (*(((type *) &(vec)) + idx))
+
+#define shufcompare(type, count, vres, v0, mask) \
+do { \
+    int __i; \
+    for (__i = 0; __i < count; __i++) { \
+        if (vidx(type, vres, __i) != vidx(type, v0, vidx(type, mask, __i))) \
+            __builtin_abort (); \
+    } \
+} while (0)
+
+vector (8, short) __attribute__ ((noinline))
+f (vector (8, short) x, vector (8, short) mask) {
+    return __builtin_shuffle (x, mask);
+}
+
+
+int main (int argc, char *argv[]) {
+    vector (8, short) v0 = {argc, 1,2,3,4,5,6,7};
+    vector (8, short) v1 = {argc, 1,argc,3,4,5,argc,7};
+    vector (8, short) v2;
+
+    vector (8, short) mask = {0,0,1,2,3,4,5,6};
+    
+    v2 = f (v0,  mask);
+    shufcompare (short, 8, v2, v0, mask);
+
+    v2 = f (v0, v1);
+    shufcompare (short, 8, v2, v0, v1);
+
+    return 0;
+}
+
Index: gcc/testsuite/gcc.c-torture/execute/vect-shuffle-5.c
===================================================================
--- gcc/testsuite/gcc.c-torture/execute/vect-shuffle-5.c	(revision 0)
+++ gcc/testsuite/gcc.c-torture/execute/vect-shuffle-5.c	(revision 0)
@@ -0,0 +1,64 @@ 
+/* Test that different type variants are compatible within
+   vector shuffling.  */
+
+#define vector(elcount, type)  \
+__attribute__((vector_size((elcount)*sizeof(type)))) type
+
+#define shufcompare(count, vres, v0, mask) \
+do { \
+    int __i; \
+    for (__i = 0; __i < count; __i++) { \
+        if (vres[__i] != v0[mask[__i]]) \
+            __builtin_abort (); \
+    } \
+} while (0)
+
+#define test_compat_mask(res, vec, mask) \
+  res = __builtin_shuffle (vec, mask); \
+  shufcompare(4, res, vec, mask); \
+  res = __builtin_shuffle (vec, c ## mask); \
+  shufcompare(4, res, vec, c ##  mask); \
+  res = __builtin_shuffle (vec, r ## mask); \
+  shufcompare(4, res, vec, r ##  mask); \
+  res = __builtin_shuffle (vec, d ## mask); \
+  shufcompare(4, res, vec, d ##  mask); \
+  res = __builtin_shuffle (vec, dc ## mask); \
+  shufcompare(4, res, vec, dc ##  mask); \
+
+#define test_compat_vec(res, vec, mask) \
+  test_compat_mask (res, vec, mask); \
+  test_compat_mask (res, c ## vec, mask); \
+  test_compat_mask (res, r ## vec, mask); \
+  test_compat_mask (res, d ## vec, mask); \
+  test_compat_mask (res, dc ## vec, mask); 
+
+#define test_compat(res, vec, mask) \
+  test_compat_vec (res, vec, mask); \
+  test_compat_vec (d ## res, vec, mask); \
+  test_compat_vec (r ## res, vec, mask);
+
+typedef vector (4, int) v4si;
+typedef const vector (4, int) v4sicst;
+
+int main (int argc, char *argv[]) {
+    vector (4, int) vec = {argc, 1,2,3};
+    const vector (4, int) cvec = {argc, 1,2,3};
+    register vector (4, int) rvec = {argc, 1,2,3};
+    v4si dvec = {argc, 1,2,3};
+    v4sicst dcvec = {argc, 1,2,3};
+    
+    vector (4, int) res; 
+    v4si dres;
+    register vector (4, int) rres;
+
+    vector (4, int) mask = {0,3,2,1};
+    const vector (4, int) cmask = {0,3,2,1};
+    register vector (4, int) rmask = {0,3,2,1};
+    v4si dmask = {0,3,2,1};
+    v4sicst dcmask = {0,3,2,1};
+
+    test_compat (res, vec, mask);
+
+    return 0;
+}
+
Index: gcc/testsuite/gcc.dg/builtin-complex-err-1.c
===================================================================
--- gcc/testsuite/gcc.dg/builtin-complex-err-1.c	(revision 178354)
+++ gcc/testsuite/gcc.dg/builtin-complex-err-1.c	(working copy)
@@ -19,8 +19,8 @@  _Complex float fc3 = __builtin_complex (
 void
 f (void)
 {
-  __builtin_complex (0.0); /* { dg-error "expected" } */
-  __builtin_complex (0.0, 0.0, 0.0); /* { dg-error "expected" } */
+  __builtin_complex (0.0); /* { dg-error "wrong number of arguments" } */
+  __builtin_complex (0.0, 0.0, 0.0); /* { dg-error "wrong number of arguments" } */
 }
 
-void (*p) (void) = __builtin_complex; /* { dg-error "expected" } */
+void (*p) (void) = __builtin_complex; /* { dg-error "cannot take address" } */
Index: gcc/c-tree.h
===================================================================
--- gcc/c-tree.h	(revision 178354)
+++ gcc/c-tree.h	(working copy)
@@ -579,6 +579,7 @@  extern tree c_begin_omp_task (void);
 extern tree c_finish_omp_task (location_t, tree, tree);
 extern tree c_finish_omp_clauses (tree);
 extern tree c_build_va_arg (location_t, tree, tree);
+extern tree c_build_vec_shuffle_expr (location_t, tree, tree, tree);
 
 /* Set to 0 at beginning of a function definition, set to 1 if
    a return statement that specifies a return value is seen.  */
Index: gcc/expr.c
===================================================================
--- gcc/expr.c	(revision 178354)
+++ gcc/expr.c	(working copy)
@@ -8605,6 +8605,10 @@  expand_expr_real_2 (sepops ops, rtx targ
     case VEC_PACK_FIX_TRUNC_EXPR:
       mode = TYPE_MODE (TREE_TYPE (treeop0));
       goto binop;
+    
+    case VEC_SHUFFLE_EXPR:
+      target = expand_vec_shuffle_expr (type, treeop0, treeop1, treeop2, target);
+      return target;
 
     case DOT_PROD_EXPR:
       {
Index: gcc/gimple-pretty-print.c
===================================================================
--- gcc/gimple-pretty-print.c	(revision 178354)
+++ gcc/gimple-pretty-print.c	(working copy)
@@ -417,6 +417,16 @@  dump_ternary_rhs (pretty_printer *buffer
       dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
       pp_string (buffer, ">");
       break;
+    
+    case VEC_SHUFFLE_EXPR:
+      pp_string (buffer, "VEC_SHUFFLE_EXPR <");
+      dump_generic_node (buffer, gimple_assign_rhs1 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs2 (gs), spc, flags, false);
+      pp_string (buffer, ", ");
+      dump_generic_node (buffer, gimple_assign_rhs3 (gs), spc, flags, false);
+      pp_string (buffer, ">");
+      break;
 
     case REALIGN_LOAD_EXPR:
       pp_string (buffer, "REALIGN_LOAD <");
Index: gcc/c-typeck.c
===================================================================
--- gcc/c-typeck.c	(revision 178354)
+++ gcc/c-typeck.c	(working copy)
@@ -2845,6 +2845,99 @@  build_function_call_vec (location_t loc,
     }
   return require_complete_type (result);
 }
+
+/* Build a VEC_SHUFFLE_EXPR if V0, V1 and MASK are not error_mark_nodes
+   and have vector types, V0 has the same type as V1, and the number of
+   elements of V0, V1, MASK is the same.
+   
+   In case V1 is a NULL_TREE it is assumed that __builtin_shuffle was
+   called with two arguments.  In this case implementation passes the
+   first argument twice in order to share the same tree code.  This fact
+   could enable the mask-values being twice the vector length.  This is
+   an implementation accident and this semantics is not guaranteed to
+   the user.  */
+tree
+c_build_vec_shuffle_expr (location_t loc, tree v0, tree v1, tree mask)
+{
+  tree vec_shuffle, tmp;
+  bool wrap = true;
+  bool maybe_const = false;
+  bool two_arguments;
+  
+  if (v1 == NULL_TREE)
+    {
+      two_arguments = true;
+      v1 = v0;
+    }
+
+  if (v0 == error_mark_node || v1 == error_mark_node
+      || mask == error_mark_node)
+    return error_mark_node;
+
+  if (TREE_CODE (TREE_TYPE (mask)) != VECTOR_TYPE
+      || TREE_CODE (TREE_TYPE (TREE_TYPE (mask))) != INTEGER_TYPE)
+    {
+      error_at (loc, "__builtin_shuffle last argument must "
+		     "be an integer vector");
+      return error_mark_node;
+    }
+   
+  if (TREE_CODE (TREE_TYPE (v0)) != VECTOR_TYPE
+      || TREE_CODE (TREE_TYPE (v1)) != VECTOR_TYPE)
+    {
+      error_at (loc, "__builtin_shuffle arguments must be vectors");
+      return error_mark_node;
+    }
+
+  if (TYPE_MAIN_VARIANT (TREE_TYPE (v0)) != TYPE_MAIN_VARIANT (TREE_TYPE (v1)))
+    {
+      error_at (loc, "__builtin_shuffle argument vectors must be of "
+		     "the same type");
+      return error_mark_node;
+    }
+
+  if (TYPE_VECTOR_SUBPARTS (TREE_TYPE (v0))
+      != TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask))
+      && TYPE_VECTOR_SUBPARTS (TREE_TYPE (v1))
+	 != TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask)))
+    {
+      error_at (loc, "__builtin_shuffle number of elements of the "
+		     "argument vector(s) and the mask vector should "
+		     "be the same");
+      return error_mark_node;
+    }
+  
+  if (GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (TREE_TYPE (v0))))
+      != GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (TREE_TYPE (mask)))))
+    {
+      error_at (loc, "__builtin_shuffle argument vector(s) inner type "
+		     "must have the same size as inner type of the mask");
+      return error_mark_node;
+    }
+
+  /* Avoid C_MAYBE_CONST_EXPRs inside VEC_SHUFFLE_EXPR.  */
+  tmp = c_fully_fold (v0, false, &maybe_const);
+  v0 = save_expr (tmp);
+  wrap &= maybe_const;
+
+  if (!two_arguments)
+    {
+      v1 = c_fully_fold (v1, false, &maybe_const);
+      wrap &= maybe_const;
+    }
+  else
+    v1 = v0;
+  
+  mask = c_fully_fold (mask, false, &maybe_const);
+  wrap &= maybe_const;
+
+  vec_shuffle = build3 (VEC_SHUFFLE_EXPR, TREE_TYPE (v0), v0, v1, mask);
+
+  if (!wrap)
+    vec_shuffle = c_wrap_maybe_const (vec_shuffle, true);
+
+  return vec_shuffle;
+}
 
 /* Convert the argument expressions in the vector VALUES
    to the types in the list TYPELIST.
@@ -6120,7 +6213,7 @@  digest_init (location_t init_loc, tree t
 	  tree value;
 	  bool constant_p = true;
 
-	  /* Iterate through elements and check if all constructor
+          /* Iterate through elements and check if all constructor
 	     elements are *_CSTs.  */
 	  FOR_EACH_CONSTRUCTOR_VALUE (CONSTRUCTOR_ELTS (inside_init), ix, value)
 	    if (!CONSTANT_CLASS_P (value))
Index: gcc/gimplify.c
===================================================================
--- gcc/gimplify.c	(revision 178354)
+++ gcc/gimplify.c	(working copy)
@@ -7286,6 +7286,7 @@  gimplify_expr (tree *expr_p, gimple_seq
 	  }
 
 	case FMA_EXPR:
+	case VEC_SHUFFLE_EXPR:
 	  /* Classified as tcc_expression.  */
 	  goto expr_3;
 
Index: gcc/tree.def
===================================================================
--- gcc/tree.def	(revision 178354)
+++ gcc/tree.def	(working copy)
@@ -497,6 +497,19 @@  DEFTREECODE (COND_EXPR, "cond_expr", tcc
 */
 DEFTREECODE (VEC_COND_EXPR, "vec_cond_expr", tcc_expression, 3)
 
+/* Vector shuffle expression.  A = VEC_SHUFFLE_EXPR<v0, v1, maks>
+   means
+
+   foreach i in length (mask):
+     A = mask[i] < length (v0) ? v0[mask[i]] : v1[mask[i]]
+
+   V0 and V1 are vectors of the same type.  MASK is an integer-typed
+   vector.  The number of MASK elements must be the same with the
+   number of elements in V0 and V1.  The size of the inner type
+   of the MASK and of the V0 and V1 must be the same.
+*/
+DEFTREECODE (VEC_SHUFFLE_EXPR, "vec_shuffle_expr", tcc_expression, 3)
+
 /* Declare local variables, including making RTL and allocating space.
    BIND_EXPR_VARS is a chain of VAR_DECL nodes for the variables.
    BIND_EXPR_BODY is the body, the expression to be computed using
Index: gcc/tree-inline.c
===================================================================
--- gcc/tree-inline.c	(revision 178354)
+++ gcc/tree-inline.c	(working copy)
@@ -3285,6 +3285,7 @@  estimate_operator_cost (enum tree_code c
        ??? We may consider mapping RTL costs to this.  */
     case COND_EXPR:
     case VEC_COND_EXPR:
+    case VEC_SHUFFLE_EXPR:
 
     case PLUS_EXPR:
     case POINTER_PLUS_EXPR:
Index: gcc/tree-vect-generic.c
===================================================================
--- gcc/tree-vect-generic.c	(revision 178354)
+++ gcc/tree-vect-generic.c	(working copy)
@@ -30,6 +30,7 @@  along with GCC; see the file COPYING3.
 #include "tree-pass.h"
 #include "flags.h"
 #include "ggc.h"
+#include "diagnostic.h"
 
 /* Need to include rtl.h, expr.h, etc. for optabs.  */
 #include "expr.h"
@@ -432,6 +433,263 @@  type_for_widest_vector_mode (enum machin
     }
 }
 
+
+/* Build a reference to the element of the vector VECT.  Function
+   returns either the element itself, either BIT_FIELD_REF, or an
+   ARRAY_REF expression.
+   
+   GSI is requred to insert temporary variables while building a
+   refernece to the element of the vector VECT.
+   
+   PTMPVEC is a pointer to the temporary variable for caching
+   purposes.  In case when PTMPVEC is NULL new temporary variable
+   will be created.  */
+static tree
+vector_element (gimple_stmt_iterator *gsi, tree vect, tree idx, tree *ptmpvec)
+{
+  tree type;
+  gimple asgn;
+  tree tmpvec;
+  tree arraytype;
+  bool need_asgn = true;
+
+  gcc_assert (TREE_CODE (TREE_TYPE (vect)) == VECTOR_TYPE);
+
+  type = TREE_TYPE (vect);
+  if (TREE_CODE (idx) == INTEGER_CST)
+    {
+      unsigned HOST_WIDE_INT index;
+
+      if (!host_integerp (idx, 1)
+           || (index = tree_low_cst (idx, 1)) > TYPE_VECTOR_SUBPARTS (type)-1)
+        return error_mark_node;
+
+      if (TREE_CODE (vect) == VECTOR_CST)
+        {
+            unsigned i;
+            tree vals = TREE_VECTOR_CST_ELTS (vect);
+            for (i = 0; vals; vals = TREE_CHAIN (vals), ++i)
+              if (i == index)
+                 return TREE_VALUE (vals);
+            return error_mark_node;
+        }
+      else if (TREE_CODE (vect) == CONSTRUCTOR)
+        {
+          unsigned i;
+          VEC (constructor_elt, gc) *vals = CONSTRUCTOR_ELTS (vect);
+          constructor_elt *elt;
+
+          for (i = 0; VEC_iterate (constructor_elt, vals, i, elt); i++)
+            if (operand_equal_p (elt->index, idx, 0))
+              return elt->value;
+          return fold_convert (TREE_TYPE (type), integer_zero_node);
+        }
+      else if (TREE_CODE (vect) == SSA_NAME)
+        {
+	  tree size = TYPE_SIZE (TREE_TYPE (type));
+          tree pos = fold_build2 (MULT_EXPR, TREE_TYPE (idx), idx, size);
+          return fold_build3 (BIT_FIELD_REF, TREE_TYPE (type), vect, size, pos);
+        }
+      else
+        return error_mark_node;
+    }
+  
+  if (!ptmpvec)
+    tmpvec = create_tmp_var (TREE_TYPE (vect), "vectmp");
+  else if (!*ptmpvec)
+    tmpvec = *ptmpvec = create_tmp_var (TREE_TYPE (vect), "vectmp");
+  else
+    {
+      tmpvec = *ptmpvec;
+      need_asgn = false;
+    }
+  
+  if (need_asgn)
+    {
+      TREE_ADDRESSABLE (tmpvec) = 1;
+      asgn = gimple_build_assign (tmpvec, vect);
+      gsi_insert_before (gsi, asgn, GSI_SAME_STMT);
+    }
+
+  arraytype = build_array_type_nelts (TREE_TYPE (type),
+				      TYPE_VECTOR_SUBPARTS (TREE_TYPE (vect)));
+  
+  return build4 (ARRAY_REF, TREE_TYPE (type),
+                 build1 (VIEW_CONVERT_EXPR, arraytype, tmpvec),
+                 idx, NULL_TREE, NULL_TREE);
+}
+
+/* Check if VEC_SHUFFLE_EXPR within the given setting is supported
+   by hardware, or lower it piecewise.  Function returns false when
+   the expression must be replaced with TRAP_RETURN, true otherwise.
+
+   When VEC_SHUFFLE_EXPR has the same first and second operands:
+   VEC_SHUFFLE_EXPR <v0, v0, mask> the lowered version would be
+   {v0[mask[0]], v0[mask[1]], ...}
+   MASK and V0 must have the same number of elements.
+        
+   Otherwise VEC_SHUFFLE_EXPR <v0, v1, mask> is lowered to
+   {mask[0] < len(v0) ? v0[mask[0]] : v1[mask[0]], ...}
+   V0 and V1 must have the same type.  MASK, V0, V1 must have the
+   same number of arguments.  */
+static bool
+lower_vec_shuffle (gimple_stmt_iterator *gsi, location_t loc)
+{
+ 
+  gimple stmt = gsi_stmt (*gsi);
+  tree mask = gimple_assign_rhs3 (stmt);
+  tree vec0 = gimple_assign_rhs1 (stmt);
+  tree vec1 = gimple_assign_rhs2 (stmt);
+  unsigned els = TYPE_VECTOR_SUBPARTS (TREE_TYPE (mask));
+  tree type0 = TREE_TYPE (TREE_TYPE (vec0));
+  VEC(constructor_elt,gc) *v = NULL;
+  tree vectype, constr;
+  tree vec0tmp = NULL_TREE, masktmp = NULL_TREE;
+
+  if (expand_vec_shuffle_expr_p (TYPE_MODE (TREE_TYPE (vec0)), vec0, vec1, mask))
+    {
+      tree t;
+
+      t = gimplify_build3 (gsi, VEC_SHUFFLE_EXPR, TREE_TYPE (vec0),
+			   vec0, vec1, mask);
+      gimple_assign_set_rhs_from_tree (gsi, t);
+      /* Statement should be updated by callee.  */
+      return true;
+    }
+  
+  if (operand_equal_p (vec0, vec1, 0))
+    {
+      unsigned i;
+      tree vec0tmp = NULL_TREE;
+      
+      v = VEC_alloc (constructor_elt, gc, els);
+      for (i = 0; i < els; i++)
+        {
+          tree idxval, vecel, t;
+	   
+	  idxval = vector_element (gsi, mask, size_int (i), &masktmp);
+          if (idxval == error_mark_node)
+            {
+              if (warning_at (loc, 0, "Invalid shuffling mask index %i", i))
+		inform (loc, "if this code is reached the programm will abort");
+	      return false;
+            }
+
+	  vecel = vector_element (gsi, vec0, idxval, &vec0tmp);
+          if (vecel == error_mark_node)
+            {
+              if (warning_at (loc, 0, "Invalid shuffling arguments"))
+		inform (loc, "if this code is reached the programm will abort");
+	      return false;
+            }
+          
+          t = force_gimple_operand_gsi (gsi, vecel, true,
+					NULL_TREE, true, GSI_SAME_STMT);
+          CONSTRUCTOR_APPEND_ELT (v, size_int (i), t);
+        }
+    }
+  else
+    {
+      unsigned i;
+      tree var = create_tmp_var (type0, "vecel");
+      tree vec1tmp = NULL_TREE;
+
+      v = VEC_alloc (constructor_elt, gc, els);
+      for (i = 0; i < els; i++)
+        {
+          tree idxval, idx1val, cond, elval0, elval1, condexpr, t, ssatmp;
+          tree vec0el, vec1el;
+          gimple asgn;
+          
+          idxval = vector_element (gsi, mask, size_int (i), &masktmp);
+	  if (idxval == error_mark_node)
+            {
+              if (warning_at (loc, 0, "Invalid shuffling mask index %i", i))
+		inform (loc, "if this code is reached the programm will abort");
+	      return false;
+            }
+          
+          if (TREE_CODE (idxval) == INTEGER_CST)
+            {
+              if (tree_int_cst_lt (idxval, size_int (els)))
+                {
+                  vec0el = vector_element (gsi, vec0, idxval, &vec0tmp);
+                  t = force_gimple_operand_gsi (gsi, vec0el,
+                                    true, NULL_TREE, true, GSI_SAME_STMT);
+                }
+              else if (tree_int_cst_lt (idxval, size_int (2*els)))
+                {
+                  idx1val = fold_build2 (MINUS_EXPR, TREE_TYPE (idxval),
+                        idxval, build_int_cst (TREE_TYPE (idxval), els));
+                  
+                  vec1el = vector_element (gsi, vec1, idx1val, &vec1tmp);
+                  t = force_gimple_operand_gsi (gsi, vec1el, true,
+						NULL_TREE, true, GSI_SAME_STMT);
+                }
+              else
+                {
+                  if (warning_at (loc, 0, "Invalid shuffling mask index %i", i))
+		    inform (loc, "if this code is reached the "
+				  "programm will abort");
+		  return false;
+                }
+            }
+          else
+            {
+
+              idx1val = fold_build2 (MINUS_EXPR, TREE_TYPE (idxval),
+                            idxval, build_int_cst (TREE_TYPE (idxval), els));
+              idx1val = force_gimple_operand_gsi (gsi, idx1val,
+                                true, NULL_TREE, true, GSI_SAME_STMT);
+              cond = fold_build2 (GT_EXPR, boolean_type_node, \
+                             idxval, fold_convert (type0, size_int (els - 1)));
+              
+	      vec0el = vector_element (gsi, vec0, idxval, &vec0tmp);
+              if (vec0el == error_mark_node)
+                {
+                  if (warning_at (loc, 0, "Invalid shuffling arguments"))
+		    inform (loc, "if this code is reached the "
+				 "programm will abort");
+		  return false;
+                }
+
+              elval0 = force_gimple_operand_gsi (gsi, vec0el,
+                                true, NULL_TREE, true, GSI_SAME_STMT);
+	      
+	      vec1el = vector_element (gsi, vec1, idx1val, &vec1tmp);
+              if (vec1el == error_mark_node)
+                {
+                  if (warning_at (loc, 0, "Invalid shuffling arguments"))
+		    inform (loc, "if this code is reached the "
+				 "programm will abort");
+		  return false;
+                }
+
+              elval1 = force_gimple_operand_gsi (gsi, vec1el,
+                                true, NULL_TREE, true, GSI_SAME_STMT);
+
+              condexpr = fold_build3 (COND_EXPR, type0, cond, \
+                                      elval1, elval0);
+
+              t = force_gimple_operand_gsi (gsi, condexpr, true, \
+                                        NULL_TREE, true, GSI_SAME_STMT);
+            }
+          
+          asgn = gimple_build_assign (var, t);
+          ssatmp = make_ssa_name (var, asgn);
+          gimple_assign_set_lhs (asgn, ssatmp);
+          gsi_insert_before (gsi, asgn, GSI_SAME_STMT);
+          CONSTRUCTOR_APPEND_ELT (v, size_int (i), ssatmp);
+        }
+    }
+  
+  vectype = build_vector_type (type0, els);
+  constr = build_constructor (vectype, v);
+  gimple_assign_set_rhs_from_tree (gsi, constr);
+  /* Statement should be updated by callee.  */
+  return true;
+}
+
 /* Process one statement.  If we identify a vector operation, expand it.  */
 
 static void
@@ -451,6 +709,25 @@  expand_vector_operations_1 (gimple_stmt_
   code = gimple_assign_rhs_code (stmt);
   rhs_class = get_gimple_rhs_class (code);
 
+  if (code == VEC_SHUFFLE_EXPR)
+    {
+      if (!lower_vec_shuffle (gsi, gimple_location (stmt)))
+	{
+	  gimple new_stmt;
+	  tree vec0;
+	  
+	  vec0 = gimple_assign_rhs1 (stmt);
+	  new_stmt = gimple_build_call (built_in_decls[BUILT_IN_TRAP], 0);
+	  gsi_insert_before (gsi, new_stmt,  GSI_SAME_STMT);
+	  split_block (gimple_bb (new_stmt), new_stmt);
+	  new_stmt = gimple_build_assign (gimple_assign_lhs (stmt), vec0);
+	  gsi_replace (gsi, new_stmt, false);
+	}
+
+      gimple_set_modified (gsi_stmt (*gsi), true);
+      update_stmt (gsi_stmt (*gsi));
+    }
+
   if (rhs_class != GIMPLE_UNARY_RHS && rhs_class != GIMPLE_BINARY_RHS)
     return;
 
@@ -613,9 +890,9 @@  expand_vector_operations_1 (gimple_stmt_
    if it may need the bit-twiddling tricks implemented in this file.  */
 
 static bool
-gate_expand_vector_operations (void)
+gate_expand_vector_operations_ssa (void)
 {
-  return flag_tree_vectorize != 0;
+  return optimize == 0;
 }
 
 static unsigned int
@@ -648,7 +925,7 @@  struct gimple_opt_pass pass_lower_vector
  {
   GIMPLE_PASS,
   "veclower",				/* name */
-  0,					/* gate */
+  gate_expand_vector_operations_ssa,    /* gate */
   expand_vector_operations,		/* execute */
   NULL,					/* sub */
   NULL,					/* next */
@@ -661,6 +938,7 @@  struct gimple_opt_pass pass_lower_vector
   TODO_update_ssa	                /* todo_flags_finish */
     | TODO_verify_ssa
     | TODO_verify_stmts | TODO_verify_flow
+    | TODO_cleanup_cfg
  }
 };
 
@@ -669,7 +947,7 @@  struct gimple_opt_pass pass_lower_vector
  {
   GIMPLE_PASS,
   "veclower2",				/* name */
-  gate_expand_vector_operations,	/* gate */
+  0,	                                /* gate */
   expand_vector_operations,		/* execute */
   NULL,					/* sub */
   NULL,					/* next */
@@ -682,6 +960,7 @@  struct gimple_opt_pass pass_lower_vector
   TODO_update_ssa	                /* todo_flags_finish */
     | TODO_verify_ssa
     | TODO_verify_stmts | TODO_verify_flow
+    | TODO_cleanup_cfg
  }
 };
 
Index: gcc/Makefile.in
===================================================================
--- gcc/Makefile.in	(revision 178354)
+++ gcc/Makefile.in	(working copy)
@@ -3178,7 +3178,7 @@  tree-vect-generic.o : tree-vect-generic.
     $(TM_H) $(TREE_FLOW_H) $(GIMPLE_H) tree-iterator.h $(TREE_PASS_H) \
     $(FLAGS_H) $(OPTABS_H) $(MACHMODE_H) $(EXPR_H) \
     langhooks.h $(FLAGS_H) $(DIAGNOSTIC_H) gt-tree-vect-generic.h $(GGC_H) \
-    coretypes.h insn-codes.h
+    coretypes.h insn-codes.h $(DIAGNOSTIC_H)
 df-core.o : df-core.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    insn-config.h $(RECOG_H) $(FUNCTION_H) $(REGS_H) alloc-pool.h \
    hard-reg-set.h $(BASIC_BLOCK_H) $(DF_H) $(BITMAP_H) sbitmap.h $(TIMEVAR_H) \
Index: gcc/gimple.c
===================================================================
--- gcc/gimple.c	(revision 178354)
+++ gcc/gimple.c	(working copy)
@@ -2615,6 +2615,7 @@  get_gimple_rhs_num_ops (enum tree_code c
       || (SYM) == WIDEN_MULT_MINUS_EXPR					    \
       || (SYM) == DOT_PROD_EXPR						    \
       || (SYM) == REALIGN_LOAD_EXPR					    \
+      || (SYM) == VEC_SHUFFLE_EXPR					    \
       || (SYM) == FMA_EXPR) ? GIMPLE_TERNARY_RHS			    \
    : ((SYM) == COND_EXPR						    \
       || (SYM) == CONSTRUCTOR						    \
Index: gcc/tree-cfg.c
===================================================================
--- gcc/tree-cfg.c	(revision 178354)
+++ gcc/tree-cfg.c	(working copy)
@@ -3711,6 +3711,59 @@  verify_gimple_assign_ternary (gimple stm
 	}
       break;
 
+    case VEC_SHUFFLE_EXPR:
+      if (!useless_type_conversion_p (lhs_type, rhs1_type)
+	  || !useless_type_conversion_p (lhs_type, rhs2_type))
+	{
+	  error ("type mismatch in vector shuffle expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TREE_CODE (rhs1_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs2_type) != VECTOR_TYPE
+	  || TREE_CODE (rhs3_type) != VECTOR_TYPE)
+	{
+	  error ("vector types expected in vector shuffle expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TYPE_VECTOR_SUBPARTS (rhs1_type) != TYPE_VECTOR_SUBPARTS (rhs2_type)
+	  || TYPE_VECTOR_SUBPARTS (rhs2_type)
+	     != TYPE_VECTOR_SUBPARTS (rhs3_type)
+	  || TYPE_VECTOR_SUBPARTS (rhs3_type)
+	     != TYPE_VECTOR_SUBPARTS (lhs_type))
+	{
+	  error ("vectors with different element number found "
+		 "in vector shuffle expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      if (TREE_CODE (TREE_TYPE (rhs3_type)) != INTEGER_TYPE
+	  || GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (rhs3_type)))
+	     != GET_MODE_BITSIZE (TYPE_MODE (TREE_TYPE (rhs1_type))))
+	{
+	  error ("invalid mask type in vector shuffle expression");
+	  debug_generic_expr (lhs_type);
+	  debug_generic_expr (rhs1_type);
+	  debug_generic_expr (rhs2_type);
+	  debug_generic_expr (rhs3_type);
+	  return true;
+	}
+
+      return false;
+
     case DOT_PROD_EXPR:
     case REALIGN_LOAD_EXPR:
       /* FIXME.  */
Index: gcc/passes.c
===================================================================
--- gcc/passes.c	(revision 178354)
+++ gcc/passes.c	(working copy)
@@ -1354,7 +1354,6 @@  init_optimization_passes (void)
 	  NEXT_PASS (pass_vectorize);
 	    {
 	      struct opt_pass **p = &pass_vectorize.pass.sub;
-	      NEXT_PASS (pass_lower_vector_ssa);
 	      NEXT_PASS (pass_dce_loop);
 	    }
           NEXT_PASS (pass_predcom);
@@ -1366,6 +1365,7 @@  init_optimization_passes (void)
 	  NEXT_PASS (pass_lim);
 	  NEXT_PASS (pass_tree_loop_done);
 	}
+      NEXT_PASS (pass_lower_vector_ssa);
       NEXT_PASS (pass_cse_reciprocals);
       NEXT_PASS (pass_reassoc);
       NEXT_PASS (pass_vrp);
Index: gcc/c-parser.c
===================================================================
--- gcc/c-parser.c	(revision 178354)
+++ gcc/c-parser.c	(working copy)
@@ -5989,6 +5989,46 @@  c_parser_alignof_expression (c_parser *p
     }
 }
 
+/* Helper function to read arguments of builtins which are interfaces
+   for the middle-end nodes like COMPLEX_EXPR, VEC_SHUFFLE_EXPR and
+   others.  The name of the builtin is passed using BNAME parameter.
+   Function returns true if there were no errors while parsing and
+   stores the arguments in EXPR_LIST.  List of original types can be
+   obtained by passing non NULL value to ORIG_TYPES.  */
+static bool
+c_parser_get_builtin_args (c_parser *parser, const char *bname,
+			   VEC(tree,gc) **expr_list,
+			   VEC(tree,gc) **orig_types)
+{
+  location_t loc = c_parser_peek_token (parser)->location;
+  *expr_list = NULL;
+
+  if (c_parser_next_token_is_not (parser, CPP_OPEN_PAREN))
+    {
+      error_at (loc, "cannot take address of %qs", bname);
+      return false;
+    }
+
+  c_parser_consume_token (parser);
+
+  if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+    {
+      c_parser_consume_token (parser);
+      return true;
+    }
+    
+  if (orig_types)
+    *expr_list = c_parser_expr_list (parser, false, false, orig_types);
+  else
+    *expr_list = c_parser_expr_list (parser, false, false, NULL);
+
+  if (!c_parser_require (parser, CPP_CLOSE_PAREN, "expected %<)%>"))
+    return false;
+
+  return true;
+}
+
+
 /* Parse a postfix expression (C90 6.3.1-6.3.2, C99 6.5.1-6.5.2).
 
    postfix-expression:
@@ -6027,6 +6067,10 @@  c_parser_alignof_expression (c_parser *p
 			     assignment-expression )
      __builtin_types_compatible_p ( type-name , type-name )
      __builtin_complex ( assignment-expression , assignment-expression )
+     __builtin_shuffle ( assignment-expression , assignment-expression )
+     __builtin_shuffle ( assignment-expression , 
+			 assignment-expression ,
+			 assignment-expression, )
 
    offsetof-member-designator:
      identifier
@@ -6047,7 +6091,7 @@  c_parser_alignof_expression (c_parser *p
 static struct c_expr
 c_parser_postfix_expression (c_parser *parser)
 {
-  struct c_expr expr, e1, e2, e3;
+  struct c_expr expr, e1;
   struct c_type_name *t1, *t2;
   location_t loc = c_parser_peek_token (parser)->location;;
   expr.original_code = ERROR_MARK;
@@ -6333,45 +6377,55 @@  c_parser_postfix_expression (c_parser *p
 	  }
 	  break;
 	case RID_CHOOSE_EXPR:
-	  c_parser_consume_token (parser);
-	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
-	    {
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  loc = c_parser_peek_token (parser)->location;
-	  e1 = c_parser_expr_no_commas (parser, NULL);
-	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
-	    {
-	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  e2 = c_parser_expr_no_commas (parser, NULL);
-	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
-	    {
-	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  e3 = c_parser_expr_no_commas (parser, NULL);
-	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN,
-				     "expected %<)%>");
 	  {
-	    tree c;
+	    VEC(tree,gc) *expr_list;
+	    VEC(tree,gc) *orig_types;
+	    tree e1value, e2value, e3value, c;
 
-	    c = e1.value;
-	    mark_exp_read (e2.value);
-	    mark_exp_read (e3.value);
+	    c_parser_consume_token (parser);
+	    if (!c_parser_get_builtin_args (parser,
+					    "__builtin_choose_expr",
+					    &expr_list, &orig_types))
+	      {
+		expr.value = error_mark_node;
+		break;
+	      }
+
+	    if (VEC_length (tree, expr_list) != 3)
+	      {
+		error_at (loc, "wrong number of arguments to "
+			       "%<__builtin_choose_expr%>");
+		expr.value = error_mark_node;
+		break;
+	      }
+	    
+	    e1value = VEC_index (tree, expr_list, 0);
+	    e2value = VEC_index (tree, expr_list, 1);
+	    e3value = VEC_index (tree, expr_list, 2);
+
+	    c = e1value;
+	    mark_exp_read (e2value);
+	    mark_exp_read (e3value);
 	    if (TREE_CODE (c) != INTEGER_CST
 		|| !INTEGRAL_TYPE_P (TREE_TYPE (c)))
 	      error_at (loc,
 			"first argument to %<__builtin_choose_expr%> not"
 			" a constant");
 	    constant_expression_warning (c);
-	    expr = integer_zerop (c) ? e3 : e2;
+	    
+	    if (integer_zerop (c))
+	      {
+		expr.value = e3value;
+		expr.original_type = VEC_index (tree, orig_types, 2);
+	      }
+	    else
+	      {
+		expr.value = e2value;
+		expr.original_type = VEC_index (tree, orig_types, 1);
+	      }
+
+	    break;
 	  }
-	  break;
 	case RID_TYPES_COMPATIBLE_P:
 	  c_parser_consume_token (parser);
 	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
@@ -6410,57 +6464,96 @@  c_parser_postfix_expression (c_parser *p
 	  }
 	  break;
 	case RID_BUILTIN_COMPLEX:
-	  c_parser_consume_token (parser);
-	  if (!c_parser_require (parser, CPP_OPEN_PAREN, "expected %<(%>"))
-	    {
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  loc = c_parser_peek_token (parser)->location;
-	  e1 = c_parser_expr_no_commas (parser, NULL);
-	  if (!c_parser_require (parser, CPP_COMMA, "expected %<,%>"))
-	    {
-	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  e2 = c_parser_expr_no_commas (parser, NULL);
-	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN,
-				     "expected %<)%>");
-	  mark_exp_read (e1.value);
-	  if (TREE_CODE (e1.value) == EXCESS_PRECISION_EXPR)
-	    e1.value = convert (TREE_TYPE (e1.value),
-				TREE_OPERAND (e1.value, 0));
-	  mark_exp_read (e2.value);
-	  if (TREE_CODE (e2.value) == EXCESS_PRECISION_EXPR)
-	    e2.value = convert (TREE_TYPE (e2.value),
-				TREE_OPERAND (e2.value, 0));
-	  if (!SCALAR_FLOAT_TYPE_P (TREE_TYPE (e1.value))
-	      || DECIMAL_FLOAT_TYPE_P (TREE_TYPE (e1.value))
-	      || !SCALAR_FLOAT_TYPE_P (TREE_TYPE (e2.value))
-	      || DECIMAL_FLOAT_TYPE_P (TREE_TYPE (e2.value)))
-	    {
-	      error_at (loc, "%<__builtin_complex%> operand "
-			"not of real binary floating-point type");
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  if (TYPE_MAIN_VARIANT (TREE_TYPE (e1.value))
-	      != TYPE_MAIN_VARIANT (TREE_TYPE (e2.value)))
-	    {
-	      error_at (loc,
-			"%<__builtin_complex%> operands of different types");
-	      expr.value = error_mark_node;
-	      break;
-	    }
-	  if (!flag_isoc99)
-	    pedwarn (loc, OPT_pedantic,
-		     "ISO C90 does not support complex types");
-	  expr.value = build2 (COMPLEX_EXPR,
-			       build_complex_type (TYPE_MAIN_VARIANT
-						   (TREE_TYPE (e1.value))),
-			       e1.value, e2.value);
-	  break;
+	  { 
+	    VEC(tree,gc) *expr_list;
+	    tree e1value, e2value;
+	    
+	    c_parser_consume_token (parser);
+	    if (!c_parser_get_builtin_args (parser,
+					    "__builtin_complex",
+					    &expr_list, NULL))
+	      {
+		expr.value = error_mark_node;
+		break;
+	      }
+
+	    if (VEC_length (tree, expr_list) != 2)
+	      {
+		error_at (loc, "wrong number of arguments to "
+			       "%<__builtin_complex%>");
+		expr.value = error_mark_node;
+		break;
+	      }
+	    
+	    e1value = VEC_index (tree, expr_list, 0);
+	    e2value = VEC_index (tree, expr_list, 1);
+
+	    mark_exp_read (e1value);
+	    if (TREE_CODE (e1value) == EXCESS_PRECISION_EXPR)
+	      e1value = convert (TREE_TYPE (e1value),
+				 TREE_OPERAND (e1value, 0));
+	    mark_exp_read (e2value);
+	    if (TREE_CODE (e2value) == EXCESS_PRECISION_EXPR)
+	      e2value = convert (TREE_TYPE (e2value),
+				 TREE_OPERAND (e2value, 0));
+	    if (!SCALAR_FLOAT_TYPE_P (TREE_TYPE (e1value))
+		|| DECIMAL_FLOAT_TYPE_P (TREE_TYPE (e1value))
+		|| !SCALAR_FLOAT_TYPE_P (TREE_TYPE (e2value))
+		|| DECIMAL_FLOAT_TYPE_P (TREE_TYPE (e2value)))
+	      {
+		error_at (loc, "%<__builtin_complex%> operand "
+			  "not of real binary floating-point type");
+		expr.value = error_mark_node;
+		break;
+	      }
+	    if (TYPE_MAIN_VARIANT (TREE_TYPE (e1value))
+		!= TYPE_MAIN_VARIANT (TREE_TYPE (e2value)))
+	      {
+		error_at (loc,
+			  "%<__builtin_complex%> operands of different types");
+		expr.value = error_mark_node;
+		break;
+	      }
+	    if (!flag_isoc99)
+	      pedwarn (loc, OPT_pedantic,
+		       "ISO C90 does not support complex types");
+	    expr.value = build2 (COMPLEX_EXPR,
+				 build_complex_type (TYPE_MAIN_VARIANT
+						     (TREE_TYPE (e1value))),
+				 e1value, e2value);
+	    break;
+	  }
+	case RID_BUILTIN_SHUFFLE:
+	  {
+	    VEC(tree,gc) *expr_list;
+	    
+	    c_parser_consume_token (parser);
+	    if (!c_parser_get_builtin_args (parser,
+					    "__builtin_shuffle",
+					    &expr_list, NULL))
+	      {
+		expr.value = error_mark_node;
+		break;
+	      }
+
+	    if (VEC_length (tree, expr_list) == 2)
+	      expr.value = c_build_vec_shuffle_expr
+				(loc, VEC_index (tree, expr_list, 0),
+				 NULL_TREE,
+				 VEC_index (tree, expr_list, 1));
+	    else if (VEC_length (tree, expr_list) == 3)
+	      expr.value = c_build_vec_shuffle_expr
+				(loc, VEC_index (tree, expr_list, 0),
+				 VEC_index (tree, expr_list, 1),
+				 VEC_index (tree, expr_list, 2));
+	    else
+	      {
+		error_at (loc, "wrong number of arguments to "
+			       "%<__builtin_shuffle%>");
+		expr.value = error_mark_node;
+	      }
+	    break;
+	  }
 	case RID_AT_SELECTOR:
 	  gcc_assert (c_dialect_objc ());
 	  c_parser_consume_token (parser);
Index: gcc/config/i386/sse.md
===================================================================
--- gcc/config/i386/sse.md	(revision 178354)
+++ gcc/config/i386/sse.md	(working copy)
@@ -231,6 +231,12 @@  (define_mode_attr sseinsnmode
    (V4SF "V4SF") (V2DF "V2DF")
    (TI "TI") (V32QI "OI") (V16HI "OI") (V8SI "OI") (V4DI "OI")])
 
+;; All 128bit vector modes
+(define_mode_attr sseshuffint
+  [(V16QI "V16QI") (V8HI "V8HI")
+   (V4SI "V4SI")  (V2DI "V2DI")
+   (V4SF "V4SI") (V2DF "V2DI")])
+
 ;; Mapping of vector float modes to an integer mode of the same size
 (define_mode_attr sseintvecmode
   [(V8SF "V8SI") (V4DF "V4DI")
@@ -6234,6 +6240,18 @@  (define_expand "vconduv2di"
   DONE;
 })
 
+(define_expand "vshuffle<mode>"
+  [(match_operand:V_128 0 "register_operand" "")
+   (match_operand:V_128 1 "general_operand" "")
+   (match_operand:<sseshuffint> 2 "general_operand" "")]
+  "TARGET_SSE3 || TARGET_AVX"
+{
+  bool ok = ix86_expand_vshuffle (operands);
+  gcc_assert (ok);
+  DONE;
+})
+
+
 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
 ;;
 ;; Parallel bitwise logical operations
Index: gcc/config/i386/i386-protos.h
===================================================================
--- gcc/config/i386/i386-protos.h	(revision 178354)
+++ gcc/config/i386/i386-protos.h	(working copy)
@@ -118,6 +118,7 @@  extern bool ix86_expand_int_movcc (rtx[]
 extern bool ix86_expand_fp_movcc (rtx[]);
 extern bool ix86_expand_fp_vcond (rtx[]);
 extern bool ix86_expand_int_vcond (rtx[]);
+extern bool ix86_expand_vshuffle (rtx[]);
 extern void ix86_expand_sse_unpack (rtx[], bool, bool);
 extern bool ix86_expand_int_addcc (rtx[]);
 extern rtx ix86_expand_call (rtx, rtx, rtx, rtx, rtx, bool);
Index: gcc/config/i386/i386.c
===================================================================
--- gcc/config/i386/i386.c	(revision 178354)
+++ gcc/config/i386/i386.c	(working copy)
@@ -18693,6 +18693,96 @@  ix86_expand_int_vcond (rtx operands[])
   return true;
 }
 
+bool
+ix86_expand_vshuffle (rtx operands[])
+{
+  rtx target = operands[0];
+  rtx op0 = operands[1];
+  rtx mask = operands[2];
+  rtx mm, vt, cv0, t1;
+  enum machine_mode mode = GET_MODE (op0);
+  enum machine_mode maskmode = GET_MODE (mask);
+  enum machine_mode maskinner = GET_MODE_INNER (mode);
+  rtx vec[16];
+  int w, i, j;
+
+  gcc_assert ((TARGET_SSE3 || TARGET_AVX) && GET_MODE_BITSIZE (mode) == 128);
+
+  op0 = force_reg (mode, op0);
+  mask = force_reg (maskmode, mask);
+
+  /* Number of elements in the vector.  */
+  w = GET_MODE_BITSIZE (maskmode) / GET_MODE_BITSIZE (maskinner);
+ 
+  /* mask = mask & {w-1, w-1, w-1,...} */
+  for (i = 0; i < w; i++)
+    vec[i] = GEN_INT (w - 1);
+
+  mm = gen_rtx_CONST_VECTOR (maskmode, gen_rtvec_v (w, vec));
+  mm = force_reg (maskmode, mm);
+
+  mask = gen_rtx_AND (maskmode, mask, mm);
+  
+  /* Convert mask to vector of chars.  */
+  mask = simplify_gen_subreg (V16QImode, mask, maskmode, 0);
+  mask = force_reg (V16QImode, mask);
+
+
+  /* Build a helper mask wich we will use in pshufb
+     (v4si) --> {0,0,0,0, 4,4,4,4, 8,8,8,8, 12,12,12,12}
+     (v8hi) --> {0,0, 2,2, 4,4, 6,6, ...}
+     ...  */
+  for (i = 0; i < w; i++)
+    for (j = 0; j < 16/w; j++)
+      vec[i*w+j] = GEN_INT (i*16/w);
+
+  vt = gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, vec));
+  vt = force_reg (V16QImode, vt);
+  
+  t1 = gen_reg_rtx (V16QImode);
+  emit_insn (gen_ssse3_pshufbv16qi3 (t1, mask, vt));
+  mm = t1;
+
+  /* MM contains now something like
+     mm = {m[0], .., m[0], m[k], .., m[k], ... }, where 
+     m[i] is an index of the element in the vector we are
+     selecting from.
+
+     Convert it into the byte positions by doing
+     mm = mm * {16/w, 16/w, ...}
+     mm = mm + {0,1,..,16/w, 0,1,..,16/w, ...}  */
+  for (i = 0; i < 16; i++)
+    vec[i] = GEN_INT (16/w);
+
+  cv0 = gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, vec));
+  cv0 = force_reg (V16QImode, cv0);
+  mm = gen_rtx_MULT (V16QImode, mm, cv0);
+
+  for (i = 0; i < w; i++)
+    for (j = 0; j < 16/w; j++)
+      vec[i*w+j] = GEN_INT (j);
+
+  cv0 = gen_rtx_CONST_VECTOR (V16QImode, gen_rtvec_v (16, vec));
+  cv0 = force_reg (V16QImode, cv0);
+  mm = gen_rtx_PLUS (V16QImode, mm, cv0);
+  mm = force_reg (V16QImode, mm);
+
+  t1 = gen_reg_rtx (V16QImode);
+  
+  /* Convert OP0 to vector of chars.  */
+  op0 = simplify_gen_subreg (V16QImode, op0, mode, 0);
+  op0 = force_reg (V16QImode, op0);
+  emit_insn (gen_ssse3_pshufbv16qi3 (t1, op0, mm));
+  
+  /* Convert it back from vector of chars to the original mode.  */
+  t1 = simplify_gen_subreg (mode, t1, V16QImode, 0);
+  
+  emit_insn (gen_rtx_SET (VOIDmode, target, t1));
+ 
+  fprintf (stderr, "-- %s called\n", __func__);
+  return true;
+}
+
 /* Unpack OP[1] into the next wider integer vector type.  UNSIGNED_P is
    true if we should do zero extension, else sign extension.  HIGH_P is
    true if we want the N/2 high elements, else the low elements.  */
@@ -30911,6 +31001,9 @@  struct expand_vec_perm_d
 
 static bool expand_vec_perm_1 (struct expand_vec_perm_d *d);
 static bool expand_vec_perm_broadcast_1 (struct expand_vec_perm_d *d);
+static int extract_vec_perm_cst (struct expand_vec_perm_d *, tree);
+static bool ix86_vectorize_builtin_vec_perm_ok (tree vec_type, tree mask);
+
 
 /* Get a vector mode of the same size as the original but with elements
    twice as wide.  This is only guaranteed to apply to integral vectors.  */
@@ -34576,10 +34669,10 @@  ix86_vectorize_builtin_vec_perm_ok (tree
 
   vec_mask = extract_vec_perm_cst (&d, mask);
 
-  /* This hook is cannot be called in response to something that the
-     user does (unlike the builtin expander) so we shouldn't ever see
-     an error generated from the extract.  */
-  gcc_assert (vec_mask > 0 && vec_mask <= 3);
+  /* Check whether the mask can be applied to the vector type.  */
+  if (vec_mask < 0 || vec_mask > 3)
+    return false;
+  
   one_vec = (vec_mask != 3);
 
   /* Implementable with shufps or pshufd.  */
Index: gcc/tree-ssa-operands.c
===================================================================
--- gcc/tree-ssa-operands.c	(revision 178354)
+++ gcc/tree-ssa-operands.c	(working copy)
@@ -943,6 +943,7 @@  get_expr_operands (gimple stmt, tree *ex
 
     case COND_EXPR:
     case VEC_COND_EXPR:
+    case VEC_SHUFFLE_EXPR:
       get_expr_operands (stmt, &TREE_OPERAND (expr, 0), uflags);
       get_expr_operands (stmt, &TREE_OPERAND (expr, 1), uflags);
       get_expr_operands (stmt, &TREE_OPERAND (expr, 2), uflags);