Patchwork RFC: simd enabled functions (omp declare simd / elementals)

login
register
mail settings
Submitter Aldy Hernandez
Date Nov. 1, 2013, 3:04 a.m.
Message ID <52731A4D.6020402@redhat.com>
Download mbox | patch
Permalink /patch/287694/
State New
Headers show

Comments

Aldy Hernandez - Nov. 1, 2013, 3:04 a.m.
Hello gentlemen.  I'm CCing all of you, because each of you can provide 
valuable feedback to various parts of the compiler which I touch.  I 
have sprinkled love notes with your names throughout the post :).

This is a patch against the gomp4 branch.  It provides initial support 
for simd-enabled functions which are "#pragma omp declare simd" in the 
OpenMP world and elementals in Cilk Plus nomenclature.  The parsing bits 
for OpenMP are already in trunk, but they are silently ignored.  This 
patch aims to remedy the situation.  The Cilk Plus parsing bits, OTOH, 
are not ready, but could trivially be adapted to use this infrastructure 
(see below).

I would like to at least get this into the gomp4 branch for now, because 
I am accumulating far too many changes locally.

The main idea is that for a simd annotated function, we can create one 
or more cloned vector variants of a scalar function that can later be 
used by the vectorizer.

For a simple example with multiple returns...

#pragma omp declare simd simdlen(4) notinbranch
int foo (int a, int b)
{
   if (a == b)
     return 555;
   else
     return 666;
}

...we would generate with this patch (unoptimized):

foo.simdclone.0 (vector(4) int simd.4, vector(4) int simd.5)
{
   unsigned int iter.6;
   int b.3[4];
   int a.2[4];
   int retval.1[4];
   int _3;
   int _5;
   int _6;
   vector(4) int _7;

   <bb 2>:
   a.2 = VIEW_CONVERT_EXPR<int[4]>(simd.4);
   b.3 = VIEW_CONVERT_EXPR<int[4]>(simd.5);
   iter.6_12 = 0;

   <bb 3>:
   # iter.6_9 = PHI <iter.6_12(2), iter.6_14(6)>
   _5 = a.2[iter.6_9];
   _6 = b.3[iter.6_9];
   if (_5 == _6)
     goto <bb 5>;
   else
     goto <bb 4>;

   <bb 4>:

   <bb 5>:
   # _3 = PHI <555(3), 666(4)>
   retval.1[iter.6_9] = _3;
   iter.6_14 = iter.6_9 + 1;
   if (iter.6_14 < 4)
     goto <bb 6>;
   else
     goto <bb 7>;

   <bb 6>:
   goto <bb 3>;

   <bb 7>:
   _7 = VIEW_CONVERT_EXPR<vector(4) int>(retval.1);
   return _7;

}

The new loop is properly created and annotated with 
loop->force_vect=true and loop->safelen set.

A possible use may be:

int array[1000];
void bar ()
{
   int i;
   for (i=0; i < 1000; ++i)
     array[i] = foo(i, 123);
}

In which case, we would use the simd clone if available:

bar ()
{
   vector(4) int vect_cst_.21;
   vector(4) int vect_i_6.20;
   vector(4) int * vectp_array.19;
   vector(4) int * vectp_array.18;
   vector(4) int vect_cst_.17;
   vector(4) int vect__4.16;
   vector(4) int vect_vec_iv_.15;
   vector(4) int vect_cst_.14;
   vector(4) int vect_cst_.13;
   int stmp_var_.12;
   int i;
   unsigned int ivtmp_1;
   int _4;
   unsigned int ivtmp_7;
   unsigned int ivtmp_20;
   unsigned int ivtmp_21;

   <bb 2>:
   vect_cst_.13_8 = { 0, 1, 2, 3 };
   vect_cst_.14_2 = { 4, 4, 4, 4 };
   vect_cst_.17_13 = { 123, 123, 123, 123 };
   vectp_array.19_15 = &array;
   vect_cst_.21_5 = { 1, 1, 1, 1 };
   goto <bb 4>;

   <bb 3>:

   <bb 4>:
   # i_9 = PHI <i_6(3), 0(2)>
   # ivtmp_1 = PHI <ivtmp_7(3), 1000(2)>
   # vect_vec_iv_.15_11 = PHI <vect_vec_iv_.15_12(3), vect_cst_.13_8(2)>
   # vectp_array.18_16 = PHI <vectp_array.18_17(3), vectp_array.19_15(2)>
   # ivtmp_20 = PHI <ivtmp_21(3), 0(2)>
   vect_vec_iv_.15_12 = vect_vec_iv_.15_11 + vect_cst_.14_2;
   vect__4.16_14 = foo.simdclone.0 (vect_vec_iv_.15_11, vect_cst_.17_13);
   _4 = 0;
   MEM[(int *)vectp_array.18_16] = vect__4.16_14;
   vect_i_6.20_19 = vect_vec_iv_.15_11 + vect_cst_.21_5;
   i_6 = i_9 + 1;
   ivtmp_7 = ivtmp_1 - 1;
   vectp_array.18_17 = vectp_array.18_16 + 16;
   ivtmp_21 = ivtmp_20 + 1;
   if (ivtmp_21 < 250)
     goto <bb 3>;
   else
     goto <bb 5>;

   <bb 5>:
   return;

}

That's the idea.

Some of the ABI issues still need to be resolved (mangling for avx-512, 
what to do with non x86 architectures, what (if any) default clones will 
be created when no vector length is specified, etc etc), but the main 
functionality can be seen above.

Uniform and linear parameters (which are passed as scalars) are still 
not handled.  Also, Jakub mentioned that with the current vectorizer we 
probably can't make good use of the inbranch/masked clones.  I have a 
laundry list of missing things prepended by // FIXME if anyone is curious.

I'd like some feedback from y'all in your respective areas, since this 
touches a few places besides OpenMP.  For instance...

[Honza] Where do you suggest I place a list of simd clones for a 
particular (scalar) function?  Right now I have added a simdclone_of 
field in cgraph_node and am (temporarily) serially scanning all 
functions in get_simd_clone().  This is obviously inefficient.  I didn't 
know whether to use the current next_sibling_clone/etc fields or create 
my own.  I tried using clone_of, and that caused some havoc so I'd like 
some feedback.

[Martin] I have adapted the ipa_parm_adjustment infrastructure to allow 
adding new arguments out of the blue like you mentioned was missing in 
ipa-prop.h.  I have also added support for creating vectors of 
arguments.  Could you take a look at my changes to ipa-prop.[ch]?

[Martin] I need to add new arguments in the case of inbranch clones, 
which add an additional vector with a mask as the last argument:  For 
the following:

#pragma omp declare simd simdlen(4) inbranch
int foo (int a)
{
   return a + 1234;
}

...we would generate a clone with:

vector(4) int
foo.simdclone.0 (vector(4) int simd.4, vector(4) int mask.5)

I thought it best to enhance ipa_modify_formal_parameters() and 
associated machinery than to add the new argument ad-hoc.  We already 
have enough ways of doing tree and cgraph versioning in the compiler ;-).

[Richi] I would appreciate feedback on the vectorizer and the 
infrastructure as a whole.  Do keep in mind that this is a work in 
progress :).

[Balaji] This patch would provide the infrastructure that can be used by 
the Cilk Plus elementals.  When this is complete, all that would be 
missing is the parser.  You would have to tag the original function with 
"omp declare simd" and "cilk plus elemental" attributes.  See 
simd_clone_clauses_extract.

[Jakub/rth]: As usual, valuable feedback on OpenMP and everything else 
is greatly appreciated.

Oh yeah, there are many more changes that would ideally be needed in the 
vectorizer.

Fire away!
gcc/ChangeLog.elementals

	* Makefile.in (omp-low.o): Depend on PRETTY_PRINT_H and IPA_PROP_H.
	* tree-vect-stmts.c (vectorizable_call): Allow > 3 arguments when
	a SIMD clone may be available.
	(vectorizable_function): Use SIMD clone if available.
	* ipa-cp.c (determine_versionability): Nodes with SIMD clones are
	not versionable.
	* ggc.h (ggc_alloc_cleared_simd_clone_stat): New.
	* cgraph.h (enum linear_stride_type): New.
	(struct simd_clone_arg): New.
	(struct simd_clone): New.
	(struct cgraph_node): Add simdclone and simdclone_of fields.
	(get_simd_clone): Protoize.
	* cgraph.c (get_simd_clone): New.
	Add `has_simd_clones' field.
	* ipa-cp.c (determine_versionability): Disallow functions with
	simd clones.
	* ipa-prop.h (ipa_sra_modify_function_body): Protoize.
	(sra_ipa_modify_expr): Same.
	(struct ipa_parm_adjustment): Add new_arg_prefix and new_param
	fields.  Document their use.
	* ipa-prop.c (ipa_modify_formal_parameters): Handle creating brand
	new parameters and minor cleanups.
	* omp-low.c: Add new pass_omp_simd_clone support code.
	(make_pass_omp_simd_clone): New.
	(pass_data_omp_simd_clone): Declare.
	(class pass_omp_simd_clone): Declare.
	(vecsize_mangle): New.
	(ipa_omp_simd_clone): New.
	(simd_clone_clauses_extract): New.
	(simd_clone_compute_base_data_type): New.
	(simd_clone_compute_vecsize_and_simdlen): New.
	(simd_clone_create): New.
	(simd_clone_adjust_return_type): New.
	(simd_clone_adjust_return_types): New.
	(simd_clone_adjust): New.
	(simd_clone_init_simd_arrays): New.
	(ipa_simd_modify_function_body): New.
	(simd_clone_mangle): New.
	(simd_clone_struct_alloc): New.
	(simd_clone_struct_copy): New.
	(class argno_map): New.
	(argno_map::argno_map(tree)): New.
	(argno_map::~argno_map): New.
	(argno_map::operator []): New.
	(argno_map::length): New.
	(expand_simd_clones): New.
	(create_tmp_simd_array): New.
	* tree.h (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE): New.
	* tree-core.h (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE): Document.
	* tree-pass.h (make_pass_omp_simd_clone): New.
	* passes.def (pass_omp_simd_clone): New.
	* target.def: Define new hook prefix "TARGET_CILKPLUS_".
	(default_vecsize_mangle): New.
	(vecsize_for_mangle): New.
	* doc/tm.texi.in: Add placeholder for
	TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE and
	TARGET_CILKPLUS_VECSIZE_FOR_MANGLE.
	* tree-sra.c (sra_ipa_modify_expr): Remove static modifier.
	(ipa_sra_modify_function_body): Same.
	* tree.h (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE): Define.
	* doc/tm.texi: Regenerate.
	* config/i386/i386.c (ix86_cilkplus_default_vecsize_mangle): New.
	(ix86_cilkplus_vecsize_for_mangle): New.
	(TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE): New.
	(TARGET_CILKPLUS_VECSIZE_FOR_MANGLE): New.

index 0000000..3f28f42
Jakub Jelinek - Nov. 1, 2013, 10:57 a.m.
Hi!

On Thu, Oct 31, 2013 at 10:04:45PM -0500, Aldy Hernandez wrote:
> Hello gentlemen.  I'm CCing all of you, because each of you can
> provide valuable feedback to various parts of the compiler which I
> touch.  I have sprinkled love notes with your names throughout the
> post :).

Thanks for working on this.

> 	* Makefile.in (omp-low.o): Depend on PRETTY_PRINT_H and IPA_PROP_H.

You aren't changing Makefile.in anymore ;).

> +/* Given a NODE, return a compatible SIMD clone returning `vectype'.
> +   If none found, NULL is returned.  */
> +
> +struct cgraph_node *
> +get_simd_clone (struct cgraph_node *node, tree vectype)
> +{
> +  if (!node->has_simd_clones)
> +    return NULL;
> +
> +  /* FIXME: What to do with linear/uniform arguments.  */
> +
> +  /* FIXME: Nasty kludge until we figure out where to put the clone
> +     list-- perhaps, next_sibling_clone/prev_sibling_clone in
> +     cgraph_node ??.  */
> +  struct cgraph_node *t;
> +  FOR_EACH_FUNCTION (t)
> +    if (t->simdclone_of == node
> +	/* No inbranch vectorization for now.  */
> +	&& !t->simdclone->inbranch
> +	&& types_compatible_p (TREE_TYPE (TREE_TYPE (t->symbol.decl)),
> +			       vectype))
> +      break;
> +  return t;
> +}

You definitely need some quick way to find the simd clones, and you really
can't do this here anyway, because you have to check all arguments, return
type might be missing etc., so it needs to be done by vectorizable_call
itself.

> +  /* If this is a SIMD clone, this points to the SIMD specific
> +     information for it.  */
> +  struct simd_clone *simdclone;
> +
> +  /* If this is a SIMD clone, this points to the original scalar
> +     function.  */
> +  struct cgraph_node *simdclone_of;

Can't you put this into the simd_clone structure, in order not to waste
memory for functions which don't have simd clones?  So, you'd use
t->simdclone && t->simdclone->clone_of == node or similar (if you need it at
all, I guess better is to add a struct cgraph_node *simd_clones;
and put the prev/next pointers in struct simd_clone).

Let me start with two testcases:

test1.c:
int array[1000];

#pragma omp declare simd simdlen(4) notinbranch
#pragma omp declare simd simdlen(4) notinbranch uniform(b)
#pragma omp declare simd simdlen(8) notinbranch
#pragma omp declare simd simdlen(8) notinbranch uniform(b)
__attribute__((noinline)) int
foo (int a, int b)
{
  if (a == b)
    return 5;
  else
    return 6;
}

void
bar ()
{
  int i;
  for (i = 0; i < 1000; ++i)
    array[i] = foo (i, 123);
  for (i = 0; i < 1000; ++i)
    array[i] = foo (i, array[i]);
}

test2.c:
int array[1000];

#pragma omp declare simd simdlen(4) notinbranch aligned(a:16) uniform(a) linear(b)
#pragma omp declare simd simdlen(4) notinbranch aligned(a:32) uniform(a) linear(b)
#pragma omp declare simd simdlen(8) notinbranch aligned(a:16) uniform(a) linear(b)
#pragma omp declare simd simdlen(8) notinbranch aligned(a:32) uniform(a) linear(b)
__attribute__((noinline)) void
foo (int *a, int b, int c)
{
  a[b] = c;
}

void
bar ()
{
  int i;
  for (i = 0; i < 1000; ++i)
    foo (array, i, i * array[i]);
}

On test1.c -O3 -fopenmp {,-mavx,-mavx2}, you can see:
test1.c: In function ‘foo.simdclone.0’:
test1.c:8:1: note: The ABI for passing parameters with 32-byte alignment has changed in GCC 4.6
 foo (int a, int b)
 ^
test1.c:8:1: warning: AVX vector argument without AVX enabled changes the ABI [enabled by default]
and the manglings are without -mavx{,2}
_ZGVxN8vu_foo
_ZGVxN8vv_foo
_ZGVxN4vu_foo
_ZGVxN4vv_foo
while with it _ZGVy* (surprisingly not Y).  As discussed earlier, we don't
want to decide which clones to create based on compiler options, we probably
want to create (unless told by Cilk+ processor clauses otherwise) entry
points for all the ABIs, just try to create the ones not matching compiler
options as small as possible, and use target attribute for those too
and say for _ZGVxN8v?_foo we need to pass the vector arguments in two
vector(4) int parameters rather than one vector(8) as it is done now (that
is why the above warnings and notes are printed).  But you know this
already... ;).

The second testcase currently ICEs I guess during simd cloning, just wanted
to make it clear that while simd clones without any arguments probably don't
make any sense (other than const, but those really should be hoisted out of
the loop much earlier), simd clones with no return value make sense.

> --- a/gcc/tree-vect-stmts.c
> +++ b/gcc/tree-vect-stmts.c
> @@ -1688,6 +1688,16 @@ tree
>  vectorizable_function (gimple call, tree vectype_out, tree vectype_in)
>  {
>    tree fndecl = gimple_call_fndecl (call);
> +  struct cgraph_node *node = cgraph_get_node (fndecl);
> +
> +  if (node->has_simd_clones)
> +    {
> +      struct cgraph_node *clone = get_simd_clone (node, vectype_out);
> +      if (clone)
> +	return clone->symbol.decl;
> +      /* Fall through in case we ever add support for
> +	 non-built-ins.  */
> +    }

I think it is a bad idea to do this in vectorizable_function, as I said
earlier keying this on the result type won't work for functions returning
void, and more importantly, you really need access to detailed info about
all the arguments for finding out if you have a suitable clone, and
as test1.c shows, also for selection of the best of the clones if more than
one is suitable.  In test1.c, in the first loop the uniform variants are
better over the ones without uniform second argument, though if the uniform
ones would be missing, then you could use even the ones with vv arguments,
because you can just pass a vector constant (or broadcast scalar element
into the vector).  Similarly, in test2.c, you want to check
get_pointer_alignment of the pointer, and if it is >= 32, you can use
the clones with aligned(:32) (and with (:16), but (:32) is supposedly
better), if it is >= 16, you can only use the clones with (:16), if it is <
16, you can't use anything.

> @@ -1758,10 +1768,12 @@ vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
>    vectype_in = NULL_TREE;
>    nargs = gimple_call_num_args (stmt);
>  
> -  /* Bail out if the function has more than three arguments, we do not have
> -     interesting builtin functions to vectorize with more than two arguments
> -     except for fma.  No arguments is also not good.  */
> -  if (nargs == 0 || nargs > 3)
> +  /* Bail out if the function has more than three arguments.  We do
> +     not have interesting builtin functions to vectorize with more
> +     than two arguments except for fma (unless we have SIMD clones).
> +     No arguments is also not good.  */
> +  struct cgraph_node *node = cgraph_get_node (gimple_call_fndecl (stmt));
> +  if (nargs == 0 || (!node->has_simd_clones && nargs > 3))
>      return false;

In the end, I think for the vectorization of elemental function calls
it might be best to write a new function, vectorizable_simd_clone_call
or similar, because if we add all the support for into vectorizable_call,
it might be unmaintainable.  Normal vectorizable_calls rely on all the input
arguments being of the same type, the return type must be not be void,
but the return type doesn't have to be the same as argument types (which
means NARROW, NONE or WIDEN kind of expansion).
The simd clones can have arbitrary argument types, so the concept of
narrowing/widening etc. doesn't work well in that case.  So I'd probably go
for starting with copy of vectorizable_call, call it at the same spots as
vectorizable_call (after it), then remove the same argument restrictions and
start updating it to do the analysis of suitable clones and some priority
mechanism on which simd clones are best (give bonus points for uniform
arguments if the argument is uniform, linear if it is linear, for highest
alignment, notinbranch/inbranch, simdlen, etc.).

So, e.g.
      /* We can only handle calls with arguments of the same type.  */
      if (rhs_type
          && !types_compatible_p (rhs_type, TREE_TYPE (op)))
        {
          if (dump_enabled_p ())
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "argument types differ.\n");
          return false;
        }
shouldn't be done for the simd clones, vectype_in should go, vectype_out
should be renamed to vectype and set to first argument's? type if result is
void or unused.

      if (!vect_is_simple_use_1 (op, stmt, loop_vinfo, bb_vinfo,
                                 &def_stmt, &def, &dt[i], &opvectype))
        {
          if (dump_enabled_p ())
            dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                             "use not simple.\n");
          return false;
        }

(note, right now in the patch this is buffer overflow for nargs > 3, for
simd clones you want to probably dynamically allocate the dt array,
supposedly also remember opvectype for each argument).
If opvectype above is NULL, then that argument can be passed to
uniform arguments (or of course broadcasted into vector to vector
arguments).  To find out if for opvectype != NULL you can pass it to
linear argument, supposedly you could call simple_iv, and compare the
iv.step computed by it if it returned true with the linear step.
To check alignment, supposedly if it is uniform (opvectype == NULL),
you'd just call get_pointer_alignment if it is a pointer, otherwise
maybe give up for now?  I mean, if it is e.g. linear, it would be harder,
you'd need to know if peeling for alignment will be needed and only if
known not to be needed you could simple_iv it and see if base has right
get_pointer_alignment and step multiplied by vectorization factor
keeps the alignment right.  Though, if you need to insert more than one
call of the simdclone, you'd need to verify that it is right for all the
calls.  Right now we have no way to express conditional calls in the
ifconverted IL, so right now we'll punt on all conditional calls, but
at least the vectorizer can fall back (with much lower priority) to
inbranch elementals if notinbranch doesn't exist (just pass all ones mask).

	Jakub
Richard Guenther - Nov. 4, 2013, 10:37 a.m.
On Fri, Nov 1, 2013 at 4:04 AM, Aldy Hernandez <aldyh@redhat.com> wrote:
> Hello gentlemen.  I'm CCing all of you, because each of you can provide
> valuable feedback to various parts of the compiler which I touch.  I have
> sprinkled love notes with your names throughout the post :).
>
> This is a patch against the gomp4 branch.  It provides initial support for
> simd-enabled functions which are "#pragma omp declare simd" in the OpenMP
> world and elementals in Cilk Plus nomenclature.  The parsing bits for OpenMP
> are already in trunk, but they are silently ignored.  This patch aims to
> remedy the situation.  The Cilk Plus parsing bits, OTOH, are not ready, but
> could trivially be adapted to use this infrastructure (see below).
>
> I would like to at least get this into the gomp4 branch for now, because I
> am accumulating far too many changes locally.
>
> The main idea is that for a simd annotated function, we can create one or
> more cloned vector variants of a scalar function that can later be used by
> the vectorizer.
>
> For a simple example with multiple returns...
>
> #pragma omp declare simd simdlen(4) notinbranch
> int foo (int a, int b)
> {
>   if (a == b)
>     return 555;
>   else
>     return 666;
> }
>
> ...we would generate with this patch (unoptimized):

Just a quick question, queueing the thread for later review (aww, queue
back at >100 threads again - I can't get any work done anymore :().

What does #pragma omp declare simd guarantee about memory
side-effects and memory use in the function?  That is, unless the
function can be safely annotated with the 'const' attribute the
whole thing is useless for the vectorizer.

Thanks,
Richard.

> foo.simdclone.0 (vector(4) int simd.4, vector(4) int simd.5)
> {
>   unsigned int iter.6;
>   int b.3[4];
>   int a.2[4];
>   int retval.1[4];
>   int _3;
>   int _5;
>   int _6;
>   vector(4) int _7;
>
>   <bb 2>:
>   a.2 = VIEW_CONVERT_EXPR<int[4]>(simd.4);
>   b.3 = VIEW_CONVERT_EXPR<int[4]>(simd.5);
>   iter.6_12 = 0;
>
>   <bb 3>:
>   # iter.6_9 = PHI <iter.6_12(2), iter.6_14(6)>
>   _5 = a.2[iter.6_9];
>   _6 = b.3[iter.6_9];
>   if (_5 == _6)
>     goto <bb 5>;
>   else
>     goto <bb 4>;
>
>   <bb 4>:
>
>   <bb 5>:
>   # _3 = PHI <555(3), 666(4)>
>   retval.1[iter.6_9] = _3;
>   iter.6_14 = iter.6_9 + 1;
>   if (iter.6_14 < 4)
>     goto <bb 6>;
>   else
>     goto <bb 7>;
>
>   <bb 6>:
>   goto <bb 3>;
>
>   <bb 7>:
>   _7 = VIEW_CONVERT_EXPR<vector(4) int>(retval.1);
>   return _7;
>
> }
>
> The new loop is properly created and annotated with loop->force_vect=true
> and loop->safelen set.
>
> A possible use may be:
>
> int array[1000];
> void bar ()
> {
>   int i;
>   for (i=0; i < 1000; ++i)
>     array[i] = foo(i, 123);
> }
>
> In which case, we would use the simd clone if available:
>
> bar ()
> {
>   vector(4) int vect_cst_.21;
>   vector(4) int vect_i_6.20;
>   vector(4) int * vectp_array.19;
>   vector(4) int * vectp_array.18;
>   vector(4) int vect_cst_.17;
>   vector(4) int vect__4.16;
>   vector(4) int vect_vec_iv_.15;
>   vector(4) int vect_cst_.14;
>   vector(4) int vect_cst_.13;
>   int stmp_var_.12;
>   int i;
>   unsigned int ivtmp_1;
>   int _4;
>   unsigned int ivtmp_7;
>   unsigned int ivtmp_20;
>   unsigned int ivtmp_21;
>
>   <bb 2>:
>   vect_cst_.13_8 = { 0, 1, 2, 3 };
>   vect_cst_.14_2 = { 4, 4, 4, 4 };
>   vect_cst_.17_13 = { 123, 123, 123, 123 };
>   vectp_array.19_15 = &array;
>   vect_cst_.21_5 = { 1, 1, 1, 1 };
>   goto <bb 4>;
>
>   <bb 3>:
>
>   <bb 4>:
>   # i_9 = PHI <i_6(3), 0(2)>
>   # ivtmp_1 = PHI <ivtmp_7(3), 1000(2)>
>   # vect_vec_iv_.15_11 = PHI <vect_vec_iv_.15_12(3), vect_cst_.13_8(2)>
>   # vectp_array.18_16 = PHI <vectp_array.18_17(3), vectp_array.19_15(2)>
>   # ivtmp_20 = PHI <ivtmp_21(3), 0(2)>
>   vect_vec_iv_.15_12 = vect_vec_iv_.15_11 + vect_cst_.14_2;
>   vect__4.16_14 = foo.simdclone.0 (vect_vec_iv_.15_11, vect_cst_.17_13);
>   _4 = 0;
>   MEM[(int *)vectp_array.18_16] = vect__4.16_14;
>   vect_i_6.20_19 = vect_vec_iv_.15_11 + vect_cst_.21_5;
>   i_6 = i_9 + 1;
>   ivtmp_7 = ivtmp_1 - 1;
>   vectp_array.18_17 = vectp_array.18_16 + 16;
>   ivtmp_21 = ivtmp_20 + 1;
>   if (ivtmp_21 < 250)
>     goto <bb 3>;
>   else
>     goto <bb 5>;
>
>   <bb 5>:
>   return;
>
> }
>
> That's the idea.
>
> Some of the ABI issues still need to be resolved (mangling for avx-512, what
> to do with non x86 architectures, what (if any) default clones will be
> created when no vector length is specified, etc etc), but the main
> functionality can be seen above.
>
> Uniform and linear parameters (which are passed as scalars) are still not
> handled.  Also, Jakub mentioned that with the current vectorizer we probably
> can't make good use of the inbranch/masked clones.  I have a laundry list of
> missing things prepended by // FIXME if anyone is curious.
>
> I'd like some feedback from y'all in your respective areas, since this
> touches a few places besides OpenMP.  For instance...
>
> [Honza] Where do you suggest I place a list of simd clones for a particular
> (scalar) function?  Right now I have added a simdclone_of field in
> cgraph_node and am (temporarily) serially scanning all functions in
> get_simd_clone().  This is obviously inefficient.  I didn't know whether to
> use the current next_sibling_clone/etc fields or create my own.  I tried
> using clone_of, and that caused some havoc so I'd like some feedback.
>
> [Martin] I have adapted the ipa_parm_adjustment infrastructure to allow
> adding new arguments out of the blue like you mentioned was missing in
> ipa-prop.h.  I have also added support for creating vectors of arguments.
> Could you take a look at my changes to ipa-prop.[ch]?
>
> [Martin] I need to add new arguments in the case of inbranch clones, which
> add an additional vector with a mask as the last argument:  For the
> following:
>
> #pragma omp declare simd simdlen(4) inbranch
> int foo (int a)
> {
>   return a + 1234;
> }
>
> ...we would generate a clone with:
>
> vector(4) int
> foo.simdclone.0 (vector(4) int simd.4, vector(4) int mask.5)
>
> I thought it best to enhance ipa_modify_formal_parameters() and associated
> machinery than to add the new argument ad-hoc.  We already have enough ways
> of doing tree and cgraph versioning in the compiler ;-).
>
> [Richi] I would appreciate feedback on the vectorizer and the infrastructure
> as a whole.  Do keep in mind that this is a work in progress :).
>
> [Balaji] This patch would provide the infrastructure that can be used by the
> Cilk Plus elementals.  When this is complete, all that would be missing is
> the parser.  You would have to tag the original function with "omp declare
> simd" and "cilk plus elemental" attributes.  See simd_clone_clauses_extract.
>
> [Jakub/rth]: As usual, valuable feedback on OpenMP and everything else is
> greatly appreciated.
>
> Oh yeah, there are many more changes that would ideally be needed in the
> vectorizer.
>
> Fire away!
Jakub Jelinek - Nov. 4, 2013, 10:57 a.m.
Hi!

On Mon, Nov 04, 2013 at 11:37:19AM +0100, Richard Biener wrote:
> Just a quick question, queueing the thread for later review (aww, queue
> back at >100 threads again - I can't get any work done anymore :().
> 
> What does #pragma omp declare simd guarantee about memory
> side-effects and memory use in the function?  That is, unless the
> function can be safely annotated with the 'const' attribute the
> whole thing is useless for the vectorizer.

The main restriction is:

"The execution of the function or subroutine cannot have any side effects that would
alter its execution for concurrent iterations of a SIMD chunk."

There are other restrictions, omp declare simd functions can't call
setjmp/longjmp or throw, etc.

The functions certainly can't be in the general case annotated with the
const attribute, they can read and write memory, but the user is responsible
for making sure that the effects or running the function sequentially as
part of non-vectorized loop are the same as running the simd clone of that
as part of a vectorized loop with the given simdlen vectorization factor.
So it is certainly meant to be used by the vectorizer, after all, that is
it's sole purpose.

	Jakub
Martin Jambor - Nov. 7, 2013, 4:09 p.m.
Hi,

On Thu, Oct 31, 2013 at 10:04:45PM -0500, Aldy Hernandez wrote:
> Hello gentlemen.  I'm CCing all of you, because each of you can
> provide valuable feedback to various parts of the compiler which I
> touch.  I have sprinkled love notes with your names throughout the
> post :).

sorry it took me so long, for various reasons I out of my control I've
accumulated quite a backlog of email and tasks last week and it took
me a lot of time to chew through it all.

...

> [Martin] I have adapted the ipa_parm_adjustment infrastructure to
> allow adding new arguments out of the blue like you mentioned was
> missing in ipa-prop.h.  I have also added support for creating
> vectors of arguments.  Could you take a look at my changes to
> ipa-prop.[ch]?

Sure, though I have only looked at ipa-* and tree-sra.c stuff.  I do
not have any real objections but would suggest a few amendments.  

I am glad this is becoming a useful infrastructure rather than just a
part of IPA-SRA.  Note that while ipa_combine_adjustments is not used
from anywhere and thus probably buggy anyway, it should in theory be
able to process new_param adjustments too.  Can you please at least
put a "not implemented" assert there?  (The reason is that the plan
still is to replace args_to_skip bitmaps in cgraphclones.c by
adjustments one day and we do need to combine clones.)

> 
> [Martin] I need to add new arguments in the case of inbranch clones,
> which add an additional vector with a mask as the last argument:
> For the following:
> 
> #pragma omp declare simd simdlen(4) inbranch
> int foo (int a)
> {
>   return a + 1234;
> }
> 
> ...we would generate a clone with:
> 
> vector(4) int
> foo.simdclone.0 (vector(4) int simd.4, vector(4) int mask.5)
> 
> I thought it best to enhance ipa_modify_formal_parameters() and
> associated machinery than to add the new argument ad-hoc.  We
> already have enough ways of doing tree and cgraph versioning in the
> compiler ;-).
> 

...

> gcc/ChangeLog.elementals
> 
> 	* Makefile.in (omp-low.o): Depend on PRETTY_PRINT_H and IPA_PROP_H.
> 	* tree-vect-stmts.c (vectorizable_call): Allow > 3 arguments when
> 	a SIMD clone may be available.
> 	(vectorizable_function): Use SIMD clone if available.
> 	* ipa-cp.c (determine_versionability): Nodes with SIMD clones are
> 	not versionable.
> 	* ggc.h (ggc_alloc_cleared_simd_clone_stat): New.
> 	* cgraph.h (enum linear_stride_type): New.
> 	(struct simd_clone_arg): New.
> 	(struct simd_clone): New.
> 	(struct cgraph_node): Add simdclone and simdclone_of fields.
> 	(get_simd_clone): Protoize.
> 	* cgraph.c (get_simd_clone): New.
> 	Add `has_simd_clones' field.
> 	* ipa-cp.c (determine_versionability): Disallow functions with
> 	simd clones.

(This looks like a repeated entry.)

> 	* ipa-prop.h (ipa_sra_modify_function_body): Protoize.
> 	(sra_ipa_modify_expr): Same.
> 	(struct ipa_parm_adjustment): Add new_arg_prefix and new_param
> 	fields.  Document their use.
> 	* ipa-prop.c (ipa_modify_formal_parameters): Handle creating brand
> 	new parameters and minor cleanups.
> 	* omp-low.c: Add new pass_omp_simd_clone support code.
> 	(make_pass_omp_simd_clone): New.
> 	(pass_data_omp_simd_clone): Declare.
> 	(class pass_omp_simd_clone): Declare.
> 	(vecsize_mangle): New.
> 	(ipa_omp_simd_clone): New.
> 	(simd_clone_clauses_extract): New.
> 	(simd_clone_compute_base_data_type): New.
> 	(simd_clone_compute_vecsize_and_simdlen): New.
> 	(simd_clone_create): New.
> 	(simd_clone_adjust_return_type): New.
> 	(simd_clone_adjust_return_types): New.
> 	(simd_clone_adjust): New.
> 	(simd_clone_init_simd_arrays): New.
> 	(ipa_simd_modify_function_body): New.
> 	(simd_clone_mangle): New.
> 	(simd_clone_struct_alloc): New.
> 	(simd_clone_struct_copy): New.
> 	(class argno_map): New.
> 	(argno_map::argno_map(tree)): New.
> 	(argno_map::~argno_map): New.
> 	(argno_map::operator []): New.
> 	(argno_map::length): New.
> 	(expand_simd_clones): New.
> 	(create_tmp_simd_array): New.
> 	* tree.h (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE): New.
> 	* tree-core.h (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE): Document.
> 	* tree-pass.h (make_pass_omp_simd_clone): New.
> 	* passes.def (pass_omp_simd_clone): New.
> 	* target.def: Define new hook prefix "TARGET_CILKPLUS_".
> 	(default_vecsize_mangle): New.
> 	(vecsize_for_mangle): New.
> 	* doc/tm.texi.in: Add placeholder for
> 	TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE and
> 	TARGET_CILKPLUS_VECSIZE_FOR_MANGLE.
> 	* tree-sra.c (sra_ipa_modify_expr): Remove static modifier.
> 	(ipa_sra_modify_function_body): Same.
> 	* tree.h (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE): Define.
> 	* doc/tm.texi: Regenerate.
> 	* config/i386/i386.c (ix86_cilkplus_default_vecsize_mangle): New.
> 	(ix86_cilkplus_vecsize_for_mangle): New.
> 	(TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE): New.
> 	(TARGET_CILKPLUS_VECSIZE_FOR_MANGLE): New.
> 

...

> diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
> index c38ba82..faae080 100644
> --- a/gcc/ipa-cp.c
> +++ b/gcc/ipa-cp.c
> @@ -446,6 +446,13 @@ determine_versionability (struct cgraph_node *node)
>      reason = "not a tree_versionable_function";
>    else if (cgraph_function_body_availability (node) <= AVAIL_OVERWRITABLE)
>      reason = "insufficient body availability";
> +  else if (node->has_simd_clones)
> +    {
> +      /* Ideally we should clone the SIMD clones themselves and create
> +	 vector copies of them, so IPA-cp and SIMD clones can happily
> +	 coexist, but that may not be worth the effort.  */
> +      reason = "function has SIMD clones";
> +    }

Lets hope we will eventually fix this in some followup :-)


>  
>    if (reason && dump_file && !node->symbol.alias && !node->thunk.thunk_p)
>      fprintf (dump_file, "Function %s/%i is not versionable, reason: %s.\n",
> diff --git a/gcc/ipa-prop.c b/gcc/ipa-prop.c
> index 2fbc9d4..0c20dc6 100644
> --- a/gcc/ipa-prop.c
> +++ b/gcc/ipa-prop.c
> @@ -3361,24 +3361,18 @@ void
>  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>  			      const char *synth_parm_prefix)
>  {
> -  vec<tree> oparms, otypes;
> -  tree orig_type, new_type = NULL;
> -  tree old_arg_types, t, new_arg_types = NULL;
> -  tree parm, *link = &DECL_ARGUMENTS (fndecl);
> -  int i, len = adjustments.length ();
> -  tree new_reversed = NULL;
> -  bool care_for_types, last_parm_void;
> -
>    if (!synth_parm_prefix)
>      synth_parm_prefix = "SYNTH";
>  
> -  oparms = ipa_get_vector_of_formal_parms (fndecl);
> -  orig_type = TREE_TYPE (fndecl);
> -  old_arg_types = TYPE_ARG_TYPES (orig_type);
> +  vec<tree> oparms = ipa_get_vector_of_formal_parms (fndecl);
> +  tree orig_type = TREE_TYPE (fndecl);
> +  tree old_arg_types = TYPE_ARG_TYPES (orig_type);
>  
>    /* The following test is an ugly hack, some functions simply don't have any
>       arguments in their type.  This is probably a bug but well... */
> -  care_for_types = (old_arg_types != NULL_TREE);
> +  bool care_for_types = (old_arg_types != NULL_TREE);
> +  bool last_parm_void;
> +  vec<tree> otypes;
>    if (care_for_types)
>      {
>        last_parm_void = (TREE_VALUE (tree_last (old_arg_types))
> @@ -3395,13 +3389,20 @@ ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>        otypes.create (0);
>      }
>  
> -  for (i = 0; i < len; i++)
> +  int len = adjustments.length ();
> +  tree *link = &DECL_ARGUMENTS (fndecl);
> +  tree new_arg_types = NULL;
> +  for (int i = 0; i < len; i++)
>      {
>        struct ipa_parm_adjustment *adj;
>        gcc_assert (link);
>  
>        adj = &adjustments[i];
> -      parm = oparms[adj->base_index];
> +      tree parm;
> +      if (adj->new_param)

I don't know what I was thinking when I invented copy_param and
remove_param as multiple flags rather than a single enum, I probably
wasn't thinking at all.  I can change it myself as a followup if you
have more pressing tasks now.  Meanwhile, can you gcc_checking_assert
that at most one flag is set at appropriate places?

> +	parm = NULL;
> +      else
> +	parm = oparms[adj->base_index];
>        adj->base = parm;

I do not think it makes sense for new parameters to have a base which
is basically the old decl.  Do you have any reasons for not setting it
to NULL?

>  
>        if (adj->copy_param)
> @@ -3417,8 +3418,18 @@ ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>  	  tree new_parm;
>  	  tree ptype;
>  

> -	  if (adj->by_ref)
> -	    ptype = build_pointer_type (adj->type);

Please add gcc_checking_assert (!adj->by_ref || adj->simdlen == 0)
here...

> +	  if (adj->simdlen)
> +	    {
> +	      /* If we have a non-null simdlen but by_ref is true, we
> +		 want a vector of pointers.  Build the vector of
> +		 pointers here, not a pointer to a vector in the
> +		 adj->by_ref case below.  */
> +	      ptype = build_vector_type (adj->type, adj->simdlen);
> +	    }
> +	  else if (adj->by_ref)

...or remove this else and be able to build a pointer to the vector
if by_ref is true.

> +	    {
> +	      ptype = build_pointer_type (adj->type);
> +	    }
>  	  else
>  	    ptype = adj->type;
>  
> @@ -3427,8 +3438,9 @@ ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>  
>  	  new_parm = build_decl (UNKNOWN_LOCATION, PARM_DECL, NULL_TREE,
>  				 ptype);
> -	  DECL_NAME (new_parm) = create_tmp_var_name (synth_parm_prefix);
> -
> +	  const char *prefix
> +	    = adj->new_param ? adj->new_arg_prefix : synth_parm_prefix;

Can we perhaps get rid of synth_parm_prefix then and just have
adj->new_arg_prefix?  It's not particularly important but this is
weird.


> +	  DECL_NAME (new_parm) = create_tmp_var_name (prefix);
>  	  DECL_ARTIFICIAL (new_parm) = 1;
>  	  DECL_ARG_TYPE (new_parm) = ptype;
>  	  DECL_CONTEXT (new_parm) = fndecl;
> @@ -3436,17 +3448,20 @@ ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>  	  DECL_IGNORED_P (new_parm) = 1;
>  	  layout_decl (new_parm, 0);
>  
> -	  adj->base = parm;
> +	  if (adj->new_param)
> +	    adj->base = new_parm;

Again, shouldn't this be NULL?

> +	  else
> +	    adj->base = parm;
>  	  adj->reduction = new_parm;
>  
>  	  *link = new_parm;
> -
>  	  link = &DECL_CHAIN (new_parm);
>  	}
>      }
>  
>    *link = NULL_TREE;
>  
> +  tree new_reversed = NULL;
>    if (care_for_types)
>      {
>        new_reversed = nreverse (new_arg_types);
> @@ -3464,6 +3479,7 @@ ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>       Exception is METHOD_TYPEs must have THIS argument.
>       When we are asked to remove it, we need to build new FUNCTION_TYPE
>       instead.  */
> +  tree new_type = NULL;
>    if (TREE_CODE (orig_type) != METHOD_TYPE
>         || (adjustments[0].copy_param
>  	  && adjustments[0].base_index == 0))
> @@ -3489,7 +3505,7 @@ ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
>  
>    /* This is a new type, not a copy of an old type.  Need to reassociate
>       variants.  We can handle everything except the main variant lazily.  */
> -  t = TYPE_MAIN_VARIANT (orig_type);
> +  tree t = TYPE_MAIN_VARIANT (orig_type);
>    if (orig_type != t)
>      {
>        TYPE_MAIN_VARIANT (new_type) = t;
> diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
> index 48634d2..8d7d9b9 100644
> --- a/gcc/ipa-prop.h
> +++ b/gcc/ipa-prop.h
> @@ -634,9 +634,10 @@ struct ipa_parm_adjustment
>       arguments.  */
>    tree alias_ptr_type;
>  
> -  /* The new declaration when creating/replacing a parameter.  Created by
> -     ipa_modify_formal_parameters, useful for functions modifying the body
> -     accordingly. */
> +  /* The new declaration when creating/replacing a parameter.  Created
> +     by ipa_modify_formal_parameters, useful for functions modifying
> +     the body accordingly.  For brand new arguments, this is the newly
> +     created argument.  */
>    tree reduction;

We should eventually rename this to new_decl or something, given that
this is not an SRA thing any more.  But that can be done later.

>  
>    /* New declaration of a substitute variable that we may use to replace all
> @@ -647,15 +648,36 @@ struct ipa_parm_adjustment
>       is NULL), this is going to be its nonlocalized vars value.  */
>    tree nonlocal_value;
>  
> +  /* If this is a brand new argument, this holds the prefix to be used
> +     for the DECL_NAME.  */
> +  const char *new_arg_prefix;
> +
>    /* Offset into the original parameter (for the cases when the new parameter
>       is a component of an original one).  */
>    HOST_WIDE_INT offset;
>  
> -  /* Zero based index of the original parameter this one is based on.  (ATM
> -     there is no way to insert a new parameter out of the blue because there is
> -     no need but if it arises the code can be easily exteded to do so.)  */
> +  /* Zero based index of the original parameter this one is based on.  */
>    int base_index;
>  
> +  /* If non-null, the parameter is a vector of `type' with this many
> +     elements.  */
> +  int simdlen;
> +
> +  /* This is a brand new parameter.
> +
> +     For new parameters, base_index must be >= the number of
> +     DECL_ARGUMENTS in the function.  That is, new arguments will be
> +     the last arguments in the adjusted function.
> +
> +     ?? Perhaps we could redesign ipa_modify_formal_parameters() to
> +     reorganize argument position, thus allowing inserting of brand
> +     new arguments anywhere, but there is no use for this now.

Where does this requirement come from?  At least at the moment I
cannot see why ipa_modify_formal_parameters wouldn't be able to
reorder parameters as it is?  What breaks if base_index of adjustments
for new parameters has zero or a nonsensical value?

> +
> +     Also, `type' should be set to the new type, `new_arg_prefix'
> +     should be set to the string prefix for the new DECL_NAME, and
> +     `reduction' will ultimately hold the newly created argument.  */
> +  unsigned new_param : 1;
> +
>    /* This new parameter is an unmodified parameter at index base_index. */
>    unsigned copy_param : 1;
>  
> @@ -697,5 +719,7 @@ void ipa_dump_param (FILE *, struct ipa_node_params *info, int i);
>  /* From tree-sra.c:  */
>  tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, tree,
>  			   gimple_stmt_iterator *, bool);
> +bool ipa_sra_modify_function_body (ipa_parm_adjustment_vec);
> +bool sra_ipa_modify_expr (tree *, bool, ipa_parm_adjustment_vec);
>  

Hm, if you can directly use these, I really think you should rename
them somehow so that their names do not contain SRA and move them to
ipa-prop.c.

Thanks for reviving this slightly moribund infrastructure and sorry
again for the delay,

Martin

Patch

diff --git a/gcc/cgraph.c b/gcc/cgraph.c
index 124ee0a..561527f 100644
--- a/gcc/cgraph.c
+++ b/gcc/cgraph.c
@@ -2998,4 +2998,29 @@  cgraph_get_body (struct cgraph_node *node)
   return true;
 }
 
+/* Given a NODE, return a compatible SIMD clone returning `vectype'.
+   If none found, NULL is returned.  */
+
+struct cgraph_node *
+get_simd_clone (struct cgraph_node *node, tree vectype)
+{
+  if (!node->has_simd_clones)
+    return NULL;
+
+  /* FIXME: What to do with linear/uniform arguments.  */
+
+  /* FIXME: Nasty kludge until we figure out where to put the clone
+     list-- perhaps, next_sibling_clone/prev_sibling_clone in
+     cgraph_node ??.  */
+  struct cgraph_node *t;
+  FOR_EACH_FUNCTION (t)
+    if (t->simdclone_of == node
+	/* No inbranch vectorization for now.  */
+	&& !t->simdclone->inbranch
+	&& types_compatible_p (TREE_TYPE (TREE_TYPE (t->symbol.decl)),
+			       vectype))
+      break;
+  return t;
+}
+
 #include "gt-cgraph.h"
diff --git a/gcc/cgraph.h b/gcc/cgraph.h
index afdeaba..c8d1830 100644
--- a/gcc/cgraph.h
+++ b/gcc/cgraph.h
@@ -248,6 +248,91 @@  struct GTY(()) cgraph_clone_info
   bitmap combined_args_to_skip;
 };
 
+enum linear_stride_type {
+  LINEAR_STRIDE_NO,
+  LINEAR_STRIDE_YES_CONSTANT,
+  LINEAR_STRIDE_YES_VARIABLE
+};
+
+/* Function arguments in the original function of a SIMD clone.
+   Supplementary data for `struct simd_clone'.  */
+
+struct GTY(()) simd_clone_arg {
+  /* Original function argument as it orignally existed in
+     DECL_ARGUMENTS.  */
+  tree orig_arg;
+
+  /* If argument is a vector, this holds the vector version of
+     orig_arg that after adjusting the argument types will live in
+     DECL_ARGUMENTS.  Otherwise, this is NULL.
+
+     This basically holds:
+       vector(simdlen) __typeof__(orig_arg) new_arg.  */
+  tree vector_arg;
+
+  /* If argument is a vector, this holds the array where the simd
+     argument is held while executing the simd clone function.  This
+     is a local variable in the cloned function.  Its content is
+     copied from vector_arg upon entry to the clone.
+
+     This basically holds:
+       __typeof__(orig_arg) simd_array[simdlen].  */
+  tree simd_array;
+
+  /* A SIMD clone's argument can be either linear (constant or
+     variable), uniform, or vector.  If the argument is neither linear
+     or uniform, the default is vector.  */
+
+  /* If the linear stride is a constant, `linear_stride' is
+     LINEAR_STRIDE_YES_CONSTANT, and `linear_stride_num' holds
+     the numeric stride.
+
+     If the linear stride is variable, `linear_stride' is
+     LINEAR_STRIDE_YES_VARIABLE, and `linear_stride_num' contains
+     the function argument containing the stride (as an index into the
+     function arguments starting at 0).
+
+     Otherwise, `linear_stride' is LINEAR_STRIDE_NO and
+     `linear_stride_num' is unused.  */
+  enum linear_stride_type linear_stride;
+  unsigned HOST_WIDE_INT linear_stride_num;
+
+  /* Variable alignment if available, otherwise 0.  */
+  unsigned int alignment;
+
+  /* True if variable is uniform.  */
+  unsigned int uniform : 1;
+};
+
+/* Specific data for a SIMD function clone.  */
+
+struct GTY(()) simd_clone {
+  /* Number of words in the SIMD lane associated with this clone.  */
+  unsigned int simdlen;
+
+  /* Number of annotated function arguments in `args'.  This is
+     usually the number of named arguments in FNDECL.  */
+  unsigned int nargs;
+
+  /* Max hardware vector size in bits.  */
+  unsigned int hw_vector_size;
+
+  /* The mangling character for a given vector size.  This is is used
+     to determine the ISA mangling bit as specified in the Intel
+     Vector ABI.  */
+  unsigned char vecsize_mangle;
+
+  /* True if this is the masked, in-branch version of the clone,
+     otherwise false.  */
+  unsigned int inbranch : 1;
+
+  /* True if this is a Cilk Plus variant.  */
+  unsigned int cilk_elemental : 1;
+
+  /* Annotated function arguments for the original function.  */
+  struct simd_clone_arg GTY((length ("%h.nargs"))) args[1];
+};
+
 
 /* The cgraph data structure.
    Each function decl has assigned cgraph_node listing callees and callers.  */
@@ -282,6 +367,14 @@  struct GTY(()) cgraph_node {
   /* Declaration node used to be clone of. */
   tree former_clone_of;
 
+  /* If this is a SIMD clone, this points to the SIMD specific
+     information for it.  */
+  struct simd_clone *simdclone;
+
+  /* If this is a SIMD clone, this points to the original scalar
+     function.  */
+  struct cgraph_node *simdclone_of;
+
   /* Interprocedural passes scheduled to have their transform functions
      applied next time we execute local pass on them.  We maintain it
      per-function in order to allow IPA passes to introduce new functions.  */
@@ -323,6 +416,8 @@  struct GTY(()) cgraph_node {
   /* ?? We should be able to remove this.  We have enough bits in
      cgraph to calculate it.  */
   unsigned tm_clone : 1;
+  /* True if this function has SIMD clones.  */
+  unsigned has_simd_clones : 1;
   /* True if this decl is a dispatcher for function versions.  */
   unsigned dispatcher_function : 1;
 };
@@ -742,6 +837,7 @@  void cgraph_speculative_call_info (struct cgraph_edge *,
 				   struct cgraph_edge *&,
 				   struct cgraph_edge *&,
 				   struct ipa_ref *&);
+struct cgraph_node *get_simd_clone (struct cgraph_node *, tree);
 
 /* In cgraphunit.c  */
 struct asm_node *add_asm_node (tree);
diff --git a/gcc/config/i386/i386.c b/gcc/config/i386/i386.c
index 168a2ac..73140f9 100644
--- a/gcc/config/i386/i386.c
+++ b/gcc/config/i386/i386.c
@@ -42875,6 +42875,42 @@  ix86_memmodel_check (unsigned HOST_WIDE_INT val)
   return val;
 }
 
+/* Return the default mangling character when no vector size can be
+   determined from the `processor' clause.  */
+
+static char
+ix86_cilkplus_default_vecsize_mangle (struct cgraph_node *clone
+				      ATTRIBUTE_UNUSED)
+{
+  return 'x';
+}
+
+/* Return the hardware vector size (in bits) for a mangling
+   character.  */
+
+static unsigned int
+ix86_cilkplus_vecsize_for_mangle (char mangle)
+{
+  /* ?? Intel currently has no ISA encoding character for AVX-512.  */
+  switch (mangle)
+    {
+    case 'x':
+      /* xmm (SSE2).  */
+      return 128;
+    case 'y':
+      /* ymm1 (AVX1).  */
+    case 'Y':
+      /* ymm2 (AVX2).  */
+      return 256;
+    case 'z':
+      /* zmm (MIC).  */
+      return 512;
+    default:
+      gcc_unreachable ();
+      return 0;
+    }
+}
+
 /* Initialize the GCC target structure.  */
 #undef TARGET_RETURN_IN_MEMORY
 #define TARGET_RETURN_IN_MEMORY ix86_return_in_memory
@@ -43247,6 +43283,14 @@  ix86_memmodel_check (unsigned HOST_WIDE_INT val)
 #undef TARGET_SPILL_CLASS
 #define TARGET_SPILL_CLASS ix86_spill_class
 
+#undef TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE
+#define TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE \
+  ix86_cilkplus_default_vecsize_mangle
+
+#undef TARGET_CILKPLUS_VECSIZE_FOR_MANGLE
+#define TARGET_CILKPLUS_VECSIZE_FOR_MANGLE \
+  ix86_cilkplus_vecsize_for_mangle
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-i386.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 8d220f3..8bb9d1e 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -5787,6 +5787,26 @@  The default is @code{NULL_TREE} which means to not vectorize gather
 loads.
 @end deftypefn
 
+@deftypefn {Target Hook} char TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE (struct cgraph_node *@var{})
+This hook should return the default mangling character when no vector
+size can be determined by examining the  Cilk Plus @code{processor} clause.
+This is as specified in the Intel Vector ABI document.
+
+This hook, as well as @code{max_vector_size_for_isa} below must be set
+to support the Cilk Plus @code{processor} clause.
+
+The only argument is a @var{cgraph_node} containing the clone.
+@end deftypefn
+
+@deftypefn {Target Hook} {unsigned int} TARGET_CILKPLUS_VECSIZE_FOR_MANGLE (char)
+This hook returns the maximum hardware vector size in bits for a given
+mangling character.  The character is as described in Intel's
+Vector ABI (see @var{ISA} character in the section on mangling).
+
+This hook must be defined in order to support the Cilk Plus @code{processor}
+clause.
+@end deftypefn
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index 863e843a..db25787 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -4414,6 +4414,10 @@  address;  but often a machine-dependent strategy can generate better code.
 
 @hook TARGET_VECTORIZE_BUILTIN_GATHER
 
+@hook TARGET_CILKPLUS_DEFAULT_VECSIZE_MANGLE
+
+@hook TARGET_CILKPLUS_VECSIZE_FOR_MANGLE
+
 @node Anchored Addresses
 @section Anchored Addresses
 @cindex anchored addresses
diff --git a/gcc/ggc.h b/gcc/ggc.h
index b31bc80..eee90c6 100644
--- a/gcc/ggc.h
+++ b/gcc/ggc.h
@@ -276,4 +276,11 @@  ggc_alloc_cleared_gimple_statement_d_stat (size_t s MEM_STAT_DECL)
     ggc_internal_cleared_alloc_stat (s PASS_MEM_STAT);
 }
 
+static inline struct simd_clone *
+ggc_alloc_cleared_simd_clone_stat (size_t s MEM_STAT_DECL)
+{
+  return (struct simd_clone *)
+    ggc_internal_cleared_alloc_stat (s PASS_MEM_STAT);
+}
+
 #endif
diff --git a/gcc/ipa-cp.c b/gcc/ipa-cp.c
index c38ba82..faae080 100644
--- a/gcc/ipa-cp.c
+++ b/gcc/ipa-cp.c
@@ -446,6 +446,13 @@  determine_versionability (struct cgraph_node *node)
     reason = "not a tree_versionable_function";
   else if (cgraph_function_body_availability (node) <= AVAIL_OVERWRITABLE)
     reason = "insufficient body availability";
+  else if (node->has_simd_clones)
+    {
+      /* Ideally we should clone the SIMD clones themselves and create
+	 vector copies of them, so IPA-cp and SIMD clones can happily
+	 coexist, but that may not be worth the effort.  */
+      reason = "function has SIMD clones";
+    }
 
   if (reason && dump_file && !node->symbol.alias && !node->thunk.thunk_p)
     fprintf (dump_file, "Function %s/%i is not versionable, reason: %s.\n",
diff --git a/gcc/ipa-prop.c b/gcc/ipa-prop.c
index 2fbc9d4..0c20dc6 100644
--- a/gcc/ipa-prop.c
+++ b/gcc/ipa-prop.c
@@ -3361,24 +3361,18 @@  void
 ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
 			      const char *synth_parm_prefix)
 {
-  vec<tree> oparms, otypes;
-  tree orig_type, new_type = NULL;
-  tree old_arg_types, t, new_arg_types = NULL;
-  tree parm, *link = &DECL_ARGUMENTS (fndecl);
-  int i, len = adjustments.length ();
-  tree new_reversed = NULL;
-  bool care_for_types, last_parm_void;
-
   if (!synth_parm_prefix)
     synth_parm_prefix = "SYNTH";
 
-  oparms = ipa_get_vector_of_formal_parms (fndecl);
-  orig_type = TREE_TYPE (fndecl);
-  old_arg_types = TYPE_ARG_TYPES (orig_type);
+  vec<tree> oparms = ipa_get_vector_of_formal_parms (fndecl);
+  tree orig_type = TREE_TYPE (fndecl);
+  tree old_arg_types = TYPE_ARG_TYPES (orig_type);
 
   /* The following test is an ugly hack, some functions simply don't have any
      arguments in their type.  This is probably a bug but well... */
-  care_for_types = (old_arg_types != NULL_TREE);
+  bool care_for_types = (old_arg_types != NULL_TREE);
+  bool last_parm_void;
+  vec<tree> otypes;
   if (care_for_types)
     {
       last_parm_void = (TREE_VALUE (tree_last (old_arg_types))
@@ -3395,13 +3389,20 @@  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
       otypes.create (0);
     }
 
-  for (i = 0; i < len; i++)
+  int len = adjustments.length ();
+  tree *link = &DECL_ARGUMENTS (fndecl);
+  tree new_arg_types = NULL;
+  for (int i = 0; i < len; i++)
     {
       struct ipa_parm_adjustment *adj;
       gcc_assert (link);
 
       adj = &adjustments[i];
-      parm = oparms[adj->base_index];
+      tree parm;
+      if (adj->new_param)
+	parm = NULL;
+      else
+	parm = oparms[adj->base_index];
       adj->base = parm;
 
       if (adj->copy_param)
@@ -3417,8 +3418,18 @@  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
 	  tree new_parm;
 	  tree ptype;
 
-	  if (adj->by_ref)
-	    ptype = build_pointer_type (adj->type);
+	  if (adj->simdlen)
+	    {
+	      /* If we have a non-null simdlen but by_ref is true, we
+		 want a vector of pointers.  Build the vector of
+		 pointers here, not a pointer to a vector in the
+		 adj->by_ref case below.  */
+	      ptype = build_vector_type (adj->type, adj->simdlen);
+	    }
+	  else if (adj->by_ref)
+	    {
+	      ptype = build_pointer_type (adj->type);
+	    }
 	  else
 	    ptype = adj->type;
 
@@ -3427,8 +3438,9 @@  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
 
 	  new_parm = build_decl (UNKNOWN_LOCATION, PARM_DECL, NULL_TREE,
 				 ptype);
-	  DECL_NAME (new_parm) = create_tmp_var_name (synth_parm_prefix);
-
+	  const char *prefix
+	    = adj->new_param ? adj->new_arg_prefix : synth_parm_prefix;
+	  DECL_NAME (new_parm) = create_tmp_var_name (prefix);
 	  DECL_ARTIFICIAL (new_parm) = 1;
 	  DECL_ARG_TYPE (new_parm) = ptype;
 	  DECL_CONTEXT (new_parm) = fndecl;
@@ -3436,17 +3448,20 @@  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
 	  DECL_IGNORED_P (new_parm) = 1;
 	  layout_decl (new_parm, 0);
 
-	  adj->base = parm;
+	  if (adj->new_param)
+	    adj->base = new_parm;
+	  else
+	    adj->base = parm;
 	  adj->reduction = new_parm;
 
 	  *link = new_parm;
-
 	  link = &DECL_CHAIN (new_parm);
 	}
     }
 
   *link = NULL_TREE;
 
+  tree new_reversed = NULL;
   if (care_for_types)
     {
       new_reversed = nreverse (new_arg_types);
@@ -3464,6 +3479,7 @@  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
      Exception is METHOD_TYPEs must have THIS argument.
      When we are asked to remove it, we need to build new FUNCTION_TYPE
      instead.  */
+  tree new_type = NULL;
   if (TREE_CODE (orig_type) != METHOD_TYPE
        || (adjustments[0].copy_param
 	  && adjustments[0].base_index == 0))
@@ -3489,7 +3505,7 @@  ipa_modify_formal_parameters (tree fndecl, ipa_parm_adjustment_vec adjustments,
 
   /* This is a new type, not a copy of an old type.  Need to reassociate
      variants.  We can handle everything except the main variant lazily.  */
-  t = TYPE_MAIN_VARIANT (orig_type);
+  tree t = TYPE_MAIN_VARIANT (orig_type);
   if (orig_type != t)
     {
       TYPE_MAIN_VARIANT (new_type) = t;
diff --git a/gcc/ipa-prop.h b/gcc/ipa-prop.h
index 48634d2..8d7d9b9 100644
--- a/gcc/ipa-prop.h
+++ b/gcc/ipa-prop.h
@@ -634,9 +634,10 @@  struct ipa_parm_adjustment
      arguments.  */
   tree alias_ptr_type;
 
-  /* The new declaration when creating/replacing a parameter.  Created by
-     ipa_modify_formal_parameters, useful for functions modifying the body
-     accordingly. */
+  /* The new declaration when creating/replacing a parameter.  Created
+     by ipa_modify_formal_parameters, useful for functions modifying
+     the body accordingly.  For brand new arguments, this is the newly
+     created argument.  */
   tree reduction;
 
   /* New declaration of a substitute variable that we may use to replace all
@@ -647,15 +648,36 @@  struct ipa_parm_adjustment
      is NULL), this is going to be its nonlocalized vars value.  */
   tree nonlocal_value;
 
+  /* If this is a brand new argument, this holds the prefix to be used
+     for the DECL_NAME.  */
+  const char *new_arg_prefix;
+
   /* Offset into the original parameter (for the cases when the new parameter
      is a component of an original one).  */
   HOST_WIDE_INT offset;
 
-  /* Zero based index of the original parameter this one is based on.  (ATM
-     there is no way to insert a new parameter out of the blue because there is
-     no need but if it arises the code can be easily exteded to do so.)  */
+  /* Zero based index of the original parameter this one is based on.  */
   int base_index;
 
+  /* If non-null, the parameter is a vector of `type' with this many
+     elements.  */
+  int simdlen;
+
+  /* This is a brand new parameter.
+
+     For new parameters, base_index must be >= the number of
+     DECL_ARGUMENTS in the function.  That is, new arguments will be
+     the last arguments in the adjusted function.
+
+     ?? Perhaps we could redesign ipa_modify_formal_parameters() to
+     reorganize argument position, thus allowing inserting of brand
+     new arguments anywhere, but there is no use for this now.
+
+     Also, `type' should be set to the new type, `new_arg_prefix'
+     should be set to the string prefix for the new DECL_NAME, and
+     `reduction' will ultimately hold the newly created argument.  */
+  unsigned new_param : 1;
+
   /* This new parameter is an unmodified parameter at index base_index. */
   unsigned copy_param : 1;
 
@@ -697,5 +719,7 @@  void ipa_dump_param (FILE *, struct ipa_node_params *info, int i);
 /* From tree-sra.c:  */
 tree build_ref_for_offset (location_t, tree, HOST_WIDE_INT, tree,
 			   gimple_stmt_iterator *, bool);
+bool ipa_sra_modify_function_body (ipa_parm_adjustment_vec);
+bool sra_ipa_modify_expr (tree *, bool, ipa_parm_adjustment_vec);
 
 #endif /* IPA_PROP_H */
diff --git a/gcc/omp-low.c b/gcc/omp-low.c
index 26f0c35..afca595 100644
--- a/gcc/omp-low.c
+++ b/gcc/omp-low.c
@@ -43,6 +43,8 @@  along with GCC; see the file COPYING3.  If not see
 #include "optabs.h"
 #include "cfgloop.h"
 #include "target.h"
+#include "pretty-print.h"
+#include "ipa-prop.h"
 
 
 /* Lowering of OpenMP parallel and workshare constructs proceeds in two
@@ -10380,5 +10382,885 @@  make_pass_diagnose_omp_blocks (gcc::context *ctxt)
 {
   return new pass_diagnose_omp_blocks (ctxt);
 }
+
+/* SIMD clone supporting code.  */
+
+/* A map for function arguments.  This will map a zero-based integer
+   to the corresponding index into DECL_ARGUMENTS.  */
+class argno_map
+{
+  vec<tree> tree_args;
+ public:
+  /* Default constructor declared but not implemented by design.  The
+     only valid constructor is the TREE version below.  */
+  argno_map ();
+  argno_map (tree fndecl);
+
+  ~argno_map () { tree_args.release (); }
+  unsigned int length () { return tree_args.length (); }
+  tree operator[] (unsigned n) { return tree_args[n]; }
+};
+
+/* FNDECL is the function containing the arguments.  */
+
+argno_map::argno_map (tree fndecl)
+{
+  tree_args.create (5);
+  for (tree t = DECL_ARGUMENTS (fndecl); t; t = DECL_CHAIN (t))
+    tree_args.safe_push (t);
+}
+
+/* Allocate a fresh `simd_clone' and return it.  NARGS is the number
+   of arguments to reserve space for.  */
+
+static struct simd_clone *
+simd_clone_struct_alloc (int nargs)
+{
+  struct simd_clone *clone_info;
+  int len = sizeof (struct simd_clone)
+    + nargs * sizeof (struct simd_clone_arg);
+  clone_info = ggc_alloc_cleared_simd_clone_stat (len PASS_MEM_STAT);
+  return clone_info;
+}
+
+/* Make a copy of the `struct simd_clone' in FROM to TO.  */
+
+static inline void
+simd_clone_struct_copy (struct simd_clone *to, struct simd_clone *from)
+{
+  memcpy (to, from, sizeof (struct simd_clone)
+	  + from->nargs * sizeof (struct simd_clone_arg));
+}
+
+/* Given a simd clone in NEW_NODE, extract the simd specific
+   information from the OMP clauses passed in CLAUSES, and set the
+   relevant bits in the cgraph node.  *INBRANCH_SPECIFIED is set to
+   TRUE if the `inbranch' or `notinbranch' clause specified, otherwise
+   set to FALSE.  */
+
+static void
+simd_clone_clauses_extract (struct cgraph_node *new_node, tree clauses,
+			    bool *inbranch_specified)
+{
+  tree t;
+  int n = 0;
+  *inbranch_specified = false;
+  for (t = DECL_ARGUMENTS (new_node->symbol.decl); t; t = DECL_CHAIN (t))
+    ++n;
+
+  /* To distinguish from an OpenMP simd clone, Cilk Plus functions to
+     be cloned have a distinctive artificial label in addition to "omp
+     declare simd".  */
+  bool cilk_clone
+    = (flag_enable_cilkplus
+       && lookup_attribute ("cilk plus elemental",
+			    DECL_ATTRIBUTES (new_node->symbol.decl)));
+
+  /* Allocate one more than needed just in case this is an in-branch
+     clone which will require a mask argument.  */
+  struct simd_clone *clone_info = simd_clone_struct_alloc (n + 1);
+  clone_info->nargs = n;
+  clone_info->cilk_elemental = cilk_clone;
+  gcc_assert (!new_node->simdclone);
+  new_node->simdclone = clone_info;
+
+  if (!clauses)
+    return;
+  clauses = TREE_VALUE (clauses);
+  if (!clauses || TREE_CODE (clauses) != OMP_CLAUSE)
+    return;
+
+  for (t = clauses; t; t = OMP_CLAUSE_CHAIN (t))
+    {
+      switch (OMP_CLAUSE_CODE (t))
+	{
+	case OMP_CLAUSE_INBRANCH:
+	  clone_info->inbranch = 1;
+	  *inbranch_specified = true;
+	  break;
+	case OMP_CLAUSE_NOTINBRANCH:
+	  clone_info->inbranch = 0;
+	  *inbranch_specified = true;
+	  break;
+	case OMP_CLAUSE_SIMDLEN:
+	  clone_info->simdlen
+	    = TREE_INT_CST_LOW (OMP_CLAUSE_SIMDLEN_EXPR (t));
+	  break;
+	case OMP_CLAUSE_LINEAR:
+	  {
+	    tree decl = OMP_CLAUSE_DECL (t);
+	    tree step = OMP_CLAUSE_LINEAR_STEP (t);
+	    int argno = TREE_INT_CST_LOW (decl);
+	    if (OMP_CLAUSE_LINEAR_VARIABLE_STRIDE (t))
+	      {
+		clone_info->args[argno].linear_stride
+		  = LINEAR_STRIDE_YES_VARIABLE;
+		clone_info->args[argno].linear_stride_num
+		  = TREE_INT_CST_LOW (step);
+		gcc_assert (!TREE_INT_CST_HIGH (step));
+	      }
+	    else
+	      {
+		if (TREE_INT_CST_HIGH (step))
+		  {
+		    /* It looks like this can't really happen, since the
+		       front-ends generally issue:
+
+		       warning: integer constant is too large for its type.
+
+		       But let's assume somehow we got past all that.  */
+		    warning_at (DECL_SOURCE_LOCATION (decl), 0,
+				"ignoring large linear step");
+		  }
+		else
+		  {
+		    clone_info->args[argno].linear_stride
+		      = LINEAR_STRIDE_YES_CONSTANT;
+		    clone_info->args[argno].linear_stride_num
+		      = TREE_INT_CST_LOW (step);
+		  }
+	      }
+	    break;
+	  }
+	case OMP_CLAUSE_UNIFORM:
+	  {
+	    tree decl = OMP_CLAUSE_DECL (t);
+	    int argno = tree_low_cst (decl, 1);
+	    clone_info->args[argno].uniform = 1;
+	    break;
+	  }
+	case OMP_CLAUSE_ALIGNED:
+	  {
+	    tree decl = OMP_CLAUSE_DECL (t);
+	    int argno = tree_low_cst (decl, 1);
+	    clone_info->args[argno].alignment
+	      = TREE_INT_CST_LOW (OMP_CLAUSE_ALIGNED_ALIGNMENT (t));
+	    break;
+	  }
+	default:
+	  break;
+	}
+    }
+}
+
+/* Helper function for mangling vectors.  Given a vector size in bits,
+   return the corresponding mangling character.  */
+
+static char
+vecsize_mangle (unsigned int vecsize)
+{
+  switch (vecsize)
+    {
+      /* The Intel Vector ABI does not provide a mangling character
+	 for a 64-bit ISA, but this feels like it's keeping with the
+	 design.  */
+    case 64: return 'w';
+
+    case 128: return 'x';
+    case 256: return 'y';
+    case 512: return 'z';
+    default:
+      /* FIXME: We must come up with a default mangling bit.  */
+      return 'x';
+    }
+}
+
+/* Given a SIMD clone in NEW_NODE, calculate the characteristic data
+   type and return the coresponding type.  The characteristic data
+   type is computed as described in the Intel Vector ABI.  */
+
+static tree
+simd_clone_compute_base_data_type (struct cgraph_node *new_node)
+{
+  tree type = integer_type_node;
+  tree fndecl = new_node->symbol.decl;
+
+  /* a) For non-void function, the characteristic data type is the
+        return type.  */
+  if (TREE_CODE (TREE_TYPE (TREE_TYPE (fndecl))) != VOID_TYPE)
+    type = TREE_TYPE (TREE_TYPE (fndecl));
+
+  /* b) If the function has any non-uniform, non-linear parameters,
+        then the characteristic data type is the type of the first
+        such parameter.  */
+  else
+    {
+      argno_map map (fndecl);
+      for (unsigned int i = 0; i < new_node->simdclone->nargs; ++i)
+	{
+	  struct simd_clone_arg arg = new_node->simdclone->args[i];
+	  if (!arg.uniform && arg.linear_stride == LINEAR_STRIDE_NO)
+	    {
+	      type = TREE_TYPE (map[i]);
+	      break;
+	    }
+	}
+    }
+
+  /* c) If the characteristic data type determined by a) or b) above
+        is struct, union, or class type which is pass-by-value (except
+        for the type that maps to the built-in complex data type), the
+        characteristic data type is int.  */
+  if (RECORD_OR_UNION_TYPE_P (type)
+      && !aggregate_value_p (type, NULL)
+      && TREE_CODE (type) != COMPLEX_TYPE)
+    return integer_type_node;
+
+  /* d) If none of the above three classes is applicable, the
+        characteristic data type is int.  */
+
+  return type;
+
+  /* e) For Intel Xeon Phi native and offload compilation, if the
+        resulting characteristic data type is 8-bit or 16-bit integer
+        data type, the characteristic data type is int.  */
+  /* Well, we don't handle Xeon Phi yet.  */
+}
+
+/* Given a SIMD clone in NEW_NODE, compute simdlen and vector size,
+   and store them in NEW_NODE->simdclone.  */
+
+static void
+simd_clone_compute_vecsize_and_simdlen (struct cgraph_node *new_node)
+{
+  char vmangle = new_node->simdclone->vecsize_mangle;
+  /* Vector size for this clone.  */
+  unsigned int vecsize = 0;
+  /* Base vector type, based on function arguments.  */
+  tree base_type = simd_clone_compute_base_data_type (new_node);
+  unsigned int base_type_size = GET_MODE_BITSIZE (TYPE_MODE (base_type));
+
+  /* Calculate everything for Cilk Plus clones with appropriate target
+     support.  This is as specified in the Intel Vector ABI.
+
+     Note: Any target which supports the Cilk Plus processor clause
+     must also provide appropriate target hooks for calculating
+     default ISA/processor (default_vecsize_mangle), and for
+     calculating hardware vector size based on ISA/processor
+     (vecsize_for_mangle).  */
+  if (new_node->simdclone->cilk_elemental
+      && targetm.cilkplus.default_vecsize_mangle)
+    {
+      if (!vmangle)
+	vmangle = targetm.cilkplus.default_vecsize_mangle (new_node);
+      vecsize = targetm.cilkplus.vecsize_for_mangle (vmangle);
+      if (!new_node->simdclone->simdlen)
+	new_node->simdclone->simdlen = vecsize / base_type_size;
+    }
+  /* Calculate everything else generically.  */
+  else
+    {
+      vecsize = GET_MODE_BITSIZE (targetm.vectorize.preferred_simd_mode
+				  (TYPE_MODE (base_type)));
+      vmangle = vecsize_mangle (vecsize);
+      if (!new_node->simdclone->simdlen)
+	new_node->simdclone->simdlen = vecsize / base_type_size;
+    }
+  new_node->simdclone->vecsize_mangle = vmangle;
+  new_node->simdclone->hw_vector_size = vecsize;
+}
+
+static void
+simd_clone_mangle (struct cgraph_node *old_node, struct cgraph_node *new_node)
+{
+  char vecsize_mangle = new_node->simdclone->vecsize_mangle;
+  char mask = new_node->simdclone->inbranch ? 'M' : 'N';
+  unsigned int simdlen = new_node->simdclone->simdlen;
+  unsigned int n;
+  pretty_printer pp;
+
+  gcc_assert (vecsize_mangle && simdlen);
+
+  pp_string (&pp, "_ZGV");
+  pp_character (&pp, vecsize_mangle);
+  pp_character (&pp, mask);
+  pp_decimal_int (&pp, simdlen);
+
+  for (n = 0; n < new_node->simdclone->nargs; ++n)
+    {
+      struct simd_clone_arg arg = new_node->simdclone->args[n];
+
+      if (arg.uniform)
+	pp_character (&pp, 'u');
+      else if (arg.linear_stride == LINEAR_STRIDE_YES_CONSTANT)
+	{
+	  gcc_assert (arg.linear_stride_num != 0);
+	  pp_character (&pp, 'l');
+	  if (arg.linear_stride_num > 1)
+	    pp_unsigned_wide_integer (&pp,
+				      arg.linear_stride_num);
+	}
+      else if (arg.linear_stride == LINEAR_STRIDE_YES_VARIABLE)
+	{
+	  pp_character (&pp, 's');
+	  pp_unsigned_wide_integer (&pp, arg.linear_stride_num);
+	}
+      else
+	pp_character (&pp, 'v');
+      if (arg.alignment)
+	{
+	  pp_character (&pp, 'a');
+	  pp_decimal_int (&pp, arg.alignment);
+	}
+    }
+
+  pp_underscore (&pp);
+  pp_string (&pp,
+	     IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (old_node->symbol.decl)));
+  const char *str = pp_formatted_text (&pp);
+  change_decl_assembler_name (new_node->symbol.decl,
+			      get_identifier (str));
+}
+
+/* Create a simd clone of OLD_NODE and return it.  */
+
+static struct cgraph_node *
+simd_clone_create (struct cgraph_node *old_node)
+{
+  struct cgraph_node *new_node;
+  new_node = cgraph_function_versioning (old_node, vNULL, NULL, NULL, false,
+					 NULL, NULL, "simdclone");
+
+  new_node->simdclone_of = old_node;
+
+  /* Keep cgraph friends from removing the clone.  */
+  new_node->symbol.externally_visible
+    = old_node->symbol.externally_visible;
+  TREE_PUBLIC (new_node->symbol.decl) = TREE_PUBLIC (old_node->symbol.decl);
+  old_node->has_simd_clones = true;
+
+  /* The function cgraph_function_versioning() will force the new
+     symbol local.  Undo this, and inherit external visability from
+     the old node.  */
+  new_node->local.local = old_node->local.local;
+  new_node->symbol.externally_visible = old_node->symbol.externally_visible;
+
+  return new_node;
+}
+
+/* Adjust the return type of the given function to its appropriate
+   vector counterpart.  Returns a simd array to be used throughout the
+   function as a return value.  */
+
+static tree
+simd_clone_adjust_return_type (struct cgraph_node *node)
+{
+  tree fndecl = node->symbol.decl;
+  tree orig_rettype = TREE_TYPE (TREE_TYPE (fndecl));
+
+  tree t = DECL_RESULT (fndecl);
+  /* Adjust the DECL_RESULT.  */
+  if (TREE_TYPE (t) != void_type_node)
+    {
+      TREE_TYPE (t)
+	= build_vector_type (TREE_TYPE (t), node->simdclone->simdlen);
+      DECL_MODE (t) = TYPE_MODE (TREE_TYPE (t));
+    }
+  /* Adjust the function return type.  */
+  if (TREE_TYPE (TREE_TYPE (fndecl)) != void_type_node)
+    {
+      TREE_TYPE (fndecl)
+	= copy_node (TREE_TYPE (fndecl));
+      TREE_TYPE (TREE_TYPE (fndecl))
+	= copy_node (TREE_TYPE (TREE_TYPE (fndecl)));
+      TREE_TYPE (TREE_TYPE (fndecl))
+	= build_vector_type (TREE_TYPE (TREE_TYPE (fndecl)),
+			     node->simdclone->simdlen);
+    }
+
+  /* Set up a SIMD array to use as the return value.  */
+  tree retval;
+  if (orig_rettype != void_type_node)
+    {
+      retval
+	= create_tmp_var_raw (build_array_type_nelts (orig_rettype,
+						      node->simdclone->simdlen),
+			      "retval");
+      gimple_add_tmp_var (retval);
+    }
+  else
+    retval = NULL;
+  return retval;
+}
+
+/* Each vector argument has a corresponding array to be used locally
+   as part of the eventual loop.  Create such temporary array and
+   return it.
+
+   PREFIX is the prefix to be used for the temporary.
+
+   TYPE is the inner element type.
+
+   SIMDLEN is the number of elements.  */
+
+static tree
+create_tmp_simd_array (const char *prefix, tree type, int simdlen)
+{
+  tree atype = build_array_type_nelts (type, simdlen);
+  tree avar = create_tmp_var_raw (atype, prefix);
+  gimple_add_tmp_var (avar);
+  return avar;
+}
+
+/* Modify the function argument types to their corresponding vector
+   counterparts if appropriate.  Also, create one array for each simd
+   argument to be used locally when using the function arguments as
+   part of the loop.
+
+   NODE is the function whose arguments are to be adjusted.
+
+   Returns an adjustment vector that will be filled describing how the
+   argument types will be adjusted.  */
+
+static ipa_parm_adjustment_vec
+simd_clone_adjust_argument_types (struct cgraph_node *node)
+{
+  argno_map args (node->symbol.decl);
+  ipa_parm_adjustment_vec adjustments;
+
+  adjustments.create (args.length ());
+  unsigned i;
+  for (i = 0; i < node->simdclone->nargs; ++i)
+    {
+      struct ipa_parm_adjustment adj;
+
+      memset (&adj, 0, sizeof (adj));
+      tree parm = args[i];
+      adj.base_index = i;
+      adj.base = parm;
+
+      node->simdclone->args[i].orig_arg = parm;
+
+      if (node->simdclone->args[i].uniform
+	  || node->simdclone->args[i].linear_stride != LINEAR_STRIDE_NO)
+	{
+	  /* No adjustment necessary for scalar arguments.  */
+	  adj.copy_param = 1;
+	}
+      else
+	{
+	  adj.simdlen = node->simdclone->simdlen;
+	  if (POINTER_TYPE_P (TREE_TYPE (parm)))
+	    adj.by_ref = 1;
+	  adj.type = TREE_TYPE (parm);
+
+	  node->simdclone->args[i].simd_array
+	    = create_tmp_simd_array (IDENTIFIER_POINTER (DECL_NAME (parm)),
+				     TREE_TYPE (parm),
+				     node->simdclone->simdlen);
+	}
+      adjustments.quick_push (adj);
+    }
+
+  if (node->simdclone->inbranch)
+    {
+      struct ipa_parm_adjustment adj;
+
+      memset (&adj, 0, sizeof (adj));
+      adj.new_param = 1;
+      adj.new_arg_prefix = "mask";
+      adj.base_index = i;
+      adj.type
+	= build_vector_type (integer_type_node, node->simdclone->simdlen);
+      adjustments.safe_push (adj);
+
+      /* We have previously allocated one extra entry for the mask.  Use
+	 it and fill it.  */
+      struct simd_clone *sc = node->simdclone;
+      sc->nargs++;
+      sc->args[i].orig_arg = build_decl (UNKNOWN_LOCATION, PARM_DECL, NULL,
+					 integer_type_node);
+      sc->args[i].simd_array
+	= create_tmp_simd_array ("mask", integer_type_node, sc->simdlen);
+    }
+
+  ipa_modify_formal_parameters (node->symbol.decl, adjustments, "simd");
+  return adjustments;
+}
+
+/* Initialize and copy the function arguments in NODE to their
+   corresponding local simd arrays.  Returns a fresh gimple_seq with
+   the instruction sequence generated.  */
+
+static gimple_seq
+simd_clone_init_simd_arrays (struct cgraph_node *node,
+			     ipa_parm_adjustment_vec adjustments)
+{
+  gimple_seq seq = NULL;
+  unsigned i = 0;
+
+  for (tree arg = DECL_ARGUMENTS (node->symbol.decl);
+       arg;
+       arg = DECL_CHAIN (arg), i++)
+    {
+      if (adjustments[i].copy_param)
+	continue;
+
+      node->simdclone->args[i].vector_arg = arg;
+
+      tree array = node->simdclone->args[i].simd_array;
+      tree t = build1 (VIEW_CONVERT_EXPR, TREE_TYPE (array), arg);
+      t = build2 (MODIFY_EXPR, TREE_TYPE (array), array, t);
+      gimplify_and_add (t, &seq);
+    }
+  return seq;
+}
+
+/* Traverse the function body and perform all modifications as
+   described in ADJUSTMENTS.  At function return, ADJUSTMENTS will be
+   modified such that the replacement/reduction value will now be an
+   offset into the corresponding simd_array.
+
+   This function will replace all function argument uses with their
+   corresponding simd array elements, and ajust the return values
+   accordingly.  */
+
+static void
+ipa_simd_modify_function_body (struct cgraph_node *node,
+			       ipa_parm_adjustment_vec adjustments,
+			       tree retval_array, tree iter)
+{
+  basic_block bb;
+
+  /* Re-use the adjustments array, but this time use it to replace
+     every function argument use to an offset into the corresponding
+     simd_array.  */
+  for (unsigned i = 0; i < node->simdclone->nargs; ++i)
+    {
+      if (!node->simdclone->args[i].vector_arg)
+	continue;
+
+      tree basetype = TREE_TYPE (node->simdclone->args[i].orig_arg);
+      adjustments[i].reduction
+	= build4 (ARRAY_REF,
+		  basetype,
+		  node->simdclone->args[i].simd_array,
+		  iter,
+		  NULL_TREE, NULL_TREE);
+    }
+
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (node->symbol.decl))
+    {
+      gimple_stmt_iterator gsi;
+
+      gsi = gsi_start_bb (bb);
+      while (!gsi_end_p (gsi))
+	{
+	  gimple stmt = gsi_stmt (gsi);
+	  bool modified = false;
+	  tree *t;
+	  unsigned i;
+
+	  switch (gimple_code (stmt))
+	    {
+	    case GIMPLE_RETURN:
+	      {
+		/* Replace `return foo' by `retval_array[iter] = foo'.  */
+		tree old_retval = gimple_return_retval (stmt);
+		if (!old_retval)
+		  break;
+		stmt = gimple_build_assign (build4 (ARRAY_REF,
+						    TREE_TYPE (old_retval),
+						    retval_array, iter,
+						    NULL, NULL),
+					    old_retval);
+		gsi_replace (&gsi, stmt, true);
+		modified = true;
+		break;
+	      }
+
+	    case GIMPLE_ASSIGN:
+	      t = gimple_assign_lhs_ptr (stmt);
+	      modified |= sra_ipa_modify_expr (t, false, adjustments);
+	      for (i = 0; i < gimple_num_ops (stmt); ++i)
+		{
+		  t = gimple_op_ptr (stmt, i);
+		  modified |= sra_ipa_modify_expr (t, false, adjustments);
+		}
+	      break;
+
+	    case GIMPLE_CALL:
+	      /* Operands must be processed before the lhs.  */
+	      for (i = 0; i < gimple_call_num_args (stmt); i++)
+		{
+		  t = gimple_call_arg_ptr (stmt, i);
+		  modified |= sra_ipa_modify_expr (t, true, adjustments);
+		}
+
+	      if (gimple_call_lhs (stmt))
+		{
+		  t = gimple_call_lhs_ptr (stmt);
+		  modified |= sra_ipa_modify_expr (t, false, adjustments);
+		}
+	      break;
+
+	    case GIMPLE_ASM:
+	      for (i = 0; i < gimple_asm_ninputs (stmt); i++)
+		{
+		  t = &TREE_VALUE (gimple_asm_input_op (stmt, i));
+		  modified |= sra_ipa_modify_expr (t, true, adjustments);
+		}
+	      for (i = 0; i < gimple_asm_noutputs (stmt); i++)
+		{
+		  t = &TREE_VALUE (gimple_asm_output_op (stmt, i));
+		  modified |= sra_ipa_modify_expr (t, false, adjustments);
+		}
+	      break;
+
+	    default:
+	      for (i = 0; i < gimple_num_ops (stmt); ++i)
+		{
+		  t = gimple_op_ptr (stmt, i);
+		  if (*t)
+		    modified |= sra_ipa_modify_expr (t, true, adjustments);
+		}
+	      break;
+	    }
+
+	  if (modified)
+	    {
+	      gimple_regimplify_operands (stmt, &gsi);
+	      update_stmt (stmt);
+	      if (maybe_clean_eh_stmt (stmt))
+		gimple_purge_dead_eh_edges (gimple_bb (stmt));
+	    }
+	  gsi_next (&gsi);
+	}
+    }
+}
+
+/* Adjust the argument types in NODE to their appropriate vector
+   counterparts.  */
+
+static void
+simd_clone_adjust (struct cgraph_node *node)
+{
+  // FIXME: -------ABI STUFF--------
+  //   0. Create clones for externs.
+  //   1. Arguments split across multiple args.
+  //   2. Which registers to pass in.
+  //   3. Get mangling correct for x86*
+  //   4. Agree on what default clones to generate when simdlen() missing.
+
+  // FIXME: ------- VECTORIZER CHANGES -------
+  //   1. At least the easy, notinbranch cases.
+  //   2. Handle linear/uniform arguments in get_simd_clone/etc.
+  //   3. Bail on non-SLP vectorizer mode.
+
+  // FIXME:  __attribute__((target (something))) if needed
+
+  // FIXME: get_simd_clone() needs optimization.
+
+  push_cfun (DECL_STRUCT_FUNCTION (node->symbol.decl));
+
+  tree retval = simd_clone_adjust_return_type (node);
+  ipa_parm_adjustment_vec adjustments = simd_clone_adjust_argument_types (node);
+
+  struct gimplify_ctx gctx;
+  push_gimplify_context (&gctx);
+
+  gimple_seq seq = simd_clone_init_simd_arrays (node, adjustments);
+
+  /* Adjust all uses of vector arguments accordingly.  Adjust all
+     return values accordingly.  */
+  tree iter = create_tmp_var (unsigned_type_node, "iter");
+  ipa_simd_modify_function_body (node, adjustments, retval, iter);
+
+  /* Initialize the iteration variable.  */
+  gimple g
+    = gimple_build_assign_with_ops (INTEGER_CST,
+				    iter,
+				    build_int_cst (unsigned_type_node, 0),
+				    NULL_TREE);
+  gimple_seq_add_stmt (&seq, g);
+
+  basic_block entry_bb = single_succ (ENTRY_BLOCK_PTR);
+  basic_block body_bb = split_block_after_labels (entry_bb)->dest;
+  gimple_stmt_iterator gsi = gsi_after_labels (entry_bb);
+  /* Insert the SIMD array and iv initialization at function
+     entry.  */
+  gsi_insert_seq_before (&gsi, seq, GSI_NEW_STMT);
+
+  pop_gimplify_context (NULL);
+
+  /* Create a new BB right before the original exit BB, to hold the
+     iteration increment and the condition/branch.  */
+  basic_block orig_exit = EDGE_PRED (EXIT_BLOCK_PTR, 0)->src;
+  basic_block incr_bb = create_empty_bb (orig_exit);
+  /* The succ of orig_exit was EXIT_BLOCK_PTR, with an empty flag.
+     Set it now to be a FALLTHRU_EDGE.  */
+  gcc_assert (EDGE_COUNT (orig_exit->succs) == 1);
+  EDGE_SUCC (orig_exit, 0)->flags |= EDGE_FALLTHRU;
+  for (unsigned i = 0; i < EDGE_COUNT (EXIT_BLOCK_PTR->preds); ++i)
+    {
+      edge e = EDGE_PRED (EXIT_BLOCK_PTR, i);
+      redirect_edge_succ (e, incr_bb);
+    }
+  edge e = make_edge (incr_bb, EXIT_BLOCK_PTR, 0);
+  e->probability = REG_BR_PROB_BASE;
+  gsi = gsi_last_bb (incr_bb);
+  g = gimple_build_assign_with_ops (PLUS_EXPR, iter, iter,
+				    build_int_cst (unsigned_type_node, 1));
+  gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+
+  /* Mostly annotate the loop for the vectorizer (the rest is done below).  */
+  struct loop *loop = alloc_loop ();
+  cfun->has_force_vect_loops = true;
+  loop->safelen = node->simdclone->simdlen;
+  loop->force_vect = true;
+  loop->header = body_bb;
+  add_bb_to_loop (incr_bb, loop);
+
+  /* Branch around the body if the mask applies.  */
+  if (node->simdclone->inbranch)
+    {
+      gimple_stmt_iterator gsi = gsi_last_bb (loop->header);
+      tree mask_array
+	= node->simdclone->args[node->simdclone->nargs - 1].simd_array;
+      tree mask = create_tmp_var (integer_type_node, NULL);
+      tree aref = build4 (ARRAY_REF,
+			  integer_type_node,
+			  mask_array, iter,
+			  NULL, NULL);
+      g = gimple_build_assign (mask, aref);
+      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+
+      g = gimple_build_cond (EQ_EXPR, mask, integer_zero_node,
+			     NULL, NULL);
+      gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+      make_edge (loop->header, incr_bb, EDGE_TRUE_VALUE);
+      FALLTHRU_EDGE (loop->header)->flags = EDGE_FALSE_VALUE;
+    }
+
+  /* Generate the condition.  */
+  g = gimple_build_cond (LT_EXPR,
+			 iter,
+			 build_int_cst (unsigned_type_node,
+					node->simdclone->simdlen),
+			 NULL, NULL);
+  gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+  e = split_block (incr_bb, gsi_stmt (gsi));
+  basic_block latch_bb = e->dest;
+  basic_block new_exit_bb = e->dest;
+  new_exit_bb = split_block (latch_bb, NULL)->dest;
+  loop->latch = latch_bb;
+
+  redirect_edge_succ (FALLTHRU_EDGE (latch_bb), body_bb);
+
+  make_edge (incr_bb, new_exit_bb, EDGE_FALSE_VALUE);
+  /* The successor of incr_bb is already pointing to latch_bb; just
+     change the flags.
+     make_edge (incr_bb, latch_bb, EDGE_TRUE_VALUE);  */
+  FALLTHRU_EDGE (incr_bb)->flags = EDGE_TRUE_VALUE;
+
+  /* Generate the new return.  */
+  gsi = gsi_last_bb (new_exit_bb);
+  if (retval)
+    {
+      retval = build1 (VIEW_CONVERT_EXPR,
+		       TREE_TYPE (TREE_TYPE (node->symbol.decl)),
+		       retval);
+      retval = force_gimple_operand_gsi (&gsi, retval, true, NULL,
+					 false, GSI_CONTINUE_LINKING);
+    }
+  g = gimple_build_return (retval);
+  gsi_insert_after (&gsi, g, GSI_CONTINUE_LINKING);
+
+  calculate_dominance_info (CDI_DOMINATORS);
+  add_loop (loop, loop->header->loop_father);
+
+  pop_cfun ();
+}
+
+/* If the function in NODE is tagged as an elemental SIMD function,
+   create the appropriate SIMD clones.  */
+
+static void
+expand_simd_clones (struct cgraph_node *node)
+{
+  if (cgraph_function_body_availability (node) < AVAIL_OVERWRITABLE)
+    return;
+
+  tree attr = lookup_attribute ("omp declare simd",
+				DECL_ATTRIBUTES (node->symbol.decl));
+  if (!attr)
+    return;
+  do
+    {
+      struct cgraph_node *new_node = simd_clone_create (node);
+
+      bool inbranch_clause_specified;
+      simd_clone_clauses_extract (new_node, TREE_VALUE (attr),
+				  &inbranch_clause_specified);
+      simd_clone_compute_vecsize_and_simdlen (new_node);
+      simd_clone_mangle (node, new_node);
+      simd_clone_adjust (new_node);
+
+      /* If no inbranch clause was specified, we need both variants.
+	 We have already created the not-in-branch version above, by
+	 virtue of .inbranch being clear.  Create the masked in-branch
+	 version.  */
+      if (!inbranch_clause_specified)
+	{
+	  struct cgraph_node *n = simd_clone_create (node);
+	  struct simd_clone *clone
+	    = simd_clone_struct_alloc (new_node->simdclone->nargs);
+	  simd_clone_struct_copy (clone, new_node->simdclone);
+	  clone->inbranch = 1;
+	  n->simdclone = clone;
+	  simd_clone_mangle (node, n);
+	  simd_clone_adjust (n);
+	}
+    }
+  while ((attr = lookup_attribute ("omp declare simd", TREE_CHAIN (attr))));
+}
+
+/* Entry point for IPA simd clone creation pass.  */
+
+static unsigned int
+ipa_omp_simd_clone (void)
+{
+  struct cgraph_node *node;
+  FOR_EACH_DEFINED_FUNCTION (node)
+    expand_simd_clones (node);
+  return 0;
+}
+
+namespace {
+
+const pass_data pass_data_omp_simd_clone =
+{
+  SIMPLE_IPA_PASS,		/* type */
+  "simdclone",			/* name */
+  OPTGROUP_NONE,		/* optinfo_flags */
+  true,				/* has_gate */
+  true,				/* has_execute */
+  TV_NONE,			/* tv_id */
+  ( PROP_ssa | PROP_cfg ),	/* properties_required */
+  0,				/* properties_provided */
+  0,				/* properties_destroyed */
+  0,				/* todo_flags_start */
+  (TODO_update_ssa | TODO_verify_all | TODO_cleanup_cfg), /* todo_flags_finish */
+};
+
+class pass_omp_simd_clone : public simple_ipa_opt_pass
+{
+public:
+  pass_omp_simd_clone(gcc::context *ctxt)
+    : simple_ipa_opt_pass(pass_data_omp_simd_clone, ctxt)
+  {}
+
+  /* opt_pass methods: */
+  bool gate () { return flag_openmp || flag_enable_cilkplus; }
+  unsigned int execute () { return ipa_omp_simd_clone (); }
+};
+
+} // anon namespace
+
+simple_ipa_opt_pass *
+make_pass_omp_simd_clone (gcc::context *ctxt)
+{
+  return new pass_omp_simd_clone (ctxt);
+}
 
 #include "gt-omp-low.h"
diff --git a/gcc/passes.def b/gcc/passes.def
index 84eb3f3..6803399 100644
--- a/gcc/passes.def
+++ b/gcc/passes.def
@@ -97,6 +97,7 @@  along with GCC; see the file COPYING3.  If not see
       NEXT_PASS (pass_feedback_split_functions);
   POP_INSERT_PASSES ()
   NEXT_PASS (pass_ipa_increase_alignment);
+  NEXT_PASS (pass_omp_simd_clone);
   NEXT_PASS (pass_ipa_tm);
   NEXT_PASS (pass_ipa_lower_emutls);
   TERMINATE_PASS_LIST ()
diff --git a/gcc/target.def b/gcc/target.def
index 6de513f..92cbd73 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -1508,6 +1508,35 @@  hook_int_uint_mode_1)
 
 HOOK_VECTOR_END (sched)
 
+/* Functions relating to Cilk Plus.  */
+#undef HOOK_PREFIX
+#define HOOK_PREFIX "TARGET_CILKPLUS_"
+HOOK_VECTOR (TARGET_CILKPLUS, cilkplus)
+
+DEFHOOK
+(default_vecsize_mangle,
+"This hook should return the default mangling character when no vector\n\
+size can be determined by examining the  Cilk Plus @code{processor} clause.\n\
+This is as specified in the Intel Vector ABI document.\n\
+\n\
+This hook, as well as @code{max_vector_size_for_isa} below must be set\n\
+to support the Cilk Plus @code{processor} clause.\n\
+\n\
+The only argument is a @var{cgraph_node} containing the clone.",
+char, (struct cgraph_node *), NULL)
+
+DEFHOOK
+(vecsize_for_mangle,
+"This hook returns the maximum hardware vector size in bits for a given\n\
+mangling character.  The character is as described in Intel's\n\
+Vector ABI (see @var{ISA} character in the section on mangling).\n\
+\n\
+This hook must be defined in order to support the Cilk Plus @code{processor}\n\
+clause.",
+unsigned int, (char), NULL)
+
+HOOK_VECTOR_END (cilkplus)
+
 /* Functions relating to vectorization.  */
 #undef HOOK_PREFIX
 #define HOOK_PREFIX "TARGET_VECTORIZE_"
diff --git a/gcc/testsuite/gcc.dg/gomp/simd-clones-1.c b/gcc/testsuite/gcc.dg/gomp/simd-clones-1.c
new file mode 100644
index 0000000..486b67a
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/simd-clones-1.c
@@ -0,0 +1,33 @@ 
+/* { dg-do compile } */
+/* { dg-options "-fopenmp -fdump-tree-optimized -O3" } */
+
+/* Test that functions that have SIMD clone counterparts are not
+   cloned by IPA-cp.  For example, special_add() below has SIMD clones
+   created for it.  However, if IPA-cp later decides to clone a
+   specialization of special_add(x, 666) when analyzing fillit(), we
+   will forever keep the vectorizer from using the SIMD versions of
+   special_add in a loop.
+
+   If IPA-CP gets taught how to adjust the SIMD clones as well, this
+   test could be removed.  */
+
+#pragma omp declare simd simdlen(4)
+static int  __attribute__ ((noinline))
+special_add (int x, int y)
+{
+  if (y == 666)
+    return x + y + 123;
+  else
+    return x + y;
+}
+
+void fillit(int *tot)
+{
+  int i;
+
+  for (i=0; i < 10000; ++i)
+    tot[i] = special_add (i, 666);
+}
+
+/* { dg-final { scan-tree-dump-not "special_add.constprop" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c b/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c
new file mode 100644
index 0000000..8ab3131
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/simd-clones-2.c
@@ -0,0 +1,21 @@ 
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-fopenmp -fdump-tree-optimized -O -msse2" } */
+
+#pragma omp declare simd inbranch uniform(c) linear(b:66)   // addit.simdclone.2
+#pragma omp declare simd notinbranch aligned(c:32) // addit.simdclone.1
+int addit(int a, int b, int c)
+{
+  return a + b;
+}
+
+#pragma omp declare simd uniform(a) aligned(a:32) linear(k:1) notinbranch
+float setArray(float *a, float x, int k)
+{
+  a[k] = a[k] + x;
+  return a[k];
+}
+
+/* { dg-final { scan-tree-dump "clone.0 \\(_ZGVxN4ua32vl_setArray" "optimized" } } */
+/* { dg-final { scan-tree-dump "clone.1 \\(_ZGVxN4vvva32_addit" "optimized" } } */
+/* { dg-final { scan-tree-dump "clone.2 \\(_ZGVxM4vl66u_addit" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/simd-clones-3.c b/gcc/testsuite/gcc.dg/gomp/simd-clones-3.c
new file mode 100644
index 0000000..a7fc2a5
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/simd-clones-3.c
@@ -0,0 +1,15 @@ 
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-fopenmp -fdump-tree-optimized -O2 -msse2" } */
+
+/* Test that if there is no *inbranch clauses, that both the masked and
+   the unmasked version are created.  */
+
+#pragma omp declare simd
+int addit(int a, int b, int c)
+{
+  return a + b;
+}
+
+/* { dg-final { scan-tree-dump "clone.* \\(_ZGVxN4vvv_addit" "optimized" } } */
+/* { dg-final { scan-tree-dump "clone.* \\(_ZGVxM4vvv_addit" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
diff --git a/gcc/testsuite/gcc.dg/gomp/simd-clones-4.c b/gcc/testsuite/gcc.dg/gomp/simd-clones-4.c
new file mode 100644
index 0000000..893f44e
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/gomp/simd-clones-4.c
@@ -0,0 +1,11 @@ 
+/* { dg-do compile { target i?86-*-* x86_64-*-* } } */
+/* { dg-options "-fopenmp" } */
+
+#pragma omp declare simd simdlen(4) notinbranch
+int f2 (int a, int b)
+{
+  if (a > 5)
+    return a + b;
+  else
+    return a - b;
+}
diff --git a/gcc/tree-core.h b/gcc/tree-core.h
index a14c7e0..c6b0c72 100644
--- a/gcc/tree-core.h
+++ b/gcc/tree-core.h
@@ -886,6 +886,9 @@  struct GTY(()) tree_base {
        CALL_ALLOCA_FOR_VAR_P in
            CALL_EXPR
 
+       OMP_CLAUSE_LINEAR_VARIABLE_STRIDE in
+	   OMP_CLAUSE_LINEAR
+
    side_effects_flag:
 
        TREE_SIDE_EFFECTS in
diff --git a/gcc/tree-pass.h b/gcc/tree-pass.h
index e72fe9a..41e8794 100644
--- a/gcc/tree-pass.h
+++ b/gcc/tree-pass.h
@@ -473,6 +473,7 @@  extern ipa_opt_pass_d *make_pass_ipa_pure_const (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_pta (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_lto_finish_out (gcc::context *ctxt);
 extern simple_ipa_opt_pass *make_pass_ipa_tm (gcc::context *ctxt);
+extern simple_ipa_opt_pass *make_pass_omp_simd_clone (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_profile (gcc::context *ctxt);
 extern ipa_opt_pass_d *make_pass_ipa_cdtor_merge (gcc::context *ctxt);
 
diff --git a/gcc/tree-sra.c b/gcc/tree-sra.c
index 82520ba..8d61c35 100644
--- a/gcc/tree-sra.c
+++ b/gcc/tree-sra.c
@@ -4486,7 +4486,7 @@  replace_removed_params_ssa_names (gimple stmt,
    incompatibility issues to the caller.  Return true iff the expression
    was modified. */
 
-static bool
+bool
 sra_ipa_modify_expr (tree *expr, bool convert,
 		     ipa_parm_adjustment_vec adjustments)
 {
@@ -4624,7 +4624,7 @@  sra_ipa_modify_assign (gimple *stmt_ptr, gimple_stmt_iterator *gsi,
 /* Traverse the function body and all modifications as described in
    ADJUSTMENTS.  Return true iff the CFG has been changed.  */
 
-static bool
+bool
 ipa_sra_modify_function_body (ipa_parm_adjustment_vec adjustments)
 {
   bool cfg_changed = false;
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index 7d9c9ed..f50a5b1 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -1688,6 +1688,16 @@  tree
 vectorizable_function (gimple call, tree vectype_out, tree vectype_in)
 {
   tree fndecl = gimple_call_fndecl (call);
+  struct cgraph_node *node = cgraph_get_node (fndecl);
+
+  if (node->has_simd_clones)
+    {
+      struct cgraph_node *clone = get_simd_clone (node, vectype_out);
+      if (clone)
+	return clone->symbol.decl;
+      /* Fall through in case we ever add support for
+	 non-built-ins.  */
+    }
 
   /* We only handle functions that do not read or clobber memory -- i.e.
      const or novops ones.  */
@@ -1758,10 +1768,12 @@  vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
   vectype_in = NULL_TREE;
   nargs = gimple_call_num_args (stmt);
 
-  /* Bail out if the function has more than three arguments, we do not have
-     interesting builtin functions to vectorize with more than two arguments
-     except for fma.  No arguments is also not good.  */
-  if (nargs == 0 || nargs > 3)
+  /* Bail out if the function has more than three arguments.  We do
+     not have interesting builtin functions to vectorize with more
+     than two arguments except for fma (unless we have SIMD clones).
+     No arguments is also not good.  */
+  struct cgraph_node *node = cgraph_get_node (gimple_call_fndecl (stmt));
+  if (nargs == 0 || (!node->has_simd_clones && nargs > 3))
     return false;
 
   /* Ignore the argument of IFN_GOMP_SIMD_LANE, it is magic.  */
diff --git a/gcc/tree.h b/gcc/tree.h
index 8200c2e..aacb22b 100644
--- a/gcc/tree.h
+++ b/gcc/tree.h
@@ -1318,6 +1318,10 @@  extern void protected_set_expr_location (tree, location_t);
 #define OMP_CLAUSE_LINEAR_NO_COPYOUT(NODE) \
   TREE_PRIVATE (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_LINEAR))
 
+/* True if a LINEAR clause has a stride that is variable.  */
+#define OMP_CLAUSE_LINEAR_VARIABLE_STRIDE(NODE) \
+  TREE_PROTECTED (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_LINEAR))
+
 #define OMP_CLAUSE_LINEAR_STEP(NODE) \
   OMP_CLAUSE_OPERAND (OMP_CLAUSE_SUBCODE_CHECK (NODE, OMP_CLAUSE_LINEAR), 1)