Patchwork Merging Cilk Plus into Trunk (Patch 1 of approximately 22)

login
register
mail settings
Submitter Iyer, Balaji V
Date Sept. 5, 2012, 10:09 p.m.
Message ID <BF230D13CA30DD48930C31D40993300016C62236@FMSMSX102.amr.corp.intel.com>
Download mbox | patch
Permalink /patch/181974/
State New
Headers show

Comments

Iyer, Balaji V - Sept. 5, 2012, 10:09 p.m.
Hello Everyone,
	Attached, please find the 1st of ~22 patches that implements Cilk Plus. This patch will implement Elemental Functions into the C compiler.  Please check it in to the trunk if it looks OK.

	Below, I will give you a small example about what elemental function is and how it can be useful. Details about elemental function can be found in the following link (http://software.intel.com/en-us/articles/elemental-functions-writing-data-parallel-code-in-cc-using-intel-cilk-plus)

Let's say we have two for loops like this:

int my_func (int x, int y);

For (ii = 0; ii < 10000; ii++)
	X[ii] = my_func (Y[ii], Z[ii]);

For (jj = 1; jj < 10000; jj++) {
 	A[jj] =  my_func (B[ii], A[jj-1]) + A[jj-1];


Assume that my_func's body is not visible during this compilation (e.g it is in some library). 

If a vectorized version for my_func is available, then the first for loop can be vectorized. However, even if such a version of my_func is available, the 2nd for loop cannot be vectorized. It would be beneficial if there is a vectorized version and a scalar version of my_func. This is where an elemental function comes to play. If we annotate *both* the function declaration and the function with the following attribute, the compiler will create a vector and scalar version of the function. 

__attribute__((vector)) my_func (int x, int y);

__attribute__((vector)) my_func (int x, int y) 
    {
      ... /* Body of the function.  */
    }

If the architecture allows, the vector version will accept the parameters through vector registers and return the result through a vector register. There are several clauses that can be used to convey information about function parameters. The vector version of the function follows a mangling format that is based on the clauses provided and the vectorlength. All these changes are transparent to the user. A great application for elemental functions is writing libraries and reusable functions. Detailed explanations and syntax about these can be found in the Cilk Plus language specification document: http://software.intel.com/sites/default/files/m/6/3/1/cilk_plus_language_specification.pdf

Here are the Changelog Entries:

============================================================================================
gcc/Changelog
2012-09-05  Balaji V. Iyer  <balaji.v.iyer@intel.com>

        * attribs.c (is_elem_fn_attribute_p): New function.
        (decl_attributes): Added a check for Elemental function attribute when
        Cilk Plus is enabled.
        * cgraphunit.c (cgraph_decide_is_function_needed): Added a check for
        cloned elemental function when Cilk Plus is enabled.
        (cgraph_add_new_function): When Cilk Plus is enabled we call
        cgraph_get_create_node.
        (cgraph_analyze_functions): Added a check if the function call is a
        cloned elemental function when Cilk Plus is enabled.
        * elem-function-common.c (find_processor_code): New function.
        (find_vlength_code): Likewise.
        (rename_elem_fn): Likewise.
        (find_suffix): Likewise.
        (find_elem_fn_parm_type_1): Likewise.
        (find_elem_fn_parm_type): Likewise.
        (find_elem_fn_name): Likewise.
        (extract_elem_fn_values): Likewise.
        (is_elem_fn): Likewise.
        * expr.c (expand_expr_real_1): Added a check if Cilk Plus is enabled.
        * function.h (struct function): Added elem_fn_already_cloned field.
        * gimplify.c (gimplify_function_tree): Added a check if Cilk Plus is
        enabled and if the function is an elemental function.  If so, then call
        the function to clone elemental function.
        * langhooks.c (lhd_elem_fn_create_fn): New function.
        * langhooks-def.h (LANG_HOOKS_CILKPLUS): New define.
        (LANG_HOOK_DECLS): Added LANG_HOOKS_CILKPLUS field.
        * langhooks.h (struct lang_hooks_for_cilkplus): New struct.
        (struct lang_hooks): Added a field called cilkplus.
        * tree.h (tree_function_decl): Added a new field called
        elem_fn_already_cloned.
        (DECL_ELEM_FN_ALREADY_CLONED): New define.
        * tree-data-ref.c (find_data_references_in_stmt): Added a check for
        an elemental function call when Cilk Plus is enabled.
        * tree-inline.c (elem_fn_copy_arguments_for_versioning): New function.
        (initialize_elem_fn_cfun): Likewise.
        (tree_elem_fn_versioning): Likewise.
        * tree-vect-stmts.c (vect_get_vec_def_for_operand): Check parm type for
        an elemental function when Cilk Plus is enabled and set data definition
        accordingly.
        (elem_fn_vect_get_vec_def_for_operand): New function.
        (vect_finish_stmt_generation): Added a check for elemental function.
        (vectorizable_function): Check if the function call is a Cilk Plus
        elemental function.  If so, then insert the appopriate mangled name.
        (vectorizable_call): Eliminate the argument requirement when Cilk Plus
        is enabled for vectorization.  Also, set thee appropriate data def. for
        an elemental function call.
        (elem_fn_linear_init_vector): New function.
        * tree.c (build_elem_fn_linear_vector): Likewise.

gcc/c-family/ChangeLog
2012-09-05  Balaji V. Iyer  <balaji.v.iyer@intel.com>

        * c-common.c (struct c_common_attribute_table): Added vector
        attribute for Cilk Plus elemental function.
        (handle_vector_atribute): New function.
        * c-cpp-elem-function.c (create_processor_attribute): Likewise.
        (create_optimize_attribute): Likewise.
        (replace_return_with_new_var): Likewise.
        (elem_fn_build_array): Likewise.
        (replace_array_ref_for_vec): Likewise.
        (fix_elem_fn_return_value): Likewise.
        (add_elem_fn_loop): Likewise.
        (add_elem_fn_mask): Likewise.
        (call_graph_add_fn): Likewise.
        (elem_fn_create_fn): Likewise.
        * c.opt (-fcilkplus): Added new flag.

gcc/c/ChangeLog
2012-09-05  Balaji V. Iyer  <balaji.v.iyer@intel.com>

        * c-decl.c (bind): Added a check for non NULL scope.
        * c-parser.c (c_parser_declaration_or_fndef): Added a check if Cilk
        Plus defined.  If so, then we save the arguments for a function
        declaration.
        (c_parser_attributes): Added a check if Cilk Plus is enabled and if
        elemental function vector attribute is given.  If so, then call the
        function c_parser_elem_fn_expr_list ().
        (c_parser_elem_fn_processor_clause): New function.
        (c_parser_elem_fn_uniform_clause): Likewise.
        (c_parser_elem_fn_linear_clause): Likewise.
        (c_parser_elem_fn_vlength_clause): Likewise.
        (c_parser_elem_fn_expr_list): Likewise.

===========================================================================================

Thanking You,

Yours Sincerely,

Balaji V. Iyer.
Joseph S. Myers - Sept. 6, 2012, 12:07 a.m.
On Wed, 5 Sep 2012, Iyer, Balaji V wrote:

> 	Attached, please find the 1st of ~22 patches that implements Cilk 
> Plus. This patch will implement Elemental Functions into the C compiler.  
> Please check it in to the trunk if it looks OK.
> 
> 	Below, I will give you a small example about what elemental 
> function is and how it can be useful. Details about elemental function 
> can be found in the following link 
> (http://software.intel.com/en-us/articles/elemental-functions-writing-data-parallel-code-in-cc-using-intel-cilk-plus)

That page says "To continue reading the article, click on the link below." 
but I don't see such a link below.

>         * c-cpp-elem-function.c (create_processor_attribute): Likewise.

I don't see a ChangeLog entry for the addition of this file at all.  When 
a new file is added, "New file." is enough entry; you don't describe 
particular things within the file.

This file includes tm.h and tm_p.h.  Inclusion of these headers from 
front-end code is deprecated.  If they are really needed, please put 
comments on the includes about exactly what target macros are being used 
in this front-end code.  Similarly, use of hard-reg-set.h in front-end 
code is doubtful.  Generally, please check all #includes in all new source 
files and make sure that each include is actually needed because some 
functionality from the relevant header is used in the source file; do not 
just copy the headers included by some existing source file.

create_processor_attribute contains hardcoded references to x86-specific 
functionality.  This is not OK; all such target dependencies need to be 
kept within the back ends, and handled from the rest of the compiler via 
target hooks (in most cases, new target dependencies must use target hooks 
not target macros).

Please make sure every new function has a comment explicitly describing 
the semantics of every parameter and the return value as well as anything 
else the function does.

Where there are alternative versions of functions/macros with/without 
explicit locations, please use the forms with explicit locations (e.g. 
build2_loc instead of build2), and try to link the locations to particular 
source code tokens and pass those locations down explicitly to each 
function as needed.

There may be more issues; I'll await a revised patch before doing further 
review.
Gabriel Dos Reis - Sept. 6, 2012, 12:12 a.m.
On Wed, Sep 5, 2012 at 5:09 PM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:
> Hello Everyone,
>         Attached, please find the 1st of ~22 patches that implements Cilk Plus. This patch will implement Elemental Functions into the C compiler.  Please check it in to the trunk if it looks OK.
>
>         Below, I will give you a small example about what elemental function is and how it can be useful. Details about elemental function can be found in the following link (http://software.intel.com/en-us/articles/elemental-functions-writing-data-parallel-code-in-cc-using-intel-cilk-plus)
>
> Let's say we have two for loops like this:
>
> int my_func (int x, int y);
>
> For (ii = 0; ii < 10000; ii++)
>         X[ii] = my_func (Y[ii], Z[ii]);
>
> For (jj = 1; jj < 10000; jj++) {
>         A[jj] =  my_func (B[ii], A[jj-1]) + A[jj-1];
>
>
> Assume that my_func's body is not visible during this compilation (e.g it is in some library).
>
> If a vectorized version for my_func is available, then the first for loop can be vectorized. However, even if such a version of my_func is available, the 2nd for loop cannot be vectorized. It would be beneficial if there is a vectorized version and a scalar version of my_func. This is where an elemental function comes to play. If we annotate *both* the function declaration and the function with the following attribute, the compiler will create a vector and scalar version of the function.
>
> __attribute__((vector)) my_func (int x, int y);
>
> __attribute__((vector)) my_func (int x, int y)
>     {
>       ... /* Body of the function.  */
>     }

1.  You should consider a different name for the attribute.

2. Considering this example, won't you get the same behaviour
     if my_func was declared with "pure" attribute?  If not, why?

-- Gaby
Marc Glisse - Sept. 6, 2012, 6:06 a.m.
On Wed, 5 Sep 2012, Gabriel Dos Reis wrote:

> On Wed, Sep 5, 2012 at 5:09 PM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:
>> Let's say we have two for loops like this:
>>
>> int my_func (int x, int y);
>>
>> For (ii = 0; ii < 10000; ii++)
>>         X[ii] = my_func (Y[ii], Z[ii]);

I assume X, Y and Z are __restrict pointers (or something the compiler can 
detect doesn't alias).

> 2. Considering this example, won't you get the same behaviour
>     if my_func was declared with "pure" attribute?  If not, why?

AFAIU, my_func is defined in a separate library and because of the 
attribute on the definition, it will actually export overloads:
int myfunc(int,int);
v2si myfunc(v2si,v2si);
v4si myfunc(v4si,v4si);
etc (where does it stop? seems problematic if the library is compiled for 
sse4 and I then compile and link an avx program)

(hopefully with implementations more clever than breaking the vectors into 
pieces and calling the basic myfunc on each)

The attribute on the declaration then lets gcc's vectorizer know it can 
call those overloads.

With suitable pure/const attribute you could unroll the loop a bit and 
reorder the calls to myfunc, but without myfunc's body, you couldn't do as 
much.

Note that this is my guess from reading the example and completely 
ignoring the patch, it could be miles from the truth, and it needs better 
explanation (the doc patch is coming later in the series IIRC).
Richard Guenther - Sept. 6, 2012, 9:37 a.m.
On Thu, Sep 6, 2012 at 8:06 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Wed, 5 Sep 2012, Gabriel Dos Reis wrote:
>
>> On Wed, Sep 5, 2012 at 5:09 PM, Iyer, Balaji V <balaji.v.iyer@intel.com>
>> wrote:
>>>
>>> Let's say we have two for loops like this:
>>>
>>> int my_func (int x, int y);
>>>
>>> For (ii = 0; ii < 10000; ii++)
>>>         X[ii] = my_func (Y[ii], Z[ii]);
>
>
> I assume X, Y and Z are __restrict pointers (or something the compiler can
> detect doesn't alias).
>
>
>> 2. Considering this example, won't you get the same behaviour
>>     if my_func was declared with "pure" attribute?  If not, why?
>
>
> AFAIU, my_func is defined in a separate library and because of the attribute
> on the definition, it will actually export overloads:
> int myfunc(int,int);
> v2si myfunc(v2si,v2si);
> v4si myfunc(v4si,v4si);
> etc (where does it stop? seems problematic if the library is compiled for
> sse4 and I then compile and link an avx program)
>
> (hopefully with implementations more clever than breaking the vectors into
> pieces and calling the basic myfunc on each)
>
> The attribute on the declaration then lets gcc's vectorizer know it can call
> those overloads.

And as the overloads definitions are not guaranteed to be generated by GCC
you need to specify the ABI and mangling of those overloads.

+static tree
+handle_vector_attribute (tree *node, tree name ATTRIBUTE_UNUSED,
+			 tree args ATTRIBUTE_UNUSED,
+			 int ARG_UNUSED (flags), bool *no_add_attrs)
+{
+  tree opt_list;
+  VEC(tree,gc) *opt_vec = NULL;
+  opt_vec = make_tree_vector ();
+  VEC_safe_push (tree, gc, opt_vec, build_string (2, "O3"));
+  opt_list = build_tree_list_vec (opt_vec);
+  release_tree_vector (opt_vec);
+  handle_optimize_attribute (node, get_identifier ("optimize"), opt_list,
+			     flags, no_add_attrs);

Please no - do not use "optimize" attributes from inside the implementation.
What happens if the user also specifies an optimize attribute?
The above also doesnt' make sense to me, so please elaborate on why you
want to enable -O3 for a function marked with the vector attribute.

This all awfully sounds like a worse way to do the multi-versioning stuff
that is still pending review.

+  if (flag_enable_cilkplus
+      && gimple_code (stmt) == GIMPLE_CALL
+      && is_elem_fn (gimple_call_fndecl (stmt)))
+    {
+      parm_type = find_elem_fn_parm_type (stmt, op, &step_size);
+      if (parm_type == TYPE_UNIFORM || parm_type == TYPE_LINEAR)
+	dt = vect_external_def;

the middle-end should not care if CILK+ is enabled or not.  Otherwise
this will not work with LTO.  Please use generic infrastructure for the
implementation or enhance generic infrastructure.

If the vectorizer should be able to vectorize non-inlined functions then
there should be an IPA pass analyzing functions for whether they can
be "elemental" (propagating this alongside the callgraph).  Then
you either decide up-front whether to "clone" those functions for
various vector sizes, or, IMHO better, make sure to ship the function
bodies to all LTRANS units that make use of them (much similar to
how we handle inlines) and make the vectorizer emit the clones.

In all this seems unrelated to CILK+ work (even if you make use of this
from within CILK+).

Richard.

> With suitable pure/const attribute you could unroll the loop a bit and
> reorder the calls to myfunc, but without myfunc's body, you couldn't do as
> much.
>
> Note that this is my guess from reading the example and completely ignoring
> the patch, it could be miles from the truth, and it needs better explanation
> (the doc patch is coming later in the series IIRC).
>
> --
> Marc Glisse
Gabriel Dos Reis - Sept. 6, 2012, 9:54 a.m.
On Thu, Sep 6, 2012 at 1:06 AM, Marc Glisse <marc.glisse@inria.fr> wrote:
> On Wed, 5 Sep 2012, Gabriel Dos Reis wrote:
>
>> On Wed, Sep 5, 2012 at 5:09 PM, Iyer, Balaji V <balaji.v.iyer@intel.com>
>> wrote:
>>>
>>> Let's say we have two for loops like this:
>>>
>>> int my_func (int x, int y);
>>>
>>> For (ii = 0; ii < 10000; ii++)
>>>         X[ii] = my_func (Y[ii], Z[ii]);
>
>
> I assume X, Y and Z are __restrict pointers (or something the compiler can
> detect doesn't alias).

I assumed this much.


>
>
>> 2. Considering this example, won't you get the same behaviour
>>     if my_func was declared with "pure" attribute?  If not, why?
>
>
> AFAIU, my_func is defined in a separate library and because of the attribute
> on the definition, it will actually export overloads:
> int myfunc(int,int);
> v2si myfunc(v2si,v2si);
> v4si myfunc(v4si,v4si);
> etc (where does it stop? seems problematic if the library is compiled for
> sse4 and I then compile and link an avx program)

Thanks but I was not talking of anything this complicated.
The "pure" attribute has nothing to do with overloading?


>
> (hopefully with implementations more clever than breaking the vectors into
> pieces and calling the basic myfunc on each)
>
> The attribute on the declaration then lets gcc's vectorizer know it can call
> those overloads.

My question was why the same conclusion won't be reached on the
example given if the function was declared with the "pure attribute.

>
> With suitable pure/const attribute you could unroll the loop a bit and
> reorder the calls to myfunc, but without myfunc's body, you couldn't do as
> much.
>
> Note that this is my guess from reading the example and completely ignoring
> the patch, it could be miles from the truth, and it needs better explanation
> (the doc patch is coming later in the series IIRC).

Note that the example given, was a function taking two ints and returning
an int. How would a function with "pure" attribute fool the vectorizer?

-- Gaby
Marc Glisse - Sept. 6, 2012, 11:03 a.m.
On Thu, 6 Sep 2012, Marc Glisse wrote:

> AFAIU, my_func is defined in a separate library and because of the attribute 
> on the definition, it will actually export overloads:
> int myfunc(int,int);
> v2si myfunc(v2si,v2si);
> v4si myfunc(v4si,v4si);
> etc (where does it stop? seems problematic if the library is compiled for 
> sse4 and I then compile and link an avx program)

According to the doc, it only generates one of these vector versions (even 
more risk of mismatch).

Does it actually create the extra declaration in the front-end, i.e. can I 
explicitly call myfunc on a v4si that I created myself, or is the 
middle-end the only user?
Richard Henderson - Sept. 6, 2012, 3:51 p.m.
On 09/06/2012 02:37 AM, Richard Guenther wrote:
> In all this seems unrelated to CILK+ work (even if you make use of this
> from within CILK+).

While true, we also asked him to split up the work.  And this piece,
done correctly, seems useful even if the rest of cilk is ignored.


r~
Iyer, Balaji V - Sept. 6, 2012, 4:03 p.m.
Hello Joseph,
	Thanks for reviewing my patch. Please see my responses below:

>-----Original Message-----
>From: Joseph Myers [mailto:joseph@codesourcery.com]
>Sent: Wednesday, September 05, 2012 8:07 PM
>To: Iyer, Balaji V
>Cc: gcc-patches@gcc.gnu.org; Aldy Hernandez (aldyh@redhat.com); Jeff Law;
>rth@redhat.com
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Wed, 5 Sep 2012, Iyer, Balaji V wrote:
>
>> 	Attached, please find the 1st of ~22 patches that implements Cilk
>> Plus. This patch will implement Elemental Functions into the C compiler.
>> Please check it in to the trunk if it looks OK.
>>
>> 	Below, I will give you a small example about what elemental function
>> is and how it can be useful. Details about elemental function can be
>> found in the following link
>> (http://software.intel.com/en-us/articles/elemental-functions-writing-
>> data-parallel-code-in-cc-using-intel-cilk-plus)
>
>That page says "To continue reading the article, click on the link below."
>but I don't see such a link below.

Sorry about that. We were recently updating the website and I think it may have gotten messed up during then. I have contacted the appropriate people to fix it.  I will let you know as soon as the link is fixed.

>
>>         * c-cpp-elem-function.c (create_processor_attribute): Likewise.
>
>I don't see a ChangeLog entry for the addition of this file at all.  When a new file
>is added, "New file." is enough entry; you don't describe particular things within
>the file.

Ok, I was mistaken there. I thought we had to add a changelog entry for every function and not every file. I will fix it in the updated patch I send soon.

>
>This file includes tm.h and tm_p.h.  Inclusion of these headers from front-end
>code is deprecated.  If they are really needed, please put comments on the
>includes about exactly what target macros are being used in this front-end code.
>Similarly, use of hard-reg-set.h in front-end code is doubtful.  Generally, please
>check all #includes in all new source files and make sure that each include is
>actually needed because some functionality from the relevant header is used in
>the source file; do not just copy the headers included by some existing source file.

OK, I will look into this. I don't believe I am using tm_p.h or tm.h. I just put them in just in case.

>
>create_processor_attribute contains hardcoded references to x86-specific
>functionality.  This is not OK; all such target dependencies need to be kept within
>the back ends, and handled from the rest of the compiler via target hooks (in
>most cases, new target dependencies must use target hooks not target macros).

The only thing I am doing in that function is to add appropriate attribute. In elemental function, there is a processor clause that will allow users to set the type of processor they want the function compiled for. All I am doing is to map that information to the appropriate "arch" attribute. I didn't think it had any back end pecularity.

>
>Please make sure every new function has a comment explicitly describing the
>semantics of every parameter and the return value as well as anything else the
>function does.

I was trying to follow suit with other functions nearby. Most function had 1 or 2 line header comments that gives a quick description about the function. 

>
>Where there are alternative versions of functions/macros with/without explicit
>locations, please use the forms with explicit locations (e.g.
>build2_loc instead of build2), and try to link the locations to particular source
>code tokens and pass those locations down explicitly to each function as needed.

I have tried to preserve the location wherever appropriate. In many cases I used UNKNOWN_LOCATION or omitted the location information because the code is internally generated which does not have a line number.

>
>There may be more issues; I'll await a revised patch before doing further review.

I will send another one ASAP.

>
>--
>Joseph S. Myers
>joseph@codesourcery.com

Yours Sincerely,

Balaji V. Iyer.
Richard Henderson - Sept. 6, 2012, 4:08 p.m.
On 09/05/2012 03:09 PM, Iyer, Balaji V wrote:
> If we annotate *both* the function declaration and the function with the following attribute, the compiler will create a vector and scalar version of the function. 
> 
> __attribute__((vector)) my_func (int x, int y);
> 
> __attribute__((vector)) my_func (int x, int y) 
>     {
>       ... /* Body of the function.  */
>     }

I know Marc Glisse has already brought this up down-thread, but I'll
re-iterate for emphasis: You cannot possibly form a stable, exportable
ABI with this alone.

At minimum I would say that the vectorlength parameter would have to
be mandatory on declarations to be useful.  One could reasonably leave
them to default on definitions, or have explicit vectorlength(default)
for declarations internal to a project (i.e. assert that the file that
contains the elemental function is compiled with the same compile flags
and so will make the same choices for default).

I see that Intel does not even begin to address this within their own
documentation.  Is the problem of ABIs vs target cpus really being ignored?


r~
Iyer, Balaji V - Sept. 6, 2012, 4:11 p.m.
Hello Marc,
	Please see my response below. 

Thanks for looking at my patch!

Sincerely,

Balaji V. Iyer.

>-----Original Message-----
>From: Marc Glisse [mailto:marc.glisse@inria.fr]
>Sent: Thursday, September 06, 2012 2:06 AM
>To: Gabriel Dos Reis
>Cc: Iyer, Balaji V; gcc-patches@gcc.gnu.org; Aldy Hernandez
>(aldyh@redhat.com); Jeff Law; rth@redhat.com
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Wed, 5 Sep 2012, Gabriel Dos Reis wrote:
>
>> On Wed, Sep 5, 2012 at 5:09 PM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:
>>> Let's say we have two for loops like this:
>>>
>>> int my_func (int x, int y);
>>>
>>> For (ii = 0; ii < 10000; ii++)
>>>         X[ii] = my_func (Y[ii], Z[ii]);
>
>I assume X, Y and Z are __restrict pointers (or something the compiler can detect
>doesn't alias).

Yes, the compiler must detect that.

>
>> 2. Considering this example, won't you get the same behaviour
>>     if my_func was declared with "pure" attribute?  If not, why?
>
>AFAIU, my_func is defined in a separate library and because of the attribute on
>the definition, it will actually export overloads:
>int myfunc(int,int);
>v2si myfunc(v2si,v2si);
>v4si myfunc(v4si,v4si);
>etc (where does it stop? seems problematic if the library is compiled for
>sse4 and I then compile and link an avx program)

The user can provide at most 1 vector length and the compiler will map it to appropriate vector value. If the user omits the vectorlength clause then the compiler picks a vectorlength based on the architecture's vector units and the data width. So, it will stop at 2 (1 scalar and 1 vector) :-).

>
>(hopefully with implementations more clever than breaking the vectors into
>pieces and calling the basic myfunc on each)
>
>The attribute on the declaration then lets gcc's vectorizer know it can call those
>overloads.
>
>With suitable pure/const attribute you could unroll the loop a bit and reorder the
>calls to myfunc, but without myfunc's body, you couldn't do as much.
>
>Note that this is my guess from reading the example and completely ignoring the
>patch, it could be miles from the truth, and it needs better explanation (the doc
>patch is coming later in the series IIRC).
>
>--
>Marc Glisse
Gabriel Dos Reis - Sept. 6, 2012, 4:16 p.m.
On Thu, Sep 6, 2012 at 10:51 AM, Richard Henderson <rth@redhat.com> wrote:
> On 09/06/2012 02:37 AM, Richard Guenther wrote:
>> In all this seems unrelated to CILK+ work (even if you make use of this
>> from within CILK+).
>
> While true, we also asked him to split up the work.  And this piece,
> done correctly, seems useful even if the rest of cilk is ignored.

Fully agreed.

The language/front-end modifications need more discussion and
explanation though.

-- Gaby
Joseph S. Myers - Sept. 6, 2012, 4:17 p.m.
On Thu, 6 Sep 2012, Iyer, Balaji V wrote:

> Ok, I was mistaken there. I thought we had to add a changelog entry for 
> every function and not every file. I will fix it in the updated patch I 
> send soon.

For functions in existing files you do need to mention each function - but 
not for new files.

> >create_processor_attribute contains hardcoded references to x86-specific
> >functionality.  This is not OK; all such target dependencies need to be kept within
> >the back ends, and handled from the rest of the compiler via target hooks (in
> >most cases, new target dependencies must use target hooks not target macros).
> 
> The only thing I am doing in that function is to add appropriate 
> attribute. In elemental function, there is a processor clause that will 
> allow users to set the type of processor they want the function compiled 
> for. All I am doing is to map that information to the appropriate "arch" 
> attribute. I didn't think it had any back end pecularity.

Concepts such as "pentium_4" are architecture-specific and have no place 
in front-end files.  This whole mapping from one sort of string to another 
belongs within the back end.
Gabriel Dos Reis - Sept. 6, 2012, 4:25 p.m.
On Thu, Sep 6, 2012 at 11:11 AM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:

>>> On Wed, Sep 5, 2012 at 5:09 PM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:
>>>> Let's say we have two for loops like this:
>>>>
>>>> int my_func (int x, int y);
>>>>
>>>> For (ii = 0; ii < 10000; ii++)
>>>>         X[ii] = my_func (Y[ii], Z[ii]);
>>
>>I assume X, Y and Z are __restrict pointers (or something the compiler can detect
>>doesn't alias).
>
> Yes, the compiler must detect that.

Exactly what do you mean by detect?
That the user has supplied somewhere a declaration saying the pointers
are restricted? (Note that C++11 does not have restrict).  Or that the
compiler must perform an alias analysis?

>>> 2. Considering this example, won't you get the same behaviour
>>>     if my_func was declared with "pure" attribute?  If not, why?
>>
>>AFAIU, my_func is defined in a separate library and because of the attribute on
>>the definition, it will actually export overloads:
>>int myfunc(int,int);
>>v2si myfunc(v2si,v2si);
>>v4si myfunc(v4si,v4si);
>>etc (where does it stop? seems problematic if the library is compiled for
>>sse4 and I then compile and link an avx program)
>
> The user can provide at most 1 vector length and the compiler will map it to appropriate vector value. If the user omits the vectorlength clause then the compiler picks a vectorlength based on the architecture's vector units and the data width. So, it will stop at 2 (1 scalar and 1 vector) :-).

Is that part of the ABI and the function declaration?
Note that in C++11 (and I suppose in C++1y), attributes are supposed to be
semantics-neutral, in the sense that if a program compiles with attributes, then
ignoring those attributes should also lead to a well-formed program with the
same observable behaviour.

This brings us to the question: do you expect your proposal to the C++ committee
to be adopted as is, or do you anticipate or expect changes based on committee
feedback?  If you expect changes, what policy to propose for changes that would
reflect any feedback you would get from WG21?  The reason why this is important
is because WG21 has its own schedule, independent of GCC, and GCC has to
deal with forward/backward compatibility.

-- Gaby
Iyer, Balaji V - Sept. 6, 2012, 4:41 p.m.
Sorry, I didn't see this message. Please see my responses below:

>-----Original Message-----
>From: Marc Glisse [mailto:marc.glisse@inria.fr]
>Sent: Thursday, September 06, 2012 7:04 AM
>To: gcc-patches@gcc.gnu.org
>Cc: Gabriel Dos Reis; Iyer, Balaji V; Aldy Hernandez (aldyh@redhat.com); Jeff Law;
>rth@redhat.com
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Thu, 6 Sep 2012, Marc Glisse wrote:
>
>> AFAIU, my_func is defined in a separate library and because of the
>> attribute on the definition, it will actually export overloads:
>> int myfunc(int,int);
>> v2si myfunc(v2si,v2si);
>> v4si myfunc(v4si,v4si);
>> etc (where does it stop? seems problematic if the library is compiled
>> for
>> sse4 and I then compile and link an avx program)
>
>According to the doc, it only generates one of these vector versions (even more
>risk of mismatch).
>
>Does it actually create the extra declaration in the front-end, i.e. can I explicitly
>call myfunc on a v4si that I created myself, or is the middle-end the only user?


Yes you can call the function yourself.

>
>--
>Marc Glisse
Iyer, Balaji V - Sept. 6, 2012, 6:25 p.m.
Hello Richard,
	I forgot to answer one of questions. Please see it below:

Thanks,

Balaji V. Iyer.


>+static tree
>+handle_vector_attribute (tree *node, tree name ATTRIBUTE_UNUSED,
>+			 tree args ATTRIBUTE_UNUSED,
>+			 int ARG_UNUSED (flags), bool *no_add_attrs) {
>+  tree opt_list;
>+  VEC(tree,gc) *opt_vec = NULL;
>+  opt_vec = make_tree_vector ();
>+  VEC_safe_push (tree, gc, opt_vec, build_string (2, "O3"));
>+  opt_list = build_tree_list_vec (opt_vec);
>+  release_tree_vector (opt_vec);
>+  handle_optimize_attribute (node, get_identifier ("optimize"), opt_list,
>+			     flags, no_add_attrs);
>
>Please no - do not use "optimize" attributes from inside the implementation.
>What happens if the user also specifies an optimize attribute?
>The above also doesnt' make sense to me, so please elaborate on why you want
>to enable -O3 for a function marked with the vector attribute.

The reason why I used optimize is because I would like to turn on the vectorizer. As far as I can tell, the only way to do that is to have -O3. Please advise if there is a better way to do so.
Iyer, Balaji V - Sept. 6, 2012, 6:35 p.m.
>-----Original Message-----
>From: Joseph Myers [mailto:joseph@codesourcery.com]
>Sent: Thursday, September 06, 2012 12:18 PM
>To: Iyer, Balaji V
>Cc: gcc-patches@gcc.gnu.org; Aldy Hernandez (aldyh@redhat.com); Jeff Law;
>rth@redhat.com
>Subject: RE: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Thu, 6 Sep 2012, Iyer, Balaji V wrote:
>
>> Ok, I was mistaken there. I thought we had to add a changelog entry
>> for every function and not every file. I will fix it in the updated
>> patch I send soon.
>
>For functions in existing files you do need to mention each function - but not for
>new files.
>
>> >create_processor_attribute contains hardcoded references to
>> >x86-specific functionality.  This is not OK; all such target
>> >dependencies need to be kept within the back ends, and handled from
>> >the rest of the compiler via target hooks (in most cases, new target
>dependencies must use target hooks not target macros).
>>
>> The only thing I am doing in that function is to add appropriate
>> attribute. In elemental function, there is a processor clause that
>> will allow users to set the type of processor they want the function
>> compiled for. All I am doing is to map that information to the appropriate
>"arch"
>> attribute. I didn't think it had any back end pecularity.
>
>Concepts such as "pentium_4" are architecture-specific and have no place in
>front-end files.  This whole mapping from one sort of string to another belongs
>within the back end.

Please excuse me if I am "beating this horse to death." I am asking this to make sure I am understanding this correctly before I start re-implementing things. I am not very clear about whether the problem is the function's location or the place where it is called? Can you please clarify? Things like pentium_4 are part of the language  (please see processor clause in the pg 34 of the spec) and all I was doing was to parse that and was doing a string matching and substituting one string for the next. All the processing and picking of instructions are done by the existing backend.

Thanks,

Balaji V. Iyer.


>
>--
>Joseph S. Myers
>joseph@codesourcery.com
Joseph S. Myers - Sept. 6, 2012, 7:05 p.m.
On Thu, 6 Sep 2012, Iyer, Balaji V wrote:

> >Concepts such as "pentium_4" are architecture-specific and have no place in
> >front-end files.  This whole mapping from one sort of string to another belongs
> >within the back end.
> 
> Please excuse me if I am "beating this horse to death." I am asking this 
> to make sure I am understanding this correctly before I start 
> re-implementing things. I am not very clear about whether the problem is 
> the function's location or the place where it is called? Can you please 
> clarify? Things like pentium_4 are part of the language (please see 
> processor clause in the pg 34 of the spec) and all I was doing was to 
> parse that and was doing a string matching and substituting one string 
> for the next. All the processing and picking of instructions are done by 
> the existing backend.

That indicates a significant deficiency in the structure of the 
specification.  It would best be reworked to separate the 
architecture-independent specification from architecture-specific annexes.  
Every reference to something architecture-specific should instead say how 
things are defined by the architecture annex (for example, that the 
architecture annex specifies the tokens accepts for the processor clause 
and the default vector length).  There should be a defined way for 
architecture annexes to be added or updated for new architectures.

In addition, descriptions such as "Calls to functions other than other 
elemental functions and the intrinsic short vector math libraries provided 
with the Intel compilers" are clearly unsuitable for a specification of a 
language extension.

Until the specification is cleaned up to follow normal good practice for 
such specifications, the documentation included with GCC will need, along 
with a pointer to the specification, detail how it is amended to be 
properly architecture-independent and what are considered to be the 
architecture annexes followed by GCC.  And those parts of the 
specification that clearly would naturally vary from architecture to 
architecture will need to have the architecture-specific parts implemented 
through target hooks, not directly in the architecture-independent 
compiler.  And make it as easy as possible for people to add test coverage 
for new architectures.

Having looked at bits of the specification now, I may as well point out 
another, unrelated, issue that needs fixing in the specification: 
"Elemental functions cannot be virtual, and can only be called directly, 
not through a function pointer".  As I noted on the WG14 reflector when a 
similar issue appeared in a draft of the IEEE 758-2008 bindings, all C 
function calls are through function pointers - that's how the C standard 
defines them - so you need to say (in the specification, not here) what's 
meant in actual C standard terms, and make sure the implementation follows 
that, and make sure there are testcases verifying that this constraint is 
diagnosed.
Richard Guenther - Sept. 7, 2012, 9:06 a.m.
On Thu, Sep 6, 2012 at 8:25 PM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:
> Hello Richard,
>         I forgot to answer one of questions. Please see it below:
>
> Thanks,
>
> Balaji V. Iyer.
>
>
>>+static tree
>>+handle_vector_attribute (tree *node, tree name ATTRIBUTE_UNUSED,
>>+                       tree args ATTRIBUTE_UNUSED,
>>+                       int ARG_UNUSED (flags), bool *no_add_attrs) {
>>+  tree opt_list;
>>+  VEC(tree,gc) *opt_vec = NULL;
>>+  opt_vec = make_tree_vector ();
>>+  VEC_safe_push (tree, gc, opt_vec, build_string (2, "O3"));
>>+  opt_list = build_tree_list_vec (opt_vec);
>>+  release_tree_vector (opt_vec);
>>+  handle_optimize_attribute (node, get_identifier ("optimize"), opt_list,
>>+                           flags, no_add_attrs);
>>
>>Please no - do not use "optimize" attributes from inside the implementation.
>>What happens if the user also specifies an optimize attribute?
>>The above also doesnt' make sense to me, so please elaborate on why you want
>>to enable -O3 for a function marked with the vector attribute.
>
> The reason why I used optimize is because I would like to turn on the vectorizer. As far as I can tell, the only way to do that is to have -O3. Please advise if there is a better way to do so.

The answer is that you should not enable the vectorizer.

Richard.
Richard Guenther - Sept. 7, 2012, 9:14 a.m.
On Thu, Sep 6, 2012 at 5:51 PM, Richard Henderson <rth@redhat.com> wrote:
> On 09/06/2012 02:37 AM, Richard Guenther wrote:
>> In all this seems unrelated to CILK+ work (even if you make use of this
>> from within CILK+).
>
> While true, we also asked him to split up the work.  And this piece,
> done correctly, seems useful even if the rest of cilk is ignored.

Sure.  When viewed independent of cilk then interpreting libm routines
as elemental and appropriately matching that with the -mveclibabi support
we have would be nice.  -mveclibabi would specify how to mangle an
assembler name to yield the vector elemental function (and thus be target
dependent and eventually library ABI dependent) and the vectorizer would
for each elemental function be able to get at a vectorized decl by means of
the existing target hook (of which parts of it could then be lifted to
generic code).

Making GCC auto-generate vector variants should be done only when the
definition is visible (for example through LTO).  That would basically mean
to implement inter-procedural vectorization support.  That side-steps all
ABI issues as the functions will be internal only.  Not sure if that is enough
for Cilk+ requirements though.

Richard.

>
> r~
Iyer, Balaji V - Sept. 7, 2012, 2:09 p.m.
>-----Original Message-----
>From: Richard Guenther [mailto:richard.guenther@gmail.com]
>Sent: Friday, September 07, 2012 5:07 AM
>To: Iyer, Balaji V
>Cc: gcc-patches@gcc.gnu.org; Gabriel Dos Reis; Aldy Hernandez
>(aldyh@redhat.com); Jeff Law; rth@redhat.com
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Thu, Sep 6, 2012 at 8:25 PM, Iyer, Balaji V <balaji.v.iyer@intel.com> wrote:
>> Hello Richard,
>>         I forgot to answer one of questions. Please see it below:
>>
>> Thanks,
>>
>> Balaji V. Iyer.
>>
>>
>>>+static tree
>>>+handle_vector_attribute (tree *node, tree name ATTRIBUTE_UNUSED,
>>>+                       tree args ATTRIBUTE_UNUSED,
>>>+                       int ARG_UNUSED (flags), bool *no_add_attrs) {
>>>+  tree opt_list;
>>>+  VEC(tree,gc) *opt_vec = NULL;
>>>+  opt_vec = make_tree_vector ();
>>>+  VEC_safe_push (tree, gc, opt_vec, build_string (2, "O3"));
>>>+  opt_list = build_tree_list_vec (opt_vec);
>>>+  release_tree_vector (opt_vec);
>>>+  handle_optimize_attribute (node, get_identifier ("optimize"), opt_list,
>>>+                           flags, no_add_attrs);
>>>
>>>Please no - do not use "optimize" attributes from inside the implementation.
>>>What happens if the user also specifies an optimize attribute?
>>>The above also doesnt' make sense to me, so please elaborate on why
>>>you want to enable -O3 for a function marked with the vector attribute.
>>
>> The reason why I used optimize is because I would like to turn on the vectorizer.
>As far as I can tell, the only way to do that is to have -O3. Please advise if there is
>a better way to do so.
>
>The answer is that you should not enable the vectorizer.

OK. I will fix that.

>
>Richard.
Andi Kleen - Sept. 7, 2012, 5:04 p.m.
"Iyer, Balaji V" <balaji.v.iyer@intel.com> writes:
>>
>>The answer is that you should not enable the vectorizer.
>
> OK. I will fix that.

It still seems like useful functionality. Otherwise you have to compile
the whole program with -O3, just to vectoriz a few marked functions or
add additional annotations for all of them. How about a separate command
line flag to auto enable vectorization (or -O3) for functions with Cilk
vector annotations?

-Andi
Iyer, Balaji V - Sept. 7, 2012, 6:28 p.m.
>-----Original Message-----
>From: Andi Kleen [mailto:andi@firstfloor.org]
>Sent: Friday, September 07, 2012 1:05 PM
>To: Iyer, Balaji V
>Cc: Richard Guenther; gcc-patches@gcc.gnu.org; Gabriel Dos Reis; Aldy
>Hernandez (aldyh@redhat.com); Jeff Law; rth@redhat.com
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>"Iyer, Balaji V" <balaji.v.iyer@intel.com> writes:
>>>
>>>The answer is that you should not enable the vectorizer.
>>
>> OK. I will fix that.
>
>It still seems like useful functionality. Otherwise you have to compile the whole
>program with -O3, just to vectoriz a few marked functions or add additional
>annotations for all of them. How about a separate command line flag to auto
>enable vectorization (or -O3) for functions with Cilk vector annotations?

Yes, I really like this idea and that is kind of what I want. But, how do I turn on vectorization on a function by function basis? I tried to set flag_tree_vectorize=1 but that doesn't seem to do the trick.

>
>-Andi
>
>--
>ak@linux.intel.com -- Speaking for myself only
Iyer, Balaji V - Sept. 7, 2012, 7:31 p.m.
Hello Richard,
	Please see my response below:

>-----Original Message-----
>From: Richard Guenther [mailto:richard.guenther@gmail.com]
>Sent: Friday, September 07, 2012 5:15 AM
>To: Richard Henderson
>Cc: gcc-patches@gcc.gnu.org; Gabriel Dos Reis; Iyer, Balaji V; Aldy Hernandez
>(aldyh@redhat.com); Jeff Law
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Thu, Sep 6, 2012 at 5:51 PM, Richard Henderson <rth@redhat.com> wrote:
>> On 09/06/2012 02:37 AM, Richard Guenther wrote:
>>> In all this seems unrelated to CILK+ work (even if you make use of
>>> this from within CILK+).
>>
>> While true, we also asked him to split up the work.  And this piece,
>> done correctly, seems useful even if the rest of cilk is ignored.
>
>Sure.  When viewed independent of cilk then interpreting libm routines as
>elemental and appropriately matching that with the -mveclibabi support we have
>would be nice.  -mveclibabi would specify how to mangle an assembler name to
>yield the vector elemental function (and thus be target dependent and eventually
>library ABI dependent) and the vectorizer would for each elemental function be
>able to get at a vectorized decl by means of the existing target hook (of which
>parts of it could then be lifted to generic code).
>
>Making GCC auto-generate vector variants should be done only when the
>definition is visible (for example through LTO).  That would basically mean to
>implement inter-procedural vectorization support.  That side-steps all ABI issues
>as the functions will be internal only.  Not sure if that is enough for Cilk+
>requirements though.

I hope I have not mistaken your question, but to clarify the elemental function's definition and body is visible to all passes after the invocation of gimplify_function_tree (). It is also visible for the LTO optimization.


>
>Richard.
>
>>
>> r~
Andi Kleen - Sept. 7, 2012, 7:59 p.m.
"Iyer, Balaji V" <balaji.v.iyer@intel.com> writes:
>
> Yes, I really like this idea and that is kind of what I want. But, how do I turn on vectorization on a function by function basis? I tried to set flag_tree_vectorize=1 but that doesn't seem to do the trick.

AFAIK vectorization needs a range of passes to work, so it probably needs full
-O3 or something similar like you did. Apparently the optimize attribute has some
known problems, but I don't know if it would affect this case.

The explicit command line flag would just avoid the problem Richard
pointed out that you overwrite an explicit user choice.

-Andi
Jakub Jelinek - Sept. 7, 2012, 8:07 p.m.
On Fri, Sep 07, 2012 at 12:59:26PM -0700, Andi Kleen wrote:
> "Iyer, Balaji V" <balaji.v.iyer@intel.com> writes:
> >
> > Yes, I really like this idea and that is kind of what I want. But, how do I turn on vectorization on a function by function basis? I tried to set flag_tree_vectorize=1 but that doesn't seem to do the trick.
> 
> AFAIK vectorization needs a range of passes to work, so it probably needs full
> -O3 or something similar like you did. Apparently the optimize attribute has some
> known problems, but I don't know if it would affect this case.

Nope, -O2 -ftree-vectorize works just fine.  Vectorization only needs
if-conversion, but that is enabled by default if -ftree-vectorize
(unless explicitly disabled).

	Jakub
Andi Kleen - Sept. 7, 2012, 8:18 p.m.
Jakub Jelinek <jakub@redhat.com> writes:
>
> Nope, -O2 -ftree-vectorize works just fine.  Vectorization only needs
> if-conversion, but that is enabled by default if -ftree-vectorize
> (unless explicitly disabled).

How about the tree unrolling? I remember that being enabled for the
vectorizer (and then annoying me in my own O3 builds because it often
generates weird code)

-Andi
Iyer, Balaji V - Sept. 7, 2012, 9 p.m.
>-----Original Message-----
>From: Jakub Jelinek [mailto:jakub@redhat.com]
>Sent: Friday, September 07, 2012 4:07 PM
>To: Andi Kleen
>Cc: Iyer, Balaji V; Richard Guenther; gcc-patches@gcc.gnu.org; Gabriel Dos Reis;
>Aldy Hernandez (aldyh@redhat.com); Jeff Law; rth@redhat.com
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On Fri, Sep 07, 2012 at 12:59:26PM -0700, Andi Kleen wrote:
>> "Iyer, Balaji V" <balaji.v.iyer@intel.com> writes:
>> >
>> > Yes, I really like this idea and that is kind of what I want. But, how do I turn on
>vectorization on a function by function basis? I tried to set flag_tree_vectorize=1
>but that doesn't seem to do the trick.
>>
>> AFAIK vectorization needs a range of passes to work, so it probably
>> needs full
>> -O3 or something similar like you did. Apparently the optimize
>> attribute has some known problems, but I don't know if it would affect this
>case.
>
>Nope, -O2 -ftree-vectorize works just fine.  Vectorization only needs if-
>conversion, but that is enabled by default if -ftree-vectorize (unless explicitly
>disabled).

So, if I am understanding this correctly, there is no way to have the vectorization turned on/off on a function by function basis? I don't mind if it is turned off for -O0, but would like it be turned on/off for anything > -O1. 

>
>	Jakub
Richard Henderson - Sept. 10, 2012, 4 p.m.
On 09/07/2012 02:00 PM, Iyer, Balaji V wrote:
> So, if I am understanding this correctly, there is no way to have the
> vectorization turned on/off on a function by function basis? I don't
> mind if it is turned off for -O0, but would like it be turned on/off
> for anything > -O1.

There's probably no reason that we can't enable vectorization on a loop
by loop basis.  Given that we're keeping a bit attached to the loop itself
for #pragma simd anyway.

This ought not be terribly difficult to arrange...


r~
Richard Henderson - Sept. 10, 2012, 4:02 p.m.
On 09/07/2012 12:31 PM, Iyer, Balaji V wrote:
> I hope I have not mistaken your question, but to clarify the
> elemental function's definition and body is visible to all passes
> after the invocation of gimplify_function_tree (). It is also visible
> for the LTO optimization.

If that's the case, what's the point in defining an external ABI and
defining what __attribute__((vector)) placed on a function declaration
means?


r~
Iyer, Balaji V - Sept. 10, 2012, 4:09 p.m.
>-----Original Message-----
>From: Richard Henderson [mailto:rth@redhat.com]
>Sent: Monday, September 10, 2012 12:03 PM
>To: Iyer, Balaji V
>Cc: Richard Guenther; gcc-patches@gcc.gnu.org; Gabriel Dos Reis; Aldy
>Hernandez (aldyh@redhat.com); Jeff Law
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On 09/07/2012 12:31 PM, Iyer, Balaji V wrote:
>> I hope I have not mistaken your question, but to clarify the elemental
>> function's definition and body is visible to all passes after the
>> invocation of gimplify_function_tree (). It is also visible for the
>> LTO optimization.
>
>If that's the case, what's the point in defining an external ABI and defining what
>__attribute__((vector)) placed on a function declaration means?

When you have __attribute__((vector)) you are asking the compiler to create a vector AND a scalar version of the function. The advantage is that if the function is used, for example, in 2 loops where 1 can be vectorized and another cannot, the vectorizable loop won't suffer (i.e. suffer from being not-vectorized).

Thanks,

Balaji V. Iyer.

>
>
>r~
Iyer, Balaji V - Sept. 10, 2012, 4:11 p.m.
>-----Original Message-----
>From: Richard Henderson [mailto:rth@redhat.com]
>Sent: Monday, September 10, 2012 12:01 PM
>To: Iyer, Balaji V
>Cc: Jakub Jelinek; Andi Kleen; Richard Guenther; gcc-patches@gcc.gnu.org;
>Gabriel Dos Reis; Aldy Hernandez (aldyh@redhat.com); Jeff Law
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On 09/07/2012 02:00 PM, Iyer, Balaji V wrote:
>> So, if I am understanding this correctly, there is no way to have the
>> vectorization turned on/off on a function by function basis? I don't
>> mind if it is turned off for -O0, but would like it be turned on/off
>> for anything > -O1.
>
>There's probably no reason that we can't enable vectorization on a loop by loop
>basis.  Given that we're keeping a bit attached to the loop itself for #pragma simd
>anyway.
>
>This ought not be terribly difficult to arrange...

Can you please help me get a start on how to get can be done? From what I understand (please correct me if I am wrong), this requires rearranging and duplicating a lot of passes and can potentially open up to a lot of bugs.

>
>
>r~
Richard Henderson - Sept. 10, 2012, 4:30 p.m.
On 09/10/2012 09:11 AM, Iyer, Balaji V wrote:
> Can you please help me get a start on how to get can be done? From
> what I understand (please correct me if I am wrong), this requires
> rearranging and duplicating a lot of passes and can potentially open
> up to a lot of bugs.

Certainly not duplicating passes.  And probably not even rearranging them.

The Important parts are:

  (1) Having a bit in "struct loop" that indicates the special semantics
      you have for #pragma simd.  I don't know if maybe all loops inside an
      elemental function are so automatically marked?

  (2) Have bits in "struct function" that summarize the contents of the
      bit from "struct loop", for all loops in the function.  Note that
      this bit would need to be updated during inlining.

  (3) Change the "gate" predicates for the relevant function to also check
      the bit from "struct function".  In some cases the pass might need
      to run globally (perhaps if-conversion?) and in some cases the pass
      might be able to restrict work to specific loops (e.g. the vectorizer),
      skipping loops for which the optimization is not enabled.


r~
Richard Henderson - Sept. 10, 2012, 4:37 p.m.
On 09/10/2012 09:09 AM, Iyer, Balaji V wrote:
>> >If that's the case, what's the point in defining an external ABI and defining what
>> >__attribute__((vector)) placed on a function declaration means?

> When you have __attribute__((vector)) you are asking the compiler to
> create a vector AND a scalar version of the function. The advantage
> is that if the function is used, for example, in 2 loops where 1 can
> be vectorized and another cannot, the vectorizable loop won't suffer
> (i.e. suffer from being not-vectorized).

You've totally mis-understood my point.

Whether or not the compiler creates a clone COULD BE totally up to the
compiler, based on whether or not vectorization is enabled, whether the
loop has been analyzed such that vectorization may proceed, or indeed
the phase of the moon.

But in order for that to happen, the clone must be totally private to
the module for which we are generating code (in the LTO sense, this is
the entire program or dll; without LTO, this is just the object file).
It means that we never attempt to generate clones for functions for
which the body of the function is not visible.

On the other hand, if you insist on assuming a clone exists merely
because a declaration bears an attribute, then you must address ALL
of the problems with respect to defining a stable ABI in the face of
different cpu revisions, different ISAs, and different vector lengths.

I've not seen you address ANY of these problems, despite having the
problem pointed out multiple times.


r~
Andi Kleen - Sept. 10, 2012, 5:22 p.m.
On Mon, Sep 10, 2012 at 09:30:15AM -0700, Richard Henderson wrote:
> On 09/10/2012 09:11 AM, Iyer, Balaji V wrote:
> > Can you please help me get a start on how to get can be done? From
> > what I understand (please correct me if I am wrong), this requires
> > rearranging and duplicating a lot of passes and can potentially open
> > up to a lot of bugs.
> 
> Certainly not duplicating passes.  And probably not even rearranging them.

It would be great if unrolling was also done loop by loop in a similar
way. I often wanted that (only enable it for some loop, not the whole file)
And a lot of other compilers have pragmas for this, just not gcc.
As I understand vectorization needs some unrolling anyways?

-Andi
Richard Guenther - Sept. 11, 2012, 8:38 a.m.
On Mon, Sep 10, 2012 at 6:30 PM, Richard Henderson <rth@redhat.com> wrote:
> On 09/10/2012 09:11 AM, Iyer, Balaji V wrote:
>> Can you please help me get a start on how to get can be done? From
>> what I understand (please correct me if I am wrong), this requires
>> rearranging and duplicating a lot of passes and can potentially open
>> up to a lot of bugs.
>
> Certainly not duplicating passes.  And probably not even rearranging them.
>
> The Important parts are:
>
>   (1) Having a bit in "struct loop" that indicates the special semantics
>       you have for #pragma simd.  I don't know if maybe all loops inside an
>       elemental function are so automatically marked?
>
>   (2) Have bits in "struct function" that summarize the contents of the
>       bit from "struct loop", for all loops in the function.  Note that
>       this bit would need to be updated during inlining.
>
>   (3) Change the "gate" predicates for the relevant function to also check
>       the bit from "struct function".  In some cases the pass might need
>       to run globally (perhaps if-conversion?) and in some cases the pass
>       might be able to restrict work to specific loops (e.g. the vectorizer),
>       skipping loops for which the optimization is not enabled.

Note that we do not preserve the loop tree before the gimple loop optimizer
passes.  Nor do we have a convenient way (currently) to transfer per-loop
information from GENERIC to the point where we can first create the loop
tree (after the CFG is built).  The former is because I didn't want to think
about the inlining case (I'm still chasing bugs for preserving the loop tree
from the start of gimple loop optimizer passes ...), the latter could be done
in a similar way we handle predications or OMP annotations - have
special instructions in the IL.

Richard.

>
> r~
>
Richard Guenther - Sept. 11, 2012, 8:41 a.m.
On Mon, Sep 10, 2012 at 6:37 PM, Richard Henderson <rth@redhat.com> wrote:
> On 09/10/2012 09:09 AM, Iyer, Balaji V wrote:
>>> >If that's the case, what's the point in defining an external ABI and defining what
>>> >__attribute__((vector)) placed on a function declaration means?
>
>> When you have __attribute__((vector)) you are asking the compiler to
>> create a vector AND a scalar version of the function. The advantage
>> is that if the function is used, for example, in 2 loops where 1 can
>> be vectorized and another cannot, the vectorizable loop won't suffer
>> (i.e. suffer from being not-vectorized).
>
> You've totally mis-understood my point.
>
> Whether or not the compiler creates a clone COULD BE totally up to the
> compiler, based on whether or not vectorization is enabled, whether the
> loop has been analyzed such that vectorization may proceed, or indeed
> the phase of the moon.
>
> But in order for that to happen, the clone must be totally private to
> the module for which we are generating code (in the LTO sense, this is
> the entire program or dll; without LTO, this is just the object file).
> It means that we never attempt to generate clones for functions for
> which the body of the function is not visible.
>
> On the other hand, if you insist on assuming a clone exists merely
> because a declaration bears an attribute, then you must address ALL
> of the problems with respect to defining a stable ABI in the face of
> different cpu revisions, different ISAs, and different vector lengths.
>
> I've not seen you address ANY of these problems, despite having the
> problem pointed out multiple times.

Indeed, if the definition of an elemental function is always visible to the
vectorizer the vectorizer itself can instruct the creation of the clone
if it does not already exist (just make those clones managed by the
callgraph).  Then the clones are visible to the current TU only and no
ABI issues exist (though you could say that the vectorizer or the inliner
could as well force inlining of elemental functions into places it wants to
vectorize - one complication even with local clones is that the x86 ABI
has no callee-saved XMM registers which makes function calls inside
loops especially expensive).

Richard.

>
> r~
Richard Guenther - Sept. 11, 2012, 8:42 a.m.
On Tue, Sep 11, 2012 at 10:41 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Mon, Sep 10, 2012 at 6:37 PM, Richard Henderson <rth@redhat.com> wrote:
>> On 09/10/2012 09:09 AM, Iyer, Balaji V wrote:
>>>> >If that's the case, what's the point in defining an external ABI and defining what
>>>> >__attribute__((vector)) placed on a function declaration means?
>>
>>> When you have __attribute__((vector)) you are asking the compiler to
>>> create a vector AND a scalar version of the function. The advantage
>>> is that if the function is used, for example, in 2 loops where 1 can
>>> be vectorized and another cannot, the vectorizable loop won't suffer
>>> (i.e. suffer from being not-vectorized).
>>
>> You've totally mis-understood my point.
>>
>> Whether or not the compiler creates a clone COULD BE totally up to the
>> compiler, based on whether or not vectorization is enabled, whether the
>> loop has been analyzed such that vectorization may proceed, or indeed
>> the phase of the moon.
>>
>> But in order for that to happen, the clone must be totally private to
>> the module for which we are generating code (in the LTO sense, this is
>> the entire program or dll; without LTO, this is just the object file).
>> It means that we never attempt to generate clones for functions for
>> which the body of the function is not visible.
>>
>> On the other hand, if you insist on assuming a clone exists merely
>> because a declaration bears an attribute, then you must address ALL
>> of the problems with respect to defining a stable ABI in the face of
>> different cpu revisions, different ISAs, and different vector lengths.
>>
>> I've not seen you address ANY of these problems, despite having the
>> problem pointed out multiple times.
>
> Indeed, if the definition of an elemental function is always visible to the
> vectorizer the vectorizer itself can instruct the creation of the clone
> if it does not already exist (just make those clones managed by the
> callgraph).  Then the clones are visible to the current TU only and no
> ABI issues exist (though you could say that the vectorizer or the inliner
> could as well force inlining of elemental functions into places it wants to
> vectorize - one complication even with local clones is that the x86 ABI
> has no callee-saved XMM registers which makes function calls inside
> loops especially expensive).

Btw, this then happily fits into my suggestion that the "elementalness"
can be autodetected by the compiler simply by means of a proper IPA
pass and thus be fully LTO / whole-program aware.  No need for an
attribute (where you'd need to handle the case that the attribute was placed
there by error).

Richard.

> Richard.
>
>>
>> r~
Gabriel Dos Reis - Sept. 11, 2012, 8:57 a.m.
On Tue, Sep 11, 2012 at 3:42 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:

> Btw, this then happily fits into my suggestion that the "elementalness"
> can be autodetected by the compiler simply by means of a proper IPA
> pass and thus be fully LTO / whole-program aware.  No need for an
> attribute (where you'd need to handle the case that the attribute was placed
> there by error).

We are in violent agreement.

-- Gaby
Jakub Jelinek - Sept. 11, 2012, 9:06 a.m.
On Tue, Sep 11, 2012 at 03:57:44AM -0500, Gabriel Dos Reis wrote:
> On Tue, Sep 11, 2012 at 3:42 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
> 
> > Btw, this then happily fits into my suggestion that the "elementalness"
> > can be autodetected by the compiler simply by means of a proper IPA
> > pass and thus be fully LTO / whole-program aware.  No need for an
> > attribute (where you'd need to handle the case that the attribute was placed
> > there by error).
> 
> We are in violent agreement.

For locally defined functions sure, the question is if we want the attribute
to be something for external functions.  Something that would have ABI
implications (the external symbol would need to be provided in two forms (or
more?), one scalar with normal mangling, one vector with some other kind
of mangling/suffix/whatever), when compiling the definition of function with
such an attribute the compiler could verify its properties (i.e. autodetect
and if it is not autodetected elemental, complain?), and when using extern
function just rely on it being provided twice.  Even with LTO, the function
can be defined in some other shared library etc.

Nothing says the implementation of the vector version of the elemental
function necessary has to be vectorized, just that the arguments would need
to be passed in the expected vector registers, similarly for return value.
Say if the elemental function is compiled with -O0, then there could just be
a loop executing the scalar body several times and creating vectors.

	Jakub
Richard Guenther - Sept. 11, 2012, 9:36 a.m.
On Tue, Sep 11, 2012 at 11:06 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Sep 11, 2012 at 03:57:44AM -0500, Gabriel Dos Reis wrote:
>> On Tue, Sep 11, 2012 at 3:42 AM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>
>> > Btw, this then happily fits into my suggestion that the "elementalness"
>> > can be autodetected by the compiler simply by means of a proper IPA
>> > pass and thus be fully LTO / whole-program aware.  No need for an
>> > attribute (where you'd need to handle the case that the attribute was placed
>> > there by error).
>>
>> We are in violent agreement.
>
> For locally defined functions sure, the question is if we want the attribute
> to be something for external functions.  Something that would have ABI
> implications (the external symbol would need to be provided in two forms (or
> more?), one scalar with normal mangling, one vector with some other kind
> of mangling/suffix/whatever), when compiling the definition of function with
> such an attribute the compiler could verify its properties (i.e. autodetect
> and if it is not autodetected elemental, complain?), and when using extern
> function just rely on it being provided twice.  Even with LTO, the function
> can be defined in some other shared library etc.
>
> Nothing says the implementation of the vector version of the elemental
> function necessary has to be vectorized, just that the arguments would need
> to be passed in the expected vector registers, similarly for return value.
> Say if the elemental function is compiled with -O0, then there could just be
> a loop executing the scalar body several times and creating vectors.

Sure.  And the "versioning" can happen from the C frontend then.  Of course
this one has the requirement of documenting the ABI.

Richard.

>         Jakub
Gabriel Dos Reis - Sept. 11, 2012, 9:39 a.m.
On Tue, Sep 11, 2012 at 4:06 AM, Jakub Jelinek <jakub@redhat.com> wrote:
> On Tue, Sep 11, 2012 at 03:57:44AM -0500, Gabriel Dos Reis wrote:
>> On Tue, Sep 11, 2012 at 3:42 AM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>
>> > Btw, this then happily fits into my suggestion that the "elementalness"
>> > can be autodetected by the compiler simply by means of a proper IPA
>> > pass and thus be fully LTO / whole-program aware.  No need for an
>> > attribute (where you'd need to handle the case that the attribute was placed
>> > there by error).
>>
>> We are in violent agreement.
>
> For locally defined functions sure, the question is if we want the attribute
> to be something for external functions.  Something that would have ABI
> implications (the external symbol would need to be provided in two forms (or
> more?), one scalar with normal mangling, one vector with some other kind
> of mangling/suffix/whatever), when compiling the definition of function with
> such an attribute the compiler could verify its properties (i.e. autodetect
> and if it is not autodetected elemental, complain?), and when using extern
> function just rely on it being provided twice.  Even with LTO, the function
> can be defined in some other shared library etc.
>
> Nothing says the implementation of the vector version of the elemental
> function necessary has to be vectorized, just that the arguments would need
> to be passed in the expected vector registers, similarly for return value.
> Say if the elemental function is compiled with -O0, then there could just be
> a loop executing the scalar body several times and creating vectors.
>

As it was pointed out earlier (by Marc?), there is also an issue of overload
resolution if these automatically synthetized functions have to be something
visible, which of course entails the whole ABI issues.  This is really
a language
design issue, not just compiler implementation.   If the synthetized functions
do not need to have the same status as real functions (hence no need for
attributes), then these issues evaporate.

-- Gaby
Marc Glisse - Sept. 11, 2012, 10:29 a.m.
On Tue, 11 Sep 2012, Richard Guenther wrote:

> On Tue, Sep 11, 2012 at 10:41 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Mon, Sep 10, 2012 at 6:37 PM, Richard Henderson <rth@redhat.com> wrote:
>>> Whether or not the compiler creates a clone COULD BE totally up to the
>>> compiler, based on whether or not vectorization is enabled, whether the
>>> loop has been analyzed such that vectorization may proceed, or indeed
>>> the phase of the moon.
>>>
>>> But in order for that to happen, the clone must be totally private to
>>> the module for which we are generating code (in the LTO sense, this is
>>> the entire program or dll; without LTO, this is just the object file).
>>> It means that we never attempt to generate clones for functions for
>>> which the body of the function is not visible.
>>>
>>> On the other hand, if you insist on assuming a clone exists merely
>>> because a declaration bears an attribute, then you must address ALL
>>> of the problems with respect to defining a stable ABI in the face of
>>> different cpu revisions, different ISAs, and different vector lengths.
>>>
>>> I've not seen you address ANY of these problems, despite having the
>>> problem pointed out multiple times.
>>
>> Indeed, if the definition of an elemental function is always visible to the
>> vectorizer the vectorizer itself can instruct the creation of the clone
>> if it does not already exist (just make those clones managed by the
>> callgraph).  Then the clones are visible to the current TU only and no
>> ABI issues exist (though you could say that the vectorizer or the inliner
>> could as well force inlining of elemental functions into places it wants to
>> vectorize - one complication even with local clones is that the x86 ABI
>> has no callee-saved XMM registers which makes function calls inside
>> loops especially expensive).

I thought gcc wouldn't use the x86 ABI for those private calls. I guess 
what I remember were vague discussions and not a description of the 
current status...

> Btw, this then happily fits into my suggestion that the "elementalness"
> can be autodetected by the compiler simply by means of a proper IPA
> pass and thus be fully LTO / whole-program aware.  No need for an
> attribute (where you'd need to handle the case that the attribute was placed
> there by error).

Note that, apart from preventing external calls, it removes this use case:

__attribute__((vector(4))) double mysqrt(double x){return sqrt(x);}

__m256d var;
mysqrt(var);

I am not sure it is the best way to achieve this, but it is one way. I am 
also planning a patch to turn {sqrt(a),sqrt(b)} into sqrt({a,b}) when the 
target likes it. And there is a PR asking for a __builtin_math_sqrt.
Jakub Jelinek - Sept. 11, 2012, 10:41 a.m.
On Tue, Sep 11, 2012 at 12:29:10PM +0200, Marc Glisse wrote:
> >Btw, this then happily fits into my suggestion that the "elementalness"
> >can be autodetected by the compiler simply by means of a proper IPA
> >pass and thus be fully LTO / whole-program aware.  No need for an
> >attribute (where you'd need to handle the case that the attribute was placed
> >there by error).
> 
> Note that, apart from preventing external calls, it removes this use case:
> 
> __attribute__((vector(4))) double mysqrt(double x){return sqrt(x);}
> 
> __m256d var;
> mysqrt(var);

I don't think those functions should be available for C++ overloading.
For one, it would be only for C++, not for C, and how would you handle
the case where the user already provides __m256d mysqrt(__m256d); overload
in addition to the one with vector attribute?
I'd say the compiler should when beneficial synthetize calls to those
in SLP or normal vectorizer instead, so you'd write:
  (__m256d){mysqrt(var[0]),mysqrt(var[1]),mysqrt(var[2]),mysqrt(var[3])};
instead of mysqrt(var); and the compiler would turn that into
  mysqrt.elem.V4DF(var)
(or whatever the mangling of the elemental functions would be).

	Jakub
Marc Glisse - Sept. 11, 2012, 11:04 a.m.
On Tue, 11 Sep 2012, Jakub Jelinek wrote:

> On Tue, Sep 11, 2012 at 12:29:10PM +0200, Marc Glisse wrote:
>> Note that, apart from preventing external calls, it removes this use case:
>>
>> __attribute__((vector(4))) double mysqrt(double x){return sqrt(x);}
>>
>> __m256d var;
>> mysqrt(var);
>
> I don't think those functions should be available for C++ overloading.

The current patch does make them available, according to their author.

> For one, it would be only for C++, not for C, and how would you handle
> the case where the user already provides __m256d mysqrt(__m256d); overload
> in addition to the one with vector attribute?

The same way you handle it when the user provides 2 identical overloads.

> I'd say the compiler should when beneficial synthetize calls to those
> in SLP or normal vectorizer instead, so you'd write:
>  (__m256d){mysqrt(var[0]),mysqrt(var[1]),mysqrt(var[2]),mysqrt(var[3])};
> instead of mysqrt(var); and the compiler would turn that into
>  mysqrt.elem.V4DF(var)
> (or whatever the mangling of the elemental functions would be).

Ok.
Iyer, Balaji V - Sept. 11, 2012, 5:14 p.m.
Please see my answers below

>-----Original Message-----
>From: Richard Henderson [mailto:rth@redhat.com]
>Sent: Monday, September 10, 2012 12:38 PM
>To: Iyer, Balaji V
>Cc: Richard Guenther; gcc-patches@gcc.gnu.org; Gabriel Dos Reis; Aldy
>Hernandez (aldyh@redhat.com); Jeff Law
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On 09/10/2012 09:09 AM, Iyer, Balaji V wrote:
>>> >If that's the case, what's the point in defining an external ABI and
>>> >defining what
>>> >__attribute__((vector)) placed on a function declaration means?
>
>> When you have __attribute__((vector)) you are asking the compiler to
>> create a vector AND a scalar version of the function. The advantage is
>> that if the function is used, for example, in 2 loops where 1 can be
>> vectorized and another cannot, the vectorizable loop won't suffer
>> (i.e. suffer from being not-vectorized).
>
>
>On the other hand, if you insist on assuming a clone exists merely because a
>declaration bears an attribute, then you must address ALL of the problems with
>respect to defining a stable ABI in the face of different cpu revisions, different
>ISAs, and different vector lengths.

The function mangling handles several of the version inconsistencies you have mentioned. If the CPU revisions, vector lengths are not the same between the function declaration and the function, then the name of the function will be different and the linker should complain.


>
>I've not seen you address ANY of these problems, despite having the problem
>pointed out multiple times.
>
>
>r~
Richard Henderson - Sept. 11, 2012, 7:11 p.m.
On 09/11/2012 10:14 AM, Iyer, Balaji V wrote:
> The function mangling handles several of the version inconsistencies
> you have mentioned. If the CPU revisions, vector lengths are not the
> same between the function declaration and the function, then the name
> of the function will be different and the linker should complain.

Sure.  I get that.  And that works for code within a single project.

But that means that if you build a shared library containing one of
these elemental functions, its external ABI changes depending on what
compiler flags you build it with.

Can you not understand how totally unacceptable this is?


r~
Iyer, Balaji V - Sept. 19, 2012, 8:58 p.m.
>-----Original Message-----
>From: gcc-patches-owner@gcc.gnu.org [mailto:gcc-patches-
>owner@gcc.gnu.org] On Behalf Of Richard Henderson
>Sent: Tuesday, September 11, 2012 3:12 PM
>To: Iyer, Balaji V
>Cc: Richard Guenther; gcc-patches@gcc.gnu.org; Gabriel Dos Reis; Aldy
>Hernandez (aldyh@redhat.com); Jeff Law
>Subject: Re: [PATCH] Merging Cilk Plus into Trunk (Patch 1 of approximately 22)
>
>On 09/11/2012 10:14 AM, Iyer, Balaji V wrote:
>> The function mangling handles several of the version inconsistencies
>> you have mentioned. If the CPU revisions, vector lengths are not the
>> same between the function declaration and the function, then the name
>> of the function will be different and the linker should complain.
>
>Sure.  I get that.  And that works for code within a single project.
>
>But that means that if you build a shared library containing one of these
>elemental functions, its external ABI changes depending on what compiler flags
>you build it with.
>
>Can you not understand how totally unacceptable this is?

Hello Richard,
              Thank you very much for pointing this out to us. We do see the problem with the default case (when the processor clause is not specified by the user) of elemental functions attribute. We have also found a solution. Since this has to do with the calling convention for elemental functions, it requires a fix to both the Intel compiler and to gcc. It will take us a couple weeks to validate this. I will re-implement this and send out another patch as soon as possible. In the meantime, I will work on the array notation patches, so we can keep making forward progress.

Thanks again for pointing this out

Yours Sincerely,

Balaji V. Iyer.

>
>
>r~
>
>
>

Patch

diff --git gcc/Makefile.in gcc/Makefile.in
index 9886b6c..e5ef1e4 100644
--- gcc/Makefile.in
+++ gcc/Makefile.in
@@ -1114,7 +1114,8 @@  C_COMMON_OBJS = c-family/c-common.o c-family/c-cppbuiltin.o c-family/c-dump.o \
   c-family/c-format.o c-family/c-gimplify.o c-family/c-lex.o \
   c-family/c-omp.o c-family/c-opts.o c-family/c-pch.o \
   c-family/c-ppoutput.o c-family/c-pragma.o c-family/c-pretty-print.o \
-  c-family/c-semantics.o c-family/c-ada-spec.o tree-mudflap.o
+  c-family/c-semantics.o c-family/c-ada-spec.o tree-mudflap.o \
+  c-family/c-cpp-elem-function.o
 
 # Language-independent object files.
 # We put the insn-*.o files first so that a parallel make will build
@@ -1190,6 +1191,7 @@  OBJS = \
 	dwarf2cfi.o \
 	dwarf2out.o \
 	ebitmap.o \
+	elem-function-common.o \
 	emit-rtl.o \
 	et-forest.o \
 	except.o \
@@ -1975,7 +1977,7 @@  default-c.o: config/default-c.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
 
 attribs.o : attribs.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(TREE_H) \
 	$(FLAGS_H) $(DIAGNOSTIC_CORE_H) $(GGC_H) $(TM_P_H) \
-	$(TARGET_H) langhooks.h $(CPPLIB_H) $(PLUGIN_H)
+	$(TARGET_H) langhooks.h $(CPPLIB_H) $(PLUGIN_H) cilkplus.h
 
 incpath.o: incpath.c incpath.h $(CONFIG_H) $(SYSTEM_H) $(CPPLIB_H) \
 		intl.h prefix.h coretypes.h $(TM_H) cppdefault.h $(TARGET_H) \
@@ -2502,7 +2504,7 @@  tree-scalar-evolution.o : tree-scalar-evolution.c $(CONFIG_H) $(SYSTEM_H) \
    $(PARAMS_H) gt-tree-scalar-evolution.h
 tree-data-ref.o : tree-data-ref.c $(CONFIG_H) $(SYSTEM_H) coretypes.h dumpfile.h \
    $(GIMPLE_PRETTY_PRINT_H) $(TREE_FLOW_H) $(CFGLOOP_H) $(TREE_DATA_REF_H) \
-   langhooks.h tree-affine.h $(PARAMS_H)
+   langhooks.h tree-affine.h $(PARAMS_H) cilkplus.h
 sese.o : sese.c sese.h $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TREE_PRETTY_PRINT_H) \
    $(TREE_FLOW_H) $(CFGLOOP_H) $(TREE_DATA_REF_H) $(TREE_PASS_H) value-prof.h
 graphite.o : graphite.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(DIAGNOSTIC_CORE_H) \
@@ -2559,7 +2561,7 @@  tree-vect-stmts.o: tree-vect-stmts.c $(CONFIG_H) $(SYSTEM_H) \
    coretypes.h dumpfile.h $(TM_H) $(GGC_H) $(TREE_H) $(TARGET_H) \
    $(BASIC_BLOCK_H) $(TREE_FLOW_H) $(CFGLOOP_H) \
    $(EXPR_H) $(RECOG_H) $(OPTABS_H) $(TREE_VECTORIZER_H) \
-   langhooks.h $(GIMPLE_PRETTY_PRINT_H)
+   langhooks.h $(GIMPLE_PRETTY_PRINT_H) cilkplus.h
 tree-vect-data-refs.o: tree-vect-data-refs.c $(CONFIG_H) $(SYSTEM_H) \
    coretypes.h dumpfile.h $(TM_H) $(GGC_H) $(TREE_H) $(TARGET_H) $(BASIC_BLOCK_H) \
    $(TREE_FLOW_H) $(CFGLOOP_H) \
@@ -3332,6 +3334,8 @@  lower-subreg.o : lower-subreg.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    insn-config.h $(BASIC_BLOCK_H) $(RECOG_H) $(OBSTACK_H) $(BITMAP_H) \
    $(EXPR_H) $(EXCEPT_H) $(REGS_H) $(TREE_PASS_H) $(DF_H) dce.h \
    lower-subreg.h
+elem-function.o: elem-function.c $(CONFIG_H) $(SYSTEM_H) $(TREE_H) $(GIMPLE_H) \
+   $(OPTABS_H) $(RECOG_H)
 target-globals.o : target-globals.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(TM_H) insn-config.h $(MACHMODE_H) $(GGC_H) toplev.h target-globals.h \
    $(FLAGS_H) $(REGS_H) $(RTL_H) reload.h expmed.h $(EXPR_H) $(OPTABS_H) \
diff --git gcc/attribs.c gcc/attribs.c
index d3af414..7c8e037 100644
--- gcc/attribs.c
+++ gcc/attribs.c
@@ -33,6 +33,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "langhooks.h"
 #include "hashtab.h"
 #include "plugin.h"
+#include "cilkplus.h"
 
 /* Table of the tables of attributes (common, language, format, machine)
    searched.  */
@@ -102,6 +103,22 @@  eq_attr (const void *p, const void *q)
   return (!strncmp (spec->name, str->str, str->length) && !spec->name[str->length]);
 }
 
+/* This will return true when name matches an elemental function mask.  */
+
+bool
+is_elem_fn_attribute_p (tree name)
+{
+  if (flag_enable_cilkplus)
+    return false;
+  return is_attribute_p ("mask", name)
+    || is_attribute_p ("unmask", name)
+    || is_attribute_p ("vectorlength", name)
+    || is_attribute_p ("vector", name)
+    || is_attribute_p ("linear", name)
+    || is_attribute_p ("uniform", name);
+}
+
+
 /* Initialize attribute tables, and make some sanity checks
    if --enable-checking.  */
 
@@ -312,6 +329,12 @@  decl_attributes (tree *node, tree attributes, int flags)
 
       if (spec == NULL)
 	{
+	  if (flag_enable_cilkplus && is_elem_fn_attribute_p (name))
+	    {
+	      returned_attrs = tree_cons (name, args, returned_attrs);
+	      DECL_ATTRIBUTES (*anode) = tree_cons (name, args,
+						    DECL_ATTRIBUTES (*anode));
+	    }
 	  if (!(flags & (int) ATTR_FLAG_BUILT_IN))
 	    warning (OPT_Wattributes, "%qE attribute directive ignored",
 		     name);
diff --git gcc/c-family/c-common.c gcc/c-family/c-common.c
index 502613a..a4c597c 100644
--- gcc/c-family/c-common.c
+++ gcc/c-family/c-common.c
@@ -45,6 +45,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "opts.h"
 #include "cgraph.h"
 #include "target-def.h"
+#include "cilkplus.h"
 
 cpp_reader *parse_in;		/* Declared in c-pragma.h.  */
 
@@ -372,6 +373,7 @@  static tree handle_type_generic_attribute (tree *, tree, tree, int, bool *);
 static tree handle_alloc_size_attribute (tree *, tree, tree, int, bool *);
 static tree handle_target_attribute (tree *, tree, tree, int, bool *);
 static tree handle_optimize_attribute (tree *, tree, tree, int, bool *);
+static tree handle_vector_attribute (tree *, tree, tree, int, bool *);
 static tree ignore_attribute (tree *, tree, tree, int, bool *);
 static tree handle_no_split_stack_attribute (tree *, tree, tree, int, bool *);
 static tree handle_fnspec_attribute (tree *, tree, tree, int, bool *);
@@ -726,6 +728,8 @@  const struct attribute_spec c_common_attribute_table[] =
 			      handle_target_attribute, false },
   { "optimize",               1, -1, true, false, false,
 			      handle_optimize_attribute, false },
+  { "vector",                 0, -1, true, false, false,
+                                  handle_vector_attribute, false },
   /* For internal use only.  The leading '*' both prevents its usage in
      source code and signals that it may be overridden by machine tables.  */
   { "*tm regparm",            0, 0, false, true, true,
@@ -8638,6 +8642,27 @@  parse_optimize_options (tree args, bool attr_p)
   return ret;
 }
 
+/* For handling the vector attribute that is used to indicate elemental
+   functions in Cilk Plus.  */
+
+static tree
+handle_vector_attribute (tree *node, tree name ATTRIBUTE_UNUSED,
+			 tree args ATTRIBUTE_UNUSED,
+			 int ARG_UNUSED (flags), bool *no_add_attrs)
+{
+  tree opt_list;
+  VEC(tree,gc) *opt_vec = NULL;
+  opt_vec = make_tree_vector ();
+  VEC_safe_push (tree, gc, opt_vec, build_string (2, "O3"));
+  opt_list = build_tree_list_vec (opt_vec);
+  release_tree_vector (opt_vec);
+  handle_optimize_attribute (node, get_identifier ("optimize"), opt_list,
+			     flags, no_add_attrs);
+  return NULL_TREE;
+}
+
+
+
 /* For handling "optimize" attribute. arguments as in
    struct attribute_spec.handler.  */
 
diff --git gcc/c-family/c-cpp-elem-function.c gcc/c-family/c-cpp-elem-function.c
new file mode 100644
index 0000000..4a87a9f
--- /dev/null
+++ gcc/c-family/c-cpp-elem-function.c
@@ -0,0 +1,563 @@ 
+/* This file is part of the Intel(R) Cilk(TM) Plus support
+   This file contains C/C++ specific functions for elemental
+   functions.
+   
+   Copyright (C) 2012  Free Software Foundation, Inc.
+   Written by Balaji V. Iyer <balaji.v.iyer@intel.com>,
+              Intel Corporation
+
+   Many Thanks to Karthik Kumar for advice on the basic technique
+   about cloning functions.
+   
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "langhooks.h"
+#include "cilkplus.h"
+#include "tm_p.h"
+#include "hard-reg-set.h"
+#include "basic-block.h"
+#include "output.h"
+#include "c-family/c-common.h"
+#include "diagnostic.h"
+#include "tree-flow.h"
+#include "tree-dump.h"
+#include "tree-pass.h"
+#include "timevar.h"
+#include "flags.h"
+#include "c/c-tree.h"
+#include "tree-inline.h"
+#include "cgraph.h"
+#include "ipa-prop.h"
+#include "opts.h"
+#include "tree-iterator.h"
+#include "toplev.h"
+#include "options.h"
+#include "intl.h"
+#include "vec.h"
+
+
+static tree create_optimize_attribute (int);
+static tree create_processor_attribute (elem_fn_info *, tree *);
+static tree elem_fn_build_array (tree base_var, tree index);
+
+
+/* This function will create the appropriate __target__ attribute for the 
+   processor.  */
+
+static tree
+create_processor_attribute (elem_fn_info *elem_fn_values, tree *opposite_attr)
+{
+  /* You will need the opposite attribute for the scalar code part.  */
+  tree proc_attr, opp_proc_attr;
+  VEC(tree,gc) *proc_vec_list = VEC_alloc (tree, gc, 4);
+  VEC(tree,gc) *opp_proc_vec_list = VEC_alloc (tree, gc, 4);
+  
+  if (!elem_fn_values || !elem_fn_values->proc_type)
+    return NULL_TREE;
+
+  if (!strcmp (elem_fn_values->proc_type, "pentium_4"))
+    {
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("arch=pentium4"), "arch=pentium4"));
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("mmx"), "mmx"));
+      if (opposite_attr)
+	{
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("no-mmx"), "no-mmx"));
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("arch=pentium4"),
+				       "arch=pentium4"));
+	}
+    }
+  else if (!strcmp (elem_fn_values->proc_type, "pentium_4_sse3"))
+    {
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("arch=pentium4"), "arch=pentium4"));
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("sse3"), "sse3"));
+      if (opposite_attr)
+	{
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("arch=pentium4"),
+				       "arch=pentium4"));
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("no-sse3"), "no-sse3"));
+	}
+    }
+  else if (!strcmp (elem_fn_values->proc_type, "core2_duo_sse3"))
+    {
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("arch=core2"), "arch=core2"));
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("sse3"), "sse3"));
+      if (opposite_attr)
+	{
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("arch=core2"), "arch=core2"));
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("no-sse3"), "no-sse3"));
+	}
+    }
+  else if (!strcmp (elem_fn_values->proc_type, "core_2_duo_sse_4_1"))
+    {
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("arch=core2"), "arch=core2"));
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("sse4.1"), "sse4.1"));
+      if (opposite_attr)
+	{
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("arch=core2"), "arch=core2"));
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("no-sse4.1"), "no-sse4.1"));
+	}
+    }
+  else if (!strcmp (elem_fn_values->proc_type, "core_i7_sse4_2"))
+    {
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("arch=corei7"), "arch=corei7"));
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("sse4.2"), "sse4.2"));
+      VEC_safe_push (tree, gc, proc_vec_list,
+		     build_string (strlen ("avx"), "avx"));
+      if (opposite_attr)
+	{
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("arch=corei7"), "arch=corei7"));
+	  VEC_safe_push (tree, gc, opp_proc_vec_list,
+			 build_string (strlen ("no-sse4.2"), "no-sse4.2"));
+	}
+    }
+  else
+    sorry ("Processor type not supported.");
+
+  proc_attr = build_tree_list_vec (proc_vec_list);
+  VEC_truncate (tree, proc_vec_list, 0);
+  proc_attr = build_tree_list (get_identifier ("__target__"), proc_attr);
+
+  if (opposite_attr)
+    {
+      opp_proc_attr = build_tree_list_vec (opp_proc_vec_list);
+      VEC_truncate (tree, opp_proc_vec_list, 0);
+      opp_proc_attr = build_tree_list (get_identifier ("__target__"),
+				       opp_proc_attr);
+      *opposite_attr = opp_proc_attr;
+    }
+  return proc_attr;
+}
+
+/* This will create an optimize attribute for the vector function, to make sure
+   the vectorizer is turned on and has its full capabilities.  */
+
+static tree
+create_optimize_attribute (int option)
+{
+  tree opt_attr;
+  VEC(tree,gc) *opt_vec = VEC_alloc (tree,gc, 4);
+  char optimization[2];
+  optimization[0] = 'O';
+  
+  if (option == 3)
+    optimization[1] = '3';
+  else if (option == 2)
+    optimization[1] = '2';
+  else if (option == 1)
+    optimization[1] = '1';
+  else if (option == 0)
+    optimization[1] = '0';
+  
+  VEC_safe_push (tree, gc, opt_vec, build_string (2, optimization));
+  opt_attr = build_tree_list_vec (opt_vec);
+  VEC_truncate (tree, opt_vec, 0);
+  opt_attr = build_tree_list (get_identifier ("optimize"), opt_attr);
+  return opt_attr;
+}
+
+
+/* This function will store return expression to a temporary var.  */
+
+static tree
+replace_return_with_new_var (tree *tp, int *walk_subtrees, void *data)
+{
+  tree mod_expr = NULL_TREE, return_var = NULL_TREE, ret_expr = NULL_TREE;
+  
+  if (!*tp)
+    return NULL_TREE;
+
+  if (TREE_CODE (*tp) == RETURN_EXPR)
+    {
+      return_var = (tree) data;
+      ret_expr = TREE_OPERAND (TREE_OPERAND (*tp, 0), 1);
+      mod_expr = build2 (MODIFY_EXPR, TREE_TYPE (return_var), return_var,
+			 ret_expr);
+      *tp = mod_expr;
+      *walk_subtrees = 0;
+    }
+  return NULL_TREE;
+}
+
+
+/* This function will create a vector access as a array access.  */
+
+static tree
+elem_fn_build_array (tree base_var, tree index)
+{
+  return build_array_ref (UNKNOWN_LOCATION, base_var, index);
+}
+
+/* This function wil replace all vector references with array references.  */
+
+static tree
+replace_array_ref_for_vec (tree *tp, int *walk_subtrees, void *data)
+{
+  tree ii_var;
+  fn_vect_elements *func_data;
+  if (!*tp)
+    return NULL_TREE;
+
+  if (TREE_CODE (*tp) == VAR_DECL || TREE_CODE (*tp) == PARM_DECL)
+    {
+      func_data = (fn_vect_elements *) data;
+      gcc_assert (func_data->induction_var);
+      for (ii_var = func_data->arguments; ii_var; ii_var = DECL_CHAIN (ii_var))
+	{
+	  if (DECL_NAME (ii_var) == DECL_NAME (*tp))
+	    {
+	      *tp =  elem_fn_build_array (*tp, func_data->induction_var);
+	      *walk_subtrees = 0;
+	      return NULL_TREE;
+	    }
+	}
+      if (func_data->return_var 
+	  && (DECL_NAME (*tp) == DECL_NAME (func_data->return_var)))
+	{
+	  *tp = elem_fn_build_array (*tp, func_data->induction_var);
+	  *walk_subtrees = 0;
+	}
+    }
+  return NULL_TREE;
+}
+
+/* This function will move return values to the end of the function.  */
+
+static void
+fix_elem_fn_return_value (tree fndecl, tree induction_var)
+{
+  fn_vect_elements data;
+  tree old_fndecl;
+  tree new_var, new_var_init,  new_body = NULL_TREE;
+  tree ret_expr, ret_stmt = NULL_TREE;
+  if (!fndecl || !DECL_SAVED_TREE (fndecl))
+    return;
+
+  if (TREE_TYPE (DECL_RESULT (fndecl)) == void_type_node)
+    return;
+
+  old_fndecl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (fndecl));
+  current_function_decl = fndecl;
+  
+  new_var = create_tmp_var (TREE_TYPE (DECL_RESULT (fndecl)), "elem_fn_ret");
+  new_var_init =
+    build_vector_from_val
+    (TREE_TYPE (DECL_RESULT (fndecl)),
+     build_zero_cst (TREE_TYPE (TREE_TYPE (DECL_RESULT (fndecl)))));
+  DECL_INITIAL (new_var) = new_var_init;
+  walk_tree (&DECL_SAVED_TREE (fndecl), replace_return_with_new_var,
+	     (void *) new_var, NULL);
+  data.return_var = new_var;
+  data.arguments = DECL_ARGUMENTS (fndecl);
+  data.induction_var = induction_var;
+
+  walk_tree (&DECL_SAVED_TREE (fndecl), replace_array_ref_for_vec,
+	     (void *) &data, NULL);
+  ret_expr = build2 (MODIFY_EXPR, TREE_TYPE (new_var),
+		     DECL_RESULT (fndecl), new_var);
+  
+  ret_stmt = build1 (RETURN_EXPR, TREE_TYPE (ret_expr), ret_expr);
+  if (TREE_CODE (DECL_SAVED_TREE (fndecl)) == BIND_EXPR)
+    {
+      if (!BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl)))
+        ;
+      else if (TREE_CODE (BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl))) !=
+	       TREE_LIST)
+	{
+	  append_to_statement_list_force
+	    (BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl)), &new_body);
+	  append_to_statement_list_force (ret_stmt, &new_body);
+	}
+      else
+	{
+	  new_body = BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl));
+	  append_to_statement_list_force (ret_stmt, &new_body);
+	}
+      BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl)) = new_body;
+    }
+
+  pop_cfun ();
+  current_function_decl = old_fndecl;
+  return;
+}
+
+/* This function converts a vector value to scalar with a for loop in front.  */
+
+static tree
+add_elem_fn_loop (tree fndecl, int vlength)
+{
+  tree exit_label = NULL_TREE, if_label = NULL_TREE, body_label = NULL_TREE;
+  tree fn_body, loop = NULL_TREE, loop_var, mod_var, incr_expr, cond_expr;
+  tree cmp_expr, old_fndecl;
+  
+  if (!fndecl)
+    return NULL_TREE; 
+
+  if (!DECL_SAVED_TREE (fndecl))
+    return NULL_TREE;
+
+  old_fndecl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (fndecl));
+  current_function_decl = fndecl;
+  
+  if (TREE_CODE (DECL_SAVED_TREE (fndecl)) == BIND_EXPR)
+    fn_body = BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl));
+  else
+    fn_body = DECL_SAVED_TREE (fndecl);
+
+  loop = alloc_stmt_list ();
+  
+  loop_var = create_tmp_var (integer_type_node, "ii_elem_fn_vec_val");
+  mod_var = build2 (MODIFY_EXPR, void_type_node, loop_var,
+		    build_int_cst (integer_type_node, 0));
+  append_to_statement_list_force (mod_var, &loop);
+  
+  if_label = build_decl (UNKNOWN_LOCATION, LABEL_DECL,
+			 get_identifier ("if_lab"), void_type_node);
+  DECL_CONTEXT (if_label) = fndecl;
+  DECL_ARTIFICIAL (if_label) = 0;
+  DECL_IGNORED_P (if_label) = 1;
+
+  exit_label = build_decl (UNKNOWN_LOCATION, LABEL_DECL,
+			   get_identifier ("exit_label"), void_type_node);
+  DECL_CONTEXT (exit_label) = fndecl;
+  DECL_ARTIFICIAL (exit_label) = 0;
+  DECL_IGNORED_P (exit_label) = 1;
+
+  body_label = build_decl (UNKNOWN_LOCATION, LABEL_DECL,
+			   get_identifier ("body_label"), void_type_node);
+  DECL_CONTEXT (body_label) = fndecl;
+  DECL_ARTIFICIAL (body_label) = 0;
+  DECL_IGNORED_P (body_label) = 1;
+  append_to_statement_list_force (build1 (LABEL_EXPR, void_type_node,
+					  if_label), &loop);
+  cmp_expr = build2 (LT_EXPR, boolean_type_node, loop_var,
+		     build_int_cst (integer_type_node, vlength));
+  cond_expr = build3 (COND_EXPR, void_type_node, cmp_expr,
+		      build1 (GOTO_EXPR, void_type_node, body_label),
+		      build1 (GOTO_EXPR, void_type_node, exit_label));
+
+  append_to_statement_list_force (cond_expr, &loop);
+  append_to_statement_list_force (build1 (LABEL_EXPR, void_type_node,
+					  body_label), &loop);
+  append_to_statement_list_force (fn_body, &loop);
+
+  incr_expr = build2 (MODIFY_EXPR, void_type_node, loop_var,
+		      build2 (PLUS_EXPR, TREE_TYPE (loop_var), loop_var,
+			      build_int_cst (integer_type_node, 1)));
+
+  append_to_statement_list_force (incr_expr, &loop);
+  append_to_statement_list_force (build1 (GOTO_EXPR, void_type_node, if_label),
+				  &loop);
+  append_to_statement_list_force (build1 (LABEL_EXPR, void_type_node,
+					  exit_label), &loop);
+  
+  if (TREE_CODE (DECL_SAVED_TREE (fndecl)) == BIND_EXPR)
+    BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl)) = loop;
+  else
+    DECL_SAVED_TREE (fndecl) = loop;
+
+  pop_cfun ();
+  current_function_decl = old_fndecl;
+  
+  return loop_var;
+}
+
+/* This function will add the mask if statement for masked clone.  */
+
+static void
+add_elem_fn_mask (tree fndecl)
+{
+  tree ii_arg;
+  tree cond_expr, cmp_expr, old_fndecl;
+  tree fn_body = NULL_TREE;
+
+  old_fndecl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (fndecl));
+  current_function_decl = fndecl;
+  
+  if (!DECL_SAVED_TREE (fndecl))
+    return;
+  
+  for (ii_arg = DECL_ARGUMENTS (fndecl); DECL_CHAIN (ii_arg);
+       ii_arg = DECL_CHAIN (ii_arg))
+    {
+      ;
+    }
+  if (TREE_CODE (DECL_SAVED_TREE (fndecl)) == BIND_EXPR)
+    fn_body = BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl));
+  else
+    fn_body = DECL_SAVED_TREE (fndecl); /* Not sure if we ever get here.  */
+
+  gcc_assert (DECL_NAME (ii_arg) == get_identifier ("__elem_fn_mask"));
+
+  cmp_expr = fold_build2 (NE_EXPR, TREE_TYPE (ii_arg), ii_arg,
+			  build_int_cst (TREE_TYPE (TREE_TYPE (ii_arg)), 0));
+  cond_expr = fold_build3 (COND_EXPR, void_type_node, cmp_expr, fn_body,
+			   build_empty_stmt (UNKNOWN_LOCATION));
+
+  if (TREE_CODE (DECL_SAVED_TREE (fndecl)) == BIND_EXPR)
+    BIND_EXPR_BODY (DECL_SAVED_TREE (fndecl)) = cond_expr;
+  else
+    DECL_SAVED_TREE (fndecl) = cond_expr;
+
+  pop_cfun ();
+  current_function_decl = old_fndecl;
+  
+  return;
+ 
+}
+
+/* This function will do hacks necessary to recognize the cloned function.  */
+
+static void
+call_graph_add_fn (tree fndecl)
+{
+  const tree outer = current_function_decl;
+  struct function *f = DECL_STRUCT_FUNCTION (fndecl);
+
+  if (cfun)
+    f->curr_properties = cfun->curr_properties;
+  push_cfun (f);
+  current_function_decl = fndecl;
+  
+  cgraph_add_new_function (fndecl, false);
+  cgraph_finalize_function (fndecl, true);
+
+  pop_cfun ();
+  current_function_decl = outer;
+
+  return;
+}
+
+/* Function to create clones for function marked with vector attribute.  */
+
+void
+elem_fn_create_fn (tree fndecl)
+{
+  tree new_masked_fn = NULL_TREE, new_unmasked_fn = NULL_TREE;
+  tree induction_var = NULL_TREE;
+  elem_fn_info *elem_fn_values = NULL;
+  char *masked_suffix = NULL, *unmasked_suffix = NULL;
+  tree proc_attr = NULL_TREE, opp_proc_attr = NULL_TREE, opt_attr = NULL_TREE;
+
+  if (!fndecl)
+    return;
+
+  elem_fn_values = extract_elem_fn_values (fndecl);
+  if (!elem_fn_values)
+    return;
+
+  if (elem_fn_values->mask == USE_MASK)
+    masked_suffix = find_suffix (elem_fn_values, true);
+  else if (elem_fn_values->mask == USE_NOMASK)
+    unmasked_suffix = find_suffix (elem_fn_values, false);
+  else
+    {
+      masked_suffix   = find_suffix (elem_fn_values, true);
+      unmasked_suffix = find_suffix (elem_fn_values, false);
+    }
+  if (masked_suffix)
+    {
+      new_masked_fn = copy_node (fndecl);
+      new_masked_fn = rename_elem_fn (new_masked_fn, masked_suffix);
+      SET_DECL_RTL (new_masked_fn, NULL);
+      TREE_SYMBOL_REFERENCED (DECL_NAME (new_masked_fn)) = 1;
+      tree_elem_fn_versioning (fndecl, new_masked_fn, NULL, false, NULL, false,
+			       NULL, NULL, elem_fn_values->vectorlength[0],
+			       true);
+      proc_attr = create_processor_attribute (elem_fn_values, &opp_proc_attr);
+      if (proc_attr)
+	decl_attributes (&new_masked_fn, proc_attr, 0);
+      if (opp_proc_attr)
+	decl_attributes (&fndecl, opp_proc_attr, 0);
+      
+      opt_attr = create_optimize_attribute (3); /* Turn vectorizer on.  */
+      if (opt_attr)
+	decl_attributes (&new_masked_fn, opt_attr, 0);
+
+      DECL_ATTRIBUTES (new_masked_fn) =
+	remove_attribute ("vector", DECL_ATTRIBUTES (new_masked_fn));
+	
+      add_elem_fn_mask (new_masked_fn);
+      induction_var = add_elem_fn_loop (new_masked_fn,
+					elem_fn_values->vectorlength[0]);
+      fix_elem_fn_return_value (new_masked_fn, induction_var);
+      call_graph_add_fn (new_masked_fn);
+      SET_DECL_ASSEMBLER_NAME (new_masked_fn, DECL_NAME (new_masked_fn));
+      DECL_ELEM_FN_ALREADY_CLONED (new_masked_fn) = true;
+      if (DECL_STRUCT_FUNCTION (new_masked_fn))
+	DECL_STRUCT_FUNCTION (new_masked_fn)->elem_fn_already_cloned = true;
+    }
+  if (unmasked_suffix)
+    {
+      new_unmasked_fn = copy_node (fndecl);
+      new_unmasked_fn = rename_elem_fn (new_unmasked_fn, unmasked_suffix);
+      SET_DECL_RTL (new_unmasked_fn, NULL);
+      TREE_SYMBOL_REFERENCED (DECL_NAME (new_unmasked_fn)) = 1;
+      tree_elem_fn_versioning (fndecl, new_unmasked_fn, NULL, false, NULL,
+			       false, NULL, NULL,
+			       elem_fn_values->vectorlength[0], false);
+      proc_attr = create_processor_attribute (elem_fn_values, &opp_proc_attr);
+      if (proc_attr)
+	decl_attributes (&new_unmasked_fn, proc_attr, 0);
+      if (opp_proc_attr)
+	decl_attributes (&fndecl, opp_proc_attr, 0);
+      
+      opt_attr = create_optimize_attribute (3); /* Turn vectorizer on.  */
+      if (opt_attr)
+	decl_attributes (&new_unmasked_fn, opt_attr, 0);
+
+      DECL_ATTRIBUTES (new_unmasked_fn) =
+	remove_attribute ("vector", DECL_ATTRIBUTES (new_unmasked_fn));
+      induction_var = add_elem_fn_loop (new_unmasked_fn,
+					elem_fn_values->vectorlength[0]);
+      fix_elem_fn_return_value (new_unmasked_fn, induction_var);
+      call_graph_add_fn (new_unmasked_fn);
+      SET_DECL_ASSEMBLER_NAME (new_unmasked_fn, DECL_NAME (new_unmasked_fn));
+      DECL_ELEM_FN_ALREADY_CLONED (new_unmasked_fn) = true;
+      if (DECL_STRUCT_FUNCTION (new_unmasked_fn))
+	DECL_STRUCT_FUNCTION (new_unmasked_fn)->elem_fn_already_cloned = true;
+    }
+
+  free (elem_fn_values);
+  return;
+}
diff --git gcc/c-family/c.opt gcc/c-family/c.opt
index 914d110..0282e75 100644
--- gcc/c-family/c.opt
+++ gcc/c-family/c.opt
@@ -755,6 +755,10 @@  Recognize built-in functions
 fbuiltin-
 C ObjC C++ ObjC++ Joined
 
+fcilkplus
+C ObjC C++ ObjC++ LTO Report Var(flag_enable_cilkplus) Init(0)
+Enable Cilk Plus
+
 fcheck-new
 C++ ObjC++ Var(flag_check_new)
 Check the return value of new
diff --git gcc/c/c-decl.c gcc/c/c-decl.c
index e5d17b7..0b44f70 100644
--- gcc/c/c-decl.c
+++ gcc/c/c-decl.c
@@ -639,7 +639,8 @@  bind (tree name, tree decl, struct c_scope *scope, bool invisible,
   b->shadowed = 0;
   b->decl = decl;
   b->id = name;
-  b->depth = scope->depth;
+  if (scope) 
+    b->depth = scope->depth;
   b->invisible = invisible;
   b->nested = nested;
   b->inner_comp = 0;
@@ -648,10 +649,13 @@  bind (tree name, tree decl, struct c_scope *scope, bool invisible,
 
   b->u.type = NULL;
 
-  b->prev = scope->bindings;
-  scope->bindings = b;
+  if (scope) 
+    { 
+      b->prev = scope->bindings; 
+      scope->bindings = b;
+    }
 
-  if (decl_jump_unsafe (decl))
+  if (scope && decl_jump_unsafe (decl))
     scope->has_jump_unsafe_decl = 1;
 
   if (!name)
@@ -677,7 +681,7 @@  bind (tree name, tree decl, struct c_scope *scope, bool invisible,
   /* Locate the appropriate place in the chain of shadowed decls
      to insert this binding.  Normally, scope == current_scope and
      this does nothing.  */
-  while (*here && (*here)->depth > scope->depth)
+  while (scope && *here && (*here)->depth > scope->depth)
     here = &(*here)->shadowed;
 
   b->shadowed = *here;
diff --git gcc/c/c-lang.c gcc/c/c-lang.c
index ae1b081..c49e09d 100644
--- gcc/c/c-lang.c
+++ gcc/c/c-lang.c
@@ -33,7 +33,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "diagnostic-core.h"
 #include "c-objc-common.h"
 #include "c-family/c-pragma.h"
-
+#include "cilkplus.h"
 enum c_language_kind c_language = clk_c;
 
 /* Lang hooks common to C and ObjC are declared in c-objc-common.h;
diff --git gcc/c/c-lang.h gcc/c/c-lang.h
index 33271dc..78d9e60 100644
--- gcc/c/c-lang.h
+++ gcc/c/c-lang.h
@@ -23,6 +23,7 @@  along with GCC; see the file COPYING3.  If not see
 
 #include "c-family/c-common.h"
 #include "ggc.h"
+#include "cilkplus.h"
 
 struct GTY((variable_size)) lang_type {
   /* In a RECORD_TYPE, a sorted array of the fields of the type.  */
diff --git gcc/c/c-objc-common.c gcc/c/c-objc-common.c
index 9351cd5..4c284c1 100644
--- gcc/c/c-objc-common.c
+++ gcc/c/c-objc-common.c
@@ -30,6 +30,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "tree-pretty-print.h"
 #include "langhooks.h"
 #include "c-objc-common.h"
+#include "cilkplus.h"
 
 static bool c_tree_printer (pretty_printer *, text_info *, const char *,
 			    int, bool, bool, bool);
diff --git gcc/c/c-objc-common.h gcc/c/c-objc-common.h
index dbbd50a..786e922 100644
--- gcc/c/c-objc-common.h
+++ gcc/c/c-objc-common.h
@@ -106,4 +106,7 @@  along with GCC; see the file COPYING3.  If not see
 #undef LANG_HOOKS_TREE_INLINING_VAR_MOD_TYPE_P
 #define LANG_HOOKS_TREE_INLINING_VAR_MOD_TYPE_P c_vla_unspec_p
 
+#undef LANG_HOOKS_ELEM_FN_CREATE_FN
+#define LANG_HOOKS_ELEM_FN_CREATE_FN elem_fn_create_fn
+
 #endif /* GCC_C_OBJC_COMMON */
diff --git gcc/c/c-parser.c gcc/c/c-parser.c
index bea9791..d4a7553 100644
--- gcc/c/c-parser.c
+++ gcc/c/c-parser.c
@@ -208,6 +208,7 @@  typedef struct GTY(()) c_parser {
 
 static GTY (()) c_parser *the_parser;
 
+static VEC(tree,gc) *c_parser_elem_fn_expr_list (c_parser *parser);
 /* Read in and lex a single token, storing it in *TOKEN.  */
 
 static void
@@ -1668,6 +1669,14 @@  c_parser_declaration_or_fndef (c_parser *parser, bool fndef_ok,
 	      tree d = start_decl (declarator, specs, false,
 				   chainon (postfix_attrs,
 					    all_prefix_attrs));
+	      /* In Cilk Plus, when we have elemental function, we must know
+		 the name of the parameter variables to resolve
+		 dependency info. specified in the clause.  */
+	      if (flag_enable_cilkplus && d && TREE_CODE (d) == FUNCTION_DECL
+		  && declarator->kind == cdk_function
+		  && lookup_attribute ("vector", all_prefix_attrs)
+		  && declarator && declarator->u.arg_info)
+		DECL_ARGUMENTS (d) = declarator->u.arg_info->parms;
 	      if (d)
 		finish_decl (d, UNKNOWN_LOCATION, NULL_TREE,
 			     NULL_TREE, asm_name);
@@ -3585,8 +3594,17 @@  c_parser_attributes (c_parser *parser)
 		{
 		  tree tree_list;
 		  c_parser_consume_token (parser);
-		  expr_list = c_parser_expr_list (parser, false, true,
-						  NULL, NULL);
+
+		  /* For Cilk Plus, "vector" attribute indicates that we have
+		     an elemental function.  */
+		  if (flag_enable_cilkplus
+		      && TREE_CODE (attr_name) == IDENTIFIER_NODE
+		      && simple_cst_equal (attr_name,
+					   get_identifier ("vector")) == 1)
+		    expr_list = c_parser_elem_fn_expr_list (parser);
+		  else
+		    expr_list = c_parser_expr_list (parser, false, true,
+						    NULL, NULL);
 		  tree_list = build_tree_list_vec (expr_list);
 		  attr_args = tree_cons (NULL_TREE, arg1, tree_list);
 		  release_tree_vector (expr_list);
@@ -3598,8 +3616,16 @@  c_parser_attributes (c_parser *parser)
 		attr_args = NULL_TREE;
 	      else
 		{
-		  expr_list = c_parser_expr_list (parser, false, true,
-						  NULL, NULL);
+		  /* For Cilk Plus, vector attribute implies we have an
+		     elemental function.  */
+		  if (flag_enable_cilkplus
+		      && TREE_CODE (attr_name) == IDENTIFIER_NODE
+		      && simple_cst_equal (attr_name,
+					   get_identifier ("vector")) == 1)
+		    expr_list = c_parser_elem_fn_expr_list (parser);
+		  else
+		    expr_list = c_parser_expr_list (parser, false, true,
+						    NULL, NULL);
 		  attr_args = build_tree_list_vec (expr_list);
 		  release_tree_vector (expr_list);
 		}
@@ -10889,4 +10915,427 @@  c_parse_file (void)
   the_parser = NULL;
 }
 
+/* This function parses Cilk Plus elemental function processor clauses.  */
+
+static tree
+c_parser_elem_fn_processor_clause (c_parser *parser)
+{
+  c_token *token;
+  tree proc_tree_list = NULL_TREE;
+  VEC(tree,gc) *proc_vec_list = NULL;
+
+  token = c_parser_peek_token (parser);
+  if (!c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      c_parser_error (parser, "expected %<)%>");
+      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+      return NULL_TREE;
+    }
+  else
+    c_parser_consume_token (parser);
+
+  proc_vec_list = make_tree_vector ();
+  
+  if (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+    {
+      token = c_parser_peek_token (parser);
+      if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	  && simple_cst_equal (token->value, get_identifier ("pentium_4")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  VEC_safe_push (tree, gc, proc_vec_list,
+			 build_string (strlen ("pentium_4"), "pentium_4"));
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("pentium_4_sse3")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  VEC_safe_push (tree, gc, proc_vec_list,
+			 build_string (strlen ("pentium_4_sse3"),
+				       "pentium_4_sse3"));
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("core2_duo_sse3")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  VEC_safe_push (tree, gc, proc_vec_list,
+			 build_string (strlen ("core2_duo_sse3"),
+				       "core2_duo_sse3"));
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("core_2_duo_sse_4_1")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  VEC_safe_push (tree, gc, proc_vec_list,
+			 build_string (strlen ("core_2_duo_sse_4_1"),
+				       "core_2_duo_sse_4_1"));
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("core_i7_sse4_2")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  VEC_safe_push (tree, gc, proc_vec_list,
+			 build_string (strlen ("core_i7_sse4_2"),
+				       "core_i7_sse4_2"));
+	}
+      else
+	sorry ("Processor type not supported");
+	
+      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+	c_parser_consume_token (parser);
+      else
+	c_parser_error (parser, "expected %>)%>");
+    }
+  else
+    c_parser_error (parser, "expected %>(%> and CPUID");
+
+  proc_tree_list = build_tree_list_vec (proc_vec_list);
+  release_tree_vector (proc_vec_list);
+  proc_tree_list = build_tree_list (get_identifier ("processor"),
+				    proc_tree_list);
+  return proc_tree_list;
+}
+
+/* This function parses the uniform clause of Cilk Plus elemental functions.  */
+
+static tree
+c_parser_elem_fn_uniform_clause (c_parser *parser)
+{
+  c_token *token;
+  tree uniform_tree;
+  tree str_token = NULL_TREE;
+  VEC(tree,gc) *uniform_vec = NULL;
+
+  if (!c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      c_parser_error (parser, "expected %<)%>");
+      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+      return NULL_TREE;
+    }
+  else
+    c_parser_consume_token (parser);
+  
+  uniform_vec =  make_tree_vector ();
+  while (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+    {
+      token = c_parser_peek_token (parser);
+      if (token->value && token->type == CPP_NAME)
+	{
+	  /* Convert the variable to a string.  */
+	  str_token = build_string (strlen (IDENTIFIER_POINTER (token->value)),
+				    IDENTIFIER_POINTER (token->value));
+	  VEC_safe_push (tree, gc, uniform_vec, str_token);
+	  c_parser_consume_token (parser);
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is_not (parser, CPP_NAME))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return NULL_TREE;
+		}
+	    }
+	  else if (c_parser_next_token_is_not (parser, CPP_CLOSE_PAREN))
+	    {
+	      c_parser_error (parser,
+			      "expected %<,%> or %<)%> after identifier");
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      return NULL_TREE;
+	    }
+	}
+      else
+	{
+	  c_parser_error (parser, "expected number or comma");
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	  return NULL_TREE;
+	}
+    }
+  c_parser_consume_token (parser);     
+  uniform_tree = build_tree_list_vec (uniform_vec);
+  release_tree_vector (uniform_vec);
+  uniform_tree = build_tree_list (get_identifier ("uniform"), uniform_tree);
+  return uniform_tree;
+}
+
+/* This function parses the linear clause of Cilk Plus Elemental functions.  */
+
+static tree
+c_parser_elem_fn_linear_clause (c_parser *parser)
+{
+  c_token *token;
+  VEC(tree,gc) *linear_vec = NULL;
+  tree linear_tree = NULL_TREE;
+  tree var_str, step_size;
+
+  if (!c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      c_parser_error (parser, "expected %<)%>");
+      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+      return NULL_TREE;
+    }
+  else
+    c_parser_consume_token (parser);
+  linear_vec = make_tree_vector ();
+
+  while (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+    {
+      token = c_parser_peek_token (parser);
+      if (token->value && token->type == CPP_NAME)
+	{
+	  var_str = build_string (strlen (IDENTIFIER_POINTER (token->value)),
+				  IDENTIFIER_POINTER (token->value));
+	  c_parser_consume_token (parser);
+	  if (c_parser_next_token_is (parser, CPP_COLON))
+	    {
+	      c_parser_consume_token (parser);
+	      token = c_parser_peek_token (parser);
+	      if (token->value && token->type == CPP_NUMBER)
+		step_size = token->value;
+	      else
+		{
+		  c_parser_error (parser, "expected step-size");
+		  return NULL_TREE;
+		}
+	      c_parser_consume_token (parser);
+	    }
+	  else
+	    step_size = integer_one_node;
+	  VEC_safe_push (tree, gc, linear_vec, var_str);
+	  VEC_safe_push (tree, gc, linear_vec, step_size);
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is_not (parser, CPP_NAME))
+		{
+		  c_parser_error (parser,
+				  "expected variable after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return NULL_TREE;
+		}
+	    }
+	  else if (c_parser_next_token_is_not (parser, CPP_CLOSE_PAREN))
+	    {
+	      c_parser_error (parser,
+			      "expected %<,%> or %<)%> after variable/step");
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      return NULL_TREE;
+	    }
+	}
+      else
+	{
+	  c_parser_error (parser, "expected variable name or comma");
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	  return NULL_TREE;
+	}
+    }
+  c_parser_consume_token (parser); /* Consume the ')'.  */
+  linear_tree = build_tree_list_vec (linear_vec);
+  release_tree_vector (linear_vec);
+  linear_tree = build_tree_list (get_identifier ("linear"), linear_tree);
+  return linear_tree;
+}
+
+/* This function parses the vectorlength clause of Elemental function.  */
+
+static tree
+c_parser_elem_fn_vlength_clause (c_parser *parser)
+{
+  c_token *token;
+  VEC(tree,gc) *vlength_vec = NULL;
+  tree vlength_tree = NULL_TREE;
+
+  if (!c_parser_next_token_is (parser, CPP_OPEN_PAREN))
+    {
+      c_parser_error (parser, "expected %<)%>");
+      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+      return NULL_TREE;
+    }
+  else
+    c_parser_consume_token (parser);
+
+  vlength_vec = make_tree_vector ();
+  while (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+    {
+      token = c_parser_peek_token (parser);
+      if (token->value && token->type == CPP_NUMBER)
+	{
+	  VEC_safe_push (tree, gc, vlength_vec, token->value);
+	  c_parser_consume_token (parser);
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is_not (parser, CPP_NUMBER))
+		{
+		  c_parser_error (parser, "expected vectorlength after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return NULL_TREE;
+		}
+	    }
+	  else if (c_parser_next_token_is_not (parser, CPP_CLOSE_PAREN))
+	    {
+	      c_parser_error (parser,
+			      "expected %<,%> or %<)%> after vectorlength");
+	      c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	      return NULL_TREE;
+	    }
+	}
+      else
+	{
+	  c_parser_error (parser, "expected number or comma");
+	  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+	  return NULL_TREE;
+	}
+    }
+  c_parser_consume_token (parser);
+  vlength_tree = build_tree_list_vec (vlength_vec);
+  release_tree_vector (vlength_vec);
+  vlength_tree = build_tree_list (get_identifier ("vectorlength"),
+				  vlength_tree);
+  return vlength_tree;
+}
+
+/* This function parses the Elemental function attribute list.  */
+
+static VEC(tree,gc) *
+c_parser_elem_fn_expr_list (c_parser *parser)
+{
+  c_token *token;
+  VEC(tree,gc) *expr_list = NULL;
+  tree proc_list = NULL_TREE, mask_list = NULL_TREE, uniform_list = NULL_TREE;
+  tree vlength_list = NULL_TREE, linear_list = NULL_TREE;
+
+  expr_list = make_tree_vector ();
+
+  while (!c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+    {
+      token = c_parser_peek_token (parser);
+      if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	  && simple_cst_equal (token->value,
+			       get_identifier ("processor")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  gcc_assert (proc_list == NULL_TREE);
+	  proc_list = c_parser_elem_fn_processor_clause (parser);
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return expr_list;
+		}
+	    }
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("mask")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  gcc_assert (mask_list == NULL_TREE);
+	  mask_list = get_identifier ("mask");
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return expr_list;
+		}
+	    }	 
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("nomask")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  gcc_assert (mask_list == NULL_TREE);
+	  mask_list = get_identifier ("nomask");
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return expr_list;
+		}
+	    }	  
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("vectorlength")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  gcc_assert (vlength_list == NULL_TREE);
+	  vlength_list = c_parser_elem_fn_vlength_clause (parser);
+	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return expr_list;
+		}
+	    }	  
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("uniform")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  gcc_assert (uniform_list == NULL_TREE);
+	  uniform_list = c_parser_elem_fn_uniform_clause (parser);
+	  	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return expr_list;
+		}
+	    }
+	}
+      else if (token->value && TREE_CODE (token->value) == IDENTIFIER_NODE
+	       && simple_cst_equal (token->value,
+				    get_identifier ("linear")) == 1)
+	{
+	  c_parser_consume_token (parser);
+	  gcc_assert (linear_list == NULL_TREE);
+	  linear_list = c_parser_elem_fn_linear_clause (parser);
+	  	  if (c_parser_next_token_is (parser, CPP_COMMA))
+	    {
+	      c_parser_consume_token (parser);
+	      if (c_parser_next_token_is (parser, CPP_CLOSE_PAREN))
+		{
+		  c_parser_error (parser, "expected identifier after %<,%>");
+		  c_parser_skip_until_found (parser, CPP_CLOSE_PAREN, NULL);
+		  return expr_list;
+		}
+	    }
+	}
+    }
+
+  if (proc_list)
+    VEC_safe_push (tree, gc, expr_list, proc_list);
+  if (vlength_list)
+    VEC_safe_push (tree, gc, expr_list, vlength_list);
+  if (uniform_list)
+    VEC_safe_push (tree, gc, expr_list, uniform_list);
+  if (mask_list)
+    VEC_safe_push (tree, gc, expr_list, mask_list);
+  if (linear_list)
+    VEC_safe_push (tree, gc, expr_list, linear_list);
+  
+  return expr_list;
+}
+
 #include "gt-c-c-parser.h"
diff --git gcc/cgraph.h gcc/cgraph.h
index 0d2ad41..d6efc5c 100644
--- gcc/cgraph.h
+++ gcc/cgraph.h
@@ -641,6 +641,8 @@  struct cgraph_node *cgraph_function_versioning (struct cgraph_node *,
 						basic_block, const char *);
 void tree_function_versioning (tree, tree, VEC (ipa_replace_map_p,gc)*,
 			       bool, bitmap, bool, bitmap, basic_block);
+void tree_elem_fn_versioning (tree, tree, VEC (ipa_replace_map_p,gc)*, bool, 
+			      bitmap, bool, bitmap, basic_block, int, bool);
 
 /* In cgraphbuild.c  */
 unsigned int rebuild_cgraph_edges (void);
diff --git gcc/cgraphunit.c gcc/cgraphunit.c
index 2dd0871..015c285 100644
--- gcc/cgraphunit.c
+++ gcc/cgraphunit.c
@@ -225,15 +225,24 @@  static GTY (()) tree vtable_entry_type;
 static bool
 cgraph_decide_is_function_needed (struct cgraph_node *node, tree decl)
 {
+  bool is_cloned_elem_func = false;
+  
   /* If the user told us it is used, then it must be so.  */
   if (node->symbol.force_output)
     return true;
 
+  /* When an elemental function is cloned, variable elem_fn_already_cloned
+     will be set to true, for all other functions, it is initalized to zero.
+     So, if it is an elemental function, we output it without questioning.  */
+  if (flag_enable_cilkplus && DECL_STRUCT_FUNCTION (decl))
+    is_cloned_elem_func = DECL_STRUCT_FUNCTION (decl)->elem_fn_already_cloned;
+  
   /* Double check that no one output the function into assembly file
      early.  */
   gcc_checking_assert (!DECL_ASSEMBLER_NAME_SET_P (decl)
 	               || (node->thunk.thunk_p || node->same_body_alias)
-	               ||  !TREE_SYMBOL_REFERENCED (DECL_ASSEMBLER_NAME (decl)));
+	               ||  !TREE_SYMBOL_REFERENCED (DECL_ASSEMBLER_NAME (decl))
+		       || is_cloned_elem_func);
 
 
   /* Keep constructors, destructors and virtual functions.  */
@@ -475,7 +484,12 @@  cgraph_add_new_function (tree fndecl, bool lowered)
 	break;
       case CGRAPH_STATE_CONSTRUCTION:
 	/* Just enqueue function to be processed at nearest occurrence.  */
-	node = cgraph_create_node (fndecl);
+	/* For Cilk Plus we need to use cgraph_get_create_node instead of
+	   cgraph_create_node.  */
+	if (flag_enable_cilkplus)
+	  node = cgraph_get_create_node (fndecl);
+	else
+	  node = cgraph_create_node (fndecl);
 	if (lowered)
 	  node->lowered = true;
 	if (!cgraph_new_nodes)
@@ -879,13 +893,18 @@  cgraph_analyze_functions (void)
 	   node != (symtab_node)first_analyzed
 	   && node != (symtab_node)first_analyzed_var; node = node->symbol.next)
 	{
-	  if ((symtab_function_p (node)
-	       && cgraph (node)->local.finalized
-	       && cgraph_decide_is_function_needed (cgraph (node), node->symbol.decl))
-	      || (symtab_variable_p (node)
-		  && varpool (node)->finalized
-		  && !DECL_EXTERNAL (node->symbol.decl)
-		  && decide_is_variable_needed (varpool (node), node->symbol.decl)))
+	  if ((flag_enable_cilkplus
+	       && TREE_CODE (node->symbol.decl) == FUNCTION_DECL
+	       && DECL_ELEM_FN_ALREADY_CLONED (node->symbol.decl))
+	      || ((symtab_function_p (node)
+		   && cgraph (node)->local.finalized
+		   && cgraph_decide_is_function_needed (cgraph (node),
+							node->symbol.decl))
+		  || (symtab_variable_p (node)
+		      && varpool (node)->finalized
+		      && !DECL_EXTERNAL (node->symbol.decl)
+		      && decide_is_variable_needed (varpool (node),
+						    node->symbol.decl))))
 	    {
 	      enqueue_node (node);
 	      if (!changed && cgraph_dump_file)
@@ -989,10 +1008,18 @@  cgraph_analyze_functions (void)
       next = node->symbol.next;
       if (!node->symbol.aux && !referred_to_p (node))
 	{
-	  if (cgraph_dump_file)
-	    fprintf (cgraph_dump_file, " %s", symtab_node_name (node));
-	  symtab_remove_node (node);
-	  continue;
+	  if (flag_enable_cilkplus
+	      && TREE_CODE (node->symbol.decl) == FUNCTION_DECL
+	      && DECL_ELEM_FN_ALREADY_CLONED (node->symbol.decl))
+	    /* We do not remove a cloned elemental function.  */
+	    ;
+	  else
+	    {
+	      if (cgraph_dump_file)
+		fprintf (cgraph_dump_file, " %s", symtab_node_name (node));
+	      symtab_remove_node (node);
+	      continue;
+	    }
 	}
       if (symtab_function_p (node))
 	{
diff --git gcc/cilkplus.h gcc/cilkplus.h
new file mode 100644
index 0000000..7a20efc
--- /dev/null
+++ gcc/cilkplus.h
@@ -0,0 +1,87 @@ 
+/* This file is part of the Intel(R) Cilk(TM) Plus support
+   This file contains Cilk Plus Support files.
+   Copyright (C) 2011, 2012  Free Software Foundation, Inc.
+   Contributed by Balaji V. Iyer <balaji.v.iyer@intel.com>,
+   Intel Corporation
+
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#ifndef GCC_CILKPLUS_H
+#define GCC_CILKPLUS_H
+
+#include "tree.h"
+
+/* This is the max number of data we have have in elem-function arrays.  */
+#define MAX_VARS 50
+
+/* These are different mask options.  I put 12345 so that we can defferenciate
+   the value during debugging.  */
+enum mask_options {
+  USE_MASK = 12345,
+  USE_NOMASK,
+  USE_BOTH
+};
+
+/* This data structure will hold all the data from the vector attribute.  */
+typedef struct
+{
+  char *proc_type;
+  enum mask_options mask;
+  int vectorlength[MAX_VARS];
+  int no_vlengths;
+  char *uniform_vars[MAX_VARS];
+  int no_uvars;
+  int uniform_location[MAX_VARS]; /* Their location in parm list.  */
+  char *linear_vars[MAX_VARS];
+  int linear_steps[MAX_VARS];
+  int linear_location[MAX_VARS]; /* Their location in parm list.  */
+  int no_lvars;
+  int private_location[MAX_VARS]; /* Parm not in uniform or linear list. */
+  int no_pvars;
+  char *func_prefix;
+  int total_no_args;
+} elem_fn_info;
+
+/* This data structure will hold all the arguments in the function.  */
+typedef struct
+{
+  tree induction_var;
+  tree arguments;
+  tree return_var;
+} fn_vect_elements;
+
+enum elem_fn_parm_type
+{ 
+  TYPE_NONE = 0, 
+  TYPE_UNIFORM = 1, 
+  TYPE_LINEAR = 2
+};
+
+
+bool is_elem_fn (tree);
+tree find_elem_fn_name (tree, tree, tree);
+void elem_fn_create_fn (tree);
+char *find_processor_code (elem_fn_info *);
+char *find_vlength_code (elem_fn_info *);
+tree rename_elem_fn (tree, const char *);
+char *find_suffix (elem_fn_info *, bool);
+enum elem_fn_parm_type find_elem_fn_parm_type (gimple, tree, tree *);
+tree find_elem_fn_name (tree, tree, tree);
+elem_fn_info *extract_elem_fn_values (tree);
+void elem_fn_create_fn (tree);
+
+#endif
diff --git gcc/elem-function-common.c gcc/elem-function-common.c
new file mode 100644
index 0000000..1b1f832
--- /dev/null
+++ gcc/elem-function-common.c
@@ -0,0 +1,491 @@ 
+/* This file is part of the Intel(R) Cilk(TM) Plus support
+   This file contains the language independent functions for
+   Elemental functions.
+   
+   Copyright (C) 2012  Free Software Foundation, Inc.
+   Written by Balaji V. Iyer <balaji.v.iyer@intel.com>,
+              Intel Corporation
+
+   Many Thanks to Karthik Kumar for advice on the basic technique
+   about cloning functions.
+   
+   This file is part of GCC.
+
+   GCC is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   GCC is distributed in the hope that it will be useful, but
+   WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+   General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with GCC; see the file COPYING3.  If not see
+   <http://www.gnu.org/licenses/>.  */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "langhooks.h"
+#include "tm_p.h"
+#include "hard-reg-set.h"
+#include "basic-block.h"
+#include "output.h"
+#include "c-family/c-common.h"
+#include "diagnostic.h"
+#include "tree-flow.h"
+#include "tree-dump.h"
+#include "tree-pass.h"
+#include "timevar.h"
+#include "flags.h"
+#include "c/c-tree.h"
+#include "tree-inline.h"
+#include "cgraph.h"
+#include "ipa-prop.h"
+#include "opts.h"
+#include "tree-iterator.h"
+#include "toplev.h"
+#include "options.h"
+#include "intl.h"
+#include "vec.h"
+#include "cilkplus.h"
+
+#define MAX_VARS 50
+
+enum elem_fn_parm_type find_elem_fn_parm_type (gimple, tree, tree *);
+bool is_elem_fn (tree);
+tree find_elem_fn_name (tree old_fndecl, tree vectype_out, tree vectype_in);
+elem_fn_info *extract_elem_fn_values (tree decl);
+
+/* This function will find the appropriate processor code in the function
+   mangling vector function.  */
+
+char *
+find_processor_code (elem_fn_info *elem_fn_values)
+{
+  if (!elem_fn_values || !elem_fn_values->proc_type)
+    return xstrdup ("B");
+
+  if (!strcmp (elem_fn_values->proc_type, "pentium_4"))
+    return xstrdup ("B");
+  else if (!strcmp (elem_fn_values->proc_type, "pentium_4_sse3"))
+    return xstrdup ("D");
+  else if (!strcmp (elem_fn_values->proc_type, "core2_duo_sse3"))
+    return xstrdup ("E");
+  else if (!strcmp (elem_fn_values->proc_type, "core_2_duo_sse_4_1"))
+    return xstrdup ("F");
+  else if (!strcmp (elem_fn_values->proc_type, "core_i7_sse4_2"))
+    return xstrdup ("H");
+  else
+    gcc_unreachable ();
+
+  return NULL; /* should never get here */
+}
+
+/* This function will return vectorlength, if specified, in string format -OR-
+   it will give the default vector length for the specified architecture.  */
+
+char *
+find_vlength_code (elem_fn_info *elem_fn_values)
+{
+  char *vlength_code = (char *) xmalloc (sizeof (char) * 10);
+  if (!elem_fn_values)
+    { 
+      sprintf (vlength_code, "4");
+      return vlength_code;
+    }
+
+  memset (vlength_code, 10, 0);
+  
+  if (elem_fn_values->no_vlengths != 0)
+    sprintf (vlength_code,"%d", elem_fn_values->vectorlength[0]);
+  else
+    {
+      if (!elem_fn_values->proc_type)
+	sprintf (vlength_code, "4");
+      else if (!strcmp (elem_fn_values->proc_type, "pentium_4"))
+	sprintf (vlength_code, "4");
+      else if (!strcmp (elem_fn_values->proc_type, "pentium_4_sse3"))
+	sprintf (vlength_code, "4");
+      else if (!strcmp (elem_fn_values->proc_type, "core2_duo_sse3"))
+	sprintf (vlength_code, "4");
+      else if (!strcmp (elem_fn_values->proc_type, "core_2_duo_sse_4_1"))
+	sprintf (vlength_code, "4");
+      else if (!strcmp (elem_fn_values->proc_type, "core_i7_sse4_2"))
+	sprintf (vlength_code, "4");
+      else
+	gcc_unreachable ();
+    }
+  return vlength_code;
+}
+
+
+/* This function will concatinate the suffix to the existing function decl.  */
+
+tree
+rename_elem_fn (tree decl, const char *suffix)
+{
+  int length = 0;
+  const char *fn_name = IDENTIFIER_POINTER (DECL_NAME (decl));
+  char *new_fn_name;
+  tree new_decl = NULL_TREE;
+  
+  if (!suffix || !fn_name)
+    return decl;
+  else
+    new_decl = decl;
+
+  length = strlen (fn_name) + strlen (suffix) + 1;
+  new_fn_name = (char *)xmalloc (length);
+  strcpy (new_fn_name, fn_name);
+  strcat (new_fn_name, suffix);
+
+  DECL_NAME (new_decl) = get_identifier (new_fn_name);
+  return new_decl;
+}
+
+
+/* This function will find the appropriate mangling suffix for the vector
+   function.  */
+
+char *
+find_suffix (elem_fn_info *elem_fn_values, bool masked)
+{
+  char *suffix = (char*)xmalloc (100);
+  char tmp_str[10];
+  int arg_number, ii_pvar, ii_uvar, ii_lvar;
+  strcpy (suffix, "._simdsimd_");
+  strcat (suffix, find_processor_code (elem_fn_values));
+  strcat (suffix, find_vlength_code (elem_fn_values));
+
+  if (masked)
+    strcat (suffix, "m");
+  else
+    strcat (suffix, "n");
+
+  for (arg_number = 0; arg_number <= elem_fn_values->total_no_args;
+       arg_number++)
+    {
+      for (ii_lvar = 0; ii_lvar < elem_fn_values->no_lvars; ii_lvar++)
+	{
+	  if (elem_fn_values->linear_location[ii_lvar] == arg_number)
+	    {
+	      strcat (suffix, "_l");
+	      sprintf (tmp_str, "%d", elem_fn_values->linear_steps[ii_lvar]);
+	      strcat (suffix, tmp_str);
+	    }
+	}
+      for (ii_uvar = 0; ii_uvar < elem_fn_values->no_uvars; ii_uvar++) 
+	if (elem_fn_values->uniform_location[ii_uvar] == arg_number) 
+	  strcat (suffix, "_s1");
+      for (ii_pvar = 0; ii_pvar < elem_fn_values->no_pvars; ii_pvar++) 
+	if (elem_fn_values->private_location[ii_pvar] == arg_number) 
+	  strcat (suffix, "_v1");
+    } 
+  return suffix;
+}
+
+
+/* This is an helper function for find_elem_fn_param_type.  */
+
+static enum elem_fn_parm_type
+find_elem_fn_parm_type_1 (tree fndecl, int parm_no, tree *step_size)
+{
+  int ii = 0;
+  elem_fn_info *elem_fn_values;
+
+  elem_fn_values = extract_elem_fn_values (fndecl);
+  if (!elem_fn_values)
+    return TYPE_NONE;
+
+  for (ii = 0; ii < elem_fn_values->no_lvars; ii++)
+    if (elem_fn_values->linear_location[ii] == parm_no)
+      {
+	if (step_size != NULL)
+	  *step_size = build_int_cst (integer_type_node,
+				      elem_fn_values->linear_steps[ii]);
+	return TYPE_LINEAR;
+      }
+    
+  for (ii = 0; ii < elem_fn_values->no_uvars; ii++)
+    if (elem_fn_values->uniform_location[ii] == parm_no)
+      return TYPE_UNIFORM;
+    
+  return TYPE_NONE;
+}
+  
+  
+/* This function will return the type of a parameter in elemental function.
+   The choices are UNIFORM or LINEAR.  */
+
+enum elem_fn_parm_type
+find_elem_fn_parm_type (gimple stmt, tree op, tree *step_size)
+{
+  tree fndecl, parm = NULL_TREE;
+  int ii, nargs;
+  enum elem_fn_parm_type return_type = TYPE_NONE;
+  
+  if (gimple_code (stmt) != GIMPLE_CALL)
+    return TYPE_NONE;
+
+  fndecl = gimple_call_fndecl (stmt);
+  gcc_assert (fndecl);
+
+  nargs = gimple_call_num_args (stmt);
+
+  for (ii = 0; ii < nargs; ii++)
+    {
+      parm = gimple_call_arg (stmt, ii);
+      if (op == parm)
+	{
+	  return_type = find_elem_fn_parm_type_1 (fndecl, ii, step_size);
+	  return return_type;
+	}
+    }
+  return return_type;
+}
+
+/* This function will return the appropriate cloned named for the function.  */
+
+tree
+find_elem_fn_name (tree old_fndecl, tree vectype_out, 
+		   tree vectype_in ATTRIBUTE_UNUSED)
+{
+  elem_fn_info *elem_fn_values = NULL;
+  tree new_fndecl = NULL_TREE, arg_type = NULL_TREE;
+  char *suffix = NULL;
+  
+  elem_fn_values = extract_elem_fn_values (old_fndecl);
+ 
+  if (elem_fn_values)
+    {
+      if (elem_fn_values->no_vlengths > 0)
+	{
+	  if (elem_fn_values->vectorlength[0] ==
+	      (int)TYPE_VECTOR_SUBPARTS (vectype_out))
+	    suffix = find_suffix (elem_fn_values, false);
+	  else
+	    return NULL_TREE;
+	}
+      else
+	return NULL_TREE;
+    }
+  else
+    return NULL_TREE;
+
+  new_fndecl = copy_node (rename_elem_fn (old_fndecl, suffix));
+  TREE_TYPE (new_fndecl) = copy_node (TREE_TYPE (old_fndecl));
+
+  TYPE_ARG_TYPES (TREE_TYPE (new_fndecl)) =
+    copy_list (TYPE_ARG_TYPES (TREE_TYPE (new_fndecl)));
+  
+  for (arg_type = TYPE_ARG_TYPES (TREE_TYPE (new_fndecl));
+       arg_type && arg_type != void_type_node;
+       arg_type = TREE_CHAIN (arg_type))
+    TREE_VALUE (arg_type) = vectype_out;
+  
+  if (TREE_TYPE (TREE_TYPE (new_fndecl)) != void_type_node)
+    {
+      TREE_TYPE (TREE_TYPE (new_fndecl)) =
+	copy_node (TREE_TYPE (TREE_TYPE (new_fndecl)));
+      TREE_TYPE (TREE_TYPE (new_fndecl)) = vectype_out;
+      DECL_MODE (new_fndecl) = TYPE_MODE (vectype_out);
+    }
+  
+  return new_fndecl;
+}
+
+/* This function will extract the elem. function values from a vector and store
+   it in a data structure and return that.  */
+
+elem_fn_info *
+extract_elem_fn_values (tree decl)
+{
+  elem_fn_info *elem_fn_values = NULL;
+  int x = 0; /* this is a dummy variable */
+  int arg_number = 0, ii = 0;
+  tree ii_tree, jj_tree, kk_tree;
+  tree decl_attr = DECL_ATTRIBUTES (decl);
+  tree decl_ret_type;
+  if (!decl_attr)
+    return NULL;
+
+  elem_fn_values = (elem_fn_info *)xmalloc (sizeof (elem_fn_info));
+  gcc_assert (elem_fn_values);
+
+  decl_ret_type = TREE_TYPE (decl);
+  if (decl_ret_type)
+    decl_ret_type = TREE_TYPE (decl_ret_type);
+  
+  elem_fn_values->proc_type = NULL;
+  elem_fn_values->mask = USE_BOTH;
+  elem_fn_values->no_vlengths = 0;
+  elem_fn_values->no_uvars = 0;
+  elem_fn_values->no_lvars = 0;
+  elem_fn_values->no_pvars = 0;
+  if (decl_ret_type && COMPLETE_TYPE_P (decl_ret_type)
+      && !VOID_TYPE_P (decl_ret_type))
+    switch (compare_tree_int (TYPE_SIZE (decl_ret_type), 64))
+      {
+      case 0: /* This means they are equal.  */
+	elem_fn_values->vectorlength[0] = 2;
+	break;
+      case -1: /* This means it is less than 64.  */
+	elem_fn_values->vectorlength[0] = 4;
+	break;
+      default:
+	elem_fn_values->vectorlength[0] = 1;
+      }
+  
+
+  for (ii_tree = decl_attr; ii_tree; ii_tree = TREE_CHAIN (ii_tree))
+    {
+      tree ii_purpose = TREE_PURPOSE (ii_tree);
+      tree ii_value = TREE_VALUE (ii_tree);
+      if (TREE_CODE (ii_purpose) == IDENTIFIER_NODE
+	  && !strcmp (IDENTIFIER_POINTER (ii_purpose), "vector"))
+	{
+	  for (jj_tree = ii_value; jj_tree;
+	       jj_tree = TREE_CHAIN (jj_tree))
+	    {
+	      tree jj_purpose = NULL_TREE, jj_value = TREE_VALUE (jj_tree);
+
+	      /* This means we have a mask/nomask.  */
+	      if (TREE_CODE (jj_value) == IDENTIFIER_NODE)
+		{ 
+		  if (!strcmp (IDENTIFIER_POINTER (jj_value), "mask"))
+		    elem_fn_values->mask = USE_MASK;		    
+		  else if (!strcmp (IDENTIFIER_POINTER (jj_value), "nomask"))
+		    elem_fn_values->mask = USE_NOMASK;
+		  continue;
+		}
+	      else
+		jj_purpose = TREE_PURPOSE (jj_value);
+	      
+	      if (TREE_CODE (jj_purpose) == IDENTIFIER_NODE
+		  && !strcmp (IDENTIFIER_POINTER (jj_purpose), "processor"))
+		{
+		  for (kk_tree = TREE_VALUE (jj_value); kk_tree;
+		       kk_tree = TREE_CHAIN (kk_tree))
+		    {
+		      tree kk_value = TREE_VALUE (kk_tree);
+		      if (TREE_CODE (kk_value) == STRING_CST)
+			elem_fn_values->proc_type =
+			  xstrdup (TREE_STRING_POINTER (kk_value));
+		    }
+		}
+	      else if (TREE_CODE (jj_purpose) == IDENTIFIER_NODE
+		       && !strcmp (IDENTIFIER_POINTER (jj_purpose),
+				  "vectorlength"))
+		{
+		  for (kk_tree = TREE_VALUE (jj_value); kk_tree;
+		       kk_tree = TREE_CHAIN (kk_tree))
+		    {
+		      tree kk_value = TREE_VALUE (kk_tree);
+		      if (TREE_CODE (kk_value) == INTEGER_CST)
+			{
+			  x = elem_fn_values->no_vlengths;
+			  elem_fn_values->vectorlength[x] =
+			    (int) TREE_INT_CST_LOW (kk_value);
+			  elem_fn_values->no_vlengths++;
+			}
+		    }
+		}
+	      else if (TREE_CODE (jj_purpose) == IDENTIFIER_NODE
+		       && !strcmp (IDENTIFIER_POINTER (jj_purpose), "uniform"))
+		{
+		  for (kk_tree = TREE_VALUE (jj_value); kk_tree;
+		       kk_tree = TREE_CHAIN (kk_tree))
+		    {
+		      tree kk_value = TREE_VALUE (kk_tree);
+		      elem_fn_values->uniform_vars[elem_fn_values->no_uvars] =
+			xstrdup (TREE_STRING_POINTER (kk_value));
+		      elem_fn_values->no_uvars++;
+		    }
+		}
+	      else if (TREE_CODE (jj_purpose) == IDENTIFIER_NODE
+		       && !strcmp (IDENTIFIER_POINTER (jj_purpose), "linear"))
+		{
+		  for (kk_tree = TREE_VALUE (jj_value); kk_tree;
+		       kk_tree = TREE_CHAIN (kk_tree))
+		    {
+		      tree kk_value = TREE_VALUE (kk_tree);
+		      elem_fn_values->linear_vars[elem_fn_values->no_lvars] =
+			xstrdup (TREE_STRING_POINTER (kk_value));
+		      kk_tree = TREE_CHAIN (kk_tree);
+		      kk_value = TREE_VALUE (kk_tree);
+		      elem_fn_values->linear_steps[elem_fn_values->no_lvars] =
+			TREE_INT_CST_LOW (kk_value);
+		      elem_fn_values->no_lvars++;
+		    }
+		}
+	    }
+	}
+    }
+
+  for (ii_tree = DECL_ARGUMENTS (decl); ii_tree; ii_tree = DECL_CHAIN (ii_tree))
+    {
+      bool already_found = false;
+      for (ii = 0; ii < elem_fn_values->no_uvars; ii++)
+	{
+	  if (DECL_NAME (ii_tree)
+	      && !strcmp (IDENTIFIER_POINTER (DECL_NAME (ii_tree)),
+			  elem_fn_values->uniform_vars[ii]))
+	    {
+	      already_found = true;
+	      elem_fn_values->uniform_location[ii] = arg_number;
+	    }
+	}
+      for (ii = 0; ii < elem_fn_values->no_lvars; ii++)
+	{
+	  if (DECL_NAME (ii_tree)
+	      && !strcmp (IDENTIFIER_POINTER (DECL_NAME (ii_tree)),
+			  elem_fn_values->linear_vars[ii]))
+	    {
+	      if (already_found)
+		  fatal_error
+		    ("variable %s defined in both uniform and linear clause",
+		     elem_fn_values->linear_vars[ii]);
+	      else
+		{
+		  already_found = true;
+		  elem_fn_values->linear_location[ii] = arg_number;
+		}
+	    }
+	}
+      if (!already_found) /* This means this variable is a private.  */
+	elem_fn_values->private_location[elem_fn_values->no_pvars++] =
+	  arg_number;
+      arg_number++;
+    }
+
+  elem_fn_values->total_no_args = arg_number;
+  if (elem_fn_values->no_vlengths == 0)
+    /* We have a default value if none is given.  */
+    elem_fn_values->no_vlengths = 1; 
+  return elem_fn_values;
+}
+
+/* This function will check to see if the node is part of an function that
+   needs to be converted to its vector equivalent.  */
+
+bool
+is_elem_fn (tree fndecl)
+{
+  tree ii_tree;
+
+  for (ii_tree = DECL_ATTRIBUTES (fndecl); ii_tree;
+       ii_tree = TREE_CHAIN (ii_tree))
+    {
+      tree ii_value = TREE_PURPOSE (ii_tree);
+      if (TREE_CODE (ii_value) == IDENTIFIER_NODE
+	  && !strcmp (IDENTIFIER_POINTER (ii_value), "vector"))
+	return true;
+    }
+  /* If we are here, then we didn't find a vector keyword, so it is false.  */
+  return false;
+}
diff --git gcc/expr.c gcc/expr.c
index 4e7eb5f..1efd198 100644
--- gcc/expr.c
+++ gcc/expr.c
@@ -9290,7 +9290,10 @@  expand_expr_real_1 (tree exp, rtx target, enum machine_mode tmode,
 	    }
 	  else
 	    pmode = promote_decl_mode (exp, &unsignedp);
-	  gcc_assert (GET_MODE (decl_rtl) == pmode);
+
+	  /* There maybe a pmode rtl for Cilk Plus.  */
+	  if (!flag_enable_cilkplus)
+	    gcc_assert (GET_MODE (decl_rtl) == pmode);
 
 	  temp = gen_lowpart_SUBREG (mode, decl_rtl);
 	  SUBREG_PROMOTED_VAR_P (temp) = 1;
diff --git gcc/function.h gcc/function.h
index 684bbce..8448fe2 100644
--- gcc/function.h
+++ gcc/function.h
@@ -556,6 +556,10 @@  struct GTY(()) function {
   /* Vector of function local variables, functions, types and constants.  */
   VEC(tree,gc) *local_decls;
 
+  /* In an elemental function, if this is true, it means the function is 
+     already cloned, and dont bother cloning it again.  */
+  bool elem_fn_already_cloned;
+
   /* For md files.  */
 
   /* tm.h can use this to store whatever it likes.  */
diff --git gcc/gengtype.c gcc/gengtype.c
index 2ae4372..77dd888 100644
--- gcc/gengtype.c
+++ gcc/gengtype.c
@@ -1646,7 +1646,7 @@  open_base_files (void)
       "tree-flow.h", "reload.h", "cpp-id-data.h", "tree-chrec.h",
       "except.h", "output.h", "gimple.h", "cfgloop.h",
       "target.h", "ipa-prop.h", "lto-streamer.h", "target-globals.h",
-      "ipa-inline.h", "dwarf2out.h", NULL
+      "ipa-inline.h", "cilkplus.h", "dwarf2out.h", NULL
     };
     const char *const *ifp;
     outf_p gtype_desc_c;
diff --git gcc/gimplify.c gcc/gimplify.c
index 0397353..91fbfb9 100644
--- gcc/gimplify.c
+++ gcc/gimplify.c
@@ -47,6 +47,7 @@  along with GCC; see the file COPYING3.  If not see
 
 #include "langhooks-def.h"	/* FIXME: for lhd_set_decl_assembler_name */
 #include "tree-pass.h"		/* FIXME: only for PROP_gimple_any */
+#include "cilkplus.h"
 
 enum gimplify_omp_var_data
 {
@@ -8263,6 +8264,12 @@  gimplify_function_tree (tree fndecl)
 
   oldfn = current_function_decl;
   current_function_decl = fndecl;
+
+  /* Here we check to see if we have a function with the attribute "vector."
+     If so, then we must clone it to masked/unmasked when apropriate.  */
+  if (flag_enable_cilkplus && is_elem_fn (fndecl))
+    lang_hooks.cilkplus.elem_fn_create_fn (fndecl);
+  
   if (DECL_STRUCT_FUNCTION (fndecl))
     push_cfun (DECL_STRUCT_FUNCTION (fndecl));
   else
diff --git gcc/langhooks-def.h gcc/langhooks-def.h
index d8f479f..a27cba9 100644
--- gcc/langhooks-def.h
+++ gcc/langhooks-def.h
@@ -211,6 +211,13 @@  extern tree lhd_make_node (enum tree_code);
 #define LANG_HOOKS_OMP_CLAUSE_DTOR hook_tree_tree_tree_null
 #define LANG_HOOKS_OMP_FINISH_CLAUSE hook_void_tree
 
+void lhd_elem_fn_create_fn (tree);
+#define LANG_HOOKS_ELEM_FN_CREATE_FN lhd_elem_fn_create_fn
+
+#define LANG_HOOKS_CILKPLUS { \
+  LANG_HOOKS_ELEM_FN_CREATE_FN \
+}
+
 #define LANG_HOOKS_DECLS { \
   LANG_HOOKS_GLOBAL_BINDINGS_P, \
   LANG_HOOKS_PUSHDECL, \
@@ -288,6 +295,7 @@  extern void lhd_end_section (void);
   LANG_HOOKS_TREE_DUMP_INITIALIZER, \
   LANG_HOOKS_DECLS, \
   LANG_HOOKS_FOR_TYPES_INITIALIZER, \
+  LANG_HOOKS_CILKPLUS, \
   LANG_HOOKS_LTO, \
   LANG_HOOKS_GET_INNERMOST_GENERIC_PARMS, \
   LANG_HOOKS_GET_INNERMOST_GENERIC_ARGS, \
diff --git gcc/langhooks.c gcc/langhooks.c
index a9e60f9..551197f 100644
--- gcc/langhooks.c
+++ gcc/langhooks.c
@@ -667,3 +667,12 @@  lhd_end_section (void)
       saved_section = NULL;
     }
 }
+
+/* Empty function that will always just return for elem fn create fn.  */
+
+void
+lhd_elem_fn_create_fn (tree x ATTRIBUTE_UNUSED)
+{
+    return;
+}
+
diff --git gcc/langhooks.h gcc/langhooks.h
index a919067..ae8764f 100644
--- gcc/langhooks.h
+++ gcc/langhooks.h
@@ -225,6 +225,14 @@  struct lang_hooks_for_decls
   void (*omp_finish_clause) (tree clause);
 };
 
+/* Language hooks related to Cilk Plus, mainly Cilk keywords handling.  */
+
+struct lang_hooks_for_cilkplus
+{ 
+  void (*elem_fn_create_fn) (tree);
+};
+
+
 /* Language hooks related to LTO serialization.  */
 
 struct lang_hooks_for_lto
@@ -406,6 +414,8 @@  struct lang_hooks
 
   struct lang_hooks_for_types types;
 
+  struct lang_hooks_for_cilkplus cilkplus;
+
   struct lang_hooks_for_lto lto;
 
   /* Returns the generic parameters of an instantiation of
diff --git gcc/tree-data-ref.c gcc/tree-data-ref.c
index 38327b0..a3a101a 100644
--- gcc/tree-data-ref.c
+++ gcc/tree-data-ref.c
@@ -86,6 +86,7 @@  along with GCC; see the file COPYING3.  If not see
 #include "langhooks.h"
 #include "tree-affine.h"
 #include "params.h"
+#include "cilkplus.h"
 
 static struct datadep_stats
 {
@@ -4383,8 +4384,18 @@  find_data_references_in_stmt (struct loop *nest, gimple stmt,
 
   if (get_references_in_stmt (stmt, &references))
     {
-      VEC_free (data_ref_loc, heap, references);
-      return false;
+      /* If we have an elemental function, then dont worry about its refernce
+	 it is probably available somewhere */
+      if (flag_enable_cilkplus
+	  && gimple_code (stmt) == GIMPLE_CALL
+	  && gimple_call_fndecl (stmt)
+	  && is_elem_fn (gimple_call_fndecl (stmt)))
+	;
+      else
+	{
+	  VEC_free (data_ref_loc, heap, references);
+	  return false;
+	}
     }
 
   FOR_EACH_VEC_ELT (data_ref_loc, references, i, ref)
diff --git gcc/tree-inline.c gcc/tree-inline.c
index 20d3317..b9cdc8d 100644
--- gcc/tree-inline.c
+++ gcc/tree-inline.c
@@ -4866,6 +4866,68 @@  copy_arguments_for_versioning (tree orig_parm, copy_body_data * id,
   return new_parm;
 }
 
+/* Return a copy of the function's argument tree.  */
+
+static tree
+elem_fn_copy_arguments_for_versioning (tree orig_parm, copy_body_data * id,
+				       bitmap args_to_skip, tree *vars,
+				       int vlength, bool masked)
+{
+  tree arg, *parg;
+  tree new_parm = NULL;
+  int i = 0;
+  tree masked_parm = NULL_TREE;
+  parg = &new_parm;
+
+  if (masked)
+    {
+      masked_parm = build_decl (UNKNOWN_LOCATION, PARM_DECL,
+				get_identifier ("__elem_fn_mask"),
+				build_vector_type (integer_type_node, vlength));
+      DECL_ARG_TYPE (masked_parm) = build_vector_type (integer_type_node,
+						       vlength);
+      DECL_ARTIFICIAL (masked_parm) = 1;
+      lang_hooks.dup_lang_specific_decl (masked_parm);
+    }
+  for (arg = orig_parm; arg; arg = DECL_CHAIN (arg), i++)
+    if (!args_to_skip || !bitmap_bit_p (args_to_skip, i))
+      {
+        tree new_tree = remap_decl (arg, id);
+	if (TREE_CODE (new_tree) != PARM_DECL)
+	  new_tree = id->copy_decl (arg, id);
+	TREE_TYPE (new_tree) = copy_node (TREE_TYPE (new_tree));
+	TREE_TYPE (new_tree) = build_vector_type (TREE_TYPE (new_tree),
+						  vlength);
+	DECL_ARG_TYPE (new_tree) = build_vector_type (DECL_ARG_TYPE (new_tree),
+						      vlength);
+        lang_hooks.dup_lang_specific_decl (new_tree);
+        *parg = new_tree;
+	parg = &DECL_CHAIN (new_tree);
+      }
+    else if (!pointer_map_contains (id->decl_map, arg))
+      {
+	/* Make an equivalent VAR_DECL.  If the argument was used
+	   as temporary variable later in function, the uses will be
+	   replaced by local variable.  */
+	tree var = copy_decl_to_var (arg, id);
+	insert_decl_map (id, arg, var);
+        /* Declare this new variable.  */
+        DECL_CHAIN (var) = *vars;
+        *vars = var;
+      }
+  if (masked && masked_parm)
+    {
+      for (arg = new_parm; DECL_CHAIN (arg); arg = DECL_CHAIN(arg))
+	;
+      
+      DECL_CONTEXT (masked_parm) = DECL_CONTEXT (arg);
+      DECL_CHAIN (arg) = masked_parm;
+    }
+  return new_parm;
+}
+
+
+
 /* Return a copy of the function's static chain.  */
 static tree
 copy_static_chain (tree static_chain, copy_body_data * id)
@@ -5266,6 +5328,250 @@  tree_function_versioning (tree old_decl, tree new_decl,
   return;
 }
 
+/* This function initializes the cfun struct for elemental functions.  */
+
+static void
+initialize_elem_fn_cfun (tree new_fndecl, tree callee_fndecl)
+{
+  struct function *src_cfun = DECL_STRUCT_FUNCTION (callee_fndecl);
+
+  /* Get clean struct function.  */
+  push_struct_function (new_fndecl);
+
+  /* We will rebuild these, so just sanity check that they are empty.  */
+  gcc_assert (VALUE_HISTOGRAMS (cfun) == NULL);
+  gcc_assert (cfun->local_decls == NULL);
+  gcc_assert (cfun->cfg == NULL);
+  gcc_assert (cfun->decl == new_fndecl);
+
+  /* Copy items we preserve during cloning.  */
+  cfun->static_chain_decl = src_cfun->static_chain_decl;
+  cfun->nonlocal_goto_save_area = src_cfun->nonlocal_goto_save_area;
+  cfun->function_end_locus = src_cfun->function_end_locus;
+  cfun->curr_properties = src_cfun->curr_properties & ~PROP_loops;
+  cfun->last_verified = src_cfun->last_verified;
+  cfun->va_list_gpr_size = src_cfun->va_list_gpr_size;
+  cfun->va_list_fpr_size = src_cfun->va_list_fpr_size;
+  cfun->has_nonlocal_label = src_cfun->has_nonlocal_label;
+  cfun->stdarg = src_cfun->stdarg;
+  cfun->after_inlining = src_cfun->after_inlining;
+  cfun->can_throw_non_call_exceptions
+    = src_cfun->can_throw_non_call_exceptions;
+  cfun->returns_struct = src_cfun->returns_struct;
+  cfun->returns_pcc_struct = src_cfun->returns_pcc_struct;
+  
+  if (src_cfun->eh)
+    init_eh_for_function ();
+
+  if (src_cfun->gimple_df)
+    {
+      init_tree_ssa (cfun);
+      cfun->gimple_df->in_ssa_p = true;
+      init_ssa_operands (cfun);
+    }
+  pop_cfun ();
+}
+
+/* Elemental fucntion's version of tree_versioning.  */
+
+void
+tree_elem_fn_versioning (tree old_decl, tree new_decl,
+			 VEC(ipa_replace_map_p,gc)* tree_map,
+			 bool update_clones, bitmap args_to_skip,
+			 bool skip_return,
+			 bitmap blocks_to_copy ATTRIBUTE_UNUSED,
+			 basic_block new_entry ATTRIBUTE_UNUSED, int vlength,
+			 bool masked)
+{
+  copy_body_data id;
+  tree p;
+  unsigned i;
+  struct ipa_replace_map *replace_info;
+  VEC (gimple, heap) *init_stmts = VEC_alloc (gimple, heap, 10);
+
+  tree old_current_function_decl = current_function_decl;
+  tree vars = NULL_TREE;
+
+  gcc_assert (TREE_CODE (old_decl) == FUNCTION_DECL
+	      && TREE_CODE (new_decl) == FUNCTION_DECL);
+  DECL_POSSIBLY_INLINED (old_decl) = 1;
+  
+  /* Copy over debug args.  */
+  if (DECL_HAS_DEBUG_ARGS_P (old_decl))
+    {
+      VEC(tree, gc) **new_debug_args, **old_debug_args;
+      gcc_checking_assert (decl_debug_args_lookup (new_decl) == NULL);
+      DECL_HAS_DEBUG_ARGS_P (new_decl) = 0;
+      old_debug_args = decl_debug_args_lookup (old_decl);
+      if (old_debug_args)
+	{
+	  new_debug_args = decl_debug_args_insert (new_decl);
+	  *new_debug_args = VEC_copy (tree, gc, *old_debug_args);
+	}
+    }
+
+  /* Output the inlining info for this abstract function, since it has been
+     inlined.  If we don't do this now, we can lose the information about the
+     variables in the function when the blocks get blown away as soon as we
+     remove the cgraph node.  */
+  (*debug_hooks->outlining_inline_function) (old_decl);
+
+  DECL_ARTIFICIAL (new_decl) = 1;
+  /* Prepare the data structures for the tree copy.  */
+  memset (&id, 0, sizeof (id));
+
+  /* Generate a new name for the new version. */
+  id.statements_to_fold = pointer_set_create ();
+
+  id.decl_map = pointer_map_create ();
+  id.debug_map = NULL;
+  id.src_fn = old_decl;
+  id.dst_fn = new_decl;
+  id.src_node = NULL;
+  id.dst_node = NULL;
+  id.src_cfun = DECL_STRUCT_FUNCTION (old_decl);
+
+  id.copy_decl = copy_decl_no_change;
+  id.transform_call_graph_edges
+    = update_clones ? CB_CGE_MOVE_CLONES : CB_CGE_MOVE;
+  id.transform_new_cfg = true;
+  id.transform_return_to_modify = false;
+  id.transform_lang_insert_block = NULL;
+
+  current_function_decl = new_decl;
+  
+  initialize_elem_fn_cfun (new_decl, old_decl);
+  push_cfun (DECL_STRUCT_FUNCTION (new_decl));
+
+  /* Copy the function's static chain.  */
+  p = DECL_STRUCT_FUNCTION (old_decl)->static_chain_decl;
+  if (p)
+    DECL_STRUCT_FUNCTION (new_decl)->static_chain_decl =
+      copy_static_chain (DECL_STRUCT_FUNCTION (old_decl)->static_chain_decl,
+			 &id);
+
+  /* If there's a tree_map, prepare for substitution.  */
+  if (tree_map)
+    for (i = 0; i < VEC_length (ipa_replace_map_p, tree_map); i++)
+      {
+	gimple init;
+	replace_info = VEC_index (ipa_replace_map_p, tree_map, i);
+	if (replace_info->replace_p)
+	  {
+	    tree op = replace_info->new_tree;
+	    if (!replace_info->old_tree)
+	      {
+		int i = replace_info->parm_num;
+		tree parm;
+		for (parm = DECL_ARGUMENTS (old_decl); i;
+		     parm = DECL_CHAIN (parm))
+		  i --;
+		replace_info->old_tree = parm;
+	      }
+		
+
+	    STRIP_NOPS (op);
+
+	    if (TREE_CODE (op) == VIEW_CONVERT_EXPR)
+	      op = TREE_OPERAND (op, 0);
+      
+	    gcc_assert (TREE_CODE (replace_info->old_tree) == PARM_DECL);
+	    init = setup_one_parameter (&id, replace_info->old_tree,
+	    			        replace_info->new_tree, id.src_fn,
+				        NULL,
+				        &vars);
+	    if (init)
+	      VEC_safe_push (gimple, heap, init_stmts, init);
+	  }
+      }
+  /* Copy the function's arguments.  */
+  if (DECL_ARGUMENTS (old_decl) != NULL_TREE)
+    DECL_ARGUMENTS (new_decl) =
+      elem_fn_copy_arguments_for_versioning (DECL_ARGUMENTS (old_decl), &id,
+					     args_to_skip, &vars,
+					     vlength, masked);
+
+  DECL_INITIAL (new_decl) = remap_blocks (DECL_INITIAL (id.src_fn), &id);
+  BLOCK_SUPERCONTEXT (DECL_INITIAL (new_decl)) = new_decl;
+
+  declare_inline_vars (DECL_INITIAL (new_decl), vars);
+
+  if (!VEC_empty (tree, DECL_STRUCT_FUNCTION (old_decl)->local_decls))
+    /* Add local vars.  */
+    add_local_variables (DECL_STRUCT_FUNCTION (old_decl), cfun, &id);
+
+  if (DECL_RESULT (old_decl) == NULL_TREE)
+    ;
+  else if (skip_return && !VOID_TYPE_P (TREE_TYPE (DECL_RESULT (old_decl))))
+    {
+      DECL_RESULT (new_decl)
+	= build_decl (DECL_SOURCE_LOCATION (DECL_RESULT (old_decl)),
+		      RESULT_DECL, NULL_TREE, void_type_node);
+      DECL_CONTEXT (DECL_RESULT (new_decl)) = new_decl;
+      cfun->returns_struct = 0;
+      cfun->returns_pcc_struct = 0;
+    }
+  else
+    {
+      tree old_name;
+      DECL_RESULT (new_decl) = remap_decl (DECL_RESULT (old_decl), &id);
+      if (TREE_TYPE (DECL_RESULT (new_decl)) != void_type_node)
+	{
+	  TREE_TYPE (DECL_RESULT (new_decl)) =
+	    build_vector_type (copy_node (TREE_TYPE (DECL_RESULT (new_decl))),
+			       vlength);
+	  DECL_MODE (DECL_RESULT (new_decl)) =
+	    TYPE_MODE (TREE_TYPE (DECL_RESULT (new_decl)));
+	}
+      if (TREE_TYPE (TREE_TYPE (old_decl)) != void_type_node)
+	{
+	  TREE_TYPE (new_decl) = copy_node (TREE_TYPE (old_decl));
+	  TREE_TYPE (TREE_TYPE (new_decl)) =
+	    copy_node (TREE_TYPE (TREE_TYPE (old_decl)));
+	  TREE_TYPE (TREE_TYPE (new_decl)) =
+	    build_vector_type (TREE_TYPE (TREE_TYPE (new_decl)), vlength);
+	}
+      lang_hooks.dup_lang_specific_decl (DECL_RESULT (new_decl));
+      if (gimple_in_ssa_p (id.src_cfun)
+	  && DECL_BY_REFERENCE (DECL_RESULT (old_decl))
+	  && (old_name = ssa_default_def (id.src_cfun, DECL_RESULT (old_decl))))
+	{
+	  tree new_name = make_ssa_name (DECL_RESULT (new_decl), NULL);
+	  insert_decl_map (&id, old_name, new_name);
+	  SSA_NAME_DEF_STMT (new_name) = gimple_build_nop ();
+	  set_ssa_default_def (cfun, DECL_RESULT (new_decl), new_name);
+	}
+    }
+  walk_tree (&DECL_SAVED_TREE (new_decl), copy_tree_body_r, &id, NULL);
+  /* Renumber the lexical scoping (non-code) blocks consecutively.  */
+  number_blocks (new_decl);
+
+  
+  /* Remap the nonlocal_goto_save_area, if any.  */
+  if (cfun->nonlocal_goto_save_area)
+    {
+      struct walk_stmt_info wi;
+
+      memset (&wi, 0, sizeof (wi));
+      wi.info = &id;
+      walk_tree (&cfun->nonlocal_goto_save_area, remap_gimple_op_r, &wi, NULL);
+    }
+  
+  /* Clean up.  */
+  pointer_map_destroy (id.decl_map);
+  if (id.debug_map)
+    pointer_map_destroy (id.debug_map);
+
+  gcc_assert (!id.debug_stmts);
+  VEC_free (gimple, heap, init_stmts);
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+  gcc_assert (!current_function_decl
+	      || DECL_STRUCT_FUNCTION (current_function_decl) == cfun);
+  return;
+}
+
+
 /* EXP is CALL_EXPR present in a GENERIC expression tree.  Try to integrate
    the callee and return the inlined body on success.  */
 
diff --git gcc/tree-vect-stmts.c gcc/tree-vect-stmts.c
index ab4a26c..93c359e 100644
--- gcc/tree-vect-stmts.c
+++ gcc/tree-vect-stmts.c
@@ -40,9 +40,13 @@  along with GCC; see the file COPYING3.  If not see
 
 /* For lang_hooks.types.type_for_mode.  */
 #include "langhooks.h"
+#include "cilkplus.h"
 
-/* Return the vectorized type for the given statement.  */
+tree elem_fn_linear_init_vector (gimple stmt, tree val, tree type, 
+				 tree step_size, gimple_stmt_iterator *gsi);
+extern enum elem_fn_parm_type find_elem_fn_parm_type (gimple, tree, tree *);
 
+/* Return the vectorized type for the given statement.  */
 tree
 stmt_vectype (struct _stmt_vec_info *stmt_info)
 {
@@ -1316,6 +1320,8 @@  vect_get_vec_def_for_operand (tree op, gimple stmt, tree *scalar_def)
   enum vect_def_type dt;
   bool is_simple_use;
   tree vector_type;
+  enum elem_fn_parm_type parm_type;
+  tree step_size = NULL_TREE;
 
   if (vect_print_dump_info (REPORT_DETAILS))
     {
@@ -1340,13 +1346,24 @@  vect_get_vec_def_for_operand (tree op, gimple stmt, tree *scalar_def)
         }
     }
 
+  if (flag_enable_cilkplus
+      && gimple_code (stmt) == GIMPLE_CALL
+      && is_elem_fn (gimple_call_fndecl (stmt)))
+    {
+      parm_type = find_elem_fn_parm_type (stmt, op, &step_size);
+      if (parm_type == TYPE_UNIFORM || parm_type == TYPE_LINEAR)
+	dt = vect_external_def;
+    }
+  else
+    parm_type = TYPE_NONE;
+      
   switch (dt)
     {
     /* Case 1: operand is a constant.  */
     case vect_constant_def:
       {
 	vector_type = get_vectype_for_scalar_type (TREE_TYPE (op));
-	gcc_assert (vector_type);
+	gcc_assert (vector_type);  
 	nunits = TYPE_VECTOR_SUBPARTS (vector_type);
 
 	if (scalar_def)
@@ -1355,8 +1372,13 @@  vect_get_vec_def_for_operand (tree op, gimple stmt, tree *scalar_def)
         /* Create 'vect_cst_ = {cst,cst,...,cst}'  */
         if (vect_print_dump_info (REPORT_DETAILS))
           fprintf (vect_dump, "Create vector_cst. nunits = %d", nunits);
-
-        return vect_init_vector (stmt, op, vector_type, NULL);
+	if (!flag_enable_cilkplus 
+	    || (parm_type == TYPE_NONE || parm_type == TYPE_UNIFORM))
+	  return vect_init_vector (stmt, op, vector_type, NULL);
+	else if (flag_enable_cilkplus && parm_type == TYPE_LINEAR)
+	  return elem_fn_linear_init_vector (stmt, op, vector_type,
+					     step_size, NULL);
+					     
       }
 
     /* Case 2: operand is defined outside the loop - loop invariant.  */
@@ -1371,8 +1393,185 @@  vect_get_vec_def_for_operand (tree op, gimple stmt, tree *scalar_def)
         /* Create 'vec_inv = {inv,inv,..,inv}'  */
         if (vect_print_dump_info (REPORT_DETAILS))
           fprintf (vect_dump, "Create vector_inv.");
+	if (!flag_enable_cilkplus
+	    || (parm_type == TYPE_NONE || parm_type == TYPE_UNIFORM))
+	  return vect_init_vector (stmt, op, vector_type, NULL);
+	else if (flag_enable_cilkplus && parm_type == TYPE_LINEAR)
+	  return elem_fn_linear_init_vector (stmt, op, vector_type,
+					     step_size, NULL);
+      }
 
-        return vect_init_vector (stmt, def, vector_type, NULL);
+    /* Case 3: operand is defined inside the loop.  */
+    case vect_internal_def:
+      {
+	if (scalar_def)
+	  *scalar_def = NULL/* FIXME tuples: def_stmt*/;
+
+        /* Get the def from the vectorized stmt.  */
+        def_stmt_info = vinfo_for_stmt (def_stmt);
+
+        vec_stmt = STMT_VINFO_VEC_STMT (def_stmt_info);
+        /* Get vectorized pattern statement.  */
+        if (!vec_stmt
+            && STMT_VINFO_IN_PATTERN_P (def_stmt_info)
+            && !STMT_VINFO_RELEVANT (def_stmt_info))
+          vec_stmt = STMT_VINFO_VEC_STMT (vinfo_for_stmt (
+                       STMT_VINFO_RELATED_STMT (def_stmt_info)));
+        gcc_assert (vec_stmt);
+	if (gimple_code (vec_stmt) == GIMPLE_PHI)
+	  vec_oprnd = PHI_RESULT (vec_stmt);
+	else if (is_gimple_call (vec_stmt))
+	  vec_oprnd = gimple_call_lhs (vec_stmt);
+	else
+	  vec_oprnd = gimple_assign_lhs (vec_stmt);
+        return vec_oprnd;
+      }
+
+    /* Case 4: operand is defined by a loop header phi - reduction  */
+    case vect_reduction_def:
+    case vect_double_reduction_def:
+    case vect_nested_cycle:
+      {
+	struct loop *loop;
+
+	gcc_assert (gimple_code (def_stmt) == GIMPLE_PHI);
+	loop = (gimple_bb (def_stmt))->loop_father;
+
+        /* Get the def before the loop  */
+        op = PHI_ARG_DEF_FROM_EDGE (def_stmt, loop_preheader_edge (loop));
+        return get_initial_def_for_reduction (stmt, op, scalar_def);
+     }
+
+    /* Case 5: operand is defined by loop-header phi - induction.  */
+    case vect_induction_def:
+      {
+	gcc_assert (gimple_code (def_stmt) == GIMPLE_PHI);
+
+        /* Get the def from the vectorized stmt.  */
+        def_stmt_info = vinfo_for_stmt (def_stmt);
+        vec_stmt = STMT_VINFO_VEC_STMT (def_stmt_info);
+	if (gimple_code (vec_stmt) == GIMPLE_PHI)
+	  vec_oprnd = PHI_RESULT (vec_stmt);
+	else
+	  vec_oprnd = gimple_get_lhs (vec_stmt);
+        return vec_oprnd;
+      }
+
+    default:
+      gcc_unreachable ();
+    }
+}
+
+
+/* Function elem_fn_vect_get_vec_def_for_operand.
+
+   OP is an operand in STMT.  This function returns a (vector) def that will be
+   used in the vectorized stmt for STMT.
+
+   In the case that OP is an SSA_NAME which is defined in the loop, then
+   STMT_VINFO_VEC_STMT of the defining stmt holds the relevant def.
+
+   In case OP is an invariant or constant, a new stmt that creates a vector def
+   needs to be introduced.  */
+
+tree
+elem_fn_vect_get_vec_def_for_operand (tree op, gimple stmt, tree *scalar_def,
+				      gimple_stmt_iterator *gsi)
+{
+  tree vec_oprnd;
+  gimple vec_stmt;
+  gimple def_stmt;
+  stmt_vec_info def_stmt_info = NULL;
+  stmt_vec_info stmt_vinfo = vinfo_for_stmt (stmt);
+  unsigned int nunits;
+  loop_vec_info loop_vinfo = STMT_VINFO_LOOP_VINFO (stmt_vinfo);
+  tree def;
+  enum vect_def_type dt;
+  bool is_simple_use;
+  tree vector_type;
+  enum elem_fn_parm_type parm_type;
+  tree step_size = NULL_TREE;
+
+  if (vect_print_dump_info (REPORT_DETAILS))
+    {
+      fprintf (vect_dump, "elem_fn_vect_get_vec_def_for_operand: ");
+      print_generic_expr (vect_dump, op, TDF_SLIM);
+    }
+
+  is_simple_use = vect_is_simple_use (op, stmt, loop_vinfo, NULL,
+				      &def_stmt, &def, &dt);
+  gcc_assert (is_simple_use);
+  if (vect_print_dump_info (REPORT_DETAILS))
+    {
+      if (def)
+        {
+          fprintf (vect_dump, "def =  ");
+          print_generic_expr (vect_dump, def, TDF_SLIM);
+        }
+      if (def_stmt)
+        {
+          fprintf (vect_dump, "  def_stmt =  ");
+	  print_gimple_stmt (vect_dump, def_stmt, 0, TDF_SLIM);
+        }
+    }
+
+  if (flag_enable_cilkplus
+      && gimple_code (stmt) == GIMPLE_CALL
+      && is_elem_fn (gimple_call_fndecl (stmt)))
+    {
+      parm_type = find_elem_fn_parm_type (stmt, op, &step_size);
+      if (parm_type == TYPE_UNIFORM || parm_type == TYPE_LINEAR)
+	dt = vect_external_def;
+    }
+  else
+    parm_type = TYPE_NONE;
+      
+  switch (dt)
+    {
+    /* Case 1: operand is a constant.  */
+    case vect_constant_def:
+      {
+	vector_type = get_vectype_for_scalar_type (TREE_TYPE (op));
+	gcc_assert (vector_type);  
+	nunits = TYPE_VECTOR_SUBPARTS (vector_type);
+
+	if (scalar_def)
+	  *scalar_def = op;
+
+        /* Create 'vect_cst_ = {cst,cst,...,cst}'  */
+        if (vect_print_dump_info (REPORT_DETAILS))
+          fprintf (vect_dump, "Create vector_cst. nunits = %d", nunits);
+	if (!flag_enable_cilkplus)
+	  return vect_init_vector (stmt, op, vector_type, NULL);
+	else if (flag_enable_cilkplus
+		 && (parm_type == TYPE_UNIFORM || parm_type == TYPE_NONE))
+	  return vect_init_vector (stmt, op, vector_type, gsi);
+	else if (flag_enable_cilkplus && parm_type == TYPE_LINEAR)
+	  return elem_fn_linear_init_vector (stmt, op, vector_type,
+					     step_size, gsi);
+				     
+      }
+
+    /* Case 2: operand is defined outside the loop - loop invariant.  */
+    case vect_external_def:
+      {
+	vector_type = get_vectype_for_scalar_type (TREE_TYPE (def));
+	gcc_assert (vector_type);
+
+	if (scalar_def)
+	  *scalar_def = def;
+
+        /* Create 'vec_inv = {inv,inv,..,inv}'  */
+        if (vect_print_dump_info (REPORT_DETAILS))
+          fprintf (vect_dump, "Create vector_inv.");
+	if (!flag_enable_cilkplus)
+	  return vect_init_vector (stmt, op, vector_type, NULL);
+	else if (flag_enable_cilkplus
+		 && (parm_type == TYPE_UNIFORM || parm_type == TYPE_NONE))
+	  return vect_init_vector (stmt, op, vector_type, gsi);
+	else if (flag_enable_cilkplus && parm_type == TYPE_LINEAR)
+	  return elem_fn_linear_init_vector (stmt, op, vector_type, step_size,
+					     gsi);
       }
 
     /* Case 3: operand is defined inside the loop.  */
@@ -1620,9 +1819,17 @@  vect_finish_stmt_generation (gimple stmt, gimple vec_stmt,
 		      && !(gimple_call_flags (vec_stmt)
 			   & (ECF_CONST|ECF_PURE|ECF_NOVOPS)))))
 	    {
-	      tree new_vdef = copy_ssa_name (vuse, vec_stmt);
-	      gimple_set_vdef (vec_stmt, new_vdef);
-	      SET_USE (gimple_vuse_op (at_stmt), new_vdef);
+	      /* Elemental functions are removed and inserted seperately.  So,
+		 we do not add any use information for them at this point.  */
+	      if (flag_enable_cilkplus && is_gimple_call (stmt)
+		  && is_elem_fn (gimple_call_fndecl (stmt)))
+		;
+	      else
+		{
+		  tree new_vdef = copy_ssa_name (vuse, vec_stmt);
+		  gimple_set_vdef (vec_stmt, new_vdef);
+		  SET_USE (gimple_vuse_op (at_stmt), new_vdef);
+		}
 	    }
 	}
     }
@@ -1649,6 +1856,19 @@  vectorizable_function (gimple call, tree vectype_out, tree vectype_in)
 {
   tree fndecl = gimple_call_fndecl (call);
 
+  if (flag_enable_cilkplus && is_elem_fn (fndecl))
+    {
+      if (DECL_ELEM_FN_ALREADY_CLONED (fndecl))
+	return fndecl;
+      else
+	{
+	  tree new_fndecl = find_elem_fn_name (copy_node (fndecl),
+					       vectype_out, vectype_in);
+	  if (new_fndecl)
+	    DECL_ELEM_FN_ALREADY_CLONED (new_fndecl) = 1;
+	  return new_fndecl;
+	}
+    }
   /* We only handle functions that do not read or clobber memory -- i.e.
      const or novops ones.  */
   if (!(gimple_call_flags (call) & (ECF_CONST | ECF_NOVOPS)))
@@ -1694,6 +1914,7 @@  vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
   enum { NARROW, NONE, WIDEN } modifier;
   size_t i, nargs;
   tree lhs;
+  tree step_size = NULL_TREE;
 
   if (!STMT_VINFO_RELEVANT_P (stmt_info) && !bb_vinfo)
     return false;
@@ -1721,7 +1942,7 @@  vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
   /* Bail out if the function has more than three arguments, we do not have
      interesting builtin functions to vectorize with more than two arguments
      except for fma.  No arguments is also not good.  */
-  if (nargs == 0 || nargs > 3)
+  if (!flag_enable_cilkplus && (nargs == 0 || nargs > 3))
     return false;
 
   for (i = 0; i < nargs; i++)
@@ -1801,8 +2022,6 @@  vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
       return false;
     }
 
-  gcc_assert (!gimple_vuse (stmt));
-
   if (slp_node || PURE_SLP_STMT (stmt_info))
     ncopies = 1;
   else if (modifier == NARROW)
@@ -1891,11 +2110,28 @@  vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	    {
 	      op = gimple_call_arg (stmt, i);
 	      if (j == 0)
-		vec_oprnd0
-		  = vect_get_vec_def_for_operand (op, stmt, NULL);
+		{
+		  if (flag_enable_cilkplus)
+		    vec_oprnd0
+		      = elem_fn_vect_get_vec_def_for_operand (op, stmt, NULL,
+							      gsi);
+		  else
+		    vec_oprnd0
+		      = vect_get_vec_def_for_operand (op, stmt, NULL);
+		}
 	      else
 		{
 		  vec_oprnd0 = gimple_call_arg (new_stmt, i);
+		  if (flag_enable_cilkplus
+		      && gimple_code (new_stmt) == GIMPLE_CALL
+		      && is_elem_fn (gimple_call_fndecl (new_stmt)))
+		    {
+		      enum elem_fn_parm_type parm_type =
+			find_elem_fn_parm_type (stmt, op, &step_size);
+		      if (parm_type == TYPE_UNIFORM
+			  || parm_type == TYPE_LINEAR)
+			dt[i] = vect_constant_def;
+		    }
 		  vec_oprnd0
                     = vect_get_vec_def_for_stmt_copy (dt[i], vec_oprnd0);
 		}
@@ -1906,6 +2142,8 @@  vectorizable_call (gimple stmt, gimple_stmt_iterator *gsi, gimple *vec_stmt,
 	  new_stmt = gimple_build_call_vec (fndecl, vargs);
 	  new_temp = make_ssa_name (vec_dest, new_stmt);
 	  gimple_call_set_lhs (new_stmt, new_temp);
+	  if (flag_enable_cilkplus && is_elem_fn (fndecl))
+	    gimple_call_set_fntype (new_stmt, TREE_TYPE (fndecl));
 	  vect_finish_stmt_generation (stmt, new_stmt, gsi);
 
 	  if (j == 0)
@@ -6592,3 +6830,45 @@  supportable_narrowing_operation (enum tree_code code,
   VEC_free (tree, heap, *interm_types);
   return false;
 }
+
+/* Elemental function's version of vect_init_vector.  */
+
+tree
+elem_fn_linear_init_vector (gimple stmt, tree val, tree type, tree step_size,
+			    gimple_stmt_iterator *gsi)
+{
+  tree new_var;
+  gimple init_stmt;
+  tree vec_oprnd;
+  tree new_temp;
+
+  if (TREE_CODE (type) == VECTOR_TYPE
+      && TREE_CODE (TREE_TYPE (val)) != VECTOR_TYPE)
+    {
+      if (!types_compatible_p (TREE_TYPE (type), TREE_TYPE (val)))
+	{
+	  if (CONSTANT_CLASS_P (val))
+	    val = fold_unary (VIEW_CONVERT_EXPR, TREE_TYPE (type), val);
+	  else
+	    {
+	      new_var = create_tmp_reg (TREE_TYPE (type), NULL);
+	      init_stmt = gimple_build_assign_with_ops (NOP_EXPR,
+							new_var, val,
+							NULL_TREE);
+	      new_temp = make_ssa_name (new_var, init_stmt);
+	      gimple_assign_set_lhs (init_stmt, new_temp);
+	      vect_init_vector_1 (stmt, init_stmt, gsi);
+	      val = new_temp;
+	    }
+	}
+      val = build_elem_fn_linear_vector_from_val (type, val, step_size);
+    }
+
+  new_var = vect_get_new_vect_var (type, vect_simple_var, "cst_");
+  init_stmt = gimple_build_assign  (new_var, val);
+  new_temp = make_ssa_name (new_var, init_stmt);
+  gimple_assign_set_lhs (init_stmt, new_temp);
+  vect_init_vector_1 (stmt, init_stmt, gsi);
+  vec_oprnd = gimple_assign_lhs (init_stmt);
+  return vec_oprnd;
+}
diff --git gcc/tree.c gcc/tree.c
index 68d5ad0..e63d629 100644
--- gcc/tree.c
+++ gcc/tree.c
@@ -1405,6 +1405,54 @@  build_vector_from_val (tree vectype, tree sc)
     }
 }
 
+/* Build a vector of type VECTYPE where all the elements are SCs.  This is
+   a specialized version for elemental functions.  */
+
+tree
+build_elem_fn_linear_vector_from_val (tree vectype, tree sc, tree step_size)
+{
+  int i, nunits = TYPE_VECTOR_SUBPARTS (vectype);
+
+  if (sc == error_mark_node)
+    return sc;
+
+  /* Verify that the vector type is suitable for SC.  Note that there
+     is some inconsistency in the type-system with respect to restrict
+     qualifications of pointers.  Vector types always have a main-variant
+     element type and the qualification is applied to the vector-type.
+     So TREE_TYPE (vector-type) does not return a properly qualified
+     vector element-type.  */
+  gcc_checking_assert (types_compatible_p (TYPE_MAIN_VARIANT (TREE_TYPE (sc)),
+					   TREE_TYPE (vectype)));
+
+  if (CONSTANT_CLASS_P (sc))
+    {
+      tree *v = XALLOCAVEC (tree, nunits);
+      for (i = 0; i < nunits; ++i)
+	// v[i] = sc;
+	v[i] = build2 (PLUS_EXPR, TREE_TYPE (sc), sc,
+		       fold_build2 (MULT_EXPR, TREE_TYPE (step_size), step_size,
+				    build_int_cst (integer_type_node, i)));
+      return build_vector (vectype, v);
+    }
+  else
+    {
+      VEC(constructor_elt, gc) *v = VEC_alloc (constructor_elt, gc, nunits);
+      for (i = 0; i < nunits; ++i)
+	{
+	  // CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, sc);
+	  tree tmp = NULL_TREE;
+	  tmp = build2 (PLUS_EXPR, TREE_TYPE (sc), sc,
+			fold_build2 (MULT_EXPR, TREE_TYPE (step_size),
+				     step_size,
+				     build_int_cst (integer_type_node, i)));
+	  CONSTRUCTOR_APPEND_ELT (v, NULL_TREE, tmp);
+	}
+      return build_constructor (vectype, v);
+    }
+}
+
+
 /* Return a new CONSTRUCTOR node whose type is TYPE and whose values
    are in the VEC pointed to by VALS.  */
 tree
diff --git gcc/tree.h gcc/tree.h
index bca0576..00bdaf7 100644
--- gcc/tree.h
+++ gcc/tree.h
@@ -3467,6 +3467,12 @@  extern VEC(tree, gc) **decl_debug_args_insert (tree);
 #define DECL_FUNCTION_SPECIFIC_OPTIMIZATION(NODE) \
    (FUNCTION_DECL_CHECK (NODE)->function_decl.function_specific_optimization)
 
+/* In FUNCTION_DECL, this bit is set to indicate that the function is already
+   cloned and there is no need to attempt to do it again.  This mainly stops
+   infinite cloning.  */
+#define DECL_ELEM_FN_ALREADY_CLONED(NODE) \
+    (FUNCTION_DECL_CHECK (NODE)->function_decl.elem_fn_already_cloned)
+
 /* FUNCTION_DECL inherits from DECL_NON_COMMON because of the use of the
    arguments/result/saved_tree fields by front ends.   It was either inherit
    FUNCTION_DECL from non_common, or inherit non_common from FUNCTION_DECL,
@@ -3511,8 +3517,8 @@  struct GTY(()) tree_function_decl {
   unsigned looping_const_or_pure_flag : 1;
   unsigned has_debug_args_flag : 1;
   unsigned tm_clone_flag : 1;
+  unsigned elem_fn_already_cloned : 1;
 
-  /* 1 bit left */
 };
 
 /* The source language of the translation-unit.  */
@@ -4737,6 +4743,7 @@  extern tree build_vector_stat (tree, tree * MEM_STAT_DECL);
 #define build_vector(t,v) build_vector_stat (t, v MEM_STAT_INFO)
 extern tree build_vector_from_ctor (tree, VEC(constructor_elt,gc) *);
 extern tree build_vector_from_val (tree, tree);
+extern tree build_elem_fn_linear_vector_from_val (tree, tree, tree);
 extern tree build_constructor (tree, VEC(constructor_elt,gc) *);
 extern tree build_constructor_single (tree, tree, tree);
 extern tree build_constructor_from_list (tree, tree);