diff mbox

[google] Patch to support calling multi-versioned functions via new GCC builtin. (issue4440078)

Message ID 20110429025248.90D61B21AB@azwildcat.mtv.corp.google.com
State New
Headers show

Commit Message

Sriraman Tallam April 29, 2011, 2:52 a.m. UTC
I want this patch to be considered for google/main now. This is intended to be submitted to trunk for review soon.
This patch has been tested with crosstool bootstrap using buildit and by running all tests.


Patch Description :

--
This patch is available for review at http://codereview.appspot.com/4440078

Comments

Richard Biener April 29, 2011, 8:56 a.m. UTC | #1
On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam <tmsriram@google.com> wrote:
> I want this patch to be considered for google/main now. This is intended to be submitted to trunk for review soon.
> This patch has been tested with crosstool bootstrap using buildit and by running all tests.
>
>
> Patch Description :
> ==================
>
> Frequently executed functions in programs are some times compiled
> into different versions to take advantage of some specific
> advanced features present in some of the types of platforms on
> which the program is executed. For instance, SSE4 versus no-SSE4.
> Existing techniques to do multi-versioning, for example using
> indirect calls, block compiler optimizations, mainly inlining.

The general idea of supporting versioning is good, thanks for working on it.
Comments below.

> This patch adds a new GCC pass to support multi-versioned calls
> through a new builtin, __builtin_dispatch.  The __builtin_dispatch
> call is converted into a simple if-then construct to call the
> appropriate version at run-time. There are no indirect calls so
> inlining is not affected.

I am not sure that combining the function choice and its invocation
to a single builtin is good.  First, you use variadic arguments for
the actual function arguments which will cause vararg promotions
to apply and thus it will be impossible to dispatch to functions
taking single precision float values (and dependent on ABI details
many more similar issues).  Second, you restrict yourself to
only two versions of a function - that looks quite limited and this
restriction is not easy to lift with your proposed interface.

Thus, I would have kept regular (but indirect) calls in the source
program and only exposed the dispatch via a builtin, like ...

> This patch also supports a function unswitching optimization via
> cloning of functions, only along hot paths, that call multi-versioned
> functions so that the hot multi-versioned paths can be independently
> optimized for performance. The cloning optimization is switched on
> via a flag, -fclone-hot-version-paths.
>
> Only two versions are supported in this patch. Example use :
>
>  int popcnt_sse4(unsigned int x) __attribute__((__target__("popcnt")));
>  int popcnt_sse4(unsigned int x)
>  {
>    int count = __builtin_popcount(x);
>    return count;
>  }
>
>  int popcnt(unsigned int x) __attribute__((__target__("no-popcnt")));
>  int popcnt(unsigned int x)
>  {
>    int count = __builtin_popcount(x);
>    return count;
>  }
>
>  int testsse() __attribute__((version_selector));
>  int main ()
>  {
>    ...
>    /* The multi-versioned call. */
>    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4, (void*)popcnt, 25);
>    ...
>  }

int main()
{
  int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse,
popcnt_sse4, popcnt);
  ret = (*ppcnt) (25);
}

which would allow the testsse function to return the argument number of
the function to select.

[snip]

>  When to use the "version_selector" attribute ?
>  -----------------------------------------------
>
>  Functions are marked with attribute "version_selector" only if
>  they are run-time constants.  Example of such functions would
>  be those that test if a specific feature is available on a
>  particular architecture.  Such functions must return a positive
>  integer. For two-way functions, those that test if a feature
>  is present or not must return 1 or 0 respectively.
>
>  This patch will make constructors that call these function once
>  and assign the outcome to a global variable. From then on, only
>  the value of the global will be used to decide which version to
>  execute.

That's a nice idea, but why not simply process all functions with
a const attribute and no arguments this way?  IMHO

int testsse (void) __attribute__((const));

is as good and avoids the new attribute completely.  The pass would
also benefit other uses of this idiom (giving a way to have global
dynamic initializers in C).  The functions may not be strictly 'const'
in a way the compiler autodetects this attribute but it presents
exactly the promises to the compiler that allow this optimization.

Thus, please split this piece out into a separate patch and use
the const attribute.

>  What happens with cloning, -fclone-hot-version-paths ?
>  -----------------------------------------------------

Now, here you lost me somewhat, because I didn't look into the
patch details and I am missing an example on how the lowered
IL looks before that cloning.  So for now I suppose this
-fclone-hot-version-paths
is to expose direct calls to the IL.  If you would lower __builtin_dispatch
to a style like

  int sel = selector ();
  switch (sel)
     {
     case 0:
       fn = popcnt;
      break;
     case 1:
       fn = popcnt_sse4;
       break;
     }
   res = (*fn) (25);

then rather than a new pass specialized for __builtin_dispatch existing
optimization passes that are able to do tracing like VRP and DOM
via jump-threading or the tracer pass should be able to do this
optimization for you.  If they do not use FDO in a good way it is better
to improve them for this case instead of writing a new pass.

Note that we already have special FDO support for indirect to direct
call promotion, so that might work already.

Thus, with all this, __builtin_dispatch would be more like syntactic
sugar to avoid writing above switch statement yourself.  If you restrict
that sugar to a constant number of candidates it can be replaced by
a macro (or several ones, specialized for the number of candidates).

Richard.

>  For the call graph that looks like this before cloning :
>
> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>                                                |
>                                                -----> no_popcnt
>
>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>
> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>                              |
>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>
>
>
>
>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>          --param=num-mversn-clones=<num> allows to specify the number of
>          functions that should be cloned.
>          --param=mversn-clone-depth=<num> allows to specify the length of
>          the call graph path that should be cloned.  num = 0 implies only
>          leaf node that contains the __builtin_dispatch statement must be
>          cloned.
> ============================================================================================
>
> Google ref 45947, 45986
>
> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>
>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>        (c_common_attribute_table): New attribute "version_selector".
>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>        (pass_ipa_multiversion_dispatch): New pass.
>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>        support multi-version calls.
>        * mversn-dispatch.c     (revision 0): New file.
>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>        multi-version and dispatch passes to the pass list.
>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>        (num-mversn-clones): Document.
>        (fclone-hot-version-paths): Document.
Xinliang David Li April 29, 2011, 4:23 p.m. UTC | #2
Here is the background for this feature:

1) People relies on function multi-version to explore hw features and
squeeze performance, but there is no standard ways of doing so, either
a) using indirect function calls with function pointers set at program
initialization; b) using manual dispatch at each callsite; b) using
features like IFUNC.  The dispatch mechanism needs to be promoted to
the language level and becomes the first class citizen;
2) Most importantly, the above non standard approaches block
interprocedural optimizations such as inlining. With the introduction
of buitlin_dispatch and the clear semantics, compiler can do more
aggressive optimization.

Multi-way dispatch will be good, but most of the use cases we have
seen is 2-way -- a fast path + a default path.

Who are the targeted consumer of the feature?

1) For developers who like to MV function manually;
2) For user directed function cloning

   e.g,
    int foo (...) __attribute__((clone ("sse")):

3) Automatic function cloning determined by compiler.


>
> I am not sure that combining the function choice and its invocation
> to a single builtin is good.  First, you use variadic arguments for
> the actual function arguments which will cause vararg promotions
> to apply and thus it will be impossible to dispatch to functions
> taking single precision float values (and dependent on ABI details
> many more similar issues).  Second, you restrict yourself to
> only two versions of a function - that looks quite limited and this
> restriction is not easy to lift with your proposed interface.

See above. Multi-way dispatch can be added similarly.


>
> That's a nice idea, but why not simply process all functions with
> a const attribute and no arguments this way?  IMHO
>
> int testsse (void) __attribute__((const));
>
> is as good and avoids the new attribute completely.  The pass would
> also benefit other uses of this idiom (giving a way to have global
> dynamic initializers in C).  The functions may not be strictly 'const'
> in a way the compiler autodetects this attribute but it presents
> exactly the promises to the compiler that allow this optimization.
>
> Thus, please split this piece out into a separate patch and use
> the const attribute.


Sounds good.

>
>>  What happens with cloning, -fclone-hot-version-paths ?
>>  -----------------------------------------------------
>
> Now, here you lost me somewhat, because I didn't look into the
> patch details and I am missing an example on how the lowered
> IL looks before that cloning.  So for now I suppose this
> -fclone-hot-version-paths
> is to expose direct calls to the IL.  If you would lower __builtin_dispatch
> to a style like
>
>  int sel = selector ();
>  switch (sel)
>     {
>     case 0:
>       fn = popcnt;
>      break;
>     case 1:
>       fn = popcnt_sse4;
>       break;
>     }
>   res = (*fn) (25);
>
> then rather than a new pass specialized for __builtin_dispatch existing
> optimization passes that are able to do tracing like VRP and DOM
> via jump-threading or the tracer pass should be able to do this
> optimization for you.  If they do not use FDO in a good way it is better
> to improve them for this case instead of writing a new pass.

What you describe may not work

1) the function selection may happen in a different function;
2) Compiler will need to hoist the selection into the program
initializer to avoid overhead

As an example of why dispatch hoisting and call chain cloning is needed:

void foo();
void bar();

void doit_v1();
void doit_v2();
bool check_v () __attribute__((const));

int test();


void bar()
{
    ....
    for (.....)
     {
         foo();
         ....
     }
}

void foo()
{
   ...
   for (...)
   {
      __builtin_dispatch(check_v, doit_v1, doit_v2,...);
     ...
   }
}


int test ()
{
    ..
   bar ();
}


The feature testing and dispatch is embedded in a 2-deep loop nest
crossing function boundaries. The call paths test --> bar --> foo
needs to be cloned. This is done by hoisting dispatch up the call
chain -- it ends up :


void bar_v1()
{
   ....
   for (..)
    {
      foo_v1 ();
    }
 ..
}

void bar_v2 ()
{
    ...
    for (..)
    {
      foo_v2();
    }
   ..
}

void foo_v1 ()
{
   ..
   for ()
    {
      doit_v1()
    }
}

void foo_v2 ()
{
    ..
    for ()
    {
       doit_v2()
    }
}

int test ()
{
     __builtin_dispatch(check_v, bar_v1, bar_v2);
    ..

}



Thanks,

David



>
> Note that we already have special FDO support for indirect to direct
> call promotion, so that might work already.
>
> Thus, with all this, __builtin_dispatch would be more like syntactic
> sugar to avoid writing above switch statement yourself.  If you restrict
> that sugar to a constant number of candidates it can be replaced by
> a macro (or several ones, specialized for the number of candidates).
>
> Richard.
>
>>  For the call graph that looks like this before cloning :
>>
>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>>                                                |
>>                                                -----> no_popcnt
>>
>>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>>
>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>>                              |
>>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>>
>>
>>
>>
>>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>>          --param=num-mversn-clones=<num> allows to specify the number of
>>          functions that should be cloned.
>>          --param=mversn-clone-depth=<num> allows to specify the length of
>>          the call graph path that should be cloned.  num = 0 implies only
>>          leaf node that contains the __builtin_dispatch statement must be
>>          cloned.
>> ============================================================================================
>>
>> Google ref 45947, 45986
>>
>> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>>
>>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>>        (c_common_attribute_table): New attribute "version_selector".
>>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>>        (pass_ipa_multiversion_dispatch): New pass.
>>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>>        support multi-version calls.
>>        * mversn-dispatch.c     (revision 0): New file.
>>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>>        multi-version and dispatch passes to the pass list.
>>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>>        (num-mversn-clones): Document.
>>        (fclone-hot-version-paths): Document.
>
Sriraman Tallam April 29, 2011, 5:53 p.m. UTC | #3
Hi Richard,

  Thanks for the comments. Please find inline responses.

On Fri, Apr 29, 2011 at 1:56 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
>
> On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam <tmsriram@google.com> wrote:
> > I want this patch to be considered for google/main now. This is intended to be submitted to trunk for review soon.
> > This patch has been tested with crosstool bootstrap using buildit and by running all tests.
> >
> >
> > Patch Description :
> > ==================
> >
> > Frequently executed functions in programs are some times compiled
> > into different versions to take advantage of some specific
> > advanced features present in some of the types of platforms on
> > which the program is executed. For instance, SSE4 versus no-SSE4.
> > Existing techniques to do multi-versioning, for example using
> > indirect calls, block compiler optimizations, mainly inlining.
>
> The general idea of supporting versioning is good, thanks for working on it.
> Comments below.
>
> > This patch adds a new GCC pass to support multi-versioned calls
> > through a new builtin, __builtin_dispatch.  The __builtin_dispatch
> > call is converted into a simple if-then construct to call the
> > appropriate version at run-time. There are no indirect calls so
> > inlining is not affected.
>
> I am not sure that combining the function choice and its invocation
> to a single builtin is good.  First, you use variadic arguments for
> the actual function arguments which will cause vararg promotions
> to apply and thus it will be impossible to dispatch to functions
> taking single precision float values (and dependent on ABI details
> many more similar issues).  Second, you restrict yourself to
> only two versions of a function - that looks quite limited and this
> restriction is not easy to lift with your proposed interface.
>
> Thus, I would have kept regular (but indirect) calls in the source
> program and only exposed the dispatch via a builtin, like ...
>
> > This patch also supports a function unswitching optimization via
> > cloning of functions, only along hot paths, that call multi-versioned
> > functions so that the hot multi-versioned paths can be independently
> > optimized for performance. The cloning optimization is switched on
> > via a flag, -fclone-hot-version-paths.
> >
> > Only two versions are supported in this patch. Example use :
> >
> >  int popcnt_sse4(unsigned int x) __attribute__((__target__("popcnt")));
> >  int popcnt_sse4(unsigned int x)
> >  {
> >    int count = __builtin_popcount(x);
> >    return count;
> >  }
> >
> >  int popcnt(unsigned int x) __attribute__((__target__("no-popcnt")));
> >  int popcnt(unsigned int x)
> >  {
> >    int count = __builtin_popcount(x);
> >    return count;
> >  }
> >
> >  int testsse() __attribute__((version_selector));
> >  int main ()
> >  {
> >    ...
> >    /* The multi-versioned call. */
> >    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4, (void*)popcnt, 25);
> >    ...
> >  }
>
> int main()
> {
>  int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse,
> popcnt_sse4, popcnt);
>  ret = (*ppcnt) (25);
> }
>
> which would allow the testsse function to return the argument number of
> the function to select.
>
> [snip]
>
> >  When to use the "version_selector" attribute ?
> >  -----------------------------------------------
> >
> >  Functions are marked with attribute "version_selector" only if
> >  they are run-time constants.  Example of such functions would
> >  be those that test if a specific feature is available on a
> >  particular architecture.  Such functions must return a positive
> >  integer. For two-way functions, those that test if a feature
> >  is present or not must return 1 or 0 respectively.
> >
> >  This patch will make constructors that call these function once
> >  and assign the outcome to a global variable. From then on, only
> >  the value of the global will be used to decide which version to
> >  execute.
>
> That's a nice idea, but why not simply process all functions with
> a const attribute and no arguments this way?  IMHO
>
> int testsse (void) __attribute__((const));
>
> is as good and avoids the new attribute completely.  The pass would
> also benefit other uses of this idiom (giving a way to have global
> dynamic initializers in C).  The functions may not be strictly 'const'
> in a way the compiler autodetects this attribute but it presents
> exactly the promises to the compiler that allow this optimization.

Since, const functions cannot, not according to the definition atleast
,  read files or memory, I was thinking along the lines of making it a
 "pure" function (For example, The version_selector function could be
reading "/proc/cpuinfo" and looking for certain features). The reason
I invented a new attribute was because I assumed it is not legal to
hoist the call site of a pure function into a constructor and
substitute the value computed simply because it may not be valid to
call the pure function ahead of the intended time in the execution.

Basically, the semantics of the "version_selector" is that the value
returned is a run-time constant and that it can be called at any point
of time during the execution. However, it could be reading files and
memory. Chatting with Ian, it looks like I can just tag this function
as const and get away with it.

Also, I do not need to do this for every const function in general
because *inlining* will take care of it. Also, evaluating it
ahead-of-time may be unnecessary and could hurt in general if the
function will not be likely executed. So, I could only mark feature
test functions as const and have the multi-versioning patch hoist
these functions alone because they could potentially be reading memory
and even with inlining there could be  a overhead.

>
> Thus, please split this piece out into a separate patch and use
> the const attribute.
>
> >  What happens with cloning, -fclone-hot-version-paths ?
> >  -----------------------------------------------------
>
> Now, here you lost me somewhat, because I didn't look into the
> patch details and I am missing an example on how the lowered
> IL looks before that cloning.  So for now I suppose this
> -fclone-hot-version-paths
> is to expose direct calls to the IL.  If you would lower __builtin_dispatch
> to a style like
>
>  int sel = selector ();
>  switch (sel)
>     {
>     case 0:
>       fn = popcnt;
>      break;
>     case 1:
>       fn = popcnt_sse4;
>       break;
>     }
>   res = (*fn) (25);
>
> then rather than a new pass specialized for __builtin_dispatch existing
> optimization passes that are able to do tracing like VRP and DOM
> via jump-threading or the tracer pass should be able to do this
> optimization for you.  If they do not use FDO in a good way it is better
> to improve them for this case instead of writing a new pass.
>
> Note that we already have special FDO support for indirect to direct
> call promotion, so that might work already.
>
> Thus, with all this, __builtin_dispatch would be more like syntactic
> sugar to avoid writing above switch statement yourself.  If you restrict
> that sugar to a constant number of candidates it can be replaced by
> a macro (or several ones, specialized for the number of candidates).


I have shown dumps  for a sample program to better illustrate what
happens with multi-versioning. It is a toy program used for
illustration.

foo.cc
=====

extern int __attribute__ ((version_selector))
featureTest ();

int foo (int i)
{
  return i;
}

int bar (int i)
{
  return (11 - i);
}

/* This calculates the sum of numbers from 1 to 10 in 2 different ways. */
int __attribute__ ((hot))
problem ()
{
  int ret = 0;
  int j = 1;
  for (j = 1; j<=10; j++)
    ret += __builtin_dispatch (featureTest, (void *)foo, (void *)bar, j);
  return ret;
}

int __attribute__ ((hot))
main ()
{
  return problem() - 55;
}

Here, irrespective of the outcome of featureTest, problem computes the
sum of the 1st 10 numbers, and hence the value returned is 55.

Now, without cloning, the IR for  function "problem" after converting
the __builtin_dispatch is :

===================================================================
 pretmp.8_2 = _Z11featureTestv_version_selector_global;
<bb 3>:
  if (pretmp.8_2 > 0)
    goto <bb 5>;
  else
    goto <bb 4>;

<bb 4>:
   D.2144_1 = (unsigned int) j_24;
  D.2145_21 = 11 - D.2144_1;
  j_22 = (int) D.2145_21;

<bb 5>:
  ret_9 = ret_23 + j_19;

  j_10 = j_24 + 1;
  if (j_10 != 11)
    goto <bb 3>;
  else
    goto <bb 6>;

<bb 6>:
   return ret_25;
=================================================

The loop stays and will be executed.

whereas with cloning, problem is made into 2 clones for each version :

int _Z7problemv_clone_0() ()
{
  bool D.2120;
  int D.2119;
  int ret;
  int j;

<bb 6>:

<bb 2>:
    goto <bb 4>;

<bb 3>:
  D.2119_2 = foo (j_1);   /* Only calls foo */
  ret_4 = D.2119_2 + ret_3;
   j_5 = j_1 + 1;
<bb 4>:
  if (j_1 <= 10)
    goto <bb 3>;
  else
    goto <bb 5>;

<bb 5>:
  return ret_3;
}

and

int _Z7problemv_clone_1() ()
{
  bool D.2127;
  int D.2126;
  int ret;
  int j;

<bb 6>:

<bb 2>:
   goto <bb 4>;

<bb 3>:
  D.2126_2 = bar (j_1); /* Only calls bar */
  ret_4 = D.2126_2 + ret_3;
   j_5 = j_1 + 1;
<bb 4>:
  if (j_1 <= 10)
    goto <bb 3>;
  else
    goto <bb 5>;

<bb 5>:
  return ret_3;
}

 and this gets optimized into :

int _Z7problemv_clone_0() ()
{
  return 55;
}

int _Z7problemv_clone_1() ()
{
  return 55;
}

Now, function problem is converted into :

int problem ()
{
  D.2129_10 = __builtin_dispatch (featureTest, _Z7problemv_clone_0,
_Z7problemv_clone_1);
}

that is, call the right clone based on the outcome. Please note the
re-use of the __builtin_dispatch here to do this.

and after conversion and optimization becomes :

int problem ()
{
  return 55;
}

This gets inlined to main, and main becomes :

int main ()
{
  return 0;
}

This has been added as a test case.

Thanks,
-Sri.



>
> Richard.
>
> >  For the call graph that looks like this before cloning :
> >
> > func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
> >                                                |
> >                                                -----> no_popcnt
> >
> >  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
> >
> > func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
> >                              |
> >                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
> >
> >
> >
> >
> >  Flags : -fclone-hot-version-paths does function unswitching via cloning.
> >          --param=num-mversn-clones=<num> allows to specify the number of
> >          functions that should be cloned.
> >          --param=mversn-clone-depth=<num> allows to specify the length of
> >          the call graph path that should be cloned.  num = 0 implies only
> >          leaf node that contains the __builtin_dispatch statement must be
> >          cloned.
> > ============================================================================================
> >
> > Google ref 45947, 45986
> >
> > 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
> >
> >        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
> >        (c_common_attribute_table): New attribute "version_selector".
> >        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
> >        (pass_ipa_multiversion_dispatch): New pass.
> >        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
> >        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
> >        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
> >        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
> >        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
> >        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
> >        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
> >        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
> >        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
> >        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
> >        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
> >        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
> >        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
> >        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
> >        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
> >        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
> >        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
> >        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
> >        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
> >        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
> >        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
> >        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
> >        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
> >        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
> >        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
> >        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
> >        support multi-version calls.
> >        * mversn-dispatch.c     (revision 0): New file.
> >        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
> >        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
> >        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
> >        * passes.c      (revision 173122) (init_optimization_passes): Add the new
> >        multi-version and dispatch passes to the pass list.
> >        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
> >        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
> >        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
> >        (num-mversn-clones): Document.
> >        (fclone-hot-version-paths): Document.
Richard Biener May 2, 2011, 9:11 a.m. UTC | #4
On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
> Here is the background for this feature:
>
> 1) People relies on function multi-version to explore hw features and
> squeeze performance, but there is no standard ways of doing so, either
> a) using indirect function calls with function pointers set at program
> initialization; b) using manual dispatch at each callsite; b) using
> features like IFUNC.  The dispatch mechanism needs to be promoted to
> the language level and becomes the first class citizen;

You are not doing that, you are inventing a new (crude) GCC extension.

> 2) Most importantly, the above non standard approaches block
> interprocedural optimizations such as inlining. With the introduction
> of buitlin_dispatch and the clear semantics, compiler can do more
> aggressive optimization.

I don't think so, but the previous mail lacked detail on why the new
scheme would be better.

> Multi-way dispatch will be good, but most of the use cases we have
> seen is 2-way -- a fast path + a default path.
>
> Who are the targeted consumer of the feature?
>
> 1) For developers who like to MV function manually;
> 2) For user directed function cloning
>
>   e.g,
>    int foo (...) __attribute__((clone ("sse")):
>
> 3) Automatic function cloning determined by compiler.
>
>
>>
>> I am not sure that combining the function choice and its invocation
>> to a single builtin is good.  First, you use variadic arguments for
>> the actual function arguments which will cause vararg promotions
>> to apply and thus it will be impossible to dispatch to functions
>> taking single precision float values (and dependent on ABI details
>> many more similar issues).  Second, you restrict yourself to
>> only two versions of a function - that looks quite limited and this
>> restriction is not easy to lift with your proposed interface.
>
> See above. Multi-way dispatch can be added similarly.

Not with the specified syntax.  So you'd end up with _two_
language extensions.  That's bad (and unacceptable, IMHO).

>
>>
>> That's a nice idea, but why not simply process all functions with
>> a const attribute and no arguments this way?  IMHO
>>
>> int testsse (void) __attribute__((const));
>>
>> is as good and avoids the new attribute completely.  The pass would
>> also benefit other uses of this idiom (giving a way to have global
>> dynamic initializers in C).  The functions may not be strictly 'const'
>> in a way the compiler autodetects this attribute but it presents
>> exactly the promises to the compiler that allow this optimization.
>>
>> Thus, please split this piece out into a separate patch and use
>> the const attribute.
>
>
> Sounds good.
>
>>
>>>  What happens with cloning, -fclone-hot-version-paths ?
>>>  -----------------------------------------------------
>>
>> Now, here you lost me somewhat, because I didn't look into the
>> patch details and I am missing an example on how the lowered
>> IL looks before that cloning.  So for now I suppose this
>> -fclone-hot-version-paths
>> is to expose direct calls to the IL.  If you would lower __builtin_dispatch
>> to a style like
>>
>>  int sel = selector ();
>>  switch (sel)
>>     {
>>     case 0:
>>       fn = popcnt;
>>      break;
>>     case 1:
>>       fn = popcnt_sse4;
>>       break;
>>     }
>>   res = (*fn) (25);
>>
>> then rather than a new pass specialized for __builtin_dispatch existing
>> optimization passes that are able to do tracing like VRP and DOM
>> via jump-threading or the tracer pass should be able to do this
>> optimization for you.  If they do not use FDO in a good way it is better
>> to improve them for this case instead of writing a new pass.
>
> What you describe may not work
>
> 1) the function selection may happen in a different function;

Well, sure.  I propose to have the above as lowered form.  If people
deliberately obfuscate code it's their fault.  Your scheme simply
makes that obfuscation impossible (or less likely possible).  So
1) is not a valid argument.

> 2) Compiler will need to hoist the selection into the program
> initializer to avoid overhead

?  Isn't that the point of the "const" function call optimization which
I said was a good thing anyway?  So, after that it would be

 int sel = some_global_static;
 ...

> As an example of why dispatch hoisting and call chain cloning is needed:
>
> void foo();
> void bar();
>
> void doit_v1();
> void doit_v2();
> bool check_v () __attribute__((const));
>
> int test();
>
>
> void bar()
> {
>    ....
>    for (.....)
>     {
>         foo();
>         ....
>     }
> }
>
> void foo()
> {
>   ...
>   for (...)
>   {
>      __builtin_dispatch(check_v, doit_v1, doit_v2,...);
>     ...
>   }
> }
>
>
> int test ()
> {
>    ..
>   bar ();
> }
>
>
> The feature testing and dispatch is embedded in a 2-deep loop nest
> crossing function boundaries. The call paths test --> bar --> foo
> needs to be cloned. This is done by hoisting dispatch up the call
> chain -- it ends up :
>
>
> void bar_v1()
> {
>   ....
>   for (..)
>    {
>      foo_v1 ();
>    }
>  ..
> }
>
> void bar_v2 ()
> {
>    ...
>    for (..)
>    {
>      foo_v2();
>    }
>   ..
> }
>
> void foo_v1 ()
> {
>   ..
>   for ()
>    {
>      doit_v1()
>    }
> }
>
> void foo_v2 ()
> {
>    ..
>    for ()
>    {
>       doit_v2()
>    }
> }
>
> int test ()
> {
>     __builtin_dispatch(check_v, bar_v1, bar_v2);
>    ..
>
> }

So - and why is this not a good optimization when instead of
__builtin_dispatch there is

   int sel = some_global_const_var;
   switch (sel)
      {
       case 1: fn = bar_v1; break;
       case 2: fn = bar_v2; break;
      }
    (*fn)();

?

My point is that such optimization is completely independent of
that dispatch thing.  The above could as well be a selection to
use different input arrays, one causing alias analysis issues
and one not.

Thus, a __builtin_dispatch centric optimization pass is the wrong
way to go.

Now, with FDO I'd expect the foo is inlined into bar (because foo
is deemed hot), then you only need to deal with loop unswitching,
which should be easy to drive from FDO.

Richard.

>
>
> Thanks,
>
> David
>
>
>
>>
>> Note that we already have special FDO support for indirect to direct
>> call promotion, so that might work already.
>>
>> Thus, with all this, __builtin_dispatch would be more like syntactic
>> sugar to avoid writing above switch statement yourself.  If you restrict
>> that sugar to a constant number of candidates it can be replaced by
>> a macro (or several ones, specialized for the number of candidates).
>>
>> Richard.
>>
>>>  For the call graph that looks like this before cloning :
>>>
>>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>>>                                                |
>>>                                                -----> no_popcnt
>>>
>>>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>>>
>>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>>>                              |
>>>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>>>
>>>
>>>
>>>
>>>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>>>          --param=num-mversn-clones=<num> allows to specify the number of
>>>          functions that should be cloned.
>>>          --param=mversn-clone-depth=<num> allows to specify the length of
>>>          the call graph path that should be cloned.  num = 0 implies only
>>>          leaf node that contains the __builtin_dispatch statement must be
>>>          cloned.
>>> ============================================================================================
>>>
>>> Google ref 45947, 45986
>>>
>>> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>>>
>>>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>>>        (c_common_attribute_table): New attribute "version_selector".
>>>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>>>        (pass_ipa_multiversion_dispatch): New pass.
>>>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>>>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>>>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>>>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>>>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>>>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>>>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>>>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>>>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>>>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>>>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>>>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>>>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>>>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>>>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>>>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>>>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>>>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>>>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>>>        support multi-version calls.
>>>        * mversn-dispatch.c     (revision 0): New file.
>>>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>>>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>>>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>>>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>>>        multi-version and dispatch passes to the pass list.
>>>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>>>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>>>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>>>        (num-mversn-clones): Document.
>>>        (fclone-hot-version-paths): Document.
>>
>
Richard Biener May 2, 2011, 9:24 a.m. UTC | #5
On Fri, Apr 29, 2011 at 7:50 PM, Sriraman Tallam <tmsriram@google.com> wrote:
>
> On Fri, Apr 29, 2011 at 1:56 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>>
>> On Fri, Apr 29, 2011 at 4:52 AM, Sriraman Tallam <tmsriram@google.com>
>> wrote:
>> > I want this patch to be considered for google/main now. This is intended
>> > to be submitted to trunk for review soon.
>> > This patch has been tested with crosstool bootstrap using buildit and by
>> > running all tests.
>> >
>> >
>> > Patch Description :
>> > ==================
>> >
>> > Frequently executed functions in programs are some times compiled
>> > into different versions to take advantage of some specific
>> > advanced features present in some of the types of platforms on
>> > which the program is executed. For instance, SSE4 versus no-SSE4.
>> > Existing techniques to do multi-versioning, for example using
>> > indirect calls, block compiler optimizations, mainly inlining.
>>
>> The general idea of supporting versioning is good, thanks for working on
>> it.
>> Comments below.
>>
>> > This patch adds a new GCC pass to support multi-versioned calls
>> > through a new builtin, __builtin_dispatch.  The __builtin_dispatch
>> > call is converted into a simple if-then construct to call the
>> > appropriate version at run-time. There are no indirect calls so
>> > inlining is not affected.
>>
>> I am not sure that combining the function choice and its invocation
>> to a single builtin is good.  First, you use variadic arguments for
>> the actual function arguments which will cause vararg promotions
>> to apply and thus it will be impossible to dispatch to functions
>> taking single precision float values (and dependent on ABI details
>> many more similar issues).  Second, you restrict yourself to
>> only two versions of a function - that looks quite limited and this
>> restriction is not easy to lift with your proposed interface.
>>
>> Thus, I would have kept regular (but indirect) calls in the source
>> program and only exposed the dispatch via a builtin, like ...
>>
>> > This patch also supports a function unswitching optimization via
>> > cloning of functions, only along hot paths, that call multi-versioned
>> > functions so that the hot multi-versioned paths can be independently
>> > optimized for performance. The cloning optimization is switched on
>> > via a flag, -fclone-hot-version-paths.
>> >
>> > Only two versions are supported in this patch. Example use :
>> >
>> >  int popcnt_sse4(unsigned int x) __attribute__((__target__("popcnt")));
>> >  int popcnt_sse4(unsigned int x)
>> >  {
>> >    int count = __builtin_popcount(x);
>> >    return count;
>> >  }
>> >
>> >  int popcnt(unsigned int x) __attribute__((__target__("no-popcnt")));
>> >  int popcnt(unsigned int x)
>> >  {
>> >    int count = __builtin_popcount(x);
>> >    return count;
>> >  }
>> >
>> >  int testsse() __attribute__((version_selector));
>> >  int main ()
>> >  {
>> >    ...
>> >    /* The multi-versioned call. */
>> >    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4,
>> > (void*)popcnt, 25);
>> >    ...
>> >  }
>>
>> int main()
>> {
>>  int (*ppcnt)(unsigned int) = __builtin_dispatch (testsse,
>> popcnt_sse4, popcnt);
>>  ret = (*ppcnt) (25);
>> }
>>
>> which would allow the testsse function to return the argument number of
>> the function to select.
>>
>> [snip]
>>
>> >  When to use the "version_selector" attribute ?
>> >  -----------------------------------------------
>> >
>> >  Functions are marked with attribute "version_selector" only if
>> >  they are run-time constants.  Example of such functions would
>> >  be those that test if a specific feature is available on a
>> >  particular architecture.  Such functions must return a positive
>> >  integer. For two-way functions, those that test if a feature
>> >  is present or not must return 1 or 0 respectively.
>> >
>> >  This patch will make constructors that call these function once
>> >  and assign the outcome to a global variable. From then on, only
>> >  the value of the global will be used to decide which version to
>> >  execute.
>>
>> That's a nice idea, but why not simply process all functions with
>> a const attribute and no arguments this way?  IMHO
>>
>> int testsse (void) __attribute__((const));
>>
>> is as good and avoids the new attribute completely.  The pass would
>> also benefit other uses of this idiom (giving a way to have global
>> dynamic initializers in C).  The functions may not be strictly 'const'
>> in a way the compiler autodetects this attribute but it presents
>> exactly the promises to the compiler that allow this optimization.
>
>
> Since, const functions cannot, not according to the definition atleast ,
>  read files or memory, I was thinking along the lines of making it a  "pure"
> function (For example, The version_selector function could be reading
> "/proc/cpuinfo" and looking for certain features).

Well, const functions "can" read files or memory, just its result may not
vary when that memory or files change (or rather the compiler assumes
that if you give the "const" premise).  With just a "pure" function the
compiler cannot possibly elide all calls to a read from a program-start
initialized return cache.

So you really need "const".

> The reason I invented a
> new attribute was because I assumed it is not legal to hoist the call site
> of a pure function into a constructor and substitute the value computed
> simply because it may not be valid to call the pure function ahead of the
> intended time in the execution.

That's true, which is why you need to use "const" which gives the
compiler that freedom.

> Basically, the semantics of the "version_selector" is that the value
> returned is a run-time constant and that it can be called at any point of
> time during the execution. However, it could be reading files and memory.
> Chatting with Ian, it looks like I can just tag this function as const and
> get away with it.

Yes.

> Also, I do not need to do this for every const function in general because
> *inlining* will take care of it. Also, evaluating it ahead-of-time may be
> unnecessary and could hurt in general if the function will not be likely
> executed. So, I could only mark feature test functions as const and have the
> multi-versioning patch hoist these functions alone because they could
> potentially be reading memory and even with inlining there could be  a
> overhead.

Well, the only use of const functions without arguments is for such
functions you want to promote to a static initializer and for maybe
wrapping fancy constants (I didn't see use of the latter).  Because
a const function without arguments is just a "name" for a constant
(in your case a runtime constant).  If inlining can take care of it
it would be a compile-time constant, which you can easily check
before doing the transformation (just check if the function body
is simply returning a constant).

>
>>
>> Thus, please split this piece out into a separate patch and use
>> the const attribute.
>>
>> >  What happens with cloning, -fclone-hot-version-paths ?
>> >  -----------------------------------------------------
>>
>> Now, here you lost me somewhat, because I didn't look into the
>> patch details and I am missing an example on how the lowered
>> IL looks before that cloning.  So for now I suppose this
>> -fclone-hot-version-paths
>> is to expose direct calls to the IL.  If you would lower
>> __builtin_dispatch
>> to a style like
>>
>>  int sel = selector ();
>>  switch (sel)
>>     {
>>     case 0:
>>       fn = popcnt;
>>      break;
>>     case 1:
>>       fn = popcnt_sse4;
>>       break;
>>     }
>>   res = (*fn) (25);
>>
>> then rather than a new pass specialized for __builtin_dispatch existing
>> optimization passes that are able to do tracing like VRP and DOM
>> via jump-threading or the tracer pass should be able to do this
>> optimization for you.  If they do not use FDO in a good way it is better
>> to improve them for this case instead of writing a new pass.
>
> I have shown dumps  for a sample program to better illustrate what happens
> with multi-versioning. It is a toy program used for illustration.
> foo.cc
> =====
> extern int __attribute__ ((version_selector))
> featureTest ();
> int foo (int i)
> {
>   return i;
> }
> int bar (int i)
> {
>   return (11 - i);
> }
> /* This calculates the sum of numbers from 1 to 10 in 2 different ways. */
> int __attribute__ ((hot))
> problem ()
> {
>   int ret = 0;
>   int j = 1;
>   for (j = 1; j<=10; j++)
>     ret += __builtin_dispatch (featureTest, (void *)foo, (void *)bar, j);
>   return ret;
> }
> int __attribute__ ((hot))
> main ()
> {
>   return problem() - 55;
> }
> Here, irrespective of the outcome of featureTest, problem computes the sum
> of the 1st 10 numbers, and hence the value returned is 55.
> Now, without cloning, the IR for  function "problem" after converting the
> __builtin_dispatch is :
> ===================================================================
>  pretmp.8_2 = _Z11featureTestv_version_selector_global;
> <bb 3>:
>   if (pretmp.8_2 > 0)
>     goto <bb 5>;
>   else
>     goto <bb 4>;
> <bb 4>:
>    D.2144_1 = (unsigned int) j_24;
>   D.2145_21 = 11 - D.2144_1;
>   j_22 = (int) D.2145_21;
> <bb 5>:
>   ret_9 = ret_23 + j_19;
>
>   j_10 = j_24 + 1;
>   if (j_10 != 11)
>     goto <bb 3>;
>   else
>     goto <bb 6>;
> <bb 6>:
>    return ret_25;

Ok, so you lower not to a single indirect call but to several direct calls.
We have no pass that currently transforms the former into the latter
but tree-ssa-phiprop.c could be trivially extended to do that (it
hoists indirect loads of a PHI result up to the defs of the PHI args
if that results in direct loads).

> =================================================
> The loop stays and will be executed.
> whereas with cloning, problem is made into 2 clones for each version :
> int _Z7problemv_clone_0() ()
> {
>   bool D.2120;
>   int D.2119;
>   int ret;
>   int j;
> <bb 6>:
> <bb 2>:
>     goto <bb 4>;
> <bb 3>:
>   D.2119_2 = foo (j_1);   /* Only calls foo */
>   ret_4 = D.2119_2 + ret_3;
>    j_5 = j_1 + 1;
> <bb 4>:
>   if (j_1 <= 10)
>     goto <bb 3>;
>   else
>     goto <bb 5>;
> <bb 5>:
>   return ret_3;
> }
> and
> int _Z7problemv_clone_1() ()
> {
>   bool D.2127;
>   int D.2126;
>   int ret;
>   int j;
> <bb 6>:
> <bb 2>:
>    goto <bb 4>;
> <bb 3>:
>   D.2126_2 = bar (j_1); /* Only calls bar */
>   ret_4 = D.2126_2 + ret_3;
>    j_5 = j_1 + 1;
> <bb 4>:
>   if (j_1 <= 10)
>     goto <bb 3>;
>   else
>     goto <bb 5>;
> <bb 5>:
>   return ret_3;
> }

When do you apply this cloning?  It must be before inlining, right?

>  and this gets optimized into :
> int _Z7problemv_clone_0() ()
> {
>   return 55;
> }
> int _Z7problemv_clone_1() ()
> {
>   return 55;
> }

I would have expected that loop unswitching is able to do the same
optimization.

> Now, function problem is converted into :
> int problem ()
> {
>   D.2129_10 = __builtin_dispatch (featureTest, _Z7problemv_clone_0,
> _Z7problemv_clone_1);
> }
> that is, call the right clone based on the outcome. Please note the re-use
> of the __builtin_dispatch here to do this.
> and after conversion and optimization becomes :
> int problem ()
> {
>   return 55;
> }

Where did you detect that the two clones are equal?  Or is the above
__builtin_dispatch never in the IL and lowered to two direct fncalls
directly?

> This gets inlined to main, and main becomes :
> int main ()
> {
>   return 0;
> }
> This has been added as a test case.

I think you are wrong in re-inventing what existing passes should be
able to do just fine (inlining and unswitching).  I realize that doing
things earlier is always appealing, but it feels wrong to add a lot of
code for a new GCC extension that existing code doesn't use instead
of trying to handle the patterns that exist in the wild in a better way.

Richard.

> Thanks,
> -Sri.
>>
>> Note that we already have special FDO support for indirect to direct
>> call promotion, so that might work already.
>>
>> Thus, with all this, __builtin_dispatch would be more like syntactic
>> sugar to avoid writing above switch statement yourself.  If you restrict
>> that sugar to a constant number of candidates it can be replaced by
>> a macro (or several ones, specialized for the number of candidates).
>>
>> Richard.
>>
>> >  For the call graph that looks like this before cloning :
>> >
>> > func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ---->
>> > popcnt
>> >                                                |
>> >                                                -----> no_popcnt
>> >
>> >  where popcnt and no_popcnt are the multi-versioned functions, then
>> > after cloning :
>> >
>> > func_cold ---->IsPopCntSupported ----> func_hot_clone0 ---->
>> > findOnes_clone0 ----> popcnt
>> >                              |
>> >                               ----> func_hot_clone1 ---->
>> > findOnes_clone1 ----> no_popcnt
>> >
>> >
>> >
>> >
>> >  Flags : -fclone-hot-version-paths does function unswitching via
>> > cloning.
>> >          --param=num-mversn-clones=<num> allows to specify the number of
>> >          functions that should be cloned.
>> >          --param=mversn-clone-depth=<num> allows to specify the length
>> > of
>> >          the call graph path that should be cloned.  num = 0 implies
>> > only
>> >          leaf node that contains the __builtin_dispatch statement must
>> > be
>> >          cloned.
>> >
>> > ============================================================================================
>> >
>> > Google ref 45947, 45986
>> >
>> > 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>> >
>> >        * c-family/c-common.c   (revision 173122)
>> > (handle_version_selector_attribute): New function.
>> >        (c_common_attribute_table): New attribute "version_selector".
>> >        * tree-pass.h   (revision 173122)
>> > (pass_tree_convert_builtin_dispatch): New pass.
>> >        (pass_ipa_multiversion_dispatch): New pass.
>> >        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>> >        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test
>> > case.
>> >        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>> >        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>> >        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New
>> > test case.
>> >        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>> >        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test
>> > case.
>> >        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test
>> > case.
>> >        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New
>> > pointer type.
>> >        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for
>> > __builtin_dispatch.
>> >        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New
>> > builtin to
>> >        support multi-version calls.
>> >        * mversn-dispatch.c     (revision 0): New file.
>> >        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time
>> > var.
>> >        * common.opt    (revision 173122) (fclone-hot-version-paths): New
>> > flag.
>> >        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>> >        * passes.c      (revision 173122) (init_optimization_passes): Add
>> > the new
>> >        multi-version and dispatch passes to the pass list.
>> >        * params.def    (revision 173122)
>> > (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>> >        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>> >        * doc/invoke.texi       (revision 173122) (mversn-clone-depth):
>> > Document.
>> >        (num-mversn-clones): Document.
>> >        (fclone-hot-version-paths): Document.
>
>
Xinliang David Li May 2, 2011, 4:41 p.m. UTC | #6
On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>> Here is the background for this feature:
>>
>> 1) People relies on function multi-version to explore hw features and
>> squeeze performance, but there is no standard ways of doing so, either
>> a) using indirect function calls with function pointers set at program
>> initialization; b) using manual dispatch at each callsite; b) using
>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>> the language level and becomes the first class citizen;
>
> You are not doing that, you are inventing a new (crude) GCC extension.

To capture the high level semantics and prevent user from lowering the
dispatch calls into forms compiler can not recognize, language
extension is the way to go.

>

>> See above. Multi-way dispatch can be added similarly.
>
> Not with the specified syntax.  So you'd end up with _two_
> language extensions.  That's bad (and unacceptable, IMHO).

This part can be improved.

>
>>
>>>
>>> That's a nice idea, but why not simply process all functions with
>>> a const attribute and no arguments this way?  IMHO
>>>
>>> int testsse (void) __attribute__((const));
>>>
>>> is as good and avoids the new attribute completely.  The pass would
>>> also benefit other uses of this idiom (giving a way to have global
>>> dynamic initializers in C).  The functions may not be strictly 'const'
>>> in a way the compiler autodetects this attribute but it presents
>>> exactly the promises to the compiler that allow this optimization.
>>>
>>> Thus, please split this piece out into a separate patch and use
>>> the const attribute.
>>
>>
>> Sounds good.
>>

>>
>> 1) the function selection may happen in a different function;
>
> Well, sure.  I propose to have the above as lowered form.  If people
> deliberately obfuscate code it's their fault.  Your scheme simply
> makes that obfuscation impossible (or less likely possible).  So
> 1) is not a valid argument.

Lowering too early may defeat the purpose because

1) the desired optimization may not happen subject to many compiler
heuristic changes;
2) it has other side effects such as wrong estimation of function size
which may impact inlining
3) it limits the lowering into one form which may not be ideal  --
with builtin_dispatch, after hoisting optimization, the lowering can
use more efficient IFUNC scheme, for instance.

>
>
> My point is that such optimization is completely independent of
> that dispatch thing.  The above could as well be a selection to
> use different input arrays, one causing alias analysis issues
> and one not.
>
> Thus, a __builtin_dispatch centric optimization pass is the wrong
> way to go.

I agree that many things can implemented in different ways, but a high
level standard builtin_dispatch mechanism doing hoisting
interprocedcurally is cleaner and simpler and targeted for function
multi-versioning -- it does not depend on/rely on later phase's
heuristic tunings to make the right things to happen. Function MV
deserves this level of treatment as it will become more and more
important for some users (e.g., Google).

>
> Now, with FDO I'd expect the foo is inlined into bar (because foo
> is deemed hot),

That is a myth -- the truth is that there are other heuristics which
can prevent this from happening.

> then you only need to deal with loop unswitching,
> which should be easy to drive from FDO.

Same here -- the loop body may not be well formed/recognized. The loop
nests may not be perfectly nested, or other unswitching heuristics may
block it from happening.  This is the common problem form many other
things that get lowered too early. It is cleaner to make the high
level transformation first in IPA, and let unswitching dealing with
intra-procedural optimization.

Thanks,

David

>
> Richard.
>
>>
>>
>> Thanks,
>>
>> David
>>
>>
>>
>>>
>>> Note that we already have special FDO support for indirect to direct
>>> call promotion, so that might work already.
>>>
>>> Thus, with all this, __builtin_dispatch would be more like syntactic
>>> sugar to avoid writing above switch statement yourself.  If you restrict
>>> that sugar to a constant number of candidates it can be replaced by
>>> a macro (or several ones, specialized for the number of candidates).
>>>
>>> Richard.
>>>
>>>>  For the call graph that looks like this before cloning :
>>>>
>>>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>>>>                                                |
>>>>                                                -----> no_popcnt
>>>>
>>>>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>>>>
>>>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>>>>                              |
>>>>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>>>>
>>>>
>>>>
>>>>
>>>>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>>>>          --param=num-mversn-clones=<num> allows to specify the number of
>>>>          functions that should be cloned.
>>>>          --param=mversn-clone-depth=<num> allows to specify the length of
>>>>          the call graph path that should be cloned.  num = 0 implies only
>>>>          leaf node that contains the __builtin_dispatch statement must be
>>>>          cloned.
>>>> ============================================================================================
>>>>
>>>> Google ref 45947, 45986
>>>>
>>>> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>>>>
>>>>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>>>>        (c_common_attribute_table): New attribute "version_selector".
>>>>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>>>>        (pass_ipa_multiversion_dispatch): New pass.
>>>>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>>>>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>>>>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>>>>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>>>>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>>>>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>>>>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>>>>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>>>>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>>>>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>>>>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>>>>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>>>>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>>>>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>>>>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>>>>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>>>>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>>>>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>>>>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>>>>        support multi-version calls.
>>>>        * mversn-dispatch.c     (revision 0): New file.
>>>>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>>>>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>>>>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>>>>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>>>>        multi-version and dispatch passes to the pass list.
>>>>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>>>>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>>>>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>>>>        (num-mversn-clones): Document.
>>>>        (fclone-hot-version-paths): Document.
>>>
>>
>
Sriraman Tallam May 2, 2011, 7:11 p.m. UTC | #7
Hi,

  I want to submit this patch to google/main to make sure it is
available for our internal use at Google in order to materialize some
optimization opportunities. Let us continue this dicussion as I make
changes and submit this for review for trunk.

Thanks,
-Sri.


On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> Here is the background for this feature:
>>>
>>> 1) People relies on function multi-version to explore hw features and
>>> squeeze performance, but there is no standard ways of doing so, either
>>> a) using indirect function calls with function pointers set at program
>>> initialization; b) using manual dispatch at each callsite; b) using
>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>> the language level and becomes the first class citizen;
>>
>> You are not doing that, you are inventing a new (crude) GCC extension.
>
> To capture the high level semantics and prevent user from lowering the
> dispatch calls into forms compiler can not recognize, language
> extension is the way to go.
>
>>
>
>>> See above. Multi-way dispatch can be added similarly.
>>
>> Not with the specified syntax.  So you'd end up with _two_
>> language extensions.  That's bad (and unacceptable, IMHO).
>
> This part can be improved.
>
>>
>>>
>>>>
>>>> That's a nice idea, but why not simply process all functions with
>>>> a const attribute and no arguments this way?  IMHO
>>>>
>>>> int testsse (void) __attribute__((const));
>>>>
>>>> is as good and avoids the new attribute completely.  The pass would
>>>> also benefit other uses of this idiom (giving a way to have global
>>>> dynamic initializers in C).  The functions may not be strictly 'const'
>>>> in a way the compiler autodetects this attribute but it presents
>>>> exactly the promises to the compiler that allow this optimization.
>>>>
>>>> Thus, please split this piece out into a separate patch and use
>>>> the const attribute.
>>>
>>>
>>> Sounds good.
>>>
>
>>>
>>> 1) the function selection may happen in a different function;
>>
>> Well, sure.  I propose to have the above as lowered form.  If people
>> deliberately obfuscate code it's their fault.  Your scheme simply
>> makes that obfuscation impossible (or less likely possible).  So
>> 1) is not a valid argument.
>
> Lowering too early may defeat the purpose because
>
> 1) the desired optimization may not happen subject to many compiler
> heuristic changes;
> 2) it has other side effects such as wrong estimation of function size
> which may impact inlining
> 3) it limits the lowering into one form which may not be ideal  --
> with builtin_dispatch, after hoisting optimization, the lowering can
> use more efficient IFUNC scheme, for instance.
>
>>
>>
>> My point is that such optimization is completely independent of
>> that dispatch thing.  The above could as well be a selection to
>> use different input arrays, one causing alias analysis issues
>> and one not.
>>
>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>> way to go.
>
> I agree that many things can implemented in different ways, but a high
> level standard builtin_dispatch mechanism doing hoisting
> interprocedcurally is cleaner and simpler and targeted for function
> multi-versioning -- it does not depend on/rely on later phase's
> heuristic tunings to make the right things to happen. Function MV
> deserves this level of treatment as it will become more and more
> important for some users (e.g., Google).
>
>>
>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>> is deemed hot),
>
> That is a myth -- the truth is that there are other heuristics which
> can prevent this from happening.
>
>> then you only need to deal with loop unswitching,
>> which should be easy to drive from FDO.
>
> Same here -- the loop body may not be well formed/recognized. The loop
> nests may not be perfectly nested, or other unswitching heuristics may
> block it from happening.  This is the common problem form many other
> things that get lowered too early. It is cleaner to make the high
> level transformation first in IPA, and let unswitching dealing with
> intra-procedural optimization.
>
> Thanks,
>
> David
>
>>
>> Richard.
>>
>>>
>>>
>>> Thanks,
>>>
>>> David
>>>
>>>
>>>
>>>>
>>>> Note that we already have special FDO support for indirect to direct
>>>> call promotion, so that might work already.
>>>>
>>>> Thus, with all this, __builtin_dispatch would be more like syntactic
>>>> sugar to avoid writing above switch statement yourself.  If you restrict
>>>> that sugar to a constant number of candidates it can be replaced by
>>>> a macro (or several ones, specialized for the number of candidates).
>>>>
>>>> Richard.
>>>>
>>>>>  For the call graph that looks like this before cloning :
>>>>>
>>>>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>>>>>                                                |
>>>>>                                                -----> no_popcnt
>>>>>
>>>>>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>>>>>
>>>>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>>>>>                              |
>>>>>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>>>>>          --param=num-mversn-clones=<num> allows to specify the number of
>>>>>          functions that should be cloned.
>>>>>          --param=mversn-clone-depth=<num> allows to specify the length of
>>>>>          the call graph path that should be cloned.  num = 0 implies only
>>>>>          leaf node that contains the __builtin_dispatch statement must be
>>>>>          cloned.
>>>>> ============================================================================================
>>>>>
>>>>> Google ref 45947, 45986
>>>>>
>>>>> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>>>>>
>>>>>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>>>>>        (c_common_attribute_table): New attribute "version_selector".
>>>>>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>>>>>        (pass_ipa_multiversion_dispatch): New pass.
>>>>>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>>>>>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>>>>>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>>>>>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>>>>>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>>>>>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>>>>>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>>>>>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>>>>>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>>>>>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>>>>>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>>>>>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>>>>>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>>>>>        support multi-version calls.
>>>>>        * mversn-dispatch.c     (revision 0): New file.
>>>>>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>>>>>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>>>>>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>>>>>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>>>>>        multi-version and dispatch passes to the pass list.
>>>>>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>>>>>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>>>>>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>>>>>        (num-mversn-clones): Document.
>>>>>        (fclone-hot-version-paths): Document.
>>>>
>>>
>>
>
Xinliang David Li May 2, 2011, 7:32 p.m. UTC | #8
Ok. There may be more correctness related comments -- but those can be
addressed when available. For trunk, you need to address issues such
as multi-way dispatch.

Thanks,

David

On Mon, May 2, 2011 at 12:11 PM, Sriraman Tallam <tmsriram@google.com> wrote:
> Hi,
>
>  I want to submit this patch to google/main to make sure it is
> available for our internal use at Google in order to materialize some
> optimization opportunities. Let us continue this dicussion as I make
> changes and submit this for review for trunk.
>
> Thanks,
> -Sri.
>
>
> On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li <davidxl@google.com> wrote:
>> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> Here is the background for this feature:
>>>>
>>>> 1) People relies on function multi-version to explore hw features and
>>>> squeeze performance, but there is no standard ways of doing so, either
>>>> a) using indirect function calls with function pointers set at program
>>>> initialization; b) using manual dispatch at each callsite; b) using
>>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>>> the language level and becomes the first class citizen;
>>>
>>> You are not doing that, you are inventing a new (crude) GCC extension.
>>
>> To capture the high level semantics and prevent user from lowering the
>> dispatch calls into forms compiler can not recognize, language
>> extension is the way to go.
>>
>>>
>>
>>>> See above. Multi-way dispatch can be added similarly.
>>>
>>> Not with the specified syntax.  So you'd end up with _two_
>>> language extensions.  That's bad (and unacceptable, IMHO).
>>
>> This part can be improved.
>>
>>>
>>>>
>>>>>
>>>>> That's a nice idea, but why not simply process all functions with
>>>>> a const attribute and no arguments this way?  IMHO
>>>>>
>>>>> int testsse (void) __attribute__((const));
>>>>>
>>>>> is as good and avoids the new attribute completely.  The pass would
>>>>> also benefit other uses of this idiom (giving a way to have global
>>>>> dynamic initializers in C).  The functions may not be strictly 'const'
>>>>> in a way the compiler autodetects this attribute but it presents
>>>>> exactly the promises to the compiler that allow this optimization.
>>>>>
>>>>> Thus, please split this piece out into a separate patch and use
>>>>> the const attribute.
>>>>
>>>>
>>>> Sounds good.
>>>>
>>
>>>>
>>>> 1) the function selection may happen in a different function;
>>>
>>> Well, sure.  I propose to have the above as lowered form.  If people
>>> deliberately obfuscate code it's their fault.  Your scheme simply
>>> makes that obfuscation impossible (or less likely possible).  So
>>> 1) is not a valid argument.
>>
>> Lowering too early may defeat the purpose because
>>
>> 1) the desired optimization may not happen subject to many compiler
>> heuristic changes;
>> 2) it has other side effects such as wrong estimation of function size
>> which may impact inlining
>> 3) it limits the lowering into one form which may not be ideal  --
>> with builtin_dispatch, after hoisting optimization, the lowering can
>> use more efficient IFUNC scheme, for instance.
>>
>>>
>>>
>>> My point is that such optimization is completely independent of
>>> that dispatch thing.  The above could as well be a selection to
>>> use different input arrays, one causing alias analysis issues
>>> and one not.
>>>
>>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>>> way to go.
>>
>> I agree that many things can implemented in different ways, but a high
>> level standard builtin_dispatch mechanism doing hoisting
>> interprocedcurally is cleaner and simpler and targeted for function
>> multi-versioning -- it does not depend on/rely on later phase's
>> heuristic tunings to make the right things to happen. Function MV
>> deserves this level of treatment as it will become more and more
>> important for some users (e.g., Google).
>>
>>>
>>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>>> is deemed hot),
>>
>> That is a myth -- the truth is that there are other heuristics which
>> can prevent this from happening.
>>
>>> then you only need to deal with loop unswitching,
>>> which should be easy to drive from FDO.
>>
>> Same here -- the loop body may not be well formed/recognized. The loop
>> nests may not be perfectly nested, or other unswitching heuristics may
>> block it from happening.  This is the common problem form many other
>> things that get lowered too early. It is cleaner to make the high
>> level transformation first in IPA, and let unswitching dealing with
>> intra-procedural optimization.
>>
>> Thanks,
>>
>> David
>>
>>>
>>> Richard.
>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> David
>>>>
>>>>
>>>>
>>>>>
>>>>> Note that we already have special FDO support for indirect to direct
>>>>> call promotion, so that might work already.
>>>>>
>>>>> Thus, with all this, __builtin_dispatch would be more like syntactic
>>>>> sugar to avoid writing above switch statement yourself.  If you restrict
>>>>> that sugar to a constant number of candidates it can be replaced by
>>>>> a macro (or several ones, specialized for the number of candidates).
>>>>>
>>>>> Richard.
>>>>>
>>>>>>  For the call graph that looks like this before cloning :
>>>>>>
>>>>>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>>>>>>                                                |
>>>>>>                                                -----> no_popcnt
>>>>>>
>>>>>>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>>>>>>
>>>>>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>>>>>>                              |
>>>>>>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>>>>>>          --param=num-mversn-clones=<num> allows to specify the number of
>>>>>>          functions that should be cloned.
>>>>>>          --param=mversn-clone-depth=<num> allows to specify the length of
>>>>>>          the call graph path that should be cloned.  num = 0 implies only
>>>>>>          leaf node that contains the __builtin_dispatch statement must be
>>>>>>          cloned.
>>>>>> ============================================================================================
>>>>>>
>>>>>> Google ref 45947, 45986
>>>>>>
>>>>>> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>>>>>>
>>>>>>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>>>>>>        (c_common_attribute_table): New attribute "version_selector".
>>>>>>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>>>>>>        (pass_ipa_multiversion_dispatch): New pass.
>>>>>>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>>>>>>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>>>>>>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>>>>>>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>>>>>>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>>>>>>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>>>>>>        support multi-version calls.
>>>>>>        * mversn-dispatch.c     (revision 0): New file.
>>>>>>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>>>>>>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>>>>>>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>>>>>>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>>>>>>        multi-version and dispatch passes to the pass list.
>>>>>>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>>>>>>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>>>>>>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>>>>>>        (num-mversn-clones): Document.
>>>>>>        (fclone-hot-version-paths): Document.
>>>>>
>>>>
>>>
>>
>
Sriraman Tallam May 2, 2011, 8:38 p.m. UTC | #9
Thanks. I committed this patch.

-Sri.

svn commit
Sending        gcc/ChangeLog.google-main
Sending        gcc/Makefile.in
Sending        gcc/builtin-types.def
Sending        gcc/builtins.def
Sending        gcc/c-family/c-common.c
Sending        gcc/common.opt
Sending        gcc/doc/invoke.texi
Adding         gcc/mversn-dispatch.c
Sending        gcc/params.def
Sending        gcc/passes.c
Adding         gcc/testsuite/g++.dg/mversn10.C
Adding         gcc/testsuite/g++.dg/mversn10a.C
Adding         gcc/testsuite/g++.dg/mversn12.C
Adding         gcc/testsuite/g++.dg/mversn14.C
Adding         gcc/testsuite/g++.dg/mversn14a.C
Adding         gcc/testsuite/g++.dg/mversn16.C
Adding         gcc/testsuite/g++.dg/mversn8.C
Adding         gcc/testsuite/g++.dg/mversn9.C
Adding         gcc/testsuite/g++.dg/torture/mversn11.C
Adding         gcc/testsuite/g++.dg/torture/mversn5.C
Adding         gcc/testsuite/g++.dg/torture/mversn5.h
Adding         gcc/testsuite/g++.dg/torture/mversn5a.C
Adding         gcc/testsuite/g++.dg/tree-prof/mversn13.C
Adding         gcc/testsuite/g++.dg/tree-prof/mversn15.C
Adding         gcc/testsuite/g++.dg/tree-prof/mversn15a.C
Adding         gcc/testsuite/gcc.dg/mversn2.c
Adding         gcc/testsuite/gcc.dg/mversn3.c
Adding         gcc/testsuite/gcc.dg/mversn4.c
Adding         gcc/testsuite/gcc.dg/mversn4.h
Adding         gcc/testsuite/gcc.dg/mversn4a.c
Adding         gcc/testsuite/gcc.dg/mversn6.c
Adding         gcc/testsuite/gcc.dg/mversn7.c
Adding         gcc/testsuite/gcc.dg/torture/mversn1.c
Sending        gcc/timevar.def
Sending        gcc/tree-pass.h
Transmitting file data ...................................
Committed revision 173271.


On Mon, May 2, 2011 at 12:32 PM, Xinliang David Li <davidxl@google.com> wrote:
> Ok. There may be more correctness related comments -- but those can be
> addressed when available. For trunk, you need to address issues such
> as multi-way dispatch.
>
> Thanks,
>
> David
>
> On Mon, May 2, 2011 at 12:11 PM, Sriraman Tallam <tmsriram@google.com> wrote:
>> Hi,
>>
>>  I want to submit this patch to google/main to make sure it is
>> available for our internal use at Google in order to materialize some
>> optimization opportunities. Let us continue this dicussion as I make
>> changes and submit this for review for trunk.
>>
>> Thanks,
>> -Sri.
>>
>>
>> On Mon, May 2, 2011 at 9:41 AM, Xinliang David Li <davidxl@google.com> wrote:
>>> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> Here is the background for this feature:
>>>>>
>>>>> 1) People relies on function multi-version to explore hw features and
>>>>> squeeze performance, but there is no standard ways of doing so, either
>>>>> a) using indirect function calls with function pointers set at program
>>>>> initialization; b) using manual dispatch at each callsite; b) using
>>>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>>>> the language level and becomes the first class citizen;
>>>>
>>>> You are not doing that, you are inventing a new (crude) GCC extension.
>>>
>>> To capture the high level semantics and prevent user from lowering the
>>> dispatch calls into forms compiler can not recognize, language
>>> extension is the way to go.
>>>
>>>>
>>>
>>>>> See above. Multi-way dispatch can be added similarly.
>>>>
>>>> Not with the specified syntax.  So you'd end up with _two_
>>>> language extensions.  That's bad (and unacceptable, IMHO).
>>>
>>> This part can be improved.
>>>
>>>>
>>>>>
>>>>>>
>>>>>> That's a nice idea, but why not simply process all functions with
>>>>>> a const attribute and no arguments this way?  IMHO
>>>>>>
>>>>>> int testsse (void) __attribute__((const));
>>>>>>
>>>>>> is as good and avoids the new attribute completely.  The pass would
>>>>>> also benefit other uses of this idiom (giving a way to have global
>>>>>> dynamic initializers in C).  The functions may not be strictly 'const'
>>>>>> in a way the compiler autodetects this attribute but it presents
>>>>>> exactly the promises to the compiler that allow this optimization.
>>>>>>
>>>>>> Thus, please split this piece out into a separate patch and use
>>>>>> the const attribute.
>>>>>
>>>>>
>>>>> Sounds good.
>>>>>
>>>
>>>>>
>>>>> 1) the function selection may happen in a different function;
>>>>
>>>> Well, sure.  I propose to have the above as lowered form.  If people
>>>> deliberately obfuscate code it's their fault.  Your scheme simply
>>>> makes that obfuscation impossible (or less likely possible).  So
>>>> 1) is not a valid argument.
>>>
>>> Lowering too early may defeat the purpose because
>>>
>>> 1) the desired optimization may not happen subject to many compiler
>>> heuristic changes;
>>> 2) it has other side effects such as wrong estimation of function size
>>> which may impact inlining
>>> 3) it limits the lowering into one form which may not be ideal  --
>>> with builtin_dispatch, after hoisting optimization, the lowering can
>>> use more efficient IFUNC scheme, for instance.
>>>
>>>>
>>>>
>>>> My point is that such optimization is completely independent of
>>>> that dispatch thing.  The above could as well be a selection to
>>>> use different input arrays, one causing alias analysis issues
>>>> and one not.
>>>>
>>>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>>>> way to go.
>>>
>>> I agree that many things can implemented in different ways, but a high
>>> level standard builtin_dispatch mechanism doing hoisting
>>> interprocedcurally is cleaner and simpler and targeted for function
>>> multi-versioning -- it does not depend on/rely on later phase's
>>> heuristic tunings to make the right things to happen. Function MV
>>> deserves this level of treatment as it will become more and more
>>> important for some users (e.g., Google).
>>>
>>>>
>>>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>>>> is deemed hot),
>>>
>>> That is a myth -- the truth is that there are other heuristics which
>>> can prevent this from happening.
>>>
>>>> then you only need to deal with loop unswitching,
>>>> which should be easy to drive from FDO.
>>>
>>> Same here -- the loop body may not be well formed/recognized. The loop
>>> nests may not be perfectly nested, or other unswitching heuristics may
>>> block it from happening.  This is the common problem form many other
>>> things that get lowered too early. It is cleaner to make the high
>>> level transformation first in IPA, and let unswitching dealing with
>>> intra-procedural optimization.
>>>
>>> Thanks,
>>>
>>> David
>>>
>>>>
>>>> Richard.
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> David
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> Note that we already have special FDO support for indirect to direct
>>>>>> call promotion, so that might work already.
>>>>>>
>>>>>> Thus, with all this, __builtin_dispatch would be more like syntactic
>>>>>> sugar to avoid writing above switch statement yourself.  If you restrict
>>>>>> that sugar to a constant number of candidates it can be replaced by
>>>>>> a macro (or several ones, specialized for the number of candidates).
>>>>>>
>>>>>> Richard.
>>>>>>
>>>>>>>  For the call graph that looks like this before cloning :
>>>>>>>
>>>>>>> func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
>>>>>>>                                                |
>>>>>>>                                                -----> no_popcnt
>>>>>>>
>>>>>>>  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :
>>>>>>>
>>>>>>> func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
>>>>>>>                              |
>>>>>>>                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>  Flags : -fclone-hot-version-paths does function unswitching via cloning.
>>>>>>>          --param=num-mversn-clones=<num> allows to specify the number of
>>>>>>>          functions that should be cloned.
>>>>>>>          --param=mversn-clone-depth=<num> allows to specify the length of
>>>>>>>          the call graph path that should be cloned.  num = 0 implies only
>>>>>>>          leaf node that contains the __builtin_dispatch statement must be
>>>>>>>          cloned.
>>>>>>> ============================================================================================
>>>>>>>
>>>>>>> Google ref 45947, 45986
>>>>>>>
>>>>>>> 2011-04-28  Sriraman Tallam  <tmsriram@google.com>
>>>>>>>
>>>>>>>        * c-family/c-common.c   (revision 173122) (handle_version_selector_attribute): New function.
>>>>>>>        (c_common_attribute_table): New attribute "version_selector".
>>>>>>>        * tree-pass.h   (revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
>>>>>>>        (pass_ipa_multiversion_dispatch): New pass.
>>>>>>>        * testsuite/gcc.dg/mversn7.c    (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/mversn4.c    (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/mversn4.h    (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/mversn4a.c   (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/torture/mversn1.c    (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/mversn2.c    (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/mversn6.c    (revision 0): New test case.
>>>>>>>        * testsuite/gcc.dg/mversn3.c    (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn8.C    (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn10a.C  (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn14a.C  (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/tree-prof/mversn13.C (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/tree-prof/mversn15.C (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/tree-prof/mversn15a.C        (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn9.C    (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn10.C   (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn12.C   (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn14.C   (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/mversn16.C   (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/torture/mversn11.C   (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/torture/mversn5.C    (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/torture/mversn5.h    (revision 0): New test case.
>>>>>>>        * testsuite/g++.dg/torture/mversn5a.C   (revision 0): New test case.
>>>>>>>        * builtin-types.def     (revision 173122) (BT_PTR_FN_INT): New pointer type.
>>>>>>>        (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
>>>>>>>        * builtins.def  (revision 173122) (BUILT_IN_DISPATCH): New builtin to
>>>>>>>        support multi-version calls.
>>>>>>>        * mversn-dispatch.c     (revision 0): New file.
>>>>>>>        * timevar.def   (revision 173122) (TV_MVERSN_DISPATCH): New time var.
>>>>>>>        * common.opt    (revision 173122) (fclone-hot-version-paths): New flag.
>>>>>>>        * Makefile.in   (revision 173122) (mversn-dispatch.o): New rule.
>>>>>>>        * passes.c      (revision 173122) (init_optimization_passes): Add the new
>>>>>>>        multi-version and dispatch passes to the pass list.
>>>>>>>        * params.def    (revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
>>>>>>>        (PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
>>>>>>>        * doc/invoke.texi       (revision 173122) (mversn-clone-depth): Document.
>>>>>>>        (num-mversn-clones): Document.
>>>>>>>        (fclone-hot-version-paths): Document.
>>>>>>
>>>>>
>>>>
>>>
>>
>
Eric Botcazou May 2, 2011, 8:44 p.m. UTC | #10
> Thanks. I committed this patch.
>
> -Sri.
>
> svn commit
> Sending        gcc/ChangeLog.google-main
> Sending        gcc/Makefile.in
> Sending        gcc/builtin-types.def
> Sending        gcc/builtins.def
> Sending        gcc/c-family/c-common.c
> Sending        gcc/common.opt
> Sending        gcc/doc/invoke.texi
> Adding         gcc/mversn-dispatch.c
> Sending        gcc/params.def
> Sending        gcc/passes.c
> Adding         gcc/testsuite/g++.dg/mversn10.C
> Adding         gcc/testsuite/g++.dg/mversn10a.C
> Adding         gcc/testsuite/g++.dg/mversn12.C
> Adding         gcc/testsuite/g++.dg/mversn14.C
> Adding         gcc/testsuite/g++.dg/mversn14a.C
> Adding         gcc/testsuite/g++.dg/mversn16.C
> Adding         gcc/testsuite/g++.dg/mversn8.C
> Adding         gcc/testsuite/g++.dg/mversn9.C
> Adding         gcc/testsuite/g++.dg/torture/mversn11.C
> Adding         gcc/testsuite/g++.dg/torture/mversn5.C
> Adding         gcc/testsuite/g++.dg/torture/mversn5.h
> Adding         gcc/testsuite/g++.dg/torture/mversn5a.C
> Adding         gcc/testsuite/g++.dg/tree-prof/mversn13.C
> Adding         gcc/testsuite/g++.dg/tree-prof/mversn15.C
> Adding         gcc/testsuite/g++.dg/tree-prof/mversn15a.C
> Adding         gcc/testsuite/gcc.dg/mversn2.c
> Adding         gcc/testsuite/gcc.dg/mversn3.c
> Adding         gcc/testsuite/gcc.dg/mversn4.c
> Adding         gcc/testsuite/gcc.dg/mversn4.h
> Adding         gcc/testsuite/gcc.dg/mversn4a.c
> Adding         gcc/testsuite/gcc.dg/mversn6.c
> Adding         gcc/testsuite/gcc.dg/mversn7.c
> Adding         gcc/testsuite/gcc.dg/torture/mversn1.c
> Sending        gcc/timevar.def
> Sending        gcc/tree-pass.h
> Transmitting file data ...................................
> Committed revision 173271.

Please avoid posting useless messages like this on gcc-patches@.  The same 
information is duplicated on gcc-cvs@.  See http://gcc.gnu.org/svnwrite.html
Sriraman Tallam May 2, 2011, 8:50 p.m. UTC | #11
On Mon, May 2, 2011 at 1:44 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Thanks. I committed this patch.
>>
>> -Sri.
>>
>> svn commit
>> Sending        gcc/ChangeLog.google-main
>> Sending        gcc/Makefile.in
>> Sending        gcc/builtin-types.def
>> Sending        gcc/builtins.def
>> Sending        gcc/c-family/c-common.c
>> Sending        gcc/common.opt
>> Sending        gcc/doc/invoke.texi
>> Adding         gcc/mversn-dispatch.c
>> Sending        gcc/params.def
>> Sending        gcc/passes.c
>> Adding         gcc/testsuite/g++.dg/mversn10.C
>> Adding         gcc/testsuite/g++.dg/mversn10a.C
>> Adding         gcc/testsuite/g++.dg/mversn12.C
>> Adding         gcc/testsuite/g++.dg/mversn14.C
>> Adding         gcc/testsuite/g++.dg/mversn14a.C
>> Adding         gcc/testsuite/g++.dg/mversn16.C
>> Adding         gcc/testsuite/g++.dg/mversn8.C
>> Adding         gcc/testsuite/g++.dg/mversn9.C
>> Adding         gcc/testsuite/g++.dg/torture/mversn11.C
>> Adding         gcc/testsuite/g++.dg/torture/mversn5.C
>> Adding         gcc/testsuite/g++.dg/torture/mversn5.h
>> Adding         gcc/testsuite/g++.dg/torture/mversn5a.C
>> Adding         gcc/testsuite/g++.dg/tree-prof/mversn13.C
>> Adding         gcc/testsuite/g++.dg/tree-prof/mversn15.C
>> Adding         gcc/testsuite/g++.dg/tree-prof/mversn15a.C
>> Adding         gcc/testsuite/gcc.dg/mversn2.c
>> Adding         gcc/testsuite/gcc.dg/mversn3.c
>> Adding         gcc/testsuite/gcc.dg/mversn4.c
>> Adding         gcc/testsuite/gcc.dg/mversn4.h
>> Adding         gcc/testsuite/gcc.dg/mversn4a.c
>> Adding         gcc/testsuite/gcc.dg/mversn6.c
>> Adding         gcc/testsuite/gcc.dg/mversn7.c
>> Adding         gcc/testsuite/gcc.dg/torture/mversn1.c
>> Sending        gcc/timevar.def
>> Sending        gcc/tree-pass.h
>> Transmitting file data ...................................
>> Committed revision 173271.
>
> Please avoid posting useless messages like this on gcc-patches@.  The same
> information is duplicated on gcc-cvs@.  See http://gcc.gnu.org/svnwrite.html

Alright & sorry.

-Sri.

>
> --
> Eric Botcazou
>
Richard Biener May 2, 2011, 9:33 p.m. UTC | #12
On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> Here is the background for this feature:
>>>
>>> 1) People relies on function multi-version to explore hw features and
>>> squeeze performance, but there is no standard ways of doing so, either
>>> a) using indirect function calls with function pointers set at program
>>> initialization; b) using manual dispatch at each callsite; b) using
>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>> the language level and becomes the first class citizen;
>>
>> You are not doing that, you are inventing a new (crude) GCC extension.
>
> To capture the high level semantics and prevent user from lowering the
> dispatch calls into forms compiler can not recognize, language
> extension is the way to go.

I don't think so.  With your patch only two passes understand the new
high-level form, the rest of the gimple passes are just confused.

>>
>
>>> See above. Multi-way dispatch can be added similarly.
>>
>> Not with the specified syntax.  So you'd end up with _two_
>> language extensions.  That's bad (and unacceptable, IMHO).
>
> This part can be improved.
>
>>
>>>
>>>>
>>>> That's a nice idea, but why not simply process all functions with
>>>> a const attribute and no arguments this way?  IMHO
>>>>
>>>> int testsse (void) __attribute__((const));
>>>>
>>>> is as good and avoids the new attribute completely.  The pass would
>>>> also benefit other uses of this idiom (giving a way to have global
>>>> dynamic initializers in C).  The functions may not be strictly 'const'
>>>> in a way the compiler autodetects this attribute but it presents
>>>> exactly the promises to the compiler that allow this optimization.
>>>>
>>>> Thus, please split this piece out into a separate patch and use
>>>> the const attribute.
>>>
>>>
>>> Sounds good.
>>>
>
>>>
>>> 1) the function selection may happen in a different function;
>>
>> Well, sure.  I propose to have the above as lowered form.  If people
>> deliberately obfuscate code it's their fault.  Your scheme simply
>> makes that obfuscation impossible (or less likely possible).  So
>> 1) is not a valid argument.
>
> Lowering too early may defeat the purpose because
>
> 1) the desired optimization may not happen subject to many compiler
> heuristic changes;
> 2) it has other side effects such as wrong estimation of function size
> which may impact inlining

May, may ... so you say all this can't happen under any circumstance
with your special code and passes?  Which nobody will see benefit
from unless they rewrite their code?  Well, I say if we can improve
_some_ of the existing usages that's better than never doing wrong
on a new language extension.  One that I'm not convinced is the way
to go (you didn't address at all the inability to use float arguments
and the ABI issues with using variadic arguments - after all you
did a poor-mans language extension by using GCC builtins instead
of inventing a true one).

> 3) it limits the lowering into one form which may not be ideal  --
> with builtin_dispatch, after hoisting optimization, the lowering can
> use more efficient IFUNC scheme, for instance.

I see no reason why we cannot transform a switch-indirect-call
pattern into an IFUNC call.

>> My point is that such optimization is completely independent of
>> that dispatch thing.  The above could as well be a selection to
>> use different input arrays, one causing alias analysis issues
>> and one not.
>>
>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>> way to go.
>
> I agree that many things can implemented in different ways, but a high
> level standard builtin_dispatch mechanism doing hoisting
> interprocedcurally is cleaner and simpler and targeted for function
> multi-versioning -- it does not depend on/rely on later phase's
> heuristic tunings to make the right things to happen. Function MV
> deserves this level of treatment as it will become more and more
> important for some users (e.g., Google).

But inventing a new language extension to benefit from whatever
improvements we implement isn't the obviously best choice.

>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>> is deemed hot),
>
> That is a myth -- the truth is that there are other heuristics which
> can prevent this from happening.

Huh, sure.  That doesn't make my expectation a myth.

>> then you only need to deal with loop unswitching,
>> which should be easy to drive from FDO.
>
> Same here -- the loop body may not be well formed/recognized. The loop
> nests may not be perfectly nested, or other unswitching heuristics may
> block it from happening.  This is the common problem form many other
> things that get lowered too early. It is cleaner to make the high
> level transformation first in IPA, and let unswitching dealing with
> intra-procedural optimization.

If it's not well-formed inlining the call does not make it well-formed and
thus it won't be optimized well anyway.  Btw, for the usual cases I
have seen inlining isn't a transform that is worthwhile - transforming
to IFUNC would have been.

I'm not convinced at all.

Richard.
Xinliang David Li May 2, 2011, 11:07 p.m. UTC | #13
On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li <davidxl@google.com> wrote:
>> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> Here is the background for this feature:
>>>>
>>>> 1) People relies on function multi-version to explore hw features and
>>>> squeeze performance, but there is no standard ways of doing so, either
>>>> a) using indirect function calls with function pointers set at program
>>>> initialization; b) using manual dispatch at each callsite; b) using
>>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>>> the language level and becomes the first class citizen;
>>>
>>> You are not doing that, you are inventing a new (crude) GCC extension.
>>
>> To capture the high level semantics and prevent user from lowering the
>> dispatch calls into forms compiler can not recognize, language
>> extension is the way to go.
>
> I don't think so.  With your patch only two passes understand the new
> high-level form, the rest of the gimple passes are just confused.

There is no need for other passes to understand it -- just treat it as
opaque calls. This is goodness otherwise other passes need to be
modified. This is true (only some passes understand it) for things
like __builtin_expect.


>>
>> 1) the desired optimization may not happen subject to many compiler
>> heuristic changes;
>> 2) it has other side effects such as wrong estimation of function size
>> which may impact inlining
>
> May, may ... so you say all this can't happen under any circumstance
> with your special code and passes?

No that is not my argument. What I tried to say is it will be harder
to achieve without high level semantics -- it requires more
handshaking between compiler passes.

> Which nobody will see benefit
> from unless they rewrite their code?

The target users for the builtin include compiler itself -- it can
synthesize dispatch calls.

> Well, I say if we can improve
> _some_ of the existing usages that's better than never doing wrong
> on a new language extension.

This is independent.

> One that I'm not convinced is the way
> to go (you didn't address at all the inability to use float arguments
> and the ABI issues with using variadic arguments - after all you
> did a poor-mans language extension by using GCC builtins instead
> of inventing a true one).

This is an independent issue that either needs to be addressed or
marked as limitation. The key of the debate is whether source/IR
annotation using construct with high level semantics helps optimizer.
In fact this is common. Would it make any difference (in terms of
acceptance) if the builtin is only used internally by the compiler and
not exposed to the user?

>
>> 3) it limits the lowering into one form which may not be ideal  --
>> with builtin_dispatch, after hoisting optimization, the lowering can
>> use more efficient IFUNC scheme, for instance.
>
> I see no reason why we cannot transform a switch-indirect-call
> pattern into an IFUNC call.
>

It is possible -- but it is like asking user to lower the dispatch and
tell compiler to raise it again ..

>>> My point is that such optimization is completely independent of
>>> that dispatch thing.  The above could as well be a selection to
>>> use different input arrays, one causing alias analysis issues
>>> and one not.
>>>
>>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>>> way to go.
>>
>> I agree that many things can implemented in different ways, but a high
>> level standard builtin_dispatch mechanism doing hoisting
>> interprocedcurally is cleaner and simpler and targeted for function
>> multi-versioning -- it does not depend on/rely on later phase's
>> heuristic tunings to make the right things to happen. Function MV
>> deserves this level of treatment as it will become more and more
>> important for some users (e.g., Google).
>
> But inventing a new language extension to benefit from whatever
> improvements we implement isn't the obviously best choice.

It is not for any improvement. I mentioned the potential for function
MV and want to have a compiler infrastructure to deal with it.


>
>>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>>> is deemed hot),
>>
>> That is a myth -- the truth is that there are other heuristics which
>> can prevent this from happening.
>
> Huh, sure.  That doesn't make my expectation a myth.
>
>>> then you only need to deal with loop unswitching,
>>> which should be easy to drive from FDO.
>>
>> Same here -- the loop body may not be well formed/recognized. The loop
>> nests may not be perfectly nested, or other unswitching heuristics may
>> block it from happening.  This is the common problem form many other
>> things that get lowered too early. It is cleaner to make the high
>> level transformation first in IPA, and let unswitching dealing with
>> intra-procedural optimization.
>
> If it's not well-formed inlining the call does not make it well-formed and
> thus it won't be optimized well anyway.  Btw, for the usual cases I
> have seen inlining isn't a transform that is worthwhile - transforming
> to IFUNC would have been.

I am not sure I understand the comment here. The proposed approach can
do interprocedural hoisting of the dispatch and this done pretty early
in the pipeline so that hot functions can be optimized as much as
possible. Lowering it early in the hot functions rely later phases to
deal with it has the following limitations:

0) lowered code can get in the way of effective optimization
1) the lowered code can be transformed such that it can not be
unswitched or can not be raised properly
2) all functions in the related call chain to be inlined for the
unswitching to happen interprocedurally
3) relies on unswitching heuristic to kick in

thanks,

David

>
> I'm not convinced at all.
>
> Richard.
>
Richard Biener May 3, 2011, 10 a.m. UTC | #14
On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li <davidxl@google.com> wrote:
> On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
>>> <richard.guenther@gmail.com> wrote:
>>>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> Here is the background for this feature:
>>>>>
>>>>> 1) People relies on function multi-version to explore hw features and
>>>>> squeeze performance, but there is no standard ways of doing so, either
>>>>> a) using indirect function calls with function pointers set at program
>>>>> initialization; b) using manual dispatch at each callsite; b) using
>>>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>>>> the language level and becomes the first class citizen;
>>>>
>>>> You are not doing that, you are inventing a new (crude) GCC extension.
>>>
>>> To capture the high level semantics and prevent user from lowering the
>>> dispatch calls into forms compiler can not recognize, language
>>> extension is the way to go.
>>
>> I don't think so.  With your patch only two passes understand the new
>> high-level form, the rest of the gimple passes are just confused.
>
> There is no need for other passes to understand it -- just treat it as
> opaque calls. This is goodness otherwise other passes need to be
> modified. This is true (only some passes understand it) for things
> like __builtin_expect.

Certainly __builtin_dispatch has to be understood by alias analysis and
all other passes that care about calls (like all IPA passes).  You can
of course treat it conservatively (may call any function, even those
which have their address not taken, clobber and read all memory, even
that which doesn't escape the TU).

Why obfuscate things when it is not necessary?

>>>
>>> 1) the desired optimization may not happen subject to many compiler
>>> heuristic changes;
>>> 2) it has other side effects such as wrong estimation of function size
>>> which may impact inlining
>>
>> May, may ... so you say all this can't happen under any circumstance
>> with your special code and passes?
>
> No that is not my argument. What I tried to say is it will be harder
> to achieve without high level semantics -- it requires more
> handshaking between compiler passes.

Sure - that's life.

>> Which nobody will see benefit
>> from unless they rewrite their code?
>
> The target users for the builtin include compiler itself -- it can
> synthesize dispatch calls.

Hum.  I'm not at all sure the dispatch calls are the best representation
for the IL.

>> Well, I say if we can improve
>> _some_ of the existing usages that's better than never doing wrong
>> on a new language extension.
>
> This is independent.

It is not.

>> One that I'm not convinced is the way
>> to go (you didn't address at all the inability to use float arguments
>> and the ABI issues with using variadic arguments - after all you
>> did a poor-mans language extension by using GCC builtins instead
>> of inventing a true one).
>
> This is an independent issue that either needs to be addressed or
> marked as limitation. The key of the debate is whether source/IR
> annotation using construct with high level semantics helps optimizer.
> In fact this is common. Would it make any difference (in terms of
> acceptance) if the builtin is only used internally by the compiler and
> not exposed to the user?

No.  I don't see at all why having everything in a single stmt is so much
more convenient.  And I don't see why existing IL features cannot be
used to make things a little more convenient.

>>> 3) it limits the lowering into one form which may not be ideal  --
>>> with builtin_dispatch, after hoisting optimization, the lowering can
>>> use more efficient IFUNC scheme, for instance.
>>
>> I see no reason why we cannot transform a switch-indirect-call
>> pattern into an IFUNC call.
>>
>
> It is possible -- but it is like asking user to lower the dispatch and
> tell compiler to raise it again ..

There is no possibility for a high-level dispatch at the source level.
And if I'd have to design one I would use function overloading, like

float compute_sth (float) __attribute__((version("sse4")))
{
  ... sse4 code ...
}

float compute_sth (float)
{
  ... fallback ...
}

float foo (float f)
{
  return compute_sth (f);
}

and if you not only want to dispatch for target features you could
specify a selector function and value in the attribute.  You might
notice that the above eventually matches the target attribute
directly, just the frontends need to be taught to emit dispatch
code whenever overload resolution results in ambiguities involving
target attribute differences.

Now, a language extension to support multi-versioning should be
completely independent on any IL representation - with using
a builtin you are tying them together with the only convenient
mechanism we have - a mechanism that isn't optimal for either
side IMNSHO.

>>>> My point is that such optimization is completely independent of
>>>> that dispatch thing.  The above could as well be a selection to
>>>> use different input arrays, one causing alias analysis issues
>>>> and one not.
>>>>
>>>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>>>> way to go.
>>>
>>> I agree that many things can implemented in different ways, but a high
>>> level standard builtin_dispatch mechanism doing hoisting
>>> interprocedcurally is cleaner and simpler and targeted for function
>>> multi-versioning -- it does not depend on/rely on later phase's
>>> heuristic tunings to make the right things to happen. Function MV
>>> deserves this level of treatment as it will become more and more
>>> important for some users (e.g., Google).
>>
>> But inventing a new language extension to benefit from whatever
>> improvements we implement isn't the obviously best choice.
>
> It is not for any improvement. I mentioned the potential for function
> MV and want to have a compiler infrastructure to deal with it.
>
>
>>
>>>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>>>> is deemed hot),
>>>
>>> That is a myth -- the truth is that there are other heuristics which
>>> can prevent this from happening.
>>
>> Huh, sure.  That doesn't make my expectation a myth.
>>
>>>> then you only need to deal with loop unswitching,
>>>> which should be easy to drive from FDO.
>>>
>>> Same here -- the loop body may not be well formed/recognized. The loop
>>> nests may not be perfectly nested, or other unswitching heuristics may
>>> block it from happening.  This is the common problem form many other
>>> things that get lowered too early. It is cleaner to make the high
>>> level transformation first in IPA, and let unswitching dealing with
>>> intra-procedural optimization.
>>
>> If it's not well-formed inlining the call does not make it well-formed and
>> thus it won't be optimized well anyway.  Btw, for the usual cases I
>> have seen inlining isn't a transform that is worthwhile - transforming
>> to IFUNC would have been.
>
> I am not sure I understand the comment here. The proposed approach can
> do interprocedural hoisting of the dispatch and this done pretty early
> in the pipeline so that hot functions can be optimized as much as
> possible. Lowering it early in the hot functions rely later phases to
> deal with it has the following limitations:
>
> 0) lowered code can get in the way of effective optimization
> 1) the lowered code can be transformed such that it can not be
> unswitched or can not be raised properly
> 2) all functions in the related call chain to be inlined for the
> unswitching to happen interprocedurally
> 3) relies on unswitching heuristic to kick in

Sure, we have such issues everywhere in the compiler.  I don't see why
MV is special.  In fact there are surely cases that can be constructed
where we rely on earlier optimizations to make MV transforms possible.

Pass ordering issues are independent on design of a language extension
and independent on what canonical IL representation we want to have
to help MV related optimizations.

Thanks,
Richard.
Richard Biener May 3, 2011, 10:07 a.m. UTC | #15
On Tue, May 3, 2011 at 12:00 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
>>>> 3) it limits the lowering into one form which may not be ideal  --
>>>> with builtin_dispatch, after hoisting optimization, the lowering can
>>>> use more efficient IFUNC scheme, for instance.
>>>
>>> I see no reason why we cannot transform a switch-indirect-call
>>> pattern into an IFUNC call.
>>>
>>
>> It is possible -- but it is like asking user to lower the dispatch and
>> tell compiler to raise it again ..
>
> There is no possibility for a high-level dispatch at the source level.
> And if I'd have to design one I would use function overloading, like
>
> float compute_sth (float) __attribute__((version("sse4")))
> {
>  ... sse4 code ...
> }
>
> float compute_sth (float)
> {
>  ... fallback ...
> }
>
> float foo (float f)
> {
>  return compute_sth (f);
> }
>
> and if you not only want to dispatch for target features you could
> specify a selector function and value in the attribute.  You might
> notice that the above eventually matches the target attribute
> directly, just the frontends need to be taught to emit dispatch
> code whenever overload resolution results in ambiguities involving
> target attribute differences.

Which also would allow the frontend to directly call the sse4 variant
if you compile with -msse4, avoiding the need for any fancy processing.

Richard.
Mike Stump May 3, 2011, 3:22 p.m. UTC | #16
On May 3, 2011, at 3:07 AM, Richard Guenther <richard.guenther@gmail.com> wrote:
>> 
>> There is no possibility for a high-level dispatch at the source level.
>> And if I'd have to design one I would use function overloading, like
>> 
>> float compute_sth (float) __attribute__((version("sse4")))
>> {
>>  ... sse4 code ...
>> }
>> 
>> float compute_sth (float)
>> {
>>  ... fallback ...
>> }
>> 
>> float foo (float f)
>> {
>>  return compute_sth (f);
>> }
>> 
>> and if you not only want to dispatch for target features you could
>> specify a selector function and value in the attribute.  You might
>> notice that the above eventually matches the target attribute
>> directly, just the frontends need to be taught to emit dispatch
>> code whenever overload resolution results in ambiguities involving
>> target attribute differences.
> 
> Which also would allow the frontend to directly call the sse4 variant
> if you compile with -msse4, avoiding the need for any fancy processing.

And to go one step further, if we had this, we could use this to define all data manipulation machine built-ins as generic functions, available to all compiles as normal c code, so portable code could use them everywhere, and on platforms that had instructions for any of them, they could define them as normal built-ins.
>
Joseph Myers May 3, 2011, 3:34 p.m. UTC | #17
On Tue, 3 May 2011, Mike Stump wrote:

> And to go one step further, if we had this, we could use this to define 
> all data manipulation machine built-ins as generic functions, available 
> to all compiles as normal c code, so portable code could use them 
> everywhere, and on platforms that had instructions for any of them, they 
> could define them as normal built-ins.

I don't think the forms in which some of the machine-specific built-in 
functions exist are particularly good for being available everywhere - 
they are typically defined to correspond exactly to the defined semantics 
of a particular instruction, complete with e.g. the peculiarities of how 
overflow works for that instruction.  But:

* I think it does make sense to make a range of vector operations, 
saturating operations (see the discussion in PR 48580) etc. available as 
generic GNU C on all platforms, with generically defined semantics 
(consistent with how C generally handles things such as overflow), where 
instruction sets commonly have relevant instructions but it isn't readily 
possible to represent the operations with generic C and pattern-match 
them.  (Just as we provide operations such as __builtin_popcount and 
__builtin_clz, for example - and note that __builtin_clz has undefined 
value at 0 rather than being defined to match a machine instruction.)

* Once the operations have generic GNU C, GIMPLE and RTL descriptions, 
intrinsic headers should best use the GNU C representations if possible 
instead of calling built-in functions, and any built-in functions should 
expand to the generic GIMPLE and RTL instead of using UNSPECs.

* It may also make sense for a target architecture to define its 
intrinsics in such a way that they are available on that architecture even 
when the relevant instructions aren't available in a particular 
subarchitecture.  This may be defining the intrinsics in a header if those 
are the public interface, instead of the built-in functions.  See 
a thread starting at <http://gcc.gnu.org/ml/gcc/2010-11/msg00546.html> for 
some discussion of doing this for SSE, for example.
Mike Stump May 3, 2011, 4:50 p.m. UTC | #18
On May 3, 2011, at 8:34 AM, "Joseph S. Myers" <joseph@codesourcery.com> wrote:.
> I don't think the forms in which some of the machine-specific built-in 
> functions exist are particularly good for being available everywhere

I take it you've never just wanted something to compile, and hit one of these yet.  Let me know if you change your opinion after you do.  I have.  For anyone that wants it to just compile, trust me, they just want it to compile.
Xinliang David Li May 3, 2011, 9:57 p.m. UTC | #19
On Tue, May 3, 2011 at 3:00 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li <davidxl@google.com> wrote:
>> On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
>> <richard.guenther@gmail.com> wrote:
>>> On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
>>>> <richard.guenther@gmail.com> wrote:
>>>>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>>> Here is the background for this feature:
>>>>>>
>>>>>> 1) People relies on function multi-version to explore hw features and
>>>>>> squeeze performance, but there is no standard ways of doing so, either
>>>>>> a) using indirect function calls with function pointers set at program
>>>>>> initialization; b) using manual dispatch at each callsite; b) using
>>>>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>>>>> the language level and becomes the first class citizen;
>>>>>
>>>>> You are not doing that, you are inventing a new (crude) GCC extension.
>>>>
>>>> To capture the high level semantics and prevent user from lowering the
>>>> dispatch calls into forms compiler can not recognize, language
>>>> extension is the way to go.
>>>
>>> I don't think so.  With your patch only two passes understand the new
>>> high-level form, the rest of the gimple passes are just confused.
>>
>> There is no need for other passes to understand it -- just treat it as
>> opaque calls. This is goodness otherwise other passes need to be
>> modified. This is true (only some passes understand it) for things
>> like __builtin_expect.
>
> Certainly __builtin_dispatch has to be understood by alias analysis and
> all other passes that care about calls (like all IPA passes).  You can
> of course treat it conservatively (may call any function, even those
> which have their address not taken, clobber and read all memory, even
> that which doesn't escape the TU).
>
> Why obfuscate things when it is not necessary?

MVed functions are usually non-trivial, so I doubt anything will be
lost due to the obfuscation. It won't be too difficult to teach
aliaser to 'merge' the attributes from target functions either.


>> No that is not my argument. What I tried to say is it will be harder
>> to achieve without high level semantics -- it requires more
>> handshaking between compiler passes.
>
> Sure - that's life.
>

We are looking at improving the life ..

>>> Which nobody will see benefit
>>> from unless they rewrite their code?
>>
>> The target users for the builtin include compiler itself -- it can
>> synthesize dispatch calls.
>
> Hum.  I'm not at all sure the dispatch calls are the best representation
> for the IL.
>

The intension is to provide an interface at both C level (for
programmers) and IL level.  It does not have to be a builtin (both
internally and externally)  -- but it needs to map to some language
construct.


>>> Well, I say if we can improve
>>> _some_ of the existing usages that's better than never doing wrong
>>> on a new language extension.
>>
>> This is independent.
>
> It is not.
>
>>> One that I'm not convinced is the way
>>> to go (you didn't address at all the inability to use float arguments
>>> and the ABI issues with using variadic arguments - after all you
>>> did a poor-mans language extension by using GCC builtins instead
>>> of inventing a true one).
>>
>> This is an independent issue that either needs to be addressed or
>> marked as limitation. The key of the debate is whether source/IR
>> annotation using construct with high level semantics helps optimizer.
>> In fact this is common. Would it make any difference (in terms of
>> acceptance) if the builtin is only used internally by the compiler and
>> not exposed to the user?
>
> No.  I don't see at all why having everything in a single stmt is so much
> more convenient.  And I don't see why existing IL features cannot be
> used to make things a little more convenient.

Why not? The high level construct is simpler to deal with. It is all
about doing the right optimization at the right level of abstraction.
Set aside the question whether using builtin for MV dispatch is the
right high level construct, looking at gcc, we can find that gcc's IR
is pretty low level resulting in missing optimizations.

For instance, there is no high level doloop representation -- Fortran
do-loop needs to be lowered and raised back again -- the consequence
is that you may not raise the loop nest into the way it was originally
written -- perfect nested loop become non-perfect loop nest --
blocking certain loop transformations.  Not only that, I am not sure
it is even possible to record any loop level information anywhere --
is it possible to have per loop attribute such as unroll factor?

Assuming gcc can do full math function inlining (for common math
routines) -- in this case, do we still want to do sin/cos optimization
or rely on the scalar optimizer to optimize the inlined copies of sin
and cos?

Not sure about gcc, I remember that dead temporary variable removal
can be very hard to do if some intrinsic gets lowered too early
introducing allocator and deallocator calls etc.

>
>>>> 3) it limits the lowering into one form which may not be ideal  --
>>>> with builtin_dispatch, after hoisting optimization, the lowering can
>>>> use more efficient IFUNC scheme, for instance.
>>>
>>> I see no reason why we cannot transform a switch-indirect-call
>>> pattern into an IFUNC call.
>>>
>>
>> It is possible -- but it is like asking user to lower the dispatch and
>> tell compiler to raise it again ..
>
> There is no possibility for a high-level dispatch at the source level.
> And if I'd have to design one I would use function overloading, like
>
> float compute_sth (float) __attribute__((version("sse4")))
> {
>  ... sse4 code ...
> }
>
> float compute_sth (float)
> {
>  ... fallback ...
> }
>
> float foo (float f)
> {
>  return compute_sth (f);
> }
>
> and if you not only want to dispatch for target features you could
> specify a selector function and value in the attribute.  You might
> notice that the above eventually matches the target attribute
> directly, just the frontends need to be taught to emit dispatch
> code whenever overload resolution results in ambiguities involving
> target attribute differences.

Now we are talking.   Allowing selector function is a must -- as
target features are just too weak. If we have this, it would be a
really nice for users.  The implicit dispatch lowering can be done
after the dispatch hoisting is done -- most of the ipa-clone work by
Sri can be retained.   Not sure how hard the FE part of the work is
though.

>
> Now, a language extension to support multi-versioning should be
> completely independent on any IL representation - with using
> a builtin you are tying them together with the only convenient
> mechanism we have - a mechanism that isn't optimal for either
> side IMNSHO.
>

Yes -- they don't have to be tied -- they just happen to suite the
needs of both ends -- but I see the value of the latest proposal
(overloading) above.

Thanks for your feedback.

David


>>>>> My point is that such optimization is completely independent of
>>>>> that dispatch thing.  The above could as well be a selection to
>>>>> use different input arrays, one causing alias analysis issues
>>>>> and one not.
>>>>>
>>>>> Thus, a __builtin_dispatch centric optimization pass is the wrong
>>>>> way to go.
>>>>
>>>> I agree that many things can implemented in different ways, but a high
>>>> level standard builtin_dispatch mechanism doing hoisting
>>>> interprocedcurally is cleaner and simpler and targeted for function
>>>> multi-versioning -- it does not depend on/rely on later phase's
>>>> heuristic tunings to make the right things to happen. Function MV
>>>> deserves this level of treatment as it will become more and more
>>>> important for some users (e.g., Google).
>>>
>>> But inventing a new language extension to benefit from whatever
>>> improvements we implement isn't the obviously best choice.
>>
>> It is not for any improvement. I mentioned the potential for function
>> MV and want to have a compiler infrastructure to deal with it.
>>
>>
>>>
>>>>> Now, with FDO I'd expect the foo is inlined into bar (because foo
>>>>> is deemed hot),
>>>>
>>>> That is a myth -- the truth is that there are other heuristics which
>>>> can prevent this from happening.
>>>
>>> Huh, sure.  That doesn't make my expectation a myth.
>>>
>>>>> then you only need to deal with loop unswitching,
>>>>> which should be easy to drive from FDO.
>>>>
>>>> Same here -- the loop body may not be well formed/recognized. The loop
>>>> nests may not be perfectly nested, or other unswitching heuristics may
>>>> block it from happening.  This is the common problem form many other
>>>> things that get lowered too early. It is cleaner to make the high
>>>> level transformation first in IPA, and let unswitching dealing with
>>>> intra-procedural optimization.
>>>
>>> If it's not well-formed inlining the call does not make it well-formed and
>>> thus it won't be optimized well anyway.  Btw, for the usual cases I
>>> have seen inlining isn't a transform that is worthwhile - transforming
>>> to IFUNC would have been.
>>
>> I am not sure I understand the comment here. The proposed approach can
>> do interprocedural hoisting of the dispatch and this done pretty early
>> in the pipeline so that hot functions can be optimized as much as
>> possible. Lowering it early in the hot functions rely later phases to
>> deal with it has the following limitations:
>>
>> 0) lowered code can get in the way of effective optimization
>> 1) the lowered code can be transformed such that it can not be
>> unswitched or can not be raised properly
>> 2) all functions in the related call chain to be inlined for the
>> unswitching to happen interprocedurally
>> 3) relies on unswitching heuristic to kick in
>
> Sure, we have such issues everywhere in the compiler.  I don't see why
> MV is special.  In fact there are surely cases that can be constructed
> where we rely on earlier optimizations to make MV transforms possible.
>
> Pass ordering issues are independent on design of a language extension
> and independent on what canonical IL representation we want to have
> to help MV related optimizations.
>
> Thanks,
> Richard.
>
Richard Biener May 4, 2011, 9:27 a.m. UTC | #20
On Tue, May 3, 2011 at 11:57 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Tue, May 3, 2011 at 3:00 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Tue, May 3, 2011 at 1:07 AM, Xinliang David Li <davidxl@google.com> wrote:
>>> On Mon, May 2, 2011 at 2:33 PM, Richard Guenther
>>> <richard.guenther@gmail.com> wrote:
>>>> On Mon, May 2, 2011 at 6:41 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> On Mon, May 2, 2011 at 2:11 AM, Richard Guenther
>>>>> <richard.guenther@gmail.com> wrote:
>>>>>> On Fri, Apr 29, 2011 at 6:23 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>>>> Here is the background for this feature:
>>>>>>>
>>>>>>> 1) People relies on function multi-version to explore hw features and
>>>>>>> squeeze performance, but there is no standard ways of doing so, either
>>>>>>> a) using indirect function calls with function pointers set at program
>>>>>>> initialization; b) using manual dispatch at each callsite; b) using
>>>>>>> features like IFUNC.  The dispatch mechanism needs to be promoted to
>>>>>>> the language level and becomes the first class citizen;
>>>>>>
>>>>>> You are not doing that, you are inventing a new (crude) GCC extension.
>>>>>
>>>>> To capture the high level semantics and prevent user from lowering the
>>>>> dispatch calls into forms compiler can not recognize, language
>>>>> extension is the way to go.
>>>>
>>>> I don't think so.  With your patch only two passes understand the new
>>>> high-level form, the rest of the gimple passes are just confused.
>>>
>>> There is no need for other passes to understand it -- just treat it as
>>> opaque calls. This is goodness otherwise other passes need to be
>>> modified. This is true (only some passes understand it) for things
>>> like __builtin_expect.
>>
>> Certainly __builtin_dispatch has to be understood by alias analysis and
>> all other passes that care about calls (like all IPA passes).  You can
>> of course treat it conservatively (may call any function, even those
>> which have their address not taken, clobber and read all memory, even
>> that which doesn't escape the TU).
>>
>> Why obfuscate things when it is not necessary?
>
> MVed functions are usually non-trivial, so I doubt anything will be
> lost due to the obfuscation. It won't be too difficult to teach
> aliaser to 'merge' the attributes from target functions either.
>
>
>>> No that is not my argument. What I tried to say is it will be harder
>>> to achieve without high level semantics -- it requires more
>>> handshaking between compiler passes.
>>
>> Sure - that's life.
>>
>
> We are looking at improving the life ..
>
>>>> Which nobody will see benefit
>>>> from unless they rewrite their code?
>>>
>>> The target users for the builtin include compiler itself -- it can
>>> synthesize dispatch calls.
>>
>> Hum.  I'm not at all sure the dispatch calls are the best representation
>> for the IL.
>>
>
> The intension is to provide an interface at both C level (for
> programmers) and IL level.  It does not have to be a builtin (both
> internally and externally)  -- but it needs to map to some language
> construct.
>
>
>>>> Well, I say if we can improve
>>>> _some_ of the existing usages that's better than never doing wrong
>>>> on a new language extension.
>>>
>>> This is independent.
>>
>> It is not.
>>
>>>> One that I'm not convinced is the way
>>>> to go (you didn't address at all the inability to use float arguments
>>>> and the ABI issues with using variadic arguments - after all you
>>>> did a poor-mans language extension by using GCC builtins instead
>>>> of inventing a true one).
>>>
>>> This is an independent issue that either needs to be addressed or
>>> marked as limitation. The key of the debate is whether source/IR
>>> annotation using construct with high level semantics helps optimizer.
>>> In fact this is common. Would it make any difference (in terms of
>>> acceptance) if the builtin is only used internally by the compiler and
>>> not exposed to the user?
>>
>> No.  I don't see at all why having everything in a single stmt is so much
>> more convenient.  And I don't see why existing IL features cannot be
>> used to make things a little more convenient.
>
> Why not? The high level construct is simpler to deal with. It is all
> about doing the right optimization at the right level of abstraction.
> Set aside the question whether using builtin for MV dispatch is the
> right high level construct, looking at gcc, we can find that gcc's IR
> is pretty low level resulting in missing optimizations.
>
> For instance, there is no high level doloop representation -- Fortran
> do-loop needs to be lowered and raised back again -- the consequence
> is that you may not raise the loop nest into the way it was originally
> written -- perfect nested loop become non-perfect loop nest --
> blocking certain loop transformations.  Not only that, I am not sure
> it is even possible to record any loop level information anywhere --
> is it possible to have per loop attribute such as unroll factor?
>
> Assuming gcc can do full math function inlining (for common math
> routines) -- in this case, do we still want to do sin/cos optimization
> or rely on the scalar optimizer to optimize the inlined copies of sin
> and cos?
>
> Not sure about gcc, I remember that dead temporary variable removal
> can be very hard to do if some intrinsic gets lowered too early
> introducing allocator and deallocator calls etc.

Sure, there is always a trade-off between lowering early and lowering late.
Both can have advantages.  For all the examples above we already
have both, a high-level and a low-level form - for dispatch we currently
only have a low-level form for which I think it is not difficult to improve
optimization a tad bit.  And of course I don't like the initial
__builtin_dispatch () proposal for a high-level form.

I can think of some more-or-less obvious high-level forms, one would
for example simply stick a new DISPATCH tree into gimple_call_fn
(similar to how we can have OBJ_TYPE_REF there), the DISPATCH
tree would be of variable length, first operand the selector function
and further operands function addresses.  That would keep the
actual call visible (instead of a fake __builtin_dispatch call), something
I'd really like to see.

Lowering that would then simply "gimplify" that DISPATCH tree.

That doesn't map to a source construct, but as I said it doesn't have to ;)

>>>>> 3) it limits the lowering into one form which may not be ideal  --
>>>>> with builtin_dispatch, after hoisting optimization, the lowering can
>>>>> use more efficient IFUNC scheme, for instance.
>>>>
>>>> I see no reason why we cannot transform a switch-indirect-call
>>>> pattern into an IFUNC call.
>>>>
>>>
>>> It is possible -- but it is like asking user to lower the dispatch and
>>> tell compiler to raise it again ..
>>
>> There is no possibility for a high-level dispatch at the source level.
>> And if I'd have to design one I would use function overloading, like
>>
>> float compute_sth (float) __attribute__((version("sse4")))
>> {
>>  ... sse4 code ...
>> }
>>
>> float compute_sth (float)
>> {
>>  ... fallback ...
>> }
>>
>> float foo (float f)
>> {
>>  return compute_sth (f);
>> }
>>
>> and if you not only want to dispatch for target features you could
>> specify a selector function and value in the attribute.  You might
>> notice that the above eventually matches the target attribute
>> directly, just the frontends need to be taught to emit dispatch
>> code whenever overload resolution results in ambiguities involving
>> target attribute differences.
>
> Now we are talking.   Allowing selector function is a must -- as
> target features are just too weak. If we have this, it would be a
> really nice for users.

Restricting ourselves to use the existing target attribute at the
beginning (with a single, compiler-generated selector function)
is probably good enough to get a prototype up and running.
Extending it to arbitrary selector-function, value pairs using a
new attribute is then probably easy (I don't see the exact use-case
for that yet, but I suppose it exists if you say so).

For the overloading to work we probably have to force that the
functions are local (so we can mangle them arbitrarily) and that
if the function should be visible externally people add an
externally visible dispatcher (foo in the above example would be one).

> The implicit dispatch lowering can be done
> after the dispatch hoisting is done -- most of the ipa-clone work by
> Sri can be retained.   Not sure how hard the FE part of the work is
> though.

The FE parts would also be language specific I guess, eventually
easier in the C++ frontend.

>> Now, a language extension to support multi-versioning should be
>> completely independent on any IL representation - with using
>> a builtin you are tying them together with the only convenient
>> mechanism we have - a mechanism that isn't optimal for either
>> side IMNSHO.
>>
>
> Yes -- they don't have to be tied -- they just happen to suite the
> needs of both ends -- but I see the value of the latest proposal
> (overloading) above.

I did realize that using builtins was convenient (been there and done
the same for some experiments ...).

Richard.
Xinliang David Li May 4, 2011, 10:19 p.m. UTC | #21
>
> I can think of some more-or-less obvious high-level forms, one would
> for example simply stick a new DISPATCH tree into gimple_call_fn
> (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
> tree would be of variable length, first operand the selector function
> and further operands function addresses.  That would keep the
> actual call visible (instead of a fake __builtin_dispatch call), something
> I'd really like to see.

This sounds like a good long term solution.

>

>
> Restricting ourselves to use the existing target attribute at the
> beginning (with a single, compiler-generated selector function)
> is probably good enough to get a prototype up and running.
> Extending it to arbitrary selector-function, value pairs using a
> new attribute is then probably easy (I don't see the exact use-case
> for that yet, but I suppose it exists if you say so).

For the use cases, CPU model will be looked at instead of just the
core architecture -- this will give use more information about the
numbrer of cores, size of caches etc. Intel's runtime library does
this checkiing at start up time so that the multi-versioned code can
look at those and make the appropriate decisions.

It will be even more complicated for arm processors -- which can have
the same processor cores but configured differently w.r.t VFP, NEON
etc.

>
> For the overloading to work we probably have to force that the
> functions are local (so we can mangle them arbitrarily) and that
> if the function should be visible externally people add an
> externally visible dispatcher (foo in the above example would be one).
>

For most of the cases, probably only the primary/default version needs
to be publicly visible ..

Thanks,

David

>
>>> Now, a language extension to support multi-versioning should be
>>> completely independent on any IL representation - with using
>>> a builtin you are tying them together with the only convenient
>>> mechanism we have - a mechanism that isn't optimal for either
>>> side IMNSHO.
>>>
>>
>> Yes -- they don't have to be tied -- they just happen to suite the
>> needs of both ends -- but I see the value of the latest proposal
>> (overloading) above.
>
> I did realize that using builtins was convenient (been there and done
> the same for some experiments ...).
>
> Richard.
>
Richard Biener May 5, 2011, 9:16 a.m. UTC | #22
On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li <davidxl@google.com> wrote:
>>
>> I can think of some more-or-less obvious high-level forms, one would
>> for example simply stick a new DISPATCH tree into gimple_call_fn
>> (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
>> tree would be of variable length, first operand the selector function
>> and further operands function addresses.  That would keep the
>> actual call visible (instead of a fake __builtin_dispatch call), something
>> I'd really like to see.
>
> This sounds like a good long term solution.

Thinking about it again maybe, similar to OBJ_TYPE_REF, have the
selection itself lowered and only keep the set of functions as
additional info.  Thus instead of having the selector function as
first operand have a pointer to the selected function there (that also
avoids too much knowledge about the return value of the selector).
Thus,

  sel = selector ();
  switch (sel)
   {
   case A: fn = &bar;
   case B: fn = &foo;
   }
  val = (*DISPATCH (fn, bar, foo)) (...);

that way regular optimizations can apply to the selection, eventually
discard the dispatch if fn becomes a known direct function (similar
to devirtualization).  At expansion time the call address is simply
taken from the first operand and an indirect call is assembled.

Does the above still provide enough knowledge for the IPA path isolation?

>> Restricting ourselves to use the existing target attribute at the
>> beginning (with a single, compiler-generated selector function)
>> is probably good enough to get a prototype up and running.
>> Extending it to arbitrary selector-function, value pairs using a
>> new attribute is then probably easy (I don't see the exact use-case
>> for that yet, but I suppose it exists if you say so).
>
> For the use cases, CPU model will be looked at instead of just the
> core architecture -- this will give use more information about the
> numbrer of cores, size of caches etc. Intel's runtime library does
> this checkiing at start up time so that the multi-versioned code can
> look at those and make the appropriate decisions.
>
> It will be even more complicated for arm processors -- which can have
> the same processor cores but configured differently w.r.t VFP, NEON
> etc.

Ah, indeed.  I hadn't thought about the tuning for different variants
as opposed to enabling HW features.  So the interface for overloading
would be sth like

enum X { Foo = 0, Bar = 5 };

enum X select () { return Bar; }

void foo (void) __attribute__((dispatch(select, Bar)));

which either means having pairs of function / select return value in
the DISPATCH operands or having it partly lowered as I outlined
above.

>> For the overloading to work we probably have to force that the
>> functions are local (so we can mangle them arbitrarily) and that
>> if the function should be visible externally people add an
>> externally visible dispatcher (foo in the above example would be one).
>>
>
> For most of the cases, probably only the primary/default version needs
> to be publicly visible ..

Yeah.  And that one we eventually can auto-transform to use IFUNC
relocations.

Richard.
Xinliang David Li May 5, 2011, 5:02 p.m. UTC | #23
On Thu, May 5, 2011 at 2:16 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li <davidxl@google.com> wrote:
>>>
>>> I can think of some more-or-less obvious high-level forms, one would
>>> for example simply stick a new DISPATCH tree into gimple_call_fn
>>> (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
>>> tree would be of variable length, first operand the selector function
>>> and further operands function addresses.  That would keep the
>>> actual call visible (instead of a fake __builtin_dispatch call), something
>>> I'd really like to see.
>>
>> This sounds like a good long term solution.
>
> Thinking about it again maybe, similar to OBJ_TYPE_REF, have the
> selection itself lowered and only keep the set of functions as
> additional info.  Thus instead of having the selector function as
> first operand have a pointer to the selected function there (that also
> avoids too much knowledge about the return value of the selector).
> Thus,
>
>  sel = selector ();
>  switch (sel)
>   {
>   case A: fn = &bar;
>   case B: fn = &foo;
>   }
>  val = (*DISPATCH (fn, bar, foo)) (...);
>
> that way regular optimizations can apply to the selection, eventually
> discard the dispatch if fn becomes a known direct function (similar
> to devirtualization).  At expansion time the call address is simply
> taken from the first operand and an indirect call is assembled.
>
> Does the above still provide enough knowledge for the IPA path isolation?
>

I like your original proposal (extending call) better because related
information are tied together and is easier to hoist and clean up.

I want propose a more general solution.

1) Generic Annotation Support for gcc IR -- it is used attach to
application/optimization specific annotation to gimple statements and
annotations can be passed around across passes. In gcc, I only see
HISTOGRAM annotation for value profiling, which is not general enough
2) Support of CallInfo for each callsite. This is an annotation, but
more standardized. The callinfo can be used to record information such
as call attributes, call side effects, mod-ref information etc ---
current gimple_call_flags can be folded into this Info structure.

Similarly (not related to this discussion), LoopInfo structure can be
introduced to annotate loop back edge jumps to allow FE to pass useful
information at loop level. For floating pointer operations, things
like the precision constraint, sensitivity to floating environment etc
can be recorded in FPInfo.

T


>>> Restricting ourselves to use the existing target attribute at the
>>> beginning (with a single, compiler-generated selector function)
>>> is probably good enough to get a prototype up and running.
>>> Extending it to arbitrary selector-function, value pairs using a
>>> new attribute is then probably easy (I don't see the exact use-case
>>> for that yet, but I suppose it exists if you say so).
>>
>> For the use cases, CPU model will be looked at instead of just the
>> core architecture -- this will give use more information about the
>> numbrer of cores, size of caches etc. Intel's runtime library does
>> this checkiing at start up time so that the multi-versioned code can
>> look at those and make the appropriate decisions.
>>
>> It will be even more complicated for arm processors -- which can have
>> the same processor cores but configured differently w.r.t VFP, NEON
>> etc.
>
> Ah, indeed.  I hadn't thought about the tuning for different variants
> as opposed to enabling HW features.  So the interface for overloading
> would be sth like
>
> enum X { Foo = 0, Bar = 5 };
>
> enum X select () { return Bar; }
>
> void foo (void) __attribute__((dispatch(select, Bar)));
>

Yes, for overloading -- something like this looks good.

Thanks,

David
Richard Biener May 6, 2011, 8:55 a.m. UTC | #24
On Thu, May 5, 2011 at 7:02 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Thu, May 5, 2011 at 2:16 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Thu, May 5, 2011 at 12:19 AM, Xinliang David Li <davidxl@google.com> wrote:
>>>>
>>>> I can think of some more-or-less obvious high-level forms, one would
>>>> for example simply stick a new DISPATCH tree into gimple_call_fn
>>>> (similar to how we can have OBJ_TYPE_REF there), the DISPATCH
>>>> tree would be of variable length, first operand the selector function
>>>> and further operands function addresses.  That would keep the
>>>> actual call visible (instead of a fake __builtin_dispatch call), something
>>>> I'd really like to see.
>>>
>>> This sounds like a good long term solution.
>>
>> Thinking about it again maybe, similar to OBJ_TYPE_REF, have the
>> selection itself lowered and only keep the set of functions as
>> additional info.  Thus instead of having the selector function as
>> first operand have a pointer to the selected function there (that also
>> avoids too much knowledge about the return value of the selector).
>> Thus,
>>
>>  sel = selector ();
>>  switch (sel)
>>   {
>>   case A: fn = &bar;
>>   case B: fn = &foo;
>>   }
>>  val = (*DISPATCH (fn, bar, foo)) (...);
>>
>> that way regular optimizations can apply to the selection, eventually
>> discard the dispatch if fn becomes a known direct function (similar
>> to devirtualization).  At expansion time the call address is simply
>> taken from the first operand and an indirect call is assembled.
>>
>> Does the above still provide enough knowledge for the IPA path isolation?
>>
>
> I like your original proposal (extending call) better because related
> information are tied together and is easier to hoist and clean up.
>
> I want propose a more general solution.
>
> 1) Generic Annotation Support for gcc IR -- it is used attach to
> application/optimization specific annotation to gimple statements and
> annotations can be passed around across passes. In gcc, I only see
> HISTOGRAM annotation for value profiling, which is not general enough
> 2) Support of CallInfo for each callsite. This is an annotation, but
> more standardized. The callinfo can be used to record information such
> as call attributes, call side effects, mod-ref information etc ---
> current gimple_call_flags can be folded into this Info structure.

I don't like generic annotation facilities.  What should passes to with
annotated stmts that are a) transformed, b) removed?  See RTL notes
and all the interesting issues they cause.

> Similarly (not related to this discussion), LoopInfo structure can be
> introduced to annotate loop back edge jumps to allow FE to pass useful
> information at loop level. For floating pointer operations, things
> like the precision constraint, sensitivity to floating environment etc
> can be recorded in FPInfo.

Yes, the idea is to keep the loop structures live throughout the whole
compilation.  Just somebody needs to do the last 1% of work.

Richard.

> T
>
>
>>>> Restricting ourselves to use the existing target attribute at the
>>>> beginning (with a single, compiler-generated selector function)
>>>> is probably good enough to get a prototype up and running.
>>>> Extending it to arbitrary selector-function, value pairs using a
>>>> new attribute is then probably easy (I don't see the exact use-case
>>>> for that yet, but I suppose it exists if you say so).
>>>
>>> For the use cases, CPU model will be looked at instead of just the
>>> core architecture -- this will give use more information about the
>>> numbrer of cores, size of caches etc. Intel's runtime library does
>>> this checkiing at start up time so that the multi-versioned code can
>>> look at those and make the appropriate decisions.
>>>
>>> It will be even more complicated for arm processors -- which can have
>>> the same processor cores but configured differently w.r.t VFP, NEON
>>> etc.
>>
>> Ah, indeed.  I hadn't thought about the tuning for different variants
>> as opposed to enabling HW features.  So the interface for overloading
>> would be sth like
>>
>> enum X { Foo = 0, Bar = 5 };
>>
>> enum X select () { return Bar; }
>>
>> void foo (void) __attribute__((dispatch(select, Bar)));
>>
>
> Yes, for overloading -- something like this looks good.
>
> Thanks,
>
> David
>
Diego Novillo May 6, 2011, 1:24 p.m. UTC | #25
On Fri, May 6, 2011 at 04:55, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Thu, May 5, 2011 at 7:02 PM, Xinliang David Li <davidxl@google.com> wrote:
>>
>> 2) Support of CallInfo for each callsite. This is an annotation, but
>> more standardized. The callinfo can be used to record information such
>> as call attributes, call side effects, mod-ref information etc ---
>> current gimple_call_flags can be folded into this Info structure.
>
> I don't like generic annotation facilities.  What should passes to with
> annotated stmts that are a) transformed, b) removed?  See RTL notes
> and all the interesting issues they cause.

Likewise.  We kind of tried having them in the early days of gimple
and tree-ssa, but quickly removed them.  Anything that is not a
first-class IL member, makes life difficult.  We have some examples in
PHI nodes and EH regions.  They're a bit to the side, and require
extra code to manage.


Diego.
Xinliang David Li May 6, 2011, 5:57 p.m. UTC | #26
>> I want propose a more general solution.
>>
>> 1) Generic Annotation Support for gcc IR -- it is used attach to
>> application/optimization specific annotation to gimple statements and
>> annotations can be passed around across passes. In gcc, I only see
>> HISTOGRAM annotation for value profiling, which is not general enough
>> 2) Support of CallInfo for each callsite. This is an annotation, but
>> more standardized. The callinfo can be used to record information such
>> as call attributes, call side effects, mod-ref information etc ---
>> current gimple_call_flags can be folded into this Info structure.
>
> I don't like generic annotation facilities.  What should passes to with
> annotated stmts that are a) transformed, b) removed?  See RTL notes
> and all the interesting issues they cause.


Then how do you store information that needs to be passed across
optimization passes -- you can not possibly dump all of them into the
core IR. In fact, anything that is derived from (via analysis) but not
part of the core IR need to worry about update and maintenance. In
current GIMPLE, we can find many such instances -- DU chains, Memory
SSA, control flow information, as well as flags like visited,
no_warning, PLF (?), etc. Have a unified way of representing them is a
good thing so that 1) make the IR lean and mean; 2) avoid too many
different side data structures.  The important thing is to have a good
verifier to catch insanity and inconsistency of the annotation after
each pass.

Thanks,

David


>
>> Similarly (not related to this discussion), LoopInfo structure can be
>> introduced to annotate loop back edge jumps to allow FE to pass useful
>> information at loop level. For floating pointer operations, things
>> like the precision constraint, sensitivity to floating environment etc
>> can be recorded in FPInfo.
>
> Yes, the idea is to keep the loop structures live throughout the whole
> compilation.  Just somebody needs to do the last 1% of work.
>
> Richard.
>
>> T
>>
>>
>>>>> Restricting ourselves to use the existing target attribute at the
>>>>> beginning (with a single, compiler-generated selector function)
>>>>> is probably good enough to get a prototype up and running.
>>>>> Extending it to arbitrary selector-function, value pairs using a
>>>>> new attribute is then probably easy (I don't see the exact use-case
>>>>> for that yet, but I suppose it exists if you say so).
>>>>
>>>> For the use cases, CPU model will be looked at instead of just the
>>>> core architecture -- this will give use more information about the
>>>> numbrer of cores, size of caches etc. Intel's runtime library does
>>>> this checkiing at start up time so that the multi-versioned code can
>>>> look at those and make the appropriate decisions.
>>>>
>>>> It will be even more complicated for arm processors -- which can have
>>>> the same processor cores but configured differently w.r.t VFP, NEON
>>>> etc.
>>>
>>> Ah, indeed.  I hadn't thought about the tuning for different variants
>>> as opposed to enabling HW features.  So the interface for overloading
>>> would be sth like
>>>
>>> enum X { Foo = 0, Bar = 5 };
>>>
>>> enum X select () { return Bar; }
>>>
>>> void foo (void) __attribute__((dispatch(select, Bar)));
>>>
>>
>> Yes, for overloading -- something like this looks good.
>>
>> Thanks,
>>
>> David
>>
>
Richard Biener May 7, 2011, 12:46 p.m. UTC | #27
On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li <davidxl@google.com> wrote:
>>> I want propose a more general solution.
>>>
>>> 1) Generic Annotation Support for gcc IR -- it is used attach to
>>> application/optimization specific annotation to gimple statements and
>>> annotations can be passed around across passes. In gcc, I only see
>>> HISTOGRAM annotation for value profiling, which is not general enough
>>> 2) Support of CallInfo for each callsite. This is an annotation, but
>>> more standardized. The callinfo can be used to record information such
>>> as call attributes, call side effects, mod-ref information etc ---
>>> current gimple_call_flags can be folded into this Info structure.
>>
>> I don't like generic annotation facilities.  What should passes to with
>> annotated stmts that are a) transformed, b) removed?  See RTL notes
>> and all the interesting issues they cause.
>
>
> Then how do you store information that needs to be passed across
> optimization passes -- you can not possibly dump all of them into the
> core IR. In fact, anything that is derived from (via analysis) but not
> part of the core IR need to worry about update and maintenance. In
> current GIMPLE, we can find many such instances -- DU chains, Memory
> SSA, control flow information, as well as flags like visited,
> no_warning, PLF (?), etc. Have a unified way of representing them is a
> good thing so that 1) make the IR lean and mean; 2) avoid too many
> different side data structures.  The important thing is to have a good
> verifier to catch insanity and inconsistency of the annotation after
> each pass.

Well, as soon as you have a verifier it's no longer generic but _is_
part of the "core IL".  Similar to how EH info is part of the "core IL",
or alias info, or loop information.

Richard.
Xinliang David Li May 7, 2011, 5:07 p.m. UTC | #28
On Sat, May 7, 2011 at 5:46 AM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>> I want propose a more general solution.
>>>>
>>>> 1) Generic Annotation Support for gcc IR -- it is used attach to
>>>> application/optimization specific annotation to gimple statements and
>>>> annotations can be passed around across passes. In gcc, I only see
>>>> HISTOGRAM annotation for value profiling, which is not general enough
>>>> 2) Support of CallInfo for each callsite. This is an annotation, but
>>>> more standardized. The callinfo can be used to record information such
>>>> as call attributes, call side effects, mod-ref information etc ---
>>>> current gimple_call_flags can be folded into this Info structure.
>>>
>>> I don't like generic annotation facilities.  What should passes to with
>>> annotated stmts that are a) transformed, b) removed?  See RTL notes
>>> and all the interesting issues they cause.
>>
>>
>> Then how do you store information that needs to be passed across
>> optimization passes -- you can not possibly dump all of them into the
>> core IR. In fact, anything that is derived from (via analysis) but not
>> part of the core IR need to worry about update and maintenance. In
>> current GIMPLE, we can find many such instances -- DU chains, Memory
>> SSA, control flow information, as well as flags like visited,
>> no_warning, PLF (?), etc. Have a unified way of representing them is a
>> good thing so that 1) make the IR lean and mean; 2) avoid too many
>> different side data structures.  The important thing is to have a good
>> verifier to catch insanity and inconsistency of the annotation after
>> each pass.
>
> Well, as soon as you have a verifier it's no longer generic but _is_
> part of the "core IL".  Similar to how EH info is part of the "core IL",
> or alias info, or loop information.

The difference is that no-core IL can be thrown away anytime and be
recomputed if needed.

David


>
> Richard.
>
Richard Biener May 7, 2011, 6:02 p.m. UTC | #29
On Sat, May 7, 2011 at 7:07 PM, Xinliang David Li <davidxl@google.com> wrote:
> On Sat, May 7, 2011 at 5:46 AM, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Fri, May 6, 2011 at 7:57 PM, Xinliang David Li <davidxl@google.com> wrote:
>>>>> I want propose a more general solution.
>>>>>
>>>>> 1) Generic Annotation Support for gcc IR -- it is used attach to
>>>>> application/optimization specific annotation to gimple statements and
>>>>> annotations can be passed around across passes. In gcc, I only see
>>>>> HISTOGRAM annotation for value profiling, which is not general enough
>>>>> 2) Support of CallInfo for each callsite. This is an annotation, but
>>>>> more standardized. The callinfo can be used to record information such
>>>>> as call attributes, call side effects, mod-ref information etc ---
>>>>> current gimple_call_flags can be folded into this Info structure.
>>>>
>>>> I don't like generic annotation facilities.  What should passes to with
>>>> annotated stmts that are a) transformed, b) removed?  See RTL notes
>>>> and all the interesting issues they cause.
>>>
>>>
>>> Then how do you store information that needs to be passed across
>>> optimization passes -- you can not possibly dump all of them into the
>>> core IR. In fact, anything that is derived from (via analysis) but not
>>> part of the core IR need to worry about update and maintenance. In
>>> current GIMPLE, we can find many such instances -- DU chains, Memory
>>> SSA, control flow information, as well as flags like visited,
>>> no_warning, PLF (?), etc. Have a unified way of representing them is a
>>> good thing so that 1) make the IR lean and mean; 2) avoid too many
>>> different side data structures.  The important thing is to have a good
>>> verifier to catch insanity and inconsistency of the annotation after
>>> each pass.
>>
>> Well, as soon as you have a verifier it's no longer generic but _is_
>> part of the "core IL".  Similar to how EH info is part of the "core IL",
>> or alias info, or loop information.
>
> The difference is that no-core IL can be thrown away anytime and be
> recomputed if needed.

The same is true for all of the "core IL" pieces I listed above apart
from maybe EH info.

Richard.

> David
>
>
>>
>> Richard.
>>
>
diff mbox

Patch

==================

Frequently executed functions in programs are some times compiled
into different versions to take advantage of some specific
advanced features present in some of the types of platforms on
which the program is executed. For instance, SSE4 versus no-SSE4.
Existing techniques to do multi-versioning, for example using
indirect calls, block compiler optimizations, mainly inlining.

This patch adds a new GCC pass to support multi-versioned calls
through a new builtin, __builtin_dispatch.  The __builtin_dispatch
call is converted into a simple if-then construct to call the
appropriate version at run-time. There are no indirect calls so
inlining is not affected.

This patch also supports a function unswitching optimization via
cloning of functions, only along hot paths, that call multi-versioned
functions so that the hot multi-versioned paths can be independently
optimized for performance. The cloning optimization is switched on
via a flag, -fclone-hot-version-paths. 

Only two versions are supported in this patch. Example use :

  int popcnt_sse4(unsigned int x) __attribute__((__target__("popcnt")));
  int popcnt_sse4(unsigned int x)
  {
    int count = __builtin_popcount(x);
    return count;
  }

  int popcnt(unsigned int x) __attribute__((__target__("no-popcnt")));
  int popcnt(unsigned int x)
  {
    int count = __builtin_popcount(x);
    return count;
  }

  int testsse() __attribute__((version_selector));
  int main ()
  {
    ...
    /* The multi-versioned call. */
    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4, (void*)popcnt, 25);
    ...
  }

  There are two passes that are run to achieve multi-versioning.
  "pass_ipa_multiversion_dispatch" is an ipa pass that decides which functions
  have to be cloned and hoists the feature-test calls appropriately.  This
  pass can be enabled with the flag "-fclone-hot-version-paths" and disabled
  with "-fno-clone-hot-version-paths".

  "pass_tree_convert_builtin_dispatch" does the lowering.  It is a
  function-level pass.  Functions marked with attribute "version_selector" are
  also handled by this pass.  This pass is always on.

  How to use __builtin_dispatch ?
  -----------------------------

  __builtin_dispatch takes 3 mandatory arguments :

  __builtin_dispatch (arg1, arg2, arg3, <arg4>, <arg5>, ...);

  arg1 is the pointer to the feature-test function, the outcome of this
  function decides the version that must be executed.
  arg2 is the ( void *) cast pointer to the versioned function that is
  executed when the feature test returns 1.
  arg3 is the ( void *) cast pointer to the versioned function that is
  executed when the feature test returns 0.
  arg4, arg5, ... are optional. They are the arguments to the versioned
  functions.  Both versions must accept the same number of arguments.
  The __builtin_dispatch function returns the value returned by the
  versioned function that gets executed.  The versioned function arg2
  is executed when the feature_test function arg1 returns 1 and arg3
  is executed when the feature_test function arg1 returns 0. arg1
  could be marked as a "version_selector" function if it is a pure
  function with no side-effects, returns a constant at run-time and
  can be evaluated at any point in the execution.

  When to use the "version_selector" attribute ?
  -----------------------------------------------

  Functions are marked with attribute "version_selector" only if
  they are run-time constants.  Example of such functions would
  be those that test if a specific feature is available on a
  particular architecture.  Such functions must return a positive
  integer. For two-way functions, those that test if a feature
  is present or not must return 1 or 0 respectively.

  This patch will make constructors that call these function once
  and assign the outcome to a global variable. From then on, only
  the value of the global will be used to decide which version to
  execute.

  What happens with cloning, -fclone-hot-version-paths ?
  -----------------------------------------------------

  For the call graph that looks like this before cloning :

func_cold ----> func_hot ----> findOnes ----> IsPopCntSupported ----> popcnt
                                                |
                                                -----> no_popcnt

  where popcnt and no_popcnt are the multi-versioned functions, then after cloning :

func_cold ---->IsPopCntSupported ----> func_hot_clone0 ----> findOnes_clone0 ----> popcnt
                              |
                               ----> func_hot_clone1 ----> findOnes_clone1 ----> no_popcnt

     

 
  Flags : -fclone-hot-version-paths does function unswitching via cloning.
          --param=num-mversn-clones=<num> allows to specify the number of
          functions that should be cloned.
	  --param=mversn-clone-depth=<num> allows to specify the length of
          the call graph path that should be cloned.  num = 0 implies only
          leaf node that contains the __builtin_dispatch statement must be
          cloned.
============================================================================================

Google ref 45947, 45986

2011-04-28  Sriraman Tallam  <tmsriram@google.com>

	* c-family/c-common.c	(revision 173122) (handle_version_selector_attribute): New function.
	(c_common_attribute_table): New attribute "version_selector".
	* tree-pass.h	(revision 173122) (pass_tree_convert_builtin_dispatch): New pass.
	(pass_ipa_multiversion_dispatch): New pass.
	* testsuite/gcc.dg/mversn7.c	(revision 0): New test case.
	* testsuite/gcc.dg/mversn4.c	(revision 0): New test case.
	* testsuite/gcc.dg/mversn4.h	(revision 0): New test case.
	* testsuite/gcc.dg/mversn4a.c	(revision 0): New test case.
	* testsuite/gcc.dg/torture/mversn1.c	(revision 0): New test case.
	* testsuite/gcc.dg/mversn2.c	(revision 0): New test case.
	* testsuite/gcc.dg/mversn6.c	(revision 0): New test case.
	* testsuite/gcc.dg/mversn3.c	(revision 0): New test case.
	* testsuite/g++.dg/mversn8.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn10a.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn14a.C	(revision 0): New test case.
	* testsuite/g++.dg/tree-prof/mversn13.C	(revision 0): New test case.
	* testsuite/g++.dg/tree-prof/mversn15.C	(revision 0): New test case.
	* testsuite/g++.dg/tree-prof/mversn15a.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn9.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn10.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn12.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn14.C	(revision 0): New test case.
	* testsuite/g++.dg/mversn16.C	(revision 0): New test case.
	* testsuite/g++.dg/torture/mversn11.C	(revision 0): New test case.
	* testsuite/g++.dg/torture/mversn5.C	(revision 0): New test case.
	* testsuite/g++.dg/torture/mversn5.h	(revision 0): New test case.
	* testsuite/g++.dg/torture/mversn5a.C	(revision 0): New test case.
	* builtin-types.def	(revision 173122) (BT_PTR_FN_INT): New pointer type.
	(BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR): New function type for __builtin_dispatch.
	* builtins.def	(revision 173122) (BUILT_IN_DISPATCH): New builtin to
	support multi-version calls.
	* mversn-dispatch.c	(revision 0): New file.
	* timevar.def	(revision 173122) (TV_MVERSN_DISPATCH): New time var.
	* common.opt	(revision 173122) (fclone-hot-version-paths): New flag.
	* Makefile.in	(revision 173122) (mversn-dispatch.o): New rule.
	* passes.c	(revision 173122) (init_optimization_passes): Add the new
	multi-version and dispatch passes to the pass list.
	* params.def	(revision 173122) (PARAM_NUMBER_OF_MVERSN_CLONES): Define.
	(PARAM_MVERSN_CLONE_CGRAPH_DEPTH): Define.
	* doc/invoke.texi	(revision 173122) (mversn-clone-depth): Document.
	(num-mversn-clones): Document.
	(fclone-hot-version-paths): Document.

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 173122)
+++ doc/invoke.texi	(working copy)
@@ -342,7 +342,8 @@ 
 -falign-labels[=@var{n}] -falign-loops[=@var{n}] -fassociative-math @gol
 -fauto-inc-dec -fbranch-probabilities -fbranch-target-load-optimize @gol
 -fbranch-target-load-optimize2 -fbtr-bb-exclusive -fcaller-saves @gol
--fcheck-data-deps -fcombine-stack-adjustments -fconserve-stack @gol
+-fcheck-data-deps -fclone-hot-version-paths @gol
+-fcombine-stack-adjustments -fconserve-stack @gol
 -fcompare-elim -fcprop-registers -fcrossjumping @gol
 -fcse-follow-jumps -fcse-skip-blocks -fcx-fortran-rules @gol
 -fcx-limited-range @gol
@@ -8229,6 +8230,10 @@ 
 branch is most likely to take, the @samp{REG_BR_PROB} values are used to
 exactly determine which path is taken more often.
 
+@item -fclone-hot-version-paths
+When multi-version calls are made using @samp{__builtin_dispatch}, this flag
+enables cloning and hoisting of hot multiversioned paths.
+
 @item -fprofile-values
 @opindex fprofile-values
 If combined with @option{-fprofile-arcs}, it adds code so that some
@@ -8490,6 +8495,16 @@ 
 be applied.
 The default value is 40.
 
+@item mversn-clone-depth
+When using @option{-fclone-hot-version-paths}, hot function paths are multi-
+versioned via cloning.  This parameter specifies the maximum length of the
+call graph path that can be cloned. The default value is 2.
+
+@item num-mversn-clones
+When using @option{-fclone-hot-version-paths}, hot function paths are multi-
+versioned via cloning.  This parameter specifies the maximum number of
+functions that can be cloned. The default value is 10.
+
 @item large-function-insns
 The limit specifying really large functions.  For functions larger than this
 limit after inlining, inlining is constrained by
Index: c-family/c-common.c
===================================================================
--- c-family/c-common.c	(revision 173122)
+++ c-family/c-common.c	(working copy)
@@ -315,6 +315,7 @@  static tree check_case_value (tree);
 static bool check_case_bounds (tree, tree, tree *, tree *);
 
 static tree handle_packed_attribute (tree *, tree, tree, int, bool *);
+static tree handle_version_selector_attribute (tree *, tree, tree, int, bool *);
 static tree handle_nocommon_attribute (tree *, tree, tree, int, bool *);
 static tree handle_common_attribute (tree *, tree, tree, int, bool *);
 static tree handle_noreturn_attribute (tree *, tree, tree, int, bool *);
@@ -603,6 +604,8 @@  const unsigned int num_c_common_reswords =
 const struct attribute_spec c_common_attribute_table[] =
 {
   /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler } */
+  { "version_selector",	      0, 0, true, false, false,
+			      handle_version_selector_attribute },
   { "packed",                 0, 0, false, false, false,
 			      handle_packed_attribute },
   { "nocommon",               0, 0, true,  false, false,
@@ -5782,6 +5785,37 @@  handle_packed_attribute (tree *node, tree name, tr
   return NULL_TREE;
 }
 
+/* Handle a "version_selector attribute".
+   Functions are marked with attribute "version_selector" only if
+   they are run-time constants.  Example of such functions would
+   be those that test if a particular feature is available on a
+   particular architecture.  Such function must return a positive
+   integer. For two-way functions, those that test if a feature
+   is present or not must return 1 or 0 respectively. */
+
+static tree
+handle_version_selector_attribute (tree *node, tree name,
+ 				   tree ARG_UNUSED (args),
+ 				   int ARG_UNUSED (flags),
+				   bool *no_add_attrs)
+{
+  if (TREE_CODE (*node) == FUNCTION_DECL)
+    {
+      /* Check that the return type is integer. */
+      gcc_assert (TREE_CODE (TREE_TYPE (TREE_TYPE (*node)))
+                  == INTEGER_TYPE);
+      if (dump_file)
+        fprintf (dump_file, "%s is a version_selector function\n",
+   	         IDENTIFIER_POINTER (DECL_NAME (*node)));
+    }
+  else
+    {
+      warning (OPT_Wattributes, "%qE attribute ignored", name);
+      *no_add_attrs = true;
+    }
+  return NULL_TREE;
+}
+
 /* Handle a "nocommon" attribute; arguments as in
    struct attribute_spec.handler.  */
 
Index: tree-pass.h
===================================================================
--- tree-pass.h	(revision 173122)
+++ tree-pass.h	(working copy)
@@ -448,6 +448,7 @@  extern struct gimple_opt_pass pass_warn_unused_res
 extern struct gimple_opt_pass pass_split_functions;
 extern struct gimple_opt_pass pass_feedback_split_functions;
 extern struct gimple_opt_pass pass_threadsafe_analyze;
+extern struct gimple_opt_pass pass_tree_convert_builtin_dispatch;
 
 /* IPA Passes */
 extern struct simple_ipa_opt_pass pass_ipa_lower_emutls;
@@ -472,6 +473,7 @@  extern struct ipa_opt_pass_d pass_ipa_lto_wpa_fixu
 extern struct ipa_opt_pass_d pass_ipa_lto_finish_out;
 extern struct ipa_opt_pass_d pass_ipa_profile;
 extern struct ipa_opt_pass_d pass_ipa_cdtor_merge;
+extern struct simple_ipa_opt_pass pass_ipa_multiversion_dispatch;
 
 extern struct gimple_opt_pass pass_all_optimizations;
 extern struct gimple_opt_pass pass_cleanup_cfg_post_optimizing;
Index: testsuite/gcc.dg/mversn7.c
===================================================================
--- testsuite/gcc.dg/mversn7.c	(revision 0)
+++ testsuite/gcc.dg/mversn7.c	(revision 0)
@@ -0,0 +1,47 @@ 
+/* This test checks if cloning and dispatching works correctly with
+   a motivating example. The function problem calculates the sum of
+   numbers from 1 to 10 in two different ways.  Hence, after cloning
+   both clones should return 55, which means main returns 0 if function
+   "problem" gets inlined.
+   This example also shows the benefits of function
+   unswitching.  Without cloning, the loop will be done. */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fdump-tree-optimized" } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int foo (int i)
+{
+  return i;
+}
+
+int bar (int i)
+{
+  return (11 - i);  
+}
+
+/* This calculates the sum of numbers from 1 to 10 in 2 different ways. */
+int __attribute__ ((hot))
+problem ()
+{
+  int ret = 0;
+  int j = 1;
+  for (j = 1; j<=10; j++)
+    ret += __builtin_dispatch (featureTest, (void *)foo, (void *)bar, j);
+  return ret;
+}
+
+int __attribute__ ((hot))
+main ()
+{
+  return problem() - 55;
+}
+
+
+/* { dg-final { scan-tree-dump "return 55" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: testsuite/gcc.dg/mversn4.c
===================================================================
--- testsuite/gcc.dg/mversn4.c	(revision 0)
+++ testsuite/gcc.dg/mversn4.c	(revision 0)
@@ -0,0 +1,29 @@ 
+/* Check that static feature test functions are correctly handled. This
+   test case also needs mversn4a.c.  extern_func calls foo and returns 0.*/
+   
+/* { dg-do run } */
+/* { dg-additional-sources "mversn4a.c" } */
+/* { dg-options "-O2" } */
+
+#include "mversn4.h"
+
+extern int extern_func ();
+
+int foo ()
+{
+  return 0;
+}
+
+int bar ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+main ()
+{
+  int a = 1, b = 1;
+  a  = extern_func ();
+  b  = __builtin_dispatch (featureTest, (void *)bar, (void *) foo);
+  return a * b;
+}
Index: testsuite/gcc.dg/mversn4.h
===================================================================
--- testsuite/gcc.dg/mversn4.h	(revision 0)
+++ testsuite/gcc.dg/mversn4.h	(revision 0)
@@ -0,0 +1,5 @@ 
+static int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
Index: testsuite/gcc.dg/mversn4a.c
===================================================================
--- testsuite/gcc.dg/mversn4a.c	(revision 0)
+++ testsuite/gcc.dg/mversn4a.c	(revision 0)
@@ -0,0 +1,13 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+#include "mversn4.h"
+
+extern int foo ();
+extern int bar ();
+
+
+int extern_func ()
+{
+  return __builtin_dispatch (featureTest, (void *)foo, (void *) bar);
+}
Index: testsuite/gcc.dg/torture/mversn1.c
===================================================================
--- testsuite/gcc.dg/torture/mversn1.c	(revision 0)
+++ testsuite/gcc.dg/torture/mversn1.c	(revision 0)
@@ -0,0 +1,31 @@ 
+/* Check that __builtin_dispatch gets converted and the executable runs fine. 
+   Since featureTest () returns 1, foo should be called and return 0.  Cloning
+   and Hoisting is not turned on here. */
+/* { dg-do run } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+foo ()
+{
+  return 0;
+}
+
+int __attribute__ ((cold))
+bar ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+main ()
+{
+  int a, b;
+  a = __builtin_dispatch (featureTest, (void *)foo, (void *) bar);
+  b = __builtin_dispatch (featureTest, (void *)bar, (void *) foo);
+  return a * b;
+}
Index: testsuite/gcc.dg/mversn2.c
===================================================================
--- testsuite/gcc.dg/mversn2.c	(revision 0)
+++ testsuite/gcc.dg/mversn2.c	(revision 0)
@@ -0,0 +1,47 @@ 
+/* This checks if cloning works correctly.  Since dispatch and fn1 are hot, they
+   should be cloned.  main should not be cloned.*/
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fdump-tree-optimized" } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+foo ()
+{
+  return 0;
+}
+
+int __attribute__ ((cold))
+bar ()
+{
+  return 1;
+}
+
+int __attribute__ ((hot))
+dispatch ()
+{
+  return __builtin_dispatch (featureTest, (void *)foo, (void *) bar);
+}
+int __attribute__ ((hot))
+fn1 ()
+{
+  return dispatch ();
+}
+
+int __attribute__ ((cold))
+main ()
+{
+  return fn1 ();
+}
+
+/* { dg-final { scan-tree-dump "fn1_clone_1" "optimized" } } */
+/* { dg-final { scan-tree-dump "dispatch_clone_0" "optimized" } } */
+/* { dg-final { scan-tree-dump "dispatch_clone_1" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "main_clone_0" "optimized" } } */
+/* { dg-final { scan-tree-dump-not "main_clone_1" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: testsuite/gcc.dg/mversn6.c
===================================================================
--- testsuite/gcc.dg/mversn6.c	(revision 0)
+++ testsuite/gcc.dg/mversn6.c	(revision 0)
@@ -0,0 +1,29 @@ 
+/* Check that __builtin_dispatch gets converted and the executable runs fine. 
+   when featureTest () is not marked with "version_selector". foo should be 
+   called and return 0.  Cloning and Hoisting is also done. */
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths" } */
+
+int
+featureTest ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+foo ()
+{
+  return 0;
+}
+
+int __attribute__ ((cold))
+bar ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+main ()
+{
+  return __builtin_dispatch (featureTest, (void *)foo, (void *) bar);
+}
Index: testsuite/gcc.dg/mversn3.c
===================================================================
--- testsuite/gcc.dg/mversn3.c	(revision 0)
+++ testsuite/gcc.dg/mversn3.c	(revision 0)
@@ -0,0 +1,30 @@ 
+/* Simple check if parameters are passed correctly. */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths" } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+foo (int a)
+{
+  if (a == 1729)
+    return 0;
+  return 1;
+}
+
+int __attribute__ ((cold))
+bar (int a)
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+main ()
+{
+  return __builtin_dispatch (featureTest, (void *)foo, (void *) bar, 1729);
+}
Index: testsuite/g++.dg/mversn8.C
===================================================================
--- testsuite/g++.dg/mversn8.C	(revision 0)
+++ testsuite/g++.dg/mversn8.C	(revision 0)
@@ -0,0 +1,45 @@ 
+/* Call the caller of __builtin_dispatch indirectly.  Specify the
+   feature test function as a function pointer.  Make sure cloning
+   still happens. */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fdump-tree-optimized" } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int
+foo ()
+{
+  return 0;
+}
+
+int
+bar ()
+{
+  return 1;
+}
+
+int __attribute__ ((hot))
+dispatch ()
+{
+  int (*funcp)() = featureTest;
+  int ret = __builtin_dispatch (funcp, (void *)foo, (void *)bar);
+  return ret;
+}
+
+int __attribute__ ((hot))
+main (int argc, char **argv)
+{
+  int (*ptr)() = dispatch;
+  return (*ptr)();
+}
+
+/* { dg-final { scan-tree-dump "dispatchv_clone_0" "optimized" } } */
+/* { dg-final { scan-tree-dump "dispatchv_clone_1" "optimized" } } */
+/* { dg-final { scan-tree-dump "main_clone_0" "optimized" } } */
+/* { dg-final { scan-tree-dump "main_clone_1" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: testsuite/g++.dg/mversn10a.C
===================================================================
--- testsuite/g++.dg/mversn10a.C	(revision 0)
+++ testsuite/g++.dg/mversn10a.C	(revision 0)
@@ -0,0 +1,37 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fdump-tree-optimized" } */
+
+static int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+static int foo ()
+{
+  return 0;
+}
+
+static int bar ()
+{
+  return 1;
+}
+
+static int __attribute__ ((noinline))
+dispatch ()
+{
+  __builtin_dispatch (featureTest, (void *)foo, (void *)bar);
+  return 0;
+}
+
+int
+fn2 ()
+{
+  for (int i = 0; i < 1000; i++)
+    dispatch ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "dispatchv_clone_0" "optimized" } } */
+/* { dg-final { scan-tree-dump "dispatchv_clone_1" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: testsuite/g++.dg/mversn14a.C
===================================================================
--- testsuite/g++.dg/mversn14a.C	(revision 0)
+++ testsuite/g++.dg/mversn14a.C	(revision 0)
@@ -0,0 +1,8 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2" } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
Index: testsuite/g++.dg/tree-prof/mversn13.C
===================================================================
--- testsuite/g++.dg/tree-prof/mversn13.C	(revision 0)
+++ testsuite/g++.dg/tree-prof/mversn13.C	(revision 0)
@@ -0,0 +1,37 @@ 
+/* Make sure -fprofile-generate and -fprofile-use work fine. */
+
+/* { dg-options "-O2 -fclone-hot-version-paths" } */
+
+static int glob = 0;
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return glob;
+}
+
+int bar (int i)
+{
+  if (i > 500)
+    return 2 * i;
+  return 3 * i;
+}
+
+int foo (int i)
+{
+  bar (i);
+}
+
+int
+dispatch ()
+{
+  int ret = 0;
+  for (int i = 0; i < 1000; i++)
+    ret += __builtin_dispatch (featureTest, (void *)foo, (void *)bar,  i);
+  return ret;
+}
+
+int main ()
+{
+  int val = dispatch ();
+  return val > 10000000;
+}
Index: testsuite/g++.dg/tree-prof/mversn15.C
===================================================================
--- testsuite/g++.dg/tree-prof/mversn15.C	(revision 0)
+++ testsuite/g++.dg/tree-prof/mversn15.C	(revision 0)
@@ -0,0 +1,23 @@ 
+/* Make sure LIPO works correctly. dispatch is defined in mversn15a.C. It either
+   calls foo or bar and both returns 1. So, the value of ret is always 1000.
+   After cross-module inlining, main must return 0. */
+
+/* { dg-additional-sources "mversn15a.C" } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fripa -fdump-tree-optimized" } */
+
+extern int foo ();
+extern int bar ();
+extern int dispatch ();
+
+int
+main ()
+{
+  int ret = 0;
+  for (int i = 1; i <= 1000; i++)
+   ret += dispatch ();
+  return ret - 1000;
+}
+
+/* { dg-final-use { scan-tree-dump "main_clone" "optimized" } } */
+/* { dg-final-use { scan-tree-dump "return 0" "optimized" } } */
+/* { dg-final-use { cleanup-tree-dump "optimized" } } */
Index: testsuite/g++.dg/tree-prof/mversn15a.C
===================================================================
--- testsuite/g++.dg/tree-prof/mversn15a.C	(revision 0)
+++ testsuite/g++.dg/tree-prof/mversn15a.C	(revision 0)
@@ -0,0 +1,26 @@ 
+/* { dg-options "-O2 -fclone-hot-version-paths -fripa" } */
+/* { dg-additional-sources "mversn15.C" } */
+
+#include <stdio.h>
+
+inline int
+ __attribute__ ((version_selector))
+featureTest()
+{
+  return 1;
+}
+
+int foo ()
+{
+  return 1;
+}
+
+int bar ()
+{
+  return 1;
+}
+
+int dispatch ()
+{
+  return __builtin_dispatch (featureTest, (void *)foo, (void *)bar);
+}
Index: testsuite/g++.dg/mversn9.C
===================================================================
--- testsuite/g++.dg/mversn9.C	(revision 0)
+++ testsuite/g++.dg/mversn9.C	(revision 0)
@@ -0,0 +1,30 @@ 
+/* Two __builtin_dispatch calls in different basic blocks. */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths" } */
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int
+foo ()
+{
+  return 0;
+}
+
+int
+bar ()
+{
+  return 1;
+}
+
+int main (int argc, char **argv)
+{
+  if (argc)
+   return __builtin_dispatch (featureTest, (void *)foo, (void *)bar);
+  else
+   return (__builtin_dispatch (featureTest, (void *)bar, (void *)foo) - 1);
+}
Index: testsuite/g++.dg/mversn10.C
===================================================================
--- testsuite/g++.dg/mversn10.C	(revision 0)
+++ testsuite/g++.dg/mversn10.C	(revision 0)
@@ -0,0 +1,54 @@ 
+/* The purpose of this test is to check if the attributes of the clone
+   correctly shadow the parent function.  In this case, function "dispatch"
+   is cloned but is a static function.  The other file, "mversn10a.C" also
+   has a dispatch function that is cloned.  So, if the attributes of the
+   clone are not correct the linker will complain. */
+
+/* { dg-do run } */
+/* { dg-additional-sources "mversn10a.C" } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fdump-tree-optimized" } */
+
+static int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+static int foo ()
+{
+  return 0;
+}
+
+static int bar ()
+{
+  return 1;
+}
+
+static int __attribute__ ((noinline))
+dispatch ()
+{
+  __builtin_dispatch (featureTest, (void *)foo, (void *)bar);
+  return 0;
+}
+
+int
+fn1 ()
+{
+  for (int i = 0; i < 1000; i++)
+    dispatch ();
+  return 0;
+}
+
+extern int fn2 ();
+
+int __attribute__ ((hot))
+main ()
+{
+  fn1 ();
+  fn2 ();
+  return 0;
+}
+
+/* { dg-final { scan-tree-dump "dispatchv_clone_0" "optimized" } } */
+/* { dg-final { scan-tree-dump "dispatchv_clone_1" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: testsuite/g++.dg/mversn12.C
===================================================================
--- testsuite/g++.dg/mversn12.C	(revision 0)
+++ testsuite/g++.dg/mversn12.C	(revision 0)
@@ -0,0 +1,43 @@ 
+/* Check if everything is fine if the versioned functions are static
+   member functions. */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths" } */
+
+int  __attribute__ ((version_selector))
+featureTest()
+{
+  return 1;;
+}
+
+class TestClass
+{
+ public:
+  static int foo ()
+  {
+    return 0;
+  }
+
+  static int bar ()
+  {
+    return 1;
+  }
+
+  int dispatch ()
+  {
+    int a = __builtin_dispatch (featureTest,
+                               (void*)(&TestClass::foo),
+			       (void*)(&TestClass::bar));
+    int b = __builtin_dispatch (featureTest,
+                               (void*)(&TestClass::bar),
+			       (void*)(&TestClass::foo));
+    return a * b;
+  }
+};
+
+int
+main ()
+{
+  TestClass c1;
+  return c1.dispatch ();
+}
Index: testsuite/g++.dg/mversn14.C
===================================================================
--- testsuite/g++.dg/mversn14.C	(revision 0)
+++ testsuite/g++.dg/mversn14.C	(revision 0)
@@ -0,0 +1,26 @@ 
+/* Check if everything is fine when the feature test body is in a different
+   module that does not have a __builtin-dispatch".  Requires mversn14a.C. */
+
+/* { dg-do run } */
+/* { dg-additional-sources "mversn14a.C" } */
+/* { dg-options "-O2 -fclone-hot-version-paths" } */
+
+int __attribute__ ((version_selector))
+featureTest ();
+
+int
+foo ()
+{
+  return 0;
+}
+
+int
+bar ()
+{
+  return 1;
+}
+
+int main ()
+{
+   return __builtin_dispatch (featureTest, (void *)foo, (void *)bar);
+}
Index: testsuite/g++.dg/mversn16.C
===================================================================
--- testsuite/g++.dg/mversn16.C	(revision 0)
+++ testsuite/g++.dg/mversn16.C	(revision 0)
@@ -0,0 +1,39 @@ 
+/* dispatch is cloned. Make sure the double is returned to main correctly. */
+
+/* { dg-do run } */
+/* { dg-options "-O2 -fclone-hot-version-paths -fdump-tree-optimized" } */
+
+#include "assert.h"
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+static int foo ()
+{
+  return 0;
+}
+
+static int bar ()
+{
+  return 1;
+}
+
+double
+dispatch ()
+{
+  __builtin_dispatch (featureTest, (void *)foo, (void *)bar);
+  return 2.536;
+}
+
+int main ()
+{
+  double d = dispatch ();
+  assert (d == 2.536);
+  return 0; 
+}
+
+/* { dg-final { scan-tree-dump "dispatchv_clone" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */
Index: testsuite/g++.dg/torture/mversn11.C
===================================================================
--- testsuite/g++.dg/torture/mversn11.C	(revision 0)
+++ testsuite/g++.dg/torture/mversn11.C	(revision 0)
@@ -0,0 +1,40 @@ 
+/* Check if parameters are passed correctly to the versioned function. */
+
+/* { dg-do run } */
+
+#include <stdio.h>
+#include <assert.h>
+
+int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
+
+int __attribute__ ((noinline))
+foo (int n1, double n2, char n3)
+{
+  assert (n1 == 10);
+  assert (n2 == 20.0);
+  assert (n3 == 'c');
+  return n1;
+}
+
+
+int __attribute__ ((noinline))
+bar (int n1, double n2, char n3)
+{
+  assert (n1 == 10);
+  assert (n2 == 20.0);
+  assert (n3 == 'c');
+  return (n1 - 10);
+}
+
+int main ()
+{
+  int a = __builtin_dispatch (featureTest, (void *)foo, (void *)bar,
+                              10, 20.0, 'c');
+  int b = __builtin_dispatch (featureTest, (void *)bar, (void *)foo,
+                              10, 20.0, 'c');
+  return a * b;
+}
Index: testsuite/g++.dg/torture/mversn5.C
===================================================================
--- testsuite/g++.dg/torture/mversn5.C	(revision 0)
+++ testsuite/g++.dg/torture/mversn5.C	(revision 0)
@@ -0,0 +1,28 @@ 
+/* Check that global inline feature test functions are correctly handled. This
+   test case also needs mversn5a.c.  extern_func calls foo and returns 0.*/
+
+/* { dg-do run } */
+/* { dg-additional-sources "mversn5a.C" } */
+
+#include "mversn5.h"
+
+extern int extern_func ();
+
+int foo ()
+{
+  return 0;
+}
+
+int bar ()
+{
+  return 1;
+}
+
+int __attribute__ ((cold))
+main ()
+{
+  int a = 1, b = 1;
+  a  = extern_func ();
+  b  = __builtin_dispatch (featureTest, (void *)bar, (void *) foo);
+  return a * b;
+}
Index: testsuite/g++.dg/torture/mversn5.h
===================================================================
--- testsuite/g++.dg/torture/mversn5.h	(revision 0)
+++ testsuite/g++.dg/torture/mversn5.h	(revision 0)
@@ -0,0 +1,5 @@ 
+inline int __attribute__ ((version_selector))
+featureTest ()
+{
+  return 1;
+}
Index: testsuite/g++.dg/torture/mversn5a.C
===================================================================
--- testsuite/g++.dg/torture/mversn5a.C	(revision 0)
+++ testsuite/g++.dg/torture/mversn5a.C	(revision 0)
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+
+#include "mversn5.h"
+
+extern int foo ();
+extern int bar ();
+
+
+int extern_func ()
+{
+  return __builtin_dispatch (featureTest, (void *)foo, (void *) bar);
+}
Index: builtin-types.def
===================================================================
--- builtin-types.def	(revision 173122)
+++ builtin-types.def	(working copy)
@@ -475,3 +475,7 @@  DEF_FUNCTION_TYPE_VAR_5 (BT_FN_INT_INT_INT_INT_INT
 DEF_POINTER_TYPE (BT_PTR_FN_VOID_VAR, BT_FN_VOID_VAR)
 DEF_FUNCTION_TYPE_3 (BT_FN_PTR_PTR_FN_VOID_VAR_PTR_SIZE,
 		     BT_PTR, BT_PTR_FN_VOID_VAR, BT_PTR, BT_SIZE)
+
+DEF_POINTER_TYPE (BT_PTR_FN_INT, BT_FN_INT)
+DEF_FUNCTION_TYPE_VAR_3 (BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR,
+			 BT_INT, BT_PTR_FN_INT, BT_PTR, BT_PTR)
Index: builtins.def
===================================================================
--- builtins.def	(revision 173122)
+++ builtins.def	(working copy)
@@ -760,6 +760,9 @@  DEF_BUILTIN (BUILT_IN_EMUTLS_REGISTER_COMMON,
 	     true, true, true, ATTR_NOTHROW_LEAF_LIST, false,
 	     !targetm.have_tls)
 
+/* Multiversioning builtin dispatch hook. */
+DEF_GCC_BUILTIN (BUILT_IN_DISPATCH, "dispatch", BT_FN_INT_PTR_FN_INT_PTR_PTR_VAR, ATTR_NULL)
+
 /* Exception support.  */
 DEF_BUILTIN_STUB (BUILT_IN_UNWIND_RESUME, "__builtin_unwind_resume")
 DEF_BUILTIN_STUB (BUILT_IN_CXA_END_CLEANUP, "__builtin_cxa_end_cleanup")
Index: mversn-dispatch.c
===================================================================
--- mversn-dispatch.c	(revision 0)
+++ mversn-dispatch.c	(revision 0)
@@ -0,0 +1,1760 @@ 
+/* Mulitversion Dispatch Pass.
+   Copyright (C) 2011 Free Software Foundation, Inc.
+   Contributed by Sriraman Tallam (tmsriram@google.com)
+
+This file is part of GCC.
+
+GCC is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 3, or (at your option) any later
+version.
+
+GCC is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or
+FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+for more details.
+
+You should have received a copy of the GNU General Public License
+along with GCC; see the file COPYING3.  If not see
+<http://www.gnu.org/licenses/>. */
+
+
+/* This pass processes __builtin_dispatch calls to call multi-versioned
+   functions. Only two versions are supported now. Example use :
+
+  int popcnt_sse4(unsigned int x) __attribute__((__target__("popcnt")));
+  int popcnt_sse4(unsigned int x)
+  {
+    int count = __builtin_popcount(x);
+    return count;
+  }
+
+  int popcnt(unsigned int x) __attribute__((__target__("no-popcnt")));
+  int popcnt(unsigned int x)
+  {
+    int count = __builtin_popcount(x);
+    return count;
+  }
+
+  int testsse() __attribute__((version_selector));
+  int main ()
+  {
+    ...
+    ret = __builtin_dispatch (testsse,  (void*)popcnt_sse4, (void*)popcnt, 25);
+    ...
+  }
+
+  There are two passes that are run to achieve multi-versioning.
+  "pass_ipa_multiversion_dispatch" is an ipa pass that decides which functions
+  have to be cloned and hoists the feature-test calls appropriately.  This
+  pass can be enabled with the flag "-fclone-hot-version-paths" and disabled
+  with "-fno-clone-hot-version-paths".
+
+  "pass_tree_convert_builtin_dispatch" does the lowering.  It is a
+  function-level pass.  Functions marked with attribute "version_selector" are
+  also handled by this pass.  This pass is always on.
+
+  How to use __builtin_dispatch ?
+  -----------------------------
+
+  __builtin_dispatch takes 3 mandatory arguments :
+
+  __builtin_dispatch (arg1, arg2, arg3, <arg4>, <arg5>, ...);
+
+  arg1 is the pointer to the feature-test function.
+  arg2 is the ( void *) cast pointer to the versioned function that is
+  executed when the feature test returns 1.
+  arg3 is the ( void *) cast pointer to the versioned function that is
+  executed when the feature test returns 0.
+  arg4, arg5, ... are optional. They are the arguments to the versioned
+  functions.  Both versions must accept the same number of arguments.
+  The __builtin_dispatch function returns the value returned by the
+  versioned function that gets executed.  The versioned function arg2
+  is executed when the feature_test function arg1 returns 1 and arg3
+  is executed when the feature_test function arg1 returns 0. arg1
+  could be marked as a "version_selector" function if it is a pure
+  function with no side-effects, returns a constant at run-time and
+  can be evaluated at any point in the execution.
+
+  When to use the "version_selector" attribute ?
+  -----------------------------------------------
+
+  Functions are marked with attribute "version_selector" only if
+  they are run-time constants.  Example of such functions would
+  be those that test if a specific feature is available on a
+  particular architecture.  Such functions must return a positive
+  integer. For two-way functions, those that test if a feature
+  is present or not must return 1 or 0 respectively.
+
+
+  The code is organized into five parts.  The first part has the functionality
+  to detect and handle functions marked with attribute "version_selector".  The
+  second part is the analysis phase where we find calls to __builtin_dispatch
+  and mark all functions that are hot and have a call-graph path to a
+  __builtin_dispatch call.  The third part decides which functions
+  to clone.  This is based on the number of clones that have to be created for
+  the functions marked in the analysis phase. Only two clones are allowed for
+  a function currently.  The fourth part is where the actual cloning happens.
+  The fifth part contains the implementation to lower the __builtin_dispatch
+  calls.
+
+  Flags : -fclone-hot-version-paths does function unswitching via cloning.
+          --param=num-mversn-clones=<num> allows to specify the number of
+          functions that should be cloned.
+	  --param=mversn-clone-depth=<num> allows to specify the length of
+          the call graph path that should be cloned.  num = 0 implies only
+          leaf node that contains the __builtin_dispatch statement must be
+          cloned. */
+
+#include "config.h"
+#include "system.h"
+#include "coretypes.h"
+#include "tm.h"
+#include "tree.h"
+#include "tree-inline.h"
+#include "langhooks.h"
+#include "flags.h"
+#include "cgraph.h"
+#include "diagnostic.h"
+#include "toplev.h"
+#include "timevar.h"
+#include "params.h"
+#include "fibheap.h"
+#include "intl.h"
+#include "tree-pass.h"
+#include "hashtab.h"
+#include "coverage.h"
+#include "ggc.h"
+#include "tree-flow.h"
+#include "rtl.h"
+#include "ipa-prop.h"
+#include "basic-block.h"
+#include "toplev.h"
+#include "dbgcnt.h"
+#include "tree-dump.h"
+#include "output.h"
+#include "vecprim.h"
+#include "gimple-pretty-print.h"
+
+typedef struct cgraph_node* NODEPTR;
+DEF_VEC_P (NODEPTR);
+DEF_VEC_ALLOC_P (NODEPTR, heap);
+
+/* Store the decl of __builtin_dispatch */
+static tree builtin_function_decl = NULL;
+
+/* Hash to map name to a decl.  Used for variables and functions. */
+static htab_t name_decl_htab = NULL;
+
+/* Hashtable helpers for name_decl_htab. */
+
+static hashval_t
+name_decl_htab_hash_descriptor (const void *p)
+{
+  const_tree t = (const_tree) p;
+  const char *name
+    = (IDENTIFIER_POINTER (DECL_NAME (t)));
+  return htab_hash_string(name);
+}
+
+/* Hashtable helper for name_decl_htab. */
+
+static int
+name_decl_htab_eq_descriptor (const void *p1, const void *p2)
+{
+  const_tree t1 = (const_tree) p1;
+  const char *c1 = IDENTIFIER_POINTER (DECL_NAME (t1));
+  const char *c2 = (const char *)p2;
+
+  return (strcmp (c1, c2) == 0);
+}
+
+/* Return true if NODE is a hot function.  It is a hot function
+   if its execution frequency is determined to be hot or
+   if any of its incoming or outgoing call-graph edges is hot.  */
+
+static bool
+hot_function_p (struct cgraph_node *node)
+{
+  struct cgraph_edge *edge;
+
+  if (node->frequency == NODE_FREQUENCY_HOT)
+    return true;
+
+  for (edge = node->callees; edge; edge = edge->next_callee)
+    if (cgraph_maybe_hot_edge_p (edge))
+      return true;
+
+  for (edge = node->callers; edge; edge = edge->next_caller)
+    if (cgraph_maybe_hot_edge_p (edge))
+      return true;
+
+  return false;
+}
+
+/* Return the number of arguments that a function has.  */
+
+static int
+function_args_count (tree fntype)
+{
+  function_args_iterator args_iter;
+  tree t;
+  int num = 0;
+
+  if (fntype)
+    {
+      FOREACH_FUNCTION_ARGS(fntype, t, args_iter)
+	{
+	  num++;
+	}
+    }
+
+  return num;
+}
+
+/* Return the variable name (global/constructor) to use for the
+   version_selector function with name of DECL by appending SUFFIX. */
+
+static char *
+make_name (tree decl, const char *suffix)
+{
+  char *global_var_name;
+  int name_len;
+  const char *name;
+
+  name = IDENTIFIER_POINTER (DECL_ASSEMBLER_NAME (decl));
+
+  name_len = strlen (name) + strlen (suffix) + 2;
+  global_var_name = (char *) xmalloc (name_len);
+  snprintf (global_var_name, name_len, "%s_%s", name, suffix);
+  return global_var_name;
+}
+
+/* Code for handling version_selector attribute functions.  Such functions are
+   run-time constants and need to be executed only once.  They are hoisted
+   to a static constructor and their result is stored in a global.
+ */
+
+
+/* This function returns the global variable / constructor name created
+   for feature-test functions marked with attribute "version_selector".
+   The name returned is the DECL name appended with
+   "version_selector_global" for the variable and
+   "version_selector_constructor" for the constructor. */
+
+static char*
+make_feature_test_global_name (tree decl, bool is_constructor)
+{
+  if (is_constructor)
+    return make_name (decl, "version_selector_constructor");
+
+  return make_name (decl, "version_selector_global");
+}
+
+/* This function creates a new VAR_DECL with attributes set
+   using the parameters.  PUBLIK corresponds to TREE_PUBLIC,
+   EXTERNAL corresponds to DECL_EXTERNAL and comdat is
+   for DECL_ONE_ONLY.  The global variable will have the
+   same status as the version_selector function.*/
+
+static tree
+allocate_new_var (const char *name, int publik,
+                  int external, int comdat)
+{
+  tree new_global_var;
+  struct varpool_node *vnode;
+
+  new_global_var = build_decl (UNKNOWN_LOCATION,
+			       VAR_DECL,
+	  	               get_identifier (name),
+		               integer_type_node);
+
+  DECL_EXTERNAL (new_global_var) = external;
+  TREE_STATIC (new_global_var) = 1;
+  TREE_PUBLIC (new_global_var) = publik;
+  DECL_INITIAL (new_global_var) = 0;
+  DECL_ARTIFICIAL (new_global_var) = 1;
+  DECL_PRESERVE_P (new_global_var) = 1;
+
+  if (comdat)
+    make_decl_one_only (new_global_var, DECL_ASSEMBLER_NAME (new_global_var));
+  assemble_variable (new_global_var, 0, 0, 0);
+
+  vnode = varpool_node (new_global_var);
+  gcc_assert (vnode != NULL);
+  /* Set finalized to 1, otherwise it asserts in function "write_symbol" in
+     lto-streamer-out.c. */
+  vnode->finalized = 1;
+  
+  return new_global_var;
+}
+
+/* Make a new constructor function here to call a feature-test function
+   and set its body to CONSTRUCTOR_BODY.  Its public and comdat
+   attributes are set from the parameters, PUBLIK, and COMDAT.
+   VERSION_SELECTOR_VAR is the global decl that saves the result of the
+   feature-test function in the constructor. */
+
+static tree
+make_constructor_function (char *name, gimple constructor_body, int publik,
+			   int comdat, tree version_selector_var)
+{
+  tree decl, type, t;
+  gimple_seq seq;
+  basic_block new_bb;
+  tree old_current_function_decl;
+
+  type = build_function_type_list (void_type_node, NULL_TREE);
+
+  if (dump_file)
+    fprintf (dump_file, "Name of new constructor function = %s\n", name);
+
+  decl = build_fn_decl (name, type);
+
+  DECL_NAME (decl) = get_identifier (name);
+  SET_DECL_ASSEMBLER_NAME (decl, DECL_NAME (decl));
+  gcc_assert (cgraph_node (decl) != NULL);
+
+  TREE_USED (decl) = 1;
+  DECL_ARTIFICIAL (decl) = 1;
+  DECL_IGNORED_P (decl) = 0;
+  TREE_PUBLIC (decl) = publik;
+  DECL_UNINLINABLE (decl) = 1;
+  DECL_EXTERNAL (decl) = 0;
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = make_node (BLOCK);
+  DECL_STATIC_CONSTRUCTOR (decl) = 1;
+  TREE_READONLY (decl) = 0;
+  DECL_PURE_P (decl) = 0;
+
+  if (comdat)
+    make_decl_one_only (decl, DECL_ASSEMBLER_NAME (decl));
+
+  /* Build result decl and add to function_decl. */
+  t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE, void_type_node);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_IGNORED_P (t) = 1;
+  DECL_RESULT (decl) = t;
+
+  gimplify_function_tree (decl);
+
+  /* Build CFG for this function. */
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (decl));
+  current_function_decl = decl;
+  init_empty_tree_cfg_for_function (DECL_STRUCT_FUNCTION (decl));
+  new_bb = create_empty_bb (ENTRY_BLOCK_PTR);
+  make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
+
+  /* XXX: Not sure if the edge commented below is necessary.  If I add this
+     edge, it fails in gimple_verify_flow_info in tree-cfg.c in condition :
+     " if (e->flags & EDGE_FALLTHRU)"
+     during -fprofile-generate.
+     Otherwise, it is fine.  Deleting this edge does not break anything.
+     Commenting this so that it is clear I am intentionally not doing this.*/
+  /* make_edge (new_bb, EXIT_BLOCK_PTR, EDGE_FALLTHRU); */
+
+  seq = gimple_seq_alloc_with_stmt (constructor_body);
+
+  set_bb_seq (new_bb, seq);
+  gimple_set_bb (constructor_body, new_bb);
+
+  /* Set the lexical block of the constructor body. Fails the inliner
+     other wise. */
+  gimple_set_block (constructor_body, DECL_INITIAL (decl));
+
+  /* This call is very important if this pass runs when the IR is in
+     SSA form.  It breaks things in strange ways otherwise. */
+  init_tree_ssa (DECL_STRUCT_FUNCTION (decl));
+  add_referenced_var (version_selector_var);
+
+  cgraph_add_new_function (decl, true);
+  cgraph_call_function_insertion_hooks (cgraph_node (decl));
+  cgraph_mark_needed_node (cgraph_node (decl));
+
+  if (dump_file)
+    dump_function_to_file (decl, dump_file, TDF_BLOCKS);
+
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+  return decl;
+}
+
+/* If  the current function is marked with attribute
+   "version_selector" then it is the predicate (feature-test) function
+   for multi-versioning.  Call this function in a constructor and assign
+   the return value to a global variable.
+   The constructor's name is the decl name suffixed
+   "version_selector_constructor" and the global variable's name is the
+   decl name suffixed with "version_selector_global"
+
+   For example, feature-test function isSSE4 marked with attribute
+   version_selector is converted to
+
+   void isSSE4_version_selector_constructor ()
+   {
+     isSSE4_version_selector_global = isSSE4 ();
+   }
+
+   This function returns the decl of the global variable.
+
+   THIS_DECL is the function decl of the "version_selector" function.
+   */
+
+static tree
+handle_version_selector_attr_function (tree this_decl)
+{
+  char *global_var_name;
+  tree version_selector_var = NULL;
+  void **slot;
+
+  gcc_assert (!flag_lto);
+
+  if (dump_file)
+    fprintf (dump_file, "Creating constructor/global for function %s\n",
+	     IDENTIFIER_POINTER (DECL_NAME (this_decl)));
+
+  global_var_name = make_feature_test_global_name (this_decl,
+                                                   false);
+
+  slot = htab_find_slot_with_hash (name_decl_htab, global_var_name,
+                                   htab_hash_string (global_var_name),
+                                   INSERT);
+  if (*slot == NULL)
+    {
+      if (dump_file)
+        fprintf (dump_file, "Creating global variable %s\n",
+	         global_var_name);
+      *slot = allocate_new_var (global_var_name,
+                                TREE_PUBLIC (this_decl),
+                                DECL_EXTERNAL (this_decl),
+                                DECL_ONE_ONLY (this_decl));
+    }
+  else
+    {
+      free (global_var_name);
+      return (tree) *slot;
+    }
+
+  version_selector_var = (tree) *slot;
+
+  /* If the feature-test function is not external, create a constructor and
+     call this function in the constructor. */
+
+  if (!DECL_EXTERNAL (this_decl))
+    {
+      char *constructor_name;
+      gimple constructor_body;
+      tree constructor_decl;
+
+      constructor_name
+        = make_feature_test_global_name (this_decl, true);
+
+      constructor_body = gimple_build_call (this_decl, 0);
+
+      gimple_call_set_lhs (constructor_body, version_selector_var);
+
+      if (dump_file)
+        print_gimple_stmt (dump_file, constructor_body, 0, TDF_VOPS);
+
+      constructor_decl =
+        make_constructor_function (constructor_name, constructor_body,
+				   TREE_PUBLIC (this_decl),
+				   DECL_ONE_ONLY (this_decl),
+			           version_selector_var);
+
+      gcc_assert (constructor_decl != NULL_TREE);
+      free (constructor_name);
+    }
+
+  free (global_var_name);
+  return version_selector_var;
+}
+
+/* Start Analysis phase.  Mark all functions that are hot and have a call-graph
+   path to a __builtin_dispatch call. */
+
+/* This function returns the address of the feature test function.
+   If the address of the function is saved to a temporary,
+   this function traverses the gimple statements before BUILTIN_STMT
+   and finds an assignment whose rhs is the feature test function.
+   If the feature test function is specified as a function pointer
+   whose function value is unknown, this funcition returns NULL. */
+
+static tree
+find_version_selector_func_addr (gimple builtin_stmt)
+{
+  tree cond_func_addr = NULL;
+  gimple def_stmt = NULL;
+
+  cond_func_addr = gimple_call_arg (builtin_stmt, 0);
+
+  gcc_assert (TREE_CODE (cond_func_addr) == ADDR_EXPR
+	      || TREE_CODE (cond_func_addr) == SSA_NAME);
+
+  if (TREE_CODE (cond_func_addr) == ADDR_EXPR)
+    return cond_func_addr;
+
+  /* TREE_CODE (cond_func_addr) ==  SSA_NAME
+     This means a new function pointer variable is created and assigned the
+     address of the feature-test function. Traverse the statements backwards
+     and find the assignment to get the RHS. */
+
+  def_stmt = SSA_NAME_DEF_STMT (cond_func_addr);
+
+  gcc_assert (def_stmt
+              && gimple_assign_lhs (def_stmt) == cond_func_addr);
+
+  cond_func_addr = gimple_assign_rhs1 (def_stmt);
+
+  /* If the cond_func_addr is still not an ADDR_EXPR, it means that the
+     feature-test function is specified as a pointer.  In this case, we
+     return NULL, since the feature-test function decl is not known. */
+
+  if (cond_func_addr == NULL
+      || TREE_CODE (cond_func_addr) != ADDR_EXPR)
+    return NULL;
+
+  /* If the operand of the ADDR_EXPR is not a function_decl, return NULL
+     as this still means the feature-test function is specified as a
+     function pointer. */
+
+  if (TREE_CODE (TREE_OPERAND (cond_func_addr, 0)) != FUNCTION_DECL)
+    return NULL;
+
+  return cond_func_addr;
+}
+
+/* Finds the gimple calls  to __builtin_dispatch in function pointed
+   to by the call graph NODE and populates the vector VEC.  Returns
+   true if at least one statement was found where the feature test
+   function is marked as "version_selector".  Otherwise, there is no
+   question of hoisting it. */
+
+static bool
+is_builtin_dispatch_stmt_present (struct cgraph_node *node,
+			          VEC (tree,heap) **vec)
+{
+  struct cgraph_edge *edge;
+  bool present = false;
+
+  gcc_assert (!flag_lto);
+
+  for (edge = node->callees; edge; edge = edge->next_callee)
+    {
+      if (edge->callee->decl == builtin_function_decl)
+        {
+	  tree cond_func_decl;
+	  tree cond_func_addr;
+	  gcc_assert (*vec != NULL);
+	  cond_func_addr = find_version_selector_func_addr (edge->call_stmt);
+
+	  if (cond_func_addr == NULL)
+            continue;
+
+	  cond_func_decl = TREE_OPERAND (cond_func_addr, 0);
+
+	  /* Do not consider for hoisting if "version_selector" attribute is
+	     not set. */
+	  if (lookup_attribute ("version_selector",
+			        DECL_ATTRIBUTES (cond_func_decl)) == NULL)
+            {
+              if (dump_file)
+                {
+                  fprintf (dump_file, "Not hoisting builtin_dispatch as "
+                           "feature_test function not version_selector :\n");
+                  print_gimple_stmt (dump_file, edge->call_stmt, 0, TDF_VOPS);
+		}
+              continue;
+            }
+
+	  present = true;
+	  VEC_safe_push (tree, heap, *vec, cond_func_decl);
+        }
+    }
+  return present;
+}
+
+/* Updates the list of feature-test function decls reaching the cgraph
+   function NODE. */
+
+static void
+update_reachable_decls_list (struct cgraph_node *node,
+                             VEC (tree, heap) *predicate_decls)
+{
+  VEC (tree, heap) **decl_list = NULL;
+  tree cond_func_decl;
+  int ix;
+
+  if (node->aux == NULL)
+    {
+      decl_list = (VEC (tree, heap) **) xmalloc (sizeof (VEC (tree, heap) *));
+      *decl_list = VEC_alloc (tree, heap, 1);
+      node->aux = decl_list;
+    }
+  else
+    decl_list = (VEC (tree, heap) **) node->aux;
+
+  for (ix = 0; VEC_iterate (tree, predicate_decls, ix, cond_func_decl); ++ix)
+    VEC_safe_push (tree, heap, *decl_list, cond_func_decl);
+}
+
+/* Propagate the __builtin_dispatch stmt (s) called from node to its
+   callers, PREDICATE_DECLS is the decls list of the predicate functions. */
+
+static unsigned int
+mark_reachable_functions (struct cgraph_node *this_node,
+			  VEC (tree, heap) *predicate_decls)
+{
+  VEC (NODEPTR, heap) *work_list;
+  VEC (int, heap) *depth_list;
+  struct cgraph_edge *e;
+  htab_t node_htab = NULL;
+  void **slot = NULL;
+
+  /* Use a work-list style algorithm to mark functions in any call-graph
+     path to the current function. */
+
+  work_list = VEC_alloc (NODEPTR, heap, 8);
+  depth_list = VEC_alloc (int, heap, 8);
+
+  VEC_safe_push (NODEPTR, heap, work_list, this_node);
+  VEC_safe_push (int, heap, depth_list, 0);
+
+  node_htab = htab_create (10, htab_hash_pointer,
+  			   htab_eq_pointer, NULL);
+
+  slot = htab_find_slot (node_htab, this_node, INSERT);
+
+  gcc_assert (*slot == NULL);
+  *slot = this_node;
+
+  while (!VEC_empty (NODEPTR, work_list))
+    {
+      struct cgraph_node *node = VEC_pop (NODEPTR, work_list);
+      int depth = VEC_pop (int, depth_list);
+
+      if (dump_file)
+        fprintf (dump_file, "%s has a depth = %d callgraph path to %s\n",
+                 cgraph_node_name (node), depth,
+                 cgraph_node_name (this_node));
+
+      update_reachable_decls_list (node, predicate_decls);
+
+      gcc_assert (node->aux != NULL);
+
+      if (depth >= PARAM_VALUE (PARAM_MVERSN_CLONE_CGRAPH_DEPTH))
+        {
+          if (dump_file)
+            fprintf (dump_file, "Not propogating __builtin_dispatch... "
+                     "maximum cloning depth  = %d reached\n",
+		     PARAM_VALUE (PARAM_MVERSN_CLONE_CGRAPH_DEPTH));
+          continue;
+        }
+
+      for (e = node->callers; e; e = e->next_caller)
+        {
+          slot = htab_find_slot (node_htab, e->caller, INSERT);
+	  if (*slot != NULL)
+	    continue;
+          *slot = e->caller;
+          if (!hot_function_p (e->caller))
+            continue;
+
+          VEC_safe_push (NODEPTR, heap, work_list, e->caller);
+          VEC_safe_push (int, heap, depth_list, (depth + 1));
+        }
+    }
+
+  htab_delete (node_htab);
+  VEC_free (NODEPTR, heap, work_list);
+  VEC_free (int, heap, depth_list);
+  return 0;
+}
+
+/* Scan the call graph and detect hot functions that have __builtin_dispatch
+   calls. Then, propogate this information to its callers. Returns true if
+   a suitable __builtin_dispatch was found. */
+
+static bool
+perform_analysis_phase (void)
+{
+  struct cgraph_node *node;
+  VEC(tree, heap) *builtin_predicates_vec = NULL;
+  bool flag = false;
+
+  builtin_predicates_vec = VEC_alloc (tree, heap, 1);
+
+  for (node = cgraph_nodes; node; node = node->next)
+    {
+      /* if the body of this decl is from outside, do nothing. */
+      if (DECL_EXTERNAL (node->decl))
+        continue;
+
+      if (!hot_function_p (node))
+        continue;
+
+      if (!is_builtin_dispatch_stmt_present (node, &builtin_predicates_vec))
+        continue;
+
+      if (dump_file)
+        {
+          fprintf (dump_file, "%s calls __builtin_dispatch atleast once.\n",
+                   cgraph_node_name (node));
+
+          fprintf (dump_file, "%s is a hot function, consider cloning ...\n",
+        	   cgraph_node_name (node));
+        }
+
+      flag = true;
+      mark_reachable_functions (node, builtin_predicates_vec);
+      VEC_truncate (tree, builtin_predicates_vec, 0);
+    }
+
+  VEC_free (tree, heap, builtin_predicates_vec);
+  return flag;
+}
+
+/* End Analysis phase. */
+
+/* Decide Cloning Phase.
+
+   In this phase, we go through each function and decide if it should be
+   cloned or not. */
+
+/* This function counts the number of unique decls in the DECL_LIST.*/
+
+static int
+count_predicate_functions (VEC (tree,heap) *decl_list)
+{
+  int ix;
+  int count = 0;
+  tree cond_func_decl = NULL;
+  htab_t dup_decl_htab = NULL;
+
+  if (VEC_length (tree, decl_list) == 1)
+    return 1;
+
+  dup_decl_htab = htab_create (2, htab_hash_pointer, htab_eq_pointer, NULL);
+
+  for (ix = 0; VEC_iterate (tree, decl_list, ix, cond_func_decl); ++ix)
+    {
+      void **slot = NULL;
+      slot = htab_find_slot (dup_decl_htab, cond_func_decl, INSERT);
+
+      if (*slot != NULL)
+        continue;
+      count++;
+      *slot = cond_func_decl;
+    }
+
+  htab_delete (dup_decl_htab);
+  return count;
+}
+
+/* This function decides which functions to clone based on the number of
+   feature_test decls reaching it.  Currently, only one feature_test decl
+   is allowed. */
+
+static bool
+decide_cloning_phase (void)
+{
+  struct cgraph_node *node;
+  int count;
+  bool run_cloning_phase = false;
+  int num_funcs_cloned = 0;
+
+  for (node = cgraph_nodes; node; node = node->next)
+    {
+      tree cond_func_decl = NULL;
+      VEC (tree, heap) *vec;
+      if (node->aux == NULL)
+        continue;
+
+      if (num_funcs_cloned >= PARAM_VALUE (PARAM_NUMBER_OF_MVERSN_CLONES))
+        {
+          if (dump_file)
+            fprintf (dump_file, "Reached cloning limit specified "
+                     "by \"num-mversn-clones\" for %s\n",
+	             cgraph_node_name (node));
+
+          free (node->aux);
+	  node->aux = NULL;
+	  continue;
+        }
+
+      vec = *(VEC (tree,heap) **) node->aux;
+      count = count_predicate_functions (vec);
+      gcc_assert (count >= 1);
+      cond_func_decl = VEC_index (tree, vec, 0);
+      gcc_assert (cond_func_decl != NULL);
+      VEC_free (tree, heap, vec);
+      free (node->aux);
+      node->aux = NULL;
+
+      if (count > 1)
+        {
+          if (dump_file)
+            fprintf (dump_file, "%s has %d predicates, Not cloning for > 1\n",
+	             cgraph_node_name (node), count);
+	  continue;
+        }
+      /* Set the node's aux value to be that of the predicate decl. */
+      node->aux = cond_func_decl;
+      run_cloning_phase = true;
+      num_funcs_cloned++;
+    }
+  return run_cloning_phase;
+}
+
+/* End Decide Cloning Phase. */
+
+/* Cloning Phase. */
+
+/* Deletes all basic-blocks and leaves function with :
+   ENTRY_BLOCK ---> (new empty basic block) ---> EXIT_BLOCK
+*/
+
+static basic_block
+empty_function_body (tree fndecl)
+{
+  basic_block bb, new_bb;
+  edge e;
+  tree old_current_function_decl;
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (fndecl));
+  current_function_decl = fndecl;
+
+  clear_edges ();
+  for (bb = ENTRY_BLOCK_PTR; bb != NULL;)
+    {
+      basic_block bb_next;
+      bb_next = bb->next_bb;
+      if (bb != EXIT_BLOCK_PTR
+          && bb != ENTRY_BLOCK_PTR)
+        {
+          if (bb_seq (bb) != NULL)
+            {
+              gimple_stmt_iterator i;
+              for (i = gsi_start_bb (bb); !gsi_end_p (i);)
+	        {
+		  gimple stmt = gsi_stmt (i);
+		  unlink_stmt_vdef (stmt);
+                  gsi_remove (&i, true);
+		  release_defs (stmt);
+		}
+            }
+          bb->il.gimple = NULL;
+          bb->prev_bb = NULL;
+          bb->next_bb = NULL;
+          SET_BASIC_BLOCK (bb->index, NULL);
+          n_basic_blocks--;
+        }
+      bb = bb_next;
+    }
+  ENTRY_BLOCK_PTR->next_bb = EXIT_BLOCK_PTR;
+  new_bb = create_empty_bb (ENTRY_BLOCK_PTR);
+  e = make_edge (ENTRY_BLOCK_PTR, new_bb, EDGE_FALLTHRU);
+  gcc_assert (e != NULL);
+  /* XXX:Is this edge necessary ? */
+  e = make_edge (new_bb, EXIT_BLOCK_PTR, EDGE_FALLTHRU);
+  gcc_assert (e != NULL);
+
+  current_function_decl = old_current_function_decl;
+  pop_cfun ();
+  return new_bb;
+}
+
+/* Takes function with decl ORIG_FNDECL and clones it.  The
+   name of the clone is the original name suffixed with
+   NAME_SUFFIX.  Code is adapted from cgraph_function_versioning
+   in cgraphunit.c */
+
+static tree
+clone_function (tree orig_fndecl, const char *name_suffix)
+{
+  tree new_decl;
+  char *new_name;
+  struct cgraph_node *new_version;
+  struct cgraph_node *old_version;
+  void **slot;
+  tree old_current_function_decl;
+
+  new_name = make_name (orig_fndecl, name_suffix);
+  new_decl = copy_node (orig_fndecl);
+
+
+  slot = htab_find_slot_with_hash (name_decl_htab, new_name,
+                                   htab_hash_string (new_name), INSERT);
+
+  gcc_assert (*slot == NULL);
+  *slot = new_decl;
+
+  /* Code adapted from cgraph_function_versioning in cgraphuinit.c */
+
+  new_version = cgraph_node (new_decl);
+  old_version = cgraph_node (orig_fndecl);
+
+  new_version->local = old_version->local;
+  new_version->global = old_version->global;
+  new_version->rtl = old_version->rtl;
+  new_version->reachable = true;
+  new_version->count = old_version->count;
+
+  /* Set the name of the new function. */
+  DECL_NAME (new_decl) = get_identifier (new_name);
+  SET_DECL_ASSEMBLER_NAME (new_decl, DECL_NAME (new_decl));
+  SET_DECL_RTL (new_decl, NULL);
+
+  tree_function_versioning (orig_fndecl, new_decl, NULL /*tree_map*/,
+                            false, NULL /*args_to_skip*/,
+			    NULL /* blocks_to_copy */ ,
+                            NULL /* new_entry */);
+
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (new_decl));
+  current_function_decl = new_decl;
+
+  TREE_READONLY (new_decl) = TREE_READONLY (orig_fndecl);
+  TREE_STATIC (new_decl) = TREE_STATIC (orig_fndecl);
+  TREE_USED (new_decl) = TREE_USED (orig_fndecl);
+  DECL_ARTIFICIAL (new_decl) = 1;
+  DECL_IGNORED_P (new_decl) = 0;
+  TREE_PUBLIC (new_decl) = TREE_PUBLIC (orig_fndecl);
+  DECL_CONTEXT (new_decl) = DECL_CONTEXT (orig_fndecl);
+
+  DECL_EXTERNAL (new_decl) = DECL_EXTERNAL (orig_fndecl);
+  DECL_COMDAT (new_decl) = DECL_COMDAT (orig_fndecl);
+  DECL_COMDAT_GROUP (new_decl) = DECL_COMDAT_GROUP (orig_fndecl);
+  DECL_VIRTUAL_P (new_decl) = DECL_VIRTUAL_P (orig_fndecl);
+  DECL_WEAK (new_decl) = DECL_WEAK (orig_fndecl);
+
+  /* Always inline the clones. Why are we cloning otherwise? */
+  DECL_DECLARED_INLINE_P (new_decl) = 1;
+  DECL_UNINLINABLE (new_decl) = 0;
+  new_version->local.externally_visible
+  = old_version->local.externally_visible;
+  new_version->local.local
+  = old_version->local.local;
+
+  new_version->analyzed = true;
+  new_version->lowered = true;
+
+  if (dump_file)
+    dump_function_to_file (new_decl, dump_file, TDF_BLOCKS);
+
+  cgraph_add_new_function (new_decl, true);
+
+  cgraph_call_function_insertion_hooks (new_version);
+  cgraph_mark_needed_node (new_version);
+
+  pop_cfun ();
+  current_function_decl = old_current_function_decl;
+
+  return new_decl;
+}
+
+/* This function populates the vector *VEC with the args in the gimple
+   call statement STMT. SKIP_ARGS is the number of args to skip.*/
+
+static void
+get_function_args (gimple stmt, int num_args, VEC (tree, heap) **vec,
+                   int skip_args)
+{
+  int i;
+
+  if (num_args == 0) return;
+
+  *vec = VEC_alloc (tree, heap, num_args);
+  /* The number of args in a function is 1 plus the actual number of
+     args.  Also, there are 3 special args reserved, so the first arg
+     starts from 3. */
+  for (i = 0; i <= num_args - 2; ++i)
+    VEC_quick_push (tree, *vec, gimple_call_arg (stmt, (skip_args + i)));
+}
+
+/* Given ret = __builtin_dispatch (pred, fn1, fn2, arg1, ....)
+   get ret = fn1 (arg1, ...) or ret = fn2 (arg1, ....)
+   depending on the value of SIDE == 0 or 1. */
+
+static gimple
+make_specialized_call_from_builtin (gimple builtin_stmt, int side)
+{
+  tree func_addr;
+  int num_func_args = 0;
+  VEC (tree, heap) *nargs = NULL;
+  tree lhs_stmt;
+  gimple specialized_call_stmt;
+
+  if (side == 0)
+    func_addr = gimple_call_arg (builtin_stmt, 1);
+  else
+    func_addr = gimple_call_arg (builtin_stmt, 2);
+
+  num_func_args
+    =  function_args_count (TREE_TYPE (TREE_OPERAND (func_addr, 0)));
+
+  get_function_args (builtin_stmt, num_func_args, &nargs, 3);
+
+  specialized_call_stmt = gimple_build_call_vec (func_addr, nargs);
+
+  lhs_stmt = gimple_call_lhs (builtin_stmt);
+
+  if (lhs_stmt != NULL_TREE)
+    gimple_call_set_lhs (specialized_call_stmt, lhs_stmt);
+
+  if (nargs != NULL)
+    VEC_free (tree, heap, nargs);
+
+  return specialized_call_stmt;
+}
+
+/* Given a call (GENERIC_STMT) to a function that is cloned, substitute
+   with a call to the correct clone. */
+
+static gimple
+make_specialized_call_to_clone (gimple generic_stmt, int side)
+{
+  tree new_decl;
+  char *new_name;
+  tree generic_fndecl;
+  gimple specialized_call_stmt;
+  void **slot;
+  int num_func_args;
+  tree lhs_stmt;
+  VEC (tree, heap) *nargs= NULL;
+
+  generic_fndecl = gimple_call_fndecl (generic_stmt);
+  gcc_assert (generic_fndecl != NULL);
+
+  if (side == 0)
+    new_name = make_name (generic_fndecl, "clone_0");
+  else
+    new_name = make_name (generic_fndecl, "clone_1");
+
+  slot = htab_find_slot_with_hash (name_decl_htab, new_name,
+                                   htab_hash_string (new_name), NO_INSERT);
+  gcc_assert (slot != NULL);
+  new_decl = (tree) *slot;
+  gcc_assert (new_decl);
+
+  num_func_args = function_args_count (TREE_TYPE (generic_fndecl));
+  get_function_args (generic_stmt, num_func_args, &nargs, 0);
+  specialized_call_stmt = gimple_build_call_vec (new_decl, nargs);
+
+  lhs_stmt = gimple_call_lhs (generic_stmt);
+
+  if (lhs_stmt != NULL_TREE)
+    gimple_call_set_lhs (specialized_call_stmt, lhs_stmt);
+
+  if (nargs != NULL)
+    VEC_free (tree, heap, nargs);
+
+  return specialized_call_stmt;
+}
+
+/* Returns true if STMT is a call to __builtin_dispatch and its
+   predicate feature-test function is marked with attribute
+   "version_selector". */
+
+static bool
+is_builtin_with_predicate_version_selector (gimple stmt)
+{
+  tree cond_func_addr, cond_func_decl;
+
+  gcc_assert (!flag_lto);
+ 
+  if (gimple_call_fndecl (stmt) != builtin_function_decl)
+    return false;
+
+  cond_func_addr = find_version_selector_func_addr (stmt);
+
+  if (cond_func_addr == NULL)
+    return false;
+
+  cond_func_decl = TREE_OPERAND (cond_func_addr, 0);
+  if (lookup_attribute ("version_selector",
+			DECL_ATTRIBUTES (cond_func_decl)) != NULL)
+    return true;
+
+  return false;
+}
+
+/* Find calls to __builtin_dispatch or to functions that are versioned
+   in CLONE_DECL and substitute the call with the correct version based
+   on the value of SIDE. */
+
+static void
+specialize_call (tree clone_decl, int side)
+{
+  basic_block bb;
+  tree old_current_function_decl;
+
+  old_current_function_decl = current_function_decl;
+  push_cfun (DECL_STRUCT_FUNCTION (clone_decl));
+  current_function_decl = clone_decl;
+
+  /* Iterate over call edges and find out if there is
+     a call to __builtin_dispatch or a cloned function.
+     We cannot iterate over call graph edges as there are
+     no edges for the clones yet. */
+
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (clone_decl))
+    {
+      gimple_stmt_iterator gsi;
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+        {
+	  tree callee_decl;
+	  struct cgraph_node *callee_node;
+	  gimple specialized_call_stmt;
+	  gimple stmt = gsi_stmt (gsi);
+
+	  if (!is_gimple_call (stmt))
+            continue;
+
+	  callee_decl = gimple_call_fndecl (stmt);
+
+	  if (callee_decl == NULL)
+            continue;
+
+	  callee_node = cgraph_node (callee_decl);
+
+	  /* For a __builtin_dispatch stmt, only specialize if
+             version_selector attribute is set. Otherwise, it is
+             not hoisted, so no specialization. */
+
+	  if (is_builtin_with_predicate_version_selector (stmt))
+	    {
+	      specialized_call_stmt =
+	        make_specialized_call_from_builtin (stmt, side);
+	    }
+	  else if  (callee_node->aux != NULL)
+            {
+	      specialized_call_stmt =
+	        make_specialized_call_to_clone (stmt, side);
+            }
+	  else
+	    continue;
+
+          if (dump_file)
+            {
+	      fprintf (dump_file, "Specialize stmt : \n");
+              print_gimple_stmt (dump_file, stmt, 0, TDF_VOPS);
+	      fprintf (dump_file, "Specialized stmt : \n");
+              print_gimple_stmt (dump_file, specialized_call_stmt,
+			         0, TDF_VOPS);
+            }
+
+	  gimple_set_block (specialized_call_stmt, gimple_block (stmt));
+	  gsi_insert_before_without_update (&gsi, specialized_call_stmt,
+                                            GSI_SAME_STMT);
+
+	  
+          unlink_stmt_vdef (stmt);
+	  gsi_remove (&gsi, true); 
+	  mark_symbols_for_renaming (specialized_call_stmt);
+
+	  /* After removing make sure gsi is set correctly to not skip
+	     a statememt. */
+          gsi = gsi_for_stmt (specialized_call_stmt);
+	}
+    }
+  current_function_decl = old_current_function_decl;
+  pop_cfun ();
+}
+
+/* When a function is version cloned, its body is replaced to call one
+   of the versions with the feature-test function acting as a predicate.
+   This is done with __builtin_dispatch which is later expanded. */
+
+static gimple
+make_builtin_call_to_clones (tree orig_fndecl, tree clone_0_addr,
+                             tree clone_1_addr, tree cond_func_addr)
+{
+  gimple new_builtin_call;
+  VEC(tree, heap) *vargs = VEC_alloc (tree, heap, 4);
+  tree arg;
+
+  VEC_quick_push (tree, vargs, cond_func_addr);
+  VEC_quick_push (tree, vargs, clone_0_addr);
+  VEC_quick_push (tree, vargs, clone_1_addr);
+
+  for (arg = DECL_ARGUMENTS (orig_fndecl); arg; arg = TREE_CHAIN (arg))
+    {
+      VEC_safe_push (tree, heap, vargs, arg);
+      /* Again, this add_referenced_var is very very important.  It broke
+         a build where a cloned function's arguments where never
+         referenced.  Missing this statement in places asserts at
+         tree-dfa.c:589, in function referenced_var_lookup at
+         "gcc_assert (h || uid == 0);" and is very difficult to triage. */
+      add_referenced_var (arg);
+    }
+
+  new_builtin_call = gimple_build_call_vec (builtin_function_decl, vargs);
+  mark_symbols_for_renaming (new_builtin_call);
+
+
+  if (dump_file)
+    print_gimple_stmt (dump_file, new_builtin_call, 0, TDF_VOPS);
+
+  VEC_free (tree, heap, vargs);
+
+  return new_builtin_call;
+}
+
+/* This clones a dispatch function whose callee-graph path has a function
+   which calls __builtin_dispatch.  This function is cloned and the
+   original function branches to the right clone. */
+
+static int
+clone_and_dispatch_function (struct cgraph_node *orig_node, tree *clone_0,
+			     tree *clone_1)
+{
+  tree clone_0_decl, clone_1_decl;
+  gimple new_builtin_call = NULL;
+  gimple new_return_stmt = NULL;
+  gimple_seq seq = NULL;
+  basic_block new_bb;
+  tree orig_fndecl;
+  tree return_var = NULL;
+  tree return_type;
+  tree old_current_function_decl;
+
+  old_current_function_decl = current_function_decl;
+  orig_fndecl = orig_node->decl;
+  push_cfun (DECL_STRUCT_FUNCTION (orig_fndecl));
+  current_function_decl = orig_fndecl;
+
+  /* Make 2 clones for true and false function. */
+  clone_0_decl = clone_function (orig_fndecl, "clone_0");
+  clone_1_decl = clone_function (orig_fndecl, "clone_1");
+  *clone_0 = clone_0_decl;
+  *clone_1 = clone_1_decl;
+
+  new_bb = empty_function_body (orig_fndecl);
+
+  new_builtin_call = make_builtin_call_to_clones (
+		       orig_fndecl,
+                       build_fold_addr_expr (clone_0_decl),
+		       build_fold_addr_expr (clone_1_decl),
+		       build_fold_addr_expr ((tree)orig_node->aux));
+
+  return_type = TREE_TYPE (TREE_TYPE (orig_fndecl));
+
+  if (!TREE_ADDRESSABLE (return_type) && COMPLETE_TYPE_P (return_type))
+    {
+      tree tmp_var;
+      tmp_var = create_tmp_var (return_type, NULL);
+      add_referenced_var (tmp_var);
+      return_var = make_ssa_name (tmp_var, new_builtin_call);
+      gimple_call_set_lhs (new_builtin_call, return_var);
+    }
+
+  mark_symbols_for_renaming (new_builtin_call);
+  new_return_stmt = gimple_build_return (return_var);
+  mark_symbols_for_renaming (new_return_stmt);
+  gimple_seq_add_stmt (&seq, new_builtin_call);
+  gimple_seq_add_stmt (&seq, new_return_stmt);
+  set_bb_seq (new_bb, seq);
+  gimple_set_bb (new_builtin_call, new_bb);
+  gimple_set_bb (new_return_stmt, new_bb);
+
+  gimple_set_block (new_builtin_call, DECL_INITIAL (orig_fndecl));
+  gimple_set_block (new_return_stmt, DECL_INITIAL (orig_fndecl));
+
+  if (dump_file)
+    dump_function_to_file (orig_fndecl, dump_file, TDF_BLOCKS);
+
+  /* This update_ssa is necessary here for the following reason.  SSA uses
+     a global syms_to_rename bitmap that stores syms that must be renamed.
+     So, if we accumulate the syms from one function in IPA but move to
+     a different function without updating SSA, then we could be
+     accumulating syms from many functions.  This would assert in
+     referenced_var_lookup because the hashtab storing the syms is
+     function local. This is horrible. gcc-4.6 makes this bitmap a
+     global. */
+  update_ssa (TODO_update_ssa);
+
+  compute_inline_parameters (cgraph_node (orig_fndecl));
+  DECL_DECLARED_INLINE_P (orig_fndecl) = 1;
+  DECL_UNINLINABLE (orig_fndecl) = 0;
+  current_function_decl = old_current_function_decl;
+  pop_cfun ();
+  return 0;
+}
+
+/* Clone all functions marked for cloning by the earlier phase. */
+
+static void
+perform_cloning_phase (void)
+{
+  struct cgraph_node *node;
+  int ix;
+  VEC (tree, heap) *cloned_decl_list = NULL;
+  tree cloned_decl = NULL;
+
+  cloned_decl_list = VEC_alloc (tree, heap, 2);
+
+  /* First clone, then specialize the clones. */
+  for (node = cgraph_nodes; node; node = node->next)
+    {
+      tree clone_0_decl, clone_1_decl;
+      if (node->aux == NULL)
+        continue;
+      if (dump_file)
+      {
+        fprintf (dump_file, "%s will be cloned\n", cgraph_node_name (node));
+        dump_function_to_file (node->decl, dump_file, TDF_BLOCKS);
+      }
+      clone_and_dispatch_function (node, &clone_0_decl, &clone_1_decl);
+      VEC_safe_push (tree, heap, cloned_decl_list, clone_0_decl);
+      VEC_safe_push (tree, heap, cloned_decl_list, clone_1_decl);
+      continue;
+    }
+
+  /* Specialize the clones now. */
+  for (ix = 0; VEC_iterate (tree, cloned_decl_list, ix, cloned_decl); ++ix)
+    {
+      int which_clone = ix % 2;
+      specialize_call (cloned_decl, which_clone);
+    }
+
+  VEC_free (tree, heap, cloned_decl_list);
+}
+
+/* End Cloning phase. */
+
+/* Checks if there is atleast one call to __builtin_dispatch. */
+
+static bool
+find_builtin_decl (void)
+{
+  struct cgraph_node *node;
+  for (node = cgraph_nodes; node; node = node->next)
+    {
+      if (strstr (cgraph_node_name (node), "__builtin_dispatch") != NULL)
+        {
+          builtin_function_decl = node->decl;
+          return true;
+        }
+    }
+  return false;
+}
+
+/* Set the aux fields of all nodes and edges in the call graph to be NULL. */
+
+static void
+cleanup_aux_field (void)
+{
+  struct cgraph_node *node;
+  struct cgraph_edge *edge;
+
+  for (node = cgraph_nodes; node; node = node->next)
+    {
+      node->aux = NULL;
+      for (edge = node->callees; edge; edge = edge->next_callee)
+        edge->aux = NULL;
+    }
+}
+
+/* Main driver function. It scans the __builtin_dispatch calls and
+   figures out which functions to clone.  It then clones the functions. */
+
+static unsigned int
+builtin_dispatch_ipa_clone (void)
+{
+  cleanup_aux_field ();
+
+  /* Allocate hashtab mapping name to decl. */
+  name_decl_htab = htab_create (10, name_decl_htab_hash_descriptor,
+			        name_decl_htab_eq_descriptor, NULL);
+
+  /* Turn it on for O1 and above.  At -O0, there is a SSA alias bug
+     with create_tmp_var.  Cloning and hoisting is not necessary at
+     -O0 anyways.  Also, guard it with the flag
+     "-fclone-hot-version-paths". 
+     Disabled for LTO as it needs more work. */
+  if (optimize == 0
+      || profile_arc_flag
+      || !flag_clone_hot_version_paths
+      || flag_lto)
+    return 0;
+
+  if (!find_builtin_decl ())
+    return 0;
+
+  gcc_assert (builtin_function_decl != NULL);
+
+  if (!perform_analysis_phase ())
+    {
+      cleanup_aux_field ();
+      return 0;
+    }
+
+  if (decide_cloning_phase ())
+    perform_cloning_phase ();
+
+  cleanup_aux_field ();
+
+  return 0;
+}
+
+static bool
+gate_handle_builtin_dispatch (void)
+{
+  return true;
+}
+
+struct simple_ipa_opt_pass pass_ipa_multiversion_dispatch =
+{
+ {
+  SIMPLE_IPA_PASS,
+  "multiversion_dispatch",		/* name */
+  gate_handle_builtin_dispatch,		/* gate */
+  builtin_dispatch_ipa_clone,           /* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_MVERSN_DISPATCH,			/* tv_id */
+  0,	                                /* properties_required */
+  PROP_cfg,				/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  TODO_dump_func |			/* todo_flags_finish */
+  TODO_update_ssa
+ }
+};
+
+/* Lowering of the __builtin_dispatch calls. */
+
+
+/* This function converts STMT which is a __builtin_dispatch
+   call of the form :
+   ret = __builtin_dispatch (predicate, foo, bar, arg1, ...)
+   into :
+   var_1 = predicate
+   if  (var_1)
+     var_2 = foo (arg1, ...);
+   else
+     var_3 = bar (arg1, ...);
+   var_4 = phi (var_2, var_3)
+   ret = var_4
+
+   var_? are ssa names for variable var.
+*/
+
+static unsigned int
+convert_builtin_dispatch (gimple stmt)
+{
+  tree cond_func_addr, if_func_addr, else_func_addr;
+  tree cond_func_decl = NULL;
+  gimple if_part, else_part, if_else_stmt;
+  basic_block bb1, bb2, bb3, bb4;
+  gimple bb1end, bb2end, bb3end;
+  edge e12, e13, e23, e24, e34;
+  VEC(tree, heap) *nargs = NULL;
+  int num_func_args = 0, i;
+  tree version_selector_var;
+  tree lhs_result;
+  gimple_stmt_iterator gsi;
+  basic_block bb;
+  gimple feature_test_call = NULL;
+  tree tmp_var = NULL;
+  gimple init_stmt = NULL;
+  tree ssa_if_name, ssa_else_name;
+  gimple phinode = NULL;
+  tree tmp_result_var, ssa_result_var;
+
+  gsi = gsi_for_stmt (stmt);
+  bb = gsi_bb (gsi);
+
+  cond_func_addr = find_version_selector_func_addr (stmt);
+  if (cond_func_addr != NULL)
+    {
+      cond_func_decl = TREE_OPERAND (cond_func_addr, 0);
+      gcc_assert (cond_func_decl);
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Before Converting __builtin_dispatch :\n");
+      dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
+    }
+
+  if_func_addr = gimple_call_arg (stmt, 1);
+  else_func_addr = gimple_call_arg (stmt, 2);
+
+  tmp_result_var = create_tmp_var (integer_type_node, NULL);
+  add_referenced_var (tmp_result_var);
+
+  if (flag_lto
+      || cond_func_decl == NULL
+      || lookup_attribute ("version_selector",
+                           DECL_ATTRIBUTES (cond_func_decl)) == NULL)
+    {
+      tree arg = gimple_call_arg (stmt, 0);
+      /* This means the feature-test function is not set with attribute
+         version_selector or it is a function pointer or in LTO. So,
+         explicitly call it. */
+      feature_test_call = gimple_build_call (arg, 0);
+      ssa_result_var = make_ssa_name (tmp_result_var, feature_test_call);
+      gimple_assign_set_lhs (feature_test_call, ssa_result_var);
+      mark_symbols_for_renaming (feature_test_call);
+      version_selector_var = ssa_result_var;
+    }
+  else
+    {
+      /* Get the global corresponding to the "version_selector" function. */
+      version_selector_var
+        = handle_version_selector_attr_function (cond_func_decl);
+      gcc_assert (version_selector_var);
+      add_referenced_var (version_selector_var);
+      feature_test_call = gimple_build_assign (tmp_result_var,
+                                               version_selector_var);
+      ssa_result_var = make_ssa_name (tmp_result_var, feature_test_call);
+      gimple_assign_set_lhs (feature_test_call, ssa_result_var);
+      mark_symbols_for_renaming (feature_test_call);
+      version_selector_var = ssa_result_var;
+    }
+
+  if_else_stmt = gimple_build_cond (GT_EXPR,
+                                    version_selector_var,
+                                    integer_zero_node,
+                                    NULL_TREE, NULL_TREE);
+
+  mark_symbols_for_renaming (if_else_stmt);
+
+  num_func_args =  function_args_count (
+    TREE_TYPE (TREE_OPERAND (if_func_addr, 0)));
+
+  nargs = VEC_alloc (tree, heap, num_func_args);
+
+  /* The arguments to the feature test function start from the 4th argument
+     in __builtin_dispatch.  The first 3 arguments are mandatory. */
+
+  for (i = 0; i <= num_func_args - 2; ++i)
+    VEC_quick_push (tree, nargs,
+                    gimple_call_arg (stmt, (3 + i)));
+
+  if_part = gimple_build_call_vec (if_func_addr, nargs);
+  else_part = gimple_build_call_vec (else_func_addr, nargs);
+
+  lhs_result = gimple_call_lhs (stmt);
+
+  if (lhs_result != NULL_TREE)
+    {
+      tree ssa_var;
+      tree return_type;
+      return_type = TREE_TYPE (lhs_result);
+      tmp_var = create_tmp_var (return_type, NULL);
+      add_referenced_var (tmp_var);
+
+      init_stmt = gimple_build_assign (tmp_var, integer_zero_node);
+      ssa_var = make_ssa_name (tmp_var, init_stmt);
+      gimple_assign_set_lhs (init_stmt, ssa_var);
+      mark_symbols_for_renaming (init_stmt);
+
+      ssa_if_name = make_ssa_name (tmp_var, init_stmt);
+      ssa_else_name = make_ssa_name (tmp_var, init_stmt);
+      gimple_call_set_lhs (if_part, ssa_if_name);
+      gimple_call_set_lhs (else_part, ssa_else_name);
+    }
+  mark_symbols_for_renaming (if_part);
+  mark_symbols_for_renaming (else_part);
+
+  /* Set the lexical block to be the same as the dispatch call. */
+  gcc_assert (feature_test_call);
+  gimple_set_block (feature_test_call, gimple_block (stmt));
+
+  if (init_stmt)
+    gimple_set_block (init_stmt, gimple_block (stmt));
+
+  gimple_set_block (if_else_stmt, gimple_block (stmt));
+  gimple_set_block (if_part, gimple_block (stmt));
+  gimple_set_block (else_part, gimple_block (stmt));
+
+  gsi_insert_before_without_update (&gsi, feature_test_call, GSI_SAME_STMT);
+  gimple_set_bb (feature_test_call, bb);
+
+  if (init_stmt)
+    {
+      gsi_insert_before_without_update (&gsi, init_stmt,
+                                        GSI_SAME_STMT);
+      gimple_set_bb (init_stmt, bb);
+    }
+
+  gsi_insert_before_without_update (&gsi, if_else_stmt, GSI_SAME_STMT);
+  gsi_insert_before_without_update (&gsi, if_part, GSI_SAME_STMT);
+  gsi_insert_before_without_update (&gsi, else_part, GSI_SAME_STMT);
+
+  /* Remove the builtin_dispatch call after the expansion. */
+  unlink_stmt_vdef (stmt);
+  gsi_remove (&gsi, true);
+
+  bb1end = if_else_stmt;
+  bb2end = if_part;
+  bb3end = else_part;
+  bb1 = bb;
+  e12 = split_block (bb1, bb1end);
+  bb2 = e12->dest;
+  e23 = split_block (bb2, bb2end);
+  bb3 = e23->dest;
+  e34 = split_block (bb3, bb3end);
+  bb4 = e34->dest;
+
+  e12->flags &= ~EDGE_FALLTHRU;
+  e12->flags |= EDGE_TRUE_VALUE;
+  e13 = make_edge (bb1, bb3, EDGE_FALSE_VALUE);
+  gcc_assert (e13);
+  e24 = make_edge (bb2, bb4, EDGE_FALLTHRU);
+  gcc_assert (e24);
+  remove_edge (e23);
+
+  if (tmp_var)
+    {
+      gimple assign_stmt;
+      phinode = create_phi_node (tmp_var, bb4);
+      add_phi_arg (phinode, ssa_if_name, e24, UNKNOWN_LOCATION);
+      add_phi_arg (phinode, ssa_else_name, e34, UNKNOWN_LOCATION);
+      mark_symbols_for_renaming (phinode);
+      gcc_assert (lhs_result);
+      assign_stmt
+        = gimple_build_assign (lhs_result, gimple_phi_result (phinode));
+      mark_symbols_for_renaming (assign_stmt);
+      gsi = gsi_start_bb (bb4);
+      gsi_insert_before_without_update (&gsi, assign_stmt, GSI_SAME_STMT);
+      gimple_set_bb (assign_stmt, bb4);
+    }
+
+  if (dump_file)
+    {
+      fprintf (dump_file, "Converted __builtin_dispatch :\n");
+      dump_function_to_file (current_function_decl, dump_file, TDF_BLOCKS);
+    }
+
+  return 0;
+}
+
+/* This function does two things.
+
+   1) For a feature-test function marked with attribute "version_selector",
+      it creates a constructor that calls the feature-test function and a
+      global that holds the result.  The global's result will be used
+      to lower any __builtin_dispatch statement that refers to this feature
+      test function.  The __builtin_dispatch statement and the feature test
+      function can be in different modules.
+
+   2) It lowers __builtin_dispatch statements. */
+
+static unsigned int
+do_convert_builtin_dispatch (void)
+{
+  basic_block bb;
+  gimple_stmt_iterator gsi;
+  VEC (gimple, heap) *builtin_stmt_list = NULL;
+  int ix;
+  gimple builtin_stmt;
+
+  /* Allocate hashtab mapping name to decl. */
+  if (name_decl_htab == NULL)
+    name_decl_htab = htab_create (10, name_decl_htab_hash_descriptor,
+		                  name_decl_htab_eq_descriptor, NULL);
+
+  /* Look for functions with attribute "version_selector" and make a
+     constructor which calls the function and saves the result in a
+     global.  Disabled for LTO as it needs more work. */
+
+  if (!flag_lto
+      && lookup_attribute ("version_selector",
+                           DECL_ATTRIBUTES (current_function_decl)) != NULL)
+    {
+      if (dump_file)
+        fprintf (dump_file, "Function with version_selector attribute found :"
+                 " %s.  Making constructor for it.\n",
+		 current_function_name ());
+
+      handle_version_selector_attr_function (current_function_decl);
+      /* Assume there are no __builtin_dispatch calls in feature test
+	 functions.  So it is safe to return. */
+      return 0;
+    }
+
+  /* Find and lower __builtin_dispatch calls. */
+
+  FOR_EACH_BB_FN (bb, DECL_STRUCT_FUNCTION (current_function_decl))
+    {
+      for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+        {
+          gimple stmt = gsi_stmt (gsi);
+          tree call_decl;
+
+          if (!is_gimple_call (stmt))
+            continue;
+
+          call_decl = gimple_call_fndecl (stmt);
+
+	  if (call_decl == NULL)
+            continue;
+
+          if (strstr (cgraph_node_name (cgraph_node (call_decl)),
+                      "__builtin_dispatch") == NULL)
+            continue;
+
+	  if (dump_file)
+            {
+	      fprintf (dump_file, "Converting __builtin_dispatch stmt in:%s\n",
+		       current_function_name ());
+              print_gimple_stmt (dump_file, stmt, 0, TDF_VOPS);
+            }
+
+	  if (builtin_stmt_list == NULL)
+              builtin_stmt_list = VEC_alloc (gimple, heap, 2);
+
+          gcc_assert (builtin_stmt_list != NULL);
+	  VEC_safe_push (gimple, heap, builtin_stmt_list, stmt);
+        }
+    }
+
+  if (!builtin_stmt_list)
+    return 0;
+ 
+  for (ix = 0; VEC_iterate (gimple, builtin_stmt_list, ix, builtin_stmt);
+       ++ix)
+    convert_builtin_dispatch (builtin_stmt);
+
+  compute_inline_parameters (cgraph_node (current_function_decl));
+ 
+  VEC_free (gimple, heap, builtin_stmt_list); 
+  
+  return 0;
+}
+
+static bool
+gate_convert_builtin_dispatch (void)
+{
+  return true;
+}
+
+struct gimple_opt_pass pass_tree_convert_builtin_dispatch =
+{
+ {
+  GIMPLE_PASS,
+  "convert_builtin_dispatch",	        /* name */
+  gate_convert_builtin_dispatch,	/* gate */
+  do_convert_builtin_dispatch,		/* execute */
+  NULL,					/* sub */
+  NULL,					/* next */
+  0,					/* static_pass_number */
+  TV_MVERSN_DISPATCH,			/* tv_id */
+  PROP_cfg,				/* properties_required */
+  PROP_cfg,				/* properties_provided */
+  0,					/* properties_destroyed */
+  0,					/* todo_flags_start */
+  TODO_verify_stmts | TODO_dump_func |	/* todo_flags_finish */
+  TODO_cleanup_cfg | TODO_dump_cgraph |
+  TODO_update_ssa | TODO_verify_ssa
+ }
+};
Index: timevar.def
===================================================================
--- timevar.def	(revision 173122)
+++ timevar.def	(working copy)
@@ -106,6 +106,7 @@  DEFTIMEVAR (TV_LEX		     , "lexical analysis")
 DEFTIMEVAR (TV_PARSE                 , "parser")
 DEFTIMEVAR (TV_NAME_LOOKUP           , "name lookup")
 DEFTIMEVAR (TV_INLINE_HEURISTICS     , "inline heuristics")
+DEFTIMEVAR (TV_MVERSN_DISPATCH       , "multiversion dispatch")
 DEFTIMEVAR (TV_INTEGRATION           , "integration")
 DEFTIMEVAR (TV_TREE_GIMPLIFY	     , "tree gimplify")
 DEFTIMEVAR (TV_TREE_EH		     , "tree eh")
Index: common.opt
===================================================================
--- common.opt	(revision 173122)
+++ common.opt	(working copy)
@@ -2191,6 +2191,10 @@  ftree-builtin-call-dce
 Common Report Var(flag_tree_builtin_call_dce) Init(0) Optimization
 Enable conditional dead code elimination for builtin calls
 
+fclone-hot-version-paths
+Common Report Var(flag_clone_hot_version_paths) Init(0)
+Enable cloning and hoisting of hot multiversioned paths
+
 fwhole-program
 Common Report Var(flag_whole_program) Init(0) Optimization
 Perform whole program optimizations
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 173122)
+++ Makefile.in	(working copy)
@@ -1304,6 +1304,7 @@  OBJS-common = \
 	mcf.o \
 	mode-switching.o \
 	modulo-sched.o \
+	mversn-dispatch.o \
 	omega.o \
 	omp-low.o \
 	optabs.o \
@@ -3097,6 +3098,11 @@  implicit-zee.o : implicit-zee.c $(CONFIG_H) $(SYST
    $(DF_H) $(TIMEVAR_H) tree-pass.h $(RECOG_H) $(EXPR_H) \
    $(REGS_H) $(TREE_H) $(TM_P_H) insn-config.h $(INSN_ATTR_H) $(DIAGNOSTIC_CORE_H) \
    $(TARGET_H) $(OPTABS_H) insn-codes.h rtlhooks-def.h $(PARAMS_H) $(CGRAPH_H)
+mversn-dispatch.o : mversn-dispatch.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) \
+   $(TREE_H) langhooks.h $(TREE_INLINE_H) $(FLAGS_H) $(CGRAPH_H) intl.h \
+   $(DIAGNOSTIC_H) $(FIBHEAP_H) $(PARAMS_H) $(TIMEVAR_H) tree-pass.h \
+   $(HASHTAB_H) $(COVERAGE_H) $(GGC_H) $(TREE_FLOW_H) $(RTL_H) $(IPA_PROP_H) \
+   $(BASIC_BLOCK_H) $(TOPLEV_H) $(TREE_DUMP_H)
 gcse.o : gcse.c $(CONFIG_H) $(SYSTEM_H) coretypes.h $(TM_H) $(RTL_H) \
    $(REGS_H) hard-reg-set.h $(FLAGS_H) insn-config.h $(GGC_H) \
    $(RECOG_H) $(EXPR_H) $(BASIC_BLOCK_H) $(FUNCTION_H) output.h toplev.h $(DIAGNOSTIC_CORE_H) \
Index: passes.c
===================================================================
--- passes.c	(revision 173122)
+++ passes.c	(working copy)
@@ -796,6 +796,15 @@  init_optimization_passes (void)
     }
   NEXT_PASS (pass_ipa_increase_alignment);
   NEXT_PASS (pass_ipa_matrix_reorg);
+  NEXT_PASS (pass_ipa_multiversion_dispatch);
+    {
+      struct opt_pass **p = &pass_ipa_multiversion_dispatch.pass.sub;
+      NEXT_PASS (pass_tree_convert_builtin_dispatch);
+      /* Rebuilding cgraph edges is necessary as the above passes change
+         the call graph.  Otherwise, future optimizations use the old
+	 call graph and make wrong decisions sometimes.*/
+      NEXT_PASS (pass_rebuild_cgraph_edges);
+    }
   NEXT_PASS (pass_ipa_lower_emutls);
   *p = NULL;
 
Index: params.def
===================================================================
--- params.def	(revision 173122)
+++ params.def	(working copy)
@@ -929,6 +929,17 @@  DEFPARAM (CXX_MAX_NAMESPACES_FOR_DIAGNOSTIC_HELP,
 	  "name lookup fails",
 	  1000, 0, 0)
 
+DEFPARAM (PARAM_NUMBER_OF_MVERSN_CLONES,
+          "num-mversn-clones",
+	  "maximum number of functions to be cloned while doing "
+          "multiversioning",
+	  10, 0, 1000)
+
+DEFPARAM (PARAM_MVERSN_CLONE_CGRAPH_DEPTH,
+          "mversn-clone-depth",
+	  "maximum length of the call graph path to be cloned "
+          "while doing multiversioning",
+	  2, 0, 5)
 /*
 Local variables:
 mode:c