diff mbox

[RFC] replace malloc with a decl on the stack

Message ID alpine.DEB.2.10.1311100140300.4153@laptop-mg.saclay.inria.fr
State New
Headers show

Commit Message

Marc Glisse Nov. 10, 2013, 3:27 p.m. UTC
Hello,

I am posting this patch to get some feedback on the approach. The goal is 
to replace malloc+free with a stack allocation (a decl actually) when the 
size is a small constant.

For testing, I highjacked the "leaf" attribute, but it isn't right, I'll 
remove it from the list (not sure what I'll do for the testcases then). 
What I'd want instead is a "returns" attribute that means the function 
will return (either by "return" or an exception), as opposed to having an 
infinite loop, calling exit or longjmp, etc (sched-deps.c has a related 
call_may_noreturn_p). The situation I am trying to avoid is:
p=malloc(12);
f(p)
free(p)

where f contains somewhere:
free(p); exit(42);
(or longjmp or something else that takes the regular call to free out of 
the execution flow).


It passes most of the testsuite, but breaks a couple __builtin_object_size 
tests:

struct A
{
   char a[10];
   int b;
   char c[10];
};
int main(){
   struct A *p = malloc (2 * sizeof (struct A));
   assert (__builtin_object_size (&p->a, 1) == sizeof (p->a));
   free (p);
}
__builtin_object_size now returns 56 instead of 10. I am not sure what to 
do about that.


The size above which the malloc->stack transformation is not applied 
should depend on a parameter, I don't know if it should get its own or 
depend on an existing one. In any case, I'd like to avoid ending up with a 
ridiculously low threshold (my programs use GMP, which internally uses 
alloca up to 65536 bytes (even in recursive functions that have a dozen 
allocations), so I don't want gcc to tell me that 50 bytes are too much).

A program with a double-free may, with this patch, end up crashing on the 
first free instead of the second, but that's an invalid program anyway.


I don't know if tree-ssa-forwprop is a good place for it, but it was 
convenient for a prototype. I like the idea of running it several times: 
replacing malloc with a decl early can help other optimizations, and other 
optimizations can make this one possible later.

The walk could be a bit expensive, but we only do it if we detected a 
malloc of a small constant and at least one matching free. I guess I 
should mark the malloc somehow to avoid performing the walk twice if there 
are several (but not enough) matching free.


stack_vec is nice, it would be convenient if bitmaps also had a version 
with a destructor so we don't need to explicitly deallocate them.

Comments

Ondřej Bílka Nov. 11, 2013, 10:08 a.m. UTC | #1
On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote:
> Hello,
> 
> I am posting this patch to get some feedback on the approach. The
> goal is to replace malloc+free with a stack allocation (a decl
> actually) when the size is a small constant.
>
Why constraint yourself to small sizes. Stack allocation benefits is
speed and less memory comsumption due lack of fragmentation.

A possible way is to have thread local bounds to stack size and call
function with custom logic when it is outside of bounds.

Below is a simple implementation which creates a separate stack for
that (for simplicity and because it does not need to find bounds on
thread stack.)

With bit of more work it could do allocations in similar way as in
splitstack.

 
> For testing, I highjacked the "leaf" attribute, but it isn't right,
> I'll remove it from the list (not sure what I'll do for the
> testcases then). What I'd want instead is a "returns" attribute that
> means the function will return (either by "return" or an exception),
> as opposed to having an infinite loop, calling exit or longjmp, etc
> (sched-deps.c has a related call_may_noreturn_p). The situation I am
> trying to avoid is:
> p=malloc(12);
> f(p)
> free(p)
> 
> where f contains somewhere:
> free(p); exit(42);
> (or longjmp or something else that takes the regular call to free
> out of the execution flow).
> 
One of plans to extend malloc is add custom free handler, interface
would be something like dalloc(amount, destructor) which would invoke
destructor on free. Main motivation is memory pool that can be returned
by free.

With that extension it would be possible to mark pointer so its free
would be a nop.
> 
> 
> The size above which the malloc->stack transformation is not applied
> should depend on a parameter, I don't know if it should get its own
> or depend on an existing one. In any case, I'd like to avoid ending
> up with a ridiculously low threshold (my programs use GMP, which
> internally uses alloca up to 65536 bytes (even in recursive
> functions that have a dozen allocations), so I don't want gcc to
> tell me that 50 bytes are too much).
> 
> A program with a double-free may, with this patch, end up crashing
> on the first free instead of the second, but that's an invalid
> program anyway.
> 
> 

#include <pthread.h>
__thread void *__stack_from;
__thread void *__stack_cur;
__thread void *__stack_to;


#define STACK_ALLOC(size) ({ \
  void *__stack_new = __stack_cur + size; \
  if (__stack_new < __stack_cur || __stack_to > __stack_new) \
    __stack_alloc (size); \
  else \
  { \
    void *__s = __stack_cur; \
    __stack_cur = __stack_new; \
    __s; \
  } \
})

#define STACK_FREE(__stack_new) ({ \
  if (__stack_new < __stack_from || __stack_to > __stack_new) \
    __stack_free (size); \
  else \
    __stack_cur = __stack_new; \
})

static pthread_key_t key;
void
__stack_destroy (void *x)
{
  free (stack_from);
}
void *
__stack_alloc (size_t size)
{
  if (!__stack_from)
    {
      __stack_from = malloc (1 << 18);
      __stack_to = __stack_from + (1 << 18);
      __stack_cur = __stack_from; _ pthread_key_create (&key, destroy);
      pthread_setspecific (key, &key);
    }
  return malloc (size);
}
void
__stack_free (void *p)
{
  free (p);
}
Jakub Jelinek Nov. 11, 2013, 10:19 a.m. UTC | #2
On Mon, Nov 11, 2013 at 11:08:14AM +0100, Ondřej Bílka wrote:
> On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote:
> > I am posting this patch to get some feedback on the approach. The
> > goal is to replace malloc+free with a stack allocation (a decl
> > actually) when the size is a small constant.
> >
> Why constraint yourself to small sizes. Stack allocation benefits is
> speed and less memory comsumption due lack of fragmentation.

Because you can hardly predict what the program will have as stack
requirements in functions you call?  In leaf functions sure, you only care
not to create too large allocations that would go over the stack size,
but if you call other functions, you usually can't know (at least without
sufficient IPA analysis, but that is really hard because it is too early)
how much stack will it really need (both fixed requirements for non-VLA
vars on the stack, spill space, function call arguments and other overhead
and VLAs and also these malloc turned into stack allocation).
So, if you say have a malloc with corresponding free shortly afterwards
but some call in between and decide that it is fine to change it into stack
allocation when it is half the size of the remaining stack space, but then
two frames down there will be some non-VLA var that needs 3/4 of the old
remaining stack space, you've turned a correct program into a broken one.

	Jakub
Ondřej Bílka Nov. 11, 2013, 10:48 a.m. UTC | #3
On Mon, Nov 11, 2013 at 11:19:05AM +0100, Jakub Jelinek wrote:
> On Mon, Nov 11, 2013 at 11:08:14AM +0100, Ondřej Bílka wrote:
> > On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote:
> > > I am posting this patch to get some feedback on the approach. The
> > > goal is to replace malloc+free with a stack allocation (a decl
> > > actually) when the size is a small constant.
> > >
> > Why constraint yourself to small sizes. Stack allocation benefits is
> > speed and less memory comsumption due lack of fragmentation.
> 
> Because you can hardly predict what the program will have as stack
> requirements in functions you call?  In leaf functions sure, you only care
> not to create too large allocations that would go over the stack size,
> but if you call other functions, you usually can't know (at least without
> sufficient IPA analysis, but that is really hard because it is too early)

Which is completely irrelevant. Checking it in run time is easy as in my
prototype which uses a alternative stack so added stack space does not
coun.

> how much stack will it really need (both fixed requirements for non-VLA
> vars on the stack, spill space, function call arguments and other overhead
> and VLAs and also these malloc turned into stack allocation).

This is argument to turn a alloca and VLA to ones that check if there is
enough stack and get space by other means. This can be with same logic
as described above. 

> So, if you say have a malloc with corresponding free shortly afterwards
> but some call in between and decide that it is fine to change it into stack
> allocation when it is half the size of the remaining stack space, but then

Of total stack space, also if we use main stack user should enlarge stacks
accordingly.

> two frames down there will be some non-VLA var that needs 3/4 of the old
> remaining stack space, you've turned a correct program into a broken one.
> 
> 	Jakub
Richard Biener Nov. 11, 2013, 12:55 p.m. UTC | #4
On Sun, Nov 10, 2013 at 4:27 PM, Marc Glisse <marc.glisse@inria.fr> wrote:
> Hello,
>
> I am posting this patch to get some feedback on the approach. The goal is to
> replace malloc+free with a stack allocation (a decl actually) when the size
> is a small constant.
>
> For testing, I highjacked the "leaf" attribute, but it isn't right, I'll
> remove it from the list (not sure what I'll do for the testcases then). What
> I'd want instead is a "returns" attribute that means the function will
> return (either by "return" or an exception), as opposed to having an
> infinite loop, calling exit or longjmp, etc (sched-deps.c has a related
> call_may_noreturn_p). The situation I am trying to avoid is:
> p=malloc(12);
> f(p)
> free(p)
>
> where f contains somewhere:
> free(p); exit(42);
> (or longjmp or something else that takes the regular call to free out of the
> execution flow).

Instead of a new attribute IPA should be used to compute this call-graph
property.  An infinite loop wouldn't be an issue, but the real issue is
that it shouldn't free the object which would be only valid if the free
after the call isn't reached.

IPA pure-const is used to compute similar properties (may loop,
may throw).

You also have to make sure to properly align the replacement.

>
> It passes most of the testsuite, but breaks a couple __builtin_object_size
> tests:
>
> struct A
> {
>   char a[10];
>   int b;
>   char c[10];
> };
> int main(){
>   struct A *p = malloc (2 * sizeof (struct A));
>   assert (__builtin_object_size (&p->a, 1) == sizeof (p->a));
>   free (p);
> }
> __builtin_object_size now returns 56 instead of 10. I am not sure what to do
> about that.
>
>
> The size above which the malloc->stack transformation is not applied should
> depend on a parameter, I don't know if it should get its own or depend on an
> existing one. In any case, I'd like to avoid ending up with a ridiculously
> low threshold (my programs use GMP, which internally uses alloca up to 65536
> bytes (even in recursive functions that have a dozen allocations), so I
> don't want gcc to tell me that 50 bytes are too much).
>
> A program with a double-free may, with this patch, end up crashing on the
> first free instead of the second, but that's an invalid program anyway.
>
>
> I don't know if tree-ssa-forwprop is a good place for it, but it was
> convenient for a prototype. I like the idea of running it several times:
> replacing malloc with a decl early can help other optimizations, and other
> optimizations can make this one possible later.

We have a similar transform in CCP (fold_builtin_alloca_with_align) which
is there because CCP is a good place where size arguments may
become constant (the case you handle - you don't seem to handle
variable malloc -> alloca transform which would be possible if for example
VRP figures out a acceptable upper bound for the size).

Richard.

> The walk could be a bit expensive, but we only do it if we detected a malloc
> of a small constant and at least one matching free. I guess I should mark
> the malloc somehow to avoid performing the walk twice if there are several
> (but not enough) matching free.
>
>
> stack_vec is nice, it would be convenient if bitmaps also had a version with
> a destructor so we don't need to explicitly deallocate them.
>
> --
> Marc Glisse
> Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C
> ===================================================================
> --- gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C (revision 0)
> +++ gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C (working copy)
> @@ -0,0 +1,24 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +struct A {
> +  void*p;
> +  A(void*q) : p(q) {}
> +  ~A(){ __builtin_free(p); }
> +};
> +void f(void*)__attribute__((__leaf__));
> +void h(void*)__attribute__((__leaf__,__nothrow__));
> +void g(){
> +  void*p=__builtin_malloc(12);
> +  A a(p);
> +  f(p);
> +}
> +
> +void i(){
> +  void*p=__builtin_malloc(12);
> +  h(p);
> +  __builtin_free(p);
> +}
> +
> +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */
> +/* { dg-final { cleanup-tree-dump "optimized" } } */
>
> Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C
> ___________________________________________________________________
> Added: svn:keywords
> ## -0,0 +1 ##
> +Author Date Id Revision URL
> \ No newline at end of property
> Added: svn:eol-style
> ## -0,0 +1 ##
> +native
> \ No newline at end of property
> Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C
> ===================================================================
> --- gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C (revision 0)
> +++ gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C (working copy)
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +void f(void*)__attribute__((__leaf__));
> +void g(){
> +  void*p=__builtin_malloc(12);
> +  f(p);
> +  __builtin_free(p);
> +}
> +
> +/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */
> +/* { dg-final { cleanup-tree-dump "optimized" } } */
>
> Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C
> ___________________________________________________________________
> Added: svn:eol-style
> ## -0,0 +1 ##
> +native
> \ No newline at end of property
> Added: svn:keywords
> ## -0,0 +1 ##
> +Author Date Id Revision URL
> \ No newline at end of property
> Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c (working copy)
> @@ -0,0 +1,22 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +void f(void*)__attribute__((__leaf__));
> +void g(int m,int n){
> +  int i;
> +  void*p=__builtin_malloc(12);
> +  switch(n){
> +    case 1:
> +      for (i=0; i<m; ++i)
> +       f(p);
> +      break;
> +    case 2:
> +      __builtin_free(p);
> +      __builtin_exit(42);
> +    default:;
> +  }
> +  __builtin_free(p);
> +}
> +
> +/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */
> +/* { dg-final { cleanup-tree-dump "optimized" } } */
>
> Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c
> ___________________________________________________________________
> Added: svn:keywords
> ## -0,0 +1 ##
> +Author Date Id Revision URL
> \ No newline at end of property
> Added: svn:eol-style
> ## -0,0 +1 ##
> +native
> \ No newline at end of property
> Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c (revision 0)
> +++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c (working copy)
> @@ -0,0 +1,12 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -fdump-tree-optimized" } */
> +
> +void f(void*);
> +void g(){
> +  void*p=__builtin_malloc(12);
> +  f(p);
> +  __builtin_free(p);
> +}
> +
> +/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */
> +/* { dg-final { cleanup-tree-dump "optimized" } } */
>
> Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c
> ___________________________________________________________________
> Added: svn:keywords
> ## -0,0 +1 ##
> +Author Date Id Revision URL
> \ No newline at end of property
> Added: svn:eol-style
> ## -0,0 +1 ##
> +native
> \ No newline at end of property
> Index: gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c
> ===================================================================
> --- gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c   (revision 204620)
> +++ gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c   (working copy)
> @@ -1,31 +1,31 @@
>  /* { dg-do compile } */
>  /* { dg-options "-O2 -fdump-tree-optimized" } */
>
> -void test2(void)
> +void test2(int n)
>  {
> -  int *p = __builtin_malloc (sizeof (int) * 4);
> +  int *p = __builtin_malloc (sizeof (int) * n);
>    if (p == (void *)0)
>      __builtin_abort ();
>    __builtin_free (p);
>  }
>
> -void test5(int b)
> +void test5(int n)
>  {
> -  int *p = __builtin_malloc (sizeof (int) * 4);
> +  int *p = __builtin_malloc (sizeof (int) * n);
>    if (p)
>      __builtin_free (p);
>  }
>
> -void test6(void)
> +void test6(int n)
>  {
> -  int *p = __builtin_malloc (sizeof (int) * 4);
> +  int *p = __builtin_malloc (sizeof (int) * n);
>    if (p == (void *)0)
>      __builtin_abort ();
>    if (p)
>      __builtin_free (p);
>  }
>
>  /* We should be able to remove all malloc/free pairs with CDDCE.
>     Assume p was non-NULL for test2.
>     For test5, it doesn't matter if p is NULL or non-NULL.  */
>
> Index: gcc/tree-ssa-forwprop.c
> ===================================================================
> --- gcc/tree-ssa-forwprop.c     (revision 204620)
> +++ gcc/tree-ssa-forwprop.c     (working copy)
> @@ -33,20 +33,21 @@ along with GCC; see the file COPYING3.
>  #include "tree-ssanames.h"
>  #include "tree-dfa.h"
>  #include "tree-pass.h"
>  #include "langhooks.h"
>  #include "flags.h"
>  #include "expr.h"
>  #include "cfgloop.h"
>  #include "optabs.h"
>  #include "tree-ssa-propagate.h"
>  #include "tree-ssa-dom.h"
> +#include "params.h"
>
>  /* This pass propagates the RHS of assignment statements into use
>     sites of the LHS of the assignment.  It's basically a specialized
>     form of tree combination.   It is hoped all of this can disappear
>     when we have a generalized tree combiner.
>
>     One class of common cases we handle is forward propagating a single use
>     variable into a COND_EXPR.
>
>       bb0:
> @@ -1478,20 +1479,113 @@ constant_pointer_difference (tree p1, tr
>      }
>
>    for (i = 0; i < cnt[0]; i++)
>      for (j = 0; j < cnt[1]; j++)
>        if (exps[0][i] == exps[1][j])
>         return size_binop (MINUS_EXPR, offs[0][i], offs[1][j]);
>
>    return NULL_TREE;
>  }
>
> +/* We want to be sure that free (VAR) can't be called in another function,
> and
> +   the easiest way to ensure that is to prove that it is called in this
> +   function.  The hardest part is avoiding some call to a function that may
> not
> +   return because of exit, an infinite loop, setjmp, etc, where it might
> call
> +   free on VAR.  */
> +static bool
> +malloc_safe_on_stack (gimple stmt, vec<gimple, va_heap> *list_of_frees)
> +{
> +  tree var = gimple_call_lhs (stmt);
> +  basic_block bb = gimple_bb (stmt);
> +  stack_vec<basic_block, 20> bb_to_visit;
> +  bitmap bb_visited = BITMAP_ALLOC (NULL);
> +  bitmap_set_bit (bb_visited, bb->index);
> +  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
> +
> +next_stmt:
> +  gsi_next_nondebug (&gsi);
> +
> +handle_stmt:
> +  if (gsi_end_p (gsi))
> +    {
> +      edge e;
> +      edge_iterator ei;
> +      FOR_EACH_EDGE (e, ei, bb->succs)
> +       {
> +         bb_to_visit.safe_push (e->dest);
> +       }
> +      goto next_bb;
> +    }
> +  stmt = gsi_stmt (gsi);
> +  if (stmt_can_throw_external (stmt))
> +    /* We could give up only if VAR has escaped (same for return), but that
> +       means there is a memory leak, so don't bother.  */
> +    goto unsafe;
> +
> +  switch (gimple_code (stmt))
> +  // TODO: GIMPLE_ASM, EH related gimples?
> +    {
> +       /* We leave the function without calling free.  */
> +      case GIMPLE_RETURN:
> +       goto unsafe;
> +
> +       /* Statements that are irrelevant.  */
> +      case GIMPLE_ASSIGN:
> +      case GIMPLE_LABEL:
> +      case GIMPLE_NOP:
> +       /* Last stmt of the bb, handled by looking at the outgoing edges.
> */
> +      case GIMPLE_GOTO:
> +      case GIMPLE_COND:
> +       // TODO: special case:  if (VAR == 0) ...
> +      case GIMPLE_SWITCH:
> +       goto next_stmt;
> +
> +      case GIMPLE_CALL:
> +       {
> +         // TODO: __builtin_(abort|trap|exit|unreachable)
> +         // should be safe as well.
> +         if (gimple_call_builtin_p (stmt, BUILT_IN_FREE)
> +             && gimple_call_arg (stmt, 0) == var)
> +           {
> +             /* Nothing could throw an exception earlier in the block,
> +                so we don't need to check the EH edges.  */
> +             list_of_frees->safe_push (stmt);
> +             goto next_bb;
> +           }
> +         int flags = gimple_call_flags (stmt);
> +         // FIXME: leaf is actually not safe, we need some new ECF_* flags.
> +         if (flags & (ECF_CONST|ECF_PURE|ECF_LEAF|ECF_NOVOPS))
> +           goto next_stmt;
> +         goto unsafe;
> +       }
> +      default:
> +       goto unsafe;
> +    }
> +
> +next_bb:
> +  if (bb_to_visit.is_empty ())
> +    goto safe;
> +  bb = bb_to_visit.last ();
> +  bb_to_visit.pop ();
> +  if (!bitmap_set_bit (bb_visited, bb->index))
> +    goto next_bb;
> +  gsi = gsi_start_bb (bb);
> +  goto handle_stmt;
> +
> +unsafe:
> +  BITMAP_FREE (bb_visited);
> +  return false;
> +safe:
> +  BITMAP_FREE (bb_visited);
> +  return true;
> +}
> +
>  /* *GSI_P is a GIMPLE_CALL to a builtin function.
>     Optimize
>     memcpy (p, "abcd", 4);
>     memset (p + 4, ' ', 3);
>     into
>     memcpy (p, "abcd   ", 7);
>     call if the latter can be stored by pieces during expansion.  */
>
>  static bool
>  simplify_builtin_call (gimple_stmt_iterator *gsi_p, tree callee2)
> @@ -1681,20 +1775,64 @@ simplify_builtin_call (gimple_stmt_itera
>               gimple_call_set_arg (stmt2, 2,
>                                    build_int_cst (TREE_TYPE (len2),
> src_len));
>               unlink_stmt_vdef (stmt1);
>               gsi_remove (&gsi, true);
>               release_defs (stmt1);
>               update_stmt (stmt2);
>               return false;
>             }
>         }
>        break;
> +
> +    case BUILT_IN_FREE:
> +      {
> +       tree size;
> +       tree ptr = gimple_call_arg (stmt2, 0);
> +       if (TREE_CODE (ptr) != SSA_NAME)
> +         return false;
> +       stmt1 = SSA_NAME_DEF_STMT (ptr);
> +       /* TODO: handle calloc.  */
> +       if (gimple_call_builtin_p (stmt1, BUILT_IN_MALLOC)
> +           && host_integerp ((size = gimple_call_arg (stmt1, 0)), 1)
> +           /* param_value(...) / 10? Some new parameter?  */
> +           && TREE_INT_CST_LOW (size)
> +              <= (unsigned HOST_WIDE_INT)PARAM_VALUE
> (PARAM_LARGE_STACK_FRAME))
> +         {
> +           gcc_checking_assert (gimple_call_lhs (stmt1) == ptr);
> +           stack_vec<gimple, 20> list_of_frees;
> +           if (!malloc_safe_on_stack (stmt1, &list_of_frees))
> +             return false;
> +           /* Declare array.  */
> +           tree elem_type = build_nonstandard_integer_type (BITS_PER_UNIT,
> 1);
> +           tree array_type = build_array_type_nelts (elem_type,
> +                                                     TREE_INT_CST_LOW
> (size));
> +           tree var = create_tmp_var (array_type, NULL);
> +           tree p = fold_convert (TREE_TYPE (ptr), build_fold_addr_expr
> (var));
> +           /* Replace free with a clobber.  */
> +           int i;
> +           gimple free_stmt;
> +           FOR_EACH_VEC_ELT (list_of_frees, i, free_stmt)
> +             {
> +               tree clobber = build_constructor (array_type, NULL);
> +               TREE_THIS_VOLATILE (clobber) = 1;
> +               gimple clobber_stmt = gimple_build_assign (var, clobber);
> +               gimple_stmt_iterator gsi_free = gsi_for_stmt (free_stmt);
> +               gsi_replace (&gsi_free, clobber_stmt, false);
> +             }
> +           /* Replace malloc with the address of the variable.  */
> +           gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
> +           update_call_from_tree (&gsi1, p);
> +           update_stmt (gsi_stmt (gsi1));
> +           return true;
> +         }
> +
> +      }
>      default:
>        break;
>      }
>    return false;
>  }
>
>  /* Checks if expression has type of one-bit precision, or is a known
>     truth-valued expression.  */
>  static bool
>  truth_valued_ssa_name (tree name)
>
Marc Glisse Nov. 12, 2013, 12:16 a.m. UTC | #5
On Mon, 11 Nov 2013, Ondřej Bílka wrote:

> On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote:
>> Hello,
>>
>> I am posting this patch to get some feedback on the approach. The
>> goal is to replace malloc+free with a stack allocation (a decl
>> actually) when the size is a small constant.
>
> Why constraint yourself to small sizes.

I am trying to get something to actually work and be accepted in gcc. That 
may mean being conservative.

> Below is a simple implementation which creates a separate stack for
> that (for simplicity and because it does not need to find bounds on
> thread stack.)

One difficulty with using a stack is that lifetimes cannot partially 
overlap, whereas with malloc+free they can. Using the main stack has the 
advantage that I don't have to think of deallocation, it happens 
automatically. And using a decl instead of alloca means that even if 
malloc+free was in a loop, not deallocating won't make me grow the stack 
use linearly in the number of iterations of the loop.

> With that extension it would be possible to mark pointer so its free
> would be a nop.

That would help indeed, but it does require a libc that I haven't seen 
yet, and it may cause trouble if people try to LD_PRELOAD a different 
malloc implementation.
Ondřej Bílka Nov. 12, 2013, 10:49 a.m. UTC | #6
On Tue, Nov 12, 2013 at 01:16:14AM +0100, Marc Glisse wrote:
> On Mon, 11 Nov 2013, Ondřej Bílka wrote:
> 
> >On Sun, Nov 10, 2013 at 04:27:00PM +0100, Marc Glisse wrote:
> >>Hello,
> >>
> >>I am posting this patch to get some feedback on the approach. The
> >>goal is to replace malloc+free with a stack allocation (a decl
> >>actually) when the size is a small constant.
> >
> >Why constraint yourself to small sizes.
> 
> I am trying to get something to actually work and be accepted in
> gcc. That may mean being conservative.
>
That also may mean that you will cover only cases where it is not needed.

A malloc will have a small per-thread cache for small requests that does
not need any locking. A performance difference will be quite small and 
there may be a define which causes inlining constant size mallocs.

Sizes from 256 bytes are interesting case.

> >Below is a simple implementation which creates a separate stack for
> >that (for simplicity and because it does not need to find bounds on
> >thread stack.)
> 
> One difficulty with using a stack is that lifetimes cannot partially
> overlap, whereas with malloc+free they can.

Which you need to solve anyway if you want to do conversion. You need to
pick a properly overlapping malloc+free area that is best to given
criteria (like that it has maximal expected memory usage.)

> Using the main stack has
> the advantage that I don't have to think of deallocation, it happens
> automatically.

And what is logic of limiting sizes? Note that a leaf function have
higher priority for stack that nonleaf ones as when you do stack
allocation early it may kill lot of leaf allocations because of stack
concerns.

> And using a decl instead of alloca means that even if
> malloc+free was in a loop, not deallocating won't make me grow the
> stack use linearly in the number of iterations of the loop.
> 
Actually this looks like a orthogonal optimalization of memory reuse,
you can transform

x = malloc(42);
free(x);
y = malloc(42);

to

x = malloc(42);
y = x;

It migth be feasible to teach PRE to transform loops with repeated
allocation to a initial allocation and reuse.

> >With that extension it would be possible to mark pointer so its free
> >would be a nop.
> 
> That would help indeed, but it does require a libc that I haven't
> seen yet, and it may cause trouble if people try to LD_PRELOAD a
> different malloc implementation.
> 
This depends if we could get information that malloc did a stack
allocation (or costant size allocation which could be transformed to
different call.)

As a custom free is concerned there are more usage cases. If its worth
complications there is a possibility of prepending LD_PRELOAD with
custom free logic that works regardless of allocator used.
Marc Glisse Nov. 12, 2013, 12:41 p.m. UTC | #7
On Tue, 12 Nov 2013, Ondřej Bílka wrote:

>> I am trying to get something to actually work and be accepted in
>> gcc. That may mean being conservative.
>
> That also may mean that you will cover only cases where it is not needed.
>
> A malloc will have a small per-thread cache for small requests that does
> not need any locking. A performance difference will be quite small and
> there may be a define which causes inlining constant size mallocs.
>
> Sizes from 256 bytes are interesting case.

I have to disagree here. When the allocated size is large enough, the cost 
of malloc+free often becomes small compared to whatever work you are doing 
in that array. It is when the size is very small that speeding up 
malloc+free is essential. And you are underestimating the cost of those 
small allocations.

I started on this because of an application that spends more than half of 
its time in malloc+free and where (almost) no allocation is larger than 
100 bytes. Changing the code to not use malloc/free but other allocation 
strategies is very complicated because it would break abstraction layers. 
I used various workarounds that proved rather effective, but I would have 
loved for that to be unnecessary.

>> One difficulty with using a stack is that lifetimes cannot partially
>> overlap, whereas with malloc+free they can.
>
> Which you need to solve anyway if you want to do conversion. You need to
> pick a properly overlapping malloc+free area that is best to given
> criteria (like that it has maximal expected memory usage.)

Here I am extending all lifetimes to the whole function (and implicitly 
reusing storage in loops), simplistic I know.

>> Using the main stack has the advantage that I don't have to think of 
>> deallocation, it happens automatically.
>
> And what is logic of limiting sizes?

Main stack is rather small by default. And it is nice if other variables 
on the stack are not spread over too many memory pages.

> Note that a leaf function have higher priority for stack that nonleaf 
> ones as when you do stack allocation early it may kill lot of leaf 
> allocations because of stack concerns.

Sorry, I have trouble understanding your sentence. Do you just mean that 
if there is a threshold it should be higher in leaf functions?

>> And using a decl instead of alloca means that even if
>> malloc+free was in a loop, not deallocating won't make me grow the
>> stack use linearly in the number of iterations of the loop.
>>
> Actually this looks like a orthogonal optimalization of memory reuse,

Replacing malloc+free with alloca (without save/restore of the stack) 
would be a pessimization.

> you can transform
>
> x = malloc(42);
> free(x);
> y = malloc(42);
>
> to
>
> x = malloc(42);
> y = x;
>
> It migth be feasible to teach PRE to transform loops with repeated
> allocation to a initial allocation and reuse.

There are already several PRs in bugzilla. For instance PR 38318 is about 
transforming:

for(...){
   malloc
   ...
   free
}

into:

malloc
for(...){
   ...
}
free

I don't remember if your example of reusing a previous allocation is 
listed anywhere, you may want to add it if not. Those would all be very 
useful, please try to implement one of them. You may also be interested in 
optimizing the C++ dynarray class once a basic implementation is 
committed.


I understand the topic of optimizing allocations is very large, and 
powerful techniques could be developed for it. However, those bugs have 
been open for years and nobody has bothered going further that removing 
free(malloc(n)). So I think we should go with the small case that has 
known usecases. And when you come up with a more general framework that 
makes this small case irrelevant, I will gladly welcome it and we can 
easily remove the small case.
Ondřej Bílka Nov. 12, 2013, 2:50 p.m. UTC | #8
On Tue, Nov 12, 2013 at 01:41:24PM +0100, Marc Glisse wrote:
> On Tue, 12 Nov 2013, Ondřej Bílka wrote:
> 
> >>I am trying to get something to actually work and be accepted in
> >>gcc. That may mean being conservative.
> >
> >That also may mean that you will cover only cases where it is not needed.
> >
> >A malloc will have a small per-thread cache for small requests that does
> >not need any locking. A performance difference will be quite small and
> >there may be a define which causes inlining constant size mallocs.
> >
> >Sizes from 256 bytes are interesting case.
> 
> I have to disagree here. When the allocated size is large enough,
> the cost of malloc+free often becomes small compared to whatever
> work you are doing in that array. It is when the size is very small
> that speeding up malloc+free is essential. And you are
> underestimating the cost of those small allocations.
>
No, just aware that these are important and there will be optimizations
that convert these. For example:

#define malloc (s) ({ \
  static pool p; \
  if (__builtin_constant_p (s) { \
    alloc_from_pool(&p); \
  else \
    malloc (s); \
})

How will you find small constant allocations with this in place?

Also you malloc has tradeoff between time and memory usage
and where larger allocations needlessly cause bigger fragmentation.
 
> I started on this because of an application that spends more than
> half of its time in malloc+free and where (almost) no allocation is
> larger than 100 bytes. Changing the code to not use malloc/free but
> other allocation strategies is very complicated because it would
> break abstraction layers. I used various workarounds that proved
> rather effective, but I would have loved for that to be unnecessary.
> 
See my memory pool that uses custom free functionality where you need
only change malloc, free is handled automaticaly.

> >
> >And what is logic of limiting sizes?
> 
> Main stack is rather small by default. And it is nice if other
> variables on the stack are not spread over too many memory pages.

No, question was what is logic that you will use to decide if allocation
is stack or heap?

Also if we take not spreading stack argument to logical conclusion a
best course of action in adding a separate stack and placing arrays/alloca
over 1024 bytes there.

This is theory, I tried few benchmarks how using large alloca on
different stack affects performance but I did not measure a difference.

> 
> >Note that a leaf function have higher priority for stack that
> >nonleaf ones as when you do stack allocation early it may kill lot
> >of leaf allocations because of stack concerns.
> 
> Sorry, I have trouble understanding your sentence. Do you just mean
> that if there is a threshold it should be higher in leaf functions?
> 
Yes, functions closer to leaf function should have higher tresholds.

Generally in leaf functions you can safely use 65536 bytes as most libc
function assume this stack so caller would get same problems by calling
that function.

> >>And using a decl instead of alloca means that even if
> >>malloc+free was in a loop, not deallocating won't make me grow the
> >>stack use linearly in the number of iterations of the loop.
> >>
> >Actually this looks like a orthogonal optimalization of memory reuse,
> 
> Replacing malloc+free with alloca (without save/restore of the
> stack) would be a pessimization.
> 
I did not mentioned alloca.

> I understand the topic of optimizing allocations is very large, and
> powerful techniques could be developed for it. However, those bugs
> have been open for years and nobody has bothered going further that
> removing free(malloc(n)). So I think we should go with the small
> case that has known usecases. And when you come up with a more
> general framework that makes this small case irrelevant, I will
> gladly welcome it and we can easily remove the small case.
> 
As this is hard in gcc then doing these without gcc should be easier.

In several cases (constant size) we do not need gcc involvement as we
colud get information ourself. In other cases a best behaviour could be
determined only by running a custom benchmarking program which will tell
which variant should go to which malloc.

Then there are parts where coordination is necessary, one is determining
if stack allocation is possible. A posible way would be first turn a
eligible malloc calls to

malloc_stack(size, color)

as hint to allocator. I added a color parameter to handle partial
overlap, if you do a coloring with edge when allocations partialy
overlap then you can assign to each color class a stack and proceed as
normal.
Marc Glisse Nov. 12, 2013, 4:01 p.m. UTC | #9
On Tue, 12 Nov 2013, Ondřej Bílka wrote:

> On Tue, Nov 12, 2013 at 01:41:24PM +0100, Marc Glisse wrote:
>> On Tue, 12 Nov 2013, Ondřej Bílka wrote:
>>
>>>> I am trying to get something to actually work and be accepted in
>>>> gcc. That may mean being conservative.
>>>
>>> That also may mean that you will cover only cases where it is not needed.
>>>
>>> A malloc will have a small per-thread cache for small requests that does
>>> not need any locking. A performance difference will be quite small and
>>> there may be a define which causes inlining constant size mallocs.
>>>
>>> Sizes from 256 bytes are interesting case.
>>
>> I have to disagree here. When the allocated size is large enough,
>> the cost of malloc+free often becomes small compared to whatever
>> work you are doing in that array. It is when the size is very small
>> that speeding up malloc+free is essential. And you are
>> underestimating the cost of those small allocations.
>>
> No, just aware that these are important and there will be optimizations
> that convert these. For example:
>
> #define malloc (s) ({ \
>  static pool p; \
>  if (__builtin_constant_p (s) { \
>    alloc_from_pool(&p); \
>  else \
>    malloc (s); \
> })

Seems to be missing some bits.

> How will you find small constant allocations with this in place?

I won't. If your code is already optimized, the compiler has nothing left 
to do, that's fine. (not that I am convinced your optimization works that 
well)

>> I started on this because of an application that spends more than
>> half of its time in malloc+free and where (almost) no allocation is
>> larger than 100 bytes. Changing the code to not use malloc/free but
>> other allocation strategies is very complicated because it would
>> break abstraction layers. I used various workarounds that proved
>> rather effective, but I would have loved for that to be unnecessary.
>
> See my memory pool that uses custom free functionality where you need
> only change malloc, free is handled automaticaly.

Do you mean the incomplete macro above, or your STACK_ALLOC macro from the 
other post? (don't know how that one works either, "size" appears out of 
nowhere in STACK_FREE)

As I already said, I know how to write efficient code, but that's hard on 
the abstraction layers (before inlining, you have to go at least 20 layers 
up in the CFG to find a common ancestor for malloc and free), and I'd be 
happy if the compiler could help a bit in easy cases.

>>> And what is logic of limiting sizes?
>>
>> Main stack is rather small by default. And it is nice if other
>> variables on the stack are not spread over too many memory pages.
>
> No, question was what is logic that you will use to decide if allocation
> is stack or heap?
>
> Also if we take not spreading stack argument to logical conclusion a
> best course of action in adding a separate stack and placing arrays/alloca
> over 1024 bytes there.

Then you also need to handle deallocations in there instead of relying on 
the automatic ones in the main stack. But yes, that's possible.

> As this is hard in gcc then doing these without gcc should be easier.

Yeah, we shouldn't bother writing an optimizing compiler, what were we 
thinking ;-)

> In several cases (constant size) we do not need gcc involvement as we
> colud get information ourself.

In several cases, we don't need to bother manually optimizing our code, 
gcc could notice that things are simple and optimize them itself.

> Then there are parts where coordination is necessary, one is determining
> if stack allocation is possible. A posible way would be first turn a
> eligible malloc calls to
>
> malloc_stack(size, color)
>
> as hint to allocator. I added a color parameter to handle partial
> overlap, if you do a coloring with edge when allocations partialy
> overlap then you can assign to each color class a stack and proceed as
> normal.

That would be great, yes. I'll be looking forward to your patches.

(note that the limits of alias analysis mean that gcc often has no idea 
which free goes with which malloc).
Ondřej Bílka Nov. 12, 2013, 5:10 p.m. UTC | #10
On Tue, Nov 12, 2013 at 05:01:31PM +0100, Marc Glisse wrote:
> On Tue, 12 Nov 2013, Ondřej Bílka wrote:
> 
> >On Tue, Nov 12, 2013 at 01:41:24PM +0100, Marc Glisse wrote:
> >>On Tue, 12 Nov 2013, Ondřej Bílka wrote:
> >>
> >>>>I am trying to get something to actually work and be accepted in
> >>>>gcc. That may mean being conservative.
> >>>
> >>>That also may mean that you will cover only cases where it is not needed.
> >>>
> >>>A malloc will have a small per-thread cache for small requests that does
> >>>not need any locking. A performance difference will be quite small and
> >>>there may be a define which causes inlining constant size mallocs.
> >>>
> >>>Sizes from 256 bytes are interesting case.
> >>
> >>I have to disagree here. When the allocated size is large enough,
> >>the cost of malloc+free often becomes small compared to whatever
> >>work you are doing in that array. It is when the size is very small
> >>that speeding up malloc+free is essential. And you are
> >>underestimating the cost of those small allocations.
> >>
> >No, just aware that these are important and there will be optimizations
> >that convert these. For example:
> >
> >#define malloc (s) ({ \
> > static pool p; \
> > if (__builtin_constant_p (s) { \
> >   alloc_from_pool(&p); \
> > else \
> >   malloc (s); \
> >})
> 
> Seems to be missing some bits.
> 
A example, its purpose is to show a idea not to be complete.

> >How will you find small constant allocations with this in place?
> 
> I won't. If your code is already optimized, the compiler has nothing
> left to do, that's fine. (not that I am convinced your optimization
> works that well)
> 
What if it decreases running time of all constant allocations by 6%.
Converting to stack allocation would eliminate overhead but eliminated
sites contributed to 5% of runtime.

> >>I started on this because of an application that spends more than
> >>half of its time in malloc+free and where (almost) no allocation is
> >>larger than 100 bytes. Changing the code to not use malloc/free but
> >>other allocation strategies is very complicated because it would
> >>break abstraction layers. I used various workarounds that proved
> >>rather effective, but I would have loved for that to be unnecessary.
> >
> >See my memory pool that uses custom free functionality where you need
> >only change malloc, free is handled automaticaly.
> 
> Do you mean the incomplete macro above, or your STACK_ALLOC macro
> from the other post? (don't know how that one works either, "size"
> appears out of nowhere in STACK_FREE)
> 
Also a example where actual logic could be supplied later, should be 
__stack_new instead size.

I am not talking about stack conversion but about memory pool, a
proof-of-concept is here.
https://www.sourceware.org/ml/libc-alpha/2013-11/msg00258.html

> As I already said, I know how to write efficient code, but that's
> hard on the abstraction layers (before inlining, you have to go at
> least 20 layers up in the CFG to find a common ancestor for malloc
> and free), and I'd be happy if the compiler could help a bit in easy
> cases.
> 
This is more about using for allocation libraries that are flexible
enough.

> >Then there are parts where coordination is necessary, one is determining
> >if stack allocation is possible. A posible way would be first turn a
> >eligible malloc calls to
> >
> >malloc_stack(size, color)
> >
> >as hint to allocator. I added a color parameter to handle partial
> >overlap, if you do a coloring with edge when allocations partialy
> >overlap then you can assign to each color class a stack and proceed as
> >normal.
> 
> That would be great, yes. I'll be looking forward to your patches.
> 
> (note that the limits of alias analysis mean that gcc often has no
> idea which free goes with which malloc).
> 
Wait, you have a free with same SSA_NAME as malloc and alias analysis
cannot tell which malloc corespond to that free?


> -- 
> Marc Glisse
Marc Glisse Nov. 12, 2013, 5:36 p.m. UTC | #11
On Tue, 12 Nov 2013, Ondřej Bílka wrote:

>> Seems to be missing some bits.
>
> A example, its purpose is to show a idea not to be complete.

I agree, but when too many bits are missing or wrong I fail to get the 
idea :-(

>>> How will you find small constant allocations with this in place?
>>
>> I won't. If your code is already optimized, the compiler has nothing
>> left to do, that's fine. (not that I am convinced your optimization
>> works that well)
>>
> What if it decreases running time of all constant allocations by 6%.
> Converting to stack allocation would eliminate overhead but eliminated
> sites contributed to 5% of runtime.

Then you may keep using your code, the optimization will be useless on it 
but it won't break it.

>>>> I started on this because of an application that spends more than
>>>> half of its time in malloc+free and where (almost) no allocation is
>>>> larger than 100 bytes. Changing the code to not use malloc/free but
>>>> other allocation strategies is very complicated because it would
>>>> break abstraction layers. I used various workarounds that proved
>>>> rather effective, but I would have loved for that to be unnecessary.
>>>
>>> See my memory pool that uses custom free functionality where you need
>>> only change malloc, free is handled automaticaly.
>>
>> Do you mean the incomplete macro above, or your STACK_ALLOC macro
>> from the other post? (don't know how that one works either, "size"
>> appears out of nowhere in STACK_FREE)
>>
> Also a example where actual logic could be supplied later, should be
> __stack_new instead size.

Ok, then that relies on the user calling STACK_ALLOC and STACK_FREE in a 
suitably nested order. I clearly don't know enough from a C++ constructor 
to use it.

> This is more about using for allocation libraries that are flexible
> enough.

We have different targets. I am concerned with very small allocations. The 
overhead of 2 function calls is already noticable. I don't want to pay for 
flexibility.

I believe the 2 approaches are really complementary and shouldn't hinder 
each other.

>> (note that the limits of alias analysis mean that gcc often has no
>> idea which free goes with which malloc).
>
> Wait, you have a free with same SSA_NAME as malloc and alias analysis
> cannot tell which malloc corespond to that free?

Here it does, but I wanted to warn you that it isn't such a common case. 
In addition to this patch, I'll also need some alias analysis changes for 
the code I started with to be optimized.
Jeff Law Nov. 12, 2013, 6:10 p.m. UTC | #12
On 11/12/13 05:41, Marc Glisse wrote:
> On Tue, 12 Nov 2013, Ondřej Bílka wrote:
>
>>> I am trying to get something to actually work and be accepted in
>>> gcc. That may mean being conservative.
>>
>> That also may mean that you will cover only cases where it is not needed.
>>
>> A malloc will have a small per-thread cache for small requests that does
>> not need any locking. A performance difference will be quite small and
>> there may be a define which causes inlining constant size mallocs.
>>
>> Sizes from 256 bytes are interesting case.
>
> I have to disagree here. When the allocated size is large enough, the
> cost of malloc+free often becomes small compared to whatever work you
> are doing in that array. It is when the size is very small that speeding
> up malloc+free is essential. And you are underestimating the cost of
> those small allocations.
I have to agree with Marc here.  Start with something small and we can 
always look to extend it to handle more cases over time.

The whole point is to avoid going into the allocator which can be damn 
expensive.  Allocating on the stack is orders of magnitude cheaper.  If 
you can collapse down to a _DECL, then, well, that's the holy grail.

[soapbox]
And doing this kind of thing intelligently is something I strongly would 
encourage.  I can't express my disdain for folks directly using alloca. 
  Programmers have consistently shown they are unable to do so in a safe 
manner.  The amount of time I have personally had to dedicate to dealing 
with the fallout from such practices is absurd.

Banning alloca in favor of malloc is something every project should 
seriously consider.  Having the ability for the compiler to 
intelligently and safely take a malloc/free pair and turn that into 
alloca or DECL reference removes one more reason for programmers to 
directly use alloca.
[/soapbox]

Jeff
diff mbox

Patch

Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C
===================================================================
--- gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C	(revision 0)
+++ gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C	(working copy)
@@ -0,0 +1,24 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+struct A {
+  void*p;
+  A(void*q) : p(q) {}
+  ~A(){ __builtin_free(p); }
+};
+void f(void*)__attribute__((__leaf__));
+void h(void*)__attribute__((__leaf__,__nothrow__));
+void g(){
+  void*p=__builtin_malloc(12);
+  A a(p);
+  f(p);
+}
+
+void i(){
+  void*p=__builtin_malloc(12);
+  h(p);
+  __builtin_free(p);
+}
+
+/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-1.C
___________________________________________________________________
Added: svn:keywords
## -0,0 +1 ##
+Author Date Id Revision URL
\ No newline at end of property
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C
===================================================================
--- gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C	(revision 0)
+++ gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C	(working copy)
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+void f(void*)__attribute__((__leaf__));
+void g(){
+  void*p=__builtin_malloc(12);
+  f(p);
+  __builtin_free(p);
+}
+
+/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

Property changes on: gcc/testsuite/g++.dg/tree-ssa/heapstack-2.C
___________________________________________________________________
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Added: svn:keywords
## -0,0 +1 ##
+Author Date Id Revision URL
\ No newline at end of property
Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c	(working copy)
@@ -0,0 +1,22 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+void f(void*)__attribute__((__leaf__));
+void g(int m,int n){
+  int i;
+  void*p=__builtin_malloc(12);
+  switch(n){
+    case 1:
+      for (i=0; i<m; ++i)
+	f(p);
+      break;
+    case 2:
+      __builtin_free(p);
+      __builtin_exit(42);
+    default:;
+  }
+  __builtin_free(p);
+}
+
+/* { dg-final { scan-tree-dump-not "malloc" "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-1.c
___________________________________________________________________
Added: svn:keywords
## -0,0 +1 ##
+Author Date Id Revision URL
\ No newline at end of property
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c	(revision 0)
+++ gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c	(working copy)
@@ -0,0 +1,12 @@ 
+/* { dg-do compile } */
+/* { dg-options "-O2 -fdump-tree-optimized" } */
+
+void f(void*);
+void g(){
+  void*p=__builtin_malloc(12);
+  f(p);
+  __builtin_free(p);
+}
+
+/* { dg-final { scan-tree-dump-times "malloc" 1 "optimized" } } */
+/* { dg-final { cleanup-tree-dump "optimized" } } */

Property changes on: gcc/testsuite/gcc.dg/tree-ssa/heapstack-2.c
___________________________________________________________________
Added: svn:keywords
## -0,0 +1 ##
+Author Date Id Revision URL
\ No newline at end of property
Added: svn:eol-style
## -0,0 +1 ##
+native
\ No newline at end of property
Index: gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c
===================================================================
--- gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c	(revision 204620)
+++ gcc/testsuite/gcc.dg/tree-ssa/pr19831-3.c	(working copy)
@@ -1,31 +1,31 @@ 
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-optimized" } */
 
-void test2(void)
+void test2(int n)
 {
-  int *p = __builtin_malloc (sizeof (int) * 4);
+  int *p = __builtin_malloc (sizeof (int) * n);
   if (p == (void *)0)
     __builtin_abort ();
   __builtin_free (p);
 }
 
-void test5(int b)
+void test5(int n)
 {
-  int *p = __builtin_malloc (sizeof (int) * 4);
+  int *p = __builtin_malloc (sizeof (int) * n);
   if (p)
     __builtin_free (p);
 }
 
-void test6(void)
+void test6(int n)
 {
-  int *p = __builtin_malloc (sizeof (int) * 4);
+  int *p = __builtin_malloc (sizeof (int) * n);
   if (p == (void *)0)
     __builtin_abort ();
   if (p)
     __builtin_free (p);
 }
 
 /* We should be able to remove all malloc/free pairs with CDDCE.
    Assume p was non-NULL for test2.
    For test5, it doesn't matter if p is NULL or non-NULL.  */
 
Index: gcc/tree-ssa-forwprop.c
===================================================================
--- gcc/tree-ssa-forwprop.c	(revision 204620)
+++ gcc/tree-ssa-forwprop.c	(working copy)
@@ -33,20 +33,21 @@  along with GCC; see the file COPYING3.
 #include "tree-ssanames.h"
 #include "tree-dfa.h"
 #include "tree-pass.h"
 #include "langhooks.h"
 #include "flags.h"
 #include "expr.h"
 #include "cfgloop.h"
 #include "optabs.h"
 #include "tree-ssa-propagate.h"
 #include "tree-ssa-dom.h"
+#include "params.h"
 
 /* This pass propagates the RHS of assignment statements into use
    sites of the LHS of the assignment.  It's basically a specialized
    form of tree combination.   It is hoped all of this can disappear
    when we have a generalized tree combiner.
 
    One class of common cases we handle is forward propagating a single use
    variable into a COND_EXPR.
 
      bb0:
@@ -1478,20 +1479,113 @@  constant_pointer_difference (tree p1, tr
     }
 
   for (i = 0; i < cnt[0]; i++)
     for (j = 0; j < cnt[1]; j++)
       if (exps[0][i] == exps[1][j])
 	return size_binop (MINUS_EXPR, offs[0][i], offs[1][j]);
 
   return NULL_TREE;
 }
 
+/* We want to be sure that free (VAR) can't be called in another function, and
+   the easiest way to ensure that is to prove that it is called in this
+   function.  The hardest part is avoiding some call to a function that may not
+   return because of exit, an infinite loop, setjmp, etc, where it might call
+   free on VAR.  */
+static bool
+malloc_safe_on_stack (gimple stmt, vec<gimple, va_heap> *list_of_frees)
+{
+  tree var = gimple_call_lhs (stmt);
+  basic_block bb = gimple_bb (stmt);
+  stack_vec<basic_block, 20> bb_to_visit;
+  bitmap bb_visited = BITMAP_ALLOC (NULL);
+  bitmap_set_bit (bb_visited, bb->index);
+  gimple_stmt_iterator gsi = gsi_for_stmt (stmt);
+
+next_stmt:
+  gsi_next_nondebug (&gsi);
+
+handle_stmt:
+  if (gsi_end_p (gsi))
+    {
+      edge e;
+      edge_iterator ei;
+      FOR_EACH_EDGE (e, ei, bb->succs)
+	{
+	  bb_to_visit.safe_push (e->dest);
+	}
+      goto next_bb;
+    }
+  stmt = gsi_stmt (gsi);
+  if (stmt_can_throw_external (stmt))
+    /* We could give up only if VAR has escaped (same for return), but that
+       means there is a memory leak, so don't bother.  */
+    goto unsafe;
+
+  switch (gimple_code (stmt))
+  // TODO: GIMPLE_ASM, EH related gimples?
+    {
+	/* We leave the function without calling free.  */
+      case GIMPLE_RETURN:
+	goto unsafe;
+
+	/* Statements that are irrelevant.  */
+      case GIMPLE_ASSIGN:
+      case GIMPLE_LABEL:
+      case GIMPLE_NOP:
+	/* Last stmt of the bb, handled by looking at the outgoing edges.  */
+      case GIMPLE_GOTO:
+      case GIMPLE_COND:
+	// TODO: special case:  if (VAR == 0) ...
+      case GIMPLE_SWITCH:
+	goto next_stmt;
+
+      case GIMPLE_CALL:
+	{
+	  // TODO: __builtin_(abort|trap|exit|unreachable)
+	  // should be safe as well.
+	  if (gimple_call_builtin_p (stmt, BUILT_IN_FREE)
+	      && gimple_call_arg (stmt, 0) == var)
+	    {
+	      /* Nothing could throw an exception earlier in the block,
+		 so we don't need to check the EH edges.  */
+	      list_of_frees->safe_push (stmt);
+	      goto next_bb;
+	    }
+	  int flags = gimple_call_flags (stmt);
+	  // FIXME: leaf is actually not safe, we need some new ECF_* flags.
+	  if (flags & (ECF_CONST|ECF_PURE|ECF_LEAF|ECF_NOVOPS))
+	    goto next_stmt;
+	  goto unsafe;
+	}
+      default:
+	goto unsafe;
+    }
+
+next_bb:
+  if (bb_to_visit.is_empty ())
+    goto safe;
+  bb = bb_to_visit.last ();
+  bb_to_visit.pop ();
+  if (!bitmap_set_bit (bb_visited, bb->index))
+    goto next_bb;
+  gsi = gsi_start_bb (bb);
+  goto handle_stmt;
+
+unsafe:
+  BITMAP_FREE (bb_visited);
+  return false;
+safe:
+  BITMAP_FREE (bb_visited);
+  return true;
+}
+
 /* *GSI_P is a GIMPLE_CALL to a builtin function.
    Optimize
    memcpy (p, "abcd", 4);
    memset (p + 4, ' ', 3);
    into
    memcpy (p, "abcd   ", 7);
    call if the latter can be stored by pieces during expansion.  */
 
 static bool
 simplify_builtin_call (gimple_stmt_iterator *gsi_p, tree callee2)
@@ -1681,20 +1775,64 @@  simplify_builtin_call (gimple_stmt_itera
 	      gimple_call_set_arg (stmt2, 2,
 				   build_int_cst (TREE_TYPE (len2), src_len));
 	      unlink_stmt_vdef (stmt1);
 	      gsi_remove (&gsi, true);
 	      release_defs (stmt1);
 	      update_stmt (stmt2);
 	      return false;
 	    }
 	}
       break;
+
+    case BUILT_IN_FREE:
+      {
+	tree size;
+	tree ptr = gimple_call_arg (stmt2, 0);
+	if (TREE_CODE (ptr) != SSA_NAME)
+	  return false;
+	stmt1 = SSA_NAME_DEF_STMT (ptr);
+	/* TODO: handle calloc.  */
+	if (gimple_call_builtin_p (stmt1, BUILT_IN_MALLOC)
+	    && host_integerp ((size = gimple_call_arg (stmt1, 0)), 1)
+	    /* param_value(...) / 10? Some new parameter?  */
+	    && TREE_INT_CST_LOW (size)
+	       <= (unsigned HOST_WIDE_INT)PARAM_VALUE (PARAM_LARGE_STACK_FRAME))
+	  {
+	    gcc_checking_assert (gimple_call_lhs (stmt1) == ptr);
+	    stack_vec<gimple, 20> list_of_frees;
+	    if (!malloc_safe_on_stack (stmt1, &list_of_frees))
+	      return false;
+	    /* Declare array.  */
+	    tree elem_type = build_nonstandard_integer_type (BITS_PER_UNIT, 1);
+	    tree array_type = build_array_type_nelts (elem_type,
+						      TREE_INT_CST_LOW (size));
+	    tree var = create_tmp_var (array_type, NULL);
+	    tree p = fold_convert (TREE_TYPE (ptr), build_fold_addr_expr (var));
+	    /* Replace free with a clobber.  */
+	    int i;
+	    gimple free_stmt;
+	    FOR_EACH_VEC_ELT (list_of_frees, i, free_stmt)
+	      {
+		tree clobber = build_constructor (array_type, NULL);
+		TREE_THIS_VOLATILE (clobber) = 1;
+		gimple clobber_stmt = gimple_build_assign (var, clobber);
+		gimple_stmt_iterator gsi_free = gsi_for_stmt (free_stmt);
+		gsi_replace (&gsi_free, clobber_stmt, false);
+	      }
+	    /* Replace malloc with the address of the variable.  */
+	    gimple_stmt_iterator gsi1 = gsi_for_stmt (stmt1);
+	    update_call_from_tree (&gsi1, p);
+	    update_stmt (gsi_stmt (gsi1));
+	    return true;
+	  }
+
+      }
     default:
       break;
     }
   return false;
 }
 
 /* Checks if expression has type of one-bit precision, or is a known
    truth-valued expression.  */
 static bool
 truth_valued_ssa_name (tree name)