diff mbox

Implement -fstack-usage option

Message ID 201008051742.05446.ebotcazou@adacore.com
State New
Headers show

Commit Message

Eric Botcazou Aug. 5, 2010, 3:42 p.m. UTC
Hi,

this is again something that has been for a while in our tree.  The command 
line option -fstack-usage is added and makes the compiler generate a second 
output file (xxx.su) for each compilation unit, which contains one line per 
function emitted in the assembly file and describing the stack usage of the 
function.  The exact format is documented in the manual.  The primary goal is 
for the data to be conservatively correct, in particular to give an upper 
bound of the actual stack usage at run time.

Implementation-wise, there is nothing sophisticated, but various quirks of 
expand_call and allocate_dynamic_stack_space need to be properly handled.
It's worth noting that this is implemented only for 7 architectures (Alpha, 
x86/x86-64, IA-64, MIPS, PA, PowerPC and SPARC) but extending it to others 
should be a trivial task for the respective maintainers.

Tested on x86-64-suse-linux, OK for mainline?  If so, what form(s) of tests 
should be written?


2010-08-05  Eric Botcazou  <ebotcazou@adacore.com>

	Stack usage support
	* common.opt (-fstack-usage): New option.
	* doc/invoke.texi (Debugging options): Document it.
	* builtins.c (expand_builtin_apply): Pass TRUE as 4th argument to
	allocate_dynamic_stack_space.
	(expand_builtin_alloca): Add 4th bool parameter CANNOT_ACCUMULATE
	and propagate it to allocate_dynamic_stack_space.
	(expand_builtin) <BUILT_IN_ALLOCA>: Adjust for above change.
	* calls.c (initialize_argument_information): Pass TRUE as 4th
	argument to allocate_dynamic_stack_space.
	(expand_call): Set current_function_has_unbounded_dynamic_stack_size
	to 1 when pushing a variable-sized argument onto the stack.  Pass
	TRUE as 4th argument to allocate_dynamic_stack_space.
	Update current_function_pushed_stack_size.
	(emit_library_call_value_1): Likewise.
	* explow.c (allocate_dynamic_stack_space): Add 4th bool parameter
	CANNOT_ACCUMULATE.  If flag_stack_usage, look into the size and
	attempt to find an upper bound.  Remove redundant code for the
	SETJMP_VIA_SAVE_AREA case.
	* expr.h (allocate_dynamic_stack_space): Add 4th bool parameter.
	* function.h (struct stack_usage): New structure.
	(current_function_static_stack_size): New macro.
	(current_function_dynamic_stack_size): Likewise.
	(current_function_pushed_stack_size): Likewise.
	(current_function_dynamic_alloc_count): Likewise.
	(current_function_has_unbounded_dynamic_stack_size): Likewise.
	(current_function_allocates_dynamic_stack_space): Likewise.
	(struct function): Add new field 'su'.
	* function.c (instantiate_virtual_regs): If SETJMP_VIA_SAVE_AREA,
	add the value of the dynamic offset to the dynamic stack usage.
	(prepare_function_start): Allocate cfun->su if flag_stack_usage.
	(rest_of_handle_thread_prologue_and_epilogue): Call output_stack_usage.
	* gimplify.c (gimplify_decl_expr): Set ALLOCA_FOR_VAR_P on the call
	to BUILT_IN_ALLOCA for variable-sized objects.
	* output.h (output_stack_usage): Declare.
	* toplev.c (stack_usage_file): New file pointer.
	(output_stack_usage): New function.
	(open_auxiliary_file): Likewise.
	(lang_dependent_init): Open file if flag_stack_usage is set.
	(finalize): Close file if stack_usage_file is not null.
	* tree.h (ALLOCA_FOR_VAR_P): New macro.
	* config/alpha/alpha.c (compute_frame_size): New function.
	(alpha_expand_prologue): Use it.
	(alpha_start_function): Likewise.
	(alpha_expand_epilogue): Likewise.
	Set current_function_static_stack_size.
	* config/i386/i386.c (ix86_expand_prologue): Likewise.
	* config/ia64/ia64.c (ia64_expand_prologue): Likewise.
	* config/mips/mips.c (mips_expand_prologue): Likewise.
	* config/pa/pa.c (hppa_expand_prologue): Likewise.
	* config/rs6000/rs6000.c (rs6000_emit_prologue): Likewise.
	* config/sparc/sparc.c (sparc_expand_prologue): Likewise.

Comments

Richard Henderson Aug. 5, 2010, 7:12 p.m. UTC | #1
On 08/05/2010 08:42 AM, Eric Botcazou wrote:
> Hi,
> 
> this is again something that has been for a while in our tree.  The command 
> line option -fstack-usage is added and makes the compiler generate a second 
> output file (xxx.su) for each compilation unit, which contains one line per 
> function emitted in the assembly file and describing the stack usage of the 
> function.  The exact format is documented in the manual.  The primary goal is 
> for the data to be conservatively correct, in particular to give an upper 
> bound of the actual stack usage at run time.
> 
> Implementation-wise, there is nothing sophisticated, but various quirks of 
> expand_call and allocate_dynamic_stack_space need to be properly handled.
> It's worth noting that this is implemented only for 7 architectures (Alpha, 
> x86/x86-64, IA-64, MIPS, PA, PowerPC and SPARC) but extending it to others 
> should be a trivial task for the respective maintainers.

Most of this looks fairly sane.

I do wonder if stack re-alignment should fall under the heading of
dynamic_stack_size though.  I.e.

> Index: config/i386/i386.c
> ===================================================================
> --- config/i386/i386.c	(revision 162897)
> +++ config/i386/i386.c	(working copy)
> @@ -9588,6 +9588,27 @@ ix86_expand_prologue (void)
>      }
>    allocate = frame.stack_pointer_offset - m->fs.sp_offset;
>  
> +  if (flag_stack_usage)
> +    {
> +      /* We start to count from ARG_POINTER.  */
> +      HOST_WIDE_INT stack_size = frame.stack_pointer_offset;
> +
> +      /* If it was realigned, take into account the fake frame.  */
> +      if (stack_realign_drap)
> +	{
> +	  if (ix86_static_chain_on_stack)
> +	    stack_size += UNITS_PER_WORD;
> +
> +	  if (!call_used_regs[REGNO (crtl->drap_reg)])
> +	    stack_size += UNITS_PER_WORD;
> +
> +	  /* This takes into account the new return address slot.  */
> +	  stack_size += crtl->stack_alignment_needed / BITS_PER_UNIT;

This last line ought to add to the dynamic stack size instead.
(And you're over-estimating by 1 minimal-stack-alignment-unit.  ;-)


r~
diff mbox

Patch

Index: doc/invoke.texi
===================================================================
--- doc/invoke.texi	(revision 162897)
+++ doc/invoke.texi	(working copy)
@@ -313,7 +313,7 @@  Objective-C and Objective-C++ Dialects}.
 -fmem-report -fpre-ipa-mem-report -fpost-ipa-mem-report -fprofile-arcs @gol
 -frandom-seed=@var{string} -fsched-verbose=@var{n} @gol
 -fsel-sched-verbose -fsel-sched-dump-cfg -fsel-sched-pipelining-verbose @gol
--ftest-coverage  -ftime-report -fvar-tracking @gol
+-fstack-usage  -ftest-coverage  -ftime-report -fvar-tracking @gol
 -fvar-tracking-assignments  -fvar-tracking-assignments-toggle @gol
 -g  -g@var{level}  -gtoggle  -gcoff  -gdwarf-@var{version} @gol
 -ggdb  -gstabs  -gstabs+  -gstrict-dwarf  -gno-strict-dwarf @gol
@@ -4839,6 +4839,39 @@  allocation when it finishes.
 Makes the compiler print some statistics about permanent memory
 allocation before or after interprocedural optimization.
 
+@item -fstack-usage
+@opindex fstack-usage
+Makes the compiler output stack usage information for the program, on a
+per-function basis.  The filename for the dump is made by appending
+@file{.su} to the AUXNAME.  AUXNAME is generated from the name of
+the output file, if explicitly specified and it is not an executable,
+otherwise it is the basename of the source file.  An entry is made up
+of three fields:
+
+@itemize
+@item
+The name of the function.
+@item
+A number of bytes.
+@item
+One or more qualifiers: @code{static}, @code{dynamic}, @code{bounded}.
+@end itemize
+
+The qualifier @code{static} means that the function manipulates the stack
+statically: a fixed number of bytes are allocated for the frame on function
+entry and released on function exit; no stack adjustments are otherwise made
+in the function.  The second field is this fixed number of bytes.
+
+The qualifier @code{dynamic} means that the function manipulates the stack
+dynamically: in addition to the static allocation described above, stack
+adjustments are made in the body of the function, for example to push/pop
+arguments around function calls.  If the qualifier @code{bounded} is also
+present, the amount of these adjustments is bounded at compile-time and
+the second field is an upper bound of the total amount of stack used by
+the function.  If it is not present, the amount of these adjustments is
+not bounded at compile-time and the second field only represents the
+bounded part.
+
 @item -fprofile-arcs
 @opindex fprofile-arcs
 Add code so that program flow @dfn{arcs} are instrumented.  During
Index: tree.h
===================================================================
--- tree.h	(revision 162897)
+++ tree.h	(working copy)
@@ -511,7 +511,8 @@  struct GTY(()) tree_common {
            BLOCK
            all decls
 
-       CALL_FROM_THUNK_P in
+       CALL_FROM_THUNK_P and
+       ALLOCA_FOR_VAR_P in
            CALL_EXPR
 
    side_effects_flag:
@@ -1318,6 +1319,10 @@  extern void omp_clause_range_check_faile
    thunked-to function.  */
 #define CALL_FROM_THUNK_P(NODE) (CALL_EXPR_CHECK (NODE)->base.protected_flag)
 
+/* In a CALL_EXPR, if the function being called is BUILT_IN_ALLOCA, means that
+   it has been built for the declaration of a variable-sized object.  */
+#define ALLOCA_FOR_VAR_P(NODE) (CALL_EXPR_CHECK (NODE)->base.protected_flag)
+
 /* In a type, nonzero means that all objects of the type are guaranteed by the
    language or front-end to be properly aligned, so we can indicate that a MEM
    of this type is aligned at least to the alignment of the type, even if it
Index: builtins.c
===================================================================
--- builtins.c	(revision 162897)
+++ builtins.c	(working copy)
@@ -132,7 +132,7 @@  static rtx expand_builtin_memset (tree,
 static rtx expand_builtin_memset_args (tree, tree, tree, rtx, enum machine_mode, tree);
 static rtx expand_builtin_bzero (tree);
 static rtx expand_builtin_strlen (tree, rtx, enum machine_mode);
-static rtx expand_builtin_alloca (tree, rtx);
+static rtx expand_builtin_alloca (tree, rtx, bool);
 static rtx expand_builtin_unop (enum machine_mode, tree, rtx, rtx, optab);
 static rtx expand_builtin_frame_address (tree, tree);
 static tree stabilize_va_list_loc (location_t, tree, int);
@@ -1520,8 +1520,10 @@  expand_builtin_apply (rtx function, rtx
     emit_stack_save (SAVE_BLOCK, &old_stack_level, NULL_RTX);
 
   /* Allocate a block of memory onto the stack and copy the memory
-     arguments to the outgoing arguments address.  */
-  allocate_dynamic_stack_space (argsize, 0, BITS_PER_UNIT);
+     arguments to the outgoing arguments address.  We can pass TRUE
+     as the 4th argument because we just saved the stack pointer
+     and will restore it right after the call.  */
+  allocate_dynamic_stack_space (argsize, 0, BITS_PER_UNIT, TRUE);
 
   /* Set DRAP flag to true, even though allocate_dynamic_stack_space
      may have already set current_function_calls_alloca to true.
@@ -4881,12 +4883,13 @@  expand_builtin_frame_address (tree fndec
     }
 }
 
-/* Expand EXP, a call to the alloca builtin.  Return NULL_RTX if
-   we failed and the caller should emit a normal call, otherwise try to get
-   the result in TARGET, if convenient.  */
+/* Expand EXP, a call to the alloca builtin.  Return NULL_RTX if we
+   failed and the caller should emit a normal call, otherwise try to
+   get the result in TARGET, if convenient.  CANNOT_ACCUMULATE is the
+   same as for allocate_dynamic_stack_space. */
 
 static rtx
-expand_builtin_alloca (tree exp, rtx target)
+expand_builtin_alloca (tree exp, rtx target, bool cannot_accumulate)
 {
   rtx op0;
   rtx result;
@@ -4902,7 +4905,8 @@  expand_builtin_alloca (tree exp, rtx tar
   op0 = expand_normal (CALL_EXPR_ARG (exp, 0));
 
   /* Allocate the desired space.  */
-  result = allocate_dynamic_stack_space (op0, target, BITS_PER_UNIT);
+  result = allocate_dynamic_stack_space (op0, target, BITS_PER_UNIT,
+					 cannot_accumulate);
   result = convert_memory_address (ptr_mode, result);
 
   return result;
@@ -5937,7 +5941,9 @@  expand_builtin (tree exp, rtx target, rt
 	return XEXP (DECL_RTL (DECL_RESULT (current_function_decl)), 0);
 
     case BUILT_IN_ALLOCA:
-      target = expand_builtin_alloca (exp, target);
+      /* If the allocation stems from the declaration of a variable-sized
+	 object, it cannot accumulate.  */
+      target = expand_builtin_alloca (exp, target, ALLOCA_FOR_VAR_P (exp));
       if (target)
 	return target;
       break;
Index: toplev.c
===================================================================
--- toplev.c	(revision 162897)
+++ toplev.c	(working copy)
@@ -350,6 +350,7 @@  static const param_info lang_independent
 
 FILE *asm_out_file;
 FILE *aux_info_file;
+FILE *stack_usage_file = NULL;
 FILE *dump_file = NULL;
 const char *dump_file_name;
 
@@ -1558,6 +1559,88 @@  alloc_for_identifier_to_locale (size_t l
   return ggc_alloc_atomic (len);
 }
 
+/* Output stack usage information.  */
+void
+output_stack_usage (void)
+{
+  static bool warning_issued = false;
+  enum stack_usage_kind_type { STATIC = 0, DYNAMIC, DYNAMIC_BOUNDED };
+  const char *stack_usage_kind_str[] = {
+    "static",
+    "dynamic",
+    "dynamic,bounded"
+  };
+  HOST_WIDE_INT stack_usage = current_function_static_stack_size;
+  enum stack_usage_kind_type stack_usage_kind;
+  expanded_location loc;
+  const char *raw_id, *id;
+
+  if (stack_usage < 0)
+    {
+      if (!warning_issued)
+	{
+	  warning (0, "-fstack-usage not supported for this target");
+	  warning_issued = true;
+	}
+      return;
+    }
+
+  stack_usage_kind = STATIC;
+
+  /* Add the maximum amount of space pushed onto the stack.  */
+  if (current_function_pushed_stack_size > 0)
+    {
+      stack_usage += current_function_pushed_stack_size;
+      stack_usage_kind = DYNAMIC_BOUNDED;
+    }
+
+  /* Now on to the tricky part: dynamic stack allocation.  */
+  if (current_function_allocates_dynamic_stack_space)
+    {
+      if (current_function_has_unbounded_dynamic_stack_size)
+	stack_usage_kind = DYNAMIC;
+      else
+	stack_usage_kind = DYNAMIC_BOUNDED;
+
+      /* Add the size even in the unbounded case, this can't hurt.  */
+      stack_usage += current_function_dynamic_stack_size;
+    }
+
+  loc = expand_location (DECL_SOURCE_LOCATION (current_function_decl));
+
+  /* Strip the scope prefix if any.  */
+  raw_id = lang_hooks.decl_printable_name (current_function_decl, 2);
+  id = strrchr (raw_id, '.');
+  if (id)
+    id++;
+  else
+    id = raw_id;
+
+  fprintf (stack_usage_file,
+	   "%s:%d:%d:%s\t"HOST_WIDE_INT_PRINT_DEC"\t%s\n",
+	   basename (loc.file),
+	   loc.line,
+	   loc.column,
+	   id,
+	   stack_usage,
+	   stack_usage_kind_str[stack_usage_kind]);
+}
+
+/* Open an auxiliary output file.  */
+static FILE *
+open_auxiliary_file (const char *ext)
+{
+  char *filename;
+  FILE *file;
+
+  filename = concat (aux_base_name, ".", ext, NULL);
+  file = fopen (filename, "w");
+  if (!file)
+    fatal_error ("can't open %s for writing: %m", filename);
+  free (filename);
+  return file;
+}
+
 /* Initialization of the front end environment, before command line
    options are parsed.  Signal handlers, internationalization etc.
    ARGV0 is main's argv[0].  */
@@ -2173,6 +2256,10 @@  lang_dependent_init (const char *name)
 
   init_asm_output (name);
 
+  /* If stack usage information is desired, open the output file.  */
+  if (flag_stack_usage)
+    stack_usage_file = open_auxiliary_file ("su");
+
   /* This creates various _DECL nodes, so needs to be called after the
      front end is initialized.  */
   init_eh ();
@@ -2254,6 +2341,9 @@  finalize (void)
 	unlink_if_ordinary (asm_file_name);
     }
 
+  if (stack_usage_file)
+    fclose (stack_usage_file);
+
   statistics_fini ();
   finish_optimization_passes ();
 
Index: expr.h
===================================================================
--- expr.h	(revision 162897)
+++ expr.h	(working copy)
@@ -641,9 +641,8 @@  extern void emit_stack_restore (enum sav
 /* Invoke emit_stack_save for the nonlocal_goto_save_area.  */
 extern void update_nonlocal_goto_save_area (void);
 
-/* Allocate some space on the stack dynamically and return its address.  An rtx
-   says how many bytes.  */
-extern rtx allocate_dynamic_stack_space (rtx, rtx, int);
+/* Allocate some space on the stack dynamically and return its address.  */
+extern rtx allocate_dynamic_stack_space (rtx, rtx, int, bool);
 
 /* Emit one stack probe at ADDRESS, an address within the stack.  */
 extern void emit_stack_probe (rtx);
Index: function.c
===================================================================
--- function.c	(revision 162897)
+++ function.c	(working copy)
@@ -1899,6 +1899,18 @@  instantiate_virtual_regs (void)
   /* Indicate that, from now on, assign_stack_local should use
      frame_pointer_rtx.  */
   virtuals_instantiated = 1;
+
+  /* See allocate_dynamic_stack_space for the rationale.  */
+#ifdef SETJMP_VIA_SAVE_AREA
+  if (flag_stack_usage && cfun->calls_setjmp)
+    {
+      int align = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+      dynamic_offset = (dynamic_offset + align - 1) / align * align;
+      current_function_dynamic_stack_size
+	+= current_function_dynamic_alloc_count * dynamic_offset;
+    }
+#endif
+
   return 0;
 }
 
@@ -4334,6 +4346,12 @@  prepare_function_start (void)
   init_expr ();
   default_rtl_profile ();
 
+  if (flag_stack_usage)
+    {
+      cfun->su = ggc_alloc_cleared_stack_usage ();
+      cfun->su->static_stack_size = -1;
+    }
+
   cse_not_expected = ! optimize;
 
   /* Caller save not needed yet.  */
@@ -5722,12 +5740,17 @@  rest_of_handle_thread_prologue_and_epilo
 {
   if (optimize)
     cleanup_cfg (CLEANUP_EXPENSIVE);
+
   /* On some machines, the prologue and epilogue code, or parts thereof,
      can be represented as RTL.  Doing so lets us schedule insns between
      it and the rest of the code and also allows delayed branch
      scheduling to operate in the epilogue.  */
-
   thread_prologue_and_epilogue_insns ();
+
+  /* The stack usage info is finalized during prologue expansion.  */
+  if (flag_stack_usage)
+    output_stack_usage ();
+
   return 0;
 }
 
Index: function.h
===================================================================
--- function.h	(revision 162897)
+++ function.h	(working copy)
@@ -468,6 +468,37 @@  extern GTY(()) struct rtl_data x_rtl;
    want to do differently.  */
 #define crtl (&x_rtl)
 
+struct GTY(()) stack_usage
+{
+  /* # of bytes of static stack space allocated by the function.  */
+  HOST_WIDE_INT static_stack_size;
+
+  /* # of bytes of dynamic stack space allocated by the function.  This is
+     meaningful only if has_unbounded_dynamic_stack_size is zero.  */
+  HOST_WIDE_INT dynamic_stack_size;
+
+  /* # of bytes of space pushed onto the stack after the prologue.  If
+     !ACCUMULATE_OUTGOING_ARGS, it contains the outgoing arguments.  */
+  int pushed_stack_size;
+
+  /* # of dynamic allocations in the function.  */
+  unsigned int dynamic_alloc_count : 31;
+
+  /* Nonzero if the amount of stack space allocated dynamically cannot
+     be bounded at compile-time.  */
+  unsigned int has_unbounded_dynamic_stack_size : 1;
+};
+
+#define current_function_static_stack_size (cfun->su->static_stack_size)
+#define current_function_dynamic_stack_size (cfun->su->dynamic_stack_size)
+#define current_function_pushed_stack_size (cfun->su->pushed_stack_size)
+#define current_function_dynamic_alloc_count (cfun->su->dynamic_alloc_count)
+#define current_function_has_unbounded_dynamic_stack_size \
+  (cfun->su->has_unbounded_dynamic_stack_size)
+#define current_function_allocates_dynamic_stack_space    \
+  (current_function_dynamic_stack_size != 0               \
+   || current_function_has_unbounded_dynamic_stack_size)
+
 /* This structure can save all the important global and static variables
    describing the status of the current function.  */
 
@@ -486,6 +517,9 @@  struct GTY(()) function {
   /* The loops in this function.  */
   struct loops *x_current_loops;
 
+  /* The stack usage of this function.  */
+  struct stack_usage *su;
+
   /* Value histograms attached to particular statements.  */
   htab_t GTY((skip)) value_histograms;
 
Index: gimplify.c
===================================================================
--- gimplify.c	(revision 162897)
+++ gimplify.c	(working copy)
@@ -1329,6 +1329,8 @@  gimplify_vla_decl (tree decl, gimple_seq
 
   t = built_in_decls[BUILT_IN_ALLOCA];
   t = build_call_expr (t, 1, DECL_SIZE_UNIT (decl));
+  /* The call has been built for a variable-sized object.  */
+  ALLOCA_FOR_VAR_P (t) = 1;
   t = fold_convert (ptr_type, t);
   t = build2 (MODIFY_EXPR, TREE_TYPE (addr), addr, t);
 
Index: calls.c
===================================================================
--- calls.c	(revision 162897)
+++ calls.c	(working copy)
@@ -1091,9 +1091,13 @@  initialize_argument_information (int num
 		      pending_stack_adjust = 0;
 		    }
 
+		  /* We can pass TRUE as the 4th argument because we just
+		     saved the stack pointer and will restore it right after
+		     the call.  */
 		  copy = gen_rtx_MEM (BLKmode,
 				      allocate_dynamic_stack_space
-				      (size_rtx, NULL_RTX, TYPE_ALIGN (type)));
+				      (size_rtx, NULL_RTX,
+				       TYPE_ALIGN (type), TRUE));
 		  set_mem_attributes (copy, type, 1);
 		}
 	      else
@@ -2488,6 +2492,8 @@  expand_call (tree exp, rtx target, int i
 	      stack_arg_under_construction = 0;
 	    }
 	  argblock = push_block (ARGS_SIZE_RTX (adjusted_args_size), 0, 0);
+	  if (flag_stack_usage)
+	    current_function_has_unbounded_dynamic_stack_size = 1;
 	}
       else
 	{
@@ -2649,8 +2655,11 @@  expand_call (tree exp, rtx target, int i
 		  stack_usage_map = stack_usage_map_buf;
 		  highest_outgoing_arg_in_use = 0;
 		}
+	      /* We can pass TRUE as the 4th argument because we just
+		 saved the stack pointer and will restore it right after
+		 the call.  */
 	      allocate_dynamic_stack_space (push_size, NULL_RTX,
-					    BITS_PER_UNIT);
+					    BITS_PER_UNIT, TRUE);
 	    }
 
 	  /* If argument evaluation might modify the stack pointer,
@@ -2690,6 +2699,19 @@  expand_call (tree exp, rtx target, int i
 	 be deferred during the evaluation of the arguments.  */
       NO_DEFER_POP;
 
+      if (flag_stack_usage
+	  && !ACCUMULATE_OUTGOING_ARGS && pass && adjusted_args_size.var == 0)
+	{
+	  int pushed = adjusted_args_size.constant + pending_stack_adjust;
+
+	  /* Record the maximum pushed stack space size.  We need to
+	     delay it this far to take into account the optimization
+	     done by combine_pending_stack_adjustment_and_call.  */
+
+	  if (pushed > current_function_pushed_stack_size)
+	    current_function_pushed_stack_size = pushed;
+	}
+
       funexp = rtx_for_function_call (fndecl, addr);
 
       /* Figure out the register where the value, if any, will come back.  */
@@ -3547,6 +3569,13 @@  emit_library_call_value_1 (int retval, r
   if (args_size.constant > crtl->outgoing_args_size)
     crtl->outgoing_args_size = args_size.constant;
 
+  if (flag_stack_usage && !ACCUMULATE_OUTGOING_ARGS)
+    {
+      int pushed = args_size.constant + pending_stack_adjust;
+      if (pushed > current_function_pushed_stack_size)
+	current_function_pushed_stack_size = pushed;
+    }
+
   if (ACCUMULATE_OUTGOING_ARGS)
     {
       /* Since the stack pointer will never be pushed, it is possible for
Index: explow.c
===================================================================
--- explow.c	(revision 162897)
+++ explow.c	(working copy)
@@ -1110,11 +1110,23 @@  update_nonlocal_goto_save_area (void)
    SIZE is an rtx representing the size of the area.
    TARGET is a place in which the address can be placed.
 
-   KNOWN_ALIGN is the alignment (in bits) that we know SIZE has.  */
+   KNOWN_ALIGN is the alignment (in bits) that we know SIZE has.
+
+   If CANNOT_ACCUMULATE is set to TRUE, the caller guarantees that the
+   stack space allocated by the generated code cannot be added with itself
+   in the course of the execution of the function.  It is always safe to
+   pass FALSE here and the following criterion is sufficient in order to
+   pass TRUE: every path in the CFG that starts at the allocation point and
+   loops to it executes the associated deallocation code (that always exists
+   if the function does not use the depressed stack pointer mechanism).  */
 
 rtx
-allocate_dynamic_stack_space (rtx size, rtx target, int known_align)
+allocate_dynamic_stack_space (rtx size, rtx target, int known_align,
+			      bool cannot_accumulate)
 {
+  HOST_WIDE_INT stack_usage_size = -1;
+  bool known_align_valid = true;
+
   /* If we're asking for zero bytes, it doesn't matter what we point
      to since we can't dereference it.  But return a reasonable
      address anyway.  */
@@ -1124,6 +1136,37 @@  allocate_dynamic_stack_space (rtx size,
   /* Otherwise, show we're calling alloca or equivalent.  */
   cfun->calls_alloca = 1;
 
+  /* If stack usage info is requested, look into the size we are passed.
+     We need to do so this early to avoid the obfuscation that may be
+     introduced later by the various alignment operations.  */
+  if (flag_stack_usage)
+    {
+      if (GET_CODE (size) == CONST_INT)
+	stack_usage_size = INTVAL (size);
+      else if (GET_CODE (size) == REG)
+        {
+	  /* Look into the last emitted insn and see if we can deduce
+	     something for the register.  */
+	  rtx insn, set, note;
+	  insn = get_last_insn ();
+	  if ((set = single_set (insn)) && rtx_equal_p (SET_DEST (set), size))
+	    {
+	      if (GET_CODE (SET_SRC (set)) == CONST_INT)
+		stack_usage_size = INTVAL (SET_SRC (set));
+	      else if ((note = find_reg_equal_equiv_note (insn))
+		       && GET_CODE (XEXP (note, 0)) == CONST_INT)
+		stack_usage_size = INTVAL (XEXP (note, 0));
+	    }
+	}
+
+      /* If the size is not constant, we can't say anything.  */
+      if (stack_usage_size == -1)
+	{
+	  current_function_has_unbounded_dynamic_stack_size = 1;
+	  stack_usage_size = 0;
+	}
+    }
+
   /* Ensure the size is in the proper mode.  */
   if (GET_MODE (size) != VOIDmode && GET_MODE (size) != Pmode)
     size = convert_to_mode (Pmode, size, 1);
@@ -1153,10 +1196,17 @@  allocate_dynamic_stack_space (rtx size,
 #endif
 
   if (MUST_ALIGN)
-    size
-      = force_operand (plus_constant (size,
-				      BIGGEST_ALIGNMENT / BITS_PER_UNIT - 1),
-		       NULL_RTX);
+    {
+      size
+        = force_operand (plus_constant (size,
+					BIGGEST_ALIGNMENT / BITS_PER_UNIT - 1),
+			 NULL_RTX);
+
+      if (flag_stack_usage)
+	stack_usage_size += BIGGEST_ALIGNMENT / BITS_PER_UNIT - 1;
+
+      known_align_valid = false;
+    }
 
 #ifdef SETJMP_VIA_SAVE_AREA
   /* If setjmp restores regs from a save area in the stack frame,
@@ -1170,32 +1220,7 @@  allocate_dynamic_stack_space (rtx size,
      would use reg notes to store the "optimized" size and fix things
      up later.  These days we know this information before we ever
      start building RTL so the reg notes are unnecessary.  */
-  if (!cfun->calls_setjmp)
-    {
-      int align = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
-
-      /* ??? Code below assumes that the save area needs maximal
-	 alignment.  This constraint may be too strong.  */
-      gcc_assert (PREFERRED_STACK_BOUNDARY == BIGGEST_ALIGNMENT);
-
-      if (CONST_INT_P (size))
-	{
-	  HOST_WIDE_INT new_size = INTVAL (size) / align * align;
-
-	  if (INTVAL (size) != new_size)
-	    size = GEN_INT (new_size);
-	}
-      else
-	{
-	  /* Since we know overflow is not possible, we avoid using
-	     CEIL_DIV_EXPR and use TRUNC_DIV_EXPR instead.  */
-	  size = expand_divmod (0, TRUNC_DIV_EXPR, Pmode, size,
-				GEN_INT (align), NULL_RTX, 1);
-	  size = expand_mult (Pmode, size,
-			      GEN_INT (align), NULL_RTX, 1);
-	}
-    }
-  else
+  if (cfun->calls_setjmp)
     {
       rtx dynamic_offset
 	= expand_binop (Pmode, sub_optab, virtual_stack_dynamic_rtx,
@@ -1203,6 +1228,14 @@  allocate_dynamic_stack_space (rtx size,
 
       size = expand_binop (Pmode, add_optab, size, dynamic_offset,
 			   NULL_RTX, 1, OPTAB_LIB_WIDEN);
+
+      /* The above dynamic offset cannot be computed statically at this
+	 point, but it will be possible to do so after RTL expansion is
+	 done.  Record how many times we will need to add it.  */
+      if (flag_stack_usage)
+	current_function_dynamic_alloc_count++;
+
+      known_align_valid = false;
     }
 #endif /* SETJMP_VIA_SAVE_AREA */
 
@@ -1219,13 +1252,28 @@  allocate_dynamic_stack_space (rtx size,
      insns.  Since this is an extremely rare event, we have no reliable
      way of knowing which systems have this problem.  So we avoid even
      momentarily mis-aligning the stack.  */
+  if (!known_align_valid || known_align % PREFERRED_STACK_BOUNDARY != 0)
+    {
+      size = round_push (size);
 
-  /* If we added a variable amount to SIZE,
-     we can no longer assume it is aligned.  */
-#if !defined (SETJMP_VIA_SAVE_AREA)
-  if (MUST_ALIGN || known_align % PREFERRED_STACK_BOUNDARY != 0)
-#endif
-    size = round_push (size);
+      if (flag_stack_usage)
+	{
+	  int align = PREFERRED_STACK_BOUNDARY / BITS_PER_UNIT;
+	  stack_usage_size = (stack_usage_size + align - 1) / align * align;
+	}
+    }
+
+  /* The size is supposed to be fully adjusted at this point so record it
+     if stack usage info is requested.  */
+  if (flag_stack_usage)
+    {
+      current_function_dynamic_stack_size += stack_usage_size;
+
+      /* ??? This is gross but the only safe stance in the absence
+	 of stack usage oriented flow analysis.  */
+      if (!cannot_accumulate)
+	current_function_has_unbounded_dynamic_stack_size = 1;
+    }
 
   do_pending_stack_adjust ();
 
Index: common.opt
===================================================================
--- common.opt	(revision 162897)
+++ common.opt	(working copy)
@@ -1222,6 +1222,10 @@  fstack-protector-all
 Common Report RejectNegative Var(flag_stack_protect, 2) VarExists
 Use a stack protection method for every function
 
+fstack-usage
+Common RejectNegative Var(flag_stack_usage)
+Output stack usage information on a per-function basis
+
 fstrength-reduce
 Common
 Does nothing.  Preserved for backward compatibility.
Index: output.h
===================================================================
--- output.h	(revision 162897)
+++ output.h	(working copy)
@@ -639,6 +639,9 @@  extern int maybe_assemble_visibility (tr
 
 extern int default_address_cost (rtx, bool);
 
+/* Output stack usage information.  */
+extern void output_stack_usage (void);
+
 /* dbxout helper functions */
 #if defined DBX_DEBUGGING_INFO || defined XCOFF_DEBUGGING_INFO
 
Index: config/alpha/alpha.c
===================================================================
--- config/alpha/alpha.c	(revision 162897)
+++ config/alpha/alpha.c	(working copy)
@@ -7763,6 +7763,30 @@  emit_frame_store (unsigned int regno, rt
   emit_frame_store_1 (reg, base_reg, frame_bias, base_ofs, reg);
 }
 
+/* Compute the frame size.  SIZE is the size of the "naked" frame
+   and SA_SIZE is the size of the register save area.  */
+
+static HOST_WIDE_INT
+compute_frame_size (HOST_WIDE_INT size, HOST_WIDE_INT sa_size)
+{
+  if (TARGET_ABI_OPEN_VMS)
+    return ALPHA_ROUND (sa_size 
+			+ (alpha_procedure_type == PT_STACK ? 8 : 0)
+			+ size
+			+ crtl->args.pretend_args_size);
+  else if (TARGET_ABI_UNICOSMK)
+    /* We have to allocate space for the DSIB if we generate a frame.  */
+    return ALPHA_ROUND (sa_size
+			+ (alpha_procedure_type == PT_STACK ? 48 : 0))
+	   + ALPHA_ROUND (size
+			  + crtl->outgoing_args_size);
+  else
+    return ALPHA_ROUND (crtl->outgoing_args_size)
+	   + sa_size
+	   + ALPHA_ROUND (size
+			  + crtl->args.pretend_args_size);
+}
+
 /* Write function prologue.  */
 
 /* On vms we have two kinds of functions:
@@ -7796,24 +7820,10 @@  alpha_expand_prologue (void)
   int i;
 
   sa_size = alpha_sa_size ();
+  frame_size = compute_frame_size (get_frame_size (), sa_size);
 
-  frame_size = get_frame_size ();
-  if (TARGET_ABI_OPEN_VMS)
-    frame_size = ALPHA_ROUND (sa_size
-			      + (alpha_procedure_type == PT_STACK ? 8 : 0)
-			      + frame_size
-			      + crtl->args.pretend_args_size);
-  else if (TARGET_ABI_UNICOSMK)
-    /* We have to allocate space for the DSIB if we generate a frame.  */
-    frame_size = ALPHA_ROUND (sa_size
-			      + (alpha_procedure_type == PT_STACK ? 48 : 0))
-		 + ALPHA_ROUND (frame_size
-				+ crtl->outgoing_args_size);
-  else
-    frame_size = (ALPHA_ROUND (crtl->outgoing_args_size)
-		  + sa_size
-		  + ALPHA_ROUND (frame_size
-				 + crtl->args.pretend_args_size));
+  if (flag_stack_usage)
+    current_function_static_stack_size = frame_size;
 
   if (TARGET_ABI_OPEN_VMS)
     reg_offset = 8 + 8 * cfun->machine->uses_condition_handler;
@@ -8135,23 +8145,7 @@  alpha_start_function (FILE *file, const
 
   alpha_fnname = fnname;
   sa_size = alpha_sa_size ();
-
-  frame_size = get_frame_size ();
-  if (TARGET_ABI_OPEN_VMS)
-    frame_size = ALPHA_ROUND (sa_size
-			      + (alpha_procedure_type == PT_STACK ? 8 : 0)
-			      + frame_size
-			      + crtl->args.pretend_args_size);
-  else if (TARGET_ABI_UNICOSMK)
-    frame_size = ALPHA_ROUND (sa_size
-			      + (alpha_procedure_type == PT_STACK ? 48 : 0))
-		 + ALPHA_ROUND (frame_size
-			      + crtl->outgoing_args_size);
-  else
-    frame_size = (ALPHA_ROUND (crtl->outgoing_args_size)
-		  + sa_size
-		  + ALPHA_ROUND (frame_size
-				 + crtl->args.pretend_args_size));
+  frame_size = compute_frame_size (get_frame_size (), sa_size);
 
   if (TARGET_ABI_OPEN_VMS)
     reg_offset = 8 + 8 * cfun->machine->uses_condition_handler;
@@ -8353,23 +8347,7 @@  alpha_expand_epilogue (void)
   int i;
 
   sa_size = alpha_sa_size ();
-
-  frame_size = get_frame_size ();
-  if (TARGET_ABI_OPEN_VMS)
-    frame_size = ALPHA_ROUND (sa_size
-			      + (alpha_procedure_type == PT_STACK ? 8 : 0)
-			      + frame_size
-			      + crtl->args.pretend_args_size);
-  else if (TARGET_ABI_UNICOSMK)
-    frame_size = ALPHA_ROUND (sa_size
-			      + (alpha_procedure_type == PT_STACK ? 48 : 0))
-		 + ALPHA_ROUND (frame_size
-			      + crtl->outgoing_args_size);
-  else
-    frame_size = (ALPHA_ROUND (crtl->outgoing_args_size)
-		  + sa_size
-		  + ALPHA_ROUND (frame_size
-				 + crtl->args.pretend_args_size));
+  frame_size = compute_frame_size (get_frame_size (), sa_size);
 
   if (TARGET_ABI_OPEN_VMS)
     {
Index: config/sparc/sparc.c
===================================================================
--- config/sparc/sparc.c	(revision 162897)
+++ config/sparc/sparc.c	(working copy)
@@ -4402,6 +4402,9 @@  sparc_expand_prologue (void)
   /* Advertise that the data calculated just above are now valid.  */
   sparc_prologue_data_valid_p = true;
 
+  if (flag_stack_usage)
+    current_function_static_stack_size = actual_fsize;
+
   if (flag_stack_check == STATIC_BUILTIN_STACK_CHECK && actual_fsize)
     sparc_emit_probe_stack_range (STACK_CHECK_PROTECT, actual_fsize);
 
Index: config/i386/i386.c
===================================================================
--- config/i386/i386.c	(revision 162897)
+++ config/i386/i386.c	(working copy)
@@ -9588,6 +9588,27 @@  ix86_expand_prologue (void)
     }
   allocate = frame.stack_pointer_offset - m->fs.sp_offset;
 
+  if (flag_stack_usage)
+    {
+      /* We start to count from ARG_POINTER.  */
+      HOST_WIDE_INT stack_size = frame.stack_pointer_offset;
+
+      /* If it was realigned, take into account the fake frame.  */
+      if (stack_realign_drap)
+	{
+	  if (ix86_static_chain_on_stack)
+	    stack_size += UNITS_PER_WORD;
+
+	  if (!call_used_regs[REGNO (crtl->drap_reg)])
+	    stack_size += UNITS_PER_WORD;
+
+	  /* This takes into account the new return address slot.  */
+	  stack_size += crtl->stack_alignment_needed / BITS_PER_UNIT;
+	}
+
+      current_function_static_stack_size = stack_size;
+    }
+
   /* The stack has already been decremented by the instruction calling us
      so we need to probe unconditionally to preserve the protection area.  */
   if (flag_stack_check == STATIC_BUILTIN_STACK_CHECK)
Index: config/ia64/ia64.c
===================================================================
--- config/ia64/ia64.c	(revision 162897)
+++ config/ia64/ia64.c	(working copy)
@@ -3097,6 +3097,9 @@  ia64_expand_prologue (void)
   ia64_compute_frame_size (get_frame_size ());
   last_scratch_gr_reg = 15;
 
+  if (flag_stack_usage)
+    current_function_static_stack_size = current_frame_info.total_size;
+
   if (dump_file) 
     {
       fprintf (dump_file, "ia64 frame related registers "
Index: config/rs6000/rs6000.c
===================================================================
--- config/rs6000/rs6000.c	(revision 162897)
+++ config/rs6000/rs6000.c	(working copy)
@@ -19403,6 +19403,9 @@  rs6000_emit_prologue (void)
 			      && call_used_regs[STATIC_CHAIN_REGNUM]);
   HOST_WIDE_INT sp_offset = 0;
 
+  if (flag_stack_usage)
+    current_function_static_stack_size = info->total_size;
+
   if (TARGET_FIX_AND_CONTINUE)
     {
       /* gdb on darwin arranges to forward a function from the old
Index: config/pa/pa.c
===================================================================
--- config/pa/pa.c	(revision 162897)
+++ config/pa/pa.c	(working copy)
@@ -3688,6 +3688,8 @@  hppa_expand_prologue (void)
     local_fsize += STARTING_FRAME_OFFSET;
 
   actual_fsize = compute_frame_size (size, &save_fregs);
+  if (flag_stack_usage)
+    current_function_static_stack_size = actual_fsize;
 
   /* Compute a few things we will use often.  */
   tmpreg = gen_rtx_REG (word_mode, 1);
Index: config/mips/mips.c
===================================================================
--- config/mips/mips.c	(revision 162897)
+++ config/mips/mips.c	(working copy)
@@ -10061,6 +10061,9 @@  mips_expand_prologue (void)
   frame = &cfun->machine->frame;
   size = frame->total_size;
 
+  if (flag_stack_usage)
+    current_function_static_stack_size = size;
+
   /* Save the registers.  Allocate up to MIPS_MAX_FIRST_STACK_STEP
      bytes beforehand; this is enough to cover the register save area
      without going out of range.  */