diff mbox

split FRAME variables back into pieces

Message ID 8076419.zVPaO6T1UT@polaris
State New
Headers show

Commit Message

Eric Botcazou Sept. 19, 2012, 10:58 a.m. UTC
Hi,

this transformation has been in our tree for a couple of years and was 
originally developed for SPARKSkein (http://www.skein-hash.info/node/48), 
which is the implementation in SPARK (subset of Ada) of the Skein algorithm.

When nested functions access local variables of their parent, the compiler 
creates a special FRAME local variable in the parent, which represents the 
non-local frame, and puts into it all the variables accessed non-locally.

If these nested functions are later inlined into their parent, these FRAME 
variables generally remain unmodified and this has various drawbacks:
 1) the frame of the parent is unnecessarily large,
 2) scalarization of aggregates put into the FRAME variables is hindered,
 3) debug info for scalars put into the FRAME variables is poor since VTA only 
works on GIMPLE registers.

The attached patch makes it so that the compiler splits FRAME variables back 
into pieces when all the nested functions have been inlined.  The natural 
place to implement the transformation would probably be the SRA pass, but this 
would require a special path to work around all the heuristics and the pass is 
already complicated enough (sorry Martin ;-)  The transformation is therefore 
implemented as a sub-pass of execute_update_addresses_taken for technical 
reasons exposed in the patch.

Tested on x86-64/Linux, OK for the mainline?


2012-09-19  Eric Botcazou  <ebotcazou@adacore.com>

	* tree.h (DECL_NONLOCAL_FRAME): New macro.
	* gimple.c (gimple_ior_addresses_taken_1): Handle non-local frame
	structures specially.
	* tree-nested.c (get_frame_type): Set DECL_NONLOCAL_FRAME.
	* tree-ssa.c (lookup_decl_for_field): New static function.
	(split_nonlocal_frames_op): Likewise.
	(execute_update_addresses_taken): Break up non-local frame structures
	into variables when possible.
	* tree-streamer-in.c (unpack_ts_decl_common_value_fields): Stream in
	DECL_NONLOCAL_FRAME flag.
	* tree-streamer-out.c (pack_ts_decl_common_value_fields): Stream out
	DECL_NONLOCAL_FRAME flag.


2012-09-19  Eric Botcazou  <ebotcazou@adacore.com>

	* gcc.dg/nested-func-9.c: New test.

Comments

Richard Biener Sept. 19, 2012, 11:36 a.m. UTC | #1
On Wed, Sep 19, 2012 at 12:58 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
> Hi,
>
> this transformation has been in our tree for a couple of years and was
> originally developed for SPARKSkein (http://www.skein-hash.info/node/48),
> which is the implementation in SPARK (subset of Ada) of the Skein algorithm.
>
> When nested functions access local variables of their parent, the compiler
> creates a special FRAME local variable in the parent, which represents the
> non-local frame, and puts into it all the variables accessed non-locally.
>
> If these nested functions are later inlined into their parent, these FRAME
> variables generally remain unmodified and this has various drawbacks:
>  1) the frame of the parent is unnecessarily large,
>  2) scalarization of aggregates put into the FRAME variables is hindered,
>  3) debug info for scalars put into the FRAME variables is poor since VTA only
> works on GIMPLE registers.
>
> The attached patch makes it so that the compiler splits FRAME variables back
> into pieces when all the nested functions have been inlined.  The natural
> place to implement the transformation would probably be the SRA pass, but this
> would require a special path to work around all the heuristics and the pass is
> already complicated enough (sorry Martin ;-)  The transformation is therefore
> implemented as a sub-pass of execute_update_addresses_taken for technical
> reasons exposed in the patch.
>
> Tested on x86-64/Linux, OK for the mainline?

I really don't like this to be done outside of SRA (and it is written in a
non-MEM_REF way).  For the testcase in question we scalarize back
'i' in SRA (other scalars are optimized away already, but as SRA runs
before DSE it still gets to see stores to FRAME.i).  Now I wonder
why we generate reasonable debug info even without inlining,
thus there has to be a association to the original decls with
the frame FIELD_DECLs.  That is, lookup_decl_for_field should not be necessary
and what we use for debug info generation should be used by SRA
to assign a name to scalarized fields.

That alone would not solve your issue because of the 'arr' field in
the structure which cannot be scalarized (moved to a stand-alone
decl) by SRA.  That's one missed feature of SRA though, and generally
useful.

So no, I don't think this patch is the right approach.

Thanks,
Richard.



>
> 2012-09-19  Eric Botcazou  <ebotcazou@adacore.com>
>
>         * tree.h (DECL_NONLOCAL_FRAME): New macro.
>         * gimple.c (gimple_ior_addresses_taken_1): Handle non-local frame
>         structures specially.
>         * tree-nested.c (get_frame_type): Set DECL_NONLOCAL_FRAME.
>         * tree-ssa.c (lookup_decl_for_field): New static function.
>         (split_nonlocal_frames_op): Likewise.
>         (execute_update_addresses_taken): Break up non-local frame structures
>         into variables when possible.
>         * tree-streamer-in.c (unpack_ts_decl_common_value_fields): Stream in
>         DECL_NONLOCAL_FRAME flag.
>         * tree-streamer-out.c (pack_ts_decl_common_value_fields): Stream out
>         DECL_NONLOCAL_FRAME flag.
>
>
> 2012-09-19  Eric Botcazou  <ebotcazou@adacore.com>
>
>         * gcc.dg/nested-func-9.c: New test.
>
>
> --
> Eric Botcazou
Jakub Jelinek Sept. 19, 2012, 11:50 a.m. UTC | #2
On Wed, Sep 19, 2012 at 01:36:50PM +0200, Richard Guenther wrote:
> On Wed, Sep 19, 2012 at 12:58 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
> I really don't like this to be done outside of SRA (and it is written in a
> non-MEM_REF way).  For the testcase in question we scalarize back
> 'i' in SRA (other scalars are optimized away already, but as SRA runs
> before DSE it still gets to see stores to FRAME.i).  Now I wonder
> why we generate reasonable debug info even without inlining,
> thus there has to be a association to the original decls with
> the frame FIELD_DECLs.  That is, lookup_decl_for_field should not be necessary
> and what we use for debug info generation should be used by SRA
> to assign a name to scalarized fields.

For debug_info, the nested function has VAR_DECLs with DECL_VALUE_EXPR as
FIELD_DECLs of the chain.

> That alone would not solve your issue because of the 'arr' field in
> the structure which cannot be scalarized (moved to a stand-alone
> decl) by SRA.  That's one missed feature of SRA though, and generally
> useful.

I agree that SRA is the right approach for this, perhaps with the
DECL_NONLOCAL_FRAME bit used by SRA to forcefully scalarize it into
individual pieces (that bit should basically tell the scalarizer that
valid code can't use pointer arithmetics to go from one outermost field to
another outermost field, i.e. those can be safely split appart even if the
whole thing is address taken).

	Jakub
Eric Botcazou Sept. 20, 2012, 10:56 a.m. UTC | #3
> I really don't like this to be done outside of SRA (and it is written in a
> non-MEM_REF way).

Could you elaborate on the latter point?  If it can be improved, even in its 
current form...

> For the testcase in question we scalarize back
> 'i' in SRA (other scalars are optimized away already, but as SRA runs
> before DSE it still gets to see stores to FRAME.i).  Now I wonder
> why we generate reasonable debug info even without inlining,
> thus there has to be a association to the original decls with
> the frame FIELD_DECLs.  That is, lookup_decl_for_field should not be
> necessary and what we use for debug info generation should be used by SRA
> to assign a name to scalarized fields.

The testcase is a toy example of course.

> That alone would not solve your issue because of the 'arr' field in
> the structure which cannot be scalarized (moved to a stand-alone
> decl) by SRA.  That's one missed feature of SRA though, and generally
> useful.

The improved scalarization of aggregates is the main point and what yielded 
the performance boost for SPARKSkein.

> So no, I don't think this patch is the right approach.

OK, but I came to the opposite conclusion when I first tried to do it in SRA 
and I don't think I will change my mind in the near future.  Never mind then.
Richard Biener Sept. 20, 2012, 1:24 p.m. UTC | #4
On Thu, Sep 20, 2012 at 12:56 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> I really don't like this to be done outside of SRA (and it is written in a
>> non-MEM_REF way).
>
> Could you elaborate on the latter point?  If it can be improved, even in its
> current form...

You rely on being able to see all FRAME accesses as component refs,
thus nothing transforms them into just MEM[&FRAME, offset].  That's of
course something that can be easily "broken" by means of doing
some pointer arithmetic like (untested, but you get the idea)

foo()
{
  int c[32];
  int j;
  bar()
  {
    int *p = &c[4];
    p = p + 1;
    j = *p;
  }
  c[4] = 0;
  bar();
  return j;
}

this should get you j = MEM[<CHAIN>, 4]; in bar and thus a missing
component-ref when inlining.

I dont' think it's easily possible to recover from this in your scheme,
but it would be straight-forward for SRA (you basically look for the
base variable FRAME and special-case that completely for
replacement construction, also constraining accesses).

>> For the testcase in question we scalarize back
>> 'i' in SRA (other scalars are optimized away already, but as SRA runs
>> before DSE it still gets to see stores to FRAME.i).  Now I wonder
>> why we generate reasonable debug info even without inlining,
>> thus there has to be a association to the original decls with
>> the frame FIELD_DECLs.  That is, lookup_decl_for_field should not be
>> necessary and what we use for debug info generation should be used by SRA
>> to assign a name to scalarized fields.
>
> The testcase is a toy example of course.

Yes, I realize that.

>> That alone would not solve your issue because of the 'arr' field in
>> the structure which cannot be scalarized (moved to a stand-alone
>> decl) by SRA.  That's one missed feature of SRA though, and generally
>> useful.
>
> The improved scalarization of aggregates is the main point and what yielded
> the performance boost for SPARKSkein.
>
>> So no, I don't think this patch is the right approach.
>
> OK, but I came to the opposite conclusion when I first tried to do it in SRA
> and I don't think I will change my mind in the near future.  Never mind then.

Marking the FRAME VAR_DECL looks useful, maybe you can split that out
of your patch?

As of doing it in SRA what I'd do there is special-case FRAME for both
candidate consideration (so you get around the addressable issue)
and replacement generation.

Maybe you can open an enhancement bugreport for this and link
your patch / testcase to it?

Thanks,
Richard.

> --
> Eric Botcazou
Eric Botcazou Sept. 21, 2012, 10:48 a.m. UTC | #5
> You rely on being able to see all FRAME accesses as component refs,
> thus nothing transforms them into just MEM[&FRAME, offset].  That's of
> course something that can be easily "broken" by means of doing
> some pointer arithmetic like (untested, but you get the idea)
> 
> foo()
> {
>   int c[32];
>   int j;
>   bar()
>   {
>     int *p = &c[4];
>     p = p + 1;
>     j = *p;
>   }
>   c[4] = 0;
>   bar();
>   return j;
> }
> 
> this should get you j = MEM[<CHAIN>, 4]; in bar and thus a missing
> component-ref when inlining.

The patch compiles hundreds of thousands of lines of Ada everyday at AdaCore, 
how could such a blatant hole have survived that?

> I dont' think it's easily possible to recover from this in your scheme,
> but it would be straight-forward for SRA (you basically look for the
> base variable FRAME and special-case that completely for
> replacement construction, also constraining accesses).

Well, it's implemented in the 30-line block of code under the comment:

+  /* Deal with remaining MEM_REFs, i.e. those for which the field reference
+     has been replaced with the offset.  */

> Marking the FRAME VAR_DECL looks useful, maybe you can split that out
> of your patch?

Sure.

> As of doing it in SRA what I'd do there is special-case FRAME for both
> candidate consideration (so you get around the addressable issue)
> and replacement generation.

OK, but you need to be able to split the FRAME structure without necessarily 
splitting its aggregate fields.  Is that (easily) doable with current SRA?

> Maybe you can open an enhancement bugreport for this and link
> your patch / testcase to it?

Will do.
Richard Biener Sept. 21, 2012, 11:32 a.m. UTC | #6
On Fri, Sep 21, 2012 at 12:48 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> You rely on being able to see all FRAME accesses as component refs,
>> thus nothing transforms them into just MEM[&FRAME, offset].  That's of
>> course something that can be easily "broken" by means of doing
>> some pointer arithmetic like (untested, but you get the idea)
>>
>> foo()
>> {
>>   int c[32];
>>   int j;
>>   bar()
>>   {
>>     int *p = &c[4];
>>     p = p + 1;
>>     j = *p;
>>   }
>>   c[4] = 0;
>>   bar();
>>   return j;
>> }
>>
>> this should get you j = MEM[<CHAIN>, 4]; in bar and thus a missing
>> component-ref when inlining.
>
> The patch compiles hundreds of thousands of lines of Ada everyday at AdaCore,
> how could such a blatant hole have survived that?
>
>> I dont' think it's easily possible to recover from this in your scheme,
>> but it would be straight-forward for SRA (you basically look for the
>> base variable FRAME and special-case that completely for
>> replacement construction, also constraining accesses).
>
> Well, it's implemented in the 30-line block of code under the comment:
>
> +  /* Deal with remaining MEM_REFs, i.e. those for which the field reference
> +     has been replaced with the offset.  */
>
>> Marking the FRAME VAR_DECL looks useful, maybe you can split that out
>> of your patch?
>
> Sure.
>
>> As of doing it in SRA what I'd do there is special-case FRAME for both
>> candidate consideration (so you get around the addressable issue)
>> and replacement generation.
>
> OK, but you need to be able to split the FRAME structure without necessarily
> splitting its aggregate fields.  Is that (easily) doable with current SRA?

In theory yes.  Though I'd start with trying to improve general SRA here
(because I think it doesn't do that yet), like for

struct { int i; struct { int j0; int j1; .... int j128; } s; } x;

while it can scalarize x.i it won't pull out x.s (similar for arrays).  Usually
addressability prevents that from being a valid transform, but excessive
size of sub-structures can prevent SRA from working on them as well.

The replacement machinery at least should handle it fine (not the
replacement object geneation - that only generates SSA names currently).

Likewise SRA can be extended to handle

struct { int i; int a[32]; } s;

with variable index accesses to a by that means (the accesses should
already be properly constrained - what is missing is a pass over all
structure fields matching them with accesses to see if aggregate
replacements are possible).

Yes, most of the SRA heuristic games make it complicated and ugly,
especially as it is isn't clearly separate analysis / decision / transform
phases.  TLC welcome ;)

>> Maybe you can open an enhancement bugreport for this and link
>> your patch / testcase to it?
>
> Will do.

Thanks,
Richard.

> --
> Eric Botcazou
Martin Jambor Sept. 21, 2012, 12:08 p.m. UTC | #7
Hi,

On Fri, Sep 21, 2012 at 12:48:16PM +0200, Eric Botcazou wrote:
>

...

> > As of doing it in SRA what I'd do there is special-case FRAME for both
> > candidate consideration (so you get around the addressable issue)
> > and replacement generation.
> 
> OK, but you need to be able to split the FRAME structure without necessarily 
> splitting its aggregate fields.  Is that (easily) doable with current SRA?
> 

Well, the current code assumes that the replacements are gimple
register types and, perhaps more importantly, it assumes there are no
to-be-replaced pieces within to-be-replaced pieces.  If we were to put
a structure outside of a frame structure and scalarize some field
within it at the same time... a lot of places would probably need to
be slightly re-thought.  OTOH, I assume the frame structure is never
being assigned to or read as a whole so that would simplify a lot of
things.

> > Maybe you can open an enhancement bugreport for this and link
> > your patch / testcase to it?
> 
> Will do.

Please CC me in the bug.  I probably won't be able to try anything
myself for a few weeks but I'm interested in helping.

Thanks,

Martin
Martin Jambor Sept. 21, 2012, 12:27 p.m. UTC | #8
On Fri, Sep 21, 2012 at 01:32:25PM +0200, Richard Guenther wrote:

...

> Yes, most of the SRA heuristic games make it complicated and ugly,
> especially as it is isn't clearly separate analysis / decision / transform
> phases.  TLC welcome ;)
> 

While I agree that the heuristics of SRA is surprisingly complex and a
bit ugly (but that's mostly because people complain when it does too
little or too much), the pass it is clearly divided into analysis,
decision, transform phases.  Specifically, unless a variable is
discarded because of things like volatile accesses or complex
overlapping accesses (usually through unions), all SRA decisions about
it are made in analyze_access_subtree.  All code executed before that
just gathers data for that function (yeah, including a gazillion
flags), every code executed afterwards does transformation as it was
decided and specifically cannot change ay decisiond (even when it
bumps into unions, placement new tricks etc., that's the main reason
why the transformation can be quite ugly too).

What I'd try is to create a replacement for (nearly) all roots of
access trees for a single variable (unless of course there is only one
corresponding to the whole variable) - i.e. those that
analyze_access_trees loops over.  A lot of code in the transformation
phase will still need changes but the heuristics should not actually
be a problem.

Martin
Eric Botcazou Sept. 22, 2012, 8:20 p.m. UTC | #9
> Well, the current code assumes that the replacements are gimple
> register types and, perhaps more importantly, it assumes there are no
> to-be-replaced pieces within to-be-replaced pieces.  If we were to put
> a structure outside of a frame structure and scalarize some field
> within it at the same time... a lot of places would probably need to
> be slightly re-thought.

Yes, IIRC I came to the conclusion that the splitting of the FRAME variables 
and the splitting (or not) of its aggregate fields couldn't be (easily) done 
in one pass with SRA.  After that, it's easy to understand why I wrote the 
special sub-pass instead of patching SRA.

> OTOH, I assume the frame structure is never being assigned to or read as a
> whole so that would simplify a lot of things.

Yes, that's one of the special properties of the FRAME variables, in addition 
to the addressability thing.

> Please CC me in the bug.  I probably won't be able to try anything
> myself for a few weeks but I'm interested in helping.

Sure, but, as I said, I don't plan to work on it myself either.  The current 
patch is small, localized, straightforward to understand and we haven't had a 
single bug reported for it since its inception 2 years ago, so...
diff mbox

Patch

Index: tree.h
===================================================================
--- tree.h	(revision 191365)
+++ tree.h	(working copy)
@@ -712,6 +712,9 @@  struct GTY(()) tree_base {
 
        SSA_NAME_IS_DEFAULT_DEF in
            SSA_NAME
+
+       DECL_NONLOCAL_FRAME in
+	   VAR_DECL
 */
 
 struct GTY(()) tree_typed {
@@ -3268,9 +3271,14 @@  extern void decl_fini_priority_insert (t
    libraries.  */
 #define MAX_RESERVED_INIT_PRIORITY 100
 
+/* In a VAR_DECL, nonzero if this is a global variable for VOPs.  */
 #define VAR_DECL_IS_VIRTUAL_OPERAND(NODE) \
   (VAR_DECL_CHECK (NODE)->base.u.bits.saturating_flag)
 
+/* In a VAR_DECL, nonzero if this is a non-local frame structure.  */
+#define DECL_NONLOCAL_FRAME(NODE)  \
+  (VAR_DECL_CHECK (NODE)->base.default_def_flag)
+
 struct GTY(()) tree_var_decl {
   struct tree_decl_with_vis common;
 };
Index: tree-streamer-out.c
===================================================================
--- tree-streamer-out.c	(revision 191365)
+++ tree-streamer-out.c	(working copy)
@@ -181,6 +181,9 @@  pack_ts_decl_common_value_fields (struct
       bp_pack_value (bp, expr->decl_common.off_align, 8);
     }
 
+  if (TREE_CODE (expr) == VAR_DECL)
+    bp_pack_value (bp, DECL_NONLOCAL_FRAME (expr), 1);
+
   if (TREE_CODE (expr) == RESULT_DECL
       || TREE_CODE (expr) == PARM_DECL
       || TREE_CODE (expr) == VAR_DECL)
Index: tree-nested.c
===================================================================
--- tree-nested.c	(revision 191365)
+++ tree-nested.c	(working copy)
@@ -235,6 +235,7 @@  get_frame_type (struct nesting_info *inf
 
       info->frame_type = type;
       info->frame_decl = create_tmp_var_for (info, type, "FRAME");
+      DECL_NONLOCAL_FRAME (info->frame_decl) = 1;
 
       /* ??? Always make it addressable for now, since it is meant to
 	 be pointed to by the static chain pointer.  This pessimizes
Index: tree-ssa.c
===================================================================
--- tree-ssa.c	(revision 191365)
+++ tree-ssa.c	(working copy)
@@ -1930,6 +1930,152 @@  maybe_optimize_var (tree var, bitmap add
     }
 }
 
+
+struct walk_info_data
+{
+  /* Map of fields in non-local frame structures to variables.  */
+  struct pointer_map_t *field_map;
+
+  /* Bitmap of variables whose address is taken.  */
+  bitmap addresses_taken;
+
+  /* Bitmap of variables to be renamed.  */
+  bitmap suitable_for_renaming;
+};
+
+/* Given FIELD, a field in a non-local frame structure, find or create a
+   variable in the current function and register it with MAP.  This is
+   the reverse function of tree-nested.c:lookup_field_for_decl.  */
+
+static tree
+lookup_decl_for_field (tree field, struct pointer_map_t *map,
+		       bitmap suitable_for_renaming)
+{
+  void **slot = pointer_map_insert (map, field);
+  if (!*slot)
+    {
+      tree decl =  build_decl (DECL_SOURCE_LOCATION (field), VAR_DECL,
+			       DECL_NAME (field), TREE_TYPE (field));
+      /* Some targets limit alignment of fields to a lower boundary than
+         that of variables unless it was overridden by attribute aligned.  */
+      if (DECL_USER_ALIGN (field))
+	{
+	  DECL_ALIGN (decl) = DECL_ALIGN (field);
+	  DECL_USER_ALIGN (decl) = 1;
+	}
+      else
+	DECL_ALIGN (decl) = TYPE_ALIGN (TREE_TYPE (field));
+      TREE_ADDRESSABLE (decl) = !DECL_NONADDRESSABLE_P (field);
+      TREE_THIS_VOLATILE (decl) = TREE_THIS_VOLATILE (field);
+      TREE_USED (decl) = 1;
+      gimple_add_tmp_var (decl);
+      if (!TREE_ADDRESSABLE (decl) && is_gimple_reg (decl))
+	bitmap_set_bit (suitable_for_renaming, DECL_UID (decl));
+      *slot = decl;
+    }
+
+  return (tree) *slot;
+}
+
+/* Callback for walk_gimple_op.  Replace component and memory references to
+   non-local frame structures pointed to by TP with individual variables.  */
+
+static tree
+split_nonlocal_frames_op (tree *tp, int *walk_subtrees, void *info)
+{
+  struct walk_info_data *data
+    = (struct walk_info_data *) ((struct walk_stmt_info *)info)->info;
+  bool address_taken;
+
+  if (TREE_CODE (*tp) == ADDR_EXPR)
+    {
+      tp = &TREE_OPERAND (*tp, 0);
+      address_taken = true;
+    }
+  else if (TREE_CODE (*tp) == TARGET_MEM_REF
+	   && TREE_CODE (TMR_BASE (*tp)) == ADDR_EXPR)
+    {
+      tp = &TREE_OPERAND (TMR_BASE (*tp), 0);
+      address_taken = true;
+    }
+  else if (TREE_CODE (*tp) == OBJ_TYPE_REF
+	   && TREE_CODE (OBJ_TYPE_REF_OBJECT (*tp)) == ADDR_EXPR)
+    {
+      tp = &TREE_OPERAND (OBJ_TYPE_REF_OBJECT (*tp), 0);
+      address_taken = true;
+    }
+  else
+    address_taken = false;
+
+  for (; handled_component_p (*tp); tp = &TREE_OPERAND (*tp, 0))
+    if (TREE_CODE (*tp) == COMPONENT_REF)
+      {
+	tree base = TREE_OPERAND (*tp, 0);
+
+	/* Deal with transparent MEM_REFs around the base.  */
+	if (TREE_CODE (base) == MEM_REF
+	    && TREE_CODE (TREE_OPERAND (base, 0)) == ADDR_EXPR
+	    && DECL_CONTEXT (TREE_OPERAND (*tp, 1))
+	       == TYPE_MAIN_VARIANT
+		  (TREE_TYPE (TREE_OPERAND (TREE_OPERAND (base, 0), 0))))
+	  {
+	    gcc_assert (integer_zerop (TREE_OPERAND (base, 1)));
+	    base = TREE_OPERAND (TREE_OPERAND (base, 0), 0);
+	  }
+
+	if (TREE_CODE (base) == VAR_DECL
+	    && DECL_NONLOCAL_FRAME (base)
+	    && !bitmap_bit_p (data->addresses_taken, DECL_UID (base)))
+	  {
+	    tree decl
+	      = lookup_decl_for_field (TREE_OPERAND (*tp, 1), data->field_map,
+				       data->suitable_for_renaming);
+	    if (address_taken)
+	      bitmap_set_bit (data->addresses_taken, DECL_UID (decl));
+	    *tp = decl;
+	    break;
+	  }
+      }
+
+  /* Deal with remaining MEM_REFs, i.e. those for which the field reference
+     has been replaced with the offset.  */
+  if (TREE_CODE (*tp) == MEM_REF
+      && TREE_CODE (TREE_OPERAND (*tp, 0)) == ADDR_EXPR)
+    {
+      tree base = TREE_OPERAND (TREE_OPERAND (*tp, 0), 0);
+
+      if (TREE_CODE (base) == VAR_DECL
+	  && DECL_NONLOCAL_FRAME (base)
+	  && !bitmap_bit_p (data->addresses_taken, DECL_UID (base)))
+	{
+	  tree frame_type = TREE_TYPE (base);
+	  tree offset = TREE_OPERAND (*tp, 1);
+	  tree field, next, decl, addr;
+
+	  for (field = TYPE_FIELDS (frame_type), next = DECL_CHAIN (field);
+	       next;
+	       field = next, next = DECL_CHAIN (next))
+	    if (tree_int_cst_lt (offset, byte_position (next)))
+	      break;
+
+	  decl = lookup_decl_for_field (field, data->field_map,
+					data->suitable_for_renaming);
+	  if (address_taken)
+	    bitmap_set_bit (data->addresses_taken, DECL_UID (decl));
+
+	  addr = build_fold_addr_expr (decl);
+	  if (TREE_TYPE (offset) == TREE_TYPE (TREE_OPERAND (*tp, 0)))
+	    offset = fold_convert (TREE_TYPE (addr), offset);
+	  *tp = fold_build2 (MEM_REF, TREE_TYPE (*tp), addr,
+			     int_const_binop (MINUS_EXPR, offset,
+					      byte_position (field)));
+	}
+    }
+
+  *walk_subtrees = 0;
+  return NULL_TREE;
+}
+
 /* Compute TREE_ADDRESSABLE and DECL_GIMPLE_REG_P for local variables.  */
 
 void
@@ -1940,6 +2086,7 @@  execute_update_addresses_taken (void)
   bitmap addresses_taken = BITMAP_ALLOC (NULL);
   bitmap not_reg_needs = BITMAP_ALLOC (NULL);
   bitmap suitable_for_renaming = BITMAP_ALLOC (NULL);
+  bool split_nonlocal_frames = false;
   tree var;
   unsigned i;
 
@@ -2034,6 +2181,38 @@  execute_update_addresses_taken (void)
 	}
     }
 
+  /* If non-local frame structures don't have their address taken as a whole,
+     they can be broken up back into variables since they are never accessed
+     directly as a whole.  */
+  FOR_EACH_VEC_ELT (tree, cfun->local_decls, i, var)
+    if (TREE_CODE (var) == VAR_DECL
+	&& DECL_NONLOCAL_FRAME (var)
+	&& !bitmap_bit_p (addresses_taken, DECL_UID (var)))
+      {
+	split_nonlocal_frames = true;
+	break;
+      }
+
+  /* Break up non-local frame structures when possible.  This will in turn
+     expose more scalarization opportunities for subsequent SRA passes.  */
+  if (split_nonlocal_frames)
+    {
+      struct walk_stmt_info wi;
+      struct walk_info_data data;
+
+      data.field_map = pointer_map_create ();
+      data.addresses_taken = addresses_taken;
+      data.suitable_for_renaming = suitable_for_renaming;
+      memset (&wi, 0, sizeof (wi));
+      wi.info = &data;
+
+      FOR_EACH_BB (bb)
+	for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); gsi_next (&gsi))
+	  walk_gimple_op (gsi_stmt (gsi), split_nonlocal_frames_op, &wi);
+
+      pointer_map_destroy (data.field_map);
+    }
+
   /* We cannot iterate over all referenced vars because that can contain
      unused vars from BLOCK trees, which causes code generation differences
      for -g vs. -g0.  */
Index: tree-streamer-in.c
===================================================================
--- tree-streamer-in.c	(revision 191365)
+++ tree-streamer-in.c	(working copy)
@@ -217,6 +217,9 @@  unpack_ts_decl_common_value_fields (stru
       expr->decl_common.off_align = bp_unpack_value (bp, 8);
     }
 
+  if (TREE_CODE (expr) == VAR_DECL)
+    DECL_NONLOCAL_FRAME (expr) = (unsigned) bp_unpack_value (bp, 1);
+
   if (TREE_CODE (expr) == RESULT_DECL
       || TREE_CODE (expr) == PARM_DECL
       || TREE_CODE (expr) == VAR_DECL)
Index: gimple.c
===================================================================
--- gimple.c	(revision 191365)
+++ gimple.c	(working copy)
@@ -4071,11 +4071,16 @@  gimple_ior_addresses_taken_1 (gimple stm
 			      tree addr, void *data)
 {
   bitmap addresses_taken = (bitmap)data;
-  addr = get_base_address (addr);
-  if (addr
-      && DECL_P (addr))
+  tree base = get_base_address (addr);
+  if (base && DECL_P (base))
     {
-      bitmap_set_bit (addresses_taken, DECL_UID (addr));
+      /* Require the address of the whole object to be taken for non-local
+	 frame structures.  */
+      if (TREE_CODE (base) == VAR_DECL
+	  && DECL_NONLOCAL_FRAME (base)
+	  && base != addr)
+	return false;
+      bitmap_set_bit (addresses_taken, DECL_UID (base));
       return true;
     }
   return false;