Patchwork [RFC] Implement load sinking in loops

login
register
mail settings
Submitter Eric Botcazou
Date Oct. 8, 2012, 10:38 a.m.
Message ID <5318138.62Wp5QHEft@polaris>
Download mbox | patch
Permalink /patch/189986/
State New
Headers show

Comments

Eric Botcazou - Oct. 8, 2012, 10:38 a.m.
Hi,

we recently noticed that, even at -O3, the compiler doesn't figure out that 
the following loop is dumb:

#define SIZE 64

int foo (int v[])
{
  int r;

  for (i = 0; i < SIZE; i++)
    r = v[i];

  return r;
}

which was a bit of a surprise.  On second thoughts, this isn't entirely 
unexpected, as it probably matters only for (slightly) pathological cases.
The attached patch nevertheless implements a form of load sinking in loops so 
as to optimize these cases.  It's combined with invariant motion to optimize:

int foo (int v[], int a)
{
  int r, i;

  for (i = 0; i < SIZE; i++)
    r = v[i] + a;

  return r;
}

and with store sinking to optimize:

int foo (int v1[], int v2[])
{
  int r[SIZE];
  int i, j;

  for (j = 0; j < SIZE; j++)
    for (i = 0; i < SIZE; i++)
      r[j] = v1[j] + v2[i];

  return r[SIZE - 1];
}

The optimization is enabled at -O2 in the patch for measurement purposes but, 
given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap, 
compiler-only, all languages except Go), it's probably best suited to -O3.
Or perhaps we don't care and it should simply be dropped...  Thoughts?

Tested on x86_64-suse-linux.


2012-10-08  Eric Botcazou  <ebotcazou@adacore.com>

	* gimple.h (gsi_insert_seq_on_edge_before): Declare.
	* gimple-iterator.c (gsi_insert_seq_on_edge_before): New function.
	* tree-ssa-loop-im.c (struct mem_ref_loc): Add LHS field.
	(mem_ref_in_stmt): Remove gcc_assert.
	(copy_load_and_single_use_chain): New function.
	(execute_lm): Likewise.
	(hoist_memory_references): Hoist the loads after the stores.
	(ref_always_accessed_p): Rename into...
	(ref_always_stored_p): ...this.  Remove STORE_P and add ONCE_P.
	(can_lsm_ref_p): New function extracted from...
	(can_sm_ref_p): ...here.  Call it.
	(follow_invariant_single_use_chain): New function.
	(can_lm_ref_p): Likewise.
	(find_refs_for_sm): Rename into..
	(find_refs_for_lsm): ...this.  Find load hoisting opportunities.
	(loop_suitable_for_sm): Rename into...
	(loop_suitable_for_lsm): ...this.
	(store_motion_loop): Rename into...
	(load_store_motion_loop): ...this.  Adjust calls to above functions.
	(tree_ssa_lim): Likewise.


2012-10-08  Eric Botcazou  <ebotcazou@adacore.com>

	* gcc.dg/tree-ssa/loadmotion-1.c: New test.
	* gcc.dg/tree-ssa/loadmotion-2.c: New test.
	* gcc.dg/tree-ssa/loadmotion-3.c: New test.
Richard Guenther - Oct. 8, 2012, 12:32 p.m.
On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
> Hi,
>
> we recently noticed that, even at -O3, the compiler doesn't figure out that
> the following loop is dumb:
>
> #define SIZE 64
>
> int foo (int v[])
> {
>   int r;
>
>   for (i = 0; i < SIZE; i++)
>     r = v[i];
>
>   return r;
> }
>
> which was a bit of a surprise.  On second thoughts, this isn't entirely
> unexpected, as it probably matters only for (slightly) pathological cases.
> The attached patch nevertheless implements a form of load sinking in loops so
> as to optimize these cases.  It's combined with invariant motion to optimize:
>
> int foo (int v[], int a)
> {
>   int r, i;
>
>   for (i = 0; i < SIZE; i++)
>     r = v[i] + a;
>
>   return r;
> }
>
> and with store sinking to optimize:
>
> int foo (int v1[], int v2[])
> {
>   int r[SIZE];
>   int i, j;
>
>   for (j = 0; j < SIZE; j++)
>     for (i = 0; i < SIZE; i++)
>       r[j] = v1[j] + v2[i];
>
>   return r[SIZE - 1];
> }
>
> The optimization is enabled at -O2 in the patch for measurement purposes but,
> given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap,
> compiler-only, all languages except Go), it's probably best suited to -O3.
> Or perhaps we don't care and it should simply be dropped...  Thoughts?

Incidentially we have scev-const-prop to deal with the similar case of
scalar computations.  But I realize this doesn't work for expressions that
are dependent on a loop variant load.

@@ -103,6 +103,7 @@ typedef struct mem_ref_loc
 {
   tree *ref;                   /* The reference itself.  */
   gimple stmt;                 /* The statement in that it occurs.  */
+  tree lhs;                    /* The (ultimate) LHS for a load.  */
 } *mem_ref_loc_p;

isn't that the lhs of stmt?

+static gimple_seq
+copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs)
+{
+  tree mem = *loc->ref;
+  tree lhs, tmp_var, ssa_name;
+  gimple_seq seq = NULL;
+  gimple stmt;
+  unsigned n = 0;
+
+  /* First copy the load and create the new LHS for it.  */
+  lhs = gimple_assign_lhs (loc->stmt);
+  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));

use make_temp_ssa_name or simply copy_ssa_name (not sure you need
fancy names here).

+      if (gimple_assign_rhs1 (use_stmt) == lhs)
+       {
+         op1 = ssa_name;
+         op2 = gimple_assign_rhs2 (use_stmt);
+       }
+      else
+       {
+         op1 = gimple_assign_rhs1 (use_stmt);
+         op2 = ssa_name;
+       }

this may enlarge lifetime of the other operand?  And it looks like it would
break with unary stmts (accessing out-of-bounds op2).  Also for
is_gimple_min_invariant other operand which may be for example &a.b
you need to unshare_expr it.

+      lhs = gimple_assign_lhs (use_stmt);
+      tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));
+      stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2);
+      ssa_name = make_ssa_name (tmp_var, stmt);
+      gimple_assign_set_lhs (stmt, ssa_name);

see above.  This can now be simplified to

   lhs = gimple_assign_lhs (use_stmt);
   ssa_name = copy_ssa_name (lhs, NULL);
   stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2);

Btw - isn't this all a bit backward (I mean the analysis in execute_lm?)
What you want is apply this transform to as much of the _DEF_s of
the loop-closed PHI nodes - only values used outside of the loop are
interesting.  Thats (sort-of) what SCEV const-prop does (well, it also
uses SCEV to compute the overall effect of the iterations).  So what
you want to know is whether when walking the DEF chain of the
loop closed PHI you end up at definitions before the loop or at
definitions that are not otherwise used inside the loop.

Which means it is really expression sinking.  Does tree-ssa-sink manage
to sink anything out of a loop?  Even scalar computation parts I mean?  For

 for (..)
   {
     a = x[i];
     y[i] = a;
     b = a * 2;
   }
  ... = b;

it should be able to sink b = a*2.

So I think the more natural place to implement this is either SCEV cprop
or tree-ssa-sink.c.  And process things from the loop-closed PHI use
walking the DEFs (first process all, marking interesting things to also
catch commonly used exprs for two PHI uses).

Again you might simply want to open a bugreport for this unless you
want to implement it yourself.

Thanks,
Richard.

> Tested on x86_64-suse-linux.
>
>
> 2012-10-08  Eric Botcazou  <ebotcazou@adacore.com>
>
>         * gimple.h (gsi_insert_seq_on_edge_before): Declare.
>         * gimple-iterator.c (gsi_insert_seq_on_edge_before): New function.
>         * tree-ssa-loop-im.c (struct mem_ref_loc): Add LHS field.
>         (mem_ref_in_stmt): Remove gcc_assert.
>         (copy_load_and_single_use_chain): New function.
>         (execute_lm): Likewise.
>         (hoist_memory_references): Hoist the loads after the stores.
>         (ref_always_accessed_p): Rename into...
>         (ref_always_stored_p): ...this.  Remove STORE_P and add ONCE_P.
>         (can_lsm_ref_p): New function extracted from...
>         (can_sm_ref_p): ...here.  Call it.
>         (follow_invariant_single_use_chain): New function.
>         (can_lm_ref_p): Likewise.
>         (find_refs_for_sm): Rename into..
>         (find_refs_for_lsm): ...this.  Find load hoisting opportunities.
>         (loop_suitable_for_sm): Rename into...
>         (loop_suitable_for_lsm): ...this.
>         (store_motion_loop): Rename into...
>         (load_store_motion_loop): ...this.  Adjust calls to above functions.
>         (tree_ssa_lim): Likewise.
>
>
> 2012-10-08  Eric Botcazou  <ebotcazou@adacore.com>
>
>         * gcc.dg/tree-ssa/loadmotion-1.c: New test.
>         * gcc.dg/tree-ssa/loadmotion-2.c: New test.
>         * gcc.dg/tree-ssa/loadmotion-3.c: New test.
>
>
> --
> Eric Botcazou
Richard Guenther - Oct. 8, 2012, 12:37 p.m.
On Mon, Oct 8, 2012 at 2:32 PM, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou <ebotcazou@adacore.com> wrote:
>> Hi,
>>
>> we recently noticed that, even at -O3, the compiler doesn't figure out that
>> the following loop is dumb:
>>
>> #define SIZE 64
>>
>> int foo (int v[])
>> {
>>   int r;
>>
>>   for (i = 0; i < SIZE; i++)
>>     r = v[i];
>>
>>   return r;
>> }
>>
>> which was a bit of a surprise.  On second thoughts, this isn't entirely
>> unexpected, as it probably matters only for (slightly) pathological cases.
>> The attached patch nevertheless implements a form of load sinking in loops so
>> as to optimize these cases.  It's combined with invariant motion to optimize:
>>
>> int foo (int v[], int a)
>> {
>>   int r, i;
>>
>>   for (i = 0; i < SIZE; i++)
>>     r = v[i] + a;
>>
>>   return r;
>> }
>>
>> and with store sinking to optimize:
>>
>> int foo (int v1[], int v2[])
>> {
>>   int r[SIZE];
>>   int i, j;
>>
>>   for (j = 0; j < SIZE; j++)
>>     for (i = 0; i < SIZE; i++)
>>       r[j] = v1[j] + v2[i];
>>
>>   return r[SIZE - 1];
>> }
>>
>> The optimization is enabled at -O2 in the patch for measurement purposes but,
>> given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap,
>> compiler-only, all languages except Go), it's probably best suited to -O3.
>> Or perhaps we don't care and it should simply be dropped...  Thoughts?
>
> Incidentially we have scev-const-prop to deal with the similar case of
> scalar computations.  But I realize this doesn't work for expressions that
> are dependent on a loop variant load.
>
> @@ -103,6 +103,7 @@ typedef struct mem_ref_loc
>  {
>    tree *ref;                   /* The reference itself.  */
>    gimple stmt;                 /* The statement in that it occurs.  */
> +  tree lhs;                    /* The (ultimate) LHS for a load.  */
>  } *mem_ref_loc_p;
>
> isn't that the lhs of stmt?
>
> +static gimple_seq
> +copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs)
> +{
> +  tree mem = *loc->ref;
> +  tree lhs, tmp_var, ssa_name;
> +  gimple_seq seq = NULL;
> +  gimple stmt;
> +  unsigned n = 0;
> +
> +  /* First copy the load and create the new LHS for it.  */
> +  lhs = gimple_assign_lhs (loc->stmt);
> +  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));
>
> use make_temp_ssa_name or simply copy_ssa_name (not sure you need
> fancy names here).
>
> +      if (gimple_assign_rhs1 (use_stmt) == lhs)
> +       {
> +         op1 = ssa_name;
> +         op2 = gimple_assign_rhs2 (use_stmt);
> +       }
> +      else
> +       {
> +         op1 = gimple_assign_rhs1 (use_stmt);
> +         op2 = ssa_name;
> +       }
>
> this may enlarge lifetime of the other operand?  And it looks like it would
> break with unary stmts (accessing out-of-bounds op2).  Also for
> is_gimple_min_invariant other operand which may be for example &a.b
> you need to unshare_expr it.
>
> +      lhs = gimple_assign_lhs (use_stmt);
> +      tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));
> +      stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2);
> +      ssa_name = make_ssa_name (tmp_var, stmt);
> +      gimple_assign_set_lhs (stmt, ssa_name);
>
> see above.  This can now be simplified to
>
>    lhs = gimple_assign_lhs (use_stmt);
>    ssa_name = copy_ssa_name (lhs, NULL);
>    stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2);
>
> Btw - isn't this all a bit backward (I mean the analysis in execute_lm?)
> What you want is apply this transform to as much of the _DEF_s of
> the loop-closed PHI nodes - only values used outside of the loop are
> interesting.  Thats (sort-of) what SCEV const-prop does (well, it also
> uses SCEV to compute the overall effect of the iterations).  So what
> you want to know is whether when walking the DEF chain of the
> loop closed PHI you end up at definitions before the loop or at
> definitions that are not otherwise used inside the loop.
>
> Which means it is really expression sinking.  Does tree-ssa-sink manage
> to sink anything out of a loop?  Even scalar computation parts I mean?  For
>
>  for (..)
>    {
>      a = x[i];
>      y[i] = a;
>      b = a * 2;
>    }
>   ... = b;
>
> it should be able to sink b = a*2.
>
> So I think the more natural place to implement this is either SCEV cprop
> or tree-ssa-sink.c.  And process things from the loop-closed PHI use
> walking the DEFs (first process all, marking interesting things to also
> catch commonly used exprs for two PHI uses).
>
> Again you might simply want to open a bugreport for this unless you
> want to implement it yourself.

We indeed sink 2*tem but not a[i] here.  Because tree-ssa-sink.c doesn't
sink loads (IIRC) at all, but I've seen patches to fix that (IIRC).

int a[256];
int foo (int x)
{
  int i, k = 0;
  for (i = 0; i < x; ++i)
    {
      int tem = a[i];
      k = 2*tem;
    }
  return k;
}

Richard.

> Thanks,
> Richard.
>
>> Tested on x86_64-suse-linux.
>>
>>
>> 2012-10-08  Eric Botcazou  <ebotcazou@adacore.com>
>>
>>         * gimple.h (gsi_insert_seq_on_edge_before): Declare.
>>         * gimple-iterator.c (gsi_insert_seq_on_edge_before): New function.
>>         * tree-ssa-loop-im.c (struct mem_ref_loc): Add LHS field.
>>         (mem_ref_in_stmt): Remove gcc_assert.
>>         (copy_load_and_single_use_chain): New function.
>>         (execute_lm): Likewise.
>>         (hoist_memory_references): Hoist the loads after the stores.
>>         (ref_always_accessed_p): Rename into...
>>         (ref_always_stored_p): ...this.  Remove STORE_P and add ONCE_P.
>>         (can_lsm_ref_p): New function extracted from...
>>         (can_sm_ref_p): ...here.  Call it.
>>         (follow_invariant_single_use_chain): New function.
>>         (can_lm_ref_p): Likewise.
>>         (find_refs_for_sm): Rename into..
>>         (find_refs_for_lsm): ...this.  Find load hoisting opportunities.
>>         (loop_suitable_for_sm): Rename into...
>>         (loop_suitable_for_lsm): ...this.
>>         (store_motion_loop): Rename into...
>>         (load_store_motion_loop): ...this.  Adjust calls to above functions.
>>         (tree_ssa_lim): Likewise.
>>
>>
>> 2012-10-08  Eric Botcazou  <ebotcazou@adacore.com>
>>
>>         * gcc.dg/tree-ssa/loadmotion-1.c: New test.
>>         * gcc.dg/tree-ssa/loadmotion-2.c: New test.
>>         * gcc.dg/tree-ssa/loadmotion-3.c: New test.
>>
>>
>> --
>> Eric Botcazou

Patch

Index: gimple.h
===================================================================
--- gimple.h	(revision 192137)
+++ gimple.h	(working copy)
@@ -5196,6 +5196,7 @@  void gsi_move_before (gimple_stmt_iterat
 void gsi_move_to_bb_end (gimple_stmt_iterator *, basic_block);
 void gsi_insert_on_edge (edge, gimple);
 void gsi_insert_seq_on_edge (edge, gimple_seq);
+void gsi_insert_seq_on_edge_before (edge, gimple_seq);
 basic_block gsi_insert_on_edge_immediate (edge, gimple);
 basic_block gsi_insert_seq_on_edge_immediate (edge, gimple_seq);
 void gsi_commit_one_edge_insert (edge, basic_block *);
Index: gimple-iterator.c
===================================================================
--- gimple-iterator.c	(revision 192137)
+++ gimple-iterator.c	(working copy)
@@ -677,6 +677,16 @@  gsi_insert_seq_on_edge (edge e, gimple_s
   gimple_seq_add_seq (&PENDING_STMT (e), seq);
 }
 
+/* Likewise, but append it instead of prepending it.  */
+
+void
+gsi_insert_seq_on_edge_before (edge e, gimple_seq seq)
+{
+  gimple_seq pending = NULL;
+  gimple_seq_add_seq (&pending, seq);
+  gimple_seq_add_seq (&pending, PENDING_STMT (e));
+  PENDING_STMT (e) = pending;
+}
 
 /* Insert the statement pointed-to by GSI into edge E.  Every attempt
    is made to place the statement in an existing basic block, but
Index: tree-ssa-loop-im.c
===================================================================
--- tree-ssa-loop-im.c	(revision 192137)
+++ tree-ssa-loop-im.c	(working copy)
@@ -103,6 +103,7 @@  typedef struct mem_ref_loc
 {
   tree *ref;			/* The reference itself.  */
   gimple stmt;			/* The statement in that it occurs.  */
+  tree lhs;			/* The (ultimate) LHS for a load.  */
 } *mem_ref_loc_p;
 
 DEF_VEC_P(mem_ref_loc_p);
@@ -674,7 +675,6 @@  mem_ref_in_stmt (gimple stmt)
 
   if (!mem)
     return NULL;
-  gcc_assert (!store);
 
   hash = iterative_hash_expr (*mem, 0);
   ref = (mem_ref_p) htab_find_with_hash (memory_accesses.refs, *mem, hash);
@@ -2192,6 +2192,140 @@  execute_sm (struct loop *loop, VEC (edge
       execute_sm_if_changed (ex, ref->mem, tmp_var, store_flag);
 }
 
+/* Copy the load and the chain of single uses described by LOC and return the
+   sequence of new statements.  Also set NEW_LHS to the copy of LOC->LHS.  */
+
+static gimple_seq
+copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs)
+{
+  tree mem = *loc->ref;
+  tree lhs, tmp_var, ssa_name;
+  gimple_seq seq = NULL;
+  gimple stmt;
+  unsigned n = 0;
+
+  /* First copy the load and create the new LHS for it.  */
+  lhs = gimple_assign_lhs (loc->stmt);
+  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));
+  stmt = gimple_build_assign (tmp_var, unshare_expr (mem));
+  ssa_name = make_ssa_name (tmp_var, stmt);	
+  gimple_assign_set_lhs (stmt, ssa_name);
+  gimple_seq_add_stmt (&seq, stmt);
+
+  /* Then follow the single-use chain and repeat the operation.  */
+  while (lhs != loc->lhs)
+    {
+      tree op1, op2;
+      use_operand_p use;
+      gimple use_stmt;
+      bool ret = single_imm_use (lhs, &use, &use_stmt);
+      enum tree_code rhs_code = gimple_assign_rhs_code (use_stmt);
+      gcc_assert (ret);
+      
+      if (gimple_assign_rhs1 (use_stmt) == lhs)
+	{
+	  op1 = ssa_name;
+	  op2 = gimple_assign_rhs2 (use_stmt);
+	}
+      else
+	{
+	  op1 = gimple_assign_rhs1 (use_stmt);
+	  op2 = ssa_name;
+	}
+
+      lhs = gimple_assign_lhs (use_stmt);
+      tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));
+      stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2);
+      ssa_name = make_ssa_name (tmp_var, stmt);
+      gimple_assign_set_lhs (stmt, ssa_name);
+      gimple_seq_add_stmt (&seq, stmt);
+    }
+
+  *new_lhs = ssa_name;
+  return seq;
+}
+
+/* Execute load motion of memory reference REF from LOOP.
+   Exits from the LOOP are stored in EXITS.  The original loads are turned
+   into null assignments and the new loads are emitted on the exit edges.  */
+
+static void
+execute_lm (struct loop *loop, VEC (edge, heap) *exits, mem_ref_p ref)
+{
+  VEC (mem_ref_loc_p, heap) *locs = NULL;
+  mem_ref_loc_p loc;
+  unsigned i;
+
+  if (dump_file && (dump_flags & TDF_DETAILS))
+    {
+      fprintf (dump_file, "Executing load motion of ");
+      print_generic_expr (dump_file, ref->mem, 0);
+      fprintf (dump_file, " from loop %d\n", loop->num);
+    }
+
+  get_all_locs_in_loop (loop, ref, &locs);
+
+  /* We have two cases: either the LHS was assigned to a reference that has
+     been hoisted out of the loop or it wasn't used within the loop.  */
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      imm_use_iterator iter;
+      gimple use_stmt;
+      tree new_lhs;
+
+      FOR_EACH_IMM_USE_STMT (use_stmt, iter, loc->lhs)
+	if (gimple_debug_bind_p (use_stmt))
+	  {
+	    gimple_debug_bind_reset_value (use_stmt);
+	    update_stmt (use_stmt);
+	  }
+	else if (gimple_assign_single_p (use_stmt))
+	  {
+	    tree var = gimple_assign_lhs (use_stmt);
+	    unsigned j;
+	    edge ex;
+
+	    /* This must be a variable replacing a former store.  */
+	    gcc_assert (TREE_CODE (var) == VAR_DECL);
+
+	    /* Insert the load and the single-use chain on every exit edge,
+	       before the store that has been previously inserted there.  */
+	    FOR_EACH_VEC_ELT (edge, exits, j, ex)
+	      {
+		gimple_seq s = copy_load_and_single_use_chain (loc, &new_lhs);
+		gimple stmt = gimple_build_assign (var, new_lhs);
+		gimple_seq_add_stmt (&s, stmt);
+		gsi_insert_seq_on_edge_before (ex, s);
+	      }
+	  }
+	else
+	  {
+	    use_operand_p use;
+
+	    /* We are supposed to be in loop-closed SSA form.  */
+	    gcc_assert (gimple_code (use_stmt) == GIMPLE_PHI);
+
+	    /* Insert the load and the single-use chain on every relevant edge
+	       of the PHI node and set the PHI argument to the new LHS.  */
+	    FOR_EACH_IMM_USE_ON_STMT (use, iter)
+	      {
+		edge e = gimple_phi_arg_edge (use_stmt,
+					      PHI_ARG_INDEX_FROM_USE (use));
+		gimple_seq s = copy_load_and_single_use_chain (loc, &new_lhs);
+		gsi_insert_seq_on_edge (e, s);
+		SET_USE (use, new_lhs);
+	      }
+	  }
+
+      /* Finally kill the original load.  */
+      gimple_assign_set_rhs1 (loc->stmt,
+			      build_zero_cst (TREE_TYPE (ref->mem)));
+      update_stmt (loc->stmt);
+    }
+
+  VEC_free (mem_ref_loc_p, heap, locs);
+}
+
 /* Hoists memory references MEM_REFS out of LOOP.  EXITS is the list of exit
    edges of the LOOP.  */
 
@@ -2203,67 +2337,77 @@  hoist_memory_references (struct loop *lo
   unsigned  i;
   bitmap_iterator bi;
 
+  /* First hoist the stores.  */
+  EXECUTE_IF_SET_IN_BITMAP (mem_refs, 0, i, bi)
+    {
+      ref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
+      if (bitmap_bit_p (ref->stored, loop->num))
+	execute_sm (loop, exits, ref);
+    }
+
+  /* Then hoist the loads.  */
   EXECUTE_IF_SET_IN_BITMAP (mem_refs, 0, i, bi)
     {
       ref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
-      execute_sm (loop, exits, ref);
+      if (!bitmap_bit_p (ref->stored, loop->num))
+	execute_lm (loop, exits, ref);
     }
 }
 
-/* Returns true if REF is always accessed in LOOP.  If STORED_P is true
-   make sure REF is always stored to in LOOP.  */
+/* Return true if REF is always stored to in LOOP.  If ONCE_P is true, make
+   sure REF is stored to exactly once in LOOP.  */
 
 static bool
-ref_always_accessed_p (struct loop *loop, mem_ref_p ref, bool stored_p)
+ref_always_stored_p (struct loop *loop, mem_ref_p ref, bool once_p)
 {
   VEC (mem_ref_loc_p, heap) *locs = NULL;
   unsigned i;
   mem_ref_loc_p loc;
   bool ret = false;
   struct loop *must_exec;
-  tree base;
+  tree base, lhs;
 
   base = get_base_address (ref->mem);
-  if (INDIRECT_REF_P (base)
-      || TREE_CODE (base) == MEM_REF)
+  if (INDIRECT_REF_P (base) || TREE_CODE (base) == MEM_REF)
     base = TREE_OPERAND (base, 0);
 
   get_all_locs_in_loop (loop, ref, &locs);
+  if (once_p && VEC_length (mem_ref_loc_p, locs) != 1)
+    {
+      VEC_free (mem_ref_loc_p, heap, locs);
+      return false;
+    }
+
   FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
     {
       if (!get_lim_data (loc->stmt))
 	continue;
 
-      /* If we require an always executed store make sure the statement
-         stores to the reference.  */
-      if (stored_p)
-	{
-	  tree lhs;
-	  if (!gimple_get_lhs (loc->stmt))
-	    continue;
-	  lhs = get_base_address (gimple_get_lhs (loc->stmt));
-	  if (!lhs)
-	    continue;
-	  if (INDIRECT_REF_P (lhs)
-	      || TREE_CODE (lhs) == MEM_REF)
-	    lhs = TREE_OPERAND (lhs, 0);
-	  if (lhs != base)
-	    continue;
-	}
+      if (!gimple_get_lhs (loc->stmt))
+	continue;
+
+      lhs = get_base_address (gimple_get_lhs (loc->stmt));
+      if (!lhs)
+	continue;
+
+      if (INDIRECT_REF_P (lhs) || TREE_CODE (lhs) == MEM_REF)
+	lhs = TREE_OPERAND (lhs, 0);
+
+      if (lhs != base)
+	continue;
 
       must_exec = get_lim_data (loc->stmt)->always_executed_in;
       if (!must_exec)
 	continue;
 
-      if (must_exec == loop
-	  || flow_loop_nested_p (must_exec, loop))
+      if (must_exec == loop || flow_loop_nested_p (must_exec, loop))
 	{
 	  ret = true;
 	  break;
 	}
     }
-  VEC_free (mem_ref_loc_p, heap, locs);
 
+  VEC_free (mem_ref_loc_p, heap, locs);
   return ret;
 }
 
@@ -2375,77 +2519,234 @@  ref_indep_loop_p (struct loop *loop, mem
   return ret;
 }
 
-/* Returns true if we can perform store motion of REF from LOOP.  */
+/* Return true if we can perform load or store motion of REF from LOOP.
+   FOR_SM is true if this is for store motion or false for load motion.  */
 
 static bool
-can_sm_ref_p (struct loop *loop, mem_ref_p ref)
+can_lsm_ref_p (struct loop *loop, mem_ref_p ref, bool for_sm)
 {
-  tree base;
+  bitmap refs;
+  bool stored;
 
   /* Can't hoist unanalyzable refs.  */
   if (!MEM_ANALYZABLE (ref))
     return false;
 
-  /* Unless the reference is stored in the loop, there is nothing to do.  */
-  if (!bitmap_bit_p (ref->stored, loop->num))
+  /* Unless the reference is accessed in the loop, there is nothing to do.  */
+  refs = VEC_index (bitmap, memory_accesses.all_refs_in_loop, loop->num);
+  if (!bitmap_bit_p (refs, ref->id))
+    return false;
+
+  /* The reference must be stored to for store motion and may not be stored
+     to for load motion.  */
+  stored = bitmap_bit_p (ref->stored, loop->num);
+  if (stored != for_sm)
     return false;
 
   /* It should be movable.  */
   if (!is_gimple_reg_type (TREE_TYPE (ref->mem))
-      || TREE_THIS_VOLATILE (ref->mem)
-      || !for_each_index (&ref->mem, may_move_till, loop))
+      || TREE_THIS_VOLATILE (ref->mem))
     return false;
 
   /* If it can throw fail, we do not properly update EH info.  */
   if (tree_could_throw_p (ref->mem))
     return false;
 
-  /* If it can trap, it must be always executed in LOOP.
-     Readonly memory locations may trap when storing to them, but
+  return true;
+}
+
+/* Return true if we can perform store motion of REF from LOOP.  */
+
+static bool
+can_sm_ref_p (struct loop *loop, mem_ref_p ref)
+{
+  tree base;
+
+  if (!can_lsm_ref_p (loop, ref, true))
+    return false;
+
+  /* It must be possible to hoist the index out of the loop.  */
+  if (!for_each_index (&ref->mem, may_move_till, loop))
+    return false;
+
+  /* If it can trap, a store must be always executed in LOOP.
+     Read-only memory locations may trap when storing to them, but
      tree_could_trap_p is a predicate for rvalues, so check that
      explicitly.  */
   base = get_base_address (ref->mem);
   if ((tree_could_trap_p (ref->mem)
        || (DECL_P (base) && TREE_READONLY (base)))
-      && !ref_always_accessed_p (loop, ref, true))
+      && !ref_always_stored_p (loop, ref, false))
     return false;
 
-  /* And it must be independent on all other memory references
-     in LOOP.  */
+  /* And the store must be independent of other memory references in LOOP.  */
   if (!ref_indep_loop_p (loop, ref))
     return false;
 
   return true;
 }
 
-/* Marks the references in LOOP for that store motion should be performed
-   in REFS_TO_SM.  SM_EXECUTED is the set of references for that store
+/* Follow a chain of single uses in LOOP starting from SSA_NAME and containing
+   only invariants of LOOP.  Return the end of the chain.  */
+
+static tree
+follow_invariant_single_use_chain (struct loop *loop, tree ssa_name)
+{
+  use_operand_p use;
+  gimple use_stmt;
+ 
+  while (single_imm_use (ssa_name, &use, &use_stmt)
+	 && flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
+    {
+      tree other_op;
+
+      if (!is_gimple_assign (use_stmt)
+	  || gimple_assign_rhs_class (use_stmt) != GIMPLE_BINARY_RHS)
+	break;
+
+      if (gimple_assign_rhs1 (use_stmt) == ssa_name)
+	other_op = gimple_assign_rhs2 (use_stmt);
+      else
+	other_op = gimple_assign_rhs1 (use_stmt);
+
+      if (outermost_invariant_loop (other_op, loop) == NULL)
+	break;
+
+      ssa_name = gimple_assign_lhs (use_stmt);
+    }
+
+  return ssa_name;
+}
+
+/* Return true if we can perform load motion of REF from LOOP.
+   REFS_TO_SM is the set of references for which store motion will be
+   performed in LOOP.  */
+
+static bool
+can_lm_ref_p (struct loop *loop, mem_ref_p ref, bitmap refs_to_sm)
+{
+  VEC (mem_ref_loc_p, heap) *locs = NULL;
+  mem_ref_loc_p loc;
+  unsigned i;
+
+  if (!can_lsm_ref_p (loop, ref, false))
+    return false;
+
+  get_all_locs_in_loop (loop, ref, &locs);
+
+  /* The LHS may not be used within the loop or the LHS must be the head of
+     an invariant single-use chain whose end has this property.  */
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      struct lim_aux_data *lim_data;
+      struct loop *must_exec;
+      imm_use_iterator iter;
+      gimple use_stmt;
+      bool fail = false;
+      tree lhs;
+
+      /* Don't look at loads that will be hoisted as invariant.  */
+      lim_data = get_lim_data (loc->stmt);
+      if (!lim_data || lim_data->max_loop != NULL)
+	goto failure;
+
+      /* The loads must always be executed in LOOP.  */
+      must_exec = lim_data->always_executed_in;
+      if (!must_exec
+	  || !(must_exec == loop || flow_loop_nested_p (must_exec, loop)))
+	goto failure;
+
+      lhs = gimple_assign_lhs (loc->stmt);
+
+      /* Follow an invariant single-use chain and retrieve its end.  */
+      lhs = follow_invariant_single_use_chain (loop, lhs);
+
+      FOR_EACH_IMM_USE_STMT (use_stmt, iter, lhs)
+        if (gimple_debug_bind_p (use_stmt))
+	  ;
+	else if (flow_bb_inside_loop_p (loop, gimple_bb (use_stmt)))
+	  {
+	    mem_ref_p use_ref;
+
+	    /* If the LHS is assigned to references that will be hoisted out
+	       of the loop, then the condition will be fulfilled after store
+	       motion.  But we need to make sure the assignments are always
+	       executed once since we cannot insert PHI nodes on edges.  */
+	    if (!gimple_has_volatile_ops (use_stmt)
+		&& gimple_assign_single_p (use_stmt)
+		&& gimple_assign_rhs1 (use_stmt) == lhs
+		&& gimple_vdef (use_stmt)
+		&& (use_ref = mem_ref_in_stmt (use_stmt)) != NULL
+		&& bitmap_bit_p (refs_to_sm, use_ref->id)
+		&& ref_always_stored_p (loop, use_ref, true))
+	      continue;
+
+	    fail = true;
+	    BREAK_FROM_IMM_USE_STMT (iter);
+	  }
+
+      if (fail)
+	goto failure;
+
+      /* Record the end of the chain.  */
+      loc->lhs = lhs;
+    }
+
+  /* And the load must be independent of other memory references in LOOP.  */
+  if (!ref_indep_loop_p (loop, ref))
+    goto failure;
+
+  VEC_free (mem_ref_loc_p, heap, locs);
+  return true;
+
+failure:
+  VEC_free (mem_ref_loc_p, heap, locs);
+  return false;
+}
+
+/* Mark references in LOOP for which load-store motion should be performed
+   in REFS_TO_LSM.  LSM_EXECUTED is the set of references for which load-store
    motion was performed in one of the outer loops.  */
 
 static void
-find_refs_for_sm (struct loop *loop, bitmap sm_executed, bitmap refs_to_sm)
+find_refs_for_lsm (struct loop *loop, bitmap lsm_executed, bitmap refs_to_lsm)
 {
+  bitmap refs_to_sm = BITMAP_ALLOC (NULL);
   bitmap refs = VEC_index (bitmap, memory_accesses.all_refs_in_loop,
 			   loop->num);
   unsigned i;
   bitmap_iterator bi;
   mem_ref_p ref;
 
-  EXECUTE_IF_AND_COMPL_IN_BITMAP (refs, sm_executed, 0, i, bi)
+  /* First mark references for which store motion should be performed, as
+     this will enable load motion for further references.  */
+  EXECUTE_IF_AND_COMPL_IN_BITMAP (refs, lsm_executed, 0, i, bi)
     {
       ref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
       if (can_sm_ref_p (loop, ref))
 	bitmap_set_bit (refs_to_sm, i);
     }
+
+  /* Then mark references for which load motion should be performed, taking
+     into account the result of the above analysis.  */
+  EXECUTE_IF_AND_COMPL_IN_BITMAP (refs, lsm_executed, 0, i, bi)
+    {
+      ref = VEC_index (mem_ref_p, memory_accesses.refs_list, i);
+      if (can_lm_ref_p (loop, ref, refs_to_sm))
+	bitmap_set_bit (refs_to_lsm, i);
+    }
+
+  bitmap_ior_into (refs_to_lsm, refs_to_sm);
+  BITMAP_FREE (refs_to_sm);
 }
 
-/* Checks whether LOOP (with exits stored in EXITS array) is suitable
-   for a store motion optimization (i.e. whether we can insert statement
+/* Check whether LOOP (with exits stored in EXITS array) is suitable for
+   a load-store motion optimization (i.e. whether we can insert statement
    on its exits).  */
 
 static bool
-loop_suitable_for_sm (struct loop *loop ATTRIBUTE_UNUSED,
-		      VEC (edge, heap) *exits)
+loop_suitable_for_lsm (struct loop *loop ATTRIBUTE_UNUSED,
+		       VEC (edge, heap) *exits)
 {
   unsigned i;
   edge ex;
@@ -2457,44 +2758,44 @@  loop_suitable_for_sm (struct loop *loop
   return true;
 }
 
-/* Try to perform store motion for all memory references modified inside
-   LOOP.  SM_EXECUTED is the bitmap of the memory references for that
+/* Try to perform load-store motion for all memory references modified inside
+   LOOP.  LSM_EXECUTED is the bitmap of the memory references for which load-
    store motion was executed in one of the outer loops.  */
 
 static void
-store_motion_loop (struct loop *loop, bitmap sm_executed)
+load_store_motion_loop (struct loop *loop, bitmap lsm_executed)
 {
   VEC (edge, heap) *exits = get_loop_exit_edges (loop);
   struct loop *subloop;
-  bitmap sm_in_loop = BITMAP_ALLOC (NULL);
+  bitmap lsm_in_loop = BITMAP_ALLOC (NULL);
 
-  if (loop_suitable_for_sm (loop, exits))
+  if (loop_suitable_for_lsm (loop, exits))
     {
-      find_refs_for_sm (loop, sm_executed, sm_in_loop);
-      hoist_memory_references (loop, sm_in_loop, exits);
+      find_refs_for_lsm (loop, lsm_executed, lsm_in_loop);
+      hoist_memory_references (loop, lsm_in_loop, exits);
     }
   VEC_free (edge, heap, exits);
 
-  bitmap_ior_into (sm_executed, sm_in_loop);
+  bitmap_ior_into (lsm_executed, lsm_in_loop);
   for (subloop = loop->inner; subloop != NULL; subloop = subloop->next)
-    store_motion_loop (subloop, sm_executed);
-  bitmap_and_compl_into (sm_executed, sm_in_loop);
-  BITMAP_FREE (sm_in_loop);
+    load_store_motion_loop (subloop, lsm_executed);
+  bitmap_and_compl_into (lsm_executed, lsm_in_loop);
+  BITMAP_FREE (lsm_in_loop);
 }
 
-/* Try to perform store motion for all memory references modified inside
+/* Try to perform load-store motion for all memory references modified inside
    loops.  */
 
 static void
-store_motion (void)
+load_store_motion (void)
 {
   struct loop *loop;
-  bitmap sm_executed = BITMAP_ALLOC (NULL);
+  bitmap lsm_executed = BITMAP_ALLOC (NULL);
 
   for (loop = current_loops->tree_root->inner; loop != NULL; loop = loop->next)
-    store_motion_loop (loop, sm_executed);
+    load_store_motion_loop (loop, lsm_executed);
 
-  BITMAP_FREE (sm_executed);
+  BITMAP_FREE (lsm_executed);
   gsi_commit_edge_inserts ();
 }
 
@@ -2652,9 +2953,9 @@  tree_ssa_lim (void)
      invariant and cost for computing the invariant.  */
   determine_invariantness ();
 
-  /* Execute store motion.  Force the necessary invariants to be moved
+  /* Execute load-store motion.  Force the necessary invariants to be moved
      out of the loops as well.  */
-  store_motion ();
+  load_store_motion ();
 
   /* Move the expressions that are expensive enough.  */
   todo = move_computations ();