From patchwork Tue Apr 24 17:43:17 2012
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Aldy Hernandez <aldyh@redhat.com>
X-Patchwork-Id: 154731
Return-Path: 
 <gcc-patches-return-317427-incoming=patchwork.ozlabs.org@gcc.gnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from sourceware.org (server1.sourceware.org [209.132.180.131])
	by ozlabs.org (Postfix) with SMTP id 15904B6FAA
	for <incoming@patchwork.ozlabs.org>;
	Wed, 25 Apr 2012 03:43:43 +1000 (EST)
Comment: DKIM? See http://www.dkim.org
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed/relaxed;
	d=gcc.gnu.org; s=default; x=1335894225; h=Comment:
	DomainKey-Signature:Received:Received:Received:Received:Received:
	Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:
	References:In-Reply-To:Content-Type:Mailing-List:Precedence:
	List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:
	Delivered-To; bh=L9PHOFT1+JbAASN9PL5lWEjxtx8=; b=oiX15vFn+Z3waIf
	wE6fdr/k1bEbViXetKwXIWkATIIpAyU9G0PmgeHsA81l1Jct5VKWXR5TWne0mYq/
	47D/hhWqYEYonJ/XS8rR6aIMBz0cSbzcL2BP9acAeTxFjd11WU98RxyPa+AUD9ge
	6iR/K26D2qrLB7upORA/k5GB3Ea8=
Comment: DomainKeys? See http://antispam.yahoo.com/domainkeys
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=default; d=gcc.gnu.org;
	h=Received:Received:X-SWARE-Spam-Status:X-Spam-Check-By:Received:Received:Received:Message-ID:Date:From:User-Agent:MIME-Version:To:CC:Subject:References:In-Reply-To:Content-Type:Mailing-List:Precedence:List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:Sender:Delivered-To;
	b=w3ST4/Srh7/OZHHJ/xyQllNql5odrtBs1KLs8yhD3AJCVv9CL+FNVQHbtzvAF1
	/+Xwi7a+kbk/Kxk8xv84fNZ7HT+MVFqr4rTl4JdEHUHOu1UK8eUHHSZM5DhPLJrk
	6r13AyvDwnyPj6SdpD9tl78RmIe4BY43BJfi0P6MrTEsA=;
Received: (qmail 29505 invoked by alias); 24 Apr 2012 17:43:37 -0000
Received: (qmail 29491 invoked by uid 22791); 24 Apr 2012 17:43:34 -0000
X-SWARE-Spam-Status: No, hits=-7.0 required=5.0	tests=AWL, BAYES_00,
	KHOP_RCVD_UNTRUST, KHOP_THREADED, RCVD_IN_DNSWL_HI,
	RCVD_IN_HOSTKARMA_W, SPF_HELO_PASS, TW_OV, TW_TM,
	T_RP_MATCHES_RCVD
X-Spam-Check-By: sourceware.org
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by
	sourceware.org (qpsmtpd/0.43rc1) with ESMTP;
	Tue, 24 Apr 2012 17:43:18 +0000
Received: from int-mx11.intmail.prod.int.phx2.redhat.com
	(int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24])	by
	mx1.redhat.com (8.14.4/8.14.4) with ESMTP id
	q3OHhIKr032050	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA
	bits=256 verify=OK); Tue, 24 Apr 2012 13:43:18 -0400
Received: from houston.quesejoda.com (vpn-227-82.phx2.redhat.com
	[10.3.227.82])	by int-mx11.intmail.prod.int.phx2.redhat.com
	(8.14.4/8.14.4) with ESMTP id q3OHhHZ1006327;
	Tue, 24 Apr 2012 13:43:17 -0400
Message-ID: <4F96E635.2040808@redhat.com>
Date: Tue, 24 Apr 2012 12:43:17 -0500
From: Aldy Hernandez <aldyh@redhat.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
	rv:11.0) Gecko/20120329 Thunderbird/11.0.1
MIME-Version: 1.0
To: Richard Guenther <richard.guenther@gmail.com>
CC: Richard Guenther <rguenther@suse.de>,
	Andrew MacLeod <amacleod@redhat.com>, "Boehm, Hans" <hans.boehm@hp.com>,
	gcc-patches <gcc-patches@gcc.gnu.org>,
	Torvald Riegel <triegel@redhat.com>
Subject: Re: [PR tree-optimization/52558]: RFC: questions on store data race
References: <4F875324.7060908@redhat.com>
	<CAFiYyc2Cc2AgL75miwrENHZVywCVuxtUrE1RdO_A4VsiDJg3=A@mail.gmail.com>
In-Reply-To: 
 <CAFiYyc2Cc2AgL75miwrENHZVywCVuxtUrE1RdO_A4VsiDJg3=A@mail.gmail.com>
Mailing-List: contact gcc-patches-help@gcc.gnu.org; run by ezmlm
Precedence: bulk
List-Id: <gcc-patches.gcc.gnu.org>
List-Unsubscribe: 
 <mailto:gcc-patches-unsubscribe-incoming=patchwork.ozlabs.org@gcc.gnu.org>
List-Archive: <http://gcc.gnu.org/ml/gcc-patches/>
List-Post: <mailto:gcc-patches@gcc.gnu.org>
List-Help: <mailto:gcc-patches-help@gcc.gnu.org>
Sender: gcc-patches-owner@gcc.gnu.org
Delivered-To: mailing list gcc-patches@gcc.gnu.org

On 04/13/12 03:46, Richard Guenther wrote:
> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<aldyh@redhat.com>  wrote:

Richard.  Thanks so much for reviewing and providing an alternative 
approach, which AFAICT provides superior results.

> A similar effect could be obtained by keeping a flag whether we entered the
> path that stored the value before the transform.  Thus, do
>
>    lsm = g2;  // technically not needed, if you also handle loads
>    flag = false;
>    for (...)
>      {
>         if (g1)
>           {
>              if (flag)
>                g2 = lsm;
>           }
>         else
>           {
>              lsm = 0;
>              flag = true;
>           }
>      }
>    if (flag)
>      g2 = lsm;

I have implemented this in the attached patch, and it seems to be 
generating better code than my other if-changed approach.  It also 
avoids generating unnecessary loads that may trap.

For the original testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
  int l;
  for (l = 0; l < 1234; l++)
  {
    if (g_1)
      return l;
    else
      g_2 = 0;
  }
  return 999;
}

After all optimizations are done, we are now generating the following 
for the flags approach:

new:
func_1:
         movl    g_1(%rip), %edx
         xorl    %eax, %eax
         testl   %edx, %edx
         je      .L5
         rep
         ret
.L5:
         movl    $0, g_2(%rip)
         movw    $999, %ax
         ret

Which is significantly better than the unmodified trunk of:

func_1:
         movl    g_1(%rip), %edx
         movl    g_2(%rip), %eax
         testl   %edx, %edx
         je      .L5
         movl    %eax, g_2(%rip)
         xorl    %eax, %eax
         ret
.L5:
         movl    $0, g_2(%rip)
         movl    $999, %eax
         ret

And in other less contrived cases, we generate correct code that avoids 
writes on code paths that did not have it.  For example, in Hans 
register promotion example:

int count;

struct obj {
     int data;
     struct obj *next;
} *q;

void func()
{
   struct obj *p;
   for (p = q; p; p = p->next)
     	if (p->data > 0)
		count++;
}

Under the new memory model we should avoid promoting "count" to a 
register and unilaterally writing to it upon exiting the loop.

With the attached patch, we now generate:

func:
.LFB0:
         movq    q(%rip), %rax
         xorl    %ecx, %ecx
         movl    count(%rip), %edx
         testq   %rax, %rax
         je      .L1
.L9:
         movl    (%rax), %esi
         testl   %esi, %esi
         jle     .L3
         addl    $1, %edx
         movl    $1, %ecx
.L3:
         movq    8(%rax), %rax
         testq   %rax, %rax
         jne     .L9
         testb   %cl, %cl
         jne     .L12
.L1:
         rep
         ret
.L12:
         movl    %edx, count(%rip)
         ret

Which is as expected.

I don't understand your suggestion of:

 >    lsm = g2;  // technically not needed, if you also handle loads
> that would allow to extend this to cover the cases where the access may
> eventually trap (if you omit the load before the loop).

Can't I just remove the lsm=g2?  What's this about also handling loads?

>
> Not sure which is more efficient (you can eventually combine flag variables
> for different store locations, combining the if-changed compare is not so easy
> I guess).

So far, I see better code generated with the flags approach, so...

>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
>>   In the PR, Richard mentions that both FRE/PRE can figure this out, but they
>> are not run after store motion.  DOM should also be able to, but apparently
>> does not :(.  Tips?
>
> Well.  Schedule FRE after loop opts ...
>
> DOM is not prepared to handle loads/stores with differing VOPs - it was never
> updated really after the alias-improvements merge.
>
>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>> with complex floats/doubles.  do_jump is complaining that it can't compare
>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
>> values since complex lowering has already happened (??). Is there a more
>> canonical way of creating a GIMPLE_COND that may contain complex floats at
>> this stage?
>
> As you are handling redundant stores you want a bitwise comparison anyway,
> but yes, complex compares need to be lowered.  But as said, you want a
> bitwise comparison, not a value-comparison.  You can try using

I can ignore all this.  I have implemented both alternatives, with a 
target hook as suggested, but I'm thinking of removing it altogether and 
just leaving the flags approach.

> Few comments on the patch itself.
>
> +         new_bb = split_edge (ex);
> +         then_bb = create_empty_bb (new_bb);
> +         if (current_loops&&  new_bb->loop_father)
> +           add_bb_to_loop (then_bb, new_bb->loop_father);
>
> +         gsi = gsi_start_bb (new_bb);
> +         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>
> Hmm, why do you need to re-load the value?  Don't you have both
> the value as it was loaded before the loop and the new value (the lsm tmp reg)
> already?  Why not simply remember the SSA name used?  Ah, because
> store motion also uses make_rename_temp.  If you don't want to change
> that (which would be good but inconvenient because of the required PHI
> insertions), I'd change the code that performs the load in the pre-header to do
>
>    SSA_NAME = ref->mem;
>    rename-lsm-tmp = SSA_NAME;
>
> so you can remember SSA_NAME and re-use it here.  That probably solves
> your optimization issue, too.

Fixed.

> +         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
> gsi_next (&gsi))
> +           {
> +             gimple phi = gsi_stmt (gsi);
> +             unsigned i;
> +
> +             for (i = 0; i<  gimple_phi_num_args (phi); i++)
> +               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
> +                 {
> +                   tree arg = gimple_phi_arg_def (phi, i);
> +                   add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
> +                   update_stmt (phi);
> +                 }
> +           }
>
> Hmm.  This is of course doing unnecessary work in case there are multiple
> moved stores.  In fact splitting the exit edge is only necessary if there are
> more than one predecessor.  Otherwise you can simply split the exit block
> after inserting the new conditional after its labels.

Is this still the case with the current patch?  If so, I'm a bit 
confused as to what you want here.

>
> +  if (opts->x_flag_tm)
> +    {
> +      if (opts->x_flag_non_call_exceptions)
> +       sorry ("transactional memory is not supported with non-call
> exceptions");
> +
> +      set_param_value ("allow-load-data-races", 0,
> +                      opts->x_param_values, opts_set->x_param_values);
> +      set_param_value ("allow-store-data-races", 0,
> +                      opts->x_param_values, opts_set->x_param_values);
>
> Unless the user explicitely set those params?  Which means using params
> wasn't a very good idea ... ideally you'd diagnose an error when the user
> would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
> allow-load-data-races (we are never going to implement that) and make
> allow-store-data-races a proper -f flag please?

Sorry, that was not meant for public consumption.  I have rewritten this 
to either take the param value as is, and in the case of flag_tm, only 
restrict the optimization if the loop is inside an actual transaction 
(see the changes to trans-mem.c, cause I know you'll hate 
bb_in_transaction() :-)).

>
> +    }
>
> And that should be a separate patch

I can certainly reimplement --param=allow-{load,store}-data-races as 
proper -f flags.  I don't care either way.  But I remember Ian Taylor 
having a different opinion.  If y'all agree, I can implement whatever.

> It would be nice if you could think about LIM/SM for possibly trapping
> loads/stores
> while you are doing this work.

We are no longer adding additional trapping loads (as with the 
if-changed approach).  Is this something else you'd like me to 
concentrate on?

Please take a look at the attached patch.  I tested the flags 
alternative on a full bootstrap, and there's only one regression which I 
will look at, but I'd like to know if the current WIP is aligned with 
what you had in mind.

Thanks again.
Aldy
* trans-mem.c (bb_in_transaction_1): New.
	(bb_in_transaction): New.
	* gimple.h (bb_in_transaction): Protoize.
	* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-im.c (execute_sm_if_changed): New.
	(execute_sm_if_changed_flag): New.
	(execute_sm_if_changed_flag_set): New.
	(execute_sm): Do not generate data races unless requested.
	* targhooks.c (default_can_compare_bitwise_p): New.
	* targhooks.h (default_can_compare_bitwise_p): Protoize.
	* target.def (can_compare_bitwise_p): New hook.
	* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
	conditional.
	* opts.c (finish_options): Do not allow store or load data races
	with -fgnu-tm.

Index: trans-mem.c
===================================================================
--- trans-mem.c	(revision 186108)
+++ trans-mem.c	(working copy)
@@ -135,6 +135,47 @@
 */
 
 
+/* Helper function for bb_in_transaction.  Returns true if the first
+   statement in BB is in a transaction.  If the BB does not have any
+   statements, return -1.  */
+
+static int
+bb_in_transaction_1 (basic_block bb)
+{
+  gimple t;
+
+  t = gimple_seq_first_stmt (bb_seq (bb));
+  if (t)
+    return gimple_in_transaction (t);
+  return -1;
+}
+
+/* Return true if BB is in a transaction.  This can only be called
+   after transaction bits have been calculated with
+   compute_transaction_bits().  */
+
+bool
+bb_in_transaction (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+  int res;
+
+  res = bb_in_transaction_1 (bb);
+  if (res != -1)
+    return (bool) res;
+
+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+     multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      int res = bb_in_transaction_1 (e->src);
+      if (res != -1)
+	return (bool) res;
+    }
+  return false;
+}
+
 /* Return the attributes we want to examine for X, or NULL if it's not
    something we examine.  We look at function types, but allow pointers
    to function types and function decls and peek through.  */
Index: tree-ssa-loop-im.c
===================================================================
--- tree-ssa-loop-im.c	(revision 186108)
+++ tree-ssa-loop-im.c	(working copy)
@@ -40,6 +40,7 @@ along with GCC; see the file COPYING3.
 #include "tree-affine.h"
 #include "pointer-set.h"
 #include "tree-ssa-propagate.h"
+#include "target.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -52,7 +53,7 @@ along with GCC; see the file COPYING3.
 	 }
      }
 
-   Where COND and INV are is invariants, but evaluating INV may trap or be
+   Where COND and INV are invariants, but evaluating INV may trap or be
    invalid from some other reason if !COND.  This may be transformed to
 
    if (cond)
@@ -1956,6 +1957,161 @@ get_lsm_tmp_name (tree ref, unsigned n)
   return lsm_tmp_name;
 }
 
+/* Helper function for execute_sm.  Emit code to store TMP_VAR into
+   MEM along edge EX.
+
+   The store is only done if MEM has changed.  We do this so no
+   changes to MEM occur on code paths that did not originally store
+   into it.
+
+   The common case for execute_sm will transform:
+
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         MEM = TMP_VAR;
+     }
+
+   into:
+
+     lsm = MEM;
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         lsm = TMP_VAR;
+     }
+     MEM = lsm;
+
+  This function will either generate when FLAG is present:
+
+     lsm = MEM;
+     flag = false;
+     ...
+     if (foo)
+       stuff;
+     else {
+       lsm = TMP_VAR;
+       flag = true;
+     }
+     if (flag)		<--
+       MEM = lsm;	<--
+
+  or the following when FLAG is not set:
+
+     lsm = MEM;
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         lsm = TMP_VAR;
+     }
+     if (lsm != MEM)	<--
+       MEM = lsm;	<--
+*/
+
+static void
+execute_sm_if_changed (edge ex, tree mem, tree orig_mem_ssa, tree tmp_var,
+		       tree flag)
+{
+  basic_block new_bb, then_bb, old_dest = ex->dest;
+  bool single_exit_p = single_pred_p (ex->dest);
+  edge then_old_edge;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+
+  if (single_exit_p)
+    ex = split_block_after_labels (old_dest);
+
+  old_dest = ex->dest;
+  new_bb = split_edge (ex);
+  then_bb = create_empty_bb (new_bb);
+  if (current_loops && new_bb->loop_father)
+    add_bb_to_loop (then_bb, new_bb->loop_father);
+
+  gsi = gsi_start_bb (new_bb);
+  if (flag)
+    stmt = gimple_build_cond (NE_EXPR, flag, boolean_false_node,
+			      NULL_TREE, NULL_TREE);
+  else
+    stmt = gimple_build_cond (NE_EXPR, orig_mem_ssa, tmp_var,
+			      NULL_TREE, NULL_TREE);
+  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+  gsi = gsi_start_bb (then_bb);
+  /* Insert actual store.  */
+  stmt = gimple_build_assign (unshare_expr (mem), tmp_var);
+  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+  make_edge (new_bb, then_bb, EDGE_TRUE_VALUE);
+  make_edge (new_bb, old_dest, EDGE_FALSE_VALUE);
+  then_old_edge = make_edge (then_bb, old_dest, EDGE_FALLTHRU);
+  set_immediate_dominator (CDI_DOMINATORS, then_bb, new_bb);
+
+  if (!single_exit_p)
+    for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple phi = gsi_stmt (gsi);
+	unsigned i;
+
+	for (i = 0; i < gimple_phi_num_args (phi); i++)
+	  if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+	    {
+	      tree arg = gimple_phi_arg_def (phi, i);
+	      add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
+	      update_stmt (phi);
+	    }
+      }
+  /* Remove the original fall through edge.  This was the
+     single_succ_edge (new_bb).  */
+  EDGE_SUCC (new_bb, 0)->flags &= ~EDGE_FALLTHRU;
+}
+
+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
+  unsigned i;
+  mem_ref_loc_p loc;
+  tree flag;
+  VEC (mem_ref_loc_p, heap) *locs = NULL;
+  char *str = get_lsm_tmp_name (ref->mem, ~0);
+
+  lsm_tmp_name_add ("_flag");
+  flag = make_rename_temp (boolean_type_node, str);
+  get_all_locs_in_loop (loop, ref, &locs);
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+
+      gsi = gsi_start_bb (gimple_bb (loc->stmt));
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+	if (gsi_stmt (gsi) == loc->stmt)
+	  {
+	    stmt = gimple_build_assign (flag, boolean_true_node);
+	    gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	    break;
+	  }
+    }
+  VEC_free (mem_ref_loc_p, heap, locs);
+  return flag;
+}
+
+/* Store motion types for execute_sm below.  */
+enum store_type
+  {
+    /* Traditional single-thread.  */
+    STORE_SINGLE_THREAD,
+    /* Store only if value changed.  */
+    STORE_IF_CHANGED,
+    /* Store only if flag changed.  */
+    STORE_IF_CHANGED_FLAG
+  };
+
 /* Executes store motion of memory reference REF from LOOP.
    Exits from the LOOP are stored in EXITS.  The initialization of the
    temporary variable is put to the preheader of the loop, and assignments
@@ -1964,12 +2120,13 @@ get_lsm_tmp_name (tree ref, unsigned n)
 static void
 execute_sm (struct loop *loop, VEC (edge, heap) *exits, mem_ref_p ref)
 {
-  tree tmp_var;
+  tree tmp_var, mem_ssa, store_flag;
   unsigned i;
-  gimple load, store;
+  gimple load;
   struct fmt_data fmt_data;
-  edge ex;
+  edge ex, latch_edge;
   struct lim_aux_data *lim_data;
+  enum store_type store;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -1978,6 +2135,8 @@ execute_sm (struct loop *loop, VEC (edge
       fprintf (dump_file, " from loop %d\n", loop->num);
     }
 
+  mem_ssa = create_tmp_reg (TREE_TYPE (unshare_expr (ref->mem)), NULL);
+  mem_ssa = make_ssa_name (mem_ssa, NULL);
   tmp_var = make_rename_temp (TREE_TYPE (ref->mem),
 			      get_lsm_tmp_name (ref->mem, ~0));
 
@@ -1985,22 +2144,66 @@ execute_sm (struct loop *loop, VEC (edge
   fmt_data.orig_loop = loop;
   for_each_index (&ref->mem, force_move_till, &fmt_data);
 
+  if ((flag_tm && bb_in_transaction (loop_preheader_edge (loop)->src))
+      || !PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
+    {
+      if (targetm.can_compare_bitwise_p (TYPE_MODE (TREE_TYPE (ref->mem))))
+	store = STORE_IF_CHANGED;
+      else
+	store = STORE_IF_CHANGED_FLAG;
+    }
+  else
+    store = STORE_SINGLE_THREAD;
+
+#if 1
+  /* ?? testing ?? */
+  store = STORE_IF_CHANGED_FLAG;
+#endif
+
+  if (store == STORE_IF_CHANGED_FLAG)
+    store_flag = execute_sm_if_changed_flag_set (loop, ref);
+
   rewrite_mem_refs (loop, ref, tmp_var);
 
-  /* Emit the load & stores.  */
-  load = gimple_build_assign (tmp_var, unshare_expr (ref->mem));
+
+  /* Emit the load code into the latch, so that we are sure it will
+     be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  if (store == STORE_IF_CHANGED_FLAG)
+    {
+      load = gimple_build_assign (store_flag, boolean_false_node);
+      lim_data = init_lim_data (load);
+      lim_data->max_loop = loop;
+      lim_data->tgt_loop = loop;
+      gsi_insert_on_edge (latch_edge, load);
+    }
 
-  /* Put this into the latch, so that we are sure it will be processed after
-     all dependencies.  */
-  gsi_insert_on_edge (loop_latch_edge (loop), load);
-
+  /* Sink the store to every exit from the loop.  */
   FOR_EACH_VEC_ELT (edge, exits, i, ex)
     {
-      store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
-      gsi_insert_on_edge (ex, store);
+      if (store == STORE_SINGLE_THREAD)
+	{
+	  gimple store;
+	  store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+	  gsi_insert_on_edge (ex, store);
+	}
+      else if (store == STORE_IF_CHANGED)
+	execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var, NULL);
+      else if (store == STORE_IF_CHANGED_FLAG)
+	execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+			       store_flag);
+      else
+	gcc_unreachable ();
     }
 }
 
Index: target.def
===================================================================
--- target.def	(revision 186108)
+++ target.def	(working copy)
@@ -2667,6 +2667,13 @@ DEFHOOK
  enum unwind_info_type, (void),
  default_debug_unwind_info)
 
+DEFHOOK
+(can_compare_bitwise_p,
+ "This hook should return true if the target can efficiently compare two\
+ registers in the given mode for bit equality.",
+ bool, (enum machine_mode),
+ default_can_compare_bitwise_p)
+
 DEFHOOKPOD
 (atomic_test_and_set_trueval,
  "This value should be set if the result written by\
Index: gimple.h
===================================================================
--- gimple.h	(revision 186108)
+++ gimple.h	(working copy)
@@ -1122,6 +1122,7 @@ extern tree omp_reduction_init (tree, tr
 /* In trans-mem.c.  */
 extern void diagnose_tm_safe_errors (tree);
 extern void compute_transaction_bits (void);
+extern bool bb_in_transaction (basic_block);
 
 /* In tree-nested.c.  */
 extern void lower_nested_functions (tree);
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 186108)
+++ Makefile.in	(working copy)
@@ -2532,7 +2532,7 @@ tree-ssa-loop-im.o : tree-ssa-loop-im.c
    $(PARAMS_H) output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
    $(TREE_DUMP_H) $(TREE_PASS_H) $(FLAGS_H) $(BASIC_BLOCK_H) \
    pointer-set.h tree-affine.h tree-ssa-propagate.h gimple-pretty-print.h \
-   tree-pretty-print.h
+   tree-pretty-print.h $(TARGET_H)
 tree-ssa-math-opts.o : tree-ssa-math-opts.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(TM_H) $(FLAGS_H) $(TREE_H) $(TREE_FLOW_H) $(TIMEVAR_H) \
    $(TREE_PASS_H) alloc-pool.h $(BASIC_BLOCK_H) $(TARGET_H) \
Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revision 186108)
+++ doc/tm.texi	(working copy)
@@ -9471,6 +9471,10 @@ how you define @code{DWARF2_FRAME_INFO}.
 @end defmac
 
 @deftypefn {Target Hook} {enum unwind_info_type} TARGET_DEBUG_UNWIND_INFO (void)
+
+@deftypefn {Target Hook} bool TARGET_CAN_COMPARE_BITWISE_P (enum @var{machine_mode})
+This hook should return true if the target can efficiently compare two registers in the given mode for bit equality.
+@end deftypefn
 This hook defines the mechanism that will be used for describing frame
 unwind information to the debugger.  Normally the hook will return
 @code{UI_DWARF2} if DWARF 2 debug information is enabled, and
Index: targhooks.c
===================================================================
--- targhooks.c	(revision 186108)
+++ targhooks.c	(working copy)
@@ -1331,6 +1331,15 @@ default_debug_unwind_info (void)
   return UI_NONE;
 }
 
+/* This hook returns true if the target can efficiently compare
+   two registers in the given mode for bit equality.  */
+
+bool
+default_can_compare_bitwise_p (enum machine_mode mode)
+{
+  return SCALAR_INT_MODE_P (mode);
+}
+
 /* To be used by targets where reg_raw_mode doesn't return the right
    mode for registers used in apply_builtin_return and apply_builtin_arg.  */
 
Index: targhooks.h
===================================================================
--- targhooks.h	(revision 186108)
+++ targhooks.h	(working copy)
@@ -168,6 +168,8 @@ extern unsigned char default_class_max_n
 
 extern enum unwind_info_type default_debug_unwind_info (void);
 
+extern bool default_can_compare_bitwise_p (enum machine_mode);
+
 extern int default_label_align_after_barrier_max_skip (rtx);
 extern int default_loop_align_max_skip (rtx);
 extern int default_label_align_max_skip (rtx);