diff mbox

[PR,tree-optimization/52558] : RFC: questions on store data race

Message ID 4F96E635.2040808@redhat.com
State New
Headers show

Commit Message

Aldy Hernandez April 24, 2012, 5:43 p.m. UTC
On 04/13/12 03:46, Richard Guenther wrote:
> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<aldyh@redhat.com>  wrote:

Richard.  Thanks so much for reviewing and providing an alternative 
approach, which AFAICT provides superior results.

> A similar effect could be obtained by keeping a flag whether we entered the
> path that stored the value before the transform.  Thus, do
>
>    lsm = g2;  // technically not needed, if you also handle loads
>    flag = false;
>    for (...)
>      {
>         if (g1)
>           {
>              if (flag)
>                g2 = lsm;
>           }
>         else
>           {
>              lsm = 0;
>              flag = true;
>           }
>      }
>    if (flag)
>      g2 = lsm;

I have implemented this in the attached patch, and it seems to be 
generating better code than my other if-changed approach.  It also 
avoids generating unnecessary loads that may trap.

For the original testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
  int l;
  for (l = 0; l < 1234; l++)
  {
    if (g_1)
      return l;
    else
      g_2 = 0;
  }
  return 999;
}

After all optimizations are done, we are now generating the following 
for the flags approach:

new:
func_1:
         movl    g_1(%rip), %edx
         xorl    %eax, %eax
         testl   %edx, %edx
         je      .L5
         rep
         ret
.L5:
         movl    $0, g_2(%rip)
         movw    $999, %ax
         ret

Which is significantly better than the unmodified trunk of:

func_1:
         movl    g_1(%rip), %edx
         movl    g_2(%rip), %eax
         testl   %edx, %edx
         je      .L5
         movl    %eax, g_2(%rip)
         xorl    %eax, %eax
         ret
.L5:
         movl    $0, g_2(%rip)
         movl    $999, %eax
         ret

And in other less contrived cases, we generate correct code that avoids 
writes on code paths that did not have it.  For example, in Hans 
register promotion example:

int count;

struct obj {
     int data;
     struct obj *next;
} *q;

void func()
{
   struct obj *p;
   for (p = q; p; p = p->next)
     	if (p->data > 0)
		count++;
}

Under the new memory model we should avoid promoting "count" to a 
register and unilaterally writing to it upon exiting the loop.

With the attached patch, we now generate:

func:
.LFB0:
         movq    q(%rip), %rax
         xorl    %ecx, %ecx
         movl    count(%rip), %edx
         testq   %rax, %rax
         je      .L1
.L9:
         movl    (%rax), %esi
         testl   %esi, %esi
         jle     .L3
         addl    $1, %edx
         movl    $1, %ecx
.L3:
         movq    8(%rax), %rax
         testq   %rax, %rax
         jne     .L9
         testb   %cl, %cl
         jne     .L12
.L1:
         rep
         ret
.L12:
         movl    %edx, count(%rip)
         ret

Which is as expected.

I don't understand your suggestion of:

 >    lsm = g2;  // technically not needed, if you also handle loads
> that would allow to extend this to cover the cases where the access may
> eventually trap (if you omit the load before the loop).

Can't I just remove the lsm=g2?  What's this about also handling loads?

>
> Not sure which is more efficient (you can eventually combine flag variables
> for different store locations, combining the if-changed compare is not so easy
> I guess).

So far, I see better code generated with the flags approach, so...

>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
>>   In the PR, Richard mentions that both FRE/PRE can figure this out, but they
>> are not run after store motion.  DOM should also be able to, but apparently
>> does not :(.  Tips?
>
> Well.  Schedule FRE after loop opts ...
>
> DOM is not prepared to handle loads/stores with differing VOPs - it was never
> updated really after the alias-improvements merge.
>
>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>> with complex floats/doubles.  do_jump is complaining that it can't compare
>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
>> values since complex lowering has already happened (??). Is there a more
>> canonical way of creating a GIMPLE_COND that may contain complex floats at
>> this stage?
>
> As you are handling redundant stores you want a bitwise comparison anyway,
> but yes, complex compares need to be lowered.  But as said, you want a
> bitwise comparison, not a value-comparison.  You can try using

I can ignore all this.  I have implemented both alternatives, with a 
target hook as suggested, but I'm thinking of removing it altogether and 
just leaving the flags approach.

> Few comments on the patch itself.
>
> +         new_bb = split_edge (ex);
> +         then_bb = create_empty_bb (new_bb);
> +         if (current_loops&&  new_bb->loop_father)
> +           add_bb_to_loop (then_bb, new_bb->loop_father);
>
> +         gsi = gsi_start_bb (new_bb);
> +         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>
> Hmm, why do you need to re-load the value?  Don't you have both
> the value as it was loaded before the loop and the new value (the lsm tmp reg)
> already?  Why not simply remember the SSA name used?  Ah, because
> store motion also uses make_rename_temp.  If you don't want to change
> that (which would be good but inconvenient because of the required PHI
> insertions), I'd change the code that performs the load in the pre-header to do
>
>    SSA_NAME = ref->mem;
>    rename-lsm-tmp = SSA_NAME;
>
> so you can remember SSA_NAME and re-use it here.  That probably solves
> your optimization issue, too.

Fixed.

> +         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
> gsi_next (&gsi))
> +           {
> +             gimple phi = gsi_stmt (gsi);
> +             unsigned i;
> +
> +             for (i = 0; i<  gimple_phi_num_args (phi); i++)
> +               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
> +                 {
> +                   tree arg = gimple_phi_arg_def (phi, i);
> +                   add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
> +                   update_stmt (phi);
> +                 }
> +           }
>
> Hmm.  This is of course doing unnecessary work in case there are multiple
> moved stores.  In fact splitting the exit edge is only necessary if there are
> more than one predecessor.  Otherwise you can simply split the exit block
> after inserting the new conditional after its labels.

Is this still the case with the current patch?  If so, I'm a bit 
confused as to what you want here.

>
> +  if (opts->x_flag_tm)
> +    {
> +      if (opts->x_flag_non_call_exceptions)
> +       sorry ("transactional memory is not supported with non-call
> exceptions");
> +
> +      set_param_value ("allow-load-data-races", 0,
> +                      opts->x_param_values, opts_set->x_param_values);
> +      set_param_value ("allow-store-data-races", 0,
> +                      opts->x_param_values, opts_set->x_param_values);
>
> Unless the user explicitely set those params?  Which means using params
> wasn't a very good idea ... ideally you'd diagnose an error when the user
> would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
> allow-load-data-races (we are never going to implement that) and make
> allow-store-data-races a proper -f flag please?

Sorry, that was not meant for public consumption.  I have rewritten this 
to either take the param value as is, and in the case of flag_tm, only 
restrict the optimization if the loop is inside an actual transaction 
(see the changes to trans-mem.c, cause I know you'll hate 
bb_in_transaction() :-)).

>
> +    }
>
> And that should be a separate patch

I can certainly reimplement --param=allow-{load,store}-data-races as 
proper -f flags.  I don't care either way.  But I remember Ian Taylor 
having a different opinion.  If y'all agree, I can implement whatever.

> It would be nice if you could think about LIM/SM for possibly trapping
> loads/stores
> while you are doing this work.

We are no longer adding additional trapping loads (as with the 
if-changed approach).  Is this something else you'd like me to 
concentrate on?

Please take a look at the attached patch.  I tested the flags 
alternative on a full bootstrap, and there's only one regression which I 
will look at, but I'd like to know if the current WIP is aligned with 
what you had in mind.

Thanks again.
Aldy
* trans-mem.c (bb_in_transaction_1): New.
	(bb_in_transaction): New.
	* gimple.h (bb_in_transaction): Protoize.
	* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-im.c (execute_sm_if_changed): New.
	(execute_sm_if_changed_flag): New.
	(execute_sm_if_changed_flag_set): New.
	(execute_sm): Do not generate data races unless requested.
	* targhooks.c (default_can_compare_bitwise_p): New.
	* targhooks.h (default_can_compare_bitwise_p): Protoize.
	* target.def (can_compare_bitwise_p): New hook.
	* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
	conditional.
	* opts.c (finish_options): Do not allow store or load data races
	with -fgnu-tm.

Comments

Richard Biener April 25, 2012, 11:45 a.m. UTC | #1
On Tue, Apr 24, 2012 at 7:43 PM, Aldy Hernandez <aldyh@redhat.com> wrote:
> On 04/13/12 03:46, Richard Guenther wrote:
>>
>> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<aldyh@redhat.com>  wrote:
>
>
> Richard.  Thanks so much for reviewing and providing an alternative
> approach, which AFAICT provides superior results.
>
>
>> A similar effect could be obtained by keeping a flag whether we entered
>> the
>> path that stored the value before the transform.  Thus, do
>>
>>   lsm = g2;  // technically not needed, if you also handle loads
>>   flag = false;
>>   for (...)
>>     {
>>        if (g1)
>>          {
>>             if (flag)
>>               g2 = lsm;
>>          }
>>        else
>>          {
>>             lsm = 0;
>>             flag = true;
>>          }
>>     }
>>   if (flag)
>>     g2 = lsm;
>
>
> I have implemented this in the attached patch, and it seems to be generating
> better code than my other if-changed approach.  It also avoids generating
> unnecessary loads that may trap.
>
> For the original testcase:
>
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
>  int l;
>  for (l = 0; l < 1234; l++)
>  {
>   if (g_1)
>     return l;
>   else
>     g_2 = 0;
>  }
>  return 999;
> }
>
> After all optimizations are done, we are now generating the following for
> the flags approach:
>
> new:
> func_1:
>        movl    g_1(%rip), %edx
>        xorl    %eax, %eax
>        testl   %edx, %edx
>        je      .L5
>        rep
>        ret
> .L5:
>        movl    $0, g_2(%rip)
>        movw    $999, %ax
>        ret
>
> Which is significantly better than the unmodified trunk of:
>
> func_1:
>        movl    g_1(%rip), %edx
>        movl    g_2(%rip), %eax
>        testl   %edx, %edx
>        je      .L5
>        movl    %eax, g_2(%rip)
>        xorl    %eax, %eax
>        ret
> .L5:
>        movl    $0, g_2(%rip)
>        movl    $999, %eax
>        ret
>
> And in other less contrived cases, we generate correct code that avoids
> writes on code paths that did not have it.  For example, in Hans register
> promotion example:
>
> int count;
>
> struct obj {
>    int data;
>    struct obj *next;
> } *q;
>
> void func()
> {
>  struct obj *p;
>  for (p = q; p; p = p->next)
>        if (p->data > 0)
>                count++;
> }
>
> Under the new memory model we should avoid promoting "count" to a register
> and unilaterally writing to it upon exiting the loop.
>
> With the attached patch, we now generate:
>
> func:
> .LFB0:
>        movq    q(%rip), %rax
>        xorl    %ecx, %ecx
>        movl    count(%rip), %edx
>        testq   %rax, %rax
>        je      .L1
> .L9:
>        movl    (%rax), %esi
>        testl   %esi, %esi
>        jle     .L3
>        addl    $1, %edx
>        movl    $1, %ecx
> .L3:
>        movq    8(%rax), %rax
>        testq   %rax, %rax
>        jne     .L9
>        testb   %cl, %cl
>        jne     .L12
> .L1:
>        rep
>        ret
> .L12:
>        movl    %edx, count(%rip)
>        ret
>
> Which is as expected.
>
> I don't understand your suggestion of:
>
>
>>    lsm = g2;  // technically not needed, if you also handle loads
>>
>> that would allow to extend this to cover the cases where the access may
>>
>> eventually trap (if you omit the load before the loop).
>
>
> Can't I just remove the lsm=g2?  What's this about also handling loads?
>
>
>>
>> Not sure which is more efficient (you can eventually combine flag
>> variables
>> for different store locations, combining the if-changed compare is not so
>> easy
>> I guess).
>
>
> So far, I see better code generated with the flags approach, so...
>
>
>>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
>>> g_2.
>>>  In the PR, Richard mentions that both FRE/PRE can figure this out, but
>>> they
>>> are not run after store motion.  DOM should also be able to, but
>>> apparently
>>> does not :(.  Tips?
>>
>>
>> Well.  Schedule FRE after loop opts ...
>>
>> DOM is not prepared to handle loads/stores with differing VOPs - it was
>> never
>> updated really after the alias-improvements merge.
>>
>>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>>> with complex floats/doubles.  do_jump is complaining that it can't
>>> compare
>>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare
>>> complex
>>> values since complex lowering has already happened (??). Is there a more
>>> canonical way of creating a GIMPLE_COND that may contain complex floats
>>> at
>>> this stage?
>>
>>
>> As you are handling redundant stores you want a bitwise comparison anyway,
>> but yes, complex compares need to be lowered.  But as said, you want a
>> bitwise comparison, not a value-comparison.  You can try using
>
>
> I can ignore all this.  I have implemented both alternatives, with a target
> hook as suggested, but I'm thinking of removing it altogether and just
> leaving the flags approach.
>
>> Few comments on the patch itself.
>>
>> +         new_bb = split_edge (ex);
>> +         then_bb = create_empty_bb (new_bb);
>> +         if (current_loops&&  new_bb->loop_father)
>>
>> +           add_bb_to_loop (then_bb, new_bb->loop_father);
>>
>> +         gsi = gsi_start_bb (new_bb);
>> +         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>>
>> Hmm, why do you need to re-load the value?  Don't you have both
>> the value as it was loaded before the loop and the new value (the lsm tmp
>> reg)
>> already?  Why not simply remember the SSA name used?  Ah, because
>> store motion also uses make_rename_temp.  If you don't want to change
>> that (which would be good but inconvenient because of the required PHI
>> insertions), I'd change the code that performs the load in the pre-header
>> to do
>>
>>   SSA_NAME = ref->mem;
>>   rename-lsm-tmp = SSA_NAME;
>>
>> so you can remember SSA_NAME and re-use it here.  That probably solves
>> your optimization issue, too.
>
>
> Fixed.
>
>
>> +         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
>> gsi_next (&gsi))
>> +           {
>> +             gimple phi = gsi_stmt (gsi);
>> +             unsigned i;
>> +
>> +             for (i = 0; i<  gimple_phi_num_args (phi); i++)
>> +               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
>> +                 {
>> +                   tree arg = gimple_phi_arg_def (phi, i);
>> +                   add_phi_arg (phi, arg, then_old_edge,
>> UNKNOWN_LOCATION);
>> +                   update_stmt (phi);
>> +                 }
>> +           }
>>
>> Hmm.  This is of course doing unnecessary work in case there are multiple
>> moved stores.  In fact splitting the exit edge is only necessary if there
>> are
>> more than one predecessor.  Otherwise you can simply split the exit block
>> after inserting the new conditional after its labels.
>
>
> Is this still the case with the current patch?  If so, I'm a bit confused as
> to what you want here.
>
>
>>
>> +  if (opts->x_flag_tm)
>> +    {
>> +      if (opts->x_flag_non_call_exceptions)
>> +       sorry ("transactional memory is not supported with non-call
>> exceptions");
>> +
>> +      set_param_value ("allow-load-data-races", 0,
>> +                      opts->x_param_values, opts_set->x_param_values);
>> +      set_param_value ("allow-store-data-races", 0,
>> +                      opts->x_param_values, opts_set->x_param_values);
>>
>> Unless the user explicitely set those params?  Which means using params
>> wasn't a very good idea ... ideally you'd diagnose an error when the user
>> would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
>> allow-load-data-races (we are never going to implement that) and make
>> allow-store-data-races a proper -f flag please?
>
>
> Sorry, that was not meant for public consumption.  I have rewritten this to
> either take the param value as is, and in the case of flag_tm, only restrict
> the optimization if the loop is inside an actual transaction (see the
> changes to trans-mem.c, cause I know you'll hate bb_in_transaction() :-)).
>
>
>>
>> +    }
>>
>> And that should be a separate patch
>
>
> I can certainly reimplement --param=allow-{load,store}-data-races as proper
> -f flags.  I don't care either way.  But I remember Ian Taylor having a
> different opinion.  If y'all agree, I can implement whatever.
>
>
>> It would be nice if you could think about LIM/SM for possibly trapping
>> loads/stores
>> while you are doing this work.
>
>
> We are no longer adding additional trapping loads (as with the if-changed
> approach).  Is this something else you'd like me to concentrate on?
>
> Please take a look at the attached patch.  I tested the flags alternative on
> a full bootstrap, and there's only one regression which I will look at, but
> I'd like to know if the current WIP is aligned with what you had in mind.

+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+     multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      int res = bb_in_transaction_1 (e->src);
+      if (res != -1)
+       return (bool) res;
+    }
+  return false;

that's weird code - if predecessors are not all in or not in a transaction
you return a random value.  So it seems this is somewhat fragile.

If bb status can be derived from looking at a single stmt why is the
transaction-ness not recorded as a basic-block flag then?  Thus,
why do we have a bit in gimple stmts?

Why can nobody merge a transaction and a non-transaction basic-block
making your predicate above wrong because only the 2nd stmt is
in a transaction?

+  bool single_exit_p = single_pred_p (ex->dest);

that's a strange variable name ;)

+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
...
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+
+      gsi = gsi_start_bb (gimple_bb (loc->stmt));
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+       if (gsi_stmt (gsi) == loc->stmt)

does not seem to do it on every path but on every REF setter.  And instead
of the innermost loop just use gsi_for_stmt (loc->stmt).

+  /* Emit the load code into the latch, so that we are sure it will
+     be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);

the 2nd assign is no longer a "load", so I suppose you don't need any lim_data
for it?

+      else if (store == STORE_IF_CHANGED_FLAG)
+       execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+                              store_flag);

and this can pass NULL instead of mem_ssa?

Overall this looks reasonable.  With also handling trapping things I meant
that we should be able to apply store-motion to

int a[256];
void foo (int *p)
{
  int i;
  for (i= 0;i<256;i++)
    if (flag)
      *p = a[i];
}

but we cannot, because the transform

  lsm = *p;
  for (i = 0; i < 256; ++i)
    if (flag)
      lsm = a[i];
  *p = lsm;

even when the store is properly conditional the load is not (and you do not
change that).  The unconditional load may trap here.  What I meant to say
is that we do not need the load at all unless there is also a load involved
inside the loop - if we use the flag-based approach.  If there is also a load
inside the loop we'd need to perform the load there (or not handle this
case)

But overall this looks nice!

Thanks,
Richard.

> Thanks again.
> Aldy
diff mbox

Patch

Index: trans-mem.c
===================================================================
--- trans-mem.c	(revision 186108)
+++ trans-mem.c	(working copy)
@@ -135,6 +135,47 @@ 
 */
 
 
+/* Helper function for bb_in_transaction.  Returns true if the first
+   statement in BB is in a transaction.  If the BB does not have any
+   statements, return -1.  */
+
+static int
+bb_in_transaction_1 (basic_block bb)
+{
+  gimple t;
+
+  t = gimple_seq_first_stmt (bb_seq (bb));
+  if (t)
+    return gimple_in_transaction (t);
+  return -1;
+}
+
+/* Return true if BB is in a transaction.  This can only be called
+   after transaction bits have been calculated with
+   compute_transaction_bits().  */
+
+bool
+bb_in_transaction (basic_block bb)
+{
+  edge e;
+  edge_iterator ei;
+  int res;
+
+  res = bb_in_transaction_1 (bb);
+  if (res != -1)
+    return (bool) res;
+
+  /* ?? Perhaps we should cache this somewhere in the BB, or are
+     multiple levels of empty BB's common. ??  */
+  FOR_EACH_EDGE (e, ei, bb->preds)
+    {
+      int res = bb_in_transaction_1 (e->src);
+      if (res != -1)
+	return (bool) res;
+    }
+  return false;
+}
+
 /* Return the attributes we want to examine for X, or NULL if it's not
    something we examine.  We look at function types, but allow pointers
    to function types and function decls and peek through.  */
Index: tree-ssa-loop-im.c
===================================================================
--- tree-ssa-loop-im.c	(revision 186108)
+++ tree-ssa-loop-im.c	(working copy)
@@ -40,6 +40,7 @@  along with GCC; see the file COPYING3.
 #include "tree-affine.h"
 #include "pointer-set.h"
 #include "tree-ssa-propagate.h"
+#include "target.h"
 
 /* TODO:  Support for predicated code motion.  I.e.
 
@@ -52,7 +53,7 @@  along with GCC; see the file COPYING3.
 	 }
      }
 
-   Where COND and INV are is invariants, but evaluating INV may trap or be
+   Where COND and INV are invariants, but evaluating INV may trap or be
    invalid from some other reason if !COND.  This may be transformed to
 
    if (cond)
@@ -1956,6 +1957,161 @@  get_lsm_tmp_name (tree ref, unsigned n)
   return lsm_tmp_name;
 }
 
+/* Helper function for execute_sm.  Emit code to store TMP_VAR into
+   MEM along edge EX.
+
+   The store is only done if MEM has changed.  We do this so no
+   changes to MEM occur on code paths that did not originally store
+   into it.
+
+   The common case for execute_sm will transform:
+
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         MEM = TMP_VAR;
+     }
+
+   into:
+
+     lsm = MEM;
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         lsm = TMP_VAR;
+     }
+     MEM = lsm;
+
+  This function will either generate when FLAG is present:
+
+     lsm = MEM;
+     flag = false;
+     ...
+     if (foo)
+       stuff;
+     else {
+       lsm = TMP_VAR;
+       flag = true;
+     }
+     if (flag)		<--
+       MEM = lsm;	<--
+
+  or the following when FLAG is not set:
+
+     lsm = MEM;
+     for (...) {
+       if (foo)
+         stuff;
+       else
+         lsm = TMP_VAR;
+     }
+     if (lsm != MEM)	<--
+       MEM = lsm;	<--
+*/
+
+static void
+execute_sm_if_changed (edge ex, tree mem, tree orig_mem_ssa, tree tmp_var,
+		       tree flag)
+{
+  basic_block new_bb, then_bb, old_dest = ex->dest;
+  bool single_exit_p = single_pred_p (ex->dest);
+  edge then_old_edge;
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+
+  if (single_exit_p)
+    ex = split_block_after_labels (old_dest);
+
+  old_dest = ex->dest;
+  new_bb = split_edge (ex);
+  then_bb = create_empty_bb (new_bb);
+  if (current_loops && new_bb->loop_father)
+    add_bb_to_loop (then_bb, new_bb->loop_father);
+
+  gsi = gsi_start_bb (new_bb);
+  if (flag)
+    stmt = gimple_build_cond (NE_EXPR, flag, boolean_false_node,
+			      NULL_TREE, NULL_TREE);
+  else
+    stmt = gimple_build_cond (NE_EXPR, orig_mem_ssa, tmp_var,
+			      NULL_TREE, NULL_TREE);
+  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+  gsi = gsi_start_bb (then_bb);
+  /* Insert actual store.  */
+  stmt = gimple_build_assign (unshare_expr (mem), tmp_var);
+  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+  make_edge (new_bb, then_bb, EDGE_TRUE_VALUE);
+  make_edge (new_bb, old_dest, EDGE_FALSE_VALUE);
+  then_old_edge = make_edge (then_bb, old_dest, EDGE_FALLTHRU);
+  set_immediate_dominator (CDI_DOMINATORS, then_bb, new_bb);
+
+  if (!single_exit_p)
+    for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); gsi_next (&gsi))
+      {
+	gimple phi = gsi_stmt (gsi);
+	unsigned i;
+
+	for (i = 0; i < gimple_phi_num_args (phi); i++)
+	  if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+	    {
+	      tree arg = gimple_phi_arg_def (phi, i);
+	      add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
+	      update_stmt (phi);
+	    }
+      }
+  /* Remove the original fall through edge.  This was the
+     single_succ_edge (new_bb).  */
+  EDGE_SUCC (new_bb, 0)->flags &= ~EDGE_FALLTHRU;
+}
+
+/* Helper function for execute_sm.  On every path that sets REF, set
+   an appropriate flag indicating the store.  */
+
+static tree
+execute_sm_if_changed_flag_set (struct loop *loop, mem_ref_p ref)
+{
+  unsigned i;
+  mem_ref_loc_p loc;
+  tree flag;
+  VEC (mem_ref_loc_p, heap) *locs = NULL;
+  char *str = get_lsm_tmp_name (ref->mem, ~0);
+
+  lsm_tmp_name_add ("_flag");
+  flag = make_rename_temp (boolean_type_node, str);
+  get_all_locs_in_loop (loop, ref, &locs);
+  FOR_EACH_VEC_ELT (mem_ref_loc_p, locs, i, loc)
+    {
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+
+      gsi = gsi_start_bb (gimple_bb (loc->stmt));
+      for (; !gsi_end_p (gsi); gsi_next (&gsi))
+	if (gsi_stmt (gsi) == loc->stmt)
+	  {
+	    stmt = gimple_build_assign (flag, boolean_true_node);
+	    gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	    break;
+	  }
+    }
+  VEC_free (mem_ref_loc_p, heap, locs);
+  return flag;
+}
+
+/* Store motion types for execute_sm below.  */
+enum store_type
+  {
+    /* Traditional single-thread.  */
+    STORE_SINGLE_THREAD,
+    /* Store only if value changed.  */
+    STORE_IF_CHANGED,
+    /* Store only if flag changed.  */
+    STORE_IF_CHANGED_FLAG
+  };
+
 /* Executes store motion of memory reference REF from LOOP.
    Exits from the LOOP are stored in EXITS.  The initialization of the
    temporary variable is put to the preheader of the loop, and assignments
@@ -1964,12 +2120,13 @@  get_lsm_tmp_name (tree ref, unsigned n)
 static void
 execute_sm (struct loop *loop, VEC (edge, heap) *exits, mem_ref_p ref)
 {
-  tree tmp_var;
+  tree tmp_var, mem_ssa, store_flag;
   unsigned i;
-  gimple load, store;
+  gimple load;
   struct fmt_data fmt_data;
-  edge ex;
+  edge ex, latch_edge;
   struct lim_aux_data *lim_data;
+  enum store_type store;
 
   if (dump_file && (dump_flags & TDF_DETAILS))
     {
@@ -1978,6 +2135,8 @@  execute_sm (struct loop *loop, VEC (edge
       fprintf (dump_file, " from loop %d\n", loop->num);
     }
 
+  mem_ssa = create_tmp_reg (TREE_TYPE (unshare_expr (ref->mem)), NULL);
+  mem_ssa = make_ssa_name (mem_ssa, NULL);
   tmp_var = make_rename_temp (TREE_TYPE (ref->mem),
 			      get_lsm_tmp_name (ref->mem, ~0));
 
@@ -1985,22 +2144,66 @@  execute_sm (struct loop *loop, VEC (edge
   fmt_data.orig_loop = loop;
   for_each_index (&ref->mem, force_move_till, &fmt_data);
 
+  if ((flag_tm && bb_in_transaction (loop_preheader_edge (loop)->src))
+      || !PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
+    {
+      if (targetm.can_compare_bitwise_p (TYPE_MODE (TREE_TYPE (ref->mem))))
+	store = STORE_IF_CHANGED;
+      else
+	store = STORE_IF_CHANGED_FLAG;
+    }
+  else
+    store = STORE_SINGLE_THREAD;
+
+#if 1
+  /* ?? testing ?? */
+  store = STORE_IF_CHANGED_FLAG;
+#endif
+
+  if (store == STORE_IF_CHANGED_FLAG)
+    store_flag = execute_sm_if_changed_flag_set (loop, ref);
+
   rewrite_mem_refs (loop, ref, tmp_var);
 
-  /* Emit the load & stores.  */
-  load = gimple_build_assign (tmp_var, unshare_expr (ref->mem));
+
+  /* Emit the load code into the latch, so that we are sure it will
+     be processed after all dependencies.  */
+  latch_edge = loop_latch_edge (loop);
+  load = gimple_build_assign (mem_ssa, unshare_expr (ref->mem));
   lim_data = init_lim_data (load);
   lim_data->max_loop = loop;
   lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  load = gimple_build_assign (tmp_var, mem_ssa);
+  lim_data = init_lim_data (load);
+  lim_data->max_loop = loop;
+  lim_data->tgt_loop = loop;
+  gsi_insert_on_edge (latch_edge, load);
+  if (store == STORE_IF_CHANGED_FLAG)
+    {
+      load = gimple_build_assign (store_flag, boolean_false_node);
+      lim_data = init_lim_data (load);
+      lim_data->max_loop = loop;
+      lim_data->tgt_loop = loop;
+      gsi_insert_on_edge (latch_edge, load);
+    }
 
-  /* Put this into the latch, so that we are sure it will be processed after
-     all dependencies.  */
-  gsi_insert_on_edge (loop_latch_edge (loop), load);
-
+  /* Sink the store to every exit from the loop.  */
   FOR_EACH_VEC_ELT (edge, exits, i, ex)
     {
-      store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
-      gsi_insert_on_edge (ex, store);
+      if (store == STORE_SINGLE_THREAD)
+	{
+	  gimple store;
+	  store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+	  gsi_insert_on_edge (ex, store);
+	}
+      else if (store == STORE_IF_CHANGED)
+	execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var, NULL);
+      else if (store == STORE_IF_CHANGED_FLAG)
+	execute_sm_if_changed (ex, ref->mem, mem_ssa, tmp_var,
+			       store_flag);
+      else
+	gcc_unreachable ();
     }
 }
 
Index: target.def
===================================================================
--- target.def	(revision 186108)
+++ target.def	(working copy)
@@ -2667,6 +2667,13 @@  DEFHOOK
  enum unwind_info_type, (void),
  default_debug_unwind_info)
 
+DEFHOOK
+(can_compare_bitwise_p,
+ "This hook should return true if the target can efficiently compare two\
+ registers in the given mode for bit equality.",
+ bool, (enum machine_mode),
+ default_can_compare_bitwise_p)
+
 DEFHOOKPOD
 (atomic_test_and_set_trueval,
  "This value should be set if the result written by\
Index: gimple.h
===================================================================
--- gimple.h	(revision 186108)
+++ gimple.h	(working copy)
@@ -1122,6 +1122,7 @@  extern tree omp_reduction_init (tree, tr
 /* In trans-mem.c.  */
 extern void diagnose_tm_safe_errors (tree);
 extern void compute_transaction_bits (void);
+extern bool bb_in_transaction (basic_block);
 
 /* In tree-nested.c.  */
 extern void lower_nested_functions (tree);
Index: Makefile.in
===================================================================
--- Makefile.in	(revision 186108)
+++ Makefile.in	(working copy)
@@ -2532,7 +2532,7 @@  tree-ssa-loop-im.o : tree-ssa-loop-im.c
    $(PARAMS_H) output.h $(DIAGNOSTIC_H) $(TIMEVAR_H) $(TM_H) coretypes.h \
    $(TREE_DUMP_H) $(TREE_PASS_H) $(FLAGS_H) $(BASIC_BLOCK_H) \
    pointer-set.h tree-affine.h tree-ssa-propagate.h gimple-pretty-print.h \
-   tree-pretty-print.h
+   tree-pretty-print.h $(TARGET_H)
 tree-ssa-math-opts.o : tree-ssa-math-opts.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
    $(TM_H) $(FLAGS_H) $(TREE_H) $(TREE_FLOW_H) $(TIMEVAR_H) \
    $(TREE_PASS_H) alloc-pool.h $(BASIC_BLOCK_H) $(TARGET_H) \
Index: doc/tm.texi
===================================================================
--- doc/tm.texi	(revision 186108)
+++ doc/tm.texi	(working copy)
@@ -9471,6 +9471,10 @@  how you define @code{DWARF2_FRAME_INFO}.
 @end defmac
 
 @deftypefn {Target Hook} {enum unwind_info_type} TARGET_DEBUG_UNWIND_INFO (void)
+
+@deftypefn {Target Hook} bool TARGET_CAN_COMPARE_BITWISE_P (enum @var{machine_mode})
+This hook should return true if the target can efficiently compare two registers in the given mode for bit equality.
+@end deftypefn
 This hook defines the mechanism that will be used for describing frame
 unwind information to the debugger.  Normally the hook will return
 @code{UI_DWARF2} if DWARF 2 debug information is enabled, and
Index: targhooks.c
===================================================================
--- targhooks.c	(revision 186108)
+++ targhooks.c	(working copy)
@@ -1331,6 +1331,15 @@  default_debug_unwind_info (void)
   return UI_NONE;
 }
 
+/* This hook returns true if the target can efficiently compare
+   two registers in the given mode for bit equality.  */
+
+bool
+default_can_compare_bitwise_p (enum machine_mode mode)
+{
+  return SCALAR_INT_MODE_P (mode);
+}
+
 /* To be used by targets where reg_raw_mode doesn't return the right
    mode for registers used in apply_builtin_return and apply_builtin_arg.  */
 
Index: targhooks.h
===================================================================
--- targhooks.h	(revision 186108)
+++ targhooks.h	(working copy)
@@ -168,6 +168,8 @@  extern unsigned char default_class_max_n
 
 extern enum unwind_info_type default_debug_unwind_info (void);
 
+extern bool default_can_compare_bitwise_p (enum machine_mode);
+
 extern int default_label_align_after_barrier_max_skip (rtx);
 extern int default_loop_align_max_skip (rtx);
 extern int default_label_align_max_skip (rtx);