Patchwork [PR,tree-optimization/52558] : RFC: questions on store data race

login
register
mail settings
Submitter Aldy Hernandez
Date April 12, 2012, 10:11 p.m.
Message ID <4F875324.7060908@redhat.com>
Download mbox | patch
Permalink /patch/152195/
State New
Headers show

Comments

Aldy Hernandez - April 12, 2012, 10:11 p.m.
Here we have a testcase that affects both the C++ memory model and 
transactional memory.

[Hans, this is caused by the same problem that is causing the 
speculative register promotion issue you and Torvald pointed me at].

In the following testcase (adapted from the PR), the loop invariant 
motion pass caches a pre-existing value for g_2, and then performs a 
store to g_2 on every path, causing a store data race:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
   int l;
   for (l = 0; l < 1234; l++)
   {
     if (g_1)
       return l;
     else
       g_2 = 0;	<-- Store to g_2 should only happen if !g_1
   }
   return 999;
}

This gets transformed into something like:

	g_2_lsm = g_2;
	if (g_1) {
		g_2 = g_2_lsm;	// boo! hiss!
		return 0;
	} else {
		g_2_lsm = 0;
		g_2 = g_2_lsm;
	}

The spurious write to g_2 could cause a data race.

Andrew has suggested verifying that the store to g_2 was actually 
different than on entry, and letting PHI copy propagation optimize the 
redundant comparisons.  Like this:

	g_2_lsm = g_2;
	if (g_1) {
		if (g_2_lsm != g_2)	// hopefully optimized away
			g_2 = g_2_lsm;	// hopefully optimized away
		return 0;
	} else {
		g_2_lsm = 0;
		if (g_2_lsm != g_2)	// hopefully optimized away
			g_2 = g_2_lsm;
	}

...which PHI copy propagation and/or friends should be able to optimize.

For that matter, regardless of the memory model, the proposed solution 
would improve the existing code by removing the extra store (in this 
contrived test case anyhow).

Anyone see a problem with this approach (see attached patch)?

Assuming the unlikely scenario that everyone likes this :), I have two 
problems that I would like answered.  But feel free to ignore if the 
approach is a no go.

1. No pass can figure out the equality (or inequality) of g_2_lsm and 
g_2.  In the PR, Richard mentions that both FRE/PRE can figure this out, 
but they are not run after store motion.  DOM should also be able to, 
but apparently does not :(.  Tips?

2. The GIMPLE_CONDs I create in this patch are currently causing 
problems with complex floats/doubles.  do_jump is complaining that it 
can't compare them, so I assume it is too late (in tree-ssa-loop-im.c) 
to compare complex values since complex lowering has already happened 
(??). Is there a more canonical way of creating a GIMPLE_COND that may 
contain complex floats at this stage?

Thanks folks.
* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
	conditional.
	* opts.c (finish_options): Do not allow store or load data races
	with -fgnu-tm.
Richard Guenther - April 13, 2012, 8:46 a.m.
On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez <aldyh@redhat.com> wrote:
> Here we have a testcase that affects both the C++ memory model and
> transactional memory.
>
> [Hans, this is caused by the same problem that is causing the speculative
> register promotion issue you and Torvald pointed me at].
>
> In the following testcase (adapted from the PR), the loop invariant motion
> pass caches a pre-existing value for g_2, and then performs a store to g_2
> on every path, causing a store data race:
>
> int g_1 = 1;
> int g_2 = 0;
>
> int func_1(void)
> {
>  int l;
>  for (l = 0; l < 1234; l++)
>  {
>    if (g_1)
>      return l;
>    else
>      g_2 = 0;  <-- Store to g_2 should only happen if !g_1
>  }
>  return 999;
> }
>
> This gets transformed into something like:
>
>        g_2_lsm = g_2;
>        if (g_1) {
>                g_2 = g_2_lsm;  // boo! hiss!
>                return 0;
>        } else {
>                g_2_lsm = 0;
>                g_2 = g_2_lsm;
>        }
>
> The spurious write to g_2 could cause a data race.
>
> Andrew has suggested verifying that the store to g_2 was actually different
> than on entry, and letting PHI copy propagation optimize the redundant
> comparisons.  Like this:
>
>        g_2_lsm = g_2;
>        if (g_1) {
>                if (g_2_lsm != g_2)     // hopefully optimized away
>                        g_2 = g_2_lsm;  // hopefully optimized away
>                return 0;
>        } else {
>                g_2_lsm = 0;
>                if (g_2_lsm != g_2)     // hopefully optimized away
>                        g_2 = g_2_lsm;
>        }
>
> ...which PHI copy propagation and/or friends should be able to optimize.
>
> For that matter, regardless of the memory model, the proposed solution would
> improve the existing code by removing the extra store (in this contrived
> test case anyhow).
>
> Anyone see a problem with this approach (see attached patch)?

Can we _remove_ a store we percieve as redundant (with a single-threaded
view) with the memory model?  If so, then this sounds plausible (minus
the optimization that if the variable is written to unconditionally we avoid
this comparison -- see the places where we check whether we can apply
store motion to possibly trapping locations).

A similar effect could be obtained by keeping a flag whether we entered the
path that stored the value before the transform.  Thus, do

  lsm = g2;  // technically not needed, if you also handle loads
  flag = false;
  for (...)
    {
       if (g1)
         {
            if (flag)
              g2 = lsm;
         }
       else
         {
            lsm = 0;
            flag = true;
         }
    }
  if (flag)
    g2 = lsm;

that would allow to extend this to cover the cases where the access may
eventually trap (if you omit the load before the loop).

Not sure which is more efficient (you can eventually combine flag variables
for different store locations, combining the if-changed compare is not so easy
I guess).

> Assuming the unlikely scenario that everyone likes this :), I have two
> problems that I would like answered.  But feel free to ignore if the
> approach is a no go.
>
> 1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
>  In the PR, Richard mentions that both FRE/PRE can figure this out, but they
> are not run after store motion.  DOM should also be able to, but apparently
> does not :(.  Tips?

Well.  Schedule FRE after loop opts ...

DOM is not prepared to handle loads/stores with differing VOPs - it was never
updated really after the alias-improvements merge.

> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
> with complex floats/doubles.  do_jump is complaining that it can't compare
> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
> values since complex lowering has already happened (??). Is there a more
> canonical way of creating a GIMPLE_COND that may contain complex floats at
> this stage?

As you are handling redundant stores you want a bitwise comparison anyway,
but yes, complex compares need to be lowered.  But as said, you want a
bitwise comparison, not a value-comparison.  You can try using

  if (VIEW_CONVERT_EXPR <int_type_for_mode (mode_for_size (...)) >
(lsm_tmp_reg) != VIEW_CONVERT_EXPR < .... > (orig_value))
    ...

that can of course result in really awful codegen (think of fp - gp register
moves, or even stack spills).  Which maybe hints at that the flag approach
would be better in some cases?  (you need to cache the original value
in a register anyway, the only thing you save is updating it when you would
update the flag value)

Maybe you can combine both, eventually dependent on a target hook
can_compare_bitwise (mode) that would tell you if the target can efficiently
compare two registers in mode for bit equality.

Few comments on the patch itself.

+         new_bb = split_edge (ex);
+         then_bb = create_empty_bb (new_bb);
+         if (current_loops && new_bb->loop_father)
+           add_bb_to_loop (then_bb, new_bb->loop_father);

+         gsi = gsi_start_bb (new_bb);
+         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);

Hmm, why do you need to re-load the value?  Don't you have both
the value as it was loaded before the loop and the new value (the lsm tmp reg)
already?  Why not simply remember the SSA name used?  Ah, because
store motion also uses make_rename_temp.  If you don't want to change
that (which would be good but inconvenient because of the required PHI
insertions), I'd change the code that performs the load in the pre-header to do

  SSA_NAME = ref->mem;
  rename-lsm-tmp = SSA_NAME;

so you can remember SSA_NAME and re-use it here.  That probably solves
your optimization issue, too.

+         stmt = gimple_build_assign (t1, unshare_expr (ref->mem));
+         gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+         stmt = gimple_build_cond (NE_EXPR, t1, tmp_var,
+                                   NULL_TREE, NULL_TREE);

As I mentioned above you need to do a bitwise comparison (for example
we do not want to miss a store with -0.0 when the original value was 0.0,
but -0.0 == 0.0).  Thus you need to create two new temporaries
here to do the conversion (if it isn't already an integer mode).

+         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
gsi_next (&gsi))
+           {
+             gimple phi = gsi_stmt (gsi);
+             unsigned i;
+
+             for (i = 0; i < gimple_phi_num_args (phi); i++)
+               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+                 {
+                   tree arg = gimple_phi_arg_def (phi, i);
+                   add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
+                   update_stmt (phi);
+                 }
+           }

Hmm.  This is of course doing unnecessary work in case there are multiple
moved stores.  In fact splitting the exit edge is only necessary if there are
more than one predecessor.  Otherwise you can simply split the exit block
after inserting the new conditional after its labels.

+  if (opts->x_flag_tm)
+    {
+      if (opts->x_flag_non_call_exceptions)
+       sorry ("transactional memory is not supported with non-call
exceptions");
+
+      set_param_value ("allow-load-data-races", 0,
+                      opts->x_param_values, opts_set->x_param_values);
+      set_param_value ("allow-store-data-races", 0,
+                      opts->x_param_values, opts_set->x_param_values);

Unless the user explicitely set those params?  Which means using params
wasn't a very good idea ... ideally you'd diagnose an error when the user
would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
allow-load-data-races (we are never going to implement that) and make
allow-store-data-races a proper -f flag please?

+    }

And that should be a separate patch

It would be nice if you could think about LIM/SM for possibly trapping
loads/stores
while you are doing this work.

Thanks,
Richard.

>
> Thanks folks.
Boehm, Hans - April 13, 2012, 11:22 p.m.
> -----Original Message-----
> From: Aldy Hernandez [mailto:aldyh@redhat.com]
> Sent: Thursday, April 12, 2012 3:12 PM
> To: Richard Guenther
> Cc: Andrew MacLeod; Boehm, Hans; gcc-patches; Torvald Riegel
> Subject: [PR tree-optimization/52558]: RFC: questions on store data
> race
> 
> Here we have a testcase that affects both the C++ memory model and
> transactional memory.
> 
> [Hans, this is caused by the same problem that is causing the
> speculative register promotion issue you and Torvald pointed me at].
> 
> In the following testcase (adapted from the PR), the loop invariant
> motion pass caches a pre-existing value for g_2, and then performs a
> store to g_2 on every path, causing a store data race:
> 
> int g_1 = 1;
> int g_2 = 0;
> 
> int func_1(void)
> {
>    int l;
>    for (l = 0; l < 1234; l++)
>    {
>      if (g_1)
>        return l;
>      else
>        g_2 = 0;	<-- Store to g_2 should only happen if !g_1
>    }
>    return 999;
> }
> 
> This gets transformed into something like:
> 
> 	g_2_lsm = g_2;
> 	if (g_1) {
> 		g_2 = g_2_lsm;	// boo! hiss!
> 		return 0;
> 	} else {
> 		g_2_lsm = 0;
> 		g_2 = g_2_lsm;
> 	}
> 
> The spurious write to g_2 could cause a data race.
Presumably the g_2_lsm = g_2 is actually outside the loop?

Why does the second g_2 = g_2_lsm; get introduced?  I would have expected it before the return.  Was the example just over-abbreviated?

Other than that, this sounds right to me.  So does Richard's flag-based version, which is the approach I would have originally expected.  But that clearly costs you a register.  It would be interesting to see how they compare.

Hans

> 
> Andrew has suggested verifying that the store to g_2 was actually
> different than on entry, and letting PHI copy propagation optimize the
> redundant comparisons.  Like this:
> 
> 	g_2_lsm = g_2;
> 	if (g_1) {
> 		if (g_2_lsm != g_2)	// hopefully optimized away
> 			g_2 = g_2_lsm;	// hopefully optimized away
> 		return 0;
> 	} else {
> 		g_2_lsm = 0;
> 		if (g_2_lsm != g_2)	// hopefully optimized away
> 			g_2 = g_2_lsm;
> 	}
> 
> ...which PHI copy propagation and/or friends should be able to
> optimize.
> 
> For that matter, regardless of the memory model, the proposed solution
> would improve the existing code by removing the extra store (in this
> contrived test case anyhow).
> 
> Anyone see a problem with this approach (see attached patch)?
> 
> Assuming the unlikely scenario that everyone likes this :), I have two
> problems that I would like answered.  But feel free to ignore if the
> approach is a no go.
> 
> 1. No pass can figure out the equality (or inequality) of g_2_lsm and
> g_2.  In the PR, Richard mentions that both FRE/PRE can figure this
> out, but they are not run after store motion.  DOM should also be able
> to, but apparently does not :(.  Tips?
> 
> 2. The GIMPLE_CONDs I create in this patch are currently causing
> problems with complex floats/doubles.  do_jump is complaining that it
> can't compare them, so I assume it is too late (in tree-ssa-loop-im.c)
> to compare complex values since complex lowering has already happened
> (??). Is there a more canonical way of creating a GIMPLE_COND that may
> contain complex floats at this stage?
> 
> Thanks folks.
Boehm, Hans - April 13, 2012, 11:30 p.m.
> From: Richard Guenther [mailto:richard.guenther@gmail.com]
> Can we _remove_ a store we percieve as redundant (with a single-threaded
> view) with the memory model?
Generally yes, so long as synchronization operations either conservatively treated as completely opaque, or are treated correctly in the "single-threaded view".  If there is no synchronization between the original store, and the redundant one, then the redundant one changes things only if another thread writes to the same variable in-between.  That constitutes a data race, which invokes undefined behavior.  The general rule is that any sequentially correct transformation is OK between synchronization operations, so long as you don't store to anything you weren't supposed to modify, and the state at the next synchronization point is what would have been expected.  You can sometimes treat synchronizations more aggressively, but that should be safe.

Hans
Aldy Hernandez - April 17, 2012, 3:03 p.m.
On 04/13/12 18:22, Boehm, Hans wrote:
>
>
>> -----Original Message-----
>> From: Aldy Hernandez [mailto:aldyh@redhat.com]
>> Sent: Thursday, April 12, 2012 3:12 PM
>> To: Richard Guenther
>> Cc: Andrew MacLeod; Boehm, Hans; gcc-patches; Torvald Riegel
>> Subject: [PR tree-optimization/52558]: RFC: questions on store data
>> race
>>
>> Here we have a testcase that affects both the C++ memory model and
>> transactional memory.
>>
>> [Hans, this is caused by the same problem that is causing the
>> speculative register promotion issue you and Torvald pointed me at].
>>

>> In the following testcase (adapted from the PR), the loop invariant
>> motion pass caches a pre-existing value for g_2, and then performs a
>> store to g_2 on every path, causing a store data race:
>>
>> int g_1 = 1;
>> int g_2 = 0;
>>
>> int func_1(void)
>> {
>>     int l;
>>     for (l = 0; l<  1234; l++)
>>     {
>>       if (g_1)
>>         return l;
>>       else
>>         g_2 = 0;	<-- Store to g_2 should only happen if !g_1
>>     }
>>     return 999;
>> }
>>
>> This gets transformed into something like:
>>
>> 	g_2_lsm = g_2;
>> 	if (g_1) {
>> 		g_2 = g_2_lsm;	// boo! hiss!
>> 		return 0;
>> 	} else {
>> 		g_2_lsm = 0;
>> 		g_2 = g_2_lsm;
>> 	}
>>
>> The spurious write to g_2 could cause a data race.
> Presumably the g_2_lsm = g_2 is actually outside the loop?
>
> Why does the second g_2 = g_2_lsm; get introduced?  I would have expected it before the return.  Was the example just over-abbreviated?

There is some abbreviation going on :).  To be exact, this is what -O2 
currently produces for the lim1 pass.

<bb 2>:
   pretmp.4_1 = g_1;
   g_2_lsm.6_12 = g_2;

<bb 3>:
   # l_13 = PHI <l_6(5), 0(2)>
   # g_2_lsm.6_10 = PHI <g_2_lsm.6_11(5), g_2_lsm.6_12(2)>
   g_1.0_4 = pretmp.4_1;
   if (g_1.0_4 != 0)
     goto <bb 7>;
   else
     goto <bb 4>;

<bb 4>:
   g_2_lsm.6_11 = 0;
   l_6 = l_13 + 1;
   if (l_6 != 1234)
     goto <bb 5>;
   else
     goto <bb 8>;

<bb 8>:
   # g_2_lsm.6_18 = PHI <g_2_lsm.6_11(4)>
   g_2 = g_2_lsm.6_18;
   goto <bb 6>;

<bb 5>:
   goto <bb 3>;

<bb 7>:
   # g_2_lsm.6_17 = PHI <g_2_lsm.6_10(3)>
   # l_19 = PHI <l_13(3)>
   g_2 = g_2_lsm.6_17;

<bb 6>:
   # l_2 = PHI <l_19(7), 999(8)>
   return l_2;

So yes, there seems to be another write to g_2 inside the else, but 
probably because we have merged some basic blocks along the way.

>
> Other than that, this sounds right to me.  So does Richard's flag-based version, which is the approach I would have originally expected.  But that clearly costs you a register.  It would be interesting to see how they compare.

I am working on the flag based approach.

Thanks to both of you.

Patch

Index: tree-ssa-loop-im.c
===================================================================
--- tree-ssa-loop-im.c	(revision 186108)
+++ tree-ssa-loop-im.c	(working copy)
@@ -1999,8 +1999,59 @@  execute_sm (struct loop *loop, VEC (edge
 
   FOR_EACH_VEC_ELT (edge, exits, i, ex)
     {
-      store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
-      gsi_insert_on_edge (ex, store);
+      basic_block new_bb, then_bb, old_dest;
+      edge then_old_edge;
+      gimple_stmt_iterator gsi;
+      gimple stmt;
+      tree t1;
+
+      if (PARAM_VALUE (PARAM_ALLOW_STORE_DATA_RACES))
+	{
+	  store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+	  gsi_insert_on_edge (ex, store);
+	}
+      else
+	{
+	  old_dest = ex->dest;
+	  new_bb = split_edge (ex);
+	  then_bb = create_empty_bb (new_bb);
+	  if (current_loops && new_bb->loop_father)
+	    add_bb_to_loop (then_bb, new_bb->loop_father);
+
+	  gsi = gsi_start_bb (new_bb);
+	  t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
+	  stmt = gimple_build_assign (t1, unshare_expr (ref->mem));
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+	  stmt = gimple_build_cond (NE_EXPR, t1, tmp_var,
+				    NULL_TREE, NULL_TREE);
+	  gsi_insert_after (&gsi, stmt, GSI_CONTINUE_LINKING);
+
+	  gsi = gsi_start_bb (then_bb);
+	  store = gimple_build_assign (unshare_expr (ref->mem), tmp_var);
+	  gsi_insert_after (&gsi, store, GSI_CONTINUE_LINKING);
+
+	  make_edge (new_bb, then_bb, EDGE_TRUE_VALUE);
+	  make_edge (new_bb, old_dest, EDGE_FALSE_VALUE);
+	  then_old_edge = make_edge (then_bb, old_dest, EDGE_FALLTHRU);
+	  set_immediate_dominator (CDI_DOMINATORS, then_bb, new_bb);
+
+	  for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi); gsi_next (&gsi))
+	    {
+	      gimple phi = gsi_stmt (gsi);
+	      unsigned i;
+
+	      for (i = 0; i < gimple_phi_num_args (phi); i++)
+		if (gimple_phi_arg_edge (phi, i)->src == new_bb)
+		  {
+		    tree arg = gimple_phi_arg_def (phi, i);
+		    add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
+		    update_stmt (phi);
+		  }
+	    }
+	  /* Remove the original fall through edge.  This was the
+	     single_succ_edge (new_bb).  */
+	  EDGE_SUCC (new_bb, 0)->flags &= ~EDGE_FALLTHRU;
+	}
     }
 }
 
Index: opts.c
===================================================================
--- opts.c	(revision 186108)
+++ opts.c	(working copy)
@@ -663,8 +663,16 @@  finish_options (struct gcc_options *opts
       opts->x_flag_toplevel_reorder = 0;
     }
 
-  if (opts->x_flag_tm && opts->x_flag_non_call_exceptions)
-    sorry ("transactional memory is not supported with non-call exceptions");
+  if (opts->x_flag_tm)
+    {
+      if (opts->x_flag_non_call_exceptions)
+	sorry ("transactional memory is not supported with non-call exceptions");
+
+      set_param_value ("allow-load-data-races", 0,
+		       opts->x_param_values, opts_set->x_param_values);
+      set_param_value ("allow-store-data-races", 0,
+		       opts->x_param_values, opts_set->x_param_values);
+    }
 
   /* -Wmissing-noreturn is alias for -Wsuggest-attribute=noreturn.  */
   if (opts->x_warn_missing_noreturn)