[PR,tree-optimization/52558] : RFC: questions on store data race

On 04/13/12 03:46, Richard Guenther wrote:
> On Fri, Apr 13, 2012 at 12:11 AM, Aldy Hernandez<aldyh@redhat.com>  wrote:

Richard.  Thanks so much for reviewing and providing an alternative 
approach, which AFAICT provides superior results.

> A similar effect could be obtained by keeping a flag whether we entered the
> path that stored the value before the transform.  Thus, do
>
>    lsm = g2;  // technically not needed, if you also handle loads
>    flag = false;
>    for (...)
>      {
>         if (g1)
>           {
>              if (flag)
>                g2 = lsm;
>           }
>         else
>           {
>              lsm = 0;
>              flag = true;
>           }
>      }
>    if (flag)
>      g2 = lsm;

I have implemented this in the attached patch, and it seems to be 
generating better code than my other if-changed approach.  It also 
avoids generating unnecessary loads that may trap.

For the original testcase:

int g_1 = 1;
int g_2 = 0;

int func_1(void)
{
  int l;
  for (l = 0; l < 1234; l++)
  {
    if (g_1)
      return l;
    else
      g_2 = 0;
  }
  return 999;
}

After all optimizations are done, we are now generating the following 
for the flags approach:

new:
func_1:
         movl    g_1(%rip), %edx
         xorl    %eax, %eax
         testl   %edx, %edx
         je      .L5
         rep
         ret
.L5:
         movl    $0, g_2(%rip)
         movw    $999, %ax
         ret

Which is significantly better than the unmodified trunk of:

func_1:
         movl    g_1(%rip), %edx
         movl    g_2(%rip), %eax
         testl   %edx, %edx
         je      .L5
         movl    %eax, g_2(%rip)
         xorl    %eax, %eax
         ret
.L5:
         movl    $0, g_2(%rip)
         movl    $999, %eax
         ret

And in other less contrived cases, we generate correct code that avoids 
writes on code paths that did not have it.  For example, in Hans 
register promotion example:

int count;

struct obj {
     int data;
     struct obj *next;
} *q;

void func()
{
   struct obj *p;
   for (p = q; p; p = p->next)
     	if (p->data > 0)
		count++;
}

Under the new memory model we should avoid promoting "count" to a 
register and unilaterally writing to it upon exiting the loop.

With the attached patch, we now generate:

func:
.LFB0:
         movq    q(%rip), %rax
         xorl    %ecx, %ecx
         movl    count(%rip), %edx
         testq   %rax, %rax
         je      .L1
.L9:
         movl    (%rax), %esi
         testl   %esi, %esi
         jle     .L3
         addl    $1, %edx
         movl    $1, %ecx
.L3:
         movq    8(%rax), %rax
         testq   %rax, %rax
         jne     .L9
         testb   %cl, %cl
         jne     .L12
.L1:
         rep
         ret
.L12:
         movl    %edx, count(%rip)
         ret

Which is as expected.

I don't understand your suggestion of:

 >    lsm = g2;  // technically not needed, if you also handle loads
> that would allow to extend this to cover the cases where the access may
> eventually trap (if you omit the load before the loop).

Can't I just remove the lsm=g2?  What's this about also handling loads?

>
> Not sure which is more efficient (you can eventually combine flag variables
> for different store locations, combining the if-changed compare is not so easy
> I guess).

So far, I see better code generated with the flags approach, so...

>> 1. No pass can figure out the equality (or inequality) of g_2_lsm and g_2.
>>   In the PR, Richard mentions that both FRE/PRE can figure this out, but they
>> are not run after store motion.  DOM should also be able to, but apparently
>> does not :(.  Tips?
>
> Well.  Schedule FRE after loop opts ...
>
> DOM is not prepared to handle loads/stores with differing VOPs - it was never
> updated really after the alias-improvements merge.
>
>> 2. The GIMPLE_CONDs I create in this patch are currently causing problems
>> with complex floats/doubles.  do_jump is complaining that it can't compare
>> them, so I assume it is too late (in tree-ssa-loop-im.c) to compare complex
>> values since complex lowering has already happened (??). Is there a more
>> canonical way of creating a GIMPLE_COND that may contain complex floats at
>> this stage?
>
> As you are handling redundant stores you want a bitwise comparison anyway,
> but yes, complex compares need to be lowered.  But as said, you want a
> bitwise comparison, not a value-comparison.  You can try using

I can ignore all this.  I have implemented both alternatives, with a 
target hook as suggested, but I'm thinking of removing it altogether and 
just leaving the flags approach.

> Few comments on the patch itself.
>
> +         new_bb = split_edge (ex);
> +         then_bb = create_empty_bb (new_bb);
> +         if (current_loops&&  new_bb->loop_father)
> +           add_bb_to_loop (then_bb, new_bb->loop_father);
>
> +         gsi = gsi_start_bb (new_bb);
> +         t1 = make_rename_temp (TREE_TYPE (ref->mem), NULL);
>
> Hmm, why do you need to re-load the value?  Don't you have both
> the value as it was loaded before the loop and the new value (the lsm tmp reg)
> already?  Why not simply remember the SSA name used?  Ah, because
> store motion also uses make_rename_temp.  If you don't want to change
> that (which would be good but inconvenient because of the required PHI
> insertions), I'd change the code that performs the load in the pre-header to do
>
>    SSA_NAME = ref->mem;
>    rename-lsm-tmp = SSA_NAME;
>
> so you can remember SSA_NAME and re-use it here.  That probably solves
> your optimization issue, too.

Fixed.

> +         for (gsi = gsi_start_phis (old_dest); !gsi_end_p (gsi);
> gsi_next (&gsi))
> +           {
> +             gimple phi = gsi_stmt (gsi);
> +             unsigned i;
> +
> +             for (i = 0; i<  gimple_phi_num_args (phi); i++)
> +               if (gimple_phi_arg_edge (phi, i)->src == new_bb)
> +                 {
> +                   tree arg = gimple_phi_arg_def (phi, i);
> +                   add_phi_arg (phi, arg, then_old_edge, UNKNOWN_LOCATION);
> +                   update_stmt (phi);
> +                 }
> +           }
>
> Hmm.  This is of course doing unnecessary work in case there are multiple
> moved stores.  In fact splitting the exit edge is only necessary if there are
> more than one predecessor.  Otherwise you can simply split the exit block
> after inserting the new conditional after its labels.

Is this still the case with the current patch?  If so, I'm a bit 
confused as to what you want here.

>
> +  if (opts->x_flag_tm)
> +    {
> +      if (opts->x_flag_non_call_exceptions)
> +       sorry ("transactional memory is not supported with non-call
> exceptions");
> +
> +      set_param_value ("allow-load-data-races", 0,
> +                      opts->x_param_values, opts_set->x_param_values);
> +      set_param_value ("allow-store-data-races", 0,
> +                      opts->x_param_values, opts_set->x_param_values);
>
> Unless the user explicitely set those params?  Which means using params
> wasn't a very good idea ... ideally you'd diagnose an error when the user
> would mix -fgnu-tm and -fallow-store-data-races.  So - can we remove
> allow-load-data-races (we are never going to implement that) and make
> allow-store-data-races a proper -f flag please?

Sorry, that was not meant for public consumption.  I have rewritten this 
to either take the param value as is, and in the case of flag_tm, only 
restrict the optimization if the loop is inside an actual transaction 
(see the changes to trans-mem.c, cause I know you'll hate 
bb_in_transaction() :-)).

>
> +    }
>
> And that should be a separate patch

I can certainly reimplement --param=allow-{load,store}-data-races as 
proper -f flags.  I don't care either way.  But I remember Ian Taylor 
having a different opinion.  If y'all agree, I can implement whatever.

> It would be nice if you could think about LIM/SM for possibly trapping
> loads/stores
> while you are doing this work.

We are no longer adding additional trapping loads (as with the 
if-changed approach).  Is this something else you'd like me to 
concentrate on?

Please take a look at the attached patch.  I tested the flags 
alternative on a full bootstrap, and there's only one regression which I 
will look at, but I'd like to know if the current WIP is aligned with 
what you had in mind.

Thanks again.
Aldy
* trans-mem.c (bb_in_transaction_1): New.
	(bb_in_transaction): New.
	* gimple.h (bb_in_transaction): Protoize.
	* Makefile.in (tree-ssa-loop-im.o): Depend on TARGET_H.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-im.c (execute_sm_if_changed): New.
	(execute_sm_if_changed_flag): New.
	(execute_sm_if_changed_flag_set): New.
	(execute_sm): Do not generate data races unless requested.
	* targhooks.c (default_can_compare_bitwise_p): New.
	* targhooks.h (default_can_compare_bitwise_p): Protoize.
	* target.def (can_compare_bitwise_p): New hook.
	* tree-ssa-loop-im.c (execute_sm): Guard store motion with a
	conditional.
	* opts.c (finish_options): Do not allow store or load data races
	with -fgnu-tm.

[PR,tree-optimization/52558] : RFC: questions on store data race

Commit Message

Comments

Patch