diff mbox

loop distribution bug fix

Message ID CABXYE2VY59NeXPSbxUZ4sj0+G+u4LcxNg0GgoxH_-=gnQZWZiQ@mail.gmail.com
State New
Headers show

Commit Message

Jim Wilson Nov. 10, 2016, 5:25 a.m. UTC
This fixes a bug in code adding edges to the dependence graph. Values
for this_dir can be -1 (backward edge), 0 (no edge), 1 (forward edge),
and 2 (both backward and forward edges). There can be multiple
dependencies checked, creating multiple edges that have to be merged
together. this_dir contains the current edge. dir contains the
previous edges. The code fails to handle the case where this_dir is 2,
in which case we can return immediately. This is a minor optimization
to improve compile time. The code handles the case where dir is
non-zero and this_dir is zero by returning 2, which is incorrect. dir
should be unmodified in this case. We can return 2 only if both dir
and this_dir are non-zero and unequal, i.e. one is -1 and the other is
1. This problem creates extra unnecessary edges, which can prevent
loops from being distributed. The patch fixes both problems.

This passed an x86_64 gcc bootstrap with -ftree-loop-distribution
added to BOOT_CFLAGS and a testsuite regression check.  Curiously, I
see that I get different results for the C/C++ ubsan tests every time
I run them, but this has nothing to do with my patch, as it happens
with or without my patch.  I haven't tried debugging this yet.  Might
be related to my recent upgrade to Ubuntu 16.04 LTS.  Otherwise, there
are no regressions.

On SPEC CPU2006, on aarch64, I see 5879 loops distributed without the
patch, and 5906 loops distributed with the patch. So 27 extra loops
are distributed which is about 0.5% more loop distributions. There is
no measurable performance gain from the bug fix on the CPU2006 run
time though I plan to spend some more time looking at this code to see
if I can find other improvements.

OK?

Jim

Comments

Richard Biener Nov. 10, 2016, 10:53 a.m. UTC | #1
On Thu, Nov 10, 2016 at 6:25 AM, Jim Wilson <jim.wilson@linaro.org> wrote:
> This fixes a bug in code adding edges to the dependence graph. Values
> for this_dir can be -1 (backward edge), 0 (no edge), 1 (forward edge),
> and 2 (both backward and forward edges). There can be multiple
> dependencies checked, creating multiple edges that have to be merged
> together. this_dir contains the current edge. dir contains the
> previous edges. The code fails to handle the case where this_dir is 2,
> in which case we can return immediately. This is a minor optimization
> to improve compile time. The code handles the case where dir is
> non-zero and this_dir is zero by returning 2, which is incorrect. dir
> should be unmodified in this case. We can return 2 only if both dir
> and this_dir are non-zero and unequal, i.e. one is -1 and the other is
> 1. This problem creates extra unnecessary edges, which can prevent
> loops from being distributed. The patch fixes both problems.
>
> This passed an x86_64 gcc bootstrap with -ftree-loop-distribution
> added to BOOT_CFLAGS and a testsuite regression check.  Curiously, I
> see that I get different results for the C/C++ ubsan tests every time
> I run them, but this has nothing to do with my patch, as it happens
> with or without my patch.  I haven't tried debugging this yet.  Might
> be related to my recent upgrade to Ubuntu 16.04 LTS.  Otherwise, there
> are no regressions.
>
> On SPEC CPU2006, on aarch64, I see 5879 loops distributed without the
> patch, and 5906 loops distributed with the patch. So 27 extra loops
> are distributed which is about 0.5% more loop distributions. There is
> no measurable performance gain from the bug fix on the CPU2006 run
> time though I plan to spend some more time looking at this code to see
> if I can find other improvements.

The biggest "lack" of loop distribution is the ability to undo CSE so for

 for (;;)
  {
     a[i] = a[i] + 1;
     b[i] = a[i];
  }

CSE makes us see

  for (;;)
    {
       tem = a[i];
       tem2 = tem + 1;
       a[i] = tem;
       b[i] = tem;
    }

and loop distribution cannot re-materialize tem3 from a[i] thus most of the
time it ends up pulling redundant computations into each partition (sometimes
that can reduce memory bandwith if one less stream is used but sometimes
not, like in the above case).

Then of course the cost model is purely modeled for STREAM (reduce the number
of memory streams).  So loop distribution is expected to pessimize code for
the CSE case in case you are not memory bound and improve things if you
are memory bound.

> OK?

Ok.

Thanks for the improvement!
Richard.

> Jim
Jim Wilson Nov. 11, 2016, 6:55 p.m. UTC | #2
On Thu, Nov 10, 2016 at 2:53 AM, Richard Biener
<richard.guenther@gmail.com> wrote:
> The biggest "lack" of loop distribution is the ability to undo CSE so for

I hadn't noticed this problem yet.  I will have to take a look.

> Then of course the cost model is purely modeled for STREAM (reduce the number
> of memory streams).  So loop distribution is expected to pessimize code for
> the CSE case in case you are not memory bound and improve things if you
> are memory bound.

I noticed this problem.  I think loop distribution should be callable
from inside the vectorizer or vice versa.  if a loop can't be
vectorized, but distributing the loop allows the sub loops to be
vectorized, then we should go ahead and ditsribute, even if that
increases the number of memory streams slightly, as the gain from
vectorizing should be greater than the loss from the additional memory
streams.  We could have a cost model that tries to compute the
gain/loss here and make a better decision of when to distribute to
increase vectorization at the expense of the number of memory streams.
This looks like a major project though, and may be more work than I
have time for.

Jim
Richard Biener Nov. 14, 2016, 10:03 a.m. UTC | #3
On Fri, Nov 11, 2016 at 7:55 PM, Jim Wilson <jim.wilson@linaro.org> wrote:
> On Thu, Nov 10, 2016 at 2:53 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> The biggest "lack" of loop distribution is the ability to undo CSE so for
>
> I hadn't noticed this problem yet.  I will have to take a look.
>
>> Then of course the cost model is purely modeled for STREAM (reduce the number
>> of memory streams).  So loop distribution is expected to pessimize code for
>> the CSE case in case you are not memory bound and improve things if you
>> are memory bound.
>
> I noticed this problem.  I think loop distribution should be callable
> from inside the vectorizer or vice versa.  if a loop can't be
> vectorized, but distributing the loop allows the sub loops to be
> vectorized, then we should go ahead and ditsribute, even if that
> increases the number of memory streams slightly, as the gain from
> vectorizing should be greater than the loss from the additional memory
> streams.  We could have a cost model that tries to compute the
> gain/loss here and make a better decision of when to distribute to
> increase vectorization at the expense of the number of memory streams.
> This looks like a major project though, and may be more work than I
> have time for.

Yes.  That's true for most enabling transforms (an easier one that comes to
my mind is final value replacement, which, when required from the vectorizer
could use a different cost model).

Richard.

> Jim
diff mbox

Patch

2016-11-09  Jim Wilson  <jim.wilson@linaro.org>

	* tree-loop-distribution.c (pg_add_dependence_edges): Return 2 if
	this_dir is 2.  Check for this_dir non-zero before dir != this_dir
	check.

Index: gcc/tree-loop-distribution.c
===================================================================
--- gcc/tree-loop-distribution.c	(revision 242025)
+++ gcc/tree-loop-distribution.c	(working copy)
@@ -1408,9 +1408,11 @@  pg_add_dependence_edges (struct graph *rdg, vec<lo
 	else
 	  this_dir = 0;
 	free_dependence_relation (ddr);
-	if (dir == 0)
+	if (this_dir == 2)
+	  return 2;
+	else if (dir == 0)
 	  dir = this_dir;
-	else if (dir != this_dir)
+	else if (this_dir != 0 && dir != this_dir)
 	  return 2;
 	/* Shuffle "back" dr1.  */
 	dr1 = saved_dr1;