diff mbox

[PR,44576] : Reduce the computation cost in compute_miss_rate for prefetching loop arrays

Message ID D4C76825A6780047854A11E93CDE84D02F7759@SAUSEXMBP01.amd.com
State New
Headers show

Commit Message

Fang, Changpeng June 30, 2010, 12:06 a.m. UTC
Hi,

Attached is the patch that partially fixes bug 44576:  testsuite/gfortran.dg/zero_sized_1.f90 with huge compile
time on prefetching + peeling.

The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding
compilation time when there are thousands of memory references in the loop.

compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache 
lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally 
determined ACCEPTABLE_MISS_RATE. 

In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed 
miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such
case. Note that the worst case cost is not affected by this patch. 

This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system.
Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly
spent in "compute_all_dependences (datarefs, &dependences, vloops, true)".  I am not sure whether the cost in
 dependence analysis for loop should be that high (with several thousands memory references in loop) or there
is a bug in dependence computation. 

The patch passed Bootstrapping and regression tests.

Is this patch OK to commit?

Thanks,

Changpeng

Comments

Richard Biener June 30, 2010, 8:52 a.m. UTC | #1
On Wed, Jun 30, 2010 at 2:06 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> Hi,
>
> Attached is the patch that partially fixes bug 44576:  testsuite/gfortran.dg/zero_sized_1.f90 with huge compile
> time on prefetching + peeling.
>
> The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding
> compilation time when there are thousands of memory references in the loop.
>
> compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache
> lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally
> determined ACCEPTABLE_MISS_RATE.
>
> In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed
> miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such
> case. Note that the worst case cost is not affected by this patch.
>
> This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system.
> Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly
> spent in "compute_all_dependences (datarefs, &dependences, vloops, true)".  I am not sure whether the cost in
>  dependence analysis for loop should be that high (with several thousands memory references in loop) or there
> is a bug in dependence computation.

Well, it seems that you compute the same information over-and-over
by doing

unsigned int
tree_ssa_prefetch_arrays (void)
{
...
  FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST)
    {
      if (dump_file && (dump_flags & TDF_DETAILS))
        fprintf (dump_file, "Processing loop %d:\n", loop->num);

      unrolled |= loop_prefetch_arrays (loop);

static bool
loop_prefetch_arrays (struct loop *loop)
{
...
 determine_loop_nest_reuse (loop, refs, no_other_refs);

static void
determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
                           bool no_other_refs)
{
...
  if (loop->inner)
    return;
...
  /* Find the outermost loop of the loop nest of loop (we require that
     there are no sibling loops inside the nest).  */
  nest = loop;
  while (1)
    {
      aloop = loop_outer (nest);

      if (aloop == current_loops->tree_root
          || aloop->inner->next)
        break;

      nest = aloop;
    }
...
  compute_all_dependences (datarefs, &dependences, vloops, true);


In fact you should iterate _only_ over innermost loops like

  FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)

does that make a difference?

>
> The patch passed Bootstrapping and regression tests.
>
> Is this patch OK to commit?

Ok.

Thanks,
Richard.

> Thanks,
>
> Changpeng
Fang, Changpeng June 30, 2010, 5:34 p.m. UTC | #2
> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)

>does that make a difference?

This doesn't help, because "compute_all_dependences" was called the same number of time
as before.

(BTW, should we limit prefetching only to the innermost one?)

In this test case, there are 6 large loops, where each loop has 729 memory reference.
It takes 4~5 seconds to "compute_all_dependence" for one such loop.


> The patch passed Bootstrapping and regression tests.
>
> Is this patch OK to commit?

>Ok.

 Thank you.

Changpeng
Richard Biener June 30, 2010, 5:44 p.m. UTC | #3
On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>
>>does that make a difference?
>
> This doesn't help, because "compute_all_dependences" was called the same number of time
> as before.
>
> (BTW, should we limit prefetching only to the innermost one?)
>
> In this test case, there are 6 large loops, where each loop has 729 memory reference.
> It takes 4~5 seconds to "compute_all_dependence" for one such loop.

It shouldn't take that long.  Can you gather a more detailed profile?

>
>> The patch passed Bootstrapping and regression tests.
>>
>> Is this patch OK to commit?
>
>>Ok.
>
>  Thank you.
>
> Changpeng
>
Zdenek Dvorak June 30, 2010, 6:01 p.m. UTC | #4
Hi,

> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
> >
> >>does that make a difference?
> >
> > This doesn't help, because "compute_all_dependences" was called the same number of time
> > as before.
> >
> > (BTW, should we limit prefetching only to the innermost one?)
> >
> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
> 
> It shouldn't take that long.  Can you gather a more detailed profile?

actually, compute_all_dependences is quadratic in the number of memory
references, and in more complicated cases, it can perform rather complex
computations, so 5 seconds on 729 references does seem like a realistic time.
Of course, we need to either speed up the analysis or add an cut-off to avoid
it on loops with too many memory references (or both),

Zdenek
Richard Biener June 30, 2010, 6:05 p.m. UTC | #5
On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
> Hi,
>
>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>> >
>> >>does that make a difference?
>> >
>> > This doesn't help, because "compute_all_dependences" was called the same number of time
>> > as before.
>> >
>> > (BTW, should we limit prefetching only to the innermost one?)
>> >
>> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
>>
>> It shouldn't take that long.  Can you gather a more detailed profile?
>
> actually, compute_all_dependences is quadratic in the number of memory
> references, and in more complicated cases, it can perform rather complex
> computations, so 5 seconds on 729 references does seem like a realistic time.
> Of course, we need to either speed up the analysis or add an cut-off to avoid
> it on loops with too many memory references (or both),

Well, but at -O3 the vectorizer computes dependences as well, and it
doesn't take that much of time,  So there must be something obvious
going wrong.

Richard.

> Zdenek
>
Sebastian Pop June 30, 2010, 6:10 p.m. UTC | #6
On Wed, Jun 30, 2010 at 13:05, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
>> Hi,
>>
>>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>>> >
>>> >>does that make a difference?
>>> >
>>> > This doesn't help, because "compute_all_dependences" was called the same number of time
>>> > as before.
>>> >
>>> > (BTW, should we limit prefetching only to the innermost one?)
>>> >
>>> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
>>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
>>>
>>> It shouldn't take that long.  Can you gather a more detailed profile?
>>
>> actually, compute_all_dependences is quadratic in the number of memory
>> references, and in more complicated cases, it can perform rather complex
>> computations, so 5 seconds on 729 references does seem like a realistic time.
>> Of course, we need to either speed up the analysis or add an cut-off to avoid
>> it on loops with too many memory references (or both),
>
> Well, but at -O3 the vectorizer computes dependences as well, and it
> doesn't take that much of time,  So there must be something obvious
> going wrong.
>

The dependence analysis in the vectorizer is done only in the innermost
loop that is vectorized, whereas prefetch does the analysis of data deps
for every loop.

Sebastian
Richard Biener June 30, 2010, 6:13 p.m. UTC | #7
On Wed, Jun 30, 2010 at 8:10 PM, Sebastian Pop <sebpop@gmail.com> wrote:
> On Wed, Jun 30, 2010 at 13:05, Richard Guenther
> <richard.guenther@gmail.com> wrote:
>> On Wed, Jun 30, 2010 at 8:01 PM, Zdenek Dvorak <rakdver@kam.mff.cuni.cz> wrote:
>>> Hi,
>>>
>>>> On Wed, Jun 30, 2010 at 7:34 PM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>>>> >> FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>>>> >
>>>> >>does that make a difference?
>>>> >
>>>> > This doesn't help, because "compute_all_dependences" was called the same number of time
>>>> > as before.
>>>> >
>>>> > (BTW, should we limit prefetching only to the innermost one?)
>>>> >
>>>> > In this test case, there are 6 large loops, where each loop has 729 memory reference.
>>>> > It takes 4~5 seconds to "compute_all_dependence" for one such loop.
>>>>
>>>> It shouldn't take that long.  Can you gather a more detailed profile?
>>>
>>> actually, compute_all_dependences is quadratic in the number of memory
>>> references, and in more complicated cases, it can perform rather complex
>>> computations, so 5 seconds on 729 references does seem like a realistic time.
>>> Of course, we need to either speed up the analysis or add an cut-off to avoid
>>> it on loops with too many memory references (or both),
>>
>> Well, but at -O3 the vectorizer computes dependences as well, and it
>> doesn't take that much of time,  So there must be something obvious
>> going wrong.
>>
>
> The dependence analysis in the vectorizer is done only in the innermost
> loop that is vectorized, whereas prefetch does the analysis of data deps
> for every loop.

The vectorizer also does dependence analysis for outer-loop vectorization.
I guess prefetch analysis should restrict itself to data dependences on a
maximum loop nest depth (of say 2 or 3).

Still dependence analysis looks overly costly here.

Richard.

> Sebastian
>
Fang, Changpeng June 30, 2010, 6:41 p.m. UTC | #8
>>> Well, but at -O3 the vectorizer computes dependences as well, and it
>>> doesn't take that much of time,  So there must be something obvious
>>> going wrong.
>>>
>>
>> The dependence analysis in the vectorizer is done only in the innermost
>> loop that is vectorized, whereas prefetch does the analysis of data deps
>> for every loop.

>The vectorizer also does dependence analysis for outer-loop vectorization.
>I guess prefetch analysis should restrict itself to data dependences on a
>maximum loop nest depth (of say 2 or 3).

For vectorizer (and tree-loop-linear), we are just lucky that the loop could
not be vectorized and dependence analysis was not invoked at all for 
this loop. For this test case, prefetching is the only pass that invokes 
dependence analysis.

>Still dependence analysis looks overly costly here.
Yes, we just understand and reduce the dependence analysis cost.

Thanks,

Changpeng
Fang, Changpeng June 30, 2010, 9 p.m. UTC | #9
>actually, compute_all_dependences is quadratic in the number of memory
>references, and in more complicated cases, it can perform rather complex
>computations, so 5 seconds on 729 references does seem like a realistic time.
>Of course, we need to either speed up the analysis or add an cut-off to avoid
>it on loops with too many memory references (or both),

Actually, this 4~5 seconds are evenly distributed, and I do think it is due to the
large number of memory references.

Let take a close look at this:

determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
			   bool no_other_refs)
{
    ...
    compute_all_dependences (datarefs, &dependences, vloops, true);  /* 3+ seconds */

   for (i = 0; VEC_iterate (ddr_p, dependences, i, dep); i++) /* 1+ second */
    {
      ...
    }
}

The 4~5 seconds are spent in compute_all_dependences and the loop after it to get
the dependence info. The loop iterates 265356 times and takes 1+ second. The loop
is simple and reasonable. Now, let look at compute_all_dependences:

compute_all_dependences (VEC (data_reference_p, heap) *datarefs,
			 VEC (ddr_p, heap) **dependence_relations,
			 VEC (loop_p, heap) *loop_nest,
			 bool compute_self_and_rr)
{
  ...
  for (i = 0; VEC_iterate (data_reference_p, datarefs, i, a); i++)
    for (j = i + 1; VEC_iterate (data_reference_p, datarefs, j, b); j++)
      if (!DR_IS_READ (a) || !DR_IS_READ (b) || compute_self_and_rr)
	{
	  ddr = initialize_data_dependence_relation (a, b, loop_nest); /* 1+ second */
	  VEC_safe_push (ddr_p, heap, *dependence_relations, ddr);
          if (loop_nest)
   	    compute_affine_dependence (ddr, VEC_index (loop_p, loop_nest, 0)); /* 1+ second */
	}
  ...
}

 Each of the functions, initialize_data_dependence_relation and  compute_affine_dependence, are
called  265356 times, and each of them costs 1+ seconds (for  265356 iters). Again, 
initialize_data_dependence_relation is also simple.

As a result, I conclude that the large number of memory references is the reason of slowdown.
So, it is better to add a cut-off (for prefetching at least).

Thanks,

Changpeng
Sebastian Pop July 2, 2010, 4:41 p.m. UTC | #10
On Wed, Jun 30, 2010 at 03:52, Richard Guenther
<richard.guenther@gmail.com> wrote:
> On Wed, Jun 30, 2010 at 2:06 AM, Fang, Changpeng <Changpeng.Fang@amd.com> wrote:
>> Hi,
>>
>> Attached is the patch that partially fixes bug 44576:  testsuite/gfortran.dg/zero_sized_1.f90 with huge compile
>> time on prefetching + peeling.
>>
>> The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding
>> compilation time when there are thousands of memory references in the loop.
>>
>> compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache
>> lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally
>> determined ACCEPTABLE_MISS_RATE.
>>
>> In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed
>> miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such
>> case. Note that the worst case cost is not affected by this patch.
>>
>> This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system.
>> Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly
>> spent in "compute_all_dependences (datarefs, &dependences, vloops, true)".  I am not sure whether the cost in
>>  dependence analysis for loop should be that high (with several thousands memory references in loop) or there
>> is a bug in dependence computation.
>
> Well, it seems that you compute the same information over-and-over
> by doing
>
> unsigned int
> tree_ssa_prefetch_arrays (void)
> {
> ...
>  FOR_EACH_LOOP (li, loop, LI_FROM_INNERMOST)
>    {
>      if (dump_file && (dump_flags & TDF_DETAILS))
>        fprintf (dump_file, "Processing loop %d:\n", loop->num);
>
>      unrolled |= loop_prefetch_arrays (loop);
>
> static bool
> loop_prefetch_arrays (struct loop *loop)
> {
> ...
>  determine_loop_nest_reuse (loop, refs, no_other_refs);
>
> static void
> determine_loop_nest_reuse (struct loop *loop, struct mem_ref_group *refs,
>                           bool no_other_refs)
> {
> ...
>  if (loop->inner)
>    return;
> ...
>  /* Find the outermost loop of the loop nest of loop (we require that
>     there are no sibling loops inside the nest).  */
>  nest = loop;
>  while (1)
>    {
>      aloop = loop_outer (nest);
>
>      if (aloop == current_loops->tree_root
>          || aloop->inner->next)
>        break;
>
>      nest = aloop;
>    }
> ...
>  compute_all_dependences (datarefs, &dependences, vloops, true);
>
>
> In fact you should iterate _only_ over innermost loops like
>
>  FOR_EACH_LOOP (li, loop, LI_ONLY_INNERMOST)
>
> does that make a difference?
>
>>
>> The patch passed Bootstrapping and regression tests.
>>
>> Is this patch OK to commit?
>
> Ok.
>

Committed r161728
diff mbox

Patch

From 3dbd1a84428979b5e73b2532c6389d31377b929a Mon Sep 17 00:00:00 2001
From: Changpeng Fang <chfang@pathscale.(none)>
Date: Mon, 28 Jun 2010 17:20:10 -0700
Subject: [PATCH 5/5] Reduce the cost in miss rate computation

	* tree-ssa-loop-prefetch.c (compute_miss_rate): Rename to
	is_miss_rate_acceptable. Pull total_positions computation
	out of the loops.  Early return if miss_positions exceeds
	the acceptable threshold.
	* tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Call
	is_miss_rate_acceptable after renaming of compute_miss_rate.
---
 gcc/tree-ssa-loop-prefetch.c |   41 +++++++++++++++++++++--------------------
 1 files changed, 21 insertions(+), 20 deletions(-)

diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c
index 27e2b42..f2c95ac 100644
--- a/gcc/tree-ssa-loop-prefetch.c
+++ b/gcc/tree-ssa-loop-prefetch.c
@@ -640,27 +640,29 @@  ddown (HOST_WIDE_INT x, unsigned HOST_WIDE_INT by)
 /* Given a CACHE_LINE_SIZE and two inductive memory references
    with a common STEP greater than CACHE_LINE_SIZE and an address
    difference DELTA, compute the probability that they will fall
-   in different cache lines.  DISTINCT_ITERS is the number of
-   distinct iterations after which the pattern repeats itself.
+   in different cache lines. Return true if the computed miss rate
+   is not greater than the ACCEPTABLE_MISS_RATE.  DISTINCT_ITERS is the
+   number of distinct iterations after which the pattern repeats itself.
    ALIGN_UNIT is the unit of alignment in bytes.  */
 
-static int
-compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size,
+static bool
+is_miss_rate_acceptable (unsigned HOST_WIDE_INT cache_line_size,
 		   HOST_WIDE_INT step, HOST_WIDE_INT delta,
 		   unsigned HOST_WIDE_INT distinct_iters,
 		   int align_unit)
 {
   unsigned align, iter;
-  int total_positions, miss_positions, miss_rate;
+  int total_positions, miss_positions, max_allowed_miss_positions;
   int address1, address2, cache_line1, cache_line2;
 
   /* It always misses if delta is greater than or equal to the cache
-     line size.  */ 
-  if (delta >= cache_line_size)
-    return 1000;
+     line size.  */
+  if (delta >= (HOST_WIDE_INT) cache_line_size)
+    return false;
 
-  total_positions = 0;
   miss_positions = 0;
+  total_positions = (cache_line_size / align_unit) * distinct_iters;
+  max_allowed_miss_positions = (ACCEPTABLE_MISS_RATE * total_positions) / 1000;
 
   /* Iterate through all possible alignments of the first
      memory reference within its cache line.  */
@@ -673,12 +675,14 @@  compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size,
 	address2 = address1 + delta;
 	cache_line1 = address1 / cache_line_size;
 	cache_line2 = address2 / cache_line_size;
-	total_positions += 1;
 	if (cache_line1 != cache_line2)
-	  miss_positions += 1;
+	  {
+	    miss_positions += 1;
+            if (miss_positions > max_allowed_miss_positions)
+	      return false;
+          }
       }
-  miss_rate = 1000 * miss_positions / total_positions;
-  return miss_rate;
+  return true;
 }
 
 /* Prune the prefetch candidate REF using the reuse with BY.
@@ -694,7 +698,6 @@  prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by,
   HOST_WIDE_INT delta = delta_b - delta_r;
   HOST_WIDE_INT hit_from;
   unsigned HOST_WIDE_INT prefetch_before, prefetch_block;
-  int miss_rate;
   HOST_WIDE_INT reduced_step;
   unsigned HOST_WIDE_INT reduced_prefetch_block;
   tree ref_type;
@@ -793,9 +796,8 @@  prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by,
   delta %= step;
   ref_type = TREE_TYPE (ref->mem);
   align_unit = TYPE_ALIGN (ref_type) / 8;
-  miss_rate = compute_miss_rate(prefetch_block, step, delta,
-				reduced_prefetch_block, align_unit);
-  if (miss_rate <= ACCEPTABLE_MISS_RATE)
+  if (is_miss_rate_acceptable (prefetch_block, step, delta,
+			       reduced_prefetch_block, align_unit))
     {
       /* Do not reduce prefetch_before if we meet beyond cache size.  */
       if (prefetch_before > L2_CACHE_SIZE_BYTES / PREFETCH_BLOCK)
@@ -809,9 +811,8 @@  prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by,
   /* Try also the following iteration.  */
   prefetch_before++;
   delta = step - delta;
-  miss_rate = compute_miss_rate(prefetch_block, step, delta,
-				reduced_prefetch_block, align_unit);
-  if (miss_rate <= ACCEPTABLE_MISS_RATE)
+  if (is_miss_rate_acceptable (prefetch_block, step, delta,
+			       reduced_prefetch_block, align_unit))
     {
       if (prefetch_before < ref->prefetch_before)
 	ref->prefetch_before = prefetch_before;
-- 
1.6.3.3