From patchwork Wed Jun 30 00:06:45 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: [PR, 44576] : Reduce the computation cost in compute_miss_rate for prefetching loop arrays From: "Fang, Changpeng" X-Patchwork-Id: 57340 Message-Id: To: Zdenek Dvorak Cc: Richard Guenther , Christian Borntraeger , "gcc-patches@gcc.gnu.org" , "uweigand@de.ibm.com" , "sebpop@gmail.com" Date: Tue, 29 Jun 2010 19:06:45 -0500 Hi, Attached is the patch that partially fixes bug 44576: testsuite/gfortran.dg/zero_sized_1.f90 with huge compile time on prefetching + peeling. The cost of compute_miss_rate in tree-ssa-loop-prefetch.c is found to be very high, and it results in an exploding compilation time when there are thousands of memory references in the loop. compute_miss_rate is to compute the possibility (miss_rate) that two memory references fall on difference cache lines. We will insert prefetches for both references if the computed miss rate is greater than the experimentally determined ACCEPTABLE_MISS_RATE. In this patch, compute_miss_rate is renamed as is_miss_rate_acceptable, and does an early return if the computed miss rate is already greater than the ACCEPTABLE_MISS_RATE, and thus significantly reduces the cost for such case. Note that the worst case cost is not affected by this patch. This patch reduces the compile time of the test case from 1m20'' to 1m03'' on an amd-linux64 system. Note that without -fprefetching-loop-arrays, the compile time on the same system is 0m30''. The extra 33'' is mostly spent in "compute_all_dependences (datarefs, &dependences, vloops, true)". I am not sure whether the cost in dependence analysis for loop should be that high (with several thousands memory references in loop) or there is a bug in dependence computation. The patch passed Bootstrapping and regression tests. Is this patch OK to commit? Thanks, Changpeng >From 3dbd1a84428979b5e73b2532c6389d31377b929a Mon Sep 17 00:00:00 2001 From: Changpeng Fang Date: Mon, 28 Jun 2010 17:20:10 -0700 Subject: [PATCH 5/5] Reduce the cost in miss rate computation * tree-ssa-loop-prefetch.c (compute_miss_rate): Rename to is_miss_rate_acceptable. Pull total_positions computation out of the loops. Early return if miss_positions exceeds the acceptable threshold. * tree-ssa-loop-prefetch.c (prune_ref_by_group_reuse): Call is_miss_rate_acceptable after renaming of compute_miss_rate. --- gcc/tree-ssa-loop-prefetch.c | 41 +++++++++++++++++++++-------------------- 1 files changed, 21 insertions(+), 20 deletions(-) diff --git a/gcc/tree-ssa-loop-prefetch.c b/gcc/tree-ssa-loop-prefetch.c index 27e2b42..f2c95ac 100644 --- a/gcc/tree-ssa-loop-prefetch.c +++ b/gcc/tree-ssa-loop-prefetch.c @@ -640,27 +640,29 @@ ddown (HOST_WIDE_INT x, unsigned HOST_WIDE_INT by) /* Given a CACHE_LINE_SIZE and two inductive memory references with a common STEP greater than CACHE_LINE_SIZE and an address difference DELTA, compute the probability that they will fall - in different cache lines. DISTINCT_ITERS is the number of - distinct iterations after which the pattern repeats itself. + in different cache lines. Return true if the computed miss rate + is not greater than the ACCEPTABLE_MISS_RATE. DISTINCT_ITERS is the + number of distinct iterations after which the pattern repeats itself. ALIGN_UNIT is the unit of alignment in bytes. */ -static int -compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size, +static bool +is_miss_rate_acceptable (unsigned HOST_WIDE_INT cache_line_size, HOST_WIDE_INT step, HOST_WIDE_INT delta, unsigned HOST_WIDE_INT distinct_iters, int align_unit) { unsigned align, iter; - int total_positions, miss_positions, miss_rate; + int total_positions, miss_positions, max_allowed_miss_positions; int address1, address2, cache_line1, cache_line2; /* It always misses if delta is greater than or equal to the cache - line size. */ - if (delta >= cache_line_size) - return 1000; + line size. */ + if (delta >= (HOST_WIDE_INT) cache_line_size) + return false; - total_positions = 0; miss_positions = 0; + total_positions = (cache_line_size / align_unit) * distinct_iters; + max_allowed_miss_positions = (ACCEPTABLE_MISS_RATE * total_positions) / 1000; /* Iterate through all possible alignments of the first memory reference within its cache line. */ @@ -673,12 +675,14 @@ compute_miss_rate (unsigned HOST_WIDE_INT cache_line_size, address2 = address1 + delta; cache_line1 = address1 / cache_line_size; cache_line2 = address2 / cache_line_size; - total_positions += 1; if (cache_line1 != cache_line2) - miss_positions += 1; + { + miss_positions += 1; + if (miss_positions > max_allowed_miss_positions) + return false; + } } - miss_rate = 1000 * miss_positions / total_positions; - return miss_rate; + return true; } /* Prune the prefetch candidate REF using the reuse with BY. @@ -694,7 +698,6 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by, HOST_WIDE_INT delta = delta_b - delta_r; HOST_WIDE_INT hit_from; unsigned HOST_WIDE_INT prefetch_before, prefetch_block; - int miss_rate; HOST_WIDE_INT reduced_step; unsigned HOST_WIDE_INT reduced_prefetch_block; tree ref_type; @@ -793,9 +796,8 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by, delta %= step; ref_type = TREE_TYPE (ref->mem); align_unit = TYPE_ALIGN (ref_type) / 8; - miss_rate = compute_miss_rate(prefetch_block, step, delta, - reduced_prefetch_block, align_unit); - if (miss_rate <= ACCEPTABLE_MISS_RATE) + if (is_miss_rate_acceptable (prefetch_block, step, delta, + reduced_prefetch_block, align_unit)) { /* Do not reduce prefetch_before if we meet beyond cache size. */ if (prefetch_before > L2_CACHE_SIZE_BYTES / PREFETCH_BLOCK) @@ -809,9 +811,8 @@ prune_ref_by_group_reuse (struct mem_ref *ref, struct mem_ref *by, /* Try also the following iteration. */ prefetch_before++; delta = step - delta; - miss_rate = compute_miss_rate(prefetch_block, step, delta, - reduced_prefetch_block, align_unit); - if (miss_rate <= ACCEPTABLE_MISS_RATE) + if (is_miss_rate_acceptable (prefetch_block, step, delta, + reduced_prefetch_block, align_unit)) { if (prefetch_before < ref->prefetch_before) ref->prefetch_before = prefetch_before; -- 1.6.3.3