diff mbox

[12/13] Workaround reduction statements for distribution

Message ID CAHFci29i50MFmcUCvW_ZeKDL3pWJohoevFH1vUgWa3GJRx2o-A@mail.gmail.com
State New
Headers show

Commit Message

Bin.Cheng June 20, 2017, 9:20 a.m. UTC
On Fri, Jun 16, 2017 at 6:15 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener
> <richard.guenther@gmail.com> wrote:
>> On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>> Hi,
>>> For now, loop distribution handles variables used outside of loop as reduction.
>>> This is inaccurate because all partitions contain statement defining induction
>>> vars.
>>
>> But final induction values are usually not used outside of the loop...
> This is in actuality for induction variable which is used outside of the loop.
>>
>> What is missing is loop distribution trying to change partition order.  In fact
>> we somehow assume we can move a reduction across a detected builtin
>> (I don't remember if we ever check for validity of that...).
> Hmm, I am not sure when we can't.  If there is any dependence between
> builtin/reduction partitions, it should be captured by RDG or PG,
> otherwise the partitions are independent and can be freely ordered as
> long as reduction partition is scheduled last?
>>
>>> Ideally we should factor out scev-propagation as a standalone interface
>>> which can be called when necessary.  Before that, this patch simply workarounds
>>> reduction issue by checking if the statement belongs to all partitions.  If yes,
>>> the reduction must be computed in the last partition no matter how the loop is
>>> distributed.
>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>
>> stmt_in_all_partitions is not kept up-to-date during partition merging and if
>> merging makes the reduction partition(s) pass the stmt_in_all_partitions
>> test your simple workaround doesn't work ...
> I think it doesn't matter because:
>   A) it's really workaround for induction variables.  In general,
> induction variables are included by all partition.
>   B) After classify partition, we immediately fuses all reduction
> partitions.  More stmt_in_all_partitions means we are fusing
> non-reduction partition with reduction partition, so the newly
> generated (stmt_in_all_partitions) are actually not reduction
> statements.  The workaround won't work anyway even the bitmap is
> maintained.
>>
>> As written it's a valid optimization but can you please note it's limitation in
>> some comment please?
> Yeah, I will add comment explaining it.
Comment added in new version patch.  It also computes bitmap outside
now, is it OK?

Thanks,
bin
2017-06-07  Bin Cheng  <bin.cheng@arm.com>

    * tree-loop-distribution.c (classify_partition): New parameter and
    better handle reduction statement.
    (rdg_build_partitions): Revise comment.
    (distribute_loop): Compute statements in all partitions and pass it
    to classify_partition.

Comments

Richard Biener June 20, 2017, 11:36 a.m. UTC | #1
On Tue, Jun 20, 2017 at 11:20 AM, Bin.Cheng <amker.cheng@gmail.com> wrote:
> On Fri, Jun 16, 2017 at 6:15 PM, Bin.Cheng <amker.cheng@gmail.com> wrote:
>> On Fri, Jun 16, 2017 at 11:21 AM, Richard Biener
>> <richard.guenther@gmail.com> wrote:
>>> On Mon, Jun 12, 2017 at 7:03 PM, Bin Cheng <Bin.Cheng@arm.com> wrote:
>>>> Hi,
>>>> For now, loop distribution handles variables used outside of loop as reduction.
>>>> This is inaccurate because all partitions contain statement defining induction
>>>> vars.
>>>
>>> But final induction values are usually not used outside of the loop...
>> This is in actuality for induction variable which is used outside of the loop.
>>>
>>> What is missing is loop distribution trying to change partition order.  In fact
>>> we somehow assume we can move a reduction across a detected builtin
>>> (I don't remember if we ever check for validity of that...).
>> Hmm, I am not sure when we can't.  If there is any dependence between
>> builtin/reduction partitions, it should be captured by RDG or PG,
>> otherwise the partitions are independent and can be freely ordered as
>> long as reduction partition is scheduled last?
>>>
>>>> Ideally we should factor out scev-propagation as a standalone interface
>>>> which can be called when necessary.  Before that, this patch simply workarounds
>>>> reduction issue by checking if the statement belongs to all partitions.  If yes,
>>>> the reduction must be computed in the last partition no matter how the loop is
>>>> distributed.
>>>> Bootstrap and test on x86_64 and AArch64.  Is it OK?
>>>
>>> stmt_in_all_partitions is not kept up-to-date during partition merging and if
>>> merging makes the reduction partition(s) pass the stmt_in_all_partitions
>>> test your simple workaround doesn't work ...
>> I think it doesn't matter because:
>>   A) it's really workaround for induction variables.  In general,
>> induction variables are included by all partition.
>>   B) After classify partition, we immediately fuses all reduction
>> partitions.  More stmt_in_all_partitions means we are fusing
>> non-reduction partition with reduction partition, so the newly
>> generated (stmt_in_all_partitions) are actually not reduction
>> statements.  The workaround won't work anyway even the bitmap is
>> maintained.
>>>
>>> As written it's a valid optimization but can you please note it's limitation in
>>> some comment please?
>> Yeah, I will add comment explaining it.
> Comment added in new version patch.  It also computes bitmap outside
> now, is it OK?

Ok.  Can you add a testcase for this as well please?  I think the
series up to this
is now fully reviewed, I defered 1/n (the new IFN) to the last one
containing the
runtime versioning.  Can you re-post that (you can merge with the IFN patch)
to apply after the series has been applied up to this?

Thanks,
Richard.

> Thanks,
> bin
> 2017-06-07  Bin Cheng  <bin.cheng@arm.com>
>
>     * tree-loop-distribution.c (classify_partition): New parameter and
>     better handle reduction statement.
>     (rdg_build_partitions): Revise comment.
>     (distribute_loop): Compute statements in all partitions and pass it
>     to classify_partition.
diff mbox

Patch

From b3502d713309da08d93cd53e5fe8fbfdccf3557b Mon Sep 17 00:00:00 2001
From: Bin Cheng <binche01@e108451-lin.cambridge.arm.com>
Date: Fri, 9 Jun 2017 13:21:07 +0100
Subject: [PATCH 11/13] reduction-workaround-20170607.txt

---
 gcc/tree-loop-distribution.c | 43 ++++++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 11 deletions(-)

diff --git a/gcc/tree-loop-distribution.c b/gcc/tree-loop-distribution.c
index d741e9e..1b4e238 100644
--- a/gcc/tree-loop-distribution.c
+++ b/gcc/tree-loop-distribution.c
@@ -1226,17 +1226,18 @@  build_rdg_partition_for_vertex (struct graph *rdg, int v)
 }
 
 /* Classifies the builtin kind we can generate for PARTITION of RDG and LOOP.
-   For the moment we detect only the memset zero pattern.  */
+   For the moment we detect memset, memcpy and memmove patterns.  Bitmap
+   STMT_IN_ALL_PARTITIONS contains statements belonging to all partitions.  */
 
 static void
-classify_partition (loop_p loop, struct graph *rdg, partition *partition)
+classify_partition (loop_p loop, struct graph *rdg, partition *partition,
+		    bitmap stmt_in_all_partitions)
 {
   bitmap_iterator bi;
   unsigned i;
   tree nb_iter;
   data_reference_p single_load, single_store;
-  bool volatiles_p = false;
-  bool plus_one = false;
+  bool volatiles_p = false, plus_one = false, has_reduction = false;
 
   partition->kind = PKIND_NORMAL;
   partition->main_dr = NULL;
@@ -1251,16 +1252,31 @@  classify_partition (loop_p loop, struct graph *rdg, partition *partition)
       if (gimple_has_volatile_ops (stmt))
 	volatiles_p = true;
 
-      /* If the stmt has uses outside of the loop mark it as reduction.  */
+      /* If the stmt is not included by all partitions and there is uses
+	 outside of the loop, then mark the partition as reduction.  */
       if (stmt_has_scalar_dependences_outside_loop (loop, stmt))
 	{
-	  partition->reduction_p = true;
-	  return;
+	  /* Due to limitation in the transform phase we have to fuse all
+	     reduction partitions.  As a result, this could cancel valid
+	     loop distribution especially for loop that induction variable
+	     is used outside of loop.  To workaround this issue, we skip
+	     marking partition as reudction if the reduction stmt belongs
+	     to all partitions.  In such case, reduction will be computed
+	     correctly no matter how partitions are fused/distributed.  */
+	  if (!bitmap_bit_p (stmt_in_all_partitions, i))
+	    {
+	      partition->reduction_p = true;
+	      return;
+	    }
+	  has_reduction = true;
 	}
     }
 
   /* Perform general partition disqualification for builtins.  */
   if (volatiles_p
+      /* Simple workaround to prevent classifying the partition as builtin
+	 if it contains any use outside of loop.  */
+      || has_reduction
       || !flag_tree_loop_distribute_patterns)
     return;
 
@@ -1435,9 +1451,9 @@  share_memory_accesses (struct graph *rdg,
   return false;
 }
 
-/* Aggregate several components into a useful partition that is
-   registered in the PARTITIONS vector.  Partitions will be
-   distributed in different loops.  */
+/* For each seed statement in STARTING_STMTS, this function builds
+   partition for it by adding depended statements according to RDG.
+   All partitions are recorded in PARTITIONS.  */
 
 static void
 rdg_build_partitions (struct graph *rdg,
@@ -1705,10 +1721,15 @@  distribute_loop (struct loop *loop, vec<gimple *> stmts,
   auto_vec<struct partition *, 3> partitions;
   rdg_build_partitions (rdg, stmts, &partitions);
 
+  auto_bitmap stmt_in_all_partitions;
+  bitmap_copy (stmt_in_all_partitions, partitions[0]->stmts);
+  for (i = 1; partitions.iterate (i, &partition); ++i)
+    bitmap_and_into (stmt_in_all_partitions, partitions[i]->stmts);
+
   any_builtin = false;
   FOR_EACH_VEC_ELT (partitions, i, partition)
     {
-      classify_partition (loop, rdg, partition);
+      classify_partition (loop, rdg, partition, stmt_in_all_partitions);
       any_builtin |= partition_builtin_p (partition);
     }
 
-- 
1.9.1