diff mbox series

[v3,3/3] PR80791 Consider doloop cmp use in ivopts

Message ID 2d897dc2-a01c-5005-6973-aad0c5930aa8@linux.ibm.com
State New
Headers show
Series None | expand

Commit Message

Kewen.Lin June 19, 2019, 11:47 a.m. UTC
Hi all,

This is the following patch after https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00910.html

Main steps:
  1) Identify the doloop cmp type iv use and record its bind_cand (explain it later).
  2) Set zero cost for pairs between this use and any iv cand.
  3) IV cand set selecting algorithm runs as usual.
  4) Fix up the selected iv cand for doloop use if need.

It only focuses on the targets like Power which has specific count register.
target hook have_count_reg_decr_p is proposed for it.

Some notes:

*) Why we need zero cost?  How about just decrease the cost for the pair
   between doloop use and its original iv cand?  How about just decrease
   the cost for the pair between doloop use and one selected iv cand?

   Since some target supports hardware count register for decrement and
   branch, it doesn't need the general instruction sequence for decr, cmp and
   branch in general registers.  The cost of moving count register to GPR
   is generally high, so it's standalone and can't be shared with other iv 
   uses.  It means IVOPTs can take doloop use as invisible (zero cost).

   Let's take a look at PR80791 for example.

                            original biv (cand 4)  use derived iv (cand 6)
     generic use:                   4                  0
     comp use (doloop use):         0                 infinite
    
    For iv cost, original biv has cost 4 while use derived iv has cost 5.
    When IVOPTs considers doloop use, the optimal cost is 8 (original biv 
    iv cost 4 + use cost 4).  Unfortunately it's not actually optimal, since
    later doloop transformation updates loop closing by count register,
    original biv (and its update) won't be needed in loop closing any more.
    The generic use become the only use for original biv.  That means, if we 
    know the doloop will perform later, we shouldn't consider the doloop use
    when determining IV set.  If we don't consider it, the algorithm will 
    choose iv cand 6 with total cost 5 (iv cost 5 + use cost 0).

    From the above, we can see that to decrease the cost for the pair between 
    doloop use and original biv doesn't work.  Meanwhile it's hard to predict
    one good iv cand in final optimal set here and pre-update the cost
    between it and doloop use.  The analysis would be heavy and imperfect.
   
*) Why we need bind_cand?

    As above, we assign zero cost for pairs between doloop use and each iv 
    cand.  It's possible that doloop use gets assigned one iv cand which is
    invalid to be used during later rewrite.  Then we have to fix it up with iv
    cand originally used for it.  It's fine whatever this iv cand exists in
    final iv cand set or not, even if it's not in the set, it will be 
    eliminated in doloop transformation.

By the way, I was thinking whether we can replace the hook have_count_reg_decr_p
with flag_branch_on_count_reg.  As the description of the "no-" option, "Disable
the optimization pass that scans for opportunities to use 'decrement and branch' 
instructions on a count register instead of instruction sequences that decrement 
a register, compare it against zero, and then branch based upon the result.", it
implicitly says it has count register support.  But I noticed that the gate of 
doloop_optimize checks this flag, as what I got from the previous discussions, some
targets which can perform doloop_optimize don't have specific count register, so it
sounds we can't make use of the flag, is it correct?

Bootstrapped on powerpcle, also ran regression testing on powerpcle, got one failure
which is exposed by this patch and the root cause is duplicate of PR62147.
case is gcc.target/powerpc/20050830-1.c

Is it OK for trunk?  

--------------

gcc/ChangeLog

2019-06-19  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* target.def (have_count_reg_decr_p): New hook.
	* doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): New hook.
	* doc/tm.texi: Regenerate.
	* config/rs6000/rs6000.c (rs6000_have_count_reg_decr_p): New function.
	(TARGET_HAVE_COUNT_REG_DECR_P): New macro.
	* tree-ssa-loop-ivopts.c (adjust_group_iv_cost_for_doloop): New function.
	(fixup_doloop_groups): Likewise.
	(find_doloop_use_and_its_bind): Likewise.
	(record_group): Init bind_cand.
	(determine_group_iv_cost): Call adjust_group_iv_cost_for_doloop.
	(find_optimal_iv_set): Call fixup_doloop_groups.
	(tree_ssa_iv_optimize_loop): Call function have_count_reg_decr_p, 
	generic_predict_doloop_p and find_doloop_use_and_its_bind.
	(generic_predict_doloop_p): Update attribute.

gcc/testsuite/ChangeLog

2019-06-19  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* gcc.dg/tree-ssa/ivopts-lt.c: Adjust.

Comments

Segher Boessenkool June 20, 2019, 9:08 a.m. UTC | #1
Hi Kewen,

On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
> +/* Return true if count register for branch is supported.  */
> +
> +static bool
> +rs6000_have_count_reg_decr_p ()
> +{
> +  return flag_branch_on_count_reg;
> +}

rs6000 unconditionally supports these instructions, not just when that
flag is set.  If you need to look at the flag, the *caller* of this new
hook should, not every implementation of the hook.  So just "return true"
here?

>  DEFHOOK
> +(have_count_reg_decr_p,
> + "Return true if the target supports hardware count register for decrement\n\
> +and branch.\n\
> +The default version of this hook returns false.",
> + bool, (void),
> + hook_bool_void_false)

Is it important here that you cannot use that register as a GPR, that any
use of it is expensive because it has to be moved to/from a GPR?  The doc
should say something like that; a little more context, what the hook is
meant to be used for.

> +/* For doloop use, if the algothrim selects some candidate which invalid for
> +   later rewrite, fix it up with bind_cand.  */

"algorithm", "which is invalid".

> +/* Find doloop comparison use and set its related bind_cand.  We adjust the
> +   doloop use group cost against various IV cands, it's possible to assign
> +   some cost like zero rather than original inifite cost.  The point is to

"infinite"

Looks good :-)


Segher
Kewen.Lin June 20, 2019, 12:08 p.m. UTC | #2
Hi Segher,

> On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
>> +/* Return true if count register for branch is supported.  */
>> +
>> +static bool
>> +rs6000_have_count_reg_decr_p ()
>> +{
>> +  return flag_branch_on_count_reg;
>> +}
> 
> rs6000 unconditionally supports these instructions, not just when that
> flag is set.  If you need to look at the flag, the *caller* of this new
> hook should, not every implementation of the hook.  So just "return true"
> here?

Good point!  Updated it as hookpod.

>> +/* For doloop use, if the algothrim selects some candidate which invalid for
> 
> "algorithm", "which is invalid".

>> +   some cost like zero rather than original inifite cost.  The point is to
> 
> "infinite"
> 

Thanks for catching!  I should run spelling check next time.  :)

New version attached with comments addressed.


Thanks,
Kewen
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 12f1dfd..e98aba9 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1913,7 +1913,7 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
 #undef TARGET_HAVE_COUNT_REG_DECR_P
-#define TARGET_HAVE_COUNT_REG_DECR_P rs6000_have_count_reg_decr_p
+#define TARGET_HAVE_COUNT_REG_DECR_P true
 
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
@@ -39440,14 +39440,6 @@ rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }
 
-/* Return true if count register for branch is supported.  */
-
-static bool
-rs6000_have_count_reg_decr_p ()
-{
-  return flag_branch_on_count_reg;
-}
-
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-rs6000.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index 46e488f..5477294 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11618,11 +11618,13 @@ loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
-@deftypefn {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P (void)
+@deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
 Return true if the target supports hardware count register for decrement
-and branch.
-The default version of this hook returns false.
-@end deftypefn
+and branch.  This count register can't be used as general register since
+moving to/from a general register from/to it is very expensive.
+For the targets with this support, ivopts can take doloop use as zero cost.
+The default value is false.
+@end deftypevr
 
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
diff --git a/gcc/target.def b/gcc/target.def
index ec15a6d..8a64e5b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4246,13 +4246,15 @@ The default version of this hook returns false.",
  bool, (struct loop *loop),
  default_predict_doloop_p)
 
-DEFHOOK
+DEFHOOKPOD
 (have_count_reg_decr_p,
  "Return true if the target supports hardware count register for decrement\n\
-and branch.\n\
-The default version of this hook returns false.",
- bool, (void),
- hook_bool_void_false)
+and branch.  This count register can't be used as general register since\n\
+moving to/from a general register from/to it is very expensive.\n\
+For the targets with this support, ivopts can take doloop use as zero cost.\n\
+The default value is false.",
+ bool, false)
+
 
 DEFHOOK
 (can_use_doloop_p,
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 7d5859b..71d7f67 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index b1138ea..742d3fa 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -3730,7 +3730,7 @@ prepare_decl_rtl (tree *expr_p, int *ws, void *data)
    Some RTL specific checks seems unable to be checked in gimple, if any new
    checks or easy checks _are_ missing here, please add them.  */
 
-static bool ATTRIBUTE_UNUSED
+static bool
 generic_predict_doloop_p (struct ivopts_data *data)
 {
   struct loop *loop = data->current_loop;
@@ -6749,7 +6749,7 @@ find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
   return set;
 }
 
-/* For doloop use, if the algothrim selects some candidate which invalid for
+/* For doloop use, if the algorithm selects some candidate which is invalid for
    later rewrite, fix it up with bind_cand.  */
 
 static void
@@ -7622,7 +7622,7 @@ determine_scaling_factor (struct ivopts_data *data, basic_block *body)
 
 /* Find doloop comparison use and set its related bind_cand.  We adjust the
    doloop use group cost against various IV cands, it's possible to assign
-   some cost like zero rather than original inifite cost.  The point is to
+   some cost like zero rather than original infinite cost.  The point is to
    give more chances to consider other IV cands instead of BIV.  The cost
    originally given on doloop use can affect optimal decision because it can
    become dead and get eliminated but considered too much here.
@@ -7744,7 +7744,8 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);
 
-  if (targetm.have_count_reg_decr_p() && generic_predict_doloop_p (data))
+  if (flag_branch_on_count_reg && targetm.have_count_reg_decr_p
+      && generic_predict_doloop_p (data))
     {
       data->doloop_use_p = find_doloop_use_and_its_bind (data);
       if (data->doloop_use_p && dump_file && (dump_flags & TDF_DETAILS))
Kewen.Lin June 20, 2019, 12:16 p.m. UTC | #3
Hi,

Sorry, the previous patch is incomplete.
New one attached.  Sorry for inconvenience.

on 2019/6/20 下午8:08, Kewen.Lin wrote:
> Hi Segher,
> 
>> On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
>>> +/* Return true if count register for branch is supported.  */
>>> +
>>> +static bool
>>> +rs6000_have_count_reg_decr_p ()
>>> +{
>>> +  return flag_branch_on_count_reg;
>>> +}
>>
>> rs6000 unconditionally supports these instructions, not just when that
>> flag is set.  If you need to look at the flag, the *caller* of this new
>> hook should, not every implementation of the hook.  So just "return true"
>> here?
> 
> Good point!  Updated it as hookpod.
> 
>>> +/* For doloop use, if the algothrim selects some candidate which invalid for
>>
>> "algorithm", "which is invalid".
> 
>>> +   some cost like zero rather than original inifite cost.  The point is to
>>
>> "infinite"
>>
> 
> Thanks for catching!  I should run spelling check next time.  :)
> 
> New version attached with comments addressed.
> 
> 
> Thanks,
> Kewen
>
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 6667cd0..e98aba9 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1912,6 +1912,9 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_HAVE_COUNT_REG_DECR_P
+#define TARGET_HAVE_COUNT_REG_DECR_P true
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c2aa4d0..5477294 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11618,6 +11618,14 @@ loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
+Return true if the target supports hardware count register for decrement
+and branch.  This count register can't be used as general register since
+moving to/from a general register from/to it is very expensive.
+For the targets with this support, ivopts can take doloop use as zero cost.
+The default value is false.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b4d57b8..5f43b27 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7946,6 +7946,8 @@ to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_HAVE_COUNT_REG_DECR_P
+
 @hook TARGET_CAN_USE_DOLOOP_P
 
 @hook TARGET_INVALID_WITHIN_DOLOOP
diff --git a/gcc/target.def b/gcc/target.def
index 71b6972..8a64e5b 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4246,6 +4246,16 @@ The default version of this hook returns false.",
  bool, (struct loop *loop),
  default_predict_doloop_p)
 
+DEFHOOKPOD
+(have_count_reg_decr_p,
+ "Return true if the target supports hardware count register for decrement\n\
+and branch.  This count register can't be used as general register since\n\
+moving to/from a general register from/to it is very expensive.\n\
+For the targets with this support, ivopts can take doloop use as zero cost.\n\
+The default value is false.",
+ bool, false)
+
+
 DEFHOOK
 (can_use_doloop_p,
  "Return true if it is possible to use low-overhead loops (@code{doloop_end}\n\
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 7d5859b..71d7f67 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 530ea4a..742d3fa 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -399,6 +399,8 @@ struct iv_group
   struct cost_pair *cost_map;
   /* The selected candidate for the group.  */
   struct iv_cand *selected;
+  /* The bind candidate for this group, for doloop use group only.  */
+  struct iv_cand *bind_cand;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -612,6 +614,9 @@ struct ivopts_data
 
   /* Whether the loop body can only be exited via single exit.  */
   bool loop_single_exit_p;
+
+  /* Whether the loop has doloop comparison use.  */
+  bool doloop_use_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -1528,6 +1533,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->type = type;
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
+  group->bind_cand = NULL;
 
   data->vgroups.safe_push (group);
   return group;
@@ -3724,7 +3730,7 @@ prepare_decl_rtl (tree *expr_p, int *ws, void *data)
    Some RTL specific checks seems unable to be checked in gimple, if any new
    checks or easy checks _are_ missing here, please add them.  */
 
-static bool ATTRIBUTE_UNUSED
+static bool
 generic_predict_doloop_p (struct ivopts_data *data)
 {
   struct loop *loop = data->current_loop;
@@ -5291,6 +5297,21 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
   return !cost.infinite_cost_p ();
 }
 
+/* Set no cost for pair between doloop iv use GROUP and iv cand CAND.  */
+
+static bool
+adjust_group_iv_cost_for_doloop (struct ivopts_data *data,
+				 struct iv_group *group, struct iv_cand *cand)
+{
+  struct cost_pair *cp = get_group_iv_cost (data, group, cand);
+  if (!cp)
+    set_group_iv_cost (data, group, cand, no_cost, NULL, NULL_TREE, ERROR_MARK,
+		       NULL);
+  else
+    cp->cost = no_cost;
+  return true;
+}
+
 /* Determines cost of computing uses in GROUP with CAND.  Returns false
    if USE cannot be represented with CAND.  */
 
@@ -5308,7 +5329,12 @@ determine_group_iv_cost (struct ivopts_data *data,
       return determine_group_iv_cost_address (data, group, cand);
 
     case USE_COMPARE:
-      return determine_group_iv_cost_cond (data, group, cand);
+      {
+	bool finite_cost_p = determine_group_iv_cost_cond (data, group, cand);
+	if (data->doloop_use_p && group->bind_cand)
+	  finite_cost_p = adjust_group_iv_cost_for_doloop (data, group, cand);
+	return finite_cost_p;
+      }
 
     default:
       gcc_unreachable ();
@@ -6723,6 +6749,29 @@ find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
   return set;
 }
 
+/* For doloop use, if the algorithm selects some candidate which is invalid for
+   later rewrite, fix it up with bind_cand.  */
+
+static void
+fixup_doloop_groups (struct ivopts_data *data, struct iv_ca *set)
+{
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->bind_cand)
+	{
+	  struct cost_pair *cp = iv_ca_cand_for_group (set, group);
+	  gcc_assert (cp);
+	  if (cp->cand != group->bind_cand && cp->value == NULL_TREE)
+	    {
+	      struct cost_pair *bind_cp
+		= get_group_iv_cost (data, group, group->bind_cand);
+	      iv_ca_set_cp (data, set, group, bind_cp);
+	    }
+	}
+    }
+}
+
 static struct iv_ca *
 find_optimal_iv_set (struct ivopts_data *data)
 {
@@ -6760,6 +6809,9 @@ find_optimal_iv_set (struct ivopts_data *data)
   else if (origset)
     iv_ca_free (&origset);
 
+  if (data->doloop_use_p)
+    fixup_doloop_groups (data, set);
+
   for (i = 0; i < data->vgroups.length (); i++)
     {
       struct iv_group *group = data->vgroups[i];
@@ -7568,6 +7620,72 @@ determine_scaling_factor (struct ivopts_data *data, basic_block *body)
     }
 }
 
+/* Find doloop comparison use and set its related bind_cand.  We adjust the
+   doloop use group cost against various IV cands, it's possible to assign
+   some cost like zero rather than original infinite cost.  The point is to
+   give more chances to consider other IV cands instead of BIV.  The cost
+   originally given on doloop use can affect optimal decision because it can
+   become dead and get eliminated but considered too much here.
+
+   So it's possible that doloop use is assigned one invalid IV cand to rewrite.
+   In this case, we need bind_cand to fix up.  Even if the bind_cand doesn't
+   exist in final iv_ca set, it won't affect optimal decision since it gets
+   eliminated along with doloop use.  */
+
+static bool
+find_doloop_use_and_its_bind (struct ivopts_data *data)
+{
+  struct loop *loop = data->current_loop;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->type == USE_COMPARE)
+	{
+	  gcc_assert (group->vuses.length () == 1);
+	  struct iv_use *use = group->vuses[0];
+	  gimple *stmt = use->stmt;
+	  if (gimple_code (stmt) == GIMPLE_COND)
+	    {
+	      basic_block bb = gimple_bb (stmt);
+	      edge true_edge, false_edge;
+	      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+	      /* This comparison is used for loop latch.  Require latch is empty
+		 for now.  */
+	      if ((loop->latch == true_edge->dest
+		   || loop->latch == false_edge->dest)
+		  && empty_block_p (loop->latch))
+		{
+		  for (unsigned j = 0; j < data->vcands.length (); j++)
+		    {
+		      if (bitmap_bit_p (group->related_cands, j))
+			{
+			  struct iv_cand *cand = data->vcands[j];
+			  tree op = use->iv->ssa_name;
+			  if (op == cand->var_before || op == cand->var_after)
+			    {
+			      group->bind_cand = cand;
+			      if (dump_file && (dump_flags & TDF_DETAILS))
+				{
+				  fprintf (dump_file, "Doloop cmp iv use: ");
+				  print_gimple_stmt (dump_file, stmt,
+						     TDF_DETAILS);
+				  dump_cand (dump_file, cand);
+				}
+			      break;
+			    }
+			}
+		    }
+		  if (group->bind_cand)
+		    return true;
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
 /* Optimizes the LOOP.  Returns true if anything changed.  */
 
 static bool
@@ -7580,6 +7698,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   basic_block *body;
 
   gcc_assert (!data->niters);
+  data->doloop_use_p = false;
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
@@ -7625,6 +7744,19 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);
 
+  if (flag_branch_on_count_reg && targetm.have_count_reg_decr_p
+      && generic_predict_doloop_p (data))
+    {
+      data->doloop_use_p = find_doloop_use_and_its_bind (data);
+      if (data->doloop_use_p && dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file,
+		   "Predict loop %d can perform doloop optimization later.\n",
+		   loop->num);
+	  flow_loop_dump (loop, dump_file, NULL, 1);
+	}
+    }
+
   /* Calculates the costs (item 3, part 1).  */
   determine_iv_costs (data);
   determine_group_iv_costs (data);
Kewen.Lin July 10, 2019, 2:15 a.m. UTC | #4
Hi all,

I'd like to gentle ping the below patch:
https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01225.html

The previous version for more context/background:
https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01126.html

Thanks a lot in advance!


on 2019/6/20 下午8:16, Kewen.Lin wrote:
> Hi,
> 
> Sorry, the previous patch is incomplete.
> New one attached.  Sorry for inconvenience.
> 
> on 2019/6/20 下午8:08, Kewen.Lin wrote:
>> Hi Segher,
>>
>>> On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
>>>> +/* Return true if count register for branch is supported.  */
>>>> +
>>>> +static bool
>>>> +rs6000_have_count_reg_decr_p ()
>>>> +{
>>>> +  return flag_branch_on_count_reg;
>>>> +}
>>>
>>> rs6000 unconditionally supports these instructions, not just when that
>>> flag is set.  If you need to look at the flag, the *caller* of this new
>>> hook should, not every implementation of the hook.  So just "return true"
>>> here?
>>
>> Good point!  Updated it as hookpod.
>>
>>>> +/* For doloop use, if the algothrim selects some candidate which invalid for
>>>
>>> "algorithm", "which is invalid".
>>
>>>> +   some cost like zero rather than original inifite cost.  The point is to
>>>
>>> "infinite"
>>>
>>
>> Thanks for catching!  I should run spelling check next time.  :)
>>
>> New version attached with comments addressed.
>>
>>
>> Thanks,
>> Kewen
>>
Richard Biener July 12, 2019, 12:11 p.m. UTC | #5
On Wed, 10 Jul 2019, Kewen.Lin wrote:

> Hi all,
> 
> I'd like to gentle ping the below patch:
> https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01225.html
> 
> The previous version for more context/background:
> https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01126.html
> 
> Thanks a lot in advance!

Again I would have hoped Bin to chime in here.

Am I correct that doloop HW implementations are constrainted
by a decrement of one?  I see no code in the patch to constrain
things this way.  I'm not familiar with the group code at all
but I would have expected the patch to only affect costing
of IVs of the appropriate form (decrement one and possibly
no uses besides the one in the compare/decrement).  Since
ivcanon already adds a canonical counter IV it's not
necessary to generate an artificial candidate IV of the
wanted style (that's something I might have expected as well).

The rest should be just magic from the IVOPTs side?

There might be the need to only consider at most one counter IV
in the costing code.

Richard.

> 
> on 2019/6/20 下午8:16, Kewen.Lin wrote:
> > Hi,
> > 
> > Sorry, the previous patch is incomplete.
> > New one attached.  Sorry for inconvenience.
> > 
> > on 2019/6/20 下午8:08, Kewen.Lin wrote:
> >> Hi Segher,
> >>
> >>> On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
> >>>> +/* Return true if count register for branch is supported.  */
> >>>> +
> >>>> +static bool
> >>>> +rs6000_have_count_reg_decr_p ()
> >>>> +{
> >>>> +  return flag_branch_on_count_reg;
> >>>> +}
> >>>
> >>> rs6000 unconditionally supports these instructions, not just when that
> >>> flag is set.  If you need to look at the flag, the *caller* of this new
> >>> hook should, not every implementation of the hook.  So just "return true"
> >>> here?
> >>
> >> Good point!  Updated it as hookpod.
> >>
> >>>> +/* For doloop use, if the algothrim selects some candidate which invalid for
> >>>
> >>> "algorithm", "which is invalid".
> >>
> >>>> +   some cost like zero rather than original inifite cost.  The point is to
> >>>
> >>> "infinite"
> >>>
> >>
> >> Thanks for catching!  I should run spelling check next time.  :)
> >>
> >> New version attached with comments addressed.
> >>
> >>
> >> Thanks,
> >> Kewen
> >>
> 
>
Segher Boessenkool July 12, 2019, 1:28 p.m. UTC | #6
On Fri, Jul 12, 2019 at 02:11:16PM +0200, Richard Biener wrote:
> Am I correct that doloop HW implementations are constrainted
> by a decrement of one?

GCC's doloop patterns are.  Not all hardware is.


Segher
Kewen.Lin July 15, 2019, 2:54 a.m. UTC | #7
Hi Richard,

on 2019/7/12 下午8:11, Richard Biener wrote:
> On Wed, 10 Jul 2019, Kewen.Lin wrote:
> 
>> Hi all,
>>
>> I'd like to gentle ping the below patch:
>> https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01225.html
>>
>> The previous version for more context/background:
>> https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01126.html
>>
>> Thanks a lot in advance!
> 
> Again I would have hoped Bin to chime in here.
> 
> Am I correct that doloop HW implementations are constrainted
> by a decrement of one?  I see no code in the patch to constrain
> things this way.  

If my understanding is correct, under have_count_reg_decr_p
I don't think we should check the decrement one pattern, doloop
can transform the loop closing to decrement by 1 since it knows
the iteration total count.  Since it uses special hardware register
like Power count register, we don't expect it to be shared with
other uses.  Btw, it also doesn't require the compare should be the 
comp/decrement pattern, so this patch more focuses on this comp
is needed or not (should be considered in selection or not).

> I'm not familiar with the group code at all
> but I would have expected the patch to only affect costing
> of IVs of the appropriate form (decrement one and possibly
> no uses besides the one in the compare/decrement).  

But since we select IV cand for every IV uses, we never knows
this IV cand will have the only use till the whole selection
done.

> Since
> ivcanon already adds a canonical counter IV it's not
> necessary to generate an artificial candidate IV of the
> wanted style (that's something I might have expected as well).

This patch is only for the case guarded in have_count_reg_decr_p.
It doesn't requires to have the artificial candidate IV as well
as decrement-compare-jump code sequence.  The code on power looks
like:  mtctr Rx   // move Rx (which holding total_counter) 
                  // to ctr register
      L:                   
       loop body...
       bnze L     // decrease ctr register and jump to L if 
                  // ctr nonzero    

> 
> The rest should be just magic from the IVOPTs side?
> 
> There might be the need to only consider at most one counter IV
> in the costing code.

The current patch doesn't introduce any IV cands but focus on 
zeroing the cost of comp IV use since we know it will be eliminated.
Still to leverage the existing candidate selection algorithm to decide
the final optimal IV set.  Bring back the canonical counter IV only
if it's not selected by any IV uses to keep the doloop comp use
rewriting correct, but it shouldn't affect anything since the use will
be eliminated and is the only use, the IV and its related will be
removed as well.


Thanks,
Kewen
Bin.Cheng July 15, 2019, 6:40 a.m. UTC | #8
On Fri, Jul 12, 2019 at 8:11 PM Richard Biener <rguenther@suse.de> wrote:
>
> On Wed, 10 Jul 2019, Kewen.Lin wrote:
>
> > Hi all,
> >
> > I'd like to gentle ping the below patch:
> > https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01225.html
> >
> > The previous version for more context/background:
> > https://gcc.gnu.org/ml/gcc-patches/2019-06/msg01126.html
> >
> > Thanks a lot in advance!
>
> Again I would have hoped Bin to chime in here.
Sorry for missing this one, will get to the patch this week.  Sorry
again for the inconvenience.

Thanks,
bin
>
> Am I correct that doloop HW implementations are constrainted
> by a decrement of one?  I see no code in the patch to constrain
> things this way.  I'm not familiar with the group code at all
> but I would have expected the patch to only affect costing
> of IVs of the appropriate form (decrement one and possibly
> no uses besides the one in the compare/decrement).  Since
> ivcanon already adds a canonical counter IV it's not
> necessary to generate an artificial candidate IV of the
> wanted style (that's something I might have expected as well).
>
> The rest should be just magic from the IVOPTs side?
>
> There might be the need to only consider at most one counter IV
> in the costing code.
>
> Richard.
>
> >
> > on 2019/6/20 下午8:16, Kewen.Lin wrote:
> > > Hi,
> > >
> > > Sorry, the previous patch is incomplete.
> > > New one attached.  Sorry for inconvenience.
> > >
> > > on 2019/6/20 下午8:08, Kewen.Lin wrote:
> > >> Hi Segher,
> > >>
> > >>> On Wed, Jun 19, 2019 at 07:47:34PM +0800, Kewen.Lin wrote:
> > >>>> +/* Return true if count register for branch is supported.  */
> > >>>> +
> > >>>> +static bool
> > >>>> +rs6000_have_count_reg_decr_p ()
> > >>>> +{
> > >>>> +  return flag_branch_on_count_reg;
> > >>>> +}
> > >>>
> > >>> rs6000 unconditionally supports these instructions, not just when that
> > >>> flag is set.  If you need to look at the flag, the *caller* of this new
> > >>> hook should, not every implementation of the hook.  So just "return true"
> > >>> here?
> > >>
> > >> Good point!  Updated it as hookpod.
> > >>
> > >>>> +/* For doloop use, if the algothrim selects some candidate which invalid for
> > >>>
> > >>> "algorithm", "which is invalid".
> > >>
> > >>>> +   some cost like zero rather than original inifite cost.  The point is to
> > >>>
> > >>> "infinite"
> > >>>
> > >>
> > >> Thanks for catching!  I should run spelling check next time.  :)
> > >>
> > >> New version attached with comments addressed.
> > >>
> > >>
> > >> Thanks,
> > >> Kewen
> > >>
> >
> >
>
> --
> Richard Biener <rguenther@suse.de>
> SUSE Linux GmbH, Maxfeldstrasse 5, 90409 Nuernberg, Germany;
> GF: Felix Imendörffer, Mary Higgins, Sri Rasiah; HRB 21284 (AG Nürnberg)
Bin.Cheng July 21, 2019, 3:07 a.m. UTC | #9
On Wed, Jun 19, 2019 at 7:47 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi all,
>
> This is the following patch after https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00910.html
>
> Main steps:
>   1) Identify the doloop cmp type iv use and record its bind_cand (explain it later).
>   2) Set zero cost for pairs between this use and any iv cand.
>   3) IV cand set selecting algorithm runs as usual.
>   4) Fix up the selected iv cand for doloop use if need.
>
> It only focuses on the targets like Power which has specific count register.
> target hook have_count_reg_decr_p is proposed for it.
>
> Some notes:
>
> *) Why we need zero cost?  How about just decrease the cost for the pair
>    between doloop use and its original iv cand?  How about just decrease
>    the cost for the pair between doloop use and one selected iv cand?
>
>    Since some target supports hardware count register for decrement and
>    branch, it doesn't need the general instruction sequence for decr, cmp and
>    branch in general registers.  The cost of moving count register to GPR
>    is generally high, so it's standalone and can't be shared with other iv
>    uses.  It means IVOPTs can take doloop use as invisible (zero cost).
>
>    Let's take a look at PR80791 for example.
>
>                             original biv (cand 4)  use derived iv (cand 6)
>      generic use:                   4                  0
>      comp use (doloop use):         0                 infinite
>
>     For iv cost, original biv has cost 4 while use derived iv has cost 5.
>     When IVOPTs considers doloop use, the optimal cost is 8 (original biv
>     iv cost 4 + use cost 4).  Unfortunately it's not actually optimal, since
>     later doloop transformation updates loop closing by count register,
>     original biv (and its update) won't be needed in loop closing any more.
>     The generic use become the only use for original biv.  That means, if we
>     know the doloop will perform later, we shouldn't consider the doloop use
>     when determining IV set.  If we don't consider it, the algorithm will
>     choose iv cand 6 with total cost 5 (iv cost 5 + use cost 0).
>
>     From the above, we can see that to decrease the cost for the pair between
>     doloop use and original biv doesn't work.  Meanwhile it's hard to predict
>     one good iv cand in final optimal set here and pre-update the cost
>     between it and doloop use.  The analysis would be heavy and imperfect.
>
> *) Why we need bind_cand?
>
>     As above, we assign zero cost for pairs between doloop use and each iv
>     cand.  It's possible that doloop use gets assigned one iv cand which is
>     invalid to be used during later rewrite.  Then we have to fix it up with iv
>     cand originally used for it.  It's fine whatever this iv cand exists in
>     final iv cand set or not, even if it's not in the set, it will be
>     eliminated in doloop transformation.
>
> By the way, I was thinking whether we can replace the hook have_count_reg_decr_p
> with flag_branch_on_count_reg.  As the description of the "no-" option, "Disable
> the optimization pass that scans for opportunities to use 'decrement and branch'
> instructions on a count register instead of instruction sequences that decrement
> a register, compare it against zero, and then branch based upon the result.", it
> implicitly says it has count register support.  But I noticed that the gate of
> doloop_optimize checks this flag, as what I got from the previous discussions, some
> targets which can perform doloop_optimize don't have specific count register, so it
> sounds we can't make use of the flag, is it correct?
>
> Bootstrapped on powerpcle, also ran regression testing on powerpcle, got one failure
> which is exposed by this patch and the root cause is duplicate of PR62147.
> case is gcc.target/powerpc/20050830-1.c
>
> Is it OK for trunk?
Sorry for the delaying.

I am not in favor of the approach very much.  When rewriting the pass
last time, we tried to reuse as much code as possible between cost
computation and iv_use rewriting.  we also followed guideline when
finite cost computed for cand/use pair, the use should be rewritten
using the cand successfully.  However, the patch adjust infinite cost
to zero cost causing cand can't be used to rewrite iv_use selected,
this is a backward step IMHO.

I am not sure if this is only useful for doloop cases, or for general cases?

Comment mentioned the point is to give more chances to consider other
IV cands instead of BIV.  If current algorithm relies on zeroing cost
of impossible cand/use pair to select optimal result, I suspect it's a
bug which should be fixed in candidate selection algorithm.  Do you
have a test case showing the issue? We should fix it as a standalone
problem, while the approach is covering the problem and not that
sound.

However, I think the patch can be changed that only finite cost should
be adjusted to zero.  Thus guarantee any cand selected is valid to
rewrite iv_use.

Thanks,
bin
>
> --------------
>
> gcc/ChangeLog
>
> 2019-06-19  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * target.def (have_count_reg_decr_p): New hook.
>         * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): New hook.
>         * doc/tm.texi: Regenerate.
>         * config/rs6000/rs6000.c (rs6000_have_count_reg_decr_p): New function.
>         (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
>         * tree-ssa-loop-ivopts.c (adjust_group_iv_cost_for_doloop): New function.
>         (fixup_doloop_groups): Likewise.
>         (find_doloop_use_and_its_bind): Likewise.
>         (record_group): Init bind_cand.
>         (determine_group_iv_cost): Call adjust_group_iv_cost_for_doloop.
>         (find_optimal_iv_set): Call fixup_doloop_groups.
>         (tree_ssa_iv_optimize_loop): Call function have_count_reg_decr_p,
>         generic_predict_doloop_p and find_doloop_use_and_its_bind.
>         (generic_predict_doloop_p): Update attribute.
>
> gcc/testsuite/ChangeLog
>
> 2019-06-19  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-lt.c: Adjust.
>
>
Kewen.Lin July 22, 2019, 3:54 a.m. UTC | #10
Hi Bin,

on 2019/7/21 上午11:07, Bin.Cheng wrote:
> On Wed, Jun 19, 2019 at 7:47 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi all,
>>
>> This is the following patch after https://gcc.gnu.org/ml/gcc-patches/2019-06/msg00910.html
>>
>> Main steps:
>>   1) Identify the doloop cmp type iv use and record its bind_cand (explain it later).
>>   2) Set zero cost for pairs between this use and any iv cand.
>>   3) IV cand set selecting algorithm runs as usual.
>>   4) Fix up the selected iv cand for doloop use if need.
>>
>> It only focuses on the targets like Power which has specific count register.
>> target hook have_count_reg_decr_p is proposed for it.
>>
>> Some notes:
>>
>> *) Why we need zero cost?  How about just decrease the cost for the pair
>>    between doloop use and its original iv cand?  How about just decrease
>>    the cost for the pair between doloop use and one selected iv cand?
>>
>>    Since some target supports hardware count register for decrement and
>>    branch, it doesn't need the general instruction sequence for decr, cmp and
>>    branch in general registers.  The cost of moving count register to GPR
>>    is generally high, so it's standalone and can't be shared with other iv
>>    uses.  It means IVOPTs can take doloop use as invisible (zero cost).
>>
>>    Let's take a look at PR80791 for example.
>>
>>                             original biv (cand 4)  use derived iv (cand 6)
>>      generic use:                   4                  0
>>      comp use (doloop use):         0                 infinite
>>
>>     For iv cost, original biv has cost 4 while use derived iv has cost 5.
>>     When IVOPTs considers doloop use, the optimal cost is 8 (original biv
>>     iv cost 4 + use cost 4).  Unfortunately it's not actually optimal, since
>>     later doloop transformation updates loop closing by count register,
>>     original biv (and its update) won't be needed in loop closing any more.
>>     The generic use become the only use for original biv.  That means, if we
>>     know the doloop will perform later, we shouldn't consider the doloop use
>>     when determining IV set.  If we don't consider it, the algorithm will
>>     choose iv cand 6 with total cost 5 (iv cost 5 + use cost 0).
>>
>>     From the above, we can see that to decrease the cost for the pair between
>>     doloop use and original biv doesn't work.  Meanwhile it's hard to predict
>>     one good iv cand in final optimal set here and pre-update the cost
>>     between it and doloop use.  The analysis would be heavy and imperfect.
>>
>> *) Why we need bind_cand?
>>
>>     As above, we assign zero cost for pairs between doloop use and each iv
>>     cand.  It's possible that doloop use gets assigned one iv cand which is
>>     invalid to be used during later rewrite.  Then we have to fix it up with iv
>>     cand originally used for it.  It's fine whatever this iv cand exists in
>>     final iv cand set or not, even if it's not in the set, it will be
>>     eliminated in doloop transformation.
>>
>> By the way, I was thinking whether we can replace the hook have_count_reg_decr_p
>> with flag_branch_on_count_reg.  As the description of the "no-" option, "Disable
>> the optimization pass that scans for opportunities to use 'decrement and branch'
>> instructions on a count register instead of instruction sequences that decrement
>> a register, compare it against zero, and then branch based upon the result.", it
>> implicitly says it has count register support.  But I noticed that the gate of
>> doloop_optimize checks this flag, as what I got from the previous discussions, some
>> targets which can perform doloop_optimize don't have specific count register, so it
>> sounds we can't make use of the flag, is it correct?
>>
>> Bootstrapped on powerpcle, also ran regression testing on powerpcle, got one failure
>> which is exposed by this patch and the root cause is duplicate of PR62147.
>> case is gcc.target/powerpc/20050830-1.c
>>
>> Is it OK for trunk?
> Sorry for the delaying.
> 
> I am not in favor of the approach very much.  When rewriting the pass
> last time, we tried to reuse as much code as possible between cost
> computation and iv_use rewriting.  we also followed guideline when
> finite cost computed for cand/use pair, the use should be rewritten
> using the cand successfully.  However, the patch adjust infinite cost
> to zero cost causing cand can't be used to rewrite iv_use selected,
> this is a backward step IMHO.

Thanks a lot for your time and comments.

V2: https://gcc.gnu.org/ml/gcc-patches/2019-05/msg00655.html

The previous version 2 (above link) used the way to teach selection 
algorithm to be aware of the group with bind_cand, it didn't zeroing 
the cost of doloop IV use, but both of them are equivalent to ignore
this doloop IV use in selection. 

Then I was thinking as granted that it changed many places to take care 
of this bind_cand group, worsen the readability and seems invasive to
the existing algorithm too much.  For example, affected functions were: 
set_group_iv_cost/iv_ca_dump/try_add_cand_for/iv_ca_extend/iv_ca_narrow
/iv_ca_replace.  

At that time, I thought version 3 doesn't need to teach the existing algorithm
anything, leaves it to go as before and only need some fixups when it needs.
It has better readability and well fit in current handlings.
> 
> I am not sure if this is only useful for doloop cases, or for general cases?
> 

Not sure either.

> Comment mentioned the point is to give more chances to consider other
> IV cands instead of BIV.  If current algorithm relies on zeroing cost
> of impossible cand/use pair to select optimal result, I suspect it's a
> bug which should be fixed in candidate selection algorithm.  Do you
> have a test case showing the issue? We should fix it as a standalone
> problem, while the approach is covering the problem and not that
> sound.

The best case is the one which caused PR80791.  It has two IV uses, one 
is generic (use 0) and the other is compare (doloop use, use 1).  
The best optimal set is to assign cand 6 to use 0 and assign whatever 
but excepting for infinite cost cand to use 1.  The actual selection set
is to assign BIV to both uses.

The wording "to give more chances to consider other IV cands instead of
BIV" is for the use 0.  If we don't make anything special for use 1, cand
4 is always considered in the optimal set, then use 0 can't have any chances
to use cand 6 (extra iv cost).  Zeroing the cost to make selection not
consider use 1 at all.

You may have the question why not just zeroing those finite cost ones,
then in this case, only cand 1->5 can be selected for use 1 then BIV still
wins since it have best iv cost, it's selected for use 0 instead of cand 6.
If we consider infinite cost cand, bring cand 6 in and get optimal set cand 6.
btw, under TARGET_HAVE_COUNT_REG_DECR_P, the doloop use will be eliminated.
Zeroing it also matches this.

The original cost mapping without my patch:

<Group-candidate Costs>:
Group 0:
  cand>-cost>---compl.>-inv.expr.>------inv.vars
  0>----8>------0>------1;>-----NIL;
  1>----8>------0>------1;>-----NIL;
  2>----4>------0>------NIL;>---NIL;
  3>----8>------0>------1;>-----NIL;
  4>----4>------0>------NIL;>---NIL;
  5>----8>------0>------NIL;>---NIL;
  6>----0>------0>------NIL;>---NIL;

Group 1:
  cand>-cost>---compl.>-inv.expr.>------inv.vars
  0>----8>------0>------NIL;>---1
  1>----8>------0>------NIL;>---1
  2>----0>------0>------NIL;>---NIL;
  3>----4>------0>------NIL;>---1
  4>----0>------0>------NIL;>---NIL;
  5>----4>------0>------NIL;>---NIL;

I'm not sure it can be taken as a bug or not, better to say doloop use
only?  For doloop use, we can teach the algorithm not to take iv cost
into account if that iv cand is only for doloop use.  Since the doloop
use will be eliminated, it's fine.  But for the generic one, we need to
have the iv cost there.

I expect it doesn't need to teach many places. I'll give it a try.  :)

> 
> However, I think the patch can be changed that only finite cost should
> be adjusted to zero.  Thus guarantee any cand selected is valid to
> rewrite iv_use.
> 

Agreed if the above is taught.

Thanks again!


Kewen
Segher Boessenkool July 22, 2019, 6:26 a.m. UTC | #11
Hi!

(Maybe I am missing half of the discussion -- sorry if so).

I think we should have a new iv for just the doloop (which can have the
same starting value and step and type as another iv).

Has this been considered?


Segher
Kewen.Lin July 22, 2019, 6:53 a.m. UTC | #12
Hi Segher,

on 2019/7/22 下午2:26, Segher Boessenkool wrote:
> Hi!
> 
> (Maybe I am missing half of the discussion -- sorry if so).
> 
> I think we should have a new iv for just the doloop (which can have the
> same starting value and step and type as another iv).
> 
> Has this been considered?
> 
> 

I don't have any patches to introduce it.  I guess you mean one pre-bind
candidate is dedicated to doloop use only?  Version 2 introduced pre-bind,
but I dropped it as it's invasive to the current selection algorithm.

The current implementation is to zeroing cost for doloop use with any 
candidates and let selection algorithm pick up whatever for it.  I think
it's fine since doloop_optimize can transform anythings to expected only
if it knows the iteration count.

Thanks,
Kewen
Richard Biener July 22, 2019, 7:18 a.m. UTC | #13
On Mon, 22 Jul 2019, Segher Boessenkool wrote:

> Hi!
> 
> (Maybe I am missing half of the discussion -- sorry if so).
> 
> I think we should have a new iv for just the doloop (which can have the
> same starting value and step and type as another iv).
> 
> Has this been considered?

I was also suggesting this (maybe with too many words ;)).  If
it's a doloop target add such IV (candidate!) which has zero
use-cost for the increment and compare but a (target configurable)
penalty for other uses.  Invasiveness of this approach is probably
that you need to distinguish this candidate by making it a new
kind (or maybe we can just have a specia candidate number...).

Richard.
Segher Boessenkool July 22, 2019, 9:43 p.m. UTC | #14
On Mon, Jul 22, 2019 at 09:18:10AM +0200, Richard Biener wrote:
> On Mon, 22 Jul 2019, Segher Boessenkool wrote:
> 
> > Hi!
> > 
> > (Maybe I am missing half of the discussion -- sorry if so).
> > 
> > I think we should have a new iv for just the doloop (which can have the
> > same starting value and step and type as another iv).
> > 
> > Has this been considered?
> 
> I was also suggesting this (maybe with too many words ;)).  If
> it's a doloop target add such IV (candidate!) which has zero
> use-cost for the increment and compare but a (target configurable)
> penalty for other uses.  Invasiveness of this approach is probably
> that you need to distinguish this candidate by making it a new
> kind (or maybe we can just have a specia candidate number...).

Or just set some (boolean) flag in the candidate.

I think it should simply not be allowed for any use except the doloop
uses at all?  You can have multiple ivs for the same loop just fine,
right?  And costs will make everything work out, if the costs are set
correctly?


Segher
Kewen.Lin July 23, 2019, 5:52 a.m. UTC | #15
on 2019/7/22 下午3:18, Richard Biener wrote:
> On Mon, 22 Jul 2019, Segher Boessenkool wrote:
> 
>> Hi!
>>
>> (Maybe I am missing half of the discussion -- sorry if so).
>>
>> I think we should have a new iv for just the doloop (which can have the
>> same starting value and step and type as another iv).
>>
>> Has this been considered?
> 
> I was also suggesting this (maybe with too many words ;)).  If
> it's a doloop target add such IV (candidate!) which has zero
> use-cost for the increment and compare but a (target configurable)
> penalty for other uses.  Invasiveness of this approach is probably
> that you need to distinguish this candidate by making it a new
> kind (or maybe we can just have a specia candidate number...).
> 

Hi Richard,

Really appreciate your comments on this, very sorry not to go with this.
Since this patch is for TARGET_HAVE_COUNT_REG_DECR_P, I was thinking
it's fairly enough to reuse the existing IV cands and just zeroing doloop
use cost with them.  I'm very happy to unify it.  If you/Segher/Bin don't
have any concerns, I'd like to make it as one follow up item.

One thing to double check is this dedicated IV will follow decrement
instead of increment align with doloop optimize?  Then it looks to shape
the loop closing to doloop pattern, at least it's decrement.


Thanks,
Kewen
Kewen.Lin July 23, 2019, 6:09 a.m. UTC | #16
Hi Segher,

on 2019/7/23 上午5:43, Segher Boessenkool wrote:
> On Mon, Jul 22, 2019 at 09:18:10AM +0200, Richard Biener wrote:
>> On Mon, 22 Jul 2019, Segher Boessenkool wrote:
>>
>>> Hi!
>>>
>>> (Maybe I am missing half of the discussion -- sorry if so).
>>>
>>> I think we should have a new iv for just the doloop (which can have the
>>> same starting value and step and type as another iv).
>>>
>>> Has this been considered?
>>
>> I was also suggesting this (maybe with too many words ;)).  If
>> it's a doloop target add such IV (candidate!) which has zero
>> use-cost for the increment and compare but a (target configurable)
>> penalty for other uses.  Invasiveness of this approach is probably
>> that you need to distinguish this candidate by making it a new
>> kind (or maybe we can just have a specia candidate number...).
> 
> Or just set some (boolean) flag in the candidate.
> 
> I think it should simply not be allowed for any use except the doloop
> uses at all?  

For the targets where the iteration count doesn't sit in its hardware count
register, we may need to allow the IV to be used for other suitable uses?

> You can have multiple ivs for the same loop just fine,
> right?  

Yes.

> And costs will make everything work out, if the costs are set
> correctly?

There are some cases requiring to do IV elimination, it might require some
cost adjustment/tuning to keep this.  I met this when I did pre-bind the
BIV for it, if the dedicated IV has the best cost and is associated to
doloop use, it probably stops the others to merge.

If my understanding is correct, this is more like to transform the loop
into doloop pattern earlier, the penalty of mis-predication of doloop can
be more? Pros is the setup code sequence for iteration count happens in
middle-end, can be optimized better (RTL misses some range info).
Richard Biener July 23, 2019, 7:31 a.m. UTC | #17
On Mon, 22 Jul 2019, Segher Boessenkool wrote:

> On Mon, Jul 22, 2019 at 09:18:10AM +0200, Richard Biener wrote:
> > On Mon, 22 Jul 2019, Segher Boessenkool wrote:
> > 
> > > Hi!
> > > 
> > > (Maybe I am missing half of the discussion -- sorry if so).
> > > 
> > > I think we should have a new iv for just the doloop (which can have the
> > > same starting value and step and type as another iv).
> > > 
> > > Has this been considered?
> > 
> > I was also suggesting this (maybe with too many words ;)).  If
> > it's a doloop target add such IV (candidate!) which has zero
> > use-cost for the increment and compare but a (target configurable)
> > penalty for other uses.  Invasiveness of this approach is probably
> > that you need to distinguish this candidate by making it a new
> > kind (or maybe we can just have a specia candidate number...).
> 
> Or just set some (boolean) flag in the candidate.
> 
> I think it should simply not be allowed for any use except the doloop
> uses at all?  You can have multiple ivs for the same loop just fine,
> right?  And costs will make everything work out, if the costs are set
> correctly?

Sure.  Upthread it was mentioned some targets can easily use the
counter IV in other IV uses so it's really a matter of costs.  That is,
IVOPTs generated "fake" RTL should, for doloop IVs, choose an
appropriate register so the target can do costing.

Richard.
Richard Biener July 23, 2019, 7:38 a.m. UTC | #18
On Tue, 23 Jul 2019, Kewen.Lin wrote:

> on 2019/7/22 下午3:18, Richard Biener wrote:
> > On Mon, 22 Jul 2019, Segher Boessenkool wrote:
> > 
> >> Hi!
> >>
> >> (Maybe I am missing half of the discussion -- sorry if so).
> >>
> >> I think we should have a new iv for just the doloop (which can have the
> >> same starting value and step and type as another iv).
> >>
> >> Has this been considered?
> > 
> > I was also suggesting this (maybe with too many words ;)).  If
> > it's a doloop target add such IV (candidate!) which has zero
> > use-cost for the increment and compare but a (target configurable)
> > penalty for other uses.  Invasiveness of this approach is probably
> > that you need to distinguish this candidate by making it a new
> > kind (or maybe we can just have a specia candidate number...).
> > 
> 
> Hi Richard,
> 
> Really appreciate your comments on this, very sorry not to go with this.
> Since this patch is for TARGET_HAVE_COUNT_REG_DECR_P, I was thinking
> it's fairly enough to reuse the existing IV cands and just zeroing doloop
> use cost with them.  I'm very happy to unify it.  If you/Segher/Bin don't
> have any concerns, I'd like to make it as one follow up item.
> 
> One thing to double check is this dedicated IV will follow decrement
> instead of increment align with doloop optimize?  Then it looks to shape
> the loop closing to doloop pattern, at least it's decrement.

I think doloop support should be as "simple" as always adding a
candidate starting from niter (-1?), step -1 marked as DOLOOP_IV
(which is then used in costing, making uses in the IV update and
the compare zero cost and uses in other places according to the
target by using an appropriate hardreg for the fake RTL we create).

IV costing and elimination should then choose the doloop IV if that's
profitable.

Richard.
diff mbox series

Patch

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 6667cd0..12f1dfd 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1912,6 +1912,9 @@  static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_HAVE_COUNT_REG_DECR_P
+#define TARGET_HAVE_COUNT_REG_DECR_P rs6000_have_count_reg_decr_p
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
@@ -39437,6 +39440,14 @@  rs6000_predict_doloop_p (struct loop *loop)
   return true;
 }
 
+/* Return true if count register for branch is supported.  */
+
+static bool
+rs6000_have_count_reg_decr_p ()
+{
+  return flag_branch_on_count_reg;
+}
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-rs6000.h"
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c2aa4d0..46e488f 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11618,6 +11618,12 @@  loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypefn {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P (void)
+Return true if the target supports hardware count register for decrement
+and branch.
+The default version of this hook returns false.
+@end deftypefn
+
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b4d57b8..5f43b27 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7946,6 +7946,8 @@  to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_HAVE_COUNT_REG_DECR_P
+
 @hook TARGET_CAN_USE_DOLOOP_P
 
 @hook TARGET_INVALID_WITHIN_DOLOOP
diff --git a/gcc/target.def b/gcc/target.def
index 71b6972..ec15a6d 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4247,6 +4247,14 @@  The default version of this hook returns false.",
  default_predict_doloop_p)
 
 DEFHOOK
+(have_count_reg_decr_p,
+ "Return true if the target supports hardware count register for decrement\n\
+and branch.\n\
+The default version of this hook returns false.",
+ bool, (void),
+ hook_bool_void_false)
+
+DEFHOOK
 (can_use_doloop_p,
  "Return true if it is possible to use low-overhead loops (@code{doloop_end}\n\
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the\n\
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 7d5859b..71d7f67 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@  f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 530ea4a..9a14ba8 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -399,6 +399,8 @@  struct iv_group
   struct cost_pair *cost_map;
   /* The selected candidate for the group.  */
   struct iv_cand *selected;
+  /* The bind candidate for this group, for doloop use group only.  */
+  struct iv_cand *bind_cand;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -612,6 +614,9 @@  struct ivopts_data
 
   /* Whether the loop body can only be exited via single exit.  */
   bool loop_single_exit_p;
+
+  /* Whether the loop has doloop comparison use.  */
+  bool doloop_use_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -1528,6 +1533,7 @@  record_group (struct ivopts_data *data, enum use_type type)
   group->type = type;
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
+  group->bind_cand = NULL;
 
   data->vgroups.safe_push (group);
   return group;
@@ -3724,7 +3730,7 @@  prepare_decl_rtl (tree *expr_p, int *ws, void *data)
    Some RTL specific checks seems unable to be checked in gimple, if any new
    checks or easy checks _are_ missing here, please add them.  */
 
-static bool ATTRIBUTE_UNUSED
+static bool
 generic_predict_doloop_p (struct ivopts_data *data)
 {
   struct loop *loop = data->current_loop;
@@ -5291,6 +5297,21 @@  determine_group_iv_cost_cond (struct ivopts_data *data,
   return !cost.infinite_cost_p ();
 }
 
+/* Set no cost for pair between doloop iv use GROUP and iv cand CAND.  */
+
+static bool
+adjust_group_iv_cost_for_doloop (struct ivopts_data *data,
+				 struct iv_group *group, struct iv_cand *cand)
+{
+  struct cost_pair *cp = get_group_iv_cost (data, group, cand);
+  if (!cp)
+    set_group_iv_cost (data, group, cand, no_cost, NULL, NULL_TREE, ERROR_MARK,
+		       NULL);
+  else
+    cp->cost = no_cost;
+  return true;
+}
+
 /* Determines cost of computing uses in GROUP with CAND.  Returns false
    if USE cannot be represented with CAND.  */
 
@@ -5308,7 +5329,12 @@  determine_group_iv_cost (struct ivopts_data *data,
       return determine_group_iv_cost_address (data, group, cand);
 
     case USE_COMPARE:
-      return determine_group_iv_cost_cond (data, group, cand);
+      {
+	bool finite_cost_p = determine_group_iv_cost_cond (data, group, cand);
+	if (data->doloop_use_p && group->bind_cand)
+	  finite_cost_p = adjust_group_iv_cost_for_doloop (data, group, cand);
+	return finite_cost_p;
+      }
 
     default:
       gcc_unreachable ();
@@ -6723,6 +6749,29 @@  find_optimal_iv_set_1 (struct ivopts_data *data, bool originalp)
   return set;
 }
 
+/* For doloop use, if the algothrim selects some candidate which invalid for
+   later rewrite, fix it up with bind_cand.  */
+
+static void
+fixup_doloop_groups (struct ivopts_data *data, struct iv_ca *set)
+{
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->bind_cand)
+	{
+	  struct cost_pair *cp = iv_ca_cand_for_group (set, group);
+	  gcc_assert (cp);
+	  if (cp->cand != group->bind_cand && cp->value == NULL_TREE)
+	    {
+	      struct cost_pair *bind_cp
+		= get_group_iv_cost (data, group, group->bind_cand);
+	      iv_ca_set_cp (data, set, group, bind_cp);
+	    }
+	}
+    }
+}
+
 static struct iv_ca *
 find_optimal_iv_set (struct ivopts_data *data)
 {
@@ -6760,6 +6809,9 @@  find_optimal_iv_set (struct ivopts_data *data)
   else if (origset)
     iv_ca_free (&origset);
 
+  if (data->doloop_use_p)
+    fixup_doloop_groups (data, set);
+
   for (i = 0; i < data->vgroups.length (); i++)
     {
       struct iv_group *group = data->vgroups[i];
@@ -7568,6 +7620,72 @@  determine_scaling_factor (struct ivopts_data *data, basic_block *body)
     }
 }
 
+/* Find doloop comparison use and set its related bind_cand.  We adjust the
+   doloop use group cost against various IV cands, it's possible to assign
+   some cost like zero rather than original inifite cost.  The point is to
+   give more chances to consider other IV cands instead of BIV.  The cost
+   originally given on doloop use can affect optimal decision because it can
+   become dead and get eliminated but considered too much here.
+
+   So it's possible that doloop use is assigned one invalid IV cand to rewrite.
+   In this case, we need bind_cand to fix up.  Even if the bind_cand doesn't
+   exist in final iv_ca set, it won't affect optimal decision since it gets
+   eliminated along with doloop use.  */
+
+static bool
+find_doloop_use_and_its_bind (struct ivopts_data *data)
+{
+  struct loop *loop = data->current_loop;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->type == USE_COMPARE)
+	{
+	  gcc_assert (group->vuses.length () == 1);
+	  struct iv_use *use = group->vuses[0];
+	  gimple *stmt = use->stmt;
+	  if (gimple_code (stmt) == GIMPLE_COND)
+	    {
+	      basic_block bb = gimple_bb (stmt);
+	      edge true_edge, false_edge;
+	      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+	      /* This comparison is used for loop latch.  Require latch is empty
+		 for now.  */
+	      if ((loop->latch == true_edge->dest
+		   || loop->latch == false_edge->dest)
+		  && empty_block_p (loop->latch))
+		{
+		  for (unsigned j = 0; j < data->vcands.length (); j++)
+		    {
+		      if (bitmap_bit_p (group->related_cands, j))
+			{
+			  struct iv_cand *cand = data->vcands[j];
+			  tree op = use->iv->ssa_name;
+			  if (op == cand->var_before || op == cand->var_after)
+			    {
+			      group->bind_cand = cand;
+			      if (dump_file && (dump_flags & TDF_DETAILS))
+				{
+				  fprintf (dump_file, "Doloop cmp iv use: ");
+				  print_gimple_stmt (dump_file, stmt,
+						     TDF_DETAILS);
+				  dump_cand (dump_file, cand);
+				}
+			      break;
+			    }
+			}
+		    }
+		  if (group->bind_cand)
+		    return true;
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
 /* Optimizes the LOOP.  Returns true if anything changed.  */
 
 static bool
@@ -7580,6 +7698,7 @@  tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   basic_block *body;
 
   gcc_assert (!data->niters);
+  data->doloop_use_p = false;
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
@@ -7625,6 +7744,18 @@  tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);
 
+  if (targetm.have_count_reg_decr_p () && generic_predict_doloop_p (data))
+    {
+      data->doloop_use_p = find_doloop_use_and_its_bind (data);
+      if (data->doloop_use_p && dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  fprintf (dump_file,
+		   "Predict loop %d can perform doloop optimization later.\n",
+		   loop->num);
+	  flow_loop_dump (loop, dump_file, NULL, 1);
+	}
+    }
+
   /* Calculates the costs (item 3, part 1).  */
   determine_iv_costs (data);
   determine_group_iv_costs (data);