[v6,3/3] PR80791 Consider doloop cmp use in ivopts
diff mbox series

Message ID 6d0b4b11-1c33-fbd5-604d-293c01b1c285@linux.ibm.com
State New
Headers show
Series
  • Untitled series #124997
Related show

Commit Message

Kewen.Lin Aug. 14, 2019, 7:23 a.m. UTC
Hi!

Comparing to the previous versions of implementation mainly based on the 
existing IV cands but zeroing the related group/use cost, this new one is based
on Richard and Segher's suggestion introducing one doloop dedicated IV cand.  

Some key points are listed below:
  1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
  2) Special name "doloop" assigned.
  3) Doloop IV cand with form (niter+1, +, -1)
  4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
  5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
     can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
  6) Add more expr support in force_expr_to_var_cost for reasonable cost
     calculation on the IV base with may_be_zero (like COND_EXPR).
  7) Set zero cost when using doloop IV cand for doloop use.
  8) Add three hooks (should we merge _generic and _address?).
    *) have_count_reg_decr_p, is to indicate the target has special hardware
       count register, we shouldn't consider the impact of doloop IV when
       calculating register pressures.
    *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
       generic type IV use.
    *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
       address type IV use.

Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
by PR89983.

Any comments and suggestions are highly appreciated.  Thanks!

Kewen

---------

gcc/ChangeLog

2019-08-14  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* target.def (have_count_reg_decr_p): New hook.
	(doloop_cost_for_generic): Likewise.
	(doloop_cost_for_address): Likewise.
	* doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
	addend.
	(record_group): Init doloop_p.
	(add_candidate_1): Add optional argument doloop, change the handlings
	accordingly.
	(add_candidate): Likewise.
	(add_iv_candidate_for_biv): Update the call to add_candidate.
	(generic_predict_doloop_p): Update attribute.
	(force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
	LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
	UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
	MIN_EXPR.
	(determine_group_iv_cost_generic): Update for doloop IV cand.
	(determine_group_iv_cost_address): Likewise.
	(determine_group_iv_cost_cond): Likewise.
	(determine_iv_cost): Likewise.
	(ivopts_estimate_reg_pressure): Likewise.
	(cand_value_at): Update argument niter type to struct tree_niter_desc*,
	consider doloop IV cand and may_be_zero.
	(may_eliminate_iv): Update the call to cand_value_at, consider doloop
	IV cand and may_be_zero.
	(add_iv_candidate_for_doloop): New function.
	(find_iv_candidates): Call function add_iv_candidate_for_doloop.
	(determine_set_costs): Update the call to ivopts_estimate_reg_pressure.
	(iv_ca_recount_cost): Likewise.
	(iv_ca_new): Init n_doloop_cands.
	(iv_ca_set_no_cp): Update n_doloop_cands.
	(iv_ca_set_cp): Likewise.
	(iv_ca_dump): Dump register cost.
	(find_doloop_use): Likewise.
	(tree_ssa_iv_optimize_loop): Call function generic_predict_doloop_p and
	find_doloop_use.

gcc/testsuite/ChangeLog

2019-08-14  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
	* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
	* gcc.dg/tree-ssa/pr32044.c: Likewise.

Comments

Bin.Cheng Aug. 21, 2019, 12:32 p.m. UTC | #1
On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi!
>
> Comparing to the previous versions of implementation mainly based on the
> existing IV cands but zeroing the related group/use cost, this new one is based
> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>
> Some key points are listed below:
>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>   2) Special name "doloop" assigned.
>   3) Doloop IV cand with form (niter+1, +, -1)
>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>      calculation on the IV base with may_be_zero (like COND_EXPR).
>   7) Set zero cost when using doloop IV cand for doloop use.
>   8) Add three hooks (should we merge _generic and _address?).
>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>        count register, we shouldn't consider the impact of doloop IV when
>        calculating register pressures.
>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>        generic type IV use.
>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>        address type IV use.
What will happen if doloop IV cand be used for generic/address type iv
use?  Can RTL doloop can still perform doloop optimization in this
case?

>
> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
> by PR89983.
>
> Any comments and suggestions are highly appreciated.  Thanks!
Not sure if I understand the patch correctly, some comments embedded.

+  /* The number of doloop candidate in the set.  */
+  unsigned n_doloop_cands;
+
This is unnecessary.  See below comments.

-    add_candidate_1 (data, base, step, important,
-                    IP_NORMAL, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+                    orig_iv);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
+                    orig_iv);
Do we need to skip ip_end_pos case for doloop candidate?  Because the
candidate increment will be inserted in latch, i.e, increment position
is after exit condition.

-  tree_to_aff_combination (iv->base, type, val);
+  tree base = iv->base;
+  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
+     the value under !may_be_zero to get the compact bound which also well fits
+     for may_be_zero since we ensure the value for it is const one.  */
+  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
(desc->may_be_zero))
+    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
+                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
+                       build_int_cst (TREE_TYPE (niter), 1));
+  tree_to_aff_combination (base, type, val);
I don't quite follow here.  The iv->base is computed from niter, I
suppose compact bound is for cheaper candidate initialization?  Why
it's possible to extract !may_be_zero niter for may_be_zero here?  The
niter under !may_be_zero has no indication about the real niter under
may_be_zero.

-  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
+  cand_value_at (loop, cand, use->stmt, desc, &bnd);
If I understand correctly, doloop use/cand will only be
identified/added for single exit loop, and there will be only one
cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
cand_value at niter at use position would be 0.  If that's the case,
we can skip calling cand_value_at here for doloop cand.  The change to
cand_value_at would be unnecessary neither.

-          expensive.  */
-  if (!integer_zerop (desc->may_be_zero))
+          expensive.
+
+     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
+     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
+  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
     return iv_elimination_compare_lt (data, cand, comp, desc);
And we can early return before this?

+  if (may_be_zero)
+    {
+      if (COMPARISON_CLASS_P (may_be_zero))
+       {
+         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
+                              build_int_cst (ntype, 0),
+                              rewrite_to_non_trapping_overflow (niter));
+       }
+      /* Don't try to obtain the iteration count expression when may_be_zero is
+        integer_nonzerop (actually iteration count is one) or else.  */
+      else
+       return;
+    }
+
+  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
+                          build_int_cst (ntype, 1));
niter is the number of latch executions, so niter + 1 could wrap here,
but guess it's not a problem the similar issue is not handled in
vectorizer neither.

+  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
+  /* If target supports count register for doloop, it doesn't take GPR.  */
+  if (targetm.have_count_reg_decr_p)
+    n_spr_for_doloop = n_doloop_cands;
+  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
Not necessary.  See below.

-  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
+  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
+                                       ivs->n_doloop_cands);
Also.

       ivs->n_cands--;
+      if (cp->cand->doloop_p)
+       ivs->n_doloop_cands--;

          ivs->n_cands++;
+         if (cp->cand->doloop_p)
+           ivs->n_doloop_cands++;
You can just book n_cands under condition !cp->cand->doloop_p.

+  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
+    {
+      if (find_doloop_use (data))
+       {
+         data->doloop_use_p = true;
+         if (dump_file && (dump_flags & TDF_DETAILS))
+           {
+             fprintf (dump_file,
+                      "Predict loop %d can perform"
+                      " doloop optimization later.\n",
+                      loop->num);
+             flow_loop_dump (loop, dump_file, NULL, 1);
+           }
+       }
+    }
+
Please factor this into a function to keep caller short.

Thanks,
bin

>
> Kewen
>
> ---------
>
> gcc/ChangeLog
>
> 2019-08-14  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * target.def (have_count_reg_decr_p): New hook.
>         (doloop_cost_for_generic): Likewise.
>         (doloop_cost_for_address): Likewise.
>         * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * doc/tm.texi: Regenerate.
>         * tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
>         addend.
>         (record_group): Init doloop_p.
>         (add_candidate_1): Add optional argument doloop, change the handlings
>         accordingly.
>         (add_candidate): Likewise.
>         (add_iv_candidate_for_biv): Update the call to add_candidate.
>         (generic_predict_doloop_p): Update attribute.
>         (force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
>         LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
>         UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
>         MIN_EXPR.
>         (determine_group_iv_cost_generic): Update for doloop IV cand.
>         (determine_group_iv_cost_address): Likewise.
>         (determine_group_iv_cost_cond): Likewise.
>         (determine_iv_cost): Likewise.
>         (ivopts_estimate_reg_pressure): Likewise.
>         (cand_value_at): Update argument niter type to struct tree_niter_desc*,
>         consider doloop IV cand and may_be_zero.
>         (may_eliminate_iv): Update the call to cand_value_at, consider doloop
>         IV cand and may_be_zero.
>         (add_iv_candidate_for_doloop): New function.
>         (find_iv_candidates): Call function add_iv_candidate_for_doloop.
>         (determine_set_costs): Update the call to ivopts_estimate_reg_pressure.
>         (iv_ca_recount_cost): Likewise.
>         (iv_ca_new): Init n_doloop_cands.
>         (iv_ca_set_no_cp): Update n_doloop_cands.
>         (iv_ca_set_cp): Likewise.
>         (iv_ca_dump): Dump register cost.
>         (find_doloop_use): Likewise.
>         (tree_ssa_iv_optimize_loop): Call function generic_predict_doloop_p and
>         find_doloop_use.
>
> gcc/testsuite/ChangeLog
>
> 2019-08-14  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
>         * gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
>         * gcc.dg/tree-ssa/pr32044.c: Likewise.
>
Kewen.Lin Aug. 22, 2019, 3:18 a.m. UTC | #2
Hi Bin,

Thanks for your time!

on 2019/8/21 下午8:32, Bin.Cheng wrote:
> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi!
>>
>> Comparing to the previous versions of implementation mainly based on the
>> existing IV cands but zeroing the related group/use cost, this new one is based
>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>>
>> Some key points are listed below:
>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>>   2) Special name "doloop" assigned.
>>   3) Doloop IV cand with form (niter+1, +, -1)
>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>>      calculation on the IV base with may_be_zero (like COND_EXPR).
>>   7) Set zero cost when using doloop IV cand for doloop use.
>>   8) Add three hooks (should we merge _generic and _address?).
>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>>        count register, we shouldn't consider the impact of doloop IV when
>>        calculating register pressures.
>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>>        generic type IV use.
>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>>        address type IV use.
> What will happen if doloop IV cand be used for generic/address type iv
> use?  Can RTL doloop can still perform doloop optimization in this
> case?
> 

On Power, we put the iteration count into hardware count register, it takes very
high cost to move the count to GPR, so the cost is set as INF to make it impossible
to use it for generic/address type iv use.  But as some discussion before, on some
targets using GPR instead of hardware count register, they probably want to use this
doloop iv used for other uses if profitable.  These two hooks offer the possibility.
In that case, I think RTL doloop can still perform since it can still get the 
pattern and transform.  The generic/address uses can still use it.
>>
>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
>> by PR89983.
>>
>> Any comments and suggestions are highly appreciated.  Thanks!
> Not sure if I understand the patch correctly, some comments embedded.
> 
> +  /* The number of doloop candidate in the set.  */
> +  unsigned n_doloop_cands;
> +
> This is unnecessary.  See below comments.
> 
> -    add_candidate_1 (data, base, step, important,
> -                    IP_NORMAL, use, NULL, orig_iv);
> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
> +                    orig_iv);
>    if (ip_end_pos (data->current_loop)
>        && allow_ip_end_pos_p (data->current_loop))
> -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
> +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
> +                    orig_iv);
> Do we need to skip ip_end_pos case for doloop candidate?  Because the
> candidate increment will be inserted in latch, i.e, increment position
> is after exit condition.
> 

Yes, we should skip it.  Currently function find_doloop_use has the check on an
empty latch and gimple_cond to latch, partially excluding it.  But it's still good
to guard it directly here.

> -  tree_to_aff_combination (iv->base, type, val);
> +  tree base = iv->base;
> +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
> +     the value under !may_be_zero to get the compact bound which also well fits
> +     for may_be_zero since we ensure the value for it is const one.  */
> +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
> (desc->may_be_zero))
> +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
> +                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
> +                       build_int_cst (TREE_TYPE (niter), 1));
> +  tree_to_aff_combination (base, type, val);
> I don't quite follow here.  The iv->base is computed from niter, I
> suppose compact bound is for cheaper candidate initialization?  Why
> it's possible to extract !may_be_zero niter for may_be_zero here?  The
> niter under !may_be_zero has no indication about the real niter under
> may_be_zero.
> 

As you note below, the cand_value for doloop would be zero, but for the case
may_be_zero set, the current calculation would take care of the whole niter
expression including the cond_expr introduced by may_be_zero check, it's 
unexpected.  The purpose is to use the value under condition !may_be_zero
for the calculation, and yes, to get expected zero finally.

> -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
> +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
> If I understand correctly, doloop use/cand will only be
> identified/added for single exit loop, and there will be only one
> cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
> cand_value at niter at use position would be 0.  If that's the case,
> we can skip calling cand_value_at here for doloop cand.  The change to
> cand_value_at would be unnecessary neither.
> 

Exactly, I'll add the early return with zero bound for doloop.

> -          expensive.  */
> -  if (!integer_zerop (desc->may_be_zero))
> +          expensive.
> +
> +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
> +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
> +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
>      return iv_elimination_compare_lt (data, cand, comp, desc);
> And we can early return before this?
> 

OK.

> +  if (may_be_zero)
> +    {
> +      if (COMPARISON_CLASS_P (may_be_zero))
> +       {
> +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
> +                              build_int_cst (ntype, 0),
> +                              rewrite_to_non_trapping_overflow (niter));
> +       }
> +      /* Don't try to obtain the iteration count expression when may_be_zero is
> +        integer_nonzerop (actually iteration count is one) or else.  */
> +      else
> +       return;
> +    }
> +
> +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> +                          build_int_cst (ntype, 1));
> niter is the number of latch executions, so niter + 1 could wrap here,
> but guess it's not a problem the similar issue is not handled in
> vectorizer neither.
> 

OK.

> +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
> +  /* If target supports count register for doloop, it doesn't take GPR.  */
> +  if (targetm.have_count_reg_decr_p)
> +    n_spr_for_doloop = n_doloop_cands;
> +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
> Not necessary.  See below.

> -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
> +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
> +                                       ivs->n_doloop_cands);
> Also.
> 
>        ivs->n_cands--;
> +      if (cp->cand->doloop_p)
> +       ivs->n_doloop_cands--;
> 
>           ivs->n_cands++;
> +         if (cp->cand->doloop_p)
> +           ivs->n_doloop_cands++;
> You can just book n_cands under condition !cp->cand->doloop_p.

If my understanding is correct, you are suggesting the code like:

if (!cp->cand->doloop_p)
  ivs->n_cands++;

But I'm afraid that it can NOT satisfy the need in function
ivopts_estimate_reg_pressure.  As the comments, "if target supports
count register for doloop it doesn't take GPR.".  If we make doloop
cand invisible in n_cands, it's fine for target with count register,
but we may miss to count them on targets without count register.

> 
> +  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
> +    {
> +      if (find_doloop_use (data))
> +       {
> +         data->doloop_use_p = true;
> +         if (dump_file && (dump_flags & TDF_DETAILS))
> +           {
> +             fprintf (dump_file,
> +                      "Predict loop %d can perform"
> +                      " doloop optimization later.\n",
> +                      loop->num);
> +             flow_loop_dump (loop, dump_file, NULL, 1);
> +           }
> +       }
> +    }
> +
> Please factor this into a function to keep caller short.
> 

OK.


Thanks!
Kewen
Bin.Cheng Aug. 22, 2019, 5:46 a.m. UTC | #3
On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> Thanks for your time!
>
> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> > On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi!
> >>
> >> Comparing to the previous versions of implementation mainly based on the
> >> existing IV cands but zeroing the related group/use cost, this new one is based
> >> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
> >>
> >> Some key points are listed below:
> >>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
> >>   2) Special name "doloop" assigned.
> >>   3) Doloop IV cand with form (niter+1, +, -1)
> >>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
> >>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
> >>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
> >>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>      calculation on the IV base with may_be_zero (like COND_EXPR).
> >>   7) Set zero cost when using doloop IV cand for doloop use.
> >>   8) Add three hooks (should we merge _generic and _address?).
> >>     *) have_count_reg_decr_p, is to indicate the target has special hardware
> >>        count register, we shouldn't consider the impact of doloop IV when
> >>        calculating register pressures.
> >>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
> >>        generic type IV use.
> >>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
> >>        address type IV use.
> > What will happen if doloop IV cand be used for generic/address type iv
> > use?  Can RTL doloop can still perform doloop optimization in this
> > case?
> >
>
> On Power, we put the iteration count into hardware count register, it takes very
> high cost to move the count to GPR, so the cost is set as INF to make it impossible
> to use it for generic/address type iv use.  But as some discussion before, on some
> targets using GPR instead of hardware count register, they probably want to use this
> doloop iv used for other uses if profitable.  These two hooks offer the possibility.
> In that case, I think RTL doloop can still perform since it can still get the
> pattern and transform.  The generic/address uses can still use it.
> >>
> >> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
> >> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
> >> by PR89983.
> >>
> >> Any comments and suggestions are highly appreciated.  Thanks!
> > Not sure if I understand the patch correctly, some comments embedded.
> >
> > +  /* The number of doloop candidate in the set.  */
> > +  unsigned n_doloop_cands;
> > +
> > This is unnecessary.  See below comments.
> >
> > -    add_candidate_1 (data, base, step, important,
> > -                    IP_NORMAL, use, NULL, orig_iv);
> > +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
> > +                    orig_iv);
> >    if (ip_end_pos (data->current_loop)
> >        && allow_ip_end_pos_p (data->current_loop))
> > -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
> > +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
> > +                    orig_iv);
> > Do we need to skip ip_end_pos case for doloop candidate?  Because the
> > candidate increment will be inserted in latch, i.e, increment position
> > is after exit condition.
> >
>
> Yes, we should skip it.  Currently function find_doloop_use has the check on an
> empty latch and gimple_cond to latch, partially excluding it.  But it's still good
> to guard it directly here.
>
> > -  tree_to_aff_combination (iv->base, type, val);
> > +  tree base = iv->base;
> > +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
> > +     the value under !may_be_zero to get the compact bound which also well fits
> > +     for may_be_zero since we ensure the value for it is const one.  */
> > +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
> > (desc->may_be_zero))
> > +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
> > +                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
> > +                       build_int_cst (TREE_TYPE (niter), 1));
> > +  tree_to_aff_combination (base, type, val);
> > I don't quite follow here.  The iv->base is computed from niter, I
> > suppose compact bound is for cheaper candidate initialization?  Why
> > it's possible to extract !may_be_zero niter for may_be_zero here?  The
> > niter under !may_be_zero has no indication about the real niter under
> > may_be_zero.
> >
>
> As you note below, the cand_value for doloop would be zero, but for the case
> may_be_zero set, the current calculation would take care of the whole niter
> expression including the cond_expr introduced by may_be_zero check, it's
> unexpected.  The purpose is to use the value under condition !may_be_zero
> for the calculation, and yes, to get expected zero finally.
>
> > -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
> > +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
> > If I understand correctly, doloop use/cand will only be
> > identified/added for single exit loop, and there will be only one
> > cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
> > cand_value at niter at use position would be 0.  If that's the case,
> > we can skip calling cand_value_at here for doloop cand.  The change to
> > cand_value_at would be unnecessary neither.
> >
>
> Exactly, I'll add the early return with zero bound for doloop.
>
> > -          expensive.  */
> > -  if (!integer_zerop (desc->may_be_zero))
> > +          expensive.
> > +
> > +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
> > +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
> > +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
> >      return iv_elimination_compare_lt (data, cand, comp, desc);
> > And we can early return before this?
> >
>
> OK.
>
> > +  if (may_be_zero)
> > +    {
> > +      if (COMPARISON_CLASS_P (may_be_zero))
> > +       {
> > +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
> > +                              build_int_cst (ntype, 0),
> > +                              rewrite_to_non_trapping_overflow (niter));
> > +       }
> > +      /* Don't try to obtain the iteration count expression when may_be_zero is
> > +        integer_nonzerop (actually iteration count is one) or else.  */
> > +      else
> > +       return;
> > +    }
> > +
> > +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> > +                          build_int_cst (ntype, 1));
> > niter is the number of latch executions, so niter + 1 could wrap here,
> > but guess it's not a problem the similar issue is not handled in
> > vectorizer neither.
> >
>
> OK.
>
> > +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
> > +  /* If target supports count register for doloop, it doesn't take GPR.  */
> > +  if (targetm.have_count_reg_decr_p)
> > +    n_spr_for_doloop = n_doloop_cands;
> > +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
> > Not necessary.  See below.
>
> > -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
> > +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
> > +                                       ivs->n_doloop_cands);
> > Also.
> >
> >        ivs->n_cands--;
> > +      if (cp->cand->doloop_p)
> > +       ivs->n_doloop_cands--;
> >
> >           ivs->n_cands++;
> > +         if (cp->cand->doloop_p)
> > +           ivs->n_doloop_cands++;
> > You can just book n_cands under condition !cp->cand->doloop_p.
>
> If my understanding is correct, you are suggesting the code like:
>
> if (!cp->cand->doloop_p)
>   ivs->n_cands++;
>
> But I'm afraid that it can NOT satisfy the need in function
> ivopts_estimate_reg_pressure.  As the comments, "if target supports
> count register for doloop it doesn't take GPR.".  If we make doloop
> cand invisible in n_cands, it's fine for target with count register,
> but we may miss to count them on targets without count register.
Why not one more step do checks:
if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
  ivs->n_cands++;

Thanks,
bin
>
> >
> > +  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
> > +    {
> > +      if (find_doloop_use (data))
> > +       {
> > +         data->doloop_use_p = true;
> > +         if (dump_file && (dump_flags & TDF_DETAILS))
> > +           {
> > +             fprintf (dump_file,
> > +                      "Predict loop %d can perform"
> > +                      " doloop optimization later.\n",
> > +                      loop->num);
> > +             flow_loop_dump (loop, dump_file, NULL, 1);
> > +           }
> > +       }
> > +    }
> > +
> > Please factor this into a function to keep caller short.
> >
>
> OK.
>
>
> Thanks!
> Kewen
>
Kewen.Lin Aug. 22, 2019, 7:09 a.m. UTC | #4
Hi Bin,

on 2019/8/22 下午1:46, Bin.Cheng wrote:
> On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Bin,
>>
>> Thanks for your time!
>>
>> on 2019/8/21 下午8:32, Bin.Cheng wrote:
>>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi!
>>>>
>>>> Comparing to the previous versions of implementation mainly based on the
>>>> existing IV cands but zeroing the related group/use cost, this new one is based
>>>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>>>>
>>>> Some key points are listed below:
>>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>>>>   2) Special name "doloop" assigned.
>>>>   3) Doloop IV cand with form (niter+1, +, -1)
>>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>>>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>>>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>>>>      calculation on the IV base with may_be_zero (like COND_EXPR).
>>>>   7) Set zero cost when using doloop IV cand for doloop use.
>>>>   8) Add three hooks (should we merge _generic and _address?).
>>>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>>>>        count register, we shouldn't consider the impact of doloop IV when
>>>>        calculating register pressures.
>>>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>>>>        generic type IV use.
>>>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>>>>        address type IV use.
>>> What will happen if doloop IV cand be used for generic/address type iv
>>> use?  Can RTL doloop can still perform doloop optimization in this
>>> case?
>>>
>>
>> On Power, we put the iteration count into hardware count register, it takes very
>> high cost to move the count to GPR, so the cost is set as INF to make it impossible
>> to use it for generic/address type iv use.  But as some discussion before, on some
>> targets using GPR instead of hardware count register, they probably want to use this
>> doloop iv used for other uses if profitable.  These two hooks offer the possibility.
>> In that case, I think RTL doloop can still perform since it can still get the
>> pattern and transform.  The generic/address uses can still use it.
>>>>
>>>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
>>>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
>>>> by PR89983.
>>>>
>>>> Any comments and suggestions are highly appreciated.  Thanks!
>>> Not sure if I understand the patch correctly, some comments embedded.
>>>
>>> +  /* The number of doloop candidate in the set.  */
>>> +  unsigned n_doloop_cands;
>>> +
>>> This is unnecessary.  See below comments.
>>>
>>> -    add_candidate_1 (data, base, step, important,
>>> -                    IP_NORMAL, use, NULL, orig_iv);
>>> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
>>> +                    orig_iv);
>>>    if (ip_end_pos (data->current_loop)
>>>        && allow_ip_end_pos_p (data->current_loop))
>>> -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
>>> +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
>>> +                    orig_iv);
>>> Do we need to skip ip_end_pos case for doloop candidate?  Because the
>>> candidate increment will be inserted in latch, i.e, increment position
>>> is after exit condition.
>>>
>>
>> Yes, we should skip it.  Currently function find_doloop_use has the check on an
>> empty latch and gimple_cond to latch, partially excluding it.  But it's still good
>> to guard it directly here.
>>
>>> -  tree_to_aff_combination (iv->base, type, val);
>>> +  tree base = iv->base;
>>> +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
>>> +     the value under !may_be_zero to get the compact bound which also well fits
>>> +     for may_be_zero since we ensure the value for it is const one.  */
>>> +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
>>> (desc->may_be_zero))
>>> +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
>>> +                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
>>> +                       build_int_cst (TREE_TYPE (niter), 1));
>>> +  tree_to_aff_combination (base, type, val);
>>> I don't quite follow here.  The iv->base is computed from niter, I
>>> suppose compact bound is for cheaper candidate initialization?  Why
>>> it's possible to extract !may_be_zero niter for may_be_zero here?  The
>>> niter under !may_be_zero has no indication about the real niter under
>>> may_be_zero.
>>>
>>
>> As you note below, the cand_value for doloop would be zero, but for the case
>> may_be_zero set, the current calculation would take care of the whole niter
>> expression including the cond_expr introduced by may_be_zero check, it's
>> unexpected.  The purpose is to use the value under condition !may_be_zero
>> for the calculation, and yes, to get expected zero finally.
>>
>>> -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
>>> +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
>>> If I understand correctly, doloop use/cand will only be
>>> identified/added for single exit loop, and there will be only one
>>> cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
>>> cand_value at niter at use position would be 0.  If that's the case,
>>> we can skip calling cand_value_at here for doloop cand.  The change to
>>> cand_value_at would be unnecessary neither.
>>>
>>
>> Exactly, I'll add the early return with zero bound for doloop.
>>
>>> -          expensive.  */
>>> -  if (!integer_zerop (desc->may_be_zero))
>>> +          expensive.
>>> +
>>> +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
>>> +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
>>> +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
>>>      return iv_elimination_compare_lt (data, cand, comp, desc);
>>> And we can early return before this?
>>>
>>
>> OK.
>>
>>> +  if (may_be_zero)
>>> +    {
>>> +      if (COMPARISON_CLASS_P (may_be_zero))
>>> +       {
>>> +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
>>> +                              build_int_cst (ntype, 0),
>>> +                              rewrite_to_non_trapping_overflow (niter));
>>> +       }
>>> +      /* Don't try to obtain the iteration count expression when may_be_zero is
>>> +        integer_nonzerop (actually iteration count is one) or else.  */
>>> +      else
>>> +       return;
>>> +    }
>>> +
>>> +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
>>> +                          build_int_cst (ntype, 1));
>>> niter is the number of latch executions, so niter + 1 could wrap here,
>>> but guess it's not a problem the similar issue is not handled in
>>> vectorizer neither.
>>>
>>
>> OK.
>>
>>> +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
>>> +  /* If target supports count register for doloop, it doesn't take GPR.  */
>>> +  if (targetm.have_count_reg_decr_p)
>>> +    n_spr_for_doloop = n_doloop_cands;
>>> +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
>>> Not necessary.  See below.
>>
>>> -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
>>> +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
>>> +                                       ivs->n_doloop_cands);
>>> Also.
>>>
>>>        ivs->n_cands--;
>>> +      if (cp->cand->doloop_p)
>>> +       ivs->n_doloop_cands--;
>>>
>>>           ivs->n_cands++;
>>> +         if (cp->cand->doloop_p)
>>> +           ivs->n_doloop_cands++;
>>> You can just book n_cands under condition !cp->cand->doloop_p.
>>
>> If my understanding is correct, you are suggesting the code like:
>>
>> if (!cp->cand->doloop_p)
>>   ivs->n_cands++;
>>
>> But I'm afraid that it can NOT satisfy the need in function
>> ivopts_estimate_reg_pressure.  As the comments, "if target supports
>> count register for doloop it doesn't take GPR.".  If we make doloop
>> cand invisible in n_cands, it's fine for target with count register,
>> but we may miss to count them on targets without count register.
> Why not one more step do checks:
> if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
>   ivs->n_cands++;
> 

Yes, it works.  Thanks!

The new patch addressing the comments is attached.  
Could you please have a look again?  Thanks in advance!


Kewen

---------

gcc/ChangeLog

2019-08-22  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* target.def (have_count_reg_decr_p): New hook.
	(doloop_cost_for_generic): Likewise.
	(doloop_cost_for_address): Likewise.
	* doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
	addend.
	(record_group): Init doloop_p.
	(add_candidate_1): Add optional argument doloop, change the handlings
	accordingly.
	(add_candidate): Likewise.
	(add_iv_candidate_for_biv): Update the call to add_candidate.
	(generic_predict_doloop_p): Update attribute.
	(force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
	LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
	UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
	MIN_EXPR.
	(determine_group_iv_cost_generic): Update for doloop IV cand.
	(determine_group_iv_cost_address): Likewise.
	(determine_group_iv_cost_cond): Likewise.
	(determine_iv_cost): Likewise.
	(ivopts_estimate_reg_pressure): Likewise.
	(may_eliminate_iv): Likewise.
	(add_iv_candidate_for_doloop): New function.
	(find_iv_candidates): Call function add_iv_candidate_for_doloop.
	(iv_ca_set_no_cp): Update for doloop IV cand.
	(iv_ca_set_cp): Likewise.
	(iv_ca_dump): Dump register cost.
	(find_doloop_use): New function.
	(predict_and_process_doloop): Likewise.
	(tree_ssa_iv_optimize_loop): Call function predict_and_process_doloop.

gcc/testsuite/ChangeLog

2019-08-22  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
	* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
	* gcc.dg/tree-ssa/pr32044.c: Likewise.
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 6667cd0..5eccbdc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1912,6 +1912,16 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_HAVE_COUNT_REG_DECR_P
+#define TARGET_HAVE_COUNT_REG_DECR_P true
+
+/* 1000000000 is infinite cost in IVOPTs.  */
+#undef TARGET_DOLOOP_COST_FOR_GENERIC
+#define TARGET_DOLOOP_COST_FOR_GENERIC 1000000000
+
+#undef TARGET_DOLOOP_COST_FOR_ADDRESS
+#define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c2aa4d0..9f3a08a 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11618,6 +11618,29 @@ loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
+Return true if the target supports hardware count register for decrement
+and branch.  This count register can't be used as general register since
+moving to/from a general register from/to it is very expensive.
+The default value is false.
+@end deftypevr
+
+@deftypevr {Target Hook} int64_t TARGET_DOLOOP_COST_FOR_GENERIC
+IVOPTs introduces one doloop dedicated IV candidate, this hook offers
+ target owner a way to adjust cost when selecting doloop IV candidate for a
+ generic IV use.  At calcuation, this value will be added on normal cost
+ already calculated by current implementation.
+The default value is zero.
+@end deftypevr
+
+@deftypevr {Target Hook} int64_t TARGET_DOLOOP_COST_FOR_ADDRESS
+IVOPTs introduces one doloop dedicated IV candidate, this hook offers
+ target owner a way to adjust cost when selecting doloop IV candidate for an
+ address IV use.  At calcuation, this value will be added on normal cost
+ already calculated by current implementation.
+The default value is zero.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b4d57b8..4346773 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7946,6 +7946,12 @@ to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_HAVE_COUNT_REG_DECR_P
+
+@hook TARGET_DOLOOP_COST_FOR_GENERIC
+
+@hook TARGET_DOLOOP_COST_FOR_ADDRESS
+
 @hook TARGET_CAN_USE_DOLOOP_P
 
 @hook TARGET_INVALID_WITHIN_DOLOOP
diff --git a/gcc/target.def b/gcc/target.def
index 71b6972..69e2844 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4246,6 +4246,32 @@ The default version of this hook returns false.",
  bool, (struct loop *loop),
  default_predict_doloop_p)
 
+DEFHOOKPOD
+(have_count_reg_decr_p,
+ "Return true if the target supports hardware count register for decrement\n\
+and branch.  This count register can't be used as general register since\n\
+moving to/from a general register from/to it is very expensive.\n\
+The default value is false.",
+ bool, false)
+
+DEFHOOKPOD
+(doloop_cost_for_generic,
+ "IVOPTs introduces one doloop dedicated IV candidate, this hook offers\n\
+ target owner a way to adjust cost when selecting doloop IV candidate for a\n\
+ generic IV use.  At calcuation, this value will be added on normal cost\n\
+ already calculated by current implementation.\n\
+The default value is zero.",
+ int64_t, 0)
+
+DEFHOOKPOD
+(doloop_cost_for_address,
+ "IVOPTs introduces one doloop dedicated IV candidate, this hook offers\n\
+ target owner a way to adjust cost when selecting doloop IV candidate for an\n\
+ address IV use.  At calcuation, this value will be added on normal cost\n\
+ already calculated by current implementation.\n\
+The default value is zero.",
+ int64_t, 0)
+
 DEFHOOK
 (can_use_doloop_p,
  "Return true if it is possible to use low-overhead loops (@code{doloop_end}\n\
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
index 214e6a7..ce4b1d0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
@@ -10,4 +10,6 @@ int main (void)
     f2 ();
 }
 
-/* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" } }  */
+/* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* More debug information emitted for doloop on powerpc.  */
+/* { dg-final { scan-tree-dump-times "!= 0" 6 "ivopts" { target { powerpc*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 7d5859b..71d7f67 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c b/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
index 8a8977a..06c27b0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
@@ -1,6 +1,10 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-optimized" } */
 
+/* For powerpc, disable doloop IV cand generation in IVOPTs to avoid unexpected
+   division operation for its base setup.  */
+/* { dg-additional-options "-fno-branch-count-reg" { target { powerpc*-*-* } } } */
+
 int foo (int n)
 {
   while (n >= 45)
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 530ea4a..be3b0b5 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -275,6 +275,9 @@ comp_cost::operator+= (comp_cost cost)
 comp_cost
 comp_cost::operator+= (HOST_WIDE_INT c)
 {
+  if (c >= INFTY)
+    this->cost = INFTY;
+
   if (infinite_cost_p ())
     return *this;
 
@@ -399,6 +402,8 @@ struct iv_group
   struct cost_pair *cost_map;
   /* The selected candidate for the group.  */
   struct iv_cand *selected;
+  /* To indicate this is a doloop use group.  */
+  bool doloop_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -439,6 +444,7 @@ struct iv_cand
 			   be hoisted out of loop.  */
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
+  bool doloop_p;	/* Whether this is a doloop candidate.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -612,6 +618,9 @@ struct ivopts_data
 
   /* Whether the loop body can only be exited via single exit.  */
   bool loop_single_exit_p;
+
+  /* Whether the loop has doloop comparison use.  */
+  bool doloop_use_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -1528,6 +1537,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->type = type;
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
+  group->doloop_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -3017,9 +3027,9 @@ get_loop_invariant_expr (struct ivopts_data *data, tree inv_expr)
    replacement of the final value of the iv by a direct computation.  */
 
 static struct iv_cand *
-add_candidate_1 (struct ivopts_data *data,
-		 tree base, tree step, bool important, enum iv_position pos,
-		 struct iv_use *use, gimple *incremented_at,
+add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
+		 enum iv_position pos, struct iv_use *use,
+		 gimple *incremented_at, bool doloop = false,
 		 struct iv *orig_iv = NULL)
 {
   unsigned i;
@@ -3079,11 +3089,15 @@ add_candidate_1 (struct ivopts_data *data,
       cand->pos = pos;
       if (pos != IP_ORIGINAL)
 	{
-	  cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "ivtmp");
+	  if (doloop)
+	    cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "doloop");
+	  else
+	    cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "ivtmp");
 	  cand->var_after = cand->var_before;
 	}
       cand->important = important;
       cand->incremented_at = incremented_at;
+      cand->doloop_p = doloop;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3116,6 +3130,7 @@ add_candidate_1 (struct ivopts_data *data,
     }
 
   cand->important |= important;
+  cand->doloop_p |= doloop;
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
@@ -3209,16 +3224,17 @@ add_autoinc_candidates (struct ivopts_data *data, tree base, tree step,
    the end of loop.  */
 
 static void
-add_candidate (struct ivopts_data *data,
-	       tree base, tree step, bool important, struct iv_use *use,
+add_candidate (struct ivopts_data *data, tree base, tree step, bool important,
+	       struct iv_use *use, bool doloop = false,
 	       struct iv *orig_iv = NULL)
 {
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important,
-		     IP_NORMAL, use, NULL, orig_iv);
-  if (ip_end_pos (data->current_loop)
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+		     orig_iv);
+  if (!doloop && ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
+		     orig_iv);
 }
 
 /* Adds standard iv candidates.  */
@@ -3262,7 +3278,7 @@ add_iv_candidate_for_biv (struct ivopts_data *data, struct iv *iv)
       tree step = fold_convert (sizetype, iv->step);
 
       /* Add iv cand of same precision as index part in TARGET_MEM_REF.  */
-      add_candidate (data, base, step, true, NULL, iv);
+      add_candidate (data, base, step, true, NULL, false, iv);
       /* Add iv cand of the original type only if it has nonlinear use.  */
       if (iv->nonlin_use)
 	add_candidate (data, iv->base, iv->step, true, NULL);
@@ -3724,7 +3740,7 @@ prepare_decl_rtl (tree *expr_p, int *ws, void *data)
    Some RTL specific checks seems unable to be checked in gimple, if any new
    checks or easy checks _are_ missing here, please add them.  */
 
-static bool ATTRIBUTE_UNUSED
+static bool
 generic_predict_doloop_p (struct ivopts_data *data)
 {
   struct loop *loop = data->current_loop;
@@ -4177,6 +4193,36 @@ force_expr_to_var_cost (tree expr, bool speed)
       STRIP_NOPS (op0);
       op1 = NULL_TREE;
       break;
+    /* See add_iv_candidate_for_doloop, for doloop may_be_zero case, we
+       introduce COND_EXPR for IV base, need to support better cost estimation
+       for this COND_EXPR and tcc_comparison.  */
+    case COND_EXPR:
+      op0 = TREE_OPERAND (expr, 1);
+      STRIP_NOPS (op0);
+      op1 = TREE_OPERAND (expr, 2);
+      STRIP_NOPS (op1);
+      break;
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+    case MAX_EXPR:
+    case MIN_EXPR:
+      op0 = TREE_OPERAND (expr, 0);
+      STRIP_NOPS (op0);
+      op1 = TREE_OPERAND (expr, 1);
+      STRIP_NOPS (op1);
+      break;
 
     default:
       /* Just an arbitrary value, FIXME.  */
@@ -4258,6 +4304,35 @@ force_expr_to_var_cost (tree expr, bool speed)
     case RSHIFT_EXPR:
       cost = comp_cost (add_cost (speed, mode), 0);
       break;
+    case COND_EXPR:
+      op0 = TREE_OPERAND (expr, 0);
+      STRIP_NOPS (op0);
+      if (op0 == NULL_TREE || TREE_CODE (op0) == SSA_NAME
+	  || CONSTANT_CLASS_P (op0))
+	cost = no_cost;
+      else
+	cost = force_expr_to_var_cost (op0, speed);
+      break;
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+    case MAX_EXPR:
+    case MIN_EXPR:
+      /* Simply use 1.5 * add cost for now, FIXME if there is some more accurate
+	 cost evaluation way.  */
+      cost = comp_cost (1.5 * add_cost (speed, mode), 0);
+      break;
 
     default:
       gcc_unreachable ();
@@ -4706,8 +4781,12 @@ determine_group_iv_cost_generic (struct ivopts_data *data,
   if (cand->pos == IP_ORIGINAL && cand->incremented_at == use->stmt)
     cost = no_cost;
   else
-    cost = get_computation_cost (data, use, cand, false,
-				 &inv_vars, NULL, &inv_expr);
+    {
+      cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL,
+				   &inv_expr);
+      if (cand->doloop_p)
+	cost += targetm.doloop_cost_for_generic;
+    }
 
   if (inv_expr)
     {
@@ -4735,6 +4814,9 @@ determine_group_iv_cost_address (struct ivopts_data *data,
   cost = get_computation_cost (data, use, cand, true,
 			       &inv_vars, &can_autoinc, &inv_expr);
 
+  if (cand->doloop_p)
+    cost += targetm.doloop_cost_for_address;
+
   if (inv_expr)
     {
       inv_exprs = BITMAP_ALLOC (NULL);
@@ -5142,6 +5224,15 @@ may_eliminate_iv (struct ivopts_data *data,
 	}
     }
 
+  /* For doloop IV cand, the bound would be zero.  It's safe whether
+     may_be_zero set or not.  */
+  if (cand->doloop_p)
+    {
+      *bound = build_int_cst (TREE_TYPE (cand->iv->base), 0);
+      *comp = iv_elimination_compare (data, use);
+      return true;
+    }
+
   cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
 
   *bound = fold_convert (TREE_TYPE (cand->iv->base),
@@ -5264,6 +5355,9 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
       inv_vars = inv_vars_elim;
       inv_vars_elim = NULL;
       inv_expr = inv_expr_elim;
+      /* For doloop candidate/use pair, adjust to zero cost.  */
+      if (group->doloop_p && cand->doloop_p)
+	cost = no_cost;
     }
   else
     {
@@ -5390,6 +5484,42 @@ relate_compare_use_with_all_cands (struct ivopts_data *data)
     }
 }
 
+/* Add one doloop dedicated IV candidate:
+     - Base is (may_be_zero ? 1 : (niter + 1)).
+     - Step is -1.  */
+
+static void
+add_iv_candidate_for_doloop (struct ivopts_data *data)
+{
+  tree_niter_desc *niter_desc = niter_for_single_dom_exit (data);
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  tree niter = niter_desc->niter;
+  tree ntype = TREE_TYPE (niter);
+  gcc_assert (TREE_CODE (ntype) == INTEGER_TYPE);
+
+  tree may_be_zero = niter_desc->may_be_zero;
+  if (may_be_zero && integer_zerop (may_be_zero))
+    may_be_zero = NULL_TREE;
+  if (may_be_zero)
+    {
+      if (COMPARISON_CLASS_P (may_be_zero))
+	{
+	  niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
+			       build_int_cst (ntype, 0),
+			       rewrite_to_non_trapping_overflow (niter));
+	}
+      /* Don't try to obtain the iteration count expression when may_be_zero is
+	 integer_nonzerop (actually iteration count is one) or else.  */
+      else
+	return;
+    }
+
+  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
+			   build_int_cst (ntype, 1));
+  add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, true);
+}
+
 /* Finds the candidates for the induction variables.  */
 
 static void
@@ -5398,6 +5528,10 @@ find_iv_candidates (struct ivopts_data *data)
   /* Add commonly used ivs.  */
   add_standard_iv_candidates (data);
 
+  /* Add doloop dedicate ivs.  */
+  if (data->doloop_use_p)
+    add_iv_candidate_for_doloop (data);
+
   /* Add old induction variables.  */
   add_iv_candidate_for_bivs (data);
 
@@ -5578,16 +5712,21 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
      or a const set.  */
   if (cost_base.cost == 0)
     cost_base.cost = COSTS_N_INSNS (1);
-  cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
-
+  /* Doloop decrement should be considered as zero cost.  */
+  if (cand->doloop_p)
+    cost_step = 0;
+  else
+    cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
-  if (cand->pos != IP_ORIGINAL
-      || !SSA_NAME_VAR (cand->var_before)
-      || DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
+  if ((cand->pos != IP_ORIGINAL
+       || !SSA_NAME_VAR (cand->var_before)
+       || DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
+      /* Prefer doloop as well.  */
+      && !cand->doloop_p)
     cost++;
 
   /* Prefer not to insert statements into latch unless there are some
@@ -5832,7 +5971,8 @@ iv_ca_set_no_cp (struct ivopts_data *data, struct iv_ca *ivs,
   if (ivs->n_cand_uses[cid] == 0)
     {
       bitmap_clear_bit (ivs->cands, cid);
-      ivs->n_cands--;
+      if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
+	ivs->n_cands--;
       ivs->cand_cost -= cp->cand->cost;
       iv_ca_set_remove_invs (ivs, cp->cand->inv_vars, ivs->n_inv_var_uses);
       iv_ca_set_remove_invs (ivs, cp->cand->inv_exprs, ivs->n_inv_expr_uses);
@@ -5889,7 +6029,8 @@ iv_ca_set_cp (struct ivopts_data *data, struct iv_ca *ivs,
       if (ivs->n_cand_uses[cid] == 1)
 	{
 	  bitmap_set_bit (ivs->cands, cid);
-	  ivs->n_cands++;
+	  if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
+	    ivs->n_cands++;
 	  ivs->cand_cost += cp->cand->cost;
 	  iv_ca_set_add_invs (ivs, cp->cand->inv_vars, ivs->n_inv_var_uses);
 	  iv_ca_set_add_invs (ivs, cp->cand->inv_exprs, ivs->n_inv_expr_uses);
@@ -6134,6 +6275,8 @@ iv_ca_dump (struct ivopts_data *data, FILE *file, struct iv_ca *ivs)
 
   fprintf (file, "  cost: %" PRId64 " (complexity %d)\n", cost.cost,
 	   cost.complexity);
+  fprintf (file, "  reg_cost: %d\n",
+	   ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands));
   fprintf (file, "  cand_cost: %" PRId64 "\n  cand_group_cost: "
 	   "%" PRId64 " (complexity %d)\n", ivs->cand_cost,
 	   ivs->cand_use_cost.cost, ivs->cand_use_cost.complexity);
@@ -7568,6 +7711,75 @@ determine_scaling_factor (struct ivopts_data *data, basic_block *body)
     }
 }
 
+/* Find doloop comparison use and set its doloop_p on if found.  */
+
+static bool
+find_doloop_use (struct ivopts_data *data)
+{
+  struct loop *loop = data->current_loop;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->type == USE_COMPARE)
+	{
+	  gcc_assert (group->vuses.length () == 1);
+	  struct iv_use *use = group->vuses[0];
+	  gimple *stmt = use->stmt;
+	  if (gimple_code (stmt) == GIMPLE_COND)
+	    {
+	      basic_block bb = gimple_bb (stmt);
+	      edge true_edge, false_edge;
+	      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+	      /* This comparison is used for loop latch.  Require latch is empty
+		 for now.  */
+	      if ((loop->latch == true_edge->dest
+		   || loop->latch == false_edge->dest)
+		  && empty_block_p (loop->latch))
+		{
+		  group->doloop_p = true;
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file, "Doloop cmp iv use: ");
+		      print_gimple_stmt (dump_file, stmt, TDF_DETAILS);
+		    }
+		  return true;
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
+/* For the targets which support doloop, to predict whether later RTL doloop
+   transformation will perform on this loop, further detect the doloop use and
+   mark the flag doloop_use_p if predicted.  */
+
+void
+predict_and_process_doloop (struct ivopts_data *data)
+{
+  if (!flag_branch_on_count_reg)
+    return;
+
+  if (!generic_predict_doloop_p (data))
+    return;
+
+  if (find_doloop_use (data))
+    {
+      data->doloop_use_p = true;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  struct loop *loop = data->current_loop;
+	  fprintf (dump_file,
+		   "Predict loop %d can perform"
+		   " doloop optimization later.\n",
+		   loop->num);
+	  flow_loop_dump (loop, dump_file, NULL, 1);
+	}
+    }
+}
+
 /* Optimizes the LOOP.  Returns true if anything changed.  */
 
 static bool
@@ -7580,6 +7792,7 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   basic_block *body;
 
   gcc_assert (!data->niters);
+  data->doloop_use_p = false;
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
@@ -7622,6 +7835,9 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   /* Determine cost scaling factor for basic blocks in loop.  */
   determine_scaling_factor (data, body);
 
+  /* Predict doloop and find the doloop use if predicted.  */
+  predict_and_process_doloop (data);
+
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);
Bin.Cheng Aug. 23, 2019, 2:19 a.m. UTC | #5
On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin,
>
> on 2019/8/22 下午1:46, Bin.Cheng wrote:
> > On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Bin,
> >>
> >> Thanks for your time!
> >>
> >> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> >>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>
> >>>> Hi!
> >>>>
> >>>> Comparing to the previous versions of implementation mainly based on the
> >>>> existing IV cands but zeroing the related group/use cost, this new one is based
> >>>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
> >>>>
> >>>> Some key points are listed below:
> >>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
> >>>>   2) Special name "doloop" assigned.
> >>>>   3) Doloop IV cand with form (niter+1, +, -1)
> >>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
> >>>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
> >>>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
> >>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>>>      calculation on the IV base with may_be_zero (like COND_EXPR).
> >>>>   7) Set zero cost when using doloop IV cand for doloop use.
> >>>>   8) Add three hooks (should we merge _generic and _address?).
> >>>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
> >>>>        count register, we shouldn't consider the impact of doloop IV when
> >>>>        calculating register pressures.
> >>>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
> >>>>        generic type IV use.
> >>>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
> >>>>        address type IV use.
> >>> What will happen if doloop IV cand be used for generic/address type iv
> >>> use?  Can RTL doloop can still perform doloop optimization in this
> >>> case?
> >>>
> >>
> >> On Power, we put the iteration count into hardware count register, it takes very
> >> high cost to move the count to GPR, so the cost is set as INF to make it impossible
> >> to use it for generic/address type iv use.  But as some discussion before, on some
> >> targets using GPR instead of hardware count register, they probably want to use this
> >> doloop iv used for other uses if profitable.  These two hooks offer the possibility.
> >> In that case, I think RTL doloop can still perform since it can still get the
> >> pattern and transform.  The generic/address uses can still use it.
> >>>>
> >>>> Bootstrapped on powerpc64le-linux-gnu and regression testing passed excepting
> >>>> for one failure on gcc/testsuite/gcc.dg/guality/loop-1.c at -O3 which is tracked
> >>>> by PR89983.
> >>>>
> >>>> Any comments and suggestions are highly appreciated.  Thanks!
> >>> Not sure if I understand the patch correctly, some comments embedded.
> >>>
> >>> +  /* The number of doloop candidate in the set.  */
> >>> +  unsigned n_doloop_cands;
> >>> +
> >>> This is unnecessary.  See below comments.
> >>>
> >>> -    add_candidate_1 (data, base, step, important,
> >>> -                    IP_NORMAL, use, NULL, orig_iv);
> >>> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
> >>> +                    orig_iv);
> >>>    if (ip_end_pos (data->current_loop)
> >>>        && allow_ip_end_pos_p (data->current_loop))
> >>> -    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
> >>> +    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
> >>> +                    orig_iv);
> >>> Do we need to skip ip_end_pos case for doloop candidate?  Because the
> >>> candidate increment will be inserted in latch, i.e, increment position
> >>> is after exit condition.
> >>>
> >>
> >> Yes, we should skip it.  Currently function find_doloop_use has the check on an
> >> empty latch and gimple_cond to latch, partially excluding it.  But it's still good
> >> to guard it directly here.
> >>
> >>> -  tree_to_aff_combination (iv->base, type, val);
> >>> +  tree base = iv->base;
> >>> +  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
> >>> +     the value under !may_be_zero to get the compact bound which also well fits
> >>> +     for may_be_zero since we ensure the value for it is const one.  */
> >>> +  if (cand->doloop_p && desc->may_be_zero && !integer_zerop
> >>> (desc->may_be_zero))
> >>> +    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
> >>> +                       unshare_expr (rewrite_to_non_trapping_overflow (niter)),
> >>> +                       build_int_cst (TREE_TYPE (niter), 1));
> >>> +  tree_to_aff_combination (base, type, val);
> >>> I don't quite follow here.  The iv->base is computed from niter, I
> >>> suppose compact bound is for cheaper candidate initialization?  Why
> >>> it's possible to extract !may_be_zero niter for may_be_zero here?  The
> >>> niter under !may_be_zero has no indication about the real niter under
> >>> may_be_zero.
> >>>
> >>
> >> As you note below, the cand_value for doloop would be zero, but for the case
> >> may_be_zero set, the current calculation would take care of the whole niter
> >> expression including the cond_expr introduced by may_be_zero check, it's
> >> unexpected.  The purpose is to use the value under condition !may_be_zero
> >> for the calculation, and yes, to get expected zero finally.
> >>
> >>> -  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
> >>> +  cand_value_at (loop, cand, use->stmt, desc, &bnd);
> >>> If I understand correctly, doloop use/cand will only be
> >>> identified/added for single exit loop, and there will be only one
> >>> cond(doloop) iv_use and only one doloop cand for doloop loop.  So the
> >>> cand_value at niter at use position would be 0.  If that's the case,
> >>> we can skip calling cand_value_at here for doloop cand.  The change to
> >>> cand_value_at would be unnecessary neither.
> >>>
> >>
> >> Exactly, I'll add the early return with zero bound for doloop.
> >>
> >>> -          expensive.  */
> >>> -  if (!integer_zerop (desc->may_be_zero))
> >>> +          expensive.
> >>> +
> >>> +     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
> >>> +     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
> >>> +  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
> >>>      return iv_elimination_compare_lt (data, cand, comp, desc);
> >>> And we can early return before this?
> >>>
> >>
> >> OK.
> >>
> >>> +  if (may_be_zero)
> >>> +    {
> >>> +      if (COMPARISON_CLASS_P (may_be_zero))
> >>> +       {
> >>> +         niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
> >>> +                              build_int_cst (ntype, 0),
> >>> +                              rewrite_to_non_trapping_overflow (niter));
> >>> +       }
> >>> +      /* Don't try to obtain the iteration count expression when may_be_zero is
> >>> +        integer_nonzerop (actually iteration count is one) or else.  */
> >>> +      else
> >>> +       return;
> >>> +    }
> >>> +
> >>> +  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
> >>> +                          build_int_cst (ntype, 1));
> >>> niter is the number of latch executions, so niter + 1 could wrap here,
> >>> but guess it's not a problem the similar issue is not handled in
> >>> vectorizer neither.
> >>>
> >>
> >> OK.
> >>
> >>> +  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
> >>> +  /* If target supports count register for doloop, it doesn't take GPR.  */
> >>> +  if (targetm.have_count_reg_decr_p)
> >>> +    n_spr_for_doloop = n_doloop_cands;
> >>> +  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
> >>> Not necessary.  See below.
> >>
> >>> -  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
> >>> +  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
> >>> +                                       ivs->n_doloop_cands);
> >>> Also.
> >>>
> >>>        ivs->n_cands--;
> >>> +      if (cp->cand->doloop_p)
> >>> +       ivs->n_doloop_cands--;
> >>>
> >>>           ivs->n_cands++;
> >>> +         if (cp->cand->doloop_p)
> >>> +           ivs->n_doloop_cands++;
> >>> You can just book n_cands under condition !cp->cand->doloop_p.
> >>
> >> If my understanding is correct, you are suggesting the code like:
> >>
> >> if (!cp->cand->doloop_p)
> >>   ivs->n_cands++;
> >>
> >> But I'm afraid that it can NOT satisfy the need in function
> >> ivopts_estimate_reg_pressure.  As the comments, "if target supports
> >> count register for doloop it doesn't take GPR.".  If we make doloop
> >> cand invisible in n_cands, it's fine for target with count register,
> >> but we may miss to count them on targets without count register.
> > Why not one more step do checks:
> > if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
> >   ivs->n_cands++;
> >
>
> Yes, it works.  Thanks!
>
> The new patch addressing the comments is attached.
> Could you please have a look again?  Thanks in advance!
Thanks for working on this.  A bit more nit-pickings.

-    add_candidate_1 (data, base, step, important,
-                    IP_NORMAL, use, NULL, orig_iv);
-  if (ip_end_pos (data->current_loop)
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+                    orig_iv);
+  if (!doloop && ip_end_pos (data->current_loop)
Could you add some comments elaborating why ip_end_pos candidate
shouldn't be added for doloop case?  Because the increment position is
wrong.

Also if you make doloop the last default parameter of add_candidate_1,
you can save more unnecessary changes to calls to add_candidate?

-    cost = get_computation_cost (data, use, cand, false,
-                                &inv_vars, NULL, &inv_expr);
+    {
+      cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL,
+                                  &inv_expr);
+      if (cand->doloop_p)
+       cost += targetm.doloop_cost_for_generic;
+    }
This adjustment

   cost = get_computation_cost (data, use, cand, true,
                               &inv_vars, &can_autoinc, &inv_expr);

+  if (cand->doloop_p)
+    cost += targetm.doloop_cost_for_address;
+
and this adjustment can be moved into get_computation_cost where all
cost adjustments are done.

+      /* For doloop candidate/use pair, adjust to zero cost.  */
+      if (group->doloop_p && cand->doloop_p)
+       cost = no_cost;
Note above code handles comparing against zero case and decreases the
cost by one (which prefers the same kind candidate as doloop one),
it's very possible to have -1 cost for doloop cand here.  how about
just set to no_cost if it's positive?  Your call.

+/* For the targets which support doloop, to predict whether later RTL doloop
+   transformation will perform on this loop, further detect the doloop use and
+   mark the flag doloop_use_p if predicted.  */
+
+void
+predict_and_process_doloop (struct ivopts_data *data)
A better name here? Sorry I don't have another candidate in mind...

+  data->doloop_use_p = false;
This can be moved to the beginning of above
'predict_and_process_doloop' function.

Lastly, could you please add some brief description/comment about
doloop handling as a subsection in the file head comment?

Otherwise, the ivopt changes look good to me.

Thanks,
bin

>
>
> Kewen
>
> ---------
>
> gcc/ChangeLog
>
> 2019-08-22  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * target.def (have_count_reg_decr_p): New hook.
>         (doloop_cost_for_generic): Likewise.
>         (doloop_cost_for_address): Likewise.
>         * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * doc/tm.texi: Regenerate.
>         * tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
>         addend.
>         (record_group): Init doloop_p.
>         (add_candidate_1): Add optional argument doloop, change the handlings
>         accordingly.
>         (add_candidate): Likewise.
>         (add_iv_candidate_for_biv): Update the call to add_candidate.
>         (generic_predict_doloop_p): Update attribute.
>         (force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
>         LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
>         UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
>         MIN_EXPR.
>         (determine_group_iv_cost_generic): Update for doloop IV cand.
>         (determine_group_iv_cost_address): Likewise.
>         (determine_group_iv_cost_cond): Likewise.
>         (determine_iv_cost): Likewise.
>         (ivopts_estimate_reg_pressure): Likewise.
>         (may_eliminate_iv): Likewise.
>         (add_iv_candidate_for_doloop): New function.
>         (find_iv_candidates): Call function add_iv_candidate_for_doloop.
>         (iv_ca_set_no_cp): Update for doloop IV cand.
>         (iv_ca_set_cp): Likewise.
>         (iv_ca_dump): Dump register cost.
>         (find_doloop_use): New function.
>         (predict_and_process_doloop): Likewise.
>         (tree_ssa_iv_optimize_loop): Call function predict_and_process_doloop.
>
> gcc/testsuite/ChangeLog
>
> 2019-08-22  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
>         * gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
>         * gcc.dg/tree-ssa/pr32044.c: Likewise.
>
Kewen.Lin Aug. 23, 2019, 8:27 a.m. UTC | #6
Hi Bin

on 2019/8/23 上午10:19, Bin.Cheng wrote:
> On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>
>> Hi Bin,
>>
>> on 2019/8/22 下午1:46, Bin.Cheng wrote:
>>> On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>
>>>> Hi Bin,
>>>>
>>>> Thanks for your time!
>>>>
>>>> on 2019/8/21 下午8:32, Bin.Cheng wrote:
>>>>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>>>>>>
>>>>>> Hi!
>>>>>>
>>>>>> Comparing to the previous versions of implementation mainly based on the
>>>>>> existing IV cands but zeroing the related group/use cost, this new one is based
>>>>>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
>>>>>>
>>>>>> Some key points are listed below:
>>>>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
>>>>>>   2) Special name "doloop" assigned.
>>>>>>   3) Doloop IV cand with form (niter+1, +, -1)
>>>>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
>>>>>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
>>>>>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
>>>>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
>>>>>>      calculation on the IV base with may_be_zero (like COND_EXPR).
>>>>>>   7) Set zero cost when using doloop IV cand for doloop use.
>>>>>>   8) Add three hooks (should we merge _generic and _address?).
>>>>>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
>>>>>>        count register, we shouldn't consider the impact of doloop IV when
>>>>>>        calculating register pressures.
>>>>>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
>>>>>>        generic type IV use.
>>>>>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
>>>>>>        address type IV use.

>> The new patch addressing the comments is attached.
>> Could you please have a look again?  Thanks in advance!
> Thanks for working on this.  A bit more nit-pickings.
> 
> -    add_candidate_1 (data, base, step, important,
> -                    IP_NORMAL, use, NULL, orig_iv);
> -  if (ip_end_pos (data->current_loop)
> +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
> +                    orig_iv);
> +  if (!doloop && ip_end_pos (data->current_loop)
> Could you add some comments elaborating why ip_end_pos candidate
> shouldn't be added for doloop case?  Because the increment position is
> wrong.
> 
> Also if you make doloop the last default parameter of add_candidate_1,
> you can save more unnecessary changes to calls to add_candidate?
> 
> -    cost = get_computation_cost (data, use, cand, false,
> -                                &inv_vars, NULL, &inv_expr);
> +    {
> +      cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL,
> +                                  &inv_expr);
> +      if (cand->doloop_p)
> +       cost += targetm.doloop_cost_for_generic;
> +    }
> This adjustment
> 
>    cost = get_computation_cost (data, use, cand, true,
>                                &inv_vars, &can_autoinc, &inv_expr);
> 
> +  if (cand->doloop_p)
> +    cost += targetm.doloop_cost_for_address;
> +
> and this adjustment can be moved into get_computation_cost where all
> cost adjustments are done.
> 
> +      /* For doloop candidate/use pair, adjust to zero cost.  */
> +      if (group->doloop_p && cand->doloop_p)
> +       cost = no_cost;
> Note above code handles comparing against zero case and decreases the
> cost by one (which prefers the same kind candidate as doloop one),
> it's very possible to have -1 cost for doloop cand here.  how about
> just set to no_cost if it's positive?  Your call.
> 
> +/* For the targets which support doloop, to predict whether later RTL doloop
> +   transformation will perform on this loop, further detect the doloop use and
> +   mark the flag doloop_use_p if predicted.  */
> +
> +void
> +predict_and_process_doloop (struct ivopts_data *data)
> A better name here? Sorry I don't have another candidate in mind...
> 
> +  data->doloop_use_p = false;
> This can be moved to the beginning of above
> 'predict_and_process_doloop' function.
> 
> Lastly, could you please add some brief description/comment about
> doloop handling as a subsection in the file head comment?
> 
> Otherwise, the ivopt changes look good to me.
> 
> Thanks,
> bin
> 

Thanks for your prompt reply!  I've updated the code as your comments,
the updated version is attached.  Looking forward to your review again.


Thanks,
Kewen

-----

gcc/ChangeLog

2019-08-23  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* target.def (have_count_reg_decr_p): New hook.
	(doloop_cost_for_generic): Likewise.
	(doloop_cost_for_address): Likewise.
	* doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
	(TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
	(TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
	* doc/tm.texi: Regenerate.
	* tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
	addend.
	(record_group): Init doloop_p.
	(add_candidate_1): Add optional argument doloop, change the handlings
	accordingly.
	(add_candidate): Likewise.
	(generic_predict_doloop_p): Update attribute.
	(force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
	LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
	UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
	MIN_EXPR.
	(get_computation_cost): Update for doloop IV cand extra cost.	
	(determine_group_iv_cost_cond): Update for doloop IV cand.
	(determine_iv_cost): Likewise.
	(ivopts_estimate_reg_pressure): Likewise.
	(may_eliminate_iv): Update handlings for doloop IV cand.
	(add_iv_candidate_for_doloop): New function.
	(find_iv_candidates): Call function add_iv_candidate_for_doloop.
	(iv_ca_set_no_cp): Update for doloop IV cand.
	(iv_ca_set_cp): Likewise.
	(iv_ca_dump): Dump register cost.
	(find_doloop_use): New function.
	(analyze_and_mark_doloop_use): Likewise.
	(tree_ssa_iv_optimize_loop): Call function analyze_and_mark_doloop_use.

gcc/testsuite/ChangeLog

2019-08-23  Kewen Lin  <linkw@gcc.gnu.org>

	PR middle-end/80791
	* gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
	* gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
	* gcc.dg/tree-ssa/pr32044.c: Likewise.
diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 6667cd0..5eccbdc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1912,6 +1912,16 @@ static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_HAVE_COUNT_REG_DECR_P
+#define TARGET_HAVE_COUNT_REG_DECR_P true
+
+/* 1000000000 is infinite cost in IVOPTs.  */
+#undef TARGET_DOLOOP_COST_FOR_GENERIC
+#define TARGET_DOLOOP_COST_FOR_GENERIC 1000000000
+
+#undef TARGET_DOLOOP_COST_FOR_ADDRESS
+#define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c2aa4d0..9f3a08a 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11618,6 +11618,29 @@ loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
+Return true if the target supports hardware count register for decrement
+and branch.  This count register can't be used as general register since
+moving to/from a general register from/to it is very expensive.
+The default value is false.
+@end deftypevr
+
+@deftypevr {Target Hook} int64_t TARGET_DOLOOP_COST_FOR_GENERIC
+IVOPTs introduces one doloop dedicated IV candidate, this hook offers
+ target owner a way to adjust cost when selecting doloop IV candidate for a
+ generic IV use.  At calcuation, this value will be added on normal cost
+ already calculated by current implementation.
+The default value is zero.
+@end deftypevr
+
+@deftypevr {Target Hook} int64_t TARGET_DOLOOP_COST_FOR_ADDRESS
+IVOPTs introduces one doloop dedicated IV candidate, this hook offers
+ target owner a way to adjust cost when selecting doloop IV candidate for an
+ address IV use.  At calcuation, this value will be added on normal cost
+ already calculated by current implementation.
+The default value is zero.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b4d57b8..4346773 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7946,6 +7946,12 @@ to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_HAVE_COUNT_REG_DECR_P
+
+@hook TARGET_DOLOOP_COST_FOR_GENERIC
+
+@hook TARGET_DOLOOP_COST_FOR_ADDRESS
+
 @hook TARGET_CAN_USE_DOLOOP_P
 
 @hook TARGET_INVALID_WITHIN_DOLOOP
diff --git a/gcc/target.def b/gcc/target.def
index 71b6972..69e2844 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4246,6 +4246,32 @@ The default version of this hook returns false.",
  bool, (struct loop *loop),
  default_predict_doloop_p)
 
+DEFHOOKPOD
+(have_count_reg_decr_p,
+ "Return true if the target supports hardware count register for decrement\n\
+and branch.  This count register can't be used as general register since\n\
+moving to/from a general register from/to it is very expensive.\n\
+The default value is false.",
+ bool, false)
+
+DEFHOOKPOD
+(doloop_cost_for_generic,
+ "IVOPTs introduces one doloop dedicated IV candidate, this hook offers\n\
+ target owner a way to adjust cost when selecting doloop IV candidate for a\n\
+ generic IV use.  At calcuation, this value will be added on normal cost\n\
+ already calculated by current implementation.\n\
+The default value is zero.",
+ int64_t, 0)
+
+DEFHOOKPOD
+(doloop_cost_for_address,
+ "IVOPTs introduces one doloop dedicated IV candidate, this hook offers\n\
+ target owner a way to adjust cost when selecting doloop IV candidate for an\n\
+ address IV use.  At calcuation, this value will be added on normal cost\n\
+ already calculated by current implementation.\n\
+The default value is zero.",
+ int64_t, 0)
+
 DEFHOOK
 (can_use_doloop_p,
  "Return true if it is possible to use low-overhead loops (@code{doloop_end}\n\
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
index 214e6a7..ce4b1d0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
@@ -10,4 +10,6 @@ int main (void)
     f2 ();
 }
 
-/* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" } }  */
+/* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* More debug information emitted for doloop on powerpc.  */
+/* { dg-final { scan-tree-dump-times "!= 0" 6 "ivopts" { target { powerpc*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 7d5859b..71d7f67 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@ f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c b/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
index 8a8977a..06c27b0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
@@ -1,6 +1,10 @@
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-optimized" } */
 
+/* For powerpc, disable doloop IV cand generation in IVOPTs to avoid unexpected
+   division operation for its base setup.  */
+/* { dg-additional-options "-fno-branch-count-reg" { target { powerpc*-*-* } } } */
+
 int foo (int n)
 {
   while (n >= 45)
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 530ea4a..88e7890 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -64,7 +64,30 @@ along with GCC; see the file COPYING3.  If not see
    All of this is done loop by loop.  Doing it globally is theoretically
    possible, it might give a better performance and it might enable us
    to decide costs more precisely, but getting all the interactions right
-   would be complicated.  */
+   would be complicated.
+
+   For the targets supporting low-overhead loops, IVOPTs has to take care of
+   the loops which will probably be transformed in RTL doloop optimization,
+   to try to make selected IV candidate set optimal.  The process of doloop
+   support includes:
+
+   1) Analyze the current loop will be transformed to doloop or not, find and
+      mark its compare type IV use as doloop use (iv_group field doloop_p), and
+      set flag doloop_use_p of ivopts_data to notify subsequent processings on
+      doloop.  See analyze_and_mark_doloop_use and its callees for the details.
+      The target hook predict_doloop_p can be used for target specific checks.
+
+   2) Add one doloop dedicated IV cand {(may_be_zero ? 1 : (niter + 1)), +, -1},
+      set flag doloop_p of iv_cand, step cost is set as zero and no extra cost
+      like biv.  For cost determination between doloop IV cand and IV use, the
+      target hooks doloop_cost_for_generic and doloop_cost_for_address are
+      provided to add on extra costs for generic type and address type IV use.
+      Zero cost is assigned to the pair between doloop IV cand and doloop IV
+      use, and bound zero is set for IV elimination.
+
+   3) With the cost setting in step 2), the current cost model based IV
+      selection algorithm will process as usual, pick up doloop dedicated IV if
+      profitable.  */
 
 #include "config.h"
 #include "system.h"
@@ -275,6 +298,9 @@ comp_cost::operator+= (comp_cost cost)
 comp_cost
 comp_cost::operator+= (HOST_WIDE_INT c)
 {
+  if (c >= INFTY)
+    this->cost = INFTY;
+
   if (infinite_cost_p ())
     return *this;
 
@@ -399,6 +425,8 @@ struct iv_group
   struct cost_pair *cost_map;
   /* The selected candidate for the group.  */
   struct iv_cand *selected;
+  /* To indicate this is a doloop use group.  */
+  bool doloop_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -439,6 +467,7 @@ struct iv_cand
 			   be hoisted out of loop.  */
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
+  bool doloop_p;	/* Whether this is a doloop candidate.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -612,6 +641,9 @@ struct ivopts_data
 
   /* Whether the loop body can only be exited via single exit.  */
   bool loop_single_exit_p;
+
+  /* Whether the loop has doloop comparison use.  */
+  bool doloop_use_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -1528,6 +1560,7 @@ record_group (struct ivopts_data *data, enum use_type type)
   group->type = type;
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
+  group->doloop_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -3017,10 +3050,10 @@ get_loop_invariant_expr (struct ivopts_data *data, tree inv_expr)
    replacement of the final value of the iv by a direct computation.  */
 
 static struct iv_cand *
-add_candidate_1 (struct ivopts_data *data,
-		 tree base, tree step, bool important, enum iv_position pos,
-		 struct iv_use *use, gimple *incremented_at,
-		 struct iv *orig_iv = NULL)
+add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
+		 enum iv_position pos, struct iv_use *use,
+		 gimple *incremented_at, struct iv *orig_iv = NULL,
+		 bool doloop = false)
 {
   unsigned i;
   struct iv_cand *cand = NULL;
@@ -3079,11 +3112,15 @@ add_candidate_1 (struct ivopts_data *data,
       cand->pos = pos;
       if (pos != IP_ORIGINAL)
 	{
-	  cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "ivtmp");
+	  if (doloop)
+	    cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "doloop");
+	  else
+	    cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "ivtmp");
 	  cand->var_after = cand->var_before;
 	}
       cand->important = important;
       cand->incremented_at = incremented_at;
+      cand->doloop_p = doloop;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3116,6 +3153,7 @@ add_candidate_1 (struct ivopts_data *data,
     }
 
   cand->important |= important;
+  cand->doloop_p |= doloop;
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
@@ -3209,14 +3247,16 @@ add_autoinc_candidates (struct ivopts_data *data, tree base, tree step,
    the end of loop.  */
 
 static void
-add_candidate (struct ivopts_data *data,
-	       tree base, tree step, bool important, struct iv_use *use,
-	       struct iv *orig_iv = NULL)
+add_candidate (struct ivopts_data *data, tree base, tree step, bool important,
+	       struct iv_use *use, struct iv *orig_iv = NULL,
+	       bool doloop = false)
 {
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important,
-		     IP_NORMAL, use, NULL, orig_iv);
-  if (ip_end_pos (data->current_loop)
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, orig_iv,
+		     doloop);
+  /* Exclude doloop candidate here since it requires decrement then comparison
+     and jump, the IP_END position doesn't match.  */
+  if (!doloop && ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
     add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
 }
@@ -3724,7 +3764,7 @@ prepare_decl_rtl (tree *expr_p, int *ws, void *data)
    Some RTL specific checks seems unable to be checked in gimple, if any new
    checks or easy checks _are_ missing here, please add them.  */
 
-static bool ATTRIBUTE_UNUSED
+static bool
 generic_predict_doloop_p (struct ivopts_data *data)
 {
   struct loop *loop = data->current_loop;
@@ -4177,6 +4217,36 @@ force_expr_to_var_cost (tree expr, bool speed)
       STRIP_NOPS (op0);
       op1 = NULL_TREE;
       break;
+    /* See add_iv_candidate_for_doloop, for doloop may_be_zero case, we
+       introduce COND_EXPR for IV base, need to support better cost estimation
+       for this COND_EXPR and tcc_comparison.  */
+    case COND_EXPR:
+      op0 = TREE_OPERAND (expr, 1);
+      STRIP_NOPS (op0);
+      op1 = TREE_OPERAND (expr, 2);
+      STRIP_NOPS (op1);
+      break;
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+    case MAX_EXPR:
+    case MIN_EXPR:
+      op0 = TREE_OPERAND (expr, 0);
+      STRIP_NOPS (op0);
+      op1 = TREE_OPERAND (expr, 1);
+      STRIP_NOPS (op1);
+      break;
 
     default:
       /* Just an arbitrary value, FIXME.  */
@@ -4258,6 +4328,35 @@ force_expr_to_var_cost (tree expr, bool speed)
     case RSHIFT_EXPR:
       cost = comp_cost (add_cost (speed, mode), 0);
       break;
+    case COND_EXPR:
+      op0 = TREE_OPERAND (expr, 0);
+      STRIP_NOPS (op0);
+      if (op0 == NULL_TREE || TREE_CODE (op0) == SSA_NAME
+	  || CONSTANT_CLASS_P (op0))
+	cost = no_cost;
+      else
+	cost = force_expr_to_var_cost (op0, speed);
+      break;
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+    case MAX_EXPR:
+    case MIN_EXPR:
+      /* Simply use 1.5 * add cost for now, FIXME if there is some more accurate
+	 cost evaluation way.  */
+      cost = comp_cost (1.5 * add_cost (speed, mode), 0);
+      break;
 
     default:
       gcc_unreachable ();
@@ -4634,7 +4733,10 @@ get_computation_cost (struct ivopts_data *data, struct iv_use *use,
     {
       cost = get_address_cost (data, use, cand, &aff_inv, &aff_var, ratio,
 			       inv_vars, inv_expr, can_autoinc, speed);
-      return get_scaled_computation_cost_at (data, at, cost);
+      cost = get_scaled_computation_cost_at (data, at, cost);
+      /* For doloop IV cand, add on the extra cost.  */
+      cost += cand->doloop_p ? targetm.doloop_cost_for_address : 0;
+      return cost;
     }
 
   bool simple_inv = (aff_combination_const_p (&aff_inv)
@@ -4684,6 +4786,10 @@ get_computation_cost (struct ivopts_data *data, struct iv_use *use,
   if (comp_inv && !integer_zerop (comp_inv))
     cost += add_cost (speed, TYPE_MODE (utype));
 
+  /* For doloop IV cand, add on the extra cost.  */
+  if (cand->doloop_p && use->type == USE_NONLINEAR_EXPR)
+    cost += targetm.doloop_cost_for_generic;
+
   return get_scaled_computation_cost_at (data, at, cost);
 }
 
@@ -5142,6 +5248,15 @@ may_eliminate_iv (struct ivopts_data *data,
 	}
     }
 
+  /* For doloop IV cand, the bound would be zero.  It's safe whether
+     may_be_zero set or not.  */
+  if (cand->doloop_p)
+    {
+      *bound = build_int_cst (TREE_TYPE (cand->iv->base), 0);
+      *comp = iv_elimination_compare (data, use);
+      return true;
+    }
+
   cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
 
   *bound = fold_convert (TREE_TYPE (cand->iv->base),
@@ -5264,6 +5379,9 @@ determine_group_iv_cost_cond (struct ivopts_data *data,
       inv_vars = inv_vars_elim;
       inv_vars_elim = NULL;
       inv_expr = inv_expr_elim;
+      /* For doloop candidate/use pair, adjust to zero cost.  */
+      if (group->doloop_p && cand->doloop_p && elim_cost.cost > no_cost.cost)
+	cost = no_cost;
     }
   else
     {
@@ -5390,6 +5508,42 @@ relate_compare_use_with_all_cands (struct ivopts_data *data)
     }
 }
 
+/* Add one doloop dedicated IV candidate:
+     - Base is (may_be_zero ? 1 : (niter + 1)).
+     - Step is -1.  */
+
+static void
+add_iv_candidate_for_doloop (struct ivopts_data *data)
+{
+  tree_niter_desc *niter_desc = niter_for_single_dom_exit (data);
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  tree niter = niter_desc->niter;
+  tree ntype = TREE_TYPE (niter);
+  gcc_assert (TREE_CODE (ntype) == INTEGER_TYPE);
+
+  tree may_be_zero = niter_desc->may_be_zero;
+  if (may_be_zero && integer_zerop (may_be_zero))
+    may_be_zero = NULL_TREE;
+  if (may_be_zero)
+    {
+      if (COMPARISON_CLASS_P (may_be_zero))
+	{
+	  niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
+			       build_int_cst (ntype, 0),
+			       rewrite_to_non_trapping_overflow (niter));
+	}
+      /* Don't try to obtain the iteration count expression when may_be_zero is
+	 integer_nonzerop (actually iteration count is one) or else.  */
+      else
+	return;
+    }
+
+  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
+			   build_int_cst (ntype, 1));
+  add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, NULL, true);
+}
+
 /* Finds the candidates for the induction variables.  */
 
 static void
@@ -5398,6 +5552,10 @@ find_iv_candidates (struct ivopts_data *data)
   /* Add commonly used ivs.  */
   add_standard_iv_candidates (data);
 
+  /* Add doloop dedicate ivs.  */
+  if (data->doloop_use_p)
+    add_iv_candidate_for_doloop (data);
+
   /* Add old induction variables.  */
   add_iv_candidate_for_bivs (data);
 
@@ -5578,16 +5736,21 @@ determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
      or a const set.  */
   if (cost_base.cost == 0)
     cost_base.cost = COSTS_N_INSNS (1);
-  cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
-
+  /* Doloop decrement should be considered as zero cost.  */
+  if (cand->doloop_p)
+    cost_step = 0;
+  else
+    cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
-  if (cand->pos != IP_ORIGINAL
-      || !SSA_NAME_VAR (cand->var_before)
-      || DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
+  if ((cand->pos != IP_ORIGINAL
+       || !SSA_NAME_VAR (cand->var_before)
+       || DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
+      /* Prefer doloop as well.  */
+      && !cand->doloop_p)
     cost++;
 
   /* Prefer not to insert statements into latch unless there are some
@@ -5832,7 +5995,8 @@ iv_ca_set_no_cp (struct ivopts_data *data, struct iv_ca *ivs,
   if (ivs->n_cand_uses[cid] == 0)
     {
       bitmap_clear_bit (ivs->cands, cid);
-      ivs->n_cands--;
+      if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
+	ivs->n_cands--;
       ivs->cand_cost -= cp->cand->cost;
       iv_ca_set_remove_invs (ivs, cp->cand->inv_vars, ivs->n_inv_var_uses);
       iv_ca_set_remove_invs (ivs, cp->cand->inv_exprs, ivs->n_inv_expr_uses);
@@ -5889,7 +6053,8 @@ iv_ca_set_cp (struct ivopts_data *data, struct iv_ca *ivs,
       if (ivs->n_cand_uses[cid] == 1)
 	{
 	  bitmap_set_bit (ivs->cands, cid);
-	  ivs->n_cands++;
+	  if (!cp->cand->doloop_p || !targetm.have_count_reg_decr_p)
+	    ivs->n_cands++;
 	  ivs->cand_cost += cp->cand->cost;
 	  iv_ca_set_add_invs (ivs, cp->cand->inv_vars, ivs->n_inv_var_uses);
 	  iv_ca_set_add_invs (ivs, cp->cand->inv_exprs, ivs->n_inv_expr_uses);
@@ -6134,6 +6299,8 @@ iv_ca_dump (struct ivopts_data *data, FILE *file, struct iv_ca *ivs)
 
   fprintf (file, "  cost: %" PRId64 " (complexity %d)\n", cost.cost,
 	   cost.complexity);
+  fprintf (file, "  reg_cost: %d\n",
+	   ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands));
   fprintf (file, "  cand_cost: %" PRId64 "\n  cand_group_cost: "
 	   "%" PRId64 " (complexity %d)\n", ivs->cand_cost,
 	   ivs->cand_use_cost.cost, ivs->cand_use_cost.complexity);
@@ -7568,6 +7735,77 @@ determine_scaling_factor (struct ivopts_data *data, basic_block *body)
     }
 }
 
+/* Find doloop comparison use and set its doloop_p on if found.  */
+
+static bool
+find_doloop_use (struct ivopts_data *data)
+{
+  struct loop *loop = data->current_loop;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->type == USE_COMPARE)
+	{
+	  gcc_assert (group->vuses.length () == 1);
+	  struct iv_use *use = group->vuses[0];
+	  gimple *stmt = use->stmt;
+	  if (gimple_code (stmt) == GIMPLE_COND)
+	    {
+	      basic_block bb = gimple_bb (stmt);
+	      edge true_edge, false_edge;
+	      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+	      /* This comparison is used for loop latch.  Require latch is empty
+		 for now.  */
+	      if ((loop->latch == true_edge->dest
+		   || loop->latch == false_edge->dest)
+		  && empty_block_p (loop->latch))
+		{
+		  group->doloop_p = true;
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file, "Doloop cmp iv use: ");
+		      print_gimple_stmt (dump_file, stmt, TDF_DETAILS);
+		    }
+		  return true;
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
+/* For the targets which support doloop, to predict whether later RTL doloop
+   transformation will perform on this loop, further detect the doloop use and
+   mark the flag doloop_use_p if predicted.  */
+
+void
+analyze_and_mark_doloop_use (struct ivopts_data *data)
+{
+  data->doloop_use_p = false;
+
+  if (!flag_branch_on_count_reg)
+    return;
+
+  if (!generic_predict_doloop_p (data))
+    return;
+
+  if (find_doloop_use (data))
+    {
+      data->doloop_use_p = true;
+      if (dump_file && (dump_flags & TDF_DETAILS))
+	{
+	  struct loop *loop = data->current_loop;
+	  fprintf (dump_file,
+		   "Predict loop %d can perform"
+		   " doloop optimization later.\n",
+		   loop->num);
+	  flow_loop_dump (loop, dump_file, NULL, 1);
+	}
+    }
+}
+
 /* Optimizes the LOOP.  Returns true if anything changed.  */
 
 static bool
@@ -7622,6 +7860,9 @@ tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   /* Determine cost scaling factor for basic blocks in loop.  */
   determine_scaling_factor (data, body);
 
+  /* Analyze doloop possibility and mark the doloop use if predicted.  */
+  analyze_and_mark_doloop_use (data);
+
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);
Bin.Cheng Aug. 23, 2019, 9:43 a.m. UTC | #7
On Fri, Aug 23, 2019 at 4:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
>
> Hi Bin
>
> on 2019/8/23 上午10:19, Bin.Cheng wrote:
> > On Thu, Aug 22, 2019 at 3:09 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>
> >> Hi Bin,
> >>
> >> on 2019/8/22 下午1:46, Bin.Cheng wrote:
> >>> On Thu, Aug 22, 2019 at 11:18 AM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>
> >>>> Hi Bin,
> >>>>
> >>>> Thanks for your time!
> >>>>
> >>>> on 2019/8/21 下午8:32, Bin.Cheng wrote:
> >>>>> On Wed, Aug 14, 2019 at 3:23 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> >>>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> Comparing to the previous versions of implementation mainly based on the
> >>>>>> existing IV cands but zeroing the related group/use cost, this new one is based
> >>>>>> on Richard and Segher's suggestion introducing one doloop dedicated IV cand.
> >>>>>>
> >>>>>> Some key points are listed below:
> >>>>>>   1) New field doloop_p in struct iv_cand to indicate doloop dedicated IV cand.
> >>>>>>   2) Special name "doloop" assigned.
> >>>>>>   3) Doloop IV cand with form (niter+1, +, -1)
> >>>>>>   4) For doloop IV cand, no extra one cost like BIV, assign zero cost for step.
> >>>>>>   5) Support may_be_zero (regressed PR is in this case), the base of doloop IV
> >>>>>>      can be COND_EXPR, add handlings in cand_value_at and may_eliminate_iv.
> >>>>>>   6) Add more expr support in force_expr_to_var_cost for reasonable cost
> >>>>>>      calculation on the IV base with may_be_zero (like COND_EXPR).
> >>>>>>   7) Set zero cost when using doloop IV cand for doloop use.
> >>>>>>   8) Add three hooks (should we merge _generic and _address?).
> >>>>>>     *) have_count_reg_decr_p, is to indicate the target has special hardware
> >>>>>>        count register, we shouldn't consider the impact of doloop IV when
> >>>>>>        calculating register pressures.
> >>>>>>     *) doloop_cost_for_generic, is the extra cost when using doloop IV cand for
> >>>>>>        generic type IV use.
> >>>>>>     *) doloop_cost_for_address, is the extra cost when using doloop IV cand for
> >>>>>>        address type IV use.
>
> >> The new patch addressing the comments is attached.
> >> Could you please have a look again?  Thanks in advance!
> > Thanks for working on this.  A bit more nit-pickings.
> >
> > -    add_candidate_1 (data, base, step, important,
> > -                    IP_NORMAL, use, NULL, orig_iv);
> > -  if (ip_end_pos (data->current_loop)
> > +    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
> > +                    orig_iv);
> > +  if (!doloop && ip_end_pos (data->current_loop)
> > Could you add some comments elaborating why ip_end_pos candidate
> > shouldn't be added for doloop case?  Because the increment position is
> > wrong.
> >
> > Also if you make doloop the last default parameter of add_candidate_1,
> > you can save more unnecessary changes to calls to add_candidate?
> >
> > -    cost = get_computation_cost (data, use, cand, false,
> > -                                &inv_vars, NULL, &inv_expr);
> > +    {
> > +      cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL,
> > +                                  &inv_expr);
> > +      if (cand->doloop_p)
> > +       cost += targetm.doloop_cost_for_generic;
> > +    }
> > This adjustment
> >
> >    cost = get_computation_cost (data, use, cand, true,
> >                                &inv_vars, &can_autoinc, &inv_expr);
> >
> > +  if (cand->doloop_p)
> > +    cost += targetm.doloop_cost_for_address;
> > +
> > and this adjustment can be moved into get_computation_cost where all
> > cost adjustments are done.
> >
> > +      /* For doloop candidate/use pair, adjust to zero cost.  */
> > +      if (group->doloop_p && cand->doloop_p)
> > +       cost = no_cost;
> > Note above code handles comparing against zero case and decreases the
> > cost by one (which prefers the same kind candidate as doloop one),
> > it's very possible to have -1 cost for doloop cand here.  how about
> > just set to no_cost if it's positive?  Your call.
> >
> > +/* For the targets which support doloop, to predict whether later RTL doloop
> > +   transformation will perform on this loop, further detect the doloop use and
> > +   mark the flag doloop_use_p if predicted.  */
> > +
> > +void
> > +predict_and_process_doloop (struct ivopts_data *data)
> > A better name here? Sorry I don't have another candidate in mind...
> >
> > +  data->doloop_use_p = false;
> > This can be moved to the beginning of above
> > 'predict_and_process_doloop' function.
> >
> > Lastly, could you please add some brief description/comment about
> > doloop handling as a subsection in the file head comment?
> >
> > Otherwise, the ivopt changes look good to me.
> >
> > Thanks,
> > bin
> >
>
> Thanks for your prompt reply!  I've updated the code as your comments,
> the updated version is attached.  Looking forward to your review again.

Sorry to bother.

-      return get_scaled_computation_cost_at (data, at, cost);
+      cost = get_scaled_computation_cost_at (data, at, cost);
+      /* For doloop IV cand, add on the extra cost.  */
+      cost += cand->doloop_p ? targetm.doloop_cost_for_address : 0;
+      return cost;
Here the cost is adjusted after scaling, while:

+  /* For doloop IV cand, add on the extra cost.  */
+  if (cand->doloop_p && use->type == USE_NONLINEAR_EXPR)
+    cost += targetm.doloop_cost_for_generic;
+
   return get_scaled_computation_cost_at (data, at, cost);
is adjusted before scaling.  Please work consistently.

+      /* Simply use 1.5 * add cost for now, FIXME if there is some
more accurate
+        cost evaluation way.  */
+      cost = comp_cost (1.5 * add_cost (speed, mode), 0);
+      break;
Is 1.5 important for some test cases?  Can we simply use 1 instead?
Or at least use xxx * 2 / 3 in order to avoid floating number.

Not sure if non-ivopts parts are already approved?  If so, the patch
is okay with above issues addressed.

Thanks very much for your time!

Thanks,
bin
>
>
> Thanks,
> Kewen
>
> -----
>
> gcc/ChangeLog
>
> 2019-08-23  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * config/rs6000/rs6000.c (TARGET_HAVE_COUNT_REG_DECR_P): New macro.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * target.def (have_count_reg_decr_p): New hook.
>         (doloop_cost_for_generic): Likewise.
>         (doloop_cost_for_address): Likewise.
>         * doc/tm.texi.in (TARGET_HAVE_COUNT_REG_DECR_P): Likewise.
>         (TARGET_DOLOOP_COST_FOR_GENERIC): Likewise.
>         (TARGET_DOLOOP_COST_FOR_ADDRESS): Likewise.
>         * doc/tm.texi: Regenerate.
>         * tree-ssa-loop-ivopts.c (comp_cost::operator+=): Consider infinite cost
>         addend.
>         (record_group): Init doloop_p.
>         (add_candidate_1): Add optional argument doloop, change the handlings
>         accordingly.
>         (add_candidate): Likewise.
>         (generic_predict_doloop_p): Update attribute.
>         (force_expr_to_var_cost): Add costing for expressions COND_EXPR/LT_EXPR/
>         LE_EXPR/GT_EXPR/GE_EXPR/EQ_EXPR/NE_EXPR/UNORDERED_EXPR/ORDERED_EXPR/
>         UNLT_EXPR/UNLE_EXPR/UNGT_EXPR/UNGE_EXPR/UNEQ_EXPR/LTGT_EXPR/MAX_EXPR/
>         MIN_EXPR.
>         (get_computation_cost): Update for doloop IV cand extra cost.
>         (determine_group_iv_cost_cond): Update for doloop IV cand.
>         (determine_iv_cost): Likewise.
>         (ivopts_estimate_reg_pressure): Likewise.
>         (may_eliminate_iv): Update handlings for doloop IV cand.
>         (add_iv_candidate_for_doloop): New function.
>         (find_iv_candidates): Call function add_iv_candidate_for_doloop.
>         (iv_ca_set_no_cp): Update for doloop IV cand.
>         (iv_ca_set_cp): Likewise.
>         (iv_ca_dump): Dump register cost.
>         (find_doloop_use): New function.
>         (analyze_and_mark_doloop_use): Likewise.
>         (tree_ssa_iv_optimize_loop): Call function analyze_and_mark_doloop_use.
>
> gcc/testsuite/ChangeLog
>
> 2019-08-23  Kewen Lin  <linkw@gcc.gnu.org>
>
>         PR middle-end/80791
>         * gcc.dg/tree-ssa/ivopts-3.c: Adjust for doloop change.
>         * gcc.dg/tree-ssa/ivopts-lt.c: Likewise.
>         * gcc.dg/tree-ssa/pr32044.c: Likewise.
>
>
Segher Boessenkool Aug. 23, 2019, 10:18 a.m. UTC | #8
Hi!

On Fri, Aug 23, 2019 at 05:43:32PM +0800, Bin.Cheng wrote:
> On Fri, Aug 23, 2019 at 4:27 PM Kewen.Lin <linkw@linux.ibm.com> wrote:
> Not sure if non-ivopts parts are already approved?  If so, the patch
> is okay with above issues addressed.

The rs6000 part is fine.  The target.def entries need some spell check
and copy-editing, but are obvious and trivial otherwise, and/or you can
approve it as ivopts maintainer.

> Thanks very much for your time!

And thank you as well Bin :-)


Segher

Patch
diff mbox series

diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c
index 6667cd0..5eccbdc 100644
--- a/gcc/config/rs6000/rs6000.c
+++ b/gcc/config/rs6000/rs6000.c
@@ -1912,6 +1912,16 @@  static const struct attribute_spec rs6000_attribute_table[] =
 #undef TARGET_PREDICT_DOLOOP_P
 #define TARGET_PREDICT_DOLOOP_P rs6000_predict_doloop_p
 
+#undef TARGET_HAVE_COUNT_REG_DECR_P
+#define TARGET_HAVE_COUNT_REG_DECR_P true
+
+/* 1000000000 is infinite cost in IVOPTs.  */
+#undef TARGET_DOLOOP_COST_FOR_GENERIC
+#define TARGET_DOLOOP_COST_FOR_GENERIC 1000000000
+
+#undef TARGET_DOLOOP_COST_FOR_ADDRESS
+#define TARGET_DOLOOP_COST_FOR_ADDRESS 1000000000
+
 #undef TARGET_ATOMIC_ASSIGN_EXPAND_FENV
 #define TARGET_ATOMIC_ASSIGN_EXPAND_FENV rs6000_atomic_assign_expand_fenv
 
diff --git a/gcc/doc/tm.texi b/gcc/doc/tm.texi
index c2aa4d0..9f3a08a 100644
--- a/gcc/doc/tm.texi
+++ b/gcc/doc/tm.texi
@@ -11618,6 +11618,29 @@  loops, and will help ivopts to make some decisions.
 The default version of this hook returns false.
 @end deftypefn
 
+@deftypevr {Target Hook} bool TARGET_HAVE_COUNT_REG_DECR_P
+Return true if the target supports hardware count register for decrement
+and branch.  This count register can't be used as general register since
+moving to/from a general register from/to it is very expensive.
+The default value is false.
+@end deftypevr
+
+@deftypevr {Target Hook} int64_t TARGET_DOLOOP_COST_FOR_GENERIC
+IVOPTs introduces one doloop dedicated IV candidate, this hook offers
+ target owner a way to adjust cost when selecting doloop IV candidate for a
+ generic IV use.  At calcuation, this value will be added on normal cost
+ already calculated by current implementation.
+The default value is zero.
+@end deftypevr
+
+@deftypevr {Target Hook} int64_t TARGET_DOLOOP_COST_FOR_ADDRESS
+IVOPTs introduces one doloop dedicated IV candidate, this hook offers
+ target owner a way to adjust cost when selecting doloop IV candidate for an
+ address IV use.  At calcuation, this value will be added on normal cost
+ already calculated by current implementation.
+The default value is zero.
+@end deftypevr
+
 @deftypefn {Target Hook} bool TARGET_CAN_USE_DOLOOP_P (const widest_int @var{&iterations}, const widest_int @var{&iterations_max}, unsigned int @var{loop_depth}, bool @var{entered_at_top})
 Return true if it is possible to use low-overhead loops (@code{doloop_end}
 and @code{doloop_begin}) for a particular loop.  @var{iterations} gives the
diff --git a/gcc/doc/tm.texi.in b/gcc/doc/tm.texi.in
index b4d57b8..4346773 100644
--- a/gcc/doc/tm.texi.in
+++ b/gcc/doc/tm.texi.in
@@ -7946,6 +7946,12 @@  to by @var{ce_info}.
 
 @hook TARGET_PREDICT_DOLOOP_P
 
+@hook TARGET_HAVE_COUNT_REG_DECR_P
+
+@hook TARGET_DOLOOP_COST_FOR_GENERIC
+
+@hook TARGET_DOLOOP_COST_FOR_ADDRESS
+
 @hook TARGET_CAN_USE_DOLOOP_P
 
 @hook TARGET_INVALID_WITHIN_DOLOOP
diff --git a/gcc/target.def b/gcc/target.def
index 71b6972..69e2844 100644
--- a/gcc/target.def
+++ b/gcc/target.def
@@ -4246,6 +4246,32 @@  The default version of this hook returns false.",
  bool, (struct loop *loop),
  default_predict_doloop_p)
 
+DEFHOOKPOD
+(have_count_reg_decr_p,
+ "Return true if the target supports hardware count register for decrement\n\
+and branch.  This count register can't be used as general register since\n\
+moving to/from a general register from/to it is very expensive.\n\
+The default value is false.",
+ bool, false)
+
+DEFHOOKPOD
+(doloop_cost_for_generic,
+ "IVOPTs introduces one doloop dedicated IV candidate, this hook offers\n\
+ target owner a way to adjust cost when selecting doloop IV candidate for a\n\
+ generic IV use.  At calcuation, this value will be added on normal cost\n\
+ already calculated by current implementation.\n\
+The default value is zero.",
+ int64_t, 0)
+
+DEFHOOKPOD
+(doloop_cost_for_address,
+ "IVOPTs introduces one doloop dedicated IV candidate, this hook offers\n\
+ target owner a way to adjust cost when selecting doloop IV candidate for an\n\
+ address IV use.  At calcuation, this value will be added on normal cost\n\
+ already calculated by current implementation.\n\
+The default value is zero.",
+ int64_t, 0)
+
 DEFHOOK
 (can_use_doloop_p,
  "Return true if it is possible to use low-overhead loops (@code{doloop_end}\n\
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
index 214e6a7..ce4b1d0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-3.c
@@ -10,4 +10,6 @@  int main (void)
     f2 ();
 }
 
-/* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" } }  */
+/* { dg-final { scan-tree-dump-times "!= 0" 5 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* More debug information emitted for doloop on powerpc.  */
+/* { dg-final { scan-tree-dump-times "!= 0" 6 "ivopts" { target { powerpc*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
index 7d5859b..71d7f67 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/ivopts-lt.c
@@ -17,6 +17,7 @@  f1 (char *p, uintptr_t i, uintptr_t n)
   while (i < n);
 }
 
-/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" } } */
-/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts"} } */
-/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" } } */
+/* { dg-final { scan-tree-dump-times "PHI" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI" 2 "ivopts" { target { powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "PHI <p_" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
+/* { dg-final { scan-tree-dump-times "p_\[0-9\]* <" 1 "ivopts" { target { ! powerpc*-*-* } } } } */
diff --git a/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c b/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
index 8a8977a..06c27b0 100644
--- a/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
+++ b/gcc/testsuite/gcc.dg/tree-ssa/pr32044.c
@@ -1,6 +1,10 @@ 
 /* { dg-do compile } */
 /* { dg-options "-O2 -fdump-tree-optimized" } */
 
+/* For powerpc, disable doloop IV cand generation in IVOPTs to avoid unexpected
+   division operation for its base setup.  */
+/* { dg-additional-options "-fno-branch-count-reg" { target { powerpc*-*-* } } } */
+
 int foo (int n)
 {
   while (n >= 45)
diff --git a/gcc/tree-ssa-loop-ivopts.c b/gcc/tree-ssa-loop-ivopts.c
index 530ea4a..11852af 100644
--- a/gcc/tree-ssa-loop-ivopts.c
+++ b/gcc/tree-ssa-loop-ivopts.c
@@ -275,6 +275,9 @@  comp_cost::operator+= (comp_cost cost)
 comp_cost
 comp_cost::operator+= (HOST_WIDE_INT c)
 {
+  if (c >= INFTY)
+    this->cost = INFTY;
+
   if (infinite_cost_p ())
     return *this;
 
@@ -399,6 +402,8 @@  struct iv_group
   struct cost_pair *cost_map;
   /* The selected candidate for the group.  */
   struct iv_cand *selected;
+  /* To indicate this is a doloop use group.  */
+  bool doloop_p;
   /* Uses in the group.  */
   vec<struct iv_use *> vuses;
 };
@@ -439,6 +444,7 @@  struct iv_cand
 			   be hoisted out of loop.  */
   struct iv *orig_iv;	/* The original iv if this cand is added from biv with
 			   smaller type.  */
+  bool doloop_p;	/* Whether this is a doloop candidate.  */
 };
 
 /* Hashtable entry for common candidate derived from iv uses.  */
@@ -612,6 +618,9 @@  struct ivopts_data
 
   /* Whether the loop body can only be exited via single exit.  */
   bool loop_single_exit_p;
+
+  /* Whether the loop has doloop comparison use.  */
+  bool doloop_use_p;
 };
 
 /* An assignment of iv candidates to uses.  */
@@ -636,6 +645,9 @@  struct iv_ca
   /* The number of candidates in the set.  */
   unsigned n_cands;
 
+  /* The number of doloop candidate in the set.  */
+  unsigned n_doloop_cands;
+
   /* The number of invariants needed, including both invariant variants and
      invariant expressions.  */
   unsigned n_invs;
@@ -1528,6 +1540,7 @@  record_group (struct ivopts_data *data, enum use_type type)
   group->type = type;
   group->related_cands = BITMAP_ALLOC (NULL);
   group->vuses.create (1);
+  group->doloop_p = false;
 
   data->vgroups.safe_push (group);
   return group;
@@ -3017,9 +3030,9 @@  get_loop_invariant_expr (struct ivopts_data *data, tree inv_expr)
    replacement of the final value of the iv by a direct computation.  */
 
 static struct iv_cand *
-add_candidate_1 (struct ivopts_data *data,
-		 tree base, tree step, bool important, enum iv_position pos,
-		 struct iv_use *use, gimple *incremented_at,
+add_candidate_1 (struct ivopts_data *data, tree base, tree step, bool important,
+		 enum iv_position pos, struct iv_use *use,
+		 gimple *incremented_at, bool doloop = false,
 		 struct iv *orig_iv = NULL)
 {
   unsigned i;
@@ -3079,11 +3092,15 @@  add_candidate_1 (struct ivopts_data *data,
       cand->pos = pos;
       if (pos != IP_ORIGINAL)
 	{
-	  cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "ivtmp");
+	  if (doloop)
+	    cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "doloop");
+	  else
+	    cand->var_before = create_tmp_var_raw (TREE_TYPE (base), "ivtmp");
 	  cand->var_after = cand->var_before;
 	}
       cand->important = important;
       cand->incremented_at = incremented_at;
+      cand->doloop_p = doloop;
       data->vcands.safe_push (cand);
 
       if (!poly_int_tree_p (step))
@@ -3116,6 +3133,7 @@  add_candidate_1 (struct ivopts_data *data,
     }
 
   cand->important |= important;
+  cand->doloop_p |= doloop;
 
   /* Relate candidate to the group for which it is added.  */
   if (use)
@@ -3209,16 +3227,17 @@  add_autoinc_candidates (struct ivopts_data *data, tree base, tree step,
    the end of loop.  */
 
 static void
-add_candidate (struct ivopts_data *data,
-	       tree base, tree step, bool important, struct iv_use *use,
+add_candidate (struct ivopts_data *data, tree base, tree step, bool important,
+	       struct iv_use *use, bool doloop = false,
 	       struct iv *orig_iv = NULL)
 {
   if (ip_normal_pos (data->current_loop))
-    add_candidate_1 (data, base, step, important,
-		     IP_NORMAL, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_NORMAL, use, NULL, doloop,
+		     orig_iv);
   if (ip_end_pos (data->current_loop)
       && allow_ip_end_pos_p (data->current_loop))
-    add_candidate_1 (data, base, step, important, IP_END, use, NULL, orig_iv);
+    add_candidate_1 (data, base, step, important, IP_END, use, NULL, doloop,
+		     orig_iv);
 }
 
 /* Adds standard iv candidates.  */
@@ -3262,7 +3281,7 @@  add_iv_candidate_for_biv (struct ivopts_data *data, struct iv *iv)
       tree step = fold_convert (sizetype, iv->step);
 
       /* Add iv cand of same precision as index part in TARGET_MEM_REF.  */
-      add_candidate (data, base, step, true, NULL, iv);
+      add_candidate (data, base, step, true, NULL, false, iv);
       /* Add iv cand of the original type only if it has nonlinear use.  */
       if (iv->nonlin_use)
 	add_candidate (data, iv->base, iv->step, true, NULL);
@@ -3724,7 +3743,7 @@  prepare_decl_rtl (tree *expr_p, int *ws, void *data)
    Some RTL specific checks seems unable to be checked in gimple, if any new
    checks or easy checks _are_ missing here, please add them.  */
 
-static bool ATTRIBUTE_UNUSED
+static bool
 generic_predict_doloop_p (struct ivopts_data *data)
 {
   struct loop *loop = data->current_loop;
@@ -4177,6 +4196,36 @@  force_expr_to_var_cost (tree expr, bool speed)
       STRIP_NOPS (op0);
       op1 = NULL_TREE;
       break;
+    /* See add_iv_candidate_for_doloop, for doloop may_be_zero case, we
+       introduce COND_EXPR for IV base, need to support better cost estimation
+       for this COND_EXPR and tcc_comparison.  */
+    case COND_EXPR:
+      op0 = TREE_OPERAND (expr, 1);
+      STRIP_NOPS (op0);
+      op1 = TREE_OPERAND (expr, 2);
+      STRIP_NOPS (op1);
+      break;
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+    case MAX_EXPR:
+    case MIN_EXPR:
+      op0 = TREE_OPERAND (expr, 0);
+      STRIP_NOPS (op0);
+      op1 = TREE_OPERAND (expr, 1);
+      STRIP_NOPS (op1);
+      break;
 
     default:
       /* Just an arbitrary value, FIXME.  */
@@ -4258,6 +4307,35 @@  force_expr_to_var_cost (tree expr, bool speed)
     case RSHIFT_EXPR:
       cost = comp_cost (add_cost (speed, mode), 0);
       break;
+    case COND_EXPR:
+      op0 = TREE_OPERAND (expr, 0);
+      STRIP_NOPS (op0);
+      if (op0 == NULL_TREE || TREE_CODE (op0) == SSA_NAME
+	  || CONSTANT_CLASS_P (op0))
+	cost = no_cost;
+      else
+	cost = force_expr_to_var_cost (op0, speed);
+      break;
+    case LT_EXPR:
+    case LE_EXPR:
+    case GT_EXPR:
+    case GE_EXPR:
+    case EQ_EXPR:
+    case NE_EXPR:
+    case UNORDERED_EXPR:
+    case ORDERED_EXPR:
+    case UNLT_EXPR:
+    case UNLE_EXPR:
+    case UNGT_EXPR:
+    case UNGE_EXPR:
+    case UNEQ_EXPR:
+    case LTGT_EXPR:
+    case MAX_EXPR:
+    case MIN_EXPR:
+      /* Simply use 1.5 * add cost for now, FIXME if there is some more accurate
+	 cost evaluation way.  */
+      cost = comp_cost (1.5 * add_cost (speed, mode), 0);
+      break;
 
     default:
       gcc_unreachable ();
@@ -4706,8 +4784,12 @@  determine_group_iv_cost_generic (struct ivopts_data *data,
   if (cand->pos == IP_ORIGINAL && cand->incremented_at == use->stmt)
     cost = no_cost;
   else
-    cost = get_computation_cost (data, use, cand, false,
-				 &inv_vars, NULL, &inv_expr);
+    {
+      cost = get_computation_cost (data, use, cand, false, &inv_vars, NULL,
+				   &inv_expr);
+      if (cand->doloop_p)
+	cost += targetm.doloop_cost_for_generic;
+    }
 
   if (inv_expr)
     {
@@ -4735,6 +4817,9 @@  determine_group_iv_cost_address (struct ivopts_data *data,
   cost = get_computation_cost (data, use, cand, true,
 			       &inv_vars, &can_autoinc, &inv_expr);
 
+  if (cand->doloop_p)
+    cost += targetm.doloop_cost_for_address;
+
   if (inv_expr)
     {
       inv_exprs = BITMAP_ALLOC (NULL);
@@ -4783,11 +4868,12 @@  determine_group_iv_cost_address (struct ivopts_data *data,
    stores it to VAL.  */
 
 static void
-cand_value_at (struct loop *loop, struct iv_cand *cand, gimple *at, tree niter,
-	       aff_tree *val)
+cand_value_at (struct loop *loop, struct iv_cand *cand, gimple *at,
+	       struct tree_niter_desc *desc, aff_tree *val)
 {
   aff_tree step, delta, nit;
   struct iv *iv = cand->iv;
+  tree niter = desc->niter;
   tree type = TREE_TYPE (iv->base);
   tree steptype;
   if (POINTER_TYPE_P (type))
@@ -4803,7 +4889,15 @@  cand_value_at (struct loop *loop, struct iv_cand *cand, gimple *at, tree niter,
   if (stmt_after_increment (loop, cand, at))
     aff_combination_add (&delta, &step);
 
-  tree_to_aff_combination (iv->base, type, val);
+  tree base = iv->base;
+  /* See add_iv_candidate_for_doloop, if may_be_zero is set, we want to extract
+     the value under !may_be_zero to get the compact bound which also well fits
+     for may_be_zero since we ensure the value for it is const one.  */
+  if (cand->doloop_p && desc->may_be_zero && !integer_zerop (desc->may_be_zero))
+    base = fold_build2 (PLUS_EXPR, TREE_TYPE (niter),
+			unshare_expr (rewrite_to_non_trapping_overflow (niter)),
+			build_int_cst (TREE_TYPE (niter), 1));
+  tree_to_aff_combination (base, type, val);
   if (!POINTER_TYPE_P (type))
     aff_combination_convert (val, steptype);
   aff_combination_add (val, &delta);
@@ -5142,7 +5236,7 @@  may_eliminate_iv (struct ivopts_data *data,
 	}
     }
 
-  cand_value_at (loop, cand, use->stmt, desc->niter, &bnd);
+  cand_value_at (loop, cand, use->stmt, desc, &bnd);
 
   *bound = fold_convert (TREE_TYPE (cand->iv->base),
 			 aff_combination_to_tree (&bnd));
@@ -5159,8 +5253,11 @@  may_eliminate_iv (struct ivopts_data *data,
 
      TODO: we could also calculate the value MAY_BE_ZERO ? 0 : NITER and
 	   base the exit condition on it.  However, that is often too
-	   expensive.  */
-  if (!integer_zerop (desc->may_be_zero))
+	   expensive.
+
+     For doloop candidate, we have considered MAY_BE_ZERO for IV base, need to
+     support MAY_BE_ZERO ? 0 : NITER, so simply bypass this check.  */
+  if (!integer_zerop (desc->may_be_zero) && !cand->doloop_p)
     return iv_elimination_compare_lt (data, cand, comp, desc);
 
   return true;
@@ -5264,6 +5361,9 @@  determine_group_iv_cost_cond (struct ivopts_data *data,
       inv_vars = inv_vars_elim;
       inv_vars_elim = NULL;
       inv_expr = inv_expr_elim;
+      /* For doloop candidate/use pair, adjust to zero cost.  */
+      if (group->doloop_p && cand->doloop_p)
+	cost = no_cost;
     }
   else
     {
@@ -5390,6 +5490,42 @@  relate_compare_use_with_all_cands (struct ivopts_data *data)
     }
 }
 
+/* Add one doloop dedicated IV candidate:
+     - Base is (may_be_zero ? 1 : (niter + 1)).
+     - Step is -1.  */
+
+static void
+add_iv_candidate_for_doloop (struct ivopts_data *data)
+{
+  tree_niter_desc *niter_desc = niter_for_single_dom_exit (data);
+  gcc_assert (niter_desc && niter_desc->assumptions);
+
+  tree niter = niter_desc->niter;
+  tree ntype = TREE_TYPE (niter);
+  gcc_assert (TREE_CODE (ntype) == INTEGER_TYPE);
+
+  tree may_be_zero = niter_desc->may_be_zero;
+  if (may_be_zero && integer_zerop (may_be_zero))
+    may_be_zero = NULL_TREE;
+  if (may_be_zero)
+    {
+      if (COMPARISON_CLASS_P (may_be_zero))
+	{
+	  niter = fold_build3 (COND_EXPR, ntype, may_be_zero,
+			       build_int_cst (ntype, 0),
+			       rewrite_to_non_trapping_overflow (niter));
+	}
+      /* Don't try to obtain the iteration count expression when may_be_zero is
+	 integer_nonzerop (actually iteration count is one) or else.  */
+      else
+	return;
+    }
+
+  tree base = fold_build2 (PLUS_EXPR, ntype, unshare_expr (niter),
+			   build_int_cst (ntype, 1));
+  add_candidate (data, base, build_int_cst (ntype, -1), true, NULL, true);
+}
+
 /* Finds the candidates for the induction variables.  */
 
 static void
@@ -5398,6 +5534,10 @@  find_iv_candidates (struct ivopts_data *data)
   /* Add commonly used ivs.  */
   add_standard_iv_candidates (data);
 
+  /* Add doloop dedicate ivs.  */
+  if (data->doloop_use_p)
+    add_iv_candidate_for_doloop (data);
+
   /* Add old induction variables.  */
   add_iv_candidate_for_bivs (data);
 
@@ -5578,16 +5718,21 @@  determine_iv_cost (struct ivopts_data *data, struct iv_cand *cand)
      or a const set.  */
   if (cost_base.cost == 0)
     cost_base.cost = COSTS_N_INSNS (1);
-  cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
-
+  /* Doloop decrement should be considered as zero cost.  */
+  if (cand->doloop_p)
+    cost_step = 0;
+  else
+    cost_step = add_cost (data->speed, TYPE_MODE (TREE_TYPE (base)));
   cost = cost_step + adjust_setup_cost (data, cost_base.cost);
 
   /* Prefer the original ivs unless we may gain something by replacing it.
      The reason is to make debugging simpler; so this is not relevant for
      artificial ivs created by other optimization passes.  */
-  if (cand->pos != IP_ORIGINAL
-      || !SSA_NAME_VAR (cand->var_before)
-      || DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
+  if ((cand->pos != IP_ORIGINAL
+       || !SSA_NAME_VAR (cand->var_before)
+       || DECL_ARTIFICIAL (SSA_NAME_VAR (cand->var_before)))
+      /* Prefer doloop as well.  */
+      && !cand->doloop_p)
     cost++;
 
   /* Prefer not to insert statements into latch unless there are some
@@ -5633,10 +5778,14 @@  determine_iv_costs (struct ivopts_data *data)
 
 static unsigned
 ivopts_estimate_reg_pressure (struct ivopts_data *data, unsigned n_invs,
-			      unsigned n_cands)
+			      unsigned n_cands, unsigned n_doloop_cands)
 {
   unsigned cost;
-  unsigned n_old = data->regs_used, n_new = n_invs + n_cands;
+  unsigned n_old = data->regs_used, n_spr_for_doloop = 0;
+  /* If target supports count register for doloop, it doesn't take GPR.  */
+  if (targetm.have_count_reg_decr_p)
+    n_spr_for_doloop = n_doloop_cands;
+  unsigned n_new = n_invs + n_cands - n_spr_for_doloop;
   unsigned regs_needed = n_new + n_old, available_regs = target_avail_regs;
   bool speed = data->speed;
 
@@ -5666,7 +5815,7 @@  ivopts_estimate_reg_pressure (struct ivopts_data *data, unsigned n_invs,
 
   /* Finally, add the number of candidates, so that we prefer eliminating
      induction variables if possible.  */
-  return cost + n_cands;
+  return cost + n_cands - n_spr_for_doloop;
 }
 
 /* For each size of the induction variable set determine the penalty.  */
@@ -5727,7 +5876,7 @@  determine_set_costs (struct ivopts_data *data)
       fprintf (dump_file, "  ivs\tcost\n");
       for (j = 0; j <= 2 * target_avail_regs; j++)
 	fprintf (dump_file, "  %d\t%d\n", j,
-		 ivopts_estimate_reg_pressure (data, 0, j));
+		 ivopts_estimate_reg_pressure (data, 0, j, 0));
       fprintf (dump_file, "\n");
     }
 }
@@ -5786,7 +5935,8 @@  iv_ca_recount_cost (struct ivopts_data *data, struct iv_ca *ivs)
   comp_cost cost = ivs->cand_use_cost;
 
   cost += ivs->cand_cost;
-  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands);
+  cost += ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
+					ivs->n_doloop_cands);
   ivs->cost = cost;
 }
 
@@ -5833,6 +5983,8 @@  iv_ca_set_no_cp (struct ivopts_data *data, struct iv_ca *ivs,
     {
       bitmap_clear_bit (ivs->cands, cid);
       ivs->n_cands--;
+      if (cp->cand->doloop_p)
+	ivs->n_doloop_cands--;
       ivs->cand_cost -= cp->cand->cost;
       iv_ca_set_remove_invs (ivs, cp->cand->inv_vars, ivs->n_inv_var_uses);
       iv_ca_set_remove_invs (ivs, cp->cand->inv_exprs, ivs->n_inv_expr_uses);
@@ -5890,6 +6042,8 @@  iv_ca_set_cp (struct ivopts_data *data, struct iv_ca *ivs,
 	{
 	  bitmap_set_bit (ivs->cands, cid);
 	  ivs->n_cands++;
+	  if (cp->cand->doloop_p)
+	    ivs->n_doloop_cands++;
 	  ivs->cand_cost += cp->cand->cost;
 	  iv_ca_set_add_invs (ivs, cp->cand->inv_vars, ivs->n_inv_var_uses);
 	  iv_ca_set_add_invs (ivs, cp->cand->inv_exprs, ivs->n_inv_expr_uses);
@@ -6100,6 +6254,7 @@  iv_ca_new (struct ivopts_data *data)
   nw->n_cand_uses = XCNEWVEC (unsigned, data->vcands.length ());
   nw->cands = BITMAP_ALLOC (NULL);
   nw->n_cands = 0;
+  nw->n_doloop_cands = 0;
   nw->n_invs = 0;
   nw->cand_use_cost = no_cost;
   nw->cand_cost = 0;
@@ -6134,6 +6289,9 @@  iv_ca_dump (struct ivopts_data *data, FILE *file, struct iv_ca *ivs)
 
   fprintf (file, "  cost: %" PRId64 " (complexity %d)\n", cost.cost,
 	   cost.complexity);
+  fprintf (file, "  reg_cost: %d\n",
+	   ivopts_estimate_reg_pressure (data, ivs->n_invs, ivs->n_cands,
+					 ivs->n_doloop_cands));
   fprintf (file, "  cand_cost: %" PRId64 "\n  cand_group_cost: "
 	   "%" PRId64 " (complexity %d)\n", ivs->cand_cost,
 	   ivs->cand_use_cost.cost, ivs->cand_use_cost.complexity);
@@ -7568,6 +7726,47 @@  determine_scaling_factor (struct ivopts_data *data, basic_block *body)
     }
 }
 
+/* Find doloop comparison use and set its doloop_p on if found.  */
+
+static bool
+find_doloop_use (struct ivopts_data *data)
+{
+  struct loop *loop = data->current_loop;
+
+  for (unsigned i = 0; i < data->vgroups.length (); i++)
+    {
+      struct iv_group *group = data->vgroups[i];
+      if (group->type == USE_COMPARE)
+	{
+	  gcc_assert (group->vuses.length () == 1);
+	  struct iv_use *use = group->vuses[0];
+	  gimple *stmt = use->stmt;
+	  if (gimple_code (stmt) == GIMPLE_COND)
+	    {
+	      basic_block bb = gimple_bb (stmt);
+	      edge true_edge, false_edge;
+	      extract_true_false_edges_from_block (bb, &true_edge, &false_edge);
+	      /* This comparison is used for loop latch.  Require latch is empty
+		 for now.  */
+	      if ((loop->latch == true_edge->dest
+		   || loop->latch == false_edge->dest)
+		  && empty_block_p (loop->latch))
+		{
+		  group->doloop_p = true;
+		  if (dump_file && (dump_flags & TDF_DETAILS))
+		    {
+		      fprintf (dump_file, "Doloop cmp iv use: ");
+		      print_gimple_stmt (dump_file, stmt, TDF_DETAILS);
+		    }
+		  return true;
+		}
+	    }
+	}
+    }
+
+  return false;
+}
+
 /* Optimizes the LOOP.  Returns true if anything changed.  */
 
 static bool
@@ -7580,6 +7779,7 @@  tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   basic_block *body;
 
   gcc_assert (!data->niters);
+  data->doloop_use_p = false;
   data->current_loop = loop;
   data->loop_loc = find_loop_location (loop).get_location_t ();
   data->speed = optimize_loop_for_speed_p (loop);
@@ -7622,6 +7822,22 @@  tree_ssa_iv_optimize_loop (struct ivopts_data *data, struct loop *loop,
   /* Determine cost scaling factor for basic blocks in loop.  */
   determine_scaling_factor (data, body);
 
+  if (flag_branch_on_count_reg && generic_predict_doloop_p (data))
+    {
+      if (find_doloop_use (data))
+	{
+	  data->doloop_use_p = true;
+	  if (dump_file && (dump_flags & TDF_DETAILS))
+	    {
+	      fprintf (dump_file,
+		       "Predict loop %d can perform"
+		       " doloop optimization later.\n",
+		       loop->num);
+	      flow_loop_dump (loop, dump_file, NULL, 1);
+	    }
+	}
+    }
+
   /* Finds candidates for the induction variables (item 2).  */
   find_iv_candidates (data);