diff mbox series

AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]

Message ID SJ2PR01MB8635742E07E2076FA2BE0560E139A@SJ2PR01MB8635.prod.exchangelabs.com
State New
Headers show
Series AArch64: Do not increase the vect reduction latency by multiplying count [PR110625] | expand

Commit Message

Hao Liu OS July 19, 2023, 4:33 a.m. UTC
This only affects the new costs in aarch64 backend.  Currently, the reduction
latency of vector body is too large as it is multiplied by stmt count.  As the
scalar reduction latency is small, the new costs model may think "scalar code
would issue more quickly" and increase the vector body cost a lot, which will
miss vectorization opportunities.

Tested by bootstrapping on aarch64-linux-gnu.

gcc/ChangeLog:

	PR target/110625
	* config/aarch64/aarch64.cc (count_ops): Remove the '* count'
	for reduction_latency.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/pr110625.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc               |  5 +--
 gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++
 2 files changed, 47 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c

Comments

Hao Liu OS July 24, 2023, 1:58 a.m. UTC | #1
Hi Richard,

Gentle ping.  Is it ok for trunk?

Or, you will have patch covering such fix?

Thanks,
-Hao
Richard Sandiford July 24, 2023, 11:10 a.m. UTC | #2
Hao Liu OS <hliu@os.amperecomputing.com> writes:
> This only affects the new costs in aarch64 backend.  Currently, the reduction
> latency of vector body is too large as it is multiplied by stmt count.  As the
> scalar reduction latency is small, the new costs model may think "scalar code
> would issue more quickly" and increase the vector body cost a lot, which will
> miss vectorization opportunities.
>
> Tested by bootstrapping on aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
> 	PR target/110625
> 	* config/aarch64/aarch64.cc (count_ops): Remove the '* count'
> 	for reduction_latency.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/pr110625.c: New testcase.
> ---
>  gcc/config/aarch64/aarch64.cc               |  5 +--
>  gcc/testsuite/gcc.target/aarch64/pr110625.c | 46 +++++++++++++++++++++
>  2 files changed, 47 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..27afa64b7d5 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16788,10 +16788,7 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
>      {
>        unsigned int base
>  	= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> -
> -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> -	 that's not yet the case.  */
> -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> +      ops->reduction_latency = MAX (ops->reduction_latency, base);

The ??? is referring to the single_defuse_cycle code in
vectorizable_reduction.  E.g. consider:

long
f (long res, short *ptr1, short *ptr2, int n) {
  for (int i = 0; i < n; ++i)
    res += (long) ptr1[i] << ptr2[i];
  return res;
}

compiled at -O3.  The main loop is:

        movi    v25.4s, 0			// init accumulator
        lsl     x5, x5, 4
        .p2align 3,,7
.L4:
        ldr     q31, [x1, x4]
        ldr     q29, [x2, x4]
        add     x4, x4, 16
        sxtl    v30.4s, v31.4h
        sxtl2   v31.4s, v31.8h
        sxtl    v28.4s, v29.4h
        sxtl2   v29.4s, v29.8h
        sxtl    v27.2d, v30.2s
        sxtl2   v30.2d, v30.4s
        sxtl    v23.2d, v28.2s
        sxtl2   v26.2d, v28.4s
        sxtl    v24.2d, v29.2s
        sxtl    v28.2d, v31.2s
        sshl    v27.2d, v27.2d, v23.2d
        sshl    v30.2d, v30.2d, v26.2d
        sxtl2   v31.2d, v31.4s
        sshl    v28.2d, v28.2d, v24.2d
        add     v27.2d, v27.2d, v25.2d		// v25 -> v27
        sxtl2   v29.2d, v29.4s
        add     v30.2d, v30.2d, v27.2d		// v27 -> v30
        sshl    v31.2d, v31.2d, v29.2d
        add     v30.2d, v28.2d, v30.2d		// v30 -> v30
        add     v25.2d, v31.2d, v30.2d		// v30 -> v25
        cmp     x4, x5
        bne     .L4

Here count is 4 and the latency is 4 additions (8 cycles).  So as a
worst case, the current cost is correct.

To remove the count in all cases, we would need to

(1) avoid single def-use cycles or

(2) reassociate the reduction as a tree, (4->2, 2->1, 1+acc->acc)

But looking again, it seems we do have the information to distinguish
the cases.  We can do something like:

      stmt_vec_info reduc_info = info_for_reduction (m_vinfo, stmt_info);
      if (STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info))
	/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
	   and then accumulate that, but at the moment the loop-carried
	   dependency includes all copies.  */
        ops->reduction_latency = MAX (ops->reduction_latency, base * count);
      else
        ops->reduction_latency = MAX (ops->reduction_latency, base);

(completely untested).

Thanks,
Richard
Hao Liu OS July 25, 2023, 9:10 a.m. UTC | #3
Hi,

Thanks for the suggestion.  I tested it and found a gcc_assert failure:
    gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)

It is caused by empty STMT_VINFO_REDUC_DEF.  So, I added an extra check before checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?

---

The new costs should only count reduction latency by multiplying count for
single_defuse_cycle.  For other situations, this will increase the reduction
latency a lot and miss vectorization opportunities.

Tested on aarch64-linux-gnu.

gcc/ChangeLog:

	PR target/110625
	* config/aarch64/aarch64.cc (count_ops): Only '* count' for
	single_defuse_cycle while counting reduction_latency.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/pr110625_1.c: New testcase.
	* gcc.target/aarch64/pr110625_2.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc                 | 13 ++++--
 gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
 3 files changed, 69 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 560e5431636..478a4e00110 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
     {
       unsigned int base
 	= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
-
-      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
-	 that's not yet the case.  */
-      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+      if (STMT_VINFO_REDUC_DEF (stmt_info)
+	  && STMT_VINFO_FORCE_SINGLE_CYCLE (
+	    info_for_reduction (m_vinfo, stmt_info)))
+	/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
+	   and then accumulate that, but at the moment the loop-carried
+	   dependency includes all copies.  */
+	ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+      else
+	ops->reduction_latency = MAX (ops->reduction_latency, base);
     }
 
   /* Assume that multiply-adds will become a single operation.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
new file mode 100644
index 00000000000..0965cac33a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
+
+/* Do not increase the vector body cost due to the incorrect reduction latency
+    Original vector body cost = 51
+    Scalar issue estimate:
+      ...
+      reduction latency = 2
+      estimated min cycles per iteration = 2.000000
+      estimated cycles per vector iteration (for VF 2) = 4.000000
+    Vector issue estimate:
+      ...
+      reduction latency = 8      <-- Too large
+      estimated min cycles per iteration = 8.000000
+    Increasing body cost to 102 because scalar code would issue more quickly
+      ...
+    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
+    missed:  not vectorized: vectorization not profitable.  */
+
+typedef struct
+{
+  unsigned short m1, m2, m3, m4;
+} the_struct_t;
+typedef struct
+{
+  double m1, m2, m3, m4, m5;
+} the_struct2_t;
+
+double
+bar (the_struct2_t *);
+
+double
+foo (double *k, unsigned int n, the_struct_t *the_struct)
+{
+  unsigned int u;
+  the_struct2_t result;
+  for (u = 0; u < n; u++, k--)
+    {
+      result.m1 += (*k) * the_struct[u].m1;
+      result.m2 += (*k) * the_struct[u].m2;
+      result.m3 += (*k) * the_struct[u].m3;
+      result.m4 += (*k) * the_struct[u].m4;
+    }
+  return bar (&result);
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
new file mode 100644
index 00000000000..7a84aa8355e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
+
+/* The reduction latency should be multiplied by the count for
+   single_defuse_cycle.  */
+
+long
+f (long res, short *ptr1, short *ptr2, int n)
+{
+  for (int i = 0; i < n; ++i)
+    res += (long) ptr1[i] << ptr2[i];
+  return res;
+}
Richard Sandiford July 25, 2023, 9:44 a.m. UTC | #4
Hao Liu OS <hliu@os.amperecomputing.com> writes:
> Hi,
>
> Thanks for the suggestion.  I tested it and found a gcc_assert failure:
>     gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)
>
> It is caused by empty STMT_VINFO_REDUC_DEF.

When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
we're not papering over an issue elsewhere.

Thanks,
Richard

  So, I added an extra check before checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?
>
> ---
>
> The new costs should only count reduction latency by multiplying count for
> single_defuse_cycle.  For other situations, this will increase the reduction
> latency a lot and miss vectorization opportunities.
>
> Tested on aarch64-linux-gnu.
>
> gcc/ChangeLog:
>
> 	PR target/110625
> 	* config/aarch64/aarch64.cc (count_ops): Only '* count' for
> 	single_defuse_cycle while counting reduction_latency.
>
> gcc/testsuite/ChangeLog:
>
> 	* gcc.target/aarch64/pr110625_1.c: New testcase.
> 	* gcc.target/aarch64/pr110625_2.c: New testcase.
> ---
>  gcc/config/aarch64/aarch64.cc                 | 13 ++++--
>  gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
>  gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
>  3 files changed, 69 insertions(+), 4 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 560e5431636..478a4e00110 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
>      {
>        unsigned int base
>  	= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> -
> -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> -	 that's not yet the case.  */
> -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> +      if (STMT_VINFO_REDUC_DEF (stmt_info)
> +	  && STMT_VINFO_FORCE_SINGLE_CYCLE (
> +	    info_for_reduction (m_vinfo, stmt_info)))
> +	/* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
> +	   and then accumulate that, but at the moment the loop-carried
> +	   dependency includes all copies.  */
> +	ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> +      else
> +	ops->reduction_latency = MAX (ops->reduction_latency, base);
>      }
>  
>    /* Assume that multiply-adds will become a single operation.  */
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> new file mode 100644
> index 00000000000..0965cac33a0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> @@ -0,0 +1,46 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
> +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
> +
> +/* Do not increase the vector body cost due to the incorrect reduction latency
> +    Original vector body cost = 51
> +    Scalar issue estimate:
> +      ...
> +      reduction latency = 2
> +      estimated min cycles per iteration = 2.000000
> +      estimated cycles per vector iteration (for VF 2) = 4.000000
> +    Vector issue estimate:
> +      ...
> +      reduction latency = 8      <-- Too large
> +      estimated min cycles per iteration = 8.000000
> +    Increasing body cost to 102 because scalar code would issue more quickly
> +      ...
> +    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
> +    missed:  not vectorized: vectorization not profitable.  */
> +
> +typedef struct
> +{
> +  unsigned short m1, m2, m3, m4;
> +} the_struct_t;
> +typedef struct
> +{
> +  double m1, m2, m3, m4, m5;
> +} the_struct2_t;
> +
> +double
> +bar (the_struct2_t *);
> +
> +double
> +foo (double *k, unsigned int n, the_struct_t *the_struct)
> +{
> +  unsigned int u;
> +  the_struct2_t result;
> +  for (u = 0; u < n; u++, k--)
> +    {
> +      result.m1 += (*k) * the_struct[u].m1;
> +      result.m2 += (*k) * the_struct[u].m2;
> +      result.m3 += (*k) * the_struct[u].m3;
> +      result.m4 += (*k) * the_struct[u].m4;
> +    }
> +  return bar (&result);
> +}
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> new file mode 100644
> index 00000000000..7a84aa8355e
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
> +/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
> +
> +/* The reduction latency should be multiplied by the count for
> +   single_defuse_cycle.  */
> +
> +long
> +f (long res, short *ptr1, short *ptr2, int n)
> +{
> +  for (int i = 0; i < n; ++i)
> +    res += (long) ptr1[i] << ptr2[i];
> +  return res;
> +}
Hao Liu OS July 26, 2023, 2:01 a.m. UTC | #5
> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.

Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":

  <bb 3>:
  # res_18 = PHI <res_15(7), 0(6)>
  # i_20 = PHI <i_16(7), 0(6)>
  _1 = (long unsigned int) i_20;
  _2 = _1 * 2;
  _3 = x_14(D) + _2;
  _4 = *_3;
  _5 = (unsigned short) _4;
  res.0_6 = (unsigned short) res_18;
  _7 = _5 + res.0_6;                             <-- The current stmt_info
  res_15 = (short int) _7;
  i_16 = i_20 + 1;
  if (n_11(D) > i_16)
    goto <bb 7>;
  else
    goto <bb 4>;

  <bb 7>:
  goto <bb 3>;

It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
The status here is:
  STMT_VINFO_REDUC_IDX (stmt_info): 1
  STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
  STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0

Thanks,
Hao
Richard Biener July 26, 2023, 8:47 a.m. UTC | #6
On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
<gcc-patches@gcc.gnu.org> wrote:
>
> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.
>
> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>
>   <bb 3>:
>   # res_18 = PHI <res_15(7), 0(6)>
>   # i_20 = PHI <i_16(7), 0(6)>
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 2;
>   _3 = x_14(D) + _2;
>   _4 = *_3;
>   _5 = (unsigned short) _4;
>   res.0_6 = (unsigned short) res_18;
>   _7 = _5 + res.0_6;                             <-- The current stmt_info
>   res_15 = (short int) _7;
>   i_16 = i_20 + 1;
>   if (n_11(D) > i_16)
>     goto <bb 7>;
>   else
>     goto <bb 4>;
>
>   <bb 7>:
>   goto <bb 3>;
>
> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
> The status here is:
>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0

Not all stmts in the SSA cycle forming the reduction have
STMT_VINFO_REDUC_DEF set,
only the last (latch def) and live stmts have at the moment.

Richard.

> Thanks,
> Hao
>
> ________________________________________
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Tuesday, July 25, 2023 17:44
> To: Hao Liu OS
> Cc: GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
>
> Hao Liu OS <hliu@os.amperecomputing.com> writes:
> > Hi,
> >
> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
> >     gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)
> >
> > It is caused by empty STMT_VINFO_REDUC_DEF.
>
> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
> we're not papering over an issue elsewhere.
>
> Thanks,
> Richard
>
>   So, I added an extra check before checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?
> >
> > ---
> >
> > The new costs should only count reduction latency by multiplying count for
> > single_defuse_cycle.  For other situations, this will increase the reduction
> > latency a lot and miss vectorization opportunities.
> >
> > Tested on aarch64-linux-gnu.
> >
> > gcc/ChangeLog:
> >
> >       PR target/110625
> >       * config/aarch64/aarch64.cc (count_ops): Only '* count' for
> >       single_defuse_cycle while counting reduction_latency.
> >
> > gcc/testsuite/ChangeLog:
> >
> >       * gcc.target/aarch64/pr110625_1.c: New testcase.
> >       * gcc.target/aarch64/pr110625_2.c: New testcase.
> > ---
> >  gcc/config/aarch64/aarch64.cc                 | 13 ++++--
> >  gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
> >  gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
> >  3 files changed, 69 insertions(+), 4 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 560e5431636..478a4e00110 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
> >      {
> >        unsigned int base
> >       = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> > -
> > -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> > -      that's not yet the case.  */
> > -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> > +      if (STMT_VINFO_REDUC_DEF (stmt_info)
> > +       && STMT_VINFO_FORCE_SINGLE_CYCLE (
> > +         info_for_reduction (m_vinfo, stmt_info)))
> > +     /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
> > +        and then accumulate that, but at the moment the loop-carried
> > +        dependency includes all copies.  */
> > +     ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> > +      else
> > +     ops->reduction_latency = MAX (ops->reduction_latency, base);
> >      }
> >
> >    /* Assume that multiply-adds will become a single operation.  */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> > new file mode 100644
> > index 00000000000..0965cac33a0
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> > @@ -0,0 +1,46 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
> > +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
> > +
> > +/* Do not increase the vector body cost due to the incorrect reduction latency
> > +    Original vector body cost = 51
> > +    Scalar issue estimate:
> > +      ...
> > +      reduction latency = 2
> > +      estimated min cycles per iteration = 2.000000
> > +      estimated cycles per vector iteration (for VF 2) = 4.000000
> > +    Vector issue estimate:
> > +      ...
> > +      reduction latency = 8      <-- Too large
> > +      estimated min cycles per iteration = 8.000000
> > +    Increasing body cost to 102 because scalar code would issue more quickly
> > +      ...
> > +    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
> > +    missed:  not vectorized: vectorization not profitable.  */
> > +
> > +typedef struct
> > +{
> > +  unsigned short m1, m2, m3, m4;
> > +} the_struct_t;
> > +typedef struct
> > +{
> > +  double m1, m2, m3, m4, m5;
> > +} the_struct2_t;
> > +
> > +double
> > +bar (the_struct2_t *);
> > +
> > +double
> > +foo (double *k, unsigned int n, the_struct_t *the_struct)
> > +{
> > +  unsigned int u;
> > +  the_struct2_t result;
> > +  for (u = 0; u < n; u++, k--)
> > +    {
> > +      result.m1 += (*k) * the_struct[u].m1;
> > +      result.m2 += (*k) * the_struct[u].m2;
> > +      result.m3 += (*k) * the_struct[u].m3;
> > +      result.m4 += (*k) * the_struct[u].m4;
> > +    }
> > +  return bar (&result);
> > +}
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> > new file mode 100644
> > index 00000000000..7a84aa8355e
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
> > +/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
> > +
> > +/* The reduction latency should be multiplied by the count for
> > +   single_defuse_cycle.  */
> > +
> > +long
> > +f (long res, short *ptr1, short *ptr2, int n)
> > +{
> > +  for (int i = 0; i < n; ++i)
> > +    res += (long) ptr1[i] << ptr2[i];
> > +  return res;
> > +}
Richard Sandiford July 26, 2023, 9:14 a.m. UTC | #7
Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
> <gcc-patches@gcc.gnu.org> wrote:
>>
>> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.
>>
>> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>>
>>   <bb 3>:
>>   # res_18 = PHI <res_15(7), 0(6)>
>>   # i_20 = PHI <i_16(7), 0(6)>
>>   _1 = (long unsigned int) i_20;
>>   _2 = _1 * 2;
>>   _3 = x_14(D) + _2;
>>   _4 = *_3;
>>   _5 = (unsigned short) _4;
>>   res.0_6 = (unsigned short) res_18;
>>   _7 = _5 + res.0_6;                             <-- The current stmt_info
>>   res_15 = (short int) _7;
>>   i_16 = i_20 + 1;
>>   if (n_11(D) > i_16)
>>     goto <bb 7>;
>>   else
>>     goto <bb 4>;
>>
>>   <bb 7>:
>>   goto <bb 3>;
>>
>> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
>> The status here is:
>>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
>
> Not all stmts in the SSA cycle forming the reduction have
> STMT_VINFO_REDUC_DEF set,
> only the last (latch def) and live stmts have at the moment.

Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:

  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
      && vect_is_reduction (stmt_info))

to:

  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
      && STMT_VINFO_LIVE_P (stmt_info)
      && vect_is_reduction (stmt_info))

instead of using a null check.

I see that vectorizable_reduction calculates a reduc_chain_length.
Would it be OK to store that in the stmt_vec_info?  I suppose the
AArch64 code should be multiplying by that as well.  (It would be a
separate patch from this one though.)

Richard


>
> Richard.
>
>> Thanks,
>> Hao
>>
>> ________________________________________
>> From: Richard Sandiford <richard.sandiford@arm.com>
>> Sent: Tuesday, July 25, 2023 17:44
>> To: Hao Liu OS
>> Cc: GCC-patches@gcc.gnu.org
>> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
>>
>> Hao Liu OS <hliu@os.amperecomputing.com> writes:
>> > Hi,
>> >
>> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
>> >     gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)
>> >
>> > It is caused by empty STMT_VINFO_REDUC_DEF.
>>
>> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
>> we're not papering over an issue elsewhere.
>>
>> Thanks,
>> Richard
>>
>>   So, I added an extra check before checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?
>> >
>> > ---
>> >
>> > The new costs should only count reduction latency by multiplying count for
>> > single_defuse_cycle.  For other situations, this will increase the reduction
>> > latency a lot and miss vectorization opportunities.
>> >
>> > Tested on aarch64-linux-gnu.
>> >
>> > gcc/ChangeLog:
>> >
>> >       PR target/110625
>> >       * config/aarch64/aarch64.cc (count_ops): Only '* count' for
>> >       single_defuse_cycle while counting reduction_latency.
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> >       * gcc.target/aarch64/pr110625_1.c: New testcase.
>> >       * gcc.target/aarch64/pr110625_2.c: New testcase.
>> > ---
>> >  gcc/config/aarch64/aarch64.cc                 | 13 ++++--
>> >  gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
>> >  gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
>> >  3 files changed, 69 insertions(+), 4 deletions(-)
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>> >
>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>> > index 560e5431636..478a4e00110 100644
>> > --- a/gcc/config/aarch64/aarch64.cc
>> > +++ b/gcc/config/aarch64/aarch64.cc
>> > @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
>> >      {
>> >        unsigned int base
>> >       = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
>> > -
>> > -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
>> > -      that's not yet the case.  */
>> > -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>> > +      if (STMT_VINFO_REDUC_DEF (stmt_info)
>> > +       && STMT_VINFO_FORCE_SINGLE_CYCLE (
>> > +         info_for_reduction (m_vinfo, stmt_info)))
>> > +     /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
>> > +        and then accumulate that, but at the moment the loop-carried
>> > +        dependency includes all copies.  */
>> > +     ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>> > +      else
>> > +     ops->reduction_latency = MAX (ops->reduction_latency, base);
>> >      }
>> >
>> >    /* Assume that multiply-adds will become a single operation.  */
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>> > new file mode 100644
>> > index 00000000000..0965cac33a0
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>> > @@ -0,0 +1,46 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
>> > +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
>> > +
>> > +/* Do not increase the vector body cost due to the incorrect reduction latency
>> > +    Original vector body cost = 51
>> > +    Scalar issue estimate:
>> > +      ...
>> > +      reduction latency = 2
>> > +      estimated min cycles per iteration = 2.000000
>> > +      estimated cycles per vector iteration (for VF 2) = 4.000000
>> > +    Vector issue estimate:
>> > +      ...
>> > +      reduction latency = 8      <-- Too large
>> > +      estimated min cycles per iteration = 8.000000
>> > +    Increasing body cost to 102 because scalar code would issue more quickly
>> > +      ...
>> > +    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
>> > +    missed:  not vectorized: vectorization not profitable.  */
>> > +
>> > +typedef struct
>> > +{
>> > +  unsigned short m1, m2, m3, m4;
>> > +} the_struct_t;
>> > +typedef struct
>> > +{
>> > +  double m1, m2, m3, m4, m5;
>> > +} the_struct2_t;
>> > +
>> > +double
>> > +bar (the_struct2_t *);
>> > +
>> > +double
>> > +foo (double *k, unsigned int n, the_struct_t *the_struct)
>> > +{
>> > +  unsigned int u;
>> > +  the_struct2_t result;
>> > +  for (u = 0; u < n; u++, k--)
>> > +    {
>> > +      result.m1 += (*k) * the_struct[u].m1;
>> > +      result.m2 += (*k) * the_struct[u].m2;
>> > +      result.m3 += (*k) * the_struct[u].m3;
>> > +      result.m4 += (*k) * the_struct[u].m4;
>> > +    }
>> > +  return bar (&result);
>> > +}
>> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>> > new file mode 100644
>> > index 00000000000..7a84aa8355e
>> > --- /dev/null
>> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>> > @@ -0,0 +1,14 @@
>> > +/* { dg-do compile } */
>> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
>> > +/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
>> > +
>> > +/* The reduction latency should be multiplied by the count for
>> > +   single_defuse_cycle.  */
>> > +
>> > +long
>> > +f (long res, short *ptr1, short *ptr2, int n)
>> > +{
>> > +  for (int i = 0; i < n; ++i)
>> > +    res += (long) ptr1[i] << ptr2[i];
>> > +  return res;
>> > +}
Richard Biener July 26, 2023, 10:02 a.m. UTC | #8
On Wed, Jul 26, 2023 at 11:14 AM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
> > <gcc-patches@gcc.gnu.org> wrote:
> >>
> >> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.
> >>
> >> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
> >>
> >>   <bb 3>:
> >>   # res_18 = PHI <res_15(7), 0(6)>
> >>   # i_20 = PHI <i_16(7), 0(6)>
> >>   _1 = (long unsigned int) i_20;
> >>   _2 = _1 * 2;
> >>   _3 = x_14(D) + _2;
> >>   _4 = *_3;
> >>   _5 = (unsigned short) _4;
> >>   res.0_6 = (unsigned short) res_18;
> >>   _7 = _5 + res.0_6;                             <-- The current stmt_info
> >>   res_15 = (short int) _7;
> >>   i_16 = i_20 + 1;
> >>   if (n_11(D) > i_16)
> >>     goto <bb 7>;
> >>   else
> >>     goto <bb 4>;
> >>
> >>   <bb 7>:
> >>   goto <bb 3>;
> >>
> >> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
> >> The status here is:
> >>   STMT_VINFO_REDUC_IDX (stmt_info): 1
> >>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
> >>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
> >
> > Not all stmts in the SSA cycle forming the reduction have
> > STMT_VINFO_REDUC_DEF set,
> > only the last (latch def) and live stmts have at the moment.
>
> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && vect_is_reduction (stmt_info))
>
> to:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && STMT_VINFO_LIVE_P (stmt_info)
>       && vect_is_reduction (stmt_info))
>
> instead of using a null check.

But as seen you will miss stmts that are part of the reduction then?
In principle we could put STMT_VINFO_REDUC_DEF to other stmts
as well.  See vectorizable_reduction in the

  while (reduc_def != PHI_RESULT (reduc_def_phi))

loop.

> I see that vectorizable_reduction calculates a reduc_chain_length.
> Would it be OK to store that in the stmt_vec_info?  I suppose the
> AArch64 code should be multiplying by that as well.  (It would be a
> separate patch from this one though.)

I don't think that's too relevant here (it also counts noop conversions).

Richard.

>
> Richard
>
>
> >
> > Richard.
> >
> >> Thanks,
> >> Hao
> >>
> >> ________________________________________
> >> From: Richard Sandiford <richard.sandiford@arm.com>
> >> Sent: Tuesday, July 25, 2023 17:44
> >> To: Hao Liu OS
> >> Cc: GCC-patches@gcc.gnu.org
> >> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
> >>
> >> Hao Liu OS <hliu@os.amperecomputing.com> writes:
> >> > Hi,
> >> >
> >> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
> >> >     gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)
> >> >
> >> > It is caused by empty STMT_VINFO_REDUC_DEF.
> >>
> >> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
> >> we're not papering over an issue elsewhere.
> >>
> >> Thanks,
> >> Richard
> >>
> >>   So, I added an extra check before checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?
> >> >
> >> > ---
> >> >
> >> > The new costs should only count reduction latency by multiplying count for
> >> > single_defuse_cycle.  For other situations, this will increase the reduction
> >> > latency a lot and miss vectorization opportunities.
> >> >
> >> > Tested on aarch64-linux-gnu.
> >> >
> >> > gcc/ChangeLog:
> >> >
> >> >       PR target/110625
> >> >       * config/aarch64/aarch64.cc (count_ops): Only '* count' for
> >> >       single_defuse_cycle while counting reduction_latency.
> >> >
> >> > gcc/testsuite/ChangeLog:
> >> >
> >> >       * gcc.target/aarch64/pr110625_1.c: New testcase.
> >> >       * gcc.target/aarch64/pr110625_2.c: New testcase.
> >> > ---
> >> >  gcc/config/aarch64/aarch64.cc                 | 13 ++++--
> >> >  gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
> >> >  gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
> >> >  3 files changed, 69 insertions(+), 4 deletions(-)
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> >> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> >> >
> >> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> >> > index 560e5431636..478a4e00110 100644
> >> > --- a/gcc/config/aarch64/aarch64.cc
> >> > +++ b/gcc/config/aarch64/aarch64.cc
> >> > @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
> >> >      {
> >> >        unsigned int base
> >> >       = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
> >> > -
> >> > -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
> >> > -      that's not yet the case.  */
> >> > -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> >> > +      if (STMT_VINFO_REDUC_DEF (stmt_info)
> >> > +       && STMT_VINFO_FORCE_SINGLE_CYCLE (
> >> > +         info_for_reduction (m_vinfo, stmt_info)))
> >> > +     /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
> >> > +        and then accumulate that, but at the moment the loop-carried
> >> > +        dependency includes all copies.  */
> >> > +     ops->reduction_latency = MAX (ops->reduction_latency, base * count);
> >> > +      else
> >> > +     ops->reduction_latency = MAX (ops->reduction_latency, base);
> >> >      }
> >> >
> >> >    /* Assume that multiply-adds will become a single operation.  */
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> >> > new file mode 100644
> >> > index 00000000000..0965cac33a0
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
> >> > @@ -0,0 +1,46 @@
> >> > +/* { dg-do compile } */
> >> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
> >> > +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
> >> > +
> >> > +/* Do not increase the vector body cost due to the incorrect reduction latency
> >> > +    Original vector body cost = 51
> >> > +    Scalar issue estimate:
> >> > +      ...
> >> > +      reduction latency = 2
> >> > +      estimated min cycles per iteration = 2.000000
> >> > +      estimated cycles per vector iteration (for VF 2) = 4.000000
> >> > +    Vector issue estimate:
> >> > +      ...
> >> > +      reduction latency = 8      <-- Too large
> >> > +      estimated min cycles per iteration = 8.000000
> >> > +    Increasing body cost to 102 because scalar code would issue more quickly
> >> > +      ...
> >> > +    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
> >> > +    missed:  not vectorized: vectorization not profitable.  */
> >> > +
> >> > +typedef struct
> >> > +{
> >> > +  unsigned short m1, m2, m3, m4;
> >> > +} the_struct_t;
> >> > +typedef struct
> >> > +{
> >> > +  double m1, m2, m3, m4, m5;
> >> > +} the_struct2_t;
> >> > +
> >> > +double
> >> > +bar (the_struct2_t *);
> >> > +
> >> > +double
> >> > +foo (double *k, unsigned int n, the_struct_t *the_struct)
> >> > +{
> >> > +  unsigned int u;
> >> > +  the_struct2_t result;
> >> > +  for (u = 0; u < n; u++, k--)
> >> > +    {
> >> > +      result.m1 += (*k) * the_struct[u].m1;
> >> > +      result.m2 += (*k) * the_struct[u].m2;
> >> > +      result.m3 += (*k) * the_struct[u].m3;
> >> > +      result.m4 += (*k) * the_struct[u].m4;
> >> > +    }
> >> > +  return bar (&result);
> >> > +}
> >> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> >> > new file mode 100644
> >> > index 00000000000..7a84aa8355e
> >> > --- /dev/null
> >> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
> >> > @@ -0,0 +1,14 @@
> >> > +/* { dg-do compile } */
> >> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
> >> > +/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
> >> > +
> >> > +/* The reduction latency should be multiplied by the count for
> >> > +   single_defuse_cycle.  */
> >> > +
> >> > +long
> >> > +f (long res, short *ptr1, short *ptr2, int n)
> >> > +{
> >> > +  for (int i = 0; i < n; ++i)
> >> > +    res += (long) ptr1[i] << ptr2[i];
> >> > +  return res;
> >> > +}
Richard Sandiford July 26, 2023, 10:12 a.m. UTC | #9
Richard Biener <richard.guenther@gmail.com> writes:
> On Wed, Jul 26, 2023 at 11:14 AM Richard Sandiford
> <richard.sandiford@arm.com> wrote:
>>
>> Richard Biener <richard.guenther@gmail.com> writes:
>> > On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
>> > <gcc-patches@gcc.gnu.org> wrote:
>> >>
>> >> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.
>> >>
>> >> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>> >>
>> >>   <bb 3>:
>> >>   # res_18 = PHI <res_15(7), 0(6)>
>> >>   # i_20 = PHI <i_16(7), 0(6)>
>> >>   _1 = (long unsigned int) i_20;
>> >>   _2 = _1 * 2;
>> >>   _3 = x_14(D) + _2;
>> >>   _4 = *_3;
>> >>   _5 = (unsigned short) _4;
>> >>   res.0_6 = (unsigned short) res_18;
>> >>   _7 = _5 + res.0_6;                             <-- The current stmt_info
>> >>   res_15 = (short int) _7;
>> >>   i_16 = i_20 + 1;
>> >>   if (n_11(D) > i_16)
>> >>     goto <bb 7>;
>> >>   else
>> >>     goto <bb 4>;
>> >>
>> >>   <bb 7>:
>> >>   goto <bb 3>;
>> >>
>> >> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
>> >> The status here is:
>> >>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>> >>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>> >>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
>> >
>> > Not all stmts in the SSA cycle forming the reduction have
>> > STMT_VINFO_REDUC_DEF set,
>> > only the last (latch def) and live stmts have at the moment.
>>
>> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>>
>>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>>       && vect_is_reduction (stmt_info))
>>
>> to:
>>
>>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>>       && STMT_VINFO_LIVE_P (stmt_info)
>>       && vect_is_reduction (stmt_info))
>>
>> instead of using a null check.
>
> But as seen you will miss stmts that are part of the reduction then?

Yeah, but the code is doing a maximum of all the reductions in the loop:

      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
	 that's not yet the case.  */
      ops->reduction_latency = MAX (ops->reduction_latency, base * count);

So as it stands, we only need to see each reduction (as opposed to each
reduction statement) once.

But we do want to know the length of each reduction...

> In principle we could put STMT_VINFO_REDUC_DEF to other stmts
> as well.  See vectorizable_reduction in the
>
>   while (reduc_def != PHI_RESULT (reduc_def_phi))
>
> loop.

Nod.  That's where I'd got the STMT_VINFO_LIVE_P thing from.

>> I see that vectorizable_reduction calculates a reduc_chain_length.
>> Would it be OK to store that in the stmt_vec_info?  I suppose the
>> AArch64 code should be multiplying by that as well.  (It would be a
>> separate patch from this one though.)
>
> I don't think that's too relevant here (it also counts noop conversions).

Bah.  I'm loath to copy that loop and just pick out the relevant
statements though.

I suppose if every statement had a STMT_VINFO_REDUC_DEF, aarch64 could
maintain a hash map from STMT_VINFO_REDUC_DEF to total latencies, and
then take the maximum of those total latencies.  It sounds pretty
complex though...

Thanks,
Richard
Richard Biener July 26, 2023, noon UTC | #10
On Wed, Jul 26, 2023 at 12:12 PM Richard Sandiford
<richard.sandiford@arm.com> wrote:
>
> Richard Biener <richard.guenther@gmail.com> writes:
> > On Wed, Jul 26, 2023 at 11:14 AM Richard Sandiford
> > <richard.sandiford@arm.com> wrote:
> >>
> >> Richard Biener <richard.guenther@gmail.com> writes:
> >> > On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
> >> > <gcc-patches@gcc.gnu.org> wrote:
> >> >>
> >> >> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.
> >> >>
> >> >> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
> >> >>
> >> >>   <bb 3>:
> >> >>   # res_18 = PHI <res_15(7), 0(6)>
> >> >>   # i_20 = PHI <i_16(7), 0(6)>
> >> >>   _1 = (long unsigned int) i_20;
> >> >>   _2 = _1 * 2;
> >> >>   _3 = x_14(D) + _2;
> >> >>   _4 = *_3;
> >> >>   _5 = (unsigned short) _4;
> >> >>   res.0_6 = (unsigned short) res_18;
> >> >>   _7 = _5 + res.0_6;                             <-- The current stmt_info
> >> >>   res_15 = (short int) _7;
> >> >>   i_16 = i_20 + 1;
> >> >>   if (n_11(D) > i_16)
> >> >>     goto <bb 7>;
> >> >>   else
> >> >>     goto <bb 4>;
> >> >>
> >> >>   <bb 7>:
> >> >>   goto <bb 3>;
> >> >>
> >> >> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
> >> >> The status here is:
> >> >>   STMT_VINFO_REDUC_IDX (stmt_info): 1
> >> >>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
> >> >>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
> >> >
> >> > Not all stmts in the SSA cycle forming the reduction have
> >> > STMT_VINFO_REDUC_DEF set,
> >> > only the last (latch def) and live stmts have at the moment.
> >>
> >> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
> >>
> >>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
> >>       && vect_is_reduction (stmt_info))
> >>
> >> to:
> >>
> >>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
> >>       && STMT_VINFO_LIVE_P (stmt_info)
> >>       && vect_is_reduction (stmt_info))
> >>
> >> instead of using a null check.
> >
> > But as seen you will miss stmts that are part of the reduction then?
>
> Yeah, but the code is doing a maximum of all the reductions in the loop:
>
>       /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
>          that's not yet the case.  */
>       ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>
> So as it stands, we only need to see each reduction (as opposed to each
> reduction statement) once.
>
> But we do want to know the length of each reduction...

Yeah, it really tells the current costing infrastructure is far from
perfect here ...

We could pass you the stmt_info for the reduction PHI (aka the
info_for_reduction)
once with a special kind, vect_reduction, so you could walk relevant stmts
from that.  You get a number of add_stmts for each reduction, that could
be a hint of the length as well ...  (but I think we always pass the 'main'
stmt_info here)

> > In principle we could put STMT_VINFO_REDUC_DEF to other stmts
> > as well.  See vectorizable_reduction in the
> >
> >   while (reduc_def != PHI_RESULT (reduc_def_phi))
> >
> > loop.
>
> Nod.  That's where I'd got the STMT_VINFO_LIVE_P thing from.
>
> >> I see that vectorizable_reduction calculates a reduc_chain_length.
> >> Would it be OK to store that in the stmt_vec_info?  I suppose the
> >> AArch64 code should be multiplying by that as well.  (It would be a
> >> separate patch from this one though.)
> >
> > I don't think that's too relevant here (it also counts noop conversions).
>
> Bah.  I'm loath to copy that loop and just pick out the relevant
> statements though.
>
> I suppose if every statement had a STMT_VINFO_REDUC_DEF, aarch64 could
> maintain a hash map from STMT_VINFO_REDUC_DEF to total latencies, and
> then take the maximum of those total latencies.  It sounds pretty
> complex though...

Ick.  Yeah.  I think most of the costing is still nearly GIGO, I hardly see any
loop vectorization disqualified because of cost (happens for BB SLP though).

Richard.

> Thanks,
> Richard
>
Hao Liu OS July 26, 2023, 12:54 p.m. UTC | #11
> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && vect_is_reduction (stmt_info))
>
> to:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && STMT_VINFO_LIVE_P (stmt_info)
>       && vect_is_reduction (stmt_info))

I  tried this and it indeed can avoid ICE.  But it seems the reduction_latency calculation is also skipped, after such modification, the redunction_latency is 0 for this case. Previously, it is 1 and 2 for scalar and vector separately.

IMHO, to keep it consistent with previous result, should we move STMT_VINFO_LIVE_P check below and inside the if? such as:

  /* Calculate the minimum cycles per iteration imposed by a reduction
     operation.  */
  if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
      && vect_is_reduction (stmt_info))
    {
      unsigned int base
        = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
      if (STMT_VINFO_LIVE_P (stmt_info) && STMT_VINFO_FORCE_SINGLE_CYCLE (
            info_for_reduction (m_vinfo, stmt_info)))
        /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
           and then accumulate that, but at the moment the loop-carried
           dependency includes all copies.  */
        ops->reduction_latency = MAX (ops->reduction_latency, base * count);
      else
        ops->reduction_latency = MAX (ops->reduction_latency, base);

Thanks,
Hao
Hao Liu OS July 28, 2023, 10:06 a.m. UTC | #12
Hi Richard,

I've updated the patch and tested on aarch64.  Is it OK?

---

The new costs should only count reduction latency by multiplying count for
single_defuse_cycle.  For other situations, this will increase the reduction
latency a lot and miss vectorization opportunities.

Tested on aarch64-linux-gnu.

gcc/ChangeLog:

        PR target/110625
        * config/aarch64/aarch64.cc (count_ops): Only '* count' for
        single_defuse_cycle while counting reduction_latency.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/pr110625_1.c: New testcase.
        * gcc.target/aarch64/pr110625_2.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc                 | 13 ++++--
 gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
 gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
 3 files changed, 69 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 560e5431636..10e7663cc42 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
     {
       unsigned int base
        = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
-
-      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
-        that's not yet the case.  */
-      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+      if (STMT_VINFO_LIVE_P (stmt_info)
+         && STMT_VINFO_FORCE_SINGLE_CYCLE (
+           info_for_reduction (m_vinfo, stmt_info)))
+       /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
+          and then accumulate that, but at the moment the loop-carried
+          dependency includes all copies.  */
+       ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+      else
+       ops->reduction_latency = MAX (ops->reduction_latency, base);
     }

   /* Assume that multiply-adds will become a single operation.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
new file mode 100644
index 00000000000..0965cac33a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
@@ -0,0 +1,46 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
+
+/* Do not increase the vector body cost due to the incorrect reduction latency
+    Original vector body cost = 51
+    Scalar issue estimate:
+      ...
+      reduction latency = 2
+      estimated min cycles per iteration = 2.000000
+      estimated cycles per vector iteration (for VF 2) = 4.000000
+    Vector issue estimate:
+      ...
+      reduction latency = 8      <-- Too large
+      estimated min cycles per iteration = 8.000000
+    Increasing body cost to 102 because scalar code would issue more quickly
+      ...
+    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
+    missed:  not vectorized: vectorization not profitable.  */
+
+typedef struct
+{
+  unsigned short m1, m2, m3, m4;
+} the_struct_t;
+typedef struct
+{
+  double m1, m2, m3, m4, m5;
+} the_struct2_t;
+
+double
+bar (the_struct2_t *);
+
+double
+foo (double *k, unsigned int n, the_struct_t *the_struct)
+{
+  unsigned int u;
+  the_struct2_t result;
+  for (u = 0; u < n; u++, k--)
+    {
+      result.m1 += (*k) * the_struct[u].m1;
+      result.m2 += (*k) * the_struct[u].m2;
+      result.m3 += (*k) * the_struct[u].m3;
+      result.m4 += (*k) * the_struct[u].m4;
+    }
+  return bar (&result);
+}
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
new file mode 100644
index 00000000000..7a84aa8355e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
+
+/* The reduction latency should be multiplied by the count for
+   single_defuse_cycle.  */
+
+long
+f (long res, short *ptr1, short *ptr2, int n)
+{
+  for (int i = 0; i < n; ++i)
+    res += (long) ptr1[i] << ptr2[i];
+  return res;
+}
--
2.34.1
Richard Sandiford July 28, 2023, 5:35 p.m. UTC | #13
Sorry for the slow response.

Hao Liu OS <hliu@os.amperecomputing.com> writes:
>> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>>
>>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>>       && vect_is_reduction (stmt_info))
>>
>> to:
>>
>>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>>       && STMT_VINFO_LIVE_P (stmt_info)
>>       && vect_is_reduction (stmt_info))
>
> I  tried this and it indeed can avoid ICE.  But it seems the reduction_latency calculation is also skipped, after such modification, the redunction_latency is 0 for this case. Previously, it is 1 and 2 for scalar and vector separately.

Which test case do you see this for?  The two tests in the patch still
seem to report correct latencies for me if I make the change above.

Thanks,
Richard

> IMHO, to keep it consistent with previous result, should we move STMT_VINFO_LIVE_P check below and inside the if? such as:
>
>   /* Calculate the minimum cycles per iteration imposed by a reduction
>      operation.  */
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && vect_is_reduction (stmt_info))
>     {
>       unsigned int base
>         = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
>       if (STMT_VINFO_LIVE_P (stmt_info) && STMT_VINFO_FORCE_SINGLE_CYCLE (
>             info_for_reduction (m_vinfo, stmt_info)))
>         /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
>            and then accumulate that, but at the moment the loop-carried
>            dependency includes all copies.  */
>         ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>       else
>         ops->reduction_latency = MAX (ops->reduction_latency, base);
>
> Thanks,
> Hao
>
> ________________________________________
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Wednesday, July 26, 2023 17:14
> To: Richard Biener
> Cc: Hao Liu OS; GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
>
> Richard Biener <richard.guenther@gmail.com> writes:
>> On Wed, Jul 26, 2023 at 4:02 AM Hao Liu OS via Gcc-patches
>> <gcc-patches@gcc.gnu.org> wrote:
>>>
>>> > When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that we're not papering over an issue elsewhere.
>>>
>>> Yes, I also wonder if this is an issue in vectorizable_reduction.  Below is the the gimple of "gcc.target/aarch64/sve/cost_model_13.c":
>>>
>>>   <bb 3>:
>>>   # res_18 = PHI <res_15(7), 0(6)>
>>>   # i_20 = PHI <i_16(7), 0(6)>
>>>   _1 = (long unsigned int) i_20;
>>>   _2 = _1 * 2;
>>>   _3 = x_14(D) + _2;
>>>   _4 = *_3;
>>>   _5 = (unsigned short) _4;
>>>   res.0_6 = (unsigned short) res_18;
>>>   _7 = _5 + res.0_6;                             <-- The current stmt_info
>>>   res_15 = (short int) _7;
>>>   i_16 = i_20 + 1;
>>>   if (n_11(D) > i_16)
>>>     goto <bb 7>;
>>>   else
>>>     goto <bb 4>;
>>>
>>>   <bb 7>:
>>>   goto <bb 3>;
>>>
>>> It looks like that STMT_VINFO_REDUC_DEF should be "res_18 = PHI <res_15(7), 0(6)>"?
>>> The status here is:
>>>   STMT_VINFO_REDUC_IDX (stmt_info): 1
>>>   STMT_VINFO_REDUC_TYPE (stmt_info): TREE_CODE_REDUCTION
>>>   STMT_VINFO_REDUC_VECTYPE (stmt_info): 0x0
>>
>> Not all stmts in the SSA cycle forming the reduction have
>> STMT_VINFO_REDUC_DEF set,
>> only the last (latch def) and live stmts have at the moment.
>
> Ah, thanks.  In that case, Hao, I think we can avoid the ICE by changing:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && vect_is_reduction (stmt_info))
>
> to:
>
>   if ((kind == scalar_stmt || kind == vector_stmt || kind == vec_to_scalar)
>       && STMT_VINFO_LIVE_P (stmt_info)
>       && vect_is_reduction (stmt_info))
>
> instead of using a null check.
>
> I see that vectorizable_reduction calculates a reduc_chain_length.
> Would it be OK to store that in the stmt_vec_info?  I suppose the
> AArch64 code should be multiplying by that as well.  (It would be a
> separate patch from this one though.)
>
> Richard
>
>
>>
>> Richard.
>>
>>> Thanks,
>>> Hao
>>>
>>> ________________________________________
>>> From: Richard Sandiford <richard.sandiford@arm.com>
>>> Sent: Tuesday, July 25, 2023 17:44
>>> To: Hao Liu OS
>>> Cc: GCC-patches@gcc.gnu.org
>>> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
>>>
>>> Hao Liu OS <hliu@os.amperecomputing.com> writes:
>>> > Hi,
>>> >
>>> > Thanks for the suggestion.  I tested it and found a gcc_assert failure:
>>> >     gcc.target/aarch64/sve/cost_model_13.c (internal compiler error: in info_for_reduction, at tree-vect-loop.cc:5473)
>>> >
>>> > It is caused by empty STMT_VINFO_REDUC_DEF.
>>>
>>> When was STMT_VINFO_REDUC_DEF empty?  I just want to make sure that
>>> we're not papering over an issue elsewhere.
>>>
>>> Thanks,
>>> Richard
>>>
>>>   So, I added an extra check before checking single_defuse_cycle. The updated patch is below.  Is it OK for trunk?
>>> >
>>> > ---
>>> >
>>> > The new costs should only count reduction latency by multiplying count for
>>> > single_defuse_cycle.  For other situations, this will increase the reduction
>>> > latency a lot and miss vectorization opportunities.
>>> >
>>> > Tested on aarch64-linux-gnu.
>>> >
>>> > gcc/ChangeLog:
>>> >
>>> >       PR target/110625
>>> >       * config/aarch64/aarch64.cc (count_ops): Only '* count' for
>>> >       single_defuse_cycle while counting reduction_latency.
>>> >
>>> > gcc/testsuite/ChangeLog:
>>> >
>>> >       * gcc.target/aarch64/pr110625_1.c: New testcase.
>>> >       * gcc.target/aarch64/pr110625_2.c: New testcase.
>>> > ---
>>> >  gcc/config/aarch64/aarch64.cc                 | 13 ++++--
>>> >  gcc/testsuite/gcc.target/aarch64/pr110625_1.c | 46 +++++++++++++++++++
>>> >  gcc/testsuite/gcc.target/aarch64/pr110625_2.c | 14 ++++++
>>> >  3 files changed, 69 insertions(+), 4 deletions(-)
>>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>>> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>>> >
>>> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
>>> > index 560e5431636..478a4e00110 100644
>>> > --- a/gcc/config/aarch64/aarch64.cc
>>> > +++ b/gcc/config/aarch64/aarch64.cc
>>> > @@ -16788,10 +16788,15 @@ aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
>>> >      {
>>> >        unsigned int base
>>> >       = aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
>>> > -
>>> > -      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
>>> > -      that's not yet the case.  */
>>> > -      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>>> > +      if (STMT_VINFO_REDUC_DEF (stmt_info)
>>> > +       && STMT_VINFO_FORCE_SINGLE_CYCLE (
>>> > +         info_for_reduction (m_vinfo, stmt_info)))
>>> > +     /* ??? Ideally we'd use a tree to reduce the copies down to 1 vector,
>>> > +        and then accumulate that, but at the moment the loop-carried
>>> > +        dependency includes all copies.  */
>>> > +     ops->reduction_latency = MAX (ops->reduction_latency, base * count);
>>> > +      else
>>> > +     ops->reduction_latency = MAX (ops->reduction_latency, base);
>>> >      }
>>> >
>>> >    /* Assume that multiply-adds will become a single operation.  */
>>> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_1.c b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>>> > new file mode 100644
>>> > index 00000000000..0965cac33a0
>>> > --- /dev/null
>>> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_1.c
>>> > @@ -0,0 +1,46 @@
>>> > +/* { dg-do compile } */
>>> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
>>> > +/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
>>> > +
>>> > +/* Do not increase the vector body cost due to the incorrect reduction latency
>>> > +    Original vector body cost = 51
>>> > +    Scalar issue estimate:
>>> > +      ...
>>> > +      reduction latency = 2
>>> > +      estimated min cycles per iteration = 2.000000
>>> > +      estimated cycles per vector iteration (for VF 2) = 4.000000
>>> > +    Vector issue estimate:
>>> > +      ...
>>> > +      reduction latency = 8      <-- Too large
>>> > +      estimated min cycles per iteration = 8.000000
>>> > +    Increasing body cost to 102 because scalar code would issue more quickly
>>> > +      ...
>>> > +    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
>>> > +    missed:  not vectorized: vectorization not profitable.  */
>>> > +
>>> > +typedef struct
>>> > +{
>>> > +  unsigned short m1, m2, m3, m4;
>>> > +} the_struct_t;
>>> > +typedef struct
>>> > +{
>>> > +  double m1, m2, m3, m4, m5;
>>> > +} the_struct2_t;
>>> > +
>>> > +double
>>> > +bar (the_struct2_t *);
>>> > +
>>> > +double
>>> > +foo (double *k, unsigned int n, the_struct_t *the_struct)
>>> > +{
>>> > +  unsigned int u;
>>> > +  the_struct2_t result;
>>> > +  for (u = 0; u < n; u++, k--)
>>> > +    {
>>> > +      result.m1 += (*k) * the_struct[u].m1;
>>> > +      result.m2 += (*k) * the_struct[u].m2;
>>> > +      result.m3 += (*k) * the_struct[u].m3;
>>> > +      result.m4 += (*k) * the_struct[u].m4;
>>> > +    }
>>> > +  return bar (&result);
>>> > +}
>>> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_2.c b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>>> > new file mode 100644
>>> > index 00000000000..7a84aa8355e
>>> > --- /dev/null
>>> > +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_2.c
>>> > @@ -0,0 +1,14 @@
>>> > +/* { dg-do compile } */
>>> > +/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
>>> > +/* { dg-final { scan-tree-dump "reduction latency = 8" "vect" } } */
>>> > +
>>> > +/* The reduction latency should be multiplied by the count for
>>> > +   single_defuse_cycle.  */
>>> > +
>>> > +long
>>> > +f (long res, short *ptr1, short *ptr2, int n)
>>> > +{
>>> > +  for (int i = 0; i < n; ++i)
>>> > +    res += (long) ptr1[i] << ptr2[i];
>>> > +  return res;
>>> > +}
Hao Liu OS July 31, 2023, 2:39 a.m. UTC | #14
> Which test case do you see this for?  The two tests in the patch still
> seem to report correct latencies for me if I make the change above.

Not the newly added tests.  It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.

It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0:

Before the change:
cost_model_13.c:7:21: note:  Original vector body cost = 6
cost_model_13.c:7:21: note:  Scalar issue estimate:
cost_model_13.c:7:21: note:    load operations = 1
cost_model_13.c:7:21: note:    store operations = 0
cost_model_13.c:7:21: note:    general operations = 1
cost_model_13.c:7:21: note:    reduction latency = 1
cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
cost_model_13.c:7:21: note:  Vector issue estimate:
cost_model_13.c:7:21: note:    load operations = 1
cost_model_13.c:7:21: note:    store operations = 0
cost_model_13.c:7:21: note:    general operations = 1
cost_model_13.c:7:21: note:    reduction latency = 2
cost_model_13.c:7:21: note:    estimated min cycles per iteration = 2.000000

After the change:
cost_model_13.c:7:21: note:  Original vector body cost = 6
cost_model_13.c:7:21: note:  Scalar issue estimate:
cost_model_13.c:7:21: note:    load operations = 1
cost_model_13.c:7:21: note:    store operations = 0
cost_model_13.c:7:21: note:    general operations = 1
cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
cost_model_13.c:7:21: note:  Vector issue estimate:
cost_model_13.c:7:21: note:    load operations = 1
cost_model_13.c:7:21: note:    store operations = 0
cost_model_13.c:7:21: note:    general operations = 1
cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000         <--- seems not consistent with above result

BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204):
  <bb 3>:
  # res_18 = PHI <res_15(7), 0(6)>
  # i_20 = PHI <i_16(7), 0(6)>
  _1 = (long unsigned int) i_20;
  _2 = _1 * 2;
  _3 = x_14(D) + _2;
  _4 = *_3;
  _5 = (unsigned short) _4;
  res.0_6 = (unsigned short) res_18;
  _7 = _5 + res.0_6;                             <-- This is not live, may be caused by the below type cast stmt.
  res_15 = (short int) _7;
  i_16 = i_20 + 1;
  if (n_11(D) > i_16)
    goto <bb 7>;
  else
    goto <bb 4>;

  <bb 7>:
  goto <bb 3>;

Thanks,
-Hao
Richard Sandiford July 31, 2023, 9:11 a.m. UTC | #15
Hao Liu OS <hliu@os.amperecomputing.com> writes:
>> Which test case do you see this for?  The two tests in the patch still
>> seem to report correct latencies for me if I make the change above.
>
> Not the newly added tests.  It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>
> It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0:
>
> Before the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 1
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 2
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 2.000000
>
> After the change:
> cost_model_13.c:7:21: note:  Original vector body cost = 6
> cost_model_13.c:7:21: note:  Scalar issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
> cost_model_13.c:7:21: note:  Vector issue estimate:
> cost_model_13.c:7:21: note:    load operations = 1
> cost_model_13.c:7:21: note:    store operations = 0
> cost_model_13.c:7:21: note:    general operations = 1
> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000         <--- seems not consistent with above result
>
> BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204):
>   <bb 3>:
>   # res_18 = PHI <res_15(7), 0(6)>
>   # i_20 = PHI <i_16(7), 0(6)>
>   _1 = (long unsigned int) i_20;
>   _2 = _1 * 2;
>   _3 = x_14(D) + _2;
>   _4 = *_3;
>   _5 = (unsigned short) _4;
>   res.0_6 = (unsigned short) res_18;
>   _7 = _5 + res.0_6;                             <-- This is not live, may be caused by the below type cast stmt.
>   res_15 = (short int) _7;
>   i_16 = i_20 + 1;
>   if (n_11(D) > i_16)
>     goto <bb 7>;
>   else
>     goto <bb 4>;
>
>   <bb 7>:
>   goto <bb 3>;

Ah, I see, thanks.  My concern was: if requiring !STMT_VINFO_LIVE_P stmts
can cause "normal" reductions to have a latency of 0, could the same thing
happen for single-cycle reductions?  But I suppose the answer is "no".
Introducing a cast like the above would cause reduc_chain_length > 1,
and so:

  if (ncopies > 1
      && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
      && reduc_chain_length == 1
      && loop_vinfo->suggested_unroll_factor == 1)
    single_defuse_cycle = true;

wouldn't trigger.  Which makes the single-cycle thing a bit hit-and-miss...

So yeah, I agree the patch is safe after all.

Please split the check out into a helper though, to avoid the awkward
formatting:

/* Return true if STMT_INFO is part of a reduction that has the form:

      r = r op ...;
      r = r op ...;

   with the single accumulator being read and written multiple times.  */
static bool
aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
{
  if (!STMT_VINFO_LIVE_P (stmt_info))
    return false;

  auto reduc_info = info_for_reduction (vinfo, stmt_info);
  return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
}

OK with that change, thanks.

Richard
Hao Liu OS July 31, 2023, 9:25 a.m. UTC | #16
Sure, the helper makes the code simpler.  I'll test the new patch and push if there is no other issue.

Thanks,
Hao
Hao Liu OS Aug. 1, 2023, 9:43 a.m. UTC | #17
Hi Richard,

This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is true, some reduct stmts still don't have REDUC_DEF.  So I change the check to STMT_VINFO_REDUC_DEF.

Is it OK for trunk?

---
Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some reduct stmts
still don't have definition.

gcc/ChangeLog:

        PR target/110625
        * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
        STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
---
 gcc/config/aarch64/aarch64.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
     return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
--
2.40.0
Hao Liu OS Aug. 2, 2023, 3:45 a.m. UTC | #18
Hi Richard,

Update the patch with a simple case (see below case and comments).  It shows a live stmt may not have reduction def, which introduce the ICE.

Is it OK for trunk?

----
Fix the assertion failure on empty reduction define in info_for_reduction.
Even a stmt is live, it may still have empty reduction define.  Check the
reduction definition instead of live info before calling info_for_reduction.

gcc/ChangeLog:

        PR target/110625
        * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
        STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.

gcc/testsuite/ChangeLog:

        * gcc.target/aarch64/pr110625_3.c: New testcase.
---
 gcc/config/aarch64/aarch64.cc                 |  2 +-
 gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++++++++++++++++++
 2 files changed, 35 insertions(+), 1 deletion(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index d4d76025545..5b8d8fa8e2d 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
 static bool
 aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
 {
-  if (!STMT_VINFO_LIVE_P (stmt_info))
+  if (!STMT_VINFO_REDUC_DEF (stmt_info))
     return false;

   auto reduc_info = info_for_reduction (vinfo, stmt_info);
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
new file mode 100644
index 00000000000..35a50290cb0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
@@ -0,0 +1,34 @@
+/* { dg-do compile } */
+/* { dg-options "-O3 -mcpu=neoverse-n2" } */
+
+/* Avoid ICE on empty reduction def in single_defuse_cycle.
+
+   E.g.
+     <bb 3> [local count: 858993456]:
+     # sum_18 = PHI <sum_15(5), 0(2)>
+     sum.0_5 = (unsigned int) sum_18;
+     _6 = _4 + sum.0_5;     <-- it is "live" but doesn't have reduction def
+     sum_15 = (int) _6;
+     ...
+     if (ivtmp_29 != 0)
+       goto <bb 5>; [75.00%]
+     else
+       goto <bb 4>; [25.00%]
+
+     <bb 5> [local count: 644245086]:
+     goto <bb 3>; [100.00%]
+
+     <bb 4> [local count: 214748368]:
+     # _31 = PHI <_6(3)>
+     _8 = _31 >> 1;
+*/
+
+int
+f (unsigned int *tmp)
+{
+  int sum = 0;
+  for (int i = 0; i < 4; i++)
+    sum += tmp[i];
+
+  return (unsigned int) sum >> 1;
+}
--
2.34.1
Hao Liu OS Aug. 3, 2023, 9:33 a.m. UTC | #19
Gentle ping. Is it OK for master?

I'm afraid the ICE may cause trouble and hope it can be fixed ASAP.

Thanks,
Hao
Richard Sandiford Aug. 3, 2023, 10:10 a.m. UTC | #20
Hao Liu OS <hliu@os.amperecomputing.com> writes:
> Hi Richard,
>
> Update the patch with a simple case (see below case and comments).  It shows a live stmt may not have reduction def, which introduce the ICE.
>
> Is it OK for trunk?

OK, thanks.

Richard

> ----
> Fix the assertion failure on empty reduction define in info_for_reduction.
> Even a stmt is live, it may still have empty reduction define.  Check the
> reduction definition instead of live info before calling info_for_reduction.
>
> gcc/ChangeLog:
>
>         PR target/110625
>         * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
>         STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/aarch64/pr110625_3.c: New testcase.
> ---
>  gcc/config/aarch64/aarch64.cc                 |  2 +-
>  gcc/testsuite/gcc.target/aarch64/pr110625_3.c | 34 +++++++++++++++++++
>  2 files changed, 35 insertions(+), 1 deletion(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/pr110625_3.c
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d4d76025545..5b8d8fa8e2d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>  static bool
>  aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
>  {
> -  if (!STMT_VINFO_LIVE_P (stmt_info))
> +  if (!STMT_VINFO_REDUC_DEF (stmt_info))
>      return false;
>
>    auto reduc_info = info_for_reduction (vinfo, stmt_info);
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625_3.c b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
> new file mode 100644
> index 00000000000..35a50290cb0
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr110625_3.c
> @@ -0,0 +1,34 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O3 -mcpu=neoverse-n2" } */
> +
> +/* Avoid ICE on empty reduction def in single_defuse_cycle.
> +
> +   E.g.
> +     <bb 3> [local count: 858993456]:
> +     # sum_18 = PHI <sum_15(5), 0(2)>
> +     sum.0_5 = (unsigned int) sum_18;
> +     _6 = _4 + sum.0_5;     <-- it is "live" but doesn't have reduction def
> +     sum_15 = (int) _6;
> +     ...
> +     if (ivtmp_29 != 0)
> +       goto <bb 5>; [75.00%]
> +     else
> +       goto <bb 4>; [25.00%]
> +
> +     <bb 5> [local count: 644245086]:
> +     goto <bb 3>; [100.00%]
> +
> +     <bb 4> [local count: 214748368]:
> +     # _31 = PHI <_6(3)>
> +     _8 = _31 >> 1;
> +*/
> +
> +int
> +f (unsigned int *tmp)
> +{
> +  int sum = 0;
> +  for (int i = 0; i < 4; i++)
> +    sum += tmp[i];
> +
> +  return (unsigned int) sum >> 1;
> +}
> --
> 2.34.1
>
> ________________________________________
> From: Hao Liu OS <hliu@os.amperecomputing.com>
> Sent: Tuesday, August 1, 2023 17:43
> To: Richard Sandiford
> Cc: Richard Biener; GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
>
> Hi Richard,
>
> This is a quick fix to the several ICEs.  It seems even STMT_VINFO_LIVE_P is true, some reduct stmts still don't have REDUC_DEF.  So I change the check to STMT_VINFO_REDUC_DEF.
>
> Is it OK for trunk?
>
> ---
> Fix the ICEs on empty reduction define.  Even STMT_VINFO_LIVE_P is true, some reduct stmts
> still don't have definition.
>
> gcc/ChangeLog:
>
>         PR target/110625
>         * config/aarch64/aarch64.cc (aarch64_force_single_cycle): check
>         STMT_VINFO_REDUC_DEF to avoid failures in info_for_reduction
> ---
>  gcc/config/aarch64/aarch64.cc | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index d4d76025545..5b8d8fa8e2d 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -16776,7 +16776,7 @@ aarch64_adjust_stmt_cost (vect_cost_for_stmt kind, stmt_vec_info stmt_info,
>  static bool
>  aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
>  {
> -  if (!STMT_VINFO_LIVE_P (stmt_info))
> +  if (!STMT_VINFO_REDUC_DEF (stmt_info))
>      return false;
>
>    auto reduc_info = info_for_reduction (vinfo, stmt_info);
> --
> 2.40.0
>
>
> ________________________________________
> From: Richard Sandiford <richard.sandiford@arm.com>
> Sent: Monday, July 31, 2023 17:11
> To: Hao Liu OS
> Cc: Richard Biener; GCC-patches@gcc.gnu.org
> Subject: Re: [PATCH] AArch64: Do not increase the vect reduction latency by multiplying count [PR110625]
>
> Hao Liu OS <hliu@os.amperecomputing.com> writes:
>>> Which test case do you see this for?  The two tests in the patch still
>>> seem to report correct latencies for me if I make the change above.
>>
>> Not the newly added tests.  It is still the existing case causing the previous ICE (i.e. assertion problem): gcc.target/aarch64/sve/cost_model_13.c.
>>
>> It's not the test case itself failed, but the dump message of vect says the "reduction latency" is 0:
>>
>> Before the change:
>> cost_model_13.c:7:21: note:  Original vector body cost = 6
>> cost_model_13.c:7:21: note:  Scalar issue estimate:
>> cost_model_13.c:7:21: note:    load operations = 1
>> cost_model_13.c:7:21: note:    store operations = 0
>> cost_model_13.c:7:21: note:    general operations = 1
>> cost_model_13.c:7:21: note:    reduction latency = 1
>> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
>> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
>> cost_model_13.c:7:21: note:  Vector issue estimate:
>> cost_model_13.c:7:21: note:    load operations = 1
>> cost_model_13.c:7:21: note:    store operations = 0
>> cost_model_13.c:7:21: note:    general operations = 1
>> cost_model_13.c:7:21: note:    reduction latency = 2
>> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 2.000000
>>
>> After the change:
>> cost_model_13.c:7:21: note:  Original vector body cost = 6
>> cost_model_13.c:7:21: note:  Scalar issue estimate:
>> cost_model_13.c:7:21: note:    load operations = 1
>> cost_model_13.c:7:21: note:    store operations = 0
>> cost_model_13.c:7:21: note:    general operations = 1
>> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
>> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000
>> cost_model_13.c:7:21: note:    estimated cycles per vector iteration (for VF 8) = 8.000000
>> cost_model_13.c:7:21: note:  Vector issue estimate:
>> cost_model_13.c:7:21: note:    load operations = 1
>> cost_model_13.c:7:21: note:    store operations = 0
>> cost_model_13.c:7:21: note:    general operations = 1
>> cost_model_13.c:7:21: note:    reduction latency = 0         <--- seems not consistent with above result
>> cost_model_13.c:7:21: note:    estimated min cycles per iteration = 1.000000         <--- seems not consistent with above result
>>
>> BTW. this should be caused by the reduction stmt is not live, which indicates whether this stmts is part of a computation whose result is used outside the loop (tree-vectorized.h:1204):
>>   <bb 3>:
>>   # res_18 = PHI <res_15(7), 0(6)>
>>   # i_20 = PHI <i_16(7), 0(6)>
>>   _1 = (long unsigned int) i_20;
>>   _2 = _1 * 2;
>>   _3 = x_14(D) + _2;
>>   _4 = *_3;
>>   _5 = (unsigned short) _4;
>>   res.0_6 = (unsigned short) res_18;
>>   _7 = _5 + res.0_6;                             <-- This is not live, may be caused by the below type cast stmt.
>>   res_15 = (short int) _7;
>>   i_16 = i_20 + 1;
>>   if (n_11(D) > i_16)
>>     goto <bb 7>;
>>   else
>>     goto <bb 4>;
>>
>>   <bb 7>:
>>   goto <bb 3>;
>
> Ah, I see, thanks.  My concern was: if requiring !STMT_VINFO_LIVE_P stmts
> can cause "normal" reductions to have a latency of 0, could the same thing
> happen for single-cycle reductions?  But I suppose the answer is "no".
> Introducing a cast like the above would cause reduc_chain_length > 1,
> and so:
>
>   if (ncopies > 1
>       && (STMT_VINFO_RELEVANT (stmt_info) <= vect_used_only_live)
>       && reduc_chain_length == 1
>       && loop_vinfo->suggested_unroll_factor == 1)
>     single_defuse_cycle = true;
>
> wouldn't trigger.  Which makes the single-cycle thing a bit hit-and-miss...
>
> So yeah, I agree the patch is safe after all.
>
> Please split the check out into a helper though, to avoid the awkward
> formatting:
>
> /* Return true if STMT_INFO is part of a reduction that has the form:
>
>       r = r op ...;
>       r = r op ...;
>
>    with the single accumulator being read and written multiple times.  */
> static bool
> aarch64_force_single_cycle (vec_info *vinfo, stmt_vec_info stmt_info)
> {
>   if (!STMT_VINFO_LIVE_P (stmt_info))
>     return false;
>
>   auto reduc_info = info_for_reduction (vinfo, stmt_info);
>   return STMT_VINFO_FORCE_SINGLE_CYCLE (reduc_info);
> }
>
> OK with that change, thanks.
>
> Richard
diff mbox series

Patch

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 560e5431636..27afa64b7d5 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -16788,10 +16788,7 @@  aarch64_vector_costs::count_ops (unsigned int count, vect_cost_for_stmt kind,
     {
       unsigned int base
 	= aarch64_in_loop_reduction_latency (m_vinfo, stmt_info, m_vec_flags);
-
-      /* ??? Ideally we'd do COUNT reductions in parallel, but unfortunately
-	 that's not yet the case.  */
-      ops->reduction_latency = MAX (ops->reduction_latency, base * count);
+      ops->reduction_latency = MAX (ops->reduction_latency, base);
     }
 
   /* Assume that multiply-adds will become a single operation.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr110625.c b/gcc/testsuite/gcc.target/aarch64/pr110625.c
new file mode 100644
index 00000000000..0965cac33a0
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr110625.c
@@ -0,0 +1,46 @@ 
+/* { dg-do compile } */
+/* { dg-options "-Ofast -mcpu=neoverse-n2 -fdump-tree-vect-details -fno-tree-slp-vectorize" } */
+/* { dg-final { scan-tree-dump-not "reduction latency = 8" "vect" } } */
+
+/* Do not increase the vector body cost due to the incorrect reduction latency
+    Original vector body cost = 51
+    Scalar issue estimate:
+      ...
+      reduction latency = 2
+      estimated min cycles per iteration = 2.000000
+      estimated cycles per vector iteration (for VF 2) = 4.000000
+    Vector issue estimate:
+      ...
+      reduction latency = 8      <-- Too large
+      estimated min cycles per iteration = 8.000000
+    Increasing body cost to 102 because scalar code would issue more quickly
+      ...
+    missed:  cost model: the vector iteration cost = 102 divided by the scalar iteration cost = 44 is greater or equal to the vectorization factor = 2.
+    missed:  not vectorized: vectorization not profitable.  */
+
+typedef struct
+{
+  unsigned short m1, m2, m3, m4;
+} the_struct_t;
+typedef struct
+{
+  double m1, m2, m3, m4, m5;
+} the_struct2_t;
+
+double
+bar (the_struct2_t *);
+
+double
+foo (double *k, unsigned int n, the_struct_t *the_struct)
+{
+  unsigned int u;
+  the_struct2_t result;
+  for (u = 0; u < n; u++, k--)
+    {
+      result.m1 += (*k) * the_struct[u].m1;
+      result.m2 += (*k) * the_struct[u].m2;
+      result.m3 += (*k) * the_struct[u].m3;
+      result.m4 += (*k) * the_struct[u].m4;
+    }
+  return bar (&result);
+}