diff mbox series

[V15] VECT: Add decrement IV iteration loop control by variable amount support

Message ID 20230525025813.1232463-1-juzhe.zhong@rivai.ai
State New
Headers show
Series [V15] VECT: Add decrement IV iteration loop control by variable amount support | expand

Commit Message

juzhe.zhong@rivai.ai May 25, 2023, 2:58 a.m. UTC
From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>

This patch is supporting decrement IV by following the flow designed by Richard:

(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
    call vect_set_loop_controls_directly.

(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.

(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.

(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.

Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.

This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.

Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.

Fix bugs of V14 patch:
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test

This patch passed all testcases listed above.

gcc/ChangeLog:

        * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Add decrement IV support.
        (vect_adjust_loop_lens_control): Ditto.
        (vect_set_loop_condition_partial_vectors): Ditto.
        * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variables.
        * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
        (LOOP_VINFO_DECREMENTING_IV_STEP): Ditto.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c: New test.
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c: New test.
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c: New test.
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c: New test.

---
 .../rvv/autovec/partial/multiple_rgroup-3.c   | 288 ++++++++++++++++++
 .../rvv/autovec/partial/multiple_rgroup-4.c   |  75 +++++
 .../autovec/partial/multiple_rgroup_run-3.c   |  36 +++
 .../autovec/partial/multiple_rgroup_run-4.c   |  15 +
 gcc/tree-vect-loop-manip.cc                   | 153 ++++++++++
 gcc/tree-vect-loop.cc                         |  13 +
 gcc/tree-vectorizer.h                         |  12 +
 7 files changed, 592 insertions(+)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c

Comments

juzhe.zhong@rivai.ai May 25, 2023, 7:43 a.m. UTC | #1
Bootstrap && Regression on X86 passed.

Ok for trunk ?


juzhe.zhong@rivai.ai
 
From: juzhe.zhong
Date: 2023-05-25 10:58
To: gcc-patches
CC: richard.sandiford; rguenther; Ju-Zhe Zhong
Subject: [PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support
From: Ju-Zhe Zhong <juzhe.zhong@rivai.ai>
 
This patch is supporting decrement IV by following the flow designed by Richard:
 
(1) In vect_set_loop_condition_partial_vectors, for the first iteration of:
    call vect_set_loop_controls_directly.
 
(2) vect_set_loop_controls_directly calculates "step" as in your patch.
If rgc has 1 control, this step is the SSA name created for that control.
Otherwise the step is a fresh SSA name, as in your patch.
 
(3) vect_set_loop_controls_directly stores this step somewhere for later
use, probably in LOOP_VINFO.  Let's use "S" to refer to this stored step.
 
(4) After the vect_set_loop_controls_directly call above, and outside
the "if" statement that now contains vect_set_loop_controls_directly,
check whether rgc->controls.length () > 1.  If so, use
vect_adjust_loop_lens_control to set the controls based on S.
 
Then the only caller of vect_adjust_loop_lens_control is
vect_set_loop_condition_partial_vectors.  And the starting
step for vect_adjust_loop_lens_control is always S.
 
This patch has well tested for single-rgroup and multiple-rgroup (SLP) and
passed all testcase in RISC-V port.
 
Also, pass tests for multiple-rgroup (non-SLP) tested on vec_pack_trunk.
 
Fix bugs of V14 patch:
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
FAIL: gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c execution test
 
This patch passed all testcases listed above.
 
gcc/ChangeLog:
 
        * tree-vect-loop-manip.cc (vect_set_loop_controls_directly): Add decrement IV support.
        (vect_adjust_loop_lens_control): Ditto.
        (vect_set_loop_condition_partial_vectors): Ditto.
        * tree-vect-loop.cc (_loop_vec_info::_loop_vec_info): New variables.
        * tree-vectorizer.h (LOOP_VINFO_USING_DECREMENTING_IV_P): New macro.
        (LOOP_VINFO_DECREMENTING_IV_STEP): Ditto.
 
gcc/testsuite/ChangeLog:
 
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c: New test.
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c: New test.
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c: New test.
        * gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c: New test.
 
---
.../rvv/autovec/partial/multiple_rgroup-3.c   | 288 ++++++++++++++++++
.../rvv/autovec/partial/multiple_rgroup-4.c   |  75 +++++
.../autovec/partial/multiple_rgroup_run-3.c   |  36 +++
.../autovec/partial/multiple_rgroup_run-4.c   |  15 +
gcc/tree-vect-loop-manip.cc                   | 153 ++++++++++
gcc/tree-vect-loop.cc                         |  13 +
gcc/tree-vectorizer.h                         |  12 +
7 files changed, 592 insertions(+)
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c
 
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
new file mode 100644
index 00000000000..9579749c285
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
@@ -0,0 +1,288 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=fixed-vlmax" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noinline, noclone))
+f0 (int8_t *__restrict x, int16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      x[i + 2] += 3;
+      x[i + 3] += 4;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+      y[j + 4] += 5;
+      y[j + 5] += 6;
+      y[j + 6] += 7;
+      y[j + 7] += 8;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f0_init (int8_t *__restrict x, int8_t *__restrict x2, int16_t *__restrict y,
+ int16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      x[i + 0] = i % 120;
+      x[i + 1] = i % 78;
+      x[i + 2] = i % 55;
+      x[i + 3] = i % 27;
+      y[j + 0] = j % 33;
+      y[j + 1] = j % 44;
+      y[j + 2] = j % 66;
+      y[j + 3] = j % 88;
+      y[j + 4] = j % 99;
+      y[j + 5] = j % 39;
+      y[j + 6] = j % 49;
+      y[j + 7] = j % 101;
+
+      x2[i + 0] = i % 120;
+      x2[i + 1] = i % 78;
+      x2[i + 2] = i % 55;
+      x2[i + 3] = i % 27;
+      y2[j + 0] = j % 33;
+      y2[j + 1] = j % 44;
+      y2[j + 2] = j % 66;
+      y2[j + 3] = j % 88;
+      y2[j + 4] = j % 99;
+      y2[j + 5] = j % 39;
+      y2[j + 6] = j % 49;
+      y2[j + 7] = j % 101;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f0_golden (int8_t *__restrict x, int16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      x[i + 2] += 3;
+      x[i + 3] += 4;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+      y[j + 4] += 5;
+      y[j + 5] += 6;
+      y[j + 6] += 7;
+      y[j + 7] += 8;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f0_check (int8_t *__restrict x, int8_t *__restrict x2, int16_t *__restrict y,
+   int16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      if (x[i + 0] != x2[i + 0])
+ __builtin_abort ();
+      if (x[i + 1] != x2[i + 1])
+ __builtin_abort ();
+      if (x[i + 2] != x2[i + 2])
+ __builtin_abort ();
+      if (x[i + 3] != x2[i + 3])
+ __builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+ __builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+ __builtin_abort ();
+      if (y[j + 2] != y2[j + 2])
+ __builtin_abort ();
+      if (y[j + 3] != y2[j + 3])
+ __builtin_abort ();
+      if (y[j + 4] != y2[j + 4])
+ __builtin_abort ();
+      if (y[j + 5] != y2[j + 5])
+ __builtin_abort ();
+      if (y[j + 6] != y2[j + 6])
+ __builtin_abort ();
+      if (y[j + 7] != y2[j + 7])
+ __builtin_abort ();
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f1 (int16_t *__restrict x, int32_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f1_init (int16_t *__restrict x, int16_t *__restrict x2, int32_t *__restrict y,
+ int32_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] = i % 67;
+      x[i + 1] = i % 76;
+      y[j + 0] = j % 111;
+      y[j + 1] = j % 63;
+      y[j + 2] = j % 39;
+      y[j + 3] = j % 8;
+
+      x2[i + 0] = i % 67;
+      x2[i + 1] = i % 76;
+      y2[j + 0] = j % 111;
+      y2[j + 1] = j % 63;
+      y2[j + 2] = j % 39;
+      y2[j + 3] = j % 8;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f1_golden (int16_t *__restrict x, int32_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f1_check (int16_t *__restrict x, int16_t *__restrict x2, int32_t *__restrict y,
+   int32_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      if (x[i + 0] != x2[i + 0])
+ __builtin_abort ();
+      if (x[i + 1] != x2[i + 1])
+ __builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+ __builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+ __builtin_abort ();
+      if (y[j + 2] != y2[j + 2])
+ __builtin_abort ();
+      if (y[j + 3] != y2[j + 3])
+ __builtin_abort ();
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f2 (int32_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f2_init (int32_t *__restrict x, int32_t *__restrict x2, int64_t *__restrict y,
+ int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] = i % 79;
+      y[j + 0] = j % 83;
+      y[j + 1] = j % 100;
+
+      x2[i + 0] = i % 79;
+      y2[j + 0] = j % 83;
+      y2[j + 1] = j % 100;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f2_golden (int32_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f2_check (int32_t *__restrict x, int32_t *__restrict x2, int64_t *__restrict y,
+   int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      if (x[i + 0] != x2[i + 0])
+ __builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+ __builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+ __builtin_abort ();
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f3 (int8_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f3_init (int8_t *__restrict x, int8_t *__restrict x2, int64_t *__restrict y,
+    int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] = i % 22;
+      y[j + 0] = i % 12;
+      y[j + 1] = i % 21;
+
+      x2[i + 0] = i % 22;
+      y2[j + 0] = i % 12;
+      y2[j + 1] = i % 21;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f3_golden (int8_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f3_check (int8_t *__restrict x, int8_t *__restrict x2, int64_t *__restrict y,
+   int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      if (x[i + 0] != x2[i + 0])
+ __builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+ __builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+ __builtin_abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
new file mode 100644
index 00000000000..e87961e49ac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
@@ -0,0 +1,75 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=fixed-vlmax" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noinline, noclone))
+f (uint64_t *__restrict x, uint16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f_init (uint64_t *__restrict x, uint64_t *__restrict x2, uint16_t *__restrict y,
+ uint16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] = i * 897 + 189;
+      x[i + 1] = i * 79 + 55963;
+      y[j + 0] = j * 18 + 78642;
+      y[j + 1] = j * 9 + 8634;
+      y[j + 2] = j * 78 + 2588;
+      y[j + 3] = j * 38 + 8932;
+  
+      x2[i + 0] = i * 897 + 189;
+      x2[i + 1] = i * 79 + 55963;
+      y2[j + 0] = j * 18 + 78642;
+      y2[j + 1] = j * 9 + 8634;
+      y2[j + 2] = j * 78 + 2588;
+      y2[j + 3] = j * 38 + 8932;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f_golden (uint64_t *__restrict x, uint16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f_check (uint64_t *__restrict x, uint64_t *__restrict x2,
+ uint16_t *__restrict y, uint16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      if (x[i + 0] != x2[i + 0])
+ __builtin_abort ();
+      if (x[i + 1] != x2[i + 1])
+ __builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+ __builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+ __builtin_abort ();
+      if (y[j + 2] != y2[j + 2])
+ __builtin_abort ();
+      if (y[j + 3] != y2[j + 3])
+ __builtin_abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
new file mode 100644
index 00000000000..b786738ce99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
@@ -0,0 +1,36 @@
+/* { dg-do run { target { riscv_vector } } } */
+/* { dg-additional-options "--param riscv-autovec-preference=fixed-vlmax" } */
+
+#include "multiple_rgroup-3.c"
+
+int __attribute__ ((optimize (0))) main (void)
+{
+  int8_t f0_x[3108], f0_x2[3108];
+  int16_t f0_y[6216], f0_y2[6216];
+  f0_init (f0_x, f0_x2, f0_y, f0_y2, 3108);
+  f0 (f0_x, f0_y, 3108);
+  f0_golden (f0_x2, f0_y2, 3108);
+  f0_check (f0_x, f0_x2, f0_y, f0_y2, 3108);
+
+  int16_t f1_x[1998], f1_x2[1998];
+  int32_t f1_y[3996], f1_y2[3996];
+  f1_init (f1_x, f1_x2, f1_y, f1_y2, 1998);
+  f1 (f1_x, f1_y, 1998);
+  f1_golden (f1_x2, f1_y2, 1998);
+  f1_check (f1_x, f1_x2, f1_y, f1_y2, 1998);
+
+  int32_t f2_x[2023], f2_x2[2023];
+  int64_t f2_y[4046], f2_y2[4046];
+  f2_init (f2_x, f2_x2, f2_y, f2_y2, 2023);
+  f2 (f2_x, f2_y, 2023);
+  f2_golden (f2_x2, f2_y2, 2023);
+  f2_check (f2_x, f2_x2, f2_y, f2_y2, 2023);
+
+  int8_t f3_x[3203], f3_x2[3203];
+  int64_t f3_y[6406], f3_y2[6406];
+  f3_init (f3_x, f3_x2, f3_y, f3_y2, 3203);
+  f3 (f3_x, f3_y, 3203);
+  f3_golden (f3_x2, f3_y2, 3203);
+  f3_check (f3_x, f3_x2, f3_y, f3_y2, 3203);
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c
new file mode 100644
index 00000000000..7751384183e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c
@@ -0,0 +1,15 @@
+/* { dg-do run { target { riscv_vector } } } */
+/* { dg-additional-options "--param riscv-autovec-preference=fixed-vlmax" } */
+
+#include "multiple_rgroup-4.c"
+
+int __attribute__ ((optimize (0))) main (void)
+{
+  uint64_t f_x[3108], f_x2[3108];
+  uint16_t f_y[6216], f_y2[6216];
+  f_init (f_x, f_x2, f_y, f_y2, 3108);
+  f (f_x, f_y, 3108);
+  f_golden (f_x2, f_y2, 3108);
+  f_check (f_x, f_x2, f_y, f_y2, 3108);
+  return 0;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ff6159e08d5..f9d92ced982 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+    {
+      /* single rgroup:
+ ...
+ _10 = (unsigned long) count_12(D);
+ ...
+ # ivtmp_9 = PHI <ivtmp_35(6), _10(5)>
+ _36 = MIN_EXPR <ivtmp_9, POLY_INT_CST [4, 4]>;
+ ...
+ vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
+ ...
+ ivtmp_35 = ivtmp_9 - _36;
+ ...
+ if (ivtmp_35 != 0)
+    goto <bb 4>; [83.33%]
+ else
+    goto <bb 5>; [16.67%]
+      */
+      nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
+      tree step = rgc->controls.length () == 1 ? rgc->controls[0]
+        : make_ssa_name (iv_type);
+      /* Create decrement IV.  */
+      create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
+ insert_after, &index_before_incr, &index_after_incr);
+      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
+     index_before_incr,
+     nitems_step));
+      LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
+      return index_after_incr;
+    }
+
+  /* Create increment IV.  */
   create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
     loop, &incr_gsi, insert_after, &index_before_incr,
     &index_after_incr);
@@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   return next_ctrl;
}
+/* Try to use adjust loop lens for multiple-rgroups.
+
+     _36 = MIN_EXPR <ivtmp_34, VF>;
+
+     First length (MIN (X, VF/N)):
+       loop_len_15 = MIN_EXPR <_36, VF/N>;
+
+     Second length:
+       tmp = _36 - loop_len_15;
+       loop_len_16 = MIN (tmp, VF/N);
+
+     Third length:
+       tmp2 = tmp - loop_len_16;
+       loop_len_17 = MIN (tmp2, VF/N);
+
+     Last length:
+       loop_len_18 = tmp2 - loop_len_17;
+*/
+
+static void
+vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
+        rgroup_controls *dest_rgm, tree step)
+{
+  tree ctrl_type = dest_rgm->type;
+  poly_uint64 nitems_per_ctrl
+    = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
+  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
+
+  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
+    {
+      tree ctrl = dest_rgm->controls[i];
+      if (i == 0)
+ {
+   /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
+   gassign *assign
+     = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+   gimple_seq_add_stmt (seq, assign);
+ }
+      else if (i == dest_rgm->controls.length () - 1)
+ {
+   /* Last iteration: Remain capped to the range [0, VF/N].  */
+   gassign *assign = gimple_build_assign (ctrl, MINUS_EXPR, step,
+ dest_rgm->controls[i - 1]);
+   gimple_seq_add_stmt (seq, assign);
+ }
+      else
+ {
+   /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
+   step = gimple_build (seq, MINUS_EXPR, iv_type, step,
+        dest_rgm->controls[i - 1]);
+   gassign *assign
+     = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+   gimple_seq_add_stmt (seq, assign);
+ }
+    }
+}
+
/* Set up the iteration condition and rgroup controls for LOOP, given
    that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the vectorized
    loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
@@ -764,6 +853,70 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
     loop_cond_gsi, rgc,
     niters, niters_skip,
     might_wrap_p);
+
+ /* Decrement IV only run vect_set_loop_controls_directly once.  */
+ if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
+     && rgc->controls.length () > 1)
+   {
+     /*
+        - Multiple rgroup (SLP):
+ ...
+ _38 = (unsigned long) bnd.7_29;
+ _39 = _38 * 2;
+ ...
+ # ivtmp_41 = PHI <ivtmp_42(6), _39(5)>
+ ...
+ _43 = MIN_EXPR <ivtmp_41, 32>;
+ loop_len_26 = MIN_EXPR <_43, 16>;
+ loop_len_25 = _43 - loop_len_26;
+ ...
+ .LEN_STORE (_6, 8B, loop_len_26, ...);
+ ...
+ .LEN_STORE (_25, 8B, loop_len_25, ...);
+ _33 = loop_len_26 / 2;
+ ...
+ .LEN_STORE (_8, 16B, _33, ...);
+ _36 = loop_len_25 / 2;
+ ...
+ .LEN_STORE (_15, 16B, _36, ...);
+ ivtmp_42 = ivtmp_41 - _43;
+ ...
+
+        - Multiple rgroup (non-SLP):
+ ...
+ _38 = (unsigned long) n_12(D);
+ ...
+ # ivtmp_38 = PHI <ivtmp_39(3), 100(2)>
+ ...
+ loop_len_38 = MIN_EXPR <ivtmp_41, POLY_INT_CST [8, 8]>;
+ _43 = MIN_EXPR <ivtmp_44, POLY_INT_CST [8, 8]>;
+ loop_len_24 = MIN_EXPR <_43, POLY_INT_CST [2, 2]>;
+ _46 = _43 - loop_len_24;
+ loop_len_23 = MIN_EXPR <_46, POLY_INT_CST [2, 2]>;
+ _47 = _46 - loop_len_23;
+ loop_len_22 = MIN_EXPR <_47, POLY_INT_CST [2, 2]>;
+ loop_len_21 = _47 - loop_len_22;
+ ...
+ vect__4.8_17 = .LEN_LOAD (_6, 64B, loop_len_24, 0);
+ ...
+ vect__4.9_9 = .LEN_LOAD (_49, 64B, loop_len_23, 0);
+ ...
+ vect__4.10_30 = .LEN_LOAD (_52, 64B, loop_len_22, 0);
+ ...
+ vect__4.11_32 = .LEN_LOAD (_55, 64B, loop_len_21, 0);
+ vect__7.13_31 = VEC_PACK_TRUNC_EXPR <...>,
+ vect__7.13_32 = VEC_PACK_TRUNC_EXPR <...>;
+ vect__7.12_33 = VEC_PACK_TRUNC_EXPR <...>;
+ ...
+ .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0);
+ ivtmp_39 = ivtmp_38 - _40;
+ ...
+     */
+     tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+     tree step = LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo);
+     gcc_assert (step);
+     vect_adjust_loop_lens_control (iv_type, &header_seq, rgc, step);
+   }
       }
   /* Emit all accumulated statements.  */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index cf10132b0bf..456f50fa7cc 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -973,6 +973,8 @@ _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
     using_partial_vectors_p (false),
+    using_decrementing_iv_p (false),
+    decrementing_iv_step (NULL_TREE),
     epil_using_partial_vectors_p (false),
     partial_load_store_bias (0),
     peeling_for_gaps (false),
@@ -2725,6 +2727,17 @@ start_over:
       && !vect_verify_loop_lens (loop_vinfo))
     LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
+  /* If we're vectorizing an loop that uses length "controls" and
+     can iterate more than once, we apply decrementing IV approach
+     in loop control.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
+      && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
+      && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+    && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
+ LOOP_VINFO_VECT_FACTOR (loop_vinfo))))
+    LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
+
   /* If we're vectorizing an epilogue loop, the vectorized loop either needs
      to be able to handle fewer than VF scalars, or needs to have a lower VF
      than the main loop.  */
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 02d2ad6fba1..7ed079f543a 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -818,6 +818,16 @@ public:
      the vector loop can handle fewer than VF scalars.  */
   bool using_partial_vectors_p;
+  /* True if we've decided to use a decrementing loop control IV that counts
+     scalars. This can be done for any loop that:
+
+ (a) uses length "controls"; and
+ (b) can iterate more than once.  */
+  bool using_decrementing_iv_p;
+
+  /* The variable amount step for decrement IV.  */
+  tree decrementing_iv_step;
+
   /* True if we've decided to use partially-populated vectors for the
      epilogue of loop.  */
   bool epil_using_partial_vectors_p;
@@ -890,6 +900,8 @@ public:
#define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
#define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
#define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_USING_DECREMENTING_IV_P(L) (L)->using_decrementing_iv_p
+#define LOOP_VINFO_DECREMENTING_IV_STEP(L) (L)->decrementing_iv_step
#define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
   (L)->epil_using_partial_vectors_p
#define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_bias
Richard Sandiford May 25, 2023, 9:02 a.m. UTC | #2
Thanks, this looks functionally correct to me.  And I agree it handles
the cases that previously needed multiplication.

But I think it regresses code quality when no multiplication was needed.
We can now generate duplicate IVs.  Perhaps ivopts would remove the
duplicates, but it might be hard, because of the variable steps.

For example, we would generate duplicate IVs for non-SLP code that
operates on multiple vector sizes.  (Can't remembrer what the status
of unpack/truncate patterns is on RVV.)  But it also shows up for SLP.
E.g., I would expect duplicate IVs for:

uint16_t x[100];
uint32_t y[200];

void f() {
  for (int i = 0; i < 100; i += 2) {
    x[i + 0] += 1;
    x[i + 1] += 2;
    y[i + 0] += 1;
    y[i + 1] += 2;
  }
}

So I think the call to vect_set_loop_controls_directly does still
need to be inside an "if".  But the "if" condition should be based
on whether the IV step is different.  As discussed yesterday, the
IV step is different if nitems_per_iter, aka:

  max_nscalars_per_iter * factor

is different.

Because of that, I think I was wrong to suggest storing the IV in
loop_vinfo.  It should probably be stored in rgroup_controls instead.

Then we could have a structure like this:

  rgroup_controls *rgc;
  rgroup_controls *iv_rgc = nullptr;
  ...
  FOR_EACH_VEC_ELT (*controls, i, rgc)
    if (!rgc->controls.is_empty ())
      {
        ...
	if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
	    || !iv_rgc
	    || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
		!= rgc->max_nscalars_per_iter * rgc->factor))
          {
            /* See whether zero-based IV would ever generate all-false masks
               or zero length before wrapping around.  */
            bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);

            /* Set up all controls for this group.  */
            test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
                                                         &preheader_seq,
                                                         &header_seq,
                                                         loop_cond_gsi, rgc,
                                                         niters, niters_skip,
                                                         might_wrap_p);

	    iv_rgc = rgc;
	  }

	if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
	    && rgc->controls.length () > 1)
	  {
            ...your code, using the iv in iv_rgc...;
	  }
      }

Some other comments:

> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..f9d92ced982 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>    gimple_stmt_iterator incr_gsi;
>    bool insert_after;
>    standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +    {
> +      /* single rgroup:

Instead of "single rgroup", how about:

      /* Create an IV that counts down from niters_total and whose step
	 is the (variable) amount processed in the current iteration:
	
But please keep the example below as well.

> +	 ...
> +	 _10 = (unsigned long) count_12(D);
> +	 ...
> +	 # ivtmp_9 = PHI <ivtmp_35(6), _10(5)>
> +	 _36 = MIN_EXPR <ivtmp_9, POLY_INT_CST [4, 4]>;
> +	 ...
> +	 vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
> +	 ...
> +	 ivtmp_35 = ivtmp_9 - _36;
> +	 ...
> +	 if (ivtmp_35 != 0)
> +	   goto <bb 4>; [83.33%]
> +	 else
> +	   goto <bb 5>; [16.67%]
> +      */
> +      nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
> +      tree step = rgc->controls.length () == 1 ? rgc->controls[0]
> +					       : make_ssa_name (iv_type);
> +      /* Create decrement IV.  */
> +      create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
> +		 insert_after, &index_before_incr, &index_after_incr);
> +      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
> +							    index_before_incr,
> +							    nitems_step));
> +      LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
> +      return index_after_incr;
> +    }
> +
> +  /* Create increment IV.  */
>    create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
>  	     loop, &incr_gsi, insert_after, &index_before_incr,
>  	     &index_after_incr);
> @@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>    return next_ctrl;
>  }
>  
> +/* Try to use adjust loop lens for multiple-rgroups.

This is no longer "try", since the function always does something.

How about:

/* Populate DEST_RGM->controls, given that they should add up to STEP.

> +
> +     _36 = MIN_EXPR <ivtmp_34, VF>;

Suggest using "STEP" instead of "_36" here, since this corresponds
to the function's "step" parameter.

> +
> +     First length (MIN (X, VF/N)):
> +       loop_len_15 = MIN_EXPR <_36, VF/N>;

Likewise here.

> +
> +     Second length:
> +       tmp = _36 - loop_len_15;

And here.

> +       loop_len_16 = MIN (tmp, VF/N);
> +
> +     Third length:
> +       tmp2 = tmp - loop_len_16;
> +       loop_len_17 = MIN (tmp2, VF/N);
> +
> +     Last length:
> +       loop_len_18 = tmp2 - loop_len_17;
> +*/
> +
> +static void
> +vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
> +			       rgroup_controls *dest_rgm, tree step)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> +    = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +    {
> +      tree ctrl = dest_rgm->controls[i];
> +      if (i == 0)
> +	{
> +	  /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
> +	  gassign *assign
> +	    = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
> +	  gimple_seq_add_stmt (seq, assign);
> +	}
> +      else if (i == dest_rgm->controls.length () - 1)
> +	{
> +	  /* Last iteration: Remain capped to the range [0, VF/N].  */
> +	  gassign *assign = gimple_build_assign (ctrl, MINUS_EXPR, step,
> +						 dest_rgm->controls[i - 1]);
> +	  gimple_seq_add_stmt (seq, assign);
> +	}
> +      else
> +	{
> +	  /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
> +	  step = gimple_build (seq, MINUS_EXPR, iv_type, step,
> +			       dest_rgm->controls[i - 1]);
> +	  gassign *assign
> +	    = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
> +	  gimple_seq_add_stmt (seq, assign);
> +	}
> +    }
> +}
> +
>  /* Set up the iteration condition and rgroup controls for LOOP, given
>     that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the vectorized
>     loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> @@ -764,6 +853,70 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>  						     loop_cond_gsi, rgc,
>  						     niters, niters_skip,
>  						     might_wrap_p);
> +
> +	/* Decrement IV only run vect_set_loop_controls_directly once.  */
> +	if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
> +	    && rgc->controls.length () > 1)
> +	  {
> +	    /*
> +	       - Multiple rgroup (SLP):
> +		 ...
> +		 _38 = (unsigned long) bnd.7_29;
> +		 _39 = _38 * 2;
> +		 ...
> +		 # ivtmp_41 = PHI <ivtmp_42(6), _39(5)>
> +		 ...
> +		 _43 = MIN_EXPR <ivtmp_41, 32>;
> +		 loop_len_26 = MIN_EXPR <_43, 16>;
> +		 loop_len_25 = _43 - loop_len_26;
> +		 ...
> +		 .LEN_STORE (_6, 8B, loop_len_26, ...);
> +		 ...
> +		 .LEN_STORE (_25, 8B, loop_len_25, ...);
> +		 _33 = loop_len_26 / 2;
> +		 ...
> +		 .LEN_STORE (_8, 16B, _33, ...);
> +		 _36 = loop_len_25 / 2;
> +		 ...
> +		 .LEN_STORE (_15, 16B, _36, ...);
> +		 ivtmp_42 = ivtmp_41 - _43;
> +		 ...
> +
> +	       - Multiple rgroup (non-SLP):
> +		 ...
> +		 _38 = (unsigned long) n_12(D);
> +		 ...
> +		 # ivtmp_38 = PHI <ivtmp_39(3), 100(2)>
> +		 ...
> +		 loop_len_38 = MIN_EXPR <ivtmp_41, POLY_INT_CST [8, 8]>;
> +		 _43 = MIN_EXPR <ivtmp_44, POLY_INT_CST [8, 8]>;
> +		 loop_len_24 = MIN_EXPR <_43, POLY_INT_CST [2, 2]>;
> +		 _46 = _43 - loop_len_24;
> +		 loop_len_23 = MIN_EXPR <_46, POLY_INT_CST [2, 2]>;
> +		 _47 = _46 - loop_len_23;
> +		 loop_len_22 = MIN_EXPR <_47, POLY_INT_CST [2, 2]>;
> +		 loop_len_21 = _47 - loop_len_22;
> +		 ...
> +		 vect__4.8_17 = .LEN_LOAD (_6, 64B, loop_len_24, 0);
> +		 ...
> +		 vect__4.9_9 = .LEN_LOAD (_49, 64B, loop_len_23, 0);
> +		 ...
> +		 vect__4.10_30 = .LEN_LOAD (_52, 64B, loop_len_22, 0);
> +		 ...
> +		 vect__4.11_32 = .LEN_LOAD (_55, 64B, loop_len_21, 0);
> +		 vect__7.13_31 = VEC_PACK_TRUNC_EXPR <...>,
> +		 vect__7.13_32 = VEC_PACK_TRUNC_EXPR <...>;
> +		 vect__7.12_33 = VEC_PACK_TRUNC_EXPR <...>;
> +		 ...
> +		 .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0);
> +		 ivtmp_39 = ivtmp_38 - _40;
> +		 ...
> +	    */

I don't think this comment is really necessary, and might be a bit
misleading, since it includes things that are not done directly
by this code.

I think the comment above vect_adjust_loop_lens_control accurately
describes what's going on, so we can just say:

	    /* vect_set_loop_controls_directly creates an IV whose step
	       is equal to the expected sum of RGC->controls.  Use that
	       information to populate RGC->controls.  */

Thanks,
Richard
juzhe.zhong@rivai.ai May 25, 2023, 9:52 a.m. UTC | #3
Hi, Richard. Thanks for the comments.

>> if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
>>     || !iv_rgc
>>     || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>> != rgc->max_nscalars_per_iter * rgc->factor))
>>           {
  >>           /* See whether zero-based IV would ever generate all-false masks
   >>             or zero length before wrapping around.  */
   >>          bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
 
   >>          /* Set up all controls for this group.  */
     >>        test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
    >>                                                      &preheader_seq,
        >>                                                  &header_seq,
    >>                                                      loop_cond_gsi, rgc,
    >>                                                      niters, niters_skip,
    >>                                                      might_wrap_p);
 
   >>  iv_rgc = rgc;
  >> }


Could you tell me why you add:
(iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>> != rgc->max_nscalars_per_iter * rgc->factor) ?

When I have this in the condition, ICE for fail to generate IR:
loop_len_76 = MIN_EXPR <ivtmp_98, 8>;
  loop_len_66 = MIN_EXPR <ivtmp_101, 16>;
  loop_len_66 = MIN_EXPR <loop_len_66, 4>;
  loop_len_65 = MIN_EXPR <0, 4>;

  _103 = -loop_len_65;

  loop_len_64 = MIN_EXPR <_103, 4>;
  loop_len_63 = _103 - loop_len_64;

When I remove it, it works.

Should I remove it?

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 17:02
To: juzhe.zhong
CC: gcc-patches; rguenther
Subject: Re: [PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support
Thanks, this looks functionally correct to me.  And I agree it handles
the cases that previously needed multiplication.
 
But I think it regresses code quality when no multiplication was needed.
We can now generate duplicate IVs.  Perhaps ivopts would remove the
duplicates, but it might be hard, because of the variable steps.
 
For example, we would generate duplicate IVs for non-SLP code that
operates on multiple vector sizes.  (Can't remembrer what the status
of unpack/truncate patterns is on RVV.)  But it also shows up for SLP.
E.g., I would expect duplicate IVs for:
 
uint16_t x[100];
uint32_t y[200];
 
void f() {
  for (int i = 0; i < 100; i += 2) {
    x[i + 0] += 1;
    x[i + 1] += 2;
    y[i + 0] += 1;
    y[i + 1] += 2;
  }
}
 
So I think the call to vect_set_loop_controls_directly does still
need to be inside an "if".  But the "if" condition should be based
on whether the IV step is different.  As discussed yesterday, the
IV step is different if nitems_per_iter, aka:
 
  max_nscalars_per_iter * factor
 
is different.
 
Because of that, I think I was wrong to suggest storing the IV in
loop_vinfo.  It should probably be stored in rgroup_controls instead.
 
Then we could have a structure like this:
 
  rgroup_controls *rgc;
  rgroup_controls *iv_rgc = nullptr;
  ...
  FOR_EACH_VEC_ELT (*controls, i, rgc)
    if (!rgc->controls.is_empty ())
      {
        ...
if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
    || !iv_rgc
    || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
!= rgc->max_nscalars_per_iter * rgc->factor))
          {
            /* See whether zero-based IV would ever generate all-false masks
               or zero length before wrapping around.  */
            bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
 
            /* Set up all controls for this group.  */
            test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
                                                         &preheader_seq,
                                                         &header_seq,
                                                         loop_cond_gsi, rgc,
                                                         niters, niters_skip,
                                                         might_wrap_p);
 
    iv_rgc = rgc;
  }
 
if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
    && rgc->controls.length () > 1)
  {
            ...your code, using the iv in iv_rgc...;
  }
      }
 
Some other comments:
 
> diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
> index ff6159e08d5..f9d92ced982 100644
> --- a/gcc/tree-vect-loop-manip.cc
> +++ b/gcc/tree-vect-loop-manip.cc
> @@ -468,6 +468,38 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>    gimple_stmt_iterator incr_gsi;
>    bool insert_after;
>    standard_iv_increment_position (loop, &incr_gsi, &insert_after);
> +  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
> +    {
> +      /* single rgroup:
 
Instead of "single rgroup", how about:
 
      /* Create an IV that counts down from niters_total and whose step
is the (variable) amount processed in the current iteration:
But please keep the example below as well.
 
> + ...
> + _10 = (unsigned long) count_12(D);
> + ...
> + # ivtmp_9 = PHI <ivtmp_35(6), _10(5)>
> + _36 = MIN_EXPR <ivtmp_9, POLY_INT_CST [4, 4]>;
> + ...
> + vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
> + ...
> + ivtmp_35 = ivtmp_9 - _36;
> + ...
> + if (ivtmp_35 != 0)
> +    goto <bb 4>; [83.33%]
> + else
> +    goto <bb 5>; [16.67%]
> +      */
> +      nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
> +      tree step = rgc->controls.length () == 1 ? rgc->controls[0]
> +        : make_ssa_name (iv_type);
> +      /* Create decrement IV.  */
> +      create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
> + insert_after, &index_before_incr, &index_after_incr);
> +      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
> +     index_before_incr,
> +     nitems_step));
> +      LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
> +      return index_after_incr;
> +    }
> +
> +  /* Create increment IV.  */
>    create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
>       loop, &incr_gsi, insert_after, &index_before_incr,
>       &index_after_incr);
> @@ -683,6 +715,63 @@ vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
>    return next_ctrl;
>  }
>  
> +/* Try to use adjust loop lens for multiple-rgroups.
 
This is no longer "try", since the function always does something.
 
How about:
 
/* Populate DEST_RGM->controls, given that they should add up to STEP.
 
> +
> +     _36 = MIN_EXPR <ivtmp_34, VF>;
 
Suggest using "STEP" instead of "_36" here, since this corresponds
to the function's "step" parameter.
 
> +
> +     First length (MIN (X, VF/N)):
> +       loop_len_15 = MIN_EXPR <_36, VF/N>;
 
Likewise here.
 
> +
> +     Second length:
> +       tmp = _36 - loop_len_15;
 
And here.
 
> +       loop_len_16 = MIN (tmp, VF/N);
> +
> +     Third length:
> +       tmp2 = tmp - loop_len_16;
> +       loop_len_17 = MIN (tmp2, VF/N);
> +
> +     Last length:
> +       loop_len_18 = tmp2 - loop_len_17;
> +*/
> +
> +static void
> +vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
> +        rgroup_controls *dest_rgm, tree step)
> +{
> +  tree ctrl_type = dest_rgm->type;
> +  poly_uint64 nitems_per_ctrl
> +    = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
> +  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
> +
> +  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
> +    {
> +      tree ctrl = dest_rgm->controls[i];
> +      if (i == 0)
> + {
> +   /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
> +   gassign *assign
> +     = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
> +   gimple_seq_add_stmt (seq, assign);
> + }
> +      else if (i == dest_rgm->controls.length () - 1)
> + {
> +   /* Last iteration: Remain capped to the range [0, VF/N].  */
> +   gassign *assign = gimple_build_assign (ctrl, MINUS_EXPR, step,
> + dest_rgm->controls[i - 1]);
> +   gimple_seq_add_stmt (seq, assign);
> + }
> +      else
> + {
> +   /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
> +   step = gimple_build (seq, MINUS_EXPR, iv_type, step,
> +        dest_rgm->controls[i - 1]);
> +   gassign *assign
> +     = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
> +   gimple_seq_add_stmt (seq, assign);
> + }
> +    }
> +}
> +
>  /* Set up the iteration condition and rgroup controls for LOOP, given
>     that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the vectorized
>     loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
> @@ -764,6 +853,70 @@ vect_set_loop_condition_partial_vectors (class loop *loop,
>       loop_cond_gsi, rgc,
>       niters, niters_skip,
>       might_wrap_p);
> +
> + /* Decrement IV only run vect_set_loop_controls_directly once.  */
> + if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
> +     && rgc->controls.length () > 1)
> +   {
> +     /*
> +        - Multiple rgroup (SLP):
> + ...
> + _38 = (unsigned long) bnd.7_29;
> + _39 = _38 * 2;
> + ...
> + # ivtmp_41 = PHI <ivtmp_42(6), _39(5)>
> + ...
> + _43 = MIN_EXPR <ivtmp_41, 32>;
> + loop_len_26 = MIN_EXPR <_43, 16>;
> + loop_len_25 = _43 - loop_len_26;
> + ...
> + .LEN_STORE (_6, 8B, loop_len_26, ...);
> + ...
> + .LEN_STORE (_25, 8B, loop_len_25, ...);
> + _33 = loop_len_26 / 2;
> + ...
> + .LEN_STORE (_8, 16B, _33, ...);
> + _36 = loop_len_25 / 2;
> + ...
> + .LEN_STORE (_15, 16B, _36, ...);
> + ivtmp_42 = ivtmp_41 - _43;
> + ...
> +
> +        - Multiple rgroup (non-SLP):
> + ...
> + _38 = (unsigned long) n_12(D);
> + ...
> + # ivtmp_38 = PHI <ivtmp_39(3), 100(2)>
> + ...
> + loop_len_38 = MIN_EXPR <ivtmp_41, POLY_INT_CST [8, 8]>;
> + _43 = MIN_EXPR <ivtmp_44, POLY_INT_CST [8, 8]>;
> + loop_len_24 = MIN_EXPR <_43, POLY_INT_CST [2, 2]>;
> + _46 = _43 - loop_len_24;
> + loop_len_23 = MIN_EXPR <_46, POLY_INT_CST [2, 2]>;
> + _47 = _46 - loop_len_23;
> + loop_len_22 = MIN_EXPR <_47, POLY_INT_CST [2, 2]>;
> + loop_len_21 = _47 - loop_len_22;
> + ...
> + vect__4.8_17 = .LEN_LOAD (_6, 64B, loop_len_24, 0);
> + ...
> + vect__4.9_9 = .LEN_LOAD (_49, 64B, loop_len_23, 0);
> + ...
> + vect__4.10_30 = .LEN_LOAD (_52, 64B, loop_len_22, 0);
> + ...
> + vect__4.11_32 = .LEN_LOAD (_55, 64B, loop_len_21, 0);
> + vect__7.13_31 = VEC_PACK_TRUNC_EXPR <...>,
> + vect__7.13_32 = VEC_PACK_TRUNC_EXPR <...>;
> + vect__7.12_33 = VEC_PACK_TRUNC_EXPR <...>;
> + ...
> + .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0);
> + ivtmp_39 = ivtmp_38 - _40;
> + ...
> +     */
 
I don't think this comment is really necessary, and might be a bit
misleading, since it includes things that are not done directly
by this code.
 
I think the comment above vect_adjust_loop_lens_control accurately
describes what's going on, so we can just say:
 
    /* vect_set_loop_controls_directly creates an IV whose step
       is equal to the expected sum of RGC->controls.  Use that
       information to populate RGC->controls.  */
 
Thanks,
Richard
Richard Sandiford May 25, 2023, 10:19 a.m. UTC | #4
"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard. Thanks for the comments.
>
>>> if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
>>>     || !iv_rgc
>>>     || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor))
>>>           {
>   >>           /* See whether zero-based IV would ever generate all-false masks
>    >>             or zero length before wrapping around.  */
>    >>          bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>  
>    >>          /* Set up all controls for this group.  */
>      >>        test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
>     >>                                                      &preheader_seq,
>         >>                                                  &header_seq,
>     >>                                                      loop_cond_gsi, rgc,
>     >>                                                      niters, niters_skip,
>     >>                                                      might_wrap_p);
>  
>    >>  iv_rgc = rgc;
>   >> }
>
>
> Could you tell me why you add:
> (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor) ?

The patch creates IVs with the following step:

      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
							    index_before_incr,
							    nitems_step));

If nitems_step is the same for two IVs, those IVs will always be equal.

So having multiple IVs with the same nitems_step is redundant.

nitems_step is calculated as follows:

  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
  ...
  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
  ...

  if (nitems_per_iter != 1)
    {
      ...
      tree iv_factor = build_int_cst (iv_type, nitems_per_iter);
      ...
      nitems_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
				  nitems_step, iv_factor);
      ...
    }

so nitems_per_step is equal to:

  rgc->max_nscalars_per_iter * rgc->factor * VF

VF is fixed for a loop, so nitems_step is equal for two different
rgroup_controls if:

  rgc->max_nscalars_per_iter * rgc->factor

is the same for those rgroup_controls.

Please try the example I posted earlier today. I think you'll see that,
without the:

  (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
   != rgc->max_nscalars_per_iter * rgc->factor)

you'll have two IVs with the same step (because their MIN_EXPRs have
the same bound).

Thanks,
Richard
juzhe.zhong@rivai.ai May 25, 2023, 10:40 a.m. UTC | #5
Yeah. I see. Removing it will cause testcase run fail.
Now I found the issue, since you want to store the step in the iv_rgroup.

After I tried, the IR looks correct but create ICE:
0x18c8d41 process_bb
        ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:7933
0x18cb6d9 do_rpo_vn_1
        ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:8544
0x18cbd35 do_rpo_vn(function*, edge_def*, bitmap_head*, bool, bool, vn_lookup_ki
        ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:8646
0x19d42d2 execute
        ../../../riscv-gcc/gcc/tree-vectorizer.cc:1385

This is the IR:

loop_len_76 = MIN_EXPR <ivtmp_98, 8>;

  loop_len_66 = MIN_EXPR <ivtmp_101, 16>;   -> store the step in rgroup instead of LOOP_VINFO
  _103 = loop_len_66;  ---->reuse the MIN VALUE

  loop_len_66 = MIN_EXPR <loop_len_66, 4>;

  _104 = _103 - loop_len_66;  ->use MIN - loop_len_66

  loop_len_65 = MIN_EXPR <_104, 4>;
  _105 = _104 - loop_len_65;
  loop_len_64 = MIN_EXPR <_105, 4>;
  loop_len_63 = _105 - loop_len_64;

Since previously I store the "MIN_EXPR <ivtmp_101, 16>;" in the LOOP_VINFO, not the rgroup.

So previously is correct and no ICE:

  loop_len_76 = MIN_EXPR <ivtmp_98, 8>;

 _103 = MIN_EXPR <ivtmp_101, 16>;    -> Step store in the LOOP_VINFO (S)

  loop_len_66 = MIN_EXPR <_103, 4>; 

  _104 = _103 - loop_len_66;  ->  use MIN - loop_len_66

  loop_len_65 = MIN_EXPR <_104, 4>;
  _105 = _104 - loop_len_65;
  loop_len_64 = MIN_EXPR <_105, 4>;
  loop_len_63 = _105 - loop_len_64;

Could you help me with this ?
Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 18:19
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support
"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard. Thanks for the comments.
>
>>> if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
>>>     || !iv_rgc
>>>     || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor))
>>>           {
>   >>           /* See whether zero-based IV would ever generate all-false masks
>    >>             or zero length before wrapping around.  */
>    >>          bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>  
>    >>          /* Set up all controls for this group.  */
>      >>        test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
>     >>                                                      &preheader_seq,
>         >>                                                  &header_seq,
>     >>                                                      loop_cond_gsi, rgc,
>     >>                                                      niters, niters_skip,
>     >>                                                      might_wrap_p);
>  
>    >>  iv_rgc = rgc;
>   >> }
>
>
> Could you tell me why you add:
> (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor) ?
 
The patch creates IVs with the following step:
 
      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
    index_before_incr,
    nitems_step));
 
If nitems_step is the same for two IVs, those IVs will always be equal.
 
So having multiple IVs with the same nitems_step is redundant.
 
nitems_step is calculated as follows:
 
  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
  ...
  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
  ...
 
  if (nitems_per_iter != 1)
    {
      ...
      tree iv_factor = build_int_cst (iv_type, nitems_per_iter);
      ...
      nitems_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
  nitems_step, iv_factor);
      ...
    }
 
so nitems_per_step is equal to:
 
  rgc->max_nscalars_per_iter * rgc->factor * VF
 
VF is fixed for a loop, so nitems_step is equal for two different
rgroup_controls if:
 
  rgc->max_nscalars_per_iter * rgc->factor
 
is the same for those rgroup_controls.
 
Please try the example I posted earlier today. I think you'll see that,
without the:
 
  (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
   != rgc->max_nscalars_per_iter * rgc->factor)
 
you'll have two IVs with the same step (because their MIN_EXPRs have
the same bound).
 
Thanks,
Richard
Richard Biener May 25, 2023, 11:04 a.m. UTC | #6
On Thu, 25 May 2023, juzhe.zhong@rivai.ai wrote:

> Yeah. I see. Removing it will cause testcase run fail.
> Now I found the issue, since you want to store the step in the iv_rgroup.
> 
> After I tried, the IR looks correct but create ICE:
> 0x18c8d41 process_bb
>         ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:7933
> 0x18cb6d9 do_rpo_vn_1
>         ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:8544
> 0x18cbd35 do_rpo_vn(function*, edge_def*, bitmap_head*, bool, bool, vn_lookup_ki
>         ../../../riscv-gcc/gcc/tree-ssa-sccvn.cc:8646
> 0x19d42d2 execute
>         ../../../riscv-gcc/gcc/tree-vectorizer.cc:1385
> 
> This is the IR:
> 
> loop_len_76 = MIN_EXPR <ivtmp_98, 8>;
> 
>   loop_len_66 = MIN_EXPR <ivtmp_101, 16>;   -> store the step in rgroup instead of LOOP_VINFO
>   _103 = loop_len_66;  ---->reuse the MIN VALUE
> 
>   loop_len_66 = MIN_EXPR <loop_len_66, 4>;

you have two defs of loop_len_66, that's not allowed

> 
>   _104 = _103 - loop_len_66;  ->use MIN - loop_len_66
> 
>   loop_len_65 = MIN_EXPR <_104, 4>;
>   _105 = _104 - loop_len_65;
>   loop_len_64 = MIN_EXPR <_105, 4>;
>   loop_len_63 = _105 - loop_len_64;
> 
> Since previously I store the "MIN_EXPR <ivtmp_101, 16>;" in the LOOP_VINFO, not the rgroup.
> 
> So previously is correct and no ICE:
> 
>   loop_len_76 = MIN_EXPR <ivtmp_98, 8>;
> 
>  _103 = MIN_EXPR <ivtmp_101, 16>;    -> Step store in the LOOP_VINFO (S)
> 
>   loop_len_66 = MIN_EXPR <_103, 4>; 
> 
>   _104 = _103 - loop_len_66;  ->  use MIN - loop_len_66
> 
>   loop_len_65 = MIN_EXPR <_104, 4>;
>   _105 = _104 - loop_len_65;
>   loop_len_64 = MIN_EXPR <_105, 4>;
>   loop_len_63 = _105 - loop_len_64;
> 
> Could you help me with this ?
> Thanks.
> 
> 
> juzhe.zhong@rivai.ai
>  
> From: Richard Sandiford
> Date: 2023-05-25 18:19
> To: juzhe.zhong\@rivai.ai
> CC: gcc-patches; rguenther
> Subject: Re: [PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support
> "juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> > Hi? Richard. Thanks for the comments.
> >
> >>> if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
> >>>     || !iv_rgc
> >>>     || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
> >>> != rgc->max_nscalars_per_iter * rgc->factor))
> >>>           {
> >   >>           /* See whether zero-based IV would ever generate all-false masks
> >    >>             or zero length before wrapping around.  */
> >    >>          bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
> >  
> >    >>          /* Set up all controls for this group.  */
> >      >>        test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
> >     >>                                                      &preheader_seq,
> >         >>                                                  &header_seq,
> >     >>                                                      loop_cond_gsi, rgc,
> >     >>                                                      niters, niters_skip,
> >     >>                                                      might_wrap_p);
> >  
> >    >>  iv_rgc = rgc;
> >   >> }
> >
> >
> > Could you tell me why you add:
> > (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
> >>> != rgc->max_nscalars_per_iter * rgc->factor) ?
>  
> The patch creates IVs with the following step:
>  
>       gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
>     index_before_incr,
>     nitems_step));
>  
> If nitems_step is the same for two IVs, those IVs will always be equal.
>  
> So having multiple IVs with the same nitems_step is redundant.
>  
> nitems_step is calculated as follows:
>  
>   unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
>   ...
>   poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
>   ...
>  
>   if (nitems_per_iter != 1)
>     {
>       ...
>       tree iv_factor = build_int_cst (iv_type, nitems_per_iter);
>       ...
>       nitems_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
>   nitems_step, iv_factor);
>       ...
>     }
>  
> so nitems_per_step is equal to:
>  
>   rgc->max_nscalars_per_iter * rgc->factor * VF
>  
> VF is fixed for a loop, so nitems_step is equal for two different
> rgroup_controls if:
>  
>   rgc->max_nscalars_per_iter * rgc->factor
>  
> is the same for those rgroup_controls.
>  
> Please try the example I posted earlier today. I think you'll see that,
> without the:
>  
>   (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>    != rgc->max_nscalars_per_iter * rgc->factor)
>  
> you'll have two IVs with the same step (because their MIN_EXPRs have
> the same bound).
>  
> Thanks,
> Richard
>  
>
juzhe.zhong@rivai.ai May 25, 2023, 12:08 p.m. UTC | #7
Thank you so much for your patience.
Could you take a look at V16 patch:
https://gcc.gnu.org/pipermail/gcc-patches/2023-May/619652.html 
whether it is ok for trunk ?

Thanks.


juzhe.zhong@rivai.ai
 
From: Richard Sandiford
Date: 2023-05-25 18:19
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V15] VECT: Add decrement IV iteration loop control by variable amount support
"juzhe.zhong@rivai.ai" <juzhe.zhong@rivai.ai> writes:
> Hi, Richard. Thanks for the comments.
>
>>> if (!LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
>>>     || !iv_rgc
>>>     || (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor))
>>>           {
>   >>           /* See whether zero-based IV would ever generate all-false masks
>    >>             or zero length before wrapping around.  */
>    >>          bool might_wrap_p = vect_rgroup_iv_might_wrap_p (loop_vinfo, rgc);
>  
>    >>          /* Set up all controls for this group.  */
>      >>        test_ctrl = vect_set_loop_controls_directly (loop, loop_vinfo,
>     >>                                                      &preheader_seq,
>         >>                                                  &header_seq,
>     >>                                                      loop_cond_gsi, rgc,
>     >>                                                      niters, niters_skip,
>     >>                                                      might_wrap_p);
>  
>    >>  iv_rgc = rgc;
>   >> }
>
>
> Could you tell me why you add:
> (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
>>> != rgc->max_nscalars_per_iter * rgc->factor) ?
 
The patch creates IVs with the following step:
 
      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
    index_before_incr,
    nitems_step));
 
If nitems_step is the same for two IVs, those IVs will always be equal.
 
So having multiple IVs with the same nitems_step is redundant.
 
nitems_step is calculated as follows:
 
  unsigned int nitems_per_iter = rgc->max_nscalars_per_iter * rgc->factor;
  ...
  poly_uint64 vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
  ...
 
  if (nitems_per_iter != 1)
    {
      ...
      tree iv_factor = build_int_cst (iv_type, nitems_per_iter);
      ...
      nitems_step = gimple_build (preheader_seq, MULT_EXPR, iv_type,
  nitems_step, iv_factor);
      ...
    }
 
so nitems_per_step is equal to:
 
  rgc->max_nscalars_per_iter * rgc->factor * VF
 
VF is fixed for a loop, so nitems_step is equal for two different
rgroup_controls if:
 
  rgc->max_nscalars_per_iter * rgc->factor
 
is the same for those rgroup_controls.
 
Please try the example I posted earlier today. I think you'll see that,
without the:
 
  (iv_rgc->max_nscalars_per_iter * iv_rgc->factor
   != rgc->max_nscalars_per_iter * rgc->factor)
 
you'll have two IVs with the same step (because their MIN_EXPRs have
the same bound).
 
Thanks,
Richard
diff mbox series

Patch

diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
new file mode 100644
index 00000000000..9579749c285
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-3.c
@@ -0,0 +1,288 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=fixed-vlmax" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noinline, noclone))
+f0 (int8_t *__restrict x, int16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      x[i + 2] += 3;
+      x[i + 3] += 4;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+      y[j + 4] += 5;
+      y[j + 5] += 6;
+      y[j + 6] += 7;
+      y[j + 7] += 8;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f0_init (int8_t *__restrict x, int8_t *__restrict x2, int16_t *__restrict y,
+	 int16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      x[i + 0] = i % 120;
+      x[i + 1] = i % 78;
+      x[i + 2] = i % 55;
+      x[i + 3] = i % 27;
+      y[j + 0] = j % 33;
+      y[j + 1] = j % 44;
+      y[j + 2] = j % 66;
+      y[j + 3] = j % 88;
+      y[j + 4] = j % 99;
+      y[j + 5] = j % 39;
+      y[j + 6] = j % 49;
+      y[j + 7] = j % 101;
+
+      x2[i + 0] = i % 120;
+      x2[i + 1] = i % 78;
+      x2[i + 2] = i % 55;
+      x2[i + 3] = i % 27;
+      y2[j + 0] = j % 33;
+      y2[j + 1] = j % 44;
+      y2[j + 2] = j % 66;
+      y2[j + 3] = j % 88;
+      y2[j + 4] = j % 99;
+      y2[j + 5] = j % 39;
+      y2[j + 6] = j % 49;
+      y2[j + 7] = j % 101;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f0_golden (int8_t *__restrict x, int16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      x[i + 2] += 3;
+      x[i + 3] += 4;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+      y[j + 4] += 5;
+      y[j + 5] += 6;
+      y[j + 6] += 7;
+      y[j + 7] += 8;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f0_check (int8_t *__restrict x, int8_t *__restrict x2, int16_t *__restrict y,
+	  int16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 4, j += 8)
+    {
+      if (x[i + 0] != x2[i + 0])
+	__builtin_abort ();
+      if (x[i + 1] != x2[i + 1])
+	__builtin_abort ();
+      if (x[i + 2] != x2[i + 2])
+	__builtin_abort ();
+      if (x[i + 3] != x2[i + 3])
+	__builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+	__builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+	__builtin_abort ();
+      if (y[j + 2] != y2[j + 2])
+	__builtin_abort ();
+      if (y[j + 3] != y2[j + 3])
+	__builtin_abort ();
+      if (y[j + 4] != y2[j + 4])
+	__builtin_abort ();
+      if (y[j + 5] != y2[j + 5])
+	__builtin_abort ();
+      if (y[j + 6] != y2[j + 6])
+	__builtin_abort ();
+      if (y[j + 7] != y2[j + 7])
+	__builtin_abort ();
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f1 (int16_t *__restrict x, int32_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f1_init (int16_t *__restrict x, int16_t *__restrict x2, int32_t *__restrict y,
+	 int32_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] = i % 67;
+      x[i + 1] = i % 76;
+      y[j + 0] = j % 111;
+      y[j + 1] = j % 63;
+      y[j + 2] = j % 39;
+      y[j + 3] = j % 8;
+
+      x2[i + 0] = i % 67;
+      x2[i + 1] = i % 76;
+      y2[j + 0] = j % 111;
+      y2[j + 1] = j % 63;
+      y2[j + 2] = j % 39;
+      y2[j + 3] = j % 8;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f1_golden (int16_t *__restrict x, int32_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f1_check (int16_t *__restrict x, int16_t *__restrict x2, int32_t *__restrict y,
+	  int32_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      if (x[i + 0] != x2[i + 0])
+	__builtin_abort ();
+      if (x[i + 1] != x2[i + 1])
+	__builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+	__builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+	__builtin_abort ();
+      if (y[j + 2] != y2[j + 2])
+	__builtin_abort ();
+      if (y[j + 3] != y2[j + 3])
+	__builtin_abort ();
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f2 (int32_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f2_init (int32_t *__restrict x, int32_t *__restrict x2, int64_t *__restrict y,
+	 int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] = i % 79;
+      y[j + 0] = j % 83;
+      y[j + 1] = j % 100;
+
+      x2[i + 0] = i % 79;
+      y2[j + 0] = j % 83;
+      y2[j + 1] = j % 100;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f2_golden (int32_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f2_check (int32_t *__restrict x, int32_t *__restrict x2, int64_t *__restrict y,
+	  int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      if (x[i + 0] != x2[i + 0])
+	__builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+	__builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+	__builtin_abort ();
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f3 (int8_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f3_init (int8_t *__restrict x, int8_t *__restrict x2, int64_t *__restrict y,
+    int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] = i % 22;
+      y[j + 0] = i % 12;
+      y[j + 1] = i % 21;
+
+      x2[i + 0] = i % 22;
+      y2[j + 0] = i % 12;
+      y2[j + 1] = i % 21;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f3_golden (int8_t *__restrict x, int64_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      x[i + 0] += 1;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+    }
+}
+
+void __attribute__ ((noinline, noclone))
+f3_check (int8_t *__restrict x, int8_t *__restrict x2, int64_t *__restrict y,
+	  int64_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 1, j += 2)
+    {
+      if (x[i + 0] != x2[i + 0])
+	__builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+	__builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+	__builtin_abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
new file mode 100644
index 00000000000..e87961e49ac
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup-4.c
@@ -0,0 +1,75 @@ 
+/* { dg-do compile } */
+/* { dg-additional-options "-march=rv32gcv -mabi=ilp32d --param riscv-autovec-preference=fixed-vlmax" } */
+
+#include <stdint-gcc.h>
+
+void __attribute__ ((noinline, noclone))
+f (uint64_t *__restrict x, uint16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f_init (uint64_t *__restrict x, uint64_t *__restrict x2, uint16_t *__restrict y,
+	uint16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] = i * 897 + 189;
+      x[i + 1] = i * 79 + 55963;
+      y[j + 0] = j * 18 + 78642;
+      y[j + 1] = j * 9 + 8634;
+      y[j + 2] = j * 78 + 2588;
+      y[j + 3] = j * 38 + 8932;
+  
+      x2[i + 0] = i * 897 + 189;
+      x2[i + 1] = i * 79 + 55963;
+      y2[j + 0] = j * 18 + 78642;
+      y2[j + 1] = j * 9 + 8634;
+      y2[j + 2] = j * 78 + 2588;
+      y2[j + 3] = j * 38 + 8932;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f_golden (uint64_t *__restrict x, uint16_t *__restrict y, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      x[i + 0] += 1;
+      x[i + 1] += 2;
+      y[j + 0] += 1;
+      y[j + 1] += 2;
+      y[j + 2] += 3;
+      y[j + 3] += 4;
+    }
+}
+
+void __attribute__ ((optimize (0)))
+f_check (uint64_t *__restrict x, uint64_t *__restrict x2,
+	 uint16_t *__restrict y, uint16_t *__restrict y2, int n)
+{
+  for (int i = 0, j = 0; i < n; i += 2, j += 4)
+    {
+      if (x[i + 0] != x2[i + 0])
+	__builtin_abort ();
+      if (x[i + 1] != x2[i + 1])
+	__builtin_abort ();
+      if (y[j + 0] != y2[j + 0])
+	__builtin_abort ();
+      if (y[j + 1] != y2[j + 1])
+	__builtin_abort ();
+      if (y[j + 2] != y2[j + 2])
+	__builtin_abort ();
+      if (y[j + 3] != y2[j + 3])
+	__builtin_abort ();
+    }
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
new file mode 100644
index 00000000000..b786738ce99
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-3.c
@@ -0,0 +1,36 @@ 
+/* { dg-do run { target { riscv_vector } } } */
+/* { dg-additional-options "--param riscv-autovec-preference=fixed-vlmax" } */
+
+#include "multiple_rgroup-3.c"
+
+int __attribute__ ((optimize (0))) main (void)
+{
+  int8_t f0_x[3108], f0_x2[3108];
+  int16_t f0_y[6216], f0_y2[6216];
+  f0_init (f0_x, f0_x2, f0_y, f0_y2, 3108);
+  f0 (f0_x, f0_y, 3108);
+  f0_golden (f0_x2, f0_y2, 3108);
+  f0_check (f0_x, f0_x2, f0_y, f0_y2, 3108);
+
+  int16_t f1_x[1998], f1_x2[1998];
+  int32_t f1_y[3996], f1_y2[3996];
+  f1_init (f1_x, f1_x2, f1_y, f1_y2, 1998);
+  f1 (f1_x, f1_y, 1998);
+  f1_golden (f1_x2, f1_y2, 1998);
+  f1_check (f1_x, f1_x2, f1_y, f1_y2, 1998);
+
+  int32_t f2_x[2023], f2_x2[2023];
+  int64_t f2_y[4046], f2_y2[4046];
+  f2_init (f2_x, f2_x2, f2_y, f2_y2, 2023);
+  f2 (f2_x, f2_y, 2023);
+  f2_golden (f2_x2, f2_y2, 2023);
+  f2_check (f2_x, f2_x2, f2_y, f2_y2, 2023);
+
+  int8_t f3_x[3203], f3_x2[3203];
+  int64_t f3_y[6406], f3_y2[6406];
+  f3_init (f3_x, f3_x2, f3_y, f3_y2, 3203);
+  f3 (f3_x, f3_y, 3203);
+  f3_golden (f3_x2, f3_y2, 3203);
+  f3_check (f3_x, f3_x2, f3_y, f3_y2, 3203);
+  return 0;
+}
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c
new file mode 100644
index 00000000000..7751384183e
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/autovec/partial/multiple_rgroup_run-4.c
@@ -0,0 +1,15 @@ 
+/* { dg-do run { target { riscv_vector } } } */
+/* { dg-additional-options "--param riscv-autovec-preference=fixed-vlmax" } */
+
+#include "multiple_rgroup-4.c"
+
+int __attribute__ ((optimize (0))) main (void)
+{
+  uint64_t f_x[3108], f_x2[3108];
+  uint16_t f_y[6216], f_y2[6216];
+  f_init (f_x, f_x2, f_y, f_y2, 3108);
+  f (f_x, f_y, 3108);
+  f_golden (f_x2, f_y2, 3108);
+  f_check (f_x, f_x2, f_y, f_y2, 3108);
+  return 0;
+}
diff --git a/gcc/tree-vect-loop-manip.cc b/gcc/tree-vect-loop-manip.cc
index ff6159e08d5..f9d92ced982 100644
--- a/gcc/tree-vect-loop-manip.cc
+++ b/gcc/tree-vect-loop-manip.cc
@@ -468,6 +468,38 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   gimple_stmt_iterator incr_gsi;
   bool insert_after;
   standard_iv_increment_position (loop, &incr_gsi, &insert_after);
+  if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo))
+    {
+      /* single rgroup:
+	 ...
+	 _10 = (unsigned long) count_12(D);
+	 ...
+	 # ivtmp_9 = PHI <ivtmp_35(6), _10(5)>
+	 _36 = MIN_EXPR <ivtmp_9, POLY_INT_CST [4, 4]>;
+	 ...
+	 vect__4.8_28 = .LEN_LOAD (_17, 32B, _36, 0);
+	 ...
+	 ivtmp_35 = ivtmp_9 - _36;
+	 ...
+	 if (ivtmp_35 != 0)
+	   goto <bb 4>; [83.33%]
+	 else
+	   goto <bb 5>; [16.67%]
+      */
+      nitems_total = gimple_convert (preheader_seq, iv_type, nitems_total);
+      tree step = rgc->controls.length () == 1 ? rgc->controls[0]
+					       : make_ssa_name (iv_type);
+      /* Create decrement IV.  */
+      create_iv (nitems_total, MINUS_EXPR, step, NULL_TREE, loop, &incr_gsi,
+		 insert_after, &index_before_incr, &index_after_incr);
+      gimple_seq_add_stmt (header_seq, gimple_build_assign (step, MIN_EXPR,
+							    index_before_incr,
+							    nitems_step));
+      LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo) = step;
+      return index_after_incr;
+    }
+
+  /* Create increment IV.  */
   create_iv (build_int_cst (iv_type, 0), PLUS_EXPR, nitems_step, NULL_TREE,
 	     loop, &incr_gsi, insert_after, &index_before_incr,
 	     &index_after_incr);
@@ -683,6 +715,63 @@  vect_set_loop_controls_directly (class loop *loop, loop_vec_info loop_vinfo,
   return next_ctrl;
 }
 
+/* Try to use adjust loop lens for multiple-rgroups.
+
+     _36 = MIN_EXPR <ivtmp_34, VF>;
+
+     First length (MIN (X, VF/N)):
+       loop_len_15 = MIN_EXPR <_36, VF/N>;
+
+     Second length:
+       tmp = _36 - loop_len_15;
+       loop_len_16 = MIN (tmp, VF/N);
+
+     Third length:
+       tmp2 = tmp - loop_len_16;
+       loop_len_17 = MIN (tmp2, VF/N);
+
+     Last length:
+       loop_len_18 = tmp2 - loop_len_17;
+*/
+
+static void
+vect_adjust_loop_lens_control (tree iv_type, gimple_seq *seq,
+			       rgroup_controls *dest_rgm, tree step)
+{
+  tree ctrl_type = dest_rgm->type;
+  poly_uint64 nitems_per_ctrl
+    = TYPE_VECTOR_SUBPARTS (ctrl_type) * dest_rgm->factor;
+  tree length_limit = build_int_cst (iv_type, nitems_per_ctrl);
+
+  for (unsigned int i = 0; i < dest_rgm->controls.length (); ++i)
+    {
+      tree ctrl = dest_rgm->controls[i];
+      if (i == 0)
+	{
+	  /* First iteration: MIN (X, VF/N) capped to the range [0, VF/N].  */
+	  gassign *assign
+	    = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+	  gimple_seq_add_stmt (seq, assign);
+	}
+      else if (i == dest_rgm->controls.length () - 1)
+	{
+	  /* Last iteration: Remain capped to the range [0, VF/N].  */
+	  gassign *assign = gimple_build_assign (ctrl, MINUS_EXPR, step,
+						 dest_rgm->controls[i - 1]);
+	  gimple_seq_add_stmt (seq, assign);
+	}
+      else
+	{
+	  /* (MIN (remain, VF*I/N)) capped to the range [0, VF/N].  */
+	  step = gimple_build (seq, MINUS_EXPR, iv_type, step,
+			       dest_rgm->controls[i - 1]);
+	  gassign *assign
+	    = gimple_build_assign (ctrl, MIN_EXPR, step, length_limit);
+	  gimple_seq_add_stmt (seq, assign);
+	}
+    }
+}
+
 /* Set up the iteration condition and rgroup controls for LOOP, given
    that LOOP_VINFO_USING_PARTIAL_VECTORS_P is true for the vectorized
    loop.  LOOP_VINFO describes the vectorization of LOOP.  NITERS is
@@ -764,6 +853,70 @@  vect_set_loop_condition_partial_vectors (class loop *loop,
 						     loop_cond_gsi, rgc,
 						     niters, niters_skip,
 						     might_wrap_p);
+
+	/* Decrement IV only run vect_set_loop_controls_directly once.  */
+	if (LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo)
+	    && rgc->controls.length () > 1)
+	  {
+	    /*
+	       - Multiple rgroup (SLP):
+		 ...
+		 _38 = (unsigned long) bnd.7_29;
+		 _39 = _38 * 2;
+		 ...
+		 # ivtmp_41 = PHI <ivtmp_42(6), _39(5)>
+		 ...
+		 _43 = MIN_EXPR <ivtmp_41, 32>;
+		 loop_len_26 = MIN_EXPR <_43, 16>;
+		 loop_len_25 = _43 - loop_len_26;
+		 ...
+		 .LEN_STORE (_6, 8B, loop_len_26, ...);
+		 ...
+		 .LEN_STORE (_25, 8B, loop_len_25, ...);
+		 _33 = loop_len_26 / 2;
+		 ...
+		 .LEN_STORE (_8, 16B, _33, ...);
+		 _36 = loop_len_25 / 2;
+		 ...
+		 .LEN_STORE (_15, 16B, _36, ...);
+		 ivtmp_42 = ivtmp_41 - _43;
+		 ...
+
+	       - Multiple rgroup (non-SLP):
+		 ...
+		 _38 = (unsigned long) n_12(D);
+		 ...
+		 # ivtmp_38 = PHI <ivtmp_39(3), 100(2)>
+		 ...
+		 loop_len_38 = MIN_EXPR <ivtmp_41, POLY_INT_CST [8, 8]>;
+		 _43 = MIN_EXPR <ivtmp_44, POLY_INT_CST [8, 8]>;
+		 loop_len_24 = MIN_EXPR <_43, POLY_INT_CST [2, 2]>;
+		 _46 = _43 - loop_len_24;
+		 loop_len_23 = MIN_EXPR <_46, POLY_INT_CST [2, 2]>;
+		 _47 = _46 - loop_len_23;
+		 loop_len_22 = MIN_EXPR <_47, POLY_INT_CST [2, 2]>;
+		 loop_len_21 = _47 - loop_len_22;
+		 ...
+		 vect__4.8_17 = .LEN_LOAD (_6, 64B, loop_len_24, 0);
+		 ...
+		 vect__4.9_9 = .LEN_LOAD (_49, 64B, loop_len_23, 0);
+		 ...
+		 vect__4.10_30 = .LEN_LOAD (_52, 64B, loop_len_22, 0);
+		 ...
+		 vect__4.11_32 = .LEN_LOAD (_55, 64B, loop_len_21, 0);
+		 vect__7.13_31 = VEC_PACK_TRUNC_EXPR <...>,
+		 vect__7.13_32 = VEC_PACK_TRUNC_EXPR <...>;
+		 vect__7.12_33 = VEC_PACK_TRUNC_EXPR <...>;
+		 ...
+		 .LEN_STORE (_14, 16B, _40, vect__7.12_33, 0);
+		 ivtmp_39 = ivtmp_38 - _40;
+		 ...
+	    */
+	    tree iv_type = LOOP_VINFO_RGROUP_IV_TYPE (loop_vinfo);
+	    tree step = LOOP_VINFO_DECREMENTING_IV_STEP (loop_vinfo);
+	    gcc_assert (step);
+	    vect_adjust_loop_lens_control (iv_type, &header_seq, rgc, step);
+	  }
       }
 
   /* Emit all accumulated statements.  */
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index cf10132b0bf..456f50fa7cc 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -973,6 +973,8 @@  _loop_vec_info::_loop_vec_info (class loop *loop_in, vec_info_shared *shared)
     vectorizable (false),
     can_use_partial_vectors_p (param_vect_partial_vector_usage != 0),
     using_partial_vectors_p (false),
+    using_decrementing_iv_p (false),
+    decrementing_iv_step (NULL_TREE),
     epil_using_partial_vectors_p (false),
     partial_load_store_bias (0),
     peeling_for_gaps (false),
@@ -2725,6 +2727,17 @@  start_over:
       && !vect_verify_loop_lens (loop_vinfo))
     LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo) = false;
 
+  /* If we're vectorizing an loop that uses length "controls" and
+     can iterate more than once, we apply decrementing IV approach
+     in loop control.  */
+  if (LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo)
+      && !LOOP_VINFO_LENS (loop_vinfo).is_empty ()
+      && LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS (loop_vinfo) == 0
+      && !(LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
+	   && known_le (LOOP_VINFO_INT_NITERS (loop_vinfo),
+			LOOP_VINFO_VECT_FACTOR (loop_vinfo))))
+    LOOP_VINFO_USING_DECREMENTING_IV_P (loop_vinfo) = true;
+
   /* If we're vectorizing an epilogue loop, the vectorized loop either needs
      to be able to handle fewer than VF scalars, or needs to have a lower VF
      than the main loop.  */
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index 02d2ad6fba1..7ed079f543a 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -818,6 +818,16 @@  public:
      the vector loop can handle fewer than VF scalars.  */
   bool using_partial_vectors_p;
 
+  /* True if we've decided to use a decrementing loop control IV that counts
+     scalars. This can be done for any loop that:
+
+	(a) uses length "controls"; and
+	(b) can iterate more than once.  */
+  bool using_decrementing_iv_p;
+
+  /* The variable amount step for decrement IV.  */
+  tree decrementing_iv_step;
+
   /* True if we've decided to use partially-populated vectors for the
      epilogue of loop.  */
   bool epil_using_partial_vectors_p;
@@ -890,6 +900,8 @@  public:
 #define LOOP_VINFO_VECTORIZABLE_P(L)       (L)->vectorizable
 #define LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P(L) (L)->can_use_partial_vectors_p
 #define LOOP_VINFO_USING_PARTIAL_VECTORS_P(L) (L)->using_partial_vectors_p
+#define LOOP_VINFO_USING_DECREMENTING_IV_P(L) (L)->using_decrementing_iv_p
+#define LOOP_VINFO_DECREMENTING_IV_STEP(L) (L)->decrementing_iv_step
 #define LOOP_VINFO_EPIL_USING_PARTIAL_VECTORS_P(L)                             \
   (L)->epil_using_partial_vectors_p
 #define LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS(L) (L)->partial_load_store_bias