diff mbox series

[V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS

Message ID 20230609104105.9100-1-juzhe.zhong@rivai.ai
State New
Headers show
Series [V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS | expand

Commit Message

juzhe.zhong@rivai.ai June 9, 2023, 10:41 a.m. UTC
From: Juzhe-Zhong <juzhe.zhong@rivai.ai>

This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
are quite messy and cause some bugs discovered by my downstream auto-vectorization
test-generator.

Before this patch.

Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
we remove this operand dependency helps for the following scheduling PASS.

Phase 6 is propagate_avl do the following 2 things:
1. Local && Global user vsetvl instructions optimization.
   E.g.
      vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
      vsetvli zero,a2, e32, mf2  ======> eliminate
2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.

There are 2 issues in Phase 5 && Phase 6:
1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
   Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
   in Phase 6 is redundant and should be removed.
2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
   patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
   necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
   
Before Phase 6:
   ...
   insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
   slli a4,a3,3
   ...
   insn2: vsetvli zero, a3, ... 
   load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
   ...

In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
even though we renew it (I didn't take a look at it and I don't think we need to now).
Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
execution test failed.

Conclusion of RTL_SSA framework:
Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
is re-new after Phase 4 (LCM) has incorrect information that causes bugs.

Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.

Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
   at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
   1). local_eliminate_vsetvl_insn we already have (no change).
   2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
      E.g. 
      void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
        size_t avl;
        if (m > 100)
          avl = __riscv_vsetvl_e16mf4(vl << 4);
        else
          avl = __riscv_vsetvl_e32mf2(vl >> 8);
        for (size_t i = 0; i < m; i++) {
          vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
          v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
          __riscv_vse8_v_i8mf8(out + i, v0, avl);
        }
      }

      This example failed to global user vsetvl optimize before this patch:
      f:
              li      a5,100
              bleu    a3,a5,.L2
              slli    a2,a2,4
              vsetvli a4,a2,e16,mf4,ta,mu
      .L3:
              li      a5,0
              vsetvli zero,a4,e8,mf8,ta,ma
      .L5:
              add     a6,a0,a5
              add     a2,a1,a5
              vle8.v  v1,0(a6)
              addi    a5,a5,1
              vadd.vv v1,v1,v1
              vse8.v  v1,0(a2)
              bgtu    a3,a5,.L5
      .L10:
              ret
      .L2:
              beq     a3,zero,.L10
              srli    a2,a2,8
              vsetvli a4,a2,e32,mf2,ta,mu
              j       .L3
      With this patch:
      f:
              li      a5,100
              bleu    a3,a5,.L2
              slli    a2,a2,4
              vsetvli zero,a2,e8,mf8,ta,ma
      .L3:
              li      a5,0
      .L5:
              add     a6,a0,a5
              add     a2,a1,a5
              vle8.v  v1,0(a6)
              addi    a5,a5,1
              vadd.vv v1,v1,v1
              vse8.v  v1,0(a2)
              bgtu    a3,a5,.L5
      .L10:
              ret
      .L2:
              beq     a3,zero,.L10
              srli    a2,a2,8
              vsetvli zero,a2,e8,mf8,ta,ma
              j       .L3

   3). Remove AVL operand dependency of each RVV instructions.

2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
   dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
   ...
   vsetvl a3, a2...
   vadd.vv (use a3)
   If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
   
   This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
   downstream tests).

gcc/ChangeLog:

        * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
        (vector_insn_info::parse_insn): Add rtx_insn parse.
        (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
        (get_first_vsetvl): New function.
        (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
        (pass_vsetvl::cleanup_insns): Remove it.
        (pass_vsetvl::ssa_post_optimization): New function.
        (has_no_uses): Ditto.
        (pass_vsetvl::propagate_avl): Remove it.
        (pass_vsetvl::df_post_optimization): New function.
        (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
        * config/riscv/riscv-vsetvl.h: Adapt declaration.

gcc/testsuite/ChangeLog:

        * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
        * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
        * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
        * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
        * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
        * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.

---
 gcc/config/riscv/riscv-vsetvl.cc              | 400 +++++++++++-------
 gcc/config/riscv/riscv-vsetvl.h               |  34 +-
 .../gcc.target/riscv/rvv/vsetvl/vsetvl-16.c   |   2 +-
 .../gcc.target/riscv/rvv/vsetvl/vsetvl-2.c    |   2 +-
 .../gcc.target/riscv/rvv/vsetvl/vsetvl-21.c   |  21 +
 .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  21 +
 .../gcc.target/riscv/rvv/vsetvl/vsetvl-23.c   |  37 ++
 .../gcc.target/riscv/rvv/vsetvl/vsetvl-3.c    |   2 +-
 8 files changed, 366 insertions(+), 153 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
 create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c

Comments

Kito Cheng June 9, 2023, 10:45 a.m. UTC | #1
Thankful you send this before weekend, I could run the fuzzy testing
during this weekend :P

On Fri, Jun 9, 2023 at 6:41 PM <juzhe.zhong@rivai.ai> wrote:
>
> From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
>
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
> are quite messy and cause some bugs discovered by my downstream auto-vectorization
> test-generator.
>
> Before this patch.
>
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
> after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
> we remove this operand dependency helps for the following scheduling PASS.
>
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>    E.g.
>       vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
>       vsetvli zero,a2, e32, mf2  ======> eliminate
> 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
> RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.
>
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
>    Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
>    in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
>    patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
>    necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
>
> Before Phase 6:
>    ...
>    insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>    slli a4,a3,3
>    ...
>    insn2: vsetvli zero, a3, ...
>    load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
>    ...
>
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
> even though we renew it (I didn't take a look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
> execution test failed.
>
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
> is re-new after Phase 4 (LCM) has incorrect information that causes bugs.
>
> Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.
>
> Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
>    at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
>    1). local_eliminate_vsetvl_insn we already have (no change).
>    2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
>       E.g.
>       void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>         size_t avl;
>         if (m > 100)
>           avl = __riscv_vsetvl_e16mf4(vl << 4);
>         else
>           avl = __riscv_vsetvl_e32mf2(vl >> 8);
>         for (size_t i = 0; i < m; i++) {
>           vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>           v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>           __riscv_vse8_v_i8mf8(out + i, v0, avl);
>         }
>       }
>
>       This example failed to global user vsetvl optimize before this patch:
>       f:
>               li      a5,100
>               bleu    a3,a5,.L2
>               slli    a2,a2,4
>               vsetvli a4,a2,e16,mf4,ta,mu
>       .L3:
>               li      a5,0
>               vsetvli zero,a4,e8,mf8,ta,ma
>       .L5:
>               add     a6,a0,a5
>               add     a2,a1,a5
>               vle8.v  v1,0(a6)
>               addi    a5,a5,1
>               vadd.vv v1,v1,v1
>               vse8.v  v1,0(a2)
>               bgtu    a3,a5,.L5
>       .L10:
>               ret
>       .L2:
>               beq     a3,zero,.L10
>               srli    a2,a2,8
>               vsetvli a4,a2,e32,mf2,ta,mu
>               j       .L3
>       With this patch:
>       f:
>               li      a5,100
>               bleu    a3,a5,.L2
>               slli    a2,a2,4
>               vsetvli zero,a2,e8,mf8,ta,ma
>       .L3:
>               li      a5,0
>       .L5:
>               add     a6,a0,a5
>               add     a2,a1,a5
>               vle8.v  v1,0(a6)
>               addi    a5,a5,1
>               vadd.vv v1,v1,v1
>               vse8.v  v1,0(a2)
>               bgtu    a3,a5,.L5
>       .L10:
>               ret
>       .L2:
>               beq     a3,zero,.L10
>               srli    a2,a2,8
>               vsetvli zero,a2,e8,mf8,ta,ma
>               j       .L3
>
>    3). Remove AVL operand dependency of each RVV instructions.
>
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
>    dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
>    ...
>    vsetvl a3, a2...
>    vadd.vv (use a3)
>    If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>
>    This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
>    downstream tests).
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
>         (vector_insn_info::parse_insn): Add rtx_insn parse.
>         (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
>         (get_first_vsetvl): New function.
>         (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>         (pass_vsetvl::cleanup_insns): Remove it.
>         (pass_vsetvl::ssa_post_optimization): New function.
>         (has_no_uses): Ditto.
>         (pass_vsetvl::propagate_avl): Remove it.
>         (pass_vsetvl::df_post_optimization): New function.
>         (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>         * config/riscv/riscv-vsetvl.h: Adapt declaration.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
>
> ---
>  gcc/config/riscv/riscv-vsetvl.cc              | 400 +++++++++++-------
>  gcc/config/riscv/riscv-vsetvl.h               |  34 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-16.c   |   2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-2.c    |   2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-21.c   |  21 +
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  21 +
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-23.c   |  37 ++
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-3.c    |   2 +-
>  8 files changed, 366 insertions(+), 153 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
>
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
> index fe55f4ccd30..924a94adf9c 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -395,10 +395,15 @@ available_occurrence_p (const bb_info *bb, const vector_insn_info dem)
>        if (!vlmax_avl_p (dem.get_avl ()))
>         {
>           rtx dest = NULL_RTX;
> +         insn_info *i = insn;
>           if (vsetvl_insn_p (insn->rtl ()))
> -           dest = get_vl (insn->rtl ());
> -         for (const insn_info *i = insn; real_insn_and_same_bb_p (i, bb);
> -              i = i->next_nondebug_insn ())
> +           {
> +             dest = get_vl (insn->rtl ());
> +             /* For user vsetvl a2, a2 instruction, we consider it as
> +                available even though it modifies "a2".  */
> +             i = i->next_nondebug_insn ();
> +           }
> +         for (; real_insn_and_same_bb_p (i, bb); i = i->next_nondebug_insn ())
>             {
>               if (read_vl_insn_p (i->rtl ()))
>                 continue;
> @@ -1893,11 +1898,13 @@ vector_insn_info::parse_insn (rtx_insn *rinsn)
>    *this = vector_insn_info ();
>    if (!NONDEBUG_INSN_P (rinsn))
>      return;
> -  if (!has_vtype_op (rinsn))
> +  if (optimize == 0 && !has_vtype_op (rinsn))
> +    return;
> +  if (optimize > 0 && !vsetvl_insn_p (rinsn))
>      return;
>    m_state = VALID;
>    extract_insn_cached (rinsn);
> -  const rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
> +  rtx avl = ::get_avl (rinsn);
>    m_avl = avl_info (avl, nullptr);
>    m_sew = ::get_sew (rinsn);
>    m_vlmul = ::get_vlmul (rinsn);
> @@ -2730,10 +2737,11 @@ private:
>    /* Phase 5.  */
>    rtx_insn *get_vsetvl_at_end (const bb_info *, vector_insn_info *) const;
>    void local_eliminate_vsetvl_insn (const bb_info *) const;
> -  void cleanup_insns (void) const;
> +  bool global_eliminate_vsetvl_insn (const bb_info *) const;
> +  void ssa_post_optimization (void) const;
>
>    /* Phase 6.  */
> -  void propagate_avl (void) const;
> +  void df_post_optimization (void) const;
>
>    void init (void);
>    void done (void);
> @@ -4246,7 +4254,7 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>
>        /* Local AVL compatibility checking is simpler than global, we only
>          need to check the REGNO is same.  */
> -      if (prev_dem.valid_p () && prev_dem.skip_avl_compatible_p (curr_dem)
> +      if (prev_dem.valid_or_dirty_p () && prev_dem.skip_avl_compatible_p (curr_dem)
>           && local_avl_compatible_p (prev_avl, curr_avl))
>         {
>           /* curr_dem and prev_dem is compatible!  */
> @@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>      }
>  }
>
> -/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> -   implicitly. Since we will emit VSETVL instruction and make RVV instructions
> -   depending on VL/VTYPE global status registers, we remove the such AVL operand
> -   in the RVV instructions pattern here in order to remove AVL dependencies when
> -   AVL operand is a register operand.
> -
> -   Before the VSETVL PASS:
> -     li a5,32
> -     ...
> -     vadd.vv (..., a5)
> -   After the VSETVL PASS:
> -     li a5,32
> -     vsetvli zero, a5, ...
> -     ...
> -     vadd.vv (..., const_int 0).  */
> +/* Get the first vsetvl instructions of the block.  */
> +static rtx_insn *
> +get_first_vsetvl (basic_block cfg_bb)
> +{
> +  rtx_insn *rinsn;
> +  FOR_BB_INSNS (cfg_bb, rinsn)
> +    {
> +      if (!NONDEBUG_INSN_P (rinsn))
> +       continue;
> +      /* If we don't find any inserted vsetvli before user RVV instructions,
> +        we don't need to optimize the vsetvls in this block.  */
> +      if (has_vtype_op (rinsn) || vsetvl_insn_p (rinsn))
> +       return nullptr;
> +
> +      if (vsetvl_discard_result_insn_p (rinsn))
> +       return rinsn;
> +    }
> +  return nullptr;
> +}
> +
> +/* Global user vsetvl optimizaiton:
> +
> +     Case 1:
> +     bb 1:
> +       vsetvl a5,a4,e8,mf8
> +       ...
> +     bb 2:
> +       ...
> +       vsetvl zero,a5,e8,mf8 --> Eliminate directly.
> +
> +     Case 2:
> +      bb 1:
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       ...
> +      bb 2:
> +       ...
> +       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
> +
> +     Case 3:
> +      bb 1:
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       ...
> +      bb 2:
> +       ...
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       goto bb 3
> +      bb 3:
> +       ...
> +       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
> +*/
> +bool
> +pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
> +{
> +  rtx_insn *vsetvl_rinsn;
> +  vector_insn_info dem = vector_insn_info ();
> +  const auto &block_info = get_block_info (bb);
> +  basic_block cfg_bb = bb->cfg_bb ();
> +
> +  if (block_info.local_dem.valid_or_dirty_p ())
> +    {
> +      /* Optimize the local vsetvl.  */
> +      dem = block_info.local_dem;
> +      vsetvl_rinsn = get_first_vsetvl (cfg_bb);
> +    }
> +  if (!vsetvl_rinsn)
> +    /* Optimize the global vsetvl inserted by LCM.  */
> +    vsetvl_rinsn = get_vsetvl_at_end (bb, &dem);
> +
> +  /* No need to optimize if block doesn't have vsetvl instructions.  */
> +  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
> +      || !dem.has_avl_reg ())
> +    return false;
> +
> +  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
> +     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
> +     block, we can eliminate this vsetvl instruction.  */
> +  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
> +
> +  unsigned int bb_index;
> +  sbitmap_iterator sbi;
> +  rtx avl = get_avl (dem.get_insn ()->rtl ());
> +  hash_set<set_info *> sets
> +    = get_all_sets (dem.get_avl_source (), true, false, false);
> +  /* Condition 1: All VL/VTYPE available in are all compatible.  */
> +  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
> +    {
> +      const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +      const auto &insn = expr->get_insn ();
> +      def_info *def = find_access (insn->defs (), REGNO (avl));
> +      set_info *set = safe_dyn_cast<set_info *> (def);
> +      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +         || !sets.contains (set))
> +       return false;
> +    }
> +
> +  /* Condition 2: Check it has preds.  */
> +  if (EDGE_COUNT (cfg_bb->preds) == 0)
> +    return false;
> +
> +  /* Condition 3: We don't do the global optimization for the block
> +     has a pred is entry block or exit block.  */
> +  /* Condition 4: All preds have available VL/VTYPE out.  */
> +  edge e;
> +  edge_iterator ei;
> +  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
> +    {
> +      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
> +      if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)
> +         || e->src == EXIT_BLOCK_PTR_FOR_FN (cfun) || bitmap_empty_p (avout))
> +       return false;
> +
> +      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
> +       {
> +         const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +         const auto &insn = expr->get_insn ();
> +         def_info *def = find_access (insn->defs (), REGNO (avl));
> +         set_info *set = safe_dyn_cast<set_info *> (def);
> +         if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +             || !sets.contains (set) || !expr->skip_avl_compatible_p (dem))
> +           return false;
> +       }
> +    }
> +
> +  /* Step1: Reshape the VL/VTYPE status to make sure everything compatible.  */
> +  hash_set<basic_block> pred_cfg_bbs = get_all_predecessors (cfg_bb);
> +  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
> +    {
> +      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
> +      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
> +       {
> +         vector_insn_info prev_dem = *m_vector_manager->vector_exprs[bb_index];
> +         vector_insn_info curr_dem = dem;
> +         insn_info *insn = prev_dem.get_insn ();
> +         if (!pred_cfg_bbs.contains (insn->bb ()->cfg_bb ()))
> +           continue;
> +         /* Update avl info since we need to make sure they are fully
> +            compatible before merge.  */
> +         curr_dem.set_avl_info (prev_dem.get_avl_info ());
> +         /* Merge both and update into curr_vsetvl.  */
> +         prev_dem = curr_dem.merge (prev_dem, LOCAL_MERGE);
> +         change_vsetvl_insn (insn, prev_dem);
> +       }
> +    }
> +
> +  /* Step2: eliminate the vsetvl instruction.  */
> +  eliminate_insn (vsetvl_rinsn);
> +  return true;
> +}
> +
> +/* This function does the following post optimization base on RTL_SSA:
> +
> +   1. Local user vsetvl optimizations.
> +   2. Global user vsetvl optimizations.
> +   3. AVL dependencies removal:
> +      Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> +      implicitly. Since we will emit VSETVL instruction and make RVV
> +      instructions depending on VL/VTYPE global status registers, we remove the
> +      such AVL operand in the RVV instructions pattern here in order to remove
> +      AVL dependencies when AVL operand is a register operand.
> +
> +      Before the VSETVL PASS:
> +       li a5,32
> +       ...
> +       vadd.vv (..., a5)
> +      After the VSETVL PASS:
> +       li a5,32
> +       vsetvli zero, a5, ...
> +       ...
> +       vadd.vv (..., const_int 0).  */
>  void
> -pass_vsetvl::cleanup_insns (void) const
> +pass_vsetvl::ssa_post_optimization (void) const
>  {
>    for (const bb_info *bb : crtl->ssa->bbs ())
>      {
>        local_eliminate_vsetvl_insn (bb);
> +      bool changed_p = true;
> +      while (changed_p)
> +       {
> +         changed_p = false;
> +         changed_p |= global_eliminate_vsetvl_insn (bb);
> +       }
>        for (insn_info *insn : bb->real_nondebug_insns ())
>         {
>           rtx_insn *rinsn = insn->rtl ();
> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>      }
>  }
>
> +/* Return true if the SET result is not used by any instructions.  */
> +static bool
> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
> +{
> +  /* Handle the following case that can not be detected in RTL_SSA.  */
> +  /* E.g.
> +         li a5, 100
> +         vsetvli a6, a5...
> +         ...
> +         vadd (use a6)
> +
> +       The use of "a6" is removed from "vadd" but the information is
> +       not updated in RTL_SSA framework. We don't want to re-new
> +       a new RTL_SSA which is expensive, instead, we use data-flow
> +       analysis to check whether "a6" has no uses.  */
> +  if (bitmap_bit_p (df_get_live_out (cfg_bb), regno))
> +    return false;
> +
> +  rtx_insn *iter;
> +  for (iter = NEXT_INSN (rinsn); iter && iter != NEXT_INSN (BB_END (cfg_bb));
> +       iter = NEXT_INSN (iter))
> +    if (df_find_use (iter, regno_reg_rtx[regno]))
> +      return false;
> +
> +  return true;
> +}
> +
> +/* This function does the following post optimization base on dataflow
> +   analysis:
> +
> +   1. Change vsetvl rd, rs1 --> vsevl zero, rs1, if rd is not used by any
> +   nondebug instructions. Even though this PASS runs after RA and it doesn't
> +   help for reduce register pressure, it can help instructions scheduling since
> +   we remove the dependencies.
> +
> +   2. Remove redundant user vsetvls base on outcome of Phase 4 (LCM) && Phase 5
> +   (AVL dependencies removal).  */
>  void
> -pass_vsetvl::propagate_avl (void) const
> -{
> -  /* Rebuild the RTL_SSA according to the new CFG generated by LCM.  */
> -  /* Finalization of RTL_SSA.  */
> -  free_dominance_info (CDI_DOMINATORS);
> -  if (crtl->ssa->perform_pending_updates ())
> -    cleanup_cfg (0);
> -  delete crtl->ssa;
> -  crtl->ssa = nullptr;
> -  /* Initialization of RTL_SSA.  */
> -  calculate_dominance_info (CDI_DOMINATORS);
> +pass_vsetvl::df_post_optimization (void) const
> +{
>    df_analyze ();
> -  crtl->ssa = new function_info (cfun);
> -
>    hash_set<rtx_insn *> to_delete;
> -  for (const bb_info *bb : crtl->ssa->bbs ())
> +  basic_block cfg_bb;
> +  rtx_insn *rinsn;
> +  FOR_ALL_BB_FN (cfg_bb, cfun)
>      {
> -      for (insn_info *insn : bb->real_nondebug_insns ())
> +      FOR_BB_INSNS (cfg_bb, rinsn)
>         {
> -         if (vsetvl_discard_result_insn_p (insn->rtl ()))
> +         if (NONDEBUG_INSN_P (rinsn) && vsetvl_insn_p (rinsn))
>             {
> -             rtx avl = get_avl (insn->rtl ());
> -             if (!REG_P (avl))
> -               continue;
> -
> -             set_info *set = find_access (insn->uses (), REGNO (avl))->def ();
> -             insn_info *def_insn = extract_single_source (set);
> -             if (!def_insn)
> -               continue;
> -
> -             /* Handle this case:
> -                vsetvli        a6,zero,e32,m1,ta,mu
> -                li     a5,4096
> -                add    a7,a0,a5
> -                addi   a7,a7,-96
> -                vsetvli        t1,zero,e8,mf8,ta,ma
> -                vle8.v v24,0(a7)
> -                add    a5,a3,a5
> -                addi   a5,a5,-96
> -                vse8.v v24,0(a5)
> -                vsetvli        zero,a6,e32,m1,tu,ma
> -             */
> -             if (vsetvl_insn_p (def_insn->rtl ()))
> -               {
> -                 vl_vtype_info def_info = get_vl_vtype_info (def_insn);
> -                 vl_vtype_info info = get_vl_vtype_info (insn);
> -                 rtx avl = get_avl (def_insn->rtl ());
> -                 rtx vl = get_vl (def_insn->rtl ());
> -                 if (def_info.get_ratio () == info.get_ratio ())
> -                   {
> -                     if (vlmax_avl_p (def_info.get_avl ()))
> -                       {
> -                         info.set_avl_info (
> -                           avl_info (def_info.get_avl (), nullptr));
> -                         rtx new_pat
> -                           = gen_vsetvl_pat (VSETVL_NORMAL, info, vl);
> -                         validate_change (insn->rtl (),
> -                                          &PATTERN (insn->rtl ()), new_pat,
> -                                          false);
> -                         continue;
> -                       }
> -                     if (def_info.has_avl_imm () || rtx_equal_p (avl, vl))
> -                       {
> -                         info.set_avl_info (avl_info (avl, nullptr));
> -                         emit_vsetvl_insn (VSETVL_DISCARD_RESULT, EMIT_AFTER,
> -                                           info, NULL_RTX, insn->rtl ());
> -                         if (set->single_nondebug_insn_use ())
> -                           {
> -                             to_delete.add (insn->rtl ());
> -                             to_delete.add (def_insn->rtl ());
> -                           }
> -                         continue;
> -                       }
> -                   }
> -               }
> -           }
> -
> -         /* Change vsetvl rd, rs1 --> vsevl zero, rs1,
> -            if rd is not used by any nondebug instructions.
> -            Even though this PASS runs after RA and it doesn't help for
> -            reduce register pressure, it can help instructions scheduling
> -            since we remove the dependencies.  */
> -         if (vsetvl_insn_p (insn->rtl ()))
> -           {
> -             rtx vl = get_vl (insn->rtl ());
> -             rtx avl = get_avl (insn->rtl ());
> -             def_info *def = find_access (insn->defs (), REGNO (vl));
> -             set_info *set = safe_dyn_cast<set_info *> (def);
> +             rtx vl = get_vl (rinsn);
>               vector_insn_info info;
> -             info.parse_insn (insn);
> -             gcc_assert (set);
> -             if (m_vector_manager->to_delete_vsetvls.contains (insn->rtl ()))
> -               {
> -                 m_vector_manager->to_delete_vsetvls.remove (insn->rtl ());
> -                 if (m_vector_manager->to_refine_vsetvls.contains (
> -                       insn->rtl ()))
> -                   m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
> -                 if (!set->has_nondebug_insn_uses ())
> -                   {
> -                     to_delete.add (insn->rtl ());
> -                     continue;
> -                   }
> -               }
> -             if (m_vector_manager->to_refine_vsetvls.contains (insn->rtl ()))
> +             info.parse_insn (rinsn);
> +             bool to_delete_p = m_vector_manager->to_delete_p (rinsn);
> +             bool to_refine_p = m_vector_manager->to_refine_p (rinsn);
> +             if (has_no_uses (cfg_bb, rinsn, REGNO (vl)))
>                 {
> -                 m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
> -                 if (!set->has_nondebug_insn_uses ())
> +                 if (to_delete_p)
> +                   to_delete.add (rinsn);
> +                 else if (to_refine_p)
>                     {
>                       rtx new_pat = gen_vsetvl_pat (VSETVL_VTYPE_CHANGE_ONLY,
>                                                     info, NULL_RTX);
> -                     change_insn (insn->rtl (), new_pat);
> -                     continue;
> +                     validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
> +                   }
> +                 else if (!vlmax_avl_p (info.get_avl ()))
> +                   {
> +                     rtx new_pat = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info,
> +                                                   NULL_RTX);
> +                     validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
>                     }
> -               }
> -             if (vlmax_avl_p (avl))
> -               continue;
> -             rtx new_pat
> -               = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info, NULL_RTX);
> -             if (!set->has_nondebug_insn_uses ())
> -               {
> -                 validate_change (insn->rtl (), &PATTERN (insn->rtl ()),
> -                                  new_pat, false);
> -                 continue;
>                 }
>             }
>         }
>      }
> -
>    for (rtx_insn *rinsn : to_delete)
>      eliminate_insn (rinsn);
>  }
> @@ -4593,16 +4707,16 @@ pass_vsetvl::lazy_vsetvl (void)
>      fprintf (dump_file, "\nPhase 4: PRE vsetvl by Lazy code motion (LCM)\n");
>    pre_vsetvl ();
>
> -  /* Phase 5 - Cleanup AVL && VL operand of RVV instruction.  */
> +  /* Phase 5 - Post optimization base on RTL_SSA.  */
>    if (dump_file)
> -    fprintf (dump_file, "\nPhase 5: Cleanup AVL and VL operands\n");
> -  cleanup_insns ();
> +    fprintf (dump_file, "\nPhase 5: Post optimization base on RTL_SSA\n");
> +  ssa_post_optimization ();
>
> -  /* Phase 6 - Rebuild RTL_SSA to propagate AVL between vsetvls.  */
> +  /* Phase 6 - Post optimization base on data-flow analysis.  */
>    if (dump_file)
>      fprintf (dump_file,
> -            "\nPhase 6: Rebuild RTL_SSA to propagate AVL between vsetvls\n");
> -  propagate_avl ();
> +            "\nPhase 6: Post optimization base on data-flow analysis\n");
> +  df_post_optimization ();
>  }
>
>  /* Main entry point for this pass.  */
> diff --git a/gcc/config/riscv/riscv-vsetvl.h b/gcc/config/riscv/riscv-vsetvl.h
> index d7a6c14e931..4257451bb74 100644
> --- a/gcc/config/riscv/riscv-vsetvl.h
> +++ b/gcc/config/riscv/riscv-vsetvl.h
> @@ -290,13 +290,6 @@ private:
>       definition of AVL.  */
>    rtl_ssa::insn_info *m_insn;
>
> -  /* Parse the instruction to get VL/VTYPE information and demanding
> -   * information.  */
> -  /* This is only called by simple_vsetvl subroutine when optimize == 0.
> -     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
> -     the m_insn.  */
> -  void parse_insn (rtx_insn *);
> -
>    friend class vector_infos_manager;
>
>  public:
> @@ -305,6 +298,12 @@ public:
>        m_insn (nullptr)
>    {}
>
> +  /* Parse the instruction to get VL/VTYPE information and demanding
> +   * information.  */
> +  /* This is only called by simple_vsetvl subroutine when optimize == 0.
> +     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
> +     the m_insn.  */
> +  void parse_insn (rtx_insn *);
>    /* This is only called by lazy_vsetvl subroutine when optimize > 0.
>       We use RTL_SSA framework to initialize the insn_info.  */
>    void parse_insn (rtl_ssa::insn_info *);
> @@ -454,6 +453,27 @@ public:
>    bool all_empty_predecessor_p (const basic_block) const;
>    bool all_avail_in_compatible_p (const basic_block) const;
>
> +  bool to_delete_p (rtx_insn *rinsn)
> +  {
> +    if (to_delete_vsetvls.contains (rinsn))
> +      {
> +       to_delete_vsetvls.remove (rinsn);
> +       if (to_refine_vsetvls.contains (rinsn))
> +         to_refine_vsetvls.remove (rinsn);
> +       return true;
> +      }
> +    return false;
> +  }
> +  bool to_refine_p (rtx_insn *rinsn)
> +  {
> +    if (to_refine_vsetvls.contains (rinsn))
> +      {
> +       to_refine_vsetvls.remove (rinsn);
> +       return true;
> +      }
> +    return false;
> +  }
> +
>    void release (void);
>    void create_bitmap_vectors (void);
>    void free_bitmap_vectors (void);
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> index e0c6588b1db..29e05c4982b 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> @@ -16,5 +16,5 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>    }
>  }
>
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> index 0c5da5e640c..ff0171b3ff6 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> @@ -17,4 +17,4 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>  }
>
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
> new file mode 100644
> index 00000000000..551920c6a72
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  if (m > 100)
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +  else{
> +    if (k)
> +      avl = __riscv_vsetvl_e8mf8(vl);
> +  }
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
> new file mode 100644
> index 00000000000..103f4238c76
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  if (m > 100)
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +  else
> +    avl = __riscv_vsetvl_e32mf2(vl >> 8);
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*t[au],\s*m[au]} 2 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
> new file mode 100644
> index 00000000000..66c90ac10e7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
> @@ -0,0 +1,37 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  switch (m)
> +  {
> +  case 50:
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +    break;
> +  case 1:
> +    avl = __riscv_vsetvl_e32mf2(k);
> +    break;
> +  case 2:
> +    avl = __riscv_vsetvl_e64m1(vl);
> +    break;
> +  case 3:
> +    avl = __riscv_vsetvl_e32mf2(k >> 8);
> +    break;
> +  default:
> +    avl = __riscv_vsetvl_e32mf2(k + vl);
> +    break;
> +  }
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
> +    v0 = __riscv_vadd_vv_i8mf8_tu (v0, v0, v0, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {srli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*8} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*tu,\s*m[au]} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> index f995e04aacc..13d09fc3fd1 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> @@ -18,4 +18,4 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>  }
>
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> --
> 2.36.1
>
juzhe.zhong@rivai.ai June 9, 2023, 10:49 a.m. UTC | #2
This patch removed 2nd time initialization of RTL_SSA which is the approach we both hate.



juzhe.zhong@rivai.ai
 
From: Kito Cheng
Date: 2023-06-09 18:45
To: juzhe.zhong
CC: gcc-patches; kito.cheng; palmer; palmer; jeffreyalaw; rdapp.gcc; pan2.li
Subject: Re: [PATCH V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS
Thankful you send this before weekend, I could run the fuzzy testing
during this weekend :P
 
On Fri, Jun 9, 2023 at 6:41 PM <juzhe.zhong@rivai.ai> wrote:
>
> From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
>
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
> are quite messy and cause some bugs discovered by my downstream auto-vectorization
> test-generator.
>
> Before this patch.
>
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
> after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
> we remove this operand dependency helps for the following scheduling PASS.
>
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>    E.g.
>       vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
>       vsetvli zero,a2, e32, mf2  ======> eliminate
> 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
> RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.
>
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
>    Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
>    in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
>    patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
>    necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
>
> Before Phase 6:
>    ...
>    insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>    slli a4,a3,3
>    ...
>    insn2: vsetvli zero, a3, ...
>    load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
>    ...
>
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
> even though we renew it (I didn't take a look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
> execution test failed.
>
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
> is re-new after Phase 4 (LCM) has incorrect information that causes bugs.
>
> Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.
>
> Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
>    at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
>    1). local_eliminate_vsetvl_insn we already have (no change).
>    2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
>       E.g.
>       void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>         size_t avl;
>         if (m > 100)
>           avl = __riscv_vsetvl_e16mf4(vl << 4);
>         else
>           avl = __riscv_vsetvl_e32mf2(vl >> 8);
>         for (size_t i = 0; i < m; i++) {
>           vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>           v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>           __riscv_vse8_v_i8mf8(out + i, v0, avl);
>         }
>       }
>
>       This example failed to global user vsetvl optimize before this patch:
>       f:
>               li      a5,100
>               bleu    a3,a5,.L2
>               slli    a2,a2,4
>               vsetvli a4,a2,e16,mf4,ta,mu
>       .L3:
>               li      a5,0
>               vsetvli zero,a4,e8,mf8,ta,ma
>       .L5:
>               add     a6,a0,a5
>               add     a2,a1,a5
>               vle8.v  v1,0(a6)
>               addi    a5,a5,1
>               vadd.vv v1,v1,v1
>               vse8.v  v1,0(a2)
>               bgtu    a3,a5,.L5
>       .L10:
>               ret
>       .L2:
>               beq     a3,zero,.L10
>               srli    a2,a2,8
>               vsetvli a4,a2,e32,mf2,ta,mu
>               j       .L3
>       With this patch:
>       f:
>               li      a5,100
>               bleu    a3,a5,.L2
>               slli    a2,a2,4
>               vsetvli zero,a2,e8,mf8,ta,ma
>       .L3:
>               li      a5,0
>       .L5:
>               add     a6,a0,a5
>               add     a2,a1,a5
>               vle8.v  v1,0(a6)
>               addi    a5,a5,1
>               vadd.vv v1,v1,v1
>               vse8.v  v1,0(a2)
>               bgtu    a3,a5,.L5
>       .L10:
>               ret
>       .L2:
>               beq     a3,zero,.L10
>               srli    a2,a2,8
>               vsetvli zero,a2,e8,mf8,ta,ma
>               j       .L3
>
>    3). Remove AVL operand dependency of each RVV instructions.
>
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
>    dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
>    ...
>    vsetvl a3, a2...
>    vadd.vv (use a3)
>    If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>
>    This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
>    downstream tests).
>
> gcc/ChangeLog:
>
>         * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
>         (vector_insn_info::parse_insn): Add rtx_insn parse.
>         (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
>         (get_first_vsetvl): New function.
>         (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>         (pass_vsetvl::cleanup_insns): Remove it.
>         (pass_vsetvl::ssa_post_optimization): New function.
>         (has_no_uses): Ditto.
>         (pass_vsetvl::propagate_avl): Remove it.
>         (pass_vsetvl::df_post_optimization): New function.
>         (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>         * config/riscv/riscv-vsetvl.h: Adapt declaration.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>         * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
>
> ---
>  gcc/config/riscv/riscv-vsetvl.cc              | 400 +++++++++++-------
>  gcc/config/riscv/riscv-vsetvl.h               |  34 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-16.c   |   2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-2.c    |   2 +-
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-21.c   |  21 +
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-22.c   |  21 +
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-23.c   |  37 ++
>  .../gcc.target/riscv/rvv/vsetvl/vsetvl-3.c    |   2 +-
>  8 files changed, 366 insertions(+), 153 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
>  create mode 100644 gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
>
> diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
> index fe55f4ccd30..924a94adf9c 100644
> --- a/gcc/config/riscv/riscv-vsetvl.cc
> +++ b/gcc/config/riscv/riscv-vsetvl.cc
> @@ -395,10 +395,15 @@ available_occurrence_p (const bb_info *bb, const vector_insn_info dem)
>        if (!vlmax_avl_p (dem.get_avl ()))
>         {
>           rtx dest = NULL_RTX;
> +         insn_info *i = insn;
>           if (vsetvl_insn_p (insn->rtl ()))
> -           dest = get_vl (insn->rtl ());
> -         for (const insn_info *i = insn; real_insn_and_same_bb_p (i, bb);
> -              i = i->next_nondebug_insn ())
> +           {
> +             dest = get_vl (insn->rtl ());
> +             /* For user vsetvl a2, a2 instruction, we consider it as
> +                available even though it modifies "a2".  */
> +             i = i->next_nondebug_insn ();
> +           }
> +         for (; real_insn_and_same_bb_p (i, bb); i = i->next_nondebug_insn ())
>             {
>               if (read_vl_insn_p (i->rtl ()))
>                 continue;
> @@ -1893,11 +1898,13 @@ vector_insn_info::parse_insn (rtx_insn *rinsn)
>    *this = vector_insn_info ();
>    if (!NONDEBUG_INSN_P (rinsn))
>      return;
> -  if (!has_vtype_op (rinsn))
> +  if (optimize == 0 && !has_vtype_op (rinsn))
> +    return;
> +  if (optimize > 0 && !vsetvl_insn_p (rinsn))
>      return;
>    m_state = VALID;
>    extract_insn_cached (rinsn);
> -  const rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
> +  rtx avl = ::get_avl (rinsn);
>    m_avl = avl_info (avl, nullptr);
>    m_sew = ::get_sew (rinsn);
>    m_vlmul = ::get_vlmul (rinsn);
> @@ -2730,10 +2737,11 @@ private:
>    /* Phase 5.  */
>    rtx_insn *get_vsetvl_at_end (const bb_info *, vector_insn_info *) const;
>    void local_eliminate_vsetvl_insn (const bb_info *) const;
> -  void cleanup_insns (void) const;
> +  bool global_eliminate_vsetvl_insn (const bb_info *) const;
> +  void ssa_post_optimization (void) const;
>
>    /* Phase 6.  */
> -  void propagate_avl (void) const;
> +  void df_post_optimization (void) const;
>
>    void init (void);
>    void done (void);
> @@ -4246,7 +4254,7 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>
>        /* Local AVL compatibility checking is simpler than global, we only
>          need to check the REGNO is same.  */
> -      if (prev_dem.valid_p () && prev_dem.skip_avl_compatible_p (curr_dem)
> +      if (prev_dem.valid_or_dirty_p () && prev_dem.skip_avl_compatible_p (curr_dem)
>           && local_avl_compatible_p (prev_avl, curr_avl))
>         {
>           /* curr_dem and prev_dem is compatible!  */
> @@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>      }
>  }
>
> -/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> -   implicitly. Since we will emit VSETVL instruction and make RVV instructions
> -   depending on VL/VTYPE global status registers, we remove the such AVL operand
> -   in the RVV instructions pattern here in order to remove AVL dependencies when
> -   AVL operand is a register operand.
> -
> -   Before the VSETVL PASS:
> -     li a5,32
> -     ...
> -     vadd.vv (..., a5)
> -   After the VSETVL PASS:
> -     li a5,32
> -     vsetvli zero, a5, ...
> -     ...
> -     vadd.vv (..., const_int 0).  */
> +/* Get the first vsetvl instructions of the block.  */
> +static rtx_insn *
> +get_first_vsetvl (basic_block cfg_bb)
> +{
> +  rtx_insn *rinsn;
> +  FOR_BB_INSNS (cfg_bb, rinsn)
> +    {
> +      if (!NONDEBUG_INSN_P (rinsn))
> +       continue;
> +      /* If we don't find any inserted vsetvli before user RVV instructions,
> +        we don't need to optimize the vsetvls in this block.  */
> +      if (has_vtype_op (rinsn) || vsetvl_insn_p (rinsn))
> +       return nullptr;
> +
> +      if (vsetvl_discard_result_insn_p (rinsn))
> +       return rinsn;
> +    }
> +  return nullptr;
> +}
> +
> +/* Global user vsetvl optimizaiton:
> +
> +     Case 1:
> +     bb 1:
> +       vsetvl a5,a4,e8,mf8
> +       ...
> +     bb 2:
> +       ...
> +       vsetvl zero,a5,e8,mf8 --> Eliminate directly.
> +
> +     Case 2:
> +      bb 1:
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       ...
> +      bb 2:
> +       ...
> +       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
> +
> +     Case 3:
> +      bb 1:
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       ...
> +      bb 2:
> +       ...
> +       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
> +       goto bb 3
> +      bb 3:
> +       ...
> +       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
> +*/
> +bool
> +pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
> +{
> +  rtx_insn *vsetvl_rinsn;
> +  vector_insn_info dem = vector_insn_info ();
> +  const auto &block_info = get_block_info (bb);
> +  basic_block cfg_bb = bb->cfg_bb ();
> +
> +  if (block_info.local_dem.valid_or_dirty_p ())
> +    {
> +      /* Optimize the local vsetvl.  */
> +      dem = block_info.local_dem;
> +      vsetvl_rinsn = get_first_vsetvl (cfg_bb);
> +    }
> +  if (!vsetvl_rinsn)
> +    /* Optimize the global vsetvl inserted by LCM.  */
> +    vsetvl_rinsn = get_vsetvl_at_end (bb, &dem);
> +
> +  /* No need to optimize if block doesn't have vsetvl instructions.  */
> +  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
> +      || !dem.has_avl_reg ())
> +    return false;
> +
> +  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
> +     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
> +     block, we can eliminate this vsetvl instruction.  */
> +  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
> +
> +  unsigned int bb_index;
> +  sbitmap_iterator sbi;
> +  rtx avl = get_avl (dem.get_insn ()->rtl ());
> +  hash_set<set_info *> sets
> +    = get_all_sets (dem.get_avl_source (), true, false, false);
> +  /* Condition 1: All VL/VTYPE available in are all compatible.  */
> +  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
> +    {
> +      const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +      const auto &insn = expr->get_insn ();
> +      def_info *def = find_access (insn->defs (), REGNO (avl));
> +      set_info *set = safe_dyn_cast<set_info *> (def);
> +      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +         || !sets.contains (set))
> +       return false;
> +    }
> +
> +  /* Condition 2: Check it has preds.  */
> +  if (EDGE_COUNT (cfg_bb->preds) == 0)
> +    return false;
> +
> +  /* Condition 3: We don't do the global optimization for the block
> +     has a pred is entry block or exit block.  */
> +  /* Condition 4: All preds have available VL/VTYPE out.  */
> +  edge e;
> +  edge_iterator ei;
> +  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
> +    {
> +      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
> +      if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)
> +         || e->src == EXIT_BLOCK_PTR_FOR_FN (cfun) || bitmap_empty_p (avout))
> +       return false;
> +
> +      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
> +       {
> +         const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +         const auto &insn = expr->get_insn ();
> +         def_info *def = find_access (insn->defs (), REGNO (avl));
> +         set_info *set = safe_dyn_cast<set_info *> (def);
> +         if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +             || !sets.contains (set) || !expr->skip_avl_compatible_p (dem))
> +           return false;
> +       }
> +    }
> +
> +  /* Step1: Reshape the VL/VTYPE status to make sure everything compatible.  */
> +  hash_set<basic_block> pred_cfg_bbs = get_all_predecessors (cfg_bb);
> +  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
> +    {
> +      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
> +      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
> +       {
> +         vector_insn_info prev_dem = *m_vector_manager->vector_exprs[bb_index];
> +         vector_insn_info curr_dem = dem;
> +         insn_info *insn = prev_dem.get_insn ();
> +         if (!pred_cfg_bbs.contains (insn->bb ()->cfg_bb ()))
> +           continue;
> +         /* Update avl info since we need to make sure they are fully
> +            compatible before merge.  */
> +         curr_dem.set_avl_info (prev_dem.get_avl_info ());
> +         /* Merge both and update into curr_vsetvl.  */
> +         prev_dem = curr_dem.merge (prev_dem, LOCAL_MERGE);
> +         change_vsetvl_insn (insn, prev_dem);
> +       }
> +    }
> +
> +  /* Step2: eliminate the vsetvl instruction.  */
> +  eliminate_insn (vsetvl_rinsn);
> +  return true;
> +}
> +
> +/* This function does the following post optimization base on RTL_SSA:
> +
> +   1. Local user vsetvl optimizations.
> +   2. Global user vsetvl optimizations.
> +   3. AVL dependencies removal:
> +      Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> +      implicitly. Since we will emit VSETVL instruction and make RVV
> +      instructions depending on VL/VTYPE global status registers, we remove the
> +      such AVL operand in the RVV instructions pattern here in order to remove
> +      AVL dependencies when AVL operand is a register operand.
> +
> +      Before the VSETVL PASS:
> +       li a5,32
> +       ...
> +       vadd.vv (..., a5)
> +      After the VSETVL PASS:
> +       li a5,32
> +       vsetvli zero, a5, ...
> +       ...
> +       vadd.vv (..., const_int 0).  */
>  void
> -pass_vsetvl::cleanup_insns (void) const
> +pass_vsetvl::ssa_post_optimization (void) const
>  {
>    for (const bb_info *bb : crtl->ssa->bbs ())
>      {
>        local_eliminate_vsetvl_insn (bb);
> +      bool changed_p = true;
> +      while (changed_p)
> +       {
> +         changed_p = false;
> +         changed_p |= global_eliminate_vsetvl_insn (bb);
> +       }
>        for (insn_info *insn : bb->real_nondebug_insns ())
>         {
>           rtx_insn *rinsn = insn->rtl ();
> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>      }
>  }
>
> +/* Return true if the SET result is not used by any instructions.  */
> +static bool
> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
> +{
> +  /* Handle the following case that can not be detected in RTL_SSA.  */
> +  /* E.g.
> +         li a5, 100
> +         vsetvli a6, a5...
> +         ...
> +         vadd (use a6)
> +
> +       The use of "a6" is removed from "vadd" but the information is
> +       not updated in RTL_SSA framework. We don't want to re-new
> +       a new RTL_SSA which is expensive, instead, we use data-flow
> +       analysis to check whether "a6" has no uses.  */
> +  if (bitmap_bit_p (df_get_live_out (cfg_bb), regno))
> +    return false;
> +
> +  rtx_insn *iter;
> +  for (iter = NEXT_INSN (rinsn); iter && iter != NEXT_INSN (BB_END (cfg_bb));
> +       iter = NEXT_INSN (iter))
> +    if (df_find_use (iter, regno_reg_rtx[regno]))
> +      return false;
> +
> +  return true;
> +}
> +
> +/* This function does the following post optimization base on dataflow
> +   analysis:
> +
> +   1. Change vsetvl rd, rs1 --> vsevl zero, rs1, if rd is not used by any
> +   nondebug instructions. Even though this PASS runs after RA and it doesn't
> +   help for reduce register pressure, it can help instructions scheduling since
> +   we remove the dependencies.
> +
> +   2. Remove redundant user vsetvls base on outcome of Phase 4 (LCM) && Phase 5
> +   (AVL dependencies removal).  */
>  void
> -pass_vsetvl::propagate_avl (void) const
> -{
> -  /* Rebuild the RTL_SSA according to the new CFG generated by LCM.  */
> -  /* Finalization of RTL_SSA.  */
> -  free_dominance_info (CDI_DOMINATORS);
> -  if (crtl->ssa->perform_pending_updates ())
> -    cleanup_cfg (0);
> -  delete crtl->ssa;
> -  crtl->ssa = nullptr;
> -  /* Initialization of RTL_SSA.  */
> -  calculate_dominance_info (CDI_DOMINATORS);
> +pass_vsetvl::df_post_optimization (void) const
> +{
>    df_analyze ();
> -  crtl->ssa = new function_info (cfun);
> -
>    hash_set<rtx_insn *> to_delete;
> -  for (const bb_info *bb : crtl->ssa->bbs ())
> +  basic_block cfg_bb;
> +  rtx_insn *rinsn;
> +  FOR_ALL_BB_FN (cfg_bb, cfun)
>      {
> -      for (insn_info *insn : bb->real_nondebug_insns ())
> +      FOR_BB_INSNS (cfg_bb, rinsn)
>         {
> -         if (vsetvl_discard_result_insn_p (insn->rtl ()))
> +         if (NONDEBUG_INSN_P (rinsn) && vsetvl_insn_p (rinsn))
>             {
> -             rtx avl = get_avl (insn->rtl ());
> -             if (!REG_P (avl))
> -               continue;
> -
> -             set_info *set = find_access (insn->uses (), REGNO (avl))->def ();
> -             insn_info *def_insn = extract_single_source (set);
> -             if (!def_insn)
> -               continue;
> -
> -             /* Handle this case:
> -                vsetvli        a6,zero,e32,m1,ta,mu
> -                li     a5,4096
> -                add    a7,a0,a5
> -                addi   a7,a7,-96
> -                vsetvli        t1,zero,e8,mf8,ta,ma
> -                vle8.v v24,0(a7)
> -                add    a5,a3,a5
> -                addi   a5,a5,-96
> -                vse8.v v24,0(a5)
> -                vsetvli        zero,a6,e32,m1,tu,ma
> -             */
> -             if (vsetvl_insn_p (def_insn->rtl ()))
> -               {
> -                 vl_vtype_info def_info = get_vl_vtype_info (def_insn);
> -                 vl_vtype_info info = get_vl_vtype_info (insn);
> -                 rtx avl = get_avl (def_insn->rtl ());
> -                 rtx vl = get_vl (def_insn->rtl ());
> -                 if (def_info.get_ratio () == info.get_ratio ())
> -                   {
> -                     if (vlmax_avl_p (def_info.get_avl ()))
> -                       {
> -                         info.set_avl_info (
> -                           avl_info (def_info.get_avl (), nullptr));
> -                         rtx new_pat
> -                           = gen_vsetvl_pat (VSETVL_NORMAL, info, vl);
> -                         validate_change (insn->rtl (),
> -                                          &PATTERN (insn->rtl ()), new_pat,
> -                                          false);
> -                         continue;
> -                       }
> -                     if (def_info.has_avl_imm () || rtx_equal_p (avl, vl))
> -                       {
> -                         info.set_avl_info (avl_info (avl, nullptr));
> -                         emit_vsetvl_insn (VSETVL_DISCARD_RESULT, EMIT_AFTER,
> -                                           info, NULL_RTX, insn->rtl ());
> -                         if (set->single_nondebug_insn_use ())
> -                           {
> -                             to_delete.add (insn->rtl ());
> -                             to_delete.add (def_insn->rtl ());
> -                           }
> -                         continue;
> -                       }
> -                   }
> -               }
> -           }
> -
> -         /* Change vsetvl rd, rs1 --> vsevl zero, rs1,
> -            if rd is not used by any nondebug instructions.
> -            Even though this PASS runs after RA and it doesn't help for
> -            reduce register pressure, it can help instructions scheduling
> -            since we remove the dependencies.  */
> -         if (vsetvl_insn_p (insn->rtl ()))
> -           {
> -             rtx vl = get_vl (insn->rtl ());
> -             rtx avl = get_avl (insn->rtl ());
> -             def_info *def = find_access (insn->defs (), REGNO (vl));
> -             set_info *set = safe_dyn_cast<set_info *> (def);
> +             rtx vl = get_vl (rinsn);
>               vector_insn_info info;
> -             info.parse_insn (insn);
> -             gcc_assert (set);
> -             if (m_vector_manager->to_delete_vsetvls.contains (insn->rtl ()))
> -               {
> -                 m_vector_manager->to_delete_vsetvls.remove (insn->rtl ());
> -                 if (m_vector_manager->to_refine_vsetvls.contains (
> -                       insn->rtl ()))
> -                   m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
> -                 if (!set->has_nondebug_insn_uses ())
> -                   {
> -                     to_delete.add (insn->rtl ());
> -                     continue;
> -                   }
> -               }
> -             if (m_vector_manager->to_refine_vsetvls.contains (insn->rtl ()))
> +             info.parse_insn (rinsn);
> +             bool to_delete_p = m_vector_manager->to_delete_p (rinsn);
> +             bool to_refine_p = m_vector_manager->to_refine_p (rinsn);
> +             if (has_no_uses (cfg_bb, rinsn, REGNO (vl)))
>                 {
> -                 m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
> -                 if (!set->has_nondebug_insn_uses ())
> +                 if (to_delete_p)
> +                   to_delete.add (rinsn);
> +                 else if (to_refine_p)
>                     {
>                       rtx new_pat = gen_vsetvl_pat (VSETVL_VTYPE_CHANGE_ONLY,
>                                                     info, NULL_RTX);
> -                     change_insn (insn->rtl (), new_pat);
> -                     continue;
> +                     validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
> +                   }
> +                 else if (!vlmax_avl_p (info.get_avl ()))
> +                   {
> +                     rtx new_pat = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info,
> +                                                   NULL_RTX);
> +                     validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
>                     }
> -               }
> -             if (vlmax_avl_p (avl))
> -               continue;
> -             rtx new_pat
> -               = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info, NULL_RTX);
> -             if (!set->has_nondebug_insn_uses ())
> -               {
> -                 validate_change (insn->rtl (), &PATTERN (insn->rtl ()),
> -                                  new_pat, false);
> -                 continue;
>                 }
>             }
>         }
>      }
> -
>    for (rtx_insn *rinsn : to_delete)
>      eliminate_insn (rinsn);
>  }
> @@ -4593,16 +4707,16 @@ pass_vsetvl::lazy_vsetvl (void)
>      fprintf (dump_file, "\nPhase 4: PRE vsetvl by Lazy code motion (LCM)\n");
>    pre_vsetvl ();
>
> -  /* Phase 5 - Cleanup AVL && VL operand of RVV instruction.  */
> +  /* Phase 5 - Post optimization base on RTL_SSA.  */
>    if (dump_file)
> -    fprintf (dump_file, "\nPhase 5: Cleanup AVL and VL operands\n");
> -  cleanup_insns ();
> +    fprintf (dump_file, "\nPhase 5: Post optimization base on RTL_SSA\n");
> +  ssa_post_optimization ();
>
> -  /* Phase 6 - Rebuild RTL_SSA to propagate AVL between vsetvls.  */
> +  /* Phase 6 - Post optimization base on data-flow analysis.  */
>    if (dump_file)
>      fprintf (dump_file,
> -            "\nPhase 6: Rebuild RTL_SSA to propagate AVL between vsetvls\n");
> -  propagate_avl ();
> +            "\nPhase 6: Post optimization base on data-flow analysis\n");
> +  df_post_optimization ();
>  }
>
>  /* Main entry point for this pass.  */
> diff --git a/gcc/config/riscv/riscv-vsetvl.h b/gcc/config/riscv/riscv-vsetvl.h
> index d7a6c14e931..4257451bb74 100644
> --- a/gcc/config/riscv/riscv-vsetvl.h
> +++ b/gcc/config/riscv/riscv-vsetvl.h
> @@ -290,13 +290,6 @@ private:
>       definition of AVL.  */
>    rtl_ssa::insn_info *m_insn;
>
> -  /* Parse the instruction to get VL/VTYPE information and demanding
> -   * information.  */
> -  /* This is only called by simple_vsetvl subroutine when optimize == 0.
> -     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
> -     the m_insn.  */
> -  void parse_insn (rtx_insn *);
> -
>    friend class vector_infos_manager;
>
>  public:
> @@ -305,6 +298,12 @@ public:
>        m_insn (nullptr)
>    {}
>
> +  /* Parse the instruction to get VL/VTYPE information and demanding
> +   * information.  */
> +  /* This is only called by simple_vsetvl subroutine when optimize == 0.
> +     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
> +     the m_insn.  */
> +  void parse_insn (rtx_insn *);
>    /* This is only called by lazy_vsetvl subroutine when optimize > 0.
>       We use RTL_SSA framework to initialize the insn_info.  */
>    void parse_insn (rtl_ssa::insn_info *);
> @@ -454,6 +453,27 @@ public:
>    bool all_empty_predecessor_p (const basic_block) const;
>    bool all_avail_in_compatible_p (const basic_block) const;
>
> +  bool to_delete_p (rtx_insn *rinsn)
> +  {
> +    if (to_delete_vsetvls.contains (rinsn))
> +      {
> +       to_delete_vsetvls.remove (rinsn);
> +       if (to_refine_vsetvls.contains (rinsn))
> +         to_refine_vsetvls.remove (rinsn);
> +       return true;
> +      }
> +    return false;
> +  }
> +  bool to_refine_p (rtx_insn *rinsn)
> +  {
> +    if (to_refine_vsetvls.contains (rinsn))
> +      {
> +       to_refine_vsetvls.remove (rinsn);
> +       return true;
> +      }
> +    return false;
> +  }
> +
>    void release (void);
>    void create_bitmap_vectors (void);
>    void free_bitmap_vectors (void);
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> index e0c6588b1db..29e05c4982b 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
> @@ -16,5 +16,5 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>    }
>  }
>
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> index 0c5da5e640c..ff0171b3ff6 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
> @@ -17,4 +17,4 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>  }
>
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
> new file mode 100644
> index 00000000000..551920c6a72
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  if (m > 100)
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +  else{
> +    if (k)
> +      avl = __riscv_vsetvl_e8mf8(vl);
> +  }
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
> new file mode 100644
> index 00000000000..103f4238c76
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  if (m > 100)
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +  else
> +    avl = __riscv_vsetvl_e32mf2(vl >> 8);
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*t[au],\s*m[au]} 2 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
> new file mode 100644
> index 00000000000..66c90ac10e7
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
> @@ -0,0 +1,37 @@
> +/* { dg-do compile } */
> +/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
> +
> +#include "riscv_vector.h"
> +
> +void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
> +  size_t avl;
> +  switch (m)
> +  {
> +  case 50:
> +    avl = __riscv_vsetvl_e16mf4(vl << 4);
> +    break;
> +  case 1:
> +    avl = __riscv_vsetvl_e32mf2(k);
> +    break;
> +  case 2:
> +    avl = __riscv_vsetvl_e64m1(vl);
> +    break;
> +  case 3:
> +    avl = __riscv_vsetvl_e32mf2(k >> 8);
> +    break;
> +  default:
> +    avl = __riscv_vsetvl_e32mf2(k + vl);
> +    break;
> +  }
> +  for (size_t i = 0; i < m; i++) {
> +    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
> +    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
> +    v0 = __riscv_vadd_vv_i8mf8_tu (v0, v0, v0, avl);
> +    __riscv_vse8_v_i8mf8(out + i, v0, avl);
> +  }
> +}
> +
> +/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {srli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*8} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*tu,\s*m[au]} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-g" no-opts "-funroll-loops" } } } } */
> diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> index f995e04aacc..13d09fc3fd1 100644
> --- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> +++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
> @@ -18,4 +18,4 @@ void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
>  }
>
>  /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
> -/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> +/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
> --
> 2.36.1
>
Jeff Law June 9, 2023, 2:33 p.m. UTC | #3
On 6/9/23 04:41, juzhe.zhong@rivai.ai wrote:
> From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> 
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
> are quite messy and cause some bugs discovered by my downstream auto-vectorization
> test-generator.
> 
> Before this patch.
> 
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
> after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
> we remove this operand dependency helps for the following scheduling PASS.
Right.  Removal of the unused operand gives the scheduler more freedom. 
It's not clear yet how much gain there is for scheduling vector on RV, 
but there's no good reason to handcuff it with unnecessary dependencies.


> 
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>     E.g.
>        vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
>        vsetvli zero,a2, e32, mf2  ======> eliminate
Always good to eliminate more instructions.   So while vsetvl is 
designed to be minimal overhead and it's fully expected that we'll see a 
lot of them, there's no good reason to have unnnecessary ones in the stream.


> 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
> RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.
This one isn't as clear cut, but I still think it's the right thing to 
do.  The first form explicitly kills the value in a2 while the second 
does not.  Though if the value is dead it's going to be discoverable by 
DF and we should also end up with REG_DEAD note as well.   It does have 
the advantage that it does not open a new live range.

> 
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
>     Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
>     in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
>     patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
>     necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
Note that you could potentially go ahead and submit that test and just 
xfail it.  Not a requirement, but a possibility that I sometimes use if 
I know I've got a fix coming shortly.


>     
> Before Phase 6:
>     ...
>     insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>     slli a4,a3,3
>     ...
>     insn2: vsetvli zero, a3, ...
>     load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
>     ...
> 
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
> even though we renew it (I didn't take a look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
> execution test failed.
Understood.

> 
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
> is re-new after Phase 4 (LCM) has incorrect information that causes bugs.
> 
> Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.
> 
> Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
>     at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
>     1). local_eliminate_vsetvl_insn we already have (no change).
>     2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
>        E.g.
>        void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>          size_t avl;
>          if (m > 100)
>            avl = __riscv_vsetvl_e16mf4(vl << 4);
>          else
>            avl = __riscv_vsetvl_e32mf2(vl >> 8);
>          for (size_t i = 0; i < m; i++) {
>            vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>            v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>            __riscv_vse8_v_i8mf8(out + i, v0, avl);
>          }
>        }
> 
>        This example failed to global user vsetvl optimize before this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli a4,a2,e16,mf4,ta,mu
>        .L3:
>                li      a5,0
>                vsetvli zero,a4,e8,mf8,ta,ma
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli a4,a2,e32,mf2,ta,mu
>                j       .L3
>        With this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli zero,a2,e8,mf8,ta,ma
>        .L3:
>                li      a5,0
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli zero,a2,e8,mf8,ta,ma
>                j       .L3
> 
>     3). Remove AVL operand dependency of each RVV instructions.
> 
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
>     dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
>     ...
>     vsetvl a3, a2...
>     vadd.vv (use a3)
>     If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>     
>     This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
>     downstream tests).
> 
> gcc/ChangeLog:
> 
>          * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
>          (vector_insn_info::parse_insn): Add rtx_insn parse.
>          (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
>          (get_first_vsetvl): New function.
>          (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>          (pass_vsetvl::cleanup_insns): Remove it.
>          (pass_vsetvl::ssa_post_optimization): New function.
>          (has_no_uses): Ditto.
>          (pass_vsetvl::propagate_avl): Remove it.
>          (pass_vsetvl::df_post_optimization): New function.
>          (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>          * config/riscv/riscv-vsetvl.h: Adapt declaration.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
s/Ehance/Enhance/

I would probably have suggested this get broken down into smaller 
chunks.  I think you've got multiple things going on in this patch.  I 
realize there may be some interdependencies, but they can often be dealt 
with.



> @@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>       }
>   }
>   
> -/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> -   implicitly. Since we will emit VSETVL instruction and make RVV instructions
> -   depending on VL/VTYPE global status registers, we remove the such AVL operand
> -   in the RVV instructions pattern here in order to remove AVL dependencies when
> -   AVL operand is a register operand.
> -
> -   Before the VSETVL PASS:
> -     li a5,32
> -     ...
> -     vadd.vv (..., a5)
> -   After the VSETVL PASS:
> -     li a5,32
> -     vsetvli zero, a5, ...
> -     ...
> -     vadd.vv (..., const_int 0).  */
> +/* Get the first vsetvl instructions of the block.  */
I'd adjust the comment a bit, perhaps something like this:

/* Return the first vsetvl instruction in CFG_BB or NULL if
    none exists or if a user RVV instruction is enountered
    prior to any vsetvl.  */

> +static rtx_insn *
> +get_first_vsetvl (basic_block cfg_bb)
I'd probably adjust the name as well.  There's an important exception to 
returning the first vsetvl -- you stop the search if you encounter a 
user RVV instruction.



> +bool
> +pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
> +{
[ ... ]
> +
> +  /* No need to optimize if block doesn't have vsetvl instructions.  */
> +  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
> +      || !dem.has_avl_reg ())
> +    return false;
It is considered best practice to test the cheapest conditional first 
(within the constraints of correctness).  So I probably would have 
checked !vsetvl_rinsn first.   Resulting in

   if (!vsetvl_rinsn || !dem.valid_or_dirty_p ()
       || !dem.get_avl_source () || !dem.has_avl_reg ())

Or

   if (!vsetvl_rinsn
       || !dem.valid_or_dirty_p ()
       || !dem.get_avl_source ()
       || !dem.has_avl_reg ())


The formatting in this case is more a personal preference.  So don't 
consider changing the formatting to be a requirement to move forward.


> +
> +  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
> +     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
> +     block, we can eliminate this vsetvl instruction.  */
> +  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
> +
> +  unsigned int bb_index;
> +  sbitmap_iterator sbi;
> +  rtx avl = get_avl (dem.get_insn ()->rtl ());
> +  hash_set<set_info *> sets
> +    = get_all_sets (dem.get_avl_source (), true, false, false);
> +  /* Condition 1: All VL/VTYPE available in are all compatible.  */
> +  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
> +    {
> +      const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +      const auto &insn = expr->get_insn ();
> +      def_info *def = find_access (insn->defs (), REGNO (avl));
> +      set_info *set = safe_dyn_cast<set_info *> (def);
> +      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +	  || !sets.contains (set))
> +	return false;
> +    }
> +
> +  /* Condition 2: Check it has preds.  */
> +  if (EDGE_COUNT (cfg_bb->preds) == 0)
> +    return false;
Not a big deal, but under what circumstances are we running into blocks 
with no predecessors?  The only block that should have that property is 
the entry block.   Similarly if you have no preds, then ISTM that avin 
will always be empty.  So if we can validly have a block with no preds, 
then shouldn't this check go before walking AVIN just from a 
compile-time standpoint?




> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>       }
>   }
>   
> +/* Return true if the SET result is not used by any instructions.  */
> +static bool
> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
> +{
> +  /* Handle the following case that can not be detected in RTL_SSA.  */
> +  /* E.g.
> +	  li a5, 100
> +	  vsetvli a6, a5...
> +	  ...
> +	  vadd (use a6)
> +
> +	The use of "a6" is removed from "vadd" but the information is
> +	not updated in RTL_SSA framework. We don't want to re-new
> +	a new RTL_SSA which is expensive, instead, we use data-flow
> +	analysis to check whether "a6" has no uses.  */
I'm a bit surprised there wasn't a reasonable way to update the RTL SSA 
framework for this case.  If we were to remove the entire vadd, then we 
would have to update the uses of a6.  If we have that capability, then I 
would expect we could refactor the updating code so that we had an API 
to remove an operand from an instruction.

In fact, if we have a constant propagator in the RTL SSA framework, 
wouldn't it have to have this capability?

I'm not objecting to what you've done at this time, but it seems like a 
better way might be possible.  So the ask is to review the RTL SSA code 
to see if there's reasonable building blocks to do what you want.


Overall it looks pretty good.  The biggest concern is the change to use 
DF use information rather than the RTL SSA framework.  That may 
ultimately be a reasonable thing to do, but I'd like you to confirm that 
we don't have the right building blocks in the RTL SSA framework to do 
the incremental update you seem to need.

Thanks,
jeff
juzhe.zhong@rivai.ai June 9, 2023, 2:46 p.m. UTC | #4
Thanks Jeff.
Actually, RTL_SSA framework is a very usefull tool very similar the framwork of SDnode of LLVM.
which is the framework I am familar with. I just realize that the 2nd build of RTL_SSA causes bugs
that's why I change it into data-flow.

Address all comments will send V3 soon.

Thanks.


juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-06-09 22:33
To: juzhe.zhong; gcc-patches
CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc; pan2.li
Subject: Re: [PATCH V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS
 
 
On 6/9/23 04:41, juzhe.zhong@rivai.ai wrote:
> From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> 
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
> are quite messy and cause some bugs discovered by my downstream auto-vectorization
> test-generator.
> 
> Before this patch.
> 
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
> after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
> we remove this operand dependency helps for the following scheduling PASS.
Right.  Removal of the unused operand gives the scheduler more freedom. 
It's not clear yet how much gain there is for scheduling vector on RV, 
but there's no good reason to handcuff it with unnecessary dependencies.
 
 
> 
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>     E.g.
>        vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
>        vsetvli zero,a2, e32, mf2  ======> eliminate
Always good to eliminate more instructions.   So while vsetvl is 
designed to be minimal overhead and it's fully expected that we'll see a 
lot of them, there's no good reason to have unnnecessary ones in the stream.
 
 
> 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
> RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.
This one isn't as clear cut, but I still think it's the right thing to 
do.  The first form explicitly kills the value in a2 while the second 
does not.  Though if the value is dead it's going to be discoverable by 
DF and we should also end up with REG_DEAD note as well.   It does have 
the advantage that it does not open a new live range.
 
> 
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
>     Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
>     in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
>     patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
>     necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
Note that you could potentially go ahead and submit that test and just 
xfail it.  Not a requirement, but a possibility that I sometimes use if 
I know I've got a fix coming shortly.
 
 
>     
> Before Phase 6:
>     ...
>     insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>     slli a4,a3,3
>     ...
>     insn2: vsetvli zero, a3, ...
>     load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
>     ...
> 
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
> even though we renew it (I didn't take a look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
> execution test failed.
Understood.
 
> 
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
> is re-new after Phase 4 (LCM) has incorrect information that causes bugs.
> 
> Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.
> 
> Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
>     at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
>     1). local_eliminate_vsetvl_insn we already have (no change).
>     2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
>        E.g.
>        void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>          size_t avl;
>          if (m > 100)
>            avl = __riscv_vsetvl_e16mf4(vl << 4);
>          else
>            avl = __riscv_vsetvl_e32mf2(vl >> 8);
>          for (size_t i = 0; i < m; i++) {
>            vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>            v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>            __riscv_vse8_v_i8mf8(out + i, v0, avl);
>          }
>        }
> 
>        This example failed to global user vsetvl optimize before this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli a4,a2,e16,mf4,ta,mu
>        .L3:
>                li      a5,0
>                vsetvli zero,a4,e8,mf8,ta,ma
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli a4,a2,e32,mf2,ta,mu
>                j       .L3
>        With this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli zero,a2,e8,mf8,ta,ma
>        .L3:
>                li      a5,0
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli zero,a2,e8,mf8,ta,ma
>                j       .L3
> 
>     3). Remove AVL operand dependency of each RVV instructions.
> 
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
>     dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
>     ...
>     vsetvl a3, a2...
>     vadd.vv (use a3)
>     If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>     
>     This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
>     downstream tests).
> 
> gcc/ChangeLog:
> 
>          * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
>          (vector_insn_info::parse_insn): Add rtx_insn parse.
>          (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
>          (get_first_vsetvl): New function.
>          (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>          (pass_vsetvl::cleanup_insns): Remove it.
>          (pass_vsetvl::ssa_post_optimization): New function.
>          (has_no_uses): Ditto.
>          (pass_vsetvl::propagate_avl): Remove it.
>          (pass_vsetvl::df_post_optimization): New function.
>          (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>          * config/riscv/riscv-vsetvl.h: Adapt declaration.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
s/Ehance/Enhance/
 
I would probably have suggested this get broken down into smaller 
chunks.  I think you've got multiple things going on in this patch.  I 
realize there may be some interdependencies, but they can often be dealt 
with.
 
 
 
> @@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>       }
>   }
>   
> -/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> -   implicitly. Since we will emit VSETVL instruction and make RVV instructions
> -   depending on VL/VTYPE global status registers, we remove the such AVL operand
> -   in the RVV instructions pattern here in order to remove AVL dependencies when
> -   AVL operand is a register operand.
> -
> -   Before the VSETVL PASS:
> -     li a5,32
> -     ...
> -     vadd.vv (..., a5)
> -   After the VSETVL PASS:
> -     li a5,32
> -     vsetvli zero, a5, ...
> -     ...
> -     vadd.vv (..., const_int 0).  */
> +/* Get the first vsetvl instructions of the block.  */
I'd adjust the comment a bit, perhaps something like this:
 
/* Return the first vsetvl instruction in CFG_BB or NULL if
    none exists or if a user RVV instruction is enountered
    prior to any vsetvl.  */
 
> +static rtx_insn *
> +get_first_vsetvl (basic_block cfg_bb)
I'd probably adjust the name as well.  There's an important exception to 
returning the first vsetvl -- you stop the search if you encounter a 
user RVV instruction.
 
 
 
> +bool
> +pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
> +{
[ ... ]
> +
> +  /* No need to optimize if block doesn't have vsetvl instructions.  */
> +  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
> +      || !dem.has_avl_reg ())
> +    return false;
It is considered best practice to test the cheapest conditional first 
(within the constraints of correctness).  So I probably would have 
checked !vsetvl_rinsn first.   Resulting in
 
   if (!vsetvl_rinsn || !dem.valid_or_dirty_p ()
       || !dem.get_avl_source () || !dem.has_avl_reg ())
 
Or
 
   if (!vsetvl_rinsn
       || !dem.valid_or_dirty_p ()
       || !dem.get_avl_source ()
       || !dem.has_avl_reg ())
 
 
The formatting in this case is more a personal preference.  So don't 
consider changing the formatting to be a requirement to move forward.
 
 
> +
> +  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
> +     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
> +     block, we can eliminate this vsetvl instruction.  */
> +  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
> +
> +  unsigned int bb_index;
> +  sbitmap_iterator sbi;
> +  rtx avl = get_avl (dem.get_insn ()->rtl ());
> +  hash_set<set_info *> sets
> +    = get_all_sets (dem.get_avl_source (), true, false, false);
> +  /* Condition 1: All VL/VTYPE available in are all compatible.  */
> +  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
> +    {
> +      const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +      const auto &insn = expr->get_insn ();
> +      def_info *def = find_access (insn->defs (), REGNO (avl));
> +      set_info *set = safe_dyn_cast<set_info *> (def);
> +      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +   || !sets.contains (set))
> + return false;
> +    }
> +
> +  /* Condition 2: Check it has preds.  */
> +  if (EDGE_COUNT (cfg_bb->preds) == 0)
> +    return false;
Not a big deal, but under what circumstances are we running into blocks 
with no predecessors?  The only block that should have that property is 
the entry block.   Similarly if you have no preds, then ISTM that avin 
will always be empty.  So if we can validly have a block with no preds, 
then shouldn't this check go before walking AVIN just from a 
compile-time standpoint?
 
 
 
 
> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>       }
>   }
>   
> +/* Return true if the SET result is not used by any instructions.  */
> +static bool
> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
> +{
> +  /* Handle the following case that can not be detected in RTL_SSA.  */
> +  /* E.g.
> +   li a5, 100
> +   vsetvli a6, a5...
> +   ...
> +   vadd (use a6)
> +
> + The use of "a6" is removed from "vadd" but the information is
> + not updated in RTL_SSA framework. We don't want to re-new
> + a new RTL_SSA which is expensive, instead, we use data-flow
> + analysis to check whether "a6" has no uses.  */
I'm a bit surprised there wasn't a reasonable way to update the RTL SSA 
framework for this case.  If we were to remove the entire vadd, then we 
would have to update the uses of a6.  If we have that capability, then I 
would expect we could refactor the updating code so that we had an API 
to remove an operand from an instruction.
 
In fact, if we have a constant propagator in the RTL SSA framework, 
wouldn't it have to have this capability?
 
I'm not objecting to what you've done at this time, but it seems like a 
better way might be possible.  So the ask is to review the RTL SSA code 
to see if there's reasonable building blocks to do what you want.
 
 
Overall it looks pretty good.  The biggest concern is the change to use 
DF use information rather than the RTL SSA framework.  That may 
ultimately be a reasonable thing to do, but I'd like you to confirm that 
we don't have the right building blocks in the RTL SSA framework to do 
the incremental update you seem to need.
 
Thanks,
jeff
juzhe.zhong@rivai.ai June 9, 2023, 2:58 p.m. UTC | #5
>> I'd probably adjust the name as well.  There's an important exception to 
>> returning the first vsetvl -- you stop the search if you encounter a 
>> user RVV instruction.
Could you give me a function name of this?
like:
get_first_vsetvl_prior_all_rvv_insns
is it ok? But I think the name is too long.


juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-06-09 22:33
To: juzhe.zhong; gcc-patches
CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc; pan2.li
Subject: Re: [PATCH V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS
 
 
On 6/9/23 04:41, juzhe.zhong@rivai.ai wrote:
> From: Juzhe-Zhong <juzhe.zhong@rivai.ai>
> 
> This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && Phase 6
> are quite messy and cause some bugs discovered by my downstream auto-vectorization
> test-generator.
> 
> Before this patch.
> 
> Phase 5 is cleanup_insns is the function remove AVL operand dependency from each RVV instruction.
> E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used in "vsetvl" instructions and
> after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need AVL operand "a5" anymore. Then,
> we remove this operand dependency helps for the following scheduling PASS.
Right.  Removal of the unused operand gives the scheduler more freedom. 
It's not clear yet how much gain there is for scheduling vector on RV, 
but there's no good reason to handcuff it with unnecessary dependencies.
 
 
> 
> Phase 6 is propagate_avl do the following 2 things:
> 1. Local && Global user vsetvl instructions optimization.
>     E.g.
>        vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
>        vsetvli zero,a2, e32, mf2  ======> eliminate
Always good to eliminate more instructions.   So while vsetvl is 
designed to be minimal overhead and it's fully expected that we'll see a 
lot of them, there's no good reason to have unnnecessary ones in the stream.
 
 
> 2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is not used by any instructions.
> Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM which change the CFG, I re-new a new
> RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and optmize user vsetvli base on the new RTL_SSA.
This one isn't as clear cut, but I still think it's the right thing to 
do.  The first form explicitly kills the value in a2 while the second 
does not.  Though if the value is dead it's going to be discoverable by 
DF and we should also end up with REG_DEAD note as well.   It does have 
the advantage that it does not open a new live range.
 
> 
> There are 2 issues in Phase 5 && Phase 6:
> 1. local_eliminate_vsetvl_insn was introduced by @kito which can do better local user vsetvl optimizations better than
>     Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So the local user vsetvli instructions optimizaiton
>     in Phase 6 is redundant and should be removed.
> 2. A bug discovered by my downstream auto-vectorization test-generator (I can't put the test in this patch since we are missing autovec
>     patterns for it so we can't use the upstream GCC directly reproduce such issue but I will remember put it back after I support the
>     necessary autovec patterns). Such bug is causing by using RTL_SSA re-new framework. The issue description is this:
Note that you could potentially go ahead and submit that test and just 
xfail it.  Not a requirement, but a possibility that I sometimes use if 
I know I've got a fix coming shortly.
 
 
>     
> Before Phase 6:
>     ...
>     insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
>     slli a4,a3,3
>     ...
>     insn2: vsetvli zero, a3, ...
>     load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" is removed in Phase 5)
>     ...
> 
> In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
> insn2 is the vsetvli instruction inserted in Phase 4 which is not included in the RLT_SSA framework
> even though we renew it (I didn't take a look at it and I don't think we need to now).
> Base on this situation, the def_info of insn2 has the information "set->single_nondebug_insn_use ()"
> which return true. Obviously, this information is not correct, since insn1 has aleast 2 uses:
> 1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my downstream test-generator
> execution test failed.
Understood.
 
> 
> Conclusion of RTL_SSA framework:
> Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of the VSETVL PASS which is absolutely correct, the other
> is re-new after Phase 4 (LCM) has incorrect information that causes bugs.
> 
> Besides, we don't like to initialize RTL_SSA second time it seems to be a waste since we just need to do a little optimization.
> 
> Base on all circumstances I described above, I rework and reorganize Phase 5 && Phase 6 as follows:
> 1. Phase 5 is called ssa_post_optimization which is doing the optimization base on the RTL_SSA information (The RTL_SSA is initialized
>     at the beginning of the VSETVL PASS, no need to re-new it again). This phase includes 3 optimizaitons:
>     1). local_eliminate_vsetvl_insn we already have (no change).
>     2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from orignal Phase 6 but with more powerful and reliable implementation.
>        E.g.
>        void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
>          size_t avl;
>          if (m > 100)
>            avl = __riscv_vsetvl_e16mf4(vl << 4);
>          else
>            avl = __riscv_vsetvl_e32mf2(vl >> 8);
>          for (size_t i = 0; i < m; i++) {
>            vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
>            v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
>            __riscv_vse8_v_i8mf8(out + i, v0, avl);
>          }
>        }
> 
>        This example failed to global user vsetvl optimize before this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli a4,a2,e16,mf4,ta,mu
>        .L3:
>                li      a5,0
>                vsetvli zero,a4,e8,mf8,ta,ma
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli a4,a2,e32,mf2,ta,mu
>                j       .L3
>        With this patch:
>        f:
>                li      a5,100
>                bleu    a3,a5,.L2
>                slli    a2,a2,4
>                vsetvli zero,a2,e8,mf8,ta,ma
>        .L3:
>                li      a5,0
>        .L5:
>                add     a6,a0,a5
>                add     a2,a1,a5
>                vle8.v  v1,0(a6)
>                addi    a5,a5,1
>                vadd.vv v1,v1,v1
>                vse8.v  v1,0(a2)
>                bgtu    a3,a5,.L5
>        .L10:
>                ret
>        .L2:
>                beq     a3,zero,.L10
>                srli    a2,a2,8
>                vsetvli zero,a2,e8,mf8,ta,ma
>                j       .L3
> 
>     3). Remove AVL operand dependency of each RVV instructions.
> 
> 2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize "vsetvl zero,a2...." base on
>     dataflow analysis of new CFG (new CFG is created by LCM). The reason we need to do use new CFG and after Phase 5:
>     ...
>     vsetvl a3, a2...
>     vadd.vv (use a3)
>     If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
>     
>     This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
>     downstream tests).
> 
> gcc/ChangeLog:
> 
>          * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user vsetvl optimization.
>          (vector_insn_info::parse_insn): Add rtx_insn parse.
>          (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl optimization.
>          (get_first_vsetvl): New function.
>          (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
>          (pass_vsetvl::cleanup_insns): Remove it.
>          (pass_vsetvl::ssa_post_optimization): New function.
>          (has_no_uses): Ditto.
>          (pass_vsetvl::propagate_avl): Remove it.
>          (pass_vsetvl::df_post_optimization): New function.
>          (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
>          * config/riscv/riscv-vsetvl.h: Adapt declaration.
> 
> gcc/testsuite/ChangeLog:
> 
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
>          * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
s/Ehance/Enhance/
 
I would probably have suggested this get broken down into smaller 
chunks.  I think you've got multiple things going on in this patch.  I 
realize there may be some interdependencies, but they can often be dealt 
with.
 
 
 
> @@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
>       }
>   }
>   
> -/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
> -   implicitly. Since we will emit VSETVL instruction and make RVV instructions
> -   depending on VL/VTYPE global status registers, we remove the such AVL operand
> -   in the RVV instructions pattern here in order to remove AVL dependencies when
> -   AVL operand is a register operand.
> -
> -   Before the VSETVL PASS:
> -     li a5,32
> -     ...
> -     vadd.vv (..., a5)
> -   After the VSETVL PASS:
> -     li a5,32
> -     vsetvli zero, a5, ...
> -     ...
> -     vadd.vv (..., const_int 0).  */
> +/* Get the first vsetvl instructions of the block.  */
I'd adjust the comment a bit, perhaps something like this:
 
/* Return the first vsetvl instruction in CFG_BB or NULL if
    none exists or if a user RVV instruction is enountered
    prior to any vsetvl.  */
 
> +static rtx_insn *
> +get_first_vsetvl (basic_block cfg_bb)
I'd probably adjust the name as well.  There's an important exception to 
returning the first vsetvl -- you stop the search if you encounter a 
user RVV instruction.
 
 
 
> +bool
> +pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
> +{
[ ... ]
> +
> +  /* No need to optimize if block doesn't have vsetvl instructions.  */
> +  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
> +      || !dem.has_avl_reg ())
> +    return false;
It is considered best practice to test the cheapest conditional first 
(within the constraints of correctness).  So I probably would have 
checked !vsetvl_rinsn first.   Resulting in
 
   if (!vsetvl_rinsn || !dem.valid_or_dirty_p ()
       || !dem.get_avl_source () || !dem.has_avl_reg ())
 
Or
 
   if (!vsetvl_rinsn
       || !dem.valid_or_dirty_p ()
       || !dem.get_avl_source ()
       || !dem.has_avl_reg ())
 
 
The formatting in this case is more a personal preference.  So don't 
consider changing the formatting to be a requirement to move forward.
 
 
> +
> +  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
> +     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
> +     block, we can eliminate this vsetvl instruction.  */
> +  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
> +
> +  unsigned int bb_index;
> +  sbitmap_iterator sbi;
> +  rtx avl = get_avl (dem.get_insn ()->rtl ());
> +  hash_set<set_info *> sets
> +    = get_all_sets (dem.get_avl_source (), true, false, false);
> +  /* Condition 1: All VL/VTYPE available in are all compatible.  */
> +  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
> +    {
> +      const auto &expr = m_vector_manager->vector_exprs[bb_index];
> +      const auto &insn = expr->get_insn ();
> +      def_info *def = find_access (insn->defs (), REGNO (avl));
> +      set_info *set = safe_dyn_cast<set_info *> (def);
> +      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
> +   || !sets.contains (set))
> + return false;
> +    }
> +
> +  /* Condition 2: Check it has preds.  */
> +  if (EDGE_COUNT (cfg_bb->preds) == 0)
> +    return false;
Not a big deal, but under what circumstances are we running into blocks 
with no predecessors?  The only block that should have that property is 
the entry block.   Similarly if you have no preds, then ISTM that avin 
will always be empty.  So if we can validly have a block with no preds, 
then shouldn't this check go before walking AVIN just from a 
compile-time standpoint?
 
 
 
 
> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>       }
>   }
>   
> +/* Return true if the SET result is not used by any instructions.  */
> +static bool
> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
> +{
> +  /* Handle the following case that can not be detected in RTL_SSA.  */
> +  /* E.g.
> +   li a5, 100
> +   vsetvli a6, a5...
> +   ...
> +   vadd (use a6)
> +
> + The use of "a6" is removed from "vadd" but the information is
> + not updated in RTL_SSA framework. We don't want to re-new
> + a new RTL_SSA which is expensive, instead, we use data-flow
> + analysis to check whether "a6" has no uses.  */
I'm a bit surprised there wasn't a reasonable way to update the RTL SSA 
framework for this case.  If we were to remove the entire vadd, then we 
would have to update the uses of a6.  If we have that capability, then I 
would expect we could refactor the updating code so that we had an API 
to remove an operand from an instruction.
 
In fact, if we have a constant propagator in the RTL SSA framework, 
wouldn't it have to have this capability?
 
I'm not objecting to what you've done at this time, but it seems like a 
better way might be possible.  So the ask is to review the RTL SSA code 
to see if there's reasonable building blocks to do what you want.
 
 
Overall it looks pretty good.  The biggest concern is the change to use 
DF use information rather than the RTL SSA framework.  That may 
ultimately be a reasonable thing to do, but I'd like you to confirm that 
we don't have the right building blocks in the RTL SSA framework to do 
the incremental update you seem to need.
 
Thanks,
jeff
Jeff Law June 9, 2023, 3:09 p.m. UTC | #6
On 6/9/23 08:58, 钟居哲 wrote:
>>> I'd probably adjust the name as well.  There's an important exception to 
>>> returning the first vsetvl -- you stop the search if you encounter a
>>> user RVV instruction.
> 
> Could you give me a function name of this?
> like:
> get_first_vsetvl_prior_all_rvv_insns
> is it ok? But I think the name is too long.
get_first_vsetvl_before_rvv_insns?  It's a bit smaller and I think 
captures the key exception -- does that work for you?

Jeff
juzhe.zhong@rivai.ai June 9, 2023, 10:52 p.m. UTC | #7
Ok. Thanks.



juzhe.zhong@rivai.ai
 
From: Jeff Law
Date: 2023-06-09 23:09
To: 钟居哲; gcc-patches
CC: kito.cheng; kito.cheng; palmer; palmer; rdapp.gcc; pan2.li
Subject: Re: [PATCH V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS
 
 
On 6/9/23 08:58, 钟居哲 wrote:
>>> I'd probably adjust the name as well.  There's an important exception to 
>>> returning the first vsetvl -- you stop the search if you encounter a
>>> user RVV instruction.
> 
> Could you give me a function name of this?
> like:
> get_first_vsetvl_prior_all_rvv_insns
> is it ok? But I think the name is too long.
get_first_vsetvl_before_rvv_insns?  It's a bit smaller and I think 
captures the key exception -- does that work for you?
 
Jeff
Richard Sandiford June 12, 2023, 7:02 p.m. UTC | #8
Jeff Law via Gcc-patches <gcc-patches@gcc.gnu.org> writes:
> On 6/9/23 04:41, juzhe.zhong@rivai.ai wrote:
>> @@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
>>       }
>>   }
>>   
>> +/* Return true if the SET result is not used by any instructions.  */
>> +static bool
>> +has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
>> +{
>> +  /* Handle the following case that can not be detected in RTL_SSA.  */
>> +  /* E.g.
>> +	  li a5, 100
>> +	  vsetvli a6, a5...
>> +	  ...
>> +	  vadd (use a6)
>> +
>> +	The use of "a6" is removed from "vadd" but the information is
>> +	not updated in RTL_SSA framework. We don't want to re-new
>> +	a new RTL_SSA which is expensive, instead, we use data-flow
>> +	analysis to check whether "a6" has no uses.  */
> I'm a bit surprised there wasn't a reasonable way to update the RTL SSA 
> framework for this case.  If we were to remove the entire vadd, then we 
> would have to update the uses of a6.  If we have that capability, then I 
> would expect we could refactor the updating code so that we had an API 
> to remove an operand from an instruction.
>
> In fact, if we have a constant propagator in the RTL SSA framework, 
> wouldn't it have to have this capability?

RTL-SSA isn't supposed to be feature-complete in its current state.
So yeah, if something is missing, it's better to add it to RTL-SSA
rather than work around it in consumers.

(Responding without fully understanding the context though, sorry.)

Thanks,
Richard
Andreas Schwab June 16, 2023, 10:55 a.m. UTC | #9
Why didn't you test that??

../../gcc/config/riscv/riscv-vsetvl.cc: In member function 'bool pass_vsetvl::global_eliminate_vsetvl_insn(const rtl_ssa::bb_info*) const':
../../gcc/config/riscv/riscv-vsetvl.cc:4354:3: error: 'vsetvl_rinsn' may be used uninitialized [-Werror=maybe-uninitialized]
 4354 |   if (!vsetvl_rinsn)
      |   ^~
../../gcc/config/riscv/riscv-vsetvl.cc:4343:13: note: 'vsetvl_rinsn' was declared here
 4343 |   rtx_insn *vsetvl_rinsn;
      |             ^~~~~~~~~~~~
cc1plus: all warnings being treated as errors
make[3]: *** [../../gcc/config/riscv/t-riscv:66: riscv-vsetvl.o] Error 1
Li, Pan2 via Gcc-patches June 16, 2023, 11:39 a.m. UTC | #10
Sorry for inconvenient, file one PATCH for this as below.

https://gcc.gnu.org/pipermail/gcc-patches/2023-June/621980.html

Pan

-----Original Message-----
From: Andreas Schwab <schwab@linux-m68k.org> 
Sent: Friday, June 16, 2023 6:55 PM
To: juzhe.zhong@rivai.ai
Cc: gcc-patches@gcc.gnu.org; kito.cheng@gmail.com; kito.cheng@sifive.com; palmer@dabbelt.com; palmer@rivosinc.com; jeffreyalaw@gmail.com; rdapp.gcc@gmail.com; Li, Pan2 <pan2.li@intel.com>
Subject: Re: [PATCH V2] RISC-V: Rework Phase 5 && Phase 6 of VSETVL PASS

Why didn't you test that??

../../gcc/config/riscv/riscv-vsetvl.cc: In member function 'bool pass_vsetvl::global_eliminate_vsetvl_insn(const rtl_ssa::bb_info*) const':
../../gcc/config/riscv/riscv-vsetvl.cc:4354:3: error: 'vsetvl_rinsn' may be used uninitialized [-Werror=maybe-uninitialized]
 4354 |   if (!vsetvl_rinsn)
      |   ^~
../../gcc/config/riscv/riscv-vsetvl.cc:4343:13: note: 'vsetvl_rinsn' was declared here
 4343 |   rtx_insn *vsetvl_rinsn;
      |             ^~~~~~~~~~~~
cc1plus: all warnings being treated as errors
make[3]: *** [../../gcc/config/riscv/t-riscv:66: riscv-vsetvl.o] Error 1
diff mbox series

Patch

diff --git a/gcc/config/riscv/riscv-vsetvl.cc b/gcc/config/riscv/riscv-vsetvl.cc
index fe55f4ccd30..924a94adf9c 100644
--- a/gcc/config/riscv/riscv-vsetvl.cc
+++ b/gcc/config/riscv/riscv-vsetvl.cc
@@ -395,10 +395,15 @@  available_occurrence_p (const bb_info *bb, const vector_insn_info dem)
       if (!vlmax_avl_p (dem.get_avl ()))
 	{
 	  rtx dest = NULL_RTX;
+	  insn_info *i = insn;
 	  if (vsetvl_insn_p (insn->rtl ()))
-	    dest = get_vl (insn->rtl ());
-	  for (const insn_info *i = insn; real_insn_and_same_bb_p (i, bb);
-	       i = i->next_nondebug_insn ())
+	    {
+	      dest = get_vl (insn->rtl ());
+	      /* For user vsetvl a2, a2 instruction, we consider it as
+		 available even though it modifies "a2".  */
+	      i = i->next_nondebug_insn ();
+	    }
+	  for (; real_insn_and_same_bb_p (i, bb); i = i->next_nondebug_insn ())
 	    {
 	      if (read_vl_insn_p (i->rtl ()))
 		continue;
@@ -1893,11 +1898,13 @@  vector_insn_info::parse_insn (rtx_insn *rinsn)
   *this = vector_insn_info ();
   if (!NONDEBUG_INSN_P (rinsn))
     return;
-  if (!has_vtype_op (rinsn))
+  if (optimize == 0 && !has_vtype_op (rinsn))
+    return;
+  if (optimize > 0 && !vsetvl_insn_p (rinsn))
     return;
   m_state = VALID;
   extract_insn_cached (rinsn);
-  const rtx avl = recog_data.operand[get_attr_vl_op_idx (rinsn)];
+  rtx avl = ::get_avl (rinsn);
   m_avl = avl_info (avl, nullptr);
   m_sew = ::get_sew (rinsn);
   m_vlmul = ::get_vlmul (rinsn);
@@ -2730,10 +2737,11 @@  private:
   /* Phase 5.  */
   rtx_insn *get_vsetvl_at_end (const bb_info *, vector_insn_info *) const;
   void local_eliminate_vsetvl_insn (const bb_info *) const;
-  void cleanup_insns (void) const;
+  bool global_eliminate_vsetvl_insn (const bb_info *) const;
+  void ssa_post_optimization (void) const;
 
   /* Phase 6.  */
-  void propagate_avl (void) const;
+  void df_post_optimization (void) const;
 
   void init (void);
   void done (void);
@@ -4246,7 +4254,7 @@  pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
 
       /* Local AVL compatibility checking is simpler than global, we only
 	 need to check the REGNO is same.  */
-      if (prev_dem.valid_p () && prev_dem.skip_avl_compatible_p (curr_dem)
+      if (prev_dem.valid_or_dirty_p () && prev_dem.skip_avl_compatible_p (curr_dem)
 	  && local_avl_compatible_p (prev_avl, curr_avl))
 	{
 	  /* curr_dem and prev_dem is compatible!  */
@@ -4277,27 +4285,187 @@  pass_vsetvl::local_eliminate_vsetvl_insn (const bb_info *bb) const
     }
 }
 
-/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
-   implicitly. Since we will emit VSETVL instruction and make RVV instructions
-   depending on VL/VTYPE global status registers, we remove the such AVL operand
-   in the RVV instructions pattern here in order to remove AVL dependencies when
-   AVL operand is a register operand.
-
-   Before the VSETVL PASS:
-     li a5,32
-     ...
-     vadd.vv (..., a5)
-   After the VSETVL PASS:
-     li a5,32
-     vsetvli zero, a5, ...
-     ...
-     vadd.vv (..., const_int 0).  */
+/* Get the first vsetvl instructions of the block.  */
+static rtx_insn *
+get_first_vsetvl (basic_block cfg_bb)
+{
+  rtx_insn *rinsn;
+  FOR_BB_INSNS (cfg_bb, rinsn)
+    {
+      if (!NONDEBUG_INSN_P (rinsn))
+	continue;
+      /* If we don't find any inserted vsetvli before user RVV instructions,
+	 we don't need to optimize the vsetvls in this block.  */
+      if (has_vtype_op (rinsn) || vsetvl_insn_p (rinsn))
+	return nullptr;
+
+      if (vsetvl_discard_result_insn_p (rinsn))
+	return rinsn;
+    }
+  return nullptr;
+}
+
+/* Global user vsetvl optimizaiton:
+
+     Case 1:
+     bb 1:
+       vsetvl a5,a4,e8,mf8
+       ...
+     bb 2:
+       ...
+       vsetvl zero,a5,e8,mf8 --> Eliminate directly.
+
+     Case 2:
+      bb 1:
+       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
+       ...
+      bb 2:
+       ...
+       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
+
+     Case 3:
+      bb 1:
+       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
+       ...
+      bb 2:
+       ...
+       vsetvl a5,a4,e8,mf8    --> vsetvl a5,a4,e32,mf2
+       goto bb 3
+      bb 3:
+       ...
+       vsetvl zero,a5,e32,mf2 --> Eliminate directly.
+*/
+bool
+pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
+{
+  rtx_insn *vsetvl_rinsn;
+  vector_insn_info dem = vector_insn_info ();
+  const auto &block_info = get_block_info (bb);
+  basic_block cfg_bb = bb->cfg_bb ();
+
+  if (block_info.local_dem.valid_or_dirty_p ())
+    {
+      /* Optimize the local vsetvl.  */
+      dem = block_info.local_dem;
+      vsetvl_rinsn = get_first_vsetvl (cfg_bb);
+    }
+  if (!vsetvl_rinsn)
+    /* Optimize the global vsetvl inserted by LCM.  */
+    vsetvl_rinsn = get_vsetvl_at_end (bb, &dem);
+
+  /* No need to optimize if block doesn't have vsetvl instructions.  */
+  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
+      || !dem.has_avl_reg ())
+    return false;
+
+  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
+     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
+     block, we can eliminate this vsetvl instruction.  */
+  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
+
+  unsigned int bb_index;
+  sbitmap_iterator sbi;
+  rtx avl = get_avl (dem.get_insn ()->rtl ());
+  hash_set<set_info *> sets
+    = get_all_sets (dem.get_avl_source (), true, false, false);
+  /* Condition 1: All VL/VTYPE available in are all compatible.  */
+  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
+    {
+      const auto &expr = m_vector_manager->vector_exprs[bb_index];
+      const auto &insn = expr->get_insn ();
+      def_info *def = find_access (insn->defs (), REGNO (avl));
+      set_info *set = safe_dyn_cast<set_info *> (def);
+      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
+	  || !sets.contains (set))
+	return false;
+    }
+
+  /* Condition 2: Check it has preds.  */
+  if (EDGE_COUNT (cfg_bb->preds) == 0)
+    return false;
+
+  /* Condition 3: We don't do the global optimization for the block
+     has a pred is entry block or exit block.  */
+  /* Condition 4: All preds have available VL/VTYPE out.  */
+  edge e;
+  edge_iterator ei;
+  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
+    {
+      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
+      if (e->src == ENTRY_BLOCK_PTR_FOR_FN (cfun)
+	  || e->src == EXIT_BLOCK_PTR_FOR_FN (cfun) || bitmap_empty_p (avout))
+	return false;
+
+      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
+	{
+	  const auto &expr = m_vector_manager->vector_exprs[bb_index];
+	  const auto &insn = expr->get_insn ();
+	  def_info *def = find_access (insn->defs (), REGNO (avl));
+	  set_info *set = safe_dyn_cast<set_info *> (def);
+	  if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
+	      || !sets.contains (set) || !expr->skip_avl_compatible_p (dem))
+	    return false;
+	}
+    }
+
+  /* Step1: Reshape the VL/VTYPE status to make sure everything compatible.  */
+  hash_set<basic_block> pred_cfg_bbs = get_all_predecessors (cfg_bb);
+  FOR_EACH_EDGE (e, ei, cfg_bb->preds)
+    {
+      sbitmap avout = m_vector_manager->vector_avout[e->src->index];
+      EXECUTE_IF_SET_IN_BITMAP (avout, 0, bb_index, sbi)
+	{
+	  vector_insn_info prev_dem = *m_vector_manager->vector_exprs[bb_index];
+	  vector_insn_info curr_dem = dem;
+	  insn_info *insn = prev_dem.get_insn ();
+	  if (!pred_cfg_bbs.contains (insn->bb ()->cfg_bb ()))
+	    continue;
+	  /* Update avl info since we need to make sure they are fully
+	     compatible before merge.  */
+	  curr_dem.set_avl_info (prev_dem.get_avl_info ());
+	  /* Merge both and update into curr_vsetvl.  */
+	  prev_dem = curr_dem.merge (prev_dem, LOCAL_MERGE);
+	  change_vsetvl_insn (insn, prev_dem);
+	}
+    }
+
+  /* Step2: eliminate the vsetvl instruction.  */
+  eliminate_insn (vsetvl_rinsn);
+  return true;
+}
+
+/* This function does the following post optimization base on RTL_SSA:
+
+   1. Local user vsetvl optimizations.
+   2. Global user vsetvl optimizations.
+   3. AVL dependencies removal:
+      Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
+      implicitly. Since we will emit VSETVL instruction and make RVV
+      instructions depending on VL/VTYPE global status registers, we remove the
+      such AVL operand in the RVV instructions pattern here in order to remove
+      AVL dependencies when AVL operand is a register operand.
+
+      Before the VSETVL PASS:
+	li a5,32
+	...
+	vadd.vv (..., a5)
+      After the VSETVL PASS:
+	li a5,32
+	vsetvli zero, a5, ...
+	...
+	vadd.vv (..., const_int 0).  */
 void
-pass_vsetvl::cleanup_insns (void) const
+pass_vsetvl::ssa_post_optimization (void) const
 {
   for (const bb_info *bb : crtl->ssa->bbs ())
     {
       local_eliminate_vsetvl_insn (bb);
+      bool changed_p = true;
+      while (changed_p)
+	{
+	  changed_p = false;
+	  changed_p |= global_eliminate_vsetvl_insn (bb);
+	}
       for (insn_info *insn : bb->real_nondebug_insns ())
 	{
 	  rtx_insn *rinsn = insn->rtl ();
@@ -4342,135 +4510,81 @@  pass_vsetvl::cleanup_insns (void) const
     }
 }
 
+/* Return true if the SET result is not used by any instructions.  */
+static bool
+has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
+{
+  /* Handle the following case that can not be detected in RTL_SSA.  */
+  /* E.g.
+	  li a5, 100
+	  vsetvli a6, a5...
+	  ...
+	  vadd (use a6)
+
+	The use of "a6" is removed from "vadd" but the information is
+	not updated in RTL_SSA framework. We don't want to re-new
+	a new RTL_SSA which is expensive, instead, we use data-flow
+	analysis to check whether "a6" has no uses.  */
+  if (bitmap_bit_p (df_get_live_out (cfg_bb), regno))
+    return false;
+
+  rtx_insn *iter;
+  for (iter = NEXT_INSN (rinsn); iter && iter != NEXT_INSN (BB_END (cfg_bb));
+       iter = NEXT_INSN (iter))
+    if (df_find_use (iter, regno_reg_rtx[regno]))
+      return false;
+
+  return true;
+}
+
+/* This function does the following post optimization base on dataflow
+   analysis:
+
+   1. Change vsetvl rd, rs1 --> vsevl zero, rs1, if rd is not used by any
+   nondebug instructions. Even though this PASS runs after RA and it doesn't
+   help for reduce register pressure, it can help instructions scheduling since
+   we remove the dependencies.
+
+   2. Remove redundant user vsetvls base on outcome of Phase 4 (LCM) && Phase 5
+   (AVL dependencies removal).  */
 void
-pass_vsetvl::propagate_avl (void) const
-{
-  /* Rebuild the RTL_SSA according to the new CFG generated by LCM.  */
-  /* Finalization of RTL_SSA.  */
-  free_dominance_info (CDI_DOMINATORS);
-  if (crtl->ssa->perform_pending_updates ())
-    cleanup_cfg (0);
-  delete crtl->ssa;
-  crtl->ssa = nullptr;
-  /* Initialization of RTL_SSA.  */
-  calculate_dominance_info (CDI_DOMINATORS);
+pass_vsetvl::df_post_optimization (void) const
+{
   df_analyze ();
-  crtl->ssa = new function_info (cfun);
-
   hash_set<rtx_insn *> to_delete;
-  for (const bb_info *bb : crtl->ssa->bbs ())
+  basic_block cfg_bb;
+  rtx_insn *rinsn;
+  FOR_ALL_BB_FN (cfg_bb, cfun)
     {
-      for (insn_info *insn : bb->real_nondebug_insns ())
+      FOR_BB_INSNS (cfg_bb, rinsn)
 	{
-	  if (vsetvl_discard_result_insn_p (insn->rtl ()))
+	  if (NONDEBUG_INSN_P (rinsn) && vsetvl_insn_p (rinsn))
 	    {
-	      rtx avl = get_avl (insn->rtl ());
-	      if (!REG_P (avl))
-		continue;
-
-	      set_info *set = find_access (insn->uses (), REGNO (avl))->def ();
-	      insn_info *def_insn = extract_single_source (set);
-	      if (!def_insn)
-		continue;
-
-	      /* Handle this case:
-		 vsetvli	a6,zero,e32,m1,ta,mu
-		 li	a5,4096
-		 add	a7,a0,a5
-		 addi	a7,a7,-96
-		 vsetvli	t1,zero,e8,mf8,ta,ma
-		 vle8.v	v24,0(a7)
-		 add	a5,a3,a5
-		 addi	a5,a5,-96
-		 vse8.v	v24,0(a5)
-		 vsetvli	zero,a6,e32,m1,tu,ma
-	      */
-	      if (vsetvl_insn_p (def_insn->rtl ()))
-		{
-		  vl_vtype_info def_info = get_vl_vtype_info (def_insn);
-		  vl_vtype_info info = get_vl_vtype_info (insn);
-		  rtx avl = get_avl (def_insn->rtl ());
-		  rtx vl = get_vl (def_insn->rtl ());
-		  if (def_info.get_ratio () == info.get_ratio ())
-		    {
-		      if (vlmax_avl_p (def_info.get_avl ()))
-			{
-			  info.set_avl_info (
-			    avl_info (def_info.get_avl (), nullptr));
-			  rtx new_pat
-			    = gen_vsetvl_pat (VSETVL_NORMAL, info, vl);
-			  validate_change (insn->rtl (),
-					   &PATTERN (insn->rtl ()), new_pat,
-					   false);
-			  continue;
-			}
-		      if (def_info.has_avl_imm () || rtx_equal_p (avl, vl))
-			{
-			  info.set_avl_info (avl_info (avl, nullptr));
-			  emit_vsetvl_insn (VSETVL_DISCARD_RESULT, EMIT_AFTER,
-					    info, NULL_RTX, insn->rtl ());
-			  if (set->single_nondebug_insn_use ())
-			    {
-			      to_delete.add (insn->rtl ());
-			      to_delete.add (def_insn->rtl ());
-			    }
-			  continue;
-			}
-		    }
-		}
-	    }
-
-	  /* Change vsetvl rd, rs1 --> vsevl zero, rs1,
-	     if rd is not used by any nondebug instructions.
-	     Even though this PASS runs after RA and it doesn't help for
-	     reduce register pressure, it can help instructions scheduling
-	     since we remove the dependencies.  */
-	  if (vsetvl_insn_p (insn->rtl ()))
-	    {
-	      rtx vl = get_vl (insn->rtl ());
-	      rtx avl = get_avl (insn->rtl ());
-	      def_info *def = find_access (insn->defs (), REGNO (vl));
-	      set_info *set = safe_dyn_cast<set_info *> (def);
+	      rtx vl = get_vl (rinsn);
 	      vector_insn_info info;
-	      info.parse_insn (insn);
-	      gcc_assert (set);
-	      if (m_vector_manager->to_delete_vsetvls.contains (insn->rtl ()))
-		{
-		  m_vector_manager->to_delete_vsetvls.remove (insn->rtl ());
-		  if (m_vector_manager->to_refine_vsetvls.contains (
-			insn->rtl ()))
-		    m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
-		  if (!set->has_nondebug_insn_uses ())
-		    {
-		      to_delete.add (insn->rtl ());
-		      continue;
-		    }
-		}
-	      if (m_vector_manager->to_refine_vsetvls.contains (insn->rtl ()))
+	      info.parse_insn (rinsn);
+	      bool to_delete_p = m_vector_manager->to_delete_p (rinsn);
+	      bool to_refine_p = m_vector_manager->to_refine_p (rinsn);
+	      if (has_no_uses (cfg_bb, rinsn, REGNO (vl)))
 		{
-		  m_vector_manager->to_refine_vsetvls.remove (insn->rtl ());
-		  if (!set->has_nondebug_insn_uses ())
+		  if (to_delete_p)
+		    to_delete.add (rinsn);
+		  else if (to_refine_p)
 		    {
 		      rtx new_pat = gen_vsetvl_pat (VSETVL_VTYPE_CHANGE_ONLY,
 						    info, NULL_RTX);
-		      change_insn (insn->rtl (), new_pat);
-		      continue;
+		      validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
+		    }
+		  else if (!vlmax_avl_p (info.get_avl ()))
+		    {
+		      rtx new_pat = gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info,
+						    NULL_RTX);
+		      validate_change (rinsn, &PATTERN (rinsn), new_pat, false);
 		    }
-		}
-	      if (vlmax_avl_p (avl))
-		continue;
-	      rtx new_pat
-		= gen_vsetvl_pat (VSETVL_DISCARD_RESULT, info, NULL_RTX);
-	      if (!set->has_nondebug_insn_uses ())
-		{
-		  validate_change (insn->rtl (), &PATTERN (insn->rtl ()),
-				   new_pat, false);
-		  continue;
 		}
 	    }
 	}
     }
-
   for (rtx_insn *rinsn : to_delete)
     eliminate_insn (rinsn);
 }
@@ -4593,16 +4707,16 @@  pass_vsetvl::lazy_vsetvl (void)
     fprintf (dump_file, "\nPhase 4: PRE vsetvl by Lazy code motion (LCM)\n");
   pre_vsetvl ();
 
-  /* Phase 5 - Cleanup AVL && VL operand of RVV instruction.  */
+  /* Phase 5 - Post optimization base on RTL_SSA.  */
   if (dump_file)
-    fprintf (dump_file, "\nPhase 5: Cleanup AVL and VL operands\n");
-  cleanup_insns ();
+    fprintf (dump_file, "\nPhase 5: Post optimization base on RTL_SSA\n");
+  ssa_post_optimization ();
 
-  /* Phase 6 - Rebuild RTL_SSA to propagate AVL between vsetvls.  */
+  /* Phase 6 - Post optimization base on data-flow analysis.  */
   if (dump_file)
     fprintf (dump_file,
-	     "\nPhase 6: Rebuild RTL_SSA to propagate AVL between vsetvls\n");
-  propagate_avl ();
+	     "\nPhase 6: Post optimization base on data-flow analysis\n");
+  df_post_optimization ();
 }
 
 /* Main entry point for this pass.  */
diff --git a/gcc/config/riscv/riscv-vsetvl.h b/gcc/config/riscv/riscv-vsetvl.h
index d7a6c14e931..4257451bb74 100644
--- a/gcc/config/riscv/riscv-vsetvl.h
+++ b/gcc/config/riscv/riscv-vsetvl.h
@@ -290,13 +290,6 @@  private:
      definition of AVL.  */
   rtl_ssa::insn_info *m_insn;
 
-  /* Parse the instruction to get VL/VTYPE information and demanding
-   * information.  */
-  /* This is only called by simple_vsetvl subroutine when optimize == 0.
-     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
-     the m_insn.  */
-  void parse_insn (rtx_insn *);
-
   friend class vector_infos_manager;
 
 public:
@@ -305,6 +298,12 @@  public:
       m_insn (nullptr)
   {}
 
+  /* Parse the instruction to get VL/VTYPE information and demanding
+   * information.  */
+  /* This is only called by simple_vsetvl subroutine when optimize == 0.
+     Since RTL_SSA can not be enabled when optimize == 0, we don't initialize
+     the m_insn.  */
+  void parse_insn (rtx_insn *);
   /* This is only called by lazy_vsetvl subroutine when optimize > 0.
      We use RTL_SSA framework to initialize the insn_info.  */
   void parse_insn (rtl_ssa::insn_info *);
@@ -454,6 +453,27 @@  public:
   bool all_empty_predecessor_p (const basic_block) const;
   bool all_avail_in_compatible_p (const basic_block) const;
 
+  bool to_delete_p (rtx_insn *rinsn)
+  {
+    if (to_delete_vsetvls.contains (rinsn))
+      {
+	to_delete_vsetvls.remove (rinsn);
+	if (to_refine_vsetvls.contains (rinsn))
+	  to_refine_vsetvls.remove (rinsn);
+	return true;
+      }
+    return false;
+  }
+  bool to_refine_p (rtx_insn *rinsn)
+  {
+    if (to_refine_vsetvls.contains (rinsn))
+      {
+	to_refine_vsetvls.remove (rinsn);
+	return true;
+      }
+    return false;
+  }
+
   void release (void);
   void create_bitmap_vectors (void);
   void free_bitmap_vectors (void);
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
index e0c6588b1db..29e05c4982b 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-16.c
@@ -16,5 +16,5 @@  void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
   }
 }
 
-/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
 /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
index 0c5da5e640c..ff0171b3ff6 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-2.c
@@ -17,4 +17,4 @@  void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
 }
 
 /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
-/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
new file mode 100644
index 00000000000..551920c6a72
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-21.c
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
+
+#include "riscv_vector.h"
+
+void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
+  size_t avl;
+  if (m > 100)
+    avl = __riscv_vsetvl_e16mf4(vl << 4);
+  else{
+    if (k)
+      avl = __riscv_vsetvl_e8mf8(vl);
+  }
+  for (size_t i = 0; i < m; i++) {
+    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
+    __riscv_vse8_v_i8mf8(out + i, v0, avl);
+  }
+}
+
+/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
new file mode 100644
index 00000000000..103f4238c76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-22.c
@@ -0,0 +1,21 @@ 
+/* { dg-do compile } */
+/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
+
+#include "riscv_vector.h"
+
+void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
+  size_t avl;
+  if (m > 100)
+    avl = __riscv_vsetvl_e16mf4(vl << 4);
+  else
+    avl = __riscv_vsetvl_e32mf2(vl >> 8);
+  for (size_t i = 0; i < m; i++) {
+    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
+    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
+    __riscv_vse8_v_i8mf8(out + i, v0, avl);
+  }
+}
+
+/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*t[au],\s*m[au]} 2 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
new file mode 100644
index 00000000000..66c90ac10e7
--- /dev/null
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-23.c
@@ -0,0 +1,37 @@ 
+/* { dg-do compile } */
+/* { dg-options "-march=rv32gcv -mabi=ilp32 -fno-schedule-insns -fno-schedule-insns2" } */
+
+#include "riscv_vector.h"
+
+void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
+  size_t avl;
+  switch (m)
+  {
+  case 50:
+    avl = __riscv_vsetvl_e16mf4(vl << 4);
+    break;
+  case 1:
+    avl = __riscv_vsetvl_e32mf2(k);
+    break;
+  case 2:
+    avl = __riscv_vsetvl_e64m1(vl);
+    break;
+  case 3:
+    avl = __riscv_vsetvl_e32mf2(k >> 8);
+    break;
+  default:
+    avl = __riscv_vsetvl_e32mf2(k + vl);
+    break;
+  }
+  for (size_t i = 0; i < m; i++) {
+    vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
+    v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
+    v0 = __riscv_vadd_vv_i8mf8_tu (v0, v0, v0, avl);
+    __riscv_vse8_v_i8mf8(out + i, v0, avl);
+  }
+}
+
+/* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*4} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {srli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*8} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli\s+zero,\s*[a-x0-9]+,\s*e8,\s*mf8,\s*tu,\s*m[au]} 5 { target { no-opts "-O0" no-opts "-Os" no-opts "-g" no-opts "-funroll-loops" } } } } */
diff --git a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
index f995e04aacc..13d09fc3fd1 100644
--- a/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
+++ b/gcc/testsuite/gcc.target/riscv/rvv/vsetvl/vsetvl-3.c
@@ -18,4 +18,4 @@  void f(int8_t *base, int8_t *out, size_t vl, size_t m) {
 }
 
 /* { dg-final { scan-assembler-times {slli\s+[a-x0-9]+,\s*[a-x0-9]+,\s*10} 1 { target { no-opts "-O0" no-opts "-g" no-opts "-funroll-loops" } } } } */
-/* { dg-final { scan-assembler-times {vsetvli} 3 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */
+/* { dg-final { scan-assembler-times {vsetvli} 2 { target { no-opts "-O0" no-opts "-Os" no-opts "-Oz" no-opts "-g" no-opts "-funroll-loops" } } } } */