diff mbox series

[v3,11/11] Hexagon (target/hexagon) Use direct block chaining for tight loops

Message ID 20221104192631.29434-12-tsimpson@quicinc.com
State New
Headers show
Series Hexagon (target/hexagon) performance and bug fixes | expand

Commit Message

Taylor Simpson Nov. 4, 2022, 7:26 p.m. UTC
Direct block chaining is documented here
https://qemu.readthedocs.io/en/latest/devel/tcg.html#direct-block-chaining

Hexagon inner loops end with the endloop0 instruction
To go back to the beginning of the loop, this instructions writes to PC
from register SA0 (start address 0).  To use direct block chaining, we
have to assign PC with a constant value.  So, we specialize the code
generation when the start of the translation block is equal to SA0.

When this is the case, we defer the compare/branch from endloop0 to
gen_end_tb.  When this is done, we can assign the start address of the TB
to PC.

Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
---
 target/hexagon/cpu.h       | 17 ++++++++----
 target/hexagon/gen_tcg.h   |  3 ++
 target/hexagon/translate.h |  1 +
 target/hexagon/genptr.c    | 57 ++++++++++++++++++++++++++++++++++++++
 target/hexagon/translate.c | 34 +++++++++++++++++++++++
 5 files changed, 107 insertions(+), 5 deletions(-)

Comments

Richard Henderson Nov. 5, 2022, 1:44 a.m. UTC | #1
On 11/5/22 06:26, Taylor Simpson wrote:
> Direct block chaining is documented here
> https://qemu.readthedocs.io/en/latest/devel/tcg.html#direct-block-chaining
> 
> Hexagon inner loops end with the endloop0 instruction
> To go back to the beginning of the loop, this instructions writes to PC
> from register SA0 (start address 0).  To use direct block chaining, we
> have to assign PC with a constant value.  So, we specialize the code
> generation when the start of the translation block is equal to SA0.
> 
> When this is the case, we defer the compare/branch from endloop0 to
> gen_end_tb.  When this is done, we can assign the start address of the TB
> to PC.
> 
> Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
> ---
>   target/hexagon/cpu.h       | 17 ++++++++----
>   target/hexagon/gen_tcg.h   |  3 ++
>   target/hexagon/translate.h |  1 +
>   target/hexagon/genptr.c    | 57 ++++++++++++++++++++++++++++++++++++++
>   target/hexagon/translate.c | 34 +++++++++++++++++++++++
>   5 files changed, 107 insertions(+), 5 deletions(-)
> 
> diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
> index ff8c26272d..5260e0f127 100644
> --- a/target/hexagon/cpu.h
> +++ b/target/hexagon/cpu.h
> @@ -152,16 +152,23 @@ struct ArchCPU {
>   
>   #include "cpu_bits.h"
>   
> +typedef union {
> +    uint32_t i;
> +    struct {
> +        bool is_tight_loop:1;
> +    };
> +} HexStateFlags;

I don't see this as an improvement on manual flags handling, as it makes the flags value 
be dependent on host bit-field ordering.  This makes it more difficult to compare traces 
across hosts.

Otherwise,
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~
Taylor Simpson Nov. 6, 2022, 9:52 p.m. UTC | #2
> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Friday, November 4, 2022 8:44 PM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: philmd@linaro.org; ale@rev.ng; anjo@rev.ng; Brian Cain
> <bcain@quicinc.com>; Matheus Bernardino (QUIC)
> <quic_mathbern@quicinc.com>
> Subject: Re: [PATCH v3 11/11] Hexagon (target/hexagon) Use direct block
> chaining for tight loops
> 
> On 11/5/22 06:26, Taylor Simpson wrote:
> > Direct block chaining is documented here
> > https://qemu.readthedocs.io/en/latest/devel/tcg.html#direct-block-chai
> > ning
> >
> > Hexagon inner loops end with the endloop0 instruction To go back to
> > the beginning of the loop, this instructions writes to PC from
> > register SA0 (start address 0).  To use direct block chaining, we have
> > to assign PC with a constant value.  So, we specialize the code
> > generation when the start of the translation block is equal to SA0.
> >
> > When this is the case, we defer the compare/branch from endloop0 to
> > gen_end_tb.  When this is done, we can assign the start address of the
> > TB to PC.
> >
> > Signed-off-by: Taylor Simpson <tsimpson@quicinc.com>
> > ---
> >   target/hexagon/cpu.h       | 17 ++++++++----
> >   target/hexagon/gen_tcg.h   |  3 ++
> >   target/hexagon/translate.h |  1 +
> >   target/hexagon/genptr.c    | 57
> ++++++++++++++++++++++++++++++++++++++
> >   target/hexagon/translate.c | 34 +++++++++++++++++++++++
> >   5 files changed, 107 insertions(+), 5 deletions(-)
> >
> > diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h index
> > ff8c26272d..5260e0f127 100644
> > --- a/target/hexagon/cpu.h
> > +++ b/target/hexagon/cpu.h
> > @@ -152,16 +152,23 @@ struct ArchCPU {
> >
> >   #include "cpu_bits.h"
> >
> > +typedef union {
> > +    uint32_t i;
> > +    struct {
> > +        bool is_tight_loop:1;
> > +    };
> > +} HexStateFlags;
> 
> I don't see this as an improvement on manual flags handling, as it makes the
> flags value be dependent on host bit-field ordering.  This makes it more
> difficult to compare traces across hosts.

I coded this originally with manual handling but decided this would be easier to read/understand/maintain - especially as we add more flags and some have more than 1 bit.

I haven't noticed the flags in any of the logs.  Where are they printed?


> 
> Otherwise,
> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
> 
> 
> r~
Richard Henderson Nov. 6, 2022, 11:12 p.m. UTC | #3
On 11/7/22 08:52, Taylor Simpson wrote:
> I coded this originally with manual handling but decided this would be easier to read/understand/maintain - especially as we add more flags and some have more than 1 bit.
> 
> I haven't noticed the flags in any of the logs.  Where are they printed?

Third field of -d exec, e.g.

Trace 0: 0x7f09a40004c0 [0000000000000000/fffffc0000000044/01000001/ff000000] __start
                                                            ^^^^^^^^


r~
diff mbox series

Patch

diff --git a/target/hexagon/cpu.h b/target/hexagon/cpu.h
index ff8c26272d..5260e0f127 100644
--- a/target/hexagon/cpu.h
+++ b/target/hexagon/cpu.h
@@ -152,16 +152,23 @@  struct ArchCPU {
 
 #include "cpu_bits.h"
 
+typedef union {
+    uint32_t i;
+    struct {
+        bool is_tight_loop:1;
+    };
+} HexStateFlags;
+
 static inline void cpu_get_tb_cpu_state(CPUHexagonState *env, target_ulong *pc,
                                         target_ulong *cs_base, uint32_t *flags)
 {
+    HexStateFlags hex_flags = { 0 };
     *pc = env->gpr[HEX_REG_PC];
     *cs_base = 0;
-#ifdef CONFIG_USER_ONLY
-    *flags = 0;
-#else
-#error System mode not supported on Hexagon yet
-#endif
+    if (*pc == env->gpr[HEX_REG_SA0]) {
+        hex_flags.is_tight_loop = true;
+    }
+    *flags = hex_flags.i;
 }
 
 static inline int cpu_mmu_index(CPUHexagonState *env, bool ifetch)
diff --git a/target/hexagon/gen_tcg.h b/target/hexagon/gen_tcg.h
index 93afa01156..0f0c027523 100644
--- a/target/hexagon/gen_tcg.h
+++ b/target/hexagon/gen_tcg.h
@@ -620,6 +620,9 @@ 
 #define fGEN_TCG_J2_callf(SHORTCODE) \
     gen_cond_call(ctx, PuV, false, riV)
 
+#define fGEN_TCG_J2_endloop0(SHORTCODE) \
+    gen_endloop0(ctx)
+
 /*
  * Compound compare and jump instructions
  * Here is a primer to understand the tag names
diff --git a/target/hexagon/translate.h b/target/hexagon/translate.h
index 0841e8418e..7fc066c19a 100644
--- a/target/hexagon/translate.h
+++ b/target/hexagon/translate.h
@@ -60,6 +60,7 @@  typedef struct DisasContext {
     bool has_single_direct_branch;
     TCGv branch_cond;
     target_ulong branch_dest;
+    bool is_tight_loop;
 } DisasContext;
 
 static inline void ctx_log_reg_write(DisasContext *ctx, int rnum)
diff --git a/target/hexagon/genptr.c b/target/hexagon/genptr.c
index 235ea9d210..74b7b3ac5d 100644
--- a/target/hexagon/genptr.c
+++ b/target/hexagon/genptr.c
@@ -635,6 +635,63 @@  static void gen_cond_call(DisasContext *ctx, TCGv pred, bool sense, int pc_off)
     gen_set_label(skip);
 }
 
+static void gen_endloop0(DisasContext *ctx)
+{
+    TCGv lpcfg = tcg_temp_local_new();
+
+    GET_USR_FIELD(USR_LPCFG, lpcfg);
+
+    /*
+     *    if (lpcfg == 1) {
+     *        hex_new_pred_value[3] = 0xff;
+     *        hex_pred_written |= 1 << 3;
+     *    }
+     */
+    TCGLabel *label1 = gen_new_label();
+    tcg_gen_brcondi_tl(TCG_COND_NE, lpcfg, 1, label1);
+    {
+        tcg_gen_movi_tl(hex_new_pred_value[3], 0xff);
+        tcg_gen_ori_tl(hex_pred_written, hex_pred_written, 1 << 3);
+    }
+    gen_set_label(label1);
+
+    /*
+     *    if (lpcfg) {
+     *        SET_USR_FIELD(USR_LPCFG, lpcfg - 1);
+     *    }
+     */
+    TCGLabel *label2 = gen_new_label();
+    tcg_gen_brcondi_tl(TCG_COND_EQ, lpcfg, 0, label2);
+    {
+        tcg_gen_subi_tl(lpcfg, lpcfg, 1);
+        SET_USR_FIELD(USR_LPCFG, lpcfg);
+    }
+    gen_set_label(label2);
+
+    /*
+     * If we're in a tight loop, we'll do this at the end of the TB to take
+     * advantage of direct block chaining.
+     */
+    if (!ctx->is_tight_loop) {
+        /*
+         *    if (hex_gpr[HEX_REG_LC0] > 1) {
+         *        PC = hex_gpr[HEX_REG_SA0];
+         *        hex_new_value[HEX_REG_LC0] = hex_gpr[HEX_REG_LC0] - 1;
+         *    }
+         */
+        TCGLabel *label3 = gen_new_label();
+        tcg_gen_brcondi_tl(TCG_COND_LEU, hex_gpr[HEX_REG_LC0], 1, label3);
+        {
+            gen_jumpr(ctx, hex_gpr[HEX_REG_SA0]);
+            tcg_gen_subi_tl(hex_new_value[HEX_REG_LC0],
+                            hex_gpr[HEX_REG_LC0], 1);
+        }
+        gen_set_label(label3);
+    }
+
+    tcg_temp_free(lpcfg);
+}
+
 static void gen_cmp_jumpnv(DisasContext *ctx,
                            TCGCond cond, TCGv val, TCGv src, int pc_off)
 {
diff --git a/target/hexagon/translate.c b/target/hexagon/translate.c
index 8c007c6f07..731ec85707 100644
--- a/target/hexagon/translate.c
+++ b/target/hexagon/translate.c
@@ -135,6 +135,8 @@  static void gen_goto_tb(DisasContext *ctx, int idx, target_ulong dest)
 
 static void gen_end_tb(DisasContext *ctx)
 {
+    Packet *pkt = ctx->pkt;
+
     gen_exec_counters(ctx);
 
     if (ctx->has_single_direct_branch) {
@@ -149,6 +151,18 @@  static void gen_end_tb(DisasContext *ctx)
         } else {
             gen_goto_tb(ctx, 0, ctx->branch_dest);
         }
+    } else if (ctx->is_tight_loop &&
+        pkt->insn[pkt->num_insns - 1].opcode == J2_endloop0) {
+        /*
+         * When we're in a tight loop, we defer the endloop0 processing
+         * to take advantage of direct block chaining
+         */
+        TCGLabel *skip = gen_new_label();
+        tcg_gen_brcondi_tl(TCG_COND_LEU, hex_gpr[HEX_REG_LC0], 1, skip);
+        tcg_gen_subi_tl(hex_gpr[HEX_REG_LC0], hex_gpr[HEX_REG_LC0], 1);
+        gen_goto_tb(ctx, 0, ctx->base.tb->pc);
+        gen_set_label(skip);
+        gen_goto_tb(ctx, 1, ctx->next_PC);
     } else {
         tcg_gen_lookup_and_goto_ptr();
     }
@@ -340,6 +354,15 @@  static void mark_implicit_reg_write(DisasContext *ctx, int attrib, int rnum)
          */
         bool is_predicated = GET_ATTRIB(opcode, A_CONDEXEC) ||
                              rnum == HEX_REG_USR;
+
+        /* LC0/LC1 is conditionally written by endloop instructions */
+        if ((rnum == HEX_REG_LC0 || rnum == HEX_REG_LC1) &&
+            (opcode == J2_endloop0 ||
+             opcode == J2_endloop1 ||
+             opcode == J2_endloop01)) {
+            is_predicated = true;
+        }
+
         if (is_predicated && !is_preloaded(ctx, rnum)) {
             tcg_gen_mov_tl(hex_new_value[rnum], hex_gpr[rnum]);
         }
@@ -423,6 +446,14 @@  static void gen_reg_writes(DisasContext *ctx)
         int reg_num = ctx->reg_log[i];
 
         tcg_gen_mov_tl(hex_gpr[reg_num], hex_new_value[reg_num]);
+
+        /*
+         * ctx->is_tight_loop is set when SA0 points to the beginning of the TB.
+         * If we write to SA0, we have to turn off tight loop handling.
+         */
+        if (reg_num == HEX_REG_SA0) {
+            ctx->is_tight_loop = false;
+        }
     }
 }
 
@@ -846,8 +877,11 @@  static void hexagon_tr_init_disas_context(DisasContextBase *dcbase,
 static void hexagon_tr_tb_start(DisasContextBase *db, CPUState *cpu)
 {
     DisasContext *ctx = container_of(db, DisasContext, base);
+    HexStateFlags hex_flags = { db->tb->flags };
+
     ctx->has_single_direct_branch = false;
     ctx->branch_cond = NULL;
+    ctx->is_tight_loop = hex_flags.is_tight_loop;
 }
 
 static void hexagon_tr_insn_start(DisasContextBase *dcbase, CPUState *cpu)