Message ID | 1383073495-5332-12-git-send-email-sebastian@macke.de |
---|---|
State | New |
Headers | show |
On 10/29/2013 12:04 PM, Sebastian Macke wrote: > { > int lab = gen_new_label(); > - dc->btaken = tcg_temp_local_new(); > - tcg_gen_movi_tl(jmp_pc, dc->pc+8); > - tcg_gen_movi_tl(dc->btaken, 0); > + tcg_gen_movi_tl(jmp_pc, 0); > tcg_gen_brcondi_i32(op0 == 0x03 ? TCG_COND_NE : TCG_COND_EQ, > cpu_srf, 0, lab); > - tcg_gen_movi_tl(dc->btaken, 1); > tcg_gen_movi_tl(jmp_pc, tmp_pc); > gen_set_label(lab); You can now use movcond here: tcg_gen_movi_i32(jmp_pc, tmp_pc); tcg_gen_movcond_i32(jmp_pc, cpu_srf, zero, jmp_pc, zero, op0 == 0x03 ? TCG_COND_NE : TCG_COND_EQ); Although I'd wonder about just using setcond instead, since I think the value stored in jmp_pc here is also stored in dc->j_target, leading to the right behaviour... > case JUMP_BRANCH: > { > int l1 = gen_new_label(); > - tcg_gen_brcondi_tl(TCG_COND_NE, dc->btaken, 0, l1); > + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); > gen_goto_tb(dc, 1, dc->pc); > gen_set_label(l1); > - tcg_temp_free(dc->btaken); > gen_goto_tb(dc, 0, dc->j_target); > break; ... here. > + case JUMP_BRANCH_DELAYED: > + { > + int l1 = gen_new_label(); > + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); > + gen_goto_tb(dc, 1, dc->pc); > + gen_set_label(l1); > + tcg_gen_mov_tl(cpu_pc, jmp_pc); > + tcg_gen_exit_tb(0); > + break; ... > + > + dc->delayed_branch = !!(dc->tb_flags & D_FLAG); > + if ((dc->delayed_branch) && (dc->tb_flags&B_FLAG)) { > + dc->j_state = JUMP_BRANCH_DELAYED; > + } > + And thus I can't see how these additions are actually useful? If I've missed something, and the last hunk needs to be retained, then please fix the coding style: if (dc->delayed_branch && (dc->tb_flags & B_FLAG)) { r~
On 30/10/2013 11:33 AM, Richard Henderson wrote: > On 10/29/2013 12:04 PM, Sebastian Macke wrote: >> { >> int lab = gen_new_label(); >> - dc->btaken = tcg_temp_local_new(); >> - tcg_gen_movi_tl(jmp_pc, dc->pc+8); >> - tcg_gen_movi_tl(dc->btaken, 0); >> + tcg_gen_movi_tl(jmp_pc, 0); >> tcg_gen_brcondi_i32(op0 == 0x03 ? TCG_COND_NE : TCG_COND_EQ, >> cpu_srf, 0, lab); >> - tcg_gen_movi_tl(dc->btaken, 1); >> tcg_gen_movi_tl(jmp_pc, tmp_pc); >> gen_set_label(lab); > You can now use movcond here: > > tcg_gen_movi_i32(jmp_pc, tmp_pc); > tcg_gen_movcond_i32(jmp_pc, cpu_srf, zero, jmp_pc, zero, > op0 == 0x03 ? TCG_COND_NE : TCG_COND_EQ); > > Although I'd wonder about just using setcond instead, since I think > the value stored in jmp_pc here is also stored in dc->j_target, leading > to the right behaviour... Correct. The movcond function can be used here. I will change that. But I didn't know that it exists because it is not written in the document http://wiki.qemu.org/Documentation/TCG/frontend-ops > >> case JUMP_BRANCH: >> { >> int l1 = gen_new_label(); >> - tcg_gen_brcondi_tl(TCG_COND_NE, dc->btaken, 0, l1); >> + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); >> gen_goto_tb(dc, 1, dc->pc); >> gen_set_label(l1); >> - tcg_temp_free(dc->btaken); >> gen_goto_tb(dc, 0, dc->j_target); >> break; > ... here. But j_target is not known when the delayed slot is translated separately. (E.g. if the delayed slot is at a page boundary.) > >> + case JUMP_BRANCH_DELAYED: >> + { >> + int l1 = gen_new_label(); >> + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); >> + gen_goto_tb(dc, 1, dc->pc); >> + gen_set_label(l1); >> + tcg_gen_mov_tl(cpu_pc, jmp_pc); >> + tcg_gen_exit_tb(0); >> + break; > ... >> + >> + dc->delayed_branch = !!(dc->tb_flags & D_FLAG); >> + if ((dc->delayed_branch) && (dc->tb_flags&B_FLAG)) { >> + dc->j_state = JUMP_BRANCH_DELAYED; >> + } >> + > > And thus I can't see how these additions are actually useful? > > If I've missed something, and the last hunk needs to be retained, > then please fix the coding style: > > if (dc->delayed_branch && (dc->tb_flags & B_FLAG)) { The problem is the delayed slot. The decision whether a jump is taken or not depends on the previous branch instruction l.bf and l.bnf instruction. If only the delayed slot is translated then we have to know what was the previous instruction. In this case we have to differ additionally if the delayed slot is a branch or a normal jump. Because for a branch the jmp_pc field is also a flag. We could simply jump to jmp_pc like it was done before but then we miss the block chaining feature. And I have not seen another way implementing it. > > r~
On 10/30/2013 12:07 PM, Sebastian Macke wrote: > > Correct. The movcond function can be used here. I will change that. > But I didn't know that it exists because it is not written in the document > http://wiki.qemu.org/Documentation/TCG/frontend-ops I had no idea that existed. The canonical documentation is in tcg/README. r~
On 10/30/2013 12:07 PM, Sebastian Macke wrote: >>> case JUMP_BRANCH: >>> { >>> int l1 = gen_new_label(); >>> - tcg_gen_brcondi_tl(TCG_COND_NE, dc->btaken, 0, l1); >>> + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); >>> gen_goto_tb(dc, 1, dc->pc); >>> gen_set_label(l1); >>> - tcg_temp_free(dc->btaken); >>> gen_goto_tb(dc, 0, dc->j_target); >>> break; >> ... here. > > But j_target is not known when the delayed slot is translated separately. (E.g. > if the delayed slot is at a page boundary.) Hmm. This was just guesswork on my part since I don't have a tree with your previous patch set applied; j_target of course doesn't exist in master. Do you have a publicly accessible tree with all your patches applied? I'd like to re-read the logic in the proper context. r~
On 30/10/2013 12:47 PM, Richard Henderson wrote: > On 10/30/2013 12:07 PM, Sebastian Macke wrote: >>>> case JUMP_BRANCH: >>>> { >>>> int l1 = gen_new_label(); >>>> - tcg_gen_brcondi_tl(TCG_COND_NE, dc->btaken, 0, l1); >>>> + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); >>>> gen_goto_tb(dc, 1, dc->pc); >>>> gen_set_label(l1); >>>> - tcg_temp_free(dc->btaken); >>>> gen_goto_tb(dc, 0, dc->j_target); >>>> break; >>> ... here. >> But j_target is not known when the delayed slot is translated separately. (E.g. >> if the delayed slot is at a page boundary.) > Hmm. This was just guesswork on my part since I don't have a tree > with your previous patch set applied; j_target of course doesn't > exist in master. > > Do you have a publicly accessible tree with all your patches applied? > I'd like to re-read the logic in the proper context. After you are the second who demanded it: https://github.com/s-macke/qemu/tree/or32-optimize > > r~
On 10/30/2013 02:08 PM, Sebastian Macke wrote: >> Do you have a publicly accessible tree with all your patches applied? >> I'd like to re-read the logic in the proper context. > After you are the second who demanded it: > https://github.com/s-macke/qemu/tree/or32-optimize Ok, the logic as written is correct as far as I can see. It's a little convoluted though, and I think there may be a way to streamline it. But I'll have to think about that some more. r~
On 30/10/2013 3:02 PM, Richard Henderson wrote: > On 10/30/2013 02:08 PM, Sebastian Macke wrote: >>> Do you have a publicly accessible tree with all your patches applied? >>> I'd like to re-read the logic in the proper context. >> After you are the second who demanded it: >> https://github.com/s-macke/qemu/tree/or32-optimize > Ok, the logic as written is correct as far as I can see. > > It's a little convoluted though, and I think there may be a way to > streamline it. But I'll have to think about that some more. > > > r~ Yeah it is. Because of the delayed slot and the way QEMU is doing its translation it is hard to see and distinguish all different code paths. When is which information available. At the moment the biggest time eater and most complex code part is the branching part. 1. l.sf..... <- set flag if condition is fullfiled (setcond instruction) 2. l.bf <- branch if flag (a brcond instruction or with the last suggestion you gave a movcond instruction) 3. delayed slot instruction which could fail. 4. Actual jump. We need a branch here for the two different slots for chaining. And we need all information about point 2) So at the moment we put three branches/conditions in the translated code. One setcond, one movcond and one brcond. In principle point 1 and 2 could be fused with some coding effort. But I am not sure if one brcond is faster then one setcond+movcond. And maybe the branching in No 4. could be avoided by some internal code change in the way QEMU does its block chaining.
diff --git a/target-openrisc/cpu.h b/target-openrisc/cpu.h index 2423fad..ea007c7 100644 --- a/target-openrisc/cpu.h +++ b/target-openrisc/cpu.h @@ -84,8 +84,9 @@ enum { /* Version Register */ #define SPR_VR 0xFFFF003F -/* Internal flags, delay slot flag */ -#define D_FLAG 2 +/* Internal flags */ +#define D_FLAG 2 /* delay slot flag */ +#define B_FLAG 4 /* branch flag */ /* Interrupt */ #define NR_IRQS 32 @@ -413,7 +414,7 @@ static inline void cpu_get_tb_cpu_state(CPUOpenRISCState *env, *pc = env->pc; *cs_base = 0; /* D_FLAG -- branch instruction exception */ - *flags = (env->flags & D_FLAG) | + *flags = (env->flags & (D_FLAG | B_FLAG)) | (env->sr & (SR_SM | SR_DME | SR_IME | SR_OVE)); } diff --git a/target-openrisc/translate.c b/target-openrisc/translate.c index d926995..274d38e 100644 --- a/target-openrisc/translate.c +++ b/target-openrisc/translate.c @@ -42,7 +42,11 @@ enum { JUMP_DYNAMIC, /* The address is not known during compile time*/ JUMP_STATIC, /* The address is known during compile time */ - JUMP_BRANCH /* The two possible destinations are known */ + JUMP_BRANCH, /* The two possible destinations are known */ + JUMP_BRANCH_DELAYED /* Indicates that the first instruction in + the tb is a delayed slot of a previous + branch instruction. */ + }; @@ -54,7 +58,6 @@ typedef struct DisasContext { uint32_t is_jmp; uint32_t j_state; /* specifies the jump type*/ target_ulong j_target; /* target address for jump */ - TCGv btaken; /* Temporary variable */ uint32_t mem_idx; int singlestep_enabled; uint32_t delayed_branch; @@ -222,15 +225,13 @@ static void gen_jump(DisasContext *dc, uint32_t imm, uint32_t reg, uint32_t op0) case 0x04: /* l.bf */ { int lab = gen_new_label(); - dc->btaken = tcg_temp_local_new(); - tcg_gen_movi_tl(jmp_pc, dc->pc+8); - tcg_gen_movi_tl(dc->btaken, 0); + tcg_gen_movi_tl(jmp_pc, 0); tcg_gen_brcondi_i32(op0 == 0x03 ? TCG_COND_NE : TCG_COND_EQ, cpu_srf, 0, lab); - tcg_gen_movi_tl(dc->btaken, 1); tcg_gen_movi_tl(jmp_pc, tmp_pc); gen_set_label(lab); dc->j_state = JUMP_BRANCH; + dc->tb_flags |= B_FLAG; } break; case 0x11: /* l.jr */ @@ -1708,19 +1709,29 @@ static void check_breakpoint(OpenRISCCPU *cpu, DisasContext *dc) static void handle_delay_slot(DisasContext *dc) { dc->tb_flags &= ~D_FLAG; + dc->tb_flags &= ~B_FLAG; gen_sync_flags(dc); switch (dc->j_state) { case JUMP_BRANCH: { int l1 = gen_new_label(); - tcg_gen_brcondi_tl(TCG_COND_NE, dc->btaken, 0, l1); + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); gen_goto_tb(dc, 1, dc->pc); gen_set_label(l1); - tcg_temp_free(dc->btaken); gen_goto_tb(dc, 0, dc->j_target); break; } + case JUMP_BRANCH_DELAYED: + { + int l1 = gen_new_label(); + tcg_gen_brcondi_tl(TCG_COND_NE, jmp_pc, 0, l1); + gen_goto_tb(dc, 1, dc->pc); + gen_set_label(l1); + tcg_gen_mov_tl(cpu_pc, jmp_pc); + tcg_gen_exit_tb(0); + break; + } case JUMP_STATIC: gen_goto_tb(dc, 0, dc->j_target); break; @@ -1759,8 +1770,13 @@ static inline void gen_intermediate_code_internal(OpenRISCCPU *cpu, dc->cpufeature = cpu->feature; dc->mem_idx = cpu_mmu_index(&cpu->env); dc->synced_flags = dc->tb_flags = tb->flags; - dc->delayed_branch = !!(dc->tb_flags & D_FLAG); dc->singlestep_enabled = cs->singlestep_enabled; + + dc->delayed_branch = !!(dc->tb_flags & D_FLAG); + if ((dc->delayed_branch) && (dc->tb_flags&B_FLAG)) { + dc->j_state = JUMP_BRANCH_DELAYED; + } + if (qemu_loglevel_mask(CPU_LOG_TB_IN_ASM)) { qemu_log("-----------------------------------------\n"); log_cpu_state(CPU(cpu), 0);
At the moment a branch l.bf and l.bnf requires an additional flag variable "btaken" for the jump to decide after the delayed slot whether the jump should be taken or not. With this patch the jmp_pc variable is used as a flag. If jmp_pc is zero the branch is not taken. This requires that together with the delayed slot flag an additional branch flag is saved to differ between slot types. In average the speed increases by around 5%. Signed-off-by: Sebastian Macke <sebastian@macke.de> --- target-openrisc/cpu.h | 7 ++++--- target-openrisc/translate.c | 34 +++++++++++++++++++++++++--------- 2 files changed, 29 insertions(+), 12 deletions(-)