Message ID | 1437206294-22407-1-git-send-email-rth@twiddle.net |
---|---|
State | New |
Headers | show |
On 2015-07-18 08:58, Richard Henderson wrote: > Enforce the invariant that 32-bit quantities are zero extended > in the register. This avoids having to re-zero-extend at memory > accesses for 32-bit guests. > > Signed-off-by: Richard Henderson <rth@twiddle.net> > --- > Here's an alternative to the other things we've been considering. > We could even make this conditional on USER_ONLY if you like. > > This does in fact fix the mips test case. Consider the fact that > memory operations are probably more common than truncations, and > it would seem that we have a net size win by forcing the truncate > over adding a byte for the ADDR32 (or 2 bytes for a zero-extend). I think we should go with your previous patch for 2.4, and think calmly about how to do that better for 2.5. It slightly increases the generated code, but only in bytes, not in number of instructions, so I don't think the performance impact is huge. > Indeed, for 2.5, we could look at dropping the existing zero-extend > from the softmmu path. Also for 2.5, split trunc_shr into two parts, From a quick look, we need to move the address to new registers anyway, so not zero-extending will mean adding the REXW prefix. > trunc_lo and trunc_hi. No one really uses any other value for the > shift, and x86 could really use different constraints for the two. That sounds like a good idea.
On 2015-07-18 23:18, Aurelien Jarno wrote: > On 2015-07-18 08:58, Richard Henderson wrote: > > Enforce the invariant that 32-bit quantities are zero extended > > in the register. This avoids having to re-zero-extend at memory > > accesses for 32-bit guests. > > > > Signed-off-by: Richard Henderson <rth@twiddle.net> > > --- > > Here's an alternative to the other things we've been considering. > > We could even make this conditional on USER_ONLY if you like. > > > > This does in fact fix the mips test case. Consider the fact that > > memory operations are probably more common than truncations, and > > it would seem that we have a net size win by forcing the truncate > > over adding a byte for the ADDR32 (or 2 bytes for a zero-extend). > > I think we should go with your previous patch for 2.4, and think calmly > about how to do that better for 2.5. It slightly increases the generated > code, but only in bytes, not in number of instructions, so I don't think > the performance impact is huge. > > > Indeed, for 2.5, we could look at dropping the existing zero-extend > > from the softmmu path. Also for 2.5, split trunc_shr into two parts, > > From a quick look, we need to move the address to new registers anyway, > so not zero-extending will mean adding the REXW prefix. Well looking more in details, we can move one instruction from the fast-path to the slow-path. Here is a typical TLB code for store: fast-path: mov %rbp,%rdi mov %rbp,%rsi shr $0x7,%rdi and $0xfffffffffffff003,%rsi and $0x1fe0,%edi lea 0x4e68(%r14,%rdi,1),%rdi cmp (%rdi),%rsi mov %rbp,%rsi jne 0x7f45b8bcc800 add 0x10(%rdi),%rsi mov %ebx,(%rsi) slow-path: mov %r14,%rdi mov %ebx,%edx mov $0x22,%ecx lea -0x156(%rip),%r8 push %r8 jmpq 0x7f45cb337010 If we know that %rbp is properly zero-extend when needed, we can change the end of the fast path into: cmp (%rdi),%rsi jne 0x7f45b8bcc800 mov 0x10(%rdi),%rsi mov %ebx,(%rsi,%rbp,1) However that means that %rsi is not loaded anymore with the address, so we have to load it in the slow path. At the end it means moving one instruction from the fast-path to the slow-path. Now I have no idea what would really improve the performances. Smaller fast-path so there are less instructions to execute? Smaller code in general so that the caches are better used?
diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c index ff4d9cf..9f01369 100644 --- a/tcg/i386/tcg-target.c +++ b/tcg/i386/tcg-target.c @@ -2040,6 +2040,14 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc, case INDEX_op_ext32s_i64: tcg_out_ext32s(s, args[0], args[1]); break; + case INDEX_op_trunc_shr_i32: + if (args[2] > 0) { + tcg_out_shifti(s, SHIFT_SHR + P_REXW, args[0], args[2]); + } + if (args[2] < 32) { + tcg_out_ext32u(s, args[0], args[0]); + } + break; #endif OP_32_64(deposit): @@ -2170,6 +2178,7 @@ static const TCGTargetOpDef x86_op_defs[] = { { INDEX_op_ext8u_i64, { "r", "r" } }, { INDEX_op_ext16u_i64, { "r", "r" } }, { INDEX_op_ext32u_i64, { "r", "r" } }, + { INDEX_op_trunc_shr_i32, { "r", "0" } }, { INDEX_op_deposit_i64, { "Q", "0", "Q" } }, { INDEX_op_movcond_i64, { "r", "r", "re", "r", "0" } }, diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h index 25b5133..8b87050 100644 --- a/tcg/i386/tcg-target.h +++ b/tcg/i386/tcg-target.h @@ -102,7 +102,7 @@ extern bool have_bmi1; #define TCG_TARGET_HAS_mulsh_i32 0 #if TCG_TARGET_REG_BITS == 64 -#define TCG_TARGET_HAS_trunc_shr_i32 0 +#define TCG_TARGET_HAS_trunc_shr_i32 1 #define TCG_TARGET_HAS_div2_i64 1 #define TCG_TARGET_HAS_rot_i64 1 #define TCG_TARGET_HAS_ext8s_i64 1
Enforce the invariant that 32-bit quantities are zero extended in the register. This avoids having to re-zero-extend at memory accesses for 32-bit guests. Signed-off-by: Richard Henderson <rth@twiddle.net> --- Here's an alternative to the other things we've been considering. We could even make this conditional on USER_ONLY if you like. This does in fact fix the mips test case. Consider the fact that memory operations are probably more common than truncations, and it would seem that we have a net size win by forcing the truncate over adding a byte for the ADDR32 (or 2 bytes for a zero-extend). Indeed, for 2.5, we could look at dropping the existing zero-extend from the softmmu path. Also for 2.5, split trunc_shr into two parts, trunc_lo and trunc_hi. No one really uses any other value for the shift, and x86 could really use different constraints for the two. Thoughts? r~ --- tcg/i386/tcg-target.c | 9 +++++++++ tcg/i386/tcg-target.h | 2 +- 2 files changed, 10 insertions(+), 1 deletion(-)