diff mbox

[for-2.4] tcg/i386: Implement trunc_shr_i32

Message ID 1437206294-22407-1-git-send-email-rth@twiddle.net
State New
Headers show

Commit Message

Richard Henderson July 18, 2015, 7:58 a.m. UTC
Enforce the invariant that 32-bit quantities are zero extended
in the register.  This avoids having to re-zero-extend at memory
accesses for 32-bit guests.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
Here's an alternative to the other things we've been considering.
We could even make this conditional on USER_ONLY if you like.

This does in fact fix the mips test case.  Consider the fact that
memory operations are probably more common than truncations, and
it would seem that we have a net size win by forcing the truncate
over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).

Indeed, for 2.5, we could look at dropping the existing zero-extend
from the softmmu path.  Also for 2.5, split trunc_shr into two parts,
trunc_lo and trunc_hi.  No one really uses any other value for the
shift, and x86 could really use different constraints for the two.

Thoughts?

r~
---
 tcg/i386/tcg-target.c | 9 +++++++++
 tcg/i386/tcg-target.h | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

Comments

Aurelien Jarno July 18, 2015, 9:18 p.m. UTC | #1
On 2015-07-18 08:58, Richard Henderson wrote:
> Enforce the invariant that 32-bit quantities are zero extended
> in the register.  This avoids having to re-zero-extend at memory
> accesses for 32-bit guests.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>
> ---
> Here's an alternative to the other things we've been considering.
> We could even make this conditional on USER_ONLY if you like.
> 
> This does in fact fix the mips test case.  Consider the fact that
> memory operations are probably more common than truncations, and
> it would seem that we have a net size win by forcing the truncate
> over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).

I think we should go with your previous patch for 2.4, and think calmly
about how to do that better for 2.5. It slightly increases the generated
code, but only in bytes, not in number of instructions, so I don't think
the performance impact is huge.

> Indeed, for 2.5, we could look at dropping the existing zero-extend
> from the softmmu path.  Also for 2.5, split trunc_shr into two parts,

From a quick look, we need to move the address to new registers anyway,
so not zero-extending will mean adding the REXW prefix.

> trunc_lo and trunc_hi.  No one really uses any other value for the
> shift, and x86 could really use different constraints for the two.

That sounds like a good idea.
Aurelien Jarno July 19, 2015, 11:26 a.m. UTC | #2
On 2015-07-18 23:18, Aurelien Jarno wrote:
> On 2015-07-18 08:58, Richard Henderson wrote:
> > Enforce the invariant that 32-bit quantities are zero extended
> > in the register.  This avoids having to re-zero-extend at memory
> > accesses for 32-bit guests.
> > 
> > Signed-off-by: Richard Henderson <rth@twiddle.net>
> > ---
> > Here's an alternative to the other things we've been considering.
> > We could even make this conditional on USER_ONLY if you like.
> > 
> > This does in fact fix the mips test case.  Consider the fact that
> > memory operations are probably more common than truncations, and
> > it would seem that we have a net size win by forcing the truncate
> > over adding a byte for the ADDR32 (or 2 bytes for a zero-extend).
> 
> I think we should go with your previous patch for 2.4, and think calmly
> about how to do that better for 2.5. It slightly increases the generated
> code, but only in bytes, not in number of instructions, so I don't think
> the performance impact is huge.
> 
> > Indeed, for 2.5, we could look at dropping the existing zero-extend
> > from the softmmu path.  Also for 2.5, split trunc_shr into two parts,
> 
> From a quick look, we need to move the address to new registers anyway,
> so not zero-extending will mean adding the REXW prefix.

Well looking more in details, we can move one instruction from the
fast-path to the slow-path. Here is a typical TLB code for store:

fast-path:
      mov    %rbp,%rdi
      mov    %rbp,%rsi
      shr    $0x7,%rdi
      and    $0xfffffffffffff003,%rsi
      and    $0x1fe0,%edi
      lea    0x4e68(%r14,%rdi,1),%rdi
      cmp    (%rdi),%rsi
      mov    %rbp,%rsi
      jne    0x7f45b8bcc800
      add    0x10(%rdi),%rsi
      mov    %ebx,(%rsi)

slow-path:
      mov    %r14,%rdi
      mov    %ebx,%edx
      mov    $0x22,%ecx
      lea    -0x156(%rip),%r8
      push   %r8
      jmpq   0x7f45cb337010

If we know that %rbp is properly zero-extend when needed, we can change
the end of the fast path into:

      cmp    (%rdi),%rsi
      jne    0x7f45b8bcc800
      mov    0x10(%rdi),%rsi  
      mov    %ebx,(%rsi,%rbp,1)

However that means that %rsi is not loaded anymore with the address, so
we have to load it in the slow path. At the end it means moving one
instruction from the fast-path to the slow-path.

Now I have no idea what would really improve the performances. Smaller
fast-path so there are less instructions to execute? Smaller code in
general so that the caches are better used?
diff mbox

Patch

diff --git a/tcg/i386/tcg-target.c b/tcg/i386/tcg-target.c
index ff4d9cf..9f01369 100644
--- a/tcg/i386/tcg-target.c
+++ b/tcg/i386/tcg-target.c
@@ -2040,6 +2040,14 @@  static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_ext32s_i64:
         tcg_out_ext32s(s, args[0], args[1]);
         break;
+    case INDEX_op_trunc_shr_i32:
+        if (args[2] > 0) {
+            tcg_out_shifti(s, SHIFT_SHR + P_REXW, args[0], args[2]);
+        }
+        if (args[2] < 32) {
+            tcg_out_ext32u(s, args[0], args[0]);
+        }
+        break;
 #endif
 
     OP_32_64(deposit):
@@ -2170,6 +2178,7 @@  static const TCGTargetOpDef x86_op_defs[] = {
     { INDEX_op_ext8u_i64, { "r", "r" } },
     { INDEX_op_ext16u_i64, { "r", "r" } },
     { INDEX_op_ext32u_i64, { "r", "r" } },
+    { INDEX_op_trunc_shr_i32,  { "r", "0" } },
 
     { INDEX_op_deposit_i64, { "Q", "0", "Q" } },
     { INDEX_op_movcond_i64, { "r", "r", "re", "r", "0" } },
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 25b5133..8b87050 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -102,7 +102,7 @@  extern bool have_bmi1;
 #define TCG_TARGET_HAS_mulsh_i32        0
 
 #if TCG_TARGET_REG_BITS == 64
-#define TCG_TARGET_HAS_trunc_shr_i32    0
+#define TCG_TARGET_HAS_trunc_shr_i32    1
 #define TCG_TARGET_HAS_div2_i64         1
 #define TCG_TARGET_HAS_rot_i64          1
 #define TCG_TARGET_HAS_ext8s_i64        1