Patchwork [PULL,5/7] tcg: Add mmu helpers that take a return address argument

login
register
mail settings
Submitter Richard Henderson
Date Aug. 26, 2013, 9 p.m.
Message ID <1377550812-908-6-git-send-email-rth@twiddle.net>
Download mbox | patch
Permalink /patch/269981/
State New
Headers show

Comments

Richard Henderson - Aug. 26, 2013, 9 p.m.
Allow the code that tcg generates to be less obtuse, passing in
the return address directly instead of computing it in the helper.

Maintain the old entrance point unchanged as an alternate entry point.

Delete the helper_st*_cmmu prototypes; the implementations did not exist.

Signed-off-by: Richard Henderson <rth@twiddle.net>
---
 include/exec/softmmu_defs.h     | 46 ++++++++++++++++++++++++++---------------
 include/exec/softmmu_template.h | 42 +++++++++++++++++++++++--------------
 2 files changed, 55 insertions(+), 33 deletions(-)
Paolo Bonzini - Aug. 26, 2013, 10:26 p.m.
Il 26/08/2013 23:00, Richard Henderson ha scritto:
> Allow the code that tcg generates to be less obtuse, passing in
> the return address directly instead of computing it in the helper.
> 
> Maintain the old entrance point unchanged as an alternate entry point.
> 
> Delete the helper_st*_cmmu prototypes; the implementations did not exist.
> 
> Signed-off-by: Richard Henderson <rth@twiddle.net>

Something that can be done on top of this patch: what about moving the
"-1" to helper_ret_*?  It is common to pretty much all the targets
(except ARM has -2), and it would allow some simplifications.  For
example I played with return address helpers on 32-bit PPC, and you
could use a

    li   rN, retaddr
    mtlr rN
    b    st_trampoline[i]

sequence instead of one of

    li   rN, retaddr
    mtlr rN
    bl   st_trampoline[i]
    b    retaddr

or

    li   rN, retaddr
    mtlr rN
    addi rN, rN, -1
    b    st_trampoline[i]

Paolo
Richard Henderson - Aug. 26, 2013, 10:34 p.m.
On 08/26/2013 03:26 PM, Paolo Bonzini wrote:
> Something that can be done on top of this patch: what about moving the
> "-1" to helper_ret_*?  It is common to pretty much all the targets
> (except ARM has -2), and it would allow some simplifications.

I suppose so, yes.

>     li   rN, retaddr
>     mtlr rN
>     b    st_trampoline[i]
> 
> sequence instead of one of
> 
>     li   rN, retaddr
>     mtlr rN
>     bl   st_trampoline[i]
>     b    retaddr

This sort of thing is very difficult to evaluate, because of the
cpu's return address prediction stack.  I have so far avoided it.

The only cpus that I believe can make good use of tail calls into
the memory helpers are those with predicated stores and calls, i.e.
arm and ia64.


r~
Peter Maydell - Aug. 26, 2013, 11:26 p.m.
On 26 August 2013 22:00, Richard Henderson <rth@twiddle.net> wrote:
> Allow the code that tcg generates to be less obtuse, passing in
> the return address directly instead of computing it in the helper.

> +uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
> +                           int mmu_idx, uintptr_t retaddr);

>  uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);

I thought the reason we did it this way round was to avoid having
so many arguments to helpers that we overflowed registers and
into the stack on some calling conventions? Or does this not make
much difference in practice?

thanks
-- PMM
Aurelien Jarno - Aug. 27, 2013, 10:46 a.m.
On Mon, Aug 26, 2013 at 03:34:15PM -0700, Richard Henderson wrote:
> On 08/26/2013 03:26 PM, Paolo Bonzini wrote:
> > Something that can be done on top of this patch: what about moving the
> > "-1" to helper_ret_*?  It is common to pretty much all the targets
> > (except ARM has -2), and it would allow some simplifications.
> 
> I suppose so, yes.
> 
> >     li   rN, retaddr
> >     mtlr rN
> >     b    st_trampoline[i]
> > 
> > sequence instead of one of
> > 
> >     li   rN, retaddr
> >     mtlr rN
> >     bl   st_trampoline[i]
> >     b    retaddr
> 
> This sort of thing is very difficult to evaluate, because of the
> cpu's return address prediction stack.  I have so far avoided it.
> 
> The only cpus that I believe can make good use of tail calls into
> the memory helpers are those with predicated stores and calls, i.e.
> arm and ia64.
> 

On the other hand calling the helper is the exception more than the
rule (that's why they have been moved at the end of the TB), so we 
should not look to much at generating fast code, but rather small code
in order to use the caches (both TB and CPU caches) more efficiently.

Therefore even on x86, if we move the -1 at the helper level, it should
be possible to use a tail call for the stores, something like:

    mov    %r14,%rdi
    mov    %ebx,%edx
    xor    %ecx,%ecx
    lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
    pushq  %r8
    jmpq   0x7f25526757a0

Instead of:

    mov    %r14,%rdi
    mov    %ebx,%edx
    xor    %ecx,%ecx
    lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
    callq  0x7f25526757a0
    jmpq   0x7f2541a6f69b
Aurelien Jarno - Aug. 27, 2013, 10:47 a.m.
On Tue, Aug 27, 2013 at 12:26:00AM +0100, Peter Maydell wrote:
> On 26 August 2013 22:00, Richard Henderson <rth@twiddle.net> wrote:
> > Allow the code that tcg generates to be less obtuse, passing in
> > the return address directly instead of computing it in the helper.
> 
> > +uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
> > +                           int mmu_idx, uintptr_t retaddr);
> 
> >  uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
> 
> I thought the reason we did it this way round was to avoid having
> so many arguments to helpers that we overflowed registers and
> into the stack on some calling conventions? Or does this not make
> much difference in practice?
> 

That was indeed the idea, but I think it's a good idea to provide this
alternative for architectures supporting enough arguments in registers.
Richard Henderson - Aug. 27, 2013, 2:53 p.m.
On 08/27/2013 03:46 AM, Aurelien Jarno wrote:
> On the other hand calling the helper is the exception more than the
> rule (that's why they have been moved at the end of the TB), so we 
> should not look to much at generating fast code, but rather small code
> in order to use the caches (both TB and CPU caches) more efficiently.
> 
> Therefore even on x86, if we move the -1 at the helper level, it should
> be possible to use a tail call for the stores, something like:
> 
>     mov    %r14,%rdi
>     mov    %ebx,%edx
>     xor    %ecx,%ecx
>     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
>     pushq  %r8
>     jmpq   0x7f25526757a0
> 
> Instead of:
> 
>     mov    %r14,%rdi
>     mov    %ebx,%edx
>     xor    %ecx,%ecx
>     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
>     callq  0x7f25526757a0
>     jmpq   0x7f2541a6f69b

Fair enough.  I'll have a go at some follow-ups then.


r~
Aurelien Jarno - Aug. 27, 2013, 3:43 p.m.
On Tue, Aug 27, 2013 at 07:53:56AM -0700, Richard Henderson wrote:
> On 08/27/2013 03:46 AM, Aurelien Jarno wrote:
> > On the other hand calling the helper is the exception more than the
> > rule (that's why they have been moved at the end of the TB), so we 
> > should not look to much at generating fast code, but rather small code
> > in order to use the caches (both TB and CPU caches) more efficiently.
> > 
> > Therefore even on x86, if we move the -1 at the helper level, it should
> > be possible to use a tail call for the stores, something like:
> > 
> >     mov    %r14,%rdi
> >     mov    %ebx,%edx
> >     xor    %ecx,%ecx
> >     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
> >     pushq  %r8
> >     jmpq   0x7f25526757a0
> > 
> > Instead of:
> > 
> >     mov    %r14,%rdi
> >     mov    %ebx,%edx
> >     xor    %ecx,%ecx
> >     lea    -0x10f(%rip),%r8        # 0x7f2541a6f69a
> >     callq  0x7f25526757a0
> >     jmpq   0x7f2541a6f69b
> 
> Fair enough.  I'll have a go at some follow-ups then.
> 

I think this can also be done in a second time. Do you want to create a
version 3, or should I just process the current pull request and you
will provide additional patches later?
Richard Henderson - Aug. 27, 2013, 3:53 p.m.
On 08/27/2013 08:43 AM, Aurelien Jarno wrote:
> I think this can also be done in a second time. Do you want to create a
> version 3, or should I just process the current pull request and you
> will provide additional patches later?

I think it might be better to provide follow-up patches.


r~

Patch

diff --git a/include/exec/softmmu_defs.h b/include/exec/softmmu_defs.h
index 1f25e33..e55e717 100644
--- a/include/exec/softmmu_defs.h
+++ b/include/exec/softmmu_defs.h
@@ -9,29 +9,41 @@ 
 #ifndef SOFTMMU_DEFS_H
 #define SOFTMMU_DEFS_H
 
+uint8_t helper_ret_ldb_mmu(CPUArchState *env, target_ulong addr,
+                           int mmu_idx, uintptr_t retaddr);
+uint16_t helper_ret_ldw_mmu(CPUArchState *env, target_ulong addr,
+                            int mmu_idx, uintptr_t retaddr);
+uint32_t helper_ret_ldl_mmu(CPUArchState *env, target_ulong addr,
+                            int mmu_idx, uintptr_t retaddr);
+uint64_t helper_ret_ldq_mmu(CPUArchState *env, target_ulong addr,
+                            int mmu_idx, uintptr_t retaddr);
+
+void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
+                        int mmu_idx, uintptr_t retaddr);
+void helper_ret_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                        int mmu_idx, uintptr_t retaddr);
+void helper_ret_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                        int mmu_idx, uintptr_t retaddr);
+void helper_ret_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                        int mmu_idx, uintptr_t retaddr);
+
 uint8_t helper_ldb_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
-                    int mmu_idx);
 uint16_t helper_ldw_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stw_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
-                    int mmu_idx);
 uint32_t helper_ldl_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                    int mmu_idx);
 uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                    int mmu_idx);
+
+void helper_stb_mmu(CPUArchState *env, target_ulong addr,
+                    uint8_t val, int mmu_idx);
+void helper_stw_mmu(CPUArchState *env, target_ulong addr,
+                    uint16_t val, int mmu_idx);
+void helper_stl_mmu(CPUArchState *env, target_ulong addr,
+                    uint32_t val, int mmu_idx);
+void helper_stq_mmu(CPUArchState *env, target_ulong addr,
+                    uint64_t val, int mmu_idx);
 
 uint8_t helper_ldb_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stb_cmmu(CPUArchState *env, target_ulong addr, uint8_t val,
-int mmu_idx);
 uint16_t helper_ldw_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stw_cmmu(CPUArchState *env, target_ulong addr, uint16_t val,
-                     int mmu_idx);
 uint32_t helper_ldl_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stl_cmmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                     int mmu_idx);
 uint64_t helper_ldq_cmmu(CPUArchState *env, target_ulong addr, int mmu_idx);
-void helper_stq_cmmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                     int mmu_idx);
-#endif
+
+#endif /* SOFTMMU_DEFS_H */
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index 8584902..7d8bcb5 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -78,15 +78,18 @@  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
 }
 
 /* handle all cases except unaligned access which span two pages */
+#ifdef SOFTMMU_CODE_ACCESS
+static
+#endif
 DATA_TYPE
-glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
-                                         int mmu_idx)
+glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
+                                             target_ulong addr, int mmu_idx,
+                                             uintptr_t retaddr)
 {
     DATA_TYPE res;
     int index;
     target_ulong tlb_addr;
     hwaddr ioaddr;
-    uintptr_t retaddr;
 
     /* test if there is match for unaligned or IO access */
     /* XXX: could done more in memory macro in a non portable way */
@@ -98,13 +101,11 @@  glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             res = glue(io_read, SUFFIX)(env, ioaddr, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
             /* slow unaligned access (it spans two pages or IO) */
         do_unaligned_access:
-            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
 #endif
@@ -115,7 +116,6 @@  glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
             }
 #endif
@@ -124,8 +124,6 @@  glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
                                                 (addr + addend));
         }
     } else {
-        /* the page is not in the TLB : fill it */
-        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
@@ -136,6 +134,14 @@  glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
     return res;
 }
 
+DATA_TYPE
+glue(glue(helper_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
+                                         int mmu_idx)
+{
+    return glue(glue(helper_ret_ld, SUFFIX), MMUSUFFIX)(env, addr, mmu_idx,
+                                                        GETPC_EXT());
+}
+
 /* handle all unaligned cases */
 static DATA_TYPE
 glue(glue(slow_ld, SUFFIX), MMUSUFFIX)(CPUArchState *env,
@@ -214,13 +220,13 @@  static inline void glue(io_write, SUFFIX)(CPUArchState *env,
     io_mem_write(mr, physaddr, val, 1 << SHIFT);
 }
 
-void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
-                                              target_ulong addr, DATA_TYPE val,
-                                              int mmu_idx)
+void
+glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
+                                             target_ulong addr, DATA_TYPE val,
+                                             int mmu_idx, uintptr_t retaddr)
 {
     hwaddr ioaddr;
     target_ulong tlb_addr;
-    uintptr_t retaddr;
     int index;
 
     index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
@@ -231,12 +237,10 @@  void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             /* IO access */
             if ((addr & (DATA_SIZE - 1)) != 0)
                 goto do_unaligned_access;
-            retaddr = GETPC_EXT();
             ioaddr = env->iotlb[mmu_idx][index];
             glue(io_write, SUFFIX)(env, ioaddr, val, addr, retaddr);
         } else if (((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1) >= TARGET_PAGE_SIZE) {
         do_unaligned_access:
-            retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
 #endif
@@ -247,7 +251,6 @@  void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
             uintptr_t addend;
 #ifdef ALIGNED_ONLY
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                retaddr = GETPC_EXT();
                 do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
             }
 #endif
@@ -257,7 +260,6 @@  void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
         }
     } else {
         /* the page is not in the TLB : fill it */
-        retaddr = GETPC_EXT();
 #ifdef ALIGNED_ONLY
         if ((addr & (DATA_SIZE - 1)) != 0)
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
@@ -267,6 +269,14 @@  void glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
     }
 }
 
+void
+glue(glue(helper_st, SUFFIX), MMUSUFFIX)(CPUArchState *env, target_ulong addr,
+                                         DATA_TYPE val, int mmu_idx)
+{
+    glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)(env, addr, val, mmu_idx,
+                                                 GETPC_EXT());
+}
+
 /* handles all unaligned cases */
 static void glue(glue(slow_st, SUFFIX), MMUSUFFIX)(CPUArchState *env,
                                                    target_ulong addr,