diff mbox

[v4] implementing victim TLB for QEMU system emulated TLB

Message ID 1391440721-25777-1-git-send-email-trent.tong@gmail.com
State New
Headers show

Commit Message

Xin Tong Feb. 3, 2014, 3:18 p.m. UTC
This patch adds a victim TLB to the QEMU system mode TLB.

Changes from PATCH v3.
1. updated measurements - performance improvement is now calculated over 3 runs.
2. fix code as discussed on the qemu mailing list.

QEMU system mode page table walks are expensive. Taken by running QEMU
qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
4-level page tables in guest Linux OS takes ~450 X86 instructions on
average.

QEMU system mode TLB is implemented using a directly-mapped hashtable.
This structure suffers from conflict misses. Increasing the
associativity of the TLB may not be the solution to conflict misses as
all the ways may have to be walked in serial.

A victim TLB is a TLB used to hold translations evicted from the
primary TLB upon replacement. The victim TLB lies between the main TLB
and its refill path. Victim TLB is of greater associativity (fully
associative in this patch). It takes longer to lookup the victim TLB,
but its likely better than a full page table walk. The memory
translation path is changed as follows :

Before Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. TLB refill.
5. Do the memory access.
6. Return to code cache.

After Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. Victim TLB lookup.
5. If victim TLB misses, TLB refill
6. Do the memory access.
7. Return to code cache

The advantage is that victim TLB can offer more associativity to a
directly mapped TLB and thus potentially fewer page table walks while
still keeping the time taken to flush within reasonable limits.
However, placing a victim TLB before the refill path increase TLB
refill path as the victim TLB is consulted before the TLB refill. The
performance results demonstrate that the pros outweigh the cons.

some performance results taken on SPECINT2006 train
datasets and kernel boot and qemu configure script on an
Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine are shown in the
Google Doc link below.

In summary, victim TLB improves the performance of qemu-system-x86_64 by
11.9% on average on SPECINT2006, kernelboot and qemu configscript and with 
highest improvement of in 26.2% in 456.hmmer. And victim TLB does not result 
in any performance degradation in any of the measured benchmarks. Furthermore, 
the implemented victim TLB is architecture independent and is expected to
benefit other architectures in QEMU as well.

https://docs.google.com/spreadsheet/ccc?key=0AiZRCc8IxzMRdGp0VUdkWmxmSnlyeFdUU3ZYNVhSNlE&usp=sharing

Although there are measurement fluctuations, the performance
improvement is very significant and by no means in the range of
noises.

Signed-off-by: Xin Tong <trent.tong@gmail.com>

---
 cputlb.c                        | 31 ++++++++++++++++++++++++++++++-
 include/exec/cpu-defs.h         |  9 +++++++--
 include/exec/softmmu_template.h | 41 +++++++++++++++++++++++++++++++++++++----
 3 files changed, 74 insertions(+), 7 deletions(-)

Comments

Eric Blake Feb. 3, 2014, 3:50 p.m. UTC | #1
On 02/03/2014 08:18 AM, Xin Tong wrote:
> This patch adds a victim TLB to the QEMU system mode TLB.
> 
> Changes from PATCH v3.
> 1. updated measurements - performance improvement is now calculated over 3 runs.
> 2. fix code as discussed on the qemu mailing list.

Th paragraph above should be moved...

> 
> QEMU system mode page table walks are expensive. Taken by running QEMU
> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
> average.
> 
...
> 
> Although there are measurement fluctuations, the performance
> improvement is very significant and by no means in the range of
> noises.
> 
> Signed-off-by: Xin Tong <trent.tong@gmail.com>
> 
> ---

...here, after the ---.  It is nice information for the reviewers, but
has no bearing on the commit itself once accepted into git (a year from
now, we won't care how many versions it took to get to the one version
that was committed).
Xin Tong Feb. 8, 2014, 3:01 a.m. UTC | #2
can anyone provide further review to this patch please ?

Thank you,
Xin

On Mon, Feb 3, 2014 at 9:50 AM, Eric Blake <eblake@redhat.com> wrote:
> On 02/03/2014 08:18 AM, Xin Tong wrote:
>> This patch adds a victim TLB to the QEMU system mode TLB.
>>
>> Changes from PATCH v3.
>> 1. updated measurements - performance improvement is now calculated over 3 runs.
>> 2. fix code as discussed on the qemu mailing list.
>
> Th paragraph above should be moved...
>
>>
>> QEMU system mode page table walks are expensive. Taken by running QEMU
>> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
>> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
>> average.
>>
> ...
>>
>> Although there are measurement fluctuations, the performance
>> improvement is very significant and by no means in the range of
>> noises.
>>
>> Signed-off-by: Xin Tong <trent.tong@gmail.com>
>>
>> ---
>
> ...here, after the ---.  It is nice information for the reviewers, but
> has no bearing on the commit itself once accepted into git (a year from
> now, we won't care how many versions it took to get to the one version
> that was committed).
>
> --
> Eric Blake   eblake redhat com    +1-919-301-3266
> Libvirt virtualization library http://libvirt.org
>
Peter Maydell Feb. 12, 2014, 10:43 p.m. UTC | #3
On 3 February 2014 15:18, Xin Tong <trent.tong@gmail.com> wrote:
> This patch adds a victim TLB to the QEMU system mode TLB.

> @@ -264,8 +286,15 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>                                              prot, &address);
>
>      index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> -    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>      te = &env->tlb_table[mmu_idx][index];
> +
> +    /* do not discard the translation in te, evict it into a victim tlb */
> +    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;

Variable declarations should go at the start of a block.

> +/* macro to check the victim tlb */
> +#define VICTIM_TLB_HIT(ty)                                                    \
> +({                                                                            \
> +    /* we are about to do a page table walk. our last hope is the             \
> +     * victim tlb. try to refill from the victim tlb before walking the       \
> +     * page table. */                                                         \
> +    int vidx;                                                                 \
> +    hwaddr tmpiotlb;                                                          \
> +    CPUTLBEntry tmptlb;                                                       \
> +    for (vidx = CPU_VTLB_SIZE; vidx >= 0; --vidx) {                           \
> +        if (env->tlb_v_table[mmu_idx][vidx].ty == (addr & TARGET_PAGE_MASK)) {\
> +            /* found entry in victim tlb, swap tlb and iotlb */               \
> +            tmptlb = env->tlb_table[mmu_idx][index];                          \
> +            env->tlb_table[mmu_idx][index] = env->tlb_v_table[mmu_idx][vidx]; \
> +            env->tlb_v_table[mmu_idx][vidx] = tmptlb;                         \
> +            tmpiotlb = env->iotlb[mmu_idx][index];                            \
> +            env->iotlb[mmu_idx][index] = env->iotlb_v[mmu_idx][vidx];         \
> +            env->iotlb_v[mmu_idx][vidx] = tmpiotlb;                           \
> +            break;                                                            \
> +        }                                                                     \
> +    }                                                                         \

This loop indexes off the end of the [vidx] arrays on its
first iteration, doesn't it?

thanks
-- PMM
diff mbox

Patch

diff --git a/cputlb.c b/cputlb.c
index b533f3f..f1418c2 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -58,8 +58,10 @@  void tlb_flush(CPUArchState *env, int flush_global)
     cpu->current_tb = NULL;
 
     memset(env->tlb_table, -1, sizeof(env->tlb_table));
+    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
 
+    env->vtlb_index = 0;
     env->tlb_flush_addr = -1;
     env->tlb_flush_mask = 0;
     tlb_flush_count++;
@@ -106,6 +108,14 @@  void tlb_flush_page(CPUArchState *env, target_ulong addr)
         tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
     }
 
+    /* check whether there are entries that need to be flushed in the vtlb */
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
+        }
+    }
+
     tb_flush_jmp_cache(env, addr);
 }
 
@@ -170,6 +180,11 @@  void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
                 tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
                                       start1, length);
             }
+
+            for (i = 0; i < CPU_VTLB_SIZE; i++) {
+                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
+                                      start1, length);
+            }
         }
     }
 }
@@ -193,6 +208,13 @@  void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
     }
+
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
+        }
+    }
 }
 
 /* Our TLB does not support large pages, so remember the area covered by
@@ -264,8 +286,15 @@  void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                                             prot, &address);
 
     index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te = &env->tlb_table[mmu_idx][index];
+
+    /* do not discard the translation in te, evict it into a victim tlb */
+    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
+    env->tlb_v_table[mmu_idx][vidx] = *te;
+    env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
+
+    /* refill the tlb */
+    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te->addend = addend - vaddr;
     if (prot & PAGE_READ) {
         te->addr_read = address;
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 01cd8c7..c2ea3fa 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -74,6 +74,8 @@  typedef uint64_t target_ulong;
 #if !defined(CONFIG_USER_ONLY)
 #define CPU_TLB_BITS 8
 #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+/* use a fully associative victim tlb of 8 entries */
+#define CPU_VTLB_SIZE 8
 
 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -106,9 +108,12 @@  QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
 #define CPU_COMMON_TLB \
     /* The meaning of the MMU modes is defined in the target code. */   \
     CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
-    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
+    CPUTLBEntry tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];               \
+    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                           \
+    hwaddr iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                        \
     target_ulong tlb_flush_addr;                                        \
-    target_ulong tlb_flush_mask;
+    target_ulong tlb_flush_mask;                                        \
+    target_ulong vtlb_index;                                            \
 
 #else
 
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index c6a5440..d09485d 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -112,6 +112,31 @@ 
 # define helper_te_st_name  helper_le_st_name
 #endif
 
+/* macro to check the victim tlb */
+#define VICTIM_TLB_HIT(ty)                                                    \
+({                                                                            \
+    /* we are about to do a page table walk. our last hope is the             \
+     * victim tlb. try to refill from the victim tlb before walking the       \
+     * page table. */                                                         \
+    int vidx;                                                                 \
+    hwaddr tmpiotlb;                                                          \
+    CPUTLBEntry tmptlb;                                                       \
+    for (vidx = CPU_VTLB_SIZE; vidx >= 0; --vidx) {                           \
+        if (env->tlb_v_table[mmu_idx][vidx].ty == (addr & TARGET_PAGE_MASK)) {\
+            /* found entry in victim tlb, swap tlb and iotlb */               \
+            tmptlb = env->tlb_table[mmu_idx][index];                          \
+            env->tlb_table[mmu_idx][index] = env->tlb_v_table[mmu_idx][vidx]; \
+            env->tlb_v_table[mmu_idx][vidx] = tmptlb;                         \
+            tmpiotlb = env->iotlb[mmu_idx][index];                            \
+            env->iotlb[mmu_idx][index] = env->iotlb_v[mmu_idx][vidx];         \
+            env->iotlb_v[mmu_idx][vidx] = tmpiotlb;                           \
+            break;                                                            \
+        }                                                                     \
+    }                                                                         \
+    /* return true when there is a vtlb hit, i.e. vidx >=0 */                 \
+    vidx >= 0;                                                                \
+})
+
 static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
                                               hwaddr physaddr,
                                               target_ulong addr,
@@ -153,7 +178,9 @@  WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        if (!VICTIM_TLB_HIT(ADDR_READ)) {
+            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }
 
@@ -235,7 +262,9 @@  WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        if (!VICTIM_TLB_HIT(ADDR_READ)) {
+            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }
 
@@ -354,7 +383,9 @@  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        if (!VICTIM_TLB_HIT(addr_write)) {
+            tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
@@ -430,7 +461,9 @@  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        if (!VICTIM_TLB_HIT(addr_write)) {
+            tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }