diff mbox

[v3] implementing victim TLB for QEMU system emulated TLB

Message ID 1390930309-21210-1-git-send-email-trent.tong@gmail.com
State New
Headers show

Commit Message

Xin Tong Jan. 28, 2014, 5:31 p.m. UTC
This patch adds a victim TLB to the QEMU system mode TLB.

QEMU system mode page table walks are expensive. Taken by running QEMU
qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
4-level page tables in guest Linux OS takes ~450 X86 instructions on
average.

QEMU system mode TLB is implemented using a directly-mapped hashtable.
This structure suffers from conflict misses. Increasing the
associativity of the TLB may not be the solution to conflict misses as
all the ways may have to be walked in serial.

A victim TLB is a TLB used to hold translations evicted from the
primary TLB upon replacement. The victim TLB lies between the main TLB
and its refill path. Victim TLB is of greater associativity (fully
associative in this patch). It takes longer to lookup the victim TLB,
but its likely better than a full page table walk. The memory
translation path is changed as follows :

Before Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. TLB refill.
5. Do the memory access.
6. Return to code cache.

After Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. Victim TLB lookup.
5. If victim TLB misses, TLB refill
6. Do the memory access.
7. Return to code cache

The advantage is that victim TLB can offer more associativity to a
directly mapped TLB and thus potentially fewer page table walks while
still keeping the time taken to flush within reasonable limits.
However, placing a victim TLB before the refill path increase TLB
refill path as the victim TLB is consulted before the TLB refill. The
performance results demonstrate that the pros outweigh the cons.

some performance results taken on SPECINT2006 train
datasets and kernel boot and qemu configure script on an
Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine are shown in the 
Google Doc link below. 

In summary, victim TLB improves the performance of qemu-system-x86_64 by
10.7% on average on SPECINT2006 and with highest improvement of in 25.4%
in 464.h264ref. And victim TLB does not result in any performance
degradation in any of the measured benchmarks. Furthermore, the
implemented victim TLB is architecture independent and is expected to
benefit other architectures in QEMU as well.

https://docs.google.com/spreadsheet/ccc?key=0AiZRCc8IxzMRdGV5ZFRrM2F2OU9sTnR2Y3JFdjNveUE&usp=sharing

Although there are measurement fluctuations, the performance
improvement is very significant and by no means in the range of
noises.

Signed-off-by: Xin Tong <trent.tong@gmail.com>

---
 cputlb.c                        | 50 ++++++++++++++++++++++++++++++++++++-
 include/exec/cpu-defs.h         | 12 ++++++---
 include/exec/exec-all.h         |  2 ++
 include/exec/softmmu_template.h | 55 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 111 insertions(+), 8 deletions(-)

Comments

Xin Tong Jan. 29, 2014, 4:03 p.m. UTC | #1
can someone please review this patch ?

Thank you,
Xin

On Tue, Jan 28, 2014 at 11:31 AM, Xin Tong <trent.tong@gmail.com> wrote:
> This patch adds a victim TLB to the QEMU system mode TLB.
>
> QEMU system mode page table walks are expensive. Taken by running QEMU
> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
> average.
>
> QEMU system mode TLB is implemented using a directly-mapped hashtable.
> This structure suffers from conflict misses. Increasing the
> associativity of the TLB may not be the solution to conflict misses as
> all the ways may have to be walked in serial.
>
> A victim TLB is a TLB used to hold translations evicted from the
> primary TLB upon replacement. The victim TLB lies between the main TLB
> and its refill path. Victim TLB is of greater associativity (fully
> associative in this patch). It takes longer to lookup the victim TLB,
> but its likely better than a full page table walk. The memory
> translation path is changed as follows :
>
> Before Victim TLB:
> 1. Inline TLB lookup
> 2. Exit code cache on TLB miss.
> 3. Check for unaligned, IO accesses
> 4. TLB refill.
> 5. Do the memory access.
> 6. Return to code cache.
>
> After Victim TLB:
> 1. Inline TLB lookup
> 2. Exit code cache on TLB miss.
> 3. Check for unaligned, IO accesses
> 4. Victim TLB lookup.
> 5. If victim TLB misses, TLB refill
> 6. Do the memory access.
> 7. Return to code cache
>
> The advantage is that victim TLB can offer more associativity to a
> directly mapped TLB and thus potentially fewer page table walks while
> still keeping the time taken to flush within reasonable limits.
> However, placing a victim TLB before the refill path increase TLB
> refill path as the victim TLB is consulted before the TLB refill. The
> performance results demonstrate that the pros outweigh the cons.
>
> some performance results taken on SPECINT2006 train
> datasets and kernel boot and qemu configure script on an
> Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine are shown in the
> Google Doc link below.
>
> In summary, victim TLB improves the performance of qemu-system-x86_64 by
> 10.7% on average on SPECINT2006 and with highest improvement of in 25.4%
> in 464.h264ref. And victim TLB does not result in any performance
> degradation in any of the measured benchmarks. Furthermore, the
> implemented victim TLB is architecture independent and is expected to
> benefit other architectures in QEMU as well.
>
> https://docs.google.com/spreadsheet/ccc?key=0AiZRCc8IxzMRdGV5ZFRrM2F2OU9sTnR2Y3JFdjNveUE&usp=sharing
>
> Although there are measurement fluctuations, the performance
> improvement is very significant and by no means in the range of
> noises.
>
> Signed-off-by: Xin Tong <trent.tong@gmail.com>
>
> ---
>  cputlb.c                        | 50 ++++++++++++++++++++++++++++++++++++-
>  include/exec/cpu-defs.h         | 12 ++++++---
>  include/exec/exec-all.h         |  2 ++
>  include/exec/softmmu_template.h | 55 ++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 111 insertions(+), 8 deletions(-)
>
> diff --git a/cputlb.c b/cputlb.c
> index b533f3f..caee78e 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -34,6 +34,22 @@
>  /* statistics */
>  int tlb_flush_count;
>
> +/* swap the 2 given TLB entries as well as their corresponding IOTLB */
> +inline void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote,
> +                     hwaddr *iose)
> +{
> +   hwaddr iotmp;
> +   CPUTLBEntry tmp;
> +   /* swap tlb */
> +   tmp = *te;
> +   *te = *se;
> +   *se = tmp;
> +   /* swap iotlb */
> +   iotmp = *iote;
> +   *iote = *iose;
> +   *iose = iotmp;
> +}
> +
>  /* NOTE:
>   * If flush_global is true (the usual case), flush all tlb entries.
>   * If flush_global is false, flush (at least) all tlb entries not
> @@ -58,8 +74,10 @@ void tlb_flush(CPUArchState *env, int flush_global)
>      cpu->current_tb = NULL;
>
>      memset(env->tlb_table, -1, sizeof(env->tlb_table));
> +    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
>      memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
>
> +    env->vtlb_index = 0;
>      env->tlb_flush_addr = -1;
>      env->tlb_flush_mask = 0;
>      tlb_flush_count++;
> @@ -106,6 +124,14 @@ void tlb_flush_page(CPUArchState *env, target_ulong addr)
>          tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
>      }
>
> +    /* check whether there are entries that need to be flushed in the vtlb */
> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
> +        unsigned int k;
> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
> +            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
> +        }
> +    }
> +
>      tb_flush_jmp_cache(env, addr);
>  }
>
> @@ -170,6 +196,11 @@ void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
>                  tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
>                                        start1, length);
>              }
> +
> +            for (i = 0; i < CPU_VTLB_SIZE; i++) {
> +                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
> +                                      start1, length);
> +            }
>          }
>      }
>  }
> @@ -193,6 +224,13 @@ void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>          tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
>      }
> +
> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
> +        unsigned int k;
> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
> +            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
> +        }
> +    }
>  }
>
>  /* Our TLB does not support large pages, so remember the area covered by
> @@ -264,8 +302,18 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>                                              prot, &address);
>
>      index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> -    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>      te = &env->tlb_table[mmu_idx][index];
> +
> +    /* do not discard the translation in te, evict it into a victim tlb */
> +    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
> +    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
> +    env->tlb_v_table[mmu_idx][vidx].addr_write = te->addr_write;
> +    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
> +    env->tlb_v_table[mmu_idx][vidx].addend     = te->addend;
> +    env->iotlb_v[mmu_idx][vidx]                = env->iotlb[mmu_idx][index];
> +
> +    /* refill the tlb */
> +    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>      te->addend = addend - vaddr;
>      if (prot & PAGE_READ) {
>          te->addr_read = address;
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 01cd8c7..2631d6b 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -74,6 +74,8 @@ typedef uint64_t target_ulong;
>  #if !defined(CONFIG_USER_ONLY)
>  #define CPU_TLB_BITS 8
>  #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
> +/* use a fully associative victim tlb */
> +#define CPU_VTLB_SIZE 8
>
>  #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
>  #define CPU_TLB_ENTRY_BITS 4
> @@ -103,12 +105,16 @@ typedef struct CPUTLBEntry {
>
>  QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
>
> +/* The meaning of the MMU modes is defined in the target code. */
>  #define CPU_COMMON_TLB \
>      /* The meaning of the MMU modes is defined in the target code. */   \
> -    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
> -    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
> +    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
> +    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
> +    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
> +    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \
>      target_ulong tlb_flush_addr;                                        \
> -    target_ulong tlb_flush_mask;
> +    target_ulong tlb_flush_mask;                                        \
> +    target_ulong vtlb_index;                                            \
>
>  #else
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index ea90b64..7e88b08 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -102,6 +102,8 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>                    hwaddr paddr, int prot,
>                    int mmu_idx, target_ulong size);
>  void tb_invalidate_phys_addr(hwaddr addr);
> +/* swap the 2 given tlb entries as well as their iotlb */
> +void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
>  #else
>  static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
>  {
> diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
> index c6a5440..38d18de 100644
> --- a/include/exec/softmmu_template.h
> +++ b/include/exec/softmmu_template.h
> @@ -112,6 +112,21 @@
>  # define helper_te_st_name  helper_le_st_name
>  #endif
>
> +/* macro to check the victim tlb */
> +#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
> +do {                                                               \
> +    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
> +        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
> +            /* found entry in victim tlb */                        \
> +            swap_tlb(&env->tlb_table[mmu_idx][index],              \
> +                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
> +                     &env->iotlb[mmu_idx][index],                  \
> +                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \
> +            break;                                                 \
> +        }                                                          \
> +    }                                                              \
> +} while (0);
> +
>  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
>                                                hwaddr physaddr,
>                                                target_ulong addr,
> @@ -141,6 +156,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>      uintptr_t haddr;
>      DATA_TYPE res;
> +    int vtlb_idx;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -153,7 +169,14 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>          }
>  #endif
> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        /* we are about to do a page table walk. our last hope is the
> +         * victim tlb. try to refill from the victim tlb before walking the
> +         * page table. */
> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
> +        /* miss victim tlb */
> +        if (vtlb_idx < 0) {
> +            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        }
>          tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>      }
>
> @@ -223,6 +246,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>      uintptr_t haddr;
>      DATA_TYPE res;
> +    int vtlb_idx;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -235,7 +259,14 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>          }
>  #endif
> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        /* we are about to do a page table walk. our last hope is the
> +         * victim tlb. try to refill from the victim tlb before walking the
> +         * page table. */
> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
> +        /* miss victim tlb */
> +        if (vtlb_idx < 0) {
> +            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        }
>          tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>      }
>
> @@ -342,6 +373,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      uintptr_t haddr;
> +    int vtlb_idx;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -354,7 +386,14 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>              do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
>          }
>  #endif
> -        tlb_fill(env, addr, 1, mmu_idx, retaddr);
> +        /* we are about to do a page table walk. our last hope is the
> +         * victim tlb. try to refill from the victim tlb before walking the
> +         * page table. */
> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].addr_write);
> +        /* miss victim tlb */
> +        if (vtlb_idx < 0) {
> +            tlb_fill(env, addr, 1, mmu_idx, retaddr);
> +        }
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>
> @@ -418,6 +457,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      uintptr_t haddr;
> +    int vtlb_idx;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -430,7 +470,14 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>              do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
>          }
>  #endif
> -        tlb_fill(env, addr, 1, mmu_idx, retaddr);
> +        /* we are about to do a page table walk. our last hope is the
> +         * victim tlb. try to refill from the victim tlb before walking the
> +         * page table. */
> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].addr_write);
> +        /* miss victim tlb */
> +        if (vtlb_idx < 0) {
> +            tlb_fill(env, addr, 1, mmu_idx, retaddr);
> +        }
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>
> --
> 1.8.3.2
>
Xin Tong Feb. 1, 2014, 4:35 p.m. UTC | #2
Hi QEMU Community

This patch provides significant performance improvement (10.76% on
average) for QEMU system emulation. so I urge the someone in the QEMU
community to review this patch so that it has the hope of making into
the mainline. I understand that I have made mistakes in patch
submission before. But i've learned from the mistakes and will try to
have future patch submission done according to guidelines.

Best Regards,
Xin

On Wed, Jan 29, 2014 at 10:03 AM, Xin Tong <trent.tong@gmail.com> wrote:
> can someone please review this patch ?
>
> Thank you,
> Xin
>
> On Tue, Jan 28, 2014 at 11:31 AM, Xin Tong <trent.tong@gmail.com> wrote:
>> This patch adds a victim TLB to the QEMU system mode TLB.
>>
>> QEMU system mode page table walks are expensive. Taken by running QEMU
>> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
>> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
>> average.
>>
>> QEMU system mode TLB is implemented using a directly-mapped hashtable.
>> This structure suffers from conflict misses. Increasing the
>> associativity of the TLB may not be the solution to conflict misses as
>> all the ways may have to be walked in serial.
>>
>> A victim TLB is a TLB used to hold translations evicted from the
>> primary TLB upon replacement. The victim TLB lies between the main TLB
>> and its refill path. Victim TLB is of greater associativity (fully
>> associative in this patch). It takes longer to lookup the victim TLB,
>> but its likely better than a full page table walk. The memory
>> translation path is changed as follows :
>>
>> Before Victim TLB:
>> 1. Inline TLB lookup
>> 2. Exit code cache on TLB miss.
>> 3. Check for unaligned, IO accesses
>> 4. TLB refill.
>> 5. Do the memory access.
>> 6. Return to code cache.
>>
>> After Victim TLB:
>> 1. Inline TLB lookup
>> 2. Exit code cache on TLB miss.
>> 3. Check for unaligned, IO accesses
>> 4. Victim TLB lookup.
>> 5. If victim TLB misses, TLB refill
>> 6. Do the memory access.
>> 7. Return to code cache
>>
>> The advantage is that victim TLB can offer more associativity to a
>> directly mapped TLB and thus potentially fewer page table walks while
>> still keeping the time taken to flush within reasonable limits.
>> However, placing a victim TLB before the refill path increase TLB
>> refill path as the victim TLB is consulted before the TLB refill. The
>> performance results demonstrate that the pros outweigh the cons.
>>
>> some performance results taken on SPECINT2006 train
>> datasets and kernel boot and qemu configure script on an
>> Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine are shown in the
>> Google Doc link below.
>>
>> In summary, victim TLB improves the performance of qemu-system-x86_64 by
>> 10.7% on average on SPECINT2006 and with highest improvement of in 25.4%
>> in 464.h264ref. And victim TLB does not result in any performance
>> degradation in any of the measured benchmarks. Furthermore, the
>> implemented victim TLB is architecture independent and is expected to
>> benefit other architectures in QEMU as well.
>>
>> https://docs.google.com/spreadsheet/ccc?key=0AiZRCc8IxzMRdGV5ZFRrM2F2OU9sTnR2Y3JFdjNveUE&usp=sharing
>>
>> Although there are measurement fluctuations, the performance
>> improvement is very significant and by no means in the range of
>> noises.
>>
>> Signed-off-by: Xin Tong <trent.tong@gmail.com>
>>
>> ---
>>  cputlb.c                        | 50 ++++++++++++++++++++++++++++++++++++-
>>  include/exec/cpu-defs.h         | 12 ++++++---
>>  include/exec/exec-all.h         |  2 ++
>>  include/exec/softmmu_template.h | 55 ++++++++++++++++++++++++++++++++++++++---
>>  4 files changed, 111 insertions(+), 8 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index b533f3f..caee78e 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -34,6 +34,22 @@
>>  /* statistics */
>>  int tlb_flush_count;
>>
>> +/* swap the 2 given TLB entries as well as their corresponding IOTLB */
>> +inline void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote,
>> +                     hwaddr *iose)
>> +{
>> +   hwaddr iotmp;
>> +   CPUTLBEntry tmp;
>> +   /* swap tlb */
>> +   tmp = *te;
>> +   *te = *se;
>> +   *se = tmp;
>> +   /* swap iotlb */
>> +   iotmp = *iote;
>> +   *iote = *iose;
>> +   *iose = iotmp;
>> +}
>> +
>>  /* NOTE:
>>   * If flush_global is true (the usual case), flush all tlb entries.
>>   * If flush_global is false, flush (at least) all tlb entries not
>> @@ -58,8 +74,10 @@ void tlb_flush(CPUArchState *env, int flush_global)
>>      cpu->current_tb = NULL;
>>
>>      memset(env->tlb_table, -1, sizeof(env->tlb_table));
>> +    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
>>      memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
>>
>> +    env->vtlb_index = 0;
>>      env->tlb_flush_addr = -1;
>>      env->tlb_flush_mask = 0;
>>      tlb_flush_count++;
>> @@ -106,6 +124,14 @@ void tlb_flush_page(CPUArchState *env, target_ulong addr)
>>          tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
>>      }
>>
>> +    /* check whether there are entries that need to be flushed in the vtlb */
>> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>> +        unsigned int k;
>> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
>> +            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
>> +        }
>> +    }
>> +
>>      tb_flush_jmp_cache(env, addr);
>>  }
>>
>> @@ -170,6 +196,11 @@ void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
>>                  tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
>>                                        start1, length);
>>              }
>> +
>> +            for (i = 0; i < CPU_VTLB_SIZE; i++) {
>> +                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
>> +                                      start1, length);
>> +            }
>>          }
>>      }
>>  }
>> @@ -193,6 +224,13 @@ void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
>>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>>          tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
>>      }
>> +
>> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>> +        unsigned int k;
>> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
>> +            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
>> +        }
>> +    }
>>  }
>>
>>  /* Our TLB does not support large pages, so remember the area covered by
>> @@ -264,8 +302,18 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>>                                              prot, &address);
>>
>>      index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> -    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>>      te = &env->tlb_table[mmu_idx][index];
>> +
>> +    /* do not discard the translation in te, evict it into a victim tlb */
>> +    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
>> +    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
>> +    env->tlb_v_table[mmu_idx][vidx].addr_write = te->addr_write;
>> +    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
>> +    env->tlb_v_table[mmu_idx][vidx].addend     = te->addend;
>> +    env->iotlb_v[mmu_idx][vidx]                = env->iotlb[mmu_idx][index];
>> +
>> +    /* refill the tlb */
>> +    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>>      te->addend = addend - vaddr;
>>      if (prot & PAGE_READ) {
>>          te->addr_read = address;
>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>> index 01cd8c7..2631d6b 100644
>> --- a/include/exec/cpu-defs.h
>> +++ b/include/exec/cpu-defs.h
>> @@ -74,6 +74,8 @@ typedef uint64_t target_ulong;
>>  #if !defined(CONFIG_USER_ONLY)
>>  #define CPU_TLB_BITS 8
>>  #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
>> +/* use a fully associative victim tlb */
>> +#define CPU_VTLB_SIZE 8
>>
>>  #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
>>  #define CPU_TLB_ENTRY_BITS 4
>> @@ -103,12 +105,16 @@ typedef struct CPUTLBEntry {
>>
>>  QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
>>
>> +/* The meaning of the MMU modes is defined in the target code. */
>>  #define CPU_COMMON_TLB \
>>      /* The meaning of the MMU modes is defined in the target code. */   \
>> -    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
>> -    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
>> +    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
>> +    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
>> +    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
>> +    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \
>>      target_ulong tlb_flush_addr;                                        \
>> -    target_ulong tlb_flush_mask;
>> +    target_ulong tlb_flush_mask;                                        \
>> +    target_ulong vtlb_index;                                            \
>>
>>  #else
>>
>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>> index ea90b64..7e88b08 100644
>> --- a/include/exec/exec-all.h
>> +++ b/include/exec/exec-all.h
>> @@ -102,6 +102,8 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>>                    hwaddr paddr, int prot,
>>                    int mmu_idx, target_ulong size);
>>  void tb_invalidate_phys_addr(hwaddr addr);
>> +/* swap the 2 given tlb entries as well as their iotlb */
>> +void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
>>  #else
>>  static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
>>  {
>> diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
>> index c6a5440..38d18de 100644
>> --- a/include/exec/softmmu_template.h
>> +++ b/include/exec/softmmu_template.h
>> @@ -112,6 +112,21 @@
>>  # define helper_te_st_name  helper_le_st_name
>>  #endif
>>
>> +/* macro to check the victim tlb */
>> +#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
>> +do {                                                               \
>> +    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
>> +        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
>> +            /* found entry in victim tlb */                        \
>> +            swap_tlb(&env->tlb_table[mmu_idx][index],              \
>> +                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
>> +                     &env->iotlb[mmu_idx][index],                  \
>> +                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \
>> +            break;                                                 \
>> +        }                                                          \
>> +    }                                                              \
>> +} while (0);
>> +
>>  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
>>                                                hwaddr physaddr,
>>                                                target_ulong addr,
>> @@ -141,6 +156,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>>      uintptr_t haddr;
>>      DATA_TYPE res;
>> +    int vtlb_idx;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -153,7 +169,14 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>>          }
>>  #endif
>> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        /* we are about to do a page table walk. our last hope is the
>> +         * victim tlb. try to refill from the victim tlb before walking the
>> +         * page table. */
>> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
>> +        /* miss victim tlb */
>> +        if (vtlb_idx < 0) {
>> +            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        }
>>          tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>>      }
>>
>> @@ -223,6 +246,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>>      uintptr_t haddr;
>>      DATA_TYPE res;
>> +    int vtlb_idx;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -235,7 +259,14 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>>          }
>>  #endif
>> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        /* we are about to do a page table walk. our last hope is the
>> +         * victim tlb. try to refill from the victim tlb before walking the
>> +         * page table. */
>> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
>> +        /* miss victim tlb */
>> +        if (vtlb_idx < 0) {
>> +            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        }
>>          tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>>      }
>>
>> @@ -342,6 +373,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>      uintptr_t haddr;
>> +    int vtlb_idx;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -354,7 +386,14 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>              do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
>>          }
>>  #endif
>> -        tlb_fill(env, addr, 1, mmu_idx, retaddr);
>> +        /* we are about to do a page table walk. our last hope is the
>> +         * victim tlb. try to refill from the victim tlb before walking the
>> +         * page table. */
>> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].addr_write);
>> +        /* miss victim tlb */
>> +        if (vtlb_idx < 0) {
>> +            tlb_fill(env, addr, 1, mmu_idx, retaddr);
>> +        }
>>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>      }
>>
>> @@ -418,6 +457,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>      uintptr_t haddr;
>> +    int vtlb_idx;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -430,7 +470,14 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>              do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
>>          }
>>  #endif
>> -        tlb_fill(env, addr, 1, mmu_idx, retaddr);
>> +        /* we are about to do a page table walk. our last hope is the
>> +         * victim tlb. try to refill from the victim tlb before walking the
>> +         * page table. */
>> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].addr_write);
>> +        /* miss victim tlb */
>> +        if (vtlb_idx < 0) {
>> +            tlb_fill(env, addr, 1, mmu_idx, retaddr);
>> +        }
>>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>      }
>>
>> --
>> 1.8.3.2
>>
Paolo Bonzini Feb. 1, 2014, 9:29 p.m. UTC | #3
Il 01/02/2014 17:35, Xin Tong ha scritto:
> Hi QEMU Community
>
> This patch provides significant performance improvement (10.76% on
> average) for QEMU system emulation. so I urge the someone in the QEMU
> community to review this patch so that it has the hope of making into
> the mainline. I understand that I have made mistakes in patch
> submission before. But i've learned from the mistakes and will try to
> have future patch submission done according to guidelines.

Don't worry, the patch is a very nice optimization and I'm sure you'll 
get a review soon.  I guess people are just busy.

Paolo
Peter Maydell Feb. 1, 2014, 10:14 p.m. UTC | #4
On 28 January 2014 17:31, Xin Tong <trent.tong@gmail.com> wrote:
> This patch adds a victim TLB to the QEMU system mode TLB.
>
> QEMU system mode page table walks are expensive. Taken by running QEMU
> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
> average.

My review below is largely limited to style issues; I'm assuming
rth will do the substantive review.

> Signed-off-by: Xin Tong <trent.tong@gmail.com>
>
> ---
>  cputlb.c                        | 50 ++++++++++++++++++++++++++++++++++++-
>  include/exec/cpu-defs.h         | 12 ++++++---
>  include/exec/exec-all.h         |  2 ++
>  include/exec/softmmu_template.h | 55 ++++++++++++++++++++++++++++++++++++++---
>  4 files changed, 111 insertions(+), 8 deletions(-)
>
> diff --git a/cputlb.c b/cputlb.c
> index b533f3f..caee78e 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -34,6 +34,22 @@
>  /* statistics */
>  int tlb_flush_count;
>
> +/* swap the 2 given TLB entries as well as their corresponding IOTLB */
> +inline void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote,
> +                     hwaddr *iose)
> +{
> +   hwaddr iotmp;
> +   CPUTLBEntry tmp;
> +   /* swap tlb */
> +   tmp = *te;
> +   *te = *se;
> +   *se = tmp;
> +   /* swap iotlb */
> +   iotmp = *iote;
> +   *iote = *iose;
> +   *iose = iotmp;
> +}
> +
>  /* NOTE:
>   * If flush_global is true (the usual case), flush all tlb entries.
>   * If flush_global is false, flush (at least) all tlb entries not
> @@ -58,8 +74,10 @@ void tlb_flush(CPUArchState *env, int flush_global)
>      cpu->current_tb = NULL;
>
>      memset(env->tlb_table, -1, sizeof(env->tlb_table));
> +    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
>      memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
>
> +    env->vtlb_index = 0;
>      env->tlb_flush_addr = -1;
>      env->tlb_flush_mask = 0;
>      tlb_flush_count++;
> @@ -106,6 +124,14 @@ void tlb_flush_page(CPUArchState *env, target_ulong addr)
>          tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
>      }
>
> +    /* check whether there are entries that need to be flushed in the vtlb */
> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
> +        unsigned int k;

Just plain "int" is fine.

> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
> +            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
> +        }
> +    }
> +
>      tb_flush_jmp_cache(env, addr);
>  }
>
> @@ -170,6 +196,11 @@ void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
>                  tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
>                                        start1, length);
>              }
> +
> +            for (i = 0; i < CPU_VTLB_SIZE; i++) {
> +                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
> +                                      start1, length);
> +            }
>          }
>      }
>  }
> @@ -193,6 +224,13 @@ void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>          tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
>      }
> +
> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
> +        unsigned int k;
> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
> +            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
> +        }
> +    }
>  }
>
>  /* Our TLB does not support large pages, so remember the area covered by
> @@ -264,8 +302,18 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>                                              prot, &address);
>
>      index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> -    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>      te = &env->tlb_table[mmu_idx][index];
> +
> +    /* do not discard the translation in te, evict it into a victim tlb */
> +    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
> +    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
> +    env->tlb_v_table[mmu_idx][vidx].addr_write = te->addr_write;
> +    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
> +    env->tlb_v_table[mmu_idx][vidx].addend     = te->addend;

You're still writing structure assignments out longhand. These four
lines should all be replaced with
    env->tlb_v_table[mmu_idx][vidx] = *te;

> +    env->iotlb_v[mmu_idx][vidx]                = env->iotlb[mmu_idx][index];
> +
> +    /* refill the tlb */
> +    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>      te->addend = addend - vaddr;
>      if (prot & PAGE_READ) {
>          te->addr_read = address;
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 01cd8c7..2631d6b 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -74,6 +74,8 @@ typedef uint64_t target_ulong;
>  #if !defined(CONFIG_USER_ONLY)
>  #define CPU_TLB_BITS 8
>  #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
> +/* use a fully associative victim tlb */
> +#define CPU_VTLB_SIZE 8
>
>  #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
>  #define CPU_TLB_ENTRY_BITS 4
> @@ -103,12 +105,16 @@ typedef struct CPUTLBEntry {
>
>  QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
>
> +/* The meaning of the MMU modes is defined in the target code. */

Why this addition? It's duplicating a comment that already exists below.

>  #define CPU_COMMON_TLB \
>      /* The meaning of the MMU modes is defined in the target code. */   \
> -    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
> -    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
> +    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
> +    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
> +    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
> +    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \

Don't try to line up the field names like this -- it just means more
code churn, because next time somebody has to add a line here
with a typename longer than "CPUTLBEntry" they need to change
every line in the macro. Single space is fine. (Lining up the '\' on
the right is OK.)

>      target_ulong tlb_flush_addr;                                        \
> -    target_ulong tlb_flush_mask;
> +    target_ulong tlb_flush_mask;                                        \
> +    target_ulong vtlb_index;                                            \
>
>  #else
>
> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
> index ea90b64..7e88b08 100644
> --- a/include/exec/exec-all.h
> +++ b/include/exec/exec-all.h
> @@ -102,6 +102,8 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>                    hwaddr paddr, int prot,
>                    int mmu_idx, target_ulong size);
>  void tb_invalidate_phys_addr(hwaddr addr);
> +/* swap the 2 given tlb entries as well as their iotlb */
> +void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
>  #else
>  static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
>  {
> diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
> index c6a5440..38d18de 100644
> --- a/include/exec/softmmu_template.h
> +++ b/include/exec/softmmu_template.h
> @@ -112,6 +112,21 @@
>  # define helper_te_st_name  helper_le_st_name
>  #endif
>
> +/* macro to check the victim tlb */
> +#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
> +do {                                                               \
> +    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
> +        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
> +            /* found entry in victim tlb */                        \
> +            swap_tlb(&env->tlb_table[mmu_idx][index],              \
> +                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
> +                     &env->iotlb[mmu_idx][index],                  \
> +                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \

This is the only place swap_tlb gets used, so probably better to
ditch that as a separate function and just inline it here. (Then
everywhere that uses softmmu_template.h gets the inlined
version, rather than cputlb.c getting to inline and the rest not.)

> +            break;                                                 \
> +        }                                                          \
> +    }                                                              \
> +} while (0);

I think we could make this macro use a bit less ugly.
(1) just call it VICTIM_TLB_HIT; 'helper' has a particular
meaning in QEMU (function called from generated code), and
besides it makes the calling code read a bit less naturally.
(2) rather than having it take as an argument
"env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ", just
take the "ADDR_READ" part. This makes the callers simpler
and avoids giving the impression that the macro is going to
simply evaluate its argument once (ie that it is function-like).
(3) Make it a gcc statement-expression, so it can return a
value. Then you can (a) make the scope of vtlb_idx be only
inside teh macro, and avoid forcing the caller to define it and
(b) have the usage pattern be
    if (!VICTIM_TLB_HIT(ADDR_READ)) {
        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
    }

(which means you don't need that "miss victim tlb" comment
any more, because it's obvious from the macro name what
is happening.)

> +
>  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
>                                                hwaddr physaddr,
>                                                target_ulong addr,
> @@ -141,6 +156,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>      uintptr_t haddr;
>      DATA_TYPE res;
> +    int vtlb_idx;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -153,7 +169,14 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>          }
>  #endif
> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        /* we are about to do a page table walk. our last hope is the
> +         * victim tlb. try to refill from the victim tlb before walking the
> +         * page table. */

You might as well put this comment in your helper macro; there's
no need to repeat it in every callsite.

> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
> +        /* miss victim tlb */
> +        if (vtlb_idx < 0) {
> +            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
> +        }
>          tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>      }

thanks
-- PMM
Xin Tong Feb. 2, 2014, 3:15 p.m. UTC | #5
Hi Peter

Thank you for your reviews , i have 2 questions.

On Sat, Feb 1, 2014 at 4:14 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 28 January 2014 17:31, Xin Tong <trent.tong@gmail.com> wrote:
>> This patch adds a victim TLB to the QEMU system mode TLB.
>>
>> QEMU system mode page table walks are expensive. Taken by running QEMU
>> qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
>> 4-level page tables in guest Linux OS takes ~450 X86 instructions on
>> average.
>
> My review below is largely limited to style issues; I'm assuming
> rth will do the substantive review.
>
>> Signed-off-by: Xin Tong <trent.tong@gmail.com>
>>
>> ---
>>  cputlb.c                        | 50 ++++++++++++++++++++++++++++++++++++-
>>  include/exec/cpu-defs.h         | 12 ++++++---
>>  include/exec/exec-all.h         |  2 ++
>>  include/exec/softmmu_template.h | 55 ++++++++++++++++++++++++++++++++++++++---
>>  4 files changed, 111 insertions(+), 8 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index b533f3f..caee78e 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -34,6 +34,22 @@
>>  /* statistics */
>>  int tlb_flush_count;
>>
>> +/* swap the 2 given TLB entries as well as their corresponding IOTLB */
>> +inline void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote,
>> +                     hwaddr *iose)
>> +{
>> +   hwaddr iotmp;
>> +   CPUTLBEntry tmp;
>> +   /* swap tlb */
>> +   tmp = *te;
>> +   *te = *se;
>> +   *se = tmp;
>> +   /* swap iotlb */
>> +   iotmp = *iote;
>> +   *iote = *iose;
>> +   *iose = iotmp;
>> +}
>> +
>>  /* NOTE:
>>   * If flush_global is true (the usual case), flush all tlb entries.
>>   * If flush_global is false, flush (at least) all tlb entries not
>> @@ -58,8 +74,10 @@ void tlb_flush(CPUArchState *env, int flush_global)
>>      cpu->current_tb = NULL;
>>
>>      memset(env->tlb_table, -1, sizeof(env->tlb_table));
>> +    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
>>      memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
>>
>> +    env->vtlb_index = 0;
>>      env->tlb_flush_addr = -1;
>>      env->tlb_flush_mask = 0;
>>      tlb_flush_count++;
>> @@ -106,6 +124,14 @@ void tlb_flush_page(CPUArchState *env, target_ulong addr)
>>          tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
>>      }
>>
>> +    /* check whether there are entries that need to be flushed in the vtlb */
>> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>> +        unsigned int k;
>
> Just plain "int" is fine.
>
>> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
>> +            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
>> +        }
>> +    }
>> +
>>      tb_flush_jmp_cache(env, addr);
>>  }
>>
>> @@ -170,6 +196,11 @@ void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
>>                  tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
>>                                        start1, length);
>>              }
>> +
>> +            for (i = 0; i < CPU_VTLB_SIZE; i++) {
>> +                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
>> +                                      start1, length);
>> +            }
>>          }
>>      }
>>  }
>> @@ -193,6 +224,13 @@ void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
>>      for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>>          tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
>>      }
>> +
>> +    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
>> +        unsigned int k;
>> +        for (k = 0; k < CPU_VTLB_SIZE; k++) {
>> +            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
>> +        }
>> +    }
>>  }
>>
>>  /* Our TLB does not support large pages, so remember the area covered by
>> @@ -264,8 +302,18 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>>                                              prot, &address);
>>
>>      index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>> -    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>>      te = &env->tlb_table[mmu_idx][index];
>> +
>> +    /* do not discard the translation in te, evict it into a victim tlb */
>> +    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
>> +    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
>> +    env->tlb_v_table[mmu_idx][vidx].addr_write = te->addr_write;
>> +    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
>> +    env->tlb_v_table[mmu_idx][vidx].addend     = te->addend;
>
> You're still writing structure assignments out longhand. These four
> lines should all be replaced with
>     env->tlb_v_table[mmu_idx][vidx] = *te;
>
>> +    env->iotlb_v[mmu_idx][vidx]                = env->iotlb[mmu_idx][index];
>> +
>> +    /* refill the tlb */
>> +    env->iotlb[mmu_idx][index] = iotlb - vaddr;
>>      te->addend = addend - vaddr;
>>      if (prot & PAGE_READ) {
>>          te->addr_read = address;
>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>> index 01cd8c7..2631d6b 100644
>> --- a/include/exec/cpu-defs.h
>> +++ b/include/exec/cpu-defs.h
>> @@ -74,6 +74,8 @@ typedef uint64_t target_ulong;
>>  #if !defined(CONFIG_USER_ONLY)
>>  #define CPU_TLB_BITS 8
>>  #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
>> +/* use a fully associative victim tlb */
>> +#define CPU_VTLB_SIZE 8
>>
>>  #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
>>  #define CPU_TLB_ENTRY_BITS 4
>> @@ -103,12 +105,16 @@ typedef struct CPUTLBEntry {
>>
>>  QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
>>
>> +/* The meaning of the MMU modes is defined in the target code. */
>
> Why this addition? It's duplicating a comment that already exists below.
>
>>  #define CPU_COMMON_TLB \
>>      /* The meaning of the MMU modes is defined in the target code. */   \
>> -    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
>> -    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
>> +    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
>> +    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
>> +    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
>> +    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \
>
> Don't try to line up the field names like this -- it just means more
> code churn, because next time somebody has to add a line here
> with a typename longer than "CPUTLBEntry" they need to change
> every line in the macro. Single space is fine. (Lining up the '\' on
> the right is OK.)
>
>>      target_ulong tlb_flush_addr;                                        \
>> -    target_ulong tlb_flush_mask;
>> +    target_ulong tlb_flush_mask;                                        \
>> +    target_ulong vtlb_index;                                            \
>>
>>  #else
>>
>> diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
>> index ea90b64..7e88b08 100644
>> --- a/include/exec/exec-all.h
>> +++ b/include/exec/exec-all.h
>> @@ -102,6 +102,8 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
>>                    hwaddr paddr, int prot,
>>                    int mmu_idx, target_ulong size);
>>  void tb_invalidate_phys_addr(hwaddr addr);
>> +/* swap the 2 given tlb entries as well as their iotlb */
>> +void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
>>  #else
>>  static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
>>  {
>> diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
>> index c6a5440..38d18de 100644
>> --- a/include/exec/softmmu_template.h
>> +++ b/include/exec/softmmu_template.h
>> @@ -112,6 +112,21 @@
>>  # define helper_te_st_name  helper_le_st_name
>>  #endif
>>
>> +/* macro to check the victim tlb */
>> +#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
>> +do {                                                               \
>> +    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
>> +        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
>> +            /* found entry in victim tlb */                        \
>> +            swap_tlb(&env->tlb_table[mmu_idx][index],              \
>> +                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
>> +                     &env->iotlb[mmu_idx][index],                  \
>> +                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \
>
> This is the only place swap_tlb gets used, so probably better to
> ditch that as a separate function and just inline it here. (Then
> everywhere that uses softmmu_template.h gets the inlined
> version, rather than cputlb.c getting to inline and the rest not.)
>
>> +            break;                                                 \
>> +        }                                                          \
>> +    }                                                              \
>> +} while (0);
>
> I think we could make this macro use a bit less ugly.
> (1) just call it VICTIM_TLB_HIT; 'helper' has a particular
> meaning in QEMU (function called from generated code), and
> besides it makes the calling code read a bit less naturally.
> (2) rather than having it take as an argument
> "env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ", just
> take the "ADDR_READ" part. This makes the callers simpler
> and avoids giving the impression that the macro is going to
> simply evaluate its argument once (ie that it is function-like).

i probably should use tlb_addr defined on the very top of the
helper_le_ld_name macro ?

> (3) Make it a gcc statement-expression, so it can return a
> value. Then you can (a) make the scope of vtlb_idx be only
> inside teh macro, and avoid forcing the caller to define it and
> (b) have the usage pattern be
>     if (!VICTIM_TLB_HIT(ADDR_READ)) {
>         tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>     }

would GCC statement expression be supported by other compilers ? i do
not want to make the code less portable. I do not see any use of
statement expression in the code ive read from QEMU code base.

>
> (which means you don't need that "miss victim tlb" comment
> any more, because it's obvious from the macro name what
> is happening.)
>
>> +
>>  static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
>>                                                hwaddr physaddr,
>>                                                target_ulong addr,
>> @@ -141,6 +156,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>>      uintptr_t haddr;
>>      DATA_TYPE res;
>> +    int vtlb_idx;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -153,7 +169,14 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
>>              do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>>          }
>>  #endif
>> -        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        /* we are about to do a page table walk. our last hope is the
>> +         * victim tlb. try to refill from the victim tlb before walking the
>> +         * page table. */
>
> You might as well put this comment in your helper macro; there's
> no need to repeat it in every callsite.
>
>> +        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
>> +        /* miss victim tlb */
>> +        if (vtlb_idx < 0) {
>> +            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>> +        }
>>          tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
>>      }
>
> thanks
> -- PMM
Peter Maydell Feb. 2, 2014, 4:19 p.m. UTC | #6
On 2 February 2014 15:15, Xin Tong <trent.tong@gmail.com> wrote:
> Hi Peter
>
> Thank you for your reviews , i have 2 questions.
>
> On Sat, Feb 1, 2014 at 4:14 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 28 January 2014 17:31, Xin Tong <trent.tong@gmail.com> wrote:
>>> +/* macro to check the victim tlb */
>>> +#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
>>> +do {                                                               \
>>> +    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
>>> +        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
>>> +            /* found entry in victim tlb */                        \
>>> +            swap_tlb(&env->tlb_table[mmu_idx][index],              \
>>> +                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
>>> +                     &env->iotlb[mmu_idx][index],                  \
>>> +                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \
>>
>> This is the only place swap_tlb gets used, so probably better to
>> ditch that as a separate function and just inline it here. (Then
>> everywhere that uses softmmu_template.h gets the inlined
>> version, rather than cputlb.c getting to inline and the rest not.)
>>
>>> +            break;                                                 \
>>> +        }                                                          \
>>> +    }                                                              \
>>> +} while (0);
>>
>> I think we could make this macro use a bit less ugly.
>> (1) just call it VICTIM_TLB_HIT; 'helper' has a particular
>> meaning in QEMU (function called from generated code), and
>> besides it makes the calling code read a bit less naturally.
>> (2) rather than having it take as an argument
>> "env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ", just
>> take the "ADDR_READ" part. This makes the callers simpler
>> and avoids giving the impression that the macro is going to
>> simply evaluate its argument once (ie that it is function-like).
>
> i probably should use tlb_addr defined on the very top of the
> helper_le_ld_name macro ?

Not sure what you mean here. tlb_addr is a variable.

>> (3) Make it a gcc statement-expression, so it can return a
>> value. Then you can (a) make the scope of vtlb_idx be only
>> inside teh macro, and avoid forcing the caller to define it and
>> (b) have the usage pattern be
>>     if (!VICTIM_TLB_HIT(ADDR_READ)) {
>>         tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>>     }
>
> would GCC statement expression be supported by other compilers ? i do
> not want to make the code less portable. I do not see any use of
> statement expression in the code ive read from QEMU code base.

No, we use it already. You can see a bunch of uses if you do
  git grep '({'
(as well as some false positive matches, of course).

Statement expressions are supported by both gcc and clang,
which is the set of compilers we care about.

thanks
-- PMM
Xin Tong Feb. 2, 2014, 6:27 p.m. UTC | #7
On Sun, Feb 2, 2014 at 10:19 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 2 February 2014 15:15, Xin Tong <trent.tong@gmail.com> wrote:
>> Hi Peter
>>
>> Thank you for your reviews , i have 2 questions.
>>
>> On Sat, Feb 1, 2014 at 4:14 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> On 28 January 2014 17:31, Xin Tong <trent.tong@gmail.com> wrote:
>>>> +/* macro to check the victim tlb */
>>>> +#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
>>>> +do {                                                               \
>>>> +    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
>>>> +        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
>>>> +            /* found entry in victim tlb */                        \
>>>> +            swap_tlb(&env->tlb_table[mmu_idx][index],              \
>>>> +                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
>>>> +                     &env->iotlb[mmu_idx][index],                  \
>>>> +                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \
>>>
>>> This is the only place swap_tlb gets used, so probably better to
>>> ditch that as a separate function and just inline it here. (Then
>>> everywhere that uses softmmu_template.h gets the inlined
>>> version, rather than cputlb.c getting to inline and the rest not.)
>>>
>>>> +            break;                                                 \
>>>> +        }                                                          \
>>>> +    }                                                              \
>>>> +} while (0);
>>>
>>> I think we could make this macro use a bit less ugly.
>>> (1) just call it VICTIM_TLB_HIT; 'helper' has a particular
>>> meaning in QEMU (function called from generated code), and
>>> besides it makes the calling code read a bit less naturally.
>>> (2) rather than having it take as an argument
>>> "env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ", just
>>> take the "ADDR_READ" part. This makes the callers simpler
>>> and avoids giving the impression that the macro is going to
>>> simply evaluate its argument once (ie that it is function-like).
>>
>> i probably should use tlb_addr defined on the very top of the
>> helper_le_ld_name macro ?
>
> Not sure what you mean here. tlb_addr is a variable.
>
There is a
>>> (3) Make it a gcc statement-expression, so it can return a
>>> value. Then you can (a) make the scope of vtlb_idx be only
>>> inside teh macro, and avoid forcing the caller to define it and
>>> (b) have the usage pattern be
>>>     if (!VICTIM_TLB_HIT(ADDR_READ)) {
>>>         tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
>>>     }
>>
>> would GCC statement expression be supported by other compilers ? i do
>> not want to make the code less portable. I do not see any use of
>> statement expression in the code ive read from QEMU code base.
>
> No, we use it already. You can see a bunch of uses if you do
>   git grep '({'
> (as well as some false positive matches, of course).
>
> Statement expressions are supported by both gcc and clang,
> which is the set of compilers we care about.

Ok got it. would moving vtlb_idx inside the macro break the C89 rule
of "No Variable declaration in middle of block"

>
> thanks
> -- PMM
Peter Maydell Feb. 2, 2014, 6:33 p.m. UTC | #8
On 2 February 2014 18:27, Xin Tong <trent.tong@gmail.com> wrote:
> On Sun, Feb 2, 2014 at 10:19 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> Statement expressions are supported by both gcc and clang,
>> which is the set of compilers we care about.
>
> Ok got it. would moving vtlb_idx inside the macro break the C89 rule
> of "No Variable declaration in middle of block"

No, because the macro has a block inside it. "block" means
"a chunk of code inside {}", not "an entire function". You can
declare variables at the start of functions, or at the start
of the block in a while loop, or at the start of the blocks in
an if () {...} else {...}, or (as in this case) at the start of the
block inside a C statement expr.

If you want to restrict the scope of a variable and you don't
have a block conveniently to hand, it's always valid just to
create one: C syntax allows a bare "{ /* code here */ }"
anywhere a statement is valid. It's almost always the case
that there's some handy control-flow-mandated block to
use, though.

thanks
-- PMM
Xin Tong Feb. 2, 2014, 7:01 p.m. UTC | #9
I am getting some compilation errors while stringnifying the
ADDR_READ, addr_write.

function helper_be_ldq_cmmu

 if (!VICTIM_TLB_HIT(ADDR_READ)) {


macro

#define VICTIM_TLB_HIT(ACCESS_TYPE)                                           \
({                                                                            \
    /* we are about to do a page table walk. our last hope is the             \
     * victim tlb. try to refill from the victim tlb before walking the       \
     * page table. */                                                         \
    int vidx;                                                                 \
    hwaddr tmpiotlb;                                                          \
    CPUTLBEntry tmptlb;                                                       \
    for (vidx = CPU_VTLB_SIZE; vidx >= 0; --vidx) {                           \
        if (env->tlb_v_table[mmu_idx][vidx].#ACCESS_TYPE                      \
            == (addr & TARGET_PAGE_MASK)) {                                   \
            /* found entry in victim tlb, swap tlb and iotlb */               \
            tmptlb = env->tlb_table[mmu_idx][index];                          \
            env->tlb_table[mmu_idx][index] = env->tlb_v_table[mmu_idx][vidx]; \
            env->tlb_v_table[mmu_idx][vidx] = tmptlb;                         \
            tmpiotlb = env->iotlb[mmu_idx][index];                            \


compilation error

/home/xtong/qemu/include/exec/softmmu_template.h: In function
'helper_be_ldq_cmmu':
/home/xtong/qemu/include/exec/softmmu_template.h:266: error: expected
identifier before string constant

Xin

On Sun, Feb 2, 2014 at 12:33 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 2 February 2014 18:27, Xin Tong <trent.tong@gmail.com> wrote:
>> On Sun, Feb 2, 2014 at 10:19 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>>> Statement expressions are supported by both gcc and clang,
>>> which is the set of compilers we care about.
>>
>> Ok got it. would moving vtlb_idx inside the macro break the C89 rule
>> of "No Variable declaration in middle of block"
>
> No, because the macro has a block inside it. "block" means
> "a chunk of code inside {}", not "an entire function". You can
> declare variables at the start of functions, or at the start
> of the block in a while loop, or at the start of the blocks in
> an if () {...} else {...}, or (as in this case) at the start of the
> block inside a C statement expr.
>
> If you want to restrict the scope of a variable and you don't
> have a block conveniently to hand, it's always valid just to
> create one: C syntax allows a bare "{ /* code here */ }"
> anywhere a statement is valid. It's almost always the case
> that there's some handy control-flow-mandated block to
> use, though.
>
> thanks
> -- PMM
Peter Maydell Feb. 2, 2014, 7:29 p.m. UTC | #10
On 2 February 2014 19:01, Xin Tong <trent.tong@gmail.com> wrote:
> I am getting some compilation errors while stringnifying the
> ADDR_READ, addr_write.
>
> function helper_be_ldq_cmmu
>
>  if (!VICTIM_TLB_HIT(ADDR_READ)) {
>
>
> macro
>
> #define VICTIM_TLB_HIT(ACCESS_TYPE)                                           \
> ({                                                                            \
>     /* we are about to do a page table walk. our last hope is the             \
>      * victim tlb. try to refill from the victim tlb before walking the       \
>      * page table. */                                                         \
>     int vidx;                                                                 \
>     hwaddr tmpiotlb;                                                          \
>     CPUTLBEntry tmptlb;                                                       \
>     for (vidx = CPU_VTLB_SIZE; vidx >= 0; --vidx) {                           \
>         if (env->tlb_v_table[mmu_idx][vidx].#ACCESS_TYPE                      \
>             == (addr & TARGET_PAGE_MASK)) {                                   \
>             /* found entry in victim tlb, swap tlb and iotlb */               \
>             tmptlb = env->tlb_table[mmu_idx][index];                          \
>             env->tlb_table[mmu_idx][index] = env->tlb_v_table[mmu_idx][vidx]; \
>             env->tlb_v_table[mmu_idx][vidx] = tmptlb;                         \
>             tmpiotlb = env->iotlb[mmu_idx][index];                            \
>
>
> compilation error
>
> /home/xtong/qemu/include/exec/softmmu_template.h: In function
> 'helper_be_ldq_cmmu':
> /home/xtong/qemu/include/exec/softmmu_template.h:266: error: expected
> identifier before string constant

That's because you're trying to stringify the macro argument with
that '#'. Don't do that -- you don't want a string.

thanks
-- PMM
Xin Tong Feb. 2, 2014, 7:39 p.m. UTC | #11
what should i do here? i want to pass the ADDR_READ/addr_write into
the macro. But i do not know how I can convert it into
env->tlb_v_table[mmu_idx][vidx].ADDR_READ/env->tlb_v_table[mmu_idx][vidx].addr_write.

Thanks a lot,
Xin

On Sun, Feb 2, 2014 at 1:29 PM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 2 February 2014 19:01, Xin Tong <trent.tong@gmail.com> wrote:
>> I am getting some compilation errors while stringnifying the
>> ADDR_READ, addr_write.
>>
>> function helper_be_ldq_cmmu
>>
>>  if (!VICTIM_TLB_HIT(ADDR_READ)) {
>>
>>
>> macro
>>
>> #define VICTIM_TLB_HIT(ACCESS_TYPE)                                           \
>> ({                                                                            \
>>     /* we are about to do a page table walk. our last hope is the             \
>>      * victim tlb. try to refill from the victim tlb before walking the       \
>>      * page table. */                                                         \
>>     int vidx;                                                                 \
>>     hwaddr tmpiotlb;                                                          \
>>     CPUTLBEntry tmptlb;                                                       \
>>     for (vidx = CPU_VTLB_SIZE; vidx >= 0; --vidx) {                           \
>>         if (env->tlb_v_table[mmu_idx][vidx].#ACCESS_TYPE                      \
>>             == (addr & TARGET_PAGE_MASK)) {                                   \
>>             /* found entry in victim tlb, swap tlb and iotlb */               \
>>             tmptlb = env->tlb_table[mmu_idx][index];                          \
>>             env->tlb_table[mmu_idx][index] = env->tlb_v_table[mmu_idx][vidx]; \
>>             env->tlb_v_table[mmu_idx][vidx] = tmptlb;                         \
>>             tmpiotlb = env->iotlb[mmu_idx][index];                            \
>>
>>
>> compilation error
>>
>> /home/xtong/qemu/include/exec/softmmu_template.h: In function
>> 'helper_be_ldq_cmmu':
>> /home/xtong/qemu/include/exec/softmmu_template.h:266: error: expected
>> identifier before string constant
>
> That's because you're trying to stringify the macro argument with
> that '#'. Don't do that -- you don't want a string.
>
> thanks
> -- PMM
Peter Maydell Feb. 2, 2014, 7:58 p.m. UTC | #12
On 2 February 2014 19:39, Xin Tong <trent.tong@gmail.com> wrote:
> what should i do here? i want to pass the ADDR_READ/addr_write into
> the macro. But i do not know how I can convert it into
> env->tlb_v_table[mmu_idx][vidx].ADDR_READ/env->tlb_v_table[mmu_idx][vidx].addr_write.

I told you what you were doing wrong. Just say
  env->tlb_v_table[mmu_idx][vidx].ACCESS_TYPE

thanks
-- PMM
diff mbox

Patch

diff --git a/cputlb.c b/cputlb.c
index b533f3f..caee78e 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -34,6 +34,22 @@ 
 /* statistics */
 int tlb_flush_count;
 
+/* swap the 2 given TLB entries as well as their corresponding IOTLB */
+inline void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote,
+                     hwaddr *iose)
+{
+   hwaddr iotmp;
+   CPUTLBEntry tmp;
+   /* swap tlb */
+   tmp = *te;
+   *te = *se;
+   *se = tmp;
+   /* swap iotlb */
+   iotmp = *iote;
+   *iote = *iose;
+   *iose = iotmp;
+}
+
 /* NOTE:
  * If flush_global is true (the usual case), flush all tlb entries.
  * If flush_global is false, flush (at least) all tlb entries not
@@ -58,8 +74,10 @@  void tlb_flush(CPUArchState *env, int flush_global)
     cpu->current_tb = NULL;
 
     memset(env->tlb_table, -1, sizeof(env->tlb_table));
+    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
 
+    env->vtlb_index = 0;
     env->tlb_flush_addr = -1;
     env->tlb_flush_mask = 0;
     tlb_flush_count++;
@@ -106,6 +124,14 @@  void tlb_flush_page(CPUArchState *env, target_ulong addr)
         tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
     }
 
+    /* check whether there are entries that need to be flushed in the vtlb */
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        unsigned int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
+        }
+    }
+
     tb_flush_jmp_cache(env, addr);
 }
 
@@ -170,6 +196,11 @@  void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
                 tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
                                       start1, length);
             }
+
+            for (i = 0; i < CPU_VTLB_SIZE; i++) {
+                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
+                                      start1, length);
+            }
         }
     }
 }
@@ -193,6 +224,13 @@  void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
     }
+
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        unsigned int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
+        }
+    }
 }
 
 /* Our TLB does not support large pages, so remember the area covered by
@@ -264,8 +302,18 @@  void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                                             prot, &address);
 
     index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te = &env->tlb_table[mmu_idx][index];
+
+    /* do not discard the translation in te, evict it into a victim tlb */
+    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
+    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
+    env->tlb_v_table[mmu_idx][vidx].addr_write = te->addr_write;
+    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
+    env->tlb_v_table[mmu_idx][vidx].addend     = te->addend;
+    env->iotlb_v[mmu_idx][vidx]                = env->iotlb[mmu_idx][index];
+
+    /* refill the tlb */
+    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te->addend = addend - vaddr;
     if (prot & PAGE_READ) {
         te->addr_read = address;
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 01cd8c7..2631d6b 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -74,6 +74,8 @@  typedef uint64_t target_ulong;
 #if !defined(CONFIG_USER_ONLY)
 #define CPU_TLB_BITS 8
 #define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+/* use a fully associative victim tlb */
+#define CPU_VTLB_SIZE 8
 
 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -103,12 +105,16 @@  typedef struct CPUTLBEntry {
 
 QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
 
+/* The meaning of the MMU modes is defined in the target code. */
 #define CPU_COMMON_TLB \
     /* The meaning of the MMU modes is defined in the target code. */   \
-    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
-    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
+    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
+    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
+    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
+    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \
     target_ulong tlb_flush_addr;                                        \
-    target_ulong tlb_flush_mask;
+    target_ulong tlb_flush_mask;                                        \
+    target_ulong vtlb_index;                                            \
 
 #else
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index ea90b64..7e88b08 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -102,6 +102,8 @@  void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                   hwaddr paddr, int prot,
                   int mmu_idx, target_ulong size);
 void tb_invalidate_phys_addr(hwaddr addr);
+/* swap the 2 given tlb entries as well as their iotlb */
+void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
 #else
 static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
 {
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index c6a5440..38d18de 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -112,6 +112,21 @@ 
 # define helper_te_st_name  helper_le_st_name
 #endif
 
+/* macro to check the victim tlb */
+#define HELPER_CHECK_VICTIM_TLB(ACCESS_TYPE)                       \
+do {                                                               \
+    for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {    \
+        if (ACCESS_TYPE == (addr & TARGET_PAGE_MASK)) {            \
+            /* found entry in victim tlb */                        \
+            swap_tlb(&env->tlb_table[mmu_idx][index],              \
+                     &env->tlb_v_table[mmu_idx][vtlb_idx],         \
+                     &env->iotlb[mmu_idx][index],                  \
+                     &env->iotlb_v[mmu_idx][vtlb_idx]);            \
+            break;                                                 \
+        }                                                          \
+    }                                                              \
+} while (0);
+
 static inline DATA_TYPE glue(io_read, SUFFIX)(CPUArchState *env,
                                               hwaddr physaddr,
                                               target_ulong addr,
@@ -141,6 +156,7 @@  WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     uintptr_t haddr;
     DATA_TYPE res;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -153,7 +169,14 @@  WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }
 
@@ -223,6 +246,7 @@  WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     uintptr_t haddr;
     DATA_TYPE res;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -235,7 +259,14 @@  WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ);
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }
 
@@ -342,6 +373,7 @@  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     uintptr_t haddr;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -354,7 +386,14 @@  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].addr_write);
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
@@ -418,6 +457,7 @@  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     uintptr_t haddr;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -430,7 +470,14 @@  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        HELPER_CHECK_VICTIM_TLB(env->tlb_v_table[mmu_idx][vtlb_idx].addr_write);
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }