From patchwork Thu Jan 23 19:49:25 2014
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Xin Tong <trent.tong@gmail.com>
X-Patchwork-Id: 313693
Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
X-Original-To: incoming@patchwork.ozlabs.org
Delivered-To: patchwork-incoming@bilbo.ozlabs.org
Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by ozlabs.org (Postfix) with ESMTPS id E984D2C0079
	for <incoming@patchwork.ozlabs.org>;
	Fri, 24 Jan 2014 06:50:57 +1100 (EST)
Received: from localhost ([::1]:42928 helo=lists.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.71) (envelope-from
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>)
	id 1W6QIh-0004Da-Gt
	for incoming@patchwork.ozlabs.org; Thu, 23 Jan 2014 14:50:55 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:49850)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <trent.tong@gmail.com>) id 1W6QI4-0004CW-27
	for qemu-devel@nongnu.org; Thu, 23 Jan 2014 14:50:21 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <trent.tong@gmail.com>) id 1W6QHy-0000YQ-5i
	for qemu-devel@nongnu.org; Thu, 23 Jan 2014 14:50:15 -0500
Received: from mail-ie0-x236.google.com ([2607:f8b0:4001:c03::236]:59140)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <trent.tong@gmail.com>) id 1W6QHx-0000YC-Up
	for qemu-devel@nongnu.org; Thu, 23 Jan 2014 14:50:10 -0500
Received: by mail-ie0-f182.google.com with SMTP id lx4so1734301iec.41
	for <qemu-devel@nongnu.org>; Thu, 23 Jan 2014 11:50:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=from:to:cc:subject:date:message-id;
	bh=51VH/utDqDos2HsYa3F+g91Ng6QCon+xDb1OxLccjhU=;
	b=ljtwy1W5+04aY8JMYpagWoMY7H49jKL53//XxEQOAK6h30ERTFAyERwdPKDJFOQ3Nv
	mc/Vm6rX5K74+YswSsUj31S1h3OLMgVApFk1OFTafO7Yd73N+mDmBOBBb8mJyyqRktPG
	KonAD2X5uHtO/CCM2oyOGlg2aRxmwJ/tEUqFZD8BKQmBQ7Jj7gFVib57X2RLZjFQalpK
	3izWUmkiqCVemNpnknVE0iz3moRtQuuODg7oX3E38/XNRzDnHpicel9jjsVHCs4y7i+o
	FtDhzVgnLErkiIJNjRS5gRoCswD/n7J6WfZNGHHRfQdIionlPJ5PjXoUnn8unXXt6k+2
	ZiqQ==
X-Received: by 10.50.61.234 with SMTP id t10mr987576igr.32.1390506607906;
	Thu, 23 Jan 2014 11:50:07 -0800 (PST)
Received: from localhost.localdomain (209-195-64-71.cpe.distributel.net.
	[209.195.64.71]) by mx.google.com with ESMTPSA id
	ac3sm1647163igd.4.2014.01.23.11.50.04 for <multiple recipients>
	(version=TLSv1.1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
	Thu, 23 Jan 2014 11:50:07 -0800 (PST)
From: Xin Tong <trent.tong@gmail.com>
To: qemu-devel@nongnu.org
Date: Thu, 23 Jan 2014 13:49:25 -0600
Message-Id: <1390506565-8880-1-git-send-email-trent.tong@gmail.com>
X-Mailer: git-send-email 1.8.3.2
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
	(bad octet value).
X-Received-From: 2607:f8b0:4001:c03::236
Cc: Xin Tong <trent.tong@gmail.com>, afaerber@suse.de, rth@twiddle.net
Subject: [Qemu-devel] [PATCH v2] cpu: implementing victim TLB for QEMU
	system emulated TLB
X-BeenThere: qemu-devel@nongnu.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

This patch adds a victim TLB to the QEMU system mode TLB.

QEMU system mode page table walks are expensive. Taken by running QEMU
qemu-system-x86_64 system mode on Intel PIN , a TLB miss and walking a
4-level page tables in guest Linux OS takes ~450 X86 instructions on
average.

QEMU system mode TLB is implemented using a directly-mapped hashtable.
This structure suffers from conflict misses. Increasing the
associativity of the TLB may not be the solution to conflict misses as
all the ways may have to be walked in serial.

A victim TLB is a TLB used to hold translations evicted from the
primary TLB upon replacement. The victim TLB lies between the main TLB
and its refill path. Victim TLB is of greater associativity (fully
associative in this patch). It takes longer to lookup the victim TLB,
but its likely better than a full page table walk. The memory
translation path is changed as follows :

Before Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. TLB refill.
5. Do the memory access.
6. Return to code cache.

After Victim TLB:
1. Inline TLB lookup
2. Exit code cache on TLB miss.
3. Check for unaligned, IO accesses
4. Victim TLB lookup.
5. If victim TLB misses, TLB refill
6. Do the memory access.
7. Return to code cache

The advantage is that victim TLB can offer more associativity to a
directly mapped TLB and thus potentially fewer page table walks while
still keeping the time taken to flush within reasonable limits.
However, placing a victim TLB before the refill path increase TLB
refill path as the victim TLB is consulted before the TLB refill. The
performance results demonstrate that the pros outweigh the cons.

Attached are some performance results taken on SPECINT2006 train
datasets and kernel boot and qemu configure script on an 
Intel(R) Xeon(R) CPU  E5620  @ 2.40GHz Linux machine. In
summary, victim TLB improves the performance of qemu-system-x86_64 by
10.7% on average on SPECINT2006 and with highest improvement of in 25.4%
in 464.h264ref. And victim TLB does not result in any performance
degradation in any of the measured benchmarks. Furthermore, the
implemented victim TLB is architecture independent and is expected to
benefit other architectures in QEMU as well.

Although there are measurement fluctuations, the performance
improvement is very significant and by no means in the range of
noises.

Reviewed-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Xin Tong <trent.tong@gmail.com>
---
 cputlb.c                        | 50 +++++++++++++++++++++++++-
 include/exec/cpu-defs.h         | 16 ++++++---
 include/exec/exec-all.h         |  2 ++
 include/exec/softmmu_template.h | 80 ++++++++++++++++++++++++++++++++++++++---
 4 files changed, 138 insertions(+), 10 deletions(-)
diff --git a/cputlb.c b/cputlb.c
index b533f3f..03a048a 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -34,6 +34,22 @@
 /* statistics */
 int tlb_flush_count;
 
+/* swap the 2 given TLB entries as well as their corresponding IOTLB */
+inline void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote,
+                     hwaddr *iose)
+{
+   hwaddr iotmp;
+   CPUTLBEntry t;
+   /* swap iotlb */
+   iotmp = *iote;
+   *iote = *iose;
+   *iose = iotmp;
+   /* swap tlb */
+   memcpy(&t, te, sizeof(CPUTLBEntry));
+   memcpy(te, se, sizeof(CPUTLBEntry));
+   memcpy(se, &t, sizeof(CPUTLBEntry));
+}
+
 /* NOTE:
  * If flush_global is true (the usual case), flush all tlb entries.
  * If flush_global is false, flush (at least) all tlb entries not
@@ -58,8 +74,10 @@ void tlb_flush(CPUArchState *env, int flush_global)
     cpu->current_tb = NULL;
 
     memset(env->tlb_table, -1, sizeof(env->tlb_table));
+    memset(env->tlb_v_table, -1, sizeof(env->tlb_v_table));
     memset(env->tb_jmp_cache, 0, sizeof(env->tb_jmp_cache));
 
+    env->vtlb_index = 0;
     env->tlb_flush_addr = -1;
     env->tlb_flush_mask = 0;
     tlb_flush_count++;
@@ -106,6 +124,14 @@ void tlb_flush_page(CPUArchState *env, target_ulong addr)
         tlb_flush_entry(&env->tlb_table[mmu_idx][i], addr);
     }
 
+    /* check whether there are entries that need to be flushed in the vtlb */
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        unsigned int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_flush_entry(&env->tlb_v_table[mmu_idx][k], addr);
+        }
+    }
+
     tb_flush_jmp_cache(env, addr);
 }
 
@@ -170,6 +196,11 @@ void cpu_tlb_reset_dirty_all(ram_addr_t start1, ram_addr_t length)
                 tlb_reset_dirty_range(&env->tlb_table[mmu_idx][i],
                                       start1, length);
             }
+
+            for (i = 0; i < CPU_VTLB_SIZE; i++) {
+                tlb_reset_dirty_range(&env->tlb_v_table[mmu_idx][i],
+                                      start1, length);
+            }
         }
     }
 }
@@ -193,6 +224,13 @@ void tlb_set_dirty(CPUArchState *env, target_ulong vaddr)
     for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
         tlb_set_dirty1(&env->tlb_table[mmu_idx][i], vaddr);
     }
+
+    for (mmu_idx = 0; mmu_idx < NB_MMU_MODES; mmu_idx++) {
+        unsigned int k;
+        for (k = 0; k < CPU_VTLB_SIZE; k++) {
+            tlb_set_dirty1(&env->tlb_v_table[mmu_idx][k], vaddr);
+        }
+    }
 }
 
 /* Our TLB does not support large pages, so remember the area covered by
@@ -264,8 +302,18 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                                             prot, &address);
 
     index = (vaddr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
-    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te = &env->tlb_table[mmu_idx][index];
+
+    /* do not discard the translation in te, evict it into a victim tlb */
+    unsigned vidx = env->vtlb_index++ % CPU_VTLB_SIZE;
+    env->tlb_v_table[mmu_idx][vidx].addr_read  = te->addr_read;
+    env->tlb_v_table[mmu_idx][vidx].addr_write = te->addr_write;
+    env->tlb_v_table[mmu_idx][vidx].addr_code  = te->addr_code;
+    env->tlb_v_table[mmu_idx][vidx].addend     = te->addend;
+    env->iotlb_v[mmu_idx][vidx]                = env->iotlb[mmu_idx][index];
+
+    /* refill the tlb */
+    env->iotlb[mmu_idx][index] = iotlb - vaddr;
     te->addend = addend - vaddr;
     if (prot & PAGE_READ) {
         te->addr_read = address;
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 01cd8c7..18d5f0d 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -72,8 +72,10 @@ typedef uint64_t target_ulong;
 #define TB_JMP_PAGE_MASK (TB_JMP_CACHE_SIZE - TB_JMP_PAGE_SIZE)
 
 #if !defined(CONFIG_USER_ONLY)
-#define CPU_TLB_BITS 8
-#define CPU_TLB_SIZE (1 << CPU_TLB_BITS)
+#define CPU_TLB_BITS  8
+#define CPU_TLB_SIZE  (1 << CPU_TLB_BITS)
+/* use a fully associative victim tlb */
+#define CPU_VTLB_SIZE 8
 
 #if HOST_LONG_BITS == 32 && TARGET_LONG_BITS == 32
 #define CPU_TLB_ENTRY_BITS 4
@@ -103,12 +105,16 @@ typedef struct CPUTLBEntry {
 
 QEMU_BUILD_BUG_ON(sizeof(CPUTLBEntry) != (1 << CPU_TLB_ENTRY_BITS));
 
+/* The meaning of the MMU modes is defined in the target code. */
 #define CPU_COMMON_TLB \
     /* The meaning of the MMU modes is defined in the target code. */   \
-    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                  \
-    hwaddr iotlb[NB_MMU_MODES][CPU_TLB_SIZE];               \
+    CPUTLBEntry  tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];                 \
+    CPUTLBEntry  tlb_v_table[NB_MMU_MODES][CPU_VTLB_SIZE];              \
+    hwaddr       iotlb[NB_MMU_MODES][CPU_TLB_SIZE];                     \
+    hwaddr       iotlb_v[NB_MMU_MODES][CPU_VTLB_SIZE];                  \
     target_ulong tlb_flush_addr;                                        \
-    target_ulong tlb_flush_mask;
+    target_ulong tlb_flush_mask;                                        \
+    target_ulong vtlb_index;                                            \
 
 #else
 
diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index ea90b64..7e88b08 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -102,6 +102,8 @@ void tlb_set_page(CPUArchState *env, target_ulong vaddr,
                   hwaddr paddr, int prot,
                   int mmu_idx, target_ulong size);
 void tb_invalidate_phys_addr(hwaddr addr);
+/* swap the 2 given tlb entries as well as their iotlb */
+void swap_tlb(CPUTLBEntry *te, CPUTLBEntry *se, hwaddr *iote, hwaddr *iose);
 #else
 static inline void tlb_flush_page(CPUArchState *env, target_ulong addr)
 {
diff --git a/include/exec/softmmu_template.h b/include/exec/softmmu_template.h
index c6a5440..fe11343 100644
--- a/include/exec/softmmu_template.h
+++ b/include/exec/softmmu_template.h
@@ -141,6 +141,7 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     uintptr_t haddr;
     DATA_TYPE res;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -153,7 +154,24 @@ WORD_TYPE helper_le_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {
+            if (env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ
+                == (addr & TARGET_PAGE_MASK)) {
+                /* found entry in victim tlb */
+                swap_tlb(&env->tlb_table[mmu_idx][index],
+                         &env->tlb_v_table[mmu_idx][vtlb_idx],
+                         &env->iotlb[mmu_idx][index],
+                         &env->iotlb_v[mmu_idx][vtlb_idx]);
+                break;
+            }
+        }
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }
 
@@ -223,6 +241,7 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     uintptr_t haddr;
     DATA_TYPE res;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -235,7 +254,24 @@ WORD_TYPE helper_be_ld_name(CPUArchState *env, target_ulong addr, int mmu_idx,
             do_unaligned_access(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {
+            if (env->tlb_v_table[mmu_idx][vtlb_idx].ADDR_READ
+                == (addr & TARGET_PAGE_MASK)) {
+                /* found entry in victim tlb */
+                swap_tlb(&env->tlb_table[mmu_idx][index],
+                         &env->tlb_v_table[mmu_idx][vtlb_idx],
+                         &env->iotlb[mmu_idx][index],
+                         &env->iotlb_v[mmu_idx][vtlb_idx]);
+                break;
+            }
+        }
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, READ_ACCESS_TYPE, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].ADDR_READ;
     }
 
@@ -342,6 +378,7 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     uintptr_t haddr;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -354,7 +391,24 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {
+            if (env->tlb_v_table[mmu_idx][vtlb_idx].addr_write
+                == (addr & TARGET_PAGE_MASK)) {
+                /* found entry in victim tlb */
+                swap_tlb(&env->tlb_table[mmu_idx][index],
+                         &env->tlb_v_table[mmu_idx][vtlb_idx],
+                         &env->iotlb[mmu_idx][index],
+                         &env->iotlb_v[mmu_idx][vtlb_idx]);
+                break;
+            }
+        }
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
@@ -418,6 +472,7 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     uintptr_t haddr;
+    int vtlb_idx;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -430,7 +485,24 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             do_unaligned_access(env, addr, 1, mmu_idx, retaddr);
         }
 #endif
-        tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        /* we are about to do a page table walk. our last hope is the
+         * victim tlb. try to refill from the victim tlb before walking the
+         * page table. */
+        for (vtlb_idx = CPU_VTLB_SIZE; vtlb_idx >= 0; --vtlb_idx) {
+            if (env->tlb_v_table[mmu_idx][vtlb_idx].addr_write
+                == (addr & TARGET_PAGE_MASK)) {
+                /* found entry in victim tlb */
+                swap_tlb(&env->tlb_table[mmu_idx][index],
+                         &env->tlb_v_table[mmu_idx][vtlb_idx],
+                         &env->iotlb[mmu_idx][index],
+                         &env->iotlb_v[mmu_idx][vtlb_idx]);
+                break;
+            }
+        }
+        /* miss victim tlb */
+        if (vtlb_idx < 0) {
+            tlb_fill(env, addr, 1, mmu_idx, retaddr);
+        }
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }