[RFC,v2,5/5] cputlb: dynamically resize TLBs based on use rate

Perform the resizing only on flushes, otherwise we'd
have to take a perf hit by either rehashing the array
or unnecessarily flushing it.

We grow the array aggressively, and reduce the size more
slowly. This accommodates mixed workloads, where some
processes might be memory-heavy while others are not.

As the following experiments show, this a net perf gain,
particularly for memory-heavy workloads. Experiments
are run on an Intel i7-6700K CPU @ 4.00GHz.

1. System boot + shudown, debian aarch64:

- Before (tb-lock-v3):
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7469.363393      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.07% )
    31,507,707,190      cycles                    #    4.218 GHz                      ( +-  0.07% )
    57,101,577,452      instructions              #    1.81  insns per cycle          ( +-  0.08% )
    10,265,531,804      branches                  # 1374.352 M/sec                    ( +-  0.07% )
       173,020,681      branch-misses             #    1.69% of all branches          ( +-  0.10% )

       7.483359063 seconds time elapsed                                          ( +-  0.08% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7185.036730      task-clock (msec)         #    0.999 CPUs utilized            ( +-  0.11% )
    30,303,501,143      cycles                    #    4.218 GHz                      ( +-  0.11% )
    54,198,386,487      instructions              #    1.79  insns per cycle          ( +-  0.08% )
     9,726,518,945      branches                  # 1353.719 M/sec                    ( +-  0.08% )
       167,082,307      branch-misses             #    1.72% of all branches          ( +-  0.08% )

       7.195597842 seconds time elapsed                                          ( +-  0.11% )

That is, a 3.8% improvement.

2. System boot + shutdown, ubuntu 18.04 x86_64:

- Before (tb-lock-v3):
Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):

      49971.036482      task-clock (msec)         #    0.999 CPUs utilized            ( +-  1.62% )
   210,766,077,140      cycles                    #    4.218 GHz                      ( +-  1.63% )
   428,829,830,790      instructions              #    2.03  insns per cycle          ( +-  0.75% )
    77,313,384,038      branches                  # 1547.164 M/sec                    ( +-  0.54% )
       835,610,706      branch-misses             #    1.08% of all branches          ( +-  2.97% )

      50.003855102 seconds time elapsed                                          ( +-  1.61% )

- After:
 Performance counter stats for 'taskset -c 0 ../img/x86_64/ubuntu-die.sh -nographic' (2 runs):

      50118.124477      task-clock (msec)         #    0.999 CPUs utilized            ( +-  4.30% )
           132,396      context-switches          #    0.003 M/sec                    ( +-  1.20% )
                 0      cpu-migrations            #    0.000 K/sec                    ( +-100.00% )
           167,754      page-faults               #    0.003 M/sec                    ( +-  0.06% )
   211,414,701,601      cycles                    #    4.218 GHz                      ( +-  4.30% )
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   431,618,818,597      instructions              #    2.04  insns per cycle          ( +-  6.40% )
    80,197,256,524      branches                  # 1600.165 M/sec                    ( +-  8.59% )
       794,830,352      branch-misses             #    0.99% of all branches          ( +-  2.05% )

      50.177077175 seconds time elapsed                                          ( +-  4.23% )

No improvement (within noise range).

3. x86_64 SPEC06int:
                              SPEC06int (test set)
                         [ Y axis: speedup over master ]
  8 +-+--+----+----+-----+----+----+----+----+----+----+-----+----+----+--+-+
    |                                                                       |
    |                                                   tlb-lock-v3         |
  7 +-+..................$$$...........................+indirection       +-+
    |                    $ $                              +resizing         |
    |                    $ $                                                |
  6 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |                    $ $                                                |
  5 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |                    $ $                                                |
  4 +-+..................$.$..............................................+-+
    |                    $ $                                                |
    |          +++       $ $                                                |
  3 +-+........$$+.......$.$..............................................+-+
    |          $$        $ $                                                |
    |          $$        $ $                                 $$$            |
  2 +-+........$$........$.$.................................$.$..........+-+
    |          $$        $ $                                 $ $       +$$  |
    |          $$   $$+  $ $  $$$       +$$                  $ $  $$$   $$  |
  1 +-+***#$***#$+**#$+**#+$**#+$**##$**##$***#$***#$+**#$+**#+$**#+$**##$+-+
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
    |  * *#$* *#$ **#$ **# $**# $** #$** #$* *#$* *#$ **#$ **# $**# $** #$  |
  0 +-+***#$***#$-**#$-**#$$**#$$**##$**##$***#$***#$-**#$-**#$$**#$$**##$+-+
     401.bzi403.gc429445.g456.h462.libq464.h471.omne4483.xalancbgeomean
png: https://imgur.com/a/b1wn3wc

That is, a 1.53x average speedup over master, with a max speedup of 7.13x.

Note that "indirection" (i.e. the first patch in this series) incurs
no overhead, on average.

To conclude, here is a different look at the SPEC06int results, using
linux-user as the baseline and comparing master and this series ("tlb-dyn"):

            Softmmu slowdown vs. linux-user for SPEC06int (test set)
                    [ Y axis: slowdown over linux-user ]
  14 +-+--+----+----+----+----+----+-----+----+----+----+----+----+----+--+-+
     |                                                                      |
     |                                                       master         |
  12 +-+...............+**..................................tlb-dyn.......+-+
     |                  **                                                  |
     |                  **                                                  |
     |                  **                                                  |
  10 +-+................**................................................+-+
     |                  **                                                  |
     |                  **                                                  |
   8 +-+................**................................................+-+
     |                  **                                                  |
     |                  **                                                  |
     |                  **                                                  |
   6 +-+................**................................................+-+
     |       ***        **                                                  |
     |       * *        **                                                  |
   4 +-+.....*.*........**.................................***............+-+
     |       * *        **                                 * *              |
     |       * *  +++   **             ***            ***  * *  ***  ***    |
     |       * *  +**++ **   **##      *+*#      ***  * *#+* *  * *##* *    |
   2 +-+.....*.*##.**##.**##.**.#.**##.*+*#.***#.*+*#.*.*#.*.*#+*.*.#*.*##+-+
     |++***##*+*+#+**+#+**+#+**+#+**+#+*+*#+*+*#+*+*#+*+*#+*+*#+*+*+#*+*+#++|
     |  * * #* * # ** # ** # ** # ** # * *# * *# * *# * *# * *# * * #* * #  |
   0 +-+***##***##-**##-**##-**##-**##-***#-***#-***#-***#-***#-***##***##+-+
      401.bzi403.g429445.g456.hm462.libq464.h471.omn4483.xalancbgeomean

png: https://imgur.com/a/eXkjMCE

After this series, we bring down the average softmmu overhead
from 2.77x to 1.80x, with a maximum slowdown of 2.48x (omnetpp).

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/cpu-defs.h | 39 +++++++++------------------------------
 accel/tcg/cputlb.c      | 39 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 31 deletions(-)

Message ID	20181008232756.30704-6-cota@braap.org
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=nongnu.org (client-ip=2001:4830:134:3::11; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=braap.org Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=braap.org header.i=@braap.org header.b="fA1AafRi"; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=messagingengine.com header.i=@messagingengine.com header.b="UTylvMGk"; dkim-atps=neutral Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 42Tc8F1nL7z9vZs for <incoming@patchwork.ozlabs.org>; Tue, 9 Oct 2018 10:31:57 +1100 (AEDT) Received: from localhost ([::1]:48754 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>) id 1g9f0E-00037W-R4 for incoming@patchwork.ozlabs.org; Mon, 08 Oct 2018 19:31:54 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33897) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <cota@braap.org>) id 1g9ewV-0000fq-Mz for qemu-devel@nongnu.org; Mon, 08 Oct 2018 19:28:05 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <cota@braap.org>) id 1g9ewS-0000VS-Rh for qemu-devel@nongnu.org; Mon, 08 Oct 2018 19:28:03 -0400 Received: from out3-smtp.messagingengine.com ([66.111.4.27]:34671) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from <cota@braap.org>) id 1g9ewS-0000Uk-Jo for qemu-devel@nongnu.org; Mon, 08 Oct 2018 19:28:00 -0400 Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id 2CFF821EF9; Mon, 8 Oct 2018 19:28:00 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Mon, 08 Oct 2018 19:28:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=braap.org; h= from:to:cc:subject:date:message-id:in-reply-to:references; s= mesmtp; bh=uwOAvLD4zmRJOnCarwjtRU5JG6VzQx31Kmze8ix5mYg=; b=fA1Aa fRizjJjoTqb8UqxolhZPzZ/S7lqLlbgn/KHGrcjobPMp1i4zYda1eeVgfmA44AL1 YApra/3hl5pD8h+Z4uaMpl80b/KWJwrM7/jk+MCcWelf7otCVoO8Tzf2ZSRPu6Uc B4MUdndYENCZSswPm1QrLt3Ofx90dZK85kPhW4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:date:from:in-reply-to:message-id :references:subject:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm3; bh=uwOAvLD4zmRJOnCarwjtRU5JG6VzQ x31Kmze8ix5mYg=; b=UTylvMGksRCpeSpPVD4ab4V/dgdlLsKWOMx0NKBbI1S61 czo/NsGzXRC5Fhc0CblXH/hoTBccWl57lAC6gjmia7pers7fJkDlNLN7IttZDSaG 0TdN4+p9xYXrB50x5VbvCg9tDuu8Dr4t749W/90RdraUlryMuK9EQMeMJFgNLLRN AwQgYKpnq9GE3lmMv6d4aeqZw7Se4iKZ/y7bXhwbJTvtNIhnSB7WHQFqt1AaM2Xk +rRPrzuFRP+sxw4stQEIEaAwZRGYTxOi6ntZ6jPZDndAYHgPE6nS/SWxVOiVAlCC JfQ+SkMorxrgwivOFn6rzdHgMXjgMPGwmJ6Pjf+Wg== X-ME-Sender: <xms:_-e7WzoWM2fIrPdDo-D-uURjZDgqAzKRerHBKuFjimhgvfDJ5G00CQ> X-ME-Proxy: <xmx:_-e7WxGhjX-iPMTRPM0blQerEBtcmcUThDKlbvuFmaSr7J64jawHiQ> <xmx:_-e7W-HNNHdWvMDhfew6v2QTcHrtiLnIgLWdUMXk2JXNQh08NnYaNA> <xmx:AOi7Wxv1Uf-iYofQ-LqRi8NqNJElayKlDAo3xVvDaRXXETSZYW5G-A> <xmx:AOi7WxagkrCvwdKUgptY6VyiJW2AbcJloh6p0XJLQDTXbSisqj8quw> <xmx:AOi7W7OYSh69W035Z1_iUcBRGJ_UXi2IiuQNzO6OUdlUMOuhzqpAHw> <xmx:AOi7WwID-QR6iEi5iIaN2gn04M9rOS_DpjswLOn4P9mzB3Bcj4bp-g> Received: from localhost (flamenco.cs.columbia.edu [128.59.20.216]) by mail.messagingengine.com (Postfix) with ESMTPA id BA444102E2; Mon, 8 Oct 2018 19:27:59 -0400 (EDT) From: "Emilio G. Cota" <cota@braap.org> To: qemu-devel@nongnu.org Date: Mon, 8 Oct 2018 19:27:56 -0400 Message-Id: <20181008232756.30704-6-cota@braap.org> X-Mailer: git-send-email 2.17.1 In-Reply-To: <20181008232756.30704-1-cota@braap.org> References: <20181008232756.30704-1-cota@braap.org> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 66.111.4.27 Subject: [Qemu-devel] [RFC v2 5/5] cputlb: dynamically resize TLBs based on use rate X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: <qemu-devel.nongnu.org> List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>, <mailto:qemu-devel-request@nongnu.org?subject=unsubscribe> List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/> List-Post: <mailto:qemu-devel@nongnu.org> List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help> List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>, <mailto:qemu-devel-request@nongnu.org?subject=subscribe> Cc: =?utf-8?q?Alex_Benn=C3=A9e?= <alex.bennee@linaro.org>, Richard Henderson <richard.henderson@linaro.org> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	Dynamic TLB sizing \| expand [RFC,v2,0/5] Dynamic TLB sizing [RFC,v2,1/5] tcg: Add tlb_index and tlb_entry helpers [RFC,v2,2/5] (XXX) cputlb: introduce indirection for TLB size [RFC,v2,3/5] cputlb: do not evict empty entries to the vtlb [RFC,v2,4/5] cputlb: track TLB use rate [RFC,v2,5/5] cputlb: dynamically resize TLBs based on use rate

[RFC,v2,5/5] cputlb: dynamically resize TLBs based on use rate

Commit Message

Comments

Patch