[v6,73/73] cputlb: queue async flush jobs without the BQL

Message ID	20190130004811.27372-74-cota@braap.org
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: "Emilio G. Cota" <cota@braap.org> To: qemu-devel@nongnu.org Date: Tue, 29 Jan 2019 19:48:11 -0500 Message-Id: <20190130004811.27372-74-cota@braap.org> In-Reply-To: <20190130004811.27372-1-cota@braap.org> References: <20190130004811.27372-1-cota@braap.org> Subject: [Qemu-devel] [PATCH v6 73/73] cputlb: queue async flush jobs without the BQL Precedence: list Cc: Paolo Bonzini <pbonzini@redhat.com>, Richard Henderson <richard.henderson@linaro.org> Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>
Series	per-CPU locks \| expand [v6,00/73] per-CPU locks [v6,01/73] cpu: convert queued work to a QSIMPLEQ [v6,02/73] cpu: rename cpu->work_mutex to cpu->lock [v6,04/73] cpu: make qemu_work_cond per-cpu [v6,05/73] cpu: move run_on_cpu to cpus-common [v6,06/73] cpu: introduce process_queued_cpu_work_locked [v6,07/73] cpu: make per-CPU locks an alias of the BQL in TCG rr mode [v6,08/73] tcg-runtime: define helper_cpu_halted_set [v6,09/73] ppc: convert to helper_cpu_halted_set [v6,10/73] cris: convert to helper_cpu_halted_set [v6,11/73] hppa: convert to helper_cpu_halted_set [v6,12/73] m68k: convert to helper_cpu_halted_set [v6,13/73] alpha: convert to helper_cpu_halted_set [v6,14/73] microblaze: convert to helper_cpu_halted_set [v6,15/73] cpu: define cpu_halted helpers [v6,16/73] tcg-runtime: convert to cpu_halted_set [v6,17/73] arm: convert to cpu_halted [v6,18/73] ppc: convert to cpu_halted [v6,19/73] sh4: convert to cpu_halted [v6,20/73] i386: convert to cpu_halted [v6,21/73] lm32: convert to cpu_halted [v6,22/73] m68k: convert to cpu_halted [v6,23/73] mips: convert to cpu_halted [v6,24/73] riscv: convert to cpu_halted [v6,25/73] s390x: convert to cpu_halted [v6,26/73] sparc: convert to cpu_halted [v6,27/73] xtensa: convert to cpu_halted [v6,28/73] gdbstub: convert to cpu_halted [v6,29/73] openrisc: convert to cpu_halted [v6,30/73] cpu-exec: convert to cpu_halted [v6,31/73] cpu: convert to cpu_halted [v6,32/73] cpu: define cpu_interrupt_request helpers [v6,33/73] ppc: use cpu_reset_interrupt [v6,34/73] exec: use cpu_reset_interrupt [v6,35/73] i386: use cpu_reset_interrupt [v6,36/73] s390x: use cpu_reset_interrupt [v6,37/73] openrisc: use cpu_reset_interrupt [v6,38/73] arm: convert to cpu_interrupt_request [v6,39/73] i386: convert to cpu_interrupt_request [v6,40/73] i386/kvm: convert to cpu_interrupt_request [v6,41/73] i386/hax-all: convert to cpu_interrupt_request [v6,42/73] i386/whpx-all: convert to cpu_interrupt_request [v6,43/73] i386/hvf: convert to cpu_request_interrupt [v6,44/73] ppc: convert to cpu_interrupt_request [v6,45/73] sh4: convert to cpu_interrupt_request [v6,46/73] cris: convert to cpu_interrupt_request [v6,47/73] hppa: convert to cpu_interrupt_request [v6,48/73] lm32: convert to cpu_interrupt_request [v6,49/73] m68k: convert to cpu_interrupt_request [v6,50/73] mips: convert to cpu_interrupt_request [v6,51/73] nios: convert to cpu_interrupt_request [v6,52/73] s390x: convert to cpu_interrupt_request [v6,53/73] alpha: convert to cpu_interrupt_request [v6,54/73] moxie: convert to cpu_interrupt_request [v6,55/73] sparc: convert to cpu_interrupt_request [v6,56/73] openrisc: convert to cpu_interrupt_request [v6,57/73] unicore32: convert to cpu_interrupt_request [v6,58/73] microblaze: convert to cpu_interrupt_request [v6,59/73] accel/tcg: convert to cpu_interrupt_request [v6,60/73] cpu: convert to interrupt_request [v6,61/73] cpu: call .cpu_has_work with the CPU lock held [v6,62/73] cpu: introduce cpu_has_work_with_iothread_lock [v6,63/73] ppc: convert to cpu_has_work_with_iothread_lock [v6,64/73] mips: convert to cpu_has_work_with_iothread_lock [v6,65/73] s390x: convert to cpu_has_work_with_iothread_lock [v6,66/73] riscv: convert to cpu_has_work_with_iothread_lock [v6,67/73] sparc: convert to cpu_has_work_with_iothread_lock [v6,68/73] xtensa: convert to cpu_has_work_with_iothread_lock [v6,69/73] cpu: rename all_cpu_threads_idle to qemu_tcg_rr_all_cpu_threads_idle [v6,70/73] cpu: protect CPU state with cpu->lock instead of the BQL [v6,71/73] cpus-common: release BQL earlier in run_on_cpu [v6,72/73] cpu: add async_run_on_cpu_no_bql [v6,73/73] cputlb: queue async flush jobs without the BQL

Message ID

20190130004811.27372-74-cota@braap.org

State

New

Headers

From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Date: Tue, 29 Jan 2019 19:48:11 -0500
Message-Id: <20190130004811.27372-74-cota@braap.org>
In-Reply-To: <20190130004811.27372-1-cota@braap.org>
References: <20190130004811.27372-1-cota@braap.org>
Subject: [Qemu-devel] [PATCH v6 73/73] cputlb: queue async flush jobs
	without the BQL
Precedence: list
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Richard Henderson <richard.henderson@linaro.org>
Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org
Sender: "Qemu-devel"
	<qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org>

Series

per-CPU locks | expand

Commit Message

Emilio Cota Jan. 30, 2019, 12:48 a.m. UTC

This yields sizable scalability improvements, as the below results show.

Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)

Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
"make -j N", where N is the number of cores in the guest.

                      Speedup vs a single thread (higher is better):

         14 +---------------------------------------------------------------+
            |       +    +       +      +       +      +      $$$$$$  +     |
            |                                            $$$$$              |
            |                                      $$$$$$                   |
         12 |-+                                $A$$                       +-|
            |                                $$                             |
            |                             $$$                               |
         10 |-+                         $$    ##D#####################D   +-|
            |                        $$$ #####**B****************           |
            |                      $$####*****                   *****      |
            |                    A$#*****                             B     |
          8 |-+                $$B**                                      +-|
            |                $$**                                           |
            |               $**                                             |
          6 |-+           $$*                                             +-|
            |            A**                                                |
            |           $B                                                  |
            |           $                                                   |
          4 |-+        $*                                                 +-|
            |          $                                                    |
            |         $                                                     |
          2 |-+      $                                                    +-|
            |        $                                 +cputlb-no-bql $$A$$ |
            |       A                                   +per-cpu-lock ##D## |
            |       +    +       +      +       +      +     baseline **B** |
          0 +---------------------------------------------------------------+
                    1    4       8      12      16     20      24     28
                                       Guest vCPUs
  png: https://imgur.com/zZRvS7q

Some notes:
- baseline corresponds to the commit before this series

- per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.

- cputlb-no-bql is this commit.

- I'm using taskset to assign cores to threads, favouring locality whenever
  possible but not using SMT. When N=1, I'm using a single host core, which
  leads to superlinear speedups (since with more cores the I/O thread can execute
  while vCPU threads sleep). In the future I might use N+1 host cores for N
  guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.

Single-threaded performance is affected very lightly. Results
below for debian aarch64 bootup+test for the entire series
on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:

- Before:

 Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):

       7269.033478      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.06% )
    30,659,870,302      cycles                    #    4.218 GHz                      ( +-  0.06% )
    54,790,540,051      instructions              #    1.79  insns per cycle          ( +-  0.05% )
     9,796,441,380      branches                  # 1347.695 M/sec                    ( +-  0.05% )
       165,132,201      branch-misses             #    1.69% of all branches          ( +-  0.12% )

       7.287011656 seconds time elapsed                                          ( +-  0.10% )

- After:

       7375.924053      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.13% )
    31,107,548,846      cycles                    #    4.217 GHz                      ( +-  0.12% )
    55,355,668,947      instructions              #    1.78  insns per cycle          ( +-  0.05% )
     9,929,917,664      branches                  # 1346.261 M/sec                    ( +-  0.04% )
       166,547,442      branch-misses             #    1.68% of all branches          ( +-  0.09% )

       7.389068145 seconds time elapsed                                          ( +-  0.13% )

That is, a 1.37% slowdown.

Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 accel/tcg/cputlb.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Comments

Alex Bennée Feb. 8, 2019, 3:58 p.m. UTC | #1

Emilio G. Cota <cota@braap.org> writes:

> This yields sizable scalability improvements, as the below results show.
>
> Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)
>
> Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
> "make -j N", where N is the number of cores in the guest.

I can verify my pigz benchmark starts levelling out at 12-14 guest vCPUs
on the 36 core host box I'm testing on. Not super controlled environment
but certainly showing how far MTTCG has come since it was first
introduced. Good stuff.

> diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
> index dad9b7796c..8491d36bcf 100644
> --- a/accel/tcg/cputlb.c
> +++ b/accel/tcg/cputlb.c
> @@ -260,7 +260,7 @@ static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
>
>      CPU_FOREACH(cpu) {
>          if (cpu != src) {
> -            async_run_on_cpu(cpu, fn, d);
> +            async_run_on_cpu_no_bql(cpu, fn, d);
>          }
>      }
>  }
> @@ -336,8 +336,8 @@ void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
>      tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap);
>
>      if (cpu->created && !qemu_cpu_is_self(cpu)) {
> -        async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
> -                         RUN_ON_CPU_HOST_INT(idxmap));
> +        async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
> +                                RUN_ON_CPU_HOST_INT(idxmap));
>      } else {
>          tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap));
>      }
> @@ -481,8 +481,8 @@ void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
>      addr_and_mmu_idx |= idxmap;
>
>      if (!qemu_cpu_is_self(cpu)) {
> -        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_work,
> -                         RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
> +        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_work,
> +                                RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
>      } else {
>          tlb_flush_page_by_mmuidx_async_work(
>              cpu, RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));


Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Tested-by: Alex Bennée <alex.bennee@linaro.org>

I think that brings my run through this patch series to a conclusion.
Looking good all round.

--
Alex Bennée

Richard Henderson Feb. 20, 2019, 5:18 p.m. UTC | #2

On 1/29/19 4:48 PM, Emilio G. Cota wrote:
> This yields sizable scalability improvements, as the below results show.
> 
> Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)
> 
> Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
> "make -j N", where N is the number of cores in the guest.
> 
>                       Speedup vs a single thread (higher is better):
> 
>          14 +---------------------------------------------------------------+
>             |       +    +       +      +       +      +      $$$$$$  +     |
>             |                                            $$$$$              |
>             |                                      $$$$$$                   |
>          12 |-+                                $A$$                       +-|
>             |                                $$                             |
>             |                             $$$                               |
>          10 |-+                         $$    ##D#####################D   +-|
>             |                        $$$ #####**B****************           |
>             |                      $$####*****                   *****      |
>             |                    A$#*****                             B     |
>           8 |-+                $$B**                                      +-|
>             |                $$**                                           |
>             |               $**                                             |
>           6 |-+           $$*                                             +-|
>             |            A**                                                |
>             |           $B                                                  |
>             |           $                                                   |
>           4 |-+        $*                                                 +-|
>             |          $                                                    |
>             |         $                                                     |
>           2 |-+      $                                                    +-|
>             |        $                                 +cputlb-no-bql $$A$$ |
>             |       A                                   +per-cpu-lock ##D## |
>             |       +    +       +      +       +      +     baseline **B** |
>           0 +---------------------------------------------------------------+
>                     1    4       8      12      16     20      24     28
>                                        Guest vCPUs
>   png: https://imgur.com/zZRvS7q
> 
> Some notes:
> - baseline corresponds to the commit before this series
> 
> - per-cpu-lock is the commit that converts the CPU loop to per-cpu locks.
> 
> - cputlb-no-bql is this commit.
> 
> - I'm using taskset to assign cores to threads, favouring locality whenever
>   possible but not using SMT. When N=1, I'm using a single host core, which
>   leads to superlinear speedups (since with more cores the I/O thread can execute
>   while vCPU threads sleep). In the future I might use N+1 host cores for N
>   guest cores to avoid this, or perhaps pin guest threads to cores one-by-one.
> 
> Single-threaded performance is affected very lightly. Results
> below for debian aarch64 bootup+test for the entire series
> on an Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz host:
> 
> - Before:
> 
>  Performance counter stats for 'taskset -c 0 ../img/aarch64/die.sh' (10 runs):
> 
>        7269.033478      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.06% )
>     30,659,870,302      cycles                    #    4.218 GHz                      ( +-  0.06% )
>     54,790,540,051      instructions              #    1.79  insns per cycle          ( +-  0.05% )
>      9,796,441,380      branches                  # 1347.695 M/sec                    ( +-  0.05% )
>        165,132,201      branch-misses             #    1.69% of all branches          ( +-  0.12% )
> 
>        7.287011656 seconds time elapsed                                          ( +-  0.10% )
> 
> - After:
> 
>        7375.924053      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.13% )
>     31,107,548,846      cycles                    #    4.217 GHz                      ( +-  0.12% )
>     55,355,668,947      instructions              #    1.78  insns per cycle          ( +-  0.05% )
>      9,929,917,664      branches                  # 1346.261 M/sec                    ( +-  0.04% )
>        166,547,442      branch-misses             #    1.68% of all branches          ( +-  0.09% )
> 
>        7.389068145 seconds time elapsed                                          ( +-  0.13% )
> 
> That is, a 1.37% slowdown.
> 
> Signed-off-by: Emilio G. Cota <cota@braap.org>
> ---
>  accel/tcg/cputlb.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>


r~

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index dad9b7796c..8491d36bcf 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -260,7 +260,7 @@  static void flush_all_helper(CPUState *src, run_on_cpu_func fn,
 
     CPU_FOREACH(cpu) {
         if (cpu != src) {
-            async_run_on_cpu(cpu, fn, d);
+            async_run_on_cpu_no_bql(cpu, fn, d);
         }
     }
 }
@@ -336,8 +336,8 @@  void tlb_flush_by_mmuidx(CPUState *cpu, uint16_t idxmap)
     tlb_debug("mmu_idx: 0x%" PRIx16 "\n", idxmap);
 
     if (cpu->created && !qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_by_mmuidx_async_work,
-                         RUN_ON_CPU_HOST_INT(idxmap));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_by_mmuidx_async_work,
+                                RUN_ON_CPU_HOST_INT(idxmap));
     } else {
         tlb_flush_by_mmuidx_async_work(cpu, RUN_ON_CPU_HOST_INT(idxmap));
     }
@@ -481,8 +481,8 @@  void tlb_flush_page_by_mmuidx(CPUState *cpu, target_ulong addr, uint16_t idxmap)
     addr_and_mmu_idx |= idxmap;
 
     if (!qemu_cpu_is_self(cpu)) {
-        async_run_on_cpu(cpu, tlb_flush_page_by_mmuidx_async_work,
-                         RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
+        async_run_on_cpu_no_bql(cpu, tlb_flush_page_by_mmuidx_async_work,
+                                RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));
     } else {
         tlb_flush_page_by_mmuidx_async_work(
             cpu, RUN_ON_CPU_TARGET_PTR(addr_and_mmu_idx));

[v6,73/73] cputlb: queue async flush jobs without the BQL

Commit Message

Comments

Patch