mbox series

[RFC,v3,00/11] KVM: Dirty ring support (QEMU part)

Message ID 20200523232035.1029349-1-peterx@redhat.com
Headers show
Series KVM: Dirty ring support (QEMU part) | expand

Message

Peter Xu May 23, 2020, 11:20 p.m. UTC
I kept the dirty sync in kvm_set_phys_mem() for kvmslot removals, left a
comment on the known issue about strict dirty sync so we can fix it someday in
the future together with dirty log and dirty ring.

v3:
- added "KVM: Use a big lock to replace per-kml slots_lock"
  this is preparing for the last patch where we'll reap kvm dirty ring when
  removing kvmslots.
- added "KVM: Simplify dirty log sync in kvm_set_phys_mem"
  it's kind of a fix, but also a preparation of the last patch so it'll be very
  easy to add the dirty ring sync there
- the last patch is changed to handle correctly the dirty sync in kvmslot
  removal, also comment there about the known issues.
- reordered the patches a bit
- NOTE: since we kept the sync in memslot removal, this version does not depend
  on any other QEMU series - it is based on QEMU master

v2:
- add r-bs for Dave
- change dirty-ring-size parameter from int64 to uint64_t [Dave]
- remove an assertion for KVM_GET_DIRTY_LOG [Dave]
- document update: "per vcpu" dirty ring [Dave]
- rename KVMReaperState to KVMDirtyRingReaperState [Dave]
- dump errno when kvm_init_vcpu fails with dirty ring [Dave]
- switch to use dirty-ring-gfns as parameter [Dave]
- comment MAP_SHARED [Dave]
- dump more info when enable dirty ring failed [Dave]
- add kvm_dirty_ring_enabled flag to show whether dirty ring enabled
- rewrote many of the last patch to reduce LOC, now we do dirty ring reap only
  with BQL to simplify things, allowing the main or vcpu thread to directly
  call kvm_dirty_ring_reap() to collect dirty pages, so that we can drop a lot
  of synchronization variables like sems or eventfds.

For anyone who wants to try (we need to upgrade kernel too):

KVM branch:
  https://github.com/xzpeter/linux/tree/kvm-dirty-ring

QEMU branch for testing:
  https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

Overview
========

KVM dirty ring is a new interface to pass over dirty bits from kernel
to the userspace.  Instead of using a bitmap for each memory region,
the dirty ring contains an array of dirtied GPAs to fetch, one ring
per vcpu.

There're a few major changes comparing to how the old dirty logging
interface would work:

- Granularity of dirty bits

  KVM dirty ring interface does not offer memory region level
  granularity to collect dirty bits (i.e., per KVM memory
  slot). Instead the dirty bit is collected globally for all the vcpus
  at once.  The major effect is on VGA part because VGA dirty tracking
  is enabled as long as the device is created, also it was in memory
  region granularity.  Now that operation will be amplified to a VM
  sync.  Maybe there's smarter way to do the same thing in VGA with
  the new interface, but so far I don't see it affects much at least
  on regular VMs.

- Collection of dirty bits

  The old dirty logging interface collects KVM dirty bits when
  synchronizing dirty bits.  KVM dirty ring interface instead used a
  standalone thread to do that.  So when the other thread (e.g., the
  migration thread) wants to synchronize the dirty bits, it simply
  kick the thread and wait until it flushes all the dirty bits to the
  ramblock dirty bitmap.

A new parameter "dirty-ring-size" is added to "-accel kvm".  By
default, dirty ring is still disabled (size==0).  To enable it, we
need to be with:

  -accel kvm,dirty-ring-size=65536

This establishes a 64K dirty ring buffer per vcpu.  Then if we
migrate, it'll switch to dirty ring.

I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
channel.  When idle or dirty workload small, I don't observe major
difference on total migration time.  When with higher random dirty
workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
ring). Total migration time is (ping pong migrate for 6 times, in
seconds):

|-------------------------+---------------|
| dirty ring (4k entries) | dirty logging |
|-------------------------+---------------|
|                      70 |            58 |
|                      78 |            70 |
|                      72 |            48 |
|                      74 |            52 |
|                      83 |            49 |
|                      65 |            54 |
|-------------------------+---------------|

Summary:

dirty ring average:    73s
dirty logging average: 55s

The KVM dirty ring will be slower in above case.  The number may show
that the dirty logging is still preferred as a default value because
small/medium VMs are still major cases, and high dirty workload
happens frequently too.  And that's what this series did.

Please refer to the code and comment itself for more information.

Thanks,

Peter Xu (11):
  linux-headers: Update
  memory: Introduce log_sync_global() to memory listener
  KVM: Fixup kvm_log_clear_one_slot() ioctl return check
  KVM: Use a big lock to replace per-kml slots_lock
  KVM: Create the KVMSlot dirty bitmap on flag changes
  KVM: Provide helper to get kvm dirty log
  KVM: Provide helper to sync dirty bitmap from slot to ramblock
  KVM: Simplify dirty log sync in kvm_set_phys_mem
  KVM: Cache kvm slot dirty bitmap size
  KVM: Add dirty-gfn-count property
  KVM: Dirty ring support

 accel/kvm/kvm-all.c         | 540 +++++++++++++++++++++++++++++++-----
 accel/kvm/trace-events      |   7 +
 include/exec/memory.h       |  12 +
 include/hw/core/cpu.h       |   8 +
 include/sysemu/kvm_int.h    |   7 +-
 linux-headers/asm-x86/kvm.h |   1 +
 linux-headers/linux/kvm.h   |  53 ++++
 memory.c                    |  33 ++-
 qemu-options.hx             |   5 +
 9 files changed, 581 insertions(+), 85 deletions(-)

Comments

Peter Xu May 24, 2020, 1:06 p.m. UTC | #1
On Sat, May 23, 2020 at 07:20:24PM -0400, Peter Xu wrote:
> I kept the dirty sync in kvm_set_phys_mem() for kvmslot removals, left a
> comment on the known issue about strict dirty sync so we can fix it someday in
> the future together with dirty log and dirty ring.

Side note: patch 3,5-8 should not be RFC material at all - they either fixes
existing issues or clean code up.  Please conside to review/merge them first
even before the rest of the patches.  Thanks,
Peter Xu May 26, 2020, 2:17 p.m. UTC | #2
On Sat, May 23, 2020 at 07:20:24PM -0400, Peter Xu wrote:
> I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
> channel.  When idle or dirty workload small, I don't observe major
> difference on total migration time.  When with higher random dirty
> workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
> ring). Total migration time is (ping pong migrate for 6 times, in
> seconds):
> 
> |-------------------------+---------------|
> | dirty ring (4k entries) | dirty logging |
> |-------------------------+---------------|
> |                      70 |            58 |
> |                      78 |            70 |
> |                      72 |            48 |
> |                      74 |            52 |
> |                      83 |            49 |
> |                      65 |            54 |
> |-------------------------+---------------|
> 
> Summary:
> 
> dirty ring average:    73s
> dirty logging average: 55s
> 
> The KVM dirty ring will be slower in above case.  The number may show
> that the dirty logging is still preferred as a default value because
> small/medium VMs are still major cases, and high dirty workload
> happens frequently too.  And that's what this series did.

Two more TODOs that can potentially be worked upon:

- Consider to drop the BQL dependency when reap dirty rings: then we can run
  the reaper thread in parallel of main thread.  Needs some thoughts around the
  race conditions of main thread to see whether it's doable.

- Consider to drop the kvmslot bitmap: logically this can be dropped with kvm
  dirty ring, not only for space saving, but also it's slower, and it's yet
  another layer linear to guest mem size which is against the whole idea of kvm
  dirty ring.  This should make above number (of kvm dirty ring) even smaller,
  but probably still not as good as dirty log if workload is high.  I'm not
  sure whether it's possible to even drop the whole ramblock dirty bitmap when
  kvm enabled then we remove all the bitmap caches (I guess VGA would still
  need a very small one if it wants, or just do the refresh unconditionally),
  but that's a much bigger surgery and a wild idea, but logically it should be
  even more efficient with the ring structure, then precopy will just work
  similar to postcopy that there'll be a queue of dirty pages (probably except
  the 1st round of precopy).

I'll append them into the cover letter of next version too.