mbox series

[v0,0/7] Background snapshots

Message ID 20180629080320.320144-1-dplotnikov@virtuozzo.com
Headers show
Series Background snapshots | expand

Message

Denis Plotnikov June 29, 2018, 8:03 a.m. UTC
The patch set adds the ability to make external snapshots while VM is running.

The workflow to make a snapshot is the following:
1. Pause the vm
2. Make a snapshot of block devices using the scheme of your choice
3. Turn on background-snapshot migration capability
4. Start the migration using the destination (migration stream) of your choice.
   The migration will resume the vm execution by itself
   when it has the devices' states saved  and is ready to start ram writing
   to the migration stream.
5. Listen to the migration finish event

The feature relies on KVM unapplied ability to report the faulting address.
Please find the KVM patch snippet to make the patchset work below:

+++ b/arch/x86/kvm/vmx.c
@@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 
        vcpu->arch.exit_qualification = exit_qualification;
 
-       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+        if (r == -EFAULT) {
+               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
+
+               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
+               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
+               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
+               r = 0;
+
+       }
+       return r;

The patch to KVM can be sent if the patch set approved

Denis Plotnikov (7):
  migration: add background snapshot capability
  bitops: add some atomic versions of bitmap operations
  threads: add infrastructure to process sigsegv
  migration: add background snapshot infrastructure
  kvm: add failed memeory access exit reason
  kvm: add vCPU failed memeory access processing
  migration: add background snapshotting

 include/exec/ram_addr.h   |   7 +
 include/exec/ramlist.h    |   4 +-
 include/qemu/bitops.h     |  24 +++
 include/qemu/thread.h     |   5 +
 linux-headers/linux/kvm.h |   5 +
 migration/migration.c     | 141 +++++++++++++++-
 migration/migration.h     |   1 +
 migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
 migration/ram.h           |  11 +-
 migration/savevm.c        |  91 ++++++-----
 migration/savevm.h        |   2 +
 qapi/migration.json       |   6 +-
 target/i386/kvm.c         |  18 +++
 util/qemu-thread-posix.c  |  50 ++++++
 14 files changed, 635 insertions(+), 63 deletions(-)

Comments

Dr. David Alan Gilbert June 29, 2018, 11:53 a.m. UTC | #1
* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> The patch set adds the ability to make external snapshots while VM is running.

cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
isn't there yet.

Hi Denis,
  How robust are you finding this SEGV based trick; for example what
about things like the kernel walking vhost queues or similar kernel
nasties?

Dave

> The workflow to make a snapshot is the following:
> 1. Pause the vm
> 2. Make a snapshot of block devices using the scheme of your choice
> 3. Turn on background-snapshot migration capability
> 4. Start the migration using the destination (migration stream) of your choice.
>    The migration will resume the vm execution by itself
>    when it has the devices' states saved  and is ready to start ram writing
>    to the migration stream.
> 5. Listen to the migration finish event
> 
> The feature relies on KVM unapplied ability to report the faulting address.
> Please find the KVM patch snippet to make the patchset work below:
> 
> +++ b/arch/x86/kvm/vmx.c
> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>         vcpu->arch.exit_qualification = exit_qualification;
>  
> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +        if (r == -EFAULT) {
> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> +
> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> +               r = 0;
> +
> +       }
> +       return r;
> 
> The patch to KVM can be sent if the patch set approved
> 
> Denis Plotnikov (7):
>   migration: add background snapshot capability
>   bitops: add some atomic versions of bitmap operations
>   threads: add infrastructure to process sigsegv
>   migration: add background snapshot infrastructure
>   kvm: add failed memeory access exit reason
>   kvm: add vCPU failed memeory access processing
>   migration: add background snapshotting
> 
>  include/exec/ram_addr.h   |   7 +
>  include/exec/ramlist.h    |   4 +-
>  include/qemu/bitops.h     |  24 +++
>  include/qemu/thread.h     |   5 +
>  linux-headers/linux/kvm.h |   5 +
>  migration/migration.c     | 141 +++++++++++++++-
>  migration/migration.h     |   1 +
>  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
>  migration/ram.h           |  11 +-
>  migration/savevm.c        |  91 ++++++-----
>  migration/savevm.h        |   2 +
>  qapi/migration.json       |   6 +-
>  target/i386/kvm.c         |  18 +++
>  util/qemu-thread-posix.c  |  50 ++++++
>  14 files changed, 635 insertions(+), 63 deletions(-)
> 
> -- 
> 2.17.0
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Xu July 2, 2018, 11:23 a.m. UTC | #2
On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
> The patch set adds the ability to make external snapshots while VM is running.

Hi, Denis,

This work is interesting, though I have a few questions to ask in
general below.

> 
> The workflow to make a snapshot is the following:
> 1. Pause the vm
> 2. Make a snapshot of block devices using the scheme of your choice

Here you explicitly took the snapshot for the block device, then...

> 3. Turn on background-snapshot migration capability
> 4. Start the migration using the destination (migration stream) of your choice.

... here you started the VM snapshot.  How did you make sure that the
VM snapshot (e.g., the RAM data) and the block snapshot will be
aligned?

For example, in current save_snapshot() we'll quiesce disk IOs before
migrating the last pieces of RAM data to make sure they are aligned.
I didn't figure out myself on how that's done in this work.

>    The migration will resume the vm execution by itself
>    when it has the devices' states saved  and is ready to start ram writing
>    to the migration stream.
> 5. Listen to the migration finish event
> 
> The feature relies on KVM unapplied ability to report the faulting address.
> Please find the KVM patch snippet to make the patchset work below:
> 
> +++ b/arch/x86/kvm/vmx.c
> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>         vcpu->arch.exit_qualification = exit_qualification;
>  
> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +        if (r == -EFAULT) {
> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> +
> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> +               r = 0;
> +
> +       }
> +       return r;

Just to make sure I fully understand here: so this is some extra KVM
work just to make sure the mprotect() trick will work even for KVM
vcpu threads, am I right?

Meanwhile, I see that you only modified EPT violation code, then how
about the legacy hardwares and softmmu case?

Thanks,

> 
> The patch to KVM can be sent if the patch set approved
> 
> Denis Plotnikov (7):
>   migration: add background snapshot capability
>   bitops: add some atomic versions of bitmap operations
>   threads: add infrastructure to process sigsegv
>   migration: add background snapshot infrastructure
>   kvm: add failed memeory access exit reason
>   kvm: add vCPU failed memeory access processing
>   migration: add background snapshotting
> 
>  include/exec/ram_addr.h   |   7 +
>  include/exec/ramlist.h    |   4 +-
>  include/qemu/bitops.h     |  24 +++
>  include/qemu/thread.h     |   5 +
>  linux-headers/linux/kvm.h |   5 +
>  migration/migration.c     | 141 +++++++++++++++-
>  migration/migration.h     |   1 +
>  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
>  migration/ram.h           |  11 +-
>  migration/savevm.c        |  91 ++++++-----
>  migration/savevm.h        |   2 +
>  qapi/migration.json       |   6 +-
>  target/i386/kvm.c         |  18 +++
>  util/qemu-thread-posix.c  |  50 ++++++
>  14 files changed, 635 insertions(+), 63 deletions(-)
> 
> -- 
> 2.17.0
> 
>
Denis Plotnikov July 2, 2018, 12:40 p.m. UTC | #3
On 02.07.2018 14:23, Peter Xu wrote:
> On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
>> The patch set adds the ability to make external snapshots while VM is running.
> 
> Hi, Denis,
> 
> This work is interesting, though I have a few questions to ask in
> general below.
> 
>>
>> The workflow to make a snapshot is the following:
>> 1. Pause the vm
>> 2. Make a snapshot of block devices using the scheme of your choice
> 
> Here you explicitly took the snapshot for the block device, then...
> 
>> 3. Turn on background-snapshot migration capability
>> 4. Start the migration using the destination (migration stream) of your choice.
> 
> ... here you started the VM snapshot.  How did you make sure that the
> VM snapshot (e.g., the RAM data) and the block snapshot will be
> aligned?
As the VM has been paused before making an image(disk) snapshot, there 
should be no requests to the original image done ever since. All the 
later request's goes to the disk snapshot.

At the point we have a disk image and its snapshot.
In the image we have kind of checkpoint-ed state which won't (shouldn't) 
be changed because all the writing requests should go to the image snapshot.

Then we start the background snapshot which marks all the memory as 
read-only and writing the part of VM state to the VM snapshot file.
By making the memory read-only we kind of freeze the state of the RAM.

At that point we have an original image and the VM memory content which 
corresponds to each other because the VM isn't running.

Then, the background snapshot thread continues VM execution with the 
read-only-marked memory which is being written to the external VM 
snapshot file. All the write accesses to the memory are intercepted and 
the memory pages being accessed are written to the VM snapshot (VM 
state) file in priority. Having marked as read-write right after the 
writing, the memory pages aren't tracked for later accesses.

This is how we guarantee that the VM snapshot (state) file has the 
memory content corresponding to the moment when the disk snapshot is 
created.

When the writing ends, we have the VM snapshot (VM state) file which has 
the memory content saved by the moment of the image snapshot creating.

So, to restore the VM from "the snapshot" we need to use the original 
image disk (not the disk snapshot) and the VM snapshot (VM state with 
saved memory) file.

> 
> For example, in current save_snapshot() we'll quiesce disk IOs before
> migrating the last pieces of RAM data to make sure they are aligned.
> I didn't figure out myself on how that's done in this work.
> 
>>     The migration will resume the vm execution by itself
>>     when it has the devices' states saved  and is ready to start ram writing
>>     to the migration stream.
>> 5. Listen to the migration finish event
>>
>> The feature relies on KVM unapplied ability to report the faulting address.
>> Please find the KVM patch snippet to make the patchset work below:
>>
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>>   
>>          vcpu->arch.exit_qualification = exit_qualification;
>>   
>> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +        if (r == -EFAULT) {
>> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
>> +
>> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
>> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
>> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
>> +               r = 0;
>> +
>> +       }
>> +       return r;
> 
> Just to make sure I fully understand here: so this is some extra KVM
> work just to make sure the mprotect() trick will work even for KVM
> vcpu threads, am I right?

That's correct!
> 
> Meanwhile, I see that you only modified EPT violation code, then how
> about the legacy hardwares and softmmu case?

Didn't check thoroughly but the scheme works in TCG mode.


Thanks,
Denis

> 
> Thanks,
> 
>>
>> The patch to KVM can be sent if the patch set approved
>>
>> Denis Plotnikov (7):
>>    migration: add background snapshot capability
>>    bitops: add some atomic versions of bitmap operations
>>    threads: add infrastructure to process sigsegv
>>    migration: add background snapshot infrastructure
>>    kvm: add failed memeory access exit reason
>>    kvm: add vCPU failed memeory access processing
>>    migration: add background snapshotting
>>
>>   include/exec/ram_addr.h   |   7 +
>>   include/exec/ramlist.h    |   4 +-
>>   include/qemu/bitops.h     |  24 +++
>>   include/qemu/thread.h     |   5 +
>>   linux-headers/linux/kvm.h |   5 +
>>   migration/migration.c     | 141 +++++++++++++++-
>>   migration/migration.h     |   1 +
>>   migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
>>   migration/ram.h           |  11 +-
>>   migration/savevm.c        |  91 ++++++-----
>>   migration/savevm.h        |   2 +
>>   qapi/migration.json       |   6 +-
>>   target/i386/kvm.c         |  18 +++
>>   util/qemu-thread-posix.c  |  50 ++++++
>>   14 files changed, 635 insertions(+), 63 deletions(-)
>>
>> -- 
>> 2.17.0
>>
>>
>
Peter Xu July 3, 2018, 5:54 a.m. UTC | #4
On Mon, Jul 02, 2018 at 03:40:31PM +0300, Denis Plotnikov wrote:
> 
> 
> On 02.07.2018 14:23, Peter Xu wrote:
> > On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
> > > The patch set adds the ability to make external snapshots while VM is running.
> > 
> > Hi, Denis,
> > 
> > This work is interesting, though I have a few questions to ask in
> > general below.
> > 
> > > 
> > > The workflow to make a snapshot is the following:
> > > 1. Pause the vm
> > > 2. Make a snapshot of block devices using the scheme of your choice
> > 
> > Here you explicitly took the snapshot for the block device, then...
> > 
> > > 3. Turn on background-snapshot migration capability
> > > 4. Start the migration using the destination (migration stream) of your choice.
> > 
> > ... here you started the VM snapshot.  How did you make sure that the
> > VM snapshot (e.g., the RAM data) and the block snapshot will be
> > aligned?
> As the VM has been paused before making an image(disk) snapshot, there
> should be no requests to the original image done ever since. All the later
> request's goes to the disk snapshot.
> 
> At the point we have a disk image and its snapshot.
> In the image we have kind of checkpoint-ed state which won't (shouldn't) be
> changed because all the writing requests should go to the image snapshot.
> 
> Then we start the background snapshot which marks all the memory as
> read-only and writing the part of VM state to the VM snapshot file.
> By making the memory read-only we kind of freeze the state of the RAM.
> 
> At that point we have an original image and the VM memory content which
> corresponds to each other because the VM isn't running.
> 
> Then, the background snapshot thread continues VM execution with the
> read-only-marked memory which is being written to the external VM snapshot
> file. All the write accesses to the memory are intercepted and the memory
> pages being accessed are written to the VM snapshot (VM state) file in
> priority. Having marked as read-write right after the writing, the memory
> pages aren't tracked for later accesses.
> 
> This is how we guarantee that the VM snapshot (state) file has the memory
> content corresponding to the moment when the disk snapshot is created.
> 
> When the writing ends, we have the VM snapshot (VM state) file which has the
> memory content saved by the moment of the image snapshot creating.
> 
> So, to restore the VM from "the snapshot" we need to use the original image
> disk (not the disk snapshot) and the VM snapshot (VM state with saved
> memory) file.

My bad to have not noticed about the implication of vm_stop() as the
first step.  Your explanation is clear.  Thank you!

> 
> > 
> > For example, in current save_snapshot() we'll quiesce disk IOs before
> > migrating the last pieces of RAM data to make sure they are aligned.
> > I didn't figure out myself on how that's done in this work.
> > 
> > >     The migration will resume the vm execution by itself
> > >     when it has the devices' states saved  and is ready to start ram writing
> > >     to the migration stream.
> > > 5. Listen to the migration finish event
> > > 
> > > The feature relies on KVM unapplied ability to report the faulting address.
> > > Please find the KVM patch snippet to make the patchset work below:
> > > 
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > >          vcpu->arch.exit_qualification = exit_qualification;
> > > -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +        if (r == -EFAULT) {
> > > +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> > > +
> > > +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> > > +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> > > +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> > > +               r = 0;
> > > +
> > > +       }
> > > +       return r;
> > 
> > Just to make sure I fully understand here: so this is some extra KVM
> > work just to make sure the mprotect() trick will work even for KVM
> > vcpu threads, am I right?
> 
> That's correct!
> > 
> > Meanwhile, I see that you only modified EPT violation code, then how
> > about the legacy hardwares and softmmu case?
> 
> Didn't check thoroughly but the scheme works in TCG mode.

Yeah I guess TCG will work since the SIGSEGV handler will work with
that.  I meant the shadow MMU implementation in KVM when
kvm_intel.ept=0 is set on the host.  But of course that's not a big
deal for now since that can be discussed in the kvm counterpart of the
work.  Meanwhile, considering that this series seems to provide a
general framework for live snapshot, this work is meaningful no matter
what backend magic is used (either mprotect, or userfaultfd in the
future).

Thanks,
Peter Xu July 13, 2018, 5:20 a.m. UTC | #5
On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
> The patch set adds the ability to make external snapshots while VM is running.
> 
> The workflow to make a snapshot is the following:
> 1. Pause the vm
> 2. Make a snapshot of block devices using the scheme of your choice
> 3. Turn on background-snapshot migration capability
> 4. Start the migration using the destination (migration stream) of your choice.
>    The migration will resume the vm execution by itself
>    when it has the devices' states saved  and is ready to start ram writing
>    to the migration stream.
> 5. Listen to the migration finish event
> 
> The feature relies on KVM unapplied ability to report the faulting address.
> Please find the KVM patch snippet to make the patchset work below:
> 
> +++ b/arch/x86/kvm/vmx.c
> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>  
>         vcpu->arch.exit_qualification = exit_qualification;
>  
> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +        if (r == -EFAULT) {
> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> +
> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> +               r = 0;
> +
> +       }
> +       return r;
> 
> The patch to KVM can be sent if the patch set approved

Hi, Denis,

If the work will definitely require KVM to cooperate, AFAIU the thing
we normally do is that we first propose the kernel counterpart on kvm
list, then it'll be easier to review the QEMU counterpart (or, propose
both kvm/qemu changes at the same time, always the QEMU changes can be
RFC, as a reference to prove the kvm change is valid and useful).  Not
sure whether you should do this as well for this live snapshot work.

Since we might have two backends in the future, my major question for
that counterpart series would be whether we need to support both in
the future (mprotect, and userfaultfd), and the differences between
the two methods from kernel's point of view.  I would vaguely guess
that we can at least firstly have mprotect work then userfaultfd then
we can automatically choose the backend when both are provided, but I
guess that discussion might still better happen on the kvm list.  Also
I would also guess that in that work you'd better consider no-ept case
as well for Intel, even for AMD.  But not sure we can at least start a
RFC with the simplest scenario and prove its validity.

Regards,
Denis Plotnikov July 16, 2018, 3 p.m. UTC | #6
On 13.07.2018 08:20, Peter Xu wrote:
> On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote:
>> The patch set adds the ability to make external snapshots while VM is running.
>>
>> The workflow to make a snapshot is the following:
>> 1. Pause the vm
>> 2. Make a snapshot of block devices using the scheme of your choice
>> 3. Turn on background-snapshot migration capability
>> 4. Start the migration using the destination (migration stream) of your choice.
>>     The migration will resume the vm execution by itself
>>     when it has the devices' states saved  and is ready to start ram writing
>>     to the migration stream.
>> 5. Listen to the migration finish event
>>
>> The feature relies on KVM unapplied ability to report the faulting address.
>> Please find the KVM patch snippet to make the patchset work below:
>>
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>>   
>>          vcpu->arch.exit_qualification = exit_qualification;
>>   
>> -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
>> +        if (r == -EFAULT) {
>> +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
>> +
>> +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
>> +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
>> +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
>> +               r = 0;
>> +
>> +       }
>> +       return r;
>>
>> The patch to KVM can be sent if the patch set approved
> 
> Hi, Denis,
> 
> If the work will definitely require KVM to cooperate, AFAIU the thing
> we normally do is that we first propose the kernel counterpart on kvm
> list, then it'll be easier to review the QEMU counterpart (or, propose
> both kvm/qemu changes at the same time, always the QEMU changes can be
> RFC, as a reference to prove the kvm change is valid and useful).  Not
> sure whether you should do this as well for this live snapshot work.
> 
> Since we might have two backends in the future, my major question for
> that counterpart series would be whether we need to support both in
> the future (mprotect, and userfaultfd), and the differences between
> the two methods from kernel's point of view.  I would vaguely guess
> that we can at least firstly have mprotect work then userfaultfd then
> we can automatically choose the backend when both are provided, but I
> guess that discussion might still better happen on the kvm list.  Also
> I would also guess that in that work you'd better consider no-ept case
> as well for Intel, even for AMD.  But not sure we can at least start a
> RFC with the simplest scenario and prove its validity.
> 
> Regards,
> 
Hi, Peter,
I think this is a good idea to go through the KVM path firstly.
When the discussion come to some conclusion further steps may become
more clear.
I'll send the patch there shortly to start the discussion.

Thanks!

Best, Denis
Peter Xu July 25, 2018, 10:18 a.m. UTC | #7
On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote:
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > The patch set adds the ability to make external snapshots while VM is running.
> 
> cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
> isn't there yet.
> 
> Hi Denis,
>   How robust are you finding this SEGV based trick; for example what
> about things like the kernel walking vhost queues or similar kernel
> nasties?

(I'm commenting on this old series to keep the discussion together)

If we want to make this series really work for people, we should
possibly need to know whether it could work with vhost (otherwise we
might need to go back to userfaultfd write-protection).

I digged a bit on the vhost-net IO, it should be using two ways to
write to guest memory:

- copy_to_user(): this should possibly still be able to be captured by
  mprotect() (after some confirmation from Paolo, but still we'd
  better try it out)

- kmap_atomic(): this is used to log dirty pages, this seems to be
  incompatible with mprotect() but I think we can just make sure dirty
  page tracking is disabled when live snapshot is enabled (please
  refer to the code clip below as a referece)

So IMHO this series should work with vhost if with logging off, but we
need to confirm...  Denis, do you want to do some more test with vhost
when you do your next post (possibly with changes like below)?

=====================

diff --git a/migration/ram.c b/migration/ram.c
index 24dea2730c..a3fa256143 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1605,6 +1605,10 @@ static void migration_bitmap_sync(RAMState *rs)
     int64_t end_time;
     uint64_t bytes_xfer_now;
 
+    if (live_snapshot) {
+        return;
+    }
+
     ram_counters.dirty_sync_count++;
 
     if (!rs->time_last_bitmap_sync) {
@@ -2952,7 +2956,10 @@ static void ram_init_bitmaps(RAMState *rs)
     rcu_read_lock();
 
     ram_list_init_bitmaps();
-    memory_global_dirty_log_start();
+
+    if (!live_snapshot) {
+        memory_global_dirty_log_start();
+    }
     migration_bitmap_sync(rs);
 
     rcu_read_unlock();


> 
> Dave
> 
> > The workflow to make a snapshot is the following:
> > 1. Pause the vm
> > 2. Make a snapshot of block devices using the scheme of your choice
> > 3. Turn on background-snapshot migration capability
> > 4. Start the migration using the destination (migration stream) of your choice.
> >    The migration will resume the vm execution by itself
> >    when it has the devices' states saved  and is ready to start ram writing
> >    to the migration stream.
> > 5. Listen to the migration finish event
> > 
> > The feature relies on KVM unapplied ability to report the faulting address.
> > Please find the KVM patch snippet to make the patchset work below:
> > 
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> >  
> >         vcpu->arch.exit_qualification = exit_qualification;
> >  
> > -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > +        if (r == -EFAULT) {
> > +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> > +
> > +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> > +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> > +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> > +               r = 0;
> > +
> > +       }
> > +       return r;
> > 
> > The patch to KVM can be sent if the patch set approved
> > 
> > Denis Plotnikov (7):
> >   migration: add background snapshot capability
> >   bitops: add some atomic versions of bitmap operations
> >   threads: add infrastructure to process sigsegv
> >   migration: add background snapshot infrastructure
> >   kvm: add failed memeory access exit reason
> >   kvm: add vCPU failed memeory access processing
> >   migration: add background snapshotting
> > 
> >  include/exec/ram_addr.h   |   7 +
> >  include/exec/ramlist.h    |   4 +-
> >  include/qemu/bitops.h     |  24 +++
> >  include/qemu/thread.h     |   5 +
> >  linux-headers/linux/kvm.h |   5 +
> >  migration/migration.c     | 141 +++++++++++++++-
> >  migration/migration.h     |   1 +
> >  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
> >  migration/ram.h           |  11 +-
> >  migration/savevm.c        |  91 ++++++-----
> >  migration/savevm.h        |   2 +
> >  qapi/migration.json       |   6 +-
> >  target/i386/kvm.c         |  18 +++
> >  util/qemu-thread-posix.c  |  50 ++++++
> >  14 files changed, 635 insertions(+), 63 deletions(-)
> > 
> > -- 
> > 2.17.0
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> 

Regards,
Dr. David Alan Gilbert July 25, 2018, 7:17 p.m. UTC | #8
* Peter Xu (peterx@redhat.com) wrote:
> On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote:
> > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > > The patch set adds the ability to make external snapshots while VM is running.
> > 
> > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
> > isn't there yet.
> > 
> > Hi Denis,
> >   How robust are you finding this SEGV based trick; for example what
> > about things like the kernel walking vhost queues or similar kernel
> > nasties?
> 
> (I'm commenting on this old series to keep the discussion together)
> 
> If we want to make this series really work for people, we should
> possibly need to know whether it could work with vhost (otherwise we
> might need to go back to userfaultfd write-protection).
> 
> I digged a bit on the vhost-net IO, it should be using two ways to
> write to guest memory:
> 
> - copy_to_user(): this should possibly still be able to be captured by
>   mprotect() (after some confirmation from Paolo, but still we'd
>   better try it out)

What confuses me here is who is going to get the signal from this and
how we recover from the signal - or does it come back as an error
on the vhost fd somehow?

Dave

> - kmap_atomic(): this is used to log dirty pages, this seems to be
>   incompatible with mprotect() but I think we can just make sure dirty
>   page tracking is disabled when live snapshot is enabled (please
>   refer to the code clip below as a referece)
> 
> So IMHO this series should work with vhost if with logging off, but we
> need to confirm...  Denis, do you want to do some more test with vhost
> when you do your next post (possibly with changes like below)?
> 
> =====================
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 24dea2730c..a3fa256143 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1605,6 +1605,10 @@ static void migration_bitmap_sync(RAMState *rs)
>      int64_t end_time;
>      uint64_t bytes_xfer_now;
>  
> +    if (live_snapshot) {
> +        return;
> +    }
> +
>      ram_counters.dirty_sync_count++;
>  
>      if (!rs->time_last_bitmap_sync) {
> @@ -2952,7 +2956,10 @@ static void ram_init_bitmaps(RAMState *rs)
>      rcu_read_lock();
>  
>      ram_list_init_bitmaps();
> -    memory_global_dirty_log_start();
> +
> +    if (!live_snapshot) {
> +        memory_global_dirty_log_start();
> +    }
>      migration_bitmap_sync(rs);
>  
>      rcu_read_unlock();
> 
> 
> > 
> > Dave
> > 
> > > The workflow to make a snapshot is the following:
> > > 1. Pause the vm
> > > 2. Make a snapshot of block devices using the scheme of your choice
> > > 3. Turn on background-snapshot migration capability
> > > 4. Start the migration using the destination (migration stream) of your choice.
> > >    The migration will resume the vm execution by itself
> > >    when it has the devices' states saved  and is ready to start ram writing
> > >    to the migration stream.
> > > 5. Listen to the migration finish event
> > > 
> > > The feature relies on KVM unapplied ability to report the faulting address.
> > > Please find the KVM patch snippet to make the patchset work below:
> > > 
> > > +++ b/arch/x86/kvm/vmx.c
> > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
> > >  
> > >         vcpu->arch.exit_qualification = exit_qualification;
> > >  
> > > -       return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +       r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> > > +        if (r == -EFAULT) {
> > > +               unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT);
> > > +
> > > +               vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS;
> > > +               vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION;
> > > +               vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1));
> > > +               r = 0;
> > > +
> > > +       }
> > > +       return r;
> > > 
> > > The patch to KVM can be sent if the patch set approved
> > > 
> > > Denis Plotnikov (7):
> > >   migration: add background snapshot capability
> > >   bitops: add some atomic versions of bitmap operations
> > >   threads: add infrastructure to process sigsegv
> > >   migration: add background snapshot infrastructure
> > >   kvm: add failed memeory access exit reason
> > >   kvm: add vCPU failed memeory access processing
> > >   migration: add background snapshotting
> > > 
> > >  include/exec/ram_addr.h   |   7 +
> > >  include/exec/ramlist.h    |   4 +-
> > >  include/qemu/bitops.h     |  24 +++
> > >  include/qemu/thread.h     |   5 +
> > >  linux-headers/linux/kvm.h |   5 +
> > >  migration/migration.c     | 141 +++++++++++++++-
> > >  migration/migration.h     |   1 +
> > >  migration/ram.c           | 333 ++++++++++++++++++++++++++++++++++++--
> > >  migration/ram.h           |  11 +-
> > >  migration/savevm.c        |  91 ++++++-----
> > >  migration/savevm.h        |   2 +
> > >  qapi/migration.json       |   6 +-
> > >  target/i386/kvm.c         |  18 +++
> > >  util/qemu-thread-posix.c  |  50 ++++++
> > >  14 files changed, 635 insertions(+), 63 deletions(-)
> > > 
> > > -- 
> > > 2.17.0
> > > 
> > --
> > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
> > 
> 
> Regards,
> 
> -- 
> Peter Xu
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Andrea Arcangeli July 25, 2018, 8:04 p.m. UTC | #9
On Wed, Jul 25, 2018 at 08:17:37PM +0100, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote:
> > > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > > > The patch set adds the ability to make external snapshots while VM is running.
> > > 
> > > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that
> > > isn't there yet.
> > > 
> > > Hi Denis,
> > >   How robust are you finding this SEGV based trick; for example what
> > > about things like the kernel walking vhost queues or similar kernel
> > > nasties?
> > 
> > (I'm commenting on this old series to keep the discussion together)
> > 
> > If we want to make this series really work for people, we should
> > possibly need to know whether it could work with vhost (otherwise we
> > might need to go back to userfaultfd write-protection).
> > 
> > I digged a bit on the vhost-net IO, it should be using two ways to
> > write to guest memory:
> > 
> > - copy_to_user(): this should possibly still be able to be captured by
> >   mprotect() (after some confirmation from Paolo, but still we'd
> >   better try it out)
> 
> What confuses me here is who is going to get the signal from this and
> how we recover from the signal - or does it come back as an error
> on the vhost fd somehow?

The problem is having to start to handle manually all sigsegv in
vhost-net by trapping copy_to_user returning less than the full buffer
size or put_user returning -EFAULT.

Those errors would need to be forwarded by vhost-net to qemu userland
to call mprotect after copying the data.

That's not conceptually different from having uffd-wp sending the
message except that will then require zero changes to vhost-net and
every other piece of kernel code that may have to write to the write
protected memory.

It may look like the uffd-wp model is wish-feature similar to an
optimization, but without the uffd-wp model when the WP fault is
triggered by kernel code, the sigsegv model falls apart and requires
all kind of ad-hoc changes just for this single feature. Plus uffd-wp
has other benefits: it makes it all reliable in terms of not
increasing the number of vmas in use during the snapshot. Finally it
makes it faster too with no mmap_sem for reading and no sigsegv
signals.

The non cooperative features got merged first because there was much
activity on the kernel side on that front, but this is just an ideal
time to nail down the remaining issues in uffd-wp I think. That I
believe is time better spent than trying to emulate it with sigsegv
and changing all drivers to send new events down to qemu specific to
the sigsegv handling. We considered this before doing uffd for
postcopy too but overall it's unreliable and more work (no single
change was then needed to KVM code with uffd to handle postcopy and
here it should be the same).

Thanks,
Andrea
Paolo Bonzini July 26, 2018, 8:51 a.m. UTC | #10
On 25/07/2018 22:04, Andrea Arcangeli wrote:
> 
> It may look like the uffd-wp model is wish-feature similar to an
> optimization, but without the uffd-wp model when the WP fault is
> triggered by kernel code, the sigsegv model falls apart and requires
> all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> has other benefits: it makes it all reliable in terms of not
> increasing the number of vmas in use during the snapshot. Finally it
> makes it faster too with no mmap_sem for reading and no sigsegv
> signals.
> 
> The non cooperative features got merged first because there was much
> activity on the kernel side on that front, but this is just an ideal
> time to nail down the remaining issues in uffd-wp I think. That I
> believe is time better spent than trying to emulate it with sigsegv
> and changing all drivers to send new events down to qemu specific to
> the sigsegv handling. We considered this before doing uffd for
> postcopy too but overall it's unreliable and more work (no single
> change was then needed to KVM code with uffd to handle postcopy and
> here it should be the same).

I totally agree.  The hard part in userfaultfd was the changes to the
kernel get_user_pages API, but the payback was huge because _all_ kernel
uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
back to mprotect would be a huge mistake.

Paolo
Peter Xu July 26, 2018, 9:23 a.m. UTC | #11
On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > 
> > It may look like the uffd-wp model is wish-feature similar to an
> > optimization, but without the uffd-wp model when the WP fault is
> > triggered by kernel code, the sigsegv model falls apart and requires
> > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > has other benefits: it makes it all reliable in terms of not
> > increasing the number of vmas in use during the snapshot. Finally it
> > makes it faster too with no mmap_sem for reading and no sigsegv
> > signals.
> > 
> > The non cooperative features got merged first because there was much
> > activity on the kernel side on that front, but this is just an ideal
> > time to nail down the remaining issues in uffd-wp I think. That I
> > believe is time better spent than trying to emulate it with sigsegv
> > and changing all drivers to send new events down to qemu specific to
> > the sigsegv handling. We considered this before doing uffd for
> > postcopy too but overall it's unreliable and more work (no single
> > change was then needed to KVM code with uffd to handle postcopy and
> > here it should be the same).
> 
> I totally agree.  The hard part in userfaultfd was the changes to the
> kernel get_user_pages API, but the payback was huge because _all_ kernel
> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> back to mprotect would be a huge mistake.

Thanks for explaining the bits.  I'd say I wasn't aware of the
difference before I started the investigation (and only until now I
noticed that major difference between mprotect and userfaultfd).  I'm
really glad that it's much clear (at least for me) on which way we
should choose.

Now I'm thinking whether we can move the userfault write protect work
forward.  The latest discussion I saw so far is in 2016, when someone
from Huawei tried to use the write protect feature for that old
version of live snapshot but reported issue:

  https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html

Is that the latest status for userfaultfd wr-protect?

If so, I'm thinking whether I can try to re-verify the work (I tried
his QEMU repository but I failed to compile somehow, so I plan to
write some even simpler code to try) to see whether I can get the same
KVM error he encountered.

Thoughts?

Regards,
Dr. David Alan Gilbert July 26, 2018, 3:13 p.m. UTC | #12
* Paolo Bonzini (pbonzini@redhat.com) wrote:
> On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > 
> > It may look like the uffd-wp model is wish-feature similar to an
> > optimization, but without the uffd-wp model when the WP fault is
> > triggered by kernel code, the sigsegv model falls apart and requires
> > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > has other benefits: it makes it all reliable in terms of not
> > increasing the number of vmas in use during the snapshot. Finally it
> > makes it faster too with no mmap_sem for reading and no sigsegv
> > signals.
> > 
> > The non cooperative features got merged first because there was much
> > activity on the kernel side on that front, but this is just an ideal
> > time to nail down the remaining issues in uffd-wp I think. That I
> > believe is time better spent than trying to emulate it with sigsegv
> > and changing all drivers to send new events down to qemu specific to
> > the sigsegv handling. We considered this before doing uffd for
> > postcopy too but overall it's unreliable and more work (no single
> > change was then needed to KVM code with uffd to handle postcopy and
> > here it should be the same).
> 
> I totally agree.  The hard part in userfaultfd was the changes to the
> kernel get_user_pages API, but the payback was huge because _all_ kernel
> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> back to mprotect would be a huge mistake.

TBF I think Denis just wanted to get things moving, knowing that we've
not got userfault-wp yet;  what I'm curious about though is how Denis
has anything stable given that the faults can land in syscalls and
vhost etc.

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Denis Plotnikov Aug. 13, 2018, 12:55 p.m. UTC | #13
On 26.07.2018 12:23, Peter Xu wrote:
> On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
>> On 25/07/2018 22:04, Andrea Arcangeli wrote:
>>>
>>> It may look like the uffd-wp model is wish-feature similar to an
>>> optimization, but without the uffd-wp model when the WP fault is
>>> triggered by kernel code, the sigsegv model falls apart and requires
>>> all kind of ad-hoc changes just for this single feature. Plus uffd-wp
>>> has other benefits: it makes it all reliable in terms of not
>>> increasing the number of vmas in use during the snapshot. Finally it
>>> makes it faster too with no mmap_sem for reading and no sigsegv
>>> signals.
>>>
>>> The non cooperative features got merged first because there was much
>>> activity on the kernel side on that front, but this is just an ideal
>>> time to nail down the remaining issues in uffd-wp I think. That I
>>> believe is time better spent than trying to emulate it with sigsegv
>>> and changing all drivers to send new events down to qemu specific to
>>> the sigsegv handling. We considered this before doing uffd for
>>> postcopy too but overall it's unreliable and more work (no single
>>> change was then needed to KVM code with uffd to handle postcopy and
>>> here it should be the same).
>>
>> I totally agree.  The hard part in userfaultfd was the changes to the
>> kernel get_user_pages API, but the payback was huge because _all_ kernel
>> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
>> back to mprotect would be a huge mistake.
> 
> Thanks for explaining the bits.  I'd say I wasn't aware of the
> difference before I started the investigation (and only until now I
> noticed that major difference between mprotect and userfaultfd).  I'm
> really glad that it's much clear (at least for me) on which way we
> should choose.
> 
> Now I'm thinking whether we can move the userfault write protect work
> forward.  The latest discussion I saw so far is in 2016, when someone
> from Huawei tried to use the write protect feature for that old
> version of live snapshot but reported issue:
> 
>    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> 
> Is that the latest status for userfaultfd wr-protect?
> 
> If so, I'm thinking whether I can try to re-verify the work (I tried
> his QEMU repository but I failed to compile somehow, so I plan to
> write some even simpler code to try) to see whether I can get the same
> KVM error he encountered.
> 
> Thoughts?

Just to sum up all being said before.

Using mprotect is a bad idea because VM's memory can be accessed from 
the number of places (KVM, vhost, ...) which need their own special care
of tracking memory accesses and notifying QEMU which makes the mprotect 
using unacceptable.

Protected memory accesses tracking can be done via userfaultfd's WP mode 
which isn't available right now.

So, the reasonable conclusion is to wait until the WP mode is available 
and build the background snapshot on top of userfaultfd-wp.
But, works on adding the WP-mode is pending for a quite a long time 
already.

Is there any way to estimate when it could be available?
> 
> Regards,
>
Dr. David Alan Gilbert Aug. 13, 2018, 7 p.m. UTC | #14
cc'ing in Mike*2
* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> 
> 
> On 26.07.2018 12:23, Peter Xu wrote:
> > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> > > On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > > > 
> > > > It may look like the uffd-wp model is wish-feature similar to an
> > > > optimization, but without the uffd-wp model when the WP fault is
> > > > triggered by kernel code, the sigsegv model falls apart and requires
> > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > > > has other benefits: it makes it all reliable in terms of not
> > > > increasing the number of vmas in use during the snapshot. Finally it
> > > > makes it faster too with no mmap_sem for reading and no sigsegv
> > > > signals.
> > > > 
> > > > The non cooperative features got merged first because there was much
> > > > activity on the kernel side on that front, but this is just an ideal
> > > > time to nail down the remaining issues in uffd-wp I think. That I
> > > > believe is time better spent than trying to emulate it with sigsegv
> > > > and changing all drivers to send new events down to qemu specific to
> > > > the sigsegv handling. We considered this before doing uffd for
> > > > postcopy too but overall it's unreliable and more work (no single
> > > > change was then needed to KVM code with uffd to handle postcopy and
> > > > here it should be the same).
> > > 
> > > I totally agree.  The hard part in userfaultfd was the changes to the
> > > kernel get_user_pages API, but the payback was huge because _all_ kernel
> > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> > > back to mprotect would be a huge mistake.
> > 
> > Thanks for explaining the bits.  I'd say I wasn't aware of the
> > difference before I started the investigation (and only until now I
> > noticed that major difference between mprotect and userfaultfd).  I'm
> > really glad that it's much clear (at least for me) on which way we
> > should choose.
> > 
> > Now I'm thinking whether we can move the userfault write protect work
> > forward.  The latest discussion I saw so far is in 2016, when someone
> > from Huawei tried to use the write protect feature for that old
> > version of live snapshot but reported issue:
> > 
> >    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> > 
> > Is that the latest status for userfaultfd wr-protect?
> > 
> > If so, I'm thinking whether I can try to re-verify the work (I tried
> > his QEMU repository but I failed to compile somehow, so I plan to
> > write some even simpler code to try) to see whether I can get the same
> > KVM error he encountered.
> > 
> > Thoughts?
> 
> Just to sum up all being said before.
> 
> Using mprotect is a bad idea because VM's memory can be accessed from the
> number of places (KVM, vhost, ...) which need their own special care
> of tracking memory accesses and notifying QEMU which makes the mprotect
> using unacceptable.
> 
> Protected memory accesses tracking can be done via userfaultfd's WP mode
> which isn't available right now.
> 
> So, the reasonable conclusion is to wait until the WP mode is available and
> build the background snapshot on top of userfaultfd-wp.
> But, works on adding the WP-mode is pending for a quite a long time already.
> 
> Is there any way to estimate when it could be available?

I think a question is whether anyone is actively working on it; I
suspect really it's on a TODO list rather than moving at the moment.

What I don't really understand is what stage the last version got upto.

Dave

> > 
> > Regards,
> > 
> 
> -- 
> Best,
> Denis
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Xu Aug. 14, 2018, 5:45 a.m. UTC | #15
On Mon, Aug 13, 2018 at 08:00:19PM +0100, Dr. David Alan Gilbert wrote:
> cc'ing in Mike*2
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > 
> > 
> > On 26.07.2018 12:23, Peter Xu wrote:
> > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> > > > On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > > > > 
> > > > > It may look like the uffd-wp model is wish-feature similar to an
> > > > > optimization, but without the uffd-wp model when the WP fault is
> > > > > triggered by kernel code, the sigsegv model falls apart and requires
> > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > > > > has other benefits: it makes it all reliable in terms of not
> > > > > increasing the number of vmas in use during the snapshot. Finally it
> > > > > makes it faster too with no mmap_sem for reading and no sigsegv
> > > > > signals.
> > > > > 
> > > > > The non cooperative features got merged first because there was much
> > > > > activity on the kernel side on that front, but this is just an ideal
> > > > > time to nail down the remaining issues in uffd-wp I think. That I
> > > > > believe is time better spent than trying to emulate it with sigsegv
> > > > > and changing all drivers to send new events down to qemu specific to
> > > > > the sigsegv handling. We considered this before doing uffd for
> > > > > postcopy too but overall it's unreliable and more work (no single
> > > > > change was then needed to KVM code with uffd to handle postcopy and
> > > > > here it should be the same).
> > > > 
> > > > I totally agree.  The hard part in userfaultfd was the changes to the
> > > > kernel get_user_pages API, but the payback was huge because _all_ kernel
> > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> > > > back to mprotect would be a huge mistake.
> > > 
> > > Thanks for explaining the bits.  I'd say I wasn't aware of the
> > > difference before I started the investigation (and only until now I
> > > noticed that major difference between mprotect and userfaultfd).  I'm
> > > really glad that it's much clear (at least for me) on which way we
> > > should choose.
> > > 
> > > Now I'm thinking whether we can move the userfault write protect work
> > > forward.  The latest discussion I saw so far is in 2016, when someone
> > > from Huawei tried to use the write protect feature for that old
> > > version of live snapshot but reported issue:
> > > 
> > >    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> > > 
> > > Is that the latest status for userfaultfd wr-protect?
> > > 
> > > If so, I'm thinking whether I can try to re-verify the work (I tried
> > > his QEMU repository but I failed to compile somehow, so I plan to
> > > write some even simpler code to try) to see whether I can get the same
> > > KVM error he encountered.
> > > 
> > > Thoughts?
> > 
> > Just to sum up all being said before.
> > 
> > Using mprotect is a bad idea because VM's memory can be accessed from the
> > number of places (KVM, vhost, ...) which need their own special care
> > of tracking memory accesses and notifying QEMU which makes the mprotect
> > using unacceptable.
> > 
> > Protected memory accesses tracking can be done via userfaultfd's WP mode
> > which isn't available right now.
> > 
> > So, the reasonable conclusion is to wait until the WP mode is available and
> > build the background snapshot on top of userfaultfd-wp.
> > But, works on adding the WP-mode is pending for a quite a long time already.
> > 
> > Is there any way to estimate when it could be available?
> 
> I think a question is whether anyone is actively working on it; I
> suspect really it's on a TODO list rather than moving at the moment.
> 
> What I don't really understand is what stage the last version got upto.

I'm still testing that tree (though due to some reason I haven't yet
continued... but I will; currently WP still not working in my test).
My plan is that I'll try to dig into the WP work if it does not work
as expected (after I'm sure there's nothing wrong with the userspace),
though that of course won't be a trivial task.  I'll see how far I can
go with it.  Anyone that would like to help with that would be greatly
welcomed too.

Regards,
Mike Rapoport Aug. 14, 2018, 6:13 a.m. UTC | #16
On Mon, Aug 13, 2018 at 08:00:19PM +0100, Dr. David Alan Gilbert wrote:
> cc'ing in Mike*2
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
> > 
> > 
> > On 26.07.2018 12:23, Peter Xu wrote:
> > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
> > > > On 25/07/2018 22:04, Andrea Arcangeli wrote:
> > > > > 
> > > > > It may look like the uffd-wp model is wish-feature similar to an
> > > > > optimization, but without the uffd-wp model when the WP fault is
> > > > > triggered by kernel code, the sigsegv model falls apart and requires
> > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp
> > > > > has other benefits: it makes it all reliable in terms of not
> > > > > increasing the number of vmas in use during the snapshot. Finally it
> > > > > makes it faster too with no mmap_sem for reading and no sigsegv
> > > > > signals.
> > > > > 
> > > > > The non cooperative features got merged first because there was much
> > > > > activity on the kernel side on that front, but this is just an ideal
> > > > > time to nail down the remaining issues in uffd-wp I think. That I
> > > > > believe is time better spent than trying to emulate it with sigsegv
> > > > > and changing all drivers to send new events down to qemu specific to
> > > > > the sigsegv handling. We considered this before doing uffd for
> > > > > postcopy too but overall it's unreliable and more work (no single
> > > > > change was then needed to KVM code with uffd to handle postcopy and
> > > > > here it should be the same).
> > > > 
> > > > I totally agree.  The hard part in userfaultfd was the changes to the
> > > > kernel get_user_pages API, but the payback was huge because _all_ kernel
> > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
> > > > back to mprotect would be a huge mistake.
> > > 
> > > Thanks for explaining the bits.  I'd say I wasn't aware of the
> > > difference before I started the investigation (and only until now I
> > > noticed that major difference between mprotect and userfaultfd).  I'm
> > > really glad that it's much clear (at least for me) on which way we
> > > should choose.
> > > 
> > > Now I'm thinking whether we can move the userfault write protect work
> > > forward.  The latest discussion I saw so far is in 2016, when someone
> > > from Huawei tried to use the write protect feature for that old
> > > version of live snapshot but reported issue:
> > > 
> > >    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
> > > 
> > > Is that the latest status for userfaultfd wr-protect?
> > > 
> > > If so, I'm thinking whether I can try to re-verify the work (I tried
> > > his QEMU repository but I failed to compile somehow, so I plan to
> > > write some even simpler code to try) to see whether I can get the same
> > > KVM error he encountered.
> > > 
> > > Thoughts?
> > 
> > Just to sum up all being said before.
> > 
> > Using mprotect is a bad idea because VM's memory can be accessed from the
> > number of places (KVM, vhost, ...) which need their own special care
> > of tracking memory accesses and notifying QEMU which makes the mprotect
> > using unacceptable.
> > 
> > Protected memory accesses tracking can be done via userfaultfd's WP mode
> > which isn't available right now.
> > 
> > So, the reasonable conclusion is to wait until the WP mode is available and
> > build the background snapshot on top of userfaultfd-wp.
> > But, works on adding the WP-mode is pending for a quite a long time already.
> > 
> > Is there any way to estimate when it could be available?
> 
> I think a question is whether anyone is actively working on it; I
> suspect really it's on a TODO list rather than moving at the moment.

I thought Andrea was working on it :)
 
> What I don't really understand is what stage the last version got upto.
> 
> Dave
> 
> > > 
> > > Regards,
> > > 
> > 
> > -- 
> > Best,
> > Denis
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Mike Kravetz Aug. 14, 2018, 11:16 p.m. UTC | #17
On 08/13/2018 12:00 PM, Dr. David Alan Gilbert wrote:
> cc'ing in Mike*2
> * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote:
>>
>>
>> On 26.07.2018 12:23, Peter Xu wrote:
>>> On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote:
>>>> On 25/07/2018 22:04, Andrea Arcangeli wrote:
>>>>>
>>>>> It may look like the uffd-wp model is wish-feature similar to an
>>>>> optimization, but without the uffd-wp model when the WP fault is
>>>>> triggered by kernel code, the sigsegv model falls apart and requires
>>>>> all kind of ad-hoc changes just for this single feature. Plus uffd-wp
>>>>> has other benefits: it makes it all reliable in terms of not
>>>>> increasing the number of vmas in use during the snapshot. Finally it
>>>>> makes it faster too with no mmap_sem for reading and no sigsegv
>>>>> signals.
>>>>>
>>>>> The non cooperative features got merged first because there was much
>>>>> activity on the kernel side on that front, but this is just an ideal
>>>>> time to nail down the remaining issues in uffd-wp I think. That I
>>>>> believe is time better spent than trying to emulate it with sigsegv
>>>>> and changing all drivers to send new events down to qemu specific to
>>>>> the sigsegv handling. We considered this before doing uffd for
>>>>> postcopy too but overall it's unreliable and more work (no single
>>>>> change was then needed to KVM code with uffd to handle postcopy and
>>>>> here it should be the same).
>>>>
>>>> I totally agree.  The hard part in userfaultfd was the changes to the
>>>> kernel get_user_pages API, but the payback was huge because _all_ kernel
>>>> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd.  Going
>>>> back to mprotect would be a huge mistake.
>>>
>>> Thanks for explaining the bits.  I'd say I wasn't aware of the
>>> difference before I started the investigation (and only until now I
>>> noticed that major difference between mprotect and userfaultfd).  I'm
>>> really glad that it's much clear (at least for me) on which way we
>>> should choose.
>>>
>>> Now I'm thinking whether we can move the userfault write protect work
>>> forward.  The latest discussion I saw so far is in 2016, when someone
>>> from Huawei tried to use the write protect feature for that old
>>> version of live snapshot but reported issue:
>>>
>>>    https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html
>>>
>>> Is that the latest status for userfaultfd wr-protect?
>>>
>>> If so, I'm thinking whether I can try to re-verify the work (I tried
>>> his QEMU repository but I failed to compile somehow, so I plan to
>>> write some even simpler code to try) to see whether I can get the same
>>> KVM error he encountered.
>>>
>>> Thoughts?
>>
>> Just to sum up all being said before.
>>
>> Using mprotect is a bad idea because VM's memory can be accessed from the
>> number of places (KVM, vhost, ...) which need their own special care
>> of tracking memory accesses and notifying QEMU which makes the mprotect
>> using unacceptable.
>>
>> Protected memory accesses tracking can be done via userfaultfd's WP mode
>> which isn't available right now.
>>
>> So, the reasonable conclusion is to wait until the WP mode is available and
>> build the background snapshot on top of userfaultfd-wp.
>> But, works on adding the WP-mode is pending for a quite a long time already.
>>
>> Is there any way to estimate when it could be available?
> 
> I think a question is whether anyone is actively working on it; I
> suspect really it's on a TODO list rather than moving at the moment.
> 

I am not working on it, and it is not on my TODO list.

However, if someone starts making progress I will jump in and work on
hugetlbfs support.  My intention would be to not let hugetlbfs support
'fall behind' general uffd support.