Message ID | 20180629080320.320144-1-dplotnikov@virtuozzo.com |
---|---|
Headers | show |
Series | Background snapshots | expand |
* Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > The patch set adds the ability to make external snapshots while VM is running. cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that isn't there yet. Hi Denis, How robust are you finding this SEGV based trick; for example what about things like the kernel walking vhost queues or similar kernel nasties? Dave > The workflow to make a snapshot is the following: > 1. Pause the vm > 2. Make a snapshot of block devices using the scheme of your choice > 3. Turn on background-snapshot migration capability > 4. Start the migration using the destination (migration stream) of your choice. > The migration will resume the vm execution by itself > when it has the devices' states saved and is ready to start ram writing > to the migration stream. > 5. Listen to the migration finish event > > The feature relies on KVM unapplied ability to report the faulting address. > Please find the KVM patch snippet to make the patchset work below: > > +++ b/arch/x86/kvm/vmx.c > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > > vcpu->arch.exit_qualification = exit_qualification; > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + if (r == -EFAULT) { > + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); > + > + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; > + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; > + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); > + r = 0; > + > + } > + return r; > > The patch to KVM can be sent if the patch set approved > > Denis Plotnikov (7): > migration: add background snapshot capability > bitops: add some atomic versions of bitmap operations > threads: add infrastructure to process sigsegv > migration: add background snapshot infrastructure > kvm: add failed memeory access exit reason > kvm: add vCPU failed memeory access processing > migration: add background snapshotting > > include/exec/ram_addr.h | 7 + > include/exec/ramlist.h | 4 +- > include/qemu/bitops.h | 24 +++ > include/qemu/thread.h | 5 + > linux-headers/linux/kvm.h | 5 + > migration/migration.c | 141 +++++++++++++++- > migration/migration.h | 1 + > migration/ram.c | 333 ++++++++++++++++++++++++++++++++++++-- > migration/ram.h | 11 +- > migration/savevm.c | 91 ++++++----- > migration/savevm.h | 2 + > qapi/migration.json | 6 +- > target/i386/kvm.c | 18 +++ > util/qemu-thread-posix.c | 50 ++++++ > 14 files changed, 635 insertions(+), 63 deletions(-) > > -- > 2.17.0 > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote: > The patch set adds the ability to make external snapshots while VM is running. Hi, Denis, This work is interesting, though I have a few questions to ask in general below. > > The workflow to make a snapshot is the following: > 1. Pause the vm > 2. Make a snapshot of block devices using the scheme of your choice Here you explicitly took the snapshot for the block device, then... > 3. Turn on background-snapshot migration capability > 4. Start the migration using the destination (migration stream) of your choice. ... here you started the VM snapshot. How did you make sure that the VM snapshot (e.g., the RAM data) and the block snapshot will be aligned? For example, in current save_snapshot() we'll quiesce disk IOs before migrating the last pieces of RAM data to make sure they are aligned. I didn't figure out myself on how that's done in this work. > The migration will resume the vm execution by itself > when it has the devices' states saved and is ready to start ram writing > to the migration stream. > 5. Listen to the migration finish event > > The feature relies on KVM unapplied ability to report the faulting address. > Please find the KVM patch snippet to make the patchset work below: > > +++ b/arch/x86/kvm/vmx.c > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > > vcpu->arch.exit_qualification = exit_qualification; > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + if (r == -EFAULT) { > + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); > + > + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; > + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; > + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); > + r = 0; > + > + } > + return r; Just to make sure I fully understand here: so this is some extra KVM work just to make sure the mprotect() trick will work even for KVM vcpu threads, am I right? Meanwhile, I see that you only modified EPT violation code, then how about the legacy hardwares and softmmu case? Thanks, > > The patch to KVM can be sent if the patch set approved > > Denis Plotnikov (7): > migration: add background snapshot capability > bitops: add some atomic versions of bitmap operations > threads: add infrastructure to process sigsegv > migration: add background snapshot infrastructure > kvm: add failed memeory access exit reason > kvm: add vCPU failed memeory access processing > migration: add background snapshotting > > include/exec/ram_addr.h | 7 + > include/exec/ramlist.h | 4 +- > include/qemu/bitops.h | 24 +++ > include/qemu/thread.h | 5 + > linux-headers/linux/kvm.h | 5 + > migration/migration.c | 141 +++++++++++++++- > migration/migration.h | 1 + > migration/ram.c | 333 ++++++++++++++++++++++++++++++++++++-- > migration/ram.h | 11 +- > migration/savevm.c | 91 ++++++----- > migration/savevm.h | 2 + > qapi/migration.json | 6 +- > target/i386/kvm.c | 18 +++ > util/qemu-thread-posix.c | 50 ++++++ > 14 files changed, 635 insertions(+), 63 deletions(-) > > -- > 2.17.0 > >
On 02.07.2018 14:23, Peter Xu wrote: > On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote: >> The patch set adds the ability to make external snapshots while VM is running. > > Hi, Denis, > > This work is interesting, though I have a few questions to ask in > general below. > >> >> The workflow to make a snapshot is the following: >> 1. Pause the vm >> 2. Make a snapshot of block devices using the scheme of your choice > > Here you explicitly took the snapshot for the block device, then... > >> 3. Turn on background-snapshot migration capability >> 4. Start the migration using the destination (migration stream) of your choice. > > ... here you started the VM snapshot. How did you make sure that the > VM snapshot (e.g., the RAM data) and the block snapshot will be > aligned? As the VM has been paused before making an image(disk) snapshot, there should be no requests to the original image done ever since. All the later request's goes to the disk snapshot. At the point we have a disk image and its snapshot. In the image we have kind of checkpoint-ed state which won't (shouldn't) be changed because all the writing requests should go to the image snapshot. Then we start the background snapshot which marks all the memory as read-only and writing the part of VM state to the VM snapshot file. By making the memory read-only we kind of freeze the state of the RAM. At that point we have an original image and the VM memory content which corresponds to each other because the VM isn't running. Then, the background snapshot thread continues VM execution with the read-only-marked memory which is being written to the external VM snapshot file. All the write accesses to the memory are intercepted and the memory pages being accessed are written to the VM snapshot (VM state) file in priority. Having marked as read-write right after the writing, the memory pages aren't tracked for later accesses. This is how we guarantee that the VM snapshot (state) file has the memory content corresponding to the moment when the disk snapshot is created. When the writing ends, we have the VM snapshot (VM state) file which has the memory content saved by the moment of the image snapshot creating. So, to restore the VM from "the snapshot" we need to use the original image disk (not the disk snapshot) and the VM snapshot (VM state with saved memory) file. > > For example, in current save_snapshot() we'll quiesce disk IOs before > migrating the last pieces of RAM data to make sure they are aligned. > I didn't figure out myself on how that's done in this work. > >> The migration will resume the vm execution by itself >> when it has the devices' states saved and is ready to start ram writing >> to the migration stream. >> 5. Listen to the migration finish event >> >> The feature relies on KVM unapplied ability to report the faulting address. >> Please find the KVM patch snippet to make the patchset work below: >> >> +++ b/arch/x86/kvm/vmx.c >> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) >> >> vcpu->arch.exit_qualification = exit_qualification; >> >> - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); >> + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); >> + if (r == -EFAULT) { >> + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); >> + >> + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; >> + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; >> + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); >> + r = 0; >> + >> + } >> + return r; > > Just to make sure I fully understand here: so this is some extra KVM > work just to make sure the mprotect() trick will work even for KVM > vcpu threads, am I right? That's correct! > > Meanwhile, I see that you only modified EPT violation code, then how > about the legacy hardwares and softmmu case? Didn't check thoroughly but the scheme works in TCG mode. Thanks, Denis > > Thanks, > >> >> The patch to KVM can be sent if the patch set approved >> >> Denis Plotnikov (7): >> migration: add background snapshot capability >> bitops: add some atomic versions of bitmap operations >> threads: add infrastructure to process sigsegv >> migration: add background snapshot infrastructure >> kvm: add failed memeory access exit reason >> kvm: add vCPU failed memeory access processing >> migration: add background snapshotting >> >> include/exec/ram_addr.h | 7 + >> include/exec/ramlist.h | 4 +- >> include/qemu/bitops.h | 24 +++ >> include/qemu/thread.h | 5 + >> linux-headers/linux/kvm.h | 5 + >> migration/migration.c | 141 +++++++++++++++- >> migration/migration.h | 1 + >> migration/ram.c | 333 ++++++++++++++++++++++++++++++++++++-- >> migration/ram.h | 11 +- >> migration/savevm.c | 91 ++++++----- >> migration/savevm.h | 2 + >> qapi/migration.json | 6 +- >> target/i386/kvm.c | 18 +++ >> util/qemu-thread-posix.c | 50 ++++++ >> 14 files changed, 635 insertions(+), 63 deletions(-) >> >> -- >> 2.17.0 >> >> >
On Mon, Jul 02, 2018 at 03:40:31PM +0300, Denis Plotnikov wrote: > > > On 02.07.2018 14:23, Peter Xu wrote: > > On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote: > > > The patch set adds the ability to make external snapshots while VM is running. > > > > Hi, Denis, > > > > This work is interesting, though I have a few questions to ask in > > general below. > > > > > > > > The workflow to make a snapshot is the following: > > > 1. Pause the vm > > > 2. Make a snapshot of block devices using the scheme of your choice > > > > Here you explicitly took the snapshot for the block device, then... > > > > > 3. Turn on background-snapshot migration capability > > > 4. Start the migration using the destination (migration stream) of your choice. > > > > ... here you started the VM snapshot. How did you make sure that the > > VM snapshot (e.g., the RAM data) and the block snapshot will be > > aligned? > As the VM has been paused before making an image(disk) snapshot, there > should be no requests to the original image done ever since. All the later > request's goes to the disk snapshot. > > At the point we have a disk image and its snapshot. > In the image we have kind of checkpoint-ed state which won't (shouldn't) be > changed because all the writing requests should go to the image snapshot. > > Then we start the background snapshot which marks all the memory as > read-only and writing the part of VM state to the VM snapshot file. > By making the memory read-only we kind of freeze the state of the RAM. > > At that point we have an original image and the VM memory content which > corresponds to each other because the VM isn't running. > > Then, the background snapshot thread continues VM execution with the > read-only-marked memory which is being written to the external VM snapshot > file. All the write accesses to the memory are intercepted and the memory > pages being accessed are written to the VM snapshot (VM state) file in > priority. Having marked as read-write right after the writing, the memory > pages aren't tracked for later accesses. > > This is how we guarantee that the VM snapshot (state) file has the memory > content corresponding to the moment when the disk snapshot is created. > > When the writing ends, we have the VM snapshot (VM state) file which has the > memory content saved by the moment of the image snapshot creating. > > So, to restore the VM from "the snapshot" we need to use the original image > disk (not the disk snapshot) and the VM snapshot (VM state with saved > memory) file. My bad to have not noticed about the implication of vm_stop() as the first step. Your explanation is clear. Thank you! > > > > > For example, in current save_snapshot() we'll quiesce disk IOs before > > migrating the last pieces of RAM data to make sure they are aligned. > > I didn't figure out myself on how that's done in this work. > > > > > The migration will resume the vm execution by itself > > > when it has the devices' states saved and is ready to start ram writing > > > to the migration stream. > > > 5. Listen to the migration finish event > > > > > > The feature relies on KVM unapplied ability to report the faulting address. > > > Please find the KVM patch snippet to make the patchset work below: > > > > > > +++ b/arch/x86/kvm/vmx.c > > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > > > vcpu->arch.exit_qualification = exit_qualification; > > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > > + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > > + if (r == -EFAULT) { > > > + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); > > > + > > > + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; > > > + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; > > > + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); > > > + r = 0; > > > + > > > + } > > > + return r; > > > > Just to make sure I fully understand here: so this is some extra KVM > > work just to make sure the mprotect() trick will work even for KVM > > vcpu threads, am I right? > > That's correct! > > > > Meanwhile, I see that you only modified EPT violation code, then how > > about the legacy hardwares and softmmu case? > > Didn't check thoroughly but the scheme works in TCG mode. Yeah I guess TCG will work since the SIGSEGV handler will work with that. I meant the shadow MMU implementation in KVM when kvm_intel.ept=0 is set on the host. But of course that's not a big deal for now since that can be discussed in the kvm counterpart of the work. Meanwhile, considering that this series seems to provide a general framework for live snapshot, this work is meaningful no matter what backend magic is used (either mprotect, or userfaultfd in the future). Thanks,
On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote: > The patch set adds the ability to make external snapshots while VM is running. > > The workflow to make a snapshot is the following: > 1. Pause the vm > 2. Make a snapshot of block devices using the scheme of your choice > 3. Turn on background-snapshot migration capability > 4. Start the migration using the destination (migration stream) of your choice. > The migration will resume the vm execution by itself > when it has the devices' states saved and is ready to start ram writing > to the migration stream. > 5. Listen to the migration finish event > > The feature relies on KVM unapplied ability to report the faulting address. > Please find the KVM patch snippet to make the patchset work below: > > +++ b/arch/x86/kvm/vmx.c > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > > vcpu->arch.exit_qualification = exit_qualification; > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > + if (r == -EFAULT) { > + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); > + > + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; > + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; > + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); > + r = 0; > + > + } > + return r; > > The patch to KVM can be sent if the patch set approved Hi, Denis, If the work will definitely require KVM to cooperate, AFAIU the thing we normally do is that we first propose the kernel counterpart on kvm list, then it'll be easier to review the QEMU counterpart (or, propose both kvm/qemu changes at the same time, always the QEMU changes can be RFC, as a reference to prove the kvm change is valid and useful). Not sure whether you should do this as well for this live snapshot work. Since we might have two backends in the future, my major question for that counterpart series would be whether we need to support both in the future (mprotect, and userfaultfd), and the differences between the two methods from kernel's point of view. I would vaguely guess that we can at least firstly have mprotect work then userfaultfd then we can automatically choose the backend when both are provided, but I guess that discussion might still better happen on the kvm list. Also I would also guess that in that work you'd better consider no-ept case as well for Intel, even for AMD. But not sure we can at least start a RFC with the simplest scenario and prove its validity. Regards,
On 13.07.2018 08:20, Peter Xu wrote: > On Fri, Jun 29, 2018 at 11:03:13AM +0300, Denis Plotnikov wrote: >> The patch set adds the ability to make external snapshots while VM is running. >> >> The workflow to make a snapshot is the following: >> 1. Pause the vm >> 2. Make a snapshot of block devices using the scheme of your choice >> 3. Turn on background-snapshot migration capability >> 4. Start the migration using the destination (migration stream) of your choice. >> The migration will resume the vm execution by itself >> when it has the devices' states saved and is ready to start ram writing >> to the migration stream. >> 5. Listen to the migration finish event >> >> The feature relies on KVM unapplied ability to report the faulting address. >> Please find the KVM patch snippet to make the patchset work below: >> >> +++ b/arch/x86/kvm/vmx.c >> @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) >> >> vcpu->arch.exit_qualification = exit_qualification; >> >> - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); >> + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); >> + if (r == -EFAULT) { >> + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); >> + >> + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; >> + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; >> + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); >> + r = 0; >> + >> + } >> + return r; >> >> The patch to KVM can be sent if the patch set approved > > Hi, Denis, > > If the work will definitely require KVM to cooperate, AFAIU the thing > we normally do is that we first propose the kernel counterpart on kvm > list, then it'll be easier to review the QEMU counterpart (or, propose > both kvm/qemu changes at the same time, always the QEMU changes can be > RFC, as a reference to prove the kvm change is valid and useful). Not > sure whether you should do this as well for this live snapshot work. > > Since we might have two backends in the future, my major question for > that counterpart series would be whether we need to support both in > the future (mprotect, and userfaultfd), and the differences between > the two methods from kernel's point of view. I would vaguely guess > that we can at least firstly have mprotect work then userfaultfd then > we can automatically choose the backend when both are provided, but I > guess that discussion might still better happen on the kvm list. Also > I would also guess that in that work you'd better consider no-ept case > as well for Intel, even for AMD. But not sure we can at least start a > RFC with the simplest scenario and prove its validity. > > Regards, > Hi, Peter, I think this is a good idea to go through the KVM path firstly. When the discussion come to some conclusion further steps may become more clear. I'll send the patch there shortly to start the discussion. Thanks! Best, Denis
On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote: > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > > The patch set adds the ability to make external snapshots while VM is running. > > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that > isn't there yet. > > Hi Denis, > How robust are you finding this SEGV based trick; for example what > about things like the kernel walking vhost queues or similar kernel > nasties? (I'm commenting on this old series to keep the discussion together) If we want to make this series really work for people, we should possibly need to know whether it could work with vhost (otherwise we might need to go back to userfaultfd write-protection). I digged a bit on the vhost-net IO, it should be using two ways to write to guest memory: - copy_to_user(): this should possibly still be able to be captured by mprotect() (after some confirmation from Paolo, but still we'd better try it out) - kmap_atomic(): this is used to log dirty pages, this seems to be incompatible with mprotect() but I think we can just make sure dirty page tracking is disabled when live snapshot is enabled (please refer to the code clip below as a referece) So IMHO this series should work with vhost if with logging off, but we need to confirm... Denis, do you want to do some more test with vhost when you do your next post (possibly with changes like below)? ===================== diff --git a/migration/ram.c b/migration/ram.c index 24dea2730c..a3fa256143 100644 --- a/migration/ram.c +++ b/migration/ram.c @@ -1605,6 +1605,10 @@ static void migration_bitmap_sync(RAMState *rs) int64_t end_time; uint64_t bytes_xfer_now; + if (live_snapshot) { + return; + } + ram_counters.dirty_sync_count++; if (!rs->time_last_bitmap_sync) { @@ -2952,7 +2956,10 @@ static void ram_init_bitmaps(RAMState *rs) rcu_read_lock(); ram_list_init_bitmaps(); - memory_global_dirty_log_start(); + + if (!live_snapshot) { + memory_global_dirty_log_start(); + } migration_bitmap_sync(rs); rcu_read_unlock(); > > Dave > > > The workflow to make a snapshot is the following: > > 1. Pause the vm > > 2. Make a snapshot of block devices using the scheme of your choice > > 3. Turn on background-snapshot migration capability > > 4. Start the migration using the destination (migration stream) of your choice. > > The migration will resume the vm execution by itself > > when it has the devices' states saved and is ready to start ram writing > > to the migration stream. > > 5. Listen to the migration finish event > > > > The feature relies on KVM unapplied ability to report the faulting address. > > Please find the KVM patch snippet to make the patchset work below: > > > > +++ b/arch/x86/kvm/vmx.c > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > > > > vcpu->arch.exit_qualification = exit_qualification; > > > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > + if (r == -EFAULT) { > > + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); > > + > > + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; > > + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; > > + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); > > + r = 0; > > + > > + } > > + return r; > > > > The patch to KVM can be sent if the patch set approved > > > > Denis Plotnikov (7): > > migration: add background snapshot capability > > bitops: add some atomic versions of bitmap operations > > threads: add infrastructure to process sigsegv > > migration: add background snapshot infrastructure > > kvm: add failed memeory access exit reason > > kvm: add vCPU failed memeory access processing > > migration: add background snapshotting > > > > include/exec/ram_addr.h | 7 + > > include/exec/ramlist.h | 4 +- > > include/qemu/bitops.h | 24 +++ > > include/qemu/thread.h | 5 + > > linux-headers/linux/kvm.h | 5 + > > migration/migration.c | 141 +++++++++++++++- > > migration/migration.h | 1 + > > migration/ram.c | 333 ++++++++++++++++++++++++++++++++++++-- > > migration/ram.h | 11 +- > > migration/savevm.c | 91 ++++++----- > > migration/savevm.h | 2 + > > qapi/migration.json | 6 +- > > target/i386/kvm.c | 18 +++ > > util/qemu-thread-posix.c | 50 ++++++ > > 14 files changed, 635 insertions(+), 63 deletions(-) > > > > -- > > 2.17.0 > > > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > Regards,
* Peter Xu (peterx@redhat.com) wrote: > On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote: > > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > > > The patch set adds the ability to make external snapshots while VM is running. > > > > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that > > isn't there yet. > > > > Hi Denis, > > How robust are you finding this SEGV based trick; for example what > > about things like the kernel walking vhost queues or similar kernel > > nasties? > > (I'm commenting on this old series to keep the discussion together) > > If we want to make this series really work for people, we should > possibly need to know whether it could work with vhost (otherwise we > might need to go back to userfaultfd write-protection). > > I digged a bit on the vhost-net IO, it should be using two ways to > write to guest memory: > > - copy_to_user(): this should possibly still be able to be captured by > mprotect() (after some confirmation from Paolo, but still we'd > better try it out) What confuses me here is who is going to get the signal from this and how we recover from the signal - or does it come back as an error on the vhost fd somehow? Dave > - kmap_atomic(): this is used to log dirty pages, this seems to be > incompatible with mprotect() but I think we can just make sure dirty > page tracking is disabled when live snapshot is enabled (please > refer to the code clip below as a referece) > > So IMHO this series should work with vhost if with logging off, but we > need to confirm... Denis, do you want to do some more test with vhost > when you do your next post (possibly with changes like below)? > > ===================== > > diff --git a/migration/ram.c b/migration/ram.c > index 24dea2730c..a3fa256143 100644 > --- a/migration/ram.c > +++ b/migration/ram.c > @@ -1605,6 +1605,10 @@ static void migration_bitmap_sync(RAMState *rs) > int64_t end_time; > uint64_t bytes_xfer_now; > > + if (live_snapshot) { > + return; > + } > + > ram_counters.dirty_sync_count++; > > if (!rs->time_last_bitmap_sync) { > @@ -2952,7 +2956,10 @@ static void ram_init_bitmaps(RAMState *rs) > rcu_read_lock(); > > ram_list_init_bitmaps(); > - memory_global_dirty_log_start(); > + > + if (!live_snapshot) { > + memory_global_dirty_log_start(); > + } > migration_bitmap_sync(rs); > > rcu_read_unlock(); > > > > > > Dave > > > > > The workflow to make a snapshot is the following: > > > 1. Pause the vm > > > 2. Make a snapshot of block devices using the scheme of your choice > > > 3. Turn on background-snapshot migration capability > > > 4. Start the migration using the destination (migration stream) of your choice. > > > The migration will resume the vm execution by itself > > > when it has the devices' states saved and is ready to start ram writing > > > to the migration stream. > > > 5. Listen to the migration finish event > > > > > > The feature relies on KVM unapplied ability to report the faulting address. > > > Please find the KVM patch snippet to make the patchset work below: > > > > > > +++ b/arch/x86/kvm/vmx.c > > > @@ -XXXX,X +XXXX,XX @@ static int handle_ept_violation(struct kvm_vcpu *vcpu) > > > > > > vcpu->arch.exit_qualification = exit_qualification; > > > > > > - return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > > + r = kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0); > > > + if (r == -EFAULT) { > > > + unsigned long hva = kvm_vcpu_gfn_to_hva(vcpu, gpa >> PAGE_SHIFT); > > > + > > > + vcpu->run->exit_reason = KVM_EXIT_FAIL_MEM_ACCESS; > > > + vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_VIOLATION; > > > + vcpu->run->fail_mem_access.hva = hva | (gpa & (PAGE_SIZE-1)); > > > + r = 0; > > > + > > > + } > > > + return r; > > > > > > The patch to KVM can be sent if the patch set approved > > > > > > Denis Plotnikov (7): > > > migration: add background snapshot capability > > > bitops: add some atomic versions of bitmap operations > > > threads: add infrastructure to process sigsegv > > > migration: add background snapshot infrastructure > > > kvm: add failed memeory access exit reason > > > kvm: add vCPU failed memeory access processing > > > migration: add background snapshotting > > > > > > include/exec/ram_addr.h | 7 + > > > include/exec/ramlist.h | 4 +- > > > include/qemu/bitops.h | 24 +++ > > > include/qemu/thread.h | 5 + > > > linux-headers/linux/kvm.h | 5 + > > > migration/migration.c | 141 +++++++++++++++- > > > migration/migration.h | 1 + > > > migration/ram.c | 333 ++++++++++++++++++++++++++++++++++++-- > > > migration/ram.h | 11 +- > > > migration/savevm.c | 91 ++++++----- > > > migration/savevm.h | 2 + > > > qapi/migration.json | 6 +- > > > target/i386/kvm.c | 18 +++ > > > util/qemu-thread-posix.c | 50 ++++++ > > > 14 files changed, 635 insertions(+), 63 deletions(-) > > > > > > -- > > > 2.17.0 > > > > > -- > > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK > > > > Regards, > > -- > Peter Xu -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Wed, Jul 25, 2018 at 08:17:37PM +0100, Dr. David Alan Gilbert wrote: > * Peter Xu (peterx@redhat.com) wrote: > > On Fri, Jun 29, 2018 at 12:53:59PM +0100, Dr. David Alan Gilbert wrote: > > > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > > > > The patch set adds the ability to make external snapshots while VM is running. > > > > > > cc'ing in Andrea since this uses sigsegv's to avoid userfault-wp that > > > isn't there yet. > > > > > > Hi Denis, > > > How robust are you finding this SEGV based trick; for example what > > > about things like the kernel walking vhost queues or similar kernel > > > nasties? > > > > (I'm commenting on this old series to keep the discussion together) > > > > If we want to make this series really work for people, we should > > possibly need to know whether it could work with vhost (otherwise we > > might need to go back to userfaultfd write-protection). > > > > I digged a bit on the vhost-net IO, it should be using two ways to > > write to guest memory: > > > > - copy_to_user(): this should possibly still be able to be captured by > > mprotect() (after some confirmation from Paolo, but still we'd > > better try it out) > > What confuses me here is who is going to get the signal from this and > how we recover from the signal - or does it come back as an error > on the vhost fd somehow? The problem is having to start to handle manually all sigsegv in vhost-net by trapping copy_to_user returning less than the full buffer size or put_user returning -EFAULT. Those errors would need to be forwarded by vhost-net to qemu userland to call mprotect after copying the data. That's not conceptually different from having uffd-wp sending the message except that will then require zero changes to vhost-net and every other piece of kernel code that may have to write to the write protected memory. It may look like the uffd-wp model is wish-feature similar to an optimization, but without the uffd-wp model when the WP fault is triggered by kernel code, the sigsegv model falls apart and requires all kind of ad-hoc changes just for this single feature. Plus uffd-wp has other benefits: it makes it all reliable in terms of not increasing the number of vmas in use during the snapshot. Finally it makes it faster too with no mmap_sem for reading and no sigsegv signals. The non cooperative features got merged first because there was much activity on the kernel side on that front, but this is just an ideal time to nail down the remaining issues in uffd-wp I think. That I believe is time better spent than trying to emulate it with sigsegv and changing all drivers to send new events down to qemu specific to the sigsegv handling. We considered this before doing uffd for postcopy too but overall it's unreliable and more work (no single change was then needed to KVM code with uffd to handle postcopy and here it should be the same). Thanks, Andrea
On 25/07/2018 22:04, Andrea Arcangeli wrote: > > It may look like the uffd-wp model is wish-feature similar to an > optimization, but without the uffd-wp model when the WP fault is > triggered by kernel code, the sigsegv model falls apart and requires > all kind of ad-hoc changes just for this single feature. Plus uffd-wp > has other benefits: it makes it all reliable in terms of not > increasing the number of vmas in use during the snapshot. Finally it > makes it faster too with no mmap_sem for reading and no sigsegv > signals. > > The non cooperative features got merged first because there was much > activity on the kernel side on that front, but this is just an ideal > time to nail down the remaining issues in uffd-wp I think. That I > believe is time better spent than trying to emulate it with sigsegv > and changing all drivers to send new events down to qemu specific to > the sigsegv handling. We considered this before doing uffd for > postcopy too but overall it's unreliable and more work (no single > change was then needed to KVM code with uffd to handle postcopy and > here it should be the same). I totally agree. The hard part in userfaultfd was the changes to the kernel get_user_pages API, but the payback was huge because _all_ kernel uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going back to mprotect would be a huge mistake. Paolo
On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote: > On 25/07/2018 22:04, Andrea Arcangeli wrote: > > > > It may look like the uffd-wp model is wish-feature similar to an > > optimization, but without the uffd-wp model when the WP fault is > > triggered by kernel code, the sigsegv model falls apart and requires > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp > > has other benefits: it makes it all reliable in terms of not > > increasing the number of vmas in use during the snapshot. Finally it > > makes it faster too with no mmap_sem for reading and no sigsegv > > signals. > > > > The non cooperative features got merged first because there was much > > activity on the kernel side on that front, but this is just an ideal > > time to nail down the remaining issues in uffd-wp I think. That I > > believe is time better spent than trying to emulate it with sigsegv > > and changing all drivers to send new events down to qemu specific to > > the sigsegv handling. We considered this before doing uffd for > > postcopy too but overall it's unreliable and more work (no single > > change was then needed to KVM code with uffd to handle postcopy and > > here it should be the same). > > I totally agree. The hard part in userfaultfd was the changes to the > kernel get_user_pages API, but the payback was huge because _all_ kernel > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going > back to mprotect would be a huge mistake. Thanks for explaining the bits. I'd say I wasn't aware of the difference before I started the investigation (and only until now I noticed that major difference between mprotect and userfaultfd). I'm really glad that it's much clear (at least for me) on which way we should choose. Now I'm thinking whether we can move the userfault write protect work forward. The latest discussion I saw so far is in 2016, when someone from Huawei tried to use the write protect feature for that old version of live snapshot but reported issue: https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html Is that the latest status for userfaultfd wr-protect? If so, I'm thinking whether I can try to re-verify the work (I tried his QEMU repository but I failed to compile somehow, so I plan to write some even simpler code to try) to see whether I can get the same KVM error he encountered. Thoughts? Regards,
* Paolo Bonzini (pbonzini@redhat.com) wrote: > On 25/07/2018 22:04, Andrea Arcangeli wrote: > > > > It may look like the uffd-wp model is wish-feature similar to an > > optimization, but without the uffd-wp model when the WP fault is > > triggered by kernel code, the sigsegv model falls apart and requires > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp > > has other benefits: it makes it all reliable in terms of not > > increasing the number of vmas in use during the snapshot. Finally it > > makes it faster too with no mmap_sem for reading and no sigsegv > > signals. > > > > The non cooperative features got merged first because there was much > > activity on the kernel side on that front, but this is just an ideal > > time to nail down the remaining issues in uffd-wp I think. That I > > believe is time better spent than trying to emulate it with sigsegv > > and changing all drivers to send new events down to qemu specific to > > the sigsegv handling. We considered this before doing uffd for > > postcopy too but overall it's unreliable and more work (no single > > change was then needed to KVM code with uffd to handle postcopy and > > here it should be the same). > > I totally agree. The hard part in userfaultfd was the changes to the > kernel get_user_pages API, but the payback was huge because _all_ kernel > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going > back to mprotect would be a huge mistake. TBF I think Denis just wanted to get things moving, knowing that we've not got userfault-wp yet; what I'm curious about though is how Denis has anything stable given that the faults can land in syscalls and vhost etc. Dave > Paolo -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On 26.07.2018 12:23, Peter Xu wrote: > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote: >> On 25/07/2018 22:04, Andrea Arcangeli wrote: >>> >>> It may look like the uffd-wp model is wish-feature similar to an >>> optimization, but without the uffd-wp model when the WP fault is >>> triggered by kernel code, the sigsegv model falls apart and requires >>> all kind of ad-hoc changes just for this single feature. Plus uffd-wp >>> has other benefits: it makes it all reliable in terms of not >>> increasing the number of vmas in use during the snapshot. Finally it >>> makes it faster too with no mmap_sem for reading and no sigsegv >>> signals. >>> >>> The non cooperative features got merged first because there was much >>> activity on the kernel side on that front, but this is just an ideal >>> time to nail down the remaining issues in uffd-wp I think. That I >>> believe is time better spent than trying to emulate it with sigsegv >>> and changing all drivers to send new events down to qemu specific to >>> the sigsegv handling. We considered this before doing uffd for >>> postcopy too but overall it's unreliable and more work (no single >>> change was then needed to KVM code with uffd to handle postcopy and >>> here it should be the same). >> >> I totally agree. The hard part in userfaultfd was the changes to the >> kernel get_user_pages API, but the payback was huge because _all_ kernel >> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going >> back to mprotect would be a huge mistake. > > Thanks for explaining the bits. I'd say I wasn't aware of the > difference before I started the investigation (and only until now I > noticed that major difference between mprotect and userfaultfd). I'm > really glad that it's much clear (at least for me) on which way we > should choose. > > Now I'm thinking whether we can move the userfault write protect work > forward. The latest discussion I saw so far is in 2016, when someone > from Huawei tried to use the write protect feature for that old > version of live snapshot but reported issue: > > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html > > Is that the latest status for userfaultfd wr-protect? > > If so, I'm thinking whether I can try to re-verify the work (I tried > his QEMU repository but I failed to compile somehow, so I plan to > write some even simpler code to try) to see whether I can get the same > KVM error he encountered. > > Thoughts? Just to sum up all being said before. Using mprotect is a bad idea because VM's memory can be accessed from the number of places (KVM, vhost, ...) which need their own special care of tracking memory accesses and notifying QEMU which makes the mprotect using unacceptable. Protected memory accesses tracking can be done via userfaultfd's WP mode which isn't available right now. So, the reasonable conclusion is to wait until the WP mode is available and build the background snapshot on top of userfaultfd-wp. But, works on adding the WP-mode is pending for a quite a long time already. Is there any way to estimate when it could be available? > > Regards, >
cc'ing in Mike*2 * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > > > On 26.07.2018 12:23, Peter Xu wrote: > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote: > > > On 25/07/2018 22:04, Andrea Arcangeli wrote: > > > > > > > > It may look like the uffd-wp model is wish-feature similar to an > > > > optimization, but without the uffd-wp model when the WP fault is > > > > triggered by kernel code, the sigsegv model falls apart and requires > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp > > > > has other benefits: it makes it all reliable in terms of not > > > > increasing the number of vmas in use during the snapshot. Finally it > > > > makes it faster too with no mmap_sem for reading and no sigsegv > > > > signals. > > > > > > > > The non cooperative features got merged first because there was much > > > > activity on the kernel side on that front, but this is just an ideal > > > > time to nail down the remaining issues in uffd-wp I think. That I > > > > believe is time better spent than trying to emulate it with sigsegv > > > > and changing all drivers to send new events down to qemu specific to > > > > the sigsegv handling. We considered this before doing uffd for > > > > postcopy too but overall it's unreliable and more work (no single > > > > change was then needed to KVM code with uffd to handle postcopy and > > > > here it should be the same). > > > > > > I totally agree. The hard part in userfaultfd was the changes to the > > > kernel get_user_pages API, but the payback was huge because _all_ kernel > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going > > > back to mprotect would be a huge mistake. > > > > Thanks for explaining the bits. I'd say I wasn't aware of the > > difference before I started the investigation (and only until now I > > noticed that major difference between mprotect and userfaultfd). I'm > > really glad that it's much clear (at least for me) on which way we > > should choose. > > > > Now I'm thinking whether we can move the userfault write protect work > > forward. The latest discussion I saw so far is in 2016, when someone > > from Huawei tried to use the write protect feature for that old > > version of live snapshot but reported issue: > > > > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html > > > > Is that the latest status for userfaultfd wr-protect? > > > > If so, I'm thinking whether I can try to re-verify the work (I tried > > his QEMU repository but I failed to compile somehow, so I plan to > > write some even simpler code to try) to see whether I can get the same > > KVM error he encountered. > > > > Thoughts? > > Just to sum up all being said before. > > Using mprotect is a bad idea because VM's memory can be accessed from the > number of places (KVM, vhost, ...) which need their own special care > of tracking memory accesses and notifying QEMU which makes the mprotect > using unacceptable. > > Protected memory accesses tracking can be done via userfaultfd's WP mode > which isn't available right now. > > So, the reasonable conclusion is to wait until the WP mode is available and > build the background snapshot on top of userfaultfd-wp. > But, works on adding the WP-mode is pending for a quite a long time already. > > Is there any way to estimate when it could be available? I think a question is whether anyone is actively working on it; I suspect really it's on a TODO list rather than moving at the moment. What I don't really understand is what stage the last version got upto. Dave > > > > Regards, > > > > -- > Best, > Denis -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
On Mon, Aug 13, 2018 at 08:00:19PM +0100, Dr. David Alan Gilbert wrote: > cc'ing in Mike*2 > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > > > > > > On 26.07.2018 12:23, Peter Xu wrote: > > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote: > > > > On 25/07/2018 22:04, Andrea Arcangeli wrote: > > > > > > > > > > It may look like the uffd-wp model is wish-feature similar to an > > > > > optimization, but without the uffd-wp model when the WP fault is > > > > > triggered by kernel code, the sigsegv model falls apart and requires > > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp > > > > > has other benefits: it makes it all reliable in terms of not > > > > > increasing the number of vmas in use during the snapshot. Finally it > > > > > makes it faster too with no mmap_sem for reading and no sigsegv > > > > > signals. > > > > > > > > > > The non cooperative features got merged first because there was much > > > > > activity on the kernel side on that front, but this is just an ideal > > > > > time to nail down the remaining issues in uffd-wp I think. That I > > > > > believe is time better spent than trying to emulate it with sigsegv > > > > > and changing all drivers to send new events down to qemu specific to > > > > > the sigsegv handling. We considered this before doing uffd for > > > > > postcopy too but overall it's unreliable and more work (no single > > > > > change was then needed to KVM code with uffd to handle postcopy and > > > > > here it should be the same). > > > > > > > > I totally agree. The hard part in userfaultfd was the changes to the > > > > kernel get_user_pages API, but the payback was huge because _all_ kernel > > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going > > > > back to mprotect would be a huge mistake. > > > > > > Thanks for explaining the bits. I'd say I wasn't aware of the > > > difference before I started the investigation (and only until now I > > > noticed that major difference between mprotect and userfaultfd). I'm > > > really glad that it's much clear (at least for me) on which way we > > > should choose. > > > > > > Now I'm thinking whether we can move the userfault write protect work > > > forward. The latest discussion I saw so far is in 2016, when someone > > > from Huawei tried to use the write protect feature for that old > > > version of live snapshot but reported issue: > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html > > > > > > Is that the latest status for userfaultfd wr-protect? > > > > > > If so, I'm thinking whether I can try to re-verify the work (I tried > > > his QEMU repository but I failed to compile somehow, so I plan to > > > write some even simpler code to try) to see whether I can get the same > > > KVM error he encountered. > > > > > > Thoughts? > > > > Just to sum up all being said before. > > > > Using mprotect is a bad idea because VM's memory can be accessed from the > > number of places (KVM, vhost, ...) which need their own special care > > of tracking memory accesses and notifying QEMU which makes the mprotect > > using unacceptable. > > > > Protected memory accesses tracking can be done via userfaultfd's WP mode > > which isn't available right now. > > > > So, the reasonable conclusion is to wait until the WP mode is available and > > build the background snapshot on top of userfaultfd-wp. > > But, works on adding the WP-mode is pending for a quite a long time already. > > > > Is there any way to estimate when it could be available? > > I think a question is whether anyone is actively working on it; I > suspect really it's on a TODO list rather than moving at the moment. > > What I don't really understand is what stage the last version got upto. I'm still testing that tree (though due to some reason I haven't yet continued... but I will; currently WP still not working in my test). My plan is that I'll try to dig into the WP work if it does not work as expected (after I'm sure there's nothing wrong with the userspace), though that of course won't be a trivial task. I'll see how far I can go with it. Anyone that would like to help with that would be greatly welcomed too. Regards,
On Mon, Aug 13, 2018 at 08:00:19PM +0100, Dr. David Alan Gilbert wrote: > cc'ing in Mike*2 > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: > > > > > > On 26.07.2018 12:23, Peter Xu wrote: > > > On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote: > > > > On 25/07/2018 22:04, Andrea Arcangeli wrote: > > > > > > > > > > It may look like the uffd-wp model is wish-feature similar to an > > > > > optimization, but without the uffd-wp model when the WP fault is > > > > > triggered by kernel code, the sigsegv model falls apart and requires > > > > > all kind of ad-hoc changes just for this single feature. Plus uffd-wp > > > > > has other benefits: it makes it all reliable in terms of not > > > > > increasing the number of vmas in use during the snapshot. Finally it > > > > > makes it faster too with no mmap_sem for reading and no sigsegv > > > > > signals. > > > > > > > > > > The non cooperative features got merged first because there was much > > > > > activity on the kernel side on that front, but this is just an ideal > > > > > time to nail down the remaining issues in uffd-wp I think. That I > > > > > believe is time better spent than trying to emulate it with sigsegv > > > > > and changing all drivers to send new events down to qemu specific to > > > > > the sigsegv handling. We considered this before doing uffd for > > > > > postcopy too but overall it's unreliable and more work (no single > > > > > change was then needed to KVM code with uffd to handle postcopy and > > > > > here it should be the same). > > > > > > > > I totally agree. The hard part in userfaultfd was the changes to the > > > > kernel get_user_pages API, but the payback was huge because _all_ kernel > > > > uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going > > > > back to mprotect would be a huge mistake. > > > > > > Thanks for explaining the bits. I'd say I wasn't aware of the > > > difference before I started the investigation (and only until now I > > > noticed that major difference between mprotect and userfaultfd). I'm > > > really glad that it's much clear (at least for me) on which way we > > > should choose. > > > > > > Now I'm thinking whether we can move the userfault write protect work > > > forward. The latest discussion I saw so far is in 2016, when someone > > > from Huawei tried to use the write protect feature for that old > > > version of live snapshot but reported issue: > > > > > > https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html > > > > > > Is that the latest status for userfaultfd wr-protect? > > > > > > If so, I'm thinking whether I can try to re-verify the work (I tried > > > his QEMU repository but I failed to compile somehow, so I plan to > > > write some even simpler code to try) to see whether I can get the same > > > KVM error he encountered. > > > > > > Thoughts? > > > > Just to sum up all being said before. > > > > Using mprotect is a bad idea because VM's memory can be accessed from the > > number of places (KVM, vhost, ...) which need their own special care > > of tracking memory accesses and notifying QEMU which makes the mprotect > > using unacceptable. > > > > Protected memory accesses tracking can be done via userfaultfd's WP mode > > which isn't available right now. > > > > So, the reasonable conclusion is to wait until the WP mode is available and > > build the background snapshot on top of userfaultfd-wp. > > But, works on adding the WP-mode is pending for a quite a long time already. > > > > Is there any way to estimate when it could be available? > > I think a question is whether anyone is actively working on it; I > suspect really it's on a TODO list rather than moving at the moment. I thought Andrea was working on it :) > What I don't really understand is what stage the last version got upto. > > Dave > > > > > > > Regards, > > > > > > > -- > > Best, > > Denis > -- > Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK >
On 08/13/2018 12:00 PM, Dr. David Alan Gilbert wrote: > cc'ing in Mike*2 > * Denis Plotnikov (dplotnikov@virtuozzo.com) wrote: >> >> >> On 26.07.2018 12:23, Peter Xu wrote: >>> On Thu, Jul 26, 2018 at 10:51:33AM +0200, Paolo Bonzini wrote: >>>> On 25/07/2018 22:04, Andrea Arcangeli wrote: >>>>> >>>>> It may look like the uffd-wp model is wish-feature similar to an >>>>> optimization, but without the uffd-wp model when the WP fault is >>>>> triggered by kernel code, the sigsegv model falls apart and requires >>>>> all kind of ad-hoc changes just for this single feature. Plus uffd-wp >>>>> has other benefits: it makes it all reliable in terms of not >>>>> increasing the number of vmas in use during the snapshot. Finally it >>>>> makes it faster too with no mmap_sem for reading and no sigsegv >>>>> signals. >>>>> >>>>> The non cooperative features got merged first because there was much >>>>> activity on the kernel side on that front, but this is just an ideal >>>>> time to nail down the remaining issues in uffd-wp I think. That I >>>>> believe is time better spent than trying to emulate it with sigsegv >>>>> and changing all drivers to send new events down to qemu specific to >>>>> the sigsegv handling. We considered this before doing uffd for >>>>> postcopy too but overall it's unreliable and more work (no single >>>>> change was then needed to KVM code with uffd to handle postcopy and >>>>> here it should be the same). >>>> >>>> I totally agree. The hard part in userfaultfd was the changes to the >>>> kernel get_user_pages API, but the payback was huge because _all_ kernel >>>> uses (KVM, vhost-net, syscalls, etc.) just work with userfaultfd. Going >>>> back to mprotect would be a huge mistake. >>> >>> Thanks for explaining the bits. I'd say I wasn't aware of the >>> difference before I started the investigation (and only until now I >>> noticed that major difference between mprotect and userfaultfd). I'm >>> really glad that it's much clear (at least for me) on which way we >>> should choose. >>> >>> Now I'm thinking whether we can move the userfault write protect work >>> forward. The latest discussion I saw so far is in 2016, when someone >>> from Huawei tried to use the write protect feature for that old >>> version of live snapshot but reported issue: >>> >>> https://lists.gnu.org/archive/html/qemu-devel/2016-12/msg01127.html >>> >>> Is that the latest status for userfaultfd wr-protect? >>> >>> If so, I'm thinking whether I can try to re-verify the work (I tried >>> his QEMU repository but I failed to compile somehow, so I plan to >>> write some even simpler code to try) to see whether I can get the same >>> KVM error he encountered. >>> >>> Thoughts? >> >> Just to sum up all being said before. >> >> Using mprotect is a bad idea because VM's memory can be accessed from the >> number of places (KVM, vhost, ...) which need their own special care >> of tracking memory accesses and notifying QEMU which makes the mprotect >> using unacceptable. >> >> Protected memory accesses tracking can be done via userfaultfd's WP mode >> which isn't available right now. >> >> So, the reasonable conclusion is to wait until the WP mode is available and >> build the background snapshot on top of userfaultfd-wp. >> But, works on adding the WP-mode is pending for a quite a long time already. >> >> Is there any way to estimate when it could be available? > > I think a question is whether anyone is actively working on it; I > suspect really it's on a TODO list rather than moving at the moment. > I am not working on it, and it is not on my TODO list. However, if someone starts making progress I will jump in and work on hugetlbfs support. My intention would be to not let hugetlbfs support 'fall behind' general uffd support.