Patchwork [RFC,00/20] Kemari for KVM v0.1

login
register
mail settings
Submitter Yoshiaki Tamura
Date April 21, 2010, 5:57 a.m.
Message ID <1271829445-5328-1-git-send-email-tamura.yoshiaki@lab.ntt.co.jp>
Download mbox | patch
Permalink /patch/50627/
State New
Headers show

Comments

Yoshiaki Tamura - April 21, 2010, 5:57 a.m.
Hi all,

We have been implementing the prototype of Kemari for KVM, and we're sending
this message to share what we have now and TODO lists.  Hopefully, we would like
to get early feedback to keep us in the right direction.  Although advanced
approaches in the TODO lists are fascinating, we would like to run this project
step by step while absorbing comments from the community.  The current code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html

The transmission/transaction protocol, and most of the control logic is
implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
proceeding before synchronizing VMs.  It may also need some plumbing in the
kernel side to guarantee replayability of certain events and instructions,
integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
as for optimization purposes, for example. 

Before going into details, we would like to show how Kemari looks.  We prepared
a demonstration video at the following location.  For those who are not
interested in the code, please take a look.  
The demonstration scenario is,

1. Play with a guest VM that has virtio-blk and virtio-net.
# The guest image should be a NFS/SAN.
2. Start Kemari to synchronize the VM by running the following command in QEMU.
Just add "-k" option to usual migrate command.
migrate -d -k tcp:192.168.0.20:4444
3. Check the status by calling info migrate.
4. Go back to the VM to play chess animation.
5. Kill the the VM. (VNC client also disappears)
6. Press "c" to continue the VM on the other host.
7. Bring up the VNC client (Sorry, it pops outside of video capture.)
8. Confirm that the chess animation ends, browser works fine, then shutdown.

http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov

The repository contains all patches we're sending with this message.  For those
who want to try, pull the following repository.  At running configure, please
put --enable-ft-mode.  Also you need to apply a patch attached at the end of
this message to your KVM.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari

In addition to usual migrate environment and command, add "-k" to run.

The patch set consists of following components.

- bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o)
- writev() support to QEMUFile and FdMigrationState.
- FT transaction sender/receiver
- event tap that triggers FT transaction.
- virtio-blk, virtio-net support for event tap.

 Makefile.objs    |    1 +
 buffered_file.c  |    2 +-
 configure        |    8 +
 cpu-all.h        |  134 ++++++++++++++++-
 cutils.c         |   12 ++
 exec.c           |  127 +++++++++++++----
 ft_transaction.c |  423 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ft_transaction.h |   57 ++++++++
 hw/hw.h          |   25 ++++
 hw/virtio-blk.c  |    2 +
 hw/virtio-net.c  |    2 +
 migration-exec.c |    2 +-
 migration-fd.c   |    2 +-
 migration-tcp.c  |   58 +++++++-
 migration-unix.c |    2 +-
 migration.c      |  146 ++++++++++++++++++-
 migration.h      |    8 +
 osdep.c          |   13 ++
 qemu-char.c      |   25 +++-
 qemu-common.h    |   21 +++
 qemu-kvm.c       |   26 ++--
 qemu-monitor.hx  |    7 +-
 qemu_socket.h    |    4 +
 savevm.c         |  264 ++++++++++++++++++++++++++++++----
 sysemu.h         |    3 +-
 vl.c             |  221 +++++++++++++++++++++++++---
 26 files changed, 1474 insertions(+), 121 deletions(-)
 create mode 100644 ft_transaction.c
 create mode 100644 ft_transaction.h

The rest of this message describes TODO lists grouped by each topic.
Dor Laor - April 22, 2010, 8:58 a.m.
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> Hi all,
>
> We have been implementing the prototype of Kemari for KVM, and we're sending
> this message to share what we have now and TODO lists.  Hopefully, we would like
> to get early feedback to keep us in the right direction.  Although advanced
> approaches in the TODO lists are fascinating, we would like to run this project
> step by step while absorbing comments from the community.  The current code is
> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>
> For those who are new to Kemari for KVM, please take a look at the
> following RFC which we posted last year.
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>
> The transmission/transaction protocol, and most of the control logic is
> implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
> proceeding before synchronizing VMs.  It may also need some plumbing in the
> kernel side to guarantee replayability of certain events and instructions,
> integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
> as for optimization purposes, for example.

[ snap]

>
> The rest of this message describes TODO lists grouped by each topic.
>
> === event tapping ===
>
> Event tapping is the core component of Kemari, and it decides on which event the
> primary should synchronize with the secondary.  The basic assumption here is
> that outgoing I/O operations are idempotent, which is usually true for disk I/O
> and reliable network protocols such as TCP.

IMO any type of network even should be stalled too. What if the VM runs 
non tcp protocol and the packet that the master node sent reached some 
remote client and before the sync to the slave the master failed?

[snap]


> === clock ===
>
> Since synchronizing the virtual machines every time the TSC is accessed would be
> prohibitive, the transmission of the TSC will be done lazily, which means
> delaying it until there is a non-TSC synchronization point arrives.

Why do you specifically care about the tsc sync? When you sync all the 
IO model on snapshot it also synchronizes the tsc.

In general, can you please explain the 'algorithm' for continuous 
snapshots (is that what you like to do?):
A trivial one would we to :
  - do X online snapshots/sec
  - Stall all IO (disk/block) from the guest to the outside world
    until the previous snapshot reaches the slave.
  - Snapshots are made of
    - diff of dirty pages from last snapshot
    - Qemu device model (+kvm's) diff from last.
You can do 'light' snapshots in between to send dirty pages to reduce 
snapshot time.

I wrote the above to serve a reference for your comments so it will map 
into my mind. Thanks, dor

>
> TODO:
>   - Synchronization of clock sources (need to intercept TSC reads, etc).
>
> === usability ===
>
> These are items that defines how users interact with Kemari.
>
> TODO:
>   - Kemarid daemon that takes care of the cluster management/monitoring
>     side of things.
>   - Some device emulators might need minor modifications to work well
>     with Kemari.  Use white(black)-listing to take the burden of
>     choosing the right device model off the users.
>
> === optimizations ===
>
> Although the big picture can be realized by completing the TODO list above, we
> need some optimizations/enhancements to make Kemari useful in real world, and
> these are items what needs to be done for that.
>
> TODO:
>   - SMP (for the sake of performance might need to implement a
>     synchronization protocol that can maintain two or more
>     synchronization points active at any given moment)
>   - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>     are really dirty).
>
>
> Any comments/suggestions would be greatly appreciated.
>
> Thanks,
>
> Yoshi
>
> --
>
> Kemari starts synchronizing VMs when QEMU handles I/O requests.
> Without this patch VCPU state is already proceeded before
> synchronization, and after failover to the VM on the receiver, it
> hangs because of this.
>
> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> ---
>   arch/x86/include/asm/kvm_host.h |    1 +
>   arch/x86/kvm/svm.c              |   11 ++++++++---
>   arch/x86/kvm/vmx.c              |   11 ++++++++---
>   arch/x86/kvm/x86.c              |    4 ++++
>   4 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 26c629a..7b8f514 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>   	int in;
>   	int port;
>   	int size;
> +	bool lazy_skip;
>   };
>
>   /*
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index d04c7ad..e373245 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>   {
>   	struct kvm_vcpu *vcpu =&svm->vcpu;
>   	u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
> -	int size, in, string;
> +	int size, in, string, ret;
>   	unsigned port;
>
>   	++svm->vcpu.stat.io_exits;
> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>   	port = io_info>>  16;
>   	size = (io_info&  SVM_IOIO_SIZE_MASK)>>  SVM_IOIO_SIZE_SHIFT;
>   	svm->next_rip = svm->vmcb->control.exit_info_2;
> -	skip_emulated_instruction(&svm->vcpu);
>
> -	return kvm_fast_pio_out(vcpu, size, port);
> +	ret = kvm_fast_pio_out(vcpu, size, port);
> +	if (ret)
> +		skip_emulated_instruction(&svm->vcpu);
> +	else
> +		vcpu->arch.pio.lazy_skip = true;
> +
> +	return ret;
>   }
>
>   static int nmi_interception(struct vcpu_svm *svm)
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 41e63bb..09052d6 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu *vcpu)
>   static int handle_io(struct kvm_vcpu *vcpu)
>   {
>   	unsigned long exit_qualification;
> -	int size, in, string;
> +	int size, in, string, ret;
>   	unsigned port;
>
>   	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>
>   	port = exit_qualification>>  16;
>   	size = (exit_qualification&  7) + 1;
> -	skip_emulated_instruction(vcpu);
>
> -	return kvm_fast_pio_out(vcpu, size, port);
> +	ret = kvm_fast_pio_out(vcpu, size, port);
> +	if (ret)
> +		skip_emulated_instruction(vcpu);
> +	else
> +		vcpu->arch.pio.lazy_skip = true;
> +
> +	return ret;
>   }
>
>   static void
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fd5c3d3..cc308d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
>   	if (!irqchip_in_kernel(vcpu->kvm))
>   		kvm_set_cr8(vcpu, kvm_run->cr8);
>
> +	if (vcpu->arch.pio.lazy_skip)
> +		kvm_x86_ops->skip_emulated_instruction(vcpu);
> +	vcpu->arch.pio.lazy_skip = false;
> +
>   	if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>   	    vcpu->arch.emulate_ctxt.restart) {
>   		if (vcpu->mmio_needed) {
Yoshiaki Tamura - April 22, 2010, 10:35 a.m.
Dor Laor wrote:
> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>> Hi all,
>>
>> We have been implementing the prototype of Kemari for KVM, and we're
>> sending
>> this message to share what we have now and TODO lists. Hopefully, we
>> would like
>> to get early feedback to keep us in the right direction. Although
>> advanced
>> approaches in the TODO lists are fascinating, we would like to run
>> this project
>> step by step while absorbing comments from the community. The current
>> code is
>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>
>> For those who are new to Kemari for KVM, please take a look at the
>> following RFC which we posted last year.
>>
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>
>> The transmission/transaction protocol, and most of the control logic is
>> implemented in QEMU. However, we needed a hack in KVM to prevent rip from
>> proceeding before synchronizing VMs. It may also need some plumbing in
>> the
>> kernel side to guarantee replayability of certain events and
>> instructions,
>> integrate the RAS capabilities of newer x86 hardware with the HA
>> stack, as well
>> as for optimization purposes, for example.
>
> [ snap]
>
>>
>> The rest of this message describes TODO lists grouped by each topic.
>>
>> === event tapping ===
>>
>> Event tapping is the core component of Kemari, and it decides on which
>> event the
>> primary should synchronize with the secondary. The basic assumption
>> here is
>> that outgoing I/O operations are idempotent, which is usually true for
>> disk I/O
>> and reliable network protocols such as TCP.
>
> IMO any type of network even should be stalled too. What if the VM runs
> non tcp protocol and the packet that the master node sent reached some
> remote client and before the sync to the slave the master failed?

In current implementation, it is actually stalling any type of network that goes 
through virtio-net.

However, if the application was using unreliable protocols, it should have its 
own recovering mechanism, or it should be completely stateless.

> [snap]
>
>
>> === clock ===
>>
>> Since synchronizing the virtual machines every time the TSC is
>> accessed would be
>> prohibitive, the transmission of the TSC will be done lazily, which means
>> delaying it until there is a non-TSC synchronization point arrives.
>
> Why do you specifically care about the tsc sync? When you sync all the
> IO model on snapshot it also synchronizes the tsc.
>
> In general, can you please explain the 'algorithm' for continuous
> snapshots (is that what you like to do?):

Yes, of course.
Sorry for being less informative.

> A trivial one would we to :
> - do X online snapshots/sec

I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the guest was 
almost idle, there will be no snapshots in 5sec.  On the other hand, if the 
guest was running I/O intensive workloads (netperf, iozone for example), there 
will be about 50 snapshots/sec.

> - Stall all IO (disk/block) from the guest to the outside world
> until the previous snapshot reaches the slave.

Yes, it does.

> - Snapshots are made of

Full device model + diff of dirty pages from the last snapshot.

> - diff of dirty pages from last snapshot

This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less than 100.

> - Qemu device model (+kvm's) diff from last.

We're currently sending full copy because we're completely reusing this part of 
existing live migration framework.

Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.

> You can do 'light' snapshots in between to send dirty pages to reduce
> snapshot time.

I agree.  That's one of the advanced topic we would like to try too.

> I wrote the above to serve a reference for your comments so it will map
> into my mind. Thanks, dor

Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to implement 
too.  In fact, we needed a hack to prevent rip from proceeding in KVM, which 
turned out that it was not the best workaround.

Thanks,

Yoshi

>
>>
>> TODO:
>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>
>> === usability ===
>>
>> These are items that defines how users interact with Kemari.
>>
>> TODO:
>> - Kemarid daemon that takes care of the cluster management/monitoring
>> side of things.
>> - Some device emulators might need minor modifications to work well
>> with Kemari. Use white(black)-listing to take the burden of
>> choosing the right device model off the users.
>>
>> === optimizations ===
>>
>> Although the big picture can be realized by completing the TODO list
>> above, we
>> need some optimizations/enhancements to make Kemari useful in real
>> world, and
>> these are items what needs to be done for that.
>>
>> TODO:
>> - SMP (for the sake of performance might need to implement a
>> synchronization protocol that can maintain two or more
>> synchronization points active at any given moment)
>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>> are really dirty).
>>
>>
>> Any comments/suggestions would be greatly appreciated.
>>
>> Thanks,
>>
>> Yoshi
>>
>> --
>>
>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>> Without this patch VCPU state is already proceeded before
>> synchronization, and after failover to the VM on the receiver, it
>> hangs because of this.
>>
>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>> ---
>> arch/x86/include/asm/kvm_host.h | 1 +
>> arch/x86/kvm/svm.c | 11 ++++++++---
>> arch/x86/kvm/vmx.c | 11 ++++++++---
>> arch/x86/kvm/x86.c | 4 ++++
>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h
>> b/arch/x86/include/asm/kvm_host.h
>> index 26c629a..7b8f514 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>> int in;
>> int port;
>> int size;
>> + bool lazy_skip;
>> };
>>
>> /*
>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> index d04c7ad..e373245 100644
>> --- a/arch/x86/kvm/svm.c
>> +++ b/arch/x86/kvm/svm.c
>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>> {
>> struct kvm_vcpu *vcpu =&svm->vcpu;
>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>> - int size, in, string;
>> + int size, in, string, ret;
>> unsigned port;
>>
>> ++svm->vcpu.stat.io_exits;
>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>> port = io_info>> 16;
>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>> svm->next_rip = svm->vmcb->control.exit_info_2;
>> - skip_emulated_instruction(&svm->vcpu);
>>
>> - return kvm_fast_pio_out(vcpu, size, port);
>> + ret = kvm_fast_pio_out(vcpu, size, port);
>> + if (ret)
>> + skip_emulated_instruction(&svm->vcpu);
>> + else
>> + vcpu->arch.pio.lazy_skip = true;
>> +
>> + return ret;
>> }
>>
>> static int nmi_interception(struct vcpu_svm *svm)
>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>> index 41e63bb..09052d6 100644
>> --- a/arch/x86/kvm/vmx.c
>> +++ b/arch/x86/kvm/vmx.c
>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>> *vcpu)
>> static int handle_io(struct kvm_vcpu *vcpu)
>> {
>> unsigned long exit_qualification;
>> - int size, in, string;
>> + int size, in, string, ret;
>> unsigned port;
>>
>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>
>> port = exit_qualification>> 16;
>> size = (exit_qualification& 7) + 1;
>> - skip_emulated_instruction(vcpu);
>>
>> - return kvm_fast_pio_out(vcpu, size, port);
>> + ret = kvm_fast_pio_out(vcpu, size, port);
>> + if (ret)
>> + skip_emulated_instruction(vcpu);
>> + else
>> + vcpu->arch.pio.lazy_skip = true;
>> +
>> + return ret;
>> }
>>
>> static void
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index fd5c3d3..cc308d2 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>> *vcpu, struct kvm_run *kvm_run)
>> if (!irqchip_in_kernel(vcpu->kvm))
>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>
>> + if (vcpu->arch.pio.lazy_skip)
>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>> + vcpu->arch.pio.lazy_skip = false;
>> +
>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>> vcpu->arch.emulate_ctxt.restart) {
>> if (vcpu->mmio_needed) {
>
>
>
>
Takuya Yoshikawa - April 22, 2010, 11:36 a.m.
(2010/04/22 19:35), Yoshiaki Tamura wrote:

>
>> A trivial one would we to :
>> - do X online snapshots/sec
>
> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the
> guest was almost idle, there will be no snapshots in 5sec. On the other
> hand, if the guest was running I/O intensive workloads (netperf, iozone
> for example), there will be about 50 snapshots/sec.
>

50 is too small: this depends on the synchronization speed and does not
show how many snapshots we need, right?
Dor Laor - April 22, 2010, 12:19 p.m.
On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
> Dor Laor wrote:
>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>> Hi all,
>>>
>>> We have been implementing the prototype of Kemari for KVM, and we're
>>> sending
>>> this message to share what we have now and TODO lists. Hopefully, we
>>> would like
>>> to get early feedback to keep us in the right direction. Although
>>> advanced
>>> approaches in the TODO lists are fascinating, we would like to run
>>> this project
>>> step by step while absorbing comments from the community. The current
>>> code is
>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>
>>> For those who are new to Kemari for KVM, please take a look at the
>>> following RFC which we posted last year.
>>>
>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>
>>> The transmission/transaction protocol, and most of the control logic is
>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>> from
>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>> the
>>> kernel side to guarantee replayability of certain events and
>>> instructions,
>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>> stack, as well
>>> as for optimization purposes, for example.
>>
>> [ snap]
>>
>>>
>>> The rest of this message describes TODO lists grouped by each topic.
>>>
>>> === event tapping ===
>>>
>>> Event tapping is the core component of Kemari, and it decides on which
>>> event the
>>> primary should synchronize with the secondary. The basic assumption
>>> here is
>>> that outgoing I/O operations are idempotent, which is usually true for
>>> disk I/O
>>> and reliable network protocols such as TCP.
>>
>> IMO any type of network even should be stalled too. What if the VM runs
>> non tcp protocol and the packet that the master node sent reached some
>> remote client and before the sync to the slave the master failed?
>
> In current implementation, it is actually stalling any type of network
> that goes through virtio-net.
>
> However, if the application was using unreliable protocols, it should
> have its own recovering mechanism, or it should be completely stateless.

Why do you treat tcp differently? You can damage the entire VM this way 
- think of dhcp request that was dropped on the moment you switched 
between the master and the slave?


>
>> [snap]
>>
>>
>>> === clock ===
>>>
>>> Since synchronizing the virtual machines every time the TSC is
>>> accessed would be
>>> prohibitive, the transmission of the TSC will be done lazily, which
>>> means
>>> delaying it until there is a non-TSC synchronization point arrives.
>>
>> Why do you specifically care about the tsc sync? When you sync all the
>> IO model on snapshot it also synchronizes the tsc.

So, do you agree that an extra clock synchronization is not needed since 
it is done anyway as part of the live migration state sync?

>>
>> In general, can you please explain the 'algorithm' for continuous
>> snapshots (is that what you like to do?):
>
> Yes, of course.
> Sorry for being less informative.
>
>> A trivial one would we to :
>> - do X online snapshots/sec
>
> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the
> guest was almost idle, there will be no snapshots in 5sec. On the other
> hand, if the guest was running I/O intensive workloads (netperf, iozone
> for example), there will be about 50 snapshots/sec.
>
>> - Stall all IO (disk/block) from the guest to the outside world
>> until the previous snapshot reaches the slave.
>
> Yes, it does.
>
>> - Snapshots are made of
>
> Full device model + diff of dirty pages from the last snapshot.
>
>> - diff of dirty pages from last snapshot
>
> This also depends on the workload.
> In case of I/O intensive workloads, dirty pages are usually less than 100.

The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much 
more)

>
>> - Qemu device model (+kvm's) diff from last.
>
> We're currently sending full copy because we're completely reusing this
> part of existing live migration framework.
>
> Last time we measured, it was about 13KB.
> But it varies by which QEMU version is used.
>
>> You can do 'light' snapshots in between to send dirty pages to reduce
>> snapshot time.
>
> I agree. That's one of the advanced topic we would like to try too.
>
>> I wrote the above to serve a reference for your comments so it will map
>> into my mind. Thanks, dor
>
> Thank your for the guidance.
> I hope this answers to your question.
>
> At the same time, I would also be happy it we could discuss how to
> implement too. In fact, we needed a hack to prevent rip from proceeding
> in KVM, which turned out that it was not the best workaround.

There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
   standard live migration)
- Stop + fork + cont the father

Or mark the recent dirty pages that were not sent to the remote as write 
protected and copy them if touched.


>
> Thanks,
>
> Yoshi
>
>>
>>>
>>> TODO:
>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>
>>> === usability ===
>>>
>>> These are items that defines how users interact with Kemari.
>>>
>>> TODO:
>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>> side of things.
>>> - Some device emulators might need minor modifications to work well
>>> with Kemari. Use white(black)-listing to take the burden of
>>> choosing the right device model off the users.
>>>
>>> === optimizations ===
>>>
>>> Although the big picture can be realized by completing the TODO list
>>> above, we
>>> need some optimizations/enhancements to make Kemari useful in real
>>> world, and
>>> these are items what needs to be done for that.
>>>
>>> TODO:
>>> - SMP (for the sake of performance might need to implement a
>>> synchronization protocol that can maintain two or more
>>> synchronization points active at any given moment)
>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>> are really dirty).
>>>
>>>
>>> Any comments/suggestions would be greatly appreciated.
>>>
>>> Thanks,
>>>
>>> Yoshi
>>>
>>> --
>>>
>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>> Without this patch VCPU state is already proceeded before
>>> synchronization, and after failover to the VM on the receiver, it
>>> hangs because of this.
>>>
>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>> ---
>>> arch/x86/include/asm/kvm_host.h | 1 +
>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>> arch/x86/kvm/x86.c | 4 ++++
>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 26c629a..7b8f514 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>> int in;
>>> int port;
>>> int size;
>>> + bool lazy_skip;
>>> };
>>>
>>> /*
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index d04c7ad..e373245 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>> {
>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>> - int size, in, string;
>>> + int size, in, string, ret;
>>> unsigned port;
>>>
>>> ++svm->vcpu.stat.io_exits;
>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>> port = io_info>> 16;
>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>> - skip_emulated_instruction(&svm->vcpu);
>>>
>>> - return kvm_fast_pio_out(vcpu, size, port);
>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>> + if (ret)
>>> + skip_emulated_instruction(&svm->vcpu);
>>> + else
>>> + vcpu->arch.pio.lazy_skip = true;
>>> +
>>> + return ret;
>>> }
>>>
>>> static int nmi_interception(struct vcpu_svm *svm)
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 41e63bb..09052d6 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>> *vcpu)
>>> static int handle_io(struct kvm_vcpu *vcpu)
>>> {
>>> unsigned long exit_qualification;
>>> - int size, in, string;
>>> + int size, in, string, ret;
>>> unsigned port;
>>>
>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>
>>> port = exit_qualification>> 16;
>>> size = (exit_qualification& 7) + 1;
>>> - skip_emulated_instruction(vcpu);
>>>
>>> - return kvm_fast_pio_out(vcpu, size, port);
>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>> + if (ret)
>>> + skip_emulated_instruction(vcpu);
>>> + else
>>> + vcpu->arch.pio.lazy_skip = true;
>>> +
>>> + return ret;
>>> }
>>>
>>> static void
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index fd5c3d3..cc308d2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>> *vcpu, struct kvm_run *kvm_run)
>>> if (!irqchip_in_kernel(vcpu->kvm))
>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>
>>> + if (vcpu->arch.pio.lazy_skip)
>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>> + vcpu->arch.pio.lazy_skip = false;
>>> +
>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>> vcpu->arch.emulate_ctxt.restart) {
>>> if (vcpu->mmio_needed) {
>>
>>
>>
>>
>
>
>
Yoshiaki Tamura - April 22, 2010, 12:35 p.m.
2010/4/22 Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>:
> (2010/04/22 19:35), Yoshiaki Tamura wrote:
>
>>
>>> A trivial one would we to :
>>> - do X online snapshots/sec
>>
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec. On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>>
>
> 50 is too small: this depends on the synchronization speed and does not
> show how many snapshots we need, right?

No it doesn't.
It's an example data which I measured before.
Yoshiaki Tamura - April 22, 2010, 1:16 p.m.
2010/4/22 Dor Laor <dlaor@redhat.com>:
> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>
>> Dor Laor wrote:
>>>
>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>
>>>> Hi all,
>>>>
>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>> sending
>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>> would like
>>>> to get early feedback to keep us in the right direction. Although
>>>> advanced
>>>> approaches in the TODO lists are fascinating, we would like to run
>>>> this project
>>>> step by step while absorbing comments from the community. The current
>>>> code is
>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>
>>>> For those who are new to Kemari for KVM, please take a look at the
>>>> following RFC which we posted last year.
>>>>
>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>
>>>> The transmission/transaction protocol, and most of the control logic is
>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>> from
>>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>>> the
>>>> kernel side to guarantee replayability of certain events and
>>>> instructions,
>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>> stack, as well
>>>> as for optimization purposes, for example.
>>>
>>> [ snap]
>>>
>>>>
>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>
>>>> === event tapping ===
>>>>
>>>> Event tapping is the core component of Kemari, and it decides on which
>>>> event the
>>>> primary should synchronize with the secondary. The basic assumption
>>>> here is
>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>> disk I/O
>>>> and reliable network protocols such as TCP.
>>>
>>> IMO any type of network even should be stalled too. What if the VM runs
>>> non tcp protocol and the packet that the master node sent reached some
>>> remote client and before the sync to the slave the master failed?
>>
>> In current implementation, it is actually stalling any type of network
>> that goes through virtio-net.
>>
>> However, if the application was using unreliable protocols, it should
>> have its own recovering mechanism, or it should be completely stateless.
>
> Why do you treat tcp differently? You can damage the entire VM this way -
> think of dhcp request that was dropped on the moment you switched between
> the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?
BTW, in current implementation, it's synchronizing before dhcp ack is sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.

>>> [snap]
>>>
>>>
>>>> === clock ===
>>>>
>>>> Since synchronizing the virtual machines every time the TSC is
>>>> accessed would be
>>>> prohibitive, the transmission of the TSC will be done lazily, which
>>>> means
>>>> delaying it until there is a non-TSC synchronization point arrives.
>>>
>>> Why do you specifically care about the tsc sync? When you sync all the
>>> IO model on snapshot it also synchronizes the tsc.
>
> So, do you agree that an extra clock synchronization is not needed since it
> is done anyway as part of the live migration state sync?

I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.

>>> In general, can you please explain the 'algorithm' for continuous
>>> snapshots (is that what you like to do?):
>>
>> Yes, of course.
>> Sorry for being less informative.
>>
>>> A trivial one would we to :
>>> - do X online snapshots/sec
>>
>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec. On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>>
>>> - Stall all IO (disk/block) from the guest to the outside world
>>> until the previous snapshot reaches the slave.
>>
>> Yes, it does.
>>
>>> - Snapshots are made of
>>
>> Full device model + diff of dirty pages from the last snapshot.
>>
>>> - diff of dirty pages from last snapshot
>>
>> This also depends on the workload.
>> In case of I/O intensive workloads, dirty pages are usually less than 100.
>
> The hardest would be memory intensive loads.
> So 100 snap/sec means latency of 10msec right?
> (not that it's not ok, with faster hw and IB you'll be able to get much
> more)

Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, you need to get, Time to transfer VM + Time
to get response from the receiver.

It's hard to say which load is the hardest.
Memory intensive load, who don't generate I/O often, will suffer from
long sync time for that moment, but would have chances to continue its
process until sync.
I/O intensive load, who don't dirty much pages, will suffer from
getting VPU stopped often, but its sync time is relatively shorter.

>>> - Qemu device model (+kvm's) diff from last.
>>
>> We're currently sending full copy because we're completely reusing this
>> part of existing live migration framework.
>>
>> Last time we measured, it was about 13KB.
>> But it varies by which QEMU version is used.
>>
>>> You can do 'light' snapshots in between to send dirty pages to reduce
>>> snapshot time.
>>
>> I agree. That's one of the advanced topic we would like to try too.
>>
>>> I wrote the above to serve a reference for your comments so it will map
>>> into my mind. Thanks, dor
>>
>> Thank your for the guidance.
>> I hope this answers to your question.
>>
>> At the same time, I would also be happy it we could discuss how to
>> implement too. In fact, we needed a hack to prevent rip from proceeding
>> in KVM, which turned out that it was not the best workaround.
>
> There are brute force solutions like
> - stop the guest until you send all of the snapshot to the remote (like
>  standard live migration)

We've implemented this way so far.

> - Stop + fork + cont the father
>
> Or mark the recent dirty pages that were not sent to the remote as write
> protected and copy them if touched.

I think I had that suggestion from Avi before.
And yes, it's very fascinating.

Meanwhile, if you look at the diffstat, it needed to touch many parts of QEMU.
Before going into further implementation, I wanted to check that I'm
in the right track for doing this project.


>> Thanks,
>>
>> Yoshi
>>
>>>
>>>>
>>>> TODO:
>>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>>
>>>> === usability ===
>>>>
>>>> These are items that defines how users interact with Kemari.
>>>>
>>>> TODO:
>>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>>> side of things.
>>>> - Some device emulators might need minor modifications to work well
>>>> with Kemari. Use white(black)-listing to take the burden of
>>>> choosing the right device model off the users.
>>>>
>>>> === optimizations ===
>>>>
>>>> Although the big picture can be realized by completing the TODO list
>>>> above, we
>>>> need some optimizations/enhancements to make Kemari useful in real
>>>> world, and
>>>> these are items what needs to be done for that.
>>>>
>>>> TODO:
>>>> - SMP (for the sake of performance might need to implement a
>>>> synchronization protocol that can maintain two or more
>>>> synchronization points active at any given moment)
>>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>>> are really dirty).
>>>>
>>>>
>>>> Any comments/suggestions would be greatly appreciated.
>>>>
>>>> Thanks,
>>>>
>>>> Yoshi
>>>>
>>>> --
>>>>
>>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>>> Without this patch VCPU state is already proceeded before
>>>> synchronization, and after failover to the VM on the receiver, it
>>>> hangs because of this.
>>>>
>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>> ---
>>>> arch/x86/include/asm/kvm_host.h | 1 +
>>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>>> arch/x86/kvm/x86.c | 4 ++++
>>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>> b/arch/x86/include/asm/kvm_host.h
>>>> index 26c629a..7b8f514 100644
>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>>> int in;
>>>> int port;
>>>> int size;
>>>> + bool lazy_skip;
>>>> };
>>>>
>>>> /*
>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>> index d04c7ad..e373245 100644
>>>> --- a/arch/x86/kvm/svm.c
>>>> +++ b/arch/x86/kvm/svm.c
>>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>>> {
>>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>>> - int size, in, string;
>>>> + int size, in, string, ret;
>>>> unsigned port;
>>>>
>>>> ++svm->vcpu.stat.io_exits;
>>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>>> port = io_info>> 16;
>>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>>> - skip_emulated_instruction(&svm->vcpu);
>>>>
>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>> + if (ret)
>>>> + skip_emulated_instruction(&svm->vcpu);
>>>> + else
>>>> + vcpu->arch.pio.lazy_skip = true;
>>>> +
>>>> + return ret;
>>>> }
>>>>
>>>> static int nmi_interception(struct vcpu_svm *svm)
>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>> index 41e63bb..09052d6 100644
>>>> --- a/arch/x86/kvm/vmx.c
>>>> +++ b/arch/x86/kvm/vmx.c
>>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>>> *vcpu)
>>>> static int handle_io(struct kvm_vcpu *vcpu)
>>>> {
>>>> unsigned long exit_qualification;
>>>> - int size, in, string;
>>>> + int size, in, string, ret;
>>>> unsigned port;
>>>>
>>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>>
>>>> port = exit_qualification>> 16;
>>>> size = (exit_qualification& 7) + 1;
>>>> - skip_emulated_instruction(vcpu);
>>>>
>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>> + if (ret)
>>>> + skip_emulated_instruction(vcpu);
>>>> + else
>>>> + vcpu->arch.pio.lazy_skip = true;
>>>> +
>>>> + return ret;
>>>> }
>>>>
>>>> static void
>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>> index fd5c3d3..cc308d2 100644
>>>> --- a/arch/x86/kvm/x86.c
>>>> +++ b/arch/x86/kvm/x86.c
>>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>>> *vcpu, struct kvm_run *kvm_run)
>>>> if (!irqchip_in_kernel(vcpu->kvm))
>>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>>
>>>> + if (vcpu->arch.pio.lazy_skip)
>>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>>> + vcpu->arch.pio.lazy_skip = false;
>>>> +
>>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>>> vcpu->arch.emulate_ctxt.restart) {
>>>> if (vcpu->mmio_needed) {
>>>
>>>
>>>
>>>
>>
>>
>>
>
>
>
>
Jamie Lokier - April 22, 2010, 4:15 p.m.
Yoshiaki Tamura wrote:
> Dor Laor wrote:
> >On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>Event tapping is the core component of Kemari, and it decides on which
> >>event the
> >>primary should synchronize with the secondary. The basic assumption
> >>here is
> >>that outgoing I/O operations are idempotent, which is usually true for
> >>disk I/O
> >>and reliable network protocols such as TCP.
> >
> >IMO any type of network even should be stalled too. What if the VM runs
> >non tcp protocol and the packet that the master node sent reached some
> >remote client and before the sync to the slave the master failed?
> 
> In current implementation, it is actually stalling any type of network 
> that goes through virtio-net.
> 
> However, if the application was using unreliable protocols, it should have 
> its own recovering mechanism, or it should be completely stateless.

Even with unreliable protocols, if slave takeover causes the receiver
to have received a packet that the sender _does not think it has ever
sent_, expect some protocols to break.

If the slave replaying master's behaviour since the last sync means it
will definitely get into the same state of having sent the packet,
that works out.

But you still have to be careful that the other end's responses to
that packet are not seen by the slave too early during that replay.
Otherwise, for example, the slave may observe a TCP ACK to a packet
that it hasn't yet sent, which is an error.

About IP idempotency:

In general, IP packets are allowed to be lost or duplicated in the
network.  All IP protocols should be prepared for that; it is a basic
property.

However there is one respect in which they're not idempotent:

The TTL field should be decreased if packets are delayed.  Packets
should not appear to live in the network for longer than TTL seconds.
If they do, some protocols (like TCP) can react to the delayed ones
differently, such as sending a RST packet and breaking a connection.

It is acceptable to reduce TTL faster than the minimum.  After all, it
is reduced by 1 on every forwarding hop, in addition to time delays.

> I currently don't have good numbers that I can share right now.
> Snapshots/sec depends on what kind of workload is running, and if the 
> guest was almost idle, there will be no snapshots in 5sec.  On the other 
> hand, if the guest was running I/O intensive workloads (netperf, iozone 
> for example), there will be about 50 snapshots/sec.

That is a really satisfying number, thank you :-)

Without this work I wouldn't have imagined that synchronised machines
could work with such a low transaction rate.

-- Jamie
Anthony Liguori - April 22, 2010, 7:42 p.m.
On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
> Hi all,
>
> We have been implementing the prototype of Kemari for KVM, and we're sending
> this message to share what we have now and TODO lists.  Hopefully, we would like
> to get early feedback to keep us in the right direction.  Although advanced
> approaches in the TODO lists are fascinating, we would like to run this project
> step by step while absorbing comments from the community.  The current code is
> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>
> For those who are new to Kemari for KVM, please take a look at the
> following RFC which we posted last year.
>
> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>
> The transmission/transaction protocol, and most of the control logic is
> implemented in QEMU.  However, we needed a hack in KVM to prevent rip from
> proceeding before synchronizing VMs.  It may also need some plumbing in the
> kernel side to guarantee replayability of certain events and instructions,
> integrate the RAS capabilities of newer x86 hardware with the HA stack, as well
> as for optimization purposes, for example.
>
> Before going into details, we would like to show how Kemari looks.  We prepared
> a demonstration video at the following location.  For those who are not
> interested in the code, please take a look.
> The demonstration scenario is,
>
> 1. Play with a guest VM that has virtio-blk and virtio-net.
> # The guest image should be a NFS/SAN.
> 2. Start Kemari to synchronize the VM by running the following command in QEMU.
> Just add "-k" option to usual migrate command.
> migrate -d -k tcp:192.168.0.20:4444
> 3. Check the status by calling info migrate.
> 4. Go back to the VM to play chess animation.
> 5. Kill the the VM. (VNC client also disappears)
> 6. Press "c" to continue the VM on the other host.
> 7. Bring up the VNC client (Sorry, it pops outside of video capture.)
> 8. Confirm that the chess animation ends, browser works fine, then shutdown.
>
> http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov
>
> The repository contains all patches we're sending with this message.  For those
> who want to try, pull the following repository.  At running configure, please
> put --enable-ft-mode.  Also you need to apply a patch attached at the end of
> this message to your KVM.
>
> git://kemari.git.sourceforge.net/gitroot/kemari/kemari
>
> In addition to usual migrate environment and command, add "-k" to run.
>
> The patch set consists of following components.
>
> - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o)
> - writev() support to QEMUFile and FdMigrationState.
> - FT transaction sender/receiver
> - event tap that triggers FT transaction.
> - virtio-blk, virtio-net support for event tap.
>    

This series looks quite nice!

I think it would make sense to separate out the things that are actually 
optimizations (like the dirty bitmap changes and the writev/readv 
changes) and to attempt to justify them with actual performance data.

I'd prefer not to modify the live migration protocol ABI and it doesn't 
seem to be necessary if we're willing to add options to the -incoming 
flag.  We also want to be a bit more generic with respect to IO.  
Otherwise, the series looks very close to being mergable.

Regards,

Anthony Liguori
Anthony Liguori - April 22, 2010, 8:33 p.m.
On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>    
>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>      
>>> Dor Laor wrote:
>>>        
>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>          
>>>>> Hi all,
>>>>>
>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>> sending
>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>> would like
>>>>> to get early feedback to keep us in the right direction. Although
>>>>> advanced
>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>> this project
>>>>> step by step while absorbing comments from the community. The current
>>>>> code is
>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>
>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>> following RFC which we posted last year.
>>>>>
>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>
>>>>> The transmission/transaction protocol, and most of the control logic is
>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>> from
>>>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>>>> the
>>>>> kernel side to guarantee replayability of certain events and
>>>>> instructions,
>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>> stack, as well
>>>>> as for optimization purposes, for example.
>>>>>            
>>>> [ snap]
>>>>
>>>>          
>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>
>>>>> === event tapping ===
>>>>>
>>>>> Event tapping is the core component of Kemari, and it decides on which
>>>>> event the
>>>>> primary should synchronize with the secondary. The basic assumption
>>>>> here is
>>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>>> disk I/O
>>>>> and reliable network protocols such as TCP.
>>>>>            
>>>> IMO any type of network even should be stalled too. What if the VM runs
>>>> non tcp protocol and the packet that the master node sent reached some
>>>> remote client and before the sync to the slave the master failed?
>>>>          
>>> In current implementation, it is actually stalling any type of network
>>> that goes through virtio-net.
>>>
>>> However, if the application was using unreliable protocols, it should
>>> have its own recovering mechanism, or it should be completely stateless.
>>>        
>> Why do you treat tcp differently? You can damage the entire VM this way -
>> think of dhcp request that was dropped on the moment you switched between
>> the master and the slave?
>>      
> I'm not trying to say that we should treat tcp differently, but just
> it's severe.
> In case of dhcp request, the client would have a chance to retry after
> failover, correct?
> BTW, in current implementation,
>    

I'm slightly confused about the current implementation vs. my 
recollection of the original paper with Xen.  I had thought that all 
disk and network I/O was buffered in such a way that at each checkpoint, 
the I/O operations would be released in a burst.  Otherwise, you would 
have to synchronize after every I/O operation which is what it seems the 
current implementation does.  I'm not sure how that is accomplished 
atomically though since you could have a completed I/O operation 
duplicated on the slave node provided it didn't notify completion prior 
to failure.

Is there another kemari component that somehow handles buffering I/O 
that is not obvious from these patches?

Regards,

Anthony Liguori
Dor Laor - April 22, 2010, 8:38 p.m.
On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:
> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>
>>> Dor Laor wrote:
>>>>
>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>> sending
>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>> would like
>>>>> to get early feedback to keep us in the right direction. Although
>>>>> advanced
>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>> this project
>>>>> step by step while absorbing comments from the community. The current
>>>>> code is
>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>
>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>> following RFC which we posted last year.
>>>>>
>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>
>>>>> The transmission/transaction protocol, and most of the control logic is
>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>> from
>>>>> proceeding before synchronizing VMs. It may also need some plumbing in
>>>>> the
>>>>> kernel side to guarantee replayability of certain events and
>>>>> instructions,
>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>> stack, as well
>>>>> as for optimization purposes, for example.
>>>>
>>>> [ snap]
>>>>
>>>>>
>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>
>>>>> === event tapping ===
>>>>>
>>>>> Event tapping is the core component of Kemari, and it decides on which
>>>>> event the
>>>>> primary should synchronize with the secondary. The basic assumption
>>>>> here is
>>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>>> disk I/O
>>>>> and reliable network protocols such as TCP.
>>>>
>>>> IMO any type of network even should be stalled too. What if the VM runs
>>>> non tcp protocol and the packet that the master node sent reached some
>>>> remote client and before the sync to the slave the master failed?
>>>
>>> In current implementation, it is actually stalling any type of network
>>> that goes through virtio-net.
>>>
>>> However, if the application was using unreliable protocols, it should
>>> have its own recovering mechanism, or it should be completely stateless.
>>
>> Why do you treat tcp differently? You can damage the entire VM this way -
>> think of dhcp request that was dropped on the moment you switched between
>> the master and the slave?
>
> I'm not trying to say that we should treat tcp differently, but just
> it's severe.
> In case of dhcp request, the client would have a chance to retry after
> failover, correct?

But until it timeouts it won't have networking.

> BTW, in current implementation, it's synchronizing before dhcp ack is sent.
> But in case of tcp, once you send ack to the client before sync, there
> is no way to recover.

What if the guest is running dhcp server? It we provide an IP to a 
client and then fail to the secondary that will run without knowing the 
master allocated this IP

>
>>>> [snap]
>>>>
>>>>
>>>>> === clock ===
>>>>>
>>>>> Since synchronizing the virtual machines every time the TSC is
>>>>> accessed would be
>>>>> prohibitive, the transmission of the TSC will be done lazily, which
>>>>> means
>>>>> delaying it until there is a non-TSC synchronization point arrives.
>>>>
>>>> Why do you specifically care about the tsc sync? When you sync all the
>>>> IO model on snapshot it also synchronizes the tsc.
>>
>> So, do you agree that an extra clock synchronization is not needed since it
>> is done anyway as part of the live migration state sync?
>
> I agree that its sent as part of the live migration.
> What I wanted to say here is that this is not something for real time
> applications.
> I usually get questions like can this guarantee fault tolerance for
> real time applications.

First the huge cost of snapshots won't match to any real time app.
Second, even if it wasn't the case, the tsc delta and kvmclock are 
synchronized as part of the VM state so there is no use of trapping it 
in the middle.

>
>>>> In general, can you please explain the 'algorithm' for continuous
>>>> snapshots (is that what you like to do?):
>>>
>>> Yes, of course.
>>> Sorry for being less informative.
>>>
>>>> A trivial one would we to :
>>>> - do X online snapshots/sec
>>>
>>> I currently don't have good numbers that I can share right now.
>>> Snapshots/sec depends on what kind of workload is running, and if the
>>> guest was almost idle, there will be no snapshots in 5sec. On the other
>>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>>> for example), there will be about 50 snapshots/sec.
>>>
>>>> - Stall all IO (disk/block) from the guest to the outside world
>>>> until the previous snapshot reaches the slave.
>>>
>>> Yes, it does.
>>>
>>>> - Snapshots are made of
>>>
>>> Full device model + diff of dirty pages from the last snapshot.
>>>
>>>> - diff of dirty pages from last snapshot
>>>
>>> This also depends on the workload.
>>> In case of I/O intensive workloads, dirty pages are usually less than 100.
>>
>> The hardest would be memory intensive loads.
>> So 100 snap/sec means latency of 10msec right?
>> (not that it's not ok, with faster hw and IB you'll be able to get much
>> more)
>
> Doesn't 100 snap/sec mean the interval of snap is 10msec?
> IIUC, to get the latency, you need to get, Time to transfer VM + Time
> to get response from the receiver.
>
> It's hard to say which load is the hardest.
> Memory intensive load, who don't generate I/O often, will suffer from
> long sync time for that moment, but would have chances to continue its
> process until sync.
> I/O intensive load, who don't dirty much pages, will suffer from
> getting VPU stopped often, but its sync time is relatively shorter.
>
>>>> - Qemu device model (+kvm's) diff from last.
>>>
>>> We're currently sending full copy because we're completely reusing this
>>> part of existing live migration framework.
>>>
>>> Last time we measured, it was about 13KB.
>>> But it varies by which QEMU version is used.
>>>
>>>> You can do 'light' snapshots in between to send dirty pages to reduce
>>>> snapshot time.
>>>
>>> I agree. That's one of the advanced topic we would like to try too.
>>>
>>>> I wrote the above to serve a reference for your comments so it will map
>>>> into my mind. Thanks, dor
>>>
>>> Thank your for the guidance.
>>> I hope this answers to your question.
>>>
>>> At the same time, I would also be happy it we could discuss how to
>>> implement too. In fact, we needed a hack to prevent rip from proceeding
>>> in KVM, which turned out that it was not the best workaround.
>>
>> There are brute force solutions like
>> - stop the guest until you send all of the snapshot to the remote (like
>>   standard live migration)
>
> We've implemented this way so far.
>
>> - Stop + fork + cont the father
>>
>> Or mark the recent dirty pages that were not sent to the remote as write
>> protected and copy them if touched.
>
> I think I had that suggestion from Avi before.
> And yes, it's very fascinating.
>
> Meanwhile, if you look at the diffstat, it needed to touch many parts of QEMU.
> Before going into further implementation, I wanted to check that I'm
> in the right track for doing this project.
>
>
>>> Thanks,
>>>
>>> Yoshi
>>>
>>>>
>>>>>
>>>>> TODO:
>>>>> - Synchronization of clock sources (need to intercept TSC reads, etc).
>>>>>
>>>>> === usability ===
>>>>>
>>>>> These are items that defines how users interact with Kemari.
>>>>>
>>>>> TODO:
>>>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>>>> side of things.
>>>>> - Some device emulators might need minor modifications to work well
>>>>> with Kemari. Use white(black)-listing to take the burden of
>>>>> choosing the right device model off the users.
>>>>>
>>>>> === optimizations ===
>>>>>
>>>>> Although the big picture can be realized by completing the TODO list
>>>>> above, we
>>>>> need some optimizations/enhancements to make Kemari useful in real
>>>>> world, and
>>>>> these are items what needs to be done for that.
>>>>>
>>>>> TODO:
>>>>> - SMP (for the sake of performance might need to implement a
>>>>> synchronization protocol that can maintain two or more
>>>>> synchronization points active at any given moment)
>>>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>>>> are really dirty).
>>>>>
>>>>>
>>>>> Any comments/suggestions would be greatly appreciated.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Yoshi
>>>>>
>>>>> --
>>>>>
>>>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>>>> Without this patch VCPU state is already proceeded before
>>>>> synchronization, and after failover to the VM on the receiver, it
>>>>> hangs because of this.
>>>>>
>>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>>> ---
>>>>> arch/x86/include/asm/kvm_host.h | 1 +
>>>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>>>> arch/x86/kvm/x86.c | 4 ++++
>>>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>> b/arch/x86/include/asm/kvm_host.h
>>>>> index 26c629a..7b8f514 100644
>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>>>> int in;
>>>>> int port;
>>>>> int size;
>>>>> + bool lazy_skip;
>>>>> };
>>>>>
>>>>> /*
>>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>>> index d04c7ad..e373245 100644
>>>>> --- a/arch/x86/kvm/svm.c
>>>>> +++ b/arch/x86/kvm/svm.c
>>>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm)
>>>>> {
>>>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>>>> - int size, in, string;
>>>>> + int size, in, string, ret;
>>>>> unsigned port;
>>>>>
>>>>> ++svm->vcpu.stat.io_exits;
>>>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm)
>>>>> port = io_info>>  16;
>>>>> size = (io_info&  SVM_IOIO_SIZE_MASK)>>  SVM_IOIO_SIZE_SHIFT;
>>>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>>>> - skip_emulated_instruction(&svm->vcpu);
>>>>>
>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>> + if (ret)
>>>>> + skip_emulated_instruction(&svm->vcpu);
>>>>> + else
>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>> +
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> static int nmi_interception(struct vcpu_svm *svm)
>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>> index 41e63bb..09052d6 100644
>>>>> --- a/arch/x86/kvm/vmx.c
>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>>>> *vcpu)
>>>>> static int handle_io(struct kvm_vcpu *vcpu)
>>>>> {
>>>>> unsigned long exit_qualification;
>>>>> - int size, in, string;
>>>>> + int size, in, string, ret;
>>>>> unsigned port;
>>>>>
>>>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>>>
>>>>> port = exit_qualification>>  16;
>>>>> size = (exit_qualification&  7) + 1;
>>>>> - skip_emulated_instruction(vcpu);
>>>>>
>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>> + if (ret)
>>>>> + skip_emulated_instruction(vcpu);
>>>>> + else
>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>> +
>>>>> + return ret;
>>>>> }
>>>>>
>>>>> static void
>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>> index fd5c3d3..cc308d2 100644
>>>>> --- a/arch/x86/kvm/x86.c
>>>>> +++ b/arch/x86/kvm/x86.c
>>>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>>>> *vcpu, struct kvm_run *kvm_run)
>>>>> if (!irqchip_in_kernel(vcpu->kvm))
>>>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>>>
>>>>> + if (vcpu->arch.pio.lazy_skip)
>>>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>>>> + vcpu->arch.pio.lazy_skip = false;
>>>>> +
>>>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>>>> vcpu->arch.emulate_ctxt.restart) {
>>>>> if (vcpu->mmio_needed) {
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
>
>
Yoshiaki Tamura - April 23, 2010, 12:20 a.m.
Jamie Lokier wrote:
> Yoshiaki Tamura wrote:
>> Dor Laor wrote:
>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>> Event tapping is the core component of Kemari, and it decides on which
>>>> event the
>>>> primary should synchronize with the secondary. The basic assumption
>>>> here is
>>>> that outgoing I/O operations are idempotent, which is usually true for
>>>> disk I/O
>>>> and reliable network protocols such as TCP.
>>>
>>> IMO any type of network even should be stalled too. What if the VM runs
>>> non tcp protocol and the packet that the master node sent reached some
>>> remote client and before the sync to the slave the master failed?
>>
>> In current implementation, it is actually stalling any type of network
>> that goes through virtio-net.
>>
>> However, if the application was using unreliable protocols, it should have
>> its own recovering mechanism, or it should be completely stateless.
>
> Even with unreliable protocols, if slave takeover causes the receiver
> to have received a packet that the sender _does not think it has ever
> sent_, expect some protocols to break.
>
> If the slave replaying master's behaviour since the last sync means it
> will definitely get into the same state of having sent the packet,
> that works out.

That's something we're expecting now.

> But you still have to be careful that the other end's responses to
> that packet are not seen by the slave too early during that replay.
> Otherwise, for example, the slave may observe a TCP ACK to a packet
> that it hasn't yet sent, which is an error.

Even current implementation syncs just before network output, what you pointed 
out could happen.  In this case, would the connection going to be lost, or would 
client/server recover from it?  If latter, it would be fine, otherwise I wonder 
how people doing similar things are handling this situation.

> About IP idempotency:
>
> In general, IP packets are allowed to be lost or duplicated in the
> network.  All IP protocols should be prepared for that; it is a basic
> property.
>
> However there is one respect in which they're not idempotent:
>
> The TTL field should be decreased if packets are delayed.  Packets
> should not appear to live in the network for longer than TTL seconds.
> If they do, some protocols (like TCP) can react to the delayed ones
> differently, such as sending a RST packet and breaking a connection.
>
> It is acceptable to reduce TTL faster than the minimum.  After all, it
> is reduced by 1 on every forwarding hop, in addition to time delays.

So the problem is, when the slave takes over, it sends a packet with same TTL 
which client may have received.

>> I currently don't have good numbers that I can share right now.
>> Snapshots/sec depends on what kind of workload is running, and if the
>> guest was almost idle, there will be no snapshots in 5sec.  On the other
>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>> for example), there will be about 50 snapshots/sec.
>
> That is a really satisfying number, thank you :-)
>
> Without this work I wouldn't have imagined that synchronised machines
> could work with such a low transaction rate.

Thank you for your comments.

Although I haven't prepared good data yet, I personally prefer to have 
discussion with actual implementation and experimental data.
Yoshiaki Tamura - April 23, 2010, 12:45 a.m.
Anthony Liguori wrote:
> On 04/21/2010 12:57 AM, Yoshiaki Tamura wrote:
>> Hi all,
>>
>> We have been implementing the prototype of Kemari for KVM, and we're
>> sending
>> this message to share what we have now and TODO lists. Hopefully, we
>> would like
>> to get early feedback to keep us in the right direction. Although
>> advanced
>> approaches in the TODO lists are fascinating, we would like to run
>> this project
>> step by step while absorbing comments from the community. The current
>> code is
>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>
>> For those who are new to Kemari for KVM, please take a look at the
>> following RFC which we posted last year.
>>
>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>
>> The transmission/transaction protocol, and most of the control logic is
>> implemented in QEMU. However, we needed a hack in KVM to prevent rip from
>> proceeding before synchronizing VMs. It may also need some plumbing in
>> the
>> kernel side to guarantee replayability of certain events and
>> instructions,
>> integrate the RAS capabilities of newer x86 hardware with the HA
>> stack, as well
>> as for optimization purposes, for example.
>>
>> Before going into details, we would like to show how Kemari looks. We
>> prepared
>> a demonstration video at the following location. For those who are not
>> interested in the code, please take a look.
>> The demonstration scenario is,
>>
>> 1. Play with a guest VM that has virtio-blk and virtio-net.
>> # The guest image should be a NFS/SAN.
>> 2. Start Kemari to synchronize the VM by running the following command
>> in QEMU.
>> Just add "-k" option to usual migrate command.
>> migrate -d -k tcp:192.168.0.20:4444
>> 3. Check the status by calling info migrate.
>> 4. Go back to the VM to play chess animation.
>> 5. Kill the the VM. (VNC client also disappears)
>> 6. Press "c" to continue the VM on the other host.
>> 7. Bring up the VNC client (Sorry, it pops outside of video capture.)
>> 8. Confirm that the chess animation ends, browser works fine, then
>> shutdown.
>>
>> http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov
>>
>> The repository contains all patches we're sending with this message.
>> For those
>> who want to try, pull the following repository. At running configure,
>> please
>> put --enable-ft-mode. Also you need to apply a patch attached at the
>> end of
>> this message to your KVM.
>>
>> git://kemari.git.sourceforge.net/gitroot/kemari/kemari
>>
>> In addition to usual migrate environment and command, add "-k" to run.
>>
>> The patch set consists of following components.
>>
>> - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April
>> 2o)
>> - writev() support to QEMUFile and FdMigrationState.
>> - FT transaction sender/receiver
>> - event tap that triggers FT transaction.
>> - virtio-blk, virtio-net support for event tap.
>
> This series looks quite nice!

Thanks for your kind words!

> I think it would make sense to separate out the things that are actually
> optimizations (like the dirty bitmap changes and the writev/readv
> changes) and to attempt to justify them with actual performance data.

I agree with the separation plan.

For dirty bitmap change, Avi and I discussed on patchset for upsream QEMU while 
you were offline (Sorry, if I was wrong).  Could you also take a look?

http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html

Regarding writev, I agree that it should be backed with actual data, otherwise 
it should be removed.  We attemped to do everything that may reduce the overhead 
of the transaction.

> I'd prefer not to modify the live migration protocol ABI and it doesn't
> seem to be necessary if we're willing to add options to the -incoming
> flag. We also want to be a bit more generic with respect to IO.

I totally agree with your approach not to change the protocol ABI.  Can we add 
an option to -incoming?  Like, -incoming ft_mode, for example
Regarding the IO, let me reply to the next message.

> Otherwise, the series looks very close to being mergable.

Thank you for your comment on each patch.

To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU and 
struggled for how to implement in an acceptable way.

Thanks,

Yoshi

>
> Regards,
>
> Anthony Liguori
>
>
>
Yoshiaki Tamura - April 23, 2010, 1:53 a.m.
Anthony Liguori wrote:
> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>> Dor Laor wrote:
>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>>> sending
>>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>>> would like
>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>> advanced
>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>> this project
>>>>>> step by step while absorbing comments from the community. The current
>>>>>> code is
>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>
>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>> following RFC which we posted last year.
>>>>>>
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>
>>>>>> The transmission/transaction protocol, and most of the control
>>>>>> logic is
>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>>> from
>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>> plumbing in
>>>>>> the
>>>>>> kernel side to guarantee replayability of certain events and
>>>>>> instructions,
>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>> stack, as well
>>>>>> as for optimization purposes, for example.
>>>>> [ snap]
>>>>>
>>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>>
>>>>>> === event tapping ===
>>>>>>
>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>> which
>>>>>> event the
>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>> here is
>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>> for
>>>>>> disk I/O
>>>>>> and reliable network protocols such as TCP.
>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>> runs
>>>>> non tcp protocol and the packet that the master node sent reached some
>>>>> remote client and before the sync to the slave the master failed?
>>>> In current implementation, it is actually stalling any type of network
>>>> that goes through virtio-net.
>>>>
>>>> However, if the application was using unreliable protocols, it should
>>>> have its own recovering mechanism, or it should be completely
>>>> stateless.
>>> Why do you treat tcp differently? You can damage the entire VM this
>>> way -
>>> think of dhcp request that was dropped on the moment you switched
>>> between
>>> the master and the slave?
>> I'm not trying to say that we should treat tcp differently, but just
>> it's severe.
>> In case of dhcp request, the client would have a chance to retry after
>> failover, correct?
>> BTW, in current implementation,
>
> I'm slightly confused about the current implementation vs. my
> recollection of the original paper with Xen. I had thought that all disk
> and network I/O was buffered in such a way that at each checkpoint, the
> I/O operations would be released in a burst. Otherwise, you would have
> to synchronize after every I/O operation which is what it seems the
> current implementation does.

Yes, you're almost right.
It's synchronizing before QEMU starts emulating I/O at each device model.
It was originally designed that way to avoid complexity of introducing buffering 
mechanism and additional I/O latency by buffering.

> I'm not sure how that is accomplished
> atomically though since you could have a completed I/O operation
> duplicated on the slave node provided it didn't notify completion prior
> to failure.

That's exactly the point I wanted to discuss.
Currently, we're calling vm_stop(0), qemu_aio_flush() and bdrv_flush_all() 
before qemu_save_state_all() in ft_tranx_ready(), to ensure outstanding I/O is 
complete.  I mimicked what existing live migration is doing.
It's not enough?

> Is there another kemari component that somehow handles buffering I/O
> that is not obvious from these patches?

No, I'm not hiding anything, and I would share any information regarding Kemari 
to develop it in this community :-)

Thanks,

Yoshi

>
> Regards,
>
> Anthony Liguori
>
>
>
Yoshiaki Tamura - April 23, 2010, 5:17 a.m.
Dor Laor wrote:
> On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:
>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>
>>>> Dor Laor wrote:
>>>>>
>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> We have been implementing the prototype of Kemari for KVM, and we're
>>>>>> sending
>>>>>> this message to share what we have now and TODO lists. Hopefully, we
>>>>>> would like
>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>> advanced
>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>> this project
>>>>>> step by step while absorbing comments from the community. The current
>>>>>> code is
>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>
>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>> following RFC which we posted last year.
>>>>>>
>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>
>>>>>> The transmission/transaction protocol, and most of the control
>>>>>> logic is
>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent rip
>>>>>> from
>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>> plumbing in
>>>>>> the
>>>>>> kernel side to guarantee replayability of certain events and
>>>>>> instructions,
>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>> stack, as well
>>>>>> as for optimization purposes, for example.
>>>>>
>>>>> [ snap]
>>>>>
>>>>>>
>>>>>> The rest of this message describes TODO lists grouped by each topic.
>>>>>>
>>>>>> === event tapping ===
>>>>>>
>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>> which
>>>>>> event the
>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>> here is
>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>> for
>>>>>> disk I/O
>>>>>> and reliable network protocols such as TCP.
>>>>>
>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>> runs
>>>>> non tcp protocol and the packet that the master node sent reached some
>>>>> remote client and before the sync to the slave the master failed?
>>>>
>>>> In current implementation, it is actually stalling any type of network
>>>> that goes through virtio-net.
>>>>
>>>> However, if the application was using unreliable protocols, it should
>>>> have its own recovering mechanism, or it should be completely
>>>> stateless.
>>>
>>> Why do you treat tcp differently? You can damage the entire VM this
>>> way -
>>> think of dhcp request that was dropped on the moment you switched
>>> between
>>> the master and the slave?
>>
>> I'm not trying to say that we should treat tcp differently, but just
>> it's severe.
>> In case of dhcp request, the client would have a chance to retry after
>> failover, correct?
>
> But until it timeouts it won't have networking.
>
>> BTW, in current implementation, it's synchronizing before dhcp ack is
>> sent.
>> But in case of tcp, once you send ack to the client before sync, there
>> is no way to recover.
>
> What if the guest is running dhcp server? It we provide an IP to a
> client and then fail to the secondary that will run without knowing the
> master allocated this IP

That's problematic.  So it needs to sync when dhcp ack is sent.

I should apologize for my misunderstanding and explanation.  I agree that we 
should stall every type of network output.

>
>>
>>>>> [snap]
>>>>>
>>>>>
>>>>>> === clock ===
>>>>>>
>>>>>> Since synchronizing the virtual machines every time the TSC is
>>>>>> accessed would be
>>>>>> prohibitive, the transmission of the TSC will be done lazily, which
>>>>>> means
>>>>>> delaying it until there is a non-TSC synchronization point arrives.
>>>>>
>>>>> Why do you specifically care about the tsc sync? When you sync all the
>>>>> IO model on snapshot it also synchronizes the tsc.
>>>
>>> So, do you agree that an extra clock synchronization is not needed
>>> since it
>>> is done anyway as part of the live migration state sync?
>>
>> I agree that its sent as part of the live migration.
>> What I wanted to say here is that this is not something for real time
>> applications.
>> I usually get questions like can this guarantee fault tolerance for
>> real time applications.
>
> First the huge cost of snapshots won't match to any real time app.

I see.

> Second, even if it wasn't the case, the tsc delta and kvmclock are
> synchronized as part of the VM state so there is no use of trapping it
> in the middle.

I should study the clock in KVM, but won't tsc get updated by the HW after 
migration?
I was wondering the following case for example:

1. The application on the guest calls rdtsc on host A.
2. The application uses rdtsc value for something.
3. Failover to host B.
4. The application on the guest replays the rdtsc call on host B.
5. If the rdtsc value is different between A and B, the application may get into 
trouble because of it.

If I were wrong, my apologies.

>
>>
>>>>> In general, can you please explain the 'algorithm' for continuous
>>>>> snapshots (is that what you like to do?):
>>>>
>>>> Yes, of course.
>>>> Sorry for being less informative.
>>>>
>>>>> A trivial one would we to :
>>>>> - do X online snapshots/sec
>>>>
>>>> I currently don't have good numbers that I can share right now.
>>>> Snapshots/sec depends on what kind of workload is running, and if the
>>>> guest was almost idle, there will be no snapshots in 5sec. On the other
>>>> hand, if the guest was running I/O intensive workloads (netperf, iozone
>>>> for example), there will be about 50 snapshots/sec.
>>>>
>>>>> - Stall all IO (disk/block) from the guest to the outside world
>>>>> until the previous snapshot reaches the slave.
>>>>
>>>> Yes, it does.
>>>>
>>>>> - Snapshots are made of
>>>>
>>>> Full device model + diff of dirty pages from the last snapshot.
>>>>
>>>>> - diff of dirty pages from last snapshot
>>>>
>>>> This also depends on the workload.
>>>> In case of I/O intensive workloads, dirty pages are usually less
>>>> than 100.
>>>
>>> The hardest would be memory intensive loads.
>>> So 100 snap/sec means latency of 10msec right?
>>> (not that it's not ok, with faster hw and IB you'll be able to get much
>>> more)
>>
>> Doesn't 100 snap/sec mean the interval of snap is 10msec?
>> IIUC, to get the latency, you need to get, Time to transfer VM + Time
>> to get response from the receiver.
>>
>> It's hard to say which load is the hardest.
>> Memory intensive load, who don't generate I/O often, will suffer from
>> long sync time for that moment, but would have chances to continue its
>> process until sync.
>> I/O intensive load, who don't dirty much pages, will suffer from
>> getting VPU stopped often, but its sync time is relatively shorter.
>>
>>>>> - Qemu device model (+kvm's) diff from last.
>>>>
>>>> We're currently sending full copy because we're completely reusing this
>>>> part of existing live migration framework.
>>>>
>>>> Last time we measured, it was about 13KB.
>>>> But it varies by which QEMU version is used.
>>>>
>>>>> You can do 'light' snapshots in between to send dirty pages to reduce
>>>>> snapshot time.
>>>>
>>>> I agree. That's one of the advanced topic we would like to try too.
>>>>
>>>>> I wrote the above to serve a reference for your comments so it will
>>>>> map
>>>>> into my mind. Thanks, dor
>>>>
>>>> Thank your for the guidance.
>>>> I hope this answers to your question.
>>>>
>>>> At the same time, I would also be happy it we could discuss how to
>>>> implement too. In fact, we needed a hack to prevent rip from proceeding
>>>> in KVM, which turned out that it was not the best workaround.
>>>
>>> There are brute force solutions like
>>> - stop the guest until you send all of the snapshot to the remote (like
>>> standard live migration)
>>
>> We've implemented this way so far.
>>
>>> - Stop + fork + cont the father
>>>
>>> Or mark the recent dirty pages that were not sent to the remote as write
>>> protected and copy them if touched.
>>
>> I think I had that suggestion from Avi before.
>> And yes, it's very fascinating.
>>
>> Meanwhile, if you look at the diffstat, it needed to touch many parts
>> of QEMU.
>> Before going into further implementation, I wanted to check that I'm
>> in the right track for doing this project.
>>
>>
>>>> Thanks,
>>>>
>>>> Yoshi
>>>>
>>>>>
>>>>>>
>>>>>> TODO:
>>>>>> - Synchronization of clock sources (need to intercept TSC reads,
>>>>>> etc).
>>>>>>
>>>>>> === usability ===
>>>>>>
>>>>>> These are items that defines how users interact with Kemari.
>>>>>>
>>>>>> TODO:
>>>>>> - Kemarid daemon that takes care of the cluster management/monitoring
>>>>>> side of things.
>>>>>> - Some device emulators might need minor modifications to work well
>>>>>> with Kemari. Use white(black)-listing to take the burden of
>>>>>> choosing the right device model off the users.
>>>>>>
>>>>>> === optimizations ===
>>>>>>
>>>>>> Although the big picture can be realized by completing the TODO list
>>>>>> above, we
>>>>>> need some optimizations/enhancements to make Kemari useful in real
>>>>>> world, and
>>>>>> these are items what needs to be done for that.
>>>>>>
>>>>>> TODO:
>>>>>> - SMP (for the sake of performance might need to implement a
>>>>>> synchronization protocol that can maintain two or more
>>>>>> synchronization points active at any given moment)
>>>>>> - VGA (leverage VNC's subtilting mechanism to identify fb pages that
>>>>>> are really dirty).
>>>>>>
>>>>>>
>>>>>> Any comments/suggestions would be greatly appreciated.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Yoshi
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>>>>>> Without this patch VCPU state is already proceeded before
>>>>>> synchronization, and after failover to the VM on the receiver, it
>>>>>> hangs because of this.
>>>>>>
>>>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>>>> ---
>>>>>> arch/x86/include/asm/kvm_host.h | 1 +
>>>>>> arch/x86/kvm/svm.c | 11 ++++++++---
>>>>>> arch/x86/kvm/vmx.c | 11 ++++++++---
>>>>>> arch/x86/kvm/x86.c | 4 ++++
>>>>>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/include/asm/kvm_host.h
>>>>>> b/arch/x86/include/asm/kvm_host.h
>>>>>> index 26c629a..7b8f514 100644
>>>>>> --- a/arch/x86/include/asm/kvm_host.h
>>>>>> +++ b/arch/x86/include/asm/kvm_host.h
>>>>>> @@ -227,6 +227,7 @@ struct kvm_pio_request {
>>>>>> int in;
>>>>>> int port;
>>>>>> int size;
>>>>>> + bool lazy_skip;
>>>>>> };
>>>>>>
>>>>>> /*
>>>>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>>>>> index d04c7ad..e373245 100644
>>>>>> --- a/arch/x86/kvm/svm.c
>>>>>> +++ b/arch/x86/kvm/svm.c
>>>>>> @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm
>>>>>> *svm)
>>>>>> {
>>>>>> struct kvm_vcpu *vcpu =&svm->vcpu;
>>>>>> u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
>>>>>> - int size, in, string;
>>>>>> + int size, in, string, ret;
>>>>>> unsigned port;
>>>>>>
>>>>>> ++svm->vcpu.stat.io_exits;
>>>>>> @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm
>>>>>> *svm)
>>>>>> port = io_info>> 16;
>>>>>> size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
>>>>>> svm->next_rip = svm->vmcb->control.exit_info_2;
>>>>>> - skip_emulated_instruction(&svm->vcpu);
>>>>>>
>>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>>> + if (ret)
>>>>>> + skip_emulated_instruction(&svm->vcpu);
>>>>>> + else
>>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>>> +
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> static int nmi_interception(struct vcpu_svm *svm)
>>>>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>>>>> index 41e63bb..09052d6 100644
>>>>>> --- a/arch/x86/kvm/vmx.c
>>>>>> +++ b/arch/x86/kvm/vmx.c
>>>>>> @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
>>>>>> *vcpu)
>>>>>> static int handle_io(struct kvm_vcpu *vcpu)
>>>>>> {
>>>>>> unsigned long exit_qualification;
>>>>>> - int size, in, string;
>>>>>> + int size, in, string, ret;
>>>>>> unsigned port;
>>>>>>
>>>>>> exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
>>>>>> @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)
>>>>>>
>>>>>> port = exit_qualification>> 16;
>>>>>> size = (exit_qualification& 7) + 1;
>>>>>> - skip_emulated_instruction(vcpu);
>>>>>>
>>>>>> - return kvm_fast_pio_out(vcpu, size, port);
>>>>>> + ret = kvm_fast_pio_out(vcpu, size, port);
>>>>>> + if (ret)
>>>>>> + skip_emulated_instruction(vcpu);
>>>>>> + else
>>>>>> + vcpu->arch.pio.lazy_skip = true;
>>>>>> +
>>>>>> + return ret;
>>>>>> }
>>>>>>
>>>>>> static void
>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>>>>> index fd5c3d3..cc308d2 100644
>>>>>> --- a/arch/x86/kvm/x86.c
>>>>>> +++ b/arch/x86/kvm/x86.c
>>>>>> @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
>>>>>> *vcpu, struct kvm_run *kvm_run)
>>>>>> if (!irqchip_in_kernel(vcpu->kvm))
>>>>>> kvm_set_cr8(vcpu, kvm_run->cr8);
>>>>>>
>>>>>> + if (vcpu->arch.pio.lazy_skip)
>>>>>> + kvm_x86_ops->skip_emulated_instruction(vcpu);
>>>>>> + vcpu->arch.pio.lazy_skip = false;
>>>>>> +
>>>>>> if (vcpu->arch.pio.count || vcpu->mmio_needed ||
>>>>>> vcpu->arch.emulate_ctxt.restart) {
>>>>>> if (vcpu->mmio_needed) {
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>
Fernando Luis Vázquez Cao - April 23, 2010, 7:36 a.m.
On 04/23/2010 02:17 PM, Yoshiaki Tamura wrote:
> Dor Laor wrote:
[...]
>> Second, even if it wasn't the case, the tsc delta and kvmclock are
>> synchronized as part of the VM state so there is no use of trapping it
>> in the middle.
> 
> I should study the clock in KVM, but won't tsc get updated by the HW
> after migration?
> I was wondering the following case for example:
> 
> 1. The application on the guest calls rdtsc on host A.
> 2. The application uses rdtsc value for something.
> 3. Failover to host B.
> 4. The application on the guest replays the rdtsc call on host B.
> 5. If the rdtsc value is different between A and B, the application may
> get into trouble because of it.

Regarding the TSC, we need to guarantee that the guest sees a monotonic
TSC after migration, which can be achieved by adjusting the TSC offset properly.
Besides, we also need a trapping TSC, so that we can tackle the case where the
primary node and the standby node have different TSC frequencies.
Anthony Liguori - April 23, 2010, 1:10 p.m.
On 04/22/2010 07:45 PM, Yoshiaki Tamura wrote:
> Anthony Liguori wrote:
>
>> I think it would make sense to separate out the things that are actually
>> optimizations (like the dirty bitmap changes and the writev/readv
>> changes) and to attempt to justify them with actual performance data.
>
> I agree with the separation plan.
>
> For dirty bitmap change, Avi and I discussed on patchset for upsream 
> QEMU while you were offline (Sorry, if I was wrong).  Could you also 
> take a look?

Yes, I've seen it and I don't disagree.  That said, there ought to be 
perf data in the commit log so that down the road, the justification is 
understood.

> http://lists.gnu.org/archive/html/qemu-devel/2010-04/msg01396.html
>
> Regarding writev, I agree that it should be backed with actual data, 
> otherwise it should be removed.  We attemped to do everything that may 
> reduce the overhead of the transaction.
>
>> I'd prefer not to modify the live migration protocol ABI and it doesn't
>> seem to be necessary if we're willing to add options to the -incoming
>> flag. We also want to be a bit more generic with respect to IO.
>
> I totally agree with your approach not to change the protocol ABI.  
> Can we add an option to -incoming?  Like, -incoming ft_mode, for example
> Regarding the IO, let me reply to the next message.
>
>> Otherwise, the series looks very close to being mergable.
>
> Thank you for your comment on each patch.
>
> To be honest, I wasn't that confident because I'm a newbie to KVM/QEMU 
> and struggled for how to implement in an acceptable way.

The series looks very good.  I'm eager to see this functionality merged.

Regards,

Anthony Liguori

> Thanks,
>
> Yoshi
>
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>
>>
>
Anthony Liguori - April 23, 2010, 1:20 p.m.
On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:
> Anthony Liguori wrote:
>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>> Dor Laor wrote:
>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> We have been implementing the prototype of Kemari for KVM, and 
>>>>>>> we're
>>>>>>> sending
>>>>>>> this message to share what we have now and TODO lists. 
>>>>>>> Hopefully, we
>>>>>>> would like
>>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>>> advanced
>>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>>> this project
>>>>>>> step by step while absorbing comments from the community. The 
>>>>>>> current
>>>>>>> code is
>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>>
>>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>>> following RFC which we posted last year.
>>>>>>>
>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>>
>>>>>>> The transmission/transaction protocol, and most of the control
>>>>>>> logic is
>>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent 
>>>>>>> rip
>>>>>>> from
>>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>>> plumbing in
>>>>>>> the
>>>>>>> kernel side to guarantee replayability of certain events and
>>>>>>> instructions,
>>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>>> stack, as well
>>>>>>> as for optimization purposes, for example.
>>>>>> [ snap]
>>>>>>
>>>>>>> The rest of this message describes TODO lists grouped by each 
>>>>>>> topic.
>>>>>>>
>>>>>>> === event tapping ===
>>>>>>>
>>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>>> which
>>>>>>> event the
>>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>>> here is
>>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>>> for
>>>>>>> disk I/O
>>>>>>> and reliable network protocols such as TCP.
>>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>>> runs
>>>>>> non tcp protocol and the packet that the master node sent reached 
>>>>>> some
>>>>>> remote client and before the sync to the slave the master failed?
>>>>> In current implementation, it is actually stalling any type of 
>>>>> network
>>>>> that goes through virtio-net.
>>>>>
>>>>> However, if the application was using unreliable protocols, it should
>>>>> have its own recovering mechanism, or it should be completely
>>>>> stateless.
>>>> Why do you treat tcp differently? You can damage the entire VM this
>>>> way -
>>>> think of dhcp request that was dropped on the moment you switched
>>>> between
>>>> the master and the slave?
>>> I'm not trying to say that we should treat tcp differently, but just
>>> it's severe.
>>> In case of dhcp request, the client would have a chance to retry after
>>> failover, correct?
>>> BTW, in current implementation,
>>
>> I'm slightly confused about the current implementation vs. my
>> recollection of the original paper with Xen. I had thought that all disk
>> and network I/O was buffered in such a way that at each checkpoint, the
>> I/O operations would be released in a burst. Otherwise, you would have
>> to synchronize after every I/O operation which is what it seems the
>> current implementation does.
>
> Yes, you're almost right.
> It's synchronizing before QEMU starts emulating I/O at each device model.

If NodeA is the master and NodeB is the slave, if NodeA sends a network 
packet, you'll checkpoint before the packet is actually sent, and then 
if a failure occurs before the next checkpoint, won't that result in 
both NodeA and NodeB sending out a duplicate version of the packet?

Regards,

Anthony Liguori
Avi Kivity - April 23, 2010, 1:24 p.m.
On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> Kemari starts synchronizing VMs when QEMU handles I/O requests.
> Without this patch VCPU state is already proceeded before
> synchronization, and after failover to the VM on the receiver, it
> hangs because of this.
>    

We discussed moving the barrier to the actual output device, instead of 
the I/O port.  This allows you to complete the I/O transaction before 
starting synchronization.

Does it not work for some reason?
Jamie Lokier - April 23, 2010, 3:07 p.m.
Yoshiaki Tamura wrote:
> Jamie Lokier wrote:
> >Yoshiaki Tamura wrote:
> >>Dor Laor wrote:
> >>>On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
> >>>>Event tapping is the core component of Kemari, and it decides on which
> >>>>event the
> >>>>primary should synchronize with the secondary. The basic assumption
> >>>>here is
> >>>>that outgoing I/O operations are idempotent, which is usually true for
> >>>>disk I/O
> >>>>and reliable network protocols such as TCP.
> >>>
> >>>IMO any type of network even should be stalled too. What if the VM runs
> >>>non tcp protocol and the packet that the master node sent reached some
> >>>remote client and before the sync to the slave the master failed?
> >>
> >>In current implementation, it is actually stalling any type of network
> >>that goes through virtio-net.
> >>
> >>However, if the application was using unreliable protocols, it should have
> >>its own recovering mechanism, or it should be completely stateless.
> >
> >Even with unreliable protocols, if slave takeover causes the receiver
> >to have received a packet that the sender _does not think it has ever
> >sent_, expect some protocols to break.
> >
> >If the slave replaying master's behaviour since the last sync means it
> >will definitely get into the same state of having sent the packet,
> >that works out.
> 
> That's something we're expecting now.
> 
> >But you still have to be careful that the other end's responses to
> >that packet are not seen by the slave too early during that replay.
> >Otherwise, for example, the slave may observe a TCP ACK to a packet
> >that it hasn't yet sent, which is an error.
> 
> Even current implementation syncs just before network output, what you 
> pointed out could happen.  In this case, would the connection going to be 
> lost, or would client/server recover from it?  If latter, it would be fine, 
> otherwise I wonder how people doing similar things are handling this 
> situation.

In the case of TCP in a "synchronised state", I think it will recover
according to the rules in RFC793.  In an "unsynchronised state"
(during connection), I'm not sure if it recovers or if it looks like a
"Connection reset" error.  I suspect it does recover but I'm not certain.

But that's TCP.  Other protocols, such as over UDP, may behave
differently, because this is not an anticipated behaviour of a
network.

> >However there is one respect in which they're not idempotent:
> >
> >The TTL field should be decreased if packets are delayed.  Packets
> >should not appear to live in the network for longer than TTL seconds.
> >If they do, some protocols (like TCP) can react to the delayed ones
> >differently, such as sending a RST packet and breaking a connection.
> >
> >It is acceptable to reduce TTL faster than the minimum.  After all, it
> >is reduced by 1 on every forwarding hop, in addition to time delays.
> 
> So the problem is, when the slave takes over, it sends a packet with same 
> TTL which client may have received.

Yes.  I guess this is a general problem with time-based protocols and
virtual machines getting stopped for 1 minute (say), without knowing
that real time has moved on for the other nodes.

Some application transaction, caching and locking protocols will give
wrong results when their time assumptions are discontinuous to such a
large degree.  It's a bit nasty to impose that on them after they
worked so hard on their reliability :-)

However, I think such implementations _could_ be made safe if those
programs can arrange to definitely be interrupted with a signal when
the discontinuity happens.  Of course, only if they're aware they may
be running on a Kemari system...

I have an intuitive idea that there is a solution to that, but each
time I try to write the next paragraph explaining it, some little
complication crops up and it needs more thought.  Something about
concurrent, asynchronous transactions to keep the master running while
recording the minimum states that replay needs to be safe, while
slewing the replaying slave's virtual clock back to real time quickly
during recovery mode.

-- Jamie
Dor Laor - April 25, 2010, 9:52 p.m.
On 04/23/2010 10:36 AM, Fernando Luis Vázquez Cao wrote:
> On 04/23/2010 02:17 PM, Yoshiaki Tamura wrote:
>> Dor Laor wrote:
> [...]
>>> Second, even if it wasn't the case, the tsc delta and kvmclock are
>>> synchronized as part of the VM state so there is no use of trapping it
>>> in the middle.
>>
>> I should study the clock in KVM, but won't tsc get updated by the HW
>> after migration?
>> I was wondering the following case for example:
>>
>> 1. The application on the guest calls rdtsc on host A.
>> 2. The application uses rdtsc value for something.
>> 3. Failover to host B.
>> 4. The application on the guest replays the rdtsc call on host B.
>> 5. If the rdtsc value is different between A and B, the application may
>> get into trouble because of it.
>
> Regarding the TSC, we need to guarantee that the guest sees a monotonic
> TSC after migration, which can be achieved by adjusting the TSC offset properly.
> Besides, we also need a trapping TSC, so that we can tackle the case where the
> primary node and the standby node have different TSC frequencies.

You're right but this is already taken care of by normal save/restore 
process. Check void kvm_load_tsc(CPUState *env) function.
Yoshiaki Tamura - April 26, 2010, 10:44 a.m.
Anthony Liguori wrote:
> On 04/22/2010 08:53 PM, Yoshiaki Tamura wrote:
>> Anthony Liguori wrote:
>>> On 04/22/2010 08:16 AM, Yoshiaki Tamura wrote:
>>>> 2010/4/22 Dor Laor<dlaor@redhat.com>:
>>>>> On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:
>>>>>> Dor Laor wrote:
>>>>>>> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> We have been implementing the prototype of Kemari for KVM, and
>>>>>>>> we're
>>>>>>>> sending
>>>>>>>> this message to share what we have now and TODO lists.
>>>>>>>> Hopefully, we
>>>>>>>> would like
>>>>>>>> to get early feedback to keep us in the right direction. Although
>>>>>>>> advanced
>>>>>>>> approaches in the TODO lists are fascinating, we would like to run
>>>>>>>> this project
>>>>>>>> step by step while absorbing comments from the community. The
>>>>>>>> current
>>>>>>>> code is
>>>>>>>> based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.
>>>>>>>>
>>>>>>>> For those who are new to Kemari for KVM, please take a look at the
>>>>>>>> following RFC which we posted last year.
>>>>>>>>
>>>>>>>> http://www.mail-archive.com/kvm@vger.kernel.org/msg25022.html
>>>>>>>>
>>>>>>>> The transmission/transaction protocol, and most of the control
>>>>>>>> logic is
>>>>>>>> implemented in QEMU. However, we needed a hack in KVM to prevent
>>>>>>>> rip
>>>>>>>> from
>>>>>>>> proceeding before synchronizing VMs. It may also need some
>>>>>>>> plumbing in
>>>>>>>> the
>>>>>>>> kernel side to guarantee replayability of certain events and
>>>>>>>> instructions,
>>>>>>>> integrate the RAS capabilities of newer x86 hardware with the HA
>>>>>>>> stack, as well
>>>>>>>> as for optimization purposes, for example.
>>>>>>> [ snap]
>>>>>>>
>>>>>>>> The rest of this message describes TODO lists grouped by each
>>>>>>>> topic.
>>>>>>>>
>>>>>>>> === event tapping ===
>>>>>>>>
>>>>>>>> Event tapping is the core component of Kemari, and it decides on
>>>>>>>> which
>>>>>>>> event the
>>>>>>>> primary should synchronize with the secondary. The basic assumption
>>>>>>>> here is
>>>>>>>> that outgoing I/O operations are idempotent, which is usually true
>>>>>>>> for
>>>>>>>> disk I/O
>>>>>>>> and reliable network protocols such as TCP.
>>>>>>> IMO any type of network even should be stalled too. What if the VM
>>>>>>> runs
>>>>>>> non tcp protocol and the packet that the master node sent reached
>>>>>>> some
>>>>>>> remote client and before the sync to the slave the master failed?
>>>>>> In current implementation, it is actually stalling any type of
>>>>>> network
>>>>>> that goes through virtio-net.
>>>>>>
>>>>>> However, if the application was using unreliable protocols, it should
>>>>>> have its own recovering mechanism, or it should be completely
>>>>>> stateless.
>>>>> Why do you treat tcp differently? You can damage the entire VM this
>>>>> way -
>>>>> think of dhcp request that was dropped on the moment you switched
>>>>> between
>>>>> the master and the slave?
>>>> I'm not trying to say that we should treat tcp differently, but just
>>>> it's severe.
>>>> In case of dhcp request, the client would have a chance to retry after
>>>> failover, correct?
>>>> BTW, in current implementation,
>>>
>>> I'm slightly confused about the current implementation vs. my
>>> recollection of the original paper with Xen. I had thought that all disk
>>> and network I/O was buffered in such a way that at each checkpoint, the
>>> I/O operations would be released in a burst. Otherwise, you would have
>>> to synchronize after every I/O operation which is what it seems the
>>> current implementation does.
>>
>> Yes, you're almost right.
>> It's synchronizing before QEMU starts emulating I/O at each device model.
>
> If NodeA is the master and NodeB is the slave, if NodeA sends a network
> packet, you'll checkpoint before the packet is actually sent, and then
> if a failure occurs before the next checkpoint, won't that result in
> both NodeA and NodeB sending out a duplicate version of the packet?

Yes.  But I think it's better than taking checkpoint after.

If we checkpoint after sending packet, let's say it sent TCP ACK to the client, 
and if a hardware failure occurred to NodeA during the transaction *but the 
client received the TCP ACK*, NodeB will resume from the previous state, and it 
may need to receive some data from the client. However, because the client has 
already receiver TCP ACK, it won't resend the data to NodeB.  It looks this 
data is going to be dropped.

Anyway, I've just started planning to move the sync point to network/block 
layer, and I would post the result for discussion again.
Yoshiaki Tamura - April 26, 2010, 10:44 a.m.
Avi Kivity wrote:
> On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:
>> Kemari starts synchronizing VMs when QEMU handles I/O requests.
>> Without this patch VCPU state is already proceeded before
>> synchronization, and after failover to the VM on the receiver, it
>> hangs because of this.
>
> We discussed moving the barrier to the actual output device, instead of
> the I/O port. This allows you to complete the I/O transaction before
> starting synchronization.
>
> Does it not work for some reason?

Sorry, I've just started working on that.
I've posted this series to share what I have done so far.
Thanks for looking.

Patch

=== event tapping === 

Event tapping is the core component of Kemari, and it decides on which event the
primary should synchronize with the secondary.  The basic assumption here is
that outgoing I/O operations are idempotent, which is usually true for disk I/O
and reliable network protocols such as TCP.

As discussed in the following thread, we may need to reconsider how and when to
start VM synchronization.

http://www.mail-archive.com/kvm@vger.kernel.org/msg31908.html

We would like get as much feedbacks on current implementation before
thinking/going into the next approach.

TODO:
 - virtio polling
 - support for asynchronous I/O methods (eventfd)

=== sender / receiver ===

To synchronize virtual machines, all the dirty pages since the last
synchronization point and the state of the VCPU the virtual devices is sent to
the fallback node from the user-space QEMU process.

TODO:
 - Asynchronous VM transfer / pipelining (needed for SMP)
 - Zero copy VM transfer
 - VM transfer w/ RDMA

=== storage ===

Although Kemari needs some kind of shared storage, many users don't like it and
they expect to use Kemari in conjunction with software storage replication.

TODO:
 - Integration with other non-shared disk cluster storage solutions
   such as DRBD (might need changes to guarantee storage data
   consistency at Kemari synchronization points).
 - Integration with QEMU's block live migration functionality for
   non-share disk configurations.

=== integration with HA stack (Pacemaker/Corosync) ===

Failover process kicks in whenever a failure in the primary node is detected.
For Kemari for Xen, we already have finished RA for Heartbeat, and planning to
integrate Kemari for KVM with the new HA stacks (Pacemaker, RHCS, etc).

Ideally, we would like to leverage the hardware failure detection
capabilities of newish x86 hardware to trigger failover, the idea
being that transferring control to the fallback node proactively
when a problem is detected is much faster than relying on the polling
mechanisms used by most HA software.

TODO:
 - RA for Pacemaker.
 - Consider both HW failure and SW failure scenarios (failover
   between Kemari clusters).
 - Make the necessary changes to Pacemaker/Corosync to support
   event(HW failure, etc)-driven failover.
 - Take advantage of the RAS capabilities of newer CPUs/motherboards
   such as MCE to trigger failover.
 - Detect failures in I/O devices (block I/O errors, etc).

=== clock ===

Since synchronizing the virtual machines every time the TSC is accessed would be
prohibitive, the transmission of the TSC will be done lazily, which means
delaying it until there is a non-TSC synchronization point arrives.

TODO:
 - Synchronization of clock sources (need to intercept TSC reads, etc).

=== usability ===

These are items that defines how users interact with Kemari.

TODO:
 - Kemarid daemon that takes care of the cluster management/monitoring
   side of things.
 - Some device emulators might need minor modifications to work well
   with Kemari.  Use white(black)-listing to take the burden of
   choosing the right device model off the users.

=== optimizations ===

Although the big picture can be realized by completing the TODO list above, we
need some optimizations/enhancements to make Kemari useful in real world, and
these are items what needs to be done for that.

TODO:
 - SMP (for the sake of performance might need to implement a
   synchronization protocol that can maintain two or more
   synchronization points active at any given moment)
 - VGA (leverage VNC's subtilting mechanism to identify fb pages that
   are really dirty).
 

Any comments/suggestions would be greatly appreciated.

Thanks,

Yoshi

--

Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 arch/x86/include/asm/kvm_host.h |    1 +
 arch/x86/kvm/svm.c              |   11 ++++++++---
 arch/x86/kvm/vmx.c              |   11 ++++++++---
 arch/x86/kvm/x86.c              |    4 ++++
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 26c629a..7b8f514 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,6 +227,7 @@  struct kvm_pio_request {
 	int in;
 	int port;
 	int size;
+	bool lazy_skip;
 };
 
 /*
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d04c7ad..e373245 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1495,7 +1495,7 @@  static int io_interception(struct vcpu_svm *svm)
 {
 	struct kvm_vcpu *vcpu = &svm->vcpu;
 	u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
-	int size, in, string;
+	int size, in, string, ret;
 	unsigned port;
 
 	++svm->vcpu.stat.io_exits;
@@ -1507,9 +1507,14 @@  static int io_interception(struct vcpu_svm *svm)
 	port = io_info >> 16;
 	size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT;
 	svm->next_rip = svm->vmcb->control.exit_info_2;
-	skip_emulated_instruction(&svm->vcpu);
 
-	return kvm_fast_pio_out(vcpu, size, port);
+	ret = kvm_fast_pio_out(vcpu, size, port);
+	if (ret)
+		skip_emulated_instruction(&svm->vcpu);
+	else
+		vcpu->arch.pio.lazy_skip = true;
+
+	return ret;
 }
 
 static int nmi_interception(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 41e63bb..09052d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2975,7 +2975,7 @@  static int handle_triple_fault(struct kvm_vcpu *vcpu)
 static int handle_io(struct kvm_vcpu *vcpu)
 {
 	unsigned long exit_qualification;
-	int size, in, string;
+	int size, in, string, ret;
 	unsigned port;
 
 	exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -2989,9 +2989,14 @@  static int handle_io(struct kvm_vcpu *vcpu)
 
 	port = exit_qualification >> 16;
 	size = (exit_qualification & 7) + 1;
-	skip_emulated_instruction(vcpu);
 
-	return kvm_fast_pio_out(vcpu, size, port);
+	ret = kvm_fast_pio_out(vcpu, size, port);
+	if (ret)
+		skip_emulated_instruction(vcpu);
+	else
+		vcpu->arch.pio.lazy_skip = true;
+
+	return ret;
 }
 
 static void
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd5c3d3..cc308d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4544,6 +4544,10 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
 	if (!irqchip_in_kernel(vcpu->kvm))
 		kvm_set_cr8(vcpu, kvm_run->cr8);
 
+	if (vcpu->arch.pio.lazy_skip)
+		kvm_x86_ops->skip_emulated_instruction(vcpu);
+	vcpu->arch.pio.lazy_skip = false;
+
 	if (vcpu->arch.pio.count || vcpu->mmio_needed ||
 	    vcpu->arch.emulate_ctxt.restart) {
 		if (vcpu->mmio_needed) {