diff mbox

[RFC,v2,01/12] mc: add documentation for micro-checkpointing

Message ID 1392713429-18201-2-git-send-email-mrhines@linux.vnet.ibm.com
State New
Headers show

Commit Message

mrhines@linux.vnet.ibm.com Feb. 18, 2014, 8:50 a.m. UTC
From: "Michael R. Hines" <mrhines@us.ibm.com>

Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
Github: git@github.com:hinesmr/qemu.git, 'mc' branch

NOTE: This is a direct copy of the QEMU wiki page for the convenience
of the review process. Since this series very much in flux, instead of
maintaing two copies of documentation in two different formats, this
documentation will be properly formatted in the future when the review
process has completed.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/mc.txt | 222 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 222 insertions(+)
 create mode 100644 docs/mc.txt

+6. We make every attempt to change as little of the existing migration call path as possible.

Comments

Dr. David Alan Gilbert Feb. 18, 2014, 12:45 p.m. UTC | #1
* mrhines@linux.vnet.ibm.com (mrhines@linux.vnet.ibm.com) wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
> Github: git@github.com:hinesmr/qemu.git, 'mc' branch
> 
> NOTE: This is a direct copy of the QEMU wiki page for the convenience
> of the review process. Since this series very much in flux, instead of
> maintaing two copies of documentation in two different formats, this
> documentation will be properly formatted in the future when the review
> process has completed.

It seems to be picking up some truncations as well.

> +The Micro-Checkpointing Process
> +Basic Algorithm
> +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
> +
> +1. After N milliseconds, stop the VM.
> +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
> +4. Resume the VM immediately so that it can make forward progress.
> +5. Transmit the checkpoint to the destination.
> +6. Repeat
> +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.

Later you talk about the memory allocation and how you grow the memory as needed
to fit the checkpoint, have you tried going the other way and triggering the
checkpoints sooner if they're taking too much memory?

> +1. MC over TCP/IP: Once the socket connection breaks, we assume
> failure. This happens very early in the loss of the latest MC not only
> because a very large amount of bytes is typically being sequenced in a
> TCP stream but perhaps also because of the timeout in acknowledgement
> of the receipt of a commit message by the destination.
> +
> +2. MC over RDMA: Since Infiniband does not provide any underlying
> timeout mechanisms, this implementation enhances QEMU's RDMA migration
> protocol to include a simple keep-alive. Upon the loss of multiple
> keep-alive messages, the sender is deemed to have failed.
> +
> +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
> +
> +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
> +
> +If the destination is deemed to be lost, we perform the same action
> as a live migration: resume the sender normally and wait for management
> software to make a policy decision about whether or not to re-protect
> the VM, which may involve a third-party to identify a new destination
>host again to use as a backup for the VM.

In this world what is making the decision about whether the sender/destination
should win - how do you avoid a split brain situation where both
VMs are running but the only thing that failed is the comms between them?
Is there any guarantee that you'll have received knowledge of the comms
failure before you pull the plug out and enable the corked packets to be
sent on the sender side?

<snip>

> +RDMA is used for two different reasons:
> +
> +1. Checkpoint generation (RDMA-based memcpy):
> +2. Checkpoint transmission
> +Checkpoint generation must be done while the VM is paused. In the
> worst case, the size of the checkpoint can be equal in size to the amount
> of memory in total use by the VM. In order to resume VM execution as
> fast as possible, the checkpoint is copied consistently locally into
> a staging area before transmission. A standard memcpy() of potentially
> such a large amount of memory not only gets no use out of the CPU cache
> but also potentially clogs up the CPU pipeline which would otherwise
> be useful by other neighbor VMs on the same physical node that could be
> scheduled for execution. To minimize the effect on neighbor VMs, we use
> RDMA to perform a "local" memcpy(), bypassing the host processor. On
> more recent processors, a 'beefy' enough memory bus architecture can
> move memory just as fast (sometimes faster) as a pure-software CPU-only
> optimized memcpy() from libc. However, on older computers, this feature
> only gives you the benefit of lower CPU-utilization at the expense of

Isn't there a generic kernel DMA ABI for doing this type of thing (I
think there was at one point, people have suggested things like using
graphics cards to do it but I don't know if it ever happened).
The other question is, do you always need to copy - what about something
like COWing the pages?

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
mrhines@linux.vnet.ibm.com Feb. 19, 2014, 1:40 a.m. UTC | #2
On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
>> +The Micro-Checkpointing Process
>> +Basic Algorithm
>> +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
>> +
>> +1. After N milliseconds, stop the VM.
>> +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
>> +4. Resume the VM immediately so that it can make forward progress.
>> +5. Transmit the checkpoint to the destination.
>> +6. Repeat
>> +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
> Later you talk about the memory allocation and how you grow the memory as needed
> to fit the checkpoint, have you tried going the other way and triggering the
> checkpoints sooner if they're taking too much memory?

There is a 'knob' in this patch called "mc-set-delay" which was designed
to solve exactly that problem. It allows policy or management software
to make an independent decision about what the frequency of the
checkpoints should be.

I wasn't comfortable implementing policy directly inside the patch as
that seemed less likely to get accepted by the community sooner.

>> +1. MC over TCP/IP: Once the socket connection breaks, we assume
>> failure. This happens very early in the loss of the latest MC not only
>> because a very large amount of bytes is typically being sequenced in a
>> TCP stream but perhaps also because of the timeout in acknowledgement
>> of the receipt of a commit message by the destination.
>> +
>> +2. MC over RDMA: Since Infiniband does not provide any underlying
>> timeout mechanisms, this implementation enhances QEMU's RDMA migration
>> protocol to include a simple keep-alive. Upon the loss of multiple
>> keep-alive messages, the sender is deemed to have failed.
>> +
>> +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
>> +
>> +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
>> +
>> +If the destination is deemed to be lost, we perform the same action
>> as a live migration: resume the sender normally and wait for management
>> software to make a policy decision about whether or not to re-protect
>> the VM, which may involve a third-party to identify a new destination
>> host again to use as a backup for the VM.
> In this world what is making the decision about whether the sender/destination
> should win - how do you avoid a split brain situation where both
> VMs are running but the only thing that failed is the comms between them?
> Is there any guarantee that you'll have received knowledge of the comms
> failure before you pull the plug out and enable the corked packets to be
> sent on the sender side?

Good question in general - I'll add it to the FAQ. The patch implements
a basic 'transaction' mechanism in coordination with an outbound I/O
buffer (documented further down). With these two things in
places, split-brain is not possible because the destination is not running.
We don't allow the destination to resume execution until a committed
transaction has been acknowledged by the destination and only until
then do we allow any outbound network traffic to be release to the
outside world.

> <snip>
>
>> +RDMA is used for two different reasons:
>> +
>> +1. Checkpoint generation (RDMA-based memcpy):
>> +2. Checkpoint transmission
>> +Checkpoint generation must be done while the VM is paused. In the
>> worst case, the size of the checkpoint can be equal in size to the amount
>> of memory in total use by the VM. In order to resume VM execution as
>> fast as possible, the checkpoint is copied consistently locally into
>> a staging area before transmission. A standard memcpy() of potentially
>> such a large amount of memory not only gets no use out of the CPU cache
>> but also potentially clogs up the CPU pipeline which would otherwise
>> be useful by other neighbor VMs on the same physical node that could be
>> scheduled for execution. To minimize the effect on neighbor VMs, we use
>> RDMA to perform a "local" memcpy(), bypassing the host processor. On
>> more recent processors, a 'beefy' enough memory bus architecture can
>> move memory just as fast (sometimes faster) as a pure-software CPU-only
>> optimized memcpy() from libc. However, on older computers, this feature
>> only gives you the benefit of lower CPU-utilization at the expense of
> Isn't there a generic kernel DMA ABI for doing this type of thing (I
> think there was at one point, people have suggested things like using
> graphics cards to do it but I don't know if it ever happened).
> The other question is, do you always need to copy - what about something
> like COWing the pages?

Excellent question! Responding in two parts:

1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
      me if I'm wrong, but vmsplice was actually designed to avoid copies
      entirely between two userspace programs to be able to move memory
      more efficiently - whereas a fault tolerant system actually *needs*
      copy to be made.

2) Using COW: Actually, I think that's an excellent idea. I've bounced that
      around with my colleagues, but we simply didn't have the manpower
      to implement it and benchmark it. There was also some concern about
      performance: Would the writable working set of the guest be so 
active/busy
      that COW would not get you much benefit? I think it's worth a try.
      Patches welcome =)

- Michael
Dr. David Alan Gilbert Feb. 19, 2014, 11:27 a.m. UTC | #3
* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
> >>+The Micro-Checkpointing Process
> >>+Basic Algorithm
> >>+Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
> >>+
> >>+1. After N milliseconds, stop the VM.
> >>+3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
> >>+4. Resume the VM immediately so that it can make forward progress.
> >>+5. Transmit the checkpoint to the destination.
> >>+6. Repeat
> >>+Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
> >Later you talk about the memory allocation and how you grow the memory as needed
> >to fit the checkpoint, have you tried going the other way and triggering the
> >checkpoints sooner if they're taking too much memory?
> 
> There is a 'knob' in this patch called "mc-set-delay" which was designed
> to solve exactly that problem. It allows policy or management software
> to make an independent decision about what the frequency of the
> checkpoints should be.
> 
> I wasn't comfortable implementing policy directly inside the patch as
> that seemed less likely to get accepted by the community sooner.

I was just wondering if a separate 'max buffer size' knob would allow
you to more reasonably bound memory without setting policy; I don't think
people like having potentially x2 memory.

> >>+1. MC over TCP/IP: Once the socket connection breaks, we assume
> >>failure. This happens very early in the loss of the latest MC not only
> >>because a very large amount of bytes is typically being sequenced in a
> >>TCP stream but perhaps also because of the timeout in acknowledgement
> >>of the receipt of a commit message by the destination.
> >>+
> >>+2. MC over RDMA: Since Infiniband does not provide any underlying
> >>timeout mechanisms, this implementation enhances QEMU's RDMA migration
> >>protocol to include a simple keep-alive. Upon the loss of multiple
> >>keep-alive messages, the sender is deemed to have failed.
> >>+
> >>+In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
> >>+
> >>+If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
> >>+
> >>+If the destination is deemed to be lost, we perform the same action
> >>as a live migration: resume the sender normally and wait for management
> >>software to make a policy decision about whether or not to re-protect
> >>the VM, which may involve a third-party to identify a new destination
> >>host again to use as a backup for the VM.
> >In this world what is making the decision about whether the sender/destination
> >should win - how do you avoid a split brain situation where both
> >VMs are running but the only thing that failed is the comms between them?
> >Is there any guarantee that you'll have received knowledge of the comms
> >failure before you pull the plug out and enable the corked packets to be
> >sent on the sender side?
> 
> Good question in general - I'll add it to the FAQ. The patch implements
> a basic 'transaction' mechanism in coordination with an outbound I/O
> buffer (documented further down). With these two things in
> places, split-brain is not possible because the destination is not running.
> We don't allow the destination to resume execution until a committed
> transaction has been acknowledged by the destination and only until
> then do we allow any outbound network traffic to be release to the
> outside world.

Yeh I see the IO buffer, what I've not figured out is how:
  1) MC over TCP/IP gets an acknowledge on the source to know when
     it can unplug it's buffer.
  2) Lets say the MC connection fails, so that ack never arrives,
     the source must assume the destination has failed and release it's
     packets and carry on.
     The destination must assume the source has failed and take over.

     Now they're both running - and that's bad and it's standard
     split brain.
  3) If we're relying on TCP/IP timeout that's quite long.

> >>+RDMA is used for two different reasons:
> >>+
> >>+1. Checkpoint generation (RDMA-based memcpy):
> >>+2. Checkpoint transmission
> >>+Checkpoint generation must be done while the VM is paused. In the
> >>worst case, the size of the checkpoint can be equal in size to the amount
> >>of memory in total use by the VM. In order to resume VM execution as
> >>fast as possible, the checkpoint is copied consistently locally into
> >>a staging area before transmission. A standard memcpy() of potentially
> >>such a large amount of memory not only gets no use out of the CPU cache
> >>but also potentially clogs up the CPU pipeline which would otherwise
> >>be useful by other neighbor VMs on the same physical node that could be
> >>scheduled for execution. To minimize the effect on neighbor VMs, we use
> >>RDMA to perform a "local" memcpy(), bypassing the host processor. On
> >>more recent processors, a 'beefy' enough memory bus architecture can
> >>move memory just as fast (sometimes faster) as a pure-software CPU-only
> >>optimized memcpy() from libc. However, on older computers, this feature
> >>only gives you the benefit of lower CPU-utilization at the expense of
> >Isn't there a generic kernel DMA ABI for doing this type of thing (I
> >think there was at one point, people have suggested things like using
> >graphics cards to do it but I don't know if it ever happened).
> >The other question is, do you always need to copy - what about something
> >like COWing the pages?
> 
> Excellent question! Responding in two parts:
> 
> 1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
>      me if I'm wrong, but vmsplice was actually designed to avoid copies
>      entirely between two userspace programs to be able to move memory
>      more efficiently - whereas a fault tolerant system actually *needs*
>      copy to be made.

No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
of the use of Intel's I/OAT, graphics cards, etc for doing things like page
zeroing and DMAing data around; I can see there is a dmaengine API in the
kernel, I haven't found where if anywhere that is available to userspace.

> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>      around with my colleagues, but we simply didn't have the manpower
>      to implement it and benchmark it. There was also some concern about
>      performance: Would the writable working set of the guest be so
> active/busy
>      that COW would not get you much benefit? I think it's worth a try.
>      Patches welcome =)

It's possible that might be doable with some of the same tricks I'm
looking at for post-copy, I'll see what I can do.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
mrhines@linux.vnet.ibm.com Feb. 20, 2014, 1:17 a.m. UTC | #4
On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>
> I was just wondering if a separate 'max buffer size' knob would allow
> you to more reasonably bound memory without setting policy; I don't think
> people like having potentially x2 memory.

Note: Checkpoint memory is not monotonic in this patchset (which
is unique to this implementation). Only if the guest actually dirties
100% of it's memory between one checkpoint to the next will
the host experience 2x memory usage for a short period of time.

The patch has a 'slab' mechanism built in to it which implements
a water-mark style policy that throws away unused portions of
the 2x checkpoint memory if later checkpoints are much smaller
(which is likely to be the case if the writable working set size changes).

However, to answer your question: Such a knob could be achieved, but
the same could be achieved simply by tuning the checkpoint frequency
itself. Memory usage would thus be a function of the checkpoint frequency.

If the guest application was maniacal, banging away at all the memory,
there's very little that can be done in the first place, but if the 
guest application
was mildly busy, you don't want to throw away your ability to be fault
tolerant - you would just need more frequent checkpoints to keep up with
the dirty rate.

Once the application died down - the water-mark policy would kick in
and start freeing checkpoint memory. (Note: this policy happens on
both sides in the patchset because the patch has to be fully compatible
with RDMA memory pinning).

What is *not* exposed, however, is the watermark knobs themselves,
I definitely think that needs to be exposed - that would also get you a 
similar
control to 'max buffer size' - you could place a time limit on the
slab list in the patch or something like that.......


>>
>> Good question in general - I'll add it to the FAQ. The patch implements
>> a basic 'transaction' mechanism in coordination with an outbound I/O
>> buffer (documented further down). With these two things in
>> places, split-brain is not possible because the destination is not running.
>> We don't allow the destination to resume execution until a committed
>> transaction has been acknowledged by the destination and only until
>> then do we allow any outbound network traffic to be release to the
>> outside world.
> Yeh I see the IO buffer, what I've not figured out is how:
>    1) MC over TCP/IP gets an acknowledge on the source to know when
>       it can unplug it's buffer.

Only partially correct (See the steps on the wiki). There are two I/O
buffers at any given time which protect against a split-brain scenario:
One buffer for the current checkpoint that is being generated (running VM)
and one buffer for the checkpoint that is being committed in a transaction.

>    2) Lets say the MC connection fails, so that ack never arrives,
>       the source must assume the destination has failed and release it's
>       packets and carry on.

Only the packets for Buffer A are released for the current committed
checkpoint after a completed transaction. The packets for Buffer B
(the current running VM) are still being held up until the next 
transaction starts.
Later once the transaction completes and A is released, B becomes the
new A and a new buffer is installed to become the new Buffer B for
the current running VM.


>       The destination must assume the source has failed and take over.

The destination must also receive an ACK. The ack goes both ways.

Once the source and destination both acknowledge a completed
transation does the source VM resume execution - and even then
it's packets are still being buffered until the next transaction starts.
(That's why it's important to checkpoint as frequently as possible).


>    3) If we're relying on TCP/IP timeout that's quite long.
>

Actually, my experience is been that TCP seems to have more than
one kind of timeout - if receiver is not responding *at all* - it seems that
TCP has a dedicated timer for that. The socket API immediately
sends back an error code and the patchset closes the conneciton
on the destination and recovers.

> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
> of the use of Intel's I/OAT, graphics cards, etc for doing things like page
> zeroing and DMAing data around; I can see there is a dmaengine API in the
> kernel, I haven't found where if anywhere that is available to userspace.
>
>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>>       around with my colleagues, but we simply didn't have the manpower
>>       to implement it and benchmark it. There was also some concern about
>>       performance: Would the writable working set of the guest be so
>> active/busy
>>       that COW would not get you much benefit? I think it's worth a try.
>>       Patches welcome =)
> It's possible that might be doable with some of the same tricks I'm
> looking at for post-copy, I'll see what I can do.

That's great news - I'm very interested to see how this applies
to post-copy and any kind patches.

- Michael
Dr. David Alan Gilbert Feb. 20, 2014, 10:09 a.m. UTC | #5
* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
> >
> >I was just wondering if a separate 'max buffer size' knob would allow
> >you to more reasonably bound memory without setting policy; I don't think
> >people like having potentially x2 memory.
> 
> Note: Checkpoint memory is not monotonic in this patchset (which
> is unique to this implementation). Only if the guest actually dirties
> 100% of it's memory between one checkpoint to the next will
> the host experience 2x memory usage for a short period of time.

Right, but that doesn't really help - if someone comes along and says
'How much memory do I need to be able to run an mc system?' the only
safe answer is 2x, otherwise we're adding a reason why the previously
stable guest might OOM.

> The patch has a 'slab' mechanism built in to it which implements
> a water-mark style policy that throws away unused portions of
> the 2x checkpoint memory if later checkpoints are much smaller
> (which is likely to be the case if the writable working set size changes).
> 
> However, to answer your question: Such a knob could be achieved, but
> the same could be achieved simply by tuning the checkpoint frequency
> itself. Memory usage would thus be a function of the checkpoint frequency.

> If the guest application was maniacal, banging away at all the memory,
> there's very little that can be done in the first place, but if the
> guest application
> was mildly busy, you don't want to throw away your ability to be fault
> tolerant - you would just need more frequent checkpoints to keep up with
> the dirty rate.

I'm not convinced; I can tune my checkpoint frequency until normal operation
makes a reasonable trade off between mc frequency and RAM usage,
but that doesn't prevent it running away when a garbage collect or some
other thing suddenly dirties a load of ram in one particular checkpoint.
Some management tool that watches ram usage etc can also help tune
it, but in the end it can't stop it taking loads of RAM.

> Once the application died down - the water-mark policy would kick in
> and start freeing checkpoint memory. (Note: this policy happens on
> both sides in the patchset because the patch has to be fully compatible
> with RDMA memory pinning).
> 
> What is *not* exposed, however, is the watermark knobs themselves,
> I definitely think that needs to be exposed - that would also get
> you a similar
> control to 'max buffer size' - you could place a time limit on the
> slab list in the patch or something like that.......
> 
> 
> >>
> >>Good question in general - I'll add it to the FAQ. The patch implements
> >>a basic 'transaction' mechanism in coordination with an outbound I/O
> >>buffer (documented further down). With these two things in
> >>places, split-brain is not possible because the destination is not running.
> >>We don't allow the destination to resume execution until a committed
> >>transaction has been acknowledged by the destination and only until
> >>then do we allow any outbound network traffic to be release to the
> >>outside world.
> >Yeh I see the IO buffer, what I've not figured out is how:
> >   1) MC over TCP/IP gets an acknowledge on the source to know when
> >      it can unplug it's buffer.
> 
> Only partially correct (See the steps on the wiki). There are two I/O
> buffers at any given time which protect against a split-brain scenario:
> One buffer for the current checkpoint that is being generated (running VM)
> and one buffer for the checkpoint that is being committed in a transaction.
> 
> >   2) Lets say the MC connection fails, so that ack never arrives,
> >      the source must assume the destination has failed and release it's
> >      packets and carry on.
> 
> Only the packets for Buffer A are released for the current committed
> checkpoint after a completed transaction. The packets for Buffer B
> (the current running VM) are still being held up until the next
> transaction starts.
> Later once the transaction completes and A is released, B becomes the
> new A and a new buffer is installed to become the new Buffer B for
> the current running VM.
> 
> 
> >      The destination must assume the source has failed and take over.
> 
> The destination must also receive an ACK. The ack goes both ways.
> 
> Once the source and destination both acknowledge a completed
> transation does the source VM resume execution - and even then
> it's packets are still being buffered until the next transaction starts.
> (That's why it's important to checkpoint as frequently as possible).

I think I understand normal operation - my question here is about failure;
what happens when neither side gets any ACKs.

> >   3) If we're relying on TCP/IP timeout that's quite long.
> >
> 
> Actually, my experience is been that TCP seems to have more than
> one kind of timeout - if receiver is not responding *at all* - it seems that
> TCP has a dedicated timer for that. The socket API immediately
> sends back an error code and the patchset closes the conneciton
> on the destination and recovers.

How did you test that?
My experience is that if a host knows that it has no route to the destination
(e.g. it has no route to try, that matches the destination, because someone
took the network interface away) you immediately get a 'no route to host',
however if an intermediate link disappears then it takes a while to time out.

> >No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
> >of the use of Intel's I/OAT, graphics cards, etc for doing things like page
> >zeroing and DMAing data around; I can see there is a dmaengine API in the
> >kernel, I haven't found where if anywhere that is available to userspace.
> >
> >>2) Using COW: Actually, I think that's an excellent idea. I've bounced that
> >>      around with my colleagues, but we simply didn't have the manpower
> >>      to implement it and benchmark it. There was also some concern about
> >>      performance: Would the writable working set of the guest be so
> >>active/busy
> >>      that COW would not get you much benefit? I think it's worth a try.
> >>      Patches welcome =)
> >It's possible that might be doable with some of the same tricks I'm
> >looking at for post-copy, I'll see what I can do.
> 
> That's great news - I'm very interested to see how this applies
> to post-copy and any kind patches.
> 
> - Michael
> 

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
liguang Feb. 20, 2014, 11:14 a.m. UTC | #6
Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>    
>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>      
>>> I was just wondering if a separate 'max buffer size' knob would allow
>>> you to more reasonably bound memory without setting policy; I don't think
>>> people like having potentially x2 memory.
>>>        
>> Note: Checkpoint memory is not monotonic in this patchset (which
>> is unique to this implementation). Only if the guest actually dirties
>> 100% of it's memory between one checkpoint to the next will
>> the host experience 2x memory usage for a short period of time.
>>      
> Right, but that doesn't really help - if someone comes along and says
> 'How much memory do I need to be able to run an mc system?' the only
> safe answer is 2x, otherwise we're adding a reason why the previously
> stable guest might OOM.
>
>    

so we may have to involve some disk operations
to handle memory exhaustion.

Thanks!

>> The patch has a 'slab' mechanism built in to it which implements
>> a water-mark style policy that throws away unused portions of
>> the 2x checkpoint memory if later checkpoints are much smaller
>> (which is likely to be the case if the writable working set size changes).
>>
>> However, to answer your question: Such a knob could be achieved, but
>> the same could be achieved simply by tuning the checkpoint frequency
>> itself. Memory usage would thus be a function of the checkpoint frequency.
>>      
>    
>> If the guest application was maniacal, banging away at all the memory,
>> there's very little that can be done in the first place, but if the
>> guest application
>> was mildly busy, you don't want to throw away your ability to be fault
>> tolerant - you would just need more frequent checkpoints to keep up with
>> the dirty rate.
>>      
> I'm not convinced; I can tune my checkpoint frequency until normal operation
> makes a reasonable trade off between mc frequency and RAM usage,
> but that doesn't prevent it running away when a garbage collect or some
> other thing suddenly dirties a load of ram in one particular checkpoint.
> Some management tool that watches ram usage etc can also help tune
> it, but in the end it can't stop it taking loads of RAM.
>
>    
>> Once the application died down - the water-mark policy would kick in
>> and start freeing checkpoint memory. (Note: this policy happens on
>> both sides in the patchset because the patch has to be fully compatible
>> with RDMA memory pinning).
>>
>> What is *not* exposed, however, is the watermark knobs themselves,
>> I definitely think that needs to be exposed - that would also get
>> you a similar
>> control to 'max buffer size' - you could place a time limit on the
>> slab list in the patch or something like that.......
>>
>>
>>      
>>>> Good question in general - I'll add it to the FAQ. The patch implements
>>>> a basic 'transaction' mechanism in coordination with an outbound I/O
>>>> buffer (documented further down). With these two things in
>>>> places, split-brain is not possible because the destination is not running.
>>>> We don't allow the destination to resume execution until a committed
>>>> transaction has been acknowledged by the destination and only until
>>>> then do we allow any outbound network traffic to be release to the
>>>> outside world.
>>>>          
>>> Yeh I see the IO buffer, what I've not figured out is how:
>>>    1) MC over TCP/IP gets an acknowledge on the source to know when
>>>       it can unplug it's buffer.
>>>        
>> Only partially correct (See the steps on the wiki). There are two I/O
>> buffers at any given time which protect against a split-brain scenario:
>> One buffer for the current checkpoint that is being generated (running VM)
>> and one buffer for the checkpoint that is being committed in a transaction.
>>
>>      
>>>    2) Lets say the MC connection fails, so that ack never arrives,
>>>       the source must assume the destination has failed and release it's
>>>       packets and carry on.
>>>        
>> Only the packets for Buffer A are released for the current committed
>> checkpoint after a completed transaction. The packets for Buffer B
>> (the current running VM) are still being held up until the next
>> transaction starts.
>> Later once the transaction completes and A is released, B becomes the
>> new A and a new buffer is installed to become the new Buffer B for
>> the current running VM.
>>
>>
>>      
>>>       The destination must assume the source has failed and take over.
>>>        
>> The destination must also receive an ACK. The ack goes both ways.
>>
>> Once the source and destination both acknowledge a completed
>> transation does the source VM resume execution - and even then
>> it's packets are still being buffered until the next transaction starts.
>> (That's why it's important to checkpoint as frequently as possible).
>>      
> I think I understand normal operation - my question here is about failure;
> what happens when neither side gets any ACKs.
>
>    
>>>    3) If we're relying on TCP/IP timeout that's quite long.
>>>
>>>        
>> Actually, my experience is been that TCP seems to have more than
>> one kind of timeout - if receiver is not responding *at all* - it seems that
>> TCP has a dedicated timer for that. The socket API immediately
>> sends back an error code and the patchset closes the conneciton
>> on the destination and recovers.
>>      
> How did you test that?
> My experience is that if a host knows that it has no route to the destination
> (e.g. it has no route to try, that matches the destination, because someone
> took the network interface away) you immediately get a 'no route to host',
> however if an intermediate link disappears then it takes a while to time out.
>
>    
>>> No, I wasn't thinking of vmsplice; I just have vague memories of suggestions
>>> of the use of Intel's I/OAT, graphics cards, etc for doing things like page
>>> zeroing and DMAing data around; I can see there is a dmaengine API in the
>>> kernel, I haven't found where if anywhere that is available to userspace.
>>>
>>>        
>>>> 2) Using COW: Actually, I think that's an excellent idea. I've bounced that
>>>>       around with my colleagues, but we simply didn't have the manpower
>>>>       to implement it and benchmark it. There was also some concern about
>>>>       performance: Would the writable working set of the guest be so
>>>> active/busy
>>>>       that COW would not get you much benefit? I think it's worth a try.
>>>>       Patches welcome =)
>>>>          
>>> It's possible that might be doable with some of the same tricks I'm
>>> looking at for post-copy, I'll see what I can do.
>>>        
>> That's great news - I'm very interested to see how this applies
>> to post-copy and any kind patches.
>>
>> - Michael
>>
>>      
> Dave
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
>
>
mrhines@linux.vnet.ibm.com Feb. 20, 2014, 2:57 p.m. UTC | #7
On 02/20/2014 06:09 PM, Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>> I was just wondering if a separate 'max buffer size' knob would allow
>>> you to more reasonably bound memory without setting policy; I don't think
>>> people like having potentially x2 memory.
>> Note: Checkpoint memory is not monotonic in this patchset (which
>> is unique to this implementation). Only if the guest actually dirties
>> 100% of it's memory between one checkpoint to the next will
>> the host experience 2x memory usage for a short period of time.
> Right, but that doesn't really help - if someone comes along and says
> 'How much memory do I need to be able to run an mc system?' the only
> safe answer is 2x, otherwise we're adding a reason why the previously
> stable guest might OOM.

Yes, exactly. Running MC is expensive and will probably always be
more or less to some degree. Saving memory and having 100%
fault tolerance are (at times) sometimes mutually exclusive.
Expectations have to be managed here.

The bottom line is: if you put a *hard* constraint on memory usage,
what will happen to the guest when that garbage collection you mentioned
shows up later and runs for several minutes? How about an hour?
Are we just going to block the guest from being allowed to start a
checkpoint until the memory usage goes down just for the sake of avoiding
the 2x memory usage? If you block the guest from being checkpointed,
then what happens if there is a failure during that extended period?
We will have saved memory at the expense of availability.

The customer that is expecting 100% fault tolerance and the provider
who is supporting it need to have an understanding that fault tolerance
is not free and that constraining memory usage will adversely affect
the VM's ability to be protected.

Do I understand your expectations correctly? Is fault tolerance
something you're willing to sacrifice?

>> The patch has a 'slab' mechanism built in to it which implements
>> a water-mark style policy that throws away unused portions of
>> the 2x checkpoint memory if later checkpoints are much smaller
>> (which is likely to be the case if the writable working set size changes).
>>
>> However, to answer your question: Such a knob could be achieved, but
>> the same could be achieved simply by tuning the checkpoint frequency
>> itself. Memory usage would thus be a function of the checkpoint frequency.
>> If the guest application was maniacal, banging away at all the memory,
>> there's very little that can be done in the first place, but if the
>> guest application
>> was mildly busy, you don't want to throw away your ability to be fault
>> tolerant - you would just need more frequent checkpoints to keep up with
>> the dirty rate.
> I'm not convinced; I can tune my checkpoint frequency until normal operation
> makes a reasonable trade off between mc frequency and RAM usage,
> but that doesn't prevent it running away when a garbage collect or some
> other thing suddenly dirties a load of ram in one particular checkpoint.
> Some management tool that watches ram usage etc can also help tune
> it, but in the end it can't stop it taking loads of RAM.

That's correct. See above comment....

>
>> Once the application died down - the water-mark policy would kick in
>> and start freeing checkpoint memory. (Note: this policy happens on
>> both sides in the patchset because the patch has to be fully compatible
>> with RDMA memory pinning).
>>
>> What is *not* exposed, however, is the watermark knobs themselves,
>> I definitely think that needs to be exposed - that would also get
>> you a similar
>> control to 'max buffer size' - you could place a time limit on the
>> slab list in the patch or something like that.......
>>
>>
>>>> Good question in general - I'll add it to the FAQ. The patch implements
>>>> a basic 'transaction' mechanism in coordination with an outbound I/O
>>>> buffer (documented further down). With these two things in
>>>> places, split-brain is not possible because the destination is not running.
>>>> We don't allow the destination to resume execution until a committed
>>>> transaction has been acknowledged by the destination and only until
>>>> then do we allow any outbound network traffic to be release to the
>>>> outside world.
>>> Yeh I see the IO buffer, what I've not figured out is how:
>>>    1) MC over TCP/IP gets an acknowledge on the source to know when
>>>       it can unplug it's buffer.
>> Only partially correct (See the steps on the wiki). There are two I/O
>> buffers at any given time which protect against a split-brain scenario:
>> One buffer for the current checkpoint that is being generated (running VM)
>> and one buffer for the checkpoint that is being committed in a transaction.
>>
>>>    2) Lets say the MC connection fails, so that ack never arrives,
>>>       the source must assume the destination has failed and release it's
>>>       packets and carry on.
>> Only the packets for Buffer A are released for the current committed
>> checkpoint after a completed transaction. The packets for Buffer B
>> (the current running VM) are still being held up until the next
>> transaction starts.
>> Later once the transaction completes and A is released, B becomes the
>> new A and a new buffer is installed to become the new Buffer B for
>> the current running VM.
>>
>>
>>>       The destination must assume the source has failed and take over.
>> The destination must also receive an ACK. The ack goes both ways.
>>
>> Once the source and destination both acknowledge a completed
>> transation does the source VM resume execution - and even then
>> it's packets are still being buffered until the next transaction starts.
>> (That's why it's important to checkpoint as frequently as possible).
> I think I understand normal operation - my question here is about failure;
> what happens when neither side gets any ACKs.

Well, that's simple: If there is a failure of the source, the destination
will simply revert to the previous checkpoint using the same mode
of operation. The lost ACKs that you're curious about only
apply to the checkpoint that is in progress. Just because a
checkpoint is in progress does not mean that the previous checkpoint
is thrown away - it is already loaded into the destination's memory
and ready to be activated.

>
>>>    3) If we're relying on TCP/IP timeout that's quite long.
>>>
>> Actually, my experience is been that TCP seems to have more than
>> one kind of timeout - if receiver is not responding *at all* - it seems that
>> TCP has a dedicated timer for that. The socket API immediately
>> sends back an error code and the patchset closes the conneciton
>> on the destination and recovers.
> How did you test that?
> My experience is that if a host knows that it has no route to the destination
> (e.g. it has no route to try, that matches the destination, because someone
> took the network interface away) you immediately get a 'no route to host',
> however if an intermediate link disappears then it takes a while to time out.

We have a script architecture (not on github) which runs MC in a tight
loop hundreds of times and kills the source QEMU and timestamps how 
quickly the
destination QEMU loses the TCP socket connection receives an error code
from the kernel - every single time, the destination resumes nearly 
instantaneously.
I've not empirically seen a case where the socket just hangs or doesn't
change state.

I'm not very familiar with the internal linux TCP/IP stack 
implementation itself,
but I have not had a problem with the dependability of the linux socket
not being able to shutdown the socket as soon as possible.

The RDMA implementation uses a manual keepalive mechanism that
I had to write from scratch - but I never ported this to the TCP 
implementation
simply because the failures always worked fine without it.

- Michael
mrhines@linux.vnet.ibm.com Feb. 20, 2014, 2:58 p.m. UTC | #8
On 02/20/2014 07:14 PM, Li Guang wrote:
> Dr. David Alan Gilbert wrote:
>> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>>> On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
>>>> I was just wondering if a separate 'max buffer size' knob would allow
>>>> you to more reasonably bound memory without setting policy; I don't 
>>>> think
>>>> people like having potentially x2 memory.
>>> Note: Checkpoint memory is not monotonic in this patchset (which
>>> is unique to this implementation). Only if the guest actually dirties
>>> 100% of it's memory between one checkpoint to the next will
>>> the host experience 2x memory usage for a short period of time.
>> Right, but that doesn't really help - if someone comes along and says
>> 'How much memory do I need to be able to run an mc system?' the only
>> safe answer is 2x, otherwise we're adding a reason why the previously
>> stable guest might OOM.
>>
>
> so we may have to involve some disk operations
> to handle memory exhaustion.
>
> Thanks!

Like a cgroups memory limit, for example?

- Michael
Dr. David Alan Gilbert Feb. 20, 2014, 4:32 p.m. UTC | #9
* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/20/2014 06:09 PM, Dr. David Alan Gilbert wrote:
> >* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> >>On 02/19/2014 07:27 PM, Dr. David Alan Gilbert wrote:
> >>>I was just wondering if a separate 'max buffer size' knob would allow
> >>>you to more reasonably bound memory without setting policy; I don't think
> >>>people like having potentially x2 memory.
> >>Note: Checkpoint memory is not monotonic in this patchset (which
> >>is unique to this implementation). Only if the guest actually dirties
> >>100% of it's memory between one checkpoint to the next will
> >>the host experience 2x memory usage for a short period of time.
> >Right, but that doesn't really help - if someone comes along and says
> >'How much memory do I need to be able to run an mc system?' the only
> >safe answer is 2x, otherwise we're adding a reason why the previously
> >stable guest might OOM.
> 
> Yes, exactly. Running MC is expensive and will probably always be
> more or less to some degree. Saving memory and having 100%
> fault tolerance are (at times) sometimes mutually exclusive.
> Expectations have to be managed here.

I'm happy to use more memory to get FT, all I'm trying to do is see
if it's possible to put a lower bound than 2x on it while still maintaining
full FT, at the expense of performance in the case where it uses
a lot of memory.

> The bottom line is: if you put a *hard* constraint on memory usage,
> what will happen to the guest when that garbage collection you mentioned
> shows up later and runs for several minutes? How about an hour?
> Are we just going to block the guest from being allowed to start a
> checkpoint until the memory usage goes down just for the sake of avoiding
> the 2x memory usage?

Yes, or move to the next checkpoint sooner than the N milliseconds when
we see the buffer is getting full.

> If you block the guest from being checkpointed,
> then what happens if there is a failure during that extended period?
> We will have saved memory at the expense of availability.

If the active machine fails during this time then the secondary carries
on from it's last good snapshot in the knowledge that the active
never finished the new snapshot and so never uncorked it's previous packets.

If the secondary machine fails during this time then tha active drops
it's nascent snapshot and carries on.

However, what you have made me realise is that I don't have an answer
for the memory usage on the secondary; while the primary can pause
it's guest until the secondary ack's the checkpoint, the secondary has
to rely on the primary not to send it huge checkpoints.

> The customer that is expecting 100% fault tolerance and the provider
> who is supporting it need to have an understanding that fault tolerance
> is not free and that constraining memory usage will adversely affect
> the VM's ability to be protected.
> 
> Do I understand your expectations correctly? Is fault tolerance
> something you're willing to sacrifice?

As above, no I'm willing to sacrifice performance but not fault tolerance.
(It is entirely possible that others would want the other trade off, i.e.
some minimum performance is worse than useless, so if we can't maintain
that performance then dropping FT leaves us in a more-working position).

> >>The patch has a 'slab' mechanism built in to it which implements
> >>a water-mark style policy that throws away unused portions of
> >>the 2x checkpoint memory if later checkpoints are much smaller
> >>(which is likely to be the case if the writable working set size changes).
> >>
> >>However, to answer your question: Such a knob could be achieved, but
> >>the same could be achieved simply by tuning the checkpoint frequency
> >>itself. Memory usage would thus be a function of the checkpoint frequency.
> >>If the guest application was maniacal, banging away at all the memory,
> >>there's very little that can be done in the first place, but if the
> >>guest application
> >>was mildly busy, you don't want to throw away your ability to be fault
> >>tolerant - you would just need more frequent checkpoints to keep up with
> >>the dirty rate.
> >I'm not convinced; I can tune my checkpoint frequency until normal operation
> >makes a reasonable trade off between mc frequency and RAM usage,
> >but that doesn't prevent it running away when a garbage collect or some
> >other thing suddenly dirties a load of ram in one particular checkpoint.
> >Some management tool that watches ram usage etc can also help tune
> >it, but in the end it can't stop it taking loads of RAM.
> 
> That's correct. See above comment....
> 
> >
> >>Once the application died down - the water-mark policy would kick in
> >>and start freeing checkpoint memory. (Note: this policy happens on
> >>both sides in the patchset because the patch has to be fully compatible
> >>with RDMA memory pinning).
> >>
> >>What is *not* exposed, however, is the watermark knobs themselves,
> >>I definitely think that needs to be exposed - that would also get
> >>you a similar
> >>control to 'max buffer size' - you could place a time limit on the
> >>slab list in the patch or something like that.......
> >>
> >>
> >>>>Good question in general - I'll add it to the FAQ. The patch implements
> >>>>a basic 'transaction' mechanism in coordination with an outbound I/O
> >>>>buffer (documented further down). With these two things in
> >>>>places, split-brain is not possible because the destination is not running.
> >>>>We don't allow the destination to resume execution until a committed
> >>>>transaction has been acknowledged by the destination and only until
> >>>>then do we allow any outbound network traffic to be release to the
> >>>>outside world.
> >>>Yeh I see the IO buffer, what I've not figured out is how:
> >>>   1) MC over TCP/IP gets an acknowledge on the source to know when
> >>>      it can unplug it's buffer.
> >>Only partially correct (See the steps on the wiki). There are two I/O
> >>buffers at any given time which protect against a split-brain scenario:
> >>One buffer for the current checkpoint that is being generated (running VM)
> >>and one buffer for the checkpoint that is being committed in a transaction.
> >>
> >>>   2) Lets say the MC connection fails, so that ack never arrives,
> >>>      the source must assume the destination has failed and release it's
> >>>      packets and carry on.
> >>Only the packets for Buffer A are released for the current committed
> >>checkpoint after a completed transaction. The packets for Buffer B
> >>(the current running VM) are still being held up until the next
> >>transaction starts.
> >>Later once the transaction completes and A is released, B becomes the
> >>new A and a new buffer is installed to become the new Buffer B for
> >>the current running VM.
> >>
> >>
> >>>      The destination must assume the source has failed and take over.
> >>The destination must also receive an ACK. The ack goes both ways.
> >>
> >>Once the source and destination both acknowledge a completed
> >>transation does the source VM resume execution - and even then
> >>it's packets are still being buffered until the next transaction starts.
> >>(That's why it's important to checkpoint as frequently as possible).
> >I think I understand normal operation - my question here is about failure;
> >what happens when neither side gets any ACKs.
> 
> Well, that's simple: If there is a failure of the source, the destination
> will simply revert to the previous checkpoint using the same mode
> of operation. The lost ACKs that you're curious about only
> apply to the checkpoint that is in progress. Just because a
> checkpoint is in progress does not mean that the previous checkpoint
> is thrown away - it is already loaded into the destination's memory
> and ready to be activated.

I still don't see why, if the link between them fails, the destination
doesn't fall back it it's previous checkpoint, AND the source carries
on running - I don't see how they can differentiate which of them has failed.

> >>>   3) If we're relying on TCP/IP timeout that's quite long.
> >>>
> >>Actually, my experience is been that TCP seems to have more than
> >>one kind of timeout - if receiver is not responding *at all* - it seems that
> >>TCP has a dedicated timer for that. The socket API immediately
> >>sends back an error code and the patchset closes the conneciton
> >>on the destination and recovers.
> >How did you test that?
> >My experience is that if a host knows that it has no route to the destination
> >(e.g. it has no route to try, that matches the destination, because someone
> >took the network interface away) you immediately get a 'no route to host',
> >however if an intermediate link disappears then it takes a while to time out.
> 
> We have a script architecture (not on github) which runs MC in a tight
> loop hundreds of times and kills the source QEMU and timestamps how
> quickly the
> destination QEMU loses the TCP socket connection receives an error code
> from the kernel - every single time, the destination resumes nearly
> instantaneously.
> I've not empirically seen a case where the socket just hangs or doesn't
> change state.
> 
> I'm not very familiar with the internal linux TCP/IP stack
> implementation itself,
> but I have not had a problem with the dependability of the linux socket
> not being able to shutdown the socket as soon as possible.

OK, that only covers a very small range of normal failures.
When you kill the destination QEMU the host OS knows that QEMU is dead
and sends a packet back closing the socket, hence the source knows
the destination is dead very quickly.
If:
   a) The destination machine was to lose power or hang
   b) Or a network link fail  (other than the one attached to the source
      possibly)

the source would have to do a full TCP timeout.

To test a,b I'd use an iptables rule somewhere to cause the packets to
be dropped (not rejected).  Stopping the qemu in gdb might be good enough.

> The RDMA implementation uses a manual keepalive mechanism that
> I had to write from scratch - but I never ported this to the TCP
> implementation
> simply because the failures always worked fine without it.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
mrhines@linux.vnet.ibm.com Feb. 21, 2014, 4:54 a.m. UTC | #10
On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
>
> I'm happy to use more memory to get FT, all I'm trying to do is see
> if it's possible to put a lower bound than 2x on it while still maintaining
> full FT, at the expense of performance in the case where it uses
> a lot of memory.
>
>> The bottom line is: if you put a *hard* constraint on memory usage,
>> what will happen to the guest when that garbage collection you mentioned
>> shows up later and runs for several minutes? How about an hour?
>> Are we just going to block the guest from being allowed to start a
>> checkpoint until the memory usage goes down just for the sake of avoiding
>> the 2x memory usage?
> Yes, or move to the next checkpoint sooner than the N milliseconds when
> we see the buffer is getting full.

OK, I see there is definitely some common ground there: So to be
more specific, what we really need is two things: (I've learned that
the reviewers are very cautious about adding to much policy into
QEMU itself, but let's iron this out anyway:)

1. First, we need to throttle down the guest (QEMU can already do this
     using the recently introduced "auto-converge" feature). This means
     that the guest is still making forward progress, albeit slow progress.

2. Then we would need some kind of policy, or better yet, a trigger that
     does something to the effect of "we're about to use a whole lot of
     checkpoint memory soon - can we afford this much memory usage".
     Such a trigger would be conditional on the current policy of the
     administrator or management software: We would either have a QMP
     command that with a boolean flag that says "Yes" or "No", it's
     tolerable or not to use that much memory in the next checkpoint.

     If the answer is "Yes", then nothing changes.
     If the answer is "No", then we should either:
        a) throttle down the guest
        b) Adjust the checkpoint frequency
        c) Or pause it altogether while we migrate some other VMs off the
            host such that we can complete the next checkpoint in its 
entirety.

It's not clear to me how much of this (or any) of this control loop should
be in QEMU or in the management software, but I would definitely agree
that a minimum of at least the ability to detect the situation and remedy
the situation should be in QEMU. I'm not entirely convince that the
ability to *decide* to remedy the situation should be in QEMU, though.


>
>> If you block the guest from being checkpointed,
>> then what happens if there is a failure during that extended period?
>> We will have saved memory at the expense of availability.
> If the active machine fails during this time then the secondary carries
> on from it's last good snapshot in the knowledge that the active
> never finished the new snapshot and so never uncorked it's previous packets.
>
> If the secondary machine fails during this time then tha active drops
> it's nascent snapshot and carries on.

Yes, that makes sense. Where would that policy go, though,
continuing the above concern?

> However, what you have made me realise is that I don't have an answer
> for the memory usage on the secondary; while the primary can pause
> it's guest until the secondary ack's the checkpoint, the secondary has
> to rely on the primary not to send it huge checkpoints.

Good question: There's a lot of work ideas out there in the academic
community to compress the secondary, or push the secondary to
a flash-based device, or de-duplicate the secondary. I'm sure any
of them would put a dent in the problem, but I'm not seeing a smoking
gun solution that would absolutely save all that memory completely.

(Personally, I don't believe in swap. I wouldn't even consider swap
or any kind of traditional disk-based remedy to be a viable solution).

>> The customer that is expecting 100% fault tolerance and the provider
>> who is supporting it need to have an understanding that fault tolerance
>> is not free and that constraining memory usage will adversely affect
>> the VM's ability to be protected.
>>
>> Do I understand your expectations correctly? Is fault tolerance
>> something you're willing to sacrifice?
> As above, no I'm willing to sacrifice performance but not fault tolerance.
> (It is entirely possible that others would want the other trade off, i.e.
> some minimum performance is worse than useless, so if we can't maintain
> that performance then dropping FT leaves us in a more-working position).
>

Agreed - I think a "proactive" failover in this case would solve the 
problem.
If we observed that availability/fault tolerance was going to be at
risk soon (which is relatively easy to detect) - we could just *force*
a failover to the secondary host and restart the protection from
scratch.


>>
>> Well, that's simple: If there is a failure of the source, the destination
>> will simply revert to the previous checkpoint using the same mode
>> of operation. The lost ACKs that you're curious about only
>> apply to the checkpoint that is in progress. Just because a
>> checkpoint is in progress does not mean that the previous checkpoint
>> is thrown away - it is already loaded into the destination's memory
>> and ready to be activated.
> I still don't see why, if the link between them fails, the destination
> doesn't fall back it it's previous checkpoint, AND the source carries
> on running - I don't see how they can differentiate which of them has failed.

I think you're forgetting that the source I/O is buffered - it doesn't
matter that the source VM is still running. As long as it's output is
buffered - it cannot have any non-fault-tolerant affect on the outside
world.

In the future, if a technician access the machine or the network
is restored, the management software can terminate the stale
source virtual machine.

>> We have a script architecture (not on github) which runs MC in a tight
>> loop hundreds of times and kills the source QEMU and timestamps how
>> quickly the
>> destination QEMU loses the TCP socket connection receives an error code
>> from the kernel - every single time, the destination resumes nearly
>> instantaneously.
>> I've not empirically seen a case where the socket just hangs or doesn't
>> change state.
>>
>> I'm not very familiar with the internal linux TCP/IP stack
>> implementation itself,
>> but I have not had a problem with the dependability of the linux socket
>> not being able to shutdown the socket as soon as possible.
> OK, that only covers a very small range of normal failures.
> When you kill the destination QEMU the host OS knows that QEMU is dead
> and sends a packet back closing the socket, hence the source knows
> the destination is dead very quickly.
> If:
>     a) The destination machine was to lose power or hang
>     b) Or a network link fail  (other than the one attached to the source
>        possibly)
>
> the source would have to do a full TCP timeout.
>
> To test a,b I'd use an iptables rule somewhere to cause the packets to
> be dropped (not rejected).  Stopping the qemu in gdb might be good enough.

Very good idea - I'll add that to the "todo" list of things to do
in my test infrastructure. It may indeed turn out be necessary
to add a formal keepalive between the source and destination.

- Michael
Dr. David Alan Gilbert Feb. 21, 2014, 9:44 a.m. UTC | #11
* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 02/21/2014 12:32 AM, Dr. David Alan Gilbert wrote:
> >
> >I'm happy to use more memory to get FT, all I'm trying to do is see
> >if it's possible to put a lower bound than 2x on it while still maintaining
> >full FT, at the expense of performance in the case where it uses
> >a lot of memory.
> >
> >>The bottom line is: if you put a *hard* constraint on memory usage,
> >>what will happen to the guest when that garbage collection you mentioned
> >>shows up later and runs for several minutes? How about an hour?
> >>Are we just going to block the guest from being allowed to start a
> >>checkpoint until the memory usage goes down just for the sake of avoiding
> >>the 2x memory usage?
> >Yes, or move to the next checkpoint sooner than the N milliseconds when
> >we see the buffer is getting full.
> 
> OK, I see there is definitely some common ground there: So to be
> more specific, what we really need is two things: (I've learned that
> the reviewers are very cautious about adding to much policy into
> QEMU itself, but let's iron this out anyway:)
> 
> 1. First, we need to throttle down the guest (QEMU can already do this
>     using the recently introduced "auto-converge" feature). This means
>     that the guest is still making forward progress, albeit slow progress.
> 
> 2. Then we would need some kind of policy, or better yet, a trigger that
>     does something to the effect of "we're about to use a whole lot of
>     checkpoint memory soon - can we afford this much memory usage".
>     Such a trigger would be conditional on the current policy of the
>     administrator or management software: We would either have a QMP
>     command that with a boolean flag that says "Yes" or "No", it's
>     tolerable or not to use that much memory in the next checkpoint.
> 
>     If the answer is "Yes", then nothing changes.
>     If the answer is "No", then we should either:
>        a) throttle down the guest
>        b) Adjust the checkpoint frequency
>        c) Or pause it altogether while we migrate some other VMs off the
>            host such that we can complete the next checkpoint in its
> entirety.

Yes I think so, although what I was thinking was mainly (b) possibly
to the point of not starting the next checkpoint.

> It's not clear to me how much of this (or any) of this control loop should
> be in QEMU or in the management software, but I would definitely agree
> that a minimum of at least the ability to detect the situation and remedy
> the situation should be in QEMU. I'm not entirely convince that the
> ability to *decide* to remedy the situation should be in QEMU, though.

The management software access is low frequency, high latency; it should
be setting general parameters (max memory allowed, desired checkpoint
frequency etc) but I don't see that we can use it to do anything on
a sooner than a few second basis; so yes it can monitor things and
tweek the knobs if it sees the host as a whole is getting tight on RAM
etc - but we can't rely on it to throw in the breaks if this guest
suddenly decides to take bucket loads of RAM; something has to react
quickly in relation to previously set limits.

> >>If you block the guest from being checkpointed,
> >>then what happens if there is a failure during that extended period?
> >>We will have saved memory at the expense of availability.
> >If the active machine fails during this time then the secondary carries
> >on from it's last good snapshot in the knowledge that the active
> >never finished the new snapshot and so never uncorked it's previous packets.
> >
> >If the secondary machine fails during this time then tha active drops
> >it's nascent snapshot and carries on.
> 
> Yes, that makes sense. Where would that policy go, though,
> continuing the above concern?

I think there has to be some input from the management layer for failover,
because (as per my split-brain concerns) something has to make the decision
about which of the source/destination is to take over, and I don't
believe individual instances have that information.

> >However, what you have made me realise is that I don't have an answer
> >for the memory usage on the secondary; while the primary can pause
> >it's guest until the secondary ack's the checkpoint, the secondary has
> >to rely on the primary not to send it huge checkpoints.
> 
> Good question: There's a lot of work ideas out there in the academic
> community to compress the secondary, or push the secondary to
> a flash-based device, or de-duplicate the secondary. I'm sure any
> of them would put a dent in the problem, but I'm not seeing a smoking
> gun solution that would absolutely save all that memory completely.

Ah, I was thinking that flash would be a good solution for secondary;
it would be a nice demo.

> (Personally, I don't believe in swap. I wouldn't even consider swap
> or any kind of traditional disk-based remedy to be a viable solution).

Well it certainly exists - I've seen it!
Swap works well in limited circumstances; but as soon as you've got
multiple VMs fighting over something with 10s of ms latency you're doomed.

> >>The customer that is expecting 100% fault tolerance and the provider
> >>who is supporting it need to have an understanding that fault tolerance
> >>is not free and that constraining memory usage will adversely affect
> >>the VM's ability to be protected.
> >>
> >>Do I understand your expectations correctly? Is fault tolerance
> >>something you're willing to sacrifice?
> >As above, no I'm willing to sacrifice performance but not fault tolerance.
> >(It is entirely possible that others would want the other trade off, i.e.
> >some minimum performance is worse than useless, so if we can't maintain
> >that performance then dropping FT leaves us in a more-working position).
> >
> 
> Agreed - I think a "proactive" failover in this case would solve the
> problem.
> If we observed that availability/fault tolerance was going to be at
> risk soon (which is relatively easy to detect) - we could just *force*
> a failover to the secondary host and restart the protection from
> scratch.
> 
> 
> >>
> >>Well, that's simple: If there is a failure of the source, the destination
> >>will simply revert to the previous checkpoint using the same mode
> >>of operation. The lost ACKs that you're curious about only
> >>apply to the checkpoint that is in progress. Just because a
> >>checkpoint is in progress does not mean that the previous checkpoint
> >>is thrown away - it is already loaded into the destination's memory
> >>and ready to be activated.
> >I still don't see why, if the link between them fails, the destination
> >doesn't fall back it it's previous checkpoint, AND the source carries
> >on running - I don't see how they can differentiate which of them has failed.
> 
> I think you're forgetting that the source I/O is buffered - it doesn't
> matter that the source VM is still running. As long as it's output is
> buffered - it cannot have any non-fault-tolerant affect on the outside
> world.
> 
> In the future, if a technician access the machine or the network
> is restored, the management software can terminate the stale
> source virtual machine.

I think going with my comment above; I'm working on the basis it's just
as likely for the destination to fail as it is for the source to fail,
and a destination failure shouldn't kill the source; and in the case
of a destination failure the source is going to have to let it's buffered
I/Os start going again.

> >>We have a script architecture (not on github) which runs MC in a tight
> >>loop hundreds of times and kills the source QEMU and timestamps how
> >>quickly the
> >>destination QEMU loses the TCP socket connection receives an error code
> >>from the kernel - every single time, the destination resumes nearly
> >>instantaneously.
> >>I've not empirically seen a case where the socket just hangs or doesn't
> >>change state.
> >>
> >>I'm not very familiar with the internal linux TCP/IP stack
> >>implementation itself,
> >>but I have not had a problem with the dependability of the linux socket
> >>not being able to shutdown the socket as soon as possible.
> >OK, that only covers a very small range of normal failures.
> >When you kill the destination QEMU the host OS knows that QEMU is dead
> >and sends a packet back closing the socket, hence the source knows
> >the destination is dead very quickly.
> >If:
> >    a) The destination machine was to lose power or hang
> >    b) Or a network link fail  (other than the one attached to the source
> >       possibly)
> >
> >the source would have to do a full TCP timeout.
> >
> >To test a,b I'd use an iptables rule somewhere to cause the packets to
> >be dropped (not rejected).  Stopping the qemu in gdb might be good enough.
> 
> Very good idea - I'll add that to the "todo" list of things to do
> in my test infrastructure. It may indeed turn out be necessary
> to add a formal keepalive between the source and destination.
> 
> - Michael

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
mrhines@linux.vnet.ibm.com March 3, 2014, 6:08 a.m. UTC | #12
On 02/21/2014 05:44 PM, Dr. David Alan Gilbert wrote:
>> It's not clear to me how much of this (or any) of this control loop should
>> be in QEMU or in the management software, but I would definitely agree
>> that a minimum of at least the ability to detect the situation and remedy
>> the situation should be in QEMU. I'm not entirely convince that the
>> ability to *decide* to remedy the situation should be in QEMU, though.
> The management software access is low frequency, high latency; it should
> be setting general parameters (max memory allowed, desired checkpoint
> frequency etc) but I don't see that we can use it to do anything on
> a sooner than a few second basis; so yes it can monitor things and
> tweek the knobs if it sees the host as a whole is getting tight on RAM
> etc - but we can't rely on it to throw in the breaks if this guest
> suddenly decides to take bucket loads of RAM; something has to react
> quickly in relation to previously set limits.

I agree - the boolean flag I mentioned previously would do just
that: setting the flag (or state, perhaps instead of boolean),
would indicate to QEMU to make a particular type of sacrifice:

A flag of "0" might mean "Throttle the guest in an emergency"
A flag of "1" might mean "Throttling is not acceptable, just let the 
guest use the extra memory"
A flag of "2" might mean "Neither one is acceptable, fail now and inform 
the management software to restart somewhere else".

Or something to that effect........

>>>> If you block the guest from being checkpointed,
>>>> then what happens if there is a failure during that extended period?
>>>> We will have saved memory at the expense of availability.
>>> If the active machine fails during this time then the secondary carries
>>> on from it's last good snapshot in the knowledge that the active
>>> never finished the new snapshot and so never uncorked it's previous packets.
>>>
>>> If the secondary machine fails during this time then tha active drops
>>> it's nascent snapshot and carries on.
>> Yes, that makes sense. Where would that policy go, though,
>> continuing the above concern?
> I think there has to be some input from the management layer for failover,
> because (as per my split-brain concerns) something has to make the decision
> about which of the source/destination is to take over, and I don't
> believe individual instances have that information.

Agreed - so the "ability" (as hinted on above) should be in QEMU,
but the decision to recover from the situation probably should not
be, where "recover" is defined as the VM is back in a fully running,
fully fault-tolerant protected state (potentially where the source VM
is on a different machine than it was before).

>
>>>> Well, that's simple: If there is a failure of the source, the destination
>>>> will simply revert to the previous checkpoint using the same mode
>>>> of operation. The lost ACKs that you're curious about only
>>>> apply to the checkpoint that is in progress. Just because a
>>>> checkpoint is in progress does not mean that the previous checkpoint
>>>> is thrown away - it is already loaded into the destination's memory
>>>> and ready to be activated.
>>> I still don't see why, if the link between them fails, the destination
>>> doesn't fall back it it's previous checkpoint, AND the source carries
>>> on running - I don't see how they can differentiate which of them has failed.
>> I think you're forgetting that the source I/O is buffered - it doesn't
>> matter that the source VM is still running. As long as it's output is
>> buffered - it cannot have any non-fault-tolerant affect on the outside
>> world.
>>
>> In the future, if a technician access the machine or the network
>> is restored, the management software can terminate the stale
>> source virtual machine.
> I think going with my comment above; I'm working on the basis it's just
> as likely for the destination to fail as it is for the source to fail,
> and a destination failure shouldn't kill the source; and in the case
> of a destination failure the source is going to have to let it's buffered
> I/Os start going again.

Yes, that's correct, but only after management software knows about
the failure. If we're on a tightly-coupled fast lan, there's no reason
to believe that libvirt, for example, would be so slow that we cannot
wait a few extra (10s of?) milliseconds after destination failure to
choose a new destination and restart the previous checkpoint.

But if management *is* too slow, which is not unlikely, then I think
we should just tell the source to Migrate entirely and get out of that
environment.

Either way - this isn't something QEMU itself necessarily needs to
worry about - it just needs to know not to explode if the destination
fails and wait for instructions on what to do next.......

Alternatively, if the administrator "prefers" restarting the fault-tolerance
instead of Migration, we could have a QMP command that specifies
a "backup" destination (or even a "duplicate" destination) that QEMU
would automatically know about in the case of destination failure.

But, I wouldn't implement something like that until at least a first version
was accepted by the community.

- Michael
diff mbox

Patch

diff --git a/docs/mc.txt b/docs/mc.txt
new file mode 100644
index 0000000..5d4b5fe
--- /dev/null
+++ b/docs/mc.txt
@@ -0,0 +1,222 @@ 
+Micro Checkpointing Specification v1
+==============================================
+Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
+Github: git@github.com:hinesmr/qemu.git, 'mc' branch
+
+Copyright (C) 2014 Michael R. Hines <mrhines@us.ibm.com>
+
+Contents
+1 Summary
+1.1 Contact
+1.2 Introduction
+2 The Micro-Checkpointing Process
+2.1 Basic Algorithm
+2.2 I/O buffering
+2.3 Failure Recovery
+3 Optimizations
+3.1 Memory Management
+3.2 RDMA Integration
+4 Usage
+4.1 BEFORE Running
+4.2 Running
+5 Performance
+6 TODO
+7 FAQ / Frequently Asked Questions
+7.1 What happens if a failure occurs in the *middle* of a flush of the network buffer?
+7.2 What's different about this implementation?
+Summary
+This is an implementation of Micro Checkpointing for memory and cpu state. Also known as: "Continuous Replication" or "Fault Tolerance" or 100 other different names - choose your poison.
+
+Contact
+Name: Michael Hines
+Email: mrhines@us.ibm.com
+Wiki: http://wiki.qemu.org/Features/MicroCheckpointing
+
+Github: http://github.com/hinesmr/qemu.git, 'mc' branch
+
+Libvirt Support: http://github.com/hinesmr/libvirt.git, 'mc' branch
+
+Copyright (C) 2014 IBM Michael R. Hines <mrhines@us.ibm.com>
+
+Introduction
+Micro-Checkpointing (MC) is one method for providing Fault Tolerance to a running virtual machine (VM) with little or no runtime assistance from the guest kernel or guest application software. Furthermore, Fault Tolerance is one method of providing high availability to a VM such that, from the perspective of the outside world (clients, devices, and neighboring VMs that may be paired with it), the VM and its applications have not lost any runtime state in the event of either a failure of the hypervisor/hardware to allow the VM to make forward progress or a complete loss of power. This mechanism for providing fault tolerance does *not* provide any protection whatsoever against software-level faults in the guest kernel or applications. In fact, due to the potentially extended lifetime of the VM because of this type of high availability, such software-level bugs may in fact manifest themselves more often than they otherwise ordinarily would, in which case you would need to empl!
 oy other
+
+This implementation is also fully compatible with RDMA and has undergone special optimizations to suppor the use of RDMA. (See docs/rdma.txt for more details).
+
+The Micro-Checkpointing Process
+Basic Algorithm
+Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
+
+1. After N milliseconds, stop the VM.
+3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
+4. Resume the VM immediately so that it can make forward progress.
+5. Transmit the checkpoint to the destination.
+6. Repeat
+Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
+
+I/O buffering
+Additionally, a MC must include a consistent view of device I/O, particularly the network, a problem commonly referred to as "output commit". This means that the outside world can not be allowed to experience duplicate state that was committed by the virtual machine after failure. This is possible because a checkpoint may diverge by N milliseconds of time and commit state while the current MC is being transmitted to the destination.
+
+To guard against this problem, first, we must "buffer" the TX output of the network (not the input) between MCs until the current MC is safely received by the destination. For example, all outbound network packets must be held at the source until the MC is transmitted. After transmission is complete, those packets can be released. Similarly, in the case of disk I/O, we must ensure that either the contents of the local disk are safely mirrored to a remote disk before completing a MC or that the output to a shared disk, such as iSCSI, is also buffered between checkpoints and then later released in the same way.
+
+For the network in particular, buffering is performed using a series of netlink (libnl3) Qdisc "plugs", introduced by the Xen Remus implementation. All packets go through netlink in the host kernel - there are no exceptions and no gaps. Even while one buffer is being released (say, after a checkpoint has been saved), another plug will have already been initiated to hold the next round of packets simultaneously while the current round of packets are being released. Thus, at any given time, there may be as many as two simultaneous buffers in place.
+
+With this in mind, here is the extended procedure for the micro checkpointing process:
+
+1. Insert a new Qdisc plug (Buffer A).
+Repeat Forever:
+
+2. After N milliseconds, stop the VM.
+3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
+4. Insert a *new* Qdisc plug (Buffer B). This buffers all new packets only.
+5. Resume the VM immediately so that it can make forward progress (making use of Buffer B).
+6. Transmit the MC to the destination.
+7. Wait for acknowledgement.
+8. Acknowledged.
+9. Release the Qdisc plug for Buffer A.
+10. Qdisc Buffer B now becomes (symbolically rename) the most recent Buffer A
+11. Go back to Step 2
+This implementation *currently* only supports buffering for the network. (Any help on implementing disk support would be greatly appreciated). Due to this lack of disk support, this requires that the VM's root disk or any non-ephemeral disks also be made network-accessible directly from within the VM. Until the aforementioned buffering or mirroring support is available (ideally through drive-mirror), the only "consistent" way to provide full fault tolerance of the VM's non-ephemeral disks is to construct a VM whose root disk is made to boot directly from iSCSI or NFS or similar such that all disk I/O is translated into network I/O.
+
+Buffering is performed with the combination of an IFB device attached to the KVM tap device combined with a netlink Qdisc plug (exactly like the Xen remus solution).
+
+Failure Recovery
+Due to the high-frequency nature of micro-checkpointing, we expect a new MC to be generated many times per second. Even missing just a few MCs easily constitutes a failure. Because of the consistent buffering of device I/O, this is safe because device I/O is not committed to the outside world until the MC has been received at the destination.
+
+Failure is thus assumed under two conditions:
+
+1. MC over TCP/IP: Once the socket connection breaks, we assume failure. This happens very early in the loss of the latest MC not only because a very large amount of bytes is typically being sequenced in a TCP stream but perhaps also because of the timeout in acknowledgement of the receipt of a commit message by the destination.
+
+2. MC over RDMA: Since Infiniband does not provide any underlying timeout mechanisms, this implementation enhances QEMU's RDMA migration protocol to include a simple keep-alive. Upon the loss of multiple keep-alive messages, the sender is deemed to have failed.
+
+In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
+
+If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
+
+If the destination is deemed to be lost, we perform the same action as a live migration: resume the sender normally and wait for management software to make a policy decision about whether or not to re-protect the VM, which may involve a third-party to identify a new destination host again to use as a backup for the VM.
+
+Optimizations
+Memory Management
+Managing QEMU memory usage in this implementation is critical to the performance of any micro-checkpointing (MC) implementation.
+
+MCs are typically only a few MB when idle. However, they can easily be very large during heavy workloads. In the *extreme* worst-case, QEMU will need double the amount of main memory than that of what was originally allocated to the virtual machine.
+
+To support this variability during transient periods, a MC consists of a linked list of slabs, each of identical size. A better name would be welcome, as the name was only chosen because it resembles linux memory allocation. Because MCs occur several times per second (a frequency of 10s of milliseconds), slabs allow MCs to grow and shrink without constantly re-allocating all memory in place during each checkpoint. During steady-state, the 'head' slab is permanently allocated and never goes away, so when the VM is idle, there is no memory allocation at all. This design supports the use of RDMA. Since RDMA requires memory pinning, we must be able to hold on to a slab for a reasonable amount of time to get any real use out of it.
+
+Regardless, the current strategy taken will be:
+
+1. If the checkpoint size increases, then grow the number of slabs to support it.
+2. If the next checkpoint size is smaller than the last one, then that's a "strike".
+3. After N strikes, cut the size of the slab cache in half (to a minimum of 1 slab as described before).
+As of this writing, the average size of a Linux-based Idle-VM checkpoint is under 5MB.
+
+RDMA Integration
+RDMA is instrumental in enabling better MC performance, which is the reason why it was introduced into QEMU first.
+
+RDMA is used for two different reasons:
+
+1. Checkpoint generation (RDMA-based memcpy):
+2. Checkpoint transmission
+Checkpoint generation must be done while the VM is paused. In the worst case, the size of the checkpoint can be equal in size to the amount of memory in total use by the VM. In order to resume VM execution as fast as possible, the checkpoint is copied consistently locally into a staging area before transmission. A standard memcpy() of potentially such a large amount of memory not only gets no use out of the CPU cache but also potentially clogs up the CPU pipeline which would otherwise be useful by other neighbor VMs on the same physical node that could be scheduled for execution. To minimize the effect on neighbor VMs, we use RDMA to perform a "local" memcpy(), bypassing the host processor. On more recent processors, a 'beefy' enough memory bus architecture can move memory just as fast (sometimes faster) as a pure-software CPU-only optimized memcpy() from libc. However, on older computers, this feature only gives you the benefit of lower CPU-utilization at the expense of
+
+Checkpoint transmission can potentially also consume very large amounts of both bandwidth as well as CPU utilization that could otherwise by used by the VM itself or its neighbors. Once the aforementioned local copy of the checkpoint is saved, this implementation makes use of the same RDMA hardware to perform the transmission exactly the same way that a live migration happens over RDMA (see docs/rdma.txt).
+
+Usage
+BEFORE Running
+First, compile QEMU with '--enable-mc' and ensure that the corresponding libraries for netlink (libnl3) are available. The netlink 'plug' support from the Qdisc functionality is required in particular, because it allows QEMU to direct the kernel to buffer outbound network packages between checkpoints as described previously. Do not proceed without this support in a production environment, or you risk corrupting the state of your I/O.
+
+$ git clone http://github.com/hinesmr/qemu.git
+$ git checkout 'mc'
+$ ./configure --enable-mc [other options]
+Next, start the VM that you want to protect using your standard procedures.
+
+Enable MC like this:
+
+QEMU Monitor Command:
+
+$ migrate_set_capability x-mc on # disabled by default
+Currently, only one network interface is supported, *and* currently you must ensure that the root disk of your VM is booted either directly from iSCSI or NFS, as described previously. This will be rectified with future improvements.
+
+For testing only, you can ignore the aforementioned requirements if you simply want to get an understanding of the performance penalties associated with this feature activated.
+
+Next, you can optionally disable network-buffering for additional test-only execution. This is useful if you want to get a breakdown only of what the cost of checkpointing the memory state is without the cost of checkpointing device state.
+
+QEMU Monitor Command:
+
+$ migrate_set_capability mc-net-disable on # buffering activated by default 
+Next, you can optionally enable RDMA 'memcpy' support. This is only valid if you have RDMA support compiled into QEMU and you intend to use the 'rdma' migration URI upon initiating MC as described later.
+
+QEMU Monitor Command:
+
+$ migrate_set_capability mc-rdma-copy on # disabled by default
+Finally, if you are using QEMU's support for RDMA migration, you will want to enable RDMA keep-alive support to allow quick detection of failure. If you are using TCP/IP, this is not required:
+
+QEMU Monitor Command:
+
+$ migrate_set_capability rdma-keepalive on # disabled by default
+Running
+First, make sure the IFB device kernel module is loaded
+
+$ modprobe ifb numifbs=100 # (or some large number)
+Now, install a Qdisc plug to the tap device using the same naming convention as the tap device created by QEMU (it must be the same, because QEMU needs to interact with the IFB device and the only mechanism we have right now of knowing the name of the IFB devices is to assume that it matches the tap device numbering scheme):
+
+$ ip link set up ifb0 # <= corresponds to tap device 'tap0'
+$ tc qdisc add dev tap0 ingress
+$ tc filter add dev tap0 parent ffff: proto ip pref 10 u32 match u32 0 0 action mirred egress redirect dev ifb0
+(You will need a script to automate the part above until the libvirt patches are more complete).
+
+Now, that the network buffering connection is ready:
+
+MC can be initiated with exactly the same command as standard live migration:
+
+QEMU Monitor Command:
+
+$ migrate -d (tcp|rdma):host:port
+Upon failure, the destination VM will detect a loss in network connectivity and automatically revert to the last checkpoint taken and resume execution immediately. There is no need for additional QEMU monitor commands to initiate the recovery process.
+
+Performance
+By far, the biggest cost is network throughput. Virtual machines are capable of dirtying memory well in excess of the bandwidth provided a commodity 1 Gbps network link. If so, the MC process will always lag behind the virtual machine and forward progress will be poor. It is highly recommended to use at least a 10 Gbps link when using MC.
+
+Numbers are still coming in, but without output buffering of network I/O, the performance penalty of a typical 4GB RAM Java-based application server workload using a 10 Gbps link (a good worst case for testing due Java's constant garbage collection) is on the order of 25%. With network buffering activated, this can be as high as 50%.
+
+Assuming that you have a reasonable 10G (or RDMA) network in place, the majority of the penalty is due to the time it takes to copy the dirty memory into a staging area before transmission of the checkpoint. Any optimizations / proposals to speed this up would be welcome!
+
+The remaining penalty comes from network buffering is typically due to checkpoints not occurring fast enough since a typical "round trip" time between the request of an application-level transaction and the corresponding response should ideally be larger than the time it takes to complete a checkpoint, otherwise, the response to the application within the VM will appear to be congested since the VM's network endpoint may not have even received the TX request from the application in the first place.
+
+We believe that this effect is "amplified" due to the poor performance in processing copying the dirty memory to staging since an application-level RTT cannot be serviced with more frequent checkpoints, network I/O tends to get held in the buffer too long. This has the effect of causing the guest TCP/IP stack to experience congestion, propagating this artificially created delay all the way up to the application.
+
+TODO
+1. Main bottleneck is to try to improve performance of the local memory copy to staging memory. The faster we can copy, the faster we can flush then network buffer.
+
+2. Implement local disk mirroring by integrating with QEMU's 'drive-mirror' feature in order to full support virtual machines with local storage.
+
+3. Implement output commit buffering for shared storage.
+
+FAQ / Frequently Asked Questions
+What happens if a failure occurs in the *middle* of a flush of the network buffer?
+Micro-Checkpointing depends *heavily* on the correctness of TCP/IP. Thus, this is not a problem because the network buffer holds packets only for the last *committed* checkpoint (meaning that the last micro checkpoint must have been acknowledged as received successfully by the backup host). After understanding this, it is then important to understand how network buffering is repeated between checkpoints. *ALL* packets go through the buffer - there is no exception or gaps. There is no such situation where while the buffer is being flushed other newer packets are going through - that's not how it works. Please refer to the previous section "I/O buffering" for a detailed description of how network buffering works.
+
+Why is this not a problem?
+
+Example: Let's say we have packets "A" and "B" in the buffer.
+
+Packet A is sent successfully and a failure occurs before packet B is transmitted.
+
+Packet A) This is acceptable. The guest checkpoint has already recorded delivery of the packet from the guest's perspective. The network fabric can deliver or not deliver as it sees fit. Thus the buffer simply has the same effect of an additional network switch - it does not alter the effect of fault tolerance as viewed by the external world any more so than another faulty hop in the traditional network architecture would cause congestion in the network. The packet will never get RE-generated because the checkpoint has already been committed at the destination which corresponds to the transmission of that packet from the perspective of the virtual machine. Any FUTURE packets generated while the VM resumes execution are *also* buffered as described previously.
+
+Packet B) This is acceptable. This packet will be lost. This will result in a TCP-level timeout on the peer side of the connection in the case that packet B is an ACK or will result in a timeout on the guest-side of the connection in the case that the packet is a TCP PUSH. Either way, the packet will get re-transmitted either because the data was never acknowledged or never received as soon as the virtual machine resumes execution.
+
+What's different about this implementation?
+Several things about this implementation attempt are different from previous implementations:
+
+1. We are dedicated to see this through the community review process and stay current with the master branch.
+
+2. This implementation is 100% compatible with RDMA.
+
+3. Memory management is completely overhauled - malloc()/free() churn is reduced to a minimum.
+
+4. This is not port of Kemari. Kemari is obsolete and incompatible with the most recent QEMU.
+
+5. Network I/O buffering is outsourced to the host kernel, using netlink code introduced by the Remus/Xen project.
+