diff mbox

[RFC,RDMA,support,v5:,03/12] comprehensive protocol documentation

Message ID 1365476681-31593-4-git-send-email-mrhines@linux.vnet.ibm.com
State New
Headers show

Commit Message

mrhines@linux.vnet.ibm.com April 9, 2013, 3:04 a.m. UTC
From: "Michael R. Hines" <mrhines@us.ibm.com>

Both the protocol and interfaces are elaborated in more detail,
including the new use of dynamic chunk registration, versioning,
and capabilities negotiation.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 313 insertions(+)
 create mode 100644 docs/rdma.txt

Comments

Michael S. Tsirkin April 10, 2013, 5:27 a.m. UTC | #1
Below is a great high level overview. the protocol looks correct.
A bit more detail would be helpful, as noted below.

The main thing I'd like to see changed is that there are already
two protocols here: chunk-based and non chunk based.
We'll need to use versioning and capabilities going forward but in the
first version we don't need to maintain compatibility with legacy so
two versions seems like unnecessary pain.  Chunk based is somewhat slower and
that is worth fixing longer term, but seems like the way forward. So
let's implement a single chunk-based protocol in the first version we
merge.

Some more minor improvement suggestions below.

On Mon, Apr 08, 2013 at 11:04:32PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Both the protocol and interfaces are elaborated in more detail,
> including the new use of dynamic chunk registration, versioning,
> and capabilities negotiation.
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 313 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..e9fa4cd
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,313 @@
> +Several changes since v4:
> +
> +- Created a "formal" protocol for the RDMA control channel
> +- Dynamic, chunked page registration now implemented on *both* the server and client
> +- Created new 'capability' for page registration
> +- Created new 'capability' for is_zero_page() - enabled by default
> +  (needed to test dynamic page registration)
> +- Created version-check before protocol begins at connection-time 
> +- no more migrate_use_rdma() !
> +
> +NOTE: While dynamic registration works on both sides now,
> +      it does *not* work with cgroups swap limits. This functionality with infiniband
> +      remains broken. (It works fine with TCP). So, in order to take full 
> +      advantage of this feature, a fix will have to be developed on the kernel side.
> +      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.

You mean the idea of using pagemap to detect shared pages created by KSM
and/or zero pages? That would be helpful for TCP migration, thanks!

> +

BTW the above comments belong outside both document and commit log,
after --- before diff.

> +Contents:
> +=================================
> +* Compiling
> +* Running (please readme before running)
> +* RDMA Protocol Description
> +* Versioning
> +* QEMUFileRDMA Interface
> +* Migration of pc.ram
> +* Error handling
> +* TODO
> +* Performance
> +
> +COMPILING:
> +===============================
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +$ make
> +
> +RUNNING:
> +===============================
> +
> +First, decide if you want dynamic page registration on the server-side.
> +This always happens on the primary-VM side, but is optional on the server.
> +Doing this allows you to support overcommit (such as cgroups or ballooning)
> +with a smaller footprint on the server-side without having to register the
> +entire VM memory footprint. 
> +NOTE: This significantly slows down performance (about 30% slower).

Where does the overhead come from? It appears from the description that
you have exactly same amount of data to exchange using send messages,
either way?
Or are you using bigger chunks with upfront registration?

> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default

I think the right choice is to make chunk based the default, and remove
the non chunk based from code.  This will simplify the protocol a tiny bit,
and make us focus on improving chunk based long term so that it's as
fast as upfront registration.

> +
> +Next, if you decided *not* to use chunked registration on the server,
> +it is recommended to also disable zero page detection. While this is not
> +strictly necessary, zero page detection also significantly slows down
> +performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:

What is meant by performance here? downtime?

> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
> +
> +Finally, set the migration speed to match your hardware's capabilities:
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +RDMA Protocol Description:
> +=================================
> +
> +Migration with RDMA is separated into two parts:
> +
> +1. The transmission of the pages using RDMA
> +2. Everything else (a control channel is introduced)
> +
> +"Everything else" is transmitted using a formal 
> +protocol now, consisting of infiniband SEND / RECV messages.
> +
> +An infiniband SEND message is the standard ibverbs
> +message used by applications of infiniband hardware.
> +The only difference between a SEND message and an RDMA
> +message is that SEND message cause completion notifications
> +to be posted to the completion queue (CQ) on the 
> +infiniband receiver side, whereas RDMA messages (used
> +for pc.ram) do not (to behave like an actual DMA).
> +    
> +Messages in infiniband require two things:
> +
> +1. registration of the memory that will be transmitted
> +2. (SEND/RECV only) work requests to be posted on both
> +   sides of the network before the actual transmission
> +   can occur.
> +
> +RDMA messages much easier to deal with. Once the memory
> +on the receiver side is registered and pinned, we're
> +basically done. All that is required is for the sender
> +side to start dumping bytes onto the link.

When is memory unregistered and unpinned on send and receive
sides?

> +
> +SEND messages require more coordination because the
> +receiver must have reserved space (using a receive
> +work request) on the receive queue (RQ) before QEMUFileRDMA
> +can start using them to carry all the bytes as
> +a transport for migration of device state.
> +
> +To begin the migration, the initial connection setup is
> +as follows (migration-rdma.c):
> +
> +1. Receiver and Sender are started (command line or libvirt):
> +2. Both sides post two RQ work requests

Okay this could be where the problem is. This means with chunk
based receive side does:

loop:
	receive request
	register
	send response

while with non chunk based it does:

receive request
send response
loop:
	register

In reality each request/response requires two network round-trips
with the Ready credit-management messsages.
So the overhead will likely be avoided if we add better pipelining:
allow multiple registration requests in the air, and add more
send/receive credits so the overhead of credit management can be
reduced.

There's no requirement to implement these optimizations upfront
before merging the first version, but let's remove the
non-chunkbased crutch unless we see it as absolutely necessary.

> +3. Receiver does listen()
> +4. Sender does connect()
> +5. Receiver accept()
> +6. Check versioning and capabilities (described later)
> +
> +At this point, we define a control channel on top of SEND messages
> +which is described by a formal protocol. Each SEND message has a 
> +header portion and a data portion (but together are transmitted 
> +as a single SEND message).
> +
> +Header:
> +    * Length  (of the data portion)
> +    * Type    (what command to perform, described below)
> +    * Version (protocol version validated before send/recv occurs)

What's the expected value for Version field?
Also, confusing.  Below mentions using private field in librdmacm instead?
Need to add # of bytes and endian-ness of each field.

> +
> +The 'type' field has 7 different command values:

0. Unused.

> +    1. None

you mean this is unused?

> +    2. Ready             (control-channel is available) 
> +    3. QEMU File         (for sending non-live device state) 
> +    4. RAM Blocks        (used right after connection setup)
> +    5. Register request  (dynamic chunk registration) 
> +    6. Register result   ('rkey' to be used by sender)

Hmm, don't you also need a virtual address for RDMA writes?

> +    7. Register finished (registration for current iteration finished)

What does Register finished mean and how it's used?

Need to add which commands have a data portion, and in what format.

> +
> +After connection setup is completed, we have two protocol-level
> +functions, responsible for communicating control-channel commands
> +using the above list of values: 
> +
> +Logically:
> +
> +qemu_rdma_exchange_recv(header, expected command type)
> +
> +1. We transmit a READY command to let the sender know that 

you call it Ready above, so better be consistent.

> +   we are *ready* to receive some data bytes on the control channel.
> +2. Before attempting to receive the expected command, we post another
> +   RQ work request to replace the one we just used up.
> +3. Block on a CQ event channel and wait for the SEND to arrive.
> +4. When the send arrives, librdmacm will unblock us.
> +5. Verify that the command-type and version received matches the one we expected.
> +
> +qemu_rdma_exchange_send(header, data, optional response header & data): 
> +
> +1. Block on the CQ event channel waiting for a READY command
> +   from the receiver to tell us that the receiver
> +   is *ready* for us to transmit some new bytes.
> +2. Optionally: if we are expecting a response from the command
> +   (that we have no yet transmitted),

Which commands expect result? Only Register request?

> let's post an RQ
> +   work request to receive that data a few moments later. 
> +3. When the READY arrives, librdmacm will 
> +   unblock us and we immediately post a RQ work request
> +   to replace the one we just used up.
> +4. Now, we can actually post the work request to SEND
> +   the requested command type of the header we were asked for.
> +5. Optionally, if we are expecting a response (as before),
> +   we block again and wait for that response using the additional
> +   work request we previously posted. (This is used to carry
> +   'Register result' commands #6 back to the sender which
> +   hold the rkey need to perform RDMA.
> +
> +All of the remaining command types (not including 'ready')
> +described above all use the aformentioned two functions to do the hard work:
> +
> +1. After connection setup, RAMBlock information is exchanged using
> +   this protocol before the actual migration begins.
> +2. During runtime, once a 'chunk' becomes full of pages ready to
> +   be sent with RDMA, the registration commands are used to ask the
> +   other side to register the memory for this chunk and respond
> +   with the result (rkey) of the registration.
> +3. Also, the QEMUFile interfaces also call these functions (described below)
> +   when transmitting non-live state, such as devices or to send
> +   its own protocol information during the migration process.
> +
> +Versioning
> +==================================
> +
> +librdmacm provides the user with a 'private data' area to be exchanged
> +at connection-setup time before any infiniband traffic is generated.
> +
> +This is a convenient place to check for protocol versioning because the
> +user does not need to register memory to transmit a few bytes of version
> +information.
> +
> +This is also a convenient place to negotiate capabilities
> +(like dynamic page registration).

This would be a good place to document the format of the
private data field.

> +
> +If the version is invalid, we throw an error.

Which version is valid in this specification?

> +
> +If the version is new, we only negotiate the capabilities that the
> +requested version is able to perform and ignore the rest.

What are these capabilities and how do we negotiate them?

> +QEMUFileRDMA Interface:
> +==================================
> +
> +QEMUFileRDMA introduces a couple of new functions:
> +
> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> +
> +These two functions are very short and simply used the protocol
> +describe above to deliver bytes without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.
> +
> +Finally, how do we handoff the actual bytes to get_buffer()?
> +
> +Again, because we're trying to "fake" a bytestream abstraction
> +using an analogy not unlike individual UDP frames, we have
> +to hold on to the bytes received from control-channel's SEND 
> +messages in memory.
> +
> +Each time we receive a complete "QEMU File" control-channel 
> +message, the bytes from SEND are copied into a small local holding area.
> +
> +Then, we return the number of bytes requested by get_buffer()
> +and leave the remaining bytes in the holding area until get_buffer()
> +comes around for another pass.
> +
> +If the buffer is empty, then we follow the same steps
> +listed above and issue another "QEMU File" protocol command,
> +asking for a new SEND message to re-fill the buffer.
> +
> +Migration of pc.ram:
> +===============================
> +
> +At the beginning of the migration, (migration-rdma.c),
> +the sender and the receiver populate the list of RAMBlocks
> +to be registered with each other into a structure.
> +Then, using the aforementioned protocol, they exchange a
> +description of these blocks with each other, to be used later 
> +during the iteration of main memory. This description includes
> +a list of all the RAMBlocks, their offsets and lengths and
> +possibly includes pre-registered RDMA keys in case dynamic
> +page registration was disabled on the server-side, otherwise not.

Worth mentioning here that memory hotplug will require a protocol
extension. That's also true of TCP so not a big deal ...

> +
> +Main memory is not migrated with the aforementioned protocol, 
> +but is instead migrated with normal RDMA Write operations.
> +
> +Pages are migrated in "chunks" (about 1 Megabyte right now).

Why "about"? This is not dynamic so needs to be exactly same
on both sides, right?

> +Chunk size is not dynamic, but it could be in a future implementation.
> +There's nothing to indicate that this is useful right now.
> +
> +When a chunk is full (or a flush() occurs), the memory backed by 
> +the chunk is registered with librdmacm and pinned in memory on 
> +both sides using the aforementioned protocol.
> +
> +After pinning, an RDMA Write is generated and tramsmitted
> +for the entire chunk.
> +
> +Chunks are also transmitted in batches: This means that we
> +do not request that the hardware signal the completion queue
> +for the completion of *every* chunk. The current batch size
> +is about 64 chunks (corresponding to 64 MB of memory).
> +Only the last chunk in a batch must be signaled.
> +This helps keep everything as asynchronous as possible
> +and helps keep the hardware busy performing RDMA operations.
> +
> +Error-handling:
> +===============================
> +
> +Infiniband has what is called a "Reliable, Connected"
> +link (one of 4 choices). This is the mode in which
> +we use for RDMA migration.
> +
> +If a *single* message fails,
> +the decision is to abort the migration entirely and
> +cleanup all the RDMA descriptors and unregister all
> +the memory.
> +
> +After cleanup, the Virtual Machine is returned to normal
> +operation the same way that would happen if the TCP
> +socket is broken during a non-RDMA based migration.

That's on sender side? Presumably this means you respond to
completion with error?
 How does receive side know
migration is complete?

> +
> +TODO:
> +=================================
> +1. Currently, cgroups swap limits for *both* TCP and RDMA
> +   on the sender-side is broken. This is more poignant for
> +   RDMA because RDMA requires memory registration.
> +   Fixing this requires infiniband page registrations to be
> +   zero-page aware, and this does not yet work properly.
> +2. Currently overcommit for the the *receiver* side of
> +   TCP works, but not for RDMA. While dynamic page registration
> +   *does* work, it is only useful if the is_zero_page() capability
> +   is remained enabled (which it is by default).
> +   However, leaving this capability turned on *significantly* slows
> +   down the RDMA throughput, particularly on hardware capable
> +   of transmitting faster than 10 gbps (such as 40gbps links).
> +3. Use of the recent /dev/<pid>/pagemap would likely solve some
> +   of these problems.
> +4. Also, some form of balloon-device usage tracking would also
> +   help aleviate some of these issues.
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput 
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> -- 
> 1.7.10.4
mrhines@linux.vnet.ibm.com April 10, 2013, 1:04 p.m. UTC | #2
On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> Below is a great high level overview. the protocol looks correct.
> A bit more detail would be helpful, as noted below.
>
> The main thing I'd like to see changed is that there are already
> two protocols here: chunk-based and non chunk based.
> We'll need to use versioning and capabilities going forward but in the
> first version we don't need to maintain compatibility with legacy so
> two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> that is worth fixing longer term, but seems like the way forward. So
> let's implement a single chunk-based protocol in the first version we
> merge.
>
> Some more minor improvement suggestions below.
Thanks.

However, IMHO restricting the policy to only used chunk-based is really
not an acceptable choice:

Here's the reason: Using my 10gbs RDMA hardware, throughput takes a dive 
from 10gbps to 6gbps.

But if I disable chunk-based registration altogether (forgoing 
overcommit), then performance comes back.

The reason for this is is the additional control trannel traffic needed 
to ask the server to register
memory pages on demand - without this traffic, we can easily saturate 
the link.

But with this traffic, the user needs to know (and be given the option) 
to disable the feature
in case they want performance instead of flexibility.

> On Mon, Apr 08, 2013 at 11:04:32PM -0400, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Both the protocol and interfaces are elaborated in more detail,
>> including the new use of dynamic chunk registration, versioning,
>> and capabilities negotiation.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 313 insertions(+)
>>   create mode 100644 docs/rdma.txt
>>
>> diff --git a/docs/rdma.txt b/docs/rdma.txt
>> new file mode 100644
>> index 0000000..e9fa4cd
>> --- /dev/null
>> +++ b/docs/rdma.txt
>> @@ -0,0 +1,313 @@
>> +Several changes since v4:
>> +
>> +- Created a "formal" protocol for the RDMA control channel
>> +- Dynamic, chunked page registration now implemented on *both* the server and client
>> +- Created new 'capability' for page registration
>> +- Created new 'capability' for is_zero_page() - enabled by default
>> +  (needed to test dynamic page registration)
>> +- Created version-check before protocol begins at connection-time
>> +- no more migrate_use_rdma() !
>> +
>> +NOTE: While dynamic registration works on both sides now,
>> +      it does *not* work with cgroups swap limits. This functionality with infiniband
>> +      remains broken. (It works fine with TCP). So, in order to take full
>> +      advantage of this feature, a fix will have to be developed on the kernel side.
>> +      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
> You mean the idea of using pagemap to detect shared pages created by KSM
> and/or zero pages? That would be helpful for TCP migration, thanks!

Yes, absolutely. This would *also* help the above registration problem.

We could use this to *pre-register* pages in advance, but that would be
an entirely different patch series (which I'm willing to write and submit).

>> +
> BTW the above comments belong outside both document and commit log,
> after --- before diff.
Acknowledged.

>> +Contents:
>> +=================================
>> +* Compiling
>> +* Running (please readme before running)
>> +* RDMA Protocol Description
>> +* Versioning
>> +* QEMUFileRDMA Interface
>> +* Migration of pc.ram
>> +* Error handling
>> +* TODO
>> +* Performance
>> +
>> +COMPILING:
>> +===============================
>> +
>> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
>> +$ make
>> +
>> +RUNNING:
>> +===============================
>> +
>> +First, decide if you want dynamic page registration on the server-side.
>> +This always happens on the primary-VM side, but is optional on the server.
>> +Doing this allows you to support overcommit (such as cgroups or ballooning)
>> +with a smaller footprint on the server-side without having to register the
>> +entire VM memory footprint.
>> +NOTE: This significantly slows down performance (about 30% slower).
> Where does the overhead come from? It appears from the description that
> you have exactly same amount of data to exchange using send messages,
> either way?
> Or are you using bigger chunks with upfront registration?

Answer is above.

Upfront registration registers the entire VM before migration starts
where as dynamic registration (on both sides) registers chunks in
1 MB increments as they are requested by the migration_thread.

The extra send messages required to request the server to register
the memory means that the RDMA must block until those messages
complete before the RDMA can begin.

>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
> I think the right choice is to make chunk based the default, and remove
> the non chunk based from code.  This will simplify the protocol a tiny bit,
> and make us focus on improving chunk based long term so that it's as
> fast as upfront registration.
Answer above.

>> +
>> +Next, if you decided *not* to use chunked registration on the server,
>> +it is recommended to also disable zero page detection. While this is not
>> +strictly necessary, zero page detection also significantly slows down
>> +performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
> What is meant by performance here? downtime?

Throughput. Zero page scanning (and dynamic registration) reduces 
throughput significantly.

>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
>> +
>> +Finally, set the migration speed to match your hardware's capabilities:
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>> +
>> +Finally, perform the actual migration:
>> +
>> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
>> +
>> +RDMA Protocol Description:
>> +=================================
>> +
>> +Migration with RDMA is separated into two parts:
>> +
>> +1. The transmission of the pages using RDMA
>> +2. Everything else (a control channel is introduced)
>> +
>> +"Everything else" is transmitted using a formal
>> +protocol now, consisting of infiniband SEND / RECV messages.
>> +
>> +An infiniband SEND message is the standard ibverbs
>> +message used by applications of infiniband hardware.
>> +The only difference between a SEND message and an RDMA
>> +message is that SEND message cause completion notifications
>> +to be posted to the completion queue (CQ) on the
>> +infiniband receiver side, whereas RDMA messages (used
>> +for pc.ram) do not (to behave like an actual DMA).
>> +
>> +Messages in infiniband require two things:
>> +
>> +1. registration of the memory that will be transmitted
>> +2. (SEND/RECV only) work requests to be posted on both
>> +   sides of the network before the actual transmission
>> +   can occur.
>> +
>> +RDMA messages much easier to deal with. Once the memory
>> +on the receiver side is registered and pinned, we're
>> +basically done. All that is required is for the sender
>> +side to start dumping bytes onto the link.
> When is memory unregistered and unpinned on send and receive
> sides?
Only when the migration ends completely. Will update the documentation.

>> +
>> +SEND messages require more coordination because the
>> +receiver must have reserved space (using a receive
>> +work request) on the receive queue (RQ) before QEMUFileRDMA
>> +can start using them to carry all the bytes as
>> +a transport for migration of device state.
>> +
>> +To begin the migration, the initial connection setup is
>> +as follows (migration-rdma.c):
>> +
>> +1. Receiver and Sender are started (command line or libvirt):
>> +2. Both sides post two RQ work requests
> Okay this could be where the problem is. This means with chunk
> based receive side does:
>
> loop:
> 	receive request
> 	register
> 	send response
>
> while with non chunk based it does:
>
> receive request
> send response
> loop:
> 	register
No, that's incorrect. With "non" chunk based, the receive side does 
*not* communicate
during the migration of pc.ram.

The control channel is only used for chunk registration and device 
state, not RAM.

I will update the documentation to make that more clear.

> In reality each request/response requires two network round-trips
> with the Ready credit-management messsages.
> So the overhead will likely be avoided if we add better pipelining:
> allow multiple registration requests in the air, and add more
> send/receive credits so the overhead of credit management can be
> reduced.
Unfortunately, the migration thread doesn't work that way.
The thread only generates one page write at-a-time.

If someone were to write a patch which submits multiple
writes at the same time, I would be very interested in
consuming that feature and making chunk registration more
efficient by batching multiple registrations into fewer messages.

> There's no requirement to implement these optimizations upfront
> before merging the first version, but let's remove the
> non-chunkbased crutch unless we see it as absolutely necessary.
>
>> +3. Receiver does listen()
>> +4. Sender does connect()
>> +5. Receiver accept()
>> +6. Check versioning and capabilities (described later)
>> +
>> +At this point, we define a control channel on top of SEND messages
>> +which is described by a formal protocol. Each SEND message has a
>> +header portion and a data portion (but together are transmitted
>> +as a single SEND message).
>> +
>> +Header:
>> +    * Length  (of the data portion)
>> +    * Type    (what command to perform, described below)
>> +    * Version (protocol version validated before send/recv occurs)
> What's the expected value for Version field?
> Also, confusing.  Below mentions using private field in librdmacm instead?
> Need to add # of bytes and endian-ness of each field.

Correct, those are two separate versions. One for capability negotiation
and one for the protocol itself.

I will update the documentation.

>> +
>> +The 'type' field has 7 different command values:
> 0. Unused.
>
>> +    1. None
> you mean this is unused?

Correct - will update.

>> +    2. Ready             (control-channel is available)
>> +    3. QEMU File         (for sending non-live device state)
>> +    4. RAM Blocks        (used right after connection setup)
>> +    5. Register request  (dynamic chunk registration)
>> +    6. Register result   ('rkey' to be used by sender)
> Hmm, don't you also need a virtual address for RDMA writes?
>

The virtual addresses are communicated at the beginning of the
migration using command #4 "Ram blocks".

>> +    7. Register finished (registration for current iteration finished)
> What does Register finished mean and how it's used?
>
> Need to add which commands have a data portion, and in what format.

Acknowledged. "finished" signals that a migration round has completed
and that the receiver side can move to the next iteration.


>> +
>> +After connection setup is completed, we have two protocol-level
>> +functions, responsible for communicating control-channel commands
>> +using the above list of values:
>> +
>> +Logically:
>> +
>> +qemu_rdma_exchange_recv(header, expected command type)
>> +
>> +1. We transmit a READY command to let the sender know that
> you call it Ready above, so better be consistent.
>
>> +   we are *ready* to receive some data bytes on the control channel.
>> +2. Before attempting to receive the expected command, we post another
>> +   RQ work request to replace the one we just used up.
>> +3. Block on a CQ event channel and wait for the SEND to arrive.
>> +4. When the send arrives, librdmacm will unblock us.
>> +5. Verify that the command-type and version received matches the one we expected.
>> +
>> +qemu_rdma_exchange_send(header, data, optional response header & data):
>> +
>> +1. Block on the CQ event channel waiting for a READY command
>> +   from the receiver to tell us that the receiver
>> +   is *ready* for us to transmit some new bytes.
>> +2. Optionally: if we are expecting a response from the command
>> +   (that we have no yet transmitted),
> Which commands expect result? Only Register request?

Yes, only register. In the code, the command is #define 
RDMA_CONTROL_REGISTER_RESULT

>> let's post an RQ
>> +   work request to receive that data a few moments later.
>> +3. When the READY arrives, librdmacm will
>> +   unblock us and we immediately post a RQ work request
>> +   to replace the one we just used up.
>> +4. Now, we can actually post the work request to SEND
>> +   the requested command type of the header we were asked for.
>> +5. Optionally, if we are expecting a response (as before),
>> +   we block again and wait for that response using the additional
>> +   work request we previously posted. (This is used to carry
>> +   'Register result' commands #6 back to the sender which
>> +   hold the rkey need to perform RDMA.
>> +
>> +All of the remaining command types (not including 'ready')
>> +described above all use the aformentioned two functions to do the hard work:
>> +
>> +1. After connection setup, RAMBlock information is exchanged using
>> +   this protocol before the actual migration begins.
>> +2. During runtime, once a 'chunk' becomes full of pages ready to
>> +   be sent with RDMA, the registration commands are used to ask the
>> +   other side to register the memory for this chunk and respond
>> +   with the result (rkey) of the registration.
>> +3. Also, the QEMUFile interfaces also call these functions (described below)
>> +   when transmitting non-live state, such as devices or to send
>> +   its own protocol information during the migration process.
>> +
>> +Versioning
>> +==================================
>> +
>> +librdmacm provides the user with a 'private data' area to be exchanged
>> +at connection-setup time before any infiniband traffic is generated.
>> +
>> +This is a convenient place to check for protocol versioning because the
>> +user does not need to register memory to transmit a few bytes of version
>> +information.
>> +
>> +This is also a convenient place to negotiate capabilities
>> +(like dynamic page registration).
> This would be a good place to document the format of the
> private data field.

Acnkowledged.


>> +
>> +If the version is invalid, we throw an error.
> Which version is valid in this specification?
Version 1. Will update.
>> +
>> +If the version is new, we only negotiate the capabilities that the
>> +requested version is able to perform and ignore the rest.
> What are these capabilities and how do we negotiate them?
There is only one capability right now: dynamic server registration.

The client must tell the server whether or not the capability was
enabled or not on the primary VM side.

Will update the documentation.

>> +QEMUFileRDMA Interface:
>> +==================================
>> +
>> +QEMUFileRDMA introduces a couple of new functions:
>> +
>> +1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
>> +2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
>> +
>> +These two functions are very short and simply used the protocol
>> +describe above to deliver bytes without changing the upper-level
>> +users of QEMUFile that depend on a bytstream abstraction.
>> +
>> +Finally, how do we handoff the actual bytes to get_buffer()?
>> +
>> +Again, because we're trying to "fake" a bytestream abstraction
>> +using an analogy not unlike individual UDP frames, we have
>> +to hold on to the bytes received from control-channel's SEND
>> +messages in memory.
>> +
>> +Each time we receive a complete "QEMU File" control-channel
>> +message, the bytes from SEND are copied into a small local holding area.
>> +
>> +Then, we return the number of bytes requested by get_buffer()
>> +and leave the remaining bytes in the holding area until get_buffer()
>> +comes around for another pass.
>> +
>> +If the buffer is empty, then we follow the same steps
>> +listed above and issue another "QEMU File" protocol command,
>> +asking for a new SEND message to re-fill the buffer.
>> +
>> +Migration of pc.ram:
>> +===============================
>> +
>> +At the beginning of the migration, (migration-rdma.c),
>> +the sender and the receiver populate the list of RAMBlocks
>> +to be registered with each other into a structure.
>> +Then, using the aforementioned protocol, they exchange a
>> +description of these blocks with each other, to be used later
>> +during the iteration of main memory. This description includes
>> +a list of all the RAMBlocks, their offsets and lengths and
>> +possibly includes pre-registered RDMA keys in case dynamic
>> +page registration was disabled on the server-side, otherwise not.
> Worth mentioning here that memory hotplug will require a protocol
> extension. That's also true of TCP so not a big deal ...

Acknowledged.

>> +
>> +Main memory is not migrated with the aforementioned protocol,
>> +but is instead migrated with normal RDMA Write operations.
>> +
>> +Pages are migrated in "chunks" (about 1 Megabyte right now).
> Why "about"? This is not dynamic so needs to be exactly same
> on both sides, right?
About is a typo =). It is hard-coded to exactly 1MB.

>
>> +Chunk size is not dynamic, but it could be in a future implementation.
>> +There's nothing to indicate that this is useful right now.
>> +
>> +When a chunk is full (or a flush() occurs), the memory backed by
>> +the chunk is registered with librdmacm and pinned in memory on
>> +both sides using the aforementioned protocol.
>> +
>> +After pinning, an RDMA Write is generated and tramsmitted
>> +for the entire chunk.
>> +
>> +Chunks are also transmitted in batches: This means that we
>> +do not request that the hardware signal the completion queue
>> +for the completion of *every* chunk. The current batch size
>> +is about 64 chunks (corresponding to 64 MB of memory).
>> +Only the last chunk in a batch must be signaled.
>> +This helps keep everything as asynchronous as possible
>> +and helps keep the hardware busy performing RDMA operations.
>> +
>> +Error-handling:
>> +===============================
>> +
>> +Infiniband has what is called a "Reliable, Connected"
>> +link (one of 4 choices). This is the mode in which
>> +we use for RDMA migration.
>> +
>> +If a *single* message fails,
>> +the decision is to abort the migration entirely and
>> +cleanup all the RDMA descriptors and unregister all
>> +the memory.
>> +
>> +After cleanup, the Virtual Machine is returned to normal
>> +operation the same way that would happen if the TCP
>> +socket is broken during a non-RDMA based migration.
> That's on sender side? Presumably this means you respond to
> completion with error?
>   How does receive side know
> migration is complete?

Yes, on the sender side.

Migration "completeness" logic has not changed in this patch series.

Pleas recall that the entire QEMUFile protocol is still
happening at the upper-level inside of savevm.c/arch_init.c.



>> +
>> +TODO:
>> +=================================
>> +1. Currently, cgroups swap limits for *both* TCP and RDMA
>> +   on the sender-side is broken. This is more poignant for
>> +   RDMA because RDMA requires memory registration.
>> +   Fixing this requires infiniband page registrations to be
>> +   zero-page aware, and this does not yet work properly.
>> +2. Currently overcommit for the the *receiver* side of
>> +   TCP works, but not for RDMA. While dynamic page registration
>> +   *does* work, it is only useful if the is_zero_page() capability
>> +   is remained enabled (which it is by default).
>> +   However, leaving this capability turned on *significantly* slows
>> +   down the RDMA throughput, particularly on hardware capable
>> +   of transmitting faster than 10 gbps (such as 40gbps links).
>> +3. Use of the recent /dev/<pid>/pagemap would likely solve some
>> +   of these problems.
>> +4. Also, some form of balloon-device usage tracking would also
>> +   help aleviate some of these issues.
>> +
>> +PERFORMANCE
>> +===================
>> +
>> +Using a 40gbps infinband link performing a worst-case stress test:
>> +
>> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +Approximately 30 gpbs (little better than the paper)
>> +1. Average worst-case throughput
>> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>> +
>> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
>> +
>> +An *exhaustive* paper (2010) shows additional performance details
>> +linked on the QEMU wiki:
>> +
>> +http://wiki.qemu.org/Features/RDMALiveMigration
>> -- 
>> 1.7.10.4
Michael S. Tsirkin April 10, 2013, 1:34 p.m. UTC | #3
On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
> On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> >Below is a great high level overview. the protocol looks correct.
> >A bit more detail would be helpful, as noted below.
> >
> >The main thing I'd like to see changed is that there are already
> >two protocols here: chunk-based and non chunk based.
> >We'll need to use versioning and capabilities going forward but in the
> >first version we don't need to maintain compatibility with legacy so
> >two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> >that is worth fixing longer term, but seems like the way forward. So
> >let's implement a single chunk-based protocol in the first version we
> >merge.
> >
> >Some more minor improvement suggestions below.
> Thanks.
> 
> However, IMHO restricting the policy to only used chunk-based is really
> not an acceptable choice:
> 
> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> dive from 10gbps to 6gbps.

Who cares about the throughput really? What we do care about
is how long the whole process takes.



> But if I disable chunk-based registration altogether (forgoing
> overcommit), then performance comes back.
> 
> The reason for this is is the additional control trannel traffic
> needed to ask the server to register
> memory pages on demand - without this traffic, we can easily
> saturate the link.
> But with this traffic, the user needs to know (and be given the
> option) to disable the feature
> in case they want performance instead of flexibility.
> 

IMO that's just because the current control protocol is so inefficient.
You just need to pipeline the registration: request the next chunk
while remote side is handling the previous one(s).

With any protocol, you still need to:
	register all memory
	send addresses and keys to source
	get notification that write is done
what is different with chunk based?
simply that there are several network roundtrips
before the process can start.
So part of the time you are not doing writes,
you are waiting for the next control message.

So you should be doing several in parallel.
This will complicate the procotol though, so I am not asking
for this right away.

But a broken pin-it-all alternative will just confuse matters.  It is
best to keep it out of tree.


> >On Mon, Apr 08, 2013 at 11:04:32PM -0400, mrhines@linux.vnet.ibm.com wrote:
> >>From: "Michael R. Hines" <mrhines@us.ibm.com>
> >>
> >>Both the protocol and interfaces are elaborated in more detail,
> >>including the new use of dynamic chunk registration, versioning,
> >>and capabilities negotiation.
> >>
> >>Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> >>---
> >>  docs/rdma.txt |  313 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 313 insertions(+)
> >>  create mode 100644 docs/rdma.txt
> >>
> >>diff --git a/docs/rdma.txt b/docs/rdma.txt
> >>new file mode 100644
> >>index 0000000..e9fa4cd
> >>--- /dev/null
> >>+++ b/docs/rdma.txt
> >>@@ -0,0 +1,313 @@
> >>+Several changes since v4:
> >>+
> >>+- Created a "formal" protocol for the RDMA control channel
> >>+- Dynamic, chunked page registration now implemented on *both* the server and client
> >>+- Created new 'capability' for page registration
> >>+- Created new 'capability' for is_zero_page() - enabled by default
> >>+  (needed to test dynamic page registration)
> >>+- Created version-check before protocol begins at connection-time
> >>+- no more migrate_use_rdma() !
> >>+
> >>+NOTE: While dynamic registration works on both sides now,
> >>+      it does *not* work with cgroups swap limits. This functionality with infiniband
> >>+      remains broken. (It works fine with TCP). So, in order to take full
> >>+      advantage of this feature, a fix will have to be developed on the kernel side.
> >>+      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
> >You mean the idea of using pagemap to detect shared pages created by KSM
> >and/or zero pages? That would be helpful for TCP migration, thanks!
> 
> Yes, absolutely. This would *also* help the above registration problem.
> 
> We could use this to *pre-register* pages in advance, but that would be
> an entirely different patch series (which I'm willing to write and submit).
> 
> >>+
> >BTW the above comments belong outside both document and commit log,
> >after --- before diff.
> Acknowledged.
> 
> >>+Contents:
> >>+=================================
> >>+* Compiling
> >>+* Running (please readme before running)
> >>+* RDMA Protocol Description
> >>+* Versioning
> >>+* QEMUFileRDMA Interface
> >>+* Migration of pc.ram
> >>+* Error handling
> >>+* TODO
> >>+* Performance
> >>+
> >>+COMPILING:
> >>+===============================
> >>+
> >>+$ ./configure --enable-rdma --target-list=x86_64-softmmu
> >>+$ make
> >>+
> >>+RUNNING:
> >>+===============================
> >>+
> >>+First, decide if you want dynamic page registration on the server-side.
> >>+This always happens on the primary-VM side, but is optional on the server.
> >>+Doing this allows you to support overcommit (such as cgroups or ballooning)
> >>+with a smaller footprint on the server-side without having to register the
> >>+entire VM memory footprint.
> >>+NOTE: This significantly slows down performance (about 30% slower).
> >Where does the overhead come from? It appears from the description that
> >you have exactly same amount of data to exchange using send messages,
> >either way?
> >Or are you using bigger chunks with upfront registration?
> 
> Answer is above.
> 
> Upfront registration registers the entire VM before migration starts
> where as dynamic registration (on both sides) registers chunks in
> 1 MB increments as they are requested by the migration_thread.
> 
> The extra send messages required to request the server to register
> the memory means that the RDMA must block until those messages
> complete before the RDMA can begin.

So make the protocol smarter and fix this. This is not something
management needs to know about.


If you like, you can teach management to specify the max amount of
memory pinned. It should be specified at the appropriate place:
on the remote for remote, on source for source.

> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
> >I think the right choice is to make chunk based the default, and remove
> >the non chunk based from code.  This will simplify the protocol a tiny bit,
> >and make us focus on improving chunk based long term so that it's as
> >fast as upfront registration.
> Answer above.
> 
> >>+
> >>+Next, if you decided *not* to use chunked registration on the server,
> >>+it is recommended to also disable zero page detection. While this is not
> >>+strictly necessary, zero page detection also significantly slows down
> >>+performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
> >What is meant by performance here? downtime?
> 
> Throughput. Zero page scanning (and dynamic registration) reduces
> throughput significantly.

Again, not something management should worry about.
Do the right thing internally.

> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
> >>+
> >>+Finally, set the migration speed to match your hardware's capabilities:
> >>+
> >>+$ virsh qemu-monitor-command --hmp \
> >>+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> >>+
> >>+Finally, perform the actual migration:
> >>+
> >>+$ virsh migrate domain rdma:xx.xx.xx.xx:port
> >>+
> >>+RDMA Protocol Description:
> >>+=================================
> >>+
> >>+Migration with RDMA is separated into two parts:
> >>+
> >>+1. The transmission of the pages using RDMA
> >>+2. Everything else (a control channel is introduced)
> >>+
> >>+"Everything else" is transmitted using a formal
> >>+protocol now, consisting of infiniband SEND / RECV messages.
> >>+
> >>+An infiniband SEND message is the standard ibverbs
> >>+message used by applications of infiniband hardware.
> >>+The only difference between a SEND message and an RDMA
> >>+message is that SEND message cause completion notifications
> >>+to be posted to the completion queue (CQ) on the
> >>+infiniband receiver side, whereas RDMA messages (used
> >>+for pc.ram) do not (to behave like an actual DMA).
> >>+
> >>+Messages in infiniband require two things:
> >>+
> >>+1. registration of the memory that will be transmitted
> >>+2. (SEND/RECV only) work requests to be posted on both
> >>+   sides of the network before the actual transmission
> >>+   can occur.
> >>+
> >>+RDMA messages much easier to deal with. Once the memory
> >>+on the receiver side is registered and pinned, we're
> >>+basically done. All that is required is for the sender
> >>+side to start dumping bytes onto the link.
> >When is memory unregistered and unpinned on send and receive
> >sides?
> Only when the migration ends completely. Will update the documentation.
> 
> >>+
> >>+SEND messages require more coordination because the
> >>+receiver must have reserved space (using a receive
> >>+work request) on the receive queue (RQ) before QEMUFileRDMA
> >>+can start using them to carry all the bytes as
> >>+a transport for migration of device state.
> >>+
> >>+To begin the migration, the initial connection setup is
> >>+as follows (migration-rdma.c):
> >>+
> >>+1. Receiver and Sender are started (command line or libvirt):
> >>+2. Both sides post two RQ work requests
> >Okay this could be where the problem is. This means with chunk
> >based receive side does:
> >
> >loop:
> >	receive request
> >	register
> >	send response
> >
> >while with non chunk based it does:
> >
> >receive request
> >send response
> >loop:
> >	register
> No, that's incorrect. With "non" chunk based, the receive side does
> *not* communicate
> during the migration of pc.ram.

It does not matter when this happens. What we care about is downtime and
total time from start of qemu on remote and until migration completes.
Not peak throughput.
If you don't count registration time on remote, that's just wrong.

> The control channel is only used for chunk registration and device
> state, not RAM.
> 
> I will update the documentation to make that more clear.

It's clear enough I think. But it seems you are measuring
the wrong things.

> >In reality each request/response requires two network round-trips
> >with the Ready credit-management messsages.
> >So the overhead will likely be avoided if we add better pipelining:
> >allow multiple registration requests in the air, and add more
> >send/receive credits so the overhead of credit management can be
> >reduced.
> Unfortunately, the migration thread doesn't work that way.
> The thread only generates one page write at-a-time.

Yes but you do not have to block it. Each page is in these states:
	- unpinned not sent
	- pinned no rkey
	- pinned have rkey
	- unpinned sent

Each time you get a new page, it's in unpinned not sent state.
So you can start it on this state machine, and tell migration thread
to proceed tothe next page.

> If someone were to write a patch which submits multiple
> writes at the same time, I would be very interested in
> consuming that feature and making chunk registration more
> efficient by batching multiple registrations into fewer messages.

No changes to migration core is necessary I think.
But assuming they are - your protocol design and
management API should not be driven by internal qemu APIs.

> >There's no requirement to implement these optimizations upfront
> >before merging the first version, but let's remove the
> >non-chunkbased crutch unless we see it as absolutely necessary.
> >
> >>+3. Receiver does listen()
> >>+4. Sender does connect()
> >>+5. Receiver accept()
> >>+6. Check versioning and capabilities (described later)
> >>+
> >>+At this point, we define a control channel on top of SEND messages
> >>+which is described by a formal protocol. Each SEND message has a
> >>+header portion and a data portion (but together are transmitted
> >>+as a single SEND message).
> >>+
> >>+Header:
> >>+    * Length  (of the data portion)
> >>+    * Type    (what command to perform, described below)
> >>+    * Version (protocol version validated before send/recv occurs)
> >What's the expected value for Version field?
> >Also, confusing.  Below mentions using private field in librdmacm instead?
> >Need to add # of bytes and endian-ness of each field.
> 
> Correct, those are two separate versions. One for capability negotiation
> and one for the protocol itself.
> 
> I will update the documentation.

Just drop the all-pinned version, and we'll work to improve
the chunk-based one until it has reasonable performance.
It seems to get a decent speed already: consider that
most people run migration with the default speed limit.
Supporting all-pinned will just be a pain down the road when
we fix performance for chunk based one.


> >>+
> >>+The 'type' field has 7 different command values:
> >0. Unused.
> >
> >>+    1. None
> >you mean this is unused?
> 
> Correct - will update.
> 
> >>+    2. Ready             (control-channel is available)
> >>+    3. QEMU File         (for sending non-live device state)
> >>+    4. RAM Blocks        (used right after connection setup)
> >>+    5. Register request  (dynamic chunk registration)
> >>+    6. Register result   ('rkey' to be used by sender)
> >Hmm, don't you also need a virtual address for RDMA writes?
> >
> 
> The virtual addresses are communicated at the beginning of the
> migration using command #4 "Ram blocks".

Yes but ram blocks are sent source to dest.
virtual address needs to be sent dest to source no?

> >>+    7. Register finished (registration for current iteration finished)
> >What does Register finished mean and how it's used?
> >
> >Need to add which commands have a data portion, and in what format.
> 
> Acknowledged. "finished" signals that a migration round has completed
> and that the receiver side can move to the next iteration.
> 
> 
> >>+
> >>+After connection setup is completed, we have two protocol-level
> >>+functions, responsible for communicating control-channel commands
> >>+using the above list of values:
> >>+
> >>+Logically:
> >>+
> >>+qemu_rdma_exchange_recv(header, expected command type)
> >>+
> >>+1. We transmit a READY command to let the sender know that
> >you call it Ready above, so better be consistent.
> >
> >>+   we are *ready* to receive some data bytes on the control channel.
> >>+2. Before attempting to receive the expected command, we post another
> >>+   RQ work request to replace the one we just used up.
> >>+3. Block on a CQ event channel and wait for the SEND to arrive.
> >>+4. When the send arrives, librdmacm will unblock us.
> >>+5. Verify that the command-type and version received matches the one we expected.
> >>+
> >>+qemu_rdma_exchange_send(header, data, optional response header & data):
> >>+
> >>+1. Block on the CQ event channel waiting for a READY command
> >>+   from the receiver to tell us that the receiver
> >>+   is *ready* for us to transmit some new bytes.
> >>+2. Optionally: if we are expecting a response from the command
> >>+   (that we have no yet transmitted),
> >Which commands expect result? Only Register request?
> 
> Yes, only register. In the code, the command is #define
> RDMA_CONTROL_REGISTER_RESULT
> 
> >>let's post an RQ
> >>+   work request to receive that data a few moments later.
> >>+3. When the READY arrives, librdmacm will
> >>+   unblock us and we immediately post a RQ work request
> >>+   to replace the one we just used up.
> >>+4. Now, we can actually post the work request to SEND
> >>+   the requested command type of the header we were asked for.
> >>+5. Optionally, if we are expecting a response (as before),
> >>+   we block again and wait for that response using the additional
> >>+   work request we previously posted. (This is used to carry
> >>+   'Register result' commands #6 back to the sender which
> >>+   hold the rkey need to perform RDMA.
> >>+
> >>+All of the remaining command types (not including 'ready')
> >>+described above all use the aformentioned two functions to do the hard work:
> >>+
> >>+1. After connection setup, RAMBlock information is exchanged using
> >>+   this protocol before the actual migration begins.
> >>+2. During runtime, once a 'chunk' becomes full of pages ready to
> >>+   be sent with RDMA, the registration commands are used to ask the
> >>+   other side to register the memory for this chunk and respond
> >>+   with the result (rkey) of the registration.
> >>+3. Also, the QEMUFile interfaces also call these functions (described below)
> >>+   when transmitting non-live state, such as devices or to send
> >>+   its own protocol information during the migration process.
> >>+
> >>+Versioning
> >>+==================================
> >>+
> >>+librdmacm provides the user with a 'private data' area to be exchanged
> >>+at connection-setup time before any infiniband traffic is generated.
> >>+
> >>+This is a convenient place to check for protocol versioning because the
> >>+user does not need to register memory to transmit a few bytes of version
> >>+information.
> >>+
> >>+This is also a convenient place to negotiate capabilities
> >>+(like dynamic page registration).
> >This would be a good place to document the format of the
> >private data field.
> 
> Acnkowledged.
> 
> 
> >>+
> >>+If the version is invalid, we throw an error.
> >Which version is valid in this specification?
> Version 1. Will update.
> >>+
> >>+If the version is new, we only negotiate the capabilities that the
> >>+requested version is able to perform and ignore the rest.
> >What are these capabilities and how do we negotiate them?
> There is only one capability right now: dynamic server registration.
> 
> The client must tell the server whether or not the capability was
> enabled or not on the primary VM side.
> 
> Will update the documentation.

Cool, best add an exact structure format.

> >>+QEMUFileRDMA Interface:
> >>+==================================
> >>+
> >>+QEMUFileRDMA introduces a couple of new functions:
> >>+
> >>+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> >>+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> >>+
> >>+These two functions are very short and simply used the protocol
> >>+describe above to deliver bytes without changing the upper-level
> >>+users of QEMUFile that depend on a bytstream abstraction.
> >>+
> >>+Finally, how do we handoff the actual bytes to get_buffer()?
> >>+
> >>+Again, because we're trying to "fake" a bytestream abstraction
> >>+using an analogy not unlike individual UDP frames, we have
> >>+to hold on to the bytes received from control-channel's SEND
> >>+messages in memory.
> >>+
> >>+Each time we receive a complete "QEMU File" control-channel
> >>+message, the bytes from SEND are copied into a small local holding area.
> >>+
> >>+Then, we return the number of bytes requested by get_buffer()
> >>+and leave the remaining bytes in the holding area until get_buffer()
> >>+comes around for another pass.
> >>+
> >>+If the buffer is empty, then we follow the same steps
> >>+listed above and issue another "QEMU File" protocol command,
> >>+asking for a new SEND message to re-fill the buffer.
> >>+
> >>+Migration of pc.ram:
> >>+===============================
> >>+
> >>+At the beginning of the migration, (migration-rdma.c),
> >>+the sender and the receiver populate the list of RAMBlocks
> >>+to be registered with each other into a structure.
> >>+Then, using the aforementioned protocol, they exchange a
> >>+description of these blocks with each other, to be used later
> >>+during the iteration of main memory. This description includes
> >>+a list of all the RAMBlocks, their offsets and lengths and
> >>+possibly includes pre-registered RDMA keys in case dynamic
> >>+page registration was disabled on the server-side, otherwise not.
> >Worth mentioning here that memory hotplug will require a protocol
> >extension. That's also true of TCP so not a big deal ...
> 
> Acknowledged.
> 
> >>+
> >>+Main memory is not migrated with the aforementioned protocol,
> >>+but is instead migrated with normal RDMA Write operations.
> >>+
> >>+Pages are migrated in "chunks" (about 1 Megabyte right now).
> >Why "about"? This is not dynamic so needs to be exactly same
> >on both sides, right?
> About is a typo =). It is hard-coded to exactly 1MB.

This, by the way, is something management *may* want to control.

> >
> >>+Chunk size is not dynamic, but it could be in a future implementation.
> >>+There's nothing to indicate that this is useful right now.
> >>+
> >>+When a chunk is full (or a flush() occurs), the memory backed by
> >>+the chunk is registered with librdmacm and pinned in memory on
> >>+both sides using the aforementioned protocol.
> >>+
> >>+After pinning, an RDMA Write is generated and tramsmitted
> >>+for the entire chunk.
> >>+
> >>+Chunks are also transmitted in batches: This means that we
> >>+do not request that the hardware signal the completion queue
> >>+for the completion of *every* chunk. The current batch size
> >>+is about 64 chunks (corresponding to 64 MB of memory).
> >>+Only the last chunk in a batch must be signaled.
> >>+This helps keep everything as asynchronous as possible
> >>+and helps keep the hardware busy performing RDMA operations.
> >>+
> >>+Error-handling:
> >>+===============================
> >>+
> >>+Infiniband has what is called a "Reliable, Connected"
> >>+link (one of 4 choices). This is the mode in which
> >>+we use for RDMA migration.
> >>+
> >>+If a *single* message fails,
> >>+the decision is to abort the migration entirely and
> >>+cleanup all the RDMA descriptors and unregister all
> >>+the memory.
> >>+
> >>+After cleanup, the Virtual Machine is returned to normal
> >>+operation the same way that would happen if the TCP
> >>+socket is broken during a non-RDMA based migration.
> >That's on sender side? Presumably this means you respond to
> >completion with error?
> >  How does receive side know
> >migration is complete?
> 
> Yes, on the sender side.
> 
> Migration "completeness" logic has not changed in this patch series.
> 
> Pleas recall that the entire QEMUFile protocol is still
> happening at the upper-level inside of savevm.c/arch_init.c.
> 

So basically receive side detects that migration is complete by
looking at the QEMUFile data?

> 
> >>+
> >>+TODO:
> >>+=================================
> >>+1. Currently, cgroups swap limits for *both* TCP and RDMA
> >>+   on the sender-side is broken. This is more poignant for
> >>+   RDMA because RDMA requires memory registration.
> >>+   Fixing this requires infiniband page registrations to be
> >>+   zero-page aware, and this does not yet work properly.
> >>+2. Currently overcommit for the the *receiver* side of
> >>+   TCP works, but not for RDMA. While dynamic page registration
> >>+   *does* work, it is only useful if the is_zero_page() capability
> >>+   is remained enabled (which it is by default).
> >>+   However, leaving this capability turned on *significantly* slows
> >>+   down the RDMA throughput, particularly on hardware capable
> >>+   of transmitting faster than 10 gbps (such as 40gbps links).
> >>+3. Use of the recent /dev/<pid>/pagemap would likely solve some
> >>+   of these problems.
> >>+4. Also, some form of balloon-device usage tracking would also
> >>+   help aleviate some of these issues.
> >>+
> >>+PERFORMANCE
> >>+===================
> >>+
> >>+Using a 40gbps infinband link performing a worst-case stress test:
> >>+
> >>+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+Approximately 30 gpbs (little better than the paper)
> >>+1. Average worst-case throughput
> >>+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> >>+
> >>+Average downtime (stop time) ranges between 28 and 33 milliseconds.
> >>+
> >>+An *exhaustive* paper (2010) shows additional performance details
> >>+linked on the QEMU wiki:
> >>+
> >>+http://wiki.qemu.org/Features/RDMALiveMigration
> >>-- 
> >>1.7.10.4
mrhines@linux.vnet.ibm.com April 10, 2013, 3:29 p.m. UTC | #4
On 04/10/2013 09:34 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
>> On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
>>> Below is a great high level overview. the protocol looks correct.
>>> A bit more detail would be helpful, as noted below.
>>>
>>> The main thing I'd like to see changed is that there are already
>>> two protocols here: chunk-based and non chunk based.
>>> We'll need to use versioning and capabilities going forward but in the
>>> first version we don't need to maintain compatibility with legacy so
>>> two versions seems like unnecessary pain.  Chunk based is somewhat slower and
>>> that is worth fixing longer term, but seems like the way forward. So
>>> let's implement a single chunk-based protocol in the first version we
>>> merge.
>>>
>>> Some more minor improvement suggestions below.
>> Thanks.
>>
>> However, IMHO restricting the policy to only used chunk-based is really
>> not an acceptable choice:
>>
>> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
>> dive from 10gbps to 6gbps.
> Who cares about the throughput really? What we do care about
> is how long the whole process takes.
>

Low latency and high throughput is very important =)

Without these properties of RDMA, many workloads simply either
take to long to finish migrating or do not converge to a stopping
point altogether.

*Not making this a configurable option would defeat the purpose of using 
RDMA altogether.

Otherwise, you're no better off than just using TCP.


>
>> But if I disable chunk-based registration altogether (forgoing
>> overcommit), then performance comes back.
>>
>> The reason for this is is the additional control trannel traffic
>> needed to ask the server to register
>> memory pages on demand - without this traffic, we can easily
>> saturate the link.
>> But with this traffic, the user needs to know (and be given the
>> option) to disable the feature
>> in case they want performance instead of flexibility.
>>
> IMO that's just because the current control protocol is so inefficient.
> You just need to pipeline the registration: request the next chunk
> while remote side is handling the previous one(s).
>
> With any protocol, you still need to:
> 	register all memory
> 	send addresses and keys to source
> 	get notification that write is done
> what is different with chunk based?
> simply that there are several network roundtrips
> before the process can start.
> So part of the time you are not doing writes,
> you are waiting for the next control message.
>
> So you should be doing several in parallel.
> This will complicate the procotol though, so I am not asking
> for this right away.
>
> But a broken pin-it-all alternative will just confuse matters.  It is
> best to keep it out of tree.

There's a huge difference. (Answer continued below this one).

The devil is in the details, here: Pipelining is simply not possible
right now because the migration thread has total control over
when and which pages are requested to be migrated.

You can't pipeline page registrations if you don't know the pages are 
dirty -
and the only way to that pages are dirty is if the migration thread told
you to save them.

On the other hand, advanced registration of *known* dirty pages
is very important - I will certainly be submitting a patch in the future
which attempts to handle this case.


> So make the protocol smarter and fix this. This is not something
> management needs to know about.
>
>
> If you like, you can teach management to specify the max amount of
> memory pinned. It should be specified at the appropriate place:
> on the remote for remote, on source for source.
>

Answer below.

>>>
>>> What is meant by performance here? downtime?
>> Throughput. Zero page scanning (and dynamic registration) reduces
>> throughput significantly.
> Again, not something management should worry about.
> Do the right thing internally.

I disagree with that: This is an entirely workload-specific decision,
not a system-level decision.

If I have a known memory-intensive workload that is virtualized,
then it would be "too late" to disable zero page detection *after*
the RDMA migration begins.

We have management tools already that are that smart - there's
nothing wrong with smart managment knowing in advance that
a workload is memory-intensive and also knowing that an RDMA
migration is going to be issued.

There's no way for QEMU to know that in advance without some kind
of advanced heuristic that tracks the behavior of the VM over time,
which I don't think anybody wants to get into the business of writing =)

>>>> +
>>>> +SEND messages require more coordination because the
>>>> +receiver must have reserved space (using a receive
>>>> +work request) on the receive queue (RQ) before QEMUFileRDMA
>>>> +can start using them to carry all the bytes as
>>>> +a transport for migration of device state.
>>>> +
>>>> +To begin the migration, the initial connection setup is
>>>> +as follows (migration-rdma.c):
>>>> +
>>>> +1. Receiver and Sender are started (command line or libvirt):
>>>> +2. Both sides post two RQ work requests
>>> Okay this could be where the problem is. This means with chunk
>>> based receive side does:
>>>
>>> loop:
>>> 	receive request
>>> 	register
>>> 	send response
>>>
>>> while with non chunk based it does:
>>>
>>> receive request
>>> send response
>>> loop:
>>> 	register
>> No, that's incorrect. With "non" chunk based, the receive side does
>> *not* communicate
>> during the migration of pc.ram.
> It does not matter when this happens. What we care about is downtime and
> total time from start of qemu on remote and until migration completes.
> Not peak throughput.
> If you don't count registration time on remote, that's just wrong.

Answer above.


>> The control channel is only used for chunk registration and device
>> state, not RAM.
>>
>> I will update the documentation to make that more clear.
> It's clear enough I think. But it seems you are measuring
> the wrong things.
>
>>> In reality each request/response requires two network round-trips
>>> with the Ready credit-management messsages.
>>> So the overhead will likely be avoided if we add better pipelining:
>>> allow multiple registration requests in the air, and add more
>>> send/receive credits so the overhead of credit management can be
>>> reduced.
>> Unfortunately, the migration thread doesn't work that way.
>> The thread only generates one page write at-a-time.
> Yes but you do not have to block it. Each page is in these states:
> 	- unpinned not sent
> 	- pinned no rkey
> 	- pinned have rkey
> 	- unpinned sent
>
> Each time you get a new page, it's in unpinned not sent state.
> So you can start it on this state machine, and tell migration thread
> to proceed tothe next page.

Yes, I'm doing that already (documented as "batching") in the
docs file.

But the problem is more complicated than that: there is no coordination
between the migration_thread and RDMA right now because Paolo is
trying to maintain a very clean separation of function.

However we *can* do what you described in a future patch like this:

1. Migration thread says "iteration starts, how much memory is dirty?"
2. RDMA protocol says "Is there a lot of dirty memory?"
         OK, yes? Then batch all the registration messages into a single 
request
         but do not write the memory until all the registrations have 
completed.

         OK, no?  Then just issue registrations with very little 
batching so that
                       we can quickly move on to the next iteration round.

Make sense?

>> If someone were to write a patch which submits multiple
>> writes at the same time, I would be very interested in
>> consuming that feature and making chunk registration more
>> efficient by batching multiple registrations into fewer messages.
> No changes to migration core is necessary I think.
> But assuming they are - your protocol design and
> management API should not be driven by internal qemu APIs.

Answer above.

>>> There's no requirement to implement these optimizations upfront
>>> before merging the first version, but let's remove the
>>> non-chunkbased crutch unless we see it as absolutely necessary.
>>>
>>>> +3. Receiver does listen()
>>>> +4. Sender does connect()
>>>> +5. Receiver accept()
>>>> +6. Check versioning and capabilities (described later)
>>>> +
>>>> +At this point, we define a control channel on top of SEND messages
>>>> +which is described by a formal protocol. Each SEND message has a
>>>> +header portion and a data portion (but together are transmitted
>>>> +as a single SEND message).
>>>> +
>>>> +Header:
>>>> +    * Length  (of the data portion)
>>>> +    * Type    (what command to perform, described below)
>>>> +    * Version (protocol version validated before send/recv occurs)
>>> What's the expected value for Version field?
>>> Also, confusing.  Below mentions using private field in librdmacm instead?
>>> Need to add # of bytes and endian-ness of each field.
>> Correct, those are two separate versions. One for capability negotiation
>> and one for the protocol itself.
>>
>> I will update the documentation.
> Just drop the all-pinned version, and we'll work to improve
> the chunk-based one until it has reasonable performance.
> It seems to get a decent speed already: consider that
> most people run migration with the default speed limit.
> Supporting all-pinned will just be a pain down the road when
> we fix performance for chunk based one.
>

The speed tops out at 6gbps, that's not good enough for a 40gbps link.

The migration could complete *much* faster by disabling chunk registration.

We have very large physical machines, where chunk registration is not as 
important
as migrating the workload very quickly with very little downtime.

In these cases, chunk registration just "gets in the way".

>>>> +
>>>> +The 'type' field has 7 different command values:
>>> 0. Unused.
>>>
>>>> +    1. None
>>> you mean this is unused?
>> Correct - will update.
>>
>>>> +    2. Ready             (control-channel is available)
>>>> +    3. QEMU File         (for sending non-live device state)
>>>> +    4. RAM Blocks        (used right after connection setup)
>>>> +    5. Register request  (dynamic chunk registration)
>>>> +    6. Register result   ('rkey' to be used by sender)
>>> Hmm, don't you also need a virtual address for RDMA writes?
>>>
>> The virtual addresses are communicated at the beginning of the
>> migration using command #4 "Ram blocks".
> Yes but ram blocks are sent source to dest.
> virtual address needs to be sent dest to source no?

I just said that, no? =)

>>
>> There is only one capability right now: dynamic server registration.
>>
>> The client must tell the server whether or not the capability was
>> enabled or not on the primary VM side.
>>
>> Will update the documentation.
> Cool, best add an exact structure format.

Acnkowledged.

>>>> +
>>>> +Main memory is not migrated with the aforementioned protocol,
>>>> +but is instead migrated with normal RDMA Write operations.
>>>> +
>>>> +Pages are migrated in "chunks" (about 1 Megabyte right now).
>>> Why "about"? This is not dynamic so needs to be exactly same
>>> on both sides, right?
>> About is a typo =). It is hard-coded to exactly 1MB.
> This, by the way, is something management *may* want to control.

Acknowledged.

>>>> +Chunk size is not dynamic, but it could be in a future implementation.
>>>> +There's nothing to indicate that this is useful right now.
>>>> +
>>>> +When a chunk is full (or a flush() occurs), the memory backed by
>>>> +the chunk is registered with librdmacm and pinned in memory on
>>>> +both sides using the aforementioned protocol.
>>>> +
>>>> +After pinning, an RDMA Write is generated and tramsmitted
>>>> +for the entire chunk.
>>>> +
>>>> +Chunks are also transmitted in batches: This means that we
>>>> +do not request that the hardware signal the completion queue
>>>> +for the completion of *every* chunk. The current batch size
>>>> +is about 64 chunks (corresponding to 64 MB of memory).
>>>> +Only the last chunk in a batch must be signaled.
>>>> +This helps keep everything as asynchronous as possible
>>>> +and helps keep the hardware busy performing RDMA operations.
>>>> +
>>>> +Error-handling:
>>>> +===============================
>>>> +
>>>> +Infiniband has what is called a "Reliable, Connected"
>>>> +link (one of 4 choices). This is the mode in which
>>>> +we use for RDMA migration.
>>>> +
>>>> +If a *single* message fails,
>>>> +the decision is to abort the migration entirely and
>>>> +cleanup all the RDMA descriptors and unregister all
>>>> +the memory.
>>>> +
>>>> +After cleanup, the Virtual Machine is returned to normal
>>>> +operation the same way that would happen if the TCP
>>>> +socket is broken during a non-RDMA based migration.
>>> That's on sender side? Presumably this means you respond to
>>> completion with error?
>>>   How does receive side know
>>> migration is complete?
>> Yes, on the sender side.
>>
>> Migration "completeness" logic has not changed in this patch series.
>>
>> Pleas recall that the entire QEMUFile protocol is still
>> happening at the upper-level inside of savevm.c/arch_init.c.
>>
> So basically receive side detects that migration is complete by
> looking at the QEMUFile data?
>

That's correct - same mechanism used by TCP.
Michael S. Tsirkin April 10, 2013, 5:41 p.m. UTC | #5
On Wed, Apr 10, 2013 at 11:29:24AM -0400, Michael R. Hines wrote:
> On 04/10/2013 09:34 AM, Michael S. Tsirkin wrote:
> >On Wed, Apr 10, 2013 at 09:04:44AM -0400, Michael R. Hines wrote:
> >>On 04/10/2013 01:27 AM, Michael S. Tsirkin wrote:
> >>>Below is a great high level overview. the protocol looks correct.
> >>>A bit more detail would be helpful, as noted below.
> >>>
> >>>The main thing I'd like to see changed is that there are already
> >>>two protocols here: chunk-based and non chunk based.
> >>>We'll need to use versioning and capabilities going forward but in the
> >>>first version we don't need to maintain compatibility with legacy so
> >>>two versions seems like unnecessary pain.  Chunk based is somewhat slower and
> >>>that is worth fixing longer term, but seems like the way forward. So
> >>>let's implement a single chunk-based protocol in the first version we
> >>>merge.
> >>>
> >>>Some more minor improvement suggestions below.
> >>Thanks.
> >>
> >>However, IMHO restricting the policy to only used chunk-based is really
> >>not an acceptable choice:
> >>
> >>Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> >>dive from 10gbps to 6gbps.
> >Who cares about the throughput really? What we do care about
> >is how long the whole process takes.
> >
> 
> Low latency and high throughput is very important =)
> 
> Without these properties of RDMA, many workloads simply either
> take to long to finish migrating or do not converge to a stopping
> point altogether.
> 
> *Not making this a configurable option would defeat the purpose of
> using RDMA altogether.
> 
> Otherwise, you're no better off than just using TCP.

So we have two protocols implemented: one is slow the other pins all
memory on destination indefinitely.

I see two options here:
- improve the slow version so it's fast, drop the pin all version
- give up and declare RDMA requires pinning all memory on destination

But giving management a way to do RDMA at the speed of TCP? Why is this
useful?

> 
> >
> >>But if I disable chunk-based registration altogether (forgoing
> >>overcommit), then performance comes back.
> >>
> >>The reason for this is is the additional control trannel traffic
> >>needed to ask the server to register
> >>memory pages on demand - without this traffic, we can easily
> >>saturate the link.
> >>But with this traffic, the user needs to know (and be given the
> >>option) to disable the feature
> >>in case they want performance instead of flexibility.
> >>
> >IMO that's just because the current control protocol is so inefficient.
> >You just need to pipeline the registration: request the next chunk
> >while remote side is handling the previous one(s).
> >
> >With any protocol, you still need to:
> >	register all memory
> >	send addresses and keys to source
> >	get notification that write is done
> >what is different with chunk based?
> >simply that there are several network roundtrips
> >before the process can start.
> >So part of the time you are not doing writes,
> >you are waiting for the next control message.
> >
> >So you should be doing several in parallel.
> >This will complicate the procotol though, so I am not asking
> >for this right away.
> >
> >But a broken pin-it-all alternative will just confuse matters.  It is
> >best to keep it out of tree.
> 
> There's a huge difference. (Answer continued below this one).
> 
> The devil is in the details, here: Pipelining is simply not possible
> right now because the migration thread has total control over
> when and which pages are requested to be migrated.
> 
> You can't pipeline page registrations if you don't know the pages
> are dirty -
> and the only way to that pages are dirty is if the migration thread told
> you to save them.


So it tells you to save them. It does not mean you need to start
RDMA immediately.  Note the address and start the process of
notifying the remote.


>
> On the other hand, advanced registration of *known* dirty pages
> is very important - I will certainly be submitting a patch in the future
> which attempts to handle this case.

Maybe I miss something, and there are changes in the migration core
that are prerequisite to making rdma fast. So take the time and make
these changes, that's better than maintaining a broken protocol
indefinitely.


> >So make the protocol smarter and fix this. This is not something
> >management needs to know about.
> >
> >
> >If you like, you can teach management to specify the max amount of
> >memory pinned. It should be specified at the appropriate place:
> >on the remote for remote, on source for source.
> >
> 
> Answer below.
> 
> >>>
> >>>What is meant by performance here? downtime?
> >>Throughput. Zero page scanning (and dynamic registration) reduces
> >>throughput significantly.
> >Again, not something management should worry about.
> >Do the right thing internally.
> 
> I disagree with that: This is an entirely workload-specific decision,
> not a system-level decision.
> 
> If I have a known memory-intensive workload that is virtualized,
> then it would be "too late" to disable zero page detection *after*
> the RDMA migration begins.
> 
> We have management tools already that are that smart - there's
> nothing wrong with smart managment knowing in advance that
> a workload is memory-intensive and also knowing that an RDMA
> migration is going to be issued.

"zero page detection" just cries out "implementation specific".

There's very little chance e.g. a different algorithm will have exactly
same performance tradeoffs. So we change some qemu internals and
suddenly your management carefully tuned for your workload is making all
the wrong decisions.



>
> There's no way for QEMU to know that in advance without some kind
> of advanced heuristic that tracks the behavior of the VM over time,
> which I don't think anybody wants to get into the business of writing =)

There's even less chance a management tool will make an
intelligent decision here. It's too tied to QEMU internals.

> >>>>+
> >>>>+SEND messages require more coordination because the
> >>>>+receiver must have reserved space (using a receive
> >>>>+work request) on the receive queue (RQ) before QEMUFileRDMA
> >>>>+can start using them to carry all the bytes as
> >>>>+a transport for migration of device state.
> >>>>+
> >>>>+To begin the migration, the initial connection setup is
> >>>>+as follows (migration-rdma.c):
> >>>>+
> >>>>+1. Receiver and Sender are started (command line or libvirt):
> >>>>+2. Both sides post two RQ work requests
> >>>Okay this could be where the problem is. This means with chunk
> >>>based receive side does:
> >>>
> >>>loop:
> >>>	receive request
> >>>	register
> >>>	send response
> >>>
> >>>while with non chunk based it does:
> >>>
> >>>receive request
> >>>send response
> >>>loop:
> >>>	register
> >>No, that's incorrect. With "non" chunk based, the receive side does
> >>*not* communicate
> >>during the migration of pc.ram.
> >It does not matter when this happens. What we care about is downtime and
> >total time from start of qemu on remote and until migration completes.
> >Not peak throughput.
> >If you don't count registration time on remote, that's just wrong.
> 
> Answer above.


I don't see it above.
> 
> >>The control channel is only used for chunk registration and device
> >>state, not RAM.
> >>
> >>I will update the documentation to make that more clear.
> >It's clear enough I think. But it seems you are measuring
> >the wrong things.
> >
> >>>In reality each request/response requires two network round-trips
> >>>with the Ready credit-management messsages.
> >>>So the overhead will likely be avoided if we add better pipelining:
> >>>allow multiple registration requests in the air, and add more
> >>>send/receive credits so the overhead of credit management can be
> >>>reduced.
> >>Unfortunately, the migration thread doesn't work that way.
> >>The thread only generates one page write at-a-time.
> >Yes but you do not have to block it. Each page is in these states:
> >	- unpinned not sent
> >	- pinned no rkey
> >	- pinned have rkey
> >	- unpinned sent
> >
> >Each time you get a new page, it's in unpinned not sent state.
> >So you can start it on this state machine, and tell migration thread
> >to proceed tothe next page.
> 
> Yes, I'm doing that already (documented as "batching") in the
> docs file.

All I see is a scheme to reduce the number of transmit completions.
This only gives a marginal gain.  E.g. you explicitly say there's a
single command in the air so another registration request can not even
start until you get a registration response.

> But the problem is more complicated than that: there is no coordination
> between the migration_thread and RDMA right now because Paolo is
> trying to maintain a very clean separation of function.
> 
> However we *can* do what you described in a future patch like this:
> 
> 1. Migration thread says "iteration starts, how much memory is dirty?"
> 2. RDMA protocol says "Is there a lot of dirty memory?"
>         OK, yes? Then batch all the registration messages into a
> single request
>         but do not write the memory until all the registrations have
> completed.
> 
>         OK, no?  Then just issue registrations with very little
> batching so that
>                       we can quickly move on to the next iteration round.
> 
> Make sense?

Actually, I think you just need to get a page from migration core and
give it to the FSM above.  Then let it give you another page, until you
have N pages in flight in the FSM all at different stages in the
pipeline.  That's the theory.

But if you want to try changing management core, go wild.  Very little
is written in stone here.

> >>If someone were to write a patch which submits multiple
> >>writes at the same time, I would be very interested in
> >>consuming that feature and making chunk registration more
> >>efficient by batching multiple registrations into fewer messages.
> >No changes to migration core is necessary I think.
> >But assuming they are - your protocol design and
> >management API should not be driven by internal qemu APIs.
> 
> Answer above.
> 
> >>>There's no requirement to implement these optimizations upfront
> >>>before merging the first version, but let's remove the
> >>>non-chunkbased crutch unless we see it as absolutely necessary.
> >>>
> >>>>+3. Receiver does listen()
> >>>>+4. Sender does connect()
> >>>>+5. Receiver accept()
> >>>>+6. Check versioning and capabilities (described later)
> >>>>+
> >>>>+At this point, we define a control channel on top of SEND messages
> >>>>+which is described by a formal protocol. Each SEND message has a
> >>>>+header portion and a data portion (but together are transmitted
> >>>>+as a single SEND message).
> >>>>+
> >>>>+Header:
> >>>>+    * Length  (of the data portion)
> >>>>+    * Type    (what command to perform, described below)
> >>>>+    * Version (protocol version validated before send/recv occurs)
> >>>What's the expected value for Version field?
> >>>Also, confusing.  Below mentions using private field in librdmacm instead?
> >>>Need to add # of bytes and endian-ness of each field.
> >>Correct, those are two separate versions. One for capability negotiation
> >>and one for the protocol itself.
> >>
> >>I will update the documentation.
> >Just drop the all-pinned version, and we'll work to improve
> >the chunk-based one until it has reasonable performance.
> >It seems to get a decent speed already: consider that
> >most people run migration with the default speed limit.
> >Supporting all-pinned will just be a pain down the road when
> >we fix performance for chunk based one.
> >
> 
> The speed tops out at 6gbps, that's not good enough for a 40gbps link.
> 
> The migration could complete *much* faster by disabling chunk registration.
> 
> We have very large physical machines, where chunk registration is
> not as important
> as migrating the workload very quickly with very little downtime.
> 
> In these cases, chunk registration just "gets in the way".

Well IMO you give up too early.

It gets in the way because you are not doing data transfers while
you are doing registration. You are doing it by chunks on the
source and source is much busier, it needs to find dirty pages,
and it needs to run VCPUs. Surely remote which is mostly idle should
be able to keep up with the demand.

Just fix the protocol so the control latency is less of the problem.


> >>>>+
> >>>>+The 'type' field has 7 different command values:
> >>>0. Unused.
> >>>
> >>>>+    1. None
> >>>you mean this is unused?
> >>Correct - will update.
> >>
> >>>>+    2. Ready             (control-channel is available)
> >>>>+    3. QEMU File         (for sending non-live device state)
> >>>>+    4. RAM Blocks        (used right after connection setup)
> >>>>+    5. Register request  (dynamic chunk registration)
> >>>>+    6. Register result   ('rkey' to be used by sender)
> >>>Hmm, don't you also need a virtual address for RDMA writes?
> >>>
> >>The virtual addresses are communicated at the beginning of the
> >>migration using command #4 "Ram blocks".
> >Yes but ram blocks are sent source to dest.
> >virtual address needs to be sent dest to source no?
> 
> I just said that, no? =)

You didn't previously.

> >>
> >>There is only one capability right now: dynamic server registration.
> >>
> >>The client must tell the server whether or not the capability was
> >>enabled or not on the primary VM side.
> >>
> >>Will update the documentation.
> >Cool, best add an exact structure format.
> 
> Acnkowledged.
> 
> >>>>+
> >>>>+Main memory is not migrated with the aforementioned protocol,
> >>>>+but is instead migrated with normal RDMA Write operations.
> >>>>+
> >>>>+Pages are migrated in "chunks" (about 1 Megabyte right now).
> >>>Why "about"? This is not dynamic so needs to be exactly same
> >>>on both sides, right?
> >>About is a typo =). It is hard-coded to exactly 1MB.
> >This, by the way, is something management *may* want to control.
> 
> Acknowledged.
> 
> >>>>+Chunk size is not dynamic, but it could be in a future implementation.
> >>>>+There's nothing to indicate that this is useful right now.
> >>>>+
> >>>>+When a chunk is full (or a flush() occurs), the memory backed by
> >>>>+the chunk is registered with librdmacm and pinned in memory on
> >>>>+both sides using the aforementioned protocol.
> >>>>+
> >>>>+After pinning, an RDMA Write is generated and tramsmitted
> >>>>+for the entire chunk.
> >>>>+
> >>>>+Chunks are also transmitted in batches: This means that we
> >>>>+do not request that the hardware signal the completion queue
> >>>>+for the completion of *every* chunk. The current batch size
> >>>>+is about 64 chunks (corresponding to 64 MB of memory).
> >>>>+Only the last chunk in a batch must be signaled.
> >>>>+This helps keep everything as asynchronous as possible
> >>>>+and helps keep the hardware busy performing RDMA operations.
> >>>>+
> >>>>+Error-handling:
> >>>>+===============================
> >>>>+
> >>>>+Infiniband has what is called a "Reliable, Connected"
> >>>>+link (one of 4 choices). This is the mode in which
> >>>>+we use for RDMA migration.
> >>>>+
> >>>>+If a *single* message fails,
> >>>>+the decision is to abort the migration entirely and
> >>>>+cleanup all the RDMA descriptors and unregister all
> >>>>+the memory.
> >>>>+
> >>>>+After cleanup, the Virtual Machine is returned to normal
> >>>>+operation the same way that would happen if the TCP
> >>>>+socket is broken during a non-RDMA based migration.
> >>>That's on sender side? Presumably this means you respond to
> >>>completion with error?
> >>>  How does receive side know
> >>>migration is complete?
> >>Yes, on the sender side.
> >>
> >>Migration "completeness" logic has not changed in this patch series.
> >>
> >>Pleas recall that the entire QEMUFile protocol is still
> >>happening at the upper-level inside of savevm.c/arch_init.c.
> >>
> >So basically receive side detects that migration is complete by
> >looking at the QEMUFile data?
> >
> 
> That's correct - same mechanism used by TCP.
>
mrhines@linux.vnet.ibm.com April 10, 2013, 8:05 p.m. UTC | #6
On 04/10/2013 01:41 PM, Michael S. Tsirkin wrote:
>>>>
>>>> Thanks.
>>>>
>>>> However, IMHO restricting the policy to only used chunk-based is really
>>>> not an acceptable choice:
>>>>
>>>> Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
>>>> dive from 10gbps to 6gbps.
>>> Who cares about the throughput really? What we do care about
>>> is how long the whole process takes.
>>>
>> Low latency and high throughput is very important =)
>>
>> Without these properties of RDMA, many workloads simply either
>> take to long to finish migrating or do not converge to a stopping
>> point altogether.
>>
>> *Not making this a configurable option would defeat the purpose of
>> using RDMA altogether.
>>
>> Otherwise, you're no better off than just using TCP.
> So we have two protocols implemented: one is slow the other pins all
> memory on destination indefinitely.
>
> I see two options here:
> - improve the slow version so it's fast, drop the pin all version
> - give up and declare RDMA requires pinning all memory on destination
>
> But giving management a way to do RDMA at the speed of TCP? Why is this
> useful?

This is "useful" because of the overcommit concerns you brought
before, which is the reason why I volunteered to write dynamic
server registration in the first place. We never required that overcommit
and performance had

 From prior experience, I don't believe overcommit and good performance
are compatible with each other in general (i.e. using compression,
page sharing, etc, etc.), but that's a debate for another day =)

I would like to propose a compromise:

How about we *keep* the registration capability and leave it enabled by 
default?

This gives management tools the ability to get performance if they want to,
but also satisfies your requirements in case management doesn't know the
feature exists - they will just get the default enabled?
>> But the problem is more complicated than that: there is no coordination
>> between the migration_thread and RDMA right now because Paolo is
>> trying to maintain a very clean separation of function.
>>
>> However we *can* do what you described in a future patch like this:
>>
>> 1. Migration thread says "iteration starts, how much memory is dirty?"
>> 2. RDMA protocol says "Is there a lot of dirty memory?"
>>          OK, yes? Then batch all the registration messages into a
>> single request
>>          but do not write the memory until all the registrations have
>> completed.
>>
>>          OK, no?  Then just issue registrations with very little
>> batching so that
>>                        we can quickly move on to the next iteration round.
>>
>> Make sense?
> Actually, I think you just need to get a page from migration core and
> give it to the FSM above.  Then let it give you another page, until you
> have N pages in flight in the FSM all at different stages in the
> pipeline.  That's the theory.
>
> But if you want to try changing management core, go wild.  Very little
> is written in stone here.

The FSM and what I described are basically the same thing, I just
described it more abstractly than you did.

Either way, I agree that the optimization would be very useful,
but I disagree that it is possible for an optimized registration algorithm
to perform *as well as* the case when there is no dynamic registration 
at all.

The point is that dynamic registration *only* helps overcommitment.

It does nothing for performance - and since that's true any optimizations
that improve on dynamic registrations will always be sub-optimal to turning
off dynamic registration in the first place.

- Michael
Michael S. Tsirkin April 11, 2013, 7:19 a.m. UTC | #7
On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> On 04/10/2013 01:41 PM, Michael S. Tsirkin wrote:
> >>>>
> >>>>Thanks.
> >>>>
> >>>>However, IMHO restricting the policy to only used chunk-based is really
> >>>>not an acceptable choice:
> >>>>
> >>>>Here's the reason: Using my 10gbs RDMA hardware, throughput takes a
> >>>>dive from 10gbps to 6gbps.
> >>>Who cares about the throughput really? What we do care about
> >>>is how long the whole process takes.
> >>>
> >>Low latency and high throughput is very important =)
> >>
> >>Without these properties of RDMA, many workloads simply either
> >>take to long to finish migrating or do not converge to a stopping
> >>point altogether.
> >>
> >>*Not making this a configurable option would defeat the purpose of
> >>using RDMA altogether.
> >>
> >>Otherwise, you're no better off than just using TCP.
> >So we have two protocols implemented: one is slow the other pins all
> >memory on destination indefinitely.
> >
> >I see two options here:
> >- improve the slow version so it's fast, drop the pin all version
> >- give up and declare RDMA requires pinning all memory on destination
> >
> >But giving management a way to do RDMA at the speed of TCP? Why is this
> >useful?
> 
> This is "useful" because of the overcommit concerns you brought
> before, which is the reason why I volunteered to write dynamic
> server registration in the first place. We never required that overcommit
> and performance had
> 
> From prior experience, I don't believe overcommit and good performance
> are compatible with each other in general (i.e. using compression,
> page sharing, etc, etc.), but that's a debate for another day =)

Maybe we should just say "RDMA is incompatible with memory overcommit"
and be done with it then. But see below.

> I would like to propose a compromise:
> 
> How about we *keep* the registration capability and leave it enabled
> by default?
> 
> This gives management tools the ability to get performance if they want to,
> but also satisfies your requirements in case management doesn't know the
> feature exists - they will just get the default enabled?

Well unfortunately the "overcommit" feature as implemented seems useless
really.  Someone wants to migrate with RDMA but with low performance?
Why not migrate with TCP then?

> >>But the problem is more complicated than that: there is no coordination
> >>between the migration_thread and RDMA right now because Paolo is
> >>trying to maintain a very clean separation of function.
> >>
> >>However we *can* do what you described in a future patch like this:
> >>
> >>1. Migration thread says "iteration starts, how much memory is dirty?"
> >>2. RDMA protocol says "Is there a lot of dirty memory?"
> >>         OK, yes? Then batch all the registration messages into a
> >>single request
> >>         but do not write the memory until all the registrations have
> >>completed.
> >>
> >>         OK, no?  Then just issue registrations with very little
> >>batching so that
> >>                       we can quickly move on to the next iteration round.
> >>
> >>Make sense?
> >Actually, I think you just need to get a page from migration core and
> >give it to the FSM above.  Then let it give you another page, until you
> >have N pages in flight in the FSM all at different stages in the
> >pipeline.  That's the theory.
> >
> >But if you want to try changing management core, go wild.  Very little
> >is written in stone here.
> 
> The FSM and what I described are basically the same thing, I just
> described it more abstractly than you did.

Yes but I'm saying it can be part of RDMA code, no strict need to
change anything else.

> Either way, I agree that the optimization would be very useful,
> but I disagree that it is possible for an optimized registration algorithm
> to perform *as well as* the case when there is no dynamic
> registration at all.
> 
> The point is that dynamic registration *only* helps overcommitment.
> 
> It does nothing for performance - and since that's true any optimizations
> that improve on dynamic registrations will always be sub-optimal to turning
> off dynamic registration in the first place.
> 
> - Michael

So you've given up on it.  Question is, sub-optimal by how much?  And
where's the bottleneck?

Let's do some math. Assume you send 16 bytes registration request and
get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
32/4096 < 1% transport overhead. Negligeable.

Is it the source CPU then? But CPU on source is basically doing same
things as with pre-registration: you do not pin all memory on source.

So it must be the destination CPU that does not keep up then?
But it has to do even less than the source CPU.

I suggest one explanation: the protocol you proposed is inefficient.
It seems to basically do everything in a single thread:
get a chunk,pin,wait for control credit,request,response,rdma,unpin,
There are two round-trips of send/receive here where you are not
going anything useful. Why not let migration proceed?

Doesn't all of this sound worth checking before we give up?
mrhines@linux.vnet.ibm.com April 11, 2013, 1:12 p.m. UTC | #8
On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> Maybe we should just say "RDMA is incompatible with memory overcommit" 
> and be done with it then. But see below.
>> I would like to propose a compromise:
>>
>> How about we *keep* the registration capability and leave it enabled
>> by default?
>>
>> This gives management tools the ability to get performance if they want to,
>> but also satisfies your requirements in case management doesn't know the
>> feature exists - they will just get the default enabled?
> Well unfortunately the "overcommit" feature as implemented seems useless
> really.  Someone wants to migrate with RDMA but with low performance?
> Why not migrate with TCP then?

Answer below.

>> Either way, I agree that the optimization would be very useful,
>> but I disagree that it is possible for an optimized registration algorithm
>> to perform *as well as* the case when there is no dynamic
>> registration at all.
>>
>> The point is that dynamic registration *only* helps overcommitment.
>>
>> It does nothing for performance - and since that's true any optimizations
>> that improve on dynamic registrations will always be sub-optimal to turning
>> off dynamic registration in the first place.
>>
>> - Michael
> So you've given up on it.  Question is, sub-optimal by how much?  And
> where's the bottleneck?
>
> Let's do some math. Assume you send 16 bytes registration request and
> get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> 32/4096 < 1% transport overhead. Negligeable.
>
> Is it the source CPU then? But CPU on source is basically doing same
> things as with pre-registration: you do not pin all memory on source.
>
> So it must be the destination CPU that does not keep up then?
> But it has to do even less than the source CPU.
>
> I suggest one explanation: the protocol you proposed is inefficient.
> It seems to basically do everything in a single thread:
> get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> There are two round-trips of send/receive here where you are not
> going anything useful. Why not let migration proceed?
>
> Doesn't all of this sound worth checking before we give up?
>
First, let me remind you:

Chunks are already doing this!

Perhaps you don't fully understand how chunks work or perhaps I should 
be more verbose
in the documentation. The protocol is already joining multiple pages into a
single chunk without issuing any writes. It is only until the chunk is 
full that an
actual page registration request occurs.

So, basically what you want to know is what happens if we *change* the 
chunk size
dynamically?

Something like this:

1. Chunk = 1MB, what is the performance?
2. Chunk = 2MB, what is the performance?
3. Chunk = 4MB, what is the performance?
4. Chunk = 8MB, what is the performance?
5. Chunk = 16MB, what is the performance?
6. Chunk = 32MB, what is the performance?
7. Chunk = 64MB, what is the performance?
8. Chunk = 128MB, what is the performance?

I'll get you a this table today. Expect an email soon.

- Michael
Michael S. Tsirkin April 11, 2013, 1:48 p.m. UTC | #9
On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >Maybe we should just say "RDMA is incompatible with memory
> >overcommit" and be done with it then. But see below.
> >>I would like to propose a compromise:
> >>
> >>How about we *keep* the registration capability and leave it enabled
> >>by default?
> >>
> >>This gives management tools the ability to get performance if they want to,
> >>but also satisfies your requirements in case management doesn't know the
> >>feature exists - they will just get the default enabled?
> >Well unfortunately the "overcommit" feature as implemented seems useless
> >really.  Someone wants to migrate with RDMA but with low performance?
> >Why not migrate with TCP then?
> 
> Answer below.
> 
> >>Either way, I agree that the optimization would be very useful,
> >>but I disagree that it is possible for an optimized registration algorithm
> >>to perform *as well as* the case when there is no dynamic
> >>registration at all.
> >>
> >>The point is that dynamic registration *only* helps overcommitment.
> >>
> >>It does nothing for performance - and since that's true any optimizations
> >>that improve on dynamic registrations will always be sub-optimal to turning
> >>off dynamic registration in the first place.
> >>
> >>- Michael
> >So you've given up on it.  Question is, sub-optimal by how much?  And
> >where's the bottleneck?
> >
> >Let's do some math. Assume you send 16 bytes registration request and
> >get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> >32/4096 < 1% transport overhead. Negligeable.
> >
> >Is it the source CPU then? But CPU on source is basically doing same
> >things as with pre-registration: you do not pin all memory on source.
> >
> >So it must be the destination CPU that does not keep up then?
> >But it has to do even less than the source CPU.
> >
> >I suggest one explanation: the protocol you proposed is inefficient.
> >It seems to basically do everything in a single thread:
> >get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >There are two round-trips of send/receive here where you are not
> >going anything useful. Why not let migration proceed?
> >
> >Doesn't all of this sound worth checking before we give up?
> >
> First, let me remind you:
> 
> Chunks are already doing this!
> 
> Perhaps you don't fully understand how chunks work or perhaps I
> should be more verbose
> in the documentation. The protocol is already joining multiple pages into a
> single chunk without issuing any writes. It is only until the chunk
> is full that an
> actual page registration request occurs.

I think I got that at a high level.
But there is a stall between chunks. If you make chunks smaller,
but pipeline registration, then there will never be any stall.

> So, basically what you want to know is what happens if we *change*
> the chunk size
> dynamically?

What I wanted to know is where is performance going?
Why is chunk based slower? It's not the extra messages,
on the wire, these take up negligeable BW.

> Something like this:
> 
> 1. Chunk = 1MB, what is the performance?
> 2. Chunk = 2MB, what is the performance?
> 3. Chunk = 4MB, what is the performance?
> 4. Chunk = 8MB, what is the performance?
> 5. Chunk = 16MB, what is the performance?
> 6. Chunk = 32MB, what is the performance?
> 7. Chunk = 64MB, what is the performance?
> 8. Chunk = 128MB, what is the performance?
> 
> I'll get you a this table today. Expect an email soon.
> 
> - Michael
> 
> 
> 
> 
>
mrhines@linux.vnet.ibm.com April 11, 2013, 1:58 p.m. UTC | #10
On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
>> On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
>>> On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
>>> Maybe we should just say "RDMA is incompatible with memory
>>> overcommit" and be done with it then. But see below.
>>>> I would like to propose a compromise:
>>>>
>>>> How about we *keep* the registration capability and leave it enabled
>>>> by default?
>>>>
>>>> This gives management tools the ability to get performance if they want to,
>>>> but also satisfies your requirements in case management doesn't know the
>>>> feature exists - they will just get the default enabled?
>>> Well unfortunately the "overcommit" feature as implemented seems useless
>>> really.  Someone wants to migrate with RDMA but with low performance?
>>> Why not migrate with TCP then?
>> Answer below.
>>
>>>> Either way, I agree that the optimization would be very useful,
>>>> but I disagree that it is possible for an optimized registration algorithm
>>>> to perform *as well as* the case when there is no dynamic
>>>> registration at all.
>>>>
>>>> The point is that dynamic registration *only* helps overcommitment.
>>>>
>>>> It does nothing for performance - and since that's true any optimizations
>>>> that improve on dynamic registrations will always be sub-optimal to turning
>>>> off dynamic registration in the first place.
>>>>
>>>> - Michael
>>> So you've given up on it.  Question is, sub-optimal by how much?  And
>>> where's the bottleneck?
>>>
>>> Let's do some math. Assume you send 16 bytes registration request and
>>> get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
>>> 32/4096 < 1% transport overhead. Negligeable.
>>>
>>> Is it the source CPU then? But CPU on source is basically doing same
>>> things as with pre-registration: you do not pin all memory on source.
>>>
>>> So it must be the destination CPU that does not keep up then?
>>> But it has to do even less than the source CPU.
>>>
>>> I suggest one explanation: the protocol you proposed is inefficient.
>>> It seems to basically do everything in a single thread:
>>> get a chunk,pin,wait for control credit,request,response,rdma,unpin,
>>> There are two round-trips of send/receive here where you are not
>>> going anything useful. Why not let migration proceed?
>>>
>>> Doesn't all of this sound worth checking before we give up?
>>>
>> First, let me remind you:
>>
>> Chunks are already doing this!
>>
>> Perhaps you don't fully understand how chunks work or perhaps I
>> should be more verbose
>> in the documentation. The protocol is already joining multiple pages into a
>> single chunk without issuing any writes. It is only until the chunk
>> is full that an
>> actual page registration request occurs.
> I think I got that at a high level.
> But there is a stall between chunks. If you make chunks smaller,
> but pipeline registration, then there will never be any stall.

Pipelineing == chunking. You cannot eliminate the stall,
that's impossible.

You can *grow* the chunk size (i.e. the pipeline)
to amortize the cost of the stall, but you cannot eliminate
the stall at the end of the pipeline.

At some point you have to flush the pipeline (i.e. the chunk),
whether you like it or not.


>> So, basically what you want to know is what happens if we *change*
>> the chunk size
>> dynamically?
> What I wanted to know is where is performance going?
> Why is chunk based slower? It's not the extra messages,
> on the wire, these take up negligeable BW.

Answer above.
Michael S. Tsirkin April 11, 2013, 2:37 p.m. UTC | #11
On Thu, Apr 11, 2013 at 09:58:50AM -0400, Michael R. Hines wrote:
> On 04/11/2013 09:48 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 09:12:17AM -0400, Michael R. Hines wrote:
> >>On 04/11/2013 03:19 AM, Michael S. Tsirkin wrote:
> >>>On Wed, Apr 10, 2013 at 04:05:34PM -0400, Michael R. Hines wrote:
> >>>Maybe we should just say "RDMA is incompatible with memory
> >>>overcommit" and be done with it then. But see below.
> >>>>I would like to propose a compromise:
> >>>>
> >>>>How about we *keep* the registration capability and leave it enabled
> >>>>by default?
> >>>>
> >>>>This gives management tools the ability to get performance if they want to,
> >>>>but also satisfies your requirements in case management doesn't know the
> >>>>feature exists - they will just get the default enabled?
> >>>Well unfortunately the "overcommit" feature as implemented seems useless
> >>>really.  Someone wants to migrate with RDMA but with low performance?
> >>>Why not migrate with TCP then?
> >>Answer below.
> >>
> >>>>Either way, I agree that the optimization would be very useful,
> >>>>but I disagree that it is possible for an optimized registration algorithm
> >>>>to perform *as well as* the case when there is no dynamic
> >>>>registration at all.
> >>>>
> >>>>The point is that dynamic registration *only* helps overcommitment.
> >>>>
> >>>>It does nothing for performance - and since that's true any optimizations
> >>>>that improve on dynamic registrations will always be sub-optimal to turning
> >>>>off dynamic registration in the first place.
> >>>>
> >>>>- Michael
> >>>So you've given up on it.  Question is, sub-optimal by how much?  And
> >>>where's the bottleneck?
> >>>
> >>>Let's do some math. Assume you send 16 bytes registration request and
> >>>get back a 16 byte response for each 4Kbyte page (16 bytes enough?).  That's
> >>>32/4096 < 1% transport overhead. Negligeable.
> >>>
> >>>Is it the source CPU then? But CPU on source is basically doing same
> >>>things as with pre-registration: you do not pin all memory on source.
> >>>
> >>>So it must be the destination CPU that does not keep up then?
> >>>But it has to do even less than the source CPU.
> >>>
> >>>I suggest one explanation: the protocol you proposed is inefficient.
> >>>It seems to basically do everything in a single thread:
> >>>get a chunk,pin,wait for control credit,request,response,rdma,unpin,
> >>>There are two round-trips of send/receive here where you are not
> >>>going anything useful. Why not let migration proceed?
> >>>
> >>>Doesn't all of this sound worth checking before we give up?
> >>>
> >>First, let me remind you:
> >>
> >>Chunks are already doing this!
> >>
> >>Perhaps you don't fully understand how chunks work or perhaps I
> >>should be more verbose
> >>in the documentation. The protocol is already joining multiple pages into a
> >>single chunk without issuing any writes. It is only until the chunk
> >>is full that an
> >>actual page registration request occurs.
> >I think I got that at a high level.
> >But there is a stall between chunks. If you make chunks smaller,
> >but pipeline registration, then there will never be any stall.
> 
> Pipelineing == chunking.

pipelining:
https://en.wikipedia.org/wiki/Pipeline_%28computing%29
chunking:
https://en.wikipedia.org/wiki/Chunking_%28computing%29

> You cannot eliminate the stall,
> that's impossible.

Sure, you can eliminate the stalls. Just hide them
behind data transfers. See a diagram below.


> You can *grow* the chunk size (i.e. the pipeline)
> to amortize the cost of the stall, but you cannot eliminate
> the stall at the end of the pipeline.
> 
> At some point you have to flush the pipeline (i.e. the chunk),
> whether you like it or not.

You can process many chunks in parallel. Make chunks smaller but process
them in a pipelined fashion.  Yes the pipe might stall but it won't if
receive side is as fast as send side, then you won't have to flush at
all.


> >>So, basically what you want to know is what happens if we *change*
> >>the chunk size
> >>dynamically?
> >What I wanted to know is where is performance going?
> >Why is chunk based slower? It's not the extra messages,
> >on the wire, these take up negligeable BW.
> 
> Answer above.


Here's how things are supposed to work in a pipeline:

req -> registration request
res -> response
done -> rdma done notification (remote can unregister)
pgX  -> page, or chunk, or whatever unit is used
        for registration
rdma -> one or more rdma write requests



pg1 ->  pin -> req -> res -> rdma -> done
        pg2 ->  pin -> req -> res -> rdma -> done
                pg3 -> pin -> req -> res -> rdma -> done
                       pg4 -> pin -> req -> res -> rdma -> done
                              pg4 -> pin -> req -> res -> rdma -> done



It's like a assembly line see?  So while software does the registration
roundtrip dance, hardware is processing rdma requests for previous
chunks.

....

When do you have to stall? when you run out of rx buffer credits so you
can not start a new req.  Your protocol has 2 outstanding buffers,
so you can only have one req in the air. Do more and
you will not need to stall - possibly at all.

One other minor point is that your protocol requires extra explicit
ready commands. You can pass the number of rx buffers as extra payload
in the traffic you are sending anyway, and reduce that overhead.
Paolo Bonzini April 11, 2013, 2:50 p.m. UTC | #12
Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> 
> pg1 ->  pin -> req -> res -> rdma -> done
>         pg2 ->  pin -> req -> res -> rdma -> done
>                 pg3 -> pin -> req -> res -> rdma -> done
>                        pg4 -> pin -> req -> res -> rdma -> done
>                               pg4 -> pin -> req -> res -> rdma -> done
> 
> It's like a assembly line see?  So while software does the registration
> roundtrip dance, hardware is processing rdma requests for previous
> chunks.

Does this only affects the implementation, or also the wire protocol?
Does the destination have to be aware that the source is doing pipelining?

Paolo

> 
> ....
> 
> When do you have to stall? when you run out of rx buffer credits so you
> can not start a new req.  Your protocol has 2 outstanding buffers,
> so you can only have one req in the air. Do more and
> you will not need to stall - possibly at all.
> 
> One other minor point is that your protocol requires extra explicit
> ready commands. You can pass the number of rx buffers as extra payload
> in the traffic you are sending anyway, and reduce that overhead.
Michael S. Tsirkin April 11, 2013, 2:56 p.m. UTC | #13
On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> > 
> > pg1 ->  pin -> req -> res -> rdma -> done
> >         pg2 ->  pin -> req -> res -> rdma -> done
> >                 pg3 -> pin -> req -> res -> rdma -> done
> >                        pg4 -> pin -> req -> res -> rdma -> done
> >                               pg4 -> pin -> req -> res -> rdma -> done
> > 
> > It's like a assembly line see?  So while software does the registration
> > roundtrip dance, hardware is processing rdma requests for previous
> > chunks.
> 
> Does this only affects the implementation, or also the wire protocol?

It affects the wire protocol.

> Does the destination have to be aware that the source is doing pipelining?
> 
> Paolo

Yes. At the moment the protocol assumption is there's only one
outstanding command on the control queue.  So destination has to
prequeue multiple buffers on hardware receive queue, and keep the source
updated about the number of available buffers. Preferably it should do
this using existing responses, maybe a separate ready command
is enough - this needs some thought, since a separate command
consumes buffers itself.

> > 
> > ....
> > 
> > When do you have to stall? when you run out of rx buffer credits so you
> > can not start a new req.  Your protocol has 2 outstanding buffers,
> > so you can only have one req in the air. Do more and
> > you will not need to stall - possibly at all.
> > 
> > One other minor point is that your protocol requires extra explicit
> > ready commands. You can pass the number of rx buffers as extra payload
> > in the traffic you are sending anyway, and reduce that overhead.
mrhines@linux.vnet.ibm.com April 11, 2013, 3:01 p.m. UTC | #14
You cannot write data in the pipeline because you do not have the
permissions to do so yet until the registrations in the pipeline have
completed and been received by the primary VM.

On 04/11/2013 10:50 AM, Paolo Bonzini wrote:
> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>> pg1 ->  pin -> req -> res -> rdma -> done
>>          pg2 ->  pin -> req -> res -> rdma -> done
>>                  pg3 -> pin -> req -> res -> rdma -> done
>>                         pg4 -> pin -> req -> res -> rdma -> done
>>                                pg4 -> pin -> req -> res -> rdma -> done
>>
>> It's like a assembly line see?  So while software does the registration
>> roundtrip dance, hardware is processing rdma requests for previous
>> chunks.
> Does this only affects the implementation, or also the wire protocol?
> Does the destination have to be aware that the source is doing pipelining?
>
> Paolo

Yes, the destination has to be aware. The destination has to acknowledge
all of the registrations in the pipeline *and* the primary-VM has to block
until all the registrations in the pipeline have been received.
mrhines@linux.vnet.ibm.com April 11, 2013, 3:18 p.m. UTC | #15
First of all, this whole argument should not even exist for the 
following reason:

Page registrations are supposed to be *rare* - once a page is registered, it
is registered for life. There is nothing in the design that says a page must
be "unregistered" and I do not believe anybody is proposing that.

Second, this means that my previous analysis showing that performance 
was reduced
was also incorrect because most of the RDMA transfers were against pages 
during
the bulk phase round, which incorrectly makes dynamic page registration 
look bad.
I should have done more testing *after* the bulk phase round,
and I apologize for not doing that.

Indeed when I do such a test (with the 'stress' command) the cost of 
page registration disappears
because most of the registrations have already completed a long time ago.

Thanks, Paolo for reminding us about the bulk-phase behavior to being with.

Third, this means that optimizing this protocol would not be helpful and 
that we should
follow the "keep it simple" approach because during steady-state phase 
of the migration
most of the pages should have already been registered.

- Michael


On 04/11/2013 10:37 AM, Michael S. Tsirkin wrote:
> Answer above.
>
> Here's how things are supposed to work in a pipeline:
>
> req -> registration request
> res -> response
> done -> rdma done notification (remote can unregister)
> pgX  -> page, or chunk, or whatever unit is used
>          for registration
> rdma -> one or more rdma write requests
>
>
>
> pg1 ->  pin -> req -> res -> rdma -> done
>          pg2 ->  pin -> req -> res -> rdma -> done
>                  pg3 -> pin -> req -> res -> rdma -> done
>                         pg4 -> pin -> req -> res -> rdma -> done
>                                pg4 -> pin -> req -> res -> rdma -> done
>
>
>
> It's like a assembly line see?  So while software does the registration
> roundtrip dance, hardware is processing rdma requests for previous
> chunks.
>
> ....
>
> When do you have to stall? when you run out of rx buffer credits so you
> can not start a new req.  Your protocol has 2 outstanding buffers,
> so you can only have one req in the air. Do more and
> you will not need to stall - possibly at all.
>
> One other minor point is that your protocol requires extra explicit
> ready commands. You can pass the number of rx buffers as extra payload
> in the traffic you are sending anyway, and reduce that overhead.
>
Paolo Bonzini April 11, 2013, 3:33 p.m. UTC | #16
Il 11/04/2013 17:18, Michael R. Hines ha scritto:
> First of all, this whole argument should not even exist for the 
> following reason:
> 
> Page registrations are supposed to be *rare* - once a page is 
> registered, it is registered for life.

Uh-oh.  That changes things a lot.  We do not even need to benchmark the
various chunk sizes.

> Third, this means that optimizing this protocol would not be helpful
> and that we should follow the "keep it simple" approach because
> during steady-state phase of the migration most of the pages should
> have already been registered.

Ok, let's keep it simple.  The only two things we need are:

1) remove the patch to disable is_dup_page

2) rename the transport to "x-rdma" (just in migration.c)

Both things together let us keep it safe for a release or two.  Let's
merge this thing.

Paolo
Michael S. Tsirkin April 11, 2013, 3:44 p.m. UTC | #17
On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
> First of all,

I know it's a hard habit to break but could you
please stop stop top-posting?

> this whole argument should not even exist for the
> following reason:
> 
> Page registrations are supposed to be *rare* - once a page is registered, it
> is registered for life. There is nothing in the design that says a page must
> be "unregistered" and I do not believe anybody is proposing that.

Hmm proposing what? Of course you need to unregister pages
eventually otherwise your pinned memory on the destination
will just grow indefinitely. People are often doing
registration caches to help reduce the overhead,
but never unregistering seems too aggressive.

You mean the chunk-based thing just delays the agony
until all guest memory is pinned for RDMA anyway?
Wait, is it registered for life on the source too?

Well this kind of explains why qemu was dying on OOM,
doesn't it?

> Second, this means that my previous analysis showing that
> performance was reduced
> was also incorrect because most of the RDMA transfers were against
> pages during
> the bulk phase round, which incorrectly makes dynamic page
> registration look bad.
> I should have done more testing *after* the bulk phase round,
> and I apologize for not doing that.
> 
> Indeed when I do such a test (with the 'stress' command) the cost of
> page registration disappears
> because most of the registrations have already completed a long time ago.
> 
> Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
> 
> Third, this means that optimizing this protocol would not be helpful
> and that we should
> follow the "keep it simple" approach because during steady-state
> phase of the migration
> most of the pages should have already been registered.
> 
> - Michael

If you mean that registering all memory is a requirement,
then I am not sure I agree: you wrote one slow protocol, this
does not mean that there can't be a fast one.

But if you mean to say that the current chunk based code
is useless, then I'd have to agree.

> 
> On 04/11/2013 10:37 AM, Michael S. Tsirkin wrote:
> >Answer above.
> >
> >Here's how things are supposed to work in a pipeline:
> >
> >req -> registration request
> >res -> response
> >done -> rdma done notification (remote can unregister)
> >pgX  -> page, or chunk, or whatever unit is used
> >         for registration
> >rdma -> one or more rdma write requests
> >
> >
> >
> >pg1 ->  pin -> req -> res -> rdma -> done
> >         pg2 ->  pin -> req -> res -> rdma -> done
> >                 pg3 -> pin -> req -> res -> rdma -> done
> >                        pg4 -> pin -> req -> res -> rdma -> done
> >                               pg4 -> pin -> req -> res -> rdma -> done
> >
> >
> >
> >It's like a assembly line see?  So while software does the registration
> >roundtrip dance, hardware is processing rdma requests for previous
> >chunks.
> >
> >....
> >
> >When do you have to stall? when you run out of rx buffer credits so you
> >can not start a new req.  Your protocol has 2 outstanding buffers,
> >so you can only have one req in the air. Do more and
> >you will not need to stall - possibly at all.
> >
> >One other minor point is that your protocol requires extra explicit
> >ready commands. You can pass the number of rx buffers as extra payload
> >in the traffic you are sending anyway, and reduce that overhead.
> >
Michael S. Tsirkin April 11, 2013, 3:46 p.m. UTC | #18
On Thu, Apr 11, 2013 at 05:33:41PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 17:18, Michael R. Hines ha scritto:
> > First of all, this whole argument should not even exist for the 
> > following reason:
> > 
> > Page registrations are supposed to be *rare* - once a page is 
> > registered, it is registered for life.
> 
> Uh-oh.  That changes things a lot.  We do not even need to benchmark the
> various chunk sizes.
> 
> > Third, this means that optimizing this protocol would not be helpful
> > and that we should follow the "keep it simple" approach because
> > during steady-state phase of the migration most of the pages should
> > have already been registered.
> 
> Ok, let's keep it simple.  The only two things we need are:
> 
> 1) remove the patch to disable is_dup_page
> 
> 2) rename the transport to "x-rdma" (just in migration.c)
> 
> Both things together let us keep it safe for a release or two.  Let's
> merge this thing.
> 
> Paolo

I would drop the chunk based thing too.  Besides being slow, it turns
out that it pins all memory anyway. So no memory overcommit.
Paolo Bonzini April 11, 2013, 3:47 p.m. UTC | #19
Il 11/04/2013 17:46, Michael S. Tsirkin ha scritto:
> > Ok, let's keep it simple.  The only two things we need are:
> > 
> > 1) remove the patch to disable is_dup_page
> > 
> > 2) rename the transport to "x-rdma" (just in migration.c)
> > 
> > Both things together let us keep it safe for a release or two.  Let's
> > merge this thing.
> 
> I would drop the chunk based thing too.  Besides being slow, it turns
> out that it pins all memory anyway. So no memory overcommit.

It doesn't pin zero pages.  Those are never transmitted (it's a recent
change).  So pages that are ballooned at the beginning of migration, and
remain ballooned throughout, will never be pinned.

Paolo
Michael S. Tsirkin April 11, 2013, 3:58 p.m. UTC | #20
On Thu, Apr 11, 2013 at 05:47:53PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 17:46, Michael S. Tsirkin ha scritto:
> > > Ok, let's keep it simple.  The only two things we need are:
> > > 
> > > 1) remove the patch to disable is_dup_page
> > > 
> > > 2) rename the transport to "x-rdma" (just in migration.c)
> > > 
> > > Both things together let us keep it safe for a release or two.  Let's
> > > merge this thing.
> > 
> > I would drop the chunk based thing too.  Besides being slow, it turns
> > out that it pins all memory anyway. So no memory overcommit.
> 
> It doesn't pin zero pages.  Those are never transmitted (it's a recent
> change).  So pages that are ballooned at the beginning of migration, and
> remain ballooned throughout, will never be pinned.
> 
> Paolo

Of course Michael says it's slow unless you disable zero page detection,
and then I'm guessing it does?
mrhines@linux.vnet.ibm.com April 11, 2013, 4:06 p.m. UTC | #21
On 04/11/2013 11:58 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 05:47:53PM +0200, Paolo Bonzini wrote:
>> Il 11/04/2013 17:46, Michael S. Tsirkin ha scritto:
>>>> Ok, let's keep it simple.  The only two things we need are:
>>>>
>>>> 1) remove the patch to disable is_dup_page
>>>>
>>>> 2) rename the transport to "x-rdma" (just in migration.c)
>>>>
>>>> Both things together let us keep it safe for a release or two.  Let's
>>>> merge this thing.
>>> I would drop the chunk based thing too.  Besides being slow, it turns
>>> out that it pins all memory anyway. So no memory overcommit.
>> It doesn't pin zero pages.  Those are never transmitted (it's a recent
>> change).  So pages that are ballooned at the beginning of migration, and
>> remain ballooned throughout, will never be pinned.
>>
>> Paolo
> Of course Michael says it's slow unless you disable zero page detection,
> and then I'm guessing it does?
>
Only during the bulk phase round, and even then, as Paolo described,
zero pages do not get pinned on the destination.

Chunk registration is still very valuable when zero page detection is
activated.

The realization is that chunk registration (and zero page scanning) have
very little effect whatsoever on performance *after* the bulk phase 
round because
pages have already been mapped and already pinned in memory for life.

- Michael
mrhines@linux.vnet.ibm.com April 11, 2013, 4:09 p.m. UTC | #22
On 04/11/2013 11:44 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
>> First of all,
> I know it's a hard habit to break but could you
> please stop stop top-posting?
Acknowledged.
>
>> this whole argument should not even exist for the
>> following reason:
>>
>> Page registrations are supposed to be *rare* - once a page is registered, it
>> is registered for life. There is nothing in the design that says a page must
>> be "unregistered" and I do not believe anybody is proposing that.
> Hmm proposing what? Of course you need to unregister pages
> eventually otherwise your pinned memory on the destination
> will just grow indefinitely. People are often doing
> registration caches to help reduce the overhead,
> but never unregistering seems too aggressive.
>
> You mean the chunk-based thing just delays the agony
> until all guest memory is pinned for RDMA anyway?
> Wait, is it registered for life on the source too?
>
> Well this kind of explains why qemu was dying on OOM,
> doesn't it?

Yes, that's correct. The agony is just delayed. The right thing to do
in a future patch would be to pin as much as possible in advance
before the bulk phase round even begins (using the pagemap).

In the meantime, chunk registartion performance is still very good
so long as total migration time is not the metric you are optimizing for.

>> Second, this means that my previous analysis showing that
>> performance was reduced
>> was also incorrect because most of the RDMA transfers were against
>> pages during
>> the bulk phase round, which incorrectly makes dynamic page
>> registration look bad.
>> I should have done more testing *after* the bulk phase round,
>> and I apologize for not doing that.
>>
>> Indeed when I do such a test (with the 'stress' command) the cost of
>> page registration disappears
>> because most of the registrations have already completed a long time ago.
>>
>> Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
>>
>> Third, this means that optimizing this protocol would not be helpful
>> and that we should
>> follow the "keep it simple" approach because during steady-state
>> phase of the migration
>> most of the pages should have already been registered.
>>
>> - Michael
> If you mean that registering all memory is a requirement,
> then I am not sure I agree: you wrote one slow protocol, this
> does not mean that there can't be a fast one.
>
> But if you mean to say that the current chunk based code
> is useless, then I'd have to agree.

Answer above.
mrhines@linux.vnet.ibm.com April 11, 2013, 4:13 p.m. UTC | #23
On 04/11/2013 11:44 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
>> First of all,
> I know it's a hard habit to break but could you
> please stop stop top-posting?
>
>> this whole argument should not even exist for the
>> following reason:
>>
>> Page registrations are supposed to be *rare* - once a page is registered, it
>> is registered for life. There is nothing in the design that says a page must
>> be "unregistered" and I do not believe anybody is proposing that.
> Hmm proposing what? Of course you need to unregister pages
> eventually otherwise your pinned memory on the destination
> will just grow indefinitely. People are often doing
> registration caches to help reduce the overhead,
> but never unregistering seems too aggressive.
>
> You mean the chunk-based thing just delays the agony
> until all guest memory is pinned for RDMA anyway?
> Wait, is it registered for life on the source too?
>
> Well this kind of explains why qemu was dying on OOM,
> doesn't it?
>
>> Second, this means that my previous analysis showing that
>> performance was reduced
>> was also incorrect because most of the RDMA transfers were against
>> pages during
>> the bulk phase round, which incorrectly makes dynamic page
>> registration look bad.
>> I should have done more testing *after* the bulk phase round,
>> and I apologize for not doing that.
>>
>> Indeed when I do such a test (with the 'stress' command) the cost of
>> page registration disappears
>> because most of the registrations have already completed a long time ago.
>>
>> Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
>>
>> Third, this means that optimizing this protocol would not be helpful
>> and that we should
>> follow the "keep it simple" approach because during steady-state
>> phase of the migration
>> most of the pages should have already been registered.
>>
>> - Michael
>
> But if you mean to say that the current chunk based code
> is useless, then I'd have to agree.
>
Well, you asked me to write an overcommit solution, so I wrote one. =)

Second, there is *no need* for a fast registration protocol, as I've 
summarized,
because most of the page registrations are supposed have already completed
before the steady-state iterative phase of the migration already begins
(which will be further optimized in a later patch). You're complaining 
about a non-issue.

- Michael
Michael S. Tsirkin April 11, 2013, 5:04 p.m. UTC | #24
On Thu, Apr 11, 2013 at 12:09:44PM -0400, Michael R. Hines wrote:
> On 04/11/2013 11:44 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 11:18:56AM -0400, Michael R. Hines wrote:
> >>First of all,
> >I know it's a hard habit to break but could you
> >please stop stop top-posting?
> Acknowledged.
> >
> >>this whole argument should not even exist for the
> >>following reason:
> >>
> >>Page registrations are supposed to be *rare* - once a page is registered, it
> >>is registered for life. There is nothing in the design that says a page must
> >>be "unregistered" and I do not believe anybody is proposing that.
> >Hmm proposing what? Of course you need to unregister pages
> >eventually otherwise your pinned memory on the destination
> >will just grow indefinitely. People are often doing
> >registration caches to help reduce the overhead,
> >but never unregistering seems too aggressive.
> >
> >You mean the chunk-based thing just delays the agony
> >until all guest memory is pinned for RDMA anyway?
> >Wait, is it registered for life on the source too?
> >
> >Well this kind of explains why qemu was dying on OOM,
> >doesn't it?
> 
> Yes, that's correct. The agony is just delayed. The right thing to do
> in a future patch would be to pin as much as possible in advance
> before the bulk phase round even begins (using the pagemap).

IMHO the right thing is to unpin memory after it's sent.

> In the meantime, chunk registartion performance is still very good
> so long as total migration time is not the metric you are optimizing for.

You mean it has better downtime than TCP? Or lower host CPU
overhead? These are the metrics we care about.

> >>Second, this means that my previous analysis showing that
> >>performance was reduced
> >>was also incorrect because most of the RDMA transfers were against
> >>pages during
> >>the bulk phase round, which incorrectly makes dynamic page
> >>registration look bad.
> >>I should have done more testing *after* the bulk phase round,
> >>and I apologize for not doing that.
> >>
> >>Indeed when I do such a test (with the 'stress' command) the cost of
> >>page registration disappears
> >>because most of the registrations have already completed a long time ago.
> >>
> >>Thanks, Paolo for reminding us about the bulk-phase behavior to being with.
> >>
> >>Third, this means that optimizing this protocol would not be helpful
> >>and that we should
> >>follow the "keep it simple" approach because during steady-state
> >>phase of the migration
> >>most of the pages should have already been registered.
> >>
> >>- Michael
> >If you mean that registering all memory is a requirement,
> >then I am not sure I agree: you wrote one slow protocol, this
> >does not mean that there can't be a fast one.
> >
> >But if you mean to say that the current chunk based code
> >is useless, then I'd have to agree.
> 
> Answer above.

I don't see it above. What does "keep it simple mean"?
mrhines@linux.vnet.ibm.com April 11, 2013, 5:27 p.m. UTC | #25
On 04/11/2013 01:04 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 12:09:44PM -0400, Michael R. Hines wrote:
>>
>> Yes, that's correct. The agony is just delayed. The right thing to do
>> in a future patch would be to pin as much as possible in advance
>> before the bulk phase round even begins (using the pagemap).
> IMHO the right thing is to unpin memory after it's sent.

Based on what, exactly? Would you unpin a hot page? Would you
unpin a cold page that becomes hot again later? I don't see how we can
know in advance the behavior of individual pages and make the decision
to unpin them - we probably don't want to know either.

Trying to build a more complex protocol just for something that's 
unpredictable
(and probably not the common case) doesn't seem like a good focus for 
debate.

Overcommit is really only useful when the "overcommitted" memory
is not expected to fluctuate.  Unpinning pages just so they can be 
overcommitted
later means that it was probably a bad idea to overcommit those pages in 
the first place....

What you're asking for is very fine-grained overcommitment, which, in my
experience is not a practical decision making process that QEMU can ever 
really know
about. Memory footprints tend to either be very big or very small
and they stay that way for a very long time until something comes along 
to change that.

>> In the meantime, chunk registartion performance is still very good
>> so long as total migration time is not the metric you are optimizing for.
> You mean it has better downtime than TCP? Or lower host CPU
> overhead? These are the metrics we care about.
Yes, it does indeed have better downtime because RDMA latencies are much
lower and *most* of the page registrations will have already occurred after
the bulk phase round has passed in the first iteration.

.

- Michael

>>> If you mean that registering all memory is a requirement,
>>> then I am not sure I agree: you wrote one slow protocol, this
>>> does not mean that there can't be a fast one.
>>>
>>> But if you mean to say that the current chunk based code
>>> is useless, then I'd have to agree.
>> Answer above.
> I don't see it above. What does "keep it simple mean"?
>

By simple, I mean the argument for a simpler protocol that I made above.

- Michael
mrhines@linux.vnet.ibm.com April 11, 2013, 5:49 p.m. UTC | #26
On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>
>>> It's like a assembly line see?  So while software does the registration
>>> roundtrip dance, hardware is processing rdma requests for previous
>>> chunks.
>> Does this only affects the implementation, or also the wire protocol?
> It affects the wire protocol.

I *do* believe chunked registration was a *very* useful request by
the community, and I want to thank you for convincing me to implement it.

But, with all due respect, pipelining is a "solution looking for a problem".

Improving the protocol does not help the behavior of any well-known 
workloads,
because it is based on the idea the the memory footprint of a VM would
*rapidly* shrink and contract up and down during the steady-state iteration
rounds while the migration is taking place.

This simply does not happen - workloads don't behave that way - they either
grow really big or they grow really small and they settle that way for a 
reasonable
amount of time before the load on the application changes at a future 
point in time.

- Michael
Michael S. Tsirkin April 11, 2013, 7:15 p.m. UTC | #27
On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
> >>Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> >>>pg1 ->  pin -> req -> res -> rdma -> done
> >>>         pg2 ->  pin -> req -> res -> rdma -> done
> >>>                 pg3 -> pin -> req -> res -> rdma -> done
> >>>                        pg4 -> pin -> req -> res -> rdma -> done
> >>>                               pg4 -> pin -> req -> res -> rdma -> done
> >>>
> >>>It's like a assembly line see?  So while software does the registration
> >>>roundtrip dance, hardware is processing rdma requests for previous
> >>>chunks.
> >>Does this only affects the implementation, or also the wire protocol?
> >It affects the wire protocol.
> 
> I *do* believe chunked registration was a *very* useful request by
> the community, and I want to thank you for convincing me to implement it.
> 
> But, with all due respect, pipelining is a "solution looking for a problem".

The problem is bad performance, isn't it?
If it wasn't we'd use chunk based all the time.

> Improving the protocol does not help the behavior of any well-known
> workloads,
> because it is based on the idea the the memory footprint of a VM would
> *rapidly* shrink and contract up and down during the steady-state iteration
> rounds while the migration is taking place.

What gave you that idea? Not at all.  It is based on the idea
of doing control actions in parallel with data transfers,
so that control latency does not degrade performance.

> This simply does not happen - workloads don't behave that way - they either
> grow really big or they grow really small and they settle that way
> for a reasonable
> amount of time before the load on the application changes at a
> future point in time.
> 
> - Michael

What is the bottleneck for chunk-based? Can you tell me that?  Find out,
and you will maybe see pipelining will help.

Basically to me, when you describe the protocol in detail the problems
become apparent.

I think you worry too much about what the guest does, what APIs are
exposed from the migration core and the specifics of the workload. Build
a sane protocol for data transfers and layer the workload on top.
mrhines@linux.vnet.ibm.com April 11, 2013, 8:33 p.m. UTC | #28
On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>>>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>>>
>>>>> It's like a assembly line see?  So while software does the registration
>>>>> roundtrip dance, hardware is processing rdma requests for previous
>>>>> chunks.
>>>> Does this only affects the implementation, or also the wire protocol?
>>> It affects the wire protocol.
>> I *do* believe chunked registration was a *very* useful request by
>> the community, and I want to thank you for convincing me to implement it.
>>
>> But, with all due respect, pipelining is a "solution looking for a problem".
> The problem is bad performance, isn't it?
> If it wasn't we'd use chunk based all the time.
>
>> Improving the protocol does not help the behavior of any well-known
>> workloads,
>> because it is based on the idea the the memory footprint of a VM would
>> *rapidly* shrink and contract up and down during the steady-state iteration
>> rounds while the migration is taking place.
> What gave you that idea? Not at all.  It is based on the idea
> of doing control actions in parallel with data transfers,
> so that control latency does not degrade performance.
Again, this parallelization is trying to solve a problem that doesn't 
exist.

As I've described before, I re-executed the worst-case memory stress hog
tests with RDMA *after* the bulk-phase round completes and determined
that RDMA throughput remains unaffected because most of the memory
was already registered in advance.

>> This simply does not happen - workloads don't behave that way - they either
>> grow really big or they grow really small and they settle that way
>> for a reasonable
>> amount of time before the load on the application changes at a
>> future point in time.
>>
>> - Michael
> What is the bottleneck for chunk-based? Can you tell me that?  Find out,
> and you will maybe see pipelining will help.
>
> Basically to me, when you describe the protocol in detail the problems
> become apparent.
>
> I think you worry too much about what the guest does, what APIs are
> exposed from the migration core and the specifics of the workload. Build
> a sane protocol for data transfers and layer the workload on top.
>

What is the point in enhancing a protocol to solve a problem will never 
be manifested?

We're trying to overlap two *completely different use cases* that are 
completely unrelated:

1. Static overcommit
2. Dynamic, fine-grained overcommit (at small time scales... seconds or 
minutes)

#1 Happens all the time. Cram a bunch of virtual machines with fixed 
workloads
and fixed writable working sets into the same place, and you're good to go.

#2 never happens. Ever. It just doesn't happen, and the enhancements you've
described are trying to protect against #2, when we should really be 
focused on #1.

It is not standard practice for a workload to expect high overcommit 
performance
in the *middle* of a relocation and nobody in the industry that I have 
met over the
years has expressed any such desire to do so.

Workloads just don't behave that way.

Dynamic registration does an excellent job at overcommitment for #1 
because most
of the registrations are done at the very beginning and can be further 
optimized to
cause little or no performance loss by simply issuing the registrations 
before the
migration ever begins.

Performance for #2 even with dynamic registration is excellent and I am not
experiencing any problems associated with it.

So, we're discussing a non-issue.

- Michael



Overcommit has two
mrhines@linux.vnet.ibm.com April 12, 2013, 5:10 a.m. UTC | #29
On 04/11/2013 11:33 AM, Paolo Bonzini wrote:
> 2) rename the transport to "x-rdma" (just in migration.c) 

What does this mean?
Paolo Bonzini April 12, 2013, 5:26 a.m. UTC | #30
Il 12/04/2013 07:10, Michael R. Hines ha scritto:
> On 04/11/2013 11:33 AM, Paolo Bonzini wrote:
>> 2) rename the transport to "x-rdma" (just in migration.c) 
> 
> What does this mean?

Use "migrate x-rdma:192.168.10.12" to migrate, to indicate it's
experimental and the protocol might change.  It's just to err on the
safe side.

Paolo
mrhines@linux.vnet.ibm.com April 12, 2013, 5:54 a.m. UTC | #31
On 04/12/2013 01:26 AM, Paolo Bonzini wrote:
> Il 12/04/2013 07:10, Michael R. Hines ha scritto:
>> On 04/11/2013 11:33 AM, Paolo Bonzini wrote:
>>> 2) rename the transport to "x-rdma" (just in migration.c)
>> What does this mean?
> Use "migrate x-rdma:192.168.10.12" to migrate, to indicate it's
> experimental and the protocol might change.  It's just to err on the
> safe side.
>
> Paolo
>
>
Ooops, you're not gonna make me re-send the patch, are you? =)
Michael S. Tsirkin April 12, 2013, 10:48 a.m. UTC | #32
On Thu, Apr 11, 2013 at 04:33:03PM -0400, Michael R. Hines wrote:
> On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
> >>On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
> >>>On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
> >>>>Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
> >>>>>pg1 ->  pin -> req -> res -> rdma -> done
> >>>>>         pg2 ->  pin -> req -> res -> rdma -> done
> >>>>>                 pg3 -> pin -> req -> res -> rdma -> done
> >>>>>                        pg4 -> pin -> req -> res -> rdma -> done
> >>>>>                               pg4 -> pin -> req -> res -> rdma -> done
> >>>>>
> >>>>>It's like a assembly line see?  So while software does the registration
> >>>>>roundtrip dance, hardware is processing rdma requests for previous
> >>>>>chunks.
> >>>>Does this only affects the implementation, or also the wire protocol?
> >>>It affects the wire protocol.
> >>I *do* believe chunked registration was a *very* useful request by
> >>the community, and I want to thank you for convincing me to implement it.
> >>
> >>But, with all due respect, pipelining is a "solution looking for a problem".
> >The problem is bad performance, isn't it?
> >If it wasn't we'd use chunk based all the time.
> >
> >>Improving the protocol does not help the behavior of any well-known
> >>workloads,
> >>because it is based on the idea the the memory footprint of a VM would
> >>*rapidly* shrink and contract up and down during the steady-state iteration
> >>rounds while the migration is taking place.
> >What gave you that idea? Not at all.  It is based on the idea
> >of doing control actions in parallel with data transfers,
> >so that control latency does not degrade performance.
> Again, this parallelization is trying to solve a problem that
> doesn't exist.
> 
> As I've described before, I re-executed the worst-case memory stress hog
> tests with RDMA *after* the bulk-phase round completes and determined
> that RDMA throughput remains unaffected because most of the memory
> was already registered in advance.
> 
> >>This simply does not happen - workloads don't behave that way - they either
> >>grow really big or they grow really small and they settle that way
> >>for a reasonable
> >>amount of time before the load on the application changes at a
> >>future point in time.
> >>
> >>- Michael
> >What is the bottleneck for chunk-based? Can you tell me that?  Find out,
> >and you will maybe see pipelining will help.
> >
> >Basically to me, when you describe the protocol in detail the problems
> >become apparent.
> >
> >I think you worry too much about what the guest does, what APIs are
> >exposed from the migration core and the specifics of the workload. Build
> >a sane protocol for data transfers and layer the workload on top.
> >
> 
> What is the point in enhancing a protocol to solve a problem will
> never be manifested?
> 
> We're trying to overlap two *completely different use cases* that
> are completely unrelated:
> 
> 1. Static overcommit
> 2. Dynamic, fine-grained overcommit (at small time scales... seconds
> or minutes)
> 
> #1 Happens all the time. Cram a bunch of virtual machines with fixed
> workloads
> and fixed writable working sets into the same place, and you're good to go.
> 
> #2 never happens. Ever. It just doesn't happen, and the enhancements you've
> described are trying to protect against #2, when we should really be
> focused on #1.
> 
> It is not standard practice for a workload to expect high overcommit
> performance
> in the *middle* of a relocation and nobody in the industry that I
> have met over the
> years has expressed any such desire to do so.
> 

Depends on who you talk to I guess.  Almost everyone
overcommits to some level. They might not know it.
It depends on the amount of overcommit.  You pin all (at least non zero)
memory eventually, breaking memory overcommit completely. If I
overcommit by 4kilobytes do you expect performance to go completely
down? It does not make sense.

> Workloads just don't behave that way.
> 
> Dynamic registration does an excellent job at overcommitment for #1
> because most
> of the registrations are done at the very beginning and can be
> further optimized to
> cause little or no performance loss by simply issuing the
> registrations before the
> migration ever begins.

How does it? You pin all VM's memory eventually.
You said your tests have the OOM killer triggering.


> Performance for #2 even with dynamic registration is excellent and I am not
> experiencing any problems associated with it.

Well previously you said the reverse. You keep vaguely speaking about
performance.  We care about these metrics:

	1. total migration time: measured by:

	time
	 ssh dest qemu -incoming &;echo migrate > monitor
	time

	2.  min allowed downtime that lets migration converge

	3. average host CPU utilization during migration,
	   on source and destination

	4. max real memory used by qemu

Can you fill this table for TCP, and two protocol versions?

If dynamic works as well as static, this is a good reason to drop the
static one.  As the next step, fix the dynamic to unregister
memory (this is required for _GIFT anyway). When you do this
it is possible that pipelining is required.

> So, we're discussing a non-issue.
> 
> - Michael
> 

There are two issues.

1.  You have two protocols already and this does not make sense in
version 1 of the patch.  You said dynamic is slow so I pointed out ways
to improve it. Now you says it's as fast as static?  so drop static
then. At no point does it make sense to have management commands to play
with low level protocol details.

> 
> 
> Overcommit has two
Paolo Bonzini April 12, 2013, 10:53 a.m. UTC | #33
Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> 1.  You have two protocols already and this does not make sense in
> version 1 of the patch.

It makes sense if we consider it experimental (add x- in front of
transport and capability) and would like people to play with it.

Paolo
Michael S. Tsirkin April 12, 2013, 11:25 a.m. UTC | #34
On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> > 1.  You have two protocols already and this does not make sense in
> > version 1 of the patch.
> 
> It makes sense if we consider it experimental (add x- in front of
> transport and capability) and would like people to play with it.
> 
> Paolo

But it's not testable yet.  I see problems just reading the
documentation.  Author thinks "ulimit -l 10000000000" on both source and
destination is just fine.  This can easily crash host or cause OOM
killer to kill QEMU.  So why is there any need for extra testers?  Fix
the major bugs first.

There's a similar issue with device assignment - we can't fix it there,
and despite being available for years, this was one of two reasons that
has kept this feature out of hands of lots of users (and assuming guest
has lots of zero pages won't work: balloon is not widely used either
since it depends on a well-behaved guest to work correctly).

And it's entirely avoidable, just fix the protocol and the code.
mrhines@linux.vnet.ibm.com April 12, 2013, 1:47 p.m. UTC | #35
On 04/12/2013 06:48 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 04:33:03PM -0400, Michael R. Hines wrote:
>> On 04/11/2013 03:15 PM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 01:49:34PM -0400, Michael R. Hines wrote:
>>>> On 04/11/2013 10:56 AM, Michael S. Tsirkin wrote:
>>>>> On Thu, Apr 11, 2013 at 04:50:21PM +0200, Paolo Bonzini wrote:
>>>>>> Il 11/04/2013 16:37, Michael S. Tsirkin ha scritto:
>>>>>>> pg1 ->  pin -> req -> res -> rdma -> done
>>>>>>>          pg2 ->  pin -> req -> res -> rdma -> done
>>>>>>>                  pg3 -> pin -> req -> res -> rdma -> done
>>>>>>>                         pg4 -> pin -> req -> res -> rdma -> done
>>>>>>>                                pg4 -> pin -> req -> res -> rdma -> done
>>>>>>>
>>>>>>> It's like a assembly line see?  So while software does the registration
>>>>>>> roundtrip dance, hardware is processing rdma requests for previous
>>>>>>> chunks.
>>>>>> Does this only affects the implementation, or also the wire protocol?
>>>>> It affects the wire protocol.
>>>> I *do* believe chunked registration was a *very* useful request by
>>>> the community, and I want to thank you for convincing me to implement it.
>>>>
>>>> But, with all due respect, pipelining is a "solution looking for a problem".
>>> The problem is bad performance, isn't it?
>>> If it wasn't we'd use chunk based all the time.
>>>
>>>> Improving the protocol does not help the behavior of any well-known
>>>> workloads,
>>>> because it is based on the idea the the memory footprint of a VM would
>>>> *rapidly* shrink and contract up and down during the steady-state iteration
>>>> rounds while the migration is taking place.
>>> What gave you that idea? Not at all.  It is based on the idea
>>> of doing control actions in parallel with data transfers,
>>> so that control latency does not degrade performance.
>> Again, this parallelization is trying to solve a problem that
>> doesn't exist.
>>
>> As I've described before, I re-executed the worst-case memory stress hog
>> tests with RDMA *after* the bulk-phase round completes and determined
>> that RDMA throughput remains unaffected because most of the memory
>> was already registered in advance.
>>
>>>> This simply does not happen - workloads don't behave that way - they either
>>>> grow really big or they grow really small and they settle that way
>>>> for a reasonable
>>>> amount of time before the load on the application changes at a
>>>> future point in time.
>>>>
>>>> - Michael
>>> What is the bottleneck for chunk-based? Can you tell me that?  Find out,
>>> and you will maybe see pipelining will help.
>>>
>>> Basically to me, when you describe the protocol in detail the problems
>>> become apparent.
>>>
>>> I think you worry too much about what the guest does, what APIs are
>>> exposed from the migration core and the specifics of the workload. Build
>>> a sane protocol for data transfers and layer the workload on top.
>>>
>> What is the point in enhancing a protocol to solve a problem will
>> never be manifested?
>>
>> We're trying to overlap two *completely different use cases* that
>> are completely unrelated:
>>
>> 1. Static overcommit
>> 2. Dynamic, fine-grained overcommit (at small time scales... seconds
>> or minutes)
>>
>> #1 Happens all the time. Cram a bunch of virtual machines with fixed
>> workloads
>> and fixed writable working sets into the same place, and you're good to go.
>>
>> #2 never happens. Ever. It just doesn't happen, and the enhancements you've
>> described are trying to protect against #2, when we should really be
>> focused on #1.
>>
>> It is not standard practice for a workload to expect high overcommit
>> performance
>> in the *middle* of a relocation and nobody in the industry that I
>> have met over the
>> years has expressed any such desire to do so.
>>
> Depends on who you talk to I guess.  Almost everyone
> overcommits to some level. They might not know it.
> It depends on the amount of overcommit.  You pin all (at least non zero)
> memory eventually, breaking memory overcommit completely. If I
> overcommit by 4kilobytes do you expect performance to go completely
> down? It does not make sense.
>
>> Workloads just don't behave that way.
>>
>> Dynamic registration does an excellent job at overcommitment for #1
>> because most
>> of the registrations are done at the very beginning and can be
>> further optimized to
>> cause little or no performance loss by simply issuing the
>> registrations before the
>> migration ever begins.
> How does it? You pin all VM's memory eventually.
> You said your tests have the OOM killer triggering.
>

That's because of cgroups memory limitations. Not the protocol.

Infiband was never designed to work with cgroups - that's a kernel
problem, not a QEMU problem or a protocol problem. Why do
we have to worry about that exactly?

>> Performance for #2 even with dynamic registration is excellent and I am not
>> experiencing any problems associated with it.
> Well previously you said the reverse. You keep vaguely speaking about
> performance.  We care about these metrics:
>
> 	1. total migration time: measured by:
>
> 	time
> 	 ssh dest qemu -incoming &;echo migrate > monitor
> 	time
>
> 	2.  min allowed downtime that lets migration converge
>
> 	3. average host CPU utilization during migration,
> 	   on source and destination
>
> 	4. max real memory used by qemu
>
> Can you fill this table for TCP, and two protocol versions?
>
> If dynamic works as well as static, this is a good reason to drop the
> static one.  As the next step, fix the dynamic to unregister
> memory (this is required for _GIFT anyway). When you do this
> it is possible that pipelining is required.

First, yes, I'm happy to fill out the table - let me address
Paolo's last requested changes (including the COMPRESS fix)

Second, there are not two protocol versions. That's incorrect.
There's only one protocol which can operate in different ways
as any protocol can operate in different ways. It has different
command types, not all of which need to be used by the protocol
at the same time.

Second, as I've explained, I strongly, strongly disagree with unregistering
memory for all of the aforementioned reasons - workloads do not
operate in such a manner that they can tolerate memory to be
pulled out from underneath them at such fine-grained time scales
in the *middle* of a relocation and I will not commit to writing a solution
for a problem that doesn't exist.

If you can prove (through some kind of anaylsis) that workloads
would benefit from this kind of fine-grained memory overcommit
by having cgroups swap out memory to disk underneath them
without their permission, I would happily reconsider my position.

- Michael



>> So, we're discussing a non-issue.
>>
>> - Michael
>>
> There are two issues.
>
> 1.  You have two protocols already and this does not make sense in
> version 1 of the patch.  You said dynamic is slow so I pointed out ways
> to improve it. Now you says it's as fast as static?  so drop static
> then. At no point does it make sense to have management commands to play
> with low level protocol details.
>
>>
>> Overcommit has two
Paolo Bonzini April 12, 2013, 2:43 p.m. UTC | #36
Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>> 1.  You have two protocols already and this does not make sense in
>>> version 1 of the patch.
>>
>> It makes sense if we consider it experimental (add x- in front of
>> transport and capability) and would like people to play with it.
>>
>> Paolo
> 
> But it's not testable yet.  I see problems just reading the
> documentation.  Author thinks "ulimit -l 10000000000" on both source and
> destination is just fine.  This can easily crash host or cause OOM
> killer to kill QEMU.  So why is there any need for extra testers?  Fix
> the major bugs first.
> 
> There's a similar issue with device assignment - we can't fix it there,
> and despite being available for years, this was one of two reasons that
> has kept this feature out of hands of lots of users (and assuming guest
> has lots of zero pages won't work: balloon is not widely used either
> since it depends on a well-behaved guest to work correctly).

I agree assuming guest has lots of zero pages won't work, but I think
you are overstating the importance of overcommit.  Let's mark the damn
thing as experimental, and stop making perfect the enemy of good.

Paolo
Michael S. Tsirkin April 14, 2013, 8:28 a.m. UTC | #37
On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> Second, as I've explained, I strongly, strongly disagree with unregistering
> memory for all of the aforementioned reasons - workloads do not
> operate in such a manner that they can tolerate memory to be
> pulled out from underneath them at such fine-grained time scales
> in the *middle* of a relocation and I will not commit to writing a solution
> for a problem that doesn't exist.

Exactly same thing happens with swap, doesn't it?
You are saying workloads simply can not tolerate swap.

> If you can prove (through some kind of anaylsis) that workloads
> would benefit from this kind of fine-grained memory overcommit
> by having cgroups swap out memory to disk underneath them
> without their permission, I would happily reconsider my position.
> 
> - Michael

This has nothing to do with cgroups directly, it's just a way to
demonstrate you have a bug.
Michael S. Tsirkin April 14, 2013, 11:59 a.m. UTC | #38
On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> > On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>> 1.  You have two protocols already and this does not make sense in
> >>> version 1 of the patch.
> >>
> >> It makes sense if we consider it experimental (add x- in front of
> >> transport and capability) and would like people to play with it.
> >>
> >> Paolo
> > 
> > But it's not testable yet.  I see problems just reading the
> > documentation.  Author thinks "ulimit -l 10000000000" on both source and
> > destination is just fine.  This can easily crash host or cause OOM
> > killer to kill QEMU.  So why is there any need for extra testers?  Fix
> > the major bugs first.
> > 
> > There's a similar issue with device assignment - we can't fix it there,
> > and despite being available for years, this was one of two reasons that
> > has kept this feature out of hands of lots of users (and assuming guest
> > has lots of zero pages won't work: balloon is not widely used either
> > since it depends on a well-behaved guest to work correctly).
> 
> I agree assuming guest has lots of zero pages won't work, but I think
> you are overstating the importance of overcommit.  Let's mark the damn
> thing as experimental, and stop making perfect the enemy of good.
> 
> Paolo

It looks like we have to decide, before merging, whether migration with
rdma that breaks overcommit is worth it or not.  Since the author made
it very clear he does not intend to make it work with overcommit, ever.
Paolo Bonzini April 14, 2013, 2:09 p.m. UTC | #39
Il 14/04/2013 13:59, Michael S. Tsirkin ha scritto:
> > I agree assuming guest has lots of zero pages won't work, but I think
> > you are overstating the importance of overcommit.  Let's mark the damn
> > thing as experimental, and stop making perfect the enemy of good.
> 
> It looks like we have to decide, before merging, whether migration with
> rdma that breaks overcommit is worth it or not.  Since the author made
> it very clear he does not intend to make it work with overcommit, ever.

To me it is very much worth it.

I would like to understand if unregistration would require a protocol
change, but that's really more a curiosity than anything else.

Perhaps it would make sense to make chunk registration permanent only
after the bulk phase.  Chunks registered in the bulk phase are not
permanent.

Paolo
mrhines@linux.vnet.ibm.com April 14, 2013, 2:27 p.m. UTC | #40
On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>> 1.  You have two protocols already and this does not make sense in
>>>>> version 1 of the patch.
>>>> It makes sense if we consider it experimental (add x- in front of
>>>> transport and capability) and would like people to play with it.
>>>>
>>>> Paolo
>>> But it's not testable yet.  I see problems just reading the
>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>> destination is just fine.  This can easily crash host or cause OOM
>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>> the major bugs first.
>>>
>>> There's a similar issue with device assignment - we can't fix it there,
>>> and despite being available for years, this was one of two reasons that
>>> has kept this feature out of hands of lots of users (and assuming guest
>>> has lots of zero pages won't work: balloon is not widely used either
>>> since it depends on a well-behaved guest to work correctly).
>> I agree assuming guest has lots of zero pages won't work, but I think
>> you are overstating the importance of overcommit.  Let's mark the damn
>> thing as experimental, and stop making perfect the enemy of good.
>>
>> Paolo
> It looks like we have to decide, before merging, whether migration with
> rdma that breaks overcommit is worth it or not.  Since the author made
> it very clear he does not intend to make it work with overcommit, ever.
>
That depends entirely as what you define as overcommit.

The pages do get unregistered at the end of the migration =)

- Michael
mrhines@linux.vnet.ibm.com April 14, 2013, 2:31 p.m. UTC | #41
On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
>> Second, as I've explained, I strongly, strongly disagree with unregistering
>> memory for all of the aforementioned reasons - workloads do not
>> operate in such a manner that they can tolerate memory to be
>> pulled out from underneath them at such fine-grained time scales
>> in the *middle* of a relocation and I will not commit to writing a solution
>> for a problem that doesn't exist.
> Exactly same thing happens with swap, doesn't it?
> You are saying workloads simply can not tolerate swap.
>
>> If you can prove (through some kind of anaylsis) that workloads
>> would benefit from this kind of fine-grained memory overcommit
>> by having cgroups swap out memory to disk underneath them
>> without their permission, I would happily reconsider my position.
>>
>> - Michael
> This has nothing to do with cgroups directly, it's just a way to
> demonstrate you have a bug.
>

If your datacenter or your cloud or your product does not want to
tolerate page registration, then don't use RDMA!

The bottom line is: RDMA is useless without page registration. Without
it, the performance of it will be crippled. If you define that as a bug,
then so be it.

- Michael
mrhines@linux.vnet.ibm.com April 14, 2013, 2:40 p.m. UTC | #42
On 04/14/2013 10:09 AM, Paolo Bonzini wrote:
> Il 14/04/2013 13:59, Michael S. Tsirkin ha scritto:
>>> I agree assuming guest has lots of zero pages won't work, but I think
>>> you are overstating the importance of overcommit.  Let's mark the damn
>>> thing as experimental, and stop making perfect the enemy of good.
>> It looks like we have to decide, before merging, whether migration with
>> rdma that breaks overcommit is worth it or not.  Since the author made
>> it very clear he does not intend to make it work with overcommit, ever.
> To me it is very much worth it.
>
> I would like to understand if unregistration would require a protocol
> change, but that's really more a curiosity than anything else.

Yes, it would require a protocol change. Either the source or the
destination would have to arbitrarily "decide" when is the time to
perform the unregistration without adversely causing the page
to be RE-registered over and over again during future iterations.

I really don't see how QEMU can accurately make such a decision.

> Perhaps it would make sense to make chunk registration permanent only
> after the bulk phase.  Chunks registered in the bulk phase are not
> permanent.

Unfortunately, that would require the entire memory footprint
to be pinned during the bulk round, which was what Michael
was originally trying to avoid a couple of weeks ago.

Nevertheless, the observation is accurate: We already have
a capability to disable chunk registration entirely.

If the user doesn't want it, they can just turn it off.


- Michael

> Paolo
>
Michael S. Tsirkin April 14, 2013, 4:03 p.m. UTC | #43
On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>1.  You have two protocols already and this does not make sense in
> >>>>>version 1 of the patch.
> >>>>It makes sense if we consider it experimental (add x- in front of
> >>>>transport and capability) and would like people to play with it.
> >>>>
> >>>>Paolo
> >>>But it's not testable yet.  I see problems just reading the
> >>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>destination is just fine.  This can easily crash host or cause OOM
> >>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>the major bugs first.
> >>>
> >>>There's a similar issue with device assignment - we can't fix it there,
> >>>and despite being available for years, this was one of two reasons that
> >>>has kept this feature out of hands of lots of users (and assuming guest
> >>>has lots of zero pages won't work: balloon is not widely used either
> >>>since it depends on a well-behaved guest to work correctly).
> >>I agree assuming guest has lots of zero pages won't work, but I think
> >>you are overstating the importance of overcommit.  Let's mark the damn
> >>thing as experimental, and stop making perfect the enemy of good.
> >>
> >>Paolo
> >It looks like we have to decide, before merging, whether migration with
> >rdma that breaks overcommit is worth it or not.  Since the author made
> >it very clear he does not intend to make it work with overcommit, ever.
> >
> That depends entirely as what you define as overcommit.

You don't get to define your own terms.  Look it up in wikipedia or
something.

> 
> The pages do get unregistered at the end of the migration =)
> 
> - Michael

The limitations are pretty clear, and you really should document them:

1. run qemu as root, or under ulimit -l <total guest memory> on both source and
  destination

2. expect that as much as that amount of memory is pinned
  and unvailable to host kernel and applications for
  arbitrarily long time.
  Make sure you have much more RAM in host or QEMU will get killed.

To me, especially 1 is an unacceptable security tradeoff.
It is entirely fixable but we both have other priorities,
so it'll stay broken.
mrhines@linux.vnet.ibm.com April 14, 2013, 4:07 p.m. UTC | #44
On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>> version 1 of the patch.
>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>> transport and capability) and would like people to play with it.
>>>>>>
>>>>>> Paolo
>>>>> But it's not testable yet.  I see problems just reading the
>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>> the major bugs first.
>>>>>
>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>> and despite being available for years, this was one of two reasons that
>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>> since it depends on a well-behaved guest to work correctly).
>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>
>>>> Paolo
>>> It looks like we have to decide, before merging, whether migration with
>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>> it very clear he does not intend to make it work with overcommit, ever.
>>>
>> That depends entirely as what you define as overcommit.
> You don't get to define your own terms.  Look it up in wikipedia or
> something.
>
>> The pages do get unregistered at the end of the migration =)
>>
>> - Michael
> The limitations are pretty clear, and you really should document them:
>
> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>    destination
>
> 2. expect that as much as that amount of memory is pinned
>    and unvailable to host kernel and applications for
>    arbitrarily long time.
>    Make sure you have much more RAM in host or QEMU will get killed.
>
> To me, especially 1 is an unacceptable security tradeoff.
> It is entirely fixable but we both have other priorities,
> so it'll stay broken.
>

Agreed, the documentation should be clear.

So, if you define that scenario as broken, then yes, it's broken.

- Michael
mrhines@linux.vnet.ibm.com April 14, 2013, 4:40 p.m. UTC | #45
On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>> version 1 of the patch.
>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>> transport and capability) and would like people to play with it.
>>>>>>
>>>>>> Paolo
>>>>> But it's not testable yet.  I see problems just reading the
>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>> the major bugs first.
>>>>>
>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>> and despite being available for years, this was one of two reasons that
>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>> since it depends on a well-behaved guest to work correctly).
>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>
>>>> Paolo
>>> It looks like we have to decide, before merging, whether migration with
>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>> it very clear he does not intend to make it work with overcommit, ever.
>>>
>> That depends entirely as what you define as overcommit.
> You don't get to define your own terms.  Look it up in wikipedia or
> something.
>
>> The pages do get unregistered at the end of the migration =)
>>
>> - Michael
> The limitations are pretty clear, and you really should document them:
>
> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>    destination
>
> 2. expect that as much as that amount of memory is pinned
>    and unvailable to host kernel and applications for
>    arbitrarily long time.
>    Make sure you have much more RAM in host or QEMU will get killed.
>
> To me, especially 1 is an unacceptable security tradeoff.
> It is entirely fixable but we both have other priorities,
> so it'll stay broken.
>

I've modified the beginning of docs/rdma.txt to say the following:

$ cat docs/rdma.txt

... snip ..

BEFORE RUNNING:
===============

Use of RDMA requires pinning and registering memory with the
hardware. If this is not acceptable for your application or
product, then the use of RDMA is strongly discouraged and you
should revert back to standard TCP-based migration.

Next, decide if you want dynamic page registration on the server-side.
For example, if you have an 8GB RAM virtual machine, but only 1GB
is in active use, then disabling this feature will cause all 8GB to
be pinned and resident in memory. This feature mostly affects the
bulk-phase round of the migration and can be disabled for extremely
high-performance RDMA hardware using the following command:

QEMU Monitor Command:
$ migrate_set_capability chunk_register_destination off # enabled by default

Performing this action will cause all 8GB to be pinned, so if that's
not what you want, then please ignore this step altogether.

RUNNING:
=======

..... snip ...

I'll group this change into a future patch whenever the current patch
gets pulled, and I will also update the QEMU wiki to make this point clear.

- Michael
Michael S. Tsirkin April 14, 2013, 6:30 p.m. UTC | #46
On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>version 1 of the patch.
> >>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>transport and capability) and would like people to play with it.
> >>>>>>
> >>>>>>Paolo
> >>>>>But it's not testable yet.  I see problems just reading the
> >>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>the major bugs first.
> >>>>>
> >>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>and despite being available for years, this was one of two reasons that
> >>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>since it depends on a well-behaved guest to work correctly).
> >>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>
> >>>>Paolo
> >>>It looks like we have to decide, before merging, whether migration with
> >>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>it very clear he does not intend to make it work with overcommit, ever.
> >>>
> >>That depends entirely as what you define as overcommit.
> >You don't get to define your own terms.  Look it up in wikipedia or
> >something.
> >
> >>The pages do get unregistered at the end of the migration =)
> >>
> >>- Michael
> >The limitations are pretty clear, and you really should document them:
> >
> >1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >   destination
> >
> >2. expect that as much as that amount of memory is pinned
> >   and unvailable to host kernel and applications for
> >   arbitrarily long time.
> >   Make sure you have much more RAM in host or QEMU will get killed.
> >
> >To me, especially 1 is an unacceptable security tradeoff.
> >It is entirely fixable but we both have other priorities,
> >so it'll stay broken.
> >
> 
> I've modified the beginning of docs/rdma.txt to say the following:

It really should say this, in a very prominent place:

BUGS:
1. You must run qemu as root, or under
   ulimit -l <total guest memory> on both source and destination

2. Expect that as much as that amount of memory to be locked
   and unvailable to host kernel and applications for
   arbitrarily long time.
   Make sure you have much more RAM in host otherwise QEMU,
   or some other arbitrary application on same host, will get killed.

3. Migration with RDMA support is experimental and unsupported.
   In particular, please do not expect it to work across qemu versions,
   and do not expect the management interface to be stable.
   

> 
> $ cat docs/rdma.txt
> 
> ... snip ..
> 
> BEFORE RUNNING:
> ===============
> 
> Use of RDMA requires pinning and registering memory with the
> hardware. If this is not acceptable for your application or
> product, then the use of RDMA is strongly discouraged and you
> should revert back to standard TCP-based migration.

No one knows of should know what "pinning and registering" means.
For which applications and products is it appropriate?
Also, you are talking about current QEMU
code using RDMA for migration but say "RDMA" generally.

> Next, decide if you want dynamic page registration on the server-side.
> For example, if you have an 8GB RAM virtual machine, but only 1GB
> is in active use, then disabling this feature will cause all 8GB to
> be pinned and resident in memory. This feature mostly affects the
> bulk-phase round of the migration and can be disabled for extremely
> high-performance RDMA hardware using the following command:
> QEMU Monitor Command:
> $ migrate_set_capability chunk_register_destination off # enabled by default
> 
> Performing this action will cause all 8GB to be pinned, so if that's
> not what you want, then please ignore this step altogether.

This does not make it clear what is the benefit of disabling this
capability. I think it's best to avoid options, just use chunk
based always.
If it's here "so people can play with it" then please rename
it to something like "x-unsupported-chunk_register_destination"
so people know this is unsupported and not to be used for production.

> RUNNING:
> =======
> 
> ..... snip ...
> 
> I'll group this change into a future patch whenever the current patch
> gets pulled, and I will also update the QEMU wiki to make this point clear.
> 
> - Michael
> 
> 
>
Michael S. Tsirkin April 14, 2013, 6:51 p.m. UTC | #47
On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
> On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> >On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> >>Second, as I've explained, I strongly, strongly disagree with unregistering
> >>memory for all of the aforementioned reasons - workloads do not
> >>operate in such a manner that they can tolerate memory to be
> >>pulled out from underneath them at such fine-grained time scales
> >>in the *middle* of a relocation and I will not commit to writing a solution
> >>for a problem that doesn't exist.
> >Exactly same thing happens with swap, doesn't it?
> >You are saying workloads simply can not tolerate swap.
> >
> >>If you can prove (through some kind of anaylsis) that workloads
> >>would benefit from this kind of fine-grained memory overcommit
> >>by having cgroups swap out memory to disk underneath them
> >>without their permission, I would happily reconsider my position.
> >>
> >>- Michael
> >This has nothing to do with cgroups directly, it's just a way to
> >demonstrate you have a bug.
> >
> 
> If your datacenter or your cloud or your product does not want to
> tolerate page registration, then don't use RDMA!
> 
> The bottom line is: RDMA is useless without page registration. Without
> it, the performance of it will be crippled. If you define that as a bug,
> then so be it.
> 
> - Michael

No one cares if you do page registration or not.  ulimit -l 10g is the
problem.  You should limit the amount of locked memory.
Lots of good research went into making RDMA go fast with limited locked
memory, with some success. Search for "registration cache" for example.
mrhines@linux.vnet.ibm.com April 14, 2013, 7:06 p.m. UTC | #48
On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>>>> version 1 of the patch.
>>>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>>>> transport and capability) and would like people to play with it.
>>>>>>>>
>>>>>>>> Paolo
>>>>>>> But it's not testable yet.  I see problems just reading the
>>>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>>>> the major bugs first.
>>>>>>>
>>>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>>>> and despite being available for years, this was one of two reasons that
>>>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>>>> since it depends on a well-behaved guest to work correctly).
>>>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>>>
>>>>>> Paolo
>>>>> It looks like we have to decide, before merging, whether migration with
>>>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>>>> it very clear he does not intend to make it work with overcommit, ever.
>>>>>
>>>> That depends entirely as what you define as overcommit.
>>> You don't get to define your own terms.  Look it up in wikipedia or
>>> something.
>>>
>>>> The pages do get unregistered at the end of the migration =)
>>>>
>>>> - Michael
>>> The limitations are pretty clear, and you really should document them:
>>>
>>> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>>>    destination
>>>
>>> 2. expect that as much as that amount of memory is pinned
>>>    and unvailable to host kernel and applications for
>>>    arbitrarily long time.
>>>    Make sure you have much more RAM in host or QEMU will get killed.
>>>
>>> To me, especially 1 is an unacceptable security tradeoff.
>>> It is entirely fixable but we both have other priorities,
>>> so it'll stay broken.
>>>
>> I've modified the beginning of docs/rdma.txt to say the following:
> It really should say this, in a very prominent place:
>
> BUGS:
Not a bug. We'll have to agree to disagree. Please drop this.
> 1. You must run qemu as root, or under
>     ulimit -l <total guest memory> on both source and destination
Good, will update the documentation now.
> 2. Expect that as much as that amount of memory to be locked
>     and unvailable to host kernel and applications for
>     arbitrarily long time.
>     Make sure you have much more RAM in host otherwise QEMU,
>     or some other arbitrary application on same host, will get killed.
This is implied already. The docs say "If you don't want pinning, then 
use TCP".
That's enough warning.
> 3. Migration with RDMA support is experimental and unsupported.
>     In particular, please do not expect it to work across qemu versions,
>     and do not expect the management interface to be stable.
>     

The only correct statement here is that it's experimental.

I will update the docs to reflect that.

>> $ cat docs/rdma.txt
>>
>> ... snip ..
>>
>> BEFORE RUNNING:
>> ===============
>>
>> Use of RDMA requires pinning and registering memory with the
>> hardware. If this is not acceptable for your application or
>> product, then the use of RDMA is strongly discouraged and you
>> should revert back to standard TCP-based migration.
> No one knows of should know what "pinning and registering" means.

I will define it in the docs, then.

> For which applications and products is it appropriate?

That's up to the vendor or user to decide, not us.

> Also, you are talking about current QEMU
> code using RDMA for migration but say "RDMA" generally.

Sure, I will fix the docs.

>> Next, decide if you want dynamic page registration on the server-side.
>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>> is in active use, then disabling this feature will cause all 8GB to
>> be pinned and resident in memory. This feature mostly affects the
>> bulk-phase round of the migration and can be disabled for extremely
>> high-performance RDMA hardware using the following command:
>> QEMU Monitor Command:
>> $ migrate_set_capability chunk_register_destination off # enabled by default
>>
>> Performing this action will cause all 8GB to be pinned, so if that's
>> not what you want, then please ignore this step altogether.
> This does not make it clear what is the benefit of disabling this
> capability. I think it's best to avoid options, just use chunk
> based always.
> If it's here "so people can play with it" then please rename
> it to something like "x-unsupported-chunk_register_destination"
> so people know this is unsupported and not to be used for production.

Again, please drop the request for removing chunking.

Paolo already told me to use "x-rdma" - so that's enough for now.

- Michael
mrhines@linux.vnet.ibm.com April 14, 2013, 7:43 p.m. UTC | #49
On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
>> On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
>>> On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
>>>> Second, as I've explained, I strongly, strongly disagree with unregistering
>>>> memory for all of the aforementioned reasons - workloads do not
>>>> operate in such a manner that they can tolerate memory to be
>>>> pulled out from underneath them at such fine-grained time scales
>>>> in the *middle* of a relocation and I will not commit to writing a solution
>>>> for a problem that doesn't exist.
>>> Exactly same thing happens with swap, doesn't it?
>>> You are saying workloads simply can not tolerate swap.
>>>
>>>> If you can prove (through some kind of anaylsis) that workloads
>>>> would benefit from this kind of fine-grained memory overcommit
>>>> by having cgroups swap out memory to disk underneath them
>>>> without their permission, I would happily reconsider my position.
>>>>
>>>> - Michael
>>> This has nothing to do with cgroups directly, it's just a way to
>>> demonstrate you have a bug.
>>>
>> If your datacenter or your cloud or your product does not want to
>> tolerate page registration, then don't use RDMA!
>>
>> The bottom line is: RDMA is useless without page registration. Without
>> it, the performance of it will be crippled. If you define that as a bug,
>> then so be it.
>>
>> - Michael
> No one cares if you do page registration or not.  ulimit -l 10g is the
> problem.  You should limit the amount of locked memory.
> Lots of good research went into making RDMA go fast with limited locked
> memory, with some success. Search for "registration cache" for example.
>

Patches using such a cache would be welcome.

- Michael
Michael S. Tsirkin April 14, 2013, 9:10 p.m. UTC | #50
On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
> On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>>>version 1 of the patch.
> >>>>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>>>transport and capability) and would like people to play with it.
> >>>>>>>>
> >>>>>>>>Paolo
> >>>>>>>But it's not testable yet.  I see problems just reading the
> >>>>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>>>the major bugs first.
> >>>>>>>
> >>>>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>>>and despite being available for years, this was one of two reasons that
> >>>>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>>>since it depends on a well-behaved guest to work correctly).
> >>>>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>>>
> >>>>>>Paolo
> >>>>>It looks like we have to decide, before merging, whether migration with
> >>>>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>>>it very clear he does not intend to make it work with overcommit, ever.
> >>>>>
> >>>>That depends entirely as what you define as overcommit.
> >>>You don't get to define your own terms.  Look it up in wikipedia or
> >>>something.
> >>>
> >>>>The pages do get unregistered at the end of the migration =)
> >>>>
> >>>>- Michael
> >>>The limitations are pretty clear, and you really should document them:
> >>>
> >>>1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >>>   destination
> >>>
> >>>2. expect that as much as that amount of memory is pinned
> >>>   and unvailable to host kernel and applications for
> >>>   arbitrarily long time.
> >>>   Make sure you have much more RAM in host or QEMU will get killed.
> >>>
> >>>To me, especially 1 is an unacceptable security tradeoff.
> >>>It is entirely fixable but we both have other priorities,
> >>>so it'll stay broken.
> >>>
> >>I've modified the beginning of docs/rdma.txt to say the following:
> >It really should say this, in a very prominent place:
> >
> >BUGS:
> Not a bug. We'll have to agree to disagree. Please drop this.

It's not a feature, it makes management harder and
will bite some users who are not careful enough
to read documentation and know what to expect.

> >1. You must run qemu as root, or under
> >    ulimit -l <total guest memory> on both source and destination
> Good, will update the documentation now.
> >2. Expect that as much as that amount of memory to be locked
> >    and unvailable to host kernel and applications for
> >    arbitrarily long time.
> >    Make sure you have much more RAM in host otherwise QEMU,
> >    or some other arbitrary application on same host, will get killed.
> This is implied already. The docs say "If you don't want pinning,
> then use TCP".
> That's enough warning.

No it's not. Pinning is jargon, and does not mean locking
up gigabytes.  Why are you using jargon?
Explain the limitation in plain English so people know
when to expect things to work.

> >3. Migration with RDMA support is experimental and unsupported.
> >    In particular, please do not expect it to work across qemu versions,
> >    and do not expect the management interface to be stable.
> 
> The only correct statement here is that it's experimental.
> 
> I will update the docs to reflect that.
> 
> >>$ cat docs/rdma.txt
> >>
> >>... snip ..
> >>
> >>BEFORE RUNNING:
> >>===============
> >>
> >>Use of RDMA requires pinning and registering memory with the
> >>hardware. If this is not acceptable for your application or
> >>product, then the use of RDMA is strongly discouraged and you
> >>should revert back to standard TCP-based migration.
> >No one knows of should know what "pinning and registering" means.
> 
> I will define it in the docs, then.

Keep it simple. Just tell people what they need to know.
It's silly to expect users to understand internals of
the product before they even try it for the first time.

> >For which applications and products is it appropriate?
> 
> That's up to the vendor or user to decide, not us.

With zero information so far, no one will be
able to decide.

> >Also, you are talking about current QEMU
> >code using RDMA for migration but say "RDMA" generally.
> 
> Sure, I will fix the docs.
> 
> >>Next, decide if you want dynamic page registration on the server-side.
> >>For example, if you have an 8GB RAM virtual machine, but only 1GB
> >>is in active use, then disabling this feature will cause all 8GB to
> >>be pinned and resident in memory. This feature mostly affects the
> >>bulk-phase round of the migration and can be disabled for extremely
> >>high-performance RDMA hardware using the following command:
> >>QEMU Monitor Command:
> >>$ migrate_set_capability chunk_register_destination off # enabled by default
> >>
> >>Performing this action will cause all 8GB to be pinned, so if that's
> >>not what you want, then please ignore this step altogether.
> >This does not make it clear what is the benefit of disabling this
> >capability. I think it's best to avoid options, just use chunk
> >based always.
> >If it's here "so people can play with it" then please rename
> >it to something like "x-unsupported-chunk_register_destination"
> >so people know this is unsupported and not to be used for production.
> 
> Again, please drop the request for removing chunking.
> 
> Paolo already told me to use "x-rdma" - so that's enough for now.
> 
> - Michael

You are adding a new command that's also experimental, so you must tag
it explicitly too.
Michael S. Tsirkin April 14, 2013, 9:16 p.m. UTC | #51
On Sun, Apr 14, 2013 at 03:43:28PM -0400, Michael R. Hines wrote:
> On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> >>>On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> >>>>Second, as I've explained, I strongly, strongly disagree with unregistering
> >>>>memory for all of the aforementioned reasons - workloads do not
> >>>>operate in such a manner that they can tolerate memory to be
> >>>>pulled out from underneath them at such fine-grained time scales
> >>>>in the *middle* of a relocation and I will not commit to writing a solution
> >>>>for a problem that doesn't exist.
> >>>Exactly same thing happens with swap, doesn't it?
> >>>You are saying workloads simply can not tolerate swap.
> >>>
> >>>>If you can prove (through some kind of anaylsis) that workloads
> >>>>would benefit from this kind of fine-grained memory overcommit
> >>>>by having cgroups swap out memory to disk underneath them
> >>>>without their permission, I would happily reconsider my position.
> >>>>
> >>>>- Michael
> >>>This has nothing to do with cgroups directly, it's just a way to
> >>>demonstrate you have a bug.
> >>>
> >>If your datacenter or your cloud or your product does not want to
> >>tolerate page registration, then don't use RDMA!
> >>
> >>The bottom line is: RDMA is useless without page registration. Without
> >>it, the performance of it will be crippled. If you define that as a bug,
> >>then so be it.
> >>
> >>- Michael
> >No one cares if you do page registration or not.  ulimit -l 10g is the
> >problem.  You should limit the amount of locked memory.
> >Lots of good research went into making RDMA go fast with limited locked
> >memory, with some success. Search for "registration cache" for example.
> >
> 
> Patches using such a cache would be welcome.
> 
> - Michael
> 

And when someone writes them one day, we'll have to carry the old code
around for interoperability as well. Not pretty.  To avoid that, you
need to explicitly say in the documenation that it's experimental and
unsupported.
mrhines@linux.vnet.ibm.com April 15, 2013, 1:06 a.m. UTC | #52
On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
>>>>> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>>>>>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>>>>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>>>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>>>>>> version 1 of the patch.
>>>>>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>>>>>> transport and capability) and would like people to play with it.
>>>>>>>>>>
>>>>>>>>>> Paolo
>>>>>>>>> But it's not testable yet.  I see problems just reading the
>>>>>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>>>>>> the major bugs first.
>>>>>>>>>
>>>>>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>>>>>> and despite being available for years, this was one of two reasons that
>>>>>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>>>>>> since it depends on a well-behaved guest to work correctly).
>>>>>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>>>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>>>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>>>>>
>>>>>>>> Paolo
>>>>>>> It looks like we have to decide, before merging, whether migration with
>>>>>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>>>>>> it very clear he does not intend to make it work with overcommit, ever.
>>>>>>>
>>>>>> That depends entirely as what you define as overcommit.
>>>>> You don't get to define your own terms.  Look it up in wikipedia or
>>>>> something.
>>>>>
>>>>>> The pages do get unregistered at the end of the migration =)
>>>>>>
>>>>>> - Michael
>>>>> The limitations are pretty clear, and you really should document them:
>>>>>
>>>>> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>>>>>    destination
>>>>>
>>>>> 2. expect that as much as that amount of memory is pinned
>>>>>    and unvailable to host kernel and applications for
>>>>>    arbitrarily long time.
>>>>>    Make sure you have much more RAM in host or QEMU will get killed.
>>>>>
>>>>> To me, especially 1 is an unacceptable security tradeoff.
>>>>> It is entirely fixable but we both have other priorities,
>>>>> so it'll stay broken.
>>>>>
>>>> I've modified the beginning of docs/rdma.txt to say the following:
>>> It really should say this, in a very prominent place:
>>>
>>> BUGS:
>> Not a bug. We'll have to agree to disagree. Please drop this.
> It's not a feature, it makes management harder and
> will bite some users who are not careful enough
> to read documentation and know what to expect.

Something that does not exist cannot be a bug. That's called a 
non-existent optimization.

>>> 1. You must run qemu as root, or under
>>>     ulimit -l <total guest memory> on both source and destination
>> Good, will update the documentation now.
>>> 2. Expect that as much as that amount of memory to be locked
>>>     and unvailable to host kernel and applications for
>>>     arbitrarily long time.
>>>     Make sure you have much more RAM in host otherwise QEMU,
>>>     or some other arbitrary application on same host, will get killed.
>> This is implied already. The docs say "If you don't want pinning,
>> then use TCP".
>> That's enough warning.
> No it's not. Pinning is jargon, and does not mean locking
> up gigabytes.  Why are you using jargon?
> Explain the limitation in plain English so people know
> when to expect things to work.

Already done.

>>> 3. Migration with RDMA support is experimental and unsupported.
>>>     In particular, please do not expect it to work across qemu versions,
>>>     and do not expect the management interface to be stable.
>> The only correct statement here is that it's experimental.
>>
>> I will update the docs to reflect that.
>>
>>>> $ cat docs/rdma.txt
>>>>
>>>> ... snip ..
>>>>
>>>> BEFORE RUNNING:
>>>> ===============
>>>>
>>>> Use of RDMA requires pinning and registering memory with the
>>>> hardware. If this is not acceptable for your application or
>>>> product, then the use of RDMA is strongly discouraged and you
>>>> should revert back to standard TCP-based migration.
>>> No one knows of should know what "pinning and registering" means.
>> I will define it in the docs, then.
> Keep it simple. Just tell people what they need to know.
> It's silly to expect users to understand internals of
> the product before they even try it for the first time.

Agreed.

>>> For which applications and products is it appropriate?
>> That's up to the vendor or user to decide, not us.
> With zero information so far, no one will be
> able to decide.

There is plenty of information. Including this email thread.


>>> Also, you are talking about current QEMU
>>> code using RDMA for migration but say "RDMA" generally.
>> Sure, I will fix the docs.
>>
>>>> Next, decide if you want dynamic page registration on the server-side.
>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>> is in active use, then disabling this feature will cause all 8GB to
>>>> be pinned and resident in memory. This feature mostly affects the
>>>> bulk-phase round of the migration and can be disabled for extremely
>>>> high-performance RDMA hardware using the following command:
>>>> QEMU Monitor Command:
>>>> $ migrate_set_capability chunk_register_destination off # enabled by default
>>>>
>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>> not what you want, then please ignore this step altogether.
>>> This does not make it clear what is the benefit of disabling this
>>> capability. I think it's best to avoid options, just use chunk
>>> based always.
>>> If it's here "so people can play with it" then please rename
>>> it to something like "x-unsupported-chunk_register_destination"
>>> so people know this is unsupported and not to be used for production.
>> Again, please drop the request for removing chunking.
>>
>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>
>> - Michael
> You are adding a new command that's also experimental, so you must tag
> it explicitly too.

The entire migration is experimental - which by extension makes the 
capability experimental.
mrhines@linux.vnet.ibm.com April 15, 2013, 1:10 a.m. UTC | #53
On 04/14/2013 05:16 PM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 03:43:28PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
>>>>> On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
>>>>>> Second, as I've explained, I strongly, strongly disagree with unregistering
>>>>>> memory for all of the aforementioned reasons - workloads do not
>>>>>> operate in such a manner that they can tolerate memory to be
>>>>>> pulled out from underneath them at such fine-grained time scales
>>>>>> in the *middle* of a relocation and I will not commit to writing a solution
>>>>>> for a problem that doesn't exist.
>>>>> Exactly same thing happens with swap, doesn't it?
>>>>> You are saying workloads simply can not tolerate swap.
>>>>>
>>>>>> If you can prove (through some kind of anaylsis) that workloads
>>>>>> would benefit from this kind of fine-grained memory overcommit
>>>>>> by having cgroups swap out memory to disk underneath them
>>>>>> without their permission, I would happily reconsider my position.
>>>>>>
>>>>>> - Michael
>>>>> This has nothing to do with cgroups directly, it's just a way to
>>>>> demonstrate you have a bug.
>>>>>
>>>> If your datacenter or your cloud or your product does not want to
>>>> tolerate page registration, then don't use RDMA!
>>>>
>>>> The bottom line is: RDMA is useless without page registration. Without
>>>> it, the performance of it will be crippled. If you define that as a bug,
>>>> then so be it.
>>>>
>>>> - Michael
>>> No one cares if you do page registration or not.  ulimit -l 10g is the
>>> problem.  You should limit the amount of locked memory.
>>> Lots of good research went into making RDMA go fast with limited locked
>>> memory, with some success. Search for "registration cache" for example.
>>>
>> Patches using such a cache would be welcome.
>>
>> - Michael
>>
> And when someone writes them one day, we'll have to carry the old code
> around for interoperability as well. Not pretty.  To avoid that, you
> need to explicitly say in the documenation that it's experimental and
> unsupported.
>

That's what protocols are for.

As I've already said, I've incorporated this into the design of the protocol
already.

The protocol already has a field called "repeat" which allows a user to
request multiple chunk registrations at the same time.

If you insist, I can add a capability / command to the protocol called 
"unregister chunk",
but I'm not volunteering to implement that command as I don't have any data
showing it to be of any value.

That would insulate the protocol against any such future "registration 
cache" design.

- Michael
Michael S. Tsirkin April 15, 2013, 6 a.m. UTC | #54
On Sun, Apr 14, 2013 at 09:06:36PM -0400, Michael R. Hines wrote:
> On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >>>>>On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>>>>>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>>>>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>>>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>>>>>version 1 of the patch.
> >>>>>>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>>>>>transport and capability) and would like people to play with it.
> >>>>>>>>>>
> >>>>>>>>>>Paolo
> >>>>>>>>>But it's not testable yet.  I see problems just reading the
> >>>>>>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>>>>>the major bugs first.
> >>>>>>>>>
> >>>>>>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>>>>>and despite being available for years, this was one of two reasons that
> >>>>>>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>>>>>since it depends on a well-behaved guest to work correctly).
> >>>>>>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>>>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>>>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>>>>>
> >>>>>>>>Paolo
> >>>>>>>It looks like we have to decide, before merging, whether migration with
> >>>>>>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>>>>>it very clear he does not intend to make it work with overcommit, ever.
> >>>>>>>
> >>>>>>That depends entirely as what you define as overcommit.
> >>>>>You don't get to define your own terms.  Look it up in wikipedia or
> >>>>>something.
> >>>>>
> >>>>>>The pages do get unregistered at the end of the migration =)
> >>>>>>
> >>>>>>- Michael
> >>>>>The limitations are pretty clear, and you really should document them:
> >>>>>
> >>>>>1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >>>>>   destination
> >>>>>
> >>>>>2. expect that as much as that amount of memory is pinned
> >>>>>   and unvailable to host kernel and applications for
> >>>>>   arbitrarily long time.
> >>>>>   Make sure you have much more RAM in host or QEMU will get killed.
> >>>>>
> >>>>>To me, especially 1 is an unacceptable security tradeoff.
> >>>>>It is entirely fixable but we both have other priorities,
> >>>>>so it'll stay broken.
> >>>>>
> >>>>I've modified the beginning of docs/rdma.txt to say the following:
> >>>It really should say this, in a very prominent place:
> >>>
> >>>BUGS:
> >>Not a bug. We'll have to agree to disagree. Please drop this.
> >It's not a feature, it makes management harder and
> >will bite some users who are not careful enough
> >to read documentation and know what to expect.
> 
> Something that does not exist cannot be a bug. That's called a
> non-existent optimization.

No because overcommit already exists, and works with migration.  It's
your patch that breaks it.  We already have a ton of migration variants
and they all work fine.  So in 2013 overcommit is a given.

Look we can include code with known bugs, but we have to be very
explicit about them, because someone *will* be confused.  If it's a hard
bug to fix it won't get solved quickly but please stop pretending it's
perfect.



> >>>1. You must run qemu as root, or under
> >>>    ulimit -l <total guest memory> on both source and destination
> >>Good, will update the documentation now.
> >>>2. Expect that as much as that amount of memory to be locked
> >>>    and unvailable to host kernel and applications for
> >>>    arbitrarily long time.
> >>>    Make sure you have much more RAM in host otherwise QEMU,
> >>>    or some other arbitrary application on same host, will get killed.
> >>This is implied already. The docs say "If you don't want pinning,
> >>then use TCP".
> >>That's enough warning.
> >No it's not. Pinning is jargon, and does not mean locking
> >up gigabytes.  Why are you using jargon?
> >Explain the limitation in plain English so people know
> >when to expect things to work.
> 
> Already done.
> 
> >>>3. Migration with RDMA support is experimental and unsupported.
> >>>    In particular, please do not expect it to work across qemu versions,
> >>>    and do not expect the management interface to be stable.
> >>The only correct statement here is that it's experimental.
> >>
> >>I will update the docs to reflect that.
> >>
> >>>>$ cat docs/rdma.txt
> >>>>
> >>>>... snip ..
> >>>>
> >>>>BEFORE RUNNING:
> >>>>===============
> >>>>
> >>>>Use of RDMA requires pinning and registering memory with the
> >>>>hardware. If this is not acceptable for your application or
> >>>>product, then the use of RDMA is strongly discouraged and you
> >>>>should revert back to standard TCP-based migration.
> >>>No one knows of should know what "pinning and registering" means.
> >>I will define it in the docs, then.
> >Keep it simple. Just tell people what they need to know.
> >It's silly to expect users to understand internals of
> >the product before they even try it for the first time.
> 
> Agreed.
> 
> >>>For which applications and products is it appropriate?
> >>That's up to the vendor or user to decide, not us.
> >With zero information so far, no one will be
> >able to decide.
> 
> There is plenty of information. Including this email thread.

Nowhere in this email thread or in your patchset did you tell anyone for
which applications and products is it appropriate.  You also expect
someone to answer this question before they run your code.  It looks
like the purpose of this phrase is to assign blame rather than to
inform.

> 
> >>>Also, you are talking about current QEMU
> >>>code using RDMA for migration but say "RDMA" generally.
> >>Sure, I will fix the docs.
> >>
> >>>>Next, decide if you want dynamic page registration on the server-side.
> >>>>For example, if you have an 8GB RAM virtual machine, but only 1GB
> >>>>is in active use, then disabling this feature will cause all 8GB to
> >>>>be pinned and resident in memory. This feature mostly affects the
> >>>>bulk-phase round of the migration and can be disabled for extremely
> >>>>high-performance RDMA hardware using the following command:
> >>>>QEMU Monitor Command:
> >>>>$ migrate_set_capability chunk_register_destination off # enabled by default
> >>>>
> >>>>Performing this action will cause all 8GB to be pinned, so if that's
> >>>>not what you want, then please ignore this step altogether.
> >>>This does not make it clear what is the benefit of disabling this
> >>>capability. I think it's best to avoid options, just use chunk
> >>>based always.
> >>>If it's here "so people can play with it" then please rename
> >>>it to something like "x-unsupported-chunk_register_destination"
> >>>so people know this is unsupported and not to be used for production.
> >>Again, please drop the request for removing chunking.
> >>
> >>Paolo already told me to use "x-rdma" - so that's enough for now.
> >>
> >>- Michael
> >You are adding a new command that's also experimental, so you must tag
> >it explicitly too.
> 
> The entire migration is experimental - which by extension makes the
> capability experimental.

Again the purpose of documentation is not to educate people about
qemu or rdma internals but to educate them how to use a feature.
It doesn't even mention rdma anywhere in the name of the capability.
Users won't make the connection.  You also didn't bother telling anyone
when to set the option.  Is it here "to be able to play with it"?  Does
it have any purpose for users not in a playful mood?  If yes your
documentation should say what it is, if no mention that.
Michael S. Tsirkin April 15, 2013, 6:10 a.m. UTC | #55
On Sun, Apr 14, 2013 at 09:10:36PM -0400, Michael R. Hines wrote:
> On 04/14/2013 05:16 PM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 03:43:28PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 02:51 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 10:31:20AM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 04:28 AM, Michael S. Tsirkin wrote:
> >>>>>On Fri, Apr 12, 2013 at 09:47:08AM -0400, Michael R. Hines wrote:
> >>>>>>Second, as I've explained, I strongly, strongly disagree with unregistering
> >>>>>>memory for all of the aforementioned reasons - workloads do not
> >>>>>>operate in such a manner that they can tolerate memory to be
> >>>>>>pulled out from underneath them at such fine-grained time scales
> >>>>>>in the *middle* of a relocation and I will not commit to writing a solution
> >>>>>>for a problem that doesn't exist.
> >>>>>Exactly same thing happens with swap, doesn't it?
> >>>>>You are saying workloads simply can not tolerate swap.
> >>>>>
> >>>>>>If you can prove (through some kind of anaylsis) that workloads
> >>>>>>would benefit from this kind of fine-grained memory overcommit
> >>>>>>by having cgroups swap out memory to disk underneath them
> >>>>>>without their permission, I would happily reconsider my position.
> >>>>>>
> >>>>>>- Michael
> >>>>>This has nothing to do with cgroups directly, it's just a way to
> >>>>>demonstrate you have a bug.
> >>>>>
> >>>>If your datacenter or your cloud or your product does not want to
> >>>>tolerate page registration, then don't use RDMA!
> >>>>
> >>>>The bottom line is: RDMA is useless without page registration. Without
> >>>>it, the performance of it will be crippled. If you define that as a bug,
> >>>>then so be it.
> >>>>
> >>>>- Michael
> >>>No one cares if you do page registration or not.  ulimit -l 10g is the
> >>>problem.  You should limit the amount of locked memory.
> >>>Lots of good research went into making RDMA go fast with limited locked
> >>>memory, with some success. Search for "registration cache" for example.
> >>>
> >>Patches using such a cache would be welcome.
> >>
> >>- Michael
> >>
> >And when someone writes them one day, we'll have to carry the old code
> >around for interoperability as well. Not pretty.  To avoid that, you
> >need to explicitly say in the documenation that it's experimental and
> >unsupported.
> >
> 
> That's what protocols are for.
> 
> As I've already said, I've incorporated this into the design of the protocol
> already.
> 
> The protocol already has a field called "repeat" which allows a user to
> request multiple chunk registrations at the same time.
> If you insist, I can add a capability / command to the protocol
> called "unregister chunk",
> but I'm not volunteering to implement that command as I don't have any data
> showing it to be of any value.

The value would be being able to run your code in qemu as unpriveledged
user.

> That would insulate the protocol against any such future
> "registration cache" design.
> 
> - Michael
>

It won't.  If it's unimplemented it won't be of any use since now your
code does not implement the protocol fully.
Paolo Bonzini April 15, 2013, 8:26 a.m. UTC | #56
Il 14/04/2013 21:06, Michael R. Hines ha scritto:
> 
>> 3. Migration with RDMA support is experimental and unsupported.
>>     In particular, please do not expect it to work across qemu versions,
>>     and do not expect the management interface to be stable.
>>     
> 
> The only correct statement here is that it's experimental.

Actually no, this is correct.  The capabilities are experimental too,
the "x-rdma" will become "rdma" in the future, and we are free to modify
the protocol.

Will it happen?  Perhaps not.  But for the moment, that is the
situation.  The alternative is not merging, and it is a much worse
alternative IMHO.

Paolo
Paolo Bonzini April 15, 2013, 8:28 a.m. UTC | #57
Il 15/04/2013 03:06, Michael R. Hines ha scritto:
>>>>> Next, decide if you want dynamic page registration on the server-side.
>>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>>> is in active use, then disabling this feature will cause all 8GB to
>>>>> be pinned and resident in memory. This feature mostly affects the
>>>>> bulk-phase round of the migration and can be disabled for extremely
>>>>> high-performance RDMA hardware using the following command:
>>>>> QEMU Monitor Command:
>>>>> $ migrate_set_capability chunk_register_destination off # enabled
>>>>> by default
>>>>>
>>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>>> not what you want, then please ignore this step altogether.
>>>> This does not make it clear what is the benefit of disabling this
>>>> capability. I think it's best to avoid options, just use chunk
>>>> based always.
>>>> If it's here "so people can play with it" then please rename
>>>> it to something like "x-unsupported-chunk_register_destination"
>>>> so people know this is unsupported and not to be used for production.
>>> Again, please drop the request for removing chunking.
>>>
>>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>
>> You are adding a new command that's also experimental, so you must tag
>> it explicitly too.
> 
> The entire migration is experimental - which by extension makes the
> capability experimental.

You still have to mark it as "x-".  Of course not "x-unsupported-", that
is a pleonasm.

Paolo
Paolo Bonzini April 15, 2013, 8:34 a.m. UTC | #58
Il 15/04/2013 03:10, Michael R. Hines ha scritto:
>>>
>> And when someone writes them one day, we'll have to carry the old code
>> around for interoperability as well. Not pretty.  To avoid that, you
>> need to explicitly say in the documenation that it's experimental and
>> unsupported.
>>
> 
> That's what protocols are for.
> 
> As I've already said, I've incorporated this into the design of the
> protocol
> already.
> 
> The protocol already has a field called "repeat" which allows a user to
> request multiple chunk registrations at the same time.
> 
> If you insist, I can add a capability / command to the protocol called
> "unregister chunk",
> but I'm not volunteering to implement that command as I don't have any data
> showing it to be of any value.

Implementing it on the destination side would be of value because it
would make the implementation interoperable.

A very basic implementation would be "during the bulk phase, unregister
the previous chunk every time you register a chunk".  It would work
great when migrating an idle guest, for example.  It would probably be
faster than TCP (which is now at 4.2 Gbps).

On one hand this should not block merging the patches; on the other
hand, "agreeing to disagree" without having done any test is not very
fruitful.  You can disagree on the priorities (and I agree with you on
this), but what mst is proposing is absolutely reasonable.

Paolo

> That would insulate the protocol against any such future "registration
> cache" design.
mrhines@linux.vnet.ibm.com April 15, 2013, 1:07 p.m. UTC | #59
On 04/15/2013 02:00 AM, Michael S. Tsirkin wrote:
> On Sun, Apr 14, 2013 at 09:06:36PM -0400, Michael R. Hines wrote:
>> On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
>>> On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
>>>> On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
>>>>> On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
>>>>>> On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
>>>>>>> On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
>>>>>>>> On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
>>>>>>>>> On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
>>>>>>>>>> Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
>>>>>>>>>>> On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
>>>>>>>>>>>>> 1.  You have two protocols already and this does not make sense in
>>>>>>>>>>>>> version 1 of the patch.
>>>>>>>>>>>> It makes sense if we consider it experimental (add x- in front of
>>>>>>>>>>>> transport and capability) and would like people to play with it.
>>>>>>>>>>>>
>>>>>>>>>>>> Paolo
>>>>>>>>>>> But it's not testable yet.  I see problems just reading the
>>>>>>>>>>> documentation.  Author thinks "ulimit -l 10000000000" on both source and
>>>>>>>>>>> destination is just fine.  This can easily crash host or cause OOM
>>>>>>>>>>> killer to kill QEMU.  So why is there any need for extra testers?  Fix
>>>>>>>>>>> the major bugs first.
>>>>>>>>>>>
>>>>>>>>>>> There's a similar issue with device assignment - we can't fix it there,
>>>>>>>>>>> and despite being available for years, this was one of two reasons that
>>>>>>>>>>> has kept this feature out of hands of lots of users (and assuming guest
>>>>>>>>>>> has lots of zero pages won't work: balloon is not widely used either
>>>>>>>>>>> since it depends on a well-behaved guest to work correctly).
>>>>>>>>>> I agree assuming guest has lots of zero pages won't work, but I think
>>>>>>>>>> you are overstating the importance of overcommit.  Let's mark the damn
>>>>>>>>>> thing as experimental, and stop making perfect the enemy of good.
>>>>>>>>>>
>>>>>>>>>> Paolo
>>>>>>>>> It looks like we have to decide, before merging, whether migration with
>>>>>>>>> rdma that breaks overcommit is worth it or not.  Since the author made
>>>>>>>>> it very clear he does not intend to make it work with overcommit, ever.
>>>>>>>>>
>>>>>>>> That depends entirely as what you define as overcommit.
>>>>>>> You don't get to define your own terms.  Look it up in wikipedia or
>>>>>>> something.
>>>>>>>
>>>>>>>> The pages do get unregistered at the end of the migration =)
>>>>>>>>
>>>>>>>> - Michael
>>>>>>> The limitations are pretty clear, and you really should document them:
>>>>>>>
>>>>>>> 1. run qemu as root, or under ulimit -l <total guest memory> on both source and
>>>>>>>    destination
>>>>>>>
>>>>>>> 2. expect that as much as that amount of memory is pinned
>>>>>>>    and unvailable to host kernel and applications for
>>>>>>>    arbitrarily long time.
>>>>>>>    Make sure you have much more RAM in host or QEMU will get killed.
>>>>>>>
>>>>>>> To me, especially 1 is an unacceptable security tradeoff.
>>>>>>> It is entirely fixable but we both have other priorities,
>>>>>>> so it'll stay broken.
>>>>>>>
>>>>>> I've modified the beginning of docs/rdma.txt to say the following:
>>>>> It really should say this, in a very prominent place:
>>>>>
>>>>> BUGS:
>>>> Not a bug. We'll have to agree to disagree. Please drop this.
>>> It's not a feature, it makes management harder and
>>> will bite some users who are not careful enough
>>> to read documentation and know what to expect.
>> Something that does not exist cannot be a bug. That's called a
>> non-existent optimization.
> No because overcommit already exists, and works with migration.  It's
> your patch that breaks it.  We already have a ton of migration variants
> and they all work fine.  So in 2013 overcommit is a given.
>
> Look we can include code with known bugs, but we have to be very
> explicit about them, because someone *will* be confused.  If it's a hard
> bug to fix it won't get solved quickly but please stop pretending it's
> perfect.
>

Setting aside RDMA for the moment, Are you trying to tell me that 
someone would
*willingly* migrate a VM to a hypervisor without first validating 
(programmatically)
whether or not the machine already has enough memory to support the entire
footprint of the VM?

If you answer yes to that question is yes, it's a bug.

That also means *any* use of RDMA in any application in the universe is also
a bug and it also means that any HPC application running against cgroups is
also buggy.

I categorically refuse to believe that someone runs a datacenter in this 
manner.

>
>>>>> 1. You must run qemu as root, or under
>>>>>     ulimit -l <total guest memory> on both source and destination
>>>> Good, will update the documentation now.
>>>>> 2. Expect that as much as that amount of memory to be locked
>>>>>     and unvailable to host kernel and applications for
>>>>>     arbitrarily long time.
>>>>>     Make sure you have much more RAM in host otherwise QEMU,
>>>>>     or some other arbitrary application on same host, will get killed.
>>>> This is implied already. The docs say "If you don't want pinning,
>>>> then use TCP".
>>>> That's enough warning.
>>> No it's not. Pinning is jargon, and does not mean locking
>>> up gigabytes.  Why are you using jargon?
>>> Explain the limitation in plain English so people know
>>> when to expect things to work.
>> Already done.
>>
>>>>> 3. Migration with RDMA support is experimental and unsupported.
>>>>>     In particular, please do not expect it to work across qemu versions,
>>>>>     and do not expect the management interface to be stable.
>>>> The only correct statement here is that it's experimental.
>>>>
>>>> I will update the docs to reflect that.
>>>>
>>>>>> $ cat docs/rdma.txt
>>>>>>
>>>>>> ... snip ..
>>>>>>
>>>>>> BEFORE RUNNING:
>>>>>> ===============
>>>>>>
>>>>>> Use of RDMA requires pinning and registering memory with the
>>>>>> hardware. If this is not acceptable for your application or
>>>>>> product, then the use of RDMA is strongly discouraged and you
>>>>>> should revert back to standard TCP-based migration.
>>>>> No one knows of should know what "pinning and registering" means.
>>>> I will define it in the docs, then.
>>> Keep it simple. Just tell people what they need to know.
>>> It's silly to expect users to understand internals of
>>> the product before they even try it for the first time.
>> Agreed.
>>
>>>>> For which applications and products is it appropriate?
>>>> That's up to the vendor or user to decide, not us.
>>> With zero information so far, no one will be
>>> able to decide.
>> There is plenty of information. Including this email thread.
> Nowhere in this email thread or in your patchset did you tell anyone for
> which applications and products is it appropriate.  You also expect
> someone to answer this question before they run your code.  It looks
> like the purpose of this phrase is to assign blame rather than to
> inform.
>>>>> Also, you are talking about current QEMU
>>>>> code using RDMA for migration but say "RDMA" generally.
>>>> Sure, I will fix the docs.
>>>>
>>>>>> Next, decide if you want dynamic page registration on the server-side.
>>>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>>>> is in active use, then disabling this feature will cause all 8GB to
>>>>>> be pinned and resident in memory. This feature mostly affects the
>>>>>> bulk-phase round of the migration and can be disabled for extremely
>>>>>> high-performance RDMA hardware using the following command:
>>>>>> QEMU Monitor Command:
>>>>>> $ migrate_set_capability chunk_register_destination off # enabled by default
>>>>>>
>>>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>>>> not what you want, then please ignore this step altogether.
>>>>> This does not make it clear what is the benefit of disabling this
>>>>> capability. I think it's best to avoid options, just use chunk
>>>>> based always.
>>>>> If it's here "so people can play with it" then please rename
>>>>> it to something like "x-unsupported-chunk_register_destination"
>>>>> so people know this is unsupported and not to be used for production.
>>>> Again, please drop the request for removing chunking.
>>>>
>>>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>>>
>>>> - Michael
>>> You are adding a new command that's also experimental, so you must tag
>>> it explicitly too.
>> The entire migration is experimental - which by extension makes the
>> capability experimental.
> Again the purpose of documentation is not to educate people about
> qemu or rdma internals but to educate them how to use a feature.
> It doesn't even mention rdma anywhere in the name of the capability.
> Users won't make the connection.  You also didn't bother telling anyone
> when to set the option.  Is it here "to be able to play with it"?  Does
> it have any purpose for users not in a playful mood?  If yes your
> documentation should say what it is, if no mention that.
>

The purpose of the capability is made blatantly clear in the documentation.

- Michael
mrhines@linux.vnet.ibm.com April 15, 2013, 1:08 p.m. UTC | #60
On 04/15/2013 04:28 AM, Paolo Bonzini wrote:
> Il 15/04/2013 03:06, Michael R. Hines ha scritto:
>>>>>> Next, decide if you want dynamic page registration on the server-side.
>>>>>> For example, if you have an 8GB RAM virtual machine, but only 1GB
>>>>>> is in active use, then disabling this feature will cause all 8GB to
>>>>>> be pinned and resident in memory. This feature mostly affects the
>>>>>> bulk-phase round of the migration and can be disabled for extremely
>>>>>> high-performance RDMA hardware using the following command:
>>>>>> QEMU Monitor Command:
>>>>>> $ migrate_set_capability chunk_register_destination off # enabled
>>>>>> by default
>>>>>>
>>>>>> Performing this action will cause all 8GB to be pinned, so if that's
>>>>>> not what you want, then please ignore this step altogether.
>>>>> This does not make it clear what is the benefit of disabling this
>>>>> capability. I think it's best to avoid options, just use chunk
>>>>> based always.
>>>>> If it's here "so people can play with it" then please rename
>>>>> it to something like "x-unsupported-chunk_register_destination"
>>>>> so people know this is unsupported and not to be used for production.
>>>> Again, please drop the request for removing chunking.
>>>>
>>>> Paolo already told me to use "x-rdma" - so that's enough for now.
>>> You are adding a new command that's also experimental, so you must tag
>>> it explicitly too.
>> The entire migration is experimental - which by extension makes the
>> capability experimental.
> You still have to mark it as "x-".  Of course not "x-unsupported-", that
> is a pleonasm.
>
> Paolo
>

Sure, I'm happy add another 'x'. I will submit a patch with all the new
changes as soon as the pull completes.

- Michael
mrhines@linux.vnet.ibm.com April 15, 2013, 1:24 p.m. UTC | #61
On 04/15/2013 04:34 AM, Paolo Bonzini wrote:
> Il 15/04/2013 03:10, Michael R. Hines ha scritto:
>>> And when someone writes them one day, we'll have to carry the old code
>>> around for interoperability as well. Not pretty.  To avoid that, you
>>> need to explicitly say in the documenation that it's experimental and
>>> unsupported.
>>>
>> That's what protocols are for.
>>
>> As I've already said, I've incorporated this into the design of the
>> protocol
>> already.
>>
>> The protocol already has a field called "repeat" which allows a user to
>> request multiple chunk registrations at the same time.
>>
>> If you insist, I can add a capability / command to the protocol called
>> "unregister chunk",
>> but I'm not volunteering to implement that command as I don't have any data
>> showing it to be of any value.
> Implementing it on the destination side would be of value because it
> would make the implementation interoperable.
>
> A very basic implementation would be "during the bulk phase, unregister
> the previous chunk every time you register a chunk".  It would work
> great when migrating an idle guest, for example.  It would probably be
> faster than TCP (which is now at 4.2 Gbps).
>
> On one hand this should not block merging the patches; on the other
> hand, "agreeing to disagree" without having done any test is not very
> fruitful.  You can disagree on the priorities (and I agree with you on
> this), but what mst is proposing is absolutely reasonable.
>
> Paolo

Ok, I think I understand the disconnect here: So, let's continue to use
the above example that you described and let me ask another question.

Let's say the above mentioned idle VM is chosen, for whatever reason,
*not* to use TCP migration, and instead use RDMA. (I recommend against
choosing RDMA in the current docs, but let's stick to this example for
the sake of argument).

Now, in this example, let's say the migration starts up and the hypervisor
has run out of physical memory and starts swapping during the migration.
(also for the sake of argument).

The next thing that would immediately happen is the
next IB verbs function call: "ib_reg_mr()".

This function call would probably fail because there's nothing else left 
to pin
and the function call would return an error.

So my question is: Is it not sufficient to send a message back to the 
primary-VM
side of the connection which says:

"Your migration cannot proceed anymore, please resume the VM and try 
again somewhere else".

In this case, both the system administrator and the virtual machine are 
safe,
nothing has been killed, nothing has crashed, and the management software
can proceed to make a new management decision.

Is there something wrong with this sequence of events?

- Michael
Paolo Bonzini April 15, 2013, 1:30 p.m. UTC | #62
Il 15/04/2013 15:24, Michael R. Hines ha scritto:
> Now, in this example, let's say the migration starts up and the hypervisor
> has run out of physical memory and starts swapping during the migration.
> (also for the sake of argument).
> 
> The next thing that would immediately happen is the
> next IB verbs function call: "ib_reg_mr()".
> 
> This function call would probably fail because there's nothing else left
> to pin and the function call would return an error.
> 
> So my question is: Is it not sufficient to send a message back to the
> primary-VM side of the connection which says:
> 
> "Your migration cannot proceed anymore, please resume the VM and try
> again somewhere else".
> 
> In this case, both the system administrator and the virtual machine are safe,
> nothing has been killed, nothing has crashed, and the management software
> can proceed to make a new management decision.
> 
> Is there something wrong with this sequence of events?

I think it's good enough.  "info migrate" will then report that
migration failed.

Paolo
mrhines@linux.vnet.ibm.com April 15, 2013, 7:55 p.m. UTC | #63
On 04/15/2013 09:30 AM, Paolo Bonzini wrote:
> Il 15/04/2013 15:24, Michael R. Hines ha scritto:
>> Now, in this example, let's say the migration starts up and the hypervisor
>> has run out of physical memory and starts swapping during the migration.
>> (also for the sake of argument).
>>
>> The next thing that would immediately happen is the
>> next IB verbs function call: "ib_reg_mr()".
>>
>> This function call would probably fail because there's nothing else left
>> to pin and the function call would return an error.
>>
>> So my question is: Is it not sufficient to send a message back to the
>> primary-VM side of the connection which says:
>>
>> "Your migration cannot proceed anymore, please resume the VM and try
>> again somewhere else".
>>
>> In this case, both the system administrator and the virtual machine are safe,
>> nothing has been killed, nothing has crashed, and the management software
>> can proceed to make a new management decision.
>>
>> Is there something wrong with this sequence of events?
> I think it's good enough.  "info migrate" will then report that
> migration failed.
>
> Paolo
>

Ok, that's good. So the current patch "[PATCH v2] rdma" is not handling this
particular error condition properly, so that's a bug.

I'll send out a trivial patch to fix this after the pull along with all 
the other
documentation updates we have discussed.

- Michael
Michael S. Tsirkin April 15, 2013, 10:20 p.m. UTC | #64
On Mon, Apr 15, 2013 at 09:07:01AM -0400, Michael R. Hines wrote:
> On 04/15/2013 02:00 AM, Michael S. Tsirkin wrote:
> >On Sun, Apr 14, 2013 at 09:06:36PM -0400, Michael R. Hines wrote:
> >>On 04/14/2013 05:10 PM, Michael S. Tsirkin wrote:
> >>>On Sun, Apr 14, 2013 at 03:06:11PM -0400, Michael R. Hines wrote:
> >>>>On 04/14/2013 02:30 PM, Michael S. Tsirkin wrote:
> >>>>>On Sun, Apr 14, 2013 at 12:40:10PM -0400, Michael R. Hines wrote:
> >>>>>>On 04/14/2013 12:03 PM, Michael S. Tsirkin wrote:
> >>>>>>>On Sun, Apr 14, 2013 at 10:27:24AM -0400, Michael R. Hines wrote:
> >>>>>>>>On 04/14/2013 07:59 AM, Michael S. Tsirkin wrote:
> >>>>>>>>>On Fri, Apr 12, 2013 at 04:43:54PM +0200, Paolo Bonzini wrote:
> >>>>>>>>>>Il 12/04/2013 13:25, Michael S. Tsirkin ha scritto:
> >>>>>>>>>>>On Fri, Apr 12, 2013 at 12:53:11PM +0200, Paolo Bonzini wrote:
> >>>>>>>>>>>>Il 12/04/2013 12:48, Michael S. Tsirkin ha scritto:
> >>>>>>>>>>>>>1.  You have two protocols already and this does not make sense in
> >>>>>>>>>>>>>version 1 of the patch.
> >>>>>>>>>>>>It makes sense if we consider it experimental (add x- in front of
> >>>>>>>>>>>>transport and capability) and would like people to play with it.
> >>>>>>>>>>>>
> >>>>>>>>>>>>Paolo
> >>>>>>>>>>>But it's not testable yet.  I see problems just reading the
> >>>>>>>>>>>documentation.  Author thinks "ulimit -l 10000000000" on both source and
> >>>>>>>>>>>destination is just fine.  This can easily crash host or cause OOM
> >>>>>>>>>>>killer to kill QEMU.  So why is there any need for extra testers?  Fix
> >>>>>>>>>>>the major bugs first.
> >>>>>>>>>>>
> >>>>>>>>>>>There's a similar issue with device assignment - we can't fix it there,
> >>>>>>>>>>>and despite being available for years, this was one of two reasons that
> >>>>>>>>>>>has kept this feature out of hands of lots of users (and assuming guest
> >>>>>>>>>>>has lots of zero pages won't work: balloon is not widely used either
> >>>>>>>>>>>since it depends on a well-behaved guest to work correctly).
> >>>>>>>>>>I agree assuming guest has lots of zero pages won't work, but I think
> >>>>>>>>>>you are overstating the importance of overcommit.  Let's mark the damn
> >>>>>>>>>>thing as experimental, and stop making perfect the enemy of good.
> >>>>>>>>>>
> >>>>>>>>>>Paolo
> >>>>>>>>>It looks like we have to decide, before merging, whether migration with
> >>>>>>>>>rdma that breaks overcommit is worth it or not.  Since the author made
> >>>>>>>>>it very clear he does not intend to make it work with overcommit, ever.
> >>>>>>>>>
> >>>>>>>>That depends entirely as what you define as overcommit.
> >>>>>>>You don't get to define your own terms.  Look it up in wikipedia or
> >>>>>>>something.
> >>>>>>>
> >>>>>>>>The pages do get unregistered at the end of the migration =)
> >>>>>>>>
> >>>>>>>>- Michael
> >>>>>>>The limitations are pretty clear, and you really should document them:
> >>>>>>>
> >>>>>>>1. run qemu as root, or under ulimit -l <total guest memory> on both source and
> >>>>>>>   destination
> >>>>>>>
> >>>>>>>2. expect that as much as that amount of memory is pinned
> >>>>>>>   and unvailable to host kernel and applications for
> >>>>>>>   arbitrarily long time.
> >>>>>>>   Make sure you have much more RAM in host or QEMU will get killed.
> >>>>>>>
> >>>>>>>To me, especially 1 is an unacceptable security tradeoff.
> >>>>>>>It is entirely fixable but we both have other priorities,
> >>>>>>>so it'll stay broken.
> >>>>>>>
> >>>>>>I've modified the beginning of docs/rdma.txt to say the following:
> >>>>>It really should say this, in a very prominent place:
> >>>>>
> >>>>>BUGS:
> >>>>Not a bug. We'll have to agree to disagree. Please drop this.
> >>>It's not a feature, it makes management harder and
> >>>will bite some users who are not careful enough
> >>>to read documentation and know what to expect.
> >>Something that does not exist cannot be a bug. That's called a
> >>non-existent optimization.
> >No because overcommit already exists, and works with migration.  It's
> >your patch that breaks it.  We already have a ton of migration variants
> >and they all work fine.  So in 2013 overcommit is a given.
> >
> >Look we can include code with known bugs, but we have to be very
> >explicit about them, because someone *will* be confused.  If it's a hard
> >bug to fix it won't get solved quickly but please stop pretending it's
> >perfect.
> >
> 
> Setting aside RDMA for the moment, Are you trying to tell me that
> someone would
> *willingly* migrate a VM to a hypervisor without first validating
> (programmatically)
> whether or not the machine already has enough memory to support the entire
> footprint of the VM?
> 
> If you answer yes to that question is yes, it's a bug.

enough virtual memory. not physical memory.

> That also means *any* use of RDMA in any application in the universe is also
> a bug and it also means that any HPC application running against cgroups is
> also buggy.

no, people don't normally lock up gigabytes of memory in HPC either.

> I categorically refuse to believe that someone runs a datacenter in
> this manner.

so you don't believe people overcommit memory.

> >
> >>>>>1. You must run qemu as root, or under
> >>>>>    ulimit -l <total guest memory> on both source and destination
> >>>>Good, will update the documentation now.
> >>>>>2. Expect that as much as that amount of memory to be locked
> >>>>>    and unvailable to host kernel and applications for
> >>>>>    arbitrarily long time.
> >>>>>    Make sure you have much more RAM in host otherwise QEMU,
> >>>>>    or some other arbitrary application on same host, will get killed.
> >>>>This is implied already. The docs say "If you don't want pinning,
> >>>>then use TCP".
> >>>>That's enough warning.
> >>>No it's not. Pinning is jargon, and does not mean locking
> >>>up gigabytes.  Why are you using jargon?
> >>>Explain the limitation in plain English so people know
> >>>when to expect things to work.
> >>Already done.
> >>
> >>>>>3. Migration with RDMA support is experimental and unsupported.
> >>>>>    In particular, please do not expect it to work across qemu versions,
> >>>>>    and do not expect the management interface to be stable.
> >>>>The only correct statement here is that it's experimental.
> >>>>
> >>>>I will update the docs to reflect that.
> >>>>
> >>>>>>$ cat docs/rdma.txt
> >>>>>>
> >>>>>>... snip ..
> >>>>>>
> >>>>>>BEFORE RUNNING:
> >>>>>>===============
> >>>>>>
> >>>>>>Use of RDMA requires pinning and registering memory with the
> >>>>>>hardware. If this is not acceptable for your application or
> >>>>>>product, then the use of RDMA is strongly discouraged and you
> >>>>>>should revert back to standard TCP-based migration.
> >>>>>No one knows of should know what "pinning and registering" means.
> >>>>I will define it in the docs, then.
> >>>Keep it simple. Just tell people what they need to know.
> >>>It's silly to expect users to understand internals of
> >>>the product before they even try it for the first time.
> >>Agreed.
> >>
> >>>>>For which applications and products is it appropriate?
> >>>>That's up to the vendor or user to decide, not us.
> >>>With zero information so far, no one will be
> >>>able to decide.
> >>There is plenty of information. Including this email thread.
> >Nowhere in this email thread or in your patchset did you tell anyone for
> >which applications and products is it appropriate.  You also expect
> >someone to answer this question before they run your code.  It looks
> >like the purpose of this phrase is to assign blame rather than to
> >inform.
> >>>>>Also, you are talking about current QEMU
> >>>>>code using RDMA for migration but say "RDMA" generally.
> >>>>Sure, I will fix the docs.
> >>>>
> >>>>>>Next, decide if you want dynamic page registration on the server-side.
> >>>>>>For example, if you have an 8GB RAM virtual machine, but only 1GB
> >>>>>>is in active use, then disabling this feature will cause all 8GB to
> >>>>>>be pinned and resident in memory. This feature mostly affects the
> >>>>>>bulk-phase round of the migration and can be disabled for extremely
> >>>>>>high-performance RDMA hardware using the following command:
> >>>>>>QEMU Monitor Command:
> >>>>>>$ migrate_set_capability chunk_register_destination off # enabled by default
> >>>>>>
> >>>>>>Performing this action will cause all 8GB to be pinned, so if that's
> >>>>>>not what you want, then please ignore this step altogether.
> >>>>>This does not make it clear what is the benefit of disabling this
> >>>>>capability. I think it's best to avoid options, just use chunk
> >>>>>based always.
> >>>>>If it's here "so people can play with it" then please rename
> >>>>>it to something like "x-unsupported-chunk_register_destination"
> >>>>>so people know this is unsupported and not to be used for production.
> >>>>Again, please drop the request for removing chunking.
> >>>>
> >>>>Paolo already told me to use "x-rdma" - so that's enough for now.
> >>>>
> >>>>- Michael
> >>>You are adding a new command that's also experimental, so you must tag
> >>>it explicitly too.
> >>The entire migration is experimental - which by extension makes the
> >>capability experimental.
> >Again the purpose of documentation is not to educate people about
> >qemu or rdma internals but to educate them how to use a feature.
> >It doesn't even mention rdma anywhere in the name of the capability.
> >Users won't make the connection.  You also didn't bother telling anyone
> >when to set the option.  Is it here "to be able to play with it"?  Does
> >it have any purpose for users not in a playful mood?  If yes your
> >documentation should say what it is, if no mention that.
> >
> 
> The purpose of the capability is made blatantly clear in the documentation.
> 
> - Michael

I know it's not clear to me.

You ask people to decide whether to use it or not basically first
or second thing. So tell them, in plain English, then and there,
what they need to know in order to decide. Not ten pages
down in the middle of a description of qemu internals.
Or just drop one of the variants - how much speed difference
is there between them?
diff mbox

Patch

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..e9fa4cd
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,313 @@ 
+Several changes since v4:
+
+- Created a "formal" protocol for the RDMA control channel
+- Dynamic, chunked page registration now implemented on *both* the server and client
+- Created new 'capability' for page registration
+- Created new 'capability' for is_zero_page() - enabled by default
+  (needed to test dynamic page registration)
+- Created version-check before protocol begins at connection-time 
+- no more migrate_use_rdma() !
+
+NOTE: While dynamic registration works on both sides now,
+      it does *not* work with cgroups swap limits. This functionality with infiniband
+      remains broken. (It works fine with TCP). So, in order to take full 
+      advantage of this feature, a fix will have to be developed on the kernel side.
+      Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
+
+Contents:
+=================================
+* Compiling
+* Running (please readme before running)
+* RDMA Protocol Description
+* Versioning
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* TODO
+* Performance
+
+COMPILING:
+===============================
+
+$ ./configure --enable-rdma --target-list=x86_64-softmmu
+$ make
+
+RUNNING:
+===============================
+
+First, decide if you want dynamic page registration on the server-side.
+This always happens on the primary-VM side, but is optional on the server.
+Doing this allows you to support overcommit (such as cgroups or ballooning)
+with a smaller footprint on the server-side without having to register the
+entire VM memory footprint. 
+NOTE: This significantly slows down performance (about 30% slower).
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
+
+Next, if you decided *not* to use chunked registration on the server,
+it is recommended to also disable zero page detection. While this is not
+strictly necessary, zero page detection also significantly slows down
+performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability check_for_zero off" # always enabled by default
+
+Finally, set the migration speed to match your hardware's capabilities:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+RDMA Protocol Description:
+=================================
+
+Migration with RDMA is separated into two parts:
+
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+
+"Everything else" is transmitted using a formal 
+protocol now, consisting of infiniband SEND / RECV messages.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the 
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+    
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND/RECV only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a 
+header portion and a data portion (but together are transmitted 
+as a single SEND message).
+
+Header:
+    * Length  (of the data portion)
+    * Type    (what command to perform, described below)
+    * Version (protocol version validated before send/recv occurs)
+
+The 'type' field has 7 different command values:
+    1. None
+    2. Ready             (control-channel is available) 
+    3. QEMU File         (for sending non-live device state) 
+    4. RAM Blocks        (used right after connection setup)
+    5. Register request  (dynamic chunk registration) 
+    6. Register result   ('rkey' to be used by sender)
+    7. Register finished (registration for current iteration finished)
+
+After connection setup is completed, we have two protocol-level
+functions, responsible for communicating control-channel commands
+using the above list of values: 
+
+Logically:
+
+qemu_rdma_exchange_recv(header, expected command type)
+
+1. We transmit a READY command to let the sender know that 
+   we are *ready* to receive some data bytes on the control channel.
+2. Before attempting to receive the expected command, we post another
+   RQ work request to replace the one we just used up.
+3. Block on a CQ event channel and wait for the SEND to arrive.
+4. When the send arrives, librdmacm will unblock us.
+5. Verify that the command-type and version received matches the one we expected.
+
+qemu_rdma_exchange_send(header, data, optional response header & data): 
+
+1. Block on the CQ event channel waiting for a READY command
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+2. Optionally: if we are expecting a response from the command
+   (that we have no yet transmitted), let's post an RQ
+   work request to receive that data a few moments later. 
+3. When the READY arrives, librdmacm will 
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually post the work request to SEND
+   the requested command type of the header we were asked for.
+5. Optionally, if we are expecting a response (as before),
+   we block again and wait for that response using the additional
+   work request we previously posted. (This is used to carry
+   'Register result' commands #6 back to the sender which
+   hold the rkey need to perform RDMA.
+
+All of the remaining command types (not including 'ready')
+described above all use the aformentioned two functions to do the hard work:
+
+1. After connection setup, RAMBlock information is exchanged using
+   this protocol before the actual migration begins.
+2. During runtime, once a 'chunk' becomes full of pages ready to
+   be sent with RDMA, the registration commands are used to ask the
+   other side to register the memory for this chunk and respond
+   with the result (rkey) of the registration.
+3. Also, the QEMUFile interfaces also call these functions (described below)
+   when transmitting non-live state, such as devices or to send
+   its own protocol information during the migration process.
+
+Versioning
+==================================
+
+librdmacm provides the user with a 'private data' area to be exchanged
+at connection-setup time before any infiniband traffic is generated.
+
+This is a convenient place to check for protocol versioning because the
+user does not need to register memory to transmit a few bytes of version
+information.
+
+This is also a convenient place to negotiate capabilities
+(like dynamic page registration).
+
+If the version is invalid, we throw an error.
+
+If the version is new, we only negotiate the capabilities that the
+requested version is able to perform and ignore the rest.
+
+QEMUFileRDMA Interface:
+==================================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+
+These two functions are very short and simply used the protocol
+describe above to deliver bytes without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from control-channel's SEND 
+messages in memory.
+
+Each time we receive a complete "QEMU File" control-channel 
+message, the bytes from SEND are copied into a small local holding area.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the holding area until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above and issue another "QEMU File" protocol command,
+asking for a new SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+===============================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+Then, using the aforementioned protocol, they exchange a
+description of these blocks with each other, to be used later 
+during the iteration of main memory. This description includes
+a list of all the RAMBlocks, their offsets and lengths and
+possibly includes pre-registered RDMA keys in case dynamic
+page registration was disabled on the server-side, otherwise not.
+
+Main memory is not migrated with the aforementioned protocol, 
+but is instead migrated with normal RDMA Write operations.
+
+Pages are migrated in "chunks" (about 1 Megabyte right now).
+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+
+When a chunk is full (or a flush() occurs), the memory backed by 
+the chunk is registered with librdmacm and pinned in memory on 
+both sides using the aforementioned protocol.
+
+After pinning, an RDMA Write is generated and tramsmitted
+for the entire chunk.
+
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+TODO:
+=================================
+1. Currently, cgroups swap limits for *both* TCP and RDMA
+   on the sender-side is broken. This is more poignant for
+   RDMA because RDMA requires memory registration.
+   Fixing this requires infiniband page registrations to be
+   zero-page aware, and this does not yet work properly.
+2. Currently overcommit for the the *receiver* side of
+   TCP works, but not for RDMA. While dynamic page registration
+   *does* work, it is only useful if the is_zero_page() capability
+   is remained enabled (which it is by default).
+   However, leaving this capability turned on *significantly* slows
+   down the RDMA throughput, particularly on hardware capable
+   of transmitting faster than 10 gbps (such as 40gbps links).
+3. Use of the recent /dev/<pid>/pagemap would likely solve some
+   of these problems.
+4. Also, some form of balloon-device usage tracking would also
+   help aleviate some of these issues.
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput 
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+
+Average downtime (stop time) ranges between 28 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:
+
+http://wiki.qemu.org/Features/RDMALiveMigration