diff mbox

[RFC,RDMA,support,v1:,12/13] updated protocol documentation

Message ID 1365632901-15470-13-git-send-email-mrhines@linux.vnet.ibm.com
State New
Headers show

Commit Message

mrhines@linux.vnet.ibm.com April 10, 2013, 10:28 p.m. UTC
From: "Michael R. Hines" <mrhines@us.ibm.com>

Full documentation on the rdma protocol: docs/rdma.txt

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 331 insertions(+)
 create mode 100644 docs/rdma.txt

Comments

Eric Blake April 11, 2013, 2:43 a.m. UTC | #1
On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Full documentation on the rdma protocol: docs/rdma.txt
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 331 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..ae68d2f
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,331 @@
> +Changes since v6:
> +
> +(Thanks, Paolo - things look much cleaner now.)
> +
> +- Try to get patch-ordering correct =)
> +- Much cleaner use of QEMUFileOps
> +- Much fewer header files changes
> +- Convert zero check capability to QMP command instead
> +- Updated documentation

The above text probably shouldn't be in the file.

> +
> +Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
> +Github: git@github.com:hinesmr/qemu.git
> +Contact: Michael R. Hines, mrhines@us.ibm.com

Missing a copyright statement, but that's just following the example of
other docs, so I guess it's okay?

> +
> +RDMA Live Migration Specification, Version # 1
> +
> +Contents:
> +=================================
> +* Running
> +* RDMA Protocol Description
> +* Versioning and Capabilities
> +* QEMUFileRDMA Interface
> +* Migration of pc.ram
> +* Error handling
> +* TODO
> +* Performance
> +

No high-level overview of what the acronym RDMA even stands for?

> +RUNNING:
> +===============================
> +
> +First, decide if you want dynamic page registration on the server-side.
> +This always happens on the primary-VM side, but is optional on the server.
> +Doing this allows you to support overcommit (such as cgroups or ballooning)
> +with a smaller footprint on the server-side without having to register the
> +entire VM memory footprint. 
> +NOTE: This significantly slows down RDMA throughput (about 30% slower).
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_capability chunk_register_destination off" # enabled by default

'virsh qemu-monitor-command' is documented as unsupported by libvirt
(it's intended solely as a development/debugging aid); but I guess until
libvirt learns to expose RDMA support by default, this is okay for a
first cut of documentation.  Furthermore, you are missing a domain argument.

Do you really want to be requiring the user to do everything through
libvirt?  This is qemu documentation, so you should document how things
work without needing libvirt in the picture.

> +
> +Next, if you decided *not* to use chunked registration on the server,
> +it is recommended to also disable zero page detection. While this is not
> +strictly necessary, zero page detection also significantly slows down
> +throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards:
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_check_for_zero off" # enabled by default

Missing a domain argument.

> +
> +Finally, set the migration speed to match your hardware's capabilities:
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device

This modifies qemu state behind libvirt's back, and won't necessarily do
what you want if libvirt tries to change things back to the speed it
thought it was managing.  Instead, use 'virsh migrate-setspeed $dom 40'.

> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port

That's not quite valid syntax for 'virsh migrate'.  Again, do you really
want to be documenting libvirt's interface, or qemu's interface?

> +
> +RDMA Protocol Description:
> +=================================

Aesthetics: match the length of === to the line above it.

<snip> I'm not reviewing technical content, just face value...

> +
> +These two functions are very short and simply used the protocol
> +describe above to deliver bytes without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.

s/bytstream/bytestream/

...
> +
> +After pinning, an RDMA Write is generated and tramsmitted
> +for the entire chunk.

s/tramsmitted/transmitted/

> +5. Also, some form of balloon-device usage tracking would also
> +   help aleviate some of these issues.

s/aleviate/alleviate/

> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:

s/infinband/infiniband/

> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)

which paper? Call that out in your high-level summary

...
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:

Missing the actual reference?  And it would help to mention it at the
beginning of the file.
mrhines@linux.vnet.ibm.com April 11, 2013, 2:47 a.m. UTC | #2
Great comments, thanks.

On 04/10/2013 10:43 PM, Eric Blake wrote:
> On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Full documentation on the rdma protocol: docs/rdma.txt
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   docs/rdma.txt |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 331 insertions(+)
>>   create mode 100644 docs/rdma.txt
>>
>> diff --git a/docs/rdma.txt b/docs/rdma.txt
>> new file mode 100644
>> index 0000000..ae68d2f
>> --- /dev/null
>> +++ b/docs/rdma.txt
>> @@ -0,0 +1,331 @@
>> +Changes since v6:
>> +
>> +(Thanks, Paolo - things look much cleaner now.)
>> +
>> +- Try to get patch-ordering correct =)
>> +- Much cleaner use of QEMUFileOps
>> +- Much fewer header files changes
>> +- Convert zero check capability to QMP command instead
>> +- Updated documentation
> The above text probably shouldn't be in the file.
>
>> +
>> +Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
>> +Github: git@github.com:hinesmr/qemu.git
>> +Contact: Michael R. Hines, mrhines@us.ibm.com
> Missing a copyright statement, but that's just following the example of
> other docs, so I guess it's okay?
>
>> +
>> +RDMA Live Migration Specification, Version # 1
>> +
>> +Contents:
>> +=================================
>> +* Running
>> +* RDMA Protocol Description
>> +* Versioning and Capabilities
>> +* QEMUFileRDMA Interface
>> +* Migration of pc.ram
>> +* Error handling
>> +* TODO
>> +* Performance
>> +
> No high-level overview of what the acronym RDMA even stands for?
>
>> +RUNNING:
>> +===============================
>> +
>> +First, decide if you want dynamic page registration on the server-side.
>> +This always happens on the primary-VM side, but is optional on the server.
>> +Doing this allows you to support overcommit (such as cgroups or ballooning)
>> +with a smaller footprint on the server-side without having to register the
>> +entire VM memory footprint.
>> +NOTE: This significantly slows down RDMA throughput (about 30% slower).
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_capability chunk_register_destination off" # enabled by default
> 'virsh qemu-monitor-command' is documented as unsupported by libvirt
> (it's intended solely as a development/debugging aid); but I guess until
> libvirt learns to expose RDMA support by default, this is okay for a
> first cut of documentation.  Furthermore, you are missing a domain argument.
>
> Do you really want to be requiring the user to do everything through
> libvirt?  This is qemu documentation, so you should document how things
> work without needing libvirt in the picture.
>
>> +
>> +Next, if you decided *not* to use chunked registration on the server,
>> +it is recommended to also disable zero page detection. While this is not
>> +strictly necessary, zero page detection also significantly slows down
>> +throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards:
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_check_for_zero off" # enabled by default
> Missing a domain argument.
>
>> +
>> +Finally, set the migration speed to match your hardware's capabilities:
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> This modifies qemu state behind libvirt's back, and won't necessarily do
> what you want if libvirt tries to change things back to the speed it
> thought it was managing.  Instead, use 'virsh migrate-setspeed $dom 40'.
>
>> +
>> +Finally, perform the actual migration:
>> +
>> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> That's not quite valid syntax for 'virsh migrate'.  Again, do you really
> want to be documenting libvirt's interface, or qemu's interface?
>
>> +
>> +RDMA Protocol Description:
>> +=================================
> Aesthetics: match the length of === to the line above it.
>
> <snip> I'm not reviewing technical content, just face value...
>
>> +
>> +These two functions are very short and simply used the protocol
>> +describe above to deliver bytes without changing the upper-level
>> +users of QEMUFile that depend on a bytstream abstraction.
> s/bytstream/bytestream/
>
> ...
>> +
>> +After pinning, an RDMA Write is generated and tramsmitted
>> +for the entire chunk.
> s/tramsmitted/transmitted/
>
>> +5. Also, some form of balloon-device usage tracking would also
>> +   help aleviate some of these issues.
> s/aleviate/alleviate/
>
>> +
>> +PERFORMANCE
>> +===================
>> +
>> +Using a 40gbps infinband link performing a worst-case stress test:
> s/infinband/infiniband/
>
>> +
>> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +Approximately 30 gpbs (little better than the paper)
> which paper? Call that out in your high-level summary
>
> ...
>> +
>> +An *exhaustive* paper (2010) shows additional performance details
>> +linked on the QEMU wiki:
> Missing the actual reference?  And it would help to mention it at the
> beginning of the file.
>
Paolo Bonzini April 11, 2013, 6:29 a.m. UTC | #3
Il 11/04/2013 00:28, mrhines@linux.vnet.ibm.com ha scritto:
> +
> +(Thanks, Paolo - things look much cleaner now.)

I agree!  No need to immortalize me in the docs, though. :)

Paolo
diff mbox

Patch

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..ae68d2f
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,331 @@ 
+Changes since v6:
+
+(Thanks, Paolo - things look much cleaner now.)
+
+- Try to get patch-ordering correct =)
+- Much cleaner use of QEMUFileOps
+- Much fewer header files changes
+- Convert zero check capability to QMP command instead
+- Updated documentation
+
+Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
+Github: git@github.com:hinesmr/qemu.git
+Contact: Michael R. Hines, mrhines@us.ibm.com
+
+RDMA Live Migration Specification, Version # 1
+
+Contents:
+=================================
+* Running
+* RDMA Protocol Description
+* Versioning and Capabilities
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* TODO
+* Performance
+
+RUNNING:
+===============================
+
+First, decide if you want dynamic page registration on the server-side.
+This always happens on the primary-VM side, but is optional on the server.
+Doing this allows you to support overcommit (such as cgroups or ballooning)
+with a smaller footprint on the server-side without having to register the
+entire VM memory footprint. 
+NOTE: This significantly slows down RDMA throughput (about 30% slower).
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability chunk_register_destination off" # enabled by default
+
+Next, if you decided *not* to use chunked registration on the server,
+it is recommended to also disable zero page detection. While this is not
+strictly necessary, zero page detection also significantly slows down
+throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_check_for_zero off" # enabled by default
+
+Finally, set the migration speed to match your hardware's capabilities:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+RDMA Protocol Description:
+=================================
+
+Migration with RDMA is separated into two parts:
+
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+
+"Everything else" is transmitted using a formal 
+protocol now, consisting of infiniband SEND / RECV messages.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the 
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+    
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND/RECV only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+(Memory is not released from pinning until the migration
+completes, given that RDMA migrations are very fast.)
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a 
+header portion and a data portion (but together are transmitted 
+as a single SEND message).
+
+Header:
+    * Length  (of the data portion, uint32, network byte order)
+    * Type    (what command to perform, uint32, network byte order)
+    * Version (protocol version validated before send/recv occurs, uint32, network byte order
+
+The 'type' field has 7 different command values:
+    1. None
+    2. Ready             (control-channel is available) 
+    3. QEMU File         (for sending non-live device state) 
+    4. RAM Blocks        (used right after connection setup)
+    5. Register request  (dynamic chunk registration) 
+    6. Register result   ('rkey' to be used by sender)
+    7. Register finished (registration for current iteration finished)
+
+After connection setup is completed, we have two protocol-level
+functions, responsible for communicating control-channel commands
+using the above list of values: 
+
+Logically:
+
+qemu_rdma_exchange_recv(header, expected command type)
+
+1. We transmit a READY command to let the sender know that 
+   we are *ready* to receive some data bytes on the control channel.
+2. Before attempting to receive the expected command, we post another
+   RQ work request to replace the one we just used up.
+3. Block on a CQ event channel and wait for the SEND to arrive.
+4. When the send arrives, librdmacm will unblock us.
+5. Verify that the command-type and version received matches the one we expected.
+
+qemu_rdma_exchange_send(header, data, optional response header & data): 
+
+1. Block on the CQ event channel waiting for a READY command
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+2. Optionally: if we are expecting a response from the command
+   (that we have no yet transmitted), let's post an RQ
+   work request to receive that data a few moments later. 
+3. When the READY arrives, librdmacm will 
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually post the work request to SEND
+   the requested command type of the header we were asked for.
+5. Optionally, if we are expecting a response (as before),
+   we block again and wait for that response using the additional
+   work request we previously posted. (This is used to carry
+   'Register result' commands #6 back to the sender which
+   hold the rkey need to perform RDMA.
+
+All of the remaining command types (not including 'ready')
+described above all use the aformentioned two functions to do the hard work:
+
+1. After connection setup, RAMBlock information is exchanged using
+   this protocol before the actual migration begins. This information includes
+   a description of each RAMBlock on the server side as well as the virtual addresses
+   and lengths of each RAMBlock. This is used by the client to determine the
+   start and stop locations of chunks and how to register them dynamically
+   before performing the RDMA operations.
+2. During runtime, once a 'chunk' becomes full of pages ready to
+   be sent with RDMA, the registration commands are used to ask the
+   other side to register the memory for this chunk and respond
+   with the result (rkey) of the registration.
+3. Also, the QEMUFile interfaces also call these functions (described below)
+   when transmitting non-live state, such as devices or to send
+   its own protocol information during the migration process.
+
+Versioning and Capabilities
+==================================
+Current version of the protocol is version #1, both for protocol
+traffic and capabilities negotiation. (i.e. There is only one version
+number that is referred to by all communication).
+
+librdmacm provides the user with a 'private data' area to be exchanged
+at connection-setup time before any infiniband traffic is generated.
+
+Header:
+    * Version (protocol version validated before send/recv occurs), uint32, network byte order
+    * Flags   (bitwise OR of each capability), uint32, network byte order
+
+There is no data portion of this header right now, so there is
+no length field. The maximum size of the 'private data' section
+is only 192 bytes per the Infiniband specification, so it's not
+very useful for data anyway. This structure needs to remain small.
+
+This private data area is a convenient place to check for protocol 
+versioning because the user does not need to register memory to 
+transmit a few bytes of version information.
+
+This is also a convenient place to negotiate capabilities
+(like dynamic page registration).
+
+If the version is invalid, we throw an error.
+
+If the version is new, we only negotiate the capabilities that the
+requested version is able to perform and ignore the rest.
+
+Currently there is only *one* capability in Version #1: dynamic page registration
+
+QEMUFileRDMA Interface:
+==================================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+
+These two functions are very short and simply used the protocol
+describe above to deliver bytes without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from control-channel's SEND 
+messages in memory.
+
+Each time we receive a complete "QEMU File" control-channel 
+message, the bytes from SEND are copied into a small local holding area.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the holding area until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above and issue another "QEMU File" protocol command,
+asking for a new SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+===============================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+Then, using the aforementioned protocol, they exchange a
+description of these blocks with each other, to be used later 
+during the iteration of main memory. This description includes
+a list of all the RAMBlocks, their offsets and lengths and
+possibly includes pre-registered RDMA keys in case dynamic
+page registration was disabled on the server-side, otherwise not.
+
+Main memory is not migrated with the aforementioned protocol, 
+but is instead migrated with normal RDMA Write operations.
+
+Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+
+When a chunk is full (or a flush() occurs), the memory backed by 
+the chunk is registered with librdmacm and pinned in memory on 
+both sides using the aforementioned protocol.
+
+After pinning, an RDMA Write is generated and tramsmitted
+for the entire chunk.
+
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+TODO:
+=================================
+1. Chunk server registration could be improved:
+   This can be done by holding chunks for a certain amount
+   of time and then register all of the chunks at the same
+   time using a fewer number of control messages. The
+   performance of this approach is unclear.
+2. Currently, cgroups swap limits for *both* TCP and RDMA
+   on the sender-side is broken. This is more poignant for
+   RDMA because RDMA requires memory registration.
+   Fixing this requires infiniband page registrations to be
+   zero-page aware, and this does not yet work properly.
+3. Currently overcommit for the the *receiver* side of
+   TCP works, but not for RDMA. While dynamic page registration
+   *does* work, it is only useful if the is_zero_page() capability
+   is remained enabled (which it is by default).
+   However, leaving this capability turned on *significantly* slows
+   down the RDMA throughput, particularly on hardware capable
+   of transmitting faster than 10 gbps (such as 40gbps links).
+4. Use of the recent /dev/<pid>/pagemap would likely solve some
+   of these problems.
+5. Also, some form of balloon-device usage tracking would also
+   help aleviate some of these issues.
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput 
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+3. Using chunked registration: approximately 6 gbps.
+
+Average downtime (stop time) ranges between 15 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki: