From patchwork Fri Apr 12 05:52:09 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: mrhines@linux.vnet.ibm.com X-Patchwork-Id: 235971 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [208.118.235.17]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 25CB52C011D for ; Fri, 12 Apr 2013 15:56:46 +1000 (EST) Received: from localhost ([::1]:41409 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQWya-0002pj-Dq for incoming@patchwork.ozlabs.org; Fri, 12 Apr 2013 01:56:44 -0400 Received: from eggs.gnu.org ([208.118.235.92]:52905) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQWv1-0006oG-OS for qemu-devel@nongnu.org; Fri, 12 Apr 2013 01:53:07 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UQWuw-00084a-Ge for qemu-devel@nongnu.org; Fri, 12 Apr 2013 01:53:03 -0400 Received: from e9.ny.us.ibm.com ([32.97.182.139]:35586) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UQWuw-00084Q-CE for qemu-devel@nongnu.org; Fri, 12 Apr 2013 01:52:58 -0400 Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 12 Apr 2013 01:52:58 -0400 Received: from d01dlp01.pok.ibm.com (9.56.250.166) by e9.ny.us.ibm.com (192.168.1.109) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 12 Apr 2013 01:52:56 -0400 Received: from d01relay06.pok.ibm.com (d01relay06.pok.ibm.com [9.56.227.116]) by d01dlp01.pok.ibm.com (Postfix) with ESMTP id EC14738C8029 for ; Fri, 12 Apr 2013 01:52:55 -0400 (EDT) Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay06.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r3C5quWF31457490 for ; Fri, 12 Apr 2013 01:52:56 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r3C5qtkU002283 for ; Fri, 12 Apr 2013 02:52:55 -0300 Received: from mrhinesdev.klabtestbed.com (klinux.watson.ibm.com [9.2.208.21]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r3C5qNvE000569; Fri, 12 Apr 2013 02:52:55 -0300 From: mrhines@linux.vnet.ibm.com To: qemu-devel@nongnu.org Date: Fri, 12 Apr 2013 01:52:09 -0400 Message-Id: <1365745929-24871-9-git-send-email-mrhines@linux.vnet.ibm.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1365745929-24871-3-git-send-email-mrhines@linux.vnet.ibm.com> References: <1365745929-24871-3-git-send-email-mrhines@linux.vnet.ibm.com> X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13041205-7182-0000-0000-00000634CD41 X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] X-Received-From: 32.97.182.139 Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com Subject: [Qemu-devel] [PATCH 8/8] rdma: add documentation X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org From: "Michael R. Hines" docs/rdma.txt contains full documentation, wiki links, github url and contact information. Signed-off-by: Michael R. Hines --- docs/rdma.txt | 338 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 338 insertions(+) create mode 100644 docs/rdma.txt diff --git a/docs/rdma.txt b/docs/rdma.txt new file mode 100644 index 0000000..a122138 --- /dev/null +++ b/docs/rdma.txt @@ -0,0 +1,338 @@ +(RDMA: Remote Direct Memory Access) +RDMA Live Migration Specification, Version # 1 +============================================== +Wiki: http://wiki.qemu.org/Features/RDMALiveMigration +Github: git@github.com:hinesmr/qemu.git, 'rdma' branch + +Copyright (C) 2013 Michael R. Hines + +An *exhaustive* paper (2010) shows additional performance details +linked on the QEMU wiki above. + +Contents: +========= +* Running +* RDMA Protocol Description +* Versioning and Capabilities +* QEMUFileRDMA Interface +* Migration of pc.ram +* Error handling +* TODO +* Performance + +RUNNING: +======== + +First, decide if you want dynamic page registration on the server-side. +The only reason to change this setting is if you have a super-fast RDMA +link which could be fully utlized during the bulk-phase round of migration. +NOTE: Disabling this will pin all of the memory on the destination, so +if that is not what you want, then don't do it! + +QEMU Monitor Command: +$ migrate_set_capability chunk_register_destination off # enabled by default + +Next, set the migration speed to match your hardware's capabilities: + +QEMU Monitor Command: +$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device + +Next, on the destination machine, add the following to the QEMU command line: + +qemu ..... -incoming rdma:host:port + +Finally, perform the actual migration: + +QEMU Monitor Command: +$ migrate -d rdma:host:port + +RDMA Protocol Description: +========================== + +Migration with RDMA is separated into two parts: + +1. The transmission of the pages using RDMA +2. Everything else (a control channel is introduced) + +"Everything else" is transmitted using a formal +protocol now, consisting of infiniband SEND messages. + +An infiniband SEND message is the standard ibverbs +message used by applications of infiniband hardware. +The only difference between a SEND message and an RDMA +message is that SEND messages cause notifications +to be posted to the completion queue (CQ) on the +infiniband receiver side, whereas RDMA messages (used +for pc.ram) do not (to behave like an actual DMA). + +Messages in infiniband require two things: + +1. registration of the memory that will be transmitted +2. (SEND only) work requests to be posted on both + sides of the network before the actual transmission + can occur. + +RDMA messages are much easier to deal with. Once the memory +on the receiver side is registered and pinned, we're +basically done. All that is required is for the sender +side to start dumping bytes onto the link. + +(Memory is not released from pinning until the migration +completes, given that RDMA migrations are very fast.) + +SEND messages require more coordination because the +receiver must have reserved space (using a receive +work request) on the receive queue (RQ) before QEMUFileRDMA +can start using them to carry all the bytes as +a control transport for migration of device state. + +To begin the migration, the initial connection setup is +as follows (migration-rdma.c): + +1. Receiver and Sender are started (command line or libvirt): +2. Both sides post two RQ work requests +3. Receiver does listen() +4. Sender does connect() +5. Receiver accept() +6. Check versioning and capabilities (described later) + +At this point, we define a control channel on top of SEND messages +which is described by a formal protocol. Each SEND message has a +header portion and a data portion (but together are transmitted +as a single SEND message). + +Header: + * Length (of the data portion, uint32, network byte order) + * Type (what command to perform, uint32, network byte order) + * Version (protocol version validated before send/recv occurs, uint32, network byte order + * Repeat (Number of commands in data portion, same type only) + +The 'Repeat' field is here to support future multiple page registrations +in a single message without any need to change the protocol itself +so that the protocol is compatible against multiple versions of QEMU. +Version #1 requires that all server implementations of the protocol must +check this field and register all requests found in the array of commands located +in the data portion and return an equal number of results in the response. +The maximum number of repeats is hard-coded to 4096. This is a conservative +limit based on the maximum size of a SEND message along with emperical +observations on the maximum future benefit of simultaneous page registrations. + +The 'type' field has 7 different command values: + 1. Unused + 2. Ready (control-channel is available) + 3. QEMU File (for sending non-live device state) + 4. RAM Blocks (used right after connection setup) + 5. Register request (dynamic chunk registration) + 6. Register result ('rkey' to be used by sender) + 7. Register finished (registration for current iteration finished) + +A single control message, as hinted above, can contain within the data +portion an array of many commands of the same type. If there is more than +one command, then the 'repeat' field will be greater than 1. + +After connection setup is completed, we have two protocol-level +functions, responsible for communicating control-channel commands +using the above list of values: + +Logically: + +qemu_rdma_exchange_recv(header, expected command type) + +1. We transmit a READY command to let the sender know that + we are *ready* to receive some data bytes on the control channel. +2. Before attempting to receive the expected command, we post another + RQ work request to replace the one we just used up. +3. Block on a CQ event channel and wait for the SEND to arrive. +4. When the send arrives, librdmacm will unblock us. +5. Verify that the command-type and version received matches the one we expected. + +qemu_rdma_exchange_send(header, data, optional response header & data): + +1. Block on the CQ event channel waiting for a READY command + from the receiver to tell us that the receiver + is *ready* for us to transmit some new bytes. +2. Optionally: if we are expecting a response from the command + (that we have no yet transmitted), let's post an RQ + work request to receive that data a few moments later. +3. When the READY arrives, librdmacm will + unblock us and we immediately post a RQ work request + to replace the one we just used up. +4. Now, we can actually post the work request to SEND + the requested command type of the header we were asked for. +5. Optionally, if we are expecting a response (as before), + we block again and wait for that response using the additional + work request we previously posted. (This is used to carry + 'Register result' commands #6 back to the sender which + hold the rkey need to perform RDMA. Note that the virtual address + corresponding to this rkey was already exchanged at the beginning + of the connection (described below). + +All of the remaining command types (not including 'ready') +described above all use the aformentioned two functions to do the hard work: + +1. After connection setup, RAMBlock information is exchanged using + this protocol before the actual migration begins. This information includes + a description of each RAMBlock on the server side as well as the virtual addresses + and lengths of each RAMBlock. This is used by the client to determine the + start and stop locations of chunks and how to register them dynamically + before performing the RDMA operations. +2. During runtime, once a 'chunk' becomes full of pages ready to + be sent with RDMA, the registration commands are used to ask the + other side to register the memory for this chunk and respond + with the result (rkey) of the registration. +3. Also, the QEMUFile interfaces also call these functions (described below) + when transmitting non-live state, such as devices or to send + its own protocol information during the migration process. + +Versioning and Capabilities +=========================== +Current version of the protocol is version #1. + +The same version applies to both for protocol traffic and capabilities +negotiation. (i.e. There is only one version number that is referred to +by all communication). + +librdmacm provides the user with a 'private data' area to be exchanged +at connection-setup time before any infiniband traffic is generated. + +Header: + * Version (protocol version validated before send/recv occurs), uint32, network byte order + * Flags (bitwise OR of each capability), uint32, network byte order + +There is no data portion of this header right now, so there is +no length field. The maximum size of the 'private data' section +is only 192 bytes per the Infiniband specification, so it's not +very useful for data anyway. This structure needs to remain small. + +This private data area is a convenient place to check for protocol +versioning because the user does not need to register memory to +transmit a few bytes of version information. + +This is also a convenient place to negotiate capabilities +(like dynamic page registration). + +If the version is invalid, we throw an error. + +If the version is new, we only negotiate the capabilities that the +requested version is able to perform and ignore the rest. + +Currently there is only *one* capability in Version #1: dynamic page registration + +Finally: Negotiation happens with the Flags field: If the primary-VM +sets a flag, but the destination does not support this capability, it +will return a zero-bit for that flag and the primary-VM will understand +that as not being an available capability and will thus disable that +capability on the primary-VM side. + +QEMUFileRDMA Interface: +======================= + +QEMUFileRDMA introduces a couple of new functions: + +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) + +These two functions are very short and simply use the protocol +describe above to deliver bytes without changing the upper-level +users of QEMUFile that depend on a bytestream abstraction. + +Finally, how do we handoff the actual bytes to get_buffer()? + +Again, because we're trying to "fake" a bytestream abstraction +using an analogy not unlike individual UDP frames, we have +to hold on to the bytes received from control-channel's SEND +messages in memory. + +Each time we receive a complete "QEMU File" control-channel +message, the bytes from SEND are copied into a small local holding area. + +Then, we return the number of bytes requested by get_buffer() +and leave the remaining bytes in the holding area until get_buffer() +comes around for another pass. + +If the buffer is empty, then we follow the same steps +listed above and issue another "QEMU File" protocol command, +asking for a new SEND message to re-fill the buffer. + +Migration of pc.ram: +==================== + +At the beginning of the migration, (migration-rdma.c), +the sender and the receiver populate the list of RAMBlocks +to be registered with each other into a structure. +Then, using the aforementioned protocol, they exchange a +description of these blocks with each other, to be used later +during the iteration of main memory. This description includes +a list of all the RAMBlocks, their offsets and lengths, virtual +addresses and possibly includes pre-registered RDMA keys in case dynamic +page registration was disabled on the server-side, otherwise not. + +Main memory is not migrated with the aforementioned protocol, +but is instead migrated with normal RDMA Write operations. + +Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now). +Chunk size is not dynamic, but it could be in a future implementation. +There's nothing to indicate that this is useful right now. + +When a chunk is full (or a flush() occurs), the memory backed by +the chunk is registered with librdmacm is pinned in memory on +both sides using the aforementioned protocol. +After pinning, an RDMA Write is generated and transmitted +for the entire chunk. + +Chunks are also transmitted in batches: This means that we +do not request that the hardware signal the completion queue +for the completion of *every* chunk. The current batch size +is about 64 chunks (corresponding to 64 MB of memory). +Only the last chunk in a batch must be signaled. +This helps keep everything as asynchronous as possible +and helps keep the hardware busy performing RDMA operations. + +Error-handling: +=============== + +Infiniband has what is called a "Reliable, Connected" +link (one of 4 choices). This is the mode in which +we use for RDMA migration. + +If a *single* message fails, +the decision is to abort the migration entirely and +cleanup all the RDMA descriptors and unregister all +the memory. + +After cleanup, the Virtual Machine is returned to normal +operation the same way that would happen if the TCP +socket is broken during a non-RDMA based migration. + +TODO: +===== +1. Chunk server registration could be improved: + This can be done by holding chunks for a certain amount + of time and then register all of the chunks at the same + time using a fewer number of control messages. The + performance of this approach is unclear. + Current version of the protocol does not need to change + to support such an optimization. +2. Currently, cgroups swap limits for *both* TCP and RDMA + on the sender-side is broken. This is more poignant for + RDMA because RDMA requires memory registration. + Fixing this requires infiniband page registrations to be + zero-page aware, and this does not yet work properly. +4. Use of the recent /proc//pagemap would likely solve some + of these problems. +5. Also, some form of balloon-device usage tracking would also + help alleviate some of these issues. + +PERFORMANCE +=========== + +Using a 40gbps infinband link performing a worst-case stress test: + +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep +Approximately 26 gpbs +1. Average worst-case throughput +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep +2. Approximately 8 gpbs (using IPOIB IP over Infiniband) +3. Using chunked registration: approximately 6 gbps. + +Average downtime (stop time) ranges between 15 and 33 milliseconds.