From patchwork Tue Jun 18 20:16:00 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: mrhines@linux.vnet.ibm.com X-Patchwork-Id: 252430 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 78DEB2C02F6 for ; Wed, 19 Jun 2013 06:22:33 +1000 (EST) Received: from localhost ([::1]:55403 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Up2QB-000112-Ao for incoming@patchwork.ozlabs.org; Tue, 18 Jun 2013 16:22:31 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51013) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Up2Le-0003Z0-Fp for qemu-devel@nongnu.org; Tue, 18 Jun 2013 16:17:56 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Up2LZ-0007cS-Q7 for qemu-devel@nongnu.org; Tue, 18 Jun 2013 16:17:50 -0400 Received: from e35.co.us.ibm.com ([32.97.110.153]:50431) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Up2LZ-0007cI-9D for qemu-devel@nongnu.org; Tue, 18 Jun 2013 16:17:45 -0400 Received: from /spool/local by e35.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 18 Jun 2013 14:17:34 -0600 Received: from d03dlp03.boulder.ibm.com (9.17.202.179) by e35.co.us.ibm.com (192.168.1.135) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 18 Jun 2013 14:17:20 -0600 Received: from d03relay04.boulder.ibm.com (d03relay04.boulder.ibm.com [9.17.195.106]) by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id 04A0119D8046 for ; Tue, 18 Jun 2013 14:17:11 -0600 (MDT) Received: from d03av06.boulder.ibm.com (d03av06.boulder.ibm.com [9.17.195.245]) by d03relay04.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r5IKHHV9307250 for ; Tue, 18 Jun 2013 14:17:18 -0600 Received: from d03av06.boulder.ibm.com (loopback [127.0.0.1]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r5IKIfNB014235 for ; Tue, 18 Jun 2013 14:18:41 -0600 Received: from mrhinesdev.klabtestbed.com (klinux.watson.ibm.com [9.2.208.21]) by d03av06.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id r5IKISUO013141; Tue, 18 Jun 2013 14:18:40 -0600 From: mrhines@linux.vnet.ibm.com To: qemu-devel@nongnu.org Date: Tue, 18 Jun 2013 16:16:00 -0400 Message-Id: <1371586561-22803-14-git-send-email-mrhines@linux.vnet.ibm.com> X-Mailer: git-send-email 1.7.10.4 In-Reply-To: <1371586561-22803-1-git-send-email-mrhines@linux.vnet.ibm.com> References: <1371586561-22803-1-git-send-email-mrhines@linux.vnet.ibm.com> X-TM-AS-MML: No X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13061820-4834-0000-0000-000008340A6B X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.4.x-2.6.x [generic] X-Received-From: 32.97.110.153 Cc: aliguori@us.ibm.com, quintela@redhat.com, knoel@redhat.com, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com, chegu_vinod@hp.com Subject: [Qemu-devel] [PATCH v10 13/14] rdma: fix mlock() freezes and accounting X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org From: "Michael R. Hines" This patch is contained to migration-rdma.c and fixes the problems experienced by others when the x-rdma-pin-all feature appeared to freeze the VM. By moving this operation out of the connection setup time and instead moving it to ram_save_setup() code, we no longer execute pinning inside the BQL and thus the pinning is parallelized with the VM execution and also properly accounted for inside the QMP migrate total time statistics. Reviewed-by: Paolo Bonzini Reviewed-by: Chegu Vinod Reviewed-by: Eric Blake Tested-by: Chegu Vinod Tested-by: Michael R. Hines Signed-off-by: Michael R. Hines --- arch_init.c | 5 -- docs/rdma.txt | 31 +++++--- migration-rdma.c | 211 +++++++++++++++++++++++++++--------------------------- 3 files changed, 128 insertions(+), 119 deletions(-) diff --git a/arch_init.c b/arch_init.c index ebb601b..a1557d2 100644 --- a/arch_init.c +++ b/arch_init.c @@ -624,11 +624,6 @@ static int ram_save_setup(QEMUFile *f, void *opaque) qemu_mutex_unlock_ramlist(); - /* - * Please leave in place. These calls generate reserved messages in - * the RDMA protocol in order to pre-register RDMA memory in the - * future before the bulk round begins. - */ ram_control_before_iterate(f, RAM_CONTROL_SETUP); ram_control_after_iterate(f, RAM_CONTROL_SETUP); diff --git a/docs/rdma.txt b/docs/rdma.txt index 4f98a3b..09d36ef 100644 --- a/docs/rdma.txt +++ b/docs/rdma.txt @@ -76,6 +76,13 @@ of the migration, which can greatly reduce the "total" time of your migration. Example performance of this using an idle VM in the previous example can be found in the "Performance" section. +Note: for very large virtual machines (hundreds of GBs), pinning all +*all* of the memory of your virtual machine in the kernel is very expensive +may extend the initial bulk iteration time by many seconds, +and thus extending the total migration time. However, this will not +affect the determinism or predictability of your migration you will +still gain from the benefits of advanced pinning with RDMA. + RUNNING: ======== @@ -195,22 +202,26 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative limit based on the maximum size of a SEND message along with emperical observations on the maximum future benefit of simultaneous page registrations. -The 'type' field has 9 different command values: +The 'type' field has 10 different command values: 1. Unused - 2. Error (sent to the source during bad things) - 3. Ready (control-channel is available) - 4. QEMU File (for sending non-live device state) - 5. RAM Blocks (used right after connection setup) - 6. Compress page (zap zero page and skip registration) - 7. Register request (dynamic chunk registration) - 8. Register result ('rkey' to be used by sender) - 9. Register finished (registration for current iteration finished) + 2. Error (sent to the source during bad things) + 3. Ready (control-channel is available) + 4. QEMU File (for sending non-live device state) + 5. RAM Blocks request (used right after connection setup) + 6. RAM Blocks result (used right after connection setup) + 7. Compress page (zap zero page and skip registration) + 8. Register request (dynamic chunk registration) + 9. Register result ('rkey' to be used by sender) + 10. Register finished (registration for current iteration finished) A single control message, as hinted above, can contain within the data portion an array of many commands of the same type. If there is more than one command, then the 'repeat' field will be greater than 1. -After connection setup is completed, we have two protocol-level +After connection setup, message 5 & 6 are used to exchange ram block +information and optionally pin all the memory if requested by the user. + +After ram block exchange is completed, we have two protocol-level functions, responsible for communicating control-channel commands using the above list of values: diff --git a/migration-rdma.c b/migration-rdma.c index 79aaa0a..853de18 100644 --- a/migration-rdma.c +++ b/migration-rdma.c @@ -146,13 +146,14 @@ const char *wrid_desc[] = { enum { RDMA_CONTROL_NONE = 0, RDMA_CONTROL_ERROR, - RDMA_CONTROL_READY, /* ready to receive */ - RDMA_CONTROL_QEMU_FILE, /* QEMUFile-transmitted bytes */ - RDMA_CONTROL_RAM_BLOCKS, /* RAMBlock synchronization */ - RDMA_CONTROL_COMPRESS, /* page contains repeat values */ - RDMA_CONTROL_REGISTER_REQUEST, /* dynamic page registration */ - RDMA_CONTROL_REGISTER_RESULT, /* key to use after registration */ - RDMA_CONTROL_REGISTER_FINISHED, /* current iteration finished */ + RDMA_CONTROL_READY, /* ready to receive */ + RDMA_CONTROL_QEMU_FILE, /* QEMUFile-transmitted bytes */ + RDMA_CONTROL_RAM_BLOCKS_REQUEST, /* RAMBlock synchronization */ + RDMA_CONTROL_RAM_BLOCKS_RESULT, /* RAMBlock synchronization */ + RDMA_CONTROL_COMPRESS, /* page contains repeat values */ + RDMA_CONTROL_REGISTER_REQUEST, /* dynamic page registration */ + RDMA_CONTROL_REGISTER_RESULT, /* key to use after registration */ + RDMA_CONTROL_REGISTER_FINISHED, /* current iteration finished */ }; const char *control_desc[] = { @@ -160,7 +161,8 @@ const char *control_desc[] = { [RDMA_CONTROL_ERROR] = "ERROR", [RDMA_CONTROL_READY] = "READY", [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE", - [RDMA_CONTROL_RAM_BLOCKS] = "REMOTE INFO", + [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST", + [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT", [RDMA_CONTROL_COMPRESS] = "COMPRESS", [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST", [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT", @@ -701,6 +703,9 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, RDMALocalBlocks *rdma_local_ram_blocks) { int i; + uint64_t start = qemu_get_clock_ms(rt_clock); + (void)start; + for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) { rdma_local_ram_blocks->block[i].mr = ibv_reg_mr(rdma->pd, @@ -716,6 +721,8 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, rdma->total_registrations++; } + DPRINTF("lock time: %" PRIu64 "\n", qemu_get_clock_ms(rt_clock) - start); + if (i >= rdma_local_ram_blocks->num_blocks) { return 0; } @@ -1338,7 +1345,6 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head, .repeat = 1, }; int ret; - int idx = 0; /* * Inform the source that we're ready to receive a message. @@ -1353,18 +1359,18 @@ static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head, /* * Block and wait for the message. */ - ret = qemu_rdma_exchange_get_response(rdma, head, expecting, idx); + ret = qemu_rdma_exchange_get_response(rdma, head, expecting, 0); if (ret < 0) { return ret; } - qemu_rdma_move_header(rdma, idx, head); + qemu_rdma_move_header(rdma, 0, head); /* * Post a new RECV work request to replace the one we just consumed. */ - ret = qemu_rdma_post_recv_control(rdma, idx); + ret = qemu_rdma_post_recv_control(rdma, 0); if (ret) { fprintf(stderr, "rdma migration: error posting second control recv!"); return ret; @@ -1852,7 +1858,6 @@ err_rdma_source_init: static int qemu_rdma_connect(RDMAContext *rdma, Error **errp) { - RDMAControlHeader head; RDMACapabilities cap = { .version = RDMA_CONTROL_VERSION_CURRENT, .flags = 0, @@ -1864,8 +1869,6 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error **errp) }; struct rdma_cm_event *cm_event; int ret; - int idx = 0; - int x; /* * Only negotiate the capability with destination if the user @@ -1923,44 +1926,12 @@ static int qemu_rdma_connect(RDMAContext *rdma, Error **errp) rdma_ack_cm_event(cm_event); - ret = qemu_rdma_post_recv_control(rdma, idx + 1); - if (ret) { - ERROR(errp, "posting first control recv!\n"); - goto err_rdma_source_connect; - } - - ret = qemu_rdma_post_recv_control(rdma, idx); + ret = qemu_rdma_post_recv_control(rdma, 0); if (ret) { ERROR(errp, "posting second control recv!\n"); goto err_rdma_source_connect; } - ret = qemu_rdma_exchange_get_response(rdma, - &head, RDMA_CONTROL_RAM_BLOCKS, idx + 1); - - if (ret < 0) { - ERROR(errp, "receiving remote info!\n"); - goto err_rdma_source_connect; - } - - qemu_rdma_move_header(rdma, idx + 1, &head); - memcpy(rdma->block, rdma->wr_data[idx + 1].control_curr, head.len); - - ret = qemu_rdma_process_remote_blocks(rdma, - (head.len / sizeof(RDMARemoteBlock)), errp); - if (ret) { - goto err_rdma_source_connect; - } - - if (!rdma->pin_all) { - for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) { - RDMALocalBlock *block = &(rdma->local_ram_blocks.block[x]); - int num_chunks = ram_chunk_count(block); - /* allocate memory to store remote rkeys */ - block->remote_keys = g_malloc0(num_chunks * sizeof(uint32_t)); - } - } - rdma->control_ready_expected = 1; rdma->num_signaled_send = 0; return 0; @@ -2321,11 +2292,6 @@ static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque, static int qemu_rdma_accept(RDMAContext *rdma) { - RDMAControlHeader head = { .len = rdma->local_ram_blocks.num_blocks * - sizeof(RDMARemoteBlock), - .type = RDMA_CONTROL_RAM_BLOCKS, - .repeat = 1, - }; RDMACapabilities cap; struct rdma_conn_param conn_param = { .responder_resources = 2, @@ -2335,8 +2301,6 @@ static int qemu_rdma_accept(RDMAContext *rdma) struct rdma_cm_event *cm_event; struct ibv_context *verbs; int ret = -EINVAL; - RDMALocalBlocks *local = &rdma->local_ram_blocks; - int i; ret = rdma_get_cm_event(rdma->channel, &cm_event); if (ret) { @@ -2434,41 +2398,6 @@ static int qemu_rdma_accept(RDMAContext *rdma) goto err_rdma_dest_wait; } - if (rdma->pin_all) { - ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks); - if (ret) { - fprintf(stderr, "rdma migration: error dest " - "registering ram blocks!\n"); - goto err_rdma_dest_wait; - } - } - - /* - * Server uses this to prepare to transmit the RAMBlock descriptions - * to the primary VM after connection setup. - * Both sides use the "remote" structure to communicate and update - * their "local" descriptions with what was sent. - */ - for (i = 0; i < local->num_blocks; i++) { - rdma->block[i].remote_host_addr = - (uint64_t)(local->block[i].local_host_addr); - - if (rdma->pin_all) { - rdma->block[i].remote_rkey = local->block[i].mr->rkey; - } - - rdma->block[i].offset = local->block[i].offset; - rdma->block[i].length = local->block[i].length; - } - - - ret = qemu_rdma_post_send_control(rdma, (uint8_t *) rdma->block, &head); - - if (ret < 0) { - fprintf(stderr, "rdma migration: error sending remote info!\n"); - goto err_rdma_dest_wait; - } - qemu_rdma_dump_gid("dest_connect", rdma->cm_id); return 0; @@ -2495,8 +2424,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque, .type = RDMA_CONTROL_REGISTER_RESULT, .repeat = 0, }; + RDMAControlHeader blocks = { .type = RDMA_CONTROL_RAM_BLOCKS_RESULT, .repeat = 1 }; QEMUFileRDMA *rfile = opaque; RDMAContext *rdma = rfile->rdma; + RDMALocalBlocks *local = &rdma->local_ram_blocks; RDMAControlHeader head; RDMARegister *reg, *registers; RDMACompress *comp; @@ -2507,13 +2438,10 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque, int ret = 0; int idx = 0; int count = 0; + int i = 0; CHECK_ERROR_STATE(); - if (rdma->pin_all) { - return 0; - } - do { DDDPRINTF("Waiting for next registration %" PRIu64 "...\n", flags); @@ -2545,9 +2473,53 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque, ram_handle_compressed(host_addr, comp->value, comp->length); break; + case RDMA_CONTROL_REGISTER_FINISHED: DDDPRINTF("Current registrations complete.\n"); goto out; + + case RDMA_CONTROL_RAM_BLOCKS_REQUEST: + DPRINTF("Initial setup info requested.\n"); + + if (rdma->pin_all) { + ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks); + if (ret) { + fprintf(stderr, "rdma migration: error dest " + "registering ram blocks!\n"); + goto out; + } + } + + /* + * Dest uses this to prepare to transmit the RAMBlock descriptions + * to the primary VM after connection setup. + * Both sides use the "remote" structure to communicate and update + * their "local" descriptions with what was sent. + */ + for (i = 0; i < local->num_blocks; i++) { + rdma->block[i].remote_host_addr = + (uint64_t)(local->block[i].local_host_addr); + + if (rdma->pin_all) { + rdma->block[i].remote_rkey = local->block[i].mr->rkey; + } + + rdma->block[i].offset = local->block[i].offset; + rdma->block[i].length = local->block[i].length; + } + + blocks.len = rdma->local_ram_blocks.num_blocks + * sizeof(RDMARemoteBlock); + + ret = qemu_rdma_post_send_control(rdma, + (uint8_t *) rdma->block, &blocks); + + if (ret < 0) { + fprintf(stderr, "rdma migration: error sending remote info!\n"); + goto out; + } + + break; case RDMA_CONTROL_REGISTER_REQUEST: DDPRINTF("There are %d registration requests\n", head.repeat); @@ -2610,10 +2582,6 @@ static int qemu_rdma_registration_start(QEMUFile *f, void *opaque, CHECK_ERROR_STATE(); - if (rdma->pin_all) { - return 0; - } - DDDPRINTF("start section: %" PRIu64 "\n", flags); qemu_put_be64(f, RAM_SAVE_FLAG_HOOK); qemu_fflush(f); @@ -2628,12 +2596,12 @@ static int qemu_rdma_registration_start(QEMUFile *f, void *opaque, static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque, uint64_t flags) { + Error *local_err = NULL, **errp = &local_err; QEMUFileRDMA *rfile = opaque; RDMAContext *rdma = rfile->rdma; - RDMAControlHeader head = { .len = 0, - .type = RDMA_CONTROL_REGISTER_FINISHED, - .repeat = 1, - }; + RDMAControlHeader head = { .len = 0, .repeat = 1 }; + RDMAControlHeader resp = {.type = RDMA_CONTROL_RAM_BLOCKS_RESULT }; + int reg_result_idx; int ret = 0; CHECK_ERROR_STATE(); @@ -2645,11 +2613,46 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque, goto err; } - if (rdma->pin_all) { - return 0; + if (flags == RAM_CONTROL_SETUP) { + head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST; + DPRINTF("Sending registration setup for ram blocks...\n"); + + ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp, ®_result_idx); + if (ret < 0) { + ERROR(errp, "receiving remote info!\n"); + return ret; + } + + qemu_rdma_move_header(rdma, reg_result_idx, &resp); + memcpy(rdma->block, rdma->wr_data[reg_result_idx].control_curr, resp.len); + + ret = qemu_rdma_process_remote_blocks(rdma, + (resp.len / sizeof(RDMARemoteBlock)), errp); + if (ret) { + ERROR(errp, "processing remote blocks!\n"); + return ret; + } + + if (rdma->pin_all) { + ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks); + if (ret) { + fprintf(stderr, "rdma migration: error source " + "registering ram blocks!\n"); + return ret; + } + } else { + int x = 0; + for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) { + RDMALocalBlock *block = &(rdma->local_ram_blocks.block[x]); + int num_chunks = ram_chunk_count(block); + block->remote_keys = g_malloc0(num_chunks * sizeof(uint32_t)); + } + } } DDDPRINTF("Sending registration finish %" PRIu64 "...\n", flags); + + head.type = RDMA_CONTROL_REGISTER_FINISHED; ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL); if (ret < 0) {