Patchwork [v9,13/14] rdma: fix mlock() freezes and accounting

login
register
mail settings
Submitter mrhines@linux.vnet.ibm.com
Date June 14, 2013, 8:35 p.m.
Message ID <1371242153-11262-14-git-send-email-mrhines@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/251529/
State New
Headers show

Comments

mrhines@linux.vnet.ibm.com - June 14, 2013, 8:35 p.m.
From: "Michael R. Hines" <mrhines@us.ibm.com>

This patch is contained to migration-rdma.c and fixes the problems
experienced by others when the x-rdma-pin-all feature appeared to
freeze the VM. By moving this operation out of the connection setup
time and instead moving it to ram_save_setup() code, we no longer
execute pinning inside the BQL and thus the pinning is parallelized
with the VM execution and also properly accounted for inside the
QMP migrate total time statistics.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c      |    5 --
 docs/rdma.txt    |   31 +++++---
 migration-rdma.c |  211 +++++++++++++++++++++++++++---------------------------
 3 files changed, 128 insertions(+), 119 deletions(-)

Patch

diff --git a/arch_init.c b/arch_init.c
index ebb601b..a1557d2 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -624,11 +624,6 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
 
     qemu_mutex_unlock_ramlist();
 
-    /*
-     * Please leave in place. These calls generate reserved messages in
-     * the RDMA protocol in order to pre-register RDMA memory in the
-     * future before the bulk round begins.
-     */
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
diff --git a/docs/rdma.txt b/docs/rdma.txt
index 4f98a3b..09d36ef 100644
--- a/docs/rdma.txt
+++ b/docs/rdma.txt
@@ -76,6 +76,13 @@  of the migration, which can greatly reduce the "total" time of your migration.
 Example performance of this using an idle VM in the previous example
 can be found in the "Performance" section.
 
+Note: for very large virtual machines (hundreds of GBs), pinning all 
+*all* of the memory of your virtual machine in the kernel is very expensive 
+may extend the initial bulk iteration time by many seconds,
+and thus extending the total migration time. However, this will not
+affect the determinism or predictability of your migration you will
+still gain from the benefits of advanced pinning with RDMA.
+
 RUNNING:
 ========
 
@@ -195,22 +202,26 @@  The maximum number of repeats is hard-coded to 4096. This is a conservative
 limit based on the maximum size of a SEND message along with emperical
 observations on the maximum future benefit of simultaneous page registrations.
 
-The 'type' field has 9 different command values:
+The 'type' field has 10 different command values:
     1. Unused
-    2. Error             (sent to the source during bad things)
-    3. Ready             (control-channel is available)
-    4. QEMU File         (for sending non-live device state)
-    5. RAM Blocks        (used right after connection setup)
-    6. Compress page     (zap zero page and skip registration)
-    7. Register request  (dynamic chunk registration)
-    8. Register result   ('rkey' to be used by sender)
-    9. Register finished (registration for current iteration finished)
+    2. Error              (sent to the source during bad things)
+    3. Ready              (control-channel is available)
+    4. QEMU File          (for sending non-live device state)
+    5. RAM Blocks request (used right after connection setup)
+    6. RAM Blocks result  (used right after connection setup)
+    7. Compress page      (zap zero page and skip registration)
+    8. Register request   (dynamic chunk registration)
+    9. Register result    ('rkey' to be used by sender)
+    10. Register finished  (registration for current iteration finished)
 
 A single control message, as hinted above, can contain within the data
 portion an array of many commands of the same type. If there is more than
 one command, then the 'repeat' field will be greater than 1.
 
-After connection setup is completed, we have two protocol-level
+After connection setup, message 5 & 6 are used to exchange ram block
+information and optionally pin all the memory if requested by the user.
+
+After ram block exchange is completed, we have two protocol-level
 functions, responsible for communicating control-channel commands
 using the above list of values:
 
diff --git a/migration-rdma.c b/migration-rdma.c
index 79aaa0a..853de18 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -146,13 +146,14 @@  const char *wrid_desc[] = {
 enum {
     RDMA_CONTROL_NONE = 0,
     RDMA_CONTROL_ERROR,
-    RDMA_CONTROL_READY,             /* ready to receive */
-    RDMA_CONTROL_QEMU_FILE,         /* QEMUFile-transmitted bytes */
-    RDMA_CONTROL_RAM_BLOCKS,        /* RAMBlock synchronization */
-    RDMA_CONTROL_COMPRESS,          /* page contains repeat values */
-    RDMA_CONTROL_REGISTER_REQUEST,  /* dynamic page registration */
-    RDMA_CONTROL_REGISTER_RESULT,   /* key to use after registration */
-    RDMA_CONTROL_REGISTER_FINISHED, /* current iteration finished */
+    RDMA_CONTROL_READY,              /* ready to receive */
+    RDMA_CONTROL_QEMU_FILE,          /* QEMUFile-transmitted bytes */
+    RDMA_CONTROL_RAM_BLOCKS_REQUEST, /* RAMBlock synchronization */
+    RDMA_CONTROL_RAM_BLOCKS_RESULT,  /* RAMBlock synchronization */
+    RDMA_CONTROL_COMPRESS,           /* page contains repeat values */
+    RDMA_CONTROL_REGISTER_REQUEST,   /* dynamic page registration */
+    RDMA_CONTROL_REGISTER_RESULT,    /* key to use after registration */
+    RDMA_CONTROL_REGISTER_FINISHED,  /* current iteration finished */
 };
 
 const char *control_desc[] = {
@@ -160,7 +161,8 @@  const char *control_desc[] = {
         [RDMA_CONTROL_ERROR] = "ERROR",
         [RDMA_CONTROL_READY] = "READY",
         [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
-        [RDMA_CONTROL_RAM_BLOCKS] = "REMOTE INFO",
+        [RDMA_CONTROL_RAM_BLOCKS_REQUEST] = "RAM BLOCKS REQUEST",
+        [RDMA_CONTROL_RAM_BLOCKS_RESULT] = "RAM BLOCKS RESULT",
         [RDMA_CONTROL_COMPRESS] = "COMPRESS",
         [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
         [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
@@ -701,6 +703,9 @@  static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
                                 RDMALocalBlocks *rdma_local_ram_blocks)
 {
     int i;
+    uint64_t start = qemu_get_clock_ms(rt_clock);
+    (void)start;
+
     for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
         rdma_local_ram_blocks->block[i].mr =
             ibv_reg_mr(rdma->pd,
@@ -716,6 +721,8 @@  static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
         rdma->total_registrations++;
     }
 
+    DPRINTF("lock time: %" PRIu64 "\n", qemu_get_clock_ms(rt_clock) - start);
+
     if (i >= rdma_local_ram_blocks->num_blocks) {
         return 0;
     }
@@ -1338,7 +1345,6 @@  static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head,
                                 .repeat = 1,
                               };
     int ret;
-    int idx = 0;
 
     /*
      * Inform the source that we're ready to receive a message.
@@ -1353,18 +1359,18 @@  static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head,
     /*
      * Block and wait for the message.
      */
-    ret = qemu_rdma_exchange_get_response(rdma, head, expecting, idx);
+    ret = qemu_rdma_exchange_get_response(rdma, head, expecting, 0);
 
     if (ret < 0) {
         return ret;
     }
 
-    qemu_rdma_move_header(rdma, idx, head);
+    qemu_rdma_move_header(rdma, 0, head);
 
     /*
      * Post a new RECV work request to replace the one we just consumed.
      */
-    ret = qemu_rdma_post_recv_control(rdma, idx);
+    ret = qemu_rdma_post_recv_control(rdma, 0);
     if (ret) {
         fprintf(stderr, "rdma migration: error posting second control recv!");
         return ret;
@@ -1852,7 +1858,6 @@  err_rdma_source_init:
 
 static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
 {
-    RDMAControlHeader head;
     RDMACapabilities cap = {
                                 .version = RDMA_CONTROL_VERSION_CURRENT,
                                 .flags = 0,
@@ -1864,8 +1869,6 @@  static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
                                         };
     struct rdma_cm_event *cm_event;
     int ret;
-    int idx = 0;
-    int x;
 
     /*
      * Only negotiate the capability with destination if the user
@@ -1923,44 +1926,12 @@  static int qemu_rdma_connect(RDMAContext *rdma, Error **errp)
 
     rdma_ack_cm_event(cm_event);
 
-    ret = qemu_rdma_post_recv_control(rdma, idx + 1);
-    if (ret) {
-        ERROR(errp, "posting first control recv!\n");
-        goto err_rdma_source_connect;
-    }
-
-    ret = qemu_rdma_post_recv_control(rdma, idx);
+    ret = qemu_rdma_post_recv_control(rdma, 0);
     if (ret) {
         ERROR(errp, "posting second control recv!\n");
         goto err_rdma_source_connect;
     }
 
-    ret = qemu_rdma_exchange_get_response(rdma,
-                            &head, RDMA_CONTROL_RAM_BLOCKS, idx + 1);
-
-    if (ret < 0) {
-        ERROR(errp, "receiving remote info!\n");
-        goto err_rdma_source_connect;
-    }
-
-    qemu_rdma_move_header(rdma, idx + 1, &head);
-    memcpy(rdma->block, rdma->wr_data[idx + 1].control_curr, head.len);
-
-    ret = qemu_rdma_process_remote_blocks(rdma,
-                        (head.len / sizeof(RDMARemoteBlock)), errp);
-    if (ret) {
-        goto err_rdma_source_connect;
-    }
-
-    if (!rdma->pin_all) {
-        for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) {
-            RDMALocalBlock *block = &(rdma->local_ram_blocks.block[x]);
-            int num_chunks = ram_chunk_count(block);
-            /* allocate memory to store remote rkeys */
-            block->remote_keys = g_malloc0(num_chunks * sizeof(uint32_t));
-        }
-    }
-
     rdma->control_ready_expected = 1;
     rdma->num_signaled_send = 0;
     return 0;
@@ -2321,11 +2292,6 @@  static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
 
 static int qemu_rdma_accept(RDMAContext *rdma)
 {
-    RDMAControlHeader head = { .len = rdma->local_ram_blocks.num_blocks *
-                                        sizeof(RDMARemoteBlock),
-                               .type = RDMA_CONTROL_RAM_BLOCKS,
-                               .repeat = 1,
-                             };
     RDMACapabilities cap;
     struct rdma_conn_param conn_param = {
                                             .responder_resources = 2,
@@ -2335,8 +2301,6 @@  static int qemu_rdma_accept(RDMAContext *rdma)
     struct rdma_cm_event *cm_event;
     struct ibv_context *verbs;
     int ret = -EINVAL;
-    RDMALocalBlocks *local = &rdma->local_ram_blocks;
-    int i;
 
     ret = rdma_get_cm_event(rdma->channel, &cm_event);
     if (ret) {
@@ -2434,41 +2398,6 @@  static int qemu_rdma_accept(RDMAContext *rdma)
         goto err_rdma_dest_wait;
     }
 
-    if (rdma->pin_all) {
-        ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
-        if (ret) {
-            fprintf(stderr, "rdma migration: error dest "
-                            "registering ram blocks!\n");
-            goto err_rdma_dest_wait;
-        }
-    }
-
-    /*
-     * Server uses this to prepare to transmit the RAMBlock descriptions
-     * to the primary VM after connection setup.
-     * Both sides use the "remote" structure to communicate and update
-     * their "local" descriptions with what was sent.
-     */
-    for (i = 0; i < local->num_blocks; i++) {
-            rdma->block[i].remote_host_addr =
-                (uint64_t)(local->block[i].local_host_addr);
-
-            if (rdma->pin_all) {
-                rdma->block[i].remote_rkey = local->block[i].mr->rkey;
-            }
-
-            rdma->block[i].offset = local->block[i].offset;
-            rdma->block[i].length = local->block[i].length;
-    }
-
-
-    ret = qemu_rdma_post_send_control(rdma, (uint8_t *) rdma->block, &head);
-
-    if (ret < 0) {
-        fprintf(stderr, "rdma migration: error sending remote info!\n");
-        goto err_rdma_dest_wait;
-    }
-
     qemu_rdma_dump_gid("dest_connect", rdma->cm_id);
 
     return 0;
@@ -2495,8 +2424,10 @@  static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
                                .type = RDMA_CONTROL_REGISTER_RESULT,
                                .repeat = 0,
                              };
+    RDMAControlHeader blocks = { .type = RDMA_CONTROL_RAM_BLOCKS_RESULT, .repeat = 1 };
     QEMUFileRDMA *rfile = opaque;
     RDMAContext *rdma = rfile->rdma;
+    RDMALocalBlocks *local = &rdma->local_ram_blocks;
     RDMAControlHeader head;
     RDMARegister *reg, *registers;
     RDMACompress *comp;
@@ -2507,13 +2438,10 @@  static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
     int ret = 0;
     int idx = 0;
     int count = 0;
+    int i = 0;
 
     CHECK_ERROR_STATE();
 
-    if (rdma->pin_all) {
-        return 0;
-    }
-
     do {
         DDDPRINTF("Waiting for next registration %" PRIu64 "...\n", flags);
 
@@ -2545,9 +2473,53 @@  static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
 
             ram_handle_compressed(host_addr, comp->value, comp->length);
             break;
+
         case RDMA_CONTROL_REGISTER_FINISHED:
             DDDPRINTF("Current registrations complete.\n");
             goto out;
+
+        case RDMA_CONTROL_RAM_BLOCKS_REQUEST:
+            DPRINTF("Initial setup info requested.\n");
+
+            if (rdma->pin_all) {
+                ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
+                if (ret) {
+                    fprintf(stderr, "rdma migration: error dest "
+                                    "registering ram blocks!\n");
+                    goto out;
+                }
+            }
+
+            /*
+             * Dest uses this to prepare to transmit the RAMBlock descriptions
+             * to the primary VM after connection setup.
+             * Both sides use the "remote" structure to communicate and update
+             * their "local" descriptions with what was sent.
+             */
+            for (i = 0; i < local->num_blocks; i++) {
+                rdma->block[i].remote_host_addr =
+                    (uint64_t)(local->block[i].local_host_addr);
+
+                if (rdma->pin_all) {
+                    rdma->block[i].remote_rkey = local->block[i].mr->rkey;
+                }
+
+                rdma->block[i].offset = local->block[i].offset;
+                rdma->block[i].length = local->block[i].length;
+            }
+
+            blocks.len = rdma->local_ram_blocks.num_blocks 
+                                                * sizeof(RDMARemoteBlock);
+
+            ret = qemu_rdma_post_send_control(rdma, 
+                                        (uint8_t *) rdma->block, &blocks);
+
+            if (ret < 0) {
+                fprintf(stderr, "rdma migration: error sending remote info!\n");
+                goto out;
+            }
+
+            break;
         case RDMA_CONTROL_REGISTER_REQUEST:
             DDPRINTF("There are %d registration requests\n", head.repeat);
 
@@ -2610,10 +2582,6 @@  static int qemu_rdma_registration_start(QEMUFile *f, void *opaque,
 
     CHECK_ERROR_STATE();
 
-    if (rdma->pin_all) {
-        return 0;
-    }
-
     DDDPRINTF("start section: %" PRIu64 "\n", flags);
     qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
     qemu_fflush(f);
@@ -2628,12 +2596,12 @@  static int qemu_rdma_registration_start(QEMUFile *f, void *opaque,
 static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
                                        uint64_t flags)
 {
+    Error *local_err = NULL, **errp = &local_err;
     QEMUFileRDMA *rfile = opaque;
     RDMAContext *rdma = rfile->rdma;
-    RDMAControlHeader head = { .len = 0,
-                               .type = RDMA_CONTROL_REGISTER_FINISHED,
-                               .repeat = 1,
-                             };
+    RDMAControlHeader head = { .len = 0, .repeat = 1 };
+    RDMAControlHeader resp = {.type = RDMA_CONTROL_RAM_BLOCKS_RESULT };
+    int reg_result_idx;
     int ret = 0;
 
     CHECK_ERROR_STATE();
@@ -2645,11 +2613,46 @@  static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
         goto err;
     }
 
-    if (rdma->pin_all) {
-        return 0;
+    if (flags == RAM_CONTROL_SETUP) {
+        head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
+        DPRINTF("Sending registration setup for ram blocks...\n");
+
+        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp, &reg_result_idx);
+        if (ret < 0) {
+            ERROR(errp, "receiving remote info!\n");
+            return ret;
+        }
+
+        qemu_rdma_move_header(rdma, reg_result_idx, &resp);
+        memcpy(rdma->block, rdma->wr_data[reg_result_idx].control_curr, resp.len);
+
+        ret = qemu_rdma_process_remote_blocks(rdma,
+                        (resp.len / sizeof(RDMARemoteBlock)), errp);
+        if (ret) {
+            ERROR(errp, "processing remote blocks!\n");
+            return ret;
+        }
+
+        if (rdma->pin_all) {
+            ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
+            if (ret) {
+                fprintf(stderr, "rdma migration: error source "
+                                "registering ram blocks!\n");
+                return ret;
+            }
+        } else {
+            int x = 0;
+            for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) {
+                RDMALocalBlock *block = &(rdma->local_ram_blocks.block[x]);
+                int num_chunks = ram_chunk_count(block);
+                block->remote_keys = g_malloc0(num_chunks * sizeof(uint32_t));
+            }
+        }
     }
 
     DDDPRINTF("Sending registration finish %" PRIu64 "...\n", flags);
+
+    head.type = RDMA_CONTROL_REGISTER_FINISHED;
     ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL);
 
     if (ret < 0) {