diff mbox

[v9,14/14] rdma: add pin-all accounting timestamp to QMP statistics

Message ID 1371242153-11262-15-git-send-email-mrhines@linux.vnet.ibm.com
State New
Headers show

Commit Message

mrhines@linux.vnet.ibm.com June 14, 2013, 8:35 p.m. UTC
From: "Michael R. Hines" <mrhines@us.ibm.com>

For very large virtual machines, pinning can take a long time.
While this does not affect the migration's *actual* time itself,
it is still important for the user to know what's going on and
to know what component of the total time is actual taken up by
pinning.

For example, using a 14GB virtual machine, pinning can take as
long as 5 seconds, for which the user would not otherwise know
what was happening.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 hmp.c                         |    4 +++
 include/migration/migration.h |    1 +
 migration-rdma.c              |   55 +++++++++++++++++++++++++++--------------
 migration.c                   |   13 +++++++++-
 qapi-schema.json              |    3 ++-
 5 files changed, 56 insertions(+), 20 deletions(-)

Comments

Eric Blake June 15, 2013, 9:06 a.m. UTC | #1
On 06/14/2013 09:35 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> For very large virtual machines, pinning can take a long time.
> While this does not affect the migration's *actual* time itself,
> it is still important for the user to know what's going on and
> to know what component of the total time is actual taken up by
> pinning.
> 
> For example, using a 14GB virtual machine, pinning can take as
> long as 5 seconds, for which the user would not otherwise know
> what was happening.
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---

> +++ b/qapi-schema.json
> @@ -586,7 +586,8 @@
>             '*xbzrle-cache': 'XBZRLECacheStats',
>             '*total-time': 'int',
>             '*expected-downtime': 'int',
> -           '*downtime': 'int'} }
> +           '*downtime': 'int',
> +           '*pin-all-time' : 'int'} }

Missing documentation of the new field (compare how @downtime is
documented as a (since 1.3) field).
Paolo Bonzini June 15, 2013, 5:32 p.m. UTC | #2
Il 15/06/2013 05:06, Eric Blake ha scritto:
> On 06/14/2013 09:35 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>> 
>> For very large virtual machines, pinning can take a long time. 
>> While this does not affect the migration's *actual* time itself, 
>> it is still important for the user to know what's going on and to
>> know what component of the total time is actual taken up by 
>> pinning.
>> 
>> For example, using a 14GB virtual machine, pinning can take as 
>> long as 5 seconds, for which the user would not otherwise know 
>> what was happening.
>> 
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
>> Michael R. Hines <mrhines@us.ibm.com> ---
> 
>> +++ b/qapi-schema.json @@ -586,7 +586,8 @@ '*xbzrle-cache':
>> 'XBZRLECacheStats', '*total-time': 'int', '*expected-downtime':
>> 'int', -           '*downtime': 'int'} } +           '*downtime':
>> 'int', +           '*pin-all-time' : 'int'} }
> 
> Missing documentation of the new field (compare how @downtime is 
> documented as a (since 1.3) field).
> 

Can you just time the setup phase instead?

Paolo
mrhines@linux.vnet.ibm.com June 16, 2013, 4:05 a.m. UTC | #3
On 06/15/2013 05:06 AM, Eric Blake wrote:
> On 06/14/2013 09:35 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> For very large virtual machines, pinning can take a long time.
>> While this does not affect the migration's *actual* time itself,
>> it is still important for the user to know what's going on and
>> to know what component of the total time is actual taken up by
>> pinning.
>>
>> For example, using a 14GB virtual machine, pinning can take as
>> long as 5 seconds, for which the user would not otherwise know
>> what was happening.
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> +++ b/qapi-schema.json
>> @@ -586,7 +586,8 @@
>>              '*xbzrle-cache': 'XBZRLECacheStats',
>>              '*total-time': 'int',
>>              '*expected-downtime': 'int',
>> -           '*downtime': 'int'} }
>> +           '*downtime': 'int',
>> +           '*pin-all-time' : 'int'} }
> Missing documentation of the new field (compare how @downtime is
> documented as a (since 1.3) field).
>

Acknowledged....
mrhines@linux.vnet.ibm.com June 16, 2013, 4:06 a.m. UTC | #4
On 06/15/2013 01:32 PM, Paolo Bonzini wrote:
> Il 15/06/2013 05:06, Eric Blake ha scritto:
>> On 06/14/2013 09:35 PM, mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>
>>> For very large virtual machines, pinning can take a long time.
>>> While this does not affect the migration's *actual* time itself,
>>> it is still important for the user to know what's going on and to
>>> know what component of the total time is actual taken up by
>>> pinning.
>>>
>>> For example, using a 14GB virtual machine, pinning can take as
>>> long as 5 seconds, for which the user would not otherwise know
>>> what was happening.
>>>
>>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by:
>>> Michael R. Hines <mrhines@us.ibm.com> ---
>>> +++ b/qapi-schema.json @@ -586,7 +586,8 @@ '*xbzrle-cache':
>>> 'XBZRLECacheStats', '*total-time': 'int', '*expected-downtime':
>>> 'int', -           '*downtime': 'int'} } +           '*downtime':
>>> 'int', +           '*pin-all-time' : 'int'} }
>> Missing documentation of the new field (compare how @downtime is
>> documented as a (since 1.3) field).
>>
> Can you just time the setup phase instead?
>
> Paolo
>

That's a great idea..... better to make it a more generic time.

I'll send out a V10 with everybody's suggestions.

- Michael
Chegu Vinod June 16, 2013, 6:37 p.m. UTC | #5
On 6/14/2013 1:35 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> For very large virtual machines, pinning can take a long time.
> While this does not affect the migration's *actual* time itself,
> it is still important for the user to know what's going on and
> to know what component of the total time is actual taken up by
> pinning.
>
> For example, using a 14GB virtual machine, pinning can take as
> long as 5 seconds, for which the user would not otherwise know
> what was happening.
>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>   hmp.c                         |    4 +++
>   include/migration/migration.h |    1 +
>   migration-rdma.c              |   55 +++++++++++++++++++++++++++--------------
>   migration.c                   |   13 +++++++++-
>   qapi-schema.json              |    3 ++-
>   5 files changed, 56 insertions(+), 20 deletions(-)
>
> diff --git a/hmp.c b/hmp.c
> index 148a3fb..90c55f2 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -164,6 +164,10 @@ void hmp_info_migrate(Monitor *mon, const QDict *qdict)
>               monitor_printf(mon, "downtime: %" PRIu64 " milliseconds\n",
>                              info->downtime);
>           }
> +        if (info->has_pin_all_time) {
> +            monitor_printf(mon, "pin-all: %" PRIu64 " milliseconds\n",
> +                           info->pin_all_time);
> +        }
>       }
>   
>       if (info->has_ram) {
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index b49e68b..d2ca75b 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -49,6 +49,7 @@ struct MigrationState
>       bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
>       int64_t xbzrle_cache_size;
>       double mbps;
> +    int64_t pin_all_time;
>   };
>   
>   void process_incoming_migration(QEMUFile *f);
> diff --git a/migration-rdma.c b/migration-rdma.c
> index 853de18..e407dce 100644
> --- a/migration-rdma.c
> +++ b/migration-rdma.c
> @@ -699,11 +699,11 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>       return 0;
>   }
>   
> -static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
> -                                RDMALocalBlocks *rdma_local_ram_blocks)
> +static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>   {
>       int i;
> -    uint64_t start = qemu_get_clock_ms(rt_clock);
> +    int64_t start = qemu_get_clock_ms(host_clock);
> +    RDMALocalBlocks *rdma_local_ram_blocks = &rdma->local_ram_blocks;
>       (void)start;
>   
>       for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
> @@ -721,7 +721,8 @@ static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
>           rdma->total_registrations++;
>       }
>   
> -    DPRINTF("lock time: %" PRIu64 "\n", qemu_get_clock_ms(rt_clock) - start);
> +    DPRINTF("local lock time: %" PRId64 "\n",
> +        qemu_get_clock_ms(host_clock) - start);
>   
>       if (i >= rdma_local_ram_blocks->num_blocks) {
>           return 0;
> @@ -1262,7 +1263,8 @@ static void qemu_rdma_move_header(RDMAContext *rdma, int idx,
>    */
>   static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
>                                      uint8_t *data, RDMAControlHeader *resp,
> -                                   int *resp_idx)
> +                                   int *resp_idx,
> +                                   int (*callback)(RDMAContext *rdma))
>   {
>       int ret = 0;
>       int idx = 0;
> @@ -1315,6 +1317,14 @@ static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
>        * If we're expecting a response, block and wait for it.
>        */
>       if (resp) {
> +        if (callback) {
> +            DPRINTF("Issuing callback before receiving response...\n");
> +            ret = callback(rdma);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +
>           DDPRINTF("Waiting for response %s\n", control_desc[resp->type]);
>           ret = qemu_rdma_exchange_get_response(rdma, resp, resp->type, idx + 1);
>   
> @@ -1464,7 +1474,7 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
>                       chunk, sge.length, current_index, offset);
>   
>                   ret = qemu_rdma_exchange_send(rdma, &head,
> -                                (uint8_t *) &comp, NULL, NULL);
> +                                (uint8_t *) &comp, NULL, NULL, NULL);
>   
>                   if (ret < 0) {
>                       return -EIO;
> @@ -1487,7 +1497,7 @@ static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
>                       chunk, sge.length, current_index, offset);
>   
>               ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
> -                                    &resp, &reg_result_idx);
> +                                    &resp, &reg_result_idx, NULL);
>               if (ret < 0) {
>                   return ret;
>               }
> @@ -2126,7 +2136,7 @@ static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf,
>           head.len = r->len;
>           head.type = RDMA_CONTROL_QEMU_FILE;
>   
> -        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL);
> +        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL, NULL);
>   
>           if (ret < 0) {
>               rdma->error_state = ret;
> @@ -2482,7 +2492,7 @@ static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
>               DPRINTF("Initial setup info requested.\n");
>   
>               if (rdma->pin_all) {
> -                ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
> +                ret = qemu_rdma_reg_whole_ram_blocks(rdma);
>                   if (ret) {
>                       fprintf(stderr, "rdma migration: error dest "
>                                       "registering ram blocks!\n");
> @@ -2614,10 +2624,22 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
>       }
>   
>       if (flags == RAM_CONTROL_SETUP) {
> +        int64_t start = qemu_get_clock_ms(host_clock);
> +        MigrationState *s = migrate_get_current();
> +
>           head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
>           DPRINTF("Sending registration setup for ram blocks...\n");
>   
> -        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp, &reg_result_idx);
> +        /*
> +         * Make sure that we parallelize the pinning on both size.
> +         * For very large guests, doing this serially takes a really
> +         * long time, so we have to 'interleave' the pinning locally
> +         * by performing it before we receive the response from the
> +         * destination that the pinning has completed.
> +         */
> +        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp,
> +                    &reg_result_idx, rdma->pin_all ?
> +                    qemu_rdma_reg_whole_ram_blocks : NULL);
>           if (ret < 0) {
>               ERROR(errp, "receiving remote info!\n");
>               return ret;
> @@ -2634,12 +2656,7 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
>           }
>   
>           if (rdma->pin_all) {
> -            ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
> -            if (ret) {
> -                fprintf(stderr, "rdma migration: error source "
> -                                "registering ram blocks!\n");
> -                return ret;
> -            }
> +            s->pin_all_time = qemu_get_clock_ms(host_clock) - start;
>           } else {
>               int x = 0;
>               for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) {
> @@ -2653,7 +2670,7 @@ static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
>       DDDPRINTF("Sending registration finish %" PRIu64 "...\n", flags);
>   
>       head.type = RDMA_CONTROL_REGISTER_FINISHED;
> -    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL);
> +    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL, NULL);
>   
>       if (ret < 0) {
>           goto err;
> @@ -2801,8 +2818,10 @@ void rdma_start_outgoing_migration(void *opaque,
>   
>       DPRINTF("qemu_rdma_source_connect success\n");
>   
> +    if (rdma->pin_all) {
> +        s->pin_all_time = 0;
> +    }
>       s->file = qemu_fopen_rdma(rdma, "wb");
> -    s->total_time = qemu_get_clock_ms(rt_clock);
>       migrate_fd_connect(s);
>       return;
>   err:
> diff --git a/migration.c b/migration.c
> index 62c6b85..17531e9 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -67,6 +67,7 @@ MigrationState *migrate_get_current(void)
>           .bandwidth_limit = MAX_THROTTLE,
>           .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
>           .mbps = -1,
> +        .pin_all_time = -1,
>       };
>   
>       return &current_migration;
> @@ -189,7 +190,7 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>           break;
>       case MIG_STATE_ACTIVE:
>           info->has_status = true;
> -        info->status = g_strdup("active");
> +        info->status = g_strdup(s->pin_all_time != 0 ?  "active" : "pinning");

Pl. check the logic for setting the migration status correctly. In the 
current logic where you rely on "s->pin_all_time" if we did non-rdma 
migration the migration status will remain in "pinning" (instead of 
"active") till the migration completes....

Thx
Vinod

>           info->has_total_time = true;
>           info->total_time = qemu_get_clock_ms(rt_clock)
>               - s->total_time;
> @@ -208,6 +209,11 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>           info->ram->dirty_pages_rate = s->dirty_pages_rate;
>           info->ram->mbps = s->mbps;
>   
> +        if (s->pin_all_time >= 0) {
> +            info->has_pin_all_time = true;
> +            info->pin_all_time = s->pin_all_time;
> +        }
> +
>           if (blk_mig_active()) {
>               info->has_disk = true;
>               info->disk = g_malloc0(sizeof(*info->disk));
> @@ -227,6 +233,11 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>           info->has_downtime = true;
>           info->downtime = s->downtime;
>   
> +        if (s->pin_all_time >= 0) {
> +            info->has_pin_all_time = true;
> +            info->pin_all_time = s->pin_all_time;
> +        }
> +
>           info->has_ram = true;
>           info->ram = g_malloc0(sizeof(*info->ram));
>           info->ram->transferred = ram_bytes_transferred();
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 4e9cd6d..dfae59a 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -586,7 +586,8 @@
>              '*xbzrle-cache': 'XBZRLECacheStats',
>              '*total-time': 'int',
>              '*expected-downtime': 'int',
> -           '*downtime': 'int'} }
> +           '*downtime': 'int',
> +           '*pin-all-time' : 'int'} }
>   
>   ##
>   # @query-migrate
mrhines@linux.vnet.ibm.com June 18, 2013, 8:17 p.m. UTC | #6
On 06/16/2013 02:37 PM, Chegu Vinod wrote:
> On 6/14/2013 1:35 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> For very large virtual machines, pinning can take a long time.
>> While this does not affect the migration's *actual* time itself,
>> it is still important for the user to know what's going on and
>> to know what component of the total time is actual taken up by
>> pinning.
>>
>> For example, using a 14GB virtual machine, pinning can take as
>> long as 5 seconds, for which the user would not otherwise know
>> what was happening.
>>
>> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   hmp.c                         |    4 +++
>>   include/migration/migration.h |    1 +
>>   migration-rdma.c              |   55 
>> +++++++++++++++++++++++++++--------------
>>   migration.c                   |   13 +++++++++-
>>   qapi-schema.json              |    3 ++-
>>   5 files changed, 56 insertions(+), 20 deletions(-)
>>
>> diff --git a/hmp.c b/hmp.c
>> index 148a3fb..90c55f2 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -164,6 +164,10 @@ void hmp_info_migrate(Monitor *mon, const QDict 
>> *qdict)
>>               monitor_printf(mon, "downtime: %" PRIu64 " 
>> milliseconds\n",
>>                              info->downtime);
>>           }
>> +        if (info->has_pin_all_time) {
>> +            monitor_printf(mon, "pin-all: %" PRIu64 " milliseconds\n",
>> +                           info->pin_all_time);
>> +        }
>>       }
>>         if (info->has_ram) {
>> diff --git a/include/migration/migration.h 
>> b/include/migration/migration.h
>> index b49e68b..d2ca75b 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -49,6 +49,7 @@ struct MigrationState
>>       bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
>>       int64_t xbzrle_cache_size;
>>       double mbps;
>> +    int64_t pin_all_time;
>>   };
>>     void process_incoming_migration(QEMUFile *f);
>> diff --git a/migration-rdma.c b/migration-rdma.c
>> index 853de18..e407dce 100644
>> --- a/migration-rdma.c
>> +++ b/migration-rdma.c
>> @@ -699,11 +699,11 @@ static int qemu_rdma_alloc_qp(RDMAContext *rdma)
>>       return 0;
>>   }
>>   -static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
>> -                                RDMALocalBlocks *rdma_local_ram_blocks)
>> +static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
>>   {
>>       int i;
>> -    uint64_t start = qemu_get_clock_ms(rt_clock);
>> +    int64_t start = qemu_get_clock_ms(host_clock);
>> +    RDMALocalBlocks *rdma_local_ram_blocks = &rdma->local_ram_blocks;
>>       (void)start;
>>         for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
>> @@ -721,7 +721,8 @@ static int 
>> qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
>>           rdma->total_registrations++;
>>       }
>>   -    DPRINTF("lock time: %" PRIu64 "\n", 
>> qemu_get_clock_ms(rt_clock) - start);
>> +    DPRINTF("local lock time: %" PRId64 "\n",
>> +        qemu_get_clock_ms(host_clock) - start);
>>         if (i >= rdma_local_ram_blocks->num_blocks) {
>>           return 0;
>> @@ -1262,7 +1263,8 @@ static void qemu_rdma_move_header(RDMAContext 
>> *rdma, int idx,
>>    */
>>   static int qemu_rdma_exchange_send(RDMAContext *rdma, 
>> RDMAControlHeader *head,
>>                                      uint8_t *data, RDMAControlHeader 
>> *resp,
>> -                                   int *resp_idx)
>> +                                   int *resp_idx,
>> +                                   int (*callback)(RDMAContext *rdma))
>>   {
>>       int ret = 0;
>>       int idx = 0;
>> @@ -1315,6 +1317,14 @@ static int qemu_rdma_exchange_send(RDMAContext 
>> *rdma, RDMAControlHeader *head,
>>        * If we're expecting a response, block and wait for it.
>>        */
>>       if (resp) {
>> +        if (callback) {
>> +            DPRINTF("Issuing callback before receiving response...\n");
>> +            ret = callback(rdma);
>> +            if (ret < 0) {
>> +                return ret;
>> +            }
>> +        }
>> +
>>           DDPRINTF("Waiting for response %s\n", 
>> control_desc[resp->type]);
>>           ret = qemu_rdma_exchange_get_response(rdma, resp, 
>> resp->type, idx + 1);
>>   @@ -1464,7 +1474,7 @@ static int qemu_rdma_write_one(QEMUFile *f, 
>> RDMAContext *rdma,
>>                       chunk, sge.length, current_index, offset);
>>                     ret = qemu_rdma_exchange_send(rdma, &head,
>> -                                (uint8_t *) &comp, NULL, NULL);
>> +                                (uint8_t *) &comp, NULL, NULL, NULL);
>>                     if (ret < 0) {
>>                       return -EIO;
>> @@ -1487,7 +1497,7 @@ static int qemu_rdma_write_one(QEMUFile *f, 
>> RDMAContext *rdma,
>>                       chunk, sge.length, current_index, offset);
>>                 ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t 
>> *) &reg,
>> -                                    &resp, &reg_result_idx);
>> +                                    &resp, &reg_result_idx, NULL);
>>               if (ret < 0) {
>>                   return ret;
>>               }
>> @@ -2126,7 +2136,7 @@ static int qemu_rdma_put_buffer(void *opaque, 
>> const uint8_t *buf,
>>           head.len = r->len;
>>           head.type = RDMA_CONTROL_QEMU_FILE;
>>   -        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL);
>> +        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL, 
>> NULL);
>>             if (ret < 0) {
>>               rdma->error_state = ret;
>> @@ -2482,7 +2492,7 @@ static int 
>> qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
>>               DPRINTF("Initial setup info requested.\n");
>>                 if (rdma->pin_all) {
>> -                ret = qemu_rdma_reg_whole_ram_blocks(rdma, 
>> &rdma->local_ram_blocks);
>> +                ret = qemu_rdma_reg_whole_ram_blocks(rdma);
>>                   if (ret) {
>>                       fprintf(stderr, "rdma migration: error dest "
>>                                       "registering ram blocks!\n");
>> @@ -2614,10 +2624,22 @@ static int 
>> qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
>>       }
>>         if (flags == RAM_CONTROL_SETUP) {
>> +        int64_t start = qemu_get_clock_ms(host_clock);
>> +        MigrationState *s = migrate_get_current();
>> +
>>           head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
>>           DPRINTF("Sending registration setup for ram blocks...\n");
>>   -        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp, 
>> &reg_result_idx);
>> +        /*
>> +         * Make sure that we parallelize the pinning on both size.
>> +         * For very large guests, doing this serially takes a really
>> +         * long time, so we have to 'interleave' the pinning locally
>> +         * by performing it before we receive the response from the
>> +         * destination that the pinning has completed.
>> +         */
>> +        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp,
>> +                    &reg_result_idx, rdma->pin_all ?
>> +                    qemu_rdma_reg_whole_ram_blocks : NULL);
>>           if (ret < 0) {
>>               ERROR(errp, "receiving remote info!\n");
>>               return ret;
>> @@ -2634,12 +2656,7 @@ static int 
>> qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
>>           }
>>             if (rdma->pin_all) {
>> -            ret = qemu_rdma_reg_whole_ram_blocks(rdma, 
>> &rdma->local_ram_blocks);
>> -            if (ret) {
>> -                fprintf(stderr, "rdma migration: error source "
>> -                                "registering ram blocks!\n");
>> -                return ret;
>> -            }
>> +            s->pin_all_time = qemu_get_clock_ms(host_clock) - start;
>>           } else {
>>               int x = 0;
>>               for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) {
>> @@ -2653,7 +2670,7 @@ static int qemu_rdma_registration_stop(QEMUFile 
>> *f, void *opaque,
>>       DDDPRINTF("Sending registration finish %" PRIu64 "...\n", flags);
>>         head.type = RDMA_CONTROL_REGISTER_FINISHED;
>> -    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL);
>> +    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL, NULL);
>>         if (ret < 0) {
>>           goto err;
>> @@ -2801,8 +2818,10 @@ void rdma_start_outgoing_migration(void *opaque,
>>         DPRINTF("qemu_rdma_source_connect success\n");
>>   +    if (rdma->pin_all) {
>> +        s->pin_all_time = 0;
>> +    }
>>       s->file = qemu_fopen_rdma(rdma, "wb");
>> -    s->total_time = qemu_get_clock_ms(rt_clock);
>>       migrate_fd_connect(s);
>>       return;
>>   err:
>> diff --git a/migration.c b/migration.c
>> index 62c6b85..17531e9 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -67,6 +67,7 @@ MigrationState *migrate_get_current(void)
>>           .bandwidth_limit = MAX_THROTTLE,
>>           .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
>>           .mbps = -1,
>> +        .pin_all_time = -1,
>>       };
>>         return &current_migration;
>> @@ -189,7 +190,7 @@ MigrationInfo *qmp_query_migrate(Error **errp)
>>           break;
>>       case MIG_STATE_ACTIVE:
>>           info->has_status = true;
>> -        info->status = g_strdup("active");
>> +        info->status = g_strdup(s->pin_all_time != 0 ? "active" : 
>> "pinning");
>
> Pl. check the logic for setting the migration status correctly. In the 
> current logic where you rely on "s->pin_all_time" if we did non-rdma 
> migration the migration status will remain in "pinning" (instead of 
> "active") till the migration completes....
>
> Thx
> Vinod

I just sent out a V10 which I have also validated with a separate patch 
to libvirt. (I'm maintaining a sister patch for RDMA to libvirt on 
github as well, in case you need to use libvirt too).

- Michael
diff mbox

Patch

diff --git a/hmp.c b/hmp.c
index 148a3fb..90c55f2 100644
--- a/hmp.c
+++ b/hmp.c
@@ -164,6 +164,10 @@  void hmp_info_migrate(Monitor *mon, const QDict *qdict)
             monitor_printf(mon, "downtime: %" PRIu64 " milliseconds\n",
                            info->downtime);
         }
+        if (info->has_pin_all_time) {
+            monitor_printf(mon, "pin-all: %" PRIu64 " milliseconds\n",
+                           info->pin_all_time);
+        }
     }
 
     if (info->has_ram) {
diff --git a/include/migration/migration.h b/include/migration/migration.h
index b49e68b..d2ca75b 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -49,6 +49,7 @@  struct MigrationState
     bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
     int64_t xbzrle_cache_size;
     double mbps;
+    int64_t pin_all_time; 
 };
 
 void process_incoming_migration(QEMUFile *f);
diff --git a/migration-rdma.c b/migration-rdma.c
index 853de18..e407dce 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -699,11 +699,11 @@  static int qemu_rdma_alloc_qp(RDMAContext *rdma)
     return 0;
 }
 
-static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
-                                RDMALocalBlocks *rdma_local_ram_blocks)
+static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma)
 {
     int i;
-    uint64_t start = qemu_get_clock_ms(rt_clock);
+    int64_t start = qemu_get_clock_ms(host_clock);
+    RDMALocalBlocks *rdma_local_ram_blocks = &rdma->local_ram_blocks;
     (void)start;
 
     for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
@@ -721,7 +721,8 @@  static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma,
         rdma->total_registrations++;
     }
 
-    DPRINTF("lock time: %" PRIu64 "\n", qemu_get_clock_ms(rt_clock) - start);
+    DPRINTF("local lock time: %" PRId64 "\n", 
+        qemu_get_clock_ms(host_clock) - start);
 
     if (i >= rdma_local_ram_blocks->num_blocks) {
         return 0;
@@ -1262,7 +1263,8 @@  static void qemu_rdma_move_header(RDMAContext *rdma, int idx,
  */
 static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
                                    uint8_t *data, RDMAControlHeader *resp,
-                                   int *resp_idx)
+                                   int *resp_idx,
+                                   int (*callback)(RDMAContext *rdma))
 {
     int ret = 0;
     int idx = 0;
@@ -1315,6 +1317,14 @@  static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head,
      * If we're expecting a response, block and wait for it.
      */
     if (resp) {
+        if (callback) {
+            DPRINTF("Issuing callback before receiving response...\n");
+            ret = callback(rdma);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+
         DDPRINTF("Waiting for response %s\n", control_desc[resp->type]);
         ret = qemu_rdma_exchange_get_response(rdma, resp, resp->type, idx + 1);
 
@@ -1464,7 +1474,7 @@  static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
                     chunk, sge.length, current_index, offset);
 
                 ret = qemu_rdma_exchange_send(rdma, &head,
-                                (uint8_t *) &comp, NULL, NULL);
+                                (uint8_t *) &comp, NULL, NULL, NULL);
 
                 if (ret < 0) {
                     return -EIO;
@@ -1487,7 +1497,7 @@  static int qemu_rdma_write_one(QEMUFile *f, RDMAContext *rdma,
                     chunk, sge.length, current_index, offset);
 
             ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
-                                    &resp, &reg_result_idx);
+                                    &resp, &reg_result_idx, NULL);
             if (ret < 0) {
                 return ret;
             }
@@ -2126,7 +2136,7 @@  static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf,
         head.len = r->len;
         head.type = RDMA_CONTROL_QEMU_FILE;
 
-        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL);
+        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL, NULL);
 
         if (ret < 0) {
             rdma->error_state = ret;
@@ -2482,7 +2492,7 @@  static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque,
             DPRINTF("Initial setup info requested.\n");
 
             if (rdma->pin_all) {
-                ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
+                ret = qemu_rdma_reg_whole_ram_blocks(rdma);
                 if (ret) {
                     fprintf(stderr, "rdma migration: error dest "
                                     "registering ram blocks!\n");
@@ -2614,10 +2624,22 @@  static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
     }
 
     if (flags == RAM_CONTROL_SETUP) {
+        int64_t start = qemu_get_clock_ms(host_clock);
+        MigrationState *s = migrate_get_current();
+
         head.type = RDMA_CONTROL_RAM_BLOCKS_REQUEST;
         DPRINTF("Sending registration setup for ram blocks...\n");
 
-        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp, &reg_result_idx);
+        /*
+         * Make sure that we parallelize the pinning on both size.
+         * For very large guests, doing this serially takes a really
+         * long time, so we have to 'interleave' the pinning locally
+         * by performing it before we receive the response from the
+         * destination that the pinning has completed.
+         */
+        ret = qemu_rdma_exchange_send(rdma, &head, NULL, &resp, 
+                    &reg_result_idx, rdma->pin_all ?
+                    qemu_rdma_reg_whole_ram_blocks : NULL);
         if (ret < 0) {
             ERROR(errp, "receiving remote info!\n");
             return ret;
@@ -2634,12 +2656,7 @@  static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
         }
 
         if (rdma->pin_all) {
-            ret = qemu_rdma_reg_whole_ram_blocks(rdma, &rdma->local_ram_blocks);
-            if (ret) {
-                fprintf(stderr, "rdma migration: error source "
-                                "registering ram blocks!\n");
-                return ret;
-            }
+            s->pin_all_time = qemu_get_clock_ms(host_clock) - start;
         } else {
             int x = 0;
             for (x = 0; x < rdma->local_ram_blocks.num_blocks; x++) {
@@ -2653,7 +2670,7 @@  static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque,
     DDDPRINTF("Sending registration finish %" PRIu64 "...\n", flags);
 
     head.type = RDMA_CONTROL_REGISTER_FINISHED;
-    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL);
+    ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL, NULL);
 
     if (ret < 0) {
         goto err;
@@ -2801,8 +2818,10 @@  void rdma_start_outgoing_migration(void *opaque,
 
     DPRINTF("qemu_rdma_source_connect success\n");
 
+    if (rdma->pin_all) {
+        s->pin_all_time = 0;
+    }
     s->file = qemu_fopen_rdma(rdma, "wb");
-    s->total_time = qemu_get_clock_ms(rt_clock);
     migrate_fd_connect(s);
     return;
 err:
diff --git a/migration.c b/migration.c
index 62c6b85..17531e9 100644
--- a/migration.c
+++ b/migration.c
@@ -67,6 +67,7 @@  MigrationState *migrate_get_current(void)
         .bandwidth_limit = MAX_THROTTLE,
         .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
         .mbps = -1,
+        .pin_all_time = -1,
     };
 
     return &current_migration;
@@ -189,7 +190,7 @@  MigrationInfo *qmp_query_migrate(Error **errp)
         break;
     case MIG_STATE_ACTIVE:
         info->has_status = true;
-        info->status = g_strdup("active");
+        info->status = g_strdup(s->pin_all_time != 0 ?  "active" : "pinning");
         info->has_total_time = true;
         info->total_time = qemu_get_clock_ms(rt_clock)
             - s->total_time;
@@ -208,6 +209,11 @@  MigrationInfo *qmp_query_migrate(Error **errp)
         info->ram->dirty_pages_rate = s->dirty_pages_rate;
         info->ram->mbps = s->mbps;
 
+        if (s->pin_all_time >= 0) {
+            info->has_pin_all_time = true;
+            info->pin_all_time = s->pin_all_time;
+        }
+
         if (blk_mig_active()) {
             info->has_disk = true;
             info->disk = g_malloc0(sizeof(*info->disk));
@@ -227,6 +233,11 @@  MigrationInfo *qmp_query_migrate(Error **errp)
         info->has_downtime = true;
         info->downtime = s->downtime;
 
+        if (s->pin_all_time >= 0) {
+            info->has_pin_all_time = true;
+            info->pin_all_time = s->pin_all_time;
+        }
+
         info->has_ram = true;
         info->ram = g_malloc0(sizeof(*info->ram));
         info->ram->transferred = ram_bytes_transferred();
diff --git a/qapi-schema.json b/qapi-schema.json
index 4e9cd6d..dfae59a 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -586,7 +586,8 @@ 
            '*xbzrle-cache': 'XBZRLECacheStats',
            '*total-time': 'int',
            '*expected-downtime': 'int',
-           '*downtime': 'int'} }
+           '*downtime': 'int',
+           '*pin-all-time' : 'int'} }
 
 ##
 # @query-migrate