diff mbox

[04/10] Translate offsets to destination address space

Message ID 1429545445-28216-5-git-send-email-dgilbert@redhat.com
State New
Headers show

Commit Message

Dr. David Alan Gilbert April 20, 2015, 3:57 p.m. UTC
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

The 'offset' field in RDMACompress and 'current_addr' field
in RDMARegister are commented as being offsets within a particular
RAMBlock, however they appear to actually be offsets within the
ram_addr_t space.

The code currently assumes that the offsets on the source/destination
match, this change removes the need for the assumption for these
structures by translating the addresses into the ram_addr_t space of
the destination host.

Note: An alternative would be to change the fields to actually
take the data they're commented for; this would potentially be
simpler but would break stream compatibility for those cases
that currently work.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 migration/rdma.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

Comments

mrhines@linux.vnet.ibm.com May 19, 2015, 6:28 p.m. UTC | #1
On 04/20/2015 10:57 AM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> The 'offset' field in RDMACompress and 'current_addr' field
> in RDMARegister are commented as being offsets within a particular
> RAMBlock, however they appear to actually be offsets within the
> ram_addr_t space.
>
> The code currently assumes that the offsets on the source/destination
> match, this change removes the need for the assumption for these
> structures by translating the addresses into the ram_addr_t space of
> the destination host.

I don't understand fully: If the offsets are not the same, then
why would the RAMBlocks be the same? If a RAMBlock
is hot-plugged on one side, shouldn't an identical one be
hotplugged on the other side, including the offset into ram_addr_t?
(Even if the base address of the ram_addr_t space is different
between source and destination, then at least the offsets
and list of blocks should be the same, no?)

Is hotplugging an asynchronous operation for post-copy or
something?

>
> Note: An alternative would be to change the fields to actually
> take the data they're commented for; this would potentially be
> simpler but would break stream compatibility for those cases
> that currently work.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   migration/rdma.c | 31 ++++++++++++++++++++++++-------
>   1 file changed, 24 insertions(+), 7 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index c3814c5..2c0d11b 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -411,7 +411,7 @@ static void network_to_control(RDMAControlHeader *control)
>    */
>   typedef struct QEMU_PACKED {
>       union QEMU_PACKED {
> -        uint64_t current_addr;  /* offset into the ramblock of the chunk */
> +        uint64_t current_addr;  /* offset into the ram_addr_t space */
>           uint64_t chunk;         /* chunk to lookup if unregistering */
>       } key;
>       uint32_t current_index; /* which ramblock the chunk belongs to */
> @@ -419,8 +419,19 @@ typedef struct QEMU_PACKED {
>       uint64_t chunks;            /* how many sequential chunks to register */
>   } RDMARegister;

The below seems OK, but I would prefer not to do this translation here.
Can the source and destination apply the offset calculations outside
of the byte-order functions? Like, before register_to_network, the
source removes the offset, and then when the message is received,
the destination then again re-applies the correct offset?

> -static void register_to_network(RDMARegister *reg)
> +static void register_to_network(RDMAContext *rdma, RDMARegister *reg)
>   {
> +    RDMALocalBlock *local_block;
> +    local_block  = &rdma->local_ram_blocks.block[reg->current_index];
> +
> +    if (local_block->is_ram_block) {
> +        /*
> +         * current_addr as passed in is an address in the local ram_addr_t
> +         * space, we need to translate this for the destination
> +         */
> +        reg->key.current_addr -= local_block->offset;
> +        reg->key.current_addr += rdma->dest_blocks[reg->current_index].offset;
> +    }
>       reg->key.current_addr = htonll(reg->key.current_addr);
>       reg->current_index = htonl(reg->current_index);
>       reg->chunks = htonll(reg->chunks);
Dr. David Alan Gilbert May 19, 2015, 6:44 p.m. UTC | #2
* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 04/20/2015 10:57 AM, Dr. David Alan Gilbert (git) wrote:
> >From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> >The 'offset' field in RDMACompress and 'current_addr' field
> >in RDMARegister are commented as being offsets within a particular
> >RAMBlock, however they appear to actually be offsets within the
> >ram_addr_t space.
> >
> >The code currently assumes that the offsets on the source/destination
> >match, this change removes the need for the assumption for these
> >structures by translating the addresses into the ram_addr_t space of
> >the destination host.
> 
> I don't understand fully: If the offsets are not the same, then
> why would the RAMBlocks be the same? If a RAMBlock
> is hot-plugged on one side, shouldn't an identical one be
> hotplugged on the other side, including the offset into ram_addr_t?

If a RAMBlock is hotplugged on the source, it's normally passed in on
the command line at startup on the destination, not hotplugged.

This difference in order of allocation of the RAMBlocks can cause
the allocation of space in ram_addr_t to be different.

In addition changes between qemu versions as to the order in which
devices are initialised can cause differences in the allocation in ram_addr_t.

Indeed the allocation algorithm isn't very deterministic, I think it looks
for the smallest gap that will fit for the allocation.
Also, lets say that you hot plugged 6 devices, and hot unplugged 5,
then migrated;  on the destination you only see one of them, not the history
of the other allocations that had to be performed.

> (Even if the base address of the ram_addr_t space is different
> between source and destination, then at least the offsets
> and list of blocks should be the same, no?)

Neither the offsets nor the order is deterministic (as mentioned above);
only the naming.

> Is hotplugging an asynchronous operation for post-copy or
> something?

Postcopy doesn't change that; just don't try hotplugging during the
migration.

Dave



> 
> >
> >Note: An alternative would be to change the fields to actually
> >take the data they're commented for; this would potentially be
> >simpler but would break stream compatibility for those cases
> >that currently work.
> >
> >Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >---
> >  migration/rdma.c | 31 ++++++++++++++++++++++++-------
> >  1 file changed, 24 insertions(+), 7 deletions(-)
> >
> >diff --git a/migration/rdma.c b/migration/rdma.c
> >index c3814c5..2c0d11b 100644
> >--- a/migration/rdma.c
> >+++ b/migration/rdma.c
> >@@ -411,7 +411,7 @@ static void network_to_control(RDMAControlHeader *control)
> >   */
> >  typedef struct QEMU_PACKED {
> >      union QEMU_PACKED {
> >-        uint64_t current_addr;  /* offset into the ramblock of the chunk */
> >+        uint64_t current_addr;  /* offset into the ram_addr_t space */
> >          uint64_t chunk;         /* chunk to lookup if unregistering */
> >      } key;
> >      uint32_t current_index; /* which ramblock the chunk belongs to */
> >@@ -419,8 +419,19 @@ typedef struct QEMU_PACKED {
> >      uint64_t chunks;            /* how many sequential chunks to register */
> >  } RDMARegister;
> 
> The below seems OK, but I would prefer not to do this translation here.
> Can the source and destination apply the offset calculations outside
> of the byte-order functions? Like, before register_to_network, the
> source removes the offset, and then when the message is received,
> the destination then again re-applies the correct offset?
> 
> >-static void register_to_network(RDMARegister *reg)
> >+static void register_to_network(RDMAContext *rdma, RDMARegister *reg)
> >  {
> >+    RDMALocalBlock *local_block;
> >+    local_block  = &rdma->local_ram_blocks.block[reg->current_index];
> >+
> >+    if (local_block->is_ram_block) {
> >+        /*
> >+         * current_addr as passed in is an address in the local ram_addr_t
> >+         * space, we need to translate this for the destination
> >+         */
> >+        reg->key.current_addr -= local_block->offset;
> >+        reg->key.current_addr += rdma->dest_blocks[reg->current_index].offset;
> >+    }
> >      reg->key.current_addr = htonll(reg->key.current_addr);
> >      reg->current_index = htonl(reg->current_index);
> >      reg->chunks = htonll(reg->chunks);
> 
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
mrhines@linux.vnet.ibm.com May 19, 2015, 6:57 p.m. UTC | #3
On 05/19/2015 01:44 PM, Dr. David Alan Gilbert wrote:
> * Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
>> On 04/20/2015 10:57 AM, Dr. David Alan Gilbert (git) wrote:
>>> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>>>
>>> The 'offset' field in RDMACompress and 'current_addr' field
>>> in RDMARegister are commented as being offsets within a particular
>>> RAMBlock, however they appear to actually be offsets within the
>>> ram_addr_t space.
>>>
>>> The code currently assumes that the offsets on the source/destination
>>> match, this change removes the need for the assumption for these
>>> structures by translating the addresses into the ram_addr_t space of
>>> the destination host.
>> I don't understand fully: If the offsets are not the same, then
>> why would the RAMBlocks be the same? If a RAMBlock
>> is hot-plugged on one side, shouldn't an identical one be
>> hotplugged on the other side, including the offset into ram_addr_t?
> If a RAMBlock is hotplugged on the source, it's normally passed in on
> the command line at startup on the destination, not hotplugged.
>
> This difference in order of allocation of the RAMBlocks can cause
> the allocation of space in ram_addr_t to be different.
>
> In addition changes between qemu versions as to the order in which
> devices are initialised can cause differences in the allocation in ram_addr_t.
>
> Indeed the allocation algorithm isn't very deterministic, I think it looks
> for the smallest gap that will fit for the allocation.
> Also, lets say that you hot plugged 6 devices, and hot unplugged 5,
> then migrated;  on the destination you only see one of them, not the history
> of the other allocations that had to be performed.

I see ---- so until now, I just got "lucky" that the RAMBlocks were
in the same order with the same offsets =).
Dr. David Alan Gilbert May 19, 2015, 7:02 p.m. UTC | #4
* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> On 05/19/2015 01:44 PM, Dr. David Alan Gilbert wrote:
> >* Michael R. Hines (mrhines@linux.vnet.ibm.com) wrote:
> >>On 04/20/2015 10:57 AM, Dr. David Alan Gilbert (git) wrote:
> >>>From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >>>
> >>>The 'offset' field in RDMACompress and 'current_addr' field
> >>>in RDMARegister are commented as being offsets within a particular
> >>>RAMBlock, however they appear to actually be offsets within the
> >>>ram_addr_t space.
> >>>
> >>>The code currently assumes that the offsets on the source/destination
> >>>match, this change removes the need for the assumption for these
> >>>structures by translating the addresses into the ram_addr_t space of
> >>>the destination host.
> >>I don't understand fully: If the offsets are not the same, then
> >>why would the RAMBlocks be the same? If a RAMBlock
> >>is hot-plugged on one side, shouldn't an identical one be
> >>hotplugged on the other side, including the offset into ram_addr_t?
> >If a RAMBlock is hotplugged on the source, it's normally passed in on
> >the command line at startup on the destination, not hotplugged.
> >
> >This difference in order of allocation of the RAMBlocks can cause
> >the allocation of space in ram_addr_t to be different.
> >
> >In addition changes between qemu versions as to the order in which
> >devices are initialised can cause differences in the allocation in ram_addr_t.
> >
> >Indeed the allocation algorithm isn't very deterministic, I think it looks
> >for the smallest gap that will fit for the allocation.
> >Also, lets say that you hot plugged 6 devices, and hot unplugged 5,
> >then migrated;  on the destination you only see one of them, not the history
> >of the other allocations that had to be performed.
> 
> I see ---- so until now, I just got "lucky" that the RAMBlocks were
> in the same order with the same offsets =).

Yep.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
diff mbox

Patch

diff --git a/migration/rdma.c b/migration/rdma.c
index c3814c5..2c0d11b 100644
--- a/migration/rdma.c
+++ b/migration/rdma.c
@@ -411,7 +411,7 @@  static void network_to_control(RDMAControlHeader *control)
  */
 typedef struct QEMU_PACKED {
     union QEMU_PACKED {
-        uint64_t current_addr;  /* offset into the ramblock of the chunk */
+        uint64_t current_addr;  /* offset into the ram_addr_t space */
         uint64_t chunk;         /* chunk to lookup if unregistering */
     } key;
     uint32_t current_index; /* which ramblock the chunk belongs to */
@@ -419,8 +419,19 @@  typedef struct QEMU_PACKED {
     uint64_t chunks;            /* how many sequential chunks to register */
 } RDMARegister;
 
-static void register_to_network(RDMARegister *reg)
+static void register_to_network(RDMAContext *rdma, RDMARegister *reg)
 {
+    RDMALocalBlock *local_block;
+    local_block  = &rdma->local_ram_blocks.block[reg->current_index];
+
+    if (local_block->is_ram_block) {
+        /*
+         * current_addr as passed in is an address in the local ram_addr_t
+         * space, we need to translate this for the destination
+         */
+        reg->key.current_addr -= local_block->offset;
+        reg->key.current_addr += rdma->dest_blocks[reg->current_index].offset;
+    }
     reg->key.current_addr = htonll(reg->key.current_addr);
     reg->current_index = htonl(reg->current_index);
     reg->chunks = htonll(reg->chunks);
@@ -436,13 +447,19 @@  static void network_to_register(RDMARegister *reg)
 typedef struct QEMU_PACKED {
     uint32_t value;     /* if zero, we will madvise() */
     uint32_t block_idx; /* which ram block index */
-    uint64_t offset;    /* where in the remote ramblock this chunk */
+    uint64_t offset;    /* Address in remote ram_addr_t space */
     uint64_t length;    /* length of the chunk */
 } RDMACompress;
 
-static void compress_to_network(RDMACompress *comp)
+static void compress_to_network(RDMAContext *rdma, RDMACompress *comp)
 {
     comp->value = htonl(comp->value);
+    /*
+     * comp->offset as passed in is an address in the local ram_addr_t
+     * space, we need to translate this for the destination
+     */
+    comp->offset -= rdma->local_ram_blocks.block[comp->block_idx].offset;
+    comp->offset += rdma->dest_blocks[comp->block_idx].offset;
     comp->block_idx = htonl(comp->block_idx);
     comp->offset = htonll(comp->offset);
     comp->length = htonll(comp->length);
@@ -1288,7 +1305,7 @@  static int qemu_rdma_unregister_waiting(RDMAContext *rdma)
         rdma->total_registrations--;
 
         reg.key.chunk = chunk;
-        register_to_network(&reg);
+        register_to_network(rdma, &reg);
         ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
                                 &resp, NULL, NULL);
         if (ret < 0) {
@@ -1909,7 +1926,7 @@  retry:
                 trace_qemu_rdma_write_one_zero(chunk, sge.length,
                                                current_index, current_addr);
 
-                compress_to_network(&comp);
+                compress_to_network(rdma, &comp);
                 ret = qemu_rdma_exchange_send(rdma, &head,
                                 (uint8_t *) &comp, NULL, NULL, NULL);
 
@@ -1936,7 +1953,7 @@  retry:
             trace_qemu_rdma_write_one_sendreg(chunk, sge.length, current_index,
                                               current_addr);
 
-            register_to_network(&reg);
+            register_to_network(rdma, &reg);
             ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg,
                                     &resp, &reg_result_idx, NULL);
             if (ret < 0) {