diff mbox

[v3,resend/cleanup,1/8] rdma: update documentation to reflect new unpin support

Message ID 1373640028-5138-2-git-send-email-mrhines@linux.vnet.ibm.com
State New
Headers show

Commit Message

mrhines@linux.vnet.ibm.com July 12, 2013, 2:40 p.m. UTC
From: "Michael R. Hines" <mrhines@us.ibm.com>

As requested, the protocol now includes memory unpinning support.
This has been implemented in a non-optimized manner, in such a way
that one could devise an LRU or other workload-specific information
on top of the basic mechanism to influence the way unpinning happens
during runtime.

The feature is not yet user-facing, and is thus can only be enabled
at compile-time.

Reviewed-by: Eric Blake <eblake@redhat.com>
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |   51 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 30 insertions(+), 21 deletions(-)

Comments

Eric Blake July 12, 2013, 5:09 p.m. UTC | #1
On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> As requested, the protocol now includes memory unpinning support.
> This has been implemented in a non-optimized manner, in such a way
> that one could devise an LRU or other workload-specific information
> on top of the basic mechanism to influence the way unpinning happens
> during runtime.
> 
> The feature is not yet user-facing, and is thus can only be enabled
> at compile-time.
> 
> Reviewed-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |   51 ++++++++++++++++++++++++++++++---------------------
>  1 file changed, 30 insertions(+), 21 deletions(-)

I suggest splitting this patch into two; and cc-ing the first of the two
patches through qemu-trivial (since formatting cleanups can be applied
now, even while still waiting for a comprehensive review of the
algorithm in the rest of the series)

> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> index 45a4b1d..45d1c8a 100644
> --- a/docs/rdma.txt
> +++ b/docs/rdma.txt
> @@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
>  with the rate of dirty memory produced by the workload.
>  
>  RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
> -over Convered Ethernet) as well as Infiniband-based. This implementation of
> +over Converged Ethernet) as well as Infiniband-based. This implementation of

Trivial

>  migration using RDMA is capable of using both technologies because of
>  the use of the OpenFabrics OFED software stack that abstracts out the
>  programming model irrespective of the underlying hardware.
> @@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted
>  as a single SEND message).
>  
>  Header:
> -    * Length  (of the data portion, uint32, network byte order)
> -    * Type    (what command to perform, uint32, network byte order)
> -    * Repeat  (Number of commands in data portion, same type only)
> +    * Length               (of the data portion, uint32, network byte order)
> +    * Type                 (what command to perform, uint32, network byte order)
> +    * Repeat               (Number of commands in data portion, same type only)

trivial

>  
>  The 'Repeat' field is here to support future multiple page registrations
>  in a single message without any need to change the protocol itself
> @@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
>  limit based on the maximum size of a SEND message along with emperical
>  observations on the maximum future benefit of simultaneous page registrations.
>  
> -The 'type' field has 10 different command values:
> -    1. Unused
> -    2. Error              (sent to the source during bad things)
> -    3. Ready              (control-channel is available)
> -    4. QEMU File          (for sending non-live device state)
> -    5. RAM Blocks request (used right after connection setup)
> -    6. RAM Blocks result  (used right after connection setup)
> -    7. Compress page      (zap zero page and skip registration)
> -    8. Register request   (dynamic chunk registration)
> -    9. Register result    ('rkey' to be used by sender)
> -    10. Register finished  (registration for current iteration finished)
> +The 'type' field has 12 different command values:
> +     1. Unused
> +     2. Error                      (sent to the source during bad things)
> +     3. Ready                      (control-channel is available)
> +     4. QEMU File                  (for sending non-live device state)
> +     5. RAM Blocks request         (used right after connection setup)
> +     6. RAM Blocks result          (used right after connection setup)
> +     7. Compress page              (zap zero page and skip registration)
> +     8. Register request           (dynamic chunk registration)
> +     9. Register result            ('rkey' to be used by sender)
> +    10. Register finished          (registration for current iteration finished)

reformatting is trivial,

> +    11. Unregister request         (unpin previously registered memory)
> +    12. Unregister finished        (confirmation that unpin completed)

addition belongs in the second patch (so that we don't have to wade
through that much trivial stuff to find the real changes)

>  
>  A single control message, as hinted above, can contain within the data
>  portion an array of many commands of the same type. If there is more than
> @@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
>     from the receiver to tell us that the receiver
>     is *ready* for us to transmit some new bytes.
>  2. Optionally: if we are expecting a response from the command
> -   (that we have no yet transmitted), let's post an RQ
> +   (that we have not yet transmitted), let's post an RQ

trivial

>     work request to receive that data a few moments later.
>  3. When the READY arrives, librdmacm will
>     unblock us and we immediately post a RQ work request
> @@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
>  at connection-setup time before any infiniband traffic is generated.
>  
>  Header:
> -    * Version (protocol version validated before send/recv occurs), uint32, network byte order
> -    * Flags   (bitwise OR of each capability), uint32, network byte order
> +    * Version (protocol version validated before send/recv occurs),
> +                                               uint32, network byte order
> +    * Flags   (bitwise OR of each capability),
> +                                               uint32, network byte order

trivial

>  
>  There is no data portion of this header right now, so there is
>  no length field. The maximum size of the 'private data' section
> @@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
>  If the version is new, we only negotiate the capabilities that the
>  requested version is able to perform and ignore the rest.
>  
> -Currently there is only *one* capability in Version #1: dynamic page registration
> +Currently there is only one capability in Version #1: dynamic page registration

trivial

>  
>  Finally: Negotiation happens with the Flags field: If the primary-VM
>  sets a flag, but the destination does not support this capability, it
> @@ -326,8 +330,8 @@ QEMUFileRDMA Interface:
>  
>  QEMUFileRDMA introduces a couple of new functions:
>  
> -1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> -2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
> +1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)

trivial

>  
>  These two functions are very short and simply use the protocol
>  describe above to deliver bytes without changing the upper-level
> @@ -413,3 +417,8 @@ TODO:
>     the use of KSM and ballooning while using RDMA.
>  4. Also, some form of balloon-device usage tracking would also
>     help alleviate some issues.
> +5. Move UNREGISTER requests to a separate thread.
> +6. Use LRU to provide more fine-grained direction of UNREGISTER
> +   requests for unpinning memory in an overcommitted environment.
> +7. Expose UNREGISTER support to the user by way of workload-specific
> +   hints about application behavior.
> 

new content
mrhines@linux.vnet.ibm.com July 12, 2013, 5:26 p.m. UTC | #2
On 07/12/2013 01:09 PM, Eric Blake wrote:
> On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> As requested, the protocol now includes memory unpinning support.
>> This has been implemented in a non-optimized manner, in such a way
>> that one could devise an LRU or other workload-specific information
>> on top of the basic mechanism to influence the way unpinning happens
>> during runtime.
>>
>> The feature is not yet user-facing, and is thus can only be enabled
>> at compile-time.
>>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   docs/rdma.txt |   51 ++++++++++++++++++++++++++++++---------------------
>>   1 file changed, 30 insertions(+), 21 deletions(-)
> I suggest splitting this patch into two; and cc-ing the first of the two
> patches through qemu-trivial (since formatting cleanups can be applied
> now, even while still waiting for a comprehensive review of the
> algorithm in the rest of the series)

My understanding is that the reviews have completed already,
including a very extensive test series that I performed which
included both virt-test results and non-virt-test results from both
myself and Chegu.

Am I mistaken?


>
>> diff --git a/docs/rdma.txt b/docs/rdma.txt
>> index 45a4b1d..45d1c8a 100644
>> --- a/docs/rdma.txt
>> +++ b/docs/rdma.txt
>> @@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
>>   with the rate of dirty memory produced by the workload.
>>   
>>   RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
>> -over Convered Ethernet) as well as Infiniband-based. This implementation of
>> +over Converged Ethernet) as well as Infiniband-based. This implementation of
> Trivial
>
>>   migration using RDMA is capable of using both technologies because of
>>   the use of the OpenFabrics OFED software stack that abstracts out the
>>   programming model irrespective of the underlying hardware.
>> @@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted
>>   as a single SEND message).
>>   
>>   Header:
>> -    * Length  (of the data portion, uint32, network byte order)
>> -    * Type    (what command to perform, uint32, network byte order)
>> -    * Repeat  (Number of commands in data portion, same type only)
>> +    * Length               (of the data portion, uint32, network byte order)
>> +    * Type                 (what command to perform, uint32, network byte order)
>> +    * Repeat               (Number of commands in data portion, same type only)
> trivial
>
>>   
>>   The 'Repeat' field is here to support future multiple page registrations
>>   in a single message without any need to change the protocol itself
>> @@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
>>   limit based on the maximum size of a SEND message along with emperical
>>   observations on the maximum future benefit of simultaneous page registrations.
>>   
>> -The 'type' field has 10 different command values:
>> -    1. Unused
>> -    2. Error              (sent to the source during bad things)
>> -    3. Ready              (control-channel is available)
>> -    4. QEMU File          (for sending non-live device state)
>> -    5. RAM Blocks request (used right after connection setup)
>> -    6. RAM Blocks result  (used right after connection setup)
>> -    7. Compress page      (zap zero page and skip registration)
>> -    8. Register request   (dynamic chunk registration)
>> -    9. Register result    ('rkey' to be used by sender)
>> -    10. Register finished  (registration for current iteration finished)
>> +The 'type' field has 12 different command values:
>> +     1. Unused
>> +     2. Error                      (sent to the source during bad things)
>> +     3. Ready                      (control-channel is available)
>> +     4. QEMU File                  (for sending non-live device state)
>> +     5. RAM Blocks request         (used right after connection setup)
>> +     6. RAM Blocks result          (used right after connection setup)
>> +     7. Compress page              (zap zero page and skip registration)
>> +     8. Register request           (dynamic chunk registration)
>> +     9. Register result            ('rkey' to be used by sender)
>> +    10. Register finished          (registration for current iteration finished)
> reformatting is trivial,
>
>> +    11. Unregister request         (unpin previously registered memory)
>> +    12. Unregister finished        (confirmation that unpin completed)
> addition belongs in the second patch (so that we don't have to wade
> through that much trivial stuff to find the real changes)
>
>>   
>>   A single control message, as hinted above, can contain within the data
>>   portion an array of many commands of the same type. If there is more than
>> @@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
>>      from the receiver to tell us that the receiver
>>      is *ready* for us to transmit some new bytes.
>>   2. Optionally: if we are expecting a response from the command
>> -   (that we have no yet transmitted), let's post an RQ
>> +   (that we have not yet transmitted), let's post an RQ
> trivial
>
>>      work request to receive that data a few moments later.
>>   3. When the READY arrives, librdmacm will
>>      unblock us and we immediately post a RQ work request
>> @@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
>>   at connection-setup time before any infiniband traffic is generated.
>>   
>>   Header:
>> -    * Version (protocol version validated before send/recv occurs), uint32, network byte order
>> -    * Flags   (bitwise OR of each capability), uint32, network byte order
>> +    * Version (protocol version validated before send/recv occurs),
>> +                                               uint32, network byte order
>> +    * Flags   (bitwise OR of each capability),
>> +                                               uint32, network byte order
> trivial
>
>>   
>>   There is no data portion of this header right now, so there is
>>   no length field. The maximum size of the 'private data' section
>> @@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
>>   If the version is new, we only negotiate the capabilities that the
>>   requested version is able to perform and ignore the rest.
>>   
>> -Currently there is only *one* capability in Version #1: dynamic page registration
>> +Currently there is only one capability in Version #1: dynamic page registration
> trivial
>
>>   
>>   Finally: Negotiation happens with the Flags field: If the primary-VM
>>   sets a flag, but the destination does not support this capability, it
>> @@ -326,8 +330,8 @@ QEMUFileRDMA Interface:
>>   
>>   QEMUFileRDMA introduces a couple of new functions:
>>   
>> -1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
>> -2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
>> +1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
>> +2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
> trivial
>
>>   
>>   These two functions are very short and simply use the protocol
>>   describe above to deliver bytes without changing the upper-level
>> @@ -413,3 +417,8 @@ TODO:
>>      the use of KSM and ballooning while using RDMA.
>>   4. Also, some form of balloon-device usage tracking would also
>>      help alleviate some issues.
>> +5. Move UNREGISTER requests to a separate thread.
>> +6. Use LRU to provide more fine-grained direction of UNREGISTER
>> +   requests for unpinning memory in an overcommitted environment.
>> +7. Expose UNREGISTER support to the user by way of workload-specific
>> +   hints about application behavior.
>>
> new content
>
Eric Blake July 12, 2013, 5:39 p.m. UTC | #3
On 07/12/2013 11:26 AM, Michael R. Hines wrote:
> On 07/12/2013 01:09 PM, Eric Blake wrote:
>> On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>
>>> As requested, the protocol now includes memory unpinning support.
>>> This has been implemented in a non-optimized manner, in such a way
>>> that one could devise an LRU or other workload-specific information
>>> on top of the basic mechanism to influence the way unpinning happens
>>> during runtime.
>>>

>>> ++++++++++++++++++++++++++++++---------------------
>>>   1 file changed, 30 insertions(+), 21 deletions(-)
>> I suggest splitting this patch into two; and cc-ing the first of the two
>> patches through qemu-trivial (since formatting cleanups can be applied
>> now, even while still waiting for a comprehensive review of the
>> algorithm in the rest of the series)
> 
> My understanding is that the reviews have completed already,
> including a very extensive test series that I performed which
> included both virt-test results and non-virt-test results from both
> myself and Chegu.
> 
> Am I mistaken?

It may have been reviewed and tested, but as you just barely posted v3
today and there is not yet a maintainer's queue with a PULL request, it
is still subject to any further review that people want to provide, and
up to the maintainer to state definitively if the further review
comments must be addressed.  It's not the end of the world if you don't
split this patch, but at the same time, splitting it makes it easier to
review, and to pick and choose which parts get backported (trivial
formatting vs. new feature).
mrhines@linux.vnet.ibm.com July 12, 2013, 5:46 p.m. UTC | #4
On 07/12/2013 01:39 PM, Eric Blake wrote:
> On 07/12/2013 11:26 AM, Michael R. Hines wrote:
>> On 07/12/2013 01:09 PM, Eric Blake wrote:
>>> On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>
>>>> As requested, the protocol now includes memory unpinning support.
>>>> This has been implemented in a non-optimized manner, in such a way
>>>> that one could devise an LRU or other workload-specific information
>>>> on top of the basic mechanism to influence the way unpinning happens
>>>> during runtime.
>>>>
>>>> ++++++++++++++++++++++++++++++---------------------
>>>>    1 file changed, 30 insertions(+), 21 deletions(-)
>>> I suggest splitting this patch into two; and cc-ing the first of the two
>>> patches through qemu-trivial (since formatting cleanups can be applied
>>> now, even while still waiting for a comprehensive review of the
>>> algorithm in the rest of the series)
>> My understanding is that the reviews have completed already,
>> including a very extensive test series that I performed which
>> included both virt-test results and non-virt-test results from both
>> myself and Chegu.
>>
>> Am I mistaken?
> It may have been reviewed and tested, but as you just barely posted v3
> today and there is not yet a maintainer's queue with a PULL request, it
> is still subject to any further review that people want to provide, and
> up to the maintainer to state definitively if the further review
> comments must be addressed.  It's not the end of the world if you don't
> split this patch, but at the same time, splitting it makes it easier to
> review, and to pick and choose which parts get backported (trivial
> formatting vs. new feature).
>

Alright - I'll wait until next week before re-sending again. No big deal.


Juan? Ping?
- Michael
diff mbox

Patch

diff --git a/docs/rdma.txt b/docs/rdma.txt
index 45a4b1d..45d1c8a 100644
--- a/docs/rdma.txt
+++ b/docs/rdma.txt
@@ -35,7 +35,7 @@  memory tracked during each live migration iteration round cannot keep pace
 with the rate of dirty memory produced by the workload.
 
 RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
-over Convered Ethernet) as well as Infiniband-based. This implementation of
+over Converged Ethernet) as well as Infiniband-based. This implementation of
 migration using RDMA is capable of using both technologies because of
 the use of the OpenFabrics OFED software stack that abstracts out the
 programming model irrespective of the underlying hardware.
@@ -188,9 +188,9 @@  header portion and a data portion (but together are transmitted
 as a single SEND message).
 
 Header:
-    * Length  (of the data portion, uint32, network byte order)
-    * Type    (what command to perform, uint32, network byte order)
-    * Repeat  (Number of commands in data portion, same type only)
+    * Length               (of the data portion, uint32, network byte order)
+    * Type                 (what command to perform, uint32, network byte order)
+    * Repeat               (Number of commands in data portion, same type only)
 
 The 'Repeat' field is here to support future multiple page registrations
 in a single message without any need to change the protocol itself
@@ -202,17 +202,19 @@  The maximum number of repeats is hard-coded to 4096. This is a conservative
 limit based on the maximum size of a SEND message along with emperical
 observations on the maximum future benefit of simultaneous page registrations.
 
-The 'type' field has 10 different command values:
-    1. Unused
-    2. Error              (sent to the source during bad things)
-    3. Ready              (control-channel is available)
-    4. QEMU File          (for sending non-live device state)
-    5. RAM Blocks request (used right after connection setup)
-    6. RAM Blocks result  (used right after connection setup)
-    7. Compress page      (zap zero page and skip registration)
-    8. Register request   (dynamic chunk registration)
-    9. Register result    ('rkey' to be used by sender)
-    10. Register finished  (registration for current iteration finished)
+The 'type' field has 12 different command values:
+     1. Unused
+     2. Error                      (sent to the source during bad things)
+     3. Ready                      (control-channel is available)
+     4. QEMU File                  (for sending non-live device state)
+     5. RAM Blocks request         (used right after connection setup)
+     6. RAM Blocks result          (used right after connection setup)
+     7. Compress page              (zap zero page and skip registration)
+     8. Register request           (dynamic chunk registration)
+     9. Register result            ('rkey' to be used by sender)
+    10. Register finished          (registration for current iteration finished)
+    11. Unregister request         (unpin previously registered memory)
+    12. Unregister finished        (confirmation that unpin completed)
 
 A single control message, as hinted above, can contain within the data
 portion an array of many commands of the same type. If there is more than
@@ -243,7 +245,7 @@  qemu_rdma_exchange_send(header, data, optional response header & data):
    from the receiver to tell us that the receiver
    is *ready* for us to transmit some new bytes.
 2. Optionally: if we are expecting a response from the command
-   (that we have no yet transmitted), let's post an RQ
+   (that we have not yet transmitted), let's post an RQ
    work request to receive that data a few moments later.
 3. When the READY arrives, librdmacm will
    unblock us and we immediately post a RQ work request
@@ -293,8 +295,10 @@  librdmacm provides the user with a 'private data' area to be exchanged
 at connection-setup time before any infiniband traffic is generated.
 
 Header:
-    * Version (protocol version validated before send/recv occurs), uint32, network byte order
-    * Flags   (bitwise OR of each capability), uint32, network byte order
+    * Version (protocol version validated before send/recv occurs),
+                                               uint32, network byte order
+    * Flags   (bitwise OR of each capability),
+                                               uint32, network byte order
 
 There is no data portion of this header right now, so there is
 no length field. The maximum size of the 'private data' section
@@ -313,7 +317,7 @@  If the version is invalid, we throw an error.
 If the version is new, we only negotiate the capabilities that the
 requested version is able to perform and ignore the rest.
 
-Currently there is only *one* capability in Version #1: dynamic page registration
+Currently there is only one capability in Version #1: dynamic page registration
 
 Finally: Negotiation happens with the Flags field: If the primary-VM
 sets a flag, but the destination does not support this capability, it
@@ -326,8 +330,8 @@  QEMUFileRDMA Interface:
 
 QEMUFileRDMA introduces a couple of new functions:
 
-1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
-2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+1. qemu_rdma_get_buffer()               (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()               (QEMUFileOps rdma_write_ops)
 
 These two functions are very short and simply use the protocol
 describe above to deliver bytes without changing the upper-level
@@ -413,3 +417,8 @@  TODO:
    the use of KSM and ballooning while using RDMA.
 4. Also, some form of balloon-device usage tracking would also
    help alleviate some issues.
+5. Move UNREGISTER requests to a separate thread.
+6. Use LRU to provide more fine-grained direction of UNREGISTER
+   requests for unpinning memory in an overcommitted environment.
+7. Expose UNREGISTER support to the user by way of workload-specific
+   hints about application behavior.