Patchwork [v3,for-1.7] rdma: rename 'x-rdma' => 'rdma'

login
register
mail settings
Submitter mrhines@linux.vnet.ibm.com
Date Nov. 6, 2013, 6:59 p.m.
Message ID <1383764354-10588-1-git-send-email-mrhines@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/288988/
State New
Headers show

Comments

mrhines@linux.vnet.ibm.com - Nov. 6, 2013, 6:59 p.m.
From: "Michael R. Hines" <mrhines@us.ibm.com>

As far as we can tell, all known bugs have been fixed:

1. Parallel RDMA migrations are working
2. IPv6 migration is working
3. Libvirt patches are ready
4. virt-test is working

Objections?

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt    |   24 ++++++++++--------------
 migration-rdma.c |    2 +-
 migration.c      |    6 +++---
 qapi-schema.json |    7 +++----
 4 files changed, 17 insertions(+), 22 deletions(-)
Eric Blake - Nov. 6, 2013, 7:04 p.m.
On 11/06/2013 11:59 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> As far as we can tell, all known bugs have been fixed:
> 
> 1. Parallel RDMA migrations are working
> 2. IPv6 migration is working
> 3. Libvirt patches are ready
> 4. virt-test is working
> 
> Objections?
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt    |   24 ++++++++++--------------
>  migration-rdma.c |    2 +-
>  migration.c      |    6 +++---
>  qapi-schema.json |    7 +++----
>  4 files changed, 17 insertions(+), 22 deletions(-)

Reviewed-by: Eric Blake <eblake@redhat.com>
Daniel P. Berrange - Nov. 15, 2013, 5:06 p.m.
On Wed, Nov 06, 2013 at 01:59:14PM -0500, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> As far as we can tell, all known bugs have been fixed:

[snip]

> 3. Libvirt patches are ready

[snip]

> Objections?

There was a first round of patches posted to the libvirt list back
in July, but those were rejected since QEMU side was still in flux.

More seriously though, from discussions at the KVM Forum it sounds
like there is a significant problem in actually using the RDMA
code. Unfortunately I can't remember who I was talking with about
it, but I was told that it requires the QEMU process to run as root
in order to talk to some of the kernel interfaces, and requires
manual updates to the cgroups device ACL to allow QEMU access to
some RMDA related device nodes.

For this to be supportable in libvirt, we need this to work when
QEMU is running as an unprivileged user/group ID. If access to
any privileged resources is required, then there needs to be a
way to get privilege separation. Either libvirtd would need to
change file ownership to grant QEMU access to resources, or
libvirtd would need to open the resources & pass a FD across to
QEMU. Running QEMU as root is a non-starter.

I don't recall any new version of the patches being posted since
then to address this problem, so from the libvirt POV I don't
think this is ready, unelss I was mis-informed about this permission
problem.

Regards,
Daniel
mrhines@linux.vnet.ibm.com - Nov. 15, 2013, 5:40 p.m.
On 11/15/2013 12:06 PM, Daniel P. Berrange wrote:
> On Wed, Nov 06, 2013 at 01:59:14PM -0500, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> As far as we can tell, all known bugs have been fixed:
> [snip]
>
>> 3. Libvirt patches are ready
> [snip]
>
>> Objections?
> There was a first round of patches posted to the libvirt list back
> in July, but those were rejected since QEMU side was still in flux.

Yes, I'm addressing all the reviews I received back in July,
I just haven't posted them to the list yet - stay tuned.

> More seriously though, from discussions at the KVM Forum it sounds
> like there is a significant problem in actually using the RDMA
> code. Unfortunately I can't remember who I was talking with about
> it, but I was told that it requires the QEMU process to run as root
> in order to talk to some of the kernel interfaces, and requires
> manual updates to the cgroups device ACL to allow QEMU access to
> some RMDA related device nodes.

No, that was for Fault Tolerance, not for RDMA specifically.

You were speaking to me =). The jog your memory, the issue
was related to the need by Micro-Checkpointing to access
netlink sockets to interact properly with netlink in the kernel.

This is unrelated to RDMA - accessing the /dev/infiniband
device nodes is already supported by libvirt my modifying
the configuration file in /etc and that works just fine.

- Michael
Eric Blake - Nov. 15, 2013, 7:25 p.m.
On 11/15/2013 10:40 AM, Michael R. Hines wrote:
> 
> This is unrelated to RDMA - accessing the /dev/infiniband
> device nodes is already supported by libvirt my modifying
> the configuration file in /etc and that works just fine.

http://wiki.qemu.org/Features/RDMALiveMigration states that you modify
the .conf file to expose /dev/infiniband/rdma_cm and friends.  Are all
of these devices read/write accessible to non-root?  Or is there going
to be a problem if using user="qemu" group="qemu"?  (That is, merely
exposing the devices through cgroup device ACL checking may be
insufficient if you can't access the devices when not running root/root).

Libvirt can be patched so that the .conf file does not have to be edited
(ie. change the defaults so that if cgroup_device_acl is not present in
the conf file, the defaults could still let a domainaccess the
/dev/infiniband devices).
mrhines@linux.vnet.ibm.com - Nov. 15, 2013, 7:44 p.m.
On 11/15/2013 02:25 PM, Eric Blake wrote:
> On 11/15/2013 10:40 AM, Michael R. Hines wrote:
>> This is unrelated to RDMA - accessing the /dev/infiniband
>> device nodes is already supported by libvirt my modifying
>> the configuration file in /etc and that works just fine.
> http://wiki.qemu.org/Features/RDMALiveMigration states that you modify
> the .conf file to expose /dev/infiniband/rdma_cm and friends.  Are all
> of these devices read/write accessible to non-root?  Or is there going
> to be a problem if using user="qemu" group="qemu"?  (That is, merely
> exposing the devices through cgroup device ACL checking may be
> insufficient if you can't access the devices when not running root/root).

Yes, non-root access is working just fine. Keep in mind that QEMU is no
different than any other standard HPC application (like MPI programs)
which perform RDMA operations.

QEMU is simply another librdmacm/libibverbs user and it does
not require any special privileges to the device files.

> Libvirt can be patched so that the .conf file does not have to be edited
> (ie. change the defaults so that if cgroup_device_acl is not present in
> the conf file, the defaults could still let a domainaccess the
> /dev/infiniband devices).
>

That would be quite nice! Shall I include that in the next version
of my libvirt patch?

- Michael
Eric Blake - Nov. 15, 2013, 7:49 p.m.
On 11/15/2013 12:44 PM, Michael R. Hines wrote:
> On 11/15/2013 02:25 PM, Eric Blake wrote:
>> On 11/15/2013 10:40 AM, Michael R. Hines wrote:
>>> This is unrelated to RDMA - accessing the /dev/infiniband
>>> device nodes is already supported by libvirt my modifying
>>> the configuration file in /etc and that works just fine.
>> http://wiki.qemu.org/Features/RDMALiveMigration states that you modify
>> the .conf file to expose /dev/infiniband/rdma_cm and friends.  Are all
>> of these devices read/write accessible to non-root?  Or is there going
>> to be a problem if using user="qemu" group="qemu"?  (That is, merely
>> exposing the devices through cgroup device ACL checking may be
>> insufficient if you can't access the devices when not running root/root).
> 
> Yes, non-root access is working just fine. Keep in mind that QEMU is no
> different than any other standard HPC application (like MPI programs)
> which perform RDMA operations.
> 
> QEMU is simply another librdmacm/libibverbs user and it does
> not require any special privileges to the device files.
> 
>> Libvirt can be patched so that the .conf file does not have to be edited
>> (ie. change the defaults so that if cgroup_device_acl is not present in
>> the conf file, the defaults could still let a domainaccess the
>> /dev/infiniband devices).
>>
> 
> That would be quite nice! Shall I include that in the next version
> of my libvirt patch?

Yes, if you can.

[rest of this is more for the libvirt list]
Ideally, the access should only be granted at the start of the
migration, and then revoked when no longer needed.  For an example of
granting ACL access while a domain is running, see how
qemuDomainPrepareDiskChainElement() calls qemuSetupDiskCgroup and
qemuTeardownDiskCgroup at appropriate times to alter the cgroup settings
on the fly.  These functions in turn call into virCgroupAllowDevicePath,
and follow it up with an audit log, to make it easy to track through the
audit which devices have been exposed to a guest.
Daniel P. Berrange - Nov. 16, 2013, 10:32 a.m.
On Fri, Nov 15, 2013 at 12:25:30PM -0700, Eric Blake wrote:
> On 11/15/2013 10:40 AM, Michael R. Hines wrote:
> > 
> > This is unrelated to RDMA - accessing the /dev/infiniband
> > device nodes is already supported by libvirt my modifying
> > the configuration file in /etc and that works just fine.
> 
> http://wiki.qemu.org/Features/RDMALiveMigration states that you modify
> the .conf file to expose /dev/infiniband/rdma_cm and friends.  Are all
> of these devices read/write accessible to non-root?  Or is there going
> to be a problem if using user="qemu" group="qemu"?  (That is, merely
> exposing the devices through cgroup device ACL checking may be
> insufficient if you can't access the devices when not running root/root).
> 
> Libvirt can be patched so that the .conf file does not have to be edited
> (ie. change the defaults so that if cgroup_device_acl is not present in
> the conf file, the defaults could still let a domainaccess the
> /dev/infiniband devices).

There's also an SELinux question to deal with there. If multiple QEMUs
need concurrent access we can't do a selective grant of the device just
when migration is running - we would have to give all QEMU's access
all the time.  This would be a case where doing FD passing of the
pre-opened devices might be a better option. It depends on what the
downsides are to giving QEMU access to the devices unconditionally.

Daniel
Paolo Bonzini - Nov. 22, 2013, 4:50 p.m.
Il 16/11/2013 11:32, Daniel P. Berrange ha scritto:
> There's also an SELinux question to deal with there. If multiple QEMUs
> need concurrent access we can't do a selective grant of the device just
> when migration is running - we would have to give all QEMU's access
> all the time.  This would be a case where doing FD passing of the
> pre-opened devices might be a better option. It depends on what the
> downsides are to giving QEMU access to the devices unconditionally.

I think unconditional SELinux access + conditional cgroups access would
work best here.

How did Gluster deal with the same problem (for the gluster+rdma:// URI
scheme)?  I guess no one bothered to mention it when the Gluster patches
were committed, but it should be the same.   It would also be the same
for userspace iSCSI if libiscsi were to grow support for iSER (iSCSI
extensions for RDMA).

Paolo

Patch

diff --git a/docs/rdma.txt b/docs/rdma.txt
index 8d1e003..c559f9a 100644
--- a/docs/rdma.txt
+++ b/docs/rdma.txt
@@ -66,7 +66,7 @@  bulk-phase round of the migration and can be enabled for extremely
 high-performance RDMA hardware using the following command:
 
 QEMU Monitor Command:
-$ migrate_set_capability x-rdma-pin-all on # disabled by default
+$ migrate_set_capability rdma-pin-all on # disabled by default
 
 Performing this action will cause all 8GB to be pinned, so if that's
 not what you want, then please ignore this step altogether.
@@ -93,12 +93,12 @@  $ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
 
 Next, on the destination machine, add the following to the QEMU command line:
 
-qemu ..... -incoming x-rdma:host:port
+qemu ..... -incoming rdma:host:port
 
 Finally, perform the actual migration on the source machine:
 
 QEMU Monitor Command:
-$ migrate -d x-rdma:host:port
+$ migrate -d rdma:host:port
 
 PERFORMANCE
 ===========
@@ -120,8 +120,8 @@  For example, in the same 8GB RAM example with all 8GB of memory in
 active use and the VM itself is completely idle using the same 40 gbps
 infiniband link:
 
-1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
-2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
+1. rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
+2. rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
 
 These numbers would of course scale up to whatever size virtual machine
 you have to migrate using RDMA.
@@ -407,18 +407,14 @@  socket is broken during a non-RDMA based migration.
 
 TODO:
 =====
-1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be
-   renamed to 'rdma' after the experimental phase of this work has
-   completed upstream.
-2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
+1. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
    are not compatible with infinband memory pinning and will result in
    an aborted migration (but with the source VM left unaffected).
-3. Use of the recent /proc/<pid>/pagemap would likely speed up
+2. Use of the recent /proc/<pid>/pagemap would likely speed up
    the use of KSM and ballooning while using RDMA.
-4. Also, some form of balloon-device usage tracking would also
+3. Also, some form of balloon-device usage tracking would also
    help alleviate some issues.
-5. Move UNREGISTER requests to a separate thread.
-6. Use LRU to provide more fine-grained direction of UNREGISTER
+4. Use LRU to provide more fine-grained direction of UNREGISTER
    requests for unpinning memory in an overcommitted environment.
-7. Expose UNREGISTER support to the user by way of workload-specific
+5. Expose UNREGISTER support to the user by way of workload-specific
    hints about application behavior.
diff --git a/migration-rdma.c b/migration-rdma.c
index f94f3b4..eeb4302 100644
--- a/migration-rdma.c
+++ b/migration-rdma.c
@@ -3412,7 +3412,7 @@  void rdma_start_outgoing_migration(void *opaque,
     }
 
     ret = qemu_rdma_source_init(rdma, &local_err,
-        s->enabled_capabilities[MIGRATION_CAPABILITY_X_RDMA_PIN_ALL]);
+        s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL]);
 
     if (ret) {
         goto err;
diff --git a/migration.c b/migration.c
index b4f8462..d9c7a62 100644
--- a/migration.c
+++ b/migration.c
@@ -81,7 +81,7 @@  void qemu_start_incoming_migration(const char *uri, Error **errp)
     if (strstart(uri, "tcp:", &p))
         tcp_start_incoming_migration(p, errp);
 #ifdef CONFIG_RDMA
-    else if (strstart(uri, "x-rdma:", &p))
+    else if (strstart(uri, "rdma:", &p))
         rdma_start_incoming_migration(p, errp);
 #endif
 #if !defined(WIN32)
@@ -423,7 +423,7 @@  void qmp_migrate(const char *uri, bool has_blk, bool blk,
     if (strstart(uri, "tcp:", &p)) {
         tcp_start_outgoing_migration(s, p, &local_err);
 #ifdef CONFIG_RDMA
-    } else if (strstart(uri, "x-rdma:", &p)) {
+    } else if (strstart(uri, "rdma:", &p)) {
         rdma_start_outgoing_migration(s, p, &local_err);
 #endif
 #if !defined(WIN32)
@@ -501,7 +501,7 @@  bool migrate_rdma_pin_all(void)
 
     s = migrate_get_current();
 
-    return s->enabled_capabilities[MIGRATION_CAPABILITY_X_RDMA_PIN_ALL];
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_RDMA_PIN_ALL];
 }
 
 bool migrate_auto_converge(void)
diff --git a/qapi-schema.json b/qapi-schema.json
index 145eca8..92509f6 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -615,10 +615,9 @@ 
 #          This feature allows us to minimize migration traffic for certain work
 #          loads, by sending compressed difference of the pages
 #
-# @x-rdma-pin-all: Controls whether or not the entire VM memory footprint is
+# @rdma-pin-all: Controls whether or not the entire VM memory footprint is
 #          mlock()'d on demand or all at once. Refer to docs/rdma.txt for usage.
-#          Disabled by default. Experimental: may (or may not) be renamed after
-#          further testing is complete. (since 1.6)
+#          Disabled by default. (since 1.7)
 #
 # @zero-blocks: During storage migration encode blocks of zeroes efficiently. This
 #          essentially saves 1MB of zeroes per block on the wire. Enabling requires
@@ -632,7 +631,7 @@ 
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle', 'x-rdma-pin-all', 'auto-converge', 'zero-blocks'] }
+  'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks'] }
 
 ##
 # @MigrationCapabilityStatus