Patchwork [RFC,RDMA,support,v2:,1/6] add openfabrics RDMA libraries, configure options to build

login
register
mail settings
Submitter mrhines@linux.vnet.ibm.com
Date Feb. 11, 2013, 10:59 p.m.
Message ID <1360623573-29737-1-git-send-email-mrhines@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/219693/
State New
Headers show

Comments

mrhines@linux.vnet.ibm.com - Feb. 11, 2013, 10:59 p.m.
From: "Michael R. Hines" <mrhines@us.ibm.com>

This patchest introduces RDMA-based live-migration to QEMU.

A copy of this documentation is located online:
http://wiki.qemu.org/Features/RDMALiveMigration

DESIGN:

Patch

==========
1. In order to provide maximum cross-device compatibility, we use the 
   librdmacm library, which abstracts out the RDMA capabilities of each 
   individual type of RDMA device, including infiniband, iWARP, as well 
   as RoCE. This patch has been tested on both RoCE and infiniband 
   devices from Mellanox.
2. A new file named "migration-rdma.c" contains the core code required
   to perform librdmacm connection establishment and the transfer of 
   actual RDMA contents.
3. Files "arch_init.c" and "savevm.c" have been modified to transfer the 
   VM's memory in the standard live migration path using RMDA memory 
   instead of using TCP.
4. All of the original logic for migration of devices and protocol 
   synchronization does not change - that happens simultaneously over TCP 
   as it normally does.
5. Currently, the XBZRLE capability and the detection of zero pages 
   (dup_page()) significantly slow down the empircal throughput observed 
   when RDMA is activated, so the code path skips these capabilities when 
   RDMA is enabled. Hopefully, we can stop doing this in the future and 
   come up with a way to preserve these capabilities simultaneously with 
   the use of RDMA. 

PERFORMANCE:
============

Using a 40gbps infinband link performing a worst-case stress test:

RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
Approximately 26 gpbs
1. Average worst-case throughput 
TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
2. Approximately 8 gpbs (using IPOIB IP over Infiniband)

Average downtime (stop time) ranges between 28 and 33 milliseconds.

An *exhaustive* paper (2010) shows additional performance details
linked on the QEMU wiki:

http://wiki.qemu.org/Features/RDMALiveMigration

USAGE:
==========
Complete instructions for compiling and running with RDMA are also
available on the wiki (probably too much for a cover letter).

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 Makefile.objs |    1 +
 configure     |   25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/Makefile.objs b/Makefile.objs
index 68eb0ce..38767cc 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -57,6 +57,7 @@  common-obj-$(CONFIG_POSIX) += os-posix.o
 common-obj-$(CONFIG_LINUX) += fsdev/
 
 common-obj-y += migration.o migration-tcp.o
+common-obj-$(CONFIG_RDMA) += migration-rdma.o
 common-obj-y += qemu-char.o #aio.o
 common-obj-y += block-migration.o
 common-obj-y += page_cache.o
diff --git a/configure b/configure
index b7635e4..893935f 100755
--- a/configure
+++ b/configure
@@ -170,6 +170,7 @@  xfs=""
 
 vhost_net="no"
 kvm="no"
+rdma="no"
 gprof="no"
 debug_tcg="no"
 debug="no"
@@ -897,6 +898,10 @@  for opt do
   ;;
   --enable-virtio-blk-data-plane) virtio_blk_data_plane="yes"
   ;;
+  --enable-rdma) rdma="yes"
+  ;;
+  --disable-rdma) rdma="no"
+  ;;
   *) echo "ERROR: unknown option $opt"; show_help="yes"
   ;;
   esac
@@ -1087,6 +1092,8 @@  echo "  --enable-bluez           enable bluez stack connectivity"
 echo "  --disable-slirp          disable SLIRP userspace network connectivity"
 echo "  --disable-kvm            disable KVM acceleration support"
 echo "  --enable-kvm             enable KVM acceleration support"
+echo "  --disable-rdma           disable RDMA-based migration support"
+echo "  --enable-rdma            enable RDMA-based migration support"
 echo "  --enable-tcg-interpreter enable TCG with bytecode interpreter (TCI)"
 echo "  --disable-nptl           disable usermode NPTL support"
 echo "  --enable-nptl            enable usermode NPTL support"
@@ -1718,6 +1725,18 @@  EOF
   libs_softmmu="$sdl_libs $libs_softmmu"
 fi
 
+if test "$rdma" = "yes" ; then
+  cat > $TMPC <<EOF
+#include <rdma/rdma_cma.h>
+int main(void) { return 0; }
+EOF
+  rdma_libs="-lrdmacm"
+  if ! compile_prog "" "$rdma_libs" ; then
+      feature_not_found "rdma"
+  fi
+    
+fi
+
 ##########################################
 # VNC TLS/WS detection
 if test "$vnc" = "yes" -a \( "$vnc_tls" != "no" -o "$vnc_ws" != "no" \) ; then
@@ -3318,6 +3337,7 @@  echo "Linux AIO support $linux_aio"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
+echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
@@ -4278,6 +4298,11 @@  if [ "$pixman" = "internal" ]; then
   echo "config-host.h: subdir-pixman" >> $config_host_mak
 fi
 
+if test "$rdma" = "yes" ; then
+echo "CONFIG_RDMA=y" >> $config_host_mak
+echo "LIBS+=$rdma_libs" >> $config_host_mak
+fi
+
 # build tree in object directory in case the source is not in the current directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas"