From patchwork Wed Jan 3 10:29:06 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Marcel Apfelbaum X-Patchwork-Id: 854960 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=nongnu.org (client-ip=2001:4830:134:3::11; helo=lists.gnu.org; envelope-from=qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org; receiver=) Received: from lists.gnu.org (lists.gnu.org [IPv6:2001:4830:134:3::11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3zBS1n6JQzz9t2x for ; Wed, 3 Jan 2018 21:33:00 +1100 (AEDT) Received: from localhost ([::1]:47813 helo=lists.gnu.org) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eWgLx-0004th-Nx for incoming@patchwork.ozlabs.org; Wed, 03 Jan 2018 05:32:57 -0500 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59324) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1eWgIY-0002Kw-3y for qemu-devel@nongnu.org; Wed, 03 Jan 2018 05:29:27 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1eWgIT-0001Ht-3U for qemu-devel@nongnu.org; Wed, 03 Jan 2018 05:29:26 -0500 Received: from mx1.redhat.com ([209.132.183.28]:52166) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1eWgIS-0001HB-Qk for qemu-devel@nongnu.org; Wed, 03 Jan 2018 05:29:21 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3E0907EA85; Wed, 3 Jan 2018 10:29:19 +0000 (UTC) Received: from localhost.localdomain (unknown [10.35.206.31]) by smtp.corp.redhat.com (Postfix) with ESMTP id EB0D35C8A5; Wed, 3 Jan 2018 10:29:09 +0000 (UTC) From: Marcel Apfelbaum To: qemu-devel@nongnu.org Date: Wed, 3 Jan 2018 12:29:06 +0200 Message-Id: <20180103102911.35562-1-marcel@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.28]); Wed, 03 Jan 2018 10:29:19 +0000 (UTC) X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] [fuzzy] X-Received-From: 209.132.183.28 Subject: [Qemu-devel] [PATCH V3 0/5] hw/pvrdma: PVRDMA device implementation X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: ehabkost@redhat.com, mst@redhat.com, f4bug@amsat.org, yuval.shaia@oracle.com, pbonzini@redhat.com, marcel@redhat.com, imammedo@redhat.com Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: "Qemu-devel" V2 -> V3: - Addressed Michael S. Tsirkin and Philippe Mathieu-Daudé comments: - Moved the device to hw/rdma - Addressed Michael S. Tsirkin comments: - Split the code into generic (hw/rdma) and VMWare specific (hw/rdma/vmw) - Added more details to documentation - VMware guest-host protocol. - Remove mad processing - limited the memory the Guest can pin. - Addressed Philippe Mathieu-Daudé comment: - s/roundup_pow_of_two/pow2roundup32 and move it to qemu/host-utils.h - Added Shamit Rabinovici's review to documentation - Rebased to latest master RFC -> V2: - Full implementation of the pvrdma device - Backend is an ibdevice interface, no need for the KDBR module General description =================== PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device. It works with its Linux Kernel driver AS IS, no need for any special guest modifications. While it complies with the VMware device, it can also communicate with bare metal RDMA-enabled machines and does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe). It does not require the whole guest RAM to be pinned allowing memory over-commit and, even if not implemented yet, migration support will be possible with some HW assistance. Design ====== - Follows the behavior of VMware's pvrdma device, however is not tightly coupled with it and most of the code can be reused if we decide to continue to a Virtio based RDMA device. - It exposes 3 BARs: BAR 0 - MSIX, utilize 3 vectors for command ring, async events and completions BAR 1 - Configuration of registers BAR 2 - UAR, used to pass HW commands from driver. - The device performs internal management of the RDMA resources (PDs, CQs, QPs, ...), meaning the objects are not directly coupled to a physical RDMA device resources. The pvrdma backend is an ibdevice interface that can be exposed either by a Soft-RoCE(rxe) device on machines with no RDMA device, or an HCA SRIOV function(VF/PF). Note that ibdevice interfaces can't be shared between pvrdma devices, each one requiring a separate instance (rxe or SRIOV VF). Tests and performance ===================== Tested with SoftRoCE backend (rxe)/Mellanox ConnectX3, and Mellanox ConnectX4 HCAs with: - VMs in the same host - VMs in different hosts - VMs to bare metal. The best performance achieved with ConnectX HCAs and buffer size bigger than 1MB which was the line rate ~ 50Gb/s. The conclusion is that using the PVRDMA device there are no actual performance penalties compared to bare metal for big enough buffers (which is quite common when using RDMA), while allowing memory overcommit. Marcel Apfelbaum (3): mem: add share parameter to memory-backend-ram docs: add pvrdma device documentation. MAINTAINERS: add entry for hw/rdma Yuval Shaia (2): pci/shpc: Move function to generic header file pvrdma: initial implementation MAINTAINERS | 8 + Makefile.objs | 2 + backends/hostmem-file.c | 25 +- backends/hostmem-ram.c | 4 +- backends/hostmem.c | 21 + configure | 9 +- default-configs/arm-softmmu.mak | 1 + default-configs/i386-softmmu.mak | 1 + default-configs/x86_64-softmmu.mak | 1 + docs/pvrdma.txt | 145 +++++++ exec.c | 26 +- hw/Makefile.objs | 1 + hw/pci/shpc.c | 13 +- hw/rdma/Makefile.objs | 7 + hw/rdma/rdma_backend.c | 815 +++++++++++++++++++++++++++++++++++++ hw/rdma/rdma_backend.h | 92 +++++ hw/rdma/rdma_backend_defs.h | 62 +++ hw/rdma/rdma_rm.c | 619 ++++++++++++++++++++++++++++ hw/rdma/rdma_rm.h | 69 ++++ hw/rdma/rdma_rm_defs.h | 106 +++++ hw/rdma/rdma_utils.h | 36 ++ hw/rdma/trace-events | 5 + hw/rdma/vmw/pvrdma.h | 123 ++++++ hw/rdma/vmw/pvrdma_cmd.c | 585 ++++++++++++++++++++++++++ hw/rdma/vmw/pvrdma_dev_api.h | 580 ++++++++++++++++++++++++++ hw/rdma/vmw/pvrdma_dev_ring.c | 140 +++++++ hw/rdma/vmw/pvrdma_dev_ring.h | 43 ++ hw/rdma/vmw/pvrdma_ib_verbs.h | 400 ++++++++++++++++++ hw/rdma/vmw/pvrdma_main.c | 644 +++++++++++++++++++++++++++++ hw/rdma/vmw/pvrdma_qp_ops.c | 212 ++++++++++ hw/rdma/vmw/pvrdma_qp_ops.h | 27 ++ hw/rdma/vmw/pvrdma_ring.h | 134 ++++++ hw/rdma/vmw/pvrdma_types.h | 38 ++ hw/rdma/vmw/pvrdma_utils.c | 135 ++++++ hw/rdma/vmw/pvrdma_utils.h | 26 ++ hw/rdma/vmw/trace-events | 5 + include/exec/memory.h | 23 ++ include/exec/ram_addr.h | 3 +- include/hw/pci/pci_ids.h | 3 + include/qemu/host-utils.h | 10 + include/qemu/osdep.h | 2 +- include/sysemu/hostmem.h | 2 +- include/sysemu/kvm.h | 2 +- memory.c | 16 +- util/oslib-posix.c | 4 +- util/oslib-win32.c | 2 +- 46 files changed, 5165 insertions(+), 62 deletions(-) create mode 100644 docs/pvrdma.txt create mode 100644 hw/rdma/Makefile.objs create mode 100644 hw/rdma/rdma_backend.c create mode 100644 hw/rdma/rdma_backend.h create mode 100644 hw/rdma/rdma_backend_defs.h create mode 100644 hw/rdma/rdma_rm.c create mode 100644 hw/rdma/rdma_rm.h create mode 100644 hw/rdma/rdma_rm_defs.h create mode 100644 hw/rdma/rdma_utils.h create mode 100644 hw/rdma/trace-events create mode 100644 hw/rdma/vmw/pvrdma.h create mode 100644 hw/rdma/vmw/pvrdma_cmd.c create mode 100644 hw/rdma/vmw/pvrdma_dev_api.h create mode 100644 hw/rdma/vmw/pvrdma_dev_ring.c create mode 100644 hw/rdma/vmw/pvrdma_dev_ring.h create mode 100644 hw/rdma/vmw/pvrdma_ib_verbs.h create mode 100644 hw/rdma/vmw/pvrdma_main.c create mode 100644 hw/rdma/vmw/pvrdma_qp_ops.c create mode 100644 hw/rdma/vmw/pvrdma_qp_ops.h create mode 100644 hw/rdma/vmw/pvrdma_ring.h create mode 100644 hw/rdma/vmw/pvrdma_types.h create mode 100644 hw/rdma/vmw/pvrdma_utils.c create mode 100644 hw/rdma/vmw/pvrdma_utils.h create mode 100644 hw/rdma/vmw/trace-events