From patchwork Wed Dec 14 10:07:57 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Stephen Finucane X-Patchwork-Id: 705634 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 3tdskv2SNlz9t1L for ; Wed, 14 Dec 2016 21:09:59 +1100 (AEDT) Authentication-Results: ozlabs.org; dkim=fail reason="key not found in DNS" (0-bit key; unprotected) header.d=that.guru header.i=@that.guru header.b="XfoShtwB"; dkim-atps=neutral Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 9C4FDBD1; Wed, 14 Dec 2016 10:08:26 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id CB873BC3 for ; Wed, 14 Dec 2016 10:08:24 +0000 (UTC) X-Greylist: from auto-whitelisted by SQLgrey-1.7.6 Received: from brown.birch.relay.mailchannels.net (brown.birch.relay.mailchannels.net [23.83.209.23]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id A9E00143 for ; Wed, 14 Dec 2016 10:08:18 +0000 (UTC) X-Sender-Id: mxroute|x-authuser|stephen@that.guru Received: from relay.mailchannels.net (localhost [127.0.0.1]) by relay.mailchannels.net (Postfix) with ESMTP id E2F931BC744 for ; Wed, 14 Dec 2016 10:08:15 +0000 (UTC) Received: from one.mxroute.com (ip-10-220-3-24.us-west-2.compute.internal [10.220.3.24]) by relay.mailchannels.net (Postfix) with ESMTPA id 0AD991BC773 for ; Wed, 14 Dec 2016 10:08:14 +0000 (UTC) X-Sender-Id: mxroute|x-authuser|stephen@that.guru Received: from one.mxroute.com ([TEMPUNAVAIL]. [10.25.23.62]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384) by 0.0.0.0:2500 (trex/5.7.8); Wed, 14 Dec 2016 10:08:15 +0000 X-MC-Relay: Neutral X-MailChannels-SenderId: mxroute|x-authuser|stephen@that.guru X-MailChannels-Auth-Id: mxroute X-MC-Loop-Signature: 1481710095646:2749637332 X-MC-Ingress-Time: 1481710095645 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=that.guru; s=default; h=References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Sender:Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding: Content-ID:Content-Description:Resent-Date:Resent-From:Resent-Sender: Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=A63U2HC+58yIZ7BdiKhN5aC6Lh8DE1IvI4NgUi6M+5w=; b=XfoShtwB5m85XszMIqs8Dlq6vr yc3R8RUTdRyHN7W8Bm25yqV3W3Jr99bJQ286X1n5I4tkTPmogORtzIHhWBDjPtUw1pbv6xgNYFu7O Z3W/qL1eIa78+Ltd9vnex4jQHxSeP0xmxT5q2l0xQlo2AohrOXGpjp3kcVkXlKyw4++CAmPqxVA+w 8cfxHWrDyROyQ5y2yIkjGS7cQfIXQD3J7cT6tJ1fri/I/L44r16IPy0eBs9dGfIY/grDDXdNhMkC0 1g6ULUJMXVacqqzmIrs1q0dvjI6Of+/dOcdTPXhBtwDhScJYyUNdz7GADx1zL7ECmu9gftcz7m78w JwWksYlQ==; From: Stephen Finucane To: dev@openvswitch.org Date: Wed, 14 Dec 2016 10:07:57 +0000 Message-Id: <20161214100800.10687-2-stephen@that.guru> X-Mailer: git-send-email 2.9.3 In-Reply-To: <20161214100800.10687-1-stephen@that.guru> References: <20161214100800.10687-1-stephen@that.guru> X-AuthUser: stephen@that.guru X-Spam-Status: No, score=-1.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, RCVD_IN_DNSWL_NONE,T_DKIM_INVALID autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Subject: [ovs-dev] [PATCH v2 1/4] doc: Split dpdk, dpdk-advanced into multiple docs X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org Combined, the dpdk and dpdk-advanced installation documents provide a lot of useful information, but most of this information is unrelated to installation. Rework these documents, completely breaking up the dpdk-advanced document into multiple smaller documents in other sections and moving non-install aspects of the dpdk document into these sections. This aims to tie the DPDK docs into the documentation structure. Signed-off-by: Stephen Finucane --- v2: - Resolve merge conflicts --- Documentation/automake.mk | 6 +- Documentation/howto/dpdk.rst | 603 +++++++++++++ Documentation/howto/index.rst | 1 + Documentation/index.rst | 13 + Documentation/intro/install/dpdk-advanced.rst | 938 --------------------- Documentation/intro/install/dpdk.rst | 584 ++++++------- Documentation/intro/install/index.rst | 5 - Documentation/topics/dpdk/index.rst | 32 + .../topics/{dpdk.rst => dpdk/ivshmem.rst} | 6 +- Documentation/topics/dpdk/vhost-user.rst | 396 +++++++++ Documentation/topics/index.rst | 3 +- Documentation/topics/testing.rst | 38 + 12 files changed, 1369 insertions(+), 1256 deletions(-) create mode 100644 Documentation/howto/dpdk.rst delete mode 100644 Documentation/intro/install/dpdk-advanced.rst create mode 100644 Documentation/topics/dpdk/index.rst rename Documentation/topics/{dpdk.rst => dpdk/ivshmem.rst} (93%) create mode 100644 Documentation/topics/dpdk/vhost-user.rst create mode 100644 Documentation/topics/testing.rst diff --git a/Documentation/automake.mk b/Documentation/automake.mk index b02d63e..ffb8ae3 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -9,7 +9,6 @@ EXTRA_DIST += \ Documentation/intro/install/index.rst \ Documentation/intro/install/bash-completion.rst \ Documentation/intro/install/debian.rst \ - Documentation/intro/install/dpdk-advanced.rst \ Documentation/intro/install/dpdk.rst \ Documentation/intro/install/fedora.rst \ Documentation/intro/install/general.rst \ @@ -25,7 +24,10 @@ EXTRA_DIST += \ Documentation/topics/bonding.rst \ Documentation/topics/datapath.rst \ Documentation/topics/design.rst \ - Documentation/topics/dpdk.rst \ + Documentation/topics/dpdk/index.rst \ + Documentation/topics/dpdk/vhost-user.rst \ + Documentation/topics/dpdk/ivshmem.rst \ + Documentation/topics/testing.rst \ Documentation/topics/high-availability.rst \ Documentation/topics/integration.rst \ Documentation/topics/openflow.rst \ diff --git a/Documentation/howto/dpdk.rst b/Documentation/howto/dpdk.rst new file mode 100644 index 0000000..f55ae3b --- /dev/null +++ b/Documentation/howto/dpdk.rst @@ -0,0 +1,603 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +============================ +Using Open vSwitch with DPDK +============================ + +This document describes how to use Open vSwitch with DPDK datapath. + +.. important:: + + Using the DPDK datapath requires building OVS with DPDK support. Refer to + :doc:`/intro/install/dpdk` for more information. + +Ports and Bridges +----------------- + +ovs-vsctl can be used to set up bridges and other Open vSwitch features. +Bridges should be created with a ``datapath_type=netdev``:: + + $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev + +ovs-vsctl can also be used to add DPDK devices. OVS expects DPDK device names +to start with ``dpdk`` and end with a portid. ovs-vswitchd should print the +number of dpdk devices found in the log file:: + + $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk + $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk + +After the DPDK ports get added to switch, a polling thread continuously polls +DPDK devices and consumes 100% of the core, as can be checked from ``top`` and +``ps`` commands:: + + $ top -H + $ ps -eLo pid,psr,comm | grep pmd + +Creating bonds of DPDK interfaces is slightly different to creating bonds of +system interfaces. For DPDK, the interface type must be explicitly set. For +example:: + + $ ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 \ + -- set Interface dpdk0 type=dpdk \ + -- set Interface dpdk1 type=dpdk + +To stop ovs-vswitchd & delete bridge, run:: + + $ ovs-appctl -t ovs-vswitchd exit + $ ovs-appctl -t ovsdb-server exit + $ ovs-vsctl del-br br0 + +PMD Thread Statistics +--------------------- + +To show current stats:: + + $ ovs-appctl dpif-netdev/pmd-stats-show + +To clear previous stats:: + + $ ovs-appctl dpif-netdev/pmd-stats-clear + +Port/RXQ Assigment to PMD Threads +--------------------------------- + +To show port/rxq assignment:: + + $ ovs-appctl dpif-netdev/pmd-rxq-show + +To change default rxq assignment to pmd threads, rxqs may be manually pinned to +desired cores using:: + + $ ovs-vsctl set Interface \ + other_config:pmd-rxq-affinity= + +where: + +- ```` is a CSV list of ``:`` values + +For example:: + + $ ovs-vsctl set interface dpdk0 options:n_rxq=4 \ + other_config:pmd-rxq-affinity="0:3,1:7,3:8" + +This will ensure: + +- Queue #0 pinned to core 3 +- Queue #1 pinned to core 7 +- Queue #2 not pinned +- Queue #3 pinned to core 8 + +After that PMD threads on cores where RX queues was pinned will become +``isolated``. This means that this thread will poll only pinned RX queues. + +.. warning:: + If there are no ``non-isolated`` PMD threads, ``non-pinned`` RX queues will + not be polled. Also, if provided ``core_id`` is not available (ex. this + ``core_id`` not in ``pmd-cpu-mask``), RX queue will not be polled by any PMD + thread. + +QoS +--- + +Assuming you have a vhost-user port transmitting traffic consisting of packets +of size 64 bytes, the following command would limit the egress transmission +rate of the port to ~1,000,000 packets per second:: + + $ ovs-vsctl set port vhost-user0 qos=@newqos -- \ + --id=@newqos create qos type=egress-policer other-config:cir=46000000 \ + other-config:cbs=2048` + +To examine the QoS configuration of the port, run:: + + $ ovs-appctl -t ovs-vswitchd qos/show vhost-user0 + +To clear the QoS configuration from the port and ovsdb, run:: + + $ ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos + +Refer to vswitch.xml for more details on egress-policer. + +Rate Limiting +-------------- + +Here is an example on Ingress Policing usage. Assuming you have a vhost-user +port receiving traffic consisting of packets of size 64 bytes, the following +command would limit the reception rate of the port to ~1,000,000 packets per +second:: + + $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 \ + ingress_policing_burst=1000` + +To examine the ingress policer configuration of the port:: + + $ ovs-vsctl list interface vhost-user0 + +To clear the ingress policer configuration from the port:: + + $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=0 + +Refer to vswitch.xml for more details on ingress-policer. + +Flow Control +------------ + +Flow control can be enabled only on DPDK physical ports. To enable flow control +support at tx side while adding a port, run:: + + $ ovs-vsctl add-port br0 dpdk0 -- \ + set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true + +Similarly, to enable rx flow control, run:: + + $ ovs-vsctl add-port br0 dpdk0 -- \ + set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true + +To enable flow control auto-negotiation, run:: + + $ ovs-vsctl add-port br0 dpdk0 -- \ + set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true + +To turn ON the tx flow control at run time for an existing port, run:: + + $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true + +The flow control parameters can be turned off by setting ``false`` to the +respective parameter. To disable the flow control at tx side, run:: + + $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false + +pdump +----- + +pdump allows you to listen on DPDK ports and view the traffic that is passing +on them. To use this utility, one must have libpcap installed on the system. +Furthermore, DPDK must be built with ``CONFIG_RTE_LIBRTE_PDUMP=y`` and +``CONFIG_RTE_LIBRTE_PMD_PCAP=y``. + +.. warning:: + A performance decrease is expected when using a monitoring application like + the DPDK pdump app. + +To use pdump, simply launch OVS as usual, then navigate to the ``app/pdump`` +directory in DPDK, ``make`` the application and run like so:: + + $ sudo ./build/app/dpdk-pdump -- \ + --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap \ + --server-socket-path=/usr/local/var/run/openvswitch + +The above command captures traffic received on queue 0 of port 0 and stores it +in ``/tmp/pkts.pcap``. Other combinations of port numbers, queues numbers and +pcap locations are of course also available to use. For example, to capture all +packets that traverse port 0 in a single pcap file:: + + $ sudo ./build/app/dpdk-pdump -- \ + --pdump 'port=0,queue=*,rx-dev=/tmp/pkts.pcap,tx-dev=/tmp/pkts.pcap' \ + --server-socket-path=/usr/local/var/run/openvswitch + +``server-socket-path`` must be set to the value of ``ovs_rundir()`` which +typically resolves to ``/usr/local/var/run/openvswitch``. + +Many tools are available to view the contents of the pcap file. Once example is +tcpdump. Issue the following command to view the contents of ``pkts.pcap``:: + + $ tcpdump -r pkts.pcap + +More information on the pdump app and its usage can be found in the `DPDK docs +`__. + +Jumbo Frames +------------ + +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To +enable Jumbo Frames support for a DPDK port, change the Interface's +``mtu_request`` attribute to a sufficiently large value. For example, to add a +DPDK Phy port with MTU of 9000:: + + $ ovs-vsctl add-port br0 dpdk0 \ + -- set Interface dpdk0 type=dpdk \ + -- set Interface dpdk0 mtu_request=9000` + +Similarly, to change the MTU of an existing port to 6200:: + + $ ovs-vsctl set Interface dpdk0 mtu_request=6200 + +Some additional configuration is needed to take advantage of jumbo frames with +vHost ports: + +1. *mergeable buffers* must be enabled for vHost ports, as demonstrated in the + QEMU command line snippet below:: + + -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on + +2. Where virtio devices are bound to the Linux kernel driver in a guest + environment (i.e. interfaces are not bound to an in-guest DPDK driver), the + MTU of those logical network interfaces must also be increased to a + sufficiently large value. This avoids segmentation of Jumbo Frames received + in the guest. Note that 'MTU' refers to the length of the IP packet only, + and not that of the entire frame. + + To calculate the exact MTU of a standard IPv4 frame, subtract the L2 header + and CRC lengths (i.e. 18B) from the max supported frame size. So, to set + the MTU for a 9018B Jumbo Frame:: + + $ ifconfig eth1 mtu 9000 + +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are +increased, such that a full Jumbo Frame of a specific size may be accommodated +within a single mbuf segment. + +Jumbo frame support has been validated against 9728B frames, which is the +largest frame size supported by Fortville NIC using the DPDK i40e driver, but +larger frames and other DPDK NIC drivers may be supported. These cases are +common for use cases involving East-West traffic only. + +.. _dpdk-ovs-in-guest: + +OVS with DPDK Inside VMs +------------------------ + +Additional configuration is required if you want to run ovs-vswitchd with DPDK +backend inside a QEMU virtual machine. ovs-vswitchd creates separate DPDK TX +queues for each CPU core available. This operation fails inside QEMU virtual +machine because, by default, VirtIO NIC provided to the guest is configured to +support only single TX queue and single RX queue. To change this behavior, you +need to turn on ``mq`` (multiqueue) property of all ``virtio-net-pci`` devices +emulated by QEMU and used by DPDK. You may do it manually (by changing QEMU +command line) or, if you use Libvirt, by adding the following string to +```` sections of all network devices used by DPDK:: + + + +where: + +``N`` + determines how many queues can be used by the guest. + +This requires QEMU >= 2.2. + +.. _dpdk-phy-phy: + +PHY-PHY +------- + +Add a userspace bridge and two ``dpdk`` (PHY) ports:: + + # Add userspace bridge + $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev + + # Add two dpdk ports + $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk + $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk + +Add test flows to forward packets betwen DPDK port 0 and port 1:: + + # Clear current flows + $ ovs-ofctl del-flows br0 + + # Add flows between port 1 (dpdk0) to port 2 (dpdk1) + $ ovs-ofctl add-flow br0 in_port=1,action=output:2 + $ ovs-ofctl add-flow br0 in_port=2,action=output:1 + +Transmit traffic into either port. You should see it returned via the other. + +.. _dpdk-vhost-loopback: + +PHY-VM-PHY (vHost Loopback) +--------------------------- + +Add a userspace bridge, two ``dpdk`` (PHY) ports, and two ``dpdkvhostuser`` +ports:: + + # Add userspace bridge + $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev + + # Add two dpdk ports + $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk + $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk + + # Add two dpdkvhostuser ports + $ ovs-vsctl add-port br0 dpdkvhostuser0 \ + -- set Interface dpdkvhostuser0 type=dpdkvhostuser + $ ovs-vsctl add-port br0 dpdkvhostuser1 \ + -- set Interface dpdkvhostuser1 type=dpdkvhostuser + +Add test flows to forward packets betwen DPDK devices and VM ports:: + + # Clear current flows + $ ovs-ofctl del-flows br0 + + # Add flows + $ ovs-ofctl add-flow br0 in_port=1,action=output:3 + $ ovs-ofctl add-flow br0 in_port=3,action=output:1 + $ ovs-ofctl add-flow br0 in_port=4,action=output:2 + $ ovs-ofctl add-flow br0 in_port=2,action=output:4 + + # Dump flows + $ ovs-ofctl dump-flows br0 + +Create a VM using the following configuration: + ++----------------------+--------+-----------------+ +| configuration | values | comments | ++----------------------+--------+-----------------+ +| qemu version | 2.2.0 | n/a | +| qemu thread affinity | core 5 | taskset 0x20 | +| memory | 4GB | n/a | +| cores | 2 | n/a | +| Qcow2 image | CentOS7| n/a | +| mrg_rxbuf | off | n/a | ++----------------------+--------+-----------------+ + +You can do this directly with QEMU via the ``qemu-system-x86_64`` application:: + + $ export VM_NAME=vhost-vm + $ export GUEST_MEM=3072M + $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 + $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch + + $ taskset 0x20 qemu-system-x86_64 -name $VM_NAME -cpu host -enable-kvm \ + -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \ + -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \ + -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ + -chardev socket,id=char0,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ + -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=off \ + -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ + -netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \ + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=off + +For a explanation of this command, along with alternative approaches such as +booting the VM via libvirt, refer to :doc:`/topics/dpdk/vhost-user`. + +Once the guest is configured and booted, configure DPDK packet forwarding +within the guest. To accomplish this, build the ``testpmd`` application as +described in :ref:`dpdk-testpmd`. Once compiled, run the application:: + + $ cd $DPDK_DIR/app/test-pmd; + $ ./testpmd -c 0x3 -n 4 --socket-mem 1024 -- \ + --burst=64 -i --txqflags=0xf00 --disable-hw-vlan + $ set fwd mac retry + $ start + +When you finish testing, bind the vNICs back to kernel:: + + $ $DPDK_DIR/tools/dpdk-devbind.py --bind=virtio-pci 0000:00:03.0 + $ $DPDK_DIR/tools/dpdk-devbind.py --bind=virtio-pci 0000:00:04.0 + +.. note:: + + Valid PCI IDs must be passed in above example. The PCI IDs can be retrieved + like so:: + + $ $DPDK_DIR/tools/dpdk-devbind.py --status + +More information on the dpdkvhostuser ports can be found in +:doc:`/topics/dpdk/vhost-user`. + +PHY-VM-PHY (vHost Loopback) (Kernel Forwarding) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:ref:`dpdk-vhost-loopback` details steps for PHY-VM-PHY loopback +testcase and packet forwarding using DPDK testpmd application in the Guest VM. +For users wishing to do packet forwarding using kernel stack below, you need to +run the below commands on the guest:: + + $ ifconfig eth1 1.1.1.2/24 + $ ifconfig eth2 1.1.2.2/24 + $ systemctl stop firewalld.service + $ systemctl stop iptables.service + $ sysctl -w net.ipv4.ip_forward=1 + $ sysctl -w net.ipv4.conf.all.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth2.rp_filter=0 + $ route add -net 1.1.2.0/24 eth2 + $ route add -net 1.1.1.0/24 eth1 + $ arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE + $ arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE + +PHY-VM-PHY (vHost Multiqueue) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +vHost Multiqueue functionality can also be validated using the PHY-VM-PHY +configuration. To begin, follow the steps described in :ref:`dpdk-phy-phy` to +create and initialize the database, start ovs-vswitchd and add ``dpdk``-type +devices to bridge ``br0``. Once complete, follow the below steps: + +1. Configure PMD and RXQs. + + For example, set the number of dpdk port rx queues to at least 2 The number + of rx queues at vhost-user interface gets automatically configured after + virtio device connection and doesn't need manual configuration:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xc + $ ovs-vsctl set Interface dpdk0 options:n_rxq=2 + $ ovs-vsctl set Interface dpdk1 options:n_rxq=2 + +2. Instantiate Guest VM using QEMU cmdline + + We must configure with appropriate software versions to ensure this feature + is supported. + + .. list-table:: Recommended BIOS Settings + :header-rows: 1 + + * - Setting + - Value + * - QEMU version + - 2.5.0 + * - QEMU thread affinity + - 2 cores (taskset 0x30) + * - Memory + - 4 GB + * - Cores + - 2 + * - Distro + - Fedora 22 + * - Multiqueue + - Enabled + + To do this, instantiate the guest as follows:: + + $ export VM_NAME=vhost-vm + $ export GUEST_MEM=4096M + $ export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 + $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch + $ taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -m 4096M \ + -drive file=$QCOW2_IMAGE --enable-kvm -name $VM_NAME \ + -nographic -numa node,memdev=mem -mem-prealloc \ + -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ + -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 \ + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 \ + -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 \ + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 + + .. note:: + Queue value above should match the queues configured in OVS, The vector + value should be set to "number of queues x 2 + 2" + +3. Configure the guest interface + + Assuming there are 2 interfaces in the guest named eth0, eth1 check the + channel configuration and set the number of combined channels to 2 for + virtio devices:: + + $ ethtool -l eth0 + $ ethtool -L eth0 combined 2 + $ ethtool -L eth1 combined 2 + + More information can be found in vHost walkthrough section. + +4. Configure kernel packet forwarding + + Configure IP and enable interfaces:: + + $ ifconfig eth0 5.5.5.1/24 up + $ ifconfig eth1 90.90.90.1/24 up + + Configure IP forwarding and add route entries:: + + $ sysctl -w net.ipv4.ip_forward=1 + $ sysctl -w net.ipv4.conf.all.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth0.rp_filter=0 + $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 + $ ip route add 2.1.1.0/24 dev eth1 + $ route add default gw 2.1.1.2 eth1 + $ route add default gw 90.90.90.90 eth1 + $ arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE + $ arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA + + Check traffic on multiple queues:: + + $ cat /proc/interrupts | grep virtio + +PHY-VM-PHY (IVSHMEM loopback) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +IVSHMEM can also be validated using the PHY-VM-PHY configuration. To begin, add +a userspace bridge, two ``dpdk`` (PHY) ports, and a single ``dpdkr`` port:: + + # Add userspace bridge + $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev + + # Add two dpdk ports + $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk + $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk + + # Add one dpdkr ports + $ ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr + +.. TODO(stephenfin): What flows should the user configure? + +QEMU must be patched to enable IVSHMEM support:: + + $ cd /usr/src/ + $ wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 + $ tar -jxvf qemu-2.2.1.tar.bz2 + $ cd /usr/src/qemu-2.2.1 + $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch + $ patch -p1 < ivshmem-qemu-2.2.1.patch + $ ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' + $ make -j 4 + +In addition, the ``cmdline_generator`` utility must be downloaded and built:: + + $ mkdir -p /usr/src/cmdline_generator + $ cd /usr/src/cmdline_generator + $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c + $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile + $ export RTE_SDK=/usr/src/dpdk-16.11 + $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc + $ make + +Once both the patche QEMU and ``cmdline_generator`` utilities have been built, +run ``cmdline_generator`` to generate a suitable QEMU commandline, and use this +to instantiate a guest. For example:: + + $ ./build/cmdline_generator -m -p dpdkr0 XXX + $ cmdline=`cat OVSMEMPOOL` + $ export VM_NAME=ivshmem-vm + $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 + $ export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 + $ taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE \ + -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 \ + -pidfile /tmp/vm1.pid $cmdline + +When the guest has started, connect to it and build and run the sample +``dpdkr`` app. This application will simply loopback packets received over the +DPDK ring port:: + + $ echo 1024 > /proc/sys/vm/nr_hugepages + $ mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) + + # Build the DPDK ring application in the VM + $ export RTE_SDK=/root/dpdk-16.11 + $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc + $ make + + # Run dpdkring application + $ ./build/dpdkr -c 1 -n 4 -- -n 0 + # where "-n 0" refers to ring '0' i.e dpdkr0 diff --git a/Documentation/howto/index.rst b/Documentation/howto/index.rst index fe85a34..0eb3d75 100644 --- a/Documentation/howto/index.rst +++ b/Documentation/howto/index.rst @@ -40,6 +40,7 @@ topics covered herein, refer to :doc:`/topics/index`. lisp native-tunneling vtep + dpdk .. toctree:: :maxdepth: 1 diff --git a/Documentation/index.rst b/Documentation/index.rst index 8484dbd..2eecf95 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -61,6 +61,19 @@ vSwitch? Start here. Deeper Dive ----------- +- **Architecture** :doc:`topics/design` | + :doc:`topics/openflow` | + :doc:`topics/integration` | + :doc:`topics/porting` + +- **DPDK** :doc:`howto/dpdk` | + :doc:`topics/dpdk/vhost-user` | + :doc:`topics/dpdk/ivshmem` + +- **Windows** :doc:`topics/windows` + +- **Testing** :doc:`topics/testing` + - **Reference Guides:** :doc:`ref/index` The Open vSwitch Project diff --git a/Documentation/intro/install/dpdk-advanced.rst b/Documentation/intro/install/dpdk-advanced.rst deleted file mode 100644 index 44d1cd7..0000000 --- a/Documentation/intro/install/dpdk-advanced.rst +++ /dev/null @@ -1,938 +0,0 @@ -.. - Licensed under the Apache License, Version 2.0 (the "License"); you may - not use this file except in compliance with the License. You may obtain - a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, WITHOUT - WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the - License for the specific language governing permissions and limitations - under the License. - - Convention for heading levels in Open vSwitch documentation: - - ======= Heading 0 (reserved for the title in a document) - ------- Heading 1 - ~~~~~~~ Heading 2 - +++++++ Heading 3 - ''''''' Heading 4 - - Avoid deeper levels because they do not render well. - -================================= -Open vSwitch with DPDK (Advanced) -================================= - -The Advanced Install Guide explains how to improve OVS performance when using -DPDK datapath. This guide provides information on tuning, system configuration, -troubleshooting, static code analysis and testcases. - -Building as a Shared Library ----------------------------- - -DPDK can be built as a static or a shared library and shall be linked by -applications using DPDK datapath. When building OVS with DPDK, you can link -Open vSwitch against the shared DPDK library. - -.. note:: - Minor performance loss is seen with OVS when using shared DPDK library as - compared to static library. - -To build Open vSwitch using DPDK as a shared library, first refer to -:doc:`/intro/install/dpdk` for download instructions for DPDK and OVS. - -Once DPDK and OVS have been downloaded, you must configure the DPDK library -accordingly. Simply set ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in -``config/common_base``, then build and install DPDK. Once done, DPDK can be -built as usual. For example:: - - $ export DPDK_TARGET=x86_64-native-linuxapp-gcc - $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET - $ make install T=$DPDK_TARGET DESTDIR=install - -Once DPDK is built, export the DPDK shared library location and setup OVS as -detailed in :doc:`/intro/install/dpdk`:: - - $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib - -System Configuration --------------------- - -To achieve optimal OVS performance, the system can be configured and that -includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA -nodes and apt selection of PCIe slots for NIC placement. - -Recommended BIOS Settings -~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. list-table:: Recommended BIOS Settings - :header-rows: 1 - - * - Setting - - Value - * - C3 Power State - - Disabled - * - C6 Power State - - Disabled - * - MLC Streamer - - Enabled - * - MLC Spacial Prefetcher - - Enabled - * - DCU Data Prefetcher - - Enabled - * - DCA - - Enabled - * - CPU Power and Performance - - Performance - * - Memeory RAS and Performance Config -> NUMA optimized - - Enabled - -PCIe Slot Selection -~~~~~~~~~~~~~~~~~~~ - -The fastpath performance can be affected by factors related to the placement of -the NIC, such as channel speeds between PCIe slot and CPU or the proximity of -PCIe slot to the CPU cores running the DPDK application. Listed below are the -steps to identify right PCIe slot. - -#. Retrieve host details using ``dmidecode``. For example:: - - $ dmidecode -t baseboard | grep "Product Name" - -#. Download the technical specification for product listed, e.g: S2600WT2 - -#. Check the Product Architecture Overview on the Riser slot placement, CPU - sharing info and also PCIe channel speeds - - For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed - between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. - Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will - optimize OVS performance in this case. - -#. Check the Riser Card #1 - Root Port mapping information, on the available - slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus - speeds and are potential slots for NIC placement. - -Advanced Hugepage Setup -~~~~~~~~~~~~~~~~~~~~~~~ - -Allocate and mount 1 GB hugepages. - -- For persistent allocation of huge pages, add the following options to the - kernel bootline:: - - default_hugepagesz=1GB hugepagesz=1G hugepages=N - - For platforms supporting multiple huge page sizes, add multiple options:: - - default_hugepagesz= hugepagesz= hugepages=N - - where: - - ``N`` - number of huge pages requested - ``size`` - huge page size with an optional suffix ``[kKmMgG]`` - -- For run-time allocation of huge pages:: - - $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages - - where: - - ``N`` - number of huge pages requested - ``X`` - NUMA Node - - .. note:: - For run-time allocation of 1G huge pages, Contiguous Memory Allocator - (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro. - -Now mount the huge pages, if not already done so:: - - $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages - -Enable HyperThreading -~~~~~~~~~~~~~~~~~~~~~ - -With HyperThreading, or SMT, enabled, a physical core appears as two logical -cores. SMT can be utilized to spawn worker threads on logical cores of the same -physical core there by saving additional cores. - -With DPDK, when pinning pmd threads to logical cores, care must be taken to set -the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are -pinned to SMT siblings. - -Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT -enabled. This gives us a total of 40 logical cores. To identify the physical -core shared by two logical cores, run:: - - $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list - -where ``N`` is the logical core number. - -In this example, it would show that cores ``1`` and ``21`` share the same -physical core. As cores are counted from 0, the ``pmd-cpu-mask`` can be used -to enable these two pmd threads running on these two logical cores (one -physical core) is:: - - $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x200002 - -Isolate Cores -~~~~~~~~~~~~~ - -The ``isolcpus`` option can be used to isolate cores from the Linux scheduler. -The isolated cores can then be used to dedicatedly run HPC applications or -threads. This helps in better application performance due to zero context -switching and minimal cache thrashing. To run platform logic on core 0 and -isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB -cmdline. - -.. note:: - It has been verified that core isolation has minimal advantage due to mature - Linux scheduler in some circumstances. - -NUMA/Cluster-on-Die -~~~~~~~~~~~~~~~~~~~ - -Ideally inter-NUMA datapaths should be avoided where possible as packets will -go across QPI and there may be a slight performance penalty when compared with -intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is -introduced on models that have 10 cores or more. This makes it possible to -logically split a socket into two NUMA regions and again it is preferred where -possible to keep critical datapaths within the one cluster. - -It is good practice to ensure that threads that are in the datapath are pinned -to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for -forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost -User ports automatically detect the NUMA socket of the QEMU vCPUs and will be -serviced by a PMD from the same node provided a core on this node is enabled in -the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature. - -Compiler Optimizations -~~~~~~~~~~~~~~~~~~~~~~ - -The default compiler optimization level is ``-O2``. Changing this to more -aggressive compiler optimization such as ``-O3 -march=native`` with -gcc (verified on 5.3.1) can produce performance gains though not siginificant. -``-march=native`` will produce optimized code on local machine and should be -used when software compilation is done on Testbed. - -Performance Tuning ------------------- - -Affinity -~~~~~~~~ - -For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be -affinitized accordingly. - -- PMD thread Affinity - - A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces - assigned to it. A pmd thread shall poll the ports for incoming packets, - switch the packets and send to tx port. pmd thread is CPU bound, and needs - to be affinitized to isolated cores for optimum performance. - - By setting a bit in the mask, a pmd thread is created and pinned to the - corresponding CPU core. e.g. to run a pmd thread on core 2:: - - $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4 - - .. note:: - pmd thread on a NUMA node is only created if there is at least one DPDK - interface from that NUMA node added to OVS. - -- QEMU vCPU thread Affinity - - A VM performing simple packet forwarding or running complex packet pipelines - has to ensure that the vCPU threads performing the work has as much CPU - occupancy as possible. - - For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned. - When the DPDK ``testpmd`` application that does packet forwarding is invoked, - the ``taskset`` command should be used to affinitize the vCPU threads to the - dedicated isolated cores on the host system. - -Multiple Poll-Mode Driver Threads -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -With pmd multi-threading support, OVS creates one pmd thread for each NUMA node -by default. However, in cases where there are multiple ports/rxq's producing -traffic, performance can be improved by creating multiple pmd threads running -on separate cores. These pmd threads can share the workload by each being -responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads -is done automatically. - -A set bit in the mask means a pmd thread is created and pinned to the -corresponding CPU core. For example, to run pmd threads on core 1 and 2:: - - $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6 - -When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as -shown below, spreading the workload over 2 or 4 pmd threads shows significant -improvements as there will be more total CPU occupancy available:: - - NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 - -DPDK Physical Port Rx Queues -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -:: - - $ ovs-vsctl set Interface options:n_rxq= - -The command above sets the number of rx queues for DPDK physical interface. -The rx queues are assigned to pmd threads on the same NUMA node in a -round-robin fashion. - -DPDK Physical Port Queue Sizes -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -:: - - $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc= - $ ovs-vsctl set Interface dpdk0 options:n_txq_desc= - -The command above sets the number of rx/tx descriptors that the NIC associated -with dpdk0 will be initialised with. - -Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different -benefits in terms of throughput and latency for different scenarios. -Generally, smaller queue sizes can have a positive impact for latency at the -expense of throughput. The opposite is often true for larger queue sizes. -Note: increasing the number of rx descriptors eg. to 4096 may have a negative -impact on performance due to the fact that non-vectorised DPDK rx functions may -be used. This is dependant on the driver in use, but is true for the commonly -used i40e and ixgbe DPDK drivers. - -Exact Match Cache -~~~~~~~~~~~~~~~~~ - -Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup -in the datapath, the EMC contains a single table and provides the lowest level -(fastest) switching for DPDK ports. If there is a miss in the EMC then the next -level where switching will occur is the datapath classifier. Missing in the -EMC and looking up in the datapath classifier incurs a significant performance -penalty. If lookup misses occur in the EMC because it is too small to handle -the number of flows, its size can be increased. The EMC size can be modified by -editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``. - -As mentioned above, an EMC is per pmd thread. An alternative way of increasing -the aggregate amount of possible flow entries in EMC and avoiding datapath -classifier lookups is to have multiple pmd threads running. - -Rx Mergeable Buffers -~~~~~~~~~~~~~~~~~~~~ - -Rx mergeable buffers is a virtio feature that allows chaining of multiple -virtio descriptors to handle large packet sizes. Large packets are handled by -reserving and chaining multiple free descriptors together. Mergeable buffer -support is negotiated between the virtio driver and virtio device and is -supported by the DPDK vhost library. This behavior is supported and enabled by -default, however in the case where the user knows that rx mergeable buffers are -not needed i.e. jumbo frames are not needed, it can be forced off by adding -``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple -chains of descriptors it will make more individual virtio descriptors available -for rx to the guest using dpdkvhost ports and this can improve performance. - -OVS Testcases -------------- - -PHY-VM-PHY (vHost Loopback) -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -:doc:`/intro/install/dpdk` details steps for PHY-VM-PHY loopback testcase and -packet forwarding using DPDK testpmd application in the Guest VM. For users -wishing to do packet forwarding using kernel stack below, you need to run the -below commands on the guest:: - - $ ifconfig eth1 1.1.1.2/24 - $ ifconfig eth2 1.1.2.2/24 - $ systemctl stop firewalld.service - $ systemctl stop iptables.service - $ sysctl -w net.ipv4.ip_forward=1 - $ sysctl -w net.ipv4.conf.all.rp_filter=0 - $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 - $ sysctl -w net.ipv4.conf.eth2.rp_filter=0 - $ route add -net 1.1.2.0/24 eth2 - $ route add -net 1.1.1.0/24 eth1 - $ arp -s 1.1.2.99 DE:AD:BE:EF:CA:FE - $ arp -s 1.1.1.99 DE:AD:BE:EF:CA:EE - -PHY-VM-PHY (IVSHMEM) -~~~~~~~~~~~~~~~~~~~~ - -IVSHMEM can also be validated using the PHY-VM-PHY configuration. To begin, -follow the steps described in the :doc:`/intro/install/dpdk` to create and -initialize the database, start ovs-vswitchd and add ``dpdk``-type devices to -bridge ``br0``. Once complete, follow the below steps: - -1. Add DPDK ring port to the bridge:: - - $ ovs-vsctl add-port br0 dpdkr0 -- set Interface dpdkr0 type=dpdkr - -2. Build modified QEMU - - QEMU must be patched to enable IVSHMEM support:: - - $ cd /usr/src/ - $ wget http://wiki.qemu.org/download/qemu-2.2.1.tar.bz2 - $ tar -jxvf qemu-2.2.1.tar.bz2 - $ cd /usr/src/qemu-2.2.1 - $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/patches/ivshmem-qemu-2.2.1.patch - $ patch -p1 < ivshmem-qemu-2.2.1.patch - $ ./configure --target-list=x86_64-softmmu --enable-debug --extra-cflags='-g' - $ make -j 4 - -3. Generate QEMU commandline:: - - $ mkdir -p /usr/src/cmdline_generator - $ cd /usr/src/cmdline_generator - $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/cmdline_generator.c - $ wget https://raw.githubusercontent.com/netgroup-polito/un-orchestrator/master/orchestrator/compute_controller/plugins/kvm-libvirt/cmdline_generator/Makefile - $ export RTE_SDK=/usr/src/dpdk-16.11 - $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc - $ make - $ ./build/cmdline_generator -m -p dpdkr0 XXX - $ cmdline=`cat OVSMEMPOOL` - -4. Start guest VM:: - - $ export VM_NAME=ivshmem-vm - $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 - $ export QEMU_BIN=/usr/src/qemu-2.2.1/x86_64-softmmu/qemu-system-x86_64 - $ taskset 0x20 $QEMU_BIN -cpu host -smp 2,cores=2 -hda $QCOW2_IMAGE \ - -m 4096 --enable-kvm -name $VM_NAME -nographic -vnc :2 \ - -pidfile /tmp/vm1.pid $cmdline - -5. Build and run the sample ``dpdkr`` app in VM:: - - $ echo 1024 > /proc/sys/vm/nr_hugepages - $ mount -t hugetlbfs nodev /dev/hugepages (if not already mounted) - - # Build the DPDK ring application in the VM - $ export RTE_SDK=/root/dpdk-16.11 - $ export RTE_TARGET=x86_64-ivshmem-linuxapp-gcc - $ make - - # Run dpdkring application - $ ./build/dpdkr -c 1 -n 4 -- -n 0 - # where "-n 0" refers to ring '0' i.e dpdkr0 - -PHY-VM-PHY (vHost Multiqueue) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -vHost Multique functionality can also be validated using the PHY-VM-PHY -configuration. To begin, follow the steps described in -:doc:`/intro/install/dpdk` to create and initialize the database, start -ovs-vswitchd and add ``dpdk``-type devices to bridge ``br0``. Once complete, -follow the below steps: - -1. Configure PMD and RXQs. - - For example, set the number of dpdk port rx queues to at least 2 The number - of rx queues at vhost-user interface gets automatically configured after - virtio device connection and doesn't need manual configuration:: - - $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0xC - $ ovs-vsctl set Interface dpdk0 options:n_rxq=2 - $ ovs-vsctl set Interface dpdk1 options:n_rxq=2 - -2. Instantiate Guest VM using QEMU cmdline - - We must configure with appropriate software versions to ensure this feature - is supported. - - .. list-table:: Recommended BIOS Settings - :header-rows: 1 - - * - Setting - - Value - * - QEMU version - - 2.5.0 - * - QEMU thread affinity - - 2 cores (taskset 0x30) - * - Memory - - 4 GB - * - Cores - - 2 - * - Distro - - Fedora 22 - * - Multiqueue - - Enabled - - To do this, instantiate the guest as follows:: - - $ export VM_NAME=vhost-vm - $ export GUEST_MEM=4096M - $ export QCOW2_IMAGE=/root/Fedora22_x86_64.qcow2 - $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch - $ taskset 0x30 qemu-system-x86_64 -cpu host -smp 2,cores=2 -m 4096M \ - -drive file=$QCOW2_IMAGE --enable-kvm -name $VM_NAME \ - -nographic -numa node,memdev=mem -mem-prealloc \ - -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ - -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ - -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=2 \ - -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mq=on,vectors=6 \ - -chardev socket,id=char2,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ - -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=2 \ - -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=6 - - .. note:: - Queue value above should match the queues configured in OVS, The vector - value should be set to "number of queues x 2 + 2" - -3. Configure the guest interface - - Assuming there are 2 interfaces in the guest named eth0, eth1 check the - channel configuration and set the number of combined channels to 2 for - virtio devices:: - - $ ethtool -l eth0 - $ ethtool -L eth0 combined 2 - $ ethtool -L eth1 combined 2 - - More information can be found in vHost walkthrough section. - -4. Configure kernel packet forwarding - - Configure IP and enable interfaces:: - - $ ifconfig eth0 5.5.5.1/24 up - $ ifconfig eth1 90.90.90.1/24 up - - Configure IP forwarding and add route entries:: - - $ sysctl -w net.ipv4.ip_forward=1 - $ sysctl -w net.ipv4.conf.all.rp_filter=0 - $ sysctl -w net.ipv4.conf.eth0.rp_filter=0 - $ sysctl -w net.ipv4.conf.eth1.rp_filter=0 - $ ip route add 2.1.1.0/24 dev eth1 - $ route add default gw 2.1.1.2 eth1 - $ route add default gw 90.90.90.90 eth1 - $ arp -s 90.90.90.90 DE:AD:BE:EF:CA:FE - $ arp -s 2.1.1.2 DE:AD:BE:EF:CA:FA - - Check traffic on multiple queues:: - - $ cat /proc/interrupts | grep virtio - -vHost Walkthrough ------------------ - -Two types of vHost User ports are available in OVS: - -- vhost-user (``dpdkvhostuser``) - -- vhost-user-client (``dpdkvhostuserclient``) - -vHost User uses a client-server model. The server creates/manages/destroys the -vHost User sockets, and the client connects to the server. Depending on which -port type you use, ``dpdkvhostuser`` or ``dpdkvhostuserclient``, a different -configuration of the client-server model is used. - -For vhost-user ports, Open vSwitch acts as the server and QEMU the client. For -vhost-user-client ports, Open vSwitch acts as the client and QEMU the server. - -vhost-user -~~~~~~~~~~ - -1. Install the prerequisites: - - - QEMU version >= 2.2 - -2. Add vhost-user ports to the switch. - - Unlike DPDK ring ports, DPDK vhost-user ports can have arbitrary names, - except that forward and backward slashes are prohibited in the names. - - For vhost-user, the name of the port type is ``dpdkvhostuser``:: - - $ ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 \ - type=dpdkvhostuser - - This action creates a socket located at - ``/usr/local/var/run/openvswitch/vhost-user-1``, which you must provide to - your VM on the QEMU command line. More instructions on this can be found in - the next section "Adding vhost-user ports to VM" - - .. note:: - If you wish for the vhost-user sockets to be created in a sub-directory of - ``/usr/local/var/run/openvswitch``, you may specify this directory in the - ovsdb like so:: - - $ ovs-vsctl --no-wait \ - set Open_vSwitch . other_config:vhost-sock-dir=subdir` - -3. Add vhost-user ports to VM - - 1. Configure sockets - - Pass the following parameters to QEMU to attach a vhost-user device:: - - -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 - -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce - -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 - - where ``vhost-user-1`` is the name of the vhost-user port added to the - switch. - - Repeat the above parameters for multiple devices, changing the chardev - ``path`` and ``id`` as necessary. Note that a separate and different - chardev ``path`` needs to be specified for each vhost-user device. For - example you have a second vhost-user port named ``vhost-user-2``, you - append your QEMU command line with an additional set of parameters:: - - -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 - -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce - -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 - - 2. Configure hugepages - - QEMU must allocate the VM's memory on hugetlbfs. vhost-user ports access - a virtio-net device's virtual rings and packet buffers mapping the VM's - physical memory on hugetlbfs. To enable vhost-user ports to map the VM's - memory into their process address space, pass the following parameters - to QEMU:: - - -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on - -numa node,memdev=mem -mem-prealloc - - 3. Enable multiqueue support (optional) - - QEMU needs to be configured to use multiqueue:: - - -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 - -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q - -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v - - where: - - ``$q`` - The number of queues - ``$v`` - The number of vectors, which is ``$q`` * 2 + 2 - - The vhost-user interface will be automatically reconfigured with - required number of rx and tx queues after connection of virtio device. - Manual configuration of ``n_rxq`` is not supported because OVS will work - properly only if ``n_rxq`` will match number of queues configured in - QEMU. - - A least 2 PMDs should be configured for the vswitch when using - multiqueue. Using a single PMD will cause traffic to be enqueued to the - same vhost queue rather than being distributed among different vhost - queues for a vhost-user interface. - - If traffic destined for a VM configured with multiqueue arrives to the - vswitch via a physical DPDK port, then the number of rxqs should also be - set to at least 2 for that physical DPDK port. This is required to - increase the probability that a different PMD will handle the multiqueue - transmission to the guest using a different vhost queue. - - If one wishes to use multiple queues for an interface in the guest, the - driver in the guest operating system must be configured to do so. It is - recommended that the number of queues configured be equal to ``$q``. - - For example, this can be done for the Linux kernel virtio-net driver - with:: - - $ ethtool -L combined <$q> - - where: - - ``-L`` - Changes the numbers of channels of the specified network device - ``combined`` - Changes the number of multi-purpose channels. - -Configure the VM using libvirt -++++++++++++++++++++++++++++++ - -You can also build and configure the VM using libvirt rather than QEMU by -itself. - -1. Change the user/group, access control policty and restart libvirtd. - - - In ``/etc/libvirt/qemu.conf`` add/edit the following lines:: - - user = "root" - group = "root" - - - Disable SELinux or set to permissive mode:: - - $ setenforce 0 - - - Restart the libvirtd process, For example, on Fedora:: - - $ systemctl restart libvirtd.service - -2. Instantiate the VM - - - Copy the XML configuration described in :doc:`/intro/install/dpdk` - - - Start the VM:: - - $ virsh create demovm.xml - - - Connect to the guest console:: - - $ virsh console demovm - -3. Configure the VM - - The demovm xml configuration is aimed at achieving out of box performance on - VM. - - - The vcpus are pinned to the cores of the CPU socket 0 using ``vcpupin``. - - - Configure NUMA cell and memory shared using ``memAccess='shared'``. - - - Disable ``mrg_rxbuf='off'`` - -Refer to the `libvirt documentation `__ -for more information. - -vhost-user-client -~~~~~~~~~~~~~~~~~ - -1. Install the prerequisites: - - - QEMU version >= 2.7 - -2. Add vhost-user-client ports to the switch. - - Unlike vhost-user ports, the name given to port does not govern the name of - the socket device. ``vhost-server-path`` reflects the full path of the - socket that has been or will be created by QEMU for the given vHost User - client port. - - For vhost-user-client, the name of the port type is - ``dpdkvhostuserclient``:: - - $ VHOST_USER_SOCKET_PATH=/path/to/socker - $ ovs-vsctl add-port br0 vhost-client-1 \ - -- set Interface vhost-client-1 type=dpdkvhostuserclient \ - options:vhost-server-path=$VHOST_USER_SOCKET_PATH - -3. Add vhost-user-client ports to VM - - 1. Configure sockets - - Pass the following parameters to QEMU to attach a vhost-user device:: - - -chardev socket,id=char1,path=$VHOST_USER_SOCKET_PATH,server - -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce - -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 - - where ``vhost-user-1`` is the name of the vhost-user port added to the - switch. - - If the corresponding dpdkvhostuserclient port has not yet been configured - in OVS with ``vhost-server-path=/path/to/socket``, QEMU will print a log - similar to the following:: - - QEMU waiting for connection on: disconnected:unix:/path/to/socket,server - - QEMU will wait until the port is created sucessfully in OVS to boot the VM. - - One benefit of using this mode is the ability for vHost ports to - 'reconnect' in event of the switch crashing or being brought down. Once - it is brought back up, the vHost ports will reconnect automatically and - normal service will resume. - -DPDK Backend Inside VM -~~~~~~~~~~~~~~~~~~~~~~ - -Additional configuration is required if you want to run ovs-vswitchd with DPDK -backend inside a QEMU virtual machine. Ovs-vswitchd creates separate DPDK TX -queues for each CPU core available. This operation fails inside QEMU virtual -machine because, by default, VirtIO NIC provided to the guest is configured to -support only single TX queue and single RX queue. To change this behavior, you -need to turn on ``mq`` (multiqueue) property of all ``virtio-net-pci`` devices -emulated by QEMU and used by DPDK. You may do it manually (by changing QEMU -command line) or, if you use Libvirt, by adding the following string to -```` sections of all network devices used by DPDK:: - - - -Where: - -``N`` - determines how many queues can be used by the guest. - -This requires QEMU >= 2.2. - -QoS ---- - -Assuming you have a vhost-user port transmitting traffic consisting of packets -of size 64 bytes, the following command would limit the egress transmission -rate of the port to ~1,000,000 packets per second:: - - $ ovs-vsctl set port vhost-user0 qos=@newqos -- \ - --id=@newqos create qos type=egress-policer other-config:cir=46000000 \ - other-config:cbs=2048` - -To examine the QoS configuration of the port, run:: - - $ ovs-appctl -t ovs-vswitchd qos/show vhost-user0 - -To clear the QoS configuration from the port and ovsdb, run:: - - $ ovs-vsctl destroy QoS vhost-user0 -- clear Port vhost-user0 qos - -Refer to vswitch.xml for more details on egress-policer. - -Rate Limiting --------------- - -Here is an example on Ingress Policing usage. Assuming you have a vhost-user -port receiving traffic consisting of packets of size 64 bytes, the following -command would limit the reception rate of the port to ~1,000,000 packets per -second:: - - $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=368000 \ - ingress_policing_burst=1000` - -To examine the ingress policer configuration of the port:: - - $ ovs-vsctl list interface vhost-user0 - -To clear the ingress policer configuration from the port:: - - $ ovs-vsctl set interface vhost-user0 ingress_policing_rate=0 - -Refer to vswitch.xml for more details on ingress-policer. - -Flow Control ------------- - -Flow control can be enabled only on DPDK physical ports. To enable flow -control support at tx side while adding a port, run:: - - $ ovs-vsctl add-port br0 dpdk0 -- \ - set Interface dpdk0 type=dpdk options:tx-flow-ctrl=true - -Similarly, to enable rx flow control, run:: - - $ ovs-vsctl add-port br0 dpdk0 -- \ - set Interface dpdk0 type=dpdk options:rx-flow-ctrl=true - -To enable flow control auto-negotiation, run:: - - $ ovs-vsctl add-port br0 dpdk0 -- \ - set Interface dpdk0 type=dpdk options:flow-ctrl-autoneg=true - -To turn ON the tx flow control at run time(After the port is being added to -OVS):: - - $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=true - -The flow control parameters can be turned off by setting ``false`` to the -respective parameter. To disable the flow control at tx side, run:: - - $ ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false - -pdump ------ - -Pdump allows you to listen on DPDK ports and view the traffic that is passing -on them. To use this utility, one must have libpcap installed on the system. -Furthermore, DPDK must be built with ``CONFIG_RTE_LIBRTE_PDUMP=y`` and -``CONFIG_RTE_LIBRTE_PMD_PCAP=y``. - -.. warning:: - A performance decrease is expected when using a monitoring application like - the DPDK pdump app. - -To use pdump, simply launch OVS as usual. Then, navigate to the ``app/pdump`` -directory in DPDK, ``make`` the application and run like so:: - - $ sudo ./build/app/dpdk-pdump -- \ - --pdump port=0,queue=0,rx-dev=/tmp/pkts.pcap \ - --server-socket-path=/usr/local/var/run/openvswitch - -The above command captures traffic received on queue 0 of port 0 and stores it -in ``/tmp/pkts.pcap``. Other combinations of port numbers, queues numbers and -pcap locations are of course also available to use. For example, to capture all -packets that traverse port 0 in a single pcap file:: - - $ sudo ./build/app/dpdk-pdump -- \ - --pdump 'port=0,queue=*,rx-dev=/tmp/pkts.pcap,tx-dev=/tmp/pkts.pcap' \ - --server-socket-path=/usr/local/var/run/openvswitch - -``server-socket-path`` must be set to the value of ovs_rundir() which typically -resolves to ``/usr/local/var/run/openvswitch``. - -Many tools are available to view the contents of the pcap file. Once example is -tcpdump. Issue the following command to view the contents of ``pkts.pcap``:: - - $ tcpdump -r pkts.pcap - -More information on the pdump app and its usage can be found in the `DPDK docs -`__. - -Jumbo Frames ------------- - -By default, DPDK ports are configured with standard Ethernet MTU (1500B). To -enable Jumbo Frames support for a DPDK port, change the Interface's -``mtu_request`` attribute to a sufficiently large value. For example, to add a -DPDK Phy port with MTU of 9000:: - - $ ovs-vsctl add-port br0 dpdk0 \ - -- set Interface dpdk0 type=dpdk \ - -- set Interface dpdk0 mtu_request=9000` - -Similarly, to change the MTU of an existing port to 6200:: - - $ ovs-vsctl set Interface dpdk0 mtu_request=6200 - -Some additional configuration is needed to take advantage of jumbo frames with -vHost ports: - -1. *mergeable buffers* must be enabled for vHost ports, as demonstrated in the - QEMU command line snippet below:: - - -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ - -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on - -2. Where virtio devices are bound to the Linux kernel driver in a guest - environment (i.e. interfaces are not bound to an in-guest DPDK driver), the - MTU of those logical network interfaces must also be increased to a - sufficiently large value. This avoids segmentation of Jumbo Frames received - in the guest. Note that 'MTU' refers to the length of the IP packet only, - and not that of the entire frame. - - To calculate the exact MTU of a standard IPv4 frame, subtract the L2 header - and CRC lengths (i.e. 18B) from the max supported frame size. So, to set - the MTU for a 9018B Jumbo Frame:: - - $ ifconfig eth1 mtu 9000 - -When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are -increased, such that a full Jumbo Frame of a specific size may be accommodated -within a single mbuf segment. - -Jumbo frame support has been validated against 9728B frames, which is the -largest frame size supported by Fortville NIC using the DPDK i40e driver, but -larger frames and other DPDK NIC drivers may be supported. These cases are -common for use cases involving East-West traffic only. - -vsperf ------- - -The vsperf project aims to develop a vSwitch test framework that can be used to -validate the suitability of different vSwitch implementations in a telco -deployment environment. More information can be found on the `OPNFV wiki -`__. - -Bug Reporting -------------- - -Report problems to bugs@openvswitch.org. diff --git a/Documentation/intro/install/dpdk.rst b/Documentation/intro/install/dpdk.rst index 7724c8a..87dc830 100644 --- a/Documentation/intro/install/dpdk.rst +++ b/Documentation/intro/install/dpdk.rst @@ -53,10 +53,7 @@ vSwitch with DPDK will require the following: present, it will be necessary to upgrade your kernel or build a custom kernel with these flags enabled. -.. TODO(stephenfin): drag the below information in from dpdk-advanced - -Detailed system requirements can be found at `DPDK requirements`_, while more -detailed install information can be found in :doc:`dpdk-advanced`. +Detailed system requirements can be found at `DPDK requirements`_. .. _DPDK supported NIC: http://dpdk.org/doc/nics .. _DPDK requirements: http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html @@ -64,10 +61,10 @@ detailed install information can be found in :doc:`dpdk-advanced`. Installing ---------- -DPDK -~~~~ +Install DPDK +~~~~~~~~~~~~ -1. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``:: +#. Download the `DPDK sources`_, extract the file and set ``DPDK_DIR``:: $ cd /usr/src/ $ wget http://fast.dpdk.org/rel/dpdk-16.11.tar.xz @@ -75,7 +72,18 @@ DPDK $ export DPDK_DIR=/usr/src/dpdk-16.11 $ cd $DPDK_DIR -2. Configure and install DPDK +#. (Optional) Configure DPDK as a shared library + + DPDK can be built as either a static library or a shared library. By + default, it is configured for the former. If you wish to use the latter, set + ``CONFIG_RTE_BUILD_SHARED_LIB=y`` in ``$DPDK_DIR/config/common_base``. + + .. note:: + + Minor performance loss is expected when using OVS with a shared DPDK + library compared to a static DPDK library. + +#. Configure and install DPDK Build and install the DPDK library:: @@ -87,6 +95,13 @@ DPDK $ export DPDK_TARGET=x86_64-ivshmem-linuxapp-gcc +#. (Optional) Export the DPDK shared library location + + If DPDK was built as a shared library, export the path to this library for + use when building OVS:: + + $ export LD_LIBRARY_PATH=$DPDK_DIR/x86_64-native-linuxapp-gcc/lib + .. _DPDK sources: http://dpdk.org/rel Install OVS @@ -101,12 +116,12 @@ has to be configured with DPDK support (``--with-dpdk``). .. _OVS sources: http://openvswitch.org/releases/ -1. Ensure the standard OVS requirements, described in +#. Ensure the standard OVS requirements, described in :ref:`general-build-reqs`, are installed -2. Bootstrap, if required, as described in :ref:`general-bootstrapping` +#. Bootstrap, if required, as described in :ref:`general-bootstrapping` -3. Configure the package using the ``--with-dpdk`` flag:: +#. Configure the package using the ``--with-dpdk`` flag:: $ ./configure --with-dpdk=$DPDK_BUILD @@ -117,7 +132,7 @@ has to be configured with DPDK support (``--with-dpdk``). While ``--with-dpdk`` is required, you can pass any other configuration option described in :ref:`general-configuring`. -4. Build and install OVS, as described in :ref:`general-building` +#. Build and install OVS, as described in :ref:`general-building` Additional information can be found in :doc:`general`. @@ -225,7 +240,7 @@ threads and pin them to cores 1,2, run:: $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6 -For details on using ivshmem with DPDK, refer to :doc:`dpdk-advanced`. +For details on using IVSHMEM with DPDK, refer to :doc:`/topics/dpdk/ivshmem`. Refer to ovs-vswitchd.conf.db(5) for additional information on configuration options. @@ -237,345 +252,300 @@ options. Validating ---------- -Creating bridges and ports -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can now use ovs-vsctl to set up bridges and other Open vSwitch features. -Bridges should be created with a ``datapath_type=netdev``:: +At this point you can use ovs-vsctl to set up bridges and other Open vSwitch +features. Seeing as we've configured the DPDK datapath, we will use DPDK-type +ports. For example, to create a userspace bridge named ``br0`` and add two +``dpdk`` ports to it, run:: $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev - -Now you can add DPDK devices. OVS expects DPDK device names to start with -``dpdk`` and end with a portid. ovs-vswitchd should print the number of dpdk -devices found in the log file:: - $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk -After the DPDK ports get added to switch, a polling thread continuously polls -DPDK devices and consumes 100% of the core, as can be checked from 'top' and -'ps' cmds:: +Refer to ovs-vsctl(8) and :doc:`/howto/dpdk` for more details. - $ top -H - $ ps -eLo pid,psr,comm | grep pmd +Performance Tuning +------------------ -Creating bonds of DPDK interfaces is slightly different to creating bonds of -system interfaces. For DPDK, the interface type must be explicitly set. For -example:: +To achieve optimal OVS performance, the system can be configured and that +includes BIOS tweaks, Grub cmdline additions, better understanding of NUMA +nodes and apt selection of PCIe slots for NIC placement. - $ ovs-vsctl add-bond br0 dpdkbond dpdk0 dpdk1 \ - -- set Interface dpdk0 type=dpdk \ - -- set Interface dpdk1 type=dpdk +.. note:: -To stop ovs-vswitchd & delete bridge, run:: + This section is optional. Once installed as described above, OVS with DPDK + will work out of the box. - $ ovs-appctl -t ovs-vswitchd exit - $ ovs-appctl -t ovsdb-server exit - $ ovs-vsctl del-br br0 +Recommended BIOS Settings +~~~~~~~~~~~~~~~~~~~~~~~~~ -PMD thread statistics -~~~~~~~~~~~~~~~~~~~~~ +.. list-table:: Recommended BIOS Settings + :header-rows: 1 -To show current stats:: + * - Setting + - Value + * - C3 Power State + - Disabled + * - C6 Power State + - Disabled + * - MLC Streamer + - Enabled + * - MLC Spacial Prefetcher + - Enabled + * - DCU Data Prefetcher + - Enabled + * - DCA + - Enabled + * - CPU Power and Performance + - Performance + * - Memeory RAS and Performance Config -> NUMA optimized + - Enabled - $ ovs-appctl dpif-netdev/pmd-stats-show +PCIe Slot Selection +~~~~~~~~~~~~~~~~~~~ -To clear previous stats:: +The fastpath performance can be affected by factors related to the placement of +the NIC, such as channel speeds between PCIe slot and CPU or the proximity of +PCIe slot to the CPU cores running the DPDK application. Listed below are the +steps to identify right PCIe slot. - $ ovs-appctl dpif-netdev/pmd-stats-clear +#. Retrieve host details using ``dmidecode``. For example:: -Port/rxq assigment to PMD threads -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + $ dmidecode -t baseboard | grep "Product Name" -To show port/rxq assignment:: +#. Download the technical specification for product listed, e.g: S2600WT2 - $ ovs-appctl dpif-netdev/pmd-rxq-show +#. Check the Product Architecture Overview on the Riser slot placement, CPU + sharing info and also PCIe channel speeds -To change default rxq assignment to pmd threads, rxqs may be manually pinned to -desired cores using:: + For example: On S2600WT, CPU1 and CPU2 share Riser Slot 1 with Channel speed + between CPU1 and Riser Slot1 at 32GB/s, CPU2 and Riser Slot1 at 16GB/s. + Running DPDK app on CPU1 cores and NIC inserted in to Riser card Slots will + optimize OVS performance in this case. - $ ovs-vsctl set Interface \ - other_config:pmd-rxq-affinity= +#. Check the Riser Card #1 - Root Port mapping information, on the available + slots and individual bus speeds. In S2600WT slot 1, slot 2 has high bus + speeds and are potential slots for NIC placement. -where: +Advanced Hugepage Setup +~~~~~~~~~~~~~~~~~~~~~~~ -- ```` ::= ``NULL`` | ```` -- ```` ::= ```` | - ```` , ```` -- ```` ::= ```` : ```` +Allocate and mount 1 GB hugepages. -For example:: +- For persistent allocation of huge pages, add the following options to the + kernel bootline:: - $ ovs-vsctl set interface dpdk0 options:n_rxq=4 \ - other_config:pmd-rxq-affinity="0:3,1:7,3:8" + default_hugepagesz=1GB hugepagesz=1G hugepages=N -This will ensure: + For platforms supporting multiple huge page sizes, add multiple options:: -- Queue #0 pinned to core 3 -- Queue #1 pinned to core 7 -- Queue #2 not pinned -- Queue #3 pinned to core 8 + default_hugepagesz= hugepagesz= hugepages=N -After that PMD threads on cores where RX queues was pinned will become -``isolated``. This means that this thread will poll only pinned RX queues. + where: -.. warning:: - If there are no ``non-isolated`` PMD threads, ``non-pinned`` RX queues will - not be polled. Also, if provided ``core_id`` is not available (ex. this - ``core_id`` not in ``pmd-cpu-mask``), RX queue will not be polled by any PMD - thread. + ``N`` + number of huge pages requested + ``size`` + huge page size with an optional suffix ``[kKmMgG]`` -.. _dpdk-guest-setup: +- For run-time allocation of huge pages:: -DPDK in the VM --------------- + $ echo N > /sys/devices/system/node/nodeX/hugepages/hugepages-1048576kB/nr_hugepages -DPDK 'testpmd' application can be run in the Guest VM for high speed packet -forwarding between vhostuser ports. DPDK and testpmd application has to be -compiled on the guest VM. Below are the steps for setting up the testpmd -application in the VM. More information on the vhostuser ports can be found in -:doc:`dpdk-advanced`. + where: -.. note:: - Support for DPDK in the guest requires QEMU >= 2.2.0. - -To being, instantiate the guest:: - - $ export VM_NAME=Centos-vm export GUEST_MEM=3072M - $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 - $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch - - $ qemu-system-x86_64 -name $VM_NAME -cpu host -enable-kvm \ - -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \ - -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \ - -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ - -chardev socket,id=char0,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ - -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ - -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=off \ - -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ - -netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \ - -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=off \ - -Download the DPDK sourcs to VM and build DPDK:: - - $ cd /root/dpdk/ - $ wget http://fast.dpdk.org/rel/dpdk-16.11.tar.xz - $ tar xf dpdk-16.11.tar.xz - $ export DPDK_DIR=/root/dpdk/dpdk-16.11 - $ export DPDK_TARGET=x86_64-native-linuxapp-gcc - $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET - $ cd $DPDK_DIR - $ make install T=$DPDK_TARGET DESTDIR=install - -Build the test-pmd application:: - - $ cd app/test-pmd - $ export RTE_SDK=$DPDK_DIR - $ export RTE_TARGET=$DPDK_TARGET - $ make - -Setup huge pages and DPDK devices using UIO:: - - $ sysctl vm.nr_hugepages=1024 - $ mkdir -p /dev/hugepages - $ mount -t hugetlbfs hugetlbfs /dev/hugepages # only if not already mounted - $ modprobe uio - $ insmod $DPDK_BUILD/kmod/igb_uio.ko - $ $DPDK_DIR/tools/dpdk-devbind.py --status - $ $DPDK_DIR/tools/dpdk-devbind.py -b igb_uio 00:03.0 00:04.0 + ``N`` + number of huge pages requested + ``X`` + NUMA Node -.. note:: + .. note:: + For run-time allocation of 1G huge pages, Contiguous Memory Allocator + (``CONFIG_CMA``) has to be supported by kernel, check your Linux distro. - vhost ports pci ids can be retrieved using:: +Now mount the huge pages, if not already done so:: - lspci | grep Ethernet + $ mount -t hugetlbfs -o pagesize=1G none /dev/hugepages -Testing -------- +Enable HyperThreading +~~~~~~~~~~~~~~~~~~~~~ -Below are few testcases and the list of steps to be followed. Before beginning, -ensure a userspace bridge has been created and two DPDK ports added:: +With HyperThreading, or SMT, enabled, a physical core appears as two logical +cores. SMT can be utilized to spawn worker threads on logical cores of the same +physical core there by saving additional cores. - $ ovs-vsctl add-br br0 -- set bridge br0 datapath_type=netdev - $ ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk - $ ovs-vsctl add-port br0 dpdk1 -- set Interface dpdk1 type=dpdk +With DPDK, when pinning pmd threads to logical cores, care must be taken to set +the correct bits of the ``pmd-cpu-mask`` to ensure that the pmd threads are +pinned to SMT siblings. -PHY-PHY -~~~~~~~ - -Add test flows to forward packets betwen DPDK port 0 and port 1:: - - # Clear current flows - $ ovs-ofctl del-flows br0 - - # Add flows between port 1 (dpdk0) to port 2 (dpdk1) - $ ovs-ofctl add-flow br0 in_port=1,action=output:2 - $ ovs-ofctl add-flow br0 in_port=2,action=output:1 - -Transmit traffic into either port. You should see it returned via the other. - -PHY-VM-PHY (vhost loopback) -~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Add two ``dpdkvhostuser`` ports to bridge ``br0``:: - - $ ovs-vsctl add-port br0 dpdkvhostuser0 \ - -- set Interface dpdkvhostuser0 type=dpdkvhostuser - $ ovs-vsctl add-port br0 dpdkvhostuser1 \ - -- set Interface dpdkvhostuser1 type=dpdkvhostuser - -Add test flows to forward packets betwen DPDK devices and VM ports:: - - # Clear current flows - $ ovs-ofctl del-flows br0 - - # Add flows - $ ovs-ofctl add-flow br0 in_port=1,action=output:3 - $ ovs-ofctl add-flow br0 in_port=3,action=output:1 - $ ovs-ofctl add-flow br0 in_port=4,action=output:2 - $ ovs-ofctl add-flow br0 in_port=2,action=output:4 - - # Dump flows - $ ovs-ofctl dump-flows br0 - -Create a VM using the following configuration: - -+----------------------+--------+-----------------+ -| configuration | values | comments | -+----------------------+--------+-----------------+ -| qemu version | 2.2.0 | n/a | -| qemu thread affinity | core 5 | taskset 0x20 | -| memory | 4GB | n/a | -| cores | 2 | n/a | -| Qcow2 image | CentOS7| n/a | -| mrg_rxbuf | off | n/a | -+----------------------+--------+-----------------+ - -You can do this directly with QEMU via the ``qemu-system-x86_64`` -application:: - - $ export VM_NAME=vhost-vm - $ export GUEST_MEM=3072M - $ export QCOW2_IMAGE=/root/CentOS7_x86_64.qcow2 - $ export VHOST_SOCK_DIR=/usr/local/var/run/openvswitch - - $ taskset 0x20 qemu-system-x86_64 -name $VM_NAME -cpu host -enable-kvm \ - -m $GUEST_MEM -drive file=$QCOW2_IMAGE --nographic -snapshot \ - -numa node,memdev=mem -mem-prealloc -smp sockets=1,cores=2 \ - -object memory-backend-file,id=mem,size=$GUEST_MEM,mem-path=/dev/hugepages,share=on \ - -chardev socket,id=char0,path=$VHOST_SOCK_DIR/dpdkvhostuser0 \ - -netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \ - -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=off \ - -chardev socket,id=char1,path=$VHOST_SOCK_DIR/dpdkvhostuser1 \ - -netdev type=vhost-user,id=mynet2,chardev=char1,vhostforce \ - -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mrg_rxbuf=off - -Alternatively, you can configure the guest using libvirt. Below is an XML -configuration for a 'demovm' guest that can be instantiated using `virsh`:: - - - demovm - 4a9b3f53-fa2a-47f3-a757-dd87720d9d1d - 4194304 - 4194304 - - - - - - 2 - - 4096 - - - - - - hvm - - - - - - - - - - - - - - destroy - restart - destroy - - /usr/bin/qemu-kvm - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Once the guest is configured and booted, configure DPDK packet forwarding -within the guest. To accomplish this, DPDK and testpmd application have to -be first compiled on the VM as described in **Guest Setup**. Once compiled, run -the ``test-pmd`` application:: - - $ cd $DPDK_DIR/app/test-pmd; - $ ./testpmd -c 0x3 -n 4 --socket-mem 1024 -- \ - --burst=64 -i --txqflags=0xf00 --disable-hw-vlan - $ set fwd mac retry - $ start - -When you finish testing, bind the vNICs back to kernel:: - - $ $DPDK_DIR/tools/dpdk-devbind.py --bind=virtio-pci 0000:00:03.0 - $ $DPDK_DIR/tools/dpdk-devbind.py --bind=virtio-pci 0000:00:04.0 +Take a sample system configuration, with 2 sockets, 2 * 10 core processors, HT +enabled. This gives us a total of 40 logical cores. To identify the physical +core shared by two logical cores, run:: -.. note:: - Appropriate PCI IDs to be passed in above example. The PCI IDs can be - retrieved like so:: + $ cat /sys/devices/system/cpu/cpuN/topology/thread_siblings_list + +where ``N`` is the logical core number. + +In this example, it would show that cores ``1`` and ``21`` share the same +physical core. As cores are counted from 0, the ``pmd-cpu-mask`` can be used +to enable these two pmd threads running on these two logical cores (one +physical core) is:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x200002 - $ $DPDK_DIR/tools/dpdk-devbind.py --status +Isolate Cores +~~~~~~~~~~~~~ + +The ``isolcpus`` option can be used to isolate cores from the Linux scheduler. +The isolated cores can then be used to dedicatedly run HPC applications or +threads. This helps in better application performance due to zero context +switching and minimal cache thrashing. To run platform logic on core 0 and +isolate cores between 1 and 19 from scheduler, add ``isolcpus=1-19`` to GRUB +cmdline. .. note:: - More information on the dpdkvhostuser ports can be found in - :doc:`dpdk-advanced`. + It has been verified that core isolation has minimal advantage due to mature + Linux scheduler in some circumstances. -PHY-VM-PHY (IVSHMEM loopback) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +NUMA/Cluster-on-Die +~~~~~~~~~~~~~~~~~~~ + +Ideally inter-NUMA datapaths should be avoided where possible as packets will +go across QPI and there may be a slight performance penalty when compared with +intra NUMA datapaths. On Intel Xeon Processor E5 v3, Cluster On Die is +introduced on models that have 10 cores or more. This makes it possible to +logically split a socket into two NUMA regions and again it is preferred where +possible to keep critical datapaths within the one cluster. + +It is good practice to ensure that threads that are in the datapath are pinned +to cores in the same NUMA area. e.g. pmd threads and QEMU vCPUs responsible for +forwarding. If DPDK is built with ``CONFIG_RTE_LIBRTE_VHOST_NUMA=y``, vHost +User ports automatically detect the NUMA socket of the QEMU vCPUs and will be +serviced by a PMD from the same node provided a core on this node is enabled in +the ``pmd-cpu-mask``. ``libnuma`` packages are required for this feature. + +Compiler Optimizations +~~~~~~~~~~~~~~~~~~~~~~ + +The default compiler optimization level is ``-O2``. Changing this to more +aggressive compiler optimization such as ``-O3 -march=native`` with +gcc (verified on 5.3.1) can produce performance gains though not siginificant. +``-march=native`` will produce optimized code on local machine and should be +used when software compilation is done on Testbed. + +Affinity +~~~~~~~~ + +For superior performance, DPDK pmd threads and Qemu vCPU threads needs to be +affinitized accordingly. + +- PMD thread Affinity + + A poll mode driver (pmd) thread handles the I/O of all DPDK interfaces + assigned to it. A pmd thread shall poll the ports for incoming packets, + switch the packets and send to tx port. pmd thread is CPU bound, and needs + to be affinitized to isolated cores for optimum performance. + + By setting a bit in the mask, a pmd thread is created and pinned to the + corresponding CPU core. e.g. to run a pmd thread on core 2:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x4 + + .. note:: + pmd thread on a NUMA node is only created if there is at least one DPDK + interface from that NUMA node added to OVS. + +- QEMU vCPU thread Affinity + + A VM performing simple packet forwarding or running complex packet pipelines + has to ensure that the vCPU threads performing the work has as much CPU + occupancy as possible. + + For example, on a multicore VM, multiple QEMU vCPU threads shall be spawned. + When the DPDK ``testpmd`` application that does packet forwarding is invoked, + the ``taskset`` command should be used to affinitize the vCPU threads to the + dedicated isolated cores on the host system. + +Multiple Poll-Mode Driver Threads +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +With pmd multi-threading support, OVS creates one pmd thread for each NUMA node +by default. However, in cases where there are multiple ports/rxq's producing +traffic, performance can be improved by creating multiple pmd threads running +on separate cores. These pmd threads can share the workload by each being +responsible for different ports/rxq's. Assignment of ports/rxq's to pmd threads +is done automatically. + +A set bit in the mask means a pmd thread is created and pinned to the +corresponding CPU core. For example, to run pmd threads on core 1 and 2:: + + $ ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x6 + +When using dpdk and dpdkvhostuser ports in a bi-directional VM loopback as +shown below, spreading the workload over 2 or 4 pmd threads shows significant +improvements as there will be more total CPU occupancy available:: + + NIC port0 <-> OVS <-> VM <-> OVS <-> NIC port 1 + +DPDK Physical Port Rx Queues +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + $ ovs-vsctl set Interface options:n_rxq= + +The above command sets the number of rx queues for DPDK physical interface. +The rx queues are assigned to pmd threads on the same NUMA node in a +round-robin fashion. + +DPDK Physical Port Queue Sizes +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +:: + + $ ovs-vsctl set Interface dpdk0 options:n_rxq_desc= + $ ovs-vsctl set Interface dpdk0 options:n_txq_desc= + +The above command sets the number of rx/tx descriptors that the NIC associated +with dpdk0 will be initialised with. + +Different ``n_rxq_desc`` and ``n_txq_desc`` configurations yield different +benefits in terms of throughput and latency for different scenarios. +Generally, smaller queue sizes can have a positive impact for latency at the +expense of throughput. The opposite is often true for larger queue sizes. +Note: increasing the number of rx descriptors eg. to 4096 may have a negative +impact on performance due to the fact that non-vectorised DPDK rx functions may +be used. This is dependant on the driver in use, but is true for the commonly +used i40e and ixgbe DPDK drivers. + +Exact Match Cache +~~~~~~~~~~~~~~~~~ + +Each pmd thread contains one Exact Match Cache (EMC). After initial flow setup +in the datapath, the EMC contains a single table and provides the lowest level +(fastest) switching for DPDK ports. If there is a miss in the EMC then the next +level where switching will occur is the datapath classifier. Missing in the +EMC and looking up in the datapath classifier incurs a significant performance +penalty. If lookup misses occur in the EMC because it is too small to handle +the number of flows, its size can be increased. The EMC size can be modified by +editing the define ``EM_FLOW_HASH_SHIFT`` in ``lib/dpif-netdev.c``. + +As mentioned above, an EMC is per pmd thread. An alternative way of increasing +the aggregate amount of possible flow entries in EMC and avoiding datapath +classifier lookups is to have multiple pmd threads running. + +Rx Mergeable Buffers +~~~~~~~~~~~~~~~~~~~~ -Refer to the :doc:`dpdk-advanced`. +Rx mergeable buffers is a virtio feature that allows chaining of multiple +virtio descriptors to handle large packet sizes. Large packets are handled by +reserving and chaining multiple free descriptors together. Mergeable buffer +support is negotiated between the virtio driver and virtio device and is +supported by the DPDK vhost library. This behavior is supported and enabled by +default, however in the case where the user knows that rx mergeable buffers are +not needed i.e. jumbo frames are not needed, it can be forced off by adding +``mrg_rxbuf=off`` to the QEMU command line options. By not reserving multiple +chains of descriptors it will make more individual virtio descriptors available +for rx to the guest using dpdkvhost ports and this can improve performance. Limitations ------------ diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst index 2366388..0d3ea06 100644 --- a/Documentation/intro/install/index.rst +++ b/Documentation/intro/install/index.rst @@ -33,10 +33,6 @@ different environments and using different configurations. Installation from Source ------------------------ -.. TODO(stephenfin): The DPDK-ADVANCED doc is mostly usage material. The - install related instructions should be moved to the main doc, while the - rest should be moved to howto and topic docs - .. TODO(stephenfin): Based on the title alone, the NetBSD doc should probably be merged into the general install doc @@ -49,7 +45,6 @@ Installation from Source xenserver userspace dpdk - dpdk-advanced bash-completion Installation from Packages diff --git a/Documentation/topics/dpdk/index.rst b/Documentation/topics/dpdk/index.rst new file mode 100644 index 0000000..3c98a9a --- /dev/null +++ b/Documentation/topics/dpdk/index.rst @@ -0,0 +1,32 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +================= +The DPDK Datapath +================= + +.. toctree:: + :maxdepth: 2 + + vhost-user + ivshmem diff --git a/Documentation/topics/dpdk.rst b/Documentation/topics/dpdk/ivshmem.rst similarity index 93% rename from Documentation/topics/dpdk.rst rename to Documentation/topics/dpdk/ivshmem.rst index 74e0266..bd4dd99 100644 --- a/Documentation/topics/dpdk.rst +++ b/Documentation/topics/dpdk/ivshmem.rst @@ -21,8 +21,8 @@ Avoid deeper levels because they do not render well. -================ -DPDK Integration -================ +================== +DPDK IVSHMEM Ports +================== **TODO** diff --git a/Documentation/topics/dpdk/vhost-user.rst b/Documentation/topics/dpdk/vhost-user.rst new file mode 100644 index 0000000..5448bd2 --- /dev/null +++ b/Documentation/topics/dpdk/vhost-user.rst @@ -0,0 +1,396 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +===================== +DPDK vHost User Ports +===================== + +The DPDK datapath provides DPDK-backed vHost user ports as a primary way to +interact with guests. For more information on vHost User, refer to the `QEMU +documentation`_ on same. + +Quick Example +------------- + +This example demonstrates how to add two ``dpdkvhostuser`` ports to an existing +bridge called ``br0``:: + + $ ovs-vsctl add-port br0 dpdkvhostuser0 \ + -- set Interface dpdkvhostuser0 type=dpdkvhostuser + $ ovs-vsctl add-port br0 dpdkvhostuser1 \ + -- set Interface dpdkvhostuser1 type=dpdkvhostuser + +vhost-user vs. vhost-user-client +-------------------------------- + +Open vSwitch provides two types of vHost User ports: + +- vhost-user (``dpdkvhostuser``) + +- vhost-user-client (``dpdkvhostuserclient``) + +vHost User uses a client-server model. The server creates/manages/destroys the +vHost User sockets, and the client connects to the server. Depending on which +port type you use, ``dpdkvhostuser`` or ``dpdkvhostuserclient``, a different +configuration of the client-server model is used. + +For vhost-user ports, Open vSwitch acts as the server and QEMU the client. For +vhost-user-client ports, Open vSwitch acts as the client and QEMU the server. + +.. _dpdk-vhost-user: + +vhost-user +---------- + +.. important:: + + Use of vhost-user ports requires QEMU >= 2.2 + +To use vhost-user ports, you must first add said ports to the switch. Unlike +DPDK ring ports, DPDK vhost-user ports can have arbitrary names, except that +forward and backward slashes are prohibited in the names. For vhost-user, the +port type is ``dpdkvhostuser``:: + + $ ovs-vsctl add-port br0 vhost-user-1 -- set Interface vhost-user-1 \ + type=dpdkvhostuser + +This action creates a socket located at +``/usr/local/var/run/openvswitch/vhost-user-1``, which you must provide to your +VM on the QEMU command line. + +.. note:: + + If you wish for the vhost-user sockets to be created in a sub-directory of + ``/usr/local/var/run/openvswitch``, you may specify this directory in the + ovsdb like so:: + + $ ovs-vsctl --no-wait \ + set Open_vSwitch . other_config:vhost-sock-dir=subdir` + +Once the vhost-user ports have been added to the switch, they must be added to +the guest. There are two ways to do this: using QEMU directly, or using +libvirt. + +Adding vhost-user ports to the guest (QEMU) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To begin, you must attach the vhost-user device sockets to the guest. To do +this, you must pass the following parameters to QEMU:: + + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 + +where ``vhost-user-1`` is the name of the vhost-user port added to the switch. + +Repeat the above parameters for multiple devices, changing the chardev ``path`` +and ``id`` as necessary. Note that a separate and different chardev ``path`` +needs to be specified for each vhost-user device. For example you have a second +vhost-user port named ``vhost-user-2``, you append your QEMU command line with +an additional set of parameters:: + + -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2 + +In addition, QEMU must allocate the VM's memory on hugetlbfs. vhost-user +ports access a virtio-net device's virtual rings and packet buffers mapping the +VM's physical memory on hugetlbfs. To enable vhost-user ports to map the VM's +memory into their process address space, pass the following parameters to +QEMU:: + + -object memory-backend-file,id=mem,size=4096M,mem-path=/dev/hugepages,share=on + -numa node,memdev=mem -mem-prealloc + +Finally, you may wish to enable multiqueue support. This is optional but, +should you wish to enable it, run:: + + -chardev socket,id=char2,path=/usr/local/var/run/openvswitch/vhost-user-2 + -netdev type=vhost-user,id=mynet2,chardev=char2,vhostforce,queues=$q + -device virtio-net-pci,mac=00:00:00:00:00:02,netdev=mynet2,mq=on,vectors=$v + +where: + +``$q`` + The number of queues +``$v`` + The number of vectors, which is ``$q`` * 2 + 2 + +The vhost-user interface will be automatically reconfigured with required +number of rx and tx queues after connection of virtio device. Manual +configuration of ``n_rxq`` is not supported because OVS will work properly only +if ``n_rxq`` will match number of queues configured in QEMU. + +A least 2 PMDs should be configured for the vswitch when using multiqueue. +Using a single PMD will cause traffic to be enqueued to the same vhost queue +rather than being distributed among different vhost queues for a vhost-user +interface. + +If traffic destined for a VM configured with multiqueue arrives to the vswitch +via a physical DPDK port, then the number of rxqs should also be set to at +least 2 for that physical DPDK port. This is required to increase the +probability that a different PMD will handle the multiqueue transmission to the +guest using a different vhost queue. + +If one wishes to use multiple queues for an interface in the guest, the driver +in the guest operating system must be configured to do so. It is recommended +that the number of queues configured be equal to ``$q``. + +For example, this can be done for the Linux kernel virtio-net driver with:: + + $ ethtool -L combined <$q> + +where: + +``-L`` + Changes the numbers of channels of the specified network device +``combined`` + Changes the number of multi-purpose channels. + +Adding vhost-user ports to the guest (libvirt) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +.. TODO(stephenfin): This seems like something that wouldn't be acceptable in + production. Is this really required? + +To begin, you must change the user and group that libvirt runs under, configure +access control policy and restart libvirtd. + +- In ``/etc/libvirt/qemu.conf`` add/edit the following lines:: + + user = "root" + group = "root" + +- Disable SELinux or set to permissive mode:: + + $ setenforce 0 + +- Finally, restart the libvirtd process, For example, on Fedora:: + + $ systemctl restart libvirtd.service + +Once complete, instantiate the VM. A sample XML configuration file is provided +at the :ref:`end of this file `. Save this file, then +create a VM using this file:: + + $ virsh create demovm.xml + +Once created, you can connect to the guest console:: + + $ virsh console demovm + +The demovm xml configuration is aimed at achieving out of box performance on +VM. These enhancements include: + +- The vcpus are pinned to the cores of the CPU socket 0 using ``vcpupin``. + +- Configure NUMA cell and memory shared using ``memAccess='shared'``. + +- Disable ``mrg_rxbuf='off'`` + +Refer to the `libvirt documentation `__ +for more information. + +.. _dpdk-vhost-user-client: + +vhost-user-client +----------------- + +.. important:: + + Use of vhost-user ports requires QEMU >= 2.7 + +To use vhost-user-client ports, you must first add said ports to the switch. +Like DPDK vhost-user ports, DPDK vhost-user-client ports can have mostly +arbitrary. However, the name given to the port does not govern the name of the +socket device. Instead, this must be configured by the user by way of a +``vhost-server-path`` option. For vhost-user-client, the port type is +``dpdkvhostuserclient``:: + + $ VHOST_USER_SOCKET_PATH=/path/to/socket + $ ovs-vsctl add-port br0 vhost-client-1 \ + -- set Interface vhost-client-1 type=dpdkvhostuserclient \ + options:vhost-server-path=$VHOST_USER_SOCKET_PATH + +Once the vhost-user-client ports have been added to the switch, they must be +added to the guest. Like vhost-user ports, there are two ways to do this: using +QEMU directly, or using libvirt. Only the QEMU case is covered here. + +Adding vhost-user-client ports to the guest (QEMU) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Attach the vhost-user device sockets to the guest. To do this, you must pass +the following parameters to QEMU:: + + -chardev socket,id=char1,path=$VHOST_USER_SOCKET_PATH,server + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce + -device virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1 + +where ``vhost-user-1`` is the name of the vhost-user port added to the switch. + +If the corresponding ``dpdkvhostuserclient`` port has not yet been configured +in OVS with ``vhost-server-path=/path/to/socket``, QEMU will print a log +similar to the following:: + + QEMU waiting for connection on: disconnected:unix:/path/to/socket,server + +QEMU will wait until the port is created sucessfully in OVS to boot the VM. +One benefit of using this mode is the ability for vHost ports to 'reconnect' in +event of the switch crashing or being brought down. Once it is brought back up, +the vHost ports will reconnect automatically and normal service will resume. + +.. _dpdk-testpmd: + +DPDK in the Guest +----------------- + +The DPDK ``testpmd`` application can be run in guest VMs for high speed packet +forwarding between vhostuser ports. DPDK and testpmd application has to be +compiled on the guest VM. Below are the steps for setting up the testpmd +application in the VM. + +.. note:: + + Support for DPDK in the guest requires QEMU >= 2.2 + +To begin, instantiate a guest as described in :ref:`dpdk-vhost-user` or +:ref:`dpdk-vhost-user-client`. Once started, connect to the VM, download the +DPDK sources to VM and build DPDK:: + + $ cd /root/dpdk/ + $ wget http://fast.dpdk.org/rel/dpdk-16.11.tar.xz + $ tar xf dpdk-16.11.tar.xz + $ export DPDK_DIR=/root/dpdk/dpdk-16.11 + $ export DPDK_TARGET=x86_64-native-linuxapp-gcc + $ export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET + $ cd $DPDK_DIR + $ make install T=$DPDK_TARGET DESTDIR=install + +Build the test-pmd application:: + + $ cd app/test-pmd + $ export RTE_SDK=$DPDK_DIR + $ export RTE_TARGET=$DPDK_TARGET + $ make + +Setup huge pages and DPDK devices using UIO:: + + $ sysctl vm.nr_hugepages=1024 + $ mkdir -p /dev/hugepages + $ mount -t hugetlbfs hugetlbfs /dev/hugepages # only if not already mounted + $ modprobe uio + $ insmod $DPDK_BUILD/kmod/igb_uio.ko + $ $DPDK_DIR/tools/dpdk-devbind.py --status + $ $DPDK_DIR/tools/dpdk-devbind.py -b igb_uio 00:03.0 00:04.0 + +.. note:: + + vhost ports pci ids can be retrieved using:: + + lspci | grep Ethernet + +Finally, start the application:: + + # TODO + +.. _dpdk-vhost-user-xml: + +Sample XML +---------- + +:: + + + demovm + 4a9b3f53-fa2a-47f3-a757-dd87720d9d1d + 4194304 + 4194304 + + + + + + 2 + + 4096 + + + + + + hvm + + + + + + + + + + + + + + destroy + restart + destroy + + /usr/bin/qemu-kvm + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. _QEMU documentation: http://git.qemu-project.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;h=7890d7169;hb=HEAD diff --git a/Documentation/topics/index.rst b/Documentation/topics/index.rst index 30f74fe..e5a8b4d 100644 --- a/Documentation/topics/index.rst +++ b/Documentation/topics/index.rst @@ -40,8 +40,9 @@ that way. openflow bonding ovsdb-replication - dpdk + dpdk/index windows + testing .. toctree:: :maxdepth: 2 diff --git a/Documentation/topics/testing.rst b/Documentation/topics/testing.rst new file mode 100644 index 0000000..5265ab1 --- /dev/null +++ b/Documentation/topics/testing.rst @@ -0,0 +1,38 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + +======= +Testing +======= + +.. TODO(stephenfin): Flesh this out with information from the general + installation guide, among others. + +vsperf +------ + +The vsperf project aims to develop a vSwitch test framework that can be used to +validate the suitability of different vSwitch implementations in a telco +deployment environment. More information can be found on the `OPNFV wiki`_. + +.. _OPNFV wiki: https://wiki.opnfv.org/display/vsperf/VSperf+Home