Message ID | 1535757856-51123-1-git-send-email-u9012063@gmail.com |
---|---|
Headers | show
Return-Path: <ovs-dev-bounces@openvswitch.org> X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=<UNKNOWN>) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="EzsyXpJe"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 422Fnk4JDFz9sBn for <incoming@patchwork.ozlabs.org>; Sat, 1 Sep 2018 09:24:57 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 68FC7C7F; Fri, 31 Aug 2018 23:24:54 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 54B1125A for <dev@openvswitch.org>; Fri, 31 Aug 2018 23:24:53 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 4F882619 for <dev@openvswitch.org>; Fri, 31 Aug 2018 23:24:52 +0000 (UTC) Received: by mail-pf1-f194.google.com with SMTP id b11-v6so6175541pfo.3 for <dev@openvswitch.org>; Fri, 31 Aug 2018 16:24:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=IbJCJUGDY9R6Pro2rIVI2UdVJM0qxo2UWgrGUsQEXe8=; b=EzsyXpJeJG3cNRr15kPkgU2AGb/ljo76Y7STNK9bhXJaTepBbEBUW6u6yVBb78zksZ HWi8f17Alggra4POcj0BeqC3suRa/4KRIeat8Jyu0zFisKzZ3cT090ClLaNSD7C8YsZB 6xQGXkOZnNsNRg1MSUny9DWNbZBi4IecVQw2alI/Sq6J6KfMIwKlDwTi2fLGMQQfqdVC N5+feETzaxt896M/jqUKo/X3HezkywjBxi+RjHLnEvg4En5R9MxdPdVVxMpiXg8FbpCt ZrUpnYECknSgq9rPRwPD2go1iZNg2NpxEoCKskN1yoRd4K2cLvFxMcdb36Q6P5N03FNy AvqQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=IbJCJUGDY9R6Pro2rIVI2UdVJM0qxo2UWgrGUsQEXe8=; b=eSzjNF0kK0eqG00n2ECa1V6bf5sjClFY20E6DqrZTNesNcSOXN7F/n3GZZhANLOqSK 9UHa3bZFg6rvYYQmYtTbS2VwBsL2y4gAWiJlbhh5entoSPnmAHzCYa7hDinInxhmqMOG vCBjHrGkCdEser1xgn30yJOta/GFzKRJyVqmlhHcVqyFQuhY0qLgc7Hwn3Jy2yLNZxNJ LJOGsG3b8MWc94T56m3IKZWBao/LpC2b/C7OGbVpxHUDL5eF7gPwXT+BrsOxmBxK4iRS QxK9slC8BKI/96si991jONvkyifzsdthQlhYrP93Zn2KbhJ0TbpnLwIKWLEvL/lnqQn1 mEXw== X-Gm-Message-State: APzg51ABIb1dlWTZr4Tppb1MZu/pnfaEdgce3Kwn+r0ijD4ye5o6txcN wOp24ZQQYKmHqLnA3SDGPlTfv9Et X-Google-Smtp-Source: ANB0VdZP2I5uAeU5qeoVW3UPWAMYKkml3I237qny3F0NkyeMDe34JDIWTB/CmWLvCjLbggeBEATmbg== X-Received: by 2002:a63:ca09:: with SMTP id n9-v6mr16372225pgi.287.1535757891255; Fri, 31 Aug 2018 16:24:51 -0700 (PDT) Received: from sc9-mailhost2.vmware.com ([66.170.99.1]) by smtp.gmail.com with ESMTPSA id k64-v6sm17039794pfg.141.2018.08.31.16.24.50 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 31 Aug 2018 16:24:50 -0700 (PDT) From: William Tu <u9012063@gmail.com> To: dev@openvswitch.org, iovisor-dev@lists.iovisor.org Date: Fri, 31 Aug 2018 16:24:13 -0700 Message-Id: <1535757856-51123-1-git-send-email-u9012063@gmail.com> X-Mailer: git-send-email 2.7.4 X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: root <ovs-smartnic@vmware.com> Subject: [ovs-dev] [PATCHv2 RFC 0/3] AF_XDP support for OVS X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: <ovs-dev.openvswitch.org> List-Unsubscribe: <https://mail.openvswitch.org/mailman/options/ovs-dev>, <mailto:ovs-dev-request@openvswitch.org?subject=unsubscribe> List-Archive: <http://mail.openvswitch.org/pipermail/ovs-dev/> List-Post: <mailto:ovs-dev@openvswitch.org> List-Help: <mailto:ovs-dev-request@openvswitch.org?subject=help> List-Subscribe: <https://mail.openvswitch.org/mailman/listinfo/ovs-dev>, <mailto:ovs-dev-request@openvswitch.org?subject=subscribe> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org |
Series |
AF_XDP support for OVS
|
expand
|
From: root <ovs-smartnic@vmware.com> The patch series introduces AF_XDP support for OVS netdev. AF_XDP is a new address family working together with eBPF. In short, a socket with AF_XDP family can receive and send packets from an eBPF/XDP program attached to the netdev. For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst OVS has a couple of netdev types, i.e., system, tap, or internal. The patch first adds a new netdev types called "afxdp", and implement its configuration, packet reception, and transmit functions. Since the AF_XDP socket, xsk, operates in userspace, once ovs-vswitchd receives packets from xsk, the proposed architecture re-uses the existing userspace dpif-netdev datapath. As a result, most of the packet processing happens at the userspace instead of linux kernel. Architecure =========== _ | +-------------------+ | | ovs-vswitchd |<-->ovsdb-server | +-------------------+ | | ofproto |<-->OpenFlow controllers | +--------+-+--------+ | | netdev | |ofproto-| userspace | +--------+ | dpif | | | netdev | +--------+ | |provider| | dpif | | +---||---+ +--------+ | || | dpif- | | || | netdev | |_ || +--------+ || _ +---||-----+--------+ | | af_xdp prog + | kernel | | xsk_map | |_ +--------||---------+ || physical NIC To simply start, create a ovs userspace bridge using dpif-netdev by setting the datapath_type to netdev: # ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev And attach a linux netdev with type afxdp: # ovs-vsctl add-port br0 afxdp-p0 -- \ set interface afxdp-p0 type="afxdp" Most of the implementation follows the AF_XDP sample code in Linux kernel under samples/bpf/xdpsock_user.c. Configuration ============= When a new afxdp netdev is added to OVS, the patch does the following configuration 1) attach the afxdp program and map to the netdev (see bpf/xdp.h) (Currently support maximum 4 afxdp netdev) 2) create an AF_XDP socket (XSK) for the afxdp netdev 3) allocate a virtual contiguous memory region, called umem, and register this memory to the XSK 4) setup the rx/tx ring, and umem's fill/completion ring Packet Flow =========== Currently, the af_xdp ebpf program loaded to the netdev does nothing but simply forwards the packet to its receiving queue id. The v1 patch simplifies the buffer/ring management by introducing a copy from umem to ovs's internal buffer, when receiving a packet. And when sending the packet out to another netdev, copying the packet to the netdev's umem. The v2 patch implements a simple push/pop umem list, where the push/pop list contains unused umem elements. So before receiving, any packet, pop N elements from the list and place into the FILL queues. When received a batch of packets, ex: 32 packets, pop another 32 unused umem elements and refill the FILL queues. When there are N packets to send, pop N elements from the list, copy the packet data into each umem element, and issue send. Once finish, recycle the sent packet's umem from the COMPLETION queue, and push back to the umem list. With v2, an AF_XDP packet forwarding from one netdev (ovs-p0) to another netdev (ovs-p1) goes through the following path: Considering the driver mode: 1) xdp program at ovs-p0 copies packet to netdev's rx umem 2) ovs-vswitchd receive the packet from ovs-p0 3) ovs dpif-netdev does parse, lookup flow table, and action=forward to another afxdp port 4) ovs-vswitchd copies the pachet to umem of ovs-p1, kick_tx 5) kernel copies the packet from umem to ovs-p1 tx queue Thus, the total number of copies between two ports is 3. I haven't tried the AF_XDP zero copy mode driver. Hopefully by using AF_XDP zero copy mode, 1) and 5) will be removed. So the best case will be one copy between two netdev. Performance =========== Performance is done by setting up 2 machine with back-to-back dual port 10Gbps link with ixgbe driver. The machine consists of Intel Xeon CPU E5-2440 v2 with 1.90GHz, 8 cores. One machine is sending 14Mpps, single flow, 64-byte packet, to another machine, which runs OVS with 2 afxdp pors. One port is used for receiving packet, and the other port for sending packet back. All experiments are using single core. Baseline performance using xdpsock (Mpps, 64-byte) Benchmark XDP_SKB XDP_DRV rxdrop 1.1 2.9 txpush 1.5 1.5 l2fwd 1.08 1.12 With this patch, OVS with AF_XDP (Mpps, 64-byte) Benchmark XDP_SKB XDP_DRV rxdrop 1.4 3.3 *[1] txpush N/A N/A l2fwd 0.3 0.4 The rxdrop is using the rule ovs-ofctl add-flow br0 "in_port=1 actions=drop" The l2fwd is using ovs-ofctl add-flow br0 "in_port=1 actions=output:2" where port 1 and 2 are ixgbe 10Gbps port. Apparently we need to find out why l2fwd is so slow. [1] the number is higher than the baseline, I guess it's due to OVS's pmd-mode netdev. Evaluation1: perf ================= 1) RX drop 3Mpps 15.43% pmd7 ovs-vswitchd [.] miniflow_extract 11.63% pmd7 ovs-vswitchd [.] dp_netdev_input__ 9.69% pmd7 libc-2.23.so [.] __clock_gettime 7.23% pmd7 ovs-vswitchd [.] umem_elem_push 7.12% pmd7 ovs-vswitchd [.] pmd_thread_main 6.76% pmd7 [vdso] [.] __vdso_clock_gettime 6.65% pmd7 ovs-vswitchd [.] netdev_linux_rxq_xsk 6.46% pmd7 [kernel.vmlinux] [k] nmi 6.36% pmd7 ovs-vswitchd [.] odp_execute_actions 5.85% pmd7 ovs-vswitchd [.] netdev_linux_rxq_recv 4.62% pmd7 ovs-vswitchd [.] time_timespec__ 3.77% pmd7 ovs-vswitchd [.] dp_netdev_process_rxq_port 2) L2fwd 0.4Mpps 20.05% pmd7 [kernel.vmlinux] [k] entry_SYSCALL_64_trampoline 11.40% pmd7 [kernel.vmlinux] [k] syscall_return_via_sysret 11.27% pmd7 [kernel.vmlinux] [k] __sys_sendto 6.10% pmd7 [kernel.vmlinux] [k] __fget 4.81% pmd7 [kernel.vmlinux] [k] xsk_sendmsg 3.62% pmd7 [kernel.vmlinux] [k] sockfd_lookup_light 3.29% pmd7 [kernel.vmlinux] [k] aa_label_sk_perm 3.16% pmd7 libpthread-2.23.so [.] __GI___libc_sendto 3.07% pmd7 [kernel.vmlinux] [k] nmi 2.90% pmd7 [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe 2.89% pmd7 [kernel.vmlinux] [k] do_syscall_64 note: lots of sendto syscall on l2fwd due to tx lots of time spent in [kernel.vmlinux], not ovs-vswitchd Evaluation2: strace -c ====================== 1) RX drop 3Mpps % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 24.94 0.000199 2 114 108 recvmsg 23.06 0.000184 11 17 poll 19.67 0.000157 3 54 54 accept 14.41 0.000115 2 54 40 read 5.39 0.000043 7 6 sendmsg 4.76 0.000038 2 21 18 recvfrom 3.88 0.000031 2 19 getrusage 3.01 0.000024 24 1 restart_syscall 0.88 0.000007 4 2 sendto ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000798 288 220 total 2) L2fwd 0.4Mpps Teconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 99.35 11.180104 272685 41 poll 0.64 0.072016 72016 1 restart_syscall 0.00 0.000256 1 264 252 recvmsg 0.00 0.000253 2 126 126 accept 0.00 0.000102 1 126 92 read 0.00 0.000071 6 12 sendmsg 0.00 0.000054 1 50 42 recvfrom 0.00 0.000035 1 44 getrusage 0.00 0.000015 4 4 sendto ------ ----------- ----------- --------- --------- ---------------- 100.00 11.252906 668 512 total note: strace showing lots of poll systcall, but using frace, I saw lots of sendto system call. Evaluation3: perf stat ====================== 1) RX drop 3Mpps 20,010.33 msec cpu-clock # 1.000 CPUs utilized 20,010.35 msec task-clock # 1.000 CPUs utilized 93 context-switches # 4.648 M/sec 346,728,358 cache-references # 17327754.023 M/sec (19.06%) 13,206 cache-misses # 0.004 % of all cache refs (19.11%) 45,404,902,649 cycles # 2269110.577 GHz (19.13%) 79,272,455,182 instructions # 1.75 insn per cycle # 0.29 stalled cycles per insn (23.88%) 16,481,635,945 branches # 823669962.269 M/sec (23.88%) 0 faults # 0.000 K/sec 22,691,286,646 stalled-cycles-frontend # 49.98% frontend cycles idle (23.86%) <not supported> stalled-cycles-backend 14,650,538 branch-misses # 0.09% of all branches (23.81%) 1,982,308,381 bus-cycles # 99065886.107 M/sec (23.79%) 37,669,050,133 ref-cycles # 1882511251.024 M/sec (28.55%) 195,627,994 LLC-loads # 9776511.444 M/sec (23.79%) 1,804 LLC-load-misses # 0.00% of all LL-cache hits (23.79%) 150,563,094 LLC-stores # 7524392.504 M/sec (9.51%) 126 LLC-store-misses # 6.297 M/sec (9.51%) 807,489,043 LLC-prefetches # 40354275.012 M/sec (9.51%) 3,794 LLC-prefetch-misses # 189.605 M/sec (9.51%) 22,972,900,217 dTLB-loads # 1148070975.362 M/sec (9.51%) 4,386,366 dTLB-load-misses # 0.02% of all dTLB cache hits (9.51%) 15,018,543,088 dTLB-stores # 750551878.461 M/sec (9.51%) 22,187 dTLB-store-misses # 1108.796 M/sec (9.51%) <not supported> dTLB-prefetches <not supported> dTLB-prefetch-misses 5,248,771 iTLB-loads # 262307.396 M/sec (9.51%) 74,790 iTLB-load-misses # 1.42% of all iTLB cache hits (14.27%) 2) L2fwd 0.4Mpps Performance counter stats for process id '7095': 10,005.64 msec cpu-clock # 1.000 CPUs utilized 10,005.64 msec task-clock # 1.000 CPUs utilized 47 context-switches # 4.698 M/sec 33,231,952 cache-references # 3321534.433 M/sec (19.09%) 4,093 cache-misses # 0.012 % of all cache refs (19.13%) 21,940,530,397 cycles # 2192956.561 GHz (19.13%) 15,082,324,838 instructions # 0.69 insn per cycle # 0.95 stalled cycles per insn (23.89%) 3,180,559,568 branches # 317897008.296 M/sec (23.89%) 0 faults # 0.000 K/sec 14,284,571,595 stalled-cycles-frontend # 65.11% frontend cycles idle (23.83%) <not supported> stalled-cycles-backend 73,279,708 branch-misses # 2.30% of all branches (23.79%) 997,678,140 bus-cycles # 99717955.022 M/sec (23.79%) 18,958,361,307 ref-cycles # 1894888686.357 M/sec (28.54%) 20,159,851 LLC-loads # 2014977.611 M/sec (23.79%) 458 LLC-load-misses # 0.00% of all LL-cache hits (23.79%) 11,159,143 LLC-stores # 1115356.622 M/sec (9.51%) 42 LLC-store-misses # 4.198 M/sec (9.51%) 54,652,812 LLC-prefetches # 5462549.925 M/sec (9.51%) 1,650 LLC-prefetch-misses # 164.918 M/sec (9.51%) 4,662,831,793 dTLB-loads # 466050154.223 M/sec (9.51%) 137,048,612 dTLB-load-misses # 2.94% of all dTLB cache hits (9.51%) 3,427,752,805 dTLB-stores # 342603978.511 M/sec (9.51%) 1,083,813 dTLB-store-misses # 108327.136 M/sec (9.51%) <not supported> dTLB-prefetches <not supported> dTLB-prefetch-misses 4,982,148 iTLB-loads # 497965.817 M/sec (9.51%) 14,543,131 iTLB-load-misses # 291.90% of all iTLB cache hits (14.27%) note: perf highlights 2 stats as red color: stalled-cycles-frontend, and iTLB-load-misses I think l2fwd has high syscall, causing high iTLB-load-misses. I don't know why on both cases, the frontend cycle, which does decode, is pretty high (49% and 65%) Evaluation4: ovs-appctl dpif-netdev/pmd-stats-show ================================================== 1) RX drop 3Mpps pmd thread numa_id 0 core_id 11: packets received: 82106476 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 82106156 megaflow hits: 318 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 2 miss with failed upcall: 0 avg. packets per output batch: 32.00 idle cycles: 198086106617 (60.53%) processing cycles: 129181332198 (39.47%) avg cycles per packet: 3985.89 (327267438815/82106476) avg processing cycles per packet: 1573.34 (129181332198/82106476) 2) L2fwd 0.4Mpps pmd thread numa_id 0 core_id 11: packets received: 8555669 packet recirculations: 0 avg. datapath passes per packet: 1.00 emc hits: 8555509 megaflow hits: 159 avg. subtable lookups per megaflow hit: 1.00 miss with success upcall: 1 miss with failed upcall: 0 avg. packets per output batch: 32.00 idle cycles: 89532679391 (57.74%) processing cycles: 65538391726 (42.26%) avg cycles per packet: 18124.95 (155071071117/8555669) avg processing cycles per packet: 7660.23 (65538391726/8555669) note: avg cycles per packet: 3985 v.s 18124 Next Step ========= 1) optimize the tx part as well as l2fwd 2) try the zero copy mode driver v1->v2: - add a list to maintain unused umem elements - remove copy from rx umem to ovs internal buffer - use hugetlb to reduce misses (not much difference) - use pmd mode netdev in OVS (huge performance improve) - remove malloc dp_packet, instead put dp_packet in umem William Tu (3): afxdp: add ebpf code for afxdp and xskmap. netdev-linux: add new netdev type afxdp. tests: add afxdp test cases. acinclude.m4 | 1 + bpf/api.h | 6 + bpf/helpers.h | 2 + bpf/maps.h | 12 + bpf/xdp.h | 42 ++- lib/automake.mk | 5 +- lib/bpf.c | 41 +- lib/bpf.h | 6 +- lib/dp-packet.c | 20 + lib/dp-packet.h | 27 +- lib/dpif-netdev-perf.h | 16 +- lib/dpif-netdev.c | 59 ++- lib/if_xdp.h | 79 ++++ lib/netdev-dummy.c | 1 + lib/netdev-linux.c | 808 +++++++++++++++++++++++++++++++++++++++- lib/netdev-provider.h | 2 + lib/netdev-vport.c | 1 + lib/netdev.c | 11 + lib/netdev.h | 1 + lib/xdpsock.c | 70 ++++ lib/xdpsock.h | 82 ++++ tests/automake.mk | 17 + tests/ofproto-macros.at | 1 + tests/system-afxdp-macros.at | 148 ++++++++ tests/system-afxdp-testsuite.at | 25 ++ tests/system-afxdp-traffic.at | 38 ++ vswitchd/bridge.c | 1 + 27 files changed, 1492 insertions(+), 30 deletions(-) create mode 100644 lib/if_xdp.h create mode 100644 lib/xdpsock.c create mode 100644 lib/xdpsock.h create mode 100644 tests/system-afxdp-macros.at create mode 100644 tests/system-afxdp-testsuite.at create mode 100644 tests/system-afxdp-traffic.at