From patchwork Fri Jul 12 23:50:55 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: William Tu X-Patchwork-Id: 1131546 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="ipUmj73s"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 45lqhf117pz9s8m for ; Sat, 13 Jul 2019 10:01:37 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 9E1FD5E16; Sat, 13 Jul 2019 00:01:34 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 470E25C12 for ; Fri, 12 Jul 2019 23:51:31 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pl1-f193.google.com (mail-pl1-f193.google.com [209.85.214.193]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 33343CF for ; Fri, 12 Jul 2019 23:51:30 +0000 (UTC) Received: by mail-pl1-f193.google.com with SMTP id y8so5498498plr.12 for ; Fri, 12 Jul 2019 16:51:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id; bh=T+E/59g+cGEGjQ8j7OoG8gf2Tmoo/56Qsgfh4omj/xY=; b=ipUmj73sJwMopkMsjXP1rdwNy9skArjlhSFcyoKdW5iTIthDcgwX2ewyN/WJX61iby ykNjtCaGZp9l1oshXEYkRUbAdFwpzewqIF8hbau5I9Ff+cRVDjqbwKCEhswHxcF3Rgyw ZHaK4bBGflPTR3WedThohP0BKJuwSlJIuLWi0Re1dJ9xHQw69XM4HuTmNAdPWLHD961j emSc+dOkJ4l0r1NI3TkKgz61mnygYWJn8C9BUbuyVFj30JOk9TCw9/0ThKnEkPg5aIUN POwgDSHy+I0xGc9FL7NaKCiccquy1okQt6oIm9wwfySt0p7vL+j3PedfwtydnCHEnfWJ AB4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=T+E/59g+cGEGjQ8j7OoG8gf2Tmoo/56Qsgfh4omj/xY=; b=QMi1ktNfgccZMdq/aOHnG9lpW2bWiQ3stDh8QYVAyMfwaI+jqiadnbkEIw4uVL0hta g6UtJhRXP6fmOUCSHr54tFvwMnkM055YrIOEEnkzVf167HOGAv0jBp6i7113zVvHL/fs /dWzbqEakRsDnzSSM/LOZGkLEmIqsBmu52II1Bo4Q2nqVUv/Li5T3pSiLM0iLjfcRU3m /C4rHxEKPapMaXEaetjz04iunJwOOh6QEV/4ts+7ZDNuugudqVSofoFDX5609pPQ6q8D WodobNbPJU6WvNRUS7oVv072LfC2TI7zMlNEYop/mw2oOkGf+l4DSlOjeRLEt5X3t6wO EQqQ== X-Gm-Message-State: APjAAAWr64fz4OgCSbcrIQ049yuFECun6Zww/pnATQJAYfU3oqbefaZN Fwvq+e/GtF3xNFP6Rq0YrkC2T3pG X-Google-Smtp-Source: APXvYqwqZltYGXFGsCEf2K9XF3ddBIx7HVIFEj84pYMmBe+SbR3nyJdt4u/ykQKr5lp0GjpdrUMvwA== X-Received: by 2002:a17:902:7d86:: with SMTP id a6mr14793516plm.199.1562975489523; Fri, 12 Jul 2019 16:51:29 -0700 (PDT) Received: from sc9-mailhost2.vmware.com ([66.170.99.2]) by smtp.gmail.com with ESMTPSA id d23sm7909738pjv.18.2019.07.12.16.51.28 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 12 Jul 2019 16:51:28 -0700 (PDT) From: William Tu To: dev@openvswitch.org Date: Fri, 12 Jul 2019 16:50:55 -0700 Message-Id: <1562975456-97888-1-git-send-email-u9012063@gmail.com> X-Mailer: git-send-email 2.7.4 X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: i.maximets@samsung.com Subject: [ovs-dev] [PATCHv16 1/2] ovs-thread: Add pthread spin lock support. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org The patch adds the basic spin lock functions: ovs_spin_{lock, try_lock, unlock, init, destroy}. OSX does not support pthread spin lock, so make it linux only. Signed-off-by: William Tu --- include/openvswitch/thread.h | 22 ++++++++++++++++++++++ lib/ovs-thread.c | 31 +++++++++++++++++++++++++++++++ 2 files changed, 53 insertions(+) diff --git a/include/openvswitch/thread.h b/include/openvswitch/thread.h index 2987db37c9dc..14cc9ad73900 100644 --- a/include/openvswitch/thread.h +++ b/include/openvswitch/thread.h @@ -33,6 +33,13 @@ struct OVS_LOCKABLE ovs_mutex { const char *where; /* NULL if and only if uninitialized. */ }; +#ifdef __linux__ +struct OVS_LOCKABLE ovs_spin { + pthread_spinlock_t lock; + const char *where; /* NULL if and only if uninitialized. */ +}; +#endif + /* "struct ovs_mutex" initializer. */ #ifdef PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP #define OVS_MUTEX_INITIALIZER { PTHREAD_ERRORCHECK_MUTEX_INITIALIZER_NP, \ @@ -70,6 +77,21 @@ int ovs_mutex_trylock_at(const struct ovs_mutex *mutex, const char *where) void ovs_mutex_cond_wait(pthread_cond_t *, const struct ovs_mutex *mutex) OVS_REQUIRES(mutex); + +#ifdef __linux__ +void ovs_spin_init(const struct ovs_spin *); +void ovs_spin_destroy(const struct ovs_spin *); +void ovs_spin_unlock(const struct ovs_spin *spin) OVS_RELEASES(spin); +void ovs_spin_lock_at(const struct ovs_spin *spin, const char *where) + OVS_ACQUIRES(spin); +#define ovs_spin_lock(spin) \ + ovs_spin_lock_at(spin, OVS_SOURCE_LOCATOR) + +int ovs_spin_trylock_at(const struct ovs_spin *spin, const char *where) + OVS_TRY_LOCK(0, spin); +#define ovs_spin_trylock(spin) \ + ovs_spin_trylock_at(spin, OVS_SOURCE_LOCATOR) +#endif /* Convenient once-only execution. * diff --git a/lib/ovs-thread.c b/lib/ovs-thread.c index 159d87e5b0ca..29a0b9e57acd 100644 --- a/lib/ovs-thread.c +++ b/lib/ovs-thread.c @@ -75,6 +75,9 @@ static bool multithreaded; LOCK_FUNCTION(mutex, lock); LOCK_FUNCTION(rwlock, rdlock); LOCK_FUNCTION(rwlock, wrlock); +#ifdef __linux__ +LOCK_FUNCTION(spin, lock); +#endif #define TRY_LOCK_FUNCTION(TYPE, FUN) \ int \ @@ -103,6 +106,9 @@ LOCK_FUNCTION(rwlock, wrlock); TRY_LOCK_FUNCTION(mutex, trylock); TRY_LOCK_FUNCTION(rwlock, tryrdlock); TRY_LOCK_FUNCTION(rwlock, trywrlock); +#ifdef __linux__ +TRY_LOCK_FUNCTION(spin, trylock); +#endif #define UNLOCK_FUNCTION(TYPE, FUN, WHERE) \ void \ @@ -125,6 +131,10 @@ UNLOCK_FUNCTION(mutex, unlock, ""); UNLOCK_FUNCTION(mutex, destroy, NULL); UNLOCK_FUNCTION(rwlock, unlock, ""); UNLOCK_FUNCTION(rwlock, destroy, NULL); +#ifdef __linux__ +UNLOCK_FUNCTION(spin, unlock, ""); +UNLOCK_FUNCTION(spin, destroy, NULL); +#endif #define XPTHREAD_FUNC1(FUNCTION, PARAM1) \ void \ @@ -268,6 +278,27 @@ ovs_mutex_cond_wait(pthread_cond_t *cond, const struct ovs_mutex *mutex_) } } +#ifdef __linux__ +static void +ovs_spin_init__(const struct ovs_spin *l_, int pshared) +{ + struct ovs_spin *l = CONST_CAST(struct ovs_spin *, l_); + int error; + + l->where = ""; + error = pthread_spin_init(&l->lock, pshared); + if (OVS_UNLIKELY(error)) { + ovs_abort(error, "pthread_spin_failed"); + } +} + +void +ovs_spin_init(const struct ovs_spin *spin) +{ + ovs_spin_init__(spin, PTHREAD_PROCESS_PRIVATE); +} +#endif + /* Initializes the 'barrier'. 'size' is the number of threads * expected to hit the barrier. */ void From patchwork Fri Jul 12 23:50:56 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: William Tu X-Patchwork-Id: 1131547 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=pass (mailfrom) smtp.mailfrom=openvswitch.org (client-ip=140.211.169.12; helo=mail.linuxfoundation.org; envelope-from=ovs-dev-bounces@openvswitch.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.b="vT8XNqGe"; dkim-atps=neutral Received: from mail.linuxfoundation.org (mail.linuxfoundation.org [140.211.169.12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 45lqjB2L6Zz9sNF for ; Sat, 13 Jul 2019 10:02:06 +1000 (AEST) Received: from mail.linux-foundation.org (localhost [127.0.0.1]) by mail.linuxfoundation.org (Postfix) with ESMTP id 6F73660FE; Sat, 13 Jul 2019 00:01:35 +0000 (UTC) X-Original-To: dev@openvswitch.org Delivered-To: ovs-dev@mail.linuxfoundation.org Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35]) by mail.linuxfoundation.org (Postfix) with ESMTPS id 8C51F5C12 for ; Fri, 12 Jul 2019 23:51:37 +0000 (UTC) X-Greylist: whitelisted by SQLgrey-1.7.6 Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178]) by smtp1.linuxfoundation.org (Postfix) with ESMTPS id A5A5DCF for ; Fri, 12 Jul 2019 23:51:32 +0000 (UTC) Received: by mail-pg1-f178.google.com with SMTP id w10so5218513pgj.7 for ; Fri, 12 Jul 2019 16:51:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=fjtSJ5LleLKle5BmF4VnnsfsPMjc+q/03w2jIH5DQWU=; b=vT8XNqGexdGV6oP0w4DgiVaouYG8pfDQwOYD/rQ3YH6E+7bTJW0P3pIInP7uyXSTbp WsIoGMKaJFHbDO4Gg454ldAMX2VC9Dba8JwBmOmP/Vy+fUGqFXBwWczt0MfAOzIQqMVL SnGqJdsaL5bEMyxipcANHVxE99NHydjU1aAC3pmn0JBTgnTsNhMXevgIANP+Gs3fc5NQ Yqyi/1yqsCVZRVRtpyGk3NMgLdZAj0CZAxw8I+7tn/V7JoDprqrqKZBN0kFIaOvtfoVY 8jU2r708tP5Vm86mDsiWLB+KCqP+FvlV8q1Kj+gUlSWfImwgagfa4TJNumJWSm+iKkvW 7ing== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=fjtSJ5LleLKle5BmF4VnnsfsPMjc+q/03w2jIH5DQWU=; b=dG6qUqHFmpW+1jxWuPa4iN5F8RetrGK3HBj4Pa6GNMb8HM5u+nT0deAi38GO5K3MrJ hxuerEkY65YyaqqSTHgPnoSmQ1w6O1r3ccFqHvzTRNyA709bUpr2eB5UVTpD0nSMbhLY MIE2EVhiPqHJnM0j9S8JaGNmBbtdjmMzGG5vHTMJSWuzdxe/sXT3PcQ6bKpUmVCtGT32 ZAZmunO8sGq8TbLg3To9eZ6PxJjwzEl4fw+w0SlJKPg0+0mHSmKBp/PH2QpXpu3+w0WM Y14xyr98lmkQCQj6gmMcMr7sPFpMkPecDxVLFJ7LRpV85fzIPcNAy3SbCBZ8fcUcml0f 1Sxw== X-Gm-Message-State: APjAAAV4HHoGRkAmkzHiBALjIt/wn2ZrrOC4ZazR+7qMxQ7brieEZ2Sc mYM7cUkyrOg20VyMuYKE0ViqWd31 X-Google-Smtp-Source: APXvYqyj6iDy9OvXhaaRELleJyPz4/e6US+1pZfLv6iKRDNnQnTCmt2YYDqd2lCdiX++3fF+8gmP+g== X-Received: by 2002:a17:90a:c68c:: with SMTP id n12mr15296163pjt.33.1562975490657; Fri, 12 Jul 2019 16:51:30 -0700 (PDT) Received: from sc9-mailhost2.vmware.com ([66.170.99.2]) by smtp.gmail.com with ESMTPSA id d23sm7909738pjv.18.2019.07.12.16.51.29 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 12 Jul 2019 16:51:29 -0700 (PDT) From: William Tu To: dev@openvswitch.org Date: Fri, 12 Jul 2019 16:50:56 -0700 Message-Id: <1562975456-97888-2-git-send-email-u9012063@gmail.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <1562975456-97888-1-git-send-email-u9012063@gmail.com> References: <1562975456-97888-1-git-send-email-u9012063@gmail.com> X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE autolearn=no version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on smtp1.linux-foundation.org Cc: i.maximets@samsung.com Subject: [ovs-dev] [PATCHv16 2/2] netdev-afxdp: add new netdev type for AF_XDP. X-BeenThere: ovs-dev@openvswitch.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Sender: ovs-dev-bounces@openvswitch.org Errors-To: ovs-dev-bounces@openvswitch.org The patch introduces experimental AF_XDP support for OVS netdev. AF_XDP, the Address Family of the eXpress Data Path, is a new Linux socket type built upon the eBPF and XDP technology. It is aims to have comparable performance to DPDK but cooperate better with existing kernel's networking stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program attached to the netdev, by-passing a couple of Linux kernel's subsystems As a result, AF_XDP socket shows much better performance than AF_PACKET For more details about AF_XDP, please see linux kernel's Documentation/networking/af_xdp.rst. Note that by default, this feature is not compiled in. Signed-off-by: William Tu --- v15: * address review feedback from Ilya https://patchwork.ozlabs.org/patch/1125476/ * skip TCP related test cases * reclaim all CONS_NUM_DESC at complete tx * add retries to kick_tx * increase memory pool size * remove redundant xdp flag and bind flag * remove unused rx_dropped var * make tx_dropped counter atomic * refactor dp_packet_init_afxdp using dp_packet_init__ * rebase to ovs master, test with latest bpf-next kernel commit b14a260e33ddb4 Ilya's kernel patches are required commit 455302d1c9ae ("xdp: fix hang while unregistering device bound to xdp socket") commit 162c820ed896 ("xdp: hold device for umem regardless of zero-copy mode") Possible issues: * still lots of afxdp_cq_skip (ovs-appctl coverage/show) afxdp_cq_skip 44325273.6/sec 34362312.683/sec 572705.2114/sec total: 2106010377 * TODO: 'make check-afxdp' still not all pass IP fragmentation expiry test not fix yet, need to implement deferral memory free, s.t like dpdk_mp_sweep. Currently hit some missing umem descs when reclaiming. NSH test case still failed (not due to afxdp) v16: * address feedbacks from Ilya * add deferral memory free * add afxdp testsuites files to gitignore --- Documentation/automake.mk | 1 + Documentation/index.rst | 1 + Documentation/intro/install/afxdp.rst | 430 ++++++++++++++ Documentation/intro/install/index.rst | 1 + acinclude.m4 | 35 ++ configure.ac | 1 + lib/automake.mk | 13 + lib/dp-packet.c | 23 + lib/dp-packet.h | 18 +- lib/dpif-netdev-perf.h | 26 + lib/netdev-afxdp.c | 1001 +++++++++++++++++++++++++++++++++ lib/netdev-afxdp.h | 74 +++ lib/netdev-linux-private.h | 138 +++++ lib/netdev-linux.c | 121 ++-- lib/netdev-provider.h | 3 + lib/netdev.c | 11 + lib/util.c | 92 ++- lib/util.h | 5 + lib/xdpsock.c | 176 ++++++ lib/xdpsock.h | 105 ++++ tests/.gitignore | 3 + tests/automake.mk | 16 + tests/system-afxdp-macros.at | 39 ++ tests/system-afxdp-testsuite.at | 26 + tests/system-traffic.at | 2 + vswitchd/vswitch.xml | 15 + 26 files changed, 2268 insertions(+), 108 deletions(-) create mode 100644 Documentation/intro/install/afxdp.rst create mode 100644 lib/netdev-afxdp.c create mode 100644 lib/netdev-afxdp.h create mode 100644 lib/netdev-linux-private.h create mode 100644 lib/xdpsock.c create mode 100644 lib/xdpsock.h create mode 100644 tests/system-afxdp-macros.at create mode 100644 tests/system-afxdp-testsuite.at diff --git a/Documentation/automake.mk b/Documentation/automake.mk index 8472921746ba..2a3214a3cc7f 100644 --- a/Documentation/automake.mk +++ b/Documentation/automake.mk @@ -10,6 +10,7 @@ DOC_SOURCE = \ Documentation/intro/why-ovs.rst \ Documentation/intro/install/index.rst \ Documentation/intro/install/bash-completion.rst \ + Documentation/intro/install/afxdp.rst \ Documentation/intro/install/debian.rst \ Documentation/intro/install/documentation.rst \ Documentation/intro/install/distributions.rst \ diff --git a/Documentation/index.rst b/Documentation/index.rst index 331353fd337a..bace34dbf91b 100644 --- a/Documentation/index.rst +++ b/Documentation/index.rst @@ -59,6 +59,7 @@ vSwitch? Start here. :doc:`intro/install/windows` | :doc:`intro/install/xenserver` | :doc:`intro/install/dpdk` | + :doc:`intro/install/afxdp` | :doc:`Installation FAQs ` - **Tutorials:** :doc:`tutorials/faucet` | diff --git a/Documentation/intro/install/afxdp.rst b/Documentation/intro/install/afxdp.rst new file mode 100644 index 000000000000..3655de68307a --- /dev/null +++ b/Documentation/intro/install/afxdp.rst @@ -0,0 +1,430 @@ +.. + Licensed under the Apache License, Version 2.0 (the "License"); you may + not use this file except in compliance with the License. You may obtain + a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, WITHOUT + WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the + License for the specific language governing permissions and limitations + under the License. + + Convention for heading levels in Open vSwitch documentation: + + ======= Heading 0 (reserved for the title in a document) + ------- Heading 1 + ~~~~~~~ Heading 2 + +++++++ Heading 3 + ''''''' Heading 4 + + Avoid deeper levels because they do not render well. + + +======================== +Open vSwitch with AF_XDP +======================== + +This document describes how to build and install Open vSwitch using +AF_XDP netdev. + +.. warning:: + The AF_XDP support of Open vSwitch is considered 'experimental', + and it is not compiled in by default. + + +Introduction +------------ +AF_XDP, Address Family of the eXpress Data Path, is a new Linux socket type +built upon the eBPF and XDP technology. It is aims to have comparable +performance to DPDK but cooperate better with existing kernel's networking +stack. An AF_XDP socket receives and sends packets from an eBPF/XDP program +attached to the netdev, by-passing a couple of Linux kernel's subsystems. +As a result, AF_XDP socket shows much better performance than AF_PACKET. +For more details about AF_XDP, please see linux kernel's +Documentation/networking/af_xdp.rst + + +AF_XDP Netdev +------------- +OVS has a couple of netdev types, i.e., system, tap, or +dpdk. The AF_XDP feature adds a new netdev types called +"afxdp", and implement its configuration, packet reception, +and transmit functions. Since the AF_XDP socket, called xsk, +operates in userspace, once ovs-vswitchd receives packets +from xsk, the afxdp netdev re-uses the existing userspace +dpif-netdev datapath. As a result, most of the packet processing +happens at the userspace instead of linux kernel. + +:: + + | +-------------------+ + | | ovs-vswitchd |<-->ovsdb-server + | +-------------------+ + | | ofproto |<-->OpenFlow controllers + | +--------+-+--------+ + | | netdev | |ofproto-| + userspace | +--------+ | dpif | + | | afxdp | +--------+ + | | netdev | | dpif | + | +---||---+ +--------+ + | || | dpif- | + | || | netdev | + |_ || +--------+ + || + _ +---||-----+--------+ + | | AF_XDP prog + | + kernel | | xsk_map | + |_ +--------||---------+ + || + physical + NIC + + +Build requirements +------------------ + +In addition to the requirements described in :doc:`general`, building Open +vSwitch with AF_XDP will require the following: + +- libbpf from kernel source tree (kernel 5.0.0 or later) + +- Linux kernel XDP support, with the following options (required) + + * CONFIG_BPF=y + + * CONFIG_BPF_SYSCALL=y + + * CONFIG_XDP_SOCKETS=y + + +- The following optional Kconfig options are also recommended, but not + required: + + * CONFIG_BPF_JIT=y (Performance) + + * CONFIG_HAVE_BPF_JIT=y (Performance) + + * CONFIG_XDP_SOCKETS_DIAG=y (Debugging) + +- Once your AF_XDP-enabled kernel is ready, if possible, run + **./xdpsock -r -N -z -i ** under linux/samples/bpf. + This is an OVS independent benchmark tools for AF_XDP. + It makes sure your basic kernel requirements are met for AF_XDP. + + +Installing +---------- +For OVS to use AF_XDP netdev, it has to be configured with LIBBPF support. +First, clone a recent version of Linux bpf-next tree:: + + git clone git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git + +Second, go into the Linux source directory and build libbpf in the tools +directory:: + + cd bpf-next/ + cd tools/lib/bpf/ + make && make install + make install_headers + +.. note:: + Make sure xsk.h and bpf.h are installed in system's library path, + e.g. /usr/local/include/bpf/ or /usr/include/bpf/ + +Make sure the libbpf.so is installed correctly:: + + ldconfig + ldconfig -p | grep libbpf + +Third, ensure the standard OVS requirements are installed and +bootstrap/configure the package:: + + ./boot.sh && ./configure --enable-afxdp + +Finally, build and install OVS:: + + make && make install + +To kick start end-to-end autotesting:: + + uname -a # make sure having 5.0+ kernel + make check-afxdp TESTSUITEFLAGS='1' + +.. note:: + Not all test cases pass at this time. Currenly all TCP related + tests, ex: wget or http, fail due to veth XDP limitations, and + cvlan is skipped now. + +If a test case fails, check the log at:: + + cat tests/system-afxdp-testsuite.dir/001/system-afxdp-testsuite.log + + +Setup AF_XDP netdev +------------------- +Before running OVS with AF_XDP, make sure the libbpf and libelf are +set-up right:: + + ldd vswitchd/ovs-vswitchd + +Open vSwitch should be started using userspace datapath as described +in :doc:`general`:: + + ovs-vswitchd ... + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev + +Make sure your device driver support AF_XDP, and to use 1 PMD (on core 4) +on 1 queue (queue 0) device, configure these options: **pmd-cpu-mask, +pmd-rxq-affinity, and n_rxq**. The **xdpmode** can be "drv" or "skb":: + + ethtool -L enp2s0 combined 1 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=1 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:4" + +Or, use 4 pmds/cores and 4 queues by doing:: + + ethtool -L enp2s0 combined 4 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x36 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=4 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:1,1:2,2:3,3:4" + +.. note:: + pmd-rxq-affinity is optional. If not specified, system will auto-assign. + +To validate that the bridge has successfully instantiated, you can use the:: + + ovs-vsctl show + +Should show something like:: + + Port "ens802f0" + Interface "ens802f0" + type: afxdp + options: {n_rxq="1", xdpmode=drv} + +Otherwise, enable debugging by:: + + ovs-appctl vlog/set netdev_afxdp::dbg + + +References +---------- +Most of the design details are described in the paper presented at +Linux Plumber 2018, "Bringing the Power of eBPF to Open vSwitch"[1], +section 4, and slides[2][4]. +"The Path to DPDK Speeds for AF XDP"[3] gives a very good introduction +about AF_XDP current and future work. + +[1] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-afxdp.pdf + +[2] http://vger.kernel.org/lpc_net2018_talks/ovs-ebpf-lpc18-presentation.pdf + +[3] http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf + +[4] https://ovsfall2018.sched.com/event/IO7p/fast-userspace-ovs-with-afxdp + + +Performance Tuning +------------------ +The name of the game is to keep your CPU running in userspace, allowing PMD +to keep polling the AF_XDP queues without any interferences from kernel. + +#. Make sure everything is in the same NUMA node (memory used by AF_XDP, pmd + running cores, device plug-in slot) + +#. Isolate your CPU by doing isolcpu at grub configure. + +#. IRQ should not set to pmd running core. + +#. The Spectre and Meltdown fixes increase the overhead of system calls. + + +Debugging performance issue +~~~~~~~~~~~~~~~~~~~~~~~~~~~ +While running the traffic, use linux perf tool to see where your cpu +spends its cycle:: + + cd bpf-next/tools/perf + make + ./perf record -p `pidof ovs-vswitchd` sleep 10 + ./perf report + +Measure your system call rate by doing:: + + pstree -p `pidof ovs-vswitchd` + strace -c -p + +Or, use OVS pmd tool:: + + ovs-appctl dpif-netdev/pmd-stats-show + + +Example Script +-------------- + +Below is a script using namespaces and veth peer:: + + #!/bin/bash + ovs-vswitchd --no-chdir --pidfile -vvconn -vofproto_dpif -vunixctl \ + --disable-system --detach \ + ovs-vsctl -- add-br br0 -- set Bridge br0 \ + protocols=OpenFlow10,OpenFlow11,OpenFlow12,OpenFlow13,OpenFlow14 \ + fail-mode=secure datapath_type=netdev + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev + + ip netns add at_ns0 + ovs-appctl vlog/set netdev_afxdp::dbg + + ip link add p0 type veth peer name afxdp-p0 + ip link set p0 netns at_ns0 + ip link set dev afxdp-p0 up + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 external-ids:iface-id="p0" type="afxdp" + + ip netns exec at_ns0 sh << NS_EXEC_HEREDOC + ip addr add "10.1.1.1/24" dev p0 + ip link set dev p0 up + NS_EXEC_HEREDOC + + ip netns add at_ns1 + ip link add p1 type veth peer name afxdp-p1 + ip link set p1 netns at_ns1 + ip link set dev afxdp-p1 up + + ovs-vsctl add-port br0 afxdp-p1 -- \ + set interface afxdp-p1 external-ids:iface-id="p1" type="afxdp" + ip netns exec at_ns1 sh << NS_EXEC_HEREDOC + ip addr add "10.1.1.2/24" dev p1 + ip link set dev p1 up + NS_EXEC_HEREDOC + + ip netns exec at_ns0 ping -i .2 10.1.1.2 + + +Limitations/Known Issues +------------------------ +#. Device's numa ID is always 0, need a way to find numa id from a netdev. +#. No QoS support because AF_XDP netdev by-pass the Linux TC layer. A possible + work-around is to use OpenFlow meter action. +#. AF_XDP device added to bridge, remove, and added again will fail. +#. Most of the tests are done using i40e single port. Multiple ports and + also ixgbe driver also needs to be tested. +#. No latency test result (TODO items) + + +PVP using tap device +-------------------- +Assume you have enp2s0 as physical nic, and a tap device connected to VM. +First, start OVS, then add physical port:: + + ethtool -L enp2s0 combined 1 + ovs-vsctl set Open_vSwitch . other_config:pmd-cpu-mask=0x10 + ovs-vsctl add-port br0 enp2s0 -- set interface enp2s0 type="afxdp" \ + options:n_rxq=1 options:xdpmode=drv \ + other_config:pmd-rxq-affinity="0:4" + +Start a VM with virtio and tap device:: + + qemu-system-x86_64 -hda ubuntu1810.qcow \ + -m 4096 \ + -cpu host,+x2apic -enable-kvm \ + -device virtio-net-pci,mac=00:02:00:00:00:01,netdev=net0,mq=on,\ + vectors=10,mrg_rxbuf=on,rx_queue_size=1024 \ + -netdev type=tap,id=net0,vhost=on,queues=8 \ + -object memory-backend-file,id=mem,size=4096M,\ + mem-path=/dev/hugepages,share=on \ + -numa node,memdev=mem -mem-prealloc -smp 2 + +Create OpenFlow rules:: + + ovs-vsctl add-port br0 tap0 -- set interface tap0 + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:tap0" + ovs-ofctl add-flow br0 "in_port=tap0, actions=output:enp2s0" + +Inside the VM, use xdp_rxq_info to bounce back the traffic:: + + ./xdp_rxq_info --dev ens3 --action XDP_TX + + +PVP using vhostuser device +-------------------------- +First, build OVS with DPDK and AFXDP:: + + ./configure --enable-afxdp --with-dpdk= + make -j4 && make install + +Create a vhost-user port from OVS:: + + ovs-vsctl --no-wait set Open_vSwitch . other_config:dpdk-init=true + ovs-vsctl -- add-br br0 -- set Bridge br0 datapath_type=netdev \ + other_config:pmd-cpu-mask=0xfff + ovs-vsctl add-port br0 vhost-user-1 \ + -- set Interface vhost-user-1 type=dpdkvhostuser + +Start VM using vhost-user mode:: + + qemu-system-x86_64 -hda ubuntu1810.qcow \ + -m 4096 \ + -cpu host,+x2apic -enable-kvm \ + -chardev socket,id=char1,path=/usr/local/var/run/openvswitch/vhost-user-1 \ + -netdev type=vhost-user,id=mynet1,chardev=char1,vhostforce,queues=4 \ + -device virtio-net-pci,mac=00:00:00:00:00:01,\ + netdev=mynet1,mq=on,vectors=10 \ + -object memory-backend-file,id=mem,size=4096M,\ + mem-path=/dev/hugepages,share=on \ + -numa node,memdev=mem -mem-prealloc -smp 2 + +Setup the OpenFlow ruls:: + + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:vhost-user-1" + ovs-ofctl add-flow br0 "in_port=vhost-user-1, actions=output:enp2s0" + +Inside the VM, use xdp_rxq_info to drop or bounce back the traffic:: + + ./xdp_rxq_info --dev ens3 --action XDP_DROP + ./xdp_rxq_info --dev ens3 --action XDP_TX + + +PCP container using veth +------------------------ +Create namespace and veth peer devices:: + + ip netns add at_ns0 + ip link add p0 type veth peer name afxdp-p0 + ip link set p0 netns at_ns0 + ip link set dev afxdp-p0 up + ip netns exec at_ns0 ip link set dev p0 up + +Attach the veth port to br0 (linux kernel mode):: + + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 options:n_rxq=1 + +Or, use AF_XDP with skb mode:: + + ovs-vsctl add-port br0 afxdp-p0 -- \ + set interface afxdp-p0 type="afxdp" options:n_rxq=1 options:xdpmode=skb + +Setup the OpenFlow rules:: + + ovs-ofctl del-flows br0 + ovs-ofctl add-flow br0 "in_port=enp2s0, actions=output:afxdp-p0" + ovs-ofctl add-flow br0 "in_port=afxdp-p0, actions=output:enp2s0" + +In the namespace, run drop or bounce back the packet:: + + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_DROP + ip netns exec at_ns0 ./xdp_rxq_info --dev p0 --action XDP_TX + + +Bug Reporting +------------- + +Please report problems to dev@openvswitch.org. diff --git a/Documentation/intro/install/index.rst b/Documentation/intro/install/index.rst index 3193c736cf17..c27a9c9d16ff 100644 --- a/Documentation/intro/install/index.rst +++ b/Documentation/intro/install/index.rst @@ -45,6 +45,7 @@ Installation from Source xenserver userspace dpdk + afxdp Installation from Packages -------------------------- diff --git a/acinclude.m4 b/acinclude.m4 index b8c9d6c06fba..f744ca85410d 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -238,6 +238,41 @@ AC_DEFUN([OVS_FIND_DEPENDENCY], [ ]) ]) +dnl OVS_CHECK_LINUX_AF_XDP +dnl +dnl Check both Linux kernel AF_XDP and libbpf support +AC_DEFUN([OVS_CHECK_LINUX_AF_XDP], [ + AC_ARG_ENABLE([afxdp], + [AC_HELP_STRING([--enable-afxdp], [Enable AF-XDP support])], + [], [enable_afxdp=no]) + AC_MSG_CHECKING([whether AF_XDP is enabled]) + if test "$enable_afxdp" != yes; then + AC_MSG_RESULT([no]) + AF_XDP_ENABLE=false + else + AC_MSG_RESULT([yes]) + AF_XDP_ENABLE=true + + AC_CHECK_HEADER([bpf/libbpf.h], [], + [AC_MSG_ERROR([unable to find bpf/libbpf.h for AF_XDP support])]) + + AC_CHECK_HEADER([linux/if_xdp.h], [], + [AC_MSG_ERROR([unable to find linux/if_xdp.h for AF_XDP support])]) + + AC_CHECK_HEADER([bpf/xsk.h], [], + [AC_MSG_ERROR([unable to find bpf/xsk.h for AF_XDP support])]) + + AC_CHECK_HEADER([bpf/libbpf_util.h], [], + [AC_MSG_ERROR([unable to find bpf/libbpf_util.h for AF_XDP support])]) + + AC_DEFINE([HAVE_AF_XDP], [1], + [Define to 1 if AF_XDP support is available and enabled.]) + LIBBPF_LDADD=" -lbpf -lelf" + AC_SUBST([LIBBPF_LDADD]) + fi + AM_CONDITIONAL([HAVE_AF_XDP], test "$AF_XDP_ENABLE" = true) +]) + dnl OVS_CHECK_DPDK dnl dnl Configure DPDK source tree diff --git a/configure.ac b/configure.ac index a9f0a06dc140..36ad246203db 100644 --- a/configure.ac +++ b/configure.ac @@ -98,6 +98,7 @@ OVS_CHECK_SPHINX OVS_CHECK_DOT OVS_CHECK_IF_DL OVS_CHECK_STRTOK_R +OVS_CHECK_LINUX_AF_XDP AC_CHECK_DECLS([sys_siglist], [], [], [[#include ]]) AC_CHECK_MEMBERS([struct stat.st_mtim.tv_nsec, struct stat.st_mtimensec], [], [], [[#include ]]) diff --git a/lib/automake.mk b/lib/automake.mk index 1b89cac8c3a2..b07bb01c4ef7 100644 --- a/lib/automake.mk +++ b/lib/automake.mk @@ -14,6 +14,10 @@ if WIN32 lib_libopenvswitch_la_LIBADD += ${PTHREAD_LIBS} endif +if HAVE_AF_XDP +lib_libopenvswitch_la_LIBADD += $(LIBBPF_LDADD) +endif + lib_libopenvswitch_la_LDFLAGS = \ $(OVS_LTINFO) \ -Wl,--version-script=$(top_builddir)/lib/libopenvswitch.sym \ @@ -394,6 +398,7 @@ lib_libopenvswitch_la_SOURCES += \ lib/if-notifier.h \ lib/netdev-linux.c \ lib/netdev-linux.h \ + lib/netdev-linux-private.h \ lib/netdev-offload-tc.c \ lib/netlink-conntrack.c \ lib/netlink-conntrack.h \ @@ -410,6 +415,14 @@ lib_libopenvswitch_la_SOURCES += \ lib/tc.h endif +if HAVE_AF_XDP +lib_libopenvswitch_la_SOURCES += \ + lib/xdpsock.c \ + lib/xdpsock.h \ + lib/netdev-afxdp.c \ + lib/netdev-afxdp.h +endif + if DPDK_NETDEV lib_libopenvswitch_la_SOURCES += \ lib/dpdk.c \ diff --git a/lib/dp-packet.c b/lib/dp-packet.c index 0976a35e758b..62d7faa4c59a 100644 --- a/lib/dp-packet.c +++ b/lib/dp-packet.c @@ -19,6 +19,7 @@ #include #include "dp-packet.h" +#include "netdev-afxdp.h" #include "netdev-dpdk.h" #include "openvswitch/dynamic-string.h" #include "util.h" @@ -59,6 +60,22 @@ dp_packet_use(struct dp_packet *b, void *base, size_t allocated) dp_packet_use__(b, base, allocated, DPBUF_MALLOC); } +#if HAVE_AF_XDP +/* Initialize 'b' as an empty dp_packet that contains + * memory starting at AF_XDP umem base. + */ +void +dp_packet_use_afxdp(struct dp_packet *b, void *data, size_t allocated, + size_t headroom) +{ + dp_packet_set_base(b, (char *)data - headroom); + dp_packet_set_data(b, data); + dp_packet_set_size(b, 0); + + dp_packet_init__(b, allocated, DPBUF_AFXDP); +} +#endif + /* Initializes 'b' as an empty dp_packet that contains the 'allocated' bytes of * memory starting at 'base'. 'base' should point to a buffer on the stack. * (Nothing actually relies on 'base' being allocated on the stack. It could @@ -122,6 +139,8 @@ dp_packet_uninit(struct dp_packet *b) * created as a dp_packet */ free_dpdk_buf((struct dp_packet*) b); #endif + } else if (b->source == DPBUF_AFXDP) { + free_afxdp_buf(b); } } } @@ -248,6 +267,9 @@ dp_packet_resize__(struct dp_packet *b, size_t new_headroom, size_t new_tailroom case DPBUF_STACK: OVS_NOT_REACHED(); + case DPBUF_AFXDP: + OVS_NOT_REACHED(); + case DPBUF_STUB: b->source = DPBUF_MALLOC; new_base = xmalloc(new_allocated); @@ -433,6 +455,7 @@ dp_packet_steal_data(struct dp_packet *b) { void *p; ovs_assert(b->source != DPBUF_DPDK); + ovs_assert(b->source != DPBUF_AFXDP); if (b->source == DPBUF_MALLOC && dp_packet_data(b) == dp_packet_base(b)) { p = dp_packet_data(b); diff --git a/lib/dp-packet.h b/lib/dp-packet.h index a5e9ade1244a..47ea14b94f74 100644 --- a/lib/dp-packet.h +++ b/lib/dp-packet.h @@ -25,6 +25,7 @@ #include #endif +#include "netdev-afxdp.h" #include "netdev-dpdk.h" #include "openvswitch/list.h" #include "packets.h" @@ -42,6 +43,7 @@ enum OVS_PACKED_ENUM dp_packet_source { DPBUF_DPDK, /* buffer data is from DPDK allocated memory. * ref to dp_packet_init_dpdk() in dp-packet.c. */ + DPBUF_AFXDP, /* buffer data from XDP frame */ }; #define DP_PACKET_CONTEXT_SIZE 64 @@ -89,6 +91,13 @@ struct dp_packet { }; }; +#if HAVE_AF_XDP +struct dp_packet_afxdp { + struct umem_pool *mpool; + struct dp_packet packet; +}; +#endif + static inline void *dp_packet_data(const struct dp_packet *); static inline void dp_packet_set_data(struct dp_packet *, void *); static inline void *dp_packet_base(const struct dp_packet *); @@ -122,7 +131,9 @@ static inline const void *dp_packet_get_nd_payload(const struct dp_packet *); void dp_packet_use(struct dp_packet *, void *, size_t); void dp_packet_use_stub(struct dp_packet *, void *, size_t); void dp_packet_use_const(struct dp_packet *, const void *, size_t); - +#if HAVE_AF_XDP +void dp_packet_use_afxdp(struct dp_packet *, void *, size_t, size_t); +#endif void dp_packet_init_dpdk(struct dp_packet *); void dp_packet_init(struct dp_packet *, size_t); @@ -184,6 +195,11 @@ dp_packet_delete(struct dp_packet *b) return; } + if (b->source == DPBUF_AFXDP) { + free_afxdp_buf(b); + return; + } + dp_packet_uninit(b); free(b); } diff --git a/lib/dpif-netdev-perf.h b/lib/dpif-netdev-perf.h index 859c05613ddf..6b6dfda7db1c 100644 --- a/lib/dpif-netdev-perf.h +++ b/lib/dpif-netdev-perf.h @@ -21,6 +21,7 @@ #include #include #include +#include #include #ifdef DPDK_NETDEV @@ -186,6 +187,24 @@ struct pmd_perf_stats { char *log_reason; }; +#ifdef __linux__ +static inline uint64_t +rdtsc_syscall(struct pmd_perf_stats *s) +{ + struct timespec val; + uint64_t v; + + if (clock_gettime(CLOCK_MONOTONIC_RAW, &val) != 0) { + return s->last_tsc; + } + + v = (uint64_t) val.tv_sec * 1000000000LL; + v += (uint64_t) val.tv_nsec; + + return s->last_tsc = v; +} +#endif + /* Support for accurate timing of PMD execution on TSC clock cycle level. * These functions are intended to be invoked in the context of pmd threads. */ @@ -198,6 +217,13 @@ cycles_counter_update(struct pmd_perf_stats *s) { #ifdef DPDK_NETDEV return s->last_tsc = rte_get_tsc_cycles(); +#elif !defined(_MSC_VER) && defined(__x86_64__) + uint32_t h, l; + asm volatile("rdtsc" : "=a" (l), "=d" (h)); + + return s->last_tsc = ((uint64_t) h << 32) | l; +#elif defined(__linux__) + return rdtsc_syscall(s); #else return s->last_tsc = 0; #endif diff --git a/lib/netdev-afxdp.c b/lib/netdev-afxdp.c new file mode 100644 index 000000000000..7693a876bf18 --- /dev/null +++ b/lib/netdev-afxdp.c @@ -0,0 +1,1001 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include + +#include "netdev-linux-private.h" +#include "netdev-linux.h" +#include "netdev-afxdp.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "coverage.h" +#include "dp-packet.h" +#include "dpif-netdev.h" +#include "fatal-signal.h" +#include "openvswitch/dynamic-string.h" +#include "openvswitch/list.h" +#include "openvswitch/vlog.h" +#include "packets.h" +#include "socket-util.h" +#include "util.h" +#include "xdpsock.h" + +#ifndef SOL_XDP +#define SOL_XDP 283 +#endif + +COVERAGE_DEFINE(afxdp_cq_empty); +COVERAGE_DEFINE(afxdp_fq_full); +COVERAGE_DEFINE(afxdp_tx_full); +COVERAGE_DEFINE(afxdp_cq_skip); + +VLOG_DEFINE_THIS_MODULE(netdev_afxdp); +static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); + +#define UMEM2DESC(elem, base) ((uint64_t)((char *)elem - (char *)base)) +#define UMEM2XPKT(base, i) \ + ALIGNED_CAST(struct dp_packet_afxdp *, (char *)base + \ + i * sizeof(struct dp_packet_afxdp)) + +static struct xsk_socket_info *xsk_configure(int ifindex, int xdp_queue_id, + int mode); +static void xsk_remove_xdp_program(uint32_t ifindex, int xdpmode); +static void xsk_destroy(struct xsk_socket_info *xsk); +static int xsk_configure_all(struct netdev *netdev); +static void xsk_destroy_all(struct netdev *netdev); + +struct unused_pool { + struct xsk_umem_info *umem_info; + int lost_in_rings; /* Number of packets left in tx, rx, cq and fq. */ + struct ovs_list list_node; +}; + +static struct ovs_mutex unused_pools_mutex = OVS_MUTEX_INITIALIZER; +static struct ovs_list unused_pools OVS_GUARDED_BY(unused_pools_mutex) = + OVS_LIST_INITIALIZER(&unused_pools); + +static void +netdev_afxdp_cleanup_unused_pool(struct unused_pool *pool) +{ + /* free the packet buffer */ + free_pagealign(pool->umem_info->buffer); + + /* cleanup umem pool */ + umem_pool_cleanup(&pool->umem_info->mpool); + + /* cleanup metadata pool */ + xpacket_pool_cleanup(&pool->umem_info->xpool); + + free(pool->umem_info); +} + +static void +netdev_afxdp_sweep_unused_pools(void *aux OVS_UNUSED) +{ + struct unused_pool *pool, *next; + unsigned int count; + + ovs_mutex_lock(&unused_pools_mutex); + LIST_FOR_EACH_SAFE (pool, next, list_node, &unused_pools) { + + count = umem_pool_count(&pool->umem_info->mpool); + ovs_assert(count + pool->lost_in_rings <= NUM_FRAMES); + + if (count + pool->lost_in_rings == NUM_FRAMES) { + /* OVS doesn't use this memory pool anymore. Kernel doesn't + * use it since closing the xdp socket. So, it's safe to free + * the pool now. */ + VLOG_DBG("Freeing umem pool at 0x%"PRIxPTR, + (uintptr_t) pool->umem_info); + ovs_list_remove(&pool->list_node); + netdev_afxdp_cleanup_unused_pool(pool); + free(pool); + } + } + ovs_mutex_unlock(&unused_pools_mutex); +} + +static struct xsk_umem_info * +xsk_configure_umem(void *buffer, uint64_t size, int xdpmode) +{ + struct xsk_umem_config uconfig; + struct xsk_umem_info *umem; + int ret; + int i; + + umem = xcalloc(1, sizeof *umem); + + uconfig.fill_size = PROD_NUM_DESCS; + uconfig.comp_size = CONS_NUM_DESCS; + uconfig.frame_size = FRAME_SIZE; + uconfig.frame_headroom = OVS_XDP_HEADROOM; + + ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq, + &uconfig); + if (ret) { + VLOG_ERR("xsk_umem__create failed (%s) mode: %s", + ovs_strerror(errno), + xdpmode == XDP_COPY ? "SKB": "DRV"); + free(umem); + return NULL; + } + + umem->buffer = buffer; + + /* set-up umem pool */ + if (umem_pool_init(&umem->mpool, NUM_FRAMES) < 0) { + VLOG_ERR("umem_pool_init failed"); + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free(umem); + return NULL; + } + + for (i = NUM_FRAMES - 1; i >= 0; i--) { + struct umem_elem *elem; + + elem = ALIGNED_CAST(struct umem_elem *, + (char *)umem->buffer + i * FRAME_SIZE); + umem_elem_push(&umem->mpool, elem); + } + + /* set-up metadata */ + if (xpacket_pool_init(&umem->xpool, NUM_FRAMES) < 0) { + VLOG_ERR("xpacket_pool_init failed"); + umem_pool_cleanup(&umem->mpool); + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free(umem); + return NULL; + } + + VLOG_DBG("%s xpacket pool from %p to %p", __func__, + umem->xpool.array, + (char *)umem->xpool.array + + NUM_FRAMES * sizeof(struct dp_packet_afxdp)); + + for (i = NUM_FRAMES - 1; i >= 0; i--) { + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + xpacket = UMEM2XPKT(umem->xpool.array, i); + xpacket->mpool = &umem->mpool; + + packet = &xpacket->packet; + packet->source = DPBUF_AFXDP; + } + + return umem; +} + +static struct xsk_socket_info * +xsk_configure_socket(struct xsk_umem_info *umem, uint32_t ifindex, + uint32_t queue_id, int xdpmode) +{ + struct xsk_socket_config cfg; + struct xsk_socket_info *xsk; + char devname[IF_NAMESIZE]; + uint32_t idx = 0, prog_id; + int ret; + int i; + + xsk = xcalloc(1, sizeof(*xsk)); + xsk->umem = umem; + cfg.rx_size = CONS_NUM_DESCS; + cfg.tx_size = PROD_NUM_DESCS; + cfg.libbpf_flags = 0; + + if (xdpmode == XDP_ZEROCOPY) { + cfg.bind_flags = XDP_ZEROCOPY; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE; + } else { + cfg.bind_flags = XDP_COPY; + cfg.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_SKB_MODE; + } + + if (if_indextoname(ifindex, devname) == NULL) { + VLOG_ERR("ifindex %d to devname failed (%s)", + ifindex, ovs_strerror(errno)); + free(xsk); + return NULL; + } + + ret = xsk_socket__create(&xsk->xsk, devname, queue_id, umem->umem, + &xsk->rx, &xsk->tx, &cfg); + if (ret) { + VLOG_ERR("xsk_socket__create failed (%s) mode: %s qid: %d", + ovs_strerror(errno), + xdpmode == XDP_COPY ? "SKB": "DRV", + queue_id); + free(xsk); + return NULL; + } + + /* Make sure the built-in AF_XDP program is loaded */ + ret = bpf_get_link_xdp_id(ifindex, &prog_id, cfg.xdp_flags); + if (ret) { + VLOG_ERR("Get XDP prog ID failed (%s)", ovs_strerror(errno)); + xsk_socket__delete(xsk->xsk); + free(xsk); + return NULL; + } + + while (!xsk_ring_prod__reserve(&xsk->umem->fq, + PROD_NUM_DESCS, &idx)) { + VLOG_WARN_RL(&rl, "Retry xsk_ring_prod__reserve to FILL queue"); + } + + for (i = 0; + i < PROD_NUM_DESCS * FRAME_SIZE; + i += FRAME_SIZE) { + struct umem_elem *elem; + uint64_t addr; + + elem = umem_elem_pop(&xsk->umem->mpool); + addr = UMEM2DESC(elem, xsk->umem->buffer); + + *xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = addr; + } + + xsk_ring_prod__submit(&xsk->umem->fq, + PROD_NUM_DESCS); + return xsk; +} + +static struct xsk_socket_info * +xsk_configure(int ifindex, int xdp_queue_id, int xdpmode) +{ + struct xsk_socket_info *xsk; + struct xsk_umem_info *umem; + void *bufs; + + netdev_afxdp_sweep_unused_pools(NULL); + + /* umem memory region */ + bufs = xmalloc_pagealign(NUM_FRAMES * FRAME_SIZE); + memset(bufs, 0, NUM_FRAMES * FRAME_SIZE); + + /* create AF_XDP socket */ + umem = xsk_configure_umem(bufs, + NUM_FRAMES * FRAME_SIZE, + xdpmode); + if (!umem) { + free_pagealign(bufs); + return NULL; + } + + VLOG_DBG("Allocated umem pool at 0x%"PRIxPTR, (uintptr_t) umem); + + xsk = xsk_configure_socket(umem, ifindex, xdp_queue_id, xdpmode); + if (!xsk) { + /* clean up umem and xpacket pool */ + if (xsk_umem__delete(umem->umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + free_pagealign(bufs); + umem_pool_cleanup(&umem->mpool); + xpacket_pool_cleanup(&umem->xpool); + free(umem); + } + return xsk; +} + +static int +xsk_configure_all(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct xsk_socket_info *xsk_info; + int i, ifindex, n_rxq; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + n_rxq = netdev_n_rxq(netdev); + dev->xsks = xzalloc(n_rxq * sizeof(struct xsk_socket_info *)); + + /* configure each queue */ + for (i = 0; i < n_rxq; i++) { + VLOG_INFO("%s configure queue %d mode %s", __func__, i, + dev->xdpmode == XDP_COPY ? "SKB" : "DRV"); + xsk_info = xsk_configure(ifindex, i, dev->xdpmode); + if (!xsk_info) { + VLOG_ERR("failed to create AF_XDP socket on queue %d", i); + dev->xsks[i] = NULL; + goto err; + } + dev->xsks[i] = xsk_info; + xsk_info->tx_dropped = 0; + xsk_info->outstanding_tx = 0; + xsk_info->available_rx = PROD_NUM_DESCS; + } + + return 0; + +err: + xsk_destroy_all(netdev); + return EINVAL; +} + +static void +xsk_destroy(struct xsk_socket_info *xsk_info) +{ + struct xsk_umem *umem; + struct unused_pool *pool; + + xsk_socket__delete(xsk_info->xsk); + xsk_info->xsk = NULL; + + umem = xsk_info->umem->umem; + if (xsk_umem__delete(umem)) { + VLOG_ERR("xsk_umem__delete failed"); + } + + pool = xzalloc(sizeof *pool); + pool->umem_info = xsk_info->umem; + pool->lost_in_rings = xsk_info->outstanding_tx + xsk_info->available_rx; + + ovs_mutex_lock(&unused_pools_mutex); + ovs_list_push_back(&unused_pools, &pool->list_node); + ovs_mutex_unlock(&unused_pools_mutex); + + free(xsk_info); + + netdev_afxdp_sweep_unused_pools(NULL); +} + +static void +xsk_destroy_all(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int i, ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + for (i = 0; i < netdev_n_rxq(netdev); i++) { + if (dev->xsks && dev->xsks[i]) { + VLOG_INFO("destroy xsk[%d]", i); + xsk_destroy(dev->xsks[i]); + dev->xsks[i] = NULL; + } + } + + VLOG_INFO("remove xdp program"); + xsk_remove_xdp_program(ifindex, dev->xdpmode); + + free(dev->xsks); +} + +static inline void OVS_UNUSED +log_xsk_stat(struct xsk_socket_info *xsk OVS_UNUSED) { + struct xdp_statistics stat; + socklen_t optlen; + + optlen = sizeof stat; + ovs_assert(getsockopt(xsk_socket__fd(xsk->xsk), SOL_XDP, XDP_STATISTICS, + &stat, &optlen) == 0); + + VLOG_DBG_RL(&rl, "rx dropped %llu, rx_invalid %llu, tx_invalid %llu", + stat.rx_dropped, + stat.rx_invalid_descs, + stat.tx_invalid_descs); +} + +int +netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, + char **errp OVS_UNUSED) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + const char *str_xdpmode; + int xdpmode, new_n_rxq; + + ovs_mutex_lock(&dev->mutex); + new_n_rxq = MAX(smap_get_int(args, "n_rxq", NR_QUEUE), 1); + if (new_n_rxq > MAX_XSKQ) { + ovs_mutex_unlock(&dev->mutex); + VLOG_ERR("%s: Too big 'n_rxq' (%d > %d).", + netdev_get_name(netdev), new_n_rxq, MAX_XSKQ); + return EINVAL; + } + + str_xdpmode = smap_get_def(args, "xdpmode", "skb"); + if (!strcasecmp(str_xdpmode, "drv")) { + xdpmode = XDP_ZEROCOPY; + } else if (!strcasecmp(str_xdpmode, "skb")) { + xdpmode = XDP_COPY; + } else { + VLOG_ERR("%s: Incorrect xdpmode (%s).", + netdev_get_name(netdev), str_xdpmode); + ovs_mutex_unlock(&dev->mutex); + return EINVAL; + } + + if (dev->requested_n_rxq != new_n_rxq + || dev->requested_xdpmode != xdpmode) { + dev->requested_n_rxq = new_n_rxq; + dev->requested_xdpmode = xdpmode; + netdev_request_reconfigure(netdev); + } + ovs_mutex_unlock(&dev->mutex); + return 0; +} + +int +netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + + ovs_mutex_lock(&dev->mutex); + smap_add_format(args, "n_rxq", "%d", netdev->n_rxq); + smap_add_format(args, "xdpmode", "%s", + dev->xdpmode == XDP_ZEROCOPY ? "drv" : "skb"); + ovs_mutex_unlock(&dev->mutex); + return 0; +} + +static void +netdev_afxdp_alloc_txq(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int n_txqs = netdev_n_rxq(netdev); + int i; + + dev->tx_locks = xmalloc(n_txqs * sizeof(struct ovs_spin)); + + for (i = 0; i < n_txqs; i++) { + ovs_spin_init(&dev->tx_locks[i]); + } +} + +int +netdev_afxdp_reconfigure(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + int err = 0; + + ovs_mutex_lock(&dev->mutex); + + if (netdev->n_rxq == dev->requested_n_rxq + && dev->xdpmode == dev->requested_xdpmode) { + goto out; + } + + xsk_destroy_all(netdev); + free(dev->tx_locks); + + netdev->n_rxq = dev->requested_n_rxq; + netdev_afxdp_alloc_txq(netdev); + + if (dev->requested_xdpmode == XDP_ZEROCOPY) { + dev->xdpmode = XDP_ZEROCOPY; + VLOG_INFO("AF_XDP device %s in DRV mode", netdev_get_name(netdev)); + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + VLOG_ERR("ERROR: setrlimit(RLIMIT_MEMLOCK): %s", + ovs_strerror(errno)); + } + } else { + dev->xdpmode = XDP_COPY; + VLOG_INFO("AF_XDP device %s in SKB mode", netdev_get_name(netdev)); + /* TODO: set rlimit back to previous value + * when no device is in DRV mode. + */ + } + + err = xsk_configure_all(netdev); + if (err) { + VLOG_ERR("AF_XDP device %s reconfig fails", netdev_get_name(netdev)); + } + netdev_change_seq_changed(netdev); +out: + ovs_mutex_unlock(&dev->mutex); + return err; +} + +int +netdev_afxdp_get_numa_id(const struct netdev *netdev) +{ + /* FIXME: Get netdev's PCIe device ID, then find + * its NUMA node id. + */ + VLOG_INFO("FIXME: Device %s always use numa id 0", + netdev_get_name(netdev)); + return 0; +} + +static void +xsk_remove_xdp_program(uint32_t ifindex, int xdpmode) +{ + uint32_t prog_id = 0; + uint32_t flags; + + flags = XDP_FLAGS_UPDATE_IF_NOEXIST; + + /* remove_xdp_program() */ + if (xdpmode == XDP_COPY) { + flags |= XDP_FLAGS_SKB_MODE; + VLOG_INFO("%s skb mode", __func__); + } else if (xdpmode == XDP_ZEROCOPY) { + flags |= XDP_FLAGS_DRV_MODE; + VLOG_INFO("%s drv mode", __func__); + } + + if (bpf_get_link_xdp_id(ifindex, &prog_id, flags)) { + VLOG_WARN("get xdp program id fails"); + } + bpf_set_link_xdp_fd(ifindex, -1, flags); +} + +void +signal_remove_xdp(struct netdev *netdev) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + int ifindex; + + ifindex = linux_get_ifindex(netdev_get_name(netdev)); + + VLOG_WARN("force remove xdp program"); + xsk_remove_xdp_program(ifindex, dev->xdpmode); +} + +static struct dp_packet_afxdp * +dp_packet_cast_afxdp(const struct dp_packet *d) +{ + ovs_assert(d->source == DPBUF_AFXDP); + return CONTAINER_OF(d, struct dp_packet_afxdp, packet); +} + +static inline void +prepare_fill_queue(struct xsk_socket_info *xsk_info) +{ + struct umem_elem *elems[BATCH_SIZE]; + struct xsk_umem_info *umem; + unsigned int idx_fq; + int i, ret; + + umem = xsk_info->umem; + + if (xsk_prod_nb_free(&umem->fq, BATCH_SIZE) < BATCH_SIZE) { + return; + } + + ret = umem_elem_pop_n(&umem->mpool, BATCH_SIZE, (void **)elems); + if (OVS_UNLIKELY(ret)) { + return; + } + + if (!xsk_ring_prod__reserve(&umem->fq, BATCH_SIZE, &idx_fq)) { + umem_elem_push_n(&umem->mpool, BATCH_SIZE, (void **)elems); + COVERAGE_INC(afxdp_fq_full); + return; + } + + for (i = 0; i < BATCH_SIZE; i++) { + uint64_t index; + struct umem_elem *elem; + + elem = elems[i]; + index = (uint64_t)((char *)elem - (char *)umem->buffer); + ovs_assert((index & FRAME_SHIFT_MASK) == 0); + *xsk_ring_prod__fill_addr(&umem->fq, idx_fq) = index; + + idx_fq++; + } + xsk_ring_prod__submit(&umem->fq, BATCH_SIZE); + xsk_info->available_rx += BATCH_SIZE; +} + +int +netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, struct dp_packet_batch *batch, + int *qfill) +{ + struct netdev_rxq_linux *rx = netdev_rxq_linux_cast(rxq_); + struct netdev *netdev = rx->up.netdev; + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct xsk_socket_info *xsk_info; + struct xsk_umem_info *umem; + uint32_t idx_rx = 0; + int qid = rxq_->queue_id; + unsigned int rcvd, i; + + xsk_info = dev->xsks[qid]; + if (!xsk_info || !xsk_info->xsk) { + return EAGAIN; + } + + prepare_fill_queue(xsk_info); + + umem = xsk_info->umem; + rx->fd = xsk_socket__fd(xsk_info->xsk); + + rcvd = xsk_ring_cons__peek(&xsk_info->rx, BATCH_SIZE, &idx_rx); + if (!rcvd) { + return EAGAIN; + } + + /* Setup a dp_packet batch from descriptors in RX queue */ + for (i = 0; i < rcvd; i++) { + struct dp_packet_afxdp *xpacket; + const struct xdp_desc *desc; + struct dp_packet *packet; + uint64_t addr, index; + uint32_t len; + char *pkt; + + desc = xsk_ring_cons__rx_desc(&xsk_info->rx,idx_rx); + addr = desc->addr; + len = desc->len; + + pkt = xsk_umem__get_data(umem->buffer, addr); + index = addr >> FRAME_SHIFT; + xpacket = UMEM2XPKT(umem->xpool.array, index); + packet = &xpacket->packet; + + /* Initialize the struct dp_packet */ + dp_packet_use_afxdp(packet, pkt, + FRAME_SIZE - FRAME_HEADROOM, + OVS_XDP_HEADROOM); + dp_packet_set_size(packet, len); + + /* Add packet into batch, increase batch->count */ + dp_packet_batch_add(batch, packet); + + idx_rx++; + } + /* Release the RX queue */ + xsk_ring_cons__release(&xsk_info->rx, rcvd); + xsk_info->available_rx -= rcvd; + + if (qfill) { + /* TODO: return the number of remaining packets in the queue. */ + *qfill = 0; + } + +#ifdef AFXDP_DEBUG + log_xsk_stat(xsk_info); +#endif + return 0; +} + +static inline int +kick_tx(struct xsk_socket_info *xsk_info, int xdpmode) +{ + int ret, retries; + static const int KERNEL_TX_BATCH_SIZE = 16; + + /* In SKB_MODE packet transmission is synchronous, and the kernel xmits + * only TX_BATCH_SIZE(16) packets for a single sendmsg syscall. + * So, we have to kick the kernel (n_packets / 16) times to be sure that + * all packets are transmitted. */ + retries = (xdpmode == XDP_COPY) + ? xsk_info->outstanding_tx / KERNEL_TX_BATCH_SIZE + : 0; +kick_retry: + /* This causes system call into kernel's xsk_sendmsg, and + * xsk_generic_xmit (skb mode) or xsk_async_xmit (driver mode). + */ + ret = sendto(xsk_socket__fd(xsk_info->xsk), NULL, 0, MSG_DONTWAIT, + NULL, 0); + if (ret < 0) { + if (retries-- && errno == EAGAIN) { + goto kick_retry; + } + if (errno == ENXIO || errno == ENOBUFS || errno == EOPNOTSUPP) { + return errno; + } + } + /* No error, or EBUSY, or too many retries on EAGAIN. */ + return 0; +} + +void +free_afxdp_buf(struct dp_packet *p) +{ + struct dp_packet_afxdp *xpacket; + uintptr_t addr; + + xpacket = dp_packet_cast_afxdp(p); + if (xpacket->mpool) { + void *base = dp_packet_base(p); + + addr = (uintptr_t)base & (~FRAME_SHIFT_MASK); + umem_elem_push(xpacket->mpool, (void *)addr); + } +} + +static void +free_afxdp_buf_batch(struct dp_packet_batch *batch) +{ + struct dp_packet_afxdp *xpacket = NULL; + struct dp_packet *packet; + void *elems[BATCH_SIZE]; + uintptr_t addr; + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + void *base; + + xpacket = dp_packet_cast_afxdp(packet); + base = dp_packet_base(packet); + addr = (uintptr_t)base & (~FRAME_SHIFT_MASK); + elems[i] = (void *)addr; + } + umem_elem_push_n(xpacket->mpool, batch->count, elems); + dp_packet_batch_init(batch); +} + +static inline bool +check_free_batch(struct dp_packet_batch *batch) +{ + struct umem_pool *first_mpool = NULL; + struct dp_packet_afxdp *xpacket; + struct dp_packet *packet; + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + if (packet->source != DPBUF_AFXDP) { + return false; + } + xpacket = dp_packet_cast_afxdp(packet); + if (i == 0) { + first_mpool = xpacket->mpool; + continue; + } + if (xpacket->mpool != first_mpool) { + return false; + } + } + /* All packets are DPBUF_AFXDP and from the same mpool */ + return true; +} + +static inline void +afxdp_complete_tx(struct xsk_socket_info *xsk_info) +{ + struct umem_elem *elems_push[BATCH_SIZE]; + struct xsk_umem_info *umem; + uint32_t idx_cq = 0; + int tx_to_free = 0; + int tx_done, j; + + umem = xsk_info->umem; + tx_done = xsk_ring_cons__peek(&umem->cq, CONS_NUM_DESCS, &idx_cq); + + /* Recycle back to umem pool */ + for (j = 0; j < tx_done; j++) { + struct umem_elem *elem; + uint64_t *addr; + + addr = (uint64_t *)xsk_ring_cons__comp_addr(&umem->cq, idx_cq++); + if (*addr == UINT64_MAX) { + /* The elem has been pushed already */ + COVERAGE_INC(afxdp_cq_skip); + continue; + } + elem = ALIGNED_CAST(struct umem_elem *, + (char *)umem->buffer + *addr); + elems_push[tx_to_free] = elem; + *addr = UINT64_MAX; /* Mark as pushed */ + tx_to_free++; + + if (tx_to_free == BATCH_SIZE || j == tx_done - 1) { + umem_elem_push_n(&umem->mpool, tx_to_free, (void **)elems_push); + xsk_info->outstanding_tx -= tx_to_free; + tx_to_free = 0; + } + } + + if (tx_done > 0) { + xsk_ring_cons__release(&umem->cq, tx_done); + } else { + COVERAGE_INC(afxdp_cq_empty); + } +} + +static inline int +__netdev_afxdp_batch_send(struct netdev *netdev, int qid, + struct dp_packet_batch *batch) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct umem_elem *elems_pop[BATCH_SIZE]; + struct xsk_socket_info *xsk_info; + struct xsk_umem_info *umem; + struct dp_packet *packet; + bool free_batch = false; + unsigned long orig; + uint32_t idx = 0; + int error = 0; + int ret; + + xsk_info = dev->xsks[qid]; + if (!xsk_info || !xsk_info->xsk) { + goto out; + } + + afxdp_complete_tx(xsk_info); + + free_batch = check_free_batch(batch); + + umem = xsk_info->umem; + ret = umem_elem_pop_n(&umem->mpool, batch->count, (void **)elems_pop); + if (OVS_UNLIKELY(ret)) { + atomic_add_relaxed(&xsk_info->tx_dropped, batch->count, &orig); + VLOG_WARN_RL(&rl, "%s: send failed due to exhausted memory pool", + netdev_get_name(netdev)); + error = ENOMEM; + goto out; + } + + /* Make sure we have enough TX descs */ + ret = xsk_ring_prod__reserve(&xsk_info->tx, batch->count, &idx); + if (OVS_UNLIKELY(ret == 0)) { + umem_elem_push_n(&umem->mpool, batch->count, (void **)elems_pop); + atomic_add_relaxed(&xsk_info->tx_dropped, batch->count, &orig); + COVERAGE_INC(afxdp_tx_full); + afxdp_complete_tx(xsk_info); + kick_tx(xsk_info, dev->xdpmode); + error = ENOMEM; + goto out; + } + + DP_PACKET_BATCH_FOR_EACH (i, packet, batch) { + struct umem_elem *elem; + uint64_t index; + + elem = elems_pop[i]; + /* Copy the packet to the umem we just pop from umem pool. + * TODO: avoid this copy if the packet and the pop umem + * are located in the same umem. + */ + memcpy(elem, dp_packet_data(packet), dp_packet_size(packet)); + + index = (uint64_t)((char *)elem - (char *)umem->buffer); + xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->addr = index; + xsk_ring_prod__tx_desc(&xsk_info->tx, idx + i)->len + = dp_packet_size(packet); + } + xsk_ring_prod__submit(&xsk_info->tx, batch->count); + xsk_info->outstanding_tx += batch->count; + + ret = kick_tx(xsk_info, dev->xdpmode); + if (OVS_UNLIKELY(ret)) { + VLOG_WARN_RL(&rl, "%s: error sending AF_XDP packet: %s", + netdev_get_name(netdev), ovs_strerror(ret)); + } + +out: + if (free_batch) { + free_afxdp_buf_batch(batch); + } else { + dp_packet_delete_batch(batch, true); + } + + return error; +} + +int +netdev_afxdp_batch_send(struct netdev *netdev, int qid, + struct dp_packet_batch *batch, + bool concurrent_txq) +{ + struct netdev_linux *dev; + int ret; + + if (concurrent_txq) { + dev = netdev_linux_cast(netdev); + qid = qid % dev->up.n_txq; + + ovs_spin_lock(&dev->tx_locks[qid]); + ret = __netdev_afxdp_batch_send(netdev, qid, batch); + ovs_spin_unlock(&dev->tx_locks[qid]); + } else { + ret = __netdev_afxdp_batch_send(netdev, qid, batch); + } + + return ret; +} + +int +netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_ OVS_UNUSED) +{ + /* Done at reconfigure */ + return 0; +} + +void +netdev_afxdp_destruct(struct netdev *netdev) +{ + static struct ovsthread_once once = OVSTHREAD_ONCE_INITIALIZER; + struct netdev_linux *dev = netdev_linux_cast(netdev); + int n_txqs = netdev_n_rxq(netdev); + int i; + + if (ovsthread_once_start(&once)) { + fatal_signal_add_hook(netdev_afxdp_sweep_unused_pools, + NULL, NULL, true); + ovsthread_once_done(&once); + } + + /* Note: tc is by-passed when using drv-mode, but when using + * skb-mode, we might need to clean up tc. */ + + xsk_destroy_all(netdev); + ovs_mutex_destroy(&dev->mutex); + + for (i = 0; i < n_txqs; i++) { + ovs_spin_destroy(&dev->tx_locks[i]); + } +} + +int +netdev_afxdp_get_stats(const struct netdev *netdev, + struct netdev_stats *stats) +{ + struct netdev_linux *dev = netdev_linux_cast(netdev); + struct xsk_socket_info *xsk_info; + struct netdev_stats dev_stats; + int error, i; + + ovs_mutex_lock(&dev->mutex); + + error = get_stats_via_netlink(netdev, &dev_stats); + if (error) { + VLOG_WARN_RL(&rl, "Error getting AF_XDP statistics"); + } else { + /* Use kernel netdev's packet and byte counts */ + stats->rx_packets = dev_stats.rx_packets; + stats->rx_bytes = dev_stats.rx_bytes; + stats->tx_packets = dev_stats.tx_packets; + stats->tx_bytes = dev_stats.tx_bytes; + + stats->rx_errors += dev_stats.rx_errors; + stats->tx_errors += dev_stats.tx_errors; + stats->rx_dropped += dev_stats.rx_dropped; + stats->tx_dropped += dev_stats.tx_dropped; + stats->multicast += dev_stats.multicast; + stats->collisions += dev_stats.collisions; + stats->rx_length_errors += dev_stats.rx_length_errors; + stats->rx_over_errors += dev_stats.rx_over_errors; + stats->rx_crc_errors += dev_stats.rx_crc_errors; + stats->rx_frame_errors += dev_stats.rx_frame_errors; + stats->rx_fifo_errors += dev_stats.rx_fifo_errors; + stats->rx_missed_errors += dev_stats.rx_missed_errors; + stats->tx_aborted_errors += dev_stats.tx_aborted_errors; + stats->tx_carrier_errors += dev_stats.tx_carrier_errors; + stats->tx_fifo_errors += dev_stats.tx_fifo_errors; + stats->tx_heartbeat_errors += dev_stats.tx_heartbeat_errors; + stats->tx_window_errors += dev_stats.tx_window_errors; + + /* Account the dropped in each xsk */ + for (i = 0; i < netdev_n_rxq(netdev); i++) { + xsk_info = dev->xsks[i]; + if (xsk_info) { + stats->tx_dropped += xsk_info->tx_dropped; + } + } + } + ovs_mutex_unlock(&dev->mutex); + + return error; +} diff --git a/lib/netdev-afxdp.h b/lib/netdev-afxdp.h new file mode 100644 index 000000000000..dd2dc1a2064d --- /dev/null +++ b/lib/netdev-afxdp.h @@ -0,0 +1,74 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef NETDEV_AFXDP_H +#define NETDEV_AFXDP_H 1 + +#include + +#ifdef HAVE_AF_XDP + +#include +#include + +/* These functions are Linux AF_XDP specific, so they should be used directly + * only by Linux-specific code. */ + +#define MAX_XSKQ 16 + +struct netdev; +struct xsk_socket_info; +struct xdp_umem; +struct dp_packet_batch; +struct smap; +struct dp_packet; +struct netdev_rxq; +struct netdev_stats; + +int netdev_afxdp_rxq_construct(struct netdev_rxq *rxq_); +void netdev_afxdp_destruct(struct netdev *netdev_); + +int netdev_afxdp_rxq_recv(struct netdev_rxq *rxq_, + struct dp_packet_batch *batch, + int *qfill); +int netdev_afxdp_batch_send(struct netdev *netdev_, int qid, + struct dp_packet_batch *batch, + bool concurrent_txq); +int netdev_afxdp_set_config(struct netdev *netdev, const struct smap *args, + char **errp); +int netdev_afxdp_get_config(const struct netdev *netdev, struct smap *args); +int netdev_afxdp_get_numa_id(const struct netdev *netdev); +int netdev_afxdp_get_stats(const struct netdev *netdev_, + struct netdev_stats *stats); + +void free_afxdp_buf(struct dp_packet *p); +int netdev_afxdp_reconfigure(struct netdev *netdev); +void signal_remove_xdp(struct netdev *netdev); + +#else /* !HAVE_AF_XDP */ + +#include "openvswitch/compiler.h" + +struct dp_packet; + +static inline void +free_afxdp_buf(struct dp_packet *p OVS_UNUSED) +{ + /* Nothing */ +} + +#endif /* HAVE_AF_XDP */ +#endif /* netdev-afxdp.h */ diff --git a/lib/netdev-linux-private.h b/lib/netdev-linux-private.h new file mode 100644 index 000000000000..f8ffce270df3 --- /dev/null +++ b/lib/netdev-linux-private.h @@ -0,0 +1,138 @@ +/* + * Copyright (c) 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef NETDEV_LINUX_PRIVATE_H +#define NETDEV_LINUX_PRIVATE_H 1 + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "netdev-afxdp.h" +#include "netdev-provider.h" +#include "netdev-vport.h" +#include "openvswitch/thread.h" +#include "ovs-atomic.h" +#include "timer.h" +#include "xdpsock.h" + +/* These functions are Linux specific, so they should be used directly only by + * Linux-specific code. */ + +struct netdev; + +struct netdev_rxq_linux { + struct netdev_rxq up; + bool is_tap; + int fd; +}; + +void netdev_linux_run(const struct netdev_class *); + +int netdev_linux_ethtool_set_flag(struct netdev *netdev, uint32_t flag, + const char *flag_name, bool enable); + +int get_stats_via_netlink(const struct netdev *netdev_, + struct netdev_stats *stats); + +struct netdev_linux { + struct netdev up; + + /* Protects all members below. */ + struct ovs_mutex mutex; + + unsigned int cache_valid; + + bool miimon; /* Link status of last poll. */ + long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ + struct timer miimon_timer; + + int netnsid; /* Network namespace ID. */ + /* The following are figured out "on demand" only. They are only valid + * when the corresponding VALID_* bit in 'cache_valid' is set. */ + int ifindex; + struct eth_addr etheraddr; + int mtu; + unsigned int ifi_flags; + long long int carrier_resets; + uint32_t kbits_rate; /* Policing data. */ + uint32_t kbits_burst; + int vport_stats_error; /* Cached error code from vport_get_stats(). + 0 or an errno value. */ + int netdev_mtu_error; /* Cached error code from SIOCGIFMTU + * or SIOCSIFMTU. + */ + int ether_addr_error; /* Cached error code from set/get etheraddr. */ + int netdev_policing_error; /* Cached error code from set policing. */ + int get_features_error; /* Cached error code from ETHTOOL_GSET. */ + int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ + + enum netdev_features current; /* Cached from ETHTOOL_GSET. */ + enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ + enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ + + struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ + struct tc *tc; + + /* For devices of class netdev_tap_class only. */ + int tap_fd; + bool present; /* If the device is present in the namespace */ + uint64_t tx_dropped; /* tap device can drop if the iface is down */ + + /* LAG information. */ + bool is_lag_master; /* True if the netdev is a LAG master. */ + + /* AF_XDP information */ +#ifdef HAVE_AF_XDP + struct xsk_socket_info **xsks; + int requested_n_rxq; + int xdpmode; + int requested_xdpmode; + struct ovs_spin *tx_locks; +#endif +}; + +static bool +is_netdev_linux_class(const struct netdev_class *netdev_class) +{ + return netdev_class->run == netdev_linux_run; +} + +static struct netdev_linux * +netdev_linux_cast(const struct netdev *netdev) +{ + ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); + + return CONTAINER_OF(netdev, struct netdev_linux, up); +} + +static struct netdev_rxq_linux * +netdev_rxq_linux_cast(const struct netdev_rxq *rx) +{ + ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); + + return CONTAINER_OF(rx, struct netdev_rxq_linux, up); +} + +#endif /* netdev-linux-private.h */ diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c index e4ea94cf9243..2ba72e117989 100644 --- a/lib/netdev-linux.c +++ b/lib/netdev-linux.c @@ -17,6 +17,7 @@ #include #include "netdev-linux.h" +#include "netdev-linux-private.h" #include #include @@ -54,6 +55,7 @@ #include "fatal-signal.h" #include "hash.h" #include "openvswitch/hmap.h" +#include "netdev-afxdp.h" #include "netdev-provider.h" #include "netdev-vport.h" #include "netlink-notifier.h" @@ -486,57 +488,6 @@ static int tc_calc_cell_log(unsigned int mtu); static void tc_fill_rate(struct tc_ratespec *rate, uint64_t bps, int mtu); static int tc_calc_buffer(unsigned int Bps, int mtu, uint64_t burst_bytes); -struct netdev_linux { - struct netdev up; - - /* Protects all members below. */ - struct ovs_mutex mutex; - - unsigned int cache_valid; - - bool miimon; /* Link status of last poll. */ - long long int miimon_interval; /* Miimon Poll rate. Disabled if <= 0. */ - struct timer miimon_timer; - - int netnsid; /* Network namespace ID. */ - /* The following are figured out "on demand" only. They are only valid - * when the corresponding VALID_* bit in 'cache_valid' is set. */ - int ifindex; - struct eth_addr etheraddr; - int mtu; - unsigned int ifi_flags; - long long int carrier_resets; - uint32_t kbits_rate; /* Policing data. */ - uint32_t kbits_burst; - int vport_stats_error; /* Cached error code from vport_get_stats(). - 0 or an errno value. */ - int netdev_mtu_error; /* Cached error code from SIOCGIFMTU or SIOCSIFMTU. */ - int ether_addr_error; /* Cached error code from set/get etheraddr. */ - int netdev_policing_error; /* Cached error code from set policing. */ - int get_features_error; /* Cached error code from ETHTOOL_GSET. */ - int get_ifindex_error; /* Cached error code from SIOCGIFINDEX. */ - - enum netdev_features current; /* Cached from ETHTOOL_GSET. */ - enum netdev_features advertised; /* Cached from ETHTOOL_GSET. */ - enum netdev_features supported; /* Cached from ETHTOOL_GSET. */ - - struct ethtool_drvinfo drvinfo; /* Cached from ETHTOOL_GDRVINFO. */ - struct tc *tc; - - /* For devices of class netdev_tap_class only. */ - int tap_fd; - bool present; /* If the device is present in the namespace */ - uint64_t tx_dropped; /* tap device can drop if the iface is down */ - - /* LAG information. */ - bool is_lag_master; /* True if the netdev is a LAG master. */ -}; - -struct netdev_rxq_linux { - struct netdev_rxq up; - bool is_tap; - int fd; -}; /* This is set pretty low because we probably won't learn anything from the * additional log messages. */ @@ -550,8 +501,6 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); * changes in the device miimon status, so we can use atomic_count. */ static atomic_count miimon_cnt = ATOMIC_COUNT_INIT(0); -static void netdev_linux_run(const struct netdev_class *); - static int netdev_linux_do_ethtool(const char *name, struct ethtool_cmd *, int cmd, const char *cmd_name); static int get_flags(const struct netdev *, unsigned int *flags); @@ -565,7 +514,6 @@ static int do_set_addr(struct netdev *netdev, struct in_addr addr); static int get_etheraddr(const char *netdev_name, struct eth_addr *ea); static int set_etheraddr(const char *netdev_name, const struct eth_addr); -static int get_stats_via_netlink(const struct netdev *, struct netdev_stats *); static int af_packet_sock(void); static bool netdev_linux_miimon_enabled(void); static void netdev_linux_miimon_run(void); @@ -573,31 +521,10 @@ static void netdev_linux_miimon_wait(void); static int netdev_linux_get_mtu__(struct netdev_linux *netdev, int *mtup); static bool -is_netdev_linux_class(const struct netdev_class *netdev_class) -{ - return netdev_class->run == netdev_linux_run; -} - -static bool is_tap_netdev(const struct netdev *netdev) { return netdev_get_class(netdev) == &netdev_tap_class; } - -static struct netdev_linux * -netdev_linux_cast(const struct netdev *netdev) -{ - ovs_assert(is_netdev_linux_class(netdev_get_class(netdev))); - - return CONTAINER_OF(netdev, struct netdev_linux, up); -} - -static struct netdev_rxq_linux * -netdev_rxq_linux_cast(const struct netdev_rxq *rx) -{ - ovs_assert(is_netdev_linux_class(netdev_get_class(rx->netdev))); - return CONTAINER_OF(rx, struct netdev_rxq_linux, up); -} static int netdev_linux_netnsid_update__(struct netdev_linux *netdev) @@ -773,7 +700,7 @@ netdev_linux_update_lag(struct rtnetlink_change *change) } } -static void +void netdev_linux_run(const struct netdev_class *netdev_class OVS_UNUSED) { struct nl_sock *sock; @@ -3278,9 +3205,7 @@ exit: .run = netdev_linux_run, \ .wait = netdev_linux_wait, \ .alloc = netdev_linux_alloc, \ - .destruct = netdev_linux_destruct, \ .dealloc = netdev_linux_dealloc, \ - .send = netdev_linux_send, \ .send_wait = netdev_linux_send_wait, \ .set_etheraddr = netdev_linux_set_etheraddr, \ .get_etheraddr = netdev_linux_get_etheraddr, \ @@ -3311,39 +3236,71 @@ exit: .arp_lookup = netdev_linux_arp_lookup, \ .update_flags = netdev_linux_update_flags, \ .rxq_alloc = netdev_linux_rxq_alloc, \ - .rxq_construct = netdev_linux_rxq_construct, \ .rxq_destruct = netdev_linux_rxq_destruct, \ .rxq_dealloc = netdev_linux_rxq_dealloc, \ - .rxq_recv = netdev_linux_rxq_recv, \ .rxq_wait = netdev_linux_rxq_wait, \ .rxq_drain = netdev_linux_rxq_drain const struct netdev_class netdev_linux_class = { NETDEV_LINUX_CLASS_COMMON, .type = "system", + .is_pmd = false, .construct = netdev_linux_construct, + .destruct = netdev_linux_destruct, .get_stats = netdev_linux_get_stats, .get_features = netdev_linux_get_features, .get_status = netdev_linux_get_status, - .get_block_id = netdev_linux_get_block_id + .get_block_id = netdev_linux_get_block_id, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; const struct netdev_class netdev_tap_class = { NETDEV_LINUX_CLASS_COMMON, .type = "tap", + .is_pmd = false, .construct = netdev_linux_construct_tap, + .destruct = netdev_linux_destruct, .get_stats = netdev_tap_get_stats, .get_features = netdev_linux_get_features, .get_status = netdev_linux_get_status, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; const struct netdev_class netdev_internal_class = { NETDEV_LINUX_CLASS_COMMON, .type = "internal", + .is_pmd = false, .construct = netdev_linux_construct, + .destruct = netdev_linux_destruct, .get_stats = netdev_internal_get_stats, .get_status = netdev_internal_get_status, + .send = netdev_linux_send, + .rxq_construct = netdev_linux_rxq_construct, + .rxq_recv = netdev_linux_rxq_recv, }; + +#ifdef HAVE_AF_XDP +const struct netdev_class netdev_afxdp_class = { + NETDEV_LINUX_CLASS_COMMON, + .type = "afxdp", + .is_pmd = true, + .construct = netdev_linux_construct, + .destruct = netdev_afxdp_destruct, + .get_stats = netdev_afxdp_get_stats, + .get_status = netdev_linux_get_status, + .set_config = netdev_afxdp_set_config, + .get_config = netdev_afxdp_get_config, + .reconfigure = netdev_afxdp_reconfigure, + .get_numa_id = netdev_afxdp_get_numa_id, + .send = netdev_afxdp_batch_send, + .rxq_construct = netdev_afxdp_rxq_construct, + .rxq_recv = netdev_afxdp_rxq_recv, +}; +#endif #define CODEL_N_QUEUES 0x0000 @@ -5915,7 +5872,7 @@ netdev_stats_from_rtnl_link_stats64(struct netdev_stats *dst, dst->tx_window_errors = src->tx_window_errors; } -static int +int get_stats_via_netlink(const struct netdev *netdev_, struct netdev_stats *stats) { struct ofpbuf request; diff --git a/lib/netdev-provider.h b/lib/netdev-provider.h index 2a545c986b4b..1e5a40c898fc 100644 --- a/lib/netdev-provider.h +++ b/lib/netdev-provider.h @@ -832,6 +832,9 @@ extern const struct netdev_class netdev_linux_class; extern const struct netdev_class netdev_internal_class; extern const struct netdev_class netdev_tap_class; +#ifdef HAVE_AF_XDP +extern const struct netdev_class netdev_afxdp_class; +#endif #ifdef __cplusplus } #endif diff --git a/lib/netdev.c b/lib/netdev.c index 6b34dec9c970..b1976d365428 100644 --- a/lib/netdev.c +++ b/lib/netdev.c @@ -103,6 +103,9 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, 20); static void restore_all_flags(void *aux OVS_UNUSED); void update_device_args(struct netdev *, const struct shash *args); +#ifdef HAVE_AF_XDP +void signal_remove_xdp(struct netdev *netdev); +#endif int netdev_n_txq(const struct netdev *netdev) @@ -147,6 +150,9 @@ netdev_initialize(void) netdev_vport_tunnel_register(); netdev_register_flow_api_provider(&netdev_offload_tc); +#ifdef HAVE_AF_XDP + netdev_register_provider(&netdev_afxdp_class); +#endif #endif #if defined(__FreeBSD__) || defined(__NetBSD__) netdev_register_provider(&netdev_tap_class); @@ -2021,6 +2027,11 @@ restore_all_flags(void *aux OVS_UNUSED) saved_flags & ~saved_values, &old_flags); } +#ifdef HAVE_AF_XDP + if (netdev->netdev_class == &netdev_afxdp_class) { + signal_remove_xdp(netdev); + } +#endif } } diff --git a/lib/util.c b/lib/util.c index 7b8ab81f6ee1..5eb20995b370 100644 --- a/lib/util.c +++ b/lib/util.c @@ -214,20 +214,19 @@ x2nrealloc(void *p, size_t *n, size_t s) return xrealloc(p, *n * s); } -/* Allocates and returns 'size' bytes of memory aligned to a cache line and in - * dedicated cache lines. That is, the memory block returned will not share a - * cache line with other data, avoiding "false sharing". +/* Allocates and returns 'size' bytes of memory aligned to 'alignment' bytes. + * 'alignment' must be a power of two and a multiple of sizeof(void *). * - * Use free_cacheline() to free the returned memory block. */ + * Use free_size_align() to free the returned memory block. */ void * -xmalloc_cacheline(size_t size) +xmalloc_size_align(size_t size, size_t alignment) { #ifdef HAVE_POSIX_MEMALIGN void *p; int error; COVERAGE_INC(util_xalloc); - error = posix_memalign(&p, CACHE_LINE_SIZE, size ? size : 1); + error = posix_memalign(&p, alignment, size ? size : 1); if (error != 0) { out_of_memory(); } @@ -235,16 +234,16 @@ xmalloc_cacheline(size_t size) #else /* Allocate room for: * - * - Header padding: Up to CACHE_LINE_SIZE - 1 bytes, to allow the - * pointer to be aligned exactly sizeof(void *) bytes before the - * beginning of a cache line. + * - Header padding: Up to alignment - 1 bytes, to allow the + * pointer 'q' to be aligned exactly sizeof(void *) bytes before the + * beginning of the alignment. * * - Pointer: A pointer to the start of the header padding, to allow us * to free() the block later. * * - User data: 'size' bytes. * - * - Trailer padding: Enough to bring the user data up to a cache line + * - Trailer padding: Enough to bring the user data up to a alignment * multiple. * * +---------------+---------+------------------------+---------+ @@ -255,18 +254,56 @@ xmalloc_cacheline(size_t size) * p q r * */ - void *p = xmalloc((CACHE_LINE_SIZE - 1) - + sizeof(void *) - + ROUND_UP(size, CACHE_LINE_SIZE)); - bool runt = PAD_SIZE((uintptr_t) p, CACHE_LINE_SIZE) < sizeof(void *); - void *r = (void *) ROUND_UP((uintptr_t) p + (runt ? CACHE_LINE_SIZE : 0), - CACHE_LINE_SIZE); - void **q = (void **) r - 1; + void *p, *r, **q; + bool runt; + + COVERAGE_INC(util_xalloc); + if (!IS_POW2(alignment) || (alignment % sizeof(void *) != 0)) { + ovs_abort(0, "Invalid alignment"); + } + + p = xmalloc((alignment - 1) + + sizeof(void *) + + ROUND_UP(size, alignment)); + + runt = PAD_SIZE((uintptr_t) p, alignment) < sizeof(void *); + /* When the padding size < sizeof(void*), we don't have enough room for + * pointer 'q'. As a reuslt, need to move 'r' to the next alignment. + * So ROUND_UP when xmalloc above, and ROUND_UP again when calculate 'r' + * below. + */ + r = (void *) ROUND_UP((uintptr_t) p + (runt ? : 0), alignment); + q = (void **) r - 1; *q = p; + return r; #endif } +void +free_size_align(void *p) +{ +#ifdef HAVE_POSIX_MEMALIGN + free(p); +#else + if (p) { + void **q = (void **) p - 1; + free(*q); + } +#endif +} + +/* Allocates and returns 'size' bytes of memory aligned to a cache line and in + * dedicated cache lines. That is, the memory block returned will not share a + * cache line with other data, avoiding "false sharing". + * + * Use free_cacheline() to free the returned memory block. */ +void * +xmalloc_cacheline(size_t size) +{ + return xmalloc_size_align(size, CACHE_LINE_SIZE); +} + /* Like xmalloc_cacheline() but clears the allocated memory to all zero * bytes. */ void * @@ -282,14 +319,19 @@ xzalloc_cacheline(size_t size) void free_cacheline(void *p) { -#ifdef HAVE_POSIX_MEMALIGN - free(p); -#else - if (p) { - void **q = (void **) p - 1; - free(*q); - } -#endif + free_size_align(p); +} + +void * +xmalloc_pagealign(size_t size) +{ + return xmalloc_size_align(size, get_page_size()); +} + +void +free_pagealign(void *p) +{ + free_size_align(p); } char * diff --git a/lib/util.h b/lib/util.h index 095ede20f07f..7ad8758fe637 100644 --- a/lib/util.h +++ b/lib/util.h @@ -169,6 +169,11 @@ void ovs_strzcpy(char *dst, const char *src, size_t size); int string_ends_with(const char *str, const char *suffix); +void *xmalloc_pagealign(size_t) MALLOC_LIKE; +void free_pagealign(void *); +void *xmalloc_size_align(size_t, size_t) MALLOC_LIKE; +void free_size_align(void *); + /* The C standards say that neither the 'dst' nor 'src' argument to * memcpy() may be null, even if 'n' is zero. This wrapper tolerates * the null case. */ diff --git a/lib/xdpsock.c b/lib/xdpsock.c new file mode 100644 index 000000000000..f3d316b8ffe0 --- /dev/null +++ b/lib/xdpsock.c @@ -0,0 +1,176 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +#include + +#include "xdpsock.h" +#include "dp-packet.h" +#include "openvswitch/compiler.h" + +/* Note: + * umem_elem_push* shouldn't overflow because we always pop + * elem first, then push back to the stack. + */ +static inline void +__umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) +{ + void *ptr; + + if (OVS_UNLIKELY(umemp->index + n > umemp->size)) { + OVS_NOT_REACHED(); + } + + ptr = &umemp->array[umemp->index]; + memcpy(ptr, addrs, n * sizeof(void *)); + umemp->index += n; +} + +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs) +{ + ovs_spin_lock(&umemp->lock); + __umem_elem_push_n(umemp, n, addrs); + ovs_spin_unlock(&umemp->lock); +} + +static inline void +__umem_elem_push(struct umem_pool *umemp, void *addr) +{ + if (OVS_UNLIKELY(umemp->index + 1) > umemp->size) { + OVS_NOT_REACHED(); + } + + umemp->array[umemp->index++] = addr; +} + +void +umem_elem_push(struct umem_pool *umemp, void *addr) +{ + + ovs_assert(((uint64_t)addr & FRAME_SHIFT_MASK) == 0); + + ovs_spin_lock(&umemp->lock); + __umem_elem_push(umemp, addr); + ovs_spin_unlock(&umemp->lock); +} + +static inline int +__umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) +{ + void *ptr; + + if (OVS_UNLIKELY(umemp->index - n < 0)) { + return -ENOMEM; + } + + umemp->index -= n; + ptr = &umemp->array[umemp->index]; + memcpy(addrs, ptr, n * sizeof(void *)); + + return 0; +} + +int +umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs) +{ + int ret; + + ovs_spin_lock(&umemp->lock); + ret = __umem_elem_pop_n(umemp, n, addrs); + ovs_spin_unlock(&umemp->lock); + + return ret; +} + +static inline void * +__umem_elem_pop(struct umem_pool *umemp) +{ + if (OVS_UNLIKELY(umemp->index - 1 < 0)) { + return NULL; + } + + return umemp->array[--umemp->index]; +} + +void * +umem_elem_pop(struct umem_pool *umemp) +{ + void *ptr; + + ovs_spin_lock(&umemp->lock); + ptr = __umem_elem_pop(umemp); + ovs_spin_unlock(&umemp->lock); + + return ptr; +} + +static void ** +__umem_pool_alloc(unsigned int size) +{ + void *bufs; + + bufs = xmalloc_pagealign(size * sizeof(void *)); + memset(bufs, 0, size * sizeof(void *)); + + return (void **)bufs; +} + +int +umem_pool_init(struct umem_pool *umemp, unsigned int size) +{ + umemp->array = __umem_pool_alloc(size); + if (!umemp->array) { + return -ENOMEM; + } + + umemp->size = size; + umemp->index = 0; + ovs_spin_init(&umemp->lock); + return 0; +} + +void +umem_pool_cleanup(struct umem_pool *umemp) +{ + free_pagealign(umemp->array); + umemp->array = NULL; +} + +unsigned int +umem_pool_count(struct umem_pool *umemp) +{ + return umemp->index; +} + +/* AF_XDP metadata init/destroy */ +int +xpacket_pool_init(struct xpacket_pool *xp, unsigned int size) +{ + void *bufs; + + bufs = xmalloc_pagealign(size * sizeof(struct dp_packet_afxdp)); + memset(bufs, 0, size * sizeof(struct dp_packet_afxdp)); + + xp->array = bufs; + xp->size = size; + + return 0; +} + +void +xpacket_pool_cleanup(struct xpacket_pool *xp) +{ + free_pagealign(xp->array); + xp->array = NULL; +} diff --git a/lib/xdpsock.h b/lib/xdpsock.h new file mode 100644 index 000000000000..4c4df9c8ae16 --- /dev/null +++ b/lib/xdpsock.h @@ -0,0 +1,105 @@ +/* + * Copyright (c) 2018, 2019 Nicira, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at: + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef XDPSOCK_H +#define XDPSOCK_H 1 + +#include + +#ifdef HAVE_AF_XDP + +#include +#include +#include +#include + +#include "openvswitch/thread.h" +#include "ovs-atomic.h" + +#define FRAME_HEADROOM XDP_PACKET_HEADROOM +#define OVS_XDP_HEADROOM 128 +#define FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE +#define FRAME_SHIFT XSK_UMEM__DEFAULT_FRAME_SHIFT +#define FRAME_SHIFT_MASK ((1 << FRAME_SHIFT) - 1) + +#define PROD_NUM_DESCS XSK_RING_PROD__DEFAULT_NUM_DESCS +#define CONS_NUM_DESCS XSK_RING_CONS__DEFAULT_NUM_DESCS + +/* The worst case is all 4 queues TX/CQ/RX/FILL are full + some packets + * still on processing in threads. Number of packets currently in OVS + * processing is hard to estimate because it depends on number of ports. + * Setting NUM_FRAMES twice as large than total of ring sizes should be + * enough for most corner cases. + */ +#define NUM_FRAMES (4 * (PROD_NUM_DESCS + CONS_NUM_DESCS)) + +#define BATCH_SIZE NETDEV_MAX_BURST + +BUILD_ASSERT_DECL(IS_POW2(NUM_FRAMES)); +BUILD_ASSERT_DECL(PROD_NUM_DESCS == CONS_NUM_DESCS); +BUILD_ASSERT_DECL(NUM_FRAMES == 4 * (PROD_NUM_DESCS + CONS_NUM_DESCS)); + +/* LIFO ptr_array */ +struct umem_pool { + int index; /* point to top */ + unsigned int size; + struct ovs_spin lock; + void **array; /* a pointer array, point to umem buf */ +}; + +/* array-based dp_packet_afxdp */ +struct xpacket_pool { + unsigned int size; + struct dp_packet_afxdp **array; +}; + +struct xsk_umem_info { + struct umem_pool mpool; + struct xpacket_pool xpool; + struct xsk_ring_prod fq; + struct xsk_ring_cons cq; + struct xsk_umem *umem; + void *buffer; +}; + +struct xsk_socket_info { + struct xsk_ring_cons rx; + struct xsk_ring_prod tx; + struct xsk_umem_info *umem; + struct xsk_socket *xsk; + uint32_t outstanding_tx; /* Number of descriptors filled in tx and cq. */ + uint32_t available_rx; /* Number of descriptors filled in rx and fq. */ + atomic_ulong tx_dropped; +}; + +struct umem_elem { + struct umem_elem *next; +}; + +void umem_elem_push(struct umem_pool *umemp, void *addr); +void umem_elem_push_n(struct umem_pool *umemp, int n, void **addrs); + +void *umem_elem_pop(struct umem_pool *umemp); +int umem_elem_pop_n(struct umem_pool *umemp, int n, void **addrs); + +int umem_pool_init(struct umem_pool *umemp, unsigned int size); +void umem_pool_cleanup(struct umem_pool *umemp); +unsigned int umem_pool_count(struct umem_pool *umemp); +int xpacket_pool_init(struct xpacket_pool *xp, unsigned int size); +void xpacket_pool_cleanup(struct xpacket_pool *xp); + +#endif +#endif diff --git a/tests/.gitignore b/tests/.gitignore index 9b07508bd056..c5abb32d025a 100644 --- a/tests/.gitignore +++ b/tests/.gitignore @@ -13,6 +13,9 @@ /ovsdb-cluster-testsuite.dir/ /ovsdb-cluster-testsuite.log /pki/ +/system-afxdp-testsuite +/system-afxdp-testsuite.dir/ +/system-afxdp-testsuite.log /system-dpdk-testsuite /system-dpdk-testsuite.dir/ /system-dpdk-testsuite.log diff --git a/tests/automake.mk b/tests/automake.mk index 2956e68b242c..d6ab51732908 100644 --- a/tests/automake.mk +++ b/tests/automake.mk @@ -4,12 +4,14 @@ EXTRA_DIST += \ $(SYSTEM_TESTSUITE_AT) \ $(SYSTEM_KMOD_TESTSUITE_AT) \ $(SYSTEM_USERSPACE_TESTSUITE_AT) \ + $(SYSTEM_AFXDP_TESTSUITE_AT) \ $(SYSTEM_OFFLOADS_TESTSUITE_AT) \ $(SYSTEM_DPDK_TESTSUITE_AT) \ $(OVSDB_CLUSTER_TESTSUITE_AT) \ $(TESTSUITE) \ $(SYSTEM_KMOD_TESTSUITE) \ $(SYSTEM_USERSPACE_TESTSUITE) \ + $(SYSTEM_AFXDP_TESTSUITE) \ $(SYSTEM_OFFLOADS_TESTSUITE) \ $(SYSTEM_DPDK_TESTSUITE) \ $(OVSDB_CLUSTER_TESTSUITE) \ @@ -160,6 +162,11 @@ SYSTEM_USERSPACE_TESTSUITE_AT = \ tests/system-userspace-macros.at \ tests/system-userspace-packet-type-aware.at +SYSTEM_AFXDP_TESTSUITE_AT = \ + tests/system-userspace-macros.at \ + tests/system-afxdp-testsuite.at \ + tests/system-afxdp-macros.at + SYSTEM_TESTSUITE_AT = \ tests/system-common-macros.at \ tests/system-ovn.at \ @@ -184,6 +191,7 @@ TESTSUITE = $(srcdir)/tests/testsuite TESTSUITE_PATCH = $(srcdir)/tests/testsuite.patch SYSTEM_KMOD_TESTSUITE = $(srcdir)/tests/system-kmod-testsuite SYSTEM_USERSPACE_TESTSUITE = $(srcdir)/tests/system-userspace-testsuite +SYSTEM_AFXDP_TESTSUITE = $(srcdir)/tests/system-afxdp-testsuite SYSTEM_OFFLOADS_TESTSUITE = $(srcdir)/tests/system-offloads-testsuite SYSTEM_DPDK_TESTSUITE = $(srcdir)/tests/system-dpdk-testsuite OVSDB_CLUSTER_TESTSUITE = $(srcdir)/tests/ovsdb-cluster-testsuite @@ -317,6 +325,10 @@ check-system-userspace: all set $(SHELL) '$(SYSTEM_USERSPACE_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) +check-afxdp: all + set $(SHELL) '$(SYSTEM_AFXDP_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)' $(TESTSUITEFLAGS) -j1; \ + "$$@" || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) + check-offloads: all set $(SHELL) '$(SYSTEM_OFFLOADS_TESTSUITE)' -C tests AUTOTEST_PATH='$(AUTOTEST_PATH)'; \ "$$@" $(TESTSUITEFLAGS) -j1 || (test X'$(RECHECK)' = Xyes && "$$@" --recheck) @@ -354,6 +366,10 @@ $(SYSTEM_USERSPACE_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_USERSP $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at $(AM_V_at)mv $@.tmp $@ +$(SYSTEM_AFXDP_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_AFXDP_TESTSUITE_AT) $(COMMON_MACROS_AT) + $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at + $(AM_V_at)mv $@.tmp $@ + $(SYSTEM_OFFLOADS_TESTSUITE): package.m4 $(SYSTEM_TESTSUITE_AT) $(SYSTEM_OFFLOADS_TESTSUITE_AT) $(COMMON_MACROS_AT) $(AM_V_GEN)$(AUTOTEST) -I '$(srcdir)' -o $@.tmp $@.at $(AM_V_at)mv $@.tmp $@ diff --git a/tests/system-afxdp-macros.at b/tests/system-afxdp-macros.at new file mode 100644 index 000000000000..f0683c0a901b --- /dev/null +++ b/tests/system-afxdp-macros.at @@ -0,0 +1,39 @@ +# Add port to ovs bridge by using afxdp mode. +# This will use generic XDP support in the veth driver. +m4_define([ADD_VETH], + [ AT_CHECK([ip link add $1 type veth peer name ovs-$1 || return 77]) + CONFIGURE_VETH_OFFLOADS([$1]) + AT_CHECK([ip link set $1 netns $2]) + AT_CHECK([ip link set dev ovs-$1 up]) + AT_CHECK([ovs-vsctl add-port $3 ovs-$1 -- \ + set interface ovs-$1 external-ids:iface-id="$1" type="afxdp"]) + NS_CHECK_EXEC([$2], [ip addr add $4 dev $1 $7]) + NS_CHECK_EXEC([$2], [ip link set dev $1 up]) + if test -n "$5"; then + NS_CHECK_EXEC([$2], [ip link set dev $1 address $5]) + fi + if test -n "$6"; then + NS_CHECK_EXEC([$2], [ip route add default via $6]) + fi + on_exit 'ip link del ovs-$1' + ] +) + +m4_define([OVS_CHECK_8021AD], + [AT_SKIP_IF([:])]) + +# CONFIGURE_VETH_OFFLOADS([VETH]) +# +# Disable TX offloads and VLAN offloads for veths used in AF_XDP. +m4_define([CONFIGURE_VETH_OFFLOADS], + [AT_CHECK([ethtool -K $1 tx off], [0], [ignore], [ignore]) + AT_CHECK([ethtool -K $1 txvlan off], [0], [ignore], [ignore]) + ] +) + +# OVS_START_L7([namespace], [protocol]) +# +# AF_XDP doesn't work with TCP over virtual interfaces for now. +# +m4_define([OVS_START_L7], + [AT_SKIP_IF([:])]) diff --git a/tests/system-afxdp-testsuite.at b/tests/system-afxdp-testsuite.at new file mode 100644 index 000000000000..9b7a29066614 --- /dev/null +++ b/tests/system-afxdp-testsuite.at @@ -0,0 +1,26 @@ +AT_INIT + +AT_COPYRIGHT([Copyright (c) 2018, 2019 Nicira, Inc. + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at: + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License.]) + +m4_ifdef([AT_COLOR_TESTS], [AT_COLOR_TESTS]) + +m4_include([tests/ovs-macros.at]) +m4_include([tests/ovsdb-macros.at]) +m4_include([tests/ofproto-macros.at]) +m4_include([tests/system-common-macros.at]) +m4_include([tests/system-userspace-macros.at]) +m4_include([tests/system-afxdp-macros.at]) + +m4_include([tests/system-traffic.at]) diff --git a/tests/system-traffic.at b/tests/system-traffic.at index 8ea450887076..4bd91a03946e 100644 --- a/tests/system-traffic.at +++ b/tests/system-traffic.at @@ -71,6 +71,7 @@ AT_CLEANUP AT_SETUP([datapath - ping between two ports on cvlan]) OVS_TRAFFIC_VSWITCHD_START() +OVS_CHECK_8021AD() AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"]) @@ -161,6 +162,7 @@ AT_CLEANUP AT_SETUP([datapath - ping6 between two ports on cvlan]) OVS_TRAFFIC_VSWITCHD_START() +OVS_CHECK_8021AD() AT_CHECK([ovs-ofctl add-flow br0 "actions=normal"]) diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index 6d99f7c270cd..027aee2f523b 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -3107,6 +3107,21 @@ ovs-vsctl add-port br0 p0 -- set Interface p0 type=patch options:peer=p1 \

+ +

+ Specifies the operational mode of the XDP program. + If "drv", the XDP program is loaded into the device driver with + zero-copy RX and TX enabled. This mode requires device driver with + AF_XDP support and has the best performance. + If "skb", the XDP program is using generic XDP mode in kernel with + extra data copying between userspace and kernel. No device driver + support is needed. Note that this is afxdp netdev type only. + Defaults to "skb" mode. +

+
+