{"id":819582,"url":"http://patchwork.ozlabs.org/api/patches/819582/?format=json","web_url":"http://patchwork.ozlabs.org/project/netdev/patch/150660343811.2808.7680200486950101509.stgit@firesoul/","project":{"id":7,"url":"http://patchwork.ozlabs.org/api/projects/7/?format=json","name":"Linux network development","link_name":"netdev","list_id":"netdev.vger.kernel.org","list_email":"netdev@vger.kernel.org","web_url":null,"scm_url":null,"webscm_url":null,"list_archive_url":"","list_archive_url_format":"","commit_url_format":""},"msgid":"<150660343811.2808.7680200486950101509.stgit@firesoul>","list_archive_url":null,"date":"2017-09-28T12:57:18","name":"[net-next,3/5] bpf: cpumap xdp_buff to skb conversion and allocation","commit_ref":null,"pull_url":null,"state":"superseded","archived":true,"hash":"c85e05803d4d6943fdedde77ec8609417393d3ae","submitter":{"id":13625,"url":"http://patchwork.ozlabs.org/api/people/13625/?format=json","name":"Jesper Dangaard Brouer","email":"brouer@redhat.com"},"delegate":{"id":34,"url":"http://patchwork.ozlabs.org/api/users/34/?format=json","username":"davem","first_name":"David","last_name":"Miller","email":"davem@davemloft.net"},"mbox":"http://patchwork.ozlabs.org/project/netdev/patch/150660343811.2808.7680200486950101509.stgit@firesoul/mbox/","series":[{"id":5560,"url":"http://patchwork.ozlabs.org/api/series/5560/?format=json","web_url":"http://patchwork.ozlabs.org/project/netdev/list/?series=5560","date":"2017-09-28T12:57:02","name":"New bpf cpumap type for XDP_REDIRECT","version":1,"mbox":"http://patchwork.ozlabs.org/series/5560/mbox/"}],"comments":"http://patchwork.ozlabs.org/api/patches/819582/comments/","check":"pending","checks":"http://patchwork.ozlabs.org/api/patches/819582/checks/","tags":{},"related":[],"headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ext-mx03.extmail.prod.ext.phx2.redhat.com;\n\tdmarc=none (p=none dis=none) header.from=redhat.com","ext-mx03.extmail.prod.ext.phx2.redhat.com;\n\tspf=fail smtp.mailfrom=brouer@redhat.com"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y2vqG311Lz9tXd\n\tfor <patchwork-incoming@ozlabs.org>;\n\tThu, 28 Sep 2017 22:57:30 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1753149AbdI1M52 (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tThu, 28 Sep 2017 08:57:28 -0400","from mx1.redhat.com ([209.132.183.28]:60686 \"EHLO mx1.redhat.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1753090AbdI1M5Y (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tThu, 28 Sep 2017 08:57:24 -0400","from smtp.corp.redhat.com\n\t(int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16])\n\t(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))\n\t(No client certificate requested)\n\tby mx1.redhat.com (Postfix) with ESMTPS id 23B9319224C;\n\tThu, 28 Sep 2017 12:57:24 +0000 (UTC)","from firesoul.localdomain (ovpn-200-26.brq.redhat.com\n\t[10.40.200.26])\n\tby smtp.corp.redhat.com (Postfix) with ESMTP id 047681803D;\n\tThu, 28 Sep 2017 12:57:19 +0000 (UTC)","from [192.168.5.1] (localhost [IPv6:::1])\n\tby firesoul.localdomain (Postfix) with ESMTP id 2B58537CC8001;\n\tThu, 28 Sep 2017 14:57:18 +0200 (CEST)"],"DMARC-Filter":"OpenDMARC Filter v1.3.2 mx1.redhat.com 23B9319224C","Subject":"[net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and\n\tallocation","From":"Jesper Dangaard Brouer <brouer@redhat.com>","To":"netdev@vger.kernel.org","Cc":"jakub.kicinski@netronome.com, \"Michael S. Tsirkin\" <mst@redhat.com>,\n\tJason Wang <jasowang@redhat.com>, mchan@broadcom.com,\n\tJohn Fastabend <john.fastabend@gmail.com>, peter.waskiewicz.jr@intel.com,\n\tJesper Dangaard Brouer <brouer@redhat.com>,\n\tDaniel Borkmann <borkmann@iogearbox.net>,\n\tAlexei Starovoitov <alexei.starovoitov@gmail.com>,\n\tAndy Gospodarek <andy@greyhouse.net>","Date":"Thu, 28 Sep 2017 14:57:18 +0200","Message-ID":"<150660343811.2808.7680200486950101509.stgit@firesoul>","In-Reply-To":"<150660339205.2808.7084136789768233829.stgit@firesoul>","References":"<150660339205.2808.7084136789768233829.stgit@firesoul>","User-Agent":"StGit/0.17.1-dirty","MIME-Version":"1.0","Content-Type":"text/plain; charset=\"utf-8\"","Content-Transfer-Encoding":"7bit","X-Scanned-By":"MIMEDefang 2.79 on 10.5.11.16","X-Greylist":"Sender IP whitelisted, not delayed by milter-greylist-4.5.16\n\t(mx1.redhat.com [10.5.110.27]);\n\tThu, 28 Sep 2017 12:57:24 +0000 (UTC)","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"},"content":"This patch makes cpumap functional, by adding SKB allocation and\ninvoking the network stack on the dequeuing CPU.\n\nFor constructing the SKB on the remote CPU, the xdp_buff in converted\ninto a struct xdp_pkt, and it mapped into the top headroom of the\npacket, to avoid allocating separate mem.  For now, struct xdp_pkt is\njust a cpumap internal data structure, with info carried between\nenqueue to dequeue.\n\nIf a driver doesn't have enough headroom it is simply dropped, with\nreturn code -EOVERFLOW.  This will be picked up the xdp tracepoint\ninfrastructure, to allow users to catch this.\n\nSigned-off-by: Jesper Dangaard Brouer <brouer@redhat.com>\n---\n kernel/bpf/cpumap.c |  153 ++++++++++++++++++++++++++++++++++++++++++++-------\n 1 file changed, 132 insertions(+), 21 deletions(-)","diff":"diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c\nindex ce2490ad860d..352cc071c9cc 100644\n--- a/kernel/bpf/cpumap.c\n+++ b/kernel/bpf/cpumap.c\n@@ -24,6 +24,9 @@\n #include <linux/workqueue.h>\n #include <linux/kthread.h>\n \n+#include <linux/netdevice.h>   /* netif_receive_skb */\n+#include <linux/etherdevice.h> /* eth_type_trans */\n+\n /*\n  * General idea: XDP packets getting XDP redirected to another CPU,\n  * will maximum be stored/queued for one driver ->poll() call.  It is\n@@ -160,20 +163,139 @@ static void cpu_map_kthread_stop(struct work_struct *work)\n \tkthread_stop(rcpu->kthread); /* calls put_cpu_map_entry */\n }\n \n+/* For now, xdp_pkt is a cpumap internal data structure, with info\n+ * carried between enqueue to dequeue. It is mapped into the top\n+ * headroom of the packet, to avoid allocating separate mem.\n+ */\n+struct xdp_pkt {\n+\tvoid *data;\n+\tu16 len;\n+\tu16 headroom;\n+\tstruct net_device *dev_rx;\n+};\n+\n+/* Convert xdp_buff to xdp_pkt */\n+static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)\n+{\n+\tstruct xdp_pkt *xdp_pkt;\n+\tint headroom;\n+\n+\t/* Assure headroom is available for storing info */\n+\theadroom = xdp->data - xdp->data_hard_start;\n+\tif (headroom < sizeof(*xdp_pkt))\n+\t\treturn NULL;\n+\n+\t/* Store info in top of packet */\n+\txdp_pkt = xdp->data_hard_start;\n+\n+\txdp_pkt->data = xdp->data;\n+\txdp_pkt->len  = xdp->data_end - xdp->data;\n+\txdp_pkt->headroom = headroom - sizeof(*xdp_pkt);\n+\n+\treturn xdp_pkt;\n+}\n+\n+static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,\n+\t\t\t\t\t struct xdp_pkt *xdp_pkt)\n+{\n+\tunsigned int frame_size;\n+\tvoid *pkt_data_start;\n+\tstruct sk_buff *skb;\n+\n+\t/* build_skb need to place skb_shared_info after SKB end, and\n+\t * also want to know the memory \"truesize\".  Thus, need to\n+\t * know the memory frame size backing xdp_buff.\n+\t *\n+\t * XDP was designed to have PAGE_SIZE frames, but this\n+\t * assumption is not longer true with ixgbe and i40e.  It\n+\t * would be preferred to set frame_size to 2048 or 4096\n+\t * depending on the driver.\n+\t *   frame_size = 2048;\n+\t *   frame_len  = frame_size - sizeof(*xdp_pkt);\n+\t *\n+\t * Instead, with info avail, skb_shared_info in placed after\n+\t * packet len.  This, unfortunately fakes the truesize.\n+\t * Another disadvantage of this approach, the skb_shared_info\n+\t * is not at a fixed memory location, with mixed length\n+\t * packets, which is bad for cache-line hotness.\n+\t */\n+\tframe_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +\n+\t\tSKB_DATA_ALIGN(sizeof(struct skb_shared_info));\n+\n+\tpkt_data_start = xdp_pkt->data - xdp_pkt->headroom;\n+\tskb = build_skb(pkt_data_start, frame_size);\n+\tif (!skb)\n+\t\treturn NULL;\n+\n+\tskb_reserve(skb, xdp_pkt->headroom);\n+\t__skb_put(skb, xdp_pkt->len);\n+\n+\t/* Essential SKB info: protocol and skb->dev */\n+\tskb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);\n+\n+\t/* Optional SKB info, currently missing:\n+\t * - HW checksum info\t\t(skb->ip_summed)\n+\t * - HW RX hash\t\t\t(skb_set_hash)\n+\t * - RX ring dev queue index\t(skb_record_rx_queue)\n+\t */\n+\n+\treturn skb;\n+}\n+\n static int cpu_map_kthread_run(void *data)\n {\n+\tconst unsigned long busy_poll_jiffies = usecs_to_jiffies(2000);\n+\tunsigned long time_limit = jiffies + busy_poll_jiffies;\n \tstruct bpf_cpu_map_entry *rcpu = data;\n+\tunsigned int empty_cnt = 0;\n \n \tset_current_state(TASK_INTERRUPTIBLE);\n \twhile (!kthread_should_stop()) {\n+\t\tunsigned int processed = 0, drops = 0;\n \t\tstruct xdp_pkt *xdp_pkt;\n \n-\t\tschedule();\n-\t\t/* Do work */\n-\t\twhile ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {\n-\t\t\t/* For now just \"refcnt-free\" */\n-\t\t\tpage_frag_free(xdp_pkt);\n+\t\t/* Release CPU reschedule checks */\n+\t\tif ((time_after_eq(jiffies, time_limit) || empty_cnt > 25) &&\n+\t\t    __ptr_ring_empty(rcpu->queue)) {\n+\t\t\tempty_cnt++;\n+\t\t\tschedule();\n+\t\t\ttime_limit = jiffies + busy_poll_jiffies;\n+\t\t\tWARN_ON(smp_processor_id() != rcpu->cpu);\n+\t\t} else {\n+\t\t\tcond_resched();\n \t\t}\n+\n+\t\t/* Process packets in rcpu->queue */\n+\t\tlocal_bh_disable();\n+\t\t/*\n+\t\t * The bpf_cpu_map_entry is single consumer, with this\n+\t\t * kthread CPU pinned. Lockless access to ptr_ring\n+\t\t * consume side valid as no-resize allowed of queue.\n+\t\t */\n+\t\twhile ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {\n+\t\t\tstruct sk_buff *skb;\n+\t\t\tint ret;\n+\n+\t\t\t/* Allow busy polling again */\n+\t\t\tempty_cnt = 0;\n+\n+\t\t\tskb = cpu_map_build_skb(rcpu, xdp_pkt);\n+\t\t\tif (!skb) {\n+\t\t\t\tpage_frag_free(xdp_pkt);\n+\t\t\t\tcontinue;\n+\t\t\t}\n+\n+\t\t\t/* Inject into network stack */\n+\t\t\tret = netif_receive_skb(skb);\n+\t\t\tif (ret == NET_RX_DROP)\n+\t\t\t\tdrops++;\n+\n+\t\t\t/* Limit BH-disable period */\n+\t\t\tif (++processed == 8)\n+\t\t\t\tbreak;\n+\t\t}\n+\t\tlocal_bh_enable();\n+\n \t\t__set_current_state(TASK_INTERRUPTIBLE);\n \t}\n \tput_cpu_map_entry(rcpu);\n@@ -458,13 +580,6 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,\n \treturn 0;\n }\n \n-/* Notice: Will change in later patch */\n-struct xdp_pkt {\n-\tvoid *data;\n-\tu16 len;\n-\tu16 headroom;\n-};\n-\n /* Runs under RCU-read-side, plus in softirq under NAPI protection.\n  * Thus, safe percpu variable access.\n  */\n@@ -492,17 +607,13 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,\n \t\t    struct net_device *dev_rx)\n {\n \tstruct xdp_pkt *xdp_pkt;\n-\tint headroom;\n \n-\t/* Convert xdp_buff to xdp_pkt */\n-\theadroom = xdp->data - xdp->data_hard_start;\n-\tif (headroom < sizeof(*xdp_pkt))\n+\txdp_pkt = convert_to_xdp_pkt(xdp);\n+\tif (!xdp_pkt)\n \t\treturn -EOVERFLOW;\n-\txdp_pkt = xdp->data_hard_start;\n-\txdp_pkt->data = xdp->data;\n-\txdp_pkt->len  = xdp->data_end - xdp->data;\n-\txdp_pkt->headroom = headroom - sizeof(*xdp_pkt);\n-\t/* For now this is just used as a void pointer to data_hard_start */\n+\n+\t/* Info needed when constructing SKB on remote CPU */\n+\txdp_pkt->dev_rx = dev_rx;\n \n \tbq_enqueue(rcpu, xdp_pkt);\n \treturn 0;\n","prefixes":["net-next","3/5"]}