[{"id":1777313,"web_url":"http://patchwork.ozlabs.org/comment/1777313/","msgid":"<59CD7B94.8010103@iogearbox.net>","list_archive_url":null,"date":"2017-09-28T22:45:40","subject":"Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT","submitter":{"id":65705,"url":"http://patchwork.ozlabs.org/api/people/65705/","name":"Daniel Borkmann","email":"daniel@iogearbox.net"},"content":"On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:\n> Introducing a new way to redirect XDP frames.  Notice how no driver\n> changes are necessary given the design of XDP_REDIRECT.\n>\n> This redirect map type is called 'cpumap', as it allows redirection\n> XDP frames to remote CPUs.  The remote CPU will do the SKB allocation\n> and start the network stack invocation on that CPU.\n>\n> This is a scalability and isolation mechanism, that allow separating\n> the early driver network XDP layer, from the rest of the netstack, and\n> assigning dedicated CPUs for this stage.  The sysadm control/configure\n> the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how\n> many queues are configured via ethtool --set-channels.  Benchmarks\n> show that a single CPU can handle approx 11Mpps.  Thus, only assigning\n> two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s\n> wirespeed smallest packet 14.88Mpps.  Reducing the number of queues\n> have the advantage that more packets being \"bulk\" available per hard\n> interrupt[1].\n>\n> [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf\n>\n> Use-cases:\n>\n> 1. End-host based pre-filtering for DDoS mitigation.  This is fast\n>     enough to allow software to see and filter all packets wirespeed.\n>     Thus, no packets getting silently dropped by hardware.\n>\n> 2. Given NIC HW unevenly distributes packets across RX queue, this\n>     mechanism can be used for redistribution load across CPUs.  This\n>     usually happens when HW is unaware of a new protocol.  This\n>     resembles RPS (Receive Packet Steering), just faster, but with more\n>     responsibility placed on the BPF program for correct steering.\n>\n> 3. Auto-scaling or power saving via only activating the appropriate\n>     number of remote CPUs for handling the current load.  The cpumap\n>     tracepoints can function as a feedback loop for this purpose.\n\nInteresting work, thanks! Still digesting the code a bit. I think\nit pretty much goes into the direction that Eric describes in his\nnetdev paper quoted above; not on a generic level though but specific\nto XDP at least; theoretically XDP could just run transparently on\nthe CPU doing the filtering, and raw buffers are handed to remote\nCPU with similar batching, but it would need some different config\ninterface at minimum.\n\nShouldn't we take the CPU(s) running XDP on the RX queues out from\nthe normal process scheduler, so that we have a guarantee that user\nspace or unrelated kernel tasks cannot interfere with them anymore,\nand we could then turn them into busy polling eventually (e.g. as\nlong as XDP is running there and once off could put them back into\nnormal scheduling domain transparently)?\n\nWhat about RPS/RFS in the sense that once you punt them to remote\nCPU, could we reuse application locality information so they'd end\nup on the right CPU in the first place (w/o backlog detour), or is\nthe intent to rather disable it and have some own orchestration\nwith relation to the CPU map?\n\nCheers,\nDaniel","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":"ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y38t61jtcz9ryr\n\tfor <patchwork-incoming@ozlabs.org>;\n\tFri, 29 Sep 2017 08:45:50 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751345AbdI1Wps (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tThu, 28 Sep 2017 18:45:48 -0400","from www62.your-server.de ([213.133.104.62]:52750 \"EHLO\n\twww62.your-server.de\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1750947AbdI1Wpr (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Thu, 28 Sep 2017 18:45:47 -0400","from [85.7.161.218] (helo=localhost.localdomain)\n\tby www62.your-server.de with esmtpsa (TLSv1.2:DHE-RSA-AES256-SHA:256)\n\t(Exim 4.85_2) (envelope-from <daniel@iogearbox.net>)\n\tid 1dxhYt-0001WN-4A; Fri, 29 Sep 2017 00:45:43 +0200"],"Message-ID":"<59CD7B94.8010103@iogearbox.net>","Date":"Fri, 29 Sep 2017 00:45:40 +0200","From":"Daniel Borkmann <daniel@iogearbox.net>","User-Agent":"Mozilla/5.0 (X11; Linux x86_64;\n\trv:31.0) Gecko/20100101 Thunderbird/31.7.0","MIME-Version":"1.0","To":"Jesper Dangaard Brouer <brouer@redhat.com>, netdev@vger.kernel.org","CC":"jakub.kicinski@netronome.com, \"Michael S. Tsirkin\" <mst@redhat.com>,\n\tJason Wang <jasowang@redhat.com>, mchan@broadcom.com,\n\tJohn Fastabend <john.fastabend@gmail.com>, peter.waskiewicz.jr@intel.com,\n\tDaniel Borkmann <borkmann@iogearbox.net>,\n\tAlexei Starovoitov <alexei.starovoitov@gmail.com>,\n\tAndy Gospodarek <andy@greyhouse.net>, edumazet@google.com","Subject":"Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT","References":"<150660339205.2808.7084136789768233829.stgit@firesoul>","In-Reply-To":"<150660339205.2808.7084136789768233829.stgit@firesoul>","Content-Type":"text/plain; charset=utf-8; format=flowed","Content-Transfer-Encoding":"7bit","X-Authenticated-Sender":"daniel@iogearbox.net","X-Virus-Scanned":"Clear (ClamAV 0.99.2/23884/Thu Sep 28 22:46:49 2017)","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1777374,"web_url":"http://patchwork.ozlabs.org/comment/1777374/","msgid":"<20170929085313.4ff4815b@redhat.com>","list_archive_url":null,"date":"2017-09-29T06:53:13","subject":"Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT","submitter":{"id":13625,"url":"http://patchwork.ozlabs.org/api/people/13625/","name":"Jesper Dangaard Brouer","email":"brouer@redhat.com"},"content":"On Fri, 29 Sep 2017 00:45:40 +0200\nDaniel Borkmann <daniel@iogearbox.net> wrote:\n\n> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:\n> > Introducing a new way to redirect XDP frames.  Notice how no driver\n> > changes are necessary given the design of XDP_REDIRECT.\n> >\n> > This redirect map type is called 'cpumap', as it allows redirection\n> > XDP frames to remote CPUs.  The remote CPU will do the SKB allocation\n> > and start the network stack invocation on that CPU.\n> >\n> > This is a scalability and isolation mechanism, that allow separating\n> > the early driver network XDP layer, from the rest of the netstack, and\n> > assigning dedicated CPUs for this stage.  The sysadm control/configure\n> > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how\n> > many queues are configured via ethtool --set-channels.  Benchmarks\n> > show that a single CPU can handle approx 11Mpps.  Thus, only assigning\n> > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s\n> > wirespeed smallest packet 14.88Mpps.  Reducing the number of queues\n> > have the advantage that more packets being \"bulk\" available per hard\n> > interrupt[1].\n> >\n> > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf\n> >\n> > Use-cases:\n> >\n> > 1. End-host based pre-filtering for DDoS mitigation.  This is fast\n> >     enough to allow software to see and filter all packets wirespeed.\n> >     Thus, no packets getting silently dropped by hardware.\n> >\n> > 2. Given NIC HW unevenly distributes packets across RX queue, this\n> >     mechanism can be used for redistribution load across CPUs.  This\n> >     usually happens when HW is unaware of a new protocol.  This\n> >     resembles RPS (Receive Packet Steering), just faster, but with more\n> >     responsibility placed on the BPF program for correct steering.\n> >\n> > 3. Auto-scaling or power saving via only activating the appropriate\n> >     number of remote CPUs for handling the current load.  The cpumap\n> >     tracepoints can function as a feedback loop for this purpose.  \n> \n> Interesting work, thanks! Still digesting the code a bit. I think\n> it pretty much goes into the direction that Eric describes in his\n> netdev paper quoted above; not on a generic level though but specific\n> to XDP at least; theoretically XDP could just run transparently on\n> the CPU doing the filtering, and raw buffers are handed to remote\n> CPU with similar batching, but it would need some different config\n> interface at minimum.\n\nGood that you noticed this is (implicit) implementing RX bulking, which\nis where much of the performance gain originates from.\n\nIt is true, I am inspired by Eric's paper (I love it). Do notice that\nthis is not blocking or interfering with Erics/others continued work in\nthis area.  This implementation just show that the section \"break the\npipe!\" idea works very well for XDP. \n\nMore on config knobs below.\n \n> Shouldn't we take the CPU(s) running XDP on the RX queues out from\n> the normal process scheduler, so that we have a guarantee that user\n> space or unrelated kernel tasks cannot interfere with them anymore,\n> and we could then turn them into busy polling eventually (e.g. as\n> long as XDP is running there and once off could put them back into\n> normal scheduling domain transparently)?\n\nWe should be careful not to invent networking config knobs that belongs\nto other parts of the kernel, like the scheduler.  We already have\nability to control where IRQ's land via procfs smp_affinity.  And if\nyou want to avoid CPU isolation, we can use the boot cmdline\n\"isolcpus\" (hint like DPDK recommend/use for zero-loss configs).  It is\nthe userspace tool (or sysadm) loading the XDP program, who is\nresponsible for having configures the CPU smp_affinity alignment.\n\nMaking NAPI busy-poll is out of scope for this patchset. Someone\nshould work on this separately.  It would just help/improve this kind\nof scheme.\n\nI actually think it would be more relevant to add/put the \"remote\" CPUs\nin the 'cpumap' into a separate scheduler group.  To implement stuff\nlike auto-scaling and power-saving.\n\n\n> What about RPS/RFS in the sense that once you punt them to remote\n> CPU, could we reuse application locality information so they'd end\n> up on the right CPU in the first place (w/o backlog detour), or is\n> the intent to rather disable it and have some own orchestration\n> with relation to the CPU map?\n\nAn advanced bpf orchestration could basically implement what you\ndescribe, combined with a userspace side tool that taskset/pin\napplications.  To know when a task can move between CPUs, you use the\ntracepoints to see when the CPU queue is empty (hint, time_limit=true\nand processed=0).\n\nFor now, I'm not targeting such advanced use-cases.  My main target is\na customer that have double tagged VLANS, and ixgbe cannot RSS\ndistribute these, thus they all end-up on queue 0.  And as I\ndemonstrated (in another email) RPS is too slow to fix this.","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ext-mx04.extmail.prod.ext.phx2.redhat.com;\n\tdmarc=none (p=none dis=none) header.from=redhat.com","ext-mx04.extmail.prod.ext.phx2.redhat.com;\n\tspf=fail smtp.mailfrom=brouer@redhat.com"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y3Mhm1wcJz9t2f\n\tfor <patchwork-incoming@ozlabs.org>;\n\tFri, 29 Sep 2017 16:53:28 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1751921AbdI2GxZ (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tFri, 29 Sep 2017 02:53:25 -0400","from mx1.redhat.com ([209.132.183.28]:58100 \"EHLO mx1.redhat.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S1751219AbdI2GxY (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tFri, 29 Sep 2017 02:53:24 -0400","from smtp.corp.redhat.com\n\t(int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])\n\t(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))\n\t(No client certificate requested)\n\tby mx1.redhat.com (Postfix) with ESMTPS id 1F1CC8553D;\n\tFri, 29 Sep 2017 06:53:24 +0000 (UTC)","from localhost (ovpn-200-30.brq.redhat.com [10.40.200.30])\n\tby smtp.corp.redhat.com (Postfix) with ESMTP id DC1E01880B;\n\tFri, 29 Sep 2017 06:53:15 +0000 (UTC)"],"DMARC-Filter":"OpenDMARC Filter v1.3.2 mx1.redhat.com 1F1CC8553D","Date":"Fri, 29 Sep 2017 08:53:13 +0200","From":"Jesper Dangaard Brouer <brouer@redhat.com>","To":"Daniel Borkmann <daniel@iogearbox.net>","Cc":"netdev@vger.kernel.org, jakub.kicinski@netronome.com,\n\t\"Michael S. Tsirkin\" <mst@redhat.com>,\n\tJason Wang <jasowang@redhat.com>, mchan@broadcom.com,\n\tJohn Fastabend <john.fastabend@gmail.com>, peter.waskiewicz.jr@intel.com,\n\tDaniel Borkmann <borkmann@iogearbox.net>,\n\tAlexei Starovoitov <alexei.starovoitov@gmail.com>,\n\tAndy Gospodarek <andy@greyhouse.net>, edumazet@google.com,\n\tbrouer@redhat.com","Subject":"Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT","Message-ID":"<20170929085313.4ff4815b@redhat.com>","In-Reply-To":"<59CD7B94.8010103@iogearbox.net>","References":"<150660339205.2808.7084136789768233829.stgit@firesoul>\n\t<59CD7B94.8010103@iogearbox.net>","MIME-Version":"1.0","Content-Type":"text/plain; charset=US-ASCII","Content-Transfer-Encoding":"7bit","X-Scanned-By":"MIMEDefang 2.79 on 10.5.11.12","X-Greylist":"Sender IP whitelisted, not delayed by milter-greylist-4.5.16\n\t(mx1.redhat.com [10.5.110.28]);\n\tFri, 29 Sep 2017 06:53:24 +0000 (UTC)","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}}]