[{"id":1773910,"web_url":"http://patchwork.ozlabs.org/comment/1773910/","msgid":"<1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>","list_archive_url":null,"date":"2017-09-22T21:58:44","subject":"Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets","submitter":{"id":2404,"url":"http://patchwork.ozlabs.org/api/people/2404/","name":"Eric Dumazet","email":"eric.dumazet@gmail.com"},"content":"On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:\n> This series refactor the UDP early demux code so that:\n> \n> * full socket lookup is performed for unicast packets\n> * a sk is grabbed even for unconnected socket match\n> * a dst cache is used even in such scenario\n> \n> To perform this tasks a couple of facilities are added:\n> \n> * noref socket references, scoped inside the current RCU section, to be\n>   explicitly cleared before leaving such section\n> * a dst cache inside the inet and inet6 local addresses tables, caching the\n>   related local dst entry\n> \n> The measured performance gain under small packet UDP flood is as follow:\n> \n> ingress NIC\tvanilla\t\tpatched\t\tdelta\n> rx queues\t(kpps)\t\t(kpps)\t\t(%)\n> [ipv4]\n> 1\t\t2177\t\t2414\t\t10\n> 2\t\t2527\t\t2892\t\t14\n> 3\t\t3050\t\t3733\t\t22\n\n\nThis is a clear sign your program is not using latest SO_REUSEPORT +\n[ec]BPF filter [1]\n\nreturn socket[RX_QUEUE# | or CPU#];\n\nIf udp_sink uses SO_REUSEPORT with no extra hint, socket selection is\nbased on a lazy hash, meaning that you do not have proper siloing.\n\nreturn socket[hash(skb)];\n\nMultiple cpus can then :\n - compete on grabbing same socket refcount\n - compete on grabbing the receive queue lock\n - compete for releasing lock and socket refcount\n - skb freeing done on different cpus than where allocated.\n\nYou are adding complexity to the kernel because you are using a\nsub-optimal user space program, favoring false sharing.\n\nFirst solve the false sharing issue.\n\nPerformance with 2 rx queues should be almost twice the performance with\n1 rx queue.\n\nThen we can see if the gains you claim are still applicable.\n\nThanks\n\nPS: Wei Wan is about to release the IPV6 changes so that the big\ndifferences you showed are going to disappear soon.\n\nRefs [1]\n\ntools/testing/selftests/net/reuseport_bpf.c\n\n6a5ef90c58daada158ba16ba330558efc3471491 Merge branch 'faster-soreuseport'\n3ca8e4029969d40ab90e3f1ecd83ab1cadd60fbb soreuseport: BPF selection functional test\n538950a1b7527a0a52ccd9337e3fcd304f027f13 soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF\ne32ea7e747271a0abcd37e265005e97cc81d9df5 soreuseport: fast reuseport UDP socket selection\nef456144da8ef507c8cf504284b6042e9201a05c soreuseport: define reuseport groups","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ozlabs.org; dkim=pass (2048-bit key;\n\tunprotected) header.d=gmail.com header.i=@gmail.com\n\theader.b=\"EozDZ2E+\"; dkim-atps=neutral"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3xzS6f2hrCz9t16\n\tfor <patchwork-incoming@ozlabs.org>;\n\tSat, 23 Sep 2017 07:58:50 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S1752563AbdIVV6r (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tFri, 22 Sep 2017 17:58:47 -0400","from mail-pf0-f196.google.com ([209.85.192.196]:36328 \"EHLO\n\tmail-pf0-f196.google.com\" rhost-flags-OK-OK-OK-OK) by vger.kernel.org\n\twith ESMTP id S1752472AbdIVV6q (ORCPT\n\t<rfc822;netdev@vger.kernel.org>); Fri, 22 Sep 2017 17:58:46 -0400","by mail-pf0-f196.google.com with SMTP id f84so964426pfj.3\n\tfor <netdev@vger.kernel.org>; Fri, 22 Sep 2017 14:58:46 -0700 (PDT)","from ?IPv6:2620:15c:2c1:100:7c10:8290:3f81:dde2?\n\t([2620:15c:2c1:100:7c10:8290:3f81:dde2])\n\tby smtp.googlemail.com with ESMTPSA id\n\ts68sm965332pfd.72.2017.09.22.14.58.44\n\t(version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);\n\tFri, 22 Sep 2017 14:58:45 -0700 (PDT)"],"DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=gmail.com; s=20161025;\n\th=message-id:subject:from:to:cc:date:in-reply-to:references\n\t:mime-version:content-transfer-encoding;\n\tbh=vYU2zfUKyY1HEElxaDOhm85YF6s5AtxUJG4slbY21FE=;\n\tb=EozDZ2E+OKX75M2CFwUjXHVRmNypPgziYWDhPRww/DGR4FuUI6gMcFEsNLKcxzQp3B\n\t56LFwVqgHAusSHFw9SKsGBHV0FUcmvRn+V7s3AwpUtlAUDJT85CcAOkexmuKm8Pe12gW\n\tq5xSyEeGgZ9veYb5Bnj1j4zpAPCcKpWB3FkzRjbOkRySt97BZ9hOIqzaPs+rZXBmyhIr\n\tomi5vRXw12hlte6zOl3pzGjkRJTxxXXojB7apSwWpji8deGrtgXKefPS0+aUpCXEU7W4\n\tRKzgy9vcgyOgurvb0ixJ8hMy6al6mbmINR2lQyXtPHvHfQmAhhfRFs47JZYbLd8/s9zm\n\toJbg==","X-Google-DKIM-Signature":"v=1; a=rsa-sha256; c=relaxed/relaxed;\n\td=1e100.net; s=20161025;\n\th=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to\n\t:references:mime-version:content-transfer-encoding;\n\tbh=vYU2zfUKyY1HEElxaDOhm85YF6s5AtxUJG4slbY21FE=;\n\tb=ejIvdlO4eHIjWWPyPHbKjU5ozRq6I4dR1gMrGqS0/EyWhglLJNkt0t3iWwKiq9OQnL\n\tyF+3GL+x7cYwbdhrDRZTH3CfQTb939+a8eHqhQarQlNOb6GZaIKXTPyGQVGXw5vX3WVA\n\tls4qupGKivwQ/fXfAr08FYiC3Kemq9zmw5NHAl9p4KXT8XviQLfWM1p488cK1Vw7Oh8m\n\ti8FBijfN31+blurGOib/YzMwJppOO3JjCJJs6yFmWumBTbm1tQ/ZNQTwjz0n7l9NGUI3\n\t1tjeCZ64y9bLXOVsPVnYfiHXIzZF6FBm9qj8yqblPqFTeYFx0USdKFydLeds02wHRrUj\n\trl/g==","X-Gm-Message-State":"AHPjjUjYKQM/1mMYZmJxz7Kj3IXsSApbtlJo/BKRuocS0SrCixrRbX/N\n\tTZIufZXSub4NUPqgwcp6BIQ=","X-Google-Smtp-Source":"AOwi7QDVi0n6XfpCfIiqH829BQoAXFBbK6rOsZoE8XZl2PkYPytKtpN9wNHYbV1hr35mvFZQUUlNQg==","X-Received":"by 10.98.246.17 with SMTP id x17mr419447pfh.209.1506117526201;\n\tFri, 22 Sep 2017 14:58:46 -0700 (PDT)","Message-ID":"<1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>","Subject":"Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets","From":"Eric Dumazet <eric.dumazet@gmail.com>","To":"Paolo Abeni <pabeni@redhat.com>","Cc":"netdev@vger.kernel.org, \"David S. Miller\" <davem@davemloft.net>,\n\tPablo Neira Ayuso <pablo@netfilter.org>, Florian Westphal <fw@strlen.de>,\n\tEric Dumazet <edumazet@google.com>,\n\tHannes Frederic Sowa <hannes@stressinduktion.org>","Date":"Fri, 22 Sep 2017 14:58:44 -0700","In-Reply-To":"<cover.1506114055.git.pabeni@redhat.com>","References":"<cover.1506114055.git.pabeni@redhat.com>","Content-Type":"text/plain; charset=\"UTF-8\"","X-Mailer":"Evolution 3.10.4-0ubuntu2 ","Mime-Version":"1.0","Content-Transfer-Encoding":"7bit","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}},{"id":1774962,"web_url":"http://patchwork.ozlabs.org/comment/1774962/","msgid":"<1506371169.2614.3.camel@redhat.com>","list_archive_url":null,"date":"2017-09-25T20:26:09","subject":"Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets","submitter":{"id":67312,"url":"http://patchwork.ozlabs.org/api/people/67312/","name":"Paolo Abeni","email":"pabeni@redhat.com"},"content":"On Fri, 2017-09-22 at 14:58 -0700, Eric Dumazet wrote:\n> On Fri, 2017-09-22 at 23:06 +0200, Paolo Abeni wrote:\n> > This series refactor the UDP early demux code so that:\n> > \n> > * full socket lookup is performed for unicast packets\n> > * a sk is grabbed even for unconnected socket match\n> > * a dst cache is used even in such scenario\n> > \n> > To perform this tasks a couple of facilities are added:\n> > \n> > * noref socket references, scoped inside the current RCU section, to be\n> >   explicitly cleared before leaving such section\n> > * a dst cache inside the inet and inet6 local addresses tables, caching the\n> >   related local dst entry\n> > \n> > The measured performance gain under small packet UDP flood is as follow:\n> > \n> > ingress NIC\tvanilla\t\tpatched\t\tdelta\n> > rx queues\t(kpps)\t\t(kpps)\t\t(%)\n> > [ipv4]\n> > 1\t\t2177\t\t2414\t\t10\n> > 2\t\t2527\t\t2892\t\t14\n> > 3\t\t3050\t\t3733\t\t22\n> \n> \n> This is a clear sign your program is not using latest SO_REUSEPORT +\n> [ec]BPF filter [1]\n> \n> return socket[RX_QUEUE# | or CPU#];\n> \n> If udp_sink uses SO_REUSEPORT with no extra hint, socket selection is\n> based on a lazy hash, meaning that you do not have proper siloing.\n> \n> return socket[hash(skb)];\n> \n> Multiple cpus can then :\n>  - compete on grabbing same socket refcount\n>  - compete on grabbing the receive queue lock\n>  - compete for releasing lock and socket refcount\n>  - skb freeing done on different cpus than where allocated.\n> \n> You are adding complexity to the kernel because you are using a\n> sub-optimal user space program, favoring false sharing.\n> \n> First solve the false sharing issue.\n> \n> Performance with 2 rx queues should be almost twice the performance with\n> 1 rx queue.\n> \n> Then we can see if the gains you claim are still applicable.\n\nHere are the performance results using a BPF filter to distribute the\ningress packet to the reuseport socket with the same id of the ingress\nCPU - we have 1 to 1 mapping between the ingress receive queue and the\ndestination socket:\n\ningress NIC     vanilla         patched         delta\nrx queues       (kpps)          (kpps)          (%)\n[ipv4]\n2               3020                3663                21\n3               4352                5179                19\n4               5318                6194                16\n5               6258                7583                21\n6               7376                8558                16\n\n[ipv6]\n2               2446                3949                61\n3               3099                5092                64\n4               3698                6611                78\n5               4382                7852                79\n6               5116                8851                73\n\nSone notes:\n\n- figures obtained with: \n\nethtool  -L em2 combined $n\nMASK=1\nfor I in `seq 0 $((n - 1))`; do\n        [ $I -eq 0 ] && USE_BPF=\"--use_bpf\" || USE_BPF=\"\"\n        udp_sink  --reuseport $USE_BPF --recvfrom --count 10000000 --port 9 &\n        taskset -p $((MASK << ($I + $n) )) $!\ndone\n\n- in the IPv6 routing code we currently have a relevant bottle-neck in\nip6_pol_route(), I see a lot of contention on a dst refcount, so\nwithout early demux the performances do not scale well there.\n\n- For maximum performances BH and user space sink need to run on\ndifference CPUs - yes we have some more cacheline misses and a little\ncontention on the receive queue spin lock, but a lot less icache misses\nand more CPU cycles available, the overall tput is a lot higher than\nbinding on the same CPU where the BH is running.\n\n> PS: Wei Wan is about to release the IPV6 changes so that the big\n> differences you showed are going to disappear soon.\n\nInteresting, looking forward to that!\n\nCheers,\n\nPaolo","headers":{"Return-Path":"<netdev-owner@vger.kernel.org>","X-Original-To":"patchwork-incoming@ozlabs.org","Delivered-To":"patchwork-incoming@ozlabs.org","Authentication-Results":["ozlabs.org;\n\tspf=none (mailfrom) smtp.mailfrom=vger.kernel.org\n\t(client-ip=209.132.180.67; helo=vger.kernel.org;\n\tenvelope-from=netdev-owner@vger.kernel.org;\n\treceiver=<UNKNOWN>)","ext-mx08.extmail.prod.ext.phx2.redhat.com;\n\tdmarc=none (p=none dis=none) header.from=redhat.com","ext-mx08.extmail.prod.ext.phx2.redhat.com;\n\tspf=fail smtp.mailfrom=pabeni@redhat.com"],"Received":["from vger.kernel.org (vger.kernel.org [209.132.180.67])\n\tby ozlabs.org (Postfix) with ESMTP id 3y1FwS1hb9z9s82\n\tfor <patchwork-incoming@ozlabs.org>;\n\tTue, 26 Sep 2017 06:26:16 +1000 (AEST)","(majordomo@vger.kernel.org) by vger.kernel.org via listexpand\n\tid S936240AbdIYU0O (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);\n\tMon, 25 Sep 2017 16:26:14 -0400","from mx1.redhat.com ([209.132.183.28]:41176 \"EHLO mx1.redhat.com\"\n\trhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP\n\tid S934254AbdIYU0N (ORCPT <rfc822;netdev@vger.kernel.org>);\n\tMon, 25 Sep 2017 16:26:13 -0400","from smtp.corp.redhat.com\n\t(int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11])\n\t(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))\n\t(No client certificate requested)\n\tby mx1.redhat.com (Postfix) with ESMTPS id D63AEC0587E9;\n\tMon, 25 Sep 2017 20:26:12 +0000 (UTC)","from ovpn-116-53.ams2.redhat.com (ovpn-116-53.ams2.redhat.com\n\t[10.36.116.53])\n\tby smtp.corp.redhat.com (Postfix) with ESMTP id AEAA0600C6;\n\tMon, 25 Sep 2017 20:26:10 +0000 (UTC)"],"DMARC-Filter":"OpenDMARC Filter v1.3.2 mx1.redhat.com D63AEC0587E9","Message-ID":"<1506371169.2614.3.camel@redhat.com>","Subject":"Re: [RFC PATCH 00/11] udp: full early demux for unconnected sockets","From":"Paolo Abeni <pabeni@redhat.com>","To":"Eric Dumazet <eric.dumazet@gmail.com>","Cc":"netdev@vger.kernel.org, \"David S. Miller\" <davem@davemloft.net>,\n\tPablo Neira Ayuso <pablo@netfilter.org>, Florian Westphal <fw@strlen.de>,\n\tEric Dumazet <edumazet@google.com>,\n\tHannes Frederic Sowa <hannes@stressinduktion.org>","Date":"Mon, 25 Sep 2017 22:26:09 +0200","In-Reply-To":"<1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>","References":"<cover.1506114055.git.pabeni@redhat.com>\n\t<1506117524.29839.176.camel@edumazet-glaptop3.roam.corp.google.com>","Content-Type":"text/plain; charset=\"UTF-8\"","Mime-Version":"1.0","Content-Transfer-Encoding":"7bit","X-Scanned-By":"MIMEDefang 2.79 on 10.5.11.11","X-Greylist":"Sender IP whitelisted, not delayed by milter-greylist-4.5.16\n\t(mx1.redhat.com [10.5.110.32]);\n\tMon, 25 Sep 2017 20:26:13 +0000 (UTC)","Sender":"netdev-owner@vger.kernel.org","Precedence":"bulk","List-ID":"<netdev.vger.kernel.org>","X-Mailing-List":"netdev@vger.kernel.org"}}]