From patchwork Thu Dec 6 20:36:34 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Willem de Bruijn X-Patchwork-Id: 204318 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id D3D442C0172 for ; Fri, 7 Dec 2012 07:43:55 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755493Ab2LFUnw (ORCPT ); Thu, 6 Dec 2012 15:43:52 -0500 Received: from mail-ea0-f202.google.com ([209.85.215.202]:64792 "EHLO mail-ea0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753499Ab2LFUnv (ORCPT ); Thu, 6 Dec 2012 15:43:51 -0500 Received: by mail-ea0-f202.google.com with SMTP id j12so471007eaa.1 for ; Thu, 06 Dec 2012 12:43:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer; bh=AVa2dMmpRlt+o18CKU6QDEYKrz9ZZpCCLDXWotQY6Jw=; b=kDX2SiZfQHYI+43IlmtdwNCQTHdFmwmWVuYl/dgPGKI/daOE8vQvBQUdQI74u0Mn7q WptV33n217t8Iw6UjQVBdfqm+R3w5uJRVjjPY4/+DvSYDFXIti6KMosdMAdZ6AZQmvPT IqIASbv7J6Yudg4Dl8M1vhvqdaLB83VLTPdmkgJ+M4PBS5dcKBLjQpMdZF/OayL6HMLl yvwpadh7QHnnIbTQkN8nMjCbS58vLAI9GdlXu5fYV9MPGcqBQJZJrE6qN/uAQ98DaCXJ o/Mq1EMFj8/ofXF/hBZs4Hjm3rvj9m+5OPt+HbQBybv9PPMzqiQOFa42FDrtxaC1i4u3 rkJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:x-gm-message-state; bh=AVa2dMmpRlt+o18CKU6QDEYKrz9ZZpCCLDXWotQY6Jw=; b=DUWCBsOMDSbkIoG/eHbq2zKeI87O1dfBhqmDeVfiIp2RZcw7330cWRDMNb09Ga4dHH FUQp5qxIenVDNum+1u6tLqWJUTZKNHVXWtbcCcfcV6aum8degtaoeQHNP4kah7Wq+Cw7 7HUNSUdQEXU9eE65J3yXX/wp1WazHetAMf90aD5XN3NnQUMgJDzxjxRQQkQ/NmvLmLjg rj2P38vfwikoPHoCorTe/9C6zagYjX2tP+rcZlP6HCYSIcpZutm70dV9ZmTX7LYDj8EU chdXX7AiRHxM7M7BqefC4SZ5OrkiRx6Yu/jNOBCeJchwvszp4gCTqii2RJNMg0o3YvxD dVbQ== Received: by 10.14.208.198 with SMTP id q46mr2604387eeo.0.1354826200786; Thu, 06 Dec 2012 12:36:40 -0800 (PST) Received: from hpza10.eem.corp.google.com ([74.125.121.33]) by gmr-mx.google.com with ESMTPS id g9si1661699eeo.1.2012.12.06.12.36.40 (version=TLSv1/SSLv3 cipher=AES128-SHA); Thu, 06 Dec 2012 12:36:40 -0800 (PST) Received: from gopher.nyc.corp.google.com (gopher.nyc.corp.google.com [172.26.106.37]) by hpza10.eem.corp.google.com (Postfix) with ESMTP id 6749F20004E; Thu, 6 Dec 2012 12:36:40 -0800 (PST) Received: by gopher.nyc.corp.google.com (Postfix, from userid 29878) id B5E341E15F3; Thu, 6 Dec 2012 15:36:39 -0500 (EST) From: Willem de Bruijn To: netdev@vger.kernel.org, davem@davemloft.net, edumazet@google.com, therbert@google.com Cc: Willem de Bruijn Subject: [PATCH net-next] rps: overflow prevention for saturated cpus Date: Thu, 6 Dec 2012 15:36:34 -0500 Message-Id: <1354826194-9289-1-git-send-email-willemb@google.com> X-Mailer: git-send-email 1.7.7.3 X-Gm-Message-State: ALoCoQlwC8Wun6xEykLAQ8AugHxqHwYirvJhNIBDYs55S+YYOxmNIolqe584usXC0r7gOXdhBcUaZFkD2uiil07Q8gHL3zJLIo8m1spNO/kTPnkMOI3T2/kwUDwJ433/qGGJuobT+NPFW8I31licqsUEm6LuXcm0ic2CmeAoOrCY5iPh1M8T9atH1k0bLukvLkYUK2JItlsu Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org RPS and RFS balance load across cpus with flow affinity. This can cause local bottlenecks, where a small number or single large flow (DoS) can saturate one CPU while others are idle. This patch maintains flow affinity in normal conditions, but trades it for throughput when a cpu becomes saturated. Then, packets destined to that cpu (only) are redirected to the lightest loaded cpu in the rxqueue's rps_map. This breaks flow affinity under high load for some flows, in favor of processing packets up to the capacity of the complete rps_map cpuset in all circumstances. Overload on an rps cpu is detected when the cpu's input queue exceeds a high watermark. This threshold, netdev_max_rps_backlog, is configurable through sysctl. By default, it is the same as netdev_max_backlog, the threshold at which point packets are dropped. Therefore, the behavior is disabled by default. It is enabled by setting the rps threshold lower than the drop threshold. This mechanism is orthogonal to filtering approaches to handle unwanted large flows (DoS). The goal here is to avoid dropping any traffic until the entire system is saturated. Tested: Sent a steady stream of 1.4M UDP packets to a machine with four RPS cpus 0--3, set in /sys/class/net/eth0/queues/rx-N/rps_cpus. RFS is disabled to illustrate load balancing more clearly. Showing the output from /proc/net/softnet_stat, columns 0 (processed), 1 (dropped), 2 (time squeeuze) and 9 (rps), for the relevant CPUs: - without patch, 40 source IP addresses (i.e., balanced): 0: ok=00409483 drop=00000000 time=00001224 rps=00000089 1: ok=00496336 drop=00051365 time=00001551 rps=00000000 2: ok=00374380 drop=00000000 time=00001105 rps=00000129 3: ok=00411348 drop=00000000 time=00001175 rps=00000165 - without patch, 1 source IP address: 0: ok=00856313 drop=00863842 time=00002676 rps=00000000 1: ok=00000003 drop=00000000 time=00000000 rps=00000001 2: ok=00000001 drop=00000000 time=00000000 rps=00000001 3: ok=00000014 drop=00000000 time=00000000 rps=00000000 - with patch, 1 source IP address: 0: ok=00278675 drop=00000000 time=00000475 rps=00001201 1: ok=00276154 drop=00000000 time=00000459 rps=00001213 2: ok=00647050 drop=00000000 time=00002022 rps=00000000 3: ok=00276228 drop=00000000 time=00000464 rps=00001218 (let me know if commit messages like this are too wordy) --- Documentation/networking/scaling.txt | 12 +++++++++++ include/linux/netdevice.h | 3 ++ net/core/dev.c | 37 ++++++++++++++++++++++++++++++++- net/core/sysctl_net_core.c | 9 ++++++++ 4 files changed, 59 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 579994a..f454564 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -135,6 +135,18 @@ packets have been queued to their backlog queue. The IPI wakes backlog processing on the remote CPU, and any queued packets are then processed up the networking stack. +==== RPS Overflow Protection + +By selecting the same cpu from the cpuset for each packet in the same +flow, RPS will cause load imbalance when input flows are not uniformly +random. In the extreme case, a single flow, all packets are handled on a +single CPU, which limits the throughput of the machine to the throughput +of that CPU. RPS has optional overflow protection, which disables flow +affinity when an RPS CPU becomes saturated: during overload, its packets +will be sent to the least loaded other CPU in the RPS cpuset. To enable +this option, set sysctl net.core.netdev_max_rps_backlog to be smaller than +net.core.netdev_max_backlog. Setting it to half is a reasonable heuristic. + ==== RPS Configuration RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 18c5dc9..84624fa 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2609,6 +2609,9 @@ extern void netdev_stats_to_stats64(struct rtnl_link_stats64 *stats64, const struct net_device_stats *netdev_stats); extern int netdev_max_backlog; +#ifdef CONFIG_RPS +extern int netdev_max_rps_backlog; +#endif extern int netdev_tstamp_prequeue; extern int weight_p; extern int bpf_jit_enable; diff --git a/net/core/dev.c b/net/core/dev.c index 2f94df2..08c99ad 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2734,6 +2734,9 @@ EXPORT_SYMBOL(dev_queue_xmit); int netdev_max_backlog __read_mostly = 1000; EXPORT_SYMBOL(netdev_max_backlog); +#ifdef CONFIG_RPS +int netdev_max_rps_backlog __read_mostly = 1000; +#endif int netdev_tstamp_prequeue __read_mostly = 1; int netdev_budget __read_mostly = 300; int weight_p __read_mostly = 64; /* old backlog weight */ @@ -2834,6 +2837,36 @@ set_rps_cpu(struct net_device *dev, struct sk_buff *skb, return rflow; } +/* @return cpu under normal conditions, another rps_cpu if backlogged. */ +static int get_rps_overflow_cpu(int cpu, const struct rps_map* map) +{ + struct softnet_data *sd; + unsigned int cur, tcpu, min; + int i; + + if (skb_queue_len(&per_cpu(softnet_data, cpu).input_pkt_queue) < + netdev_max_rps_backlog || !map) + return cpu; + + /* leave room to prioritize the flows sent to the cpu by rxhash. */ + min = netdev_max_rps_backlog; + min -= min >> 3; + + for (i = 0; i < map->len; i++) { + tcpu = map->cpus[i]; + if (cpu_online(tcpu)) { + sd = &per_cpu(softnet_data, tcpu); + cur = skb_queue_len(&sd->input_pkt_queue); + if (cur < min) { + min = cur; + cpu = tcpu; + } + } + } + + return cpu; +} + /* * get_rps_cpu is called from netif_receive_skb and returns the target * CPU from the RPS map of the receiving queue for a given skb. @@ -2912,7 +2945,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, if (tcpu != RPS_NO_CPU && cpu_online(tcpu)) { *rflowp = rflow; - cpu = tcpu; + cpu = get_rps_overflow_cpu(tcpu, map); goto done; } } @@ -2921,7 +2954,7 @@ static int get_rps_cpu(struct net_device *dev, struct sk_buff *skb, tcpu = map->cpus[((u64) skb->rxhash * map->len) >> 32]; if (cpu_online(tcpu)) { - cpu = tcpu; + cpu = get_rps_overflow_cpu(tcpu, map); goto done; } } diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index d1b0804..c1b7829 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -129,6 +129,15 @@ static struct ctl_table net_core_table[] = { .mode = 0644, .proc_handler = proc_dointvec }, +#ifdef CONFIG_RPS + { + .procname = "netdev_max_rps_backlog", + .data = &netdev_max_rps_backlog, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, +#endif #ifdef CONFIG_BPF_JIT { .procname = "bpf_jit_enable",