From patchwork Wed Apr 24 00:37:27 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Willem de Bruijn X-Patchwork-Id: 239025 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id C73702C0118 for ; Wed, 24 Apr 2013 11:01:27 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756496Ab3DXBBT (ORCPT ); Tue, 23 Apr 2013 21:01:19 -0400 Received: from mail-yh0-f74.google.com ([209.85.213.74]:49751 "EHLO mail-yh0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755716Ab3DXBBQ (ORCPT ); Tue, 23 Apr 2013 21:01:16 -0400 Received: by mail-yh0-f74.google.com with SMTP id z12so6395yhz.1 for ; Tue, 23 Apr 2013 18:01:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:x-mailer:in-reply-to :references; bh=EVw29wJBtaJptieixV0r2af07f/q2xTzUsD0NdjOxD4=; b=YnYJDKWAblVGjSNwdyggZuYskINDXCh1gF3smeVhEhkxBif9gK3yee9FqIjMFT+vXf SuyPwqZYZI+LlFHdhJ5IjzS7XPY0h+KphiEvSnU9SlbjPc5QIJfD7m5chXNe91hyuxYF ZhOgvLeuU+hQuzDpIaMWPrfLgKcJUDDlju+BNvNqM2xKvNxpUHspkKdz08j1SJsgsUC0 WTnVXq7wD+YbvrU23UxGZu2o9C4RD95QYs1m1fNHFNTZvdGcez2XeOMz2llU7x8OgIKE 9R/r0uFE/O2114Z/NoooOeDveEr6SgFVSadJOox2CQS8MmcoBZcukdzif38hsCREM7wO b5sA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:x-mailer:in-reply-to :references:x-gm-message-state; bh=EVw29wJBtaJptieixV0r2af07f/q2xTzUsD0NdjOxD4=; b=M3s6cfp0JlinTFfz87Pd8QqtlO+fMCPJ9h0J1JzqjRys5ciXudwWBgum6dZdQNAnvv 5uFOGLcHLurfv8iYm1TjvVqCvMNMrDMuHVw0GfNeMYv7aE8GrAAkMuRLNLvjO+bk2BDh KCb0S9MdjJJIL1D095hAraISy5S5qPk1UXMZ7atXmYU4U+mjPGBv7r2Sd4FUVSjGYHsF 45ogTQDinnKF10bIEcEgGdZy1jn+WrzSEqn3DDHINdeOYnYBUbr5svM5RXpfRnAN5yHU T0GU7r9uEAXksEMMOtyV/xYikTzARVfmq5KnXu/SIsx2oaiGZaC9dMsPfjJhpcd3ynX8 Tk3w== X-Received: by 10.236.88.236 with SMTP id a72mr16723752yhf.56.1366763854355; Tue, 23 Apr 2013 17:37:34 -0700 (PDT) Received: from corp2gmr1-2.hot.corp.google.com (corp2gmr1-2.hot.corp.google.com [172.24.189.93]) by gmr-mx.google.com with ESMTPS id q58si52660yhe.4.2013.04.23.17.37.34 for (version=TLSv1.1 cipher=AES128-SHA bits=128/128); Tue, 23 Apr 2013 17:37:34 -0700 (PDT) Received: from gopher.nyc.corp.google.com (gopher.nyc.corp.google.com [172.26.106.37]) by corp2gmr1-2.hot.corp.google.com (Postfix) with ESMTP id 2DD755A4016; Tue, 23 Apr 2013 17:37:34 -0700 (PDT) Received: by gopher.nyc.corp.google.com (Postfix, from userid 29878) id BD809C0BC1; Tue, 23 Apr 2013 20:37:33 -0400 (EDT) From: Willem de Bruijn To: eric.dumazet@gmail.com, netdev@vger.kernel.org, davem@davemloft.net, stephen@networkplumber.org Cc: Willem de Bruijn Subject: [PATCH net-next v5] rps: selective flow shedding during softnet overflow Date: Tue, 23 Apr 2013 20:37:27 -0400 Message-Id: <1366763847-9279-1-git-send-email-willemb@google.com> X-Mailer: git-send-email 1.8.2.1 In-Reply-To: <1366762196.8964.46.camel@edumazet-glaptop> References: <1366762196.8964.46.camel@edumazet-glaptop> X-Gm-Message-State: ALoCoQn/ZzrY1t+3RHd9OU0GgLuWJoykYy/0Ln6AlcWUPACTAVwLAMi0ET0w/qshd4Gnv9/Ul4sOlVuvvqKqk7LNpFcX6IRPN40GtB2QUPPvtoIQ3KGp2dpe9T7F7aYvumMmW/stEvSaQYRYoQsTsuDBhIIkwlrgZcRV0ZODSrsDqwXNJmYRw/ejae7WihgWQOLIH+nn4OeQ Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A cpu executing the network receive path sheds packets when its input queue grows to netdev_max_backlog. A single high rate flow (such as a spoofed source DoS) can exceed a single cpu processing rate and will degrade throughput of other flows hashed onto the same cpu. This patch adds a more fine grained hashtable. If the netdev backlog is above a threshold, IRQ cpus track the ratio of total traffic of each flow (using 4096 buckets, configurable). The ratio is measured by counting the number of packets per flow over the last 256 packets from the source cpu. Any flow that occupies a large fraction of this (set at 50%) will see packet drop while above the threshold. Tested: Setup is a muli-threaded UDP echo server with network rx IRQ on cpu0, kernel receive (RPS) on cpu0 and application threads on cpus 2--7 each handling 20k req/s. Throughput halves when hit with a 400 kpps antagonist storm. With this patch applied, antagonist overload is dropped and the server processes its complete load. The patch is effective when kernel receive processing is the bottleneck. The above RPS scenario is a extreme, but the same is reached with RFS and sufficient kernel processing (iptables, packet socket tap, ..). Signed-off-by: Willem de Bruijn Acked-by: Eric Dumazet --- Changes v5 - depend on RPS, automatically build if RPS is enabled. v4 - remove unnecessary synchronize_rcu after rcu_assign_pointer to NULL ptr - simplify lookup of current cpu's softnet v3 - fix race between updates to table_len sysctl during bitmap sysctl. - fix NULL pointer dereference on alloc failure. v2 - add fl->num_buckets element to use the actual allocated table length. - disable the kconfig option by default, as it is workload specific. --- include/linux/netdevice.h | 17 ++++++++ net/Kconfig | 12 ++++++ net/core/dev.c | 48 ++++++++++++++++++++- net/core/net-procfs.c | 16 ++++++- net/core/sysctl_net_core.c | 104 +++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 194 insertions(+), 3 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index f8898a4..d781cf1 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1779,6 +1779,19 @@ static inline int unregister_gifconf(unsigned int family) return register_gifconf(family, NULL); } +#ifdef CONFIG_NET_FLOW_LIMIT +#define FLOW_LIMIT_HISTORY (1 << 8) /* must be ^2 */ +struct sd_flow_limit { + u64 count; + unsigned int num_buckets; + unsigned int history_head; + u16 history[FLOW_LIMIT_HISTORY]; + u8 buckets[]; +}; + +extern int netdev_flow_limit_table_len; +#endif /* CONFIG_NET_FLOW_LIMIT */ + /* * Incoming packets are placed on per-cpu queues */ @@ -1808,6 +1821,10 @@ struct softnet_data { unsigned int dropped; struct sk_buff_head input_pkt_queue; struct napi_struct backlog; + +#ifdef CONFIG_NET_FLOW_LIMIT + struct sd_flow_limit *flow_limit; +#endif }; static inline void input_queue_head_incr(struct softnet_data *sd) diff --git a/net/Kconfig b/net/Kconfig index 1a22216..02ebc71 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -268,6 +268,18 @@ config BPF_JIT packet sniffing (libpcap/tcpdump). Note : Admin should enable this feature changing /proc/sys/net/core/bpf_jit_enable +config NET_FLOW_LIMIT + bool "Flow shedding under load" + depends on RPS + default y + ---help--- + The network stack has to drop packets when a receive processing CPUs + backlog reaches netdev_max_backlog. If a few out of many active flows + generate the vast majority of load, drop their traffic earlier to + maintain capacity for the other flows. This feature provides servers + with many clients some protection against DoS by a single (spoofed) + flow that greatly exceeds average workload. + menu "Network testing" config NET_PKTGEN diff --git a/net/core/dev.c b/net/core/dev.c index 9e26b8d..c9b7106 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3058,6 +3058,46 @@ static int rps_ipi_queued(struct softnet_data *sd) return 0; } +#ifdef CONFIG_NET_FLOW_LIMIT +int netdev_flow_limit_table_len __read_mostly = (1 << 12); +#endif + +static bool skb_flow_limit(struct sk_buff *skb, unsigned int qlen) +{ +#ifdef CONFIG_NET_FLOW_LIMIT + struct sd_flow_limit *fl; + struct softnet_data *sd; + unsigned int old_flow, new_flow; + + if (qlen < (netdev_max_backlog >> 1)) + return false; + + sd = &__get_cpu_var(softnet_data); + + rcu_read_lock(); + fl = rcu_dereference(sd->flow_limit); + if (fl) { + new_flow = skb_get_rxhash(skb) & (fl->num_buckets - 1); + old_flow = fl->history[fl->history_head]; + fl->history[fl->history_head] = new_flow; + + fl->history_head++; + fl->history_head &= FLOW_LIMIT_HISTORY - 1; + + if (likely(fl->buckets[old_flow])) + fl->buckets[old_flow]--; + + if (++fl->buckets[new_flow] > (FLOW_LIMIT_HISTORY >> 1)) { + fl->count++; + rcu_read_unlock(); + return true; + } + } + rcu_read_unlock(); +#endif + return false; +} + /* * enqueue_to_backlog is called to queue an skb to a per CPU backlog * queue (may be a remote CPU queue). @@ -3067,13 +3107,15 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu, { struct softnet_data *sd; unsigned long flags; + unsigned int qlen; sd = &per_cpu(softnet_data, cpu); local_irq_save(flags); rps_lock(sd); - if (skb_queue_len(&sd->input_pkt_queue) <= netdev_max_backlog) { + qlen = skb_queue_len(&sd->input_pkt_queue); + if (qlen <= netdev_max_backlog && !skb_flow_limit(skb, qlen)) { if (skb_queue_len(&sd->input_pkt_queue)) { enqueue: __skb_queue_tail(&sd->input_pkt_queue, skb); @@ -6263,6 +6305,10 @@ static int __init net_dev_init(void) sd->backlog.weight = weight_p; sd->backlog.gro_list = NULL; sd->backlog.gro_count = 0; + +#ifdef CONFIG_NET_FLOW_LIMIT + sd->flow_limit = NULL; +#endif } dev_boot_phase = 0; diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c index 569d355..2bf8329 100644 --- a/net/core/net-procfs.c +++ b/net/core/net-procfs.c @@ -146,11 +146,23 @@ static void softnet_seq_stop(struct seq_file *seq, void *v) static int softnet_seq_show(struct seq_file *seq, void *v) { struct softnet_data *sd = v; + unsigned int flow_limit_count = 0; - seq_printf(seq, "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", +#ifdef CONFIG_NET_FLOW_LIMIT + struct sd_flow_limit *fl; + + rcu_read_lock(); + fl = rcu_dereference(sd->flow_limit); + if (fl) + flow_limit_count = fl->count; + rcu_read_unlock(); +#endif + + seq_printf(seq, + "%08x %08x %08x %08x %08x %08x %08x %08x %08x %08x %08x\n", sd->processed, sd->dropped, sd->time_squeeze, 0, 0, 0, 0, 0, /* was fastroute */ - sd->cpu_collision, sd->received_rps); + sd->cpu_collision, sd->received_rps, flow_limit_count); return 0; } diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c index cfdb46a..9e3e644 100644 --- a/net/core/sysctl_net_core.c +++ b/net/core/sysctl_net_core.c @@ -87,6 +87,96 @@ static int rps_sock_flow_sysctl(ctl_table *table, int write, } #endif /* CONFIG_RPS */ +#ifdef CONFIG_NET_FLOW_LIMIT +static DEFINE_MUTEX(flow_limit_update_mutex); + +static int flow_limit_cpu_sysctl(ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + struct sd_flow_limit *cur; + struct softnet_data *sd; + cpumask_var_t mask; + int i, len, ret = 0; + + if (!alloc_cpumask_var(&mask, GFP_KERNEL)) + return -ENOMEM; + + if (write) { + ret = cpumask_parse_user(buffer, *lenp, mask); + if (ret) + goto done; + + mutex_lock(&flow_limit_update_mutex); + len = sizeof(*cur) + netdev_flow_limit_table_len; + for_each_possible_cpu(i) { + sd = &per_cpu(softnet_data, i); + cur = rcu_dereference_protected(sd->flow_limit, + lockdep_is_held(flow_limit_update_mutex)); + if (cur && !cpumask_test_cpu(i, mask)) { + RCU_INIT_POINTER(sd->flow_limit, NULL); + synchronize_rcu(); + kfree(cur); + } else if (!cur && cpumask_test_cpu(i, mask)) { + cur = kzalloc(len, GFP_KERNEL); + if (!cur) { + /* not unwinding previous changes */ + ret = -ENOMEM; + goto write_unlock; + } + cur->num_buckets = netdev_flow_limit_table_len; + rcu_assign_pointer(sd->flow_limit, cur); + } + } +write_unlock: + mutex_unlock(&flow_limit_update_mutex); + } else { + if (*ppos || !*lenp) { + *lenp = 0; + goto done; + } + + cpumask_clear(mask); + rcu_read_lock(); + for_each_possible_cpu(i) { + sd = &per_cpu(softnet_data, i); + if (rcu_dereference(sd->flow_limit)) + cpumask_set_cpu(i, mask); + } + rcu_read_unlock(); + + len = cpumask_scnprintf(buffer, *lenp, mask); + *lenp = len + 1; + *ppos += len + 1; + } + +done: + free_cpumask_var(mask); + return ret; +} + +static int flow_limit_table_len_sysctl(ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + unsigned int old, *ptr; + int ret; + + mutex_lock(&flow_limit_update_mutex); + + ptr = table->data; + old = *ptr; + ret = proc_dointvec(table, write, buffer, lenp, ppos); + if (!ret && write && !is_power_of_2(*ptr)) { + *ptr = old; + ret = -EINVAL; + } + + mutex_unlock(&flow_limit_update_mutex); + return ret; +} +#endif /* CONFIG_NET_FLOW_LIMIT */ + static struct ctl_table net_core_table[] = { #ifdef CONFIG_NET { @@ -180,6 +270,20 @@ static struct ctl_table net_core_table[] = { .proc_handler = rps_sock_flow_sysctl }, #endif +#ifdef CONFIG_NET_FLOW_LIMIT + { + .procname = "flow_limit_cpu_bitmap", + .mode = 0644, + .proc_handler = flow_limit_cpu_sysctl + }, + { + .procname = "flow_limit_table_len", + .data = &netdev_flow_limit_table_len, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = flow_limit_table_len_sysctl + }, +#endif /* CONFIG_NET_FLOW_LIMIT */ #endif /* CONFIG_NET */ { .procname = "netdev_budget",