From patchwork Wed May 22 17:54:40 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Willem de Bruijn X-Patchwork-Id: 245694 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id E45CF2C00A3 for ; Thu, 23 May 2013 04:58:17 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754670Ab3EVS6N (ORCPT ); Wed, 22 May 2013 14:58:13 -0400 Received: from mail-vc0-f202.google.com ([209.85.220.202]:35098 "EHLO mail-vc0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753259Ab3EVS6M (ORCPT ); Wed, 22 May 2013 14:58:12 -0400 Received: by mail-vc0-f202.google.com with SMTP id m17so251067vca.5 for ; Wed, 22 May 2013 11:58:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer; bh=stcdVhX/IV3/MIRjLusRks7tm41knhFEwxcGgrLAL5w=; b=NbnCYjnW8UtousXYTRZ0LXeEwzpeFU43DXKKACv+2AAvx2u/r8VMsj7qMgHRxomxZr wRr0C6abzaQbrwSDUJ4Wh8PSx1OQV+oD0Ow0CWTilhpD6D993CaJaV+wXPh94G59NozY 5nJYq0mJ7/4FaVawatemp4NqYmRlx0CcqhjcK1lVI76/I10JkclLTQHricQhRrkhpLZ4 wfQLxbaaEQrFTNa2LCaW5cwlSlDlrh13nPTvpsXzfQBeocpYP6A3lEOuOJDaKIWjabH4 gt1ETLXygPrNuJhtNJVrdJqLZcAq963MstrsaO2RQI/vcI+uvQK6cccV5NmEDhcupaDV MvxA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:to:cc:subject:date:message-id:x-mailer:x-gm-message-state; bh=stcdVhX/IV3/MIRjLusRks7tm41knhFEwxcGgrLAL5w=; b=ZkTBvSX7zFNBOC3W5hZu8kuzl3lhhkdo8UGcnzkoLg7cuBwbnFW27BQho/wq/5FV5/ dR7U3PTVK/e86HD4rtydTB+oKfsCoTaLrm+eZoouDofKgbVcvsFB71z2rZuxIWYJHfKY iYiUGwYjlghk4SsI2X26HNfDwYGUXTdjdJnJsbeglPdiVXe6mJfYkGvA4Qcoi5HxWToL fCxhpGYrhxTuYAb9fEMzDxb4uxFdDcZX7XIPR/qHQMaFp4UoG8BL3kmxxhXneKNSgTkw 6rn1j58Xqprf7ZD5lArJaFBIAv7HIrfloajiKq3FJNxU1cQlMlLV6A5A8htxXbJILHTv FFEw== X-Received: by 10.236.14.233 with SMTP id d69mr3915536yhd.51.1369245282815; Wed, 22 May 2013 10:54:42 -0700 (PDT) Received: from corp2gmr1-1.hot.corp.google.com (corp2gmr1-1.hot.corp.google.com [172.24.189.92]) by gmr-mx.google.com with ESMTPS id k33si723131yhi.3.2013.05.22.10.54.42 for (version=TLSv1.1 cipher=AES128-SHA bits=128/128); Wed, 22 May 2013 10:54:42 -0700 (PDT) Received: from gopher.nyc.corp.google.com (gopher.nyc.corp.google.com [172.26.106.37]) by corp2gmr1-1.hot.corp.google.com (Postfix) with ESMTP id 9B13F31C506; Wed, 22 May 2013 10:54:42 -0700 (PDT) Received: by gopher.nyc.corp.google.com (Postfix, from userid 29878) id 3E31EC14CD; Wed, 22 May 2013 13:54:42 -0400 (EDT) From: Willem de Bruijn To: netdev@vger.kernel.org, eric.dumazet@gmail.com, davem@davemloft.net Cc: Willem de Bruijn Subject: [PATCH net-next] rps: document flow limit in scaling.txt Date: Wed, 22 May 2013 13:54:40 -0400 Message-Id: <1369245280-9585-1-git-send-email-willemb@google.com> X-Mailer: git-send-email 1.8.2.1 X-Gm-Message-State: ALoCoQnYT7WHeQQAmHhWIDI0L/V2pk4Yry2GL9XVJ8F2PsJuVVoaFfqKBZLuLO07R6TrLmOdQF7jF2zA+8lRlo+Sp5guoHWkBtzbfRjfJeS/MPznbWFfAQY2Y+DMtYnY8O4s/53ftmVX5ZVCVSBwWIod/YJs75pCS5gXA8UchCHpBll+38Lq+pdYvQ/v5PkQIgxM3zQ9wLgY Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Explain the mechanism and API of the recently merged rps flow limit patch. Signed-off-by: Willem de Bruijn --- Perhaps a bit wordy. I can strip the context and leave only the API documentation. --- Documentation/networking/scaling.txt | 58 ++++++++++++++++++++++++++++++++++++ 1 file changed, 58 insertions(+) diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt index 579994a..ca6977f 100644 --- a/Documentation/networking/scaling.txt +++ b/Documentation/networking/scaling.txt @@ -163,6 +163,64 @@ and unnecessary. If there are fewer hardware queues than CPUs, then RPS might be beneficial if the rps_cpus for each queue are the ones that share the same memory domain as the interrupting CPU for that queue. +==== RPS Flow Limit + +RPS scales kernel receive processing across CPUs without introducing +reordering. The trade-off to sending all packets from the same flow +to the same CPU is CPU load imbalance if flows vary in packet rate. +In the extreme case a single flow dominates traffic. Especially on +common server workloads with many concurrent connections, such +behavior indicates a problem such as a misconfiguration or spoofed +source Denial of Service attack. + +Flow Limit is an optional RPS feature that prioritizes small flows +during CPU contention by dropping packets from large flows slightly +ahead of those from small flows. It is active only when an RPS or RFS +destination CPU approaches saturation. Once a CPU's input packet +queue exceeds half the maximum queue length (as set by sysctl +net.core.netdev_max_backlog), the kernel starts a per-flow packet +count over the last 256 packets. If a flow exceeds a set ratio (by +default, half) of these packets when a new packet arrives, then the +new packet is dropped. Packets from other flows are still only +dropped once the input packet queue reaches netdev_max_backlog. +No packets are dropped when the input packet queue length is below +the threshold, so flow limit does not sever connections outright: +even large flows maintain connectivity. + +== Interface + +Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not +turned on. It is implemented for each CPU independently (to avoid lock +and cache contention) and toggled per CPU by setting the relevant bit +in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU +bitmap interface as rps_cpus (see above) when called from procfs: + + /proc/sys/net/core/flow_limit_cpu_bitmap + +Per-flow rate is calculated by hashing each packet into a hashtable +bucket and incrementing a per-bucket counter. The hash function is +the same that selects a CPU in RPS, but as the number of buckets can +be much larger than the number of CPUs, flow limit has finer-grained +identification of large flows and fewer false positives. The default +table has 4096 buckets. This value can be modified through sysctl + + net.core.flow_limit_table_len + +The value is only consulted when a new table is allocated. Modifying +it does not update active tables. + +== Suggested Configuration + +Flow limit is useful on systems with many concurrent connections, +where a single connection taking up 50% of a CPU indicates a problem. +In such environments, enable the feature on all CPUs that handle +network rx interrupts (as set in /proc/irq/N/smp_affinity). + +The feature depends on the input packet queue length to exceed +the flow limit threshold (50%) + the flow history length (256). +Setting net.core.netdev_max_backlog to either 1000 or 10000 +performed well in experiments. + RFS: Receive Flow Steering ==========================