From patchwork Wed May 22 17:54:40 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Willem de Bruijn <willemb@google.com>
X-Patchwork-Id: 245694
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id E45CF2C00A3
	for <patchwork-incoming@ozlabs.org>;
	Thu, 23 May 2013 04:58:17 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754670Ab3EVS6N (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Wed, 22 May 2013 14:58:13 -0400
Received: from mail-vc0-f202.google.com ([209.85.220.202]:35098 "EHLO
	mail-vc0-f202.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753259Ab3EVS6M (ORCPT
	<rfc822;netdev@vger.kernel.org>); Wed, 22 May 2013 14:58:12 -0400
Received: by mail-vc0-f202.google.com with SMTP id m17so251067vca.5
	for <netdev@vger.kernel.org>; Wed, 22 May 2013 11:58:11 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=from:to:cc:subject:date:message-id:x-mailer;
	bh=stcdVhX/IV3/MIRjLusRks7tm41knhFEwxcGgrLAL5w=;
	b=NbnCYjnW8UtousXYTRZ0LXeEwzpeFU43DXKKACv+2AAvx2u/r8VMsj7qMgHRxomxZr
	wRr0C6abzaQbrwSDUJ4Wh8PSx1OQV+oD0Ow0CWTilhpD6D993CaJaV+wXPh94G59NozY
	5nJYq0mJ7/4FaVawatemp4NqYmRlx0CcqhjcK1lVI76/I10JkclLTQHricQhRrkhpLZ4
	wfQLxbaaEQrFTNa2LCaW5cwlSlDlrh13nPTvpsXzfQBeocpYP6A3lEOuOJDaKIWjabH4
	gt1ETLXygPrNuJhtNJVrdJqLZcAq963MstrsaO2RQI/vcI+uvQK6cccV5NmEDhcupaDV
	MvxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=google.com; s=20120113;
	h=from:to:cc:subject:date:message-id:x-mailer:x-gm-message-state;
	bh=stcdVhX/IV3/MIRjLusRks7tm41knhFEwxcGgrLAL5w=;
	b=ZkTBvSX7zFNBOC3W5hZu8kuzl3lhhkdo8UGcnzkoLg7cuBwbnFW27BQho/wq/5FV5/
	dR7U3PTVK/e86HD4rtydTB+oKfsCoTaLrm+eZoouDofKgbVcvsFB71z2rZuxIWYJHfKY
	iYiUGwYjlghk4SsI2X26HNfDwYGUXTdjdJnJsbeglPdiVXe6mJfYkGvA4Qcoi5HxWToL
	fCxhpGYrhxTuYAb9fEMzDxb4uxFdDcZX7XIPR/qHQMaFp4UoG8BL3kmxxhXneKNSgTkw
	6rn1j58Xqprf7ZD5lArJaFBIAv7HIrfloajiKq3FJNxU1cQlMlLV6A5A8htxXbJILHTv
	FFEw==
X-Received: by 10.236.14.233 with SMTP id d69mr3915536yhd.51.1369245282815;
	Wed, 22 May 2013 10:54:42 -0700 (PDT)
Received: from corp2gmr1-1.hot.corp.google.com
	(corp2gmr1-1.hot.corp.google.com [172.24.189.92])
	by gmr-mx.google.com with ESMTPS id
	k33si723131yhi.3.2013.05.22.10.54.42 for <multiple recipients>
	(version=TLSv1.1 cipher=AES128-SHA bits=128/128);
	Wed, 22 May 2013 10:54:42 -0700 (PDT)
Received: from gopher.nyc.corp.google.com (gopher.nyc.corp.google.com
	[172.26.106.37])
	by corp2gmr1-1.hot.corp.google.com (Postfix) with ESMTP id
	9B13F31C506; Wed, 22 May 2013 10:54:42 -0700 (PDT)
Received: by gopher.nyc.corp.google.com (Postfix, from userid 29878)
	id 3E31EC14CD; Wed, 22 May 2013 13:54:42 -0400 (EDT)
From: Willem de Bruijn <willemb@google.com>
To: netdev@vger.kernel.org, eric.dumazet@gmail.com, davem@davemloft.net
Cc: Willem de Bruijn <willemb@google.com>
Subject: [PATCH net-next] rps: document flow limit in scaling.txt
Date: Wed, 22 May 2013 13:54:40 -0400
Message-Id: <1369245280-9585-1-git-send-email-willemb@google.com>
X-Mailer: git-send-email 1.8.2.1
X-Gm-Message-State: 
 ALoCoQnYT7WHeQQAmHhWIDI0L/V2pk4Yry2GL9XVJ8F2PsJuVVoaFfqKBZLuLO07R6TrLmOdQF7jF2zA+8lRlo+Sp5guoHWkBtzbfRjfJeS/MPznbWFfAQY2Y+DMtYnY8O4s/53ftmVX5ZVCVSBwWIod/YJs75pCS5gXA8UchCHpBll+38Lq+pdYvQ/v5PkQIgxM3zQ9wLgY
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Explain the mechanism and API of the recently merged
rps flow limit patch.

Signed-off-by: Willem de Bruijn <willemb@google.com>
---

Perhaps a bit wordy. I can strip the context and leave only the API
documentation.
---
 Documentation/networking/scaling.txt | 58 ++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index 579994a..ca6977f 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -163,6 +163,64 @@ and unnecessary. If there are fewer hardware queues than CPUs, then
 RPS might be beneficial if the rps_cpus for each queue are the ones that
 share the same memory domain as the interrupting CPU for that queue.
 
+==== RPS Flow Limit
+
+RPS scales kernel receive processing across CPUs without introducing
+reordering. The trade-off to sending all packets from the same flow
+to the same CPU is CPU load imbalance if flows vary in packet rate.
+In the extreme case a single flow dominates traffic. Especially on
+common server workloads with many concurrent connections, such
+behavior indicates a problem such as a misconfiguration or spoofed
+source Denial of Service attack.
+
+Flow Limit is an optional RPS feature that prioritizes small flows
+during CPU contention by dropping packets from large flows slightly
+ahead of those from small flows. It is active only when an RPS or RFS
+destination CPU approaches saturation.  Once a CPU's input packet
+queue exceeds half the maximum queue length (as set by sysctl
+net.core.netdev_max_backlog), the kernel starts a per-flow packet
+count over the last 256 packets. If a flow exceeds a set ratio (by
+default, half) of these packets when a new packet arrives, then the
+new packet is dropped. Packets from other flows are still only
+dropped once the input packet queue reaches netdev_max_backlog.
+No packets are dropped when the input packet queue length is below
+the threshold, so flow limit does not sever connections outright:
+even large flows maintain connectivity.
+
+== Interface
+
+Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
+turned on. It is implemented for each CPU independently (to avoid lock
+and cache contention) and toggled per CPU by setting the relevant bit
+in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
+bitmap interface as rps_cpus (see above) when called from procfs:
+
+ /proc/sys/net/core/flow_limit_cpu_bitmap
+
+Per-flow rate is calculated by hashing each packet into a hashtable
+bucket and incrementing a per-bucket counter. The hash function is
+the same that selects a CPU in RPS, but as the number of buckets can
+be much larger than the number of CPUs, flow limit has finer-grained
+identification of large flows and fewer false positives. The default
+table has 4096 buckets. This value can be modified through sysctl
+
+ net.core.flow_limit_table_len
+
+The value is only consulted when a new table is allocated. Modifying
+it does not update active tables.
+
+== Suggested Configuration
+
+Flow limit is useful on systems with many concurrent connections,
+where a single connection taking up 50% of a CPU indicates a problem.
+In such environments, enable the feature on all CPUs that handle
+network rx interrupts (as set in /proc/irq/N/smp_affinity).
+
+The feature depends on the input packet queue length to exceed
+the flow limit threshold (50%) + the flow history length (256).
+Setting net.core.netdev_max_backlog to either 1000 or 10000
+performed well in experiments.
+
 
 RFS: Receive Flow Steering
 ==========================