[RFC] net: store port/representative id in metadata_dst

Switches and modern SR-IOV enabled NICs may multiplex traffic from
representators and control messages over single set of hardware queues.
Control messages and muxed traffic may need ordered delivery.

Those requirements make it hard to comfortably use TC infrastructure
today unless we have a way of attaching metadata to skbs at the upper
device.  Because single set of queues is used for many netdevs stopping
TC/sched queues of all of them reliably is impossible and lower
device has to retreat to returning NETDEV_TX_BUSY and usually
has to take extra locks on the fastpath.

This patch attempts to enable port/representative devs to attach metadata
to skbs which carry port id.  This way representatives can be queueless
and all queuing can be performed at the lower netdev in the usual way.

Traffic arriving on the port/representative interfaces will be have 
metadata attached and will subsequently be queued to the lower device
for transmission.  The lower device should recognize the metadata and
translate it to HW specific format which is most likely either a special
header inserted before the network headers or descriptor/metadata fields.

Metadata is associated with the lower device by storing the netdev pointer
along with port id so that if TC decides to redirect or mirror the new 
netdev will not try to interpret it.

This is mostly for SR-IOV devices since switches don't have lower
netdevs today.

Since I don't have any real user in the tree at this point please
allow me to present a trivial example use here:

void upper_init(struct upper *upper, struct lower *lower, unsigned int id)
{
	upper->lower_dev = lower;

	upper->dst_meta = metadata_dst_alloc(0, METADATA_HW_PORT_MUX,
					     GFP_KERNEL);
	upper->dst_meta.u.lower_dev = lower->netdev;
	upper->dst_meta.u.port_info.port_id = id;
}

int upper_tx(struct sk_buff *skb, struct net_device *netdev)
{
	struct upper *upper = netdev_priv(netdev);

	skb_dst_drop(skb);
	skb_dst_set_noref(skb, upper->dst_meta);

	return dev_queue_xmit(upper->lower_dev, skb);
}

int lower_tx(struct sk_buff *skb, struct net_device *netdev)
{
	struct metadata_dst *md = skb_metadata_dst(skb);
	struct lower *lower = netdev_priv(netdev);

	if (md->type == METADATA_HW_PORT_MUX &&
	    md->u.lower_dev == netdev) {
		/* use md->u.port_id to set port in
		 * descriptor/metadata/do encap
		 */
	}
	...
}

Other approaches considered but found inferior:
 - in-data tags - inserting tags into data will be confusing
   to classifiers which start parsing from mac headers, also
   in-band data is less perfect and allows sufficiently privileged
   user to inject control messages from userspace (this is DSA model
   - note that in SR-IOV switchdev mode I control both upper and lower
   device which differs from DSA where lower device can be any MAC);
 - per-VFR HW queues - requiring a queue per VF is a little wasteful and
   less scalable, muxing allows us to use all PF queues to transmit
   and receive with full RSS (this is model of existing SR-IOV switchdev
   mode implementations);
 - per-VFR TC queue - we could use per-VFR queue in the lower device,
   tag traffic and TX on smaller set of HW queues but again scaling
   would suffer, we would need to lock an extra queue and we have no
   way to stop all queues when HW queues fill up reliably (this model
   would piggy back on dev_queue_xmit_accel() to select queue).

Any comments, reactions would be much appreciated!

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 include/net/dst_metadata.h     | 35 ++++++++++++++++++++++++++++-------
 net/core/dst.c                 | 14 +++++++++-----
 net/core/filter.c              |  1 +
 net/ipv4/ip_tunnel_core.c      |  5 +++--
 net/openvswitch/flow_netlink.c |  3 ++-
 5 files changed, 43 insertions(+), 15 deletions(-)

Message ID	1474572417-15907-1-git-send-email-jakub.kicinski@netronome.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3sg62K0Msmz9t1C for <patchwork-incoming@ozlabs.org>; Fri, 23 Sep 2016 05:27:21 +1000 (AEST) Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=netronome-com.20150623.gappssmtp.com header.i=@netronome-com.20150623.gappssmtp.com header.b=wfJdUip0; dkim-atps=neutral Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965881AbcIVT1Q (ORCPT <rfc822;patchwork-incoming@ozlabs.org>); Thu, 22 Sep 2016 15:27:16 -0400 Received: from mail-wm0-f46.google.com ([74.125.82.46]:36383 "EHLO mail-wm0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965350AbcIVT1M (ORCPT <rfc822;netdev@vger.kernel.org>); Thu, 22 Sep 2016 15:27:12 -0400 Received: by mail-wm0-f46.google.com with SMTP id w84so271754903wmg.1 for <netdev@vger.kernel.org>; Thu, 22 Sep 2016 12:27:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netronome-com.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id; bh=Se9yb6EL1kndrdj7ImaMfl/9iTWLYYRBIQY7IsFJp8M=; b=wfJdUip0ysHe/nKzNhyaWZfXt+W6giNobjqvMf+uwKg52Vs0fLE1rfopVN5t1h/xvP HoMp7tEGzBisQw+UgaHk04cnRiQMVCoGGTQyf4vRXWx0jE812vJoGEa/1ZRj7lhCWMu1 5dapfxu+LcGR95otlbo+CuNxTGFT422OGkiHdlIL6z/U6DTtms3bzTuLb7BgDL14P4tQ OG687yebrMHVhT84YVXYI/RilmvYiago9XH0FcaUUx0UktJ96STgaa1txyUvukzUH3X9 /uiodUPUMgfhuyR24KeTNMnnhz0lgxx/eiIrBe9MGIRcfeSzAYBZLPOjup3+BpSKdnuU LXJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:to:cc:subject:date:message-id; bh=Se9yb6EL1kndrdj7ImaMfl/9iTWLYYRBIQY7IsFJp8M=; b=FU6WGCStEyJJCUyI1IBzxCnTqwATe+Xm937rbemyvtxwuEfKc/semT5tYXxv1+goOX UJt4+fQgBa1Z8/NFEGDpGvlyWndDfU2hZYOiSRuy9nyBAh9ytGngbJKIm/iW7yZ/DxBk k6Nalx0XOrJXcVP5bbCTJkjSji4e4heZlNrFZVA/tfZrQnYBMJqDfMR31aZxqTdR0yEv lm2LhKTOEGZ1Bs1kK4IsBVsEYDpS42OEa2GRmzuxR+Pvngkjh7drJGB7VDUKfDdCl4OU EfiSGi8MFNMrbu4JeXCF8Hgm25gvZ4Pwo2ONDVZujEXefvU6SAJdRkVpHwi3/AvWHYfl JsRQ== X-Gm-Message-State: AE9vXwMCa6zo1kvQXtjXcK45270ED4LlRTnoEP+hRDCLnj6ZeIyxHN6zY6HrwkkMRGJFX1ac X-Received: by 10.28.155.203 with SMTP id d194mr10030720wme.72.1474572430279; Thu, 22 Sep 2016 12:27:10 -0700 (PDT) Received: from jkicinski-Precision-T1700.netronome.com (host-79-78-33-110.static.as9105.net. [79.78.33.110]) by smtp.gmail.com with ESMTPSA id e12sm39194838wmg.17.2016.09.22.12.27.08 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 22 Sep 2016 12:27:09 -0700 (PDT) From: Jakub Kicinski <jakub.kicinski@netronome.com> To: netdev@vger.kernel.org, Thomas Graf <tgraf@suug.ch> Cc: Roopa Prabhu <roopa@cumulusnetworks.com>, Jiri Pirko <jiri@resnulli.us>, ogerlitz@mellanox.com, John Fastabend <john.fastabend@gmail.com>, sridhar.samudrala@intel.com, ast@kernel.org, daniel@iogearbox.net, simon.horman@netronome.com, Paolo Abeni <pabeni@redhat.com>, Pravin B Shelar <pshelar@nicira.com>, Jiri Benc <jbenc@redhat.com>, hannes@stressinduktion.org, kubakici@wp.pl, Jakub Kicinski <jakub.kicinski@netronome.com> Subject: [RFC] net: store port/representative id in metadata_dst Date: Thu, 22 Sep 2016 20:26:57 +0100 Message-Id: <1474572417-15907-1-git-send-email-jakub.kicinski@netronome.com> X-Mailer: git-send-email 1.9.1 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org

[RFC] net: store port/representative id in metadata_dst

Commit Message

Comments

Patch