diff mbox

[net-next-2.6,2/2] bonding: allow user-controlled output slave selection

Message ID 20100511003245.GB7497@gospo.rdu.redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Andy Gospodarek May 11, 2010, 12:32 a.m. UTC
This patch give the user the ability to control the output slave for
round-robin and active-backup bonding.  Similar functionality was
discussed in the past, but Jay Vosburgh indicated he would rather see a
feature like this added to existing modes rather than creating a
completely new mode.  Jay's thoughts as well as Neil's input surrounding
some of the issues with the first implementation pushed us toward a
design that relied on the queue_mapping rather than skb marks.
Round-robin and active-backup modes were chosen as the first users of
this slave selection as they seemed like the most logical choices when
considering a multi-switch environment.

Round-robin mode works without any modification, but active-backup does
require inclusion of the first patch in this series and setting
the 'keep_all' flag.  This will allow reception of unicast traffic on
any of the backup interfaces.

This was tested with IPv4-based filters as well as VLAN-based filters
with good results.

More information as well as a configuration example is available in the
patch to Documentation/networking/bonding.txt.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 Documentation/networking/bonding.txt |   76 +++++++++++++++++++++-
 drivers/net/bonding/bond_main.c      |   77 +++++++++++++++++++++-
 drivers/net/bonding/bond_sysfs.c     |  117 +++++++++++++++++++++++++++++++++-
 drivers/net/bonding/bonding.h        |    5 ++
 include/linux/if_bonding.h           |    1 +
 5 files changed, 270 insertions(+), 6 deletions(-)

Comments

Jay Vosburgh May 11, 2010, 8:09 p.m. UTC | #1
Andy Gospodarek <andy@greyhouse.net> wrote:

>This patch give the user the ability to control the output slave for
>round-robin and active-backup bonding.  Similar functionality was
>discussed in the past, but Jay Vosburgh indicated he would rather see a
>feature like this added to existing modes rather than creating a
>completely new mode.  Jay's thoughts as well as Neil's input surrounding
>some of the issues with the first implementation pushed us toward a
>design that relied on the queue_mapping rather than skb marks.
>Round-robin and active-backup modes were chosen as the first users of
>this slave selection as they seemed like the most logical choices when
>considering a multi-switch environment.
>
>Round-robin mode works without any modification, but active-backup does
>require inclusion of the first patch in this series and setting
>the 'keep_all' flag.  This will allow reception of unicast traffic on
>any of the backup interfaces.

	Yes, I did think that the mark business fit better into existing
modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
I also didn't expect to see so much new stuff (this, as well as the FCOE
special cases being discussed elsewhere) being shoehorned into the
active-backup mode.  I'm not so sure that adding so many special cases
to active-backup is a good thing.

	Now, I'm starting to wonder if you were right, and it would be
better overall to have a "manual" mode that would hopefully satisfy this
case as well as the FCOE special case.  I don't think either of these is
a bad use case, I'm just not sure the right way to handle them is
another special knob in active-backup mode (either directly, or
implicitly in __netif_receive_skb), which wasn't what I expected to see.

	I presume you're overloading active-backup because it's not
etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
regular load balance modes, I still think overlay into the existing
modes is preferable (more on that later); I'm thinking of "manual"
instead of another tweak to active-backup.

	If users want to have actual hot-standby functionality, then
active-backup would do that, and nothing else (and it can be multi-queue
aware, but only one slave active at a time).

	Users who want the set of bonded slaves to look like a big
multiqueue buffet could use this "manual" mode and set things up however
they want.  One way to set it up is simply that the bond is N queues
wide, where N is the total of the queue counts of all the slaves.  If a
slave fails, N gets smaller, and the user code has to deal with that.
Since the queue count of a device can't change dynamically, the bond
would have to actually be set up with some big number of queues, and
then only a subset is actually active (or there is some sort of wrap).

	In such an implementation, each slave would have a range of
queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
where each slave is one queue ID, as it could make transitioning to real
multi-queue awareness difficult.

	There might also be a way to tie it in to the new RPS code on
the receive side.

	If the slaves all have the same MAC and attach to a single
switch via etherchannel, then it all looks pretty much like a single big
honkin' multiqueue device.  The switch probably won't map the flows back
the same way, though.

	If the slaves are on discrete switches (without etherchannel),
things become more complicated.  If the slaves have the same MAC, then
the switches will be irritated about seeing that same MAC coming in from
multiple places.  If the slaves have different MACs, then ARP has the
same sort of issues.

	In thinking about it, if it's linux bonding at both ends, there
could be any number of discrete switches in the path, and it wouldn't
matter as long as the linux end can work things out, e.g.,

        -- switch 1 --
hostA  /              \  hostB
bond  ---- switch 2 ---- bond
       \              /
        -- switch 3 --

	For something like this, the switches would never share MAC
information for the bonding slaves.  The issue here then becomes more of
detecting link failures (it would require either a "trunk failover" type
of function on the switch, or some kind of active probe between the
bonds).

	Now, I realize that I'm babbling a bit, as from reading your
description, this isn't necessarily your target topology (which sounded
more like a case of slave A can reach only network X, and slave B can
reach anywhere, so sending to network X should use slave A
preferentially), or, as long as I'm doing ASCII-art,

       --- switch 1 ---- network X
hostA /               /
bond  ---- switch 2 -+-- anywhere

	Is that an accurate representation?  Or is it something a bit
different, e.g.,

       --- switch 1 ---- network X -\
hostA /                             /
bond  ---- switch 2 ---- anywhere --

	I.e., the "anywhere" connects back to network X from the
outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
thinking of something like this:

       --- switch 1 --- VPN --- web site
hostA /                          /
bond  ---- switch 2 - Internet -/

	Where you prefer to hit "web site" via the VPN (perhaps it's a
more efficient or secure path), but can do it from the public network at
large if necessary.

	Now, regardless of the above, your first patch ("keep_all") is
to deal with the reverse problem, if this is a piggyback on top of
active-backup mode: how to get packets back, when both channels can be
active simultaneously.  That actually dovetails to a degree with work
I've been doing lately, but the solution there probably isn't what
you're looking for (there's a user space daemon to do path finding, and
the "bond IP" address is piggybacked on the slaves' MAC addresses, which
are not changed; the "bond IP" set exists in a separate subnet all its
own).

	As I said, I'm not convinced that the "keep_all" option to
active-backup is really better than just a "manual" mode that lacks the
dup suppression and expects the user to set everything up.

	As for the round-robin change in this patch, if I'm reading it
right, then the way it works is that the packets are round-robined,
unless there's a queue id passed in, in which case it's assigned to the
slave mapped to that queue id.  I'm not entirely sure why you picked
round-robin mode for that over balance-xor; it doesn't seem to fit well
with the description in the documentation.  Or is it just sort of a
demonstrator?

	I do like one other aspect of the patch, and that's the concept
of overlaying the queue map on top of the balance algorithm.  So, e.g.,
balance-xor would do its usual thing, unless the packet is queue mapped,
in which case the packet's assignment is obeyed.  The balance-xor could
even optionally do its xor across the full set of all slaves output
queues instead of just across the slaves.  Round-robin can operate
similarly.  For those modes, a "balance by queue vs. balance by slave"
seems like a reasonable knob to have.

	I do understand that you're proposing something relatively
simple, and I'm thinking out loud about alternate or additional
implementation details.  Some of this is "ooh ahh what if", but we also
don't want to end up with something that's forwards incompatible, and
I'm hoping to find one solution to multiple problems.

	Thoughts?

	-J

>This was tested with IPv4-based filters as well as VLAN-based filters
>with good results.
>
>More information as well as a configuration example is available in the
>patch to Documentation/networking/bonding.txt.
>
>Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
>Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
>---
> Documentation/networking/bonding.txt |   76 +++++++++++++++++++++-
> drivers/net/bonding/bond_main.c      |   77 +++++++++++++++++++++-
> drivers/net/bonding/bond_sysfs.c     |  117 +++++++++++++++++++++++++++++++++-
> drivers/net/bonding/bonding.h        |    5 ++
> include/linux/if_bonding.h           |    1 +
> 5 files changed, 270 insertions(+), 6 deletions(-)
>
>diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
>index d64fd2f..fd277c1 100644
>--- a/Documentation/networking/bonding.txt
>+++ b/Documentation/networking/bonding.txt
>@@ -49,6 +49,7 @@ Table of Contents
> 3.3	Configuring Bonding Manually with Ifenslave
> 3.3.1		Configuring Multiple Bonds Manually
> 3.4	Configuring Bonding Manually via Sysfs
>+3.5	Overriding Configuration for Special Cases
>
> 4. Querying Bonding Configuration
> 4.1	Bonding Configuration
>@@ -1333,8 +1334,79 @@ echo 2000 > /sys/class/net/bond1/bonding/arp_interval
> echo +eth2 > /sys/class/net/bond1/bonding/slaves
> echo +eth3 > /sys/class/net/bond1/bonding/slaves
>
>-
>-4. Querying Bonding Configuration 
>+3.5 Overriding Configuration for Special Cases
>+----------------------------------------------
>+Nominally, when using the bonding driver, the physical port which transmits a
>+frame is selected by the bonding driver, and is not relevant to the user or
>+system administrator.  The output port is simply selected using the policies of
>+the selected bonding mode.  On occasion however, it is helpful to direct certain
>+classes of traffic to certain physical interfaces on output to implement
>+slightly more complex policies.  For example, to reach a web server over a
>+bonded interface in which eth0 connects to a private network, while eth1
>+connects via a public network, it may be desirous to bias the bond to send said
>+traffic over eth0 first, using eth1 only as a fall back, while all other traffic
>+can safely be sent over either interface.  Such configurations may be achieved
>+using the traffic control utilities inherent in linux.
>+
>+By default the bonding driver is multiqueue aware and 16 queues are created
>+when the driver initializes (see Documentation/networking/multiqueue.txt
>+for details).  If more or less queues are desired the module parameter
>+tx_queues can be used to change this value.  There is no sysfs parameter
>+available as the allocation is done at module init time.
>+
>+The output of the file /proc/net/bonding/bondX has changed so the output Queue
>+ID is now printed for each slave:
>+
>+Bonding Mode: fault-tolerance (active-backup)
>+Primary Slave: None
>+Currently Active Slave: eth0
>+MII Status: up
>+MII Polling Interval (ms): 0
>+Up Delay (ms): 0
>+Down Delay (ms): 0
>+
>+Slave Interface: eth0
>+MII Status: up
>+Link Failure Count: 0
>+Permanent HW addr: 00:1a:a0:12:8f:cb
>+Slave queue ID: 0
>+
>+Slave Interface: eth1
>+MII Status: up
>+Link Failure Count: 0
>+Permanent HW addr: 00:1a:a0:12:8f:cc
>+Slave queue ID: 2
>+
>+The queue_id for a slave can be set using the command:
>+
>+# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
>+
>+Any interface that needs a queue_id set should set it with multiple calls
>+like the one above until proper priorities are set for all interfaces.  On
>+distributions that allow configuration via initscripts, multiple 'queue_id'
>+arguments can be added to BONDING_OPTS to set all needed slave queues.
>+
>+These queue id's can be used in conjunction with the tc utility to configure
>+a multiqueue qdisc and filters to bias certain traffic to transmit on certain
>+slave devices.  For instance, say we wanted, in the above configuration to
>+force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output
>+device. The following commands would accomplish this:
>+
>+# tc qdisc add dev bond0 handle 1 root multiq
>+
>+# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \
>+	192.168.1.100 action skbedit queue_mapping 2
>+
>+These commands tell the kernel to attach a multiqueue queue discipline to the
>+bond0 interface and filter traffic enqueued to it, such that packets with a dst
>+ip of 192.168.1.100 have their output queue mapping value overwritten to 2.
>+This value is then passed into the driver, causing the normal output path
>+selection policy to be overridden, selecting instead qid 2, which maps to eth1.
>+
>+Note that qid values begin at 1.  qid 0 is reserved to initiate to the driver
>+that normal output policy selection should take place.
>+
>+4 Querying Bonding Configuration
> =================================
>
> 4.1 Bonding Configuration
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index eb86363..aa6a79a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -90,6 +90,7 @@
> #define BOND_LINK_ARP_INTERV	0
>
> static int max_bonds	= BOND_DEFAULT_MAX_BONDS;
>+static int tx_queues	= BOND_DEFAULT_TX_QUEUES;
> static int num_grat_arp = 1;
> static int num_unsol_na = 1;
> static int miimon	= BOND_LINK_MON_INTERV;
>@@ -111,6 +112,8 @@ static struct bond_params bonding_defaults;
>
> module_param(max_bonds, int, 0);
> MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
>+module_param(tx_queues, int, 0);
>+MODULE_PARM_DESC(tx_queues, "Max number of transmit queues (default = 16)");
> module_param(num_grat_arp, int, 0644);
> MODULE_PARM_DESC(num_grat_arp, "Number of gratuitous ARP packets to send on failover event");
> module_param(num_unsol_na, int, 0644);
>@@ -1532,6 +1535,12 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
> 		goto err_undo_flags;
> 	}
>
>+	/*
>+	 * Set the new_slave's queue_id to be zero.  Queue ID mapping
>+	 * is set via sysfs or module option if desired.
>+	 */
>+	new_slave->queue_id = 0;
>+
> 	/* save slave's original flags before calling
> 	 * netdev_set_master and dev_open
> 	 */
>@@ -1790,6 +1799,7 @@ err_restore_mac:
> 	}
>
> err_free:
>+	new_slave->queue_id = 0;
> 	kfree(new_slave);
>
> err_undo_flags:
>@@ -1977,6 +1987,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
> 				   IFF_SLAVE_INACTIVE | IFF_BONDING |
> 				   IFF_SLAVE_NEEDARP);
>
>+	slave->queue_id = 0;
> 	kfree(slave);
>
> 	return 0;  /* deletion OK */
>@@ -3269,6 +3280,7 @@ static void bond_info_show_slave(struct seq_file *seq,
> 		else
> 			seq_puts(seq, "Aggregator ID: N/A\n");
> 	}
>+	seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
> }
>
> static int bond_info_seq_show(struct seq_file *seq, void *v)
>@@ -4405,9 +4417,59 @@ static void bond_set_xmit_hash_policy(struct bonding *bond)
> 	}
> }
>
>+/*
>+ * Lookup the slave that corresponds to a qid
>+ */
>+static inline int bond_slave_override(struct bonding *bond,
>+				      struct sk_buff *skb)
>+{
>+	int i, res = 1;
>+	struct slave *slave = NULL;
>+	struct slave *check_slave;
>+
>+	read_lock(&bond->lock);
>+
>+	if (!BOND_IS_OK(bond) || !skb->queue_mapping)
>+		goto out;
>+
>+	/* Find out if any slaves have the same mapping as this skb. */
>+	bond_for_each_slave(bond, check_slave, i) {
>+		if (check_slave->queue_id == skb->queue_mapping) {
>+			slave = check_slave;
>+			break;
>+		}
>+	}
>+
>+	/* If the slave isn't UP, use default transmit policy. */
>+	if (slave && slave->queue_id && IS_UP(slave->dev) &&
>+	    (slave->link == BOND_LINK_UP)) {
>+		res = bond_dev_queue_xmit(bond, skb, slave->dev);
>+	}
>+
>+out:
>+	read_unlock(&bond->lock);
>+	return res;
>+}
>+
>+static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
>+{
>+	/*
>+	 * This helper function exists to help dev_pick_tx get the correct
>+	 * destination queue.  Using a helper function skips the a call to
>+	 * skb_tx_hash and will put the skbs in the queue we expect on their
>+	 * way down to the bonding driver.
>+	 */
>+	return skb->queue_mapping;
>+}
>+
> static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
> {
>-	const struct bonding *bond = netdev_priv(dev);
>+	struct bonding *bond = netdev_priv(dev);
>+
>+	if (TX_QUEUE_OVERRIDE(bond->params.mode)) {
>+		if (!bond_slave_override(bond, skb))
>+			return NETDEV_TX_OK;
>+	}
>
> 	switch (bond->params.mode) {
> 	case BOND_MODE_ROUNDROBIN:
>@@ -4492,6 +4554,7 @@ static const struct net_device_ops bond_netdev_ops = {
> 	.ndo_open		= bond_open,
> 	.ndo_stop		= bond_close,
> 	.ndo_start_xmit		= bond_start_xmit,
>+	.ndo_select_queue	= bond_select_queue,
> 	.ndo_get_stats		= bond_get_stats,
> 	.ndo_do_ioctl		= bond_do_ioctl,
> 	.ndo_set_multicast_list	= bond_set_multicast_list,
>@@ -4763,6 +4826,13 @@ static int bond_check_params(struct bond_params *params)
> 		}
> 	}
>
>+	if (tx_queues < 1 || tx_queues > 255) {
>+		pr_warning("Warning: tx_queues (%d) should be between "
>+			   "1 and 255, resetting to %d\n",
>+			   tx_queues, BOND_DEFAULT_TX_QUEUES);
>+		tx_queues = BOND_DEFAULT_TX_QUEUES;
>+	}
>+
> 	if ((keep_all != 0) && (keep_all != 1)) {
> 		pr_warning("Warning: keep_all module parameter (%d), "
> 			   "not of valid value (0/1), so it was set to "
>@@ -4940,6 +5010,7 @@ static int bond_check_params(struct bond_params *params)
> 	params->primary[0] = 0;
> 	params->primary_reselect = primary_reselect_value;
> 	params->fail_over_mac = fail_over_mac_value;
>+	params->tx_queues = tx_queues;
> 	params->keep_all = keep_all;
>
> 	if (primary) {
>@@ -5027,8 +5098,8 @@ int bond_create(struct net *net, const char *name)
>
> 	rtnl_lock();
>
>-	bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : "",
>-				bond_setup);
>+	bond_dev = alloc_netdev_mq(sizeof(struct bonding), name ? name : "",
>+				bond_setup, tx_queues);
> 	if (!bond_dev) {
> 		pr_err("%s: eek! can't alloc netdev!\n", name);
> 		rtnl_unlock();
>diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>index 44651ce..87bfcf1 100644
>--- a/drivers/net/bonding/bond_sysfs.c
>+++ b/drivers/net/bonding/bond_sysfs.c
>@@ -1472,6 +1472,121 @@ static ssize_t bonding_show_ad_partner_mac(struct device *d,
> static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL);
>
> /*
>+ * Show the queue_ids of the slaves in the current bond.
>+ */
>+static ssize_t bonding_show_queue_id(struct device *d,
>+				     struct device_attribute *attr,
>+				     char *buf)
>+{
>+	struct slave *slave;
>+	int i, res = 0;
>+	struct bonding *bond = to_bond(d);
>+
>+	if (!rtnl_trylock())
>+		return restart_syscall();
>+
>+	read_lock(&bond->lock);
>+	bond_for_each_slave(bond, slave, i) {
>+		if (res > (PAGE_SIZE - 6)) {
>+			/* not enough space for another interface name */
>+			if ((PAGE_SIZE - res) > 10)
>+				res = PAGE_SIZE - 10;
>+			res += sprintf(buf + res, "++more++ ");
>+			break;
>+		}
>+		res += sprintf(buf + res, "%s:%d ",
>+			       slave->dev->name, slave->queue_id);
>+	}
>+	read_unlock(&bond->lock);
>+	if (res)
>+		buf[res-1] = '\n'; /* eat the leftover space */
>+	rtnl_unlock();
>+	return res;
>+}
>+
>+/*
>+ * Set the queue_ids of the  slaves in the current bond.  The bond
>+ * interface must be enslaved for this to work.
>+ */
>+static ssize_t bonding_store_queue_id(struct device *d,
>+				      struct device_attribute *attr,
>+				      const char *buffer, size_t count)
>+{
>+	struct slave *slave, *update_slave;
>+	struct bonding *bond = to_bond(d);
>+	u16 qid;
>+	int i, ret = count;
>+	char *delim;
>+	struct net_device *sdev = NULL;
>+
>+	if (!rtnl_trylock())
>+		return restart_syscall();
>+
>+	/* delim will point to queue id if successful */
>+	delim = strchr(buffer, ':');
>+	if (!delim)
>+		goto err_no_cmd;
>+
>+	/*
>+	 * Terminate string that points to device name and bump it
>+	 * up one, so we can read the queue id there.
>+	 */
>+	*delim = '\0';
>+	if (sscanf(++delim, "%hd\n", &qid) != 1)
>+		goto err_no_cmd;
>+
>+	/* Check buffer length, valid ifname and queue id */
>+	if (strlen(buffer) > IFNAMSIZ ||
>+	    !dev_valid_name(buffer) ||
>+	    qid > bond->params.tx_queues)
>+		goto err_no_cmd;
>+
>+	/* Get the pointer to that interface if it exists */
>+	sdev = __dev_get_by_name(dev_net(bond->dev), buffer);
>+	if (!sdev)
>+		goto err_no_cmd;
>+
>+	read_lock(&bond->lock);
>+
>+	/* Search for thes slave and check for duplicate qids */
>+	update_slave = NULL;
>+	bond_for_each_slave(bond, slave, i) {
>+		if (sdev == slave->dev)
>+			/*
>+			 * We don't need to check the matching
>+			 * slave for dups, since we're overwriting it
>+			 */
>+			update_slave = slave;
>+		else if (qid && qid == slave->queue_id) {
>+			goto err_no_cmd_unlock;
>+		}
>+	}
>+
>+	if (!update_slave)
>+		goto err_no_cmd_unlock;
>+
>+	/* Actually set the qids for the slave */
>+	update_slave->queue_id = qid;
>+
>+	read_unlock(&bond->lock);
>+out:
>+	rtnl_unlock();
>+	return ret;
>+
>+err_no_cmd_unlock:
>+	read_unlock(&bond->lock);
>+err_no_cmd:
>+	pr_info("invalid input for queue_id set for %s.\n",
>+		bond->dev->name);
>+	ret = -EPERM;
>+	goto out;
>+}
>+
>+static DEVICE_ATTR(queue_id, S_IRUGO | S_IWUSR, bonding_show_queue_id,
>+		   bonding_store_queue_id);
>+
>+
>+/*
>  * Show and set the keep_all flag.
>  */
> static ssize_t bonding_show_keep(struct device *d,
>@@ -1513,7 +1628,6 @@ static DEVICE_ATTR(keep_all, S_IRUGO | S_IWUSR,
> 		   bonding_show_keep, bonding_store_keep);
>
>
>-
> static struct attribute *per_bond_attrs[] = {
> 	&dev_attr_slaves.attr,
> 	&dev_attr_mode.attr,
>@@ -1539,6 +1653,7 @@ static struct attribute *per_bond_attrs[] = {
> 	&dev_attr_ad_actor_key.attr,
> 	&dev_attr_ad_partner_key.attr,
> 	&dev_attr_ad_partner_mac.attr,
>+	&dev_attr_queue_id.attr,
> 	&dev_attr_keep_all.attr,
> 	NULL,
> };
>diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
>index 3b7532f..274a3a1 100644
>--- a/drivers/net/bonding/bonding.h
>+++ b/drivers/net/bonding/bonding.h
>@@ -60,6 +60,9 @@
> 		 ((mode) == BOND_MODE_TLB)          ||	\
> 		 ((mode) == BOND_MODE_ALB))
>
>+#define TX_QUEUE_OVERRIDE(mode)				\
>+			(((mode) == BOND_MODE_ACTIVEBACKUP) ||	\
>+			 ((mode) == BOND_MODE_ROUNDROBIN))
> /*
>  * Less bad way to call ioctl from within the kernel; this needs to be
>  * done some other way to get the call out of interrupt context.
>@@ -131,6 +134,7 @@ struct bond_params {
> 	char primary[IFNAMSIZ];
> 	int primary_reselect;
> 	__be32 arp_targets[BOND_MAX_ARP_TARGETS];
>+	int tx_queues;
> 	int keep_all;
> };
>
>@@ -166,6 +170,7 @@ struct slave {
> 	u8     perm_hwaddr[ETH_ALEN];
> 	u16    speed;
> 	u8     duplex;
>+	u16    queue_id;
> 	struct ad_slave_info ad_info; /* HUGE - better to dynamically alloc */
> 	struct tlb_slave_info tlb_info;
> };
>diff --git a/include/linux/if_bonding.h b/include/linux/if_bonding.h
>index cd525fa..2c79943 100644
>--- a/include/linux/if_bonding.h
>+++ b/include/linux/if_bonding.h
>@@ -83,6 +83,7 @@
>
> #define BOND_DEFAULT_MAX_BONDS  1   /* Default maximum number of devices to support */
>
>+#define BOND_DEFAULT_TX_QUEUES 16   /* Default number of tx queues per device */
> /* hashing types */
> #define BOND_XMIT_POLICY_LAYER2		0 /* layer 2 (MAC only), default */
> #define BOND_XMIT_POLICY_LAYER34	1 /* layer 3+4 (IP ^ (TCP || UDP)) */
>-- 
>1.6.2.5

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 12, 2010, 12:27 a.m. UTC | #2
On Tue, May 11, 2010 at 01:09:39PM -0700, Jay Vosburgh wrote:
> Andy Gospodarek <andy@greyhouse.net> wrote:
> 
> >This patch give the user the ability to control the output slave for
> >round-robin and active-backup bonding.  Similar functionality was
> >discussed in the past, but Jay Vosburgh indicated he would rather see a
> >feature like this added to existing modes rather than creating a
> >completely new mode.  Jay's thoughts as well as Neil's input surrounding
> >some of the issues with the first implementation pushed us toward a
> >design that relied on the queue_mapping rather than skb marks.
> >Round-robin and active-backup modes were chosen as the first users of
> >this slave selection as they seemed like the most logical choices when
> >considering a multi-switch environment.
> >
> >Round-robin mode works without any modification, but active-backup does
> >require inclusion of the first patch in this series and setting
> >the 'keep_all' flag.  This will allow reception of unicast traffic on
> >any of the backup interfaces.
> 
> 	Yes, I did think that the mark business fit better into existing
> modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
> I also didn't expect to see so much new stuff (this, as well as the FCOE
> special cases being discussed elsewhere) being shoehorned into the
> active-backup mode.  I'm not so sure that adding so many special cases
> to active-backup is a good thing.
> 
> 	Now, I'm starting to wonder if you were right, and it would be
> better overall to have a "manual" mode that would hopefully satisfy this
> case as well as the FCOE special case.  I don't think either of these is
> a bad use case, I'm just not sure the right way to handle them is
> another special knob in active-backup mode (either directly, or
> implicitly in __netif_receive_skb), which wasn't what I expected to see.
> 
I honestly don't think a separate mode is warranted here.  While I'm not opposed
to adding a new mode, I really think doing so is no different from overloading
an existing mode.  I say that because to add a new mode in which we explicitly
expect traffic to be directed to various slaves requires that we implement a
policy for frames which have no queue mapping determined on egress.  Any policy
I can think of is really an approximation of an existing policy, so we may as
well reuse the policy code that we already have in place.  About the only way a
separate mode makes sense is in the 'passthrough' queue mode you document below.
In this model, in which queue ids map to slaves in a 1:1 fashion it doesn't make
senes.


> 	I presume you're overloading active-backup because it's not
> etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
> regular load balance modes, I still think overlay into the existing
> modes is preferable (more on that later); I'm thinking of "manual"
> instead of another tweak to active-backup.
> 
> 	If users want to have actual hot-standby functionality, then
> active-backup would do that, and nothing else (and it can be multi-queue
> aware, but only one slave active at a time).
> 
Yes, but active backup doesn't provide prefered output path selection in and of
itself.  Thats the feature here.

> 	Users who want the set of bonded slaves to look like a big
> multiqueue buffet could use this "manual" mode and set things up however
> they want.  One way to set it up is simply that the bond is N queues
> wide, where N is the total of the queue counts of all the slaves.  If a
> slave fails, N gets smaller, and the user code has to deal with that.
> Since the queue count of a device can't change dynamically, the bond
> would have to actually be set up with some big number of queues, and
> then only a subset is actually active (or there is some sort of wrap).
> 
> 	In such an implementation, each slave would have a range of
> queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
> where each slave is one queue ID, as it could make transitioning to real
> multi-queue awareness difficult.
> 
I'm sorry, what exactly do you mean when you say 'real' multi queue
awareness?  How is this any less real than any other implementation?  The
approach you outline above isn't any more or less valid than this one.

While we're on the subject, Andy and I did discuss a model simmilar to what you
describe above (what I'll refer to as a queue id passthrough model), in which
you can tell the bonding driver to map a frame to a queue, and the bonding
driver doesn't really do anything with the queue id other than pass to the slave
device for hardware based multiqueue tx handling.  While we could do that, its
my feeling such a model isn't the way to go for two primary reasons:

1) Inconsistent behavior.  Such an implementation makes assumptions regarding
queue id specification within a driver.  For example, What if one of the slaves
reserves some fixed number of low order queues for a sepecific purpose, and as
such general use queues begin an at offset from zero, while other slaves do not.
While its easy to accomidate such needs when writing the tc filters, if a slave
fails over, such a bias would change output traffic behavior, as the bonding
driver can't be clearly informed of such a bias.  Likewise, what if a slave
driver allocates more queues than it actually supports in hardware (like the
implementation you propose, ixgbe IIRC actually does this).  If slaves handled
unimplemented tx queues different (if one wrapped queues, while the other simply
dropped frames to unimplemented queues for instance).  A failover would change
traffic patterns dramatically.

2) Need.  While (1) can pretty easily be managed with a few configuration
guidelines (output queues on slaves have to be configured identically, lets
chaos and madness befall you, etc), theres really no reason to bind users to
such a system.  We're using tc filters to set the queue id on skbs enqueued to
the bonding driver, theres absolutely no reason you can add addition filters to
the slaves directly.  Since the bonding driver uses dev_queue_xmit to send a
frame to a slave, it has the opportunity to pass through another set of queuing
diciplines and filters that can reset and re-assign the skbs queue mapping.  So
with the approach in this patch you can get both direct output control without
sacrificing actual hardware tx output queue control.  With a passthrough model,
you save a bit of filter configuration, but at the expense of having to be much
more careful about how you configure your slave nics, and detecting such errors
in configuration would be rather difficult to track down, as it would require
the generation of traffic that hit the right filter after a failover.


> 	There might also be a way to tie it in to the new RPS code on
> the receive side.
> 
> 	If the slaves all have the same MAC and attach to a single
> switch via etherchannel, then it all looks pretty much like a single big
> honkin' multiqueue device.  The switch probably won't map the flows back
> the same way, though.
> 
I agree, they probably wont.  Receive side handling wasn't really our focus here
though.  Thats largely why we chose round robin and active backup as our first
modes to use this with.  They are already written to expect frames on either
interface.

> 	If the slaves are on discrete switches (without etherchannel),
> things become more complicated.  If the slaves have the same MAC, then
> the switches will be irritated about seeing that same MAC coming in from
> multiple places.  If the slaves have different MACs, then ARP has the
> same sort of issues.
> 
> 	In thinking about it, if it's linux bonding at both ends, there
> could be any number of discrete switches in the path, and it wouldn't
> matter as long as the linux end can work things out, e.g.,
> 
>         -- switch 1 --
> hostA  /              \  hostB
> bond  ---- switch 2 ---- bond
>        \              /
>         -- switch 3 --
> 
> 	For something like this, the switches would never share MAC
> information for the bonding slaves.  The issue here then becomes more of
> detecting link failures (it would require either a "trunk failover" type
> of function on the switch, or some kind of active probe between the
> bonds).
> 
> 	Now, I realize that I'm babbling a bit, as from reading your
> description, this isn't necessarily your target topology (which sounded
> more like a case of slave A can reach only network X, and slave B can
> reach anywhere, so sending to network X should use slave A
> preferentially), or, as long as I'm doing ASCII-art,
> 
>        --- switch 1 ---- network X
> hostA /               /
> bond  ---- switch 2 -+-- anywhere
> 
> 	Is that an accurate representation?  Or is it something a bit
> different, e.g.,
> 
>        --- switch 1 ---- network X -\
> hostA /                             /
> bond  ---- switch 2 ---- anywhere --
> 
> 	I.e., the "anywhere" connects back to network X from the
> outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
> thinking of something like this:
> 
>        --- switch 1 --- VPN --- web site
> hostA /                          /
> bond  ---- switch 2 - Internet -/
> 
> 	Where you prefer to hit "web site" via the VPN (perhaps it's a
> more efficient or secure path), but can do it from the public network at
> large if necessary.
> 
Yes, this one.  I think the other models are equally interesting, but this model
in which either path had universal reachabilty, but for some classes of traffic
one path is preferred over the other is the one we had in mind.

> 	Now, regardless of the above, your first patch ("keep_all") is
> to deal with the reverse problem, if this is a piggyback on top of
> active-backup mode: how to get packets back, when both channels can be
> active simultaneously.  That actually dovetails to a degree with work
> I've been doing lately, but the solution there probably isn't what
> you're looking for (there's a user space daemon to do path finding, and
> the "bond IP" address is piggybacked on the slaves' MAC addresses, which
> are not changed; the "bond IP" set exists in a separate subnet all its
> own).
> 
> 	As I said, I'm not convinced that the "keep_all" option to
> active-backup is really better than just a "manual" mode that lacks the
> dup suppression and expects the user to set everything up.
> 
> 	As for the round-robin change in this patch, if I'm reading it
> right, then the way it works is that the packets are round-robined,
> unless there's a queue id passed in, in which case it's assigned to the
> slave mapped to that queue id.  I'm not entirely sure why you picked
> round-robin mode for that over balance-xor; it doesn't seem to fit well
> with the description in the documentation.  Or is it just sort of a
> demonstrator?
> 
It was selected because round robin allows transmits on any interface already,
and expects frames on any interface, so it was a 'safe' choice.  I would think
balance-xor would also work.  Ideally it would be nice to get more modes
supporting this mechanism.

> 	I do like one other aspect of the patch, and that's the concept
> of overlaying the queue map on top of the balance algorithm.  So, e.g.,
> balance-xor would do its usual thing, unless the packet is queue mapped,
> in which case the packet's assignment is obeyed.  The balance-xor could
> even optionally do its xor across the full set of all slaves output
> queues instead of just across the slaves.  Round-robin can operate
> similarly.  For those modes, a "balance by queue vs. balance by slave"
> seems like a reasonable knob to have.
Not sure what you mean here.  In the model implemented by this patch, there is
one output queue per slave, and as such, balance by queue == balance by slave.
That would make sense in the model you describe earlier in this note, but not in
the model presented by this patch.

> 
> 	I do understand that you're proposing something relatively
> simple, and I'm thinking out loud about alternate or additional
> implementation details.  Some of this is "ooh ahh what if", but we also
> don't want to end up with something that's forwards incompatible, and
> I'm hoping to find one solution to multiple problems.
> 
For clarification, can you ennumerate what other problems you are trying to
solve with this feature, or features simmilar to this?  From this email, the one
that I most clearly see is the desire to allow a passthrough mode of queue
selection, which I think I've noted can be done already (even without this
patch), by attaching additional tc filters to the slaves output queues directly.
What else do you have in mind?

Thanks & Regards
Neil

> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh May 12, 2010, 7:41 p.m. UTC | #3
Neil Horman <nhorman@tuxdriver.com> wrote:

>On Tue, May 11, 2010 at 01:09:39PM -0700, Jay Vosburgh wrote:
>> Andy Gospodarek <andy@greyhouse.net> wrote:
>> 
>> >This patch give the user the ability to control the output slave for
>> >round-robin and active-backup bonding.  Similar functionality was
>> >discussed in the past, but Jay Vosburgh indicated he would rather see a
>> >feature like this added to existing modes rather than creating a
>> >completely new mode.  Jay's thoughts as well as Neil's input surrounding
>> >some of the issues with the first implementation pushed us toward a
>> >design that relied on the queue_mapping rather than skb marks.
>> >Round-robin and active-backup modes were chosen as the first users of
>> >this slave selection as they seemed like the most logical choices when
>> >considering a multi-switch environment.
>> >
>> >Round-robin mode works without any modification, but active-backup does
>> >require inclusion of the first patch in this series and setting
>> >the 'keep_all' flag.  This will allow reception of unicast traffic on
>> >any of the backup interfaces.
>> 
>> 	Yes, I did think that the mark business fit better into existing
>> modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
>> I also didn't expect to see so much new stuff (this, as well as the FCOE
>> special cases being discussed elsewhere) being shoehorned into the
>> active-backup mode.  I'm not so sure that adding so many special cases
>> to active-backup is a good thing.
>> 
>> 	Now, I'm starting to wonder if you were right, and it would be
>> better overall to have a "manual" mode that would hopefully satisfy this
>> case as well as the FCOE special case.  I don't think either of these is
>> a bad use case, I'm just not sure the right way to handle them is
>> another special knob in active-backup mode (either directly, or
>> implicitly in __netif_receive_skb), which wasn't what I expected to see.
>> 
>I honestly don't think a separate mode is warranted here.  While I'm not opposed
>to adding a new mode, I really think doing so is no different from overloading
>an existing mode.  I say that because to add a new mode in which we explicitly
>expect traffic to be directed to various slaves requires that we implement a
>policy for frames which have no queue mapping determined on egress.  Any policy
>I can think of is really an approximation of an existing policy, so we may as
>well reuse the policy code that we already have in place.  About the only way a
>separate mode makes sense is in the 'passthrough' queue mode you document below.
>In this model, in which queue ids map to slaves in a 1:1 fashion it doesn't make
>senes.

	One goal I'm hoping to achieve is something that would satisfy
both the queue map stuff that you're looking for, and would meet the
needs of the FCOE people who also want to disable the duplicate
suppression (i.e., permit incoming traffic on the inactive slave) for a
different special case.

	The FCOE proposal revolves around, essentially, active-backup
mode, but permitting incoming traffic on the inactive slave.  At the
moment, the patches attempt to special case things such that only
dev_add_pack listeners directly bound to the inactive slave are checked
(to permit the FCOE traffic to pass on the inactive slave, but still
dropping IP, as ip_rcv is a wildcard bind).

	Your keep_all patch is, by and large, the same thing, except
that it permits anything to come in on the "inactive" slave, and it's a
switch that has to be turned on.

	This seems like needless duplication to me; I'd prefer to see a
single solution that handles both cases instead of two special cases
that each do 90% of what the other does.

	As far as a new mode goes, one major reason I think a separate
mode is warranted is the semantics: with either of these changes (to
permit more or less regular use of the "inactive" slaves), the mode
isn't really an "active-backup" mode any longer; there is no "inactive"
or "backup" slave.  I think of this as being a major change of
functionality, not simply a minor option.

	Hence my thought that "active-backup" could stay as a "true" hot
standby mode (backup slaves are just that: for backup, only), and this
new mode would be the place to do the special queue-map / FCOE /
whatever that isn't really a hot standby configuration any longer.

	As far as the behavior of the new mode (your concern about its
policy map approximations), in the end, it would probably act pretty
much like active-backup with your patch applied: traffic goes out the
active slave, unless directed otherwise.  It's a lot less complicated
than I had feared.

>> 	I presume you're overloading active-backup because it's not
>> etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
>> regular load balance modes, I still think overlay into the existing
>> modes is preferable (more on that later); I'm thinking of "manual"
>> instead of another tweak to active-backup.
>> 
>> 	If users want to have actual hot-standby functionality, then
>> active-backup would do that, and nothing else (and it can be multi-queue
>> aware, but only one slave active at a time).
>> 
>Yes, but active backup doesn't provide prefered output path selection in and of
>itself.  Thats the feature here.

	I understand that; I'm suggesting that active-backup should
provide no service other than hot standby, and not be overloaded into a
manual load balancing scheme (both for your use, and for FCOE).

	Maybe I'm worrying too much about defending the purity of the
active-backup mode; I understand what you're trying to do a little
better now, and yes, the "manual" mode I think of (in your queue mapping
scheme, not the other doodads I talked about) would basically be
active-backup with your queue mapper, minus the duplicate suppressor.

>> 	Users who want the set of bonded slaves to look like a big
>> multiqueue buffet could use this "manual" mode and set things up however
>> they want.  One way to set it up is simply that the bond is N queues
>> wide, where N is the total of the queue counts of all the slaves.  If a
>> slave fails, N gets smaller, and the user code has to deal with that.
>> Since the queue count of a device can't change dynamically, the bond
>> would have to actually be set up with some big number of queues, and
>> then only a subset is actually active (or there is some sort of wrap).
>> 
>> 	In such an implementation, each slave would have a range of
>> queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
>> where each slave is one queue ID, as it could make transitioning to real
>> multi-queue awareness difficult.
>> 
>I'm sorry, what exactly do you mean when you say 'real' multi queue
>awareness?  How is this any less real than any other implementation?  The
>approach you outline above isn't any more or less valid than this one.

	That was my misunderstanding of how you planned to handle
things.  I had thought this patch was simply a scheme to use the queue
IDs for slave selection, without any method to further perform queue
selection on the slave itself (I hadn't thought of placing a tc action
on the slave itself, which you described later on).  I had been thinking
in terms of schemes to expose all of the slave queues on the bonding
device.

	So, I don't see any issues with the queue mapping part.  I still
want to find a common solution for FCOE / your patch with regards to the
duplicate suppressor.

>While we're on the subject, Andy and I did discuss a model simmilar to what you
>describe above (what I'll refer to as a queue id passthrough model), in which
>you can tell the bonding driver to map a frame to a queue, and the bonding
>driver doesn't really do anything with the queue id other than pass to the slave
>device for hardware based multiqueue tx handling.  While we could do that, its
>my feeling such a model isn't the way to go for two primary reasons:
>
>1) Inconsistent behavior.  Such an implementation makes assumptions regarding
>queue id specification within a driver.  For example, What if one of the slaves
>reserves some fixed number of low order queues for a sepecific purpose, and as
>such general use queues begin an at offset from zero, while other slaves do not.
>While its easy to accomidate such needs when writing the tc filters, if a slave
>fails over, such a bias would change output traffic behavior, as the bonding
>driver can't be clearly informed of such a bias.  Likewise, what if a slave
>driver allocates more queues than it actually supports in hardware (like the
>implementation you propose, ixgbe IIRC actually does this).  If slaves handled
>unimplemented tx queues different (if one wrapped queues, while the other simply
>dropped frames to unimplemented queues for instance).  A failover would change
>traffic patterns dramatically.
>
>2) Need.  While (1) can pretty easily be managed with a few configuration
>guidelines (output queues on slaves have to be configured identically, lets
>chaos and madness befall you, etc), theres really no reason to bind users to
>such a system.  We're using tc filters to set the queue id on skbs enqueued to
>the bonding driver, theres absolutely no reason you can add addition filters to
>the slaves directly.  Since the bonding driver uses dev_queue_xmit to send a
>frame to a slave, it has the opportunity to pass through another set of queuing
>diciplines and filters that can reset and re-assign the skbs queue mapping.  So
>with the approach in this patch you can get both direct output control without
>sacrificing actual hardware tx output queue control.  With a passthrough model,
>you save a bit of filter configuration, but at the expense of having to be much
>more careful about how you configure your slave nics, and detecting such errors
>in configuration would be rather difficult to track down, as it would require
>the generation of traffic that hit the right filter after a failover.

	I don't disagree with any of this.  One thought I do have is
that Eric Dumazet, I believe, has mentioned that the read lock in
bonding is a limiting factor on 10G performance.  In the far distant
future when bonding is RCU, going through the lock(s) on the tc actions
of the slave could have the same net effect, and in such a case, a
qdisc-less path may be of benefit.  Not a concern for today, I suspect.

>> 	There might also be a way to tie it in to the new RPS code on
>> the receive side.
>> 
>> 	If the slaves all have the same MAC and attach to a single
>> switch via etherchannel, then it all looks pretty much like a single big
>> honkin' multiqueue device.  The switch probably won't map the flows back
>> the same way, though.
>> 
>I agree, they probably wont.  Receive side handling wasn't really our focus here
>though.  Thats largely why we chose round robin and active backup as our first
>modes to use this with.  They are already written to expect frames on either
>interface.
>
>> 	If the slaves are on discrete switches (without etherchannel),
>> things become more complicated.  If the slaves have the same MAC, then
>> the switches will be irritated about seeing that same MAC coming in from
>> multiple places.  If the slaves have different MACs, then ARP has the
>> same sort of issues.
>> 
>> 	In thinking about it, if it's linux bonding at both ends, there
>> could be any number of discrete switches in the path, and it wouldn't
>> matter as long as the linux end can work things out, e.g.,
>> 
>>         -- switch 1 --
>> hostA  /              \  hostB
>> bond  ---- switch 2 ---- bond
>>        \              /
>>         -- switch 3 --
>> 
>> 	For something like this, the switches would never share MAC
>> information for the bonding slaves.  The issue here then becomes more of
>> detecting link failures (it would require either a "trunk failover" type
>> of function on the switch, or some kind of active probe between the
>> bonds).
>> 
>> 	Now, I realize that I'm babbling a bit, as from reading your
>> description, this isn't necessarily your target topology (which sounded
>> more like a case of slave A can reach only network X, and slave B can
>> reach anywhere, so sending to network X should use slave A
>> preferentially), or, as long as I'm doing ASCII-art,
>> 
>>        --- switch 1 ---- network X
>> hostA /               /
>> bond  ---- switch 2 -+-- anywhere
>> 
>> 	Is that an accurate representation?  Or is it something a bit
>> different, e.g.,
>> 
>>        --- switch 1 ---- network X -\
>> hostA /                             /
>> bond  ---- switch 2 ---- anywhere --
>> 
>> 	I.e., the "anywhere" connects back to network X from the
>> outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
>> thinking of something like this:
>> 
>>        --- switch 1 --- VPN --- web site
>> hostA /                          /
>> bond  ---- switch 2 - Internet -/
>> 
>> 	Where you prefer to hit "web site" via the VPN (perhaps it's a
>> more efficient or secure path), but can do it from the public network at
>> large if necessary.
>> 
>Yes, this one.  I think the other models are equally interesting, but this model
>in which either path had universal reachabilty, but for some classes of traffic
>one path is preferred over the other is the one we had in mind.
>
>> 	Now, regardless of the above, your first patch ("keep_all") is
>> to deal with the reverse problem, if this is a piggyback on top of
>> active-backup mode: how to get packets back, when both channels can be
>> active simultaneously.  That actually dovetails to a degree with work
>> I've been doing lately, but the solution there probably isn't what
>> you're looking for (there's a user space daemon to do path finding, and
>> the "bond IP" address is piggybacked on the slaves' MAC addresses, which
>> are not changed; the "bond IP" set exists in a separate subnet all its
>> own).
>> 
>> 	As I said, I'm not convinced that the "keep_all" option to
>> active-backup is really better than just a "manual" mode that lacks the
>> dup suppression and expects the user to set everything up.
>> 
>> 	As for the round-robin change in this patch, if I'm reading it
>> right, then the way it works is that the packets are round-robined,
>> unless there's a queue id passed in, in which case it's assigned to the
>> slave mapped to that queue id.  I'm not entirely sure why you picked
>> round-robin mode for that over balance-xor; it doesn't seem to fit well
>> with the description in the documentation.  Or is it just sort of a
>> demonstrator?
>> 
>It was selected because round robin allows transmits on any interface already,
>and expects frames on any interface, so it was a 'safe' choice.  I would think
>balance-xor would also work.  Ideally it would be nice to get more modes
>supporting this mechanism.

	I think that this should work for balance-xor and 802.3ad.  The
only limitation for 802.3ad is that the spec requires "conversations" to
not be striped or to skip around in a manner that could lead to out of
order delivery.

	I'm not so sure about the alb/tlb modes; at first thought, I
think it could have conflicts with the internal balancing done within
the modes (if, e.g., the tc action put traffic for the same destination
on two slaves).

>> 	I do like one other aspect of the patch, and that's the concept
>> of overlaying the queue map on top of the balance algorithm.  So, e.g.,
>> balance-xor would do its usual thing, unless the packet is queue mapped,
>> in which case the packet's assignment is obeyed.  The balance-xor could
>> even optionally do its xor across the full set of all slaves output
>> queues instead of just across the slaves.  Round-robin can operate
>> similarly.  For those modes, a "balance by queue vs. balance by slave"
>> seems like a reasonable knob to have.
>Not sure what you mean here.  In the model implemented by this patch, there is
>one output queue per slave, and as such, balance by queue == balance by slave.
>That would make sense in the model you describe earlier in this note, but not in
>the model presented by this patch.

	Yes, I was thinking about what I had described; again,
predicated on my misunderstanding of how it all worked.

>> 	I do understand that you're proposing something relatively
>> simple, and I'm thinking out loud about alternate or additional
>> implementation details.  Some of this is "ooh ahh what if", but we also
>> don't want to end up with something that's forwards incompatible, and
>> I'm hoping to find one solution to multiple problems.
>> 
>For clarification, can you ennumerate what other problems you are trying to
>solve with this feature, or features simmilar to this?  From this email, the one
>that I most clearly see is the desire to allow a passthrough mode of queue
>selection, which I think I've noted can be done already (even without this
>patch), by attaching additional tc filters to the slaves output queues directly.
>What else do you have in mind?

	As I said above, I hadn't thought of stacking tc actions on to
the slaves directly, so I was thinking on ways to expose the slave
queues.

	I still find something intriguing about a round-robin or xor
mode that robins/xors through all of the slave queues, though, but that
should be something separate (I'm not sure if such a scheme is actually
"better", either).

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek May 12, 2010, 10:14 p.m. UTC | #4
On Wed, May 12, 2010 at 12:41:54PM -0700, Jay Vosburgh wrote:
> Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> >On Tue, May 11, 2010 at 01:09:39PM -0700, Jay Vosburgh wrote:
> >> Andy Gospodarek <andy@greyhouse.net> wrote:
> >> 
> >> >This patch give the user the ability to control the output slave for
> >> >round-robin and active-backup bonding.  Similar functionality was
> >> >discussed in the past, but Jay Vosburgh indicated he would rather see a
> >> >feature like this added to existing modes rather than creating a
> >> >completely new mode.  Jay's thoughts as well as Neil's input surrounding
> >> >some of the issues with the first implementation pushed us toward a
> >> >design that relied on the queue_mapping rather than skb marks.
> >> >Round-robin and active-backup modes were chosen as the first users of
> >> >this slave selection as they seemed like the most logical choices when
> >> >considering a multi-switch environment.
> >> >
> >> >Round-robin mode works without any modification, but active-backup does
> >> >require inclusion of the first patch in this series and setting
> >> >the 'keep_all' flag.  This will allow reception of unicast traffic on
> >> >any of the backup interfaces.
> >> 
> >> 	Yes, I did think that the mark business fit better into existing
> >> modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
> >> I also didn't expect to see so much new stuff (this, as well as the FCOE
> >> special cases being discussed elsewhere) being shoehorned into the
> >> active-backup mode.  I'm not so sure that adding so many special cases
> >> to active-backup is a good thing.
> >> 
> >> 	Now, I'm starting to wonder if you were right, and it would be
> >> better overall to have a "manual" mode that would hopefully satisfy this
> >> case as well as the FCOE special case.  I don't think either of these is
> >> a bad use case, I'm just not sure the right way to handle them is
> >> another special knob in active-backup mode (either directly, or
> >> implicitly in __netif_receive_skb), which wasn't what I expected to see.
> >> 
> >I honestly don't think a separate mode is warranted here.  While I'm not opposed
> >to adding a new mode, I really think doing so is no different from overloading
> >an existing mode.  I say that because to add a new mode in which we explicitly
> >expect traffic to be directed to various slaves requires that we implement a
> >policy for frames which have no queue mapping determined on egress.  Any policy
> >I can think of is really an approximation of an existing policy, so we may as
> >well reuse the policy code that we already have in place.  About the only way a
> >separate mode makes sense is in the 'passthrough' queue mode you document below.
> >In this model, in which queue ids map to slaves in a 1:1 fashion it doesn't make
> >senes.
> 
> 	One goal I'm hoping to achieve is something that would satisfy
> both the queue map stuff that you're looking for, and would meet the
> needs of the FCOE people who also want to disable the duplicate
> suppression (i.e., permit incoming traffic on the inactive slave) for a
> different special case.
> 
> 	The FCOE proposal revolves around, essentially, active-backup
> mode, but permitting incoming traffic on the inactive slave.  At the
> moment, the patches attempt to special case things such that only
> dev_add_pack listeners directly bound to the inactive slave are checked
> (to permit the FCOE traffic to pass on the inactive slave, but still
> dropping IP, as ip_rcv is a wildcard bind).
> 
> 	Your keep_all patch is, by and large, the same thing, except
> that it permits anything to come in on the "inactive" slave, and it's a
> switch that has to be turned on.
> 
> 	This seems like needless duplication to me; I'd prefer to see a
> single solution that handles both cases instead of two special cases
> that each do 90% of what the other does.
> 
> 	As far as a new mode goes, one major reason I think a separate
> mode is warranted is the semantics: with either of these changes (to
> permit more or less regular use of the "inactive" slaves), the mode
> isn't really an "active-backup" mode any longer; there is no "inactive"
> or "backup" slave.  I think of this as being a major change of
> functionality, not simply a minor option.
> 
> 	Hence my thought that "active-backup" could stay as a "true" hot
> standby mode (backup slaves are just that: for backup, only), and this
> new mode would be the place to do the special queue-map / FCOE /
> whatever that isn't really a hot standby configuration any longer.
> 
> 	As far as the behavior of the new mode (your concern about its
> policy map approximations), in the end, it would probably act pretty
> much like active-backup with your patch applied: traffic goes out the
> active slave, unless directed otherwise.  It's a lot less complicated
> than I had feared.
> 

It's beginning to sound like the 'FCoE use-case' and the one Neil and I
are proposing are quite similar.  The main goal of both is to have the
option to have multiple slaves send and receive traffic during the
steady-state, but in the event of a failover all traffic would run on a
single interface.

The implementation proposed with this patch is a bit different that the
'mark-mode' patch you may recall I posted a few months ago.  It created
a new mode that essentially did exactly what you are describing --
transmit on the primary interface unless pushed to another interface via
info in the skb and receive on all interfaces.  We initially did not
create a new mode based on your reservations about the previous
mark-mode patch and went the direction of enhancing one or two modes
initially (figuring it would be good to run before walking), with the
idea that other modes could take care of this output slave selection
logic in the future.


> >> 	I presume you're overloading active-backup because it's not
> >> etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
> >> regular load balance modes, I still think overlay into the existing
> >> modes is preferable (more on that later); I'm thinking of "manual"
> >> instead of another tweak to active-backup.
> >> 
> >> 	If users want to have actual hot-standby functionality, then
> >> active-backup would do that, and nothing else (and it can be multi-queue
> >> aware, but only one slave active at a time).
> >> 
> >Yes, but active backup doesn't provide prefered output path selection in and of
> >itself.  Thats the feature here.
> 
> 	I understand that; I'm suggesting that active-backup should
> provide no service other than hot standby, and not be overloaded into a
> manual load balancing scheme (both for your use, and for FCOE).
> 
> 	Maybe I'm worrying too much about defending the purity of the
> active-backup mode; I understand what you're trying to do a little
> better now, and yes, the "manual" mode I think of (in your queue mapping
> scheme, not the other doodads I talked about) would basically be
> active-backup with your queue mapper, minus the duplicate suppressor.
> 

It doesn't matter terribly to me which direction is taken.  Again, a
major reason this route was proposed was that you were not as keen on
creating a new mode as I was at the time of that patch posting.  It's
somewhat understandable as once a mode is added it's tough to take away,
but when one sees how much we are really changing the way active-backup
might behave in some cases maybe it makes sense to use a new mode?

I guess I like the idea of adding this output selection to existing
modes because it at least gives us the option to use queue maps to
select output interfaces for more than a mode that looks like
present-day active-backup minus the duplicate suppression.   I'm happy to
code-up a patch that creates a new mode, but before I go do that and
test it, I'd like to know we have come to an agreement on the direction
for the future.

> >> 	Users who want the set of bonded slaves to look like a big
> >> multiqueue buffet could use this "manual" mode and set things up however
> >> they want.  One way to set it up is simply that the bond is N queues
> >> wide, where N is the total of the queue counts of all the slaves.  If a
> >> slave fails, N gets smaller, and the user code has to deal with that.
> >> Since the queue count of a device can't change dynamically, the bond
> >> would have to actually be set up with some big number of queues, and
> >> then only a subset is actually active (or there is some sort of wrap).
> >> 
> >> 	In such an implementation, each slave would have a range of
> >> queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
> >> where each slave is one queue ID, as it could make transitioning to real
> >> multi-queue awareness difficult.
> >> 
> >I'm sorry, what exactly do you mean when you say 'real' multi queue
> >awareness?  How is this any less real than any other implementation?  The
> >approach you outline above isn't any more or less valid than this one.
> 
> 	That was my misunderstanding of how you planned to handle
> things.  I had thought this patch was simply a scheme to use the queue
> IDs for slave selection, without any method to further perform queue
> selection on the slave itself (I hadn't thought of placing a tc action
> on the slave itself, which you described later on).  I had been thinking
> in terms of schemes to expose all of the slave queues on the bonding
> device.

It wasn't our original intention either.  I didn't mention it in my
original post as it wasn't really the intent of our patch, but a nice
side-effect for the informed user. :) Obviously a bit more testing could
take place and we could add more examples to the documentation for the
nice side-effect feature of this patch, but since this wasn't our
original intent and we didn't test it, we did not advertise it.

> 	So, I don't see any issues with the queue mapping part.  I still
> want to find a common solution for FCOE / your patch with regards to the
> duplicate suppressor.

Understood.

> >While we're on the subject, Andy and I did discuss a model simmilar to what you
> >describe above (what I'll refer to as a queue id passthrough model), in which
> >you can tell the bonding driver to map a frame to a queue, and the bonding
> >driver doesn't really do anything with the queue id other than pass to the slave
> >device for hardware based multiqueue tx handling.  While we could do that, its
> >my feeling such a model isn't the way to go for two primary reasons:
> >
> >1) Inconsistent behavior.  Such an implementation makes assumptions regarding
> >queue id specification within a driver.  For example, What if one of the slaves
> >reserves some fixed number of low order queues for a sepecific purpose, and as
> >such general use queues begin an at offset from zero, while other slaves do not.
> >While its easy to accomidate such needs when writing the tc filters, if a slave
> >fails over, such a bias would change output traffic behavior, as the bonding
> >driver can't be clearly informed of such a bias.  Likewise, what if a slave
> >driver allocates more queues than it actually supports in hardware (like the
> >implementation you propose, ixgbe IIRC actually does this).  If slaves handled
> >unimplemented tx queues different (if one wrapped queues, while the other simply
> >dropped frames to unimplemented queues for instance).  A failover would change
> >traffic patterns dramatically.
> >
> >2) Need.  While (1) can pretty easily be managed with a few configuration
> >guidelines (output queues on slaves have to be configured identically, lets
> >chaos and madness befall you, etc), theres really no reason to bind users to
> >such a system.  We're using tc filters to set the queue id on skbs enqueued to
> >the bonding driver, theres absolutely no reason you can add addition filters to
> >the slaves directly.  Since the bonding driver uses dev_queue_xmit to send a
> >frame to a slave, it has the opportunity to pass through another set of queuing
> >diciplines and filters that can reset and re-assign the skbs queue mapping.  So
> >with the approach in this patch you can get both direct output control without
> >sacrificing actual hardware tx output queue control.  With a passthrough model,
> >you save a bit of filter configuration, but at the expense of having to be much
> >more careful about how you configure your slave nics, and detecting such errors
> >in configuration would be rather difficult to track down, as it would require
> >the generation of traffic that hit the right filter after a failover.
> 
> 	I don't disagree with any of this.  One thought I do have is
> that Eric Dumazet, I believe, has mentioned that the read lock in
> bonding is a limiting factor on 10G performance.  In the far distant
> future when bonding is RCU, going through the lock(s) on the tc actions
> of the slave could have the same net effect, and in such a case, a
> qdisc-less path may be of benefit.  Not a concern for today, I suspect.
> 
> >> 	There might also be a way to tie it in to the new RPS code on
> >> the receive side.
> >> 
> >> 	If the slaves all have the same MAC and attach to a single
> >> switch via etherchannel, then it all looks pretty much like a single big
> >> honkin' multiqueue device.  The switch probably won't map the flows back
> >> the same way, though.
> >> 
> >I agree, they probably wont.  Receive side handling wasn't really our focus here
> >though.  Thats largely why we chose round robin and active backup as our first
> >modes to use this with.  They are already written to expect frames on either
> >interface.
> >
> >> 	If the slaves are on discrete switches (without etherchannel),
> >> things become more complicated.  If the slaves have the same MAC, then
> >> the switches will be irritated about seeing that same MAC coming in from
> >> multiple places.  If the slaves have different MACs, then ARP has the
> >> same sort of issues.
> >> 
> >> 	In thinking about it, if it's linux bonding at both ends, there
> >> could be any number of discrete switches in the path, and it wouldn't
> >> matter as long as the linux end can work things out, e.g.,
> >> 
> >>         -- switch 1 --
> >> hostA  /              \  hostB
> >> bond  ---- switch 2 ---- bond
> >>        \              /
> >>         -- switch 3 --
> >> 
> >> 	For something like this, the switches would never share MAC
> >> information for the bonding slaves.  The issue here then becomes more of
> >> detecting link failures (it would require either a "trunk failover" type
> >> of function on the switch, or some kind of active probe between the
> >> bonds).
> >> 
> >> 	Now, I realize that I'm babbling a bit, as from reading your
> >> description, this isn't necessarily your target topology (which sounded
> >> more like a case of slave A can reach only network X, and slave B can
> >> reach anywhere, so sending to network X should use slave A
> >> preferentially), or, as long as I'm doing ASCII-art,
> >> 
> >>        --- switch 1 ---- network X
> >> hostA /               /
> >> bond  ---- switch 2 -+-- anywhere
> >> 
> >> 	Is that an accurate representation?  Or is it something a bit
> >> different, e.g.,
> >> 
> >>        --- switch 1 ---- network X -\
> >> hostA /                             /
> >> bond  ---- switch 2 ---- anywhere --
> >> 
> >> 	I.e., the "anywhere" connects back to network X from the
> >> outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
> >> thinking of something like this:
> >> 
> >>        --- switch 1 --- VPN --- web site
> >> hostA /                          /
> >> bond  ---- switch 2 - Internet -/
> >> 
> >> 	Where you prefer to hit "web site" via the VPN (perhaps it's a
> >> more efficient or secure path), but can do it from the public network at
> >> large if necessary.
> >> 
> >Yes, this one.  I think the other models are equally interesting, but this model
> >in which either path had universal reachabilty, but for some classes of traffic
> >one path is preferred over the other is the one we had in mind.
> >
> >> 	Now, regardless of the above, your first patch ("keep_all") is
> >> to deal with the reverse problem, if this is a piggyback on top of
> >> active-backup mode: how to get packets back, when both channels can be
> >> active simultaneously.  That actually dovetails to a degree with work
> >> I've been doing lately, but the solution there probably isn't what
> >> you're looking for (there's a user space daemon to do path finding, and
> >> the "bond IP" address is piggybacked on the slaves' MAC addresses, which
> >> are not changed; the "bond IP" set exists in a separate subnet all its
> >> own).
> >> 
> >> 	As I said, I'm not convinced that the "keep_all" option to
> >> active-backup is really better than just a "manual" mode that lacks the
> >> dup suppression and expects the user to set everything up.
> >> 
> >> 	As for the round-robin change in this patch, if I'm reading it
> >> right, then the way it works is that the packets are round-robined,
> >> unless there's a queue id passed in, in which case it's assigned to the
> >> slave mapped to that queue id.  I'm not entirely sure why you picked
> >> round-robin mode for that over balance-xor; it doesn't seem to fit well
> >> with the description in the documentation.  Or is it just sort of a
> >> demonstrator?
> >> 
> >It was selected because round robin allows transmits on any interface already,
> >and expects frames on any interface, so it was a 'safe' choice.  I would think
> >balance-xor would also work.  Ideally it would be nice to get more modes
> >supporting this mechanism.
> 
> 	I think that this should work for balance-xor and 802.3ad.  The
> only limitation for 802.3ad is that the spec requires "conversations" to
> not be striped or to skip around in a manner that could lead to out of
> order delivery.

Agreed.  Checking would probably also have to be done to make sure that
we were not trasmitting on an inactive aggregator.

> 
> 	I'm not so sure about the alb/tlb modes; at first thought, I
> think it could have conflicts with the internal balancing done within
> the modes (if, e.g., the tc action put traffic for the same destination
> on two slaves).
> 

TLB and ALB modes would certainly have to be done differently.  It
should not be terribly difficult to move from the existing hashing
that's done to one that relies on the queue_mapping, but it will take a
bit to make sure it's not a complete hack.

We decided against doing that for all modes on the first pass as it
seemed like the active-backup and round-robin were the most-likely
users.  We also wanted present the code early rather that spending time
supporting this on every-mode to find out that it just wasn't rational
to do it on some of them.

> >> 	I do like one other aspect of the patch, and that's the concept
> >> of overlaying the queue map on top of the balance algorithm.  So, e.g.,
> >> balance-xor would do its usual thing, unless the packet is queue mapped,
> >> in which case the packet's assignment is obeyed.  The balance-xor could
> >> even optionally do its xor across the full set of all slaves output
> >> queues instead of just across the slaves.  Round-robin can operate
> >> similarly.  For those modes, a "balance by queue vs. balance by slave"
> >> seems like a reasonable knob to have.
> >Not sure what you mean here.  In the model implemented by this patch, there is
> >one output queue per slave, and as such, balance by queue == balance by slave.
> >That would make sense in the model you describe earlier in this note, but not in
> >the model presented by this patch.
> 
> 	Yes, I was thinking about what I had described; again,
> predicated on my misunderstanding of how it all worked.
> 
> >> 	I do understand that you're proposing something relatively
> >> simple, and I'm thinking out loud about alternate or additional
> >> implementation details.  Some of this is "ooh ahh what if", but we also
> >> don't want to end up with something that's forwards incompatible, and
> >> I'm hoping to find one solution to multiple problems.
> >> 
> >For clarification, can you ennumerate what other problems you are trying to
> >solve with this feature, or features simmilar to this?  From this email, the one
> >that I most clearly see is the desire to allow a passthrough mode of queue
> >selection, which I think I've noted can be done already (even without this
> >patch), by attaching additional tc filters to the slaves output queues directly.
> >What else do you have in mind?
> 
> 	As I said above, I hadn't thought of stacking tc actions on to
> the slaves directly, so I was thinking on ways to expose the slave
> queues.
> 
> 	I still find something intriguing about a round-robin or xor
> mode that robins/xors through all of the slave queues, though, but that
> should be something separate (I'm not sure if such a scheme is actually
> "better", either).
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend May 13, 2010, 7:32 a.m. UTC | #5
Andy Gospodarek wrote:
> On Wed, May 12, 2010 at 12:41:54PM -0700, Jay Vosburgh wrote:
>> Neil Horman <nhorman@tuxdriver.com> wrote:
>>
>>> On Tue, May 11, 2010 at 01:09:39PM -0700, Jay Vosburgh wrote:
>>>> Andy Gospodarek <andy@greyhouse.net> wrote:
>>>>
>>>>> This patch give the user the ability to control the output slave for
>>>>> round-robin and active-backup bonding.  Similar functionality was
>>>>> discussed in the past, but Jay Vosburgh indicated he would rather see a
>>>>> feature like this added to existing modes rather than creating a
>>>>> completely new mode.  Jay's thoughts as well as Neil's input surrounding
>>>>> some of the issues with the first implementation pushed us toward a
>>>>> design that relied on the queue_mapping rather than skb marks.
>>>>> Round-robin and active-backup modes were chosen as the first users of
>>>>> this slave selection as they seemed like the most logical choices when
>>>>> considering a multi-switch environment.
>>>>>
>>>>> Round-robin mode works without any modification, but active-backup does
>>>>> require inclusion of the first patch in this series and setting
>>>>> the 'keep_all' flag.  This will allow reception of unicast traffic on
>>>>> any of the backup interfaces.
>>>>    Yes, I did think that the mark business fit better into existing
>>>> modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
>>>> I also didn't expect to see so much new stuff (this, as well as the FCOE
>>>> special cases being discussed elsewhere) being shoehorned into the
>>>> active-backup mode.  I'm not so sure that adding so many special cases
>>>> to active-backup is a good thing.
>>>>
>>>>    Now, I'm starting to wonder if you were right, and it would be
>>>> better overall to have a "manual" mode that would hopefully satisfy this
>>>> case as well as the FCOE special case.  I don't think either of these is
>>>> a bad use case, I'm just not sure the right way to handle them is
>>>> another special knob in active-backup mode (either directly, or
>>>> implicitly in __netif_receive_skb), which wasn't what I expected to see.
>>>>
>>> I honestly don't think a separate mode is warranted here.  While I'm not opposed
>>> to adding a new mode, I really think doing so is no different from overloading
>>> an existing mode.  I say that because to add a new mode in which we explicitly
>>> expect traffic to be directed to various slaves requires that we implement a
>>> policy for frames which have no queue mapping determined on egress.  Any policy
>>> I can think of is really an approximation of an existing policy, so we may as
>>> well reuse the policy code that we already have in place.  About the only way a
>>> separate mode makes sense is in the 'passthrough' queue mode you document below.
>>> In this model, in which queue ids map to slaves in a 1:1 fashion it doesn't make
>>> senes.
>>       One goal I'm hoping to achieve is something that would satisfy
>> both the queue map stuff that you're looking for, and would meet the
>> needs of the FCOE people who also want to disable the duplicate
>> suppression (i.e., permit incoming traffic on the inactive slave) for a
>> different special case.
>>
>>       The FCOE proposal revolves around, essentially, active-backup
>> mode, but permitting incoming traffic on the inactive slave.  At the
>> moment, the patches attempt to special case things such that only
>> dev_add_pack listeners directly bound to the inactive slave are checked
>> (to permit the FCOE traffic to pass on the inactive slave, but still
>> dropping IP, as ip_rcv is a wildcard bind).
>>
>>       Your keep_all patch is, by and large, the same thing, except
>> that it permits anything to come in on the "inactive" slave, and it's a
>> switch that has to be turned on.
>>
>>       This seems like needless duplication to me; I'd prefer to see a
>> single solution that handles both cases instead of two special cases
>> that each do 90% of what the other does.
>>
>>       As far as a new mode goes, one major reason I think a separate
>> mode is warranted is the semantics: with either of these changes (to
>> permit more or less regular use of the "inactive" slaves), the mode
>> isn't really an "active-backup" mode any longer; there is no "inactive"
>> or "backup" slave.  I think of this as being a major change of
>> functionality, not simply a minor option.
>>
>>       Hence my thought that "active-backup" could stay as a "true" hot
>> standby mode (backup slaves are just that: for backup, only), and this
>> new mode would be the place to do the special queue-map / FCOE /
>> whatever that isn't really a hot standby configuration any longer.
>>
>>       As far as the behavior of the new mode (your concern about its
>> policy map approximations), in the end, it would probably act pretty
>> much like active-backup with your patch applied: traffic goes out the
>> active slave, unless directed otherwise.  It's a lot less complicated
>> than I had feared.
>>
> 
> It's beginning to sound like the 'FCoE use-case' and the one Neil and I
> are proposing are quite similar.  The main goal of both is to have the
> option to have multiple slaves send and receive traffic during the
> steady-state, but in the event of a failover all traffic would run on a
> single interface.
> 

I believe they are similar although I never considered using FCoE over a 
device that is actually in the bond.  For example the current FCoE use 
case is,

bond0 ------> ethx
                |
vlan-fcoe -->  |

Here vlan-fcoe is not a slave of bond0.  With the keep_all patch this 
would work plus an additional configuration,

bond0 --> vlan-fcoe1  ---> ethx
    \                        |
     \ --- vlan-fcoe2  --->  |

Here both vlan-fcoe1 and vlan-fcoe2 are slaves of bond0.

Even with the keep_all patch it still seems a little inconsistent to 
drop a packet outright if it is received on an inactive slave and 
destined for a vlan on the bond and then to deliver the packet to 
devices that have exact matches if it is received on an inactive slave 
but destined for the bond device.  I'll post a patch in just a moment 
that hopefully illustrates what I see as an unexpected side effect.


> The implementation proposed with this patch is a bit different that the
> 'mark-mode' patch you may recall I posted a few months ago.  It created
> a new mode that essentially did exactly what you are describing --
> transmit on the primary interface unless pushed to another interface via
> info in the skb and receive on all interfaces.  We initially did not
> create a new mode based on your reservations about the previous
> mark-mode patch and went the direction of enhancing one or two modes
> initially (figuring it would be good to run before walking), with the
> idea that other modes could take care of this output slave selection
> logic in the future.
> 
> 
>>>>    I presume you're overloading active-backup because it's not
>>>> etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
>>>> regular load balance modes, I still think overlay into the existing
>>>> modes is preferable (more on that later); I'm thinking of "manual"
>>>> instead of another tweak to active-backup.
>>>>
>>>>    If users want to have actual hot-standby functionality, then
>>>> active-backup would do that, and nothing else (and it can be multi-queue
>>>> aware, but only one slave active at a time).
>>>>
>>> Yes, but active backup doesn't provide prefered output path selection in and of
>>> itself.  Thats the feature here.
>>       I understand that; I'm suggesting that active-backup should
>> provide no service other than hot standby, and not be overloaded into a
>> manual load balancing scheme (both for your use, and for FCOE).
>>
>>       Maybe I'm worrying too much about defending the purity of the
>> active-backup mode; I understand what you're trying to do a little
>> better now, and yes, the "manual" mode I think of (in your queue mapping
>> scheme, not the other doodads I talked about) would basically be
>> active-backup with your queue mapper, minus the duplicate suppressor.
>>
> 
> It doesn't matter terribly to me which direction is taken.  Again, a
> major reason this route was proposed was that you were not as keen on
> creating a new mode as I was at the time of that patch posting.  It's
> somewhat understandable as once a mode is added it's tough to take away,
> but when one sees how much we are really changing the way active-backup
> might behave in some cases maybe it makes sense to use a new mode?
> 
> I guess I like the idea of adding this output selection to existing
> modes because it at least gives us the option to use queue maps to
> select output interfaces for more than a mode that looks like
> present-day active-backup minus the duplicate suppression.   I'm happy to
> code-up a patch that creates a new mode, but before I go do that and
> test it, I'd like to know we have come to an agreement on the direction
> for the future.
> 
>>>>    Users who want the set of bonded slaves to look like a big
>>>> multiqueue buffet could use this "manual" mode and set things up however
>>>> they want.  One way to set it up is simply that the bond is N queues
>>>> wide, where N is the total of the queue counts of all the slaves.  If a
>>>> slave fails, N gets smaller, and the user code has to deal with that.
>>>> Since the queue count of a device can't change dynamically, the bond
>>>> would have to actually be set up with some big number of queues, and
>>>> then only a subset is actually active (or there is some sort of wrap).
>>>>
>>>>    In such an implementation, each slave would have a range of
>>>> queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
>>>> where each slave is one queue ID, as it could make transitioning to real
>>>> multi-queue awareness difficult.
>>>>
>>> I'm sorry, what exactly do you mean when you say 'real' multi queue
>>> awareness?  How is this any less real than any other implementation?  The
>>> approach you outline above isn't any more or less valid than this one.
>>       That was my misunderstanding of how you planned to handle
>> things.  I had thought this patch was simply a scheme to use the queue
>> IDs for slave selection, without any method to further perform queue
>> selection on the slave itself (I hadn't thought of placing a tc action
>> on the slave itself, which you described later on).  I had been thinking
>> in terms of schemes to expose all of the slave queues on the bonding
>> device.
> 
> It wasn't our original intention either.  I didn't mention it in my
> original post as it wasn't really the intent of our patch, but a nice
> side-effect for the informed user. :) Obviously a bit more testing could
> take place and we could add more examples to the documentation for the
> nice side-effect feature of this patch, but since this wasn't our
> original intent and we didn't test it, we did not advertise it.
> 
>>       So, I don't see any issues with the queue mapping part.  I still
>> want to find a common solution for FCOE / your patch with regards to the
>> duplicate suppressor.
> 
> Understood.
> 
>>> While we're on the subject, Andy and I did discuss a model simmilar to what you
>>> describe above (what I'll refer to as a queue id passthrough model), in which
>>> you can tell the bonding driver to map a frame to a queue, and the bonding
>>> driver doesn't really do anything with the queue id other than pass to the slave
>>> device for hardware based multiqueue tx handling.  While we could do that, its
>>> my feeling such a model isn't the way to go for two primary reasons:
>>>
>>> 1) Inconsistent behavior.  Such an implementation makes assumptions regarding
>>> queue id specification within a driver.  For example, What if one of the slaves
>>> reserves some fixed number of low order queues for a sepecific purpose, and as
>>> such general use queues begin an at offset from zero, while other slaves do not.
>>> While its easy to accomidate such needs when writing the tc filters, if a slave
>>> fails over, such a bias would change output traffic behavior, as the bonding
>>> driver can't be clearly informed of such a bias.  Likewise, what if a slave
>>> driver allocates more queues than it actually supports in hardware (like the
>>> implementation you propose, ixgbe IIRC actually does this).  If slaves handled
>>> unimplemented tx queues different (if one wrapped queues, while the other simply
>>> dropped frames to unimplemented queues for instance).  A failover would change
>>> traffic patterns dramatically.
>>>
>>> 2) Need.  While (1) can pretty easily be managed with a few configuration
>>> guidelines (output queues on slaves have to be configured identically, lets
>>> chaos and madness befall you, etc), theres really no reason to bind users to
>>> such a system.  We're using tc filters to set the queue id on skbs enqueued to
>>> the bonding driver, theres absolutely no reason you can add addition filters to
>>> the slaves directly.  Since the bonding driver uses dev_queue_xmit to send a
>>> frame to a slave, it has the opportunity to pass through another set of queuing
>>> diciplines and filters that can reset and re-assign the skbs queue mapping.  So
>>> with the approach in this patch you can get both direct output control without
>>> sacrificing actual hardware tx output queue control.  With a passthrough model,
>>> you save a bit of filter configuration, but at the expense of having to be much
>>> more careful about how you configure your slave nics, and detecting such errors
>>> in configuration would be rather difficult to track down, as it would require
>>> the generation of traffic that hit the right filter after a failover.
>>       I don't disagree with any of this.  One thought I do have is
>> that Eric Dumazet, I believe, has mentioned that the read lock in
>> bonding is a limiting factor on 10G performance.  In the far distant
>> future when bonding is RCU, going through the lock(s) on the tc actions
>> of the slave could have the same net effect, and in such a case, a
>> qdisc-less path may be of benefit.  Not a concern for today, I suspect.
>>
>>>>    There might also be a way to tie it in to the new RPS code on
>>>> the receive side.
>>>>
>>>>    If the slaves all have the same MAC and attach to a single
>>>> switch via etherchannel, then it all looks pretty much like a single big
>>>> honkin' multiqueue device.  The switch probably won't map the flows back
>>>> the same way, though.
>>>>
>>> I agree, they probably wont.  Receive side handling wasn't really our focus here
>>> though.  Thats largely why we chose round robin and active backup as our first
>>> modes to use this with.  They are already written to expect frames on either
>>> interface.
>>>
>>>>    If the slaves are on discrete switches (without etherchannel),
>>>> things become more complicated.  If the slaves have the same MAC, then
>>>> the switches will be irritated about seeing that same MAC coming in from
>>>> multiple places.  If the slaves have different MACs, then ARP has the
>>>> same sort of issues.
>>>>
>>>>    In thinking about it, if it's linux bonding at both ends, there
>>>> could be any number of discrete switches in the path, and it wouldn't
>>>> matter as long as the linux end can work things out, e.g.,
>>>>
>>>>         -- switch 1 --
>>>> hostA  /              \  hostB
>>>> bond  ---- switch 2 ---- bond
>>>>        \              /
>>>>         -- switch 3 --
>>>>
>>>>    For something like this, the switches would never share MAC
>>>> information for the bonding slaves.  The issue here then becomes more of
>>>> detecting link failures (it would require either a "trunk failover" type
>>>> of function on the switch, or some kind of active probe between the
>>>> bonds).
>>>>
>>>>    Now, I realize that I'm babbling a bit, as from reading your
>>>> description, this isn't necessarily your target topology (which sounded
>>>> more like a case of slave A can reach only network X, and slave B can
>>>> reach anywhere, so sending to network X should use slave A
>>>> preferentially), or, as long as I'm doing ASCII-art,
>>>>
>>>>        --- switch 1 ---- network X
>>>> hostA /               /
>>>> bond  ---- switch 2 -+-- anywhere
>>>>
>>>>    Is that an accurate representation?  Or is it something a bit
>>>> different, e.g.,
>>>>
>>>>        --- switch 1 ---- network X -\
>>>> hostA /                             /
>>>> bond  ---- switch 2 ---- anywhere --
>>>>
>>>>    I.e., the "anywhere" connects back to network X from the
>>>> outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
>>>> thinking of something like this:
>>>>
>>>>        --- switch 1 --- VPN --- web site
>>>> hostA /                          /
>>>> bond  ---- switch 2 - Internet -/
>>>>
>>>>    Where you prefer to hit "web site" via the VPN (perhaps it's a
>>>> more efficient or secure path), but can do it from the public network at
>>>> large if necessary.
>>>>
>>> Yes, this one.  I think the other models are equally interesting, but this model
>>> in which either path had universal reachabilty, but for some classes of traffic
>>> one path is preferred over the other is the one we had in mind.
>>>
>>>>    Now, regardless of the above, your first patch ("keep_all") is
>>>> to deal with the reverse problem, if this is a piggyback on top of
>>>> active-backup mode: how to get packets back, when both channels can be
>>>> active simultaneously.  That actually dovetails to a degree with work
>>>> I've been doing lately, but the solution there probably isn't what
>>>> you're looking for (there's a user space daemon to do path finding, and
>>>> the "bond IP" address is piggybacked on the slaves' MAC addresses, which
>>>> are not changed; the "bond IP" set exists in a separate subnet all its
>>>> own).
>>>>
>>>>    As I said, I'm not convinced that the "keep_all" option to
>>>> active-backup is really better than just a "manual" mode that lacks the
>>>> dup suppression and expects the user to set everything up.
>>>>
>>>>    As for the round-robin change in this patch, if I'm reading it
>>>> right, then the way it works is that the packets are round-robined,
>>>> unless there's a queue id passed in, in which case it's assigned to the
>>>> slave mapped to that queue id.  I'm not entirely sure why you picked
>>>> round-robin mode for that over balance-xor; it doesn't seem to fit well
>>>> with the description in the documentation.  Or is it just sort of a
>>>> demonstrator?
>>>>
>>> It was selected because round robin allows transmits on any interface already,
>>> and expects frames on any interface, so it was a 'safe' choice.  I would think
>>> balance-xor would also work.  Ideally it would be nice to get more modes
>>> supporting this mechanism.
>>       I think that this should work for balance-xor and 802.3ad.  The
>> only limitation for 802.3ad is that the spec requires "conversations" to
>> not be striped or to skip around in a manner that could lead to out of
>> order delivery.
> 
> Agreed.  Checking would probably also have to be done to make sure that
> we were not trasmitting on an inactive aggregator.
> 
>>       I'm not so sure about the alb/tlb modes; at first thought, I
>> think it could have conflicts with the internal balancing done within
>> the modes (if, e.g., the tc action put traffic for the same destination
>> on two slaves).
>>
> 
> TLB and ALB modes would certainly have to be done differently.  It
> should not be terribly difficult to move from the existing hashing
> that's done to one that relies on the queue_mapping, but it will take a
> bit to make sure it's not a complete hack.
> 
> We decided against doing that for all modes on the first pass as it
> seemed like the active-backup and round-robin were the most-likely
> users.  We also wanted present the code early rather that spending time
> supporting this on every-mode to find out that it just wasn't rational
> to do it on some of them.
> 
>>>>    I do like one other aspect of the patch, and that's the concept
>>>> of overlaying the queue map on top of the balance algorithm.  So, e.g.,
>>>> balance-xor would do its usual thing, unless the packet is queue mapped,
>>>> in which case the packet's assignment is obeyed.  The balance-xor could
>>>> even optionally do its xor across the full set of all slaves output
>>>> queues instead of just across the slaves.  Round-robin can operate
>>>> similarly.  For those modes, a "balance by queue vs. balance by slave"
>>>> seems like a reasonable knob to have.
>>> Not sure what you mean here.  In the model implemented by this patch, there is
>>> one output queue per slave, and as such, balance by queue == balance by slave.
>>> That would make sense in the model you describe earlier in this note, but not in
>>> the model presented by this patch.
>>       Yes, I was thinking about what I had described; again,
>> predicated on my misunderstanding of how it all worked.
>>
>>>>    I do understand that you're proposing something relatively
>>>> simple, and I'm thinking out loud about alternate or additional
>>>> implementation details.  Some of this is "ooh ahh what if", but we also
>>>> don't want to end up with something that's forwards incompatible, and
>>>> I'm hoping to find one solution to multiple problems.
>>>>
>>> For clarification, can you ennumerate what other problems you are trying to
>>> solve with this feature, or features simmilar to this?  From this email, the one
>>> that I most clearly see is the desire to allow a passthrough mode of queue
>>> selection, which I think I've noted can be done already (even without this
>>> patch), by attaching additional tc filters to the slaves output queues directly.
>>> What else do you have in mind?
>>       As I said above, I hadn't thought of stacking tc actions on to
>> the slaves directly, so I was thinking on ways to expose the slave
>> queues.
>>
>>       I still find something intriguing about a round-robin or xor
>> mode that robins/xors through all of the slave queues, though, but that
>> should be something separate (I'm not sure if such a scheme is actually
>> "better", either).
>>
>>       -J

It would be best if there was a solution for the FCoE use case that 
works with the current bonding modes including 802.3ad.  There is switch 
support to run mpio FCoE while doing link aggregation on the LAN side 
that we should support.  I'm not sure the keep_all patch would be good 
in this case Jay I think you mentioned this at some point, but I missed 
the conclusion?  Although maybe it would be OK I'll think about it some 
more tomorrow.

Thanks,
John



>>
>> ---
>>       -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 13, 2010, 5:15 p.m. UTC | #6
On Wed, May 12, 2010 at 06:14:08PM -0400, Andy Gospodarek wrote:
> On Wed, May 12, 2010 at 12:41:54PM -0700, Jay Vosburgh wrote:
> > Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > >On Tue, May 11, 2010 at 01:09:39PM -0700, Jay Vosburgh wrote:
> > >> Andy Gospodarek <andy@greyhouse.net> wrote:
> > >> 
> > >> >This patch give the user the ability to control the output slave for
> > >> >round-robin and active-backup bonding.  Similar functionality was
> > >> >discussed in the past, but Jay Vosburgh indicated he would rather see a
> > >> >feature like this added to existing modes rather than creating a
> > >> >completely new mode.  Jay's thoughts as well as Neil's input surrounding
> > >> >some of the issues with the first implementation pushed us toward a
> > >> >design that relied on the queue_mapping rather than skb marks.
> > >> >Round-robin and active-backup modes were chosen as the first users of
> > >> >this slave selection as they seemed like the most logical choices when
> > >> >considering a multi-switch environment.
> > >> >
> > >> >Round-robin mode works without any modification, but active-backup does
> > >> >require inclusion of the first patch in this series and setting
> > >> >the 'keep_all' flag.  This will allow reception of unicast traffic on
> > >> >any of the backup interfaces.
> > >> 
> > >> 	Yes, I did think that the mark business fit better into existing
> > >> modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
> > >> I also didn't expect to see so much new stuff (this, as well as the FCOE
> > >> special cases being discussed elsewhere) being shoehorned into the
> > >> active-backup mode.  I'm not so sure that adding so many special cases
> > >> to active-backup is a good thing.
> > >> 
> > >> 	Now, I'm starting to wonder if you were right, and it would be
> > >> better overall to have a "manual" mode that would hopefully satisfy this
> > >> case as well as the FCOE special case.  I don't think either of these is
> > >> a bad use case, I'm just not sure the right way to handle them is
> > >> another special knob in active-backup mode (either directly, or
> > >> implicitly in __netif_receive_skb), which wasn't what I expected to see.
> > >> 
> > >I honestly don't think a separate mode is warranted here.  While I'm not opposed
> > >to adding a new mode, I really think doing so is no different from overloading
> > >an existing mode.  I say that because to add a new mode in which we explicitly
> > >expect traffic to be directed to various slaves requires that we implement a
> > >policy for frames which have no queue mapping determined on egress.  Any policy
> > >I can think of is really an approximation of an existing policy, so we may as
> > >well reuse the policy code that we already have in place.  About the only way a
> > >separate mode makes sense is in the 'passthrough' queue mode you document below.
> > >In this model, in which queue ids map to slaves in a 1:1 fashion it doesn't make
> > >senes.
> > 
> > 	One goal I'm hoping to achieve is something that would satisfy
> > both the queue map stuff that you're looking for, and would meet the
> > needs of the FCOE people who also want to disable the duplicate
> > suppression (i.e., permit incoming traffic on the inactive slave) for a
> > different special case.
> > 
> > 	The FCOE proposal revolves around, essentially, active-backup
> > mode, but permitting incoming traffic on the inactive slave.  At the
> > moment, the patches attempt to special case things such that only
> > dev_add_pack listeners directly bound to the inactive slave are checked
> > (to permit the FCOE traffic to pass on the inactive slave, but still
> > dropping IP, as ip_rcv is a wildcard bind).
> > 
> > 	Your keep_all patch is, by and large, the same thing, except
> > that it permits anything to come in on the "inactive" slave, and it's a
> > switch that has to be turned on.
> > 
> > 	This seems like needless duplication to me; I'd prefer to see a
> > single solution that handles both cases instead of two special cases
> > that each do 90% of what the other does.
> > 
> > 	As far as a new mode goes, one major reason I think a separate
> > mode is warranted is the semantics: with either of these changes (to
> > permit more or less regular use of the "inactive" slaves), the mode
> > isn't really an "active-backup" mode any longer; there is no "inactive"
> > or "backup" slave.  I think of this as being a major change of
> > functionality, not simply a minor option.
> > 
> > 	Hence my thought that "active-backup" could stay as a "true" hot
> > standby mode (backup slaves are just that: for backup, only), and this
> > new mode would be the place to do the special queue-map / FCOE /
> > whatever that isn't really a hot standby configuration any longer.
> > 
> > 	As far as the behavior of the new mode (your concern about its
> > policy map approximations), in the end, it would probably act pretty
> > much like active-backup with your patch applied: traffic goes out the
> > active slave, unless directed otherwise.  It's a lot less complicated
> > than I had feared.
> > 
> 
> It's beginning to sound like the 'FCoE use-case' and the one Neil and I
> are proposing are quite similar.  The main goal of both is to have the
> option to have multiple slaves send and receive traffic during the
> steady-state, but in the event of a failover all traffic would run on a
> single interface.
> 
> The implementation proposed with this patch is a bit different that the
> 'mark-mode' patch you may recall I posted a few months ago.  It created
> a new mode that essentially did exactly what you are describing --
> transmit on the primary interface unless pushed to another interface via
> info in the skb and receive on all interfaces.  We initially did not
> create a new mode based on your reservations about the previous
> mark-mode patch and went the direction of enhancing one or two modes
> initially (figuring it would be good to run before walking), with the
> idea that other modes could take care of this output slave selection
> logic in the future.


So, its sounding to me like everyone is leaning toward a new mode approach here.
Before we go ahead and start coding, I hear the bullet points for this approach
as:

1) Implement a new bond mode where queue ids are used to steer frames to output
interfaces

2) Use said mode to imply universal frame reception (i.e. remove the keep_all
knob, and turn on that behavior when this new mode is selected)

3) use John F.'s skb_should_drop changes to implement the keep_all feature.

Does that sound about right?

Regards
Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh May 13, 2010, 6:54 p.m. UTC | #7
John Fastabend <john.r.fastabend@intel.com> wrote:

>Andy Gospodarek wrote:
>> On Wed, May 12, 2010 at 12:41:54PM -0700, Jay Vosburgh wrote:
[...]
>>>       One goal I'm hoping to achieve is something that would satisfy
>>> both the queue map stuff that you're looking for, and would meet the
>>> needs of the FCOE people who also want to disable the duplicate
>>> suppression (i.e., permit incoming traffic on the inactive slave) for a
>>> different special case.
>>>
>>>       The FCOE proposal revolves around, essentially, active-backup
>>> mode, but permitting incoming traffic on the inactive slave.  At the
>>> moment, the patches attempt to special case things such that only
>>> dev_add_pack listeners directly bound to the inactive slave are checked
>>> (to permit the FCOE traffic to pass on the inactive slave, but still
>>> dropping IP, as ip_rcv is a wildcard bind).
>>>
>>>       Your keep_all patch is, by and large, the same thing, except
>>> that it permits anything to come in on the "inactive" slave, and it's a
>>> switch that has to be turned on.
>>>
>>>       This seems like needless duplication to me; I'd prefer to see a
>>> single solution that handles both cases instead of two special cases
>>> that each do 90% of what the other does.
>>>
>>>       As far as a new mode goes, one major reason I think a separate
>>> mode is warranted is the semantics: with either of these changes (to
>>> permit more or less regular use of the "inactive" slaves), the mode
>>> isn't really an "active-backup" mode any longer; there is no "inactive"
>>> or "backup" slave.  I think of this as being a major change of
>>> functionality, not simply a minor option.
>>>
>>>       Hence my thought that "active-backup" could stay as a "true" hot
>>> standby mode (backup slaves are just that: for backup, only), and this
>>> new mode would be the place to do the special queue-map / FCOE /
>>> whatever that isn't really a hot standby configuration any longer.
>>>
>>>       As far as the behavior of the new mode (your concern about its
>>> policy map approximations), in the end, it would probably act pretty
>>> much like active-backup with your patch applied: traffic goes out the
>>> active slave, unless directed otherwise.  It's a lot less complicated
>>> than I had feared.
>>>
>>
>> It's beginning to sound like the 'FCoE use-case' and the one Neil and I
>> are proposing are quite similar.  The main goal of both is to have the
>> option to have multiple slaves send and receive traffic during the
>> steady-state, but in the event of a failover all traffic would run on a
>> single interface.
>>
>
>I believe they are similar although I never considered using FCoE over a
>device that is actually in the bond.  For example the current FCoE use
>case is,
>
>bond0 ------> ethx
>               |
>vlan-fcoe -->  |
>
>Here vlan-fcoe is not a slave of bond0.  With the keep_all patch this
>would work plus an additional configuration,

	I also recall discussion that another valid FCOE configuration
is to simply bind to ethx, with no VLANs involved.

>bond0 --> vlan-fcoe1  ---> ethx
>   \                        |
>    \ --- vlan-fcoe2  --->  |
>
>Here both vlan-fcoe1 and vlan-fcoe2 are slaves of bond0.
>
>Even with the keep_all patch it still seems a little inconsistent to drop
>a packet outright if it is received on an inactive slave and destined for
>a vlan on the bond and then to deliver the packet to devices that have
>exact matches if it is received on an inactive slave but destined for the
>bond device.  I'll post a patch in just a moment that hopefully
>illustrates what I see as an unexpected side effect.

	Yes, I understand this, and I view this as a separate concern
from the duplicate suppressor (although they are linked to a degree).

	The ultimate intent (for your changes) is to permit slaves to
operate simultaneously as members of the bond as well as independent
entities, which is a significant behavior change from the past.

>> The implementation proposed with this patch is a bit different that the
>> 'mark-mode' patch you may recall I posted a few months ago.  It created
>> a new mode that essentially did exactly what you are describing --
>> transmit on the primary interface unless pushed to another interface via
>> info in the skb and receive on all interfaces.  We initially did not
>> create a new mode based on your reservations about the previous
>> mark-mode patch and went the direction of enhancing one or two modes
>> initially (figuring it would be good to run before walking), with the
>> idea that other modes could take care of this output slave selection
>> logic in the future.
>>
>>
>>>>>    I presume you're overloading active-backup because it's not
>>>>> etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
>>>>> regular load balance modes, I still think overlay into the existing
>>>>> modes is preferable (more on that later); I'm thinking of "manual"
>>>>> instead of another tweak to active-backup.
>>>>>
>>>>>    If users want to have actual hot-standby functionality, then
>>>>> active-backup would do that, and nothing else (and it can be multi-queue
>>>>> aware, but only one slave active at a time).
>>>>>
>>>> Yes, but active backup doesn't provide prefered output path selection in and of
>>>> itself.  Thats the feature here.
>>>       I understand that; I'm suggesting that active-backup should
>>> provide no service other than hot standby, and not be overloaded into a
>>> manual load balancing scheme (both for your use, and for FCOE).
>>>
>>>       Maybe I'm worrying too much about defending the purity of the
>>> active-backup mode; I understand what you're trying to do a little
>>> better now, and yes, the "manual" mode I think of (in your queue mapping
>>> scheme, not the other doodads I talked about) would basically be
>>> active-backup with your queue mapper, minus the duplicate suppressor.
>>>
>>
>> It doesn't matter terribly to me which direction is taken.  Again, a
>> major reason this route was proposed was that you were not as keen on
>> creating a new mode as I was at the time of that patch posting.  It's
>> somewhat understandable as once a mode is added it's tough to take away,
>> but when one sees how much we are really changing the way active-backup
>> might behave in some cases maybe it makes sense to use a new mode?
>>
>> I guess I like the idea of adding this output selection to existing
>> modes because it at least gives us the option to use queue maps to
>> select output interfaces for more than a mode that looks like
>> present-day active-backup minus the duplicate suppression.   I'm happy to
>> code-up a patch that creates a new mode, but before I go do that and
>> test it, I'd like to know we have come to an agreement on the direction
>> for the future.
>>
>>>>>    Users who want the set of bonded slaves to look like a big
>>>>> multiqueue buffet could use this "manual" mode and set things up however
>>>>> they want.  One way to set it up is simply that the bond is N queues
>>>>> wide, where N is the total of the queue counts of all the slaves.  If a
>>>>> slave fails, N gets smaller, and the user code has to deal with that.
>>>>> Since the queue count of a device can't change dynamically, the bond
>>>>> would have to actually be set up with some big number of queues, and
>>>>> then only a subset is actually active (or there is some sort of wrap).
>>>>>
>>>>>    In such an implementation, each slave would have a range of
>>>>> queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
>>>>> where each slave is one queue ID, as it could make transitioning to real
>>>>> multi-queue awareness difficult.
>>>>>
>>>> I'm sorry, what exactly do you mean when you say 'real' multi queue
>>>> awareness?  How is this any less real than any other implementation?  The
>>>> approach you outline above isn't any more or less valid than this one.
>>>       That was my misunderstanding of how you planned to handle
>>> things.  I had thought this patch was simply a scheme to use the queue
>>> IDs for slave selection, without any method to further perform queue
>>> selection on the slave itself (I hadn't thought of placing a tc action
>>> on the slave itself, which you described later on).  I had been thinking
>>> in terms of schemes to expose all of the slave queues on the bonding
>>> device.
>>
>> It wasn't our original intention either.  I didn't mention it in my
>> original post as it wasn't really the intent of our patch, but a nice
>> side-effect for the informed user. :) Obviously a bit more testing could
>> take place and we could add more examples to the documentation for the
>> nice side-effect feature of this patch, but since this wasn't our
>> original intent and we didn't test it, we did not advertise it.
>>
>>>       So, I don't see any issues with the queue mapping part.  I still
>>> want to find a common solution for FCOE / your patch with regards to the
>>> duplicate suppressor.
>>
>> Understood.
>>
>>>> While we're on the subject, Andy and I did discuss a model simmilar to what you
>>>> describe above (what I'll refer to as a queue id passthrough model), in which
>>>> you can tell the bonding driver to map a frame to a queue, and the bonding
>>>> driver doesn't really do anything with the queue id other than pass to the slave
>>>> device for hardware based multiqueue tx handling.  While we could do that, its
>>>> my feeling such a model isn't the way to go for two primary reasons:
>>>>
>>>> 1) Inconsistent behavior.  Such an implementation makes assumptions regarding
>>>> queue id specification within a driver.  For example, What if one of the slaves
>>>> reserves some fixed number of low order queues for a sepecific purpose, and as
>>>> such general use queues begin an at offset from zero, while other slaves do not.
>>>> While its easy to accomidate such needs when writing the tc filters, if a slave
>>>> fails over, such a bias would change output traffic behavior, as the bonding
>>>> driver can't be clearly informed of such a bias.  Likewise, what if a slave
>>>> driver allocates more queues than it actually supports in hardware (like the
>>>> implementation you propose, ixgbe IIRC actually does this).  If slaves handled
>>>> unimplemented tx queues different (if one wrapped queues, while the other simply
>>>> dropped frames to unimplemented queues for instance).  A failover would change
>>>> traffic patterns dramatically.
>>>>
>>>> 2) Need.  While (1) can pretty easily be managed with a few configuration
>>>> guidelines (output queues on slaves have to be configured identically, lets
>>>> chaos and madness befall you, etc), theres really no reason to bind users to
>>>> such a system.  We're using tc filters to set the queue id on skbs enqueued to
>>>> the bonding driver, theres absolutely no reason you can add addition filters to
>>>> the slaves directly.  Since the bonding driver uses dev_queue_xmit to send a
>>>> frame to a slave, it has the opportunity to pass through another set of queuing
>>>> diciplines and filters that can reset and re-assign the skbs queue mapping.  So
>>>> with the approach in this patch you can get both direct output control without
>>>> sacrificing actual hardware tx output queue control.  With a passthrough model,
>>>> you save a bit of filter configuration, but at the expense of having to be much
>>>> more careful about how you configure your slave nics, and detecting such errors
>>>> in configuration would be rather difficult to track down, as it would require
>>>> the generation of traffic that hit the right filter after a failover.
>>>       I don't disagree with any of this.  One thought I do have is
>>> that Eric Dumazet, I believe, has mentioned that the read lock in
>>> bonding is a limiting factor on 10G performance.  In the far distant
>>> future when bonding is RCU, going through the lock(s) on the tc actions
>>> of the slave could have the same net effect, and in such a case, a
>>> qdisc-less path may be of benefit.  Not a concern for today, I suspect.
>>>
>>>>>    There might also be a way to tie it in to the new RPS code on
>>>>> the receive side.
>>>>>
>>>>>    If the slaves all have the same MAC and attach to a single
>>>>> switch via etherchannel, then it all looks pretty much like a single big
>>>>> honkin' multiqueue device.  The switch probably won't map the flows back
>>>>> the same way, though.
>>>>>
>>>> I agree, they probably wont.  Receive side handling wasn't really our focus here
>>>> though.  Thats largely why we chose round robin and active backup as our first
>>>> modes to use this with.  They are already written to expect frames on either
>>>> interface.
>>>>
>>>>>    If the slaves are on discrete switches (without etherchannel),
>>>>> things become more complicated.  If the slaves have the same MAC, then
>>>>> the switches will be irritated about seeing that same MAC coming in from
>>>>> multiple places.  If the slaves have different MACs, then ARP has the
>>>>> same sort of issues.
>>>>>
>>>>>    In thinking about it, if it's linux bonding at both ends, there
>>>>> could be any number of discrete switches in the path, and it wouldn't
>>>>> matter as long as the linux end can work things out, e.g.,
>>>>>
>>>>>         -- switch 1 --
>>>>> hostA  /              \  hostB
>>>>> bond  ---- switch 2 ---- bond
>>>>>        \              /
>>>>>         -- switch 3 --
>>>>>
>>>>>    For something like this, the switches would never share MAC
>>>>> information for the bonding slaves.  The issue here then becomes more of
>>>>> detecting link failures (it would require either a "trunk failover" type
>>>>> of function on the switch, or some kind of active probe between the
>>>>> bonds).
>>>>>
>>>>>    Now, I realize that I'm babbling a bit, as from reading your
>>>>> description, this isn't necessarily your target topology (which sounded
>>>>> more like a case of slave A can reach only network X, and slave B can
>>>>> reach anywhere, so sending to network X should use slave A
>>>>> preferentially), or, as long as I'm doing ASCII-art,
>>>>>
>>>>>        --- switch 1 ---- network X
>>>>> hostA /               /
>>>>> bond  ---- switch 2 -+-- anywhere
>>>>>
>>>>>    Is that an accurate representation?  Or is it something a bit
>>>>> different, e.g.,
>>>>>
>>>>>        --- switch 1 ---- network X -\
>>>>> hostA /                             /
>>>>> bond  ---- switch 2 ---- anywhere --
>>>>>
>>>>>    I.e., the "anywhere" connects back to network X from the
>>>>> outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
>>>>> thinking of something like this:
>>>>>
>>>>>        --- switch 1 --- VPN --- web site
>>>>> hostA /                          /
>>>>> bond  ---- switch 2 - Internet -/
>>>>>
>>>>>    Where you prefer to hit "web site" via the VPN (perhaps it's a
>>>>> more efficient or secure path), but can do it from the public network at
>>>>> large if necessary.
>>>>>
>>>> Yes, this one.  I think the other models are equally interesting, but this model
>>>> in which either path had universal reachabilty, but for some classes of traffic
>>>> one path is preferred over the other is the one we had in mind.
>>>>
>>>>>    Now, regardless of the above, your first patch ("keep_all") is
>>>>> to deal with the reverse problem, if this is a piggyback on top of
>>>>> active-backup mode: how to get packets back, when both channels can be
>>>>> active simultaneously.  That actually dovetails to a degree with work
>>>>> I've been doing lately, but the solution there probably isn't what
>>>>> you're looking for (there's a user space daemon to do path finding, and
>>>>> the "bond IP" address is piggybacked on the slaves' MAC addresses, which
>>>>> are not changed; the "bond IP" set exists in a separate subnet all its
>>>>> own).
>>>>>
>>>>>    As I said, I'm not convinced that the "keep_all" option to
>>>>> active-backup is really better than just a "manual" mode that lacks the
>>>>> dup suppression and expects the user to set everything up.
>>>>>
>>>>>    As for the round-robin change in this patch, if I'm reading it
>>>>> right, then the way it works is that the packets are round-robined,
>>>>> unless there's a queue id passed in, in which case it's assigned to the
>>>>> slave mapped to that queue id.  I'm not entirely sure why you picked
>>>>> round-robin mode for that over balance-xor; it doesn't seem to fit well
>>>>> with the description in the documentation.  Or is it just sort of a
>>>>> demonstrator?
>>>>>
>>>> It was selected because round robin allows transmits on any interface already,
>>>> and expects frames on any interface, so it was a 'safe' choice.  I would think
>>>> balance-xor would also work.  Ideally it would be nice to get more modes
>>>> supporting this mechanism.
>>>       I think that this should work for balance-xor and 802.3ad.  The
>>> only limitation for 802.3ad is that the spec requires "conversations" to
>>> not be striped or to skip around in a manner that could lead to out of
>>> order delivery.
>>
>> Agreed.  Checking would probably also have to be done to make sure that
>> we were not trasmitting on an inactive aggregator.
>>
>>>       I'm not so sure about the alb/tlb modes; at first thought, I
>>> think it could have conflicts with the internal balancing done within
>>> the modes (if, e.g., the tc action put traffic for the same destination
>>> on two slaves).
>>>
>>
>> TLB and ALB modes would certainly have to be done differently.  It
>> should not be terribly difficult to move from the existing hashing
>> that's done to one that relies on the queue_mapping, but it will take a
>> bit to make sure it's not a complete hack.
>>
>> We decided against doing that for all modes on the first pass as it
>> seemed like the active-backup and round-robin were the most-likely
>> users.  We also wanted present the code early rather that spending time
>> supporting this on every-mode to find out that it just wasn't rational
>> to do it on some of them.
>>
>>>>>    I do like one other aspect of the patch, and that's the concept
>>>>> of overlaying the queue map on top of the balance algorithm.  So, e.g.,
>>>>> balance-xor would do its usual thing, unless the packet is queue mapped,
>>>>> in which case the packet's assignment is obeyed.  The balance-xor could
>>>>> even optionally do its xor across the full set of all slaves output
>>>>> queues instead of just across the slaves.  Round-robin can operate
>>>>> similarly.  For those modes, a "balance by queue vs. balance by slave"
>>>>> seems like a reasonable knob to have.
>>>> Not sure what you mean here.  In the model implemented by this patch, there is
>>>> one output queue per slave, and as such, balance by queue == balance by slave.
>>>> That would make sense in the model you describe earlier in this note, but not in
>>>> the model presented by this patch.
>>>       Yes, I was thinking about what I had described; again,
>>> predicated on my misunderstanding of how it all worked.
>>>
>>>>>    I do understand that you're proposing something relatively
>>>>> simple, and I'm thinking out loud about alternate or additional
>>>>> implementation details.  Some of this is "ooh ahh what if", but we also
>>>>> don't want to end up with something that's forwards incompatible, and
>>>>> I'm hoping to find one solution to multiple problems.
>>>>>
>>>> For clarification, can you ennumerate what other problems you are trying to
>>>> solve with this feature, or features simmilar to this?  From this email, the one
>>>> that I most clearly see is the desire to allow a passthrough mode of queue
>>>> selection, which I think I've noted can be done already (even without this
>>>> patch), by attaching additional tc filters to the slaves output queues directly.
>>>> What else do you have in mind?
>>>       As I said above, I hadn't thought of stacking tc actions on to
>>> the slaves directly, so I was thinking on ways to expose the slave
>>> queues.
>>>
>>>       I still find something intriguing about a round-robin or xor
>>> mode that robins/xors through all of the slave queues, though, but that
>>> should be something separate (I'm not sure if such a scheme is actually
>>> "better", either).
>>>
>>>       -J
>
>It would be best if there was a solution for the FCoE use case that works
>with the current bonding modes including 802.3ad.  There is switch support
>to run mpio FCoE while doing link aggregation on the LAN side that we
>should support.  I'm not sure the keep_all patch would be good in this
>case Jay I think you mentioned this at some point, but I missed the
>conclusion?  Although maybe it would be OK I'll think about it some more
>tomorrow.

	How does that mpio FCOE / switch support function?  Does it rely
on utilizing ports (for the FC traffic) that are not members of the
802.3ad active aggregator?  E.g.:

       / eth0,eth1 -- switch A -- etc
bond0 -
       \ eth2,eth3 -- switch B -- etc

	Bond0 has four slaves, eth0 - eth3.  Eth0 and eth1 connect to
switch A; eth2 and eth3 to switch B.  Presuming that the switches aren't
stacked / magic, either eth0/eth1 or eth2/eth3 will be the active
aggregator (linux bonding only supports one active aggregator).

	Am I correct in presuming that the FCOE balancer gizmo doesn't
care about the 802.3ad state of the ports, and it and the switch will
run FC traffic across all four ports, regardless of which ports are
active and which are not?

	Or is the switch even simpler than that, and it processes all
FCOE traffic to ports, regardless of how the ports are configured
(etherchannel, 802.3ad, etc)?

	As for other bonding modes, balance-xor or balance-rr (round
robin) shouldn't have the same problems with the duplicate suppression
logic that active-backup and 802.3ad have.  The alb/tlb modes might or
might not be workable at all, depending upon how the FCOE traffic looks
(e.g., what source and destination MAC addresses are in the FCOE
frames?).

	In any event, wanting to run FCOE in conjunction with a variety
of bonding modes suggests that Neil was right all along, and the
duplicate suppressor change should be an option, not a new mode.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend May 14, 2010, 8:53 a.m. UTC | #8
Jay Vosburgh wrote:
> John Fastabend <john.r.fastabend@intel.com> wrote:
> 
>> Andy Gospodarek wrote:
>>> On Wed, May 12, 2010 at 12:41:54PM -0700, Jay Vosburgh wrote:
> [...]
>>>>       One goal I'm hoping to achieve is something that would satisfy
>>>> both the queue map stuff that you're looking for, and would meet the
>>>> needs of the FCOE people who also want to disable the duplicate
>>>> suppression (i.e., permit incoming traffic on the inactive slave) for a
>>>> different special case.
>>>>
>>>>       The FCOE proposal revolves around, essentially, active-backup
>>>> mode, but permitting incoming traffic on the inactive slave.  At the
>>>> moment, the patches attempt to special case things such that only
>>>> dev_add_pack listeners directly bound to the inactive slave are checked
>>>> (to permit the FCOE traffic to pass on the inactive slave, but still
>>>> dropping IP, as ip_rcv is a wildcard bind).
>>>>
>>>>       Your keep_all patch is, by and large, the same thing, except
>>>> that it permits anything to come in on the "inactive" slave, and it's a
>>>> switch that has to be turned on.
>>>>
>>>>       This seems like needless duplication to me; I'd prefer to see a
>>>> single solution that handles both cases instead of two special cases
>>>> that each do 90% of what the other does.
>>>>
>>>>       As far as a new mode goes, one major reason I think a separate
>>>> mode is warranted is the semantics: with either of these changes (to
>>>> permit more or less regular use of the "inactive" slaves), the mode
>>>> isn't really an "active-backup" mode any longer; there is no "inactive"
>>>> or "backup" slave.  I think of this as being a major change of
>>>> functionality, not simply a minor option.
>>>>
>>>>       Hence my thought that "active-backup" could stay as a "true" hot
>>>> standby mode (backup slaves are just that: for backup, only), and this
>>>> new mode would be the place to do the special queue-map / FCOE /
>>>> whatever that isn't really a hot standby configuration any longer.
>>>>
>>>>       As far as the behavior of the new mode (your concern about its
>>>> policy map approximations), in the end, it would probably act pretty
>>>> much like active-backup with your patch applied: traffic goes out the
>>>> active slave, unless directed otherwise.  It's a lot less complicated
>>>> than I had feared.
>>>>
>>> It's beginning to sound like the 'FCoE use-case' and the one Neil and I
>>> are proposing are quite similar.  The main goal of both is to have the
>>> option to have multiple slaves send and receive traffic during the
>>> steady-state, but in the event of a failover all traffic would run on a
>>> single interface.
>>>
>> I believe they are similar although I never considered using FCoE over a
>> device that is actually in the bond.  For example the current FCoE use
>> case is,
>>
>> bond0 ------> ethx
>>               |
>> vlan-fcoe -->  |
>>
>> Here vlan-fcoe is not a slave of bond0.  With the keep_all patch this
>> would work plus an additional configuration,
> 
>         I also recall discussion that another valid FCOE configuration
> is to simply bind to ethx, with no VLANs involved.
> 

Correct this is valid and works today because the skb will be delivered 
to exact matches in this case.

>> bond0 --> vlan-fcoe1  ---> ethx
>>   \                        |
>>    \ --- vlan-fcoe2  --->  |
>>
>> Here both vlan-fcoe1 and vlan-fcoe2 are slaves of bond0.
>>
>> Even with the keep_all patch it still seems a little inconsistent to drop
>> a packet outright if it is received on an inactive slave and destined for
>> a vlan on the bond and then to deliver the packet to devices that have
>> exact matches if it is received on an inactive slave but destined for the
>> bond device.  I'll post a patch in just a moment that hopefully
>> illustrates what I see as an unexpected side effect.
> 
>         Yes, I understand this, and I view this as a separate concern
> from the duplicate suppressor (although they are linked to a degree).

OK, my third patch adding bond_should_drop flag to the sk_buff struct 
should address this.

> 
>         The ultimate intent (for your changes) is to permit slaves to
> operate simultaneously as members of the bond as well as independent
> entities, which is a significant behavior change from the past.
> 

I expect the primary use case to be attaching a vlan outside the bond 
over a enslaved real device.  Most of this is already implemented though 
by delivering skbs to exact matches in the inactive case.

>>> The implementation proposed with this patch is a bit different that the
>>> 'mark-mode' patch you may recall I posted a few months ago.  It created
>>> a new mode that essentially did exactly what you are describing --
>>> transmit on the primary interface unless pushed to another interface via
>>> info in the skb and receive on all interfaces.  We initially did not
>>> create a new mode based on your reservations about the previous
>>> mark-mode patch and went the direction of enhancing one or two modes
>>> initially (figuring it would be good to run before walking), with the
>>> idea that other modes could take care of this output slave selection
>>> logic in the future.
>>>
>>>
>>>>>>    I presume you're overloading active-backup because it's not
>>>>>> etherchannel, 802.3ad, etc, and just talks right to the switch.  For the
>>>>>> regular load balance modes, I still think overlay into the existing
>>>>>> modes is preferable (more on that later); I'm thinking of "manual"
>>>>>> instead of another tweak to active-backup.
>>>>>>
>>>>>>    If users want to have actual hot-standby functionality, then
>>>>>> active-backup would do that, and nothing else (and it can be multi-queue
>>>>>> aware, but only one slave active at a time).
>>>>>>
>>>>> Yes, but active backup doesn't provide prefered output path selection in and of
>>>>> itself.  Thats the feature here.
>>>>       I understand that; I'm suggesting that active-backup should
>>>> provide no service other than hot standby, and not be overloaded into a
>>>> manual load balancing scheme (both for your use, and for FCOE).
>>>>
>>>>       Maybe I'm worrying too much about defending the purity of the
>>>> active-backup mode; I understand what you're trying to do a little
>>>> better now, and yes, the "manual" mode I think of (in your queue mapping
>>>> scheme, not the other doodads I talked about) would basically be
>>>> active-backup with your queue mapper, minus the duplicate suppressor.
>>>>
>>> It doesn't matter terribly to me which direction is taken.  Again, a
>>> major reason this route was proposed was that you were not as keen on
>>> creating a new mode as I was at the time of that patch posting.  It's
>>> somewhat understandable as once a mode is added it's tough to take away,
>>> but when one sees how much we are really changing the way active-backup
>>> might behave in some cases maybe it makes sense to use a new mode?
>>>
>>> I guess I like the idea of adding this output selection to existing
>>> modes because it at least gives us the option to use queue maps to
>>> select output interfaces for more than a mode that looks like
>>> present-day active-backup minus the duplicate suppression.   I'm happy to
>>> code-up a patch that creates a new mode, but before I go do that and
>>> test it, I'd like to know we have come to an agreement on the direction
>>> for the future.
>>>
>>>>>>    Users who want the set of bonded slaves to look like a big
>>>>>> multiqueue buffet could use this "manual" mode and set things up however
>>>>>> they want.  One way to set it up is simply that the bond is N queues
>>>>>> wide, where N is the total of the queue counts of all the slaves.  If a
>>>>>> slave fails, N gets smaller, and the user code has to deal with that.
>>>>>> Since the queue count of a device can't change dynamically, the bond
>>>>>> would have to actually be set up with some big number of queues, and
>>>>>> then only a subset is actually active (or there is some sort of wrap).
>>>>>>
>>>>>>    In such an implementation, each slave would have a range of
>>>>>> queue IDs, not necessarily just one.  I'm a bit leery of exposing an API
>>>>>> where each slave is one queue ID, as it could make transitioning to real
>>>>>> multi-queue awareness difficult.
>>>>>>
>>>>> I'm sorry, what exactly do you mean when you say 'real' multi queue
>>>>> awareness?  How is this any less real than any other implementation?  The
>>>>> approach you outline above isn't any more or less valid than this one.
>>>>       That was my misunderstanding of how you planned to handle
>>>> things.  I had thought this patch was simply a scheme to use the queue
>>>> IDs for slave selection, without any method to further perform queue
>>>> selection on the slave itself (I hadn't thought of placing a tc action
>>>> on the slave itself, which you described later on).  I had been thinking
>>>> in terms of schemes to expose all of the slave queues on the bonding
>>>> device.
>>> It wasn't our original intention either.  I didn't mention it in my
>>> original post as it wasn't really the intent of our patch, but a nice
>>> side-effect for the informed user. :) Obviously a bit more testing could
>>> take place and we could add more examples to the documentation for the
>>> nice side-effect feature of this patch, but since this wasn't our
>>> original intent and we didn't test it, we did not advertise it.
>>>
>>>>       So, I don't see any issues with the queue mapping part.  I still
>>>> want to find a common solution for FCOE / your patch with regards to the
>>>> duplicate suppressor.
>>> Understood.
>>>
>>>>> While we're on the subject, Andy and I did discuss a model simmilar to what you
>>>>> describe above (what I'll refer to as a queue id passthrough model), in which
>>>>> you can tell the bonding driver to map a frame to a queue, and the bonding
>>>>> driver doesn't really do anything with the queue id other than pass to the slave
>>>>> device for hardware based multiqueue tx handling.  While we could do that, its
>>>>> my feeling such a model isn't the way to go for two primary reasons:
>>>>>
>>>>> 1) Inconsistent behavior.  Such an implementation makes assumptions regarding
>>>>> queue id specification within a driver.  For example, What if one of the slaves
>>>>> reserves some fixed number of low order queues for a sepecific purpose, and as
>>>>> such general use queues begin an at offset from zero, while other slaves do not.
>>>>> While its easy to accomidate such needs when writing the tc filters, if a slave
>>>>> fails over, such a bias would change output traffic behavior, as the bonding
>>>>> driver can't be clearly informed of such a bias.  Likewise, what if a slave
>>>>> driver allocates more queues than it actually supports in hardware (like the
>>>>> implementation you propose, ixgbe IIRC actually does this).  If slaves handled
>>>>> unimplemented tx queues different (if one wrapped queues, while the other simply
>>>>> dropped frames to unimplemented queues for instance).  A failover would change
>>>>> traffic patterns dramatically.
>>>>>
>>>>> 2) Need.  While (1) can pretty easily be managed with a few configuration
>>>>> guidelines (output queues on slaves have to be configured identically, lets
>>>>> chaos and madness befall you, etc), theres really no reason to bind users to
>>>>> such a system.  We're using tc filters to set the queue id on skbs enqueued to
>>>>> the bonding driver, theres absolutely no reason you can add addition filters to
>>>>> the slaves directly.  Since the bonding driver uses dev_queue_xmit to send a
>>>>> frame to a slave, it has the opportunity to pass through another set of queuing
>>>>> diciplines and filters that can reset and re-assign the skbs queue mapping.  So
>>>>> with the approach in this patch you can get both direct output control without
>>>>> sacrificing actual hardware tx output queue control.  With a passthrough model,
>>>>> you save a bit of filter configuration, but at the expense of having to be much
>>>>> more careful about how you configure your slave nics, and detecting such errors
>>>>> in configuration would be rather difficult to track down, as it would require
>>>>> the generation of traffic that hit the right filter after a failover.
>>>>       I don't disagree with any of this.  One thought I do have is
>>>> that Eric Dumazet, I believe, has mentioned that the read lock in
>>>> bonding is a limiting factor on 10G performance.  In the far distant
>>>> future when bonding is RCU, going through the lock(s) on the tc actions
>>>> of the slave could have the same net effect, and in such a case, a
>>>> qdisc-less path may be of benefit.  Not a concern for today, I suspect.
>>>>
>>>>>>    There might also be a way to tie it in to the new RPS code on
>>>>>> the receive side.
>>>>>>
>>>>>>    If the slaves all have the same MAC and attach to a single
>>>>>> switch via etherchannel, then it all looks pretty much like a single big
>>>>>> honkin' multiqueue device.  The switch probably won't map the flows back
>>>>>> the same way, though.
>>>>>>
>>>>> I agree, they probably wont.  Receive side handling wasn't really our focus here
>>>>> though.  Thats largely why we chose round robin and active backup as our first
>>>>> modes to use this with.  They are already written to expect frames on either
>>>>> interface.
>>>>>
>>>>>>    If the slaves are on discrete switches (without etherchannel),
>>>>>> things become more complicated.  If the slaves have the same MAC, then
>>>>>> the switches will be irritated about seeing that same MAC coming in from
>>>>>> multiple places.  If the slaves have different MACs, then ARP has the
>>>>>> same sort of issues.
>>>>>>
>>>>>>    In thinking about it, if it's linux bonding at both ends, there
>>>>>> could be any number of discrete switches in the path, and it wouldn't
>>>>>> matter as long as the linux end can work things out, e.g.,
>>>>>>
>>>>>>         -- switch 1 --
>>>>>> hostA  /              \  hostB
>>>>>> bond  ---- switch 2 ---- bond
>>>>>>        \              /
>>>>>>         -- switch 3 --
>>>>>>
>>>>>>    For something like this, the switches would never share MAC
>>>>>> information for the bonding slaves.  The issue here then becomes more of
>>>>>> detecting link failures (it would require either a "trunk failover" type
>>>>>> of function on the switch, or some kind of active probe between the
>>>>>> bonds).
>>>>>>
>>>>>>    Now, I realize that I'm babbling a bit, as from reading your
>>>>>> description, this isn't necessarily your target topology (which sounded
>>>>>> more like a case of slave A can reach only network X, and slave B can
>>>>>> reach anywhere, so sending to network X should use slave A
>>>>>> preferentially), or, as long as I'm doing ASCII-art,
>>>>>>
>>>>>>        --- switch 1 ---- network X
>>>>>> hostA /               /
>>>>>> bond  ---- switch 2 -+-- anywhere
>>>>>>
>>>>>>    Is that an accurate representation?  Or is it something a bit
>>>>>> different, e.g.,
>>>>>>
>>>>>>        --- switch 1 ---- network X -\
>>>>>> hostA /                             /
>>>>>> bond  ---- switch 2 ---- anywhere --
>>>>>>
>>>>>>    I.e., the "anywhere" connects back to network X from the
>>>>>> outside, so to speak.  Or, oh, maybe I'm missing it entirely, and you're
>>>>>> thinking of something like this:
>>>>>>
>>>>>>        --- switch 1 --- VPN --- web site
>>>>>> hostA /                          /
>>>>>> bond  ---- switch 2 - Internet -/
>>>>>>
>>>>>>    Where you prefer to hit "web site" via the VPN (perhaps it's a
>>>>>> more efficient or secure path), but can do it from the public network at
>>>>>> large if necessary.
>>>>>>
>>>>> Yes, this one.  I think the other models are equally interesting, but this model
>>>>> in which either path had universal reachabilty, but for some classes of traffic
>>>>> one path is preferred over the other is the one we had in mind.
>>>>>
>>>>>>    Now, regardless of the above, your first patch ("keep_all") is
>>>>>> to deal with the reverse problem, if this is a piggyback on top of
>>>>>> active-backup mode: how to get packets back, when both channels can be
>>>>>> active simultaneously.  That actually dovetails to a degree with work
>>>>>> I've been doing lately, but the solution there probably isn't what
>>>>>> you're looking for (there's a user space daemon to do path finding, and
>>>>>> the "bond IP" address is piggybacked on the slaves' MAC addresses, which
>>>>>> are not changed; the "bond IP" set exists in a separate subnet all its
>>>>>> own).
>>>>>>
>>>>>>    As I said, I'm not convinced that the "keep_all" option to
>>>>>> active-backup is really better than just a "manual" mode that lacks the
>>>>>> dup suppression and expects the user to set everything up.
>>>>>>
>>>>>>    As for the round-robin change in this patch, if I'm reading it
>>>>>> right, then the way it works is that the packets are round-robined,
>>>>>> unless there's a queue id passed in, in which case it's assigned to the
>>>>>> slave mapped to that queue id.  I'm not entirely sure why you picked
>>>>>> round-robin mode for that over balance-xor; it doesn't seem to fit well
>>>>>> with the description in the documentation.  Or is it just sort of a
>>>>>> demonstrator?
>>>>>>
>>>>> It was selected because round robin allows transmits on any interface already,
>>>>> and expects frames on any interface, so it was a 'safe' choice.  I would think
>>>>> balance-xor would also work.  Ideally it would be nice to get more modes
>>>>> supporting this mechanism.
>>>>       I think that this should work for balance-xor and 802.3ad.  The
>>>> only limitation for 802.3ad is that the spec requires "conversations" to
>>>> not be striped or to skip around in a manner that could lead to out of
>>>> order delivery.
>>> Agreed.  Checking would probably also have to be done to make sure that
>>> we were not trasmitting on an inactive aggregator.
>>>
>>>>       I'm not so sure about the alb/tlb modes; at first thought, I
>>>> think it could have conflicts with the internal balancing done within
>>>> the modes (if, e.g., the tc action put traffic for the same destination
>>>> on two slaves).
>>>>
>>> TLB and ALB modes would certainly have to be done differently.  It
>>> should not be terribly difficult to move from the existing hashing
>>> that's done to one that relies on the queue_mapping, but it will take a
>>> bit to make sure it's not a complete hack.
>>>
>>> We decided against doing that for all modes on the first pass as it
>>> seemed like the active-backup and round-robin were the most-likely
>>> users.  We also wanted present the code early rather that spending time
>>> supporting this on every-mode to find out that it just wasn't rational
>>> to do it on some of them.
>>>
>>>>>>    I do like one other aspect of the patch, and that's the concept
>>>>>> of overlaying the queue map on top of the balance algorithm.  So, e.g.,
>>>>>> balance-xor would do its usual thing, unless the packet is queue mapped,
>>>>>> in which case the packet's assignment is obeyed.  The balance-xor could
>>>>>> even optionally do its xor across the full set of all slaves output
>>>>>> queues instead of just across the slaves.  Round-robin can operate
>>>>>> similarly.  For those modes, a "balance by queue vs. balance by slave"
>>>>>> seems like a reasonable knob to have.
>>>>> Not sure what you mean here.  In the model implemented by this patch, there is
>>>>> one output queue per slave, and as such, balance by queue == balance by slave.
>>>>> That would make sense in the model you describe earlier in this note, but not in
>>>>> the model presented by this patch.
>>>>       Yes, I was thinking about what I had described; again,
>>>> predicated on my misunderstanding of how it all worked.
>>>>
>>>>>>    I do understand that you're proposing something relatively
>>>>>> simple, and I'm thinking out loud about alternate or additional
>>>>>> implementation details.  Some of this is "ooh ahh what if", but we also
>>>>>> don't want to end up with something that's forwards incompatible, and
>>>>>> I'm hoping to find one solution to multiple problems.
>>>>>>
>>>>> For clarification, can you ennumerate what other problems you are trying to
>>>>> solve with this feature, or features simmilar to this?  From this email, the one
>>>>> that I most clearly see is the desire to allow a passthrough mode of queue
>>>>> selection, which I think I've noted can be done already (even without this
>>>>> patch), by attaching additional tc filters to the slaves output queues directly.
>>>>> What else do you have in mind?
>>>>       As I said above, I hadn't thought of stacking tc actions on to
>>>> the slaves directly, so I was thinking on ways to expose the slave
>>>> queues.
>>>>
>>>>       I still find something intriguing about a round-robin or xor
>>>> mode that robins/xors through all of the slave queues, though, but that
>>>> should be something separate (I'm not sure if such a scheme is actually
>>>> "better", either).
>>>>
>>>>       -J
>> It would be best if there was a solution for the FCoE use case that works
>> with the current bonding modes including 802.3ad.  There is switch support
>> to run mpio FCoE while doing link aggregation on the LAN side that we
>> should support.  I'm not sure the keep_all patch would be good in this
>> case Jay I think you mentioned this at some point, but I missed the
>> conclusion?  Although maybe it would be OK I'll think about it some more
>> tomorrow.
> 
>         How does that mpio FCOE / switch support function?  Does it rely
> on utilizing ports (for the FC traffic) that are not members of the
> 802.3ad active aggregator?  E.g.:
> 
>        / eth0,eth1 -- switch A -- etc
> bond0 -
>        \ eth2,eth3 -- switch B -- etc
> 
>         Bond0 has four slaves, eth0 - eth3.  Eth0 and eth1 connect to
> switch A; eth2 and eth3 to switch B.  Presuming that the switches aren't
> stacked / magic, either eth0/eth1 or eth2/eth3 will be the active
> aggregator (linux bonding only supports one active aggregator).
> 
>         Am I correct in presuming that the FCOE balancer gizmo doesn't
> care about the 802.3ad state of the ports, and it and the switch will
> run FC traffic across all four ports, regardless of which ports are
> active and which are not?
> 
>         Or is the switch even simpler than that, and it processes all
> FCOE traffic to ports, regardless of how the ports are configured
> (etherchannel, 802.3ad, etc)?
>

The switch is simple and FCoE traffic is bound to ports which are in 
turn bound to VSAN (virtual SAN) ports on the FC side.  So the FCoE 
traffic works regardless of how the ports are configured (etherchannel, 
802.3ad, etc).

>         As for other bonding modes, balance-xor or balance-rr (round
> robin) shouldn't have the same problems with the duplicate suppression
> logic that active-backup and 802.3ad have.  The alb/tlb modes might or
> might not be workable at all, depending upon how the FCOE traffic looks
> (e.g., what source and destination MAC addresses are in the FCOE
> frames?).

We have the ability to use a SAN MAC that can be different then the LAN 
MAC which should allow this to work.

> 
>         In any event, wanting to run FCOE in conjunction with a variety
> of bonding modes suggests that Neil was right all along, and the
> duplicate suppressor change should be an option, not a new mode.

This should be OK, but all FCoE really needs is skbs to be delivered to 
devices that have packet handlers with exact matches.  Basically, making 
  the inactive VLAN case behave the same as an inactive real device.

Thanks,
John


> 
>         -J
> 
> ---
>         -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman May 17, 2010, 6:45 p.m. UTC | #9
On Thu, May 13, 2010 at 01:15:04PM -0400, Neil Horman wrote:
> On Wed, May 12, 2010 at 06:14:08PM -0400, Andy Gospodarek wrote:
> > On Wed, May 12, 2010 at 12:41:54PM -0700, Jay Vosburgh wrote:
> > > Neil Horman <nhorman@tuxdriver.com> wrote:
> > > 
> > > >On Tue, May 11, 2010 at 01:09:39PM -0700, Jay Vosburgh wrote:
> > > >> Andy Gospodarek <andy@greyhouse.net> wrote:
> > > >> 
> > > >> >This patch give the user the ability to control the output slave for
> > > >> >round-robin and active-backup bonding.  Similar functionality was
> > > >> >discussed in the past, but Jay Vosburgh indicated he would rather see a
> > > >> >feature like this added to existing modes rather than creating a
> > > >> >completely new mode.  Jay's thoughts as well as Neil's input surrounding
> > > >> >some of the issues with the first implementation pushed us toward a
> > > >> >design that relied on the queue_mapping rather than skb marks.
> > > >> >Round-robin and active-backup modes were chosen as the first users of
> > > >> >this slave selection as they seemed like the most logical choices when
> > > >> >considering a multi-switch environment.
> > > >> >
> > > >> >Round-robin mode works without any modification, but active-backup does
> > > >> >require inclusion of the first patch in this series and setting
> > > >> >the 'keep_all' flag.  This will allow reception of unicast traffic on
> > > >> >any of the backup interfaces.
> > > >> 
> > > >> 	Yes, I did think that the mark business fit better into existing
> > > >> modes (I thought of it as kind of a new hash for xor and 802.3ad modes).
> > > >> I also didn't expect to see so much new stuff (this, as well as the FCOE
> > > >> special cases being discussed elsewhere) being shoehorned into the
> > > >> active-backup mode.  I'm not so sure that adding so many special cases
> > > >> to active-backup is a good thing.
> > > >> 
> > > >> 	Now, I'm starting to wonder if you were right, and it would be
> > > >> better overall to have a "manual" mode that would hopefully satisfy this
> > > >> case as well as the FCOE special case.  I don't think either of these is
> > > >> a bad use case, I'm just not sure the right way to handle them is
> > > >> another special knob in active-backup mode (either directly, or
> > > >> implicitly in __netif_receive_skb), which wasn't what I expected to see.
> > > >> 
> > > >I honestly don't think a separate mode is warranted here.  While I'm not opposed
> > > >to adding a new mode, I really think doing so is no different from overloading
> > > >an existing mode.  I say that because to add a new mode in which we explicitly
> > > >expect traffic to be directed to various slaves requires that we implement a
> > > >policy for frames which have no queue mapping determined on egress.  Any policy
> > > >I can think of is really an approximation of an existing policy, so we may as
> > > >well reuse the policy code that we already have in place.  About the only way a
> > > >separate mode makes sense is in the 'passthrough' queue mode you document below.
> > > >In this model, in which queue ids map to slaves in a 1:1 fashion it doesn't make
> > > >senes.
> > > 
> > > 	One goal I'm hoping to achieve is something that would satisfy
> > > both the queue map stuff that you're looking for, and would meet the
> > > needs of the FCOE people who also want to disable the duplicate
> > > suppression (i.e., permit incoming traffic on the inactive slave) for a
> > > different special case.
> > > 
> > > 	The FCOE proposal revolves around, essentially, active-backup
> > > mode, but permitting incoming traffic on the inactive slave.  At the
> > > moment, the patches attempt to special case things such that only
> > > dev_add_pack listeners directly bound to the inactive slave are checked
> > > (to permit the FCOE traffic to pass on the inactive slave, but still
> > > dropping IP, as ip_rcv is a wildcard bind).
> > > 
> > > 	Your keep_all patch is, by and large, the same thing, except
> > > that it permits anything to come in on the "inactive" slave, and it's a
> > > switch that has to be turned on.
> > > 
> > > 	This seems like needless duplication to me; I'd prefer to see a
> > > single solution that handles both cases instead of two special cases
> > > that each do 90% of what the other does.
> > > 
> > > 	As far as a new mode goes, one major reason I think a separate
> > > mode is warranted is the semantics: with either of these changes (to
> > > permit more or less regular use of the "inactive" slaves), the mode
> > > isn't really an "active-backup" mode any longer; there is no "inactive"
> > > or "backup" slave.  I think of this as being a major change of
> > > functionality, not simply a minor option.
> > > 
> > > 	Hence my thought that "active-backup" could stay as a "true" hot
> > > standby mode (backup slaves are just that: for backup, only), and this
> > > new mode would be the place to do the special queue-map / FCOE /
> > > whatever that isn't really a hot standby configuration any longer.
> > > 
> > > 	As far as the behavior of the new mode (your concern about its
> > > policy map approximations), in the end, it would probably act pretty
> > > much like active-backup with your patch applied: traffic goes out the
> > > active slave, unless directed otherwise.  It's a lot less complicated
> > > than I had feared.
> > > 
> > 
> > It's beginning to sound like the 'FCoE use-case' and the one Neil and I
> > are proposing are quite similar.  The main goal of both is to have the
> > option to have multiple slaves send and receive traffic during the
> > steady-state, but in the event of a failover all traffic would run on a
> > single interface.
> > 
> > The implementation proposed with this patch is a bit different that the
> > 'mark-mode' patch you may recall I posted a few months ago.  It created
> > a new mode that essentially did exactly what you are describing --
> > transmit on the primary interface unless pushed to another interface via
> > info in the skb and receive on all interfaces.  We initially did not
> > create a new mode based on your reservations about the previous
> > mark-mode patch and went the direction of enhancing one or two modes
> > initially (figuring it would be good to run before walking), with the
> > idea that other modes could take care of this output slave selection
> > logic in the future.
> 
> 
> So, its sounding to me like everyone is leaning toward a new mode approach here.
> Before we go ahead and start coding, I hear the bullet points for this approach
> as:
> 
> 1) Implement a new bond mode where queue ids are used to steer frames to output
> interfaces
> 
> 2) Use said mode to imply universal frame reception (i.e. remove the keep_all
> knob, and turn on that behavior when this new mode is selected)
> 
> 3) use John F.'s skb_should_drop changes to implement the keep_all feature.
> 
> Does that sound about right?
> 
> Regards
> Neil
> 
Ping, Jay, any feedback on these bullet points.  I'd like to keep working on
this while we have the time, but I'd rather not play guess & check on the list
here any more than we need to.

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh May 17, 2010, 7:25 p.m. UTC | #10
Neil Horman <nhorman@tuxdriver.com> wrote:
>Neil Horman <nhorman@tuxdriver.com> wrote:
>> So, its sounding to me like everyone is leaning toward a new mode approach here.
>> Before we go ahead and start coding, I hear the bullet points for this approach
>> as:
>> 
>> 1) Implement a new bond mode where queue ids are used to steer frames to output
>> interfaces
>> 
>> 2) Use said mode to imply universal frame reception (i.e. remove the keep_all
>> knob, and turn on that behavior when this new mode is selected)
>> 
>> 3) use John F.'s skb_should_drop changes to implement the keep_all feature.
>> 
>> Does that sound about right?
>> 
>> Regards
>> Neil
>> 
>Ping, Jay, any feedback on these bullet points.  I'd like to keep working on
>this while we have the time, but I'd rather not play guess & check on the list
>here any more than we need to.

	I've been thinking on this and trying some variations.

	After further discussion with John Fastabend, I don't think a
new mode is warranted for this particular work, largely because the FCOE
case wants to run under essentially any mode (the magic FCOE switches
don't care, they just divert the FC traffic regardless of the switch
port settings).  Perhaps some specific multi-queue gizmo will become a
new mode down the road.  

	Part of my thinking for originally wanting a new mode was that
your changes didn't expose the slave device queues, but after further
discussion, that isn't actually the case.

	I thought about removing the whole special case for "deliver to
direct bindings" and have a switch instead to always deliver to
everybody.  In the end, I don't think that would end up making things
much simpler, because the special cases have to stay for the "normal"
inactive slave behavior to pass the ARP, ARP-over-VLAN, ETH_P_SLOW, etc
traffic.  It also would come at the cost of disabling the duplicate
suppressor for the FCOE case, and I don't think that's what they want.

	I'm thinking right now to go with more or less the three patch
series that John Fastabend posted last week, along with a variant of
your patch.  As much as I'd like to do both things as a unified patch,
doing so just doesn't seem to simplify the existing code, and ends up
being less than optimal for both cases.

	For John's patches, a few minor changes, but that basic idea
(flag in the skb); I'm still chewing on the "VLAN don't copy IFF_MASTER
or SLAVE flag" patch though, just to make sure it won't break anything,
but I don't think it's a critical change.

	For your patch, I'm exploring the idea of not setting
IFF_SLAVE_INACTIVE on "inactive" slaves for an "all_slaves_active"
option (I think that's a more descriptive name than "keep_all") instead
of adding a new KEEP_ALL flag bit to priv_flags.  Did you consider this
methodology and exclude it for some reason?

	Hopefully there won't be any pushback against using a flag bit
in the skb.  I haven't thought of a way to do what's desired in an
efficient manner without storing state in the skb somewhere.  The
skb->skb_iif does hold the original device receiving the skb, but I have
to believe that converting that to a struct net_device * (for the
"direct bind to original slave" case) in __netif_receive_skb is more
expensive that stashing a flag in the skb.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index d64fd2f..fd277c1 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -49,6 +49,7 @@  Table of Contents
 3.3	Configuring Bonding Manually with Ifenslave
 3.3.1		Configuring Multiple Bonds Manually
 3.4	Configuring Bonding Manually via Sysfs
+3.5	Overriding Configuration for Special Cases
 
 4. Querying Bonding Configuration
 4.1	Bonding Configuration
@@ -1333,8 +1334,79 @@  echo 2000 > /sys/class/net/bond1/bonding/arp_interval
 echo +eth2 > /sys/class/net/bond1/bonding/slaves
 echo +eth3 > /sys/class/net/bond1/bonding/slaves
 
-
-4. Querying Bonding Configuration 
+3.5 Overriding Configuration for Special Cases
+----------------------------------------------
+Nominally, when using the bonding driver, the physical port which transmits a
+frame is selected by the bonding driver, and is not relevant to the user or
+system administrator.  The output port is simply selected using the policies of
+the selected bonding mode.  On occasion however, it is helpful to direct certain
+classes of traffic to certain physical interfaces on output to implement
+slightly more complex policies.  For example, to reach a web server over a
+bonded interface in which eth0 connects to a private network, while eth1
+connects via a public network, it may be desirous to bias the bond to send said
+traffic over eth0 first, using eth1 only as a fall back, while all other traffic
+can safely be sent over either interface.  Such configurations may be achieved
+using the traffic control utilities inherent in linux.
+
+By default the bonding driver is multiqueue aware and 16 queues are created
+when the driver initializes (see Documentation/networking/multiqueue.txt
+for details).  If more or less queues are desired the module parameter
+tx_queues can be used to change this value.  There is no sysfs parameter
+available as the allocation is done at module init time.
+
+The output of the file /proc/net/bonding/bondX has changed so the output Queue
+ID is now printed for each slave:
+
+Bonding Mode: fault-tolerance (active-backup)
+Primary Slave: None
+Currently Active Slave: eth0
+MII Status: up
+MII Polling Interval (ms): 0
+Up Delay (ms): 0
+Down Delay (ms): 0
+
+Slave Interface: eth0
+MII Status: up
+Link Failure Count: 0
+Permanent HW addr: 00:1a:a0:12:8f:cb
+Slave queue ID: 0
+
+Slave Interface: eth1
+MII Status: up
+Link Failure Count: 0
+Permanent HW addr: 00:1a:a0:12:8f:cc
+Slave queue ID: 2
+
+The queue_id for a slave can be set using the command:
+
+# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
+
+Any interface that needs a queue_id set should set it with multiple calls
+like the one above until proper priorities are set for all interfaces.  On
+distributions that allow configuration via initscripts, multiple 'queue_id'
+arguments can be added to BONDING_OPTS to set all needed slave queues.
+
+These queue id's can be used in conjunction with the tc utility to configure
+a multiqueue qdisc and filters to bias certain traffic to transmit on certain
+slave devices.  For instance, say we wanted, in the above configuration to
+force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output
+device. The following commands would accomplish this:
+
+# tc qdisc add dev bond0 handle 1 root multiq
+
+# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \
+	192.168.1.100 action skbedit queue_mapping 2
+
+These commands tell the kernel to attach a multiqueue queue discipline to the
+bond0 interface and filter traffic enqueued to it, such that packets with a dst
+ip of 192.168.1.100 have their output queue mapping value overwritten to 2.
+This value is then passed into the driver, causing the normal output path
+selection policy to be overridden, selecting instead qid 2, which maps to eth1.
+
+Note that qid values begin at 1.  qid 0 is reserved to initiate to the driver
+that normal output policy selection should take place.
+
+4 Querying Bonding Configuration
 =================================
 
 4.1 Bonding Configuration
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index eb86363..aa6a79a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -90,6 +90,7 @@ 
 #define BOND_LINK_ARP_INTERV	0
 
 static int max_bonds	= BOND_DEFAULT_MAX_BONDS;
+static int tx_queues	= BOND_DEFAULT_TX_QUEUES;
 static int num_grat_arp = 1;
 static int num_unsol_na = 1;
 static int miimon	= BOND_LINK_MON_INTERV;
@@ -111,6 +112,8 @@  static struct bond_params bonding_defaults;
 
 module_param(max_bonds, int, 0);
 MODULE_PARM_DESC(max_bonds, "Max number of bonded devices");
+module_param(tx_queues, int, 0);
+MODULE_PARM_DESC(tx_queues, "Max number of transmit queues (default = 16)");
 module_param(num_grat_arp, int, 0644);
 MODULE_PARM_DESC(num_grat_arp, "Number of gratuitous ARP packets to send on failover event");
 module_param(num_unsol_na, int, 0644);
@@ -1532,6 +1535,12 @@  int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		goto err_undo_flags;
 	}
 
+	/*
+	 * Set the new_slave's queue_id to be zero.  Queue ID mapping
+	 * is set via sysfs or module option if desired.
+	 */
+	new_slave->queue_id = 0;
+
 	/* save slave's original flags before calling
 	 * netdev_set_master and dev_open
 	 */
@@ -1790,6 +1799,7 @@  err_restore_mac:
 	}
 
 err_free:
+	new_slave->queue_id = 0;
 	kfree(new_slave);
 
 err_undo_flags:
@@ -1977,6 +1987,7 @@  int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 				   IFF_SLAVE_INACTIVE | IFF_BONDING |
 				   IFF_SLAVE_NEEDARP);
 
+	slave->queue_id = 0;
 	kfree(slave);
 
 	return 0;  /* deletion OK */
@@ -3269,6 +3280,7 @@  static void bond_info_show_slave(struct seq_file *seq,
 		else
 			seq_puts(seq, "Aggregator ID: N/A\n");
 	}
+	seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
 }
 
 static int bond_info_seq_show(struct seq_file *seq, void *v)
@@ -4405,9 +4417,59 @@  static void bond_set_xmit_hash_policy(struct bonding *bond)
 	}
 }
 
+/*
+ * Lookup the slave that corresponds to a qid
+ */
+static inline int bond_slave_override(struct bonding *bond,
+				      struct sk_buff *skb)
+{
+	int i, res = 1;
+	struct slave *slave = NULL;
+	struct slave *check_slave;
+
+	read_lock(&bond->lock);
+
+	if (!BOND_IS_OK(bond) || !skb->queue_mapping)
+		goto out;
+
+	/* Find out if any slaves have the same mapping as this skb. */
+	bond_for_each_slave(bond, check_slave, i) {
+		if (check_slave->queue_id == skb->queue_mapping) {
+			slave = check_slave;
+			break;
+		}
+	}
+
+	/* If the slave isn't UP, use default transmit policy. */
+	if (slave && slave->queue_id && IS_UP(slave->dev) &&
+	    (slave->link == BOND_LINK_UP)) {
+		res = bond_dev_queue_xmit(bond, skb, slave->dev);
+	}
+
+out:
+	read_unlock(&bond->lock);
+	return res;
+}
+
+static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
+{
+	/*
+	 * This helper function exists to help dev_pick_tx get the correct
+	 * destination queue.  Using a helper function skips the a call to
+	 * skb_tx_hash and will put the skbs in the queue we expect on their
+	 * way down to the bonding driver.
+	 */
+	return skb->queue_mapping;
+}
+
 static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	const struct bonding *bond = netdev_priv(dev);
+	struct bonding *bond = netdev_priv(dev);
+
+	if (TX_QUEUE_OVERRIDE(bond->params.mode)) {
+		if (!bond_slave_override(bond, skb))
+			return NETDEV_TX_OK;
+	}
 
 	switch (bond->params.mode) {
 	case BOND_MODE_ROUNDROBIN:
@@ -4492,6 +4554,7 @@  static const struct net_device_ops bond_netdev_ops = {
 	.ndo_open		= bond_open,
 	.ndo_stop		= bond_close,
 	.ndo_start_xmit		= bond_start_xmit,
+	.ndo_select_queue	= bond_select_queue,
 	.ndo_get_stats		= bond_get_stats,
 	.ndo_do_ioctl		= bond_do_ioctl,
 	.ndo_set_multicast_list	= bond_set_multicast_list,
@@ -4763,6 +4826,13 @@  static int bond_check_params(struct bond_params *params)
 		}
 	}
 
+	if (tx_queues < 1 || tx_queues > 255) {
+		pr_warning("Warning: tx_queues (%d) should be between "
+			   "1 and 255, resetting to %d\n",
+			   tx_queues, BOND_DEFAULT_TX_QUEUES);
+		tx_queues = BOND_DEFAULT_TX_QUEUES;
+	}
+
 	if ((keep_all != 0) && (keep_all != 1)) {
 		pr_warning("Warning: keep_all module parameter (%d), "
 			   "not of valid value (0/1), so it was set to "
@@ -4940,6 +5010,7 @@  static int bond_check_params(struct bond_params *params)
 	params->primary[0] = 0;
 	params->primary_reselect = primary_reselect_value;
 	params->fail_over_mac = fail_over_mac_value;
+	params->tx_queues = tx_queues;
 	params->keep_all = keep_all;
 
 	if (primary) {
@@ -5027,8 +5098,8 @@  int bond_create(struct net *net, const char *name)
 
 	rtnl_lock();
 
-	bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : "",
-				bond_setup);
+	bond_dev = alloc_netdev_mq(sizeof(struct bonding), name ? name : "",
+				bond_setup, tx_queues);
 	if (!bond_dev) {
 		pr_err("%s: eek! can't alloc netdev!\n", name);
 		rtnl_unlock();
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 44651ce..87bfcf1 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1472,6 +1472,121 @@  static ssize_t bonding_show_ad_partner_mac(struct device *d,
 static DEVICE_ATTR(ad_partner_mac, S_IRUGO, bonding_show_ad_partner_mac, NULL);
 
 /*
+ * Show the queue_ids of the slaves in the current bond.
+ */
+static ssize_t bonding_show_queue_id(struct device *d,
+				     struct device_attribute *attr,
+				     char *buf)
+{
+	struct slave *slave;
+	int i, res = 0;
+	struct bonding *bond = to_bond(d);
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	read_lock(&bond->lock);
+	bond_for_each_slave(bond, slave, i) {
+		if (res > (PAGE_SIZE - 6)) {
+			/* not enough space for another interface name */
+			if ((PAGE_SIZE - res) > 10)
+				res = PAGE_SIZE - 10;
+			res += sprintf(buf + res, "++more++ ");
+			break;
+		}
+		res += sprintf(buf + res, "%s:%d ",
+			       slave->dev->name, slave->queue_id);
+	}
+	read_unlock(&bond->lock);
+	if (res)
+		buf[res-1] = '\n'; /* eat the leftover space */
+	rtnl_unlock();
+	return res;
+}
+
+/*
+ * Set the queue_ids of the  slaves in the current bond.  The bond
+ * interface must be enslaved for this to work.
+ */
+static ssize_t bonding_store_queue_id(struct device *d,
+				      struct device_attribute *attr,
+				      const char *buffer, size_t count)
+{
+	struct slave *slave, *update_slave;
+	struct bonding *bond = to_bond(d);
+	u16 qid;
+	int i, ret = count;
+	char *delim;
+	struct net_device *sdev = NULL;
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	/* delim will point to queue id if successful */
+	delim = strchr(buffer, ':');
+	if (!delim)
+		goto err_no_cmd;
+
+	/*
+	 * Terminate string that points to device name and bump it
+	 * up one, so we can read the queue id there.
+	 */
+	*delim = '\0';
+	if (sscanf(++delim, "%hd\n", &qid) != 1)
+		goto err_no_cmd;
+
+	/* Check buffer length, valid ifname and queue id */
+	if (strlen(buffer) > IFNAMSIZ ||
+	    !dev_valid_name(buffer) ||
+	    qid > bond->params.tx_queues)
+		goto err_no_cmd;
+
+	/* Get the pointer to that interface if it exists */
+	sdev = __dev_get_by_name(dev_net(bond->dev), buffer);
+	if (!sdev)
+		goto err_no_cmd;
+
+	read_lock(&bond->lock);
+
+	/* Search for thes slave and check for duplicate qids */
+	update_slave = NULL;
+	bond_for_each_slave(bond, slave, i) {
+		if (sdev == slave->dev)
+			/*
+			 * We don't need to check the matching
+			 * slave for dups, since we're overwriting it
+			 */
+			update_slave = slave;
+		else if (qid && qid == slave->queue_id) {
+			goto err_no_cmd_unlock;
+		}
+	}
+
+	if (!update_slave)
+		goto err_no_cmd_unlock;
+
+	/* Actually set the qids for the slave */
+	update_slave->queue_id = qid;
+
+	read_unlock(&bond->lock);
+out:
+	rtnl_unlock();
+	return ret;
+
+err_no_cmd_unlock:
+	read_unlock(&bond->lock);
+err_no_cmd:
+	pr_info("invalid input for queue_id set for %s.\n",
+		bond->dev->name);
+	ret = -EPERM;
+	goto out;
+}
+
+static DEVICE_ATTR(queue_id, S_IRUGO | S_IWUSR, bonding_show_queue_id,
+		   bonding_store_queue_id);
+
+
+/*
  * Show and set the keep_all flag.
  */
 static ssize_t bonding_show_keep(struct device *d,
@@ -1513,7 +1628,6 @@  static DEVICE_ATTR(keep_all, S_IRUGO | S_IWUSR,
 		   bonding_show_keep, bonding_store_keep);
 
 
-
 static struct attribute *per_bond_attrs[] = {
 	&dev_attr_slaves.attr,
 	&dev_attr_mode.attr,
@@ -1539,6 +1653,7 @@  static struct attribute *per_bond_attrs[] = {
 	&dev_attr_ad_actor_key.attr,
 	&dev_attr_ad_partner_key.attr,
 	&dev_attr_ad_partner_mac.attr,
+	&dev_attr_queue_id.attr,
 	&dev_attr_keep_all.attr,
 	NULL,
 };
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 3b7532f..274a3a1 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -60,6 +60,9 @@ 
 		 ((mode) == BOND_MODE_TLB)          ||	\
 		 ((mode) == BOND_MODE_ALB))
 
+#define TX_QUEUE_OVERRIDE(mode)				\
+			(((mode) == BOND_MODE_ACTIVEBACKUP) ||	\
+			 ((mode) == BOND_MODE_ROUNDROBIN))
 /*
  * Less bad way to call ioctl from within the kernel; this needs to be
  * done some other way to get the call out of interrupt context.
@@ -131,6 +134,7 @@  struct bond_params {
 	char primary[IFNAMSIZ];
 	int primary_reselect;
 	__be32 arp_targets[BOND_MAX_ARP_TARGETS];
+	int tx_queues;
 	int keep_all;
 };
 
@@ -166,6 +170,7 @@  struct slave {
 	u8     perm_hwaddr[ETH_ALEN];
 	u16    speed;
 	u8     duplex;
+	u16    queue_id;
 	struct ad_slave_info ad_info; /* HUGE - better to dynamically alloc */
 	struct tlb_slave_info tlb_info;
 };
diff --git a/include/linux/if_bonding.h b/include/linux/if_bonding.h
index cd525fa..2c79943 100644
--- a/include/linux/if_bonding.h
+++ b/include/linux/if_bonding.h
@@ -83,6 +83,7 @@ 
 
 #define BOND_DEFAULT_MAX_BONDS  1   /* Default maximum number of devices to support */
 
+#define BOND_DEFAULT_TX_QUEUES 16   /* Default number of tx queues per device */
 /* hashing types */
 #define BOND_XMIT_POLICY_LAYER2		0 /* layer 2 (MAC only), default */
 #define BOND_XMIT_POLICY_LAYER34	1 /* layer 3+4 (IP ^ (TCP || UDP)) */