diff mbox

[v4] net/bonding: send arp in interval if no active slave

Message ID 1444161197-38442-1-git-send-email-jarod@redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jarod Wilson Oct. 6, 2015, 7:53 p.m. UTC
From: Uwe Koziolek <uwe.koziolek@redknee.com>

With some very finicky switch hardware, active backup bonding can get into
a situation where we play ping-pong between interfaces, trying to get one
to come up as the active slave. There seems to be an issue with the
switch's arp replies either taking too long, or simply getting lost, so we
wind up unable to get any interface up and active. Sometimes, the issue
sorts itself out after a while, sometimes it doesn't.

Testing with num_grat_arp has proven fruitless, but sending an additional
arp on curr_arp_slave if we're still in the arp_interval timeslice in
bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
this hardware combination.

[jarod: manufacturing of changelog, addition of modparam gating]
CC: Jay Vosburgh <jay.vosburgh@canonical.com>
CC: Andy Gospodarek <gospo@cumulusnetworks.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: netdev@vger.kernel.org
Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
---
v2: add code comment as to why change is needed
v3: fix wrapping of comments
v4: [jarod] add module parameter gating of code addition

 drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
 include/net/bonding.h           |  1 +
 2 files changed, 25 insertions(+)

Comments

Jarod Wilson Oct. 6, 2015, 7:58 p.m. UTC | #1
Jarod Wilson wrote:
> From: Uwe Koziolek<uwe.koziolek@redknee.com>
>
> With some very finicky switch hardware, active backup bonding can get into
> a situation where we play ping-pong between interfaces, trying to get one
> to come up as the active slave. There seems to be an issue with the
> switch's arp replies either taking too long, or simply getting lost, so we
> wind up unable to get any interface up and active. Sometimes, the issue
> sorts itself out after a while, sometimes it doesn't.
>
> Testing with num_grat_arp has proven fruitless, but sending an additional
> arp on curr_arp_slave if we're still in the arp_interval timeslice in
> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
> this hardware combination.
>
> [jarod: manufacturing of changelog, addition of modparam gating]
> CC: Jay Vosburgh<jay.vosburgh@canonical.com>
> CC: Andy Gospodarek<gospo@cumulusnetworks.com>
> CC: Veaceslav Falico<vfalico@gmail.com>
> CC: netdev@vger.kernel.org
> Signed-off-by: Uwe Koziolek<uwe.koziolek@redknee.com>
> Signed-off-by: Jarod Wilson<jarod@redhat.com>
> ---
> v2: add code comment as to why change is needed
> v3: fix wrapping of comments
> v4: [jarod] add module parameter gating of code addition
>
>   drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>   include/net/bonding.h           |  1 +
>   2 files changed, 25 insertions(+)
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 90f2615..72ab512 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -95,6 +95,7 @@ static int miimon;
>   static int updelay;
>   static int downdelay;
>   static int use_carrier	= 1;
> +static int arp_slow_switch;
>   static char *mode;
>   static char *primary;
>   static char *primary_reselect;
> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>   module_param(use_carrier, int, 0);
>   MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>   			      "0 for off, 1 for on (default)");
> +module_param(arp_slow_switch, int, 0);
> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
> +				  "caches that are slow to update; "
> +				  "0 for off (default), 1 for on");
>   module_param(mode, charp, 0);
>   MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>   		       "1 for active-backup, 2 for balance-xor, "
> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>   			return should_notify_rtnl;
>   	}
>
> +	/* Sometimes the forwarding tables of the switches are not update
> +	 * fast enough, so the first arp response after a slave change is
> +	 * received on the wrong slave.
> +	 *
> +	 * The arp requests will be retried 2 times on the same slave.
> +	 */
> +	if (arp_slow_switch &&

This here should actually be bond->params.arp_slow_switch, but I'd like 
to hear first if a module parameter gating this change is even a 
remotely acceptable idea. It'd keep the logic identical in the default 
case though, and still allow for people like Uwe that need it to deploy 
the work-around.

Though I'm slightly curious if this problem does NOT manifest by simply 
setting a larger arp_interval. Early on, I thought I'd heard that other 
intervals had been tried with the same results, but a comment in this 
thread suggested maybe only 500 had been tried.
Nikolay Aleksandrov Oct. 7, 2015, 12:03 p.m. UTC | #2
On 10/06/2015 09:53 PM, Jarod Wilson wrote:
> From: Uwe Koziolek <uwe.koziolek@redknee.com>
> 
> With some very finicky switch hardware, active backup bonding can get into
> a situation where we play ping-pong between interfaces, trying to get one
> to come up as the active slave. There seems to be an issue with the
> switch's arp replies either taking too long, or simply getting lost, so we
> wind up unable to get any interface up and active. Sometimes, the issue
> sorts itself out after a while, sometimes it doesn't.
> 
> Testing with num_grat_arp has proven fruitless, but sending an additional
> arp on curr_arp_slave if we're still in the arp_interval timeslice in
> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
> this hardware combination.
> 
> [jarod: manufacturing of changelog, addition of modparam gating]
> CC: Jay Vosburgh <jay.vosburgh@canonical.com>
> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
> CC: Veaceslav Falico <vfalico@gmail.com>
> CC: netdev@vger.kernel.org
> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
> Signed-off-by: Jarod Wilson <jarod@redhat.com>
> ---
> v2: add code comment as to why change is needed
> v3: fix wrapping of comments
> v4: [jarod] add module parameter gating of code addition
> 
Hi all,
As Andy already stated I'm not a fan of such workarounds either but it's
necessary sometimes so if this is going to be actually considered then a
few things need to be fixed. Please make this a proper bonding option
which can be changed at runtime and not only via a module parameter.
Now, I saw that you've only tested with 500 ms, can't this be fixed by using
a different interval ? This seems like a very specific problem to have a
whole new option for.
I really want to say fix the switch but I know that's not an option. :-)
A few minor nits below,

>  drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>  include/net/bonding.h           |  1 +
>  2 files changed, 25 insertions(+)
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 90f2615..72ab512 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -95,6 +95,7 @@ static int miimon;
>  static int updelay;
>  static int downdelay;
>  static int use_carrier	= 1;
> +static int arp_slow_switch;
>  static char *mode;
>  static char *primary;
>  static char *primary_reselect;
> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>  module_param(use_carrier, int, 0);
>  MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>  			      "0 for off, 1 for on (default)");
> +module_param(arp_slow_switch, int, 0);
> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
> +				  "caches that are slow to update; "
> +				  "0 for off (default), 1 for on");
>  module_param(mode, charp, 0);
>  MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>  		       "1 for active-backup, 2 for balance-xor, "
> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>  			return should_notify_rtnl;
>  	}
>  
> +	/* Sometimes the forwarding tables of the switches are not update
^ s/update/updated/

> +	 * fast enough, so the first arp response after a slave change is
> +	 * received on the wrong slave.
> +	 *
> +	 * The arp requests will be retried 2 times on the same slave.
> +	 */
> +	if (arp_slow_switch &&
> +	    bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
> +		bond_arp_send_all(bond, curr_arp_slave);
> +		return should_notify_rtnl;
> +	}
> +
>  	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>  
>  	bond_for_each_slave_rcu(bond, slave, iter) {
> @@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
>  		use_carrier = 1;
>  	}
>  
> +	if ((arp_slow_switch != 0) && (arp_slow_switch != 1)) {
^^ no need for the extra ()

> +		pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
> +			arp_slow_switch);
> +		arp_slow_switch = 1;
^^ please default to old behaviour in this case (0)

> +	}
> +
>  	if (num_peer_notif < 0 || num_peer_notif > 255) {
>  		pr_warn("Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
>  			num_peer_notif);
> @@ -4516,6 +4539,7 @@ static int bond_check_params(struct bond_params *params)
>  	params->updelay = updelay;
>  	params->downdelay = downdelay;
>  	params->use_carrier = use_carrier;
> +	params->arp_slow_switch = arp_slow_switch;
>  	params->lacp_fast = lacp_fast;
>  	params->primary[0] = 0;
>  	params->primary_reselect = primary_reselect_value;
> diff --git a/include/net/bonding.h b/include/net/bonding.h
> index c1740a2..208d31c 100644
> --- a/include/net/bonding.h
> +++ b/include/net/bonding.h
> @@ -120,6 +120,7 @@ struct bond_params {
>  	int arp_validate;
>  	int arp_all_targets;
>  	int use_carrier;
> +	int arp_slow_switch;
>  	int fail_over_mac;
>  	int updelay;
>  	int downdelay;
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarod Wilson Oct. 7, 2015, 1:29 p.m. UTC | #3
Nikolay Aleksandrov wrote:
> On 10/06/2015 09:53 PM, Jarod Wilson wrote:
>> From: Uwe Koziolek<uwe.koziolek@redknee.com>
>>
>> With some very finicky switch hardware, active backup bonding can get into
>> a situation where we play ping-pong between interfaces, trying to get one
>> to come up as the active slave. There seems to be an issue with the
>> switch's arp replies either taking too long, or simply getting lost, so we
>> wind up unable to get any interface up and active. Sometimes, the issue
>> sorts itself out after a while, sometimes it doesn't.
>>
>> Testing with num_grat_arp has proven fruitless, but sending an additional
>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
>> this hardware combination.
>>
>> [jarod: manufacturing of changelog, addition of modparam gating]
>> CC: Jay Vosburgh<jay.vosburgh@canonical.com>
>> CC: Andy Gospodarek<gospo@cumulusnetworks.com>
>> CC: Veaceslav Falico<vfalico@gmail.com>
>> CC: netdev@vger.kernel.org
>> Signed-off-by: Uwe Koziolek<uwe.koziolek@redknee.com>
>> Signed-off-by: Jarod Wilson<jarod@redhat.com>
>> ---
>> v2: add code comment as to why change is needed
>> v3: fix wrapping of comments
>> v4: [jarod] add module parameter gating of code addition
>>
> Hi all,
> As Andy already stated I'm not a fan of such workarounds either but it's
> necessary sometimes so if this is going to be actually considered then a
> few things need to be fixed. Please make this a proper bonding option
> which can be changed at runtime and not only via a module parameter.

Okay, I can give that a shot, however...

> Now, I saw that you've only tested with 500 ms, can't this be fixed by using
> a different interval ? This seems like a very specific problem to have a
> whole new option for.

...I'll wait until we've heard confirmation from Uwe that intervals 
other than 500ms don't fix things.

> I really want to say fix the switch but I know that's not an option. :-)

Yeah, unfortunately not!

> A few minor nits below,
>
>>   drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>>   include/net/bonding.h           |  1 +
>>   2 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 90f2615..72ab512 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -95,6 +95,7 @@ static int miimon;
>>   static int updelay;
>>   static int downdelay;
>>   static int use_carrier	= 1;
>> +static int arp_slow_switch;
>>   static char *mode;
>>   static char *primary;
>>   static char *primary_reselect;
>> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>>   module_param(use_carrier, int, 0);
>>   MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>>   			      "0 for off, 1 for on (default)");
>> +module_param(arp_slow_switch, int, 0);
>> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
>> +				  "caches that are slow to update; "
>> +				  "0 for off (default), 1 for on");
>>   module_param(mode, charp, 0);
>>   MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>>   		       "1 for active-backup, 2 for balance-xor, "
>> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>>   			return should_notify_rtnl;
>>   	}
>>
>> +	/* Sometimes the forwarding tables of the switches are not update
> ^ s/update/updated/

D'oh. Fixed locally.

>> @@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
>>   		use_carrier = 1;
>>   	}
>>
>> +	if ((arp_slow_switch != 0) &&  (arp_slow_switch != 1)) {
> ^^ no need for the extra ()

Copy-pasta from use_carrier checks right above it. Never quite sure if I 
should stick with the same possibly sub-optimal formatting conventions 
already in the file, try to fix them while also fixing bugs, or just mix 
styles...


>> +		pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
>> +			arp_slow_switch);
>> +		arp_slow_switch = 1;
> ^^ please default to old behaviour in this case (0)

Will do.
Jarod Wilson Oct. 9, 2015, 2:36 p.m. UTC | #4
Jarod Wilson wrote:
...
> As Andy already stated I'm not a fan of such workarounds either but it's
> necessary sometimes so if this is going to be actually considered then a
> few things need to be fixed. Please make this a proper bonding option
> which can be changed at runtime and not only via a module parameter.

Is there any particular userspace tool that would need some updating, or 
is adding the sysfs knobs sufficient here? I think I've got all the 
sysfs stuff thrown together now, but still need to test.


>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>> using
>> a different interval ? This seems like a very specific problem to have a
>> whole new option for.
>
> ...I'll wait until we've heard confirmation from Uwe that intervals
> other than 500ms don't fix things.

Okay, so I believe the "only tested with 500ms" was in reference to 
testing with Uwe's initial patch. I do have supporting evidence in a 
bugzilla report that shows upwards of 5000ms still experience the 
problem here.
Nikolay Aleksandrov Oct. 9, 2015, 3:25 p.m. UTC | #5
On 10/09/2015 04:36 PM, Jarod Wilson wrote:
> Jarod Wilson wrote:
> ...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
> 
> Is there any particular userspace tool that would need some updating, or is adding the sysfs knobs sufficient here? I think I've got all the sysfs stuff thrown together now, but still need to test.
> 
I'd say adding netlink support at this point is more important, and it'd be nice
if you can add support to iproute2 for the new attribute. Currently all bonding
options have both netlink and sysfs support, so you can follow that, the others
can correct me if I'm wrong here.

One more thing please don't forget to update Documentation/networking/bonding.txt

> 
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
> 
> Okay, so I believe the "only tested with 500ms" was in reference to testing with Uwe's initial patch. I do have supporting evidence in a bugzilla report that shows upwards of 5000ms still experience the problem here.
_5 seconds_ are not enough to receive a reply, but sending it twice
in a second fixes the issue ?!
This sounds like the ARP request is not properly handled/received
and there's no reply.

Cheers,
 Nik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh Oct. 9, 2015, 3:31 p.m. UTC | #6
Jarod Wilson <jarod@redhat.com> wrote:

>Jarod Wilson wrote:
>...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
>Is there any particular userspace tool that would need some updating, or
>is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>stuff thrown together now, but still need to test.

	Most (all?) bonding options should be configurable via iproute
(netlink) now.

>
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
>
>Okay, so I believe the "only tested with 500ms" was in reference to
>testing with Uwe's initial patch. I do have supporting evidence in a
>bugzilla report that shows upwards of 5000ms still experience the problem
>here.

	I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

	As I understand it, the working theory is something like this:

	- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

	- switch has MAC for B in its forwarding table

	- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

	As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

	- switch sees source mac coming from port A, starts to update
its forwarding table

	- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

	- switch finishes updating forwarding table, MAC is now assigned
to port A.

	- bonding now tries sending on port B, and the cycle repeats.

	If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

	Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.

	For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.

	I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.

	Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarod Wilson Oct. 12, 2015, 3:33 p.m. UTC | #7
Jay Vosburgh wrote:
> Jarod Wilson<jarod@redhat.com>  wrote:
>
>> Jarod Wilson wrote:
>> ...
>>> As Andy already stated I'm not a fan of such workarounds either but it's
>>> necessary sometimes so if this is going to be actually considered then a
>>> few things need to be fixed. Please make this a proper bonding option
>>> which can be changed at runtime and not only via a module parameter.
>> Is there any particular userspace tool that would need some updating, or
>> is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>> stuff thrown together now, but still need to test.
>
> 	Most (all?) bonding options should be configurable via iproute
> (netlink) now.

D'oh, of course. I've done the kernel-side netlink bits now too, and 
started looking at the iproute source. However...


>>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>>> using
>>>> a different interval ? This seems like a very specific problem to have a
>>>> whole new option for.
>>> ...I'll wait until we've heard confirmation from Uwe that intervals
>>> other than 500ms don't fix things.
>> Okay, so I believe the "only tested with 500ms" was in reference to
>> testing with Uwe's initial patch. I do have supporting evidence in a
>> bugzilla report that shows upwards of 5000ms still experience the problem
>> here.
>
> 	I did set up some switches and attempt to reproduce this
> yesterday; I daisy-chained three switches (two Cisco and an HP) together
> and connected the bonded interfaces to the "end" switches.  I tried
> various ARP targets (the switch, hosts on various points of the switch)
> and varying arp_intervals and was unable to reproduce the problem.
>
> 	As I understand it, the working theory is something like this:
>
> 	- host with two bonded interfaces, A and B.  For active-backup
> mode, the interfaces have been assigned the same MAC address.
>
> 	- switch has MAC for B in its forwarding table
>
> 	- bonding goes from down to up, and thinks all its slaves are
> down, and starts the "curr_arp_slave" search for an active
> arp_ip_target.  In this case, it starts with A, and sends an ARP from A.
>
> 	As an aside, I'm not 100% clear on what exactly is going on in
> the "bonding goes from down to up" transition; this seems to be key in
> reproducing the issue.
>
> 	- switch sees source mac coming from port A, starts to update
> its forwarding table
>
> 	- meanwhile, switch forwards ARP request, and receives ARP
> reply, which it forwards to port B.  Bonding drops this, as the slave is
> inactive.
>
> 	- switch finishes updating forwarding table, MAC is now assigned
> to port A.
>
> 	- bonding now tries sending on port B, and the cycle repeats.
>
> 	If this is what's taking place, then the arp_interval itself is
> irrelevant, the race is between the switch table update and the
> generation of the ARP reply.
>
> 	Also, presuming the above is what's going on, we could modify
> the ARP "curr_arp_slave" logic a bit to resolve this without requiring
> any magic knobs.

I really like this idea. Still trying to grasp exactly how we get into 
this situation and what everything looks like as we hop through the 
various bond_ab_arp_* functions though.

> 	For example, we could change the "drop on inactive" logic to
> recognise the "curr_arp_slave" search and accept the unicast ARP reply,
> and perhaps make that receiving slave the next curr_arp_slave
> automatically.

Nothing ever actually getting picked as curr_arp_slave does appear to be 
the problem, so that does sound like it could do the trick.

> 	I also wonder if the fail_over_mac option would affect this
> behavior, as it would cause the slaves to keep their MAC address for the
> duration, so the switch would not see the MAC move from port to port.

Not sure if that's an option for the particular environment, but we 
could certainly ask Uwe to give it a try.

> 	Another thought would be to have the curr_arp_slave cycle
> through the slaves in random order, but that could create
> non-deterministic results even when things are working correctly.

I'd say avoid this route if at all possible, would rather not make 
things less predictable.
Uwe Koziolek Oct. 30, 2015, 6:59 p.m. UTC | #8
> Nikolay Aleksandrov wrote:
>> On 10/06/2015 09:53 PM, Jarod Wilson wrote:
>>> From: Uwe Koziolek<uwe.koziolek@redknee.com>
>>>
>>> With some very finicky switch hardware, active backup bonding can get into
>>> a situation where we play ping-pong between interfaces, trying to get one
>>> to come up as the active slave. There seems to be an issue with the
>>> switch's arp replies either taking too long, or simply getting lost, so we
>>> wind up unable to get any interface up and active. Sometimes, the issue
>>> sorts itself out after a while, sometimes it doesn't.
>>>
>>> Testing with num_grat_arp has proven fruitless, but sending an additional
>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing with
>>> this hardware combination.
>>>
>>> [jarod: manufacturing of changelog, addition of modparam gating]
>>> CC: Jay Vosburgh<jay.vosburgh@canonical.com>
>>> CC: Andy Gospodarek<gospo@cumulusnetworks.com>
>>> CC: Veaceslav Falico<vfalico@gmail.com>
>>> CC: netdev@vger.kernel.org
>>> Signed-off-by: Uwe Koziolek<uwe.koziolek@redknee.com>
>>> Signed-off-by: Jarod Wilson<jarod@redhat.com>
>>> ---
>>> v2: add code comment as to why change is needed
>>> v3: fix wrapping of comments
>>> v4: [jarod] add module parameter gating of code addition
>>>
>> Hi all,
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
> Okay, I can give that a shot, however...
>
>> Now, I saw that you've only tested with 500 ms, can't this be fixed by using
>> a different interval ? This seems like a very specific problem to have a
>> whole new option for.
>
> ...I'll wait until we've heard confirmation from Uwe that intervals other than 500ms don't fix things.
>
A test with 5000 ms don't fix the problem. Tested with Cisco C3750, 4 bonds.

>> I really want to say fix the switch but I know that's not an option. :-)
>
> Yeah, unfortunately not!
>
>> A few minor nits below,
>>
>>>   drivers/net/bonding/bond_main.c | 24 ++++++++++++++++++++++++
>>>   include/net/bonding.h           |  1 +
>>>   2 files changed, 25 insertions(+)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index 90f2615..72ab512 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -95,6 +95,7 @@ static int miimon;
>>>   static int updelay;
>>>   static int downdelay;
>>>   static int use_carrier    = 1;
>>> +static int arp_slow_switch;
>>>   static char *mode;
>>>   static char *primary;
>>>   static char *primary_reselect;
>>> @@ -133,6 +134,10 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
>>>   module_param(use_carrier, int, 0);
>>>   MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
>>>                     "0 for off, 1 for on (default)");
>>> +module_param(arp_slow_switch, int, 0);
>>> +MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
>>> +                  "caches that are slow to update; "
>>> +                  "0 for off (default), 1 for on");
>>>   module_param(mode, charp, 0);
>>>   MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>>>                  "1 for active-backup, 2 for balance-xor, "
>>> @@ -2793,6 +2798,18 @@ static bool bond_ab_arp_probe(struct bonding *bond)
>>>               return should_notify_rtnl;
>>>       }
>>>
>>> +    /* Sometimes the forwarding tables of the switches are not update
>> ^ s/update/updated/
>
> D'oh. Fixed locally.
>
>>> @@ -4280,6 +4297,12 @@ static int bond_check_params(struct bond_params *params)
>>>           use_carrier = 1;
>>>       }
>>>
>>> +    if ((arp_slow_switch != 0) &&  (arp_slow_switch != 1)) {
>> ^^ no need for the extra ()
>
> Copy-pasta from use_carrier checks right above it. Never quite sure if I should stick with the same possibly sub-optimal
> formatting conventions already in the file, try to fix them while also fixing bugs, or just mix styles...
>
>
>>> +        pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
>>> +            arp_slow_switch);
>>> +        arp_slow_switch = 1;
>> ^^ please default to old behaviour in this case (0)
>
> Will do.
>
Uwe Koziolek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 90f2615..72ab512 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -95,6 +95,7 @@  static int miimon;
 static int updelay;
 static int downdelay;
 static int use_carrier	= 1;
+static int arp_slow_switch;
 static char *mode;
 static char *primary;
 static char *primary_reselect;
@@ -133,6 +134,10 @@  MODULE_PARM_DESC(downdelay, "Delay before considering link down, "
 module_param(use_carrier, int, 0);
 MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in miimon; "
 			      "0 for off, 1 for on (default)");
+module_param(arp_slow_switch, int, 0);
+MODULE_PARM_DESC(arp_slow_switch, "Do extra arp checks for switches with arp "
+				  "caches that are slow to update; "
+				  "0 for off (default), 1 for on");
 module_param(mode, charp, 0);
 MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
 		       "1 for active-backup, 2 for balance-xor, "
@@ -2793,6 +2798,18 @@  static bool bond_ab_arp_probe(struct bonding *bond)
 			return should_notify_rtnl;
 	}
 
+	/* Sometimes the forwarding tables of the switches are not update
+	 * fast enough, so the first arp response after a slave change is
+	 * received on the wrong slave.
+	 *
+	 * The arp requests will be retried 2 times on the same slave.
+	 */
+	if (arp_slow_switch &&
+	    bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+		bond_arp_send_all(bond, curr_arp_slave);
+		return should_notify_rtnl;
+	}
+
 	bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
 
 	bond_for_each_slave_rcu(bond, slave, iter) {
@@ -4280,6 +4297,12 @@  static int bond_check_params(struct bond_params *params)
 		use_carrier = 1;
 	}
 
+	if ((arp_slow_switch != 0) && (arp_slow_switch != 1)) {
+		pr_warn("Warning: arp_slow_switch module parameter (%d), not of valid value (0/1), so it was set to 1\n",
+			arp_slow_switch);
+		arp_slow_switch = 1;
+	}
+
 	if (num_peer_notif < 0 || num_peer_notif > 255) {
 		pr_warn("Warning: num_grat_arp/num_unsol_na (%d) not in range 0-255 so it was reset to 1\n",
 			num_peer_notif);
@@ -4516,6 +4539,7 @@  static int bond_check_params(struct bond_params *params)
 	params->updelay = updelay;
 	params->downdelay = downdelay;
 	params->use_carrier = use_carrier;
+	params->arp_slow_switch = arp_slow_switch;
 	params->lacp_fast = lacp_fast;
 	params->primary[0] = 0;
 	params->primary_reselect = primary_reselect_value;
diff --git a/include/net/bonding.h b/include/net/bonding.h
index c1740a2..208d31c 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -120,6 +120,7 @@  struct bond_params {
 	int arp_validate;
 	int arp_all_targets;
 	int use_carrier;
+	int arp_slow_switch;
 	int fail_over_mac;
 	int updelay;
 	int downdelay;