diff mbox

net/bonding: send arp in interval if no active slave

Message ID 55E63065.8070906@redknee.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Uwe Koziolek Sept. 1, 2015, 11:10 p.m. UTC
On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>>
>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>
>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>> into
>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>> one
>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>> so we
>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>
>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>> additional
>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>> with
>>>>>>> this hardware combination.
>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>> are
>>>>>> we fixiing here.
>>>>>>
>>>>>> It also breaks completely the logic for link state management in case
>>>>>> of no
>>>>>> current active slave for 2*arp_interval.
>>>>>>
>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>> works? :)
>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>> illustrate the exact nature of the problem.
>>>>>
>>>>>> p.s. num_grat_arp maybe could help?
>>>>> That was my thought as well, but as I understand it, that route was
>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>> here...
>>>>>
>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>> num_grat_arp.
>>>> The bonding slaves are connected to high available switches, each of the
>>>> slaves is connected to a different switch. If the bond is starting, only
>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>> received, this slave and the bond is going into state up, sending the
>>>> gratitious arps...
>>>> But if you got no arp reply the next slave was selected.
>>>> With most of the newer switches, not overloaded, or with other software
>>>> bugs, or with a single switch configuration, you would get a arp response
>>>> on the first arp request.
>>>> But in case of high availability configuration with non perfect switches
>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>> response on the first arp request.
>>>>
>>>> I have seen network snoops, there the switches are not responding to the
>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>> but the response was received on slave one,  and all following arp
>>>> requests are anwsered on the wrong slave for a longer time.
>>> 	Could you elaborate on the exact "high availability
>>> configuration" here, including the model(s) of switch(es) involved?
>>>
>>> 	Is this some kind of race between the switch or switches
>>> updating the forwarding tables and the bond flip flopping between the
>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>> on slave 2 is sent (using the same source MAC, unless you set
>>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>>> reply is sent there.
>> High availability:
>> 2 managed switches with routing capabilities have an interconnect.
>> One slave of a bonding interface is connected to the first switch, the
>> second slave is connected to the other switch.
>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>> remember also HP E 3500 and  E 3800 are also
>> affected, for the affected Cisco models I can't answer today.
>> Affected single switch configurations was not seen.
>>
>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>> well matching explanation for the problem.
>>
>>>> The proposed change sents up to 3 arp requests on a down bond using the
>>>> same slave, delayed by arp_interval.
>>>> Using problematic switches i have seen the the arp response on the right
>>>> slave at latest on the second arp request. So the bond is going into state
>>>> up.
>>>>
>>>> How does it works:
>>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>>> procedure, the other part of this procedure is handling the slave change.
>>>> The proposed change is bypassing the slave change for 2 additional calls
>>>> of bond_ab_arp_probe.
>>>> Now the retries are not only for an up bond available, they are also
>>>> implemented for a down bond.
>>> 	Does this delay failover or bringup on switches that are not
>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>> this impact failover / recovery times?
>>>
>>> 	-J
>> It depends.
>> failover times are not impacted, this is handled different.
>> Only the transition from a down bonding interface (bond and all slaves are
>> down) to the state up can be increased by up to 2 times arp_interval,
>> If the selected interface did not came up .If well working switches are
>> used, and everything other is also ok, there are no impacts.
> So I'm not a huge fan of workarounds like these, but I also understand
> from a practical standpoint that this is useful.  My only issue with the
> patch would be to please include a small comment (1-2 lines) in the code
> that describes the behavior.  I know we have the changelog entries for
> this, but I would feel better about having an exception like this in the
> code for those reading it and wondering:
>
> "Why would we wait 2 intervals before failing over to the next interface
> when there are no active interfaces?"
>


>>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>>>> only used, if a different slave is going active.
>>>> But in our case, the bonding slaves are not going into the state active
>>>> for a longer time.
>>>>>>> [jarod: manufacturing of changelog]
>>>>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>>>> CC: netdev@vger.kernel.org
>>>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>>>>> ---
>>>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>>>> b/drivers/net/bonding/bond_main.c
>>>>>>> index 0c627b4..60b9483 100644
>>>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>>>> *bond)
>>>>>>>              return should_notify_rtnl;
>>>>>>>      }
>>>>>>>
>>>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>>>> {
>>>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>>>> +        return should_notify_rtnl;
>>>>>>> +    }
>>>>>>> +
>>>>>>>      bond_set_slave_inactive_flags(curr_arp_slave,
>>>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>>>
>>>>>>>      bond_for_each_slave_rcu(bond, slave, iter) {
>>>>>>> -- 
>>>>>>> 1.8.3.1
>>> ---
>>> 	-Jay Vosburgh, jay.vosburgh@canonical.com

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jay Vosburgh Sept. 3, 2015, 3:05 p.m. UTC | #1
Uwe Koziolek <uwe.koziolek@redknee.com> wrote:

>On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>> Uwe Koziolek <uwe.koziolek@redknee.com> wrote:
>>>>
>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>> From: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>>
>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>> into
>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>> one
>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>> so we
>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>
>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>> additional
>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>> with
>>>>>>>> this hardware combination.
>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>> are
>>>>>>> we fixiing here.
>>>>>>>
>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>> of no
>>>>>>> current active slave for 2*arp_interval.
>>>>>>>
>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>> works? :)
>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>> illustrate the exact nature of the problem.
>>>>>>
>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>> here...
>>>>>>
>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>> num_grat_arp.
>>>>> The bonding slaves are connected to high available switches, each of the
>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>> received, this slave and the bond is going into state up, sending the
>>>>> gratitious arps...
>>>>> But if you got no arp reply the next slave was selected.
>>>>> With most of the newer switches, not overloaded, or with other software
>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>> on the first arp request.
>>>>> But in case of high availability configuration with non perfect switches
>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>> response on the first arp request.
>>>>>
>>>>> I have seen network snoops, there the switches are not responding to the
>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>> but the response was received on slave one,  and all following arp
>>>>> requests are anwsered on the wrong slave for a longer time.
>>>> 	Could you elaborate on the exact "high availability
>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>
>>>> 	Is this some kind of race between the switch or switches
>>>> updating the forwarding tables and the bond flip flopping between the
>>>> slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
>>>> the forwarding table, but (for whatever reason) there is no reply.  ARP
>>>> on slave 2 is sent (using the same source MAC, unless you set
>>>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so
>>>> reply is sent there.
>>> High availability:
>>> 2 managed switches with routing capabilities have an interconnect.
>>> One slave of a bonding interface is connected to the first switch, the
>>> second slave is connected to the other switch.
>>> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i
>>> remember also HP E 3500 and  E 3800 are also
>>> affected, for the affected Cisco models I can't answer today.
>>> Affected single switch configurations was not seen.
>>>
>>> Yes, race conditions with delayed upgrades of the forwarding tables is a
>>> well matching explanation for the problem.
>>>
>>>>> The proposed change sents up to 3 arp requests on a down bond using the
>>>>> same slave, delayed by arp_interval.
>>>>> Using problematic switches i have seen the the arp response on the right
>>>>> slave at latest on the second arp request. So the bond is going into state
>>>>> up.
>>>>>
>>>>> How does it works:
>>>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe
>>>>> procedure, the other part of this procedure is handling the slave change.
>>>>> The proposed change is bypassing the slave change for 2 additional calls
>>>>> of bond_ab_arp_probe.
>>>>> Now the retries are not only for an up bond available, they are also
>>>>> implemented for a down bond.
>>>> 	Does this delay failover or bringup on switches that are not
>>>> "problematic"?  I.e., if arp_interval is, say, 1000 (1 second), will
>>>> this impact failover / recovery times?
>>>>
>>>> 	-J
>>> It depends.
>>> failover times are not impacted, this is handled different.
>>> Only the transition from a down bonding interface (bond and all slaves are
>>> down) to the state up can be increased by up to 2 times arp_interval,
>>> If the selected interface did not came up .If well working switches are
>>> used, and everything other is also ok, there are no impacts.
>> So I'm not a huge fan of workarounds like these, but I also understand
>> from a practical standpoint that this is useful.  My only issue with the
>> patch would be to please include a small comment (1-2 lines) in the code
>> that describes the behavior.  I know we have the changelog entries for
>> this, but I would feel better about having an exception like this in the
>> code for those reading it and wondering:
>>
>> "Why would we wait 2 intervals before failing over to the next interface
>> when there are no active interfaces?"
>>
>
>diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>--- a/drivers/net/bonding/bond_main.c   2015-08-30 20:34:09.000000000 +0200
>+++ b/drivers/net/bonding/bond_main.c   2015-09-02 00:39:10.000298202 +0200
>@@ -2795,6 +2795,16 @@ static bool bond_ab_arp_probe(struct bon
>                        return should_notify_rtnl;
>        }
>
>+       /* sometimes the forwarding tables of the switches are not updated fast enough
>+        * the first arp response after a slave change is received on the wrong slave.
>+        * the arp requests will be retried 2 times on the same slave
>+        */
>+
>+       if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
>+               bond_arp_send_all(bond, curr_arp_slave);
>+               return should_notify_rtnl;
>+       }
>+

	I probably should have asked this in the beginning, but at what
range of arp_interval values does the problem manifest?  If it's a race
condition with the switch update, I'd expect that only very small
arp_interval values would be affected.

	Also, your proposed comment wraps past 80 columns.

	-J


>        bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);
>
>        bond_for_each_slave_rcu(bond, slave, iter) {
>
>>>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is
>>>>> only used, if a different slave is going active.
>>>>> But in our case, the bonding slaves are not going into the state active
>>>>> for a longer time.
>>>>>>>> [jarod: manufacturing of changelog]
>>>>>>>> CC: Jay Vosburgh <j.vosburgh@gmail.com>
>>>>>>>> CC: Veaceslav Falico <vfalico@gmail.com>
>>>>>>>> CC: Andy Gospodarek <gospo@cumulusnetworks.com>
>>>>>>>> CC: netdev@vger.kernel.org
>>>>>>>> Signed-off-by: Uwe Koziolek <uwe.koziolek@redknee.com>
>>>>>>>> Signed-off-by: Jarod Wilson <jarod@redhat.com>
>>>>>>>> ---
>>>>>>>> drivers/net/bonding/bond_main.c | 5 +++++
>>>>>>>> 1 file changed, 5 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/bonding/bond_main.c
>>>>>>>> b/drivers/net/bonding/bond_main.c
>>>>>>>> index 0c627b4..60b9483 100644
>>>>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding
>>>>>>>> *bond)
>>>>>>>>              return should_notify_rtnl;
>>>>>>>>      }
>>>>>>>>
>>>>>>>> +    if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2))
>>>>>>>> {
>>>>>>>> +        bond_arp_send_all(bond, curr_arp_slave);
>>>>>>>> +        return should_notify_rtnl;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>>      bond_set_slave_inactive_flags(curr_arp_slave,
>>>>>>>> BOND_SLAVE_NOTIFY_LATER);
>>>>>>>>
>>>>>>>>      bond_for_each_slave_rcu(bond, slave, iter) {
>>>>>>>> -- 
>>>>>>>> 1.8.3.1

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
--- a/drivers/net/bonding/bond_main.c   2015-08-30 20:34:09.000000000 +0200
+++ b/drivers/net/bonding/bond_main.c   2015-09-02 00:39:10.000298202 +0200
@@ -2795,6 +2795,16 @@  static bool bond_ab_arp_probe(struct bon
                         return should_notify_rtnl;
         }

+       /* sometimes the forwarding tables of the switches are not updated fast enough
+        * the first arp response after a slave change is received on the wrong slave.
+        * the arp requests will be retried 2 times on the same slave
+        */
+
+       if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) {
+               bond_arp_send_all(bond, curr_arp_slave);
+               return should_notify_rtnl;
+       }
+
         bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER);

         bond_for_each_slave_rcu(bond, slave, iter) {