diff mbox series

Bond recovery from BOND_LINK_FAIL state not working

Message ID f9a6457e-f575-e9c1-2519-7694e30c474f@hpe.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series Bond recovery from BOND_LINK_FAIL state not working | expand

Commit Message

Alex Sidorenko Nov. 1, 2017, 6:09 p.m. UTC
The problem has been found while trying to deploy RHEL7 on HPE Synergy platform, it is 
seen both in customer's environment and in HPE test lab.

There are several bonds configured in TLB mode and miimon=100, all other options are 
default. Slaves are connected to VirtualConnect modules. Rebooting a VC module should 
bring one bond slave (ens3f0) down temporarily, but not another one (ens3f1). But what we 
see is

Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1

and it never recovers. When VC reboot is complete, everything goes back to normal again.

Redhat has backported all recent upstream commits and instrumented the bonding driver. We 
have found the following (when VC goes down)

In bond_miimon_inspect() the first slave goes to
	bond_propose_link_state(slave, BOND_LINK_FAIL);
		and
	slave->new_link = BOND_LINK_DOWN;

The second slave is still
	slave->link = BOND_LINK_UP;
		and
         slave->new_link = BOND_LINK_NOCHANGE;

This is as expected. But in bond_miimon_commit() we see that _both_ slaves are in 
BOND_LINK_FAIL.  That is, something changes the state of the second slave from another 
thread. We suspect the NetworkManager, as the problem  is there _only_ when bonds are 
controlled by it, if we set NM_CONTROLLED=no everything starts working normally.

While we still do not understand how NM affects bond state, I think that bonding driver 
needs to be made reliable enough to recover even from this state.

At this moment when we enter bond_miimon_inspect() with
slave->link = BOND_LINK_FAIL and are in the following code

                         /*FALLTHRU*/
                 case BOND_LINK_BACK:
                         if (!link_state) {
                                 bond_propose_link_state(slave, BOND_LINK_DOWN);
                                 netdev_info(bond->dev, "link status down again after %d 
ms for interface %s\n",
                                             (bond->params.updelay - slave->delay) *
                                             bond->params.miimon,
                                             slave->dev->name);

                                 commit++;
                                 continue;
                         }


we propose a new state and do 'commit++', but we do not change slave->new_link from 
BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() will not process this slave.

The following patch fixes the issue:

****
If we enter bond_miimon_inspect() with slave_link=BOND_LINK_FAIL
and recover, we do bond_propose_link_state(slave, BOND_LINK_UP);
but do not change slave->new_link, so it is left in
BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() will not
process that slave and it never recovers. We need to set
slave->new_link = BOND_LINK_UP to make bond_miimon_commit() work
---
  drivers/net/bonding/bond_main.c | 1 +
  1 file changed, 1 insertion(+)

Comments

Jay Vosburgh Nov. 1, 2017, 9:34 p.m. UTC | #1
Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:

>The problem has been found while trying to deploy RHEL7 on HPE Synergy
>platform, it is seen both in customer's environment and in HPE test lab.
>
>There are several bonds configured in TLB mode and miimon=100, all other
>options are default. Slaves are connected to VirtualConnect
>modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>temporarily, but not another one (ens3f1). But what we see is
>
>Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>
>and it never recovers. When VC reboot is complete, everything goes back to normal again.
>
>Redhat has backported all recent upstream commits and instrumented the
>bonding driver. We have found the following (when VC goes down)
>
>In bond_miimon_inspect() the first slave goes to
>	bond_propose_link_state(slave, BOND_LINK_FAIL);
>		and
>	slave->new_link = BOND_LINK_DOWN;
>
>The second slave is still
>	slave->link = BOND_LINK_UP;
>		and
>        slave->new_link = BOND_LINK_NOCHANGE;
>
>This is as expected. But in bond_miimon_commit() we see that _both_ slaves
>are in BOND_LINK_FAIL.  That is, something changes the state of the second
>slave from another thread. We suspect the NetworkManager, as the problem
>is there _only_ when bonds are controlled by it, if we set
>NM_CONTROLLED=no everything starts working normally.
>
>While we still do not understand how NM affects bond state, I think that
>bonding driver needs to be made reliable enough to recover even from this
>state.

	You're suggesting that the bonding driver be able to distinguish
"false" down states set by network manager from a genuine link failure?
Am I misunderstanding?

>At this moment when we enter bond_miimon_inspect() with
>slave->link = BOND_LINK_FAIL and are in the following code
>
>                        /*FALLTHRU*/
>                case BOND_LINK_BACK:
>                        if (!link_state) {
>                                bond_propose_link_state(slave, BOND_LINK_DOWN);
>                                netdev_info(bond->dev, "link status down
>again after %d ms for interface %s\n",
>                                            (bond->params.updelay - slave->delay) *
>                                            bond->params.miimon,
>                                            slave->dev->name);
>
>                                commit++;
>                                continue;
>                        }
>

	Looking at bonding in the current net-next, if a slave enters
bond_miimon_inspect with slave->link == BOND_LINK_FAIL, it will not
proceed to the BOND_LINK_BACK block of the switch; BOND_LINK_FAIL does
not fall through that far.

	Did you actually mean the BOND_LINK_FAIL block of the switch?
I'll assume so for the rest of my reply.

>we propose a new state and do 'commit++', but we do not change
>slave->new_link from BOND_LINK_NOCHANGE. As a result, bond_miimon_commit()
>will not process this slave.
>
>The following patch fixes the issue:
>
>****
>If we enter bond_miimon_inspect() with slave_link=BOND_LINK_FAIL
>and recover, we do bond_propose_link_state(slave, BOND_LINK_UP);
>but do not change slave->new_link, so it is left in
>BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() will not
>process that slave and it never recovers. We need to set
>slave->new_link = BOND_LINK_UP to make bond_miimon_commit() work

	What is your downdelay set to?  In principle,
bond_miimon_inspect should not enter with a slave in BOND_LINK_FAIL
state if downdelay is 0.

> drivers/net/bonding/bond_main.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index c99dc59..07aa7ba 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2072,6 +2072,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>                                            (bond->params.downdelay - slave->delay) *
>                                            bond->params.miimon,
>                                            slave->dev->name);
>+                               slave->new_link = BOND_LINK_UP;
>                                commit++;
>                                continue;
>                        }
>-- 
>2.7.4

	Your patch does not apply to net-next, so I'm not absolutely
sure where this is, but presuming that this is in the BOND_LINK_FAIL
case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
will have the issue that if the link recovers or fails, respectively,
within the delay window (for down/updelay > 0) it won't set a
slave->new_link.

	Looks like this got lost somewhere along the line, as originally
the transition back to UP (or DOWN) happened immediately, and that has
been lost somewhere.

	I'll have to dig out when that broke, but I'll see about a test
patch this afternoon.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Jay Vosburgh Nov. 2, 2017, 12:35 a.m. UTC | #2
Jay Vosburgh <jay.vosburgh@canonical.com> wrote:

>Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>
>>The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>platform, it is seen both in customer's environment and in HPE test lab.
>>
>>There are several bonds configured in TLB mode and miimon=100, all other
>>options are default. Slaves are connected to VirtualConnect
>>modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>temporarily, but not another one (ens3f1). But what we see is
>>
>>Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1

	In net-next, I don't see a path in the code that will lead to
this message, as it would apparently require entering
bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
remain in _FAIL state.

>>and it never recovers. When VC reboot is complete, everything goes back to normal again.
>>
>>Redhat has backported all recent upstream commits and instrumented the
>>bonding driver. We have found the following (when VC goes down)
>>
>>In bond_miimon_inspect() the first slave goes to
>>	bond_propose_link_state(slave, BOND_LINK_FAIL);
>>		and
>>	slave->new_link = BOND_LINK_DOWN;
>>
>>The second slave is still
>>	slave->link = BOND_LINK_UP;
>>		and
>>        slave->new_link = BOND_LINK_NOCHANGE;
>>
>>This is as expected. But in bond_miimon_commit() we see that _both_ slaves
>>are in BOND_LINK_FAIL.  That is, something changes the state of the second
>>slave from another thread. We suspect the NetworkManager, as the problem
>>is there _only_ when bonds are controlled by it, if we set
>>NM_CONTROLLED=no everything starts working normally.
>>
>>While we still do not understand how NM affects bond state, I think that
>>bonding driver needs to be made reliable enough to recover even from this
>>state.
>
>	You're suggesting that the bonding driver be able to distinguish
>"false" down states set by network manager from a genuine link failure?
>Am I misunderstanding?
>
>>At this moment when we enter bond_miimon_inspect() with
>>slave->link = BOND_LINK_FAIL and are in the following code
>>
>>                        /*FALLTHRU*/
>>                case BOND_LINK_BACK:
>>                        if (!link_state) {
>>                                bond_propose_link_state(slave, BOND_LINK_DOWN);
>>                                netdev_info(bond->dev, "link status down
>>again after %d ms for interface %s\n",
>>                                            (bond->params.updelay - slave->delay) *
>>                                            bond->params.miimon,
>>                                            slave->dev->name);
>>
>>                                commit++;
>>                                continue;
>>                        }
>>
>
>	Looking at bonding in the current net-next, if a slave enters
>bond_miimon_inspect with slave->link == BOND_LINK_FAIL, it will not
>proceed to the BOND_LINK_BACK block of the switch; BOND_LINK_FAIL does
>not fall through that far.
>
>	Did you actually mean the BOND_LINK_FAIL block of the switch?
>I'll assume so for the rest of my reply.
>
>>we propose a new state and do 'commit++', but we do not change
>>slave->new_link from BOND_LINK_NOCHANGE. As a result, bond_miimon_commit()
>>will not process this slave.
>>
>>The following patch fixes the issue:
>>
>>****
>>If we enter bond_miimon_inspect() with slave_link=BOND_LINK_FAIL
>>and recover, we do bond_propose_link_state(slave, BOND_LINK_UP);
>>but do not change slave->new_link, so it is left in
>>BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() will not
>>process that slave and it never recovers. We need to set
>>slave->new_link = BOND_LINK_UP to make bond_miimon_commit() work
>
>	What is your downdelay set to?  In principle,
>bond_miimon_inspect should not enter with a slave in BOND_LINK_FAIL
>state if downdelay is 0.
>
>> drivers/net/bonding/bond_main.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>index c99dc59..07aa7ba 100644
>>--- a/drivers/net/bonding/bond_main.c
>>+++ b/drivers/net/bonding/bond_main.c
>>@@ -2072,6 +2072,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>>                                            (bond->params.downdelay - slave->delay) *
>>                                            bond->params.miimon,
>>                                            slave->dev->name);
>>+                               slave->new_link = BOND_LINK_UP;
>>                                commit++;
>>                                continue;
>>                        }
>>-- 
>>2.7.4
>
>	Your patch does not apply to net-next, so I'm not absolutely
>sure where this is, but presuming that this is in the BOND_LINK_FAIL
>case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
>will have the issue that if the link recovers or fails, respectively,
>within the delay window (for down/updelay > 0) it won't set a
>slave->new_link.
>
>	Looks like this got lost somewhere along the line, as originally
>the transition back to UP (or DOWN) happened immediately, and that has
>been lost somewhere.
>
>	I'll have to dig out when that broke, but I'll see about a test
>patch this afternoon.

	The case I was concerned with was moved around; the proposed
state is committed in bond_mii_monitor.  But to commit to _FAIL state,
the downdelay would have to be > 0.  I'm not seeing any errors in
net-next; can you reproduce your erroneous behavior on net-next?

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Jarod Wilson Nov. 2, 2017, 2:37 a.m. UTC | #3
On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
> 
>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>
>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>
>>> There are several bonds configured in TLB mode and miimon=100, all other
>>> options are default. Slaves are connected to VirtualConnect
>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>> temporarily, but not another one (ens3f1). But what we see is
>>>
>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
> 
> 	In net-next, I don't see a path in the code that will lead to
> this message, as it would apparently require entering
> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
> remain in _FAIL state.

The kernel in question is laden with a fair bit of additional debug 
spew, as we were going back and forth, trying to isolate where things 
were going wrong.  That was indeed from the BOND_LINK_FAIL state in 
bond_miimon_inspect, inside the if (link_state) clause though, so after 
commit++, there's a continue, which ... does what now? Doesn't it take 
us back to the top of the bond_for_each_slave_rcu() loop, so we bypass 
the next few lines of code that would have led to a transition to 
BOND_LINK_DOWN?

...
>> 	Your patch does not apply to net-next, so I'm not absolutely
>> sure where this is, but presuming that this is in the BOND_LINK_FAIL
>> case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
>> will have the issue that if the link recovers or fails, respectively,
>> within the delay window (for down/updelay > 0) it won't set a
>> slave->new_link.
>>
>> 	Looks like this got lost somewhere along the line, as originally
>> the transition back to UP (or DOWN) happened immediately, and that has
>> been lost somewhere.
>>
>> 	I'll have to dig out when that broke, but I'll see about a test
>> patch this afternoon.
> 
> 	The case I was concerned with was moved around; the proposed
> state is committed in bond_mii_monitor.  But to commit to _FAIL state,
> the downdelay would have to be > 0.  I'm not seeing any errors in
> net-next; can you reproduce your erroneous behavior on net-next?

I can try to get a net-next-ish kernel into their hands, but the bonding 
driver we're working with here is quite close to current net-next 
already, so I'm fairly confident the same thing will happen.
Jay Vosburgh Nov. 2, 2017, 4:51 a.m. UTC | #4
Jarod Wilson <jarod@redhat.com> wrote:

>On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>>
>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>
>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>
>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>> options are default. Slaves are connected to VirtualConnect
>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>
>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>>
>> 	In net-next, I don't see a path in the code that will lead to
>> this message, as it would apparently require entering
>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>> remain in _FAIL state.
>
>The kernel in question is laden with a fair bit of additional debug spew,
>as we were going back and forth, trying to isolate where things were going
>wrong.  That was indeed from the BOND_LINK_FAIL state in
>bond_miimon_inspect, inside the if (link_state) clause though, so after
>commit++, there's a continue, which ... does what now? Doesn't it take us
>back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>next few lines of code that would have led to a transition to
>BOND_LINK_DOWN?

	Just to confirm: your downdelay is 0, correct?

	And, do you get any other log messages other than "link status
up again after 0 ms"?

	To answer your question, yes, the "if (link_state) {" block in
the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
this path is nominally for the downdelay logic.  If downdelay is active
and the link recovers before the delay expires, the link should never
have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
to return nonzero, causing in turn the bond_propose_link_state change to
BOND_LINK_FAIL state to be committed.  This path deliberately does not
set slave->new_link, as downdelay is purposely delaying the transition
to BOND_LINK_DOWN.

	If downdelay is 0, the slave->link should not persist in
BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
cause a transition in bond_miimon_commit.  The bond_propose_link_state
call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
"if (link_state) {" block should never be entered.

	I'm a little leery of adding the state transition you suggest
without understanding how this situation arose, as it shouldn't get into
this condition in the first place.

	-J

>...
>>> 	Your patch does not apply to net-next, so I'm not absolutely
>>> sure where this is, but presuming that this is in the BOND_LINK_FAIL
>>> case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
>>> will have the issue that if the link recovers or fails, respectively,
>>> within the delay window (for down/updelay > 0) it won't set a
>>> slave->new_link.
>>>
>>> 	Looks like this got lost somewhere along the line, as originally
>>> the transition back to UP (or DOWN) happened immediately, and that has
>>> been lost somewhere.
>>>
>>> 	I'll have to dig out when that broke, but I'll see about a test
>>> patch this afternoon.
>>
>> 	The case I was concerned with was moved around; the proposed
>> state is committed in bond_mii_monitor.  But to commit to _FAIL state,
>> the downdelay would have to be > 0.  I'm not seeing any errors in
>> net-next; can you reproduce your erroneous behavior on net-next?
>
>I can try to get a net-next-ish kernel into their hands, but the bonding
>driver we're working with here is quite close to current net-next already,
>so I'm fairly confident the same thing will happen.
>
>-- 
>Jarod Wilson
>jarod@redhat.com

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Alex Sidorenko Nov. 2, 2017, 12:47 p.m. UTC | #5
On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
> Jarod Wilson <jarod@redhat.com> wrote:
> 
>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>>>
>>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>>
>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>>
>>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>>> options are default. Slaves are connected to VirtualConnect
>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>>
>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>>>
>>> 	In net-next, I don't see a path in the code that will lead to
>>> this message, as it would apparently require entering
>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>>> remain in _FAIL state.
>>
>> The kernel in question is laden with a fair bit of additional debug spew,
>> as we were going back and forth, trying to isolate where things were going
>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>> commit++, there's a continue, which ... does what now? Doesn't it take us
>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>> next few lines of code that would have led to a transition to
>> BOND_LINK_DOWN?
> 
> 	Just to confirm: your downdelay is 0, correct?

Correct.

> 
> 	And, do you get any other log messages other than "link status
> up again after 0 ms"?

Yes, here are some messages (from an early instrumentation):

Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
Oct 24 10:56:27 SYDC1LNX kernel: bond0: committing link state for interface ens3f0, 20000 Mbps full 
duplex
Oct 24 10:56:27 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 1
Oct 24 10:56:27 SYDC1LNX kernel: bond0: committing link state for interface ens3f1, 20000 Mbps full 
duplex
Oct 24 10:56:27 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 1
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_DOWN
Oct 24 10:56:27 SYDC1LNX kernel: bond0: link status definitely down for interface ens3f0, disabling it
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave is NULL, this is 
probably bad
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: ens3f0
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: NULL
Oct 24 10:56:27 SYDC1LNX kernel: bond0: Removing MAC from old_active
Oct 24 10:56:27 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off
Oct 24 10:56:27 SYDC1LNX kernel: bond0: now running without any active interface!
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f1
Oct 24 10:56:27 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
Oct 24 10:56:27 SYDC1LNX kernel: bond0: committing link state for interface ens3f0, 20000 Mbps full 
duplex
Oct 24 10:56:27 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 0
Oct 24 10:56:27 SYDC1LNX kernel: bond0: committing link state for interface ens3f1, 20000 Mbps full 
duplex
Oct 24 10:56:27 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 0
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f0
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
Oct 24 10:56:27 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f1


That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues staying in 
BOND_LINK_NOCHANGE/BOND_LINK_FAIL


> 
> 	To answer your question, yes, the "if (link_state) {" block in
> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
> this path is nominally for the downdelay logic.  If downdelay is active
> and the link recovers before the delay expires, the link should never
> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
> to return nonzero, causing in turn the bond_propose_link_state change to
> BOND_LINK_FAIL state to be committed.  This path deliberately does not
> set slave->new_link, as downdelay is purposely delaying the transition
> to BOND_LINK_DOWN.
> 
> 	If downdelay is 0, the slave->link should not persist in
> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
> cause a transition in bond_miimon_commit.  The bond_propose_link_state
> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
> "if (link_state) {" block should never be entered.

I totally agree with your description of transition logic, and this is why
we were puzzled by how this can occur until we noticed NetworkManager
messages around this time and decided to run a test without it.
Without NM, everything works as expected. After that, adding more
instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
bond_miimon_inspect() but elsewhere (NetworkManager?).

> 
> 	I'm a little leery of adding the state transition you suggest
> without understanding how this situation arose, as it shouldn't get into
> this condition in the first place.

I think that adding such transition could be a reasonable safety measure
and should help in diagnosing problems. It can be done only when
bond->params.downdelay==0, adding pr_warn_once() with text suggesting that
this is unexpected.

Alex

> 
> 	-J
> 
>> ...
>>>> 	Your patch does not apply to net-next, so I'm not absolutely
>>>> sure where this is, but presuming that this is in the BOND_LINK_FAIL
>>>> case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
>>>> will have the issue that if the link recovers or fails, respectively,
>>>> within the delay window (for down/updelay > 0) it won't set a
>>>> slave->new_link.
>>>>
>>>> 	Looks like this got lost somewhere along the line, as originally
>>>> the transition back to UP (or DOWN) happened immediately, and that has
>>>> been lost somewhere.
>>>>
>>>> 	I'll have to dig out when that broke, but I'll see about a test
>>>> patch this afternoon.
>>>
>>> 	The case I was concerned with was moved around; the proposed
>>> state is committed in bond_mii_monitor.  But to commit to _FAIL state,
>>> the downdelay would have to be > 0.  I'm not seeing any errors in
>>> net-next; can you reproduce your erroneous behavior on net-next?
>>
>> I can try to get a net-next-ish kernel into their hands, but the bonding
>> driver we're working with here is quite close to current net-next already,
>> so I'm fairly confident the same thing will happen.
>>
>> -- 
>> Jarod Wilson
>> jarod@redhat.com
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>
Jay Vosburgh Nov. 3, 2017, 1:11 a.m. UTC | #6
Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>> Jarod Wilson <jarod@redhat.com> wrote:
>>
>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>>>>
>>>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>>>
>>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>>>
>>>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>>>> options are default. Slaves are connected to VirtualConnect
>>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>>>
>>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>>>>
>>>> 	In net-next, I don't see a path in the code that will lead to
>>>> this message, as it would apparently require entering
>>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>>>> remain in _FAIL state.
>>>
>>> The kernel in question is laden with a fair bit of additional debug spew,
>>> as we were going back and forth, trying to isolate where things were going
>>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>>> commit++, there's a continue, which ... does what now? Doesn't it take us
>>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>>> next few lines of code that would have led to a transition to
>>> BOND_LINK_DOWN?
>>
>> 	Just to confirm: your downdelay is 0, correct?
>
>Correct.
>
>>
>> 	And, do you get any other log messages other than "link status
>> up again after 0 ms"?
>
>Yes, here are some messages (from an early instrumentation):
[...]
>That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues
>staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL
>
>
>>
>> 	To answer your question, yes, the "if (link_state) {" block in
>> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
>> this path is nominally for the downdelay logic.  If downdelay is active
>> and the link recovers before the delay expires, the link should never
>> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
>> to return nonzero, causing in turn the bond_propose_link_state change to
>> BOND_LINK_FAIL state to be committed.  This path deliberately does not
>> set slave->new_link, as downdelay is purposely delaying the transition
>> to BOND_LINK_DOWN.
>>
>> 	If downdelay is 0, the slave->link should not persist in
>> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
>> cause a transition in bond_miimon_commit.  The bond_propose_link_state
>> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
>> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
>> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
>> "if (link_state) {" block should never be entered.
>
>I totally agree with your description of transition logic, and this is why
>we were puzzled by how this can occur until we noticed NetworkManager
>messages around this time and decided to run a test without it.
>Without NM, everything works as expected. After that, adding more
>instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
>bond_miimon_inspect() but elsewhere (NetworkManager?).

	I think I see the flaw in the logic.

	1) bond_miimon_inspect finds link_state = 0, then makes a call
to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
returns non-zero.

	2) bond_mii_monitor rtnl_trylock fails, it reschedules.

	3) bond_mii_monitor runs again, and calls bond_miimon_inspect.

	4) the slave's link has recovered, so link_state != 0.
slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
bond_miimon_inspect returns 0, so nothing is committed.

	5) step 4 can repeat indefinitely.

	6) eventually, the other slave does something that causes
commit++, making bond_mii_monitor call bond_commit_link_state and then
bond_miimon_commit.  The slave in question from steps 1-4 still has
link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends
up in BOND_LINK_FAIL state.

	I think step 6 could also occur concurrently with the initial
pass through step 4 to induce the problem.

	It looks like Mahesh mostly fixed this in

commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb
Author: Mahesh Bandewar <maheshb@google.com>
Date:   Tue Apr 11 22:36:00 2017 -0700

    bonding: handle link transition from FAIL to UP correctly

	but the window still exists, and requires the slave link state
to change between the failed rtnl_trylock and the second pass through
_inspect.  The problem is that a state transition has been kept between
invocations to _inspect, but the condition that induced the transition
has changed.

	I haven't tested these, but I suspect the solution is either to
clear link_new_state on entry to the loop in bond_miimon_inspect, or
merge new_state and link_new_state as a single "next state" (which is
cleared on entry to the loop).

	The first of these is a pretty simple patch:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 18b58e1376f1..6f89f9981a6c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
 
 	bond_for_each_slave_rcu(bond, slave, iter) {
 		slave->new_link = BOND_LINK_NOCHANGE;
+		slave->link_new_state = slave->link;
 
 		link_state = bond_check_dev_link(bond, slave->dev, 0);
 

	Alex / Jarod, could you check my logic, and would you be able to
test this patch if my analysis appears sound?

	Thanks,

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Alex Sidorenko Nov. 3, 2017, 3:40 p.m. UTC | #7
Jay,

while scenario you describe makes sense, it does not match what we see in our tests.

The instrumentation prints info every time we enter bond_mii_monitor(), bond_miimon_inspect(),
bond_miimon_commit() and every time we are committing link state. And we print a message every time we
propose BOND_LINK_FAIL in bond_miimon_inspect()

So in your scenario,

2) bond_mii_monitor rtnl_trylock fails, it reschedules

we should see bond_miimon_inspect() 'proposed BOND_LINK_FAIL' debugging message without matching
'commit' messages. But what we see in reality is that for each 'proposed' there is 'commit' message.
(And we don't expect ens3f1 to really go down when VC module is rebooted - it is not connected to it).

Here is debugging output (with the fix I have suggested in my first email *applied*),
my comments inline.

   (FYI: in "bond_mii_monitor: ens3f0 current link state: 0" lines we print slave->link when entering bond_mii_monitor)

***

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
<snipped NetworkManager lines>
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: proposed BOND_LINK_FAIL on slave ens3f0
              /*FALLTHRU*/
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_FAIL case on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: set new_link=BOND_LINK_DOWN on slave ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero

    As you see in lines above, we reach BOND_LINK_FAIL on ens3f0 only, ens3f1 has good link_state and we
    do not reach BOND_LINK_FAIL fallthru and do not propose anything
    (otherwise there would be debugging output for it)

    Now we are in bond_mii_monitor() bond_for_each_slave() loop, committing link states - and suddenly
        we have state 0->1 for interface ens3f1 as well as for ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for interface ens3f0, 0 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for interface ens3f1, 20000 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 1

    Entering bond_miimon_commit()
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_DOWN
Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely down for interface ens3f0, disabling it
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: false, slave->delay: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, up: true, slave->delay: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave found, bond failure imminent
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave is NULL, this is probably bad
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: NULL
Oct 31 09:09:25 SYDC1LNX kernel: bond0: Removing MAC from old_active
Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off
Oct 31 09:09:25 SYDC1LNX kernel: bond0: now running without any active interface!

         Even though link state of ens3f1 has changed, we have BOND_LINK_NOCHANGE
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f1

     Another invocation of bond_mii_monitor
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 2
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_DOWN case on slave ens3f0

Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_FAIL case on slave ens3f1

    We see "up again on ens3f1" and execute slave->new_link = BOND_LINK_UP; (my initial suggestion)
Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 2->2 for interface ens3f0, 0 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 1->0 for interface ens3f1, 20000 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 0

    As we nudged slave->new_link, bond_miimon_commit() now works
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_UP
Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely up for interface ens3f1, 20000 Mbps full duplex
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: false, slave->delay: 0
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave ens3f1 BOND_LINK_UP
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: NULL
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: ens3f1
Oct 31 09:09:25 SYDC1LNX kernel: bond0: Non-BOND_LINK_BACK case
Oct 31 09:09:25 SYDC1LNX kernel: bond0: making interface ens3f1 the new active one
Oct 31 09:09:25 SYDC1LNX kernel: bond0: Setting MAC on new_active
Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier on
Oct 31 09:09:25 SYDC1LNX kernel: bond0: first active interface up!
<snipped NetworkManager lines>
Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 2
Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_DOWN case on slave ens3f0
Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
...

Alex

On 11/02/2017 09:11 PM, Jay Vosburgh wrote:
> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>> On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>>> Jarod Wilson <jarod@redhat.com> wrote:
>>>
>>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>>>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>>>>>
>>>>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>>>>
>>>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>>>>
>>>>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>>>>> options are default. Slaves are connected to VirtualConnect
>>>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>>>>
>>>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>>>>>
>>>>> 	In net-next, I don't see a path in the code that will lead to
>>>>> this message, as it would apparently require entering
>>>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>>>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>>>>> remain in _FAIL state.
>>>>
>>>> The kernel in question is laden with a fair bit of additional debug spew,
>>>> as we were going back and forth, trying to isolate where things were going
>>>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>>>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>>>> commit++, there's a continue, which ... does what now? Doesn't it take us
>>>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>>>> next few lines of code that would have led to a transition to
>>>> BOND_LINK_DOWN?
>>>
>>> 	Just to confirm: your downdelay is 0, correct?
>>
>> Correct.
>>
>>>
>>> 	And, do you get any other log messages other than "link status
>>> up again after 0 ms"?
>>
>> Yes, here are some messages (from an early instrumentation):
> [...]
>> That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues
>> staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL
>>
>>
>>>
>>> 	To answer your question, yes, the "if (link_state) {" block in
>>> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
>>> this path is nominally for the downdelay logic.  If downdelay is active
>>> and the link recovers before the delay expires, the link should never
>>> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
>>> to return nonzero, causing in turn the bond_propose_link_state change to
>>> BOND_LINK_FAIL state to be committed.  This path deliberately does not
>>> set slave->new_link, as downdelay is purposely delaying the transition
>>> to BOND_LINK_DOWN.
>>>
>>> 	If downdelay is 0, the slave->link should not persist in
>>> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
>>> cause a transition in bond_miimon_commit.  The bond_propose_link_state
>>> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
>>> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
>>> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
>>> "if (link_state) {" block should never be entered.
>>
>> I totally agree with your description of transition logic, and this is why
>> we were puzzled by how this can occur until we noticed NetworkManager
>> messages around this time and decided to run a test without it.
>> Without NM, everything works as expected. After that, adding more
>> instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
>> bond_miimon_inspect() but elsewhere (NetworkManager?).
> 
> 	I think I see the flaw in the logic.
> 
> 	1) bond_miimon_inspect finds link_state = 0, then makes a call
> to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
> BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
> returns non-zero.
> 
> 	2) bond_mii_monitor rtnl_trylock fails, it reschedules.
> 
> 	3) bond_mii_monitor runs again, and calls bond_miimon_inspect.
> 
> 	4) the slave's link has recovered, so link_state != 0.
> slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
> set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
> bond_miimon_inspect returns 0, so nothing is committed.
> 
> 	5) step 4 can repeat indefinitely.
> 
> 	6) eventually, the other slave does something that causes
> commit++, making bond_mii_monitor call bond_commit_link_state and then
> bond_miimon_commit.  The slave in question from steps 1-4 still has
> link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends
> up in BOND_LINK_FAIL state.
> 
> 	I think step 6 could also occur concurrently with the initial
> pass through step 4 to induce the problem.
> 
> 	It looks like Mahesh mostly fixed this in
> 
> commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb
> Author: Mahesh Bandewar <maheshb@google.com>
> Date:   Tue Apr 11 22:36:00 2017 -0700
> 
>      bonding: handle link transition from FAIL to UP correctly
> 
> 	but the window still exists, and requires the slave link state
> to change between the failed rtnl_trylock and the second pass through
> _inspect.  The problem is that a state transition has been kept between
> invocations to _inspect, but the condition that induced the transition
> has changed.
> 
> 	I haven't tested these, but I suspect the solution is either to
> clear link_new_state on entry to the loop in bond_miimon_inspect, or
> merge new_state and link_new_state as a single "next state" (which is
> cleared on entry to the loop).
> 
> 	The first of these is a pretty simple patch:
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 18b58e1376f1..6f89f9981a6c 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>   
>   	bond_for_each_slave_rcu(bond, slave, iter) {
>   		slave->new_link = BOND_LINK_NOCHANGE;
> +		slave->link_new_state = slave->link;
>   
>   		link_state = bond_check_dev_link(bond, slave->dev, 0);
>   
> 
> 	Alex / Jarod, could you check my logic, and would you be able to
> test this patch if my analysis appears sound?
> 
> 	Thanks,
> 
> 	-J
> 
> ---
> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>
Jay Vosburgh Nov. 3, 2017, 6:26 p.m. UTC | #8
Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:

>Jay,
>
>while scenario you describe makes sense, it does not match what we see in our tests.
>
>The instrumentation prints info every time we enter bond_mii_monitor(), bond_miimon_inspect(),
>bond_miimon_commit() and every time we are committing link state. And we print a message every time we
>propose BOND_LINK_FAIL in bond_miimon_inspect()
>
>So in your scenario,
>
>2) bond_mii_monitor rtnl_trylock fails, it reschedules
>
>we should see bond_miimon_inspect() 'proposed BOND_LINK_FAIL' debugging message without matching
>'commit' messages. But what we see in reality is that for each 'proposed' there is 'commit' message.
>(And we don't expect ens3f1 to really go down when VC module is rebooted - it is not connected to it).
>
>Here is debugging output (with the fix I have suggested in my first email *applied*),
>my comments inline.
>
>  (FYI: in "bond_mii_monitor: ens3f0 current link state: 0" lines we print slave->link when entering bond_mii_monitor)
>
>***
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
><snipped NetworkManager lines>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: proposed BOND_LINK_FAIL on slave ens3f0
>             /*FALLTHRU*/
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_FAIL case on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: set new_link=BOND_LINK_DOWN on slave ens3f0
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
>
>   As you see in lines above, we reach BOND_LINK_FAIL on ens3f0 only, ens3f1 has good link_state and we
>   do not reach BOND_LINK_FAIL fallthru and do not propose anything
>   (otherwise there would be debugging output for it)
>
>   Now we are in bond_mii_monitor() bond_for_each_slave() loop, committing link states - and suddenly
>       we have state 0->1 for interface ens3f1 as well as for ens3f0
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for interface ens3f0, 0 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for interface ens3f1, 20000 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 1

	Does your instrumentation show each slave's ->link_new_state at
each entry to bond_miimon_inspect?  Not just when commit, et al, is
called, but the actual value of the variable each time through the
function?  Or maybe you've got an entry for the propose way back in the
log somewhere?

	I'm wondering here whether you're seeing "step 6" from the
failure path I described, i.e., the slave->link_new_state on ens3f1 was
set some time previously (perhaps a long time) and has been silently
pending until some event happens on the other slave to trigger a commit
cycle.

	Something had to have set the variable, and from your
instrumentation, it does not appear that it was the immediately prior
instance of bond_miimon_inspect.  In net-next's bonding driver, nothing
other than the _propose call will enter BOND_LINK_FAIL state.

	-J

>   Entering bond_miimon_commit()
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_DOWN
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely down for interface ens3f0, disabling it
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: false, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, up: true, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave found, bond failure imminent
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave is NULL, this is probably bad
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: NULL
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: Removing MAC from old_active
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: now running without any active interface!
>
>        Even though link state of ens3f1 has changed, we have BOND_LINK_NOCHANGE
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f1
>
>    Another invocation of bond_mii_monitor
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 2
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_DOWN case on slave ens3f0
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_FAIL case on slave ens3f1
>
>   We see "up again on ens3f1" and execute slave->new_link = BOND_LINK_UP; (my initial suggestion)
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 2->2 for interface ens3f0, 0 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 1->0 for interface ens3f1, 20000 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 0
>
>   As we nudged slave->new_link, bond_miimon_commit() now works
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_UP
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely up for interface ens3f1, 20000 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: false, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave ens3f1 BOND_LINK_UP
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: NULL
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: Non-BOND_LINK_BACK case
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: making interface ens3f1 the new active one
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: Setting MAC on new_active
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier on
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: first active interface up!
><snipped NetworkManager lines>
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 2
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_DOWN case on slave ens3f0
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>...
>
>Alex
>
>On 11/02/2017 09:11 PM, Jay Vosburgh wrote:
>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>> On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>>>> Jarod Wilson <jarod@redhat.com> wrote:
>>>>
>>>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>>>>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>>>>>>
>>>>>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>>>>>
>>>>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>>>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>>>>>
>>>>>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>>>>>> options are default. Slaves are connected to VirtualConnect
>>>>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>>>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>>>>>
>>>>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>>>>>>
>>>>>> 	In net-next, I don't see a path in the code that will lead to
>>>>>> this message, as it would apparently require entering
>>>>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>>>>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>>>>>> remain in _FAIL state.
>>>>>
>>>>> The kernel in question is laden with a fair bit of additional debug spew,
>>>>> as we were going back and forth, trying to isolate where things were going
>>>>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>>>>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>>>>> commit++, there's a continue, which ... does what now? Doesn't it take us
>>>>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>>>>> next few lines of code that would have led to a transition to
>>>>> BOND_LINK_DOWN?
>>>>
>>>> 	Just to confirm: your downdelay is 0, correct?
>>>
>>> Correct.
>>>
>>>>
>>>> 	And, do you get any other log messages other than "link status
>>>> up again after 0 ms"?
>>>
>>> Yes, here are some messages (from an early instrumentation):
>> [...]
>>> That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues
>>> staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL
>>>
>>>
>>>>
>>>> 	To answer your question, yes, the "if (link_state) {" block in
>>>> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
>>>> this path is nominally for the downdelay logic.  If downdelay is active
>>>> and the link recovers before the delay expires, the link should never
>>>> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
>>>> to return nonzero, causing in turn the bond_propose_link_state change to
>>>> BOND_LINK_FAIL state to be committed.  This path deliberately does not
>>>> set slave->new_link, as downdelay is purposely delaying the transition
>>>> to BOND_LINK_DOWN.
>>>>
>>>> 	If downdelay is 0, the slave->link should not persist in
>>>> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
>>>> cause a transition in bond_miimon_commit.  The bond_propose_link_state
>>>> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
>>>> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
>>>> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
>>>> "if (link_state) {" block should never be entered.
>>>
>>> I totally agree with your description of transition logic, and this is why
>>> we were puzzled by how this can occur until we noticed NetworkManager
>>> messages around this time and decided to run a test without it.
>>> Without NM, everything works as expected. After that, adding more
>>> instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
>>> bond_miimon_inspect() but elsewhere (NetworkManager?).
>>
>> 	I think I see the flaw in the logic.
>>
>> 	1) bond_miimon_inspect finds link_state = 0, then makes a call
>> to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
>> BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
>> returns non-zero.
>>
>> 	2) bond_mii_monitor rtnl_trylock fails, it reschedules.
>>
>> 	3) bond_mii_monitor runs again, and calls bond_miimon_inspect.
>>
>> 	4) the slave's link has recovered, so link_state != 0.
>> slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
>> set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
>> bond_miimon_inspect returns 0, so nothing is committed.
>>
>> 	5) step 4 can repeat indefinitely.
>>
>> 	6) eventually, the other slave does something that causes
>> commit++, making bond_mii_monitor call bond_commit_link_state and then
>> bond_miimon_commit.  The slave in question from steps 1-4 still has
>> link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends
>> up in BOND_LINK_FAIL state.
>>
>> 	I think step 6 could also occur concurrently with the initial
>> pass through step 4 to induce the problem.
>>
>> 	It looks like Mahesh mostly fixed this in
>>
>> commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb
>> Author: Mahesh Bandewar <maheshb@google.com>
>> Date:   Tue Apr 11 22:36:00 2017 -0700
>>
>>      bonding: handle link transition from FAIL to UP correctly
>>
>> 	but the window still exists, and requires the slave link state
>> to change between the failed rtnl_trylock and the second pass through
>> _inspect.  The problem is that a state transition has been kept between
>> invocations to _inspect, but the condition that induced the transition
>> has changed.
>>
>> 	I haven't tested these, but I suspect the solution is either to
>> clear link_new_state on entry to the loop in bond_miimon_inspect, or
>> merge new_state and link_new_state as a single "next state" (which is
>> cleared on entry to the loop).
>>
>> 	The first of these is a pretty simple patch:
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 18b58e1376f1..6f89f9981a6c 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>>     	bond_for_each_slave_rcu(bond, slave, iter) {
>>   		slave->new_link = BOND_LINK_NOCHANGE;
>> +		slave->link_new_state = slave->link;
>>     		link_state = bond_check_dev_link(bond, slave->dev, 0);
>>   
>>
>> 	Alex / Jarod, could you check my logic, and would you be able to
>> test this patch if my analysis appears sound?
>>
>> 	Thanks,
>>
>> 	-J
>>
>> ---
>> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>>
>
>-- 
>------------------------------------------------------------------
>Alex Sidorenko	email: asid@hpe.com
>ERT  Linux 	Hewlett-Packard Enterprise (Canada)
>------------------------------------------------------------------
Alex Sidorenko Nov. 3, 2017, 7:30 p.m. UTC | #9
Indeed, we do not print slave's ->link_new_state on each entry - so it is quite possible that we are at stage 6.

It is even possible that this has something to do with how NM initially created bonds.
Customer says that the problem occurs once only after host reboot, after that
failover works fine no matter how many times he changes the state of
VirtualConnect modules.

Jarod,

could you please add printing slave->link_new_state for both slaves at each
entry to bond_miimon_inspect?

(and instead of nudging slave->new_link like I suggested, use Jay's patch).

Alex


On 11/03/2017 02:26 PM, Jay Vosburgh wrote:
> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>
>> Jay,
>>
>> while scenario you describe makes sense, it does not match what we see in our tests.
>>
>> The instrumentation prints info every time we enter bond_mii_monitor(), bond_miimon_inspect(),
>> bond_miimon_commit() and every time we are committing link state. And we print a message every time we
>> propose BOND_LINK_FAIL in bond_miimon_inspect()
>>
>> So in your scenario,
>>
>> 2) bond_mii_monitor rtnl_trylock fails, it reschedules
>>
>> we should see bond_miimon_inspect() 'proposed BOND_LINK_FAIL' debugging message without matching
>> 'commit' messages. But what we see in reality is that for each 'proposed' there is 'commit' message.
>> (And we don't expect ens3f1 to really go down when VC module is rebooted - it is not connected to it).
>>
>> Here is debugging output (with the fix I have suggested in my first email *applied*),
>> my comments inline.
>>
>>   (FYI: in "bond_mii_monitor: ens3f0 current link state: 0" lines we print slave->link when entering bond_mii_monitor)
>>
>> ***
>>
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>> <snipped NetworkManager lines>
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: proposed BOND_LINK_FAIL on slave ens3f0
>>              /*FALLTHRU*/
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_FAIL case on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: set new_link=BOND_LINK_DOWN on slave ens3f0
>>
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
>>
>>    As you see in lines above, we reach BOND_LINK_FAIL on ens3f0 only, ens3f1 has good link_state and we
>>    do not reach BOND_LINK_FAIL fallthru and do not propose anything
>>    (otherwise there would be debugging output for it)
>>
>>    Now we are in bond_mii_monitor() bond_for_each_slave() loop, committing link states - and suddenly
>>        we have state 0->1 for interface ens3f1 as well as for ens3f0
>>
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for interface ens3f0, 0 Mbps full duplex
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 0->1 for interface ens3f1, 20000 Mbps full duplex
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 1
> 	Does your instrumentation show each slave's ->link_new_state at
> each entry to bond_miimon_inspect?  Not just when commit, et al, is
> called, but the actual value of the variable each time through the
> function?  Or maybe you've got an entry for the propose way back in the
> log somewhere?
>
> 	I'm wondering here whether you're seeing "step 6" from the
> failure path I described, i.e., the slave->link_new_state on ens3f1 was
> set some time previously (perhaps a long time) and has been silently
> pending until some event happens on the other slave to trigger a commit
> cycle.
>
> 	Something had to have set the variable, and from your
> instrumentation, it does not appear that it was the immediately prior
> instance of bond_miimon_inspect.  In net-next's bonding driver, nothing
> other than the _propose call will enter BOND_LINK_FAIL state.
>
> 	-J
>
>>    Entering bond_miimon_commit()
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_DOWN
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely down for interface ens3f0, disabling it
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: false, slave->delay: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, up: true, slave->delay: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave found, bond failure imminent
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave is NULL, this is probably bad
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: NULL
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: Removing MAC from old_active
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: now running without any active interface!
>>
>>         Even though link state of ens3f1 has changed, we have BOND_LINK_NOCHANGE
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f1
>>
>>     Another invocation of bond_mii_monitor
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 2
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_DOWN case on slave ens3f0
>>
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_FAIL case on slave ens3f1
>>
>>    We see "up again on ens3f1" and execute slave->new_link = BOND_LINK_UP; (my initial suggestion)
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 2->2 for interface ens3f0, 0 Mbps full duplex
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f0 now: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 1->0 for interface ens3f1, 20000 Mbps full duplex
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for interface ens3f1 now: 0
>>
>>    As we nudged slave->new_link, bond_miimon_commit() now works
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE on slave ens3f0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave ens3f1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_UP
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely up for interface ens3f1, 20000 Mbps full duplex
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, up: false, slave->delay: 0
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave ens3f1 BOND_LINK_UP
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave != curr_active_slave
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: NULL
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: ens3f1
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: Non-BOND_LINK_BACK case
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: making interface ens3f1 the new active one
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: Setting MAC on new_active
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call bond_alb_handle_active_change
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier on
>> Oct 31 09:09:25 SYDC1LNX kernel: bond0: first active interface up!
>> <snipped NetworkManager lines>
>> Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link state: 2
>> Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link state: 0
>> Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_DOWN case on slave ens3f0
>> Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered BOND_LINK_UP case on slave ens3f1
>> ...
>>
>> Alex
>>
>> On 11/02/2017 09:11 PM, Jay Vosburgh wrote:
>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>> On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>>>>> Jarod Wilson <jarod@redhat.com> wrote:
>>>>>
>>>>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>>>>>> Jay Vosburgh <jay.vosburgh@canonical.com> wrote:
>>>>>>>
>>>>>>>> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
>>>>>>>>
>>>>>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>>>>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>>>>>>
>>>>>>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>>>>>>> options are default. Slaves are connected to VirtualConnect
>>>>>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>>>>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>>>>>>
>>>>>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for interface ens3f1
>>>>>>> 	In net-next, I don't see a path in the code that will lead to
>>>>>>> this message, as it would apparently require entering
>>>>>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>>>>>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>>>>>>> remain in _FAIL state.
>>>>>> The kernel in question is laden with a fair bit of additional debug spew,
>>>>>> as we were going back and forth, trying to isolate where things were going
>>>>>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>>>>>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>>>>>> commit++, there's a continue, which ... does what now? Doesn't it take us
>>>>>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>>>>>> next few lines of code that would have led to a transition to
>>>>>> BOND_LINK_DOWN?
>>>>> 	Just to confirm: your downdelay is 0, correct?
>>>> Correct.
>>>>
>>>>> 	And, do you get any other log messages other than "link status
>>>>> up again after 0 ms"?
>>>> Yes, here are some messages (from an early instrumentation):
>>> [...]
>>>> That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues
>>>> staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL
>>>>
>>>>
>>>>> 	To answer your question, yes, the "if (link_state) {" block in
>>>>> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
>>>>> this path is nominally for the downdelay logic.  If downdelay is active
>>>>> and the link recovers before the delay expires, the link should never
>>>>> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
>>>>> to return nonzero, causing in turn the bond_propose_link_state change to
>>>>> BOND_LINK_FAIL state to be committed.  This path deliberately does not
>>>>> set slave->new_link, as downdelay is purposely delaying the transition
>>>>> to BOND_LINK_DOWN.
>>>>>
>>>>> 	If downdelay is 0, the slave->link should not persist in
>>>>> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
>>>>> cause a transition in bond_miimon_commit.  The bond_propose_link_state
>>>>> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
>>>>> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
>>>>> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
>>>>> "if (link_state) {" block should never be entered.
>>>> I totally agree with your description of transition logic, and this is why
>>>> we were puzzled by how this can occur until we noticed NetworkManager
>>>> messages around this time and decided to run a test without it.
>>>> Without NM, everything works as expected. After that, adding more
>>>> instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
>>>> bond_miimon_inspect() but elsewhere (NetworkManager?).
>>> 	I think I see the flaw in the logic.
>>>
>>> 	1) bond_miimon_inspect finds link_state = 0, then makes a call
>>> to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
>>> BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
>>> returns non-zero.
>>>
>>> 	2) bond_mii_monitor rtnl_trylock fails, it reschedules.
>>>
>>> 	3) bond_mii_monitor runs again, and calls bond_miimon_inspect.
>>>
>>> 	4) the slave's link has recovered, so link_state != 0.
>>> slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
>>> set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
>>> bond_miimon_inspect returns 0, so nothing is committed.
>>>
>>> 	5) step 4 can repeat indefinitely.
>>>
>>> 	6) eventually, the other slave does something that causes
>>> commit++, making bond_mii_monitor call bond_commit_link_state and then
>>> bond_miimon_commit.  The slave in question from steps 1-4 still has
>>> link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends
>>> up in BOND_LINK_FAIL state.
>>>
>>> 	I think step 6 could also occur concurrently with the initial
>>> pass through step 4 to induce the problem.
>>>
>>> 	It looks like Mahesh mostly fixed this in
>>>
>>> commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb
>>> Author: Mahesh Bandewar <maheshb@google.com>
>>> Date:   Tue Apr 11 22:36:00 2017 -0700
>>>
>>>       bonding: handle link transition from FAIL to UP correctly
>>>
>>> 	but the window still exists, and requires the slave link state
>>> to change between the failed rtnl_trylock and the second pass through
>>> _inspect.  The problem is that a state transition has been kept between
>>> invocations to _inspect, but the condition that induced the transition
>>> has changed.
>>>
>>> 	I haven't tested these, but I suspect the solution is either to
>>> clear link_new_state on entry to the loop in bond_miimon_inspect, or
>>> merge new_state and link_new_state as a single "next state" (which is
>>> cleared on entry to the loop).
>>>
>>> 	The first of these is a pretty simple patch:
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index 18b58e1376f1..6f89f9981a6c 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>>>      	bond_for_each_slave_rcu(bond, slave, iter) {
>>>    		slave->new_link = BOND_LINK_NOCHANGE;
>>> +		slave->link_new_state = slave->link;
>>>      		link_state = bond_check_dev_link(bond, slave->dev, 0);
>>>    
>>>
>>> 	Alex / Jarod, could you check my logic, and would you be able to
>>> test this patch if my analysis appears sound?
>>>
>>> 	Thanks,
>>>
>>> 	-J
>>>
>>> ---
>>> 	-Jay Vosburgh, jay.vosburgh@canonical.com
>>>
>> -- 
>> ------------------------------------------------------------------
>> Alex Sidorenko	email: asid@hpe.com
>> ERT  Linux 	Hewlett-Packard Enterprise (Canada)
>> ------------------------------------------------------------------
Jarod Wilson Nov. 3, 2017, 9:46 p.m. UTC | #10
On 2017-11-03 3:30 PM, Alex Sidorenko wrote:
> Indeed, we do not print slave's ->link_new_state on each entry - so it 
> is quite possible that we are at stage 6.
> 
> It is even possible that this has something to do with how NM initially 
> created bonds.
> Customer says that the problem occurs once only after host reboot, after 
> that
> failover works fine no matter how many times he changes the state of
> VirtualConnect modules.
> 
> Jarod,
> 
> could you please add printing slave->link_new_state for both slaves at each
> entry to bond_miimon_inspect?
> 
> (and instead of nudging slave->new_link like I suggested, use Jay's patch).

Will do, test build is just about ready here.
Jarod Wilson Nov. 6, 2017, 6:06 a.m. UTC | #11
On 2017-11-02 9:11 PM, Jay Vosburgh wrote:
> Alex Sidorenko <alexandre.sidorenko@hpe.com> wrote:
...> 	I think I see the flaw in the logic.
> 
> 	1) bond_miimon_inspect finds link_state = 0, then makes a call
> to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
> BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
> returns non-zero.
> 
> 	2) bond_mii_monitor rtnl_trylock fails, it reschedules.
> 
> 	3) bond_mii_monitor runs again, and calls bond_miimon_inspect.
> 
> 	4) the slave's link has recovered, so link_state != 0.
> slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
> set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
> bond_miimon_inspect returns 0, so nothing is committed.
> 
> 	5) step 4 can repeat indefinitely.
> 
> 	6) eventually, the other slave does something that causes
> commit++, making bond_mii_monitor call bond_commit_link_state and then
> bond_miimon_commit.  The slave in question from steps 1-4 still has
> link_new_state as BOND_LINK_FAIL, but new_link is NOCHANGE, so it ends
> up in BOND_LINK_FAIL state.
> 
> 	I think step 6 could also occur concurrently with the initial
> pass through step 4 to induce the problem.
> 
> 	It looks like Mahesh mostly fixed this in
> 
> commit fb9eb899a6dc663e4a2deed9af2ac28f507d0ffb
> Author: Mahesh Bandewar <maheshb@google.com>
> Date:   Tue Apr 11 22:36:00 2017 -0700
> 
>      bonding: handle link transition from FAIL to UP correctly
> 
> 	but the window still exists, and requires the slave link state
> to change between the failed rtnl_trylock and the second pass through
> _inspect.  The problem is that a state transition has been kept between
> invocations to _inspect, but the condition that induced the transition
> has changed.
> 
> 	I haven't tested these, but I suspect the solution is either to
> clear link_new_state on entry to the loop in bond_miimon_inspect, or
> merge new_state and link_new_state as a single "next state" (which is
> cleared on entry to the loop).
> 
> 	The first of these is a pretty simple patch:
> 
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 18b58e1376f1..6f89f9981a6c 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>   
>   	bond_for_each_slave_rcu(bond, slave, iter) {
>   		slave->new_link = BOND_LINK_NOCHANGE;
> +		slave->link_new_state = slave->link;
>   
>   		link_state = bond_check_dev_link(bond, slave->dev, 0);
>   
> 
> 	Alex / Jarod, could you check my logic, and would you be able to
> test this patch if my analysis appears sound?

This patch looks good, the original reproducing setup successfully 
recovers after the original active slave goes down, even with 
NetworkManager in the mix.

Reviewed-by: Jarod Wilson <jarod@redhat.com>
Jay Vosburgh Nov. 7, 2017, 2:47 a.m. UTC | #12
Jarod Wilson <jarod@redhat.com> wrote:

>On 2017-11-02 9:11 PM, Jay Vosburgh wrote:
[...]
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 18b58e1376f1..6f89f9981a6c 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>>     	bond_for_each_slave_rcu(bond, slave, iter) {
>>   		slave->new_link = BOND_LINK_NOCHANGE;
>> +		slave->link_new_state = slave->link;
>>     		link_state = bond_check_dev_link(bond, slave->dev, 0);
>>   
>>
>> 	Alex / Jarod, could you check my logic, and would you be able to
>> test this patch if my analysis appears sound?
>
>This patch looks good, the original reproducing setup successfully
>recovers after the original active slave goes down, even with
>NetworkManager in the mix.
>
>Reviewed-by: Jarod Wilson <jarod@redhat.com>

	Thanks, I'll get a formal patch submission out later today.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
diff mbox series

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c99dc59..07aa7ba 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2072,6 +2072,7 @@  static int bond_miimon_inspect(struct bonding *bond)
                                             (bond->params.downdelay - slave->delay) *
                                             bond->params.miimon,
                                             slave->dev->name);
+                               slave->new_link = BOND_LINK_UP;
                                 commit++;
                                 continue;
                         }