diff mbox

bonding reports interface up with 0 Mbps

Message ID 87618083B2453E4A8714035B62D6799250524233@FMSMSX105.amr.corp.intel.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Tantilov, Emil S Feb. 3, 2016, 11:10 p.m. UTC
We are seeing an occasional issue where the bonding driver may report interface up with 0 Mbps:
bond0: link status definitely up for interface eth0, 0 Mbps full duplex

So far in all the failed traces I have collected this happens on NETDEV_CHANGELOWERSTATE event:

<...>-20533 [000] .... 81811.041241: ixgbe_service_task: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
<...>-20533 [000] .... 81811.041257: ixgbe_check_vf_rate_limit <-ixgbe_service_task
<...>-20533 [000] .... 81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task
kworker/u48:0-7503  [010] .... 81811.041345: ixgbe_get_stats64 <-dev_get_stats
kworker/u48:0-7503  [010] .... 81811.041393: bond_netdev_event: eth1: event: 1b
kworker/u48:0-7503  [010] .... 81811.041394: bond_netdev_event: eth1: IFF_SLAVE
kworker/u48:0-7503  [010] .... 81811.041395: bond_netdev_event: eth1: slave->speed = ffffffff
<...>-20533 [000] .... 81811.041407: ixgbe_ptp_overflow_check <-ixgbe_service_task
kworker/u48:0-7503  [010] .... 81811.041407: bond_mii_monitor: bond0: link status definitely up for interface eth1, 0 Mbps full duplex

As a proof of concept I added NETDEV_CHANGELOWERSTATE in bond_slave_netdev_event() along with NETDEV_UP/CHANGE:


With this change I have not seen 0 Mbps reported by the bonding driver (around 12 hour test up to this point
vs. 2-3 hours otherwise). Although I suppose it could also be some sort of race/timing issue with bond_mii_monitor().

This test is with current bonding driver from net-next (top commit 03d84a5f83).

The bond is configured as such:

mode = 802.3ad
lacp_rate = fast
miimon = 100
xmit_hash_policy = layer3+4

I should note that the speed is reported correctly in /proc/net/bonding/bond0 once the bond0 interface is up,
so this seems to be just an issue with the initial detection of the speed. At least from what I have seen so far.

Thanks,
Emil

Comments

Zhu Yanjun Feb. 4, 2016, 2:56 a.m. UTC | #1
Hi, Emil

Thanks for your hard work.

With kernel 3.14, NETDEV_CHANGELOWERSTATE is not introduced. my user 
still confronted
"bond_mii_monitor: bond0: link status definitely up for interface eth1, 
0 Mbps full duplex".

How to explain it?

Would you like to make tests with kernel 3.14?

Thanks a lot.

Zhu Yanjun

On 02/04/2016 07:10 AM, Tantilov, Emil S wrote:
> We are seeing an occasional issue where the bonding driver may report interface up with 0 Mbps:
> bond0: link status definitely up for interface eth0, 0 Mbps full duplex
>
> So far in all the failed traces I have collected this happens on NETDEV_CHANGELOWERSTATE event:
>
> <...>-20533 [000] .... 81811.041241: ixgbe_service_task: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
> <...>-20533 [000] .... 81811.041257: ixgbe_check_vf_rate_limit <-ixgbe_service_task
> <...>-20533 [000] .... 81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task
> kworker/u48:0-7503  [010] .... 81811.041345: ixgbe_get_stats64 <-dev_get_stats
> kworker/u48:0-7503  [010] .... 81811.041393: bond_netdev_event: eth1: event: 1b
> kworker/u48:0-7503  [010] .... 81811.041394: bond_netdev_event: eth1: IFF_SLAVE
> kworker/u48:0-7503  [010] .... 81811.041395: bond_netdev_event: eth1: slave->speed = ffffffff
> <...>-20533 [000] .... 81811.041407: ixgbe_ptp_overflow_check <-ixgbe_service_task
> kworker/u48:0-7503  [010] .... 81811.041407: bond_mii_monitor: bond0: link status definitely up for interface eth1, 0 Mbps full duplex
>
> As a proof of concept I added NETDEV_CHANGELOWERSTATE in bond_slave_netdev_event() along with NETDEV_UP/CHANGE:
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index 56b5605..a9dac4c 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3014,6 +3014,7 @@ static int bond_slave_netdev_event(unsigned long event,
>   		break;
>   	case NETDEV_UP:
>   	case NETDEV_CHANGE:
> +	case NETDEV_CHANGELOWERSTATE:
>   		bond_update_speed_duplex(slave);
>   		if (BOND_MODE(bond) == BOND_MODE_8023AD)
>   			bond_3ad_adapter_speed_duplex_changed(slave);
>
> With this change I have not seen 0 Mbps reported by the bonding driver (around 12 hour test up to this point
> vs. 2-3 hours otherwise). Although I suppose it could also be some sort of race/timing issue with bond_mii_monitor().
>
> This test is with current bonding driver from net-next (top commit 03d84a5f83).
>
> The bond is configured as such:
>
> mode = 802.3ad
> lacp_rate = fast
> miimon = 100
> xmit_hash_policy = layer3+4
>
> I should note that the speed is reported correctly in /proc/net/bonding/bond0 once the bond0 interface is up,
> so this seems to be just an issue with the initial detection of the speed. At least from what I have seen so far.
>
> Thanks,
> Emil
>
>
Jay Vosburgh Feb. 4, 2016, 8:29 p.m. UTC | #2
Tantilov, Emil S <emil.s.tantilov@intel.com> wrote:

>We are seeing an occasional issue where the bonding driver may report interface up with 0 Mbps:
>bond0: link status definitely up for interface eth0, 0 Mbps full duplex
>
>So far in all the failed traces I have collected this happens on NETDEV_CHANGELOWERSTATE event:
>
><...>-20533 [000] .... 81811.041241: ixgbe_service_task: eth1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
><...>-20533 [000] .... 81811.041257: ixgbe_check_vf_rate_limit <-ixgbe_service_task
><...>-20533 [000] .... 81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task
>kworker/u48:0-7503  [010] .... 81811.041345: ixgbe_get_stats64 <-dev_get_stats
>kworker/u48:0-7503  [010] .... 81811.041393: bond_netdev_event: eth1: event: 1b
>kworker/u48:0-7503  [010] .... 81811.041394: bond_netdev_event: eth1: IFF_SLAVE
>kworker/u48:0-7503  [010] .... 81811.041395: bond_netdev_event: eth1: slave->speed = ffffffff
><...>-20533 [000] .... 81811.041407: ixgbe_ptp_overflow_check <-ixgbe_service_task
>kworker/u48:0-7503  [010] .... 81811.041407: bond_mii_monitor: bond0: link status definitely up for interface eth1, 0 Mbps full duplex

	Thinking about the trace again... Emil: what happens in the
trace before this?  Is there ever a call to the ixgbe_get_settings?
Does a NETDEV_UP or NETDEV_CHANGE event ever hit the bond_netdev_event
function?

	Could you describe your test that reproduces this?  I'd like to
see if I can set it up locally.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
Tantilov, Emil S Feb. 5, 2016, 12:07 a.m. UTC | #3
>-----Original Message-----
>From: Jay Vosburgh [mailto:jay.vosburgh@canonical.com]
>Sent: Thursday, February 04, 2016 12:30 PM
>To: Tantilov, Emil S
>Cc: netdev@vger.kernel.org; gospo@cumulusnetworks.com; zhuyj;
>jiri@mellanox.com
>Subject: Re: bonding reports interface up with 0 Mbps
>
>Tantilov, Emil S <emil.s.tantilov@intel.com> wrote:
>
>>We are seeing an occasional issue where the bonding driver may report
>interface up with 0 Mbps:
>>bond0: link status definitely up for interface eth0, 0 Mbps full duplex
>>
>>So far in all the failed traces I have collected this happens on
>NETDEV_CHANGELOWERSTATE event:
>>
>><...>-20533 [000] .... 81811.041241: ixgbe_service_task: eth1: NIC Link is
>Up 10 Gbps, Flow Control: RX/TX
>><...>-20533 [000] .... 81811.041257: ixgbe_check_vf_rate_limit <-
>ixgbe_service_task
>><...>-20533 [000] .... 81811.041272: ixgbe_ping_all_vfs <-
>ixgbe_service_task
>>kworker/u48:0-7503  [010] .... 81811.041345: ixgbe_get_stats64 <-
>dev_get_stats
>>kworker/u48:0-7503  [010] .... 81811.041393: bond_netdev_event: eth1:
>event: 1b
>>kworker/u48:0-7503  [010] .... 81811.041394: bond_netdev_event: eth1:
>IFF_SLAVE
>>kworker/u48:0-7503  [010] .... 81811.041395: bond_netdev_event: eth1:
>slave->speed = ffffffff
>><...>-20533 [000] .... 81811.041407: ixgbe_ptp_overflow_check <-
>ixgbe_service_task
>>kworker/u48:0-7503  [010] .... 81811.041407: bond_mii_monitor: bond0: link
>status definitely up for interface eth1, 0 Mbps full duplex
>
>	Thinking about the trace again... Emil: what happens in the
>trace before this?  Is there ever a call to the ixgbe_get_settings?
>Does a NETDEV_UP or NETDEV_CHANGE event ever hit the bond_netdev_event
>function?

Yes, there are calls to ixgbe_get_settings, but the interface is still
down at that time. I managed to trim the ftrace filters down to where
the trace comes out at a decent size, also added some additional debugging
for link_state and slave->link in bond_miimon_inspect() - see attached file.
 
>
>	Could you describe your test that reproduces this?  I'd like to
>see if I can set it up locally.

It is basically an up/down of the bonding interface:

ifdown  bond0
ifup eth0
ifup eth1
<wait for link>
<check dmesg>

in a loop 

ifdown bond0 brings bond0, eth0/1 down, ifup eth0/1 brings the ixgbe
interfaces up which kicks the bond0 interface up as well.

Thanks,
Emil
diff mbox

Patch

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b5605..a9dac4c 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3014,6 +3014,7 @@  static int bond_slave_netdev_event(unsigned long event,
 		break;
 	case NETDEV_UP:
 	case NETDEV_CHANGE:
+	case NETDEV_CHANGELOWERSTATE:
 		bond_update_speed_duplex(slave);
 		if (BOND_MODE(bond) == BOND_MODE_8023AD)
 			bond_3ad_adapter_speed_duplex_changed(slave);