Message ID | 20240503101836.32755-1-en-wei.wu@canonical.com |
---|---|
State | Changes Requested |
Headers | show |
Series | [v2,1/2] e1000e: let the sleep codes run every time | expand |
On Fri, May 03, 2024 at 06:18:36PM +0800, Ricky Wu wrote: > As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, > Intel I219-LM reports link up -> link down -> link up after hot-plugging > the Ethernet cable. Please could you quote some parts of 802.3 which state this is a problem. How is this breaking the standard. Andrew
On 07/05/2024 15:31, Andrew Lunn wrote: > On Fri, May 03, 2024 at 06:18:36PM +0800, Ricky Wu wrote: >> As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, >> Intel I219-LM reports link up -> link down -> link up after hot-plugging >> the Ethernet cable. > > Please could you quote some parts of 802.3 which state this is a > problem. How is this breaking the standard. > > Andrew In I219-* parts used LSI PHY. This PHY is compliant with the 802.3 IEEE standard if I recall correctly. Auto-negotiation and link establishment are processed following the IEEE standard and could vary from platform to platform but are not violent to the IEEE standard. En-Wei, My recommendation is not to accept these patches. If you think there is a HW/PHY problem - open a ticket on Intel PAE. Sasha
On 08/05/2024 8:05, Sasha Neftin wrote: > On 07/05/2024 15:31, Andrew Lunn wrote: >> On Fri, May 03, 2024 at 06:18:36PM +0800, Ricky Wu wrote: >>> As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, >>> Intel I219-LM reports link up -> link down -> link up after hot-plugging >>> the Ethernet cable. >> >> Please could you quote some parts of 802.3 which state this is a >> problem. How is this breaking the standard. >> >> Andrew > > In I219-* parts used LSI PHY. This PHY is compliant with the 802.3 IEEE > standard if I recall correctly. Auto-negotiation and link establishment > are processed following the IEEE standard and could vary from platform > to platform but are not violent to the IEEE standard. > > En-Wei, My recommendation is not to accept these patches. If you think > there is a HW/PHY problem - open a ticket on Intel PAE. > > Sasha I concur. I am wary of changing the behavior of some driver fundamentals, to satisfy a particular validation/certification flow, if there is no real functionality problem. It can open a big Pandora box. Checking the Bugzilla report again, I am not sure we understand the issue fully: [ 143.141006] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None [ 143.144878] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down [ 146.838980] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None This looks like a very quick link "flap", following by proper link establishment ~3.7 seconds later. These ~3.7 seconds are in line of what link auto-negotiation would take (auto-negotiation is the default mode for this driver). The first print (1000 Mbps Half Duplex) actually makes no sense - it cannot be real link status since 1000/Half is not a supported speed. So it seems to me that actually the first "link up" is an incorrect/incomplete/premature reading, not the "link down". --Dima
On Thu, May 09, 2024 at 12:13:27PM +0300, Ruinskiy, Dima wrote: > On 08/05/2024 8:05, Sasha Neftin wrote: > > On 07/05/2024 15:31, Andrew Lunn wrote: > > > On Fri, May 03, 2024 at 06:18:36PM +0800, Ricky Wu wrote: > > > > As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, > > > > Intel I219-LM reports link up -> link down -> link up after hot-plugging > > > > the Ethernet cable. > > > > > > Please could you quote some parts of 802.3 which state this is a > > > problem. How is this breaking the standard. > > > > > > Andrew > > > > In I219-* parts used LSI PHY. This PHY is compliant with the 802.3 IEEE > > standard if I recall correctly. Auto-negotiation and link establishment > > are processed following the IEEE standard and could vary from platform > > to platform but are not violent to the IEEE standard. > > > > En-Wei, My recommendation is not to accept these patches. If you think > > there is a HW/PHY problem - open a ticket on Intel PAE. > > > > Sasha > > I concur. I am wary of changing the behavior of some driver fundamentals, to > satisfy a particular validation/certification flow, if there is no real > functionality problem. It can open a big Pandora box. > > Checking the Bugzilla report again, I am not sure we understand the issue > fully: > > [ 143.141006] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half > Duplex, Flow Control: None > [ 143.144878] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down > [ 146.838980] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full > Duplex, Flow Control: None > > This looks like a very quick link "flap", following by proper link > establishment ~3.7 seconds later. These ~3.7 seconds are in line of what > link auto-negotiation would take (auto-negotiation is the default mode for > this driver). That actually seems slow. It is normally a little over 1 second. I forget the exact number. But is the PHY being polled once a second, rather than being interrupt driven? > The first print (1000 Mbps Half Duplex) actually makes no > sense - it cannot be real link status since 1000/Half is not a supported > speed. It would be interesting to see what the link partner sees. What does it think the I219-LM is advertising? Is it advertising 1000BaseT_Half? But why would auto-neg resolve to that if 1000BaseT_Full is available? > So it seems to me that actually the first "link up" is an > incorrect/incomplete/premature reading, not the "link down". Agreed. Root cause this, which looks like a real problem, rather than apply a band-aid for a test system. Andrew
> En-Wei, My recommendation is not to accept these patches. If you think > there is a HW/PHY problem - open a ticket on Intel PAE. > I concur. I am wary of changing the behavior of some driver > fundamentals, to satisfy a particular validation/certification flow, if > there is no real functionality problem. It can open a big Pandora box. OK. Thanks for your help. I think we can end this patchset now. > It is normally a little over 1 second. I > forget the exact number. But is the PHY being polled once a second, > rather than being interrupt driven? If I read the code correctly, the PHY is polled every 2 seconds by the e1000e watchdog. But if an interrupt occurs and it's a link-status-change interrupt, the watchdog will be called immediately and the PHY is polled. > What does it think the I219-LM is advertising? Is it advertising 1000BaseT_Half? > But why would auto-neg resolve to that if 1000BaseT_Full is available? I'm also interested in it. I'll do some checking later to see what's advertising by us and the link partner. > Agreed. Root cause this, which looks like a real problem, rather than > apply a band-aid for a test system. OK. I think there is a clue which is related to auto-negotiation. I'll work on it later. Thank all of you for your help, I really appreciate it. On Thu, 9 May 2024 at 15:46, Andrew Lunn <andrew@lunn.ch> wrote: > > On Thu, May 09, 2024 at 12:13:27PM +0300, Ruinskiy, Dima wrote: > > On 08/05/2024 8:05, Sasha Neftin wrote: > > > On 07/05/2024 15:31, Andrew Lunn wrote: > > > > On Fri, May 03, 2024 at 06:18:36PM +0800, Ricky Wu wrote: > > > > > As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, > > > > > Intel I219-LM reports link up -> link down -> link up after hot-plugging > > > > > the Ethernet cable. > > > > > > > > Please could you quote some parts of 802.3 which state this is a > > > > problem. How is this breaking the standard. > > > > > > > > Andrew > > > > > > In I219-* parts used LSI PHY. This PHY is compliant with the 802.3 IEEE > > > standard if I recall correctly. Auto-negotiation and link establishment > > > are processed following the IEEE standard and could vary from platform > > > to platform but are not violent to the IEEE standard. > > > > > > En-Wei, My recommendation is not to accept these patches. If you think > > > there is a HW/PHY problem - open a ticket on Intel PAE. > > > > > > Sasha > > > > I concur. I am wary of changing the behavior of some driver fundamentals, to > > satisfy a particular validation/certification flow, if there is no real > > functionality problem. It can open a big Pandora box. > > > > Checking the Bugzilla report again, I am not sure we understand the issue > > fully: > > > > [ 143.141006] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half > > Duplex, Flow Control: None > > [ 143.144878] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down > > [ 146.838980] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full > > Duplex, Flow Control: None > > > > This looks like a very quick link "flap", following by proper link > > establishment ~3.7 seconds later. These ~3.7 seconds are in line of what > > link auto-negotiation would take (auto-negotiation is the default mode for > > this driver). > > That actually seems slow. It is normally a little over 1 second. I > forget the exact number. But is the PHY being polled once a second, > rather than being interrupt driven? > > > The first print (1000 Mbps Half Duplex) actually makes no > > sense - it cannot be real link status since 1000/Half is not a supported > > speed. > > It would be interesting to see what the link partner sees. What does > it think the I219-LM is advertising? Is it advertising 1000BaseT_Half? > But why would auto-neg resolve to that if 1000BaseT_Full is available? > > > So it seems to me that actually the first "link up" is an > > incorrect/incomplete/premature reading, not the "link down". > > Agreed. Root cause this, which looks like a real problem, rather than > apply a band-aid for a test system. > > Andrew
On 09/05/2024 16:46, Andrew Lunn wrote: > On Thu, May 09, 2024 at 12:13:27PM +0300, Ruinskiy, Dima wrote: >> On 08/05/2024 8:05, Sasha Neftin wrote: >>> On 07/05/2024 15:31, Andrew Lunn wrote: >>>> On Fri, May 03, 2024 at 06:18:36PM +0800, Ricky Wu wrote: >>>>> As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, >>>>> Intel I219-LM reports link up -> link down -> link up after hot-plugging >>>>> the Ethernet cable. >>>> >>>> Please could you quote some parts of 802.3 which state this is a >>>> problem. How is this breaking the standard. >>>> >>>> Andrew >>> >>> In I219-* parts used LSI PHY. This PHY is compliant with the 802.3 IEEE >>> standard if I recall correctly. Auto-negotiation and link establishment >>> are processed following the IEEE standard and could vary from platform >>> to platform but are not violent to the IEEE standard. >>> >>> En-Wei, My recommendation is not to accept these patches. If you think >>> there is a HW/PHY problem - open a ticket on Intel PAE. >>> >>> Sasha >> >> I concur. I am wary of changing the behavior of some driver fundamentals, to >> satisfy a particular validation/certification flow, if there is no real >> functionality problem. It can open a big Pandora box. >> >> Checking the Bugzilla report again, I am not sure we understand the issue >> fully: >> >> [ 143.141006] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half >> Duplex, Flow Control: None >> [ 143.144878] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down >> [ 146.838980] e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full >> Duplex, Flow Control: None >> >> This looks like a very quick link "flap", following by proper link >> establishment ~3.7 seconds later. These ~3.7 seconds are in line of what >> link auto-negotiation would take (auto-negotiation is the default mode for >> this driver). > > That actually seems slow. It is normally a little over 1 second. I > forget the exact number. But is the PHY being polled once a second, > rather than being interrupt driven? > >> The first print (1000 Mbps Half Duplex) actually makes no >> sense - it cannot be real link status since 1000/Half is not a supported >> speed. > > It would be interesting to see what the link partner sees. What does > it think the I219-LM is advertising? Is it advertising 1000BaseT_Half? i219 parts come with LSI PHY. 1000BASE-T half-duplex is not supported. 1000BASET half-duplex not advertised in IEEE 1000BASE-T Control Register 9. > But why would auto-neg resolve to that if 1000BaseT_Full is available? > >> So it seems to me that actually the first "link up" is an >> incorrect/incomplete/premature reading, not the "link down". > > Agreed. Root cause this, which looks like a real problem, rather than > apply a band-aid for a test system. > > Andrew
> > It would be interesting to see what the link partner sees. What does > > it think the I219-LM is advertising? Is it advertising 1000BaseT_Half? > > i219 parts come with LSI PHY. 1000BASE-T half-duplex is not supported. > 1000BASET half-duplex not advertised in IEEE 1000BASE-T Control Register 9. That is the theory. But in practice? What does the link partner really see? I've come across systems which get advertisement wrong. However, in that case, i suspect it is the software above the PHY, not the PHY itself which was wrong. Andrew
On 10/05/2024 15:32, Andrew Lunn wrote: >>> It would be interesting to see what the link partner sees. What does >>> it think the I219-LM is advertising? Is it advertising 1000BaseT_Half? >> >> i219 parts come with LSI PHY. 1000BASE-T half-duplex is not supported. >> 1000BASET half-duplex not advertised in IEEE 1000BASE-T Control Register 9. > > That is the theory. But in practice? What does the link partner really > see? I've come across systems which get advertisement wrong. However, > in that case, i suspect it is the software above the PHY, not the PHY > itself which was wrong. > > Andrew Not only in the theory. On the link partner IEEE 1000BASE-T Status Register - Address 10, Link Partner 1000BASE-T Half-Duplex Capability is 0. (1000BAS-T HD not advertised). I believe here is some false positive indication during the auto-negotiation process.(this is not related to the link fluctuation) Sasha
diff --git a/drivers/net/ethernet/intel/e1000e/ich8lan.c b/drivers/net/ethernet/intel/e1000e/ich8lan.c index f9e94be36e97..68f5698a22b0 100644 --- a/drivers/net/ethernet/intel/e1000e/ich8lan.c +++ b/drivers/net/ethernet/intel/e1000e/ich8lan.c @@ -1428,7 +1428,17 @@ static s32 e1000_check_for_copper_link_ich8lan(struct e1000_hw *hw) * link. If so, then we want to get the current speed/duplex * of the PHY. */ - ret_val = e1000e_phy_has_link_generic(hw, 1, 0, &link); + /* We've seen that I219-LM sometimes has link fluctuations + * (link up -> link down -> link up) after hot-plugging the cable. + * The problem is caused by the instability of the Link Status bit + * (BMSR_LSTATUS) in MII Status Register. The average time between + * the first link up and link down is between 3~4 ms. + * Increasing the iteration times and setting up the delay to + * 100ms (which is safe) solves the problem. + * This behavior hasn't been seen on other NICs and also not being + * documented in datasheet/errata. + */ + ret_val = e1000e_phy_has_link_generic(hw, COPPER_LINK_UP_LIMIT, 100000, &link); if (ret_val) goto out;
As described in https://bugzilla.kernel.org/show_bug.cgi?id=218642, Intel I219-LM reports link up -> link down -> link up after hot-plugging the Ethernet cable. The problem is because the unstable behavior of Link Status bit in PHY Status Register of some e1000e NIC. When we re-plug the cable, the e1000e_phy_has_link_generic() (called after the Link-Status-Changed interrupt) has read this bit with 1->0->1 (1=link up, 0=link down) and e1000e reports it to net device layer respectively. This patch solves the problem by passing polling delays on e1000e_phy_has_link_generic() so that it will not get the unstable states of Link Status bit. Link: https://bugzilla.kernel.org/show_bug.cgi?id=218642 Fixes: 7d3cabbcc86 ("e1000e: disable K1 at 1000Mbps for 82577/82578") Signed-off-by: Ricky Wu <en-wei.wu@canonical.com> --- In v2: * Split the sleep codes part into PATCHSET [1/2] --- drivers/net/ethernet/intel/e1000e/ich8lan.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-)