Message ID | 20151121223627.GA3493@electric-eye.fr.zoreil.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Sat, Nov 21, 2015 at 11:36:27PM +0100, Francois Romieu wrote: > Francois Romieu <romieu@fr.zoreil.com> : > [...] > > If you can crash your system at will, you may apply the patch below to > da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq > handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and > build it in a current tree (say 4.2). No problem crashing the machine at present. It is an internal test machine which I am using to chase this issue at present. As I understand the above you would like me to get r8169.c from 1e874e041fc7c222cbd85b20c4406070be1f687a, apply the patch, drop it into kernel 4.2 and run with that. That's fine. I'm a little confused though as to which patch you want me to apply. There was an inline patch against r8169.c in your message, and then there was another patch to r8169.c in the form of an attachment. Both patches removed the include of asm/system.h but the rest of the content differs. Did you want each tried in turn? Apologies if I'm missing something obvious. > How much memory and CPU may I rely on in your test computer ? Since the problem can be triggered without me having to run the full data acquisition system, the machine is basically unloaded (it doesn't take many resources to send/receive 6 udp packets at a time :-) ). The CPU is an i7-860 CPU at 2.8 GHz, with 4 GB of RAM fitted and 8 GB swap. The kernel and userspace are 32-bit, so we have the usual 3 GB per-process memory limit. Regards jonathan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jonathan Woithe <jwoithe@atrad.com.au> : [...] > I'm a little confused though as to which patch you want me to apply. There > was an inline patch against r8169.c in your message, and then there was > another patch to r8169.c in the form of an attachment. Both patches removed > the include of asm/system.h but the rest of the content differs. Did you > want each tried in turn? Apologies if I'm missing something obvious. Use the inlined one. I forgot to remove the 2.6.35.11 based attachment.
On Mon, Nov 23, 2015 at 12:02:44AM +0100, Francois Romieu wrote: > Jonathan Woithe <jwoithe@atrad.com.au> : > [...] > > I'm a little confused though as to which patch you want me to apply. There > > was an inline patch against r8169.c in your message, and then there was > > another patch to r8169.c in the form of an attachment. Both patches removed > > the include of asm/system.h but the rest of the content differs. Did you > > want each tried in turn? Apologies if I'm missing something obvious. > > Use the inlined one. I forgot to remove the 2.6.35.11 based attachment. Ok, no problem. That's done now and it appears to work. That is, I am not seeing any of the erroneous UDP packet delivery problems noted previously and therefore the communication with the external device is working normally. I dropped the patched r8169.c into a 4.3.0 kernel since that's what I already had on the system. Regards jonathan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 23, 2015 at 12:02:44AM +0100, Francois Romieu wrote: > Jonathan Woithe <jwoithe@atrad.com.au> : > [...] > > I'm a little confused though as to which patch you want me to apply. There > > was an inline patch against r8169.c in your message, and then there was > > another patch to r8169.c in the form of an attachment. Both patches removed > > the include of asm/system.h but the rest of the content differs. Did you > > want each tried in turn? Apologies if I'm missing something obvious. > > Use the inlined one. I forgot to remove the 2.6.35.11 based attachment. Any thoughts or progress at this stage? Are there further tests you need me to do? To recap, the inlined patch against 4.3 resolved the problem. Regards jonathan -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Jonathan Woithe <jwoithe@atrad.com.au> : [...] > Any thoughts or progress at this stage? Are there further tests you need me > to do ? Yes but you should expect two more days without signal.
Hi Francois On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote: > Jonathan Woithe <jwoithe@atrad.com.au> : > [...] > > Any thoughts or progress at this stage? Are there further tests you need me > > to do ? > > Yes but you should expect two more days without signal. FYI I am now back from annual leave and linux.conf.au. This means I am in a position to test possible solutions to this problem once you are able to make them available. Regards jonathan
On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote: > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote: > > Jonathan Woithe <jwoithe@atrad.com.au> : > > [...] > > > Any thoughts or progress at this stage? Are there further tests you need me > > > to do ? > > > > Yes but you should expect two more days without signal. > > FYI I am now back from annual leave and linux.conf.au. This means I am in > a position to test possible solutions to this problem once you are able to > make them available. Has there been any further progress on this problem? I still have access to the hardware and systems if further tests are required. Regards jonathan
On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote: > On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote: > > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote: > > > Jonathan Woithe <jwoithe@atrad.com.au> : > > > [...] > > > > Any thoughts or progress at this stage? Are there further tests you need me > > > > to do ? > > > > > > Yes but you should expect two more days without signal. > > > > FYI I am now back from annual leave and linux.conf.au. This means I am in > > a position to test possible solutions to this problem once you are able to > > make them available. > > Has there been any further progress on this problem? I still have access to > the hardware and systems if further tests are required. Any further progress? Regards jonathan
On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote: > On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote: > > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote: > > > Jonathan Woithe <jwoithe@atrad.com.au> : > > > [...] > > > > Any thoughts or progress at this stage? Are there further tests you need me > > > > to do ? > > > > > > Yes but you should expect two more days without signal. > > > > FYI I am now back from annual leave and linux.conf.au. This means I am in > > a position to test possible solutions to this problem once you are able to > > make them available. > > Has there been any further progress on this problem? I still have access to > the hardware and systems if further tests are required. To assist in picking up this issue I thought I'd summarise where things are at from my perspective. The problem is that small low-speed UDP packets are being dropped by the r8169-based network card in our systems. Git bisect showed that the regression started with commit da78dbff2e05630921c551dbbc70a4b7981a8fff. On 23 Nov 2015 Francois Romieu requested: > If you can crash your system at will, you may apply the patch below to > da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq > handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and > build it in a current tree (say 4.2). I did this (in a 4.3 tree) and the regression was resolved. Following the communuication on 2 Dec 2015 indicating there would be further news in a couple of days, I haven't head anything else on the subject. I am keen to resolve this problem as soon as possible as it will allow us to move systems away from the ancient 2.6 kernel we are currently forced to use as a result of this UDP regression (changing the network card for something based on another chipset is not an option). If there are further tests needed to progress a fix please let me know as I still have an offline system I can use for testing. Regards jonathan
On Wed, May 18, 2016 at 04:21:37PM +0930, Jonathan Woithe wrote: > On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote: > > On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote: > > > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote: > > > > Jonathan Woithe <jwoithe@atrad.com.au> : > > > > [...] > > > > > Any thoughts or progress at this stage? Are there further tests you need me > > > > > to do ? > > > > > > > > Yes but you should expect two more days without signal. > > > > > > FYI I am now back from annual leave and linux.conf.au. This means I am in > > > a position to test possible solutions to this problem once you are able to > > > make them available. > > > > Has there been any further progress on this problem? I still have access to > > the hardware and systems if further tests are required. > > To assist in picking up this issue I thought I'd summarise where things are > at from my perspective. > > The problem is that small low-speed UDP packets are being dropped by the > r8169-based network card in our systems. Git bisect showed that the > regression started with commit da78dbff2e05630921c551dbbc70a4b7981a8fff. > : FYI I have just tested the kernel.org 4.6 kernel and unsurprisingly it still exhibits the regression. Is there anything more I can do to move this forward? I really do not want to limit our systems to old kernels[1] just to avoid this regression, but that is what I have to do while the regression remains. Regards jonathan [1] We are currently on 2.6.35.11 with some additional device driver patches we need for our hardware (which have subsequently been merged into mainline). The regression is present in all kernels starting with 3.3 which means the latest kernel we can use at present is 3.2.x.
Hi all On Wed, Jun 01, 2016 at 10:01:23AM +0930, Jonathan Woithe wrote: > On Wed, May 18, 2016 at 04:21:37PM +0930, Jonathan Woithe wrote: > > On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote: > > > On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote: > > > > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote: > > > > > Jonathan Woithe <jwoithe@atrad.com.au> : > > > > > [...] > > > > > > Any thoughts or progress at this stage? Are there further tests you need me > > > > > > to do ? > > > > > > > > > > Yes but you should expect two more days without signal. > > > > > > > > FYI I am now back from annual leave and linux.conf.au. This means I am in > > > > a position to test possible solutions to this problem once you are able to > > > > make them available. > > > > > > Has there been any further progress on this problem? I still have access to > > > the hardware and systems if further tests are required. > > > > To assist in picking up this issue I thought I'd summarise where things are > > at from my perspective. > > > > The problem is that small low-speed UDP packets are being dropped by the > > r8169-based network card in our systems. Git bisect showed that the > > regression started with commit da78dbff2e05630921c551dbbc70a4b7981a8fff. > > : > > FYI I have just tested the kernel.org 4.6 kernel and unsurprisingly it still > exhibits the regression. Is there any chance that this regression can be resolved? It's been 6 months since the last contact was received from the list in relation to this issue. If the r8169 driver is to remain broken with respect to UDP traffic then we will have no choice but to factor in a change in our standard hardware for future systems. Unfortunately this also means that dozens of systems in the field cannot be upgraded to recent kernels since doing so will trigger the regression.[1] I would very much like to see the r8169 regression fixed and remain ready to run more tests as needed. I still have the test rig in place to recreate the situation which triggers it, although it's likely the hardware will be redeployed in the next few months. If the decision has been made to leave the regression unfixed, please let me know. A least we will then understand the situation and we will stop waiting for a solution which will never appear and start looking at changing our standard hardware build. For various reasons this is a significant undertaking which we would rather not do and it has implications for ongoing support of existing systems. Obviously though if there's a bug in the r8169 kernel driver which will never be fixed then this is the approach we will regrettably have to take. Regards jonathan [1] Exchanging network cards in existing systems just to avoid a kernel regression is not a viable option because there are commercial software licences node locked to the existing cards. Finding a non-r8169 PCI card for these systems may also prove to be a challenge.
--- r8169.c 2015-11-21 23:02:10.435275753 +0100 +++ r8169.c 2015-11-21 23:21:49.429554012 +0100 @@ -29,7 +29,6 @@ #include <linux/pci-aspm.h> #include <linux/prefetch.h> -#include <asm/system.h> #include <asm/io.h> #include <asm/irq.h> @@ -1616,7 +1615,7 @@ static int rtl8169_set_features(struct n else tp->cp_cmd &= ~RxChkSum; - if (dev->features & NETIF_F_HW_VLAN_RX) + if (dev->features & NETIF_F_HW_VLAN_CTAG_RX) tp->cp_cmd |= RxVlan; else tp->cp_cmd &= ~RxVlan; @@ -1632,8 +1631,8 @@ static int rtl8169_set_features(struct n static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp, struct sk_buff *skb) { - return (vlan_tx_tag_present(skb)) ? - TxVlanTag | swab16(vlan_tx_tag_get(skb)) : 0x00; + return (skb_vlan_tag_present(skb)) ? + TxVlanTag | swab16(skb_vlan_tag_get(skb)) : 0x00; } static void rtl8169_rx_vlan_tag(struct RxDesc *desc, struct sk_buff *skb) @@ -1641,7 +1640,7 @@ static void rtl8169_rx_vlan_tag(struct R u32 opts2 = le32_to_cpu(desc->opts2); if (opts2 & RxVlanTag) - __vlan_hwaccel_put_tag(skb, swab16(opts2 & 0xffff)); + __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), swab16(opts2 & 0xffff)); desc->opts2 = 0; } @@ -3508,7 +3507,7 @@ static const struct net_device_ops rtl81 }; -static void __devinit rtl_init_mdio_ops(struct rtl8169_private *tp) +static void rtl_init_mdio_ops(struct rtl8169_private *tp) { struct mdio_ops *ops = &tp->mdio_ops; @@ -3725,7 +3724,7 @@ static void rtl_pll_power_up(struct rtl8 rtl_generic_op(tp, tp->pll_power_ops.up); } -static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp) +static void rtl_init_pll_power_ops(struct rtl8169_private *tp) { struct pll_power_ops *ops = &tp->pll_power_ops; @@ -3905,7 +3904,7 @@ static void r8168b_1_hw_jumbo_disable(st RTL_W8(Config4, RTL_R8(Config4) & ~(1 << 0)); } -static void __devinit rtl_init_jumbo_ops(struct rtl8169_private *tp) +static void rtl_init_jumbo_ops(struct rtl8169_private *tp) { struct jumbo_ops *ops = &tp->jumbo_ops; @@ -3971,7 +3970,7 @@ static void rtl_hw_reset(struct rtl8169_ } } -static int __devinit +static int rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { const struct rtl_cfg_info *cfg = rtl_cfg_infos + ent->driver_data; @@ -4137,7 +4136,7 @@ rtl8169_init_one(struct pci_dev *pdev, c dev->dev_addr[i] = RTL_R8(MAC0 + i); memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len); - SET_ETHTOOL_OPS(dev, &rtl8169_ethtool_ops); + dev->ethtool_ops = &rtl8169_ethtool_ops; dev->watchdog_timeo = RTL8169_TX_TIMEOUT; dev->irq = pdev->irq; dev->base_addr = (unsigned long) ioaddr; @@ -4147,16 +4146,16 @@ rtl8169_init_one(struct pci_dev *pdev, c /* don't enable SG, IP_CSUM and TSO by default - it might not work * properly for all devices */ dev->features |= NETIF_F_RXCSUM | - NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; + NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX; dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO | - NETIF_F_RXCSUM | NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; + NETIF_F_RXCSUM | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX; dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO | NETIF_F_HIGHDMA; if (tp->mac_version == RTL_GIGA_MAC_VER_05) /* 8110SCd requires hardware Rx VLAN - disallow toggling */ - dev->hw_features &= ~NETIF_F_HW_VLAN_RX; + dev->hw_features &= ~NETIF_F_HW_VLAN_CTAG_RX; tp->intr_mask = 0xffff; tp->hw_start = cfg->hw_start; @@ -4217,7 +4216,7 @@ err_out_free_dev_1: goto out; } -static void __devexit rtl8169_remove_one(struct pci_dev *pdev) +static void rtl8169_remove_one(struct pci_dev *pdev) { struct net_device *dev = pci_get_drvdata(pdev); struct rtl8169_private *tp = netdev_priv(dev); @@ -6218,20 +6217,9 @@ static struct pci_driver rtl8169_pci_dri .name = MODULENAME, .id_table = rtl8169_pci_tbl, .probe = rtl8169_init_one, - .remove = __devexit_p(rtl8169_remove_one), + .remove = rtl8169_remove_one, .shutdown = rtl_shutdown, .driver.pm = RTL8169_PM_OPS, }; -static int __init rtl8169_init_module(void) -{ - return pci_register_driver(&rtl8169_pci_driver); -} - -static void __exit rtl8169_cleanup_module(void) -{ - pci_unregister_driver(&rtl8169_pci_driver); -} - -module_init(rtl8169_init_module); -module_exit(rtl8169_cleanup_module); +module_pci_driver(rtl8169_pci_driver);