diff mbox

r8169 regression: UDP packets dropped intermittantly

Message ID 20151121223627.GA3493@electric-eye.fr.zoreil.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Francois Romieu Nov. 21, 2015, 10:36 p.m. UTC
Francois Romieu <romieu@fr.zoreil.com> :
[...]

If you can crash your system at will, you may apply the patch below to
da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq
handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and
build it in a current tree (say 4.2).

If it works it will be easier for me to compare behaviors in current
tree (especially as none of my current test boxes wants to boot with
my own ga311).

How much memory and CPU may I rely on in your test computer ?

Comments

Jonathan Woithe Nov. 22, 2015, 10:50 p.m. UTC | #1
On Sat, Nov 21, 2015 at 11:36:27PM +0100, Francois Romieu wrote:
> Francois Romieu <romieu@fr.zoreil.com> :
> [...]
> 
> If you can crash your system at will, you may apply the patch below to
> da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq
> handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and
> build it in a current tree (say 4.2).

No problem crashing the machine at present.  It is an internal test machine
which I am using to chase this issue at present.

As I understand the above you would like me to get r8169.c from
1e874e041fc7c222cbd85b20c4406070be1f687a, apply the patch, drop it into
kernel 4.2 and run with that.  That's fine.

I'm a little confused though as to which patch you want me to apply.  There
was an inline patch against r8169.c in your message, and then there was
another patch to r8169.c in the form of an attachment.  Both patches removed
the include of asm/system.h but the rest of the content differs.  Did you
want each tried in turn?  Apologies if I'm missing something obvious.

> How much memory and CPU may I rely on in your test computer ?

Since the problem can be triggered without me having to run the full data
acquisition system, the machine is basically unloaded (it doesn't take many
resources to send/receive 6 udp packets at a time :-) ).  The CPU is an
i7-860 CPU at 2.8 GHz, with 4 GB of RAM fitted and 8 GB swap.  The kernel
and userspace are 32-bit, so we have the usual 3 GB per-process memory
limit.

Regards
  jonathan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Francois Romieu Nov. 22, 2015, 11:02 p.m. UTC | #2
Jonathan Woithe <jwoithe@atrad.com.au> :
[...]
> I'm a little confused though as to which patch you want me to apply.  There
> was an inline patch against r8169.c in your message, and then there was
> another patch to r8169.c in the form of an attachment.  Both patches removed
> the include of asm/system.h but the rest of the content differs.  Did you
> want each tried in turn?  Apologies if I'm missing something obvious.

Use the inlined one. I forgot to remove the 2.6.35.11 based attachment.
Jonathan Woithe Nov. 23, 2015, 12:07 a.m. UTC | #3
On Mon, Nov 23, 2015 at 12:02:44AM +0100, Francois Romieu wrote:
> Jonathan Woithe <jwoithe@atrad.com.au> :
> [...]
> > I'm a little confused though as to which patch you want me to apply.  There
> > was an inline patch against r8169.c in your message, and then there was
> > another patch to r8169.c in the form of an attachment.  Both patches removed
> > the include of asm/system.h but the rest of the content differs.  Did you
> > want each tried in turn?  Apologies if I'm missing something obvious.
> 
> Use the inlined one. I forgot to remove the 2.6.35.11 based attachment.

Ok, no problem.

That's done now and it appears to work.  That is, I am not seeing any of the
erroneous UDP packet delivery problems noted previously and therefore the
communication with the external device is working normally.

I dropped the patched r8169.c into a 4.3.0 kernel since that's what I
already had on the system.

Regards
  jonathan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Woithe Nov. 30, 2015, 6:42 a.m. UTC | #4
On Mon, Nov 23, 2015 at 12:02:44AM +0100, Francois Romieu wrote:
> Jonathan Woithe <jwoithe@atrad.com.au> :
> [...]
> > I'm a little confused though as to which patch you want me to apply.  There
> > was an inline patch against r8169.c in your message, and then there was
> > another patch to r8169.c in the form of an attachment.  Both patches removed
> > the include of asm/system.h but the rest of the content differs.  Did you
> > want each tried in turn?  Apologies if I'm missing something obvious.
> 
> Use the inlined one. I forgot to remove the 2.6.35.11 based attachment.

Any thoughts or progress at this stage?  Are there further tests you need me
to do?  To recap, the inlined patch against 4.3 resolved the problem.

Regards
  jonathan
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Francois Romieu Dec. 1, 2015, 11:58 p.m. UTC | #5
Jonathan Woithe <jwoithe@atrad.com.au> :
[...]
> Any thoughts or progress at this stage?  Are there further tests you need me
> to do ?

Yes but you should expect two more days without signal.
Jonathan Woithe Feb. 8, 2016, 2:33 a.m. UTC | #6
Hi Francois

On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote:
> Jonathan Woithe <jwoithe@atrad.com.au> :
> [...]
> > Any thoughts or progress at this stage?  Are there further tests you need me
> > to do ?
> 
> Yes but you should expect two more days without signal.

FYI I am now back from annual leave and linux.conf.au.  This means I am in a
position to test possible solutions to this problem once you are able to
make them available.

Regards
  jonathan
Jonathan Woithe April 7, 2016, 2:44 a.m. UTC | #7
On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote:
> On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote:
> > Jonathan Woithe <jwoithe@atrad.com.au> :
> > [...]
> > > Any thoughts or progress at this stage?  Are there further tests you need me
> > > to do ?
> > 
> > Yes but you should expect two more days without signal.
> 
> FYI I am now back from annual leave and linux.conf.au.  This means I am in
> a position to test possible solutions to this problem once you are able to
> make them available.

Has there been any further progress on this problem?  I still have access to
the hardware and systems if further tests are required.

Regards
  jonathan
Jonathan Woithe May 17, 2016, 6:52 a.m. UTC | #8
On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote:
> On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote:
> > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote:
> > > Jonathan Woithe <jwoithe@atrad.com.au> :
> > > [...]
> > > > Any thoughts or progress at this stage?  Are there further tests you need me
> > > > to do ?
> > > 
> > > Yes but you should expect two more days without signal.
> > 
> > FYI I am now back from annual leave and linux.conf.au.  This means I am in
> > a position to test possible solutions to this problem once you are able to
> > make them available.
> 
> Has there been any further progress on this problem?  I still have access to
> the hardware and systems if further tests are required.

Any further progress?

Regards
  jonathan
Jonathan Woithe May 18, 2016, 6:51 a.m. UTC | #9
On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote:
> On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote:
> > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote:
> > > Jonathan Woithe <jwoithe@atrad.com.au> :
> > > [...]
> > > > Any thoughts or progress at this stage?  Are there further tests you need me
> > > > to do ?
> > > 
> > > Yes but you should expect two more days without signal.
> > 
> > FYI I am now back from annual leave and linux.conf.au.  This means I am in
> > a position to test possible solutions to this problem once you are able to
> > make them available.
> 
> Has there been any further progress on this problem?  I still have access to
> the hardware and systems if further tests are required.

To assist in picking up this issue I thought I'd summarise where things are
at from my perspective.

The problem is that small low-speed UDP packets are being dropped by the
r8169-based network card in our systems.  Git bisect showed that the
regression started with commit da78dbff2e05630921c551dbbc70a4b7981a8fff.

On 23 Nov 2015 Francois Romieu requested:

> If you can crash your system at will, you may apply the patch below to
> da78dbff2e05630921c551dbbc70a4b7981a8fff ("r8169: remove work from irq
> handler.") parent (aka 1e874e041fc7c222cbd85b20c4406070be1f687a) and
> build it in a current tree (say 4.2).

I did this (in a 4.3 tree) and the regression was resolved.  Following the
communuication on 2 Dec 2015 indicating there would be further news in a
couple of days, I haven't head anything else on the subject.

I am keen to resolve this problem as soon as possible as it will allow us to
move systems away from the ancient 2.6 kernel we are currently forced to use
as a result of this UDP regression (changing the network card for something
based on another chipset is not an option).  If there are further tests
needed to progress a fix please let me know as I still have an offline
system I can use for testing.

Regards
  jonathan
Jonathan Woithe June 1, 2016, 12:31 a.m. UTC | #10
On Wed, May 18, 2016 at 04:21:37PM +0930, Jonathan Woithe wrote:
> On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote:
> > On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote:
> > > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote:
> > > > Jonathan Woithe <jwoithe@atrad.com.au> :
> > > > [...]
> > > > > Any thoughts or progress at this stage?  Are there further tests you need me
> > > > > to do ?
> > > > 
> > > > Yes but you should expect two more days without signal.
> > > 
> > > FYI I am now back from annual leave and linux.conf.au.  This means I am in
> > > a position to test possible solutions to this problem once you are able to
> > > make them available.
> > 
> > Has there been any further progress on this problem?  I still have access to
> > the hardware and systems if further tests are required.
> 
> To assist in picking up this issue I thought I'd summarise where things are
> at from my perspective.
> 
> The problem is that small low-speed UDP packets are being dropped by the
> r8169-based network card in our systems.  Git bisect showed that the
> regression started with commit da78dbff2e05630921c551dbbc70a4b7981a8fff.
> :

FYI I have just tested the kernel.org 4.6 kernel and unsurprisingly it still
exhibits the regression.

Is there anything more I can do to move this forward?  I really do not want
to limit our systems to old kernels[1] just to avoid this regression, but
that is what I have to do while the regression remains.

Regards
  jonathan

[1] We are currently on 2.6.35.11 with some additional device driver patches
    we need for our hardware (which have subsequently been merged into
    mainline).  The regression is present in all kernels starting with 3.3
    which means the latest kernel we can use at present is 3.2.x.
Jonathan Woithe June 21, 2016, 3:21 a.m. UTC | #11
Hi all

On Wed, Jun 01, 2016 at 10:01:23AM +0930, Jonathan Woithe wrote:
> On Wed, May 18, 2016 at 04:21:37PM +0930, Jonathan Woithe wrote:
> > On Thu, Apr 07, 2016 at 12:14:18PM +0930, Jonathan Woithe wrote:
> > > On Mon, Feb 08, 2016 at 01:03:19PM +1030, Jonathan Woithe wrote:
> > > > On Wed, Dec 02, 2015 at 12:58:52AM +0100, Francois Romieu wrote:
> > > > > Jonathan Woithe <jwoithe@atrad.com.au> :
> > > > > [...]
> > > > > > Any thoughts or progress at this stage?  Are there further tests you need me
> > > > > > to do ?
> > > > > 
> > > > > Yes but you should expect two more days without signal.
> > > > 
> > > > FYI I am now back from annual leave and linux.conf.au.  This means I am in
> > > > a position to test possible solutions to this problem once you are able to
> > > > make them available.
> > > 
> > > Has there been any further progress on this problem?  I still have access to
> > > the hardware and systems if further tests are required.
> > 
> > To assist in picking up this issue I thought I'd summarise where things are
> > at from my perspective.
> > 
> > The problem is that small low-speed UDP packets are being dropped by the
> > r8169-based network card in our systems.  Git bisect showed that the
> > regression started with commit da78dbff2e05630921c551dbbc70a4b7981a8fff.
> > :
> 
> FYI I have just tested the kernel.org 4.6 kernel and unsurprisingly it still
> exhibits the regression.

Is there any chance that this regression can be resolved?  It's been 6
months since the last contact was received from the list in relation to this
issue.  If the r8169 driver is to remain broken with respect to UDP traffic
then we will have no choice but to factor in a change in our standard
hardware for future systems.  Unfortunately this also means that dozens of
systems in the field cannot be upgraded to recent kernels since doing so
will trigger the regression.[1]

I would very much like to see the r8169 regression fixed and remain ready to
run more tests as needed.  I still have the test rig in place to recreate
the situation which triggers it, although it's likely the hardware will be
redeployed in the next few months.

If the decision has been made to leave the regression unfixed, please let me
know.  A least we will then understand the situation and we will stop waiting
for a solution which will never appear and start looking at changing our
standard hardware build.  For various reasons this is a significant
undertaking which we would rather not do and it has implications for ongoing
support of existing systems.  Obviously though if there's a bug in the r8169
kernel driver which will never be fixed then this is the approach we will
regrettably have to take.

Regards
  jonathan

[1] Exchanging network cards in existing systems just to avoid a kernel
regression is not a viable option because there are commercial software
licences node locked to the existing cards.  Finding a non-r8169 PCI card
for these systems may also prove to be a challenge.
diff mbox

Patch

--- r8169.c	2015-11-21 23:02:10.435275753 +0100
+++ r8169.c	2015-11-21 23:21:49.429554012 +0100
@@ -29,7 +29,6 @@ 
 #include <linux/pci-aspm.h>
 #include <linux/prefetch.h>
 
-#include <asm/system.h>
 #include <asm/io.h>
 #include <asm/irq.h>
 
@@ -1616,7 +1615,7 @@  static int rtl8169_set_features(struct n
 	else
 		tp->cp_cmd &= ~RxChkSum;
 
-	if (dev->features & NETIF_F_HW_VLAN_RX)
+	if (dev->features & NETIF_F_HW_VLAN_CTAG_RX)
 		tp->cp_cmd |= RxVlan;
 	else
 		tp->cp_cmd &= ~RxVlan;
@@ -1632,8 +1631,8 @@  static int rtl8169_set_features(struct n
 static inline u32 rtl8169_tx_vlan_tag(struct rtl8169_private *tp,
 				      struct sk_buff *skb)
 {
-	return (vlan_tx_tag_present(skb)) ?
-		TxVlanTag | swab16(vlan_tx_tag_get(skb)) : 0x00;
+	return (skb_vlan_tag_present(skb)) ?
+		TxVlanTag | swab16(skb_vlan_tag_get(skb)) : 0x00;
 }
 
 static void rtl8169_rx_vlan_tag(struct RxDesc *desc, struct sk_buff *skb)
@@ -1641,7 +1640,7 @@  static void rtl8169_rx_vlan_tag(struct R
 	u32 opts2 = le32_to_cpu(desc->opts2);
 
 	if (opts2 & RxVlanTag)
-		__vlan_hwaccel_put_tag(skb, swab16(opts2 & 0xffff));
+		__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), swab16(opts2 & 0xffff));
 
 	desc->opts2 = 0;
 }
@@ -3508,7 +3507,7 @@  static const struct net_device_ops rtl81
 
 };
 
-static void __devinit rtl_init_mdio_ops(struct rtl8169_private *tp)
+static void rtl_init_mdio_ops(struct rtl8169_private *tp)
 {
 	struct mdio_ops *ops = &tp->mdio_ops;
 
@@ -3725,7 +3724,7 @@  static void rtl_pll_power_up(struct rtl8
 	rtl_generic_op(tp, tp->pll_power_ops.up);
 }
 
-static void __devinit rtl_init_pll_power_ops(struct rtl8169_private *tp)
+static void rtl_init_pll_power_ops(struct rtl8169_private *tp)
 {
 	struct pll_power_ops *ops = &tp->pll_power_ops;
 
@@ -3905,7 +3904,7 @@  static void r8168b_1_hw_jumbo_disable(st
 	RTL_W8(Config4, RTL_R8(Config4) & ~(1 << 0));
 }
 
-static void __devinit rtl_init_jumbo_ops(struct rtl8169_private *tp)
+static void rtl_init_jumbo_ops(struct rtl8169_private *tp)
 {
 	struct jumbo_ops *ops = &tp->jumbo_ops;
 
@@ -3971,7 +3970,7 @@  static void rtl_hw_reset(struct rtl8169_
 	}
 }
 
-static int __devinit
+static int
 rtl8169_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 {
 	const struct rtl_cfg_info *cfg = rtl_cfg_infos + ent->driver_data;
@@ -4137,7 +4136,7 @@  rtl8169_init_one(struct pci_dev *pdev, c
 		dev->dev_addr[i] = RTL_R8(MAC0 + i);
 	memcpy(dev->perm_addr, dev->dev_addr, dev->addr_len);
 
-	SET_ETHTOOL_OPS(dev, &rtl8169_ethtool_ops);
+	dev->ethtool_ops = &rtl8169_ethtool_ops;
 	dev->watchdog_timeo = RTL8169_TX_TIMEOUT;
 	dev->irq = pdev->irq;
 	dev->base_addr = (unsigned long) ioaddr;
@@ -4147,16 +4146,16 @@  rtl8169_init_one(struct pci_dev *pdev, c
 	/* don't enable SG, IP_CSUM and TSO by default - it might not work
 	 * properly for all devices */
 	dev->features |= NETIF_F_RXCSUM |
-		NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
+		NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX;
 
 	dev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO |
-		NETIF_F_RXCSUM | NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX;
+		NETIF_F_RXCSUM | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_HW_VLAN_CTAG_RX;
 	dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO |
 		NETIF_F_HIGHDMA;
 
 	if (tp->mac_version == RTL_GIGA_MAC_VER_05)
 		/* 8110SCd requires hardware Rx VLAN - disallow toggling */
-		dev->hw_features &= ~NETIF_F_HW_VLAN_RX;
+		dev->hw_features &= ~NETIF_F_HW_VLAN_CTAG_RX;
 
 	tp->intr_mask = 0xffff;
 	tp->hw_start = cfg->hw_start;
@@ -4217,7 +4216,7 @@  err_out_free_dev_1:
 	goto out;
 }
 
-static void __devexit rtl8169_remove_one(struct pci_dev *pdev)
+static void rtl8169_remove_one(struct pci_dev *pdev)
 {
 	struct net_device *dev = pci_get_drvdata(pdev);
 	struct rtl8169_private *tp = netdev_priv(dev);
@@ -6218,20 +6217,9 @@  static struct pci_driver rtl8169_pci_dri
 	.name		= MODULENAME,
 	.id_table	= rtl8169_pci_tbl,
 	.probe		= rtl8169_init_one,
-	.remove		= __devexit_p(rtl8169_remove_one),
+	.remove		= rtl8169_remove_one,
 	.shutdown	= rtl_shutdown,
 	.driver.pm	= RTL8169_PM_OPS,
 };
 
-static int __init rtl8169_init_module(void)
-{
-	return pci_register_driver(&rtl8169_pci_driver);
-}
-
-static void __exit rtl8169_cleanup_module(void)
-{
-	pci_unregister_driver(&rtl8169_pci_driver);
-}
-
-module_init(rtl8169_init_module);
-module_exit(rtl8169_cleanup_module);
+module_pci_driver(rtl8169_pci_driver);