Message ID | 8f84fe39-3d8d-396d-3b97-027e0a83f8cb@gmail.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Series | [net] r8169: fix NAPI handling under high load | expand |
On 10/16/18 22:37, Heiner Kallweit wrote: > rtl_rx() and rtl_tx() are called only if the respective bits are set > in the interrupt status register. Under high load NAPI may not be > able to process all data (work_done == budget) and it will schedule > subsequent calls to the poll callback. > rtl_ack_events() however resets the bits in the interrupt status > register, therefore subsequent calls to rtl8169_poll() won't call > rtl_rx() and rtl_tx() - chip interrupts are still disabled. Very interesting! Could this be the reason for the mysterious hangs & resets we experienced when enabling BQL for r8169? They happened more often with TSO/GSO enabled and several people attempted to fix those hangs unsuccessfully; it was later reverted and has been since then (#87cda7cb43). If this bug has been there "forever" it might be tempting to re-apply BQL and see what happens. Any chance you could give that a try? I'll gladly test patches, just like I'll run this one. cheers Holger
On Tue, 16 Oct 2018 22:37:31 +0200 Heiner Kallweit <hkallweit1@gmail.com> wrote: > rtl_rx() and rtl_tx() are called only if the respective bits are set > in the interrupt status register. Under high load NAPI may not be > able to process all data (work_done == budget) and it will schedule > subsequent calls to the poll callback. > rtl_ack_events() however resets the bits in the interrupt status > register, therefore subsequent calls to rtl8169_poll() won't call > rtl_rx() and rtl_tx() - chip interrupts are still disabled. > > Fix this by calling rtl_rx() and rtl_tx() independent of the bits > set in the interrupt status register. Both functions will detect > if there's nothing to do for them. > > This issue has been there more or less forever (at least it exists in > 3.16 already), so I can't provide a "Fixes" tag. > > Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Another issue is this: if (work_done < budget) { napi_complete_done(napi, work_done); rtl_irq_enable(tp, enable_mask); mmiowb(); } return work_done; } The code needs to check return value of napi_complete_done. if (work_done < budget && napi_complete_done(napi, work_done) { rtl_irq_enable(tp, enable_mask); mmiowb(); } return work_done; } Try that, it might fix the problem and your logic would be unnecessary
On Tue, 16 Oct 2018 23:17:31 +0200 Holger Hoffstätte <holger@applied-asynchrony.com> wrote: > On 10/16/18 22:37, Heiner Kallweit wrote: > > rtl_rx() and rtl_tx() are called only if the respective bits are set > > in the interrupt status register. Under high load NAPI may not be > > able to process all data (work_done == budget) and it will schedule > > subsequent calls to the poll callback. > > rtl_ack_events() however resets the bits in the interrupt status > > register, therefore subsequent calls to rtl8169_poll() won't call > > rtl_rx() and rtl_tx() - chip interrupts are still disabled. > > Very interesting! Could this be the reason for the mysterious > hangs & resets we experienced when enabling BQL for r8169? > They happened more often with TSO/GSO enabled and several people > attempted to fix those hangs unsuccessfully; it was later reverted > and has been since then (#87cda7cb43). > If this bug has been there "forever" it might be tempting to > re-apply BQL and see what happens. Any chance you could give that > a try? I'll gladly test patches, just like I'll run this one. > > cheers > Holger Many drivers have buggy usage of napi_complete_done. Might even be worth forcing all network drivers to check the return value. But fixing 150 broken drivers will be a nuisance. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index dc1d9ed33b31..c38bc66ffe74 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct *napi) return false; } -bool napi_complete_done(struct napi_struct *n, int work_done); +bool __must_check napi_complete_done(struct napi_struct *n, int work_done); + /** * napi_complete - NAPI processing complete * @n: NAPI context
On 10/16/2018 04:03 PM, Stephen Hemminger wrote: > On Tue, 16 Oct 2018 23:17:31 +0200 > Holger Hoffstätte <holger@applied-asynchrony.com> wrote: > >> On 10/16/18 22:37, Heiner Kallweit wrote: >>> rtl_rx() and rtl_tx() are called only if the respective bits are set >>> in the interrupt status register. Under high load NAPI may not be >>> able to process all data (work_done == budget) and it will schedule >>> subsequent calls to the poll callback. >>> rtl_ack_events() however resets the bits in the interrupt status >>> register, therefore subsequent calls to rtl8169_poll() won't call >>> rtl_rx() and rtl_tx() - chip interrupts are still disabled. >> >> Very interesting! Could this be the reason for the mysterious >> hangs & resets we experienced when enabling BQL for r8169? >> They happened more often with TSO/GSO enabled and several people >> attempted to fix those hangs unsuccessfully; it was later reverted >> and has been since then (#87cda7cb43). >> If this bug has been there "forever" it might be tempting to >> re-apply BQL and see what happens. Any chance you could give that >> a try? I'll gladly test patches, just like I'll run this one. >> >> cheers >> Holger > > Many drivers have buggy usage of napi_complete_done. > > Might even be worth forcing all network drivers to check the return > value. But fixing 150 broken drivers will be a nuisance. I had started doing that about a month ago in light of the ixbge ndo_poll_controller vs. napi problem, but have not had time to submit that series yet: https://github.com/ffainelli/linux/commits/napi-check feel free to piggy back on top of that series if you would like to address this. > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index dc1d9ed33b31..c38bc66ffe74 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct *napi) > return false; > } > > -bool napi_complete_done(struct napi_struct *n, int work_done); > +bool __must_check napi_complete_done(struct napi_struct *n, int work_done); > + > /** > * napi_complete - NAPI processing complete > * @n: NAPI context >
On 10/16/2018 03:17 PM, Stephen Hemminger wrote: > On Tue, 16 Oct 2018 22:37:31 +0200 > Heiner Kallweit <hkallweit1@gmail.com> wrote: > >> rtl_rx() and rtl_tx() are called only if the respective bits are set >> in the interrupt status register. Under high load NAPI may not be >> able to process all data (work_done == budget) and it will schedule >> subsequent calls to the poll callback. >> rtl_ack_events() however resets the bits in the interrupt status >> register, therefore subsequent calls to rtl8169_poll() won't call >> rtl_rx() and rtl_tx() - chip interrupts are still disabled. >> >> Fix this by calling rtl_rx() and rtl_tx() independent of the bits >> set in the interrupt status register. Both functions will detect >> if there's nothing to do for them. >> >> This issue has been there more or less forever (at least it exists in >> 3.16 already), so I can't provide a "Fixes" tag. >> >> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> > > Another issue is this: > > if (work_done < budget) { > napi_complete_done(napi, work_done); > > rtl_irq_enable(tp, enable_mask); > mmiowb(); > } > > return work_done; > } > > > The code needs to check return value of napi_complete_done. > > if (work_done < budget && > napi_complete_done(napi, work_done) { > rtl_irq_enable(tp, enable_mask); > mmiowb(); > } > > return work_done; > } > > Try that, it might fix the problem and your logic would > be unnecessary > Well, I do not believe this is related. Testing napi_complete_done() is not mandatory, it is only an optimization [1] and only for busy polling users. In short, by default, this should not matter, since busy-polling is not enabled by default. [1] busy polling users are spinning anyway, so it is not even clear if this is really an optimization, unless maybe the cost of irq enabling is really really high...
On 10/16/2018 04:03 PM, Stephen Hemminger wrote: > Many drivers have buggy usage of napi_complete_done. > > Might even be worth forcing all network drivers to check the return > value. But fixing 150 broken drivers will be a nuisance. > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index dc1d9ed33b31..c38bc66ffe74 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -466,7 +466,8 @@ static inline bool napi_reschedule(struct napi_struct *napi) > return false; > } > > -bool napi_complete_done(struct napi_struct *n, int work_done); > +bool __must_check napi_complete_done(struct napi_struct *n, int work_done); > + > /** > * napi_complete - NAPI processing complete > * @n: NAPI context > I disagree completely. This never has been a requirement.
On 10/16/2018 04:08 PM, Florian Fainelli wrote: > I had started doing that about a month ago in light of the ixbge > ndo_poll_controller vs. napi problem, but have not had time to submit > that series yet: > > https://github.com/ffainelli/linux/commits/napi-check > > feel free to piggy back on top of that series if you would like to > address this. But the root cause was different, you remember ? We fixed the real issue with netpoll ability to stick all NIC IRQ onto one victim CPU.
On 10/16/2018 5:23 PM, Eric Dumazet wrote: > > > On 10/16/2018 04:08 PM, Florian Fainelli wrote: > >> I had started doing that about a month ago in light of the ixbge >> ndo_poll_controller vs. napi problem, but have not had time to submit >> that series yet: >> >> https://github.com/ffainelli/linux/commits/napi-check >> >> feel free to piggy back on top of that series if you would like to >> address this. > > > But the root cause was different, you remember ? > > We fixed the real issue with netpoll ability to stick all NIC IRQ onto one > victim CPU. I do remember, after seeing the resolution I kind of decided to leave this in a branch but not submit it because it was not particularly relevant anymore.
On 16.10.2018 23:17, Holger Hoffstätte wrote: > On 10/16/18 22:37, Heiner Kallweit wrote: >> rtl_rx() and rtl_tx() are called only if the respective bits are set >> in the interrupt status register. Under high load NAPI may not be >> able to process all data (work_done == budget) and it will schedule >> subsequent calls to the poll callback. >> rtl_ack_events() however resets the bits in the interrupt status >> register, therefore subsequent calls to rtl8169_poll() won't call >> rtl_rx() and rtl_tx() - chip interrupts are still disabled. > > Very interesting! Could this be the reason for the mysterious > hangs & resets we experienced when enabling BQL for r8169? > They happened more often with TSO/GSO enabled and several people > attempted to fix those hangs unsuccessfully; it was later reverted > and has been since then (#87cda7cb43). > If this bug has been there "forever" it might be tempting to > re-apply BQL and see what happens. Any chance you could give that > a try? I'll gladly test patches, just like I'll run this one. > After reading through the old mail threads regarding BQL on r8169 I don't think the fix here is related. It seems that BQL on r8169 worked fine for most people, just one had problems on one of his systems. I assume the issue was specific to one chip version. From the ~ 50 chip versions supported by r8169 more or less each one requires its own quirks. If we're lucky the chip-version-specific issue has been fixed in the meantime and we can simply apply the old BQL patch again. If it turns out that certain chip versions can't be used with BQL, then we can disable the feature for these chip versions instead of removing the feature completely. I will apply the old BQL patch and see how it's on my system (with GRO and SG enabled). > cheers > Holger > Heiner
Tomas, more than three years back you reported network problems after BQL was added to the r8169 driver. Due to this the change was reverted. Now the discussion to add BQL popped up again. You mentioned that the issue exists on one of your systems only. Therefore it could be an issue specific to a particular chip version. I'd be interested in the chip version of the affected system. You linked to another similar report, there the chip version was: r8169 0000:10:00.0 eth0: RTL8168c/8111c at 0xf8130000, 00:e0:4c:68:48:d2, XID 1c4000c0 IRQ 29 In case you still have the affected system or at least the old dmesg logs, I'd appreciate if you could let me know the quoted line from dmesg output. Thanks a lot, Heiner -------- Forwarded Message -------- Subject: Re: [PATCH net] r8169: fix NAPI handling under high load Date: Wed, 17 Oct 2018 20:12:48 +0200 From: Heiner Kallweit <hkallweit1@gmail.com> To: Holger Hoffstätte <holger@applied-asynchrony.com>, David Miller <davem@davemloft.net>, Realtek linux nic maintainers <nic_swsd@realtek.com> CC: netdev@vger.kernel.org <netdev@vger.kernel.org> On 16.10.2018 23:17, Holger Hoffstätte wrote: > On 10/16/18 22:37, Heiner Kallweit wrote: >> rtl_rx() and rtl_tx() are called only if the respective bits are set >> in the interrupt status register. Under high load NAPI may not be >> able to process all data (work_done == budget) and it will schedule >> subsequent calls to the poll callback. >> rtl_ack_events() however resets the bits in the interrupt status >> register, therefore subsequent calls to rtl8169_poll() won't call >> rtl_rx() and rtl_tx() - chip interrupts are still disabled. > > Very interesting! Could this be the reason for the mysterious > hangs & resets we experienced when enabling BQL for r8169? > They happened more often with TSO/GSO enabled and several people > attempted to fix those hangs unsuccessfully; it was later reverted > and has been since then (#87cda7cb43). > If this bug has been there "forever" it might be tempting to > re-apply BQL and see what happens. Any chance you could give that > a try? I'll gladly test patches, just like I'll run this one. > After reading through the old mail threads regarding BQL on r8169 I don't think the fix here is related. It seems that BQL on r8169 worked fine for most people, just one had problems on one of his systems. I assume the issue was specific to one chip version. From the ~ 50 chip versions supported by r8169 more or less each one requires its own quirks. If we're lucky the chip-version-specific issue has been fixed in the meantime and we can simply apply the old BQL patch again. If it turns out that certain chip versions can't be used with BQL, then we can disable the feature for these chip versions instead of removing the feature completely. I will apply the old BQL patch and see how it's on my system (with GRO and SG enabled). > cheers > Holger > Heiner
On 10/17/18 20:12, Heiner Kallweit wrote: > On 16.10.2018 23:17, Holger Hoffstätte wrote: >> On 10/16/18 22:37, Heiner Kallweit wrote: >>> rtl_rx() and rtl_tx() are called only if the respective bits are set >>> in the interrupt status register. Under high load NAPI may not be >>> able to process all data (work_done == budget) and it will schedule >>> subsequent calls to the poll callback. >>> rtl_ack_events() however resets the bits in the interrupt status >>> register, therefore subsequent calls to rtl8169_poll() won't call >>> rtl_rx() and rtl_tx() - chip interrupts are still disabled. >> >> Very interesting! Could this be the reason for the mysterious >> hangs & resets we experienced when enabling BQL for r8169? >> They happened more often with TSO/GSO enabled and several people >> attempted to fix those hangs unsuccessfully; it was later reverted >> and has been since then (#87cda7cb43). >> If this bug has been there "forever" it might be tempting to >> re-apply BQL and see what happens. Any chance you could give that >> a try? I'll gladly test patches, just like I'll run this one. >> > After reading through the old mail threads regarding BQL on r8169 > I don't think the fix here is related. > It seems that BQL on r8169 worked fine for most people, just one > had problems on one of his systems. I assume the issue was specific I continued to use the BQL patch in my private tree after it was reverted and also had occasional timeouts, but *only* after I started playing with ethtool to change offload settings. Without offloads or the BQL patch everything has been rock-solid since then. The other weird problem was that timeouts would occur on an otherwise *completely idle* system. Since that occasionally borked my NFS server over night I ultimately removed BQL as well. Rock-solid since then. > I will apply the old BQL patch and see how it's on my system > (with GRO and SG enabled). I don't think it still applies cleanly, but if you cook up an updated version I'll gladly test it. Thanks! :) Holger
On 17.10.2018 21:11, Holger Hoffstätte wrote: > On 10/17/18 20:12, Heiner Kallweit wrote: >> On 16.10.2018 23:17, Holger Hoffstätte wrote: >>> On 10/16/18 22:37, Heiner Kallweit wrote: >>>> rtl_rx() and rtl_tx() are called only if the respective bits are set >>>> in the interrupt status register. Under high load NAPI may not be >>>> able to process all data (work_done == budget) and it will schedule >>>> subsequent calls to the poll callback. >>>> rtl_ack_events() however resets the bits in the interrupt status >>>> register, therefore subsequent calls to rtl8169_poll() won't call >>>> rtl_rx() and rtl_tx() - chip interrupts are still disabled. >>> >>> Very interesting! Could this be the reason for the mysterious >>> hangs & resets we experienced when enabling BQL for r8169? >>> They happened more often with TSO/GSO enabled and several people >>> attempted to fix those hangs unsuccessfully; it was later reverted >>> and has been since then (#87cda7cb43). >>> If this bug has been there "forever" it might be tempting to >>> re-apply BQL and see what happens. Any chance you could give that >>> a try? I'll gladly test patches, just like I'll run this one. >>> >> After reading through the old mail threads regarding BQL on r8169 >> I don't think the fix here is related. >> It seems that BQL on r8169 worked fine for most people, just one >> had problems on one of his systems. I assume the issue was specific > > I continued to use the BQL patch in my private tree after it was reverted > and also had occasional timeouts, but *only* after I started playing > with ethtool to change offload settings. Without offloads or the BQL patch > everything has been rock-solid since then. > The other weird problem was that timeouts would occur on an otherwise > *completely idle* system. Since that occasionally borked my NFS server > over night I ultimately removed BQL as well. Rock-solid since then. > >> I will apply the old BQL patch and see how it's on my system >> (with GRO and SG enabled). > > I don't think it still applies cleanly, but if you cook up an updated > version I'll gladly test it. > > Thanks! :) > Holger > Good to know. What's your kernel version and RTL8168 chip version? Regarding the chip version the dmesg line with the XID would be relevant. Below is the slightly modified original BQL patch, I just moved the call to netdev_reset_queue(). This patch applies at least to latest linux-next. My test system: - RTL8168evl - latest linux-next - BQL patch applied - SG/GRO enabled: rx-checksumming: on tx-checksumming: on tx-checksum-ipv4: on tx-checksum-ip-generic: off [fixed] tx-checksum-ipv6: on tx-checksum-fcoe-crc: off [fixed] tx-checksum-sctp: off [fixed] scatter-gather: on tx-scatter-gather: on tx-scatter-gather-fraglist: off [fixed] tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: off [fixed] tx-tcp-mangleid-segmentation: on tx-tcp6-segmentation: on udp-fragmentation-offload: off generic-segmentation-offload: on generic-receive-offload: on large-receive-offload: off [fixed] rx-vlan-offload: on tx-vlan-offload: on I briefly tested normal operation and did some tests with iperf3. Everything looks good so far. --- drivers/net/ethernet/realtek/r8169.c | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 0d8070adc..e236b46b8 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -5852,6 +5852,7 @@ static void rtl8169_tx_clear(struct rtl8169_private *tp) { rtl8169_tx_clear_range(tp, tp->dirty_tx, NUM_TX_DESC); tp->cur_tx = tp->dirty_tx = 0; + netdev_reset_queue(tp->dev); } static void rtl_reset_work(struct rtl8169_private *tp) @@ -6154,6 +6155,8 @@ static netdev_tx_t rtl8169_start_xmit(struct sk_buff *skb, txd->opts2 = cpu_to_le32(opts[1]); + netdev_sent_queue(dev, skb->len); + skb_tx_timestamp(skb); /* Force memory writes to complete before releasing descriptor */ @@ -6252,7 +6255,7 @@ static void rtl8169_pcierr_interrupt(struct net_device *dev) static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) { - unsigned int dirty_tx, tx_left; + unsigned int dirty_tx, tx_left, bytes_compl = 0, pkts_compl = 0; dirty_tx = tp->dirty_tx; smp_rmb(); @@ -6276,10 +6279,8 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) rtl8169_unmap_tx_skb(tp_to_dev(tp), tx_skb, tp->TxDescArray + entry); if (status & LastFrag) { - u64_stats_update_begin(&tp->tx_stats.syncp); - tp->tx_stats.packets++; - tp->tx_stats.bytes += tx_skb->skb->len; - u64_stats_update_end(&tp->tx_stats.syncp); + pkts_compl++; + bytes_compl += tx_skb->skb->len; dev_consume_skb_any(tx_skb->skb); tx_skb->skb = NULL; } @@ -6288,6 +6289,13 @@ static void rtl_tx(struct net_device *dev, struct rtl8169_private *tp) } if (tp->dirty_tx != dirty_tx) { + netdev_completed_queue(dev, pkts_compl, bytes_compl); + + u64_stats_update_begin(&tp->tx_stats.syncp); + tp->tx_stats.packets += pkts_compl; + tp->tx_stats.bytes += bytes_compl; + u64_stats_update_end(&tp->tx_stats.syncp); + tp->dirty_tx = dirty_tx; /* Sync with rtl8169_start_xmit: * - publish dirty_tx ring index (write barrier)
On 10/17/18 21:27, Heiner Kallweit wrote: (snip) > Good to know. What's your kernel version and RTL8168 chip version? > Regarding the chip version the dmesg line with the XID would be relevant. 4.18.15 + PDS (custom CPU scheduler) + cherry pickings from mainline. Applied both the original patch in this thread & bql, built fine. Server: r8169 0000:04:00.0 eth0: RTL8168evl/8111evl, c8:60:00:68:33:cc, XID 2c900800, IRQ 30 Workstation: r8169 0000:04:00.0 eth0: RTL8168evl/8111evl, 50:e5:49:41:7d:ad, XID 2c900800, IRQ 33 So same chipsets. On both: ethtool --coalesce eth0 rx-frames 0 rx-usecs 50 tx-frames 0 tx-usecs 50 ethtool --offload eth0 rx on tx on gro on gso on sg on tso on Let's see how it goes. :) cheers Holger
Heiner Kallweit <hkallweit1@gmail.com> : [...] > This issue has been there more or less forever (at least it exists in > 3.16 already), so I can't provide a "Fixes" tag. Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff.
Holger Hoffstätte <holger@applied-asynchrony.com> : [...] > I continued to use the BQL patch in my private tree after it was reverted > and also had occasional timeouts, but *only* after I started playing > with ethtool to change offload settings. Without offloads or the BQL patch > everything has been rock-solid since then. > The other weird problem was that timeouts would occur on an otherwise > *completely idle* system. Since that occasionally borked my NFS server > over night I ultimately removed BQL as well. Rock-solid since then. The bug will induce delayed rx processing when a spike of "load" is followed by an idle period. I do not see how bql would matter per se though.
From: Francois Romieu <romieu@fr.zoreil.com> Date: Thu, 18 Oct 2018 01:30:45 +0200 > Heiner Kallweit <hkallweit1@gmail.com> : > [...] >> This issue has been there more or less forever (at least it exists in >> 3.16 already), so I can't provide a "Fixes" tag. > > Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff. I don't see exactly how that can be true. That commit didn't change the parts of the NAPI poll processing which are relevant here, mainly the guarding of the RX and TX work using the status bits which are cleared. Maybe I'm missing something? If so, indeed it would be nice to add a proper Fixes: tag here. Thanks!
On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote: > Holger Hoffstätte <holger@applied-asynchrony.com> : > [...] > > I continued to use the BQL patch in my private tree after it was reverted > > and also had occasional timeouts, but *only* after I started playing > > with ethtool to change offload settings. Without offloads or the BQL patch > > everything has been rock-solid since then. > > The other weird problem was that timeouts would occur on an otherwise > > *completely idle* system. Since that occasionally borked my NFS server > > over night I ultimately removed BQL as well. Rock-solid since then. > > The bug will induce delayed rx processing when a spike of "load" is > followed by an idle period. If this is the case, I wonder whether this bug might also be the cause of the long reception delays we've observed at times when a period of high network load is followed by almost nothing[1]. That thread[2] details the investigations subsequently done. A git bisect showed that commit da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour we were observing. We still see the problem when we test with recent kernels. It would be great if the underlying problem has now been identified. I can possibly scrape some hardware together to test any proposed fix under our workload if there was interest. Regards jonathan [1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2 [2] https://marc.info/?t=136281339500002&r=1&w=2
On 18.10.2018 07:21, David Miller wrote: > From: Francois Romieu <romieu@fr.zoreil.com> > Date: Thu, 18 Oct 2018 01:30:45 +0200 > >> Heiner Kallweit <hkallweit1@gmail.com> : >> [...] >>> This issue has been there more or less forever (at least it exists in >>> 3.16 already), so I can't provide a "Fixes" tag. >> >> Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff. > > I don't see exactly how that can be true. > > That commit didn't change the parts of the NAPI poll processing which > are relevant here, mainly the guarding of the RX and TX work using > the status bits which are cleared. > AFAICS Francois is right and patch da78dbff2e05 ("r8169: remove work from irq handler") introduced the guarding of RX and TX work. I just checked back to 3.16 as oldest LTS kernel version. > Maybe I'm missing something? If so, indeed it would be nice to add > a proper Fixes: tag here. > Shall I submit a v2 including the Fixes line? > Thanks! >
On 18.10.2018 07:58, Jonathan Woithe wrote: > On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote: >> Holger Hoffstätte <holger@applied-asynchrony.com> : >> [...] >>> I continued to use the BQL patch in my private tree after it was reverted >>> and also had occasional timeouts, but *only* after I started playing >>> with ethtool to change offload settings. Without offloads or the BQL patch >>> everything has been rock-solid since then. >>> The other weird problem was that timeouts would occur on an otherwise >>> *completely idle* system. Since that occasionally borked my NFS server >>> over night I ultimately removed BQL as well. Rock-solid since then. >> >> The bug will induce delayed rx processing when a spike of "load" is >> followed by an idle period. > > If this is the case, I wonder whether this bug might also be the cause of > the long reception delays we've observed at times when a period of high > network load is followed by almost nothing[1]. That thread[2] details the > investigations subsequently done. A git bisect showed that commit > da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour > we were observing. > > We still see the problem when we test with recent kernels. It would be > great if the underlying problem has now been identified. > > I can possibly scrape some hardware together to test any proposed fix under > our workload if there was interest. > Proposed fix is here: https://patchwork.ozlabs.org/patch/985014/ Would be good if you could test it. Thanks! Heiner > Regards > jonathan > > [1] https://marc.info/?l=linux-netdev&m=136281333207734&w=2 > [2] https://marc.info/?t=136281339500002&r=1&w=2 >
On Thu, Oct 18, 2018 at 08:03:32AM +0200, Heiner Kallweit wrote: > On 18.10.2018 07:58, Jonathan Woithe wrote: > > On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote: > >> Holger Hoffstätte <holger@applied-asynchrony.com> : > >> [...] > >> The bug will induce delayed rx processing when a spike of "load" is > >> followed by an idle period. > > > > If this is the case, I wonder whether this bug might also be the cause of > > the long reception delays we've observed at times when a period of high > > network load is followed by almost nothing[1]. That thread[2] details the > > investigations subsequently done. A git bisect showed that commit > > da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour > > we were observing. > > > > We still see the problem when we test with recent kernels. It would be > > great if the underlying problem has now been identified. > > > > I can possibly scrape some hardware together to test any proposed fix under > > our workload if there was interest. > > > Proposed fix is here: > https://patchwork.ozlabs.org/patch/985014/ > Would be good if you could test it. Thanks! I should be able to do so tomorrow. Which kernel would you like me to apply the patch to? Regards jonathan
From: Heiner Kallweit <hkallweit1@gmail.com> Date: Thu, 18 Oct 2018 07:58:52 +0200 > On 18.10.2018 07:21, David Miller wrote: >> From: Francois Romieu <romieu@fr.zoreil.com> >> Date: Thu, 18 Oct 2018 01:30:45 +0200 >> >>> Heiner Kallweit <hkallweit1@gmail.com> : >>> [...] >>>> This issue has been there more or less forever (at least it exists in >>>> 3.16 already), so I can't provide a "Fixes" tag. >>> >>> Hardly forever. It fixes da78dbff2e05630921c551dbbc70a4b7981a8fff. >> >> I don't see exactly how that can be true. >> >> That commit didn't change the parts of the NAPI poll processing which >> are relevant here, mainly the guarding of the RX and TX work using >> the status bits which are cleared. >> > > AFAICS Francois is right and patch da78dbff2e05 ("r8169: remove work > from irq handler") introduced the guarding of RX and TX work. > I just checked back to 3.16 as oldest LTS kernel version. Aha, now I see it. >> Maybe I'm missing something? If so, indeed it would be nice to add >> a proper Fixes: tag here. >> > Shall I submit a v2 including the Fixes line? Yes, please do!
On 10/18/18 08:15, Jonathan Woithe wrote: > On Thu, Oct 18, 2018 at 08:03:32AM +0200, Heiner Kallweit wrote: >> On 18.10.2018 07:58, Jonathan Woithe wrote: >>> On Thu, Oct 18, 2018 at 01:30:51AM +0200, Francois Romieu wrote: >>>> Holger Hoffstätte <holger@applied-asynchrony.com> : >>>> [...] >>>> The bug will induce delayed rx processing when a spike of "load" is >>>> followed by an idle period. >>> >>> If this is the case, I wonder whether this bug might also be the cause of >>> the long reception delays we've observed at times when a period of high >>> network load is followed by almost nothing[1]. That thread[2] details the >>> investigations subsequently done. A git bisect showed that commit >>> da78dbff2e05630921c551dbbc70a4b7981a8fff was the origin of the misbehaviour >>> we were observing. >>> >>> We still see the problem when we test with recent kernels. It would be >>> great if the underlying problem has now been identified. >>> >>> I can possibly scrape some hardware together to test any proposed fix under >>> our workload if there was interest. >>> >> Proposed fix is here: >> https://patchwork.ozlabs.org/patch/985014/ >> Would be good if you could test it. Thanks! > > I should be able to do so tomorrow. Which kernel would you like me to apply > the patch to? Hi Jonathan, I'm already running it on 4.18.15, so either that or latest 4.19-rc would work as well. cheers Holger
On Thu, Oct 18, 2018 at 01:52:33PM +0200, Holger Hoffstätte wrote: > On 10/18/18 08:15, Jonathan Woithe wrote: > > On Thu, Oct 18, 2018 at 08:03:32AM +0200, Heiner Kallweit wrote: > > > Proposed fix is here: > > > https://patchwork.ozlabs.org/patch/985014/ > > > Would be good if you could test it. Thanks! > > > > I should be able to do so tomorrow. Which kernel would you like me to apply > > the patch to? > > Hi Jonathan, > > I'm already running it on 4.18.15, so either that or latest 4.19-rc would > work as well. It turns out I couldn't compile 4.18.15 conveniently for the test system due to the age of the compiler. The compiler, like the rest of the distribution, is ancient: the presence of the bug has prevented us updating the system due to a need to support hardware in the field. The bug has affected the operation of our systems in all kernels since (I think) 3.3. I could compile 4.14 though - I had previously confirmed that the problem was still seen with that kernel. I therefore applied the fix to 4.14 (which was trivial) and started a test with that. The system has been running without exhibiting the effects of the bug for over five hours. With an unpatched 4.14, symptoms were typically seen within 30 minutes. I will leave the system to operate over the weekend to be sure, but at this stage it seems likely that the patch will resolve this long-standing difficulty that we've experienced. I will report back on Monday. Regards jonathan
On 10/17/18 22:07, Holger Hoffstätte wrote: > On 10/17/18 21:27, Heiner Kallweit wrote: > (snip) >> Good to know. What's your kernel version and RTL8168 chip version? >> Regarding the chip version the dmesg line with the XID would be relevant. > > 4.18.15 + PDS (custom CPU scheduler) + cherry pickings from mainline. > Applied both the original patch in this thread & bql, built fine. Good news everyone! Been running with the new BQL patch for three days now on 2 machines and not a single hang/reset, regardless of load. Coupled with the original patch in this thread (already in mainline) this looks pretty good! So while I can only speak for myself & my hardware, here's a Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Thanks Heiner! -h
On Fri, 19 Oct 2018 17:59:21 +1030, Jonathan Woithe wrote: > On 10/18/18 08:15, Jonathan Woithe wrote: > > On Thu, Oct 18, 2018 at 08:03:32AM +0200, Heiner Kallweit wrote: > > > Proposed fix is here: > > > https://patchwork.ozlabs.org/patch/985014/ > > > Would be good if you could test it. Thanks! > > > > I should be able to do so tomorrow. Which kernel would you like me to apply > > the patch to? > > It turns out I couldn't compile 4.18.15 conveniently ... > I could compile 4.14 though - I had previously confirmed that the problem > was still seen with that kernel. I therefore applied the fix to 4.14 (which > was trivial) and started a test with that. The system has been running > without exhibiting the effects of the bug for over five hours. With an > unpatched 4.14, symptoms were typically seen within 30 minutes. > > I will leave the system to operate over the weekend to be sure, but at this > stage it seems likely that the patch will resolve this long-standing > difficulty that we've experienced. I will report back on Monday. Our test system has now been running for just under 3 days (69 hours) and there has not been any incidence of the problem with this patch applied to the 4.14 r8169 driver. Without the patch, multiple problems are logged within 30 minutes. Given this, I would conclude that this patch fixes the problem for us. Although belated (since the patch has already been accepted): Tested-by: Jonathan Woithe <jwoithe@atrad.com.au> Thank you very much for your work which lead to the patch! It means we can finally provide an upgrade path for systems in the field which are equipped with the r8169 hardware. For reference, the problem being tested on our systems is the one discussed in the "r8169 regression: UDP packets dropped intermittantly" thread on this mailing list. Regards jonathan
diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 8c4f49adc..7caf3b7e9 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -6528,17 +6528,15 @@ static int rtl8169_poll(struct napi_struct *napi, int budget) struct rtl8169_private *tp = container_of(napi, struct rtl8169_private, napi); struct net_device *dev = tp->dev; u16 enable_mask = RTL_EVENT_NAPI | tp->event_slow; - int work_done= 0; + int work_done; u16 status; status = rtl_get_events(tp); rtl_ack_events(tp, status & ~tp->event_slow); - if (status & RTL_EVENT_NAPI_RX) - work_done = rtl_rx(dev, tp, (u32) budget); + work_done = rtl_rx(dev, tp, (u32) budget); - if (status & RTL_EVENT_NAPI_TX) - rtl_tx(dev, tp); + rtl_tx(dev, tp); if (status & tp->event_slow) { enable_mask &= ~tp->event_slow;
rtl_rx() and rtl_tx() are called only if the respective bits are set in the interrupt status register. Under high load NAPI may not be able to process all data (work_done == budget) and it will schedule subsequent calls to the poll callback. rtl_ack_events() however resets the bits in the interrupt status register, therefore subsequent calls to rtl8169_poll() won't call rtl_rx() and rtl_tx() - chip interrupts are still disabled. Fix this by calling rtl_rx() and rtl_tx() independent of the bits set in the interrupt status register. Both functions will detect if there's nothing to do for them. This issue has been there more or less forever (at least it exists in 3.16 already), so I can't provide a "Fixes" tag. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> --- drivers/net/ethernet/realtek/r8169.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-)