Message ID | 1359055016-13603-1-git-send-email-bjorn@mork.no |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote: > A device sending 0 length frames as fast as it can has been > observed killing the host system due to the resulting memory > pressure. > > Temporarily disable RX skb allocation and URB submission when > the current error ratio is high, preventing us from trying to > allocate an infinite number of skbs. Reenable as soon as we > are finished processing the done queue, allowing the device > to continue working after short error bursts. > > Signed-off-by: Bjørn Mork <bjorn@mork.no> > --- > So is this starting to look OK? It seems to me that we at least need to try some error recovery. How about resetting the device when it is no longer used? Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-01-24 at 20:16 +0100, Bjørn Mork wrote: > A device sending 0 length frames as fast as it can has been > observed killing the host system due to the resulting memory > pressure. [] > diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c [] > @@ -539,6 +545,22 @@ block: > break; > } > > + /* stop rx if packet error rate is high */ > + if (++dev->pkt_cnt > 30) { > + dev->pkt_cnt = 0; > + dev->pkt_err = 0; > + } else { > + if (state == rx_cleanup) > + dev->pkt_err++; > + if (dev->pkt_err > 20) { > + set_bit(EVENT_RX_KILL, &dev->flags); > + if (net_ratelimit()) > + netif_dbg(dev, rx_err, dev->net, > + "rx kill: high error rate\n"); > + dev->pkt_err = 0; > + } > + } Maybe use ratelimit() here? > diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h [] > @@ -33,6 +33,7 @@ struct usbnet { > wait_queue_head_t *wait; > struct mutex phy_mutex; > unsigned char suspend_count; > + unsigned char pkt_cnt, pkt_err; and instead: struct ratelimit_state errors; -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum <oliver@neukum.org> writes: > On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote: >> A device sending 0 length frames as fast as it can has been >> observed killing the host system due to the resulting memory >> pressure. >> >> Temporarily disable RX skb allocation and URB submission when >> the current error ratio is high, preventing us from trying to >> allocate an infinite number of skbs. Reenable as soon as we >> are finished processing the done queue, allowing the device >> to continue working after short error bursts. >> >> Signed-off-by: Bjørn Mork <bjorn@mork.no> >> --- >> So is this starting to look OK? > > It seems to me that we at least need to try some error recovery. Won't the disabling code in usbnet_bh do? RX will only stay disabled until the done queue is handled. > How about resetting the device when it is no longer used? Yes, that we should do. I guess usbnet_open is the place to reset the flag and counters? I'll send another version taking care of this and Joes comment. Bjørn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Joe Perches <joe@perches.com> writes: > On Thu, 2013-01-24 at 20:16 +0100, Bjørn Mork wrote: >> A device sending 0 length frames as fast as it can has been >> observed killing the host system due to the resulting memory >> pressure. > [] >> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c > [] >> @@ -539,6 +545,22 @@ block: >> break; >> } >> >> + /* stop rx if packet error rate is high */ >> + if (++dev->pkt_cnt > 30) { >> + dev->pkt_cnt = 0; >> + dev->pkt_err = 0; >> + } else { >> + if (state == rx_cleanup) >> + dev->pkt_err++; >> + if (dev->pkt_err > 20) { >> + set_bit(EVENT_RX_KILL, &dev->flags); >> + if (net_ratelimit()) >> + netif_dbg(dev, rx_err, dev->net, >> + "rx kill: high error rate\n"); >> + dev->pkt_err = 0; >> + } >> + } > > Maybe use ratelimit() here? > >> diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h > [] >> @@ -33,6 +33,7 @@ struct usbnet { >> wait_queue_head_t *wait; >> struct mutex phy_mutex; >> unsigned char suspend_count; >> + unsigned char pkt_cnt, pkt_err; > > and instead: > > struct ratelimit_state errors; Thanks. I took a look at this, but it seems to be more complex than I really wanted for keeping the debug noise down here. The rest of usbnet does not care much about rate limiting debug messages at all. I'll get a message for every 0 length packet for example. Maybe usbnet should get a private debug ratelimiter all over? Is the problem that these instances will hide more important net messages? Would it help to make the ratelimit call depend on whether debugging is enabled like ath5k and brcm80211 seems to do? Bjørn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 25 January 2013 08:13:15 Bjørn Mork wrote: > Oliver Neukum <oliver@neukum.org> writes: > > On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote: > >> A device sending 0 length frames as fast as it can has been > >> observed killing the host system due to the resulting memory > >> pressure. > >> > >> Temporarily disable RX skb allocation and URB submission when > >> the current error ratio is high, preventing us from trying to > >> allocate an infinite number of skbs. Reenable as soon as we > >> are finished processing the done queue, allowing the device > >> to continue working after short error bursts. > >> > >> Signed-off-by: Bjørn Mork <bjorn@mork.no> > >> --- > >> So is this starting to look OK? > > > > It seems to me that we at least need to try some error recovery. > > Won't the disabling code in usbnet_bh do? RX will only stay disabled > until the done queue is handled. So will the burst of bogus packets stop by itself? > > > How about resetting the device when it is no longer used? > > Yes, that we should do. I guess usbnet_open is the place to reset the > flag and counters? I'll send another version taking care of this and > Joes comment. I was thinking about resetting the device, not just counters. But yes, open() needs to reset the counters, too. Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Oliver Neukum <oliver@neukum.org> writes: > On Friday 25 January 2013 08:13:15 Bjørn Mork wrote: >> Oliver Neukum <oliver@neukum.org> writes: >> > On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote: >> >> A device sending 0 length frames as fast as it can has been >> >> observed killing the host system due to the resulting memory >> >> pressure. >> >> >> >> Temporarily disable RX skb allocation and URB submission when >> >> the current error ratio is high, preventing us from trying to >> >> allocate an infinite number of skbs. Reenable as soon as we >> >> are finished processing the done queue, allowing the device >> >> to continue working after short error bursts. >> >> >> >> Signed-off-by: Bjørn Mork <bjorn@mork.no> >> >> --- >> >> So is this starting to look OK? >> > >> > It seems to me that we at least need to try some error recovery. >> >> Won't the disabling code in usbnet_bh do? RX will only stay disabled >> until the done queue is handled. > > So will the burst of bogus packets stop by itself? No, in the case I am looking at it won't. So we end up switching this off/on endlessly. But I believe that is fine. There is no way we can *know* that the errors won't stop unless we start receiving packets again. Other devices may have similar temporary bugs, making them start working again after a while. If we permanently disable RX then we will just make any such device fail for no good reason. My only wish for this patch is that it makes usbnet survive the buggy device without bringing the host down. Not magically fix the device (of course impossible), or even hide the bug in any way. A non-functional device will still appear as a non-functional device. Manual user intervention is required to make it work. This might involve a firmware upgrade for all we know... >> > How about resetting the device when it is no longer used? >> >> Yes, that we should do. I guess usbnet_open is the place to reset the >> flag and counters? I'll send another version taking care of this and >> Joes comment. > > I was thinking about resetting the device, not just counters. What's the point? We only risk making the issue worse if some device has a similar temporary bug, fixing itself a while after reset. I think we should leave any such actions to the user. > But yes, open() needs to reset the counters, too. OK, will add that. Bjørn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index f34b2eb..64657d6 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -380,6 +380,12 @@ static int rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags) unsigned long lockflags; size_t size = dev->rx_urb_size; + /* prevent rx skb allocation when error ratio is high */ + if (test_bit(EVENT_RX_KILL, &dev->flags)) { + usb_free_urb(urb); + return -ENOLINK; + } + skb = __netdev_alloc_skb_ip_align(dev->net, size, flags); if (!skb) { netif_dbg(dev, rx_err, dev->net, "no rx skb\n"); @@ -539,6 +545,22 @@ block: break; } + /* stop rx if packet error rate is high */ + if (++dev->pkt_cnt > 30) { + dev->pkt_cnt = 0; + dev->pkt_err = 0; + } else { + if (state == rx_cleanup) + dev->pkt_err++; + if (dev->pkt_err > 20) { + set_bit(EVENT_RX_KILL, &dev->flags); + if (net_ratelimit()) + netif_dbg(dev, rx_err, dev->net, + "rx kill: high error rate\n"); + dev->pkt_err = 0; + } + } + state = defer_bh(dev, skb, &dev->rxq, state); if (urb) { @@ -1254,6 +1276,10 @@ static void usbnet_bh (unsigned long param) } } + /* restart RX again after disabling due to high error rate */ + if (test_and_clear_bit(EVENT_RX_KILL, &dev->flags) && net_ratelimit()) + netif_dbg(dev, rx_err, dev->net, "rx kill: restarting\n"); + // waiting for all pending urbs to complete? if (dev->wait) { if ((dev->txq.qlen + dev->rxq.qlen + dev->done.qlen) == 0) { diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h index 5de7a22..0de078d 100644 --- a/include/linux/usb/usbnet.h +++ b/include/linux/usb/usbnet.h @@ -33,6 +33,7 @@ struct usbnet { wait_queue_head_t *wait; struct mutex phy_mutex; unsigned char suspend_count; + unsigned char pkt_cnt, pkt_err; /* i/o info: pipes etc */ unsigned in, out; @@ -70,6 +71,7 @@ struct usbnet { # define EVENT_DEV_OPEN 7 # define EVENT_DEVICE_REPORT_IDLE 8 # define EVENT_NO_RUNTIME_PM 9 +# define EVENT_RX_KILL 10 }; static inline struct usb_driver *driver_of(struct usb_interface *intf)
A device sending 0 length frames as fast as it can has been observed killing the host system due to the resulting memory pressure. Temporarily disable RX skb allocation and URB submission when the current error ratio is high, preventing us from trying to allocate an infinite number of skbs. Reenable as soon as we are finished processing the done queue, allowing the device to continue working after short error bursts. Signed-off-by: Bjørn Mork <bjorn@mork.no> --- So is this starting to look OK? usbnet already uses "throttle", "halt" and "stop" for other functions, so I decided to name the new flag "kill". No other reason. Didn't see any point in calculating the error limit. A fixed number works just as well. Restarting in usbnet_bh was a simple way to achieve what I wanted: enabling RX again when we know we can handle it. Bjørn drivers/net/usb/usbnet.c | 26 ++++++++++++++++++++++++++ include/linux/usb/usbnet.h | 2 ++ 2 files changed, 28 insertions(+), 0 deletions(-)