Patchwork [net] net: usbnet: prevent buggy devices from killing us

login
register
mail settings
Submitter Bjørn Mork
Date Jan. 24, 2013, 7:16 p.m.
Message ID <1359055016-13603-1-git-send-email-bjorn@mork.no>
Download mbox | patch
Permalink /patch/215481/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Bjørn Mork - Jan. 24, 2013, 7:16 p.m.
A device sending 0 length frames as fast as it can has been
observed killing the host system due to the resulting memory
pressure.

Temporarily disable RX skb allocation and URB submission when
the current error ratio is high, preventing us from trying to
allocate an infinite number of skbs.  Reenable as soon as we
are finished processing the done queue, allowing the device
to continue working after short error bursts.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
So is this starting to look OK?

usbnet already uses "throttle", "halt" and "stop" for other
functions, so I decided to name the new flag "kill".  No other
reason.

Didn't see any point in calculating the error limit.  A fixed
number works just as well.

Restarting in usbnet_bh was a simple way to achieve what I
wanted: enabling RX again when we know we can handle it.


Bjørn

 drivers/net/usb/usbnet.c   |   26 ++++++++++++++++++++++++++
 include/linux/usb/usbnet.h |    2 ++
 2 files changed, 28 insertions(+), 0 deletions(-)
Oliver Neukum - Jan. 24, 2013, 10:09 p.m.
On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote:
> A device sending 0 length frames as fast as it can has been
> observed killing the host system due to the resulting memory
> pressure.
> 
> Temporarily disable RX skb allocation and URB submission when
> the current error ratio is high, preventing us from trying to
> allocate an infinite number of skbs.  Reenable as soon as we
> are finished processing the done queue, allowing the device
> to continue working after short error bursts.
> 
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> So is this starting to look OK?

It seems to me that we at least need to try some error recovery.
How about resetting the device when it is no longer used?

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Perches - Jan. 24, 2013, 11:57 p.m.
On Thu, 2013-01-24 at 20:16 +0100, Bjørn Mork wrote:
> A device sending 0 length frames as fast as it can has been
> observed killing the host system due to the resulting memory
> pressure.
[]
> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
[]
> @@ -539,6 +545,22 @@ block:
>  		break;
>  	}
>  
> +	/* stop rx if packet error rate is high */
> +	if (++dev->pkt_cnt > 30) {
> +		dev->pkt_cnt = 0;
> +		dev->pkt_err = 0;
> +	} else {
> +		if (state == rx_cleanup)
> +			dev->pkt_err++;
> +		if (dev->pkt_err > 20) {
> +			set_bit(EVENT_RX_KILL, &dev->flags);
> +			if (net_ratelimit())
> +				netif_dbg(dev, rx_err, dev->net,
> +					  "rx kill: high error rate\n");
> +			dev->pkt_err = 0;
> +		}
> +	}

Maybe use ratelimit() here?

> diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
[]
> @@ -33,6 +33,7 @@ struct usbnet {
>  	wait_queue_head_t	*wait;
>  	struct mutex		phy_mutex;
>  	unsigned char		suspend_count;
> +	unsigned char		pkt_cnt, pkt_err;

and instead:

	struct ratelimit_state errors;



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjørn Mork - Jan. 25, 2013, 7:13 a.m.
Oliver Neukum <oliver@neukum.org> writes:
> On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote:
>> A device sending 0 length frames as fast as it can has been
>> observed killing the host system due to the resulting memory
>> pressure.
>> 
>> Temporarily disable RX skb allocation and URB submission when
>> the current error ratio is high, preventing us from trying to
>> allocate an infinite number of skbs.  Reenable as soon as we
>> are finished processing the done queue, allowing the device
>> to continue working after short error bursts.
>> 
>> Signed-off-by: Bjørn Mork <bjorn@mork.no>
>> ---
>> So is this starting to look OK?
>
> It seems to me that we at least need to try some error recovery.

Won't the disabling code in usbnet_bh do? RX will only stay disabled
until the done queue is handled.

> How about resetting the device when it is no longer used?

Yes, that we should do. I guess usbnet_open is the place to reset the
flag and counters? I'll send another version taking care of this and
Joes comment.


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjørn Mork - Jan. 25, 2013, 8:14 a.m.
Joe Perches <joe@perches.com> writes:

> On Thu, 2013-01-24 at 20:16 +0100, Bjørn Mork wrote:
>> A device sending 0 length frames as fast as it can has been
>> observed killing the host system due to the resulting memory
>> pressure.
> []
>> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> []
>> @@ -539,6 +545,22 @@ block:
>>  		break;
>>  	}
>>  
>> +	/* stop rx if packet error rate is high */
>> +	if (++dev->pkt_cnt > 30) {
>> +		dev->pkt_cnt = 0;
>> +		dev->pkt_err = 0;
>> +	} else {
>> +		if (state == rx_cleanup)
>> +			dev->pkt_err++;
>> +		if (dev->pkt_err > 20) {
>> +			set_bit(EVENT_RX_KILL, &dev->flags);
>> +			if (net_ratelimit())
>> +				netif_dbg(dev, rx_err, dev->net,
>> +					  "rx kill: high error rate\n");
>> +			dev->pkt_err = 0;
>> +		}
>> +	}
>
> Maybe use ratelimit() here?
>
>> diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
> []
>> @@ -33,6 +33,7 @@ struct usbnet {
>>  	wait_queue_head_t	*wait;
>>  	struct mutex		phy_mutex;
>>  	unsigned char		suspend_count;
>> +	unsigned char		pkt_cnt, pkt_err;
>
> and instead:
>
> 	struct ratelimit_state errors;

Thanks.  I took a look at this, but it seems to be more complex than I
really wanted for keeping the debug noise down here.  The rest of usbnet
does not care much about rate limiting debug messages at all.  I'll get
a message for every 0 length packet for example.

Maybe usbnet should get a private debug ratelimiter all over?

Is the problem that these instances will hide more important net
messages?  Would it help to make the ratelimit call depend on whether
debugging is enabled like ath5k and brcm80211 seems to do?



Bjørn

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Neukum - Jan. 25, 2013, 12:02 p.m.
On Friday 25 January 2013 08:13:15 Bjørn Mork wrote:
> Oliver Neukum <oliver@neukum.org> writes:
> > On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote:
> >> A device sending 0 length frames as fast as it can has been
> >> observed killing the host system due to the resulting memory
> >> pressure.
> >> 
> >> Temporarily disable RX skb allocation and URB submission when
> >> the current error ratio is high, preventing us from trying to
> >> allocate an infinite number of skbs.  Reenable as soon as we
> >> are finished processing the done queue, allowing the device
> >> to continue working after short error bursts.
> >> 
> >> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> >> ---
> >> So is this starting to look OK?
> >
> > It seems to me that we at least need to try some error recovery.
> 
> Won't the disabling code in usbnet_bh do? RX will only stay disabled
> until the done queue is handled.

So will the burst of bogus packets stop by itself?

> 
> > How about resetting the device when it is no longer used?
> 
> Yes, that we should do. I guess usbnet_open is the place to reset the
> flag and counters? I'll send another version taking care of this and
> Joes comment.

I was thinking about resetting the device, not just counters.
But yes, open() needs to reset the counters, too.

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjørn Mork - Jan. 25, 2013, 12:27 p.m.
Oliver Neukum <oliver@neukum.org> writes:
> On Friday 25 January 2013 08:13:15 Bjørn Mork wrote:
>> Oliver Neukum <oliver@neukum.org> writes:
>> > On Thursday 24 January 2013 20:16:56 Bjørn Mork wrote:
>> >> A device sending 0 length frames as fast as it can has been
>> >> observed killing the host system due to the resulting memory
>> >> pressure.
>> >> 
>> >> Temporarily disable RX skb allocation and URB submission when
>> >> the current error ratio is high, preventing us from trying to
>> >> allocate an infinite number of skbs.  Reenable as soon as we
>> >> are finished processing the done queue, allowing the device
>> >> to continue working after short error bursts.
>> >> 
>> >> Signed-off-by: Bjørn Mork <bjorn@mork.no>
>> >> ---
>> >> So is this starting to look OK?
>> >
>> > It seems to me that we at least need to try some error recovery.
>> 
>> Won't the disabling code in usbnet_bh do? RX will only stay disabled
>> until the done queue is handled.
>
> So will the burst of bogus packets stop by itself?

No, in the case I am looking at it won't.  So we end up switching this
off/on endlessly.

But I believe that is fine. There is no way we can *know* that the
errors won't stop unless we start receiving packets again.  Other
devices may have similar temporary bugs, making them start working again
after a while. If we permanently disable RX then we will just make any
such device fail for no good reason.

My only wish for this patch is that it makes usbnet survive the buggy
device without bringing the host down.  Not magically fix the device (of
course impossible), or even hide the bug in any way.  A non-functional
device will still appear as a non-functional device. Manual user
intervention is required to make it work.  This might involve a firmware
upgrade for all we know...

>> > How about resetting the device when it is no longer used?
>> 
>> Yes, that we should do. I guess usbnet_open is the place to reset the
>> flag and counters? I'll send another version taking care of this and
>> Joes comment.
>
> I was thinking about resetting the device, not just counters.

What's the point? We only risk making the issue worse if some device has
a similar temporary bug, fixing itself a while after reset.  I think we
should leave any such actions to the user.

> But yes, open() needs to reset the counters, too.

OK, will add that.


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index f34b2eb..64657d6 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -380,6 +380,12 @@  static int rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
 	unsigned long		lockflags;
 	size_t			size = dev->rx_urb_size;
 
+	/* prevent rx skb allocation when error ratio is high */
+	if (test_bit(EVENT_RX_KILL, &dev->flags)) {
+		usb_free_urb(urb);
+		return -ENOLINK;
+	}
+
 	skb = __netdev_alloc_skb_ip_align(dev->net, size, flags);
 	if (!skb) {
 		netif_dbg(dev, rx_err, dev->net, "no rx skb\n");
@@ -539,6 +545,22 @@  block:
 		break;
 	}
 
+	/* stop rx if packet error rate is high */
+	if (++dev->pkt_cnt > 30) {
+		dev->pkt_cnt = 0;
+		dev->pkt_err = 0;
+	} else {
+		if (state == rx_cleanup)
+			dev->pkt_err++;
+		if (dev->pkt_err > 20) {
+			set_bit(EVENT_RX_KILL, &dev->flags);
+			if (net_ratelimit())
+				netif_dbg(dev, rx_err, dev->net,
+					  "rx kill: high error rate\n");
+			dev->pkt_err = 0;
+		}
+	}
+
 	state = defer_bh(dev, skb, &dev->rxq, state);
 
 	if (urb) {
@@ -1254,6 +1276,10 @@  static void usbnet_bh (unsigned long param)
 		}
 	}
 
+	/* restart RX again after disabling due to high error rate */
+	if (test_and_clear_bit(EVENT_RX_KILL, &dev->flags) && net_ratelimit())
+		netif_dbg(dev, rx_err, dev->net, "rx kill: restarting\n");
+
 	// waiting for all pending urbs to complete?
 	if (dev->wait) {
 		if ((dev->txq.qlen + dev->rxq.qlen + dev->done.qlen) == 0) {
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index 5de7a22..0de078d 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -33,6 +33,7 @@  struct usbnet {
 	wait_queue_head_t	*wait;
 	struct mutex		phy_mutex;
 	unsigned char		suspend_count;
+	unsigned char		pkt_cnt, pkt_err;
 
 	/* i/o info: pipes etc */
 	unsigned		in, out;
@@ -70,6 +71,7 @@  struct usbnet {
 #		define EVENT_DEV_OPEN	7
 #		define EVENT_DEVICE_REPORT_IDLE	8
 #		define EVENT_NO_RUNTIME_PM	9
+#		define EVENT_RX_KILL	10
 };
 
 static inline struct usb_driver *driver_of(struct usb_interface *intf)