Patchwork [RFC] net: usbnet: prevent buggy devices from killing us

login
register
mail settings
Submitter Bjørn Mork
Date Jan. 24, 2013, 10:25 a.m.
Message ID <1359023152-32576-1-git-send-email-bjorn@mork.no>
Download mbox | patch
Permalink /patch/215306/
State Superseded
Headers show

Comments

Bjørn Mork - Jan. 24, 2013, 10:25 a.m.
A device sending 0 length frames as fast as it can has been
observed killing the host system due to the resulting memory
pressure. We handle the done queue as fast as we can, so
if this queue is filling up then that is an indication that we
are under too heavy pressure.  Refusing further allocations
until the done queue is handled prevents the buggy device
from taking the system down.

Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
Hello Oliver,

The MBIM firmware for the Sierra Wireless MC7710 is a nice source
of "interesting" device issues.  One of the uglier ones is that
it under certain conditions will start flooding us with frames
having length 0 as fast as it can.  And that is pretty fast...

My older laptop dies immediately under this.  It just cannot keep
up with the infinite allocations usbnet will do when the done
queue first starts growing beyond reason.

I really do not have a clue how to handle this problem, but this
patch seems to do the job for me without affecting normal devices.
The queue limit is just a number which Works For Me, leaving the
system running with the buggy device and not kicking in under
normal load.

What do you think? Is there some other way this should be solved?



Bjørn

 drivers/net/usb/usbnet.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)
Oliver Neukum - Jan. 24, 2013, 10:46 a.m.
On Thursday 24 January 2013 11:25:52 Bjørn Mork wrote:
> The MBIM firmware for the Sierra Wireless MC7710 is a nice source
> of "interesting" device issues.  One of the uglier ones is that
> it under certain conditions will start flooding us with frames
> having length 0 as fast as it can.  And that is pretty fast...

If you can tell that those frames are bogus, why not count them
as opposed to generic qlen? Say, if you see 20 among the last 30
are of this type, throttle.

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjørn Mork - Jan. 24, 2013, 10:47 a.m.
Bjørn Mork <bjorn@mork.no> writes:

> A device sending 0 length frames as fast as it can has been
> observed killing the host system due to the resulting memory
> pressure. We handle the done queue as fast as we can, so
> if this queue is filling up then that is an indication that we
> are under too heavy pressure.  Refusing further allocations
> until the done queue is handled prevents the buggy device
> from taking the system down.
>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> Hello Oliver,
>
> The MBIM firmware for the Sierra Wireless MC7710 is a nice source
> of "interesting" device issues.  One of the uglier ones is that
> it under certain conditions will start flooding us with frames
> having length 0 as fast as it can.  And that is pretty fast...
>
> My older laptop dies immediately under this.  It just cannot keep
> up with the infinite allocations usbnet will do when the done
> queue first starts growing beyond reason.
>
> I really do not have a clue how to handle this problem, but this
> patch seems to do the job for me without affecting normal devices.
> The queue limit is just a number which Works For Me, leaving the
> system running with the buggy device and not kicking in under
> normal load.
>
> What do you think? Is there some other way this should be solved?


To illustrate the problem, this the start and stop debug output for such
a buggy device session *with* the RFC patch applied:

Jan 24 11:16:23 nemi kernel: [ 3187.624164] qmi_wwan 8-4:1.8 wwan0: open: enable queueing (rx 60, tx 60) mtu 1500 simple framing
Jan 24 11:16:38 nemi kernel: [ 3202.536921] qmi_wwan 8-4:1.8 wwan0: stop stats: rx/tx 1/11, errs 738980/0

I believe the stats tell the full story...

I do not have any logs without the throttling patch, as that takes down
everything on my laptop including the ahci driver and keyboard.  Not
even the magic sysrq is working then.

If anyone is interested in the full debug log (211KB compressed) from
the above session, then I've put it on
http://www.mork.no/~bjorn/usbnet-zero-packet-fix.log.gz


It is mostly full of "rx length 0" lines, but with an occasional
sequence of

Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1025) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1026) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1027) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1028) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1029) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1030) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1031) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1032) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1033) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: rx length 0
Jan 24 11:16:23 nemi kernel: [ 3187.682669] qmi_wwan 8-4:1.8 wwan0: done queue filling up (1034) - throttling
Jan 24 11:16:23 nemi kernel: [ 3187.697826] qmi_wwan 8-4:1.8 wwan0: rxqlen 0 --> 10


showing that the throttling is kicking in and doing its job.



Bjørn
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjørn Mork - Jan. 24, 2013, 10:52 a.m.
Oliver Neukum <oneukum@suse.de> writes:

> On Thursday 24 January 2013 11:25:52 Bjørn Mork wrote:
>> The MBIM firmware for the Sierra Wireless MC7710 is a nice source
>> of "interesting" device issues.  One of the uglier ones is that
>> it under certain conditions will start flooding us with frames
>> having length 0 as fast as it can.  And that is pretty fast...
>
> If you can tell that those frames are bogus, why not count them
> as opposed to generic qlen? Say, if you see 20 among the last 30
> are of this type, throttle.

Sounds like a good idea, but where do I add/get that statistics?


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Neukum - Jan. 24, 2013, 11:03 a.m.
On Thursday 24 January 2013 11:52:22 Bjørn Mork wrote:
> Oliver Neukum <oneukum@suse.de> writes:
> 
> > On Thursday 24 January 2013 11:25:52 Bjørn Mork wrote:
> >> The MBIM firmware for the Sierra Wireless MC7710 is a nice source
> >> of "interesting" device issues.  One of the uglier ones is that
> >> it under certain conditions will start flooding us with frames
> >> having length 0 as fast as it can.  And that is pretty fast...
> >
> > If you can tell that those frames are bogus, why not count them
> > as opposed to generic qlen? Say, if you see 20 among the last 30
> > are of this type, throttle.
> 
> Sounds like a good idea, but where do I add/get that statistics?

rx_complete

It need not be very accurate.

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjørn Mork - Jan. 24, 2013, 11:22 a.m.
Oliver Neukum <oneukum@suse.de> writes:

> On Thursday 24 January 2013 11:52:22 Bjørn Mork wrote:
>> Oliver Neukum <oneukum@suse.de> writes:
>> 
>> > On Thursday 24 January 2013 11:25:52 Bjørn Mork wrote:
>> >> The MBIM firmware for the Sierra Wireless MC7710 is a nice source
>> >> of "interesting" device issues.  One of the uglier ones is that
>> >> it under certain conditions will start flooding us with frames
>> >> having length 0 as fast as it can.  And that is pretty fast...
>> >
>> > If you can tell that those frames are bogus, why not count them
>> > as opposed to generic qlen? Say, if you see 20 among the last 30
>> > are of this type, throttle.
>> 
>> Sounds like a good idea, but where do I add/get that statistics?
>
> rx_complete
>
> It need not be very accurate.

Sorry for being daft, but how do I code the "20 among the last 30" part
there?


Bjørn
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Neukum - Jan. 24, 2013, 11:31 a.m.
On Thursday 24 January 2013 12:22:54 Bjørn Mork wrote:
> Oliver Neukum <oneukum@suse.de> writes:
> 
> > On Thursday 24 January 2013 11:52:22 Bjørn Mork wrote:
> >> Oliver Neukum <oneukum@suse.de> writes:
> >> 
> >> > On Thursday 24 January 2013 11:25:52 Bjørn Mork wrote:
> >> >> The MBIM firmware for the Sierra Wireless MC7710 is a nice source
> >> >> of "interesting" device issues.  One of the uglier ones is that
> >> >> it under certain conditions will start flooding us with frames
> >> >> having length 0 as fast as it can.  And that is pretty fast...
> >> >
> >> > If you can tell that those frames are bogus, why not count them
> >> > as opposed to generic qlen? Say, if you see 20 among the last 30
> >> > are of this type, throttle.
> >> 
> >> Sounds like a good idea, but where do I add/get that statistics?
> >
> > rx_complete
> >
> > It need not be very accurate.
> 
> Sorry for being daft, but how do I code the "20 among the last 30" part
> there?

Just by agreeing that you can live with false negatives but not false positives

if (++counter > 30) {
	counter = bogus = 0;
} else {
	if (is_bogus(packet)
		bogus++;
	if (bogus > counter/2)
		throttle();
}

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sergei Shtylyov - Jan. 24, 2013, 12:39 p.m.
Hello.

On 24-01-2013 14:25, Bjørn Mork wrote:

> A device sending 0 length frames as fast as it can has been
> observed killing the host system due to the resulting memory
> pressure. We handle the done queue as fast as we can, so
> if this queue is filling up then that is an indication that we
> are under too heavy pressure.  Refusing further allocations
> until the done queue is handled prevents the buggy device
> from taking the system down.
>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
[...]

>   drivers/net/usb/usbnet.c |    8 ++++++++
>   1 files changed, 8 insertions(+), 0 deletions(-)

> diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
> index f34b2eb..85c7ffd 100644
> --- a/drivers/net/usb/usbnet.c
> +++ b/drivers/net/usb/usbnet.c
> @@ -380,6 +380,14 @@ static int rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
>   	unsigned long		lockflags;
>   	size_t			size = dev->rx_urb_size;
>
> +	/* Do not let a device flood us to death! */
> +	if (dev->done.qlen > 1024) {
> +		netif_dbg(dev, rx_err, dev->net, "done queue filling up (%u) - throttling\n", dev->done.qlen);
> +		usbnet_defer_kevent (dev, EVENT_RX_MEMORY);
> +		usb_free_urb (urb);

    Run your patch thru scripts/checkpatch.pl please -- spaces before parens 
are not allowed.

WBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Perches - Jan. 24, 2013, 1:01 p.m.
On Thu, 2013-01-24 at 16:39 +0400, Sergei Shtylyov wrote:
> On 24-01-2013 14:25, Bjørn Mork wrote:
> > A device sending 0 length frames as fast as it can has been
> > observed killing the host system due to the resulting memory
> > pressure.
[]
> > diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
[]
> > +	/* Do not let a device flood us to death! */
> > +	if (dev->done.qlen > 1024) {
> > +		netif_dbg(dev, rx_err, dev->net, "done queue filling up (%u) - throttling\n", dev->done.qlen);
> > +		usbnet_defer_kevent (dev, EVENT_RX_MEMORY);
> > +		usb_free_urb (urb);
> 
>     Run your patch thru scripts/checkpatch.pl please

And maybe ratelimit the netif_dbg

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index f34b2eb..85c7ffd 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -380,6 +380,14 @@  static int rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
 	unsigned long		lockflags;
 	size_t			size = dev->rx_urb_size;
 
+	/* Do not let a device flood us to death! */
+	if (dev->done.qlen > 1024) {
+		netif_dbg(dev, rx_err, dev->net, "done queue filling up (%u) - throttling\n", dev->done.qlen);
+		usbnet_defer_kevent (dev, EVENT_RX_MEMORY);
+		usb_free_urb (urb);
+		return -ENOMEM;
+	}
+
 	skb = __netdev_alloc_skb_ip_align(dev->net, size, flags);
 	if (!skb) {
 		netif_dbg(dev, rx_err, dev->net, "no rx skb\n");