diff mbox

[RFC,V1,1/1] net: cdc_ncm: Reduce memory use when kernel memory low

Message ID 1494956480-6127-2-git-send-email-jim_baxter@mentor.com
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Jim Baxter May 16, 2017, 5:41 p.m. UTC
The CDC-NCM driver can require large amounts of memory to create
skb's and this can be a problem when the memory becomes fragmented.

This especially affects embedded systems that have constrained
resources but wish to maximise the throughput of CDC-NCM with 16KiB
NTB's.

The issue is after running for a while the kernel memory can become
fragmented and it needs compacting.
If the NTB allocation is needed before the memory has been compacted
the atomic allocation can fail which can cause increased latency,
large re-transmissions or disconnections depending upon the data
being transmitted at the time.
This situation occurs for less than a second until the kernel has
compacted the memory but the failed devices can take a lot longer to
recover from the failed TX packets.

To ease this temporary situation I modified the CDC-NCM TX path to
temporarily switch into a reduced memory mode which allocates an NTB
that will fit into a USB_CDC_NCM_NTB_MIN_OUT_SIZE (default 2048 Bytes)
sized memory block and only transmit NTB's with a single network frame
until the memory situation is resolved.
Once the memory is compacted the CDC-NCM data can resume transmitting
at the normal tx_max rate once again.

Signed-off-by: Jim Baxter <jim_baxter@mentor.com>
---
 drivers/net/usb/cdc_ncm.c   | 39 +++++++++++++++++++++++++++------------
 include/linux/usb/cdc_ncm.h |  1 +
 2 files changed, 28 insertions(+), 12 deletions(-)

Comments

Bjørn Mork May 16, 2017, 6:24 p.m. UTC | #1
Jim Baxter <jim_baxter@mentor.com> writes:

> The CDC-NCM driver can require large amounts of memory to create
> skb's and this can be a problem when the memory becomes fragmented.
>
> This especially affects embedded systems that have constrained
> resources but wish to maximise the throughput of CDC-NCM with 16KiB
> NTB's.
>
> The issue is after running for a while the kernel memory can become
> fragmented and it needs compacting.
> If the NTB allocation is needed before the memory has been compacted
> the atomic allocation can fail which can cause increased latency,
> large re-transmissions or disconnections depending upon the data
> being transmitted at the time.
> This situation occurs for less than a second until the kernel has
> compacted the memory but the failed devices can take a lot longer to
> recover from the failed TX packets.
>
> To ease this temporary situation I modified the CDC-NCM TX path to
> temporarily switch into a reduced memory mode which allocates an NTB
> that will fit into a USB_CDC_NCM_NTB_MIN_OUT_SIZE (default 2048 Bytes)
> sized memory block and only transmit NTB's with a single network frame
> until the memory situation is resolved.
> Once the memory is compacted the CDC-NCM data can resume transmitting
> at the normal tx_max rate once again.

I must say that I don't like the additional complexity added here.  If
there are memory issues and you can reduce the buffer size to
USB_CDC_NCM_NTB_MIN_OUT_SIZE, then why don't you just set a lower tx_max
buffer size in the first place?

  echo 2048 > /sys/class/net/wwan0/cdc_ncm/tx_max




Bjørn
Oliver Neukum May 17, 2017, 7:44 a.m. UTC | #2
Am Dienstag, den 16.05.2017, 20:24 +0200 schrieb Bjørn Mork:
> 
> I must say that I don't like the additional complexity added here.  If
> there are memory issues and you can reduce the buffer size to
> USB_CDC_NCM_NTB_MIN_OUT_SIZE, then why don't you just set a lower tx_max
> buffer size in the first place?
> 
>   echo 2048 > /sys/class/net/wwan0/cdc_ncm/tx_max
> 

Hi,

that would hurt performance across the board though.
Can we trigger memory compactation earlier?

	Regards
		Oliver
Jim Baxter May 17, 2017, 10:56 a.m. UTC | #3
From: Oliver Neukum (oneukum@suse.com) Sent: Wed, 17 May 2017 09:44:20 +0200 

> Am Dienstag, den 16.05.2017, 20:24 +0200 schrieb Bjørn Mork:
>>
>> I must say that I don't like the additional complexity added here.  If
>> there are memory issues and you can reduce the buffer size to
>> USB_CDC_NCM_NTB_MIN_OUT_SIZE, then why don't you just set a lower tx_max
>> buffer size in the first place?
>>
>>   echo 2048 > /sys/class/net/wwan0/cdc_ncm/tx_max
>>

Hi

The issue is we need the higher performance for Mirrorlink to be able to
work correctly. The low memory situation only occurs very occasionally and
once the kernel has run compaction if this issue occurs again it will be
many hours later.

> 
> Hi,
> 
> that would hurt performance across the board though.
> Can we trigger memory compactation earlier?
> 
> 	Regards
> 		Oliver
> 

We initially tried increasing the vm.min_free_kbytes but the value needed to address
the problem was around 65536 which then meant some other applications were unable to
run due to there not being enough free memory.
The i.MX6 based system has 863MB of RAM in total.

Regards,
Jim
David Miller May 17, 2017, 6:18 p.m. UTC | #4
From: Bjørn Mork <bjorn@mork.no>
Date: Tue, 16 May 2017 20:24:10 +0200

> Jim Baxter <jim_baxter@mentor.com> writes:
> 
>> The CDC-NCM driver can require large amounts of memory to create
>> skb's and this can be a problem when the memory becomes fragmented.
>>
>> This especially affects embedded systems that have constrained
>> resources but wish to maximise the throughput of CDC-NCM with 16KiB
>> NTB's.
>>
>> The issue is after running for a while the kernel memory can become
>> fragmented and it needs compacting.
>> If the NTB allocation is needed before the memory has been compacted
>> the atomic allocation can fail which can cause increased latency,
>> large re-transmissions or disconnections depending upon the data
>> being transmitted at the time.
>> This situation occurs for less than a second until the kernel has
>> compacted the memory but the failed devices can take a lot longer to
>> recover from the failed TX packets.
>>
>> To ease this temporary situation I modified the CDC-NCM TX path to
>> temporarily switch into a reduced memory mode which allocates an NTB
>> that will fit into a USB_CDC_NCM_NTB_MIN_OUT_SIZE (default 2048 Bytes)
>> sized memory block and only transmit NTB's with a single network frame
>> until the memory situation is resolved.
>> Once the memory is compacted the CDC-NCM data can resume transmitting
>> at the normal tx_max rate once again.
> 
> I must say that I don't like the additional complexity added here.  If
> there are memory issues and you can reduce the buffer size to
> USB_CDC_NCM_NTB_MIN_OUT_SIZE, then why don't you just set a lower tx_max
> buffer size in the first place?
> 
>   echo 2048 > /sys/class/net/wwan0/cdc_ncm/tx_max

When there isn't memory pressure this will hurt performance of
course.

It is a quite common paradigm to back down to 0 order memory requests
when higher order ones fail, so this isn't such a bad change from the
perspective.

However, one negative about it is that when the system is under memory
stress it doesn't help at all to keep attemping high order allocations
when the system hasn't recovered yet.  In fact, this can make it
worse.
Oliver Neukum May 18, 2017, 10:01 a.m. UTC | #5
Am Mittwoch, den 17.05.2017, 14:18 -0400 schrieb David Miller:
> From: Bjørn Mork <bjorn@mork.no>
> Date: Tue, 16 May 2017 20:24:10 +0200
> 
> > Jim Baxter <jim_baxter@mentor.com> writes:
> > 
> >> The CDC-NCM driver can require large amounts of memory to create
> >> skb's and this can be a problem when the memory becomes fragmented.
> >>
> >> This especially affects embedded systems that have constrained
> >> resources but wish to maximise the throughput of CDC-NCM with 16KiB
> >> NTB's.
> >>
> >> The issue is after running for a while the kernel memory can become
> >> fragmented and it needs compacting.
> >> If the NTB allocation is needed before the memory has been compacted
> >> the atomic allocation can fail which can cause increased latency,
> >> large re-transmissions or disconnections depending upon the data
> >> being transmitted at the time.
> >> This situation occurs for less than a second until the kernel has
> >> compacted the memory but the failed devices can take a lot longer to
> >> recover from the failed TX packets.
> >>
> >> To ease this temporary situation I modified the CDC-NCM TX path to
> >> temporarily switch into a reduced memory mode which allocates an NTB
> >> that will fit into a USB_CDC_NCM_NTB_MIN_OUT_SIZE (default 2048 Bytes)
> >> sized memory block and only transmit NTB's with a single network frame
> >> until the memory situation is resolved.
> >> Once the memory is compacted the CDC-NCM data can resume transmitting
> >> at the normal tx_max rate once again.
> > 
> > I must say that I don't like the additional complexity added here.  If
> > there are memory issues and you can reduce the buffer size to
> > USB_CDC_NCM_NTB_MIN_OUT_SIZE, then why don't you just set a lower tx_max
> > buffer size in the first place?
> > 
> >   echo 2048 > /sys/class/net/wwan0/cdc_ncm/tx_max
> 
> When there isn't memory pressure this will hurt performance of
> course.
> 
> It is a quite common paradigm to back down to 0 order memory requests
> when higher order ones fail, so this isn't such a bad change from the
> perspective.
> 
> However, one negative about it is that when the system is under memory
> stress it doesn't help at all to keep attemping high order allocations
> when the system hasn't recovered yet.  In fact, this can make it
> worse.

This makes me wonder why there is no notifier chain for this.
Or am I just too stupid to find it?

	Regards
		Oliver
David Laight May 19, 2017, 11:10 a.m. UTC | #6
From: linux-usb-owner@vger.kernel.org [mailto:linux-usb-owner@vger.kernel.org] On Behalf Of Jim Baxter
> Sent: 16 May 2017 18:41
> 
> The CDC-NCM driver can require large amounts of memory to create
> skb's and this can be a problem when the memory becomes fragmented.
> 
> This especially affects embedded systems that have constrained
> resources but wish to maximise the throughput of CDC-NCM with 16KiB
> NTB's.

Why is this driver copying multiple tx messages into a single skb.
Surely there are better ways to do this??

I think it is generating a 'multi-ethernet frame' URB with an
overall header for each URB and a header for each ethernet frame.

Given that the USB stack allows multiple concurrent transmits I'm
surprised that batching large ethernet frames makes much difference.

Also the USB target can't actually tell when URB that contain
multiples of the USB packet size end.
So it is possible to send a single NTB as multiple URB.
Of course, the usb_net might have other ideas.

	David
Bjørn Mork May 19, 2017, 1:55 p.m. UTC | #7
David Laight <David.Laight@ACULAB.COM> writes:

> From: linux-usb-owner@vger.kernel.org [mailto:linux-usb-owner@vger.kernel.org] On Behalf Of Jim Baxter
>> Sent: 16 May 2017 18:41
>> 
>> The CDC-NCM driver can require large amounts of memory to create
>> skb's and this can be a problem when the memory becomes fragmented.
>> 
>> This especially affects embedded systems that have constrained
>> resources but wish to maximise the throughput of CDC-NCM with 16KiB
>> NTB's.
>
> Why is this driver copying multiple tx messages into a single skb.

Mostly becasue it already did that when I started messing with it, and I
didn't know how to avoid that.

> Surely there are better ways to do this??

But I have been there thinking this exact thought a couple of times.
Suggestions are appreciated.

> I think it is generating a 'multi-ethernet frame' URB with an
> overall header for each URB and a header for each ethernet frame.

With some negotiated alignment restrictions, and a linked list of frame
pointer arrays.  But yes, that is basically it.

(it's not always ethernet - with MBIM it can be IP or arbitrary as well,
but I don't think that makes any difference)

> Given that the USB stack allows multiple concurrent transmits I'm
> surprised that batching large ethernet frames makes much difference.

Me too.  Actually, I don't think it does.  The protocol was developed
with specific device restrictions in mind. These might be invalid today.
There is no reason to believe that using simple CDC ECM framing
(i.e. one ethernet frame per URB) is any problem.

> Also the USB target can't actually tell when URB that contain
> multiples of the USB packet size end.
> So it is possible to send a single NTB as multiple URB.

Nice idea!  Never thought of that.  Yes, the driver could use a number
smaller buffers regardless of the NTB size, by abusing the fact that the
device will see them as a contigious USB transfer as long as they fall
on USB packet boundaries.

Started thinking about how to do this in practice.  It seemed obviously
simply to jsut fire off the buffers as they fill up until the the max
aggregation size or time has been exceeded.  But the header makes this
harder than necessary.  It contains both a length and a pointer to the
first frame pointer array (NDP).  So we will have to decide the size of
the NTB and where to put the first NDP before sending the first USB
packet. This is possible if we always go for the pad-to-max strategy.
We'll also have to make some assumptions about the size of the NDP(s) as
we add them, but we already do that so I don't think it is a big deal.

Might be the way to go.

Unless someone has a nice way to just collect a list of skbs and have
them converted to proper framing on the fly when transmitting, without
having to care about USB packet boundaries.



Bjørn
David Laight May 19, 2017, 2:46 p.m. UTC | #8
From: Bjørn Mork

> Sent: 19 May 2017 14:56

...
> Unless someone has a nice way to just collect a list of skbs and have

> them converted to proper framing on the fly when transmitting, without

> having to care about USB packet boundaries.


skb can be linked into arbitrary chains (or even trees), but I suspect
the usbnet code would need to be taught about them.

For XHCI it isn't too bad because it will do arbitrary scatter-gather
(apart from the ring end).
But I believe the earlier controllers only support fragments that
end on usb packet boundaries.

I looked at the usbnet code a few years ago, I'm sure it ought to
be possible to shortcut most of the code that uses URB and directly
write to the xhci (in particular) ring descriptors.

For receive you probably want to feed the USB stack multiple (probably)
2k buffers, and handle the debatching into ethernet frames yourself.

One of the ASIX drivers used to be particularly bad, it allocated 64k
skb for receive (the hardware can merge IP packets) and then hacked
the true_size before passing upstream.

	David
Oliver Neukum May 22, 2017, 1:27 p.m. UTC | #9
Am Freitag, den 19.05.2017, 14:46 +0000 schrieb David Laight:
> For XHCI it isn't too bad because it will do arbitrary scatter-gather
> (apart from the ring end).
> But I believe the earlier controllers only support fragments that
> end on usb packet boundaries.
> 
> I looked at the usbnet code a few years ago, I'm sure it ought to
> be possible to shortcut most of the code that uses URB and directly
> write to the xhci (in particular) ring descriptors.

Hi,

we cannot break the layering. URBs can support scatter/gather and
that infrastructure must be used. And usbnet will work with sg used
in skbs. What you should honor in general is not splitting packets.
So 512 byte chunks.

	Regards
		Oliver
Jim Baxter May 22, 2017, 3:45 p.m. UTC | #10
From: David S. Miller (davem@davemloft.net)
Sent: Wed, 17 May 2017 14:18:19 -0400 

> 
> When there isn't memory pressure this will hurt performance of
> course.
> 
> It is a quite common paradigm to back down to 0 order memory requests
> when higher order ones fail, so this isn't such a bad change from the
> perspective.
> 
> However, one negative about it is that when the system is under memory
> stress it doesn't help at all to keep attemping high order allocations
> when the system hasn't recovered yet.  In fact, this can make it
> worse.
> 

Hello David,

Do you think the patch should be modified to extend the length of time
the 0 order memory requests with a time period of 1 minute for example?

Or do you feel the patch is not the correct way this should be performed?

Best regards,
Jim
David Miller May 22, 2017, 3:54 p.m. UTC | #11
From: "Baxter, Jim" <jim_baxter@mentor.com>
Date: Mon, 22 May 2017 16:45:42 +0100

> From: David S. Miller (davem@davemloft.net)
> Sent: Wed, 17 May 2017 14:18:19 -0400 
> 
>> 
>> When there isn't memory pressure this will hurt performance of
>> course.
>> 
>> It is a quite common paradigm to back down to 0 order memory requests
>> when higher order ones fail, so this isn't such a bad change from the
>> perspective.
>> 
>> However, one negative about it is that when the system is under memory
>> stress it doesn't help at all to keep attemping high order allocations
>> when the system hasn't recovered yet.  In fact, this can make it
>> worse.
>> 
> 
> Do you think the patch should be modified to extend the length of time
> the 0 order memory requests with a time period of 1 minute for example?
> 
> Or do you feel the patch is not the correct way this should be performed?

Unfortunately without a real notifier of some sort (there isn't one, and
it isn't actually easy to come up with a clean way to do this which is
probably why it doesn't exist yet in the first place) I really cannot
recommend anything better.

That being said, probably for the time being we should just backoff each
and every request, always trying initially to do the higher order thing.
Oliver Neukum May 23, 2017, 8:42 a.m. UTC | #12
Am Montag, den 22.05.2017, 11:54 -0400 schrieb David Miller:
> 
> Unfortunately without a real notifier of some sort (there isn't one, and
> it isn't actually easy to come up with a clean way to do this which is
> probably why it doesn't exist yet in the first place) I really cannot
> recommend anything better.
> 
> That being said, probably for the time being we should just backoff each
> and every request, always trying initially to do the higher order thing.

We could use a counter. After the first failure, do it once, after the
second twice and so on. And reset the counter as a higher order
allocation works. (just bound it somewhere)

	Regards
		Oliver
David Miller May 23, 2017, 3:26 p.m. UTC | #13
From: Oliver Neukum <oneukum@suse.com>
Date: Tue, 23 May 2017 10:42:48 +0200

> Am Montag, den 22.05.2017, 11:54 -0400 schrieb David Miller:
>> 
>> Unfortunately without a real notifier of some sort (there isn't one, and
>> it isn't actually easy to come up with a clean way to do this which is
>> probably why it doesn't exist yet in the first place) I really cannot
>> recommend anything better.
>> 
>> That being said, probably for the time being we should just backoff each
>> and every request, always trying initially to do the higher order thing.
> 
> We could use a counter. After the first failure, do it once, after the
> second twice and so on. And reset the counter as a higher order
> allocation works. (just bound it somewhere)

So an exponential backoff, that might work.
diff mbox

Patch

diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
index b5cec18..c06d20f 100644
--- a/drivers/net/usb/cdc_ncm.c
+++ b/drivers/net/usb/cdc_ncm.c
@@ -1055,10 +1055,10 @@  static struct usb_cdc_ncm_ndp16 *cdc_ncm_ndp(struct cdc_ncm_ctx *ctx, struct sk_
 
 	/* align new NDP */
 	if (!(ctx->drvflags & CDC_NCM_FLAG_NDP_TO_END))
-		cdc_ncm_align_tail(skb, ctx->tx_ndp_modulus, 0, ctx->tx_max);
+		cdc_ncm_align_tail(skb, ctx->tx_ndp_modulus, 0, ctx->tx_curr_size);
 
 	/* verify that there is room for the NDP and the datagram (reserve) */
-	if ((ctx->tx_max - skb->len - reserve) < ctx->max_ndp_size)
+	if ((ctx->tx_curr_size - skb->len - reserve) < ctx->max_ndp_size)
 		return NULL;
 
 	/* link to it */
@@ -1111,13 +1111,28 @@  struct sk_buff *
 
 	/* allocate a new OUT skb */
 	if (!skb_out) {
-		skb_out = alloc_skb(ctx->tx_max, GFP_ATOMIC);
+		ctx->tx_curr_size = ctx->tx_max;
+		skb_out = alloc_skb(ctx->tx_curr_size, GFP_ATOMIC);
 		if (skb_out == NULL) {
-			if (skb != NULL) {
-				dev_kfree_skb_any(skb);
-				dev->net->stats.tx_dropped++;
+			/* See if a very small allocation is possible.
+			 * We will send this packet immediately and hope
+			 * that there is more memory available later.
+			 */
+			if (skb)
+				ctx->tx_curr_size = max(skb->len,
+					(u32)USB_CDC_NCM_NTB_MIN_OUT_SIZE);
+			else
+				ctx->tx_curr_size = USB_CDC_NCM_NTB_MIN_OUT_SIZE;
+			skb_out = alloc_skb(ctx->tx_curr_size, GFP_ATOMIC);
+
+			/* No allocation possible so we will abort */
+			if (skb_out == NULL) {
+				if (skb != NULL) {
+					dev_kfree_skb_any(skb);
+					dev->net->stats.tx_dropped++;
+				}
+				goto exit_no_skb;
 			}
-			goto exit_no_skb;
 		}
 		/* fill out the initial 16-bit NTB header */
 		nth16 = (struct usb_cdc_ncm_nth16 *)memset(skb_put(skb_out, sizeof(struct usb_cdc_ncm_nth16)), 0, sizeof(struct usb_cdc_ncm_nth16));
@@ -1148,10 +1163,10 @@  struct sk_buff *
 		ndp16 = cdc_ncm_ndp(ctx, skb_out, sign, skb->len + ctx->tx_modulus + ctx->tx_remainder);
 
 		/* align beginning of next frame */
-		cdc_ncm_align_tail(skb_out,  ctx->tx_modulus, ctx->tx_remainder, ctx->tx_max);
+		cdc_ncm_align_tail(skb_out,  ctx->tx_modulus, ctx->tx_remainder, ctx->tx_curr_size);
 
 		/* check if we had enough room left for both NDP and frame */
-		if (!ndp16 || skb_out->len + skb->len + delayed_ndp_size > ctx->tx_max) {
+		if (!ndp16 || skb_out->len + skb->len + delayed_ndp_size > ctx->tx_curr_size) {
 			if (n == 0) {
 				/* won't fit, MTU problem? */
 				dev_kfree_skb_any(skb);
@@ -1227,7 +1242,7 @@  struct sk_buff *
 	/* If requested, put NDP at end of frame. */
 	if (ctx->drvflags & CDC_NCM_FLAG_NDP_TO_END) {
 		nth16 = (struct usb_cdc_ncm_nth16 *)skb_out->data;
-		cdc_ncm_align_tail(skb_out, ctx->tx_ndp_modulus, 0, ctx->tx_max);
+		cdc_ncm_align_tail(skb_out, ctx->tx_ndp_modulus, 0, ctx->tx_curr_size);
 		nth16->wNdpIndex = cpu_to_le16(skb_out->len);
 		memcpy(skb_put(skb_out, ctx->max_ndp_size), ctx->delayed_ndp16, ctx->max_ndp_size);
 
@@ -1246,9 +1261,9 @@  struct sk_buff *
 	 */
 	if (!(dev->driver_info->flags & FLAG_SEND_ZLP) &&
 	    skb_out->len > ctx->min_tx_pkt) {
-		padding_count = ctx->tx_max - skb_out->len;
+		padding_count = ctx->tx_curr_size - skb_out->len;
 		memset(skb_put(skb_out, padding_count), 0, padding_count);
-	} else if (skb_out->len < ctx->tx_max &&
+	} else if (skb_out->len < ctx->tx_curr_size &&
 		   (skb_out->len % dev->maxpacket) == 0) {
 		*skb_put(skb_out, 1) = 0;	/* force short packet */
 	}
diff --git a/include/linux/usb/cdc_ncm.h b/include/linux/usb/cdc_ncm.h
index 00d2324..5162f38 100644
--- a/include/linux/usb/cdc_ncm.h
+++ b/include/linux/usb/cdc_ncm.h
@@ -117,6 +117,7 @@  struct cdc_ncm_ctx {
 	u32 tx_curr_frame_num;
 	u32 rx_max;
 	u32 tx_max;
+	u32 tx_curr_size;
 	u32 max_datagram_size;
 	u16 tx_max_datagrams;
 	u16 tx_remainder;