Patchwork changing usbnet's API to better deal with cdc-ncm

login
register
mail settings
Submitter Oliver Neukum
Date Sept. 7, 2012, 7:35 a.m.
Message ID <1446113.XC2Qbo4flT@linux-lqwf.site>
Download mbox | patch
Permalink /patch/182339/
State RFC
Delegated to: David Miller
Headers show

Comments

Oliver Neukum - Sept. 7, 2012, 7:35 a.m.
On Friday 07 September 2012 11:55:53 Ming Lei wrote:
> On Fri, Sep 7, 2012 at 1:56 AM, Oliver Neukum <oneukum@suse.de> wrote:
> > On Friday 07 September 2012 00:09:13 Ming Lei wrote:
> >> On Thu, Sep 6, 2012 at 4:30 PM, Bjørn Mork <bjorn@mork.no> wrote:
> >> > Ming Lei <tom.leiming@gmail.com> writes:
> >
> >> >> Looks the introduced .tx_bundle is not necessary since .tx_fixup is OK.
> >> >
> >> > The minidriver does not have any information about tx in progress.  The
> >>
> >> Inside .tx_fixup, the low level driver will get the tx progress information.
> >
> > That information changes after tx_fixup
> 
> You mean the low level drive can't see if the later transmission
> for the aggregated frame is successful?  If so, there is no any complete

Not successful, necessary.

> notification to all low level drivers (with aggregation capability
> or not) now, isn't there?

There doesn't need to be.

> > Well, that is the mistake. Using a timer is a bad idea.
> 
> Why is a bad idea?  Without a timer, how can usbnet or
> low level driver aggregate the later coming transmit frames to
> form a biger aggregated frame?

By registering demand in an abstract sense. The timer makes sense
only when no other buffers are being transfered.

The timer means that we transmit if no other more packages are coming
down from the upper layers. Now, either the device is still busy or it is not.
While it is busy, we might just as well wait, because the upper layer may
change its mind.
If the device is not busy we worsen latency by waiting for the timer.

In addition the timer causes complications with recursing into usbnet
and hassle with suspend/resume and disconnection.

> >> If we can abstract some common things about aggregation, it should be
> >> meaningful. As far as I know, most of aggregation protocol is very different,
> >> so almost all aggregation work is only done by low level driver, such as
> >> cdc_ncm.
> >>
> >> If we want to implement some aggregation framework, maybe below is
> >> one solution, correct me if it is wrong.
> >
> > It isn't so much wrong as incomplete.
> >
> > It seems to me we can have
> >
> > - can queue, buffer not full -> do nothing more
> 
> As I said above, without a timeout timer, the queued skb
> might not be sent out even some long time passed if no
> frame comes later.

Therefore we check whether the device is still busy.

> So could you add the wait for more mechanism to your patch
> for further discussion?

I am attaching a version that is algorithmically complete
except for error handling.

> > - can queue, buffer full -> transmit
> > - cannot queue, buffer full -> transmit and then try again to queue
> 
> This might not happen, the buffer full should have been observed
> in last xmit path.

Possibly, but the driver cannot know in advance how large the next package
will be.

	Reagrds
		Oliver


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey ORISHKO - Sept. 7, 2012, 12:01 p.m.
> -----Original Message-----
> From: Oliver Neukum [mailto:oneukum@suse.de]
> Sent: Friday, September 07, 2012 9:36 AM

> > > Well, that is the mistake. Using a timer is a bad idea.
> >
> > Why is a bad idea?  Without a timer, how can usbnet or low level
> > driver aggregate the later coming transmit frames to form a biger
> > aggregated frame?
> 
> By registering demand in an abstract sense. The timer makes sense only
> when no other buffers are being transfered.
> 
> The timer means that we transmit if no other more packages are coming
> down from the upper layers. Now, either the device is still busy or it
> is not.
> While it is busy, we might just as well wait, because the upper layer
> may change its mind.
> If the device is not busy we worsen latency by waiting for the timer.

...

> > As I said above, without a timeout timer, the queued skb might not be
> > sent out even some long time passed if no frame comes later.
> 
> Therefore we check whether the device is still busy.
> 

The purpose of NCM is to reduce the amount of interrupts and the number
of DMA jobs that must be handled by the device. It is most efficient to
send and especially to receive NTBs of agreed size, padded if needed.

Misunderstanding of the reason why NCM protocol is doing aggregation
of several ETH frames into a single NTB leads to crippled implementation. 
Look at NCM gadget, which is not really NCM device at all, it's kinda ECM
device with NCM framing.

The experience from early implementations and prototyping of NCM was that
using NCM with 4-8kB NTB increased max throughput in loop-mode by a factor
of 5-6 even 8-10 times compared to using ECM. One real-world example was
modem for 21+6Mbit/s what used 100% CPU with ECM responsible for approx. 40%
of the MIPS used. Using NCM instead CPU was only at approx. 65% utilization.
Which allowed multiple other functions to be added and significantly increased
the usability and value of the modem.

NCM efficiency is gained either by accessing TX queue of upper layer or
by using timer. 

These findings were later confirmed by multiple major industry players
(names withheld), and demonstrated during multiple NCM interoperability
workshops using multiple device HW platforms, multiple operating systems
and multiple host HW ranging from Beagleboard to latest quad-core x86.    

Until we do something with network device framework in order to get access
to upper layer Tx queue we need to utilize timer.

Regards,
Alexey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Neukum - Sept. 7, 2012, 1:08 p.m.
On Friday 07 September 2012 14:01:23 Alexey ORISHKO wrote:
> The experience from early implementations and prototyping of NCM was that
> using NCM with 4-8kB NTB increased max throughput in loop-mode by a factor
> of 5-6 even 8-10 times compared to using ECM. One real-world example was
> modem for 21+6Mbit/s what used 100% CPU with ECM responsible for approx. 40%
> of the MIPS used. Using NCM instead CPU was only at approx. 65% utilization.
> Which allowed multiple other functions to be added and significantly increased
> the usability and value of the modem.
> 
> NCM efficiency is gained either by accessing TX queue of upper layer or
> by using timer. 
> 
> These findings were later confirmed by multiple major industry players
> (names withheld), and demonstrated during multiple NCM interoperability
> workshops using multiple device HW platforms, multiple operating systems
> and multiple host HW ranging from Beagleboard to latest quad-core x86.    
> 
> Until we do something with network device framework in order to get access
> to upper layer Tx queue we need to utilize timer.

Could you explain your reasoning? From what you say we must reduce the number
of transfers, thus use them efficiently, but why by means of a timer?

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey ORISHKO - Sept. 7, 2012, 3:23 p.m.
> -----Original Message-----
> From: Oliver Neukum [mailto:oneukum@suse.de]
> Sent: Friday, September 07, 2012 3:09 PM

> >
> > Until we do something with network device framework in order to get
> > access to upper layer Tx queue we need to utilize timer.
> 
> Could you explain your reasoning? From what you say we must reduce the
> number of transfers, thus use them efficiently, but why by means of a
> timer?
> 

What was omitted in the previous mail is the fact that in addition to
reducing amount of transfers we need to fill in NTB to the maximum size.
It's quite important especially for embedded systems.

As an example, on device (modem) side, DMA job for max NTB is initially
set. 
- If device receives full NTB, only one interrupt (per NTB) is
generated at the completion of DMA job and NCM function can immediately
parse incoming NTB frame. 
- In case of "skinny" (not full) NTB, device gets INT for the short
packet, then original DMA job for full NTB shall be cancelled, then DMA
job to get short packet shall be set and after that we get INT for its
completion. Only at this point all data is received and NTB parsing
can be done by NCM function. All above takes quite more time than first
case.

There is a temptation to send full NTBs even with a single IP packet,
But it will lead to wasted USB bandwidth and reduced ability to send
real data for other functions in the device or other devices on the
same root hub. Most important it will also harm IN direction.

The challenge is to ensure that an acceptable timeout value is used. Too
long and latency is introduced. Too short and too many padded NTBs will
go out and that reduces the available throughput. The ideal timer value
depends on the throughput available in the network. Something that is not
really explicitly known to the NCM layer nor to the layer above.

Alternate methods exist to achieve the same result without using a timer.
But an optimal implementation requires that the amount of IP packets "in
progress" or queued up is known to NCM so NCM can decide to send short or
padded NTB or aggregate and send one or more full NTBs plus short or
padded NTB. 

An implementation where NCM only knows if there is more data available or
not can be shown to have side-effects that are not easily circumvented.
And likewise shown to limit throughput compared to a timer-based solution.   

Regards,
Alexey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oliver Neukum - Sept. 7, 2012, 6:23 p.m.
On Friday 07 September 2012 17:23:33 Alexey ORISHKO wrote:

> There is a temptation to send full NTBs even with a single IP packet,
> But it will lead to wasted USB bandwidth and reduced ability to send
> real data for other functions in the device or other devices on the
> same root hub. Most important it will also harm IN direction.

Well, we will eventually need to do so. In that case, shall we send a short package
or not?

> The challenge is to ensure that an acceptable timeout value is used. Too
> long and latency is introduced. Too short and too many padded NTBs will
> go out and that reduces the available throughput. The ideal timer value
> depends on the throughput available in the network. Something that is not
> really explicitly known to the NCM layer nor to the layer above.

Well, we know how many packages are available to the device because
they have already been scheduled. It seems to be clear that we need to
batch while enough are on the way.

> Alternate methods exist to achieve the same result without using a timer.
> But an optimal implementation requires that the amount of IP packets "in
> progress" or queued up is known to NCM so NCM can decide to send short or
> padded NTB or aggregate and send one or more full NTBs plus short or
> padded NTB.

What exactly means "in progress"?

> An implementation where NCM only knows if there is more data available or
> not can be shown to have side-effects that are not easily circumvented.
> And likewise shown to limit throughput compared to a timer-based solution.   

Well, I must say that the timer is ugly. If absolutely necessary we can keep
it, but then I'd much prefer to put it into usbnet and have a generic batching
functionality. However, I'd like to explore aternatives.

	Regards
		Oliver

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexey ORISHKO - Sept. 10, 2012, 5:10 p.m.
> -----Original Message-----
> From: Oliver Neukum [mailto:oliver@neukum.org]
> Sent: Friday, September 07, 2012 8:23 PM
> 
> > There is a temptation to send full NTBs even with a single IP packet,
> > But it will lead to wasted USB bandwidth and reduced ability to send
> > real data for other functions in the device or other devices on the
> > same root hub. Most important it will also harm IN direction.
> 
> Well, we will eventually need to do so. In that case, shall we send a
> short package or not?

The trick is to find a compromise for device load vs wasted bandwidth. 
Currently if data size is less than 512 bytes, a shot packet will be sent. 
However, this value might be better tuned based on bus speed (HS/SS)
and max NTB size negotiated.

Note, that ZLP was avoided in the ncm driver by using flag in usbnet,
because on some embedded USB controllers handling ZLP was more costly
than short packet.

> Well, we know how many packages are available to the device because
> they have already been scheduled. It seems to be clear that we need to
> batch while enough are on the way.

It doesn't matter much that some packets on the way to device. What
matters is the amount of packets in the application Tx queue for
ncm network device.

> > But an optimal implementation requires that the amount of IP packets
> > "in progress" or queued up is known to NCM so NCM can decide to send
> > short or padded NTB or aggregate and send one or more full NTBs plus
> > short or padded NTB.
> 
> What exactly means "in progress"?

Sent by application to the network device.

> 
> > An implementation where NCM only knows if there is more data
> available
> > or not can be shown to have side-effects that are not easily
> circumvented.
> > And likewise shown to limit throughput compared to a timer-based
> solution.
> 
> Well, I must say that the timer is ugly. If absolutely necessary we can
> keep it, but then I'd much prefer to put it into usbnet and have a
> generic batching functionality. However, I'd like to explore
> aternatives.

Is it possible to implement upper layer Tx queue in usbnet, so mini driver 
could get a clue about outstanding packets to be transmitted?

Regards,
Alexey
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c
index 4cd582a..56ef743 100644
--- a/drivers/net/usb/cdc_ncm.c
+++ b/drivers/net/usb/cdc_ncm.c
@@ -135,9 +135,6 @@  struct cdc_ncm_ctx {
 	u16 connected;
 };
 
-static void cdc_ncm_txpath_bh(unsigned long param);
-static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx);
-static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *hr_timer);
 static struct usb_driver cdc_ncm_driver;
 
 static void
@@ -464,10 +461,6 @@  static int cdc_ncm_bind(struct usbnet *dev, struct usb_interface *intf)
 	if (ctx == NULL)
 		return -ENODEV;
 
-	hrtimer_init(&ctx->tx_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	ctx->tx_timer.function = &cdc_ncm_tx_timer_cb;
-	ctx->bh.data = (unsigned long)ctx;
-	ctx->bh.func = cdc_ncm_txpath_bh;
 	atomic_set(&ctx->stop, 0);
 	spin_lock_init(&ctx->mtx);
 	ctx->netdev = dev->net;
@@ -650,7 +643,7 @@  static void cdc_ncm_zero_fill(u8 *ptr, u32 first, u32 end, u32 max)
 	memset(ptr + first, 0, end - first);
 }
 
-static struct sk_buff *
+static int
 cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 {
 	struct sk_buff *skb_out;
@@ -659,12 +652,7 @@  cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 	u32 last_offset;
 	u16 n = 0, index;
 	u8 ready2send = 0;
-
-	/* if there is a remaining skb, it gets priority */
-	if (skb != NULL)
-		swap(skb, ctx->tx_rem_skb);
-	else
-		ready2send = 1;
+	u8 error = 0;
 
 	/*
 	 * +----------------+
@@ -690,7 +678,7 @@  cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 				dev_kfree_skb_any(skb);
 				ctx->netdev->stats.tx_dropped++;
 			}
-			goto exit_no_skb;
+			return -EBUSY;
 		}
 
 		/* make room for NTH and NDP */
@@ -719,28 +707,15 @@  cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 		/* compute maximum buffer size */
 		rem = ctx->tx_max - offset;
 
-		if (skb == NULL) {
-			skb = ctx->tx_rem_skb;
-			ctx->tx_rem_skb = NULL;
-
-			/* check for end of skb */
-			if (skb == NULL)
-				break;
-		}
-
 		if (skb->len > rem) {
 			if (n == 0) {
 				/* won't fit, MTU problem? */
 				dev_kfree_skb_any(skb);
 				skb = NULL;
 				ctx->netdev->stats.tx_dropped++;
+				error = 1;
 			} else {
-				/* no room for skb - store for later */
-				if (ctx->tx_rem_skb != NULL) {
-					dev_kfree_skb_any(ctx->tx_rem_skb);
-					ctx->netdev->stats.tx_dropped++;
-				}
-				ctx->tx_rem_skb = skb;
+
 				skb = NULL;
 				ready2send = 1;
 			}
@@ -768,13 +743,6 @@  cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 		skb = NULL;
 	}
 
-	/* free up any dangling skb */
-	if (skb != NULL) {
-		dev_kfree_skb_any(skb);
-		skb = NULL;
-		ctx->netdev->stats.tx_dropped++;
-	}
-
 	ctx->tx_curr_frame_num = n;
 
 	if (n == 0) {
@@ -791,9 +759,7 @@  cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 		ctx->tx_curr_skb = skb_out;
 		ctx->tx_curr_offset = offset;
 		ctx->tx_curr_last_offset = last_offset;
-		/* set the pending count */
-		if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT)
-			ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT;
+
 		goto exit_no_skb;
 
 	} else {
@@ -874,71 +840,37 @@  cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb)
 	/* return skb */
 	ctx->tx_curr_skb = NULL;
 	ctx->netdev->stats.tx_packets += ctx->tx_curr_frame_num;
-	return skb_out;
 
-exit_no_skb:
-	/* Start timer, if there is a remaining skb */
-	if (ctx->tx_curr_skb != NULL)
-		cdc_ncm_tx_timeout_start(ctx);
-	return NULL;
-}
-
-static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx)
-{
-	/* start timer, if not already started */
-	if (!(hrtimer_active(&ctx->tx_timer) || atomic_read(&ctx->stop)))
-		hrtimer_start(&ctx->tx_timer,
-				ktime_set(0, CDC_NCM_TIMER_INTERVAL),
-				HRTIMER_MODE_REL);
-}
+	if (error)
+		return -EBUSY;
+	if (ready2send)
+		return -EBUSY;
+	else
+		return 0;
 
-static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *timer)
-{
-	struct cdc_ncm_ctx *ctx =
-			container_of(timer, struct cdc_ncm_ctx, tx_timer);
+exit_no_skb:
 
-	if (!atomic_read(&ctx->stop))
-		tasklet_schedule(&ctx->bh);
-	return HRTIMER_NORESTART;
+	return -EAGAIN;
 }
 
-static void cdc_ncm_txpath_bh(unsigned long param)
+static int cdc_ncm_tx_bundle(struct usbnet *dev, struct sk_buff *skb, gfp_t flags)
 {
-	struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)param;
-
-	spin_lock_bh(&ctx->mtx);
-	if (ctx->tx_timer_pending != 0) {
-		ctx->tx_timer_pending--;
-		cdc_ncm_tx_timeout_start(ctx);
-		spin_unlock_bh(&ctx->mtx);
-	} else if (ctx->netdev != NULL) {
-		spin_unlock_bh(&ctx->mtx);
-		netif_tx_lock_bh(ctx->netdev);
-		usbnet_start_xmit(NULL, ctx->netdev);
-		netif_tx_unlock_bh(ctx->netdev);
-	}
+	int err;
+	struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)dev->data[0];
+	
+	err = cdc_ncm_fill_tx_frame(ctx, skb);
+	return err;
 }
 
 static struct sk_buff *
 cdc_ncm_tx_fixup(struct usbnet *dev, struct sk_buff *skb, gfp_t flags)
 {
-	struct sk_buff *skb_out;
 	struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)dev->data[0];
 
-	/*
-	 * The Ethernet API we are using does not support transmitting
-	 * multiple Ethernet frames in a single call. This driver will
-	 * accumulate multiple Ethernet frames and send out a larger
-	 * USB frame when the USB buffer is full or when a single jiffies
-	 * timeout happens.
-	 */
 	if (ctx == NULL)
 		goto error;
 
-	spin_lock_bh(&ctx->mtx);
-	skb_out = cdc_ncm_fill_tx_frame(ctx, skb);
-	spin_unlock_bh(&ctx->mtx);
-	return skb_out;
+	return ctx->tx_curr_skb;
 
 error:
 	if (skb != NULL)
@@ -1197,6 +1129,7 @@  static const struct driver_info cdc_ncm_info = {
 	.manage_power = cdc_ncm_manage_power,
 	.status = cdc_ncm_status,
 	.rx_fixup = cdc_ncm_rx_fixup,
+	.tx_bundle = cdc_ncm_tx_bundle,
 	.tx_fixup = cdc_ncm_tx_fixup,
 };
 
@@ -1211,6 +1144,7 @@  static const struct driver_info wwan_info = {
 	.manage_power = cdc_ncm_manage_power,
 	.status = cdc_ncm_status,
 	.rx_fixup = cdc_ncm_rx_fixup,
+	.tx_bundle = cdc_ncm_tx_bundle,
 	.tx_fixup = cdc_ncm_tx_fixup,
 };
 
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 8531c1c..429b08b 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -86,6 +86,8 @@  static u8	node_id [ETH_ALEN];
 
 static const char driver_name [] = "usbnet";
 
+static void tx_complete(struct urb *urb);
+
 /* use ethtool to change the level for any given device */
 static int msg_level = -1;
 module_param (msg_level, int, 0);
@@ -1016,17 +1018,76 @@  skip_reset:
 		netdev_dbg(dev->net, "kevent done, flags = 0x%lx\n", dev->flags);
 }
 
+static int submit_next_bundle(struct usbnet *dev, struct sk_buff *skb)
+{
+	struct skb_data *entry;
+	int length = skb->len;
+	int retval;
+	struct driver_info *info = dev->driver_info;
+	struct urb *urb;
+
+	urb = usb_alloc_urb(0, GFP_ATOMIC);
+	if (!urb)
+		return -ENOMEM;
+
+	entry = (struct skb_data *) skb->cb;
+	entry->urb = urb;
+	entry->dev = dev;
+	entry->length = length;
+
+	usb_fill_bulk_urb (urb, dev->udev, dev->out,
+			skb->data, skb->len, tx_complete, skb);
+
+	/* don't assume the hardware handles USB_ZERO_PACKET
+	 * NOTE:  strictly conforming cdc-ether devices should expect
+	 * the ZLP here, but ignore the one-byte packet.
+	 * NOTE2: CDC NCM specification is different from CDC ECM when
+	 * handling ZLP/short packets, so cdc_ncm driver will make short
+	 * packet itself if needed.
+	 */
+	if (length % dev->maxpacket == 0) {
+		if (!(info->flags & FLAG_SEND_ZLP)) {
+			if (!(info->flags & FLAG_MULTI_PACKET)) {
+				urb->transfer_buffer_length++;
+				if (skb_tailroom(skb)) {
+					skb->data[skb->len] = 0;
+					__skb_put(skb, 1);
+				}
+			}
+		} else
+			urb->transfer_flags |= URB_ZERO_PACKET;
+	}
+
+	atomic_inc(&dev->tx_in_flight);
+	retval = usb_submit_urb(urb, GFP_ATOMIC);
+	if (retval < 0)
+		atomic_dec(&dev->tx_in_flight);
+	//FIXME: full error handling
+
+	return retval;
+}
 /*-------------------------------------------------------------------------*/
 
 static void tx_complete (struct urb *urb)
 {
 	struct sk_buff		*skb = (struct sk_buff *) urb->context;
+	struct sk_buff		*skb2;
 	struct skb_data		*entry = (struct skb_data *) skb->cb;
 	struct usbnet		*dev = entry->dev;
+	struct driver_info	*info = dev->driver_info;
+	int			in_flight;
 
+	in_flight = atomic_dec_return(&dev->tx_in_flight);
 	if (urb->status == 0) {
-		if (!(dev->driver_info->flags & FLAG_MULTI_PACKET))
+		if (!(dev->driver_info->flags & FLAG_MULTI_PACKET)) {
 			dev->net->stats.tx_packets++;
+		} else {
+			if (in_flight < 2) {
+				skb2 = info->tx_fixup(dev, NULL, GFP_ATOMIC);
+				if (skb2)
+					submit_next_bundle(dev, skb2);
+			}
+		}
 		dev->net->stats.tx_bytes += entry->length;
 	} else {
 		dev->net->stats.tx_errors++;
@@ -1089,23 +1150,50 @@  netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	struct urb		*urb = NULL;
 	struct skb_data		*entry;
 	struct driver_info	*info = dev->driver_info;
+	struct sk_buff		*skb_old = NULL;
 	unsigned long		flags;
 	int retval;
+	int transmit_now = 1;
+	int bundle_again = 0;
 
 	if (skb)
 		skb_tx_timestamp(skb);
 
+	/*
+	 * first we allow drivers to bundle packets together
+	 * maintainance of the buffer is the responsibility
+	 * of the lower layer
+	 */
+rebundle:
+	if (info->tx_bundle) {
+		bundle_again = 0;
+		retval = info->tx_bundle(dev, skb, GFP_ATOMIC);
+
+		switch (retval) {
+		case 0: /* the package has been bundled */
+			if (atomic_read(&dev->tx_in_flight) < 2)
+				transmit_now = 1;
+			else
+				transmit_now = 0;
+			break;
+		case -EAGAIN:
+			transmit_now = 1;
+			bundle_again = 1;
+			skb_old = skb;
+			break;
+		case -EBUSY:
+			transmit_now = 1;
+			break;
+		}
+	}
 	// some devices want funky USB-level framing, for
 	// win32 driver (usually) and/or hardware quirks
-	if (info->tx_fixup) {
+	if (transmit_now && info->tx_fixup) {
 		skb = info->tx_fixup (dev, skb, GFP_ATOMIC);
 		if (!skb) {
 			if (netif_msg_tx_err(dev)) {
 				netif_dbg(dev, tx_err, dev->net, "can't tx_fixup skb\n");
 				goto drop;
-			} else {
-				/* cdc_ncm collected packet; waits for more */
-				goto not_drop;
 			}
 		}
 	}
@@ -1164,14 +1252,17 @@  netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	}
 #endif
 
+	atomic_inc(&dev->tx_in_flight);
 	switch ((retval = usb_submit_urb (urb, GFP_ATOMIC))) {
 	case -EPIPE:
 		netif_stop_queue (net);
 		usbnet_defer_kevent (dev, EVENT_TX_HALT);
+		atomic_dec(&dev->tx_in_flight);
 		usb_autopm_put_interface_async(dev->intf);
 		break;
 	default:
 		usb_autopm_put_interface_async(dev->intf);
+		atomic_dec(&dev->tx_in_flight);
 		netif_dbg(dev, tx_err, dev->net,
 			  "tx: submit urb err %d\n", retval);
 		break;
@@ -1187,7 +1278,6 @@  netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 		netif_dbg(dev, tx_err, dev->net, "drop, code %d\n", retval);
 drop:
 		dev->net->stats.tx_dropped++;
-not_drop:
 		if (skb)
 			dev_kfree_skb_any (skb);
 		usb_free_urb (urb);
@@ -1197,6 +1287,10 @@  not_drop:
 #ifdef CONFIG_PM
 deferred:
 #endif
+	if (bundle_again) {
+		skb = skb_old;
+		goto rebundle;
+	}
 	return NETDEV_TX_OK;
 }
 EXPORT_SYMBOL_GPL(usbnet_start_xmit);
@@ -1393,6 +1487,7 @@  usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod)
 	dev->delay.data = (unsigned long) dev;
 	init_timer (&dev->delay);
 	mutex_init (&dev->phy_mutex);
+	atomic_set(&dev->tx_in_flight, 0); 
 
 	dev->net = net;
 	strcpy (net->name, "usb%d");
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index f87cf62..bb2f622 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -33,6 +33,7 @@  struct usbnet {
 	wait_queue_head_t	*wait;
 	struct mutex		phy_mutex;
 	unsigned char		suspend_count;
+	atomic_t		tx_in_flight;
 
 	/* i/o info: pipes etc */
 	unsigned		in, out;
@@ -133,6 +134,12 @@  struct driver_info {
 	/* fixup rx packet (strip framing) */
 	int	(*rx_fixup)(struct usbnet *dev, struct sk_buff *skb);
 
+	/* bundle individual package for transmission as
+	 * larger package. This cannot sleep
+	 */
+	int	(*tx_bundle)(struct usbnet *dev,
+				struct sk_buff *skb, gfp_t flags);
+
 	/* fixup tx packet (add framing) */
 	struct sk_buff	*(*tx_fixup)(struct usbnet *dev,
 				struct sk_buff *skb, gfp_t flags);