Message ID | 1446113.XC2Qbo4flT@linux-lqwf.site |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
> -----Original Message----- > From: Oliver Neukum [mailto:oneukum@suse.de] > Sent: Friday, September 07, 2012 9:36 AM > > > Well, that is the mistake. Using a timer is a bad idea. > > > > Why is a bad idea? Without a timer, how can usbnet or low level > > driver aggregate the later coming transmit frames to form a biger > > aggregated frame? > > By registering demand in an abstract sense. The timer makes sense only > when no other buffers are being transfered. > > The timer means that we transmit if no other more packages are coming > down from the upper layers. Now, either the device is still busy or it > is not. > While it is busy, we might just as well wait, because the upper layer > may change its mind. > If the device is not busy we worsen latency by waiting for the timer. ... > > As I said above, without a timeout timer, the queued skb might not be > > sent out even some long time passed if no frame comes later. > > Therefore we check whether the device is still busy. > The purpose of NCM is to reduce the amount of interrupts and the number of DMA jobs that must be handled by the device. It is most efficient to send and especially to receive NTBs of agreed size, padded if needed. Misunderstanding of the reason why NCM protocol is doing aggregation of several ETH frames into a single NTB leads to crippled implementation. Look at NCM gadget, which is not really NCM device at all, it's kinda ECM device with NCM framing. The experience from early implementations and prototyping of NCM was that using NCM with 4-8kB NTB increased max throughput in loop-mode by a factor of 5-6 even 8-10 times compared to using ECM. One real-world example was modem for 21+6Mbit/s what used 100% CPU with ECM responsible for approx. 40% of the MIPS used. Using NCM instead CPU was only at approx. 65% utilization. Which allowed multiple other functions to be added and significantly increased the usability and value of the modem. NCM efficiency is gained either by accessing TX queue of upper layer or by using timer. These findings were later confirmed by multiple major industry players (names withheld), and demonstrated during multiple NCM interoperability workshops using multiple device HW platforms, multiple operating systems and multiple host HW ranging from Beagleboard to latest quad-core x86. Until we do something with network device framework in order to get access to upper layer Tx queue we need to utilize timer. Regards, Alexey -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 07 September 2012 14:01:23 Alexey ORISHKO wrote: > The experience from early implementations and prototyping of NCM was that > using NCM with 4-8kB NTB increased max throughput in loop-mode by a factor > of 5-6 even 8-10 times compared to using ECM. One real-world example was > modem for 21+6Mbit/s what used 100% CPU with ECM responsible for approx. 40% > of the MIPS used. Using NCM instead CPU was only at approx. 65% utilization. > Which allowed multiple other functions to be added and significantly increased > the usability and value of the modem. > > NCM efficiency is gained either by accessing TX queue of upper layer or > by using timer. > > These findings were later confirmed by multiple major industry players > (names withheld), and demonstrated during multiple NCM interoperability > workshops using multiple device HW platforms, multiple operating systems > and multiple host HW ranging from Beagleboard to latest quad-core x86. > > Until we do something with network device framework in order to get access > to upper layer Tx queue we need to utilize timer. Could you explain your reasoning? From what you say we must reduce the number of transfers, thus use them efficiently, but why by means of a timer? Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Oliver Neukum [mailto:oneukum@suse.de] > Sent: Friday, September 07, 2012 3:09 PM > > > > Until we do something with network device framework in order to get > > access to upper layer Tx queue we need to utilize timer. > > Could you explain your reasoning? From what you say we must reduce the > number of transfers, thus use them efficiently, but why by means of a > timer? > What was omitted in the previous mail is the fact that in addition to reducing amount of transfers we need to fill in NTB to the maximum size. It's quite important especially for embedded systems. As an example, on device (modem) side, DMA job for max NTB is initially set. - If device receives full NTB, only one interrupt (per NTB) is generated at the completion of DMA job and NCM function can immediately parse incoming NTB frame. - In case of "skinny" (not full) NTB, device gets INT for the short packet, then original DMA job for full NTB shall be cancelled, then DMA job to get short packet shall be set and after that we get INT for its completion. Only at this point all data is received and NTB parsing can be done by NCM function. All above takes quite more time than first case. There is a temptation to send full NTBs even with a single IP packet, But it will lead to wasted USB bandwidth and reduced ability to send real data for other functions in the device or other devices on the same root hub. Most important it will also harm IN direction. The challenge is to ensure that an acceptable timeout value is used. Too long and latency is introduced. Too short and too many padded NTBs will go out and that reduces the available throughput. The ideal timer value depends on the throughput available in the network. Something that is not really explicitly known to the NCM layer nor to the layer above. Alternate methods exist to achieve the same result without using a timer. But an optimal implementation requires that the amount of IP packets "in progress" or queued up is known to NCM so NCM can decide to send short or padded NTB or aggregate and send one or more full NTBs plus short or padded NTB. An implementation where NCM only knows if there is more data available or not can be shown to have side-effects that are not easily circumvented. And likewise shown to limit throughput compared to a timer-based solution. Regards, Alexey -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Friday 07 September 2012 17:23:33 Alexey ORISHKO wrote: > There is a temptation to send full NTBs even with a single IP packet, > But it will lead to wasted USB bandwidth and reduced ability to send > real data for other functions in the device or other devices on the > same root hub. Most important it will also harm IN direction. Well, we will eventually need to do so. In that case, shall we send a short package or not? > The challenge is to ensure that an acceptable timeout value is used. Too > long and latency is introduced. Too short and too many padded NTBs will > go out and that reduces the available throughput. The ideal timer value > depends on the throughput available in the network. Something that is not > really explicitly known to the NCM layer nor to the layer above. Well, we know how many packages are available to the device because they have already been scheduled. It seems to be clear that we need to batch while enough are on the way. > Alternate methods exist to achieve the same result without using a timer. > But an optimal implementation requires that the amount of IP packets "in > progress" or queued up is known to NCM so NCM can decide to send short or > padded NTB or aggregate and send one or more full NTBs plus short or > padded NTB. What exactly means "in progress"? > An implementation where NCM only knows if there is more data available or > not can be shown to have side-effects that are not easily circumvented. > And likewise shown to limit throughput compared to a timer-based solution. Well, I must say that the timer is ugly. If absolutely necessary we can keep it, but then I'd much prefer to put it into usbnet and have a generic batching functionality. However, I'd like to explore aternatives. Regards Oliver -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> -----Original Message----- > From: Oliver Neukum [mailto:oliver@neukum.org] > Sent: Friday, September 07, 2012 8:23 PM > > > There is a temptation to send full NTBs even with a single IP packet, > > But it will lead to wasted USB bandwidth and reduced ability to send > > real data for other functions in the device or other devices on the > > same root hub. Most important it will also harm IN direction. > > Well, we will eventually need to do so. In that case, shall we send a > short package or not? The trick is to find a compromise for device load vs wasted bandwidth. Currently if data size is less than 512 bytes, a shot packet will be sent. However, this value might be better tuned based on bus speed (HS/SS) and max NTB size negotiated. Note, that ZLP was avoided in the ncm driver by using flag in usbnet, because on some embedded USB controllers handling ZLP was more costly than short packet. > Well, we know how many packages are available to the device because > they have already been scheduled. It seems to be clear that we need to > batch while enough are on the way. It doesn't matter much that some packets on the way to device. What matters is the amount of packets in the application Tx queue for ncm network device. > > But an optimal implementation requires that the amount of IP packets > > "in progress" or queued up is known to NCM so NCM can decide to send > > short or padded NTB or aggregate and send one or more full NTBs plus > > short or padded NTB. > > What exactly means "in progress"? Sent by application to the network device. > > > An implementation where NCM only knows if there is more data > available > > or not can be shown to have side-effects that are not easily > circumvented. > > And likewise shown to limit throughput compared to a timer-based > solution. > > Well, I must say that the timer is ugly. If absolutely necessary we can > keep it, but then I'd much prefer to put it into usbnet and have a > generic batching functionality. However, I'd like to explore > aternatives. Is it possible to implement upper layer Tx queue in usbnet, so mini driver could get a clue about outstanding packets to be transmitted? Regards, Alexey -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/usb/cdc_ncm.c b/drivers/net/usb/cdc_ncm.c index 4cd582a..56ef743 100644 --- a/drivers/net/usb/cdc_ncm.c +++ b/drivers/net/usb/cdc_ncm.c @@ -135,9 +135,6 @@ struct cdc_ncm_ctx { u16 connected; }; -static void cdc_ncm_txpath_bh(unsigned long param); -static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx); -static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *hr_timer); static struct usb_driver cdc_ncm_driver; static void @@ -464,10 +461,6 @@ static int cdc_ncm_bind(struct usbnet *dev, struct usb_interface *intf) if (ctx == NULL) return -ENODEV; - hrtimer_init(&ctx->tx_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); - ctx->tx_timer.function = &cdc_ncm_tx_timer_cb; - ctx->bh.data = (unsigned long)ctx; - ctx->bh.func = cdc_ncm_txpath_bh; atomic_set(&ctx->stop, 0); spin_lock_init(&ctx->mtx); ctx->netdev = dev->net; @@ -650,7 +643,7 @@ static void cdc_ncm_zero_fill(u8 *ptr, u32 first, u32 end, u32 max) memset(ptr + first, 0, end - first); } -static struct sk_buff * +static int cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) { struct sk_buff *skb_out; @@ -659,12 +652,7 @@ cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) u32 last_offset; u16 n = 0, index; u8 ready2send = 0; - - /* if there is a remaining skb, it gets priority */ - if (skb != NULL) - swap(skb, ctx->tx_rem_skb); - else - ready2send = 1; + u8 error = 0; /* * +----------------+ @@ -690,7 +678,7 @@ cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) dev_kfree_skb_any(skb); ctx->netdev->stats.tx_dropped++; } - goto exit_no_skb; + return -EBUSY; } /* make room for NTH and NDP */ @@ -719,28 +707,15 @@ cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) /* compute maximum buffer size */ rem = ctx->tx_max - offset; - if (skb == NULL) { - skb = ctx->tx_rem_skb; - ctx->tx_rem_skb = NULL; - - /* check for end of skb */ - if (skb == NULL) - break; - } - if (skb->len > rem) { if (n == 0) { /* won't fit, MTU problem? */ dev_kfree_skb_any(skb); skb = NULL; ctx->netdev->stats.tx_dropped++; + error = 1; } else { - /* no room for skb - store for later */ - if (ctx->tx_rem_skb != NULL) { - dev_kfree_skb_any(ctx->tx_rem_skb); - ctx->netdev->stats.tx_dropped++; - } - ctx->tx_rem_skb = skb; + skb = NULL; ready2send = 1; } @@ -768,13 +743,6 @@ cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) skb = NULL; } - /* free up any dangling skb */ - if (skb != NULL) { - dev_kfree_skb_any(skb); - skb = NULL; - ctx->netdev->stats.tx_dropped++; - } - ctx->tx_curr_frame_num = n; if (n == 0) { @@ -791,9 +759,7 @@ cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) ctx->tx_curr_skb = skb_out; ctx->tx_curr_offset = offset; ctx->tx_curr_last_offset = last_offset; - /* set the pending count */ - if (n < CDC_NCM_RESTART_TIMER_DATAGRAM_CNT) - ctx->tx_timer_pending = CDC_NCM_TIMER_PENDING_CNT; + goto exit_no_skb; } else { @@ -874,71 +840,37 @@ cdc_ncm_fill_tx_frame(struct cdc_ncm_ctx *ctx, struct sk_buff *skb) /* return skb */ ctx->tx_curr_skb = NULL; ctx->netdev->stats.tx_packets += ctx->tx_curr_frame_num; - return skb_out; -exit_no_skb: - /* Start timer, if there is a remaining skb */ - if (ctx->tx_curr_skb != NULL) - cdc_ncm_tx_timeout_start(ctx); - return NULL; -} - -static void cdc_ncm_tx_timeout_start(struct cdc_ncm_ctx *ctx) -{ - /* start timer, if not already started */ - if (!(hrtimer_active(&ctx->tx_timer) || atomic_read(&ctx->stop))) - hrtimer_start(&ctx->tx_timer, - ktime_set(0, CDC_NCM_TIMER_INTERVAL), - HRTIMER_MODE_REL); -} + if (error) + return -EBUSY; + if (ready2send) + return -EBUSY; + else + return 0; -static enum hrtimer_restart cdc_ncm_tx_timer_cb(struct hrtimer *timer) -{ - struct cdc_ncm_ctx *ctx = - container_of(timer, struct cdc_ncm_ctx, tx_timer); +exit_no_skb: - if (!atomic_read(&ctx->stop)) - tasklet_schedule(&ctx->bh); - return HRTIMER_NORESTART; + return -EAGAIN; } -static void cdc_ncm_txpath_bh(unsigned long param) +static int cdc_ncm_tx_bundle(struct usbnet *dev, struct sk_buff *skb, gfp_t flags) { - struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)param; - - spin_lock_bh(&ctx->mtx); - if (ctx->tx_timer_pending != 0) { - ctx->tx_timer_pending--; - cdc_ncm_tx_timeout_start(ctx); - spin_unlock_bh(&ctx->mtx); - } else if (ctx->netdev != NULL) { - spin_unlock_bh(&ctx->mtx); - netif_tx_lock_bh(ctx->netdev); - usbnet_start_xmit(NULL, ctx->netdev); - netif_tx_unlock_bh(ctx->netdev); - } + int err; + struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)dev->data[0]; + + err = cdc_ncm_fill_tx_frame(ctx, skb); + return err; } static struct sk_buff * cdc_ncm_tx_fixup(struct usbnet *dev, struct sk_buff *skb, gfp_t flags) { - struct sk_buff *skb_out; struct cdc_ncm_ctx *ctx = (struct cdc_ncm_ctx *)dev->data[0]; - /* - * The Ethernet API we are using does not support transmitting - * multiple Ethernet frames in a single call. This driver will - * accumulate multiple Ethernet frames and send out a larger - * USB frame when the USB buffer is full or when a single jiffies - * timeout happens. - */ if (ctx == NULL) goto error; - spin_lock_bh(&ctx->mtx); - skb_out = cdc_ncm_fill_tx_frame(ctx, skb); - spin_unlock_bh(&ctx->mtx); - return skb_out; + return ctx->tx_curr_skb; error: if (skb != NULL) @@ -1197,6 +1129,7 @@ static const struct driver_info cdc_ncm_info = { .manage_power = cdc_ncm_manage_power, .status = cdc_ncm_status, .rx_fixup = cdc_ncm_rx_fixup, + .tx_bundle = cdc_ncm_tx_bundle, .tx_fixup = cdc_ncm_tx_fixup, }; @@ -1211,6 +1144,7 @@ static const struct driver_info wwan_info = { .manage_power = cdc_ncm_manage_power, .status = cdc_ncm_status, .rx_fixup = cdc_ncm_rx_fixup, + .tx_bundle = cdc_ncm_tx_bundle, .tx_fixup = cdc_ncm_tx_fixup, }; diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index 8531c1c..429b08b 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -86,6 +86,8 @@ static u8 node_id [ETH_ALEN]; static const char driver_name [] = "usbnet"; +static void tx_complete(struct urb *urb); + /* use ethtool to change the level for any given device */ static int msg_level = -1; module_param (msg_level, int, 0); @@ -1016,17 +1018,76 @@ skip_reset: netdev_dbg(dev->net, "kevent done, flags = 0x%lx\n", dev->flags); } +static int submit_next_bundle(struct usbnet *dev, struct sk_buff *skb) +{ + struct skb_data *entry; + int length = skb->len; + int retval; + struct driver_info *info = dev->driver_info; + struct urb *urb; + + urb = usb_alloc_urb(0, GFP_ATOMIC); + if (!urb) + return -ENOMEM; + + entry = (struct skb_data *) skb->cb; + entry->urb = urb; + entry->dev = dev; + entry->length = length; + + usb_fill_bulk_urb (urb, dev->udev, dev->out, + skb->data, skb->len, tx_complete, skb); + + /* don't assume the hardware handles USB_ZERO_PACKET + * NOTE: strictly conforming cdc-ether devices should expect + * the ZLP here, but ignore the one-byte packet. + * NOTE2: CDC NCM specification is different from CDC ECM when + * handling ZLP/short packets, so cdc_ncm driver will make short + * packet itself if needed. + */ + if (length % dev->maxpacket == 0) { + if (!(info->flags & FLAG_SEND_ZLP)) { + if (!(info->flags & FLAG_MULTI_PACKET)) { + urb->transfer_buffer_length++; + if (skb_tailroom(skb)) { + skb->data[skb->len] = 0; + __skb_put(skb, 1); + } + } + } else + urb->transfer_flags |= URB_ZERO_PACKET; + } + + atomic_inc(&dev->tx_in_flight); + retval = usb_submit_urb(urb, GFP_ATOMIC); + if (retval < 0) + atomic_dec(&dev->tx_in_flight); + //FIXME: full error handling + + return retval; +} /*-------------------------------------------------------------------------*/ static void tx_complete (struct urb *urb) { struct sk_buff *skb = (struct sk_buff *) urb->context; + struct sk_buff *skb2; struct skb_data *entry = (struct skb_data *) skb->cb; struct usbnet *dev = entry->dev; + struct driver_info *info = dev->driver_info; + int in_flight; + in_flight = atomic_dec_return(&dev->tx_in_flight); if (urb->status == 0) { - if (!(dev->driver_info->flags & FLAG_MULTI_PACKET)) + if (!(dev->driver_info->flags & FLAG_MULTI_PACKET)) { dev->net->stats.tx_packets++; + } else { + if (in_flight < 2) { + skb2 = info->tx_fixup(dev, NULL, GFP_ATOMIC); + if (skb2) + submit_next_bundle(dev, skb2); + } + } dev->net->stats.tx_bytes += entry->length; } else { dev->net->stats.tx_errors++; @@ -1089,23 +1150,50 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb, struct urb *urb = NULL; struct skb_data *entry; struct driver_info *info = dev->driver_info; + struct sk_buff *skb_old = NULL; unsigned long flags; int retval; + int transmit_now = 1; + int bundle_again = 0; if (skb) skb_tx_timestamp(skb); + /* + * first we allow drivers to bundle packets together + * maintainance of the buffer is the responsibility + * of the lower layer + */ +rebundle: + if (info->tx_bundle) { + bundle_again = 0; + retval = info->tx_bundle(dev, skb, GFP_ATOMIC); + + switch (retval) { + case 0: /* the package has been bundled */ + if (atomic_read(&dev->tx_in_flight) < 2) + transmit_now = 1; + else + transmit_now = 0; + break; + case -EAGAIN: + transmit_now = 1; + bundle_again = 1; + skb_old = skb; + break; + case -EBUSY: + transmit_now = 1; + break; + } + } // some devices want funky USB-level framing, for // win32 driver (usually) and/or hardware quirks - if (info->tx_fixup) { + if (transmit_now && info->tx_fixup) { skb = info->tx_fixup (dev, skb, GFP_ATOMIC); if (!skb) { if (netif_msg_tx_err(dev)) { netif_dbg(dev, tx_err, dev->net, "can't tx_fixup skb\n"); goto drop; - } else { - /* cdc_ncm collected packet; waits for more */ - goto not_drop; } } } @@ -1164,14 +1252,17 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb, } #endif + atomic_inc(&dev->tx_in_flight); switch ((retval = usb_submit_urb (urb, GFP_ATOMIC))) { case -EPIPE: netif_stop_queue (net); usbnet_defer_kevent (dev, EVENT_TX_HALT); + atomic_dec(&dev->tx_in_flight); usb_autopm_put_interface_async(dev->intf); break; default: usb_autopm_put_interface_async(dev->intf); + atomic_dec(&dev->tx_in_flight); netif_dbg(dev, tx_err, dev->net, "tx: submit urb err %d\n", retval); break; @@ -1187,7 +1278,6 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb, netif_dbg(dev, tx_err, dev->net, "drop, code %d\n", retval); drop: dev->net->stats.tx_dropped++; -not_drop: if (skb) dev_kfree_skb_any (skb); usb_free_urb (urb); @@ -1197,6 +1287,10 @@ not_drop: #ifdef CONFIG_PM deferred: #endif + if (bundle_again) { + skb = skb_old; + goto rebundle; + } return NETDEV_TX_OK; } EXPORT_SYMBOL_GPL(usbnet_start_xmit); @@ -1393,6 +1487,7 @@ usbnet_probe (struct usb_interface *udev, const struct usb_device_id *prod) dev->delay.data = (unsigned long) dev; init_timer (&dev->delay); mutex_init (&dev->phy_mutex); + atomic_set(&dev->tx_in_flight, 0); dev->net = net; strcpy (net->name, "usb%d"); diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h index f87cf62..bb2f622 100644 --- a/include/linux/usb/usbnet.h +++ b/include/linux/usb/usbnet.h @@ -33,6 +33,7 @@ struct usbnet { wait_queue_head_t *wait; struct mutex phy_mutex; unsigned char suspend_count; + atomic_t tx_in_flight; /* i/o info: pipes etc */ unsigned in, out; @@ -133,6 +134,12 @@ struct driver_info { /* fixup rx packet (strip framing) */ int (*rx_fixup)(struct usbnet *dev, struct sk_buff *skb); + /* bundle individual package for transmission as + * larger package. This cannot sleep + */ + int (*tx_bundle)(struct usbnet *dev, + struct sk_buff *skb, gfp_t flags); + /* fixup tx packet (add framing) */ struct sk_buff *(*tx_fixup)(struct usbnet *dev, struct sk_buff *skb, gfp_t flags);