Message ID | 1310363206.2512.26.camel@edumazet-laptop |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Jul 11, 2011 at 07:46:46AM +0200, Eric Dumazet wrote: > Le lundi 11 juillet 2011 à 02:52 +0200, Michał Mirosław a écrit : > > Introduce __netdev_alloc_skb_aligned() to return skb with skb->data > > aligned at specified 2^n multiple. > > > > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl> > Hi Michal > > Could we synchronize our work to not introduce things that might > disappear shortly ? Sure. Are you saying that you'll convert all drivers to build_skb()? :-) > Here is the RFC patch about build_skb() : > > [PATCH] net: introduce build_skb() > > One of the thing we discussed during netdev 2011 conference was the idea > to change network drivers to allocate/populate their skb at RX > completion time, right before feeding the skb to network stack. > > Right now, we allocate skbs when populating the RX ring, and thats a > waste of CPU cache, since allocating skb means a full memset() to clear > the skb and its skb_shared_info portion. By the time NIC fills a frame > in data buffer and host can get it, cpu probably threw away the cache > lines from its caches, because of huge RX ring sizes. > > So the deal would be to allocate only the data buffer for the NIC to > populate its RX ring buffer. And use build_skb() at RX completion to > attach a data buffer (now filled with an ethernet frame) to a new skb, > initialize the skb_shared_info portion, and give the hot skb to network > stack. > > build_skb() is the function to allocate an skb, caller providing the > data buffer that should be attached to it. Drivers are expected to call > skb_reserve() right after build_skb() to let skb->data points to the > Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN) [...] > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -234,6 +234,54 @@ nodata: [...] > /** > + * build_skb - build a network buffer > + * @data: data buffer provider by caller > + * @size: size of data buffer, not including skb_shared_info > + * > + * Allocate a new &sk_buff. Caller provides space holding head and > + * skb_shared_info. Mostly used in driver RX path. > + * The return is the buffer. On a failure the return is %NULL. > + * Notes : > + * Before IO, driver allocates only data buffer where NIC put incoming frame > + * Driver SHOULD add room at head (NET_SKB_PAD) and > + * MUST add room tail (to hold skb_shared_info) > + * After IO, driver calls build_skb(), to get a hot skb instead of a cold one > + * before giving packet to stack. RX rings only contains data buffers, not > + * full skbs. > + */ > +struct sk_buff *build_skb(void *data, unsigned int size) > +{ > + struct skb_shared_info *shinfo; > + struct sk_buff *skb; > + > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > + if (!skb) > + return NULL; > + > + size = SKB_DATA_ALIGN(size); > + > + memset(skb, 0, offsetof(struct sk_buff, tail)); > + skb->truesize = size + sizeof(struct sk_buff); > + atomic_set(&skb->users, 1); > + skb->head = data; > + skb->data = data; > + skb_reset_tail_pointer(skb); > + skb->end = skb->tail + size; > +#ifdef NET_SKBUFF_DATA_USES_OFFSET > + skb->mac_header = ~0U; > +#endif > + > + /* make sure we initialize shinfo sequentially */ > + shinfo = skb_shinfo(skb); > + memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); > + atomic_set(&shinfo->dataref, 1); > + kmemcheck_annotate_variable(shinfo->destructor_arg); > + > + return skb; > +} > +EXPORT_SYMBOL(build_skb); I like the idea. From driver writer perspective I would like to also see a function that given max frame size and DMA aligment would allocate the buffer for me. In short: * rx_refill: [ptr, size] = alloc_rx_buffer(size, alignment, offset); [dma_addr] = map_buffer(ptr, size); append to rx buffer list * rx_poll: unmap_buffer(dma_addr, size); [skb] = build_skb(ptr, size); if (!skb) reuse_buffer return skb_reserve(skb, rx_offset); skb_put(skb, pkt_len); (indicate offloads) rx_skb(skb); call rx_refill * rx_poll with copybreak: sync_buffer_to_cpu(dma_addr, data_len); [skb, copied] = build_or_copy_skb(ptr, size, hw_rx_offset, pkt_len); if (copied || !skb) append to rx buffer list else unmap_buffer(dma_addr, size); if (!skb) return; (indicate offloads) rx_skb(skb); if (!copied) call rx_refill For even less driver code this could happen: * rx_refill(ptr, dma_addr) [size/alignment stored in queue or net_device struct] if (is_rx_free_list_full) return -EBUSY; append to rx buffer list [ptr, dma_addr, size] return !is_rx_free_list_full; * rx_poll with copybreak: [copy threshold stored in queue or net_device struct] [skb] = finish_rx(ptr, dma_addr, size, hw_rx_offset, pkt_len); if (!skb) return; (fill in offloads: checksum/etc) rx_skb(skb); This could be extended to handle frames spanning multiple buffers. BTW, napi_get_frags() + napi_gro_frags() use similar idea of allocating skb late. Best Regards, Michał Mirosław -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 11 juillet 2011 à 07:46 +0200, Eric Dumazet a écrit : > [PATCH] net: introduce build_skb() > > One of the thing we discussed during netdev 2011 conference was the idea > to change network drivers to allocate/populate their skb at RX > completion time, right before feeding the skb to network stack. > > Right now, we allocate skbs when populating the RX ring, and thats a > waste of CPU cache, since allocating skb means a full memset() to clear > the skb and its skb_shared_info portion. By the time NIC fills a frame > in data buffer and host can get it, cpu probably threw away the cache > lines from its caches, because of huge RX ring sizes. > > So the deal would be to allocate only the data buffer for the NIC to > populate its RX ring buffer. And use build_skb() at RX completion to > attach a data buffer (now filled with an ethernet frame) to a new skb, > initialize the skb_shared_info portion, and give the hot skb to network > stack. Update : First results are impressive : About 15% of throughput increase with igb driver on my small desktop machine, and I am limited by the wire speed :) (AMD Athlon(tm) II X2 B24 Processor, 3GHz, cache size : 1024K) setup : One dual port Intel card : Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) eth1 direct attach on eth2, Gigabit speed. eth2 RX ring set to 4096 slots (default is 256) CPU0 : pktgen sending on eth1, line rate (1488137pps) CPU1 : receive eth2 interrupts, packets dropped into raw netfilter table to bypass upper stacks. Before patch : 15% packet losses, ksoftirqd/1 using 100% of cpu After patch : residual losses (less than 0.1 %), ksoftirqd not used, 80% cpu used I'll do more tests with a 10Gb card (ixgbe driver) to not be wire limited. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 12, 2011 at 05:40:16PM +0200, Eric Dumazet wrote: > Le lundi 11 juillet 2011 à 07:46 +0200, Eric Dumazet a écrit : > > > [PATCH] net: introduce build_skb() > > > > One of the thing we discussed during netdev 2011 conference was the idea > > to change network drivers to allocate/populate their skb at RX > > completion time, right before feeding the skb to network stack. > > > > Right now, we allocate skbs when populating the RX ring, and thats a > > waste of CPU cache, since allocating skb means a full memset() to clear > > the skb and its skb_shared_info portion. By the time NIC fills a frame > > in data buffer and host can get it, cpu probably threw away the cache > > lines from its caches, because of huge RX ring sizes. > > > > So the deal would be to allocate only the data buffer for the NIC to > > populate its RX ring buffer. And use build_skb() at RX completion to > > attach a data buffer (now filled with an ethernet frame) to a new skb, > > initialize the skb_shared_info portion, and give the hot skb to network > > stack. > > Update : > > First results are impressive : About 15% of throughput increase with igb > driver on my small desktop machine, and I am limited by the wire > speed :) > > (AMD Athlon(tm) II X2 B24 Processor, 3GHz, cache size : 1024K) > > setup : One dual port Intel card : Ethernet controller: Intel > Corporation 82576 Gigabit Network Connection (rev 01) > > eth1 direct attach on eth2, Gigabit speed. > eth2 RX ring set to 4096 slots (default is 256) > > CPU0 : pktgen sending on eth1, line rate (1488137pps) > CPU1 : receive eth2 interrupts, packets dropped into raw netfilter table > to bypass upper stacks. > > Before patch : 15% packet losses, ksoftirqd/1 using 100% of cpu > After patch : residual losses (less than 0.1 %), ksoftirqd not used, 80% > cpu used > > I'll do more tests with a 10Gb card (ixgbe driver) to not be wire > limited. I remember observing similar increase after switching from allocating skb to allocating pages and using napi_get_frags() + napi_gro_frags(). That was with sl351x driver posted for review some time ago. Best Regards, Michał Mirosław -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 32ada53..5e903e7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -507,6 +507,7 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb) extern void kfree_skb(struct sk_buff *skb); extern void consume_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); +extern struct sk_buff *build_skb(void *data, unsigned int size); extern struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int fclone, int node); static inline struct sk_buff *alloc_skb(unsigned int size, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index d220119..9193d7e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -234,6 +234,54 @@ nodata: EXPORT_SYMBOL(__alloc_skb); /** + * build_skb - build a network buffer + * @data: data buffer provider by caller + * @size: size of data buffer, not including skb_shared_info + * + * Allocate a new &sk_buff. Caller provides space holding head and + * skb_shared_info. Mostly used in driver RX path. + * The return is the buffer. On a failure the return is %NULL. + * Notes : + * Before IO, driver allocates only data buffer where NIC put incoming frame + * Driver SHOULD add room at head (NET_SKB_PAD) and + * MUST add room tail (to hold skb_shared_info) + * After IO, driver calls build_skb(), to get a hot skb instead of a cold one + * before giving packet to stack. RX rings only contains data buffers, not + * full skbs. + */ +struct sk_buff *build_skb(void *data, unsigned int size) +{ + struct skb_shared_info *shinfo; + struct sk_buff *skb; + + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); + if (!skb) + return NULL; + + size = SKB_DATA_ALIGN(size); + + memset(skb, 0, offsetof(struct sk_buff, tail)); + skb->truesize = size + sizeof(struct sk_buff); + atomic_set(&skb->users, 1); + skb->head = data; + skb->data = data; + skb_reset_tail_pointer(skb); + skb->end = skb->tail + size; +#ifdef NET_SKBUFF_DATA_USES_OFFSET + skb->mac_header = ~0U; +#endif + + /* make sure we initialize shinfo sequentially */ + shinfo = skb_shinfo(skb); + memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); + atomic_set(&shinfo->dataref, 1); + kmemcheck_annotate_variable(shinfo->destructor_arg); + + return skb; +} +EXPORT_SYMBOL(build_skb); + +/** * __netdev_alloc_skb - allocate an skbuff for rx on a specific device * @dev: network device to receive on * @length: length to allocate