diff mbox

[net-next-2.6] net: introduce build_skb()

Message ID 1310363206.2512.26.camel@edumazet-laptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet July 11, 2011, 5:46 a.m. UTC
Le lundi 11 juillet 2011 à 02:52 +0200, Michał Mirosław a écrit :
> Introduce __netdev_alloc_skb_aligned() to return skb with skb->data
> aligned at specified 2^n multiple.
> 
> Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> ---

Hi Michal


Could we synchronize our work to not introduce things that might
disappear shortly ?

Here is the RFC patch about build_skb() :

[PATCH] net: introduce build_skb()

One of the thing we discussed during netdev 2011 conference was the idea
to change network drivers to allocate/populate their skb at RX
completion time, right before feeding the skb to network stack.

Right now, we allocate skbs when populating the RX ring, and thats a
waste of CPU cache, since allocating skb means a full memset() to clear
the skb and its skb_shared_info portion. By the time NIC fills a frame
in data buffer and host can get it, cpu probably threw away the cache
lines from its caches, because of huge RX ring sizes.

So the deal would be to allocate only the data buffer for the NIC to
populate its RX ring buffer. And use build_skb() at RX completion to
attach a data buffer (now filled with an ethernet frame) to a new skb,
initialize the skb_shared_info portion, and give the hot skb to network
stack.

build_skb() is the function to allocate an skb, caller providing the
data buffer that should be attached to it. Drivers are expected to call 
skb_reserve() right after build_skb() to let skb->data points to the
Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN)


Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |    1 
 net/core/skbuff.c      |   48 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 49 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Michał Mirosław July 11, 2011, 10:53 a.m. UTC | #1
On Mon, Jul 11, 2011 at 07:46:46AM +0200, Eric Dumazet wrote:
> Le lundi 11 juillet 2011 à 02:52 +0200, Michał Mirosław a écrit :
> > Introduce __netdev_alloc_skb_aligned() to return skb with skb->data
> > aligned at specified 2^n multiple.
> > 
> > Signed-off-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
> Hi Michal
> 
> Could we synchronize our work to not introduce things that might
> disappear shortly ?

Sure. Are you saying that you'll convert all drivers to build_skb()? :-)

> Here is the RFC patch about build_skb() :
> 
> [PATCH] net: introduce build_skb()
> 
> One of the thing we discussed during netdev 2011 conference was the idea
> to change network drivers to allocate/populate their skb at RX
> completion time, right before feeding the skb to network stack.
> 
> Right now, we allocate skbs when populating the RX ring, and thats a
> waste of CPU cache, since allocating skb means a full memset() to clear
> the skb and its skb_shared_info portion. By the time NIC fills a frame
> in data buffer and host can get it, cpu probably threw away the cache
> lines from its caches, because of huge RX ring sizes.
> 
> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.
> 
> build_skb() is the function to allocate an skb, caller providing the
> data buffer that should be attached to it. Drivers are expected to call 
> skb_reserve() right after build_skb() to let skb->data points to the
> Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN)
[...]
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -234,6 +234,54 @@ nodata:
[...]
>  /**
> + * build_skb - build a network buffer
> + * @data: data buffer provider by caller
> + * @size: size of data buffer, not including skb_shared_info
> + *
> + * Allocate a new &sk_buff. Caller provides space holding head and
> + * skb_shared_info. Mostly used in driver RX path.
> + * The return is the buffer. On a failure the return is %NULL.
> + * Notes :
> + *  Before IO, driver allocates only data buffer where NIC put incoming frame
> + *  Driver SHOULD add room at head (NET_SKB_PAD) and
> + *  MUST add room tail (to hold skb_shared_info)
> + *  After IO, driver calls build_skb(), to get a hot skb instead of a cold one
> + *  before giving packet to stack. RX rings only contains data buffers, not
> + *  full skbs.
> + */
> +struct sk_buff *build_skb(void *data, unsigned int size)
> +{
> +	struct skb_shared_info *shinfo;
> +	struct sk_buff *skb;
> +
> +	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> +	if (!skb)
> +		return NULL;
> +
> +	size = SKB_DATA_ALIGN(size);
> +
> +	memset(skb, 0, offsetof(struct sk_buff, tail));
> +	skb->truesize = size + sizeof(struct sk_buff);
> +	atomic_set(&skb->users, 1);
> +	skb->head = data;
> +	skb->data = data;
> +	skb_reset_tail_pointer(skb);
> +	skb->end = skb->tail + size;
> +#ifdef NET_SKBUFF_DATA_USES_OFFSET
> +	skb->mac_header = ~0U;
> +#endif
> +
> +	/* make sure we initialize shinfo sequentially */
> +	shinfo = skb_shinfo(skb);
> +	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +	atomic_set(&shinfo->dataref, 1);
> +	kmemcheck_annotate_variable(shinfo->destructor_arg);
> +
> +	return skb;
> +}
> +EXPORT_SYMBOL(build_skb);

I like the idea. From driver writer perspective I would like to also
see a function that given max frame size and DMA aligment would allocate
the buffer for me.

In short:

 * rx_refill:

	[ptr, size] = alloc_rx_buffer(size, alignment, offset);
	[dma_addr] = map_buffer(ptr, size);
	append to rx buffer list

 * rx_poll:

	unmap_buffer(dma_addr, size);
	[skb] = build_skb(ptr, size);
	if (!skb)
		reuse_buffer
		return
	skb_reserve(skb, rx_offset);
	skb_put(skb, pkt_len);
	(indicate offloads)
	rx_skb(skb);
	call rx_refill

 * rx_poll with copybreak:

	sync_buffer_to_cpu(dma_addr, data_len);
	[skb, copied] = build_or_copy_skb(ptr, size, hw_rx_offset, pkt_len);
	if (copied || !skb)
		append to rx buffer list
	else
		unmap_buffer(dma_addr, size);
	if (!skb)
		return;
	(indicate offloads)
	rx_skb(skb);
	if (!copied)
		call rx_refill


For even less driver code this could happen:

 * rx_refill(ptr, dma_addr)
	[size/alignment stored in queue or net_device struct]

	if (is_rx_free_list_full)
		return -EBUSY;
	append to rx buffer list [ptr, dma_addr, size]
	return !is_rx_free_list_full;

 * rx_poll with copybreak:
	[copy threshold stored in queue or net_device struct]

	[skb] = finish_rx(ptr, dma_addr, size, hw_rx_offset, pkt_len);
	if (!skb)
		return;
	(fill in offloads: checksum/etc)
	rx_skb(skb);


This could be extended to handle frames spanning multiple buffers.

BTW, napi_get_frags() + napi_gro_frags() use similar idea of allocating
skb late.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet July 12, 2011, 3:40 p.m. UTC | #2
Le lundi 11 juillet 2011 à 07:46 +0200, Eric Dumazet a écrit :

> [PATCH] net: introduce build_skb()
> 
> One of the thing we discussed during netdev 2011 conference was the idea
> to change network drivers to allocate/populate their skb at RX
> completion time, right before feeding the skb to network stack.
> 
> Right now, we allocate skbs when populating the RX ring, and thats a
> waste of CPU cache, since allocating skb means a full memset() to clear
> the skb and its skb_shared_info portion. By the time NIC fills a frame
> in data buffer and host can get it, cpu probably threw away the cache
> lines from its caches, because of huge RX ring sizes.
> 
> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.

Update :

First results are impressive : About 15% of throughput increase with igb
driver on my small desktop machine, and I am limited by the wire
speed :)

(AMD Athlon(tm) II X2 B24 Processor, 3GHz, cache size : 1024K)

setup : One dual port Intel card : Ethernet controller: Intel
Corporation 82576 Gigabit Network Connection (rev 01)

eth1 direct attach on eth2, Gigabit speed.
eth2 RX ring set to 4096 slots (default is 256)

CPU0 : pktgen sending on eth1, line rate (1488137pps)
CPU1 : receive eth2 interrupts, packets dropped into raw netfilter table
to bypass upper stacks.

Before patch : 15% packet losses, ksoftirqd/1 using 100% of cpu
After patch : residual losses (less than 0.1 %), ksoftirqd not used, 80%
cpu used 

I'll do more tests with a 10Gb card (ixgbe driver) to not be wire
limited.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michał Mirosław July 12, 2011, 3:54 p.m. UTC | #3
On Tue, Jul 12, 2011 at 05:40:16PM +0200, Eric Dumazet wrote:
> Le lundi 11 juillet 2011 à 07:46 +0200, Eric Dumazet a écrit :
> 
> > [PATCH] net: introduce build_skb()
> > 
> > One of the thing we discussed during netdev 2011 conference was the idea
> > to change network drivers to allocate/populate their skb at RX
> > completion time, right before feeding the skb to network stack.
> > 
> > Right now, we allocate skbs when populating the RX ring, and thats a
> > waste of CPU cache, since allocating skb means a full memset() to clear
> > the skb and its skb_shared_info portion. By the time NIC fills a frame
> > in data buffer and host can get it, cpu probably threw away the cache
> > lines from its caches, because of huge RX ring sizes.
> > 
> > So the deal would be to allocate only the data buffer for the NIC to
> > populate its RX ring buffer. And use build_skb() at RX completion to
> > attach a data buffer (now filled with an ethernet frame) to a new skb,
> > initialize the skb_shared_info portion, and give the hot skb to network
> > stack.
> 
> Update :
> 
> First results are impressive : About 15% of throughput increase with igb
> driver on my small desktop machine, and I am limited by the wire
> speed :)
> 
> (AMD Athlon(tm) II X2 B24 Processor, 3GHz, cache size : 1024K)
> 
> setup : One dual port Intel card : Ethernet controller: Intel
> Corporation 82576 Gigabit Network Connection (rev 01)
> 
> eth1 direct attach on eth2, Gigabit speed.
> eth2 RX ring set to 4096 slots (default is 256)
> 
> CPU0 : pktgen sending on eth1, line rate (1488137pps)
> CPU1 : receive eth2 interrupts, packets dropped into raw netfilter table
> to bypass upper stacks.
> 
> Before patch : 15% packet losses, ksoftirqd/1 using 100% of cpu
> After patch : residual losses (less than 0.1 %), ksoftirqd not used, 80%
> cpu used 
> 
> I'll do more tests with a 10Gb card (ixgbe driver) to not be wire
> limited.

I remember observing similar increase after switching from allocating skb
to allocating pages and using napi_get_frags() + napi_gro_frags(). That
was with sl351x driver posted for review some time ago.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 32ada53..5e903e7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -507,6 +507,7 @@  static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
+extern struct sk_buff *build_skb(void *data, unsigned int size);
 extern struct sk_buff *__alloc_skb(unsigned int size,
 				   gfp_t priority, int fclone, int node);
 static inline struct sk_buff *alloc_skb(unsigned int size,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d220119..9193d7e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -234,6 +234,54 @@  nodata:
 EXPORT_SYMBOL(__alloc_skb);
 
 /**
+ * build_skb - build a network buffer
+ * @data: data buffer provider by caller
+ * @size: size of data buffer, not including skb_shared_info
+ *
+ * Allocate a new &sk_buff. Caller provides space holding head and
+ * skb_shared_info. Mostly used in driver RX path.
+ * The return is the buffer. On a failure the return is %NULL.
+ * Notes :
+ *  Before IO, driver allocates only data buffer where NIC put incoming frame
+ *  Driver SHOULD add room at head (NET_SKB_PAD) and
+ *  MUST add room tail (to hold skb_shared_info)
+ *  After IO, driver calls build_skb(), to get a hot skb instead of a cold one
+ *  before giving packet to stack. RX rings only contains data buffers, not
+ *  full skbs.
+ */
+struct sk_buff *build_skb(void *data, unsigned int size)
+{
+	struct skb_shared_info *shinfo;
+	struct sk_buff *skb;
+
+	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	if (!skb)
+		return NULL;
+
+	size = SKB_DATA_ALIGN(size);
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->truesize = size + sizeof(struct sk_buff);
+	atomic_set(&skb->users, 1);
+	skb->head = data;
+	skb->data = data;
+	skb_reset_tail_pointer(skb);
+	skb->end = skb->tail + size;
+#ifdef NET_SKBUFF_DATA_USES_OFFSET
+	skb->mac_header = ~0U;
+#endif
+
+	/* make sure we initialize shinfo sequentially */
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+	kmemcheck_annotate_variable(shinfo->destructor_arg);
+
+	return skb;
+}
+EXPORT_SYMBOL(build_skb);
+
+/**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
  *	@length: length to allocate