diff mbox

tcp: splice as many packets as possible at once

Message ID 20090113.163705.130074998.davem@davemloft.net
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller Jan. 14, 2009, 12:37 a.m. UTC
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Wed, 14 Jan 2009 03:22:52 +0300

> On Tue, Jan 13, 2009 at 04:16:25PM -0800, David Miller (davem@davemloft.net) wrote:
> > I wish there were some way we could make this code grab and release a
> > reference to the SKB data area (I mean skb_shinfo(skb)->dataref) to
> > accomplish it's goals.
> 
> Ugh... Clone without cloninig, but by increasing the dataref. Getting
> that splice only needs that skb to track the head of the original, this
> may really work.

Here is something I scrambled together, it is largely based upon
Jarek's patch:

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu Jan. 14, 2009, 3:51 a.m. UTC | #1
David Miller <davem@davemloft.net> wrote:
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5110b35..05126da 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -70,12 +70,17 @@
> static struct kmem_cache *skbuff_head_cache __read_mostly;
> static struct kmem_cache *skbuff_fclone_cache __read_mostly;
> 
> +static void skb_release_data(struct sk_buff *skb);
> +
> static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
>                                  struct pipe_buffer *buf)
> {
>        struct sk_buff *skb = (struct sk_buff *) buf->private;
> 
> -       kfree_skb(skb);
> +       if (skb)
> +               skb_release_data(skb);
> +       else
> +               put_page(buf->page);
> }

Unfortunately this won't work, not even for network destinations.

The reason is that this gets called as soon as the destination's
splice hook returns, for networking that means when sendpage returns.

So by that time we'll still be left with just a page reference
on a page where the slab memory may already have been freed.

To make this work we need to get the destination's splice hooks
to acquire this reference.

Cheers,
David Miller Jan. 14, 2009, 4:25 a.m. UTC | #2
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Jan 2009 14:51:24 +1100

> Unfortunately this won't work, not even for network destinations.
> 
> The reason is that this gets called as soon as the destination's
> splice hook returns, for networking that means when sendpage returns.
> 
> So by that time we'll still be left with just a page reference
> on a page where the slab memory may already have been freed.
> 
> To make this work we need to get the destination's splice hooks
> to acquire this reference.

Yes I realized this after your earlier posting today.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 14, 2009, 7:27 a.m. UTC | #3
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 14 Jan 2009 14:51:24 +1100

> Unfortunately this won't work, not even for network destinations.
> 
> The reason is that this gets called as soon as the destination's
> splice hook returns, for networking that means when sendpage returns.
> 
> So by that time we'll still be left with just a page reference
> on a page where the slab memory may already have been freed.
> 
> To make this work we need to get the destination's splice hooks
> to acquire this reference.

So while trying to figure out a sane way to fix this, I found
another bug:

	/*
	 * map the linear part
	 */
	if (__splice_segment(virt_to_page(skb->data),
			     (unsigned long) skb->data & (PAGE_SIZE - 1),
			     skb_headlen(skb),
			     offset, len, skb, spd))
		return 1;

This will explode if the SLAB cache for skb->head is using compound
(ie. order > 0) pages.

For example, if this is an order-1 page being used for the skb->head
data (which would be true on most systems for jumbo MTU frames being
received into a linear SKB), the offset will be wrong and depending
upon skb_headlen() we could reference past the end of that
non-compound page we will end up grabbing a reference to.

And then we'll end up with a compound page in an skb_shinfo() frag
array, which is illegal.

Well, at least, I can list several drivers that will barf when
trying to TX that (Acenic, atlx1, cassini, jme, sungem), since
they use pci_map_page(... virt_to_page(skb->data)) or similar.

The core KMAP'ing support for SKBs will also not be able to grok
such a beastly SKB.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 14, 2009, 8:26 a.m. UTC | #4
On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote:
> 
> So while trying to figure out a sane way to fix this, I found
> another bug:
> 
> 	/*
> 	 * map the linear part
> 	 */
> 	if (__splice_segment(virt_to_page(skb->data),
> 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> 			     skb_headlen(skb),
> 			     offset, len, skb, spd))
> 		return 1;
> 
> This will explode if the SLAB cache for skb->head is using compound
> (ie. order > 0) pages.
> 
> For example, if this is an order-1 page being used for the skb->head
> data (which would be true on most systems for jumbo MTU frames being
> received into a linear SKB), the offset will be wrong and depending
> upon skb_headlen() we could reference past the end of that
> non-compound page we will end up grabbing a reference to.

I'm actually not worried so much about these packets since these
drivers should be converted to skb frags as otherwise they'll
probably stop working after a while due to memory fragmentation.

But yeah for correctness we definitely should address this in
skb_splice_bits.

I still think Jarek's approach (the copying one) is probably the
easiest for now until we can find a better way.

Cheers,
Jarek Poplawski Jan. 14, 2009, 8:53 a.m. UTC | #5
On Wed, Jan 14, 2009 at 07:26:30PM +1100, Herbert Xu wrote:
> On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote:
> > 
> > So while trying to figure out a sane way to fix this, I found
> > another bug:
> > 
> > 	/*
> > 	 * map the linear part
> > 	 */
> > 	if (__splice_segment(virt_to_page(skb->data),
> > 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
> > 			     skb_headlen(skb),
> > 			     offset, len, skb, spd))
> > 		return 1;
> > 
> > This will explode if the SLAB cache for skb->head is using compound
> > (ie. order > 0) pages.
> > 
> > For example, if this is an order-1 page being used for the skb->head
> > data (which would be true on most systems for jumbo MTU frames being
> > received into a linear SKB), the offset will be wrong and depending
> > upon skb_headlen() we could reference past the end of that
> > non-compound page we will end up grabbing a reference to.
> 
> I'm actually not worried so much about these packets since these
> drivers should be converted to skb frags as otherwise they'll
> probably stop working after a while due to memory fragmentation.
> 
> But yeah for correctness we definitely should address this in
> skb_splice_bits.
> 
> I still think Jarek's approach (the copying one) is probably the
> easiest for now until we can find a better way.
> 

Actually, I still think my second approach (the PageSlab) is probably
(if tested) the easiest for now, because it should fix the reported
(Willy's) problem, without any change or copy overhead for splice to
file (which could be still wrong, but not obviously wrong). Then we
could search for the only right way (which is most probably around
Herbert's new skb page allocator. IMHO "my" "copying approach" is too
risky e.g. for stable etc. because of unknown memory requirements,
especially for some larger size page configs/systems.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5110b35..05126da 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -70,12 +70,17 @@ 
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
+static void skb_release_data(struct sk_buff *skb);
+
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
 	struct sk_buff *skb = (struct sk_buff *) buf->private;
 
-	kfree_skb(skb);
+	if (skb)
+		skb_release_data(skb);
+	else
+		put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
@@ -83,7 +88,10 @@  static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 {
 	struct sk_buff *skb = (struct sk_buff *) buf->private;
 
-	skb_get(skb);
+	if (skb)
+		atomic_inc(&skb_shinfo(skb)->dataref);
+	else
+		get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1336,7 +1344,10 @@  static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
 	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
 
-	kfree_skb(skb);
+	if (skb)
+		skb_release_data(skb);
+	else
+		put_page(spd->pages[i]);
 }
 
 /*
@@ -1344,7 +1355,7 @@  static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
@@ -1352,8 +1363,15 @@  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
-	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
+	spd->partial[spd->nr_pages].private =
+		(unsigned long) (linear ? skb : NULL);
 	spd->nr_pages++;
+
+	if (linear)
+		atomic_inc(&skb_shinfo(skb)->dataref);
+	else
+		get_page(page);
+
 	return 0;
 }
 
@@ -1369,7 +1387,7 @@  static inline void __segment_seek(struct page **page, unsigned int *poff,
 static inline int __splice_segment(struct page *page, unsigned int poff,
 				   unsigned int plen, unsigned int *off,
 				   unsigned int *len, struct sk_buff *skb,
-				   struct splice_pipe_desc *spd)
+				   struct splice_pipe_desc *spd, int linear)
 {
 	if (!*len)
 		return 1;
@@ -1392,7 +1410,7 @@  static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb))
+		if (spd_fill_page(spd, page, flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1437,7 @@  static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 	if (__splice_segment(virt_to_page(skb->data),
 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
 			     skb_headlen(skb),
-			     offset, len, skb, spd))
+			     offset, len, skb, spd, 1))
 		return 1;
 
 	/*
@@ -1429,7 +1447,7 @@  static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
 		if (__splice_segment(f->page, f->page_offset, f->size,
-				     offset, len, skb, spd))
+				     offset, len, skb, spd, 0))
 			return 1;
 	}