diff mbox

[v3] tcp: splice as many packets as possible at once

Message ID 20090122090442.GB11139@ff.dom.local
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Jarek Poplawski Jan. 22, 2009, 9:04 a.m. UTC
On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 20 Jan 2009 11:01:44 +0000
> 
> > On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> > > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > > Good question! Alas I can't check this soon, but if it's really like
> > > > this, of course this needs some better idea and rework. (BTW, I'd like
> > > > to prevent here as much as possible some strange activities like 1
> > > > byte (payload) packets getting full pages without any accounting.)
> > > 
> > > I believe approach to meet all our goals is to have own network memory
> > > allocator, so that each skb could have its payload in the fragments, we
> > > would not suffer from the heavy fragmentation and power-of-two overhead
> > > for the larger MTUs, have a reserve for the OOM condition and generally
> > > do not depend on the main system behaviour.
> > 
> > 100% right! But I guess we need this current fix for -stable, and I'm
> > a bit worried about safety.
> 
> Jarek, we already have a page and offset you can use.
> 
> It's called sk_sndmsg_page but that is just the (current) name.
> Nothing prevents you from reusing it for your purposes here.

It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
I used here tcp_sndmsg() way, but I think I'll go back to this question
soon.

Thanks,
Jarek P.

------------> take 3

net: Optimize memory usage when splicing from sockets.

The recent fix of data corruption when splicing from sockets uses
memory very inefficiently allocating a new page to copy each chunk of
linear part of skb. This patch uses the same page until it's full
(almost) by caching the page in sk_sndmsg_page field.

With changes from David S. Miller <davem@davemloft.net>

Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Tested-by: needed...
---

 net/core/skbuff.c |   45 +++++++++++++++++++++++++++++++++++----------
 1 files changed, 35 insertions(+), 10 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Jan. 26, 2009, 5:22 a.m. UTC | #1
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Thu, 22 Jan 2009 09:04:42 +0000

> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> I used here tcp_sndmsg() way, but I think I'll go back to this question
> soon.

Indeed, it is something to look into, as well as locking.

I'll try to find some time for this, thanks Jarek.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2009, 7:11 a.m. UTC | #2
David Miller <davem@davemloft.net> wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Thu, 22 Jan 2009 09:04:42 +0000
> 
>> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
>> I used here tcp_sndmsg() way, but I think I'll go back to this question
>> soon.
> 
> Indeed, it is something to look into, as well as locking.
> 
> I'll try to find some time for this, thanks Jarek.

After a quick look it seems to be OK to me.  The code in the patch
is called from tcp_splice_read, which holds the socket lock.  So as
long as the patch uses the usual TCP convention it should work.

Cheers,
Jarek Poplawski Jan. 27, 2009, 7:54 a.m. UTC | #3
On Tue, Jan 27, 2009 at 06:11:30PM +1100, Herbert Xu wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Thu, 22 Jan 2009 09:04:42 +0000
> > 
> >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> >> I used here tcp_sndmsg() way, but I think I'll go back to this question
> >> soon.
> > 
> > Indeed, it is something to look into, as well as locking.
> > 
> > I'll try to find some time for this, thanks Jarek.
> 
> After a quick look it seems to be OK to me.  The code in the patch
> is called from tcp_splice_read, which holds the socket lock.  So as
> long as the patch uses the usual TCP convention it should work.

Yes, but ip_append_data() (and skb_append_datato_frags() for
NETIF_F_UFO only, so currently not a problem), uses this differently,
and these pages in sk->sk_sndmsg_page could leak or be used after
kfree. (I didn't track locking in these other places).

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2009, 10:09 a.m. UTC | #4
On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> 
> Yes, but ip_append_data() (and skb_append_datato_frags() for
> NETIF_F_UFO only, so currently not a problem), uses this differently,
> and these pages in sk->sk_sndmsg_page could leak or be used after
> kfree. (I didn't track locking in these other places).

It'll be freed when the socket is freed so that should be fine.

Cheers,
Jarek Poplawski Jan. 27, 2009, 10:35 a.m. UTC | #5
On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote:
> On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> > 
> > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > and these pages in sk->sk_sndmsg_page could leak or be used after
> > kfree. (I didn't track locking in these other places).
> 
> It'll be freed when the socket is freed so that should be fine.
> 

I don't think so: these places can overwrite sk->sk_sndmsg_page left
after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
pointer without put_page() (they only reference copied chunks and
expect auto freeing). On the other hand, if tcp_sendmsg() reads after
them it could use a pointer after the page is freed, I guess.

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 27, 2009, 10:57 a.m. UTC | #6
On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> On Tue, Jan 27, 2009 at 09:09:58PM +1100, Herbert Xu wrote:
> > On Tue, Jan 27, 2009 at 07:54:18AM +0000, Jarek Poplawski wrote:
> > > 
> > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > kfree. (I didn't track locking in these other places).
> > 
> > It'll be freed when the socket is freed so that should be fine.
> > 
> 
> I don't think so: these places can overwrite sk->sk_sndmsg_page left
> after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> pointer without put_page() (they only reference copied chunks and
> expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> them it could use a pointer after the page is freed, I guess.

tcp_v4_destroy_sock() looks like vulnerable too.

BTW, skb_append_datato_frags() currently doesn't need to use this
sk->sk_sndmsg_page at all - it doesn't use caching between calls.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 27, 2009, 11:48 a.m. UTC | #7
On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
>
> > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > kfree. (I didn't track locking in these other places).
> > 
> > It'll be freed when the socket is freed so that should be fine.
> 
> I don't think so: these places can overwrite sk->sk_sndmsg_page left
> after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> pointer without put_page() (they only reference copied chunks and
> expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> them it could use a pointer after the page is freed, I guess.

I wasn't referring to the first part of your sentence.  That can't
happen because they're only used for UDP sockets, this is a TCP
socket.

Cheers,
Jarek Poplawski Jan. 27, 2009, 12:16 p.m. UTC | #8
On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> >
> > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > kfree. (I didn't track locking in these other places).
> > > 
> > > It'll be freed when the socket is freed so that should be fine.
> > 
> > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > pointer without put_page() (they only reference copied chunks and
> > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > them it could use a pointer after the page is freed, I guess.
> 
> I wasn't referring to the first part of your sentence.  That can't
> happen because they're only used for UDP sockets, this is a TCP
> socket.

Do you mean this part from ip_append_data() isn't used for TCP?:

1007
1008                         if (page && (left = PAGE_SIZE - off) > 0) {
1009                                 if (copy >= left)
1010                                         copy = left;
1011                                 if (page != frag->page) {
1012                                         if (i == MAX_SKB_FRAGS) {
1013                                                 err = -EMSGSIZE;
1014                                                 goto error;
1015                                         }
1016                                         get_page(page);
1017                                         skb_fill_page_desc(skb, i, page, sk->sk_sndmsg_off, 0);
1018                                         frag = &skb_shinfo(skb)->frags[i];
1019                                 }
1020                         } else if (i < MAX_SKB_FRAGS) {
1021                                 if (copy > PAGE_SIZE)
1022                                         copy = PAGE_SIZE;
1023                                 page = alloc_pages(sk->sk_allocation, 0);
1024                                 if (page == NULL)  {
1025                                         err = -ENOMEM;
1026                                         goto error;
1027                                 }
1028                                 sk->sk_sndmsg_page = page;
1029                                 sk->sk_sndmsg_off = 0;
1030
1031                                 skb_fill_page_desc(skb, i, page, 0, 0);
1032                                 frag = &skb_shinfo(skb)->frags[i];
1033                         } else {

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 27, 2009, 12:31 p.m. UTC | #9
On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > >
> > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > kfree. (I didn't track locking in these other places).
> > > > 
> > > > It'll be freed when the socket is freed so that should be fine.
> > > 
> > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > pointer without put_page() (they only reference copied chunks and
> > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > them it could use a pointer after the page is freed, I guess.
> > 
> > I wasn't referring to the first part of your sentence.  That can't
> > happen because they're only used for UDP sockets, this is a TCP
> > socket.
> 
> Do you mean this part from ip_append_data() isn't used for TCP?:

Actually, the beginning part of ip_append_data() should be enough too.
So I guess I missed your point...

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 27, 2009, 5:06 p.m. UTC | #10
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 27 Jan 2009 12:31:11 +0000

> On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > > >
> > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > > kfree. (I didn't track locking in these other places).
> > > > > 
> > > > > It'll be freed when the socket is freed so that should be fine.
> > > > 
> > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > > pointer without put_page() (they only reference copied chunks and
> > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > > them it could use a pointer after the page is freed, I guess.
> > > 
> > > I wasn't referring to the first part of your sentence.  That can't
> > > happen because they're only used for UDP sockets, this is a TCP
> > > socket.
> > 
> > Do you mean this part from ip_append_data() isn't used for TCP?:
> 
> Actually, the beginning part of ip_append_data() should be enough too.
> So I guess I missed your point...

TCP doesn't use ip_append_data(), period.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 28, 2009, 8:10 a.m. UTC | #11
On Tue, Jan 27, 2009 at 09:06:51AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 27 Jan 2009 12:31:11 +0000
> 
> > On Tue, Jan 27, 2009 at 12:16:42PM +0000, Jarek Poplawski wrote:
> > > On Tue, Jan 27, 2009 at 10:48:05PM +1100, Herbert Xu wrote:
> > > > On Tue, Jan 27, 2009 at 10:35:11AM +0000, Jarek Poplawski wrote:
> > > > >
> > > > > > > Yes, but ip_append_data() (and skb_append_datato_frags() for
> > > > > > > NETIF_F_UFO only, so currently not a problem), uses this differently,
> > > > > > > and these pages in sk->sk_sndmsg_page could leak or be used after
> > > > > > > kfree. (I didn't track locking in these other places).
> > > > > > 
> > > > > > It'll be freed when the socket is freed so that should be fine.
> > > > > 
> > > > > I don't think so: these places can overwrite sk->sk_sndmsg_page left
> > > > > after tcp_sendmsg(), or skb_splice_bits() now, with NULL or a new
> > > > > pointer without put_page() (they only reference copied chunks and
> > > > > expect auto freeing). On the other hand, if tcp_sendmsg() reads after
> > > > > them it could use a pointer after the page is freed, I guess.
> > > > 
> > > > I wasn't referring to the first part of your sentence.  That can't
> > > > happen because they're only used for UDP sockets, this is a TCP
> > > > socket.
> > > 
> > > Do you mean this part from ip_append_data() isn't used for TCP?:
> > 
> > Actually, the beginning part of ip_append_data() should be enough too.
> > So I guess I missed your point...
> 
> TCP doesn't use ip_append_data(), period.

Hmm... I see: TCP does use ip_send_reply(), so ip_append_data() too,
but with a special socket.

Thanks for the explanations,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 1, 2009, 8:41 a.m. UTC | #12
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 27 Jan 2009 18:11:30 +1100

> David Miller <davem@davemloft.net> wrote:
> > From: Jarek Poplawski <jarkao2@gmail.com>
> > Date: Thu, 22 Jan 2009 09:04:42 +0000
> > 
> >> It seems this sk_sndmsg_page usage (refcounting) isn't consistent.
> >> I used here tcp_sndmsg() way, but I think I'll go back to this question
> >> soon.
> > 
> > Indeed, it is something to look into, as well as locking.
> > 
> > I'll try to find some time for this, thanks Jarek.
> 
> After a quick look it seems to be OK to me.  The code in the patch
> is called from tcp_splice_read, which holds the socket lock.  So as
> long as the patch uses the usual TCP convention it should work.

I've tossed Jarek's patch into net-next-2.6, thanks everyone.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 2e5f2ca..2e64c1b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1333,14 +1333,39 @@  static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 	put_page(spd->pages[i]);
 }
 
-static inline struct page *linear_to_page(struct page *page, unsigned int len,
-					  unsigned int offset)
-{
-	struct page *p = alloc_pages(GFP_KERNEL, 0);
+static inline struct page *linear_to_page(struct page *page, unsigned int *len,
+					  unsigned int *offset,
+					  struct sk_buff *skb)
+{
+	struct sock *sk = skb->sk;
+	struct page *p = sk->sk_sndmsg_page;
+	unsigned int off;
+
+	if (!p) {
+new_page:
+		p = sk->sk_sndmsg_page = alloc_pages(sk->sk_allocation, 0);
+		if (!p)
+			return NULL;
 
-	if (!p)
-		return NULL;
-	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+		off = sk->sk_sndmsg_off = 0;
+		/* hold one ref to this page until it's full */
+	} else {
+		unsigned int mlen;
+
+		off = sk->sk_sndmsg_off;
+		mlen = PAGE_SIZE - off;
+		if (mlen < 64 && mlen < *len) {
+			put_page(p);
+			goto new_page;
+		}
+
+		*len = min_t(unsigned int, *len, mlen);
+	}
+
+	memcpy(page_address(p) + off, page_address(page) + *offset, *len);
+	sk->sk_sndmsg_off += *len;
+	*offset = off;
+	get_page(p);
 
 	return p;
 }
@@ -1349,21 +1374,21 @@  static inline struct page *linear_to_page(struct page *page, unsigned int len,
  * Fill page/offset/length into spd, if it can hold more pages.
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
-				unsigned int len, unsigned int offset,
+				unsigned int *len, unsigned int offset,
 				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
 	if (linear) {
-		page = linear_to_page(page, len, offset);
+		page = linear_to_page(page, len, &offset, skb);
 		if (!page)
 			return 1;
 	} else
 		get_page(page);
 
 	spd->pages[spd->nr_pages] = page;
-	spd->partial[spd->nr_pages].len = len;
+	spd->partial[spd->nr_pages].len = *len;
 	spd->partial[spd->nr_pages].offset = offset;
 	spd->nr_pages++;
 
@@ -1405,7 +1430,7 @@  static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb, linear))
+		if (spd_fill_page(spd, page, &flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);