Message ID | 20090114.012919.117682429.davem@davemloft.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Wed, 14 Jan 2009 08:53:08 +0000 > > > Actually, I still think my second approach (the PageSlab) is probably > > (if tested) the easiest for now, because it should fix the reported > > (Willy's) problem, without any change or copy overhead for splice to > > file (which could be still wrong, but not obviously wrong). > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > implementations exposed to data corruption when the from side of the > pipe buffer is a socket. I don't think Herbert meant other ->sendpage() implementations, but I could miss something. > That, to me, is almost worse than a bad fix. > > It's definitely worse than a slower but full fix, which the copy > patch is. Sorry, I can't see how this patch could make sendpage worse. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: ... > Therefore what I'll likely do is push Jarek's copy based cure, > and meanwhile we can brainstorm some more on how to fix this > properly in the long term. > > So, I've put together a full commit message and Jarek's patch > below. One thing I notice is that the silly skb_clone() done > by SKB splicing is no longer necessary. > > We could get rid of that to offset (some) of the cost we are > adding with this bug fix. > > Comments? Yes, this should lessen the overhead a bit. > > net: Fix data corruption when splicing from sockets. > > From: Jarek Poplawski <jarkao2@gmail.com> > > The trick in socket splicing where we try to convert the skb->data > into a page based reference using virt_to_page() does not work so > well. > > The idea is to pass the virt_to_page() reference via the pipe > buffer, and refcount the buffer using a SKB reference. > > But if we are splicing from a socket to a socket (via sendpage) > this doesn't work. > > The from side processing will grab the page (and SKB) references. > The sendpage() calls will grab page references only, return, and > then the from side processing completes and drops the SKB ref. > > The page based reference to skb->data is not enough to keep the > kmalloc() buffer backing it from being reused. Yet, that is > all that the socket send side has at this point. > > This leads to data corruption if the skb->data buffer is reused > by SLAB before the send side socket actually gets the TX packet > out to the device. > > The fix employed here is to simply allocate a page and copy the > skb->data bytes into that page. > > This will hurt performance, but there is no clear way to fix this > properly without a copy at the present time, and it is important > to get rid of the data corruption. > > Signed-off-by: David S. Miller <davem@davemloft.net> You are very brave! I'd prefer to wait for at least minimal testing by Willy... Thanks, Jarek P. BTW, an skb parameter could be removed from spd_fill_page() to make it even faster... ... > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > unsigned int len, unsigned int offset, > - struct sk_buff *skb) > + struct sk_buff *skb, int linear) > { > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > return 1; > > + if (linear) { > + page = linear_to_page(page, len, offset); > + if (!page) > + return 1; > + } > + > spd->pages[spd->nr_pages] = page; > spd->partial[spd->nr_pages].len = len; > spd->partial[spd->nr_pages].offset = offset; > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > spd->nr_pages++; > + get_page(page); > + > return 0; > } > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > static inline int __splice_segment(struct page *page, unsigned int poff, > unsigned int plen, unsigned int *off, > unsigned int *len, struct sk_buff *skb, > - struct splice_pipe_desc *spd) > + struct splice_pipe_desc *spd, int linear) > { > if (!*len) > return 1; > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > /* the linear region may spread across several pages */ > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > - if (spd_fill_page(spd, page, flen, poff, skb)) > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > return 1; > > __segment_seek(&page, &poff, &plen, flen); > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > if (__splice_segment(virt_to_page(skb->data), > (unsigned long) skb->data & (PAGE_SIZE - 1), > skb_headlen(skb), > - offset, len, skb, spd)) > + offset, len, skb, spd, 1)) > return 1; > > /* > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > if (__splice_segment(f->page, f->page_offset, f->size, > - offset, len, skb, spd)) > + offset, len, skb, spd, 0)) > return 1; > } > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote: > You are very brave! I'd prefer to wait for at least minimal testing > by Willy... Will test, probably this evening. Thanks guys, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Wed, 14 Jan 2009 09:42:16 +0000 > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > > From: Jarek Poplawski <jarkao2@gmail.com> > > Date: Wed, 14 Jan 2009 08:53:08 +0000 > > > > > Actually, I still think my second approach (the PageSlab) is probably > > > (if tested) the easiest for now, because it should fix the reported > > > (Willy's) problem, without any change or copy overhead for splice to > > > file (which could be still wrong, but not obviously wrong). > > > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > > implementations exposed to data corruption when the from side of the > > pipe buffer is a socket. > > I don't think Herbert meant other ->sendpage() implementations, but I > could miss something. I think he did :-) Or, more generally, he could have been referring to splice pipe outputs. All of these things grab references to pages and expect that to keep the underlying data from being reallocated. That doesn't work for this skb->data case. > > That, to me, is almost worse than a bad fix. > > > > It's definitely worse than a slower but full fix, which the copy > > patch is. > > Sorry, I can't see how this patch could make sendpage worse. Because that patch only fixes TCP's ->sendpage() implementation. There are others out there which could end up experiencing similar data corruption. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 02:06:37AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Wed, 14 Jan 2009 09:42:16 +0000 > > > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > > > From: Jarek Poplawski <jarkao2@gmail.com> > > > Date: Wed, 14 Jan 2009 08:53:08 +0000 > > > > > > > Actually, I still think my second approach (the PageSlab) is probably > > > > (if tested) the easiest for now, because it should fix the reported > > > > (Willy's) problem, without any change or copy overhead for splice to > > > > file (which could be still wrong, but not obviously wrong). > > > > > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > > > implementations exposed to data corruption when the from side of the > > > pipe buffer is a socket. > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > could miss something. > > I think he did :-) I hope Herbert will make it clear. > > Or, more generally, he could have been referring to splice pipe > outputs. All of these things grab references to pages and > expect that to keep the underlying data from being reallocated. > > That doesn't work for this skb->data case. > > > > That, to me, is almost worse than a bad fix. > > > > > > It's definitely worse than a slower but full fix, which the copy > > > patch is. > > > > Sorry, I can't see how this patch could make sendpage worse. > > Because that patch only fixes TCP's ->sendpage() implementation. Yes, it fixes (I hope) the only reported implementation. > > There are others out there which could end up experiencing similar > data corruption. Since this fix is very simple (and IMHO safe) it could be probably used elsewhere too, but I doubt we should care at the moment. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > > It's a simple fix, but as Herbert stated it leaves other ->sendpage() > implementations exposed to data corruption when the from side of the > pipe buffer is a socket. > > That, to me, is almost worse than a bad fix. Yep, so far nobody has verified the disk path at all. So for all we know, if there is a delay on the way to disk, the exact same thing can occur. Besides, the PageSlab thing is going to copy for network to network anyway. > So, I've put together a full commit message and Jarek's patch > below. One thing I notice is that the silly skb_clone() done > by SKB splicing is no longer necessary. > > We could get rid of that to offset (some) of the cost we are > adding with this bug fix. > > Comments? Yes that's probably a good idea. Cheers,
On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote: > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > > could miss something. > > > > I think he did :-) > > I hope Herbert will make it clear. Yes I did mean the other splice paths, and in particular, the path to disk. I hope that's clear enough :) Cheers,
On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote: > On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote: > > > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > > > could miss something. > > > > > > I think he did :-) > > > > I hope Herbert will make it clear. > > Yes I did mean the other splice paths, and in particular, the path > to disk. I hope that's clear enough :) So, I think I got it right: otherwise it would be enough to make this copy to a new page only before ->sendpage() calls e.g. in generic_splice_sendpage(). Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 11:40:59AM +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 10:29:03PM +1100, Herbert Xu wrote: > > On Wed, Jan 14, 2009 at 10:47:16AM +0000, Jarek Poplawski wrote: > > > > > > > > I don't think Herbert meant other ->sendpage() implementations, but I > > > > > could miss something. > > > > > > > > I think he did :-) > > > > > > I hope Herbert will make it clear. > > > > Yes I did mean the other splice paths, and in particular, the path > > to disk. I hope that's clear enough :) > > So, I think I got it right: otherwise it would be enough to make this > copy to a new page only before ->sendpage() calls e.g. in > generic_splice_sendpage(). Hmm... Actually in pipe_to_sendpage(). Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: ... > > net: Fix data corruption when splicing from sockets. > > > > From: Jarek Poplawski <jarkao2@gmail.com> ... > > > > Signed-off-by: David S. Miller <davem@davemloft.net> > > You are very brave! I'd prefer to wait for at least minimal testing > by Willy... On the other hand, I can't waste your trust like this (plus I'm not suicidal), so a little update: Based on a review by Changli Gao <xiaosuo@gmail.com>: http://lkml.org/lkml/2008/2/26/210 Foreseen-by: Changli Gao <xiaosuo@gmail.com> Diagnosed-by: Willy Tarreau <w@1wt.eu> Reported-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> > > Thanks, > Jarek P. > > BTW, an skb parameter could be removed from spd_fill_page() to make it > even faster... > > ... > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > unsigned int len, unsigned int offset, > > - struct sk_buff *skb) > > + struct sk_buff *skb, int linear) > > { > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > return 1; > > > > + if (linear) { > > + page = linear_to_page(page, len, offset); > > + if (!page) > > + return 1; > > + } > > + > > spd->pages[spd->nr_pages] = page; > > spd->partial[spd->nr_pages].len = len; > > spd->partial[spd->nr_pages].offset = offset; > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > spd->nr_pages++; > > + get_page(page); > > + > > return 0; > > } > > > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > > static inline int __splice_segment(struct page *page, unsigned int poff, > > unsigned int plen, unsigned int *off, > > unsigned int *len, struct sk_buff *skb, > > - struct splice_pipe_desc *spd) > > + struct splice_pipe_desc *spd, int linear) > > { > > if (!*len) > > return 1; > > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > > /* the linear region may spread across several pages */ > > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > > > - if (spd_fill_page(spd, page, flen, poff, skb)) > > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > > return 1; > > > > __segment_seek(&page, &poff, &plen, flen); > > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > if (__splice_segment(virt_to_page(skb->data), > > (unsigned long) skb->data & (PAGE_SIZE - 1), > > skb_headlen(skb), > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 1)) > > return 1; > > > > /* > > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > > > if (__splice_segment(f->page, f->page_offset, f->size, > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 0)) > > return 1; > > } > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry: take 2: On Wed, Jan 14, 2009 at 09:54:54AM +0000, Jarek Poplawski wrote: > On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: ... > > net: Fix data corruption when splicing from sockets. > > > > From: Jarek Poplawski <jarkao2@gmail.com> ... > > > > Signed-off-by: David S. Miller <davem@davemloft.net> > > You are very brave! I'd prefer to wait for at least minimal testing > by Willy... On the other hand, I can't waste your trust like this (plus I'm not suicidal), so a little update: Based on a review by Changli Gao <xiaosuo@gmail.com>: http://lkml.org/lkml/2008/2/26/210 Foreseen-by: Changli Gao <xiaosuo@gmail.com> Diagnosed-by: Willy Tarreau <w@1wt.eu> Reported-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com> Fixed-by: Jens Axboe <jens.axboe@oracle.com> > > Thanks, > Jarek P. > > BTW, an skb parameter could be removed from spd_fill_page() to make it > even faster... > > ... > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > unsigned int len, unsigned int offset, > > - struct sk_buff *skb) > > + struct sk_buff *skb, int linear) > > { > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > return 1; > > > > + if (linear) { > > + page = linear_to_page(page, len, offset); > > + if (!page) > > + return 1; > > + } > > + > > spd->pages[spd->nr_pages] = page; > > spd->partial[spd->nr_pages].len = len; > > spd->partial[spd->nr_pages].offset = offset; > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > spd->nr_pages++; > > + get_page(page); > > + > > return 0; > > } > > > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > > static inline int __splice_segment(struct page *page, unsigned int poff, > > unsigned int plen, unsigned int *off, > > unsigned int *len, struct sk_buff *skb, > > - struct splice_pipe_desc *spd) > > + struct splice_pipe_desc *spd, int linear) > > { > > if (!*len) > > return 1; > > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > > /* the linear region may spread across several pages */ > > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > > > - if (spd_fill_page(spd, page, flen, poff, skb)) > > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > > return 1; > > > > __segment_seek(&page, &poff, &plen, flen); > > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > if (__splice_segment(virt_to_page(skb->data), > > (unsigned long) skb->data & (PAGE_SIZE - 1), > > skb_headlen(skb), > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 1)) > > return 1; > > > > /* > > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > > > if (__splice_segment(f->page, f->page_offset, f->size, > > - offset, len, skb, spd)) > > + offset, len, skb, spd, 0)) > > return 1; > > } > > > > -- > > To unsubscribe from this list: send the line "unsubscribe netdev" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Jan 14, 2009 at 01:29:19AM -0800, David Miller wrote: > Therefore what I'll likely do is push Jarek's copy based cure, > and meanwhile we can brainstorm some more on how to fix this > properly in the long term. > > So, I've put together a full commit message and Jarek's patch > below. One thing I notice is that the silly skb_clone() done > by SKB splicing is no longer necessary. > > We could get rid of that to offset (some) of the cost we are > adding with this bug fix. > > Comments? David, please don't merge it as-is. I've just tried it and got an OOM in __alloc_pages_internal after a few seconds of data transfer. I'm leaving the patch below for comments, maybe someone will spot something ? Don't we need at least one kfree() somewhere to match alloc_pages() ? Regards, Willy -- > > net: Fix data corruption when splicing from sockets. > > From: Jarek Poplawski <jarkao2@gmail.com> > > The trick in socket splicing where we try to convert the skb->data > into a page based reference using virt_to_page() does not work so > well. > > The idea is to pass the virt_to_page() reference via the pipe > buffer, and refcount the buffer using a SKB reference. > > But if we are splicing from a socket to a socket (via sendpage) > this doesn't work. > > The from side processing will grab the page (and SKB) references. > The sendpage() calls will grab page references only, return, and > then the from side processing completes and drops the SKB ref. > > The page based reference to skb->data is not enough to keep the > kmalloc() buffer backing it from being reused. Yet, that is > all that the socket send side has at this point. > > This leads to data corruption if the skb->data buffer is reused > by SLAB before the send side socket actually gets the TX packet > out to the device. > > The fix employed here is to simply allocate a page and copy the > skb->data bytes into that page. > > This will hurt performance, but there is no clear way to fix this > properly without a copy at the present time, and it is important > to get rid of the data corruption. > > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 5110b35..6e43d52 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly; > static void sock_pipe_buf_release(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > - struct sk_buff *skb = (struct sk_buff *) buf->private; > - > - kfree_skb(skb); > + put_page(buf->page); > } > > static void sock_pipe_buf_get(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > - struct sk_buff *skb = (struct sk_buff *) buf->private; > - > - skb_get(skb); > + get_page(buf->page); > } > > static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, > @@ -1334,9 +1330,19 @@ fault: > */ > static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) > { > - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; > + put_page(spd->pages[i]); > +} > > - kfree_skb(skb); > +static inline struct page *linear_to_page(struct page *page, unsigned int len, > + unsigned int offset) > +{ > + struct page *p = alloc_pages(GFP_KERNEL, 0); > + > + if (!p) > + return NULL; > + memcpy(page_address(p) + offset, page_address(page) + offset, len); > + > + return p; > } > > /* > @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) > */ > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > unsigned int len, unsigned int offset, > - struct sk_buff *skb) > + struct sk_buff *skb, int linear) > { > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > return 1; > > + if (linear) { > + page = linear_to_page(page, len, offset); > + if (!page) > + return 1; > + } > + > spd->pages[spd->nr_pages] = page; > spd->partial[spd->nr_pages].len = len; > spd->partial[spd->nr_pages].offset = offset; > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > spd->nr_pages++; > + get_page(page); > + > return 0; > } > > @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, > static inline int __splice_segment(struct page *page, unsigned int poff, > unsigned int plen, unsigned int *off, > unsigned int *len, struct sk_buff *skb, > - struct splice_pipe_desc *spd) > + struct splice_pipe_desc *spd, int linear) > { > if (!*len) > return 1; > @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, > /* the linear region may spread across several pages */ > flen = min_t(unsigned int, flen, PAGE_SIZE - poff); > > - if (spd_fill_page(spd, page, flen, poff, skb)) > + if (spd_fill_page(spd, page, flen, poff, skb, linear)) > return 1; > > __segment_seek(&page, &poff, &plen, flen); > @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > if (__splice_segment(virt_to_page(skb->data), > (unsigned long) skb->data & (PAGE_SIZE - 1), > skb_headlen(skb), > - offset, len, skb, spd)) > + offset, len, skb, spd, 1)) > return 1; > > /* > @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, > const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; > > if (__splice_segment(f->page, f->page_offset, f->size, > - offset, len, skb, spd)) > + offset, len, skb, spd, 0)) > return 1; > } > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Fri, 16 Jan 2009 00:03:31 +0100 > please don't merge it as-is. I've just tried it and got an OOM > in __alloc_pages_internal after a few seconds of data transfer. > > I'm leaving the patch below for comments, maybe someone will spot > something ? Don't we need at least one kfree() somewhere to match > alloc_pages() ? What is needed is a put_page(). But that should be happening as a result of the splice callback. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote: > > I'm leaving the patch below for comments, maybe someone will spot > something ? Don't we need at least one kfree() somewhere to match > alloc_pages() ? Indeed. > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > unsigned int len, unsigned int offset, > > - struct sk_buff *skb) > > + struct sk_buff *skb, int linear) > > { > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > return 1; > > > > + if (linear) { > > + page = linear_to_page(page, len, offset); > > + if (!page) > > + return 1; > > + } > > + > > spd->pages[spd->nr_pages] = page; > > spd->partial[spd->nr_pages].len = len; > > spd->partial[spd->nr_pages].offset = offset; > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > spd->nr_pages++; > > + get_page(page); This get_page needs to be moved into an else clause of the previous if block. Cheers,
On Fri, Jan 16, 2009 at 10:19:35AM +1100, Herbert Xu wrote: > On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote: > > > > I'm leaving the patch below for comments, maybe someone will spot > > something ? Don't we need at least one kfree() somewhere to match > > alloc_pages() ? > > Indeed. > > > > static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, > > > unsigned int len, unsigned int offset, > > > - struct sk_buff *skb) > > > + struct sk_buff *skb, int linear) > > > { > > > if (unlikely(spd->nr_pages == PIPE_BUFFERS)) > > > return 1; > > > > > > + if (linear) { > > > + page = linear_to_page(page, len, offset); > > > + if (!page) > > > + return 1; > > > + } > > > + > > > spd->pages[spd->nr_pages] = page; > > > spd->partial[spd->nr_pages].len = len; > > > spd->partial[spd->nr_pages].offset = offset; > > > - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); > > > spd->nr_pages++; > > > + get_page(page); > > This get_page needs to be moved into an else clause of the previous > if block. Good catch Herbert, it's working fine now. The performance I get (30.6 MB/s) is between the original splice version (24.1) and the recently optimized one (35.7). That's not that bad considering we're copying the data. Cheers, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Fri, 16 Jan 2009 00:32:20 +0100 > Good catch Herbert, it's working fine now. The performance I get > (30.6 MB/s) is between the original splice version (24.1) and the > recently optimized one (35.7). That's not that bad considering > we're copying the data. Willy, see how much my skb clone removal in the patch I just posted helps, if at all. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5110b35..6e43d52 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -73,17 +73,13 @@ static struct kmem_cache *skbuff_fclone_cache __read_mostly; static void sock_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - kfree_skb(skb); + put_page(buf->page); } static void sock_pipe_buf_get(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { - struct sk_buff *skb = (struct sk_buff *) buf->private; - - skb_get(skb); + get_page(buf->page); } static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, @@ -1334,9 +1330,19 @@ fault: */ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) { - struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; + put_page(spd->pages[i]); +} - kfree_skb(skb); +static inline struct page *linear_to_page(struct page *page, unsigned int len, + unsigned int offset) +{ + struct page *p = alloc_pages(GFP_KERNEL, 0); + + if (!p) + return NULL; + memcpy(page_address(p) + offset, page_address(page) + offset, len); + + return p; } /* @@ -1344,16 +1350,23 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, unsigned int len, unsigned int offset, - struct sk_buff *skb) + struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; + if (linear) { + page = linear_to_page(page, len, offset); + if (!page) + return 1; + } + spd->pages[spd->nr_pages] = page; spd->partial[spd->nr_pages].len = len; spd->partial[spd->nr_pages].offset = offset; - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); spd->nr_pages++; + get_page(page); + return 0; } @@ -1369,7 +1382,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, static inline int __splice_segment(struct page *page, unsigned int poff, unsigned int plen, unsigned int *off, unsigned int *len, struct sk_buff *skb, - struct splice_pipe_desc *spd) + struct splice_pipe_desc *spd, int linear) { if (!*len) return 1; @@ -1392,7 +1405,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb)) + if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); @@ -1419,7 +1432,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), - offset, len, skb, spd)) + offset, len, skb, spd, 1)) return 1; /* @@ -1429,7 +1442,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; if (__splice_segment(f->page, f->page_offset, f->size, - offset, len, skb, spd)) + offset, len, skb, spd, 0)) return 1; }