Message ID | 20090113.163705.130074998.davem@davemloft.net |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
David Miller <davem@davemloft.net> wrote: > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 5110b35..05126da 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -70,12 +70,17 @@ > static struct kmem_cache *skbuff_head_cache __read_mostly; > static struct kmem_cache *skbuff_fclone_cache __read_mostly; > > +static void skb_release_data(struct sk_buff *skb); > + > static void sock_pipe_buf_release(struct pipe_inode_info *pipe, > struct pipe_buffer *buf) > { > struct sk_buff *skb = (struct sk_buff *) buf->private; > > - kfree_skb(skb); > + if (skb) > + skb_release_data(skb); > + else > + put_page(buf->page); > } Unfortunately this won't work, not even for network destinations. The reason is that this gets called as soon as the destination's splice hook returns, for networking that means when sendpage returns. So by that time we'll still be left with just a page reference on a page where the slab memory may already have been freed. To make this work we need to get the destination's splice hooks to acquire this reference. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 14 Jan 2009 14:51:24 +1100 > Unfortunately this won't work, not even for network destinations. > > The reason is that this gets called as soon as the destination's > splice hook returns, for networking that means when sendpage returns. > > So by that time we'll still be left with just a page reference > on a page where the slab memory may already have been freed. > > To make this work we need to get the destination's splice hooks > to acquire this reference. Yes I realized this after your earlier posting today. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 14 Jan 2009 14:51:24 +1100 > Unfortunately this won't work, not even for network destinations. > > The reason is that this gets called as soon as the destination's > splice hook returns, for networking that means when sendpage returns. > > So by that time we'll still be left with just a page reference > on a page where the slab memory may already have been freed. > > To make this work we need to get the destination's splice hooks > to acquire this reference. So while trying to figure out a sane way to fix this, I found another bug: /* * map the linear part */ if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), offset, len, skb, spd)) return 1; This will explode if the SLAB cache for skb->head is using compound (ie. order > 0) pages. For example, if this is an order-1 page being used for the skb->head data (which would be true on most systems for jumbo MTU frames being received into a linear SKB), the offset will be wrong and depending upon skb_headlen() we could reference past the end of that non-compound page we will end up grabbing a reference to. And then we'll end up with a compound page in an skb_shinfo() frag array, which is illegal. Well, at least, I can list several drivers that will barf when trying to TX that (Acenic, atlx1, cassini, jme, sungem), since they use pci_map_page(... virt_to_page(skb->data)) or similar. The core KMAP'ing support for SKBs will also not be able to grok such a beastly SKB. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote: > > So while trying to figure out a sane way to fix this, I found > another bug: > > /* > * map the linear part > */ > if (__splice_segment(virt_to_page(skb->data), > (unsigned long) skb->data & (PAGE_SIZE - 1), > skb_headlen(skb), > offset, len, skb, spd)) > return 1; > > This will explode if the SLAB cache for skb->head is using compound > (ie. order > 0) pages. > > For example, if this is an order-1 page being used for the skb->head > data (which would be true on most systems for jumbo MTU frames being > received into a linear SKB), the offset will be wrong and depending > upon skb_headlen() we could reference past the end of that > non-compound page we will end up grabbing a reference to. I'm actually not worried so much about these packets since these drivers should be converted to skb frags as otherwise they'll probably stop working after a while due to memory fragmentation. But yeah for correctness we definitely should address this in skb_splice_bits. I still think Jarek's approach (the copying one) is probably the easiest for now until we can find a better way. Cheers,
On Wed, Jan 14, 2009 at 07:26:30PM +1100, Herbert Xu wrote: > On Tue, Jan 13, 2009 at 11:27:10PM -0800, David Miller wrote: > > > > So while trying to figure out a sane way to fix this, I found > > another bug: > > > > /* > > * map the linear part > > */ > > if (__splice_segment(virt_to_page(skb->data), > > (unsigned long) skb->data & (PAGE_SIZE - 1), > > skb_headlen(skb), > > offset, len, skb, spd)) > > return 1; > > > > This will explode if the SLAB cache for skb->head is using compound > > (ie. order > 0) pages. > > > > For example, if this is an order-1 page being used for the skb->head > > data (which would be true on most systems for jumbo MTU frames being > > received into a linear SKB), the offset will be wrong and depending > > upon skb_headlen() we could reference past the end of that > > non-compound page we will end up grabbing a reference to. > > I'm actually not worried so much about these packets since these > drivers should be converted to skb frags as otherwise they'll > probably stop working after a while due to memory fragmentation. > > But yeah for correctness we definitely should address this in > skb_splice_bits. > > I still think Jarek's approach (the copying one) is probably the > easiest for now until we can find a better way. > Actually, I still think my second approach (the PageSlab) is probably (if tested) the easiest for now, because it should fix the reported (Willy's) problem, without any change or copy overhead for splice to file (which could be still wrong, but not obviously wrong). Then we could search for the only right way (which is most probably around Herbert's new skb page allocator. IMHO "my" "copying approach" is too risky e.g. for stable etc. because of unknown memory requirements, especially for some larger size page configs/systems. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5110b35..05126da 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -70,12 +70,17 @@ static struct kmem_cache *skbuff_head_cache __read_mostly; static struct kmem_cache *skbuff_fclone_cache __read_mostly; +static void skb_release_data(struct sk_buff *skb); + static void sock_pipe_buf_release(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { struct sk_buff *skb = (struct sk_buff *) buf->private; - kfree_skb(skb); + if (skb) + skb_release_data(skb); + else + put_page(buf->page); } static void sock_pipe_buf_get(struct pipe_inode_info *pipe, @@ -83,7 +88,10 @@ static void sock_pipe_buf_get(struct pipe_inode_info *pipe, { struct sk_buff *skb = (struct sk_buff *) buf->private; - skb_get(skb); + if (skb) + atomic_inc(&skb_shinfo(skb)->dataref); + else + get_page(buf->page); } static int sock_pipe_buf_steal(struct pipe_inode_info *pipe, @@ -1336,7 +1344,10 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) { struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private; - kfree_skb(skb); + if (skb) + skb_release_data(skb); + else + put_page(spd->pages[i]); } /* @@ -1344,7 +1355,7 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) */ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, unsigned int len, unsigned int offset, - struct sk_buff *skb) + struct sk_buff *skb, int linear) { if (unlikely(spd->nr_pages == PIPE_BUFFERS)) return 1; @@ -1352,8 +1363,15 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, spd->pages[spd->nr_pages] = page; spd->partial[spd->nr_pages].len = len; spd->partial[spd->nr_pages].offset = offset; - spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb); + spd->partial[spd->nr_pages].private = + (unsigned long) (linear ? skb : NULL); spd->nr_pages++; + + if (linear) + atomic_inc(&skb_shinfo(skb)->dataref); + else + get_page(page); + return 0; } @@ -1369,7 +1387,7 @@ static inline void __segment_seek(struct page **page, unsigned int *poff, static inline int __splice_segment(struct page *page, unsigned int poff, unsigned int plen, unsigned int *off, unsigned int *len, struct sk_buff *skb, - struct splice_pipe_desc *spd) + struct splice_pipe_desc *spd, int linear) { if (!*len) return 1; @@ -1392,7 +1410,7 @@ static inline int __splice_segment(struct page *page, unsigned int poff, /* the linear region may spread across several pages */ flen = min_t(unsigned int, flen, PAGE_SIZE - poff); - if (spd_fill_page(spd, page, flen, poff, skb)) + if (spd_fill_page(spd, page, flen, poff, skb, linear)) return 1; __segment_seek(&page, &poff, &plen, flen); @@ -1419,7 +1437,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, if (__splice_segment(virt_to_page(skb->data), (unsigned long) skb->data & (PAGE_SIZE - 1), skb_headlen(skb), - offset, len, skb, spd)) + offset, len, skb, spd, 1)) return 1; /* @@ -1429,7 +1447,7 @@ static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset, const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; if (__splice_segment(f->page, f->page_offset, f->size, - offset, len, skb, spd)) + offset, len, skb, spd, 0)) return 1; }