diff mbox

tcp: splice as many packets as possible at once

Message ID 20090115.152608.89323697.davem@davemloft.net
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

David Miller Jan. 15, 2009, 11:26 p.m. UTC
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 16 Jan 2009 10:19:35 +1100

> On Fri, Jan 16, 2009 at 12:03:31AM +0100, Willy Tarreau wrote:
> > > +	if (linear) {
> > > +		page = linear_to_page(page, len, offset);
> > > +		if (!page)
> > > +			return 1;
> > > +	}
> > > +
> > >  	spd->pages[spd->nr_pages] = page;
> > >  	spd->partial[spd->nr_pages].len = len;
> > >  	spd->partial[spd->nr_pages].offset = offset;
> > > -	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
> > >  	spd->nr_pages++;
> > > +	get_page(page);
> 
> This get_page needs to be moved into an else clause of the previous
> if block.

Yep, good catch Herbert.

New patch, this has the SKB clone removal as well:

net: Fix data corruption when splicing from sockets.

From: Jarek Poplawski <jarkao2@gmail.com>

The trick in socket splicing where we try to convert the skb->data
into a page based reference using virt_to_page() does not work so
well.

The idea is to pass the virt_to_page() reference via the pipe
buffer, and refcount the buffer using a SKB reference.

But if we are splicing from a socket to a socket (via sendpage)
this doesn't work.

The from side processing will grab the page (and SKB) references.
The sendpage() calls will grab page references only, return, and
then the from side processing completes and drops the SKB ref.

The page based reference to skb->data is not enough to keep the
kmalloc() buffer backing it from being reused.  Yet, that is
all that the socket send side has at this point.

This leads to data corruption if the skb->data buffer is reused
by SLAB before the send side socket actually gets the TX packet
out to the device.

The fix employed here is to simply allocate a page and copy the
skb->data bytes into that page.

This will hurt performance, but there is no clear way to fix this
properly without a copy at the present time, and it is important
to get rid of the data corruption.

With fixes from Herbert Xu.

Signed-off-by: David S. Miller <davem@davemloft.net>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu Jan. 15, 2009, 11:32 p.m. UTC | #1
On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
>
> New patch, this has the SKB clone removal as well:

Thanks Dave!

Something else just came to mind though.

> +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> +					  unsigned int offset)
> +{
> +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> +
> +	if (!p)
> +		return NULL;
> +	memcpy(page_address(p) + offset, page_address(page) + offset, len);

This won't work very well if skb->head is longer than a page.

We'll need to divide it up into individual pages.

Cheers,
David Miller Jan. 15, 2009, 11:34 p.m. UTC | #2
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Fri, 16 Jan 2009 10:32:05 +1100

> On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > +					  unsigned int offset)
> > +{
> > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > +
> > +	if (!p)
> > +		return NULL;
> > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> 
> This won't work very well if skb->head is longer than a page.
> 
> We'll need to divide it up into individual pages.

Oh yes the same bug I pointed out the other day.

But Willy can test this patch as-is, since he is not using
jumbo frames in linear SKBs.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 15, 2009, 11:42 p.m. UTC | #3
On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 16 Jan 2009 10:32:05 +1100
> 
> > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > +					  unsigned int offset)
> > > +{
> > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > +
> > > +	if (!p)
> > > +		return NULL;
> > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > 
> > This won't work very well if skb->head is longer than a page.
> > 
> > We'll need to divide it up into individual pages.
> 
> Oh yes the same bug I pointed out the other day.
> 
> But Willy can test this patch as-is,

Hey, nice work Dave. +3% performance from your previous patch
(31.6 MB/s). It's going fine and stable here.

> since he is not using jumbo frames in linear SKBs.

If you're interested, this week-end I can do some tests on my
myri10ge NICs which support LRO. I frequently observe 23 kB
packets there, and they also support jumbo frames. Those should
cover the case above.

I'm afraid that's all for me for this evening, I have to get some
sleep before going to work. If you want to cook up more patches,
I'll be able to do a bit of testing in 5 hours now.

Cheers!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 15, 2009, 11:44 p.m. UTC | #4
On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote:
> On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Fri, 16 Jan 2009 10:32:05 +1100
> > 
> > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > > +					  unsigned int offset)
> > > > +{
> > > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > > +
> > > > +	if (!p)
> > > > +		return NULL;
> > > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > > 
> > > This won't work very well if skb->head is longer than a page.
> > > 
> > > We'll need to divide it up into individual pages.
> > 
> > Oh yes the same bug I pointed out the other day.
> > 
> > But Willy can test this patch as-is,
> 
> Hey, nice work Dave. +3% performance from your previous patch
> (31.6 MB/s). It's going fine and stable here.

And BTW feel free to add my Tested-by if you want in case you merge
this fix.

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 15, 2009, 11:54 p.m. UTC | #5
From: Willy Tarreau <w@1wt.eu>
Date: Fri, 16 Jan 2009 00:44:08 +0100

> And BTW feel free to add my Tested-by if you want in case you merge
> this fix.

Done, thanks Willy.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 16, 2009, 6:51 a.m. UTC | #6
On Fri, Jan 16, 2009 at 12:44:08AM +0100, Willy Tarreau wrote:
> On Fri, Jan 16, 2009 at 12:42:55AM +0100, Willy Tarreau wrote:
> > On Thu, Jan 15, 2009 at 03:34:49PM -0800, David Miller wrote:
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Date: Fri, 16 Jan 2009 10:32:05 +1100
> > > 
> > > > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > > > +					  unsigned int offset)
> > > > > +{
> > > > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > > > +
> > > > > +	if (!p)
> > > > > +		return NULL;
> > > > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > > > 
> > > > This won't work very well if skb->head is longer than a page.
> > > > 
> > > > We'll need to divide it up into individual pages.
> > > 
> > > Oh yes the same bug I pointed out the other day.
> > > 
> > > But Willy can test this patch as-is,
> > 
> > Hey, nice work Dave. +3% performance from your previous patch
> > (31.6 MB/s). It's going fine and stable here.
> 
> And BTW feel free to add my Tested-by if you want in case you merge
> this fix.
> 
> Willy
> 

Herbert, good catch!

David, if it's not too late I think more credits are needed,
especially for Willy. He did "a bit" more than testing.

Alas, I can't see this problem with skb->head longer than page. There
is even some comment on this in __splice_segment(), but I can miss
something.

I'm more concerned with memory usage if these skbs are not acked for
some reason. Isn't there some DOS issue possible?

Thanks everybody,
Jarek P.
--------->
Based on a review by Changli Gao <xiaosuo@gmail.com>:
http://lkml.org/lkml/2008/2/26/210

Foreseen-by: Changli Gao <xiaosuo@gmail.com>
Diagnosed-by: Willy Tarreau <w@1wt.eu>
Reported-by: Willy Tarreau <w@1wt.eu>
Fixed-by: Jens Axboe <jens.axboe@oracle.com>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 19, 2009, 12:42 a.m. UTC | #7
Hi guys,

On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Fri, 16 Jan 2009 00:44:08 +0100
> 
> > And BTW feel free to add my Tested-by if you want in case you merge
> > this fix.
> 
> Done, thanks Willy.

Just for the record, I've now re-integrated those changes in a test kernel
that I booted on my 10gig machines. I have updated my user-space code in
haproxy to run a new series of tests. Eventhough there is a memcpy(), the
results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :

  - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
    (3.2 Gbps at 100% CPU without splice)

  - 9.2 Gbps at 50% CPU using MTU=1500 with LRO

  - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
    splice)

  - 10 Gbps at 15% CPU using MTU=9000 with LRO

These last ones are really impressive. While I had already observed such
performance on the Myri10GE with Tux, it's the first time I can reach that
level with so little CPU usage in haproxy !

So I think that the memcpy() workaround might be a non-issue for some time.
I agree it's not beautiful but it works pretty well for now.

The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of
-EAGAIN on end of read, the one to process multiple skbs at once, and Dave's
last patch based on Jarek's workaround for the corruption issue.

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 19, 2009, 3:08 a.m. UTC | #8
On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
>
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> 
>   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
>     (3.2 Gbps at 100% CPU without splice)

One thing to note is that Myricom's driver probably uses page
frags which means that you're not actually triggering the copy.

Cheers,
David Miller Jan. 19, 2009, 3:27 a.m. UTC | #9
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 19 Jan 2009 14:08:44 +1100

> On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> >
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > 
> >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> >     (3.2 Gbps at 100% CPU without splice)
> 
> One thing to note is that Myricom's driver probably uses page
> frags which means that you're not actually triggering the copy.

Right.

And this is also the only reason why jumbo MTU worked :-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 19, 2009, 3:28 a.m. UTC | #10
From: Willy Tarreau <w@1wt.eu>
Date: Mon, 19 Jan 2009 01:42:06 +0100

> Hi guys,
> 
> On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > 
> > > And BTW feel free to add my Tested-by if you want in case you merge
> > > this fix.
> > 
> > Done, thanks Willy.
> 
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> 
>   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
>     (3.2 Gbps at 100% CPU without splice)
> 
>   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> 
>   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
>     splice)
> 
>   - 10 Gbps at 15% CPU using MTU=9000 with LRO

Thanks for the numbers.

We can almost certainly do a lot better, so if you have the time and
can get some oprofile dumps for these various cases that would be
useful to us.

Thanks again.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 19, 2009, 6:08 a.m. UTC | #11
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 16 Jan 2009 06:51:08 +0000

> David, if it's not too late I think more credits are needed,
> especially for Willy. He did "a bit" more than testing.
 ...
> Foreseen-by: Changli Gao <xiaosuo@gmail.com>
> Diagnosed-by: Willy Tarreau <w@1wt.eu>
> Reported-by: Willy Tarreau <w@1wt.eu>
> Fixed-by: Jens Axboe <jens.axboe@oracle.com>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

I will be sure to add these before committing, thanks Jarek.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 19, 2009, 6:11 a.m. UTC | #12
On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 19 Jan 2009 01:42:06 +0100
> 
> > Hi guys,
> > 
> > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > > 
> > > > And BTW feel free to add my Tested-by if you want in case you merge
> > > > this fix.
> > > 
> > > Done, thanks Willy.
> > 
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > 
> >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> >     (3.2 Gbps at 100% CPU without splice)
> > 
> >   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> > 
> >   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> >     splice)
> > 
> >   - 10 Gbps at 15% CPU using MTU=9000 with LRO
> 
> Thanks for the numbers.
> 
> We can almost certainly do a lot better, so if you have the time and
> can get some oprofile dumps for these various cases that would be
> useful to us.

No problem, of course. It's just a matter of time, but if we can push
the numbers further, let's try.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 19, 2009, 6:14 a.m. UTC | #13
On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 19 Jan 2009 14:08:44 +1100
> 
> > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> > >
> > > Just for the record, I've now re-integrated those changes in a test kernel
> > > that I booted on my 10gig machines. I have updated my user-space code in
> > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > > 
> > >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > >     (3.2 Gbps at 100% CPU without splice)
> > 
> > One thing to note is that Myricom's driver probably uses page
> > frags which means that you're not actually triggering the copy.

So does this mean that the corruption problem should still there for
such a driver ? I'm asking before testing, because at these speeds,
validity tests are not that easy ;-)

> Right.
> 
> And this is also the only reason why jumbo MTU worked :-)

What should we expect from other drivers with jumbo frames ? Hangs,
corruption, errors, packet loss ?

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 19, 2009, 6:16 a.m. UTC | #14
From: David Miller <davem@davemloft.net>
Date: Thu, 15 Jan 2009 15:34:49 -0800 (PST)

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Fri, 16 Jan 2009 10:32:05 +1100
> 
> > On Thu, Jan 15, 2009 at 03:26:08PM -0800, David Miller wrote:
> > > +static inline struct page *linear_to_page(struct page *page, unsigned int len,
> > > +					  unsigned int offset)
> > > +{
> > > +	struct page *p = alloc_pages(GFP_KERNEL, 0);
> > > +
> > > +	if (!p)
> > > +		return NULL;
> > > +	memcpy(page_address(p) + offset, page_address(page) + offset, len);
> > 
> > This won't work very well if skb->head is longer than a page.
> > 
> > We'll need to divide it up into individual pages.
> 
> Oh yes the same bug I pointed out the other day.
> 
> But Willy can test this patch as-is, since he is not using
> jumbo frames in linear SKBs.

Actually, Herbert, it turns out this case should be OK.

Look a level or two higher, at __splice_segment(), it even has a
comment :-)

--------------------
	do {
		unsigned int flen = min(*len, plen);

		/* the linear region may spread across several pages  */
		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);

		if (spd_fill_page(spd, page, flen, poff, skb, linear))
			return 1;

		__segment_seek(&page, &poff, &plen, flen);
		*len -= flen;

	} while (*len && plen);
--------------------

That code should work and do what we need.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 19, 2009, 6:19 a.m. UTC | #15
From: Willy Tarreau <w@1wt.eu>
Date: Mon, 19 Jan 2009 07:14:20 +0100

> On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > 
> > > One thing to note is that Myricom's driver probably uses page
> > > frags which means that you're not actually triggering the copy.
> 
> So does this mean that the corruption problem should still there for
> such a driver ? I'm asking before testing, because at these speeds,
> validity tests are not that easy ;-)

It ought not to, but it seems that is the case where you
saw the original corruptions, so hmmm...

Actually, I see, the myri10ge driver does put up to
64 bytes of the initial packet into the linear area.
If the IPV4 + TCP headers are less than this, you will
hit the corruption case even with the myri10ge driver.

So everything checks out.

> > And this is also the only reason why jumbo MTU worked :-)
> 
> What should we expect from other drivers with jumbo frames ? Hangs,
> corruption, errors, packet loss ?

Upon recent review I think jumbo frames in such drivers should
actually be fine.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 19, 2009, 6:45 a.m. UTC | #16
On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 19 Jan 2009 07:14:20 +0100
> 
> > On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > > 
> > > > One thing to note is that Myricom's driver probably uses page
> > > > frags which means that you're not actually triggering the copy.
> > 
> > So does this mean that the corruption problem should still there for
> > such a driver ? I'm asking before testing, because at these speeds,
> > validity tests are not that easy ;-)
> 
> It ought not to, but it seems that is the case where you
> saw the original corruptions, so hmmm...
> 
> Actually, I see, the myri10ge driver does put up to
> 64 bytes of the initial packet into the linear area.
> If the IPV4 + TCP headers are less than this, you will
> hit the corruption case even with the myri10ge driver.
> 
> So everything checks out.

OK, so I will modify my tools in order to perform a few checks.

Thanks!
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 19, 2009, 8:40 a.m. UTC | #17
On Mon, Jan 19, 2009 at 07:14:20AM +0100, Willy Tarreau wrote:
> On Sun, Jan 18, 2009 at 07:27:19PM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Mon, 19 Jan 2009 14:08:44 +1100
> > 
> > > On Mon, Jan 19, 2009 at 01:42:06AM +0100, Willy Tarreau wrote:
> > > >
> > > > Just for the record, I've now re-integrated those changes in a test kernel
> > > > that I booted on my 10gig machines. I have updated my user-space code in
> > > > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > > > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > > > 
> > > >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> > > >     (3.2 Gbps at 100% CPU without splice)
> > > 
> > > One thing to note is that Myricom's driver probably uses page
> > > frags which means that you're not actually triggering the copy.
> 
> So does this mean that the corruption problem should still there for
> such a driver ? I'm asking before testing, because at these speeds,
> validity tests are not that easy ;-)

I guess, David meant the performance: there is not much change because
only a small part could be copied. The most harmed should be jumbo
frames in linear only skbs. But no corruption is expected.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 19, 2009, 10:19 a.m. UTC | #18
On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> 
> Actually, I see, the myri10ge driver does put up to
> 64 bytes of the initial packet into the linear area.
> If the IPV4 + TCP headers are less than this, you will
> hit the corruption case even with the myri10ge driver.

I thought splice only mapped the payload areas, no?

So we should probably test with a non-paged driver to be totally
sure.

Cheers,
Herbert Xu Jan. 19, 2009, 10:20 a.m. UTC | #19
On Sun, Jan 18, 2009 at 10:16:03PM -0800, David Miller wrote:
> 
> Actually, Herbert, it turns out this case should be OK.
> 
> Look a level or two higher, at __splice_segment(), it even has a
> comment :-)

Aha, that should take care of it.

Thanks,
David Miller Jan. 19, 2009, 8:59 p.m. UTC | #20
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 19 Jan 2009 21:19:24 +1100

> On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> > 
> > Actually, I see, the myri10ge driver does put up to
> > 64 bytes of the initial packet into the linear area.
> > If the IPV4 + TCP headers are less than this, you will
> > hit the corruption case even with the myri10ge driver.
> 
> I thought splice only mapped the payload areas, no?

And the difference between 64 and IPV4+TCP header len becomes the
payload, don't you see? :-)

myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
the linear area, unconditionally.  So a small number of payload bytes
can in fact end up there.

Otherwise Willy could never have triggered this bug.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 19, 2009, 9:24 p.m. UTC | #21
On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote:
.
> And the difference between 64 and IPV4+TCP header len becomes the
> payload, don't you see? :-)
> 
> myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
> the linear area, unconditionally.  So a small number of payload bytes
> can in fact end up there.

Ah, I thought they were doing a precise division.  Yes pulling
a constant length will trigger it if it's big enough.

Cheers,
Ben Mansell Jan. 20, 2009, 12:01 p.m. UTC | #22
Willy Tarreau wrote:
> Hi guys,
> 
> On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
>> From: Willy Tarreau <w@1wt.eu>
>> Date: Fri, 16 Jan 2009 00:44:08 +0100
>>
>>> And BTW feel free to add my Tested-by if you want in case you merge
>>> this fix.
>> Done, thanks Willy.
> 
> Just for the record, I've now re-integrated those changes in a test kernel
> that I booted on my 10gig machines. I have updated my user-space code in
> haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> 
>   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
>     (3.2 Gbps at 100% CPU without splice)
> 
>   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> 
>   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
>     splice)
> 
>   - 10 Gbps at 15% CPU using MTU=9000 with LRO
> 
> These last ones are really impressive. While I had already observed such
> performance on the Myri10GE with Tux, it's the first time I can reach that
> level with so little CPU usage in haproxy !
> 
> So I think that the memcpy() workaround might be a non-issue for some time.
> I agree it's not beautiful but it works pretty well for now.
> 
> The 3 patches I used on top of 2.6.27.10 were the fix to return 0 intead of
> -EAGAIN on end of read, the one to process multiple skbs at once, and Dave's
> last patch based on Jarek's workaround for the corruption issue.

I've also tested on the same three patches (against 2.6.27.2 here), and 
the patches appear to work just fine. I'm running a similar proxy 
benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 
2xforcedeth). splice is working OK now, although I get identical results 
when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 
98% system).

I may be hitting a h/w limitation which prevents any higher throughput, 
but I'm a little surprised that splice() didn't use less CPU time. 
Anyway, the splice code is working which is the important part!


Ben

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Jan. 20, 2009, 12:11 p.m. UTC | #23
On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote:
> I've also tested on the same three patches (against 2.6.27.2 here), and 
> the patches appear to work just fine. I'm running a similar proxy 
> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 
> 2xforcedeth). splice is working OK now, although I get identical results 
> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 
> 98% system).

With small MTU or when driver does not support fragmented allocation
(iirc at least forcedeth does not) skb will contain all the data in the
linear part and thus will be copied in the kernel. read()/write() does
effectively the same, but in userspace.
This should only affect splice usage which involves socket->pipe data
transfer.

> I may be hitting a h/w limitation which prevents any higher throughput, 
> but I'm a little surprised that splice() didn't use less CPU time. 
> Anyway, the splice code is working which is the important part!

Does splice without patches (but with performance improvement for
non-blocking splice) has the same performance? It does not copy data,
but you may hit the data corruption? If performance is the same, this
maybe indeed HW limitation.
Ben Mansell Jan. 20, 2009, 1:43 p.m. UTC | #24
Evgeniy Polyakov wrote:
> On Tue, Jan 20, 2009 at 12:01:13PM +0000, Ben Mansell (ben@zeus.com) wrote:
>> I've also tested on the same three patches (against 2.6.27.2 here), and 
>> the patches appear to work just fine. I'm running a similar proxy 
>> benchmark test to Willy, on a machine with 4 gigabit NICs (2xtg3, 
>> 2xforcedeth). splice is working OK now, although I get identical results 
>> when using splice() or read()/write(): 2.4 Gbps at 100% CPU (2% user, 
>> 98% system).
> 
> With small MTU or when driver does not support fragmented allocation
> (iirc at least forcedeth does not) skb will contain all the data in the
> linear part and thus will be copied in the kernel. read()/write() does
> effectively the same, but in userspace.
> This should only affect splice usage which involves socket->pipe data
> transfer.

I'll try with some larger MTUs and see if that helps - it should also 
give an improvement if I'm hitting a limit on the number of 
packets/second that the cards can process, regardless of splice...

>> I may be hitting a h/w limitation which prevents any higher throughput, 
>> but I'm a little surprised that splice() didn't use less CPU time. 
>> Anyway, the splice code is working which is the important part!
> 
> Does splice without patches (but with performance improvement for
> non-blocking splice) has the same performance? It does not copy data,
> but you may hit the data corruption? If performance is the same, this
> maybe indeed HW limitation.

With an unpatched kernel, the splice performance was worse (due to the 
one packet per-splice issues). With the small patch to fix that, I was 
getting around 2 Gbps performance, although oddly enough, I could only 
get 2 Gbps with read()/write() then as well...

I'll try and do some tests on a machine that hopefully doesn't have the 
bottlenecks (and one that uses different NICs)


Ben
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 20, 2009, 2:06 p.m. UTC | #25
On Tue, Jan 20, 2009 at 01:43:52PM +0000, Ben Mansell wrote:
...
> With an unpatched kernel, the splice performance was worse (due to the  
> one packet per-splice issues). With the small patch to fix that, I was  
> getting around 2 Gbps performance, although oddly enough, I could only  
> get 2 Gbps with read()/write() then as well...
>
> I'll try and do some tests on a machine that hopefully doesn't have the  
> bottlenecks (and one that uses different NICs)

I guess you should especially check if SG and checksums are on, and it
could depend on a chip within those NICs as well.

Jarek P. 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 24, 2009, 9:23 p.m. UTC | #26
Hi David,

On Sun, Jan 18, 2009 at 07:28:15PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Mon, 19 Jan 2009 01:42:06 +0100
> 
> > Hi guys,
> > 
> > On Thu, Jan 15, 2009 at 03:54:34PM -0800, David Miller wrote:
> > > From: Willy Tarreau <w@1wt.eu>
> > > Date: Fri, 16 Jan 2009 00:44:08 +0100
> > > 
> > > > And BTW feel free to add my Tested-by if you want in case you merge
> > > > this fix.
> > > 
> > > Done, thanks Willy.
> > 
> > Just for the record, I've now re-integrated those changes in a test kernel
> > that I booted on my 10gig machines. I have updated my user-space code in
> > haproxy to run a new series of tests. Eventhough there is a memcpy(), the
> > results are EXCELLENT (on a C2D 2.66 GHz using Myricom's Myri10GE NICs) :
> > 
> >   - 4.8 Gbps at 100% CPU using MTU=1500 without LRO
> >     (3.2 Gbps at 100% CPU without splice)
> > 
> >   - 9.2 Gbps at 50% CPU using MTU=1500 with LRO
> > 
> >   - 10 Gbps at 20% CPU using MTU=9000 without LRO (7 Gbps at 100% CPU without
> >     splice)
> > 
> >   - 10 Gbps at 15% CPU using MTU=9000 with LRO
> 
> Thanks for the numbers.
> 
> We can almost certainly do a lot better, so if you have the time and
> can get some oprofile dumps for these various cases that would be
> useful to us.

Well, I tried to get oprofile to work on those machines, but I'm surely
missing something, as "opreport" always complains :
  opreport error: No sample file found: try running opcontrol --dump
  or specify a session containing sample files

I've straced opcontrol --dump, opcontrol --stop, and I see nothing
creating any file in the samples directory. I thought it would be
opjitconv which would do it, but it's hard to debug, and I haven't
used oprofile for a 2/3 years now. I followed exactly the instructions
in the kernel doc, as well as a howto found on the net, but I remain
out of luck. I just see a "complete_dump" file created with only two
bytes. It's not easy to debug on those machines are they're diskless
and PXE-booted from squashfs root images.

However, upon Ingo's suggestion I tried his perfcounters while
running a test at 5 Gbps with haproxy running alone on one core,
and IRQs on the other one. No LRO was used and MTU was 1500.

For instance, kerneltop tells how many CPU cycles are spent in each
function :

# kerneltop -e 0 -d 1 -c 1000000 -C 1

             events         RIP          kernel function
  ______     ______   ________________   _______________

            1864.00 - 00000000f87be000 : init_module    [myri10ge]
            1078.00 - 00000000784a6580 : tcp_read_sock
             901.00 - 00000000784a7840 : tcp_sendpage
             857.00 - 0000000078470be0 : __skb_splice_bits
             617.00 - 00000000784b2260 : tcp_transmit_skb
             604.00 - 000000007849fdf0 : __ip_local_out
             581.00 - 0000000078470460 : __copy_skb_header
             569.00 - 000000007850cac0 : _spin_lock     [myri10ge]
             472.00 - 0000000078185150 : __slab_free
             443.00 - 000000007850cc10 : _spin_lock_bh  [sky2]
             434.00 - 00000000781852e0 : __slab_alloc
             408.00 - 0000000078488620 : __qdisc_run
             355.00 - 0000000078185b20 : kmem_cache_free
             352.00 - 0000000078472950 : __alloc_skb
             348.00 - 00000000784705f0 : __skb_clone
             337.00 - 0000000078185870 : kmem_cache_alloc       [myri10ge]
             302.00 - 0000000078472150 : skb_release_data
             297.00 - 000000007847bcf0 : dev_queue_xmit
             285.00 - 00000000784a08f0 : ip_queue_xmit

You should ignore the lines init_module, _spin_lock, etc, in fact all
lines indicating a module, as there's something wrong there, they
always report the name of the last module loaded, and the name changes
when the module is unloaded.

I also tried dumping the number of cache misses per function :

------------------------------------------------------------------------------
 KernelTop:    1146 irqs/sec  [NMI, 1000 cache-misses],  (all, cpu: 1)
------------------------------------------------------------------------------

             events         RIP          kernel function
  ______     ______   ________________   _______________

            7512.00 - 00000000784a6580 : tcp_read_sock
            2716.00 - 0000000078470be0 : __skb_splice_bits
            2516.00 - 00000000784a08f0 : ip_queue_xmit
             986.00 - 00000000784a7840 : tcp_sendpage
             587.00 - 00000000781a40c0 : sys_splice
             462.00 - 00000000781856b0 : kfree  [myri10ge]
             451.00 - 000000007849fdf0 : __ip_local_out
             242.00 - 0000000078185b20 : kmem_cache_free
             205.00 - 00000000784b1b90 : __tcp_select_window
             153.00 - 000000007850cac0 : _spin_lock     [myri10ge]
             142.00 - 000000007849f6a0 : ip_fragment
             119.00 - 0000000078188950 : rw_verify_area
             117.00 - 00000000784a99e0 : tcp_rcv_space_adjust
             107.00 - 000000007850cc10 : _spin_lock_bh  [sky2]
             104.00 - 00000000781852e0 : __slab_alloc

There are other options to combine events but I don't understand
the output when I enable it.

I think that when properly used, these tools can report useful
information.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 25, 2009, 9:03 p.m. UTC | #27
Hi David,

On Mon, Jan 19, 2009 at 12:59:41PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 19 Jan 2009 21:19:24 +1100
> 
> > On Sun, Jan 18, 2009 at 10:19:08PM -0800, David Miller wrote:
> > > 
> > > Actually, I see, the myri10ge driver does put up to
> > > 64 bytes of the initial packet into the linear area.
> > > If the IPV4 + TCP headers are less than this, you will
> > > hit the corruption case even with the myri10ge driver.
> > 
> > I thought splice only mapped the payload areas, no?
> 
> And the difference between 64 and IPV4+TCP header len becomes the
> payload, don't you see? :-)
> 
> myri10ge just pulls min(64, skb->len) bytes from the SKB frags into
> the linear area, unconditionally.  So a small number of payload bytes
> can in fact end up there.
> 
> Otherwise Willy could never have triggered this bug.

Just FWIW, I've updated my tools in order to perform content checks more
easily. I cannot reproduce the issue at all with the myri10ge NICs, neither
with large frames nor with tiny ones (8 bytes).

However, I have noticed that the load is now sensible to the number of
concurrent sessions. I'm using 2.6.29-rc2 with the perfcounters patches,
and I'm not sure whether the difference in behaviour came with the data
corruption fixes or with the new kernel (which has some profiling options
turned on). Basically, below 800-1000 concurrent sessions, I have no
problem reaching 10 Gbps with LRO and MTU=1500, with about 60% CPU. Above
this number of session, the CPU suddenly jumps to 100% and the data rate
drops to about 6.7 Gbps.

I spent a long time trying to figure what it was, but I think that I
have found. Kerneltop reports different figures above and below the
limit.

1) below the limit :

            1429.00 - 00000000784a7840 : tcp_sendpage
             561.00 - 00000000784a6580 : tcp_read_sock
             485.00 - 00000000f87e13c0 : myri10ge_xmit  [myri10ge]
             433.00 - 00000000781a40c0 : sys_splice
             411.00 - 00000000784a6eb0 : tcp_poll
             344.00 - 000000007847bcf0 : dev_queue_xmit
             342.00 - 0000000078470be0 : __skb_splice_bits
             319.00 - 0000000078472950 : __alloc_skb
             310.00 - 0000000078185870 : kmem_cache_alloc
             285.00 - 00000000784b2260 : tcp_transmit_skb
             285.00 - 000000007850cac0 : _spin_lock
             250.00 - 00000000781afda0 : sys_epoll_ctl
             238.00 - 000000007810334c : system_call
             232.00 - 000000007850ac20 : schedule
             230.00 - 000000007850cc10 : _spin_lock_bh
             222.00 - 00000000784705f0 : __skb_clone
             220.00 - 000000007850cbc0 : _spin_lock_irqsave
             213.00 - 00000000784a08f0 : ip_queue_xmit
             211.00 - 0000000078185ea0 : __kmalloc_track_caller

2) above the limit :

            1778.00 - 00000000784a7840 : tcp_sendpage
            1281.00 - 0000000078472950 : __alloc_skb
             639.00 - 00000000784a6780 : sk_stream_alloc_skb
             507.00 - 0000000078185ea0 : __kmalloc_track_caller
             484.00 - 0000000078185870 : kmem_cache_alloc
             476.00 - 00000000784a6580 : tcp_read_sock
             451.00 - 00000000784a08f0 : ip_queue_xmit
             421.00 - 00000000f87e13c0 : myri10ge_xmit  [myri10ge]
             374.00 - 00000000781852e0 : __slab_alloc
             361.00 - 00000000781a40c0 : sys_splice
             273.00 - 0000000078470be0 : __skb_splice_bits
             231.00 - 000000007850cac0 : _spin_lock
             206.00 - 0000000078168b30 : get_pageblock_flags_group
             165.00 - 00000000784a0260 : ip_finish_output
             165.00 - 00000000784b2260 : tcp_transmit_skb
             161.00 - 0000000078470460 : __copy_skb_header
             153.00 - 000000007816d6d0 : put_page
             144.00 - 000000007850cbc0 : _spin_lock_irqsave
             137.00 - 0000000078189be0 : fget_light

The memory allocation clearly is the culprit here. I'll try Jarek's
patch which reduces memory allocation to see if that changes something,
as I'm sure we can do fairly better, given how it behaves with limited
sessions.

Regards,
Willy

PS: this thread is long, if some of the people in CC want to get off
    the thread, please complain.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 26, 2009, 7:59 a.m. UTC | #28
On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote:
...
> The memory allocation clearly is the culprit here. I'll try Jarek's
> patch which reduces memory allocation to see if that changes something,
> as I'm sure we can do fairly better, given how it behaves with limited
> sessions.

I think you are right, but I wonder if it's not better to wait with
more profiling until this splicing is really redone.

Regards,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 26, 2009, 8:12 a.m. UTC | #29
On Mon, Jan 26, 2009 at 07:59:33AM +0000, Jarek Poplawski wrote:
> On Sun, Jan 25, 2009 at 10:03:25PM +0100, Willy Tarreau wrote:
> ...
> > The memory allocation clearly is the culprit here. I'll try Jarek's
> > patch which reduces memory allocation to see if that changes something,
> > as I'm sure we can do fairly better, given how it behaves with limited
> > sessions.
> 
> I think you are right, but I wonder if it's not better to wait with
> more profiling until this splicing is really redone.

Agreed. In fact I have run a few tests even with your patch and I could
see no obvious figure starting to appear. I'll go back to 2.6.27-stable
+ the fixes because I don't really know what I'm testing under 2.6.29-rc2.
Once I'm able to get reproducible reference numbers, I'll test again
with the latest kernel.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 65eac77..56272ac 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -73,17 +73,13 @@  static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 static void sock_pipe_buf_release(struct pipe_inode_info *pipe,
 				  struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	kfree_skb(skb);
+	put_page(buf->page);
 }
 
 static void sock_pipe_buf_get(struct pipe_inode_info *pipe,
 				struct pipe_buffer *buf)
 {
-	struct sk_buff *skb = (struct sk_buff *) buf->private;
-
-	skb_get(skb);
+	get_page(buf->page);
 }
 
 static int sock_pipe_buf_steal(struct pipe_inode_info *pipe,
@@ -1334,9 +1330,19 @@  fault:
  */
 static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 {
-	struct sk_buff *skb = (struct sk_buff *) spd->partial[i].private;
+	put_page(spd->pages[i]);
+}
 
-	kfree_skb(skb);
+static inline struct page *linear_to_page(struct page *page, unsigned int len,
+					  unsigned int offset)
+{
+	struct page *p = alloc_pages(GFP_KERNEL, 0);
+
+	if (!p)
+		return NULL;
+	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+
+	return p;
 }
 
 /*
@@ -1344,16 +1350,23 @@  static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
  */
 static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 				unsigned int len, unsigned int offset,
-				struct sk_buff *skb)
+				struct sk_buff *skb, int linear)
 {
 	if (unlikely(spd->nr_pages == PIPE_BUFFERS))
 		return 1;
 
+	if (linear) {
+		page = linear_to_page(page, len, offset);
+		if (!page)
+			return 1;
+	} else
+		get_page(page);
+
 	spd->pages[spd->nr_pages] = page;
 	spd->partial[spd->nr_pages].len = len;
 	spd->partial[spd->nr_pages].offset = offset;
-	spd->partial[spd->nr_pages].private = (unsigned long) skb_get(skb);
 	spd->nr_pages++;
+
 	return 0;
 }
 
@@ -1369,7 +1382,7 @@  static inline void __segment_seek(struct page **page, unsigned int *poff,
 static inline int __splice_segment(struct page *page, unsigned int poff,
 				   unsigned int plen, unsigned int *off,
 				   unsigned int *len, struct sk_buff *skb,
-				   struct splice_pipe_desc *spd)
+				   struct splice_pipe_desc *spd, int linear)
 {
 	if (!*len)
 		return 1;
@@ -1392,7 +1405,7 @@  static inline int __splice_segment(struct page *page, unsigned int poff,
 		/* the linear region may spread across several pages  */
 		flen = min_t(unsigned int, flen, PAGE_SIZE - poff);
 
-		if (spd_fill_page(spd, page, flen, poff, skb))
+		if (spd_fill_page(spd, page, flen, poff, skb, linear))
 			return 1;
 
 		__segment_seek(&page, &poff, &plen, flen);
@@ -1419,7 +1432,7 @@  static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 	if (__splice_segment(virt_to_page(skb->data),
 			     (unsigned long) skb->data & (PAGE_SIZE - 1),
 			     skb_headlen(skb),
-			     offset, len, skb, spd))
+			     offset, len, skb, spd, 1))
 		return 1;
 
 	/*
@@ -1429,7 +1442,7 @@  static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
 		const skb_frag_t *f = &skb_shinfo(skb)->frags[seg];
 
 		if (__splice_segment(f->page, f->page_offset, f->size,
-				     offset, len, skb, spd))
+				     offset, len, skb, spd, 0))
 			return 1;
 	}
 
@@ -1442,7 +1455,7 @@  static int __skb_splice_bits(struct sk_buff *skb, unsigned int *offset,
  * the frag list, if such a thing exists. We'd probably need to recurse to
  * handle that cleanly.
  */
-int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
+int skb_splice_bits(struct sk_buff *skb, unsigned int offset,
 		    struct pipe_inode_info *pipe, unsigned int tlen,
 		    unsigned int flags)
 {
@@ -1455,16 +1468,6 @@  int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 		.ops = &sock_pipe_buf_ops,
 		.spd_release = sock_spd_release,
 	};
-	struct sk_buff *skb;
-
-	/*
-	 * I'd love to avoid the clone here, but tcp_read_sock()
-	 * ignores reference counts and unconditonally kills the sk_buff
-	 * on return from the actor.
-	 */
-	skb = skb_clone(__skb, GFP_KERNEL);
-	if (unlikely(!skb))
-		return -ENOMEM;
 
 	/*
 	 * __skb_splice_bits() only fails if the output has no room left,
@@ -1488,15 +1491,9 @@  int skb_splice_bits(struct sk_buff *__skb, unsigned int offset,
 	}
 
 done:
-	/*
-	 * drop our reference to the clone, the pipe consumption will
-	 * drop the rest.
-	 */
-	kfree_skb(skb);
-
 	if (spd.nr_pages) {
+		struct sock *sk = skb->sk;
 		int ret;
-		struct sock *sk = __skb->sk;
 
 		/*
 		 * Drop the socket lock, otherwise we have reverse