diff mbox

[v2] tcp: splice as many packets as possible at once

Message ID 20090120093352.GB13806@ff.dom.local
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Jarek Poplawski Jan. 20, 2009, 9:33 a.m. UTC
On Tue, Jan 20, 2009 at 08:37:26AM +0000, Jarek Poplawski wrote:
...
> Here is a tiny upgrade to save some memory by reusing a page for more
> chunks if possible, which I think could be considered, after the
> testing of the main patch is finished. (There could be also added an
> additional freeing of this cached page before socket destruction,
> maybe in tcp_splice_read(), if somebody finds good place.)

OOPS! I did it again... Here is better refcounting.

Jarek P.

--- (take 2)

 include/net/sock.h  |    4 ++++
 net/core/skbuff.c   |   32 ++++++++++++++++++++++++++------
 net/core/sock.c     |    2 ++
 net/ipv4/tcp_ipv4.c |    8 ++++++++
 4 files changed, 40 insertions(+), 6 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Evgeniy Polyakov Jan. 20, 2009, 10 a.m. UTC | #1
Hi Jarek.

On Tue, Jan 20, 2009 at 09:33:52AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > Here is a tiny upgrade to save some memory by reusing a page for more
> > chunks if possible, which I think could be considered, after the
> > testing of the main patch is finished. (There could be also added an
> > additional freeing of this cached page before socket destruction,
> > maybe in tcp_splice_read(), if somebody finds good place.)
> 
> OOPS! I did it again... Here is better refcounting.
> 
> Jarek P.
> 
> --- (take 2)
> 
>  include/net/sock.h  |    4 ++++
>  net/core/skbuff.c   |   32 ++++++++++++++++++++++++++------
>  net/core/sock.c     |    2 ++
>  net/ipv4/tcp_ipv4.c |    8 ++++++++
>  4 files changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 5a3a151..4ded741 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -190,6 +190,8 @@ struct sock_common {
>    *	@sk_user_data: RPC layer private data
>    *	@sk_sndmsg_page: cached page for sendmsg
>    *	@sk_sndmsg_off: cached offset for sendmsg
> +  *	@sk_splice_page: cached page for splice
> +  *	@sk_splice_off: cached offset for splice

Ugh, increase every socket by 16 bytes... Does TCP one still fit the
page?
Jarek Poplawski Jan. 20, 2009, 10:20 a.m. UTC | #2
On Tue, Jan 20, 2009 at 01:00:43PM +0300, Evgeniy Polyakov wrote:
> Hi Jarek.

Hi Evgeniy.
...
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 5a3a151..4ded741 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -190,6 +190,8 @@ struct sock_common {
> >    *	@sk_user_data: RPC layer private data
> >    *	@sk_sndmsg_page: cached page for sendmsg
> >    *	@sk_sndmsg_off: cached offset for sendmsg
> > +  *	@sk_splice_page: cached page for splice
> > +  *	@sk_splice_off: cached offset for splice
> 
> Ugh, increase every socket by 16 bytes... Does TCP one still fit the
> page?

Good question! Alas I can't check this soon, but if it's really like
this, of course this needs some better idea and rework. (BTW, I'd like
to prevent here as much as possible some strange activities like 1
byte (payload) packets getting full pages without any accounting.)

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Jan. 20, 2009, 10:31 a.m. UTC | #3
On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> Good question! Alas I can't check this soon, but if it's really like
> this, of course this needs some better idea and rework. (BTW, I'd like
> to prevent here as much as possible some strange activities like 1
> byte (payload) packets getting full pages without any accounting.)

I believe approach to meet all our goals is to have own network memory
allocator, so that each skb could have its payload in the fragments, we
would not suffer from the heavy fragmentation and power-of-two overhead
for the larger MTUs, have a reserve for the OOM condition and generally
do not depend on the main system behaviour.

I will resurrect to some point my network allocator to check how things
go in the modern environment, if no one will beat this idea first :)

1. Network (tree) allocator
http://www.ioremap.net/projects/nta
Jarek Poplawski Jan. 20, 2009, 11:01 a.m. UTC | #4
On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > Good question! Alas I can't check this soon, but if it's really like
> > this, of course this needs some better idea and rework. (BTW, I'd like
> > to prevent here as much as possible some strange activities like 1
> > byte (payload) packets getting full pages without any accounting.)
> 
> I believe approach to meet all our goals is to have own network memory
> allocator, so that each skb could have its payload in the fragments, we
> would not suffer from the heavy fragmentation and power-of-two overhead
> for the larger MTUs, have a reserve for the OOM condition and generally
> do not depend on the main system behaviour.

100% right! But I guess we need this current fix for -stable, and I'm
a bit worried about safety.

> 
> I will resurrect to some point my network allocator to check how things
> go in the modern environment, if no one will beat this idea first :)

I can't see too much beating of ideas around this problem now... I Wish
you luck!

> 
> 1. Network (tree) allocator
> http://www.ioremap.net/projects/nta
> 

Great, I'll try to learn a bit btw.,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 20, 2009, 5:16 p.m. UTC | #5
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 20 Jan 2009 11:01:44 +0000

> On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote:
> > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > Good question! Alas I can't check this soon, but if it's really like
> > > this, of course this needs some better idea and rework. (BTW, I'd like
> > > to prevent here as much as possible some strange activities like 1
> > > byte (payload) packets getting full pages without any accounting.)
> > 
> > I believe approach to meet all our goals is to have own network memory
> > allocator, so that each skb could have its payload in the fragments, we
> > would not suffer from the heavy fragmentation and power-of-two overhead
> > for the larger MTUs, have a reserve for the OOM condition and generally
> > do not depend on the main system behaviour.
> 
> 100% right! But I guess we need this current fix for -stable, and I'm
> a bit worried about safety.

Jarek, we already have a page and offset you can use.

It's called sk_sndmsg_page but that is just the (current) name.
Nothing prevents you from reusing it for your purposes here.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 21, 2009, 9:54 a.m. UTC | #6
On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote:
...
> Jarek, we already have a page and offset you can use.
> 
> It's called sk_sndmsg_page but that is just the (current) name.
> Nothing prevents you from reusing it for your purposes here.

I'm trying to get some know-how about this field.

Thanks,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 26, 2009, 8:20 a.m. UTC | #7
On 20-01-2009 11:31, Evgeniy Polyakov wrote:
...
> I believe approach to meet all our goals is to have own network memory
> allocator, so that each skb could have its payload in the fragments, we
> would not suffer from the heavy fragmentation and power-of-two overhead
> for the larger MTUs, have a reserve for the OOM condition and generally
> do not depend on the main system behaviour.
> 
> I will resurrect to some point my network allocator to check how things
> go in the modern environment, if no one will beat this idea first :)
> 
> 1. Network (tree) allocator
> http://www.ioremap.net/projects/nta

I looked at this a bit, but alas I didn't find much for this Herbert's
idea of payload in fragments/pages. Maybe some kind of API RFC is
needed before this resurrection?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Jan. 26, 2009, 9:21 p.m. UTC | #8
Hi Jarek.

On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > 1. Network (tree) allocator
> > http://www.ioremap.net/projects/nta
> 
> I looked at this a bit, but alas I didn't find much for this Herbert's
> idea of payload in fragments/pages. Maybe some kind of API RFC is
> needed before this resurrection?

Basic idea is to steal some (probably a lot) pages from the slab
allocator and put network buffers there without strict need for
power-of-two alignment and possible wraps when we add skb_shared_info at
the end, so that old e1000 driver required order-4 allocations for the
jumbo frames. We can do that in alloc_skb() and friends and put returned
buffers into skb's fraglist and updated reference counters for those
pages; and with additional copy of the network headers into skb->head.
David Miller Jan. 27, 2009, 6:10 a.m. UTC | #9
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Tue, 27 Jan 2009 00:21:30 +0300

> Hi Jarek.
> 
> On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > 1. Network (tree) allocator
> > > http://www.ioremap.net/projects/nta
> > 
> > I looked at this a bit, but alas I didn't find much for this Herbert's
> > idea of payload in fragments/pages. Maybe some kind of API RFC is
> > needed before this resurrection?
> 
> Basic idea is to steal some (probably a lot) pages from the slab
> allocator and put network buffers there without strict need for
> power-of-two alignment and possible wraps when we add skb_shared_info at
> the end, so that old e1000 driver required order-4 allocations for the
> jumbo frames. We can do that in alloc_skb() and friends and put returned
> buffers into skb's fraglist and updated reference counters for those
> pages; and with additional copy of the network headers into skb->head.

We are going back and forth saying the same thing, I think :-)
(BTW, I think NTA is cool and we might do something like that
eventually)

The basic thing we have to do is make the drivers receive into
pages, and then slide the network headers (only) into the linear
SKB data area.

Even for drivers like NIU and myri10ge that do this, they only
use heuristics or some fixed minimum to decide how much to
move to the linear area.

Result?  Some data payload bits end up there because it overshoots.

Since we have pskb_may_pull() calls everywhere necessary, which
means not in eth_type_trans(), we could just make these drivers
(and future drivers converted to operate in this way) only
put the ethernet headers there initially.

Then the rest of the stack will take care of moving the network
and transport payloads there, as necessary.

I bet it won't even hurt latency or routing/firewall performance.

I did test this with the NIU driver at one point, and it did not
change TCP latency nor throughput at all even at 10g speeds.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Jan. 27, 2009, 7:40 a.m. UTC | #10
On Mon, Jan 26, 2009 at 10:10:56PM -0800, David Miller wrote:
> From: Evgeniy Polyakov <zbr@ioremap.net>
> Date: Tue, 27 Jan 2009 00:21:30 +0300
> 
> > Hi Jarek.
> > 
> > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > > 1. Network (tree) allocator
> > > > http://www.ioremap.net/projects/nta
> > > 
> > > I looked at this a bit, but alas I didn't find much for this Herbert's
> > > idea of payload in fragments/pages. Maybe some kind of API RFC is
> > > needed before this resurrection?
> > 
> > Basic idea is to steal some (probably a lot) pages from the slab
> > allocator and put network buffers there without strict need for
> > power-of-two alignment and possible wraps when we add skb_shared_info at
> > the end, so that old e1000 driver required order-4 allocations for the
> > jumbo frames. We can do that in alloc_skb() and friends and put returned
> > buffers into skb's fraglist and updated reference counters for those
> > pages; and with additional copy of the network headers into skb->head.

I think the main problem is to respect put_page() more, and maybe you
mean to add this to your allocator too, but using slab pages for this
looks a bit complex to me, but I can miss something.

> We are going back and forth saying the same thing, I think :-)
> (BTW, I think NTA is cool and we might do something like that
> eventually)
> 
> The basic thing we have to do is make the drivers receive into
> pages, and then slide the network headers (only) into the linear
> SKB data area.

As a matter of fact, I wonder if these headers should be always
separated. Their "chunk" could be refcounted as well, I guess.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 27, 2009, 6:42 p.m. UTC | #11
From: David Miller <davem@davemloft.net>
Date: Mon, 26 Jan 2009 22:10:56 -0800 (PST)

> Even for drivers like NIU and myri10ge that do this, they only
> use heuristics or some fixed minimum to decide how much to
> move to the linear area.
> 
> Result?  Some data payload bits end up there because it overshoots.
 ...
> I did test this with the NIU driver at one point, and it did not
> change TCP latency nor throughput at all even at 10g speeds.

As a followup, it turns out that NIU right now does this properly.

It only pulls a maximum of ETH_HLEN into the linear area before giving
the SKB to netif_receive_skb().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 30, 2009, 9:42 p.m. UTC | #12
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 27 Jan 2009 07:40:48 +0000

> I think the main problem is to respect put_page() more, and maybe you
> mean to add this to your allocator too, but using slab pages for this
> looks a bit complex to me, but I can miss something.

Hmmm, Jarek's comments here made me realize that we might be
able to do some hack with cooperation with SLAB.

Basically the idea is that if the page count of a SLAB page
is greater than one, SLAB will not use that page for new
allocations.

It's cheesy and the SLAB developers will likely barf at the
idea, but it would certainly work.

Back to real life, I think long term the thing to do is to just do the
cached page allocator thing we'll be doing after Jarek's socket page
patch is integrated, and for best performance the driver has to
receive it's data into pages, only explicitly pulling the ethernet
header into the linear area, like NIU does.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 30, 2009, 9:59 p.m. UTC | #13
On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 27 Jan 2009 07:40:48 +0000
> 
> > I think the main problem is to respect put_page() more, and maybe you
> > mean to add this to your allocator too, but using slab pages for this
> > looks a bit complex to me, but I can miss something.
> 
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
> 
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.

I thought it was the standard behaviour. That may explain why I
did not understand much of previous discussion then :-/

> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

Maybe that would be enough as a definitive fix for a stable
release, so that we can go on with deeper changes in newer
versions ?

> Back to real life, I think long term the thing to do is to just do the
> cached page allocator thing we'll be doing after Jarek's socket page
> patch is integrated, and for best performance the driver has to
> receive it's data into pages, only explicitly pulling the ethernet
> header into the linear area, like NIU does.

Are there NICs out there able to do that themselves or does the
driver need to rely on complex hacks in order to achieve this ?

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 30, 2009, 10:03 p.m. UTC | #14
From: Willy Tarreau <w@1wt.eu>
Date: Fri, 30 Jan 2009 22:59:20 +0100

> On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > It's cheesy and the SLAB developers will likely barf at the
> > idea, but it would certainly work.
> 
> Maybe that would be enough as a definitive fix for a stable
> release, so that we can go on with deeper changes in newer
> versions ?

Such a check could have performance ramifications, I wouldn't
risk it and already I intend to push Jarek's page allocator
splice fix back to -stable eventually.

> > Back to real life, I think long term the thing to do is to just do the
> > cached page allocator thing we'll be doing after Jarek's socket page
> > patch is integrated, and for best performance the driver has to
> > receive it's data into pages, only explicitly pulling the ethernet
> > header into the linear area, like NIU does.
> 
> Are there NICs out there able to do that themselves or does the
> driver need to rely on complex hacks in order to achieve this ?

Any NIC, even the dumbest ones, can be made to receive into pages.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 30, 2009, 10:13 p.m. UTC | #15
On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Fri, 30 Jan 2009 22:59:20 +0100
> 
> > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > > It's cheesy and the SLAB developers will likely barf at the
> > > idea, but it would certainly work.
> > 
> > Maybe that would be enough as a definitive fix for a stable
> > release, so that we can go on with deeper changes in newer
> > versions ?
> 
> Such a check could have performance ramifications, I wouldn't
> risk it and already I intend to push Jarek's page allocator
> splice fix back to -stable eventually.

OK.

> > > Back to real life, I think long term the thing to do is to just do the
> > > cached page allocator thing we'll be doing after Jarek's socket page
> > > patch is integrated, and for best performance the driver has to
> > > receive it's data into pages, only explicitly pulling the ethernet
> > > header into the linear area, like NIU does.
> > 
> > Are there NICs out there able to do that themselves or does the
> > driver need to rely on complex hacks in order to achieve this ?
> 
> Any NIC, even the dumbest ones, can be made to receive into pages.

OK I thought that it was not always easy to split between headers
and payload. I know that myri10ge can be configured to receive into
either skbs or pages, but I was not sure about the real implications
behind that.

Thanks
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Jan. 30, 2009, 10:15 p.m. UTC | #16
From: Willy Tarreau <w@1wt.eu>
Date: Fri, 30 Jan 2009 23:13:46 +0100

> On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote:
> > Any NIC, even the dumbest ones, can be made to receive into pages.
> 
> OK I thought that it was not always easy to split between headers
> and payload. I know that myri10ge can be configured to receive into
> either skbs or pages, but I was not sure about the real implications
> behind that.

For a dumb NIC you wouldn't split, you'd receive directly, the
entire packet, into part of a page.

Then right befire you give it to the stack, you pull the ethernet
header from the page into the linear area.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Jan. 30, 2009, 10:16 p.m. UTC | #17
On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> 
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
> 
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.
> 
> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

I'm not going anywhere near that discussion :)

> Back to real life, I think long term the thing to do is to just do the
> cached page allocator thing we'll be doing after Jarek's socket page
> patch is integrated, and for best performance the driver has to
> receive it's data into pages, only explicitly pulling the ethernet
> header into the linear area, like NIU does.

Yes that sounds like the way to go.

Cheers,
Jarek Poplawski Feb. 2, 2009, 8:08 a.m. UTC | #18
On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote:
> On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
...
> > Back to real life, I think long term the thing to do is to just do the
> > cached page allocator thing we'll be doing after Jarek's socket page
> > patch is integrated, and for best performance the driver has to
> > receive it's data into pages, only explicitly pulling the ethernet
> > header into the linear area, like NIU does.
> 
> Yes that sounds like the way to go.

Looks like a lot of changes in drivers, plus: would it work with jumbo
frames? I wonder why the linear area can't be allocated as paged, and
freed with put_page() instead of kfree(skb->head) in skb_release_data().

Actually, at least for some time, there could be used both of these
methods (on paged alloc failure) with some ifs in skb_release_data()
and spd_fill_page() (to check if linear_to_page() is needed).

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 2, 2009, 8:18 a.m. UTC | #19
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 2 Feb 2009 08:08:55 +0000

> On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote:
> > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> ...
> > > Back to real life, I think long term the thing to do is to just do the
> > > cached page allocator thing we'll be doing after Jarek's socket page
> > > patch is integrated, and for best performance the driver has to
> > > receive it's data into pages, only explicitly pulling the ethernet
> > > header into the linear area, like NIU does.
> > 
> > Yes that sounds like the way to go.
> 
> Looks like a lot of changes in drivers, plus: would it work with jumbo
> frames? I wonder why the linear area can't be allocated as paged, and
> freed with put_page() instead of kfree(skb->head) in skb_release_data().

Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 2, 2009, 8:43 a.m. UTC | #20
On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 2 Feb 2009 08:08:55 +0000
> 
> > On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote:
> > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote:
> > ...
> > > > Back to real life, I think long term the thing to do is to just do the
> > > > cached page allocator thing we'll be doing after Jarek's socket page
> > > > patch is integrated, and for best performance the driver has to
> > > > receive it's data into pages, only explicitly pulling the ethernet
> > > > header into the linear area, like NIU does.
> > > 
> > > Yes that sounds like the way to go.
> > 
> > Looks like a lot of changes in drivers, plus: would it work with jumbo
> > frames? I wonder why the linear area can't be allocated as paged, and
> > freed with put_page() instead of kfree(skb->head) in skb_release_data().
> 
> Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.

I mean allocating chunks of cached pages similarly to sk_sndmsg_page
way. I guess the similar problem is to be worked out in any case. But
it seems doing it on the linear area requires less changes in other
places.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 3, 2009, 7:50 a.m. UTC | #21
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 2 Feb 2009 08:43:58 +0000

> On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.
> 
> I mean allocating chunks of cached pages similarly to sk_sndmsg_page
> way. I guess the similar problem is to be worked out in any case. But
> it seems doing it on the linear area requires less changes in other
> places.

This is a very interesting idea, but it has some drawbacks:

1) Just like any other allocator we'll need to find a way to
   handle > PAGE_SIZE allocations, and thus add handling for
   compound pages etc.
 
   And exactly the drivers that want such huge SKB data areas
   on receive should be converted to use scatter gather page
   vectors in order to avoid multi-order pages and thus strains
   on the page allocator.

2) Space wastage and poor packing can be an issue.

   Even with SLAB/SLUB we get poor packing, look at Evegeniy's
   graphs that he made when writing his NTA patches.

Now, when choosing a way to move forward, I'm willing to accept a
little bit of the issues in #2 for the sake of avoiding the
issues in #1 above.

Jarek, note that we can just keep your current splice() copy hacks in
there.  And as a result we can have an easier to handle migration
path.  We just do the page RX allocation conversions in the drivers
where performance really matters, for hardware a lot of people have.

That's a lot smoother and has less issues that converting the system
wide SKB allocator upside down.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 3, 2009, 9:41 a.m. UTC | #22
On Mon, Feb 02, 2009 at 11:50:17PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 2 Feb 2009 08:43:58 +0000
> 
> > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote:
> > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful.
> > 
> > I mean allocating chunks of cached pages similarly to sk_sndmsg_page
> > way. I guess the similar problem is to be worked out in any case. But
> > it seems doing it on the linear area requires less changes in other
> > places.
> 
> This is a very interesting idea, but it has some drawbacks:
> 
> 1) Just like any other allocator we'll need to find a way to
>    handle > PAGE_SIZE allocations, and thus add handling for
>    compound pages etc.
>  
>    And exactly the drivers that want such huge SKB data areas
>    on receive should be converted to use scatter gather page
>    vectors in order to avoid multi-order pages and thus strains
>    on the page allocator.

I guess compound pages are handled by put_page() enough, but I don't
think they should be main argument here, and I agree: scatter gather
should be used where possible.

> 
> 2) Space wastage and poor packing can be an issue.
> 
>    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
>    graphs that he made when writing his NTA patches.

I'm a bit lost here: could you "remind" the way page space would be
used/saved in your paged variant e.g. for ~1500B skbs?

> 
> Now, when choosing a way to move forward, I'm willing to accept a
> little bit of the issues in #2 for the sake of avoiding the
> issues in #1 above.
> 
> Jarek, note that we can just keep your current splice() copy hacks in
> there.  And as a result we can have an easier to handle migration
> path.  We just do the page RX allocation conversions in the drivers
> where performance really matters, for hardware a lot of people have.
> 
> That's a lot smoother and has less issues that converting the system
> wide SKB allocator upside down.
> 

Yes, this looks reasonable. On the other hand, I think it would be
nice to get some opinions of slab folks (incl. Evgeniy) on the expected
efficiency of such a solution. (It seems releasing with put_page() will
always have some cost with delayed reusing and/or waste of space.)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Feb. 3, 2009, 11:10 a.m. UTC | #23
On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > 1) Just like any other allocator we'll need to find a way to
> >    handle > PAGE_SIZE allocations, and thus add handling for
> >    compound pages etc.
> >  
> >    And exactly the drivers that want such huge SKB data areas
> >    on receive should be converted to use scatter gather page
> >    vectors in order to avoid multi-order pages and thus strains
> >    on the page allocator.
> 
> I guess compound pages are handled by put_page() enough, but I don't
> think they should be main argument here, and I agree: scatter gather
> should be used where possible.

Problem is to allocate them, since with the time memory will be
quite fragmented, which will not allow to find a big enough page.

NTA tried to solve this by not allowing to free the data allocated on
the different CPU, contrary to what SLAB does. Modulo cache coherency
improvements, it allows to combine freed chunks back into the pages and
combine them in turn to get bigger contiguous areas suitable for the
drivers which were not converted to use the scatter gather approach.
I even believe that for some hardware it is the only way to deal
with the jumbo frames.

> > 2) Space wastage and poor packing can be an issue.
> > 
> >    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
> >    graphs that he made when writing his NTA patches.
> 
> I'm a bit lost here: could you "remind" the way page space would be
> used/saved in your paged variant e.g. for ~1500B skbs?

At least in NTA I used cache line alignment for smaller chunks, while
SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes
per packet (modulo size of the shared info structure).

> Yes, this looks reasonable. On the other hand, I think it would be
> nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> efficiency of such a solution. (It seems releasing with put_page() will
> always have some cost with delayed reusing and/or waste of space.)

Well, my opinion is rather biased here :)
Herbert Xu Feb. 3, 2009, 11:24 a.m. UTC | #24
On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote:
>
> I even believe that for some hardware it is the only way to deal
> with the jumbo frames.

Not necessarily.  Even if the hardware can only DMA into contiguous
memory, we can always allocate a sufficient number of contiguous
buffers initially, and then always copy them into fragmented skbs
at receive time.  This way the contiguous buffers are never
depleted.

Granted copying sucks, but this is really because the underlying
hardware is badly designed.  Also copying is way better than
not receiving at all due to memory fragmentation.

Cheers,
Nick Piggin Feb. 3, 2009, 11:38 a.m. UTC | #25
On Saturday 31 January 2009 08:42:27 David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 27 Jan 2009 07:40:48 +0000
>
> > I think the main problem is to respect put_page() more, and maybe you
> > mean to add this to your allocator too, but using slab pages for this
> > looks a bit complex to me, but I can miss something.
>
> Hmmm, Jarek's comments here made me realize that we might be
> able to do some hack with cooperation with SLAB.
>
> Basically the idea is that if the page count of a SLAB page
> is greater than one, SLAB will not use that page for new
> allocations.

Wouldn't your caller need to know what objects are already
allocated in that page too?


> It's cheesy and the SLAB developers will likely barf at the
> idea, but it would certainly work.

It is nasty, yes. Using the page allocator directly seeems
like a better approach.

And btw. be careful of using page->_count for anything, due
to speculative page references... basically it is useful only
to test zero or non-zero refcount. If designing a new scheme
for the network layer, it would be nicer to begin by using
say _mapcount or private or some other field in there for a
refcount (and I have a patch to avoid the atomic
put_page_testzero in page freeing for a caller that does their
own refcounting, so don't fear that extra overhead too much :)).


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Feb. 3, 2009, 11:49 a.m. UTC | #26
On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > I even believe that for some hardware it is the only way to deal
> > with the jumbo frames.
> 
> Not necessarily.  Even if the hardware can only DMA into contiguous
> memory, we can always allocate a sufficient number of contiguous
> buffers initially, and then always copy them into fragmented skbs
> at receive time.  This way the contiguous buffers are never
> depleted.

How many such preallocated frames is enough? Does it enough to have all
sockets recv buffer sizes divided by the MTU size? Or just some of them,
or... That will work but there are way too many corner cases.

> Granted copying sucks, but this is really because the underlying
> hardware is badly designed.  Also copying is way better than
> not receiving at all due to memory fragmentation.

Maybe just do not allow jumbo frames when memory is fragmented enough
and fallback to the smaller MTU in this case? With LRO/GRO stuff there
should be not that much of the overhead compared to multiple-page
copies.
Herbert Xu Feb. 3, 2009, 11:53 a.m. UTC | #27
On Tue, Feb 03, 2009 at 02:49:44PM +0300, Evgeniy Polyakov wrote:
> How many such preallocated frames is enough? Does it enough to have all
> sockets recv buffer sizes divided by the MTU size? Or just some of them,
> or... That will work but there are way too many corner cases.

Easy, the driver is already allocating them right now so we don't
have to change a thing :)

All we have to do is change the refill mechanism to always allocate
a replacement skb in the rx path, and if that fails, allocate a
fragmented skb instead and copy the received data into it so that
the contiguous skb can be reused.

Cheers,
Evgeniy Polyakov Feb. 3, 2009, 12:07 p.m. UTC | #28
On Tue, Feb 03, 2009 at 10:53:13PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > How many such preallocated frames is enough? Does it enough to have all
> > sockets recv buffer sizes divided by the MTU size? Or just some of them,
> > or... That will work but there are way too many corner cases.
> 
> Easy, the driver is already allocating them right now so we don't
> have to change a thing :)

How many? A hundred or so descriptors (or even several thousands) -
this really does not scale for the somewhat loaded IO servers, that's
why we frequently get questions why dmesg is filler with order-3 and
higher allocation failure dumps.

> All we have to do is change the refill mechanism to always allocate
> a replacement skb in the rx path, and if that fails, allocate a
> fragmented skb instead and copy the received data into it so that
> the contiguous skb can be reused.

Having a 'reserve' skb per socket is a good idea, but what if numbr of
sockets is way too big?
Herbert Xu Feb. 3, 2009, 12:12 p.m. UTC | #29
On Tue, Feb 03, 2009 at 03:07:15PM +0300, Evgeniy Polyakov wrote:
>
> How many? A hundred or so descriptors (or even several thousands) -
> this really does not scale for the somewhat loaded IO servers, that's
> why we frequently get questions why dmesg is filler with order-3 and
> higher allocation failure dumps.

I think you've misunderstood my suggested scheme.

I'm suggesting that we keep the driver initialisation path as is,
so however many skb's the driver is allocating at open() time
remains unchanged.  Usually this would be the number of entries
on the ring buffer.  We can't do any better than that since if
the hardware can't do SG then you'll just have to find this many
contiguous buffers.

The only change we need to make is at receive time.  Instead of
always pushing the received skb into the stack, we should try to
allocate a linear replacement skb, and if that fails, allocate
a fragmented skb and copy the data into it.  That way we can
always push a linear skb back into the ring buffer.

Cheers,
Evgeniy Polyakov Feb. 3, 2009, 12:12 p.m. UTC | #30
On Tue, Feb 03, 2009 at 05:05:14AM -0800, david@lang.hm (david@lang.hm) wrote:
> >Maybe just do not allow jumbo frames when memory is fragmented enough
> >and fallback to the smaller MTU in this case? With LRO/GRO stuff there
> >should be not that much of the overhead compared to multiple-page
> >copies.
> 
> 
> 1. define 'fragmented enough'

When allocator can not provide requested amount of data.

> 2. the packet size was already negotiated on your existing connections, 
> how are you going to change all those on the fly?

I.e. MTU can not be changed on-flight? Magic world.

> 3. what do you do when a remote system sends you a large packet? drop it 
> on the floor?

We already do just that when jumbo frame can not be allocated :)

> having some pool of large buffers to receive into (and copy out of those 
> buffers as quickly as possible) would cause a performance hit when things 
> get bad, but isn't that better than dropping packets?

It is a solution, but I think it will behave noticebly worse than
with decresed MTU.

> as for the number of buffers to use. make a reasonable guess. if you only 
> have a small number of packets around, use the buffers directly, as you 
> use more of them start copying, as useage climbs attempt to allocate more. 
> if you can't allocate more (and you have all of your existing ones in use) 
> you will have to drop the packet, but at that point are you really in any 
> worse shape than if you didn't have some mechanism to copy out of the 
> large buffers?

That's the main point: how to deal with broken hardware? I think (but
have no strong numbers though) that having 6 packets with 1500 MTU
combined into GRO/LRO frame will be processed way faster than copying 9k
MTU into 3 pages and process single skb.
Herbert Xu Feb. 3, 2009, 12:18 p.m. UTC | #31
On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote:
> 
> It is a solution, but I think it will behave noticebly worse than
> with decresed MTU.

Not necessarily.  Remember GSO/GRO in essence are just hacks to
get around the fact that we can't increase the MTU to where we
want it to be.  MTU reduces the cost over the entire path while
GRO/GSO only do so for the sender and the receiver.

In other words when given the choice between a larger MTU with
copying or GRO, the larger MTU will probably win anyway as it's
optimising the entire path rather than just the receiver.
 
> That's the main point: how to deal with broken hardware? I think (but
> have no strong numbers though) that having 6 packets with 1500 MTU
> combined into GRO/LRO frame will be processed way faster than copying 9k
> MTU into 3 pages and process single skb.

Please note that with my scheme, you'd only start copying if you
can't allocate a linear skb.  So if memory fragmentation doesn't
happen then there is no copying at all.

Cheers,
Evgeniy Polyakov Feb. 3, 2009, 12:18 p.m. UTC | #32
On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> The only change we need to make is at receive time.  Instead of
> always pushing the received skb into the stack, we should try to
> allocate a linear replacement skb, and if that fails, allocate
> a fragmented skb and copy the data into it.  That way we can
> always push a linear skb back into the ring buffer.

Yes, that's was the part about 'reserve' buffer for the sockets you cut
:)

I agree that this will work and will be better than nothing, but copying
9kb into 3 pages is rather CPU hungry operation, and I think (but have
no numbers though) that system will behave faster if MTU is reduced to
the standard one.
Another solution is to have a proper allocator which will be able to
defragment the data, if talking about the alternatives to the drop.

So:
1. copy the whole jumbo skb into fragmented one
2. reduce the MTU
3. rely on the allocator

For the 'good' hardware and drivers nothing from the above is really needed.
Willy Tarreau Feb. 3, 2009, 12:25 p.m. UTC | #33
On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > The only change we need to make is at receive time.  Instead of
> > always pushing the received skb into the stack, we should try to
> > allocate a linear replacement skb, and if that fails, allocate
> > a fragmented skb and copy the data into it.  That way we can
> > always push a linear skb back into the ring buffer.
> 
> Yes, that's was the part about 'reserve' buffer for the sockets you cut
> :)
> 
> I agree that this will work and will be better than nothing, but copying
> 9kb into 3 pages is rather CPU hungry operation, and I think (but have
> no numbers though) that system will behave faster if MTU is reduced to
> the standard one.

Well, FWIW, I've always observed better performance with 4k MTU (4080 to
be precise) than with 9K, and I think that the overhead of allocating 3
contiguous pages is a major reason for this.

> Another solution is to have a proper allocator which will be able to
> defragment the data, if talking about the alternatives to the drop.
>
> So:
> 1. copy the whole jumbo skb into fragmented one
> 2. reduce the MTU

you'll not reduce MTU of established connections though. And trying to
advertise MSS changes in the middle of a TCP connection is an awful
hack which I think will not work everywhere.

> 3. rely on the allocator
> 
> For the 'good' hardware and drivers nothing from the above is really needed.
> 
> -- 
> 	Evgeniy Polyakov

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 3, 2009, 12:27 p.m. UTC | #34
On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote:
>
> I agree that this will work and will be better than nothing, but copying
> 9kb into 3 pages is rather CPU hungry operation, and I think (but have
> no numbers though) that system will behave faster if MTU is reduced to
> the standard one.

Reducing the MTU can create all sorts of problems so it should be
avoided if at all possible.  These days, path MTU discovery is
haphazard at best.  In fact MTU problems are the main reason why
jumbo frames simply don't get deployed.

> Another solution is to have a proper allocator which will be able to
> defragment the data, if talking about the alternatives to the drop.

Sure, if we can create an allocator that can guarantee contiguous
allocations all the time then by all means go for it.  But until
we get there, doing what I suggested is way better than stopping
the receiving process altogether.

> So:
> 1. copy the whole jumbo skb into fragmented one
> 2. reduce the MTU
> 3. rely on the allocator

Yes, improving the allocator would obviously inrease the performance,
however, there is nothing against employing both methods.  I'd
always avoid reducing the MTU at run-time though.

> For the 'good' hardware and drivers nothing from the above is really needed.

Right, that's why there is a point beyond which improving the
allocator is no longer worthwhile.

Cheers,
Herbert Xu Feb. 3, 2009, 12:28 p.m. UTC | #35
On Tue, Feb 03, 2009 at 01:25:35PM +0100, Willy Tarreau wrote:
>
> > So:
> > 1. copy the whole jumbo skb into fragmented one
> > 2. reduce the MTU
> 
> you'll not reduce MTU of established connections though. And trying to
> advertise MSS changes in the middle of a TCP connection is an awful
> hack which I think will not work everywhere.

Not to mention that Ethernet isn't just IP, so the protocol that
is being used might not have a concept of path MTU discovery at
all.

Cheers,
Evgeniy Polyakov Feb. 3, 2009, 12:30 p.m. UTC | #36
On Tue, Feb 03, 2009 at 11:18:08PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
> > It is a solution, but I think it will behave noticebly worse than
> > with decresed MTU.
> 
> Not necessarily.  Remember GSO/GRO in essence are just hacks to
> get around the fact that we can't increase the MTU to where we
> want it to be.  MTU reduces the cost over the entire path while
> GRO/GSO only do so for the sender and the receiver.
> 
> In other words when given the choice between a larger MTU with
> copying or GRO, the larger MTU will probably win anyway as it's
> optimising the entire path rather than just the receiver.

Well, we both do not have the data and very likely will not change the
opinions :)
But we can proceed the discussion in case something interesting will
appear. For example I can hack up e1000e driver to do a dumb copy of 9k
each time it has received a jumbo frame and compare it with usual 1.5k
MTU performance. But getting that modern CPUs are loafing with noticebly
big IO chunks, this may only show that CPU was increased with the copy.
But still may work.

> > That's the main point: how to deal with broken hardware? I think (but
> > have no strong numbers though) that having 6 packets with 1500 MTU
> > combined into GRO/LRO frame will be processed way faster than copying 9k
> > MTU into 3 pages and process single skb.
> 
> Please note that with my scheme, you'd only start copying if you
> can't allocate a linear skb.  So if memory fragmentation doesn't
> happen then there is no copying at all.

Yes, absolutely.
Nick Piggin Feb. 3, 2009, 12:33 p.m. UTC | #37
On Tuesday 03 February 2009 23:18:08 Herbert Xu wrote:
> On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote:
> > It is a solution, but I think it will behave noticebly worse than
> > with decresed MTU.
>
> Not necessarily.  Remember GSO/GRO in essence are just hacks to
> get around the fact that we can't increase the MTU to where we
> want it to be.  MTU reduces the cost over the entire path while
> GRO/GSO only do so for the sender and the receiver.
>
> In other words when given the choice between a larger MTU with
> copying or GRO, the larger MTU will probably win anyway as it's
> optimising the entire path rather than just the receiver.
>
> > That's the main point: how to deal with broken hardware? I think (but
> > have no strong numbers though) that having 6 packets with 1500 MTU
> > combined into GRO/LRO frame will be processed way faster than copying 9k
> > MTU into 3 pages and process single skb.
>
> Please note that with my scheme, you'd only start copying if you
> can't allocate a linear skb.  So if memory fragmentation doesn't
> happen then there is no copying at all.

This sounds like a really nice idea (to the layman)!

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 3, 2009, 12:33 p.m. UTC | #38
On Tue, Feb 03, 2009 at 03:30:15PM +0300, Evgeniy Polyakov wrote:
>
> But we can proceed the discussion in case something interesting will
> appear. For example I can hack up e1000e driver to do a dumb copy of 9k
> each time it has received a jumbo frame and compare it with usual 1.5k
> MTU performance. But getting that modern CPUs are loafing with noticebly
> big IO chunks, this may only show that CPU was increased with the copy.
> But still may work.

Comparing performance is pointless because the only time you need
to do the copy is when the allocator has failed.  So there is *no*
alternative to copying, regardless of how slow it is.

You can always improve the allocator whether we do this copying
fallback or not.

Cheers,
Jarek Poplawski Feb. 3, 2009, 12:36 p.m. UTC | #39
On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > > 1) Just like any other allocator we'll need to find a way to
> > >    handle > PAGE_SIZE allocations, and thus add handling for
> > >    compound pages etc.
> > >  
> > >    And exactly the drivers that want such huge SKB data areas
> > >    on receive should be converted to use scatter gather page
> > >    vectors in order to avoid multi-order pages and thus strains
> > >    on the page allocator.
> > 
> > I guess compound pages are handled by put_page() enough, but I don't
> > think they should be main argument here, and I agree: scatter gather
> > should be used where possible.
> 
> Problem is to allocate them, since with the time memory will be
> quite fragmented, which will not allow to find a big enough page.

Yes, it's a problem, but I don't think the main one. Since we're
currently concerned with zero-copy for splice I think we could
concentrate on most common cases, and treat jumbo frames with best
effort only: if there are free compound pages - fine, otherwise we
fallback to slab and copy in splice.

> 
> NTA tried to solve this by not allowing to free the data allocated on
> the different CPU, contrary to what SLAB does. Modulo cache coherency
> improvements, it allows to combine freed chunks back into the pages and
> combine them in turn to get bigger contiguous areas suitable for the
> drivers which were not converted to use the scatter gather approach.
> I even believe that for some hardware it is the only way to deal
> with the jumbo frames.
> 
> > > 2) Space wastage and poor packing can be an issue.
> > > 
> > >    Even with SLAB/SLUB we get poor packing, look at Evegeniy's
> > >    graphs that he made when writing his NTA patches.
> > 
> > I'm a bit lost here: could you "remind" the way page space would be
> > used/saved in your paged variant e.g. for ~1500B skbs?
> 
> At least in NTA I used cache line alignment for smaller chunks, while
> SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes
> per packet (modulo size of the shared info structure).
> 
> > Yes, this looks reasonable. On the other hand, I think it would be
> > nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> > efficiency of such a solution. (It seems releasing with put_page() will
> > always have some cost with delayed reusing and/or waste of space.)
> 
> Well, my opinion is rather biased here :)

I understand NTA could be better than slabs in above-mentioned cases,
but I'm not sure you explaind enough your point on solving this
zero-copy problem vs. NTA?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Lang Feb. 3, 2009, 1:05 p.m. UTC | #40
On Tue, 3 Feb 2009, Evgeniy Polyakov wrote:

> On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote:
>>> I even believe that for some hardware it is the only way to deal
>>> with the jumbo frames.
>>
>> Not necessarily.  Even if the hardware can only DMA into contiguous
>> memory, we can always allocate a sufficient number of contiguous
>> buffers initially, and then always copy them into fragmented skbs
>> at receive time.  This way the contiguous buffers are never
>> depleted.
>
> How many such preallocated frames is enough? Does it enough to have all
> sockets recv buffer sizes divided by the MTU size? Or just some of them,
> or... That will work but there are way too many corner cases.
>
>> Granted copying sucks, but this is really because the underlying
>> hardware is badly designed.  Also copying is way better than
>> not receiving at all due to memory fragmentation.
>
> Maybe just do not allow jumbo frames when memory is fragmented enough
> and fallback to the smaller MTU in this case? With LRO/GRO stuff there
> should be not that much of the overhead compared to multiple-page
> copies.


1. define 'fragmented enough'

2. the packet size was already negotiated on your existing connections, 
how are you going to change all those on the fly?

3. what do you do when a remote system sends you a large packet? drop it 
on the floor?

having some pool of large buffers to receive into (and copy out of those 
buffers as quickly as possible) would cause a performance hit when things 
get bad, but isn't that better than dropping packets?

as for the number of buffers to use. make a reasonable guess. if you only 
have a small number of packets around, use the buffers directly, as you 
use more of them start copying, as useage climbs attempt to allocate more. 
if you can't allocate more (and you have all of your existing ones in use) 
you will have to drop the packet, but at that point are you really in any 
worse shape than if you didn't have some mechanism to copy out of the 
large buffers?

David Lang
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Feb. 3, 2009, 1:06 p.m. UTC | #41
On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> I understand NTA could be better than slabs in above-mentioned cases,
> but I'm not sure you explaind enough your point on solving this
> zero-copy problem vs. NTA?

NTA steals pages from the SLAB so we can maintain any reference counter
logic in them, so linear part of the skb may be not really freed/reused
until reference counter hits zero.
Jarek Poplawski Feb. 3, 2009, 1:25 p.m. UTC | #42
On Tue, Feb 03, 2009 at 04:06:06PM +0300, Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> > I understand NTA could be better than slabs in above-mentioned cases,
> > but I'm not sure you explaind enough your point on solving this
> > zero-copy problem vs. NTA?
> 
> NTA steals pages from the SLAB so we can maintain any reference counter
> logic in them, so linear part of the skb may be not really freed/reused
> until reference counter hits zero.

Now it's clear. So this looks like one of the options considered by
David. Then I wonder about details... It seems some kind of scheduled
browsing for refcounts is needed or is there something better?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Feb. 3, 2009, 2:20 p.m. UTC | #43
On Tue, Feb 03, 2009 at 01:25:37PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote:
> Now it's clear. So this looks like one of the options considered by
> David. Then I wonder about details... It seems some kind of scheduled
> browsing for refcounts is needed or is there something better?

It depends on the implementation, for example each kfree() may check the
reference counter and return page to the allocator when it is really
free. Since page may contiain multiple objects its reference counter may
hit zero someday in the future, or never reach it if data was not freed.
David Miller Feb. 4, 2009, 12:46 a.m. UTC | #44
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Tue, 3 Feb 2009 14:10:12 +0300

> NTA tried to solve this by not allowing to free the data allocated on
> the different CPU, contrary to what SLAB does. Modulo cache coherency
> improvements,

This could kill performance on NUMA systems if we are not careful.

If we ever consider NTA seriously, these issues would need to
be performance tested.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 4, 2009, 12:46 a.m. UTC | #45
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Tue, 3 Feb 2009 22:24:31 +1100

> Not necessarily.  Even if the hardware can only DMA into contiguous
> memory, we can always allocate a sufficient number of contiguous
> buffers initially, and then always copy them into fragmented skbs
> at receive time.  This way the contiguous buffers are never
> depleted.
> 
> Granted copying sucks, but this is really because the underlying
> hardware is badly designed.  Also copying is way better than
> not receiving at all due to memory fragmentation.

This scheme sounds very reasonable.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 4, 2009, 12:47 a.m. UTC | #46
From: Willy Tarreau <w@1wt.eu>
Date: Tue, 3 Feb 2009 13:25:35 +0100

> Well, FWIW, I've always observed better performance with 4k MTU (4080 to
> be precise) than with 9K, and I think that the overhead of allocating 3
> contiguous pages is a major reason for this.

With what hardware?  If it's with myri10ge, that driver uses page
frags so would not be using 3 contiguous pages even for jumbo frames.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Feb. 4, 2009, 6:19 a.m. UTC | #47
On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Tue, 3 Feb 2009 13:25:35 +0100
> 
> > Well, FWIW, I've always observed better performance with 4k MTU (4080 to
> > be precise) than with 9K, and I think that the overhead of allocating 3
> > contiguous pages is a major reason for this.
> 
> With what hardware?  If it's with myri10ge, that driver uses page
> frags so would not be using 3 contiguous pages even for jumbo frames.

Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
remember the exact optimal value, I think it was slightly lower).

For the myri10ge, could this be caused by the cache footprint then ?
I can also retry with various values between 4 and 9k, including
values close to 8k. Maybe the fact that 4k is better than 9 is
because we get better filling of all pages ?

I also remember having used a 7 kB MTU on e1000 and dl2k in the past.
BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the
allocation failures which were polluting the logs, so it's been running
with that setting for years now.

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Feb. 4, 2009, 8:08 a.m. UTC | #48
On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) wrote:
> > NTA tried to solve this by not allowing to free the data allocated on
> > the different CPU, contrary to what SLAB does. Modulo cache coherency
> > improvements,
> 
> This could kill performance on NUMA systems if we are not careful.
> 
> If we ever consider NTA seriously, these issues would need to
> be performance tested.

Quite contrary I think. Memory is allocated and freed on the same CPU,
which means on the same memory domain, closest to the CPU in question.

I did not test NUMA though, but NTA performance on the usual CPU (it is
2.5 years old already :) was noticebly good.
Evgeniy Polyakov Feb. 4, 2009, 8:12 a.m. UTC | #49
On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote:
> Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
> remember the exact optimal value, I think it was slightly lower).

Very likely it is related to the allocator - the same allocation
overhead to get a page, but 2.5 times bigger frame.

> For the myri10ge, could this be caused by the cache footprint then ?
> I can also retry with various values between 4 and 9k, including
> values close to 8k. Maybe the fact that 4k is better than 9 is
> because we get better filling of all pages ?
> 
> I also remember having used a 7 kB MTU on e1000 and dl2k in the past.
> BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the
> allocation failures which were polluting the logs, so it's been running
> with that setting for years now.

Recent e1000 (e1000e) uses fragments, so it does not suffer from the
high-order allocation failures.
Willy Tarreau Feb. 4, 2009, 8:54 a.m. UTC | #50
On Wed, Feb 04, 2009 at 11:12:01AM +0300, Evgeniy Polyakov wrote:
> On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote:
> > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
> > remember the exact optimal value, I think it was slightly lower).
> 
> Very likely it is related to the allocator - the same allocation
> overhead to get a page, but 2.5 times bigger frame.
> 
> > For the myri10ge, could this be caused by the cache footprint then ?
> > I can also retry with various values between 4 and 9k, including
> > values close to 8k. Maybe the fact that 4k is better than 9 is
> > because we get better filling of all pages ?
> > 
> > I also remember having used a 7 kB MTU on e1000 and dl2k in the past.
> > BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the
> > allocation failures which were polluting the logs, so it's been running
> > with that setting for years now.
> 
> Recent e1000 (e1000e) uses fragments, so it does not suffer from the
> high-order allocation failures.

My server is running 2.4 :-), but I observed the same issues with older
2.6 as well. I can certainly imagine that things have changed a lot since,
but the initial point remains : jumbo frames are expensive to deal with,
and with recent NICs and drivers, we might get close performance for
little additional cost. After all, initial justification for jumbo frames
was the devastating interrupt rate and all NICs coalesce interrupts now.

So if we can optimize all the infrastructure for extremely fast
processing of standard frames (1500) and still support jumbo frames
in a suboptimal mode, I think it could be a very good trade-off.

Regards,
willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 4, 2009, 8:59 a.m. UTC | #51
On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
>
> My server is running 2.4 :-), but I observed the same issues with older
> 2.6 as well. I can certainly imagine that things have changed a lot since,
> but the initial point remains : jumbo frames are expensive to deal with,
> and with recent NICs and drivers, we might get close performance for
> little additional cost. After all, initial justification for jumbo frames
> was the devastating interrupt rate and all NICs coalesce interrupts now.

This is total crap! Jumbo frames are way better than any of the
hacks (such as GSO) that people have come up with to get around it.
The only reason we are not using it as much is because of this
nasty thing called the Internet.

Cheers,
David Miller Feb. 4, 2009, 9:01 a.m. UTC | #52
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 4 Feb 2009 19:59:07 +1100

> On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
> >
> > My server is running 2.4 :-), but I observed the same issues with older
> > 2.6 as well. I can certainly imagine that things have changed a lot since,
> > but the initial point remains : jumbo frames are expensive to deal with,
> > and with recent NICs and drivers, we might get close performance for
> > little additional cost. After all, initial justification for jumbo frames
> > was the devastating interrupt rate and all NICs coalesce interrupts now.
> 
> This is total crap! Jumbo frames are way better than any of the
> hacks (such as GSO) that people have come up with to get around it.
> The only reason we are not using it as much is because of this
> nasty thing called the Internet.

Completely agreed.

If Jumbo frames are slower, it is NOT some fundamental issue.  It is
rather due to some misdesign of the hardware or it's driver.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Feb. 4, 2009, 9:12 a.m. UTC | #53
On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 4 Feb 2009 19:59:07 +1100
> 
> > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
> > >
> > > My server is running 2.4 :-), but I observed the same issues with older
> > > 2.6 as well. I can certainly imagine that things have changed a lot since,
> > > but the initial point remains : jumbo frames are expensive to deal with,
> > > and with recent NICs and drivers, we might get close performance for
> > > little additional cost. After all, initial justification for jumbo frames
> > > was the devastating interrupt rate and all NICs coalesce interrupts now.
> > 
> > This is total crap! Jumbo frames are way better than any of the
> > hacks (such as GSO) that people have come up with to get around it.
> > The only reason we are not using it as much is because of this
> > nasty thing called the Internet.
> 
> Completely agreed.
> 
> If Jumbo frames are slower, it is NOT some fundamental issue.  It is
> rather due to some misdesign of the hardware or it's driver.

Agreed we can't use them *because* of the internet, but this
limitation has forced hardware designers to find valid alternatives.
For instance, having the ability to reach 10 Gbps with 1500 bytes
frames on myri10ge with a low CPU usage is a real achievement. This
is "only" 800 kpps after all.

And the arbitrary choice of 9k for jumbo frames was total crap too.
It's clear that no hardware designer was involved in the process.
They have to stuff 16kB of RAM on a NIC to use only 9. And we need
to allocate 3 pages for slightly more than 2. 7.5 kB would have been
better in this regard.

I still find it nice to lower CPU usage with frames larger than 1500,
but given the fact that this is rarely used (even in datacenters), I
think our efforts should concentrate on where the real users are, ie
<1500.

Regards,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 4, 2009, 9:12 a.m. UTC | #54
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 4 Feb 2009 07:19:47 +0100

> On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote:
> > From: Willy Tarreau <w@1wt.eu>
> > Date: Tue, 3 Feb 2009 13:25:35 +0100
> > 
> > > Well, FWIW, I've always observed better performance with 4k MTU (4080 to
> > > be precise) than with 9K, and I think that the overhead of allocating 3
> > > contiguous pages is a major reason for this.
> > 
> > With what hardware?  If it's with myri10ge, that driver uses page
> > frags so would not be using 3 contiguous pages even for jumbo frames.
> 
> Yes myri10ge for the optimal 4080, but with e1000 too (though I don't
> remember the exact optimal value, I think it was slightly lower).
> 
> For the myri10ge, could this be caused by the cache footprint then ?
> I can also retry with various values between 4 and 9k, including
> values close to 8k. Maybe the fact that 4k is better than 9 is
> because we get better filling of all pages ?

Looking quickly, myri10ge's buffer manager is incredibly simplistic so
it wastes a lot of memory and gives terrible cache behavior.

When using JUMBO MTU it just gives whole pages to the chip.

So it looks like, assuming 4096 byte PAGE_SIZE and 9000 byte
jumbo MTU, the chip will allocate for a full size frame:

	FULL PAGE
	FULL PAGE
	FULL PAGE

and only ~1K of that last full page will be utilized.

The headers will therefore always land on the same cache lines,
and PAGE_SIZE-~1K will be wasted.

Whereas for < PAGE_SIZE mtu selections, it will give MTU sized
blocks to the chip for packet data allocation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 4, 2009, 9:15 a.m. UTC | #55
From: Willy Tarreau <w@1wt.eu>
Date: Wed, 4 Feb 2009 10:12:17 +0100

> And the arbitrary choice of 9k for jumbo frames was total crap too.
> It's clear that no hardware designer was involved in the process.

Willy, do some reasearch, this also completely wrong.

Alteon effectively created jumbo MTUs in their Acenic chips since that
was the first chip to ever do it.  Those were hardware engineers only
making those design decisions.

I think this is will I will stop taking part in this part of the
discussion.  Every posting is full of misinformation and I've got
better things to do than to refute them every 5 minutes.  :-/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nick Piggin Feb. 4, 2009, 9:23 a.m. UTC | #56
On Wednesday 04 February 2009 19:08:51 Evgeniy Polyakov wrote:
> On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) 
wrote:
> > > NTA tried to solve this by not allowing to free the data allocated on
> > > the different CPU, contrary to what SLAB does. Modulo cache coherency
> > > improvements,
> >
> > This could kill performance on NUMA systems if we are not careful.
> >
> > If we ever consider NTA seriously, these issues would need to
> > be performance tested.
>
> Quite contrary I think. Memory is allocated and freed on the same CPU,
> which means on the same memory domain, closest to the CPU in question.
>
> I did not test NUMA though, but NTA performance on the usual CPU (it is
> 2.5 years old already :) was noticebly good.

I had a quick look at NTA... I didn't understand much of it yet, but
the remote freeing scheme is kind of like what I did for slqb. The
freeing CPU queues objects back to the CPU that allocated them, which
eventually checks the queue and frees them itself.

I don't know how much cache coherency gains you get from this -- in
most slab allocations, I think the object tends to be cache on on the
CPU that frees it. I'm doing it mainly to try avoid locking... I guess
that makes for cache coherency benefit itself.

If NTA does significantly better than slab allocator, I would be quite
interested. It might be something that we can learn from and use in
the general slab allocator (or maybe something more network specific
that NTA does).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benny Amorsen Feb. 4, 2009, 9:41 a.m. UTC | #57
David Miller <davem@davemloft.net> writes:

> From: Herbert Xu <herbert@gondor.apana.org.au>

>> Granted copying sucks, but this is really because the underlying
>> hardware is badly designed.  Also copying is way better than
>> not receiving at all due to memory fragmentation.
>
> This scheme sounds very reasonable.

Would it be possible to add a counter somewhere for this, or is that
too expensive?


/Benny

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 4, 2009, 12:01 p.m. UTC | #58
Benny Amorsen <benny+usenet@amorsen.dk> wrote:
>
> Would it be possible to add a counter somewhere for this, or is that
> too expensive?

Yes a counter would be useful and is reasonable.  But you can
probably deduce it by just looking at slabinfo.

Cheers,
Roland Dreier Feb. 4, 2009, 7:19 p.m. UTC | #59
> And the arbitrary choice of 9k for jumbo frames was total crap too.
 > It's clear that no hardware designer was involved in the process.
 > They have to stuff 16kB of RAM on a NIC to use only 9. And we need
 > to allocate 3 pages for slightly more than 2. 7.5 kB would have been
 > better in this regard.

9K was not totally arbitrary.  The CRC used for checksumming ethernet
packets has a probability of undetected errors that goes up about
11-thousand something bytes.  So the real limit is ~11000 bytes, and I
believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR
and transport headers without fragmentation.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Feb. 4, 2009, 7:28 p.m. UTC | #60
On Wed, Feb 04, 2009 at 11:19:06AM -0800, Roland Dreier wrote:
>  > And the arbitrary choice of 9k for jumbo frames was total crap too.
>  > It's clear that no hardware designer was involved in the process.
>  > They have to stuff 16kB of RAM on a NIC to use only 9. And we need
>  > to allocate 3 pages for slightly more than 2. 7.5 kB would have been
>  > better in this regard.
> 
> 9K was not totally arbitrary.  The CRC used for checksumming ethernet
> packets has a probability of undetected errors that goes up about
> 11-thousand something bytes.  So the real limit is ~11000 bytes, and I
> believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR
> and transport headers without fragmentation.

Yes I know that initial motivation. But IMHO it was a purely functional
motivation without real considerations of the implications. When you
read Alteon's initial proposal, there is even a biased analysis (they
compare the fill ratio obtained with one 8k frame with that of 6 1.5k
frames). Their own argument does not stand with their final proposal!
I think that there might already have been people pushing for 8k and
not 9 by then, but in order to get wide acceptance in datacenters,
they had to please the NFS admins.

Now it's useless to speculate on history and I agree with Davem that
we're wasting our time with this discussion, let's go back to the
keyboard ;-)

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 4, 2009, 7:48 p.m. UTC | #61
On Wed, Feb 04, 2009 at 08:28:20PM +0100, Willy Tarreau wrote:
...
> Now it's useless to speculate on history and I agree with Davem that
> we're wasting our time with this discussion, let's go back to the
> keyboard ;-)

Davem knows everything, so he is ...wrong. This is a very useful
discussion. Go back to the keyboards, please :-)

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bill Fink Feb. 5, 2009, 8:32 a.m. UTC | #62
On Wed, 4 Feb 2009, Willy Tarreau wrote:

> On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Wed, 4 Feb 2009 19:59:07 +1100
> > 
> > > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote:
> > > >
> > > > My server is running 2.4 :-), but I observed the same issues with older
> > > > 2.6 as well. I can certainly imagine that things have changed a lot since,
> > > > but the initial point remains : jumbo frames are expensive to deal with,
> > > > and with recent NICs and drivers, we might get close performance for
> > > > little additional cost. After all, initial justification for jumbo frames
> > > > was the devastating interrupt rate and all NICs coalesce interrupts now.
> > > 
> > > This is total crap! Jumbo frames are way better than any of the
> > > hacks (such as GSO) that people have come up with to get around it.
> > > The only reason we are not using it as much is because of this
> > > nasty thing called the Internet.
> > 
> > Completely agreed.
> > 
> > If Jumbo frames are slower, it is NOT some fundamental issue.  It is
> > rather due to some misdesign of the hardware or it's driver.
> 
> Agreed we can't use them *because* of the internet, but this
> limitation has forced hardware designers to find valid alternatives.
> For instance, having the ability to reach 10 Gbps with 1500 bytes
> frames on myri10ge with a low CPU usage is a real achievement. This
> is "only" 800 kpps after all.
> 
> And the arbitrary choice of 9k for jumbo frames was total crap too.
> It's clear that no hardware designer was involved in the process.
> They have to stuff 16kB of RAM on a NIC to use only 9. And we need
> to allocate 3 pages for slightly more than 2. 7.5 kB would have been
> better in this regard.
> 
> I still find it nice to lower CPU usage with frames larger than 1500,
> but given the fact that this is rarely used (even in datacenters), I
> think our efforts should concentrate on where the real users are, ie
> <1500.

Those in the HPC realm use 9000 byte jumbo frames because it makes
a major performance difference, especially across large RTT paths,
and the Internet2 backbone fully supports 9000 byte jumbo frames
(with some wishing we could support much larger frame sizes).

Local environment:

9000 byte jumbo frames:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
11818.1875 MB /  10.01 sec = 9905.9707 Mbps 100 %TX 76 %RX 0 retrans 0.15 msRTT

4080 byte MTU:

[root@lang2 ~]# nuttcp -w10m 192.168.88.16
 9171.6875 MB /  10.02 sec = 7680.7663 Mbps 100 %TX 99 %RX 0 retrans 0.19 msRTT

The performance impact is even more pronounced on a large RTT path
such as the following netem emulated 80 ms RTT path:

9000 byte jumbo frames:

[root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15
25904.2500 MB /  30.16 sec = 7205.8755 Mbps 96 %TX 55 %RX 0 retrans 82.73 msRTT

4080 byte MTU:

[root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15
 8650.0129 MB /  30.25 sec = 2398.8862 Mbps 33 %TX 19 %RX 2371 retrans 81.98 msRTT

And if there's any loss in the path, the performance difference is also
dramatic, such as here across a real MAN environment with about a 1 ms RTT:

9000 byte jumbo frames:

[root@chance9 ~]# nuttcp -w20m 192.168.88.8
 7711.8750 MB /  10.05 sec = 6436.2406 Mbps 82 %TX 96 %RX 261 retrans 0.92 msRTT

4080 byte MTU:

[root@chance9 ~]# nuttcp -w20m 192.168.88.8
 4551.0625 MB /  10.08 sec = 3786.2108 Mbps 50 %TX 95 %RX 42 retrans 0.95 msRTT

All testing was with myri10ge on the transmitter side (2.6.20.7 kernel).

So my experience has definitely been that 9000 byte jumbo frames are a
major performance win for high throughput applications.

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2009, 7:52 a.m. UTC | #63
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 3 Feb 2009 09:41:08 +0000

> Yes, this looks reasonable. On the other hand, I think it would be
> nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> efficiency of such a solution. (It seems releasing with put_page() will
> always have some cost with delayed reusing and/or waste of space.)

I think we can't avoid using carved up pages for skb->data in the end.
The whole kernel wants to speak in pages and be able to grab and
release them in one way and one way only (get_page() and put_page()).

What do you think is more likely?  Us teaching the whole entire kernel
how to hold onto SKB linear data buffers, or the networking fixing
itself to operate on pages for it's header metadata? :-)

What we'll end up with is likely a hybrid scheme.  High speed devices
will receive into pages.  And also the skb->data area will be page
backed and held using get_page()/put_page() references.

It is not even worth optimizing for skb->data holding the entire
packet, that's not the case that matters.

These skb->data areas will thus be 128 bytes plus the skb_shinfo
structure blob.  They also will be recycled often, rather than held
onto for long periods of time.

In fact we can optimize that even further in many ways, for example by
dropping the skb->data backed memory once the skb is queued to the
socket receive buffer.  That will make skb->data buffer lifetimes
miniscule even under heavy receive load.

In that kind of situation, doing even the most stupidest page slicing
algorithm, similar to what we do now with sk->sk_sndmsg_page, is
more than adequate and things like NTA (purely to solve this problem)
is overengineering.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 6, 2009, 8:09 a.m. UTC | #64
On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote:
>
> In fact we can optimize that even further in many ways, for example by
> dropping the skb->data backed memory once the skb is queued to the
> socket receive buffer.  That will make skb->data buffer lifetimes
> miniscule even under heavy receive load.

Indeed, while I was doing the tun accounting stuff and reviewing
the rx accounting users, it was apparent that we don't need to
carry most of this stuff in our receive queues.

This is almost the opposite of the skbless data idea for TSO.

Cheers,
Jarek Poplawski Feb. 6, 2009, 9:10 a.m. UTC | #65
On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 3 Feb 2009 09:41:08 +0000
> 
> > Yes, this looks reasonable. On the other hand, I think it would be
> > nice to get some opinions of slab folks (incl. Evgeniy) on the expected
> > efficiency of such a solution. (It seems releasing with put_page() will
> > always have some cost with delayed reusing and/or waste of space.)
> 
> I think we can't avoid using carved up pages for skb->data in the end.
> The whole kernel wants to speak in pages and be able to grab and
> release them in one way and one way only (get_page() and put_page()).
> 
> What do you think is more likely?  Us teaching the whole entire kernel
> how to hold onto SKB linear data buffers, or the networking fixing
> itself to operate on pages for it's header metadata? :-)

This idea looks very reasonable, except I wander why nobody else
didn't need this kind of mm interface. Another question is it seems
many mechanisms like fast searching, defragmentation etc. could be
reused.

> What we'll end up with is likely a hybrid scheme.  High speed devices
> will receive into pages.  And also the skb->data area will be page
> backed and held using get_page()/put_page() references.
> 
> It is not even worth optimizing for skb->data holding the entire
> packet, that's not the case that matters.
> 
> These skb->data areas will thus be 128 bytes plus the skb_shinfo
> structure blob.  They also will be recycled often, rather than held
> onto for long periods of time.

Looks fine, except: you mentioned dumb NICs, which would need this
page space on receive, anyway. BTW, don't they need this on transmit
again?

> In fact we can optimize that even further in many ways, for example by
> dropping the skb->data backed memory once the skb is queued to the
> socket receive buffer.  That will make skb->data buffer lifetimes
> miniscule even under heavy receive load.
> 
> In that kind of situation, doing even the most stupidest page slicing
> algorithm, similar to what we do now with sk->sk_sndmsg_page, is
> more than adequate and things like NTA (purely to solve this problem)
> is overengineering.

Hmm... I don't get it. It seems these slabs do a lot of advanced work,
and still some people like Evgeniy or Nick thought it's not enough,
and even found it worth of their time to rework this.

There is also a question of memory accounting: do you think admins
don't care if we give away say 25% additionally?

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2009, 9:17 a.m. UTC | #66
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 6 Feb 2009 09:10:34 +0000

> Hmm... I don't get it. It seems these slabs do a lot of advanced work,
> and still some people like Evgeniy or Nick thought it's not enough,
> and even found it worth of their time to rework this.

Note that, at least to some extent, the memory allocators are
duplicating some of the locality and NUMA logic that's already present
in the page allocator itself.

Except that they are handling the fact that objects are moving around
instead of pages.

Also keep in mind that we might also want to encourage drivers to make
use of the SKB recycling mechanisms we have.  So this will decrease
lifetimes, and thus the wastage and locality issues immensely.

We truly want something different from what the general purpose
allocator provides.  Namely, a reference countable buffer.

And all I'm saying is that since the page allocator provides that
facility, and using pages solves all of the splice() et al.  problems,
building something extremely simple on top of the page allocator seems
to be a good way to go.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 6, 2009, 9:23 a.m. UTC | #67
On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote:
> 
> Looks fine, except: you mentioned dumb NICs, which would need this
> page space on receive, anyway. BTW, don't they need this on transmit
> again?

A lot more NICs support SG on tx than rx.

Cheers,
Jarek Poplawski Feb. 6, 2009, 9:42 a.m. UTC | #68
On Fri, Feb 06, 2009 at 01:17:22AM -0800, David Miller wrote:
...
> And all I'm saying is that since the page allocator provides that
> facility, and using pages solves all of the splice() et al.  problems,
> building something extremely simple on top of the page allocator seems
> to be a good way to go.

This all is absolutely right if we can afford it: more simple - more
memory wasted. I hope you're right, but on the other hand, people
don't use slob by default.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 6, 2009, 9:49 a.m. UTC | #69
From: Jarek Poplawski <jarkao2@gmail.com>
Date: Fri, 6 Feb 2009 09:42:53 +0000

> I hope you're right, but on the other hand, people don't use slob by
> default.

The use is different, so it's a bad comparison, really.

SLAB has to satisfy all kinds of object lifetimes, all kinds
of sizes and use cases, and perform generally well in all
situations with zero knowledge about usage patterns.

Networking is strictly the opposite of that set of requirements.  We
know all of these parameters (or their ranges) and can on top of that
control them if necessary.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 6, 2009, 9:51 a.m. UTC | #70
On Fri, Feb 06, 2009 at 08:23:26PM +1100, Herbert Xu wrote:
> On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote:
> > 
> > Looks fine, except: you mentioned dumb NICs, which would need this
> > page space on receive, anyway. BTW, don't they need this on transmit
> > again?
> 
> A lot more NICs support SG on tx than rx.

OK, but since there is not so much difference, and we need to waste
it in some cases anyway, plus handle it later some special way, I'm
a bit in doubt.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Feb. 6, 2009, 10:28 a.m. UTC | #71
On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
>
> OK, but since there is not so much difference, and we need to waste
> it in some cases anyway, plus handle it later some special way, I'm
> a bit in doubt.

Well the thing is cards that don't support SG on tx probably
don't support jumbo frames either.

Cheers,
Jarek Poplawski Feb. 6, 2009, 10:58 a.m. UTC | #72
On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote:
> On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
> >
> > OK, but since there is not so much difference, and we need to waste
> > it in some cases anyway, plus handle it later some special way, I'm
> > a bit in doubt.
> 
> Well the thing is cards that don't support SG on tx probably
> don't support jumbo frames either.

?? I mean this 128 byte chunk would be hard to reuse after copying
to skb->data, and if reused, we could miss this for some NICs on TX,
so the whole packed would need a copy.

BTW, David mentioned something simple like sk_sndmsg_page would be
enough, but I guess not for these non-SG NICs. We have to allocate
bigger chunks for them, so more fragmentation to handle.

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Feb. 6, 2009, 11:10 a.m. UTC | #73
On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote:
> On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote:
> > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
> > >
> > > OK, but since there is not so much difference, and we need to waste
> > > it in some cases anyway, plus handle it later some special way, I'm
> > > a bit in doubt.
> > 
> > Well the thing is cards that don't support SG on tx probably
> > don't support jumbo frames either.
> 
> ?? I mean this 128 byte chunk would be hard to reuse after copying
> to skb->data, and if reused, we could miss this for some NICs on TX,
> so the whole packed would need a copy.

couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit
map to indicate which slot is used and which one is free ? This would
just be a matter of calling ffz() to find one spare place in a page.
Also, that bitmap might serve as a refcount, because if it drops to
zero, it means all slots are unused. And -1 means all slots are used.

This would reduce wastage if wee need to allocate 128 bytes often.

Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 6, 2009, 11:47 a.m. UTC | #74
On Fri, Feb 06, 2009 at 12:10:15PM +0100, Willy Tarreau wrote:
> On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote:
> > On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote:
> > > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote:
> > > >
> > > > OK, but since there is not so much difference, and we need to waste
> > > > it in some cases anyway, plus handle it later some special way, I'm
> > > > a bit in doubt.
> > > 
> > > Well the thing is cards that don't support SG on tx probably
> > > don't support jumbo frames either.
> > 
> > ?? I mean this 128 byte chunk would be hard to reuse after copying
> > to skb->data, and if reused, we could miss this for some NICs on TX,
> > so the whole packed would need a copy.
> 
> couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit
> map to indicate which slot is used and which one is free ? This would
> just be a matter of calling ffz() to find one spare place in a page.
> Also, that bitmap might serve as a refcount, because if it drops to
> zero, it means all slots are unused. And -1 means all slots are used.
> 
> This would reduce wastage if wee need to allocate 128 bytes often.

Something like this would be useful for SG NICs for the paged
skb->data area. But I'm concerned with non-SG ones: if I got it right,
for 1500 byte packets we need to allocate such a chunk, copy 128+
bytes to skb->data, and we have 128+ unused. If there are later such
short packets, we don't need this space: why copy?

But even if we find the way to use this, or don't reserve such space
while receiving from SG NIC, than we will need to copy both chunks
to another page for TX on non-SG NICs.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jarek Poplawski Feb. 6, 2009, 6:59 p.m. UTC | #75
David Miller wrote, On 02/06/2009 08:52 AM:

> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Tue, 3 Feb 2009 09:41:08 +0000
> 
>> Yes, this looks reasonable. On the other hand, I think it would be
>> nice to get some opinions of slab folks (incl. Evgeniy) on the expected
>> efficiency of such a solution. (It seems releasing with put_page() will
>> always have some cost with delayed reusing and/or waste of space.)
> 
> I think we can't avoid using carved up pages for skb->data in the end.
> The whole kernel wants to speak in pages and be able to grab and
> release them in one way and one way only (get_page() and put_page()).
> 
> What do you think is more likely?  Us teaching the whole entire kernel
> how to hold onto SKB linear data buffers, or the networking fixing
> itself to operate on pages for it's header metadata? :-)
> 
> What we'll end up with is likely a hybrid scheme.  High speed devices
> will receive into pages.  And also the skb->data area will be page
> backed and held using get_page()/put_page() references.

So, after a full awakening I think I got your point at last! I thought
all the time we're trying to do something more general, and you're
seemingly focused on SG capable NICs, with myri10ge or niu as model
to follow. I'm OK with this. Very nice idea and much less work! (It's
only enough to CC all the maintainers !)

> It is not even worth optimizing for skb->data holding the entire
> packet, that's not the case that matters.
> 
> These skb->data areas will thus be 128 bytes plus the skb_shinfo
> structure blob.  They also will be recycled often, rather than held
> onto for long periods of time.
> 
> In fact we can optimize that even further in many ways, for example by
> dropping the skb->data backed memory once the skb is queued to the
> socket receive buffer.  That will make skb->data buffer lifetimes
> miniscule even under heavy receive load.
> 
> In that kind of situation, doing even the most stupidest page slicing
> algorithm, similar to what we do now with sk->sk_sndmsg_page, is
> more than adequate and things like NTA (purely to solve this problem)
> is overengineering.

This is 100% right, except if we try to do something with non-SG and/or
jumbos - IMHO this requires some overengineering.

Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index 5a3a151..4ded741 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -190,6 +190,8 @@  struct sock_common {
   *	@sk_user_data: RPC layer private data
   *	@sk_sndmsg_page: cached page for sendmsg
   *	@sk_sndmsg_off: cached offset for sendmsg
+  *	@sk_splice_page: cached page for splice
+  *	@sk_splice_off: cached offset for splice
   *	@sk_send_head: front of stuff to transmit
   *	@sk_security: used by security modules
   *	@sk_mark: generic packet mark
@@ -279,6 +281,8 @@  struct sock {
 	struct page		*sk_sndmsg_page;
 	struct sk_buff		*sk_send_head;
 	__u32			sk_sndmsg_off;
+	struct page		*sk_splice_page;
+	__u32			sk_splice_off;
 	int			sk_write_pending;
 #ifdef CONFIG_SECURITY
 	void			*sk_security;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 56272ac..02a1a6c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1334,13 +1334,33 @@  static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 }
 
 static inline struct page *linear_to_page(struct page *page, unsigned int len,
-					  unsigned int offset)
+					  unsigned int *offset,
+					  struct sk_buff *skb)
 {
-	struct page *p = alloc_pages(GFP_KERNEL, 0);
+	struct sock *sk = skb->sk;
+	struct page *p = sk->sk_splice_page;
+	unsigned int off;
 
-	if (!p)
-		return NULL;
-	memcpy(page_address(p) + offset, page_address(page) + offset, len);
+	if (!p) {
+new_page:
+		p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0);
+		if (!p)
+			return NULL;
+
+		off = sk->sk_splice_off = 0;
+		/* we hold one ref to this page until it's full or unneeded */
+	} else {
+		off = sk->sk_splice_off;
+		if (off + len > PAGE_SIZE) {
+			put_page(p);
+			goto new_page;
+		}
+	}
+
+	memcpy(page_address(p) + off, page_address(page) + *offset, len);
+	sk->sk_splice_off += len;
+	*offset = off;
+	get_page(p);
 
 	return p;
 }
@@ -1356,7 +1376,7 @@  static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page,
 		return 1;
 
 	if (linear) {
-		page = linear_to_page(page, len, offset);
+		page = linear_to_page(page, len, &offset, skb);
 		if (!page)
 			return 1;
 	} else
diff --git a/net/core/sock.c b/net/core/sock.c
index f3a0d08..6b258a9 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1732,6 +1732,8 @@  void sock_init_data(struct socket *sock, struct sock *sk)
 
 	sk->sk_sndmsg_page	=	NULL;
 	sk->sk_sndmsg_off	=	0;
+	sk->sk_splice_page	=	NULL;
+	sk->sk_splice_off	=	0;
 
 	sk->sk_peercred.pid 	=	0;
 	sk->sk_peercred.uid	=	-1;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 19d7b42..cf3d367 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1848,6 +1848,14 @@  void tcp_v4_destroy_sock(struct sock *sk)
 		sk->sk_sndmsg_page = NULL;
 	}
 
+	/*
+	 * If splice cached page exists, toss it.
+	 */
+	if (sk->sk_splice_page) {
+		__free_page(sk->sk_splice_page);
+		sk->sk_splice_page = NULL;
+	}
+
 	percpu_counter_dec(&tcp_sockets_allocated);
 }