Message ID | 20090120093352.GB13806@ff.dom.local |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
Hi Jarek. On Tue, Jan 20, 2009 at 09:33:52AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > Here is a tiny upgrade to save some memory by reusing a page for more > > chunks if possible, which I think could be considered, after the > > testing of the main patch is finished. (There could be also added an > > additional freeing of this cached page before socket destruction, > > maybe in tcp_splice_read(), if somebody finds good place.) > > OOPS! I did it again... Here is better refcounting. > > Jarek P. > > --- (take 2) > > include/net/sock.h | 4 ++++ > net/core/skbuff.c | 32 ++++++++++++++++++++++++++------ > net/core/sock.c | 2 ++ > net/ipv4/tcp_ipv4.c | 8 ++++++++ > 4 files changed, 40 insertions(+), 6 deletions(-) > > diff --git a/include/net/sock.h b/include/net/sock.h > index 5a3a151..4ded741 100644 > --- a/include/net/sock.h > +++ b/include/net/sock.h > @@ -190,6 +190,8 @@ struct sock_common { > * @sk_user_data: RPC layer private data > * @sk_sndmsg_page: cached page for sendmsg > * @sk_sndmsg_off: cached offset for sendmsg > + * @sk_splice_page: cached page for splice > + * @sk_splice_off: cached offset for splice Ugh, increase every socket by 16 bytes... Does TCP one still fit the page?
On Tue, Jan 20, 2009 at 01:00:43PM +0300, Evgeniy Polyakov wrote: > Hi Jarek. Hi Evgeniy. ... > > diff --git a/include/net/sock.h b/include/net/sock.h > > index 5a3a151..4ded741 100644 > > --- a/include/net/sock.h > > +++ b/include/net/sock.h > > @@ -190,6 +190,8 @@ struct sock_common { > > * @sk_user_data: RPC layer private data > > * @sk_sndmsg_page: cached page for sendmsg > > * @sk_sndmsg_off: cached offset for sendmsg > > + * @sk_splice_page: cached page for splice > > + * @sk_splice_off: cached offset for splice > > Ugh, increase every socket by 16 bytes... Does TCP one still fit the > page? Good question! Alas I can't check this soon, but if it's really like this, of course this needs some better idea and rework. (BTW, I'd like to prevent here as much as possible some strange activities like 1 byte (payload) packets getting full pages without any accounting.) Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > Good question! Alas I can't check this soon, but if it's really like > this, of course this needs some better idea and rework. (BTW, I'd like > to prevent here as much as possible some strange activities like 1 > byte (payload) packets getting full pages without any accounting.) I believe approach to meet all our goals is to have own network memory allocator, so that each skb could have its payload in the fragments, we would not suffer from the heavy fragmentation and power-of-two overhead for the larger MTUs, have a reserve for the OOM condition and generally do not depend on the main system behaviour. I will resurrect to some point my network allocator to check how things go in the modern environment, if no one will beat this idea first :) 1. Network (tree) allocator http://www.ioremap.net/projects/nta
On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote: > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > Good question! Alas I can't check this soon, but if it's really like > > this, of course this needs some better idea and rework. (BTW, I'd like > > to prevent here as much as possible some strange activities like 1 > > byte (payload) packets getting full pages without any accounting.) > > I believe approach to meet all our goals is to have own network memory > allocator, so that each skb could have its payload in the fragments, we > would not suffer from the heavy fragmentation and power-of-two overhead > for the larger MTUs, have a reserve for the OOM condition and generally > do not depend on the main system behaviour. 100% right! But I guess we need this current fix for -stable, and I'm a bit worried about safety. > > I will resurrect to some point my network allocator to check how things > go in the modern environment, if no one will beat this idea first :) I can't see too much beating of ideas around this problem now... I Wish you luck! > > 1. Network (tree) allocator > http://www.ioremap.net/projects/nta > Great, I'll try to learn a bit btw., Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 20 Jan 2009 11:01:44 +0000 > On Tue, Jan 20, 2009 at 01:31:22PM +0300, Evgeniy Polyakov wrote: > > On Tue, Jan 20, 2009 at 10:20:53AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > Good question! Alas I can't check this soon, but if it's really like > > > this, of course this needs some better idea and rework. (BTW, I'd like > > > to prevent here as much as possible some strange activities like 1 > > > byte (payload) packets getting full pages without any accounting.) > > > > I believe approach to meet all our goals is to have own network memory > > allocator, so that each skb could have its payload in the fragments, we > > would not suffer from the heavy fragmentation and power-of-two overhead > > for the larger MTUs, have a reserve for the OOM condition and generally > > do not depend on the main system behaviour. > > 100% right! But I guess we need this current fix for -stable, and I'm > a bit worried about safety. Jarek, we already have a page and offset you can use. It's called sk_sndmsg_page but that is just the (current) name. Nothing prevents you from reusing it for your purposes here. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jan 20, 2009 at 09:16:16AM -0800, David Miller wrote: ... > Jarek, we already have a page and offset you can use. > > It's called sk_sndmsg_page but that is just the (current) name. > Nothing prevents you from reusing it for your purposes here. I'm trying to get some know-how about this field. Thanks, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20-01-2009 11:31, Evgeniy Polyakov wrote: ... > I believe approach to meet all our goals is to have own network memory > allocator, so that each skb could have its payload in the fragments, we > would not suffer from the heavy fragmentation and power-of-two overhead > for the larger MTUs, have a reserve for the OOM condition and generally > do not depend on the main system behaviour. > > I will resurrect to some point my network allocator to check how things > go in the modern environment, if no one will beat this idea first :) > > 1. Network (tree) allocator > http://www.ioremap.net/projects/nta I looked at this a bit, but alas I didn't find much for this Herbert's idea of payload in fragments/pages. Maybe some kind of API RFC is needed before this resurrection? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jarek. On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > 1. Network (tree) allocator > > http://www.ioremap.net/projects/nta > > I looked at this a bit, but alas I didn't find much for this Herbert's > idea of payload in fragments/pages. Maybe some kind of API RFC is > needed before this resurrection? Basic idea is to steal some (probably a lot) pages from the slab allocator and put network buffers there without strict need for power-of-two alignment and possible wraps when we add skb_shared_info at the end, so that old e1000 driver required order-4 allocations for the jumbo frames. We can do that in alloc_skb() and friends and put returned buffers into skb's fraglist and updated reference counters for those pages; and with additional copy of the network headers into skb->head.
From: Evgeniy Polyakov <zbr@ioremap.net> Date: Tue, 27 Jan 2009 00:21:30 +0300 > Hi Jarek. > > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > 1. Network (tree) allocator > > > http://www.ioremap.net/projects/nta > > > > I looked at this a bit, but alas I didn't find much for this Herbert's > > idea of payload in fragments/pages. Maybe some kind of API RFC is > > needed before this resurrection? > > Basic idea is to steal some (probably a lot) pages from the slab > allocator and put network buffers there without strict need for > power-of-two alignment and possible wraps when we add skb_shared_info at > the end, so that old e1000 driver required order-4 allocations for the > jumbo frames. We can do that in alloc_skb() and friends and put returned > buffers into skb's fraglist and updated reference counters for those > pages; and with additional copy of the network headers into skb->head. We are going back and forth saying the same thing, I think :-) (BTW, I think NTA is cool and we might do something like that eventually) The basic thing we have to do is make the drivers receive into pages, and then slide the network headers (only) into the linear SKB data area. Even for drivers like NIU and myri10ge that do this, they only use heuristics or some fixed minimum to decide how much to move to the linear area. Result? Some data payload bits end up there because it overshoots. Since we have pskb_may_pull() calls everywhere necessary, which means not in eth_type_trans(), we could just make these drivers (and future drivers converted to operate in this way) only put the ethernet headers there initially. Then the rest of the stack will take care of moving the network and transport payloads there, as necessary. I bet it won't even hurt latency or routing/firewall performance. I did test this with the NIU driver at one point, and it did not change TCP latency nor throughput at all even at 10g speeds. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jan 26, 2009 at 10:10:56PM -0800, David Miller wrote: > From: Evgeniy Polyakov <zbr@ioremap.net> > Date: Tue, 27 Jan 2009 00:21:30 +0300 > > > Hi Jarek. > > > > On Mon, Jan 26, 2009 at 08:20:36AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > > 1. Network (tree) allocator > > > > http://www.ioremap.net/projects/nta > > > > > > I looked at this a bit, but alas I didn't find much for this Herbert's > > > idea of payload in fragments/pages. Maybe some kind of API RFC is > > > needed before this resurrection? > > > > Basic idea is to steal some (probably a lot) pages from the slab > > allocator and put network buffers there without strict need for > > power-of-two alignment and possible wraps when we add skb_shared_info at > > the end, so that old e1000 driver required order-4 allocations for the > > jumbo frames. We can do that in alloc_skb() and friends and put returned > > buffers into skb's fraglist and updated reference counters for those > > pages; and with additional copy of the network headers into skb->head. I think the main problem is to respect put_page() more, and maybe you mean to add this to your allocator too, but using slab pages for this looks a bit complex to me, but I can miss something. > We are going back and forth saying the same thing, I think :-) > (BTW, I think NTA is cool and we might do something like that > eventually) > > The basic thing we have to do is make the drivers receive into > pages, and then slide the network headers (only) into the linear > SKB data area. As a matter of fact, I wonder if these headers should be always separated. Their "chunk" could be refcounted as well, I guess. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: David Miller <davem@davemloft.net> Date: Mon, 26 Jan 2009 22:10:56 -0800 (PST) > Even for drivers like NIU and myri10ge that do this, they only > use heuristics or some fixed minimum to decide how much to > move to the linear area. > > Result? Some data payload bits end up there because it overshoots. ... > I did test this with the NIU driver at one point, and it did not > change TCP latency nor throughput at all even at 10g speeds. As a followup, it turns out that NIU right now does this properly. It only pulls a maximum of ETH_HLEN into the linear area before giving the SKB to netif_receive_skb(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 27 Jan 2009 07:40:48 +0000 > I think the main problem is to respect put_page() more, and maybe you > mean to add this to your allocator too, but using slab pages for this > looks a bit complex to me, but I can miss something. Hmmm, Jarek's comments here made me realize that we might be able to do some hack with cooperation with SLAB. Basically the idea is that if the page count of a SLAB page is greater than one, SLAB will not use that page for new allocations. It's cheesy and the SLAB developers will likely barf at the idea, but it would certainly work. Back to real life, I think long term the thing to do is to just do the cached page allocator thing we'll be doing after Jarek's socket page patch is integrated, and for best performance the driver has to receive it's data into pages, only explicitly pulling the ethernet header into the linear area, like NIU does. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 27 Jan 2009 07:40:48 +0000 > > > I think the main problem is to respect put_page() more, and maybe you > > mean to add this to your allocator too, but using slab pages for this > > looks a bit complex to me, but I can miss something. > > Hmmm, Jarek's comments here made me realize that we might be > able to do some hack with cooperation with SLAB. > > Basically the idea is that if the page count of a SLAB page > is greater than one, SLAB will not use that page for new > allocations. I thought it was the standard behaviour. That may explain why I did not understand much of previous discussion then :-/ > It's cheesy and the SLAB developers will likely barf at the > idea, but it would certainly work. Maybe that would be enough as a definitive fix for a stable release, so that we can go on with deeper changes in newer versions ? > Back to real life, I think long term the thing to do is to just do the > cached page allocator thing we'll be doing after Jarek's socket page > patch is integrated, and for best performance the driver has to > receive it's data into pages, only explicitly pulling the ethernet > header into the linear area, like NIU does. Are there NICs out there able to do that themselves or does the driver need to rely on complex hacks in order to achieve this ? Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Fri, 30 Jan 2009 22:59:20 +0100 > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > It's cheesy and the SLAB developers will likely barf at the > > idea, but it would certainly work. > > Maybe that would be enough as a definitive fix for a stable > release, so that we can go on with deeper changes in newer > versions ? Such a check could have performance ramifications, I wouldn't risk it and already I intend to push Jarek's page allocator splice fix back to -stable eventually. > > Back to real life, I think long term the thing to do is to just do the > > cached page allocator thing we'll be doing after Jarek's socket page > > patch is integrated, and for best performance the driver has to > > receive it's data into pages, only explicitly pulling the ethernet > > header into the linear area, like NIU does. > > Are there NICs out there able to do that themselves or does the > driver need to rely on complex hacks in order to achieve this ? Any NIC, even the dumbest ones, can be made to receive into pages. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Fri, 30 Jan 2009 22:59:20 +0100 > > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > > It's cheesy and the SLAB developers will likely barf at the > > > idea, but it would certainly work. > > > > Maybe that would be enough as a definitive fix for a stable > > release, so that we can go on with deeper changes in newer > > versions ? > > Such a check could have performance ramifications, I wouldn't > risk it and already I intend to push Jarek's page allocator > splice fix back to -stable eventually. OK. > > > Back to real life, I think long term the thing to do is to just do the > > > cached page allocator thing we'll be doing after Jarek's socket page > > > patch is integrated, and for best performance the driver has to > > > receive it's data into pages, only explicitly pulling the ethernet > > > header into the linear area, like NIU does. > > > > Are there NICs out there able to do that themselves or does the > > driver need to rely on complex hacks in order to achieve this ? > > Any NIC, even the dumbest ones, can be made to receive into pages. OK I thought that it was not always easy to split between headers and payload. I know that myri10ge can be configured to receive into either skbs or pages, but I was not sure about the real implications behind that. Thanks Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Fri, 30 Jan 2009 23:13:46 +0100 > On Fri, Jan 30, 2009 at 02:03:46PM -0800, David Miller wrote: > > Any NIC, even the dumbest ones, can be made to receive into pages. > > OK I thought that it was not always easy to split between headers > and payload. I know that myri10ge can be configured to receive into > either skbs or pages, but I was not sure about the real implications > behind that. For a dumb NIC you wouldn't split, you'd receive directly, the entire packet, into part of a page. Then right befire you give it to the stack, you pull the ethernet header from the page into the linear area. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > Hmmm, Jarek's comments here made me realize that we might be > able to do some hack with cooperation with SLAB. > > Basically the idea is that if the page count of a SLAB page > is greater than one, SLAB will not use that page for new > allocations. > > It's cheesy and the SLAB developers will likely barf at the > idea, but it would certainly work. I'm not going anywhere near that discussion :) > Back to real life, I think long term the thing to do is to just do the > cached page allocator thing we'll be doing after Jarek's socket page > patch is integrated, and for best performance the driver has to > receive it's data into pages, only explicitly pulling the ethernet > header into the linear area, like NIU does. Yes that sounds like the way to go. Cheers,
On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote: > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: ... > > Back to real life, I think long term the thing to do is to just do the > > cached page allocator thing we'll be doing after Jarek's socket page > > patch is integrated, and for best performance the driver has to > > receive it's data into pages, only explicitly pulling the ethernet > > header into the linear area, like NIU does. > > Yes that sounds like the way to go. Looks like a lot of changes in drivers, plus: would it work with jumbo frames? I wonder why the linear area can't be allocated as paged, and freed with put_page() instead of kfree(skb->head) in skb_release_data(). Actually, at least for some time, there could be used both of these methods (on paged alloc failure) with some ifs in skb_release_data() and spd_fill_page() (to check if linear_to_page() is needed). Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Mon, 2 Feb 2009 08:08:55 +0000 > On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote: > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > ... > > > Back to real life, I think long term the thing to do is to just do the > > > cached page allocator thing we'll be doing after Jarek's socket page > > > patch is integrated, and for best performance the driver has to > > > receive it's data into pages, only explicitly pulling the ethernet > > > header into the linear area, like NIU does. > > > > Yes that sounds like the way to go. > > Looks like a lot of changes in drivers, plus: would it work with jumbo > frames? I wonder why the linear area can't be allocated as paged, and > freed with put_page() instead of kfree(skb->head) in skb_release_data(). Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon, 2 Feb 2009 08:08:55 +0000 > > > On Sat, Jan 31, 2009 at 09:16:04AM +1100, Herbert Xu wrote: > > > On Fri, Jan 30, 2009 at 01:42:27PM -0800, David Miller wrote: > > ... > > > > Back to real life, I think long term the thing to do is to just do the > > > > cached page allocator thing we'll be doing after Jarek's socket page > > > > patch is integrated, and for best performance the driver has to > > > > receive it's data into pages, only explicitly pulling the ethernet > > > > header into the linear area, like NIU does. > > > > > > Yes that sounds like the way to go. > > > > Looks like a lot of changes in drivers, plus: would it work with jumbo > > frames? I wonder why the linear area can't be allocated as paged, and > > freed with put_page() instead of kfree(skb->head) in skb_release_data(). > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. I mean allocating chunks of cached pages similarly to sk_sndmsg_page way. I guess the similar problem is to be worked out in any case. But it seems doing it on the linear area requires less changes in other places. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Mon, 2 Feb 2009 08:43:58 +0000 > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. > > I mean allocating chunks of cached pages similarly to sk_sndmsg_page > way. I guess the similar problem is to be worked out in any case. But > it seems doing it on the linear area requires less changes in other > places. This is a very interesting idea, but it has some drawbacks: 1) Just like any other allocator we'll need to find a way to handle > PAGE_SIZE allocations, and thus add handling for compound pages etc. And exactly the drivers that want such huge SKB data areas on receive should be converted to use scatter gather page vectors in order to avoid multi-order pages and thus strains on the page allocator. 2) Space wastage and poor packing can be an issue. Even with SLAB/SLUB we get poor packing, look at Evegeniy's graphs that he made when writing his NTA patches. Now, when choosing a way to move forward, I'm willing to accept a little bit of the issues in #2 for the sake of avoiding the issues in #1 above. Jarek, note that we can just keep your current splice() copy hacks in there. And as a result we can have an easier to handle migration path. We just do the page RX allocation conversions in the drivers where performance really matters, for hardware a lot of people have. That's a lot smoother and has less issues that converting the system wide SKB allocator upside down. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 02, 2009 at 11:50:17PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Mon, 2 Feb 2009 08:43:58 +0000 > > > On Mon, Feb 02, 2009 at 12:18:54AM -0800, David Miller wrote: > > > Allocating 4096 or 8192 bytes for a 1500 byte frame is wasteful. > > > > I mean allocating chunks of cached pages similarly to sk_sndmsg_page > > way. I guess the similar problem is to be worked out in any case. But > > it seems doing it on the linear area requires less changes in other > > places. > > This is a very interesting idea, but it has some drawbacks: > > 1) Just like any other allocator we'll need to find a way to > handle > PAGE_SIZE allocations, and thus add handling for > compound pages etc. > > And exactly the drivers that want such huge SKB data areas > on receive should be converted to use scatter gather page > vectors in order to avoid multi-order pages and thus strains > on the page allocator. I guess compound pages are handled by put_page() enough, but I don't think they should be main argument here, and I agree: scatter gather should be used where possible. > > 2) Space wastage and poor packing can be an issue. > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > graphs that he made when writing his NTA patches. I'm a bit lost here: could you "remind" the way page space would be used/saved in your paged variant e.g. for ~1500B skbs? > > Now, when choosing a way to move forward, I'm willing to accept a > little bit of the issues in #2 for the sake of avoiding the > issues in #1 above. > > Jarek, note that we can just keep your current splice() copy hacks in > there. And as a result we can have an easier to handle migration > path. We just do the page RX allocation conversions in the drivers > where performance really matters, for hardware a lot of people have. > > That's a lot smoother and has less issues that converting the system > wide SKB allocator upside down. > Yes, this looks reasonable. On the other hand, I think it would be nice to get some opinions of slab folks (incl. Evgeniy) on the expected efficiency of such a solution. (It seems releasing with put_page() will always have some cost with delayed reusing and/or waste of space.) Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > 1) Just like any other allocator we'll need to find a way to > > handle > PAGE_SIZE allocations, and thus add handling for > > compound pages etc. > > > > And exactly the drivers that want such huge SKB data areas > > on receive should be converted to use scatter gather page > > vectors in order to avoid multi-order pages and thus strains > > on the page allocator. > > I guess compound pages are handled by put_page() enough, but I don't > think they should be main argument here, and I agree: scatter gather > should be used where possible. Problem is to allocate them, since with the time memory will be quite fragmented, which will not allow to find a big enough page. NTA tried to solve this by not allowing to free the data allocated on the different CPU, contrary to what SLAB does. Modulo cache coherency improvements, it allows to combine freed chunks back into the pages and combine them in turn to get bigger contiguous areas suitable for the drivers which were not converted to use the scatter gather approach. I even believe that for some hardware it is the only way to deal with the jumbo frames. > > 2) Space wastage and poor packing can be an issue. > > > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > > graphs that he made when writing his NTA patches. > > I'm a bit lost here: could you "remind" the way page space would be > used/saved in your paged variant e.g. for ~1500B skbs? At least in NTA I used cache line alignment for smaller chunks, while SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes per packet (modulo size of the shared info structure). > Yes, this looks reasonable. On the other hand, I think it would be > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > efficiency of such a solution. (It seems releasing with put_page() will > always have some cost with delayed reusing and/or waste of space.) Well, my opinion is rather biased here :)
On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote: > > I even believe that for some hardware it is the only way to deal > with the jumbo frames. Not necessarily. Even if the hardware can only DMA into contiguous memory, we can always allocate a sufficient number of contiguous buffers initially, and then always copy them into fragmented skbs at receive time. This way the contiguous buffers are never depleted. Granted copying sucks, but this is really because the underlying hardware is badly designed. Also copying is way better than not receiving at all due to memory fragmentation. Cheers,
On Saturday 31 January 2009 08:42:27 David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 27 Jan 2009 07:40:48 +0000 > > > I think the main problem is to respect put_page() more, and maybe you > > mean to add this to your allocator too, but using slab pages for this > > looks a bit complex to me, but I can miss something. > > Hmmm, Jarek's comments here made me realize that we might be > able to do some hack with cooperation with SLAB. > > Basically the idea is that if the page count of a SLAB page > is greater than one, SLAB will not use that page for new > allocations. Wouldn't your caller need to know what objects are already allocated in that page too? > It's cheesy and the SLAB developers will likely barf at the > idea, but it would certainly work. It is nasty, yes. Using the page allocator directly seeems like a better approach. And btw. be careful of using page->_count for anything, due to speculative page references... basically it is useful only to test zero or non-zero refcount. If designing a new scheme for the network layer, it would be nicer to begin by using say _mapcount or private or some other field in there for a refcount (and I have a patch to avoid the atomic put_page_testzero in page freeing for a caller that does their own refcounting, so don't fear that extra overhead too much :)). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > I even believe that for some hardware it is the only way to deal > > with the jumbo frames. > > Not necessarily. Even if the hardware can only DMA into contiguous > memory, we can always allocate a sufficient number of contiguous > buffers initially, and then always copy them into fragmented skbs > at receive time. This way the contiguous buffers are never > depleted. How many such preallocated frames is enough? Does it enough to have all sockets recv buffer sizes divided by the MTU size? Or just some of them, or... That will work but there are way too many corner cases. > Granted copying sucks, but this is really because the underlying > hardware is badly designed. Also copying is way better than > not receiving at all due to memory fragmentation. Maybe just do not allow jumbo frames when memory is fragmented enough and fallback to the smaller MTU in this case? With LRO/GRO stuff there should be not that much of the overhead compared to multiple-page copies.
On Tue, Feb 03, 2009 at 02:49:44PM +0300, Evgeniy Polyakov wrote: > How many such preallocated frames is enough? Does it enough to have all > sockets recv buffer sizes divided by the MTU size? Or just some of them, > or... That will work but there are way too many corner cases. Easy, the driver is already allocating them right now so we don't have to change a thing :) All we have to do is change the refill mechanism to always allocate a replacement skb in the rx path, and if that fails, allocate a fragmented skb instead and copy the received data into it so that the contiguous skb can be reused. Cheers,
On Tue, Feb 03, 2009 at 10:53:13PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > How many such preallocated frames is enough? Does it enough to have all > > sockets recv buffer sizes divided by the MTU size? Or just some of them, > > or... That will work but there are way too many corner cases. > > Easy, the driver is already allocating them right now so we don't > have to change a thing :) How many? A hundred or so descriptors (or even several thousands) - this really does not scale for the somewhat loaded IO servers, that's why we frequently get questions why dmesg is filler with order-3 and higher allocation failure dumps. > All we have to do is change the refill mechanism to always allocate > a replacement skb in the rx path, and if that fails, allocate a > fragmented skb instead and copy the received data into it so that > the contiguous skb can be reused. Having a 'reserve' skb per socket is a good idea, but what if numbr of sockets is way too big?
On Tue, Feb 03, 2009 at 03:07:15PM +0300, Evgeniy Polyakov wrote: > > How many? A hundred or so descriptors (or even several thousands) - > this really does not scale for the somewhat loaded IO servers, that's > why we frequently get questions why dmesg is filler with order-3 and > higher allocation failure dumps. I think you've misunderstood my suggested scheme. I'm suggesting that we keep the driver initialisation path as is, so however many skb's the driver is allocating at open() time remains unchanged. Usually this would be the number of entries on the ring buffer. We can't do any better than that since if the hardware can't do SG then you'll just have to find this many contiguous buffers. The only change we need to make is at receive time. Instead of always pushing the received skb into the stack, we should try to allocate a linear replacement skb, and if that fails, allocate a fragmented skb and copy the data into it. That way we can always push a linear skb back into the ring buffer. Cheers,
On Tue, Feb 03, 2009 at 05:05:14AM -0800, david@lang.hm (david@lang.hm) wrote: > >Maybe just do not allow jumbo frames when memory is fragmented enough > >and fallback to the smaller MTU in this case? With LRO/GRO stuff there > >should be not that much of the overhead compared to multiple-page > >copies. > > > 1. define 'fragmented enough' When allocator can not provide requested amount of data. > 2. the packet size was already negotiated on your existing connections, > how are you going to change all those on the fly? I.e. MTU can not be changed on-flight? Magic world. > 3. what do you do when a remote system sends you a large packet? drop it > on the floor? We already do just that when jumbo frame can not be allocated :) > having some pool of large buffers to receive into (and copy out of those > buffers as quickly as possible) would cause a performance hit when things > get bad, but isn't that better than dropping packets? It is a solution, but I think it will behave noticebly worse than with decresed MTU. > as for the number of buffers to use. make a reasonable guess. if you only > have a small number of packets around, use the buffers directly, as you > use more of them start copying, as useage climbs attempt to allocate more. > if you can't allocate more (and you have all of your existing ones in use) > you will have to drop the packet, but at that point are you really in any > worse shape than if you didn't have some mechanism to copy out of the > large buffers? That's the main point: how to deal with broken hardware? I think (but have no strong numbers though) that having 6 packets with 1500 MTU combined into GRO/LRO frame will be processed way faster than copying 9k MTU into 3 pages and process single skb.
On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote: > > It is a solution, but I think it will behave noticebly worse than > with decresed MTU. Not necessarily. Remember GSO/GRO in essence are just hacks to get around the fact that we can't increase the MTU to where we want it to be. MTU reduces the cost over the entire path while GRO/GSO only do so for the sender and the receiver. In other words when given the choice between a larger MTU with copying or GRO, the larger MTU will probably win anyway as it's optimising the entire path rather than just the receiver. > That's the main point: how to deal with broken hardware? I think (but > have no strong numbers though) that having 6 packets with 1500 MTU > combined into GRO/LRO frame will be processed way faster than copying 9k > MTU into 3 pages and process single skb. Please note that with my scheme, you'd only start copying if you can't allocate a linear skb. So if memory fragmentation doesn't happen then there is no copying at all. Cheers,
On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > The only change we need to make is at receive time. Instead of > always pushing the received skb into the stack, we should try to > allocate a linear replacement skb, and if that fails, allocate > a fragmented skb and copy the data into it. That way we can > always push a linear skb back into the ring buffer. Yes, that's was the part about 'reserve' buffer for the sockets you cut :) I agree that this will work and will be better than nothing, but copying 9kb into 3 pages is rather CPU hungry operation, and I think (but have no numbers though) that system will behave faster if MTU is reduced to the standard one. Another solution is to have a proper allocator which will be able to defragment the data, if talking about the alternatives to the drop. So: 1. copy the whole jumbo skb into fragmented one 2. reduce the MTU 3. rely on the allocator For the 'good' hardware and drivers nothing from the above is really needed.
On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 11:12:09PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > The only change we need to make is at receive time. Instead of > > always pushing the received skb into the stack, we should try to > > allocate a linear replacement skb, and if that fails, allocate > > a fragmented skb and copy the data into it. That way we can > > always push a linear skb back into the ring buffer. > > Yes, that's was the part about 'reserve' buffer for the sockets you cut > :) > > I agree that this will work and will be better than nothing, but copying > 9kb into 3 pages is rather CPU hungry operation, and I think (but have > no numbers though) that system will behave faster if MTU is reduced to > the standard one. Well, FWIW, I've always observed better performance with 4k MTU (4080 to be precise) than with 9K, and I think that the overhead of allocating 3 contiguous pages is a major reason for this. > Another solution is to have a proper allocator which will be able to > defragment the data, if talking about the alternatives to the drop. > > So: > 1. copy the whole jumbo skb into fragmented one > 2. reduce the MTU you'll not reduce MTU of established connections though. And trying to advertise MSS changes in the middle of a TCP connection is an awful hack which I think will not work everywhere. > 3. rely on the allocator > > For the 'good' hardware and drivers nothing from the above is really needed. > > -- > Evgeniy Polyakov Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 03:18:36PM +0300, Evgeniy Polyakov wrote: > > I agree that this will work and will be better than nothing, but copying > 9kb into 3 pages is rather CPU hungry operation, and I think (but have > no numbers though) that system will behave faster if MTU is reduced to > the standard one. Reducing the MTU can create all sorts of problems so it should be avoided if at all possible. These days, path MTU discovery is haphazard at best. In fact MTU problems are the main reason why jumbo frames simply don't get deployed. > Another solution is to have a proper allocator which will be able to > defragment the data, if talking about the alternatives to the drop. Sure, if we can create an allocator that can guarantee contiguous allocations all the time then by all means go for it. But until we get there, doing what I suggested is way better than stopping the receiving process altogether. > So: > 1. copy the whole jumbo skb into fragmented one > 2. reduce the MTU > 3. rely on the allocator Yes, improving the allocator would obviously inrease the performance, however, there is nothing against employing both methods. I'd always avoid reducing the MTU at run-time though. > For the 'good' hardware and drivers nothing from the above is really needed. Right, that's why there is a point beyond which improving the allocator is no longer worthwhile. Cheers,
On Tue, Feb 03, 2009 at 01:25:35PM +0100, Willy Tarreau wrote: > > > So: > > 1. copy the whole jumbo skb into fragmented one > > 2. reduce the MTU > > you'll not reduce MTU of established connections though. And trying to > advertise MSS changes in the middle of a TCP connection is an awful > hack which I think will not work everywhere. Not to mention that Ethernet isn't just IP, so the protocol that is being used might not have a concept of path MTU discovery at all. Cheers,
On Tue, Feb 03, 2009 at 11:18:08PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: > > It is a solution, but I think it will behave noticebly worse than > > with decresed MTU. > > Not necessarily. Remember GSO/GRO in essence are just hacks to > get around the fact that we can't increase the MTU to where we > want it to be. MTU reduces the cost over the entire path while > GRO/GSO only do so for the sender and the receiver. > > In other words when given the choice between a larger MTU with > copying or GRO, the larger MTU will probably win anyway as it's > optimising the entire path rather than just the receiver. Well, we both do not have the data and very likely will not change the opinions :) But we can proceed the discussion in case something interesting will appear. For example I can hack up e1000e driver to do a dumb copy of 9k each time it has received a jumbo frame and compare it with usual 1.5k MTU performance. But getting that modern CPUs are loafing with noticebly big IO chunks, this may only show that CPU was increased with the copy. But still may work. > > That's the main point: how to deal with broken hardware? I think (but > > have no strong numbers though) that having 6 packets with 1500 MTU > > combined into GRO/LRO frame will be processed way faster than copying 9k > > MTU into 3 pages and process single skb. > > Please note that with my scheme, you'd only start copying if you > can't allocate a linear skb. So if memory fragmentation doesn't > happen then there is no copying at all. Yes, absolutely.
On Tuesday 03 February 2009 23:18:08 Herbert Xu wrote: > On Tue, Feb 03, 2009 at 03:12:19PM +0300, Evgeniy Polyakov wrote: > > It is a solution, but I think it will behave noticebly worse than > > with decresed MTU. > > Not necessarily. Remember GSO/GRO in essence are just hacks to > get around the fact that we can't increase the MTU to where we > want it to be. MTU reduces the cost over the entire path while > GRO/GSO only do so for the sender and the receiver. > > In other words when given the choice between a larger MTU with > copying or GRO, the larger MTU will probably win anyway as it's > optimising the entire path rather than just the receiver. > > > That's the main point: how to deal with broken hardware? I think (but > > have no strong numbers though) that having 6 packets with 1500 MTU > > combined into GRO/LRO frame will be processed way faster than copying 9k > > MTU into 3 pages and process single skb. > > Please note that with my scheme, you'd only start copying if you > can't allocate a linear skb. So if memory fragmentation doesn't > happen then there is no copying at all. This sounds like a really nice idea (to the layman)! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 03:30:15PM +0300, Evgeniy Polyakov wrote: > > But we can proceed the discussion in case something interesting will > appear. For example I can hack up e1000e driver to do a dumb copy of 9k > each time it has received a jumbo frame and compare it with usual 1.5k > MTU performance. But getting that modern CPUs are loafing with noticebly > big IO chunks, this may only show that CPU was increased with the copy. > But still may work. Comparing performance is pointless because the only time you need to do the copy is when the allocator has failed. So there is *no* alternative to copying, regardless of how slow it is. You can always improve the allocator whether we do this copying fallback or not. Cheers,
On Tue, Feb 03, 2009 at 02:10:12PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 09:41:08AM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > > 1) Just like any other allocator we'll need to find a way to > > > handle > PAGE_SIZE allocations, and thus add handling for > > > compound pages etc. > > > > > > And exactly the drivers that want such huge SKB data areas > > > on receive should be converted to use scatter gather page > > > vectors in order to avoid multi-order pages and thus strains > > > on the page allocator. > > > > I guess compound pages are handled by put_page() enough, but I don't > > think they should be main argument here, and I agree: scatter gather > > should be used where possible. > > Problem is to allocate them, since with the time memory will be > quite fragmented, which will not allow to find a big enough page. Yes, it's a problem, but I don't think the main one. Since we're currently concerned with zero-copy for splice I think we could concentrate on most common cases, and treat jumbo frames with best effort only: if there are free compound pages - fine, otherwise we fallback to slab and copy in splice. > > NTA tried to solve this by not allowing to free the data allocated on > the different CPU, contrary to what SLAB does. Modulo cache coherency > improvements, it allows to combine freed chunks back into the pages and > combine them in turn to get bigger contiguous areas suitable for the > drivers which were not converted to use the scatter gather approach. > I even believe that for some hardware it is the only way to deal > with the jumbo frames. > > > > 2) Space wastage and poor packing can be an issue. > > > > > > Even with SLAB/SLUB we get poor packing, look at Evegeniy's > > > graphs that he made when writing his NTA patches. > > > > I'm a bit lost here: could you "remind" the way page space would be > > used/saved in your paged variant e.g. for ~1500B skbs? > > At least in NTA I used cache line alignment for smaller chunks, while > SLAB uses power of two. Thus for 1500 MTU SLAB wastes about 500 bytes > per packet (modulo size of the shared info structure). > > > Yes, this looks reasonable. On the other hand, I think it would be > > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > > efficiency of such a solution. (It seems releasing with put_page() will > > always have some cost with delayed reusing and/or waste of space.) > > Well, my opinion is rather biased here :) I understand NTA could be better than slabs in above-mentioned cases, but I'm not sure you explaind enough your point on solving this zero-copy problem vs. NTA? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 3 Feb 2009, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 10:24:31PM +1100, Herbert Xu (herbert@gondor.apana.org.au) wrote: >>> I even believe that for some hardware it is the only way to deal >>> with the jumbo frames. >> >> Not necessarily. Even if the hardware can only DMA into contiguous >> memory, we can always allocate a sufficient number of contiguous >> buffers initially, and then always copy them into fragmented skbs >> at receive time. This way the contiguous buffers are never >> depleted. > > How many such preallocated frames is enough? Does it enough to have all > sockets recv buffer sizes divided by the MTU size? Or just some of them, > or... That will work but there are way too many corner cases. > >> Granted copying sucks, but this is really because the underlying >> hardware is badly designed. Also copying is way better than >> not receiving at all due to memory fragmentation. > > Maybe just do not allow jumbo frames when memory is fragmented enough > and fallback to the smaller MTU in this case? With LRO/GRO stuff there > should be not that much of the overhead compared to multiple-page > copies. 1. define 'fragmented enough' 2. the packet size was already negotiated on your existing connections, how are you going to change all those on the fly? 3. what do you do when a remote system sends you a large packet? drop it on the floor? having some pool of large buffers to receive into (and copy out of those buffers as quickly as possible) would cause a performance hit when things get bad, but isn't that better than dropping packets? as for the number of buffers to use. make a reasonable guess. if you only have a small number of packets around, use the buffers directly, as you use more of them start copying, as useage climbs attempt to allocate more. if you can't allocate more (and you have all of your existing ones in use) you will have to drop the packet, but at that point are you really in any worse shape than if you didn't have some mechanism to copy out of the large buffers? David Lang -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > I understand NTA could be better than slabs in above-mentioned cases, > but I'm not sure you explaind enough your point on solving this > zero-copy problem vs. NTA? NTA steals pages from the SLAB so we can maintain any reference counter logic in them, so linear part of the skb may be not really freed/reused until reference counter hits zero.
On Tue, Feb 03, 2009 at 04:06:06PM +0300, Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 12:36:28PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > > I understand NTA could be better than slabs in above-mentioned cases, > > but I'm not sure you explaind enough your point on solving this > > zero-copy problem vs. NTA? > > NTA steals pages from the SLAB so we can maintain any reference counter > logic in them, so linear part of the skb may be not really freed/reused > until reference counter hits zero. Now it's clear. So this looks like one of the options considered by David. Then I wonder about details... It seems some kind of scheduled browsing for refcounts is needed or is there something better? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 01:25:37PM +0000, Jarek Poplawski (jarkao2@gmail.com) wrote: > Now it's clear. So this looks like one of the options considered by > David. Then I wonder about details... It seems some kind of scheduled > browsing for refcounts is needed or is there something better? It depends on the implementation, for example each kfree() may check the reference counter and return page to the allocator when it is really free. Since page may contiain multiple objects its reference counter may hit zero someday in the future, or never reach it if data was not freed.
From: Evgeniy Polyakov <zbr@ioremap.net> Date: Tue, 3 Feb 2009 14:10:12 +0300 > NTA tried to solve this by not allowing to free the data allocated on > the different CPU, contrary to what SLAB does. Modulo cache coherency > improvements, This could kill performance on NUMA systems if we are not careful. If we ever consider NTA seriously, these issues would need to be performance tested. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Tue, 3 Feb 2009 22:24:31 +1100 > Not necessarily. Even if the hardware can only DMA into contiguous > memory, we can always allocate a sufficient number of contiguous > buffers initially, and then always copy them into fragmented skbs > at receive time. This way the contiguous buffers are never > depleted. > > Granted copying sucks, but this is really because the underlying > hardware is badly designed. Also copying is way better than > not receiving at all due to memory fragmentation. This scheme sounds very reasonable. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Tue, 3 Feb 2009 13:25:35 +0100 > Well, FWIW, I've always observed better performance with 4k MTU (4080 to > be precise) than with 9K, and I think that the overhead of allocating 3 > contiguous pages is a major reason for this. With what hardware? If it's with myri10ge, that driver uses page frags so would not be using 3 contiguous pages even for jumbo frames. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote: > From: Willy Tarreau <w@1wt.eu> > Date: Tue, 3 Feb 2009 13:25:35 +0100 > > > Well, FWIW, I've always observed better performance with 4k MTU (4080 to > > be precise) than with 9K, and I think that the overhead of allocating 3 > > contiguous pages is a major reason for this. > > With what hardware? If it's with myri10ge, that driver uses page > frags so would not be using 3 contiguous pages even for jumbo frames. Yes myri10ge for the optimal 4080, but with e1000 too (though I don't remember the exact optimal value, I think it was slightly lower). For the myri10ge, could this be caused by the cache footprint then ? I can also retry with various values between 4 and 9k, including values close to 8k. Maybe the fact that 4k is better than 9 is because we get better filling of all pages ? I also remember having used a 7 kB MTU on e1000 and dl2k in the past. BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the allocation failures which were polluting the logs, so it's been running with that setting for years now. Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) wrote: > > NTA tried to solve this by not allowing to free the data allocated on > > the different CPU, contrary to what SLAB does. Modulo cache coherency > > improvements, > > This could kill performance on NUMA systems if we are not careful. > > If we ever consider NTA seriously, these issues would need to > be performance tested. Quite contrary I think. Memory is allocated and freed on the same CPU, which means on the same memory domain, closest to the CPU in question. I did not test NUMA though, but NTA performance on the usual CPU (it is 2.5 years old already :) was noticebly good.
On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote: > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't > remember the exact optimal value, I think it was slightly lower). Very likely it is related to the allocator - the same allocation overhead to get a page, but 2.5 times bigger frame. > For the myri10ge, could this be caused by the cache footprint then ? > I can also retry with various values between 4 and 9k, including > values close to 8k. Maybe the fact that 4k is better than 9 is > because we get better filling of all pages ? > > I also remember having used a 7 kB MTU on e1000 and dl2k in the past. > BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the > allocation failures which were polluting the logs, so it's been running > with that setting for years now. Recent e1000 (e1000e) uses fragments, so it does not suffer from the high-order allocation failures.
On Wed, Feb 04, 2009 at 11:12:01AM +0300, Evgeniy Polyakov wrote: > On Wed, Feb 04, 2009 at 07:19:47AM +0100, Willy Tarreau (w@1wt.eu) wrote: > > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't > > remember the exact optimal value, I think it was slightly lower). > > Very likely it is related to the allocator - the same allocation > overhead to get a page, but 2.5 times bigger frame. > > > For the myri10ge, could this be caused by the cache footprint then ? > > I can also retry with various values between 4 and 9k, including > > values close to 8k. Maybe the fact that 4k is better than 9 is > > because we get better filling of all pages ? > > > > I also remember having used a 7 kB MTU on e1000 and dl2k in the past. > > BTW, 7k MTU on my NFS server which uses e1000 definitely stopped the > > allocation failures which were polluting the logs, so it's been running > > with that setting for years now. > > Recent e1000 (e1000e) uses fragments, so it does not suffer from the > high-order allocation failures. My server is running 2.4 :-), but I observed the same issues with older 2.6 as well. I can certainly imagine that things have changed a lot since, but the initial point remains : jumbo frames are expensive to deal with, and with recent NICs and drivers, we might get close performance for little additional cost. After all, initial justification for jumbo frames was the devastating interrupt rate and all NICs coalesce interrupts now. So if we can optimize all the infrastructure for extremely fast processing of standard frames (1500) and still support jumbo frames in a suboptimal mode, I think it could be a very good trade-off. Regards, willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > My server is running 2.4 :-), but I observed the same issues with older > 2.6 as well. I can certainly imagine that things have changed a lot since, > but the initial point remains : jumbo frames are expensive to deal with, > and with recent NICs and drivers, we might get close performance for > little additional cost. After all, initial justification for jumbo frames > was the devastating interrupt rate and all NICs coalesce interrupts now. This is total crap! Jumbo frames are way better than any of the hacks (such as GSO) that people have come up with to get around it. The only reason we are not using it as much is because of this nasty thing called the Internet. Cheers,
From: Herbert Xu <herbert@gondor.apana.org.au> Date: Wed, 4 Feb 2009 19:59:07 +1100 > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > > > My server is running 2.4 :-), but I observed the same issues with older > > 2.6 as well. I can certainly imagine that things have changed a lot since, > > but the initial point remains : jumbo frames are expensive to deal with, > > and with recent NICs and drivers, we might get close performance for > > little additional cost. After all, initial justification for jumbo frames > > was the devastating interrupt rate and all NICs coalesce interrupts now. > > This is total crap! Jumbo frames are way better than any of the > hacks (such as GSO) that people have come up with to get around it. > The only reason we are not using it as much is because of this > nasty thing called the Internet. Completely agreed. If Jumbo frames are slower, it is NOT some fundamental issue. It is rather due to some misdesign of the hardware or it's driver. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote: > From: Herbert Xu <herbert@gondor.apana.org.au> > Date: Wed, 4 Feb 2009 19:59:07 +1100 > > > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > > > > > My server is running 2.4 :-), but I observed the same issues with older > > > 2.6 as well. I can certainly imagine that things have changed a lot since, > > > but the initial point remains : jumbo frames are expensive to deal with, > > > and with recent NICs and drivers, we might get close performance for > > > little additional cost. After all, initial justification for jumbo frames > > > was the devastating interrupt rate and all NICs coalesce interrupts now. > > > > This is total crap! Jumbo frames are way better than any of the > > hacks (such as GSO) that people have come up with to get around it. > > The only reason we are not using it as much is because of this > > nasty thing called the Internet. > > Completely agreed. > > If Jumbo frames are slower, it is NOT some fundamental issue. It is > rather due to some misdesign of the hardware or it's driver. Agreed we can't use them *because* of the internet, but this limitation has forced hardware designers to find valid alternatives. For instance, having the ability to reach 10 Gbps with 1500 bytes frames on myri10ge with a low CPU usage is a real achievement. This is "only" 800 kpps after all. And the arbitrary choice of 9k for jumbo frames was total crap too. It's clear that no hardware designer was involved in the process. They have to stuff 16kB of RAM on a NIC to use only 9. And we need to allocate 3 pages for slightly more than 2. 7.5 kB would have been better in this regard. I still find it nice to lower CPU usage with frames larger than 1500, but given the fact that this is rarely used (even in datacenters), I think our efforts should concentrate on where the real users are, ie <1500. Regards, Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Wed, 4 Feb 2009 07:19:47 +0100 > On Tue, Feb 03, 2009 at 04:47:34PM -0800, David Miller wrote: > > From: Willy Tarreau <w@1wt.eu> > > Date: Tue, 3 Feb 2009 13:25:35 +0100 > > > > > Well, FWIW, I've always observed better performance with 4k MTU (4080 to > > > be precise) than with 9K, and I think that the overhead of allocating 3 > > > contiguous pages is a major reason for this. > > > > With what hardware? If it's with myri10ge, that driver uses page > > frags so would not be using 3 contiguous pages even for jumbo frames. > > Yes myri10ge for the optimal 4080, but with e1000 too (though I don't > remember the exact optimal value, I think it was slightly lower). > > For the myri10ge, could this be caused by the cache footprint then ? > I can also retry with various values between 4 and 9k, including > values close to 8k. Maybe the fact that 4k is better than 9 is > because we get better filling of all pages ? Looking quickly, myri10ge's buffer manager is incredibly simplistic so it wastes a lot of memory and gives terrible cache behavior. When using JUMBO MTU it just gives whole pages to the chip. So it looks like, assuming 4096 byte PAGE_SIZE and 9000 byte jumbo MTU, the chip will allocate for a full size frame: FULL PAGE FULL PAGE FULL PAGE and only ~1K of that last full page will be utilized. The headers will therefore always land on the same cache lines, and PAGE_SIZE-~1K will be wasted. Whereas for < PAGE_SIZE mtu selections, it will give MTU sized blocks to the chip for packet data allocation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Willy Tarreau <w@1wt.eu> Date: Wed, 4 Feb 2009 10:12:17 +0100 > And the arbitrary choice of 9k for jumbo frames was total crap too. > It's clear that no hardware designer was involved in the process. Willy, do some reasearch, this also completely wrong. Alteon effectively created jumbo MTUs in their Acenic chips since that was the first chip to ever do it. Those were hardware engineers only making those design decisions. I think this is will I will stop taking part in this part of the discussion. Every posting is full of misinformation and I've got better things to do than to refute them every 5 minutes. :-/ -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wednesday 04 February 2009 19:08:51 Evgeniy Polyakov wrote: > On Tue, Feb 03, 2009 at 04:46:09PM -0800, David Miller (davem@davemloft.net) wrote: > > > NTA tried to solve this by not allowing to free the data allocated on > > > the different CPU, contrary to what SLAB does. Modulo cache coherency > > > improvements, > > > > This could kill performance on NUMA systems if we are not careful. > > > > If we ever consider NTA seriously, these issues would need to > > be performance tested. > > Quite contrary I think. Memory is allocated and freed on the same CPU, > which means on the same memory domain, closest to the CPU in question. > > I did not test NUMA though, but NTA performance on the usual CPU (it is > 2.5 years old already :) was noticebly good. I had a quick look at NTA... I didn't understand much of it yet, but the remote freeing scheme is kind of like what I did for slqb. The freeing CPU queues objects back to the CPU that allocated them, which eventually checks the queue and frees them itself. I don't know how much cache coherency gains you get from this -- in most slab allocations, I think the object tends to be cache on on the CPU that frees it. I'm doing it mainly to try avoid locking... I guess that makes for cache coherency benefit itself. If NTA does significantly better than slab allocator, I would be quite interested. It might be something that we can learn from and use in the general slab allocator (or maybe something more network specific that NTA does). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller <davem@davemloft.net> writes: > From: Herbert Xu <herbert@gondor.apana.org.au> >> Granted copying sucks, but this is really because the underlying >> hardware is badly designed. Also copying is way better than >> not receiving at all due to memory fragmentation. > > This scheme sounds very reasonable. Would it be possible to add a counter somewhere for this, or is that too expensive? /Benny -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Benny Amorsen <benny+usenet@amorsen.dk> wrote: > > Would it be possible to add a counter somewhere for this, or is that > too expensive? Yes a counter would be useful and is reasonable. But you can probably deduce it by just looking at slabinfo. Cheers,
> And the arbitrary choice of 9k for jumbo frames was total crap too. > It's clear that no hardware designer was involved in the process. > They have to stuff 16kB of RAM on a NIC to use only 9. And we need > to allocate 3 pages for slightly more than 2. 7.5 kB would have been > better in this regard. 9K was not totally arbitrary. The CRC used for checksumming ethernet packets has a probability of undetected errors that goes up about 11-thousand something bytes. So the real limit is ~11000 bytes, and I believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR and transport headers without fragmentation. - R. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 04, 2009 at 11:19:06AM -0800, Roland Dreier wrote: > > And the arbitrary choice of 9k for jumbo frames was total crap too. > > It's clear that no hardware designer was involved in the process. > > They have to stuff 16kB of RAM on a NIC to use only 9. And we need > > to allocate 3 pages for slightly more than 2. 7.5 kB would have been > > better in this regard. > > 9K was not totally arbitrary. The CRC used for checksumming ethernet > packets has a probability of undetected errors that goes up about > 11-thousand something bytes. So the real limit is ~11000 bytes, and I > believe ~9000 was chosen to be able to carry 8K NFS payloads + all XDR > and transport headers without fragmentation. Yes I know that initial motivation. But IMHO it was a purely functional motivation without real considerations of the implications. When you read Alteon's initial proposal, there is even a biased analysis (they compare the fill ratio obtained with one 8k frame with that of 6 1.5k frames). Their own argument does not stand with their final proposal! I think that there might already have been people pushing for 8k and not 9 by then, but in order to get wide acceptance in datacenters, they had to please the NFS admins. Now it's useless to speculate on history and I agree with Davem that we're wasting our time with this discussion, let's go back to the keyboard ;-) Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 04, 2009 at 08:28:20PM +0100, Willy Tarreau wrote: ... > Now it's useless to speculate on history and I agree with Davem that > we're wasting our time with this discussion, let's go back to the > keyboard ;-) Davem knows everything, so he is ...wrong. This is a very useful discussion. Go back to the keyboards, please :-) Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 4 Feb 2009, Willy Tarreau wrote: > On Wed, Feb 04, 2009 at 01:01:46AM -0800, David Miller wrote: > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Date: Wed, 4 Feb 2009 19:59:07 +1100 > > > > > On Wed, Feb 04, 2009 at 09:54:32AM +0100, Willy Tarreau wrote: > > > > > > > > My server is running 2.4 :-), but I observed the same issues with older > > > > 2.6 as well. I can certainly imagine that things have changed a lot since, > > > > but the initial point remains : jumbo frames are expensive to deal with, > > > > and with recent NICs and drivers, we might get close performance for > > > > little additional cost. After all, initial justification for jumbo frames > > > > was the devastating interrupt rate and all NICs coalesce interrupts now. > > > > > > This is total crap! Jumbo frames are way better than any of the > > > hacks (such as GSO) that people have come up with to get around it. > > > The only reason we are not using it as much is because of this > > > nasty thing called the Internet. > > > > Completely agreed. > > > > If Jumbo frames are slower, it is NOT some fundamental issue. It is > > rather due to some misdesign of the hardware or it's driver. > > Agreed we can't use them *because* of the internet, but this > limitation has forced hardware designers to find valid alternatives. > For instance, having the ability to reach 10 Gbps with 1500 bytes > frames on myri10ge with a low CPU usage is a real achievement. This > is "only" 800 kpps after all. > > And the arbitrary choice of 9k for jumbo frames was total crap too. > It's clear that no hardware designer was involved in the process. > They have to stuff 16kB of RAM on a NIC to use only 9. And we need > to allocate 3 pages for slightly more than 2. 7.5 kB would have been > better in this regard. > > I still find it nice to lower CPU usage with frames larger than 1500, > but given the fact that this is rarely used (even in datacenters), I > think our efforts should concentrate on where the real users are, ie > <1500. Those in the HPC realm use 9000 byte jumbo frames because it makes a major performance difference, especially across large RTT paths, and the Internet2 backbone fully supports 9000 byte jumbo frames (with some wishing we could support much larger frame sizes). Local environment: 9000 byte jumbo frames: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 11818.1875 MB / 10.01 sec = 9905.9707 Mbps 100 %TX 76 %RX 0 retrans 0.15 msRTT 4080 byte MTU: [root@lang2 ~]# nuttcp -w10m 192.168.88.16 9171.6875 MB / 10.02 sec = 7680.7663 Mbps 100 %TX 99 %RX 0 retrans 0.19 msRTT The performance impact is even more pronounced on a large RTT path such as the following netem emulated 80 ms RTT path: 9000 byte jumbo frames: [root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15 25904.2500 MB / 30.16 sec = 7205.8755 Mbps 96 %TX 55 %RX 0 retrans 82.73 msRTT 4080 byte MTU: [root@lang2 ~]# nuttcp -T30 -w80m 192.168.89.15 8650.0129 MB / 30.25 sec = 2398.8862 Mbps 33 %TX 19 %RX 2371 retrans 81.98 msRTT And if there's any loss in the path, the performance difference is also dramatic, such as here across a real MAN environment with about a 1 ms RTT: 9000 byte jumbo frames: [root@chance9 ~]# nuttcp -w20m 192.168.88.8 7711.8750 MB / 10.05 sec = 6436.2406 Mbps 82 %TX 96 %RX 261 retrans 0.92 msRTT 4080 byte MTU: [root@chance9 ~]# nuttcp -w20m 192.168.88.8 4551.0625 MB / 10.08 sec = 3786.2108 Mbps 50 %TX 95 %RX 42 retrans 0.95 msRTT All testing was with myri10ge on the transmitter side (2.6.20.7 kernel). So my experience has definitely been that 9000 byte jumbo frames are a major performance win for high throughput applications. -Bill -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 3 Feb 2009 09:41:08 +0000 > Yes, this looks reasonable. On the other hand, I think it would be > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > efficiency of such a solution. (It seems releasing with put_page() will > always have some cost with delayed reusing and/or waste of space.) I think we can't avoid using carved up pages for skb->data in the end. The whole kernel wants to speak in pages and be able to grab and release them in one way and one way only (get_page() and put_page()). What do you think is more likely? Us teaching the whole entire kernel how to hold onto SKB linear data buffers, or the networking fixing itself to operate on pages for it's header metadata? :-) What we'll end up with is likely a hybrid scheme. High speed devices will receive into pages. And also the skb->data area will be page backed and held using get_page()/put_page() references. It is not even worth optimizing for skb->data holding the entire packet, that's not the case that matters. These skb->data areas will thus be 128 bytes plus the skb_shinfo structure blob. They also will be recycled often, rather than held onto for long periods of time. In fact we can optimize that even further in many ways, for example by dropping the skb->data backed memory once the skb is queued to the socket receive buffer. That will make skb->data buffer lifetimes miniscule even under heavy receive load. In that kind of situation, doing even the most stupidest page slicing algorithm, similar to what we do now with sk->sk_sndmsg_page, is more than adequate and things like NTA (purely to solve this problem) is overengineering. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote: > > In fact we can optimize that even further in many ways, for example by > dropping the skb->data backed memory once the skb is queued to the > socket receive buffer. That will make skb->data buffer lifetimes > miniscule even under heavy receive load. Indeed, while I was doing the tun accounting stuff and reviewing the rx accounting users, it was apparent that we don't need to carry most of this stuff in our receive queues. This is almost the opposite of the skbless data idea for TSO. Cheers,
On Thu, Feb 05, 2009 at 11:52:58PM -0800, David Miller wrote: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 3 Feb 2009 09:41:08 +0000 > > > Yes, this looks reasonable. On the other hand, I think it would be > > nice to get some opinions of slab folks (incl. Evgeniy) on the expected > > efficiency of such a solution. (It seems releasing with put_page() will > > always have some cost with delayed reusing and/or waste of space.) > > I think we can't avoid using carved up pages for skb->data in the end. > The whole kernel wants to speak in pages and be able to grab and > release them in one way and one way only (get_page() and put_page()). > > What do you think is more likely? Us teaching the whole entire kernel > how to hold onto SKB linear data buffers, or the networking fixing > itself to operate on pages for it's header metadata? :-) This idea looks very reasonable, except I wander why nobody else didn't need this kind of mm interface. Another question is it seems many mechanisms like fast searching, defragmentation etc. could be reused. > What we'll end up with is likely a hybrid scheme. High speed devices > will receive into pages. And also the skb->data area will be page > backed and held using get_page()/put_page() references. > > It is not even worth optimizing for skb->data holding the entire > packet, that's not the case that matters. > > These skb->data areas will thus be 128 bytes plus the skb_shinfo > structure blob. They also will be recycled often, rather than held > onto for long periods of time. Looks fine, except: you mentioned dumb NICs, which would need this page space on receive, anyway. BTW, don't they need this on transmit again? > In fact we can optimize that even further in many ways, for example by > dropping the skb->data backed memory once the skb is queued to the > socket receive buffer. That will make skb->data buffer lifetimes > miniscule even under heavy receive load. > > In that kind of situation, doing even the most stupidest page slicing > algorithm, similar to what we do now with sk->sk_sndmsg_page, is > more than adequate and things like NTA (purely to solve this problem) > is overengineering. Hmm... I don't get it. It seems these slabs do a lot of advanced work, and still some people like Evgeniy or Nick thought it's not enough, and even found it worth of their time to rework this. There is also a question of memory accounting: do you think admins don't care if we give away say 25% additionally? Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Fri, 6 Feb 2009 09:10:34 +0000 > Hmm... I don't get it. It seems these slabs do a lot of advanced work, > and still some people like Evgeniy or Nick thought it's not enough, > and even found it worth of their time to rework this. Note that, at least to some extent, the memory allocators are duplicating some of the locality and NUMA logic that's already present in the page allocator itself. Except that they are handling the fact that objects are moving around instead of pages. Also keep in mind that we might also want to encourage drivers to make use of the SKB recycling mechanisms we have. So this will decrease lifetimes, and thus the wastage and locality issues immensely. We truly want something different from what the general purpose allocator provides. Namely, a reference countable buffer. And all I'm saying is that since the page allocator provides that facility, and using pages solves all of the splice() et al. problems, building something extremely simple on top of the page allocator seems to be a good way to go. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote: > > Looks fine, except: you mentioned dumb NICs, which would need this > page space on receive, anyway. BTW, don't they need this on transmit > again? A lot more NICs support SG on tx than rx. Cheers,
On Fri, Feb 06, 2009 at 01:17:22AM -0800, David Miller wrote: ... > And all I'm saying is that since the page allocator provides that > facility, and using pages solves all of the splice() et al. problems, > building something extremely simple on top of the page allocator seems > to be a good way to go. This all is absolutely right if we can afford it: more simple - more memory wasted. I hope you're right, but on the other hand, people don't use slob by default. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jarek Poplawski <jarkao2@gmail.com> Date: Fri, 6 Feb 2009 09:42:53 +0000 > I hope you're right, but on the other hand, people don't use slob by > default. The use is different, so it's a bad comparison, really. SLAB has to satisfy all kinds of object lifetimes, all kinds of sizes and use cases, and perform generally well in all situations with zero knowledge about usage patterns. Networking is strictly the opposite of that set of requirements. We know all of these parameters (or their ranges) and can on top of that control them if necessary. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2009 at 08:23:26PM +1100, Herbert Xu wrote: > On Fri, Feb 06, 2009 at 09:10:34AM +0000, Jarek Poplawski wrote: > > > > Looks fine, except: you mentioned dumb NICs, which would need this > > page space on receive, anyway. BTW, don't they need this on transmit > > again? > > A lot more NICs support SG on tx than rx. OK, but since there is not so much difference, and we need to waste it in some cases anyway, plus handle it later some special way, I'm a bit in doubt. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > OK, but since there is not so much difference, and we need to waste > it in some cases anyway, plus handle it later some special way, I'm > a bit in doubt. Well the thing is cards that don't support SG on tx probably don't support jumbo frames either. Cheers,
On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote: > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > > > OK, but since there is not so much difference, and we need to waste > > it in some cases anyway, plus handle it later some special way, I'm > > a bit in doubt. > > Well the thing is cards that don't support SG on tx probably > don't support jumbo frames either. ?? I mean this 128 byte chunk would be hard to reuse after copying to skb->data, and if reused, we could miss this for some NICs on TX, so the whole packed would need a copy. BTW, David mentioned something simple like sk_sndmsg_page would be enough, but I guess not for these non-SG NICs. We have to allocate bigger chunks for them, so more fragmentation to handle. Cheers, Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote: > On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote: > > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > > > > > OK, but since there is not so much difference, and we need to waste > > > it in some cases anyway, plus handle it later some special way, I'm > > > a bit in doubt. > > > > Well the thing is cards that don't support SG on tx probably > > don't support jumbo frames either. > > ?? I mean this 128 byte chunk would be hard to reuse after copying > to skb->data, and if reused, we could miss this for some NICs on TX, > so the whole packed would need a copy. couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit map to indicate which slot is used and which one is free ? This would just be a matter of calling ffz() to find one spare place in a page. Also, that bitmap might serve as a refcount, because if it drops to zero, it means all slots are unused. And -1 means all slots are used. This would reduce wastage if wee need to allocate 128 bytes often. Willy -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Feb 06, 2009 at 12:10:15PM +0100, Willy Tarreau wrote: > On Fri, Feb 06, 2009 at 10:58:07AM +0000, Jarek Poplawski wrote: > > On Fri, Feb 06, 2009 at 09:28:22PM +1100, Herbert Xu wrote: > > > On Fri, Feb 06, 2009 at 09:51:20AM +0000, Jarek Poplawski wrote: > > > > > > > > OK, but since there is not so much difference, and we need to waste > > > > it in some cases anyway, plus handle it later some special way, I'm > > > > a bit in doubt. > > > > > > Well the thing is cards that don't support SG on tx probably > > > don't support jumbo frames either. > > > > ?? I mean this 128 byte chunk would be hard to reuse after copying > > to skb->data, and if reused, we could miss this for some NICs on TX, > > so the whole packed would need a copy. > > couldn't we stuff up to 32 128-byte chunks in a page and use a 32-bit > map to indicate which slot is used and which one is free ? This would > just be a matter of calling ffz() to find one spare place in a page. > Also, that bitmap might serve as a refcount, because if it drops to > zero, it means all slots are unused. And -1 means all slots are used. > > This would reduce wastage if wee need to allocate 128 bytes often. Something like this would be useful for SG NICs for the paged skb->data area. But I'm concerned with non-SG ones: if I got it right, for 1500 byte packets we need to allocate such a chunk, copy 128+ bytes to skb->data, and we have 128+ unused. If there are later such short packets, we don't need this space: why copy? But even if we find the way to use this, or don't reserve such space while receiving from SG NIC, than we will need to copy both chunks to another page for TX on non-SG NICs. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
David Miller wrote, On 02/06/2009 08:52 AM: > From: Jarek Poplawski <jarkao2@gmail.com> > Date: Tue, 3 Feb 2009 09:41:08 +0000 > >> Yes, this looks reasonable. On the other hand, I think it would be >> nice to get some opinions of slab folks (incl. Evgeniy) on the expected >> efficiency of such a solution. (It seems releasing with put_page() will >> always have some cost with delayed reusing and/or waste of space.) > > I think we can't avoid using carved up pages for skb->data in the end. > The whole kernel wants to speak in pages and be able to grab and > release them in one way and one way only (get_page() and put_page()). > > What do you think is more likely? Us teaching the whole entire kernel > how to hold onto SKB linear data buffers, or the networking fixing > itself to operate on pages for it's header metadata? :-) > > What we'll end up with is likely a hybrid scheme. High speed devices > will receive into pages. And also the skb->data area will be page > backed and held using get_page()/put_page() references. So, after a full awakening I think I got your point at last! I thought all the time we're trying to do something more general, and you're seemingly focused on SG capable NICs, with myri10ge or niu as model to follow. I'm OK with this. Very nice idea and much less work! (It's only enough to CC all the maintainers !) > It is not even worth optimizing for skb->data holding the entire > packet, that's not the case that matters. > > These skb->data areas will thus be 128 bytes plus the skb_shinfo > structure blob. They also will be recycled often, rather than held > onto for long periods of time. > > In fact we can optimize that even further in many ways, for example by > dropping the skb->data backed memory once the skb is queued to the > socket receive buffer. That will make skb->data buffer lifetimes > miniscule even under heavy receive load. > > In that kind of situation, doing even the most stupidest page slicing > algorithm, similar to what we do now with sk->sk_sndmsg_page, is > more than adequate and things like NTA (purely to solve this problem) > is overengineering. This is 100% right, except if we try to do something with non-SG and/or jumbos - IMHO this requires some overengineering. Jarek P. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/sock.h b/include/net/sock.h index 5a3a151..4ded741 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -190,6 +190,8 @@ struct sock_common { * @sk_user_data: RPC layer private data * @sk_sndmsg_page: cached page for sendmsg * @sk_sndmsg_off: cached offset for sendmsg + * @sk_splice_page: cached page for splice + * @sk_splice_off: cached offset for splice * @sk_send_head: front of stuff to transmit * @sk_security: used by security modules * @sk_mark: generic packet mark @@ -279,6 +281,8 @@ struct sock { struct page *sk_sndmsg_page; struct sk_buff *sk_send_head; __u32 sk_sndmsg_off; + struct page *sk_splice_page; + __u32 sk_splice_off; int sk_write_pending; #ifdef CONFIG_SECURITY void *sk_security; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 56272ac..02a1a6c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1334,13 +1334,33 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i) } static inline struct page *linear_to_page(struct page *page, unsigned int len, - unsigned int offset) + unsigned int *offset, + struct sk_buff *skb) { - struct page *p = alloc_pages(GFP_KERNEL, 0); + struct sock *sk = skb->sk; + struct page *p = sk->sk_splice_page; + unsigned int off; - if (!p) - return NULL; - memcpy(page_address(p) + offset, page_address(page) + offset, len); + if (!p) { +new_page: + p = sk->sk_splice_page = alloc_pages(sk->sk_allocation, 0); + if (!p) + return NULL; + + off = sk->sk_splice_off = 0; + /* we hold one ref to this page until it's full or unneeded */ + } else { + off = sk->sk_splice_off; + if (off + len > PAGE_SIZE) { + put_page(p); + goto new_page; + } + } + + memcpy(page_address(p) + off, page_address(page) + *offset, len); + sk->sk_splice_off += len; + *offset = off; + get_page(p); return p; } @@ -1356,7 +1376,7 @@ static inline int spd_fill_page(struct splice_pipe_desc *spd, struct page *page, return 1; if (linear) { - page = linear_to_page(page, len, offset); + page = linear_to_page(page, len, &offset, skb); if (!page) return 1; } else diff --git a/net/core/sock.c b/net/core/sock.c index f3a0d08..6b258a9 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1732,6 +1732,8 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_sndmsg_page = NULL; sk->sk_sndmsg_off = 0; + sk->sk_splice_page = NULL; + sk->sk_splice_off = 0; sk->sk_peercred.pid = 0; sk->sk_peercred.uid = -1; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 19d7b42..cf3d367 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1848,6 +1848,14 @@ void tcp_v4_destroy_sock(struct sock *sk) sk->sk_sndmsg_page = NULL; } + /* + * If splice cached page exists, toss it. + */ + if (sk->sk_splice_page) { + __free_page(sk->sk_splice_page); + sk->sk_splice_page = NULL; + } + percpu_counter_dec(&tcp_sockets_allocated); }