Message ID | 1286859925.30423.184.camel@edumazet-laptop |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
Acked-by: Tom Herbert <therbert@google.com> On Mon, Oct 11, 2010 at 10:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit : >> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit : >> > >> > For multi queue devices, it makes more sense to allocate skb on local >> > node of the cpu handling RX interrupts. This allow each cpu to >> > manipulate its own slub/slab queues/structures without doing expensive >> > cross-node business. >> > >> > For non multi queue devices, IRQ affinity should be set so that a cpu >> > close to the device services interrupts. Even if not set, using >> > dev_alloc_skb() is faster. >> > >> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> >> >> Or maybe revert : >> >> commit b30973f877fea1a3fb84e05599890fcc082a88e5 >> Author: Christoph Hellwig <hch@lst.de> >> Date: Wed Dec 6 20:32:36 2006 -0800 >> >> [PATCH] node-aware skb allocation >> >> Node-aware allocation of skbs for the receive path. >> >> Details: >> >> - __alloc_skb gets a new node argument and cals the node-aware >> slab functions with it. >> - netdev_alloc_skb passed the node number it gets from dev_to_node >> to it, everyone else passes -1 (any node) >> >> Signed-off-by: Christoph Hellwig <hch@lst.de> >> Cc: Christoph Lameter <clameter@engr.sgi.com> >> Cc: "David S. Miller" <davem@davemloft.net> >> Signed-off-by: Andrew Morton <akpm@osdl.org> >> >> >> Apparently, only Christoph and Andrew signed it. >> >> > > [PATCH net-next] net: allocate skbs on local node > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of > allocating net drivers skbs on a given memory node : The one closest to > the NIC hardware. This is wrong because as soon as we try to scale > network stack, we need to use many cpus to handle traffic and hit > slub/slab management on cross-node allocations/frees when these cpus > have to alloc/free skbs bound to a central node. > > skb allocated in RX path are ephemeral, they have a very short > lifetime : Extra cost to maintain NUMA affinity is too expensive. What > appeared as a nice idea four years ago is in fact a bad one. > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, > and two 10Gb NIC might deliver more than 28 million packets per second, > needing all the available cpus. > > Cost of cross-node handling in network and vm stacks outperforms the > small benefit hardware had when doing its DMA transfert in its 'local' > memory node at RX time. Even trying to differentiate the two allocations > done for one skb (the sk_buff on local node, the data part on NIC > hardware node) is not enough to bring good performance. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > --- > include/linux/skbuff.h | 20 ++++++++++++++++---- > net/core/skbuff.c | 13 +------------ > 2 files changed, 17 insertions(+), 16 deletions(-) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 0b53c43..05a358f 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size, > static inline struct sk_buff *alloc_skb(unsigned int size, > gfp_t priority) > { > - return __alloc_skb(size, priority, 0, -1); > + return __alloc_skb(size, priority, 0, NUMA_NO_NODE); > } > > static inline struct sk_buff *alloc_skb_fclone(unsigned int size, > gfp_t priority) > { > - return __alloc_skb(size, priority, 1, -1); > + return __alloc_skb(size, priority, 1, NUMA_NO_NODE); > } > > extern bool skb_recycle_check(struct sk_buff *skb, int skb_size); > @@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev, > return skb; > } > > -extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask); > +/** > + * __netdev_alloc_page - allocate a page for ps-rx on a specific device > + * @dev: network device to receive on > + * @gfp_mask: alloc_pages_node mask > + * > + * Allocate a new page. dev currently unused. > + * > + * %NULL is returned if there is no free memory. > + */ > +static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) > +{ > + return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0); > +} > > /** > * netdev_alloc_page - allocate a page for ps-rx on a specific device > * @dev: network device to receive on > * > - * Allocate a new page node local to the specified device. > + * Allocate a new page. dev currently unused. > * > * %NULL is returned if there is no free memory. > */ > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 752c197..4e8b82e 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb); > struct sk_buff *__netdev_alloc_skb(struct net_device *dev, > unsigned int length, gfp_t gfp_mask) > { > - int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1; > struct sk_buff *skb; > > - skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node); > + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE); > if (likely(skb)) { > skb_reserve(skb, NET_SKB_PAD); > skb->dev = dev; > @@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, > } > EXPORT_SYMBOL(__netdev_alloc_skb); > > -struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) > -{ > - int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1; > - struct page *page; > - > - page = alloc_pages_node(node, gfp_mask, 0); > - return page; > -} > -EXPORT_SYMBOL(__netdev_alloc_page); > - > void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, > int size) > { > > > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mardi 12 octobre 2010 __ 01:22 +0200, Eric Dumazet a __crit : > > Le mardi 12 octobre 2010 __ 01:03 +0200, Eric Dumazet a __crit : > > > > > > For multi queue devices, it makes more sense to allocate skb on local > > > node of the cpu handling RX interrupts. This allow each cpu to > > > manipulate its own slub/slab queues/structures without doing expensive > > > cross-node business. > > > > > > For non multi queue devices, IRQ affinity should be set so that a cpu > > > close to the device services interrupts. Even if not set, using > > > dev_alloc_skb() is faster. > > > > > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> > > > > Or maybe revert : > > > > commit b30973f877fea1a3fb84e05599890fcc082a88e5 > > Author: Christoph Hellwig <hch@lst.de> > > Date: Wed Dec 6 20:32:36 2006 -0800 > > > > [PATCH] node-aware skb allocation > > > > Node-aware allocation of skbs for the receive path. > > > > Details: > > > > - __alloc_skb gets a new node argument and cals the node-aware > > slab functions with it. > > - netdev_alloc_skb passed the node number it gets from dev_to_node > > to it, everyone else passes -1 (any node) > > > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Cc: Christoph Lameter <clameter@engr.sgi.com> > > Cc: "David S. Miller" <davem@davemloft.net> > > Signed-off-by: Andrew Morton <akpm@osdl.org> > > > > > > Apparently, only Christoph and Andrew signed it. > > > > > > [PATCH net-next] net: allocate skbs on local node > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of > allocating net drivers skbs on a given memory node : The one closest to > the NIC hardware. This is wrong because as soon as we try to scale > network stack, we need to use many cpus to handle traffic and hit > slub/slab management on cross-node allocations/frees when these cpus > have to alloc/free skbs bound to a central node. > > skb allocated in RX path are ephemeral, they have a very short > lifetime : Extra cost to maintain NUMA affinity is too expensive. What > appeared as a nice idea four years ago is in fact a bad one. > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, > and two 10Gb NIC might deliver more than 28 million packets per second, > needing all the available cpus. > > Cost of cross-node handling in network and vm stacks outperforms the > small benefit hardware had when doing its DMA transfert in its 'local' > memory node at RX time. Even trying to differentiate the two allocations > done for one skb (the sk_buff on local node, the data part on NIC > hardware node) is not enough to bring good performance. > This is all conspicuously hand-wavy and unquantified. (IOW: prove it!) The mooted effects should be tested for on both slab and slub, I suggest. They're pretty different beasts. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le lundi 11 octobre 2010 à 23:03 -0700, Andrew Morton a écrit : > On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > > [PATCH net-next] net: allocate skbs on local node > > > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of > > allocating net drivers skbs on a given memory node : The one closest to > > the NIC hardware. This is wrong because as soon as we try to scale > > network stack, we need to use many cpus to handle traffic and hit > > slub/slab management on cross-node allocations/frees when these cpus > > have to alloc/free skbs bound to a central node. > > > > skb allocated in RX path are ephemeral, they have a very short > > lifetime : Extra cost to maintain NUMA affinity is too expensive. What > > appeared as a nice idea four years ago is in fact a bad one. > > > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, > > and two 10Gb NIC might deliver more than 28 million packets per second, > > needing all the available cpus. > > > > Cost of cross-node handling in network and vm stacks outperforms the > > small benefit hardware had when doing its DMA transfert in its 'local' > > memory node at RX time. Even trying to differentiate the two allocations > > done for one skb (the sk_buff on local node, the data part on NIC > > hardware node) is not enough to bring good performance. > > > > This is all conspicuously hand-wavy and unquantified. (IOW: prove it!) > I would say, _you_ should prove that original patch was good. It seems no network guy was really in the discussion ? Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the difference. Thats about a 40% slowdown on high packet rates, on a dual socket machine (dual X5570 @2.93GHz). You can expect higher values on four nodes (I dont have such hardware to do the test) > The mooted effects should be tested for on both slab and slub, I > suggest. They're pretty different beasts. SLAB is so slow on NUMA these days, you can forget it for good. Its about 40% slower on some tests I did this week on net-next, to speedup output (and routing) performance, so it was with normal (local) allocations, not even cross-nodes ones. Once you remove network bottlenecks, you badly hit contention on SLAB and are forced to switch to SLUB ;) Sending 160.000.000 udp frames on same neighbour/destination, IP route cache disabled (to mimic DDOS on a router) 16 threads, 16 logical cpus. 32bit kernel (dual E5540 @ 2.53GHz) (It takes more than 2 minutes with linux-2.6, so use net-next-2.6 if you really want to get these numbers) SLUB : real 0m50.661s user 0m15.973s sys 11m42.548s 18348.00 21.4% dst_destroy vmlinux 5674.00 6.6% fib_table_lookup vmlinux 5563.00 6.5% dst_alloc vmlinux 5226.00 6.1% neigh_lookup vmlinux 3590.00 4.2% __ip_route_output_key vmlinux 2712.00 3.2% neigh_resolve_output vmlinux 2511.00 2.9% fib_semantic_match vmlinux 2488.00 2.9% ipv4_dst_destroy vmlinux 2206.00 2.6% __xfrm_lookup vmlinux 2119.00 2.5% memset vmlinux 2015.00 2.4% __copy_from_user_ll vmlinux 1722.00 2.0% udp_sendmsg vmlinux 1679.00 2.0% __slab_free vmlinux 1152.00 1.3% ip_append_data vmlinux 1044.00 1.2% __alloc_skb vmlinux 952.00 1.1% kmem_cache_free vmlinux 942.00 1.1% udp_push_pending_frames vmlinux 877.00 1.0% kfree vmlinux 870.00 1.0% __call_rcu vmlinux 829.00 1.0% ip_push_pending_frames vmlinux 799.00 0.9% _raw_spin_lock_bh vmlinux SLAB: real 1m10.771s user 0m13.941s sys 12m42.188s 22734.00 26.0% _raw_spin_lock vmlinux 8238.00 9.4% dst_destroy vmlinux 4393.00 5.0% fib_table_lookup vmlinux 3652.00 4.2% dst_alloc vmlinux 3335.00 3.8% neigh_lookup vmlinux 2444.00 2.8% memset vmlinux 2443.00 2.8% __ip_route_output_key vmlinux 1916.00 2.2% fib_semantic_match vmlinux 1708.00 2.0% __copy_from_user_ll vmlinux 1669.00 1.9% __xfrm_lookup vmlinux 1642.00 1.9% free_block vmlinux 1554.00 1.8% neigh_resolve_output vmlinux 1388.00 1.6% ipv4_dst_destroy vmlinux 1335.00 1.5% udp_sendmsg vmlinux 1109.00 1.3% kmem_cache_free vmlinux 1007.00 1.2% __alloc_skb vmlinux 1004.00 1.1% kfree vmlinux 1002.00 1.1% ip_append_data vmlinux 975.00 1.1% cache_grow vmlinux 936.00 1.1% ____cache_alloc_node vmlinux 925.00 1.1% udp_push_pending_frames vmlinux All this raw_spin_lock overhead comes from SLAB. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010 08:58:19 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le lundi 11 octobre 2010 __ 23:03 -0700, Andrew Morton a __crit : > > On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > [PATCH net-next] net: allocate skbs on local node > > > > > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of > > > allocating net drivers skbs on a given memory node : The one closest to > > > the NIC hardware. This is wrong because as soon as we try to scale > > > network stack, we need to use many cpus to handle traffic and hit > > > slub/slab management on cross-node allocations/frees when these cpus > > > have to alloc/free skbs bound to a central node. > > > > > > skb allocated in RX path are ephemeral, they have a very short > > > lifetime : Extra cost to maintain NUMA affinity is too expensive. What > > > appeared as a nice idea four years ago is in fact a bad one. > > > > > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, > > > and two 10Gb NIC might deliver more than 28 million packets per second, > > > needing all the available cpus. > > > > > > Cost of cross-node handling in network and vm stacks outperforms the > > > small benefit hardware had when doing its DMA transfert in its 'local' > > > memory node at RX time. Even trying to differentiate the two allocations > > > done for one skb (the sk_buff on local node, the data part on NIC > > > hardware node) is not enough to bring good performance. > > > > > > > This is all conspicuously hand-wavy and unquantified. (IOW: prove it!) > > > > I would say, _you_ should prove that original patch was good. It seems > no network guy was really in the discussion ? Two wrongs and all that. The 2006 patch has nothing to do with it, apart from demonstrating the importance of including performance measurements in a performance patch. > Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the > difference. Thats about a 40% slowdown on high packet rates, on a dual > socket machine (dual X5570 @2.93GHz). You can expect higher values on > four nodes (I dont have such hardware to do the test) Like that. Please flesh it out and stick it in the changelog. > > > The mooted effects should be tested for on both slab and slub, I > > suggest. They're pretty different beasts. > > SLAB is so slow on NUMA these days, you can forget it for good. I'd love to forget it, but it's faster for some things (I forget which). Which is why it's still around. And the ghastly thing about this is that you're forced to care about it too because some people are, apparently, still using it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit : > I'd love to forget it, but it's faster for some things (I forget > which). Which is why it's still around. Yes, two years ago it was true on pathological/obscure cases. Every time I did the comparison, SLUB won. You asked me, I did yet another test this morning, and 40% is pretty serious, I believe. > > And the ghastly thing about this is that you're forced to care about it > too because some people are, apparently, still using it. > Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7 machines, I know... I am not saying we should not care, but for any serious network workload on NUMA arches, SLUB is the best, and seeing Christoph recent work, it might even get better. BTW, I believe all modern distros ship SLUB, dont they ? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote: > Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit : > > > I'd love to forget it, but it's faster for some things (I forget > > which). Which is why it's still around. > > Yes, two years ago it was true on pathological/obscure cases. > Every time I did the comparison, SLUB won. > You asked me, I did yet another test this morning, and 40% is pretty > serious, I believe. > > > > > And the ghastly thing about this is that you're forced to care about it > > too because some people are, apparently, still using it. > > > > Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7 > machines, I know... > > I am not saying we should not care, but for any serious network workload > on NUMA arches, SLUB is the best, and seeing Christoph recent work, it > might even get better. > > BTW, I believe all modern distros ship SLUB, dont they ? > Dunno. Pekka, why haven't we deleted slab yet?? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/12/10 10:58 AM, Andrew Morton wrote: > On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet<eric.dumazet@gmail.com> wrote: > >> Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit : >> >>> I'd love to forget it, but it's faster for some things (I forget >>> which). Which is why it's still around. >> >> Yes, two years ago it was true on pathological/obscure cases. >> Every time I did the comparison, SLUB won. >> You asked me, I did yet another test this morning, and 40% is pretty >> serious, I believe. >> >>> And the ghastly thing about this is that you're forced to care about it >>> too because some people are, apparently, still using it. >> >> Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7 >> machines, I know... >> >> I am not saying we should not care, but for any serious network workload >> on NUMA arches, SLUB is the best, and seeing Christoph recent work, it >> might even get better. >> >> BTW, I believe all modern distros ship SLUB, dont they ? > > Dunno. > > Pekka, why haven't we deleted slab yet?? To make a long story short, we still have relevant performance regressions that need to be taken care of. The most interesting one is a regression in netperf TCP_RR that's been reported by David Rientjes a while back. There's bunch of SLUB cleanups queued for 2.6.37 that pave the way for Christoph's SLUB queueing work that should hopefully fix that particular issue for 2.6.38. There's little point in discussing the removal of SLAB as long as there are performance regressions for real workloads from people who are willing to share results and test patches. I'm optimistic that we'll be able to try removing SLAB some time next year unless something interesting pops up... Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010, Pekka Enberg wrote: > There's little point in discussing the removal of SLAB as long as there are > performance regressions for real workloads from people who are willing to > share results and test patches. I'm optimistic that we'll be able to try > removing SLAB some time next year unless something interesting pops up... Hmmm. Given these effects I think we should be more cautious regarding the unification work. May be the "unified allocator" should replace SLAB instead and SLUB can stay unchanged? The unification patches go back to the one lock per node SLAB thing because the queue maintenance overhead is otherwise causing large regressions in hackbench because of lots of atomic ops. The per node lock seem to be causing problems here in the network stack,. Take the unified as a SLAB cleanup instead? Then at least we have a large common code base and just differentiate through the locking mechanism? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010, Christoph Lameter wrote: > Hmmm. Given these effects I think we should be more cautious regarding the > unification work. May be the "unified allocator" should replace SLAB > instead and SLUB can stay unchanged? Linus has said that he refuses to merge another allocator until one is removed or replaced, so that would force the unificiation patches to go into slab instead if you want to leave slub untouched. > The unification patches go back to > the one lock per node SLAB thing because the queue maintenance overhead is > otherwise causing large regressions in hackbench because of lots of atomic > ops. The per node lock seem to be causing problems here in the network > stack,. The TCP_RR regression on slub is because of what I described a couple years ago as "slab thrashing" where cpu slabs would be filled with allocations, then frees would occur to move those slabs from the full to partial list with only a few free objects, those partial slabs would quickly become full, etc. Performance gets better if you change the per-node lock to a trylock when iterating the partial list and preallocate and have a substantially longer partial list than normal (and it still didn't rival slab's performance), so I don't think it's only a per-node lock that's the issue , it's all the slowpath overhead of swapping the cpu slab out for another slab. The TCP_RR load would show slub stats that indicate certain caches, kmalloc-256 and kmalloc-2048, would have ~98% of allocations coming from the slowpath. This gets better if you allocate higher order slabs (and kmalloc-2048 is already order-3 by default) but then allocating new slabs gets really slow if not impossible on smaller machines. The overhead of even compaction will kill us. > Take the unified as a SLAB cleanup instead? Then at least we have > a large common code base and just differentiate through the locking > mechanism? > Will you be adding the extensive slub debugging to slab then? It would be a shame to lose it because one allocator is chosen over another for performance reasons and then we need to recompile to debug issues as they arise. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010, Christoph Lameter wrote: >> Hmmm. Given these effects I think we should be more cautious regarding the >> unification work. May be the "unified allocator" should replace SLAB >> instead and SLUB can stay unchanged? On 10/12/10 10:43 PM, David Rientjes wrote: > Linus has said that he refuses to merge another allocator until one is > removed or replaced, so that would force the unificiation patches to go > into slab instead if you want to leave slub untouched. Yes, and quite frankly, I'm not interested in introducing a new one either. On Tue, 12 Oct 2010, Christoph Lameter wrote: >> The unification patches go back to >> the one lock per node SLAB thing because the queue maintenance overhead is >> otherwise causing large regressions in hackbench because of lots of atomic >> ops. The per node lock seem to be causing problems here in the network >> stack,. On Tue, 12 Oct 2010, Christoph Lameter wrote: >> Take the unified as a SLAB cleanup instead? Then at least we have >> a large common code base and just differentiate through the locking >> mechanism? On 10/12/10 10:43 PM, David Rientjes wrote: > Will you be adding the extensive slub debugging to slab then? It would be > a shame to lose it because one allocator is chosen over another for > performance reasons and then we need to recompile to debug issues as they > arise. I think Christoph is saying that we'd remove SLAB and make the unified allocator the new SLAB while keeping SLUB in place. In any case, yes, the debugging support in SLUB is something that we want to keep. Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, Pekka Enberg wrote: > I think Christoph is saying that we'd remove SLAB and make the unified > allocator the new SLAB while keeping SLUB in place. Right, so his unification patches would be against slab instead of slub which is a pretty major change from the current state of the patchset, although it might actually be smaller? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, Pekka Enberg wrote: >> I think Christoph is saying that we'd remove SLAB and make the unified >> allocator the new SLAB while keeping SLUB in place. On 10/13/10 9:31 AM, David Rientjes wrote: >> Right, so his unification patches would be against slab instead of slub >> which is a pretty major change from the current state of the patchset, >> although it might actually be smaller? No, the way I understood it was: rm mm/slab.c cp mm/slub.c mm/slab.c <apply patches on top of slab.c> Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 12 Oct 2010, David Rientjes wrote: > > Take the unified as a SLAB cleanup instead? Then at least we have > > a large common code base and just differentiate through the locking > > mechanism? > > > > Will you be adding the extensive slub debugging to slab then? It would be > a shame to lose it because one allocator is chosen over another for > performance reasons and then we need to recompile to debug issues as they > arise. Well basically we would copy SLUB to SLAB apply unification patches to SLAB instead of SLUBB. We first have to make sure that the unified patches have the same performance as SLAB. It maybe much better to isolate the debug features and general bootstrap from the particulars of the allocation strategy of either SLUB or SLAB. That way a common code base exists and it would be easier to add different allocation strategies. Basically have slab.c with the basic functions and then slab_queueing.c and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation strategy? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, Christoph Lameter wrote: > > Will you be adding the extensive slub debugging to slab then? It would be > > a shame to lose it because one allocator is chosen over another for > > performance reasons and then we need to recompile to debug issues as they > > arise. > > Well basically we would copy SLUB to SLAB apply unification patches to > SLAB instead of SLUBB. We first have to make sure that the unified patches > have the same performance as SLAB. > I see, so all of the development will be done in Pekka's tree on mm/slub.c and then when we can see no performance regression compared to the slab baseline, merge it into Linus' tree as mm/slab.c. I'm not exactly sure how that set of diffs being sent to Linus would look. Are the changes to slub in the unification patchset so intrusive that it wouldn't be possible to isolate many of the features under #ifdef or boot-time options in a single, truly unified, allocator? It seems like a shame that we'll have two allocators where the base is the same and much of the debugging code is the same. > It maybe much better to isolate the debug features and general bootstrap > from the particulars of the allocation strategy of either SLUB or SLAB. > That way a common code base exists and it would be easier to add different > allocation strategies. > > Basically have slab.c with the basic functions and then slab_queueing.c > and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation > strategy? > I was going to mention that as an idea, but I thought storing the metadata for certain debugging features might differ from the two allocators so substantially that it would be even more convoluted and difficult to maintain? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, David Rientjes wrote: > > Basically have slab.c with the basic functions and then slab_queueing.c > > and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation > > strategy? > > > > I was going to mention that as an idea, but I thought storing the metadata > for certain debugging features might differ from the two allocators so > substantially that it would be even more convoluted and difficult to > maintain? We could have some callbacks to store allocator specific metadata? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, Christoph Lameter wrote: > > I was going to mention that as an idea, but I thought storing the metadata > > for certain debugging features might differ from the two allocators so > > substantially that it would be even more convoluted and difficult to > > maintain? > > We could have some callbacks to store allocator specific metadata? > It depends on whether we could share the same base for both slab (unified allocator) and slub, which you snipped from your reply, that would make this cleaner. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, Christoph Lameter wrote: >>> I was going to mention that as an idea, but I thought storing the metadata >>> for certain debugging features might differ from the two allocators so >>> substantially that it would be even more convoluted and difficult to >>> maintain? >> >> We could have some callbacks to store allocator specific metadata? On 10/14/10 1:41 AM, David Rientjes wrote: > It depends on whether we could share the same base for both slab (unified > allocator) and slub, which you snipped from your reply, that would make > this cleaner. Argh. Why would we want to introduce something that's effectively a new allocator based on SLUB? If there's something controversial in the current patch series, lets just keep it out of mainline. A "rewrite" is the reason we're in this mess so lets not repeat the same mistake again! Pekka -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Oct 2010, Pekka Enberg wrote: > Argh. Why would we want to introduce something that's effectively a new > allocator based on SLUB? If there's something controversial in the current > patch series, lets just keep it out of mainline. A "rewrite" is the reason > we're in this mess so lets not repeat the same mistake again! > SLUB is a good base framework for developing just about any slab allocator you can imagine, in part because of its enhanced debugging facilities. Nick originally developed SLQB with much of the same SLUB framework and the queueing changes that Christoph is proposing in his new unified allocator builds upon SLUB. Instead of the slab.c, slab_queue.c, and slab_nonqueue.c trifecta, I suggested building as much of the core allocator into a single file as possible and then extending that with a config option such as CONFIG_SLAB_QUEUEING, if possible. Christoph knows his allocator better than anybody so he'd be the person to ask if this was indeed feasible and, if so, I think it's in the best interest of a long-term maintainable kernel. I care about how this is organized because I think the current config option demanding users select between SLAB and SLUB without really understanding the differences (especially for users who run a very wide range of applications and the pros and cons of better microbenchmark results for one allocator over another isn't at all convincing) is detrimental. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> This is all conspicuously hand-wavy and unquantified. (IOW: prove it!) > > The mooted effects should be tested for on both slab and slub, I > suggest. They're pretty different beasts. > -- Some results running netper TCP_RR test with 200 instances, 1 byte request and response on 16 core AMD using bnx2x with one 16 queues, one for each CPU. SLAB Without patch 553570 tps at 86% CPU With patch 791883 tps at 93% CPU SLUB Without patch 704879 tps at 95% CPU With patch 775278 tps at 92% CPU I believe both show good benfits with patch, and it actually looks like the impact is more pronounced for SLAB. I would also note, that we have actually already internally patched __netdev_alloc_skb to do local node allocation which we have been running in production for quite some time. Tom -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 14 Oct 2010 08:31:01 -0700 Tom Herbert <therbert@google.com> wrote: > > This is all conspicuously hand-wavy and unquantified. __(IOW: prove it!) > > > > The mooted effects should be tested for on both slab and slub, I > > suggest. __They're pretty different beasts. > > -- > > Some results running netper TCP_RR test with 200 instances, 1 byte > request and response on 16 core AMD using bnx2x with one 16 queues, > one for each CPU. > > SLAB > > Without patch 553570 tps at 86% CPU > With patch 791883 tps at 93% CPU > > SLUB > > Without patch 704879 tps at 95% CPU > With patch 775278 tps at 92% CPU > > I believe both show good benfits with patch, and it actually looks > like the impact is more pronounced for SLAB. I would also note, that > we have actually already internally patched __netdev_alloc_skb to do > local node allocation which we have been running in production for > quite some time. > Yes, that's a solid gain. Can we think of any hardware configuration for which the change would be harmful? Something with really expensive cross-node DMA maybe? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le jeudi 14 octobre 2010 à 12:27 -0700, Andrew Morton a écrit : > Can we think of any hardware configuration for which the change would > be harmful? Something with really expensive cross-node DMA maybe? > If such hardware exists (yes it does, but not close my hands...), then NIC IRQS probably should be handled by cpus close to the device, or it might be very slow. This has nothing to do with skb allocation layer. I believe we should not try to correct wrong IRQ affinities with NUMA games. Network stack will wakeup threads and scheduler will run them on same cpu, if possible. If skb stay long enough on socket receive queue, application will need to fetch again remote numa node when reading socket a few milliseconds later, because cache will be cold : Total : two cross node transferts per incoming frame. The more node distances are big, the more speedup we can expect from this patch. Thanks -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Oct 2010, David Rientjes wrote: > On Wed, 13 Oct 2010, Christoph Lameter wrote: > > > > I was going to mention that as an idea, but I thought storing the metadata > > > for certain debugging features might differ from the two allocators so > > > substantially that it would be even more convoluted and difficult to > > > maintain? > > > > We could have some callbacks to store allocator specific metadata? > > > > It depends on whether we could share the same base for both slab (unified > allocator) and slub, which you snipped from your reply, that would make > this cleaner. I already said before that we should consider having a common base so I thought that was not necessary. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 12 Oct 2010 07:05:25 +0200 > [PATCH net-next] net: allocate skbs on local node > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of > allocating net drivers skbs on a given memory node : The one closest to > the NIC hardware. This is wrong because as soon as we try to scale > network stack, we need to use many cpus to handle traffic and hit > slub/slab management on cross-node allocations/frees when these cpus > have to alloc/free skbs bound to a central node. > > skb allocated in RX path are ephemeral, they have a very short > lifetime : Extra cost to maintain NUMA affinity is too expensive. What > appeared as a nice idea four years ago is in fact a bad one. > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load, > and two 10Gb NIC might deliver more than 28 million packets per second, > needing all the available cpus. > > Cost of cross-node handling in network and vm stacks outperforms the > small benefit hardware had when doing its DMA transfert in its 'local' > memory node at RX time. Even trying to differentiate the two allocations > done for one skb (the sk_buff on local node, the data part on NIC > hardware node) is not enough to bring good performance. > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 0b53c43..05a358f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size, static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 0, -1); + return __alloc_skb(size, priority, 0, NUMA_NO_NODE); } static inline struct sk_buff *alloc_skb_fclone(unsigned int size, gfp_t priority) { - return __alloc_skb(size, priority, 1, -1); + return __alloc_skb(size, priority, 1, NUMA_NO_NODE); } extern bool skb_recycle_check(struct sk_buff *skb, int skb_size); @@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev, return skb; } -extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask); +/** + * __netdev_alloc_page - allocate a page for ps-rx on a specific device + * @dev: network device to receive on + * @gfp_mask: alloc_pages_node mask + * + * Allocate a new page. dev currently unused. + * + * %NULL is returned if there is no free memory. + */ +static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) +{ + return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0); +} /** * netdev_alloc_page - allocate a page for ps-rx on a specific device * @dev: network device to receive on * - * Allocate a new page node local to the specified device. + * Allocate a new page. dev currently unused. * * %NULL is returned if there is no free memory. */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 752c197..4e8b82e 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb); struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length, gfp_t gfp_mask) { - int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1; struct sk_buff *skb; - skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node); + skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE); if (likely(skb)) { skb_reserve(skb, NET_SKB_PAD); skb->dev = dev; @@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, } EXPORT_SYMBOL(__netdev_alloc_skb); -struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask) -{ - int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1; - struct page *page; - - page = alloc_pages_node(node, gfp_mask, 0); - return page; -} -EXPORT_SYMBOL(__netdev_alloc_page); - void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off, int size) {