diff mbox

[net-next] net: allocate skbs on local node

Message ID 1286859925.30423.184.camel@edumazet-laptop
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 12, 2010, 5:05 a.m. UTC
Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit :
> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
> > 
> > For multi queue devices, it makes more sense to allocate skb on local
> > node of the cpu handling RX interrupts. This allow each cpu to
> > manipulate its own slub/slab queues/structures without doing expensive
> > cross-node business.
> > 
> > For non multi queue devices, IRQ affinity should be set so that a cpu
> > close to the device services interrupts. Even if not set, using
> > dev_alloc_skb() is faster.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> Or maybe revert :
> 
> commit b30973f877fea1a3fb84e05599890fcc082a88e5
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Wed Dec 6 20:32:36 2006 -0800
> 
>     [PATCH] node-aware skb allocation
>     
>     Node-aware allocation of skbs for the receive path.
>     
>     Details:
>     
>       - __alloc_skb gets a new node argument and cals the node-aware
>         slab functions with it.
>       - netdev_alloc_skb passed the node number it gets from dev_to_node
>         to it, everyone else passes -1 (any node)
>     
>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>     Cc: Christoph Lameter <clameter@engr.sgi.com>
>     Cc: "David S. Miller" <davem@davemloft.net>
>     Signed-off-by: Andrew Morton <akpm@osdl.org>
> 
> 
> Apparently, only Christoph and Andrew signed it.
> 
> 

[PATCH net-next] net: allocate skbs on local node

commit b30973f877 (node-aware skb allocation) spread a wrong habit of
allocating net drivers skbs on a given memory node : The one closest to
the NIC hardware. This is wrong because as soon as we try to scale
network stack, we need to use many cpus to handle traffic and hit
slub/slab management on cross-node allocations/frees when these cpus
have to alloc/free skbs bound to a central node.

skb allocated in RX path are ephemeral, they have a very short
lifetime : Extra cost to maintain NUMA affinity is too expensive. What
appeared as a nice idea four years ago is in fact a bad one.

In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
and two 10Gb NIC might deliver more than 28 million packets per second,
needing all the available cpus.

Cost of cross-node handling in network and vm stacks outperforms the
small benefit hardware had when doing its DMA transfert in its 'local'
memory node at RX time. Even trying to differentiate the two allocations
done for one skb (the sk_buff on local node, the data part on NIC
hardware node) is not enough to bring good performance.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/skbuff.h |   20 ++++++++++++++++----
 net/core/skbuff.c      |   13 +------------
 2 files changed, 17 insertions(+), 16 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Tom Herbert Oct. 12, 2010, 5:35 a.m. UTC | #1
Acked-by: Tom Herbert <therbert@google.com>

On Mon, Oct 11, 2010 at 10:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 12 octobre 2010 à 01:22 +0200, Eric Dumazet a écrit :
>> Le mardi 12 octobre 2010 à 01:03 +0200, Eric Dumazet a écrit :
>> >
>> > For multi queue devices, it makes more sense to allocate skb on local
>> > node of the cpu handling RX interrupts. This allow each cpu to
>> > manipulate its own slub/slab queues/structures without doing expensive
>> > cross-node business.
>> >
>> > For non multi queue devices, IRQ affinity should be set so that a cpu
>> > close to the device services interrupts. Even if not set, using
>> > dev_alloc_skb() is faster.
>> >
>> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>>
>> Or maybe revert :
>>
>> commit b30973f877fea1a3fb84e05599890fcc082a88e5
>> Author: Christoph Hellwig <hch@lst.de>
>> Date:   Wed Dec 6 20:32:36 2006 -0800
>>
>>     [PATCH] node-aware skb allocation
>>
>>     Node-aware allocation of skbs for the receive path.
>>
>>     Details:
>>
>>       - __alloc_skb gets a new node argument and cals the node-aware
>>         slab functions with it.
>>       - netdev_alloc_skb passed the node number it gets from dev_to_node
>>         to it, everyone else passes -1 (any node)
>>
>>     Signed-off-by: Christoph Hellwig <hch@lst.de>
>>     Cc: Christoph Lameter <clameter@engr.sgi.com>
>>     Cc: "David S. Miller" <davem@davemloft.net>
>>     Signed-off-by: Andrew Morton <akpm@osdl.org>
>>
>>
>> Apparently, only Christoph and Andrew signed it.
>>
>>
>
> [PATCH net-next] net: allocate skbs on local node
>
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
>
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
>
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
>
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/linux/skbuff.h |   20 ++++++++++++++++----
>  net/core/skbuff.c      |   13 +------------
>  2 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0b53c43..05a358f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -496,13 +496,13 @@ extern struct sk_buff *__alloc_skb(unsigned int size,
>  static inline struct sk_buff *alloc_skb(unsigned int size,
>                                        gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, 0, -1);
> +       return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
>  }
>
>  static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
>                                               gfp_t priority)
>  {
> -       return __alloc_skb(size, priority, 1, -1);
> +       return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
>  }
>
>  extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
> @@ -1563,13 +1563,25 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
>        return skb;
>  }
>
> -extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
> +/**
> + *     __netdev_alloc_page - allocate a page for ps-rx on a specific device
> + *     @dev: network device to receive on
> + *     @gfp_mask: alloc_pages_node mask
> + *
> + *     Allocate a new page. dev currently unused.
> + *
> + *     %NULL is returned if there is no free memory.
> + */
> +static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
> +{
> +       return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
> +}
>
>  /**
>  *     netdev_alloc_page - allocate a page for ps-rx on a specific device
>  *     @dev: network device to receive on
>  *
> - *     Allocate a new page node local to the specified device.
> + *     Allocate a new page. dev currently unused.
>  *
>  *     %NULL is returned if there is no free memory.
>  */
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 752c197..4e8b82e 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -247,10 +247,9 @@ EXPORT_SYMBOL(__alloc_skb);
>  struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>                unsigned int length, gfp_t gfp_mask)
>  {
> -       int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
>        struct sk_buff *skb;
>
> -       skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
> +       skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
>        if (likely(skb)) {
>                skb_reserve(skb, NET_SKB_PAD);
>                skb->dev = dev;
> @@ -259,16 +258,6 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>  }
>  EXPORT_SYMBOL(__netdev_alloc_skb);
>
> -struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
> -{
> -       int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
> -       struct page *page;
> -
> -       page = alloc_pages_node(node, gfp_mask, 0);
> -       return page;
> -}
> -EXPORT_SYMBOL(__netdev_alloc_page);
> -
>  void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
>                int size)
>  {
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Oct. 12, 2010, 6:03 a.m. UTC | #2
On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 octobre 2010 __ 01:22 +0200, Eric Dumazet a __crit :
> > Le mardi 12 octobre 2010 __ 01:03 +0200, Eric Dumazet a __crit :
> > > 
> > > For multi queue devices, it makes more sense to allocate skb on local
> > > node of the cpu handling RX interrupts. This allow each cpu to
> > > manipulate its own slub/slab queues/structures without doing expensive
> > > cross-node business.
> > > 
> > > For non multi queue devices, IRQ affinity should be set so that a cpu
> > > close to the device services interrupts. Even if not set, using
> > > dev_alloc_skb() is faster.
> > > 
> > > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > 
> > Or maybe revert :
> > 
> > commit b30973f877fea1a3fb84e05599890fcc082a88e5
> > Author: Christoph Hellwig <hch@lst.de>
> > Date:   Wed Dec 6 20:32:36 2006 -0800
> > 
> >     [PATCH] node-aware skb allocation
> >     
> >     Node-aware allocation of skbs for the receive path.
> >     
> >     Details:
> >     
> >       - __alloc_skb gets a new node argument and cals the node-aware
> >         slab functions with it.
> >       - netdev_alloc_skb passed the node number it gets from dev_to_node
> >         to it, everyone else passes -1 (any node)
> >     
> >     Signed-off-by: Christoph Hellwig <hch@lst.de>
> >     Cc: Christoph Lameter <clameter@engr.sgi.com>
> >     Cc: "David S. Miller" <davem@davemloft.net>
> >     Signed-off-by: Andrew Morton <akpm@osdl.org>
> > 
> > 
> > Apparently, only Christoph and Andrew signed it.
> > 
> > 
> 
> [PATCH net-next] net: allocate skbs on local node
> 
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
> 
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
> 
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
> 
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
> 

This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)

The mooted effects should be tested for on both slab and slub, I
suggest.  They're pretty different beasts.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 12, 2010, 6:58 a.m. UTC | #3
Le lundi 11 octobre 2010 à 23:03 -0700, Andrew Morton a écrit :
> On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > [PATCH net-next] net: allocate skbs on local node
> > 
> > commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> > allocating net drivers skbs on a given memory node : The one closest to
> > the NIC hardware. This is wrong because as soon as we try to scale
> > network stack, we need to use many cpus to handle traffic and hit
> > slub/slab management on cross-node allocations/frees when these cpus
> > have to alloc/free skbs bound to a central node.
> > 
> > skb allocated in RX path are ephemeral, they have a very short
> > lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> > appeared as a nice idea four years ago is in fact a bad one.
> > 
> > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> > and two 10Gb NIC might deliver more than 28 million packets per second,
> > needing all the available cpus.
> > 
> > Cost of cross-node handling in network and vm stacks outperforms the
> > small benefit hardware had when doing its DMA transfert in its 'local'
> > memory node at RX time. Even trying to differentiate the two allocations
> > done for one skb (the sk_buff on local node, the data part on NIC
> > hardware node) is not enough to bring good performance.
> > 
> 
> This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> 

I would say, _you_ should prove that original patch was good. It seems
no network guy was really in the discussion ?

Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the
difference. Thats about a 40% slowdown on high packet rates, on a dual
socket machine (dual X5570  @2.93GHz). You can expect higher values on
four nodes (I dont have such hardware to do the test)


> The mooted effects should be tested for on both slab and slub, I
> suggest.  They're pretty different beasts.

SLAB is so slow on NUMA these days, you can forget it for good.

Its about 40% slower on some tests I did this week on net-next, to
speedup output (and routing) performance, so it was with normal (local)
allocations, not even cross-nodes ones.

Once you remove network bottlenecks, you badly hit contention on SLAB
and are forced to switch to SLUB ;)

Sending 160.000.000 udp frames on same neighbour/destination,
IP route cache disabled (to mimic DDOS on a router)
16 threads, 16 logical cpus. 32bit kernel (dual E5540  @ 2.53GHz)


(It takes more than 2 minutes with linux-2.6, so use net-next-2.6 if you
really want to get these numbers)

SLUB :

real	0m50.661s
user	0m15.973s
sys	11m42.548s


18348.00 21.4% dst_destroy             vmlinux
 5674.00  6.6% fib_table_lookup        vmlinux
 5563.00  6.5% dst_alloc               vmlinux
 5226.00  6.1% neigh_lookup            vmlinux
 3590.00  4.2% __ip_route_output_key   vmlinux
 2712.00  3.2% neigh_resolve_output    vmlinux
 2511.00  2.9% fib_semantic_match      vmlinux
 2488.00  2.9% ipv4_dst_destroy        vmlinux
 2206.00  2.6% __xfrm_lookup           vmlinux
 2119.00  2.5% memset                  vmlinux
 2015.00  2.4% __copy_from_user_ll     vmlinux
 1722.00  2.0% udp_sendmsg             vmlinux
 1679.00  2.0% __slab_free             vmlinux
 1152.00  1.3% ip_append_data          vmlinux
 1044.00  1.2% __alloc_skb             vmlinux
  952.00  1.1% kmem_cache_free         vmlinux
  942.00  1.1% udp_push_pending_frames vmlinux
  877.00  1.0% kfree                   vmlinux
  870.00  1.0% __call_rcu              vmlinux
  829.00  1.0% ip_push_pending_frames  vmlinux
  799.00  0.9% _raw_spin_lock_bh       vmlinux

SLAB:

real	1m10.771s
user	0m13.941s
sys	12m42.188s


22734.00 26.0% _raw_spin_lock          vmlinux
 8238.00  9.4% dst_destroy             vmlinux
 4393.00  5.0% fib_table_lookup        vmlinux
 3652.00  4.2% dst_alloc               vmlinux
 3335.00  3.8% neigh_lookup            vmlinux
 2444.00  2.8% memset                  vmlinux
 2443.00  2.8% __ip_route_output_key   vmlinux
 1916.00  2.2% fib_semantic_match      vmlinux
 1708.00  2.0% __copy_from_user_ll     vmlinux
 1669.00  1.9% __xfrm_lookup           vmlinux
 1642.00  1.9% free_block              vmlinux
 1554.00  1.8% neigh_resolve_output    vmlinux
 1388.00  1.6% ipv4_dst_destroy        vmlinux
 1335.00  1.5% udp_sendmsg             vmlinux
 1109.00  1.3% kmem_cache_free         vmlinux
 1007.00  1.2% __alloc_skb             vmlinux
 1004.00  1.1% kfree                   vmlinux
 1002.00  1.1% ip_append_data          vmlinux
  975.00  1.1% cache_grow              vmlinux
  936.00  1.1% ____cache_alloc_node    vmlinux
  925.00  1.1% udp_push_pending_frames vmlinux


All this raw_spin_lock overhead comes from SLAB.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Oct. 12, 2010, 7:24 a.m. UTC | #4
On Tue, 12 Oct 2010 08:58:19 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le lundi 11 octobre 2010 __ 23:03 -0700, Andrew Morton a __crit :
> > On Tue, 12 Oct 2010 07:05:25 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > > [PATCH net-next] net: allocate skbs on local node
> > > 
> > > commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> > > allocating net drivers skbs on a given memory node : The one closest to
> > > the NIC hardware. This is wrong because as soon as we try to scale
> > > network stack, we need to use many cpus to handle traffic and hit
> > > slub/slab management on cross-node allocations/frees when these cpus
> > > have to alloc/free skbs bound to a central node.
> > > 
> > > skb allocated in RX path are ephemeral, they have a very short
> > > lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> > > appeared as a nice idea four years ago is in fact a bad one.
> > > 
> > > In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> > > and two 10Gb NIC might deliver more than 28 million packets per second,
> > > needing all the available cpus.
> > > 
> > > Cost of cross-node handling in network and vm stacks outperforms the
> > > small benefit hardware had when doing its DMA transfert in its 'local'
> > > memory node at RX time. Even trying to differentiate the two allocations
> > > done for one skb (the sk_buff on local node, the data part on NIC
> > > hardware node) is not enough to bring good performance.
> > > 
> > 
> > This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
> > 
> 
> I would say, _you_ should prove that original patch was good. It seems
> no network guy was really in the discussion ?

Two wrongs and all that.  The 2006 patch has nothing to do with it,
apart from demonstrating the importance of including performance
measurements in a performance patch.

> Just run a test on a bnx2x or ixgbe multiqueue 10Gb adapter, and see the
> difference. Thats about a 40% slowdown on high packet rates, on a dual
> socket machine (dual X5570  @2.93GHz). You can expect higher values on
> four nodes (I dont have such hardware to do the test)

Like that.  Please flesh it out and stick it in the changelog.

> 
> > The mooted effects should be tested for on both slab and slub, I
> > suggest.  They're pretty different beasts.
> 
> SLAB is so slow on NUMA these days, you can forget it for good.

I'd love to forget it, but it's faster for some things (I forget
which).  Which is why it's still around.

And the ghastly thing about this is that you're forced to care about it
too because some people are, apparently, still using it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 12, 2010, 7:49 a.m. UTC | #5
Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :

> I'd love to forget it, but it's faster for some things (I forget
> which).  Which is why it's still around.

Yes, two years ago it was true on pathological/obscure cases.
Every time I did the comparison, SLUB won.
You asked me, I did yet another test this morning, and 40% is pretty
serious, I believe.

> 
> And the ghastly thing about this is that you're forced to care about it
> too because some people are, apparently, still using it.
> 

Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
machines, I know...

I am not saying we should not care, but for any serious network workload
on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
might even get better.

BTW, I believe all modern distros ship SLUB, dont they ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Oct. 12, 2010, 7:58 a.m. UTC | #6
On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :
> 
> > I'd love to forget it, but it's faster for some things (I forget
> > which).  Which is why it's still around.
> 
> Yes, two years ago it was true on pathological/obscure cases.
> Every time I did the comparison, SLUB won.
> You asked me, I did yet another test this morning, and 40% is pretty
> serious, I believe.
> 
> > 
> > And the ghastly thing about this is that you're forced to care about it
> > too because some people are, apparently, still using it.
> > 
> 
> Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
> machines, I know...
> 
> I am not saying we should not care, but for any serious network workload
> on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
> might even get better.
> 
> BTW, I believe all modern distros ship SLUB, dont they ?
> 

Dunno.

Pekka, why haven't we deleted slab yet??
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Oct. 12, 2010, 11:08 a.m. UTC | #7
On 10/12/10 10:58 AM, Andrew Morton wrote:
> On Tue, 12 Oct 2010 09:49:53 +0200 Eric Dumazet<eric.dumazet@gmail.com>  wrote:
>
>> Le mardi 12 octobre 2010 à 00:24 -0700, Andrew Morton a écrit :
>>
>>> I'd love to forget it, but it's faster for some things (I forget
>>> which).  Which is why it's still around.
>>
>> Yes, two years ago it was true on pathological/obscure cases.
>> Every time I did the comparison, SLUB won.
>> You asked me, I did yet another test this morning, and 40% is pretty
>> serious, I believe.
>>
>>> And the ghastly thing about this is that you're forced to care about it
>>> too because some people are, apparently, still using it.
>>
>> Yes, some people (in my company) still use linux 2.6.9 32bit on HP G6/G7
>> machines, I know...
>>
>> I am not saying we should not care, but for any serious network workload
>> on NUMA arches, SLUB is the best, and seeing Christoph recent work, it
>> might even get better.
>>
>> BTW, I believe all modern distros ship SLUB, dont they ?
>
> Dunno.
>
> Pekka, why haven't we deleted slab yet??

To make a long story short, we still have relevant performance 
regressions that need to be taken care of. The most interesting one is a 
regression in netperf TCP_RR that's been reported by David Rientjes a 
while back. There's bunch of SLUB cleanups queued for 2.6.37 that pave 
the way for Christoph's SLUB queueing work that should hopefully fix 
that particular issue for 2.6.38.

There's little point in discussing the removal of SLAB as long as there 
are performance regressions for real workloads from people who are 
willing to share results and test patches. I'm optimistic that we'll be 
able to try removing SLAB some time next year unless something 
interesting pops up...

			Pekka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Oct. 12, 2010, 12:50 p.m. UTC | #8
On Tue, 12 Oct 2010, Pekka Enberg wrote:

> There's little point in discussing the removal of SLAB as long as there are
> performance regressions for real workloads from people who are willing to
> share results and test patches. I'm optimistic that we'll be able to try
> removing SLAB some time next year unless something interesting pops up...

Hmmm. Given these effects I think we should be more cautious regarding the
unification work. May be the "unified allocator" should replace SLAB
instead and SLUB can stay unchanged? The unification patches go back to
the one lock per node SLAB thing because the queue maintenance overhead is
otherwise causing large regressions in hackbench because of lots of atomic
ops. The per node lock seem to be causing problems here in the network
stack,. Take the unified as a SLAB cleanup instead? Then at least we have
a large common code base and just differentiate through the locking
mechanism?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Oct. 12, 2010, 7:43 p.m. UTC | #9
On Tue, 12 Oct 2010, Christoph Lameter wrote:

> Hmmm. Given these effects I think we should be more cautious regarding the
> unification work. May be the "unified allocator" should replace SLAB
> instead and SLUB can stay unchanged?

Linus has said that he refuses to merge another allocator until one is 
removed or replaced, so that would force the unificiation patches to go 
into slab instead if you want to leave slub untouched.

> The unification patches go back to
> the one lock per node SLAB thing because the queue maintenance overhead is
> otherwise causing large regressions in hackbench because of lots of atomic
> ops. The per node lock seem to be causing problems here in the network
> stack,.

The TCP_RR regression on slub is because of what I described a couple 
years ago as "slab thrashing" where cpu slabs would be filled with 
allocations, then frees would occur to move those slabs from the full to 
partial list with only a few free objects, those partial slabs would 
quickly become full, etc.  Performance gets better if you change the 
per-node lock to a trylock when iterating the partial list and preallocate 
and have a substantially longer partial list than normal (and it still 
didn't rival slab's performance), so I don't think it's only a per-node 
lock that's the issue , it's all the slowpath overhead of swapping the cpu 
slab out for another slab.  The TCP_RR load would show slub stats that 
indicate certain caches, kmalloc-256 and kmalloc-2048, would have ~98% of 
allocations coming from the slowpath.

This gets better if you allocate higher order slabs (and kmalloc-2048 is 
already order-3 by default) but then allocating new slabs gets really slow 
if not impossible on smaller machines.  The overhead of even compaction 
will kill us.

> Take the unified as a SLAB cleanup instead? Then at least we have
> a large common code base and just differentiate through the locking
> mechanism?
> 

Will you be adding the extensive slub debugging to slab then?  It would be 
a shame to lose it because one allocator is chosen over another for 
performance reasons and then we need to recompile to debug issues as they 
arise.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Oct. 13, 2010, 6:17 a.m. UTC | #10
On Tue, 12 Oct 2010, Christoph Lameter wrote:
>> Hmmm. Given these effects I think we should be more cautious regarding the
>> unification work. May be the "unified allocator" should replace SLAB
>> instead and SLUB can stay unchanged?

On 10/12/10 10:43 PM, David Rientjes wrote:
> Linus has said that he refuses to merge another allocator until one is
> removed or replaced, so that would force the unificiation patches to go
> into slab instead if you want to leave slub untouched.

Yes, and quite frankly, I'm not interested in introducing a new one either.

On Tue, 12 Oct 2010, Christoph Lameter wrote:
>> The unification patches go back to
>> the one lock per node SLAB thing because the queue maintenance overhead is
>> otherwise causing large regressions in hackbench because of lots of atomic
>> ops. The per node lock seem to be causing problems here in the network
>> stack,.

On Tue, 12 Oct 2010, Christoph Lameter wrote:
>> Take the unified as a SLAB cleanup instead? Then at least we have
>> a large common code base and just differentiate through the locking
>> mechanism?

On 10/12/10 10:43 PM, David Rientjes wrote:
> Will you be adding the extensive slub debugging to slab then?  It would be
> a shame to lose it because one allocator is chosen over another for
> performance reasons and then we need to recompile to debug issues as they
> arise.

I think Christoph is saying that we'd remove SLAB and make the unified 
allocator the new SLAB while keeping SLUB in place. In any case, yes, 
the debugging support in SLUB is something that we want to keep.

			Pekka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Oct. 13, 2010, 6:31 a.m. UTC | #11
On Wed, 13 Oct 2010, Pekka Enberg wrote:

> I think Christoph is saying that we'd remove SLAB and make the unified
> allocator the new SLAB while keeping SLUB in place.

Right, so his unification patches would be against slab instead of slub 
which is a pretty major change from the current state of the patchset, 
although it might actually be smaller?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Oct. 13, 2010, 6:36 a.m. UTC | #12
On Wed, 13 Oct 2010, Pekka Enberg wrote:
>> I think Christoph is saying that we'd remove SLAB and make the unified
>> allocator the new SLAB while keeping SLUB in place.

On 10/13/10 9:31 AM, David Rientjes wrote:
>> Right, so his unification patches would be against slab instead of slub
>> which is a pretty major change from the current state of the patchset,
>> although it might actually be smaller?

No, the way I understood it was:

   rm mm/slab.c
   cp mm/slub.c mm/slab.c
<apply patches on top of slab.c>

             Pekka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Oct. 13, 2010, 4 p.m. UTC | #13
On Tue, 12 Oct 2010, David Rientjes wrote:

> > Take the unified as a SLAB cleanup instead? Then at least we have
> > a large common code base and just differentiate through the locking
> > mechanism?
> >
>
> Will you be adding the extensive slub debugging to slab then?  It would be
> a shame to lose it because one allocator is chosen over another for
> performance reasons and then we need to recompile to debug issues as they
> arise.

Well basically we would copy SLUB to SLAB apply unification patches to
SLAB instead of SLUBB. We first have to make sure that the unified patches
have the same performance as SLAB.

It maybe much better to isolate the debug features and general bootstrap
from the particulars of the allocation strategy of either SLUB or SLAB.
That way a common code base exists and it would be easier to add different
allocation strategies.

Basically have slab.c with the basic functions and then slab_queueing.c
and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation
strategy?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Oct. 13, 2010, 8:48 p.m. UTC | #14
On Wed, 13 Oct 2010, Christoph Lameter wrote:

> > Will you be adding the extensive slub debugging to slab then?  It would be
> > a shame to lose it because one allocator is chosen over another for
> > performance reasons and then we need to recompile to debug issues as they
> > arise.
> 
> Well basically we would copy SLUB to SLAB apply unification patches to
> SLAB instead of SLUBB. We first have to make sure that the unified patches
> have the same performance as SLAB.
> 

I see, so all of the development will be done in Pekka's tree on mm/slub.c 
and then when we can see no performance regression compared to the slab 
baseline, merge it into Linus' tree as mm/slab.c.  I'm not exactly sure 
how that set of diffs being sent to Linus would look.

Are the changes to slub in the unification patchset so intrusive that it 
wouldn't be possible to isolate many of the features under #ifdef or 
boot-time options in a single, truly unified, allocator?  It seems like a 
shame that we'll have two allocators where the base is the same and much 
of the debugging code is the same.

> It maybe much better to isolate the debug features and general bootstrap
> from the particulars of the allocation strategy of either SLUB or SLAB.
> That way a common code base exists and it would be easier to add different
> allocation strategies.
> 
> Basically have slab.c with the basic functions and then slab_queueing.c
> and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation
> strategy?
> 

I was going to mention that as an idea, but I thought storing the metadata 
for certain debugging features might differ from the two allocators so 
substantially that it would be even more convoluted and difficult to 
maintain?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Oct. 13, 2010, 9:43 p.m. UTC | #15
On Wed, 13 Oct 2010, David Rientjes wrote:

> > Basically have slab.c with the basic functions and then slab_queueing.c
> > and slab_noqueue.c for SLAB/SLUB with the particulars of the allocation
> > strategy?
> >
>
> I was going to mention that as an idea, but I thought storing the metadata
> for certain debugging features might differ from the two allocators so
> substantially that it would be even more convoluted and difficult to
> maintain?

We could have some callbacks to store allocator specific metadata?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Oct. 13, 2010, 10:41 p.m. UTC | #16
On Wed, 13 Oct 2010, Christoph Lameter wrote:

> > I was going to mention that as an idea, but I thought storing the metadata
> > for certain debugging features might differ from the two allocators so
> > substantially that it would be even more convoluted and difficult to
> > maintain?
> 
> We could have some callbacks to store allocator specific metadata?
> 

It depends on whether we could share the same base for both slab (unified 
allocator) and slub, which you snipped from your reply, that would make 
this cleaner.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pekka Enberg Oct. 14, 2010, 6:22 a.m. UTC | #17
On Wed, 13 Oct 2010, Christoph Lameter wrote:
>>> I was going to mention that as an idea, but I thought storing the metadata
>>> for certain debugging features might differ from the two allocators so
>>> substantially that it would be even more convoluted and difficult to
>>> maintain?
>>
>> We could have some callbacks to store allocator specific metadata?

On 10/14/10 1:41 AM, David Rientjes wrote:
> It depends on whether we could share the same base for both slab (unified
> allocator) and slub, which you snipped from your reply, that would make
> this cleaner.

Argh. Why would we want to introduce something that's effectively a new 
allocator based on SLUB? If there's something controversial in the 
current patch series, lets just keep it out of mainline. A "rewrite" is 
the reason we're in this mess so lets not repeat the same mistake again!

			Pekka
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Rientjes Oct. 14, 2010, 7:23 a.m. UTC | #18
On Thu, 14 Oct 2010, Pekka Enberg wrote:

> Argh. Why would we want to introduce something that's effectively a new
> allocator based on SLUB? If there's something controversial in the current
> patch series, lets just keep it out of mainline. A "rewrite" is the reason
> we're in this mess so lets not repeat the same mistake again!
> 

SLUB is a good base framework for developing just about any slab allocator 
you can imagine, in part because of its enhanced debugging facilities.  
Nick originally developed SLQB with much of the same SLUB framework and 
the queueing changes that Christoph is proposing in his new unified 
allocator builds upon SLUB.

Instead of the slab.c, slab_queue.c, and slab_nonqueue.c trifecta, I 
suggested building as much of the core allocator into a single file as 
possible and then extending that with a config option such as 
CONFIG_SLAB_QUEUEING, if possible.  Christoph knows his allocator better 
than anybody so he'd be the person to ask if this was indeed feasible and, 
if so, I think it's in the best interest of a long-term maintainable 
kernel.

I care about how this is organized because I think the current config 
option demanding users select between SLAB and SLUB without really 
understanding the differences (especially for users who run a very wide 
range of applications and the pros and cons of better microbenchmark 
results for one allocator over another isn't at all convincing) is 
detrimental.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Oct. 14, 2010, 3:31 p.m. UTC | #19
> This is all conspicuously hand-wavy and unquantified.  (IOW: prove it!)
>
> The mooted effects should be tested for on both slab and slub, I
> suggest.  They're pretty different beasts.
> --

Some results running netper TCP_RR test with 200 instances, 1 byte
request and response on 16 core AMD using bnx2x with one 16 queues,
one for each CPU.

SLAB

Without patch 553570 tps at 86% CPU
With patch 791883 tps at 93% CPU

SLUB

Without patch 704879 tps at 95% CPU
With patch 775278 tps at 92% CPU

I believe both show good benfits with patch, and it actually looks
like the impact is more pronounced for SLAB.  I would also note, that
we have actually already internally patched __netdev_alloc_skb to do
local node allocation which we have been running in production for
quite some time.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Morton Oct. 14, 2010, 7:27 p.m. UTC | #20
On Thu, 14 Oct 2010 08:31:01 -0700
Tom Herbert <therbert@google.com> wrote:

> > This is all conspicuously hand-wavy and unquantified. __(IOW: prove it!)
> >
> > The mooted effects should be tested for on both slab and slub, I
> > suggest. __They're pretty different beasts.
> > --
> 
> Some results running netper TCP_RR test with 200 instances, 1 byte
> request and response on 16 core AMD using bnx2x with one 16 queues,
> one for each CPU.
> 
> SLAB
> 
> Without patch 553570 tps at 86% CPU
> With patch 791883 tps at 93% CPU
> 
> SLUB
> 
> Without patch 704879 tps at 95% CPU
> With patch 775278 tps at 92% CPU
> 
> I believe both show good benfits with patch, and it actually looks
> like the impact is more pronounced for SLAB.  I would also note, that
> we have actually already internally patched __netdev_alloc_skb to do
> local node allocation which we have been running in production for
> quite some time.
> 

Yes, that's a solid gain.

Can we think of any hardware configuration for which the change would
be harmful?  Something with really expensive cross-node DMA maybe?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 14, 2010, 7:59 p.m. UTC | #21
Le jeudi 14 octobre 2010 à 12:27 -0700, Andrew Morton a écrit :

> Can we think of any hardware configuration for which the change would
> be harmful?  Something with really expensive cross-node DMA maybe?
> 

If such hardware exists (yes it does, but not close my hands...), then
NIC IRQS probably should be handled by cpus close to the device, or it
might be very slow. This has nothing to do with skb allocation layer.

I believe we should not try to correct wrong IRQ affinities with NUMA
games. Network stack will wakeup threads and scheduler will run them on
same cpu, if possible.

If skb stay long enough on socket receive queue, application will need
to fetch again remote numa node when reading socket a few milliseconds
later, because cache will be cold : Total : two cross node transferts
per incoming frame.

The more node distances are big, the more speedup we can expect from
this patch.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter (Ampere) Oct. 15, 2010, 2:23 p.m. UTC | #22
On Wed, 13 Oct 2010, David Rientjes wrote:

> On Wed, 13 Oct 2010, Christoph Lameter wrote:
>
> > > I was going to mention that as an idea, but I thought storing the metadata
> > > for certain debugging features might differ from the two allocators so
> > > substantially that it would be even more convoluted and difficult to
> > > maintain?
> >
> > We could have some callbacks to store allocator specific metadata?
> >
>
> It depends on whether we could share the same base for both slab (unified
> allocator) and slub, which you snipped from your reply, that would make
> this cleaner.

I already said before that we should consider having a common base so I
thought that was not necessary.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 16, 2010, 6:54 p.m. UTC | #23
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 12 Oct 2010 07:05:25 +0200

> [PATCH net-next] net: allocate skbs on local node
> 
> commit b30973f877 (node-aware skb allocation) spread a wrong habit of
> allocating net drivers skbs on a given memory node : The one closest to
> the NIC hardware. This is wrong because as soon as we try to scale
> network stack, we need to use many cpus to handle traffic and hit
> slub/slab management on cross-node allocations/frees when these cpus
> have to alloc/free skbs bound to a central node.
> 
> skb allocated in RX path are ephemeral, they have a very short
> lifetime : Extra cost to maintain NUMA affinity is too expensive. What
> appeared as a nice idea four years ago is in fact a bad one.
> 
> In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
> and two 10Gb NIC might deliver more than 28 million packets per second,
> needing all the available cpus.
> 
> Cost of cross-node handling in network and vm stacks outperforms the
> small benefit hardware had when doing its DMA transfert in its 'local'
> memory node at RX time. Even trying to differentiate the two allocations
> done for one skb (the sk_buff on local node, the data part on NIC
> hardware node) is not enough to bring good performance.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0b53c43..05a358f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -496,13 +496,13 @@  extern struct sk_buff *__alloc_skb(unsigned int size,
 static inline struct sk_buff *alloc_skb(unsigned int size,
 					gfp_t priority)
 {
-	return __alloc_skb(size, priority, 0, -1);
+	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);
 }
 
 static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 					       gfp_t priority)
 {
-	return __alloc_skb(size, priority, 1, -1);
+	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
 }
 
 extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
@@ -1563,13 +1563,25 @@  static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return skb;
 }
 
-extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
+/**
+ *	__netdev_alloc_page - allocate a page for ps-rx on a specific device
+ *	@dev: network device to receive on
+ *	@gfp_mask: alloc_pages_node mask
+ *
+ * 	Allocate a new page. dev currently unused.
+ *
+ * 	%NULL is returned if there is no free memory.
+ */
+static inline struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
+{
+	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, 0);
+}
 
 /**
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
  *	@dev: network device to receive on
  *
- * 	Allocate a new page node local to the specified device.
+ * 	Allocate a new page. dev currently unused.
  *
  * 	%NULL is returned if there is no free memory.
  */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 752c197..4e8b82e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -247,10 +247,9 @@  EXPORT_SYMBOL(__alloc_skb);
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		unsigned int length, gfp_t gfp_mask)
 {
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
 	struct sk_buff *skb;
 
-	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, node);
+	skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -259,16 +258,6 @@  struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 }
 EXPORT_SYMBOL(__netdev_alloc_skb);
 
-struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask)
-{
-	int node = dev->dev.parent ? dev_to_node(dev->dev.parent) : -1;
-	struct page *page;
-
-	page = alloc_pages_node(node, gfp_mask, 0);
-	return page;
-}
-EXPORT_SYMBOL(__netdev_alloc_page);
-
 void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 		int size)
 {