diff mbox

[net-next,V1,1/3] net: bulk alloc and reuse of SKBs in NAPI context

Message ID 20160509134429.3573.4048.stgit@firesoul
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer May 9, 2016, 1:44 p.m. UTC
This patch introduce bulk alloc of SKBs and allow reuse of SKBs
free'ed in same softirq cycle.  SKBs are normally free'ed during TX
completion, but most high speed drivers also cleanup TX ring during
NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
SKBs will be avail in the napi_alloc_cache->skb_cache.

If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
the potential overshooting unused SKBs needed to free'ed when NAPI
cycle ends (flushed in net_rx_action via __kfree_skb_flush()).

Generalize ___build_skb() to allow passing it a preallocated SKB.

I've previously demonstrated a 1% speedup for IPv4 forwarding, when
used on the ixgbe driver.  If SKB alloc and free happens on different
CPUs (like in RPS use-case) the performance benefit is expected to
increase.

All drivers using the napi_alloc_skb() call will benefit from
this change automatically.

Joint work with Alexander Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 net/core/skbuff.c |   71 ++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 45 insertions(+), 26 deletions(-)

Comments

Alexander H Duyck May 9, 2016, 4:20 p.m. UTC | #1
On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> completion, but most high speed drivers also cleanup TX ring during
> NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> SKBs will be avail in the napi_alloc_cache->skb_cache.
>
> If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> the potential overshooting unused SKBs needed to free'ed when NAPI
> cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
>
> Generalize ___build_skb() to allow passing it a preallocated SKB.
>
> I've previously demonstrated a 1% speedup for IPv4 forwarding, when
> used on the ixgbe driver.  If SKB alloc and free happens on different
> CPUs (like in RPS use-case) the performance benefit is expected to
> increase.

Really I am not sure this patch series is worth the effort.  For
freeing buffers in Tx it was an obvious win.  But adding all this
complexity for a 1% gain just doesn't seen worth the effort.

> All drivers using the napi_alloc_skb() call will benefit from
> this change automatically.

Yes, but a 1% improvement is essentially just noise.  I'd say we need
to show a better gain or be able to more precisely show that this is a
definite gain and not just a test fluctuation resulting in the
improvement.  For all I know all of the gain could be due to a
function shifting around so that some loop is now 16 byte aligned.

> Joint work with Alexander Duyck.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>

The fact that this is still using my redhat.com address says volumes.
If I recall I think we were working on this code around 9 months ago
and had similar issues with it showing either negative performance or
no gain.  My advice then was to push the bulk free code and try to
find a way to fix the bulk allocation.  If we are still at this state
for bulk allocation then maybe we need to drop the bulk allocation and
start looking at other avenues to pursue.

> ---
>  net/core/skbuff.c |   71 ++++++++++++++++++++++++++++++++++-------------------
>  1 file changed, 45 insertions(+), 26 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5586be93632f..e85f1065b263 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -281,32 +281,14 @@ nodata:
>  }
>  EXPORT_SYMBOL(__alloc_skb);
>
> -/**
> - * __build_skb - build a network buffer
> - * @data: data buffer provided by caller
> - * @frag_size: size of data, or 0 if head was kmalloced
> - *
> - * Allocate a new &sk_buff. Caller provides space holding head and
> - * skb_shared_info. @data must have been allocated by kmalloc() only if
> - * @frag_size is 0, otherwise data should come from the page allocator
> - *  or vmalloc()
> - * The return is the new skb buffer.
> - * On a failure the return is %NULL, and @data is not freed.
> - * Notes :
> - *  Before IO, driver allocates only data buffer where NIC put incoming frame
> - *  Driver should add room at head (NET_SKB_PAD) and
> - *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> - *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
> - *  before giving packet to stack.
> - *  RX rings only contains data buffers, not full skbs.
> - */
> -struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> +/* Allows skb being (pre)allocated by caller */
> +static inline
> +struct sk_buff *___build_skb(void *data, unsigned int frag_size,
> +                            struct sk_buff *skb)
>  {
>         struct skb_shared_info *shinfo;
> -       struct sk_buff *skb;
>         unsigned int size = frag_size ? : ksize(data);
>
> -       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
>         if (!skb)
>                 return NULL;
>
> @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>         return skb;
>  }
>
> +/**
> + * __build_skb - build a network buffer
> + * @data: data buffer provided by caller
> + * @frag_size: size of data, or 0 if head was kmalloced
> + *
> + * Allocate a new &sk_buff. Caller provides space holding head and
> + * skb_shared_info. @data must have been allocated by kmalloc() only if
> + * @frag_size is 0, otherwise data should come from the page allocator
> + *  or vmalloc()
> + * The return is the new skb buffer.
> + * On a failure the return is %NULL, and @data is not freed.
> + * Notes :
> + *  Before IO, driver allocates only data buffer where NIC put incoming frame
> + *  Driver should add room at head (NET_SKB_PAD) and
> + *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> + *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
> + *  before giving packet to stack.
> + *  RX rings only contains data buffers, not full skbs.
> + */
> +struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> +{
> +       struct sk_buff *skb;
> +       unsigned int size = frag_size ? : ksize(data);
> +
> +       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> +       return ___build_skb(data, size, skb);
> +}
> +
>  /* build_skb() is wrapper over __build_skb(), that specifically
>   * takes care of skb->head and skb->pfmemalloc
>   * This means that if @frag_size is not zero, then @data must be backed

If we can avoid having to break up build_skb into more functions that
would be preferred.  I realize I probably wrote this code in order to
enable the bulk allocation approach, but really I don't want to add
more complexity unless we can show a strong gain which we haven't been
able to demonstrate.

> @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>
>         len += NET_SKB_PAD + NET_IP_ALIGN;
>
> -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> -           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> +       if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> +                    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) {
>                 skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
>                 if (!skb)
>                         goto skb_fail;

Does this unlikely really make a difference?  If so you might want to
move it into a patch all on its own.

> @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
>         if (unlikely(!data))
>                 return NULL;
>
> -       skb = __build_skb(data, len);
> -       if (unlikely(!skb)) {
> +#define BULK_ALLOC_SIZE 8
> +       if (!nc->skb_count) {
> +               nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
> +                                                     gfp_mask, BULK_ALLOC_SIZE,
> +                                                     nc->skb_cache);
> +       }
> +       if (likely(nc->skb_count)) {
> +               skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
> +       } else {
> +               /* alloc bulk failed */

So did you try doing a low latency socket test with this patch?  I'm
guessing not as I am certain there is a negative impact from having to
allocate 8 buffers, and then free back 7 every time you call the NAPI
polling routine with just one buffer in the ring.

>                 skb_free_frag(data);
>                 return NULL;
>         }
> +       skb = ___build_skb(data, len, skb);
>
>         /* use OR instead of assignment to avoid clearing of bits in mask */
>         if (nc->page.pfmemalloc)
>
Jesper Dangaard Brouer May 9, 2016, 7:59 p.m. UTC | #2
On Mon, 9 May 2016 09:20:41 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs
> > free'ed in same softirq cycle.  SKBs are normally free'ed during TX
> > completion, but most high speed drivers also cleanup TX ring during
> > NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
> > SKBs will be avail in the napi_alloc_cache->skb_cache.
> >
> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
> > the potential overshooting unused SKBs needed to free'ed when NAPI
> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
> >
> > Generalize ___build_skb() to allow passing it a preallocated SKB.
> >
> > I've previously demonstrated a 1% speedup for IPv4 forwarding, when
> > used on the ixgbe driver.  If SKB alloc and free happens on different
> > CPUs (like in RPS use-case) the performance benefit is expected to
> > increase.  
> 
> Really I am not sure this patch series is worth the effort.  For
> freeing buffers in Tx it was an obvious win.  But adding all this
> complexity for a 1% gain just doesn't seen worth the effort.

I still think it is worth it, because: 1) it enables recycling, which
is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA
completion cleanup, and RX can re-use these), and 2) because bulk alloc
and bulk free gets "coupled" (mostly relevant when doing cross CPU).


> > All drivers using the napi_alloc_skb() call will benefit from
> > this change automatically.  
> 
> Yes, but a 1% improvement is essentially just noise.  I'd say we need
> to show a better gain or be able to more precisely show that this is a
> definite gain and not just a test fluctuation resulting in the
> improvement.  For all I know all of the gain could be due to a
> function shifting around so that some loop is now 16 byte aligned.
> 
> > Joint work with Alexander Duyck.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>  
> 
> The fact that this is still using my redhat.com address says volumes.
> If I recall I think we were working on this code around 9 months ago
> and had similar issues with it showing either negative performance or
> no gain.  My advice then was to push the bulk free code and try to
> find a way to fix the bulk allocation.  If we are still at this state
> for bulk allocation then maybe we need to drop the bulk allocation and
> start looking at other avenues to pursue.

I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up
with a benchmark that show a bigger gain, but is it time well spend?

My quick benchmarking with mlx4 actually showed 2.7% improvement, for
local UDP socket delivery. And only 0.6% on IP-forwarding.  If I find
some bidirectional traffic then the benefit should be higher due to
recycling kicking in.

We did do as you recommended, and the bulk free code first. I'm
just getting the last pieces pushed.  I didn't see any negative
performance in my testing. 

As you also know, tuning the SLUB system will give higher performance,
easily.  In the future, I'm planning to get some auto-tuning into the
SLUB allocator.  I've already discussed this with Christiph Lameter, at
MM-summit, see presentation[1] slides 4 and 5.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp
 
> > ---
> >  net/core/skbuff.c |   71
> > ++++++++++++++++++++++++++++++++++------------------- 1 file
> > changed, 45 insertions(+), 26 deletions(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 5586be93632f..e85f1065b263 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -281,32 +281,14 @@ nodata:
> >  }
> >  EXPORT_SYMBOL(__alloc_skb);
> >
> > -/**
> > - * __build_skb - build a network buffer
> > - * @data: data buffer provided by caller
> > - * @frag_size: size of data, or 0 if head was kmalloced
> > - *
> > - * Allocate a new &sk_buff. Caller provides space holding head and
> > - * skb_shared_info. @data must have been allocated by kmalloc()
> > only if
> > - * @frag_size is 0, otherwise data should come from the page
> > allocator
> > - *  or vmalloc()
> > - * The return is the new skb buffer.
> > - * On a failure the return is %NULL, and @data is not freed.
> > - * Notes :
> > - *  Before IO, driver allocates only data buffer where NIC put
> > incoming frame
> > - *  Driver should add room at head (NET_SKB_PAD) and
> > - *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> > - *  After IO, driver calls build_skb(), to allocate sk_buff and
> > populate it
> > - *  before giving packet to stack.
> > - *  RX rings only contains data buffers, not full skbs.
> > - */
> > -struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> > +/* Allows skb being (pre)allocated by caller */
> > +static inline
> > +struct sk_buff *___build_skb(void *data, unsigned int frag_size,
> > +                            struct sk_buff *skb)
> >  {
> >         struct skb_shared_info *shinfo;
> > -       struct sk_buff *skb;
> >         unsigned int size = frag_size ? : ksize(data);
> >
> > -       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> >         if (!skb)
> >                 return NULL;
> >
> > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data,
> > unsigned int frag_size) return skb;
> >  }
> >
> > +/**
> > + * __build_skb - build a network buffer
> > + * @data: data buffer provided by caller
> > + * @frag_size: size of data, or 0 if head was kmalloced
> > + *
> > + * Allocate a new &sk_buff. Caller provides space holding head and
> > + * skb_shared_info. @data must have been allocated by kmalloc()
> > only if
> > + * @frag_size is 0, otherwise data should come from the page
> > allocator
> > + *  or vmalloc()
> > + * The return is the new skb buffer.
> > + * On a failure the return is %NULL, and @data is not freed.
> > + * Notes :
> > + *  Before IO, driver allocates only data buffer where NIC put
> > incoming frame
> > + *  Driver should add room at head (NET_SKB_PAD) and
> > + *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
> > + *  After IO, driver calls build_skb(), to allocate sk_buff and
> > populate it
> > + *  before giving packet to stack.
> > + *  RX rings only contains data buffers, not full skbs.
> > + */
> > +struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> > +{
> > +       struct sk_buff *skb;
> > +       unsigned int size = frag_size ? : ksize(data);
> > +
> > +       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> > +       return ___build_skb(data, size, skb);
> > +}
> > +
> >  /* build_skb() is wrapper over __build_skb(), that specifically
> >   * takes care of skb->head and skb->pfmemalloc
> >   * This means that if @frag_size is not zero, then @data must be
> > backed  
> 
> If we can avoid having to break up build_skb into more functions that
> would be preferred.  I realize I probably wrote this code in order to
> enable the bulk allocation approach, but really I don't want to add
> more complexity unless we can show a strong gain which we haven't been
> able to demonstrate.

You do notice that ___build_skb gets inlined, thus there is not
performance loss.  And this change was explicitly requested last time
this patch was reviewed.  And I think this variant is much less
intrusive.

 
> > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct
> > napi_struct *napi, unsigned int len,
> >
> >         len += NET_SKB_PAD + NET_IP_ALIGN;
> >
> > -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > -           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
> > +       if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
> > +                    (gfp_mask & (__GFP_DIRECT_RECLAIM |
> > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX,
> > NUMA_NO_NODE); if (!skb)
> >                         goto skb_fail;  
> 
> Does this unlikely really make a difference?  If so you might want to
> move it into a patch all on its own.

I can place it in a patch of it's own.  I just noticed the compiler
layed out the code wrongly compared to the normal use-case.


> > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct
> > napi_struct *napi, unsigned int len, if (unlikely(!data))
> >                 return NULL;
> >
> > -       skb = __build_skb(data, len);
> > -       if (unlikely(!skb)) {
> > +#define BULK_ALLOC_SIZE 8
> > +       if (!nc->skb_count) {
> > +               nc->skb_count =
> > kmem_cache_alloc_bulk(skbuff_head_cache,
> > +                                                     gfp_mask,
> > BULK_ALLOC_SIZE,
> > +
> > nc->skb_cache);
> > +       }
> > +       if (likely(nc->skb_count)) {
> > +               skb = (struct sk_buff
> > *)nc->skb_cache[--nc->skb_count];
> > +       } else {
> > +               /* alloc bulk failed */  
> 
> So did you try doing a low latency socket test with this patch?  I'm
> guessing not as I am certain there is a negative impact from having to
> allocate 8 buffers, and then free back 7 every time you call the NAPI
> polling routine with just one buffer in the ring.

There is a very high probability that pulling out 8 objects, and
returning 7 object, will have the same cost of alloc and free of a
single object.  This is due to how the SLUB allocator's per CPU
allocator works.

Notice I said "high probability".  I have adjusted the slab bulk APIs,
such that we can extend them to bulk return "upto" a given number of
objects.  Then the SLUB allocator can remove the "high probability"
part, and make sure only to return object on the per CPU slab-page,
thus guaranteeing no cmpxchg_double calls, and only
local_irq_disable/enable cycles, which is actually faster than the
normal fastpath local cmpxchg_double (non-atomic variant).

 
> >                 skb_free_frag(data);
> >                 return NULL;
> >         }
> > +       skb = ___build_skb(data, len, skb);
> >
> >         /* use OR instead of assignment to avoid clearing of bits
> > in mask */ if (nc->page.pfmemalloc)
> >
Alexander H Duyck May 9, 2016, 8:46 p.m. UTC | #3
On Mon, May 9, 2016 at 12:59 PM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Mon, 9 May 2016 09:20:41 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs
>> > free'ed in same softirq cycle.  SKBs are normally free'ed during TX
>> > completion, but most high speed drivers also cleanup TX ring during
>> > NAPI RX poll cycle.  Thus, if using napi_consume_skb/__kfree_skb_defer,
>> > SKBs will be avail in the napi_alloc_cache->skb_cache.
>> >
>> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit
>> > the potential overshooting unused SKBs needed to free'ed when NAPI
>> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()).
>> >
>> > Generalize ___build_skb() to allow passing it a preallocated SKB.
>> >
>> > I've previously demonstrated a 1% speedup for IPv4 forwarding, when
>> > used on the ixgbe driver.  If SKB alloc and free happens on different
>> > CPUs (like in RPS use-case) the performance benefit is expected to
>> > increase.
>>
>> Really I am not sure this patch series is worth the effort.  For
>> freeing buffers in Tx it was an obvious win.  But adding all this
>> complexity for a 1% gain just doesn't seen worth the effort.
>
> I still think it is worth it, because: 1) it enables recycling, which
> is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA
> completion cleanup, and RX can re-use these), and 2) because bulk alloc
> and bulk free gets "coupled" (mostly relevant when doing cross CPU).

I disagree.  While there will be some ACKs it will likely be
significantly less then the number of packets received, especially
when you take GRO into account.

>
>> > All drivers using the napi_alloc_skb() call will benefit from
>> > this change automatically.
>>
>> Yes, but a 1% improvement is essentially just noise.  I'd say we need
>> to show a better gain or be able to more precisely show that this is a
>> definite gain and not just a test fluctuation resulting in the
>> improvement.  For all I know all of the gain could be due to a
>> function shifting around so that some loop is now 16 byte aligned.
>>
>> > Joint work with Alexander Duyck.
>> >
>> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
>>
>> The fact that this is still using my redhat.com address says volumes.
>> If I recall I think we were working on this code around 9 months ago
>> and had similar issues with it showing either negative performance or
>> no gain.  My advice then was to push the bulk free code and try to
>> find a way to fix the bulk allocation.  If we are still at this state
>> for bulk allocation then maybe we need to drop the bulk allocation and
>> start looking at other avenues to pursue.
>
> I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up
> with a benchmark that show a bigger gain, but is it time well spend?
>
> My quick benchmarking with mlx4 actually showed 2.7% improvement, for
> local UDP socket delivery. And only 0.6% on IP-forwarding.  If I find
> some bidirectional traffic then the benefit should be higher due to
> recycling kicking in.

Is that with or without patch 2 being a part of the set?  This patch
should be able to prove itself as a stand-alone patch.  I'd say go
ahead and submit patch 2 as a standalone and then we can start working
on collecting numbers on the mlx4.  Also are you running your test
with GRO disabled?  If so you might try enabling it on the mlx4
because that changes the skbuff allocation which would impact the
behavior for the device.

Try testing with TCP_RR instead and watch the CPU utilization.  I'm
suspecting allocating 8 and freeing 7 buffers for every 1 buffer
received will blow any gains right out of the water.  Also try it with
a mix of traffic.  So have one NIC doing TCP_RR while another is doing
a stream test.  You are stuffing 7 buffers onto a queue that were were
using to perform bulk freeing.  How much of a penalty do you take if
you are now limited on how many you can bulk free because somebody
left a stray 7 packets sitting on the queue?

Also please stop focusing on IP-forwarding.  There are some use cases
for it out there, but the vast majority of people don't care that much
about it.  It is just one data point out of many.  Try running a few
TCP based applications and see if you can notice any difference.  It
is okay to focus on routing numbers when you are doing work that will
only impact routing or has next to no likelihood of introducing a
regression, but the fact is the bulk alloc has a strong likelihood of
having effects on other parts of the network stack since you are
pulling in more buffers then you actually need.

> We did do as you recommended, and the bulk free code first. I'm
> just getting the last pieces pushed.  I didn't see any negative
> performance in my testing.

Not seeing any negative performance is not the same as seeing gains.
You are introducing complexity, there needs to be something to show
for it.  Otherwise you are just making the code harder to maintain for
no good reason.

> As you also know, tuning the SLUB system will give higher performance,
> easily.  In the future, I'm planning to get some auto-tuning into the
> SLUB allocator.  I've already discussed this with Christiph Lameter, at
> MM-summit, see presentation[1] slides 4 and 5.

We aren't discussing tuning parameters.  We are discussing this patch.
If you want to argue that with certain tuning parameters this shows
more performance then bring the numbers, but don't try to bias things.
If you have to tune the system in some way that nobody will there is
probably no point in submitting the patch because nobody will use it
that way.

You also didn't test for low latency.  As I said this is coming at a
high expense to other use cases, but your use case only shows a 1%
improvement.  In my opinion it is not worth adding the complexity if
we cannot show much of a gain.  If you want to win me over we need to
be able to justify the complexity.  Right now I am not seeing it.  In
addition patch 3 illustrates the added complexity you are bringing to
all this because you are having to add limitations to napi_alloc_skb
because you are trying to make it work with recycling which
historically has never been a win because there are too many traffic
patterns where recycling cannot occur.

> [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp
>
>> > ---
>> >  net/core/skbuff.c |   71
>> > ++++++++++++++++++++++++++++++++++------------------- 1 file
>> > changed, 45 insertions(+), 26 deletions(-)
>> >
>> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> > index 5586be93632f..e85f1065b263 100644
>> > --- a/net/core/skbuff.c
>> > +++ b/net/core/skbuff.c
>> > @@ -281,32 +281,14 @@ nodata:
>> >  }
>> >  EXPORT_SYMBOL(__alloc_skb);
>> >
>> > -/**
>> > - * __build_skb - build a network buffer
>> > - * @data: data buffer provided by caller
>> > - * @frag_size: size of data, or 0 if head was kmalloced
>> > - *
>> > - * Allocate a new &sk_buff. Caller provides space holding head and
>> > - * skb_shared_info. @data must have been allocated by kmalloc()
>> > only if
>> > - * @frag_size is 0, otherwise data should come from the page
>> > allocator
>> > - *  or vmalloc()
>> > - * The return is the new skb buffer.
>> > - * On a failure the return is %NULL, and @data is not freed.
>> > - * Notes :
>> > - *  Before IO, driver allocates only data buffer where NIC put
>> > incoming frame
>> > - *  Driver should add room at head (NET_SKB_PAD) and
>> > - *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
>> > - *  After IO, driver calls build_skb(), to allocate sk_buff and
>> > populate it
>> > - *  before giving packet to stack.
>> > - *  RX rings only contains data buffers, not full skbs.
>> > - */
>> > -struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>> > +/* Allows skb being (pre)allocated by caller */
>> > +static inline
>> > +struct sk_buff *___build_skb(void *data, unsigned int frag_size,
>> > +                            struct sk_buff *skb)
>> >  {
>> >         struct skb_shared_info *shinfo;
>> > -       struct sk_buff *skb;
>> >         unsigned int size = frag_size ? : ksize(data);
>> >
>> > -       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
>> >         if (!skb)
>> >                 return NULL;
>> >
>> > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data,
>> > unsigned int frag_size) return skb;
>> >  }
>> >
>> > +/**
>> > + * __build_skb - build a network buffer
>> > + * @data: data buffer provided by caller
>> > + * @frag_size: size of data, or 0 if head was kmalloced
>> > + *
>> > + * Allocate a new &sk_buff. Caller provides space holding head and
>> > + * skb_shared_info. @data must have been allocated by kmalloc()
>> > only if
>> > + * @frag_size is 0, otherwise data should come from the page
>> > allocator
>> > + *  or vmalloc()
>> > + * The return is the new skb buffer.
>> > + * On a failure the return is %NULL, and @data is not freed.
>> > + * Notes :
>> > + *  Before IO, driver allocates only data buffer where NIC put
>> > incoming frame
>> > + *  Driver should add room at head (NET_SKB_PAD) and
>> > + *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
>> > + *  After IO, driver calls build_skb(), to allocate sk_buff and
>> > populate it
>> > + *  before giving packet to stack.
>> > + *  RX rings only contains data buffers, not full skbs.
>> > + */
>> > +struct sk_buff *__build_skb(void *data, unsigned int frag_size)
>> > +{
>> > +       struct sk_buff *skb;
>> > +       unsigned int size = frag_size ? : ksize(data);
>> > +
>> > +       skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
>> > +       return ___build_skb(data, size, skb);
>> > +}
>> > +
>> >  /* build_skb() is wrapper over __build_skb(), that specifically
>> >   * takes care of skb->head and skb->pfmemalloc
>> >   * This means that if @frag_size is not zero, then @data must be
>> > backed
>>
>> If we can avoid having to break up build_skb into more functions that
>> would be preferred.  I realize I probably wrote this code in order to
>> enable the bulk allocation approach, but really I don't want to add
>> more complexity unless we can show a strong gain which we haven't been
>> able to demonstrate.
>
> You do notice that ___build_skb gets inlined, thus there is not
> performance loss.  And this change was explicitly requested last time
> this patch was reviewed.  And I think this variant is much less
> intrusive.

I didn't notice the inline, so it means we are bloating the code a bit
instead of adding more function calls.  I guess that is a better trade
off, but still not the most desirable thing to do.

>
>> > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct
>> > napi_struct *napi, unsigned int len,
>> >
>> >         len += NET_SKB_PAD + NET_IP_ALIGN;
>> >
>> > -       if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>> > -           (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
>> > +       if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
>> > +                    (gfp_mask & (__GFP_DIRECT_RECLAIM |
>> > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX,
>> > NUMA_NO_NODE); if (!skb)
>> >                         goto skb_fail;
>>
>> Does this unlikely really make a difference?  If so you might want to
>> move it into a patch all on its own.
>
> I can place it in a patch of it's own.  I just noticed the compiler
> layed out the code wrongly compared to the normal use-case.

What version of gcc are you seeing this with?  From what I can tell on
the 4.8.5 version that came with my RHEL 7.2 build I am seeing the
proper setup with 2 compares and 2 jumps to the remote code.

>> > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct
>> > napi_struct *napi, unsigned int len, if (unlikely(!data))
>> >                 return NULL;
>> >
>> > -       skb = __build_skb(data, len);
>> > -       if (unlikely(!skb)) {
>> > +#define BULK_ALLOC_SIZE 8
>> > +       if (!nc->skb_count) {
>> > +               nc->skb_count =
>> > kmem_cache_alloc_bulk(skbuff_head_cache,
>> > +                                                     gfp_mask,
>> > BULK_ALLOC_SIZE,
>> > +
>> > nc->skb_cache);
>> > +       }
>> > +       if (likely(nc->skb_count)) {
>> > +               skb = (struct sk_buff
>> > *)nc->skb_cache[--nc->skb_count];
>> > +       } else {
>> > +               /* alloc bulk failed */
>>
>> So did you try doing a low latency socket test with this patch?  I'm
>> guessing not as I am certain there is a negative impact from having to
>> allocate 8 buffers, and then free back 7 every time you call the NAPI
>> polling routine with just one buffer in the ring.
>
> There is a very high probability that pulling out 8 objects, and
> returning 7 object, will have the same cost of alloc and free of a
> single object.  This is due to how the SLUB allocator's per CPU
> allocator works.

I very much doubt that.  You are still having to move around many more
objects then you really should be and dragging in 8 cache lines as you
have to walk the percpu freelist to pull the entries and then push
them back in.

> Notice I said "high probability".  I have adjusted the slab bulk APIs,
> such that we can extend them to bulk return "upto" a given number of
> objects.  Then the SLUB allocator can remove the "high probability"
> part, and make sure only to return object on the per CPU slab-page,
> thus guaranteeing no cmpxchg_double calls, and only
> local_irq_disable/enable cycles, which is actually faster than the
> normal fastpath local cmpxchg_double (non-atomic variant).

On your system.  There is no guarantee that is the case across all
CPUs supported by the x86 architecture.

Also I can think of scenarios where your code would become very
expensive.  The fact is you are pushing an additional 7 objects into
the L1 cache each time you do one of these bulk allocations.  The
question you have to ask yourself is what is it you are evicting when
you do it.  There are scenerios where you could be evicting critical
data back out to L2 which will negatively impact your performance.  A
1% gain 99% of the time is kind of useless when there is a risk of a
significant performance hit for that remaining 1% of cases.
Jesper Dangaard Brouer May 10, 2016, 9:29 a.m. UTC | #4
On Mon, 9 May 2016 13:46:32 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> > As you also know, tuning the SLUB system will give higher performance,
> > easily.  In the future, I'm planning to get some auto-tuning into the
> > SLUB allocator.  I've already discussed this with Christiph Lameter, at
> > MM-summit, see presentation[1] slides 4 and 5.  
> 
> We aren't discussing tuning parameters.  We are discussing this patch.
> If you want to argue that with certain tuning parameters this shows
> more performance then bring the numbers, but don't try to bias things.
> If you have to tune the system in some way that nobody will there is
> probably no point in submitting the patch because nobody will use it
> that way.

I think you missed the point. I didn't do parameter tuning for my
benchmarks. I hate tuning parameters, they are huge problem for users
of the kernel.
 My point is that I want to implement auto-tuning in the SLUB
allocator.  This network stack use-case, will just be one use-case
where the auto-tuning need to show improvements.  It is not that
complicated. FreeBSD have this kind of auto-tuning to the workload in
their slab implementation.
Jesper Dangaard Brouer May 10, 2016, 12:30 p.m. UTC | #5
On Mon, 9 May 2016 13:46:32 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> Try testing with TCP_RR instead and watch the CPU utilization.  I'm
> suspecting allocating 8 and freeing 7 buffers for every 1 buffer
> received will blow any gains right out of the water.  Also try it with
> a mix of traffic.  So have one NIC doing TCP_RR while another is doing
> a stream test.  You are stuffing 7 buffers onto a queue that were were
> using to perform bulk freeing.  How much of a penalty do you take if
> you are now limited on how many you can bulk free because somebody
> left a stray 7 packets sitting on the queue?

Testing with TCP_RR, is not a very "clean" network test. One have to be
very careful what is actually being tested, is it the server or client
which is the bottleneck. And most of all this is test of the CPU/process
scheduler.

We can avoid the scheduler problem by enabling busy_poll/busy_read.

I guess you want to see the "scheduler test" first.  Default setting of
disabled busy poll on both client and server:

Disable busy poll on both client and server, Not patched:

 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
 
 16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
 16384  87380 

Disable busy poll on both client and server, patched:

 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
 
 16384  87380  1       1      60.00   78517.32  3.06   2.84   3.118   8.677  
 16384  87380 

I will not call this an improvement... the results are basically the same.



Next step enabling busy poll on the server.  The server is likely the
bottleneck, given it's CPU is slower than the client.  Context switches
on the server is too high 156K/sec, after enabling busy poll reduced to
620/sec. Note the client is doing around 233k/sec context switches,
(fairly impressive).

Enabling busy poll on the server:
 sysctl -w net.core.busy_poll=50
 sysctl -w net.core.busy_read=50


Enabled busy poll only on server, Not patched:
 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

 16384  87380  1       1      60.00   112480.72  5.90   4.68   4.194   9.984  
 16384  87380 

Enabled busy poll only on server, patched:

 $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

 16384  87380  1       1      60.00   110152.34  5.84   4.60   4.242   10.014 
 16384  87380 

Numbers are too close, for any conclusions.

Running a second run, on Not-patched kernel:
 Enabled busy poll only on server, Not patched:
 [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
 MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
 ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
 Local /Remote
 Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
 Send   Recv   Size    Size   Time    Rate     local  remote local   remote
 bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

 16384  87380  1       1      60.00   101554.90  4.12   4.31   3.245   10.185 
 16384  87380 

Thus, variation between runs are bigger than any improvement/regression,
thus no performance conclusions from this change can be drawn.


Lets move beyond testing the CPU/process scheduler by enabling
busy-polling on both client and server:
 (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50)

Enable busy poll on both client and server, Not patched:

$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   137987.86  13.18  4.77   7.643   8.298  
16384  87380 


Enable busy poll on both client and server, patched:

$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      60.00   147324.38  13.76  4.76   7.474   7.747  
16384  87380 

I've a little bit surprised to see such a large improvement here 6.76%.
 147324/137987*100 = 106.76

I'm remaining skeptic towards this measurement, as the improvement
should not be this high.  Even if recycling is happening.

Perf record does show less calls to __slab_free(), indicating better
interaction with SLUB, and perhaps recycling working.  But this is
only a perf-report change from 0.37% to 0.33%.

More testing show not-patched kernel fluctuate between 125k-143k/sec,
and patched kernel fluctuate between 131k-152k/sec. The ranges are too
high, to say anything conclusive.  It seems to be timing dependent, as
starting and stoping the test with -D 1, show a rate variation within
2k/sec, but rate itself can vary withing the range stated.
Eric Dumazet May 10, 2016, 1:48 p.m. UTC | #6
On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote:

> Disable busy poll on both client and server, Not patched:
> 
>  $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 
>  ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>  Local /Remote
>  Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>  Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>  bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>  
>  16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
>  16384  87380 

Tell us more about the -T6,6

For example how many TX/RX queues you have on the NIC, and which cpus
service interrupts.
Jesper Dangaard Brouer May 10, 2016, 2:48 p.m. UTC | #7
On Tue, 10 May 2016 06:48:54 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote:
> 
> > Disable busy poll on both client and server, Not patched:
> > 
> >  $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> >  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 
> >  ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> >  Local /Remote
> >  Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> >  Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> >  bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> >  
> >  16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
> >  16384  87380   
> 
> Tell us more about the -T6,6
> 
> For example how many TX/RX queues you have on the NIC, and which cpus
> service interrupts.

The -T6,6 option:
 -T lcpu,rcpu      Request netperf/netserver be bound to local/remote cpu

I use the option to get more stable results.  If I don't pin/bind the
CPU netperf/netserver is running on then the CPU scheduler will migrate
the processes around.  This gives unpredictable results, worst for the
busy_poll tests.  Especially if the RX softirq runs on the same CPU
(also true if it runs on a HyperTread siping).  

Netperf client (8 cores i7-4790K CPU @ 4.00GHz)  RX:8 and TX:8 queues.
Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues.
Driver mlx4.
Disabled GRO to hit code path I changed in patch 2.
Eric Dumazet May 10, 2016, 3:44 p.m. UTC | #8
On Tue, 2016-05-10 at 16:48 +0200, Jesper Dangaard Brouer wrote:
> On Tue, 10 May 2016 06:48:54 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote:
> > 
> > > Disable busy poll on both client and server, Not patched:
> > > 
> > >  $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> > >  MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 
> > >  ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> > >  Local /Remote
> > >  Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> > >  Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> > >  bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> > >  
> > >  16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265  
> > >  16384  87380   
> > 
> > Tell us more about the -T6,6
> > 
> > For example how many TX/RX queues you have on the NIC, and which cpus
> > service interrupts.
> 
> The -T6,6 option:
>  -T lcpu,rcpu      Request netperf/netserver be bound to local/remote cpu
> 

Sure, I know -T option in netperf.

> I use the option to get more stable results.  If I don't pin/bind the
> CPU netperf/netserver is running on then the CPU scheduler will migrate
> the processes around.  This gives unpredictable results, worst for the
> busy_poll tests.  Especially if the RX softirq runs on the same CPU
> (also true if it runs on a HyperTread siping).  
> 
> Netperf client (8 cores i7-4790K CPU @ 4.00GHz)  RX:8 and TX:8 queues.
> Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues.
> Driver mlx4.
> Disabled GRO to hit code path I changed in patch 2.

But are you using stuff like aRFS, RPS , RFS ?

Each netperf run lands on different cpus, and we know that results can
have a 25% variability  because of that, even more on 2-node systems.

By forcing -T6,6 you force the netperf/netserver cpu, not the RX queues.

A nice effort would be to be able to chose the source in the 4-tuple at
connect() time so that we know that Toeplitz hash will select the
'correct' RX queue.
Alexander H Duyck May 10, 2016, 5:46 p.m. UTC | #9
On 05/10/2016 05:30 AM, Jesper Dangaard Brouer wrote:
>
> On Mon, 9 May 2016 13:46:32 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> Try testing with TCP_RR instead and watch the CPU utilization.  I'm
>> suspecting allocating 8 and freeing 7 buffers for every 1 buffer
>> received will blow any gains right out of the water.  Also try it with
>> a mix of traffic.  So have one NIC doing TCP_RR while another is doing
>> a stream test.  You are stuffing 7 buffers onto a queue that were were
>> using to perform bulk freeing.  How much of a penalty do you take if
>> you are now limited on how many you can bulk free because somebody
>> left a stray 7 packets sitting on the queue?
>
> Testing with TCP_RR, is not a very "clean" network test. One have to be
> very careful what is actually being tested, is it the server or client
> which is the bottleneck. And most of all this is test of the CPU/process
> scheduler.
>
> We can avoid the scheduler problem by enabling busy_poll/busy_read.
>
> I guess you want to see the "scheduler test" first.  Default setting of
> disabled busy poll on both client and server:
>
> Disable busy poll on both client and server, Not patched:

Lets also define what "Not patched" means.  I only want the bulk 
allocation patch tested.  The other patches you have for the mlx4 and 
the WARN_ON are just noise.  If possible it would be best to focus on 
just the one patch that is high risk.

>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   78077.55  3.74   2.69   3.830   8.265
>   16384  87380
>
> Disable busy poll on both client and server, patched:
>
>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   78517.32  3.06   2.84   3.118   8.677
>   16384  87380
>
> I will not call this an improvement... the results are basically the same.

So I have a few suggestions.

1.  Either switch to ixgbe and use ATR/Flow Director or look at setting 
up your test so that the RSS key and indirection table are the same for 
each test and use the "-- -P" option in netperf to force the use of the 
same 5 tuple for each test.

2.  Cut down on the noise.  Specifically rebuild your kernel with as few 
options enabled as possible.  If you don't need it drop it out so that 
we can identify exactly how much gain there is to be had from your 
patches.  Also you should increase your test to use multiple CPUs, or 
cut down on the number of CPUs so that you aren't cutting down the the 
CPU utilization so much.  If you have to you might even look at doing 
something like using SAR -P 6 in order to be able to monitor the CPU 
utilization of just CPU 6 on your test system.

The goal is you want your tests to be repeatable if you had to move to 
another system or another NIC so I would recommend trying to find a way 
to make it so that much of the fluctuation is ruled out and that your 
numbers are as reliable as possible.

3.  You might even try a pktgen Rx and drop test to see what the 
difference is in ns/packet for your allocation routine.  Assuming you 
can clear out the variability that would be a useful datapoint as you 
could then also collect the perf data to show which functions have 
reduced their total CPU time.

> Next step enabling busy poll on the server.  The server is likely the
> bottleneck, given it's CPU is slower than the client.  Context switches
> on the server is too high 156K/sec, after enabling busy poll reduced to
> 620/sec. Note the client is doing around 233k/sec context switches,
> (fairly impressive).
>
> Enabling busy poll on the server:
>   sysctl -w net.core.busy_poll=50
>   sysctl -w net.core.busy_read=50
>
>
> Enabled busy poll only on server, Not patched:
>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   112480.72  5.90   4.68   4.194   9.984
>   16384  87380
>
> Enabled busy poll only on server, patched:
>
>   $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   110152.34  5.84   4.60   4.242   10.014
>   16384  87380
>
> Numbers are too close, for any conclusions.

Agreed.

> Running a second run, on Not-patched kernel:
>   Enabled busy poll only on server, Not patched:
>   [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
>   MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2
>   ()  port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
>   Local /Remote
>   Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
>   Send   Recv   Size    Size   Time    Rate     local  remote local   remote
>   bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
>   16384  87380  1       1      60.00   101554.90  4.12   4.31   3.245   10.185
>   16384  87380
>
> Thus, variation between runs are bigger than any improvement/regression,
> thus no performance conclusions from this change can be drawn.

Like Eric mentioned this is likely the fact that you are bouncing 
between Rx queues.

> Lets move beyond testing the CPU/process scheduler by enabling
> busy-polling on both client and server:
>   (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50)
>
> Enable busy poll on both client and server, Not patched:
>
> $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
> 16384  87380  1       1      60.00   137987.86  13.18  4.77   7.643   8.298
> 16384  87380
>
>
> Enable busy poll on both client and server, patched:
>
> $ netperf -H 198.18.40.2 -t TCP_RR  -l 60 -T 6,6 -Cc
> MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
>
> 16384  87380  1       1      60.00   147324.38  13.76  4.76   7.474   7.747
> 16384  87380
>
> I've a little bit surprised to see such a large improvement here 6.76%.
>   147324/137987*100 = 106.76

That is a difference of 500ns per packet.  I am highly doubtful we are 
seeing that much of an improvement as well.  Odds are you were doing 
something cross-node in  your first run or something along those lines 
since the difference is too large to be attributed to the bulk 
allocation change.

> I'm remaining skeptic towards this measurement, as the improvement
> should not be this high.  Even if recycling is happening.
>
> Perf record does show less calls to __slab_free(), indicating better
> interaction with SLUB, and perhaps recycling working.  But this is
> only a perf-report change from 0.37% to 0.33%.
>
> More testing show not-patched kernel fluctuate between 125k-143k/sec,
> and patched kernel fluctuate between 131k-152k/sec. The ranges are too
> high, to say anything conclusive.  It seems to be timing dependent, as
> starting and stoping the test with -D 1, show a rate variation within
> 2k/sec, but rate itself can vary withing the range stated.

I am pretty sure we are guaranteed to see a performance regression for 
socket based workloads.  You are hoping for recycling to occur, but in 
almost all cases recycling almost always fails to show any gains and 
just ends up introducing the possibility for regressions as something 
always gets overlooked.  On top of that you aren't guaranteed a frame 
from Tx clean-up is going to be warm in the cache so you may end up 
taking a few cache line misses as well.

I haven't seen enough change in these patches to justify having them 
submitted to the kernel.  A 1% improvement in one specific test case is 
kind of a vague reason to do something that very likely introduces 
regressions in many other cases.

- Alex
diff mbox

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5586be93632f..e85f1065b263 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -281,32 +281,14 @@  nodata:
 }
 EXPORT_SYMBOL(__alloc_skb);
 
-/**
- * __build_skb - build a network buffer
- * @data: data buffer provided by caller
- * @frag_size: size of data, or 0 if head was kmalloced
- *
- * Allocate a new &sk_buff. Caller provides space holding head and
- * skb_shared_info. @data must have been allocated by kmalloc() only if
- * @frag_size is 0, otherwise data should come from the page allocator
- *  or vmalloc()
- * The return is the new skb buffer.
- * On a failure the return is %NULL, and @data is not freed.
- * Notes :
- *  Before IO, driver allocates only data buffer where NIC put incoming frame
- *  Driver should add room at head (NET_SKB_PAD) and
- *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
- *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
- *  before giving packet to stack.
- *  RX rings only contains data buffers, not full skbs.
- */
-struct sk_buff *__build_skb(void *data, unsigned int frag_size)
+/* Allows skb being (pre)allocated by caller */
+static inline
+struct sk_buff *___build_skb(void *data, unsigned int frag_size,
+			     struct sk_buff *skb)
 {
 	struct skb_shared_info *shinfo;
-	struct sk_buff *skb;
 	unsigned int size = frag_size ? : ksize(data);
 
-	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
 	if (!skb)
 		return NULL;
 
@@ -331,6 +313,34 @@  struct sk_buff *__build_skb(void *data, unsigned int frag_size)
 	return skb;
 }
 
+/**
+ * __build_skb - build a network buffer
+ * @data: data buffer provided by caller
+ * @frag_size: size of data, or 0 if head was kmalloced
+ *
+ * Allocate a new &sk_buff. Caller provides space holding head and
+ * skb_shared_info. @data must have been allocated by kmalloc() only if
+ * @frag_size is 0, otherwise data should come from the page allocator
+ *  or vmalloc()
+ * The return is the new skb buffer.
+ * On a failure the return is %NULL, and @data is not freed.
+ * Notes :
+ *  Before IO, driver allocates only data buffer where NIC put incoming frame
+ *  Driver should add room at head (NET_SKB_PAD) and
+ *  MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info))
+ *  After IO, driver calls build_skb(), to allocate sk_buff and populate it
+ *  before giving packet to stack.
+ *  RX rings only contains data buffers, not full skbs.
+ */
+struct sk_buff *__build_skb(void *data, unsigned int frag_size)
+{
+	struct sk_buff *skb;
+	unsigned int size = frag_size ? : ksize(data);
+
+	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	return ___build_skb(data, size, skb);
+}
+
 /* build_skb() is wrapper over __build_skb(), that specifically
  * takes care of skb->head and skb->pfmemalloc
  * This means that if @frag_size is not zero, then @data must be backed
@@ -490,8 +500,8 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 
 	len += NET_SKB_PAD + NET_IP_ALIGN;
 
-	if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
-	    (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) {
+	if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) ||
+		     (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) {
 		skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
 		if (!skb)
 			goto skb_fail;
@@ -508,11 +518,20 @@  struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
 	if (unlikely(!data))
 		return NULL;
 
-	skb = __build_skb(data, len);
-	if (unlikely(!skb)) {
+#define BULK_ALLOC_SIZE 8
+	if (!nc->skb_count) {
+		nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+						      gfp_mask, BULK_ALLOC_SIZE,
+						      nc->skb_cache);
+	}
+	if (likely(nc->skb_count)) {
+		skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count];
+	} else {
+		/* alloc bulk failed */
 		skb_free_frag(data);
 		return NULL;
 	}
+	skb = ___build_skb(data, len, skb);
 
 	/* use OR instead of assignment to avoid clearing of bits in mask */
 	if (nc->page.pfmemalloc)