Message ID | 20160509134429.3573.4048.stgit@firesoul |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > This patch introduce bulk alloc of SKBs and allow reuse of SKBs > free'ed in same softirq cycle. SKBs are normally free'ed during TX > completion, but most high speed drivers also cleanup TX ring during > NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, > SKBs will be avail in the napi_alloc_cache->skb_cache. > > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit > the potential overshooting unused SKBs needed to free'ed when NAPI > cycle ends (flushed in net_rx_action via __kfree_skb_flush()). > > Generalize ___build_skb() to allow passing it a preallocated SKB. > > I've previously demonstrated a 1% speedup for IPv4 forwarding, when > used on the ixgbe driver. If SKB alloc and free happens on different > CPUs (like in RPS use-case) the performance benefit is expected to > increase. Really I am not sure this patch series is worth the effort. For freeing buffers in Tx it was an obvious win. But adding all this complexity for a 1% gain just doesn't seen worth the effort. > All drivers using the napi_alloc_skb() call will benefit from > this change automatically. Yes, but a 1% improvement is essentially just noise. I'd say we need to show a better gain or be able to more precisely show that this is a definite gain and not just a test fluctuation resulting in the improvement. For all I know all of the gain could be due to a function shifting around so that some loop is now 16 byte aligned. > Joint work with Alexander Duyck. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> The fact that this is still using my redhat.com address says volumes. If I recall I think we were working on this code around 9 months ago and had similar issues with it showing either negative performance or no gain. My advice then was to push the bulk free code and try to find a way to fix the bulk allocation. If we are still at this state for bulk allocation then maybe we need to drop the bulk allocation and start looking at other avenues to pursue. > --- > net/core/skbuff.c | 71 ++++++++++++++++++++++++++++++++++------------------- > 1 file changed, 45 insertions(+), 26 deletions(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 5586be93632f..e85f1065b263 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -281,32 +281,14 @@ nodata: > } > EXPORT_SYMBOL(__alloc_skb); > > -/** > - * __build_skb - build a network buffer > - * @data: data buffer provided by caller > - * @frag_size: size of data, or 0 if head was kmalloced > - * > - * Allocate a new &sk_buff. Caller provides space holding head and > - * skb_shared_info. @data must have been allocated by kmalloc() only if > - * @frag_size is 0, otherwise data should come from the page allocator > - * or vmalloc() > - * The return is the new skb buffer. > - * On a failure the return is %NULL, and @data is not freed. > - * Notes : > - * Before IO, driver allocates only data buffer where NIC put incoming frame > - * Driver should add room at head (NET_SKB_PAD) and > - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > - * After IO, driver calls build_skb(), to allocate sk_buff and populate it > - * before giving packet to stack. > - * RX rings only contains data buffers, not full skbs. > - */ > -struct sk_buff *__build_skb(void *data, unsigned int frag_size) > +/* Allows skb being (pre)allocated by caller */ > +static inline > +struct sk_buff *___build_skb(void *data, unsigned int frag_size, > + struct sk_buff *skb) > { > struct skb_shared_info *shinfo; > - struct sk_buff *skb; > unsigned int size = frag_size ? : ksize(data); > > - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > if (!skb) > return NULL; > > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) > return skb; > } > > +/** > + * __build_skb - build a network buffer > + * @data: data buffer provided by caller > + * @frag_size: size of data, or 0 if head was kmalloced > + * > + * Allocate a new &sk_buff. Caller provides space holding head and > + * skb_shared_info. @data must have been allocated by kmalloc() only if > + * @frag_size is 0, otherwise data should come from the page allocator > + * or vmalloc() > + * The return is the new skb buffer. > + * On a failure the return is %NULL, and @data is not freed. > + * Notes : > + * Before IO, driver allocates only data buffer where NIC put incoming frame > + * Driver should add room at head (NET_SKB_PAD) and > + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > + * After IO, driver calls build_skb(), to allocate sk_buff and populate it > + * before giving packet to stack. > + * RX rings only contains data buffers, not full skbs. > + */ > +struct sk_buff *__build_skb(void *data, unsigned int frag_size) > +{ > + struct sk_buff *skb; > + unsigned int size = frag_size ? : ksize(data); > + > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > + return ___build_skb(data, size, skb); > +} > + > /* build_skb() is wrapper over __build_skb(), that specifically > * takes care of skb->head and skb->pfmemalloc > * This means that if @frag_size is not zero, then @data must be backed If we can avoid having to break up build_skb into more functions that would be preferred. I realize I probably wrote this code in order to enable the bulk allocation approach, but really I don't want to add more complexity unless we can show a strong gain which we haven't been able to demonstrate. > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, > > len += NET_SKB_PAD + NET_IP_ALIGN; > > - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { > + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > + (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) { > skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); > if (!skb) > goto skb_fail; Does this unlikely really make a difference? If so you might want to move it into a patch all on its own. > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, > if (unlikely(!data)) > return NULL; > > - skb = __build_skb(data, len); > - if (unlikely(!skb)) { > +#define BULK_ALLOC_SIZE 8 > + if (!nc->skb_count) { > + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, > + gfp_mask, BULK_ALLOC_SIZE, > + nc->skb_cache); > + } > + if (likely(nc->skb_count)) { > + skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count]; > + } else { > + /* alloc bulk failed */ So did you try doing a low latency socket test with this patch? I'm guessing not as I am certain there is a negative impact from having to allocate 8 buffers, and then free back 7 every time you call the NAPI polling routine with just one buffer in the ring. > skb_free_frag(data); > return NULL; > } > + skb = ___build_skb(data, len, skb); > > /* use OR instead of assignment to avoid clearing of bits in mask */ > if (nc->page.pfmemalloc) >
On Mon, 9 May 2016 09:20:41 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > This patch introduce bulk alloc of SKBs and allow reuse of SKBs > > free'ed in same softirq cycle. SKBs are normally free'ed during TX > > completion, but most high speed drivers also cleanup TX ring during > > NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, > > SKBs will be avail in the napi_alloc_cache->skb_cache. > > > > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit > > the potential overshooting unused SKBs needed to free'ed when NAPI > > cycle ends (flushed in net_rx_action via __kfree_skb_flush()). > > > > Generalize ___build_skb() to allow passing it a preallocated SKB. > > > > I've previously demonstrated a 1% speedup for IPv4 forwarding, when > > used on the ixgbe driver. If SKB alloc and free happens on different > > CPUs (like in RPS use-case) the performance benefit is expected to > > increase. > > Really I am not sure this patch series is worth the effort. For > freeing buffers in Tx it was an obvious win. But adding all this > complexity for a 1% gain just doesn't seen worth the effort. I still think it is worth it, because: 1) it enables recycling, which is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA completion cleanup, and RX can re-use these), and 2) because bulk alloc and bulk free gets "coupled" (mostly relevant when doing cross CPU). > > All drivers using the napi_alloc_skb() call will benefit from > > this change automatically. > > Yes, but a 1% improvement is essentially just noise. I'd say we need > to show a better gain or be able to more precisely show that this is a > definite gain and not just a test fluctuation resulting in the > improvement. For all I know all of the gain could be due to a > function shifting around so that some loop is now 16 byte aligned. > > > Joint work with Alexander Duyck. > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> > > The fact that this is still using my redhat.com address says volumes. > If I recall I think we were working on this code around 9 months ago > and had similar issues with it showing either negative performance or > no gain. My advice then was to push the bulk free code and try to > find a way to fix the bulk allocation. If we are still at this state > for bulk allocation then maybe we need to drop the bulk allocation and > start looking at other avenues to pursue. I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up with a benchmark that show a bigger gain, but is it time well spend? My quick benchmarking with mlx4 actually showed 2.7% improvement, for local UDP socket delivery. And only 0.6% on IP-forwarding. If I find some bidirectional traffic then the benefit should be higher due to recycling kicking in. We did do as you recommended, and the bulk free code first. I'm just getting the last pieces pushed. I didn't see any negative performance in my testing. As you also know, tuning the SLUB system will give higher performance, easily. In the future, I'm planning to get some auto-tuning into the SLUB allocator. I've already discussed this with Christiph Lameter, at MM-summit, see presentation[1] slides 4 and 5. [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp > > --- > > net/core/skbuff.c | 71 > > ++++++++++++++++++++++++++++++++++------------------- 1 file > > changed, 45 insertions(+), 26 deletions(-) > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > > index 5586be93632f..e85f1065b263 100644 > > --- a/net/core/skbuff.c > > +++ b/net/core/skbuff.c > > @@ -281,32 +281,14 @@ nodata: > > } > > EXPORT_SYMBOL(__alloc_skb); > > > > -/** > > - * __build_skb - build a network buffer > > - * @data: data buffer provided by caller > > - * @frag_size: size of data, or 0 if head was kmalloced > > - * > > - * Allocate a new &sk_buff. Caller provides space holding head and > > - * skb_shared_info. @data must have been allocated by kmalloc() > > only if > > - * @frag_size is 0, otherwise data should come from the page > > allocator > > - * or vmalloc() > > - * The return is the new skb buffer. > > - * On a failure the return is %NULL, and @data is not freed. > > - * Notes : > > - * Before IO, driver allocates only data buffer where NIC put > > incoming frame > > - * Driver should add room at head (NET_SKB_PAD) and > > - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > > - * After IO, driver calls build_skb(), to allocate sk_buff and > > populate it > > - * before giving packet to stack. > > - * RX rings only contains data buffers, not full skbs. > > - */ > > -struct sk_buff *__build_skb(void *data, unsigned int frag_size) > > +/* Allows skb being (pre)allocated by caller */ > > +static inline > > +struct sk_buff *___build_skb(void *data, unsigned int frag_size, > > + struct sk_buff *skb) > > { > > struct skb_shared_info *shinfo; > > - struct sk_buff *skb; > > unsigned int size = frag_size ? : ksize(data); > > > > - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > > if (!skb) > > return NULL; > > > > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, > > unsigned int frag_size) return skb; > > } > > > > +/** > > + * __build_skb - build a network buffer > > + * @data: data buffer provided by caller > > + * @frag_size: size of data, or 0 if head was kmalloced > > + * > > + * Allocate a new &sk_buff. Caller provides space holding head and > > + * skb_shared_info. @data must have been allocated by kmalloc() > > only if > > + * @frag_size is 0, otherwise data should come from the page > > allocator > > + * or vmalloc() > > + * The return is the new skb buffer. > > + * On a failure the return is %NULL, and @data is not freed. > > + * Notes : > > + * Before IO, driver allocates only data buffer where NIC put > > incoming frame > > + * Driver should add room at head (NET_SKB_PAD) and > > + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) > > + * After IO, driver calls build_skb(), to allocate sk_buff and > > populate it > > + * before giving packet to stack. > > + * RX rings only contains data buffers, not full skbs. > > + */ > > +struct sk_buff *__build_skb(void *data, unsigned int frag_size) > > +{ > > + struct sk_buff *skb; > > + unsigned int size = frag_size ? : ksize(data); > > + > > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); > > + return ___build_skb(data, size, skb); > > +} > > + > > /* build_skb() is wrapper over __build_skb(), that specifically > > * takes care of skb->head and skb->pfmemalloc > > * This means that if @frag_size is not zero, then @data must be > > backed > > If we can avoid having to break up build_skb into more functions that > would be preferred. I realize I probably wrote this code in order to > enable the bulk allocation approach, but really I don't want to add > more complexity unless we can show a strong gain which we haven't been > able to demonstrate. You do notice that ___build_skb gets inlined, thus there is not performance loss. And this change was explicitly requested last time this patch was reviewed. And I think this variant is much less intrusive. > > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct > > napi_struct *napi, unsigned int len, > > > > len += NET_SKB_PAD + NET_IP_ALIGN; > > > > - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { > > + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || > > + (gfp_mask & (__GFP_DIRECT_RECLAIM | > > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, > > NUMA_NO_NODE); if (!skb) > > goto skb_fail; > > Does this unlikely really make a difference? If so you might want to > move it into a patch all on its own. I can place it in a patch of it's own. I just noticed the compiler layed out the code wrongly compared to the normal use-case. > > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct > > napi_struct *napi, unsigned int len, if (unlikely(!data)) > > return NULL; > > > > - skb = __build_skb(data, len); > > - if (unlikely(!skb)) { > > +#define BULK_ALLOC_SIZE 8 > > + if (!nc->skb_count) { > > + nc->skb_count = > > kmem_cache_alloc_bulk(skbuff_head_cache, > > + gfp_mask, > > BULK_ALLOC_SIZE, > > + > > nc->skb_cache); > > + } > > + if (likely(nc->skb_count)) { > > + skb = (struct sk_buff > > *)nc->skb_cache[--nc->skb_count]; > > + } else { > > + /* alloc bulk failed */ > > So did you try doing a low latency socket test with this patch? I'm > guessing not as I am certain there is a negative impact from having to > allocate 8 buffers, and then free back 7 every time you call the NAPI > polling routine with just one buffer in the ring. There is a very high probability that pulling out 8 objects, and returning 7 object, will have the same cost of alloc and free of a single object. This is due to how the SLUB allocator's per CPU allocator works. Notice I said "high probability". I have adjusted the slab bulk APIs, such that we can extend them to bulk return "upto" a given number of objects. Then the SLUB allocator can remove the "high probability" part, and make sure only to return object on the per CPU slab-page, thus guaranteeing no cmpxchg_double calls, and only local_irq_disable/enable cycles, which is actually faster than the normal fastpath local cmpxchg_double (non-atomic variant). > > skb_free_frag(data); > > return NULL; > > } > > + skb = ___build_skb(data, len, skb); > > > > /* use OR instead of assignment to avoid clearing of bits > > in mask */ if (nc->page.pfmemalloc) > >
On Mon, May 9, 2016 at 12:59 PM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Mon, 9 May 2016 09:20:41 -0700 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> On Mon, May 9, 2016 at 6:44 AM, Jesper Dangaard Brouer >> <brouer@redhat.com> wrote: >> > This patch introduce bulk alloc of SKBs and allow reuse of SKBs >> > free'ed in same softirq cycle. SKBs are normally free'ed during TX >> > completion, but most high speed drivers also cleanup TX ring during >> > NAPI RX poll cycle. Thus, if using napi_consume_skb/__kfree_skb_defer, >> > SKBs will be avail in the napi_alloc_cache->skb_cache. >> > >> > If no SKBs are avail for reuse, then only bulk alloc 8 SKBs, to limit >> > the potential overshooting unused SKBs needed to free'ed when NAPI >> > cycle ends (flushed in net_rx_action via __kfree_skb_flush()). >> > >> > Generalize ___build_skb() to allow passing it a preallocated SKB. >> > >> > I've previously demonstrated a 1% speedup for IPv4 forwarding, when >> > used on the ixgbe driver. If SKB alloc and free happens on different >> > CPUs (like in RPS use-case) the performance benefit is expected to >> > increase. >> >> Really I am not sure this patch series is worth the effort. For >> freeing buffers in Tx it was an obvious win. But adding all this >> complexity for a 1% gain just doesn't seen worth the effort. > > I still think it is worth it, because: 1) it enables recycling, which > is more likely for real-life traffic (e.g. some TCP ACKs gets TX DMA > completion cleanup, and RX can re-use these), and 2) because bulk alloc > and bulk free gets "coupled" (mostly relevant when doing cross CPU). I disagree. While there will be some ACKs it will likely be significantly less then the number of packets received, especially when you take GRO into account. > >> > All drivers using the napi_alloc_skb() call will benefit from >> > this change automatically. >> >> Yes, but a 1% improvement is essentially just noise. I'd say we need >> to show a better gain or be able to more precisely show that this is a >> definite gain and not just a test fluctuation resulting in the >> improvement. For all I know all of the gain could be due to a >> function shifting around so that some loop is now 16 byte aligned. >> >> > Joint work with Alexander Duyck. >> > >> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> >> > Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> >> >> The fact that this is still using my redhat.com address says volumes. >> If I recall I think we were working on this code around 9 months ago >> and had similar issues with it showing either negative performance or >> no gain. My advice then was to push the bulk free code and try to >> find a way to fix the bulk allocation. If we are still at this state >> for bulk allocation then maybe we need to drop the bulk allocation and >> start looking at other avenues to pursue. > > I'm sure there is a gain here. Sure I can spend 2-3 weeks coming up > with a benchmark that show a bigger gain, but is it time well spend? > > My quick benchmarking with mlx4 actually showed 2.7% improvement, for > local UDP socket delivery. And only 0.6% on IP-forwarding. If I find > some bidirectional traffic then the benefit should be higher due to > recycling kicking in. Is that with or without patch 2 being a part of the set? This patch should be able to prove itself as a stand-alone patch. I'd say go ahead and submit patch 2 as a standalone and then we can start working on collecting numbers on the mlx4. Also are you running your test with GRO disabled? If so you might try enabling it on the mlx4 because that changes the skbuff allocation which would impact the behavior for the device. Try testing with TCP_RR instead and watch the CPU utilization. I'm suspecting allocating 8 and freeing 7 buffers for every 1 buffer received will blow any gains right out of the water. Also try it with a mix of traffic. So have one NIC doing TCP_RR while another is doing a stream test. You are stuffing 7 buffers onto a queue that were were using to perform bulk freeing. How much of a penalty do you take if you are now limited on how many you can bulk free because somebody left a stray 7 packets sitting on the queue? Also please stop focusing on IP-forwarding. There are some use cases for it out there, but the vast majority of people don't care that much about it. It is just one data point out of many. Try running a few TCP based applications and see if you can notice any difference. It is okay to focus on routing numbers when you are doing work that will only impact routing or has next to no likelihood of introducing a regression, but the fact is the bulk alloc has a strong likelihood of having effects on other parts of the network stack since you are pulling in more buffers then you actually need. > We did do as you recommended, and the bulk free code first. I'm > just getting the last pieces pushed. I didn't see any negative > performance in my testing. Not seeing any negative performance is not the same as seeing gains. You are introducing complexity, there needs to be something to show for it. Otherwise you are just making the code harder to maintain for no good reason. > As you also know, tuning the SLUB system will give higher performance, > easily. In the future, I'm planning to get some auto-tuning into the > SLUB allocator. I've already discussed this with Christiph Lameter, at > MM-summit, see presentation[1] slides 4 and 5. We aren't discussing tuning parameters. We are discussing this patch. If you want to argue that with certain tuning parameters this shows more performance then bring the numbers, but don't try to bias things. If you have to tune the system in some way that nobody will there is probably no point in submitting the patch because nobody will use it that way. You also didn't test for low latency. As I said this is coming at a high expense to other use cases, but your use case only shows a 1% improvement. In my opinion it is not worth adding the complexity if we cannot show much of a gain. If you want to win me over we need to be able to justify the complexity. Right now I am not seeing it. In addition patch 3 illustrates the added complexity you are bringing to all this because you are having to add limitations to napi_alloc_skb because you are trying to make it work with recycling which historically has never been a win because there are too many traffic patterns where recycling cannot occur. > [1] http://people.netfilter.org/hawk/presentations/MM-summit2016/slab_mm_summit2016.odp > >> > --- >> > net/core/skbuff.c | 71 >> > ++++++++++++++++++++++++++++++++++------------------- 1 file >> > changed, 45 insertions(+), 26 deletions(-) >> > >> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c >> > index 5586be93632f..e85f1065b263 100644 >> > --- a/net/core/skbuff.c >> > +++ b/net/core/skbuff.c >> > @@ -281,32 +281,14 @@ nodata: >> > } >> > EXPORT_SYMBOL(__alloc_skb); >> > >> > -/** >> > - * __build_skb - build a network buffer >> > - * @data: data buffer provided by caller >> > - * @frag_size: size of data, or 0 if head was kmalloced >> > - * >> > - * Allocate a new &sk_buff. Caller provides space holding head and >> > - * skb_shared_info. @data must have been allocated by kmalloc() >> > only if >> > - * @frag_size is 0, otherwise data should come from the page >> > allocator >> > - * or vmalloc() >> > - * The return is the new skb buffer. >> > - * On a failure the return is %NULL, and @data is not freed. >> > - * Notes : >> > - * Before IO, driver allocates only data buffer where NIC put >> > incoming frame >> > - * Driver should add room at head (NET_SKB_PAD) and >> > - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) >> > - * After IO, driver calls build_skb(), to allocate sk_buff and >> > populate it >> > - * before giving packet to stack. >> > - * RX rings only contains data buffers, not full skbs. >> > - */ >> > -struct sk_buff *__build_skb(void *data, unsigned int frag_size) >> > +/* Allows skb being (pre)allocated by caller */ >> > +static inline >> > +struct sk_buff *___build_skb(void *data, unsigned int frag_size, >> > + struct sk_buff *skb) >> > { >> > struct skb_shared_info *shinfo; >> > - struct sk_buff *skb; >> > unsigned int size = frag_size ? : ksize(data); >> > >> > - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); >> > if (!skb) >> > return NULL; >> > >> > @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, >> > unsigned int frag_size) return skb; >> > } >> > >> > +/** >> > + * __build_skb - build a network buffer >> > + * @data: data buffer provided by caller >> > + * @frag_size: size of data, or 0 if head was kmalloced >> > + * >> > + * Allocate a new &sk_buff. Caller provides space holding head and >> > + * skb_shared_info. @data must have been allocated by kmalloc() >> > only if >> > + * @frag_size is 0, otherwise data should come from the page >> > allocator >> > + * or vmalloc() >> > + * The return is the new skb buffer. >> > + * On a failure the return is %NULL, and @data is not freed. >> > + * Notes : >> > + * Before IO, driver allocates only data buffer where NIC put >> > incoming frame >> > + * Driver should add room at head (NET_SKB_PAD) and >> > + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) >> > + * After IO, driver calls build_skb(), to allocate sk_buff and >> > populate it >> > + * before giving packet to stack. >> > + * RX rings only contains data buffers, not full skbs. >> > + */ >> > +struct sk_buff *__build_skb(void *data, unsigned int frag_size) >> > +{ >> > + struct sk_buff *skb; >> > + unsigned int size = frag_size ? : ksize(data); >> > + >> > + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); >> > + return ___build_skb(data, size, skb); >> > +} >> > + >> > /* build_skb() is wrapper over __build_skb(), that specifically >> > * takes care of skb->head and skb->pfmemalloc >> > * This means that if @frag_size is not zero, then @data must be >> > backed >> >> If we can avoid having to break up build_skb into more functions that >> would be preferred. I realize I probably wrote this code in order to >> enable the bulk allocation approach, but really I don't want to add >> more complexity unless we can show a strong gain which we haven't been >> able to demonstrate. > > You do notice that ___build_skb gets inlined, thus there is not > performance loss. And this change was explicitly requested last time > this patch was reviewed. And I think this variant is much less > intrusive. I didn't notice the inline, so it means we are bloating the code a bit instead of adding more function calls. I guess that is a better trade off, but still not the most desirable thing to do. > >> > @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct >> > napi_struct *napi, unsigned int len, >> > >> > len += NET_SKB_PAD + NET_IP_ALIGN; >> > >> > - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || >> > - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { >> > + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || >> > + (gfp_mask & (__GFP_DIRECT_RECLAIM | >> > GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, >> > NUMA_NO_NODE); if (!skb) >> > goto skb_fail; >> >> Does this unlikely really make a difference? If so you might want to >> move it into a patch all on its own. > > I can place it in a patch of it's own. I just noticed the compiler > layed out the code wrongly compared to the normal use-case. What version of gcc are you seeing this with? From what I can tell on the 4.8.5 version that came with my RHEL 7.2 build I am seeing the proper setup with 2 compares and 2 jumps to the remote code. >> > @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct >> > napi_struct *napi, unsigned int len, if (unlikely(!data)) >> > return NULL; >> > >> > - skb = __build_skb(data, len); >> > - if (unlikely(!skb)) { >> > +#define BULK_ALLOC_SIZE 8 >> > + if (!nc->skb_count) { >> > + nc->skb_count = >> > kmem_cache_alloc_bulk(skbuff_head_cache, >> > + gfp_mask, >> > BULK_ALLOC_SIZE, >> > + >> > nc->skb_cache); >> > + } >> > + if (likely(nc->skb_count)) { >> > + skb = (struct sk_buff >> > *)nc->skb_cache[--nc->skb_count]; >> > + } else { >> > + /* alloc bulk failed */ >> >> So did you try doing a low latency socket test with this patch? I'm >> guessing not as I am certain there is a negative impact from having to >> allocate 8 buffers, and then free back 7 every time you call the NAPI >> polling routine with just one buffer in the ring. > > There is a very high probability that pulling out 8 objects, and > returning 7 object, will have the same cost of alloc and free of a > single object. This is due to how the SLUB allocator's per CPU > allocator works. I very much doubt that. You are still having to move around many more objects then you really should be and dragging in 8 cache lines as you have to walk the percpu freelist to pull the entries and then push them back in. > Notice I said "high probability". I have adjusted the slab bulk APIs, > such that we can extend them to bulk return "upto" a given number of > objects. Then the SLUB allocator can remove the "high probability" > part, and make sure only to return object on the per CPU slab-page, > thus guaranteeing no cmpxchg_double calls, and only > local_irq_disable/enable cycles, which is actually faster than the > normal fastpath local cmpxchg_double (non-atomic variant). On your system. There is no guarantee that is the case across all CPUs supported by the x86 architecture. Also I can think of scenarios where your code would become very expensive. The fact is you are pushing an additional 7 objects into the L1 cache each time you do one of these bulk allocations. The question you have to ask yourself is what is it you are evicting when you do it. There are scenerios where you could be evicting critical data back out to L2 which will negatively impact your performance. A 1% gain 99% of the time is kind of useless when there is a risk of a significant performance hit for that remaining 1% of cases.
On Mon, 9 May 2016 13:46:32 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > > As you also know, tuning the SLUB system will give higher performance, > > easily. In the future, I'm planning to get some auto-tuning into the > > SLUB allocator. I've already discussed this with Christiph Lameter, at > > MM-summit, see presentation[1] slides 4 and 5. > > We aren't discussing tuning parameters. We are discussing this patch. > If you want to argue that with certain tuning parameters this shows > more performance then bring the numbers, but don't try to bias things. > If you have to tune the system in some way that nobody will there is > probably no point in submitting the patch because nobody will use it > that way. I think you missed the point. I didn't do parameter tuning for my benchmarks. I hate tuning parameters, they are huge problem for users of the kernel. My point is that I want to implement auto-tuning in the SLUB allocator. This network stack use-case, will just be one use-case where the auto-tuning need to show improvements. It is not that complicated. FreeBSD have this kind of auto-tuning to the workload in their slab implementation.
On Mon, 9 May 2016 13:46:32 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote: > Try testing with TCP_RR instead and watch the CPU utilization. I'm > suspecting allocating 8 and freeing 7 buffers for every 1 buffer > received will blow any gains right out of the water. Also try it with > a mix of traffic. So have one NIC doing TCP_RR while another is doing > a stream test. You are stuffing 7 buffers onto a queue that were were > using to perform bulk freeing. How much of a penalty do you take if > you are now limited on how many you can bulk free because somebody > left a stray 7 packets sitting on the queue? Testing with TCP_RR, is not a very "clean" network test. One have to be very careful what is actually being tested, is it the server or client which is the bottleneck. And most of all this is test of the CPU/process scheduler. We can avoid the scheduler problem by enabling busy_poll/busy_read. I guess you want to see the "scheduler test" first. Default setting of disabled busy poll on both client and server: Disable busy poll on both client and server, Not patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 16384 87380 Disable busy poll on both client and server, patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 78517.32 3.06 2.84 3.118 8.677 16384 87380 I will not call this an improvement... the results are basically the same. Next step enabling busy poll on the server. The server is likely the bottleneck, given it's CPU is slower than the client. Context switches on the server is too high 156K/sec, after enabling busy poll reduced to 620/sec. Note the client is doing around 233k/sec context switches, (fairly impressive). Enabling busy poll on the server: sysctl -w net.core.busy_poll=50 sysctl -w net.core.busy_read=50 Enabled busy poll only on server, Not patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 112480.72 5.90 4.68 4.194 9.984 16384 87380 Enabled busy poll only on server, patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 110152.34 5.84 4.60 4.242 10.014 16384 87380 Numbers are too close, for any conclusions. Running a second run, on Not-patched kernel: Enabled busy poll only on server, Not patched: [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 101554.90 4.12 4.31 3.245 10.185 16384 87380 Thus, variation between runs are bigger than any improvement/regression, thus no performance conclusions from this change can be drawn. Lets move beyond testing the CPU/process scheduler by enabling busy-polling on both client and server: (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50) Enable busy poll on both client and server, Not patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 137987.86 13.18 4.77 7.643 8.298 16384 87380 Enable busy poll on both client and server, patched: $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr 16384 87380 1 1 60.00 147324.38 13.76 4.76 7.474 7.747 16384 87380 I've a little bit surprised to see such a large improvement here 6.76%. 147324/137987*100 = 106.76 I'm remaining skeptic towards this measurement, as the improvement should not be this high. Even if recycling is happening. Perf record does show less calls to __slab_free(), indicating better interaction with SLUB, and perhaps recycling working. But this is only a perf-report change from 0.37% to 0.33%. More testing show not-patched kernel fluctuate between 125k-143k/sec, and patched kernel fluctuate between 131k-152k/sec. The ranges are too high, to say anything conclusive. It seems to be timing dependent, as starting and stoping the test with -D 1, show a rate variation within 2k/sec, but rate itself can vary withing the range stated.
On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote: > Disable busy poll on both client and server, Not patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > 16384 87380 Tell us more about the -T6,6 For example how many TX/RX queues you have on the NIC, and which cpus service interrupts.
On Tue, 10 May 2016 06:48:54 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote: > > > Disable busy poll on both client and server, Not patched: > > > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 > > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > > Local /Remote > > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > > Send Recv Size Size Time Rate local remote local remote > > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > > 16384 87380 > > Tell us more about the -T6,6 > > For example how many TX/RX queues you have on the NIC, and which cpus > service interrupts. The -T6,6 option: -T lcpu,rcpu Request netperf/netserver be bound to local/remote cpu I use the option to get more stable results. If I don't pin/bind the CPU netperf/netserver is running on then the CPU scheduler will migrate the processes around. This gives unpredictable results, worst for the busy_poll tests. Especially if the RX softirq runs on the same CPU (also true if it runs on a HyperTread siping). Netperf client (8 cores i7-4790K CPU @ 4.00GHz) RX:8 and TX:8 queues. Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues. Driver mlx4. Disabled GRO to hit code path I changed in patch 2.
On Tue, 2016-05-10 at 16:48 +0200, Jesper Dangaard Brouer wrote: > On Tue, 10 May 2016 06:48:54 -0700 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > On Tue, 2016-05-10 at 14:30 +0200, Jesper Dangaard Brouer wrote: > > > > > Disable busy poll on both client and server, Not patched: > > > > > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > > > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 port 0 AF_INET to 198.18.40.2 > > > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > > > Local /Remote > > > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > > > Send Recv Size Size Time Rate local remote local remote > > > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > > > > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > > > 16384 87380 > > > > Tell us more about the -T6,6 > > > > For example how many TX/RX queues you have on the NIC, and which cpus > > service interrupts. > > The -T6,6 option: > -T lcpu,rcpu Request netperf/netserver be bound to local/remote cpu > Sure, I know -T option in netperf. > I use the option to get more stable results. If I don't pin/bind the > CPU netperf/netserver is running on then the CPU scheduler will migrate > the processes around. This gives unpredictable results, worst for the > busy_poll tests. Especially if the RX softirq runs on the same CPU > (also true if it runs on a HyperTread siping). > > Netperf client (8 cores i7-4790K CPU @ 4.00GHz) RX:8 and TX:8 queues. > Netserver server (2x 12 cores E5-2630 @ 2.30GHz) RX:8 and TX:24 queues. > Driver mlx4. > Disabled GRO to hit code path I changed in patch 2. But are you using stuff like aRFS, RPS , RFS ? Each netperf run lands on different cpus, and we know that results can have a 25% variability because of that, even more on 2-node systems. By forcing -T6,6 you force the netperf/netserver cpu, not the RX queues. A nice effort would be to be able to chose the source in the 4-tuple at connect() time so that we know that Toeplitz hash will select the 'correct' RX queue.
On 05/10/2016 05:30 AM, Jesper Dangaard Brouer wrote: > > On Mon, 9 May 2016 13:46:32 -0700 > Alexander Duyck <alexander.duyck@gmail.com> wrote: > >> Try testing with TCP_RR instead and watch the CPU utilization. I'm >> suspecting allocating 8 and freeing 7 buffers for every 1 buffer >> received will blow any gains right out of the water. Also try it with >> a mix of traffic. So have one NIC doing TCP_RR while another is doing >> a stream test. You are stuffing 7 buffers onto a queue that were were >> using to perform bulk freeing. How much of a penalty do you take if >> you are now limited on how many you can bulk free because somebody >> left a stray 7 packets sitting on the queue? > > Testing with TCP_RR, is not a very "clean" network test. One have to be > very careful what is actually being tested, is it the server or client > which is the bottleneck. And most of all this is test of the CPU/process > scheduler. > > We can avoid the scheduler problem by enabling busy_poll/busy_read. > > I guess you want to see the "scheduler test" first. Default setting of > disabled busy poll on both client and server: > > Disable busy poll on both client and server, Not patched: Lets also define what "Not patched" means. I only want the bulk allocation patch tested. The other patches you have for the mlx4 and the WARN_ON are just noise. If possible it would be best to focus on just the one patch that is high risk. > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 78077.55 3.74 2.69 3.830 8.265 > 16384 87380 > > Disable busy poll on both client and server, patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 78517.32 3.06 2.84 3.118 8.677 > 16384 87380 > > I will not call this an improvement... the results are basically the same. So I have a few suggestions. 1. Either switch to ixgbe and use ATR/Flow Director or look at setting up your test so that the RSS key and indirection table are the same for each test and use the "-- -P" option in netperf to force the use of the same 5 tuple for each test. 2. Cut down on the noise. Specifically rebuild your kernel with as few options enabled as possible. If you don't need it drop it out so that we can identify exactly how much gain there is to be had from your patches. Also you should increase your test to use multiple CPUs, or cut down on the number of CPUs so that you aren't cutting down the the CPU utilization so much. If you have to you might even look at doing something like using SAR -P 6 in order to be able to monitor the CPU utilization of just CPU 6 on your test system. The goal is you want your tests to be repeatable if you had to move to another system or another NIC so I would recommend trying to find a way to make it so that much of the fluctuation is ruled out and that your numbers are as reliable as possible. 3. You might even try a pktgen Rx and drop test to see what the difference is in ns/packet for your allocation routine. Assuming you can clear out the variability that would be a useful datapoint as you could then also collect the perf data to show which functions have reduced their total CPU time. > Next step enabling busy poll on the server. The server is likely the > bottleneck, given it's CPU is slower than the client. Context switches > on the server is too high 156K/sec, after enabling busy poll reduced to > 620/sec. Note the client is doing around 233k/sec context switches, > (fairly impressive). > > Enabling busy poll on the server: > sysctl -w net.core.busy_poll=50 > sysctl -w net.core.busy_read=50 > > > Enabled busy poll only on server, Not patched: > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 112480.72 5.90 4.68 4.194 9.984 > 16384 87380 > > Enabled busy poll only on server, patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 110152.34 5.84 4.60 4.242 10.014 > 16384 87380 > > Numbers are too close, for any conclusions. Agreed. > Running a second run, on Not-patched kernel: > Enabled busy poll only on server, Not patched: > [jbrouer@canyon ~]$ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 > () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 101554.90 4.12 4.31 3.245 10.185 > 16384 87380 > > Thus, variation between runs are bigger than any improvement/regression, > thus no performance conclusions from this change can be drawn. Like Eric mentioned this is likely the fact that you are bouncing between Rx queues. > Lets move beyond testing the CPU/process scheduler by enabling > busy-polling on both client and server: > (sysctl -w net.core.busy_poll=50 ;sysctl -w net.core.busy_read=50) > > Enable busy poll on both client and server, Not patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 137987.86 13.18 4.77 7.643 8.298 > 16384 87380 > > > Enable busy poll on both client and server, patched: > > $ netperf -H 198.18.40.2 -t TCP_RR -l 60 -T 6,6 -Cc > MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 198.18.40.2 () port 0 AF_INET : histogram : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > 16384 87380 1 1 60.00 147324.38 13.76 4.76 7.474 7.747 > 16384 87380 > > I've a little bit surprised to see such a large improvement here 6.76%. > 147324/137987*100 = 106.76 That is a difference of 500ns per packet. I am highly doubtful we are seeing that much of an improvement as well. Odds are you were doing something cross-node in your first run or something along those lines since the difference is too large to be attributed to the bulk allocation change. > I'm remaining skeptic towards this measurement, as the improvement > should not be this high. Even if recycling is happening. > > Perf record does show less calls to __slab_free(), indicating better > interaction with SLUB, and perhaps recycling working. But this is > only a perf-report change from 0.37% to 0.33%. > > More testing show not-patched kernel fluctuate between 125k-143k/sec, > and patched kernel fluctuate between 131k-152k/sec. The ranges are too > high, to say anything conclusive. It seems to be timing dependent, as > starting and stoping the test with -D 1, show a rate variation within > 2k/sec, but rate itself can vary withing the range stated. I am pretty sure we are guaranteed to see a performance regression for socket based workloads. You are hoping for recycling to occur, but in almost all cases recycling almost always fails to show any gains and just ends up introducing the possibility for regressions as something always gets overlooked. On top of that you aren't guaranteed a frame from Tx clean-up is going to be warm in the cache so you may end up taking a few cache line misses as well. I haven't seen enough change in these patches to justify having them submitted to the kernel. A 1% improvement in one specific test case is kind of a vague reason to do something that very likely introduces regressions in many other cases. - Alex
diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 5586be93632f..e85f1065b263 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -281,32 +281,14 @@ nodata: } EXPORT_SYMBOL(__alloc_skb); -/** - * __build_skb - build a network buffer - * @data: data buffer provided by caller - * @frag_size: size of data, or 0 if head was kmalloced - * - * Allocate a new &sk_buff. Caller provides space holding head and - * skb_shared_info. @data must have been allocated by kmalloc() only if - * @frag_size is 0, otherwise data should come from the page allocator - * or vmalloc() - * The return is the new skb buffer. - * On a failure the return is %NULL, and @data is not freed. - * Notes : - * Before IO, driver allocates only data buffer where NIC put incoming frame - * Driver should add room at head (NET_SKB_PAD) and - * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) - * After IO, driver calls build_skb(), to allocate sk_buff and populate it - * before giving packet to stack. - * RX rings only contains data buffers, not full skbs. - */ -struct sk_buff *__build_skb(void *data, unsigned int frag_size) +/* Allows skb being (pre)allocated by caller */ +static inline +struct sk_buff *___build_skb(void *data, unsigned int frag_size, + struct sk_buff *skb) { struct skb_shared_info *shinfo; - struct sk_buff *skb; unsigned int size = frag_size ? : ksize(data); - skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); if (!skb) return NULL; @@ -331,6 +313,34 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size) return skb; } +/** + * __build_skb - build a network buffer + * @data: data buffer provided by caller + * @frag_size: size of data, or 0 if head was kmalloced + * + * Allocate a new &sk_buff. Caller provides space holding head and + * skb_shared_info. @data must have been allocated by kmalloc() only if + * @frag_size is 0, otherwise data should come from the page allocator + * or vmalloc() + * The return is the new skb buffer. + * On a failure the return is %NULL, and @data is not freed. + * Notes : + * Before IO, driver allocates only data buffer where NIC put incoming frame + * Driver should add room at head (NET_SKB_PAD) and + * MUST add room at tail (SKB_DATA_ALIGN(skb_shared_info)) + * After IO, driver calls build_skb(), to allocate sk_buff and populate it + * before giving packet to stack. + * RX rings only contains data buffers, not full skbs. + */ +struct sk_buff *__build_skb(void *data, unsigned int frag_size) +{ + struct sk_buff *skb; + unsigned int size = frag_size ? : ksize(data); + + skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC); + return ___build_skb(data, size, skb); +} + /* build_skb() is wrapper over __build_skb(), that specifically * takes care of skb->head and skb->pfmemalloc * This means that if @frag_size is not zero, then @data must be backed @@ -490,8 +500,8 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, len += NET_SKB_PAD + NET_IP_ALIGN; - if ((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || - (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA))) { + if (unlikely((len > SKB_WITH_OVERHEAD(PAGE_SIZE)) || + (gfp_mask & (__GFP_DIRECT_RECLAIM | GFP_DMA)))) { skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); if (!skb) goto skb_fail; @@ -508,11 +518,20 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len, if (unlikely(!data)) return NULL; - skb = __build_skb(data, len); - if (unlikely(!skb)) { +#define BULK_ALLOC_SIZE 8 + if (!nc->skb_count) { + nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache, + gfp_mask, BULK_ALLOC_SIZE, + nc->skb_cache); + } + if (likely(nc->skb_count)) { + skb = (struct sk_buff *)nc->skb_cache[--nc->skb_count]; + } else { + /* alloc bulk failed */ skb_free_frag(data); return NULL; } + skb = ___build_skb(data, len, skb); /* use OR instead of assignment to avoid clearing of bits in mask */ if (nc->page.pfmemalloc)