Message ID | 20141210034042.2114.29360.stgit@ahduyck-vm-fedora20 |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Dec 9, 2014 at 7:40 PM, Alexander Duyck <alexander.h.duyck@redhat.com> wrote: > This patch splits the netdev_alloc_frag function up so that it can be used > on one of two page frag pools instead of being fixed on the > netdev_alloc_cache. By doing this we can add a NAPI specific function > __napi_alloc_frag that accesses a pool that is only used from softirq > context. The advantage to this is that we do not need to call > local_irq_save/restore which can be a significant savings. > > I also took the opportunity to refactor the core bits that were placed in > __alloc_page_frag. First I updated the allocation to do either a 32K > allocation or an order 0 page. This is based on the changes in commmit > d9b2938aa where it was found that latencies could be reduced in case of thanks for explaining that piece of it. > + struct page *page = NULL; > + gfp_t gfp = gfp_mask; > + > + if (order) { > + gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY; > + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); > + nc->frag.size = PAGE_SIZE << (page ? order : 0); > + } > > - local_irq_save(flags); > - nc = this_cpu_ptr(&netdev_alloc_cache); > - if (unlikely(!nc->frag.page)) { > + if (unlikely(!page)) > + page = alloc_pages_node(NUMA_NO_NODE, gfp, 0); I'm guessing you're not combining this 'if' with above one to keep gfp untouched, so there is a 'warn' when it actually fails 2nd time. Tricky :) Anyway looks good to me and I think I understand it enough to say: Acked-by: Alexei Starovoitov <ast@plumgrid.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/09/2014 08:16 PM, Alexei Starovoitov wrote: > On Tue, Dec 9, 2014 at 7:40 PM, Alexander Duyck > <alexander.h.duyck@redhat.com> wrote: >> This patch splits the netdev_alloc_frag function up so that it can be used >> on one of two page frag pools instead of being fixed on the >> netdev_alloc_cache. By doing this we can add a NAPI specific function >> __napi_alloc_frag that accesses a pool that is only used from softirq >> context. The advantage to this is that we do not need to call >> local_irq_save/restore which can be a significant savings. >> >> I also took the opportunity to refactor the core bits that were placed in >> __alloc_page_frag. First I updated the allocation to do either a 32K >> allocation or an order 0 page. This is based on the changes in commmit >> d9b2938aa where it was found that latencies could be reduced in case of > thanks for explaining that piece of it. > >> + struct page *page = NULL; >> + gfp_t gfp = gfp_mask; >> + >> + if (order) { >> + gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY; >> + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); >> + nc->frag.size = PAGE_SIZE << (page ? order : 0); >> + } >> >> - local_irq_save(flags); >> - nc = this_cpu_ptr(&netdev_alloc_cache); >> - if (unlikely(!nc->frag.page)) { >> + if (unlikely(!page)) >> + page = alloc_pages_node(NUMA_NO_NODE, gfp, 0); > I'm guessing you're not combining this 'if' with above one to > keep gfp untouched, so there is a 'warn' when it actually fails 2nd time. > Tricky :) > Anyway looks good to me and I think I understand it enough to say: > Acked-by: Alexei Starovoitov <ast@plumgrid.com> Thanks. Yes the compiler is smart enough to combine the frag.size and the second check into one if order is non-zero. The other trick here is if order is 0 then that whole block disappears and I don't have to touch frag.size or gfp at all and the code gets much simpler as the *page = NULL falls though and cancels out the 'if' as a compile time check. - Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote: > I also took the opportunity to refactor the core bits that were placed in > __alloc_page_frag. First I updated the allocation to do either a 32K > allocation or an order 0 page. This is based on the changes in commmit > d9b2938aa where it was found that latencies could be reduced in case of > failures. GFP_KERNEL and GFP_ATOMIC allocation constraints are quite different. I have no idea how expensive it is to attempt order-3, order-2, order-1 allocations with GFP_ATOMIC. I did an interesting experiment on mlx4 driver, allocating the pages needed to store the fragments, using a small layer before the alloc_page() that is normally used : - Attempt order-9 allocations, and use split_page() to give the individual pages. Boost in performance is 10% on TCP bulk receive, because of less TLB misses. With huge amount of memory these days, alloc_page() tend to give pages spread all over memory, with poor TLB locality. With this strategy, a 1024 RX ring is backed by 2 huge pages only. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2014 08:02 AM, Eric Dumazet wrote: > On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote: > >> I also took the opportunity to refactor the core bits that were placed in >> __alloc_page_frag. First I updated the allocation to do either a 32K >> allocation or an order 0 page. This is based on the changes in commmit >> d9b2938aa where it was found that latencies could be reduced in case of >> failures. > > GFP_KERNEL and GFP_ATOMIC allocation constraints are quite different. > > I have no idea how expensive it is to attempt order-3, order-2, order-1 > allocations with GFP_ATOMIC. The most likely case is the successful first allocation so I didn't see much point in trying to optimize for the failure cases. I personally prefer to see a fast failure rather than one that is dragged out over several failed allocation attempts. In addition I can get away with several optimization tricks that I cannot with the loop. > I did an interesting experiment on mlx4 driver, allocating the pages > needed to store the fragments, using a small layer before the > alloc_page() that is normally used : > > - Attempt order-9 allocations, and use split_page() to give the > individual pages. > > Boost in performance is 10% on TCP bulk receive, because of less TLB > misses. > > With huge amount of memory these days, alloc_page() tend to give pages > spread all over memory, with poor TLB locality. > > With this strategy, a 1024 RX ring is backed by 2 huge pages only. That is an interesting idea. I wonder if there would be a similar benefit for small packets. If nothing else I might try a few experiments with ixgbe to see if I can take advantage of something similar. - Alex -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-12-10 at 09:06 -0800, Alexander Duyck wrote: > That is an interesting idea. I wonder if there would be a similar > benefit for small packets. If nothing else I might try a few > experiments with ixgbe to see if I can take advantage of something similar. I was planning to submit a core infrastructure when net-next reopens, a simple layer to use instead of alloc_page(), and immediately usable in Intel drivers. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/10/2014 09:13 AM, Eric Dumazet wrote: > On Wed, 2014-12-10 at 09:06 -0800, Alexander Duyck wrote: > >> That is an interesting idea. I wonder if there would be a similar >> benefit for small packets. If nothing else I might try a few >> experiments with ixgbe to see if I can take advantage of something similar. > I was planning to submit a core infrastructure when net-next reopens, a > simple layer to use instead of alloc_page(), and immediately usable in > Intel drivers. Sound good. I look forward to it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ef64cec..b2b53b0 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2164,6 +2164,8 @@ static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev, return __netdev_alloc_skb_ip_align(dev, length, GFP_ATOMIC); } +void *napi_alloc_frag(unsigned int fragsz); + /** * __dev_alloc_pages - allocate page for network Rx * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7a338fb..56ed17c 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -336,59 +336,85 @@ struct netdev_alloc_cache { unsigned int pagecnt_bias; }; static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache); +static DEFINE_PER_CPU(struct netdev_alloc_cache, napi_alloc_cache); -static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask) +static struct page *__page_frag_refill(struct netdev_alloc_cache *nc, + gfp_t gfp_mask) { - struct netdev_alloc_cache *nc; - void *data = NULL; - int order; - unsigned long flags; + const unsigned int order = NETDEV_FRAG_PAGE_MAX_ORDER; + struct page *page = NULL; + gfp_t gfp = gfp_mask; + + if (order) { + gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY; + page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); + nc->frag.size = PAGE_SIZE << (page ? order : 0); + } - local_irq_save(flags); - nc = this_cpu_ptr(&netdev_alloc_cache); - if (unlikely(!nc->frag.page)) { + if (unlikely(!page)) + page = alloc_pages_node(NUMA_NO_NODE, gfp, 0); + + nc->frag.page = page; + + return page; +} + +static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache, + unsigned int fragsz, gfp_t gfp_mask) +{ + struct netdev_alloc_cache *nc = this_cpu_ptr(cache); + struct page *page = nc->frag.page; + unsigned int size; + int offset; + + if (unlikely(!page)) { refill: - for (order = NETDEV_FRAG_PAGE_MAX_ORDER; ;) { - gfp_t gfp = gfp_mask; + page = __page_frag_refill(nc, gfp_mask); + if (!page) + return NULL; + + /* if size can vary use frag.size else just use PAGE_SIZE */ + size = NETDEV_FRAG_PAGE_MAX_ORDER ? nc->frag.size : PAGE_SIZE; - if (order) - gfp |= __GFP_COMP | __GFP_NOWARN; - nc->frag.page = alloc_pages(gfp, order); - if (likely(nc->frag.page)) - break; - if (--order < 0) - goto end; - } - nc->frag.size = PAGE_SIZE << order; /* Even if we own the page, we do not use atomic_set(). * This would break get_page_unless_zero() users. */ - atomic_add(NETDEV_PAGECNT_MAX_BIAS - 1, - &nc->frag.page->_count); - nc->pagecnt_bias = NETDEV_PAGECNT_MAX_BIAS; - nc->frag.offset = 0; + atomic_add(size - 1, &page->_count); + + /* reset page count bias and offset to start of new frag */ + nc->pagecnt_bias = size; + nc->frag.offset = size; } - if (nc->frag.offset + fragsz > nc->frag.size) { - if (atomic_read(&nc->frag.page->_count) != nc->pagecnt_bias) { - if (!atomic_sub_and_test(nc->pagecnt_bias, - &nc->frag.page->_count)) - goto refill; - /* OK, page count is 0, we can safely set it */ - atomic_set(&nc->frag.page->_count, - NETDEV_PAGECNT_MAX_BIAS); - } else { - atomic_add(NETDEV_PAGECNT_MAX_BIAS - nc->pagecnt_bias, - &nc->frag.page->_count); - } - nc->pagecnt_bias = NETDEV_PAGECNT_MAX_BIAS; - nc->frag.offset = 0; + offset = nc->frag.offset - fragsz; + if (unlikely(offset < 0)) { + if (!atomic_sub_and_test(nc->pagecnt_bias, &page->_count)) + goto refill; + + /* if size can vary use frag.size else just use PAGE_SIZE */ + size = NETDEV_FRAG_PAGE_MAX_ORDER ? nc->frag.size : PAGE_SIZE; + + /* OK, page count is 0, we can safely set it */ + atomic_set(&page->_count, size); + + /* reset page count bias and offset to start of new frag */ + nc->pagecnt_bias = size; + offset = size - fragsz; } - data = page_address(nc->frag.page) + nc->frag.offset; - nc->frag.offset += fragsz; nc->pagecnt_bias--; -end: + nc->frag.offset = offset; + + return page_address(page) + offset; +} + +static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask) +{ + unsigned long flags; + void *data; + + local_irq_save(flags); + data = __alloc_page_frag(&netdev_alloc_cache, fragsz, gfp_mask); local_irq_restore(flags); return data; } @@ -406,6 +432,17 @@ void *netdev_alloc_frag(unsigned int fragsz) } EXPORT_SYMBOL(netdev_alloc_frag); +static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask) +{ + return __alloc_page_frag(&napi_alloc_cache, fragsz, gfp_mask); +} + +void *napi_alloc_frag(unsigned int fragsz) +{ + return __napi_alloc_frag(fragsz, GFP_ATOMIC | __GFP_COLD); +} +EXPORT_SYMBOL(napi_alloc_frag); + /** * __netdev_alloc_skb - allocate an skbuff for rx on a specific device * @dev: network device to receive on
This patch splits the netdev_alloc_frag function up so that it can be used on one of two page frag pools instead of being fixed on the netdev_alloc_cache. By doing this we can add a NAPI specific function __napi_alloc_frag that accesses a pool that is only used from softirq context. The advantage to this is that we do not need to call local_irq_save/restore which can be a significant savings. I also took the opportunity to refactor the core bits that were placed in __alloc_page_frag. First I updated the allocation to do either a 32K allocation or an order 0 page. This is based on the changes in commmit d9b2938aa where it was found that latencies could be reduced in case of failures. Then I also rewrote the logic to work from the end of the page to the start. By doing this the size value doesn't have to be used unless we have run out of space for page fragments. Finally I cleaned up the atomic bits so that we just do an atomic_sub_and_test and if that returns true then we set the page->_count via an atomic_set. This way we can remove the extra conditional for the atomic_read since it would have led to an atomic_inc in the case of success anyway. Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com> --- include/linux/skbuff.h | 2 + net/core/skbuff.c | 117 ++++++++++++++++++++++++++++++++---------------- 2 files changed, 79 insertions(+), 40 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html