diff mbox

[net-next,1/6] net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag

Message ID 20141210034042.2114.29360.stgit@ahduyck-vm-fedora20
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Alexander Duyck Dec. 10, 2014, 3:40 a.m. UTC
This patch splits the netdev_alloc_frag function up so that it can be used
on one of two page frag pools instead of being fixed on the
netdev_alloc_cache.  By doing this we can add a NAPI specific function
__napi_alloc_frag that accesses a pool that is only used from softirq
context.  The advantage to this is that we do not need to call
local_irq_save/restore which can be a significant savings.

I also took the opportunity to refactor the core bits that were placed in
__alloc_page_frag.  First I updated the allocation to do either a 32K
allocation or an order 0 page.  This is based on the changes in commmit
d9b2938aa where it was found that latencies could be reduced in case of
failures.  Then I also rewrote the logic to work from the end of the page to
the start.  By doing this the size value doesn't have to be used unless we
have run out of space for page fragments.  Finally I cleaned up the atomic
bits so that we just do an atomic_sub_and_test and if that returns true then
we set the page->_count via an atomic_set.  This way we can remove the extra
conditional for the atomic_read since it would have led to an atomic_inc in
the case of success anyway.

Signed-off-by: Alexander Duyck <alexander.h.duyck@redhat.com>
---
 include/linux/skbuff.h |    2 +
 net/core/skbuff.c      |  117 ++++++++++++++++++++++++++++++++----------------
 2 files changed, 79 insertions(+), 40 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alexei Starovoitov Dec. 10, 2014, 4:16 a.m. UTC | #1
On Tue, Dec 9, 2014 at 7:40 PM, Alexander Duyck
<alexander.h.duyck@redhat.com> wrote:
> This patch splits the netdev_alloc_frag function up so that it can be used
> on one of two page frag pools instead of being fixed on the
> netdev_alloc_cache.  By doing this we can add a NAPI specific function
> __napi_alloc_frag that accesses a pool that is only used from softirq
> context.  The advantage to this is that we do not need to call
> local_irq_save/restore which can be a significant savings.
>
> I also took the opportunity to refactor the core bits that were placed in
> __alloc_page_frag.  First I updated the allocation to do either a 32K
> allocation or an order 0 page.  This is based on the changes in commmit
> d9b2938aa where it was found that latencies could be reduced in case of

thanks for explaining that piece of it.

> +       struct page *page = NULL;
> +       gfp_t gfp = gfp_mask;
> +
> +       if (order) {
> +               gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
> +               page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
> +               nc->frag.size = PAGE_SIZE << (page ? order : 0);
> +       }
>
> -       local_irq_save(flags);
> -       nc = this_cpu_ptr(&netdev_alloc_cache);
> -       if (unlikely(!nc->frag.page)) {
> +       if (unlikely(!page))
> +               page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);

I'm guessing you're not combining this 'if' with above one to
keep gfp untouched, so there is a 'warn' when it actually fails 2nd time.
Tricky :)
Anyway looks good to me and I think I understand it enough to say:
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck Dec. 10, 2014, 3:21 p.m. UTC | #2
On 12/09/2014 08:16 PM, Alexei Starovoitov wrote:
> On Tue, Dec 9, 2014 at 7:40 PM, Alexander Duyck
> <alexander.h.duyck@redhat.com> wrote:
>> This patch splits the netdev_alloc_frag function up so that it can be used
>> on one of two page frag pools instead of being fixed on the
>> netdev_alloc_cache.  By doing this we can add a NAPI specific function
>> __napi_alloc_frag that accesses a pool that is only used from softirq
>> context.  The advantage to this is that we do not need to call
>> local_irq_save/restore which can be a significant savings.
>>
>> I also took the opportunity to refactor the core bits that were placed in
>> __alloc_page_frag.  First I updated the allocation to do either a 32K
>> allocation or an order 0 page.  This is based on the changes in commmit
>> d9b2938aa where it was found that latencies could be reduced in case of
> thanks for explaining that piece of it.
>
>> +       struct page *page = NULL;
>> +       gfp_t gfp = gfp_mask;
>> +
>> +       if (order) {
>> +               gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
>> +               page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
>> +               nc->frag.size = PAGE_SIZE << (page ? order : 0);
>> +       }
>>
>> -       local_irq_save(flags);
>> -       nc = this_cpu_ptr(&netdev_alloc_cache);
>> -       if (unlikely(!nc->frag.page)) {
>> +       if (unlikely(!page))
>> +               page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
> I'm guessing you're not combining this 'if' with above one to
> keep gfp untouched, so there is a 'warn' when it actually fails 2nd time.
> Tricky :)
> Anyway looks good to me and I think I understand it enough to say:
> Acked-by: Alexei Starovoitov <ast@plumgrid.com>

Thanks.  Yes the compiler is smart enough to combine the frag.size and
the second check into one if order is non-zero.  The other trick here is
if order is 0 then that whole block disappears and I don't have to touch
frag.size or gfp at all and the code gets much simpler as the *page =
NULL falls though and cancels out the 'if' as a compile time check.

- Alex
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 10, 2014, 4:02 p.m. UTC | #3
On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote:

> I also took the opportunity to refactor the core bits that were placed in
> __alloc_page_frag.  First I updated the allocation to do either a 32K
> allocation or an order 0 page.  This is based on the changes in commmit
> d9b2938aa where it was found that latencies could be reduced in case of
> failures. 


GFP_KERNEL and GFP_ATOMIC allocation constraints are quite different.

I have no idea how expensive it is to attempt order-3, order-2, order-1
allocations with GFP_ATOMIC.

I did an interesting experiment on mlx4 driver, allocating the pages
needed to store the fragments, using a small layer before the
alloc_page() that is normally used :

- Attempt order-9 allocations, and use split_page() to give the
individual pages.

Boost in performance is 10% on TCP bulk receive, because of less TLB
misses.

With huge amount of memory these days, alloc_page() tend to give pages
spread all over memory, with poor TLB locality.

With this strategy, a 1024 RX ring is backed by 2 huge pages only.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck Dec. 10, 2014, 5:06 p.m. UTC | #4
On 12/10/2014 08:02 AM, Eric Dumazet wrote:
> On Tue, 2014-12-09 at 19:40 -0800, Alexander Duyck wrote:
>
>> I also took the opportunity to refactor the core bits that were placed in
>> __alloc_page_frag.  First I updated the allocation to do either a 32K
>> allocation or an order 0 page.  This is based on the changes in commmit
>> d9b2938aa where it was found that latencies could be reduced in case of
>> failures. 
>
> GFP_KERNEL and GFP_ATOMIC allocation constraints are quite different.
>
> I have no idea how expensive it is to attempt order-3, order-2, order-1
> allocations with GFP_ATOMIC.

The most likely case is the successful first allocation so I didn't see
much point in trying to optimize for the failure cases.  I personally
prefer to see a fast failure rather than one that is dragged out over
several failed allocation attempts.  In addition I can get away with
several optimization tricks that I cannot with the loop.

> I did an interesting experiment on mlx4 driver, allocating the pages
> needed to store the fragments, using a small layer before the
> alloc_page() that is normally used :
>
> - Attempt order-9 allocations, and use split_page() to give the
> individual pages.
>
> Boost in performance is 10% on TCP bulk receive, because of less TLB
> misses.
>
> With huge amount of memory these days, alloc_page() tend to give pages
> spread all over memory, with poor TLB locality.
>
> With this strategy, a 1024 RX ring is backed by 2 huge pages only.

That is an interesting idea.  I wonder if there would be a similar
benefit for small packets.  If nothing else I might try a few
experiments with ixgbe to see if I can take advantage of something similar.

- Alex


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 10, 2014, 5:13 p.m. UTC | #5
On Wed, 2014-12-10 at 09:06 -0800, Alexander Duyck wrote:

> That is an interesting idea.  I wonder if there would be a similar
> benefit for small packets.  If nothing else I might try a few
> experiments with ixgbe to see if I can take advantage of something similar.

I was planning to submit a core infrastructure when net-next reopens, a
simple layer to use instead of alloc_page(), and immediately usable in
Intel drivers.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck Dec. 10, 2014, 5:16 p.m. UTC | #6
On 12/10/2014 09:13 AM, Eric Dumazet wrote:
> On Wed, 2014-12-10 at 09:06 -0800, Alexander Duyck wrote:
>
>> That is an interesting idea.  I wonder if there would be a similar
>> benefit for small packets.  If nothing else I might try a few
>> experiments with ixgbe to see if I can take advantage of something similar.
> I was planning to submit a core infrastructure when net-next reopens, a
> simple layer to use instead of alloc_page(), and immediately usable in
> Intel drivers.

Sound good.  I look forward to it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ef64cec..b2b53b0 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2164,6 +2164,8 @@  static inline struct sk_buff *netdev_alloc_skb_ip_align(struct net_device *dev,
 	return __netdev_alloc_skb_ip_align(dev, length, GFP_ATOMIC);
 }
 
+void *napi_alloc_frag(unsigned int fragsz);
+
 /**
  * __dev_alloc_pages - allocate page for network Rx
  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7a338fb..56ed17c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -336,59 +336,85 @@  struct netdev_alloc_cache {
 	unsigned int		pagecnt_bias;
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct netdev_alloc_cache, napi_alloc_cache);
 
-static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
+static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
+				       gfp_t gfp_mask)
 {
-	struct netdev_alloc_cache *nc;
-	void *data = NULL;
-	int order;
-	unsigned long flags;
+	const unsigned int order = NETDEV_FRAG_PAGE_MAX_ORDER;
+	struct page *page = NULL;
+	gfp_t gfp = gfp_mask;
+
+	if (order) {
+		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
+		page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
+		nc->frag.size = PAGE_SIZE << (page ? order : 0);
+	}
 
-	local_irq_save(flags);
-	nc = this_cpu_ptr(&netdev_alloc_cache);
-	if (unlikely(!nc->frag.page)) {
+	if (unlikely(!page))
+		page = alloc_pages_node(NUMA_NO_NODE, gfp, 0);
+
+	nc->frag.page = page;
+
+	return page;
+}
+
+static void *__alloc_page_frag(struct netdev_alloc_cache __percpu *cache,
+			       unsigned int fragsz, gfp_t gfp_mask)
+{
+	struct netdev_alloc_cache *nc = this_cpu_ptr(cache);
+	struct page *page = nc->frag.page;
+	unsigned int size;
+	int offset;
+
+	if (unlikely(!page)) {
 refill:
-		for (order = NETDEV_FRAG_PAGE_MAX_ORDER; ;) {
-			gfp_t gfp = gfp_mask;
+		page = __page_frag_refill(nc, gfp_mask);
+		if (!page)
+			return NULL;
+
+		/* if size can vary use frag.size else just use PAGE_SIZE */
+		size = NETDEV_FRAG_PAGE_MAX_ORDER ? nc->frag.size : PAGE_SIZE;
 
-			if (order)
-				gfp |= __GFP_COMP | __GFP_NOWARN;
-			nc->frag.page = alloc_pages(gfp, order);
-			if (likely(nc->frag.page))
-				break;
-			if (--order < 0)
-				goto end;
-		}
-		nc->frag.size = PAGE_SIZE << order;
 		/* Even if we own the page, we do not use atomic_set().
 		 * This would break get_page_unless_zero() users.
 		 */
-		atomic_add(NETDEV_PAGECNT_MAX_BIAS - 1,
-			   &nc->frag.page->_count);
-		nc->pagecnt_bias = NETDEV_PAGECNT_MAX_BIAS;
-		nc->frag.offset = 0;
+		atomic_add(size - 1, &page->_count);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pagecnt_bias = size;
+		nc->frag.offset = size;
 	}
 
-	if (nc->frag.offset + fragsz > nc->frag.size) {
-		if (atomic_read(&nc->frag.page->_count) != nc->pagecnt_bias) {
-			if (!atomic_sub_and_test(nc->pagecnt_bias,
-						 &nc->frag.page->_count))
-				goto refill;
-			/* OK, page count is 0, we can safely set it */
-			atomic_set(&nc->frag.page->_count,
-				   NETDEV_PAGECNT_MAX_BIAS);
-		} else {
-			atomic_add(NETDEV_PAGECNT_MAX_BIAS - nc->pagecnt_bias,
-				   &nc->frag.page->_count);
-		}
-		nc->pagecnt_bias = NETDEV_PAGECNT_MAX_BIAS;
-		nc->frag.offset = 0;
+	offset = nc->frag.offset - fragsz;
+	if (unlikely(offset < 0)) {
+		if (!atomic_sub_and_test(nc->pagecnt_bias, &page->_count))
+			goto refill;
+
+		/* if size can vary use frag.size else just use PAGE_SIZE */
+		size = NETDEV_FRAG_PAGE_MAX_ORDER ? nc->frag.size : PAGE_SIZE;
+
+		/* OK, page count is 0, we can safely set it */
+		atomic_set(&page->_count, size);
+
+		/* reset page count bias and offset to start of new frag */
+		nc->pagecnt_bias = size;
+		offset = size - fragsz;
 	}
 
-	data = page_address(nc->frag.page) + nc->frag.offset;
-	nc->frag.offset += fragsz;
 	nc->pagecnt_bias--;
-end:
+	nc->frag.offset = offset;
+
+	return page_address(page) + offset;
+}
+
+static void *__netdev_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
+{
+	unsigned long flags;
+	void *data;
+
+	local_irq_save(flags);
+	data = __alloc_page_frag(&netdev_alloc_cache, fragsz, gfp_mask);
 	local_irq_restore(flags);
 	return data;
 }
@@ -406,6 +432,17 @@  void *netdev_alloc_frag(unsigned int fragsz)
 }
 EXPORT_SYMBOL(netdev_alloc_frag);
 
+static void *__napi_alloc_frag(unsigned int fragsz, gfp_t gfp_mask)
+{
+	return __alloc_page_frag(&napi_alloc_cache, fragsz, gfp_mask);
+}
+
+void *napi_alloc_frag(unsigned int fragsz)
+{
+	return __napi_alloc_frag(fragsz, GFP_ATOMIC | __GFP_COLD);
+}
+EXPORT_SYMBOL(napi_alloc_frag);
+
 /**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on