diff mbox series

[bpf-next,V5,11/15] page_pool: refurbish version of page_pool code

Message ID 152180753479.20167.856688163861554435.stgit@firesoul
State Changes Requested, archived
Delegated to: BPF Maintainers
Headers show
Series XDP redirect memory return API | expand

Commit Message

Jesper Dangaard Brouer March 23, 2018, 12:18 p.m. UTC
Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |  133 +++++++++++++++++++
 net/core/Makefile       |    1 
 net/core/page_pool.c    |  329 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 463 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

Comments

Eric Dumazet March 23, 2018, 1:28 p.m. UTC | #1
On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:

> +
> +	/* Note, below struct compat code was primarily needed when
> +	 * page_pool code lived under MM-tree control, given mmots and
> +	 * net-next trees progress in very different rates.
> +	 *
> +	 * Allow kernel devel trees and driver to progress at different rates
> +	 */
> +	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
> +	memset(&pool->p, 0, param_copy_sz);
> +	if (params->size < param_copy_sz) {
> +		/* Older module calling newer kernel, handled by only
> +		 * copying supplied size, and keep remaining params zero
> +		 */
> +		param_copy_sz = params->size;
> +	} else if (params->size > param_copy_sz) {
> +		/* Newer module calling older kernel. Need to validate
> +		 * no new features were requested.
> +		 */
> +		unsigned char *addr = (unsigned char *)params + param_copy_sz;
> +		unsigned char *end  = (unsigned char *)params + params->size;
> +
> +		for (; addr < end; addr++) {
> +			if (*addr != 0)
> +				return -E2BIG;
> +		}
> +	}

I do not see the need for this part.
Eric Dumazet March 23, 2018, 1:29 p.m. UTC | #2
On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:

> +
> +void page_pool_destroy_rcu(struct page_pool *pool)
> +{
> +	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
> +}
> +EXPORT_SYMBOL(page_pool_destroy_rcu);
> 


Why do we need to respect one rcu grace period before destroying a page pool ?

In any case, this should be called page_pool_destroy()
Eric Dumazet March 23, 2018, 1:37 p.m. UTC | #3
On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:


> +#define PP_ALLOC_CACHE_SIZE	128
> +#define PP_ALLOC_CACHE_REFILL	64
> +struct pp_alloc_cache {
> +	u32 count ____cacheline_aligned_in_smp;
> +	void *cache[PP_ALLOC_CACHE_SIZE];
> +};
> +
> +struct page_pool_params {
...
> +};
> +#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
> +
> +struct page_pool {
> +	struct page_pool_params p;
> +
> +	struct pp_alloc_cache alloc;
> +
>...
> +	struct ptr_ring ring;
> +
> +	struct rcu_head rcu;
> +};
> +


The placement of ____cacheline_aligned_in_smp in pp_alloc_cache is odd.

I would rather put it in struct page_pool as in :

+struct page_pool {
+	struct page_pool_params p;

+	struct pp_alloc_cache alloc ____cacheline_aligned_in_smp;;

To make clear the intent here (let the page_pool_params being in a read only cache line)

Also you probably can move the struct rcu_head  between p and pp_alloc_cache_alloc to fill a hole.

(assuming allowing variable sized params is no longer needed)
Jesper Dangaard Brouer March 23, 2018, 2:15 p.m. UTC | #4
On Fri, 23 Mar 2018 06:29:55 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:
> 
> > +
> > +void page_pool_destroy_rcu(struct page_pool *pool)
> > +{
> > +	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
> > +}
> > +EXPORT_SYMBOL(page_pool_destroy_rcu);
> >   
> 
> 
> Why do we need to respect one rcu grace period before destroying a page pool ?

Due to previous allocator ID patch, which can have a pointer reference
to a page_pool, and the allocator ID lookup uses RCU.

> In any case, this should be called page_pool_destroy()

Okay.
Eric Dumazet March 23, 2018, 2:55 p.m. UTC | #5
On 03/23/2018 07:15 AM, Jesper Dangaard Brouer wrote:
> On Fri, 23 Mar 2018 06:29:55 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:
>>
>>> +
>>> +void page_pool_destroy_rcu(struct page_pool *pool)
>>> +{
>>> +	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
>>> +}
>>> +EXPORT_SYMBOL(page_pool_destroy_rcu);
>>>   
>>
>>
>> Why do we need to respect one rcu grace period before destroying a page pool ?
> 
> Due to previous allocator ID patch, which can have a pointer reference
> to a page_pool, and the allocator ID lookup uses RCU.
> 

I am not convinced.  How comes a patch that is _before_ this one can have any impact ?

Normally, we put first infrastructure, then something using it.

rcu grace period before freeing huge quantitites of pages is problematic and could
be used by syzbot to OOM the host.
Jesper Dangaard Brouer March 26, 2018, 2:09 p.m. UTC | #6
On Fri, 23 Mar 2018 06:28:13 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:
> 
> > +
> > +	/* Note, below struct compat code was primarily needed when
> > +	 * page_pool code lived under MM-tree control, given mmots and
> > +	 * net-next trees progress in very different rates.
> > +	 *
> > +	 * Allow kernel devel trees and driver to progress at different rates
> > +	 */
> > +	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
> > +	memset(&pool->p, 0, param_copy_sz);
> > +	if (params->size < param_copy_sz) {
> > +		/* Older module calling newer kernel, handled by only
> > +		 * copying supplied size, and keep remaining params zero
> > +		 */
> > +		param_copy_sz = params->size;
> > +	} else if (params->size > param_copy_sz) {
> > +		/* Newer module calling older kernel. Need to validate
> > +		 * no new features were requested.
> > +		 */
> > +		unsigned char *addr = (unsigned char *)params + param_copy_sz;
> > +		unsigned char *end  = (unsigned char *)params + params->size;
> > +
> > +		for (; addr < end; addr++) {
> > +			if (*addr != 0)
> > +				return -E2BIG;
> > +		}
> > +	}  
> 
> I do not see the need for this part.

Okay, then I'll just drop it.
I'm considering changing page_pool_create() to just take everything
as parameters.
Jesper Dangaard Brouer March 26, 2018, 3:19 p.m. UTC | #7
On Fri, 23 Mar 2018 07:55:37 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> rcu grace period before freeing huge quantitites of pages is
> problematic and could be used by syzbot to OOM the host.

Okay. Adjusted code to empty ring "right-way" when driver calls
destroy, and then only RCU delay/free the page_pool pointer and also
free/empty ring for pages of any in-flight producers.
diff mbox series

Patch

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index 000000000000..4486fedbf180
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,133 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.h
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include <linux/mm.h> /* Needed by ptr_ring */
+#include <linux/ptr_ring.h>
+#include <linux/dma-direction.h>
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALL	PP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count ____cacheline_aligned_in_smp;
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+	u32		size; /* caller sets size of struct */
+	unsigned int	order;
+	unsigned long	flags;
+	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	int		nid;  /* Numa node id to allocate from pages from */
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+	unsigned int	pool_size;
+	char		end_marker[0]; /* must be last struct member */
+};
+#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
+
+struct page_pool {
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * Use ptr_ring, as it separates consumer and producer
+	 * effeciently, it a way that doesn't bounce cache-lines.
+	 *
+	 * TODO: Implement bulk return pages into this structure.
+	 */
+	struct ptr_ring ring;
+
+	struct rcu_head rcu;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy_rcu(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_put_page(pool, page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+					    struct page *page)
+{
+	__page_pool_put_page(pool, page, true);
+}
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..100a2b3b2a08 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@  obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
+obj-y += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
new file mode 100644
index 000000000000..351fdfb1d872
--- /dev/null
+++ b/net/core/page_pool.c
@@ -0,0 +1,329 @@ 
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.c
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+int page_pool_init(struct page_pool *pool,
+		   const struct page_pool_params *params)
+{
+	int ring_qsize = 1024; /* Default */
+	int param_copy_sz;
+
+	if (!pool)
+		return -EFAULT;
+
+	/* Note, below struct compat code was primarily needed when
+	 * page_pool code lived under MM-tree control, given mmots and
+	 * net-next trees progress in very different rates.
+	 *
+	 * Allow kernel devel trees and driver to progress at different rates
+	 */
+	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
+	memset(&pool->p, 0, param_copy_sz);
+	if (params->size < param_copy_sz) {
+		/* Older module calling newer kernel, handled by only
+		 * copying supplied size, and keep remaining params zero
+		 */
+		param_copy_sz = params->size;
+	} else if (params->size > param_copy_sz) {
+		/* Newer module calling older kernel. Need to validate
+		 * no new features were requested.
+		 */
+		unsigned char *addr = (unsigned char *)params + param_copy_sz;
+		unsigned char *end  = (unsigned char *)params + params->size;
+
+		for (; addr < end; addr++) {
+			if (*addr != 0)
+				return -E2BIG;
+		}
+	}
+	memcpy(&pool->p, params, param_copy_sz);
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
+		return -ENOMEM;
+
+	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	if (params->size < offsetof(struct page_pool_params, nid)) {
+		WARN(1, "Fix page_pool_params->size code\n");
+		return ERR_PTR(-EBADR);
+	}
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *page;
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Test for safe-context, caller should provide this guarantee */
+	if (likely(in_serving_softirq())) {
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill
+		 *
+		 * Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: the ring consumer lock is not really
+		 * needed due to the softirq/NAPI protection, but
+		 * later need the ability to reclaim pages on the
+		 * ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+
+	/* FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		goto skip_dma_map;
+
+	/* Setup DMA mapping: use page->private for DMA-addr
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	set_page_private(page, dma); /* page->private = dma; */
+
+skip_dma_map:
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages_slow(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+static void __page_pool_clean_page(struct page_pool *pool,
+				   struct page *page)
+{
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return;
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page_private(page),
+		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	set_page_private(page, 0);
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+	/* An optimization would be to call __free_pages(page, pool->p.order)
+	 * knowing page is not part of page-cache (thus avoiding a
+	 * __page_cache_release() call).
+	 */
+}
+
+bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* BH protection not needed if current is serving softirq */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/* Only allow direct recycling in special circumstances, into the
+ * alloc side cache.  E.g. during RX-NAPI processing for XDP_DROP use-case.
+ *
+ * Caller must provide appropriate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct)
+{
+	/* This allocator is optimized for the XDP mode that uses
+	 * one-frame-per-page, but have fallbacks that act like the
+	 * regular page allocator APIs.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier done in page_ref_count / READ_ONCE */
+
+		if (allow_direct && in_serving_softirq())
+			if (__page_pool_recycle_direct(page, pool))
+				return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, fallback to free pages */
+			__page_pool_return_page(pool, page);
+		}
+		return;
+	}
+	/* Fallback/non-XDP mode: API user have elevated refcnt.
+	 *
+	 * Many drivers split up the page into fragments, and some
+	 * want to keep doing this to save memory and do refcnt based
+	 * recycling. Support this use case too, to ease drivers
+	 * switching between XDP/non-XDP.
+	 *
+	 * In-case page_pool maintains the DMA mapping, API user must
+	 * call page_pool_put_page once.  In this elevated refcnt
+	 * case, the DMA is unmapped/released, as driver is likely
+	 * doing refcnt based recycle tricks, meaning another process
+	 * will be invoking put_page.
+	 */
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+/* Cleanup and release resources */
+void __page_pool_destroy_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+	struct page *page;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+
+	/* Empty alloc cache, assume caller made sure this is
+	 * no-longer in use, and page_pool_alloc_pages() cannot be
+	 * call concurrently.
+	 */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
+
+	/* Empty recycle ring */
+	while ((page = ptr_ring_consume(&pool->ring))) {
+		/* Verify the refcnt invariant of cached pages */
+		if (!(page_ref_count(page) == 1)) {
+			pr_crit("%s() page_pool refcnt %d violation\n",
+				__func__, page_ref_count(page));
+			WARN_ON(1);
+		}
+		__page_pool_return_page(pool, page);
+	}
+	ptr_ring_cleanup(&pool->ring, NULL);
+	kfree(pool);
+}
+
+void page_pool_destroy_rcu(struct page_pool *pool)
+{
+	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
+}
+EXPORT_SYMBOL(page_pool_destroy_rcu);