diff mbox series

[net-nex,V2,2/3] page_pool: Don't recycle non-reusable pages

Message ID 20191023193632.26917-3-saeedm@mellanox.com
State Changes Requested
Delegated to: David Miller
Headers show
Series page_pool: API for numa node change handling | expand

Commit Message

Saeed Mahameed Oct. 23, 2019, 7:37 p.m. UTC
A page is NOT reusable when at least one of the following is true:
1) allocated when system was under some pressure. (page_is_pfmemalloc)
2) belongs to a different NUMA node than pool->p.nid.

To update pool->p.nid users should call page_pool_update_nid().

Holding on to such pages in the pool will hurt the consumer performance
when the pool migrates to a different numa node.

Performance testing:
XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
while migrating rx ring irq from close to far numa:

mlx5 internal page cache was locally disabled to get pure page pool
results.

CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)

XDP Drop/TX single core:
NUMA  | XDP  | Before    | After
---------------------------------------
Close | Drop | 11   Mpps | 10.9 Mpps
Far   | Drop | 4.4  Mpps | 5.8  Mpps

Close | TX   | 6.5 Mpps  | 6.5 Mpps
Far   | TX   | 3.5 Mpps  | 4  Mpps

Improvement is about 30% drop packet rate, 15% tx packet rate for numa
far test.
No degradation for numa close tests.

TCP single/multi cpu/stream:
NUMA  | #cpu | Before  | After
--------------------------------------
Close | 1    | 18 Gbps | 18 Gbps
Far   | 1    | 15 Gbps | 18 Gbps
Close | 12   | 80 Gbps | 80 Gbps
Far   | 12   | 68 Gbps | 80 Gbps

In all test cases we see improvement for the far numa case, and no
impact on the close numa case.

The impact of adding a check per page is very negligible, and shows no
performance degradation whatsoever, also functionality wise it seems more
correct and more robust for page pool to verify when pages should be
recycled, since page pool can't guarantee where pages are coming from.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
---
 net/core/page_pool.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Ilias Apalodimas Oct. 24, 2019, 5 a.m. UTC | #1
On Wed, Oct 23, 2019 at 07:37:00PM +0000, Saeed Mahameed wrote:
> A page is NOT reusable when at least one of the following is true:
> 1) allocated when system was under some pressure. (page_is_pfmemalloc)
> 2) belongs to a different NUMA node than pool->p.nid.
> 
> To update pool->p.nid users should call page_pool_update_nid().
> 
> Holding on to such pages in the pool will hurt the consumer performance
> when the pool migrates to a different numa node.
> 
> Performance testing:
> XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
> while migrating rx ring irq from close to far numa:
> 
> mlx5 internal page cache was locally disabled to get pure page pool
> results.
> 
> CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
> NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)
> 
> XDP Drop/TX single core:
> NUMA  | XDP  | Before    | After
> ---------------------------------------
> Close | Drop | 11   Mpps | 10.9 Mpps
> Far   | Drop | 4.4  Mpps | 5.8  Mpps
> 
> Close | TX   | 6.5 Mpps  | 6.5 Mpps
> Far   | TX   | 3.5 Mpps  | 4  Mpps
> 
> Improvement is about 30% drop packet rate, 15% tx packet rate for numa
> far test.
> No degradation for numa close tests.
> 
> TCP single/multi cpu/stream:
> NUMA  | #cpu | Before  | After
> --------------------------------------
> Close | 1    | 18 Gbps | 18 Gbps
> Far   | 1    | 15 Gbps | 18 Gbps
> Close | 12   | 80 Gbps | 80 Gbps
> Far   | 12   | 68 Gbps | 80 Gbps
> 
> In all test cases we see improvement for the far numa case, and no
> impact on the close numa case.
> 
> The impact of adding a check per page is very negligible, and shows no
> performance degradation whatsoever, also functionality wise it seems more
> correct and more robust for page pool to verify when pages should be
> recycled, since page pool can't guarantee where pages are coming from.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  net/core/page_pool.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 953af6d414fb..73e4173c4dce 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -283,6 +283,17 @@ static bool __page_pool_recycle_direct(struct page *page,
>  	return true;
>  }
>  
> +/* page is NOT reusable when:
> + * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> + * 2) belongs to a different NUMA node than pool->p.nid.
> + *
> + * To update pool->p.nid users must call page_pool_update_nid.
> + */
> +static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> +{
> +	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
> +}
> +
>  void __page_pool_put_page(struct page_pool *pool,
>  			  struct page *page, bool allow_direct)
>  {
> @@ -292,7 +303,8 @@ void __page_pool_put_page(struct page_pool *pool,
>  	 *
>  	 * refcnt == 1 means page_pool owns page, and can recycle it.
>  	 */
> -	if (likely(page_ref_count(page) == 1)) {
> +	if (likely(page_ref_count(page) == 1 &&
> +		   pool_page_reusable(pool, page))) {
>  		/* Read barrier done in page_ref_count / READ_ONCE */
>  
>  		if (allow_direct && in_serving_softirq())
> -- 
> 2.21.0
> 

Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Jesper Dangaard Brouer Oct. 25, 2019, 1:33 p.m. UTC | #2
On Wed, 23 Oct 2019 19:37:00 +0000
Saeed Mahameed <saeedm@mellanox.com> wrote:

> A page is NOT reusable when at least one of the following is true:
> 1) allocated when system was under some pressure. (page_is_pfmemalloc)
> 2) belongs to a different NUMA node than pool->p.nid.
> 
> To update pool->p.nid users should call page_pool_update_nid().
> 
> Holding on to such pages in the pool will hurt the consumer performance
> when the pool migrates to a different numa node.
> 
> Performance testing:
> XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
> while migrating rx ring irq from close to far numa:
> 
> mlx5 internal page cache was locally disabled to get pure page pool
> results.

Could you show us the code that disable the local page cache?


> CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
> NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)
> 
> XDP Drop/TX single core:
> NUMA  | XDP  | Before    | After
> ---------------------------------------
> Close | Drop | 11   Mpps | 10.9 Mpps
> Far   | Drop | 4.4  Mpps | 5.8  Mpps
> 
> Close | TX   | 6.5 Mpps  | 6.5 Mpps
> Far   | TX   | 3.5 Mpps  | 4  Mpps
> 
> Improvement is about 30% drop packet rate, 15% tx packet rate for numa
> far test.
> No degradation for numa close tests.
> 
> TCP single/multi cpu/stream:
> NUMA  | #cpu | Before  | After
> --------------------------------------
> Close | 1    | 18 Gbps | 18 Gbps
> Far   | 1    | 15 Gbps | 18 Gbps
> Close | 12   | 80 Gbps | 80 Gbps
> Far   | 12   | 68 Gbps | 80 Gbps
> 
> In all test cases we see improvement for the far numa case, and no
> impact on the close numa case.
> 
> The impact of adding a check per page is very negligible, and shows no
> performance degradation whatsoever, also functionality wise it seems more
> correct and more robust for page pool to verify when pages should be
> recycled, since page pool can't guarantee where pages are coming from.
> 
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> ---
>  net/core/page_pool.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 953af6d414fb..73e4173c4dce 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -283,6 +283,17 @@ static bool __page_pool_recycle_direct(struct page *page,
>  	return true;
>  }
>  
> +/* page is NOT reusable when:
> + * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
> + * 2) belongs to a different NUMA node than pool->p.nid.
> + *
> + * To update pool->p.nid users must call page_pool_update_nid.
> + */
> +static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> +{
> +	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
> +}
> +
>  void __page_pool_put_page(struct page_pool *pool,
>  			  struct page *page, bool allow_direct)
>  {
> @@ -292,7 +303,8 @@ void __page_pool_put_page(struct page_pool *pool,
>  	 *
>  	 * refcnt == 1 means page_pool owns page, and can recycle it.
>  	 */
> -	if (likely(page_ref_count(page) == 1)) {
> +	if (likely(page_ref_count(page) == 1 &&
> +		   pool_page_reusable(pool, page))) {

I'm afraid that we are slowly chipping away the performance benefit
with these incremental changes, adding more checks. We have an extreme
performance use-case with XDP_DROP, where we want drivers to use this
code path to hit __page_pool_recycle_direct(), that is a simple array
update (protected under NAPI) into pool->alloc.cache[].

To preserve this hot-path, you could instead flush pool->alloc.cache[]
in the call page_pool_update_nid().  And move the pool_page_reusable()
check into __page_pool_recycle_into_ring().  (Below added the '>>' with
remaining code to make this easier to see)


>  		/* Read barrier done in page_ref_count / READ_ONCE */
>  
>  		if (allow_direct && in_serving_softirq())
>>			if (__page_pool_recycle_direct(page, pool))
>>				return;
>>
>>		if (!__page_pool_recycle_into_ring(pool, page)) {
>>			/* Cache full, fallback to free pages */
>>			__page_pool_return_page(pool, page);
>>		}
>>		return;
>>	}
>>	/* Fallback/non-XDP mode: API user have elevated refcnt.
Jonathan Lemon Oct. 25, 2019, 6:37 p.m. UTC | #3
On 25 Oct 2019, at 6:33, Jesper Dangaard Brouer wrote:

> On Wed, 23 Oct 2019 19:37:00 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>> @@ -292,7 +303,8 @@ void __page_pool_put_page(struct page_pool *pool,
>>  	 *
>>  	 * refcnt == 1 means page_pool owns page, and can recycle it.
>>  	 */
>> -	if (likely(page_ref_count(page) == 1)) {
>> +	if (likely(page_ref_count(page) == 1 &&
>> +		   pool_page_reusable(pool, page))) {
>
> I'm afraid that we are slowly chipping away the performance benefit
> with these incremental changes, adding more checks. We have an extreme
> performance use-case with XDP_DROP, where we want drivers to use this
> code path to hit __page_pool_recycle_direct(), that is a simple array
> update (protected under NAPI) into pool->alloc.cache[].

The checks have to be paid for somewhere.  mlx4, mlx5, i40e, igb, etc
already do these in determining whether to recycle a page or not.  The
pfmemalloc check can't be moved into the flush path.  I'd rather have
a common point where this is performed, moving it out of each driver.
Saeed Mahameed Oct. 28, 2019, 10:57 p.m. UTC | #4
On Fri, 2019-10-25 at 15:33 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 23 Oct 2019 19:37:00 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > A page is NOT reusable when at least one of the following is true:
> > 1) allocated when system was under some pressure.
> > (page_is_pfmemalloc)
> > 2) belongs to a different NUMA node than pool->p.nid.
> > 
> > To update pool->p.nid users should call page_pool_update_nid().
> > 
> > Holding on to such pages in the pool will hurt the consumer
> > performance
> > when the pool migrates to a different numa node.
> > 
> > Performance testing:
> > XDP drop/tx rate and TCP single/multi stream, on mlx5 driver
> > while migrating rx ring irq from close to far numa:
> > 
> > mlx5 internal page cache was locally disabled to get pure page pool
> > results.
> 
> Could you show us the code that disable the local page cache?
> 

here you go, very simple patch, just avoid calling mlx5 internal cache
API and avoid dma mapping/unmapping.

https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=topic/page-pool-numa&id=aa7345f62cad62cf19fbf2dee2a8b15de34b33e3

> 
> > CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
> > NIC: Mellanox Technologies MT27700 Family [ConnectX-4] (100G)
> > 
> > XDP Drop/TX single core:
> > NUMA  | XDP  | Before    | After
> > ---------------------------------------
> > Close | Drop | 11   Mpps | 10.9 Mpps
> > Far   | Drop | 4.4  Mpps | 5.8  Mpps
> > 
> > Close | TX   | 6.5 Mpps  | 6.5 Mpps
> > Far   | TX   | 3.5 Mpps  | 4  Mpps
> > 
> > Improvement is about 30% drop packet rate, 15% tx packet rate for
> > numa
> > far test.
> > No degradation for numa close tests.
> > 
> > TCP single/multi cpu/stream:
> > NUMA  | #cpu | Before  | After
> > --------------------------------------
> > Close | 1    | 18 Gbps | 18 Gbps
> > Far   | 1    | 15 Gbps | 18 Gbps
> > Close | 12   | 80 Gbps | 80 Gbps
> > Far   | 12   | 68 Gbps | 80 Gbps
> > 
> > In all test cases we see improvement for the far numa case, and no
> > impact on the close numa case.
> > 
> > The impact of adding a check per page is very negligible, and shows
> > no
> > performance degradation whatsoever, also functionality wise it
> > seems more
> > correct and more robust for page pool to verify when pages should
> > be
> > recycled, since page pool can't guarantee where pages are coming
> > from.
> > 
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
> > ---
> >  net/core/page_pool.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index 953af6d414fb..73e4173c4dce 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -283,6 +283,17 @@ static bool __page_pool_recycle_direct(struct
> > page *page,
> >  	return true;
> >  }
> >  
> > +/* page is NOT reusable when:
> > + * 1) allocated when system is under some pressure.
> > (page_is_pfmemalloc)
> > + * 2) belongs to a different NUMA node than pool->p.nid.
> > + *
> > + * To update pool->p.nid users must call page_pool_update_nid.
> > + */
> > +static bool pool_page_reusable(struct page_pool *pool, struct page
> > *page)
> > +{
> > +	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool-
> > >p.nid;
> > +}
> > +
> >  void __page_pool_put_page(struct page_pool *pool,
> >  			  struct page *page, bool allow_direct)
> >  {
> > @@ -292,7 +303,8 @@ void __page_pool_put_page(struct page_pool
> > *pool,
> >  	 *
> >  	 * refcnt == 1 means page_pool owns page, and can recycle it.
> >  	 */
> > -	if (likely(page_ref_count(page) == 1)) {
> > +	if (likely(page_ref_count(page) == 1 &&
> > +		   pool_page_reusable(pool, page))) {
> 
> I'm afraid that we are slowly chipping away the performance benefit
> with these incremental changes, adding more checks. We have an
> extreme
> performance use-case with XDP_DROP, where we want drivers to use this
> code path to hit __page_pool_recycle_direct(), that is a simple array
> update (protected under NAPI) into pool->alloc.cache[].
> 
> To preserve this hot-path, you could instead flush pool-
> >alloc.cache[]
> in the call page_pool_update_nid().  And move the
> pool_page_reusable()
> check into __page_pool_recycle_into_ring().  (Below added the '>>'
> with
> remaining code to make this easier to see)
> 

Flush simply won't work, what about pages that are currently in use by
the driver HW, and will be returned to the cache only after the flush ?
who will flush those pages ? and flushing the HW queue will be too
heavy !

As Jonathan pointed out, all drivers are doing this check today, all we
need is to simply move this to the cache and take it out of the drivers
responsibility. 

for the pfmemalloc, we can make sure the page pool will never allow
such pages to be allocated in first place, but this is very risky and
can break many "kernel in emergency" features.

I think what Jonathan and I did here, is the best course of action.

> 
> >  		/* Read barrier done in page_ref_count / READ_ONCE */
> >  
> >  		if (allow_direct && in_serving_softirq())
> > > 			if (__page_pool_recycle_direct(page, pool))
> > > 				return;
> > > 
> > > 		if (!__page_pool_recycle_into_ring(pool, page)) {
> > > 			/* Cache full, fallback to free pages */
> > > 			__page_pool_return_page(pool, page);
> > > 		}
> > > 		return;
> > > 	}
> > > 	/* Fallback/non-XDP mode: API user have elevated refcnt.
> 
>
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 953af6d414fb..73e4173c4dce 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -283,6 +283,17 @@  static bool __page_pool_recycle_direct(struct page *page,
 	return true;
 }
 
+/* page is NOT reusable when:
+ * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
+ * 2) belongs to a different NUMA node than pool->p.nid.
+ *
+ * To update pool->p.nid users must call page_pool_update_nid.
+ */
+static bool pool_page_reusable(struct page_pool *pool, struct page *page)
+{
+	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
+}
+
 void __page_pool_put_page(struct page_pool *pool,
 			  struct page *page, bool allow_direct)
 {
@@ -292,7 +303,8 @@  void __page_pool_put_page(struct page_pool *pool,
 	 *
 	 * refcnt == 1 means page_pool owns page, and can recycle it.
 	 */
-	if (likely(page_ref_count(page) == 1)) {
+	if (likely(page_ref_count(page) == 1 &&
+		   pool_page_reusable(pool, page))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
 		if (allow_direct && in_serving_softirq())