diff mbox series

[v2] page_pool: handle page recycle for NUMA_NO_NODE condition

Message ID 1575624767-3343-1-git-send-email-lirongqing@baidu.com
State Changes Requested
Delegated to: David Miller
Headers show
Series [v2] page_pool: handle page recycle for NUMA_NO_NODE condition | expand

Commit Message

Li RongQing Dec. 6, 2019, 9:32 a.m. UTC
some drivers uses page pool, but not require to allocate
pages from bound node, or simply assign pool.p.nid to
NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
Don't recycle non-reusable pages") will block this kind
of driver to recycle

so take page as reusable when page belongs to current
memory node if nid is NUMA_NO_NODE

v1-->v2: add check with numa_mem_id from Yunsheng

Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 net/core/page_pool.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Saeed Mahameed Dec. 7, 2019, 3:52 a.m. UTC | #1
On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
> some drivers uses page pool, but not require to allocate
> pages from bound node, or simply assign pool.p.nid to
> NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
> Don't recycle non-reusable pages") will block this kind
> of driver to recycle
> 
> so take page as reusable when page belongs to current
> memory node if nid is NUMA_NO_NODE
> 
> v1-->v2: add check with numa_mem_id from Yunsheng
> 
> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  net/core/page_pool.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..3c8b51ccd1c1 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct
> page *page,
>  /* page is NOT reusable when:
>   * 1) allocated when system is under some pressure.
> (page_is_pfmemalloc)
>   * 2) belongs to a different NUMA node than pool->p.nid.
> + * 3) belongs to a different memory node than current context
> + * if pool->p.nid is NUMA_NO_NODE
>   *
>   * To update pool->p.nid users must call page_pool_update_nid.
>   */
>  static bool pool_page_reusable(struct page_pool *pool, struct page
> *page)
>  {
> -	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool-
> >p.nid;
> +	return !page_is_pfmemalloc(page) &&
> +		(page_to_nid(page) == pool->p.nid ||
> +		(pool->p.nid == NUMA_NO_NODE &&
> +		page_to_nid(page) == numa_mem_id()));
>  }
>  

Cc'ed Jesper, Ilias & Jonathan.

I don't think it is correct to check that the page nid is same as
numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all
pages to recycle, because you can't assume where pages are allocated
from and where they are being handled.

I suggest the following:

return !page_pfmemalloc() && 
( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );

1) never recycle emergency pages, regardless of pool nid.
2) always recycle if pool is NUMA_NO_NODE.

the above change should not add any overhead, a modest branch predictor
will handle this with no effort.

Jesper et al. what do you think?

-Saeed.
Yunsheng Lin Dec. 9, 2019, 1:31 a.m. UTC | #2
On 2019/12/7 11:52, Saeed Mahameed wrote:
> On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
>> some drivers uses page pool, but not require to allocate
>> pages from bound node, or simply assign pool.p.nid to
>> NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
>> Don't recycle non-reusable pages") will block this kind
>> of driver to recycle
>>
>> so take page as reusable when page belongs to current
>> memory node if nid is NUMA_NO_NODE
>>
>> v1-->v2: add check with numa_mem_id from Yunsheng
>>
>> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
>> Signed-off-by: Li RongQing <lirongqing@baidu.com>
>> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
>> Cc: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>  net/core/page_pool.c | 7 ++++++-
>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>> index a6aefe989043..3c8b51ccd1c1 100644
>> --- a/net/core/page_pool.c
>> +++ b/net/core/page_pool.c
>> @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct
>> page *page,
>>  /* page is NOT reusable when:
>>   * 1) allocated when system is under some pressure.
>> (page_is_pfmemalloc)
>>   * 2) belongs to a different NUMA node than pool->p.nid.
>> + * 3) belongs to a different memory node than current context
>> + * if pool->p.nid is NUMA_NO_NODE
>>   *
>>   * To update pool->p.nid users must call page_pool_update_nid.
>>   */
>>  static bool pool_page_reusable(struct page_pool *pool, struct page
>> *page)
>>  {
>> -	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool-
>>> p.nid;
>> +	return !page_is_pfmemalloc(page) &&
>> +		(page_to_nid(page) == pool->p.nid ||
>> +		(pool->p.nid == NUMA_NO_NODE &&
>> +		page_to_nid(page) == numa_mem_id()));
>>  }
>>  
> 
> Cc'ed Jesper, Ilias & Jonathan.
> 
> I don't think it is correct to check that the page nid is same as
> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all
> pages to recycle, because you can't assume where pages are allocated
> from and where they are being handled.
> 
> I suggest the following:
> 
> return !page_pfmemalloc() && 
> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );
> 
> 1) never recycle emergency pages, regardless of pool nid.
> 2) always recycle if pool is NUMA_NO_NODE.

As I can see, below are the cases that the pool->p.nid could be NUMA_NO_NODE:

1. kernel built with the CONFIG_NUMA being off.

2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not provide
   a valid node id through ACPI/DT, and it has the below cases:

   a). the hardware itself is single numa node system, so maybe it is valid
       to not provide a valid node for the device.
   b). the hardware itself is multi numa nodes system, and the FW/BIOS is
       broken that it does not provide a valid one.

3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide a
   valid node id through ACPI/DT, but the driver does not pass the valid
   node id when calling page_pool_init().

I am not sure which case this patch is trying to fix, maybe Rongqing can
help to clarify.

For case 1 and case 2 (a), I suppose checking pool->p.nid being NUMA_NO_NODE
is enough.

For case 2 (b), There may be argument that it should be fixed in the BIOS/FW,
But it also can be argued that the numa_mem_id() checking has been done in the
driver that has not using page pool yet when deciding whether to do recycling,
see [1]. If I understanding correctly, recycling is normally done at the NAPI
pooling, which has the same affinity as the rx interrupt, and rx interrupt is
not changed very often. If it does change to different memory node, maybe it
makes sense not to recycle the old page belongs to old node?


For case 3, I am not sure if any driver is doing that, and if the page pool API
even allow that?

[1] https://elixir.bootlin.com/linux/latest/ident/numa_mem_id

> 
> the above change should not add any overhead, a modest branch predictor
> will handle this with no effort.
> 
> Jesper et al. what do you think?
> 
> -Saeed.
>
Li RongQing Dec. 9, 2019, 3:47 a.m. UTC | #3
Cc: Grygorii Strashko  Ivan Khoronzhuk

I see that cpsw is using NUMA_NO_NODE when init page pool

> On 2019/12/7 11:52, Saeed Mahameed wrote:
> > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
> >> some drivers uses page pool, but not require to allocate pages from
> >> bound node, or simply assign pool.p.nid to NUMA_NO_NODE, and the
> >> commit d5394610b1ba ("page_pool:
> >> Don't recycle non-reusable pages") will block this kind of driver to
> >> recycle
> >>
> >> so take page as reusable when page belongs to current memory node if
> >> nid is NUMA_NO_NODE
> >>
> >> v1-->v2: add check with numa_mem_id from Yunsheng
> >>
> >> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> >> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> >> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
> >> Cc: Saeed Mahameed <saeedm@mellanox.com>
> >> ---
> >>  net/core/page_pool.c | 7 ++++++-
> >>  1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c index
> >> a6aefe989043..3c8b51ccd1c1 100644
> >> --- a/net/core/page_pool.c
> >> +++ b/net/core/page_pool.c
> >> @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct
> >> page *page,
> >>  /* page is NOT reusable when:
> >>   * 1) allocated when system is under some pressure.
> >> (page_is_pfmemalloc)
> >>   * 2) belongs to a different NUMA node than pool->p.nid.
> >> + * 3) belongs to a different memory node than current context
> >> + * if pool->p.nid is NUMA_NO_NODE
> >>   *
> >>   * To update pool->p.nid users must call page_pool_update_nid.
> >>   */
> >>  static bool pool_page_reusable(struct page_pool *pool, struct page
> >> *page)
> >>  {
> >> -	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool-
> >>> p.nid;
> >> +	return !page_is_pfmemalloc(page) &&
> >> +		(page_to_nid(page) == pool->p.nid ||
> >> +		(pool->p.nid == NUMA_NO_NODE &&
> >> +		page_to_nid(page) == numa_mem_id()));
> >>  }
> >>
> >
> > Cc'ed Jesper, Ilias & Jonathan.
> >
> > I don't think it is correct to check that the page nid is same as
> > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> > all pages to recycle, because you can't assume where pages are
> > allocated from and where they are being handled.
> >
> > I suggest the following:
> >
> > return !page_pfmemalloc() &&
> > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );
> >
> > 1) never recycle emergency pages, regardless of pool nid.
> > 2) always recycle if pool is NUMA_NO_NODE.
> 
> As I can see, below are the cases that the pool->p.nid could be
> NUMA_NO_NODE:
> 
> 1. kernel built with the CONFIG_NUMA being off.
> 
> 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not provide
>    a valid node id through ACPI/DT, and it has the below cases:
> 
>    a). the hardware itself is single numa node system, so maybe it is valid
>        to not provide a valid node for the device.
>    b). the hardware itself is multi numa nodes system, and the FW/BIOS is
>        broken that it does not provide a valid one.
> 
> 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide a
>    valid node id through ACPI/DT, but the driver does not pass the valid
>    node id when calling page_pool_init().
> 
> I am not sure which case this patch is trying to fix, maybe Rongqing can help to
> clarify.
> 
> For case 1 and case 2 (a), I suppose checking pool->p.nid being
> NUMA_NO_NODE is enough.
> 
> For case 2 (b), There may be argument that it should be fixed in the BIOS/FW,
> But it also can be argued that the numa_mem_id() checking has been done in
> the driver that has not using page pool yet when deciding whether to do
> recycling, see [1]. If I understanding correctly, recycling is normally done at the
> NAPI pooling, which has the same affinity as the rx interrupt, and rx interrupt is
> not changed very often. If it does change to different memory node, maybe it
> makes sense not to recycle the old page belongs to old node?
> 
> 
> For case 3, I am not sure if any driver is doing that, and if the page pool API
> even allow that?
> 

I think pool_page_reusable should support NUMA_NO_NODE no matter which cases


And recycling is normally done at the NAPI pooling, NUMA_NO_NODE hint to use the
local node, except memory usage is unbalance, so I add the check that the page nid is
same as numa_mem_id() if pool is NUMA_NO_NODE

-Li


> [1] https://elixir.bootlin.com/linux/latest/ident/numa_mem_id
> 
> >
> > the above change should not add any overhead, a modest branch
> > predictor will handle this with no effort.
> >
> > Jesper et al. what do you think?
> >
> > -Saeed.
> >
Ilias Apalodimas Dec. 9, 2019, 9:30 a.m. UTC | #4
On Mon, Dec 09, 2019 at 03:47:50AM +0000, Li,Rongqing wrote:
> > >
[...]
> > > Cc'ed Jesper, Ilias & Jonathan.
> > >
> > > I don't think it is correct to check that the page nid is same as
> > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> > > all pages to recycle, because you can't assume where pages are
> > > allocated from and where they are being handled.
> > >
> > > I suggest the following:
> > >
> > > return !page_pfmemalloc() &&
> > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );
> > >
> > > 1) never recycle emergency pages, regardless of pool nid.
> > > 2) always recycle if pool is NUMA_NO_NODE.
> > 
> > As I can see, below are the cases that the pool->p.nid could be
> > NUMA_NO_NODE:
> > 
> > 1. kernel built with the CONFIG_NUMA being off.
> > 
> > 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not provide
> >    a valid node id through ACPI/DT, and it has the below cases:
> > 
> >    a). the hardware itself is single numa node system, so maybe it is valid
> >        to not provide a valid node for the device.
> >    b). the hardware itself is multi numa nodes system, and the FW/BIOS is
> >        broken that it does not provide a valid one.
> > 
> > 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide a
> >    valid node id through ACPI/DT, but the driver does not pass the valid
> >    node id when calling page_pool_init().
> > 
> > I am not sure which case this patch is trying to fix, maybe Rongqing can help to
> > clarify.
> > 
> > For case 1 and case 2 (a), I suppose checking pool->p.nid being
> > NUMA_NO_NODE is enough.
> > 
> > For case 2 (b), There may be argument that it should be fixed in the BIOS/FW,
> > But it also can be argued that the numa_mem_id() checking has been done in
> > the driver that has not using page pool yet when deciding whether to do
> > recycling, see [1]. If I understanding correctly, recycling is normally done at the
> > NAPI pooling, which has the same affinity as the rx interrupt, and rx interrupt is
> > not changed very often. If it does change to different memory node, maybe it
> > makes sense not to recycle the old page belongs to old node?
> > 
> > 
> > For case 3, I am not sure if any driver is doing that, and if the page pool API
> > even allow that?
> > 
> 
> I think pool_page_reusable should support NUMA_NO_NODE no matter which cases
> 

Yes 

> 
> And recycling is normally done at the NAPI pooling, NUMA_NO_NODE hint to use the
> local node, except memory usage is unbalance, so I add the check that the page nid is
> same as numa_mem_id() if pool is NUMA_NO_NODE

I am not sure i am following here.
Recycling done at NAPI or not should have nothing to do with NUMA.
What do you mean by 'memory usage is unbalance'?


Thanks
/Ilias
Li RongQing Dec. 9, 2019, 10:37 a.m. UTC | #5
> -----邮件原件-----
> 发件人: Ilias Apalodimas [mailto:ilias.apalodimas@linaro.org]
> 发送时间: 2019年12月9日 17:30
> 收件人: Li,Rongqing <lirongqing@baidu.com>
> 抄送: Yunsheng Lin <linyunsheng@huawei.com>; Saeed Mahameed
> <saeedm@mellanox.com>; jonathan.lemon@gmail.com;
> netdev@vger.kernel.org; brouer@redhat.com; ivan.khoronzhuk@linaro.org;
> grygorii.strashko@ti.com
> 主题: Re: 答复: [PATCH][v2] page_pool: handle page recycle for
> NUMA_NO_NODE condition
> 
> On Mon, Dec 09, 2019 at 03:47:50AM +0000, Li,Rongqing wrote:
> > > >
> [...]
> > > > Cc'ed Jesper, Ilias & Jonathan.
> > > >
> > > > I don't think it is correct to check that the page nid is same as
> > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > allow all pages to recycle, because you can't assume where pages
> > > > are allocated from and where they are being handled.
> > > >
> > > > I suggest the following:
> > > >
> > > > return !page_pfmemalloc() &&
> > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
> > > > );
> > > >
> > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > 2) always recycle if pool is NUMA_NO_NODE.
> > >
> > > As I can see, below are the cases that the pool->p.nid could be
> > > NUMA_NO_NODE:
> > >
> > > 1. kernel built with the CONFIG_NUMA being off.
> > >
> > > 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not
> provide
> > >    a valid node id through ACPI/DT, and it has the below cases:
> > >
> > >    a). the hardware itself is single numa node system, so maybe it is valid
> > >        to not provide a valid node for the device.
> > >    b). the hardware itself is multi numa nodes system, and the FW/BIOS is
> > >        broken that it does not provide a valid one.
> > >
> > > 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide
> a
> > >    valid node id through ACPI/DT, but the driver does not pass the valid
> > >    node id when calling page_pool_init().
> > >
> > > I am not sure which case this patch is trying to fix, maybe Rongqing
> > > can help to clarify.
> > >
> > > For case 1 and case 2 (a), I suppose checking pool->p.nid being
> > > NUMA_NO_NODE is enough.
> > >
> > > For case 2 (b), There may be argument that it should be fixed in the
> > > BIOS/FW, But it also can be argued that the numa_mem_id() checking
> > > has been done in the driver that has not using page pool yet when
> > > deciding whether to do recycling, see [1]. If I understanding
> > > correctly, recycling is normally done at the NAPI pooling, which has
> > > the same affinity as the rx interrupt, and rx interrupt is not
> > > changed very often. If it does change to different memory node, maybe it
> makes sense not to recycle the old page belongs to old node?
> > >
> > >
> > > For case 3, I am not sure if any driver is doing that, and if the
> > > page pool API even allow that?
> > >
> >
> > I think pool_page_reusable should support NUMA_NO_NODE no matter
> which
> > cases
> >
> 
> Yes
> 
> >
> > And recycling is normally done at the NAPI pooling, NUMA_NO_NODE hint
> > to use the local node, except memory usage is unbalance, so I add the
> > check that the page nid is same as numa_mem_id() if pool is
> > NUMA_NO_NODE
> 
> I am not sure i am following here.
> Recycling done at NAPI or not should have nothing to do with NUMA.
> What do you mean by 'memory usage is unbalance'?
> 
> 

I am saying the possible reason that alloc_pages_node(NUMA_NO_NODE) can not
allocate local page, which causes the pool_page_reusable always return false if add the
check that the page nid is same as numa_mem_id()

it should rarely happen

-Li
> Thanks
> /Ilias
Jesper Dangaard Brouer Dec. 9, 2019, 12:14 p.m. UTC | #6
On Sat, 7 Dec 2019 03:52:41 +0000
Saeed Mahameed <saeedm@mellanox.com> wrote:

> On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
> > some drivers uses page pool, but not require to allocate
> > pages from bound node, or simply assign pool.p.nid to
> > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
> > Don't recycle non-reusable pages") will block this kind
> > of driver to recycle
> > 
> > so take page as reusable when page belongs to current
> > memory node if nid is NUMA_NO_NODE
> > 
> > v1-->v2: add check with numa_mem_id from Yunsheng
> > 
> > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages")
> > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  net/core/page_pool.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..3c8b51ccd1c1 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct
> > page *page,
> >  /* page is NOT reusable when:
> >   * 1) allocated when system is under some pressure.
> > (page_is_pfmemalloc)
> >   * 2) belongs to a different NUMA node than pool->p.nid.
> > + * 3) belongs to a different memory node than current context
> > + * if pool->p.nid is NUMA_NO_NODE
> >   *
> >   * To update pool->p.nid users must call page_pool_update_nid.
> >   */
> >  static bool pool_page_reusable(struct page_pool *pool, struct page
> > *page)
> >  {
> > -	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool-  
> > >p.nid;  
> > +	return !page_is_pfmemalloc(page) &&
> > +		(page_to_nid(page) == pool->p.nid ||
> > +		(pool->p.nid == NUMA_NO_NODE &&
> > +		page_to_nid(page) == numa_mem_id()));
> >  }
> >    
> 
> Cc'ed Jesper, Ilias & Jonathan.
> 
> I don't think it is correct to check that the page nid is same as
> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all
> pages to recycle, because you can't assume where pages are allocated
> from and where they are being handled.
> 
> I suggest the following:
> 
> return !page_pfmemalloc() && 
> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );
> 
> 1) never recycle emergency pages, regardless of pool nid.
> 2) always recycle if pool is NUMA_NO_NODE.
> 
> the above change should not add any overhead, a modest branch predictor
> will handle this with no effort.
> 
> Jesper et al. what do you think?

The patch description doesn't explain the problem very well.

Lets first establish what the problem is.  After I took at closer look,
I do think we have a real problem here...

If function alloc_pages_node() is called with NUMA_NO_NODE (see below
signature), then the nid is re-assigned to numa_mem_id().

Our current code checks: page_to_nid(page) == pool->p.nid which seems
bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return
NUMA_NO_NODE... as it was set to the local detect numa node, right?

So, we do need a fix... but the question is that semantics do we want?
Saeed Mahameed Dec. 9, 2019, 11:34 p.m. UTC | #7
On Mon, 2019-12-09 at 13:14 +0100, Jesper Dangaard Brouer wrote:
> On Sat, 7 Dec 2019 03:52:41 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
> > > some drivers uses page pool, but not require to allocate
> > > pages from bound node, or simply assign pool.p.nid to
> > > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
> > > Don't recycle non-reusable pages") will block this kind
> > > of driver to recycle
> > > 
> > > so take page as reusable when page belongs to current
> > > memory node if nid is NUMA_NO_NODE
> > > 
> > > v1-->v2: add check with numa_mem_id from Yunsheng
> > > 
> > > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable
> > > pages")
> > > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
> > > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > > ---
> > >  net/core/page_pool.c | 7 ++++++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index a6aefe989043..3c8b51ccd1c1 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -312,12 +312,17 @@ static bool
> > > __page_pool_recycle_direct(struct
> > > page *page,
> > >  /* page is NOT reusable when:
> > >   * 1) allocated when system is under some pressure.
> > > (page_is_pfmemalloc)
> > >   * 2) belongs to a different NUMA node than pool->p.nid.
> > > + * 3) belongs to a different memory node than current context
> > > + * if pool->p.nid is NUMA_NO_NODE
> > >   *
> > >   * To update pool->p.nid users must call page_pool_update_nid.
> > >   */
> > >  static bool pool_page_reusable(struct page_pool *pool, struct
> > > page
> > > *page)
> > >  {
> > > -	return !page_is_pfmemalloc(page) && page_to_nid(page) ==
> > > pool-  
> > > > p.nid;  
> > > +	return !page_is_pfmemalloc(page) &&
> > > +		(page_to_nid(page) == pool->p.nid ||
> > > +		(pool->p.nid == NUMA_NO_NODE &&
> > > +		page_to_nid(page) == numa_mem_id()));
> > >  }
> > >    
> > 
> > Cc'ed Jesper, Ilias & Jonathan.
> > 
> > I don't think it is correct to check that the page nid is same as
> > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> > all
> > pages to recycle, because you can't assume where pages are
> > allocated
> > from and where they are being handled.
> > 
> > I suggest the following:
> > 
> > return !page_pfmemalloc() && 
> > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
> > );
> > 
> > 1) never recycle emergency pages, regardless of pool nid.
> > 2) always recycle if pool is NUMA_NO_NODE.
> > 
> > the above change should not add any overhead, a modest branch
> > predictor
> > will handle this with no effort.
> > 
> > Jesper et al. what do you think?
> 
> The patch description doesn't explain the problem very well.
> 
> Lets first establish what the problem is.  After I took at closer
> look,
> I do think we have a real problem here...
> 
> If function alloc_pages_node() is called with NUMA_NO_NODE (see below
> signature), then the nid is re-assigned to numa_mem_id().
> 
> Our current code checks: page_to_nid(page) == pool->p.nid which seems
> bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return
> NUMA_NO_NODE... as it was set to the local detect numa node, right?
> 

right.

> So, we do need a fix... but the question is that semantics do we
> want?
> 

maybe assume that __page_pool_recycle_direct() is always called from
the right node and change the current bogus check:

from:
page_to_nid(page) == pool->p.nid 

to:
page_to_nid(page) == numa_mem_id()

This will allow recycling only if handling node is the same as where
the page was allocated, regardless of pool->p.nid.

so semantics are:

1) allocate from: pool->p.nid, as chosen by user.
2) recycle when: page_to_nid(page) == numa_mem_id().
3) pool user must guarantee that the handler will run on the right
node. which should always be the case. otherwise recycling will be
skipped (no cross numa recycling).


a) if the pool migrates, we will stop recycling until the pool moves
back to original node, or user calls pool_update_nid() as we do in
mlx5.
b) if pool is NUMA_NO_NODE, then allocation and handling will be done
on numa_mem_id(), which means the above check will work.

Thanks,
Saeed.
Yunsheng Lin Dec. 10, 2019, 1:31 a.m. UTC | #8
On 2019/12/10 7:34, Saeed Mahameed wrote:
> On Mon, 2019-12-09 at 13:14 +0100, Jesper Dangaard Brouer wrote:
>> On Sat, 7 Dec 2019 03:52:41 +0000
>> Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>>> On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
>>>> some drivers uses page pool, but not require to allocate
>>>> pages from bound node, or simply assign pool.p.nid to
>>>> NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
>>>> Don't recycle non-reusable pages") will block this kind
>>>> of driver to recycle
>>>>
>>>> so take page as reusable when page belongs to current
>>>> memory node if nid is NUMA_NO_NODE
>>>>
>>>> v1-->v2: add check with numa_mem_id from Yunsheng
>>>>
>>>> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable
>>>> pages")
>>>> Signed-off-by: Li RongQing <lirongqing@baidu.com>
>>>> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
>>>> Cc: Saeed Mahameed <saeedm@mellanox.com>
>>>> ---
>>>>  net/core/page_pool.c | 7 ++++++-
>>>>  1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>>> index a6aefe989043..3c8b51ccd1c1 100644
>>>> --- a/net/core/page_pool.c
>>>> +++ b/net/core/page_pool.c
>>>> @@ -312,12 +312,17 @@ static bool
>>>> __page_pool_recycle_direct(struct
>>>> page *page,
>>>>  /* page is NOT reusable when:
>>>>   * 1) allocated when system is under some pressure.
>>>> (page_is_pfmemalloc)
>>>>   * 2) belongs to a different NUMA node than pool->p.nid.
>>>> + * 3) belongs to a different memory node than current context
>>>> + * if pool->p.nid is NUMA_NO_NODE
>>>>   *
>>>>   * To update pool->p.nid users must call page_pool_update_nid.
>>>>   */
>>>>  static bool pool_page_reusable(struct page_pool *pool, struct
>>>> page
>>>> *page)
>>>>  {
>>>> -	return !page_is_pfmemalloc(page) && page_to_nid(page) ==
>>>> pool-  
>>>>> p.nid;  
>>>> +	return !page_is_pfmemalloc(page) &&
>>>> +		(page_to_nid(page) == pool->p.nid ||
>>>> +		(pool->p.nid == NUMA_NO_NODE &&
>>>> +		page_to_nid(page) == numa_mem_id()));
>>>>  }
>>>>    
>>>
>>> Cc'ed Jesper, Ilias & Jonathan.
>>>
>>> I don't think it is correct to check that the page nid is same as
>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>> all
>>> pages to recycle, because you can't assume where pages are
>>> allocated
>>> from and where they are being handled.
>>>
>>> I suggest the following:
>>>
>>> return !page_pfmemalloc() && 
>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
>>> );
>>>
>>> 1) never recycle emergency pages, regardless of pool nid.
>>> 2) always recycle if pool is NUMA_NO_NODE.
>>>
>>> the above change should not add any overhead, a modest branch
>>> predictor
>>> will handle this with no effort.
>>>
>>> Jesper et al. what do you think?
>>
>> The patch description doesn't explain the problem very well.
>>
>> Lets first establish what the problem is.  After I took at closer
>> look,
>> I do think we have a real problem here...
>>
>> If function alloc_pages_node() is called with NUMA_NO_NODE (see below
>> signature), then the nid is re-assigned to numa_mem_id().
>>
>> Our current code checks: page_to_nid(page) == pool->p.nid which seems
>> bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return
>> NUMA_NO_NODE... as it was set to the local detect numa node, right?
>>
> 
> right.
> 
>> So, we do need a fix... but the question is that semantics do we
>> want?
>>
> 
> maybe assume that __page_pool_recycle_direct() is always called from
> the right node and change the current bogus check:
> 
> from:
> page_to_nid(page) == pool->p.nid 
> 
> to:
> page_to_nid(page) == numa_mem_id()
> 
> This will allow recycling only if handling node is the same as where
> the page was allocated, regardless of pool->p.nid.
> 
> so semantics are:
> 
> 1) allocate from: pool->p.nid, as chosen by user.
> 2) recycle when: page_to_nid(page) == numa_mem_id().
> 3) pool user must guarantee that the handler will run on the right
> node. which should always be the case. otherwise recycling will be
> skipped (no cross numa recycling).
> 
> 
> a) if the pool migrates, we will stop recycling until the pool moves
> back to original node, or user calls pool_update_nid() as we do in
> mlx5.
> b) if pool is NUMA_NO_NODE, then allocation and handling will be done
> on numa_mem_id(), which means the above check will work.


Only checking page_to_nid(page) == numa_mem_id() may not work for the below
case in mvneta.c:

static int mvneta_create_page_pool(struct mvneta_port *pp,
				   struct mvneta_rx_queue *rxq, int size)
{
	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
	struct page_pool_params pp_params = {
		.order = 0,
		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
		.pool_size = size,
		.nid = cpu_to_node(0),
		.dev = pp->dev->dev.parent,
		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE,
		.offset = pp->rx_offset_correction,
		.max_len = MVNETA_MAX_RX_BUF_SIZE,
	};

the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated for rx
may not be numa_mem_id() when running in the NAPI polling, because pool->p.nid
is not the same as the node of cpu running in the NAPI polling.

Does the page pool support recycling for above case?

Or we "fix' the above case by setting pool->p.nid to NUMA_NO_NODE/dev_to_node(),
or by calling pool_update_nid() in NAPI polling as mlx5 does?


> 
> Thanks,
> Saeed.
> 
> 
> 
> 
> 
> 
>
Ilias Apalodimas Dec. 10, 2019, 3:02 p.m. UTC | #9
Hi Saeed,

> > 
> > The patch description doesn't explain the problem very well.
> > 
> > Lets first establish what the problem is.  After I took at closer
> > look,
> > I do think we have a real problem here...
> > 
> > If function alloc_pages_node() is called with NUMA_NO_NODE (see below
> > signature), then the nid is re-assigned to numa_mem_id().
> > 
> > Our current code checks: page_to_nid(page) == pool->p.nid which seems
> > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return
> > NUMA_NO_NODE... as it was set to the local detect numa node, right?
> > 
> 
> right.
> 
> > So, we do need a fix... but the question is that semantics do we
> > want?
> > 
> 
> maybe assume that __page_pool_recycle_direct() is always called from
> the right node and change the current bogus check:

Is this a typo? pool_page_reusable() is called from __page_pool_put_page().

page_pool_put_page and page_pool_recycle_direct() (no underscores) call that.
Can we guarantee that those will always run from the correct cpu?
In the current code base if they are only called under NAPI this might be true.
On the page_pool skb recycling patches though (yes we'll eventually send those
:)) this is called from kfree_skb().
I don't think we can get such a guarantee there, right?

Regards
/Ilias
Saeed Mahameed Dec. 10, 2019, 7:45 p.m. UTC | #10
On Tue, 2019-12-10 at 09:31 +0800, Yunsheng Lin wrote:
> On 2019/12/10 7:34, Saeed Mahameed wrote:
> > On Mon, 2019-12-09 at 13:14 +0100, Jesper Dangaard Brouer wrote:
> > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > 
> > > > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote:
> > > > > some drivers uses page pool, but not require to allocate
> > > > > pages from bound node, or simply assign pool.p.nid to
> > > > > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool:
> > > > > Don't recycle non-reusable pages") will block this kind
> > > > > of driver to recycle
> > > > > 
> > > > > so take page as reusable when page belongs to current
> > > > > memory node if nid is NUMA_NO_NODE
> > > > > 
> > > > > v1-->v2: add check with numa_mem_id from Yunsheng
> > > > > 
> > > > > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable
> > > > > pages")
> > > > > Signed-off-by: Li RongQing <lirongqing@baidu.com>
> > > > > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com>
> > > > > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > > > > ---
> > > > >  net/core/page_pool.c | 7 ++++++-
> > > > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > > > index a6aefe989043..3c8b51ccd1c1 100644
> > > > > --- a/net/core/page_pool.c
> > > > > +++ b/net/core/page_pool.c
> > > > > @@ -312,12 +312,17 @@ static bool
> > > > > __page_pool_recycle_direct(struct
> > > > > page *page,
> > > > >  /* page is NOT reusable when:
> > > > >   * 1) allocated when system is under some pressure.
> > > > > (page_is_pfmemalloc)
> > > > >   * 2) belongs to a different NUMA node than pool->p.nid.
> > > > > + * 3) belongs to a different memory node than current
> > > > > context
> > > > > + * if pool->p.nid is NUMA_NO_NODE
> > > > >   *
> > > > >   * To update pool->p.nid users must call
> > > > > page_pool_update_nid.
> > > > >   */
> > > > >  static bool pool_page_reusable(struct page_pool *pool,
> > > > > struct
> > > > > page
> > > > > *page)
> > > > >  {
> > > > > -	return !page_is_pfmemalloc(page) && page_to_nid(page)
> > > > > ==
> > > > > pool-  
> > > > > > p.nid;  
> > > > > +	return !page_is_pfmemalloc(page) &&
> > > > > +		(page_to_nid(page) == pool->p.nid ||
> > > > > +		(pool->p.nid == NUMA_NO_NODE &&
> > > > > +		page_to_nid(page) == numa_mem_id()));
> > > > >  }
> > > > >    
> > > > 
> > > > Cc'ed Jesper, Ilias & Jonathan.
> > > > 
> > > > I don't think it is correct to check that the page nid is same
> > > > as
> > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > allow
> > > > all
> > > > pages to recycle, because you can't assume where pages are
> > > > allocated
> > > > from and where they are being handled.
> > > > 
> > > > I suggest the following:
> > > > 
> > > > return !page_pfmemalloc() && 
> > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > NUMA_NO_NODE
> > > > );
> > > > 
> > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > 2) always recycle if pool is NUMA_NO_NODE.
> > > > 
> > > > the above change should not add any overhead, a modest branch
> > > > predictor
> > > > will handle this with no effort.
> > > > 
> > > > Jesper et al. what do you think?
> > > 
> > > The patch description doesn't explain the problem very well.
> > > 
> > > Lets first establish what the problem is.  After I took at closer
> > > look,
> > > I do think we have a real problem here...
> > > 
> > > If function alloc_pages_node() is called with NUMA_NO_NODE (see
> > > below
> > > signature), then the nid is re-assigned to numa_mem_id().
> > > 
> > > Our current code checks: page_to_nid(page) == pool->p.nid which
> > > seems
> > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not
> > > return
> > > NUMA_NO_NODE... as it was set to the local detect numa node,
> > > right?
> > > 
> > 
> > right.
> > 
> > > So, we do need a fix... but the question is that semantics do we
> > > want?
> > > 
> > 
> > maybe assume that __page_pool_recycle_direct() is always called
> > from
> > the right node and change the current bogus check:
> > 
> > from:
> > page_to_nid(page) == pool->p.nid 
> > 
> > to:
> > page_to_nid(page) == numa_mem_id()
> > 
> > This will allow recycling only if handling node is the same as
> > where
> > the page was allocated, regardless of pool->p.nid.
> > 
> > so semantics are:
> > 
> > 1) allocate from: pool->p.nid, as chosen by user.
> > 2) recycle when: page_to_nid(page) == numa_mem_id().
> > 3) pool user must guarantee that the handler will run on the right
> > node. which should always be the case. otherwise recycling will be
> > skipped (no cross numa recycling).
> > 
> > 
> > a) if the pool migrates, we will stop recycling until the pool
> > moves
> > back to original node, or user calls pool_update_nid() as we do in
> > mlx5.
> > b) if pool is NUMA_NO_NODE, then allocation and handling will be
> > done
> > on numa_mem_id(), which means the above check will work.
> 
> Only checking page_to_nid(page) == numa_mem_id() may not work for the
> below
> case in mvneta.c:
> 
> static int mvneta_create_page_pool(struct mvneta_port *pp,
> 				   struct mvneta_rx_queue *rxq, int
> size)
> {
> 	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
> 	struct page_pool_params pp_params = {
> 		.order = 0,
> 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> 		.pool_size = size,
> 		.nid = cpu_to_node(0),
> 		.dev = pp->dev->dev.parent,
> 		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL :
> DMA_FROM_DEVICE,
> 		.offset = pp->rx_offset_correction,
> 		.max_len = MVNETA_MAX_RX_BUF_SIZE,
> 	};
> 
> the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated
> for rx
> may not be numa_mem_id() when running in the NAPI polling, because
> pool->p.nid
> is not the same as the node of cpu running in the NAPI polling.
> 
> Does the page pool support recycling for above case?
> 

I don't think you want to allow cross numa recycling.

> Or we "fix' the above case by setting pool->p.nid to
> NUMA_NO_NODE/dev_to_node(),
> or by calling pool_update_nid() in NAPI polling as mlx5 does?
> 

Yes just update_nid when needed, and make sure the NAPI polling runs on
a consistent core and eventually alloc/recycling will happen on the
same core.

> 
> > Thanks,
> > Saeed.
> > 
> > 
> > 
> > 
> > 
> > 
> >
Saeed Mahameed Dec. 10, 2019, 8:02 p.m. UTC | #11
On Tue, Dec 10, 2019 at 7:02 AM Ilias Apalodimas
<ilias.apalodimas@linaro.org> wrote:
>
> Hi Saeed,
>
> > >
> > > The patch description doesn't explain the problem very well.
> > >
> > > Lets first establish what the problem is.  After I took at closer
> > > look,
> > > I do think we have a real problem here...
> > >
> > > If function alloc_pages_node() is called with NUMA_NO_NODE (see below
> > > signature), then the nid is re-assigned to numa_mem_id().
> > >
> > > Our current code checks: page_to_nid(page) == pool->p.nid which seems
> > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return
> > > NUMA_NO_NODE... as it was set to the local detect numa node, right?
> > >
> >
> > right.
> >
> > > So, we do need a fix... but the question is that semantics do we
> > > want?
> > >
> >
> > maybe assume that __page_pool_recycle_direct() is always called from
> > the right node and change the current bogus check:
>
> Is this a typo? pool_page_reusable() is called from __page_pool_put_page().
>
> page_pool_put_page and page_pool_recycle_direct() (no underscores) call that.

Yes a typo :) , thanks for the correction.

> Can we guarantee that those will always run from the correct cpu?
No, but we add the tool to correct any discrepancy: page_pool_nid_changed()

> In the current code base if they are only called under NAPI this might be true.
> On the page_pool skb recycling patches though (yes we'll eventually send those
> :)) this is called from kfree_skb().
> I don't think we can get such a guarantee there, right?
>

Yes, but this has nothing to do with page recycling from pool's owner
level (driver napi)
 for SKB recycling we can use pool.nid to recycle, and not numa_mem_id().

> Regards
> /Ilias
Ilias Apalodimas Dec. 10, 2019, 8:10 p.m. UTC | #12
Hi Saeed,

On Tue, Dec 10, 2019 at 12:02:12PM -0800, Saeed Mahameed wrote:
> On Tue, Dec 10, 2019 at 7:02 AM Ilias Apalodimas
> <ilias.apalodimas@linaro.org> wrote:
> >
> > Hi Saeed,
> >
> > > >
> > > > The patch description doesn't explain the problem very well.
> > > >
> > > > Lets first establish what the problem is.  After I took at closer
> > > > look,
> > > > I do think we have a real problem here...
> > > >
> > > > If function alloc_pages_node() is called with NUMA_NO_NODE (see below
> > > > signature), then the nid is re-assigned to numa_mem_id().
> > > >
> > > > Our current code checks: page_to_nid(page) == pool->p.nid which seems
> > > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return
> > > > NUMA_NO_NODE... as it was set to the local detect numa node, right?
> > > >
> > >
> > > right.
> > >
> > > > So, we do need a fix... but the question is that semantics do we
> > > > want?
> > > >
> > >
> > > maybe assume that __page_pool_recycle_direct() is always called from
> > > the right node and change the current bogus check:
> >
> > Is this a typo? pool_page_reusable() is called from __page_pool_put_page().
> >
> > page_pool_put_page and page_pool_recycle_direct() (no underscores) call that.
> 
> Yes a typo :) , thanks for the correction.
> 
> > Can we guarantee that those will always run from the correct cpu?
> No, but we add the tool to correct any discrepancy: page_pool_nid_changed()
> 
> > In the current code base if they are only called under NAPI this might be true.
> > On the page_pool skb recycling patches though (yes we'll eventually send those
> > :)) this is called from kfree_skb().
> > I don't think we can get such a guarantee there, right?
> >
> 
> Yes, but this has nothing to do with page recycling from pool's owner
> level (driver napi)
>  for SKB recycling we can use pool.nid to recycle, and not numa_mem_id().

Right i responded to an email without the proper context!
Let me try again. You suggested  changing the check
from page_to_nid(page) == pool->p.nid to page_to_nid(page) == numa_mem_id().

Since the skb recycling code is using page_pool_put_page() won't that break the
recycling for thatr patchset?

Thanks
/Ilias
Yunsheng Lin Dec. 11, 2019, 3:01 a.m. UTC | #13
On 2019/12/11 3:45, Saeed Mahameed wrote:
>>> maybe assume that __page_pool_recycle_direct() is always called
>>> from
>>> the right node and change the current bogus check:
>>>
>>> from:
>>> page_to_nid(page) == pool->p.nid 
>>>
>>> to:
>>> page_to_nid(page) == numa_mem_id()
>>>
>>> This will allow recycling only if handling node is the same as
>>> where
>>> the page was allocated, regardless of pool->p.nid.
>>>
>>> so semantics are:
>>>
>>> 1) allocate from: pool->p.nid, as chosen by user.
>>> 2) recycle when: page_to_nid(page) == numa_mem_id().
>>> 3) pool user must guarantee that the handler will run on the right
>>> node. which should always be the case. otherwise recycling will be
>>> skipped (no cross numa recycling).
>>>
>>>
>>> a) if the pool migrates, we will stop recycling until the pool
>>> moves
>>> back to original node, or user calls pool_update_nid() as we do in
>>> mlx5.
>>> b) if pool is NUMA_NO_NODE, then allocation and handling will be
>>> done
>>> on numa_mem_id(), which means the above check will work.
>>
>> Only checking page_to_nid(page) == numa_mem_id() may not work for the
>> below
>> case in mvneta.c:
>>
>> static int mvneta_create_page_pool(struct mvneta_port *pp,
>> 				   struct mvneta_rx_queue *rxq, int
>> size)
>> {
>> 	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
>> 	struct page_pool_params pp_params = {
>> 		.order = 0,
>> 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>> 		.pool_size = size,
>> 		.nid = cpu_to_node(0),
>> 		.dev = pp->dev->dev.parent,
>> 		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL :
>> DMA_FROM_DEVICE,
>> 		.offset = pp->rx_offset_correction,
>> 		.max_len = MVNETA_MAX_RX_BUF_SIZE,
>> 	};
>>
>> the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated
>> for rx
>> may not be numa_mem_id() when running in the NAPI polling, because
>> pool->p.nid
>> is not the same as the node of cpu running in the NAPI polling.
>>
>> Does the page pool support recycling for above case?
>>
> 
> I don't think you want to allow cross numa recycling.

Cross numa recycling is not what I want.

> 
>> Or we "fix' the above case by setting pool->p.nid to
>> NUMA_NO_NODE/dev_to_node(),
>> or by calling pool_update_nid() in NAPI polling as mlx5 does?
>>
> 
> Yes just update_nid when needed, and make sure the NAPI polling runs on
> a consistent core and eventually alloc/recycling will happen on the
> same core.

To me, passing NUMA_NO_NODE/dev_to_node() seems to always work.
Calling pool_update_nid() in NAPI polling is another way of passing
NUMA_NO_NODE to page_pool_init().

And it seems it is a copy & paste problem for mvneta and netsec
driver that uses cpu_to_node(0) as pool->p.nid but does not call
page_pool_nid_changed() in the NAPI polling as mlx5 does.

So I suggest to remove page_pool_nid_changed() and always use
NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear (
by comment or warning?)that page_pool_nid_changed() should be
called when pool->p.nid is NUMA_NO_NODE/dev_to_node().

I prefer to remove page_pool_nid_changed() if we do not allow
cross numa recycling.


> 
>>
>>> Thanks,
>>> Saeed.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
Yunsheng Lin Dec. 11, 2019, 3:06 a.m. UTC | #14
On 2019/12/11 11:01, Yunsheng Lin wrote:
> On 2019/12/11 3:45, Saeed Mahameed wrote:
>>>> maybe assume that __page_pool_recycle_direct() is always called
>>>> from
>>>> the right node and change the current bogus check:
>>>>
>>>> from:
>>>> page_to_nid(page) == pool->p.nid 
>>>>
>>>> to:
>>>> page_to_nid(page) == numa_mem_id()
>>>>
>>>> This will allow recycling only if handling node is the same as
>>>> where
>>>> the page was allocated, regardless of pool->p.nid.
>>>>
>>>> so semantics are:
>>>>
>>>> 1) allocate from: pool->p.nid, as chosen by user.
>>>> 2) recycle when: page_to_nid(page) == numa_mem_id().
>>>> 3) pool user must guarantee that the handler will run on the right
>>>> node. which should always be the case. otherwise recycling will be
>>>> skipped (no cross numa recycling).
>>>>
>>>>
>>>> a) if the pool migrates, we will stop recycling until the pool
>>>> moves
>>>> back to original node, or user calls pool_update_nid() as we do in
>>>> mlx5.
>>>> b) if pool is NUMA_NO_NODE, then allocation and handling will be
>>>> done
>>>> on numa_mem_id(), which means the above check will work.
>>>
>>> Only checking page_to_nid(page) == numa_mem_id() may not work for the
>>> below
>>> case in mvneta.c:
>>>
>>> static int mvneta_create_page_pool(struct mvneta_port *pp,
>>> 				   struct mvneta_rx_queue *rxq, int
>>> size)
>>> {
>>> 	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
>>> 	struct page_pool_params pp_params = {
>>> 		.order = 0,
>>> 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>> 		.pool_size = size,
>>> 		.nid = cpu_to_node(0),
>>> 		.dev = pp->dev->dev.parent,
>>> 		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL :
>>> DMA_FROM_DEVICE,
>>> 		.offset = pp->rx_offset_correction,
>>> 		.max_len = MVNETA_MAX_RX_BUF_SIZE,
>>> 	};
>>>
>>> the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated
>>> for rx
>>> may not be numa_mem_id() when running in the NAPI polling, because
>>> pool->p.nid
>>> is not the same as the node of cpu running in the NAPI polling.
>>>
>>> Does the page pool support recycling for above case?
>>>
>>
>> I don't think you want to allow cross numa recycling.
> 
> Cross numa recycling is not what I want.
> 
>>
>>> Or we "fix' the above case by setting pool->p.nid to
>>> NUMA_NO_NODE/dev_to_node(),
>>> or by calling pool_update_nid() in NAPI polling as mlx5 does?
>>>
>>
>> Yes just update_nid when needed, and make sure the NAPI polling runs on
>> a consistent core and eventually alloc/recycling will happen on the
>> same core.
> 
> To me, passing NUMA_NO_NODE/dev_to_node() seems to always work.
> Calling pool_update_nid() in NAPI polling is another way of passing
> NUMA_NO_NODE to page_pool_init().
> 
> And it seems it is a copy & paste problem for mvneta and netsec
> driver that uses cpu_to_node(0) as pool->p.nid but does not call
> page_pool_nid_changed() in the NAPI polling as mlx5 does.
> 
> So I suggest to remove page_pool_nid_changed() and always use
> NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear (
> by comment or warning?)that page_pool_nid_changed() should be
> called when pool->p.nid is NUMA_NO_NODE/dev_to_node().
                          ~~

sorry for the typo.

I meant pool->p.nid is not NUMA_NO_NODE/dev_to_node().

> 
> I prefer to remove page_pool_nid_changed() if we do not allow
> cross numa recycling.
> 
> 
>>
>>>
>>>> Thanks,
>>>> Saeed.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
> 
> 
> .
>
Jesper Dangaard Brouer Dec. 11, 2019, 6:49 p.m. UTC | #15
On Sat, 7 Dec 2019 03:52:41 +0000
Saeed Mahameed <saeedm@mellanox.com> wrote:

> I don't think it is correct to check that the page nid is same as
> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all
> pages to recycle, because you can't assume where pages are allocated
> from and where they are being handled.

I agree, using numa_mem_id() is not valid, because it takes the numa
node id from the executing CPU and the call to __page_pool_put_page()
can happen on a remote CPU (e.g. cpumap redirect, and in future SKBs).


> I suggest the following:
> 
> return !page_pfmemalloc() && 
> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );

Above code doesn't generate optimal ASM code, I suggest:

 static bool pool_page_reusable(struct page_pool *pool, struct page *page)
 {
	return !page_is_pfmemalloc(page) &&
		pool->p.nid != NUMA_NO_NODE &&
		page_to_nid(page) == pool->p.nid;
 }

I have compiled different variants and looked at the ASM code generated
by GCC.  This seems to give the best result.


> 1) never recycle emergency pages, regardless of pool nid.
> 2) always recycle if pool is NUMA_NO_NODE.

Yes, this defines the semantics, that a page_pool configured with
NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...


> the above change should not add any overhead, a modest branch
> predictor will handle this with no effort.

It still annoys me that we keep adding instructions to this code
hot-path (I counted 34 bytes and 11 instructions in my proposed
function).

I think that it might be possible to move these NUMA checks to
alloc-side (instead of return/recycles side as today), and perhaps only
on slow-path when dequeuing from ptr_ring (as recycles that call
__page_pool_recycle_direct() will be pinned during NAPI).  But lets
focus on a smaller fix for the immediate issue...
Saeed Mahameed Dec. 11, 2019, 8:57 p.m. UTC | #16
On Wed, 2019-12-11 at 11:01 +0800, Yunsheng Lin wrote:
> On 2019/12/11 3:45, Saeed Mahameed wrote:
> > > > maybe assume that __page_pool_recycle_direct() is always called
> > > > from
> > > > the right node and change the current bogus check:
> > > > 
> > > > from:
> > > > page_to_nid(page) == pool->p.nid 
> > > > 
> > > > to:
> > > > page_to_nid(page) == numa_mem_id()
> > > > 
> > > > This will allow recycling only if handling node is the same as
> > > > where
> > > > the page was allocated, regardless of pool->p.nid.
> > > > 
> > > > so semantics are:
> > > > 
> > > > 1) allocate from: pool->p.nid, as chosen by user.
> > > > 2) recycle when: page_to_nid(page) == numa_mem_id().
> > > > 3) pool user must guarantee that the handler will run on the
> > > > right
> > > > node. which should always be the case. otherwise recycling will
> > > > be
> > > > skipped (no cross numa recycling).
> > > > 
> > > > 
> > > > a) if the pool migrates, we will stop recycling until the pool
> > > > moves
> > > > back to original node, or user calls pool_update_nid() as we do
> > > > in
> > > > mlx5.
> > > > b) if pool is NUMA_NO_NODE, then allocation and handling will
> > > > be
> > > > done
> > > > on numa_mem_id(), which means the above check will work.
> > > 
> > > Only checking page_to_nid(page) == numa_mem_id() may not work for
> > > the
> > > below
> > > case in mvneta.c:
> > > 
> > > static int mvneta_create_page_pool(struct mvneta_port *pp,
> > > 				   struct mvneta_rx_queue *rxq, int
> > > size)
> > > {
> > > 	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
> > > 	struct page_pool_params pp_params = {
> > > 		.order = 0,
> > > 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
> > > 		.pool_size = size,
> > > 		.nid = cpu_to_node(0),
> > > 		.dev = pp->dev->dev.parent,
> > > 		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL :
> > > DMA_FROM_DEVICE,
> > > 		.offset = pp->rx_offset_correction,
> > > 		.max_len = MVNETA_MAX_RX_BUF_SIZE,
> > > 	};
> > > 
> > > the pool->p.nid is not NUMA_NO_NODE, then the node of page
> > > allocated
> > > for rx
> > > may not be numa_mem_id() when running in the NAPI polling,
> > > because
> > > pool->p.nid
> > > is not the same as the node of cpu running in the NAPI polling.
> > > 
> > > Does the page pool support recycling for above case?
> > > 
> > 
> > I don't think you want to allow cross numa recycling.
> 
> Cross numa recycling is not what I want.
> 
> > > Or we "fix' the above case by setting pool->p.nid to
> > > NUMA_NO_NODE/dev_to_node(),
> > > or by calling pool_update_nid() in NAPI polling as mlx5 does?
> > > 
> > 
> > Yes just update_nid when needed, and make sure the NAPI polling
> > runs on
> > a consistent core and eventually alloc/recycling will happen on the
> > same core.
> 
> To me, passing NUMA_NO_NODE/dev_to_node() seems to always work.
> Calling pool_update_nid() in NAPI polling is another way of passing
> NUMA_NO_NODE to page_pool_init().
> 
> And it seems it is a copy & paste problem for mvneta and netsec
> driver that uses cpu_to_node(0) as pool->p.nid but does not call
> page_pool_nid_changed() in the NAPI polling as mlx5 does.
> 
> So I suggest to remove page_pool_nid_changed() and always use
> NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear (
> by comment or warning?)that page_pool_nid_changed() should be
> called when pool->p.nid is NUMA_NO_NODE/dev_to_node().
> 

not an option.

rx rings should always allocate data buffers according to their cpu
affinity and not dev_node or default to NUMA_NO_NODE.

> I prefer to remove page_pool_nid_changed() if we do not allow
> cross numa recycling.
> 

This is not for cross numa recycling. 
Since rx rings can migragte between cores, (msix affinity/IRQ balancer)
we need page_pool_nid_changed() for seamless migration and for
recycling and allocation to migrate with the ring.

> 
> > > > Thanks,
> > > > Saeed.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > >
Saeed Mahameed Dec. 11, 2019, 9:24 p.m. UTC | #17
On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:
> On Sat, 7 Dec 2019 03:52:41 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > I don't think it is correct to check that the page nid is same as
> > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> > all
> > pages to recycle, because you can't assume where pages are
> > allocated
> > from and where they are being handled.
> 
> I agree, using numa_mem_id() is not valid, because it takes the numa
> node id from the executing CPU and the call to __page_pool_put_page()
> can happen on a remote CPU (e.g. cpumap redirect, and in future
> SKBs).
> 
> 
> > I suggest the following:
> > 
> > return !page_pfmemalloc() && 
> > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
> > );
> 
> Above code doesn't generate optimal ASM code, I suggest:
> 
>  static bool pool_page_reusable(struct page_pool *pool, struct page
> *page)
>  {
> 	return !page_is_pfmemalloc(page) &&
> 		pool->p.nid != NUMA_NO_NODE &&
> 		page_to_nid(page) == pool->p.nid;
>  }
> 

this is not equivalent to the above. Here in case pool->p.nid is
NUMA_NO_NODE, pool_page_reusable() will always be false.

We can avoid the extra check in data path.
How about avoiding NUMA_NO_NODE in page_pool altogether, and force
numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
pool init, as already done in alloc_pages_node(). 

which will imply recycling without adding any extra condition to the
data path.

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..00c99282a306 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
 
        memcpy(&pool->p, params, sizeof(pool->p));
 
+	/* overwrite to allow recycling.. */
+       if (pool->p.nid == NUMA_NO_NODE) 
+               pool->p.nid = numa_mem_id(); 
+

After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
pool->p.nid.. 


> I have compiled different variants and looked at the A
> SM code generated
> by GCC.  This seems to give the best result.
> 
> 
> > 1) never recycle emergency pages, regardless of pool nid.
> > 2) always recycle if pool is NUMA_NO_NODE.
> 
> Yes, this defines the semantics, that a page_pool configured with
> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
> 
> 
> > the above change should not add any overhead, a modest branch
> > predictor will handle this with no effort.
> 
> It still annoys me that we keep adding instructions to this code
> hot-path (I counted 34 bytes and 11 instructions in my proposed
> function).
> 
> I think that it might be possible to move these NUMA checks to
> alloc-side (instead of return/recycles side as today), and perhaps
> only
> on slow-path when dequeuing from ptr_ring (as recycles that call
> __page_pool_recycle_direct() will be pinned during NAPI).  But lets
> focus on a smaller fix for the immediate issue...
> 

I know. It annoys me too, but we need recycling to work in production :
where rings/napi can migrate and numa nodes can be NUMA_NO_NODE :-(.
Yunsheng Lin Dec. 12, 2019, 1:04 a.m. UTC | #18
On 2019/12/12 4:57, Saeed Mahameed wrote:
> On Wed, 2019-12-11 at 11:01 +0800, Yunsheng Lin wrote:
>> On 2019/12/11 3:45, Saeed Mahameed wrote:
>>>>> maybe assume that __page_pool_recycle_direct() is always called
>>>>> from
>>>>> the right node and change the current bogus check:
>>>>>
>>>>> from:
>>>>> page_to_nid(page) == pool->p.nid 
>>>>>
>>>>> to:
>>>>> page_to_nid(page) == numa_mem_id()
>>>>>
>>>>> This will allow recycling only if handling node is the same as
>>>>> where
>>>>> the page was allocated, regardless of pool->p.nid.
>>>>>
>>>>> so semantics are:
>>>>>
>>>>> 1) allocate from: pool->p.nid, as chosen by user.
>>>>> 2) recycle when: page_to_nid(page) == numa_mem_id().
>>>>> 3) pool user must guarantee that the handler will run on the
>>>>> right
>>>>> node. which should always be the case. otherwise recycling will
>>>>> be
>>>>> skipped (no cross numa recycling).
>>>>>
>>>>>
>>>>> a) if the pool migrates, we will stop recycling until the pool
>>>>> moves
>>>>> back to original node, or user calls pool_update_nid() as we do
>>>>> in
>>>>> mlx5.
>>>>> b) if pool is NUMA_NO_NODE, then allocation and handling will
>>>>> be
>>>>> done
>>>>> on numa_mem_id(), which means the above check will work.
>>>>
>>>> Only checking page_to_nid(page) == numa_mem_id() may not work for
>>>> the
>>>> below
>>>> case in mvneta.c:
>>>>
>>>> static int mvneta_create_page_pool(struct mvneta_port *pp,
>>>> 				   struct mvneta_rx_queue *rxq, int
>>>> size)
>>>> {
>>>> 	struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog);
>>>> 	struct page_pool_params pp_params = {
>>>> 		.order = 0,
>>>> 		.flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV,
>>>> 		.pool_size = size,
>>>> 		.nid = cpu_to_node(0),
>>>> 		.dev = pp->dev->dev.parent,
>>>> 		.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL :
>>>> DMA_FROM_DEVICE,
>>>> 		.offset = pp->rx_offset_correction,
>>>> 		.max_len = MVNETA_MAX_RX_BUF_SIZE,
>>>> 	};
>>>>
>>>> the pool->p.nid is not NUMA_NO_NODE, then the node of page
>>>> allocated
>>>> for rx
>>>> may not be numa_mem_id() when running in the NAPI polling,
>>>> because
>>>> pool->p.nid
>>>> is not the same as the node of cpu running in the NAPI polling.
>>>>
>>>> Does the page pool support recycling for above case?
>>>>
>>>
>>> I don't think you want to allow cross numa recycling.
>>
>> Cross numa recycling is not what I want.
>>
>>>> Or we "fix' the above case by setting pool->p.nid to
>>>> NUMA_NO_NODE/dev_to_node(),
>>>> or by calling pool_update_nid() in NAPI polling as mlx5 does?
>>>>
>>>
>>> Yes just update_nid when needed, and make sure the NAPI polling
>>> runs on
>>> a consistent core and eventually alloc/recycling will happen on the
>>> same core.
>>
>> To me, passing NUMA_NO_NODE/dev_to_node() seems to always work.
>> Calling pool_update_nid() in NAPI polling is another way of passing
>> NUMA_NO_NODE to page_pool_init().
>>
>> And it seems it is a copy & paste problem for mvneta and netsec
>> driver that uses cpu_to_node(0) as pool->p.nid but does not call
>> page_pool_nid_changed() in the NAPI polling as mlx5 does.
>>
>> So I suggest to remove page_pool_nid_changed() and always use
>> NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear (
>> by comment or warning?)that page_pool_nid_changed() should be
>> called when pool->p.nid is NUMA_NO_NODE/dev_to_node().
>>
> 
> not an option.
> 
> rx rings should always allocate data buffers according to their cpu
> affinity and not dev_node or default to NUMA_NO_NODE.

Does "their cpu affinity" mean numa_mem_id() when runnig in the
NAPI polling context?

If yes, the node of data buffers allocated for rx will be default to
numa_mem_id() when the pool->p.nid is NUMA_NO_NODE.

See:

/*
 * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE,
 * prefer the current CPU's closest node. Otherwise node must be valid and
 * online.
 */
static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask,
						unsigned int order)
{
	if (nid == NUMA_NO_NODE)
		nid = numa_mem_id();

	return __alloc_pages_node(nid, gfp_mask, order);
}

If the above is true, NUMA_NO_NODE is default to numa_mem_id().

> 
>> I prefer to remove page_pool_nid_changed() if we do not allow
>> cross numa recycling.
>>
> 
> This is not for cross numa recycling. 
> Since rx rings can migragte between cores, (msix affinity/IRQ balancer)
> we need page_pool_nid_changed() for seamless migration and for
> recycling and allocation to migrate with the ring.


If the allocation and recycling for rx data buffer is guaranteed to be in
NAPI polling context, when rx rings migragtes between nodes, it will stop
recycling the old pages allocated from old node, and start allocating new
page from new node and start recycling those new pages, because numa_mem_id()
will return different node id based on node the current cpu is running on.

Or allocation and recycling for rx data buffer will not be guaranteed to be
in the same NAPI polling context for a specific ring? Is there a usecase for
this?


> 
>>
>>>>> Thanks,
>>>>> Saeed.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
Yunsheng Lin Dec. 12, 2019, 1:34 a.m. UTC | #19
+CC Michal, Peter, Greg and Bjorn
Because there has been disscusion about where and how the NUMA_NO_NODE
should be handled before.

On 2019/12/12 5:24, Saeed Mahameed wrote:
> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:
>> On Sat, 7 Dec 2019 03:52:41 +0000
>> Saeed Mahameed <saeedm@mellanox.com> wrote:
>>
>>> I don't think it is correct to check that the page nid is same as
>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>> all
>>> pages to recycle, because you can't assume where pages are
>>> allocated
>>> from and where they are being handled.
>>
>> I agree, using numa_mem_id() is not valid, because it takes the numa
>> node id from the executing CPU and the call to __page_pool_put_page()
>> can happen on a remote CPU (e.g. cpumap redirect, and in future
>> SKBs).
>>
>>
>>> I suggest the following:
>>>
>>> return !page_pfmemalloc() && 
>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE
>>> );
>>
>> Above code doesn't generate optimal ASM code, I suggest:
>>
>>  static bool pool_page_reusable(struct page_pool *pool, struct page
>> *page)
>>  {
>> 	return !page_is_pfmemalloc(page) &&
>> 		pool->p.nid != NUMA_NO_NODE &&
>> 		page_to_nid(page) == pool->p.nid;
>>  }
>>
> 
> this is not equivalent to the above. Here in case pool->p.nid is
> NUMA_NO_NODE, pool_page_reusable() will always be false.
> 
> We can avoid the extra check in data path.
> How about avoiding NUMA_NO_NODE in page_pool altogether, and force
> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
> pool init, as already done in alloc_pages_node(). 

That means we will not support page reuse migragtion for NUMA_NO_NODE,
which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
because alloc_pages_node() will allocate the page based on the node
of the current running cpu.

Also, There seems to be a wild guessing of the node id here, which has
been disscussed before and has not reached a agreement yet.

> 
> which will imply recycling without adding any extra condition to the
> data path.
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index a6aefe989043..00c99282a306 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
>  
>         memcpy(&pool->p, params, sizeof(pool->p));
>  
> +	/* overwrite to allow recycling.. */
> +       if (pool->p.nid == NUMA_NO_NODE) 
> +               pool->p.nid = numa_mem_id(); 
> +
> 
> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
> pool->p.nid.. 
> 
> 
>> I have compiled different variants and looked at the A
>> SM code generated
>> by GCC.  This seems to give the best result.
>>
>>
>>> 1) never recycle emergency pages, regardless of pool nid.
>>> 2) always recycle if pool is NUMA_NO_NODE.
>>
>> Yes, this defines the semantics, that a page_pool configured with
>> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
>>
>>
>>> the above change should not add any overhead, a modest branch
>>> predictor will handle this with no effort.
>>
>> It still annoys me that we keep adding instructions to this code
>> hot-path (I counted 34 bytes and 11 instructions in my proposed
>> function).
>>
>> I think that it might be possible to move these NUMA checks to
>> alloc-side (instead of return/recycles side as today), and perhaps
>> only
>> on slow-path when dequeuing from ptr_ring (as recycles that call
>> __page_pool_recycle_direct() will be pinned during NAPI).  But lets
>> focus on a smaller fix for the immediate issue...
>>
> 
> I know. It annoys me too, but we need recycling to work in production :
> where rings/napi can migrate and numa nodes can be NUMA_NO_NODE :-(.
> 
>
Jesper Dangaard Brouer Dec. 12, 2019, 10:18 a.m. UTC | #20
On Thu, 12 Dec 2019 09:34:14 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> +CC Michal, Peter, Greg and Bjorn
> Because there has been disscusion about where and how the NUMA_NO_NODE
> should be handled before.
> 
> On 2019/12/12 5:24, Saeed Mahameed wrote:
> > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:  
> >> On Sat, 7 Dec 2019 03:52:41 +0000
> >> Saeed Mahameed <saeedm@mellanox.com> wrote:
> >>  
> >>> I don't think it is correct to check that the page nid is same as
> >>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
> >>> all  pages to recycle, because you can't assume where pages are
> >>> allocated from and where they are being handled.  
> >>
> >> I agree, using numa_mem_id() is not valid, because it takes the numa
> >> node id from the executing CPU and the call to __page_pool_put_page()
> >> can happen on a remote CPU (e.g. cpumap redirect, and in future
> >> SKBs).
> >>
> >>  
> >>> I suggest the following:
> >>>
> >>> return !page_pfmemalloc() && 
> >>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );  
> >>
> >> Above code doesn't generate optimal ASM code, I suggest:
> >>
> >>  static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> >>  {
> >> 	return !page_is_pfmemalloc(page) &&
> >> 		pool->p.nid != NUMA_NO_NODE &&
> >> 		page_to_nid(page) == pool->p.nid;
> >>  }
> >>  
> > 
> > this is not equivalent to the above. Here in case pool->p.nid is
> > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > 
> > We can avoid the extra check in data path.
> > How about avoiding NUMA_NO_NODE in page_pool altogether, and force
> > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
> > pool init, as already done in alloc_pages_node().   
> 
> That means we will not support page reuse mitigation for NUMA_NO_NODE,
> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
> because alloc_pages_node() will allocate the page based on the node
> of the current running cpu.

True, as I wrote (below) my code defines semantics as: that a page_pool
configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle
regardless of NUMA node page belong to.  It seems that you want another
semantics.

I'm open to other semantics. My main concern is performance.  The
page_pool fast-path for driver recycling use-case of XDP_DROP, have
extreme performance requirements, as it needs to compete with driver
local recycle tricks (else we cannot use page_pool to simplify drivers).
The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
cycle/instruction counts.

 
> Also, There seems to be a wild guessing of the node id here, which has
> been disscussed before and has not reached a agreement yet.
> 
> > 
> > which will imply recycling without adding any extra condition to the
> > data path.

I love code that moves thing out of our fast-path.

> > 
> > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > index a6aefe989043..00c99282a306 100644
> > --- a/net/core/page_pool.c
> > +++ b/net/core/page_pool.c
> > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
> >  
> >         memcpy(&pool->p, params, sizeof(pool->p));
> >  
> > +	/* overwrite to allow recycling.. */
> > +       if (pool->p.nid == NUMA_NO_NODE) 
> > +               pool->p.nid = numa_mem_id(); 
> > +

The problem is that page_pool_init() is can be initiated from a random
CPU, first at driver setup/bringup, and later at other queue changes
that can be started via ethtool or XDP attach. (numa_mem_id() picks
from running CPU).

As Yunsheng mentioned elsewhere, there is also a dev_to_node() function.
Isn't that what we want in a place like this?


One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns
NUMA_NO_NODE (-1).  (And device_initialize() also set it to -1).  Thus,
in that case we set pool->p.nid = 0, as page_to_nid() will also return
zero in that case (as far as I follow the code).


> > After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
> > pool->p.nid.. 
> > 
> >   
> >> I have compiled different variants and looked at the ASM code
> >> generated by GCC.  This seems to give the best result.
> >>
> >>  
> >>> 1) never recycle emergency pages, regardless of pool nid.
> >>> 2) always recycle if pool is NUMA_NO_NODE.  
> >>
> >> Yes, this defines the semantics, that a page_pool configured with
> >> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
> >>
> >>  
> >>> the above change should not add any overhead, a modest branch
> >>> predictor will handle this with no effort.  
> >>
> >> It still annoys me that we keep adding instructions to this code
> >> hot-path (I counted 34 bytes and 11 instructions in my proposed
> >> function).
> >>
> >> I think that it might be possible to move these NUMA checks to
> >> alloc-side (instead of return/recycles side as today), and perhaps
> >> only on slow-path when dequeuing from ptr_ring (as recycles that
> >> call __page_pool_recycle_direct() will be pinned during NAPI).
> >> But lets focus on a smaller fix for the immediate issue...
> >>  
> > 
> > I know. It annoys me too, but we need recycling to work in
> > production : where rings/napi can migrate and numa nodes can be
> > NUMA_NO_NODE :-(.
> > 
> >
Yunsheng Lin Dec. 13, 2019, 3:40 a.m. UTC | #21
On 2019/12/12 18:18, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
>> +CC Michal, Peter, Greg and Bjorn
>> Because there has been disscusion about where and how the NUMA_NO_NODE
>> should be handled before.
>>
>> On 2019/12/12 5:24, Saeed Mahameed wrote:
>>> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote:  
>>>> On Sat, 7 Dec 2019 03:52:41 +0000
>>>> Saeed Mahameed <saeedm@mellanox.com> wrote:
>>>>  
>>>>> I don't think it is correct to check that the page nid is same as
>>>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow
>>>>> all  pages to recycle, because you can't assume where pages are
>>>>> allocated from and where they are being handled.  
>>>>
>>>> I agree, using numa_mem_id() is not valid, because it takes the numa
>>>> node id from the executing CPU and the call to __page_pool_put_page()
>>>> can happen on a remote CPU (e.g. cpumap redirect, and in future
>>>> SKBs).
>>>>
>>>>  
>>>>> I suggest the following:
>>>>>
>>>>> return !page_pfmemalloc() && 
>>>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE );  
>>>>
>>>> Above code doesn't generate optimal ASM code, I suggest:
>>>>
>>>>  static bool pool_page_reusable(struct page_pool *pool, struct page *page)
>>>>  {
>>>> 	return !page_is_pfmemalloc(page) &&
>>>> 		pool->p.nid != NUMA_NO_NODE &&
>>>> 		page_to_nid(page) == pool->p.nid;
>>>>  }
>>>>  
>>>
>>> this is not equivalent to the above. Here in case pool->p.nid is
>>> NUMA_NO_NODE, pool_page_reusable() will always be false.
>>>
>>> We can avoid the extra check in data path.
>>> How about avoiding NUMA_NO_NODE in page_pool altogether, and force
>>> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page
>>> pool init, as already done in alloc_pages_node().   
>>
>> That means we will not support page reuse mitigation for NUMA_NO_NODE,
>> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE,
>> because alloc_pages_node() will allocate the page based on the node
>> of the current running cpu.
> 
> True, as I wrote (below) my code defines semantics as: that a page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle
> regardless of NUMA node page belong to.  It seems that you want another
> semantics.

For driver that does not have page pool support yet, the semantics seems
to be: always allocate and recycle local page(excpet pfmemalloc one). Page
reuse migration moves when the rx interrupt affinity moves(the NAPI polling
context moves accordingly) regardless of the dev_to_node().

It would be good to maintain the above semantics. And rx data page seems
to be close to the cpu that doing the rx cleaning, which means the cpu
can process the buffer quicker?

> 
> I'm open to other semantics. My main concern is performance.  The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
> cycle/instruction counts.

Yes, the performance is a concern too.
But if the rx page is closer to the cpu, maybe the time taken to process
the buffer can be reduced?

It is good to allocate the rx page close to both cpu and device, but if
both goal can not be reached, maybe we choose to allocate page that close
to cpu?

> 
>  
>> Also, There seems to be a wild guessing of the node id here, which has
>> been disscussed before and has not reached a agreement yet.
>>
>>>
>>> which will imply recycling without adding any extra condition to the
>>> data path.
> 
> I love code that moves thing out of our fast-path.
> 
>>>
>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
>>> index a6aefe989043..00c99282a306 100644
>>> --- a/net/core/page_pool.c
>>> +++ b/net/core/page_pool.c
>>> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool,
>>>  
>>>         memcpy(&pool->p, params, sizeof(pool->p));
>>>  
>>> +	/* overwrite to allow recycling.. */
>>> +       if (pool->p.nid == NUMA_NO_NODE) 
>>> +               pool->p.nid = numa_mem_id(); 
>>> +
> 
> The problem is that page_pool_init() is can be initiated from a random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).

Yes, changing ring num or ring depth releases and allocates rx data page,
so using NUMA_NO_NODE to allocate page and alway recycle those page may
has different performance noticable to user.

> 
> As Yunsheng mentioned elsewhere, there is also a dev_to_node() function.
> Isn't that what we want in a place like this?
> 
> 
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns
> NUMA_NO_NODE (-1).  (And device_initialize() also set it to -1).  Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also return
> zero in that case (as far as I follow the code).
> 
> 
>>> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in
>>> pool->p.nid.. 
>>>
>>>   
>>>> I have compiled different variants and looked at the ASM code
>>>> generated by GCC.  This seems to give the best result.
>>>>
>>>>  
>>>>> 1) never recycle emergency pages, regardless of pool nid.
>>>>> 2) always recycle if pool is NUMA_NO_NODE.  
>>>>
>>>> Yes, this defines the semantics, that a page_pool configured with
>>>> NUMA_NO_NODE means skip NUMA checks.  I think that sounds okay...
>>>>
>>>>  
>>>>> the above change should not add any overhead, a modest branch
>>>>> predictor will handle this with no effort.  
>>>>
>>>> It still annoys me that we keep adding instructions to this code
>>>> hot-path (I counted 34 bytes and 11 instructions in my proposed
>>>> function).
>>>>
>>>> I think that it might be possible to move these NUMA checks to
>>>> alloc-side (instead of return/recycles side as today), and perhaps
>>>> only on slow-path when dequeuing from ptr_ring (as recycles that
>>>> call __page_pool_recycle_direct() will be pinned during NAPI).
>>>> But lets focus on a smaller fix for the immediate issue...
>>>>  
>>>
>>> I know. It annoys me too, but we need recycling to work in
>>> production : where rings/napi can migrate and numa nodes can be
>>> NUMA_NO_NODE :-(.
>>>
>>>   
>
Li RongQing Dec. 13, 2019, 6:27 a.m. UTC | #22
> 
> It is good to allocate the rx page close to both cpu and device, but if
> both goal can not be reached, maybe we choose to allocate page that close
> to cpu?
> 
I think it is true

If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with
alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as
that page_to_nid(page) is checked with numa_mem_id()  

since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow 
will be called in NAPI polling often if recycle failed, after some cycle, the page will be from
local memory node.

-Li
Yunsheng Lin Dec. 13, 2019, 6:53 a.m. UTC | #23
On 2019/12/13 14:27, Li,Rongqing wrote:
>>
>> It is good to allocate the rx page close to both cpu and device, but if
>> both goal can not be reached, maybe we choose to allocate page that close
>> to cpu?
>>
> I think it is true
> 
> If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with
> alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as
> that page_to_nid(page) is checked with numa_mem_id()  
> 
> since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow 
> will be called in NAPI polling often if recycle failed, after some cycle, the page will be from
> local memory node.

Yes if allocation and recycling are in the same NAPI polling context.

As pointed out by Saeed and Ilias, the allocation and recycling seems to
may not be happening in the same NAPI polling context, see:

"In the current code base if they are only called under NAPI this might be true.
On the page_pool skb recycling patches though (yes we'll eventually send those
:)) this is called from kfree_skb()."

So there may need some additionl attention.


> 
> -Li
>
Jesper Dangaard Brouer Dec. 13, 2019, 8:48 a.m. UTC | #24
On Fri, 13 Dec 2019 14:53:37 +0800
Yunsheng Lin <linyunsheng@huawei.com> wrote:

> On 2019/12/13 14:27, Li,Rongqing wrote:
> >>
> >> It is good to allocate the rx page close to both cpu and device,
> >> but if both goal can not be reached, maybe we choose to allocate
> >> page that close to cpu?
> >>  
> > I think it is true
> > 
> > If it is true, , we can remove pool->p.nid, and replace
> > alloc_pages_node with alloc_pages in __page_pool_alloc_pages_slow,
> > and change pool_page_reusable as that page_to_nid(page) is checked
> > with numa_mem_id()  

No, as I explained before, you cannot use numa_mem_id() in pool_page_reusable,
because recycle call can happen from/on a remote CPU (and numa_mem_id()
uses the local CPU to determine the NUMA id).

Today we already have XDP cpumap redirect.  And we are working on
extending this to SKBs, which will increase the likelihood even more.

I don't think we want to not-recycle if a remote NUMA node CPU happened
to touch the packet?

(Based on the optimizations done for Facebook (the reason we added this))
What seems to matter is the NUMA node of CPU that runs RX NAPI-polling,
this is the first CPU that touch the packet memory and struct-page
memory.  For these drivers, it is also the same "RX-CPU" that does the
allocation of new pages (to refill the RX-ring), and these "new" pages
can come from the page_pool.

With this in mind, it does seem strange that we are doing the check on
the "free"/recycles code path, that can run on remote CPUs...


> > since alloc_pages hint to use the current node page, and
> > __page_pool_alloc_pages_slow will be called in NAPI polling often
> > if recycle failed, after some cycle, the page will be from local
> > memory node.  

You are basically saying that the NUMA check should be moved to
allocation time, as it is running the RX-CPU (NAPI).  And eventually
after some time the pages will come from correct NUMA node.

I think we can do that, and only affect the semi-fast-path.
We just need to handle that pages in the ptr_ring that are recycled can
be from the wrong NUMA node.  In __page_pool_get_cached() when
consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we
can evict pages from wrong NUMA node.

For the pool->alloc.cache we either accept, that it will eventually
after some time be emptied (it is only in a 100% XDP_DROP workload that
it will continue to reuse same pages).   Or we simply clear the
pool->alloc.cache when calling page_pool_update_nid().


> Yes if allocation and recycling are in the same NAPI polling context.

Which is a false assumption.

> As pointed out by Saeed and Ilias, the allocation and recycling seems
> to may not be happening in the same NAPI polling context, see:
> 
> "In the current code base if they are only called under NAPI this
> might be true. On the page_pool skb recycling patches though (yes
> we'll eventually send those :)) this is called from kfree_skb()."
> 
> So there may need some additionl attention.

Yes, as explained above.
Yunsheng Lin Dec. 16, 2019, 1:51 a.m. UTC | #25
On 2019/12/13 16:48, Jesper Dangaard Brouer wrote:> You are basically saying that the NUMA check should be moved to
> allocation time, as it is running the RX-CPU (NAPI).  And eventually
> after some time the pages will come from correct NUMA node.
> 
> I think we can do that, and only affect the semi-fast-path.
> We just need to handle that pages in the ptr_ring that are recycled can
> be from the wrong NUMA node.  In __page_pool_get_cached() when
> consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we
> can evict pages from wrong NUMA node.

Yes, that's workable.

> 
> For the pool->alloc.cache we either accept, that it will eventually
> after some time be emptied (it is only in a 100% XDP_DROP workload that
> it will continue to reuse same pages).   Or we simply clear the
> pool->alloc.cache when calling page_pool_update_nid().

Simply clearing the pool->alloc.cache when calling page_pool_update_nid()
seems better.

>
Michal Hocko Dec. 16, 2019, 12:15 p.m. UTC | #26
On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> +CC Michal, Peter, Greg and Bjorn
> Because there has been disscusion about where and how the NUMA_NO_NODE
> should be handled before.

I do not have a full context. What is the question here?
Ilias Apalodimas Dec. 16, 2019, 12:34 p.m. UTC | #27
Hi Michal, 
On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the NUMA_NO_NODE
> > should be handled before.
> 
> I do not have a full context. What is the question here?

When we allocate pages for the page_pool API, during the init, the driver writer
decides which NUMA node to use. The API can,  in some cases recycle the memory,
instead of freeing it and re-allocating it. If the NUMA node has changed (irq
affinity for example), we forbid recycling and free the memory, since recycling
and using memory on far NUMA nodes is more expensive (more expensive than
recycling, at least on the architectures we tried anyway).
Since this would be expensive to do it per packet, the burden falls on the 
driver writer for that. Drivers *have* to call page_pool_update_nid() or 
page_pool_nid_changed() if they want to check for that which runs once
per NAPI cycle.

The current code in the API though does not account for NUMA_NO_NODE. That's
what this is trying to fix.
If the page_pool params are initialized with that, we *never* recycle
the memory. This is happening because the API is allocating memory with 
'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
'page_to_nid(page) == pool->p.nid' will never trigger.

The initial proposal was to check:
pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));

After that the thread span out of control :)
My question is do we *really* have to check for 
page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
wouldn't pool->p.nid == NUMA_NO_NODE be enough?

Thanks
/Ilias
> -- 
> Michal Hocko
> SUSE Labs
Michal Hocko Dec. 16, 2019, 1:08 p.m. UTC | #28
On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
> Hi Michal, 
> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> > > +CC Michal, Peter, Greg and Bjorn
> > > Because there has been disscusion about where and how the NUMA_NO_NODE
> > > should be handled before.
> > 
> > I do not have a full context. What is the question here?
> 
> When we allocate pages for the page_pool API, during the init, the driver writer
> decides which NUMA node to use. The API can,  in some cases recycle the memory,
> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
> affinity for example), we forbid recycling and free the memory, since recycling
> and using memory on far NUMA nodes is more expensive (more expensive than
> recycling, at least on the architectures we tried anyway).
> Since this would be expensive to do it per packet, the burden falls on the 
> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
> page_pool_nid_changed() if they want to check for that which runs once
> per NAPI cycle.

Thanks for the clarification.

> The current code in the API though does not account for NUMA_NO_NODE. That's
> what this is trying to fix.
> If the page_pool params are initialized with that, we *never* recycle
> the memory. This is happening because the API is allocating memory with 
> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
> 'page_to_nid(page) == pool->p.nid' will never trigger.

OK. There is no explicit mention of the expected behavior for
NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
requirement and the MM code simply starts the allocate from a local node
in that case. But the memory might come from any node so there is no
"local node" guarantee.

So the main question is what is the expected semantic? Do people expect
that NUMA_NO_NODE implies locality? Why don't you simply always reuse
when there was no explicit numa requirement?

> The initial proposal was to check:
> pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));

> After that the thread span out of control :)
> My question is do we *really* have to check for 
> page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
> wouldn't pool->p.nid == NUMA_NO_NODE be enough?

If the architecture is !NUMA then numa_mem_id and page_to_nid should
always equal and be both zero.
Ilias Apalodimas Dec. 16, 2019, 1:21 p.m. UTC | #29
On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
> > Hi Michal, 
> > On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> > > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> > > > +CC Michal, Peter, Greg and Bjorn
> > > > Because there has been disscusion about where and how the NUMA_NO_NODE
> > > > should be handled before.
> > > 
> > > I do not have a full context. What is the question here?
> > 
> > When we allocate pages for the page_pool API, during the init, the driver writer
> > decides which NUMA node to use. The API can,  in some cases recycle the memory,
> > instead of freeing it and re-allocating it. If the NUMA node has changed (irq
> > affinity for example), we forbid recycling and free the memory, since recycling
> > and using memory on far NUMA nodes is more expensive (more expensive than
> > recycling, at least on the architectures we tried anyway).
> > Since this would be expensive to do it per packet, the burden falls on the 
> > driver writer for that. Drivers *have* to call page_pool_update_nid() or 
> > page_pool_nid_changed() if they want to check for that which runs once
> > per NAPI cycle.
> 
> Thanks for the clarification.
> 
> > The current code in the API though does not account for NUMA_NO_NODE. That's
> > what this is trying to fix.
> > If the page_pool params are initialized with that, we *never* recycle
> > the memory. This is happening because the API is allocating memory with 
> > 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
> > 'page_to_nid(page) == pool->p.nid' will never trigger.
> 
> OK. There is no explicit mention of the expected behavior for
> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
> requirement and the MM code simply starts the allocate from a local node
> in that case. But the memory might come from any node so there is no
> "local node" guarantee.
> 
> So the main question is what is the expected semantic? Do people expect
> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
> when there was no explicit numa requirement?
> 

Well they shouldn't. Hence my next proposal. I think we are pretty much saying
the same thing here. 
If the driver defines NUMA_NO_NODE, just blindly recycle memory.

> > The initial proposal was to check:
> > pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));
> 
> > After that the thread span out of control :)
> > My question is do we *really* have to check for 
> > page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
> > wouldn't pool->p.nid == NUMA_NO_NODE be enough?
> 
> If the architecture is !NUMA then numa_mem_id and page_to_nid should
> always equal and be both zero.
> 

Ditto

> -- 
> Michal Hocko
> SUSE Labs


Thanks
/Ilias
Yunsheng Lin Dec. 17, 2019, 2:11 a.m. UTC | #30
On 2019/12/16 21:21, Ilias Apalodimas wrote:
> On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
>> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
>>> Hi Michal, 
>>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
>>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
>>>>> +CC Michal, Peter, Greg and Bjorn
>>>>> Because there has been disscusion about where and how the NUMA_NO_NODE
>>>>> should be handled before.
>>>>
>>>> I do not have a full context. What is the question here?
>>>
>>> When we allocate pages for the page_pool API, during the init, the driver writer
>>> decides which NUMA node to use. The API can,  in some cases recycle the memory,
>>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
>>> affinity for example), we forbid recycling and free the memory, since recycling
>>> and using memory on far NUMA nodes is more expensive (more expensive than
>>> recycling, at least on the architectures we tried anyway).
>>> Since this would be expensive to do it per packet, the burden falls on the 
>>> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
>>> page_pool_nid_changed() if they want to check for that which runs once
>>> per NAPI cycle.
>>
>> Thanks for the clarification.
>>
>>> The current code in the API though does not account for NUMA_NO_NODE. That's
>>> what this is trying to fix.
>>> If the page_pool params are initialized with that, we *never* recycle
>>> the memory. This is happening because the API is allocating memory with 
>>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
>>> 'page_to_nid(page) == pool->p.nid' will never trigger.
>>
>> OK. There is no explicit mention of the expected behavior for
>> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
>> requirement and the MM code simply starts the allocate from a local node
>> in that case. But the memory might come from any node so there is no
>> "local node" guarantee.
>>
>> So the main question is what is the expected semantic? Do people expect
>> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
>> when there was no explicit numa requirement?

For driver that has not supported page pool yet, NUMA_NO_NODE seems to
imply locality, see [1].

And for those drivers, locality is decided by rx interrupt affinity, not
dev_to_node(). So when rx interrupt affinity changes, the old page from old
node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
new pages will be allocated to replace the old pages and the new pages will
be recycled because allocation and recycling happens in the same context,
which means numa_mem_id() returns the same node of new page allocated, see
[2].

As why not allocate and recycle page based on dev_to_node(), Jesper had
explained it better than me, see [3]:

"(Based on the optimizations done for Facebook (the reason we added this))
What seems to matter is the NUMA node of CPU that runs RX NAPI-polling,
this is the first CPU that touch the packet memory and struct-page
memory.  For these drivers, it is also the same "RX-CPU" that does the
allocation of new pages (to refill the RX-ring), and these "new" pages
can come from the page_pool."

So maybe the tricky part for page pool is to find the same context to
allocate and recycle pages, so that NUMA_NO_NODE can be used to allocate
pages and numa_mem_id() can be used to decide recycling. Or update the
node dynamically as proposed by Rongqing, see [4].


[1] https://elixir.bootlin.com/linux/v5.5-rc1/ident/dev_alloc_pages
[2] https://elixir.bootlin.com/linux/v5.5-rc1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L1856
[3] https://lkml.org/lkml/2019/12/13/132
[4] https://lkml.org/lkml/2019/12/15/353
>>
> 
> Well they shouldn't. Hence my next proposal. I think we are pretty much saying
> the same thing here. 
> If the driver defines NUMA_NO_NODE, just blindly recycle memory.

Not really. see above.

> 
>>> The initial proposal was to check:
>>> pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id()));
>>
>>> After that the thread span out of control :)
>>> My question is do we *really* have to check for 
>>> page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware
>>> wouldn't pool->p.nid == NUMA_NO_NODE be enough?
>>
>> If the architecture is !NUMA then numa_mem_id and page_to_nid should
>> always equal and be both zero.

The tricky one is that dev_to_node() always return -1 when CONFIG_NUMA is
not defined, and some driver may use that to allocte page, and the node of
page may be checked with dev_to_node() to decide whether to recycle.

>>
> 
> Ditto
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
> 
> 
> Thanks
> /Ilias
> 
> .
>
Michal Hocko Dec. 17, 2019, 9:11 a.m. UTC | #31
On Tue 17-12-19 10:11:12, Yunsheng Lin wrote:
> On 2019/12/16 21:21, Ilias Apalodimas wrote:
> > On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
> >> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
> >>> Hi Michal, 
> >>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
> >>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
> >>>>> +CC Michal, Peter, Greg and Bjorn
> >>>>> Because there has been disscusion about where and how the NUMA_NO_NODE
> >>>>> should be handled before.
> >>>>
> >>>> I do not have a full context. What is the question here?
> >>>
> >>> When we allocate pages for the page_pool API, during the init, the driver writer
> >>> decides which NUMA node to use. The API can,  in some cases recycle the memory,
> >>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
> >>> affinity for example), we forbid recycling and free the memory, since recycling
> >>> and using memory on far NUMA nodes is more expensive (more expensive than
> >>> recycling, at least on the architectures we tried anyway).
> >>> Since this would be expensive to do it per packet, the burden falls on the 
> >>> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
> >>> page_pool_nid_changed() if they want to check for that which runs once
> >>> per NAPI cycle.
> >>
> >> Thanks for the clarification.
> >>
> >>> The current code in the API though does not account for NUMA_NO_NODE. That's
> >>> what this is trying to fix.
> >>> If the page_pool params are initialized with that, we *never* recycle
> >>> the memory. This is happening because the API is allocating memory with 
> >>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
> >>> 'page_to_nid(page) == pool->p.nid' will never trigger.
> >>
> >> OK. There is no explicit mention of the expected behavior for
> >> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
> >> requirement and the MM code simply starts the allocate from a local node
> >> in that case. But the memory might come from any node so there is no
> >> "local node" guarantee.
> >>
> >> So the main question is what is the expected semantic? Do people expect
> >> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
> >> when there was no explicit numa requirement?
> 
> For driver that has not supported page pool yet, NUMA_NO_NODE seems to
> imply locality, see [1].

Which is kinda awkward, no? Is there any reason for __dev_alloc_pages to
not use numa_mem_id explicitly when the local node affinity is required?
There is not real guarantee that NUMA_NO_NODE is going to imply local
node and we do not want to grow any subtle dependency on that behavior.

> And for those drivers, locality is decided by rx interrupt affinity, not
> dev_to_node(). So when rx interrupt affinity changes, the old page from old
> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
> new pages will be allocated to replace the old pages and the new pages will
> be recycled because allocation and recycling happens in the same context,
> which means numa_mem_id() returns the same node of new page allocated, see
> [2].

Well, but my understanding is that the generic page pool implementation
has a clear means to change the affinity (namely page_pool_update_nid()).
So my primary question is, why does NUMA_NO_NODE should be use as a
bypass for that?
Saeed Mahameed Dec. 17, 2019, 7:27 p.m. UTC | #32
On Thu, 2019-12-12 at 11:18 +0100, Jesper Dangaard Brouer wrote:
> On Thu, 12 Dec 2019 09:34:14 +0800
> Yunsheng Lin <linyunsheng@huawei.com> wrote:
> 
> > +CC Michal, Peter, Greg and Bjorn
> > Because there has been disscusion about where and how the
> > NUMA_NO_NODE
> > should be handled before.
> > 
> > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > wrote:  
> > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > >  
> > > > > I don't think it is correct to check that the page nid is
> > > > > same as
> > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should
> > > > > allow
> > > > > all  pages to recycle, because you can't assume where pages
> > > > > are
> > > > > allocated from and where they are being handled.  
> > > > 
> > > > I agree, using numa_mem_id() is not valid, because it takes the
> > > > numa
> > > > node id from the executing CPU and the call to
> > > > __page_pool_put_page()
> > > > can happen on a remote CPU (e.g. cpumap redirect, and in future
> > > > SKBs).
> > > > 
> > > >  
> > > > > I suggest the following:
> > > > > 
> > > > > return !page_pfmemalloc() && 
> > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > NUMA_NO_NODE );  
> > > > 
> > > > Above code doesn't generate optimal ASM code, I suggest:
> > > > 
> > > >  static bool pool_page_reusable(struct page_pool *pool, struct
> > > > page *page)
> > > >  {
> > > > 	return !page_is_pfmemalloc(page) &&
> > > > 		pool->p.nid != NUMA_NO_NODE &&
> > > > 		page_to_nid(page) == pool->p.nid;
> > > >  }
> > > >  
> > > 
> > > this is not equivalent to the above. Here in case pool->p.nid is
> > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > > 
> > > We can avoid the extra check in data path.
> > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > force
> > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > page
> > > pool init, as already done in alloc_pages_node().   
> > 
> > That means we will not support page reuse mitigation for
> > NUMA_NO_NODE,
> > which is not same semantic that alloc_pages_node() handle
> > NUMA_NO_NODE,
> > because alloc_pages_node() will allocate the page based on the node
> > of the current running cpu.
> 
> True, as I wrote (below) my code defines semantics as: that a
> page_pool
> configured with NUMA_NO_NODE means skip NUMA checks, and allow
> recycle

Your code will NOT allow recycling when NUMA_NO_NODE is configured.
so i am not sure what semantics you are referring to :)

> regardless of NUMA node page belong to.  It seems that you want
> another
> semantics.
> 

I think that the semantics we want is to allow recycling when
NUMA_NO_NODE is selected, to solve the reported issue ? no ?

> I'm open to other semantics. My main concern is performance.  The
> page_pool fast-path for driver recycling use-case of XDP_DROP, have
> extreme performance requirements, as it needs to compete with driver
> local recycle tricks (else we cannot use page_pool to simplify
> drivers).
> The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q),
> and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level every
> cycle/instruction counts.
> 

I agree.

>  
> > Also, There seems to be a wild guessing of the node id here, which
> > has
> > been disscussed before and has not reached a agreement yet.
> > 
> > > which will imply recycling without adding any extra condition to
> > > the
> > > data path.
> 
> I love code that moves thing out of our fast-path.
> 
> > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > index a6aefe989043..00c99282a306 100644
> > > --- a/net/core/page_pool.c
> > > +++ b/net/core/page_pool.c
> > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > *pool,
> > >  
> > >         memcpy(&pool->p, params, sizeof(pool->p));
> > >  
> > > +	/* overwrite to allow recycling.. */
> > > +       if (pool->p.nid == NUMA_NO_NODE) 
> > > +               pool->p.nid = numa_mem_id(); 
> > > +
> 
> The problem is that page_pool_init() is can be initiated from a
> random
> CPU, first at driver setup/bringup, and later at other queue changes
> that can be started via ethtool or XDP attach. (numa_mem_id() picks
> from running CPU).
> 

well if the user selected NUMA_NO_NODE, then it is his choice ..
Also the user always have the ability to change pool->p.nid, using the
API i introduced. so i don't see any issue with this.

> As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> function.
> Isn't that what we want in a place like this?
> 

We should keep this outside page_pool, if the user want dev_to_node, he
can set this as a  parameter to the page pool from outside. when
NUMA_NO_NODE is selected this means that user doesn't really care.

> 
> One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> returns
> NUMA_NO_NODE (-1).  (And device_initialize() also set it to
> -1).  Thus,
> in that case we set pool->p.nid = 0, as page_to_nid() will also
> return
> zero in that case (as far as I follow the code).
> 

yes this is the idea, since alloc_page will also select cpu 0 if you
keep NUMA_NO_NODE, then recycle will happen inherently by design
without any change to data path.

> > > After a quick look, i don't see any reason why to keep
> > > NUMA_NO_NODE in
> > > pool->p.nid.. 
> > > 
> > >   
> > > > I have compiled different variants and looked at the ASM code
> > > > generated by GCC.  This seems to give the best result.
> > > > 
> > > >  
> > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > 2) always recycle if pool is NUMA_NO_NODE.  
> > > > 
> > > > Yes, this defines the semantics, that a page_pool configured
> > > > with
> > > > NUMA_NO_NODE means skip NUMA checks.  I think that sounds
> > > > okay...
> > > > 
> > > >  
> > > > > the above change should not add any overhead, a modest branch
> > > > > predictor will handle this with no effort.  
> > > > 
> > > > It still annoys me that we keep adding instructions to this
> > > > code
> > > > hot-path (I counted 34 bytes and 11 instructions in my proposed
> > > > function).
> > > > 
> > > > I think that it might be possible to move these NUMA checks to
> > > > alloc-side (instead of return/recycles side as today), and
> > > > perhaps
> > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > that
> > > > call __page_pool_recycle_direct() will be pinned during NAPI).
> > > > But lets focus on a smaller fix for the immediate issue...
> > > >  
> > > 
> > > I know. It annoys me too, but we need recycling to work in
> > > production : where rings/napi can migrate and numa nodes can be
> > > NUMA_NO_NODE :-(.
> > > 
> > >
Saeed Mahameed Dec. 17, 2019, 7:35 p.m. UTC | #33
On Fri, 2019-12-13 at 11:40 +0800, Yunsheng Lin wrote:
> On 2019/12/12 18:18, Jesper Dangaard Brouer wrote:
> > On Thu, 12 Dec 2019 09:34:14 +0800
> > Yunsheng Lin <linyunsheng@huawei.com> wrote:
> > 
> > > +CC Michal, Peter, Greg and Bjorn
> > > Because there has been disscusion about where and how the
> > > NUMA_NO_NODE
> > > should be handled before.
> > > 
> > > On 2019/12/12 5:24, Saeed Mahameed wrote:
> > > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer
> > > > wrote:  
> > > > > On Sat, 7 Dec 2019 03:52:41 +0000
> > > > > Saeed Mahameed <saeedm@mellanox.com> wrote:
> > > > >  
> > > > > > I don't think it is correct to check that the page nid is
> > > > > > same as
> > > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we
> > > > > > should allow
> > > > > > all  pages to recycle, because you can't assume where pages
> > > > > > are
> > > > > > allocated from and where they are being handled.  
> > > > > 
> > > > > I agree, using numa_mem_id() is not valid, because it takes
> > > > > the numa
> > > > > node id from the executing CPU and the call to
> > > > > __page_pool_put_page()
> > > > > can happen on a remote CPU (e.g. cpumap redirect, and in
> > > > > future
> > > > > SKBs).
> > > > > 
> > > > >  
> > > > > > I suggest the following:
> > > > > > 
> > > > > > return !page_pfmemalloc() && 
> > > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid ==
> > > > > > NUMA_NO_NODE );  
> > > > > 
> > > > > Above code doesn't generate optimal ASM code, I suggest:
> > > > > 
> > > > >  static bool pool_page_reusable(struct page_pool *pool,
> > > > > struct page *page)
> > > > >  {
> > > > > 	return !page_is_pfmemalloc(page) &&
> > > > > 		pool->p.nid != NUMA_NO_NODE &&
> > > > > 		page_to_nid(page) == pool->p.nid;
> > > > >  }
> > > > >  
> > > > 
> > > > this is not equivalent to the above. Here in case pool->p.nid
> > > > is
> > > > NUMA_NO_NODE, pool_page_reusable() will always be false.
> > > > 
> > > > We can avoid the extra check in data path.
> > > > How about avoiding NUMA_NO_NODE in page_pool altogether, and
> > > > force
> > > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at
> > > > page
> > > > pool init, as already done in alloc_pages_node().   
> > > 
> > > That means we will not support page reuse mitigation for
> > > NUMA_NO_NODE,
> > > which is not same semantic that alloc_pages_node() handle
> > > NUMA_NO_NODE,
> > > because alloc_pages_node() will allocate the page based on the
> > > node
> > > of the current running cpu.
> > 
> > True, as I wrote (below) my code defines semantics as: that a
> > page_pool
> > configured with NUMA_NO_NODE means skip NUMA checks, and allow
> > recycle
> > regardless of NUMA node page belong to.  It seems that you want
> > another
> > semantics.
> 
> For driver that does not have page pool support yet, the semantics
> seems
> to be: always allocate and recycle local page(excpet pfmemalloc one).
> Page
> reuse migration moves when the rx interrupt affinity moves(the NAPI
> polling
> context moves accordingly) regardless of the dev_to_node().
> 
> It would be good to maintain the above semantics. And rx data page
> seems
> to be close to the cpu that doing the rx cleaning, which means the
> cpu
> can process the buffer quicker?
> 
> > I'm open to other semantics. My main concern is performance.  The
> > page_pool fast-path for driver recycling use-case of XDP_DROP, have
> > extreme performance requirements, as it needs to compete with
> > driver
> > local recycle tricks (else we cannot use page_pool to simplify
> > drivers).
> > The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and
> > in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single
> > q),
> > and Bjørn is showing 30 Mpps = 33.3ns with i40e.  At this level
> > every
> > cycle/instruction counts.
> 
> Yes, the performance is a concern too.
> But if the rx page is closer to the cpu, maybe the time taken to
> process
> the buffer can be reduced?
> 
> It is good to allocate the rx page close to both cpu and device, but
> if
> both goal can not be reached, maybe we choose to allocate page that
> close
> to cpu?
> 

this should be up to user, page pool shouldn't care much.
at init, user picks whatever node he wants
on data path/napi, initially recycling will only occur only on the
initial pool->p.nid. dynamically user can change it via
page_pool_nid_chaned();

using dev_to_node() or numa_mem_id() are totally different scenarios
for different use cases, 

Normally you should use  numa_mem_id()(close to cpu) for RX data
buffers. and dev_to_node() (close to device) for HW rings descriptors.

if we make the pool biased to one use case then we can't support the
other. thus keep this outside of the pool.
user must use the nid on initialization and change it on the fly if
necessary.

if user chooses NUMA_NO_NODE, then the pool might decide to go with
numa_mem_id() which is my latest proposed solution for this, which at
least will solve the regression.


> >  
> > > Also, There seems to be a wild guessing of the node id here,
> > > which has
> > > been disscussed before and has not reached a agreement yet.
> > > 
> > > > which will imply recycling without adding any extra condition
> > > > to the
> > > > data path.
> > 
> > I love code that moves thing out of our fast-path.
> > 
> > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> > > > index a6aefe989043..00c99282a306 100644
> > > > --- a/net/core/page_pool.c
> > > > +++ b/net/core/page_pool.c
> > > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool
> > > > *pool,
> > > >  
> > > >         memcpy(&pool->p, params, sizeof(pool->p));
> > > >  
> > > > +	/* overwrite to allow recycling.. */
> > > > +       if (pool->p.nid == NUMA_NO_NODE) 
> > > > +               pool->p.nid = numa_mem_id(); 
> > > > +
> > 
> > The problem is that page_pool_init() is can be initiated from a
> > random
> > CPU, first at driver setup/bringup, and later at other queue
> > changes
> > that can be started via ethtool or XDP attach. (numa_mem_id() picks
> > from running CPU).
> 
> Yes, changing ring num or ring depth releases and allocates rx data
> page,
> so using NUMA_NO_NODE to allocate page and alway recycle those page
> may
> has different performance noticable to user.
> 
> > As Yunsheng mentioned elsewhere, there is also a dev_to_node()
> > function.
> > Isn't that what we want in a place like this?
> > 
> > 
> > One issue with dev_to_node() is that in case of !CONFIG_NUMA it
> > returns
> > NUMA_NO_NODE (-1).  (And device_initialize() also set it to
> > -1).  Thus,
> > in that case we set pool->p.nid = 0, as page_to_nid() will also
> > return
> > zero in that case (as far as I follow the code).
> > 
> > 
> > > > After a quick look, i don't see any reason why to keep
> > > > NUMA_NO_NODE in
> > > > pool->p.nid.. 
> > > > 
> > > >   
> > > > > I have compiled different variants and looked at the ASM code
> > > > > generated by GCC.  This seems to give the best result.
> > > > > 
> > > > >  
> > > > > > 1) never recycle emergency pages, regardless of pool nid.
> > > > > > 2) always recycle if pool is NUMA_NO_NODE.  
> > > > > 
> > > > > Yes, this defines the semantics, that a page_pool configured
> > > > > with
> > > > > NUMA_NO_NODE means skip NUMA checks.  I think that sounds
> > > > > okay...
> > > > > 
> > > > >  
> > > > > > the above change should not add any overhead, a modest
> > > > > > branch
> > > > > > predictor will handle this with no effort.  
> > > > > 
> > > > > It still annoys me that we keep adding instructions to this
> > > > > code
> > > > > hot-path (I counted 34 bytes and 11 instructions in my
> > > > > proposed
> > > > > function).
> > > > > 
> > > > > I think that it might be possible to move these NUMA checks
> > > > > to
> > > > > alloc-side (instead of return/recycles side as today), and
> > > > > perhaps
> > > > > only on slow-path when dequeuing from ptr_ring (as recycles
> > > > > that
> > > > > call __page_pool_recycle_direct() will be pinned during
> > > > > NAPI).
> > > > > But lets focus on a smaller fix for the immediate issue...
> > > > >  
> > > > 
> > > > I know. It annoys me too, but we need recycling to work in
> > > > production : where rings/napi can migrate and numa nodes can be
> > > > NUMA_NO_NODE :-(.
> > > > 
> > > >
Yunsheng Lin Dec. 19, 2019, 2:09 a.m. UTC | #34
On 2019/12/17 17:11, Michal Hocko wrote:
> On Tue 17-12-19 10:11:12, Yunsheng Lin wrote:
>> On 2019/12/16 21:21, Ilias Apalodimas wrote:
>>> On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote:
>>>> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote:
>>>>> Hi Michal, 
>>>>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote:
>>>>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote:
>>>>>>> +CC Michal, Peter, Greg and Bjorn
>>>>>>> Because there has been disscusion about where and how the NUMA_NO_NODE
>>>>>>> should be handled before.
>>>>>>
>>>>>> I do not have a full context. What is the question here?
>>>>>
>>>>> When we allocate pages for the page_pool API, during the init, the driver writer
>>>>> decides which NUMA node to use. The API can,  in some cases recycle the memory,
>>>>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq
>>>>> affinity for example), we forbid recycling and free the memory, since recycling
>>>>> and using memory on far NUMA nodes is more expensive (more expensive than
>>>>> recycling, at least on the architectures we tried anyway).
>>>>> Since this would be expensive to do it per packet, the burden falls on the 
>>>>> driver writer for that. Drivers *have* to call page_pool_update_nid() or 
>>>>> page_pool_nid_changed() if they want to check for that which runs once
>>>>> per NAPI cycle.
>>>>
>>>> Thanks for the clarification.
>>>>
>>>>> The current code in the API though does not account for NUMA_NO_NODE. That's
>>>>> what this is trying to fix.
>>>>> If the page_pool params are initialized with that, we *never* recycle
>>>>> the memory. This is happening because the API is allocating memory with 
>>>>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement
>>>>> 'page_to_nid(page) == pool->p.nid' will never trigger.
>>>>
>>>> OK. There is no explicit mention of the expected behavior for
>>>> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement
>>>> requirement and the MM code simply starts the allocate from a local node
>>>> in that case. But the memory might come from any node so there is no
>>>> "local node" guarantee.
>>>>
>>>> So the main question is what is the expected semantic? Do people expect
>>>> that NUMA_NO_NODE implies locality? Why don't you simply always reuse
>>>> when there was no explicit numa requirement?
>>
>> For driver that has not supported page pool yet, NUMA_NO_NODE seems to
>> imply locality, see [1].
> 
> Which is kinda awkward, no? Is there any reason for __dev_alloc_pages to
> not use numa_mem_id explicitly when the local node affinity is required?

Not sure why.

And it seems other places that uses NUMA_NO_NODE in the net code too.

grep -rn "NUMA_NO_NODE" net
net/openvswitch/flow_table.c:85:                                      node_online(0) ? 0 : NUMA_NO_NODE);
net/core/skbuff.c:436:          skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
net/core/skbuff.c:507:          skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE);
net/core/skbuff.c:1514:                                 skb_alloc_rx_flag(skb), NUMA_NO_NODE);
net/core/skbuff.c:1553: struct sk_buff *n = __alloc_skb(size, gfp_mask, flags, NUMA_NO_NODE);
net/core/skbuff.c:1629:                        gfp_mask, NUMA_NO_NODE, NULL);
net/core/skbuff.c:1747:                                 NUMA_NO_NODE);
net/core/skbuff.c:3811:                                    NUMA_NO_NODE);
net/core/skbuff.c:5720:                        gfp_mask, NUMA_NO_NODE, NULL);
net/core/skbuff.c:5844:                        gfp_mask, NUMA_NO_NODE, NULL);
net/core/dev.c:2378:                            NUMA_NO_NODE);
net/core/dev.c:2617:                                         numa_node_id : NUMA_NO_NODE);
net/core/dev.c:9117:    netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
net/core/pktgen.c:3632: pkt_dev->node = NUMA_NO_NODE;
net/sunrpc/svc.c:296:   return NUMA_NO_NODE;
net/qrtr/qrtr.c:97:static unsigned int qrtr_local_nid = NUMA_NO_NODE;


> There is not real guarantee that NUMA_NO_NODE is going to imply local
> node and we do not want to grow any subtle dependency on that behavior.

Strictly speaking, using numa_mem_id() also does not have real guarantee
that it will allocate local memory when local memory is used up, right?
Because alloc_pages_node() is basically turning the node to numa_mem_id()
when it is NUMA_NO_NODE.

Unless we do not allow passing NUMA_NO_NODE to alloc_pages_node(), otherwise
I can not see difference between NUMA_NO_NODE and numa_mem_id().

> 
>> And for those drivers, locality is decided by rx interrupt affinity, not
>> dev_to_node(). So when rx interrupt affinity changes, the old page from old
>> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
>> new pages will be allocated to replace the old pages and the new pages will
>> be recycled because allocation and recycling happens in the same context,
>> which means numa_mem_id() returns the same node of new page allocated, see
>> [2].
> 
> Well, but my understanding is that the generic page pool implementation
> has a clear means to change the affinity (namely page_pool_update_nid()).
> So my primary question is, why does NUMA_NO_NODE should be use as a
> bypass for that?

In that case, page_pool_update_nid() need to be called explicitly, which
may not be the reality, because for drivers using page pool now, mlx5 seems
to be the only one to call page_pool_update_nid(), which may lead to
copy & paste problem when not careful enough.


>
Michal Hocko Dec. 19, 2019, 11:53 a.m. UTC | #35
On Thu 19-12-19 10:09:33, Yunsheng Lin wrote:
[...]
> > There is not real guarantee that NUMA_NO_NODE is going to imply local
> > node and we do not want to grow any subtle dependency on that behavior.
> 
> Strictly speaking, using numa_mem_id() also does not have real guarantee
> that it will allocate local memory when local memory is used up, right?
> Because alloc_pages_node() is basically turning the node to numa_mem_id()
> when it is NUMA_NO_NODE.

yes, both allocations are allowed to fallback to other nodes unless
there is an explicit nodemask specified.

> Unless we do not allow passing NUMA_NO_NODE to alloc_pages_node(), otherwise
> I can not see difference between NUMA_NO_NODE and numa_mem_id().

The difference is in the presented intention. NUMA_NO_NODE means no node
preference. We turn it into an implicit local node preference because
this is the best assumption we can in general. If you provide numa_mem_id
then you explicitly ask for the local node preference because you know
that this is the best for your specific code. See the difference?

The NUMA_NO_NODE -> local node is a heuristic which might change
(albeit unlikely).
 
> >> And for those drivers, locality is decided by rx interrupt affinity, not
> >> dev_to_node(). So when rx interrupt affinity changes, the old page from old
> >> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()),
> >> new pages will be allocated to replace the old pages and the new pages will
> >> be recycled because allocation and recycling happens in the same context,
> >> which means numa_mem_id() returns the same node of new page allocated, see
> >> [2].
> > 
> > Well, but my understanding is that the generic page pool implementation
> > has a clear means to change the affinity (namely page_pool_update_nid()).
> > So my primary question is, why does NUMA_NO_NODE should be use as a
> > bypass for that?
> 
> In that case, page_pool_update_nid() need to be called explicitly, which
> may not be the reality, because for drivers using page pool now, mlx5 seems
> to be the only one to call page_pool_update_nid(), which may lead to
> copy & paste problem when not careful enough.

The API is quite new AFAIU and I think it would be better to use it in
the intended way. Relying on implicit and undocumented behavior is just
going to bend that API in the future and it will impose an additional
burden to any future changes.
diff mbox series

Patch

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a6aefe989043..3c8b51ccd1c1 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -312,12 +312,17 @@  static bool __page_pool_recycle_direct(struct page *page,
 /* page is NOT reusable when:
  * 1) allocated when system is under some pressure. (page_is_pfmemalloc)
  * 2) belongs to a different NUMA node than pool->p.nid.
+ * 3) belongs to a different memory node than current context
+ * if pool->p.nid is NUMA_NO_NODE
  *
  * To update pool->p.nid users must call page_pool_update_nid.
  */
 static bool pool_page_reusable(struct page_pool *pool, struct page *page)
 {
-	return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid;
+	return !page_is_pfmemalloc(page) &&
+		(page_to_nid(page) == pool->p.nid ||
+		(pool->p.nid == NUMA_NO_NODE &&
+		page_to_nid(page) == numa_mem_id()));
 }
 
 void __page_pool_put_page(struct page_pool *pool, struct page *page,