Message ID | 1575624767-3343-1-git-send-email-lirongqing@baidu.com |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | [v2] page_pool: handle page recycle for NUMA_NO_NODE condition | expand |
On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: > some drivers uses page pool, but not require to allocate > pages from bound node, or simply assign pool.p.nid to > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: > Don't recycle non-reusable pages") will block this kind > of driver to recycle > > so take page as reusable when page belongs to current > memory node if nid is NUMA_NO_NODE > > v1-->v2: add check with numa_mem_id from Yunsheng > > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages") > Signed-off-by: Li RongQing <lirongqing@baidu.com> > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> > Cc: Saeed Mahameed <saeedm@mellanox.com> > --- > net/core/page_pool.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index a6aefe989043..3c8b51ccd1c1 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct > page *page, > /* page is NOT reusable when: > * 1) allocated when system is under some pressure. > (page_is_pfmemalloc) > * 2) belongs to a different NUMA node than pool->p.nid. > + * 3) belongs to a different memory node than current context > + * if pool->p.nid is NUMA_NO_NODE > * > * To update pool->p.nid users must call page_pool_update_nid. > */ > static bool pool_page_reusable(struct page_pool *pool, struct page > *page) > { > - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool- > >p.nid; > + return !page_is_pfmemalloc(page) && > + (page_to_nid(page) == pool->p.nid || > + (pool->p.nid == NUMA_NO_NODE && > + page_to_nid(page) == numa_mem_id())); > } > Cc'ed Jesper, Ilias & Jonathan. I don't think it is correct to check that the page nid is same as numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all pages to recycle, because you can't assume where pages are allocated from and where they are being handled. I suggest the following: return !page_pfmemalloc() && ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); 1) never recycle emergency pages, regardless of pool nid. 2) always recycle if pool is NUMA_NO_NODE. the above change should not add any overhead, a modest branch predictor will handle this with no effort. Jesper et al. what do you think? -Saeed.
On 2019/12/7 11:52, Saeed Mahameed wrote: > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: >> some drivers uses page pool, but not require to allocate >> pages from bound node, or simply assign pool.p.nid to >> NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: >> Don't recycle non-reusable pages") will block this kind >> of driver to recycle >> >> so take page as reusable when page belongs to current >> memory node if nid is NUMA_NO_NODE >> >> v1-->v2: add check with numa_mem_id from Yunsheng >> >> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages") >> Signed-off-by: Li RongQing <lirongqing@baidu.com> >> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> >> Cc: Saeed Mahameed <saeedm@mellanox.com> >> --- >> net/core/page_pool.c | 7 ++++++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c >> index a6aefe989043..3c8b51ccd1c1 100644 >> --- a/net/core/page_pool.c >> +++ b/net/core/page_pool.c >> @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct >> page *page, >> /* page is NOT reusable when: >> * 1) allocated when system is under some pressure. >> (page_is_pfmemalloc) >> * 2) belongs to a different NUMA node than pool->p.nid. >> + * 3) belongs to a different memory node than current context >> + * if pool->p.nid is NUMA_NO_NODE >> * >> * To update pool->p.nid users must call page_pool_update_nid. >> */ >> static bool pool_page_reusable(struct page_pool *pool, struct page >> *page) >> { >> - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool- >>> p.nid; >> + return !page_is_pfmemalloc(page) && >> + (page_to_nid(page) == pool->p.nid || >> + (pool->p.nid == NUMA_NO_NODE && >> + page_to_nid(page) == numa_mem_id())); >> } >> > > Cc'ed Jesper, Ilias & Jonathan. > > I don't think it is correct to check that the page nid is same as > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all > pages to recycle, because you can't assume where pages are allocated > from and where they are being handled. > > I suggest the following: > > return !page_pfmemalloc() && > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); > > 1) never recycle emergency pages, regardless of pool nid. > 2) always recycle if pool is NUMA_NO_NODE. As I can see, below are the cases that the pool->p.nid could be NUMA_NO_NODE: 1. kernel built with the CONFIG_NUMA being off. 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not provide a valid node id through ACPI/DT, and it has the below cases: a). the hardware itself is single numa node system, so maybe it is valid to not provide a valid node for the device. b). the hardware itself is multi numa nodes system, and the FW/BIOS is broken that it does not provide a valid one. 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide a valid node id through ACPI/DT, but the driver does not pass the valid node id when calling page_pool_init(). I am not sure which case this patch is trying to fix, maybe Rongqing can help to clarify. For case 1 and case 2 (a), I suppose checking pool->p.nid being NUMA_NO_NODE is enough. For case 2 (b), There may be argument that it should be fixed in the BIOS/FW, But it also can be argued that the numa_mem_id() checking has been done in the driver that has not using page pool yet when deciding whether to do recycling, see [1]. If I understanding correctly, recycling is normally done at the NAPI pooling, which has the same affinity as the rx interrupt, and rx interrupt is not changed very often. If it does change to different memory node, maybe it makes sense not to recycle the old page belongs to old node? For case 3, I am not sure if any driver is doing that, and if the page pool API even allow that? [1] https://elixir.bootlin.com/linux/latest/ident/numa_mem_id > > the above change should not add any overhead, a modest branch predictor > will handle this with no effort. > > Jesper et al. what do you think? > > -Saeed. >
Cc: Grygorii Strashko Ivan Khoronzhuk I see that cpsw is using NUMA_NO_NODE when init page pool > On 2019/12/7 11:52, Saeed Mahameed wrote: > > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: > >> some drivers uses page pool, but not require to allocate pages from > >> bound node, or simply assign pool.p.nid to NUMA_NO_NODE, and the > >> commit d5394610b1ba ("page_pool: > >> Don't recycle non-reusable pages") will block this kind of driver to > >> recycle > >> > >> so take page as reusable when page belongs to current memory node if > >> nid is NUMA_NO_NODE > >> > >> v1-->v2: add check with numa_mem_id from Yunsheng > >> > >> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages") > >> Signed-off-by: Li RongQing <lirongqing@baidu.com> > >> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> > >> Cc: Saeed Mahameed <saeedm@mellanox.com> > >> --- > >> net/core/page_pool.c | 7 ++++++- > >> 1 file changed, 6 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/core/page_pool.c b/net/core/page_pool.c index > >> a6aefe989043..3c8b51ccd1c1 100644 > >> --- a/net/core/page_pool.c > >> +++ b/net/core/page_pool.c > >> @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct > >> page *page, > >> /* page is NOT reusable when: > >> * 1) allocated when system is under some pressure. > >> (page_is_pfmemalloc) > >> * 2) belongs to a different NUMA node than pool->p.nid. > >> + * 3) belongs to a different memory node than current context > >> + * if pool->p.nid is NUMA_NO_NODE > >> * > >> * To update pool->p.nid users must call page_pool_update_nid. > >> */ > >> static bool pool_page_reusable(struct page_pool *pool, struct page > >> *page) > >> { > >> - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool- > >>> p.nid; > >> + return !page_is_pfmemalloc(page) && > >> + (page_to_nid(page) == pool->p.nid || > >> + (pool->p.nid == NUMA_NO_NODE && > >> + page_to_nid(page) == numa_mem_id())); > >> } > >> > > > > Cc'ed Jesper, Ilias & Jonathan. > > > > I don't think it is correct to check that the page nid is same as > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow > > all pages to recycle, because you can't assume where pages are > > allocated from and where they are being handled. > > > > I suggest the following: > > > > return !page_pfmemalloc() && > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); > > > > 1) never recycle emergency pages, regardless of pool nid. > > 2) always recycle if pool is NUMA_NO_NODE. > > As I can see, below are the cases that the pool->p.nid could be > NUMA_NO_NODE: > > 1. kernel built with the CONFIG_NUMA being off. > > 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not provide > a valid node id through ACPI/DT, and it has the below cases: > > a). the hardware itself is single numa node system, so maybe it is valid > to not provide a valid node for the device. > b). the hardware itself is multi numa nodes system, and the FW/BIOS is > broken that it does not provide a valid one. > > 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide a > valid node id through ACPI/DT, but the driver does not pass the valid > node id when calling page_pool_init(). > > I am not sure which case this patch is trying to fix, maybe Rongqing can help to > clarify. > > For case 1 and case 2 (a), I suppose checking pool->p.nid being > NUMA_NO_NODE is enough. > > For case 2 (b), There may be argument that it should be fixed in the BIOS/FW, > But it also can be argued that the numa_mem_id() checking has been done in > the driver that has not using page pool yet when deciding whether to do > recycling, see [1]. If I understanding correctly, recycling is normally done at the > NAPI pooling, which has the same affinity as the rx interrupt, and rx interrupt is > not changed very often. If it does change to different memory node, maybe it > makes sense not to recycle the old page belongs to old node? > > > For case 3, I am not sure if any driver is doing that, and if the page pool API > even allow that? > I think pool_page_reusable should support NUMA_NO_NODE no matter which cases And recycling is normally done at the NAPI pooling, NUMA_NO_NODE hint to use the local node, except memory usage is unbalance, so I add the check that the page nid is same as numa_mem_id() if pool is NUMA_NO_NODE -Li > [1] https://elixir.bootlin.com/linux/latest/ident/numa_mem_id > > > > > the above change should not add any overhead, a modest branch > > predictor will handle this with no effort. > > > > Jesper et al. what do you think? > > > > -Saeed. > >
On Mon, Dec 09, 2019 at 03:47:50AM +0000, Li,Rongqing wrote: > > > [...] > > > Cc'ed Jesper, Ilias & Jonathan. > > > > > > I don't think it is correct to check that the page nid is same as > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow > > > all pages to recycle, because you can't assume where pages are > > > allocated from and where they are being handled. > > > > > > I suggest the following: > > > > > > return !page_pfmemalloc() && > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); > > > > > > 1) never recycle emergency pages, regardless of pool nid. > > > 2) always recycle if pool is NUMA_NO_NODE. > > > > As I can see, below are the cases that the pool->p.nid could be > > NUMA_NO_NODE: > > > > 1. kernel built with the CONFIG_NUMA being off. > > > > 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not provide > > a valid node id through ACPI/DT, and it has the below cases: > > > > a). the hardware itself is single numa node system, so maybe it is valid > > to not provide a valid node for the device. > > b). the hardware itself is multi numa nodes system, and the FW/BIOS is > > broken that it does not provide a valid one. > > > > 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide a > > valid node id through ACPI/DT, but the driver does not pass the valid > > node id when calling page_pool_init(). > > > > I am not sure which case this patch is trying to fix, maybe Rongqing can help to > > clarify. > > > > For case 1 and case 2 (a), I suppose checking pool->p.nid being > > NUMA_NO_NODE is enough. > > > > For case 2 (b), There may be argument that it should be fixed in the BIOS/FW, > > But it also can be argued that the numa_mem_id() checking has been done in > > the driver that has not using page pool yet when deciding whether to do > > recycling, see [1]. If I understanding correctly, recycling is normally done at the > > NAPI pooling, which has the same affinity as the rx interrupt, and rx interrupt is > > not changed very often. If it does change to different memory node, maybe it > > makes sense not to recycle the old page belongs to old node? > > > > > > For case 3, I am not sure if any driver is doing that, and if the page pool API > > even allow that? > > > > I think pool_page_reusable should support NUMA_NO_NODE no matter which cases > Yes > > And recycling is normally done at the NAPI pooling, NUMA_NO_NODE hint to use the > local node, except memory usage is unbalance, so I add the check that the page nid is > same as numa_mem_id() if pool is NUMA_NO_NODE I am not sure i am following here. Recycling done at NAPI or not should have nothing to do with NUMA. What do you mean by 'memory usage is unbalance'? Thanks /Ilias
> -----邮件原件----- > 发件人: Ilias Apalodimas [mailto:ilias.apalodimas@linaro.org] > 发送时间: 2019年12月9日 17:30 > 收件人: Li,Rongqing <lirongqing@baidu.com> > 抄送: Yunsheng Lin <linyunsheng@huawei.com>; Saeed Mahameed > <saeedm@mellanox.com>; jonathan.lemon@gmail.com; > netdev@vger.kernel.org; brouer@redhat.com; ivan.khoronzhuk@linaro.org; > grygorii.strashko@ti.com > 主题: Re: 答复: [PATCH][v2] page_pool: handle page recycle for > NUMA_NO_NODE condition > > On Mon, Dec 09, 2019 at 03:47:50AM +0000, Li,Rongqing wrote: > > > > > [...] > > > > Cc'ed Jesper, Ilias & Jonathan. > > > > > > > > I don't think it is correct to check that the page nid is same as > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should > > > > allow all pages to recycle, because you can't assume where pages > > > > are allocated from and where they are being handled. > > > > > > > > I suggest the following: > > > > > > > > return !page_pfmemalloc() && > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE > > > > ); > > > > > > > > 1) never recycle emergency pages, regardless of pool nid. > > > > 2) always recycle if pool is NUMA_NO_NODE. > > > > > > As I can see, below are the cases that the pool->p.nid could be > > > NUMA_NO_NODE: > > > > > > 1. kernel built with the CONFIG_NUMA being off. > > > > > > 2. kernel built with the CONFIG_NUMA being on, but FW/BIOS dose not > provide > > > a valid node id through ACPI/DT, and it has the below cases: > > > > > > a). the hardware itself is single numa node system, so maybe it is valid > > > to not provide a valid node for the device. > > > b). the hardware itself is multi numa nodes system, and the FW/BIOS is > > > broken that it does not provide a valid one. > > > > > > 3. kernel built with the CONFIG_NUMA being on, and FW/BIOS dose provide > a > > > valid node id through ACPI/DT, but the driver does not pass the valid > > > node id when calling page_pool_init(). > > > > > > I am not sure which case this patch is trying to fix, maybe Rongqing > > > can help to clarify. > > > > > > For case 1 and case 2 (a), I suppose checking pool->p.nid being > > > NUMA_NO_NODE is enough. > > > > > > For case 2 (b), There may be argument that it should be fixed in the > > > BIOS/FW, But it also can be argued that the numa_mem_id() checking > > > has been done in the driver that has not using page pool yet when > > > deciding whether to do recycling, see [1]. If I understanding > > > correctly, recycling is normally done at the NAPI pooling, which has > > > the same affinity as the rx interrupt, and rx interrupt is not > > > changed very often. If it does change to different memory node, maybe it > makes sense not to recycle the old page belongs to old node? > > > > > > > > > For case 3, I am not sure if any driver is doing that, and if the > > > page pool API even allow that? > > > > > > > I think pool_page_reusable should support NUMA_NO_NODE no matter > which > > cases > > > > Yes > > > > > And recycling is normally done at the NAPI pooling, NUMA_NO_NODE hint > > to use the local node, except memory usage is unbalance, so I add the > > check that the page nid is same as numa_mem_id() if pool is > > NUMA_NO_NODE > > I am not sure i am following here. > Recycling done at NAPI or not should have nothing to do with NUMA. > What do you mean by 'memory usage is unbalance'? > > I am saying the possible reason that alloc_pages_node(NUMA_NO_NODE) can not allocate local page, which causes the pool_page_reusable always return false if add the check that the page nid is same as numa_mem_id() it should rarely happen -Li > Thanks > /Ilias
On Sat, 7 Dec 2019 03:52:41 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote: > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: > > some drivers uses page pool, but not require to allocate > > pages from bound node, or simply assign pool.p.nid to > > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: > > Don't recycle non-reusable pages") will block this kind > > of driver to recycle > > > > so take page as reusable when page belongs to current > > memory node if nid is NUMA_NO_NODE > > > > v1-->v2: add check with numa_mem_id from Yunsheng > > > > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages") > > Signed-off-by: Li RongQing <lirongqing@baidu.com> > > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> > > Cc: Saeed Mahameed <saeedm@mellanox.com> > > --- > > net/core/page_pool.c | 7 ++++++- > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > index a6aefe989043..3c8b51ccd1c1 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct > > page *page, > > /* page is NOT reusable when: > > * 1) allocated when system is under some pressure. > > (page_is_pfmemalloc) > > * 2) belongs to a different NUMA node than pool->p.nid. > > + * 3) belongs to a different memory node than current context > > + * if pool->p.nid is NUMA_NO_NODE > > * > > * To update pool->p.nid users must call page_pool_update_nid. > > */ > > static bool pool_page_reusable(struct page_pool *pool, struct page > > *page) > > { > > - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool- > > >p.nid; > > + return !page_is_pfmemalloc(page) && > > + (page_to_nid(page) == pool->p.nid || > > + (pool->p.nid == NUMA_NO_NODE && > > + page_to_nid(page) == numa_mem_id())); > > } > > > > Cc'ed Jesper, Ilias & Jonathan. > > I don't think it is correct to check that the page nid is same as > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all > pages to recycle, because you can't assume where pages are allocated > from and where they are being handled. > > I suggest the following: > > return !page_pfmemalloc() && > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); > > 1) never recycle emergency pages, regardless of pool nid. > 2) always recycle if pool is NUMA_NO_NODE. > > the above change should not add any overhead, a modest branch predictor > will handle this with no effort. > > Jesper et al. what do you think? The patch description doesn't explain the problem very well. Lets first establish what the problem is. After I took at closer look, I do think we have a real problem here... If function alloc_pages_node() is called with NUMA_NO_NODE (see below signature), then the nid is re-assigned to numa_mem_id(). Our current code checks: page_to_nid(page) == pool->p.nid which seems bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return NUMA_NO_NODE... as it was set to the local detect numa node, right? So, we do need a fix... but the question is that semantics do we want?
On Mon, 2019-12-09 at 13:14 +0100, Jesper Dangaard Brouer wrote: > On Sat, 7 Dec 2019 03:52:41 +0000 > Saeed Mahameed <saeedm@mellanox.com> wrote: > > > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: > > > some drivers uses page pool, but not require to allocate > > > pages from bound node, or simply assign pool.p.nid to > > > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: > > > Don't recycle non-reusable pages") will block this kind > > > of driver to recycle > > > > > > so take page as reusable when page belongs to current > > > memory node if nid is NUMA_NO_NODE > > > > > > v1-->v2: add check with numa_mem_id from Yunsheng > > > > > > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable > > > pages") > > > Signed-off-by: Li RongQing <lirongqing@baidu.com> > > > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> > > > Cc: Saeed Mahameed <saeedm@mellanox.com> > > > --- > > > net/core/page_pool.c | 7 ++++++- > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > > index a6aefe989043..3c8b51ccd1c1 100644 > > > --- a/net/core/page_pool.c > > > +++ b/net/core/page_pool.c > > > @@ -312,12 +312,17 @@ static bool > > > __page_pool_recycle_direct(struct > > > page *page, > > > /* page is NOT reusable when: > > > * 1) allocated when system is under some pressure. > > > (page_is_pfmemalloc) > > > * 2) belongs to a different NUMA node than pool->p.nid. > > > + * 3) belongs to a different memory node than current context > > > + * if pool->p.nid is NUMA_NO_NODE > > > * > > > * To update pool->p.nid users must call page_pool_update_nid. > > > */ > > > static bool pool_page_reusable(struct page_pool *pool, struct > > > page > > > *page) > > > { > > > - return !page_is_pfmemalloc(page) && page_to_nid(page) == > > > pool- > > > > p.nid; > > > + return !page_is_pfmemalloc(page) && > > > + (page_to_nid(page) == pool->p.nid || > > > + (pool->p.nid == NUMA_NO_NODE && > > > + page_to_nid(page) == numa_mem_id())); > > > } > > > > > > > Cc'ed Jesper, Ilias & Jonathan. > > > > I don't think it is correct to check that the page nid is same as > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow > > all > > pages to recycle, because you can't assume where pages are > > allocated > > from and where they are being handled. > > > > I suggest the following: > > > > return !page_pfmemalloc() && > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE > > ); > > > > 1) never recycle emergency pages, regardless of pool nid. > > 2) always recycle if pool is NUMA_NO_NODE. > > > > the above change should not add any overhead, a modest branch > > predictor > > will handle this with no effort. > > > > Jesper et al. what do you think? > > The patch description doesn't explain the problem very well. > > Lets first establish what the problem is. After I took at closer > look, > I do think we have a real problem here... > > If function alloc_pages_node() is called with NUMA_NO_NODE (see below > signature), then the nid is re-assigned to numa_mem_id(). > > Our current code checks: page_to_nid(page) == pool->p.nid which seems > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return > NUMA_NO_NODE... as it was set to the local detect numa node, right? > right. > So, we do need a fix... but the question is that semantics do we > want? > maybe assume that __page_pool_recycle_direct() is always called from the right node and change the current bogus check: from: page_to_nid(page) == pool->p.nid to: page_to_nid(page) == numa_mem_id() This will allow recycling only if handling node is the same as where the page was allocated, regardless of pool->p.nid. so semantics are: 1) allocate from: pool->p.nid, as chosen by user. 2) recycle when: page_to_nid(page) == numa_mem_id(). 3) pool user must guarantee that the handler will run on the right node. which should always be the case. otherwise recycling will be skipped (no cross numa recycling). a) if the pool migrates, we will stop recycling until the pool moves back to original node, or user calls pool_update_nid() as we do in mlx5. b) if pool is NUMA_NO_NODE, then allocation and handling will be done on numa_mem_id(), which means the above check will work. Thanks, Saeed.
On 2019/12/10 7:34, Saeed Mahameed wrote: > On Mon, 2019-12-09 at 13:14 +0100, Jesper Dangaard Brouer wrote: >> On Sat, 7 Dec 2019 03:52:41 +0000 >> Saeed Mahameed <saeedm@mellanox.com> wrote: >> >>> On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: >>>> some drivers uses page pool, but not require to allocate >>>> pages from bound node, or simply assign pool.p.nid to >>>> NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: >>>> Don't recycle non-reusable pages") will block this kind >>>> of driver to recycle >>>> >>>> so take page as reusable when page belongs to current >>>> memory node if nid is NUMA_NO_NODE >>>> >>>> v1-->v2: add check with numa_mem_id from Yunsheng >>>> >>>> Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable >>>> pages") >>>> Signed-off-by: Li RongQing <lirongqing@baidu.com> >>>> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> >>>> Cc: Saeed Mahameed <saeedm@mellanox.com> >>>> --- >>>> net/core/page_pool.c | 7 ++++++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c >>>> index a6aefe989043..3c8b51ccd1c1 100644 >>>> --- a/net/core/page_pool.c >>>> +++ b/net/core/page_pool.c >>>> @@ -312,12 +312,17 @@ static bool >>>> __page_pool_recycle_direct(struct >>>> page *page, >>>> /* page is NOT reusable when: >>>> * 1) allocated when system is under some pressure. >>>> (page_is_pfmemalloc) >>>> * 2) belongs to a different NUMA node than pool->p.nid. >>>> + * 3) belongs to a different memory node than current context >>>> + * if pool->p.nid is NUMA_NO_NODE >>>> * >>>> * To update pool->p.nid users must call page_pool_update_nid. >>>> */ >>>> static bool pool_page_reusable(struct page_pool *pool, struct >>>> page >>>> *page) >>>> { >>>> - return !page_is_pfmemalloc(page) && page_to_nid(page) == >>>> pool- >>>>> p.nid; >>>> + return !page_is_pfmemalloc(page) && >>>> + (page_to_nid(page) == pool->p.nid || >>>> + (pool->p.nid == NUMA_NO_NODE && >>>> + page_to_nid(page) == numa_mem_id())); >>>> } >>>> >>> >>> Cc'ed Jesper, Ilias & Jonathan. >>> >>> I don't think it is correct to check that the page nid is same as >>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow >>> all >>> pages to recycle, because you can't assume where pages are >>> allocated >>> from and where they are being handled. >>> >>> I suggest the following: >>> >>> return !page_pfmemalloc() && >>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE >>> ); >>> >>> 1) never recycle emergency pages, regardless of pool nid. >>> 2) always recycle if pool is NUMA_NO_NODE. >>> >>> the above change should not add any overhead, a modest branch >>> predictor >>> will handle this with no effort. >>> >>> Jesper et al. what do you think? >> >> The patch description doesn't explain the problem very well. >> >> Lets first establish what the problem is. After I took at closer >> look, >> I do think we have a real problem here... >> >> If function alloc_pages_node() is called with NUMA_NO_NODE (see below >> signature), then the nid is re-assigned to numa_mem_id(). >> >> Our current code checks: page_to_nid(page) == pool->p.nid which seems >> bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return >> NUMA_NO_NODE... as it was set to the local detect numa node, right? >> > > right. > >> So, we do need a fix... but the question is that semantics do we >> want? >> > > maybe assume that __page_pool_recycle_direct() is always called from > the right node and change the current bogus check: > > from: > page_to_nid(page) == pool->p.nid > > to: > page_to_nid(page) == numa_mem_id() > > This will allow recycling only if handling node is the same as where > the page was allocated, regardless of pool->p.nid. > > so semantics are: > > 1) allocate from: pool->p.nid, as chosen by user. > 2) recycle when: page_to_nid(page) == numa_mem_id(). > 3) pool user must guarantee that the handler will run on the right > node. which should always be the case. otherwise recycling will be > skipped (no cross numa recycling). > > > a) if the pool migrates, we will stop recycling until the pool moves > back to original node, or user calls pool_update_nid() as we do in > mlx5. > b) if pool is NUMA_NO_NODE, then allocation and handling will be done > on numa_mem_id(), which means the above check will work. Only checking page_to_nid(page) == numa_mem_id() may not work for the below case in mvneta.c: static int mvneta_create_page_pool(struct mvneta_port *pp, struct mvneta_rx_queue *rxq, int size) { struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog); struct page_pool_params pp_params = { .order = 0, .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, .pool_size = size, .nid = cpu_to_node(0), .dev = pp->dev->dev.parent, .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE, .offset = pp->rx_offset_correction, .max_len = MVNETA_MAX_RX_BUF_SIZE, }; the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated for rx may not be numa_mem_id() when running in the NAPI polling, because pool->p.nid is not the same as the node of cpu running in the NAPI polling. Does the page pool support recycling for above case? Or we "fix' the above case by setting pool->p.nid to NUMA_NO_NODE/dev_to_node(), or by calling pool_update_nid() in NAPI polling as mlx5 does? > > Thanks, > Saeed. > > > > > > >
Hi Saeed, > > > > The patch description doesn't explain the problem very well. > > > > Lets first establish what the problem is. After I took at closer > > look, > > I do think we have a real problem here... > > > > If function alloc_pages_node() is called with NUMA_NO_NODE (see below > > signature), then the nid is re-assigned to numa_mem_id(). > > > > Our current code checks: page_to_nid(page) == pool->p.nid which seems > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return > > NUMA_NO_NODE... as it was set to the local detect numa node, right? > > > > right. > > > So, we do need a fix... but the question is that semantics do we > > want? > > > > maybe assume that __page_pool_recycle_direct() is always called from > the right node and change the current bogus check: Is this a typo? pool_page_reusable() is called from __page_pool_put_page(). page_pool_put_page and page_pool_recycle_direct() (no underscores) call that. Can we guarantee that those will always run from the correct cpu? In the current code base if they are only called under NAPI this might be true. On the page_pool skb recycling patches though (yes we'll eventually send those :)) this is called from kfree_skb(). I don't think we can get such a guarantee there, right? Regards /Ilias
On Tue, 2019-12-10 at 09:31 +0800, Yunsheng Lin wrote: > On 2019/12/10 7:34, Saeed Mahameed wrote: > > On Mon, 2019-12-09 at 13:14 +0100, Jesper Dangaard Brouer wrote: > > > On Sat, 7 Dec 2019 03:52:41 +0000 > > > Saeed Mahameed <saeedm@mellanox.com> wrote: > > > > > > > On Fri, 2019-12-06 at 17:32 +0800, Li RongQing wrote: > > > > > some drivers uses page pool, but not require to allocate > > > > > pages from bound node, or simply assign pool.p.nid to > > > > > NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: > > > > > Don't recycle non-reusable pages") will block this kind > > > > > of driver to recycle > > > > > > > > > > so take page as reusable when page belongs to current > > > > > memory node if nid is NUMA_NO_NODE > > > > > > > > > > v1-->v2: add check with numa_mem_id from Yunsheng > > > > > > > > > > Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable > > > > > pages") > > > > > Signed-off-by: Li RongQing <lirongqing@baidu.com> > > > > > Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> > > > > > Cc: Saeed Mahameed <saeedm@mellanox.com> > > > > > --- > > > > > net/core/page_pool.c | 7 ++++++- > > > > > 1 file changed, 6 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > > > > index a6aefe989043..3c8b51ccd1c1 100644 > > > > > --- a/net/core/page_pool.c > > > > > +++ b/net/core/page_pool.c > > > > > @@ -312,12 +312,17 @@ static bool > > > > > __page_pool_recycle_direct(struct > > > > > page *page, > > > > > /* page is NOT reusable when: > > > > > * 1) allocated when system is under some pressure. > > > > > (page_is_pfmemalloc) > > > > > * 2) belongs to a different NUMA node than pool->p.nid. > > > > > + * 3) belongs to a different memory node than current > > > > > context > > > > > + * if pool->p.nid is NUMA_NO_NODE > > > > > * > > > > > * To update pool->p.nid users must call > > > > > page_pool_update_nid. > > > > > */ > > > > > static bool pool_page_reusable(struct page_pool *pool, > > > > > struct > > > > > page > > > > > *page) > > > > > { > > > > > - return !page_is_pfmemalloc(page) && page_to_nid(page) > > > > > == > > > > > pool- > > > > > > p.nid; > > > > > + return !page_is_pfmemalloc(page) && > > > > > + (page_to_nid(page) == pool->p.nid || > > > > > + (pool->p.nid == NUMA_NO_NODE && > > > > > + page_to_nid(page) == numa_mem_id())); > > > > > } > > > > > > > > > > > > > Cc'ed Jesper, Ilias & Jonathan. > > > > > > > > I don't think it is correct to check that the page nid is same > > > > as > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should > > > > allow > > > > all > > > > pages to recycle, because you can't assume where pages are > > > > allocated > > > > from and where they are being handled. > > > > > > > > I suggest the following: > > > > > > > > return !page_pfmemalloc() && > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == > > > > NUMA_NO_NODE > > > > ); > > > > > > > > 1) never recycle emergency pages, regardless of pool nid. > > > > 2) always recycle if pool is NUMA_NO_NODE. > > > > > > > > the above change should not add any overhead, a modest branch > > > > predictor > > > > will handle this with no effort. > > > > > > > > Jesper et al. what do you think? > > > > > > The patch description doesn't explain the problem very well. > > > > > > Lets first establish what the problem is. After I took at closer > > > look, > > > I do think we have a real problem here... > > > > > > If function alloc_pages_node() is called with NUMA_NO_NODE (see > > > below > > > signature), then the nid is re-assigned to numa_mem_id(). > > > > > > Our current code checks: page_to_nid(page) == pool->p.nid which > > > seems > > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not > > > return > > > NUMA_NO_NODE... as it was set to the local detect numa node, > > > right? > > > > > > > right. > > > > > So, we do need a fix... but the question is that semantics do we > > > want? > > > > > > > maybe assume that __page_pool_recycle_direct() is always called > > from > > the right node and change the current bogus check: > > > > from: > > page_to_nid(page) == pool->p.nid > > > > to: > > page_to_nid(page) == numa_mem_id() > > > > This will allow recycling only if handling node is the same as > > where > > the page was allocated, regardless of pool->p.nid. > > > > so semantics are: > > > > 1) allocate from: pool->p.nid, as chosen by user. > > 2) recycle when: page_to_nid(page) == numa_mem_id(). > > 3) pool user must guarantee that the handler will run on the right > > node. which should always be the case. otherwise recycling will be > > skipped (no cross numa recycling). > > > > > > a) if the pool migrates, we will stop recycling until the pool > > moves > > back to original node, or user calls pool_update_nid() as we do in > > mlx5. > > b) if pool is NUMA_NO_NODE, then allocation and handling will be > > done > > on numa_mem_id(), which means the above check will work. > > Only checking page_to_nid(page) == numa_mem_id() may not work for the > below > case in mvneta.c: > > static int mvneta_create_page_pool(struct mvneta_port *pp, > struct mvneta_rx_queue *rxq, int > size) > { > struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog); > struct page_pool_params pp_params = { > .order = 0, > .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, > .pool_size = size, > .nid = cpu_to_node(0), > .dev = pp->dev->dev.parent, > .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : > DMA_FROM_DEVICE, > .offset = pp->rx_offset_correction, > .max_len = MVNETA_MAX_RX_BUF_SIZE, > }; > > the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated > for rx > may not be numa_mem_id() when running in the NAPI polling, because > pool->p.nid > is not the same as the node of cpu running in the NAPI polling. > > Does the page pool support recycling for above case? > I don't think you want to allow cross numa recycling. > Or we "fix' the above case by setting pool->p.nid to > NUMA_NO_NODE/dev_to_node(), > or by calling pool_update_nid() in NAPI polling as mlx5 does? > Yes just update_nid when needed, and make sure the NAPI polling runs on a consistent core and eventually alloc/recycling will happen on the same core. > > > Thanks, > > Saeed. > > > > > > > > > > > > > >
On Tue, Dec 10, 2019 at 7:02 AM Ilias Apalodimas <ilias.apalodimas@linaro.org> wrote: > > Hi Saeed, > > > > > > > The patch description doesn't explain the problem very well. > > > > > > Lets first establish what the problem is. After I took at closer > > > look, > > > I do think we have a real problem here... > > > > > > If function alloc_pages_node() is called with NUMA_NO_NODE (see below > > > signature), then the nid is re-assigned to numa_mem_id(). > > > > > > Our current code checks: page_to_nid(page) == pool->p.nid which seems > > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return > > > NUMA_NO_NODE... as it was set to the local detect numa node, right? > > > > > > > right. > > > > > So, we do need a fix... but the question is that semantics do we > > > want? > > > > > > > maybe assume that __page_pool_recycle_direct() is always called from > > the right node and change the current bogus check: > > Is this a typo? pool_page_reusable() is called from __page_pool_put_page(). > > page_pool_put_page and page_pool_recycle_direct() (no underscores) call that. Yes a typo :) , thanks for the correction. > Can we guarantee that those will always run from the correct cpu? No, but we add the tool to correct any discrepancy: page_pool_nid_changed() > In the current code base if they are only called under NAPI this might be true. > On the page_pool skb recycling patches though (yes we'll eventually send those > :)) this is called from kfree_skb(). > I don't think we can get such a guarantee there, right? > Yes, but this has nothing to do with page recycling from pool's owner level (driver napi) for SKB recycling we can use pool.nid to recycle, and not numa_mem_id(). > Regards > /Ilias
Hi Saeed, On Tue, Dec 10, 2019 at 12:02:12PM -0800, Saeed Mahameed wrote: > On Tue, Dec 10, 2019 at 7:02 AM Ilias Apalodimas > <ilias.apalodimas@linaro.org> wrote: > > > > Hi Saeed, > > > > > > > > > > The patch description doesn't explain the problem very well. > > > > > > > > Lets first establish what the problem is. After I took at closer > > > > look, > > > > I do think we have a real problem here... > > > > > > > > If function alloc_pages_node() is called with NUMA_NO_NODE (see below > > > > signature), then the nid is re-assigned to numa_mem_id(). > > > > > > > > Our current code checks: page_to_nid(page) == pool->p.nid which seems > > > > bogus, as pool->p.nid=NUMA_NO_NODE and the page NID will not return > > > > NUMA_NO_NODE... as it was set to the local detect numa node, right? > > > > > > > > > > right. > > > > > > > So, we do need a fix... but the question is that semantics do we > > > > want? > > > > > > > > > > maybe assume that __page_pool_recycle_direct() is always called from > > > the right node and change the current bogus check: > > > > Is this a typo? pool_page_reusable() is called from __page_pool_put_page(). > > > > page_pool_put_page and page_pool_recycle_direct() (no underscores) call that. > > Yes a typo :) , thanks for the correction. > > > Can we guarantee that those will always run from the correct cpu? > No, but we add the tool to correct any discrepancy: page_pool_nid_changed() > > > In the current code base if they are only called under NAPI this might be true. > > On the page_pool skb recycling patches though (yes we'll eventually send those > > :)) this is called from kfree_skb(). > > I don't think we can get such a guarantee there, right? > > > > Yes, but this has nothing to do with page recycling from pool's owner > level (driver napi) > for SKB recycling we can use pool.nid to recycle, and not numa_mem_id(). Right i responded to an email without the proper context! Let me try again. You suggested changing the check from page_to_nid(page) == pool->p.nid to page_to_nid(page) == numa_mem_id(). Since the skb recycling code is using page_pool_put_page() won't that break the recycling for thatr patchset? Thanks /Ilias
On 2019/12/11 3:45, Saeed Mahameed wrote: >>> maybe assume that __page_pool_recycle_direct() is always called >>> from >>> the right node and change the current bogus check: >>> >>> from: >>> page_to_nid(page) == pool->p.nid >>> >>> to: >>> page_to_nid(page) == numa_mem_id() >>> >>> This will allow recycling only if handling node is the same as >>> where >>> the page was allocated, regardless of pool->p.nid. >>> >>> so semantics are: >>> >>> 1) allocate from: pool->p.nid, as chosen by user. >>> 2) recycle when: page_to_nid(page) == numa_mem_id(). >>> 3) pool user must guarantee that the handler will run on the right >>> node. which should always be the case. otherwise recycling will be >>> skipped (no cross numa recycling). >>> >>> >>> a) if the pool migrates, we will stop recycling until the pool >>> moves >>> back to original node, or user calls pool_update_nid() as we do in >>> mlx5. >>> b) if pool is NUMA_NO_NODE, then allocation and handling will be >>> done >>> on numa_mem_id(), which means the above check will work. >> >> Only checking page_to_nid(page) == numa_mem_id() may not work for the >> below >> case in mvneta.c: >> >> static int mvneta_create_page_pool(struct mvneta_port *pp, >> struct mvneta_rx_queue *rxq, int >> size) >> { >> struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog); >> struct page_pool_params pp_params = { >> .order = 0, >> .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, >> .pool_size = size, >> .nid = cpu_to_node(0), >> .dev = pp->dev->dev.parent, >> .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : >> DMA_FROM_DEVICE, >> .offset = pp->rx_offset_correction, >> .max_len = MVNETA_MAX_RX_BUF_SIZE, >> }; >> >> the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated >> for rx >> may not be numa_mem_id() when running in the NAPI polling, because >> pool->p.nid >> is not the same as the node of cpu running in the NAPI polling. >> >> Does the page pool support recycling for above case? >> > > I don't think you want to allow cross numa recycling. Cross numa recycling is not what I want. > >> Or we "fix' the above case by setting pool->p.nid to >> NUMA_NO_NODE/dev_to_node(), >> or by calling pool_update_nid() in NAPI polling as mlx5 does? >> > > Yes just update_nid when needed, and make sure the NAPI polling runs on > a consistent core and eventually alloc/recycling will happen on the > same core. To me, passing NUMA_NO_NODE/dev_to_node() seems to always work. Calling pool_update_nid() in NAPI polling is another way of passing NUMA_NO_NODE to page_pool_init(). And it seems it is a copy & paste problem for mvneta and netsec driver that uses cpu_to_node(0) as pool->p.nid but does not call page_pool_nid_changed() in the NAPI polling as mlx5 does. So I suggest to remove page_pool_nid_changed() and always use NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear ( by comment or warning?)that page_pool_nid_changed() should be called when pool->p.nid is NUMA_NO_NODE/dev_to_node(). I prefer to remove page_pool_nid_changed() if we do not allow cross numa recycling. > >> >>> Thanks, >>> Saeed. >>> >>> >>> >>> >>> >>> >>>
On 2019/12/11 11:01, Yunsheng Lin wrote: > On 2019/12/11 3:45, Saeed Mahameed wrote: >>>> maybe assume that __page_pool_recycle_direct() is always called >>>> from >>>> the right node and change the current bogus check: >>>> >>>> from: >>>> page_to_nid(page) == pool->p.nid >>>> >>>> to: >>>> page_to_nid(page) == numa_mem_id() >>>> >>>> This will allow recycling only if handling node is the same as >>>> where >>>> the page was allocated, regardless of pool->p.nid. >>>> >>>> so semantics are: >>>> >>>> 1) allocate from: pool->p.nid, as chosen by user. >>>> 2) recycle when: page_to_nid(page) == numa_mem_id(). >>>> 3) pool user must guarantee that the handler will run on the right >>>> node. which should always be the case. otherwise recycling will be >>>> skipped (no cross numa recycling). >>>> >>>> >>>> a) if the pool migrates, we will stop recycling until the pool >>>> moves >>>> back to original node, or user calls pool_update_nid() as we do in >>>> mlx5. >>>> b) if pool is NUMA_NO_NODE, then allocation and handling will be >>>> done >>>> on numa_mem_id(), which means the above check will work. >>> >>> Only checking page_to_nid(page) == numa_mem_id() may not work for the >>> below >>> case in mvneta.c: >>> >>> static int mvneta_create_page_pool(struct mvneta_port *pp, >>> struct mvneta_rx_queue *rxq, int >>> size) >>> { >>> struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog); >>> struct page_pool_params pp_params = { >>> .order = 0, >>> .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, >>> .pool_size = size, >>> .nid = cpu_to_node(0), >>> .dev = pp->dev->dev.parent, >>> .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : >>> DMA_FROM_DEVICE, >>> .offset = pp->rx_offset_correction, >>> .max_len = MVNETA_MAX_RX_BUF_SIZE, >>> }; >>> >>> the pool->p.nid is not NUMA_NO_NODE, then the node of page allocated >>> for rx >>> may not be numa_mem_id() when running in the NAPI polling, because >>> pool->p.nid >>> is not the same as the node of cpu running in the NAPI polling. >>> >>> Does the page pool support recycling for above case? >>> >> >> I don't think you want to allow cross numa recycling. > > Cross numa recycling is not what I want. > >> >>> Or we "fix' the above case by setting pool->p.nid to >>> NUMA_NO_NODE/dev_to_node(), >>> or by calling pool_update_nid() in NAPI polling as mlx5 does? >>> >> >> Yes just update_nid when needed, and make sure the NAPI polling runs on >> a consistent core and eventually alloc/recycling will happen on the >> same core. > > To me, passing NUMA_NO_NODE/dev_to_node() seems to always work. > Calling pool_update_nid() in NAPI polling is another way of passing > NUMA_NO_NODE to page_pool_init(). > > And it seems it is a copy & paste problem for mvneta and netsec > driver that uses cpu_to_node(0) as pool->p.nid but does not call > page_pool_nid_changed() in the NAPI polling as mlx5 does. > > So I suggest to remove page_pool_nid_changed() and always use > NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear ( > by comment or warning?)that page_pool_nid_changed() should be > called when pool->p.nid is NUMA_NO_NODE/dev_to_node(). ~~ sorry for the typo. I meant pool->p.nid is not NUMA_NO_NODE/dev_to_node(). > > I prefer to remove page_pool_nid_changed() if we do not allow > cross numa recycling. > > >> >>> >>>> Thanks, >>>> Saeed. >>>> >>>> >>>> >>>> >>>> >>>> >>>> > > > . >
On Sat, 7 Dec 2019 03:52:41 +0000 Saeed Mahameed <saeedm@mellanox.com> wrote: > I don't think it is correct to check that the page nid is same as > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow all > pages to recycle, because you can't assume where pages are allocated > from and where they are being handled. I agree, using numa_mem_id() is not valid, because it takes the numa node id from the executing CPU and the call to __page_pool_put_page() can happen on a remote CPU (e.g. cpumap redirect, and in future SKBs). > I suggest the following: > > return !page_pfmemalloc() && > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); Above code doesn't generate optimal ASM code, I suggest: static bool pool_page_reusable(struct page_pool *pool, struct page *page) { return !page_is_pfmemalloc(page) && pool->p.nid != NUMA_NO_NODE && page_to_nid(page) == pool->p.nid; } I have compiled different variants and looked at the ASM code generated by GCC. This seems to give the best result. > 1) never recycle emergency pages, regardless of pool nid. > 2) always recycle if pool is NUMA_NO_NODE. Yes, this defines the semantics, that a page_pool configured with NUMA_NO_NODE means skip NUMA checks. I think that sounds okay... > the above change should not add any overhead, a modest branch > predictor will handle this with no effort. It still annoys me that we keep adding instructions to this code hot-path (I counted 34 bytes and 11 instructions in my proposed function). I think that it might be possible to move these NUMA checks to alloc-side (instead of return/recycles side as today), and perhaps only on slow-path when dequeuing from ptr_ring (as recycles that call __page_pool_recycle_direct() will be pinned during NAPI). But lets focus on a smaller fix for the immediate issue...
On Wed, 2019-12-11 at 11:01 +0800, Yunsheng Lin wrote: > On 2019/12/11 3:45, Saeed Mahameed wrote: > > > > maybe assume that __page_pool_recycle_direct() is always called > > > > from > > > > the right node and change the current bogus check: > > > > > > > > from: > > > > page_to_nid(page) == pool->p.nid > > > > > > > > to: > > > > page_to_nid(page) == numa_mem_id() > > > > > > > > This will allow recycling only if handling node is the same as > > > > where > > > > the page was allocated, regardless of pool->p.nid. > > > > > > > > so semantics are: > > > > > > > > 1) allocate from: pool->p.nid, as chosen by user. > > > > 2) recycle when: page_to_nid(page) == numa_mem_id(). > > > > 3) pool user must guarantee that the handler will run on the > > > > right > > > > node. which should always be the case. otherwise recycling will > > > > be > > > > skipped (no cross numa recycling). > > > > > > > > > > > > a) if the pool migrates, we will stop recycling until the pool > > > > moves > > > > back to original node, or user calls pool_update_nid() as we do > > > > in > > > > mlx5. > > > > b) if pool is NUMA_NO_NODE, then allocation and handling will > > > > be > > > > done > > > > on numa_mem_id(), which means the above check will work. > > > > > > Only checking page_to_nid(page) == numa_mem_id() may not work for > > > the > > > below > > > case in mvneta.c: > > > > > > static int mvneta_create_page_pool(struct mvneta_port *pp, > > > struct mvneta_rx_queue *rxq, int > > > size) > > > { > > > struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog); > > > struct page_pool_params pp_params = { > > > .order = 0, > > > .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, > > > .pool_size = size, > > > .nid = cpu_to_node(0), > > > .dev = pp->dev->dev.parent, > > > .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : > > > DMA_FROM_DEVICE, > > > .offset = pp->rx_offset_correction, > > > .max_len = MVNETA_MAX_RX_BUF_SIZE, > > > }; > > > > > > the pool->p.nid is not NUMA_NO_NODE, then the node of page > > > allocated > > > for rx > > > may not be numa_mem_id() when running in the NAPI polling, > > > because > > > pool->p.nid > > > is not the same as the node of cpu running in the NAPI polling. > > > > > > Does the page pool support recycling for above case? > > > > > > > I don't think you want to allow cross numa recycling. > > Cross numa recycling is not what I want. > > > > Or we "fix' the above case by setting pool->p.nid to > > > NUMA_NO_NODE/dev_to_node(), > > > or by calling pool_update_nid() in NAPI polling as mlx5 does? > > > > > > > Yes just update_nid when needed, and make sure the NAPI polling > > runs on > > a consistent core and eventually alloc/recycling will happen on the > > same core. > > To me, passing NUMA_NO_NODE/dev_to_node() seems to always work. > Calling pool_update_nid() in NAPI polling is another way of passing > NUMA_NO_NODE to page_pool_init(). > > And it seems it is a copy & paste problem for mvneta and netsec > driver that uses cpu_to_node(0) as pool->p.nid but does not call > page_pool_nid_changed() in the NAPI polling as mlx5 does. > > So I suggest to remove page_pool_nid_changed() and always use > NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear ( > by comment or warning?)that page_pool_nid_changed() should be > called when pool->p.nid is NUMA_NO_NODE/dev_to_node(). > not an option. rx rings should always allocate data buffers according to their cpu affinity and not dev_node or default to NUMA_NO_NODE. > I prefer to remove page_pool_nid_changed() if we do not allow > cross numa recycling. > This is not for cross numa recycling. Since rx rings can migragte between cores, (msix affinity/IRQ balancer) we need page_pool_nid_changed() for seamless migration and for recycling and allocation to migrate with the ring. > > > > > Thanks, > > > > Saeed. > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote: > On Sat, 7 Dec 2019 03:52:41 +0000 > Saeed Mahameed <saeedm@mellanox.com> wrote: > > > I don't think it is correct to check that the page nid is same as > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow > > all > > pages to recycle, because you can't assume where pages are > > allocated > > from and where they are being handled. > > I agree, using numa_mem_id() is not valid, because it takes the numa > node id from the executing CPU and the call to __page_pool_put_page() > can happen on a remote CPU (e.g. cpumap redirect, and in future > SKBs). > > > > I suggest the following: > > > > return !page_pfmemalloc() && > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE > > ); > > Above code doesn't generate optimal ASM code, I suggest: > > static bool pool_page_reusable(struct page_pool *pool, struct page > *page) > { > return !page_is_pfmemalloc(page) && > pool->p.nid != NUMA_NO_NODE && > page_to_nid(page) == pool->p.nid; > } > this is not equivalent to the above. Here in case pool->p.nid is NUMA_NO_NODE, pool_page_reusable() will always be false. We can avoid the extra check in data path. How about avoiding NUMA_NO_NODE in page_pool altogether, and force numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page pool init, as already done in alloc_pages_node(). which will imply recycling without adding any extra condition to the data path. diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a6aefe989043..00c99282a306 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool, memcpy(&pool->p, params, sizeof(pool->p)); + /* overwrite to allow recycling.. */ + if (pool->p.nid == NUMA_NO_NODE) + pool->p.nid = numa_mem_id(); + After a quick look, i don't see any reason why to keep NUMA_NO_NODE in pool->p.nid.. > I have compiled different variants and looked at the A > SM code generated > by GCC. This seems to give the best result. > > > > 1) never recycle emergency pages, regardless of pool nid. > > 2) always recycle if pool is NUMA_NO_NODE. > > Yes, this defines the semantics, that a page_pool configured with > NUMA_NO_NODE means skip NUMA checks. I think that sounds okay... > > > > the above change should not add any overhead, a modest branch > > predictor will handle this with no effort. > > It still annoys me that we keep adding instructions to this code > hot-path (I counted 34 bytes and 11 instructions in my proposed > function). > > I think that it might be possible to move these NUMA checks to > alloc-side (instead of return/recycles side as today), and perhaps > only > on slow-path when dequeuing from ptr_ring (as recycles that call > __page_pool_recycle_direct() will be pinned during NAPI). But lets > focus on a smaller fix for the immediate issue... > I know. It annoys me too, but we need recycling to work in production : where rings/napi can migrate and numa nodes can be NUMA_NO_NODE :-(.
On 2019/12/12 4:57, Saeed Mahameed wrote: > On Wed, 2019-12-11 at 11:01 +0800, Yunsheng Lin wrote: >> On 2019/12/11 3:45, Saeed Mahameed wrote: >>>>> maybe assume that __page_pool_recycle_direct() is always called >>>>> from >>>>> the right node and change the current bogus check: >>>>> >>>>> from: >>>>> page_to_nid(page) == pool->p.nid >>>>> >>>>> to: >>>>> page_to_nid(page) == numa_mem_id() >>>>> >>>>> This will allow recycling only if handling node is the same as >>>>> where >>>>> the page was allocated, regardless of pool->p.nid. >>>>> >>>>> so semantics are: >>>>> >>>>> 1) allocate from: pool->p.nid, as chosen by user. >>>>> 2) recycle when: page_to_nid(page) == numa_mem_id(). >>>>> 3) pool user must guarantee that the handler will run on the >>>>> right >>>>> node. which should always be the case. otherwise recycling will >>>>> be >>>>> skipped (no cross numa recycling). >>>>> >>>>> >>>>> a) if the pool migrates, we will stop recycling until the pool >>>>> moves >>>>> back to original node, or user calls pool_update_nid() as we do >>>>> in >>>>> mlx5. >>>>> b) if pool is NUMA_NO_NODE, then allocation and handling will >>>>> be >>>>> done >>>>> on numa_mem_id(), which means the above check will work. >>>> >>>> Only checking page_to_nid(page) == numa_mem_id() may not work for >>>> the >>>> below >>>> case in mvneta.c: >>>> >>>> static int mvneta_create_page_pool(struct mvneta_port *pp, >>>> struct mvneta_rx_queue *rxq, int >>>> size) >>>> { >>>> struct bpf_prog *xdp_prog = READ_ONCE(pp->xdp_prog); >>>> struct page_pool_params pp_params = { >>>> .order = 0, >>>> .flags = PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV, >>>> .pool_size = size, >>>> .nid = cpu_to_node(0), >>>> .dev = pp->dev->dev.parent, >>>> .dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : >>>> DMA_FROM_DEVICE, >>>> .offset = pp->rx_offset_correction, >>>> .max_len = MVNETA_MAX_RX_BUF_SIZE, >>>> }; >>>> >>>> the pool->p.nid is not NUMA_NO_NODE, then the node of page >>>> allocated >>>> for rx >>>> may not be numa_mem_id() when running in the NAPI polling, >>>> because >>>> pool->p.nid >>>> is not the same as the node of cpu running in the NAPI polling. >>>> >>>> Does the page pool support recycling for above case? >>>> >>> >>> I don't think you want to allow cross numa recycling. >> >> Cross numa recycling is not what I want. >> >>>> Or we "fix' the above case by setting pool->p.nid to >>>> NUMA_NO_NODE/dev_to_node(), >>>> or by calling pool_update_nid() in NAPI polling as mlx5 does? >>>> >>> >>> Yes just update_nid when needed, and make sure the NAPI polling >>> runs on >>> a consistent core and eventually alloc/recycling will happen on the >>> same core. >> >> To me, passing NUMA_NO_NODE/dev_to_node() seems to always work. >> Calling pool_update_nid() in NAPI polling is another way of passing >> NUMA_NO_NODE to page_pool_init(). >> >> And it seems it is a copy & paste problem for mvneta and netsec >> driver that uses cpu_to_node(0) as pool->p.nid but does not call >> page_pool_nid_changed() in the NAPI polling as mlx5 does. >> >> So I suggest to remove page_pool_nid_changed() and always use >> NUMA_NO_NODE/dev_to_node() as pool->p.nid or make it clear ( >> by comment or warning?)that page_pool_nid_changed() should be >> called when pool->p.nid is NUMA_NO_NODE/dev_to_node(). >> > > not an option. > > rx rings should always allocate data buffers according to their cpu > affinity and not dev_node or default to NUMA_NO_NODE. Does "their cpu affinity" mean numa_mem_id() when runnig in the NAPI polling context? If yes, the node of data buffers allocated for rx will be default to numa_mem_id() when the pool->p.nid is NUMA_NO_NODE. See: /* * Allocate pages, preferring the node given as nid. When nid == NUMA_NO_NODE, * prefer the current CPU's closest node. Otherwise node must be valid and * online. */ static inline struct page *alloc_pages_node(int nid, gfp_t gfp_mask, unsigned int order) { if (nid == NUMA_NO_NODE) nid = numa_mem_id(); return __alloc_pages_node(nid, gfp_mask, order); } If the above is true, NUMA_NO_NODE is default to numa_mem_id(). > >> I prefer to remove page_pool_nid_changed() if we do not allow >> cross numa recycling. >> > > This is not for cross numa recycling. > Since rx rings can migragte between cores, (msix affinity/IRQ balancer) > we need page_pool_nid_changed() for seamless migration and for > recycling and allocation to migrate with the ring. If the allocation and recycling for rx data buffer is guaranteed to be in NAPI polling context, when rx rings migragtes between nodes, it will stop recycling the old pages allocated from old node, and start allocating new page from new node and start recycling those new pages, because numa_mem_id() will return different node id based on node the current cpu is running on. Or allocation and recycling for rx data buffer will not be guaranteed to be in the same NAPI polling context for a specific ring? Is there a usecase for this? > >> >>>>> Thanks, >>>>> Saeed. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>
+CC Michal, Peter, Greg and Bjorn Because there has been disscusion about where and how the NUMA_NO_NODE should be handled before. On 2019/12/12 5:24, Saeed Mahameed wrote: > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote: >> On Sat, 7 Dec 2019 03:52:41 +0000 >> Saeed Mahameed <saeedm@mellanox.com> wrote: >> >>> I don't think it is correct to check that the page nid is same as >>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow >>> all >>> pages to recycle, because you can't assume where pages are >>> allocated >>> from and where they are being handled. >> >> I agree, using numa_mem_id() is not valid, because it takes the numa >> node id from the executing CPU and the call to __page_pool_put_page() >> can happen on a remote CPU (e.g. cpumap redirect, and in future >> SKBs). >> >> >>> I suggest the following: >>> >>> return !page_pfmemalloc() && >>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE >>> ); >> >> Above code doesn't generate optimal ASM code, I suggest: >> >> static bool pool_page_reusable(struct page_pool *pool, struct page >> *page) >> { >> return !page_is_pfmemalloc(page) && >> pool->p.nid != NUMA_NO_NODE && >> page_to_nid(page) == pool->p.nid; >> } >> > > this is not equivalent to the above. Here in case pool->p.nid is > NUMA_NO_NODE, pool_page_reusable() will always be false. > > We can avoid the extra check in data path. > How about avoiding NUMA_NO_NODE in page_pool altogether, and force > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page > pool init, as already done in alloc_pages_node(). That means we will not support page reuse migragtion for NUMA_NO_NODE, which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE, because alloc_pages_node() will allocate the page based on the node of the current running cpu. Also, There seems to be a wild guessing of the node id here, which has been disscussed before and has not reached a agreement yet. > > which will imply recycling without adding any extra condition to the > data path. > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index a6aefe989043..00c99282a306 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool, > > memcpy(&pool->p, params, sizeof(pool->p)); > > + /* overwrite to allow recycling.. */ > + if (pool->p.nid == NUMA_NO_NODE) > + pool->p.nid = numa_mem_id(); > + > > After a quick look, i don't see any reason why to keep NUMA_NO_NODE in > pool->p.nid.. > > >> I have compiled different variants and looked at the A >> SM code generated >> by GCC. This seems to give the best result. >> >> >>> 1) never recycle emergency pages, regardless of pool nid. >>> 2) always recycle if pool is NUMA_NO_NODE. >> >> Yes, this defines the semantics, that a page_pool configured with >> NUMA_NO_NODE means skip NUMA checks. I think that sounds okay... >> >> >>> the above change should not add any overhead, a modest branch >>> predictor will handle this with no effort. >> >> It still annoys me that we keep adding instructions to this code >> hot-path (I counted 34 bytes and 11 instructions in my proposed >> function). >> >> I think that it might be possible to move these NUMA checks to >> alloc-side (instead of return/recycles side as today), and perhaps >> only >> on slow-path when dequeuing from ptr_ring (as recycles that call >> __page_pool_recycle_direct() will be pinned during NAPI). But lets >> focus on a smaller fix for the immediate issue... >> > > I know. It annoys me too, but we need recycling to work in production : > where rings/napi can migrate and numa nodes can be NUMA_NO_NODE :-(. > >
On Thu, 12 Dec 2019 09:34:14 +0800 Yunsheng Lin <linyunsheng@huawei.com> wrote: > +CC Michal, Peter, Greg and Bjorn > Because there has been disscusion about where and how the NUMA_NO_NODE > should be handled before. > > On 2019/12/12 5:24, Saeed Mahameed wrote: > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote: > >> On Sat, 7 Dec 2019 03:52:41 +0000 > >> Saeed Mahameed <saeedm@mellanox.com> wrote: > >> > >>> I don't think it is correct to check that the page nid is same as > >>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow > >>> all pages to recycle, because you can't assume where pages are > >>> allocated from and where they are being handled. > >> > >> I agree, using numa_mem_id() is not valid, because it takes the numa > >> node id from the executing CPU and the call to __page_pool_put_page() > >> can happen on a remote CPU (e.g. cpumap redirect, and in future > >> SKBs). > >> > >> > >>> I suggest the following: > >>> > >>> return !page_pfmemalloc() && > >>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); > >> > >> Above code doesn't generate optimal ASM code, I suggest: > >> > >> static bool pool_page_reusable(struct page_pool *pool, struct page *page) > >> { > >> return !page_is_pfmemalloc(page) && > >> pool->p.nid != NUMA_NO_NODE && > >> page_to_nid(page) == pool->p.nid; > >> } > >> > > > > this is not equivalent to the above. Here in case pool->p.nid is > > NUMA_NO_NODE, pool_page_reusable() will always be false. > > > > We can avoid the extra check in data path. > > How about avoiding NUMA_NO_NODE in page_pool altogether, and force > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page > > pool init, as already done in alloc_pages_node(). > > That means we will not support page reuse mitigation for NUMA_NO_NODE, > which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE, > because alloc_pages_node() will allocate the page based on the node > of the current running cpu. True, as I wrote (below) my code defines semantics as: that a page_pool configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle regardless of NUMA node page belong to. It seems that you want another semantics. I'm open to other semantics. My main concern is performance. The page_pool fast-path for driver recycling use-case of XDP_DROP, have extreme performance requirements, as it needs to compete with driver local recycle tricks (else we cannot use page_pool to simplify drivers). The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q), and Bjørn is showing 30 Mpps = 33.3ns with i40e. At this level every cycle/instruction counts. > Also, There seems to be a wild guessing of the node id here, which has > been disscussed before and has not reached a agreement yet. > > > > > which will imply recycling without adding any extra condition to the > > data path. I love code that moves thing out of our fast-path. > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > index a6aefe989043..00c99282a306 100644 > > --- a/net/core/page_pool.c > > +++ b/net/core/page_pool.c > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool, > > > > memcpy(&pool->p, params, sizeof(pool->p)); > > > > + /* overwrite to allow recycling.. */ > > + if (pool->p.nid == NUMA_NO_NODE) > > + pool->p.nid = numa_mem_id(); > > + The problem is that page_pool_init() is can be initiated from a random CPU, first at driver setup/bringup, and later at other queue changes that can be started via ethtool or XDP attach. (numa_mem_id() picks from running CPU). As Yunsheng mentioned elsewhere, there is also a dev_to_node() function. Isn't that what we want in a place like this? One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns NUMA_NO_NODE (-1). (And device_initialize() also set it to -1). Thus, in that case we set pool->p.nid = 0, as page_to_nid() will also return zero in that case (as far as I follow the code). > > After a quick look, i don't see any reason why to keep NUMA_NO_NODE in > > pool->p.nid.. > > > > > >> I have compiled different variants and looked at the ASM code > >> generated by GCC. This seems to give the best result. > >> > >> > >>> 1) never recycle emergency pages, regardless of pool nid. > >>> 2) always recycle if pool is NUMA_NO_NODE. > >> > >> Yes, this defines the semantics, that a page_pool configured with > >> NUMA_NO_NODE means skip NUMA checks. I think that sounds okay... > >> > >> > >>> the above change should not add any overhead, a modest branch > >>> predictor will handle this with no effort. > >> > >> It still annoys me that we keep adding instructions to this code > >> hot-path (I counted 34 bytes and 11 instructions in my proposed > >> function). > >> > >> I think that it might be possible to move these NUMA checks to > >> alloc-side (instead of return/recycles side as today), and perhaps > >> only on slow-path when dequeuing from ptr_ring (as recycles that > >> call __page_pool_recycle_direct() will be pinned during NAPI). > >> But lets focus on a smaller fix for the immediate issue... > >> > > > > I know. It annoys me too, but we need recycling to work in > > production : where rings/napi can migrate and numa nodes can be > > NUMA_NO_NODE :-(. > > > >
On 2019/12/12 18:18, Jesper Dangaard Brouer wrote: > On Thu, 12 Dec 2019 09:34:14 +0800 > Yunsheng Lin <linyunsheng@huawei.com> wrote: > >> +CC Michal, Peter, Greg and Bjorn >> Because there has been disscusion about where and how the NUMA_NO_NODE >> should be handled before. >> >> On 2019/12/12 5:24, Saeed Mahameed wrote: >>> On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer wrote: >>>> On Sat, 7 Dec 2019 03:52:41 +0000 >>>> Saeed Mahameed <saeedm@mellanox.com> wrote: >>>> >>>>> I don't think it is correct to check that the page nid is same as >>>>> numa_mem_id() if pool is NUMA_NO_NODE. In such case we should allow >>>>> all pages to recycle, because you can't assume where pages are >>>>> allocated from and where they are being handled. >>>> >>>> I agree, using numa_mem_id() is not valid, because it takes the numa >>>> node id from the executing CPU and the call to __page_pool_put_page() >>>> can happen on a remote CPU (e.g. cpumap redirect, and in future >>>> SKBs). >>>> >>>> >>>>> I suggest the following: >>>>> >>>>> return !page_pfmemalloc() && >>>>> ( page_to_nid(page) == pool->p.nid || pool->p.nid == NUMA_NO_NODE ); >>>> >>>> Above code doesn't generate optimal ASM code, I suggest: >>>> >>>> static bool pool_page_reusable(struct page_pool *pool, struct page *page) >>>> { >>>> return !page_is_pfmemalloc(page) && >>>> pool->p.nid != NUMA_NO_NODE && >>>> page_to_nid(page) == pool->p.nid; >>>> } >>>> >>> >>> this is not equivalent to the above. Here in case pool->p.nid is >>> NUMA_NO_NODE, pool_page_reusable() will always be false. >>> >>> We can avoid the extra check in data path. >>> How about avoiding NUMA_NO_NODE in page_pool altogether, and force >>> numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at page >>> pool init, as already done in alloc_pages_node(). >> >> That means we will not support page reuse mitigation for NUMA_NO_NODE, >> which is not same semantic that alloc_pages_node() handle NUMA_NO_NODE, >> because alloc_pages_node() will allocate the page based on the node >> of the current running cpu. > > True, as I wrote (below) my code defines semantics as: that a page_pool > configured with NUMA_NO_NODE means skip NUMA checks, and allow recycle > regardless of NUMA node page belong to. It seems that you want another > semantics. For driver that does not have page pool support yet, the semantics seems to be: always allocate and recycle local page(excpet pfmemalloc one). Page reuse migration moves when the rx interrupt affinity moves(the NAPI polling context moves accordingly) regardless of the dev_to_node(). It would be good to maintain the above semantics. And rx data page seems to be close to the cpu that doing the rx cleaning, which means the cpu can process the buffer quicker? > > I'm open to other semantics. My main concern is performance. The > page_pool fast-path for driver recycling use-case of XDP_DROP, have > extreme performance requirements, as it needs to compete with driver > local recycle tricks (else we cannot use page_pool to simplify drivers). > The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and > in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q), > and Bjørn is showing 30 Mpps = 33.3ns with i40e. At this level every > cycle/instruction counts. Yes, the performance is a concern too. But if the rx page is closer to the cpu, maybe the time taken to process the buffer can be reduced? It is good to allocate the rx page close to both cpu and device, but if both goal can not be reached, maybe we choose to allocate page that close to cpu? > > >> Also, There seems to be a wild guessing of the node id here, which has >> been disscussed before and has not reached a agreement yet. >> >>> >>> which will imply recycling without adding any extra condition to the >>> data path. > > I love code that moves thing out of our fast-path. > >>> >>> diff --git a/net/core/page_pool.c b/net/core/page_pool.c >>> index a6aefe989043..00c99282a306 100644 >>> --- a/net/core/page_pool.c >>> +++ b/net/core/page_pool.c >>> @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool *pool, >>> >>> memcpy(&pool->p, params, sizeof(pool->p)); >>> >>> + /* overwrite to allow recycling.. */ >>> + if (pool->p.nid == NUMA_NO_NODE) >>> + pool->p.nid = numa_mem_id(); >>> + > > The problem is that page_pool_init() is can be initiated from a random > CPU, first at driver setup/bringup, and later at other queue changes > that can be started via ethtool or XDP attach. (numa_mem_id() picks > from running CPU). Yes, changing ring num or ring depth releases and allocates rx data page, so using NUMA_NO_NODE to allocate page and alway recycle those page may has different performance noticable to user. > > As Yunsheng mentioned elsewhere, there is also a dev_to_node() function. > Isn't that what we want in a place like this? > > > One issue with dev_to_node() is that in case of !CONFIG_NUMA it returns > NUMA_NO_NODE (-1). (And device_initialize() also set it to -1). Thus, > in that case we set pool->p.nid = 0, as page_to_nid() will also return > zero in that case (as far as I follow the code). > > >>> After a quick look, i don't see any reason why to keep NUMA_NO_NODE in >>> pool->p.nid.. >>> >>> >>>> I have compiled different variants and looked at the ASM code >>>> generated by GCC. This seems to give the best result. >>>> >>>> >>>>> 1) never recycle emergency pages, regardless of pool nid. >>>>> 2) always recycle if pool is NUMA_NO_NODE. >>>> >>>> Yes, this defines the semantics, that a page_pool configured with >>>> NUMA_NO_NODE means skip NUMA checks. I think that sounds okay... >>>> >>>> >>>>> the above change should not add any overhead, a modest branch >>>>> predictor will handle this with no effort. >>>> >>>> It still annoys me that we keep adding instructions to this code >>>> hot-path (I counted 34 bytes and 11 instructions in my proposed >>>> function). >>>> >>>> I think that it might be possible to move these NUMA checks to >>>> alloc-side (instead of return/recycles side as today), and perhaps >>>> only on slow-path when dequeuing from ptr_ring (as recycles that >>>> call __page_pool_recycle_direct() will be pinned during NAPI). >>>> But lets focus on a smaller fix for the immediate issue... >>>> >>> >>> I know. It annoys me too, but we need recycling to work in >>> production : where rings/napi can migrate and numa nodes can be >>> NUMA_NO_NODE :-(. >>> >>> >
> > It is good to allocate the rx page close to both cpu and device, but if > both goal can not be reached, maybe we choose to allocate page that close > to cpu? > I think it is true If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as that page_to_nid(page) is checked with numa_mem_id() since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow will be called in NAPI polling often if recycle failed, after some cycle, the page will be from local memory node. -Li
On 2019/12/13 14:27, Li,Rongqing wrote: >> >> It is good to allocate the rx page close to both cpu and device, but if >> both goal can not be reached, maybe we choose to allocate page that close >> to cpu? >> > I think it is true > > If it is true, , we can remove pool->p.nid, and replace alloc_pages_node with > alloc_pages in __page_pool_alloc_pages_slow, and change pool_page_reusable as > that page_to_nid(page) is checked with numa_mem_id() > > since alloc_pages hint to use the current node page, and __page_pool_alloc_pages_slow > will be called in NAPI polling often if recycle failed, after some cycle, the page will be from > local memory node. Yes if allocation and recycling are in the same NAPI polling context. As pointed out by Saeed and Ilias, the allocation and recycling seems to may not be happening in the same NAPI polling context, see: "In the current code base if they are only called under NAPI this might be true. On the page_pool skb recycling patches though (yes we'll eventually send those :)) this is called from kfree_skb()." So there may need some additionl attention. > > -Li >
On Fri, 13 Dec 2019 14:53:37 +0800 Yunsheng Lin <linyunsheng@huawei.com> wrote: > On 2019/12/13 14:27, Li,Rongqing wrote: > >> > >> It is good to allocate the rx page close to both cpu and device, > >> but if both goal can not be reached, maybe we choose to allocate > >> page that close to cpu? > >> > > I think it is true > > > > If it is true, , we can remove pool->p.nid, and replace > > alloc_pages_node with alloc_pages in __page_pool_alloc_pages_slow, > > and change pool_page_reusable as that page_to_nid(page) is checked > > with numa_mem_id() No, as I explained before, you cannot use numa_mem_id() in pool_page_reusable, because recycle call can happen from/on a remote CPU (and numa_mem_id() uses the local CPU to determine the NUMA id). Today we already have XDP cpumap redirect. And we are working on extending this to SKBs, which will increase the likelihood even more. I don't think we want to not-recycle if a remote NUMA node CPU happened to touch the packet? (Based on the optimizations done for Facebook (the reason we added this)) What seems to matter is the NUMA node of CPU that runs RX NAPI-polling, this is the first CPU that touch the packet memory and struct-page memory. For these drivers, it is also the same "RX-CPU" that does the allocation of new pages (to refill the RX-ring), and these "new" pages can come from the page_pool. With this in mind, it does seem strange that we are doing the check on the "free"/recycles code path, that can run on remote CPUs... > > since alloc_pages hint to use the current node page, and > > __page_pool_alloc_pages_slow will be called in NAPI polling often > > if recycle failed, after some cycle, the page will be from local > > memory node. You are basically saying that the NUMA check should be moved to allocation time, as it is running the RX-CPU (NAPI). And eventually after some time the pages will come from correct NUMA node. I think we can do that, and only affect the semi-fast-path. We just need to handle that pages in the ptr_ring that are recycled can be from the wrong NUMA node. In __page_pool_get_cached() when consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we can evict pages from wrong NUMA node. For the pool->alloc.cache we either accept, that it will eventually after some time be emptied (it is only in a 100% XDP_DROP workload that it will continue to reuse same pages). Or we simply clear the pool->alloc.cache when calling page_pool_update_nid(). > Yes if allocation and recycling are in the same NAPI polling context. Which is a false assumption. > As pointed out by Saeed and Ilias, the allocation and recycling seems > to may not be happening in the same NAPI polling context, see: > > "In the current code base if they are only called under NAPI this > might be true. On the page_pool skb recycling patches though (yes > we'll eventually send those :)) this is called from kfree_skb()." > > So there may need some additionl attention. Yes, as explained above.
On 2019/12/13 16:48, Jesper Dangaard Brouer wrote:> You are basically saying that the NUMA check should be moved to > allocation time, as it is running the RX-CPU (NAPI). And eventually > after some time the pages will come from correct NUMA node. > > I think we can do that, and only affect the semi-fast-path. > We just need to handle that pages in the ptr_ring that are recycled can > be from the wrong NUMA node. In __page_pool_get_cached() when > consuming pages from the ptr_ring (__ptr_ring_consume_batched), then we > can evict pages from wrong NUMA node. Yes, that's workable. > > For the pool->alloc.cache we either accept, that it will eventually > after some time be emptied (it is only in a 100% XDP_DROP workload that > it will continue to reuse same pages). Or we simply clear the > pool->alloc.cache when calling page_pool_update_nid(). Simply clearing the pool->alloc.cache when calling page_pool_update_nid() seems better. >
On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: > +CC Michal, Peter, Greg and Bjorn > Because there has been disscusion about where and how the NUMA_NO_NODE > should be handled before. I do not have a full context. What is the question here?
Hi Michal, On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote: > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: > > +CC Michal, Peter, Greg and Bjorn > > Because there has been disscusion about where and how the NUMA_NO_NODE > > should be handled before. > > I do not have a full context. What is the question here? When we allocate pages for the page_pool API, during the init, the driver writer decides which NUMA node to use. The API can, in some cases recycle the memory, instead of freeing it and re-allocating it. If the NUMA node has changed (irq affinity for example), we forbid recycling and free the memory, since recycling and using memory on far NUMA nodes is more expensive (more expensive than recycling, at least on the architectures we tried anyway). Since this would be expensive to do it per packet, the burden falls on the driver writer for that. Drivers *have* to call page_pool_update_nid() or page_pool_nid_changed() if they want to check for that which runs once per NAPI cycle. The current code in the API though does not account for NUMA_NO_NODE. That's what this is trying to fix. If the page_pool params are initialized with that, we *never* recycle the memory. This is happening because the API is allocating memory with 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement 'page_to_nid(page) == pool->p.nid' will never trigger. The initial proposal was to check: pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id())); After that the thread span out of control :) My question is do we *really* have to check for page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware wouldn't pool->p.nid == NUMA_NO_NODE be enough? Thanks /Ilias > -- > Michal Hocko > SUSE Labs
On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote: > Hi Michal, > On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote: > > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: > > > +CC Michal, Peter, Greg and Bjorn > > > Because there has been disscusion about where and how the NUMA_NO_NODE > > > should be handled before. > > > > I do not have a full context. What is the question here? > > When we allocate pages for the page_pool API, during the init, the driver writer > decides which NUMA node to use. The API can, in some cases recycle the memory, > instead of freeing it and re-allocating it. If the NUMA node has changed (irq > affinity for example), we forbid recycling and free the memory, since recycling > and using memory on far NUMA nodes is more expensive (more expensive than > recycling, at least on the architectures we tried anyway). > Since this would be expensive to do it per packet, the burden falls on the > driver writer for that. Drivers *have* to call page_pool_update_nid() or > page_pool_nid_changed() if they want to check for that which runs once > per NAPI cycle. Thanks for the clarification. > The current code in the API though does not account for NUMA_NO_NODE. That's > what this is trying to fix. > If the page_pool params are initialized with that, we *never* recycle > the memory. This is happening because the API is allocating memory with > 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement > 'page_to_nid(page) == pool->p.nid' will never trigger. OK. There is no explicit mention of the expected behavior for NUMA_NO_NODE. The semantic is usually that there is no NUMA placement requirement and the MM code simply starts the allocate from a local node in that case. But the memory might come from any node so there is no "local node" guarantee. So the main question is what is the expected semantic? Do people expect that NUMA_NO_NODE implies locality? Why don't you simply always reuse when there was no explicit numa requirement? > The initial proposal was to check: > pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id())); > After that the thread span out of control :) > My question is do we *really* have to check for > page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware > wouldn't pool->p.nid == NUMA_NO_NODE be enough? If the architecture is !NUMA then numa_mem_id and page_to_nid should always equal and be both zero.
On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote: > On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote: > > Hi Michal, > > On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote: > > > On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: > > > > +CC Michal, Peter, Greg and Bjorn > > > > Because there has been disscusion about where and how the NUMA_NO_NODE > > > > should be handled before. > > > > > > I do not have a full context. What is the question here? > > > > When we allocate pages for the page_pool API, during the init, the driver writer > > decides which NUMA node to use. The API can, in some cases recycle the memory, > > instead of freeing it and re-allocating it. If the NUMA node has changed (irq > > affinity for example), we forbid recycling and free the memory, since recycling > > and using memory on far NUMA nodes is more expensive (more expensive than > > recycling, at least on the architectures we tried anyway). > > Since this would be expensive to do it per packet, the burden falls on the > > driver writer for that. Drivers *have* to call page_pool_update_nid() or > > page_pool_nid_changed() if they want to check for that which runs once > > per NAPI cycle. > > Thanks for the clarification. > > > The current code in the API though does not account for NUMA_NO_NODE. That's > > what this is trying to fix. > > If the page_pool params are initialized with that, we *never* recycle > > the memory. This is happening because the API is allocating memory with > > 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement > > 'page_to_nid(page) == pool->p.nid' will never trigger. > > OK. There is no explicit mention of the expected behavior for > NUMA_NO_NODE. The semantic is usually that there is no NUMA placement > requirement and the MM code simply starts the allocate from a local node > in that case. But the memory might come from any node so there is no > "local node" guarantee. > > So the main question is what is the expected semantic? Do people expect > that NUMA_NO_NODE implies locality? Why don't you simply always reuse > when there was no explicit numa requirement? > Well they shouldn't. Hence my next proposal. I think we are pretty much saying the same thing here. If the driver defines NUMA_NO_NODE, just blindly recycle memory. > > The initial proposal was to check: > > pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id())); > > > After that the thread span out of control :) > > My question is do we *really* have to check for > > page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware > > wouldn't pool->p.nid == NUMA_NO_NODE be enough? > > If the architecture is !NUMA then numa_mem_id and page_to_nid should > always equal and be both zero. > Ditto > -- > Michal Hocko > SUSE Labs Thanks /Ilias
On 2019/12/16 21:21, Ilias Apalodimas wrote: > On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote: >> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote: >>> Hi Michal, >>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote: >>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: >>>>> +CC Michal, Peter, Greg and Bjorn >>>>> Because there has been disscusion about where and how the NUMA_NO_NODE >>>>> should be handled before. >>>> >>>> I do not have a full context. What is the question here? >>> >>> When we allocate pages for the page_pool API, during the init, the driver writer >>> decides which NUMA node to use. The API can, in some cases recycle the memory, >>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq >>> affinity for example), we forbid recycling and free the memory, since recycling >>> and using memory on far NUMA nodes is more expensive (more expensive than >>> recycling, at least on the architectures we tried anyway). >>> Since this would be expensive to do it per packet, the burden falls on the >>> driver writer for that. Drivers *have* to call page_pool_update_nid() or >>> page_pool_nid_changed() if they want to check for that which runs once >>> per NAPI cycle. >> >> Thanks for the clarification. >> >>> The current code in the API though does not account for NUMA_NO_NODE. That's >>> what this is trying to fix. >>> If the page_pool params are initialized with that, we *never* recycle >>> the memory. This is happening because the API is allocating memory with >>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement >>> 'page_to_nid(page) == pool->p.nid' will never trigger. >> >> OK. There is no explicit mention of the expected behavior for >> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement >> requirement and the MM code simply starts the allocate from a local node >> in that case. But the memory might come from any node so there is no >> "local node" guarantee. >> >> So the main question is what is the expected semantic? Do people expect >> that NUMA_NO_NODE implies locality? Why don't you simply always reuse >> when there was no explicit numa requirement? For driver that has not supported page pool yet, NUMA_NO_NODE seems to imply locality, see [1]. And for those drivers, locality is decided by rx interrupt affinity, not dev_to_node(). So when rx interrupt affinity changes, the old page from old node will not be recycled(by checking page_to_nid(page) == numa_mem_id()), new pages will be allocated to replace the old pages and the new pages will be recycled because allocation and recycling happens in the same context, which means numa_mem_id() returns the same node of new page allocated, see [2]. As why not allocate and recycle page based on dev_to_node(), Jesper had explained it better than me, see [3]: "(Based on the optimizations done for Facebook (the reason we added this)) What seems to matter is the NUMA node of CPU that runs RX NAPI-polling, this is the first CPU that touch the packet memory and struct-page memory. For these drivers, it is also the same "RX-CPU" that does the allocation of new pages (to refill the RX-ring), and these "new" pages can come from the page_pool." So maybe the tricky part for page pool is to find the same context to allocate and recycle pages, so that NUMA_NO_NODE can be used to allocate pages and numa_mem_id() can be used to decide recycling. Or update the node dynamically as proposed by Rongqing, see [4]. [1] https://elixir.bootlin.com/linux/v5.5-rc1/ident/dev_alloc_pages [2] https://elixir.bootlin.com/linux/v5.5-rc1/source/drivers/net/ethernet/intel/i40e/i40e_txrx.c#L1856 [3] https://lkml.org/lkml/2019/12/13/132 [4] https://lkml.org/lkml/2019/12/15/353 >> > > Well they shouldn't. Hence my next proposal. I think we are pretty much saying > the same thing here. > If the driver defines NUMA_NO_NODE, just blindly recycle memory. Not really. see above. > >>> The initial proposal was to check: >>> pool->p.nid == NUMA_NO_NODE && page_to_nid(page) == numa_mem_id())); >> >>> After that the thread span out of control :) >>> My question is do we *really* have to check for >>> page_to_nid(page) == numa_mem_id()? if the architecture is not NUMA aware >>> wouldn't pool->p.nid == NUMA_NO_NODE be enough? >> >> If the architecture is !NUMA then numa_mem_id and page_to_nid should >> always equal and be both zero. The tricky one is that dev_to_node() always return -1 when CONFIG_NUMA is not defined, and some driver may use that to allocte page, and the node of page may be checked with dev_to_node() to decide whether to recycle. >> > > Ditto > >> -- >> Michal Hocko >> SUSE Labs > > > Thanks > /Ilias > > . >
On Tue 17-12-19 10:11:12, Yunsheng Lin wrote: > On 2019/12/16 21:21, Ilias Apalodimas wrote: > > On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote: > >> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote: > >>> Hi Michal, > >>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote: > >>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: > >>>>> +CC Michal, Peter, Greg and Bjorn > >>>>> Because there has been disscusion about where and how the NUMA_NO_NODE > >>>>> should be handled before. > >>>> > >>>> I do not have a full context. What is the question here? > >>> > >>> When we allocate pages for the page_pool API, during the init, the driver writer > >>> decides which NUMA node to use. The API can, in some cases recycle the memory, > >>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq > >>> affinity for example), we forbid recycling and free the memory, since recycling > >>> and using memory on far NUMA nodes is more expensive (more expensive than > >>> recycling, at least on the architectures we tried anyway). > >>> Since this would be expensive to do it per packet, the burden falls on the > >>> driver writer for that. Drivers *have* to call page_pool_update_nid() or > >>> page_pool_nid_changed() if they want to check for that which runs once > >>> per NAPI cycle. > >> > >> Thanks for the clarification. > >> > >>> The current code in the API though does not account for NUMA_NO_NODE. That's > >>> what this is trying to fix. > >>> If the page_pool params are initialized with that, we *never* recycle > >>> the memory. This is happening because the API is allocating memory with > >>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement > >>> 'page_to_nid(page) == pool->p.nid' will never trigger. > >> > >> OK. There is no explicit mention of the expected behavior for > >> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement > >> requirement and the MM code simply starts the allocate from a local node > >> in that case. But the memory might come from any node so there is no > >> "local node" guarantee. > >> > >> So the main question is what is the expected semantic? Do people expect > >> that NUMA_NO_NODE implies locality? Why don't you simply always reuse > >> when there was no explicit numa requirement? > > For driver that has not supported page pool yet, NUMA_NO_NODE seems to > imply locality, see [1]. Which is kinda awkward, no? Is there any reason for __dev_alloc_pages to not use numa_mem_id explicitly when the local node affinity is required? There is not real guarantee that NUMA_NO_NODE is going to imply local node and we do not want to grow any subtle dependency on that behavior. > And for those drivers, locality is decided by rx interrupt affinity, not > dev_to_node(). So when rx interrupt affinity changes, the old page from old > node will not be recycled(by checking page_to_nid(page) == numa_mem_id()), > new pages will be allocated to replace the old pages and the new pages will > be recycled because allocation and recycling happens in the same context, > which means numa_mem_id() returns the same node of new page allocated, see > [2]. Well, but my understanding is that the generic page pool implementation has a clear means to change the affinity (namely page_pool_update_nid()). So my primary question is, why does NUMA_NO_NODE should be use as a bypass for that?
On Thu, 2019-12-12 at 11:18 +0100, Jesper Dangaard Brouer wrote: > On Thu, 12 Dec 2019 09:34:14 +0800 > Yunsheng Lin <linyunsheng@huawei.com> wrote: > > > +CC Michal, Peter, Greg and Bjorn > > Because there has been disscusion about where and how the > > NUMA_NO_NODE > > should be handled before. > > > > On 2019/12/12 5:24, Saeed Mahameed wrote: > > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer > > > wrote: > > > > On Sat, 7 Dec 2019 03:52:41 +0000 > > > > Saeed Mahameed <saeedm@mellanox.com> wrote: > > > > > > > > > I don't think it is correct to check that the page nid is > > > > > same as > > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we should > > > > > allow > > > > > all pages to recycle, because you can't assume where pages > > > > > are > > > > > allocated from and where they are being handled. > > > > > > > > I agree, using numa_mem_id() is not valid, because it takes the > > > > numa > > > > node id from the executing CPU and the call to > > > > __page_pool_put_page() > > > > can happen on a remote CPU (e.g. cpumap redirect, and in future > > > > SKBs). > > > > > > > > > > > > > I suggest the following: > > > > > > > > > > return !page_pfmemalloc() && > > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == > > > > > NUMA_NO_NODE ); > > > > > > > > Above code doesn't generate optimal ASM code, I suggest: > > > > > > > > static bool pool_page_reusable(struct page_pool *pool, struct > > > > page *page) > > > > { > > > > return !page_is_pfmemalloc(page) && > > > > pool->p.nid != NUMA_NO_NODE && > > > > page_to_nid(page) == pool->p.nid; > > > > } > > > > > > > > > > this is not equivalent to the above. Here in case pool->p.nid is > > > NUMA_NO_NODE, pool_page_reusable() will always be false. > > > > > > We can avoid the extra check in data path. > > > How about avoiding NUMA_NO_NODE in page_pool altogether, and > > > force > > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at > > > page > > > pool init, as already done in alloc_pages_node(). > > > > That means we will not support page reuse mitigation for > > NUMA_NO_NODE, > > which is not same semantic that alloc_pages_node() handle > > NUMA_NO_NODE, > > because alloc_pages_node() will allocate the page based on the node > > of the current running cpu. > > True, as I wrote (below) my code defines semantics as: that a > page_pool > configured with NUMA_NO_NODE means skip NUMA checks, and allow > recycle Your code will NOT allow recycling when NUMA_NO_NODE is configured. so i am not sure what semantics you are referring to :) > regardless of NUMA node page belong to. It seems that you want > another > semantics. > I think that the semantics we want is to allow recycling when NUMA_NO_NODE is selected, to solve the reported issue ? no ? > I'm open to other semantics. My main concern is performance. The > page_pool fast-path for driver recycling use-case of XDP_DROP, have > extreme performance requirements, as it needs to compete with driver > local recycle tricks (else we cannot use page_pool to simplify > drivers). > The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and > in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single q), > and Bjørn is showing 30 Mpps = 33.3ns with i40e. At this level every > cycle/instruction counts. > I agree. > > > Also, There seems to be a wild guessing of the node id here, which > > has > > been disscussed before and has not reached a agreement yet. > > > > > which will imply recycling without adding any extra condition to > > > the > > > data path. > > I love code that moves thing out of our fast-path. > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > > index a6aefe989043..00c99282a306 100644 > > > --- a/net/core/page_pool.c > > > +++ b/net/core/page_pool.c > > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool > > > *pool, > > > > > > memcpy(&pool->p, params, sizeof(pool->p)); > > > > > > + /* overwrite to allow recycling.. */ > > > + if (pool->p.nid == NUMA_NO_NODE) > > > + pool->p.nid = numa_mem_id(); > > > + > > The problem is that page_pool_init() is can be initiated from a > random > CPU, first at driver setup/bringup, and later at other queue changes > that can be started via ethtool or XDP attach. (numa_mem_id() picks > from running CPU). > well if the user selected NUMA_NO_NODE, then it is his choice .. Also the user always have the ability to change pool->p.nid, using the API i introduced. so i don't see any issue with this. > As Yunsheng mentioned elsewhere, there is also a dev_to_node() > function. > Isn't that what we want in a place like this? > We should keep this outside page_pool, if the user want dev_to_node, he can set this as a parameter to the page pool from outside. when NUMA_NO_NODE is selected this means that user doesn't really care. > > One issue with dev_to_node() is that in case of !CONFIG_NUMA it > returns > NUMA_NO_NODE (-1). (And device_initialize() also set it to > -1). Thus, > in that case we set pool->p.nid = 0, as page_to_nid() will also > return > zero in that case (as far as I follow the code). > yes this is the idea, since alloc_page will also select cpu 0 if you keep NUMA_NO_NODE, then recycle will happen inherently by design without any change to data path. > > > After a quick look, i don't see any reason why to keep > > > NUMA_NO_NODE in > > > pool->p.nid.. > > > > > > > > > > I have compiled different variants and looked at the ASM code > > > > generated by GCC. This seems to give the best result. > > > > > > > > > > > > > 1) never recycle emergency pages, regardless of pool nid. > > > > > 2) always recycle if pool is NUMA_NO_NODE. > > > > > > > > Yes, this defines the semantics, that a page_pool configured > > > > with > > > > NUMA_NO_NODE means skip NUMA checks. I think that sounds > > > > okay... > > > > > > > > > > > > > the above change should not add any overhead, a modest branch > > > > > predictor will handle this with no effort. > > > > > > > > It still annoys me that we keep adding instructions to this > > > > code > > > > hot-path (I counted 34 bytes and 11 instructions in my proposed > > > > function). > > > > > > > > I think that it might be possible to move these NUMA checks to > > > > alloc-side (instead of return/recycles side as today), and > > > > perhaps > > > > only on slow-path when dequeuing from ptr_ring (as recycles > > > > that > > > > call __page_pool_recycle_direct() will be pinned during NAPI). > > > > But lets focus on a smaller fix for the immediate issue... > > > > > > > > > > I know. It annoys me too, but we need recycling to work in > > > production : where rings/napi can migrate and numa nodes can be > > > NUMA_NO_NODE :-(. > > > > > >
On Fri, 2019-12-13 at 11:40 +0800, Yunsheng Lin wrote: > On 2019/12/12 18:18, Jesper Dangaard Brouer wrote: > > On Thu, 12 Dec 2019 09:34:14 +0800 > > Yunsheng Lin <linyunsheng@huawei.com> wrote: > > > > > +CC Michal, Peter, Greg and Bjorn > > > Because there has been disscusion about where and how the > > > NUMA_NO_NODE > > > should be handled before. > > > > > > On 2019/12/12 5:24, Saeed Mahameed wrote: > > > > On Wed, 2019-12-11 at 19:49 +0100, Jesper Dangaard Brouer > > > > wrote: > > > > > On Sat, 7 Dec 2019 03:52:41 +0000 > > > > > Saeed Mahameed <saeedm@mellanox.com> wrote: > > > > > > > > > > > I don't think it is correct to check that the page nid is > > > > > > same as > > > > > > numa_mem_id() if pool is NUMA_NO_NODE. In such case we > > > > > > should allow > > > > > > all pages to recycle, because you can't assume where pages > > > > > > are > > > > > > allocated from and where they are being handled. > > > > > > > > > > I agree, using numa_mem_id() is not valid, because it takes > > > > > the numa > > > > > node id from the executing CPU and the call to > > > > > __page_pool_put_page() > > > > > can happen on a remote CPU (e.g. cpumap redirect, and in > > > > > future > > > > > SKBs). > > > > > > > > > > > > > > > > I suggest the following: > > > > > > > > > > > > return !page_pfmemalloc() && > > > > > > ( page_to_nid(page) == pool->p.nid || pool->p.nid == > > > > > > NUMA_NO_NODE ); > > > > > > > > > > Above code doesn't generate optimal ASM code, I suggest: > > > > > > > > > > static bool pool_page_reusable(struct page_pool *pool, > > > > > struct page *page) > > > > > { > > > > > return !page_is_pfmemalloc(page) && > > > > > pool->p.nid != NUMA_NO_NODE && > > > > > page_to_nid(page) == pool->p.nid; > > > > > } > > > > > > > > > > > > > this is not equivalent to the above. Here in case pool->p.nid > > > > is > > > > NUMA_NO_NODE, pool_page_reusable() will always be false. > > > > > > > > We can avoid the extra check in data path. > > > > How about avoiding NUMA_NO_NODE in page_pool altogether, and > > > > force > > > > numa_mem_id() as pool->p.nid when user requests NUMA_NO_NODE at > > > > page > > > > pool init, as already done in alloc_pages_node(). > > > > > > That means we will not support page reuse mitigation for > > > NUMA_NO_NODE, > > > which is not same semantic that alloc_pages_node() handle > > > NUMA_NO_NODE, > > > because alloc_pages_node() will allocate the page based on the > > > node > > > of the current running cpu. > > > > True, as I wrote (below) my code defines semantics as: that a > > page_pool > > configured with NUMA_NO_NODE means skip NUMA checks, and allow > > recycle > > regardless of NUMA node page belong to. It seems that you want > > another > > semantics. > > For driver that does not have page pool support yet, the semantics > seems > to be: always allocate and recycle local page(excpet pfmemalloc one). > Page > reuse migration moves when the rx interrupt affinity moves(the NAPI > polling > context moves accordingly) regardless of the dev_to_node(). > > It would be good to maintain the above semantics. And rx data page > seems > to be close to the cpu that doing the rx cleaning, which means the > cpu > can process the buffer quicker? > > > I'm open to other semantics. My main concern is performance. The > > page_pool fast-path for driver recycling use-case of XDP_DROP, have > > extreme performance requirements, as it needs to compete with > > driver > > local recycle tricks (else we cannot use page_pool to simplify > > drivers). > > The extreme performance target is 100Gbit/s = 148Mpps = 6.72ns, and > > in practice I'm measuring 25Mpps = 40ns with Mlx5 driver (single > > q), > > and Bjørn is showing 30 Mpps = 33.3ns with i40e. At this level > > every > > cycle/instruction counts. > > Yes, the performance is a concern too. > But if the rx page is closer to the cpu, maybe the time taken to > process > the buffer can be reduced? > > It is good to allocate the rx page close to both cpu and device, but > if > both goal can not be reached, maybe we choose to allocate page that > close > to cpu? > this should be up to user, page pool shouldn't care much. at init, user picks whatever node he wants on data path/napi, initially recycling will only occur only on the initial pool->p.nid. dynamically user can change it via page_pool_nid_chaned(); using dev_to_node() or numa_mem_id() are totally different scenarios for different use cases, Normally you should use numa_mem_id()(close to cpu) for RX data buffers. and dev_to_node() (close to device) for HW rings descriptors. if we make the pool biased to one use case then we can't support the other. thus keep this outside of the pool. user must use the nid on initialization and change it on the fly if necessary. if user chooses NUMA_NO_NODE, then the pool might decide to go with numa_mem_id() which is my latest proposed solution for this, which at least will solve the regression. > > > > > Also, There seems to be a wild guessing of the node id here, > > > which has > > > been disscussed before and has not reached a agreement yet. > > > > > > > which will imply recycling without adding any extra condition > > > > to the > > > > data path. > > > > I love code that moves thing out of our fast-path. > > > > > > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > > > > index a6aefe989043..00c99282a306 100644 > > > > --- a/net/core/page_pool.c > > > > +++ b/net/core/page_pool.c > > > > @@ -28,6 +28,9 @@ static int page_pool_init(struct page_pool > > > > *pool, > > > > > > > > memcpy(&pool->p, params, sizeof(pool->p)); > > > > > > > > + /* overwrite to allow recycling.. */ > > > > + if (pool->p.nid == NUMA_NO_NODE) > > > > + pool->p.nid = numa_mem_id(); > > > > + > > > > The problem is that page_pool_init() is can be initiated from a > > random > > CPU, first at driver setup/bringup, and later at other queue > > changes > > that can be started via ethtool or XDP attach. (numa_mem_id() picks > > from running CPU). > > Yes, changing ring num or ring depth releases and allocates rx data > page, > so using NUMA_NO_NODE to allocate page and alway recycle those page > may > has different performance noticable to user. > > > As Yunsheng mentioned elsewhere, there is also a dev_to_node() > > function. > > Isn't that what we want in a place like this? > > > > > > One issue with dev_to_node() is that in case of !CONFIG_NUMA it > > returns > > NUMA_NO_NODE (-1). (And device_initialize() also set it to > > -1). Thus, > > in that case we set pool->p.nid = 0, as page_to_nid() will also > > return > > zero in that case (as far as I follow the code). > > > > > > > > After a quick look, i don't see any reason why to keep > > > > NUMA_NO_NODE in > > > > pool->p.nid.. > > > > > > > > > > > > > I have compiled different variants and looked at the ASM code > > > > > generated by GCC. This seems to give the best result. > > > > > > > > > > > > > > > > 1) never recycle emergency pages, regardless of pool nid. > > > > > > 2) always recycle if pool is NUMA_NO_NODE. > > > > > > > > > > Yes, this defines the semantics, that a page_pool configured > > > > > with > > > > > NUMA_NO_NODE means skip NUMA checks. I think that sounds > > > > > okay... > > > > > > > > > > > > > > > > the above change should not add any overhead, a modest > > > > > > branch > > > > > > predictor will handle this with no effort. > > > > > > > > > > It still annoys me that we keep adding instructions to this > > > > > code > > > > > hot-path (I counted 34 bytes and 11 instructions in my > > > > > proposed > > > > > function). > > > > > > > > > > I think that it might be possible to move these NUMA checks > > > > > to > > > > > alloc-side (instead of return/recycles side as today), and > > > > > perhaps > > > > > only on slow-path when dequeuing from ptr_ring (as recycles > > > > > that > > > > > call __page_pool_recycle_direct() will be pinned during > > > > > NAPI). > > > > > But lets focus on a smaller fix for the immediate issue... > > > > > > > > > > > > > I know. It annoys me too, but we need recycling to work in > > > > production : where rings/napi can migrate and numa nodes can be > > > > NUMA_NO_NODE :-(. > > > > > > > >
On 2019/12/17 17:11, Michal Hocko wrote: > On Tue 17-12-19 10:11:12, Yunsheng Lin wrote: >> On 2019/12/16 21:21, Ilias Apalodimas wrote: >>> On Mon, Dec 16, 2019 at 02:08:45PM +0100, Michal Hocko wrote: >>>> On Mon 16-12-19 14:34:26, Ilias Apalodimas wrote: >>>>> Hi Michal, >>>>> On Mon, Dec 16, 2019 at 01:15:57PM +0100, Michal Hocko wrote: >>>>>> On Thu 12-12-19 09:34:14, Yunsheng Lin wrote: >>>>>>> +CC Michal, Peter, Greg and Bjorn >>>>>>> Because there has been disscusion about where and how the NUMA_NO_NODE >>>>>>> should be handled before. >>>>>> >>>>>> I do not have a full context. What is the question here? >>>>> >>>>> When we allocate pages for the page_pool API, during the init, the driver writer >>>>> decides which NUMA node to use. The API can, in some cases recycle the memory, >>>>> instead of freeing it and re-allocating it. If the NUMA node has changed (irq >>>>> affinity for example), we forbid recycling and free the memory, since recycling >>>>> and using memory on far NUMA nodes is more expensive (more expensive than >>>>> recycling, at least on the architectures we tried anyway). >>>>> Since this would be expensive to do it per packet, the burden falls on the >>>>> driver writer for that. Drivers *have* to call page_pool_update_nid() or >>>>> page_pool_nid_changed() if they want to check for that which runs once >>>>> per NAPI cycle. >>>> >>>> Thanks for the clarification. >>>> >>>>> The current code in the API though does not account for NUMA_NO_NODE. That's >>>>> what this is trying to fix. >>>>> If the page_pool params are initialized with that, we *never* recycle >>>>> the memory. This is happening because the API is allocating memory with >>>>> 'nid = numa_mem_id()' if NUMA_NO_NODE is configured so the current if statement >>>>> 'page_to_nid(page) == pool->p.nid' will never trigger. >>>> >>>> OK. There is no explicit mention of the expected behavior for >>>> NUMA_NO_NODE. The semantic is usually that there is no NUMA placement >>>> requirement and the MM code simply starts the allocate from a local node >>>> in that case. But the memory might come from any node so there is no >>>> "local node" guarantee. >>>> >>>> So the main question is what is the expected semantic? Do people expect >>>> that NUMA_NO_NODE implies locality? Why don't you simply always reuse >>>> when there was no explicit numa requirement? >> >> For driver that has not supported page pool yet, NUMA_NO_NODE seems to >> imply locality, see [1]. > > Which is kinda awkward, no? Is there any reason for __dev_alloc_pages to > not use numa_mem_id explicitly when the local node affinity is required? Not sure why. And it seems other places that uses NUMA_NO_NODE in the net code too. grep -rn "NUMA_NO_NODE" net net/openvswitch/flow_table.c:85: node_online(0) ? 0 : NUMA_NO_NODE); net/core/skbuff.c:436: skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); net/core/skbuff.c:507: skb = __alloc_skb(len, gfp_mask, SKB_ALLOC_RX, NUMA_NO_NODE); net/core/skbuff.c:1514: skb_alloc_rx_flag(skb), NUMA_NO_NODE); net/core/skbuff.c:1553: struct sk_buff *n = __alloc_skb(size, gfp_mask, flags, NUMA_NO_NODE); net/core/skbuff.c:1629: gfp_mask, NUMA_NO_NODE, NULL); net/core/skbuff.c:1747: NUMA_NO_NODE); net/core/skbuff.c:3811: NUMA_NO_NODE); net/core/skbuff.c:5720: gfp_mask, NUMA_NO_NODE, NULL); net/core/skbuff.c:5844: gfp_mask, NUMA_NO_NODE, NULL); net/core/dev.c:2378: NUMA_NO_NODE); net/core/dev.c:2617: numa_node_id : NUMA_NO_NODE); net/core/dev.c:9117: netdev_queue_numa_node_write(queue, NUMA_NO_NODE); net/core/pktgen.c:3632: pkt_dev->node = NUMA_NO_NODE; net/sunrpc/svc.c:296: return NUMA_NO_NODE; net/qrtr/qrtr.c:97:static unsigned int qrtr_local_nid = NUMA_NO_NODE; > There is not real guarantee that NUMA_NO_NODE is going to imply local > node and we do not want to grow any subtle dependency on that behavior. Strictly speaking, using numa_mem_id() also does not have real guarantee that it will allocate local memory when local memory is used up, right? Because alloc_pages_node() is basically turning the node to numa_mem_id() when it is NUMA_NO_NODE. Unless we do not allow passing NUMA_NO_NODE to alloc_pages_node(), otherwise I can not see difference between NUMA_NO_NODE and numa_mem_id(). > >> And for those drivers, locality is decided by rx interrupt affinity, not >> dev_to_node(). So when rx interrupt affinity changes, the old page from old >> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()), >> new pages will be allocated to replace the old pages and the new pages will >> be recycled because allocation and recycling happens in the same context, >> which means numa_mem_id() returns the same node of new page allocated, see >> [2]. > > Well, but my understanding is that the generic page pool implementation > has a clear means to change the affinity (namely page_pool_update_nid()). > So my primary question is, why does NUMA_NO_NODE should be use as a > bypass for that? In that case, page_pool_update_nid() need to be called explicitly, which may not be the reality, because for drivers using page pool now, mlx5 seems to be the only one to call page_pool_update_nid(), which may lead to copy & paste problem when not careful enough. >
On Thu 19-12-19 10:09:33, Yunsheng Lin wrote: [...] > > There is not real guarantee that NUMA_NO_NODE is going to imply local > > node and we do not want to grow any subtle dependency on that behavior. > > Strictly speaking, using numa_mem_id() also does not have real guarantee > that it will allocate local memory when local memory is used up, right? > Because alloc_pages_node() is basically turning the node to numa_mem_id() > when it is NUMA_NO_NODE. yes, both allocations are allowed to fallback to other nodes unless there is an explicit nodemask specified. > Unless we do not allow passing NUMA_NO_NODE to alloc_pages_node(), otherwise > I can not see difference between NUMA_NO_NODE and numa_mem_id(). The difference is in the presented intention. NUMA_NO_NODE means no node preference. We turn it into an implicit local node preference because this is the best assumption we can in general. If you provide numa_mem_id then you explicitly ask for the local node preference because you know that this is the best for your specific code. See the difference? The NUMA_NO_NODE -> local node is a heuristic which might change (albeit unlikely). > >> And for those drivers, locality is decided by rx interrupt affinity, not > >> dev_to_node(). So when rx interrupt affinity changes, the old page from old > >> node will not be recycled(by checking page_to_nid(page) == numa_mem_id()), > >> new pages will be allocated to replace the old pages and the new pages will > >> be recycled because allocation and recycling happens in the same context, > >> which means numa_mem_id() returns the same node of new page allocated, see > >> [2]. > > > > Well, but my understanding is that the generic page pool implementation > > has a clear means to change the affinity (namely page_pool_update_nid()). > > So my primary question is, why does NUMA_NO_NODE should be use as a > > bypass for that? > > In that case, page_pool_update_nid() need to be called explicitly, which > may not be the reality, because for drivers using page pool now, mlx5 seems > to be the only one to call page_pool_update_nid(), which may lead to > copy & paste problem when not careful enough. The API is quite new AFAIU and I think it would be better to use it in the intended way. Relying on implicit and undocumented behavior is just going to bend that API in the future and it will impose an additional burden to any future changes.
diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a6aefe989043..3c8b51ccd1c1 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -312,12 +312,17 @@ static bool __page_pool_recycle_direct(struct page *page, /* page is NOT reusable when: * 1) allocated when system is under some pressure. (page_is_pfmemalloc) * 2) belongs to a different NUMA node than pool->p.nid. + * 3) belongs to a different memory node than current context + * if pool->p.nid is NUMA_NO_NODE * * To update pool->p.nid users must call page_pool_update_nid. */ static bool pool_page_reusable(struct page_pool *pool, struct page *page) { - return !page_is_pfmemalloc(page) && page_to_nid(page) == pool->p.nid; + return !page_is_pfmemalloc(page) && + (page_to_nid(page) == pool->p.nid || + (pool->p.nid == NUMA_NO_NODE && + page_to_nid(page) == numa_mem_id())); } void __page_pool_put_page(struct page_pool *pool, struct page *page,
some drivers uses page pool, but not require to allocate pages from bound node, or simply assign pool.p.nid to NUMA_NO_NODE, and the commit d5394610b1ba ("page_pool: Don't recycle non-reusable pages") will block this kind of driver to recycle so take page as reusable when page belongs to current memory node if nid is NUMA_NO_NODE v1-->v2: add check with numa_mem_id from Yunsheng Fixes: d5394610b1ba ("page_pool: Don't recycle non-reusable pages") Signed-off-by: Li RongQing <lirongqing@baidu.com> Suggested-by: Yunsheng Lin <linyunsheng@huawei.com> Cc: Saeed Mahameed <saeedm@mellanox.com> --- net/core/page_pool.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-)