Message ID | 1484809388.16328.19.camel@edumazet-glaptop3.roam.corp.google.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > A driver using dev_alloc_page() must not reuse a page allocated from > emergency memory reserve. > > Otherwise all packets using this page will be immediately dropped, > unless for very specific sockets having SOCK_MEMALLOC bit set. > > This issue might be hard to debug, because only a fraction of received > packets would be dropped. Hi Eric, When you say reuse, you mean point to the same page from several SKBs ? Because in our page cache implementation we don't reuse pages that already passed to the stack, we just keep them in the page cache until the ref count drop back to one, so we recycle them (i,e they will be re-used only when no one else is using them). on mlx5e rx re-post to the HW buffer we will call mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and mlx5e_rx_cache_get will only return a page that is already freed by all stack users, otherwise it will allocated a brand new one. a page form mlx5e_page_cache can have one of 2 states 1. passed to the stack (not free) 2. free (already used and freed by the stack) a non free page will never get reused for another SKB. this is true for both mlx5 rx modes (striding and non-striding) but in the striding mode regardless of page cache, small packets can always fit in one page and this page can get pointed to by several SKBs - should we be concerned about this case ? Anyway, if you meant by this patch to avoid keeping such special pages in the mlx5 page cache, then it looks very good to me. Thanks, Saeed. > > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Tariq Toukan <tariqt@mellanox.com> > Cc: Saeed Mahameed <saeedm@mellanox.com> > --- > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, > return false; > } > > + if (unlikely(page_is_pfmemalloc(dma_info->page))) > + return false; > + > cache->page_cache[cache->tail] = *dma_info; > cache->tail = tail_next; > return true; > >
On Thu, 2017-01-19 at 21:14 +0200, Saeed Mahameed wrote: > On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > > > > A driver using dev_alloc_page() must not reuse a page allocated from > > emergency memory reserve. > > > > Otherwise all packets using this page will be immediately dropped, > > unless for very specific sockets having SOCK_MEMALLOC bit set. > > > > This issue might be hard to debug, because only a fraction of received > > packets would be dropped. > > Hi Eric, > > When you say reuse, you mean point to the same page from several SKBs ? > > Because in our page cache implementation we don't reuse pages that > already passed to the stack, > we just keep them in the page cache until the ref count drop back to > one, so we recycle them (i,e they will be re-used only when no one > else is using them). Look at other page_is_pfmemalloc() calls in other network drivers. The point is : We do not want to reuse a page that was initially allocated from emergency reserve. Chances are high that the packet will be simply dropped. If your cache is not refilled with such problematic pages, future dev_page_alloc() will likely give you sane pages, after stress has ended. By definition, emergency reserves are small, we do not want to keep pages from the reserve in another cache, depleting the reserve for too long. dev_alloc_page() in mlx5 was introduced in commit bc77b240b3c57236cdcc08d64ca390655d3a16ff ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") Maybe my Fixes: tag was slightly wrong. > > on mlx5e rx re-post to the HW buffer we will call > mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and mlx5e_rx_cache_get > will only return a page that is already freed by all stack users, > otherwise it will allocated a brand new one. > > a page form mlx5e_page_cache can have one of 2 states > 1. passed to the stack (not free) > 2. free (already used and freed by the stack) > > a non free page will never get reused for another SKB. Of course. > > this is true for both mlx5 rx modes (striding and non-striding) > > but in the striding mode regardless of page cache, small packets can > always fit in one page and this page can get pointed to by several > SKBs - should we be concerned about this case ? > > Anyway, if you meant by this patch to avoid keeping such special pages > in the mlx5 page cache, > then it looks very good to me. > > Thanks, > Saeed. > > > > > > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > Cc: Tariq Toukan <tariqt@mellanox.com> > > Cc: Saeed Mahameed <saeedm@mellanox.com> > > --- > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c > > @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, > > return false; > > } > > > > + if (unlikely(page_is_pfmemalloc(dma_info->page))) > > + return false; > > + > > cache->page_cache[cache->tail] = *dma_info; > > cache->tail = tail_next; > > return true; > > > >
On Thu, Jan 19, 2017 at 9:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2017-01-19 at 21:14 +0200, Saeed Mahameed wrote: >> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> > From: Eric Dumazet <edumazet@google.com> >> > >> > A driver using dev_alloc_page() must not reuse a page allocated from >> > emergency memory reserve. >> > >> > Otherwise all packets using this page will be immediately dropped, >> > unless for very specific sockets having SOCK_MEMALLOC bit set. >> > >> > This issue might be hard to debug, because only a fraction of received >> > packets would be dropped. >> >> Hi Eric, >> >> When you say reuse, you mean point to the same page from several SKBs ? >> >> Because in our page cache implementation we don't reuse pages that >> already passed to the stack, >> we just keep them in the page cache until the ref count drop back to >> one, so we recycle them (i,e they will be re-used only when no one >> else is using them). > > Look at other page_is_pfmemalloc() calls in other network drivers. > > The point is : > > We do not want to reuse a page that was initially allocated from > emergency reserve. > > Chances are high that the packet will be simply dropped. > > If your cache is not refilled with such problematic pages, > future dev_page_alloc() will likely give you sane pages, after stress > has ended. > > By definition, emergency reserves are small, we do not want to keep > pages from the reserve in another cache, depleting the reserve for too > long. Got it, Thanks for the explanation, I got confused since the term "reuse" can also imply "reuse same page for another RX packet", I guess the correct term for mlx5 page cache is "recycle". > > dev_alloc_page() in mlx5 was introduced in commit > bc77b240b3c57236cdcc08d64ca390655d3a16ff > ("net/mlx5e: Add fragmented memory support for RX multi packet WQE") > > Maybe my Fixes: tag was slightly wrong. > No, i think you got it right in first place, you are fixing the RX cache to not keep hold on such pages. before the RX cache we didn't have this issue. >> >> on mlx5e rx re-post to the HW buffer we will call >> mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and mlx5e_rx_cache_get >> will only return a page that is already freed by all stack users, >> otherwise it will allocated a brand new one. >> >> a page form mlx5e_page_cache can have one of 2 states >> 1. passed to the stack (not free) >> 2. free (already used and freed by the stack) >> >> a non free page will never get reused for another SKB. > > > Of course. > >> >> this is true for both mlx5 rx modes (striding and non-striding) >> >> but in the striding mode regardless of page cache, small packets can >> always fit in one page and this page can get pointed to by several >> SKBs - should we be concerned about this case ? >> >> Anyway, if you meant by this patch to avoid keeping such special pages >> in the mlx5 page cache, >> then it looks very good to me. >> >> Thanks, >> Saeed. >> >> >> > >> > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") >> > Signed-off-by: Eric Dumazet <edumazet@google.com> >> > Cc: Tariq Toukan <tariqt@mellanox.com> >> > Cc: Saeed Mahameed <saeedm@mellanox.com> >> > --- >> > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 3 +++ >> > 1 file changed, 3 insertions(+) >> > >> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c >> > index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644 >> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c >> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c >> > @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, >> > return false; >> > } >> > >> > + if (unlikely(page_is_pfmemalloc(dma_info->page))) >> > + return false; >> > + >> > cache->page_cache[cache->tail] = *dma_info; >> > cache->tail = tail_next; >> > return true; >> > >> > > >
On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > From: Eric Dumazet <edumazet@google.com> > > A driver using dev_alloc_page() must not reuse a page allocated from > emergency memory reserve. > > Otherwise all packets using this page will be immediately dropped, > unless for very specific sockets having SOCK_MEMALLOC bit set. > > This issue might be hard to debug, because only a fraction of received > packets would be dropped. > > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Tariq Toukan <tariqt@mellanox.com> > Cc: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Wed, 18 Jan 2017 23:03:08 -0800 > From: Eric Dumazet <edumazet@google.com> > > A driver using dev_alloc_page() must not reuse a page allocated from > emergency memory reserve. > > Otherwise all packets using this page will be immediately dropped, > unless for very specific sockets having SOCK_MEMALLOC bit set. > > This issue might be hard to debug, because only a fraction of received > packets would be dropped. > > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") > Signed-off-by: Eric Dumazet <edumazet@google.com> Applied and queued up for -stable, thanks.
On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote: >> From: Eric Dumazet <edumazet@google.com> >> >> A driver using dev_alloc_page() must not reuse a page allocated from >> emergency memory reserve. >> >> Otherwise all packets using this page will be immediately dropped, >> unless for very specific sockets having SOCK_MEMALLOC bit set. >> >> This issue might be hard to debug, because only a fraction of received >> packets would be dropped. > > Hi Eric, > > When you say reuse, you mean point to the same page from several SKBs ? > > Because in our page cache implementation we don't reuse pages that > already passed to the stack, > we just keep them in the page cache until the ref count drop back to > one, so we recycle them (i,e they will be re-used only when no one > else is using them). > Saeed, Speaking of the mlx page cache can we remove this or a least make it optional to use. It is another example of complex functionality being put into drivers that makes things like backports more complicated and provide at best some marginal value. In the case of the mlx5e cache code the results from pktgen really weren't very impressive in the first place. Also, the cache suffers from HOL blocking where we can block the whole cache due to an outstanding reference on just one page (something that you wouldn't see in pktgen but is likely to happen in real applications). Thanks, Tom > on mlx5e rx re-post to the HW buffer we will call > mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and mlx5e_rx_cache_get > will only return a page that is already freed by all stack users, > otherwise it will allocated a brand new one. > > a page form mlx5e_page_cache can have one of 2 states > 1. passed to the stack (not free) > 2. free (already used and freed by the stack) > > a non free page will never get reused for another SKB. > > this is true for both mlx5 rx modes (striding and non-striding) > > but in the striding mode regardless of page cache, small packets can > always fit in one page and this page can get pointed to by several > SKBs - should we be concerned about this case ? > > Anyway, if you meant by this patch to avoid keeping such special pages > in the mlx5 page cache, > then it looks very good to me. > > Thanks, > Saeed. > > >> >> Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle") >> Signed-off-by: Eric Dumazet <edumazet@google.com> >> Cc: Tariq Toukan <tariqt@mellanox.com> >> Cc: Saeed Mahameed <saeedm@mellanox.com> >> --- >> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 3 +++ >> 1 file changed, 3 insertions(+) >> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c >> index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c >> @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, >> return false; >> } >> >> + if (unlikely(page_is_pfmemalloc(dma_info->page))) >> + return false; >> + >> cache->page_cache[cache->tail] = *dma_info; >> cache->tail = tail_next; >> return true; >> >>
On Sat, 2017-01-21 at 20:12 +0100, kernel netdev wrote: > > > Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>: > On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed > <saeedm@dev.mellanox.co.il> wrote: > > On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet > <eric.dumazet@gmail.com> wrote: > >> From: Eric Dumazet <edumazet@google.com> > >> > >> A driver using dev_alloc_page() must not reuse a page > allocated from > >> emergency memory reserve. > >> > >> Otherwise all packets using this page will be immediately > dropped, > >> unless for very specific sockets having SOCK_MEMALLOC bit > set. > >> > >> This issue might be hard to debug, because only a fraction > of received > >> packets would be dropped. > > > > Hi Eric, > > > > When you say reuse, you mean point to the same page from > several SKBs ? > > > > Because in our page cache implementation we don't reuse > pages that > > already passed to the stack, > > we just keep them in the page cache until the ref count drop > back to > > one, so we recycle them (i,e they will be re-used only when > no one > > else is using them). > > > > Saeed, > > Speaking of the mlx page cache can we remove this or a least > make it > optional to use. It is another example of complex > functionality being > put into drivers that makes things like backports more > complicated and > provide at best some marginal value. In the case of the mlx5e > cache > code the results from pktgen really weren't very impressive in > the > first place. Also, the cache suffers from HOL blocking where > we can > block the whole cache due to an outstanding reference on just > one page > (something that you wouldn't see in pktgen but is likely to > happen in > real applications). > > > (Send from phone in car) > > > To Tom, have you measured the effect of this page cache? Before > claiming it is ineffective. > > > My previous measurements show approx 20℅ speedup on a UDP test with > delivery to remote CPU. > I find this a bit strange. When you have time (ie not while driving your car or during week end) please give more details, for example on message size. Was it before skb_condense() was added ? > > Removing the cache would of cause be a good usecase for speeding up > the page allocator (PCP). Which Mel Gorman and me are working on. > AFAIK current page order0 cost 240 cycles. Mel have reduced til to > 180, and without NUMA 150 cycles. And with bulking this can be > amortized to 80 cycles. > > > --Jesper > > >
On Sat, Jan 21, 2017 at 9:12 PM, kernel netdev <netdev@brouer.com> wrote: > > > Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>: > > On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed > <saeedm@dev.mellanox.co.il> wrote: >> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> >> wrote: >>> From: Eric Dumazet <edumazet@google.com> >>> >>> A driver using dev_alloc_page() must not reuse a page allocated from >>> emergency memory reserve. >>> >>> Otherwise all packets using this page will be immediately dropped, >>> unless for very specific sockets having SOCK_MEMALLOC bit set. >>> >>> This issue might be hard to debug, because only a fraction of received >>> packets would be dropped. >> >> Hi Eric, >> >> When you say reuse, you mean point to the same page from several SKBs ? >> >> Because in our page cache implementation we don't reuse pages that >> already passed to the stack, >> we just keep them in the page cache until the ref count drop back to >> one, so we recycle them (i,e they will be re-used only when no one >> else is using them). >> > Saeed, > > Speaking of the mlx page cache can we remove this or a least make it > optional to use. It is another example of complex functionality being > put into drivers that makes things like backports more complicated and Re complexity, I am not sure the mlx page cache is that complex, we just wrap alloc_page/put_page with our own page cache calls. Roughly the page cache implementation is 200-300 LOC tops all concentrated in one place in the code. > provide at best some marginal value. In the case of the mlx5e cache > code the results from pktgen really weren't very impressive in the > first place. Also, the cache suffers from HOL blocking where we can Well, with pktgen you won't notice a huge improvement since the pages are freed in the stack directly from our rx receive handler (gro_receive), those pages will go back to the page allocator and get requested immediately again from the driver (no stress). The real improvements are seen when you really stress the page allocator with real uses cases such as TCP/UDP with user applications, where the pages are held longer than the driver needs, the page cache in this case will play an important role of reducing the stress on the page allocater since with those use cases we are juggling with more pages than pktgen could. With more stress (#cores TCP streams) our humble page cache some times can't hold ! and it will get full fast enough and will fail to recycle for a huge percentage of the driver pages requests. Before our own page cache we used dev_alloc_skb which used its own cache "page_frag_cache" and it worked nice enough. So i don't really recommend removing the page cache, until we have a generic RX page cache solution for all device drivers. > block the whole cache due to an outstanding reference on just one page > (something that you wouldn't see in pktgen but is likely to happen in > real applications). > Re the HOL issue, we have some upcoming patches that would drastically improve the HOL blocking issue (we will simple swap the HOL on every sample). > > (Send from phone in car) > Driver Safe :) .. > To Tom, have you measured the effect of this page cache? Before claiming it > is ineffective. > > My previous measurements show approx 20℅ speedup on a UDP test with delivery > to remote CPU. > > Removing the cache would of cause be a good usecase for speeding up the page > allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page > order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150 > cycles. And with bulking this can be amortized to 80 cycles. > Are you trying to say that we won't need the cache if you manage to deliver those optimizations ? can you compare those optimizations with the page_frag_cache from dev_alloc_skb ? > --Jesper >
On Sat, Jan 21, 2017 at 12:31 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > On Sat, Jan 21, 2017 at 9:12 PM, kernel netdev <netdev@brouer.com> wrote: >> >> >> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>: >> >> On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed >> <saeedm@dev.mellanox.co.il> wrote: >>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> >>> wrote: >>>> From: Eric Dumazet <edumazet@google.com> >>>> >>>> A driver using dev_alloc_page() must not reuse a page allocated from >>>> emergency memory reserve. >>>> >>>> Otherwise all packets using this page will be immediately dropped, >>>> unless for very specific sockets having SOCK_MEMALLOC bit set. >>>> >>>> This issue might be hard to debug, because only a fraction of received >>>> packets would be dropped. >>> >>> Hi Eric, >>> >>> When you say reuse, you mean point to the same page from several SKBs ? >>> >>> Because in our page cache implementation we don't reuse pages that >>> already passed to the stack, >>> we just keep them in the page cache until the ref count drop back to >>> one, so we recycle them (i,e they will be re-used only when no one >>> else is using them). >>> >> Saeed, >> >> Speaking of the mlx page cache can we remove this or a least make it >> optional to use. It is another example of complex functionality being >> put into drivers that makes things like backports more complicated and > > Re complexity, I am not sure the mlx page cache is that complex, > we just wrap alloc_page/put_page with our own page cache calls. > Roughly the page cache implementation is 200-300 LOC tops all concentrated > in one place in the code. > Taken as part of the RX buffer management code the whole thing in very complicated and seems to be completely bereft of any comments in the code as to how things are supposed to work. >> provide at best some marginal value. In the case of the mlx5e cache >> code the results from pktgen really weren't very impressive in the >> first place. Also, the cache suffers from HOL blocking where we can > > Well, with pktgen you won't notice a huge improvement since the pages are freed > in the stack directly from our rx receive handler (gro_receive), those > pages will go back to the page allocator and get requested immediately > again from the driver (no stress). > > The real improvements are seen when you really stress the page allocator with > real uses cases such as TCP/UDP with user applications, where the > pages are held longer than the driver needs, the page cache in this > case will play an important role of reducing the stress > on the page allocater since with those use cases we are juggling with > more pages than pktgen could. > > With more stress (#cores TCP streams) our humble page cache some times > can't hold ! and it will get full fast enough and will fail to recycle > for a huge percentage of the driver pages requests. > > Before our own page cache we used dev_alloc_skb which used its own > cache "page_frag_cache" and it worked nice enough. So i don't really > recommend removing the page cache, until we have > a generic RX page cache solution for all device drivers. > >> block the whole cache due to an outstanding reference on just one page >> (something that you wouldn't see in pktgen but is likely to happen in >> real applications). >> > > Re the HOL issue, we have some upcoming patches that would drastically > improve the HOL blocking issue (we will simple swap the HOL on every > sample). > >> >> (Send from phone in car) >> > > Driver Safe :) .. > >> To Tom, have you measured the effect of this page cache? Before claiming it >> is ineffective. >> >> My previous measurements show approx 20℅ speedup on a UDP test with delivery >> to remote CPU. >> >> Removing the cache would of cause be a good usecase for speeding up the page >> allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page >> order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150 >> cycles. And with bulking this can be amortized to 80 cycles. >> > > Are you trying to say that we won't need the cache if you manage to > deliver those optimizations ? > can you compare those optimizations with the page_frag_cache from > dev_alloc_skb ? > >> --Jesper >>
On Sat, Jan 21, 2017 at 11:12 AM, kernel netdev <netdev@brouer.com> wrote: > > > Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>: > > On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed > <saeedm@dev.mellanox.co.il> wrote: >> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> >> wrote: >>> From: Eric Dumazet <edumazet@google.com> >>> >>> A driver using dev_alloc_page() must not reuse a page allocated from >>> emergency memory reserve. >>> >>> Otherwise all packets using this page will be immediately dropped, >>> unless for very specific sockets having SOCK_MEMALLOC bit set. >>> >>> This issue might be hard to debug, because only a fraction of received >>> packets would be dropped. >> >> Hi Eric, >> >> When you say reuse, you mean point to the same page from several SKBs ? >> >> Because in our page cache implementation we don't reuse pages that >> already passed to the stack, >> we just keep them in the page cache until the ref count drop back to >> one, so we recycle them (i,e they will be re-used only when no one >> else is using them). >> > Saeed, > > Speaking of the mlx page cache can we remove this or a least make it > optional to use. It is another example of complex functionality being > put into drivers that makes things like backports more complicated and > provide at best some marginal value. In the case of the mlx5e cache > code the results from pktgen really weren't very impressive in the > first place. Also, the cache suffers from HOL blocking where we can > block the whole cache due to an outstanding reference on just one page > (something that you wouldn't see in pktgen but is likely to happen in > real applications). > > > (Send from phone in car) > > To Tom, have you measured the effect of this page cache? Before claiming it > is ineffective. No, I have not. TBH, I have most of the past few weeks trying to debug a backport of the code from 4.9 to 4.6. Until we have a working backport performance is immaterial for our purposes. Unfortunately, we are seeing some issues: the checksum faults I described previously and crashes on bad page refcns which are presumably being caused by the logic in RX buffer processing. This is why I am now having to dissect the code and trying to disable things like the page cache that are not essential functionality. In any case the HOL blocking issue is obvious from reading the code which and implies bimodal behavior-- we don't need a test for that to know it's a bad as we've see the bad effects of that in many other contexts. > > My previous measurements show approx 20℅ speedup on a UDP test with delivery > to remote CPU. > > Removing the cache would of cause be a good usecase for speeding up the page > allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page > order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150 > cycles. And with bulking this can be amortized to 80 cycles. > That would be great. If only I had a nickel for every time someone started working on a driver and came the conclusion that they need to do a custom memory allocator because the kernel allocator is so inefficient! Tom > --Jesper >
On Sat, 21 Jan 2017 22:31:26 +0200 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > > My previous measurements show approx 20℅ speedup on a UDP test with delivery > > to remote CPU. > > > > Removing the cache would of cause be a good usecase for speeding up the page > > allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page > > order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150 > > cycles. And with bulking this can be amortized to 80 cycles. > > > > Are you trying to say that we won't need the cache if you manage to > deliver those optimizations ? These page alloc optimization are good, but not fast-enough to replace your cache. Maybe I can improve it further, but it will be very hard to compete with a recycle cache (it can skip many checks the page alloc need, as it knows the page state). Said in another way, the gap will be significantly smaller, and you will not see a big boost from using this cache. BUT there are other advantages of using a guaranteed recycle pool facility (like the page_pool). Namely, (1) DMA-overhead: keeping page DMA mapped to counter DMA+IOMMU overhead, (2) RX-zero-copy: opens up for a software memory model solution for pre-VMA-mapping pages to userspace (See: [1] for argument how this avoids leaking kernel mem, but only expose/leak packet-data mem) > can you compare those optimizations with the page_frag_cache from > dev_alloc_skb ? IHMO the page_frag_cache is indirectly a bulk page allocator facility, which is limiting the number of times the page allocator is called, and thus amortize the cost of talking to the page allocator. Performance wise, it is likely comparable to your page cache, and likely faster than the 80 cycles order0-bulking. BUT we generally worry about using these 32K pages, as it opens a window for attackers pinning down memory. [1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
On Sat, 21 Jan 2017 11:26:49 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote: > > My previous measurements show approx 20℅ speedup on a UDP test with > > delivery to remote CPU. > > > I find this a bit strange. When you have time (ie not while driving your > car or during week end) please give more details, for example on message > size. I tested this with both 64 bytes and 1500 bytes. After I moved to 50G and 100G testing then I don't need to use 64 bytes packets to provoke the bottlenecks in the stack ;-) > Was it before skb_condense() was added ? It tested this just before skb_condense() was added. BUT skb_condense() does not get activated when using mlx5, because uses build_skb() ie. not using frags. For people that don't realize this: Eric's optimization in skb_condense() is about trading remote CPU atomic refcnt (put_page) for copy + local CPU refcnt dec. My measurements show cycles cost local=31 vs. remote=208, thus a estimated saving around 177 cycles. Which is spend on calling a fairly complex function __pskb_pull_tail(), and only works for more complex SKBs with frags.
From: Jesper Dangaard Brouer <brouer@redhat.com> Date: Mon, 23 Jan 2017 09:39:40 +0100 > BUT there are other advantages of using a guaranteed recycle pool > facility (like the page_pool). Namely, (1) DMA-overhead: keeping page > DMA mapped to counter DMA+IOMMU overhead, (2) RX-zero-copy: opens up > for a software memory model solution for pre-VMA-mapping pages to > userspace (See: [1] for argument how this avoids leaking kernel mem, > but only expose/leak packet-data mem) +1
On Mon, Jan 23, 2017 at 11:14 AM, Jesper Dangaard Brouer <brouer@redhat.com> wrote: > On Sat, 21 Jan 2017 11:26:49 -0800 > Eric Dumazet <eric.dumazet@gmail.com> wrote: > >> > My previous measurements show approx 20℅ speedup on a UDP test with >> > delivery to remote CPU. >> > >> I find this a bit strange. When you have time (ie not while driving your >> car or during week end) please give more details, for example on message >> size. > > I tested this with both 64 bytes and 1500 bytes. After I moved to 50G > and 100G testing then I don't need to use 64 bytes packets to provoke > the bottlenecks in the stack ;-) > Exactly! for XDP like uses cases, page cache maybe a non required optimization. but when you start testing a typical TCP use cases over 50/100G link you will need more buffers (pages) to host the traffic for longer periods, you will hit that bottleneck. >> Was it before skb_condense() was added ? > > It tested this just before skb_condense() was added. BUT > skb_condense() does not get activated when using mlx5, because uses > build_skb() ie. not using frags. > Well, we can always replace build_skb with alloc_skb + memcpy(skb->data, headlen) + add_skb_frag(payload) does it it worth it ? and is it healthy that both skb->data and skb_shinfo(skb)->frags[i] point to the same page ? > For people that don't realize this: > Eric's optimization in skb_condense() is about trading remote CPU > atomic refcnt (put_page) for copy + local CPU refcnt dec. > > My measurements show cycles cost local=31 vs. remote=208, thus a > estimated saving around 177 cycles. Which is spend on calling a fairly > complex function __pskb_pull_tail(), and only works for more complex > SKBs with frags. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
On Sun, Jan 22, 2017 at 7:50 PM, Tom Herbert <tom@herbertland.com> wrote: > On Sat, Jan 21, 2017 at 12:31 PM, Saeed Mahameed > <saeedm@dev.mellanox.co.il> wrote: >> On Sat, Jan 21, 2017 at 9:12 PM, kernel netdev <netdev@brouer.com> wrote: >>> >>> >>> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>: >>> >>> On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed >>> <saeedm@dev.mellanox.co.il> wrote: >>>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> >>>> wrote: >>>>> From: Eric Dumazet <edumazet@google.com> >>>>> >>>>> A driver using dev_alloc_page() must not reuse a page allocated from >>>>> emergency memory reserve. >>>>> >>>>> Otherwise all packets using this page will be immediately dropped, >>>>> unless for very specific sockets having SOCK_MEMALLOC bit set. >>>>> >>>>> This issue might be hard to debug, because only a fraction of received >>>>> packets would be dropped. >>>> >>>> Hi Eric, >>>> >>>> When you say reuse, you mean point to the same page from several SKBs ? >>>> >>>> Because in our page cache implementation we don't reuse pages that >>>> already passed to the stack, >>>> we just keep them in the page cache until the ref count drop back to >>>> one, so we recycle them (i,e they will be re-used only when no one >>>> else is using them). >>>> >>> Saeed, >>> >>> Speaking of the mlx page cache can we remove this or a least make it >>> optional to use. It is another example of complex functionality being >>> put into drivers that makes things like backports more complicated and >> >> Re complexity, I am not sure the mlx page cache is that complex, >> we just wrap alloc_page/put_page with our own page cache calls. >> Roughly the page cache implementation is 200-300 LOC tops all concentrated >> in one place in the code. >> > Taken as part of the RX buffer management code the whole thing in very > complicated and seems to be completely bereft of any comments in the > code as to how things are supposed to work. > The Idea is fairly simple, maybe we could have made it cleaner if we better decoupled the page cache from the rx logic. The page cache works in parallel to the rx logic, it takes its owns ref counts, this way the rx logic will continue working as if there was no cache (get/put_pages). but instead of of the kernel (get/put_pages) it will call the internal cache get/put_cache_page. Yeah I agree, some comments could help. We will work on this when we fix the HOL issues.
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq, return false; } + if (unlikely(page_is_pfmemalloc(dma_info->page))) + return false; + cache->page_cache[cache->tail] = *dma_info; cache->tail = tail_next; return true;