diff mbox

[net] net/mlx5e: Do not recycle pages from emergency reserve

Message ID 1484809388.16328.19.camel@edumazet-glaptop3.roam.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 19, 2017, 7:03 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

A driver using dev_alloc_page() must not reuse a page allocated from
emergency memory reserve.

Otherwise all packets using this page will be immediately dropped,
unless for very specific sockets having SOCK_MEMALLOC bit set.

This issue might be hard to debug, because only a fraction of received
packets would be dropped.

Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    3 +++
 1 file changed, 3 insertions(+)

Comments

Saeed Mahameed Jan. 19, 2017, 7:14 p.m. UTC | #1
On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> A driver using dev_alloc_page() must not reuse a page allocated from
> emergency memory reserve.
>
> Otherwise all packets using this page will be immediately dropped,
> unless for very specific sockets having SOCK_MEMALLOC bit set.
>
> This issue might be hard to debug, because only a fraction of received
> packets would be dropped.

Hi Eric,

When you say reuse, you mean point to the same page from several SKBs ?

Because in our page cache implementation we don't reuse pages that
already passed to the stack,
we just keep them in the page cache until the ref count drop back to
one, so we recycle them (i,e they will be re-used only when no one
else is using them).

on mlx5e rx re-post to the HW buffer we will call
mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and  mlx5e_rx_cache_get
will only return a page that is already freed by all stack users,
otherwise it will allocated a brand new one.

a page form mlx5e_page_cache can have one of 2 states
1. passed to the stack (not free)
2. free (already used and freed by the stack)

a non free page will never get reused for another SKB.

this is true for both mlx5 rx modes (striding and non-striding)

but in the striding mode regardless of page cache, small packets can
always fit in one page and this page can get pointed to by several
SKBs - should we be concerned about this case ?

Anyway, if you meant by this patch to avoid keeping such special pages
in the mlx5 page cache,
then it looks very good to me.

Thanks,
Saeed.


>
> Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
>                 return false;
>         }
>
> +       if (unlikely(page_is_pfmemalloc(dma_info->page)))
> +               return false;
> +
>         cache->page_cache[cache->tail] = *dma_info;
>         cache->tail = tail_next;
>         return true;
>
>
Eric Dumazet Jan. 19, 2017, 7:25 p.m. UTC | #2
On Thu, 2017-01-19 at 21:14 +0200, Saeed Mahameed wrote:
> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > A driver using dev_alloc_page() must not reuse a page allocated from
> > emergency memory reserve.
> >
> > Otherwise all packets using this page will be immediately dropped,
> > unless for very specific sockets having SOCK_MEMALLOC bit set.
> >
> > This issue might be hard to debug, because only a fraction of received
> > packets would be dropped.
> 
> Hi Eric,
> 
> When you say reuse, you mean point to the same page from several SKBs ?
> 
> Because in our page cache implementation we don't reuse pages that
> already passed to the stack,
> we just keep them in the page cache until the ref count drop back to
> one, so we recycle them (i,e they will be re-used only when no one
> else is using them).

Look at other page_is_pfmemalloc() calls in other network drivers.

The point is :

We do not want to reuse a page that was initially allocated from
emergency reserve.

Chances are high that the packet will be simply dropped.

If your cache is not refilled with such problematic pages,
future dev_page_alloc() will likely give you sane pages, after stress
has ended.

By definition, emergency reserves are small, we do not want to keep
pages from the reserve in another cache, depleting the reserve for too
long.

dev_alloc_page() in mlx5 was introduced in commit
bc77b240b3c57236cdcc08d64ca390655d3a16ff
("net/mlx5e: Add fragmented memory support for RX multi packet WQE")

Maybe my Fixes: tag was slightly wrong.

> 
> on mlx5e rx re-post to the HW buffer we will call
> mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and  mlx5e_rx_cache_get
> will only return a page that is already freed by all stack users,
> otherwise it will allocated a brand new one.
> 
> a page form mlx5e_page_cache can have one of 2 states
> 1. passed to the stack (not free)
> 2. free (already used and freed by the stack)
> 
> a non free page will never get reused for another SKB.


Of course.

> 
> this is true for both mlx5 rx modes (striding and non-striding)
> 
> but in the striding mode regardless of page cache, small packets can
> always fit in one page and this page can get pointed to by several
> SKBs - should we be concerned about this case ?
> 
> Anyway, if you meant by this patch to avoid keeping such special pages
> in the mlx5 page cache,
> then it looks very good to me.
> 
> Thanks,
> Saeed.
> 
> 
> >
> > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> >                 return false;
> >         }
> >
> > +       if (unlikely(page_is_pfmemalloc(dma_info->page)))
> > +               return false;
> > +
> >         cache->page_cache[cache->tail] = *dma_info;
> >         cache->tail = tail_next;
> >         return true;
> >
> >
Saeed Mahameed Jan. 19, 2017, 7:39 p.m. UTC | #3
On Thu, Jan 19, 2017 at 9:25 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-01-19 at 21:14 +0200, Saeed Mahameed wrote:
>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > A driver using dev_alloc_page() must not reuse a page allocated from
>> > emergency memory reserve.
>> >
>> > Otherwise all packets using this page will be immediately dropped,
>> > unless for very specific sockets having SOCK_MEMALLOC bit set.
>> >
>> > This issue might be hard to debug, because only a fraction of received
>> > packets would be dropped.
>>
>> Hi Eric,
>>
>> When you say reuse, you mean point to the same page from several SKBs ?
>>
>> Because in our page cache implementation we don't reuse pages that
>> already passed to the stack,
>> we just keep them in the page cache until the ref count drop back to
>> one, so we recycle them (i,e they will be re-used only when no one
>> else is using them).
>
> Look at other page_is_pfmemalloc() calls in other network drivers.
>
> The point is :
>
> We do not want to reuse a page that was initially allocated from
> emergency reserve.
>
> Chances are high that the packet will be simply dropped.
>
> If your cache is not refilled with such problematic pages,
> future dev_page_alloc() will likely give you sane pages, after stress
> has ended.
>
> By definition, emergency reserves are small, we do not want to keep
> pages from the reserve in another cache, depleting the reserve for too
> long.

Got it, Thanks for the explanation, I got confused since the term "reuse"
can also imply "reuse same page for another RX packet", I guess the
correct term for
mlx5 page cache is "recycle".

>
> dev_alloc_page() in mlx5 was introduced in commit
> bc77b240b3c57236cdcc08d64ca390655d3a16ff
> ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
>
> Maybe my Fixes: tag was slightly wrong.
>

No, i think you got it right in first place, you are fixing the RX
cache to not keep hold on such pages.
before the RX cache we didn't have this issue.

>>
>> on mlx5e rx re-post to the HW buffer we will call
>> mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and  mlx5e_rx_cache_get
>> will only return a page that is already freed by all stack users,
>> otherwise it will allocated a brand new one.
>>
>> a page form mlx5e_page_cache can have one of 2 states
>> 1. passed to the stack (not free)
>> 2. free (already used and freed by the stack)
>>
>> a non free page will never get reused for another SKB.
>
>
> Of course.
>
>>
>> this is true for both mlx5 rx modes (striding and non-striding)
>>
>> but in the striding mode regardless of page cache, small packets can
>> always fit in one page and this page can get pointed to by several
>> SKBs - should we be concerned about this case ?
>>
>> Anyway, if you meant by this patch to avoid keeping such special pages
>> in the mlx5 page cache,
>> then it looks very good to me.
>>
>> Thanks,
>> Saeed.
>>
>>
>> >
>> > Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: Tariq Toukan <tariqt@mellanox.com>
>> > Cc: Saeed Mahameed <saeedm@mellanox.com>
>> > ---
>> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    3 +++
>> >  1 file changed, 3 insertions(+)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> > index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> > @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
>> >                 return false;
>> >         }
>> >
>> > +       if (unlikely(page_is_pfmemalloc(dma_info->page)))
>> > +               return false;
>> > +
>> >         cache->page_cache[cache->tail] = *dma_info;
>> >         cache->tail = tail_next;
>> >         return true;
>> >
>> >
>
>
Saeed Mahameed Jan. 20, 2017, 10:28 a.m. UTC | #4
On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> A driver using dev_alloc_page() must not reuse a page allocated from
> emergency memory reserve.
>
> Otherwise all packets using this page will be immediately dropped,
> unless for very specific sockets having SOCK_MEMALLOC bit set.
>
> This issue might be hard to debug, because only a fraction of received
> packets would be dropped.
>
> Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>
David Miller Jan. 20, 2017, 5:43 p.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jan 2017 23:03:08 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> A driver using dev_alloc_page() must not reuse a page allocated from
> emergency memory reserve.
> 
> Otherwise all packets using this page will be immediately dropped,
> unless for very specific sockets having SOCK_MEMALLOC bit set.
> 
> This issue might be hard to debug, because only a fraction of received
> packets would be dropped.
> 
> Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied and queued up for -stable, thanks.
Tom Herbert Jan. 21, 2017, 6:09 p.m. UTC | #6
On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> A driver using dev_alloc_page() must not reuse a page allocated from
>> emergency memory reserve.
>>
>> Otherwise all packets using this page will be immediately dropped,
>> unless for very specific sockets having SOCK_MEMALLOC bit set.
>>
>> This issue might be hard to debug, because only a fraction of received
>> packets would be dropped.
>
> Hi Eric,
>
> When you say reuse, you mean point to the same page from several SKBs ?
>
> Because in our page cache implementation we don't reuse pages that
> already passed to the stack,
> we just keep them in the page cache until the ref count drop back to
> one, so we recycle them (i,e they will be re-used only when no one
> else is using them).
>
Saeed,

Speaking of the mlx page cache can we remove this or a least make it
optional to use. It is another example of complex functionality being
put into drivers that makes things like backports more complicated and
provide at best some marginal value. In the case of the mlx5e cache
code the results from pktgen really weren't very impressive in the
first place. Also, the cache suffers from HOL blocking where we can
block the whole cache due to an outstanding reference on just one page
(something that you wouldn't see in pktgen but is likely to happen in
real applications).

Thanks,
Tom

> on mlx5e rx re-post to the HW buffer we will call
> mlx5e_page_alloc_mapped->mlx5e_rx_cache_get, and  mlx5e_rx_cache_get
> will only return a page that is already freed by all stack users,
> otherwise it will allocated a brand new one.
>
> a page form mlx5e_page_cache can have one of 2 states
> 1. passed to the stack (not free)
> 2. free (already used and freed by the stack)
>
> a non free page will never get reused for another SKB.
>
> this is true for both mlx5 rx modes (striding and non-striding)
>
> but in the striding mode regardless of page cache, small packets can
> always fit in one page and this page can get pointed to by several
> SKBs - should we be concerned about this case ?
>
> Anyway, if you meant by this patch to avoid keeping such special pages
> in the mlx5 page cache,
> then it looks very good to me.
>
> Thanks,
> Saeed.
>
>
>>
>> Fixes: 4415a0319f92 ("net/mlx5e: Implement RX mapped page cache for page recycle")
>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>> Cc: Tariq Toukan <tariqt@mellanox.com>
>> Cc: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>> @@ -193,6 +193,9 @@ static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
>>                 return false;
>>         }
>>
>> +       if (unlikely(page_is_pfmemalloc(dma_info->page)))
>> +               return false;
>> +
>>         cache->page_cache[cache->tail] = *dma_info;
>>         cache->tail = tail_next;
>>         return true;
>>
>>
Eric Dumazet Jan. 21, 2017, 7:26 p.m. UTC | #7
On Sat, 2017-01-21 at 20:12 +0100, kernel netdev wrote:
> 
> 
> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>:
>         On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed
>         <saeedm@dev.mellanox.co.il> wrote:
>         > On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet
>         <eric.dumazet@gmail.com> wrote:
>         >> From: Eric Dumazet <edumazet@google.com>
>         >>
>         >> A driver using dev_alloc_page() must not reuse a page
>         allocated from
>         >> emergency memory reserve.
>         >>
>         >> Otherwise all packets using this page will be immediately
>         dropped,
>         >> unless for very specific sockets having SOCK_MEMALLOC bit
>         set.
>         >>
>         >> This issue might be hard to debug, because only a fraction
>         of received
>         >> packets would be dropped.
>         >
>         > Hi Eric,
>         >
>         > When you say reuse, you mean point to the same page from
>         several SKBs ?
>         >
>         > Because in our page cache implementation we don't reuse
>         pages that
>         > already passed to the stack,
>         > we just keep them in the page cache until the ref count drop
>         back to
>         > one, so we recycle them (i,e they will be re-used only when
>         no one
>         > else is using them).
>         >
>         
>         Saeed,
>         
>         Speaking of the mlx page cache can we remove this or a least
>         make it
>         optional to use. It is another example of complex
>         functionality being
>         put into drivers that makes things like backports more
>         complicated and
>         provide at best some marginal value. In the case of the mlx5e
>         cache
>         code the results from pktgen really weren't very impressive in
>         the
>         first place. Also, the cache suffers from HOL blocking where
>         we can
>         block the whole cache due to an outstanding reference on just
>         one page
>         (something that you wouldn't see in pktgen but is likely to
>         happen in
>         real applications).
> 
> 
> (Send from phone in car)
> 
> 
> To Tom, have you measured the effect of this page cache? Before
> claiming it is ineffective.
> 
> 
> My previous measurements show approx 20℅ speedup on a UDP test with
> delivery to remote CPU.
> 
I find this a bit strange. When you have time (ie not while driving your
car or during week end) please give more details, for example on message
size. Was it before skb_condense() was added ?

> 
> Removing the cache would of cause be a good usecase for speeding up
> the page allocator (PCP). Which Mel Gorman and me are working on.
> AFAIK current page order0 cost 240 cycles. Mel have reduced til to
> 180, and without NUMA 150 cycles. And with bulking this can be
> amortized to 80 cycles.
> 
> 
> --Jesper
> 
> 
>
Saeed Mahameed Jan. 21, 2017, 8:31 p.m. UTC | #8
On Sat, Jan 21, 2017 at 9:12 PM, kernel netdev <netdev@brouer.com> wrote:
>
>
> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>:
>
> On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed
> <saeedm@dev.mellanox.co.il> wrote:
>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> A driver using dev_alloc_page() must not reuse a page allocated from
>>> emergency memory reserve.
>>>
>>> Otherwise all packets using this page will be immediately dropped,
>>> unless for very specific sockets having SOCK_MEMALLOC bit set.
>>>
>>> This issue might be hard to debug, because only a fraction of received
>>> packets would be dropped.
>>
>> Hi Eric,
>>
>> When you say reuse, you mean point to the same page from several SKBs ?
>>
>> Because in our page cache implementation we don't reuse pages that
>> already passed to the stack,
>> we just keep them in the page cache until the ref count drop back to
>> one, so we recycle them (i,e they will be re-used only when no one
>> else is using them).
>>
> Saeed,
>
> Speaking of the mlx page cache can we remove this or a least make it
> optional to use. It is another example of complex functionality being
> put into drivers that makes things like backports more complicated and

Re complexity, I am not sure the mlx page cache is that complex,
we just wrap alloc_page/put_page with our own page cache calls.
Roughly the page cache implementation is 200-300 LOC tops all concentrated
in one place in the code.

> provide at best some marginal value. In the case of the mlx5e cache
> code the results from pktgen really weren't very impressive in the
> first place. Also, the cache suffers from HOL blocking where we can

Well, with pktgen you won't notice a huge improvement since the pages are freed
in the stack directly from our rx receive handler (gro_receive), those
pages will go back to the page allocator and get requested immediately
again from the driver (no stress).

The real improvements are seen when you really stress the page allocator with
real uses cases such as  TCP/UDP with user applications, where the
pages are held longer than the driver needs, the page cache in this
case will play an important role of reducing the stress
on the page allocater since with those use cases we are juggling with
more pages than pktgen could.

With more stress (#cores TCP streams) our humble page cache some times
can't hold ! and it will get full fast enough and will fail to recycle
for a huge percentage of the driver pages requests.

Before our own page cache we used dev_alloc_skb which used its own
cache "page_frag_cache" and it worked nice enough. So i don't really
recommend removing the page cache, until we have
a generic RX page cache solution for all device drivers.

> block the whole cache due to an outstanding reference on just one page
> (something that you wouldn't see in pktgen but is likely to happen in
> real applications).
>

Re the HOL issue, we have some upcoming patches that would drastically
improve the HOL blocking issue (we will simple swap the HOL on every
sample).

>
> (Send from phone in car)
>

Driver Safe :) ..

> To Tom, have you measured the effect of this page cache? Before claiming it
> is ineffective.
>
> My previous measurements show approx 20℅ speedup on a UDP test with delivery
> to remote CPU.
>
> Removing the cache would of cause be a good usecase for speeding up the page
> allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page
> order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150
> cycles. And with bulking this can be amortized to 80 cycles.
>

Are you trying to say that we won't need the cache if you manage to
deliver those optimizations ?
can you compare those optimizations with the page_frag_cache from
dev_alloc_skb ?

> --Jesper
>
Tom Herbert Jan. 22, 2017, 5:50 p.m. UTC | #9
On Sat, Jan 21, 2017 at 12:31 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Sat, Jan 21, 2017 at 9:12 PM, kernel netdev <netdev@brouer.com> wrote:
>>
>>
>> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>:
>>
>> On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed
>> <saeedm@dev.mellanox.co.il> wrote:
>>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com>
>>> wrote:
>>>> From: Eric Dumazet <edumazet@google.com>
>>>>
>>>> A driver using dev_alloc_page() must not reuse a page allocated from
>>>> emergency memory reserve.
>>>>
>>>> Otherwise all packets using this page will be immediately dropped,
>>>> unless for very specific sockets having SOCK_MEMALLOC bit set.
>>>>
>>>> This issue might be hard to debug, because only a fraction of received
>>>> packets would be dropped.
>>>
>>> Hi Eric,
>>>
>>> When you say reuse, you mean point to the same page from several SKBs ?
>>>
>>> Because in our page cache implementation we don't reuse pages that
>>> already passed to the stack,
>>> we just keep them in the page cache until the ref count drop back to
>>> one, so we recycle them (i,e they will be re-used only when no one
>>> else is using them).
>>>
>> Saeed,
>>
>> Speaking of the mlx page cache can we remove this or a least make it
>> optional to use. It is another example of complex functionality being
>> put into drivers that makes things like backports more complicated and
>
> Re complexity, I am not sure the mlx page cache is that complex,
> we just wrap alloc_page/put_page with our own page cache calls.
> Roughly the page cache implementation is 200-300 LOC tops all concentrated
> in one place in the code.
>
Taken as part of the RX buffer management code the whole thing in very
complicated and seems to be completely bereft of any comments in the
code as to how things are supposed to work.

>> provide at best some marginal value. In the case of the mlx5e cache
>> code the results from pktgen really weren't very impressive in the
>> first place. Also, the cache suffers from HOL blocking where we can
>
> Well, with pktgen you won't notice a huge improvement since the pages are freed
> in the stack directly from our rx receive handler (gro_receive), those
> pages will go back to the page allocator and get requested immediately
> again from the driver (no stress).
>
> The real improvements are seen when you really stress the page allocator with
> real uses cases such as  TCP/UDP with user applications, where the
> pages are held longer than the driver needs, the page cache in this
> case will play an important role of reducing the stress
> on the page allocater since with those use cases we are juggling with
> more pages than pktgen could.
>
> With more stress (#cores TCP streams) our humble page cache some times
> can't hold ! and it will get full fast enough and will fail to recycle
> for a huge percentage of the driver pages requests.
>
> Before our own page cache we used dev_alloc_skb which used its own
> cache "page_frag_cache" and it worked nice enough. So i don't really
> recommend removing the page cache, until we have
> a generic RX page cache solution for all device drivers.
>
>> block the whole cache due to an outstanding reference on just one page
>> (something that you wouldn't see in pktgen but is likely to happen in
>> real applications).
>>
>
> Re the HOL issue, we have some upcoming patches that would drastically
> improve the HOL blocking issue (we will simple swap the HOL on every
> sample).
>
>>
>> (Send from phone in car)
>>
>
> Driver Safe :) ..
>
>> To Tom, have you measured the effect of this page cache? Before claiming it
>> is ineffective.
>>
>> My previous measurements show approx 20℅ speedup on a UDP test with delivery
>> to remote CPU.
>>
>> Removing the cache would of cause be a good usecase for speeding up the page
>> allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page
>> order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150
>> cycles. And with bulking this can be amortized to 80 cycles.
>>
>
> Are you trying to say that we won't need the cache if you manage to
> deliver those optimizations ?
> can you compare those optimizations with the page_frag_cache from
> dev_alloc_skb ?
>
>> --Jesper
>>
Tom Herbert Jan. 22, 2017, 6:09 p.m. UTC | #10
On Sat, Jan 21, 2017 at 11:12 AM, kernel netdev <netdev@brouer.com> wrote:
>
>
> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>:
>
> On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed
> <saeedm@dev.mellanox.co.il> wrote:
>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com>
>> wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> A driver using dev_alloc_page() must not reuse a page allocated from
>>> emergency memory reserve.
>>>
>>> Otherwise all packets using this page will be immediately dropped,
>>> unless for very specific sockets having SOCK_MEMALLOC bit set.
>>>
>>> This issue might be hard to debug, because only a fraction of received
>>> packets would be dropped.
>>
>> Hi Eric,
>>
>> When you say reuse, you mean point to the same page from several SKBs ?
>>
>> Because in our page cache implementation we don't reuse pages that
>> already passed to the stack,
>> we just keep them in the page cache until the ref count drop back to
>> one, so we recycle them (i,e they will be re-used only when no one
>> else is using them).
>>
> Saeed,
>
> Speaking of the mlx page cache can we remove this or a least make it
> optional to use. It is another example of complex functionality being
> put into drivers that makes things like backports more complicated and
> provide at best some marginal value. In the case of the mlx5e cache
> code the results from pktgen really weren't very impressive in the
> first place. Also, the cache suffers from HOL blocking where we can
> block the whole cache due to an outstanding reference on just one page
> (something that you wouldn't see in pktgen but is likely to happen in
> real applications).
>
>
> (Send from phone in car)
>
> To Tom, have you measured the effect of this page cache? Before claiming it
> is ineffective.

No, I have not. TBH, I have most of the past few weeks trying to debug
a backport of the code from 4.9 to 4.6. Until we have a working
backport performance is immaterial for our purposes. Unfortunately, we
are seeing some issues: the checksum faults I described previously and
crashes on bad page refcns which are presumably being caused by the
logic in RX buffer processing. This is why I am now having to dissect
the code and trying to disable things like the page cache that are not
essential functionality.

In any case the HOL blocking issue is obvious from reading the code
which and implies bimodal behavior-- we don't need a test for that to
know it's a bad as we've see the bad effects of that in many other
contexts.
>
> My previous measurements show approx 20℅ speedup on a UDP test with delivery
> to remote CPU.
>
> Removing the cache would of cause be a good usecase for speeding up the page
> allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page
> order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150
> cycles. And with bulking this can be amortized to 80 cycles.
>
That would be great. If only I had a nickel for every time someone
started working on a driver and came the conclusion that they need to
do a custom memory allocator because the kernel allocator is so
inefficient!

Tom

> --Jesper
>
Jesper Dangaard Brouer Jan. 23, 2017, 8:39 a.m. UTC | #11
On Sat, 21 Jan 2017 22:31:26 +0200 Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:

> > My previous measurements show approx 20℅ speedup on a UDP test with delivery
> > to remote CPU.
> >
> > Removing the cache would of cause be a good usecase for speeding up the page
> > allocator (PCP). Which Mel Gorman and me are working on. AFAIK current page
> > order0 cost 240 cycles. Mel have reduced til to 180, and without NUMA 150
> > cycles. And with bulking this can be amortized to 80 cycles.
> >  
> 
> Are you trying to say that we won't need the cache if you manage to
> deliver those optimizations ?

These page alloc optimization are good, but not fast-enough to replace
your cache.  Maybe I can improve it further, but it will be very hard
to compete with a recycle cache (it can skip many checks the page alloc
need, as it knows the page state).

Said in another way, the gap will be significantly smaller, and you
will not see a big boost from using this cache.


BUT there are other advantages of using a guaranteed recycle pool
facility (like the page_pool).  Namely, (1) DMA-overhead: keeping page
DMA mapped to counter DMA+IOMMU overhead, (2) RX-zero-copy: opens up
for a software memory model solution for pre-VMA-mapping pages to
userspace (See: [1] for argument how this avoids leaking kernel mem,
but only expose/leak packet-data mem)


> can you compare those optimizations with the page_frag_cache from
> dev_alloc_skb ?

IHMO the page_frag_cache is indirectly a bulk page allocator facility,
which is limiting the number of times the page allocator is called, and
thus amortize the cost of talking to the page allocator.  Performance
wise, it is likely comparable to your page cache, and likely faster
than the 80 cycles order0-bulking.
BUT we generally worry about using these 32K pages, as it opens a
window for attackers pinning down memory.


[1] https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
Jesper Dangaard Brouer Jan. 23, 2017, 9:14 a.m. UTC | #12
On Sat, 21 Jan 2017 11:26:49 -0800
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > My previous measurements show approx 20℅ speedup on a UDP test with
> > delivery to remote CPU.
> >   
> I find this a bit strange. When you have time (ie not while driving your
> car or during week end) please give more details, for example on message
> size.

I tested this with both 64 bytes and 1500 bytes.  After I moved to 50G
and 100G testing then I don't need to use 64 bytes packets to provoke
the bottlenecks in the stack ;-)

> Was it before skb_condense() was added ?

It tested this just before skb_condense() was added.  BUT
skb_condense() does not get activated when using mlx5, because uses
build_skb() ie. not using frags.  

For people that don't realize this:
 Eric's optimization in skb_condense() is about trading remote CPU
 atomic refcnt (put_page) for copy + local CPU refcnt dec.

My measurements show cycles cost local=31 vs. remote=208, thus a
estimated saving around 177 cycles.  Which is spend on calling a fairly
complex function __pskb_pull_tail(), and only works for more complex
SKBs with frags.
David Miller Jan. 23, 2017, 3:45 p.m. UTC | #13
From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Mon, 23 Jan 2017 09:39:40 +0100

> BUT there are other advantages of using a guaranteed recycle pool
> facility (like the page_pool).  Namely, (1) DMA-overhead: keeping page
> DMA mapped to counter DMA+IOMMU overhead, (2) RX-zero-copy: opens up
> for a software memory model solution for pre-VMA-mapping pages to
> userspace (See: [1] for argument how this avoids leaking kernel mem,
> but only expose/leak packet-data mem)

+1
Saeed Mahameed Jan. 24, 2017, 9:21 a.m. UTC | #14
On Mon, Jan 23, 2017 at 11:14 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Sat, 21 Jan 2017 11:26:49 -0800
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> > My previous measurements show approx 20℅ speedup on a UDP test with
>> > delivery to remote CPU.
>> >
>> I find this a bit strange. When you have time (ie not while driving your
>> car or during week end) please give more details, for example on message
>> size.
>
> I tested this with both 64 bytes and 1500 bytes.  After I moved to 50G
> and 100G testing then I don't need to use 64 bytes packets to provoke
> the bottlenecks in the stack ;-)
>

Exactly! for XDP like uses cases, page cache maybe a non required optimization.
but when you start testing a typical TCP use cases over 50/100G link
you will need
more buffers (pages) to host the traffic for longer periods, you will
hit that bottleneck.

>> Was it before skb_condense() was added ?
>
> It tested this just before skb_condense() was added.  BUT
> skb_condense() does not get activated when using mlx5, because uses
> build_skb() ie. not using frags.
>

Well, we can always replace build_skb with alloc_skb +
memcpy(skb->data, headlen) + add_skb_frag(payload)
does it it worth it ? and is it healthy that both skb->data and
skb_shinfo(skb)->frags[i] point to the same page ?

> For people that don't realize this:
>  Eric's optimization in skb_condense() is about trading remote CPU
>  atomic refcnt (put_page) for copy + local CPU refcnt dec.
>
> My measurements show cycles cost local=31 vs. remote=208, thus a
> estimated saving around 177 cycles.  Which is spend on calling a fairly
> complex function __pskb_pull_tail(), and only works for more complex
> SKBs with frags.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
Saeed Mahameed Jan. 24, 2017, 9:29 a.m. UTC | #15
On Sun, Jan 22, 2017 at 7:50 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Sat, Jan 21, 2017 at 12:31 PM, Saeed Mahameed
> <saeedm@dev.mellanox.co.il> wrote:
>> On Sat, Jan 21, 2017 at 9:12 PM, kernel netdev <netdev@brouer.com> wrote:
>>>
>>>
>>> Den 21. jan. 2017 7.10 PM skrev "Tom Herbert" <tom@herbertland.com>:
>>>
>>> On Thu, Jan 19, 2017 at 11:14 AM, Saeed Mahameed
>>> <saeedm@dev.mellanox.co.il> wrote:
>>>> On Thu, Jan 19, 2017 at 9:03 AM, Eric Dumazet <eric.dumazet@gmail.com>
>>>> wrote:
>>>>> From: Eric Dumazet <edumazet@google.com>
>>>>>
>>>>> A driver using dev_alloc_page() must not reuse a page allocated from
>>>>> emergency memory reserve.
>>>>>
>>>>> Otherwise all packets using this page will be immediately dropped,
>>>>> unless for very specific sockets having SOCK_MEMALLOC bit set.
>>>>>
>>>>> This issue might be hard to debug, because only a fraction of received
>>>>> packets would be dropped.
>>>>
>>>> Hi Eric,
>>>>
>>>> When you say reuse, you mean point to the same page from several SKBs ?
>>>>
>>>> Because in our page cache implementation we don't reuse pages that
>>>> already passed to the stack,
>>>> we just keep them in the page cache until the ref count drop back to
>>>> one, so we recycle them (i,e they will be re-used only when no one
>>>> else is using them).
>>>>
>>> Saeed,
>>>
>>> Speaking of the mlx page cache can we remove this or a least make it
>>> optional to use. It is another example of complex functionality being
>>> put into drivers that makes things like backports more complicated and
>>
>> Re complexity, I am not sure the mlx page cache is that complex,
>> we just wrap alloc_page/put_page with our own page cache calls.
>> Roughly the page cache implementation is 200-300 LOC tops all concentrated
>> in one place in the code.
>>
> Taken as part of the RX buffer management code the whole thing in very
> complicated and seems to be completely bereft of any comments in the
> code as to how things are supposed to work.
>

The Idea is fairly simple, maybe we could have made it cleaner if we
better decoupled the page
cache from the rx logic.

The page cache works in parallel to the rx logic, it takes its owns ref counts,
this way the rx logic will continue working as if there was no cache
(get/put_pages).
but instead of of the kernel (get/put_pages) it will call the internal
cache get/put_cache_page.

Yeah I agree, some comments could help. We will work on this when we
fix the HOL issues.
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0e2fb3ed1790900ccdc0bbbef8bc5c33267df092..ce46409bde2736ea4d1246ca97855098f052f271 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -193,6 +193,9 @@  static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
 		return false;
 	}
 
+	if (unlikely(page_is_pfmemalloc(dma_info->page)))
+		return false;
+
 	cache->page_cache[cache->tail] = *dma_info;
 	cache->tail = tail_next;
 	return true;