diff mbox

[net-next] mlx4: support __GFP_MEMALLOC for rx

Message ID 1484712850.13165.86.camel@edumazet-glaptop3.roam.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 18, 2017, 4:14 a.m. UTC
From: Eric Dumazet <edumazet@google.com>

Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.

As using memory reserves is a must in some situations (swap over NFS or
iSCSI), this patch adds this flag.

Note that this driver does not reuse pages (yet) so we do not have to
add anything else.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Konstantin Khlebnikov Jan. 18, 2017, 9:31 a.m. UTC | #1
On 18.01.2017 07:14, Eric Dumazet wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
> ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
> that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
>
> As using memory reserves is a must in some situations (swap over NFS or
> iSCSI), this patch adds this flag.

AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which are required by memory reclaimer to free some pages.

Allocation RX buffers with __GFP_MEMALLOC is a straight way to depleting all reserves by flood from network.

>
> Note that this driver does not reuse pages (yet) so we do not have to
> add anything else.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index eac527e25ec902c2a586e9952272b9e8e599e2c8..e362f99334d03c0df4d88320977670015870dd9c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -706,7 +706,8 @@ static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
>  	do {
>  		if (mlx4_en_prepare_rx_desc(priv, ring,
>  					    ring->prod & ring->size_mask,
> -					    GFP_ATOMIC | __GFP_COLD))
> +					    GFP_ATOMIC | __GFP_COLD |
> +					    __GFP_MEMALLOC))
>  			break;
>  		ring->prod++;
>  	} while (--missing);
>
>
Eric Dumazet Jan. 18, 2017, 2:23 p.m. UTC | #2
On Wed, 2017-01-18 at 12:31 +0300, Konstantin Khlebnikov wrote:
> On 18.01.2017 07:14, Eric Dumazet wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
> > ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
> > that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
> >
> > As using memory reserves is a must in some situations (swap over NFS or
> > iSCSI), this patch adds this flag.
> 
> AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which
> are required by memory reclaimer to free some pages.
> 
> Allocation RX buffers with __GFP_MEMALLOC is a straight way to
> depleting all reserves by flood from network.

You are mistaken.

How do you think a TCP flow can make progress sending data if no ACK
packet can go back in RX ?

Take a look at sk_filter_trim_cap(), where the RX packets received on a
socket which does not have SOCK_MEMALLOC is dropped.

        /*
         * If the skb was allocated from pfmemalloc reserves, only
         * allow SOCK_MEMALLOC sockets to use it as this socket is
         * helping free memory
         */
        if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
                return -ENOMEM;

Also take a look at __dev_alloc_pages()

static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
                                             unsigned int order)
{
        /* This piece of code contains several assumptions.
         * 1.  This is for device Rx, therefor a cold page is preferred.
         * 2.  The expectation is the user wants a compound page.
         * 3.  If requesting a order 0 page it will not be compound
         *     due to the check to see if order has a value in prep_new_page
         * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
         *     code in gfp_to_alloc_flags that should be enforcing this.
         */
        gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;

        return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
}


So __GFP_MEMALLOC in RX is absolutely supported.

But drivers have to opt-in, either using __dev_alloc_pages() or
dev_alloc_pages, or explicitely ORing __GFP_MEMALLOC when using
alloc_page[s]()
Konstantin Khlebnikov Jan. 18, 2017, 3:11 p.m. UTC | #3
On 18.01.2017 17:23, Eric Dumazet wrote:
> On Wed, 2017-01-18 at 12:31 +0300, Konstantin Khlebnikov wrote:
>> On 18.01.2017 07:14, Eric Dumazet wrote:
>>> From: Eric Dumazet <edumazet@google.com>
>>>
>>> Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
>>> ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
>>> that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
>>>
>>> As using memory reserves is a must in some situations (swap over NFS or
>>> iSCSI), this patch adds this flag.
>>
>> AFAIK __GFP_MEMALLOC is used for TX, not for RX: for allocations which
>> are required by memory reclaimer to free some pages.
>>
>> Allocation RX buffers with __GFP_MEMALLOC is a straight way to
>> depleting all reserves by flood from network.
>
> You are mistaken.
>
> How do you think a TCP flow can make progress sending data if no ACK
> packet can go back in RX ?

Well. Ok. I mistaken.

>
> Take a look at sk_filter_trim_cap(), where the RX packets received on a
> socket which does not have SOCK_MEMALLOC is dropped.
>
>         /*
>          * If the skb was allocated from pfmemalloc reserves, only
>          * allow SOCK_MEMALLOC sockets to use it as this socket is
>          * helping free memory
>          */
>         if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>                 return -ENOMEM;

I suppose this happens in BH context right after receiving packet?

Potentially any ACK could free memory in TCP send queue,
so using reserves here makes sense.

>
> Also take a look at __dev_alloc_pages()
>
> static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
>                                              unsigned int order)
> {
>         /* This piece of code contains several assumptions.
>          * 1.  This is for device Rx, therefor a cold page is preferred.
>          * 2.  The expectation is the user wants a compound page.
>          * 3.  If requesting a order 0 page it will not be compound
>          *     due to the check to see if order has a value in prep_new_page
>          * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
>          *     code in gfp_to_alloc_flags that should be enforcing this.
>          */
>         gfp_mask |= __GFP_COLD | __GFP_COMP | __GFP_MEMALLOC;
>
>         return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
> }
>
>
> So __GFP_MEMALLOC in RX is absolutely supported.
>
> But drivers have to opt-in, either using __dev_alloc_pages() or
> dev_alloc_pages, or explicitely ORing __GFP_MEMALLOC when using
> alloc_page[s]()
>
>
>
Eric Dumazet Jan. 18, 2017, 3:24 p.m. UTC | #4
On Wed, 2017-01-18 at 18:11 +0300, Konstantin Khlebnikov wrote:
> On 18.01.2017 17:23, Eric Dumazet wrote:

> >
> > Take a look at sk_filter_trim_cap(), where the RX packets received on a
> > socket which does not have SOCK_MEMALLOC is dropped.
> >
> >         /*
> >          * If the skb was allocated from pfmemalloc reserves, only
> >          * allow SOCK_MEMALLOC sockets to use it as this socket is
> >          * helping free memory
> >          */
> >         if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> >                 return -ENOMEM;
> 
> I suppose this happens in BH context right after receiving packet?
> 
> Potentially any ACK could free memory in TCP send queue,
> so using reserves here makes sense.

Yes, but only sockets with SOCK_MEMALLOC have this contract with the mm
layer.

For 'other' sockets, one possible trick would be that if only the page
fragment attached to skb had the pfmemalloc bit, and not the sk_buff
itself, we could attempt a skb_condense() operation [1], but it is not
really easy to properly recompute skb->pfmemalloc.

Pure TCP ACK packets can usually be trimmed by skb_condense().
Since they have no payload, we have a guarantee they wont sit in a queue
and hold memory.
David Miller Jan. 20, 2017, 4:36 a.m. UTC | #5
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 Jan 2017 20:14:10 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Commit 04aeb56a1732 ("net/mlx4_en: allocate non 0-order pages for RX
> ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
> that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
> 
> As using memory reserves is a must in some situations (swap over NFS or
> iSCSI), this patch adds this flag.
> 
> Note that this driver does not reuse pages (yet) so we do not have to
> add anything else.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index eac527e25ec902c2a586e9952272b9e8e599e2c8..e362f99334d03c0df4d88320977670015870dd9c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -706,7 +706,8 @@  static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
 	do {
 		if (mlx4_en_prepare_rx_desc(priv, ring,
 					    ring->prod & ring->size_mask,
-					    GFP_ATOMIC | __GFP_COLD))
+					    GFP_ATOMIC | __GFP_COLD |
+					    __GFP_MEMALLOC))
 			break;
 		ring->prod++;
 	} while (--missing);