diff mbox series

[1/4] net/skbuff: don't waste memory reserves

Message ID 20190418180524.23489-1-aryabinin@virtuozzo.com
State Rejected
Delegated to: David Miller
Headers show
Series [1/4] net/skbuff: don't waste memory reserves | expand

Commit Message

Andrey Ryabinin April 18, 2019, 6:05 p.m. UTC
In some workloads we have noticed packets being dropped by
sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc
reserves:

	/*
	 * If the skb was allocated from pfmemalloc reserves, only
	 * allow SOCK_MEMALLOC sockets to use it as this socket is
	 * helping free memory
	 */
	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
		return -ENOMEM;
	}

Memalloc sockets are used for stuff like swap over NBD or NFS
and only memalloc sockets can process memalloc skbs. Since we
don't have any memalloc sockets in our setups we shouldn't have
memalloc skbs either. It simply doesn't make any sense to waste
memory reserves on skb which will be dropped anyway.

It appears that __dev_alloc_pages() unconditionally uses
__GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the
__dev_alloc_pages() may dive into memory reserves.
Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1
so this skb always dropped by sk_filter_trim_cap().

Instead of wasting memory reserves we simply shouldn't use them in the
case of absence memalloc sockets in the system. Do this by adding
the  __GFP_MEMALLOC only when such socket is present in the system.

Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to skb")
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
---
 include/linux/skbuff.h | 17 ++++++++++++++++-
 include/net/sock.h     | 15 ---------------
 2 files changed, 16 insertions(+), 16 deletions(-)

Comments

Eric Dumazet April 18, 2019, 6:55 p.m. UTC | #1
On 04/18/2019 11:05 AM, Andrey Ryabinin wrote:
> In some workloads we have noticed packets being dropped by
> sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc
> reserves:
> 
> 	/*
> 	 * If the skb was allocated from pfmemalloc reserves, only
> 	 * allow SOCK_MEMALLOC sockets to use it as this socket is
> 	 * helping free memory
> 	 */
> 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
> 		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
> 		return -ENOMEM;
> 	}
> 
> Memalloc sockets are used for stuff like swap over NBD or NFS
> and only memalloc sockets can process memalloc skbs. Since we
> don't have any memalloc sockets in our setups we shouldn't have
> memalloc skbs either. It simply doesn't make any sense to waste
> memory reserves on skb which will be dropped anyway.
> 
> It appears that __dev_alloc_pages() unconditionally uses
> __GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the
> __dev_alloc_pages() may dive into memory reserves.
> Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1
> so this skb always dropped by sk_filter_trim_cap().
> 
> Instead of wasting memory reserves we simply shouldn't use them in the
> case of absence memalloc sockets in the system. Do this by adding
> the  __GFP_MEMALLOC only when such socket is present in the system.
> 
> Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to skb")
> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
> ---
>  include/linux/skbuff.h | 17 ++++++++++++++++-
>  include/net/sock.h     | 15 ---------------
>  2 files changed, 16 insertions(+), 16 deletions(-)
>

Hi Andrey

Are you targeting net or net-next tree ?

AFAIK, drivers allocate skbs way before a frame is actually received,
(at RX ring buffer initialization or refill)

So sk_memalloc_socks() might be false at that time, but true later.
David Miller April 18, 2019, 6:56 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 18 Apr 2019 11:55:16 -0700

> AFAIK, drivers allocate skbs way before a frame is actually received,
> (at RX ring buffer initialization or refill)
> 
> So sk_memalloc_socks() might be false at that time, but true later.

This is exactly what I was going to say too.
Andrey Ryabinin April 19, 2019, 1:17 p.m. UTC | #3
On 4/18/19 9:55 PM, Eric Dumazet wrote:
> 
> 
> On 04/18/2019 11:05 AM, Andrey Ryabinin wrote:
>> In some workloads we have noticed packets being dropped by
>> sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc
>> reserves:
>>
>> 	/*
>> 	 * If the skb was allocated from pfmemalloc reserves, only
>> 	 * allow SOCK_MEMALLOC sockets to use it as this socket is
>> 	 * helping free memory
>> 	 */
>> 	if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
>> 		NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
>> 		return -ENOMEM;
>> 	}
>>
>> Memalloc sockets are used for stuff like swap over NBD or NFS
>> and only memalloc sockets can process memalloc skbs. Since we
>> don't have any memalloc sockets in our setups we shouldn't have
>> memalloc skbs either. It simply doesn't make any sense to waste
>> memory reserves on skb which will be dropped anyway.
>>
>> It appears that __dev_alloc_pages() unconditionally uses
>> __GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the
>> __dev_alloc_pages() may dive into memory reserves.
>> Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1
>> so this skb always dropped by sk_filter_trim_cap().
>>
>> Instead of wasting memory reserves we simply shouldn't use them in the
>> case of absence memalloc sockets in the system. Do this by adding
>> the  __GFP_MEMALLOC only when such socket is present in the system.
>>
>> Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to skb")
>> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
>> ---
>>  include/linux/skbuff.h | 17 ++++++++++++++++-
>>  include/net/sock.h     | 15 ---------------
>>  2 files changed, 16 insertions(+), 16 deletions(-)
>>
> 
> Hi Andrey
> 
> Are you targeting net or net-next tree ?
> 

I think it's up to Dave to decide where the patches should go. They apply cleanly on both trees.
The last two patches just minor cleanups so they are definitely net-next material.

> AFAIK, drivers allocate skbs way before a frame is actually received,
> (at RX ring buffer initialization or refill)
> 
> So sk_memalloc_socks() might be false at that time, but true later.
> 

I don't see why that would be a problem. If refill failed because we didn't have
access to reserves, then there going to be an another refill attempt, right?
And the next refill attempt will be with access to the reserves if memalloc socket was created.
We can't predict the future, so until the memalloc socket appeared we must assume that those
RX ring buffers won't be used to reclaim memory (and that is actually true in 99% of cases).
Eric Dumazet April 19, 2019, 1:36 p.m. UTC | #4
On 04/19/2019 06:17 AM, Andrey Ryabinin wrote:
> 
> 
> On 4/18/19 9:55 PM, Eric Dumazet wrote:

>> Are you targeting net or net-next tree ?
>>
> 
> I think it's up to Dave to decide where the patches should go. They apply cleanly on both trees.
> The last two patches just minor cleanups so they are definitely net-next material.
> 

Not at all, please carefully read 

Documentation/networking/netdev-FAQ.rst

Q: How do I indicate which tree (net vs. net-next) my patch should be in?

You need to split your patches in two series, not assume that David will manually do
the manual work for you.

Thank you
Eric Dumazet April 19, 2019, 1:41 p.m. UTC | #5
On 04/19/2019 06:17 AM, Andrey Ryabinin wrote:
> 

> I don't see why that would be a problem. If refill failed because we didn't have
> access to reserves, then there going to be an another refill attempt, right?
> And the next refill attempt will be with access to the reserves if memalloc socket was created.
> We can't predict the future, so until the memalloc socket appeared we must assume that those
> RX ring buffers won't be used to reclaim memory (and that is actually true in 99% of cases).
> 

I just said that the alloc might be attempted "in the past"

Yes, we can not predict the future, this is why we need to access the reserve _now_ and not
at the time the packet is received.

The 'being true in 99% of cases' argument is not really convincing.

You want the NIC to be ready to receive packets even before sk_memalloc_socks() becomes true.

If a NIC driver has a bug, please fix it.
Andrey Ryabinin April 19, 2019, 4:25 p.m. UTC | #6
On 4/19/19 4:41 PM, Eric Dumazet wrote:
> On 04/19/2019 06:17 AM, Andrey Ryabinin wrote:
>>
> 
>> I don't see why that would be a problem. If refill failed because we didn't have
>> access to reserves, then there going to be an another refill attempt, right?
>> And the next refill attempt will be with access to the reserves if memalloc socket was created.
>> We can't predict the future, so until the memalloc socket appeared we must assume that those
>> RX ring buffers won't be used to reclaim memory (and that is actually true in 99% of cases).
>>
> 
> I just said that the alloc might be attempted "in the past"
> 
> Yes, we can not predict the future, this is why we need to access the reserve _now_ and not
> at the time the packet is received.
> 
> The 'being true in 99% of cases' argument is not really convincing.
> 
> You want the NIC to be ready to receive packets even before sk_memalloc_socks() becomes true.

The fact that we allocate pages before the socket created I completely understand.

But why that failed allocation is such a problem?

1. sk_memalloc_socks() false
2. NIC driver tries to allocate pages and fails
3. swapon /nfs/swap_file /* memalloc_socket created */
4. We may be loosing some packets because of 2. But there will be retransmit, I suppose NIC driver will
try to allocate pages again, and this time with access to reserves, so eventually we all good.

So what is the problem? Potential lost of some packets for some period of time after swapon?


> If a NIC driver has a bug, please fix it.
> 

I don't follow, what kind of bug are you talking about?
The problem I'm trying to fix is entirely in __dev_alloc_pages():
1. sk_memalloc_socks() false all the time.
2. NIC takes pages from memory reserves
3. The fact that we occupying memory reserves increases the chances that the next page will be also taken from reserves as well,
which means more dropped packets than we would have without using the reserves.
Eric Dumazet April 19, 2019, 4:27 p.m. UTC | #7
On Fri, Apr 19, 2019 at 9:24 AM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
>
>
> But why that failed allocation is such a problem?
>
> 1. sk_memalloc_socks() false
> 2. NIC driver tries to allocate pages and fails

The NIC then is unable to receive any frames.

We need to be able to populate the RX ring buffer, before NIC can
actually be started.

Basically you are saying : We need to allocate memory only _after_
frame has been received by the NIC.
I am saying : We need to allocate memory so that the NIC can put a
future frame in it.
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a06275a618f0..676e54f84de4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2784,6 +2784,19 @@  void napi_consume_skb(struct sk_buff *skb, int budget);
 void __kfree_skb_flush(void);
 void __kfree_skb_defer(struct sk_buff *skb);
 
+#ifdef CONFIG_NET
+DECLARE_STATIC_KEY_FALSE(memalloc_socks_key);
+static inline int sk_memalloc_socks(void)
+{
+	return static_branch_unlikely(&memalloc_socks_key);
+}
+#else
+static inline int sk_memalloc_socks(void)
+{
+	return 0;
+}
+#endif
+
 /**
  * __dev_alloc_pages - allocate page for network Rx
  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
@@ -2804,7 +2817,9 @@  static inline struct page *__dev_alloc_pages(gfp_t gfp_mask,
 	 * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
 	 *     code in gfp_to_alloc_flags that should be enforcing this.
 	 */
-	gfp_mask |= __GFP_COMP | __GFP_MEMALLOC;
+	gfp_mask |=  __GFP_COMP;
+	if (sk_memalloc_socks())
+		gfp_mask |= __GFP_MEMALLOC;
 
 	return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index bdd77bbce7d8..5b2138d47bd8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -838,21 +838,6 @@  static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
 	return test_bit(flag, &sk->sk_flags);
 }
 
-#ifdef CONFIG_NET
-DECLARE_STATIC_KEY_FALSE(memalloc_socks_key);
-static inline int sk_memalloc_socks(void)
-{
-	return static_branch_unlikely(&memalloc_socks_key);
-}
-#else
-
-static inline int sk_memalloc_socks(void)
-{
-	return 0;
-}
-
-#endif
-
 static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
 {
 	return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);