Message ID | 20190418180524.23489-1-aryabinin@virtuozzo.com |
---|---|
State | Rejected |
Delegated to: | David Miller |
Headers | show |
Series | [1/4] net/skbuff: don't waste memory reserves | expand |
On 04/18/2019 11:05 AM, Andrey Ryabinin wrote: > In some workloads we have noticed packets being dropped by > sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc > reserves: > > /* > * If the skb was allocated from pfmemalloc reserves, only > * allow SOCK_MEMALLOC sockets to use it as this socket is > * helping free memory > */ > if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) { > NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP); > return -ENOMEM; > } > > Memalloc sockets are used for stuff like swap over NBD or NFS > and only memalloc sockets can process memalloc skbs. Since we > don't have any memalloc sockets in our setups we shouldn't have > memalloc skbs either. It simply doesn't make any sense to waste > memory reserves on skb which will be dropped anyway. > > It appears that __dev_alloc_pages() unconditionally uses > __GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the > __dev_alloc_pages() may dive into memory reserves. > Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1 > so this skb always dropped by sk_filter_trim_cap(). > > Instead of wasting memory reserves we simply shouldn't use them in the > case of absence memalloc sockets in the system. Do this by adding > the __GFP_MEMALLOC only when such socket is present in the system. > > Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to skb") > Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> > --- > include/linux/skbuff.h | 17 ++++++++++++++++- > include/net/sock.h | 15 --------------- > 2 files changed, 16 insertions(+), 16 deletions(-) > Hi Andrey Are you targeting net or net-next tree ? AFAIK, drivers allocate skbs way before a frame is actually received, (at RX ring buffer initialization or refill) So sk_memalloc_socks() might be false at that time, but true later.
From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 18 Apr 2019 11:55:16 -0700 > AFAIK, drivers allocate skbs way before a frame is actually received, > (at RX ring buffer initialization or refill) > > So sk_memalloc_socks() might be false at that time, but true later. This is exactly what I was going to say too.
On 4/18/19 9:55 PM, Eric Dumazet wrote: > > > On 04/18/2019 11:05 AM, Andrey Ryabinin wrote: >> In some workloads we have noticed packets being dropped by >> sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc >> reserves: >> >> /* >> * If the skb was allocated from pfmemalloc reserves, only >> * allow SOCK_MEMALLOC sockets to use it as this socket is >> * helping free memory >> */ >> if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) { >> NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP); >> return -ENOMEM; >> } >> >> Memalloc sockets are used for stuff like swap over NBD or NFS >> and only memalloc sockets can process memalloc skbs. Since we >> don't have any memalloc sockets in our setups we shouldn't have >> memalloc skbs either. It simply doesn't make any sense to waste >> memory reserves on skb which will be dropped anyway. >> >> It appears that __dev_alloc_pages() unconditionally uses >> __GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the >> __dev_alloc_pages() may dive into memory reserves. >> Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1 >> so this skb always dropped by sk_filter_trim_cap(). >> >> Instead of wasting memory reserves we simply shouldn't use them in the >> case of absence memalloc sockets in the system. Do this by adding >> the __GFP_MEMALLOC only when such socket is present in the system. >> >> Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to skb") >> Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> >> --- >> include/linux/skbuff.h | 17 ++++++++++++++++- >> include/net/sock.h | 15 --------------- >> 2 files changed, 16 insertions(+), 16 deletions(-) >> > > Hi Andrey > > Are you targeting net or net-next tree ? > I think it's up to Dave to decide where the patches should go. They apply cleanly on both trees. The last two patches just minor cleanups so they are definitely net-next material. > AFAIK, drivers allocate skbs way before a frame is actually received, > (at RX ring buffer initialization or refill) > > So sk_memalloc_socks() might be false at that time, but true later. > I don't see why that would be a problem. If refill failed because we didn't have access to reserves, then there going to be an another refill attempt, right? And the next refill attempt will be with access to the reserves if memalloc socket was created. We can't predict the future, so until the memalloc socket appeared we must assume that those RX ring buffers won't be used to reclaim memory (and that is actually true in 99% of cases).
On 04/19/2019 06:17 AM, Andrey Ryabinin wrote: > > > On 4/18/19 9:55 PM, Eric Dumazet wrote: >> Are you targeting net or net-next tree ? >> > > I think it's up to Dave to decide where the patches should go. They apply cleanly on both trees. > The last two patches just minor cleanups so they are definitely net-next material. > Not at all, please carefully read Documentation/networking/netdev-FAQ.rst Q: How do I indicate which tree (net vs. net-next) my patch should be in? You need to split your patches in two series, not assume that David will manually do the manual work for you. Thank you
On 04/19/2019 06:17 AM, Andrey Ryabinin wrote: > > I don't see why that would be a problem. If refill failed because we didn't have > access to reserves, then there going to be an another refill attempt, right? > And the next refill attempt will be with access to the reserves if memalloc socket was created. > We can't predict the future, so until the memalloc socket appeared we must assume that those > RX ring buffers won't be used to reclaim memory (and that is actually true in 99% of cases). > I just said that the alloc might be attempted "in the past" Yes, we can not predict the future, this is why we need to access the reserve _now_ and not at the time the packet is received. The 'being true in 99% of cases' argument is not really convincing. You want the NIC to be ready to receive packets even before sk_memalloc_socks() becomes true. If a NIC driver has a bug, please fix it.
On 4/19/19 4:41 PM, Eric Dumazet wrote: > On 04/19/2019 06:17 AM, Andrey Ryabinin wrote: >> > >> I don't see why that would be a problem. If refill failed because we didn't have >> access to reserves, then there going to be an another refill attempt, right? >> And the next refill attempt will be with access to the reserves if memalloc socket was created. >> We can't predict the future, so until the memalloc socket appeared we must assume that those >> RX ring buffers won't be used to reclaim memory (and that is actually true in 99% of cases). >> > > I just said that the alloc might be attempted "in the past" > > Yes, we can not predict the future, this is why we need to access the reserve _now_ and not > at the time the packet is received. > > The 'being true in 99% of cases' argument is not really convincing. > > You want the NIC to be ready to receive packets even before sk_memalloc_socks() becomes true. The fact that we allocate pages before the socket created I completely understand. But why that failed allocation is such a problem? 1. sk_memalloc_socks() false 2. NIC driver tries to allocate pages and fails 3. swapon /nfs/swap_file /* memalloc_socket created */ 4. We may be loosing some packets because of 2. But there will be retransmit, I suppose NIC driver will try to allocate pages again, and this time with access to reserves, so eventually we all good. So what is the problem? Potential lost of some packets for some period of time after swapon? > If a NIC driver has a bug, please fix it. > I don't follow, what kind of bug are you talking about? The problem I'm trying to fix is entirely in __dev_alloc_pages(): 1. sk_memalloc_socks() false all the time. 2. NIC takes pages from memory reserves 3. The fact that we occupying memory reserves increases the chances that the next page will be also taken from reserves as well, which means more dropped packets than we would have without using the reserves.
On Fri, Apr 19, 2019 at 9:24 AM Andrey Ryabinin <aryabinin@virtuozzo.com> wrote: > > > But why that failed allocation is such a problem? > > 1. sk_memalloc_socks() false > 2. NIC driver tries to allocate pages and fails The NIC then is unable to receive any frames. We need to be able to populate the RX ring buffer, before NIC can actually be started. Basically you are saying : We need to allocate memory only _after_ frame has been received by the NIC. I am saying : We need to allocate memory so that the NIC can put a future frame in it.
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a06275a618f0..676e54f84de4 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2784,6 +2784,19 @@ void napi_consume_skb(struct sk_buff *skb, int budget); void __kfree_skb_flush(void); void __kfree_skb_defer(struct sk_buff *skb); +#ifdef CONFIG_NET +DECLARE_STATIC_KEY_FALSE(memalloc_socks_key); +static inline int sk_memalloc_socks(void) +{ + return static_branch_unlikely(&memalloc_socks_key); +} +#else +static inline int sk_memalloc_socks(void) +{ + return 0; +} +#endif + /** * __dev_alloc_pages - allocate page for network Rx * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx @@ -2804,7 +2817,9 @@ static inline struct page *__dev_alloc_pages(gfp_t gfp_mask, * 4. __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to * code in gfp_to_alloc_flags that should be enforcing this. */ - gfp_mask |= __GFP_COMP | __GFP_MEMALLOC; + gfp_mask |= __GFP_COMP; + if (sk_memalloc_socks()) + gfp_mask |= __GFP_MEMALLOC; return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order); } diff --git a/include/net/sock.h b/include/net/sock.h index bdd77bbce7d8..5b2138d47bd8 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -838,21 +838,6 @@ static inline bool sock_flag(const struct sock *sk, enum sock_flags flag) return test_bit(flag, &sk->sk_flags); } -#ifdef CONFIG_NET -DECLARE_STATIC_KEY_FALSE(memalloc_socks_key); -static inline int sk_memalloc_socks(void) -{ - return static_branch_unlikely(&memalloc_socks_key); -} -#else - -static inline int sk_memalloc_socks(void) -{ - return 0; -} - -#endif - static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask) { return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
In some workloads we have noticed packets being dropped by sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc reserves: /* * If the skb was allocated from pfmemalloc reserves, only * allow SOCK_MEMALLOC sockets to use it as this socket is * helping free memory */ if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) { NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP); return -ENOMEM; } Memalloc sockets are used for stuff like swap over NBD or NFS and only memalloc sockets can process memalloc skbs. Since we don't have any memalloc sockets in our setups we shouldn't have memalloc skbs either. It simply doesn't make any sense to waste memory reserves on skb which will be dropped anyway. It appears that __dev_alloc_pages() unconditionally uses __GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the __dev_alloc_pages() may dive into memory reserves. Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1 so this skb always dropped by sk_filter_trim_cap(). Instead of wasting memory reserves we simply shouldn't use them in the case of absence memalloc sockets in the system. Do this by adding the __GFP_MEMALLOC only when such socket is present in the system. Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to skb") Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> --- include/linux/skbuff.h | 17 ++++++++++++++++- include/net/sock.h | 15 --------------- 2 files changed, 16 insertions(+), 16 deletions(-)