diff mbox

[net] net: Do not hold the reference for the same sk_rx_dst.

Message ID 1489651711-4105-1-git-send-email-kaiwen.xu@hulu.com
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Kevin Xu March 16, 2017, 8:08 a.m. UTC
In some rare cases, inet_sk_rx_dst_set() may be called multiple times
on the same dst, causing reference count leakage. Eventually, it
prevents net_device to be destroyed. The bug then manifested as

unregister_netdevice: waiting for lo to become free. Usage count = 1

in the kernel log, preventing new network namespace creation.

The patch works around the issue by checking whether the socket already
has the same dst set.

Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>
---
 net/ipv4/tcp_ipv4.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Comments

Jakub Sitnicki March 16, 2017, 10:01 a.m. UTC | #1
On Thu, Mar 16, 2017 at 08:08 AM GMT, Kevin Xu wrote:
> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
> on the same dst, causing reference count leakage. Eventually, it
> prevents net_device to be destroyed. The bug then manifested as
>
> unregister_netdevice: waiting for lo to become free. Usage count = 1
>
> in the kernel log, preventing new network namespace creation.
>
> The patch works around the issue by checking whether the socket already
> has the same dst set.
>
> Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>
> ---

FWIW, with this patch applied I'm still sometimes seeing:

[  125.928095] unregister_netdevice: waiting for lo to become free. Usage count = 1

-Jakub
Kevin Xu March 16, 2017, 10:12 a.m. UTC | #2
Do you mean the message looping endlessly?

If so, then I suppose it's a different bug.

Kevin

> On Mar 16, 2017, at 3:01 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
> 
>> On Thu, Mar 16, 2017 at 08:08 AM GMT, Kevin Xu wrote:
>> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
>> on the same dst, causing reference count leakage. Eventually, it
>> prevents net_device to be destroyed. The bug then manifested as
>> 
>> unregister_netdevice: waiting for lo to become free. Usage count = 1
>> 
>> in the kernel log, preventing new network namespace creation.
>> 
>> The patch works around the issue by checking whether the socket already
>> has the same dst set.
>> 
>> Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>
>> ---
> 
> FWIW, with this patch applied I'm still sometimes seeing:
> 
> [  125.928095] unregister_netdevice: waiting for lo to become free. Usage count = 1
> 
> -Jakub
Jakub Sitnicki March 16, 2017, 10:45 a.m. UTC | #3
On Thu, Mar 16, 2017 at 10:12 AM GMT, Kevin Xu wrote:
> Do you mean the message looping endlessly?

No, the message is emitted just once. Around 100 seconds after
destroying a few namespaces. Occurs not so often, maybe once per ten
runs.

-Jakub

> If so, then I suppose it's a different bug.
>
> Kevin
>
>> On Mar 16, 2017, at 3:01 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
>>
>>> On Thu, Mar 16, 2017 at 08:08 AM GMT, Kevin Xu wrote:
>>> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
>>> on the same dst, causing reference count leakage. Eventually, it
>>> prevents net_device to be destroyed. The bug then manifested as
>>>
>>> unregister_netdevice: waiting for lo to become free. Usage count = 1
>>>
>>> in the kernel log, preventing new network namespace creation.
>>>
>>> The patch works around the issue by checking whether the socket already
>>> has the same dst set.
>>>
>>> Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>
>>> ---
>>
>> FWIW, with this patch applied I'm still sometimes seeing:
>>
>> [  125.928095] unregister_netdevice: waiting for lo to become free. Usage count = 1
>>
>> -Jakub
David Miller March 16, 2017, 6:16 p.m. UTC | #4
From: Kevin Xu <kaiwen.xu@hulu.com>
Date: Thu, 16 Mar 2017 01:08:30 -0700

> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
> on the same dst, causing reference count leakage. Eventually, it
> prevents net_device to be destroyed. The bug then manifested as
> 
> unregister_netdevice: waiting for lo to become free. Usage count = 1
> 
> in the kernel log, preventing new network namespace creation.
> 
> The patch works around the issue by checking whether the socket already
> has the same dst set.
> 
> Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>

You need to prevent this parallel execution of this function or use
atomic compare-and-exchange to set the rx_dst in order to prevent
the double refcounting.

This patch by itself is even worse than a workaround, because depending
upon how the compiler reloads values from memory, the problem can still
occur.
Kevin Xu March 16, 2017, 6:18 p.m. UTC | #5
On Thu, Mar 16, 2017 at 11:45:03AM +0100, Jakub Sitnicki wrote:
> On Thu, Mar 16, 2017 at 10:12 AM GMT, Kevin Xu wrote:
> > Do you mean the message looping endlessly?
> 
> No, the message is emitted just once. Around 100 seconds after
> destroying a few namespaces. Occurs not so often, maybe once per ten
> runs.
> 
> -Jakub
> 

I saw that happening from time to time during my test as well, I suspect
it was because some TCP sockets stays in either TCP_TIME_WAIT or
TCP_FIN_WAIT1. But eventually those sockets get destroyed and lo gets
deleted as well.

The patch was fixing an issue I am seeing, that the message gets looped
forever, and causing a deadlock on new network ns creation.

Kevin

> > If so, then I suppose it's a different bug.
> >
> > Kevin
> >
> >> On Mar 16, 2017, at 3:01 AM, Jakub Sitnicki <jkbs@redhat.com> wrote:
> >>
> >>> On Thu, Mar 16, 2017 at 08:08 AM GMT, Kevin Xu wrote:
> >>> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
> >>> on the same dst, causing reference count leakage. Eventually, it
> >>> prevents net_device to be destroyed. The bug then manifested as
> >>>
> >>> unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>>
> >>> in the kernel log, preventing new network namespace creation.
> >>>
> >>> The patch works around the issue by checking whether the socket already
> >>> has the same dst set.
> >>>
> >>> Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>
> >>> ---
> >>
> >> FWIW, with this patch applied I'm still sometimes seeing:
> >>
> >> [  125.928095] unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>
> >> -Jakub
Cong Wang March 16, 2017, 6:42 p.m. UTC | #6
On Thu, Mar 16, 2017 at 11:18 AM, Kaiwen Xu <kaiwen.xu@hulu.com> wrote:
> On Thu, Mar 16, 2017 at 11:45:03AM +0100, Jakub Sitnicki wrote:
>> On Thu, Mar 16, 2017 at 10:12 AM GMT, Kevin Xu wrote:
>> > Do you mean the message looping endlessly?
>>
>> No, the message is emitted just once. Around 100 seconds after
>> destroying a few namespaces. Occurs not so often, maybe once per ten
>> runs.
>>
>> -Jakub
>>
>
> I saw that happening from time to time during my test as well, I suspect
> it was because some TCP sockets stays in either TCP_TIME_WAIT or
> TCP_FIN_WAIT1. But eventually those sockets get destroyed and lo gets
> deleted as well.

But TIMEWAIT sockets are purged by inet_twsk_purge() during netns
destroy, apparently before lo is destroyed.

>
> The patch was fixing an issue I am seeing, that the message gets looped
> forever, and causing a deadlock on new network ns creation.
>

But as DaveM said, the race still could happen after your patch,
the if check you add is not atomic at all.

Also, we should already lock the sock at the time we call
inet_sk_rx_dst_set(), but perhaps not for TCP_LISTEN case...
Eric Dumazet March 16, 2017, 9:33 p.m. UTC | #7
On Thu, 2017-03-16 at 01:08 -0700, Kevin Xu wrote:
> In some rare cases, inet_sk_rx_dst_set() may be called multiple times
> on the same dst, causing reference count leakage. Eventually, it
> prevents net_device to be destroyed. The bug then manifested as
> 
> unregister_netdevice: waiting for lo to become free. Usage count = 1
> 
> in the kernel log, preventing new network namespace creation.
> 
> The patch works around the issue by checking whether the socket already
> has the same dst set.
> 
> Signed-off-by: Kevin Xu <kaiwen.xu@hulu.com>
> ---
>  net/ipv4/tcp_ipv4.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 575e19d..f425c14 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1807,9 +1807,14 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
>  {
>  	struct dst_entry *dst = skb_dst(skb);
>  
> -	if (dst && dst_hold_safe(dst)) {
> -		sk->sk_rx_dst = dst;
> -		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
> +	if (dst) {
> +		if (unlikely(dst == sk->sk_rx_dst))
> +			return;
> +
> +		if (dst_hold_safe(dst)) {
> +			sk->sk_rx_dst = dst;
> +			inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
> +		}


That should not be needed.

Have you backported the redirect fix ?

commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0

Or other fixes that went very recently (pick David Miller net tree)
Cong Wang March 16, 2017, 11:13 p.m. UTC | #8
On Thu, Mar 16, 2017 at 2:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Have you backported the redirect fix ?
>
> commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0
>
> Or other fixes that went very recently (pick David Miller net tree)

Why the commit above is relevant here? It fixes a double-release,
while Kevin's case is a double-hold... Not to mention it sk_rx_dst
instead of sk_dst_cache, according to Kevin.
Eric Dumazet March 16, 2017, 11:29 p.m. UTC | #9
On Thu, 2017-03-16 at 16:13 -0700, Cong Wang wrote:
> On Thu, Mar 16, 2017 at 2:33 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > Have you backported the redirect fix ?
> >
> > commit 45caeaa5ac0b4b11784ac6f932c0ad4c6b67cda0
> >
> > Or other fixes that went very recently (pick David Miller net tree)
> 
> Why the commit above is relevant here? It fixes a double-release,
> while Kevin's case is a double-hold... Not to mention it sk_rx_dst
> instead of sk_dst_cache, according to Kevin.

Yes you are right.

I was lazy, the other fix I wanted to mention was 
02b2faaf0af1d85585f

In any case I wanted to make sure the reported issue is not for some old
kernels.
diff mbox

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 575e19d..f425c14 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1807,9 +1807,14 @@  void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
 
-	if (dst && dst_hold_safe(dst)) {
-		sk->sk_rx_dst = dst;
-		inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+	if (dst) {
+		if (unlikely(dst == sk->sk_rx_dst))
+			return;
+
+		if (dst_hold_safe(dst)) {
+			sk->sk_rx_dst = dst;
+			inet_sk(sk)->rx_dst_ifindex = skb->skb_iif;
+		}
 	}
 }
 EXPORT_SYMBOL(inet_sk_rx_dst_set);