diff mbox

netns refcnt leak for kernel accept sock

Message ID 20150727142146.GC16447@oracle.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Sowmini Varadhan July 27, 2015, 2:21 p.m. UTC
I'm running into a netns refcnt issue, and I suspect that 
eeb1bd5c has something to do with it (perhaps we need an 
additional change in sk_clone_lock() after eeb1bd5c). 
Here's the problem:

When we create an syn_recv sock based on a kernel listen sock, we
take a get_net() ref  with a stack similar to the one shown below.
Note that the parent (kernel, listen) sock itself has not taken
a get_net() ref, because it explicitly calls sock_create_kern().

  get_net /* for the newsk */
  sk_clone_lock
  inet_csk_clone_lock
  tcp_create_openreq_child
  tcp_v4_syn_recv_sock
  tcp_check_req
  tcp_v4_do_rcv
  tcp_v4_rcv 
   :

But it's not clear to me where this refcnt will be released: 
in my case, I expect to create/cleanup kernel sockets as part 
of ->init/->exit for my module, but because the accept socket 
has a netns refcnt, it blocks cleanup_net(), thus my ->exit 
pernet_subsys op cannot run and clean this up, and we have a leak.

I think that sk_clone_lock() should only do a get_net() if the parent
is not a kernel socket (making this similar to sk_alloc()), i.e.,


Does this sound right?

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric W. Biederman July 27, 2015, 5:40 p.m. UTC | #1
sock_create_kern and friends are specialied interfaces for special
purposes.  At a quick read through I don't think we have a single in
tree user doing with them what you are trying to do.

Without seeing code using the interfaces in the way are trying to use
them I do not have enough information to comment intelligently.

Eric

Sowmini Varadhan <sowmini.varadhan@oracle.com> writes:
> I'm running into a netns refcnt issue, and I suspect that 
> eeb1bd5c has something to do with it (perhaps we need an 
> additional change in sk_clone_lock() after eeb1bd5c). 
> Here's the problem:
>
> When we create an syn_recv sock based on a kernel listen sock, we
> take a get_net() ref  with a stack similar to the one shown below.
> Note that the parent (kernel, listen) sock itself has not taken
> a get_net() ref, because it explicitly calls sock_create_kern().
>
>   get_net /* for the newsk */
>   sk_clone_lock
>   inet_csk_clone_lock
>   tcp_create_openreq_child
>   tcp_v4_syn_recv_sock
>   tcp_check_req
>   tcp_v4_do_rcv
>   tcp_v4_rcv 
>    :
>
> But it's not clear to me where this refcnt will be released: 
> in my case, I expect to create/cleanup kernel sockets as part 
> of ->init/->exit for my module, but because the accept socket 
> has a netns refcnt, it blocks cleanup_net(), thus my ->exit 
> pernet_subsys op cannot run and clean this up, and we have a leak.
>
> I think that sk_clone_lock() should only do a get_net() if the parent
> is not a kernel socket (making this similar to sk_alloc()), i.e.,
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 08f16db..371d1b7 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gf
>                 sock_copy(newsk, sk);
>
>                 /* SANITY */
> -               get_net(sock_net(newsk));
> +               if (likely(newsk->sk_net_refcnt))
> +                       get_net(sock_net(newsk));
>                 sk_node_init(&newsk->sk_node);
>                 sock_lock_init(newsk);
>                 bh_lock_sock(newsk);
>
> Does this sound right?
>
> --Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang July 27, 2015, 6:13 p.m. UTC | #2
On Mon, Jul 27, 2015 at 7:21 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
>
> I'm running into a netns refcnt issue, and I suspect that
> eeb1bd5c has something to do with it (perhaps we need an
> additional change in sk_clone_lock() after eeb1bd5c).
> Here's the problem:
>
> When we create an syn_recv sock based on a kernel listen sock, we
> take a get_net() ref  with a stack similar to the one shown below.
> Note that the parent (kernel, listen) sock itself has not taken
> a get_net() ref, because it explicitly calls sock_create_kern().
>
>   get_net /* for the newsk */
>   sk_clone_lock
>   inet_csk_clone_lock
>   tcp_create_openreq_child
>   tcp_v4_syn_recv_sock
>   tcp_check_req
>   tcp_v4_do_rcv
>   tcp_v4_rcv
>    :
>
> But it's not clear to me where this refcnt will be released:
> in my case, I expect to create/cleanup kernel sockets as part
> of ->init/->exit for my module, but because the accept socket
> has a netns refcnt, it blocks cleanup_net(), thus my ->exit
> pernet_subsys op cannot run and clean this up, and we have a leak.


That refcnt should be released in sock destructor too, when the tcp
connection is terminated.

>
> I think that sk_clone_lock() should only do a get_net() if the parent
> is not a kernel socket (making this similar to sk_alloc()), i.e.,
>

Given the fact that sk_destruct() checks for sk_net_refcnt, your
patch makes sense to me. But I am not sure how a TCP kernel
socket is supposed to use.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan July 27, 2015, 6:19 p.m. UTC | #3
On (07/27/15 11:13), Cong Wang wrote:
> 
> That refcnt should be released in sock destructor too, when the tcp
> connection is terminated.

yes, but in my case, the listen socket is opened as part of
the ->init indirection in pernet_operations (thus it is a kernel socket)
and the expectation is that this listen socket, and any accept sockets
derived from it, will be closed in ->exit.

But if the accept socket is treated as a uspace socket (thus holds a get_net())
then it will block cleanup_net() and the associated ->exit cleanup operations.

This is probably not a problem for other systems like vxlan/gue/geneve etc
because they all use udp sockets, thus dont have the "accept" equivalent.

But fundamentally, its wrong for a kspace listen socket to result in a
"uspace" accept socket.

> Given the fact that sk_destruct() checks for sk_net_refcnt, your
> patch makes sense to me. But I am not sure how a TCP kernel
> socket is supposed to use.

Thanks for the confirmation - I think RDS is a bit of a maverick here in
that it uses tcp sockets unlike vxlan etc.
For those curious about RDS-TCP, I've actually updated the documentation at
https://oss.oracle.com/projects/rds/dist/documentation/rds-3.1-spec.html
recently. I hope that helps.

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cong Wang July 27, 2015, 6:37 p.m. UTC | #4
On Mon, Jul 27, 2015 at 11:19 AM, Sowmini Varadhan
<sowmini.varadhan@oracle.com> wrote:
> On (07/27/15 11:13), Cong Wang wrote:
>>
>> That refcnt should be released in sock destructor too, when the tcp
>> connection is terminated.
>
> yes, but in my case, the listen socket is opened as part of
> the ->init indirection in pernet_operations (thus it is a kernel socket)
> and the expectation is that this listen socket, and any accept sockets
> derived from it, will be closed in ->exit.
>
> But if the accept socket is treated as a uspace socket (thus holds a get_net())
> then it will block cleanup_net() and the associated ->exit cleanup operations.
>
> This is probably not a problem for other systems like vxlan/gue/geneve etc
> because they all use udp sockets, thus dont have the "accept" equivalent.
>

dlm uses a kernel TCP socket too, but it allocates a new socket and calls
->accept() by itself. ;)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Sowmini Varadhan July 27, 2015, 6:50 p.m. UTC | #5
On (07/27/15 11:37), Cong Wang wrote:
> 
> dlm uses a kernel TCP socket too, but it allocates a new socket and calls
> ->accept() by itself. ;)

sure, and rds does this in rds_tcp_accept_one() too.

But the newsk being created in sk_clone_lock  is the one on an 
incoming syn, i.e., the one that is saved up as part of listen backlog, 
to be returned later on the accept.

I dont know the details of dlm- can you have one dlm instance per
network namespace? That's where I'm running into this issue- when we
try to have one rds listen socket per netns, and want to be able to
do both
- dynamically build/tear down new network namepsaces, without 
  unloading rds_tcp globally
- unload rds_tcp globally withouth tearing down individual netns.

But perhaps we digress.

Fundamental issue remains: newsk is the syn_recv version of the
listen socket. If the listen socket is a "kernel" socket (kern == 1
for sk_alloc, and the listen socket thus has no sk_net_refcnt),
the syn_recv socket must also have that behavior, so that it is
cleaned up in the same way.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/core/sock.c b/net/core/sock.c
index 08f16db..371d1b7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1497,7 +1497,8 @@  struct sock *sk_clone_lock(const struct sock *sk, const gf
                sock_copy(newsk, sk);

                /* SANITY */
-               get_net(sock_net(newsk));
+               if (likely(newsk->sk_net_refcnt))
+                       get_net(sock_net(newsk));
                sk_node_init(&newsk->sk_node);
                sock_lock_init(newsk);
                bh_lock_sock(newsk);