Message ID | 20150727142146.GC16447@oracle.com |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
sock_create_kern and friends are specialied interfaces for special purposes. At a quick read through I don't think we have a single in tree user doing with them what you are trying to do. Without seeing code using the interfaces in the way are trying to use them I do not have enough information to comment intelligently. Eric Sowmini Varadhan <sowmini.varadhan@oracle.com> writes: > I'm running into a netns refcnt issue, and I suspect that > eeb1bd5c has something to do with it (perhaps we need an > additional change in sk_clone_lock() after eeb1bd5c). > Here's the problem: > > When we create an syn_recv sock based on a kernel listen sock, we > take a get_net() ref with a stack similar to the one shown below. > Note that the parent (kernel, listen) sock itself has not taken > a get_net() ref, because it explicitly calls sock_create_kern(). > > get_net /* for the newsk */ > sk_clone_lock > inet_csk_clone_lock > tcp_create_openreq_child > tcp_v4_syn_recv_sock > tcp_check_req > tcp_v4_do_rcv > tcp_v4_rcv > : > > But it's not clear to me where this refcnt will be released: > in my case, I expect to create/cleanup kernel sockets as part > of ->init/->exit for my module, but because the accept socket > has a netns refcnt, it blocks cleanup_net(), thus my ->exit > pernet_subsys op cannot run and clean this up, and we have a leak. > > I think that sk_clone_lock() should only do a get_net() if the parent > is not a kernel socket (making this similar to sk_alloc()), i.e., > > diff --git a/net/core/sock.c b/net/core/sock.c > index 08f16db..371d1b7 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gf > sock_copy(newsk, sk); > > /* SANITY */ > - get_net(sock_net(newsk)); > + if (likely(newsk->sk_net_refcnt)) > + get_net(sock_net(newsk)); > sk_node_init(&newsk->sk_node); > sock_lock_init(newsk); > bh_lock_sock(newsk); > > Does this sound right? > > --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 27, 2015 at 7:21 AM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > > I'm running into a netns refcnt issue, and I suspect that > eeb1bd5c has something to do with it (perhaps we need an > additional change in sk_clone_lock() after eeb1bd5c). > Here's the problem: > > When we create an syn_recv sock based on a kernel listen sock, we > take a get_net() ref with a stack similar to the one shown below. > Note that the parent (kernel, listen) sock itself has not taken > a get_net() ref, because it explicitly calls sock_create_kern(). > > get_net /* for the newsk */ > sk_clone_lock > inet_csk_clone_lock > tcp_create_openreq_child > tcp_v4_syn_recv_sock > tcp_check_req > tcp_v4_do_rcv > tcp_v4_rcv > : > > But it's not clear to me where this refcnt will be released: > in my case, I expect to create/cleanup kernel sockets as part > of ->init/->exit for my module, but because the accept socket > has a netns refcnt, it blocks cleanup_net(), thus my ->exit > pernet_subsys op cannot run and clean this up, and we have a leak. That refcnt should be released in sock destructor too, when the tcp connection is terminated. > > I think that sk_clone_lock() should only do a get_net() if the parent > is not a kernel socket (making this similar to sk_alloc()), i.e., > Given the fact that sk_destruct() checks for sk_net_refcnt, your patch makes sense to me. But I am not sure how a TCP kernel socket is supposed to use. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On (07/27/15 11:13), Cong Wang wrote: > > That refcnt should be released in sock destructor too, when the tcp > connection is terminated. yes, but in my case, the listen socket is opened as part of the ->init indirection in pernet_operations (thus it is a kernel socket) and the expectation is that this listen socket, and any accept sockets derived from it, will be closed in ->exit. But if the accept socket is treated as a uspace socket (thus holds a get_net()) then it will block cleanup_net() and the associated ->exit cleanup operations. This is probably not a problem for other systems like vxlan/gue/geneve etc because they all use udp sockets, thus dont have the "accept" equivalent. But fundamentally, its wrong for a kspace listen socket to result in a "uspace" accept socket. > Given the fact that sk_destruct() checks for sk_net_refcnt, your > patch makes sense to me. But I am not sure how a TCP kernel > socket is supposed to use. Thanks for the confirmation - I think RDS is a bit of a maverick here in that it uses tcp sockets unlike vxlan etc. For those curious about RDS-TCP, I've actually updated the documentation at https://oss.oracle.com/projects/rds/dist/documentation/rds-3.1-spec.html recently. I hope that helps. --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 27, 2015 at 11:19 AM, Sowmini Varadhan <sowmini.varadhan@oracle.com> wrote: > On (07/27/15 11:13), Cong Wang wrote: >> >> That refcnt should be released in sock destructor too, when the tcp >> connection is terminated. > > yes, but in my case, the listen socket is opened as part of > the ->init indirection in pernet_operations (thus it is a kernel socket) > and the expectation is that this listen socket, and any accept sockets > derived from it, will be closed in ->exit. > > But if the accept socket is treated as a uspace socket (thus holds a get_net()) > then it will block cleanup_net() and the associated ->exit cleanup operations. > > This is probably not a problem for other systems like vxlan/gue/geneve etc > because they all use udp sockets, thus dont have the "accept" equivalent. > dlm uses a kernel TCP socket too, but it allocates a new socket and calls ->accept() by itself. ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On (07/27/15 11:37), Cong Wang wrote: > > dlm uses a kernel TCP socket too, but it allocates a new socket and calls > ->accept() by itself. ;) sure, and rds does this in rds_tcp_accept_one() too. But the newsk being created in sk_clone_lock is the one on an incoming syn, i.e., the one that is saved up as part of listen backlog, to be returned later on the accept. I dont know the details of dlm- can you have one dlm instance per network namespace? That's where I'm running into this issue- when we try to have one rds listen socket per netns, and want to be able to do both - dynamically build/tear down new network namepsaces, without unloading rds_tcp globally - unload rds_tcp globally withouth tearing down individual netns. But perhaps we digress. Fundamental issue remains: newsk is the syn_recv version of the listen socket. If the listen socket is a "kernel" socket (kern == 1 for sk_alloc, and the listen socket thus has no sk_net_refcnt), the syn_recv socket must also have that behavior, so that it is cleaned up in the same way. --Sowmini -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/core/sock.c b/net/core/sock.c index 08f16db..371d1b7 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1497,7 +1497,8 @@ struct sock *sk_clone_lock(const struct sock *sk, const gf sock_copy(newsk, sk); /* SANITY */ - get_net(sock_net(newsk)); + if (likely(newsk->sk_net_refcnt)) + get_net(sock_net(newsk)); sk_node_init(&newsk->sk_node); sock_lock_init(newsk); bh_lock_sock(newsk);