diff mbox series

[net-next,2/3] rds: tcp: correctly sequence cleanup on netns deletion.

Message ID 4c8332354d9423207740a814ea75706f59ca7e72.1511195030.git.sowmini.varadhan@oracle.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series rds-tcp netns delete related fixes | expand

Commit Message

Sowmini Varadhan Nov. 30, 2017, 7:11 p.m. UTC
Commit 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
(and thus rds_tcp_dev_event notification) is only called from put_net()
when all netns refcounts go to 0, but this cannot happen if the
rds_connection itself is holding a c_net ref that it expects to
release in rds_tcp_kill_sock.

Instead, the rds_tcp_kill_sock callback should make sure to
tear down state carefully, ensuring that the socket teardown
is only done after all data-structures and workqs that depend
on it are quiesced.

The original motivation for commit 8edc3affc077 ("rds: tcp: Take explicit
refcounts on struct net") was to resolve a race condition reported by
syzkaller where workqs for tx/rx/connect were triggered after the
namespace was deleted. Those worker threads should have been
cancelled/flushed before socket tear-down and indeed,
rds_conn_path_destroy() does try to sequence this by doing
     /* cancel cp_send_w */
     /* cancel cp_recv_w */
     /* flush cp_down_w */
     /* free data structures */
Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
we ought to have satisfied the requirement that "socket-close is
done after all other dependent state is quiesced". However,
rds_conn_shutdown has a bug in that it *always* triggers the reconnect
workq (and if connection is successful, we always restart tx/rx
workqs so with the right timing, we risk the race conditions reported
by syzkaller).

Netns deletion is like module teardown- no need to restart a
reconnect in this case. We can use the c_destroy_in_prog bit
to avoid restarting the reconnect.

Fixes: 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
---
 net/rds/connection.c |    3 ++-
 net/rds/rds.h        |    6 +++---
 net/rds/tcp.c        |    4 ++--
 3 files changed, 7 insertions(+), 6 deletions(-)

Comments

Santosh Shilimkar Nov. 30, 2017, 8:28 p.m. UTC | #1
On 11/30/2017 11:11 AM, Sowmini Varadhan wrote:
> Commit 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
> introduces a regression in rds-tcp netns cleanup. The cleanup_net(),
> (and thus rds_tcp_dev_event notification) is only called from put_net()
> when all netns refcounts go to 0, but this cannot happen if the
> rds_connection itself is holding a c_net ref that it expects to
> release in rds_tcp_kill_sock.
> 
> Instead, the rds_tcp_kill_sock callback should make sure to
> tear down state carefully, ensuring that the socket teardown
> is only done after all data-structures and workqs that depend
> on it are quiesced.
> 
> The original motivation for commit 8edc3affc077 ("rds: tcp: Take explicit
> refcounts on struct net") was to resolve a race condition reported by
> syzkaller where workqs for tx/rx/connect were triggered after the
> namespace was deleted. Those worker threads should have been
> cancelled/flushed before socket tear-down and indeed,
> rds_conn_path_destroy() does try to sequence this by doing
>       /* cancel cp_send_w */
>       /* cancel cp_recv_w */
>       /* flush cp_down_w */
>       /* free data structures */
> Here the "flush cp_down_w" will trigger rds_conn_shutdown and thus
> invoke rds_tcp_conn_path_shutdown() to close the tcp socket, so that
> we ought to have satisfied the requirement that "socket-close is
> done after all other dependent state is quiesced". However,
> rds_conn_shutdown has a bug in that it *always* triggers the reconnect
> workq (and if connection is successful, we always restart tx/rx
> workqs so with the right timing, we risk the race conditions reported
> by syzkaller).
> 
> Netns deletion is like module teardown- no need to restart a
> reconnect in this case. We can use the c_destroy_in_prog bit
> to avoid restarting the reconnect.
> 
> Fixes: 8edc3affc077 ("rds: tcp: Take explicit refcounts on struct net")
> Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
> ---
>   net/rds/connection.c |    3 ++-
>   net/rds/rds.h        |    6 +++---
>   net/rds/tcp.c        |    4 ++--
>   3 files changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/net/rds/connection.c b/net/rds/connection.c
> index 7ee2d5d..9efc82c 100644
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -366,6 +366,8 @@ void rds_conn_shutdown(struct rds_conn_path *cp)
>   	 * to the conn hash, so we never trigger a reconnect on this
>   	 * conn - the reconnect is always triggered by the active peer. */
>   	cancel_delayed_work_sync(&cp->cp_conn_w);
> +	if (conn->c_destroy_in_prog)
> +		return;

Not related to this patch but it will be more safe to use
cp_flags or if needed add flag and conn level for bundle
and use bit wise to avoid possible races to set c_destroy_in_prog.

Something similar to RDS_DESTROY_PENDING etc.

The patch itself looks good to me in terms of netns ref counting.

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
diff mbox series

Patch

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 7ee2d5d..9efc82c 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -366,6 +366,8 @@  void rds_conn_shutdown(struct rds_conn_path *cp)
 	 * to the conn hash, so we never trigger a reconnect on this
 	 * conn - the reconnect is always triggered by the active peer. */
 	cancel_delayed_work_sync(&cp->cp_conn_w);
+	if (conn->c_destroy_in_prog)
+		return;
 	rcu_read_lock();
 	if (!hlist_unhashed(&conn->c_hash_node)) {
 		rcu_read_unlock();
@@ -445,7 +447,6 @@  void rds_conn_destroy(struct rds_connection *conn)
 	 */
 	rds_cong_remove_conn(conn);
 
-	put_net(conn->c_net);
 	kfree(conn->c_path);
 	kmem_cache_free(rds_conn_slab, conn);
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 2e0315b..d11301b 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -149,7 +149,7 @@  struct rds_connection {
 
 	/* Protocol version */
 	unsigned int		c_version;
-	struct net		*c_net;
+	possible_net_t		c_net;
 
 	struct list_head	c_map_item;
 	unsigned long		c_map_queued;
@@ -164,13 +164,13 @@  struct rds_connection {
 static inline
 struct net *rds_conn_net(struct rds_connection *conn)
 {
-	return conn->c_net;
+	return read_pnet(&conn->c_net);
 }
 
 static inline
 void rds_conn_net_set(struct rds_connection *conn, struct net *net)
 {
-	conn->c_net = get_net(net);
+	write_pnet(&conn->c_net, net);
 }
 
 #define RDS_FLAG_CONG_BITMAP	0x01
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 222cc53..f580f72 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -506,7 +506,7 @@  static void rds_tcp_kill_sock(struct net *net)
 	rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w);
 	spin_lock_irq(&rds_tcp_conn_lock);
 	list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) {
-		struct net *c_net = tc->t_cpath->cp_conn->c_net;
+		struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net);
 
 		if (net != c_net || !tc->t_sock)
 			continue;
@@ -563,7 +563,7 @@  static void rds_tcp_sysctl_reset(struct net *net)
 
 	spin_lock_irq(&rds_tcp_conn_lock);
 	list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) {
-		struct net *c_net = tc->t_cpath->cp_conn->c_net;
+		struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net);
 
 		if (net != c_net || !tc->t_sock)
 			continue;