diff mbox series

[v1,bpf-next,05/11] tcp: Migrate TCP_NEW_SYN_RECV requests.

Message ID 20201201144418.35045-6-kuniyu@amazon.co.jp
State Superseded
Headers show
Series Socket migration for SO_REUSEPORT. | expand

Commit Message

Kuniyuki Iwashima Dec. 1, 2020, 2:44 p.m. UTC
This patch renames reuseport_select_sock() to __reuseport_select_sock() and
adds two wrapper function of it to pass the migration type defined in the
previous commit.

  reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
  reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST

As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
patch also changes the code to call reuseport_select_migrated_sock() even
if the listening socket is TCP_CLOSE. If we can pick out a listening socket
from the reuseport group, we rewrite request_sock.rsk_listener and resume
processing the request.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
---
 include/net/inet_connection_sock.h | 12 +++++++++++
 include/net/request_sock.h         | 13 ++++++++++++
 include/net/sock_reuseport.h       |  8 +++----
 net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
 net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
 net/ipv4/tcp_ipv4.c                |  9 ++++++--
 net/ipv6/tcp_ipv6.c                |  9 ++++++--
 7 files changed, 81 insertions(+), 17 deletions(-)

Comments

Eric Dumazet Dec. 1, 2020, 3:13 p.m. UTC | #1
On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> adds two wrapper function of it to pass the migration type defined in the
> previous commit.
> 
>   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
>   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> 
> As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> patch also changes the code to call reuseport_select_migrated_sock() even
> if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> from the reuseport group, we rewrite request_sock.rsk_listener and resume
> processing the request.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/net/inet_connection_sock.h | 12 +++++++++++
>  include/net/request_sock.h         | 13 ++++++++++++
>  include/net/sock_reuseport.h       |  8 +++----
>  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
>  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
>  net/ipv4/tcp_ipv4.c                |  9 ++++++--
>  net/ipv6/tcp_ipv6.c                |  9 ++++++--
>  7 files changed, 81 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 2ea2d743f8fc..1e0958f5eb21 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
>  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
>  }
>  
> +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> +						 struct sock *nsk,
> +						 struct request_sock *req)
> +{
> +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> +			     &inet_csk(nsk)->icsk_accept_queue,
> +			     req);
> +	sock_put(sk);
> +	sock_hold(nsk);

This looks racy to me. nsk refcount might be zero at this point.

If you think it can _not_ be zero, please add a big comment here,
because this would mean something has been done before reaching this function,
and this sock_hold() would be not needed in the first place.

There is a good reason reqsk_alloc() is using refcount_inc_not_zero().

> +	req->rsk_listener = nsk;
> +}
> +

Honestly, this patch series looks quite complex, and finding a bug in the
very first function I am looking at is not really a good sign...
Kuniyuki Iwashima Dec. 3, 2020, 2:12 p.m. UTC | #2
From:   Eric Dumazet <eric.dumazet@gmail.com>
Date:   Tue, 1 Dec 2020 16:13:39 +0100
> On 12/1/20 3:44 PM, Kuniyuki Iwashima wrote:
> > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > adds two wrapper function of it to pass the migration type defined in the
> > previous commit.
> > 
> >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > 
> > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > patch also changes the code to call reuseport_select_migrated_sock() even
> > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > processing the request.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/net/inet_connection_sock.h | 12 +++++++++++
> >  include/net/request_sock.h         | 13 ++++++++++++
> >  include/net/sock_reuseport.h       |  8 +++----
> >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> >  7 files changed, 81 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index 2ea2d743f8fc..1e0958f5eb21 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> >  }
> >  
> > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > +						 struct sock *nsk,
> > +						 struct request_sock *req)
> > +{
> > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > +			     &inet_csk(nsk)->icsk_accept_queue,
> > +			     req);
> > +	sock_put(sk);
> > +	sock_hold(nsk);
> 
> This looks racy to me. nsk refcount might be zero at this point.
> 
> If you think it can _not_ be zero, please add a big comment here,
> because this would mean something has been done before reaching this function,
> and this sock_hold() would be not needed in the first place.
> 
> There is a good reason reqsk_alloc() is using refcount_inc_not_zero().

Exactly, I will fix this in the next spin like below.
Thank you.

---8<---
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 1e0958f5eb21..d8c3be31e987 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -280,7 +280,6 @@ static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
                             &inet_csk(nsk)->icsk_accept_queue,
                             req);
        sock_put(sk);
-       sock_hold(nsk);
        req->rsk_listener = nsk;
 }
 
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 6b475897b496..4d07bddcf678 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -386,7 +386,14 @@ EXPORT_SYMBOL(reuseport_select_sock);
 struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
                                            struct sk_buff *skb)
 {
-       return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+       struct sock *nsk;
+
+       nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+       if (IS_ERR_OR_NULL(nsk) ||
+           unlikely(!refcount_inc_not_zero(&nsk->sk_refcnt)))
+               return NULL;
+
+       return nsk;
 }
 EXPORT_SYMBOL(reuseport_select_migrated_sock);
 
---8<---


> > +	req->rsk_listener = nsk;
> > +}
> > +
> 
> Honestly, this patch series looks quite complex, and finding a bug in the
> very first function I am looking at is not really a good sign...

I also think this issue is quite complex, but it might be easier to fix
than it was disscussed in 2015 [1] thanks to your some refactoring.

https://lore.kernel.org/netdev/1443313848-751-1-git-send-email-tolga.ceylan@gmail.com/
Martin KaFai Lau Dec. 10, 2020, 12:07 a.m. UTC | #3
On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> adds two wrapper function of it to pass the migration type defined in the
> previous commit.
> 
>   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
>   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> 
> As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> patch also changes the code to call reuseport_select_migrated_sock() even
> if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> from the reuseport group, we rewrite request_sock.rsk_listener and resume
> processing the request.
> 
> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> ---
>  include/net/inet_connection_sock.h | 12 +++++++++++
>  include/net/request_sock.h         | 13 ++++++++++++
>  include/net/sock_reuseport.h       |  8 +++----
>  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
>  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
>  net/ipv4/tcp_ipv4.c                |  9 ++++++--
>  net/ipv6/tcp_ipv6.c                |  9 ++++++--
>  7 files changed, 81 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 2ea2d743f8fc..1e0958f5eb21 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
>  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
>  }
>  
> +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> +						 struct sock *nsk,
> +						 struct request_sock *req)
> +{
> +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> +			     &inet_csk(nsk)->icsk_accept_queue,
> +			     req);
> +	sock_put(sk);
not sure if it is safe to do here.
IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
to req->rsk_listener such that sock_hold(req->rsk_listener) is
safe because its sk_refcnt is not zero.

> +	sock_hold(nsk);
> +	req->rsk_listener = nsk;
> +}
> +

[ ... ]

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 361efe55b1ad..e71653c6eae2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
>  	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
>  	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
>  
> -	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> -		goto drop;
> +	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
> +		sk_listener = reuseport_select_migrated_sock(sk_listener,
> +							     req_to_sk(req)->sk_hash, NULL);
> +		if (!sk_listener) {
> +			sk_listener = req->rsk_listener;
> +			goto drop;
> +		}
> +		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
> +		icsk = inet_csk(sk_listener);
> +		queue = &icsk->icsk_accept_queue;
> +	}
>  
>  	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
>  	/* Normally all the openreqs are young and become mature
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index e4b31e70bd30..9a9aa27c6069 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
>  			goto csum_error;
>  		}
>  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> -			goto lookup;
> +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> +			if (!nsk) {
> +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> +				goto lookup;
> +			}
> +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> +			sk = nsk;
>  		}
>  		/* We own a reference on the listener, increase it again
>  		 * as we might lose it too soon.
> diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> index 992cbf3eb9e3..ff11f3c0cb96 100644
> --- a/net/ipv6/tcp_ipv6.c
> +++ b/net/ipv6/tcp_ipv6.c
> @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
>  			goto csum_error;
>  		}
>  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> -			goto lookup;
> +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> +			if (!nsk) {
> +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> +				goto lookup;
> +			}
> +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> +			sk = nsk;
>  		}
>  		sock_hold(sk);
For example, this sock_hold(sk).  sk here is req->rsk_listener.

>  		refcounted = true;
> -- 
> 2.17.2 (Apple Git-113)
>
Kuniyuki Iwashima Dec. 10, 2020, 5:15 a.m. UTC | #4
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Wed, 9 Dec 2020 16:07:07 -0800
> On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > adds two wrapper function of it to pass the migration type defined in the
> > previous commit.
> > 
> >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > 
> > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > patch also changes the code to call reuseport_select_migrated_sock() even
> > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > processing the request.
> > 
> > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > ---
> >  include/net/inet_connection_sock.h | 12 +++++++++++
> >  include/net/request_sock.h         | 13 ++++++++++++
> >  include/net/sock_reuseport.h       |  8 +++----
> >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> >  7 files changed, 81 insertions(+), 17 deletions(-)
> > 
> > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > index 2ea2d743f8fc..1e0958f5eb21 100644
> > --- a/include/net/inet_connection_sock.h
> > +++ b/include/net/inet_connection_sock.h
> > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> >  }
> >  
> > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > +						 struct sock *nsk,
> > +						 struct request_sock *req)
> > +{
> > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > +			     &inet_csk(nsk)->icsk_accept_queue,
> > +			     req);
> > +	sock_put(sk);
> not sure if it is safe to do here.
> IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> to req->rsk_listener such that sock_hold(req->rsk_listener) is
> safe because its sk_refcnt is not zero.

I think it is safe to call sock_put() for the old listener here.

Without this patchset, at receiving the final ACK or retransmitting
SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put(). And
then, we do `goto lookup;` and overwrite the sk.

In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
reuseport_select_migrated_sock(), so we have to call sock_put() for the old
listener instead to free it properly.

---8<---
+struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+					    struct sk_buff *skb)
+{
+	struct sock *nsk;
+
+	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
+		return nsk;
+
+	return NULL;
+}
+EXPORT_SYMBOL(reuseport_select_migrated_sock);
---8<---
https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/


> > +	sock_hold(nsk);
> > +	req->rsk_listener = nsk;
> > +}
> > +
> 
> [ ... ]
> 
> > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> > index 361efe55b1ad..e71653c6eae2 100644
> > --- a/net/ipv4/inet_connection_sock.c
> > +++ b/net/ipv4/inet_connection_sock.c
> > @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t)
> >  	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
> >  	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
> >  
> > -	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
> > -		goto drop;
> > +	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
> > +		sk_listener = reuseport_select_migrated_sock(sk_listener,
> > +							     req_to_sk(req)->sk_hash, NULL);
> > +		if (!sk_listener) {
> > +			sk_listener = req->rsk_listener;
> > +			goto drop;
> > +		}
> > +		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
> > +		icsk = inet_csk(sk_listener);
> > +		queue = &icsk->icsk_accept_queue;
> > +	}
> >  
> >  	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
> >  	/* Normally all the openreqs are young and become mature
> > diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> > index e4b31e70bd30..9a9aa27c6069 100644
> > --- a/net/ipv4/tcp_ipv4.c
> > +++ b/net/ipv4/tcp_ipv4.c
> > @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
> >  			goto csum_error;
> >  		}
> >  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> > -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> > -			goto lookup;
> > +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> > +			if (!nsk) {
> > +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> > +				goto lookup;
> > +			}
> > +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > +			sk = nsk;
> >  		}
> >  		/* We own a reference on the listener, increase it again
> >  		 * as we might lose it too soon.
> > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
> > index 992cbf3eb9e3..ff11f3c0cb96 100644
> > --- a/net/ipv6/tcp_ipv6.c
> > +++ b/net/ipv6/tcp_ipv6.c
> > @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
> >  			goto csum_error;
> >  		}
> >  		if (unlikely(sk->sk_state != TCP_LISTEN)) {
> > -			inet_csk_reqsk_queue_drop_and_put(sk, req);
> > -			goto lookup;
> > +			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
> > +			if (!nsk) {
> > +				inet_csk_reqsk_queue_drop_and_put(sk, req);
> > +				goto lookup;
> > +			}
> > +			inet_csk_reqsk_queue_migrated(sk, nsk, req);
> > +			sk = nsk;
> >  		}
> >  		sock_hold(sk);
> For example, this sock_hold(sk).  sk here is req->rsk_listener.

After migration, this is for the new listener and it is safe because
refcount_inc_not_zero() for the new listener is called in
reuseport_select_migerate_sock().


> >  		refcounted = true;
> > -- 
> > 2.17.2 (Apple Git-113)
Martin KaFai Lau Dec. 10, 2020, 6:49 p.m. UTC | #5
On Thu, Dec 10, 2020 at 02:15:38PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Wed, 9 Dec 2020 16:07:07 -0800
> > On Tue, Dec 01, 2020 at 11:44:12PM +0900, Kuniyuki Iwashima wrote:
> > > This patch renames reuseport_select_sock() to __reuseport_select_sock() and
> > > adds two wrapper function of it to pass the migration type defined in the
> > > previous commit.
> > > 
> > >   reuseport_select_sock          : BPF_SK_REUSEPORT_MIGRATE_NO
> > >   reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST
> > > 
> > > As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV
> > > requests at receiving the final ACK or sending a SYN+ACK. Therefore, this
> > > patch also changes the code to call reuseport_select_migrated_sock() even
> > > if the listening socket is TCP_CLOSE. If we can pick out a listening socket
> > > from the reuseport group, we rewrite request_sock.rsk_listener and resume
> > > processing the request.
> > > 
> > > Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
> > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
> > > ---
> > >  include/net/inet_connection_sock.h | 12 +++++++++++
> > >  include/net/request_sock.h         | 13 ++++++++++++
> > >  include/net/sock_reuseport.h       |  8 +++----
> > >  net/core/sock_reuseport.c          | 34 ++++++++++++++++++++++++------
> > >  net/ipv4/inet_connection_sock.c    | 13 ++++++++++--
> > >  net/ipv4/tcp_ipv4.c                |  9 ++++++--
> > >  net/ipv6/tcp_ipv6.c                |  9 ++++++--
> > >  7 files changed, 81 insertions(+), 17 deletions(-)
> > > 
> > > diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > > index 2ea2d743f8fc..1e0958f5eb21 100644
> > > --- a/include/net/inet_connection_sock.h
> > > +++ b/include/net/inet_connection_sock.h
> > > @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk)
> > >  	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
> > >  }
> > >  
> > > +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
> > > +						 struct sock *nsk,
> > > +						 struct request_sock *req)
> > > +{
> > > +	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
> > > +			     &inet_csk(nsk)->icsk_accept_queue,
> > > +			     req);
> > > +	sock_put(sk);
> > not sure if it is safe to do here.
> > IIUC, when the req->rsk_refcnt is held, it also holds a refcnt
> > to req->rsk_listener such that sock_hold(req->rsk_listener) is
> > safe because its sk_refcnt is not zero.
> 
> I think it is safe to call sock_put() for the old listener here.
> 
> Without this patchset, at receiving the final ACK or retransmitting
> SYN+ACK, if sk_state == TCP_CLOSE, sock_put(req->rsk_listener) is done
> by calling reqsk_put() twice in inet_csk_reqsk_queue_drop_and_put().
Note that in your example (final ACK), sock_put(req->rsk_listener) is
_only_ called when reqsk_put() can get refcount_dec_and_test(&req->rsk_refcnt)
to reach zero.

Here in this patch, it sock_put(req->rsk_listener) without req->rsk_refcnt
reaching zero.

Let says there are two cores holding two refcnt to req (one cnt for each core)
by looking up the req from ehash.  One of the core do this migrate and
sock_put(req->rsk_listener).  Another core does sock_hold(req->rsk_listener).

	Core1					Core2
						sock_put(req->rsk_listener)

	sock_hold(req->rsk_listener)

> And then, we do `goto lookup;` and overwrite the sk.
> 
> In the v2 patchset, refcount_inc_not_zero() is done for the new listener in
> reuseport_select_migrated_sock(), so we have to call sock_put() for the old
> listener instead to free it properly.
> 
> ---8<---
> +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
> +					    struct sk_buff *skb)
> +{
> +	struct sock *nsk;
> +
> +	nsk = __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
> +	if (nsk && likely(refcount_inc_not_zero(&nsk->sk_refcnt)))
There is another potential issue here.  The TCP_LISTEN nsk is protected
by rcu.  refcount_inc_not_zero(&nsk->sk_refcnt) cannot be done if it
is not under rcu_read_lock().

The receive path may be ok as it is in rcu.  You may need to check for
others.

> +		return nsk;
> +
> +	return NULL;
> +}
> +EXPORT_SYMBOL(reuseport_select_migrated_sock);
> ---8<---
> https://lore.kernel.org/netdev/20201207132456.65472-8-kuniyu@amazon.co.jp/
> 
> 
> > > +	sock_hold(nsk);
> > > +	req->rsk_listener = nsk;
It looks like there is another race here.  What
if multiple cores try to update req->rsk_listener?
diff mbox series

Patch

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ea2d743f8fc..1e0958f5eb21 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -272,6 +272,18 @@  static inline void inet_csk_reqsk_queue_added(struct sock *sk)
 	reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
 }
 
+static inline void inet_csk_reqsk_queue_migrated(struct sock *sk,
+						 struct sock *nsk,
+						 struct request_sock *req)
+{
+	reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue,
+			     &inet_csk(nsk)->icsk_accept_queue,
+			     req);
+	sock_put(sk);
+	sock_hold(nsk);
+	req->rsk_listener = nsk;
+}
+
 static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
 {
 	return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 29e41ff3ec93..d18ba0b857cc 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -226,6 +226,19 @@  static inline void reqsk_queue_added(struct request_sock_queue *queue)
 	atomic_inc(&queue->qlen);
 }
 
+static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue,
+					struct request_sock_queue *new_accept_queue,
+					const struct request_sock *req)
+{
+	atomic_dec(&old_accept_queue->qlen);
+	atomic_inc(&new_accept_queue->qlen);
+
+	if (req->num_timeout == 0) {
+		atomic_dec(&old_accept_queue->young);
+		atomic_inc(&new_accept_queue->young);
+	}
+}
+
 static inline int reqsk_queue_len(const struct request_sock_queue *queue)
 {
 	return atomic_read(&queue->qlen);
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index 09a1b1539d4c..a48259a974be 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,10 +32,10 @@  extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
 extern struct sock *reuseport_detach_sock(struct sock *sk);
-extern struct sock *reuseport_select_sock(struct sock *sk,
-					  u32 hash,
-					  struct sk_buff *skb,
-					  int hdr_len);
+extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+					  struct sk_buff *skb, int hdr_len);
+extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+						   struct sk_buff *skb);
 extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog);
 extern int reuseport_detach_prog(struct sock *sk);
 
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 60d7c1f28809..b4fe0829c9ab 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -202,7 +202,7 @@  int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 	}
 
 	reuse->socks[reuse->num_socks] = sk;
-	/* paired with smp_rmb() in reuseport_select_sock() */
+	/* paired with smp_rmb() in __reuseport_select_sock() */
 	smp_wmb();
 	reuse->num_socks++;
 	rcu_assign_pointer(sk->sk_reuseport_cb, reuse);
@@ -313,12 +313,13 @@  static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks,
  *  @hdr_len: BPF filter expects skb data pointer at payload data.  If
  *    the skb does not yet point at the payload, this parameter represents
  *    how far the pointer needs to advance to reach the payload.
+ *  @migration: represents if it is selecting a listener for SYN or
+ *    migrating ESTABLISHED/SYN_RECV sockets or NEW_SYN_RECV socket.
  *  Returns a socket that should receive the packet (or NULL on error).
  */
-struct sock *reuseport_select_sock(struct sock *sk,
-				   u32 hash,
-				   struct sk_buff *skb,
-				   int hdr_len)
+struct sock *__reuseport_select_sock(struct sock *sk, u32 hash,
+				     struct sk_buff *skb, int hdr_len,
+				     u8 migration)
 {
 	struct sock_reuseport *reuse;
 	struct bpf_prog *prog;
@@ -332,13 +333,19 @@  struct sock *reuseport_select_sock(struct sock *sk,
 	if (!reuse)
 		goto out;
 
-	prog = rcu_dereference(reuse->prog);
 	socks = READ_ONCE(reuse->num_socks);
 	if (likely(socks)) {
 		/* paired with smp_wmb() in reuseport_add_sock() */
 		smp_rmb();
 
-		if (!prog || !skb)
+		prog = rcu_dereference(reuse->prog);
+		if (!prog)
+			goto select_by_hash;
+
+		if (migration)
+			goto out;
+
+		if (!skb)
 			goto select_by_hash;
 
 		if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT)
@@ -367,8 +374,21 @@  struct sock *reuseport_select_sock(struct sock *sk,
 	rcu_read_unlock();
 	return sk2;
 }
+
+struct sock *reuseport_select_sock(struct sock *sk, u32 hash,
+				   struct sk_buff *skb, int hdr_len)
+{
+	return __reuseport_select_sock(sk, hash, skb, hdr_len, BPF_SK_REUSEPORT_MIGRATE_NO);
+}
 EXPORT_SYMBOL(reuseport_select_sock);
 
+struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash,
+					    struct sk_buff *skb)
+{
+	return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST);
+}
+EXPORT_SYMBOL(reuseport_select_migrated_sock);
+
 int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog)
 {
 	struct sock_reuseport *reuse;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 361efe55b1ad..e71653c6eae2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -743,8 +743,17 @@  static void reqsk_timer_handler(struct timer_list *t)
 	struct request_sock_queue *queue = &icsk->icsk_accept_queue;
 	int max_syn_ack_retries, qlen, expire = 0, resend = 0;
 
-	if (inet_sk_state_load(sk_listener) != TCP_LISTEN)
-		goto drop;
+	if (inet_sk_state_load(sk_listener) != TCP_LISTEN) {
+		sk_listener = reuseport_select_migrated_sock(sk_listener,
+							     req_to_sk(req)->sk_hash, NULL);
+		if (!sk_listener) {
+			sk_listener = req->rsk_listener;
+			goto drop;
+		}
+		inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req);
+		icsk = inet_csk(sk_listener);
+		queue = &icsk->icsk_accept_queue;
+	}
 
 	max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries;
 	/* Normally all the openreqs are young and become mature
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index e4b31e70bd30..9a9aa27c6069 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1973,8 +1973,13 @@  int tcp_v4_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			inet_csk_reqsk_queue_migrated(sk, nsk, req);
+			sk = nsk;
 		}
 		/* We own a reference on the listener, increase it again
 		 * as we might lose it too soon.
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 992cbf3eb9e3..ff11f3c0cb96 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1635,8 +1635,13 @@  INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb)
 			goto csum_error;
 		}
 		if (unlikely(sk->sk_state != TCP_LISTEN)) {
-			inet_csk_reqsk_queue_drop_and_put(sk, req);
-			goto lookup;
+			nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb);
+			if (!nsk) {
+				inet_csk_reqsk_queue_drop_and_put(sk, req);
+				goto lookup;
+			}
+			inet_csk_reqsk_queue_migrated(sk, nsk, req);
+			sk = nsk;
 		}
 		sock_hold(sk);
 		refcounted = true;