diff mbox series

[v2,mptcp-next,1/8] mptcp: enable busypoll from mptcp receive path

Message ID 20210511133659.29982-2-fw@strlen.de
State Superseded, archived
Delegated to: Paolo Abeni
Headers show
Series add cmsg support to receive path | expand

Commit Message

Florian Westphal May 11, 2021, 1:36 p.m. UTC
The setting is only relevant for the msk socket.
While at it, also handle rcvlowat/rcvtimeo this way.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/mptcp/protocol.c | 7 +++++++
 net/mptcp/sockopt.c  | 8 ++++++++
 2 files changed, 15 insertions(+)

Comments

Paolo Abeni May 12, 2021, 8:19 a.m. UTC | #1
On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote:
> The setting is only relevant for the msk socket.
> While at it, also handle rcvlowat/rcvtimeo this way.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  net/mptcp/protocol.c | 7 +++++++
>  net/mptcp/sockopt.c  | 8 ++++++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> index 652e55a0c6e8..86e599eb4403 100644
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -12,6 +12,7 @@
>  #include <linux/sched/signal.h>
>  #include <linux/atomic.h>
>  #include <net/sock.h>
> +#include <net/busy_poll.h>
>  #include <net/inet_common.h>
>  #include <net/inet_hashtables.h>
>  #include <net/protocol.h>
> @@ -1982,6 +1983,12 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  	if (unlikely(flags & MSG_ERRQUEUE))
>  		return inet_recv_error(sk, msg, len, addr_len);
>  
> +	if (sk_can_busy_loop(sk) &&
> +	    skb_queue_empty_lockless(&msk->receive_queue) &&
> +	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> +	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
> +		sk_busy_loop(sk, nonblock);

If I understand the code correctly, here we need a valid sk-
>sk_napi_id. That field is set by the TCP input path
in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the
subflow ssk.

I think sk_napi_id is not currently available for the msk?!? 
Copying it from the subflow requires some care, as the msk could end-up 
busy polling on the 'wrong' napi instance ?!?

/P
Florian Westphal May 12, 2021, 9:08 a.m. UTC | #2
Paolo Abeni <pabeni@redhat.com> wrote:
> On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote:
> > The setting is only relevant for the msk socket.
> > While at it, also handle rcvlowat/rcvtimeo this way.
> > 
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > ---
> >  net/mptcp/protocol.c | 7 +++++++
> >  net/mptcp/sockopt.c  | 8 ++++++++
> >  2 files changed, 15 insertions(+)
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > index 652e55a0c6e8..86e599eb4403 100644
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -12,6 +12,7 @@
> >  #include <linux/sched/signal.h>
> >  #include <linux/atomic.h>
> >  #include <net/sock.h>
> > +#include <net/busy_poll.h>
> >  #include <net/inet_common.h>
> >  #include <net/inet_hashtables.h>
> >  #include <net/protocol.h>
> > @@ -1982,6 +1983,12 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
> >  	if (unlikely(flags & MSG_ERRQUEUE))
> >  		return inet_recv_error(sk, msg, len, addr_len);
> >  
> > +	if (sk_can_busy_loop(sk) &&
> > +	    skb_queue_empty_lockless(&msk->receive_queue) &&
> > +	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> > +	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
> > +		sk_busy_loop(sk, nonblock);
> 
> If I understand the code correctly, here we need a valid sk-
> >sk_napi_id. That field is set by the TCP input path
> in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the
> subflow ssk.

Can this patch just be dropped from the series?
Florian Westphal May 12, 2021, 10:04 a.m. UTC | #3
Paolo Abeni <pabeni@redhat.com> wrote:
> On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote:
> > +	if (sk_can_busy_loop(sk) &&
> > +	    skb_queue_empty_lockless(&msk->receive_queue) &&
> > +	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> > +	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
> > +		sk_busy_loop(sk, nonblock);
> 
> If I understand the code correctly, here we need a valid sk-
> >sk_napi_id. That field is set by the TCP input path
> in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the
> subflow ssk.
> 
> I think sk_napi_id is not currently available for the msk?!? 
> Copying it from the subflow requires some care, as the msk could end-up 
> busy polling on the 'wrong' napi instance ?!?

What about:

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
 		if (tail && mptcp_try_coalesce(sk, tail, skb))
 			return true;
 
+		sk_mark_napi_id_once(sk, skb);
 		skb_set_owner_r(skb, sk);
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		return true;

This sets the napi id to one that provided the last mptcp-in-sequence skb.
Paolo Abeni May 12, 2021, 3:06 p.m. UTC | #4
On Wed, 2021-05-12 at 12:04 +0200, Florian Westphal wrote:
> Paolo Abeni <pabeni@redhat.com> wrote:
> > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote:
> > > +	if (sk_can_busy_loop(sk) &&
> > > +	    skb_queue_empty_lockless(&msk->receive_queue) &&
> > > +	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> > > +	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
> > > +		sk_busy_loop(sk, nonblock);
> > 
> > If I understand the code correctly, here we need a valid sk-
> > > sk_napi_id. That field is set by the TCP input path
> > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the
> > subflow ssk.
> > 
> > I think sk_napi_id is not currently available for the msk?!? 
> > Copying it from the subflow requires some care, as the msk could end-up 
> > busy polling on the 'wrong' napi instance ?!?
> 
> What about:
> 
> diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> --- a/net/mptcp/protocol.c
> +++ b/net/mptcp/protocol.c
> @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
>  		if (tail && mptcp_try_coalesce(sk, tail, skb))
>  			return true;
>  
> +		sk_mark_napi_id_once(sk, skb);
>  		skb_set_owner_r(skb, sk);
>  		__skb_queue_tail(&sk->sk_receive_queue, skb);
>  		return true;
> 
> This sets the napi id to one that provided the last mptcp-in-sequence skb.

I *think* the above sets napi it to the one provided by the first
received skb.

On the flip side, updating it at every packet would probably cause
contention on a 'read mostly' sock cacheline.

What about something as crazy as the following? not even build tested,
but I hope it can illustrate the idea bettern than words.

Than we could call mptcp_busy_loop() instead of sk_busy_loop().

We could also try to be smarter, e.g. fetch all the napi_id in a single
mptcp_for_each_subflow() traversal, and store them in a small, fixed
size array, avoiding duplicate values.

Thinking again of it, this latter strategy looks much better then what
described below...
---
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 2d21a4793d9d..f5d71aae38cc 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -1960,6 +1960,43 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk)
 	return !skb_queue_empty(&msk->receive_queue);
 }
 
+/* return -1 if nr > avail subflow */
+static int napi_id_by_subflow_nr(struct sock *sk, int nr)
+{
+	bool slow = lock_sock_fast(sk);
+	int id = -1, i = 0;
+
+	mptcp_for_each_subflow(mptcp_sk(sk), subflow) {
+		if (i++ == nr) {
+			id = READ_ONCE(mptcp_subflow_tcp_sock(subflow)->sk_napi_id);
+			break;
+		}
+	}
+
+	unlock_sock_fast(sk, slow);
+	return id;
+}
+
+static void mptcp_busy_loop(struct sock *sk, bool nonblock)
+{
+	unsigned long start_time = busy_loop_current_time();
+	struct mptcp_subflow_context *subflow;$
+
+	for (;;) {
+		int nr, napi_id = napi_id_by_subflow_nr(sk, 0);
+
+		for (nr = 0; napi_id >= 0 ; napi_id = napi_id_by_subflow_nr(sk, ++nr)) }
+			napi_busy_loop(napi_id, NULL, NULL,
+			       READ_ONCE(sk->sk_prefer_busy_poll),
+			       READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET);
+			if (skb_queue_empty_lockless(&sk->sk_receive_queue))
+				return;
+		}
+		if (nonblock || sk_busy_loop_end(sk, start_time))
+			return;
+	}
+}
+
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int nonblock, int flags, int *addr_len)
 {
Florian Westphal May 14, 2021, 9:57 a.m. UTC | #5
Paolo Abeni <pabeni@redhat.com> wrote:
> On Wed, 2021-05-12 at 12:04 +0200, Florian Westphal wrote:
> > Paolo Abeni <pabeni@redhat.com> wrote:
> > > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote:
> > > > +	if (sk_can_busy_loop(sk) &&
> > > > +	    skb_queue_empty_lockless(&msk->receive_queue) &&
> > > > +	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> > > > +	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
> > > > +		sk_busy_loop(sk, nonblock);
> > > 
> > > If I understand the code correctly, here we need a valid sk-
> > > > sk_napi_id. That field is set by the TCP input path
> > > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the
> > > subflow ssk.
> > > 
> > > I think sk_napi_id is not currently available for the msk?!? 
> > > Copying it from the subflow requires some care, as the msk could end-up 
> > > busy polling on the 'wrong' napi instance ?!?
> > 
> > What about:
> > 
> > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > --- a/net/mptcp/protocol.c
> > +++ b/net/mptcp/protocol.c
> > @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
> >  		if (tail && mptcp_try_coalesce(sk, tail, skb))
> >  			return true;
> >  
> > +		sk_mark_napi_id_once(sk, skb);
> >  		skb_set_owner_r(skb, sk);
> >  		__skb_queue_tail(&sk->sk_receive_queue, skb);
> >  		return true;
> > 
> > This sets the napi id to one that provided the last mptcp-in-sequence skb.
> 
> I *think* the above sets napi it to the one provided by the first
> received skb.

Depends on wheter the receive queue is empty or not.

Its the most recent subflow that had activity.

> What about something as crazy as the following? not even build tested,
> but I hope it can illustrate the idea bettern than words.

I'm no longer convicend busypolling makes any sense with mptcp sockets,
see below.

> +	for (;;) {
> +		int nr, napi_id = napi_id_by_subflow_nr(sk, 0);
> +
> +		for (nr = 0; napi_id >= 0 ; napi_id = napi_id_by_subflow_nr(sk, ++nr)) }
> +			napi_busy_loop(napi_id, NULL, NULL,
> +			       READ_ONCE(sk->sk_prefer_busy_poll),
> +			       READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET);
> +			if (skb_queue_empty_lockless(&sk->sk_receive_queue))
> +				return;

We would need a new conditional to early-terminate the busypoll loop.

Else we might be busypolling on napi instance 1 while another instance
just made a packet available.

Futhermore, we'd have to divide the budget among the napi instances to
avoid excess usage.

The way I see it busypoll should never increase latency, but with the
above, afaics, it would since we don't have a 1:1 mapping of mptcp
socket to a napi instance that we expect packets on.

The way I see it there are two options:
1. Ignore buyspoll requests, or.
2. Set the napi instance ID from __mptcp_move_skb().
   As soon as the instance changes (anoter subflow on different path),
   disable busypoll for that mptcp socket.

1) is simpler, 2) would still support busypoll for fallback mode and
for '1 flow' resp. 'multiple flows on same napi instance'.

The latter could happen if we assume server with single
link that serves clients with multiple uplinks.
Paolo Abeni May 14, 2021, 10:35 a.m. UTC | #6
On Fri, 2021-05-14 at 11:57 +0200, Florian Westphal wrote:
> Paolo Abeni <pabeni@redhat.com> wrote:
> > On Wed, 2021-05-12 at 12:04 +0200, Florian Westphal wrote:
> > > Paolo Abeni <pabeni@redhat.com> wrote:
> > > > On Tue, 2021-05-11 at 15:36 +0200, Florian Westphal wrote:
> > > > > +	if (sk_can_busy_loop(sk) &&
> > > > > +	    skb_queue_empty_lockless(&msk->receive_queue) &&
> > > > > +	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
> > > > > +	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
> > > > > +		sk_busy_loop(sk, nonblock);
> > > > 
> > > > If I understand the code correctly, here we need a valid sk-
> > > > > sk_napi_id. That field is set by the TCP input path
> > > > in tcp_finish_connect()/tcp_v{4,6}_do_rcv()/tcp_child_process() for the
> > > > subflow ssk.
> > > > 
> > > > I think sk_napi_id is not currently available for the msk?!? 
> > > > Copying it from the subflow requires some care, as the msk could end-up 
> > > > busy polling on the 'wrong' napi instance ?!?
> > > 
> > > What about:
> > > 
> > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
> > > --- a/net/mptcp/protocol.c
> > > +++ b/net/mptcp/protocol.c
> > > @@ -303,6 +303,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk,
> > >  		if (tail && mptcp_try_coalesce(sk, tail, skb))
> > >  			return true;
> > >  
> > > +		sk_mark_napi_id_once(sk, skb);
> > >  		skb_set_owner_r(skb, sk);
> > >  		__skb_queue_tail(&sk->sk_receive_queue, skb);
> > >  		return true;
> > > 
> > > This sets the napi id to one that provided the last mptcp-in-sequence skb.
> > 
> > I *think* the above sets napi it to the one provided by the first
> > received skb.
> 
> Depends on wheter the receive queue is empty or not.
> 
> Its the most recent subflow that had activity.

I mean: sk_mark_napi_id_once() is called on the most recent subflow
that had activity, but AFAICS it only sets sk->sk_napi_id - the msk
napi id, in this case - if the current value is 0 - that is, if such
field has not been intialized yet.

> > +	for (;;) {
> > +		int nr, napi_id = napi_id_by_subflow_nr(sk, 0);
> > +
> > +		for (nr = 0; napi_id >= 0 ; napi_id = napi_id_by_subflow_nr(sk, ++nr)) }
> > +			napi_busy_loop(napi_id, NULL, NULL,
> > +			       READ_ONCE(sk->sk_prefer_busy_poll),
> > +			       READ_ONCE(sk->sk_busy_poll_budget) ?: BUSY_POLL_BUDGET);
> > +			if (skb_queue_empty_lockless(&sk->sk_receive_queue))
> > +				return;
> 
> We would need a new conditional to early-terminate the busypoll loop.
> 
> Else we might be busypolling on napi instance 1 while another instance
> just made a packet available.
> 
> Futhermore, we'd have to divide the budget among the napi instances to
> avoid excess usage.

With the above code, in the inner looop, we will do a single ->poll()
invocation on each napi_id: passing a NULL 'loop_end' argument
to napi_busy_loop() causes the latter to do a single poll invocation.

That should also produce fair budget usage.

> The way I see it busypoll should never increase latency, but with the
> above, afaics, it would since we don't have a 1:1 mapping of mptcp
> socket to a napi instance that we expect packets on.

We will have a N:1 mapping (where N is the number of subflows using the
code above and could be "optimized" to the number of different napi_id
in use by such subflows with the changes discussed in the last part of
my previous email).

But still should not introuce additional latency. If e.g. we are
polling on napi_id_1 and packets arrive on napi id_2, since the msk
socket lock is not held, the rx subflow input path will deliver them to
the msk receive queue and the:

	'if (skb_queue_empty_lockless(&sk->sk_receive_queue))'

check will terminate the loop as soon as polling no napi_id_1
completes.

Well, that could actually introduce some latency, e.g. if napi_id_1 is
under heavy traffic sent from some different peer. Not 110% sure
anyway: BP reduce lantency mostly by avoiding tripping on the process
scheduler, and we will get that part right.

Assuming this later consideration is correct, we could eventually
consider always doing busy polling on a single napi instance (belonging
to an active, non backup subflow) even in presence of multiple active
subflows, and that should still reduce latency.

I guess a little experimentation will help ;) 

> The way I see it there are two options:
> 1. Ignore buyspoll requests, or.
> 2. Set the napi instance ID from __mptcp_move_skb().
>    As soon as the instance changes (anoter subflow on different path),
>    disable busypoll for that mptcp socket.
> 
> 1) is simpler, 2) would still support busypoll for fallback mode and
> for '1 flow' resp. 'multiple flows on same napi instance'.

Both sound reasonable

> The latter could happen if we assume server with single
> link that serves clients with multiple uplinks.

... and such link/NIC has a single napi instance, which is quite
uncommon on server H/W, I think.

I'm open to any option. 

Could you please review the above obove stuff regarding multiple napis
BP? does that make any sense?

Thanks!

Paolo
Paolo Abeni May 14, 2021, 10:50 a.m. UTC | #7
On Fri, 2021-05-14 at 12:35 +0200, Paolo Abeni wrote:
> We will have a N:1 mapping (where N is the number of subflows using the
> code above and could be "optimized" to the number of different napi_id
> in use by such subflows with the changes discussed in the last part of
> my previous email).
> 
> But still should not introuce additional latency. If e.g. we are
> polling on napi_id_1 and packets arrive on napi id_2, since the msk
> socket lock is not held, the rx subflow input path will deliver them to
> the msk receive queue and the:
> 
> 	'if (skb_queue_empty_lockless(&sk->sk_receive_queue))'
> 
> check will terminate the loop as soon as polling no napi_id_1
> completes.
> 
> Well, that could actually introduce some latency, e.g. if napi_id_1 is
> under heavy traffic sent from some different peer. Not 110% sure
> anyway: BP reduce lantency mostly by avoiding tripping on the process
> scheduler, and we will get that part right.
> 
> Assuming this later consideration is correct, we could eventually
> consider always doing busy polling on a single napi instance (belonging
> to an active, non backup subflow) even in presence of multiple active
> subflows, and that should still reduce latency.

If I read correctly, this is roughly the approach used by mptcp.org
> 
> I guess a little experimentation will help ;) 

@Christoph, do you have info/paper/data for BP && mptcp && multiple
napi ids? Or (half-seriously) some university PHD willing do some
academic research on it? ;)))

/P
diff mbox series

Patch

diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 652e55a0c6e8..86e599eb4403 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -12,6 +12,7 @@ 
 #include <linux/sched/signal.h>
 #include <linux/atomic.h>
 #include <net/sock.h>
+#include <net/busy_poll.h>
 #include <net/inet_common.h>
 #include <net/inet_hashtables.h>
 #include <net/protocol.h>
@@ -1982,6 +1983,12 @@  static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 	if (unlikely(flags & MSG_ERRQUEUE))
 		return inet_recv_error(sk, msg, len, addr_len);
 
+	if (sk_can_busy_loop(sk) &&
+	    skb_queue_empty_lockless(&msk->receive_queue) &&
+	    skb_queue_empty_lockless(&sk->sk_receive_queue) &&
+	    inet_sk_state_load(sk) == TCP_ESTABLISHED)
+		sk_busy_loop(sk, nonblock);
+
 	mptcp_lock_sock(sk, __mptcp_splice_receive_queue(sk));
 	if (unlikely(sk->sk_state == TCP_LISTEN)) {
 		copied = -ENOTCONN;
diff --git a/net/mptcp/sockopt.c b/net/mptcp/sockopt.c
index a79798189599..ee55e4145d7e 100644
--- a/net/mptcp/sockopt.c
+++ b/net/mptcp/sockopt.c
@@ -254,6 +254,14 @@  static int mptcp_setsockopt_sol_socket(struct mptcp_sock *msk, int optname,
 		return mptcp_setsockopt_sol_socket_int(msk, optname, optval, optlen);
 	case SO_LINGER:
 		return mptcp_setsockopt_sol_socket_linger(msk, optval, optlen);
+	case SO_RCVLOWAT:
+	case SO_RCVTIMEO_OLD:
+	case SO_RCVTIMEO_NEW:
+	case SO_BUSY_POLL:
+	case SO_PREFER_BUSY_POLL:
+	case SO_BUSY_POLL_BUDGET:
+		/* No need to copy: only relevant for msk */
+		break;
 	case SO_NO_CHECK:
 	case SO_DONTROUTE:
 	case SO_BROADCAST: