diff mbox series

[RFC,mptpcp-next] mptcp: add ooo prune support

Message ID 20201002154535.28412-1-fw@strlen.de
State RFC
Delegated to: Paolo Abeni
Headers show
Series [RFC,mptpcp-next] mptcp: add ooo prune support | expand

Commit Message

Florian Westphal Oct. 2, 2020, 3:45 p.m. UTC
It might be possible that entire receive buffer is occupied by
skbs in the OOO queue.

In this case we can't pull more skbs from subflows and the holes
will never be filled.

If this happens, schedule the work queue and prune ~12% of skbs to
make space available. Also add a MIB counter for this.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Paolo, this does relate a bit to our discussion wrt. oow
 tracking.  I thought we might need to add some sort of cushion to
 account for window discrepancies, but that might then get us
 in a state where wmem might be full...

 What do you think?

 I did NOT see such a problem in practice, this is a theoretical "fix".
 TCP has similar code to deal with corner cases of small-oow packets.

 net/mptcp/mib.c      |  1 +
 net/mptcp/mib.h      |  1 +
 net/mptcp/protocol.c | 48 ++++++++++++++++++++++++++++++++++++++++++--
 net/mptcp/protocol.h |  1 +
 4 files changed, 49 insertions(+), 2 deletions(-)

Comments

Mat Martineau Oct. 9, 2020, 12:29 a.m. UTC | #1
On Fri, 2 Oct 2020, Florian Westphal wrote:

> It might be possible that entire receive buffer is occupied by
> skbs in the OOO queue.
>
> In this case we can't pull more skbs from subflows and the holes
> will never be filled.
>
> If this happens, schedule the work queue and prune ~12% of skbs to
> make space available. Also add a MIB counter for this.
>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
> Paolo, this does relate a bit to our discussion wrt. oow
> tracking.  I thought we might need to add some sort of cushion to
> account for window discrepancies, but that might then get us
> in a state where wmem might be full...
>
> What do you think?
>
> I did NOT see such a problem in practice, this is a theoretical "fix".
> TCP has similar code to deal with corner cases of small-oow packets.
>

Is there a benefit to relying on the workqueue to discard skbs from the 
ooo queue rather than handling that as data is moved from the subflows?

The cause of "holes" in MPTCP reassembly is a little different from TCP, 
since TCP takes care of the packet loss problem it seems to me the main 
issue is recovering from path loss (assuming the window and rcv buffer 
sizes are sane). We need to keep pulling from the subflows to potentially 
fill the holes, and the most valuable data to make progress would be 
earlier in the sequence space. We potentially throw less away if 
discarding later-sequence ooo queued skbs as subflow data arrives rather 
than doing the whole 12.5% chunk, and don't have the latency of waiting 
for the workqueue to get scheduled.

(I should look at the multipath-tcp.org kernel to understand how it's done 
there but I haven't had a chance to do that yet)

--
Mat Martineau
Intel
Florian Westphal Oct. 9, 2020, 11:11 a.m. UTC | #2
Mat Martineau <mathew.j.martineau@linux.intel.com> wrote:
> On Fri, 2 Oct 2020, Florian Westphal wrote:
> 
> > It might be possible that entire receive buffer is occupied by
> > skbs in the OOO queue.
> > 
> > In this case we can't pull more skbs from subflows and the holes
> > will never be filled.
> > 
> > If this happens, schedule the work queue and prune ~12% of skbs to
> > make space available. Also add a MIB counter for this.
> > 
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> > ---
> > Paolo, this does relate a bit to our discussion wrt. oow
> > tracking.  I thought we might need to add some sort of cushion to
> > account for window discrepancies, but that might then get us
> > in a state where wmem might be full...
> > 
> > What do you think?
> > 
> > I did NOT see such a problem in practice, this is a theoretical "fix".
> > TCP has similar code to deal with corner cases of small-oow packets.
> > 
> 
> Is there a benefit to relying on the workqueue to discard skbs from the ooo
> queue rather than handling that as data is moved from the subflows?

It seemed simpler to do it this way.  When data_ready callback is
running its the first spot where we make checks wrt. buffer occupancy.

move_skbs_to_msk() (which is called right after) may not be able to
acquire the msk lock.  I did not consider this performance critical to
deal with both 'can do it inline' and 'need to defer to worker' cases.

> The cause of "holes" in MPTCP reassembly is a little different from TCP,
> since TCP takes care of the packet loss problem it seems to me the main
> issue is recovering from path loss (assuming the window and rcv buffer sizes
> are sane). We need to keep pulling from the subflows to potentially fill the
> holes, and the most valuable data to make progress would be earlier in the
> sequence space. We potentially throw less away if discarding later-sequence
> ooo queued skbs as subflow data arrives rather than doing the whole 12.5%
> chunk, and don't have the latency of waiting for the workqueue to get
> scheduled.

Hmmm, thats true. OTOH I never obsevered this from happening, so I just
went for a 'compact' patch.  Not sure it makes sense to optimize this.
Mat Martineau Oct. 9, 2020, 4:35 p.m. UTC | #3
On Fri, 9 Oct 2020, Florian Westphal wrote:

> Mat Martineau <mathew.j.martineau@linux.intel.com> wrote:
>> On Fri, 2 Oct 2020, Florian Westphal wrote:
>>
>>> It might be possible that entire receive buffer is occupied by
>>> skbs in the OOO queue.
>>>
>>> In this case we can't pull more skbs from subflows and the holes
>>> will never be filled.
>>>
>>> If this happens, schedule the work queue and prune ~12% of skbs to
>>> make space available. Also add a MIB counter for this.
>>>
>>> Signed-off-by: Florian Westphal <fw@strlen.de>
>>> ---
>>> Paolo, this does relate a bit to our discussion wrt. oow
>>> tracking.  I thought we might need to add some sort of cushion to
>>> account for window discrepancies, but that might then get us
>>> in a state where wmem might be full...
>>>
>>> What do you think?
>>>
>>> I did NOT see such a problem in practice, this is a theoretical "fix".
>>> TCP has similar code to deal with corner cases of small-oow packets.
>>>
>>
>> Is there a benefit to relying on the workqueue to discard skbs from the ooo
>> queue rather than handling that as data is moved from the subflows?
>
> It seemed simpler to do it this way.  When data_ready callback is
> running its the first spot where we make checks wrt. buffer occupancy.
>
> move_skbs_to_msk() (which is called right after) may not be able to
> acquire the msk lock.  I did not consider this performance critical to
> deal with both 'can do it inline' and 'need to defer to worker' cases.
>

Ok, makes sense.

>> The cause of "holes" in MPTCP reassembly is a little different from TCP,
>> since TCP takes care of the packet loss problem it seems to me the main
>> issue is recovering from path loss (assuming the window and rcv buffer sizes
>> are sane). We need to keep pulling from the subflows to potentially fill the
>> holes, and the most valuable data to make progress would be earlier in the
>> sequence space. We potentially throw less away if discarding later-sequence
>> ooo queued skbs as subflow data arrives rather than doing the whole 12.5%
>> chunk, and don't have the latency of waiting for the workqueue to get
>> scheduled.
>
> Hmmm, thats true. OTOH I never obsevered this from happening, so I just
> went for a 'compact' patch.  Not sure it makes sense to optimize this.
>

I do wonder if different packet schedulers might run in to this more, but 
agree that there isn't evidence to justify optimizing now. If we merge 
this (or similar) we will have the MIB counter to help understand how much 
it does happen.

--
Mat Martineau
Intel
diff mbox series

Patch

diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c
index 84d119436b22..65c575e3af60 100644
--- a/net/mptcp/mib.c
+++ b/net/mptcp/mib.c
@@ -25,6 +25,7 @@  static const struct snmp_mib mptcp_snmp_list[] = {
 	SNMP_MIB_ITEM("OFOQueueTail", MPTCP_MIB_OFOQUEUETAIL),
 	SNMP_MIB_ITEM("OFOQueue", MPTCP_MIB_OFOQUEUE),
 	SNMP_MIB_ITEM("OFOMerge", MPTCP_MIB_OFOMERGE),
+	SNMP_MIB_ITEM("OFOPrune", MPTCP_MIB_OFOPRUNE),
 	SNMP_MIB_ITEM("NoDSSInWindow", MPTCP_MIB_NODSSWINDOW),
 	SNMP_MIB_ITEM("DuplicateData", MPTCP_MIB_DUPDATA),
 	SNMP_MIB_ITEM("AddAddr", MPTCP_MIB_ADDADDR),
diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h
index 47bcecce1106..75a7fb3a87db 100644
--- a/net/mptcp/mib.h
+++ b/net/mptcp/mib.h
@@ -18,6 +18,7 @@  enum linux_mptcp_mib_field {
 	MPTCP_MIB_OFOQUEUETAIL,	/* Segments inserted into OoO queue tail */
 	MPTCP_MIB_OFOQUEUE,		/* Segments inserted into OoO queue */
 	MPTCP_MIB_OFOMERGE,		/* Segments merged in OoO queue */
+	MPTCP_MIB_OFOPRUNE,		/* Segments pruned from OoO queue */
 	MPTCP_MIB_NODSSWINDOW,		/* Segments not in MPTCP windows */
 	MPTCP_MIB_DUPDATA,		/* Segments discarded due to duplicate DSS */
 	MPTCP_MIB_ADDADDR,		/* Received ADD_ADDR with echo-flag=0 */
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 79cd8e879c10..4cc30a3d426c 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -658,8 +658,17 @@  void mptcp_data_ready(struct sock *sk, struct sock *ssk)
 		sk_rbuf = ssk_rbuf;
 
 	/* over limit? can't append more skbs to msk */
-	if (atomic_read(&sk->sk_rmem_alloc) > sk_rbuf)
-		goto wake;
+	if (atomic_read(&sk->sk_rmem_alloc) > sk_rbuf) {
+		if (likely(!skb_queue_empty(&sk->sk_receive_queue)))
+			goto wake;
+
+		/* Entire recvbuf occupied by OOO skbs? Prune time. */
+		if (!test_and_set_bit(MPTCP_WORK_PRUNE_OFO, &msk->flags) &&
+		     schedule_work(&msk->work))
+			sock_hold(sk);
+
+		return;
+	}
 
 	if (move_skbs_to_msk(msk, ssk))
 		goto wake;
@@ -1797,6 +1806,38 @@  static bool mptcp_check_close_timeout(const struct sock *sk)
 	return true;
 }
 
+static void mptcp_prune_ofo(struct mptcp_sock *msk)
+{
+	struct sock *sk = &msk->sk.icsk_inet.sk;
+	struct sk_buff *skb, *prev = NULL;
+	int goal;
+
+	if (!skb_queue_empty(&sk->sk_receive_queue) ||
+	    atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+		return;
+
+	if (WARN_ON_ONCE(RB_EMPTY_ROOT(&msk->out_of_order_queue)))
+		return;
+
+	MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOPRUNE);
+
+	goal = READ_ONCE(sk->sk_rcvbuf) >> 3;
+	skb = msk->ooo_last_skb;
+
+	while (skb) {
+		prev = skb_rb_prev(skb);
+		rb_erase(&skb->rbnode, &msk->out_of_order_queue);
+		goal -= skb->truesize;
+		mptcp_drop(sk, skb);
+
+		if (goal <= 0)
+			break;
+		skb = prev;
+	}
+
+	msk->ooo_last_skb = prev;
+}
+
 static void mptcp_worker(struct work_struct *work)
 {
 	struct mptcp_sock *msk = container_of(work, struct mptcp_sock, work);
@@ -1819,6 +1860,9 @@  static void mptcp_worker(struct work_struct *work)
 	if (mptcp_send_head(sk))
 		mptcp_push_pending(sk, 0);
 
+	if (test_and_clear_bit(MPTCP_WORK_PRUNE_OFO, &msk->flags))
+		mptcp_prune_ofo(msk);
+
 	if (msk->pm.status)
 		pm_work(msk);
 
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index d33e9676a1a3..360441fdaa93 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -91,6 +91,7 @@ 
 #define MPTCP_WORK_EOF		3
 #define MPTCP_FALLBACK_DONE	4
 #define MPTCP_WORKER_RUNNING	5
+#define MPTCP_WORK_PRUNE_OFO	6
 
 static inline bool before64(__u64 seq1, __u64 seq2)
 {