diff mbox series

[net-next,v3,07/11] tcp: Prevent coalesce/collapse when skb has MPTCP extensions

Message ID 20191217203807.12579-8-mathew.j.martineau@linux.intel.com
State Changes Requested
Delegated to: David Miller
Headers show
Series Multipath TCP: Prerequisites | expand

Commit Message

Mat Martineau Dec. 17, 2019, 8:38 p.m. UTC
The MPTCP extension data needs to be preserved as it passes through the
TCP stack. Make sure that these skbs are not appended to others during
coalesce or collapse, so the data remains associated with the payload of
the given skb.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 include/net/mptcp.h   | 16 ++++++++++++++++
 include/net/tcp.h     |  8 ++++++++
 net/ipv4/tcp_input.c  | 10 ++++++++--
 net/ipv4/tcp_output.c |  2 +-
 4 files changed, 33 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Dec. 18, 2019, 7:50 p.m. UTC | #1
On 12/17/19 12:38 PM, Mat Martineau wrote:
> The MPTCP extension data needs to be preserved as it passes through the
> TCP stack. Make sure that these skbs are not appended to others during
> coalesce or collapse, so the data remains associated with the payload of
> the given skb.
>


This seems a very pessimistic change to me.

Are you planing later to refine this limitation ?

Surely if a sender sends TSO packet, we allow all the segments
being aggregated at receive side either by GRO or TCP coalescing.
David Miller Dec. 18, 2019, 8:45 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Dec 2019 11:50:24 -0800

> On 12/17/19 12:38 PM, Mat Martineau wrote:
>> The MPTCP extension data needs to be preserved as it passes through the
>> TCP stack. Make sure that these skbs are not appended to others during
>> coalesce or collapse, so the data remains associated with the payload of
>> the given skb.
> 
> This seems a very pessimistic change to me.
> 
> Are you planing later to refine this limitation ?
> 
> Surely if a sender sends TSO packet, we allow all the segments
> being aggregated at receive side either by GRO or TCP coalescing.

This turns off absolutely crucial functional elements of our TCP
stack, and will avoid all of the machinery that avoids wastage in TCP
packets sitting in the various queues.  skb->truesize management, etc.

I will not apply these patches with such a non-trivial regression in
place for MPTCP streams, sorry.
Mat Martineau Dec. 18, 2019, 9:02 p.m. UTC | #3
On Wed, 18 Dec 2019, David Miller wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 Dec 2019 11:50:24 -0800
>
>> On 12/17/19 12:38 PM, Mat Martineau wrote:
>>> The MPTCP extension data needs to be preserved as it passes through the
>>> TCP stack. Make sure that these skbs are not appended to others during
>>> coalesce or collapse, so the data remains associated with the payload of
>>> the given skb.
>>
>> This seems a very pessimistic change to me.
>>
>> Are you planing later to refine this limitation ?
>>
>> Surely if a sender sends TSO packet, we allow all the segments
>> being aggregated at receive side either by GRO or TCP coalescing.
>
> This turns off absolutely crucial functional elements of our TCP
> stack, and will avoid all of the machinery that avoids wastage in TCP
> packets sitting in the various queues.  skb->truesize management, etc.
>
> I will not apply these patches with such a non-trivial regression in
> place for MPTCP streams, sorry.
>

Ok, understood. Not every packet has this MPTCP extension data so 
coalescing was not always turned off, but given the importance of avoiding 
this memory waste I'll confirm GRO behavior and work on maintaining 
coalesce/collapse with identical MPTCP extension data.

--
Mat Martineau
Intel
Subash Abhinov Kasiviswanathan Dec. 18, 2019, 9:36 p.m. UTC | #4
> Ok, understood. Not every packet has this MPTCP extension data so
> coalescing was not always turned off, but given the importance of
> avoiding
> 
> this memory waste I'll confirm GRO behavior and work on maintaining
> coalesce/collapse with identical MPTCP extension data.
> 

Hi Mat

Are identical MPTCP extensions a common case?
AFAIK the data sequence number and the subflow sequence number change
per packet even in a single stream scenario.
Paolo Abeni Dec. 18, 2019, 9:52 p.m. UTC | #5
On Wed, 2019-12-18 at 14:36 -0700, subashab@codeaurora.org wrote:
> > Ok, understood. Not every packet has this MPTCP extension data so
> > coalescing was not always turned off, but given the importance of
> > avoiding
> > 
> > this memory waste I'll confirm GRO behavior and work on maintaining
> > coalesce/collapse with identical MPTCP extension data.
> > 
> 
> Hi Mat
> 
> Are identical MPTCP extensions a common case?

Yes, they are.

> AFAIK the data sequence number and the subflow sequence number change
> per packet even in a single stream scenario.

What actually change on per packet basis is the TCP sequence number.
The DSS mapping can spawn on multiple packets, and will have constand
(base) sequence number and length.

Cheers,

Paolo
Paolo Abeni Dec. 18, 2019, 10:15 p.m. UTC | #6
On Wed, 2019-12-18 at 12:45 -0800, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 18 Dec 2019 11:50:24 -0800
> 
> > On 12/17/19 12:38 PM, Mat Martineau wrote:
> >> The MPTCP extension data needs to be preserved as it passes through the
> >> TCP stack. Make sure that these skbs are not appended to others during
> >> coalesce or collapse, so the data remains associated with the payload of
> >> the given skb.
> > 
> > This seems a very pessimistic change to me.
> > 
> > Are you planing later to refine this limitation ?
> > 
> > Surely if a sender sends TSO packet, we allow all the segments
> > being aggregated at receive side either by GRO or TCP coalescing.
> 
> This turns off absolutely crucial functional elements of our TCP
> stack, and will avoid all of the machinery that avoids wastage in TCP
> packets sitting in the various queues.  skb->truesize management, etc.

Thank you for the feedback!

Just to clarify, with the code we have currently posted TSO trains of
MPTCP packets can be aggregated by the GRO engine almost exactly as
currently happens for plain TCP packets.

We still have chances to aggregate packets belonging to a MPTCP stream,
as not all of them carry a DSS option.

We opted to not coalesce at the TCP level for the moment to avoid
adding additional hook code inside the coalescing code.

If you are ok without such hooks in the initial version, we can handle
MPTCP coalescing, too. The real work will likely land in part 2.

Would that fit you?

Thanks,

Paolo
David Miller Dec. 18, 2019, 11:17 p.m. UTC | #7
From: Paolo Abeni <pabeni@redhat.com>
Date: Wed, 18 Dec 2019 23:15:46 +0100

> Just to clarify, with the code we have currently posted TSO trains of
> MPTCP packets can be aggregated by the GRO engine almost exactly as
> currently happens for plain TCP packets.
> 
> We still have chances to aggregate packets belonging to a MPTCP stream,
> as not all of them carry a DSS option.
> 
> We opted to not coalesce at the TCP level for the moment to avoid
> adding additional hook code inside the coalescing code.
> 
> If you are ok without such hooks in the initial version, we can handle
> MPTCP coalescing, too. The real work will likely land in part 2.
> 
> Would that fit you?

Can't you someone encode the MPTCP metadata so that it will only apply
to a range or something like that?  Would that help?
diff mbox series

Patch

diff --git a/include/net/mptcp.h b/include/net/mptcp.h
index f9f668ac4339..43ddfdf9e4a3 100644
--- a/include/net/mptcp.h
+++ b/include/net/mptcp.h
@@ -8,6 +8,7 @@ 
 #ifndef __NET_MPTCP_H
 #define __NET_MPTCP_H
 
+#include <linux/skbuff.h>
 #include <linux/types.h>
 
 /* MPTCP sk_buff extension data */
@@ -24,4 +25,19 @@  struct mptcp_ext {
 			__unused:2;
 };
 
+#ifdef CONFIG_MPTCP
+
+static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
+{
+	return skb_ext_exist(skb, SKB_EXT_MPTCP);
+}
+
+#else
+
+static inline bool mptcp_skb_ext_exist(const struct sk_buff *skb)
+{
+	return false;
+}
+
+#endif /* CONFIG_MPTCP */
 #endif /* __NET_MPTCP_H */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index c82b2f75d024..c483c73b8d41 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -39,6 +39,7 @@ 
 #include <net/tcp_states.h>
 #include <net/inet_ecn.h>
 #include <net/dst.h>
+#include <net/mptcp.h>
 
 #include <linux/seq_file.h>
 #include <linux/memcontrol.h>
@@ -978,6 +979,13 @@  static inline bool tcp_skb_can_collapse_to(const struct sk_buff *skb)
 	return likely(!TCP_SKB_CB(skb)->eor);
 }
 
+static inline bool tcp_skb_can_collapse(const struct sk_buff *to,
+					const struct sk_buff *from)
+{
+	return likely(tcp_skb_can_collapse_to(to) &&
+		      !mptcp_skb_ext_exist(from));
+}
+
 /* Events passed to congestion control interface */
 enum tcp_ca_event {
 	CA_EVENT_TX_START,	/* first transmit when no packets in flight */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 88b987ca9ebb..55b460a2ece2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1422,7 +1422,7 @@  static struct sk_buff *tcp_shift_skb_data(struct sock *sk, struct sk_buff *skb,
 	if ((TCP_SKB_CB(prev)->sacked & TCPCB_TAGBITS) != TCPCB_SACKED_ACKED)
 		goto fallback;
 
-	if (!tcp_skb_can_collapse_to(prev))
+	if (!tcp_skb_can_collapse(prev, skb))
 		goto fallback;
 
 	in_sack = !after(start_seq, TCP_SKB_CB(skb)->seq) &&
@@ -4420,6 +4420,9 @@  static bool tcp_try_coalesce(struct sock *sk,
 	if (TCP_SKB_CB(from)->seq != TCP_SKB_CB(to)->end_seq)
 		return false;
 
+	if (mptcp_skb_ext_exist(from))
+		return false;
+
 #ifdef CONFIG_TLS_DEVICE
 	if (from->decrypted != to->decrypted)
 		return false;
@@ -4928,10 +4931,12 @@  tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 
 		/* The first skb to collapse is:
 		 * - not SYN/FIN and
+		 * - does not include a MPTCP skb extension
 		 * - bloated or contains data before "start" or
 		 *   overlaps to the next one.
 		 */
 		if (!(TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) &&
+		    !mptcp_skb_ext_exist(skb) &&
 		    (tcp_win_from_space(sk, skb->truesize) > skb->len ||
 		     before(TCP_SKB_CB(skb)->seq, start))) {
 			end_of_skbs = false;
@@ -4947,7 +4952,7 @@  tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 		/* Decided to skip this, advance start seq. */
 		start = TCP_SKB_CB(skb)->end_seq;
 	}
-	if (end_of_skbs ||
+	if (end_of_skbs || mptcp_skb_ext_exist(skb) ||
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 		return;
 
@@ -4990,6 +4995,7 @@  tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
 				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
+				    mptcp_skb_ext_exist(skb) ||
 				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 					goto end;
 #ifdef CONFIG_TLS_DEVICE
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b184f03d7437..9e04d45bc0e4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2854,7 +2854,7 @@  static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
 		if (!tcp_can_collapse(sk, skb))
 			break;
 
-		if (!tcp_skb_can_collapse_to(to))
+		if (!tcp_skb_can_collapse(to, skb))
 			break;
 
 		space -= skb->len;