diff mbox series

[2/3] tcp: Add ESP encapsulation support

Message ID E1eZcnl-0002VF-JD@gondolin.hengli.com.au
State Awaiting Upstream, archived
Delegated to: David Miller
Headers show
Series [1/3] skbuff: Avoid sleeping in skb_send_sock_locked | expand

Commit Message

Herbert Xu Jan. 11, 2018, 1:21 p.m. UTC
This patch adds the plumbing in TCP for ESP encapsulation support
per RFC8229.

The patch mostly deals with inbound processing, as well as enabling
TCP encapsulation on a socket through setsockopt.  The outbound
processing is dealt with in the ESP code as is done for UDP.

The inbound processing is split into two halves.  First of all,
the softirq path directly intercepts ESP packets and feeds them
into the IPsec stack.  Most of the time the packet will be freed
right away if it contains complete ESP packets.  However, if
the message is incomplete or it contains non-ESP data, then the
skb will be added to the receive queue.  We also add packets to
the receive queue if it is currently non-emtpy, in order to
preserve sequence number continuity and minimise the changes
to the TCP code.

On the user-space facing side, packets marked as ESP-only are
skipped and not visible to user-space.  However, some ESP data
may seep through.  For example, if we receive a partial message
then we will always give it to user-space regardless of whether
it turns out to be ESP or not.  So user-space should be prepared
to skip ESP messages (SPI != 0).

There is a little bit of code dealing with the encapsulation side.
In particular, if encapsulation data comes in while the socket
is owned by user-space, the packets will be stored in tp->encap_out
and processed during release_sock.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
---

 include/linux/tcp.h      |   15 ++
 include/net/tcp.h        |   27 +++
 include/uapi/linux/tcp.h |    1 
 include/uapi/linux/udp.h |    1 
 net/ipv4/tcp.c           |   68 +++++++++
 net/ipv4/tcp_input.c     |  326 +++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_ipv4.c      |    1 
 net/ipv4/tcp_output.c    |   48 ++++++
 8 files changed, 473 insertions(+), 14 deletions(-)

Comments

Eric Dumazet Jan. 12, 2018, 4:38 p.m. UTC | #1
On Fri, 2018-01-12 at 00:21 +1100, Herbert Xu wrote:
> This patch adds the plumbing in TCP for ESP encapsulation support
> per RFC8229.
> 
> The patch mostly deals with inbound processing, as well as enabling
> TCP encapsulation on a socket through setsockopt.  The outbound
> processing is dealt with in the ESP code as is done for UDP.
> 
> The inbound processing is split into two halves.  First of all,
> the softirq path directly intercepts ESP packets and feeds them
> into the IPsec stack.  Most of the time the packet will be freed
> right away if it contains complete ESP packets.  However, if
> the message is incomplete or it contains non-ESP data, then the
> skb will be added to the receive queue.  We also add packets to
> the receive queue if it is currently non-emtpy, in order to
> preserve sequence number continuity and minimise the changes
> to the TCP code.
> 
> On the user-space facing side, packets marked as ESP-only are
> skipped and not visible to user-space.  However, some ESP data
> may seep through.  For example, if we receive a partial message
> then we will always give it to user-space regardless of whether
> it turns out to be ESP or not.  So user-space should be prepared
> to skip ESP messages (SPI != 0).
> 
> There is a little bit of code dealing with the encapsulation side.
> In particular, if encapsulation data comes in while the socket
> is owned by user-space, the packets will be stored in tp->encap_out
> and processed during release_sock.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> ---
> 
>  include/linux/tcp.h      |   15 ++
>  include/net/tcp.h        |   27 +++
>  include/uapi/linux/tcp.h |    1 
>  include/uapi/linux/udp.h |    1 
>  net/ipv4/tcp.c           |   68 +++++++++
>  net/ipv4/tcp_input.c     |  326 +++++++++++++++++++++++++++++++++++++++++++++--
>  net/ipv4/tcp_ipv4.c      |    1 
>  net/ipv4/tcp_output.c    |   48 ++++++
>  8 files changed, 473 insertions(+), 14 deletions(-)
> 

Ouch...

Is there any chance this can be done with almost no change in TCP
stack, using a layer model ? ( net/kcm comes to mind )

NFS uses TCP sockets, but does not invade TCP stack either.

I believe Christoph Paasch sent a patch series during holidays trying
to cleanup the MD5 mess (I had no time reviewing it, sorry)
Steffen Klassert Jan. 16, 2018, 10:28 a.m. UTC | #2
On Fri, Jan 12, 2018 at 08:38:01AM -0800, Eric Dumazet wrote:
> On Fri, 2018-01-12 at 00:21 +1100, Herbert Xu wrote:
> > This patch adds the plumbing in TCP for ESP encapsulation support
> > per RFC8229.
> > 
> > The patch mostly deals with inbound processing, as well as enabling
> > TCP encapsulation on a socket through setsockopt.  The outbound
> > processing is dealt with in the ESP code as is done for UDP.
> > 
> > The inbound processing is split into two halves.  First of all,
> > the softirq path directly intercepts ESP packets and feeds them
> > into the IPsec stack.  Most of the time the packet will be freed
> > right away if it contains complete ESP packets.  However, if
> > the message is incomplete or it contains non-ESP data, then the
> > skb will be added to the receive queue.  We also add packets to
> > the receive queue if it is currently non-emtpy, in order to
> > preserve sequence number continuity and minimise the changes
> > to the TCP code.
> > 
> > On the user-space facing side, packets marked as ESP-only are
> > skipped and not visible to user-space.  However, some ESP data
> > may seep through.  For example, if we receive a partial message
> > then we will always give it to user-space regardless of whether
> > it turns out to be ESP or not.  So user-space should be prepared
> > to skip ESP messages (SPI != 0).
> > 
> > There is a little bit of code dealing with the encapsulation side.
> > In particular, if encapsulation data comes in while the socket
> > is owned by user-space, the packets will be stored in tp->encap_out
> > and processed during release_sock.
> > 
> > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > ---
> > 
> >  include/linux/tcp.h      |   15 ++
> >  include/net/tcp.h        |   27 +++
> >  include/uapi/linux/tcp.h |    1 
> >  include/uapi/linux/udp.h |    1 
> >  net/ipv4/tcp.c           |   68 +++++++++
> >  net/ipv4/tcp_input.c     |  326 +++++++++++++++++++++++++++++++++++++++++++++--
> >  net/ipv4/tcp_ipv4.c      |    1 
> >  net/ipv4/tcp_output.c    |   48 ++++++
> >  8 files changed, 473 insertions(+), 14 deletions(-)
> > 
> 
> Ouch...
> 
> Is there any chance this can be done with almost no change in TCP
> stack, using a layer model ? ( net/kcm comes to mind )

Herbert, would this be an option or is this not possible?

Thanks!
Herbert Xu Jan. 18, 2018, 3:49 a.m. UTC | #3
On Tue, Jan 16, 2018 at 11:28:23AM +0100, Steffen Klassert wrote:
>
> > Is there any chance this can be done with almost no change in TCP
> > stack, using a layer model ? ( net/kcm comes to mind )
> 
> Herbert, would this be an option or is this not possible?

Yes it can be done.  I'm working on it.

Cheers,
diff mbox series

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index ca4a636..1360a0e 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -225,7 +225,8 @@  struct tcp_sock {
 		fastopen_connect:1, /* FASTOPEN_CONNECT sockopt */
 		fastopen_no_cookie:1, /* Allow send/recv SYN+data without a cookie */
 		is_sack_reneg:1,    /* in recovery from loss with SACK reneg? */
-		unused:2;
+		encap:1,	/* TCP IKE/ESP encapsulation */
+		encap_lenhi_valid:1;
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		unused1	    : 1,
@@ -373,6 +374,16 @@  struct tcp_sock {
 	 */
 	struct request_sock *fastopen_rsk;
 	u32	*saved_syn;
+
+#ifdef CONFIG_XFRM
+/* TCP ESP encapsulation */
+	struct sk_buff *encap_in;
+	struct sk_buff_head encap_out;
+	u32	encap_seq;
+	u32	encap_last;
+	u16	encap_backlog;
+	u8	encap_lenhi;
+#endif
 };
 
 enum tsq_enum {
@@ -384,6 +395,7 @@  enum tsq_enum {
 	TCP_MTU_REDUCED_DEFERRED,  /* tcp_v{4|6}_err() could not call
 				    * tcp_v{4|6}_mtu_reduced()
 				    */
+	TCP_ESP_DEFERRED,	   /* esp_output_tcp_encap2 queued packets */
 };
 
 enum tsq_flags {
@@ -393,6 +405,7 @@  enum tsq_flags {
 	TCPF_WRITE_TIMER_DEFERRED	= (1UL << TCP_WRITE_TIMER_DEFERRED),
 	TCPF_DELACK_TIMER_DEFERRED	= (1UL << TCP_DELACK_TIMER_DEFERRED),
 	TCPF_MTU_REDUCED_DEFERRED	= (1UL << TCP_MTU_REDUCED_DEFERRED),
+	TCPF_ESP_DEFERRED		= (1UL << TCP_ESP_DEFERRED),
 };
 
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d..6513ae2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -327,6 +327,7 @@  int tcp_sendpage_locked(struct sock *sk, struct page *page, int offset,
 			size_t size, int flags);
 ssize_t do_tcp_sendpages(struct sock *sk, struct page *page, int offset,
 		 size_t size, int flags);
+int tcp_encap_output(struct sock *sk, struct sk_buff *skb);
 void tcp_release_cb(struct sock *sk);
 void tcp_wfree(struct sk_buff *skb);
 void tcp_write_timer_handler(struct sock *sk);
@@ -399,6 +400,7 @@  int compat_tcp_setsockopt(struct sock *sk, int level, int optname,
 			  char __user *optval, unsigned int optlen);
 void tcp_set_keepalive(struct sock *sk, int val);
 void tcp_syn_ack_timeout(const struct request_sock *req);
+void tcp_cleanup_rbuf(struct sock *sk, int copied);
 int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		int flags, int *addr_len);
 void tcp_parse_options(const struct net *net, const struct sk_buff *skb,
@@ -789,7 +791,8 @@  struct tcp_skb_cb {
 	__u8		txstamp_ack:1,	/* Record TX timestamp for ack? */
 			eor:1,		/* Is skb MSG_EOR marked? */
 			has_rxtstamp:1,	/* SKB has a RX timestamp	*/
-			unused:5;
+			esp_skip:1,	/* SKB is pure ESP */
+			unused:4;
 	__u32		ack_seq;	/* Sequence number ACK'd	*/
 	union {
 		struct {
@@ -2062,4 +2065,26 @@  static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 #if IS_ENABLED(CONFIG_SMC)
 extern struct static_key_false tcp_have_smc;
 #endif
+
+#ifdef CONFIG_XFRM
+DECLARE_STATIC_KEY_FALSE(tcp_encap_needed);
+
+int tcp_encap_enable(struct sock *sk);
+
+static inline bool tcp_esp_skipped(struct sk_buff *skb)
+{
+	return TCP_SKB_CB(skb)->esp_skip;
+}
+#else
+static inline int tcp_encap_enable(struct sock *sk)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline bool tcp_esp_skipped(struct sk_buff *skb)
+{
+	return false;
+}
+#endif
+
 #endif	/* _TCP_H */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..769cab0 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,7 @@  enum {
 #define TCP_MD5SIG_EXT		32	/* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY	33	/* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE	34	/* Enable TFO without a TFO cookie */
+#define TCP_ENCAP		35	/* Set the socket to accept encapsulated packets */
 
 struct tcp_repair_opt {
 	__u32	opt_code;
diff --git a/include/uapi/linux/udp.h b/include/uapi/linux/udp.h
index efb7b59..1102846 100644
--- a/include/uapi/linux/udp.h
+++ b/include/uapi/linux/udp.h
@@ -39,5 +39,6 @@  struct udphdr {
 #define UDP_ENCAP_L2TPINUDP	3 /* rfc2661 */
 #define UDP_ENCAP_GTP0		4 /* GSM TS 09.60 */
 #define UDP_ENCAP_GTP1U		5 /* 3GPP TS 29.060 */
+#define TCP_ENCAP_ESPINTCP	6 /* Yikes, this is really xfrm encap types. */
 
 #endif /* _UAPI_LINUX_UDP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f08eebe..032b46c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1545,7 +1545,7 @@  static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
  * calculation of whether or not we must ACK for the sake of
  * a window update.
  */
-static void tcp_cleanup_rbuf(struct sock *sk, int copied)
+void tcp_cleanup_rbuf(struct sock *sk, int copied)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	bool time_to_ack = false;
@@ -1627,6 +1627,35 @@  static struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off)
 	return NULL;
 }
 
+#ifdef CONFIG_XFRM
+static void __tcp_esp_skip(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+	unsigned offset;
+	unsigned used;
+
+	while ((skb = tcp_recv_skb(sk, tp->copied_seq, &offset)) &&
+	       tcp_esp_skipped(skb)) {
+		used = skb->len - offset;
+		tp->copied_seq += used;
+		tcp_rcv_space_adjust(sk);
+		sk_eat_skb(sk, skb);
+	}
+}
+
+static inline void tcp_esp_skip(struct sock *sk, int flags)
+{
+	if (static_branch_unlikely(&tcp_encap_needed) &&
+	    tcp_sk(sk)->encap && !(flags & MSG_PEEK))
+		__tcp_esp_skip(sk);
+}
+#else
+static inline void tcp_esp_skip(struct sock *sk, int flags)
+{
+}
+#endif
+
 /*
  * This routine provides an alternative to tcp_recvmsg() for routines
  * that would like to handle copying from skbuffs directly in 'sendfile'
@@ -1650,7 +1679,9 @@  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	if (sk->sk_state == TCP_LISTEN)
 		return -ENOTCONN;
 	while ((skb = tcp_recv_skb(sk, seq, &offset)) != NULL) {
-		if (offset < skb->len) {
+		if (tcp_esp_skipped(skb))
+			seq += skb->len - offset;
+		else if (offset < skb->len) {
 			int used;
 			size_t len;
 
@@ -1704,6 +1735,7 @@  int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 	/* Clean up data we have read: This will do ACK frames. */
 	if (copied > 0) {
 		tcp_recv_skb(sk, seq, &offset);
+		tcp_esp_skip(sk, 0);
 		tcp_cleanup_rbuf(sk, copied);
 	}
 	return copied;
@@ -1946,6 +1978,13 @@  int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 	found_ok_skb:
 		/* Ok so how much can we use? */
 		used = skb->len - offset;
+
+		if (tcp_esp_skipped(skb)) {
+			*seq += used;
+			urg_hole += used;
+			goto skip_copy;
+		}
+
 		if (len < used)
 			used = len;
 
@@ -2009,6 +2048,8 @@  int tcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock,
 		break;
 	} while (len > 0);
 
+	tcp_esp_skip(sk, flags);
+
 	/* According to UNIX98, msg_name/msg_namelen are ignored
 	 * on connected socket. I was just happy when found this 8) --ANK
 	 */
@@ -2146,6 +2187,21 @@  bool tcp_check_oom(struct sock *sk, int shift)
 	return too_many_orphans || out_of_socket_memory;
 }
 
+#ifdef CONFIG_XFRM
+static inline void tcp_encap_free(struct tcp_sock *tp)
+{
+	struct sk_buff *skb;
+
+	kfree_skb(tp->encap_in);
+	while ((skb = __skb_dequeue(&tp->encap_out)) != NULL)
+		__kfree_skb(skb);
+}
+#else
+static inline void tcp_encap_free(struct tcp_sock *tp)
+{
+}
+#endif
+
 void tcp_close(struct sock *sk, long timeout)
 {
 	struct sk_buff *skb;
@@ -2177,6 +2233,8 @@  void tcp_close(struct sock *sk, long timeout)
 		__kfree_skb(skb);
 	}
 
+	tcp_encap_free(tcp_sk(sk));
+
 	sk_mem_reclaim(sk);
 
 	/* If socket has been already reset (e.g. in tcp_reset()) - kill it. */
@@ -2583,6 +2641,12 @@  static int do_tcp_setsockopt(struct sock *sk, int level,
 
 		return tcp_fastopen_reset_cipher(net, sk, key, sizeof(key));
 	}
+	case TCP_ENCAP:
+		if (sk->sk_state == TCP_ESTABLISHED)
+			return tcp_encap_enable(sk);
+		else
+			return -ENOTCONN;
+		break;
 	default:
 		/* fallthru */
 		break;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9550cc4..22c9f70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -72,12 +72,14 @@ 
 #include <linux/prefetch.h>
 #include <net/dst.h>
 #include <net/tcp.h>
+#include <net/xfrm.h>
 #include <net/inet_common.h>
 #include <linux/ipsec.h>
 #include <asm/unaligned.h>
 #include <linux/errqueue.h>
 #include <trace/events/tcp.h>
 #include <linux/static_key.h>
+#include <uapi/linux/udp.h>
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -110,6 +112,10 @@ 
 #define REXMIT_LOST	1 /* retransmit packets marked lost */
 #define REXMIT_NEW	2 /* FRTO-style transmit of unsent/new packets */
 
+#ifdef CONFIG_XFRM
+DEFINE_STATIC_KEY_FALSE(tcp_encap_needed);
+#endif
+
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb,
 			     unsigned int len)
 {
@@ -4294,6 +4300,314 @@  static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 	__kfree_skb(skb);
 }
 
+#ifdef CONFIG_XFRM
+static void tcp_set_encap_seq(struct tcp_sock *tp, struct sk_buff *skb,
+			      unsigned offset, __be16 len)
+{
+	while ((offset += min(be16_to_cpu(len), 2)) + 1 < skb->len)
+		skb_copy_bits(skb, offset, &len, 2);
+
+	if (skb->len <= offset) {
+		tp->encap_seq = TCP_SKB_CB(skb)->seq + offset;
+		return;
+	}
+
+	skb_copy_bits(skb, offset, &tp->encap_lenhi, 1);
+	tp->encap_lenhi_valid = true;
+}
+
+static void tcp_encap_error(struct tcp_sock *tp, struct sk_buff *skb,
+			    unsigned offset)
+{
+	struct sk_buff *prev = tp->encap_in;
+	union {
+		u8 bytes[2];
+		__be16 len;
+	} hdr;
+
+	if (!prev) {
+		tcp_set_encap_seq(tp, skb, offset - 2, 0);
+		return;
+	}
+
+	if (prev->len == 1) {
+		skb_copy_bits(prev, 0, &hdr.bytes[0], 1);
+		skb_copy_bits(skb, offset, &hdr.bytes[1], 1);
+		tcp_set_encap_seq(tp, skb, offset - 1, hdr.len);
+	}
+
+	__kfree_skb(prev);
+	tp->encap_in = NULL;
+}
+
+static void tcp_encap_error_free(struct tcp_sock *tp, struct sk_buff *skb,
+			       unsigned offset)
+{
+	tcp_encap_error(tp, skb, offset);
+	__kfree_skb(skb);
+}
+
+static int tcp_decap_skb(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	/* Get rid of length field to get pure ESP. */
+	if (!__pskb_pull(skb, 2))
+		return -ENOMEM;
+	skb_reset_transport_header(skb);
+
+	rcu_read_lock();
+	skb->dev = dev_get_by_index_rcu(sock_net((struct sock *)tp),
+					skb->skb_iif);
+	if (skb->dev)
+		xfrm4_rcv_encap(skb, IPPROTO_ESP, 0, TCP_ENCAP_ESPINTCP);
+	rcu_read_unlock();
+	return 0;
+}
+
+static bool __tcp_encap_process(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	struct sock *sk = (void *)tp;
+	struct {
+		union {
+			u8 bytes[2];
+			__be16 len;
+		};
+		__be32 spi;
+	} hdr;
+	struct sk_buff *prev;
+	bool eaten = false;
+	unsigned headlen;
+	unsigned offset2;
+	unsigned offset;
+	bool fragstolen;
+	int delta;
+
+	offset = tp->encap_last - TCP_SKB_CB(skb)->seq;
+	if (unlikely(skb->len <= offset))
+		return false;
+
+	tp->encap_last = TCP_SKB_CB(skb)->seq + skb->len;
+
+	if (unlikely(tp->encap_lenhi_valid)) {
+		tp->encap_lenhi_valid = false;
+		hdr.bytes[0] = tp->encap_lenhi;
+		skb_copy_bits(skb, offset, &hdr.bytes[1], 1);
+		tcp_set_encap_seq(tp, skb, offset - 1, hdr.len);
+		return false;
+	}
+
+	if (unlikely(tp->urg_data))
+		goto slow_path;
+
+	if (unlikely(tp->encap_in))
+		goto slow_path;
+
+	offset = tp->encap_seq - TCP_SKB_CB(skb)->seq;
+	if (unlikely(skb->len <= offset))
+		return false;
+
+	if (unlikely(offset))
+		goto slow_path;
+
+	if (unlikely(skb_has_frag_list(skb)))
+		goto slow_path;
+
+	offset2 = 0;
+
+	do {
+		if (unlikely(skb->len < sizeof(hdr)))
+			goto slow_path;
+
+		skb_copy_bits(skb, offset2, &hdr, sizeof(hdr));
+		offset2 += be16_to_cpu(hdr.len);
+		if (skb->len < offset2)
+			goto slow_path;
+
+		if (!hdr.spi)
+			goto slow_path;
+	} while (skb->len > offset2);
+
+	if (offset2 != be16_to_cpu(hdr.len))
+		goto slow_path;
+
+	tp->encap_seq = TCP_SKB_CB(skb)->seq + skb->len;
+
+	if (!skb_peek_tail(&sk->sk_receive_queue) &&
+	    !(TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)) {
+		tp->copied_seq = tp->encap_seq;
+		tcp_rcv_space_adjust(sk);
+		tcp_cleanup_rbuf(sk, skb->len);
+		eaten = true;
+	}
+
+	TCP_SKB_CB(skb)->esp_skip = 1;
+
+	skb = skb_clone(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return eaten;
+
+	if (unlikely(tcp_decap_skb(tp, skb)))
+		__kfree_skb(skb);
+
+	return eaten;
+
+slow_path:
+	headlen = -(skb_mac_header_was_set(skb) ? skb_mac_offset(skb) :
+						  skb_network_offset(skb));
+	__skb_push(skb, headlen);
+	prev = skb;
+
+	skb = pskb_copy(prev, GFP_ATOMIC);
+	__skb_pull(prev, headlen);
+
+	if (!skb) {
+		tcp_encap_error(tp, prev, offset);
+		return false;
+	}
+
+	__skb_pull(skb, headlen);
+	skb->mac_len = prev->mac_len;
+
+	if (!__pskb_pull(skb, offset)) {
+		tcp_encap_error_free(tp, skb, offset);
+		return false;
+	}
+
+	TCP_SKB_CB(skb)->seq += offset;
+	prev = tp->encap_in;
+	tp->encap_in = NULL;
+
+	if (!prev)
+		prev = skb;
+	else if (skb_try_coalesce(prev, skb, &fragstolen, &delta))
+		kfree_skb_partial(skb, fragstolen);
+	else {
+		skb_shinfo(prev)->frag_list = skb;
+		prev->data_len += skb->len;
+		prev->len += skb->len;
+		prev->truesize += skb->truesize;
+	}
+
+	/* We could do a list instead of linearising, but that would
+	 * open the door to abuses such as a stream of single-byte
+	 * datagrams up to 64K.
+	 */
+	if (skb_has_frag_list(prev) && __skb_linearize(prev)) {
+		tcp_encap_error_free(tp, prev, 0);
+		return false;
+	}
+
+	headlen = -(skb_mac_header_was_set(prev) ? skb_mac_offset(prev) :
+						   skb_network_offset(prev));
+
+	while (prev->len >= sizeof(hdr.len)) {
+		skb_copy_bits(prev, 0, &hdr,
+			      min((unsigned)sizeof(hdr), prev->len));
+
+		offset = be16_to_cpu(hdr.len);
+		tp->encap_seq = TCP_SKB_CB(prev)->seq + offset;
+
+		if (prev->len < offset)
+			break;
+
+		skb = prev;
+		if (prev->len > offset) {
+			int nsize = skb_headlen(skb) - offset;
+
+			if (nsize < 0)
+				nsize = 0;
+
+			prev = alloc_skb(nsize + headlen, GFP_ATOMIC);
+			if (!prev) {
+				tcp_encap_error_free(tp, skb, offset);
+				return false;
+			}
+
+			/* Slap on a header on each message. */
+			if (skb_mac_header_was_set(skb)) {
+				skb_reset_mac_header(prev);
+				skb_set_network_header(
+					prev, skb_mac_header_len(skb));
+				prev->mac_len = skb->mac_len;
+			} else
+				skb_reset_network_header(prev);
+			memcpy(__skb_put(prev, headlen),
+			       skb->data - headlen, headlen);
+			__skb_pull(prev, headlen);
+
+			nsize = skb->len - offset - nsize;
+
+			skb_split(skb, prev, offset);
+			skb->truesize -= nsize;
+			prev->truesize += nsize;
+			prev->skb_iif = skb->skb_iif;
+			TCP_SKB_CB(prev)->seq = TCP_SKB_CB(skb)->seq + offset;
+		}
+
+		if (!hdr.spi || tcp_decap_skb(tp, skb))
+			__kfree_skb(skb);
+
+		if (prev == skb)
+			return eaten;
+	}
+
+	tp->encap_in = prev;
+
+	return false;
+}
+
+int tcp_encap_enable(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
+
+	lock_sock(sk);
+
+	if (tp->encap)
+		goto out;
+
+	__skb_queue_head_init(&tp->encap_out);
+
+	tp->encap_last = tp->encap_seq = tp->copied_seq;
+
+	skb_queue_walk(&sk->sk_receive_queue, skb)
+		__tcp_encap_process(tp, skb);
+
+	tp->encap = 1;
+	static_branch_enable(&tcp_encap_needed);
+
+out:
+	release_sock(sk);
+
+	return 0;
+}
+
+static inline bool tcp_encap_process(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	if (static_branch_unlikely(&tcp_encap_needed) && tp->encap)
+		return __tcp_encap_process(tp, skb);
+
+	return false;
+}
+#else
+static inline bool tcp_encap_process(struct tcp_sock *tp, struct sk_buff *skb)
+{
+	return false;
+}
+#endif
+
+static bool tcp_eat_skb(struct sock *sk, struct sk_buff *skb, bool *fragstolen)
+{
+	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
+	struct tcp_sock *tp = tcp_sk(sk);
+	bool eaten;
+
+	eaten = tcp_encap_process(tp, skb) ||
+		(tail && tcp_try_coalesce(sk, tail, skb, fragstolen));
+	tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+
+	return eaten;
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4302,7 +4616,7 @@  static void tcp_ofo_queue(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 dsack_high = tp->rcv_nxt;
 	bool fin, fragstolen, eaten;
-	struct sk_buff *skb, *tail;
+	struct sk_buff *skb;
 	struct rb_node *p;
 
 	p = rb_first(&tp->out_of_order_queue);
@@ -4329,9 +4643,7 @@  static void tcp_ofo_queue(struct sock *sk)
 			   tp->rcv_nxt, TCP_SKB_CB(skb)->seq,
 			   TCP_SKB_CB(skb)->end_seq);
 
-		tail = skb_peek_tail(&sk->sk_receive_queue);
-		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
-		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+		eaten = tcp_eat_skb(sk, skb, &fragstolen);
 		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
@@ -4508,13 +4820,9 @@  static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int
 		  bool *fragstolen)
 {
 	int eaten;
-	struct sk_buff *tail = skb_peek_tail(&sk->sk_receive_queue);
 
 	__skb_pull(skb, hdrlen);
-	eaten = (tail &&
-		 tcp_try_coalesce(sk, tail,
-				  skb, fragstolen)) ? 1 : 0;
-	tcp_rcv_nxt_update(tcp_sk(sk), TCP_SKB_CB(skb)->end_seq);
+	eaten = tcp_eat_skb(sk, skb, fragstolen);
 	if (!eaten) {
 		__skb_queue_tail(&sk->sk_receive_queue, skb);
 		skb_set_owner_r(skb, sk);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 77ea45d..a613ff4 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1617,6 +1617,7 @@  static void tcp_v4_fill_cb(struct sk_buff *skb, const struct iphdr *iph,
 	TCP_SKB_CB(skb)->sacked	 = 0;
 	TCP_SKB_CB(skb)->has_rxtstamp =
 			skb->tstamp || skb_hwtstamps(skb)->hwtstamp;
+	TCP_SKB_CB(skb)->esp_skip = 0;
 }
 
 /*
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a4d214c..66e1121 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -830,10 +830,53 @@  static void tcp_tasklet_func(unsigned long data)
 	}
 }
 
+#ifdef CONFIG_XFRM
+int tcp_encap_output(struct sock *sk, struct sk_buff *skb)
+{
+	int offset;
+	unsigned len;
+
+	if (sk->sk_state != TCP_ESTABLISHED)
+		return -ECONNRESET;
+
+	offset = skb_transport_offset(skb);
+	len = skb->len - offset;
+
+	*(__be16 *)skb_transport_header(skb) = cpu_to_be16(len);
+
+	offset = skb_send_sock_locked(sk, skb, offset, len);
+	if (offset >= 0) {
+		__kfree_skb(skb);
+		offset = 0;
+	}
+
+	return offset;
+}
+EXPORT_SYMBOL(tcp_encap_output);
+
+static void tcp_process_encap(struct sock *sk)
+{
+	struct sk_buff_head queue;
+	struct sk_buff *skb;
+
+	__skb_queue_head_init(&queue);
+	skb_queue_splice_init(&tcp_sk(sk)->encap_out, &queue);
+
+	while ((skb = __skb_dequeue(&queue)))
+		if (tcp_encap_output(sk, skb))
+			__kfree_skb(skb);
+}
+#else
+static inline void tcp_process_encap(struct sock *sk)
+{
+}
+#endif
+
 #define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED |		\
 			  TCPF_WRITE_TIMER_DEFERRED |	\
 			  TCPF_DELACK_TIMER_DEFERRED |	\
-			  TCPF_MTU_REDUCED_DEFERRED)
+			  TCPF_MTU_REDUCED_DEFERRED |	\
+			  TCPF_ESP_DEFERRED)
 /**
  * tcp_release_cb - tcp release_sock() callback
  * @sk: socket
@@ -879,6 +922,8 @@  void tcp_release_cb(struct sock *sk)
 		inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
 		__sock_put(sk);
 	}
+	if (flags & TCPF_ESP_DEFERRED)
+		tcp_process_encap(sk);
 }
 EXPORT_SYMBOL(tcp_release_cb);
 
@@ -1609,6 +1654,7 @@  unsigned int tcp_current_mss(struct sock *sk)
 
 	return mss_now;
 }
+EXPORT_SYMBOL(tcp_current_mss);
 
 /* RFC2861, slow part. Adjust cwnd, after it was not full during one rto.
  * As additional protections, we do not touch cwnd in retransmission phases,