From patchwork Mon Apr 23 08:30:08 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Simon Horman X-Patchwork-Id: 154348 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id D3FE8B6EEB for ; Mon, 23 Apr 2012 18:30:21 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754247Ab2DWIaT (ORCPT ); Mon, 23 Apr 2012 04:30:19 -0400 Received: from kirsty.vergenet.net ([202.4.237.240]:47347 "EHLO kirsty.vergenet.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754226Ab2DWIaM (ORCPT ); Mon, 23 Apr 2012 04:30:12 -0400 Received: from ayumi.akashicho.tokyo.vergenet.net (p5238-ipbfp903kobeminato.hyogo.ocn.ne.jp [123.221.44.238]) by kirsty.vergenet.net (Postfix) with ESMTP id 924EE25BF66; Mon, 23 Apr 2012 18:30:09 +1000 (EST) Received: by ayumi.akashicho.tokyo.vergenet.net (Postfix, from userid 7100) id 4171FEDE090; Mon, 23 Apr 2012 17:30:08 +0900 (JST) Date: Mon, 23 Apr 2012 17:30:08 +0900 From: Simon Horman To: David Miller Cc: jhs@mojatatu.com, stephen.hemminger@vyatta.com, netdev@vger.kernel.org, dev@openvswitch.org, eric.dumazet@gmail.com Subject: Re: [RFC v4] Add TCP encap_rcv hook (repost) Message-ID: <20120423083007.GB22556@verge.net.au> References: <61c89e02-c916-421e-b469-62b307853b1b@tahiti.vyatta.com> <1335110082.2132.22.camel@mojatatu> <20120423051359.GE11672@verge.net.au> <20120423.033658.1229108613501573952.davem@davemloft.net> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20120423.033658.1229108613501573952.davem@davemloft.net> Organisation: Horms Solutions Ltd. User-Agent: Mutt/1.5.21 (2010-09-15) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Mon, Apr 23, 2012 at 03:36:58AM -0400, David Miller wrote: > From: Simon Horman > Date: Mon, 23 Apr 2012 14:14:02 +0900 > > > On Sun, Apr 22, 2012 at 11:54:42AM -0400, Jamal Hadi Salim wrote: > >> On Sun, 2012-04-22 at 08:22 -0700, Stephen Hemminger wrote: > >> > >> > STT isn't really doing TCP, it just lying and pretending to be > >> > TCP to allow TSO to work! There is no packet ordering, sequence > >> > numbers or any real transport layer. > > > > Yes, that is my understanding. Originally I envisaged that an STT > > implementation would rely more heavily on the TCP stack. However, as > > STT doesn't rely on any of the features of TCP other than its header > > this was not the case and (almost) bypassing the TCP stack seems > > to be sufficient. > > > > I believe the motivation for reusing TCP is, as Stephen suggests, > > to allow some hardware acceleration to occur. > > Yes, this is what the IETF draft states. > > But I wonder about your encap_rcv hook placement, nevermind > that your posted patch won't compile since tcp_sock lacks > an encap_tcv member and your patch didn't add one. :-) I'm pretty sure the patch I posted added encap_rcv to tcp_sock. Am I missing the point? > You'll need to somehow create either a fully established or a > listening socket for that hook to work. > > You'd need to perform a full handshake to get a socket into > established state, and it seems STT doesn't do a TCP handshake. > > That leaves you with the listening socket option, and in that case I > want to know how you're going to send packets out of this STT tunnel? Currently I am setting up a listening socket. The Open vSwtich tunneling code transmits skbs and using either dev_queue_xmit() or ip_local_out(). I'm not sure that I have exercised the ip_local_out() case yet. But perhaps that doesn't answer your question? > In order to get the advertised benefits of this STT thing, you'll need > to go through the whole TCP data packet sending engine, in order to > get all the TSO/GSO stuff initialized properly on the SKB so the NIC > will do it's thing. > > But you can't send data out of an un-established TCP socket. > > At the very least, we'll need to see the rest of your full > implementation before we can say whether this encap_rcv hook is the > right way to do things. Sure, I'm happy to provide my implementation, though it is still WIP. The most recent patch is below. I should point out that the actual transmission of packets occurs outside of that patch in existing Open vSwtich code. I am unsure of the best way to make that available to you. It is the ovs_tnl_send() function in datapath/tunnel.c which is available in the openvswitch git repository. git://openvswitch.org/openvswitch For reference I have included the file in this email after the STT patch. ---- begin stt patch ---- tunnelling: stt: Prototype Implementation This is a not yet well exercised implementation of STT intended for review, I am sure there are numerous areas that need improvement. In particular: - The transmit path's generation of partial checksums needs to be tested - The VLAN stripping code needs to be excercised - The code needs to be exercised in the presence of HW checksumming - In general, the code has been exercised by running Open vSwtich in KVM guests on the same host. Testing between physucal hosts is needed. This implementation is based on the CAPWAP implementation and in particular includes defragmentation code almost identical to CAPWAP. It seems to me that while fragmentation can be handled by GSO/TSO, defragmentation code is needed in STT in the case where LRO/GRO doesn't reassemble an entire STT frame for some reason. If the defragmentation code, which is of non-trivial length, remains more or less in its present state then there is some scope for consolidation with CAPWAP. Other code that may possibly be consolidated with CAPWAP has been marked accordingly. This code depends on a encap_rcv hook being added to the Linux Kernel's TCP stack. A patch to add such a hook will be posted separately. Ultimately this change or some alternative will need to be applied to the mainline Linux kernel's TCP stack if STT is to be widely deployed. Motivating this change to the TCP stack is part of the purpose of this prototype STT implementation. The configuration of STT is analogous to that of other tunneling protocols such as GRE which are supported by Open vSwtich. e.g. ovs-vsctl add bridge project0 ports @newport \ -- --id=@newport create port name=stt0 interfaces=[@newinterface] \ -- --id=@newinterface create interface name=stt0 type=stt options="remote_ip=10.0.99.192,key=64" Signed-off-by: Simon Horman --- v3 * Correct stripping of vlan tag on transmit * Correct setting of vlan TCI on recieve - Use __vlan_hwaccel_put_tag instead of vlan_put_tag * Use encap_rcv_enable() to enable receiving packets from the TCP stack - This is an update for the new implementation of the TCP stack patch that adds encap_rcv * call pskb_may_pull() for STT_FRAME_HLEN + ETH_HLEN bytes in process_stt_proto() as this is required by ovs_flow_extract() * Include "stt: " in pr_fmt * Make use of pr_* instead of printk * Rate limit all packet-generated pr_* messages * STT flags are 8bits wide so don't define them using __cpu_to_be16() * Only include l4_offset if 1. get_ip_summed(skb) is OVS_CSUM_PARTIAL 2. skb->csum_start is non-zero 3. it is between 0 and 255 - Warn if the first two conditions are met but not the third one. * Only set STT_FLAG_CHECKSUM_VERIFIED if get_ip_summed(skb) is * OVS_CSUM_UNNECESSARY * Print a debug message if get_ip_summed(skb) is OVS_CSUM_UNNECESSARY, this case is yet to be exercised * In the rx path, adjust skb->csum_start to take into account pulling STT_FRAME_HLEN if get_ip_summed(skb) is OVS_CSUM_PARTIAL * Warn if skb->dev is NULL on defragmentation and stop processing the skb. - This fixes a crash bug - But how can this occur? v2 * Transmit - Correct calculation of segment offset - Streamline source port calculation and setting STT_FLAG_IP_VERSION. This allows IPv4 and IPv6 to share more code and for overall there to be less code. - Calculate partial checksum for GSO skbs. Is this correct? - Only calculate full checksum for non-GSO skbs. - Set STT_FLAG_CHECKSUM_VERIFIED for all non-GSO skbs. - Remove use of l4_offset, the patch modifying the tunnelling code to supply this has been dropped. Instead calculate the value based on csum_start if it is set and the network protocol of the inner packet is IPv4 or IPv6 * Receive - Correct number of bytes pulled + Only the TCP header plus the STT header less the pad needs to be pulled. - Only access STT header after it has been pulled - Verify checksum on receive - Remove use of encap_type, it is no longer present in the proposed TCP stack patch - Use the acknowledgement (tcph->ack_seq) as the fragment id in defragmentation * Transmit and Receive - Add stt_seg_len() helper and use it in segmentation and desegmentation code. This corrects several offset calculation errors. --- acinclude.m4 | 3 + datapath/Modules.mk | 3 +- datapath/tunnel.h | 1 + datapath/vport-stt.c | 803 +++++++++++++++++++++++++++++++++++++++++++ datapath/vport.c | 3 + datapath/vport.h | 2 + include/linux/openvswitch.h | 1 + lib/netdev-vport.c | 9 +- vswitchd/vswitch.xml | 10 + 9 files changed, 833 insertions(+), 2 deletions(-) create mode 100644 datapath/vport-stt.c diff --git a/acinclude.m4 b/acinclude.m4 index 69bb772..f3a52fa 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -266,6 +266,9 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [ OVS_GREP_IFELSE([$KSRC/include/linux/if_vlan.h], [ADD_ALL_VLANS_CMD], [OVS_DEFINE([HAVE_VLAN_BUG_WORKAROUND])]) + OVS_GREP_IFELSE([$KSRC/include/linux/tcp.h], [encap_rcv], + [OVS_DEFINE([HAVE_TCP_ENCAP_RCV])]) + OVS_CHECK_LOG2_H if cmp -s datapath/linux/kcompat.h.new \ diff --git a/datapath/Modules.mk b/datapath/Modules.mk index 24c1075..6fbe3dd 100644 --- a/datapath/Modules.mk +++ b/datapath/Modules.mk @@ -26,7 +26,8 @@ openvswitch_sources = \ vport-gre.c \ vport-internal_dev.c \ vport-netdev.c \ - vport-patch.c + vport-patch.c \ + vport-stt.c openvswitch_headers = \ checksum.h \ diff --git a/datapath/tunnel.h b/datapath/tunnel.h index 33eb63c..96f59b1 100644 --- a/datapath/tunnel.h +++ b/datapath/tunnel.h @@ -41,6 +41,7 @@ */ #define TNL_T_PROTO_GRE 0 #define TNL_T_PROTO_CAPWAP 1 +#define TNL_T_PROTO_STT 2 /* These flags are only needed when calling tnl_find_port(). */ #define TNL_T_KEY_EXACT (1 << 10) diff --git a/datapath/vport-stt.c b/datapath/vport-stt.c new file mode 100644 index 0000000..638998d --- /dev/null +++ b/datapath/vport-stt.c @@ -0,0 +1,803 @@ +/* + * Copyright (c) 2012 Horms Solutions Ltd. + * Distributed under the terms of the GNU GPL version 2. + * + * Significant portions of this file may be copied from parts of the Linux + * kernel, by Linus Torvalds and others. + * + * Significant portions of this file may be copied from + * other parts of Open vSwitch, by Nicira Networks and others. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": stt: " fmt + +#include +#ifdef HAVE_TCP_ENCAP_RCV + +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include "datapath.h" +#include "tunnel.h" +#include "vport.h" +#include "vport-generic.h" + +#define STT_DST_PORT 58882 /* Change to actual port number once awarded by IANA */ + +/* XXX: Possible Consolidation: The same values as capwap */ +#define STT_FRAG_TIMEOUT (30 * HZ) +#define STT_FRAG_MAX_MEM (256 * 1024) +#define STT_FRAG_PRUNE_MEM (192 * 1024) +#define STT_FRAG_SECRET_INTERVAL (10 * 60 * HZ) + +#define STT_FLAG_CHECKSUM_VERIFIED (1 << 0) +#define STT_FLAG_CHECKSUM_PARTIAL (1 << 1) +#define STT_FLAG_IP_VERSION (1 << 2) +#define STT_FLAG_TCP_PAYLOAD (1 << 3) + +#define FRAG_OFF_MASK 0xffffU +#define FRAME_LEN_SHIFT 16 + +struct stthdr { + uint8_t version; + uint8_t flags; + uint8_t l4_offset; + uint8_t reserved; + __be16 mss; + __be16 vlan_tci; + __be64 context_id; +}; + +/* + * Not in stthdr to avoid that structure being padded to + * a 64bit boundary - 2 bytes of pad are required, not 8 + */ +struct stthdr_pad { + uint8_t pad[2]; +}; + +static struct stthdr *stt_hdr(const struct sk_buff *skb) +{ + return (struct stthdr *)(tcp_hdr(skb) + 1); +} + +/* + * The minimum header length. + */ +#define STT_SEG_HLEN sizeof(struct tcphdr) +#define STT_FRAME_HLEN (STT_SEG_HLEN + sizeof(struct stthdr) + \ + sizeof(struct stthdr_pad)) + +static inline int stt_seg_len(struct sk_buff *skb) +{ + return skb->len - skb_transport_offset(skb) - STT_SEG_HLEN; +} + +static inline struct ethhdr *stt_inner_eth_header(struct sk_buff *skb) +{ + return (struct ethhdr *)((char *)skb_transport_header(skb) + + STT_FRAME_HLEN); +} + +/* XXX: Possible Consolidation: Same as capwap */ +struct frag_match { + __be32 saddr; + __be32 daddr; + __be32 id; +}; + +/* XXX: Possible Consolidation: Same as capwap */ +struct frag_queue { + struct inet_frag_queue ifq; + struct frag_match match; +}; + +/* XXX: Possible Consolidation: Same as capwap */ +struct frag_skb_cb { + u16 offset; +}; +#define FRAG_CB(skb) ((struct frag_skb_cb *)(skb)->cb) + +static struct sk_buff *defrag(struct sk_buff *skb, u16 frame_len); + +static void stt_frag_init(struct inet_frag_queue *, void *match); +static unsigned int stt_frag_hash(struct inet_frag_queue *); +static int stt_frag_match(struct inet_frag_queue *, void *match); +static void stt_frag_expire(unsigned long ifq); + +static struct inet_frags frag_state = { + .constructor = stt_frag_init, + .qsize = sizeof(struct frag_queue), + .hashfn = stt_frag_hash, + .match = stt_frag_match, + .frag_expire = stt_frag_expire, + .secret_interval = STT_FRAG_SECRET_INTERVAL, +}; + +/* random value for selecting source ports */ +static u32 stt_port_rnd __read_mostly; + +static int stt_hdr_len(const struct tnl_mutable_config *mutable) +{ + return (int)STT_FRAME_HLEN; +} + +static void stt_build_header(const struct vport *vport, + const struct tnl_mutable_config *mutable, + void *header) +{ + struct tcphdr *tcph = header; + struct stthdr *stth = (struct stthdr *)(tcph + 1); + struct stthdr_pad *pad = (struct stthdr_pad *)(stth + 1); + + tcph->dest = htons(STT_DST_PORT); + tcp_flag_word(tcph) = 0; + tcph->doff = sizeof(struct tcphdr) / 4; + tcph->ack = 1; + pad->pad[0] = pad->pad[1] = 0; +} + +static u16 stt_src_port(u32 hash) +{ + int low, high; + inet_get_local_port_range(&low, &high); + return hash % (high - low) + low; +} + +struct sk_buff *stt_update_header(const struct vport *vport, + const struct tnl_mutable_config *mutable, + struct dst_entry *dst, + struct sk_buff *skb) +{ + struct tcphdr *tcph; + struct stthdr *stth; + struct ethhdr *inner_ethh; + struct tnl_vport *tnl_vport = tnl_vport_priv(vport); + __be32 frag_id = htonl(atomic_inc_return(&tnl_vport->frag_id)); + __be32 vlan_tci = 0; + u32 hash = jhash_1word(skb->protocol, stt_port_rnd); + int l4_protocol = IPPROTO_MAX; + + if (skb->protocol == htons(ETH_P_8021Q)) { + struct vlan_ethhdr *vlanh; + + if (unlikely(!pskb_may_pull(skb, VLAN_ETH_HLEN))) + goto err; + + vlanh = (struct vlan_ethhdr *)stt_inner_eth_header(skb); + vlan_tci = vlanh->h_vlan_TCI; + + /* STT requires that the encapsulated frame be untagged + * and the STT header only allows saving one VLAN TCI. + * So there seems to be no way to handle the presence of + * more than one vlan tag other than to drop the packet + */ + if (vlan_eth_hdr(skb)->h_vlan_encapsulated_proto == + htons(ETH_P_8021Q)) + goto err; + + memmove(skb->data + VLAN_HLEN, skb->data, + (size_t)((char *)vlanh - (char *)skb->data) + + 2 * ETH_ALEN); + if (unlikely(!skb_pull(skb, VLAN_HLEN))) + goto err; + + skb->protocol = vlan_eth_hdr(skb)->h_vlan_encapsulated_proto; + skb->mac_header += VLAN_HLEN; + skb->network_header += VLAN_HLEN; + skb->transport_header += VLAN_HLEN; + } + + tcph = tcp_hdr(skb); + stth = (struct stthdr *)(tcph + 1); + inner_ethh = stt_inner_eth_header(skb); + + stth->flags = 0; + + if (skb->protocol == htons(ETH_P_IP)) { + struct iphdr *iph = (struct iphdr *)(inner_ethh + 1); + hash = jhash_2words(iph->saddr, iph->daddr, hash); + l4_protocol = iph->protocol; + stth->flags |= STT_FLAG_IP_VERSION; + } else if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *ipv6h = (struct ipv6hdr *)(inner_ethh + 1); + hash = jhash(ipv6h->saddr.s6_addr, + sizeof(ipv6h->saddr.s6_addr), hash); + hash = jhash(ipv6h->daddr.s6_addr, + sizeof(ipv6h->daddr.s6_addr), hash); + l4_protocol = ipv6h->nexthdr; + } + + stth->l4_offset = 0; + if (get_ip_summed(skb) == OVS_CSUM_PARTIAL && skb->csum_start) { + int off = skb->csum_start - skb_headroom(skb); + if (likely(off < 256 && off > 0)) + stth->l4_offset = off; + else if (net_ratelimit()) + pr_err("%s: l4_offset is out of range %d should be " + "between 0 and 255", __func__, off); + } + + if (stth->l4_offset && (l4_protocol == IPPROTO_TCP || + l4_protocol == IPPROTO_UDP || + l4_protocol == IPPROTO_DCCP || + l4_protocol == IPPROTO_SCTP)) { + /* TCP, UDP, DCCP and SCTP place the source and destination + * ports in the first and second 16-bits of their header, + * so grabbing the first 32-bits will give a combined value. + */ + __be32 *ports = (__be32 *)((char *)inner_ethh + + stth->l4_offset); + hash = jhash_1word(*ports, hash); + } + + if (l4_protocol == IPPROTO_TCP) + stth->flags |= STT_FLAG_TCP_PAYLOAD; + + stth->reserved = 0; + stth->mss = htons(dst_mtu(dst)); + stth->vlan_tci = vlan_tci; + stth->context_id = mutable->out_key; + + tcph->source = htons(stt_src_port(hash)); + tcph->seq = htonl(stt_seg_len(skb) << FRAME_LEN_SHIFT); + tcph->ack_seq = frag_id; + tcph->ack = 1; + tcph->psh = 1; + + switch (get_ip_summed(skb)) { + case OVS_CSUM_PARTIAL: + stth->flags |= STT_FLAG_CHECKSUM_PARTIAL; + tcph->check = ~tcp_v4_check(skb->len, + ip_hdr(skb)->saddr, + ip_hdr(skb)->daddr, 0); + skb->csum_start = skb_transport_header(skb) - skb->head; + skb->csum_offset = offsetof(struct tcphdr, check); + break; + case OVS_CSUM_UNNECESSARY: + stth->flags |= STT_FLAG_CHECKSUM_VERIFIED; + pr_debug_once("%s: checsum unnecessary\n", __func__); + default: + tcph->check = 0; + skb->csum = skb_checksum(skb, skb_transport_offset(skb), + skb->len - skb_transport_offset(skb), + 0); + tcph->check = tcp_v4_check(skb->len - skb_transport_offset(skb), + ip_hdr(skb)->saddr, + ip_hdr(skb)->daddr, skb->csum); + set_ip_summed(skb, OVS_CSUM_UNNECESSARY); + } + forward_ip_summed(skb, 1); + + return skb; +err: + kfree_skb(skb); + return NULL; +} + +static inline struct capwap_net *ovs_get_stt_net(struct net *net) +{ + struct ovs_net *ovs_net = net_generic(net, ovs_net_id); + return &ovs_net->vport_net.stt; +} + +static struct sk_buff *process_stt_proto(struct sk_buff *skb, __be64 *key) +{ + struct tcphdr *tcph = tcp_hdr(skb); + struct stthdr *stth; + u16 frame_len; + + skb_postpull_rcsum(skb, skb_transport_header(skb), + STT_SEG_HLEN + ETH_HLEN); + + frame_len = ntohl(tcph->seq) >> FRAME_LEN_SHIFT; + if (stt_seg_len(skb) < frame_len) { + skb = defrag(skb, frame_len); + if (!skb) + return NULL; + } + + if (skb->len < (tcph->doff << 2) || tcp_checksum_complete(skb)) { + if (net_ratelimit()) { + struct iphdr *iph = ip_hdr(skb); + pr_info("stt: dropped frame with " + "invalid checksum (%pI4, %d)->(%pI4, %d)\n", + &iph->saddr, ntohs(tcph->source), + &iph->daddr, ntohs(tcph->dest)); + } + goto error; + } + + /* STT_FRAME_HLEN less two pad bytes is needed here. + * STT_FRAME_HLEN is needed by our caller, stt_rcv(). + * An additional ETH_HLEN bytes are required by ovs_flow_extract() + * which is called indirectly by our caller. + */ + if (unlikely(!pskb_may_pull(skb, STT_FRAME_HLEN + ETH_HLEN))) { + if (net_ratelimit()) + pr_info("dropped frame that is too short! %d < %lu\n", + skb->len, STT_FRAME_HLEN + ETH_HLEN); + goto error; + } + + stth = stt_hdr(skb); + /* Only accept STT version 0, its all we know */ + if (stth->version != 0) + goto error; + + *key = stth->context_id; + __vlan_hwaccel_put_tag(skb, ntohs(stth->vlan_tci)); + + return skb; +error: + kfree_skb(skb); + return NULL; +} + +/* Called with rcu_read_lock and BH disabled. */ +static int stt_rcv(struct sock *sk, struct sk_buff *skb) +{ + struct vport *vport; + const struct tnl_mutable_config *mutable; + struct iphdr *iph; + __be64 key = 0; + + /* pskb_may_pull() has already been called for + * sizeof(struct tcphdr) in tcp_v4_rcv(), so there + * is no need to do so again here + */ + + skb = process_stt_proto(skb, &key); + if (unlikely(!skb)) + goto out; + + iph = ip_hdr(skb); + vport = ovs_tnl_find_port(sock_net(sk), iph->daddr, iph->saddr, key, + TNL_T_PROTO_STT, &mutable); + if (unlikely(!vport)) { + icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); + goto error; + } + + if (mutable->flags & TNL_F_IN_KEY_MATCH) + OVS_CB(skb)->tun_id = key; + else + OVS_CB(skb)->tun_id = 0; + + __skb_pull(skb, STT_FRAME_HLEN); + skb_postpull_rcsum(skb, skb_transport_header(skb), + STT_FRAME_HLEN + ETH_HLEN); + if (get_ip_summed(skb) == OVS_CSUM_PARTIAL) + skb->csum_start += STT_FRAME_HLEN; + + ovs_tnl_rcv(vport, skb, iph->tos); + goto out; + +error: + kfree_skb(skb); +out: + return 0; +} + +static const struct tnl_ops stt_tnl_ops = { + .tunnel_type = TNL_T_PROTO_STT, + .ipproto = IPPROTO_TCP, + .hdr_len = stt_hdr_len, + .build_header = stt_build_header, + .update_header = stt_update_header, +}; + +static int init_socket(struct net *net) +{ + int err; + struct capwap_net *stt_net = ovs_get_stt_net(net); + struct sockaddr_in sin; + + if (stt_net->n_tunnels) { + stt_net->n_tunnels++; + return 0; + } + + err = sock_create_kern(AF_INET, SOCK_STREAM, 0, + &stt_net->capwap_rcv_socket); + if (err) + goto error; + + /* release net ref. */ + sk_change_net(stt_net->capwap_rcv_socket->sk, net); + + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = htonl(INADDR_ANY); + sin.sin_port = htons(STT_DST_PORT); + + err = kernel_bind(stt_net->capwap_rcv_socket, (struct sockaddr *)&sin, + sizeof(struct sockaddr_in)); + if (err) + goto error_sock; + + tcp_sk(stt_net->capwap_rcv_socket->sk)->encap_rcv = stt_rcv; + tcp_encap_enable(); + + stt_net->frag_state.timeout = STT_FRAG_TIMEOUT; + stt_net->frag_state.high_thresh = STT_FRAG_MAX_MEM; + stt_net->frag_state.low_thresh = STT_FRAG_PRUNE_MEM; + + inet_frags_init_net(&stt_net->frag_state); + + err = kernel_listen(stt_net->capwap_rcv_socket, 7); + if (err) + goto error_sock; + + stt_net->n_tunnels++; + return 0; + +error_sock: + sk_release_kernel(stt_net->capwap_rcv_socket->sk); +error: + pr_warn("cannot register protocol handler : %d\n", err); + return err; +} + +/* XXX: Possible Consolidation: Very similar to vport-capwap.c:release_socket() */ +static void release_socket(struct net *net) +{ + struct capwap_net *stt_net = ovs_get_stt_net(net); + + stt_net->n_tunnels--; + if (stt_net->n_tunnels) + return; + + inet_frags_exit_net(&stt_net->frag_state, &frag_state); + sk_release_kernel(stt_net->capwap_rcv_socket->sk); +} + +/* XXX: Possible Consolidation: Very similar to capwap_create() */ +static struct vport *stt_create(const struct vport_parms *parms) +{ + struct vport *vport; + int err; + + err = init_socket(ovs_dp_get_net(parms->dp)); + if (err) + return ERR_PTR(err); + + vport = ovs_tnl_create(parms, &ovs_stt_vport_ops, &stt_tnl_ops); + if (IS_ERR(vport)) + release_socket(ovs_dp_get_net(parms->dp)); + + return vport; +} + +/* XXX: Possible Consolidation: Same as capwap_destroy() */ +static void stt_destroy(struct vport *vport) +{ + ovs_tnl_destroy(vport); + release_socket(ovs_dp_get_net(vport->dp)); +} + +/* XXX: Possible Consolidation: Same as capwap_init() */ +static int stt_init(void) +{ + inet_frags_init(&frag_state); + get_random_bytes(&stt_port_rnd, sizeof(stt_port_rnd)); + return 0; +} + +/* XXX: Possible Consolidation: Same as capwap_exit() */ +static void stt_exit(void) +{ + inet_frags_fini(&frag_state); +} + +/* All of the following functions relate to fragmentation reassembly. */ + +static struct frag_queue *ifq_cast(struct inet_frag_queue *ifq) +{ + return container_of(ifq, struct frag_queue, ifq); +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_hash() */ +static u32 frag_hash(struct frag_match *match) +{ + return jhash_3words((__force u16)match->id, (__force u32)match->saddr, + (__force u32)match->daddr, + frag_state.rnd) & (INETFRAGS_HASHSZ - 1); +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:queue_find() */ +static struct frag_queue *queue_find(struct netns_frags *ns_frag_state, + struct frag_match *match) +{ + struct inet_frag_queue *ifq; + + read_lock(&frag_state.lock); + + ifq = inet_frag_find(ns_frag_state, &frag_state, match, frag_hash(match)); + if (!ifq) + return NULL; + + /* Unlock happens inside inet_frag_find(). */ + + return ifq_cast(ifq); +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_reasm() */ +static struct sk_buff *frag_reasm(struct frag_queue *fq, struct net_device *dev) +{ + struct sk_buff *head = fq->ifq.fragments; + struct sk_buff *frag; + + /* Succeed or fail, we're done with this queue. */ + inet_frag_kill(&fq->ifq, &frag_state); + + if (fq->ifq.len > 65535) + return NULL; + + /* Can't have the head be a clone. */ + if (skb_cloned(head) && pskb_expand_head(head, 0, 0, GFP_ATOMIC)) + return NULL; + + /* + * We're about to build frag list for this SKB. If it already has a + * frag list, alloc a new SKB and put the existing frag list there. + */ + if (skb_shinfo(head)->frag_list) { + int i; + int paged_len = 0; + + frag = alloc_skb(0, GFP_ATOMIC); + if (!frag) + return NULL; + + frag->next = head->next; + head->next = frag; + skb_shinfo(frag)->frag_list = skb_shinfo(head)->frag_list; + skb_shinfo(head)->frag_list = NULL; + + for (i = 0; i < skb_shinfo(head)->nr_frags; i++) + paged_len += skb_shinfo(head)->frags[i].size; + frag->len = frag->data_len = head->data_len - paged_len; + head->data_len -= frag->len; + head->len -= frag->len; + + frag->ip_summed = head->ip_summed; + atomic_add(frag->truesize, &fq->ifq.net->mem); + } + + skb_shinfo(head)->frag_list = head->next; + atomic_sub(head->truesize, &fq->ifq.net->mem); + + /* Properly account for data in various packets. */ + for (frag = head->next; frag; frag = frag->next) { + head->data_len += frag->len; + head->len += frag->len; + + if (head->ip_summed != frag->ip_summed) + head->ip_summed = CHECKSUM_NONE; + else if (head->ip_summed == CHECKSUM_COMPLETE) + head->csum = csum_add(head->csum, frag->csum); + + head->truesize += frag->truesize; + atomic_sub(frag->truesize, &fq->ifq.net->mem); + } + + head->next = NULL; + head->dev = dev; + head->tstamp = fq->ifq.stamp; + fq->ifq.fragments = NULL; + + return head; +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_queue() */ +static struct sk_buff *frag_queue(struct frag_queue *fq, struct sk_buff *skb, + u16 offset, bool frag_last) +{ + struct sk_buff *prev, *next; + struct net_device *dev; + int end; + + if (fq->ifq.last_in & INET_FRAG_COMPLETE) + goto error; + + if (stt_seg_len(skb) <= 0) + goto error; + + end = offset + stt_seg_len(skb); + + if (frag_last) { + /* + * Last fragment, shouldn't already have data past our end or + * have another last fragment. + */ + if (end < fq->ifq.len || fq->ifq.last_in & INET_FRAG_LAST_IN) + goto error; + + fq->ifq.last_in |= INET_FRAG_LAST_IN; + fq->ifq.len = end; + } else { + /* Fragments should align to 8 byte chunks. */ + if (end & ~FRAG_OFF_MASK) + goto error; + + if (end > fq->ifq.len) { + /* + * Shouldn't have data past the end, if we already + * have one. + */ + if (fq->ifq.last_in & INET_FRAG_LAST_IN) + goto error; + + fq->ifq.len = end; + } + } + + /* Find where we fit in. */ + prev = NULL; + for (next = fq->ifq.fragments; next != NULL; next = next->next) { + if (FRAG_CB(next)->offset >= offset) + break; + prev = next; + } + + /* + * Overlapping fragments aren't allowed. We shouldn't start before + * the end of the previous fragment. + */ + if (prev && FRAG_CB(prev)->offset + stt_seg_len(prev) > offset) + goto error; + + /* We also shouldn't end after the beginning of the next fragment. */ + if (next && end > FRAG_CB(next)->offset) + goto error; + + FRAG_CB(skb)->offset = offset; + + /* Link into list. */ + skb->next = next; + if (prev) + prev->next = skb; + else + fq->ifq.fragments = skb; + + dev = skb->dev; + skb->dev = NULL; + + fq->ifq.stamp = skb->tstamp; + fq->ifq.meat += stt_seg_len(skb); + atomic_add(skb->truesize, &fq->ifq.net->mem); + if (offset == 0) + fq->ifq.last_in |= INET_FRAG_FIRST_IN; + + /* If we have all fragments do reassembly. */ + if (fq->ifq.last_in == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) && + fq->ifq.meat == fq->ifq.len) + return frag_reasm(fq, dev); + + write_lock(&frag_state.lock); + list_move_tail(&fq->ifq.lru_list, &fq->ifq.net->lru_list); + write_unlock(&frag_state.lock); + + return NULL; + +error: + kfree_skb(skb); + return NULL; +} + +/* XXX: Possible Consolidation: Similar to vport-capwap.c:defrag() */ +static struct sk_buff *defrag(struct sk_buff *skb, u16 frame_len) +{ + struct iphdr *iph = ip_hdr(skb); + struct tcphdr *tcph = tcp_hdr(skb); + struct netns_frags *ns_frag_state; + struct frag_match match; + u16 frag_off; + struct frag_queue *fq; + bool frag_last = false; + + if (unlikely(!skb->dev)) { + if (net_ratelimit()) + pr_err("%s: No skb->dev!\n", __func__); + goto out; + } + + ns_frag_state = &ovs_get_stt_net(dev_net(skb->dev))->frag_state; + if (atomic_read(&ns_frag_state->mem) > ns_frag_state->high_thresh) + inet_frag_evictor(ns_frag_state, &frag_state); + + match.daddr = iph->daddr; + match.saddr = iph->saddr; + match.id = tcph->ack_seq; + frag_off = ntohl(tcph->seq) & FRAG_OFF_MASK; + if (frame_len == stt_seg_len(skb) + frag_off) + frag_last = true; + + fq = queue_find(ns_frag_state, &match); + if (fq) { + spin_lock(&fq->ifq.lock); + skb = frag_queue(fq, skb, frag_off, frag_last); + spin_unlock(&fq->ifq.lock); + + inet_frag_put(&fq->ifq, &frag_state); + + return skb; + } + +out: + kfree_skb(skb); + return NULL; +} + +/* XXX: Possible Consolidation: Functionally identical to capwap_frag_init */ +static void stt_frag_init(struct inet_frag_queue *ifq, void *match_) +{ + struct frag_match *match = match_; + + ifq_cast(ifq)->match = *match; +} + +/* XXX: Possible Consolidation: Functionally identical to capwap_frag_hash */ +static unsigned int stt_frag_hash(struct inet_frag_queue *ifq) +{ + return frag_hash(&ifq_cast(ifq)->match); +} + +/* XXX: Possible Consolidation: Almost functionally identical to capwap_frag_match */ +static int stt_frag_match(struct inet_frag_queue *ifq, void *a_) +{ + struct frag_match *a = a_; + struct frag_match *b = &ifq_cast(ifq)->match; + + return a->id == b->id && a->saddr == b->saddr && a->daddr == b->daddr; +} + +/* Run when the timeout for a given queue expires. */ +/* XXX: Possible Consolidation: Functionally identical to capwap_frag_hash */ +static void stt_frag_expire(unsigned long ifq) +{ + struct frag_queue *fq; + + fq = ifq_cast((struct inet_frag_queue *)ifq); + + spin_lock(&fq->ifq.lock); + + if (!(fq->ifq.last_in & INET_FRAG_COMPLETE)) + inet_frag_kill(&fq->ifq, &frag_state); + + spin_unlock(&fq->ifq.lock); + inet_frag_put(&fq->ifq, &frag_state); +} + +const struct vport_ops ovs_stt_vport_ops = { + .type = OVS_VPORT_TYPE_STT, + .flags = VPORT_F_TUN_ID, + .init = stt_init, + .exit = stt_exit, + .create = stt_create, + .destroy = stt_destroy, + .set_addr = ovs_tnl_set_addr, + .get_name = ovs_tnl_get_name, + .get_addr = ovs_tnl_get_addr, + .get_options = ovs_tnl_get_options, + .set_options = ovs_tnl_set_options, + .get_dev_flags = ovs_vport_gen_get_dev_flags, + .is_running = ovs_vport_gen_is_running, + .get_operstate = ovs_vport_gen_get_operstate, + .send = ovs_tnl_send, +}; +#else +#warning STT requires TCP encap_rcv hook in Kernel +#endif /* HAVE_TCP_ENCAP_RCV */ diff --git a/datapath/vport.c b/datapath/vport.c index b75a866..575e7a2 100644 --- a/datapath/vport.c +++ b/datapath/vport.c @@ -44,6 +44,9 @@ static const struct vport_ops *base_vport_ops_list[] = { #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26) &ovs_capwap_vport_ops, #endif +#ifdef HAVE_TCP_ENCAP_RCV + &ovs_stt_vport_ops, +#endif }; static const struct vport_ops **vport_ops_list; diff --git a/datapath/vport.h b/datapath/vport.h index 2aafde0..3994eb1 100644 --- a/datapath/vport.h +++ b/datapath/vport.h @@ -33,6 +33,7 @@ struct vport_parms; struct vport_net { struct capwap_net capwap; + struct capwap_net stt; }; /* The following definitions are for users of the vport subsytem: */ @@ -257,5 +258,6 @@ extern const struct vport_ops ovs_internal_vport_ops; extern const struct vport_ops ovs_patch_vport_ops; extern const struct vport_ops ovs_gre_vport_ops; extern const struct vport_ops ovs_capwap_vport_ops; +extern const struct vport_ops ovs_stt_vport_ops; #endif /* vport.h */ diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h index 0578b5f..47f6dca 100644 --- a/include/linux/openvswitch.h +++ b/include/linux/openvswitch.h @@ -185,6 +185,7 @@ enum ovs_vport_type { OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */ OVS_VPORT_TYPE_GRE, /* GRE tunnel */ OVS_VPORT_TYPE_CAPWAP, /* CAPWAP tunnel */ + OVS_VPORT_TYPE_STT, /* STT tunnel */ __OVS_VPORT_TYPE_MAX }; diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c index 7bd50a4..346878b 100644 --- a/lib/netdev-vport.c +++ b/lib/netdev-vport.c @@ -165,6 +165,9 @@ netdev_vport_get_netdev_type(const struct dpif_linux_vport *vport) case OVS_VPORT_TYPE_CAPWAP: return "capwap"; + case OVS_VPORT_TYPE_STT: + return "stt"; + case __OVS_VPORT_TYPE_MAX: break; } @@ -965,7 +968,11 @@ netdev_vport_register(void) { OVS_VPORT_TYPE_PATCH, { "patch", VPORT_FUNCTIONS(NULL) }, - parse_patch_config, unparse_patch_config } + parse_patch_config, unparse_patch_config }, + + { OVS_VPORT_TYPE_STT, + { "stt", VPORT_FUNCTIONS(netdev_vport_get_drv_info) }, + parse_tunnel_config, unparse_tunnel_config } }; int i; diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index f3ea338..d8c860e 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -1177,6 +1177,16 @@ A pair of virtual devices that act as a patch cable. +
stt
+
+ An Ethernet tunnel over STT (IETF draft-davie-stt-01). UDP + ports 58882 is used as the destination port and ports from the + ephemeral range, which may be via proc using + sys/net/ipv4/ip_local_port_range, are used as the source ports. + STT currently requires modifications to the Linux kernel and is + not supported by any released kernel version. +
+
null
An ignored interface.