diff mbox

[RFC,v4] Add TCP encap_rcv hook (repost)

Message ID 20120423083007.GB22556@verge.net.au
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Simon Horman April 23, 2012, 8:30 a.m. UTC
On Mon, Apr 23, 2012 at 03:36:58AM -0400, David Miller wrote:
> From: Simon Horman <horms@verge.net.au>
> Date: Mon, 23 Apr 2012 14:14:02 +0900
> 
> > On Sun, Apr 22, 2012 at 11:54:42AM -0400, Jamal Hadi Salim wrote:
> >> On Sun, 2012-04-22 at 08:22 -0700, Stephen Hemminger wrote:
> >> 
> >> > STT isn't really doing TCP, it just lying and pretending to be
> >> > TCP to allow TSO to work! There is no packet ordering, sequence
> >> > numbers or any real transport layer. 
> > 
> > Yes, that is my understanding. Originally I envisaged that an STT
> > implementation would rely more heavily on the TCP stack. However, as
> > STT doesn't rely on any of the features of TCP other than its header
> > this was not the case and (almost) bypassing the TCP stack seems
> > to be sufficient.
> > 
> > I believe the motivation for reusing TCP is, as Stephen suggests,
> > to allow some hardware acceleration to occur.
> 
> Yes, this is what the IETF draft states.
> 
> But I wonder about your encap_rcv hook placement, nevermind
> that your posted patch won't compile since tcp_sock lacks
> an encap_tcv member and your patch didn't add one. :-)

I'm pretty sure the patch I posted added encap_rcv to tcp_sock.
Am I missing the point?

> You'll need to somehow create either a fully established or a
> listening socket for that hook to work.
> 
> You'd need to perform a full handshake to get a socket into
> established state, and it seems STT doesn't do a TCP handshake.
> 
> That leaves you with the listening socket option, and in that case I
> want to know how you're going to send packets out of this STT tunnel?

Currently I am setting up a listening socket. The Open vSwtich tunneling
code transmits skbs and using either dev_queue_xmit() or ip_local_out().
I'm not sure that I have exercised the ip_local_out() case yet.

But perhaps that doesn't answer your question?

> In order to get the advertised benefits of this STT thing, you'll need
> to go through the whole TCP data packet sending engine, in order to
> get all the TSO/GSO stuff initialized properly on the SKB so the NIC
> will do it's thing.
> 
> But you can't send data out of an un-established TCP socket.
> 
> At the very least, we'll need to see the rest of your full
> implementation before we can say whether this encap_rcv hook is the
> right way to do things.

Sure, I'm happy to provide my implementation, though it is still WIP.
The most recent patch is below.

I should point out that the actual transmission of packets occurs outside
of that patch in existing Open vSwtich code. I am unsure of the best
way to make that available to you.

It is the ovs_tnl_send() function in datapath/tunnel.c
which is available in the openvswitch git repository.

git://openvswitch.org/openvswitch

For reference I have included the file in this email after the STT patch.


---- begin stt patch ----
tunnelling: stt: Prototype Implementation

This is a not yet well exercised implementation of STT intended for review,
I am sure there are numerous areas that need improvement.

In particular:
- The transmit path's generation of partial checksums needs to be tested
- The VLAN stripping code needs to be excercised
- The code needs to be exercised in the presence of HW checksumming
- In general, the code has been exercised by running Open vSwtich in
  KVM guests on the same host. Testing between physucal hosts is needed.

This implementation is based on the CAPWAP implementation and in particular
includes defragmentation code almost identical to CAPWAP. It seems to me
that while fragmentation can be handled by GSO/TSO, defragmentation code is
needed in STT in the case where LRO/GRO doesn't reassemble an entire STT
frame for some reason.

If the defragmentation code, which is of non-trivial length, remains more
or less in its present state then there is some scope for consolidation
with CAPWAP. Other code that may possibly be consolidated with CAPWAP has
been marked accordingly.

This code depends on a encap_rcv hook being added to the Linux Kernel's TCP
stack. A patch to add such a hook will be posted separately. Ultimately
this change or some alternative will need to be applied to the mainline
Linux kernel's TCP stack if STT is to be widely deployed. Motivating this
change to the TCP stack is part of the purpose of this prototype STT
implementation.

The configuration of STT is analogous to that of other tunneling
protocols such as GRE which are supported by Open vSwtich.

e.g.

ovs-vsctl add bridge project0 ports @newport \
        -- --id=@newport create port name=stt0 interfaces=[@newinterface] \
        -- --id=@newinterface create interface name=stt0 type=stt options="remote_ip=10.0.99.192,key=64"

Signed-off-by: Simon Horman <horms@verge.net.au>

---

v3
* Correct stripping of vlan tag on transmit
* Correct setting of vlan TCI on recieve
  - Use __vlan_hwaccel_put_tag instead of vlan_put_tag
* Use encap_rcv_enable() to enable receiving packets from the TCP stack
  - This is an update for the new implementation of the TCP stack
    patch that adds encap_rcv
* call pskb_may_pull() for STT_FRAME_HLEN + ETH_HLEN bytes in
  process_stt_proto() as this is required by ovs_flow_extract()
* Include "stt: " in pr_fmt
* Make use of pr_* instead of printk
* Rate limit all packet-generated pr_* messages
* STT flags are 8bits wide so don't define them using __cpu_to_be16()
* Only include l4_offset if
  1. get_ip_summed(skb) is OVS_CSUM_PARTIAL
  2. skb->csum_start is non-zero
  3. it is between 0 and 255
  - Warn if the first two conditions are met but not the third one.
* Only set STT_FLAG_CHECKSUM_VERIFIED if
  get_ip_summed(skb) is * OVS_CSUM_UNNECESSARY
* Print a debug message if get_ip_summed(skb) is OVS_CSUM_UNNECESSARY,
  this case is yet to be exercised
* In the rx path, adjust skb->csum_start to take into account pulling
  STT_FRAME_HLEN if get_ip_summed(skb) is OVS_CSUM_PARTIAL
* Warn if skb->dev is NULL on defragmentation and stop processing the skb.
  - This fixes a crash bug
  - But how can this occur?

v2

* Transmit
  - Correct calculation of segment offset
  - Streamline source port calculation and setting STT_FLAG_IP_VERSION.
    This allows IPv4 and IPv6 to share more code and for overall there
    to be less code.
  - Calculate partial checksum for GSO skbs. Is this correct?
  - Only calculate full checksum for non-GSO skbs.
  - Set STT_FLAG_CHECKSUM_VERIFIED for all non-GSO skbs.
  - Remove use of l4_offset, the patch modifying the tunnelling code
    to supply this has been dropped. Instead calculate the value
    based on csum_start if it is set and the network protocol of
    the inner packet is IPv4 or IPv6

* Receive
  - Correct number of bytes pulled
    + Only the TCP header plus the STT header less the pad needs to be pulled.
  - Only access STT header after it has been pulled
  - Verify checksum on receive
  - Remove use of encap_type, it is no longer present in the proposed
    TCP stack patch
  - Use the acknowledgement (tcph->ack_seq) as the fragment id
    in defragmentation

* Transmit and Receive
  - Add stt_seg_len() helper and use it in segmentation and desegmentation
    code. This corrects several offset calculation errors.
---
 acinclude.m4                |    3 +
 datapath/Modules.mk         |    3 +-
 datapath/tunnel.h           |    1 +
 datapath/vport-stt.c        |  803 +++++++++++++++++++++++++++++++++++++++++++
 datapath/vport.c            |    3 +
 datapath/vport.h            |    2 +
 include/linux/openvswitch.h |    1 +
 lib/netdev-vport.c          |    9 +-
 vswitchd/vswitch.xml        |   10 +
 9 files changed, 833 insertions(+), 2 deletions(-)
 create mode 100644 datapath/vport-stt.c

Comments

David Miller April 23, 2012, 7:15 p.m. UTC | #1
From: Simon Horman <horms@verge.net.au>
Date: Mon, 23 Apr 2012 17:30:08 +0900

> I'm pretty sure the patch I posted added encap_rcv to tcp_sock.
> Am I missing the point?

It did, my eyes are failing me :-)

> Currently I am setting up a listening socket. The Open vSwtich tunneling
> code transmits skbs and using either dev_queue_xmit() or ip_local_out().
> I'm not sure that I have exercised the ip_local_out() case yet.

I don't see where on transmit you're going to realize the primary
stated benefit of STT, that being TSO/GSO.

You'll probably want to gather as many packets as possible into a
larger STT frame for this purpose.  And when switching between STT
tunnels, leave the packet alone since a GRO STT frame on receive will
transparently become a STT GSO frame on transmit.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger April 23, 2012, 7:19 p.m. UTC | #2
On Mon, 23 Apr 2012 15:15:33 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Simon Horman <horms@verge.net.au>
> Date: Mon, 23 Apr 2012 17:30:08 +0900
> 
> > I'm pretty sure the patch I posted added encap_rcv to tcp_sock.
> > Am I missing the point?
> 
> It did, my eyes are failing me :-)
> 
> > Currently I am setting up a listening socket. The Open vSwtich tunneling
> > code transmits skbs and using either dev_queue_xmit() or ip_local_out().
> > I'm not sure that I have exercised the ip_local_out() case yet.
> 
> I don't see where on transmit you're going to realize the primary
> stated benefit of STT, that being TSO/GSO.
> 
> You'll probably want to gather as many packets as possible into a
> larger STT frame for this purpose.  And when switching between STT
> tunnels, leave the packet alone since a GRO STT frame on receive will
> transparently become a STT GSO frame on transmit.
> 

I think the point of the TSO hack is to get around the MTU problem when tunneling.
The added header of the tunnel eats into the the possible MTU. The use of TSO
in STT is designed to deal with the fact that hardware can't do IP fragmentation
of IP (or UDP).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross April 23, 2012, 8:08 p.m. UTC | #3
On Mon, Apr 23, 2012 at 12:19 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Mon, 23 Apr 2012 15:15:33 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
>
>> From: Simon Horman <horms@verge.net.au>
>> Date: Mon, 23 Apr 2012 17:30:08 +0900
>>
>> > I'm pretty sure the patch I posted added encap_rcv to tcp_sock.
>> > Am I missing the point?
>>
>> It did, my eyes are failing me :-)
>>
>> > Currently I am setting up a listening socket. The Open vSwtich tunneling
>> > code transmits skbs and using either dev_queue_xmit() or ip_local_out().
>> > I'm not sure that I have exercised the ip_local_out() case yet.
>>
>> I don't see where on transmit you're going to realize the primary
>> stated benefit of STT, that being TSO/GSO.
>>
>> You'll probably want to gather as many packets as possible into a
>> larger STT frame for this purpose.  And when switching between STT
>> tunnels, leave the packet alone since a GRO STT frame on receive will
>> transparently become a STT GSO frame on transmit.
>>
>
> I think the point of the TSO hack is to get around the MTU problem when tunneling.
> The added header of the tunnel eats into the the possible MTU. The use of TSO
> in STT is designed to deal with the fact that hardware can't do IP fragmentation
> of IP (or UDP).

That is a beneficial side effect, although the main goal is to just to
get back all of the offloads that are lost because hardware can't see
inside of encapsulated packets, with TSO, LRO, and RSS being the main
examples.

Assuming that the TCP stack generates large TSO frames on transmit
(which could be the local stack; something sent by a VM; or packets
received, coalesced by GRO and then encapsulated by STT) then you can
just prepend the STT header (possibly slightly adjusting things like
requested MSS, number of segments, etc. slightly).  After that it's
possible to just output the resulting frame through the IP stack like
all tunnels do today.  Similarly, on the other side the NIC will be
able to perform its normal offloading operations as well.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 23, 2012, 8:13 p.m. UTC | #4
From: Jesse Gross <jesse@nicira.com>
Date: Mon, 23 Apr 2012 13:08:49 -0700

> Assuming that the TCP stack generates large TSO frames on transmit
> (which could be the local stack; something sent by a VM; or packets
> received, coalesced by GRO and then encapsulated by STT) then you can
> just prepend the STT header (possibly slightly adjusting things like
> requested MSS, number of segments, etc. slightly).  After that it's
> possible to just output the resulting frame through the IP stack like
> all tunnels do today.

Which seems to potentially suggest a stronger intergration of the STT
tunnel transmit path into our IP stack rather than the approach Simon
is taking
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross April 23, 2012, 8:53 p.m. UTC | #5
On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
> From: Jesse Gross <jesse@nicira.com>
> Date: Mon, 23 Apr 2012 13:08:49 -0700
>
>> Assuming that the TCP stack generates large TSO frames on transmit
>> (which could be the local stack; something sent by a VM; or packets
>> received, coalesced by GRO and then encapsulated by STT) then you can
>> just prepend the STT header (possibly slightly adjusting things like
>> requested MSS, number of segments, etc. slightly).  After that it's
>> possible to just output the resulting frame through the IP stack like
>> all tunnels do today.
>
> Which seems to potentially suggest a stronger intergration of the STT
> tunnel transmit path into our IP stack rather than the approach Simon
> is taking

Did you have something in mind?  Since the originating stack already
generates TSO frames today, it's just a few lines of code to adjust
for the addition of the STT header as the skb is encapsulated.
Otherwise, the transmit path is the same as something like GRE.  L2TP
follows a fairly similar path - on receive it binds to a listening UDP
socket and on transmit it prepends a header, setups up checksum
offloading, and outputs directly via ip_queue_xmit().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 23, 2012, 9:08 p.m. UTC | #6
From: Jesse Gross <jesse@nicira.com>
Date: Mon, 23 Apr 2012 13:53:42 -0700

> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>> From: Jesse Gross <jesse@nicira.com>
>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>>
>>> Assuming that the TCP stack generates large TSO frames on transmit
>>> (which could be the local stack; something sent by a VM; or packets
>>> received, coalesced by GRO and then encapsulated by STT) then you can
>>> just prepend the STT header (possibly slightly adjusting things like
>>> requested MSS, number of segments, etc. slightly).  After that it's
>>> possible to just output the resulting frame through the IP stack like
>>> all tunnels do today.
>>
>> Which seems to potentially suggest a stronger intergration of the STT
>> tunnel transmit path into our IP stack rather than the approach Simon
>> is taking
> 
> Did you have something in mind?

A normal bonafide tunnel netdevice driver like GRE instead of the
openvswitch approach Simon is using.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross April 23, 2012, 9:38 p.m. UTC | #7
On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
> From: Jesse Gross <jesse@nicira.com>
> Date: Mon, 23 Apr 2012 13:53:42 -0700
>
>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Jesse Gross <jesse@nicira.com>
>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>>>
>>>> Assuming that the TCP stack generates large TSO frames on transmit
>>>> (which could be the local stack; something sent by a VM; or packets
>>>> received, coalesced by GRO and then encapsulated by STT) then you can
>>>> just prepend the STT header (possibly slightly adjusting things like
>>>> requested MSS, number of segments, etc. slightly).  After that it's
>>>> possible to just output the resulting frame through the IP stack like
>>>> all tunnels do today.
>>>
>>> Which seems to potentially suggest a stronger intergration of the STT
>>> tunnel transmit path into our IP stack rather than the approach Simon
>>> is taking
>>
>> Did you have something in mind?
>
> A normal bonafide tunnel netdevice driver like GRE instead of the
> openvswitch approach Simon is using.

Ahh, yes, that I agree with.  Independent of this, there's work being
done to make it so that OVS can use the normal in-tree tunneling code
and not need its own.  Once that's done I expect that STT will follow
the same model.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman April 23, 2012, 10:32 p.m. UTC | #8
On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
> > From: Jesse Gross <jesse@nicira.com>
> > Date: Mon, 23 Apr 2012 13:53:42 -0700
> >
> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
> >>> From: Jesse Gross <jesse@nicira.com>
> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700
> >>>
> >>>> Assuming that the TCP stack generates large TSO frames on transmit
> >>>> (which could be the local stack; something sent by a VM; or packets
> >>>> received, coalesced by GRO and then encapsulated by STT) then you can
> >>>> just prepend the STT header (possibly slightly adjusting things like
> >>>> requested MSS, number of segments, etc. slightly).  After that it's
> >>>> possible to just output the resulting frame through the IP stack like
> >>>> all tunnels do today.
> >>>
> >>> Which seems to potentially suggest a stronger intergration of the STT
> >>> tunnel transmit path into our IP stack rather than the approach Simon
> >>> is taking
> >>
> >> Did you have something in mind?
> >
> > A normal bonafide tunnel netdevice driver like GRE instead of the
> > openvswitch approach Simon is using.
> 
> Ahh, yes, that I agree with.  Independent of this, there's work being
> done to make it so that OVS can use the normal in-tree tunneling code
> and not need its own.  Once that's done I expect that STT will follow
> the same model.

Hi Jesse,

I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
code are. I'm happy to move my efforts over to an in-tree STT implementation
but ultimately I would like to get STT running in conjunction with OVS.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross April 23, 2012, 10:59 p.m. UTC | #9
On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
>> > From: Jesse Gross <jesse@nicira.com>
>> > Date: Mon, 23 Apr 2012 13:53:42 -0700
>> >
>> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>> >>> From: Jesse Gross <jesse@nicira.com>
>> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>> >>>
>> >>>> Assuming that the TCP stack generates large TSO frames on transmit
>> >>>> (which could be the local stack; something sent by a VM; or packets
>> >>>> received, coalesced by GRO and then encapsulated by STT) then you can
>> >>>> just prepend the STT header (possibly slightly adjusting things like
>> >>>> requested MSS, number of segments, etc. slightly).  After that it's
>> >>>> possible to just output the resulting frame through the IP stack like
>> >>>> all tunnels do today.
>> >>>
>> >>> Which seems to potentially suggest a stronger intergration of the STT
>> >>> tunnel transmit path into our IP stack rather than the approach Simon
>> >>> is taking
>> >>
>> >> Did you have something in mind?
>> >
>> > A normal bonafide tunnel netdevice driver like GRE instead of the
>> > openvswitch approach Simon is using.
>>
>> Ahh, yes, that I agree with.  Independent of this, there's work being
>> done to make it so that OVS can use the normal in-tree tunneling code
>> and not need its own.  Once that's done I expect that STT will follow
>> the same model.
>
> Hi Jesse,
>
> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
> code are. I'm happy to move my efforts over to an in-tree STT implementation
> but ultimately I would like to get STT running in conjunction with OVS.

I would say that it's a firm goal but the implementation probably
still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
on this in support of adding VXLAN, which needs some additional
flexibility that this approach would also provide.  You might want to
talk to him to see if there are ways that you guys can work together
on it if you are interested.  Having better integration with upstream
tunneling is definitely a step that OVS needs to make and sooner would
be better than later.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman April 24, 2012, 2:25 a.m. UTC | #10
On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
> > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
> >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
> >> > From: Jesse Gross <jesse@nicira.com>
> >> > Date: Mon, 23 Apr 2012 13:53:42 -0700
> >> >
> >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
> >> >>> From: Jesse Gross <jesse@nicira.com>
> >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700
> >> >>>
> >> >>>> Assuming that the TCP stack generates large TSO frames on transmit
> >> >>>> (which could be the local stack; something sent by a VM; or packets
> >> >>>> received, coalesced by GRO and then encapsulated by STT) then you can
> >> >>>> just prepend the STT header (possibly slightly adjusting things like
> >> >>>> requested MSS, number of segments, etc. slightly).  After that it's
> >> >>>> possible to just output the resulting frame through the IP stack like
> >> >>>> all tunnels do today.
> >> >>>
> >> >>> Which seems to potentially suggest a stronger intergration of the STT
> >> >>> tunnel transmit path into our IP stack rather than the approach Simon
> >> >>> is taking
> >> >>
> >> >> Did you have something in mind?
> >> >
> >> > A normal bonafide tunnel netdevice driver like GRE instead of the
> >> > openvswitch approach Simon is using.
> >>
> >> Ahh, yes, that I agree with.  Independent of this, there's work being
> >> done to make it so that OVS can use the normal in-tree tunneling code
> >> and not need its own.  Once that's done I expect that STT will follow
> >> the same model.
> >
> > Hi Jesse,
> >
> > I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
> > code are. I'm happy to move my efforts over to an in-tree STT implementation
> > but ultimately I would like to get STT running in conjunction with OVS.
> 
> I would say that it's a firm goal but the implementation probably
> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
> on this in support of adding VXLAN, which needs some additional
> flexibility that this approach would also provide.  You might want to
> talk to him to see if there are ways that you guys can work together
> on it if you are interested.  Having better integration with upstream
> tunneling is definitely a step that OVS needs to make and sooner would
> be better than later.

Hi Jesse, Hi Kyle,

that sounds like an excellent plan.

Kyle, do you have any thoughts on how we might best work together on this?
Perhaps there are some patches floating around that I could take a look at?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Hemminger April 24, 2012, 4:40 a.m. UTC | #11
----- Original Message -----
> On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
> > On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au>
> > wrote:
> > > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
> > >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller
> > >> <davem@davemloft.net> wrote:
> > >> > From: Jesse Gross <jesse@nicira.com>
> > >> > Date: Mon, 23 Apr 2012 13:53:42 -0700
> > >> >
> > >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller
> > >> >> <davem@davemloft.net> wrote:
> > >> >>> From: Jesse Gross <jesse@nicira.com>
> > >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700
> > >> >>>
> > >> >>>> Assuming that the TCP stack generates large TSO frames on
> > >> >>>> transmit
> > >> >>>> (which could be the local stack; something sent by a VM; or
> > >> >>>> packets
> > >> >>>> received, coalesced by GRO and then encapsulated by STT)
> > >> >>>> then you can
> > >> >>>> just prepend the STT header (possibly slightly adjusting
> > >> >>>> things like
> > >> >>>> requested MSS, number of segments, etc. slightly).  After
> > >> >>>> that it's
> > >> >>>> possible to just output the resulting frame through the IP
> > >> >>>> stack like
> > >> >>>> all tunnels do today.
> > >> >>>
> > >> >>> Which seems to potentially suggest a stronger intergration
> > >> >>> of the STT
> > >> >>> tunnel transmit path into our IP stack rather than the
> > >> >>> approach Simon
> > >> >>> is taking
> > >> >>
> > >> >> Did you have something in mind?
> > >> >
> > >> > A normal bonafide tunnel netdevice driver like GRE instead of
> > >> > the
> > >> > openvswitch approach Simon is using.
> > >>
> > >> Ahh, yes, that I agree with.  Independent of this, there's work
> > >> being
> > >> done to make it so that OVS can use the normal in-tree tunneling
> > >> code
> > >> and not need its own.  Once that's done I expect that STT will
> > >> follow
> > >> the same model.
> > >
> > > Hi Jesse,
> > >
> > > I am wondering how firm the plans to on allowing OVS to use
> > > in-tree tunnel
> > > code are. I'm happy to move my efforts over to an in-tree STT
> > > implementation
> > > but ultimately I would like to get STT running in conjunction
> > > with OVS.
> > 
> > I would say that it's a firm goal but the implementation probably
> > still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to
> > work
> > on this in support of adding VXLAN, which needs some additional
> > flexibility that this approach would also provide.  You might want
> > to
> > talk to him to see if there are ways that you guys can work
> > together
> > on it if you are interested.  Having better integration with
> > upstream
> > tunneling is definitely a step that OVS needs to make and sooner
> > would
> > be better than later.
> 
> Hi Jesse, Hi Kyle,
> 
> that sounds like an excellent plan.
> 
> Kyle, do you have any thoughts on how we might best work together on
> this?
> Perhaps there are some patches floating around that I could take a
> look at?

ChrisW had a start to VxVlan tunnel (non OVS), and I promised to work on finishing
it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman April 24, 2012, 5:42 a.m. UTC | #12
On Mon, Apr 23, 2012 at 09:40:57PM -0700, Stephen Hemminger wrote:
> 
> 
> ----- Original Message -----
> > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
> > > On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au>
> > > wrote:
> > > > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
> > > >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller
> > > >> <davem@davemloft.net> wrote:
> > > >> > From: Jesse Gross <jesse@nicira.com>
> > > >> > Date: Mon, 23 Apr 2012 13:53:42 -0700
> > > >> >
> > > >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller
> > > >> >> <davem@davemloft.net> wrote:
> > > >> >>> From: Jesse Gross <jesse@nicira.com>
> > > >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700
> > > >> >>>
> > > >> >>>> Assuming that the TCP stack generates large TSO frames on
> > > >> >>>> transmit
> > > >> >>>> (which could be the local stack; something sent by a VM; or
> > > >> >>>> packets
> > > >> >>>> received, coalesced by GRO and then encapsulated by STT)
> > > >> >>>> then you can
> > > >> >>>> just prepend the STT header (possibly slightly adjusting
> > > >> >>>> things like
> > > >> >>>> requested MSS, number of segments, etc. slightly).  After
> > > >> >>>> that it's
> > > >> >>>> possible to just output the resulting frame through the IP
> > > >> >>>> stack like
> > > >> >>>> all tunnels do today.
> > > >> >>>
> > > >> >>> Which seems to potentially suggest a stronger intergration
> > > >> >>> of the STT
> > > >> >>> tunnel transmit path into our IP stack rather than the
> > > >> >>> approach Simon
> > > >> >>> is taking
> > > >> >>
> > > >> >> Did you have something in mind?
> > > >> >
> > > >> > A normal bonafide tunnel netdevice driver like GRE instead of
> > > >> > the
> > > >> > openvswitch approach Simon is using.
> > > >>
> > > >> Ahh, yes, that I agree with.  Independent of this, there's work
> > > >> being
> > > >> done to make it so that OVS can use the normal in-tree tunneling
> > > >> code
> > > >> and not need its own.  Once that's done I expect that STT will
> > > >> follow
> > > >> the same model.
> > > >
> > > > Hi Jesse,
> > > >
> > > > I am wondering how firm the plans to on allowing OVS to use
> > > > in-tree tunnel
> > > > code are. I'm happy to move my efforts over to an in-tree STT
> > > > implementation
> > > > but ultimately I would like to get STT running in conjunction
> > > > with OVS.
> > > 
> > > I would say that it's a firm goal but the implementation probably
> > > still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to
> > > work
> > > on this in support of adding VXLAN, which needs some additional
> > > flexibility that this approach would also provide.  You might want
> > > to
> > > talk to him to see if there are ways that you guys can work
> > > together
> > > on it if you are interested.  Having better integration with
> > > upstream
> > > tunneling is definitely a step that OVS needs to make and sooner
> > > would
> > > be better than later.
> > 
> > Hi Jesse, Hi Kyle,
> > 
> > that sounds like an excellent plan.
> > 
> > Kyle, do you have any thoughts on how we might best work together on
> > this?
> > Perhaps there are some patches floating around that I could take a
> > look at?
> 
> ChrisW had a start to VxVlan tunnel (non OVS), and I promised to work on
> finishing

Thanks. I guess that I might be able to base parts of an STT implementation
on that work.

I'd like to use an STT implementation with OVS, so in-tree tunnel support
for OVS is also important to me.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kyle Mestery (kmestery) April 24, 2012, 4:02 p.m. UTC | #13
On Apr 23, 2012, at 9:25 PM, Simon Horman wrote:
> On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
>> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
>>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
>>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
>>>>> From: Jesse Gross <jesse@nicira.com>
>>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700
>>>>> 
>>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>> From: Jesse Gross <jesse@nicira.com>
>>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>>>>>>> 
>>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit
>>>>>>>> (which could be the local stack; something sent by a VM; or packets
>>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can
>>>>>>>> just prepend the STT header (possibly slightly adjusting things like
>>>>>>>> requested MSS, number of segments, etc. slightly).  After that it's
>>>>>>>> possible to just output the resulting frame through the IP stack like
>>>>>>>> all tunnels do today.
>>>>>>> 
>>>>>>> Which seems to potentially suggest a stronger intergration of the STT
>>>>>>> tunnel transmit path into our IP stack rather than the approach Simon
>>>>>>> is taking
>>>>>> 
>>>>>> Did you have something in mind?
>>>>> 
>>>>> A normal bonafide tunnel netdevice driver like GRE instead of the
>>>>> openvswitch approach Simon is using.
>>>> 
>>>> Ahh, yes, that I agree with.  Independent of this, there's work being
>>>> done to make it so that OVS can use the normal in-tree tunneling code
>>>> and not need its own.  Once that's done I expect that STT will follow
>>>> the same model.
>>> 
>>> Hi Jesse,
>>> 
>>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
>>> code are. I'm happy to move my efforts over to an in-tree STT implementation
>>> but ultimately I would like to get STT running in conjunction with OVS.
>> 
>> I would say that it's a firm goal but the implementation probably
>> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
>> on this in support of adding VXLAN, which needs some additional
>> flexibility that this approach would also provide.  You might want to
>> talk to him to see if there are ways that you guys can work together
>> on it if you are interested.  Having better integration with upstream
>> tunneling is definitely a step that OVS needs to make and sooner would
>> be better than later.
> 
> Hi Jesse, Hi Kyle,
> 
> that sounds like an excellent plan.
> 
> Kyle, do you have any thoughts on how we might best work together on this?
> Perhaps there are some patches floating around that I could take a look at?
> 

Hi Simon:

The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here:

https://github.com/mestery/ovs-vxlan/tree/vxlan

This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can be passed to user space, so things like multicast learning for VXLAN become possible.

With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this!

Thanks!
Kyle--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger April 24, 2012, 4:13 p.m. UTC | #14
On Tue, 24 Apr 2012 16:02:41 +0000
"Kyle Mestery (kmestery)" <kmestery@cisco.com> wrote:

> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote:
> > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
> >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
> >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
> >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
> >>>>> From: Jesse Gross <jesse@nicira.com>
> >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700
> >>>>> 
> >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
> >>>>>>> From: Jesse Gross <jesse@nicira.com>
> >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
> >>>>>>> 
> >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit
> >>>>>>>> (which could be the local stack; something sent by a VM; or packets
> >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can
> >>>>>>>> just prepend the STT header (possibly slightly adjusting things like
> >>>>>>>> requested MSS, number of segments, etc. slightly).  After that it's
> >>>>>>>> possible to just output the resulting frame through the IP stack like
> >>>>>>>> all tunnels do today.
> >>>>>>> 
> >>>>>>> Which seems to potentially suggest a stronger intergration of the STT
> >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon
> >>>>>>> is taking
> >>>>>> 
> >>>>>> Did you have something in mind?
> >>>>> 
> >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the
> >>>>> openvswitch approach Simon is using.
> >>>> 
> >>>> Ahh, yes, that I agree with.  Independent of this, there's work being
> >>>> done to make it so that OVS can use the normal in-tree tunneling code
> >>>> and not need its own.  Once that's done I expect that STT will follow
> >>>> the same model.
> >>> 
> >>> Hi Jesse,
> >>> 
> >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
> >>> code are. I'm happy to move my efforts over to an in-tree STT implementation
> >>> but ultimately I would like to get STT running in conjunction with OVS.
> >> 
> >> I would say that it's a firm goal but the implementation probably
> >> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
> >> on this in support of adding VXLAN, which needs some additional
> >> flexibility that this approach would also provide.  You might want to
> >> talk to him to see if there are ways that you guys can work together
> >> on it if you are interested.  Having better integration with upstream
> >> tunneling is definitely a step that OVS needs to make and sooner would
> >> be better than later.
> > 
> > Hi Jesse, Hi Kyle,
> > 
> > that sounds like an excellent plan.
> > 
> > Kyle, do you have any thoughts on how we might best work together on this?
> > Perhaps there are some patches floating around that I could take a look at?
> > 
> 
> Hi Simon:
> 
> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here:
> 
> https://github.com/mestery/ovs-vxlan/tree/vxlan
> 
> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can be passed to user space, so things like multicast learning for VXLAN become possible.
> 
> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this!
> 

My use of VXVLAN was to be key based (like existing GRE), not flow based.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kyle Mestery (kmestery) April 24, 2012, 4:16 p.m. UTC | #15
On Apr 24, 2012, at 11:13 AM, Stephen Hemminger wrote:
> On Tue, 24 Apr 2012 16:02:41 +0000
> "Kyle Mestery (kmestery)" <kmestery@cisco.com> wrote:
> 
>> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote:
>>> On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
>>>> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
>>>>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
>>>>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>> From: Jesse Gross <jesse@nicira.com>
>>>>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700
>>>>>>> 
>>>>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>>>> From: Jesse Gross <jesse@nicira.com>
>>>>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>>>>>>>>> 
>>>>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit
>>>>>>>>>> (which could be the local stack; something sent by a VM; or packets
>>>>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can
>>>>>>>>>> just prepend the STT header (possibly slightly adjusting things like
>>>>>>>>>> requested MSS, number of segments, etc. slightly).  After that it's
>>>>>>>>>> possible to just output the resulting frame through the IP stack like
>>>>>>>>>> all tunnels do today.
>>>>>>>>> 
>>>>>>>>> Which seems to potentially suggest a stronger intergration of the STT
>>>>>>>>> tunnel transmit path into our IP stack rather than the approach Simon
>>>>>>>>> is taking
>>>>>>>> 
>>>>>>>> Did you have something in mind?
>>>>>>> 
>>>>>>> A normal bonafide tunnel netdevice driver like GRE instead of the
>>>>>>> openvswitch approach Simon is using.
>>>>>> 
>>>>>> Ahh, yes, that I agree with.  Independent of this, there's work being
>>>>>> done to make it so that OVS can use the normal in-tree tunneling code
>>>>>> and not need its own.  Once that's done I expect that STT will follow
>>>>>> the same model.
>>>>> 
>>>>> Hi Jesse,
>>>>> 
>>>>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
>>>>> code are. I'm happy to move my efforts over to an in-tree STT implementation
>>>>> but ultimately I would like to get STT running in conjunction with OVS.
>>>> 
>>>> I would say that it's a firm goal but the implementation probably
>>>> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
>>>> on this in support of adding VXLAN, which needs some additional
>>>> flexibility that this approach would also provide.  You might want to
>>>> talk to him to see if there are ways that you guys can work together
>>>> on it if you are interested.  Having better integration with upstream
>>>> tunneling is definitely a step that OVS needs to make and sooner would
>>>> be better than later.
>>> 
>>> Hi Jesse, Hi Kyle,
>>> 
>>> that sounds like an excellent plan.
>>> 
>>> Kyle, do you have any thoughts on how we might best work together on this?
>>> Perhaps there are some patches floating around that I could take a look at?
>>> 
>> 
>> Hi Simon:
>> 
>> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here:
>> 
>> https://github.com/mestery/ovs-vxlan/tree/vxlan
>> 
>> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can be passed to user space, so things like multicast learning for VXLAN become possible.
>> 
>> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this!
>> 
> 
> My use of VXVLAN was to be key based (like existing GRE), not flow based.
> 

Yes, for OVS the idea is to add the tunnel key values to the flow-key in the OVS kernel module.--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman April 25, 2012, 8:39 a.m. UTC | #16
On Tue, Apr 24, 2012 at 04:02:41PM +0000, Kyle Mestery (kmestery) wrote:
> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote:
> > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
> >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
> >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
> >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
> >>>>> From: Jesse Gross <jesse@nicira.com>
> >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700
> >>>>> 
> >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
> >>>>>>> From: Jesse Gross <jesse@nicira.com>
> >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
> >>>>>>> 
> >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit
> >>>>>>>> (which could be the local stack; something sent by a VM; or packets
> >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can
> >>>>>>>> just prepend the STT header (possibly slightly adjusting things like
> >>>>>>>> requested MSS, number of segments, etc. slightly).  After that it's
> >>>>>>>> possible to just output the resulting frame through the IP stack like
> >>>>>>>> all tunnels do today.
> >>>>>>> 
> >>>>>>> Which seems to potentially suggest a stronger intergration of the STT
> >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon
> >>>>>>> is taking
> >>>>>> 
> >>>>>> Did you have something in mind?
> >>>>> 
> >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the
> >>>>> openvswitch approach Simon is using.
> >>>> 
> >>>> Ahh, yes, that I agree with.  Independent of this, there's work being
> >>>> done to make it so that OVS can use the normal in-tree tunneling code
> >>>> and not need its own.  Once that's done I expect that STT will follow
> >>>> the same model.
> >>> 
> >>> Hi Jesse,
> >>> 
> >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
> >>> code are. I'm happy to move my efforts over to an in-tree STT implementation
> >>> but ultimately I would like to get STT running in conjunction with OVS.
> >> 
> >> I would say that it's a firm goal but the implementation probably
> >> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
> >> on this in support of adding VXLAN, which needs some additional
> >> flexibility that this approach would also provide.  You might want to
> >> talk to him to see if there are ways that you guys can work together
> >> on it if you are interested.  Having better integration with upstream
> >> tunneling is definitely a step that OVS needs to make and sooner would
> >> be better than later.
> > 
> > Hi Jesse, Hi Kyle,
> > 
> > that sounds like an excellent plan.
> > 
> > Kyle, do you have any thoughts on how we might best work together on this?
> > Perhaps there are some patches floating around that I could take a look at?
> > 
> 
> Hi Simon:
> 
> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here:
> 
> https://github.com/mestery/ovs-vxlan/tree/vxlan
> 
> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist.
> As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can
> be passed to user space, so things like multicast learning for VXLAN become possible.
> 
> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this!

Hi Kyle,

the component that is of most interest to me is enabling OVS to use in-tree
tunnelling code - as it seems that makes most sense for an implementation
of STT. I have taken a brief look over your vxlan work and it isn't clear
to me if it is moving towards being an in-tree implementation.  Moreover,
I'm a rather unclear on what changes need to be made to OVS in order for
in-tree tunneling to be used.

My recollection is that OVS did make use of in-tree tunnelling code
but this was removed in favour of the current implementation for various
reasons (performance being one IIRC). I gather that revisiting in-tree
tunnelling won't revisit the previous set of problems. But I'm unclear how.

Jesse, is it possible for you to describe that in a little detail
or point me to some information?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kyle Mestery (kmestery) April 25, 2012, 1:36 p.m. UTC | #17
On Apr 25, 2012, at 3:39 AM, Simon Horman wrote:
> On Tue, Apr 24, 2012 at 04:02:41PM +0000, Kyle Mestery (kmestery) wrote:
>> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote:
>>> On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
>>>> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
>>>>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
>>>>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>> From: Jesse Gross <jesse@nicira.com>
>>>>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700
>>>>>>> 
>>>>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>>>>>>>>> From: Jesse Gross <jesse@nicira.com>
>>>>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>>>>>>>>> 
>>>>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit
>>>>>>>>>> (which could be the local stack; something sent by a VM; or packets
>>>>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can
>>>>>>>>>> just prepend the STT header (possibly slightly adjusting things like
>>>>>>>>>> requested MSS, number of segments, etc. slightly).  After that it's
>>>>>>>>>> possible to just output the resulting frame through the IP stack like
>>>>>>>>>> all tunnels do today.
>>>>>>>>> 
>>>>>>>>> Which seems to potentially suggest a stronger intergration of the STT
>>>>>>>>> tunnel transmit path into our IP stack rather than the approach Simon
>>>>>>>>> is taking
>>>>>>>> 
>>>>>>>> Did you have something in mind?
>>>>>>> 
>>>>>>> A normal bonafide tunnel netdevice driver like GRE instead of the
>>>>>>> openvswitch approach Simon is using.
>>>>>> 
>>>>>> Ahh, yes, that I agree with.  Independent of this, there's work being
>>>>>> done to make it so that OVS can use the normal in-tree tunneling code
>>>>>> and not need its own.  Once that's done I expect that STT will follow
>>>>>> the same model.
>>>>> 
>>>>> Hi Jesse,
>>>>> 
>>>>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
>>>>> code are. I'm happy to move my efforts over to an in-tree STT implementation
>>>>> but ultimately I would like to get STT running in conjunction with OVS.
>>>> 
>>>> I would say that it's a firm goal but the implementation probably
>>>> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
>>>> on this in support of adding VXLAN, which needs some additional
>>>> flexibility that this approach would also provide.  You might want to
>>>> talk to him to see if there are ways that you guys can work together
>>>> on it if you are interested.  Having better integration with upstream
>>>> tunneling is definitely a step that OVS needs to make and sooner would
>>>> be better than later.
>>> 
>>> Hi Jesse, Hi Kyle,
>>> 
>>> that sounds like an excellent plan.
>>> 
>>> Kyle, do you have any thoughts on how we might best work together on this?
>>> Perhaps there are some patches floating around that I could take a look at?
>>> 
>> 
>> Hi Simon:
>> 
>> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here:
>> 
>> https://github.com/mestery/ovs-vxlan/tree/vxlan
>> 
>> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist.
>> As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can
>> be passed to user space, so things like multicast learning for VXLAN become possible.
>> 
>> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this!
> 
> Hi Kyle,
> 
> the component that is of most interest to me is enabling OVS to use in-tree
> tunnelling code - as it seems that makes most sense for an implementation
> of STT. I have taken a brief look over your vxlan work and it isn't clear
> to me if it is moving towards being an in-tree implementation.  Moreover,
> I'm a rather unclear on what changes need to be made to OVS in order for
> in-tree tunneling to be used.
> 
> My recollection is that OVS did make use of in-tree tunnelling code
> but this was removed in favour of the current implementation for various
> reasons (performance being one IIRC). I gather that revisiting in-tree
> tunnelling won't revisit the previous set of problems. But I'm unclear how.
> 
> Jesse, is it possible for you to describe that in a little detail
> or point me to some information?

Simon:

The changes I have in there now are taking the first step of trying to add support for flow-based tunneling, in the case of VXLAN. Once we do that, we can remove (if we want) the existing port-based tunneling code. I was planning this as a first step. I would also to understand from Jesse better the direction with regards to moving to in-tree tunneling. I assume the changes Jesse and I had talked about a few months back around flow-based tunneling will still be compatible with the in-tree tunneling as well.

Thanks,
Kyle--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross April 25, 2012, 5:17 p.m. UTC | #18
On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote:
> On Tue, Apr 24, 2012 at 04:02:41PM +0000, Kyle Mestery (kmestery) wrote:
>> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote:
>> > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote:
>> >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote:
>> >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote:
>> >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote:
>> >>>>> From: Jesse Gross <jesse@nicira.com>
>> >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700
>> >>>>>
>> >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote:
>> >>>>>>> From: Jesse Gross <jesse@nicira.com>
>> >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700
>> >>>>>>>
>> >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit
>> >>>>>>>> (which could be the local stack; something sent by a VM; or packets
>> >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can
>> >>>>>>>> just prepend the STT header (possibly slightly adjusting things like
>> >>>>>>>> requested MSS, number of segments, etc. slightly).  After that it's
>> >>>>>>>> possible to just output the resulting frame through the IP stack like
>> >>>>>>>> all tunnels do today.
>> >>>>>>>
>> >>>>>>> Which seems to potentially suggest a stronger intergration of the STT
>> >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon
>> >>>>>>> is taking
>> >>>>>>
>> >>>>>> Did you have something in mind?
>> >>>>>
>> >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the
>> >>>>> openvswitch approach Simon is using.
>> >>>>
>> >>>> Ahh, yes, that I agree with.  Independent of this, there's work being
>> >>>> done to make it so that OVS can use the normal in-tree tunneling code
>> >>>> and not need its own.  Once that's done I expect that STT will follow
>> >>>> the same model.
>> >>>
>> >>> Hi Jesse,
>> >>>
>> >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel
>> >>> code are. I'm happy to move my efforts over to an in-tree STT implementation
>> >>> but ultimately I would like to get STT running in conjunction with OVS.
>> >>
>> >> I would say that it's a firm goal but the implementation probably
>> >> still has a ways to go.  Kyle Mestery (CC'ed) has volunteered to work
>> >> on this in support of adding VXLAN, which needs some additional
>> >> flexibility that this approach would also provide.  You might want to
>> >> talk to him to see if there are ways that you guys can work together
>> >> on it if you are interested.  Having better integration with upstream
>> >> tunneling is definitely a step that OVS needs to make and sooner would
>> >> be better than later.
>> >
>> > Hi Jesse, Hi Kyle,
>> >
>> > that sounds like an excellent plan.
>> >
>> > Kyle, do you have any thoughts on how we might best work together on this?
>> > Perhaps there are some patches floating around that I could take a look at?
>> >
>>
>> Hi Simon:
>>
>> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here:
>>
>> https://github.com/mestery/ovs-vxlan/tree/vxlan
>>
>> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist.
>> As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can
>> be passed to user space, so things like multicast learning for VXLAN become possible.
>>
>> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this!
>
> Hi Kyle,
>
> the component that is of most interest to me is enabling OVS to use in-tree
> tunnelling code - as it seems that makes most sense for an implementation
> of STT. I have taken a brief look over your vxlan work and it isn't clear
> to me if it is moving towards being an in-tree implementation.  Moreover,
> I'm a rather unclear on what changes need to be made to OVS in order for
> in-tree tunneling to be used.
>
> My recollection is that OVS did make use of in-tree tunnelling code
> but this was removed in favour of the current implementation for various
> reasons (performance being one IIRC). I gather that revisiting in-tree
> tunnelling won't revisit the previous set of problems. But I'm unclear how.
>
> Jesse, is it possible for you to describe that in a little detail
> or point me to some information?

This was what I had originally written a while back, although it's
more about OVS internally and less about how to connect to the in-tree
code:
http://openvswitch.org/pipermail/dev/2012-February/014779.html

In order to flexibly implement support for current and future tunnel
protocols OVS needs to be able to get/set information about the outer
tunnel header when processing the inner packet.  At the very least
this is src/dst IP addresses and the key/ID/VNI/etc.  In the upstream
tunnel implementations those are implicitly encoded in the device that
sends or receives the packet.  However, this has a two problems:
number of devices and ability to handle unknown values.  We addressed
part of this problem by allowing the tunnel ID to be set and matched
through the OVS flow table and an action.  In order to do this with
the in-tree tunneling code, we obviously need a way of passing this
information around since it would currently get lost as we pass
through the Linux device layer.

The plan to deal with that is to add a function to the in-tree
tunneling code that allows a skb to be encapsulated with specific
parameters and conversely a hook to receive decapsulated packets along
with header info.  This would make all of the kernel tunneling code
common, while still giving OVS userspace the ability to implement
essentially any type of tunneling policy.  In many ways, this is very
similar to how vlans look in OVS today.

While it would be possible to implement the hook to use the in-tree
tunnel code today without a lot of changes, we already know that we
want to move away from port-based model in the OVS kernel module
towards the flow model.  As we push this upstream the userspace/kernel
API should be the correct one, so that's why these two things are tied
together.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Simon Horman April 26, 2012, 7:13 a.m. UTC | #19
On Wed, Apr 25, 2012 at 10:17:25AM -0700, Jesse Gross wrote:
> On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote:
> >
> > Hi Kyle,
> >
> > the component that is of most interest to me is enabling OVS to use in-tree
> > tunnelling code - as it seems that makes most sense for an implementation
> > of STT. I have taken a brief look over your vxlan work and it isn't clear
> > to me if it is moving towards being an in-tree implementation.  Moreover,
> > I'm a rather unclear on what changes need to be made to OVS in order for
> > in-tree tunneling to be used.
> >
> > My recollection is that OVS did make use of in-tree tunnelling code
> > but this was removed in favour of the current implementation for various
> > reasons (performance being one IIRC). I gather that revisiting in-tree
> > tunnelling won't revisit the previous set of problems. But I'm unclear how.
> >
> > Jesse, is it possible for you to describe that in a little detail
> > or point me to some information?
> 
> This was what I had originally written a while back, although it's
> more about OVS internally and less about how to connect to the in-tree
> code:
> http://openvswitch.org/pipermail/dev/2012-February/014779.html
> 
> In order to flexibly implement support for current and future tunnel
> protocols OVS needs to be able to get/set information about the outer
> tunnel header when processing the inner packet.  At the very least
> this is src/dst IP addresses and the key/ID/VNI/etc.  In the upstream
> tunnel implementations those are implicitly encoded in the device that
> sends or receives the packet.  However, this has a two problems:
> number of devices and ability to handle unknown values.  We addressed
> part of this problem by allowing the tunnel ID to be set and matched
> through the OVS flow table and an action.  In order to do this with
> the in-tree tunneling code, we obviously need a way of passing this
> information around since it would currently get lost as we pass
> through the Linux device layer.
> 
> The plan to deal with that is to add a function to the in-tree
> tunneling code that allows a skb to be encapsulated with specific
> parameters and conversely a hook to receive decapsulated packets along
> with header info.  This would make all of the kernel tunneling code
> common, while still giving OVS userspace the ability to implement
> essentially any type of tunneling policy.  In many ways, this is very
> similar to how vlans look in OVS today.
> 
> While it would be possible to implement the hook to use the in-tree
> tunnel code today without a lot of changes, we already know that we
> want to move away from port-based model in the OVS kernel module
> towards the flow model.  As we push this upstream the userspace/kernel
> API should be the correct one, so that's why these two things are tied
> together.


Thanks, that explanation along with Kyle's response helps a lot.

It seems to me that something I could help out with is the implementation
of the set_tunnel action which extents and replaces the tun_id action.
It seems that is a requirement for the scheme you describe above.

http://openvswitch.org/pipermail/dev/2012-April/016239.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross April 26, 2012, 4:13 p.m. UTC | #20
On Thu, Apr 26, 2012 at 12:13 AM, Simon Horman <horms@verge.net.au> wrote:
> On Wed, Apr 25, 2012 at 10:17:25AM -0700, Jesse Gross wrote:
>> On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote:
>> >
>> > Hi Kyle,
>> >
>> > the component that is of most interest to me is enabling OVS to use in-tree
>> > tunnelling code - as it seems that makes most sense for an implementation
>> > of STT. I have taken a brief look over your vxlan work and it isn't clear
>> > to me if it is moving towards being an in-tree implementation.  Moreover,
>> > I'm a rather unclear on what changes need to be made to OVS in order for
>> > in-tree tunneling to be used.
>> >
>> > My recollection is that OVS did make use of in-tree tunnelling code
>> > but this was removed in favour of the current implementation for various
>> > reasons (performance being one IIRC). I gather that revisiting in-tree
>> > tunnelling won't revisit the previous set of problems. But I'm unclear how.
>> >
>> > Jesse, is it possible for you to describe that in a little detail
>> > or point me to some information?
>>
>> This was what I had originally written a while back, although it's
>> more about OVS internally and less about how to connect to the in-tree
>> code:
>> http://openvswitch.org/pipermail/dev/2012-February/014779.html
>>
>> In order to flexibly implement support for current and future tunnel
>> protocols OVS needs to be able to get/set information about the outer
>> tunnel header when processing the inner packet.  At the very least
>> this is src/dst IP addresses and the key/ID/VNI/etc.  In the upstream
>> tunnel implementations those are implicitly encoded in the device that
>> sends or receives the packet.  However, this has a two problems:
>> number of devices and ability to handle unknown values.  We addressed
>> part of this problem by allowing the tunnel ID to be set and matched
>> through the OVS flow table and an action.  In order to do this with
>> the in-tree tunneling code, we obviously need a way of passing this
>> information around since it would currently get lost as we pass
>> through the Linux device layer.
>>
>> The plan to deal with that is to add a function to the in-tree
>> tunneling code that allows a skb to be encapsulated with specific
>> parameters and conversely a hook to receive decapsulated packets along
>> with header info.  This would make all of the kernel tunneling code
>> common, while still giving OVS userspace the ability to implement
>> essentially any type of tunneling policy.  In many ways, this is very
>> similar to how vlans look in OVS today.
>>
>> While it would be possible to implement the hook to use the in-tree
>> tunnel code today without a lot of changes, we already know that we
>> want to move away from port-based model in the OVS kernel module
>> towards the flow model.  As we push this upstream the userspace/kernel
>> API should be the correct one, so that's why these two things are tied
>> together.
>
>
> Thanks, that explanation along with Kyle's response helps a lot.
>
> It seems to me that something I could help out with is the implementation
> of the set_tunnel action which extents and replaces the tun_id action.
> It seems that is a requirement for the scheme you describe above.
>
> http://openvswitch.org/pipermail/dev/2012-April/016239.html

I agree that's probably the best place to start unless Kyle has some
specific plans otherwise.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kyle Mestery (kmestery) April 26, 2012, 4:16 p.m. UTC | #21
On Apr 26, 2012, at 11:13 AM, Jesse Gross wrote:

> On Thu, Apr 26, 2012 at 12:13 AM, Simon Horman <horms@verge.net.au> wrote:
>> On Wed, Apr 25, 2012 at 10:17:25AM -0700, Jesse Gross wrote:
>>> On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote:
>>>> 
>>>> Hi Kyle,
>>>> 
>>>> the component that is of most interest to me is enabling OVS to use in-tree
>>>> tunnelling code - as it seems that makes most sense for an implementation
>>>> of STT. I have taken a brief look over your vxlan work and it isn't clear
>>>> to me if it is moving towards being an in-tree implementation.  Moreover,
>>>> I'm a rather unclear on what changes need to be made to OVS in order for
>>>> in-tree tunneling to be used.
>>>> 
>>>> My recollection is that OVS did make use of in-tree tunnelling code
>>>> but this was removed in favour of the current implementation for various
>>>> reasons (performance being one IIRC). I gather that revisiting in-tree
>>>> tunnelling won't revisit the previous set of problems. But I'm unclear how.
>>>> 
>>>> Jesse, is it possible for you to describe that in a little detail
>>>> or point me to some information?
>>> 
>>> This was what I had originally written a while back, although it's
>>> more about OVS internally and less about how to connect to the in-tree
>>> code:
>>> http://openvswitch.org/pipermail/dev/2012-February/014779.html
>>> 
>>> In order to flexibly implement support for current and future tunnel
>>> protocols OVS needs to be able to get/set information about the outer
>>> tunnel header when processing the inner packet.  At the very least
>>> this is src/dst IP addresses and the key/ID/VNI/etc.  In the upstream
>>> tunnel implementations those are implicitly encoded in the device that
>>> sends or receives the packet.  However, this has a two problems:
>>> number of devices and ability to handle unknown values.  We addressed
>>> part of this problem by allowing the tunnel ID to be set and matched
>>> through the OVS flow table and an action.  In order to do this with
>>> the in-tree tunneling code, we obviously need a way of passing this
>>> information around since it would currently get lost as we pass
>>> through the Linux device layer.
>>> 
>>> The plan to deal with that is to add a function to the in-tree
>>> tunneling code that allows a skb to be encapsulated with specific
>>> parameters and conversely a hook to receive decapsulated packets along
>>> with header info.  This would make all of the kernel tunneling code
>>> common, while still giving OVS userspace the ability to implement
>>> essentially any type of tunneling policy.  In many ways, this is very
>>> similar to how vlans look in OVS today.
>>> 
>>> While it would be possible to implement the hook to use the in-tree
>>> tunnel code today without a lot of changes, we already know that we
>>> want to move away from port-based model in the OVS kernel module
>>> towards the flow model.  As we push this upstream the userspace/kernel
>>> API should be the correct one, so that's why these two things are tied
>>> together.
>> 
>> 
>> Thanks, that explanation along with Kyle's response helps a lot.
>> 
>> It seems to me that something I could help out with is the implementation
>> of the set_tunnel action which extents and replaces the tun_id action.
>> It seems that is a requirement for the scheme you describe above.
>> 
>> http://openvswitch.org/pipermail/dev/2012-April/016239.html
> 
> I agree that's probably the best place to start unless Kyle has some
> specific plans otherwise.

Simon and I chatted off-list, and this is indeed where we plan to start.--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/acinclude.m4 b/acinclude.m4
index 69bb772..f3a52fa 100644
--- a/acinclude.m4
+++ b/acinclude.m4
@@ -266,6 +266,9 @@  AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [
   OVS_GREP_IFELSE([$KSRC/include/linux/if_vlan.h], [ADD_ALL_VLANS_CMD],
                   [OVS_DEFINE([HAVE_VLAN_BUG_WORKAROUND])])
 
+  OVS_GREP_IFELSE([$KSRC/include/linux/tcp.h], [encap_rcv],
+                  [OVS_DEFINE([HAVE_TCP_ENCAP_RCV])])
+
   OVS_CHECK_LOG2_H
 
   if cmp -s datapath/linux/kcompat.h.new \
diff --git a/datapath/Modules.mk b/datapath/Modules.mk
index 24c1075..6fbe3dd 100644
--- a/datapath/Modules.mk
+++ b/datapath/Modules.mk
@@ -26,7 +26,8 @@  openvswitch_sources = \
 	vport-gre.c \
 	vport-internal_dev.c \
 	vport-netdev.c \
-	vport-patch.c
+	vport-patch.c \
+	vport-stt.c
 
 openvswitch_headers = \
 	checksum.h \
diff --git a/datapath/tunnel.h b/datapath/tunnel.h
index 33eb63c..96f59b1 100644
--- a/datapath/tunnel.h
+++ b/datapath/tunnel.h
@@ -41,6 +41,7 @@ 
  */
 #define TNL_T_PROTO_GRE		0
 #define TNL_T_PROTO_CAPWAP	1
+#define TNL_T_PROTO_STT		2
 
 /* These flags are only needed when calling tnl_find_port(). */
 #define TNL_T_KEY_EXACT		(1 << 10)
diff --git a/datapath/vport-stt.c b/datapath/vport-stt.c
new file mode 100644
index 0000000..638998d
--- /dev/null
+++ b/datapath/vport-stt.c
@@ -0,0 +1,803 @@ 
+/*
+ * Copyright (c) 2012 Horms Solutions Ltd.
+ * Distributed under the terms of the GNU GPL version 2.
+ *
+ * Significant portions of this file may be copied from parts of the Linux
+ * kernel, by Linus Torvalds and others.
+ *
+ * Significant portions of this file may be copied from
+ * other parts of Open vSwitch, by Nicira Networks and others.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": stt: " fmt
+
+#include <linux/version.h>
+#ifdef HAVE_TCP_ENCAP_RCV
+
+#include <linux/if.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/list.h>
+#include <linux/net.h>
+#include <net/net_namespace.h>
+
+#include <net/icmp.h>
+#include <net/inet_frag.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
+#include <net/protocol.h>
+#include <net/udp.h>
+#include <net/tcp.h>
+
+#include "datapath.h"
+#include "tunnel.h"
+#include "vport.h"
+#include "vport-generic.h"
+
+#define STT_DST_PORT 58882 /* Change to actual port number once awarded by IANA */
+
+/* XXX: Possible Consolidation: The same values as capwap */
+#define STT_FRAG_TIMEOUT (30 * HZ)
+#define STT_FRAG_MAX_MEM (256 * 1024)
+#define STT_FRAG_PRUNE_MEM (192 * 1024)
+#define STT_FRAG_SECRET_INTERVAL (10 * 60 * HZ)
+
+#define STT_FLAG_CHECKSUM_VERIFIED	(1 << 0)
+#define STT_FLAG_CHECKSUM_PARTIAL	(1 << 1)
+#define STT_FLAG_IP_VERSION		(1 << 2)
+#define STT_FLAG_TCP_PAYLOAD		(1 << 3)
+
+#define FRAG_OFF_MASK	0xffffU
+#define FRAME_LEN_SHIFT	16
+
+struct stthdr {
+	uint8_t	version;
+	uint8_t flags;
+	uint8_t l4_offset;
+	uint8_t reserved;
+	__be16 mss;
+	__be16 vlan_tci;
+	__be64 context_id;
+};
+
+/*
+ * Not in stthdr to avoid that structure being padded to
+ * a 64bit boundary - 2 bytes of pad are required, not 8
+ */
+struct stthdr_pad {
+	uint8_t pad[2];
+};
+
+static struct stthdr *stt_hdr(const struct sk_buff *skb)
+{
+	return (struct stthdr *)(tcp_hdr(skb) + 1);
+}
+
+/*
+ * The minimum header length.
+ */
+#define STT_SEG_HLEN   sizeof(struct tcphdr)
+#define STT_FRAME_HLEN (STT_SEG_HLEN + sizeof(struct stthdr) + \
+			sizeof(struct stthdr_pad))
+
+static inline int stt_seg_len(struct sk_buff *skb)
+{
+	return skb->len - skb_transport_offset(skb) - STT_SEG_HLEN;
+}
+
+static inline struct ethhdr *stt_inner_eth_header(struct sk_buff *skb)
+{
+	return (struct ethhdr *)((char *)skb_transport_header(skb)
+				 + STT_FRAME_HLEN);
+}
+
+/* XXX: Possible Consolidation: Same as capwap */
+struct frag_match {
+	__be32 saddr;
+	__be32 daddr;
+	__be32 id;
+};
+
+/* XXX: Possible Consolidation: Same as capwap */
+struct frag_queue {
+	struct inet_frag_queue ifq;
+	struct frag_match match;
+};
+
+/* XXX: Possible Consolidation: Same as capwap */
+struct frag_skb_cb {
+	u16 offset;
+};
+#define FRAG_CB(skb) ((struct frag_skb_cb *)(skb)->cb)
+
+static struct sk_buff *defrag(struct sk_buff *skb, u16 frame_len);
+
+static void stt_frag_init(struct inet_frag_queue *, void *match);
+static unsigned int stt_frag_hash(struct inet_frag_queue *);
+static int stt_frag_match(struct inet_frag_queue *, void *match);
+static void stt_frag_expire(unsigned long ifq);
+
+static struct inet_frags frag_state = {
+	.constructor	= stt_frag_init,
+	.qsize		= sizeof(struct frag_queue),
+	.hashfn		= stt_frag_hash,
+	.match		= stt_frag_match,
+	.frag_expire	= stt_frag_expire,
+	.secret_interval = STT_FRAG_SECRET_INTERVAL,
+};
+
+/* random value for selecting source ports */
+static u32 stt_port_rnd __read_mostly;
+
+static int stt_hdr_len(const struct tnl_mutable_config *mutable)
+{
+	return (int)STT_FRAME_HLEN;
+}
+
+static void stt_build_header(const struct vport *vport,
+			     const struct tnl_mutable_config *mutable,
+			     void *header)
+{
+	struct tcphdr *tcph = header;
+	struct stthdr *stth = (struct stthdr *)(tcph + 1);
+	struct stthdr_pad *pad = (struct stthdr_pad *)(stth + 1);
+
+	tcph->dest = htons(STT_DST_PORT);
+	tcp_flag_word(tcph) = 0;
+	tcph->doff = sizeof(struct tcphdr) / 4;
+	tcph->ack = 1;
+	pad->pad[0] = pad->pad[1] = 0;
+}
+
+static u16 stt_src_port(u32 hash)
+{
+	int low, high;
+	inet_get_local_port_range(&low, &high);
+	return hash % (high - low) + low;
+}
+
+struct sk_buff *stt_update_header(const struct vport *vport,
+				  const struct tnl_mutable_config *mutable,
+				  struct dst_entry *dst,
+				  struct sk_buff *skb)
+{
+	struct tcphdr *tcph;
+	struct stthdr *stth;
+	struct ethhdr *inner_ethh;
+	struct tnl_vport *tnl_vport = tnl_vport_priv(vport);
+	__be32 frag_id = htonl(atomic_inc_return(&tnl_vport->frag_id));
+	__be32 vlan_tci = 0;
+	u32 hash = jhash_1word(skb->protocol, stt_port_rnd);
+	int l4_protocol = IPPROTO_MAX;
+
+	if (skb->protocol == htons(ETH_P_8021Q)) {
+		struct vlan_ethhdr *vlanh;
+
+		if (unlikely(!pskb_may_pull(skb, VLAN_ETH_HLEN)))
+			goto err;
+
+		vlanh = (struct vlan_ethhdr *)stt_inner_eth_header(skb);
+		vlan_tci = vlanh->h_vlan_TCI;
+
+		/* STT requires that the encapsulated frame be untagged
+		 * and the STT header only allows saving one VLAN TCI.
+		 * So there seems to be no way to handle the presence of
+		 * more than one vlan tag other than to drop the packet
+		 */
+		if (vlan_eth_hdr(skb)->h_vlan_encapsulated_proto ==
+		    htons(ETH_P_8021Q))
+			goto err;
+
+		memmove(skb->data + VLAN_HLEN, skb->data,
+			(size_t)((char *)vlanh - (char *)skb->data) +
+			2 * ETH_ALEN);
+		if (unlikely(!skb_pull(skb, VLAN_HLEN)))
+			goto err;
+
+		skb->protocol = vlan_eth_hdr(skb)->h_vlan_encapsulated_proto;
+		skb->mac_header += VLAN_HLEN;
+		skb->network_header += VLAN_HLEN;
+		skb->transport_header += VLAN_HLEN;
+	}
+
+	tcph = tcp_hdr(skb);
+	stth = (struct stthdr *)(tcph + 1);
+	inner_ethh = stt_inner_eth_header(skb);
+
+	stth->flags = 0;
+
+	if (skb->protocol == htons(ETH_P_IP)) {
+		struct iphdr *iph = (struct iphdr *)(inner_ethh + 1);
+		hash = jhash_2words(iph->saddr, iph->daddr, hash);
+		l4_protocol = iph->protocol;
+		stth->flags |= STT_FLAG_IP_VERSION;
+	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		struct ipv6hdr *ipv6h = (struct ipv6hdr *)(inner_ethh + 1);
+		hash = jhash(ipv6h->saddr.s6_addr,
+			     sizeof(ipv6h->saddr.s6_addr), hash);
+		hash = jhash(ipv6h->daddr.s6_addr,
+			     sizeof(ipv6h->daddr.s6_addr), hash);
+		l4_protocol = ipv6h->nexthdr;
+	}
+
+	stth->l4_offset = 0;
+	if (get_ip_summed(skb) == OVS_CSUM_PARTIAL && skb->csum_start) {
+		int off = skb->csum_start - skb_headroom(skb);
+		if (likely(off < 256 && off > 0))
+		    stth->l4_offset = off;
+		else if (net_ratelimit())
+			pr_err("%s: l4_offset is out of range %d should be "
+			       "between 0 and 255", __func__, off);
+	}
+
+	if (stth->l4_offset && (l4_protocol == IPPROTO_TCP ||
+				l4_protocol == IPPROTO_UDP ||
+				l4_protocol == IPPROTO_DCCP ||
+				l4_protocol == IPPROTO_SCTP)) {
+		/* TCP, UDP, DCCP and SCTP place the source and destination
+		 * ports in the first and second 16-bits of their header,
+		 * so grabbing the first 32-bits will give a combined value.
+		 */
+		__be32 *ports = (__be32 *)((char *)inner_ethh +
+					   stth->l4_offset);
+		hash = jhash_1word(*ports, hash);
+	}
+
+	if (l4_protocol == IPPROTO_TCP)
+		stth->flags |= STT_FLAG_TCP_PAYLOAD;
+
+	stth->reserved = 0;
+	stth->mss = htons(dst_mtu(dst));
+	stth->vlan_tci = vlan_tci;
+	stth->context_id = mutable->out_key;
+
+	tcph->source = htons(stt_src_port(hash));
+	tcph->seq = htonl(stt_seg_len(skb) << FRAME_LEN_SHIFT);
+	tcph->ack_seq = frag_id;
+	tcph->ack = 1;
+	tcph->psh = 1;
+
+	switch (get_ip_summed(skb)) {
+	case OVS_CSUM_PARTIAL:
+		stth->flags |= STT_FLAG_CHECKSUM_PARTIAL;
+		tcph->check = ~tcp_v4_check(skb->len,
+					    ip_hdr(skb)->saddr,
+					    ip_hdr(skb)->daddr, 0);
+		skb->csum_start = skb_transport_header(skb) - skb->head;
+		skb->csum_offset = offsetof(struct tcphdr, check);
+		break;
+	case OVS_CSUM_UNNECESSARY:
+		stth->flags |= STT_FLAG_CHECKSUM_VERIFIED;
+		pr_debug_once("%s: checsum unnecessary\n", __func__);
+	default:
+		tcph->check = 0;
+		skb->csum = skb_checksum(skb, skb_transport_offset(skb),
+					 skb->len - skb_transport_offset(skb),
+					 0);
+		tcph->check = tcp_v4_check(skb->len - skb_transport_offset(skb),
+					   ip_hdr(skb)->saddr,
+					   ip_hdr(skb)->daddr, skb->csum);
+		set_ip_summed(skb, OVS_CSUM_UNNECESSARY);
+	}
+	forward_ip_summed(skb, 1);
+
+	return skb;
+err:
+	kfree_skb(skb);
+	return NULL;
+}
+
+static inline struct capwap_net *ovs_get_stt_net(struct net *net)
+{
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	return &ovs_net->vport_net.stt;
+}
+
+static struct sk_buff *process_stt_proto(struct sk_buff *skb, __be64 *key)
+{
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct stthdr *stth;
+	u16 frame_len;
+
+	skb_postpull_rcsum(skb, skb_transport_header(skb),
+			   STT_SEG_HLEN + ETH_HLEN);
+
+	frame_len = ntohl(tcph->seq) >> FRAME_LEN_SHIFT;
+	if (stt_seg_len(skb) < frame_len) {
+		skb = defrag(skb, frame_len);
+		if (!skb)
+			return NULL;
+	}
+
+	if (skb->len < (tcph->doff << 2) || tcp_checksum_complete(skb)) {
+		if (net_ratelimit()) {
+			struct iphdr *iph = ip_hdr(skb);
+			pr_info("stt: dropped frame with "
+			       "invalid checksum  (%pI4, %d)->(%pI4, %d)\n",
+			       &iph->saddr, ntohs(tcph->source),
+			       &iph->daddr, ntohs(tcph->dest));
+		}
+		goto error;
+	}
+
+	/* STT_FRAME_HLEN less two pad bytes is needed here.
+	 * STT_FRAME_HLEN is needed by our caller, stt_rcv().
+	 * An additional ETH_HLEN bytes are required by ovs_flow_extract()
+	 * which is called indirectly by our caller.
+	 */
+	if (unlikely(!pskb_may_pull(skb, STT_FRAME_HLEN + ETH_HLEN))) {
+		if (net_ratelimit())
+			pr_info("dropped frame that is too short! %d < %lu\n",
+				skb->len, STT_FRAME_HLEN + ETH_HLEN);
+		goto error;
+	}
+
+	stth = stt_hdr(skb);
+	/* Only accept STT version 0, its all we know */
+	if (stth->version != 0)
+		goto error;
+
+	*key = stth->context_id;
+	__vlan_hwaccel_put_tag(skb, ntohs(stth->vlan_tci));
+
+	return skb;
+error:
+	kfree_skb(skb);
+	return NULL;
+}
+
+/* Called with rcu_read_lock and BH disabled. */
+static int stt_rcv(struct sock *sk, struct sk_buff *skb)
+{
+	struct vport *vport;
+	const struct tnl_mutable_config *mutable;
+	struct iphdr *iph;
+	__be64 key = 0;
+
+	/* pskb_may_pull() has already been called for
+	 * sizeof(struct tcphdr) in tcp_v4_rcv(), so there
+	 * is no need to do so again here
+	 */
+
+	skb = process_stt_proto(skb, &key);
+	if (unlikely(!skb))
+		goto out;
+
+	iph = ip_hdr(skb);
+	vport = ovs_tnl_find_port(sock_net(sk), iph->daddr, iph->saddr, key,
+				  TNL_T_PROTO_STT, &mutable);
+	if (unlikely(!vport)) {
+		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0);
+		goto error;
+	}
+
+	if (mutable->flags & TNL_F_IN_KEY_MATCH)
+		OVS_CB(skb)->tun_id = key;
+	else
+		OVS_CB(skb)->tun_id = 0;
+
+	__skb_pull(skb, STT_FRAME_HLEN);
+	skb_postpull_rcsum(skb, skb_transport_header(skb),
+			   STT_FRAME_HLEN + ETH_HLEN);
+	if (get_ip_summed(skb) == OVS_CSUM_PARTIAL)
+		skb->csum_start += STT_FRAME_HLEN;
+
+	ovs_tnl_rcv(vport, skb, iph->tos);
+	goto out;
+
+error:
+	kfree_skb(skb);
+out:
+	return 0;
+}
+
+static const struct tnl_ops stt_tnl_ops = {
+	.tunnel_type	= TNL_T_PROTO_STT,
+	.ipproto	= IPPROTO_TCP,
+	.hdr_len	= stt_hdr_len,
+	.build_header	= stt_build_header,
+	.update_header	= stt_update_header,
+};
+
+static int init_socket(struct net *net)
+{
+	int err;
+	struct capwap_net *stt_net = ovs_get_stt_net(net);
+	struct sockaddr_in sin;
+
+	if (stt_net->n_tunnels) {
+		stt_net->n_tunnels++;
+		return 0;
+	}
+
+	err = sock_create_kern(AF_INET, SOCK_STREAM, 0,
+			       &stt_net->capwap_rcv_socket);
+	if (err)
+		goto error;
+
+	/* release net ref. */
+	sk_change_net(stt_net->capwap_rcv_socket->sk, net);
+
+	sin.sin_family = AF_INET;
+	sin.sin_addr.s_addr = htonl(INADDR_ANY);
+	sin.sin_port = htons(STT_DST_PORT);
+
+	err = kernel_bind(stt_net->capwap_rcv_socket, (struct sockaddr *)&sin,
+			  sizeof(struct sockaddr_in));
+	if (err)
+		goto error_sock;
+
+	tcp_sk(stt_net->capwap_rcv_socket->sk)->encap_rcv = stt_rcv;
+	tcp_encap_enable();
+
+	stt_net->frag_state.timeout = STT_FRAG_TIMEOUT;
+	stt_net->frag_state.high_thresh	= STT_FRAG_MAX_MEM;
+	stt_net->frag_state.low_thresh	= STT_FRAG_PRUNE_MEM;
+
+	inet_frags_init_net(&stt_net->frag_state);
+
+	err = kernel_listen(stt_net->capwap_rcv_socket, 7);
+	if (err)
+		goto error_sock;
+
+	stt_net->n_tunnels++;
+	return 0;
+
+error_sock:
+	sk_release_kernel(stt_net->capwap_rcv_socket->sk);
+error:
+	pr_warn("cannot register protocol handler : %d\n", err);
+	return err;
+}
+
+/* XXX: Possible Consolidation: Very similar to vport-capwap.c:release_socket() */
+static void release_socket(struct net *net)
+{
+	struct capwap_net *stt_net = ovs_get_stt_net(net);
+
+	stt_net->n_tunnels--;
+	if (stt_net->n_tunnels)
+		return;
+
+	inet_frags_exit_net(&stt_net->frag_state, &frag_state);
+	sk_release_kernel(stt_net->capwap_rcv_socket->sk);
+}
+
+/* XXX: Possible Consolidation: Very similar to capwap_create() */
+static struct vport *stt_create(const struct vport_parms *parms)
+{
+	struct vport *vport;
+	int err;
+
+	err = init_socket(ovs_dp_get_net(parms->dp));
+	if (err)
+		return ERR_PTR(err);
+
+	vport = ovs_tnl_create(parms, &ovs_stt_vport_ops, &stt_tnl_ops);
+	if (IS_ERR(vport))
+		release_socket(ovs_dp_get_net(parms->dp));
+
+	return vport;
+}
+
+/* XXX: Possible Consolidation: Same as capwap_destroy() */
+static void stt_destroy(struct vport *vport)
+{
+	ovs_tnl_destroy(vport);
+	release_socket(ovs_dp_get_net(vport->dp));
+}
+
+/* XXX: Possible Consolidation: Same as capwap_init() */
+static int stt_init(void)
+{
+	inet_frags_init(&frag_state);
+	get_random_bytes(&stt_port_rnd, sizeof(stt_port_rnd));
+	return 0;
+}
+
+/* XXX: Possible Consolidation: Same as capwap_exit() */
+static void stt_exit(void)
+{
+	inet_frags_fini(&frag_state);
+}
+
+/* All of the following functions relate to fragmentation reassembly. */
+
+static struct frag_queue *ifq_cast(struct inet_frag_queue *ifq)
+{
+	return container_of(ifq, struct frag_queue, ifq);
+}
+
+/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_hash() */
+static u32 frag_hash(struct frag_match *match)
+{
+	return jhash_3words((__force u16)match->id, (__force u32)match->saddr,
+			    (__force u32)match->daddr,
+			    frag_state.rnd) & (INETFRAGS_HASHSZ - 1);
+}
+
+/* XXX: Possible Consolidation: Identical to to vport-capwap.c:queue_find() */
+static struct frag_queue *queue_find(struct netns_frags *ns_frag_state,
+				     struct frag_match *match)
+{
+	struct inet_frag_queue *ifq;
+
+	read_lock(&frag_state.lock);
+
+	ifq = inet_frag_find(ns_frag_state, &frag_state, match, frag_hash(match));
+	if (!ifq)
+		return NULL;
+
+	/* Unlock happens inside inet_frag_find(). */
+
+	return ifq_cast(ifq);
+}
+
+/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_reasm() */
+static struct sk_buff *frag_reasm(struct frag_queue *fq, struct net_device *dev)
+{
+	struct sk_buff *head = fq->ifq.fragments;
+	struct sk_buff *frag;
+
+	/* Succeed or fail, we're done with this queue. */
+	inet_frag_kill(&fq->ifq, &frag_state);
+
+	if (fq->ifq.len > 65535)
+		return NULL;
+
+	/* Can't have the head be a clone. */
+	if (skb_cloned(head) && pskb_expand_head(head, 0, 0, GFP_ATOMIC))
+		return NULL;
+
+	/*
+	 * We're about to build frag list for this SKB.  If it already has a
+	 * frag list, alloc a new SKB and put the existing frag list there.
+	 */
+	if (skb_shinfo(head)->frag_list) {
+		int i;
+		int paged_len = 0;
+
+		frag = alloc_skb(0, GFP_ATOMIC);
+		if (!frag)
+			return NULL;
+
+		frag->next = head->next;
+		head->next = frag;
+		skb_shinfo(frag)->frag_list = skb_shinfo(head)->frag_list;
+		skb_shinfo(head)->frag_list = NULL;
+
+		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
+			paged_len += skb_shinfo(head)->frags[i].size;
+		frag->len = frag->data_len = head->data_len - paged_len;
+		head->data_len -= frag->len;
+		head->len -= frag->len;
+
+		frag->ip_summed = head->ip_summed;
+		atomic_add(frag->truesize, &fq->ifq.net->mem);
+	}
+
+	skb_shinfo(head)->frag_list = head->next;
+	atomic_sub(head->truesize, &fq->ifq.net->mem);
+
+	/* Properly account for data in various packets. */
+	for (frag = head->next; frag; frag = frag->next) {
+		head->data_len += frag->len;
+		head->len += frag->len;
+
+		if (head->ip_summed != frag->ip_summed)
+			head->ip_summed = CHECKSUM_NONE;
+		else if (head->ip_summed == CHECKSUM_COMPLETE)
+			head->csum = csum_add(head->csum, frag->csum);
+
+		head->truesize += frag->truesize;
+		atomic_sub(frag->truesize, &fq->ifq.net->mem);
+	}
+
+	head->next = NULL;
+	head->dev = dev;
+	head->tstamp = fq->ifq.stamp;
+	fq->ifq.fragments = NULL;
+
+	return head;
+}
+
+/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_queue() */
+static struct sk_buff *frag_queue(struct frag_queue *fq, struct sk_buff *skb,
+				  u16 offset, bool frag_last)
+{
+	struct sk_buff *prev, *next;
+	struct net_device *dev;
+	int end;
+
+	if (fq->ifq.last_in & INET_FRAG_COMPLETE)
+		goto error;
+
+	if (stt_seg_len(skb) <= 0)
+		goto error;
+
+	end = offset + stt_seg_len(skb);
+
+	if (frag_last) {
+		/*
+		 * Last fragment, shouldn't already have data past our end or
+		 * have another last fragment.
+		 */
+		if (end < fq->ifq.len || fq->ifq.last_in & INET_FRAG_LAST_IN)
+			goto error;
+
+		fq->ifq.last_in |= INET_FRAG_LAST_IN;
+		fq->ifq.len = end;
+	} else {
+		/* Fragments should align to 8 byte chunks. */
+		if (end & ~FRAG_OFF_MASK)
+			goto error;
+
+		if (end > fq->ifq.len) {
+			/*
+			 * Shouldn't have data past the end, if we already
+			 * have one.
+			 */
+			if (fq->ifq.last_in & INET_FRAG_LAST_IN)
+				goto error;
+
+			fq->ifq.len = end;
+		}
+	}
+
+	/* Find where we fit in. */
+	prev = NULL;
+	for (next = fq->ifq.fragments; next != NULL; next = next->next) {
+		if (FRAG_CB(next)->offset >= offset)
+			break;
+		prev = next;
+	}
+
+	/*
+	 * Overlapping fragments aren't allowed.  We shouldn't start before
+	 * the end of the previous fragment.
+	 */
+	if (prev && FRAG_CB(prev)->offset + stt_seg_len(prev) > offset)
+		goto error;
+
+	/* We also shouldn't end after the beginning of the next fragment. */
+	if (next && end > FRAG_CB(next)->offset)
+		goto error;
+
+	FRAG_CB(skb)->offset = offset;
+
+	/* Link into list. */
+	skb->next = next;
+	if (prev)
+		prev->next = skb;
+	else
+		fq->ifq.fragments = skb;
+
+	dev = skb->dev;
+	skb->dev = NULL;
+
+	fq->ifq.stamp = skb->tstamp;
+	fq->ifq.meat += stt_seg_len(skb);
+	atomic_add(skb->truesize, &fq->ifq.net->mem);
+	if (offset == 0)
+		fq->ifq.last_in |= INET_FRAG_FIRST_IN;
+
+	/* If we have all fragments do reassembly. */
+	if (fq->ifq.last_in == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
+	    fq->ifq.meat == fq->ifq.len)
+		return frag_reasm(fq, dev);
+
+	write_lock(&frag_state.lock);
+	list_move_tail(&fq->ifq.lru_list, &fq->ifq.net->lru_list);
+	write_unlock(&frag_state.lock);
+
+	return NULL;
+
+error:
+	kfree_skb(skb);
+	return NULL;
+}
+
+/* XXX: Possible Consolidation: Similar to vport-capwap.c:defrag() */
+static struct sk_buff *defrag(struct sk_buff *skb, u16 frame_len)
+{
+	struct iphdr *iph = ip_hdr(skb);
+	struct tcphdr *tcph = tcp_hdr(skb);
+	struct netns_frags *ns_frag_state;
+	struct frag_match match;
+	u16 frag_off;
+	struct frag_queue *fq;
+	bool frag_last = false;
+
+	if (unlikely(!skb->dev)) {
+		if (net_ratelimit())
+			pr_err("%s: No skb->dev!\n", __func__);
+		goto out;
+	}
+
+	ns_frag_state = &ovs_get_stt_net(dev_net(skb->dev))->frag_state;
+	if (atomic_read(&ns_frag_state->mem) > ns_frag_state->high_thresh)
+		inet_frag_evictor(ns_frag_state, &frag_state);
+
+	match.daddr = iph->daddr;
+	match.saddr = iph->saddr;
+	match.id = tcph->ack_seq;
+	frag_off = ntohl(tcph->seq) & FRAG_OFF_MASK;
+	if (frame_len == stt_seg_len(skb) + frag_off)
+		frag_last = true;
+
+	fq = queue_find(ns_frag_state, &match);
+	if (fq) {
+		spin_lock(&fq->ifq.lock);
+		skb = frag_queue(fq, skb, frag_off, frag_last);
+		spin_unlock(&fq->ifq.lock);
+
+		inet_frag_put(&fq->ifq, &frag_state);
+
+		return skb;
+	}
+
+out:
+	kfree_skb(skb);
+	return NULL;
+}
+
+/* XXX: Possible Consolidation: Functionally identical to capwap_frag_init */
+static void stt_frag_init(struct inet_frag_queue *ifq, void *match_)
+{
+	struct frag_match *match = match_;
+
+	ifq_cast(ifq)->match = *match;
+}
+
+/* XXX: Possible Consolidation: Functionally identical to capwap_frag_hash */
+static unsigned int stt_frag_hash(struct inet_frag_queue *ifq)
+{
+	return frag_hash(&ifq_cast(ifq)->match);
+}
+
+/* XXX: Possible Consolidation: Almost functionally identical to capwap_frag_match */
+static int stt_frag_match(struct inet_frag_queue *ifq, void *a_)
+{
+	struct frag_match *a = a_;
+	struct frag_match *b = &ifq_cast(ifq)->match;
+
+	return a->id == b->id && a->saddr == b->saddr && a->daddr == b->daddr;
+}
+
+/* Run when the timeout for a given queue expires. */
+/* XXX: Possible Consolidation: Functionally identical to capwap_frag_hash */
+static void stt_frag_expire(unsigned long ifq)
+{
+	struct frag_queue *fq;
+
+	fq = ifq_cast((struct inet_frag_queue *)ifq);
+
+	spin_lock(&fq->ifq.lock);
+
+	if (!(fq->ifq.last_in & INET_FRAG_COMPLETE))
+		inet_frag_kill(&fq->ifq, &frag_state);
+
+	spin_unlock(&fq->ifq.lock);
+	inet_frag_put(&fq->ifq, &frag_state);
+}
+
+const struct vport_ops ovs_stt_vport_ops = {
+	.type		= OVS_VPORT_TYPE_STT,
+	.flags		= VPORT_F_TUN_ID,
+	.init		= stt_init,
+	.exit		= stt_exit,
+	.create		= stt_create,
+	.destroy	= stt_destroy,
+	.set_addr	= ovs_tnl_set_addr,
+	.get_name	= ovs_tnl_get_name,
+	.get_addr	= ovs_tnl_get_addr,
+	.get_options	= ovs_tnl_get_options,
+	.set_options	= ovs_tnl_set_options,
+	.get_dev_flags	= ovs_vport_gen_get_dev_flags,
+	.is_running	= ovs_vport_gen_is_running,
+	.get_operstate	= ovs_vport_gen_get_operstate,
+	.send		= ovs_tnl_send,
+};
+#else
+#warning STT requires TCP encap_rcv hook in Kernel
+#endif /* HAVE_TCP_ENCAP_RCV */
diff --git a/datapath/vport.c b/datapath/vport.c
index b75a866..575e7a2 100644
--- a/datapath/vport.c
+++ b/datapath/vport.c
@@ -44,6 +44,9 @@  static const struct vport_ops *base_vport_ops_list[] = {
 #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26)
 	&ovs_capwap_vport_ops,
 #endif
+#ifdef HAVE_TCP_ENCAP_RCV
+	&ovs_stt_vport_ops,
+#endif
 };
 
 static const struct vport_ops **vport_ops_list;
diff --git a/datapath/vport.h b/datapath/vport.h
index 2aafde0..3994eb1 100644
--- a/datapath/vport.h
+++ b/datapath/vport.h
@@ -33,6 +33,7 @@  struct vport_parms;
 
 struct vport_net {
 	struct capwap_net capwap;
+	struct capwap_net stt;
 };
 
 /* The following definitions are for users of the vport subsytem: */
@@ -257,5 +258,6 @@  extern const struct vport_ops ovs_internal_vport_ops;
 extern const struct vport_ops ovs_patch_vport_ops;
 extern const struct vport_ops ovs_gre_vport_ops;
 extern const struct vport_ops ovs_capwap_vport_ops;
+extern const struct vport_ops ovs_stt_vport_ops;
 
 #endif /* vport.h */
diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h
index 0578b5f..47f6dca 100644
--- a/include/linux/openvswitch.h
+++ b/include/linux/openvswitch.h
@@ -185,6 +185,7 @@  enum ovs_vport_type {
 	OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */
 	OVS_VPORT_TYPE_GRE,      /* GRE tunnel */
 	OVS_VPORT_TYPE_CAPWAP,   /* CAPWAP tunnel */
+	OVS_VPORT_TYPE_STT,      /* STT tunnel */
 	__OVS_VPORT_TYPE_MAX
 };
 
diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c
index 7bd50a4..346878b 100644
--- a/lib/netdev-vport.c
+++ b/lib/netdev-vport.c
@@ -165,6 +165,9 @@  netdev_vport_get_netdev_type(const struct dpif_linux_vport *vport)
     case OVS_VPORT_TYPE_CAPWAP:
         return "capwap";
 
+    case OVS_VPORT_TYPE_STT:
+        return "stt";
+
     case __OVS_VPORT_TYPE_MAX:
         break;
     }
@@ -965,7 +968,11 @@  netdev_vport_register(void)
 
         { OVS_VPORT_TYPE_PATCH,
           { "patch", VPORT_FUNCTIONS(NULL) },
-          parse_patch_config, unparse_patch_config }
+          parse_patch_config, unparse_patch_config },
+
+        { OVS_VPORT_TYPE_STT,
+          { "stt", VPORT_FUNCTIONS(netdev_vport_get_drv_info) },
+          parse_tunnel_config, unparse_tunnel_config }
     };
 
     int i;
diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml
index f3ea338..d8c860e 100644
--- a/vswitchd/vswitch.xml
+++ b/vswitchd/vswitch.xml
@@ -1177,6 +1177,16 @@ 
             A pair of virtual devices that act as a patch cable.
           </dd>
 
+          <dt><code>stt</code></dt>
+          <dd>
+	    An Ethernet tunnel over STT (IETF draft-davie-stt-01).  UDP
+	    ports 58882 is used as the destination port and ports from the
+	    ephemeral range, which may be via proc using
+	    sys/net/ipv4/ip_local_port_range, are used as the source ports.
+	    STT currently requires modifications to the Linux kernel and is
+	    not supported by any released kernel version.
+          </dd>
+
           <dt><code>null</code></dt>
           <dd>An ignored interface.</dd>
         </dl>