Message ID | 20120423083007.GB22556@verge.net.au |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
From: Simon Horman <horms@verge.net.au> Date: Mon, 23 Apr 2012 17:30:08 +0900 > I'm pretty sure the patch I posted added encap_rcv to tcp_sock. > Am I missing the point? It did, my eyes are failing me :-) > Currently I am setting up a listening socket. The Open vSwtich tunneling > code transmits skbs and using either dev_queue_xmit() or ip_local_out(). > I'm not sure that I have exercised the ip_local_out() case yet. I don't see where on transmit you're going to realize the primary stated benefit of STT, that being TSO/GSO. You'll probably want to gather as many packets as possible into a larger STT frame for this purpose. And when switching between STT tunnels, leave the packet alone since a GRO STT frame on receive will transparently become a STT GSO frame on transmit. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 23 Apr 2012 15:15:33 -0400 (EDT) David Miller <davem@davemloft.net> wrote: > From: Simon Horman <horms@verge.net.au> > Date: Mon, 23 Apr 2012 17:30:08 +0900 > > > I'm pretty sure the patch I posted added encap_rcv to tcp_sock. > > Am I missing the point? > > It did, my eyes are failing me :-) > > > Currently I am setting up a listening socket. The Open vSwtich tunneling > > code transmits skbs and using either dev_queue_xmit() or ip_local_out(). > > I'm not sure that I have exercised the ip_local_out() case yet. > > I don't see where on transmit you're going to realize the primary > stated benefit of STT, that being TSO/GSO. > > You'll probably want to gather as many packets as possible into a > larger STT frame for this purpose. And when switching between STT > tunnels, leave the packet alone since a GRO STT frame on receive will > transparently become a STT GSO frame on transmit. > I think the point of the TSO hack is to get around the MTU problem when tunneling. The added header of the tunnel eats into the the possible MTU. The use of TSO in STT is designed to deal with the fact that hardware can't do IP fragmentation of IP (or UDP). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 12:19 PM, Stephen Hemminger <shemminger@vyatta.com> wrote: > On Mon, 23 Apr 2012 15:15:33 -0400 (EDT) > David Miller <davem@davemloft.net> wrote: > >> From: Simon Horman <horms@verge.net.au> >> Date: Mon, 23 Apr 2012 17:30:08 +0900 >> >> > I'm pretty sure the patch I posted added encap_rcv to tcp_sock. >> > Am I missing the point? >> >> It did, my eyes are failing me :-) >> >> > Currently I am setting up a listening socket. The Open vSwtich tunneling >> > code transmits skbs and using either dev_queue_xmit() or ip_local_out(). >> > I'm not sure that I have exercised the ip_local_out() case yet. >> >> I don't see where on transmit you're going to realize the primary >> stated benefit of STT, that being TSO/GSO. >> >> You'll probably want to gather as many packets as possible into a >> larger STT frame for this purpose. And when switching between STT >> tunnels, leave the packet alone since a GRO STT frame on receive will >> transparently become a STT GSO frame on transmit. >> > > I think the point of the TSO hack is to get around the MTU problem when tunneling. > The added header of the tunnel eats into the the possible MTU. The use of TSO > in STT is designed to deal with the fact that hardware can't do IP fragmentation > of IP (or UDP). That is a beneficial side effect, although the main goal is to just to get back all of the offloads that are lost because hardware can't see inside of encapsulated packets, with TSO, LRO, and RSS being the main examples. Assuming that the TCP stack generates large TSO frames on transmit (which could be the local stack; something sent by a VM; or packets received, coalesced by GRO and then encapsulated by STT) then you can just prepend the STT header (possibly slightly adjusting things like requested MSS, number of segments, etc. slightly). After that it's possible to just output the resulting frame through the IP stack like all tunnels do today. Similarly, on the other side the NIC will be able to perform its normal offloading operations as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jesse Gross <jesse@nicira.com> Date: Mon, 23 Apr 2012 13:08:49 -0700 > Assuming that the TCP stack generates large TSO frames on transmit > (which could be the local stack; something sent by a VM; or packets > received, coalesced by GRO and then encapsulated by STT) then you can > just prepend the STT header (possibly slightly adjusting things like > requested MSS, number of segments, etc. slightly). After that it's > possible to just output the resulting frame through the IP stack like > all tunnels do today. Which seems to potentially suggest a stronger intergration of the STT tunnel transmit path into our IP stack rather than the approach Simon is taking -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: > From: Jesse Gross <jesse@nicira.com> > Date: Mon, 23 Apr 2012 13:08:49 -0700 > >> Assuming that the TCP stack generates large TSO frames on transmit >> (which could be the local stack; something sent by a VM; or packets >> received, coalesced by GRO and then encapsulated by STT) then you can >> just prepend the STT header (possibly slightly adjusting things like >> requested MSS, number of segments, etc. slightly). After that it's >> possible to just output the resulting frame through the IP stack like >> all tunnels do today. > > Which seems to potentially suggest a stronger intergration of the STT > tunnel transmit path into our IP stack rather than the approach Simon > is taking Did you have something in mind? Since the originating stack already generates TSO frames today, it's just a few lines of code to adjust for the addition of the STT header as the skb is encapsulated. Otherwise, the transmit path is the same as something like GRE. L2TP follows a fairly similar path - on receive it binds to a listening UDP socket and on transmit it prepends a header, setups up checksum offloading, and outputs directly via ip_queue_xmit(). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Jesse Gross <jesse@nicira.com> Date: Mon, 23 Apr 2012 13:53:42 -0700 > On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >> From: Jesse Gross <jesse@nicira.com> >> Date: Mon, 23 Apr 2012 13:08:49 -0700 >> >>> Assuming that the TCP stack generates large TSO frames on transmit >>> (which could be the local stack; something sent by a VM; or packets >>> received, coalesced by GRO and then encapsulated by STT) then you can >>> just prepend the STT header (possibly slightly adjusting things like >>> requested MSS, number of segments, etc. slightly). After that it's >>> possible to just output the resulting frame through the IP stack like >>> all tunnels do today. >> >> Which seems to potentially suggest a stronger intergration of the STT >> tunnel transmit path into our IP stack rather than the approach Simon >> is taking > > Did you have something in mind? A normal bonafide tunnel netdevice driver like GRE instead of the openvswitch approach Simon is using. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: > From: Jesse Gross <jesse@nicira.com> > Date: Mon, 23 Apr 2012 13:53:42 -0700 > >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >>> From: Jesse Gross <jesse@nicira.com> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700 >>> >>>> Assuming that the TCP stack generates large TSO frames on transmit >>>> (which could be the local stack; something sent by a VM; or packets >>>> received, coalesced by GRO and then encapsulated by STT) then you can >>>> just prepend the STT header (possibly slightly adjusting things like >>>> requested MSS, number of segments, etc. slightly). After that it's >>>> possible to just output the resulting frame through the IP stack like >>>> all tunnels do today. >>> >>> Which seems to potentially suggest a stronger intergration of the STT >>> tunnel transmit path into our IP stack rather than the approach Simon >>> is taking >> >> Did you have something in mind? > > A normal bonafide tunnel netdevice driver like GRE instead of the > openvswitch approach Simon is using. Ahh, yes, that I agree with. Independent of this, there's work being done to make it so that OVS can use the normal in-tree tunneling code and not need its own. Once that's done I expect that STT will follow the same model. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: > On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: > > From: Jesse Gross <jesse@nicira.com> > > Date: Mon, 23 Apr 2012 13:53:42 -0700 > > > >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: > >>> From: Jesse Gross <jesse@nicira.com> > >>> Date: Mon, 23 Apr 2012 13:08:49 -0700 > >>> > >>>> Assuming that the TCP stack generates large TSO frames on transmit > >>>> (which could be the local stack; something sent by a VM; or packets > >>>> received, coalesced by GRO and then encapsulated by STT) then you can > >>>> just prepend the STT header (possibly slightly adjusting things like > >>>> requested MSS, number of segments, etc. slightly). After that it's > >>>> possible to just output the resulting frame through the IP stack like > >>>> all tunnels do today. > >>> > >>> Which seems to potentially suggest a stronger intergration of the STT > >>> tunnel transmit path into our IP stack rather than the approach Simon > >>> is taking > >> > >> Did you have something in mind? > > > > A normal bonafide tunnel netdevice driver like GRE instead of the > > openvswitch approach Simon is using. > > Ahh, yes, that I agree with. Independent of this, there's work being > done to make it so that OVS can use the normal in-tree tunneling code > and not need its own. Once that's done I expect that STT will follow > the same model. Hi Jesse, I am wondering how firm the plans to on allowing OVS to use in-tree tunnel code are. I'm happy to move my efforts over to an in-tree STT implementation but ultimately I would like to get STT running in conjunction with OVS. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: >> > From: Jesse Gross <jesse@nicira.com> >> > Date: Mon, 23 Apr 2012 13:53:42 -0700 >> > >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >> >>> From: Jesse Gross <jesse@nicira.com> >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700 >> >>> >> >>>> Assuming that the TCP stack generates large TSO frames on transmit >> >>>> (which could be the local stack; something sent by a VM; or packets >> >>>> received, coalesced by GRO and then encapsulated by STT) then you can >> >>>> just prepend the STT header (possibly slightly adjusting things like >> >>>> requested MSS, number of segments, etc. slightly). After that it's >> >>>> possible to just output the resulting frame through the IP stack like >> >>>> all tunnels do today. >> >>> >> >>> Which seems to potentially suggest a stronger intergration of the STT >> >>> tunnel transmit path into our IP stack rather than the approach Simon >> >>> is taking >> >> >> >> Did you have something in mind? >> > >> > A normal bonafide tunnel netdevice driver like GRE instead of the >> > openvswitch approach Simon is using. >> >> Ahh, yes, that I agree with. Independent of this, there's work being >> done to make it so that OVS can use the normal in-tree tunneling code >> and not need its own. Once that's done I expect that STT will follow >> the same model. > > Hi Jesse, > > I am wondering how firm the plans to on allowing OVS to use in-tree tunnel > code are. I'm happy to move my efforts over to an in-tree STT implementation > but ultimately I would like to get STT running in conjunction with OVS. I would say that it's a firm goal but the implementation probably still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work on this in support of adding VXLAN, which needs some additional flexibility that this approach would also provide. You might want to talk to him to see if there are ways that you guys can work together on it if you are interested. Having better integration with upstream tunneling is definitely a step that OVS needs to make and sooner would be better than later. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: > On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: > > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: > >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: > >> > From: Jesse Gross <jesse@nicira.com> > >> > Date: Mon, 23 Apr 2012 13:53:42 -0700 > >> > > >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: > >> >>> From: Jesse Gross <jesse@nicira.com> > >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700 > >> >>> > >> >>>> Assuming that the TCP stack generates large TSO frames on transmit > >> >>>> (which could be the local stack; something sent by a VM; or packets > >> >>>> received, coalesced by GRO and then encapsulated by STT) then you can > >> >>>> just prepend the STT header (possibly slightly adjusting things like > >> >>>> requested MSS, number of segments, etc. slightly). After that it's > >> >>>> possible to just output the resulting frame through the IP stack like > >> >>>> all tunnels do today. > >> >>> > >> >>> Which seems to potentially suggest a stronger intergration of the STT > >> >>> tunnel transmit path into our IP stack rather than the approach Simon > >> >>> is taking > >> >> > >> >> Did you have something in mind? > >> > > >> > A normal bonafide tunnel netdevice driver like GRE instead of the > >> > openvswitch approach Simon is using. > >> > >> Ahh, yes, that I agree with. Independent of this, there's work being > >> done to make it so that OVS can use the normal in-tree tunneling code > >> and not need its own. Once that's done I expect that STT will follow > >> the same model. > > > > Hi Jesse, > > > > I am wondering how firm the plans to on allowing OVS to use in-tree tunnel > > code are. I'm happy to move my efforts over to an in-tree STT implementation > > but ultimately I would like to get STT running in conjunction with OVS. > > I would say that it's a firm goal but the implementation probably > still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work > on this in support of adding VXLAN, which needs some additional > flexibility that this approach would also provide. You might want to > talk to him to see if there are ways that you guys can work together > on it if you are interested. Having better integration with upstream > tunneling is definitely a step that OVS needs to make and sooner would > be better than later. Hi Jesse, Hi Kyle, that sounds like an excellent plan. Kyle, do you have any thoughts on how we might best work together on this? Perhaps there are some patches floating around that I could take a look at? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
----- Original Message ----- > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: > > On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> > > wrote: > > > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: > > >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller > > >> <davem@davemloft.net> wrote: > > >> > From: Jesse Gross <jesse@nicira.com> > > >> > Date: Mon, 23 Apr 2012 13:53:42 -0700 > > >> > > > >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller > > >> >> <davem@davemloft.net> wrote: > > >> >>> From: Jesse Gross <jesse@nicira.com> > > >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700 > > >> >>> > > >> >>>> Assuming that the TCP stack generates large TSO frames on > > >> >>>> transmit > > >> >>>> (which could be the local stack; something sent by a VM; or > > >> >>>> packets > > >> >>>> received, coalesced by GRO and then encapsulated by STT) > > >> >>>> then you can > > >> >>>> just prepend the STT header (possibly slightly adjusting > > >> >>>> things like > > >> >>>> requested MSS, number of segments, etc. slightly). After > > >> >>>> that it's > > >> >>>> possible to just output the resulting frame through the IP > > >> >>>> stack like > > >> >>>> all tunnels do today. > > >> >>> > > >> >>> Which seems to potentially suggest a stronger intergration > > >> >>> of the STT > > >> >>> tunnel transmit path into our IP stack rather than the > > >> >>> approach Simon > > >> >>> is taking > > >> >> > > >> >> Did you have something in mind? > > >> > > > >> > A normal bonafide tunnel netdevice driver like GRE instead of > > >> > the > > >> > openvswitch approach Simon is using. > > >> > > >> Ahh, yes, that I agree with. Independent of this, there's work > > >> being > > >> done to make it so that OVS can use the normal in-tree tunneling > > >> code > > >> and not need its own. Once that's done I expect that STT will > > >> follow > > >> the same model. > > > > > > Hi Jesse, > > > > > > I am wondering how firm the plans to on allowing OVS to use > > > in-tree tunnel > > > code are. I'm happy to move my efforts over to an in-tree STT > > > implementation > > > but ultimately I would like to get STT running in conjunction > > > with OVS. > > > > I would say that it's a firm goal but the implementation probably > > still has a ways to go. Kyle Mestery (CC'ed) has volunteered to > > work > > on this in support of adding VXLAN, which needs some additional > > flexibility that this approach would also provide. You might want > > to > > talk to him to see if there are ways that you guys can work > > together > > on it if you are interested. Having better integration with > > upstream > > tunneling is definitely a step that OVS needs to make and sooner > > would > > be better than later. > > Hi Jesse, Hi Kyle, > > that sounds like an excellent plan. > > Kyle, do you have any thoughts on how we might best work together on > this? > Perhaps there are some patches floating around that I could take a > look at? ChrisW had a start to VxVlan tunnel (non OVS), and I promised to work on finishing it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 23, 2012 at 09:40:57PM -0700, Stephen Hemminger wrote: > > > ----- Original Message ----- > > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: > > > On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> > > > wrote: > > > > On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: > > > >> On Mon, Apr 23, 2012 at 2:08 PM, David Miller > > > >> <davem@davemloft.net> wrote: > > > >> > From: Jesse Gross <jesse@nicira.com> > > > >> > Date: Mon, 23 Apr 2012 13:53:42 -0700 > > > >> > > > > >> >> On Mon, Apr 23, 2012 at 1:13 PM, David Miller > > > >> >> <davem@davemloft.net> wrote: > > > >> >>> From: Jesse Gross <jesse@nicira.com> > > > >> >>> Date: Mon, 23 Apr 2012 13:08:49 -0700 > > > >> >>> > > > >> >>>> Assuming that the TCP stack generates large TSO frames on > > > >> >>>> transmit > > > >> >>>> (which could be the local stack; something sent by a VM; or > > > >> >>>> packets > > > >> >>>> received, coalesced by GRO and then encapsulated by STT) > > > >> >>>> then you can > > > >> >>>> just prepend the STT header (possibly slightly adjusting > > > >> >>>> things like > > > >> >>>> requested MSS, number of segments, etc. slightly). After > > > >> >>>> that it's > > > >> >>>> possible to just output the resulting frame through the IP > > > >> >>>> stack like > > > >> >>>> all tunnels do today. > > > >> >>> > > > >> >>> Which seems to potentially suggest a stronger intergration > > > >> >>> of the STT > > > >> >>> tunnel transmit path into our IP stack rather than the > > > >> >>> approach Simon > > > >> >>> is taking > > > >> >> > > > >> >> Did you have something in mind? > > > >> > > > > >> > A normal bonafide tunnel netdevice driver like GRE instead of > > > >> > the > > > >> > openvswitch approach Simon is using. > > > >> > > > >> Ahh, yes, that I agree with. Independent of this, there's work > > > >> being > > > >> done to make it so that OVS can use the normal in-tree tunneling > > > >> code > > > >> and not need its own. Once that's done I expect that STT will > > > >> follow > > > >> the same model. > > > > > > > > Hi Jesse, > > > > > > > > I am wondering how firm the plans to on allowing OVS to use > > > > in-tree tunnel > > > > code are. I'm happy to move my efforts over to an in-tree STT > > > > implementation > > > > but ultimately I would like to get STT running in conjunction > > > > with OVS. > > > > > > I would say that it's a firm goal but the implementation probably > > > still has a ways to go. Kyle Mestery (CC'ed) has volunteered to > > > work > > > on this in support of adding VXLAN, which needs some additional > > > flexibility that this approach would also provide. You might want > > > to > > > talk to him to see if there are ways that you guys can work > > > together > > > on it if you are interested. Having better integration with > > > upstream > > > tunneling is definitely a step that OVS needs to make and sooner > > > would > > > be better than later. > > > > Hi Jesse, Hi Kyle, > > > > that sounds like an excellent plan. > > > > Kyle, do you have any thoughts on how we might best work together on > > this? > > Perhaps there are some patches floating around that I could take a > > look at? > > ChrisW had a start to VxVlan tunnel (non OVS), and I promised to work on > finishing Thanks. I guess that I might be able to base parts of an STT implementation on that work. I'd like to use an STT implementation with OVS, so in-tree tunnel support for OVS is also important to me. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 23, 2012, at 9:25 PM, Simon Horman wrote: > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: >>>>> From: Jesse Gross <jesse@nicira.com> >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700 >>>>> >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >>>>>>> From: Jesse Gross <jesse@nicira.com> >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700 >>>>>>> >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit >>>>>>>> (which could be the local stack; something sent by a VM; or packets >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can >>>>>>>> just prepend the STT header (possibly slightly adjusting things like >>>>>>>> requested MSS, number of segments, etc. slightly). After that it's >>>>>>>> possible to just output the resulting frame through the IP stack like >>>>>>>> all tunnels do today. >>>>>>> >>>>>>> Which seems to potentially suggest a stronger intergration of the STT >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon >>>>>>> is taking >>>>>> >>>>>> Did you have something in mind? >>>>> >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the >>>>> openvswitch approach Simon is using. >>>> >>>> Ahh, yes, that I agree with. Independent of this, there's work being >>>> done to make it so that OVS can use the normal in-tree tunneling code >>>> and not need its own. Once that's done I expect that STT will follow >>>> the same model. >>> >>> Hi Jesse, >>> >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel >>> code are. I'm happy to move my efforts over to an in-tree STT implementation >>> but ultimately I would like to get STT running in conjunction with OVS. >> >> I would say that it's a firm goal but the implementation probably >> still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work >> on this in support of adding VXLAN, which needs some additional >> flexibility that this approach would also provide. You might want to >> talk to him to see if there are ways that you guys can work together >> on it if you are interested. Having better integration with upstream >> tunneling is definitely a step that OVS needs to make and sooner would >> be better than later. > > Hi Jesse, Hi Kyle, > > that sounds like an excellent plan. > > Kyle, do you have any thoughts on how we might best work together on this? > Perhaps there are some patches floating around that I could take a look at? > Hi Simon: The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here: https://github.com/mestery/ovs-vxlan/tree/vxlan This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can be passed to user space, so things like multicast learning for VXLAN become possible. With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this! Thanks! Kyle-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 24 Apr 2012 16:02:41 +0000 "Kyle Mestery (kmestery)" <kmestery@cisco.com> wrote: > On Apr 23, 2012, at 9:25 PM, Simon Horman wrote: > > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: > >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: > >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: > >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: > >>>>> From: Jesse Gross <jesse@nicira.com> > >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700 > >>>>> > >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: > >>>>>>> From: Jesse Gross <jesse@nicira.com> > >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700 > >>>>>>> > >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit > >>>>>>>> (which could be the local stack; something sent by a VM; or packets > >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can > >>>>>>>> just prepend the STT header (possibly slightly adjusting things like > >>>>>>>> requested MSS, number of segments, etc. slightly). After that it's > >>>>>>>> possible to just output the resulting frame through the IP stack like > >>>>>>>> all tunnels do today. > >>>>>>> > >>>>>>> Which seems to potentially suggest a stronger intergration of the STT > >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon > >>>>>>> is taking > >>>>>> > >>>>>> Did you have something in mind? > >>>>> > >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the > >>>>> openvswitch approach Simon is using. > >>>> > >>>> Ahh, yes, that I agree with. Independent of this, there's work being > >>>> done to make it so that OVS can use the normal in-tree tunneling code > >>>> and not need its own. Once that's done I expect that STT will follow > >>>> the same model. > >>> > >>> Hi Jesse, > >>> > >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel > >>> code are. I'm happy to move my efforts over to an in-tree STT implementation > >>> but ultimately I would like to get STT running in conjunction with OVS. > >> > >> I would say that it's a firm goal but the implementation probably > >> still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work > >> on this in support of adding VXLAN, which needs some additional > >> flexibility that this approach would also provide. You might want to > >> talk to him to see if there are ways that you guys can work together > >> on it if you are interested. Having better integration with upstream > >> tunneling is definitely a step that OVS needs to make and sooner would > >> be better than later. > > > > Hi Jesse, Hi Kyle, > > > > that sounds like an excellent plan. > > > > Kyle, do you have any thoughts on how we might best work together on this? > > Perhaps there are some patches floating around that I could take a look at? > > > > Hi Simon: > > The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here: > > https://github.com/mestery/ovs-vxlan/tree/vxlan > > This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can be passed to user space, so things like multicast learning for VXLAN become possible. > > With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this! > My use of VXVLAN was to be key based (like existing GRE), not flow based. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 24, 2012, at 11:13 AM, Stephen Hemminger wrote: > On Tue, 24 Apr 2012 16:02:41 +0000 > "Kyle Mestery (kmestery)" <kmestery@cisco.com> wrote: > >> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote: >>> On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: >>>> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: >>>>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: >>>>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: >>>>>>> From: Jesse Gross <jesse@nicira.com> >>>>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700 >>>>>>> >>>>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >>>>>>>>> From: Jesse Gross <jesse@nicira.com> >>>>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700 >>>>>>>>> >>>>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit >>>>>>>>>> (which could be the local stack; something sent by a VM; or packets >>>>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can >>>>>>>>>> just prepend the STT header (possibly slightly adjusting things like >>>>>>>>>> requested MSS, number of segments, etc. slightly). After that it's >>>>>>>>>> possible to just output the resulting frame through the IP stack like >>>>>>>>>> all tunnels do today. >>>>>>>>> >>>>>>>>> Which seems to potentially suggest a stronger intergration of the STT >>>>>>>>> tunnel transmit path into our IP stack rather than the approach Simon >>>>>>>>> is taking >>>>>>>> >>>>>>>> Did you have something in mind? >>>>>>> >>>>>>> A normal bonafide tunnel netdevice driver like GRE instead of the >>>>>>> openvswitch approach Simon is using. >>>>>> >>>>>> Ahh, yes, that I agree with. Independent of this, there's work being >>>>>> done to make it so that OVS can use the normal in-tree tunneling code >>>>>> and not need its own. Once that's done I expect that STT will follow >>>>>> the same model. >>>>> >>>>> Hi Jesse, >>>>> >>>>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel >>>>> code are. I'm happy to move my efforts over to an in-tree STT implementation >>>>> but ultimately I would like to get STT running in conjunction with OVS. >>>> >>>> I would say that it's a firm goal but the implementation probably >>>> still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work >>>> on this in support of adding VXLAN, which needs some additional >>>> flexibility that this approach would also provide. You might want to >>>> talk to him to see if there are ways that you guys can work together >>>> on it if you are interested. Having better integration with upstream >>>> tunneling is definitely a step that OVS needs to make and sooner would >>>> be better than later. >>> >>> Hi Jesse, Hi Kyle, >>> >>> that sounds like an excellent plan. >>> >>> Kyle, do you have any thoughts on how we might best work together on this? >>> Perhaps there are some patches floating around that I could take a look at? >>> >> >> Hi Simon: >> >> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here: >> >> https://github.com/mestery/ovs-vxlan/tree/vxlan >> >> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can be passed to user space, so things like multicast learning for VXLAN become possible. >> >> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this! >> > > My use of VXVLAN was to be key based (like existing GRE), not flow based. > Yes, for OVS the idea is to add the tunnel key values to the flow-key in the OVS kernel module.-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Apr 24, 2012 at 04:02:41PM +0000, Kyle Mestery (kmestery) wrote: > On Apr 23, 2012, at 9:25 PM, Simon Horman wrote: > > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: > >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: > >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: > >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: > >>>>> From: Jesse Gross <jesse@nicira.com> > >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700 > >>>>> > >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: > >>>>>>> From: Jesse Gross <jesse@nicira.com> > >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700 > >>>>>>> > >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit > >>>>>>>> (which could be the local stack; something sent by a VM; or packets > >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can > >>>>>>>> just prepend the STT header (possibly slightly adjusting things like > >>>>>>>> requested MSS, number of segments, etc. slightly). After that it's > >>>>>>>> possible to just output the resulting frame through the IP stack like > >>>>>>>> all tunnels do today. > >>>>>>> > >>>>>>> Which seems to potentially suggest a stronger intergration of the STT > >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon > >>>>>>> is taking > >>>>>> > >>>>>> Did you have something in mind? > >>>>> > >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the > >>>>> openvswitch approach Simon is using. > >>>> > >>>> Ahh, yes, that I agree with. Independent of this, there's work being > >>>> done to make it so that OVS can use the normal in-tree tunneling code > >>>> and not need its own. Once that's done I expect that STT will follow > >>>> the same model. > >>> > >>> Hi Jesse, > >>> > >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel > >>> code are. I'm happy to move my efforts over to an in-tree STT implementation > >>> but ultimately I would like to get STT running in conjunction with OVS. > >> > >> I would say that it's a firm goal but the implementation probably > >> still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work > >> on this in support of adding VXLAN, which needs some additional > >> flexibility that this approach would also provide. You might want to > >> talk to him to see if there are ways that you guys can work together > >> on it if you are interested. Having better integration with upstream > >> tunneling is definitely a step that OVS needs to make and sooner would > >> be better than later. > > > > Hi Jesse, Hi Kyle, > > > > that sounds like an excellent plan. > > > > Kyle, do you have any thoughts on how we might best work together on this? > > Perhaps there are some patches floating around that I could take a look at? > > > > Hi Simon: > > The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here: > > https://github.com/mestery/ovs-vxlan/tree/vxlan > > This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. > As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can > be passed to user space, so things like multicast learning for VXLAN become possible. > > With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this! Hi Kyle, the component that is of most interest to me is enabling OVS to use in-tree tunnelling code - as it seems that makes most sense for an implementation of STT. I have taken a brief look over your vxlan work and it isn't clear to me if it is moving towards being an in-tree implementation. Moreover, I'm a rather unclear on what changes need to be made to OVS in order for in-tree tunneling to be used. My recollection is that OVS did make use of in-tree tunnelling code but this was removed in favour of the current implementation for various reasons (performance being one IIRC). I gather that revisiting in-tree tunnelling won't revisit the previous set of problems. But I'm unclear how. Jesse, is it possible for you to describe that in a little detail or point me to some information? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 25, 2012, at 3:39 AM, Simon Horman wrote: > On Tue, Apr 24, 2012 at 04:02:41PM +0000, Kyle Mestery (kmestery) wrote: >> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote: >>> On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: >>>> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: >>>>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: >>>>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: >>>>>>> From: Jesse Gross <jesse@nicira.com> >>>>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700 >>>>>>> >>>>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >>>>>>>>> From: Jesse Gross <jesse@nicira.com> >>>>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700 >>>>>>>>> >>>>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit >>>>>>>>>> (which could be the local stack; something sent by a VM; or packets >>>>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can >>>>>>>>>> just prepend the STT header (possibly slightly adjusting things like >>>>>>>>>> requested MSS, number of segments, etc. slightly). After that it's >>>>>>>>>> possible to just output the resulting frame through the IP stack like >>>>>>>>>> all tunnels do today. >>>>>>>>> >>>>>>>>> Which seems to potentially suggest a stronger intergration of the STT >>>>>>>>> tunnel transmit path into our IP stack rather than the approach Simon >>>>>>>>> is taking >>>>>>>> >>>>>>>> Did you have something in mind? >>>>>>> >>>>>>> A normal bonafide tunnel netdevice driver like GRE instead of the >>>>>>> openvswitch approach Simon is using. >>>>>> >>>>>> Ahh, yes, that I agree with. Independent of this, there's work being >>>>>> done to make it so that OVS can use the normal in-tree tunneling code >>>>>> and not need its own. Once that's done I expect that STT will follow >>>>>> the same model. >>>>> >>>>> Hi Jesse, >>>>> >>>>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel >>>>> code are. I'm happy to move my efforts over to an in-tree STT implementation >>>>> but ultimately I would like to get STT running in conjunction with OVS. >>>> >>>> I would say that it's a firm goal but the implementation probably >>>> still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work >>>> on this in support of adding VXLAN, which needs some additional >>>> flexibility that this approach would also provide. You might want to >>>> talk to him to see if there are ways that you guys can work together >>>> on it if you are interested. Having better integration with upstream >>>> tunneling is definitely a step that OVS needs to make and sooner would >>>> be better than later. >>> >>> Hi Jesse, Hi Kyle, >>> >>> that sounds like an excellent plan. >>> >>> Kyle, do you have any thoughts on how we might best work together on this? >>> Perhaps there are some patches floating around that I could take a look at? >>> >> >> Hi Simon: >> >> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here: >> >> https://github.com/mestery/ovs-vxlan/tree/vxlan >> >> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. >> As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can >> be passed to user space, so things like multicast learning for VXLAN become possible. >> >> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this! > > Hi Kyle, > > the component that is of most interest to me is enabling OVS to use in-tree > tunnelling code - as it seems that makes most sense for an implementation > of STT. I have taken a brief look over your vxlan work and it isn't clear > to me if it is moving towards being an in-tree implementation. Moreover, > I'm a rather unclear on what changes need to be made to OVS in order for > in-tree tunneling to be used. > > My recollection is that OVS did make use of in-tree tunnelling code > but this was removed in favour of the current implementation for various > reasons (performance being one IIRC). I gather that revisiting in-tree > tunnelling won't revisit the previous set of problems. But I'm unclear how. > > Jesse, is it possible for you to describe that in a little detail > or point me to some information? Simon: The changes I have in there now are taking the first step of trying to add support for flow-based tunneling, in the case of VXLAN. Once we do that, we can remove (if we want) the existing port-based tunneling code. I was planning this as a first step. I would also to understand from Jesse better the direction with regards to moving to in-tree tunneling. I assume the changes Jesse and I had talked about a few months back around flow-based tunneling will still be compatible with the in-tree tunneling as well. Thanks, Kyle-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote: > On Tue, Apr 24, 2012 at 04:02:41PM +0000, Kyle Mestery (kmestery) wrote: >> On Apr 23, 2012, at 9:25 PM, Simon Horman wrote: >> > On Mon, Apr 23, 2012 at 03:59:24PM -0700, Jesse Gross wrote: >> >> On Mon, Apr 23, 2012 at 3:32 PM, Simon Horman <horms@verge.net.au> wrote: >> >>> On Mon, Apr 23, 2012 at 02:38:07PM -0700, Jesse Gross wrote: >> >>>> On Mon, Apr 23, 2012 at 2:08 PM, David Miller <davem@davemloft.net> wrote: >> >>>>> From: Jesse Gross <jesse@nicira.com> >> >>>>> Date: Mon, 23 Apr 2012 13:53:42 -0700 >> >>>>> >> >>>>>> On Mon, Apr 23, 2012 at 1:13 PM, David Miller <davem@davemloft.net> wrote: >> >>>>>>> From: Jesse Gross <jesse@nicira.com> >> >>>>>>> Date: Mon, 23 Apr 2012 13:08:49 -0700 >> >>>>>>> >> >>>>>>>> Assuming that the TCP stack generates large TSO frames on transmit >> >>>>>>>> (which could be the local stack; something sent by a VM; or packets >> >>>>>>>> received, coalesced by GRO and then encapsulated by STT) then you can >> >>>>>>>> just prepend the STT header (possibly slightly adjusting things like >> >>>>>>>> requested MSS, number of segments, etc. slightly). After that it's >> >>>>>>>> possible to just output the resulting frame through the IP stack like >> >>>>>>>> all tunnels do today. >> >>>>>>> >> >>>>>>> Which seems to potentially suggest a stronger intergration of the STT >> >>>>>>> tunnel transmit path into our IP stack rather than the approach Simon >> >>>>>>> is taking >> >>>>>> >> >>>>>> Did you have something in mind? >> >>>>> >> >>>>> A normal bonafide tunnel netdevice driver like GRE instead of the >> >>>>> openvswitch approach Simon is using. >> >>>> >> >>>> Ahh, yes, that I agree with. Independent of this, there's work being >> >>>> done to make it so that OVS can use the normal in-tree tunneling code >> >>>> and not need its own. Once that's done I expect that STT will follow >> >>>> the same model. >> >>> >> >>> Hi Jesse, >> >>> >> >>> I am wondering how firm the plans to on allowing OVS to use in-tree tunnel >> >>> code are. I'm happy to move my efforts over to an in-tree STT implementation >> >>> but ultimately I would like to get STT running in conjunction with OVS. >> >> >> >> I would say that it's a firm goal but the implementation probably >> >> still has a ways to go. Kyle Mestery (CC'ed) has volunteered to work >> >> on this in support of adding VXLAN, which needs some additional >> >> flexibility that this approach would also provide. You might want to >> >> talk to him to see if there are ways that you guys can work together >> >> on it if you are interested. Having better integration with upstream >> >> tunneling is definitely a step that OVS needs to make and sooner would >> >> be better than later. >> > >> > Hi Jesse, Hi Kyle, >> > >> > that sounds like an excellent plan. >> > >> > Kyle, do you have any thoughts on how we might best work together on this? >> > Perhaps there are some patches floating around that I could take a look at? >> > >> >> Hi Simon: >> >> The VXLAN work has been slow going for me at this point. What I have works, but is far from complete. It's available here: >> >> https://github.com/mestery/ovs-vxlan/tree/vxlan >> >> This is based on a fairly recent version of OVS. I'm currently working to allow tunnels to be flow-based rather than port-based, as they currently exist. >> As Jesse may have mentioned, doing this allows us to move most tunnel state into user space. The outer header can now be part of the flow lookup and can >> be passed to user space, so things like multicast learning for VXLAN become possible. >> >> With regards to working together, ping me off-list and we can work something out, I'm very much in favor of this! > > Hi Kyle, > > the component that is of most interest to me is enabling OVS to use in-tree > tunnelling code - as it seems that makes most sense for an implementation > of STT. I have taken a brief look over your vxlan work and it isn't clear > to me if it is moving towards being an in-tree implementation. Moreover, > I'm a rather unclear on what changes need to be made to OVS in order for > in-tree tunneling to be used. > > My recollection is that OVS did make use of in-tree tunnelling code > but this was removed in favour of the current implementation for various > reasons (performance being one IIRC). I gather that revisiting in-tree > tunnelling won't revisit the previous set of problems. But I'm unclear how. > > Jesse, is it possible for you to describe that in a little detail > or point me to some information? This was what I had originally written a while back, although it's more about OVS internally and less about how to connect to the in-tree code: http://openvswitch.org/pipermail/dev/2012-February/014779.html In order to flexibly implement support for current and future tunnel protocols OVS needs to be able to get/set information about the outer tunnel header when processing the inner packet. At the very least this is src/dst IP addresses and the key/ID/VNI/etc. In the upstream tunnel implementations those are implicitly encoded in the device that sends or receives the packet. However, this has a two problems: number of devices and ability to handle unknown values. We addressed part of this problem by allowing the tunnel ID to be set and matched through the OVS flow table and an action. In order to do this with the in-tree tunneling code, we obviously need a way of passing this information around since it would currently get lost as we pass through the Linux device layer. The plan to deal with that is to add a function to the in-tree tunneling code that allows a skb to be encapsulated with specific parameters and conversely a hook to receive decapsulated packets along with header info. This would make all of the kernel tunneling code common, while still giving OVS userspace the ability to implement essentially any type of tunneling policy. In many ways, this is very similar to how vlans look in OVS today. While it would be possible to implement the hook to use the in-tree tunnel code today without a lot of changes, we already know that we want to move away from port-based model in the OVS kernel module towards the flow model. As we push this upstream the userspace/kernel API should be the correct one, so that's why these two things are tied together. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 25, 2012 at 10:17:25AM -0700, Jesse Gross wrote: > On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote: > > > > Hi Kyle, > > > > the component that is of most interest to me is enabling OVS to use in-tree > > tunnelling code - as it seems that makes most sense for an implementation > > of STT. I have taken a brief look over your vxlan work and it isn't clear > > to me if it is moving towards being an in-tree implementation. Moreover, > > I'm a rather unclear on what changes need to be made to OVS in order for > > in-tree tunneling to be used. > > > > My recollection is that OVS did make use of in-tree tunnelling code > > but this was removed in favour of the current implementation for various > > reasons (performance being one IIRC). I gather that revisiting in-tree > > tunnelling won't revisit the previous set of problems. But I'm unclear how. > > > > Jesse, is it possible for you to describe that in a little detail > > or point me to some information? > > This was what I had originally written a while back, although it's > more about OVS internally and less about how to connect to the in-tree > code: > http://openvswitch.org/pipermail/dev/2012-February/014779.html > > In order to flexibly implement support for current and future tunnel > protocols OVS needs to be able to get/set information about the outer > tunnel header when processing the inner packet. At the very least > this is src/dst IP addresses and the key/ID/VNI/etc. In the upstream > tunnel implementations those are implicitly encoded in the device that > sends or receives the packet. However, this has a two problems: > number of devices and ability to handle unknown values. We addressed > part of this problem by allowing the tunnel ID to be set and matched > through the OVS flow table and an action. In order to do this with > the in-tree tunneling code, we obviously need a way of passing this > information around since it would currently get lost as we pass > through the Linux device layer. > > The plan to deal with that is to add a function to the in-tree > tunneling code that allows a skb to be encapsulated with specific > parameters and conversely a hook to receive decapsulated packets along > with header info. This would make all of the kernel tunneling code > common, while still giving OVS userspace the ability to implement > essentially any type of tunneling policy. In many ways, this is very > similar to how vlans look in OVS today. > > While it would be possible to implement the hook to use the in-tree > tunnel code today without a lot of changes, we already know that we > want to move away from port-based model in the OVS kernel module > towards the flow model. As we push this upstream the userspace/kernel > API should be the correct one, so that's why these two things are tied > together. Thanks, that explanation along with Kyle's response helps a lot. It seems to me that something I could help out with is the implementation of the set_tunnel action which extents and replaces the tun_id action. It seems that is a requirement for the scheme you describe above. http://openvswitch.org/pipermail/dev/2012-April/016239.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 26, 2012 at 12:13 AM, Simon Horman <horms@verge.net.au> wrote: > On Wed, Apr 25, 2012 at 10:17:25AM -0700, Jesse Gross wrote: >> On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote: >> > >> > Hi Kyle, >> > >> > the component that is of most interest to me is enabling OVS to use in-tree >> > tunnelling code - as it seems that makes most sense for an implementation >> > of STT. I have taken a brief look over your vxlan work and it isn't clear >> > to me if it is moving towards being an in-tree implementation. Moreover, >> > I'm a rather unclear on what changes need to be made to OVS in order for >> > in-tree tunneling to be used. >> > >> > My recollection is that OVS did make use of in-tree tunnelling code >> > but this was removed in favour of the current implementation for various >> > reasons (performance being one IIRC). I gather that revisiting in-tree >> > tunnelling won't revisit the previous set of problems. But I'm unclear how. >> > >> > Jesse, is it possible for you to describe that in a little detail >> > or point me to some information? >> >> This was what I had originally written a while back, although it's >> more about OVS internally and less about how to connect to the in-tree >> code: >> http://openvswitch.org/pipermail/dev/2012-February/014779.html >> >> In order to flexibly implement support for current and future tunnel >> protocols OVS needs to be able to get/set information about the outer >> tunnel header when processing the inner packet. At the very least >> this is src/dst IP addresses and the key/ID/VNI/etc. In the upstream >> tunnel implementations those are implicitly encoded in the device that >> sends or receives the packet. However, this has a two problems: >> number of devices and ability to handle unknown values. We addressed >> part of this problem by allowing the tunnel ID to be set and matched >> through the OVS flow table and an action. In order to do this with >> the in-tree tunneling code, we obviously need a way of passing this >> information around since it would currently get lost as we pass >> through the Linux device layer. >> >> The plan to deal with that is to add a function to the in-tree >> tunneling code that allows a skb to be encapsulated with specific >> parameters and conversely a hook to receive decapsulated packets along >> with header info. This would make all of the kernel tunneling code >> common, while still giving OVS userspace the ability to implement >> essentially any type of tunneling policy. In many ways, this is very >> similar to how vlans look in OVS today. >> >> While it would be possible to implement the hook to use the in-tree >> tunnel code today without a lot of changes, we already know that we >> want to move away from port-based model in the OVS kernel module >> towards the flow model. As we push this upstream the userspace/kernel >> API should be the correct one, so that's why these two things are tied >> together. > > > Thanks, that explanation along with Kyle's response helps a lot. > > It seems to me that something I could help out with is the implementation > of the set_tunnel action which extents and replaces the tun_id action. > It seems that is a requirement for the scheme you describe above. > > http://openvswitch.org/pipermail/dev/2012-April/016239.html I agree that's probably the best place to start unless Kyle has some specific plans otherwise. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Apr 26, 2012, at 11:13 AM, Jesse Gross wrote: > On Thu, Apr 26, 2012 at 12:13 AM, Simon Horman <horms@verge.net.au> wrote: >> On Wed, Apr 25, 2012 at 10:17:25AM -0700, Jesse Gross wrote: >>> On Wed, Apr 25, 2012 at 1:39 AM, Simon Horman <horms@verge.net.au> wrote: >>>> >>>> Hi Kyle, >>>> >>>> the component that is of most interest to me is enabling OVS to use in-tree >>>> tunnelling code - as it seems that makes most sense for an implementation >>>> of STT. I have taken a brief look over your vxlan work and it isn't clear >>>> to me if it is moving towards being an in-tree implementation. Moreover, >>>> I'm a rather unclear on what changes need to be made to OVS in order for >>>> in-tree tunneling to be used. >>>> >>>> My recollection is that OVS did make use of in-tree tunnelling code >>>> but this was removed in favour of the current implementation for various >>>> reasons (performance being one IIRC). I gather that revisiting in-tree >>>> tunnelling won't revisit the previous set of problems. But I'm unclear how. >>>> >>>> Jesse, is it possible for you to describe that in a little detail >>>> or point me to some information? >>> >>> This was what I had originally written a while back, although it's >>> more about OVS internally and less about how to connect to the in-tree >>> code: >>> http://openvswitch.org/pipermail/dev/2012-February/014779.html >>> >>> In order to flexibly implement support for current and future tunnel >>> protocols OVS needs to be able to get/set information about the outer >>> tunnel header when processing the inner packet. At the very least >>> this is src/dst IP addresses and the key/ID/VNI/etc. In the upstream >>> tunnel implementations those are implicitly encoded in the device that >>> sends or receives the packet. However, this has a two problems: >>> number of devices and ability to handle unknown values. We addressed >>> part of this problem by allowing the tunnel ID to be set and matched >>> through the OVS flow table and an action. In order to do this with >>> the in-tree tunneling code, we obviously need a way of passing this >>> information around since it would currently get lost as we pass >>> through the Linux device layer. >>> >>> The plan to deal with that is to add a function to the in-tree >>> tunneling code that allows a skb to be encapsulated with specific >>> parameters and conversely a hook to receive decapsulated packets along >>> with header info. This would make all of the kernel tunneling code >>> common, while still giving OVS userspace the ability to implement >>> essentially any type of tunneling policy. In many ways, this is very >>> similar to how vlans look in OVS today. >>> >>> While it would be possible to implement the hook to use the in-tree >>> tunnel code today without a lot of changes, we already know that we >>> want to move away from port-based model in the OVS kernel module >>> towards the flow model. As we push this upstream the userspace/kernel >>> API should be the correct one, so that's why these two things are tied >>> together. >> >> >> Thanks, that explanation along with Kyle's response helps a lot. >> >> It seems to me that something I could help out with is the implementation >> of the set_tunnel action which extents and replaces the tun_id action. >> It seems that is a requirement for the scheme you describe above. >> >> http://openvswitch.org/pipermail/dev/2012-April/016239.html > > I agree that's probably the best place to start unless Kyle has some > specific plans otherwise. Simon and I chatted off-list, and this is indeed where we plan to start.-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/acinclude.m4 b/acinclude.m4 index 69bb772..f3a52fa 100644 --- a/acinclude.m4 +++ b/acinclude.m4 @@ -266,6 +266,9 @@ AC_DEFUN([OVS_CHECK_LINUX_COMPAT], [ OVS_GREP_IFELSE([$KSRC/include/linux/if_vlan.h], [ADD_ALL_VLANS_CMD], [OVS_DEFINE([HAVE_VLAN_BUG_WORKAROUND])]) + OVS_GREP_IFELSE([$KSRC/include/linux/tcp.h], [encap_rcv], + [OVS_DEFINE([HAVE_TCP_ENCAP_RCV])]) + OVS_CHECK_LOG2_H if cmp -s datapath/linux/kcompat.h.new \ diff --git a/datapath/Modules.mk b/datapath/Modules.mk index 24c1075..6fbe3dd 100644 --- a/datapath/Modules.mk +++ b/datapath/Modules.mk @@ -26,7 +26,8 @@ openvswitch_sources = \ vport-gre.c \ vport-internal_dev.c \ vport-netdev.c \ - vport-patch.c + vport-patch.c \ + vport-stt.c openvswitch_headers = \ checksum.h \ diff --git a/datapath/tunnel.h b/datapath/tunnel.h index 33eb63c..96f59b1 100644 --- a/datapath/tunnel.h +++ b/datapath/tunnel.h @@ -41,6 +41,7 @@ */ #define TNL_T_PROTO_GRE 0 #define TNL_T_PROTO_CAPWAP 1 +#define TNL_T_PROTO_STT 2 /* These flags are only needed when calling tnl_find_port(). */ #define TNL_T_KEY_EXACT (1 << 10) diff --git a/datapath/vport-stt.c b/datapath/vport-stt.c new file mode 100644 index 0000000..638998d --- /dev/null +++ b/datapath/vport-stt.c @@ -0,0 +1,803 @@ +/* + * Copyright (c) 2012 Horms Solutions Ltd. + * Distributed under the terms of the GNU GPL version 2. + * + * Significant portions of this file may be copied from parts of the Linux + * kernel, by Linus Torvalds and others. + * + * Significant portions of this file may be copied from + * other parts of Open vSwitch, by Nicira Networks and others. + */ + +#define pr_fmt(fmt) KBUILD_MODNAME ": stt: " fmt + +#include <linux/version.h> +#ifdef HAVE_TCP_ENCAP_RCV + +#include <linux/if.h> +#include <linux/in.h> +#include <linux/ip.h> +#include <linux/list.h> +#include <linux/net.h> +#include <net/net_namespace.h> + +#include <net/icmp.h> +#include <net/inet_frag.h> +#include <net/ip.h> +#include <net/ipv6.h> +#include <net/protocol.h> +#include <net/udp.h> +#include <net/tcp.h> + +#include "datapath.h" +#include "tunnel.h" +#include "vport.h" +#include "vport-generic.h" + +#define STT_DST_PORT 58882 /* Change to actual port number once awarded by IANA */ + +/* XXX: Possible Consolidation: The same values as capwap */ +#define STT_FRAG_TIMEOUT (30 * HZ) +#define STT_FRAG_MAX_MEM (256 * 1024) +#define STT_FRAG_PRUNE_MEM (192 * 1024) +#define STT_FRAG_SECRET_INTERVAL (10 * 60 * HZ) + +#define STT_FLAG_CHECKSUM_VERIFIED (1 << 0) +#define STT_FLAG_CHECKSUM_PARTIAL (1 << 1) +#define STT_FLAG_IP_VERSION (1 << 2) +#define STT_FLAG_TCP_PAYLOAD (1 << 3) + +#define FRAG_OFF_MASK 0xffffU +#define FRAME_LEN_SHIFT 16 + +struct stthdr { + uint8_t version; + uint8_t flags; + uint8_t l4_offset; + uint8_t reserved; + __be16 mss; + __be16 vlan_tci; + __be64 context_id; +}; + +/* + * Not in stthdr to avoid that structure being padded to + * a 64bit boundary - 2 bytes of pad are required, not 8 + */ +struct stthdr_pad { + uint8_t pad[2]; +}; + +static struct stthdr *stt_hdr(const struct sk_buff *skb) +{ + return (struct stthdr *)(tcp_hdr(skb) + 1); +} + +/* + * The minimum header length. + */ +#define STT_SEG_HLEN sizeof(struct tcphdr) +#define STT_FRAME_HLEN (STT_SEG_HLEN + sizeof(struct stthdr) + \ + sizeof(struct stthdr_pad)) + +static inline int stt_seg_len(struct sk_buff *skb) +{ + return skb->len - skb_transport_offset(skb) - STT_SEG_HLEN; +} + +static inline struct ethhdr *stt_inner_eth_header(struct sk_buff *skb) +{ + return (struct ethhdr *)((char *)skb_transport_header(skb) + + STT_FRAME_HLEN); +} + +/* XXX: Possible Consolidation: Same as capwap */ +struct frag_match { + __be32 saddr; + __be32 daddr; + __be32 id; +}; + +/* XXX: Possible Consolidation: Same as capwap */ +struct frag_queue { + struct inet_frag_queue ifq; + struct frag_match match; +}; + +/* XXX: Possible Consolidation: Same as capwap */ +struct frag_skb_cb { + u16 offset; +}; +#define FRAG_CB(skb) ((struct frag_skb_cb *)(skb)->cb) + +static struct sk_buff *defrag(struct sk_buff *skb, u16 frame_len); + +static void stt_frag_init(struct inet_frag_queue *, void *match); +static unsigned int stt_frag_hash(struct inet_frag_queue *); +static int stt_frag_match(struct inet_frag_queue *, void *match); +static void stt_frag_expire(unsigned long ifq); + +static struct inet_frags frag_state = { + .constructor = stt_frag_init, + .qsize = sizeof(struct frag_queue), + .hashfn = stt_frag_hash, + .match = stt_frag_match, + .frag_expire = stt_frag_expire, + .secret_interval = STT_FRAG_SECRET_INTERVAL, +}; + +/* random value for selecting source ports */ +static u32 stt_port_rnd __read_mostly; + +static int stt_hdr_len(const struct tnl_mutable_config *mutable) +{ + return (int)STT_FRAME_HLEN; +} + +static void stt_build_header(const struct vport *vport, + const struct tnl_mutable_config *mutable, + void *header) +{ + struct tcphdr *tcph = header; + struct stthdr *stth = (struct stthdr *)(tcph + 1); + struct stthdr_pad *pad = (struct stthdr_pad *)(stth + 1); + + tcph->dest = htons(STT_DST_PORT); + tcp_flag_word(tcph) = 0; + tcph->doff = sizeof(struct tcphdr) / 4; + tcph->ack = 1; + pad->pad[0] = pad->pad[1] = 0; +} + +static u16 stt_src_port(u32 hash) +{ + int low, high; + inet_get_local_port_range(&low, &high); + return hash % (high - low) + low; +} + +struct sk_buff *stt_update_header(const struct vport *vport, + const struct tnl_mutable_config *mutable, + struct dst_entry *dst, + struct sk_buff *skb) +{ + struct tcphdr *tcph; + struct stthdr *stth; + struct ethhdr *inner_ethh; + struct tnl_vport *tnl_vport = tnl_vport_priv(vport); + __be32 frag_id = htonl(atomic_inc_return(&tnl_vport->frag_id)); + __be32 vlan_tci = 0; + u32 hash = jhash_1word(skb->protocol, stt_port_rnd); + int l4_protocol = IPPROTO_MAX; + + if (skb->protocol == htons(ETH_P_8021Q)) { + struct vlan_ethhdr *vlanh; + + if (unlikely(!pskb_may_pull(skb, VLAN_ETH_HLEN))) + goto err; + + vlanh = (struct vlan_ethhdr *)stt_inner_eth_header(skb); + vlan_tci = vlanh->h_vlan_TCI; + + /* STT requires that the encapsulated frame be untagged + * and the STT header only allows saving one VLAN TCI. + * So there seems to be no way to handle the presence of + * more than one vlan tag other than to drop the packet + */ + if (vlan_eth_hdr(skb)->h_vlan_encapsulated_proto == + htons(ETH_P_8021Q)) + goto err; + + memmove(skb->data + VLAN_HLEN, skb->data, + (size_t)((char *)vlanh - (char *)skb->data) + + 2 * ETH_ALEN); + if (unlikely(!skb_pull(skb, VLAN_HLEN))) + goto err; + + skb->protocol = vlan_eth_hdr(skb)->h_vlan_encapsulated_proto; + skb->mac_header += VLAN_HLEN; + skb->network_header += VLAN_HLEN; + skb->transport_header += VLAN_HLEN; + } + + tcph = tcp_hdr(skb); + stth = (struct stthdr *)(tcph + 1); + inner_ethh = stt_inner_eth_header(skb); + + stth->flags = 0; + + if (skb->protocol == htons(ETH_P_IP)) { + struct iphdr *iph = (struct iphdr *)(inner_ethh + 1); + hash = jhash_2words(iph->saddr, iph->daddr, hash); + l4_protocol = iph->protocol; + stth->flags |= STT_FLAG_IP_VERSION; + } else if (skb->protocol == htons(ETH_P_IPV6)) { + struct ipv6hdr *ipv6h = (struct ipv6hdr *)(inner_ethh + 1); + hash = jhash(ipv6h->saddr.s6_addr, + sizeof(ipv6h->saddr.s6_addr), hash); + hash = jhash(ipv6h->daddr.s6_addr, + sizeof(ipv6h->daddr.s6_addr), hash); + l4_protocol = ipv6h->nexthdr; + } + + stth->l4_offset = 0; + if (get_ip_summed(skb) == OVS_CSUM_PARTIAL && skb->csum_start) { + int off = skb->csum_start - skb_headroom(skb); + if (likely(off < 256 && off > 0)) + stth->l4_offset = off; + else if (net_ratelimit()) + pr_err("%s: l4_offset is out of range %d should be " + "between 0 and 255", __func__, off); + } + + if (stth->l4_offset && (l4_protocol == IPPROTO_TCP || + l4_protocol == IPPROTO_UDP || + l4_protocol == IPPROTO_DCCP || + l4_protocol == IPPROTO_SCTP)) { + /* TCP, UDP, DCCP and SCTP place the source and destination + * ports in the first and second 16-bits of their header, + * so grabbing the first 32-bits will give a combined value. + */ + __be32 *ports = (__be32 *)((char *)inner_ethh + + stth->l4_offset); + hash = jhash_1word(*ports, hash); + } + + if (l4_protocol == IPPROTO_TCP) + stth->flags |= STT_FLAG_TCP_PAYLOAD; + + stth->reserved = 0; + stth->mss = htons(dst_mtu(dst)); + stth->vlan_tci = vlan_tci; + stth->context_id = mutable->out_key; + + tcph->source = htons(stt_src_port(hash)); + tcph->seq = htonl(stt_seg_len(skb) << FRAME_LEN_SHIFT); + tcph->ack_seq = frag_id; + tcph->ack = 1; + tcph->psh = 1; + + switch (get_ip_summed(skb)) { + case OVS_CSUM_PARTIAL: + stth->flags |= STT_FLAG_CHECKSUM_PARTIAL; + tcph->check = ~tcp_v4_check(skb->len, + ip_hdr(skb)->saddr, + ip_hdr(skb)->daddr, 0); + skb->csum_start = skb_transport_header(skb) - skb->head; + skb->csum_offset = offsetof(struct tcphdr, check); + break; + case OVS_CSUM_UNNECESSARY: + stth->flags |= STT_FLAG_CHECKSUM_VERIFIED; + pr_debug_once("%s: checsum unnecessary\n", __func__); + default: + tcph->check = 0; + skb->csum = skb_checksum(skb, skb_transport_offset(skb), + skb->len - skb_transport_offset(skb), + 0); + tcph->check = tcp_v4_check(skb->len - skb_transport_offset(skb), + ip_hdr(skb)->saddr, + ip_hdr(skb)->daddr, skb->csum); + set_ip_summed(skb, OVS_CSUM_UNNECESSARY); + } + forward_ip_summed(skb, 1); + + return skb; +err: + kfree_skb(skb); + return NULL; +} + +static inline struct capwap_net *ovs_get_stt_net(struct net *net) +{ + struct ovs_net *ovs_net = net_generic(net, ovs_net_id); + return &ovs_net->vport_net.stt; +} + +static struct sk_buff *process_stt_proto(struct sk_buff *skb, __be64 *key) +{ + struct tcphdr *tcph = tcp_hdr(skb); + struct stthdr *stth; + u16 frame_len; + + skb_postpull_rcsum(skb, skb_transport_header(skb), + STT_SEG_HLEN + ETH_HLEN); + + frame_len = ntohl(tcph->seq) >> FRAME_LEN_SHIFT; + if (stt_seg_len(skb) < frame_len) { + skb = defrag(skb, frame_len); + if (!skb) + return NULL; + } + + if (skb->len < (tcph->doff << 2) || tcp_checksum_complete(skb)) { + if (net_ratelimit()) { + struct iphdr *iph = ip_hdr(skb); + pr_info("stt: dropped frame with " + "invalid checksum (%pI4, %d)->(%pI4, %d)\n", + &iph->saddr, ntohs(tcph->source), + &iph->daddr, ntohs(tcph->dest)); + } + goto error; + } + + /* STT_FRAME_HLEN less two pad bytes is needed here. + * STT_FRAME_HLEN is needed by our caller, stt_rcv(). + * An additional ETH_HLEN bytes are required by ovs_flow_extract() + * which is called indirectly by our caller. + */ + if (unlikely(!pskb_may_pull(skb, STT_FRAME_HLEN + ETH_HLEN))) { + if (net_ratelimit()) + pr_info("dropped frame that is too short! %d < %lu\n", + skb->len, STT_FRAME_HLEN + ETH_HLEN); + goto error; + } + + stth = stt_hdr(skb); + /* Only accept STT version 0, its all we know */ + if (stth->version != 0) + goto error; + + *key = stth->context_id; + __vlan_hwaccel_put_tag(skb, ntohs(stth->vlan_tci)); + + return skb; +error: + kfree_skb(skb); + return NULL; +} + +/* Called with rcu_read_lock and BH disabled. */ +static int stt_rcv(struct sock *sk, struct sk_buff *skb) +{ + struct vport *vport; + const struct tnl_mutable_config *mutable; + struct iphdr *iph; + __be64 key = 0; + + /* pskb_may_pull() has already been called for + * sizeof(struct tcphdr) in tcp_v4_rcv(), so there + * is no need to do so again here + */ + + skb = process_stt_proto(skb, &key); + if (unlikely(!skb)) + goto out; + + iph = ip_hdr(skb); + vport = ovs_tnl_find_port(sock_net(sk), iph->daddr, iph->saddr, key, + TNL_T_PROTO_STT, &mutable); + if (unlikely(!vport)) { + icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); + goto error; + } + + if (mutable->flags & TNL_F_IN_KEY_MATCH) + OVS_CB(skb)->tun_id = key; + else + OVS_CB(skb)->tun_id = 0; + + __skb_pull(skb, STT_FRAME_HLEN); + skb_postpull_rcsum(skb, skb_transport_header(skb), + STT_FRAME_HLEN + ETH_HLEN); + if (get_ip_summed(skb) == OVS_CSUM_PARTIAL) + skb->csum_start += STT_FRAME_HLEN; + + ovs_tnl_rcv(vport, skb, iph->tos); + goto out; + +error: + kfree_skb(skb); +out: + return 0; +} + +static const struct tnl_ops stt_tnl_ops = { + .tunnel_type = TNL_T_PROTO_STT, + .ipproto = IPPROTO_TCP, + .hdr_len = stt_hdr_len, + .build_header = stt_build_header, + .update_header = stt_update_header, +}; + +static int init_socket(struct net *net) +{ + int err; + struct capwap_net *stt_net = ovs_get_stt_net(net); + struct sockaddr_in sin; + + if (stt_net->n_tunnels) { + stt_net->n_tunnels++; + return 0; + } + + err = sock_create_kern(AF_INET, SOCK_STREAM, 0, + &stt_net->capwap_rcv_socket); + if (err) + goto error; + + /* release net ref. */ + sk_change_net(stt_net->capwap_rcv_socket->sk, net); + + sin.sin_family = AF_INET; + sin.sin_addr.s_addr = htonl(INADDR_ANY); + sin.sin_port = htons(STT_DST_PORT); + + err = kernel_bind(stt_net->capwap_rcv_socket, (struct sockaddr *)&sin, + sizeof(struct sockaddr_in)); + if (err) + goto error_sock; + + tcp_sk(stt_net->capwap_rcv_socket->sk)->encap_rcv = stt_rcv; + tcp_encap_enable(); + + stt_net->frag_state.timeout = STT_FRAG_TIMEOUT; + stt_net->frag_state.high_thresh = STT_FRAG_MAX_MEM; + stt_net->frag_state.low_thresh = STT_FRAG_PRUNE_MEM; + + inet_frags_init_net(&stt_net->frag_state); + + err = kernel_listen(stt_net->capwap_rcv_socket, 7); + if (err) + goto error_sock; + + stt_net->n_tunnels++; + return 0; + +error_sock: + sk_release_kernel(stt_net->capwap_rcv_socket->sk); +error: + pr_warn("cannot register protocol handler : %d\n", err); + return err; +} + +/* XXX: Possible Consolidation: Very similar to vport-capwap.c:release_socket() */ +static void release_socket(struct net *net) +{ + struct capwap_net *stt_net = ovs_get_stt_net(net); + + stt_net->n_tunnels--; + if (stt_net->n_tunnels) + return; + + inet_frags_exit_net(&stt_net->frag_state, &frag_state); + sk_release_kernel(stt_net->capwap_rcv_socket->sk); +} + +/* XXX: Possible Consolidation: Very similar to capwap_create() */ +static struct vport *stt_create(const struct vport_parms *parms) +{ + struct vport *vport; + int err; + + err = init_socket(ovs_dp_get_net(parms->dp)); + if (err) + return ERR_PTR(err); + + vport = ovs_tnl_create(parms, &ovs_stt_vport_ops, &stt_tnl_ops); + if (IS_ERR(vport)) + release_socket(ovs_dp_get_net(parms->dp)); + + return vport; +} + +/* XXX: Possible Consolidation: Same as capwap_destroy() */ +static void stt_destroy(struct vport *vport) +{ + ovs_tnl_destroy(vport); + release_socket(ovs_dp_get_net(vport->dp)); +} + +/* XXX: Possible Consolidation: Same as capwap_init() */ +static int stt_init(void) +{ + inet_frags_init(&frag_state); + get_random_bytes(&stt_port_rnd, sizeof(stt_port_rnd)); + return 0; +} + +/* XXX: Possible Consolidation: Same as capwap_exit() */ +static void stt_exit(void) +{ + inet_frags_fini(&frag_state); +} + +/* All of the following functions relate to fragmentation reassembly. */ + +static struct frag_queue *ifq_cast(struct inet_frag_queue *ifq) +{ + return container_of(ifq, struct frag_queue, ifq); +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_hash() */ +static u32 frag_hash(struct frag_match *match) +{ + return jhash_3words((__force u16)match->id, (__force u32)match->saddr, + (__force u32)match->daddr, + frag_state.rnd) & (INETFRAGS_HASHSZ - 1); +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:queue_find() */ +static struct frag_queue *queue_find(struct netns_frags *ns_frag_state, + struct frag_match *match) +{ + struct inet_frag_queue *ifq; + + read_lock(&frag_state.lock); + + ifq = inet_frag_find(ns_frag_state, &frag_state, match, frag_hash(match)); + if (!ifq) + return NULL; + + /* Unlock happens inside inet_frag_find(). */ + + return ifq_cast(ifq); +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_reasm() */ +static struct sk_buff *frag_reasm(struct frag_queue *fq, struct net_device *dev) +{ + struct sk_buff *head = fq->ifq.fragments; + struct sk_buff *frag; + + /* Succeed or fail, we're done with this queue. */ + inet_frag_kill(&fq->ifq, &frag_state); + + if (fq->ifq.len > 65535) + return NULL; + + /* Can't have the head be a clone. */ + if (skb_cloned(head) && pskb_expand_head(head, 0, 0, GFP_ATOMIC)) + return NULL; + + /* + * We're about to build frag list for this SKB. If it already has a + * frag list, alloc a new SKB and put the existing frag list there. + */ + if (skb_shinfo(head)->frag_list) { + int i; + int paged_len = 0; + + frag = alloc_skb(0, GFP_ATOMIC); + if (!frag) + return NULL; + + frag->next = head->next; + head->next = frag; + skb_shinfo(frag)->frag_list = skb_shinfo(head)->frag_list; + skb_shinfo(head)->frag_list = NULL; + + for (i = 0; i < skb_shinfo(head)->nr_frags; i++) + paged_len += skb_shinfo(head)->frags[i].size; + frag->len = frag->data_len = head->data_len - paged_len; + head->data_len -= frag->len; + head->len -= frag->len; + + frag->ip_summed = head->ip_summed; + atomic_add(frag->truesize, &fq->ifq.net->mem); + } + + skb_shinfo(head)->frag_list = head->next; + atomic_sub(head->truesize, &fq->ifq.net->mem); + + /* Properly account for data in various packets. */ + for (frag = head->next; frag; frag = frag->next) { + head->data_len += frag->len; + head->len += frag->len; + + if (head->ip_summed != frag->ip_summed) + head->ip_summed = CHECKSUM_NONE; + else if (head->ip_summed == CHECKSUM_COMPLETE) + head->csum = csum_add(head->csum, frag->csum); + + head->truesize += frag->truesize; + atomic_sub(frag->truesize, &fq->ifq.net->mem); + } + + head->next = NULL; + head->dev = dev; + head->tstamp = fq->ifq.stamp; + fq->ifq.fragments = NULL; + + return head; +} + +/* XXX: Possible Consolidation: Identical to to vport-capwap.c:frag_queue() */ +static struct sk_buff *frag_queue(struct frag_queue *fq, struct sk_buff *skb, + u16 offset, bool frag_last) +{ + struct sk_buff *prev, *next; + struct net_device *dev; + int end; + + if (fq->ifq.last_in & INET_FRAG_COMPLETE) + goto error; + + if (stt_seg_len(skb) <= 0) + goto error; + + end = offset + stt_seg_len(skb); + + if (frag_last) { + /* + * Last fragment, shouldn't already have data past our end or + * have another last fragment. + */ + if (end < fq->ifq.len || fq->ifq.last_in & INET_FRAG_LAST_IN) + goto error; + + fq->ifq.last_in |= INET_FRAG_LAST_IN; + fq->ifq.len = end; + } else { + /* Fragments should align to 8 byte chunks. */ + if (end & ~FRAG_OFF_MASK) + goto error; + + if (end > fq->ifq.len) { + /* + * Shouldn't have data past the end, if we already + * have one. + */ + if (fq->ifq.last_in & INET_FRAG_LAST_IN) + goto error; + + fq->ifq.len = end; + } + } + + /* Find where we fit in. */ + prev = NULL; + for (next = fq->ifq.fragments; next != NULL; next = next->next) { + if (FRAG_CB(next)->offset >= offset) + break; + prev = next; + } + + /* + * Overlapping fragments aren't allowed. We shouldn't start before + * the end of the previous fragment. + */ + if (prev && FRAG_CB(prev)->offset + stt_seg_len(prev) > offset) + goto error; + + /* We also shouldn't end after the beginning of the next fragment. */ + if (next && end > FRAG_CB(next)->offset) + goto error; + + FRAG_CB(skb)->offset = offset; + + /* Link into list. */ + skb->next = next; + if (prev) + prev->next = skb; + else + fq->ifq.fragments = skb; + + dev = skb->dev; + skb->dev = NULL; + + fq->ifq.stamp = skb->tstamp; + fq->ifq.meat += stt_seg_len(skb); + atomic_add(skb->truesize, &fq->ifq.net->mem); + if (offset == 0) + fq->ifq.last_in |= INET_FRAG_FIRST_IN; + + /* If we have all fragments do reassembly. */ + if (fq->ifq.last_in == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) && + fq->ifq.meat == fq->ifq.len) + return frag_reasm(fq, dev); + + write_lock(&frag_state.lock); + list_move_tail(&fq->ifq.lru_list, &fq->ifq.net->lru_list); + write_unlock(&frag_state.lock); + + return NULL; + +error: + kfree_skb(skb); + return NULL; +} + +/* XXX: Possible Consolidation: Similar to vport-capwap.c:defrag() */ +static struct sk_buff *defrag(struct sk_buff *skb, u16 frame_len) +{ + struct iphdr *iph = ip_hdr(skb); + struct tcphdr *tcph = tcp_hdr(skb); + struct netns_frags *ns_frag_state; + struct frag_match match; + u16 frag_off; + struct frag_queue *fq; + bool frag_last = false; + + if (unlikely(!skb->dev)) { + if (net_ratelimit()) + pr_err("%s: No skb->dev!\n", __func__); + goto out; + } + + ns_frag_state = &ovs_get_stt_net(dev_net(skb->dev))->frag_state; + if (atomic_read(&ns_frag_state->mem) > ns_frag_state->high_thresh) + inet_frag_evictor(ns_frag_state, &frag_state); + + match.daddr = iph->daddr; + match.saddr = iph->saddr; + match.id = tcph->ack_seq; + frag_off = ntohl(tcph->seq) & FRAG_OFF_MASK; + if (frame_len == stt_seg_len(skb) + frag_off) + frag_last = true; + + fq = queue_find(ns_frag_state, &match); + if (fq) { + spin_lock(&fq->ifq.lock); + skb = frag_queue(fq, skb, frag_off, frag_last); + spin_unlock(&fq->ifq.lock); + + inet_frag_put(&fq->ifq, &frag_state); + + return skb; + } + +out: + kfree_skb(skb); + return NULL; +} + +/* XXX: Possible Consolidation: Functionally identical to capwap_frag_init */ +static void stt_frag_init(struct inet_frag_queue *ifq, void *match_) +{ + struct frag_match *match = match_; + + ifq_cast(ifq)->match = *match; +} + +/* XXX: Possible Consolidation: Functionally identical to capwap_frag_hash */ +static unsigned int stt_frag_hash(struct inet_frag_queue *ifq) +{ + return frag_hash(&ifq_cast(ifq)->match); +} + +/* XXX: Possible Consolidation: Almost functionally identical to capwap_frag_match */ +static int stt_frag_match(struct inet_frag_queue *ifq, void *a_) +{ + struct frag_match *a = a_; + struct frag_match *b = &ifq_cast(ifq)->match; + + return a->id == b->id && a->saddr == b->saddr && a->daddr == b->daddr; +} + +/* Run when the timeout for a given queue expires. */ +/* XXX: Possible Consolidation: Functionally identical to capwap_frag_hash */ +static void stt_frag_expire(unsigned long ifq) +{ + struct frag_queue *fq; + + fq = ifq_cast((struct inet_frag_queue *)ifq); + + spin_lock(&fq->ifq.lock); + + if (!(fq->ifq.last_in & INET_FRAG_COMPLETE)) + inet_frag_kill(&fq->ifq, &frag_state); + + spin_unlock(&fq->ifq.lock); + inet_frag_put(&fq->ifq, &frag_state); +} + +const struct vport_ops ovs_stt_vport_ops = { + .type = OVS_VPORT_TYPE_STT, + .flags = VPORT_F_TUN_ID, + .init = stt_init, + .exit = stt_exit, + .create = stt_create, + .destroy = stt_destroy, + .set_addr = ovs_tnl_set_addr, + .get_name = ovs_tnl_get_name, + .get_addr = ovs_tnl_get_addr, + .get_options = ovs_tnl_get_options, + .set_options = ovs_tnl_set_options, + .get_dev_flags = ovs_vport_gen_get_dev_flags, + .is_running = ovs_vport_gen_is_running, + .get_operstate = ovs_vport_gen_get_operstate, + .send = ovs_tnl_send, +}; +#else +#warning STT requires TCP encap_rcv hook in Kernel +#endif /* HAVE_TCP_ENCAP_RCV */ diff --git a/datapath/vport.c b/datapath/vport.c index b75a866..575e7a2 100644 --- a/datapath/vport.c +++ b/datapath/vport.c @@ -44,6 +44,9 @@ static const struct vport_ops *base_vport_ops_list[] = { #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,26) &ovs_capwap_vport_ops, #endif +#ifdef HAVE_TCP_ENCAP_RCV + &ovs_stt_vport_ops, +#endif }; static const struct vport_ops **vport_ops_list; diff --git a/datapath/vport.h b/datapath/vport.h index 2aafde0..3994eb1 100644 --- a/datapath/vport.h +++ b/datapath/vport.h @@ -33,6 +33,7 @@ struct vport_parms; struct vport_net { struct capwap_net capwap; + struct capwap_net stt; }; /* The following definitions are for users of the vport subsytem: */ @@ -257,5 +258,6 @@ extern const struct vport_ops ovs_internal_vport_ops; extern const struct vport_ops ovs_patch_vport_ops; extern const struct vport_ops ovs_gre_vport_ops; extern const struct vport_ops ovs_capwap_vport_ops; +extern const struct vport_ops ovs_stt_vport_ops; #endif /* vport.h */ diff --git a/include/linux/openvswitch.h b/include/linux/openvswitch.h index 0578b5f..47f6dca 100644 --- a/include/linux/openvswitch.h +++ b/include/linux/openvswitch.h @@ -185,6 +185,7 @@ enum ovs_vport_type { OVS_VPORT_TYPE_PATCH = 100, /* virtual tunnel connecting two vports */ OVS_VPORT_TYPE_GRE, /* GRE tunnel */ OVS_VPORT_TYPE_CAPWAP, /* CAPWAP tunnel */ + OVS_VPORT_TYPE_STT, /* STT tunnel */ __OVS_VPORT_TYPE_MAX }; diff --git a/lib/netdev-vport.c b/lib/netdev-vport.c index 7bd50a4..346878b 100644 --- a/lib/netdev-vport.c +++ b/lib/netdev-vport.c @@ -165,6 +165,9 @@ netdev_vport_get_netdev_type(const struct dpif_linux_vport *vport) case OVS_VPORT_TYPE_CAPWAP: return "capwap"; + case OVS_VPORT_TYPE_STT: + return "stt"; + case __OVS_VPORT_TYPE_MAX: break; } @@ -965,7 +968,11 @@ netdev_vport_register(void) { OVS_VPORT_TYPE_PATCH, { "patch", VPORT_FUNCTIONS(NULL) }, - parse_patch_config, unparse_patch_config } + parse_patch_config, unparse_patch_config }, + + { OVS_VPORT_TYPE_STT, + { "stt", VPORT_FUNCTIONS(netdev_vport_get_drv_info) }, + parse_tunnel_config, unparse_tunnel_config } }; int i; diff --git a/vswitchd/vswitch.xml b/vswitchd/vswitch.xml index f3ea338..d8c860e 100644 --- a/vswitchd/vswitch.xml +++ b/vswitchd/vswitch.xml @@ -1177,6 +1177,16 @@ A pair of virtual devices that act as a patch cable. </dd> + <dt><code>stt</code></dt> + <dd> + An Ethernet tunnel over STT (IETF draft-davie-stt-01). UDP + ports 58882 is used as the destination port and ports from the + ephemeral range, which may be via proc using + sys/net/ipv4/ip_local_port_range, are used as the source ports. + STT currently requires modifications to the Linux kernel and is + not supported by any released kernel version. + </dd> + <dt><code>null</code></dt> <dd>An ignored interface.</dd> </dl>