Message ID | 1411134590-4586-9-git-send-email-jiri@resnulli.us |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On 09/19/14 09:49, Jiri Pirko wrote: > This patch exposes switchdev API using generic Netlink. > Example userspace utility is here: > https://github.com/jpirko/switchdev > Is this just a temporary test tool? Otherwise i dont see reason for its existence (or the API that it feeds on). cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >On 09/19/14 09:49, Jiri Pirko wrote: >>This patch exposes switchdev API using generic Netlink. >>Example userspace utility is here: >>https://github.com/jpirko/switchdev >> > >Is this just a temporary test tool? Otherwise i dont see reason >for its existence (or the API that it feeds on). Please read the conversation I had with Pravin and Jesse in v1 thread. Long story short they like to have the api separated from ovs datapath so ovs daemon can use it to directly communicate with driver. Also John Fastabend requested a way to work with driver flows without using ovs -> that was the original reason I created switchdev genl api. Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon will use directly switchdev genl api. I hope I cleared this out. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/19/14 11:49, Jiri Pirko wrote: > Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >> Is this just a temporary test tool? Otherwise i dont see reason >> for its existence (or the API that it feeds on). > > Please read the conversation I had with Pravin and Jesse in v1 thread. > Long story short they like to have the api separated from ovs datapath > so ovs daemon can use it to directly communicate with driver. Also John > Fastabend requested a way to work with driver flows without using ovs -> > that was the original reason I created switchdev genl api. > > Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon > will use directly switchdev genl api. > > I hope I cleared this out. > It is - thanks Jiri. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: > On 09/19/14 11:49, Jiri Pirko wrote: >> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: > >>> Is this just a temporary test tool? Otherwise i dont see reason >>> for its existence (or the API that it feeds on). >> >> Please read the conversation I had with Pravin and Jesse in v1 thread. >> Long story short they like to have the api separated from ovs datapath >> so ovs daemon can use it to directly communicate with driver. Also John >> Fastabend requested a way to work with driver flows without using ovs -> >> that was the original reason I created switchdev genl api. >> >> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >> will use directly switchdev genl api. >> >> I hope I cleared this out. >> > > It is - thanks Jiri. > > cheers, > jamal Hi Jiri, I was considering a slightly different approach where the device would report via netlink the fields/actions it supported rather than creating pre-defined enums for every possible key. I already need to have an API to report fields/matches that are being supported why not have the device report the headers as header fields (len, offset) and the associated parse graph the hardware uses? Vendors should have this already to describe/design their real hardware. As always its better to have code and when I get some time I'll try to write it up. Maybe its just a separate classifier although I don't actually want two hardware flow APIs. I see you dropped the RFC tag are you proposing we include this now? .John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/19/14 18:12, John Fastabend wrote: > On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: >> On 09/19/14 11:49, Jiri Pirko wrote: >>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >> >>>> Is this just a temporary test tool? Otherwise i dont see reason >>>> for its existence (or the API that it feeds on). >>> >>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>> Long story short they like to have the api separated from ovs datapath >>> so ovs daemon can use it to directly communicate with driver. Also John >>> Fastabend requested a way to work with driver flows without using ovs -> >>> that was the original reason I created switchdev genl api. >>> >>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>> will use directly switchdev genl api. >>> >>> I hope I cleared this out. >>> >> >> It is - thanks Jiri. >> >> cheers, >> jamal > > Hi Jiri, > > I was considering a slightly different approach where the > device would report via netlink the fields/actions it > supported rather than creating pre-defined enums for every > possible key. > > I already need to have an API to report fields/matches > that are being supported why not have the device report > the headers as header fields (len, offset) and the > associated parse graph the hardware uses? Vendors should > have this already to describe/design their real hardware. > > As always its better to have code and when I get some > time I'll try to write it up. Maybe its just a separate > classifier although I don't actually want two hardware > flow APIs. > > I see you dropped the RFC tag are you proposing we include > this now? > Actually I just realized i missed something very basic that Jiri said. I think i understand the tool being there for testing but i am assumed the same about the genlink api. Jiri, are you saying that genlink api is there to stay? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9/19/14, 8:49 AM, Jiri Pirko wrote: > Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >> On 09/19/14 09:49, Jiri Pirko wrote: >>> This patch exposes switchdev API using generic Netlink. >>> Example userspace utility is here: >>> https://github.com/jpirko/switchdev >>> >> Is this just a temporary test tool? Otherwise i dont see reason >> for its existence (or the API that it feeds on). > Please read the conversation I had with Pravin and Jesse in v1 thread. > Long story short they like to have the api separated from ovs datapath > so ovs daemon can use it to directly communicate with driver. Also John > Fastabend requested a way to work with driver flows without using ovs -> > that was the original reason I created switchdev genl api. > > Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon > will use directly switchdev genl api. > > I hope I cleared this out. We already have all the needed rtnetlink kernel api and userspace tools around it to support all switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it. Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to create bridges etc will work on a switchdev. Which seems like going in the direction exactly opposite to what we had discussed earlier. If a non-ovs flow interface is needed from userspace, we can extend the existing interface to include flows. I don't understand why we should replace the existing rtnetlink switchdev api to accommodate flows. Thanks, Roopa -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/19/14 15:12, John Fastabend wrote: > On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: >> On 09/19/14 11:49, Jiri Pirko wrote: >>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >> >>>> Is this just a temporary test tool? Otherwise i dont see reason >>>> for its existence (or the API that it feeds on). >>> >>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>> Long story short they like to have the api separated from ovs datapath >>> so ovs daemon can use it to directly communicate with driver. Also John >>> Fastabend requested a way to work with driver flows without using ovs -> >>> that was the original reason I created switchdev genl api. >>> >>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>> will use directly switchdev genl api. >>> >>> I hope I cleared this out. >>> >> >> It is - thanks Jiri. >> >> cheers, >> jamal > > Hi Jiri, > > I was considering a slightly different approach where the > device would report via netlink the fields/actions it > supported rather than creating pre-defined enums for every > possible key. > > I already need to have an API to report fields/matches > that are being supported why not have the device report > the headers as header fields (len, offset) and the > associated parse graph the hardware uses? Vendors should > have this already to describe/design their real hardware. Humm would not that slightly go against coming with a netlink API that is generic? Surely we could pay close attention when reviewing what is being added and spot when a common API needs to be introduced... This might become very similar to the private ioctl(), private wireless extensions, nl80211 testmode and well it's not extremely pretty. > > As always its better to have code and when I get some > time I'll try to write it up. Maybe its just a separate > classifier although I don't actually want two hardware > flow APIs. > > I see you dropped the RFC tag are you proposing we include > this now? > > .John > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/19/14 15:18, Jamal Hadi Salim wrote: > On 09/19/14 18:12, John Fastabend wrote: >> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: >>> On 09/19/14 11:49, Jiri Pirko wrote: >>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >>> >>>>> Is this just a temporary test tool? Otherwise i dont see reason >>>>> for its existence (or the API that it feeds on). >>>> >>>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>>> Long story short they like to have the api separated from ovs datapath >>>> so ovs daemon can use it to directly communicate with driver. Also John >>>> Fastabend requested a way to work with driver flows without using >>>> ovs -> >>>> that was the original reason I created switchdev genl api. >>>> >>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>>> will use directly switchdev genl api. >>>> >>>> I hope I cleared this out. >>>> >>> >>> It is - thanks Jiri. >>> >>> cheers, >>> jamal >> >> Hi Jiri, >> >> I was considering a slightly different approach where the >> device would report via netlink the fields/actions it >> supported rather than creating pre-defined enums for every >> possible key. >> >> I already need to have an API to report fields/matches >> that are being supported why not have the device report >> the headers as header fields (len, offset) and the >> associated parse graph the hardware uses? Vendors should >> have this already to describe/design their real hardware. >> >> As always its better to have code and when I get some >> time I'll try to write it up. Maybe its just a separate >> classifier although I don't actually want two hardware >> flow APIs. >> >> I see you dropped the RFC tag are you proposing we include >> this now? >> > > > Actually I just realized i missed something very basic that > Jiri said. I think i understand the tool being there for testing > but i am assumed the same about the genlink api. > Jiri, are you saying that genlink api is there to > stay? So, I really have mixed feelings about this netlink API, in particular because it is not clear to me where is the line between what should be a network device ndo operation, what should be an ethtool command, what should be a netlink message, and the rest. I can certainly acknowledge the fact that manipulating flows is not ideal with the current set of tools, but really once we are there with netlink, how far are we from not having any network devices at all, and how does that differ from OpenWrt's swconfig in the end [1]? [1]: https://lwn.net/Articles/571390/ -- Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sat, Sep 20, 2014 at 05:41:16AM CEST, roopa@cumulusnetworks.com wrote: >On 9/19/14, 8:49 AM, Jiri Pirko wrote: >>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >>>On 09/19/14 09:49, Jiri Pirko wrote: >>>>This patch exposes switchdev API using generic Netlink. >>>>Example userspace utility is here: >>>>https://github.com/jpirko/switchdev >>>> >>>Is this just a temporary test tool? Otherwise i dont see reason >>>for its existence (or the API that it feeds on). >>Please read the conversation I had with Pravin and Jesse in v1 thread. >>Long story short they like to have the api separated from ovs datapath >>so ovs daemon can use it to directly communicate with driver. Also John >>Fastabend requested a way to work with driver flows without using ovs -> >>that was the original reason I created switchdev genl api. >> >>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>will use directly switchdev genl api. >> >>I hope I cleared this out. >We already have all the needed rtnetlink kernel api and userspace tools >around it to support all >switching asic features. ie, the rtnetlink api is the switchdev api. We can >do l2, l3, acl's with it. >Its unclear to me why we need another new netlink api. Which will mean none >of the existing tools to >create bridges etc will work on a switchdev. No one is proposing such API. Note that what I'm trying to solve in my patchset is FLOW world. There is only one API there, ovs genl. But the usage of that for hw offload purposes was nacked by ovs maintainer. Plus couple of people wanted to run the offloading independently on ovs instance. Therefore I introduced the switchdev genl, which takes care of that. No plan to extend it for other things you mentioned, just flows. >Which seems like going in the direction exactly opposite to what we had >discussed earlier. Nope. The previous discussion ignored flows. > >If a non-ovs flow interface is needed from userspace, we can extend the >existing interface to include flows. How? You mean to extend rtnetlink? What advantage it would bring comparing to separate genl iface? >I don't understand why we should replace the existing rtnetlink switchdev api >to accommodate flows. Sorry, I do not undertand what "existing rtnetlink switchdev api" you have on mind. Would you care to explain? > >Thanks, >Roopa > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote: > On 9/19/14, 8:49 AM, Jiri Pirko wrote: >> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >>> On 09/19/14 09:49, Jiri Pirko wrote: >>>> This patch exposes switchdev API using generic Netlink. >>>> Example userspace utility is here: >>>> https://github.com/jpirko/switchdev >>>> >>> Is this just a temporary test tool? Otherwise i dont see reason >>> for its existence (or the API that it feeds on). >> Please read the conversation I had with Pravin and Jesse in v1 thread. >> Long story short they like to have the api separated from ovs datapath >> so ovs daemon can use it to directly communicate with driver. Also John >> Fastabend requested a way to work with driver flows without using ovs -> >> that was the original reason I created switchdev genl api. >> >> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >> will use directly switchdev genl api. >> >> I hope I cleared this out. > We already have all the needed rtnetlink kernel api and userspace tools around it to support all > switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it. > Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to > create bridges etc will work on a switchdev. > Which seems like going in the direction exactly opposite to what we had discussed earlier. Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc). With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow. This conversion happens in the kernel components. For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path. You have: user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW Jiri has: user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW > If a non-ovs flow interface is needed from userspace, we can extend the existing interface to include flows. > I don't understand why we should replace the existing rtnetlink switchdev api to accommodate flows. > > Thanks, > Roopa > -scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote: >On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: >> On 09/19/14 11:49, Jiri Pirko wrote: >>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >> >>>> Is this just a temporary test tool? Otherwise i dont see reason >>>> for its existence (or the API that it feeds on). >>> >>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>> Long story short they like to have the api separated from ovs datapath >>> so ovs daemon can use it to directly communicate with driver. Also John >>> Fastabend requested a way to work with driver flows without using ovs -> >>> that was the original reason I created switchdev genl api. >>> >>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>> will use directly switchdev genl api. >>> >>> I hope I cleared this out. >>> >> >> It is - thanks Jiri. >> >> cheers, >> jamal > >Hi Jiri, > >I was considering a slightly different approach where the >device would report via netlink the fields/actions it >supported rather than creating pre-defined enums for every >possible key. > >I already need to have an API to report fields/matches >that are being supported why not have the device report >the headers as header fields (len, offset) and the >associated parse graph the hardware uses? Vendors should >have this already to describe/design their real hardware. Hmm, let me think about this a bit more. I will try to figure out how to handle that. Sound logic though. Will try to incorporate the idea in the patchset. > >As always its better to have code and when I get some >time I'll try to write it up. Maybe its just a separate >classifier although I don't actually want two hardware >flow APIs. Understood. > >I see you dropped the RFC tag are you proposing we include >this now? v11 is my bet :) > >.John -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sat, Sep 20, 2014 at 12:18:02AM CEST, jhs@mojatatu.com wrote: >On 09/19/14 18:12, John Fastabend wrote: >>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: >>>On 09/19/14 11:49, Jiri Pirko wrote: >>>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >>> >>>>>Is this just a temporary test tool? Otherwise i dont see reason >>>>>for its existence (or the API that it feeds on). >>>> >>>>Please read the conversation I had with Pravin and Jesse in v1 thread. >>>>Long story short they like to have the api separated from ovs datapath >>>>so ovs daemon can use it to directly communicate with driver. Also John >>>>Fastabend requested a way to work with driver flows without using ovs -> >>>>that was the original reason I created switchdev genl api. >>>> >>>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>>>will use directly switchdev genl api. >>>> >>>>I hope I cleared this out. >>>> >>> >>>It is - thanks Jiri. >>> >>>cheers, >>>jamal >> >>Hi Jiri, >> >>I was considering a slightly different approach where the >>device would report via netlink the fields/actions it >>supported rather than creating pre-defined enums for every >>possible key. >> >>I already need to have an API to report fields/matches >>that are being supported why not have the device report >>the headers as header fields (len, offset) and the >>associated parse graph the hardware uses? Vendors should >>have this already to describe/design their real hardware. >> >>As always its better to have code and when I get some >>time I'll try to write it up. Maybe its just a separate >>classifier although I don't actually want two hardware >>flow APIs. >> >>I see you dropped the RFC tag are you proposing we include >>this now? >> > > >Actually I just realized i missed something very basic that >Jiri said. I think i understand the tool being there for testing >but i am assumed the same about the genlink api. >Jiri, are you saying that genlink api is there to >stay? Yes, that I say. It is needed for flow manipulation, because such api does not exist. As I stated earlier, I do not want to use switchdev genl for anything other than flow manipulation. > >cheers, >jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sat, Sep 20, 2014 at 07:39:51AM CEST, f.fainelli@gmail.com wrote: >On 09/19/14 15:18, Jamal Hadi Salim wrote: >>On 09/19/14 18:12, John Fastabend wrote: >>>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote: >>>>On 09/19/14 11:49, Jiri Pirko wrote: >>>>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >>>> >>>>>>Is this just a temporary test tool? Otherwise i dont see reason >>>>>>for its existence (or the API that it feeds on). >>>>> >>>>>Please read the conversation I had with Pravin and Jesse in v1 thread. >>>>>Long story short they like to have the api separated from ovs datapath >>>>>so ovs daemon can use it to directly communicate with driver. Also John >>>>>Fastabend requested a way to work with driver flows without using >>>>>ovs -> >>>>>that was the original reason I created switchdev genl api. >>>>> >>>>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>>>>will use directly switchdev genl api. >>>>> >>>>>I hope I cleared this out. >>>>> >>>> >>>>It is - thanks Jiri. >>>> >>>>cheers, >>>>jamal >>> >>>Hi Jiri, >>> >>>I was considering a slightly different approach where the >>>device would report via netlink the fields/actions it >>>supported rather than creating pre-defined enums for every >>>possible key. >>> >>>I already need to have an API to report fields/matches >>>that are being supported why not have the device report >>>the headers as header fields (len, offset) and the >>>associated parse graph the hardware uses? Vendors should >>>have this already to describe/design their real hardware. >>> >>>As always its better to have code and when I get some >>>time I'll try to write it up. Maybe its just a separate >>>classifier although I don't actually want two hardware >>>flow APIs. >>> >>>I see you dropped the RFC tag are you proposing we include >>>this now? >>> >> >> >>Actually I just realized i missed something very basic that >>Jiri said. I think i understand the tool being there for testing >>but i am assumed the same about the genlink api. >>Jiri, are you saying that genlink api is there to >>stay? > >So, I really have mixed feelings about this netlink API, in particular >because it is not clear to me where is the line between what should be a >network device ndo operation, what should be an ethtool command, what should >be a netlink message, and the rest. Well as I said, this api should serve for flow manipulation only, therefore swdev flow related ndos are used. > >I can certainly acknowledge the fact that manipulating flows is not ideal >with the current set of tools, but really once we are there with netlink, how >far are we from not having any network devices at all, and how does that >differ from OpenWrt's swconfig in the end [1]? I'm all ears on proposals how to make flow manipulation better. > >[1]: https://lwn.net/Articles/571390/ >-- >Florian -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 04:17, Jiri Pirko wrote: > Yes, that I say. It is needed for flow manipulation, because such api does > not exist. Come on Jiri! The ovs guys are against this and now no *api exists*? Write a 15 tuple classifier tc classifier and use it. I will be more than happy to help you. I will get to it when we have basics L2 working on real devices. >As I stated earlier, I do not want to use switchdev genl for > anything other than flow manipulation. Totally unacceptable in my books. If the OVS guys want some way out to be able to ride on some vendor sdks then that is their problem. We shouldnt allow for such loopholes. This is why/how TOE never made it in the kernel. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 04:10, Scott Feldman wrote: > > On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote: > > > Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components >(bridge, fib, etc). You have made this claim before Scott and I am still not following. Why do we need to echo things to get FDB or FIB to work? device ops for FDB offload for example already exist. I think they need to be revamped, but that consensus can be reasonably reached. Why do we need this flow api for such activities? cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 at 10:14am, Jiri Pirko wrote: > Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote: > >I was considering a slightly different approach where the > >device would report via netlink the fields/actions it > >supported rather than creating pre-defined enums for every > >possible key. > > > >I already need to have an API to report fields/matches > >that are being supported why not have the device report > >the headers as header fields (len, offset) and the > >associated parse graph the hardware uses? Vendors should > >have this already to describe/design their real hardware. > > Hmm, let me think about this a bit more. I will try to figure out how to > handle that. Sound logic though. Will try to incorporate the idea in the > patchset. I think this is the right track. I agree with Jamal that there is no need for a new permanent and separate Netlink interface for this. I think this would best be described as a structure of nested Netlink attributes in the form John proposes which is then embedded into existing Netlink interfaces such as rtnetlink and OVS genl. OVS can register new genl ops to check capabilities and insert hardware flows which allows implementation of the offload decision in user space and allows for arbitary combination of hardware and software flows. It also allows to run a eBPF software data path in combination with a hardware flow setup. rtnetlink can embed the nested attribute structure into existing APIs to allow feature capability detection from user space, statistic reporting and optional direct hardware offload if a transaprent offload is not feasible. Would that work for you John? I think we should focus on getting the layering right and make it generic enough so we allow evolving naturally. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 at 06:19am, Jamal Hadi Salim wrote: > The ovs guys are against this and now no *api exists*? > Write a 15 tuple classifier tc classifier and use it. I will be more > than happy to help you. I will get to it when we have basics L2 working > on real devices. Nothing speaks against having such a tc classifier. In fact, having the interface consist of only an embedded Netlink attribute structure would allow for such a classifier in a very straight forward way. That doesn't mean everybody should be forced to use the stateful tc interface. > Totally unacceptable in my books. If the OVS guys want some way out > to be able to ride on some vendor sdks then that is their problem. > We shouldnt allow for such loopholes. This is why/how TOE never made it > in the kernel. No need for false accusations here. Nobody ever mentioned vendor SDKs. The statement was that the requirement of deriving hardware flows from software flows *in the kernel* is not flexible enough for the future for reasons such as: 1) The OVS software data path might be based on eBPF in the future and it is unclear how we could derive hardware flows from that transparently. 2) Depending on hardware capabilities. Hardware flows might need to be assisted by software flow counterparts and it is believed that it is the wrong approach to push all the necessary context for the decision down into the kernel. This can be argued about and I don't feel strongly either way. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 07:01, Thomas Graf wrote: > Nothing speaks against having such a tc classifier. In fact, having > the interface consist of only an embedded Netlink attribute structure > would allow for such a classifier in a very straight forward way. > > That doesn't mean everybody should be forced to use the stateful > tc interface. > Agreed. The response was to Jiri's strange statement that now that he cant use OVS, there is no such api. I point to tc as very capable of such usage. > No need for false accusations here. Nobody ever mentioned vendor SDKs. > I am sorry to have tied the two together. Maybe not OVS but the approach described is heaven for vendor SDKs. > The statement was that the requirement of deriving hardware flows from > software flows *in the kernel* is not flexible enough for the future > for reasons such as: > > 1) The OVS software data path might be based on eBPF in the future and > it is unclear how we could derive hardware flows from that > transparently. > Who says you cant put BPF in hardware? And why is OVS defining how BPF should evolve or how it should be used? > 2) Depending on hardware capabilities. Hardware flows might need to be > assisted by software flow counterparts and it is believed that it > is the wrong approach to push all the necessary context for the > decision down into the kernel. This can be argued about and I don't > feel strongly either way. > Pointing to the current FDB offload: You can select to bypass and not use s/ware. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 at 07:32am, Jamal Hadi Salim wrote: > I am sorry to have tied the two together. Maybe not OVS but the approach > described is heaven for vendor SDKs. I fail to see the connection. You can use switch vendor SDK no matter how we define the kernel APIs. They already exist and have been designed in a way to be completely indepenedent from the kernel. Are you referring to vendor specific decisions in user space in general? I believe that the whole point of swdev is to provide *that* level of abstraction so decisions can be made in a vendor neutral way. > >The statement was that the requirement of deriving hardware flows from > >software flows *in the kernel* is not flexible enough for the future > >for reasons such as: > > > >1) The OVS software data path might be based on eBPF in the future and > > it is unclear how we could derive hardware flows from that > > transparently. > > > > Who says you cant put BPF in hardware? I don't think anybody is saying that. P4 is likely a reality soon. But we definitely want hardware offload in a BPF world even if the hardware can't do BPF yet. > And why is OVS defining how BPF should evolve or how it should be used? Not sure I understand. OVS would be a user of eBPF just like tracing, xt_BPF, socket filter, ... > >2) Depending on hardware capabilities. Hardware flows might need to be > > assisted by software flow counterparts and it is believed that it > > is the wrong approach to push all the necessary context for the > > decision down into the kernel. This can be argued about and I don't > > feel strongly either way. > > > > Pointing to the current FDB offload: You can select to bypass > and not use s/ware. As I said, this can be argued about. It would require to push a lot of context into the kernel though. The FDB offload is relatively trivial in comparison to the complexity OVS user space can handle. I can't think of any reasons why to complicate the kernel further with OVS specific knowledge as long as we can guarantee the vendor abstraction. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 07:51, Thomas Graf wrote: > I fail to see the connection. You can use switch vendor SDK no matter > how we define the kernel APIs. They already exist and have been > designed in a way to be completely indepenedent from the kernel. > > Are you referring to vendor specific decisions in user space in > general? I believe that the whole point of swdev is to provide *that* > level of abstraction so decisions can be made in a vendor neutral way. > I am not against the swdev idea. I think we have disagreements for the general classification/action interface how that should look like - but that is resolvable with correct interfaces. The vendor neutral way *already exists* via current netlink abstractions that existing tools use. When we need to add new interfaces then we should. > I don't think anybody is saying that. P4 is likely a reality soon. But > we definitely want hardware offload in a BPF world even if the hardware > can't do BPF yet. > I dont think we have contradictions. We are speaking past each other. You implied that in the future OVS s/w path might be based on BPF. I implied BPF itself could be offloaded and stands on its own merit and should work if we have the correct interface. As an example, I dont care about P4 or OVS - but i have no problem if they use the common interfaces provided by Linux. i.e If i want to build a little cpu running the BPF instruction set and use that as my offload then that interface should work and if it doesnt i should provide extensions. > Not sure I understand. OVS would be a user of eBPF just like tracing, > xt_BPF, socket filter, ... > Ok, we are on the same page then. > As I said, this can be argued about. It would require to push a lot of > context into the kernel though. The FDB offload is relatively trivial > in comparison to the complexity OVS user space can handle. I can't think > of any reasons why to complicate the kernel further with OVS specific > knowledge as long as we can guarantee the vendor abstraction. > I disagree. OVS maybe complex in that sense (I am sorry i am making an assumption based on what you are saying) but i dont think there is any other kernel subsystem that has this challenge. Note: i am pointing to fdb only because it carries the concept of "put this in hardware and/or software". I agree the fdb maybe reasonably simpler. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9/20/14, 1:09 AM, Jiri Pirko wrote: > Sat, Sep 20, 2014 at 05:41:16AM CEST, roopa@cumulusnetworks.com wrote: >> On 9/19/14, 8:49 AM, Jiri Pirko wrote: >>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote: >>>> On 09/19/14 09:49, Jiri Pirko wrote: >>>>> This patch exposes switchdev API using generic Netlink. >>>>> Example userspace utility is here: >>>>> https://github.com/jpirko/switchdev >>>>> >>>> Is this just a temporary test tool? Otherwise i dont see reason >>>> for its existence (or the API that it feeds on). >>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>> Long story short they like to have the api separated from ovs datapath >>> so ovs daemon can use it to directly communicate with driver. Also John >>> Fastabend requested a way to work with driver flows without using ovs -> >>> that was the original reason I created switchdev genl api. >>> >>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>> will use directly switchdev genl api. >>> >>> I hope I cleared this out. >> We already have all the needed rtnetlink kernel api and userspace tools >> around it to support all >> switching asic features. ie, the rtnetlink api is the switchdev api. We can >> do l2, l3, acl's with it. >> Its unclear to me why we need another new netlink api. Which will mean none >> of the existing tools to >> create bridges etc will work on a switchdev. > No one is proposing such API. Note that what I'm trying to solve in my > patchset is FLOW world. There is only one API there, ovs genl. But the > usage of that for hw offload purposes was nacked by ovs maintainer. Plus > couple of people wanted to run the offloading independently on ovs > instance. Therefore I introduced the switchdev genl, which takes care of > that. No plan to extend it for other things you mentioned, just flows. ok, That was not clear to me. Introducing a new genl api and calling it the switchd dev api can result it non-flow creep into it in the future. > > >> Which seems like going in the direction exactly opposite to what we had >> discussed earlier. > Nope. The previous discussion ignored flows. >> If a non-ovs flow interface is needed from userspace, we can extend the >> existing interface to include flows. > How? You mean to extend rtnetlink? What advantage it would bring > comparing to separate genl iface? yes. Advantage would be that we dont have yet another parallel switchdev netlink api. >> I don't understand why we should replace the existing rtnetlink switchdev api >> to accommodate flows. > Sorry, I do not undertand what "existing rtnetlink switchdev api" you > have on mind. Would you care to explain? I am taking about existing rtnetlink api that bridge, ip link uses to talk l2 and l3 to the kernel. RTM_NEWROUTE etc. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote: > On 9/20/14, 1:10 AM, Scott Feldman wrote: >> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> >> wrote: >> >> >>> On 9/19/14, 8:49 AM, Jiri Pirko wrote: >>> >>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com >>>> wrote: >>>> >>>>> On 09/19/14 09:49, Jiri Pirko wrote: >>>>> >>>>>> This patch exposes switchdev API using generic Netlink. >>>>>> Example userspace utility is here: >>>>>> >>>>>> https://github.com/jpirko/switchdev >>>>>> >>>>>> >>>>>> >>>>> Is this just a temporary test tool? Otherwise i dont see reason >>>>> for its existence (or the API that it feeds on). >>>>> >>>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>>> Long story short they like to have the api separated from ovs datapath >>>> so ovs daemon can use it to directly communicate with driver. Also John >>>> Fastabend requested a way to work with driver flows without using ovs -> >>>> that was the original reason I created switchdev genl api. >>>> >>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>>> will use directly switchdev genl api. >>>> >>>> I hope I cleared this out. >>>> >>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all >>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it. >>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to >>> create bridges etc will work on a switchdev. >>> Which seems like going in the direction exactly opposite to what we had discussed earlier. >>> >> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc). With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow. This conversion happens in the kernel components. For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path. >> >> You have: >> user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW >> >> >> Jiri has: >> user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW >> >> > Keeping the goal to not change or not add a new userspace API in mind, > I have : > user -> rtnetlink -> kernel -> ndo_op -> swdev driver -> HW > Then you have the same as Jiri, for the traditional L2/L3 case. > Jiri has: > user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS. It’s not a substitute for rtnetlink, it’s an alternative. The complete picture is: user -> swdev genl ----- \ \ -------> kernel -> ndo_swdev_* -> swdev driver -> HW / / user -> rtnetlink ------ -scott -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sat, Sep 20, 2014 at 07:21:10PM CEST, sfeldma@cumulusnetworks.com wrote: > >On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote: > >> On 9/20/14, 1:10 AM, Scott Feldman wrote: >>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> >>> wrote: >>> >>> >>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote: >>>> >>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com >>>>> wrote: >>>>> >>>>>> On 09/19/14 09:49, Jiri Pirko wrote: >>>>>> >>>>>>> This patch exposes switchdev API using generic Netlink. >>>>>>> Example userspace utility is here: >>>>>>> >>>>>>> https://github.com/jpirko/switchdev >>>>>>> >>>>>>> >>>>>>> >>>>>> Is this just a temporary test tool? Otherwise i dont see reason >>>>>> for its existence (or the API that it feeds on). >>>>>> >>>>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>>>> Long story short they like to have the api separated from ovs datapath >>>>> so ovs daemon can use it to directly communicate with driver. Also John >>>>> Fastabend requested a way to work with driver flows without using ovs -> >>>>> that was the original reason I created switchdev genl api. >>>>> >>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>>>> will use directly switchdev genl api. >>>>> >>>>> I hope I cleared this out. >>>>> >>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all >>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it. >>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to >>>> create bridges etc will work on a switchdev. >>>> Which seems like going in the direction exactly opposite to what we had discussed earlier. >>>> >>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc). With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow. This conversion happens in the kernel components. For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path. >>> >>> You have: >>> user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW >>> >>> >>> Jiri has: >>> user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW >>> >>> >> Keeping the goal to not change or not add a new userspace API in mind, >> I have : >> user -> rtnetlink -> kernel -> ndo_op -> swdev driver -> HW >> > >Then you have the same as Jiri, for the traditional L2/L3 case. > >> Jiri has: >> user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW > >Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS. It’s not a substitute for rtnetlink, it’s an alternative. The complete picture is: Not an alternative, an addition. > >user -> swdev genl ----- > \ > \ > -------> kernel -> ndo_swdev_* -> swdev driver -> HW > / > / >user -> rtnetlink ------ True is that, as Thomas pointed out, we can probably nest this into rtnl_link messages. That might work. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Sep 20, 2014 at 3:53 AM, Thomas Graf <tgraf@suug.ch> wrote: > On 09/20/14 at 10:14am, Jiri Pirko wrote: >> Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote: >> >I was considering a slightly different approach where the >> >device would report via netlink the fields/actions it >> >supported rather than creating pre-defined enums for every >> >possible key. >> > >> >I already need to have an API to report fields/matches >> >that are being supported why not have the device report >> >the headers as header fields (len, offset) and the >> >associated parse graph the hardware uses? Vendors should >> >have this already to describe/design their real hardware. >> >> Hmm, let me think about this a bit more. I will try to figure out how to >> handle that. Sound logic though. Will try to incorporate the idea in the >> patchset. > > I think this is the right track. I agree with John and Thomas here. I think HW should not be limited by SW abstractions whether these abstractions are called flows, n-tuples, bridge or else. Really looking forward to see "device reporting the headers as header fields (len, offset) and the associated parse graph" as the first step. Another topic that this discussion didn't cover yet is how this all connects to tunnels and what is 'tunnel offloading'. imo flow offloading by itself serves only academic interest. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9/20/14, 10:38 AM, Jiri Pirko wrote: > Sat, Sep 20, 2014 at 07:21:10PM CEST, sfeldma@cumulusnetworks.com wrote: >> On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote: >> >>> On 9/20/14, 1:10 AM, Scott Feldman wrote: >>>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> >>>> wrote: >>>> >>>> >>>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote: >>>>> >>>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com >>>>>> wrote: >>>>>> >>>>>>> On 09/19/14 09:49, Jiri Pirko wrote: >>>>>>> >>>>>>>> This patch exposes switchdev API using generic Netlink. >>>>>>>> Example userspace utility is here: >>>>>>>> >>>>>>>> https://github.com/jpirko/switchdev >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> Is this just a temporary test tool? Otherwise i dont see reason >>>>>>> for its existence (or the API that it feeds on). >>>>>>> >>>>>> Please read the conversation I had with Pravin and Jesse in v1 thread. >>>>>> Long story short they like to have the api separated from ovs datapath >>>>>> so ovs daemon can use it to directly communicate with driver. Also John >>>>>> Fastabend requested a way to work with driver flows without using ovs -> >>>>>> that was the original reason I created switchdev genl api. >>>>>> >>>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon >>>>>> will use directly switchdev genl api. >>>>>> >>>>>> I hope I cleared this out. >>>>>> >>>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all >>>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it. >>>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to >>>>> create bridges etc will work on a switchdev. >>>>> Which seems like going in the direction exactly opposite to what we had discussed earlier. >>>>> >>>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc). With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow. This conversion happens in the kernel components. For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path. >>>> >>>> You have: >>>> user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW >>>> >>>> >>>> Jiri has: >>>> user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW >>>> >>>> >>> Keeping the goal to not change or not add a new userspace API in mind, >>> I have : >>> user -> rtnetlink -> kernel -> ndo_op -> swdev driver -> HW >>> >> Then you have the same as Jiri, for the traditional L2/L3 case. >> >>> Jiri has: >>> user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW >> Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS. It’s not a substitute for rtnetlink, it’s an alternative. The complete picture is: > Not an alternative, an addition. > >> user -> swdev genl ----- >> \ >> \ >> -------> kernel -> ndo_swdev_* -> swdev driver -> HW >> / >> / >> user -> rtnetlink ------ > True is that, as Thomas pointed out, we can probably nest this into > rtnl_link messages. That might work. That's the thing i was hinting as well. You can extend the existing infrastructure/api instead of adding a new one. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sat, Sep 20, 2014 at 01:32:30PM CEST, jhs@mojatatu.com wrote: >On 09/20/14 07:01, Thomas Graf wrote: > >>Nothing speaks against having such a tc classifier. In fact, having >>the interface consist of only an embedded Netlink attribute structure >>would allow for such a classifier in a very straight forward way. >> >>That doesn't mean everybody should be forced to use the stateful >>tc interface. >> > > >Agreed. The response was to Jiri's strange statement that now that >he cant use OVS, there is no such api. I point to tc as very capable of >such usage. Jamal, would you please give us some examples on how to use tc to work with flows? I have a feeling that you see something other people does not. Lets get on the same page now. Thanks. > >>No need for false accusations here. Nobody ever mentioned vendor SDKs. >> > >I am sorry to have tied the two together. Maybe not OVS but the approach >described is heaven for vendor SDKs. > >>The statement was that the requirement of deriving hardware flows from >>software flows *in the kernel* is not flexible enough for the future >>for reasons such as: >> >>1) The OVS software data path might be based on eBPF in the future and >> it is unclear how we could derive hardware flows from that >> transparently. >> > >Who says you cant put BPF in hardware? >And why is OVS defining how BPF should evolve or how it should be used? > >>2) Depending on hardware capabilities. Hardware flows might need to be >> assisted by software flow counterparts and it is believed that it >> is the wrong approach to push all the necessary context for the >> decision down into the kernel. This can be argued about and I don't >> feel strongly either way. >> > >Pointing to the current FDB offload: You can select to bypass >and not use s/ware. > >cheers, >jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/20/14 at 03:50pm, Alexei Starovoitov wrote: > I think HW should not be limited by SW abstractions whether > these abstractions are called flows, n-tuples, bridge or else. > Really looking forward to see "device reporting the headers as > header fields (len, offset) and the associated parse graph" > as the first step. > > Another topic that this discussion didn't cover yet is how this > all connects to tunnels and what is 'tunnel offloading'. > imo flow offloading by itself serves only academic interest. We haven't touched encryption yet either ;-) Certainly true for the host case. The Linux on TOR case is less dependant on this and L2/L3 offload w/o encap already has value. I'm with you though, all of this has little value on the host in the DC if stateful encap offload is not incorporated. I expect the HW to provide filters on the outer header plus metadata in the encap. Actually, this was a follow-up question I had for John as this is not easily describable with offset/len filters. How would we represent such capabilities? The TX side of this was one of the reasons why I initially thought it would be beneficial to implement a cache like offload as we could serve an initial encap in SW, do the FIB lookup and offload it transparently to avoid replicating the FIB in user space. What seems most feasisble to me right now is to separate the offload of the encap action from the IP -> dev mapping decision. The eSwitch would send the first encap for an unknown dest IP to the CPU due to a miss in the IP mapping table, the CPU would do the FIB lookup, update the table and send it back. What do you have in mind? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 03:53, Jiri Pirko wrote: > Jamal, would you please give us some examples on how to use tc to work > with flows? I have a feeling that you see something other people does not. I will be a little verbose so as to avoid knowledge assumption. Lets talk about tc classifier/action subsystem because that is what would take advantage of flows. We could also talk about qdiscs i.e schedulers and queue objects because the two are often related (the default classification action is "classid" which typically maps to a queue class). tc classification/action subsystem allows you to specify arbitrary classifiers and actions. You can then specify (using a precise BNF grammar) how filters and actions are to be related. Look at iproute2/f_*.c to see the currently defined ones. Each classifier has a name/id and attributes/options specific to itself. Classifiers dont necessarily have to filter on packet headers; they could filter on metadata for example. Each classifier running in software may be offloaded. I think that simple model would allow usable tools. The classifier you have defined currently in your patches could be realized via the u32 classifier but i think that would require knowledge of u32. So for usability reasons I would suggest to write a brand new classifier. For lack of a better name, lets call it "multi-tuple classifier". I would expect this classifier to be usable in software tc as well without necessarily being offloaded. There are two important details to note: 1) many different types of classifiers exist. This would very likely depend on hardware implementation. It is academic bullshit (i.e not pragmatic) to claim all hardware offload can use the same classification language. As i was telling Thomas I dont see why one wouldnt offload the defined bpf classifier. From an API level, this means your ->flow_add/del/get would have to support ability to define different classifiers. 2) Each classifier will have different semantics. From a device API level this means you have to allow the different classifiers to pass attributes specific to them. This means each classifier may override the ops(). I am indifferent how it is achieved. So while you could pass one big structure such as your flow struct, one should be able to do u32 kind of semantics. We also need to discover which device supports which classifiers and what constraints exist in the hardware implementation exist (we can talk about that because it is important). Example if one supports u32, how many u32 rules can be offloaded etc. As to how it is to be implemented: I like the semantics of the current bridge code. I have always wondered why we didnt use that scheme for offloading qdiscs. Each device supporting FDB offload has an ->fdb_add/del/get (dont quote me on the naming). User space describes what it wants. If something is to be offloaded we already know the netdev the user is pointing to. We invoke the appropriate ->flow() calls with appropriately cooked structures. I am not sure i like that we pass the netlink structure as Scott often seems to point to; i think that passing the internal structure we would install in s/ware may be the better approach since: a) we would need to parse the data anyways for validation etc b) each hardware offload will likely need to translate further in internal format c)we have well defined mapping between user and offload, the generic structure will be very close to hardware. note: that is what the fdb offload does. Note: I described this using tc, but i dont see why nftable couldnt follow the same approach. My angle is that we dont impede other users by over-focussing on ovs and whatever other things that surround it. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote: > On 09/20/14 at 03:50pm, Alexei Starovoitov wrote: >> I think HW should not be limited by SW abstractions whether >> these abstractions are called flows, n-tuples, bridge or else. >> Really looking forward to see "device reporting the headers as >> header fields (len, offset) and the associated parse graph" >> as the first step. >> >> Another topic that this discussion didn't cover yet is how this >> all connects to tunnels and what is 'tunnel offloading'. >> imo flow offloading by itself serves only academic interest. > > We haven't touched encryption yet either ;-) > > Certainly true for the host case. The Linux on TOR case is less > dependant on this and L2/L3 offload w/o encap already has value. > Thomas, can you (or someone else) quantify what the host case is. I suppose there may be merit in using a switch on NIC for kernel bypass scenarios, but I'm still having a hard time understanding how this could be integrated into the host stack with benefits that outweigh complexity. The history of stateful offloads in NICs is not great, and encapsulation (stuffing a few bytes of header into a packet) is in itself not nearly an expensive enough operation to warrant offloading to the NIC. Personally, I wish if NIC vendors are going to focus on stateful offload I rather see it be for encryption which I believe currently does warrant offload at 40G and higher speeds. Tom > I'm with you though, all of this has little value on the host in > the DC if stateful encap offload is not incorporated. I expect the > HW to provide filters on the outer header plus metadata in the > encap. Actually, this was a follow-up question I had for John as > this is not easily describable with offset/len filters. How would > we represent such capabilities? > > The TX side of this was one of the reasons why I initially thought > it would be beneficial to implement a cache like offload as we could > serve an initial encap in SW, do the FIB lookup and offload it > transparently to avoid replicating the FIB in user space. > > What seems most feasisble to me right now is to separate the offload > of the encap action from the IP -> dev mapping decision. The eSwitch > would send the first encap for an unknown dest IP to the CPU due > to a miss in the IP mapping table, the CPU would do the FIB lookup, > update the table and send it back. > > What do you have in mind? > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 at 08:10am, Tom Herbert wrote: > Thomas, can you (or someone else) quantify what the host case is. I > suppose there may be merit in using a switch on NIC for kernel bypass > scenarios, but I'm still having a hard time understanding how this > could be integrated into the host stack with benefits that outweigh Personally my primary interest is on lxc and vm based workloads w/ end to end encryption, encap, distributed L3 and NAT, and policy enforcement including service graphs which imply both east-west and north-south traffic patterns on a host. The usual I guess ;-) > complexity. The history of stateful offloads in NICs is not great, and > encapsulation (stuffing a few bytes of header into a packet) is in > itself not nearly an expensive enough operation to warrant offloading No argument here. The direct benchmark comparisons I've measured showed only around 2% improvement. What makes stateful offload interesting to me is that the final desintation of a packet is known at RX and can be redirected to a queue or VF. This allows to build packet batches on shared pages while preserving the securiy model. Will the gains outweigh complexity? I hope so but I don't know for sure. If you have insights, let me know. What I know for sure is that I don't want to rely on a kernel bypass for the above. > to the NIC. Personally, I wish if NIC vendors are going to focus on > stateful offload I rather see it be for encryption which I believe > currently does warrant offload at 40G and higher speeds. Agreed. I would like to be see a focus on both. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 09/22/14 at 08:10am, Tom Herbert wrote: >> Thomas, can you (or someone else) quantify what the host case is. I >> suppose there may be merit in using a switch on NIC for kernel bypass >> scenarios, but I'm still having a hard time understanding how this >> could be integrated into the host stack with benefits that outweigh > > Personally my primary interest is on lxc and vm based workloads w/ > end to end encryption, encap, distributed L3 and NAT, and policy > enforcement including service graphs which imply both east-west > and north-south traffic patterns on a host. The usual I guess ;-) > >> complexity. The history of stateful offloads in NICs is not great, and >> encapsulation (stuffing a few bytes of header into a packet) is in >> itself not nearly an expensive enough operation to warrant offloading > > No argument here. The direct benchmark comparisons I've measured showed > only around 2% improvement. > > What makes stateful offload interesting to me is that the final > desintation of a packet is known at RX and can be redirected to a > queue or VF. This allows to build packet batches on shared pages > while preserving the securiy model. > How is this different from what rx-filtering already does? > Will the gains outweigh complexity? I hope so but I don't know for > sure. If you have insights, let me know. What I know for sure is that > I don't want to rely on a kernel bypass for the above. > >> to the NIC. Personally, I wish if NIC vendors are going to focus on >> stateful offload I rather see it be for encryption which I believe >> currently does warrant offload at 40G and higher speeds. > > Agreed. I would like to be see a focus on both. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 at 03:40pm, Tom Herbert wrote: > On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote: > > What makes stateful offload interesting to me is that the final > > desintation of a packet is known at RX and can be redirected to a > > queue or VF. This allows to build packet batches on shared pages > > while preserving the securiy model. > > > How is this different from what rx-filtering already does? Without stateful offload I can't know where the packet is destined to until after I've allocated an skb and parsed the packet in software. I might be missing what you refer to here specifically. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 22, 2014 at 3:53 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 09/22/14 at 03:40pm, Tom Herbert wrote: >> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote: >> > What makes stateful offload interesting to me is that the final >> > desintation of a packet is known at RX and can be redirected to a >> > queue or VF. This allows to build packet batches on shared pages >> > while preserving the securiy model. >> > >> How is this different from what rx-filtering already does? > > Without stateful offload I can't know where the packet is destined > to until after I've allocated an skb and parsed the packet in > software. I might be missing what you refer to here specifically. n-tuple filtering in as exposed by ethtool. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/2014 04:07 PM, Tom Herbert wrote: > On Mon, Sep 22, 2014 at 3:53 PM, Thomas Graf <tgraf@suug.ch> wrote: >> On 09/22/14 at 03:40pm, Tom Herbert wrote: >>> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote: >>>> What makes stateful offload interesting to me is that the final >>>> desintation of a packet is known at RX and can be redirected to a >>>> queue or VF. This allows to build packet batches on shared pages >>>> while preserving the securiy model. >>>> >>> How is this different from what rx-filtering already does? >> >> Without stateful offload I can't know where the packet is destined >> to until after I've allocated an skb and parsed the packet in >> software. I might be missing what you refer to here specifically. > > n-tuple filtering in as exposed by ethtool. n-tuple has some deficiencies, - its not possible to get the capabilities to learn what fields are supported by the device, what actions, etc. - its ioctl based so we have to poll the device - only supports a single table, where we have devices with multiple tables - sort of the same as above but it doesn't allow creating new tables or destroying old tables. I probably missed a few others but those are the main ones that I would like to address. Granted other than the ioctl line the rest could be solved by extending the existing interface. However I would just assume port it to ndo_ops and netlink then extend the existing ioctl seeing it needs a reasonable overall to support the above anyways. We could port the ethtool ops over to the new interface to simplify drivers. .John
On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote: > On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote: >> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote: >>> I think HW should not be limited by SW abstractions whether >>> these abstractions are called flows, n-tuples, bridge or else. >>> Really looking forward to see "device reporting the headers as >>> header fields (len, offset) and the associated parse graph" >>> as the first step. >>> >>> Another topic that this discussion didn't cover yet is how this >>> all connects to tunnels and what is 'tunnel offloading'. > encapsulation (stuffing a few bytes of header into a packet) is in > itself not nearly an expensive enough operation to warrant offloading > to the NIC. Personally, I wish if NIC vendors are going to focus on On contrary, generic tunneling is most important one to get right when we're talking offloads. Adding encap header is easy to do in hw, but it breaks all other offloads if hw is not generic. Consider gso packet coming from vm. Generic tunnel allows sw to add inner headers, outer headers and setup offload offsets, so that HW does segmentation, checksuming of inner packet, adjusts inner headers and adds final outer encap. And this is just tx offload. On rx smart tunnel offload in HW parses encap and goes all the way to inner headers to verify checksums, it also steers based on inner headers. Try mellanox nics with and without vxlan offload to see the difference. It looks like fm10k will be just as good, but existing encaps are not going to last forever, so RX should be improved they way John is saying. There gotta to be a 'parse graph' for HW to see past variable length encap and into inner headers. checksum_complete style of offloading checksum verification is not efficient. The cost of adjusting it over and over while parsing encaps is too high. Plus cpu steering based on outer headers is just too slow when speeds are in 40G range. > stateful offload I rather see it be for encryption which I believe > currently does warrant offload at 40G and higher speeds. encryption offload is badly needed as well. Unfortunately it's not seen as nic feature yet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 22, 2014 at 6:54 PM, Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote: >> On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote: >>> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote: >>>> I think HW should not be limited by SW abstractions whether >>>> these abstractions are called flows, n-tuples, bridge or else. >>>> Really looking forward to see "device reporting the headers as >>>> header fields (len, offset) and the associated parse graph" >>>> as the first step. >>>> >>>> Another topic that this discussion didn't cover yet is how this >>>> all connects to tunnels and what is 'tunnel offloading'. > >> encapsulation (stuffing a few bytes of header into a packet) is in >> itself not nearly an expensive enough operation to warrant offloading >> to the NIC. Personally, I wish if NIC vendors are going to focus on > > On contrary, generic tunneling is most important one to get right > when we're talking offloads. > Adding encap header is easy to do in hw, but it breaks all other > offloads if hw is not generic. Consider gso packet coming from vm. > Generic tunnel allows sw to add inner headers, outer headers and > setup offload offsets, so that HW does segmentation, checksuming > of inner packet, adjusts inner headers and adds final outer encap. As I pointed out on a previous thread, we already have a sufficiently generic interface to allow HW to do encapsulated TSO (SKB_GSO_UDP_TUNNEL and SKB_GSO_UDP_TUNNEL_CSUM with the inner headers). If properly implemented, HW can implement a whole bunch of UDP encap protocols without knowing how to parse them. I don't see how a switch on the NIC helps this... > And this is just tx offload. On rx smart tunnel offload in HW parses > encap and goes all the way to inner headers to verify checksums, > it also steers based on inner headers. > Try mellanox nics with and without vxlan offload to see > the difference. Turn on UDP RSS on the device and I bet you'll see those differences go away! Once we moved to UDP encapsulation, there's really little value in looking at inner headers for RSS or ECMP, this should be sufficient. Sure someone might want to parse the inner headers for some sort of advanced RX steering, but again this implies rx-filtering and not switch functionality. Alexei, I believe you said previously said that SW should not dictate HW models. I agree with this, but also believe the converse is true-- HW shouldn't dictate SW model. This is really why I'm raising the question of what it means to integrate a switch into the host stack. If this is something that doesn't require any model change to the stack and is just a clever backend for rx-filters or tc, then I'm fine with that! Thanks, Tom > It looks like fm10k will be just as good, but existing encaps are > not going to last forever, so RX should be improved they way John > is saying. There gotta to be a 'parse graph' for HW to see past > variable length encap and into inner headers. > checksum_complete style of offloading checksum verification > is not efficient. The cost of adjusting it over and over while > parsing encaps is too high. Plus cpu steering based on outer > headers is just too slow when speeds are in 40G range. > >> stateful offload I rather see it be for encryption which I believe >> currently does warrant offload at 40G and higher speeds. > > encryption offload is badly needed as well. Unfortunately it's > not seen as nic feature yet. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Sep 22, 2014 at 07:16:47PM -0700, Tom Herbert wrote: [...] > > Alexei, I believe you said previously said that SW should not dictate > HW models. I agree with this, but also believe the converse is true-- > HW shouldn't dictate SW model. This is really why I'm raising the > question of what it means to integrate a switch into the host stack. Tom, when I read this I cannot help but remind myself that the intentions/hopes/dreams of those on this thread and how different their views can be on what it means to add additional 'offload support' to the kernel. There are clearly some that are most interested in how an eSwitch on an SR-IOV capable NIC be controlled can provide traditional forwarding help as well as offload the various technologies they hope to terminate at/inside their endpoint (host/guest/container) -- Thomas's _simple_ use-case demonstrates this. ;) This is a logical extention/increase in functionality that is offered in many eSwitches that was previously hidden from the user with the first generation SR-IOV capable network devices on hosts/servers. Others (like Florian who has been working to extend DSA or those pushing hardware vendors to make SDKs more open) where the existing bridging/ routing/offload code can take advantage of the hardware offload/encap available in merchant silicon. The general idea seems to add the knowledge of offload hardware to the kernel -- either via new ndo_ops or netlink. This gives users who have this hardware the ability to have a solution for their router/switch that makes it feel like Linux is actually helping make forwarding decisions -- rather than just being the kernel chosen to provide an environment where some other non-community code runs that makes all of the decisions. And now we also have the patchset that spawned what I think has been more excellent discussion. Jiri and Scott's patches bring up another, more generic model that while not currently backed by hardware provided an example/vision for what could be done if such hardware existed and how to consider interacting with that driver/hardware (that clearly has been met with some resistance, but the discussion has been great). There ultimate goals appear to be similar to those that want full offload/fordwarding support for a device, but via a different method than what some would consider standard. I am personally hopeful that most who are passionate about this will be able to get together next month at LPC (or send someone to represent them!) so that those interested can sit in the same room and try to better understand each others desires and start to form some concrete direction towards a solution that seems to meet the needs of most while not being an architectural disaster. Of course that may be way too optimistic for this crowd! :-D -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 at 06:36pm, John Fastabend wrote: > n-tuple has some deficiencies, > > - its not possible to get the capabilities to learn what > fields are supported by the device, what actions, etc. > > - its ioctl based so we have to poll the device > > - only supports a single table, where we have devices with > multiple tables > > - sort of the same as above but it doesn't allow creating new > tables or destroying old tables. OK, I understand where Tom was going. Given we add feature detection capabilities this could be used to identify the guest for fixed length encap. I still assume HW won't be able to match on the inner header for any variable lengh encap with metadata packet unless it can actually parse the encap. I hope I didn't bring encap format to this thread at this very moment ;-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 at 03:40pm, Tom Herbert wrote: > On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote: > > What makes stateful offload interesting to me is that the final > > desintation of a packet is known at RX and can be redirected to a > > queue or VF. This allows to build packet batches on shared pages > > while preserving the securiy model. To put this in other words: It is equivalent to applying the snabbswitch + vhost-user principle to the kernel but with encap support. The SR-IOV case would be a further optimization of that. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 at 07:16pm, Tom Herbert wrote: > Turn on UDP RSS on the device and I bet you'll see those differences > go away! Once we moved to UDP encapsulation, there's really little > value in looking at inner headers for RSS or ECMP, this should be > sufficient. Sure someone might want to parse the inner headers for > some sort of advanced RX steering, but again this implies rx-filtering > and not switch functionality. Agreed. The reason we discuss this in the context of this thread is because the required rx-filtering capabilities seem to be introduced in the form of (adapted) switch chip integrations onto NICs. In that sense, OVS is essentially doing advanced RX steering in software. I agree that switch functionality (whatever that specifically implies) is not strictly required for the host if you consider queue redirection as part of RX steering. The exception here would be use of SR-IOV which could be highly interesting for corner cases if combined with smart elephant guest detection. A classic example would be NFV deployed in a virtualized environment, i.e. a virtual firewall or DPI application serving a bunch of guests. > If this is something that doesn't require any model change to the > stack and is just a clever backend for rx-filters or tc, then I'm fine > with that! I haven't seen any model change proposed. I'm most certainly not advocating that. Anyone who can live a model change might as well just stick to SnabbSwitch or DPDK. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/23/14 at 12:11am, Andy Gospodarek wrote: > There are clearly some that are most interested in how an eSwitch on an > SR-IOV capable NIC be controlled can provide traditional forwarding help > as well as offload the various technologies they hope to terminate > at/inside their endpoint (host/guest/container) -- Thomas's _simple_ > use-case demonstrates this. ;) This is a logical extention/increase in > functionality that is offered in many eSwitches that was previously > hidden from the user with the first generation SR-IOV capable network > devices on hosts/servers. I think we can define this more broadly and state that providing RX steering capabilities to identify a guest in the NIC allows to directly map packets into a memory region shared between host and guest. Not a new concept at all but the existing dMAC and VLAN rx filtering is just too limiting. We require a programmable API with support for encap and encryption. SR-IOV is a hardware assisted form of that which can expedite the guest to guest path on a host. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/22/14 21:36, John Fastabend wrote: > n-tuple has some deficiencies, > > - its not possible to get the capabilities to learn what > fields are supported by the device, what actions, etc. > > - its ioctl based so we have to poll the device > > - only supports a single table, where we have devices with > multiple tables > > - sort of the same as above but it doesn't allow creating new > tables or destroying old tables. > > I probably missed a few others A few more I can think of which are generic: The whole event subsystem allowing multi-user sync or monitoring offered by netlink is missing because ethtool ioctl go directly to the driver. The synchronous interface vs async offered by netlink offers a more effective user programmability. The ioctl binary interface whose extensibility is a pain (dont let Stephen H hear you mention ioctls for just this one reason). > but those are the main ones that I > would like to address. Granted other than the ioctl line the rest could > be solved by extending the existing interface. However I would just > assume port it to ndo_ops and netlink then extend the existing ioctl > seeing it needs a reasonable overall to support the above anyways. > > We could port the ethtool ops over to the new interface to > simplify drivers. Indeed. cheers, jamal > .John > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 9/23/2014 7:11 AM, Andy Gospodarek wrote: > On Mon, Sep 22, 2014 at 07:16:47PM -0700, Tom Herbert wrote: > [...] >> Alexei, I believe you said previously said that SW should not dictate >> HW models. I agree with this, but also believe the converse is true-- >> HW shouldn't dictate SW model. This is really why I'm raising the >> question of what it means to integrate a switch into the host stack. > Tom, when I read this I cannot help but remind myself that the > intentions/hopes/dreams of those on this thread and how different their > views can be on what it means to add additional 'offload support' to the > kernel. > > There are clearly some that are most interested in how an eSwitch on an > SR-IOV capable NIC be controlled can provide traditional forwarding help > as well as offload the various technologies they hope to terminate > at/inside their endpoint (host/guest/container) -- Thomas's _simple_ > use-case demonstrates this. ;) This is a logical extention/increase in > functionality that is offered in many eSwitches that was previously > hidden from the user with the first generation SR-IOV capable network > devices on hosts/servers. Indeed. The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well (NOT to offload OVS). We envision a seamless integration of user environment which is based on OVS with SRIOV eSwitch and the grounds for that were very well supported in Jiri’s V1. The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the flat rule of OVS. The kernel datapath does not need to be aware of the existence of HW nor its capabilities, it just pushes the flow also to the switchdev which represents the eSwitch. If the flow can be supported in HW it will be forwarded in HW and if not it will be forwarded by the kernel > [....] > > And now we also have the patchset that spawned what I think has been > more excellent discussion. Jiri and Scott's patches bring up another, > more generic model that while not currently backed by hardware provided > an example/vision for what could be done if such hardware existed and > how to consider interacting with that driver/hardware (that clearly has > been met with some resistance, but the discussion has been great). > There ultimate goals appear to be similar to those that want full > offload/fordwarding support for a device, but via a different method > than what some would consider standard. > > I am personally hopeful that most who are passionate about this will be > able to get together next month at LPC (or send someone to represent > them!) so that those interested can sit in the same room and try to > better understand each others desires and start to form some concrete > direction towards a solution that seems to meet the needs of most while > not being an architectural disaster. > Yep. LPC is the time and place to go over the multiple use-cases (phyiscal switch, eSwitch, eBPF, etc) that could (should) be supported by the basic framework. Or. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/23/14 at 06:32pm, Or Gerlitz wrote: > Indeed. > > The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well > (NOT to offload OVS). > > We envision a seamless integration of user environment which is based on OVS > with SRIOV eSwitch and the grounds for that were very well supported in > Jiri’s V1. Please consider comparing your model with what is described here [0]. I'm trying to write down an architecture document that we can finalize in Düsseldorf. [0] http://goo.gl/qkzW5y > The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the > flat rule of OVS. The kernel datapath does not need to be aware of the > existence of HW nor its capabilities, it just pushes the flow also to the > switchdev which represents the eSwitch. I think you are saying that the kernel should not be required to make the offload decision which is fair. We definitely don't want to force the decision to be outside though, there are several legit reasons to support transparent offloads within the kernel as well outside of OVS. > Yep. LPC is the time and place to go over the multiple use-cases (phyiscal > switch, eSwitch, eBPF, etc) that could (should) be supported by the basic > framework. For reference: http://www.linuxplumbersconf.org/2014/ocw/proposals/2463 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Sep 24, 2014 at 4:32 PM, Thomas Graf <tgraf@suug.ch> wrote: > On 09/23/14 at 06:32pm, Or Gerlitz wrote: >> Indeed. >> >> The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well >> (NOT to offload OVS). >> >> We envision a seamless integration of user environment which is based on OVS >> with SRIOV eSwitch and the grounds for that were very well supported in >> Jiri’s V1. > > Please consider comparing your model with what is described here [0]. > I'm trying to write down an architecture document that we can finalize > in Düsseldorf. > [0] http://goo.gl/qkzW5y Yep, this can serve us for the architecture discussion @ LPC. Re the SRIOV case, you referred to the case where guest VF traffic goes through HW (say) VXLAN encap/decap -- just to make sure, we need also to support the simpler case, where guest traffic just goes through vlan tag/strip. >> The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the >> flat rule of OVS. The kernel datapath does not need to be aware of the >> existence of HW nor its capabilities, it just pushes the flow also to the >> switchdev which represents the eSwitch. > I think you are saying that the kernel should not be required to make > the offload decision which is fair. We definitely don't want to force > the decision to be outside though, there are several legit reasons to > support transparent offloads within the kernel as well outside of OVS. > >> Yep. LPC is the time and place to go over the multiple use-cases (phyiscal >> switch, eSwitch, eBPF, etc) that could (should) be supported by the basic >> framework. > > For reference: > http://www.linuxplumbersconf.org/2014/ocw/proposals/2463 The SRIOV case is only mentioned here in the "Compatibility with existing FDB ioctls for SR-IOV" bullet, so I'm a bit nervous... we need to have it clear in the agenda. Also, this BoF needs to be double-len, two hours, can you act to get that done? thanks, Or. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/26/14 at 11:03pm, Or Gerlitz wrote: > Yep, this can serve us for the architecture discussion @ LPC. Re the > SRIOV case, you referred to the case where guest VF traffic goes > through HW (say) VXLAN encap/decap -- just to make sure, we need also > to support the simpler case, where guest traffic just goes through > vlan tag/strip. Agreed. > The SRIOV case is only mentioned here in the "Compatibility with > existing FDB ioctls for SR-IOV" bullet, so I'm a bit nervous... we > need to have it clear in the agenda. I think the offload API discussion should consider the SR-iOV case but we might need to discuss additional details outside of that BoF to ensure that the BoF can keep focus on the offload API itself. That said, I suggest we define the specific agenda once we know that the BoF has been accepted and 2 hours have been allocated ;-) > Also, this BoF needs to be double-len, two hours, can you act to > get that done? This has already been requested. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/MAINTAINERS b/MAINTAINERS index f1f26db..0fe2822 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8832,6 +8832,7 @@ L: netdev@vger.kernel.org S: Supported F: net/switchdev/ F: include/net/switchdev.h +F: include/uapi/linux/switchdev.h SYNOPSYS ARC ARCHITECTURE M: Vineet Gupta <vgupta@synopsys.com> diff --git a/include/uapi/linux/switchdev.h b/include/uapi/linux/switchdev.h new file mode 100644 index 0000000..f945b57 --- /dev/null +++ b/include/uapi/linux/switchdev.h @@ -0,0 +1,113 @@ +/* + * include/uapi/linux/switchdev.h - Netlink interface to Switch device + * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef _UAPI_LINUX_SWITCHDEV_H_ +#define _UAPI_LINUX_SWITCHDEV_H_ + +enum { + SWDEV_CMD_NOOP, + SWDEV_CMD_FLOW_INSERT, + SWDEV_CMD_FLOW_REMOVE, +}; + +enum { + SWDEV_ATTR_UNSPEC, + SWDEV_ATTR_IFINDEX, /* u32 */ + SWDEV_ATTR_FLOW, /* nest */ + + __SWDEV_ATTR_MAX, + SWDEV_ATTR_MAX = (__SWDEV_ATTR_MAX - 1), +}; + +enum { + SWDEV_ATTR_FLOW_MATCH_KEY_UNSPEC, + SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY, /* u32 */ + SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT, /* u32 (ifindex) */ + SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC, /* ETH_ALEN */ + SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST, /* ETH_ALEN */ + SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI, /* be16 */ + SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE, /* be16 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO, /* u8 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS, /* u8 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL, /* u8 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG, /* u8 */ + SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC, /* be16 */ + SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST, /* be16 */ + SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS, /* be16 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC, /* be32 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST, /* be32 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA, /* ETH_ALEN */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA, /* ETH_ALEN */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC, /* struct in6_addr */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST, /* struct in6_addr */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL, /* be32 */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET, /* struct in6_addr */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL, /* ETH_ALEN */ + SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL, /* ETH_ALEN */ + + __SWDEV_ATTR_FLOW_MATCH_KEY_MAX, + SWDEV_ATTR_FLOW_MATCH_KEY_MAX = (__SWDEV_ATTR_FLOW_MATCH_KEY_MAX - 1), +}; + +enum { + SWDEV_FLOW_ACTION_TYPE_OUTPUT, + SWDEV_FLOW_ACTION_TYPE_VLAN_PUSH, + SWDEV_FLOW_ACTION_TYPE_VLAN_POP, +}; + +enum { + SWDEV_ATTR_FLOW_ACTION_UNSPEC, + SWDEV_ATTR_FLOW_ACTION_TYPE, /* u32 */ + SWDEV_ATTR_FLOW_ACTION_OUT_PORT, /* u32 (ifindex) */ + SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO, /* be16 */ + SWDEV_ATTR_FLOW_ACTION_VLAN_TCI, /* u16 */ + + __SWDEV_ATTR_FLOW_ACTION_MAX, + SWDEV_ATTR_FLOW_ACTION_MAX = (__SWDEV_ATTR_FLOW_ACTION_MAX - 1), +}; + +enum { + SWDEV_ATTR_FLOW_ITEM_UNSPEC, + SWDEV_ATTR_FLOW_ITEM_ACTION, /* nest */ + + __SWDEV_ATTR_FLOW_ITEM_MAX, + SWDEV_ATTR_FLOW_ITEM_MAX = (__SWDEV_ATTR_FLOW_ITEM_MAX - 1), +}; + +enum { + SWDEV_ATTR_FLOW_UNSPEC, + SWDEV_ATTR_FLOW_MATCH_KEY, /* nest */ + SWDEV_ATTR_FLOW_MATCH_KEY_MASK, /* nest */ + SWDEV_ATTR_FLOW_LIST_ACTION, /* nest */ + + __SWDEV_ATTR_FLOW_MAX, + SWDEV_ATTR_FLOW_MAX = (__SWDEV_ATTR_FLOW_MAX - 1), +}; + +/* Nested layout of flow add/remove command message: + * + * [SWDEV_ATTR_IFINDEX] + * [SWDEV_ATTR_FLOW] + * [SWDEV_ATTR_FLOW_MATCH_KEY] + * [SWDEV_ATTR_FLOW_MATCH_KEY_*], ... + * [SWDEV_ATTR_FLOW_MATCH_KEY_MASK] + * [SWDEV_ATTR_FLOW_MATCH_KEY_*], ... + * [SWDEV_ATTR_FLOW_LIST_ACTION] + * [SWDEV_ATTR_FLOW_ITEM_ACTION] + * [SWDEV_ATTR_FLOW_ACTION_*], ... + * [SWDEV_ATTR_FLOW_ITEM_ACTION] + * [SWDEV_ATTR_FLOW_ACTION_*], ... + * ... + */ + +#define SWITCHDEV_GENL_NAME "switchdev" +#define SWITCHDEV_GENL_VERSION 0x1 + +#endif /* _UAPI_LINUX_SWITCHDEV_H_ */ diff --git a/net/switchdev/Kconfig b/net/switchdev/Kconfig index 20e8ed2..4470d6e 100644 --- a/net/switchdev/Kconfig +++ b/net/switchdev/Kconfig @@ -7,3 +7,14 @@ config NET_SWITCHDEV depends on INET ---help--- This module provides support for hardware switch chips. + +config NET_SWITCHDEV_NETLINK + tristate "Netlink interface to Switch device" + depends on NET_SWITCHDEV + default m + ---help--- + This module provides Generic Netlink intercace to hardware switch + chips. + + To compile this code as a module, choose M here: the + module will be called switchdev_netlink. diff --git a/net/switchdev/Makefile b/net/switchdev/Makefile index 5ed63ed..0695b53 100644 --- a/net/switchdev/Makefile +++ b/net/switchdev/Makefile @@ -3,3 +3,4 @@ # obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o +obj-$(CONFIG_NET_SWITCHDEV_NETLINK) += switchdev_netlink.o diff --git a/net/switchdev/switchdev_netlink.c b/net/switchdev/switchdev_netlink.c new file mode 100644 index 0000000..d97db8b --- /dev/null +++ b/net/switchdev/switchdev_netlink.c @@ -0,0 +1,441 @@ +/* + * net/switchdev/switchdev_netlink.c - Netlink interface to Switch device + * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#include <linux/kernel.h> +#include <linux/types.h> +#include <linux/module.h> +#include <linux/init.h> +#include <linux/netdevice.h> +#include <linux/etherdevice.h> +#include <net/switchdev.h> +#include <net/netlink.h> +#include <net/genetlink.h> +#include <net/netlink.h> +#include <uapi/linux/switchdev.h> + +static struct genl_family swdev_nl_family = { + .id = GENL_ID_GENERATE, + .name = SWITCHDEV_GENL_NAME, + .version = SWITCHDEV_GENL_VERSION, + .maxattr = SWDEV_ATTR_MAX, + .netnsok = true, +}; + +static const struct nla_policy swdev_nl_flow_policy[SWDEV_ATTR_FLOW_MAX + 1] = { + [SWDEV_ATTR_FLOW_UNSPEC] = { .type = NLA_UNSPEC, }, + [SWDEV_ATTR_FLOW_MATCH_KEY] = { .type = NLA_NESTED }, + [SWDEV_ATTR_FLOW_MATCH_KEY_MASK] = { .type = NLA_NESTED }, + [SWDEV_ATTR_FLOW_LIST_ACTION] = { .type = NLA_NESTED }, +}; + +#define __IN6_ALEN sizeof(struct in6_addr) + +static const struct nla_policy +swdev_nl_flow_match_key_policy[SWDEV_ATTR_FLOW_MATCH_KEY_MAX + 1] = { + [SWDEV_ATTR_FLOW_MATCH_KEY_UNSPEC] = { .type = NLA_UNSPEC, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY] = { .type = NLA_U32, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT] = { .type = NLA_U32, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC] = { .len = ETH_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST] = { .len = ETH_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI] = { .type = NLA_U16, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE] = { .type = NLA_U16, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO] = { .type = NLA_U8, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS] = { .type = NLA_U8, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL] = { .type = NLA_U8, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG] = { .type = NLA_U8, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC] = { .type = NLA_U16, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST] = { .type = NLA_U16, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS] = { .type = NLA_U16, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC] = { .type = NLA_U32, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST] = { .type = NLA_U32, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA] = { .len = ETH_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA] = { .len = ETH_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC] = { .len = __IN6_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST] = { .len = __IN6_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL] = { .type = NLA_U32, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET] = { .len = __IN6_ALEN, }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL] = { .len = ETH_ALEN }, + [SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL] = { .len = ETH_ALEN }, +}; + +static const struct nla_policy +swdev_nl_flow_action_policy[SWDEV_ATTR_FLOW_ACTION_MAX + 1] = { + [SWDEV_ATTR_FLOW_ACTION_UNSPEC] = { .type = NLA_UNSPEC, }, + [SWDEV_ATTR_FLOW_ACTION_TYPE] = { .type = NLA_U32, }, + [SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO] = { .type = NLA_U16, }, + [SWDEV_ATTR_FLOW_ACTION_VLAN_TCI] = { .type = NLA_U16, }, +}; + +static int swdev_nl_cmd_noop(struct sk_buff *skb, struct genl_info *info) +{ + struct sk_buff *msg; + void *hdr; + int err; + + msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL); + if (!msg) + return -ENOMEM; + + hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq, + &swdev_nl_family, 0, SWDEV_CMD_NOOP); + if (!hdr) { + err = -EMSGSIZE; + goto err_msg_put; + } + + genlmsg_end(msg, hdr); + + return genlmsg_unicast(genl_info_net(info), msg, info->snd_portid); + +err_msg_put: + nlmsg_free(msg); + + return err; +} + +static int swdev_nl_parse_flow_match_key(struct nlattr *key_attr, + struct swdev_flow_match_key *key) +{ + struct nlattr *attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MAX + 1]; + int err; + + err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_MATCH_KEY_MAX, + key_attr, swdev_nl_flow_match_key_policy); + if (err) + return err; + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY]) + key->phy.priority = + nla_get_u32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT]) + key->phy.in_port_ifindex = + nla_get_u32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC]) + ether_addr_copy(key->eth.src, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC])); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST]) + ether_addr_copy(key->eth.dst, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST])); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI]) + key->eth.tci = + nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE]) + key->eth.type = + nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO]) + key->ip.proto = + nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS]) + key->ip.tos = + nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL]) + key->ip.ttl = + nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG]) + key->ip.frag = + nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC]) + key->tp.src = + nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST]) + key->tp.dst = + nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS]) + key->tp.flags = + nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC]) + key->ipv4.addr.src = + nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST]) + key->ipv4.addr.dst = + nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA]) + ether_addr_copy(key->ipv4.arp.sha, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA])); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA]) + ether_addr_copy(key->ipv4.arp.tha, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA])); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC]) + memcpy(&key->ipv6.addr.src, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC]), + sizeof(key->ipv6.addr.src)); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST]) + memcpy(&key->ipv6.addr.dst, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST]), + sizeof(key->ipv6.addr.dst)); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL]) + key->ipv6.label = + nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL]); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET]) + memcpy(&key->ipv6.nd.target, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET]), + sizeof(key->ipv6.nd.target)); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL]) + ether_addr_copy(key->ipv6.nd.sll, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL])); + + if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL]) + ether_addr_copy(key->ipv6.nd.tll, + nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL])); + + return 0; +} + +static int swdev_nl_parse_flow_action(struct nlattr *action_attr, + struct swdev_flow_action *flow_action) +{ + struct nlattr *attrs[SWDEV_ATTR_FLOW_ACTION_MAX + 1]; + int err; + + err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_ACTION_MAX, + action_attr, swdev_nl_flow_action_policy); + if (err) + return err; + + if (!attrs[SWDEV_ATTR_FLOW_ACTION_TYPE]) + return -EINVAL; + + switch (nla_get_u32(attrs[SWDEV_ATTR_FLOW_ACTION_TYPE])) { + case SWDEV_FLOW_ACTION_TYPE_OUTPUT: + if (!attrs[SWDEV_ATTR_FLOW_ACTION_OUT_PORT]) + return -EINVAL; + flow_action->out_port_ifindex = + nla_get_u32(attrs[SWDEV_ATTR_FLOW_ACTION_OUT_PORT]); + flow_action->type = SW_FLOW_ACTION_TYPE_OUTPUT; + break; + case SWDEV_FLOW_ACTION_TYPE_VLAN_PUSH: + if (!attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO] || + !attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI]) + return -EINVAL; + flow_action->vlan.proto = + nla_get_be16(attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO]); + flow_action->vlan.tci = + nla_get_u16(attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI]); + flow_action->type = SW_FLOW_ACTION_TYPE_VLAN_PUSH; + break; + case SWDEV_FLOW_ACTION_TYPE_VLAN_POP: + flow_action->type = SW_FLOW_ACTION_TYPE_VLAN_POP; + break; + default: + return -EINVAL; + } + + return 0; +} + +static int swdev_nl_parse_flow_actions(struct nlattr *actions_attr, + struct swdev_flow_action *action) +{ + struct swdev_flow_action *cur; + struct nlattr *action_attr; + int rem; + int err; + + cur = action; + nla_for_each_nested(action_attr, actions_attr, rem) { + err = swdev_nl_parse_flow_action(action_attr, cur); + if (err) + return err; + cur++; + } + return 0; +} + +static int swdev_nl_parse_flow_action_count(struct nlattr *actions_attr, + unsigned *p_action_count) +{ + struct nlattr *action_attr; + int rem; + int count = 0; + + nla_for_each_nested(action_attr, actions_attr, rem) { + if (nla_type(action_attr) != SWDEV_ATTR_FLOW_ITEM_ACTION) + return -EINVAL; + count++; + } + *p_action_count = count; + return 0; +} + +static void swdev_nl_free_flow(struct swdev_flow *flow) +{ + kfree(flow); +} + +static int swdev_nl_parse_flow(struct nlattr *flow_attr, struct swdev_flow **p_flow) +{ + struct swdev_flow *flow; + struct nlattr *attrs[SWDEV_ATTR_FLOW_MAX + 1]; + unsigned action_count; + int err; + + err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_MAX, + flow_attr, swdev_nl_flow_policy); + if (err) + return err; + + if (!attrs[SWDEV_ATTR_FLOW_MATCH_KEY] || + !attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MASK] || + !attrs[SWDEV_ATTR_FLOW_LIST_ACTION]) + return -EINVAL; + + err = swdev_nl_parse_flow_action_count(attrs[SWDEV_ATTR_FLOW_LIST_ACTION], + &action_count); + if (err) + return err; + flow = swdev_flow_alloc(action_count, GFP_KERNEL); + if (!flow) + return -ENOMEM; + + err = swdev_nl_parse_flow_match_key(attrs[SWDEV_ATTR_FLOW_MATCH_KEY], + &flow->match.key); + if (err) + goto out; + + err = swdev_nl_parse_flow_match_key(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MASK], + &flow->match.key_mask); + if (err) + goto out; + + err = swdev_nl_parse_flow_actions(attrs[SWDEV_ATTR_FLOW_LIST_ACTION], + flow->action); + if (err) + goto out; + + *p_flow = flow; + return 0; + +out: + kfree(flow); + return err; +} + +static struct net_device *swdev_nl_dev_get(struct genl_info *info) +{ + struct net *net = genl_info_net(info); + int ifindex; + + if (!info->attrs[SWDEV_ATTR_IFINDEX]) + return NULL; + + ifindex = nla_get_u32(info->attrs[SWDEV_ATTR_IFINDEX]); + return dev_get_by_index(net, ifindex); +} + +static void swdev_nl_dev_put(struct net_device *dev) +{ + dev_put(dev); +} + +static int swdev_nl_cmd_flow_insert(struct sk_buff *skb, struct genl_info *info) +{ + struct net_device *dev; + struct swdev_flow *flow; + int err; + + if (!info->attrs[SWDEV_ATTR_FLOW]) + return -EINVAL; + + dev = swdev_nl_dev_get(info); + if (!dev) + return -EINVAL; + + err = swdev_nl_parse_flow(info->attrs[SWDEV_ATTR_FLOW], &flow); + if (err) + goto dev_put; + + err = swdev_flow_insert(dev, flow); + swdev_nl_free_flow(flow); +dev_put: + swdev_nl_dev_put(dev); + return err; +} + +static int swdev_nl_cmd_flow_remove(struct sk_buff *skb, struct genl_info *info) +{ + struct net_device *dev; + struct swdev_flow *flow; + int err; + + if (!info->attrs[SWDEV_ATTR_FLOW]) + return -EINVAL; + + dev = swdev_nl_dev_get(info); + if (!dev) + return -EINVAL; + + err = swdev_nl_parse_flow(info->attrs[SWDEV_ATTR_FLOW], &flow); + if (err) + goto dev_put; + + err = swdev_flow_remove(dev, flow); + swdev_nl_free_flow(flow); +dev_put: + swdev_nl_dev_put(dev); + return err; +} + +static const struct genl_ops swdev_nl_ops[] = { + { + .cmd = SWDEV_CMD_NOOP, + .doit = swdev_nl_cmd_noop, + }, + { + .cmd = SWDEV_CMD_FLOW_INSERT, + .doit = swdev_nl_cmd_flow_insert, + .policy = swdev_nl_flow_policy, + .flags = GENL_ADMIN_PERM, + }, + { + .cmd = SWDEV_CMD_FLOW_REMOVE, + .doit = swdev_nl_cmd_flow_remove, + .policy = swdev_nl_flow_policy, + .flags = GENL_ADMIN_PERM, + }, +}; + +static int __init swdev_nl_module_init(void) +{ + return genl_register_family_with_ops(&swdev_nl_family, swdev_nl_ops); +} + +static void swdev_nl_module_fini(void) +{ + genl_unregister_family(&swdev_nl_family); +} + +module_init(swdev_nl_module_init); +module_exit(swdev_nl_module_fini); + +MODULE_LICENSE("GPL v2"); +MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>"); +MODULE_DESCRIPTION("Netlink interface to Switch device"); +MODULE_ALIAS_GENL_FAMILY(SWITCHDEV_GENL_NAME);
This patch exposes switchdev API using generic Netlink. Example userspace utility is here: https://github.com/jpirko/switchdev Signed-off-by: Jiri Pirko <jiri@resnulli.us> --- MAINTAINERS | 1 + include/uapi/linux/switchdev.h | 113 ++++++++++ net/switchdev/Kconfig | 11 + net/switchdev/Makefile | 1 + net/switchdev/switchdev_netlink.c | 441 ++++++++++++++++++++++++++++++++++++++ 5 files changed, 567 insertions(+) create mode 100644 include/uapi/linux/switchdev.h create mode 100644 net/switchdev/switchdev_netlink.c