diff mbox

[net-next,v2,8/9] switchdev: introduce Netlink API

Message ID 1411134590-4586-9-git-send-email-jiri@resnulli.us
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jiri Pirko Sept. 19, 2014, 1:49 p.m. UTC
This patch exposes switchdev API using generic Netlink.
Example userspace utility is here:
https://github.com/jpirko/switchdev

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 MAINTAINERS                       |   1 +
 include/uapi/linux/switchdev.h    | 113 ++++++++++
 net/switchdev/Kconfig             |  11 +
 net/switchdev/Makefile            |   1 +
 net/switchdev/switchdev_netlink.c | 441 ++++++++++++++++++++++++++++++++++++++
 5 files changed, 567 insertions(+)
 create mode 100644 include/uapi/linux/switchdev.h
 create mode 100644 net/switchdev/switchdev_netlink.c

Comments

Jamal Hadi Salim Sept. 19, 2014, 3:25 p.m. UTC | #1
On 09/19/14 09:49, Jiri Pirko wrote:
> This patch exposes switchdev API using generic Netlink.
> Example userspace utility is here:
> https://github.com/jpirko/switchdev
>

Is this just a temporary test tool? Otherwise i dont see reason
for its existence (or the API that it feeds on).

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 19, 2014, 3:49 p.m. UTC | #2
Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>On 09/19/14 09:49, Jiri Pirko wrote:
>>This patch exposes switchdev API using generic Netlink.
>>Example userspace utility is here:
>>https://github.com/jpirko/switchdev
>>
>
>Is this just a temporary test tool? Otherwise i dont see reason
>for its existence (or the API that it feeds on).

Please read the conversation I had with Pravin and Jesse in v1 thread.
Long story short they like to have the api separated from ovs datapath
so ovs daemon can use it to directly communicate with driver. Also John
Fastabend requested a way to work with driver flows without using ovs ->
that was the original reason I created switchdev genl api.

Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
will use directly switchdev genl api.

I hope I cleared this out.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 19, 2014, 5:57 p.m. UTC | #3
On 09/19/14 11:49, Jiri Pirko wrote:
> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:

>> Is this just a temporary test tool? Otherwise i dont see reason
>> for its existence (or the API that it feeds on).
>
> Please read the conversation I had with Pravin and Jesse in v1 thread.
> Long story short they like to have the api separated from ovs datapath
> so ovs daemon can use it to directly communicate with driver. Also John
> Fastabend requested a way to work with driver flows without using ovs ->
> that was the original reason I created switchdev genl api.
>
> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
> will use directly switchdev genl api.
>
> I hope I cleared this out.
>

It is - thanks Jiri.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Sept. 19, 2014, 10:12 p.m. UTC | #4
On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
> On 09/19/14 11:49, Jiri Pirko wrote:
>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
> 
>>> Is this just a temporary test tool? Otherwise i dont see reason
>>> for its existence (or the API that it feeds on).
>>
>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>> Long story short they like to have the api separated from ovs datapath
>> so ovs daemon can use it to directly communicate with driver. Also John
>> Fastabend requested a way to work with driver flows without using ovs ->
>> that was the original reason I created switchdev genl api.
>>
>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>> will use directly switchdev genl api.
>>
>> I hope I cleared this out.
>>
> 
> It is - thanks Jiri.
> 
> cheers,
> jamal

Hi Jiri,

I was considering a slightly different approach where the
device would report via netlink the fields/actions it
supported rather than creating pre-defined enums for every
possible key.

I already need to have an API to report fields/matches
that are being supported why not have the device report
the headers as header fields (len, offset) and the
associated parse graph the hardware uses? Vendors should
have this already to describe/design their real hardware.

As always its better to have code and when I get some
time I'll try to write it up. Maybe its just a separate
classifier although I don't actually want two hardware
flow APIs.

I see you dropped the RFC tag are you proposing we include
this now?

.John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 19, 2014, 10:18 p.m. UTC | #5
On 09/19/14 18:12, John Fastabend wrote:
> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>> On 09/19/14 11:49, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>>
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>>>
>>
>> It is - thanks Jiri.
>>
>> cheers,
>> jamal
>
> Hi Jiri,
>
> I was considering a slightly different approach where the
> device would report via netlink the fields/actions it
> supported rather than creating pre-defined enums for every
> possible key.
>
> I already need to have an API to report fields/matches
> that are being supported why not have the device report
> the headers as header fields (len, offset) and the
> associated parse graph the hardware uses? Vendors should
> have this already to describe/design their real hardware.
>
> As always its better to have code and when I get some
> time I'll try to write it up. Maybe its just a separate
> classifier although I don't actually want two hardware
> flow APIs.
>
> I see you dropped the RFC tag are you proposing we include
> this now?
>


Actually I just realized i missed something very basic that
Jiri said. I think i understand the tool being there for testing
but i am assumed the same about the genlink api.
Jiri, are you saying that genlink api is there to
stay?

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Sept. 20, 2014, 3:41 a.m. UTC | #6
On 9/19/14, 8:49 AM, Jiri Pirko wrote:
> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>> On 09/19/14 09:49, Jiri Pirko wrote:
>>> This patch exposes switchdev API using generic Netlink.
>>> Example userspace utility is here:
>>> https://github.com/jpirko/switchdev
>>>
>> Is this just a temporary test tool? Otherwise i dont see reason
>> for its existence (or the API that it feeds on).
> Please read the conversation I had with Pravin and Jesse in v1 thread.
> Long story short they like to have the api separated from ovs datapath
> so ovs daemon can use it to directly communicate with driver. Also John
> Fastabend requested a way to work with driver flows without using ovs ->
> that was the original reason I created switchdev genl api.
>
> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
> will use directly switchdev genl api.
>
> I hope I cleared this out.
We already have all the needed rtnetlink kernel api and userspace tools 
around it to support all
switching asic features. ie, the rtnetlink api is the switchdev api. We 
can do l2, l3, acl's with it.
Its unclear to me why we need another new netlink api. Which will mean 
none of the existing tools to
create bridges etc will work on a switchdev.
Which seems like going in the direction exactly opposite to what we had 
discussed earlier.

If a non-ovs flow interface is needed from userspace, we can extend the 
existing interface to include flows.
I don't understand why we should replace the existing rtnetlink 
switchdev api to accommodate flows.

Thanks,
Roopa

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Fainelli Sept. 20, 2014, 5:36 a.m. UTC | #7
On 09/19/14 15:12, John Fastabend wrote:
> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>> On 09/19/14 11:49, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>>
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>>>
>>
>> It is - thanks Jiri.
>>
>> cheers,
>> jamal
>
> Hi Jiri,
>
> I was considering a slightly different approach where the
> device would report via netlink the fields/actions it
> supported rather than creating pre-defined enums for every
> possible key.
>
> I already need to have an API to report fields/matches
> that are being supported why not have the device report
> the headers as header fields (len, offset) and the
> associated parse graph the hardware uses? Vendors should
> have this already to describe/design their real hardware.

Humm would not that slightly go against coming with a netlink API that 
is generic? Surely we could pay close attention when reviewing what is 
being added and spot when a common API needs to be introduced...

This might become very similar to the private ioctl(), private wireless 
extensions, nl80211 testmode and well it's not extremely pretty.

>
> As always its better to have code and when I get some
> time I'll try to write it up. Maybe its just a separate
> classifier although I don't actually want two hardware
> flow APIs.
>
> I see you dropped the RFC tag are you proposing we include
> this now?
>
> .John
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Fainelli Sept. 20, 2014, 5:39 a.m. UTC | #8
On 09/19/14 15:18, Jamal Hadi Salim wrote:
> On 09/19/14 18:12, John Fastabend wrote:
>> On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>>> On 09/19/14 11:49, Jiri Pirko wrote:
>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>
>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>> for its existence (or the API that it feeds on).
>>>>
>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>> Long story short they like to have the api separated from ovs datapath
>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>> Fastabend requested a way to work with driver flows without using
>>>> ovs ->
>>>> that was the original reason I created switchdev genl api.
>>>>
>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>> will use directly switchdev genl api.
>>>>
>>>> I hope I cleared this out.
>>>>
>>>
>>> It is - thanks Jiri.
>>>
>>> cheers,
>>> jamal
>>
>> Hi Jiri,
>>
>> I was considering a slightly different approach where the
>> device would report via netlink the fields/actions it
>> supported rather than creating pre-defined enums for every
>> possible key.
>>
>> I already need to have an API to report fields/matches
>> that are being supported why not have the device report
>> the headers as header fields (len, offset) and the
>> associated parse graph the hardware uses? Vendors should
>> have this already to describe/design their real hardware.
>>
>> As always its better to have code and when I get some
>> time I'll try to write it up. Maybe its just a separate
>> classifier although I don't actually want two hardware
>> flow APIs.
>>
>> I see you dropped the RFC tag are you proposing we include
>> this now?
>>
>
>
> Actually I just realized i missed something very basic that
> Jiri said. I think i understand the tool being there for testing
> but i am assumed the same about the genlink api.
> Jiri, are you saying that genlink api is there to
> stay?

So, I really have mixed feelings about this netlink API, in particular 
because it is not clear to me where is the line between what should be a 
network device ndo operation, what should be an ethtool command, what 
should be a netlink message, and the rest.

I can certainly acknowledge the fact that manipulating flows is not 
ideal with the current set of tools, but really once we are there with 
netlink, how far are we from not having any network devices at all, and 
how does that differ from OpenWrt's swconfig in the end [1]?

[1]: https://lwn.net/Articles/571390/
--
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 20, 2014, 8:09 a.m. UTC | #9
Sat, Sep 20, 2014 at 05:41:16AM CEST, roopa@cumulusnetworks.com wrote:
>On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>On 09/19/14 09:49, Jiri Pirko wrote:
>>>>This patch exposes switchdev API using generic Netlink.
>>>>Example userspace utility is here:
>>>>https://github.com/jpirko/switchdev
>>>>
>>>Is this just a temporary test tool? Otherwise i dont see reason
>>>for its existence (or the API that it feeds on).
>>Please read the conversation I had with Pravin and Jesse in v1 thread.
>>Long story short they like to have the api separated from ovs datapath
>>so ovs daemon can use it to directly communicate with driver. Also John
>>Fastabend requested a way to work with driver flows without using ovs ->
>>that was the original reason I created switchdev genl api.
>>
>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>will use directly switchdev genl api.
>>
>>I hope I cleared this out.
>We already have all the needed rtnetlink kernel api and userspace tools
>around it to support all
>switching asic features. ie, the rtnetlink api is the switchdev api. We can
>do l2, l3, acl's with it.
>Its unclear to me why we need another new netlink api. Which will mean none
>of the existing tools to
>create bridges etc will work on a switchdev.

No one is proposing such API. Note that what I'm trying to solve in my
patchset is FLOW world. There is only one API there, ovs genl. But the
usage of that for hw offload purposes was nacked by ovs maintainer. Plus
couple of people wanted to run the offloading independently on ovs
instance. Therefore I introduced the switchdev genl, which takes care of
that. No plan to extend it for other things you mentioned, just flows.


>Which seems like going in the direction exactly opposite to what we had
>discussed earlier.

Nope. The previous discussion ignored flows.


>
>If a non-ovs flow interface is needed from userspace, we can extend the
>existing interface to include flows.

How? You mean to extend rtnetlink? What advantage it would bring
comparing to separate genl iface?


>I don't understand why we should replace the existing rtnetlink switchdev api
>to accommodate flows.

Sorry, I do not undertand what "existing rtnetlink switchdev api" you
have on mind. Would you care to explain?


>
>Thanks,
>Roopa
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
sfeldma@cumulusnetworks.com Sept. 20, 2014, 8:10 a.m. UTC | #10
On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>> This patch exposes switchdev API using generic Netlink.
>>>> Example userspace utility is here:
>>>> https://github.com/jpirko/switchdev
>>>> 
>>> Is this just a temporary test tool? Otherwise i dont see reason
>>> for its existence (or the API that it feeds on).
>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>> Long story short they like to have the api separated from ovs datapath
>> so ovs daemon can use it to directly communicate with driver. Also John
>> Fastabend requested a way to work with driver flows without using ovs ->
>> that was the original reason I created switchdev genl api.
>> 
>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>> will use directly switchdev genl api.
>> 
>> I hope I cleared this out.
> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
> create bridges etc will work on a switchdev.
> Which seems like going in the direction exactly opposite to what we had discussed earlier.

Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.

You have:
    user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW

Jiri has:
    user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW

> If a non-ovs flow interface is needed from userspace, we can extend the existing interface to include flows.
> I don't understand why we should replace the existing rtnetlink switchdev api to accommodate flows.
> 
> Thanks,
> Roopa
> 


-scott



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 20, 2014, 8:14 a.m. UTC | #11
Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote:
>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>> On 09/19/14 11:49, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>> 
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>>
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>>>
>> 
>> It is - thanks Jiri.
>> 
>> cheers,
>> jamal
>
>Hi Jiri,
>
>I was considering a slightly different approach where the
>device would report via netlink the fields/actions it
>supported rather than creating pre-defined enums for every
>possible key.
>
>I already need to have an API to report fields/matches
>that are being supported why not have the device report
>the headers as header fields (len, offset) and the
>associated parse graph the hardware uses? Vendors should
>have this already to describe/design their real hardware.

Hmm, let me think about this a bit more. I will try to figure out how to
handle that. Sound logic though. Will try to incorporate the idea in the
patchset.


>
>As always its better to have code and when I get some
>time I'll try to write it up. Maybe its just a separate
>classifier although I don't actually want two hardware
>flow APIs.

Understood.

>
>I see you dropped the RFC tag are you proposing we include
>this now?

v11 is my bet :)

>
>.John
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 20, 2014, 8:17 a.m. UTC | #12
Sat, Sep 20, 2014 at 12:18:02AM CEST, jhs@mojatatu.com wrote:
>On 09/19/14 18:12, John Fastabend wrote:
>>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>>>On 09/19/14 11:49, Jiri Pirko wrote:
>>>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>
>>>>>Is this just a temporary test tool? Otherwise i dont see reason
>>>>>for its existence (or the API that it feeds on).
>>>>
>>>>Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>Long story short they like to have the api separated from ovs datapath
>>>>so ovs daemon can use it to directly communicate with driver. Also John
>>>>Fastabend requested a way to work with driver flows without using ovs ->
>>>>that was the original reason I created switchdev genl api.
>>>>
>>>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>will use directly switchdev genl api.
>>>>
>>>>I hope I cleared this out.
>>>>
>>>
>>>It is - thanks Jiri.
>>>
>>>cheers,
>>>jamal
>>
>>Hi Jiri,
>>
>>I was considering a slightly different approach where the
>>device would report via netlink the fields/actions it
>>supported rather than creating pre-defined enums for every
>>possible key.
>>
>>I already need to have an API to report fields/matches
>>that are being supported why not have the device report
>>the headers as header fields (len, offset) and the
>>associated parse graph the hardware uses? Vendors should
>>have this already to describe/design their real hardware.
>>
>>As always its better to have code and when I get some
>>time I'll try to write it up. Maybe its just a separate
>>classifier although I don't actually want two hardware
>>flow APIs.
>>
>>I see you dropped the RFC tag are you proposing we include
>>this now?
>>
>
>
>Actually I just realized i missed something very basic that
>Jiri said. I think i understand the tool being there for testing
>but i am assumed the same about the genlink api.
>Jiri, are you saying that genlink api is there to
>stay?

Yes, that I say. It is needed for flow manipulation, because such api does
not exist. As I stated earlier, I do not want to use switchdev genl for
anything other than flow manipulation.

>
>cheers,
>jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 20, 2014, 8:25 a.m. UTC | #13
Sat, Sep 20, 2014 at 07:39:51AM CEST, f.fainelli@gmail.com wrote:
>On 09/19/14 15:18, Jamal Hadi Salim wrote:
>>On 09/19/14 18:12, John Fastabend wrote:
>>>On 09/19/2014 10:57 AM, Jamal Hadi Salim wrote:
>>>>On 09/19/14 11:49, Jiri Pirko wrote:
>>>>>Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>>
>>>>>>Is this just a temporary test tool? Otherwise i dont see reason
>>>>>>for its existence (or the API that it feeds on).
>>>>>
>>>>>Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>>Long story short they like to have the api separated from ovs datapath
>>>>>so ovs daemon can use it to directly communicate with driver. Also John
>>>>>Fastabend requested a way to work with driver flows without using
>>>>>ovs ->
>>>>>that was the original reason I created switchdev genl api.
>>>>>
>>>>>Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>>will use directly switchdev genl api.
>>>>>
>>>>>I hope I cleared this out.
>>>>>
>>>>
>>>>It is - thanks Jiri.
>>>>
>>>>cheers,
>>>>jamal
>>>
>>>Hi Jiri,
>>>
>>>I was considering a slightly different approach where the
>>>device would report via netlink the fields/actions it
>>>supported rather than creating pre-defined enums for every
>>>possible key.
>>>
>>>I already need to have an API to report fields/matches
>>>that are being supported why not have the device report
>>>the headers as header fields (len, offset) and the
>>>associated parse graph the hardware uses? Vendors should
>>>have this already to describe/design their real hardware.
>>>
>>>As always its better to have code and when I get some
>>>time I'll try to write it up. Maybe its just a separate
>>>classifier although I don't actually want two hardware
>>>flow APIs.
>>>
>>>I see you dropped the RFC tag are you proposing we include
>>>this now?
>>>
>>
>>
>>Actually I just realized i missed something very basic that
>>Jiri said. I think i understand the tool being there for testing
>>but i am assumed the same about the genlink api.
>>Jiri, are you saying that genlink api is there to
>>stay?
>
>So, I really have mixed feelings about this netlink API, in particular
>because it is not clear to me where is the line between what should be a
>network device ndo operation, what should be an ethtool command, what should
>be a netlink message, and the rest.

Well as I said, this api should serve for flow manipulation only,
therefore swdev flow related ndos are used.


>
>I can certainly acknowledge the fact that manipulating flows is not ideal
>with the current set of tools, but really once we are there with netlink, how
>far are we from not having any network devices at all, and how does that
>differ from OpenWrt's swconfig in the end [1]?


I'm all ears on proposals how to make flow manipulation better.


>
>[1]: https://lwn.net/Articles/571390/
>--
>Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 20, 2014, 10:19 a.m. UTC | #14
On 09/20/14 04:17, Jiri Pirko wrote:

> Yes, that I say. It is needed for flow manipulation, because such api does
> not exist.

Come on Jiri!
The ovs guys are against this and now no *api exists*?
Write a 15 tuple classifier tc classifier and use it. I will be more
than happy to help you. I will get to it when we have basics L2 working
on real devices.

>As I stated earlier, I do not want to use switchdev genl for
> anything other than flow manipulation.


Totally unacceptable in my books. If the OVS guys want some way out
to be able to ride on some vendor sdks then that is their problem.
We shouldnt allow for such loopholes. This is why/how TOE never made it
in the kernel.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 20, 2014, 10:31 a.m. UTC | #15
On 09/20/14 04:10, Scott Feldman wrote:
>
> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>

>
> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components
>(bridge, fib, etc).

You have made this claim before Scott and I am still not following.
Why do we need to echo things to get FDB or FIB to work? device ops for 
FDB offload for example already exist. I think they need to be
revamped, but that consensus can be reasonably reached. Why do we
need this flow api for such activities?

cheers,
jamal

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 20, 2014, 10:53 a.m. UTC | #16
On 09/20/14 at 10:14am, Jiri Pirko wrote:
> Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote:
> >I was considering a slightly different approach where the
> >device would report via netlink the fields/actions it
> >supported rather than creating pre-defined enums for every
> >possible key.
> >
> >I already need to have an API to report fields/matches
> >that are being supported why not have the device report
> >the headers as header fields (len, offset) and the
> >associated parse graph the hardware uses? Vendors should
> >have this already to describe/design their real hardware.
> 
> Hmm, let me think about this a bit more. I will try to figure out how to
> handle that. Sound logic though. Will try to incorporate the idea in the
> patchset.

I think this is the right track.

I agree with Jamal that there is no need for a new permanent and
separate Netlink interface for this. I think this would best be described
as a structure of nested Netlink attributes in the form John proposes
which is then embedded into existing Netlink interfaces such as rtnetlink
and OVS genl.

OVS can register new genl ops to check capabilities and insert
hardware flows which allows implementation of the offload decision in
user space and allows for arbitary combination of hardware and software
flows. It also allows to run a eBPF software data path in combination
with a hardware flow setup.

rtnetlink can embed the nested attribute structure into existing APIs
to allow feature capability detection from user space, statistic
reporting and optional direct hardware offload if a transaprent
offload is not feasible. Would that work for you John?

I think we should focus on getting the layering right and make it
generic enough so we allow evolving naturally.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 20, 2014, 11:01 a.m. UTC | #17
On 09/20/14 at 06:19am, Jamal Hadi Salim wrote:
> The ovs guys are against this and now no *api exists*?
> Write a 15 tuple classifier tc classifier and use it. I will be more
> than happy to help you. I will get to it when we have basics L2 working
> on real devices.

Nothing speaks against having such a tc classifier. In fact, having
the interface consist of only an embedded Netlink attribute structure
would allow for such a classifier in a very straight forward way.

That doesn't mean everybody should be forced to use the stateful
tc interface.

> Totally unacceptable in my books. If the OVS guys want some way out
> to be able to ride on some vendor sdks then that is their problem.
> We shouldnt allow for such loopholes. This is why/how TOE never made it
> in the kernel.

No need for false accusations here. Nobody ever mentioned vendor SDKs.

The statement was that the requirement of deriving hardware flows from
software flows *in the kernel* is not flexible enough for the future
for reasons such as:

1) The OVS software data path might be based on eBPF in the future and
   it is unclear how we could derive hardware flows from that
   transparently.

2) Depending on hardware capabilities. Hardware flows might need to be
   assisted by software flow counterparts and it is believed that it
   is the wrong approach to push all the necessary context for the
   decision down into the kernel. This can be argued about and I don't
   feel strongly either way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 20, 2014, 11:32 a.m. UTC | #18
On 09/20/14 07:01, Thomas Graf wrote:

> Nothing speaks against having such a tc classifier. In fact, having
> the interface consist of only an embedded Netlink attribute structure
> would allow for such a classifier in a very straight forward way.
>
> That doesn't mean everybody should be forced to use the stateful
> tc interface.
>


Agreed. The response was to Jiri's strange statement that now that
he cant use OVS, there is no such api. I point to tc as very capable of
such usage.

> No need for false accusations here. Nobody ever mentioned vendor SDKs.
>

I am sorry to have tied the two together. Maybe not OVS but the approach
described is heaven for vendor SDKs.

> The statement was that the requirement of deriving hardware flows from
> software flows *in the kernel* is not flexible enough for the future
> for reasons such as:
>
> 1) The OVS software data path might be based on eBPF in the future and
>     it is unclear how we could derive hardware flows from that
>     transparently.
>

Who says you cant put BPF in hardware?
And why is OVS defining how BPF should evolve or how it should be used?

> 2) Depending on hardware capabilities. Hardware flows might need to be
>     assisted by software flow counterparts and it is believed that it
>     is the wrong approach to push all the necessary context for the
>     decision down into the kernel. This can be argued about and I don't
>     feel strongly either way.
>

Pointing to the current FDB offload: You can select to bypass
and not use s/ware.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 20, 2014, 11:51 a.m. UTC | #19
On 09/20/14 at 07:32am, Jamal Hadi Salim wrote:
> I am sorry to have tied the two together. Maybe not OVS but the approach
> described is heaven for vendor SDKs.

I fail to see the connection. You can use switch vendor SDK no matter
how we define the kernel APIs. They already exist and have been
designed in a way to be completely indepenedent from the kernel.

Are you referring to vendor specific decisions in user space in
general? I believe that the whole point of swdev is to provide *that*
level of abstraction so decisions can be made in a vendor neutral way.

> >The statement was that the requirement of deriving hardware flows from
> >software flows *in the kernel* is not flexible enough for the future
> >for reasons such as:
> >
> >1) The OVS software data path might be based on eBPF in the future and
> >    it is unclear how we could derive hardware flows from that
> >    transparently.
> >
> 
> Who says you cant put BPF in hardware?

I don't think anybody is saying that. P4 is likely a reality soon. But
we definitely want hardware offload in a BPF world even if the hardware
can't do BPF yet.

> And why is OVS defining how BPF should evolve or how it should be used?

Not sure I understand. OVS would be a user of eBPF just like tracing,
xt_BPF, socket filter, ...

> >2) Depending on hardware capabilities. Hardware flows might need to be
> >    assisted by software flow counterparts and it is believed that it
> >    is the wrong approach to push all the necessary context for the
> >    decision down into the kernel. This can be argued about and I don't
> >    feel strongly either way.
> >
> 
> Pointing to the current FDB offload: You can select to bypass
> and not use s/ware.

As I said, this can be argued about. It would require to push a lot of
context into the kernel though. The FDB offload is relatively trivial
in comparison to the complexity OVS user space can handle. I can't think
of any reasons why to complicate the kernel further with OVS specific
knowledge as long as we can guarantee the vendor abstraction.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 20, 2014, 12:35 p.m. UTC | #20
On 09/20/14 07:51, Thomas Graf wrote:

> I fail to see the connection. You can use switch vendor SDK no matter
> how we define the kernel APIs. They already exist and have been
> designed in a way to be completely indepenedent from the kernel.
>
> Are you referring to vendor specific decisions in user space in
> general? I believe that the whole point of swdev is to provide *that*
> level of abstraction so decisions can be made in a vendor neutral way.
>

I am not against the swdev idea. I think we have disagreements
for the general classification/action interface how that should look
like - but that is resolvable with correct interfaces.
The vendor neutral way *already exists* via current netlink
abstractions that existing tools use. When we need to add new
interfaces then we should.

> I don't think anybody is saying that. P4 is likely a reality soon. But
> we definitely want hardware offload in a BPF world even if the hardware
> can't do BPF yet.
>


I dont think we have contradictions. We are speaking past each other.
You implied that in the future OVS s/w path might be based on BPF.
I implied BPF itself could be offloaded and stands on its own merit
and should work if we have the correct interface. As an example,
I dont care about P4 or OVS - but i have no problem if they use
the common interfaces provided by Linux. i.e
If i want to build  a little cpu running the BPF instruction set
and use that as my offload then that interface should work and if
it doesnt i should provide extensions.

> Not sure I understand. OVS would be a user of eBPF just like tracing,
> xt_BPF, socket filter, ...
>

Ok, we are on the same page then.

> As I said, this can be argued about. It would require to push a lot of
> context into the kernel though. The FDB offload is relatively trivial
> in comparison to the complexity OVS user space can handle. I can't think
> of any reasons why to complicate the kernel further with OVS specific
> knowledge as long as we can guarantee the vendor abstraction.
>

I disagree. OVS maybe complex in that sense (I am sorry i am making
an assumption based on what you are saying) but i dont think there is
any other kernel subsystem that has this challenge.
Note: i am pointing to fdb only because it carries the concept of "put
this in hardware and/or software". I agree the fdb maybe reasonably
simpler.

cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Sept. 20, 2014, 12:39 p.m. UTC | #21
On 9/20/14, 1:09 AM, Jiri Pirko wrote:
> Sat, Sep 20, 2014 at 05:41:16AM CEST, roopa@cumulusnetworks.com wrote:
>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com wrote:
>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>> This patch exposes switchdev API using generic Netlink.
>>>>> Example userspace utility is here:
>>>>> https://github.com/jpirko/switchdev
>>>>>
>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>> for its existence (or the API that it feeds on).
>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>> Long story short they like to have the api separated from ovs datapath
>>> so ovs daemon can use it to directly communicate with driver. Also John
>>> Fastabend requested a way to work with driver flows without using ovs ->
>>> that was the original reason I created switchdev genl api.
>>>
>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>> will use directly switchdev genl api.
>>>
>>> I hope I cleared this out.
>> We already have all the needed rtnetlink kernel api and userspace tools
>> around it to support all
>> switching asic features. ie, the rtnetlink api is the switchdev api. We can
>> do l2, l3, acl's with it.
>> Its unclear to me why we need another new netlink api. Which will mean none
>> of the existing tools to
>> create bridges etc will work on a switchdev.
> No one is proposing such API. Note that what I'm trying to solve in my
> patchset is FLOW world. There is only one API there, ovs genl. But the
> usage of that for hw offload purposes was nacked by ovs maintainer. Plus
> couple of people wanted to run the offloading independently on ovs
> instance. Therefore I introduced the switchdev genl, which takes care of
> that. No plan to extend it for other things you mentioned, just flows.
ok, That was not clear to me. Introducing a new genl api and calling it the
switchd dev api can result it non-flow creep into it in the future.
>
>
>> Which seems like going in the direction exactly opposite to what we had
>> discussed earlier.
> Nope. The previous discussion ignored flows.
>> If a non-ovs flow interface is needed from userspace, we can extend the
>> existing interface to include flows.
> How? You mean to extend rtnetlink? What advantage it would bring
> comparing to separate genl iface?
yes. Advantage would be that we dont have yet another parallel switchdev 
netlink api.


>> I don't understand why we should replace the existing rtnetlink switchdev api
>> to accommodate flows.
> Sorry, I do not undertand what "existing rtnetlink switchdev api" you
> have on mind. Would you care to explain?

I am taking about existing rtnetlink api that bridge, ip link uses to 
talk l2 and l3 to the kernel.
RTM_NEWROUTE etc.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
sfeldma@cumulusnetworks.com Sept. 20, 2014, 5:21 p.m. UTC | #22
On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:

> On 9/20/14, 1:10 AM, Scott Feldman wrote:
>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
>>  wrote:
>> 
>> 
>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>> 
>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com
>>>>  wrote:
>>>> 
>>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>> 
>>>>>> This patch exposes switchdev API using generic Netlink.
>>>>>> Example userspace utility is here:
>>>>>> 
>>>>>> https://github.com/jpirko/switchdev
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>> for its existence (or the API that it feeds on).
>>>>> 
>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>> Long story short they like to have the api separated from ovs datapath
>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>> Fastabend requested a way to work with driver flows without using ovs ->
>>>> that was the original reason I created switchdev genl api.
>>>> 
>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>> will use directly switchdev genl api.
>>>> 
>>>> I hope I cleared this out.
>>>> 
>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>>> create bridges etc will work on a switchdev.
>>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
>>> 
>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>> 
>> You have:
>>     user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>> 
>> 
>> Jiri has:
>>     user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>> 
>> 
> Keeping the goal to not change or not add a new userspace API in mind,
> I have :
>     user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW
> 

Then you have the same as Jiri, for the traditional L2/L3 case.

> Jiri has:
>     user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW

Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS.  It’s not a substitute for rtnetlink, it’s an alternative.  The complete picture is:

user -> swdev genl -----
                        \
                         \
                          -------> kernel -> ndo_swdev_* -> swdev driver -> HW
                         /
                        /
user -> rtnetlink ------


-scott



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 20, 2014, 5:38 p.m. UTC | #23
Sat, Sep 20, 2014 at 07:21:10PM CEST, sfeldma@cumulusnetworks.com wrote:
>
>On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>
>> On 9/20/14, 1:10 AM, Scott Feldman wrote:
>>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
>>>  wrote:
>>> 
>>> 
>>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>>> 
>>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com
>>>>>  wrote:
>>>>> 
>>>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>>> 
>>>>>>> This patch exposes switchdev API using generic Netlink.
>>>>>>> Example userspace utility is here:
>>>>>>> 
>>>>>>> https://github.com/jpirko/switchdev
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>>> for its existence (or the API that it feeds on).
>>>>>> 
>>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>> Long story short they like to have the api separated from ovs datapath
>>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>>> Fastabend requested a way to work with driver flows without using ovs ->
>>>>> that was the original reason I created switchdev genl api.
>>>>> 
>>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>> will use directly switchdev genl api.
>>>>> 
>>>>> I hope I cleared this out.
>>>>> 
>>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>>>> create bridges etc will work on a switchdev.
>>>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
>>>> 
>>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>>> 
>>> You have:
>>>     user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>>> 
>>> 
>>> Jiri has:
>>>     user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>>> 
>>> 
>> Keeping the goal to not change or not add a new userspace API in mind,
>> I have :
>>     user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW
>> 
>
>Then you have the same as Jiri, for the traditional L2/L3 case.
>
>> Jiri has:
>>     user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW
>
>Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS.  It’s not a substitute for rtnetlink, it’s an alternative.  The complete picture is:

Not an alternative, an addition.

>
>user -> swdev genl -----
>                        \
>                         \
>                          -------> kernel -> ndo_swdev_* -> swdev driver -> HW
>                         /
>                        /
>user -> rtnetlink ------

True is that, as Thomas pointed out, we can probably nest this into
rtnl_link messages. That might work.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexei Starovoitov Sept. 20, 2014, 10:50 p.m. UTC | #24
On Sat, Sep 20, 2014 at 3:53 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/20/14 at 10:14am, Jiri Pirko wrote:
>> Sat, Sep 20, 2014 at 12:12:12AM CEST, john.r.fastabend@intel.com wrote:
>> >I was considering a slightly different approach where the
>> >device would report via netlink the fields/actions it
>> >supported rather than creating pre-defined enums for every
>> >possible key.
>> >
>> >I already need to have an API to report fields/matches
>> >that are being supported why not have the device report
>> >the headers as header fields (len, offset) and the
>> >associated parse graph the hardware uses? Vendors should
>> >have this already to describe/design their real hardware.
>>
>> Hmm, let me think about this a bit more. I will try to figure out how to
>> handle that. Sound logic though. Will try to incorporate the idea in the
>> patchset.
>
> I think this is the right track.

I agree with John and Thomas here.
I think HW should not be limited by SW abstractions whether
these abstractions are called flows, n-tuples, bridge or else.
Really looking forward to see "device reporting the headers as
header fields (len, offset) and the associated parse graph"
as the first step.

Another topic that this discussion didn't cover yet is how this
all connects to tunnels and what is 'tunnel offloading'.
imo flow offloading by itself serves only academic interest.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roopa Prabhu Sept. 21, 2014, 1:30 a.m. UTC | #25
On 9/20/14, 10:38 AM, Jiri Pirko wrote:
> Sat, Sep 20, 2014 at 07:21:10PM CEST, sfeldma@cumulusnetworks.com wrote:
>> On Sep 20, 2014, at 5:51 AM, Roopa Prabhu <roopa@cumulusnetworks.com> wrote:
>>
>>> On 9/20/14, 1:10 AM, Scott Feldman wrote:
>>>> On Sep 19, 2014, at 8:41 PM, Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>   wrote:
>>>>
>>>>
>>>>> On 9/19/14, 8:49 AM, Jiri Pirko wrote:
>>>>>
>>>>>> Fri, Sep 19, 2014 at 05:25:48PM CEST, jhs@mojatatu.com
>>>>>>   wrote:
>>>>>>
>>>>>>> On 09/19/14 09:49, Jiri Pirko wrote:
>>>>>>>
>>>>>>>> This patch exposes switchdev API using generic Netlink.
>>>>>>>> Example userspace utility is here:
>>>>>>>>
>>>>>>>> https://github.com/jpirko/switchdev
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> Is this just a temporary test tool? Otherwise i dont see reason
>>>>>>> for its existence (or the API that it feeds on).
>>>>>>>
>>>>>> Please read the conversation I had with Pravin and Jesse in v1 thread.
>>>>>> Long story short they like to have the api separated from ovs datapath
>>>>>> so ovs daemon can use it to directly communicate with driver. Also John
>>>>>> Fastabend requested a way to work with driver flows without using ovs ->
>>>>>> that was the original reason I created switchdev genl api.
>>>>>>
>>>>>> Regarding the "sw" tool, yes it is for testing purposes now. ovs daemon
>>>>>> will use directly switchdev genl api.
>>>>>>
>>>>>> I hope I cleared this out.
>>>>>>
>>>>> We already have all the needed rtnetlink kernel api and userspace tools around it to support all
>>>>> switching asic features. ie, the rtnetlink api is the switchdev api. We can do l2, l3, acl's with it.
>>>>> Its unclear to me why we need another new netlink api. Which will mean none of the existing tools to
>>>>> create bridges etc will work on a switchdev.
>>>>> Which seems like going in the direction exactly opposite to what we had discussed earlier.
>>>>>
>>>> Existing rtnetlink isn’t available to swdev without some kind of snooping the echoes from the various kernel components (bridge, fib, etc).  With swdev_flow, as Jiri has defined it, there is an additional conversion needed to bridge the gap (bad expression, I know) between rtnetlink and swdev_flow.  This conversion happens in the kernel components.  For example, the bridge module, still driven from userspace by existing rtnetlink, will formulate the necessary swdev_flow insert/remove calls to the swdev driver such that HW will offload the fwd path.
>>>>
>>>> You have:
>>>>      user -> rtnetlink -> kernel -> netlink echo -> [some process] -> [some driver] -> HW
>>>>
>>>>
>>>> Jiri has:
>>>>      user -> rtnetlink -> kernel -> swdev_* -> swdev driver -> HW
>>>>
>>>>
>>> Keeping the goal to not change or not add a new userspace API in mind,
>>> I have :
>>>      user -> rtnetlink -> kernel -> ndo_op -> swdev driver  -> HW
>>>
>> Then you have the same as Jiri, for the traditional L2/L3 case.
>>
>>> Jiri has:
>>>      user -> genl (newapi) -> kernel -> swdev_* -> swdev driver -> HW
>> Jiri’s genl is for userspace apps that are talking rtnetlink, like OVS.  It’s not a substitute for rtnetlink, it’s an alternative.  The complete picture is:
> Not an alternative, an addition.
>
>> user -> swdev genl -----
>>                         \
>>                          \
>>                           -------> kernel -> ndo_swdev_* -> swdev driver -> HW
>>                          /
>>                         /
>> user -> rtnetlink ------
> True is that, as Thomas pointed out, we can probably nest this into
> rtnl_link messages. That might work.
That's the thing i was hinting as well. You can extend the existing 
infrastructure/api  instead of adding a new one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiri Pirko Sept. 22, 2014, 7:53 a.m. UTC | #26
Sat, Sep 20, 2014 at 01:32:30PM CEST, jhs@mojatatu.com wrote:
>On 09/20/14 07:01, Thomas Graf wrote:
>
>>Nothing speaks against having such a tc classifier. In fact, having
>>the interface consist of only an embedded Netlink attribute structure
>>would allow for such a classifier in a very straight forward way.
>>
>>That doesn't mean everybody should be forced to use the stateful
>>tc interface.
>>
>
>
>Agreed. The response was to Jiri's strange statement that now that
>he cant use OVS, there is no such api. I point to tc as very capable of
>such usage.

Jamal, would you please give us some examples on how to use tc to work
with flows? I have a feeling that you see something other people does not.
Lets get on the same page now.
Thanks.


>
>>No need for false accusations here. Nobody ever mentioned vendor SDKs.
>>
>
>I am sorry to have tied the two together. Maybe not OVS but the approach
>described is heaven for vendor SDKs.
>
>>The statement was that the requirement of deriving hardware flows from
>>software flows *in the kernel* is not flexible enough for the future
>>for reasons such as:
>>
>>1) The OVS software data path might be based on eBPF in the future and
>>    it is unclear how we could derive hardware flows from that
>>    transparently.
>>
>
>Who says you cant put BPF in hardware?
>And why is OVS defining how BPF should evolve or how it should be used?
>
>>2) Depending on hardware capabilities. Hardware flows might need to be
>>    assisted by software flow counterparts and it is believed that it
>>    is the wrong approach to push all the necessary context for the
>>    decision down into the kernel. This can be argued about and I don't
>>    feel strongly either way.
>>
>
>Pointing to the current FDB offload: You can select to bypass
>and not use s/ware.
>
>cheers,
>jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 22, 2014, 8:13 a.m. UTC | #27
On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
> I think HW should not be limited by SW abstractions whether
> these abstractions are called flows, n-tuples, bridge or else.
> Really looking forward to see "device reporting the headers as
> header fields (len, offset) and the associated parse graph"
> as the first step.
> 
> Another topic that this discussion didn't cover yet is how this
> all connects to tunnels and what is 'tunnel offloading'.
> imo flow offloading by itself serves only academic interest.

We haven't touched encryption yet either ;-)

Certainly true for the host case. The Linux on TOR case is less
dependant on this and L2/L3 offload w/o encap already has value.

I'm with you though, all of this has little value on the host in
the DC if stateful encap offload is not incorporated. I expect the
HW to provide filters on the outer header plus metadata in the
encap. Actually, this was a follow-up question I had for John as
this is not easily describable with offset/len filters. How would
we represent such capabilities?

The TX side of this was one of the reasons why I initially thought
it would be beneficial to implement a cache like offload as we could
serve an initial encap in SW, do the FIB lookup and offload it
transparently to avoid replicating the FIB in user space.

What seems most feasisble to me right now is to separate the offload
of the encap action from the IP -> dev mapping decision. The eSwitch
would send the first encap for an unknown dest IP to the CPU due
to a miss in the IP mapping table, the CPU would do the FIB lookup,
update the table and send it back.

What do you have in mind?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 22, 2014, 11:48 a.m. UTC | #28
On 09/22/14 03:53, Jiri Pirko wrote:

> Jamal, would you please give us some examples on how to use tc to work
> with flows? I have a feeling that you see something other people does not.

I will be a little verbose so as to avoid knowledge assumption.

Lets talk about tc classifier/action subsystem because that is what
would take advantage of flows. We could also talk about qdiscs i.e
schedulers and queue objects because the two are often related
(the default classification action is "classid" which typically
maps to a queue class).

tc classification/action subsystem allows you to specify arbitrary
classifiers and actions.
You can then specify (using a precise BNF grammar) how filters and
actions are to be related.
Look at iproute2/f_*.c to see the currently defined ones.

Each classifier has a name/id and attributes/options specific to
itself. Classifiers dont necessarily have to filter on packet
headers; they could filter on metadata for example.
Each classifier running in software may be offloaded. I think that
simple model would allow usable tools.
The classifier you have defined currently in your patches could
be realized via the u32 classifier but i think that would
require knowledge of u32. So for usability reasons I would
suggest to write a brand new classifier. For lack of a better
name, lets call it "multi-tuple classifier".
I would expect this classifier to be usable in software tc as
well without necessarily being offloaded.

There are two important details to note:
1) many different types of classifiers exist. This would very
likely depend on hardware implementation. It is academic bullshit
(i.e not pragmatic) to claim all hardware offload can use the
same classification language. As i was telling Thomas
I dont see why one wouldnt offload the defined bpf classifier.
 From an API level, this means your ->flow_add/del/get would have
to support ability to define different classifiers.

2) Each classifier will have different semantics.
 From a device API level this means you have to allow the different
classifiers to pass attributes specific to them. This means
each classifier may override the ops(). I am indifferent how
it is achieved. So while you could pass one big structure
such as your flow struct, one should be able to do u32
kind of semantics.

We also need to discover which device supports which classifiers
and what constraints exist in the hardware implementation exist
(we can talk about that because it is important). Example
if one supports u32, how many u32 rules can be offloaded etc.

As to how it is to be implemented:
I like the semantics of the current bridge code. I have always
wondered why we didnt use that scheme for offloading qdiscs.
Each device supporting FDB offload has an ->fdb_add/del/get
(dont quote me on the naming). User space describes what
it wants. If something is to be offloaded we already know the
netdev the user is pointing to. We invoke the appropriate
->flow() calls with appropriately cooked structures.
I am not sure i like that we pass the netlink structure as Scott
often seems to point to; i think that passing the internal
structure we would install in s/ware may be the better approach
since:
a) we would need to parse the data anyways for validation etc
b) each hardware offload will likely need to translate further in
internal format
c)we have well defined mapping between user and offload,
the generic structure will be very close to hardware.
note: that is what the fdb offload does.

Note: I described this using tc, but i dont see why nftable
couldnt follow the same approach. My angle is that we dont
impede other users by over-focussing on ovs and whatever
other things that surround it.
cheers,
jamal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 22, 2014, 3:10 p.m. UTC | #29
On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>> I think HW should not be limited by SW abstractions whether
>> these abstractions are called flows, n-tuples, bridge or else.
>> Really looking forward to see "device reporting the headers as
>> header fields (len, offset) and the associated parse graph"
>> as the first step.
>>
>> Another topic that this discussion didn't cover yet is how this
>> all connects to tunnels and what is 'tunnel offloading'.
>> imo flow offloading by itself serves only academic interest.
>
> We haven't touched encryption yet either ;-)
>
> Certainly true for the host case. The Linux on TOR case is less
> dependant on this and L2/L3 offload w/o encap already has value.
>
Thomas, can you (or someone else) quantify what the host case is. I
suppose there may be merit in using a switch on NIC for kernel bypass
scenarios, but I'm still having a hard time understanding how this
could be integrated into the host stack with benefits that outweigh
complexity. The history of stateful offloads in NICs is not great, and
encapsulation (stuffing a few bytes of header into a packet) is in
itself not nearly an expensive enough operation to warrant offloading
to the NIC. Personally, I wish if NIC vendors are going to focus on
stateful offload I rather see it be for encryption which I believe
currently does warrant offload at 40G and higher speeds.

Tom

> I'm with you though, all of this has little value on the host in
> the DC if stateful encap offload is not incorporated. I expect the
> HW to provide filters on the outer header plus metadata in the
> encap. Actually, this was a follow-up question I had for John as
> this is not easily describable with offset/len filters. How would
> we represent such capabilities?
>
> The TX side of this was one of the reasons why I initially thought
> it would be beneficial to implement a cache like offload as we could
> serve an initial encap in SW, do the FIB lookup and offload it
> transparently to avoid replicating the FIB in user space.
>
> What seems most feasisble to me right now is to separate the offload
> of the encap action from the IP -> dev mapping decision. The eSwitch
> would send the first encap for an unknown dest IP to the CPU due
> to a miss in the IP mapping table, the CPU would do the FIB lookup,
> update the table and send it back.
>
> What do you have in mind?
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 22, 2014, 10:17 p.m. UTC | #30
On 09/22/14 at 08:10am, Tom Herbert wrote:
> Thomas, can you (or someone else) quantify what the host case is. I
> suppose there may be merit in using a switch on NIC for kernel bypass
> scenarios, but I'm still having a hard time understanding how this
> could be integrated into the host stack with benefits that outweigh

Personally my primary interest is on lxc and vm based workloads w/
end to end encryption, encap, distributed L3 and NAT, and policy
enforcement including service graphs which imply both east-west
and north-south traffic patterns on a host. The usual I guess ;-)

> complexity. The history of stateful offloads in NICs is not great, and
> encapsulation (stuffing a few bytes of header into a packet) is in
> itself not nearly an expensive enough operation to warrant offloading

No argument here. The direct benchmark comparisons I've measured showed
only around 2% improvement.

What makes stateful offload interesting to me is that the final
desintation of a packet is known at RX and can be redirected to a
queue or VF. This allows to build packet batches on shared pages
while preserving the securiy model.

Will the gains outweigh complexity? I hope so but I don't know for
sure. If you have insights, let me know. What I know for sure is that
I don't want to rely on a kernel bypass for the above.

> to the NIC. Personally, I wish if NIC vendors are going to focus on
> stateful offload I rather see it be for encryption which I believe
> currently does warrant offload at 40G and higher speeds.

Agreed. I would like to be see a focus on both.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 22, 2014, 10:40 p.m. UTC | #31
On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/22/14 at 08:10am, Tom Herbert wrote:
>> Thomas, can you (or someone else) quantify what the host case is. I
>> suppose there may be merit in using a switch on NIC for kernel bypass
>> scenarios, but I'm still having a hard time understanding how this
>> could be integrated into the host stack with benefits that outweigh
>
> Personally my primary interest is on lxc and vm based workloads w/
> end to end encryption, encap, distributed L3 and NAT, and policy
> enforcement including service graphs which imply both east-west
> and north-south traffic patterns on a host. The usual I guess ;-)
>
>> complexity. The history of stateful offloads in NICs is not great, and
>> encapsulation (stuffing a few bytes of header into a packet) is in
>> itself not nearly an expensive enough operation to warrant offloading
>
> No argument here. The direct benchmark comparisons I've measured showed
> only around 2% improvement.
>
> What makes stateful offload interesting to me is that the final
> desintation of a packet is known at RX and can be redirected to a
> queue or VF. This allows to build packet batches on shared pages
> while preserving the securiy model.
>
How is this different from what rx-filtering already does?

> Will the gains outweigh complexity? I hope so but I don't know for
> sure. If you have insights, let me know. What I know for sure is that
> I don't want to rely on a kernel bypass for the above.
>
>> to the NIC. Personally, I wish if NIC vendors are going to focus on
>> stateful offload I rather see it be for encryption which I believe
>> currently does warrant offload at 40G and higher speeds.
>
> Agreed. I would like to be see a focus on both.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 22, 2014, 10:53 p.m. UTC | #32
On 09/22/14 at 03:40pm, Tom Herbert wrote:
> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > What makes stateful offload interesting to me is that the final
> > desintation of a packet is known at RX and can be redirected to a
> > queue or VF. This allows to build packet batches on shared pages
> > while preserving the securiy model.
> >
> How is this different from what rx-filtering already does?

Without stateful offload I can't know where the packet is destined
to until after I've allocated an skb and parsed the packet in
software. I might be missing what you refer to here specifically.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 22, 2014, 11:07 p.m. UTC | #33
On Mon, Sep 22, 2014 at 3:53 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/22/14 at 03:40pm, Tom Herbert wrote:
>> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> > What makes stateful offload interesting to me is that the final
>> > desintation of a packet is known at RX and can be redirected to a
>> > queue or VF. This allows to build packet batches on shared pages
>> > while preserving the securiy model.
>> >
>> How is this different from what rx-filtering already does?
>
> Without stateful offload I can't know where the packet is destined
> to until after I've allocated an skb and parsed the packet in
> software. I might be missing what you refer to here specifically.

n-tuple filtering in as exposed by ethtool.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
John Fastabend Sept. 23, 2014, 1:36 a.m. UTC | #34
On 09/22/2014 04:07 PM, Tom Herbert wrote:
> On Mon, Sep 22, 2014 at 3:53 PM, Thomas Graf <tgraf@suug.ch> wrote:
>> On 09/22/14 at 03:40pm, Tom Herbert wrote:
>>> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
>>>> What makes stateful offload interesting to me is that the final
>>>> desintation of a packet is known at RX and can be redirected to a
>>>> queue or VF. This allows to build packet batches on shared pages
>>>> while preserving the securiy model.
>>>>
>>> How is this different from what rx-filtering already does?
>>
>> Without stateful offload I can't know where the packet is destined
>> to until after I've allocated an skb and parsed the packet in
>> software. I might be missing what you refer to here specifically.
>
> n-tuple filtering in as exposed by ethtool.

n-tuple has some deficiencies,

	- its not possible to get the capabilities to learn what
	  fields are supported by the device, what actions, etc.

	- its ioctl based so we have to poll the device

	- only supports a single table, where we have devices with
	  multiple tables

	- sort of the same as above but it doesn't allow creating new
	  tables or destroying old tables.

I probably missed a few others but those are the main ones that I
would like to address. Granted other than the ioctl line the rest could
be solved by extending the existing interface. However I would just
assume port it to ndo_ops and netlink then extend the existing ioctl
seeing it needs a reasonable overall to support the above anyways.

We could port the ethtool ops over to the new interface to
simplify drivers.

.John
Alexei Starovoitov Sept. 23, 2014, 1:54 a.m. UTC | #35
On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote:
> On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>>> I think HW should not be limited by SW abstractions whether
>>> these abstractions are called flows, n-tuples, bridge or else.
>>> Really looking forward to see "device reporting the headers as
>>> header fields (len, offset) and the associated parse graph"
>>> as the first step.
>>>
>>> Another topic that this discussion didn't cover yet is how this
>>> all connects to tunnels and what is 'tunnel offloading'.

> encapsulation (stuffing a few bytes of header into a packet) is in
> itself not nearly an expensive enough operation to warrant offloading
> to the NIC. Personally, I wish if NIC vendors are going to focus on

On contrary, generic tunneling is most important one to get right
when we're talking offloads.
Adding encap header is easy to do in hw, but it breaks all other
offloads if hw is not generic. Consider gso packet coming from vm.
Generic tunnel allows sw to add inner headers, outer headers and
setup offload offsets, so that HW does segmentation, checksuming
of inner packet, adjusts inner headers and adds final outer encap.
And this is just tx offload. On rx smart tunnel offload in HW parses
encap and goes all the way to inner headers to verify checksums,
it also steers based on inner headers.
Try mellanox nics with and without vxlan offload to see
the difference.
It looks like fm10k will be just as good, but existing encaps are
not going to last forever, so RX should be improved they way John
is saying. There gotta to be a 'parse graph' for HW to see past
variable length encap and into inner headers.
checksum_complete style of offloading checksum verification
is not efficient. The cost of adjusting it over and over while
parsing encaps is too high. Plus cpu steering based on outer
headers is just too slow when speeds are in 40G range.

> stateful offload I rather see it be for encryption which I believe
> currently does warrant offload at 40G and higher speeds.

encryption offload is badly needed as well. Unfortunately it's
not seen as nic feature yet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert Sept. 23, 2014, 2:16 a.m. UTC | #36
On Mon, Sep 22, 2014 at 6:54 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Sep 22, 2014 at 8:10 AM, Tom Herbert <therbert@google.com> wrote:
>> On Mon, Sep 22, 2014 at 1:13 AM, Thomas Graf <tgraf@suug.ch> wrote:
>>> On 09/20/14 at 03:50pm, Alexei Starovoitov wrote:
>>>> I think HW should not be limited by SW abstractions whether
>>>> these abstractions are called flows, n-tuples, bridge or else.
>>>> Really looking forward to see "device reporting the headers as
>>>> header fields (len, offset) and the associated parse graph"
>>>> as the first step.
>>>>
>>>> Another topic that this discussion didn't cover yet is how this
>>>> all connects to tunnels and what is 'tunnel offloading'.
>
>> encapsulation (stuffing a few bytes of header into a packet) is in
>> itself not nearly an expensive enough operation to warrant offloading
>> to the NIC. Personally, I wish if NIC vendors are going to focus on
>
> On contrary, generic tunneling is most important one to get right
> when we're talking offloads.
> Adding encap header is easy to do in hw, but it breaks all other
> offloads if hw is not generic. Consider gso packet coming from vm.
> Generic tunnel allows sw to add inner headers, outer headers and
> setup offload offsets, so that HW does segmentation, checksuming
> of inner packet, adjusts inner headers and adds final outer encap.

As I pointed out on a previous thread, we already have a sufficiently
generic interface to allow HW to do encapsulated TSO
(SKB_GSO_UDP_TUNNEL and SKB_GSO_UDP_TUNNEL_CSUM with the inner
headers). If properly implemented, HW can implement a whole bunch of
UDP encap protocols without knowing how to parse them. I don't see how
a switch on the NIC helps this...

> And this is just tx offload. On rx smart tunnel offload in HW parses
> encap and goes all the way to inner headers to verify checksums,
> it also steers based on inner headers.
> Try mellanox nics with and without vxlan offload to see
> the difference.

Turn on UDP RSS on the device and I bet you'll see those differences
go away! Once we moved to UDP encapsulation, there's really little
value in looking at inner headers for RSS or ECMP, this should be
sufficient. Sure someone might want to parse the inner headers for
some sort of advanced RX steering, but again this implies rx-filtering
and not switch functionality.

Alexei, I believe you said previously said that SW should not dictate
HW models. I agree with this, but also believe the converse is true--
HW shouldn't dictate SW model. This is really why I'm raising the
question of what it means to integrate a switch into the host stack.
If this is something that doesn't require any model change to the
stack and is just a clever backend for rx-filters or tc, then I'm fine
with that!

Thanks,
Tom

> It looks like fm10k will be just as good, but existing encaps are
> not going to last forever, so RX should be improved they way John
> is saying. There gotta to be a 'parse graph' for HW to see past
> variable length encap and into inner headers.
> checksum_complete style of offloading checksum verification
> is not efficient. The cost of adjusting it over and over while
> parsing encaps is too high. Plus cpu steering based on outer
> headers is just too slow when speeds are in 40G range.
>
>> stateful offload I rather see it be for encryption which I believe
>> currently does warrant offload at 40G and higher speeds.
>
> encryption offload is badly needed as well. Unfortunately it's
> not seen as nic feature yet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Gospodarek Sept. 23, 2014, 4:11 a.m. UTC | #37
On Mon, Sep 22, 2014 at 07:16:47PM -0700, Tom Herbert wrote:
[...]
> 
> Alexei, I believe you said previously said that SW should not dictate
> HW models. I agree with this, but also believe the converse is true--
> HW shouldn't dictate SW model. This is really why I'm raising the
> question of what it means to integrate a switch into the host stack.

Tom, when I read this I cannot help but remind myself that the
intentions/hopes/dreams of those on this thread and how different their
views can be on what it means to add additional 'offload support' to the
kernel.

There are clearly some that are most interested in how an eSwitch on an
SR-IOV capable NIC be controlled can provide traditional forwarding help
as well as offload the various technologies they hope to terminate
at/inside their endpoint (host/guest/container) -- Thomas's _simple_
use-case demonstrates this. ;)  This is a logical extention/increase in
functionality that is offered in many eSwitches that was previously
hidden from the user with the first generation SR-IOV capable network
devices on hosts/servers.

Others (like Florian who has been working to extend DSA or those pushing
hardware vendors to make SDKs more open) where the existing bridging/
routing/offload code can take advantage of the hardware offload/encap
available in merchant silicon.  The general idea seems to add the
knowledge of offload hardware to the kernel -- either via new ndo_ops or
netlink.  This gives users who have this hardware the ability to have a
solution for their router/switch that makes it feel like Linux is
actually helping make forwarding decisions -- rather than just being the
kernel chosen to provide an environment where some other non-community
code runs that makes all of the decisions.

And now we also have the patchset that spawned what I think has been
more excellent discussion.  Jiri and Scott's patches bring up another,
more generic model that while not currently backed by hardware provided
an example/vision for what could be done if such hardware existed and
how to consider interacting with that driver/hardware (that clearly has
been met with some resistance, but the discussion has been great).
There ultimate goals appear to be similar to those that want full
offload/fordwarding support for a device, but via a different method
than what some would consider standard.

I am personally hopeful that most who are passionate about this will be
able to get together next month at LPC (or send someone to represent
them!) so that those interested can sit in the same room and try to
better understand each others desires and start to form some concrete
direction towards a solution that seems to meet the needs of most while
not being an architectural disaster.

Of course that may be way too optimistic for this crowd!  :-D

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 23, 2014, 7:19 a.m. UTC | #38
On 09/22/14 at 06:36pm, John Fastabend wrote:
> n-tuple has some deficiencies,
> 
> 	- its not possible to get the capabilities to learn what
> 	  fields are supported by the device, what actions, etc.
> 
> 	- its ioctl based so we have to poll the device
> 
> 	- only supports a single table, where we have devices with
> 	  multiple tables
> 
> 	- sort of the same as above but it doesn't allow creating new
> 	  tables or destroying old tables.

OK, I understand where Tom was going. Given we add feature detection
capabilities this could be used to identify the guest for fixed length
encap. I still assume HW won't be able to match on the inner header
for any variable lengh encap with metadata packet unless it can
actually parse the encap. I hope I didn't bring encap format to this
thread at this very moment ;-)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 23, 2014, 9:18 a.m. UTC | #39
On 09/22/14 at 03:40pm, Tom Herbert wrote:
> On Mon, Sep 22, 2014 at 3:17 PM, Thomas Graf <tgraf@suug.ch> wrote:
> > What makes stateful offload interesting to me is that the final
> > desintation of a packet is known at RX and can be redirected to a
> > queue or VF. This allows to build packet batches on shared pages
> > while preserving the securiy model.

To put this in other words: It is equivalent to applying the snabbswitch
+ vhost-user principle to the kernel but with encap support. The SR-IOV
case would be a further optimization of that.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 23, 2014, 9:52 a.m. UTC | #40
On 09/22/14 at 07:16pm, Tom Herbert wrote:
> Turn on UDP RSS on the device and I bet you'll see those differences
> go away! Once we moved to UDP encapsulation, there's really little
> value in looking at inner headers for RSS or ECMP, this should be
> sufficient. Sure someone might want to parse the inner headers for
> some sort of advanced RX steering, but again this implies rx-filtering
> and not switch functionality.

Agreed. The reason we discuss this in the context of this thread is
because the required rx-filtering capabilities seem to be introduced
in the form of (adapted) switch chip integrations onto NICs. In that
sense, OVS is essentially doing advanced RX steering in software.

I agree that switch functionality (whatever that specifically implies)
is not strictly required for the host if you consider queue
redirection as part of RX steering. The exception here would be use
of SR-IOV which could be highly interesting for corner cases if
combined with smart elephant guest detection. A classic example would
be NFV deployed in a virtualized environment, i.e. a virtual firewall
or DPI application serving a bunch of guests.

> If this is something that doesn't require any model change to the
> stack and is just a clever backend for rx-filters or tc, then I'm fine
> with that!

I haven't seen any model change proposed. I'm most certainly not
advocating that. Anyone who can live a model change might as well
just stick to SnabbSwitch or DPDK.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 23, 2014, 10:11 a.m. UTC | #41
On 09/23/14 at 12:11am, Andy Gospodarek wrote:
> There are clearly some that are most interested in how an eSwitch on an
> SR-IOV capable NIC be controlled can provide traditional forwarding help
> as well as offload the various technologies they hope to terminate
> at/inside their endpoint (host/guest/container) -- Thomas's _simple_
> use-case demonstrates this. ;)  This is a logical extention/increase in
> functionality that is offered in many eSwitches that was previously
> hidden from the user with the first generation SR-IOV capable network
> devices on hosts/servers.

I think we can define this more broadly and state that providing RX
steering capabilities to identify a guest in the NIC allows to
directly map packets into a memory region shared between host and
guest. Not a new concept at all but the existing dMAC and VLAN rx
filtering is just too limiting. We require a programmable API with
support for encap and encryption. SR-IOV is a hardware assisted form
of that which can expedite the guest to guest path on a host.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jamal Hadi Salim Sept. 23, 2014, 11:09 a.m. UTC | #42
On 09/22/14 21:36, John Fastabend wrote:

> n-tuple has some deficiencies,
>
>      - its not possible to get the capabilities to learn what
>        fields are supported by the device, what actions, etc.
>
>      - its ioctl based so we have to poll the device
>
>      - only supports a single table, where we have devices with
>        multiple tables
>
>      - sort of the same as above but it doesn't allow creating new
>        tables or destroying old tables.
>
> I probably missed a few others

A few more I can think of which are generic:
The whole event subsystem allowing multi-user sync or monitoring
offered by netlink is missing because ethtool ioctl go
directly to the driver.
The synchronous interface vs async offered by netlink offers
a more effective user programmability.
The ioctl binary interface whose extensibility is a pain (dont
let Stephen H hear you mention ioctls for just this one reason).

> but those are the main ones that I
> would like to address. Granted other than the ioctl line the rest could
> be solved by extending the existing interface. However I would just
> assume port it to ndo_ops and netlink then extend the existing ioctl
> seeing it needs a reasonable overall to support the above anyways.
>
> We could port the ethtool ops over to the new interface to
> simplify drivers.

Indeed.

cheers,
jamal

> .John
>

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Sept. 23, 2014, 3:32 p.m. UTC | #43
On 9/23/2014 7:11 AM, Andy Gospodarek wrote:
> On Mon, Sep 22, 2014 at 07:16:47PM -0700, Tom Herbert wrote:
> [...]
>> Alexei, I believe you said previously said that SW should not dictate
>> HW models. I agree with this, but also believe the converse is true--
>> HW shouldn't dictate SW model. This is really why I'm raising the
>> question of what it means to integrate a switch into the host stack.
> Tom, when I read this I cannot help but remind myself that the
> intentions/hopes/dreams of those on this thread and how different their
> views can be on what it means to add additional 'offload support' to the
> kernel.
>
> There are clearly some that are most interested in how an eSwitch on an
> SR-IOV capable NIC be controlled can provide traditional forwarding help
> as well as offload the various technologies they hope to terminate
> at/inside their endpoint (host/guest/container) -- Thomas's _simple_
> use-case demonstrates this. ;)  This is a logical extention/increase in
> functionality that is offered in many eSwitches that was previously
> hidden from the user with the first generation SR-IOV capable network
> devices on hosts/servers.

Indeed.

The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as 
well (NOT to offload OVS).

We envision a seamless integration of user environment which is based on 
OVS with SRIOV eSwitch and the grounds for that were very well supported 
in Jiri’s V1.

The eSwitch hardware does not need to have multiple tables and ‘enjoys’ 
the flat rule of OVS. The kernel datapath does not need to be aware of 
the existence of HW nor its capabilities, it just pushes the flow also 
to the switchdev which represents the eSwitch.

If the flow can be supported in HW it will be forwarded in HW and if not 
it will be forwarded by the kernel

> [....]
>
> And now we also have the patchset that spawned what I think has been
> more excellent discussion.  Jiri and Scott's patches bring up another,
> more generic model that while not currently backed by hardware provided
> an example/vision for what could be done if such hardware existed and
> how to consider interacting with that driver/hardware (that clearly has
> been met with some resistance, but the discussion has been great).
> There ultimate goals appear to be similar to those that want full
> offload/fordwarding support for a device, but via a different method
> than what some would consider standard.
>
> I am personally hopeful that most who are passionate about this will be
> able to get together next month at LPC (or send someone to represent
> them!) so that those interested can sit in the same room and try to
> better understand each others desires and start to form some concrete
> direction towards a solution that seems to meet the needs of most while
> not being an architectural disaster.
>

Yep. LPC is the time and place to go over the multiple use-cases 
(phyiscal switch, eSwitch, eBPF, etc) that could (should) be supported 
by the basic framework.

Or.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 24, 2014, 1:32 p.m. UTC | #44
On 09/23/14 at 06:32pm, Or Gerlitz wrote:
> Indeed.
> 
> The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well
> (NOT to offload OVS).
> 
> We envision a seamless integration of user environment which is based on OVS
> with SRIOV eSwitch and the grounds for that were very well supported in
> Jiri’s V1.

Please consider comparing your model with what is described here [0].
I'm trying to write down an architecture document that we can finalize
in Düsseldorf.

[0] http://goo.gl/qkzW5y

> The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the
> flat rule of OVS. The kernel datapath does not need to be aware of the
> existence of HW nor its capabilities, it just pushes the flow also to the
> switchdev which represents the eSwitch.

I think you are saying that the kernel should not be required to make
the offload decision which is fair. We definitely don't want to force
the decision to be outside though, there are several legit reasons to
support transparent offloads within the kernel as well outside of OVS.

> Yep. LPC is the time and place to go over the multiple use-cases (phyiscal
> switch, eSwitch, eBPF, etc) that could (should) be supported by the basic
> framework.

For reference:
http://www.linuxplumbersconf.org/2014/ocw/proposals/2463
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Or Gerlitz Sept. 26, 2014, 8:03 p.m. UTC | #45
On Wed, Sep 24, 2014 at 4:32 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 09/23/14 at 06:32pm, Or Gerlitz wrote:
>> Indeed.
>>
>> The idea is to leverage OVS to manage eSwitch (embedded NIC switch) as well
>> (NOT to offload OVS).
>>
>> We envision a seamless integration of user environment which is based on OVS
>> with SRIOV eSwitch and the grounds for that were very well supported in
>> Jiri’s V1.
>
> Please consider comparing your model with what is described here [0].
> I'm trying to write down an architecture document that we can finalize
> in Düsseldorf.
> [0] http://goo.gl/qkzW5y

Yep, this can serve us for the architecture discussion @ LPC. Re the
SRIOV case, you referred to the case where guest VF traffic goes
through HW (say) VXLAN encap/decap -- just to make sure, we need also
to support the simpler case, where guest traffic just goes through
vlan tag/strip.


>> The eSwitch hardware does not need to have multiple tables and ‘enjoys’ the
>> flat rule of OVS. The kernel datapath does not need to be aware of the
>> existence of HW nor its capabilities, it just pushes the flow also to the
>> switchdev which represents the eSwitch.

> I think you are saying that the kernel should not be required to make
> the offload decision which is fair. We definitely don't want to force
> the decision to be outside though, there are several legit reasons to
> support transparent offloads within the kernel as well outside of OVS.
>
>> Yep. LPC is the time and place to go over the multiple use-cases (phyiscal
>> switch, eSwitch, eBPF, etc) that could (should) be supported by the basic
>> framework.
>
> For reference:
> http://www.linuxplumbersconf.org/2014/ocw/proposals/2463

The SRIOV case is only mentioned here in the "Compatibility with
existing FDB ioctls for SR-IOV" bullet, so I'm a bit nervous... we
need to have it clear in the agenda. Also, this BoF needs to be
double-len, two hours, can you act to get that done?

thanks,

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Sept. 26, 2014, 9:02 p.m. UTC | #46
On 09/26/14 at 11:03pm, Or Gerlitz wrote:
> Yep, this can serve us for the architecture discussion @ LPC. Re the
> SRIOV case, you referred to the case where guest VF traffic goes
> through HW (say) VXLAN encap/decap -- just to make sure, we need also
> to support the simpler case, where guest traffic just goes through
> vlan tag/strip.

Agreed.

> The SRIOV case is only mentioned here in the "Compatibility with
> existing FDB ioctls for SR-IOV" bullet, so I'm a bit nervous... we
> need to have it clear in the agenda.

I think the offload API discussion should consider the SR-iOV case
but we might need to discuss additional details outside of that
BoF to ensure that the BoF can keep focus on the offload API itself.
That said, I suggest we define the specific agenda once we know that
the BoF has been accepted and 2 hours have been allocated ;-)

> Also, this BoF needs to be double-len, two hours, can you act to
> get that done?

This has already been requested.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index f1f26db..0fe2822 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8832,6 +8832,7 @@  L:	netdev@vger.kernel.org
 S:	Supported
 F:	net/switchdev/
 F:	include/net/switchdev.h
+F:	include/uapi/linux/switchdev.h
 
 SYNOPSYS ARC ARCHITECTURE
 M:	Vineet Gupta <vgupta@synopsys.com>
diff --git a/include/uapi/linux/switchdev.h b/include/uapi/linux/switchdev.h
new file mode 100644
index 0000000..f945b57
--- /dev/null
+++ b/include/uapi/linux/switchdev.h
@@ -0,0 +1,113 @@ 
+/*
+ * include/uapi/linux/switchdev.h - Netlink interface to Switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef _UAPI_LINUX_SWITCHDEV_H_
+#define _UAPI_LINUX_SWITCHDEV_H_
+
+enum {
+	SWDEV_CMD_NOOP,
+	SWDEV_CMD_FLOW_INSERT,
+	SWDEV_CMD_FLOW_REMOVE,
+};
+
+enum {
+	SWDEV_ATTR_UNSPEC,
+	SWDEV_ATTR_IFINDEX,			/* u32 */
+	SWDEV_ATTR_FLOW,			/* nest */
+
+	__SWDEV_ATTR_MAX,
+	SWDEV_ATTR_MAX = (__SWDEV_ATTR_MAX - 1),
+};
+
+enum {
+	SWDEV_ATTR_FLOW_MATCH_KEY_UNSPEC,
+	SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY,		/* u32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT,		/* u32 (ifindex) */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG,		/* u8 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS,		/* be16 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC,	/* be32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST,	/* be32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC,	/* struct in6_addr */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST,	/* struct in6_addr */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL,		/* be32 */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET,	/* struct in6_addr */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL,		/* ETH_ALEN */
+	SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL,		/* ETH_ALEN */
+
+	__SWDEV_ATTR_FLOW_MATCH_KEY_MAX,
+	SWDEV_ATTR_FLOW_MATCH_KEY_MAX = (__SWDEV_ATTR_FLOW_MATCH_KEY_MAX - 1),
+};
+
+enum {
+	SWDEV_FLOW_ACTION_TYPE_OUTPUT,
+	SWDEV_FLOW_ACTION_TYPE_VLAN_PUSH,
+	SWDEV_FLOW_ACTION_TYPE_VLAN_POP,
+};
+
+enum {
+	SWDEV_ATTR_FLOW_ACTION_UNSPEC,
+	SWDEV_ATTR_FLOW_ACTION_TYPE,		/* u32 */
+	SWDEV_ATTR_FLOW_ACTION_OUT_PORT,	/* u32 (ifindex) */
+	SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO,	/* be16 */
+	SWDEV_ATTR_FLOW_ACTION_VLAN_TCI,	/* u16 */
+
+	__SWDEV_ATTR_FLOW_ACTION_MAX,
+	SWDEV_ATTR_FLOW_ACTION_MAX = (__SWDEV_ATTR_FLOW_ACTION_MAX - 1),
+};
+
+enum {
+	SWDEV_ATTR_FLOW_ITEM_UNSPEC,
+	SWDEV_ATTR_FLOW_ITEM_ACTION,		/* nest */
+
+	__SWDEV_ATTR_FLOW_ITEM_MAX,
+	SWDEV_ATTR_FLOW_ITEM_MAX = (__SWDEV_ATTR_FLOW_ITEM_MAX - 1),
+};
+
+enum {
+	SWDEV_ATTR_FLOW_UNSPEC,
+	SWDEV_ATTR_FLOW_MATCH_KEY,		/* nest */
+	SWDEV_ATTR_FLOW_MATCH_KEY_MASK,		/* nest */
+	SWDEV_ATTR_FLOW_LIST_ACTION,		/* nest */
+
+	__SWDEV_ATTR_FLOW_MAX,
+	SWDEV_ATTR_FLOW_MAX = (__SWDEV_ATTR_FLOW_MAX - 1),
+};
+
+/* Nested layout of flow add/remove command message:
+ *
+ *	[SWDEV_ATTR_IFINDEX]
+ *	[SWDEV_ATTR_FLOW]
+ *		[SWDEV_ATTR_FLOW_MATCH_KEY]
+ *			[SWDEV_ATTR_FLOW_MATCH_KEY_*], ...
+ *		[SWDEV_ATTR_FLOW_MATCH_KEY_MASK]
+ *			[SWDEV_ATTR_FLOW_MATCH_KEY_*], ...
+ *		[SWDEV_ATTR_FLOW_LIST_ACTION]
+ *			[SWDEV_ATTR_FLOW_ITEM_ACTION]
+ *				[SWDEV_ATTR_FLOW_ACTION_*], ...
+ *			[SWDEV_ATTR_FLOW_ITEM_ACTION]
+ *				[SWDEV_ATTR_FLOW_ACTION_*], ...
+ *			...
+ */
+
+#define SWITCHDEV_GENL_NAME "switchdev"
+#define SWITCHDEV_GENL_VERSION 0x1
+
+#endif /* _UAPI_LINUX_SWITCHDEV_H_ */
diff --git a/net/switchdev/Kconfig b/net/switchdev/Kconfig
index 20e8ed2..4470d6e 100644
--- a/net/switchdev/Kconfig
+++ b/net/switchdev/Kconfig
@@ -7,3 +7,14 @@  config NET_SWITCHDEV
 	depends on INET
 	---help---
 	  This module provides support for hardware switch chips.
+
+config NET_SWITCHDEV_NETLINK
+	tristate "Netlink interface to Switch device"
+	depends on NET_SWITCHDEV
+	default m
+	---help---
+	  This module provides Generic Netlink intercace to hardware switch
+	  chips.
+
+	  To compile this code as a module, choose M here: the
+	  module will be called switchdev_netlink.
diff --git a/net/switchdev/Makefile b/net/switchdev/Makefile
index 5ed63ed..0695b53 100644
--- a/net/switchdev/Makefile
+++ b/net/switchdev/Makefile
@@ -3,3 +3,4 @@ 
 #
 
 obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
+obj-$(CONFIG_NET_SWITCHDEV_NETLINK) += switchdev_netlink.o
diff --git a/net/switchdev/switchdev_netlink.c b/net/switchdev/switchdev_netlink.c
new file mode 100644
index 0000000..d97db8b
--- /dev/null
+++ b/net/switchdev/switchdev_netlink.c
@@ -0,0 +1,441 @@ 
+/*
+ * net/switchdev/switchdev_netlink.c - Netlink interface to Switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <net/switchdev.h>
+#include <net/netlink.h>
+#include <net/genetlink.h>
+#include <net/netlink.h>
+#include <uapi/linux/switchdev.h>
+
+static struct genl_family swdev_nl_family = {
+	.id		= GENL_ID_GENERATE,
+	.name		= SWITCHDEV_GENL_NAME,
+	.version	= SWITCHDEV_GENL_VERSION,
+	.maxattr	= SWDEV_ATTR_MAX,
+	.netnsok	= true,
+};
+
+static const struct nla_policy swdev_nl_flow_policy[SWDEV_ATTR_FLOW_MAX + 1] = {
+	[SWDEV_ATTR_FLOW_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY]		= { .type = NLA_NESTED },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_MASK]	= { .type = NLA_NESTED },
+	[SWDEV_ATTR_FLOW_LIST_ACTION]		= { .type = NLA_NESTED },
+};
+
+#define __IN6_ALEN sizeof(struct in6_addr)
+
+static const struct nla_policy
+swdev_nl_flow_match_key_policy[SWDEV_ATTR_FLOW_MATCH_KEY_MAX + 1] = {
+	[SWDEV_ATTR_FLOW_MATCH_KEY_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC]		= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST]		= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG]		= { .type = NLA_U8, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS]		= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA]	= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA]	= { .len  = ETH_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC]	= { .len  = __IN6_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST]	= { .len  = __IN6_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL]	= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET]	= { .len  = __IN6_ALEN, },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL]	= { .len  = ETH_ALEN },
+	[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL]	= { .len  = ETH_ALEN },
+};
+
+static const struct nla_policy
+swdev_nl_flow_action_policy[SWDEV_ATTR_FLOW_ACTION_MAX + 1] = {
+	[SWDEV_ATTR_FLOW_ACTION_UNSPEC]		= { .type = NLA_UNSPEC, },
+	[SWDEV_ATTR_FLOW_ACTION_TYPE]		= { .type = NLA_U32, },
+	[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO]	= { .type = NLA_U16, },
+	[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI]	= { .type = NLA_U16, },
+};
+
+static int swdev_nl_cmd_noop(struct sk_buff *skb, struct genl_info *info)
+{
+	struct sk_buff *msg;
+	void *hdr;
+	int err;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!msg)
+		return -ENOMEM;
+
+	hdr = genlmsg_put(msg, info->snd_portid, info->snd_seq,
+			  &swdev_nl_family, 0, SWDEV_CMD_NOOP);
+	if (!hdr) {
+		err = -EMSGSIZE;
+		goto err_msg_put;
+	}
+
+	genlmsg_end(msg, hdr);
+
+	return genlmsg_unicast(genl_info_net(info), msg, info->snd_portid);
+
+err_msg_put:
+	nlmsg_free(msg);
+
+	return err;
+}
+
+static int swdev_nl_parse_flow_match_key(struct nlattr *key_attr,
+					 struct swdev_flow_match_key *key)
+{
+	struct nlattr *attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MAX + 1];
+	int err;
+
+	err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_MATCH_KEY_MAX,
+			       key_attr, swdev_nl_flow_match_key_policy);
+	if (err)
+		return err;
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY])
+		key->phy.priority =
+			nla_get_u32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_PRIORITY]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT])
+		key->phy.in_port_ifindex =
+			nla_get_u32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_PHY_IN_PORT]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC])
+		ether_addr_copy(key->eth.src,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_SRC]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST])
+		ether_addr_copy(key->eth.dst,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_DST]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI])
+		key->eth.tci =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TCI]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE])
+		key->eth.type =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_ETH_TYPE]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO])
+		key->ip.proto =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_PROTO]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS])
+		key->ip.tos =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TOS]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL])
+		key->ip.ttl =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_TTL]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG])
+		key->ip.frag =
+			nla_get_u8(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IP_FRAG]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC])
+		key->tp.src =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_SRC]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST])
+		key->tp.dst =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_DST]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS])
+		key->tp.flags =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_TP_FLAGS]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC])
+		key->ipv4.addr.src =
+			nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_SRC]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST])
+		key->ipv4.addr.dst =
+			nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ADDR_DST]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA])
+		ether_addr_copy(key->ipv4.arp.sha,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_SHA]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA])
+		ether_addr_copy(key->ipv4.arp.tha,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV4_ARP_THA]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC])
+		memcpy(&key->ipv6.addr.src,
+		       nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_SRC]),
+		       sizeof(key->ipv6.addr.src));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST])
+		memcpy(&key->ipv6.addr.dst,
+		       nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ADDR_DST]),
+		       sizeof(key->ipv6.addr.dst));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL])
+		key->ipv6.label =
+			nla_get_be32(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_LABEL]);
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET])
+		memcpy(&key->ipv6.nd.target,
+		       nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TARGET]),
+		       sizeof(key->ipv6.nd.target));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL])
+		ether_addr_copy(key->ipv6.nd.sll,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_SLL]));
+
+	if (attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL])
+		ether_addr_copy(key->ipv6.nd.tll,
+				nla_data(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_IPV6_ND_TLL]));
+
+	return 0;
+}
+
+static int swdev_nl_parse_flow_action(struct nlattr *action_attr,
+				      struct swdev_flow_action *flow_action)
+{
+	struct nlattr *attrs[SWDEV_ATTR_FLOW_ACTION_MAX + 1];
+	int err;
+
+	err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_ACTION_MAX,
+			       action_attr, swdev_nl_flow_action_policy);
+	if (err)
+		return err;
+
+	if (!attrs[SWDEV_ATTR_FLOW_ACTION_TYPE])
+		return -EINVAL;
+
+	switch (nla_get_u32(attrs[SWDEV_ATTR_FLOW_ACTION_TYPE])) {
+	case SWDEV_FLOW_ACTION_TYPE_OUTPUT:
+		if (!attrs[SWDEV_ATTR_FLOW_ACTION_OUT_PORT])
+			return -EINVAL;
+		flow_action->out_port_ifindex =
+			nla_get_u32(attrs[SWDEV_ATTR_FLOW_ACTION_OUT_PORT]);
+		flow_action->type = SW_FLOW_ACTION_TYPE_OUTPUT;
+		break;
+	case SWDEV_FLOW_ACTION_TYPE_VLAN_PUSH:
+		if (!attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO] ||
+		    !attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI])
+			return -EINVAL;
+		flow_action->vlan.proto =
+			nla_get_be16(attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_PROTO]);
+		flow_action->vlan.tci =
+			nla_get_u16(attrs[SWDEV_ATTR_FLOW_ACTION_VLAN_TCI]);
+		flow_action->type = SW_FLOW_ACTION_TYPE_VLAN_PUSH;
+		break;
+	case SWDEV_FLOW_ACTION_TYPE_VLAN_POP:
+		flow_action->type = SW_FLOW_ACTION_TYPE_VLAN_POP;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int swdev_nl_parse_flow_actions(struct nlattr *actions_attr,
+				       struct swdev_flow_action *action)
+{
+	struct swdev_flow_action *cur;
+	struct nlattr *action_attr;
+	int rem;
+	int err;
+
+	cur = action;
+	nla_for_each_nested(action_attr, actions_attr, rem) {
+		err = swdev_nl_parse_flow_action(action_attr, cur);
+		if (err)
+			return err;
+		cur++;
+	}
+	return 0;
+}
+
+static int swdev_nl_parse_flow_action_count(struct nlattr *actions_attr,
+					    unsigned *p_action_count)
+{
+	struct nlattr *action_attr;
+	int rem;
+	int count = 0;
+
+	nla_for_each_nested(action_attr, actions_attr, rem) {
+		if (nla_type(action_attr) != SWDEV_ATTR_FLOW_ITEM_ACTION)
+			return -EINVAL;
+		count++;
+	}
+	*p_action_count = count;
+	return 0;
+}
+
+static void swdev_nl_free_flow(struct swdev_flow *flow)
+{
+	kfree(flow);
+}
+
+static int swdev_nl_parse_flow(struct nlattr *flow_attr, struct swdev_flow **p_flow)
+{
+	struct swdev_flow *flow;
+	struct nlattr *attrs[SWDEV_ATTR_FLOW_MAX + 1];
+	unsigned action_count;
+	int err;
+
+	err = nla_parse_nested(attrs, SWDEV_ATTR_FLOW_MAX,
+			       flow_attr, swdev_nl_flow_policy);
+	if (err)
+		return err;
+
+	if (!attrs[SWDEV_ATTR_FLOW_MATCH_KEY] ||
+	    !attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MASK] ||
+	    !attrs[SWDEV_ATTR_FLOW_LIST_ACTION])
+		return -EINVAL;
+
+	err = swdev_nl_parse_flow_action_count(attrs[SWDEV_ATTR_FLOW_LIST_ACTION],
+					       &action_count);
+	if (err)
+		return err;
+	flow = swdev_flow_alloc(action_count, GFP_KERNEL);
+	if (!flow)
+		return -ENOMEM;
+
+	err = swdev_nl_parse_flow_match_key(attrs[SWDEV_ATTR_FLOW_MATCH_KEY],
+					    &flow->match.key);
+	if (err)
+		goto out;
+
+	err = swdev_nl_parse_flow_match_key(attrs[SWDEV_ATTR_FLOW_MATCH_KEY_MASK],
+					    &flow->match.key_mask);
+	if (err)
+		goto out;
+
+	err = swdev_nl_parse_flow_actions(attrs[SWDEV_ATTR_FLOW_LIST_ACTION],
+					  flow->action);
+	if (err)
+		goto out;
+
+	*p_flow = flow;
+	return 0;
+
+out:
+	kfree(flow);
+	return err;
+}
+
+static struct net_device *swdev_nl_dev_get(struct genl_info *info)
+{
+	struct net *net = genl_info_net(info);
+	int ifindex;
+
+	if (!info->attrs[SWDEV_ATTR_IFINDEX])
+		return NULL;
+
+	ifindex = nla_get_u32(info->attrs[SWDEV_ATTR_IFINDEX]);
+	return dev_get_by_index(net, ifindex);
+}
+
+static void swdev_nl_dev_put(struct net_device *dev)
+{
+	dev_put(dev);
+}
+
+static int swdev_nl_cmd_flow_insert(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *dev;
+	struct swdev_flow *flow;
+	int err;
+
+	if (!info->attrs[SWDEV_ATTR_FLOW])
+		return -EINVAL;
+
+	dev = swdev_nl_dev_get(info);
+	if (!dev)
+		return -EINVAL;
+
+	err = swdev_nl_parse_flow(info->attrs[SWDEV_ATTR_FLOW], &flow);
+	if (err)
+		goto dev_put;
+
+	err = swdev_flow_insert(dev, flow);
+	swdev_nl_free_flow(flow);
+dev_put:
+	swdev_nl_dev_put(dev);
+	return err;
+}
+
+static int swdev_nl_cmd_flow_remove(struct sk_buff *skb, struct genl_info *info)
+{
+	struct net_device *dev;
+	struct swdev_flow *flow;
+	int err;
+
+	if (!info->attrs[SWDEV_ATTR_FLOW])
+		return -EINVAL;
+
+	dev = swdev_nl_dev_get(info);
+	if (!dev)
+		return -EINVAL;
+
+	err = swdev_nl_parse_flow(info->attrs[SWDEV_ATTR_FLOW], &flow);
+	if (err)
+		goto dev_put;
+
+	err = swdev_flow_remove(dev, flow);
+	swdev_nl_free_flow(flow);
+dev_put:
+	swdev_nl_dev_put(dev);
+	return err;
+}
+
+static const struct genl_ops swdev_nl_ops[] = {
+	{
+		.cmd = SWDEV_CMD_NOOP,
+		.doit = swdev_nl_cmd_noop,
+	},
+	{
+		.cmd = SWDEV_CMD_FLOW_INSERT,
+		.doit = swdev_nl_cmd_flow_insert,
+		.policy = swdev_nl_flow_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+	{
+		.cmd = SWDEV_CMD_FLOW_REMOVE,
+		.doit = swdev_nl_cmd_flow_remove,
+		.policy = swdev_nl_flow_policy,
+		.flags = GENL_ADMIN_PERM,
+	},
+};
+
+static int __init swdev_nl_module_init(void)
+{
+	return genl_register_family_with_ops(&swdev_nl_family, swdev_nl_ops);
+}
+
+static void swdev_nl_module_fini(void)
+{
+	genl_unregister_family(&swdev_nl_family);
+}
+
+module_init(swdev_nl_module_init);
+module_exit(swdev_nl_module_fini);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_DESCRIPTION("Netlink interface to Switch device");
+MODULE_ALIAS_GENL_FAMILY(SWITCHDEV_GENL_NAME);