diff mbox

[net-next,02/10] udp: Expand UDP tunnel common APIs

Message ID 1406024393-6778-3-git-send-email-azhou@nicira.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Andy Zhou July 22, 2014, 10:19 a.m. UTC
Added create_udp_tunnel_socket(), packet receive and transmit,  and
other related common functions for UDP tunnels.

Per net open UDP tunnel ports are tracked in this common layer to
prevent sharing of a single port with more than one UDP tunnel.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/net/udp_tunnel.h |   57 +++++++++-
 net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 312 insertions(+), 2 deletions(-)

Comments

Andy Zhou July 22, 2014, 9:02 p.m. UTC | #1
On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>
>
> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>> other related common functions for UDP tunnels.
>>
>> Per net open UDP tunnel ports are tracked in this common layer to
>> prevent sharing of a single port with more than one UDP tunnel.
>>
> bind should already prevent this. I don't really see a need to track udp
> encap ports separately.

When a new network device driver is activated, does it need to get a list
of currently open UDP tunnel ports to configure its offloads?

>> --- a/include/net/udp_tunnel.h
>> +++ b/include/net/udp_tunnel.h
>> @@ -1,7 +1,10 @@
>>  #ifndef __NET_UDP_TUNNEL_H
>>  #define __NET_UDP_TUNNEL_H
>>
>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>> +#include <net/ip_tunnels.h>
>> +
>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>
> Why do we need to define these? Caller should know what type of port is
> being opened and provide appropriate encap_rcv.

Assume udp tunnel layer needs to keep track of open ports, should it
also keep track of the protocol associated with the port?

>> +
>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>> + * supply the listening VXLAN udp ports. Callers are expected
>> + * to implement the ndo_add_tunnle_port.
>> + */
> Seems a little presumptuous that we're doing VXLAN specific things in what
> should be common and generic code...
>
You are right. Cut-and-past error. It should read "UDP tunnel ports"
instead. I will fix it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 22, 2014, 9:16 p.m. UTC | #2
On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>
>>
>> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>>> other related common functions for UDP tunnels.
>>>
>>> Per net open UDP tunnel ports are tracked in this common layer to
>>> prevent sharing of a single port with more than one UDP tunnel.
>>>
>> bind should already prevent this. I don't really see a need to track udp
>> encap ports separately.
>
> When a new network device driver is activated, does it need to get a list
> of currently open UDP tunnel ports to configure its offloads?
>
If that's needed it should be driven by the UDP offload registration
mechanisms, not from UDP tunnel code. It's very conceivable that we
will have UDP offloads that don't correspond to UDP tunnels in the
kernel--QUIC comes to mind.

>>> --- a/include/net/udp_tunnel.h
>>> +++ b/include/net/udp_tunnel.h
>>> @@ -1,7 +1,10 @@
>>>  #ifndef __NET_UDP_TUNNEL_H
>>>  #define __NET_UDP_TUNNEL_H
>>>
>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>> +#include <net/ip_tunnels.h>
>>> +
>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>
>> Why do we need to define these? Caller should know what type of port is
>> being opened and provide appropriate encap_rcv.
>
> Assume udp tunnel layer needs to keep track of open ports, should it
> also keep track of the protocol associated with the port?
>
For what purpose? Other than for offloads and rcv_encap functions that
provide the service function anyway, what need is there for UDP layer
to know about this. More to the point, if I add a module to the kernel
with a new flavor of UDP tunneling, I shouldn't have to touch any core
code for things to work correctly. So by this line of thinking,
neither the terms VXLAN nor GENEVE should appear in any common code.

>>> +
>>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>>> + * supply the listening VXLAN udp ports. Callers are expected
>>> + * to implement the ndo_add_tunnle_port.
>>> + */
>> Seems a little presumptuous that we're doing VXLAN specific things in what
>> should be common and generic code...
>>
> You are right. Cut-and-past error. It should read "UDP tunnel ports"
> instead. I will fix it.

Given my arguments above, I'm not sure that ndo_add_tunnel_port is the
right interface either.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross July 22, 2014, 9:56 p.m. UTC | #3
On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>> --- a/include/net/udp_tunnel.h
>>>> +++ b/include/net/udp_tunnel.h
>>>> @@ -1,7 +1,10 @@
>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>  #define __NET_UDP_TUNNEL_H
>>>>
>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>> +#include <net/ip_tunnels.h>
>>>> +
>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>
>>> Why do we need to define these? Caller should know what type of port is
>>> being opened and provide appropriate encap_rcv.
>>
>> Assume udp tunnel layer needs to keep track of open ports, should it
>> also keep track of the protocol associated with the port?
>>
> For what purpose? Other than for offloads and rcv_encap functions that
> provide the service function anyway, what need is there for UDP layer
> to know about this. More to the point, if I add a module to the kernel
> with a new flavor of UDP tunneling, I shouldn't have to touch any core
> code for things to work correctly. So by this line of thinking,
> neither the terms VXLAN nor GENEVE should appear in any common code.

The hardware will need to know what the header format is so that it
can parse the packets on receive. And since the NIC can't exactly call
into a function pointer like GRO can, I'm not sure that there is a
solution that doesn't involve an identifier that needs to be listed
somewhere. This is a pretty minimal impact - it doesn't actually
appear in the core code.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 22, 2014, 10:38 p.m. UTC | #4
On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> --- a/include/net/udp_tunnel.h
>>>>> +++ b/include/net/udp_tunnel.h
>>>>> @@ -1,7 +1,10 @@
>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>
>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>> +#include <net/ip_tunnels.h>
>>>>> +
>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>
>>>> Why do we need to define these? Caller should know what type of port is
>>>> being opened and provide appropriate encap_rcv.
>>>
>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>> also keep track of the protocol associated with the port?
>>>
>> For what purpose? Other than for offloads and rcv_encap functions that
>> provide the service function anyway, what need is there for UDP layer
>> to know about this. More to the point, if I add a module to the kernel
>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>> code for things to work correctly. So by this line of thinking,
>> neither the terms VXLAN nor GENEVE should appear in any common code.
>
> The hardware will need to know what the header format is so that it
> can parse the packets on receive. And since the NIC can't exactly call
> into a function pointer like GRO can, I'm not sure that there is a
> solution that doesn't involve an identifier that needs to be listed
> somewhere. This is a pretty minimal impact - it doesn't actually
> appear in the core code.

The hardware doesn't *need* to know this, it's must be optional and
should have no bearing on the software stack. Suggest to put them in
their own header file. Also, as HW features these should appear in
NETIF_F_* list so that we can control on a per device level rather to
enable this feature (something like how NETIF_F_GSO_* was done).

What about support for L2TP/UDP?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Duyck, Alexander H July 22, 2014, 10:55 p.m. UTC | #5
On 07/22/2014 03:38 PM, Tom Herbert wrote:
> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>> --- a/include/net/udp_tunnel.h
>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>> @@ -1,7 +1,10 @@
>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>
>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>> +#include <net/ip_tunnels.h>
>>>>>> +
>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>
>>>>> Why do we need to define these? Caller should know what type of port is
>>>>> being opened and provide appropriate encap_rcv.
>>>>
>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>> also keep track of the protocol associated with the port?
>>>>
>>> For what purpose? Other than for offloads and rcv_encap functions that
>>> provide the service function anyway, what need is there for UDP layer
>>> to know about this. More to the point, if I add a module to the kernel
>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>> code for things to work correctly. So by this line of thinking,
>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>
>> The hardware will need to know what the header format is so that it
>> can parse the packets on receive. And since the NIC can't exactly call
>> into a function pointer like GRO can, I'm not sure that there is a
>> solution that doesn't involve an identifier that needs to be listed
>> somewhere. This is a pretty minimal impact - it doesn't actually
>> appear in the core code.
> 
> The hardware doesn't *need* to know this, it's must be optional and
> should have no bearing on the software stack. Suggest to put them in
> their own header file. Also, as HW features these should appear in
> NETIF_F_* list so that we can control on a per device level rather to
> enable this feature (something like how NETIF_F_GSO_* was done).
> 
> What about support for L2TP/UDP?

The hardware needs some means of knowing what UDP port numbers are used
for VXLAN and/or GENEVE as the two formats contain subtle differences
that we have to be ready for on the Rx path as we have to parse out the
frames.

We already have feature flags controlling the offloads, what the port
numbers provide is a means for us to determine what Rx packets we should
parse as tunnels vs standard UDP and which tunnel type we should parse
it as.

Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross July 22, 2014, 11:12 p.m. UTC | #6
On Tue, Jul 22, 2014 at 6:38 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>> --- a/include/net/udp_tunnel.h
>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>> @@ -1,7 +1,10 @@
>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>
>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>> +#include <net/ip_tunnels.h>
>>>>>> +
>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>
>>>>> Why do we need to define these? Caller should know what type of port is
>>>>> being opened and provide appropriate encap_rcv.
>>>>
>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>> also keep track of the protocol associated with the port?
>>>>
>>> For what purpose? Other than for offloads and rcv_encap functions that
>>> provide the service function anyway, what need is there for UDP layer
>>> to know about this. More to the point, if I add a module to the kernel
>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>> code for things to work correctly. So by this line of thinking,
>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>
>> The hardware will need to know what the header format is so that it
>> can parse the packets on receive. And since the NIC can't exactly call
>> into a function pointer like GRO can, I'm not sure that there is a
>> solution that doesn't involve an identifier that needs to be listed
>> somewhere. This is a pretty minimal impact - it doesn't actually
>> appear in the core code.
>
> The hardware doesn't *need* to know this, it's must be optional and
> should have no bearing on the software stack. Suggest to put them in
> their own header file. Also, as HW features these should appear in
> NETIF_F_* list so that we can control on a per device level rather to
> enable this feature (something like how NETIF_F_GSO_* was done).

Right - I meant for hardware offload. Obviously, pure software
implementations should continue to work fine with the tunnel stack (as
it does here). I don't have any particular objection to moving them to
a different file (udp_offload.h?) but I agree with Alex that these are
slightly different than hardware feature flags.

> What about support for L2TP/UDP?

It should be possible to take advantage of the common UDP tunnel code
here as well. I believe that Andy is planning on doing it as a follow
up patch, which would be a good example of a pure software
implementation.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 22, 2014, 11:24 p.m. UTC | #7
On Tue, Jul 22, 2014 at 3:55 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 07/22/2014 03:38 PM, Tom Herbert wrote:
>> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>> --- a/include/net/udp_tunnel.h
>>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>>> @@ -1,7 +1,10 @@
>>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>>
>>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>>> +#include <net/ip_tunnels.h>
>>>>>>> +
>>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>>
>>>>>> Why do we need to define these? Caller should know what type of port is
>>>>>> being opened and provide appropriate encap_rcv.
>>>>>
>>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>>> also keep track of the protocol associated with the port?
>>>>>
>>>> For what purpose? Other than for offloads and rcv_encap functions that
>>>> provide the service function anyway, what need is there for UDP layer
>>>> to know about this. More to the point, if I add a module to the kernel
>>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>>> code for things to work correctly. So by this line of thinking,
>>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>>
>>> The hardware will need to know what the header format is so that it
>>> can parse the packets on receive. And since the NIC can't exactly call
>>> into a function pointer like GRO can, I'm not sure that there is a
>>> solution that doesn't involve an identifier that needs to be listed
>>> somewhere. This is a pretty minimal impact - it doesn't actually
>>> appear in the core code.
>>
>> The hardware doesn't *need* to know this, it's must be optional and
>> should have no bearing on the software stack. Suggest to put them in
>> their own header file. Also, as HW features these should appear in
>> NETIF_F_* list so that we can control on a per device level rather to
>> enable this feature (something like how NETIF_F_GSO_* was done).
>>
>> What about support for L2TP/UDP?
>
> The hardware needs some means of knowing what UDP port numbers are used
> for VXLAN and/or GENEVE as the two formats contain subtle differences
> that we have to be ready for on the Rx path as we have to parse out the
> frames.
>
> We already have feature flags controlling the offloads, what the port
> numbers provide is a means for us to determine what Rx packets we should
> parse as tunnels vs standard UDP and which tunnel type we should parse
> it as.
>
Which feature flags control the receive side parsing in the device?

> Thanks,
>
> Alex
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alexander H Duyck July 23, 2014, 2:16 a.m. UTC | #8
On 07/22/2014 04:24 PM, Tom Herbert wrote:
> On Tue, Jul 22, 2014 at 3:55 PM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
>> On 07/22/2014 03:38 PM, Tom Herbert wrote:
>>> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>>>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>>> --- a/include/net/udp_tunnel.h
>>>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>>>> @@ -1,7 +1,10 @@
>>>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>>>
>>>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>>>> +#include <net/ip_tunnels.h>
>>>>>>>> +
>>>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>>>
>>>>>>> Why do we need to define these? Caller should know what type of port is
>>>>>>> being opened and provide appropriate encap_rcv.
>>>>>>
>>>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>>>> also keep track of the protocol associated with the port?
>>>>>>
>>>>> For what purpose? Other than for offloads and rcv_encap functions that
>>>>> provide the service function anyway, what need is there for UDP layer
>>>>> to know about this. More to the point, if I add a module to the kernel
>>>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>>>> code for things to work correctly. So by this line of thinking,
>>>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>>>
>>>> The hardware will need to know what the header format is so that it
>>>> can parse the packets on receive. And since the NIC can't exactly call
>>>> into a function pointer like GRO can, I'm not sure that there is a
>>>> solution that doesn't involve an identifier that needs to be listed
>>>> somewhere. This is a pretty minimal impact - it doesn't actually
>>>> appear in the core code.
>>>
>>> The hardware doesn't *need* to know this, it's must be optional and
>>> should have no bearing on the software stack. Suggest to put them in
>>> their own header file. Also, as HW features these should appear in
>>> NETIF_F_* list so that we can control on a per device level rather to
>>> enable this feature (something like how NETIF_F_GSO_* was done).
>>>
>>> What about support for L2TP/UDP?
>>
>> The hardware needs some means of knowing what UDP port numbers are used
>> for VXLAN and/or GENEVE as the two formats contain subtle differences
>> that we have to be ready for on the Rx path as we have to parse out the
>> frames.
>>
>> We already have feature flags controlling the offloads, what the port
>> numbers provide is a means for us to determine what Rx packets we should
>> parse as tunnels vs standard UDP and which tunnel type we should parse
>> it as.
>>
> Which feature flags control the receive side parsing in the device?

The only real features that need the port info are Rx hash and Rx
checksum.  If those are disabled then there shouldn't be any need for
the port numbers.  I don't recall if you can disable them separately
from the non-tunnel case though.  I believe they are linked to the
standard offloads.

Thanks,

Alex



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 23, 2014, 3:53 a.m. UTC | #9
>> Which feature flags control the receive side parsing in the device?
>
> The only real features that need the port info are Rx hash and Rx
> checksum.  If those are disabled then there shouldn't be any need for
> the port numbers.  I don't recall if you can disable them separately
> from the non-tunnel case though.  I believe they are linked to the
> standard offloads.
>
Rx hash is unnecessary consideration because we can derive that from
UDP header. The fact that we can deduce a reasonable hash is a major
rationale of UDP encapsulation. We will need drivers to start
enabling/supporting UDP RSS and providing RX hash to realize full
benefits of this.

Rx checksum is also an unnecessary consideration if devices return
CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Pretty much
anything can (and probably will) be encapsulated in UDP (VXLAN, GRE,
MPLS, L2TP, IPIP, SIT, etc.), so if your hardware provides
CHECKSUM_COMPLETE this immediately gives us easy calculation the
embedded checksums no matter how many encapsulation layers there are.

Another need for parsing UDP contents would be for LRO. This would
require implementation of each encapsulation format supported. I
believe that LRO pretty much deprecated, so maybe this is not an issue
either.

Are there any other cases where HW needs to know about port? Is this
needed for those devices that provide SRIOV?

Tom

> Thanks,
>
> Alex
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross July 23, 2014, 4:35 a.m. UTC | #10
On Tue, Jul 22, 2014 at 11:53 PM, Tom Herbert <therbert@google.com> wrote:
>>> Which feature flags control the receive side parsing in the device?
>>
>> The only real features that need the port info are Rx hash and Rx
>> checksum.  If those are disabled then there shouldn't be any need for
>> the port numbers.  I don't recall if you can disable them separately
>> from the non-tunnel case though.  I believe they are linked to the
>> standard offloads.
>>
> Rx hash is unnecessary consideration because we can derive that from
> UDP header. The fact that we can deduce a reasonable hash is a major
> rationale of UDP encapsulation. We will need drivers to start
> enabling/supporting UDP RSS and providing RX hash to realize full
> benefits of this.

That's true for basic hashing but for more sophisticated things like
flow steering or sending OAM packets to control queues the hardware
still needs to be able to look into the header.

> Rx checksum is also an unnecessary consideration if devices return
> CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Pretty much
> anything can (and probably will) be encapsulated in UDP (VXLAN, GRE,
> MPLS, L2TP, IPIP, SIT, etc.), so if your hardware provides
> CHECKSUM_COMPLETE this immediately gives us easy calculation the
> embedded checksums no matter how many encapsulation layers there are.

This property only applies to ones-complement checksums though. If I
recall correctly, I believe you have a desire for something stronger
:)

> Another need for parsing UDP contents would be for LRO. This would
> require implementation of each encapsulation format supported. I
> believe that LRO pretty much deprecated, so maybe this is not an issue
> either.

I think only the old style of LRO is deprecated. Some drivers provide
"GRO" where the hardware supplies the original MSS and that works OK.

Some of these are obviously future looking but I think that means that
even if you got your desired changes, the use of the UDP port on
receive would only shift, not go away.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 23, 2014, 3:45 p.m. UTC | #11
On Tue, Jul 22, 2014 at 9:35 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Tue, Jul 22, 2014 at 11:53 PM, Tom Herbert <therbert@google.com> wrote:
>>>> Which feature flags control the receive side parsing in the device?
>>>
>>> The only real features that need the port info are Rx hash and Rx
>>> checksum.  If those are disabled then there shouldn't be any need for
>>> the port numbers.  I don't recall if you can disable them separately
>>> from the non-tunnel case though.  I believe they are linked to the
>>> standard offloads.
>>>
>> Rx hash is unnecessary consideration because we can derive that from
>> UDP header. The fact that we can deduce a reasonable hash is a major
>> rationale of UDP encapsulation. We will need drivers to start
>> enabling/supporting UDP RSS and providing RX hash to realize full
>> benefits of this.
>
> That's true for basic hashing but for more sophisticated things like
> flow steering or sending OAM packets to control queues the hardware
> still needs to be able to look into the header.
>
Flow steering (aRFS, FlowDirector, ECMP in network) will work just
fine based on UDP header-- again this is a fundamental property in UDP
encapsulation. If you need to implement mechanisms that require
parsing of the encapsulated headers, then it's better to make this
part of RX filtering.

We already have a mess with the all the GSO protocol variants for
different protocols because no one has defined a generic TSO
mechanism, let's avoid repeating that for RX.

>> Rx checksum is also an unnecessary consideration if devices return
>> CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Pretty much
>> anything can (and probably will) be encapsulated in UDP (VXLAN, GRE,
>> MPLS, L2TP, IPIP, SIT, etc.), so if your hardware provides
>> CHECKSUM_COMPLETE this immediately gives us easy calculation the
>> embedded checksums no matter how many encapsulation layers there are.
>
> This property only applies to ones-complement checksums though. If I
> recall correctly, I believe you have a desire for something stronger
> :)

True, I desire full line rate encryption of all packets :-). In order
to the do this efficiently and generically we will want to do
something like ESP/UDP to keep the flow hash visible. So this is one
valid case where we'd need to configure the HW with a UDP port if it
is to do decrypt.

btw, Geneve draft allows for non-zero UDP checksums to be ignored like
in VXLAN-- this is a violation of UDP standard :-(. We will not do
this in the stack, but it opens the possibility that HW may tell us
checksum is okay when it actually isn't. Accepting
CHECKSUM_UNNECESSARY from all these devices is quite the leap of faith
we're taking!

>
>> Another need for parsing UDP contents would be for LRO. This would
>> require implementation of each encapsulation format supported. I
>> believe that LRO pretty much deprecated, so maybe this is not an issue
>> either.
>
> I think only the old style of LRO is deprecated. Some drivers provide
> "GRO" where the hardware supplies the original MSS and that works OK.
>
> Some of these are obviously future looking but I think that means that
> even if you got your desired changes, the use of the UDP port on
> receive would only shift, not go away.

I think your hitting the major point that we have to be future
looking. When hardware hardwire specific protocols instead of using
generic mechanisms, we become pigeonholed-- this is *not* future
looking and in the long run it's a disservice to customers if we
advocate this in the stack. Consider that geneve is likely superior to
VXLAN because it is extensible, but that VXLAN may still win since it
is already "supported" in so much HW.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 23, 2014, 7:57 p.m. UTC | #12
On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
> Added create_udp_tunnel_socket(), packet receive and transmit,  and
> other related common functions for UDP tunnels.
>
> Per net open UDP tunnel ports are tracked in this common layer to
> prevent sharing of a single port with more than one UDP tunnel.
>
> Signed-off-by: Andy Zhou <azhou@nicira.com>
> ---
>  include/net/udp_tunnel.h |   57 +++++++++-
>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 312 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
> index 3f34c65..b5e815a 100644
> --- a/include/net/udp_tunnel.h
> +++ b/include/net/udp_tunnel.h
> @@ -1,7 +1,10 @@
>  #ifndef __NET_UDP_TUNNEL_H
>  #define __NET_UDP_TUNNEL_H
>
> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
> +#include <net/ip_tunnels.h>
> +
> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>
>  struct udp_port_cfg {
>         u8                      family;
> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>                                 use_udp6_rx_checksums:1;
>  };
>
> +struct udp_tunnel_sock;
> +
> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
> +                               struct sk_buff *skb, ...);
> +
> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
> +
> +struct udp_tunnel_socket_cfg {
> +       u8 tunnel_type;
> +       struct udp_port_cfg port;
> +       udp_tunnel_rcv_t *rcv;
> +       udp_tunnel_encap_rcv_t *encap_rcv;

Why do you need two receive functions or udp_tunnel_rcv_t?

> +       void *data;

Similarly, why is this needed when we already have sk_user_data?

> +};
> +
> +struct udp_tunnel_sock {
> +       u8 tunnel_type;
> +       struct hlist_node hlist;
> +       udp_tunnel_rcv_t *rcv;
> +       void *data;
> +       struct socket *sock;
> +};
> +
>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>                     struct socket **sockp);
>
> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
> +                                                struct udp_tunnel_socket_cfg
> +                                                       *socket_cfg);
> +
> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
> +
> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
> +                       struct sk_buff *skb, __be32 src, __be32 dst,
> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
> +                       __be16 dst_port, bool xnet);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
> +               struct sk_buff *skb, struct net_device *dev,
> +               struct in6_addr *saddr, struct in6_addr *daddr,
> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
> +
> +#endif
> +
> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
> +void udp_tunnel_get_rx_port(struct net_device *dev);
> +
> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
> +                                                        bool udp_csum)
> +{
> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
> +
> +       return iptunnel_handle_offloads(skb, udp_csum, type);
> +}
>  #endif
> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
> index 61ec1a6..3c14b16 100644
> --- a/net/ipv4/udp_tunnel.c
> +++ b/net/ipv4/udp_tunnel.c
> @@ -7,6 +7,23 @@
>  #include <net/udp.h>
>  #include <net/udp_tunnel.h>
>  #include <net/net_namespace.h>
> +#include <net/netns/generic.h>
> +#if IS_ENABLED(CONFIG_IPV6)
> +#include <net/ipv6.h>
> +#include <net/addrconf.h>
> +#include <net/ip6_tunnel.h>
> +#include <net/ip6_checksum.h>
> +#endif
> +
> +#define PORT_HASH_BITS 8
> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
> +
> +static int udp_tunnel_net_id;
> +
> +struct udp_tunnel_net {
> +       struct hlist_head sock_list[PORT_HASH_SIZE];
> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
> +};
>
>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>                     struct socket **sockp)
> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>                 return -EPFNOSUPPORT;
>         }
>
> -
>         *sockp = sock;
>
>         return 0;
> @@ -97,4 +113,243 @@ error:
>  }
>  EXPORT_SYMBOL(udp_sock_create);
>
> +
> +/* Socket hash table head */
> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
> +{
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +
> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
> +}
> +
> +static int handle_offloads(struct sk_buff *skb)
> +{
> +       if (skb_is_gso(skb)) {
> +               int err = skb_unclone(skb, GFP_ATOMIC);
> +
> +               if (unlikely(err))
> +                       return err;
> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
> +       } else {
> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
> +                       skb->ip_summed = CHECKSUM_NONE;
> +       }
> +
> +       return 0;
> +}
> +
> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
> +                                                struct udp_tunnel_socket_cfg
> +                                                       *cfg)
> +{
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +       struct udp_tunnel_sock *uts;
> +       struct socket *sock;
> +       struct sock *sk;
> +       const __be16 port = cfg->port.local_udp_port;
> +       const int ipv6 = (cfg->port.family == AF_INET6);
> +       int err;
> +
> +       uts = kzalloc(size, GFP_KERNEL);
> +       if (!uts)
> +               return ERR_PTR(-ENOMEM);
> +
> +       err = udp_sock_create(net, &cfg->port, &sock);
> +       if (err < 0) {
> +               kfree(uts);
> +               return NULL;
> +       }
> +
> +       /* Disable multicast loopback */
> +       inet_sk(sock->sk)->mc_loop = 0;
> +
> +       uts->sock = sock;
> +       sk = sock->sk;
> +       uts->rcv = cfg->rcv;
> +       uts->data = cfg->data;
> +       rcu_assign_sk_user_data(sock->sk, uts);
> +
> +       spin_lock(&utn->sock_lock);
> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
> +       spin_unlock(&utn->sock_lock);
> +
> +       udp_sk(sk)->encap_type = 1;
> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +       if (ipv6)
> +               ipv6_stub->udpv6_encap_enable();
> +       else
> +#endif
> +               udp_encap_enable();
> +
> +       return uts;
> +}
> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
> +
> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
> +                       struct sk_buff *skb, __be32 src, __be32 dst,
> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
> +                       __be16 dst_port, bool xnet)
> +{
> +       struct udphdr *uh;
> +
> +       __skb_push(skb, sizeof(*uh));
> +       skb_reset_transport_header(skb);
> +       uh = udp_hdr(skb);
> +
> +       uh->dest = dst_port;
> +       uh->source = src_port;
> +       uh->len = htons(skb->len);
> +
> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
> +
> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
> +                            tos, ttl, df, xnet);
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
> +                        struct sk_buff *skb, struct net_device *dev,
> +                        struct in6_addr *saddr, struct in6_addr *daddr,
> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
> +{
> +       struct udphdr *uh;
> +       struct ipv6hdr *ip6h;
> +       int err;
> +
> +       __skb_push(skb, sizeof(*uh));
> +       skb_reset_transport_header(skb);
> +       uh = udp_hdr(skb);
> +
> +       uh->dest = dst_port;
> +       uh->source = src_port;
> +
> +       uh->len = htons(skb->len);
> +       uh->check = 0;
> +
> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
> +                           | IPSKB_REROUTED);
> +       skb_dst_set(skb, dst);
> +
> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
> +
> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
> +                               IPPROTO_UDP, csum);
> +               if (uh->check == 0)
> +                       uh->check = CSUM_MANGLED_0;
> +       } else {
> +               skb->ip_summed = CHECKSUM_PARTIAL;
> +               skb->csum_start = skb_transport_header(skb) - skb->head;
> +               skb->csum_offset = offsetof(struct udphdr, check);
> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
> +                               skb->len, IPPROTO_UDP, 0);
> +       }
> +
> +       __skb_push(skb, sizeof(*ip6h));
> +       skb_reset_network_header(skb);
> +       ip6h              = ipv6_hdr(skb);
> +       ip6h->version     = 6;
> +       ip6h->priority    = prio;
> +       ip6h->flow_lbl[0] = 0;
> +       ip6h->flow_lbl[1] = 0;
> +       ip6h->flow_lbl[2] = 0;
> +       ip6h->payload_len = htons(skb->len);
> +       ip6h->nexthdr     = IPPROTO_UDP;
> +       ip6h->hop_limit   = ttl;
> +       ip6h->daddr       = *daddr;
> +       ip6h->saddr       = *saddr;
> +
> +       err = handle_offloads(skb);
> +       if (err)
> +               return err;
> +
> +       ip6tunnel_xmit(skb, dev);
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
> +#endif
> +
> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
> +{
> +       struct udp_tunnel_sock *uts;
> +
> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
> +                       return uts;
> +       }
> +
> +       return NULL;
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
> +
> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
> +{
> +       struct sock *sk = uts->sock->sk;
> +       struct net *net = sock_net(sk);
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +
> +       spin_lock(&utn->sock_lock);
> +       hlist_del_rcu(&uts->hlist);
> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
> +       spin_unlock(&utn->sock_lock);
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
> +
> +/* Calls the ndo_add_tunnel_port of the caller in order to
> + * supply the listening VXLAN udp ports. Callers are expected
> + * to implement the ndo_add_tunnle_port.
> + */
> +void udp_tunnel_get_rx_port(struct net_device *dev)
> +{
> +       struct udp_tunnel_sock *uts;
> +       struct net *net = dev_net(dev);
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +       sa_family_t sa_family;
> +       __be16 port;
> +       unsigned int i;
> +
> +       spin_lock(&utn->sock_lock);
> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
> +                       port = inet_sk(uts->sock->sk)->inet_sport;
> +                       sa_family = uts->sock->sk->sk_family;
> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
> +                                       sa_family, port, uts->tunnel_type);
> +               }
> +       }
> +       spin_unlock(&utn->sock_lock);
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
> +
> +static int __net_init udp_tunnel_init_net(struct net *net)
> +{
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +       unsigned int h;
> +
> +       spin_lock_init(&utn->sock_lock);
> +
> +       for (h = 0; h < PORT_HASH_SIZE; h++)
> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
> +
> +       return 0;
> +}
> +
> +static struct pernet_operations udp_tunnel_net_ops = {
> +       .init = udp_tunnel_init_net,
> +       .exit = NULL,
> +       .id = &udp_tunnel_net_id,
> +       .size = sizeof(struct udp_tunnel_net),
> +};
> +
> +static int __init udp_tunnel_init(void)
> +{
> +       return register_pernet_subsys(&udp_tunnel_net_ops);
> +}
> +late_initcall(udp_tunnel_init);
> +
>  MODULE_LICENSE("GPL");
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross July 24, 2014, 3:24 a.m. UTC | #13
On Wed, Jul 23, 2014 at 11:45 AM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 9:35 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 11:53 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> Which feature flags control the receive side parsing in the device?
>>>>
>>>> The only real features that need the port info are Rx hash and Rx
>>>> checksum.  If those are disabled then there shouldn't be any need for
>>>> the port numbers.  I don't recall if you can disable them separately
>>>> from the non-tunnel case though.  I believe they are linked to the
>>>> standard offloads.
>>>>
>>> Rx hash is unnecessary consideration because we can derive that from
>>> UDP header. The fact that we can deduce a reasonable hash is a major
>>> rationale of UDP encapsulation. We will need drivers to start
>>> enabling/supporting UDP RSS and providing RX hash to realize full
>>> benefits of this.
>>
>> That's true for basic hashing but for more sophisticated things like
>> flow steering or sending OAM packets to control queues the hardware
>> still needs to be able to look into the header.
>>
> Flow steering (aRFS, FlowDirector, ECMP in network) will work just
> fine based on UDP header-- again this is a fundamental property in UDP
> encapsulation. If you need to implement mechanisms that require
> parsing of the encapsulated headers, then it's better to make this
> part of RX filtering.

Sure, it can operate on the UDP hash but I would argue that it works
better if you actually look into the packet. Using the hash is either
going to just randomly spread traffic or require you to track hashes
and direct them to particular places for established connections.
However, depending on the situation this may not really be optimal
compared to, say, steering based on inner MAC address.

But in reality, whether it is for steering or filtering these
operations are pretty similar to me, just different goals.

> btw, Geneve draft allows for non-zero UDP checksums to be ignored like
> in VXLAN-- this is a violation of UDP standard :-(. We will not do
> this in the stack, but it opens the possibility that HW may tell us
> checksum is okay when it actually isn't. Accepting
> CHECKSUM_UNNECESSARY from all these devices is quite the leap of faith
> we're taking!

This is actually not the intention but I see that the wording of the
draft is poor. I'll see if I can improve it to avoid this situation.

>> Some of these are obviously future looking but I think that means that
>> even if you got your desired changes, the use of the UDP port on
>> receive would only shift, not go away.
>
> I think your hitting the major point that we have to be future
> looking. When hardware hardwire specific protocols instead of using
> generic mechanisms, we become pigeonholed-- this is *not* future
> looking and in the long run it's a disservice to customers if we
> advocate this in the stack. Consider that geneve is likely superior to
> VXLAN because it is extensible, but that VXLAN may still win since it
> is already "supported" in so much HW.

I understand your goal but I'm not really sure what solution you are
proposing. There are obviously ways that the stack can be made more
generic from where it is today but I think we agree that at least some
things will require protocol knowledge.

Geneve (and GUE) are trying to solve this by having a protocol that is
generic - hardware will still need to have support for a specific
protocol but at least that can support many uses. However, this
doesn't seem to be what you are getting at since it's not true of
VXLAN, particularly if you are concerned with deployed hardware.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Zhou July 24, 2014, 8:23 p.m. UTC | #14
The general layering I see is  tunnel_user (i.e. OVS) -> tuunel_driver
(i.e. vxlan) -> udp_tunnel.

The two receive functions are from two separate layers above
udp_tunnel. I can restructure the APIs to make it
cleaner.

On Wed, Jul 23, 2014 at 12:57 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>> other related common functions for UDP tunnels.
>>
>> Per net open UDP tunnel ports are tracked in this common layer to
>> prevent sharing of a single port with more than one UDP tunnel.
>>
>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>> ---
>>  include/net/udp_tunnel.h |   57 +++++++++-
>>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 312 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>> index 3f34c65..b5e815a 100644
>> --- a/include/net/udp_tunnel.h
>> +++ b/include/net/udp_tunnel.h
>> @@ -1,7 +1,10 @@
>>  #ifndef __NET_UDP_TUNNEL_H
>>  #define __NET_UDP_TUNNEL_H
>>
>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>> +#include <net/ip_tunnels.h>
>> +
>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>
>>  struct udp_port_cfg {
>>         u8                      family;
>> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>>                                 use_udp6_rx_checksums:1;
>>  };
>>
>> +struct udp_tunnel_sock;
>> +
>> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
>> +                               struct sk_buff *skb, ...);
>> +
>> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
>> +
>> +struct udp_tunnel_socket_cfg {
>> +       u8 tunnel_type;
>> +       struct udp_port_cfg port;
>> +       udp_tunnel_rcv_t *rcv;
>> +       udp_tunnel_encap_rcv_t *encap_rcv;
>
> Why do you need two receive functions or udp_tunnel_rcv_t?
>
>> +       void *data;
>
> Similarly, why is this needed when we already have sk_user_data?
>
>> +};
>> +
>> +struct udp_tunnel_sock {
>> +       u8 tunnel_type;
>> +       struct hlist_node hlist;
>> +       udp_tunnel_rcv_t *rcv;
>> +       void *data;
>> +       struct socket *sock;
>> +};
>> +
>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>                     struct socket **sockp);
>>
>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>> +                                                struct udp_tunnel_socket_cfg
>> +                                                       *socket_cfg);
>> +
>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
>> +
>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>> +                       __be16 dst_port, bool xnet);
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>> +               struct sk_buff *skb, struct net_device *dev,
>> +               struct in6_addr *saddr, struct in6_addr *daddr,
>> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
>> +
>> +#endif
>> +
>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
>> +void udp_tunnel_get_rx_port(struct net_device *dev);
>> +
>> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
>> +                                                        bool udp_csum)
>> +{
>> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>> +
>> +       return iptunnel_handle_offloads(skb, udp_csum, type);
>> +}
>>  #endif
>> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
>> index 61ec1a6..3c14b16 100644
>> --- a/net/ipv4/udp_tunnel.c
>> +++ b/net/ipv4/udp_tunnel.c
>> @@ -7,6 +7,23 @@
>>  #include <net/udp.h>
>>  #include <net/udp_tunnel.h>
>>  #include <net/net_namespace.h>
>> +#include <net/netns/generic.h>
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +#include <net/ipv6.h>
>> +#include <net/addrconf.h>
>> +#include <net/ip6_tunnel.h>
>> +#include <net/ip6_checksum.h>
>> +#endif
>> +
>> +#define PORT_HASH_BITS 8
>> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
>> +
>> +static int udp_tunnel_net_id;
>> +
>> +struct udp_tunnel_net {
>> +       struct hlist_head sock_list[PORT_HASH_SIZE];
>> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
>> +};
>>
>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>                     struct socket **sockp)
>> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>                 return -EPFNOSUPPORT;
>>         }
>>
>> -
>>         *sockp = sock;
>>
>>         return 0;
>> @@ -97,4 +113,243 @@ error:
>>  }
>>  EXPORT_SYMBOL(udp_sock_create);
>>
>> +
>> +/* Socket hash table head */
>> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
>> +{
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +
>> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
>> +}
>> +
>> +static int handle_offloads(struct sk_buff *skb)
>> +{
>> +       if (skb_is_gso(skb)) {
>> +               int err = skb_unclone(skb, GFP_ATOMIC);
>> +
>> +               if (unlikely(err))
>> +                       return err;
>> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>> +       } else {
>> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
>> +                       skb->ip_summed = CHECKSUM_NONE;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>> +                                                struct udp_tunnel_socket_cfg
>> +                                                       *cfg)
>> +{
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +       struct udp_tunnel_sock *uts;
>> +       struct socket *sock;
>> +       struct sock *sk;
>> +       const __be16 port = cfg->port.local_udp_port;
>> +       const int ipv6 = (cfg->port.family == AF_INET6);
>> +       int err;
>> +
>> +       uts = kzalloc(size, GFP_KERNEL);
>> +       if (!uts)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       err = udp_sock_create(net, &cfg->port, &sock);
>> +       if (err < 0) {
>> +               kfree(uts);
>> +               return NULL;
>> +       }
>> +
>> +       /* Disable multicast loopback */
>> +       inet_sk(sock->sk)->mc_loop = 0;
>> +
>> +       uts->sock = sock;
>> +       sk = sock->sk;
>> +       uts->rcv = cfg->rcv;
>> +       uts->data = cfg->data;
>> +       rcu_assign_sk_user_data(sock->sk, uts);
>> +
>> +       spin_lock(&utn->sock_lock);
>> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
>> +       spin_unlock(&utn->sock_lock);
>> +
>> +       udp_sk(sk)->encap_type = 1;
>> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +       if (ipv6)
>> +               ipv6_stub->udpv6_encap_enable();
>> +       else
>> +#endif
>> +               udp_encap_enable();
>> +
>> +       return uts;
>> +}
>> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
>> +
>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>> +                       __be16 dst_port, bool xnet)
>> +{
>> +       struct udphdr *uh;
>> +
>> +       __skb_push(skb, sizeof(*uh));
>> +       skb_reset_transport_header(skb);
>> +       uh = udp_hdr(skb);
>> +
>> +       uh->dest = dst_port;
>> +       uh->source = src_port;
>> +       uh->len = htons(skb->len);
>> +
>> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
>> +
>> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
>> +                            tos, ttl, df, xnet);
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>> +                        struct sk_buff *skb, struct net_device *dev,
>> +                        struct in6_addr *saddr, struct in6_addr *daddr,
>> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
>> +{
>> +       struct udphdr *uh;
>> +       struct ipv6hdr *ip6h;
>> +       int err;
>> +
>> +       __skb_push(skb, sizeof(*uh));
>> +       skb_reset_transport_header(skb);
>> +       uh = udp_hdr(skb);
>> +
>> +       uh->dest = dst_port;
>> +       uh->source = src_port;
>> +
>> +       uh->len = htons(skb->len);
>> +       uh->check = 0;
>> +
>> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
>> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
>> +                           | IPSKB_REROUTED);
>> +       skb_dst_set(skb, dst);
>> +
>> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
>> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
>> +
>> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
>> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
>> +                               IPPROTO_UDP, csum);
>> +               if (uh->check == 0)
>> +                       uh->check = CSUM_MANGLED_0;
>> +       } else {
>> +               skb->ip_summed = CHECKSUM_PARTIAL;
>> +               skb->csum_start = skb_transport_header(skb) - skb->head;
>> +               skb->csum_offset = offsetof(struct udphdr, check);
>> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
>> +                               skb->len, IPPROTO_UDP, 0);
>> +       }
>> +
>> +       __skb_push(skb, sizeof(*ip6h));
>> +       skb_reset_network_header(skb);
>> +       ip6h              = ipv6_hdr(skb);
>> +       ip6h->version     = 6;
>> +       ip6h->priority    = prio;
>> +       ip6h->flow_lbl[0] = 0;
>> +       ip6h->flow_lbl[1] = 0;
>> +       ip6h->flow_lbl[2] = 0;
>> +       ip6h->payload_len = htons(skb->len);
>> +       ip6h->nexthdr     = IPPROTO_UDP;
>> +       ip6h->hop_limit   = ttl;
>> +       ip6h->daddr       = *daddr;
>> +       ip6h->saddr       = *saddr;
>> +
>> +       err = handle_offloads(skb);
>> +       if (err)
>> +               return err;
>> +
>> +       ip6tunnel_xmit(skb, dev);
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
>> +#endif
>> +
>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
>> +{
>> +       struct udp_tunnel_sock *uts;
>> +
>> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
>> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
>> +                       return uts;
>> +       }
>> +
>> +       return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
>> +
>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
>> +{
>> +       struct sock *sk = uts->sock->sk;
>> +       struct net *net = sock_net(sk);
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +
>> +       spin_lock(&utn->sock_lock);
>> +       hlist_del_rcu(&uts->hlist);
>> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
>> +       spin_unlock(&utn->sock_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
>> +
>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>> + * supply the listening VXLAN udp ports. Callers are expected
>> + * to implement the ndo_add_tunnle_port.
>> + */
>> +void udp_tunnel_get_rx_port(struct net_device *dev)
>> +{
>> +       struct udp_tunnel_sock *uts;
>> +       struct net *net = dev_net(dev);
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +       sa_family_t sa_family;
>> +       __be16 port;
>> +       unsigned int i;
>> +
>> +       spin_lock(&utn->sock_lock);
>> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
>> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
>> +                       port = inet_sk(uts->sock->sk)->inet_sport;
>> +                       sa_family = uts->sock->sk->sk_family;
>> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
>> +                                       sa_family, port, uts->tunnel_type);
>> +               }
>> +       }
>> +       spin_unlock(&utn->sock_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
>> +
>> +static int __net_init udp_tunnel_init_net(struct net *net)
>> +{
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +       unsigned int h;
>> +
>> +       spin_lock_init(&utn->sock_lock);
>> +
>> +       for (h = 0; h < PORT_HASH_SIZE; h++)
>> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
>> +
>> +       return 0;
>> +}
>> +
>> +static struct pernet_operations udp_tunnel_net_ops = {
>> +       .init = udp_tunnel_init_net,
>> +       .exit = NULL,
>> +       .id = &udp_tunnel_net_id,
>> +       .size = sizeof(struct udp_tunnel_net),
>> +};
>> +
>> +static int __init udp_tunnel_init(void)
>> +{
>> +       return register_pernet_subsys(&udp_tunnel_net_ops);
>> +}
>> +late_initcall(udp_tunnel_init);
>> +
>>  MODULE_LICENSE("GPL");
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tom Herbert July 24, 2014, 8:47 p.m. UTC | #15
On Thu, Jul 24, 2014 at 1:23 PM, Andy Zhou <azhou@nicira.com> wrote:
> The general layering I see is  tunnel_user (i.e. OVS) -> tuunel_driver
> (i.e. vxlan) -> udp_tunnel.
>
Simpler and more efficient if you stick with UDP->UDP_encap_handler as
the most general model for RX.

> The two receive functions are from two separate layers above
> udp_tunnel. I can restructure the APIs to make it
> cleaner.
>
The only necessary function for opening the UDP encap port is the UDP
receive handler (encap receive). If you want to implement more
indirection within your handler then it should be pretty easy to
create another layer of API for that purpose.

> On Wed, Jul 23, 2014 at 12:57 PM, Tom Herbert <therbert@google.com> wrote:
>> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>>> other related common functions for UDP tunnels.
>>>
>>> Per net open UDP tunnel ports are tracked in this common layer to
>>> prevent sharing of a single port with more than one UDP tunnel.
>>>
>>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>>> ---
>>>  include/net/udp_tunnel.h |   57 +++++++++-
>>>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>>>  2 files changed, 312 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>>> index 3f34c65..b5e815a 100644
>>> --- a/include/net/udp_tunnel.h
>>> +++ b/include/net/udp_tunnel.h
>>> @@ -1,7 +1,10 @@
>>>  #ifndef __NET_UDP_TUNNEL_H
>>>  #define __NET_UDP_TUNNEL_H
>>>
>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>> +#include <net/ip_tunnels.h>
>>> +
>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>
>>>  struct udp_port_cfg {
>>>         u8                      family;
>>> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>>>                                 use_udp6_rx_checksums:1;
>>>  };
>>>
>>> +struct udp_tunnel_sock;
>>> +
>>> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
>>> +                               struct sk_buff *skb, ...);
>>> +
>>> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
>>> +
>>> +struct udp_tunnel_socket_cfg {
>>> +       u8 tunnel_type;
>>> +       struct udp_port_cfg port;
>>> +       udp_tunnel_rcv_t *rcv;
>>> +       udp_tunnel_encap_rcv_t *encap_rcv;
>>
>> Why do you need two receive functions or udp_tunnel_rcv_t?
>>
>>> +       void *data;
>>
>> Similarly, why is this needed when we already have sk_user_data?
>>
>>> +};
>>> +
>>> +struct udp_tunnel_sock {
>>> +       u8 tunnel_type;
>>> +       struct hlist_node hlist;
>>> +       udp_tunnel_rcv_t *rcv;
>>> +       void *data;
>>> +       struct socket *sock;
>>> +};
>>> +
>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>                     struct socket **sockp);
>>>
>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>> +                                                struct udp_tunnel_socket_cfg
>>> +                                                       *socket_cfg);
>>> +
>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
>>> +
>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>> +                       __be16 dst_port, bool xnet);
>>> +
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>> +               struct sk_buff *skb, struct net_device *dev,
>>> +               struct in6_addr *saddr, struct in6_addr *daddr,
>>> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
>>> +
>>> +#endif
>>> +
>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
>>> +void udp_tunnel_get_rx_port(struct net_device *dev);
>>> +
>>> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
>>> +                                                        bool udp_csum)
>>> +{
>>> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>>> +
>>> +       return iptunnel_handle_offloads(skb, udp_csum, type);
>>> +}
>>>  #endif
>>> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
>>> index 61ec1a6..3c14b16 100644
>>> --- a/net/ipv4/udp_tunnel.c
>>> +++ b/net/ipv4/udp_tunnel.c
>>> @@ -7,6 +7,23 @@
>>>  #include <net/udp.h>
>>>  #include <net/udp_tunnel.h>
>>>  #include <net/net_namespace.h>
>>> +#include <net/netns/generic.h>
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +#include <net/ipv6.h>
>>> +#include <net/addrconf.h>
>>> +#include <net/ip6_tunnel.h>
>>> +#include <net/ip6_checksum.h>
>>> +#endif
>>> +
>>> +#define PORT_HASH_BITS 8
>>> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
>>> +
>>> +static int udp_tunnel_net_id;
>>> +
>>> +struct udp_tunnel_net {
>>> +       struct hlist_head sock_list[PORT_HASH_SIZE];
>>> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
>>> +};
>>>
>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>                     struct socket **sockp)
>>> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>                 return -EPFNOSUPPORT;
>>>         }
>>>
>>> -
>>>         *sockp = sock;
>>>
>>>         return 0;
>>> @@ -97,4 +113,243 @@ error:
>>>  }
>>>  EXPORT_SYMBOL(udp_sock_create);
>>>
>>> +
>>> +/* Socket hash table head */
>>> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
>>> +{
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +
>>> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
>>> +}
>>> +
>>> +static int handle_offloads(struct sk_buff *skb)
>>> +{
>>> +       if (skb_is_gso(skb)) {
>>> +               int err = skb_unclone(skb, GFP_ATOMIC);
>>> +
>>> +               if (unlikely(err))
>>> +                       return err;
>>> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>>> +       } else {
>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
>>> +                       skb->ip_summed = CHECKSUM_NONE;
>>> +       }
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>> +                                                struct udp_tunnel_socket_cfg
>>> +                                                       *cfg)
>>> +{
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +       struct udp_tunnel_sock *uts;
>>> +       struct socket *sock;
>>> +       struct sock *sk;
>>> +       const __be16 port = cfg->port.local_udp_port;
>>> +       const int ipv6 = (cfg->port.family == AF_INET6);
>>> +       int err;
>>> +
>>> +       uts = kzalloc(size, GFP_KERNEL);
>>> +       if (!uts)
>>> +               return ERR_PTR(-ENOMEM);
>>> +
>>> +       err = udp_sock_create(net, &cfg->port, &sock);
>>> +       if (err < 0) {
>>> +               kfree(uts);
>>> +               return NULL;
>>> +       }
>>> +
>>> +       /* Disable multicast loopback */
>>> +       inet_sk(sock->sk)->mc_loop = 0;
>>> +
>>> +       uts->sock = sock;
>>> +       sk = sock->sk;
>>> +       uts->rcv = cfg->rcv;
>>> +       uts->data = cfg->data;
>>> +       rcu_assign_sk_user_data(sock->sk, uts);
>>> +
>>> +       spin_lock(&utn->sock_lock);
>>> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
>>> +       spin_unlock(&utn->sock_lock);
>>> +
>>> +       udp_sk(sk)->encap_type = 1;
>>> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
>>> +
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +       if (ipv6)
>>> +               ipv6_stub->udpv6_encap_enable();
>>> +       else
>>> +#endif
>>> +               udp_encap_enable();
>>> +
>>> +       return uts;
>>> +}
>>> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
>>> +
>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>> +                       __be16 dst_port, bool xnet)
>>> +{
>>> +       struct udphdr *uh;
>>> +
>>> +       __skb_push(skb, sizeof(*uh));
>>> +       skb_reset_transport_header(skb);
>>> +       uh = udp_hdr(skb);
>>> +
>>> +       uh->dest = dst_port;
>>> +       uh->source = src_port;
>>> +       uh->len = htons(skb->len);
>>> +
>>> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
>>> +
>>> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
>>> +                            tos, ttl, df, xnet);
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
>>> +
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>> +                        struct sk_buff *skb, struct net_device *dev,
>>> +                        struct in6_addr *saddr, struct in6_addr *daddr,
>>> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
>>> +{
>>> +       struct udphdr *uh;
>>> +       struct ipv6hdr *ip6h;
>>> +       int err;
>>> +
>>> +       __skb_push(skb, sizeof(*uh));
>>> +       skb_reset_transport_header(skb);
>>> +       uh = udp_hdr(skb);
>>> +
>>> +       uh->dest = dst_port;
>>> +       uh->source = src_port;
>>> +
>>> +       uh->len = htons(skb->len);
>>> +       uh->check = 0;
>>> +
>>> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
>>> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
>>> +                           | IPSKB_REROUTED);
>>> +       skb_dst_set(skb, dst);
>>> +
>>> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
>>> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
>>> +
>>> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
>>> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
>>> +                               IPPROTO_UDP, csum);
>>> +               if (uh->check == 0)
>>> +                       uh->check = CSUM_MANGLED_0;
>>> +       } else {
>>> +               skb->ip_summed = CHECKSUM_PARTIAL;
>>> +               skb->csum_start = skb_transport_header(skb) - skb->head;
>>> +               skb->csum_offset = offsetof(struct udphdr, check);
>>> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
>>> +                               skb->len, IPPROTO_UDP, 0);
>>> +       }
>>> +
>>> +       __skb_push(skb, sizeof(*ip6h));
>>> +       skb_reset_network_header(skb);
>>> +       ip6h              = ipv6_hdr(skb);
>>> +       ip6h->version     = 6;
>>> +       ip6h->priority    = prio;
>>> +       ip6h->flow_lbl[0] = 0;
>>> +       ip6h->flow_lbl[1] = 0;
>>> +       ip6h->flow_lbl[2] = 0;
>>> +       ip6h->payload_len = htons(skb->len);
>>> +       ip6h->nexthdr     = IPPROTO_UDP;
>>> +       ip6h->hop_limit   = ttl;
>>> +       ip6h->daddr       = *daddr;
>>> +       ip6h->saddr       = *saddr;
>>> +
>>> +       err = handle_offloads(skb);
>>> +       if (err)
>>> +               return err;
>>> +
>>> +       ip6tunnel_xmit(skb, dev);
>>> +       return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
>>> +#endif
>>> +
>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
>>> +{
>>> +       struct udp_tunnel_sock *uts;
>>> +
>>> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
>>> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
>>> +                       return uts;
>>> +       }
>>> +
>>> +       return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
>>> +
>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
>>> +{
>>> +       struct sock *sk = uts->sock->sk;
>>> +       struct net *net = sock_net(sk);
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +
>>> +       spin_lock(&utn->sock_lock);
>>> +       hlist_del_rcu(&uts->hlist);
>>> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
>>> +       spin_unlock(&utn->sock_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
>>> +
>>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>>> + * supply the listening VXLAN udp ports. Callers are expected
>>> + * to implement the ndo_add_tunnle_port.
>>> + */
>>> +void udp_tunnel_get_rx_port(struct net_device *dev)
>>> +{
>>> +       struct udp_tunnel_sock *uts;
>>> +       struct net *net = dev_net(dev);
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +       sa_family_t sa_family;
>>> +       __be16 port;
>>> +       unsigned int i;
>>> +
>>> +       spin_lock(&utn->sock_lock);
>>> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
>>> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
>>> +                       port = inet_sk(uts->sock->sk)->inet_sport;
>>> +                       sa_family = uts->sock->sk->sk_family;
>>> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
>>> +                                       sa_family, port, uts->tunnel_type);
>>> +               }
>>> +       }
>>> +       spin_unlock(&utn->sock_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
>>> +
>>> +static int __net_init udp_tunnel_init_net(struct net *net)
>>> +{
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +       unsigned int h;
>>> +
>>> +       spin_lock_init(&utn->sock_lock);
>>> +
>>> +       for (h = 0; h < PORT_HASH_SIZE; h++)
>>> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static struct pernet_operations udp_tunnel_net_ops = {
>>> +       .init = udp_tunnel_init_net,
>>> +       .exit = NULL,
>>> +       .id = &udp_tunnel_net_id,
>>> +       .size = sizeof(struct udp_tunnel_net),
>>> +};
>>> +
>>> +static int __init udp_tunnel_init(void)
>>> +{
>>> +       return register_pernet_subsys(&udp_tunnel_net_ops);
>>> +}
>>> +late_initcall(udp_tunnel_init);
>>> +
>>>  MODULE_LICENSE("GPL");
>>> --
>>> 1.7.9.5
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andy Zhou July 24, 2014, 8:54 p.m. UTC | #16
On Thu, Jul 24, 2014 at 1:47 PM, Tom Herbert <therbert@google.com> wrote:
> On Thu, Jul 24, 2014 at 1:23 PM, Andy Zhou <azhou@nicira.com> wrote:
>> The general layering I see is  tunnel_user (i.e. OVS) -> tuunel_driver
>> (i.e. vxlan) -> udp_tunnel.
>>
> Simpler and more efficient if you stick with UDP->UDP_encap_handler as
> the most general model for RX.

I believe this is the case now. I don't plan to change this. Just not
exposing the
higher layer callback to the udp_tunnel layer.

>
>> The two receive functions are from two separate layers above
>> udp_tunnel. I can restructure the APIs to make it
>> cleaner.
>>
> The only necessary function for opening the UDP encap port is the UDP
> receive handler (encap receive). If you want to implement more
> indirection within your handler then it should be pretty easy to
> create another layer of API for that purpose.
>

Yes, this is the direction I am going towards.

>> On Wed, Jul 23, 2014 at 12:57 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>>>> other related common functions for UDP tunnels.
>>>>
>>>> Per net open UDP tunnel ports are tracked in this common layer to
>>>> prevent sharing of a single port with more than one UDP tunnel.
>>>>
>>>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>>>> ---
>>>>  include/net/udp_tunnel.h |   57 +++++++++-
>>>>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 312 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>>>> index 3f34c65..b5e815a 100644
>>>> --- a/include/net/udp_tunnel.h
>>>> +++ b/include/net/udp_tunnel.h
>>>> @@ -1,7 +1,10 @@
>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>  #define __NET_UDP_TUNNEL_H
>>>>
>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>> +#include <net/ip_tunnels.h>
>>>> +
>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>
>>>>  struct udp_port_cfg {
>>>>         u8                      family;
>>>> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>>>>                                 use_udp6_rx_checksums:1;
>>>>  };
>>>>
>>>> +struct udp_tunnel_sock;
>>>> +
>>>> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
>>>> +                               struct sk_buff *skb, ...);
>>>> +
>>>> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
>>>> +
>>>> +struct udp_tunnel_socket_cfg {
>>>> +       u8 tunnel_type;
>>>> +       struct udp_port_cfg port;
>>>> +       udp_tunnel_rcv_t *rcv;
>>>> +       udp_tunnel_encap_rcv_t *encap_rcv;
>>>
>>> Why do you need two receive functions or udp_tunnel_rcv_t?
>>>
>>>> +       void *data;
>>>
>>> Similarly, why is this needed when we already have sk_user_data?
>>>
>>>> +};
>>>> +
>>>> +struct udp_tunnel_sock {
>>>> +       u8 tunnel_type;
>>>> +       struct hlist_node hlist;
>>>> +       udp_tunnel_rcv_t *rcv;
>>>> +       void *data;
>>>> +       struct socket *sock;
>>>> +};
>>>> +
>>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>>                     struct socket **sockp);
>>>>
>>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>>> +                                                struct udp_tunnel_socket_cfg
>>>> +                                                       *socket_cfg);
>>>> +
>>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
>>>> +
>>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>>> +                       __be16 dst_port, bool xnet);
>>>> +
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>>> +               struct sk_buff *skb, struct net_device *dev,
>>>> +               struct in6_addr *saddr, struct in6_addr *daddr,
>>>> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
>>>> +
>>>> +#endif
>>>> +
>>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
>>>> +void udp_tunnel_get_rx_port(struct net_device *dev);
>>>> +
>>>> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
>>>> +                                                        bool udp_csum)
>>>> +{
>>>> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>>>> +
>>>> +       return iptunnel_handle_offloads(skb, udp_csum, type);
>>>> +}
>>>>  #endif
>>>> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
>>>> index 61ec1a6..3c14b16 100644
>>>> --- a/net/ipv4/udp_tunnel.c
>>>> +++ b/net/ipv4/udp_tunnel.c
>>>> @@ -7,6 +7,23 @@
>>>>  #include <net/udp.h>
>>>>  #include <net/udp_tunnel.h>
>>>>  #include <net/net_namespace.h>
>>>> +#include <net/netns/generic.h>
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +#include <net/ipv6.h>
>>>> +#include <net/addrconf.h>
>>>> +#include <net/ip6_tunnel.h>
>>>> +#include <net/ip6_checksum.h>
>>>> +#endif
>>>> +
>>>> +#define PORT_HASH_BITS 8
>>>> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
>>>> +
>>>> +static int udp_tunnel_net_id;
>>>> +
>>>> +struct udp_tunnel_net {
>>>> +       struct hlist_head sock_list[PORT_HASH_SIZE];
>>>> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
>>>> +};
>>>>
>>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>>                     struct socket **sockp)
>>>> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>>                 return -EPFNOSUPPORT;
>>>>         }
>>>>
>>>> -
>>>>         *sockp = sock;
>>>>
>>>>         return 0;
>>>> @@ -97,4 +113,243 @@ error:
>>>>  }
>>>>  EXPORT_SYMBOL(udp_sock_create);
>>>>
>>>> +
>>>> +/* Socket hash table head */
>>>> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
>>>> +{
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +
>>>> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
>>>> +}
>>>> +
>>>> +static int handle_offloads(struct sk_buff *skb)
>>>> +{
>>>> +       if (skb_is_gso(skb)) {
>>>> +               int err = skb_unclone(skb, GFP_ATOMIC);
>>>> +
>>>> +               if (unlikely(err))
>>>> +                       return err;
>>>> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>>>> +       } else {
>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
>>>> +                       skb->ip_summed = CHECKSUM_NONE;
>>>> +       }
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>>> +                                                struct udp_tunnel_socket_cfg
>>>> +                                                       *cfg)
>>>> +{
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +       struct udp_tunnel_sock *uts;
>>>> +       struct socket *sock;
>>>> +       struct sock *sk;
>>>> +       const __be16 port = cfg->port.local_udp_port;
>>>> +       const int ipv6 = (cfg->port.family == AF_INET6);
>>>> +       int err;
>>>> +
>>>> +       uts = kzalloc(size, GFP_KERNEL);
>>>> +       if (!uts)
>>>> +               return ERR_PTR(-ENOMEM);
>>>> +
>>>> +       err = udp_sock_create(net, &cfg->port, &sock);
>>>> +       if (err < 0) {
>>>> +               kfree(uts);
>>>> +               return NULL;
>>>> +       }
>>>> +
>>>> +       /* Disable multicast loopback */
>>>> +       inet_sk(sock->sk)->mc_loop = 0;
>>>> +
>>>> +       uts->sock = sock;
>>>> +       sk = sock->sk;
>>>> +       uts->rcv = cfg->rcv;
>>>> +       uts->data = cfg->data;
>>>> +       rcu_assign_sk_user_data(sock->sk, uts);
>>>> +
>>>> +       spin_lock(&utn->sock_lock);
>>>> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
>>>> +       spin_unlock(&utn->sock_lock);
>>>> +
>>>> +       udp_sk(sk)->encap_type = 1;
>>>> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
>>>> +
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +       if (ipv6)
>>>> +               ipv6_stub->udpv6_encap_enable();
>>>> +       else
>>>> +#endif
>>>> +               udp_encap_enable();
>>>> +
>>>> +       return uts;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
>>>> +
>>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>>> +                       __be16 dst_port, bool xnet)
>>>> +{
>>>> +       struct udphdr *uh;
>>>> +
>>>> +       __skb_push(skb, sizeof(*uh));
>>>> +       skb_reset_transport_header(skb);
>>>> +       uh = udp_hdr(skb);
>>>> +
>>>> +       uh->dest = dst_port;
>>>> +       uh->source = src_port;
>>>> +       uh->len = htons(skb->len);
>>>> +
>>>> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
>>>> +
>>>> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
>>>> +                            tos, ttl, df, xnet);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
>>>> +
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>>> +                        struct sk_buff *skb, struct net_device *dev,
>>>> +                        struct in6_addr *saddr, struct in6_addr *daddr,
>>>> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
>>>> +{
>>>> +       struct udphdr *uh;
>>>> +       struct ipv6hdr *ip6h;
>>>> +       int err;
>>>> +
>>>> +       __skb_push(skb, sizeof(*uh));
>>>> +       skb_reset_transport_header(skb);
>>>> +       uh = udp_hdr(skb);
>>>> +
>>>> +       uh->dest = dst_port;
>>>> +       uh->source = src_port;
>>>> +
>>>> +       uh->len = htons(skb->len);
>>>> +       uh->check = 0;
>>>> +
>>>> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
>>>> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
>>>> +                           | IPSKB_REROUTED);
>>>> +       skb_dst_set(skb, dst);
>>>> +
>>>> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
>>>> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
>>>> +
>>>> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
>>>> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
>>>> +                               IPPROTO_UDP, csum);
>>>> +               if (uh->check == 0)
>>>> +                       uh->check = CSUM_MANGLED_0;
>>>> +       } else {
>>>> +               skb->ip_summed = CHECKSUM_PARTIAL;
>>>> +               skb->csum_start = skb_transport_header(skb) - skb->head;
>>>> +               skb->csum_offset = offsetof(struct udphdr, check);
>>>> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
>>>> +                               skb->len, IPPROTO_UDP, 0);
>>>> +       }
>>>> +
>>>> +       __skb_push(skb, sizeof(*ip6h));
>>>> +       skb_reset_network_header(skb);
>>>> +       ip6h              = ipv6_hdr(skb);
>>>> +       ip6h->version     = 6;
>>>> +       ip6h->priority    = prio;
>>>> +       ip6h->flow_lbl[0] = 0;
>>>> +       ip6h->flow_lbl[1] = 0;
>>>> +       ip6h->flow_lbl[2] = 0;
>>>> +       ip6h->payload_len = htons(skb->len);
>>>> +       ip6h->nexthdr     = IPPROTO_UDP;
>>>> +       ip6h->hop_limit   = ttl;
>>>> +       ip6h->daddr       = *daddr;
>>>> +       ip6h->saddr       = *saddr;
>>>> +
>>>> +       err = handle_offloads(skb);
>>>> +       if (err)
>>>> +               return err;
>>>> +
>>>> +       ip6tunnel_xmit(skb, dev);
>>>> +       return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
>>>> +#endif
>>>> +
>>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
>>>> +{
>>>> +       struct udp_tunnel_sock *uts;
>>>> +
>>>> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
>>>> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
>>>> +                       return uts;
>>>> +       }
>>>> +
>>>> +       return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
>>>> +
>>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
>>>> +{
>>>> +       struct sock *sk = uts->sock->sk;
>>>> +       struct net *net = sock_net(sk);
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +
>>>> +       spin_lock(&utn->sock_lock);
>>>> +       hlist_del_rcu(&uts->hlist);
>>>> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
>>>> +       spin_unlock(&utn->sock_lock);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
>>>> +
>>>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>>>> + * supply the listening VXLAN udp ports. Callers are expected
>>>> + * to implement the ndo_add_tunnle_port.
>>>> + */
>>>> +void udp_tunnel_get_rx_port(struct net_device *dev)
>>>> +{
>>>> +       struct udp_tunnel_sock *uts;
>>>> +       struct net *net = dev_net(dev);
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +       sa_family_t sa_family;
>>>> +       __be16 port;
>>>> +       unsigned int i;
>>>> +
>>>> +       spin_lock(&utn->sock_lock);
>>>> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
>>>> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
>>>> +                       port = inet_sk(uts->sock->sk)->inet_sport;
>>>> +                       sa_family = uts->sock->sk->sk_family;
>>>> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
>>>> +                                       sa_family, port, uts->tunnel_type);
>>>> +               }
>>>> +       }
>>>> +       spin_unlock(&utn->sock_lock);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
>>>> +
>>>> +static int __net_init udp_tunnel_init_net(struct net *net)
>>>> +{
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +       unsigned int h;
>>>> +
>>>> +       spin_lock_init(&utn->sock_lock);
>>>> +
>>>> +       for (h = 0; h < PORT_HASH_SIZE; h++)
>>>> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static struct pernet_operations udp_tunnel_net_ops = {
>>>> +       .init = udp_tunnel_init_net,
>>>> +       .exit = NULL,
>>>> +       .id = &udp_tunnel_net_id,
>>>> +       .size = sizeof(struct udp_tunnel_net),
>>>> +};
>>>> +
>>>> +static int __init udp_tunnel_init(void)
>>>> +{
>>>> +       return register_pernet_subsys(&udp_tunnel_net_ops);
>>>> +}
>>>> +late_initcall(udp_tunnel_init);
>>>> +
>>>>  MODULE_LICENSE("GPL");
>>>> --
>>>> 1.7.9.5
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 3f34c65..b5e815a 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -1,7 +1,10 @@ 
 #ifndef __NET_UDP_TUNNEL_H
 #define __NET_UDP_TUNNEL_H
 
-#define UDP_TUNNEL_TYPE_VXLAN 0x01
+#include <net/ip_tunnels.h>
+
+#define UDP_TUNNEL_TYPE_VXLAN  0x01
+#define UDP_TUNNEL_TYPE_GENEVE 0x02
 
 struct udp_port_cfg {
 	u8			family;
@@ -28,7 +31,59 @@  struct udp_port_cfg {
 				use_udp6_rx_checksums:1;
 };
 
+struct udp_tunnel_sock;
+
+typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
+				struct sk_buff *skb, ...);
+
+typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
+
+struct udp_tunnel_socket_cfg {
+	u8 tunnel_type;
+	struct udp_port_cfg port;
+	udp_tunnel_rcv_t *rcv;
+	udp_tunnel_encap_rcv_t *encap_rcv;
+	void *data;
+};
+
+struct udp_tunnel_sock {
+	u8 tunnel_type;
+	struct hlist_node hlist;
+	udp_tunnel_rcv_t *rcv;
+	void *data;
+	struct socket *sock;
+};
+
 int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
 		    struct socket **sockp);
 
+struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
+						 struct udp_tunnel_socket_cfg
+							*socket_cfg);
+
+struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
+
+int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
+			struct sk_buff *skb, __be32 src, __be32 dst,
+			__u8 tos, __u8 ttl, __be16 df, __be16 src_port,
+			__be16 dst_port, bool xnet);
+
+#if IS_ENABLED(CONFIG_IPV6)
+int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
+		struct sk_buff *skb, struct net_device *dev,
+		struct in6_addr *saddr, struct in6_addr *daddr,
+		__u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
+
+#endif
+
+void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
+void udp_tunnel_get_rx_port(struct net_device *dev);
+
+static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
+							 bool udp_csum)
+{
+	int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+	return iptunnel_handle_offloads(skb, udp_csum, type);
+}
 #endif
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 61ec1a6..3c14b16 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -7,6 +7,23 @@ 
 #include <net/udp.h>
 #include <net/udp_tunnel.h>
 #include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#if IS_ENABLED(CONFIG_IPV6)
+#include <net/ipv6.h>
+#include <net/addrconf.h>
+#include <net/ip6_tunnel.h>
+#include <net/ip6_checksum.h>
+#endif
+
+#define PORT_HASH_BITS 8
+#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
+
+static int udp_tunnel_net_id;
+
+struct udp_tunnel_net {
+	struct hlist_head sock_list[PORT_HASH_SIZE];
+	spinlock_t  sock_lock;   /* Protecting the sock_list */
+};
 
 int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
 		    struct socket **sockp)
@@ -82,7 +99,6 @@  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
 		return -EPFNOSUPPORT;
 	}
 
-
 	*sockp = sock;
 
 	return 0;
@@ -97,4 +113,243 @@  error:
 }
 EXPORT_SYMBOL(udp_sock_create);
 
+
+/* Socket hash table head */
+static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
+{
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+
+	return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
+}
+
+static int handle_offloads(struct sk_buff *skb)
+{
+	if (skb_is_gso(skb)) {
+		int err = skb_unclone(skb, GFP_ATOMIC);
+
+		if (unlikely(err))
+			return err;
+		skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
+	} else {
+		if (skb->ip_summed != CHECKSUM_PARTIAL)
+			skb->ip_summed = CHECKSUM_NONE;
+	}
+
+	return 0;
+}
+
+struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
+						 struct udp_tunnel_socket_cfg
+							*cfg)
+{
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+	struct udp_tunnel_sock *uts;
+	struct socket *sock;
+	struct sock *sk;
+	const __be16 port = cfg->port.local_udp_port;
+	const int ipv6 = (cfg->port.family == AF_INET6);
+	int err;
+
+	uts = kzalloc(size, GFP_KERNEL);
+	if (!uts)
+		return ERR_PTR(-ENOMEM);
+
+	err = udp_sock_create(net, &cfg->port, &sock);
+	if (err < 0) {
+		kfree(uts);
+		return NULL;
+	}
+
+	/* Disable multicast loopback */
+	inet_sk(sock->sk)->mc_loop = 0;
+
+	uts->sock = sock;
+	sk = sock->sk;
+	uts->rcv = cfg->rcv;
+	uts->data = cfg->data;
+	rcu_assign_sk_user_data(sock->sk, uts);
+
+	spin_lock(&utn->sock_lock);
+	hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
+	spin_unlock(&utn->sock_lock);
+
+	udp_sk(sk)->encap_type = 1;
+	udp_sk(sk)->encap_rcv = cfg->encap_rcv;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (ipv6)
+		ipv6_stub->udpv6_encap_enable();
+	else
+#endif
+		udp_encap_enable();
+
+	return uts;
+}
+EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
+
+int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
+			struct sk_buff *skb, __be32 src, __be32 dst,
+			__u8 tos, __u8 ttl, __be16 df, __be16 src_port,
+			__be16 dst_port, bool xnet)
+{
+	struct udphdr *uh;
+
+	__skb_push(skb, sizeof(*uh));
+	skb_reset_transport_header(skb);
+	uh = udp_hdr(skb);
+
+	uh->dest = dst_port;
+	uh->source = src_port;
+	uh->len = htons(skb->len);
+
+	udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
+
+	return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
+			     tos, ttl, df, xnet);
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
+
+#if IS_ENABLED(CONFIG_IPV6)
+int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
+			 struct sk_buff *skb, struct net_device *dev,
+			 struct in6_addr *saddr, struct in6_addr *daddr,
+			 __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
+{
+	struct udphdr *uh;
+	struct ipv6hdr *ip6h;
+	int err;
+
+	__skb_push(skb, sizeof(*uh));
+	skb_reset_transport_header(skb);
+	uh = udp_hdr(skb);
+
+	uh->dest = dst_port;
+	uh->source = src_port;
+
+	uh->len = htons(skb->len);
+	uh->check = 0;
+
+	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
+			    | IPSKB_REROUTED);
+	skb_dst_set(skb, dst);
+
+	if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
+		__wsum csum = skb_checksum(skb, 0, skb->len, 0);
+
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+		uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
+				IPPROTO_UDP, csum);
+		if (uh->check == 0)
+			uh->check = CSUM_MANGLED_0;
+	} else {
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		skb->csum_start = skb_transport_header(skb) - skb->head;
+		skb->csum_offset = offsetof(struct udphdr, check);
+		uh->check = ~csum_ipv6_magic(saddr, daddr,
+				skb->len, IPPROTO_UDP, 0);
+	}
+
+	__skb_push(skb, sizeof(*ip6h));
+	skb_reset_network_header(skb);
+	ip6h		  = ipv6_hdr(skb);
+	ip6h->version	  = 6;
+	ip6h->priority	  = prio;
+	ip6h->flow_lbl[0] = 0;
+	ip6h->flow_lbl[1] = 0;
+	ip6h->flow_lbl[2] = 0;
+	ip6h->payload_len = htons(skb->len);
+	ip6h->nexthdr     = IPPROTO_UDP;
+	ip6h->hop_limit   = ttl;
+	ip6h->daddr	  = *daddr;
+	ip6h->saddr	  = *saddr;
+
+	err = handle_offloads(skb);
+	if (err)
+		return err;
+
+	ip6tunnel_xmit(skb, dev);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
+#endif
+
+struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
+{
+	struct udp_tunnel_sock *uts;
+
+	hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
+		if (inet_sk(uts->sock->sk)->inet_sport == port)
+			return uts;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
+
+void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
+{
+	struct sock *sk = uts->sock->sk;
+	struct net *net = sock_net(sk);
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+
+	spin_lock(&utn->sock_lock);
+	hlist_del_rcu(&uts->hlist);
+	rcu_assign_sk_user_data(uts->sock->sk, NULL);
+	spin_unlock(&utn->sock_lock);
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
+
+/* Calls the ndo_add_tunnel_port of the caller in order to
+ * supply the listening VXLAN udp ports. Callers are expected
+ * to implement the ndo_add_tunnle_port.
+ */
+void udp_tunnel_get_rx_port(struct net_device *dev)
+{
+	struct udp_tunnel_sock *uts;
+	struct net *net = dev_net(dev);
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+	sa_family_t sa_family;
+	__be16 port;
+	unsigned int i;
+
+	spin_lock(&utn->sock_lock);
+	for (i = 0; i < PORT_HASH_SIZE; ++i) {
+		hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
+			port = inet_sk(uts->sock->sk)->inet_sport;
+			sa_family = uts->sock->sk->sk_family;
+			dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
+					sa_family, port, uts->tunnel_type);
+		}
+	}
+	spin_unlock(&utn->sock_lock);
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
+
+static int __net_init udp_tunnel_init_net(struct net *net)
+{
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+	unsigned int h;
+
+	spin_lock_init(&utn->sock_lock);
+
+	for (h = 0; h < PORT_HASH_SIZE; h++)
+		INIT_HLIST_HEAD(&utn->sock_list[h]);
+
+	return 0;
+}
+
+static struct pernet_operations udp_tunnel_net_ops = {
+	.init = udp_tunnel_init_net,
+	.exit = NULL,
+	.id = &udp_tunnel_net_id,
+	.size = sizeof(struct udp_tunnel_net),
+};
+
+static int __init udp_tunnel_init(void)
+{
+	return register_pernet_subsys(&udp_tunnel_net_ops);
+}
+late_initcall(udp_tunnel_init);
+
 MODULE_LICENSE("GPL");