diff mbox

New feature - RFC3931 L2TPv3 network transport using static Ethernet over L2TPv3 tunnels

Message ID 1394028740-710822-1-git-send-email-anton.ivanov@kot-begemot.co.uk
State New
Headers show

Commit Message

Anton Ivanov March 5, 2014, 2:12 p.m. UTC
From: Anton Ivanov <antivano@cisco.com>

This transport allows qemu to communicate with host if host
supports L2TPv3, communicate directly VM to VM (similar to
current socket transport) and VM to other device - f.e. VM to
a router.

Supported
    * Raw IPv4, Raw IPv6, UDPv4 and UDPv6 as per RFC
    * No cookie, 32 bit cookie or 64 bit cookie
    * Counter (as per RFC)
    * No counter (draft-mkonstan-keyed-ipv6-tunnel-00)
Unsupported
    * Workarounds for implementation with broken counter handling

Signed-off-by: Anton Ivanov <antivano@cisco.com>
---
 net/Makefile.objs |    1 +
 net/clients.h     |    2 +
 net/l2tpv3.c      |  554 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 net/net.c         |    3 +
 qapi-schema.json  |   56 +++++-
 5 files changed, 615 insertions(+), 1 deletion(-)
 create mode 100644 net/l2tpv3.c

Comments

Eric Blake March 5, 2014, 9:22 p.m. UTC | #1
On 03/05/2014 07:12 AM, anton.ivanov@kot-begemot.co.uk wrote:
> From: Anton Ivanov <antivano@cisco.com>
> 
> This transport allows qemu to communicate with host if host
> supports L2TPv3, communicate directly VM to VM (similar to
> current socket transport) and VM to other device - f.e. VM to
> a router.
> 
> Supported
>     * Raw IPv4, Raw IPv6, UDPv4 and UDPv6 as per RFC
>     * No cookie, 32 bit cookie or 64 bit cookie
>     * Counter (as per RFC)
>     * No counter (draft-mkonstan-keyed-ipv6-tunnel-00)
> Unsupported
>     * Workarounds for implementation with broken counter handling

Just an interface review (leaving the technical review to others more
knowledgeable on networking):

> +++ b/qapi-schema.json
> @@ -2940,6 +2940,57 @@
>      '*localaddr': 'str',
>      '*udp':       'str' } }
>  
> +# @NetdevL2TPv3Options
> +#
> +# Connect the VLAN to Ethernet over L2TPv3 Static tunnel
> +#
> +# @src :source address

s/ :/: /

> +#
> +# @dst :destination address

and again

> +#
> +# @srcport :#optional source port - mandatory for udp, optional for ip

and again (several times more)

udp is layer 3, ip is layer 2 - don't you mean "optional for tcp" as the
layer 3 counterpart of udp?

> +#
> +# @dstport :#optional destination port - mandatory for udp, optional for ip
> +#
> +# @ipv6 :#optional - force the use of ipv6 
> +#
> +# @udp :#optional - use the udp version of l2tpv3 encapsulation
> +#
> +# @cookie64:#optional - use 64 bit coookies

s/:/: /

> +#
> +# @counter :#optional have sequence counter
> +# 
> +# @txcookie :#optional 32 or 64 bit transmit cookie 
> +# 
> +# @rxcookie :#optional 32 or 64 bit receive cookie 
> +# 
> +# @txsession : 32 bit transmit session

s/ :/:/

> +# 
> +# @rxsession : 32 bit receive session - if not specified set to the same value as transmit

Long line, please wrap to keep within 80 columns

> +# 
> +# @optional : additional offset - allows the insertion of additional application-specific data before the packet payload

another long line

> +# 
> +#
> +# Since 1.2

At this point, you've pretty much missed the 2.0 feature freeze.  So the
earliest this can be is 2.1.  Specifying 1.2 is wrong either way.

> +##
> +##
> +{ 'type': 'NetdevL2TPv3Options',
> +  'data': {
> +    'src':	  'str', 

TAB damage and trailing whitespace.  Please run your patch through
scripts/checkpatch.pl, and only use spaces in the .json file.

> +    'dst':	  'str', 
> +    '*srcport':	  'str', 
> +    '*dstport':	  'str', 
> +    '*ipv6':	  'bool', 
> +    '*udp':	  'bool', 
> +    '*cookie64':  'bool',

Indentation looks funky - either use a single space after : in all
cases, or align all types to the same column (I don't care which, as
long as it isn't the mismatch of random alignments that you currently have)

But looking better than the earlier version.  On your next submission,
be sure to use 'git send-email --subject-prefix=PATCHv2' to make it
obvious that you are sending a new version.
Anton Ivanov March 5, 2014, 11 p.m. UTC | #2
On 05/03/14 21:22, Eric Blake wrote:

[snip]
>
> udp is layer 3, ip is layer 2 - don't you mean "optional for tcp" as the
> layer 3 counterpart of udp?

L2TPv3 uses either udp or ip as in "raw ip" (raw sockets) using a
special protocol - 0x73.

So no tcp at play here. The wording is incorrect - should be "ignored
for ip", thanks for noticing this.

>
>
>> +#
>> +# @dstport :#optional destination port - mandatory for udp, optional
for ip
>> +#
>> +# @ipv6 :#optional - force the use of ipv6
>> +#
>> +# @udp :#optional - use the udp version of l2tpv3 encapsulation
>> +#
>> +# @cookie64:#optional - use 64 bit coookies
>
> s/:/: /
>
>> +#
>> +# @counter :#optional have sequence counter
>> +#
>> +# @txcookie :#optional 32 or 64 bit transmit cookie
>> +#
>> +# @rxcookie :#optional 32 or 64 bit receive cookie
>> +#
>> +# @txsession : 32 bit transmit session
>
> s/ :/:/
>
>> +#
>> +# @rxsession : 32 bit receive session - if not specified set to the
same value as transmit
>
> Long line, please wrap to keep within 80 columns
>
>> +#
>> +# @optional : additional offset - allows the insertion of additional
application-specific data before the packet payload
>
> another long line

OK

>
>> +#
>> +#
>> +# Since 1.2
>
> At this point, you've pretty much missed the 2.0 feature freeze. So the
> earliest this can be is 2.1. Specifying 1.2 is wrong either way.
>
>> +##
>> +##
>> +{ 'type': 'NetdevL2TPv3Options',
>> + 'data': {
>> + 'src': 'str',
>
> TAB damage and trailing whitespace. Please run your patch through
> scripts/checkpatch.pl, and only use spaces in the .json file.

OK.

>
>
>> + 'dst': 'str',
>> + '*srcport': 'str',
>> + '*dstport': 'str',
>> + '*ipv6': 'bool',
>> + '*udp': 'bool',
>> + '*cookie64': 'bool',
>
> Indentation looks funky - either use a single space after : in all
> cases, or align all types to the same column (I don't care which, as
> long as it isn't the mismatch of random alignments that you currently have)
>
> But looking better than the earlier version. On your next submission,
> be sure to use 'git send-email --subject-prefix=PATCHv2' to make it
> obvious that you are sending a new version.
>
Stefan Hajnoczi March 6, 2014, 9:44 a.m. UTC | #3
On Wed, Mar 05, 2014 at 02:12:20PM +0000, anton.ivanov@kot-begemot.co.uk wrote:

Please don't put "New feature - " in the commit message.  "net: " or
"l2tpv3: " would be a good prefix (see git-log(1) for other examples).
The idea behind using a prefix is that you can immediately determine
which area of code is affected by a commit.  It also helps with grepping
the commit history although there are more precise ways to do that.

Anyway, "New feature" isn't a useful prefix.

> From: Anton Ivanov <antivano@cisco.com>
> 
> This transport allows qemu to communicate with host if host
> supports L2TPv3, communicate directly VM to VM (similar to
> current socket transport) and VM to other device - f.e. VM to
> a router.
> 
> Supported
>     * Raw IPv4, Raw IPv6, UDPv4 and UDPv6 as per RFC
>     * No cookie, 32 bit cookie or 64 bit cookie
>     * Counter (as per RFC)
>     * No counter (draft-mkonstan-keyed-ipv6-tunnel-00)
> Unsupported
>     * Workarounds for implementation with broken counter handling
> 
> Signed-off-by: Anton Ivanov <antivano@cisco.com>
> ---
>  net/Makefile.objs |    1 +
>  net/clients.h     |    2 +
>  net/l2tpv3.c      |  554 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  net/net.c         |    3 +
>  qapi-schema.json  |   56 +++++-
>  5 files changed, 615 insertions(+), 1 deletion(-)
>  create mode 100644 net/l2tpv3.c

Please add documentation to qemu-options.hx.  You can follow the other
-netdev documentation examples in that file.

Also please add example Linux commands for setting up L2TPv3 to the
commit description.  This will help anyone wishing to try out the code.
Something along the lines of:

  # ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
    udp_dport 5001 encap udp local 127.0.0.1 remote 127.0.0.1
  # ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
  # ip addr add 192.168.2.1/32 peer 192.168.2.2/32 dev l2tpeth0
  # ifconfig l2tpeth0 up
  # qemu -netdev l2tpv3,src=127.0.0.1,dst=127.0.0.1,\
                 srcport=5001,dstport=5000,udp=on,...

(incomplete but should be close)

> diff --git a/net/Makefile.objs b/net/Makefile.objs
> index 4854a14..160214e 100644
> --- a/net/Makefile.objs
> +++ b/net/Makefile.objs
> @@ -2,6 +2,7 @@ common-obj-y = net.o queue.o checksum.o util.o hub.o
>  common-obj-y += socket.o
>  common-obj-y += dump.o
>  common-obj-y += eth.o
> +common-obj-$(CONFIG_LINUX) += l2tpv3.o
>  common-obj-$(CONFIG_POSIX) += tap.o
>  common-obj-$(CONFIG_LINUX) += tap-linux.o
>  common-obj-$(CONFIG_WIN32) += tap-win32.o
> diff --git a/net/clients.h b/net/clients.h
> index 7793294..bbf177c 100644
> --- a/net/clients.h
> +++ b/net/clients.h
> @@ -47,6 +47,8 @@ int net_init_tap(const NetClientOptions *opts, const char *name,
>  int net_init_bridge(const NetClientOptions *opts, const char *name,
>                      NetClientState *peer);
>  
> +int net_init_l2tpv3(const NetClientOptions *opts, const char *name,
> +                    NetClientState *peer);
>  #ifdef CONFIG_VDE
>  int net_init_vde(const NetClientOptions *opts, const char *name,
>                   NetClientState *peer);
> diff --git a/net/l2tpv3.c b/net/l2tpv3.c
> new file mode 100644
> index 0000000..3348b5d
> --- /dev/null
> +++ b/net/l2tpv3.c

Please run patches through scripts/checkpatch.pl before submitting them.
See here for how to set up git to automatically check your code as you
commit:
http://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html

> @@ -0,0 +1,554 @@
> +/*
> + * QEMU System Emulator
> + *
> + * Copyright (c) 2003-2008 Fabrice Bellard
> + * Copyright (c) 2012-2014 Cisco Systems
> + *
> + * Permission is hereby granted, free of charge, to any person obtaining a copy
> + * of this software and associated documentation files (the "Software"), to deal
> + * in the Software without restriction, including without limitation the rights
> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> + * copies of the Software, and to permit persons to whom the Software is
> + * furnished to do so, subject to the following conditions:
> + *
> + * The above copyright notice and this permission notice shall be included in
> + * all copies or substantial portions of the Software.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> + * THE SOFTWARE.
> + */
> +
> +#include <linux/ip.h>
> +#include <netdb.h>
> +#include "config-host.h"
> +#include "net/net.h"
> +#include "clients.h"
> +#include "monitor/monitor.h"
> +#include "qemu-common.h"
> +#include "qemu/error-report.h"
> +#include "qemu/option.h"
> +#include "qemu/sockets.h"
> +#include "qemu/iov.h"
> +#include "qemu/main-loop.h"
> +
> +
> +/* The buffer size needs to be investigated for optimum numbers and
> + * optimum means of paging in on different systems. This size is
> + * chosen to be sufficient to accommodate one packet with some headers
> + */
> +
> +#define BUFFER_ALIGN sysconf(_SC_PAGESIZE)
> +#define BUFFER_SIZE 2048
> +#define IOVSIZE 2
> +#define MAX_L2TPV3_MSGCNT 32 

Please avoid trailing whitespace, git prints a warning about it when
applying patches.  There are occurrences throughout this file.  You can
remove it in vi with :%s/\s*$//g

> +#define MAX_L2TPV3_IOVCNT (MAX_L2TPV3_MSGCNT * IOVSIZE)
> +
> +#ifndef IPPROTO_L2TP
> +#define IPPROTO_L2TP 0x73
> +#endif

Not needed, see below for details.

> +typedef struct NetL2TPV3State {
> +    NetClientState nc;
> +    int fd;

From here...

> +    int state; 
> +    unsigned int index;
> +    unsigned int packet_len;

...to here should be dropped.  These fields are not used and were copied
from another netdev implementation.

(Mentioned in a previous review.)

> +
> +    /* 
> +     *	these are used for xmit - that happens packet a time
> +     *	and for first sign of life packet (easier to parse that once)
> +     */ 
> +
> +    uint8_t *header_buf;
> +    struct iovec *vec;
> +
> +    /* 
> +     * these are used for receive - try to "eat" up to 32 packets at a time
> +     */
> +
> +    struct mmsghdr *msgvec;
> +
> +    /*
> +     * peer address
> +     */
> +
> +    struct sockaddr_storage *dgram_dst; 
> +    uint32_t dst_size;
> +
> +    /*
> +     * L2TPv3 parameters
> +     */
> +
> +    uint64_t rx_cookie;
> +    uint64_t tx_cookie;
> +    uint32_t rx_session;
> +    uint32_t tx_session;
> +    uint32_t header_size;
> +    uint32_t counter;
> +
> +    /*
> +     * Precomputed offsets
> +     */
> +
> +    uint32_t offset;   
> +    uint32_t cookie_offset;
> +    uint32_t counter_offset;
> +    uint32_t session_offset;
> +
> +    /* Flags */
> +
> +    bool ipv6;
> +    bool udp;
> +    bool has_counter;
> +    bool cookie;
> +    bool cookie_is_64;
> +
> +} NetL2TPV3State;
> +
> +typedef struct NetL2TPV3ListenState {
> +    NetClientState nc;
> +    char *model;
> +    char *name;
> +    int fd;
> +} NetL2TPV3ListenState;

NetL2TPV3ListenState is not used.  Please remove.

(Mentioned in a previous review.)

> +
> +static int l2tpv3_form_header(NetL2TPV3State *s) 

I suggest making this void since there is no error return and callers
don't check.

> +{
> +    uint32_t *header;
> +    uint32_t *session;
> +    uint64_t *cookie64;
> +    uint32_t *cookie32;
> +    uint32_t *counter;
> +
> +    if (s->udp) {
> +	header = (uint32_t *) s->header_buf;
> +	stl_be_p(header, 0x30000);

I guess the 3 is the protocol version number.  A comment about magic
number would help.

> +    }

QEMU coding style is 4-space indentation.  Please reindent the code.

> +    session = (uint32_t *) (s->header_buf + s->session_offset);
> +    stl_be_p(session, s->tx_session);
> +
> +    if (s->cookie) {
> +	if (s->cookie_is_64) {
> +	    cookie64 = (uint64_t *)(s->header_buf + s->cookie_offset);
> +	    stq_be_p(cookie64, s->tx_cookie);
> +	} else {
> +	    cookie32 = (uint32_t *) (s->header_buf + s->cookie_offset);
> +	    stl_be_p(cookie32, s->tx_cookie);
> +	}
> +    }
> +
> +    if (s->has_counter) {
> +	counter = (uint32_t *)(s->header_buf + s->counter_offset);
> +	stl_be_p(counter, ++s->counter);
> +    }
> +    return 0;
> +}
> +
> +static ssize_t net_l2tpv3_receive_dgram_iov(NetClientState *nc, 
> +				    const struct iovec *iov, 
> +				    int iovcnt)
> +{
> +    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
> +
> +    struct msghdr message;
> +    int ret;
> +
> +    if (iovcnt > MAX_L2TPV3_IOVCNT - 1) {
> +	error_report("iovec too long %d > %d, change l2tpv3.h\n", iovcnt, MAX_L2TPV3_IOVCNT);

error_report() messages should not include a newline.

> +	return -1;
> +    }
> +    l2tpv3_form_header(s);
> +    memcpy(s->vec + 1, iov, iovcnt * sizeof(struct iovec));
> +    s->vec->iov_base = s->header_buf;
> +    s->vec->iov_len = s->offset;
> +    message.msg_name = s->dgram_dst;
> +    message.msg_namelen = s->dst_size;
> +    message.msg_iov = (struct iovec *) s->vec;

No need to cast here.

> +    message.msg_iovlen = iovcnt + 1; 
> +    message.msg_control = NULL;
> +    message.msg_controllen = 0;
> +    message.msg_flags = 0;
> +    do {
> +	ret = sendmsg(s->fd, &message, 0);
> +    } while ((ret == -1) && (errno == EINTR));
> +    if (ret > 0) {
> +	ret -= s->offset; 
> +    } else if (ret == 0) {
> +	ret = iov_size(iov, iovcnt);
> +    } 
> +    return ret;

If the socket buffer (in the host kernel) is full this will drop
packets.  The other netdev implementations return 0 and wait until the
file descriptor becomes writable again (see net/tap.c or net/socket.c).

Please either implement the same approach or explain why you disagree.

(Mentioned in previous review.)

> +}
> +
> +static ssize_t net_l2tpv3_receive_dgram(NetClientState *nc, 
> +					const uint8_t *buf, 
> +					size_t size)
> +{
> +    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
> +
> +    struct iovec * vec;
> +    struct msghdr message;
> +    ssize_t ret = 0;
> +
> +    l2tpv3_form_header(s);
> +    vec = s->vec;
> +    vec->iov_base = s->header_buf;
> +    vec->iov_len = s->offset;
> +    vec++;
> +    vec->iov_base = (void *) buf;

Cast is unnecessary.  In C any pointer casts to void* and vice versa
without warnings.

> +    vec->iov_len = size;
> +    message.msg_name = s->dgram_dst;
> +    message.msg_namelen = s->dst_size;
> +    message.msg_iov = (struct iovec *) s->vec;

Cast is unnecessary.

> +    message.msg_iovlen = 2;
> +    message.msg_control = NULL;
> +    message.msg_controllen = 0;
> +    message.msg_flags = 0;
> +    do {
> +	ret = sendmsg(s->fd, &message, 0);
> +    } while ((ret == -1) && (errno == EINTR));
> +    if (ret > 0) {
> +	ret -= s->offset; 
> +    } else if (ret == 0) {
> +	ret = size;
> +    } 
> +    return ret;
> +}
> +
> +static int l2tpv3_verify_header(NetL2TPV3State *s, 
> +				    uint8_t *buf) 
> +{
> +
> +    uint64_t *cookie64;
> +    uint32_t *cookie32;
> +    uint32_t *session;
> +
> +    if ((!s->udp) && (!s->ipv6)){
> +	buf += sizeof(struct iphdr) /* fix for ipv4 raw */;
> +    } 
> +    if (s->cookie) {
> +	if (s->cookie_is_64) {
> +	    /* 64 bit cookie */
> +	    cookie64 = (uint64_t *)(buf + s->cookie_offset);
> +	    if (ldq_be_p(cookie64) != s->rx_cookie) {
> +		error_report("unknown cookie id\n");
> +		return -1; 
> +	    }
> +	} else {
> +	    cookie32 = (uint32_t *)(buf + s->cookie_offset);
> +	    if (ldl_be_p(cookie32) != * (uint32_t *) &s->rx_cookie) {

Ouch, the rx_cookie cast assumes little-endian host!  See below for a
cleaner solution.

> +		error_report("unknown cookie id\n");
> +		return -1 ; 
> +	    }
> +	}

The casting and code duplication is not necessary:

uint64_t cookie;
if (s->cookie_is_64) {
    cookie = ldq_be_p(buf + s->cookie_offset);
} else {
    cookie = ldl_be_p(buf + s->cookie_offset);
}
if (cookie != s->rx_cookie) {
    error_report("unknown cookie id\n");
    return -1;
}

> +    }
> +    session = (uint32_t *) (buf + s->session_offset);
> +    if (ldl_be_p(session) != s->rx_session) {
> +	error_report("session mismatch\n");
> +	return -1;
> +    }	
> +    return 0;
> +}
> +
> +static void net_l2tpv3_send(void *opaque)
> +{
> +    NetL2TPV3State *s = opaque;
> +
> +    int i, count, offset;
> +    struct mmsghdr * msgvec;
> +    struct iovec * vec;
> +
> +    msgvec = s->msgvec;
> +    offset = s->offset;
> +    if ((!s->udp) && (!s->ipv6)){
> +	offset +=   sizeof(struct iphdr);
> +    }  
> +    count = recvmmsg(s->fd, msgvec, MAX_L2TPV3_MSGCNT, MSG_DONTWAIT, NULL);
> +    for (i=0;i<count;i++) {

QEMU coding style uses spaces:
for (i = 0; i < count; i++) {

> +	if (msgvec->msg_len > 0) {
> +	    vec = msgvec->msg_hdr.msg_iov;
> +	    if (
> +		(msgvec->msg_len > offset) && 
> +		(l2tpv3_verify_header(s, vec->iov_base) == 0)
> +		) {

if ((msgvec->msg_len > offset) &&
    (l2tcpv3_verify_header(s, vec->iov_base) == 0))

> +		vec++;
> +		qemu_send_packet(&s->nc, vec->iov_base, msgvec->msg_len - offset);
> +	    } else {
> +		error_report("l2tpv3 header verification failed\n");
> +		vec++; 

vec isn't used after this so the increment can be removed.

> +	    }
> +	} 
> +	msgvec++;
> +    }
> +}
> +
> +static void destroy_vector(struct mmsghdr * msgvec, int count, int iovcount) 
> +{
> +    int i, j;
> +    struct iovec * iov;
> +    struct mmsghdr * cleanup = msgvec;
> +    if (cleanup) {
> +	for (i=0;i<count;i++) {
> +	    if (cleanup->msg_hdr.msg_iov) {
> +		iov = cleanup->msg_hdr.msg_iov;
> +		for (j=0;j<iovcount;j++) {
> +		    if (iov->iov_base) {
> +			g_free(iov->iov_base);
> +		    }

g_free(NULL) is a nop so you don't need to check iov->iov_base before
calling g_free().

> +		    iov++;
> +		}
> +		g_free(cleanup->msg_hdr.msg_iov);
> +	    }
> +	    cleanup++;
> +	}
> +	g_free(msgvec);
> +    }
> +}
> +
> +static struct mmsghdr * build_l2tpv3_vector(NetL2TPV3State *s, int count) 
> +{
> +    int i;
> +    struct iovec * iov;
> +    struct mmsghdr * msgvec, *result;
> +
> +    msgvec = g_malloc(sizeof(struct mmsghdr) * count);
> +    result = msgvec;
> +    for (i=0;i < count ;i++) {
> +	msgvec->msg_hdr.msg_name = NULL;
> +	msgvec->msg_hdr.msg_namelen = 0;
> +	iov =  g_malloc(sizeof(struct iovec) * IOVSIZE);
> +	msgvec->msg_hdr.msg_iov = iov;
> +	if ((!s->udp) && (!s->ipv6)){
> +	    /* fix for ipv4 raw */;
> +	    iov->iov_base = g_malloc(s->offset + sizeof(struct iphdr)); 
> +	    iov->iov_len = s->offset + sizeof (struct iphdr);
> +	} else {
> +	    iov->iov_base = g_malloc(s->offset);
> +	    iov->iov_len = s->offset;
> +	}
> +	iov++ ;
> +	iov->iov_base = qemu_memalign(BUFFER_ALIGN,BUFFER_SIZE);
> +	iov->iov_len = BUFFER_SIZE;
> +	msgvec->msg_hdr.msg_iovlen = 2;
> +	msgvec->msg_hdr.msg_control = NULL;
> +	msgvec->msg_hdr.msg_controllen = 0;
> +	msgvec->msg_hdr.msg_flags = 0;
> +	msgvec++;
> +    }
> +    return result;
> +}
> +
> +static void net_l2tpv3_cleanup(NetClientState *nc) 
> +{
> +    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
> +    qemu_set_fd_handler(s->fd, NULL, NULL, NULL);
> +    close(s->fd);
> +    destroy_vector(s->msgvec, MAX_L2TPV3_MSGCNT, IOVSIZE);
> +    g_free(s->header_buf);
> +    g_free(s->dgram_dst);
> +}
> +
> +static NetClientInfo net_l2tpv3_info = {
> +    .type = NET_CLIENT_OPTIONS_KIND_L2TPV3,
> +    .size = sizeof(NetL2TPV3State),
> +    .receive = net_l2tpv3_receive_dgram,
> +    .receive_iov = net_l2tpv3_receive_dgram_iov,
> +    .cleanup = net_l2tpv3_cleanup,
> +};
> +
> +int net_init_l2tpv3(const NetClientOptions *opts,
> +                    const char *name,
> +                    NetClientState *peer) 
> +{
> +
> +
> +    const NetdevL2TPv3Options * l2tpv3;
> +    NetL2TPV3State *s;
> +    NetClientState *nc;
> +    int fd, gairet;
> +    struct addrinfo hints;
> +    struct addrinfo * result = NULL;
> +    char * srcport, * dstport;
> +
> +    nc = qemu_new_net_client(&net_l2tpv3_info, peer, "l2tpv3", name);
> +
> +    s = DO_UPCAST(NetL2TPV3State, nc, nc);
> +
> +    assert(opts->kind == NET_CLIENT_OPTIONS_KIND_L2TPV3);
> +    l2tpv3 = opts->l2tpv3;
> +
> +    if (l2tpv3->has_ipv6 && l2tpv3->ipv6) {
> +	s->ipv6 = l2tpv3->ipv6;
> +    } else {
> +	s->ipv6 = false;
> +    }
> +
> +    if (l2tpv3->has_rxcookie || l2tpv3->has_txcookie) {
> +	if (l2tpv3->has_rxcookie && l2tpv3->has_txcookie) {
> +	    s->cookie = true;
> +	} else {
> +	    return -1;

This leaks nc.  I suggest adding a goto err where nc is freed.  Other
error return paths should also free resources and can use this label.

> +	}
> +    } else {
> +	s->cookie = false;
> +    }
> +
> +    if (l2tpv3->has_cookie64 || l2tpv3->cookie64) {
> +	s->cookie_is_64  = true;
> +    } else {
> +	s->cookie_is_64  = false;
> +    }
> +
> +    if (l2tpv3->has_udp && l2tpv3->udp) {
> +	s->udp = true;
> +	if (!(l2tpv3->has_srcport && l2tpv3->has_dstport)) {
> +	    error_report("l2tpv3_open : need both src and dst port for udp\n");
> +	    return -1;

Leaks nc.

> +	} else {
> +	    srcport = l2tpv3->srcport;
> +	    dstport = l2tpv3->dstport;
> +	}
> +    } else {
> +	s->udp = false;
> +	srcport = NULL;
> +	dstport = NULL;
> +    }
> +
> +
> +    s->offset = 4;
> +    s->session_offset = 0;
> +    s->cookie_offset = 4;
> +    s->counter_offset = 4;
> +
> +    s->tx_session = l2tpv3->txsession;
> +    if (l2tpv3->has_rxsession) {
> +	s->rx_session = l2tpv3->rxsession;
> +    } else {
> +	s->rx_session = s->tx_session;
> +    }
> +
> +    if (s->cookie) {
> +	s->rx_cookie = l2tpv3->rxcookie;
> +	s->tx_cookie = l2tpv3->txcookie;
> +	if (s->cookie_is_64 == true) {
> +	    /* 64 bit cookie */
> +	    s->offset += 8;
> +	    s->counter_offset += 8;
> +	} else {
> +	    /* 32 bit cookie */
> +	    s->offset += 4;
> +	    s->counter_offset +=4;
> +	}
> +    }
> +
> +    memset(&hints, 0, sizeof(hints));
> +    
> +    if (s->ipv6) {
> +	hints.ai_family = AF_INET6;
> +    } else {
> +	hints.ai_family = AF_INET;
> +    }
> +    if (s->udp) {
> +	hints.ai_socktype = SOCK_DGRAM;
> +	hints.ai_protocol = 0;
> +	s->offset += 4;
> +	s->counter_offset += 4;
> +	s->session_offset += 4;
> +	s->cookie_offset += 4;
> +    } else {
> +	hints.ai_socktype = SOCK_RAW;
> +	hints.ai_protocol = IPPROTO_L2TP;
> +    }
> +    
> +    gairet= getaddrinfo(l2tpv3->src, srcport, &hints, &result);
> +
> +    if ((gairet !=0) || (result == NULL)) {
> +	error_report("l2tpv3_open : could not resolve src, errno = %s\n", gai_strerror(gairet));

Leaks nc.

> +	return -1;
> +    }
> +
> +    if ((fd = socket(result->ai_family, result->ai_socktype, result->ai_protocol)) == -1) {
> +	fd = -errno;
> +	error_report("l2tpv3_open : socket creation failed, errno = %d\n", -fd);
> +	freeaddrinfo(result);

Leaks nc.

> +	return fd;
> +    } 
> +    if (bind(fd, (struct sockaddr *) result->ai_addr, result->ai_addrlen)) {
> +	error_report("l2tpv3_open :  could not bind socket err=%i\n", errno);
> +	close(fd);

Leaks nc and result.

> +	return -1;
> +    }
> +
> +    freeaddrinfo(result);
> +
> +    memset(&hints, 0, sizeof(hints));
> +    
> +    if (s->ipv6) {
> +	hints.ai_family = AF_INET6;
> +    } else {
> +	hints.ai_family = AF_INET;
> +    }
> +    if (s->udp) {
> +	hints.ai_socktype = SOCK_DGRAM;
> +	hints.ai_protocol = 0;
> +    } else {
> +	hints.ai_socktype = SOCK_RAW;
> +	hints.ai_protocol = IPPROTO_L2TP;

Hang on, this is bogus.  This is a *userspace* L2TP implementation!

We don't want a kernel L2TP driver to handle this socket.  Luckily this
never happens anyway since net/l2tp/l2tp_ip.c only registers its socket
type for <AF_INET, SOCK_DGRAM, IPPROTO_L2TP> and <AF_INET6, SOCK_DGRAM,
IPPROTO_L2TP>.

When we create this socket with <AF_INET, SOCK_RAW, IPPROTO_L2TP> what
really happens is that the kernel falls back to the IPv4 raw socket
driver due to a wildcard match.

In other words, we shouldn't use IPPROTO_L2TP.  Just use 0.

> +    }
> +
> +    gairet= getaddrinfo(l2tpv3->dst, dstport, &hints, &result);
> +    if ((gairet !=0) || (result == NULL)) {
> +	error_report("l2tpv3_open : could not resolve dst, error = %s\n", gai_strerror(gairet));
> +	return -1;

More leaks.

> +    }
> +
> +    s->dgram_dst = g_malloc(sizeof(struct sockaddr_storage));
> +    memset(s->dgram_dst, '\0' , sizeof(struct sockaddr_storage));
> +    memcpy(s->dgram_dst, result->ai_addr, result->ai_addrlen);

Why is this field heap-allocated?  It doesn't need to be a pointer and
then you can avoid the g_malloc()/g_free().

> +    s->dst_size = result->ai_addrlen;
> +
> +    freeaddrinfo(result);
> +
> +    if (l2tpv3->has_counter && l2tpv3->counter) {
> +	s->has_counter = true;
> +	s->offset += 4;
> +    } else {
> +	s->has_counter = false;
> +    }
> +
> +    if (l2tpv3->has_offset) {
> +	/* extra offset */
> +	s->offset += l2tpv3->offset;
> +    }
> +
> +    s->msgvec = build_l2tpv3_vector(s, MAX_L2TPV3_MSGCNT);
> +    s->vec = g_malloc(sizeof(struct iovec) * MAX_L2TPV3_IOVCNT);    

It's up to you if you want to bother heap-allocating this.  Seems
simpler to embed it in the struct:
struct iovec vec[MAX_L2TPV3_IOVCNT];

> +    if ((!s->udp) && (!s->ipv6)){
> +	s->header_buf = g_malloc(s->offset + sizeof (struct iphdr));    
> +    } else {
> +	s->header_buf = g_malloc(s->offset);    
> +    }
> +
> +    qemu_set_nonblock(fd);
> +
> +    if (fd < 0)
> +	return -1;

QEMU coding style always uses curlies:
if (fd < 0) {
    return -1;
}

Anyway, this check is redundant because we already checked when creating
the socket.

> +
> +    s->fd = fd;
> +    s->counter = 0;
> +
> +    qemu_set_fd_handler(s->fd, net_l2tpv3_send, NULL, s);
> +
> +    if (!s) {
> +	error_report("l2tpv3_open : failed to set fd handler\n");
> +	return -1;
> +    }

Bogus check.  s is never NULL here.

> +    snprintf(s->nc.info_str, sizeof(s->nc.info_str),
> +             "l2tpv3: connected");
> +    return 0;
> +}
> +
> diff --git a/net/net.c b/net/net.c
> index 0a88e68..749d34c 100644
> --- a/net/net.c
> +++ b/net/net.c
> @@ -731,6 +731,9 @@ static int (* const net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
>          [NET_CLIENT_OPTIONS_KIND_BRIDGE]    = net_init_bridge,
>  #endif
>          [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
> +#ifdef CONFIG_LINUX
> +        [NET_CLIENT_OPTIONS_KIND_L2TPV3]    = net_init_l2tpv3,
> +#endif
>  };
>  
>  
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 83fa485..7d9f69f 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -2940,6 +2940,57 @@
>      '*localaddr': 'str',
>      '*udp':       'str' } }
>  
> +# @NetdevL2TPv3Options
> +#
> +# Connect the VLAN to Ethernet over L2TPv3 Static tunnel
> +#
> +# @src :source address
> +#
> +# @dst :destination address
> +#
> +# @srcport :#optional source port - mandatory for udp, optional for ip
> +#
> +# @dstport :#optional destination port - mandatory for udp, optional for ip
> +#
> +# @ipv6 :#optional - force the use of ipv6 
> +#
> +# @udp :#optional - use the udp version of l2tpv3 encapsulation
> +#
> +# @cookie64:#optional - use 64 bit coookies
> +#
> +# @counter :#optional have sequence counter
> +# 
> +# @txcookie :#optional 32 or 64 bit transmit cookie 
> +# 
> +# @rxcookie :#optional 32 or 64 bit receive cookie 
> +# 
> +# @txsession : 32 bit transmit session
> +# 
> +# @rxsession : 32 bit receive session - if not specified set to the same value as transmit
> +# 
> +# @optional : additional offset - allows the insertion of additional application-specific data before the packet payload
> +# 
> +#
> +# Since 1.2
> +##
> +##
> +{ 'type': 'NetdevL2TPv3Options',
> +  'data': {
> +    'src':	  'str', 
> +    'dst':	  'str', 
> +    '*srcport':	  'str', 
> +    '*dstport':	  'str', 
> +    '*ipv6':	  'bool', 
> +    '*udp':	  'bool', 
> +    '*cookie64':  'bool', 
> +    '*counter':   'bool', 
> +    '*txcookie':  'uint64',
> +    '*rxcookie':  'uint64',
> +    'txsession': 'uint32',
> +    '*rxsession': 'uint32',
> +    '*offset':  'uint32' } }
> +
> +##
>  ##
>  # @NetdevVdeOptions
>  #
> @@ -3014,13 +3065,16 @@
>  # A discriminated record of network device traits.
>  #
>  # Since 1.2
> -##
> +#
> +# Added in 2.0 - l2tpv3 
> +#
>  { 'union': 'NetClientOptions',
>    'data': {
>      'none':     'NetdevNoneOptions',
>      'nic':      'NetLegacyNicOptions',
>      'user':     'NetdevUserOptions',
>      'tap':      'NetdevTapOptions',
> +    'l2tpv3':   'NetdevL2TPv3Options',
>      'socket':   'NetdevSocketOptions',
>      'vde':      'NetdevVdeOptions',
>      'dump':     'NetdevDumpOptions',
> -- 
> 1.7.10.4
>
Anton Ivanov March 6, 2014, 10:28 a.m. UTC | #4
On 06/03/14 09:44, Stefan Hajnoczi wrote:
> On Wed, Mar 05, 2014 at 02:12:20PM +0000, anton.ivanov@kot-begemot.co.uk wrote:
>
> Please don't put "New feature - " in the commit message.  "net: " or
> "l2tpv3: " would be a good prefix (see git-log(1) for other examples).
> The idea behind using a prefix is that you can immediately determine
> which area of code is affected by a commit.  It also helps with grepping
> the commit history although there are more precise ways to do that.
>
> Anyway, "New feature" isn't a useful prefix.

Understood.

>
>> From: Anton Ivanov <antivano@cisco.com>
>>
>> This transport allows qemu to communicate with host if host
>> supports L2TPv3, communicate directly VM to VM (similar to
>> current socket transport) and VM to other device - f.e. VM to
>> a router.
>>
>> Supported
>>      * Raw IPv4, Raw IPv6, UDPv4 and UDPv6 as per RFC
>>      * No cookie, 32 bit cookie or 64 bit cookie
>>      * Counter (as per RFC)
>>      * No counter (draft-mkonstan-keyed-ipv6-tunnel-00)
>> Unsupported
>>      * Workarounds for implementation with broken counter handling
>>
>> Signed-off-by: Anton Ivanov <antivano@cisco.com>
>> ---
>>   net/Makefile.objs |    1 +
>>   net/clients.h     |    2 +
>>   net/l2tpv3.c      |  554 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   net/net.c         |    3 +
>>   qapi-schema.json  |   56 +++++-
>>   5 files changed, 615 insertions(+), 1 deletion(-)
>>   create mode 100644 net/l2tpv3.c
> Please add documentation to qemu-options.hx.  You can follow the other
> -netdev documentation examples in that file.
>
> Also please add example Linux commands for setting up L2TPv3 to the
> commit description.  This will help anyone wishing to try out the code.
> Something along the lines of:
>
>    # ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \
>      udp_dport 5001 encap udp local 127.0.0.1 remote 127.0.0.1
>    # ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1
>    # ip addr add 192.168.2.1/32 peer 192.168.2.2/32 dev l2tpeth0
>    # ifconfig l2tpeth0 up
>    # qemu -netdev l2tpv3,src=127.0.0.1,dst=127.0.0.1,\
>                   srcport=5001,dstport=5000,udp=on,...
>
> (incomplete but should be close)

Understood, will do as well as basic router commands.

>
>> diff --git a/net/Makefile.objs b/net/Makefile.objs
>> index 4854a14..160214e 100644
>> --- a/net/Makefile.objs
>> +++ b/net/Makefile.objs
>> @@ -2,6 +2,7 @@ common-obj-y = net.o queue.o checksum.o util.o hub.o
>>   common-obj-y += socket.o
>>   common-obj-y += dump.o
>>   common-obj-y += eth.o
>> +common-obj-$(CONFIG_LINUX) += l2tpv3.o
>>   common-obj-$(CONFIG_POSIX) += tap.o
>>   common-obj-$(CONFIG_LINUX) += tap-linux.o
>>   common-obj-$(CONFIG_WIN32) += tap-win32.o
>> diff --git a/net/clients.h b/net/clients.h
>> index 7793294..bbf177c 100644
>> --- a/net/clients.h
>> +++ b/net/clients.h
>> @@ -47,6 +47,8 @@ int net_init_tap(const NetClientOptions *opts, const char *name,
>>   int net_init_bridge(const NetClientOptions *opts, const char *name,
>>                       NetClientState *peer);
>>   
>> +int net_init_l2tpv3(const NetClientOptions *opts, const char *name,
>> +                    NetClientState *peer);
>>   #ifdef CONFIG_VDE
>>   int net_init_vde(const NetClientOptions *opts, const char *name,
>>                    NetClientState *peer);
>> diff --git a/net/l2tpv3.c b/net/l2tpv3.c
>> new file mode 100644
>> index 0000000..3348b5d
>> --- /dev/null
>> +++ b/net/l2tpv3.c
> Please run patches through scripts/checkpatch.pl before submitting them.
> See here for how to set up git to automatically check your code as you
> commit:
> http://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html

Eric already asked me to do that, next submission will be processed this 
way.

>
>> @@ -0,0 +1,554 @@
>> +/*
>> + * QEMU System Emulator
>> + *
>> + * Copyright (c) 2003-2008 Fabrice Bellard
>> + * Copyright (c) 2012-2014 Cisco Systems
>> + *
>> + * Permission is hereby granted, free of charge, to any person obtaining a copy
>> + * of this software and associated documentation files (the "Software"), to deal
>> + * in the Software without restriction, including without limitation the rights
>> + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
>> + * copies of the Software, and to permit persons to whom the Software is
>> + * furnished to do so, subject to the following conditions:
>> + *
>> + * The above copyright notice and this permission notice shall be included in
>> + * all copies or substantial portions of the Software.
>> + *
>> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
>> + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
>> + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
>> + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
>> + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
>> + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
>> + * THE SOFTWARE.
>> + */
>> +
>> +#include <linux/ip.h>
>> +#include <netdb.h>
>> +#include "config-host.h"
>> +#include "net/net.h"
>> +#include "clients.h"
>> +#include "monitor/monitor.h"
>> +#include "qemu-common.h"
>> +#include "qemu/error-report.h"
>> +#include "qemu/option.h"
>> +#include "qemu/sockets.h"
>> +#include "qemu/iov.h"
>> +#include "qemu/main-loop.h"
>> +
>> +
>> +/* The buffer size needs to be investigated for optimum numbers and
>> + * optimum means of paging in on different systems. This size is
>> + * chosen to be sufficient to accommodate one packet with some headers
>> + */
>> +
>> +#define BUFFER_ALIGN sysconf(_SC_PAGESIZE)
>> +#define BUFFER_SIZE 2048
>> +#define IOVSIZE 2
>> +#define MAX_L2TPV3_MSGCNT 32
> Please avoid trailing whitespace, git prints a warning about it when
> applying patches.  There are occurrences throughout this file.  You can
> remove it in vi with :%s/\s*$//g
>
>> +#define MAX_L2TPV3_IOVCNT (MAX_L2TPV3_MSGCNT * IOVSIZE)
>> +
>> +#ifndef IPPROTO_L2TP
>> +#define IPPROTO_L2TP 0x73
>> +#endif
> Not needed, see below for details.
>
>> +typedef struct NetL2TPV3State {
>> +    NetClientState nc;
>> +    int fd;
>  From here...
>
>> +    int state;
>> +    unsigned int index;
>> +    unsigned int packet_len;
> ...to here should be dropped.  These fields are not used and were copied
> from another netdev implementation.
>
> (Mentioned in a previous review.)
>
>> +
>> +    /*
>> +     *	these are used for xmit - that happens packet a time
>> +     *	and for first sign of life packet (easier to parse that once)
>> +     */
>> +
>> +    uint8_t *header_buf;
>> +    struct iovec *vec;
>> +
>> +    /*
>> +     * these are used for receive - try to "eat" up to 32 packets at a time
>> +     */
>> +
>> +    struct mmsghdr *msgvec;
>> +
>> +    /*
>> +     * peer address
>> +     */
>> +
>> +    struct sockaddr_storage *dgram_dst;
>> +    uint32_t dst_size;
>> +
>> +    /*
>> +     * L2TPv3 parameters
>> +     */
>> +
>> +    uint64_t rx_cookie;
>> +    uint64_t tx_cookie;
>> +    uint32_t rx_session;
>> +    uint32_t tx_session;
>> +    uint32_t header_size;
>> +    uint32_t counter;
>> +
>> +    /*
>> +     * Precomputed offsets
>> +     */
>> +
>> +    uint32_t offset;
>> +    uint32_t cookie_offset;
>> +    uint32_t counter_offset;
>> +    uint32_t session_offset;
>> +
>> +    /* Flags */
>> +
>> +    bool ipv6;
>> +    bool udp;
>> +    bool has_counter;
>> +    bool cookie;
>> +    bool cookie_is_64;
>> +
>> +} NetL2TPV3State;
>> +
>> +typedef struct NetL2TPV3ListenState {
>> +    NetClientState nc;
>> +    char *model;
>> +    char *name;
>> +    int fd;
>> +} NetL2TPV3ListenState;
> NetL2TPV3ListenState is not used.  Please remove.
>
> (Mentioned in a previous review.)

OK.

>
>> +
>> +static int l2tpv3_form_header(NetL2TPV3State *s)
> I suggest making this void since there is no error return and callers
> don't check.

OK.

>
>> +{
>> +    uint32_t *header;
>> +    uint32_t *session;
>> +    uint64_t *cookie64;
>> +    uint32_t *cookie32;
>> +    uint32_t *counter;
>> +
>> +    if (s->udp) {
>> +	header = (uint32_t *) s->header_buf;
>> +	stl_be_p(header, 0x30000);
> I guess the 3 is the protocol version number.  A comment about magic
> number would help.

No. It signifies a data packet. Agree - a constant will be helpful.

>
>> +    }
> QEMU coding style is 4-space indentation.  Please reindent the code.

I thought I did, I will re-run the source through a few perl one liners 
to make sure it is all indented correctly.

>
>> +    session = (uint32_t *) (s->header_buf + s->session_offset);
>> +    stl_be_p(session, s->tx_session);
>> +
>> +    if (s->cookie) {
>> +	if (s->cookie_is_64) {
>> +	    cookie64 = (uint64_t *)(s->header_buf + s->cookie_offset);
>> +	    stq_be_p(cookie64, s->tx_cookie);
>> +	} else {
>> +	    cookie32 = (uint32_t *) (s->header_buf + s->cookie_offset);
>> +	    stl_be_p(cookie32, s->tx_cookie);
>> +	}
>> +    }
>> +
>> +    if (s->has_counter) {
>> +	counter = (uint32_t *)(s->header_buf + s->counter_offset);
>> +	stl_be_p(counter, ++s->counter);
>> +    }
>> +    return 0;
>> +}
>> +
>> +static ssize_t net_l2tpv3_receive_dgram_iov(NetClientState *nc,
>> +				    const struct iovec *iov,
>> +				    int iovcnt)
>> +{
>> +    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
>> +
>> +    struct msghdr message;
>> +    int ret;
>> +
>> +    if (iovcnt > MAX_L2TPV3_IOVCNT - 1) {
>> +	error_report("iovec too long %d > %d, change l2tpv3.h\n", iovcnt, MAX_L2TPV3_IOVCNT);
> error_report() messages should not include a newline.

Ok. Understood.

>
>> +	return -1;
>> +    }
>> +    l2tpv3_form_header(s);
>> +    memcpy(s->vec + 1, iov, iovcnt * sizeof(struct iovec));
>> +    s->vec->iov_base = s->header_buf;
>> +    s->vec->iov_len = s->offset;
>> +    message.msg_name = s->dgram_dst;
>> +    message.msg_namelen = s->dst_size;
>> +    message.msg_iov = (struct iovec *) s->vec;
> No need to cast here.
>
>> +    message.msg_iovlen = iovcnt + 1;
>> +    message.msg_control = NULL;
>> +    message.msg_controllen = 0;
>> +    message.msg_flags = 0;
>> +    do {
>> +	ret = sendmsg(s->fd, &message, 0);
>> +    } while ((ret == -1) && (errno == EINTR));
>> +    if (ret > 0) {
>> +	ret -= s->offset;
>> +    } else if (ret == 0) {
>> +	ret = iov_size(iov, iovcnt);
>> +    }
>> +    return ret;
> If the socket buffer (in the host kernel) is full this will drop
> packets.  The other netdev implementations return 0 and wait until the
> file descriptor becomes writable again (see net/tap.c or net/socket.c).
>
> Please either implement the same approach or explain why you disagree.

Will implement the correct approach (it did not quite work with one of 
our orginal features which we removed - listen-only mode). Probably me 
returning incorrect error code.

>
> (Mentioned in previous review.)
>
>> +}
>> +
>> +static ssize_t net_l2tpv3_receive_dgram(NetClientState *nc,
>> +					const uint8_t *buf,
>> +					size_t size)
>> +{
>> +    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
>> +
>> +    struct iovec * vec;
>> +    struct msghdr message;
>> +    ssize_t ret = 0;
>> +
>> +    l2tpv3_form_header(s);
>> +    vec = s->vec;
>> +    vec->iov_base = s->header_buf;
>> +    vec->iov_len = s->offset;
>> +    vec++;
>> +    vec->iov_base = (void *) buf;
> Cast is unnecessary.  In C any pointer casts to void* and vice versa
> without warnings.

Bad habits die hard :)

>
>> +    vec->iov_len = size;
>> +    message.msg_name = s->dgram_dst;
>> +    message.msg_namelen = s->dst_size;
>> +    message.msg_iov = (struct iovec *) s->vec;
> Cast is unnecessary.
>
>> +    message.msg_iovlen = 2;
>> +    message.msg_control = NULL;
>> +    message.msg_controllen = 0;
>> +    message.msg_flags = 0;
>> +    do {
>> +	ret = sendmsg(s->fd, &message, 0);
>> +    } while ((ret == -1) && (errno == EINTR));
>> +    if (ret > 0) {
>> +	ret -= s->offset;
>> +    } else if (ret == 0) {
>> +	ret = size;
>> +    }
>> +    return ret;
>> +}
>> +
>> +static int l2tpv3_verify_header(NetL2TPV3State *s,
>> +				    uint8_t *buf)
>> +{
>> +
>> +    uint64_t *cookie64;
>> +    uint32_t *cookie32;
>> +    uint32_t *session;
>> +
>> +    if ((!s->udp) && (!s->ipv6)){
>> +	buf += sizeof(struct iphdr) /* fix for ipv4 raw */;
>> +    }
>> +    if (s->cookie) {
>> +	if (s->cookie_is_64) {
>> +	    /* 64 bit cookie */
>> +	    cookie64 = (uint64_t *)(buf + s->cookie_offset);
>> +	    if (ldq_be_p(cookie64) != s->rx_cookie) {
>> +		error_report("unknown cookie id\n");
>> +		return -1;
>> +	    }
>> +	} else {
>> +	    cookie32 = (uint32_t *)(buf + s->cookie_offset);
>> +	    if (ldl_be_p(cookie32) != * (uint32_t *) &s->rx_cookie) {
> Ouch, the rx_cookie cast assumes little-endian host!  See below for a
> cleaner solution.
>
>> +		error_report("unknown cookie id\n");
>> +		return -1 ;
>> +	    }
>> +	}
> The casting and code duplication is not necessary:
>
> uint64_t cookie;
> if (s->cookie_is_64) {
>      cookie = ldq_be_p(buf + s->cookie_offset);
> } else {
>      cookie = ldl_be_p(buf + s->cookie_offset);
> }
> if (cookie != s->rx_cookie) {
>      error_report("unknown cookie id\n");
>      return -1;
> }

Ok, I read the source of ldq and ldl. I probably misunderstood the finer 
points for non-x86.

>
>> +    }
>> +    session = (uint32_t *) (buf + s->session_offset);
>> +    if (ldl_be_p(session) != s->rx_session) {
>> +	error_report("session mismatch\n");
>> +	return -1;
>> +    }	
>> +    return 0;
>> +}
>> +
>> +static void net_l2tpv3_send(void *opaque)
>> +{
>> +    NetL2TPV3State *s = opaque;
>> +
>> +    int i, count, offset;
>> +    struct mmsghdr * msgvec;
>> +    struct iovec * vec;
>> +
>> +    msgvec = s->msgvec;
>> +    offset = s->offset;
>> +    if ((!s->udp) && (!s->ipv6)){
>> +	offset +=   sizeof(struct iphdr);
>> +    }
>> +    count = recvmmsg(s->fd, msgvec, MAX_L2TPV3_MSGCNT, MSG_DONTWAIT, NULL);
>> +    for (i=0;i<count;i++) {
> QEMU coding style uses spaces:
> for (i = 0; i < count; i++) {

Understood will re-format.

>
>> +	if (msgvec->msg_len > 0) {
>> +	    vec = msgvec->msg_hdr.msg_iov;
>> +	    if (
>> +		(msgvec->msg_len > offset) &&
>> +		(l2tpv3_verify_header(s, vec->iov_base) == 0)
>> +		) {
> if ((msgvec->msg_len > offset) &&
>      (l2tcpv3_verify_header(s, vec->iov_base) == 0))
>
>> +		vec++;
>> +		qemu_send_packet(&s->nc, vec->iov_base, msgvec->msg_len - offset);
>> +	    } else {
>> +		error_report("l2tpv3 header verification failed\n");
>> +		vec++;
> vec isn't used after this so the increment can be removed.

Correct. It was there because originally the code was resetting 
iovec_len. However, we found that it is not modified by recvmmsg so it 
is unnecessary.
>
>> +	    }
>> +	}
>> +	msgvec++;
>> +    }
>> +}
>> +
>> +static void destroy_vector(struct mmsghdr * msgvec, int count, int iovcount)
>> +{
>> +    int i, j;
>> +    struct iovec * iov;
>> +    struct mmsghdr * cleanup = msgvec;
>> +    if (cleanup) {
>> +	for (i=0;i<count;i++) {
>> +	    if (cleanup->msg_hdr.msg_iov) {
>> +		iov = cleanup->msg_hdr.msg_iov;
>> +		for (j=0;j<iovcount;j++) {
>> +		    if (iov->iov_base) {
>> +			g_free(iov->iov_base);
>> +		    }
> g_free(NULL) is a nop so you don't need to check iov->iov_base before
> calling g_free().

Artefact of porting code from "regular" malloc/free, apologies, will 
take into account.

>
>> +		    iov++;
>> +		}
>> +		g_free(cleanup->msg_hdr.msg_iov);
>> +	    }
>> +	    cleanup++;
>> +	}
>> +	g_free(msgvec);
>> +    }
>> +}
>> +
>> +static struct mmsghdr * build_l2tpv3_vector(NetL2TPV3State *s, int count)
>> +{
>> +    int i;
>> +    struct iovec * iov;
>> +    struct mmsghdr * msgvec, *result;
>> +
>> +    msgvec = g_malloc(sizeof(struct mmsghdr) * count);
>> +    result = msgvec;
>> +    for (i=0;i < count ;i++) {
>> +	msgvec->msg_hdr.msg_name = NULL;
>> +	msgvec->msg_hdr.msg_namelen = 0;
>> +	iov =  g_malloc(sizeof(struct iovec) * IOVSIZE);
>> +	msgvec->msg_hdr.msg_iov = iov;
>> +	if ((!s->udp) && (!s->ipv6)){
>> +	    /* fix for ipv4 raw */;
>> +	    iov->iov_base = g_malloc(s->offset + sizeof(struct iphdr));
>> +	    iov->iov_len = s->offset + sizeof (struct iphdr);
>> +	} else {
>> +	    iov->iov_base = g_malloc(s->offset);
>> +	    iov->iov_len = s->offset;
>> +	}
>> +	iov++ ;
>> +	iov->iov_base = qemu_memalign(BUFFER_ALIGN,BUFFER_SIZE);
>> +	iov->iov_len = BUFFER_SIZE;
>> +	msgvec->msg_hdr.msg_iovlen = 2;
>> +	msgvec->msg_hdr.msg_control = NULL;
>> +	msgvec->msg_hdr.msg_controllen = 0;
>> +	msgvec->msg_hdr.msg_flags = 0;
>> +	msgvec++;
>> +    }
>> +    return result;
>> +}
>> +
>> +static void net_l2tpv3_cleanup(NetClientState *nc)
>> +{
>> +    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
>> +    qemu_set_fd_handler(s->fd, NULL, NULL, NULL);
>> +    close(s->fd);
>> +    destroy_vector(s->msgvec, MAX_L2TPV3_MSGCNT, IOVSIZE);
>> +    g_free(s->header_buf);
>> +    g_free(s->dgram_dst);
>> +}
>> +
>> +static NetClientInfo net_l2tpv3_info = {
>> +    .type = NET_CLIENT_OPTIONS_KIND_L2TPV3,
>> +    .size = sizeof(NetL2TPV3State),
>> +    .receive = net_l2tpv3_receive_dgram,
>> +    .receive_iov = net_l2tpv3_receive_dgram_iov,
>> +    .cleanup = net_l2tpv3_cleanup,
>> +};
>> +
>> +int net_init_l2tpv3(const NetClientOptions *opts,
>> +                    const char *name,
>> +                    NetClientState *peer)
>> +{
>> +
>> +
>> +    const NetdevL2TPv3Options * l2tpv3;
>> +    NetL2TPV3State *s;
>> +    NetClientState *nc;
>> +    int fd, gairet;
>> +    struct addrinfo hints;
>> +    struct addrinfo * result = NULL;
>> +    char * srcport, * dstport;
>> +
>> +    nc = qemu_new_net_client(&net_l2tpv3_info, peer, "l2tpv3", name);
>> +
>> +    s = DO_UPCAST(NetL2TPV3State, nc, nc);
>> +
>> +    assert(opts->kind == NET_CLIENT_OPTIONS_KIND_L2TPV3);
>> +    l2tpv3 = opts->l2tpv3;
>> +
>> +    if (l2tpv3->has_ipv6 && l2tpv3->ipv6) {
>> +	s->ipv6 = l2tpv3->ipv6;
>> +    } else {
>> +	s->ipv6 = false;
>> +    }
>> +
>> +    if (l2tpv3->has_rxcookie || l2tpv3->has_txcookie) {
>> +	if (l2tpv3->has_rxcookie && l2tpv3->has_txcookie) {
>> +	    s->cookie = true;
>> +	} else {
>> +	    return -1;
> This leaks nc.  I suggest adding a goto err where nc is freed.  Other
> error return paths should also free resources and can use this label.

OK.

>
>> +	}
>> +    } else {
>> +	s->cookie = false;
>> +    }
>> +
>> +    if (l2tpv3->has_cookie64 || l2tpv3->cookie64) {
>> +	s->cookie_is_64  = true;
>> +    } else {
>> +	s->cookie_is_64  = false;
>> +    }
>> +
>> +    if (l2tpv3->has_udp && l2tpv3->udp) {
>> +	s->udp = true;
>> +	if (!(l2tpv3->has_srcport && l2tpv3->has_dstport)) {
>> +	    error_report("l2tpv3_open : need both src and dst port for udp\n");
>> +	    return -1;
> Leaks nc.
>
>> +	} else {
>> +	    srcport = l2tpv3->srcport;
>> +	    dstport = l2tpv3->dstport;
>> +	}
>> +    } else {
>> +	s->udp = false;
>> +	srcport = NULL;
>> +	dstport = NULL;
>> +    }
>> +
>> +
>> +    s->offset = 4;
>> +    s->session_offset = 0;
>> +    s->cookie_offset = 4;
>> +    s->counter_offset = 4;
>> +
>> +    s->tx_session = l2tpv3->txsession;
>> +    if (l2tpv3->has_rxsession) {
>> +	s->rx_session = l2tpv3->rxsession;
>> +    } else {
>> +	s->rx_session = s->tx_session;
>> +    }
>> +
>> +    if (s->cookie) {
>> +	s->rx_cookie = l2tpv3->rxcookie;
>> +	s->tx_cookie = l2tpv3->txcookie;
>> +	if (s->cookie_is_64 == true) {
>> +	    /* 64 bit cookie */
>> +	    s->offset += 8;
>> +	    s->counter_offset += 8;
>> +	} else {
>> +	    /* 32 bit cookie */
>> +	    s->offset += 4;
>> +	    s->counter_offset +=4;
>> +	}
>> +    }
>> +
>> +    memset(&hints, 0, sizeof(hints));
>> +
>> +    if (s->ipv6) {
>> +	hints.ai_family = AF_INET6;
>> +    } else {
>> +	hints.ai_family = AF_INET;
>> +    }
>> +    if (s->udp) {
>> +	hints.ai_socktype = SOCK_DGRAM;
>> +	hints.ai_protocol = 0;
>> +	s->offset += 4;
>> +	s->counter_offset += 4;
>> +	s->session_offset += 4;
>> +	s->cookie_offset += 4;
>> +    } else {
>> +	hints.ai_socktype = SOCK_RAW;
>> +	hints.ai_protocol = IPPROTO_L2TP;
>> +    }
>> +
>> +    gairet= getaddrinfo(l2tpv3->src, srcport, &hints, &result);
>> +
>> +    if ((gairet !=0) || (result == NULL)) {
>> +	error_report("l2tpv3_open : could not resolve src, errno = %s\n", gai_strerror(gairet));
> Leaks nc.
>
>> +	return -1;
>> +    }
>> +
>> +    if ((fd = socket(result->ai_family, result->ai_socktype, result->ai_protocol)) == -1) {
>> +	fd = -errno;
>> +	error_report("l2tpv3_open : socket creation failed, errno = %d\n", -fd);
>> +	freeaddrinfo(result);
> Leaks nc.
>
>> +	return fd;
>> +    }
>> +    if (bind(fd, (struct sockaddr *) result->ai_addr, result->ai_addrlen)) {
>> +	error_report("l2tpv3_open :  could not bind socket err=%i\n", errno);
>> +	close(fd);
> Leaks nc and result.
>
>> +	return -1;
>> +    }
>> +
>> +    freeaddrinfo(result);
>> +
>> +    memset(&hints, 0, sizeof(hints));
>> +
>> +    if (s->ipv6) {
>> +	hints.ai_family = AF_INET6;
>> +    } else {
>> +	hints.ai_family = AF_INET;
>> +    }
>> +    if (s->udp) {
>> +	hints.ai_socktype = SOCK_DGRAM;
>> +	hints.ai_protocol = 0;
>> +    } else {
>> +	hints.ai_socktype = SOCK_RAW;
>> +	hints.ai_protocol = IPPROTO_L2TP;
> Hang on, this is bogus.  This is a *userspace* L2TP implementation!
>
> We don't want a kernel L2TP driver to handle this socket.  Luckily this
> never happens anyway since net/l2tp/l2tp_ip.c only registers its socket
> type for <AF_INET, SOCK_DGRAM, IPPROTO_L2TP> and <AF_INET6, SOCK_DGRAM,
> IPPROTO_L2TP>.
>
> When we create this socket with <AF_INET, SOCK_RAW, IPPROTO_L2TP> what
> really happens is that the kernel falls back to the IPv4 raw socket
> driver due to a wildcard match.
>
> In other words, we shouldn't use IPPROTO_L2TP.  Just use 0.

OK. Thanks. Will fix, reread the kernel code to make sure we never clash 
with it and retest.

>
>> +    }
>> +
>> +    gairet= getaddrinfo(l2tpv3->dst, dstport, &hints, &result);
>> +    if ((gairet !=0) || (result == NULL)) {
>> +	error_report("l2tpv3_open : could not resolve dst, error = %s\n", gai_strerror(gairet));
>> +	return -1;
> More leaks.
>
>> +    }
>> +
>> +    s->dgram_dst = g_malloc(sizeof(struct sockaddr_storage));
>> +    memset(s->dgram_dst, '\0' , sizeof(struct sockaddr_storage));
>> +    memcpy(s->dgram_dst, result->ai_addr, result->ai_addrlen);
> Why is this field heap-allocated?  It doesn't need to be a pointer and
> then you can avoid the g_malloc()/g_free().

It does not need to be. Habits.

>
>> +    s->dst_size = result->ai_addrlen;
>> +
>> +    freeaddrinfo(result);
>> +
>> +    if (l2tpv3->has_counter && l2tpv3->counter) {
>> +	s->has_counter = true;
>> +	s->offset += 4;
>> +    } else {
>> +	s->has_counter = false;
>> +    }
>> +
>> +    if (l2tpv3->has_offset) {
>> +	/* extra offset */
>> +	s->offset += l2tpv3->offset;
>> +    }
>> +
>> +    s->msgvec = build_l2tpv3_vector(s, MAX_L2TPV3_MSGCNT);
>> +    s->vec = g_malloc(sizeof(struct iovec) * MAX_L2TPV3_IOVCNT);
> It's up to you if you want to bother heap-allocating this.  Seems
> simpler to embed it in the struct:
> struct iovec vec[MAX_L2TPV3_IOVCNT];
>
>> +    if ((!s->udp) && (!s->ipv6)){
>> +	s->header_buf = g_malloc(s->offset + sizeof (struct iphdr));
>> +    } else {
>> +	s->header_buf = g_malloc(s->offset);
>> +    }
>> +
>> +    qemu_set_nonblock(fd);
>> +
>> +    if (fd < 0)
>> +	return -1;
> QEMU coding style always uses curlies:
> if (fd < 0) {
>      return -1;
> }
>
> Anyway, this check is redundant because we already checked when creating
> the socket.
>
>> +
>> +    s->fd = fd;
>> +    s->counter = 0;
>> +
>> +    qemu_set_fd_handler(s->fd, net_l2tpv3_send, NULL, s);
>> +
>> +    if (!s) {
>> +	error_report("l2tpv3_open : failed to set fd handler\n");
>> +	return -1;
>> +    }
> Bogus check.  s is never NULL here.
>
>> +    snprintf(s->nc.info_str, sizeof(s->nc.info_str),
>> +             "l2tpv3: connected");
>> +    return 0;
>> +}
>> +
>> diff --git a/net/net.c b/net/net.c
>> index 0a88e68..749d34c 100644
>> --- a/net/net.c
>> +++ b/net/net.c
>> @@ -731,6 +731,9 @@ static int (* const net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
>>           [NET_CLIENT_OPTIONS_KIND_BRIDGE]    = net_init_bridge,
>>   #endif
>>           [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
>> +#ifdef CONFIG_LINUX
>> +        [NET_CLIENT_OPTIONS_KIND_L2TPV3]    = net_init_l2tpv3,
>> +#endif
>>   };
>>   
>>   
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 83fa485..7d9f69f 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -2940,6 +2940,57 @@
>>       '*localaddr': 'str',
>>       '*udp':       'str' } }
>>   
>> +# @NetdevL2TPv3Options
>> +#
>> +# Connect the VLAN to Ethernet over L2TPv3 Static tunnel
>> +#
>> +# @src :source address
>> +#
>> +# @dst :destination address
>> +#
>> +# @srcport :#optional source port - mandatory for udp, optional for ip
>> +#
>> +# @dstport :#optional destination port - mandatory for udp, optional for ip
>> +#
>> +# @ipv6 :#optional - force the use of ipv6
>> +#
>> +# @udp :#optional - use the udp version of l2tpv3 encapsulation
>> +#
>> +# @cookie64:#optional - use 64 bit coookies
>> +#
>> +# @counter :#optional have sequence counter
>> +#
>> +# @txcookie :#optional 32 or 64 bit transmit cookie
>> +#
>> +# @rxcookie :#optional 32 or 64 bit receive cookie
>> +#
>> +# @txsession : 32 bit transmit session
>> +#
>> +# @rxsession : 32 bit receive session - if not specified set to the same value as transmit
>> +#
>> +# @optional : additional offset - allows the insertion of additional application-specific data before the packet payload
>> +#
>> +#
>> +# Since 1.2
>> +##
>> +##
>> +{ 'type': 'NetdevL2TPv3Options',
>> +  'data': {
>> +    'src':	  'str',
>> +    'dst':	  'str',
>> +    '*srcport':	  'str',
>> +    '*dstport':	  'str',
>> +    '*ipv6':	  'bool',
>> +    '*udp':	  'bool',
>> +    '*cookie64':  'bool',
>> +    '*counter':   'bool',
>> +    '*txcookie':  'uint64',
>> +    '*rxcookie':  'uint64',
>> +    'txsession': 'uint32',
>> +    '*rxsession': 'uint32',
>> +    '*offset':  'uint32' } }
>> +
>> +##
>>   ##
>>   # @NetdevVdeOptions
>>   #
>> @@ -3014,13 +3065,16 @@
>>   # A discriminated record of network device traits.
>>   #
>>   # Since 1.2
>> -##
>> +#
>> +# Added in 2.0 - l2tpv3
>> +#
>>   { 'union': 'NetClientOptions',
>>     'data': {
>>       'none':     'NetdevNoneOptions',
>>       'nic':      'NetLegacyNicOptions',
>>       'user':     'NetdevUserOptions',
>>       'tap':      'NetdevTapOptions',
>> +    'l2tpv3':   'NetdevL2TPv3Options',
>>       'socket':   'NetdevSocketOptions',
>>       'vde':      'NetdevVdeOptions',
>>       'dump':     'NetdevDumpOptions',
>> -- 
>> 1.7.10.4
>>
Anton Ivanov March 9, 2014, 5:06 p.m. UTC | #5
>> +	return -1;
>> +    }
>> +
>> +    freeaddrinfo(result);
>> +
>> +    memset(&hints, 0, sizeof(hints));
>> +
>> +    if (s->ipv6) {
>> +	hints.ai_family = AF_INET6;
>> +    } else {
>> +	hints.ai_family = AF_INET;
>> +    }
>> +    if (s->udp) {
>> +	hints.ai_socktype = SOCK_DGRAM;
>> +	hints.ai_protocol = 0;
>> +    } else {
>> +	hints.ai_socktype = SOCK_RAW;
>> +	hints.ai_protocol = IPPROTO_L2TP;
> Hang on, this is bogus.  This is a *userspace* L2TP implementation!
>
> We don't want a kernel L2TP driver to handle this socket.  Luckily this
> never happens anyway since net/l2tp/l2tp_ip.c only registers its socket
> type for <AF_INET, SOCK_DGRAM, IPPROTO_L2TP> and <AF_INET6, SOCK_DGRAM,
> IPPROTO_L2TP>.
>
> When we create this socket with <AF_INET, SOCK_RAW, IPPROTO_L2TP> what
> really happens is that the kernel falls back to the IPv4 raw socket
> driver due to a wildcard match.
>
> In other words, we shouldn't use IPPROTO_L2TP.  Just use 0.
>

This is not passed to socket directly - both setups share a call to 
getaddinfo() after that and use whatever it returns.

If you pass family, RAW, 0 to getaddrinfo it returns  family, DGRAM, 0. 
So when you use it later on the socket is setup incorrectly.

If you pass RAW, PROTO_L2TPV3 it returns the correct values to setup the 
socket. Bug for bug canceling each other out :(

Otherwise, as far as the kernel is concerned you are absolutely correct 
- I looked at the code, this is passed to userspace if it hits the 
wildcard.

I have added some comments to address this at the PROTO_L2TPV3 
definition so that it is not fixed by mistake later on.

I have addressed all other comments and added some examples to the .hx 
(tested vs openwrt barrier breaker/3.8.13).

I am going to retest the final version and if it passes all tests 
resubmit sometimes tomorrow.

A.
Stefan Hajnoczi March 10, 2014, 8:35 a.m. UTC | #6
On Sun, Mar 09, 2014 at 05:06:15PM +0000, Anton Ivanov wrote:
> >>+	return -1;
> >>+    }
> >>+
> >>+    freeaddrinfo(result);
> >>+
> >>+    memset(&hints, 0, sizeof(hints));
> >>+
> >>+    if (s->ipv6) {
> >>+	hints.ai_family = AF_INET6;
> >>+    } else {
> >>+	hints.ai_family = AF_INET;
> >>+    }
> >>+    if (s->udp) {
> >>+	hints.ai_socktype = SOCK_DGRAM;
> >>+	hints.ai_protocol = 0;
> >>+    } else {
> >>+	hints.ai_socktype = SOCK_RAW;
> >>+	hints.ai_protocol = IPPROTO_L2TP;
> >Hang on, this is bogus.  This is a *userspace* L2TP implementation!
> >
> >We don't want a kernel L2TP driver to handle this socket.  Luckily this
> >never happens anyway since net/l2tp/l2tp_ip.c only registers its socket
> >type for <AF_INET, SOCK_DGRAM, IPPROTO_L2TP> and <AF_INET6, SOCK_DGRAM,
> >IPPROTO_L2TP>.
> >
> >When we create this socket with <AF_INET, SOCK_RAW, IPPROTO_L2TP> what
> >really happens is that the kernel falls back to the IPv4 raw socket
> >driver due to a wildcard match.
> >
> >In other words, we shouldn't use IPPROTO_L2TP.  Just use 0.
> >
> 
> This is not passed to socket directly - both setups share a call to
> getaddinfo() after that and use whatever it returns.
> 
> If you pass family, RAW, 0 to getaddrinfo it returns  family, DGRAM,
> 0. So when you use it later on the socket is setup incorrectly.
> 
> If you pass RAW, PROTO_L2TPV3 it returns the correct values to setup
> the socket. Bug for bug canceling each other out :(

That doesn't match what I see.  Can you double-check your test and figure out
what is happening?

Here is my test:

$ grep -r SOCK_RAW /usr/include/
/usr/include/bits/socket_type.h:  SOCK_RAW = 3, /* Raw protocol interface.  */
$ ./a
src ai_family 2 ai_socketype 3 ai_protocol 0
dst ai_family 2 ai_socketype 3 ai_protocol 0
$ cat a.c
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

int main(int argc, char **argv)
{
    int fd, gairet;
    struct addrinfo hints;
    struct addrinfo * result = NULL;

    memset(&hints, 0, sizeof(hints));
    hints.ai_family = AF_INET;
    hints.ai_socktype = SOCK_RAW;
    hints.ai_protocol = 0;

    gairet = getaddrinfo("localhost", NULL, &hints, &result);
    if ((gairet !=0) || (result == NULL)) {
        fprintf(stderr, "could not resolve src, errno = %s\n", gai_strerror(gairet));
        return 1;
    }

    printf("src ai_family %d ai_socketype %d ai_protocol %d\n",
           result->ai_family, result->ai_socktype, result->ai_protocol);

    freeaddrinfo(result);

    memset(&hints, 0, sizeof(hints));
    hints.ai_family = AF_INET;
    hints.ai_socktype = SOCK_RAW;
    hints.ai_protocol = 0;

    gairet = getaddrinfo("8.8.8.8", NULL, &hints, &result);
    if ((gairet !=0) || (result == NULL)) {
        fprintf(stderr, "could not resolve dst, errno = %s\n", gai_strerror(gairet));
        return 1;
    }

    printf("dst ai_family %d ai_socketype %d ai_protocol %d\n",
           result->ai_family, result->ai_socktype, result->ai_protocol);

    freeaddrinfo(result);

    return 0;
}
Anton Ivanov March 10, 2014, 8:49 a.m. UTC | #7
You are correct. My test is wrong.

However, the result is the same - it wants a non-zero proto there.

$ sudo ./gaitest
src ai_family 2 ai_socketype 3 ai_protocol 0
socket creation failed, errno = 93
src ai_family 2 ai_socketype 3 ai_protocol 115

No error on the second one, -93 on the first one.

A.
Stefan Hajnoczi March 10, 2014, 6:04 p.m. UTC | #8
On Mon, Mar 10, 2014 at 08:49:01AM +0000, Anton Ivanov wrote:
> You are correct. My test is wrong.
> 
> However, the result is the same - it wants a non-zero proto there.
> 
> $ sudo ./gaitest
> src ai_family 2 ai_socketype 3 ai_protocol 0
> socket creation failed, errno = 93

You are right!

I got confused with Linux net/l2tp/l2tp_ip.c driver which handles
socket(AF_INET, SOCK_DGRAM, IPPROTO_L2TP).  But that has nothing to do
with this raw socket code which needs to do socket(AF_INET, SOCK_RAW,
IPPROTO_L2TP).

IPPROTO_L2TP *is* needed after all since SOCK_RAW wants
to know the IP protocol number so it can receive incoming packets.
We're not trying to capture all IP packets, just the L2TP ones.

So I'm happy again with the code.

Stefan
Anton Ivanov (antivano) March 10, 2014, 7:14 p.m. UTC | #9
On 10/03/14 18:04, Stefan Hajnoczi wrote:
> On Mon, Mar 10, 2014 at 08:49:01AM +0000, Anton Ivanov wrote:
>> You are correct. My test is wrong.
>>
>> However, the result is the same - it wants a non-zero proto there.
>>
>> $ sudo ./gaitest
>> src ai_family 2 ai_socketype 3 ai_protocol 0
>> socket creation failed, errno = 93
> You are right!
>
> I got confused with Linux net/l2tp/l2tp_ip.c driver which handles
> socket(AF_INET, SOCK_DGRAM, IPPROTO_L2TP).  But that has nothing to do
> with this raw socket code which needs to do socket(AF_INET, SOCK_RAW,
> IPPROTO_L2TP).
>
> IPPROTO_L2TP *is* needed after all since SOCK_RAW wants
> to know the IP protocol number so it can receive incoming packets.
> We're not trying to capture all IP packets, just the L2TP ones.
>
> So I'm happy again with the code.

OK. In that case I will fix the few remaining issues with the interface
file, mark that as v5 and resubmit tomorrow.

I think I have addressed all other comments.

Best Regards,

A.

>
> Stefan
diff mbox

Patch

diff --git a/net/Makefile.objs b/net/Makefile.objs
index 4854a14..160214e 100644
--- a/net/Makefile.objs
+++ b/net/Makefile.objs
@@ -2,6 +2,7 @@  common-obj-y = net.o queue.o checksum.o util.o hub.o
 common-obj-y += socket.o
 common-obj-y += dump.o
 common-obj-y += eth.o
+common-obj-$(CONFIG_LINUX) += l2tpv3.o
 common-obj-$(CONFIG_POSIX) += tap.o
 common-obj-$(CONFIG_LINUX) += tap-linux.o
 common-obj-$(CONFIG_WIN32) += tap-win32.o
diff --git a/net/clients.h b/net/clients.h
index 7793294..bbf177c 100644
--- a/net/clients.h
+++ b/net/clients.h
@@ -47,6 +47,8 @@  int net_init_tap(const NetClientOptions *opts, const char *name,
 int net_init_bridge(const NetClientOptions *opts, const char *name,
                     NetClientState *peer);
 
+int net_init_l2tpv3(const NetClientOptions *opts, const char *name,
+                    NetClientState *peer);
 #ifdef CONFIG_VDE
 int net_init_vde(const NetClientOptions *opts, const char *name,
                  NetClientState *peer);
diff --git a/net/l2tpv3.c b/net/l2tpv3.c
new file mode 100644
index 0000000..3348b5d
--- /dev/null
+++ b/net/l2tpv3.c
@@ -0,0 +1,554 @@ 
+/*
+ * QEMU System Emulator
+ *
+ * Copyright (c) 2003-2008 Fabrice Bellard
+ * Copyright (c) 2012-2014 Cisco Systems
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include <linux/ip.h>
+#include <netdb.h>
+#include "config-host.h"
+#include "net/net.h"
+#include "clients.h"
+#include "monitor/monitor.h"
+#include "qemu-common.h"
+#include "qemu/error-report.h"
+#include "qemu/option.h"
+#include "qemu/sockets.h"
+#include "qemu/iov.h"
+#include "qemu/main-loop.h"
+
+
+/* The buffer size needs to be investigated for optimum numbers and
+ * optimum means of paging in on different systems. This size is
+ * chosen to be sufficient to accommodate one packet with some headers
+ */
+
+#define BUFFER_ALIGN sysconf(_SC_PAGESIZE)
+#define BUFFER_SIZE 2048
+#define IOVSIZE 2
+#define MAX_L2TPV3_MSGCNT 32 
+#define MAX_L2TPV3_IOVCNT (MAX_L2TPV3_MSGCNT * IOVSIZE)
+
+#ifndef IPPROTO_L2TP
+#define IPPROTO_L2TP 0x73
+#endif
+
+typedef struct NetL2TPV3State {
+    NetClientState nc;
+    int fd;
+    int state; 
+    unsigned int index;
+    unsigned int packet_len;
+
+    /* 
+     *	these are used for xmit - that happens packet a time
+     *	and for first sign of life packet (easier to parse that once)
+     */ 
+
+    uint8_t *header_buf;
+    struct iovec *vec;
+
+    /* 
+     * these are used for receive - try to "eat" up to 32 packets at a time
+     */
+
+    struct mmsghdr *msgvec;
+
+    /*
+     * peer address
+     */
+
+    struct sockaddr_storage *dgram_dst; 
+    uint32_t dst_size;
+
+    /*
+     * L2TPv3 parameters
+     */
+
+    uint64_t rx_cookie;
+    uint64_t tx_cookie;
+    uint32_t rx_session;
+    uint32_t tx_session;
+    uint32_t header_size;
+    uint32_t counter;
+
+    /*
+     * Precomputed offsets
+     */
+
+    uint32_t offset;   
+    uint32_t cookie_offset;
+    uint32_t counter_offset;
+    uint32_t session_offset;
+
+    /* Flags */
+
+    bool ipv6;
+    bool udp;
+    bool has_counter;
+    bool cookie;
+    bool cookie_is_64;
+
+} NetL2TPV3State;
+
+typedef struct NetL2TPV3ListenState {
+    NetClientState nc;
+    char *model;
+    char *name;
+    int fd;
+} NetL2TPV3ListenState;
+
+static int l2tpv3_form_header(NetL2TPV3State *s) 
+{
+    uint32_t *header;
+    uint32_t *session;
+    uint64_t *cookie64;
+    uint32_t *cookie32;
+    uint32_t *counter;
+
+    if (s->udp) {
+	header = (uint32_t *) s->header_buf;
+	stl_be_p(header, 0x30000);
+    }
+    session = (uint32_t *) (s->header_buf + s->session_offset);
+    stl_be_p(session, s->tx_session);
+
+    if (s->cookie) {
+	if (s->cookie_is_64) {
+	    cookie64 = (uint64_t *)(s->header_buf + s->cookie_offset);
+	    stq_be_p(cookie64, s->tx_cookie);
+	} else {
+	    cookie32 = (uint32_t *) (s->header_buf + s->cookie_offset);
+	    stl_be_p(cookie32, s->tx_cookie);
+	}
+    }
+
+    if (s->has_counter) {
+	counter = (uint32_t *)(s->header_buf + s->counter_offset);
+	stl_be_p(counter, ++s->counter);
+    }
+    return 0;
+}
+
+static ssize_t net_l2tpv3_receive_dgram_iov(NetClientState *nc, 
+				    const struct iovec *iov, 
+				    int iovcnt)
+{
+    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
+
+    struct msghdr message;
+    int ret;
+
+    if (iovcnt > MAX_L2TPV3_IOVCNT - 1) {
+	error_report("iovec too long %d > %d, change l2tpv3.h\n", iovcnt, MAX_L2TPV3_IOVCNT);
+	return -1;
+    }
+    l2tpv3_form_header(s);
+    memcpy(s->vec + 1, iov, iovcnt * sizeof(struct iovec));
+    s->vec->iov_base = s->header_buf;
+    s->vec->iov_len = s->offset;
+    message.msg_name = s->dgram_dst;
+    message.msg_namelen = s->dst_size;
+    message.msg_iov = (struct iovec *) s->vec;
+    message.msg_iovlen = iovcnt + 1; 
+    message.msg_control = NULL;
+    message.msg_controllen = 0;
+    message.msg_flags = 0;
+    do {
+	ret = sendmsg(s->fd, &message, 0);
+    } while ((ret == -1) && (errno == EINTR));
+    if (ret > 0) {
+	ret -= s->offset; 
+    } else if (ret == 0) {
+	ret = iov_size(iov, iovcnt);
+    } 
+    return ret;
+}
+
+static ssize_t net_l2tpv3_receive_dgram(NetClientState *nc, 
+					const uint8_t *buf, 
+					size_t size)
+{
+    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
+
+    struct iovec * vec;
+    struct msghdr message;
+    ssize_t ret = 0;
+
+    l2tpv3_form_header(s);
+    vec = s->vec;
+    vec->iov_base = s->header_buf;
+    vec->iov_len = s->offset;
+    vec++;
+    vec->iov_base = (void *) buf;
+    vec->iov_len = size;
+    message.msg_name = s->dgram_dst;
+    message.msg_namelen = s->dst_size;
+    message.msg_iov = (struct iovec *) s->vec;
+    message.msg_iovlen = 2;
+    message.msg_control = NULL;
+    message.msg_controllen = 0;
+    message.msg_flags = 0;
+    do {
+	ret = sendmsg(s->fd, &message, 0);
+    } while ((ret == -1) && (errno == EINTR));
+    if (ret > 0) {
+	ret -= s->offset; 
+    } else if (ret == 0) {
+	ret = size;
+    } 
+    return ret;
+}
+
+static int l2tpv3_verify_header(NetL2TPV3State *s, 
+				    uint8_t *buf) 
+{
+
+    uint64_t *cookie64;
+    uint32_t *cookie32;
+    uint32_t *session;
+
+    if ((!s->udp) && (!s->ipv6)){
+	buf += sizeof(struct iphdr) /* fix for ipv4 raw */;
+    } 
+    if (s->cookie) {
+	if (s->cookie_is_64) {
+	    /* 64 bit cookie */
+	    cookie64 = (uint64_t *)(buf + s->cookie_offset);
+	    if (ldq_be_p(cookie64) != s->rx_cookie) {
+		error_report("unknown cookie id\n");
+		return -1; 
+	    }
+	} else {
+	    cookie32 = (uint32_t *)(buf + s->cookie_offset);
+	    if (ldl_be_p(cookie32) != * (uint32_t *) &s->rx_cookie) {
+		error_report("unknown cookie id\n");
+		return -1 ; 
+	    }
+	}
+    }
+    session = (uint32_t *) (buf + s->session_offset);
+    if (ldl_be_p(session) != s->rx_session) {
+	error_report("session mismatch\n");
+	return -1;
+    }	
+    return 0;
+}
+
+static void net_l2tpv3_send(void *opaque)
+{
+    NetL2TPV3State *s = opaque;
+
+    int i, count, offset;
+    struct mmsghdr * msgvec;
+    struct iovec * vec;
+
+    msgvec = s->msgvec;
+    offset = s->offset;
+    if ((!s->udp) && (!s->ipv6)){
+	offset +=   sizeof(struct iphdr);
+    }  
+    count = recvmmsg(s->fd, msgvec, MAX_L2TPV3_MSGCNT, MSG_DONTWAIT, NULL);
+    for (i=0;i<count;i++) {
+	if (msgvec->msg_len > 0) {
+	    vec = msgvec->msg_hdr.msg_iov;
+	    if (
+		(msgvec->msg_len > offset) && 
+		(l2tpv3_verify_header(s, vec->iov_base) == 0)
+		) {
+		vec++;
+		qemu_send_packet(&s->nc, vec->iov_base, msgvec->msg_len - offset);
+	    } else {
+		error_report("l2tpv3 header verification failed\n");
+		vec++; 
+	    }
+	} 
+	msgvec++;
+    }
+}
+
+static void destroy_vector(struct mmsghdr * msgvec, int count, int iovcount) 
+{
+    int i, j;
+    struct iovec * iov;
+    struct mmsghdr * cleanup = msgvec;
+    if (cleanup) {
+	for (i=0;i<count;i++) {
+	    if (cleanup->msg_hdr.msg_iov) {
+		iov = cleanup->msg_hdr.msg_iov;
+		for (j=0;j<iovcount;j++) {
+		    if (iov->iov_base) {
+			g_free(iov->iov_base);
+		    }
+		    iov++;
+		}
+		g_free(cleanup->msg_hdr.msg_iov);
+	    }
+	    cleanup++;
+	}
+	g_free(msgvec);
+    }
+}
+
+static struct mmsghdr * build_l2tpv3_vector(NetL2TPV3State *s, int count) 
+{
+    int i;
+    struct iovec * iov;
+    struct mmsghdr * msgvec, *result;
+
+    msgvec = g_malloc(sizeof(struct mmsghdr) * count);
+    result = msgvec;
+    for (i=0;i < count ;i++) {
+	msgvec->msg_hdr.msg_name = NULL;
+	msgvec->msg_hdr.msg_namelen = 0;
+	iov =  g_malloc(sizeof(struct iovec) * IOVSIZE);
+	msgvec->msg_hdr.msg_iov = iov;
+	if ((!s->udp) && (!s->ipv6)){
+	    /* fix for ipv4 raw */;
+	    iov->iov_base = g_malloc(s->offset + sizeof(struct iphdr)); 
+	    iov->iov_len = s->offset + sizeof (struct iphdr);
+	} else {
+	    iov->iov_base = g_malloc(s->offset);
+	    iov->iov_len = s->offset;
+	}
+	iov++ ;
+	iov->iov_base = qemu_memalign(BUFFER_ALIGN,BUFFER_SIZE);
+	iov->iov_len = BUFFER_SIZE;
+	msgvec->msg_hdr.msg_iovlen = 2;
+	msgvec->msg_hdr.msg_control = NULL;
+	msgvec->msg_hdr.msg_controllen = 0;
+	msgvec->msg_hdr.msg_flags = 0;
+	msgvec++;
+    }
+    return result;
+}
+
+static void net_l2tpv3_cleanup(NetClientState *nc) 
+{
+    NetL2TPV3State *s = DO_UPCAST(NetL2TPV3State, nc, nc);
+    qemu_set_fd_handler(s->fd, NULL, NULL, NULL);
+    close(s->fd);
+    destroy_vector(s->msgvec, MAX_L2TPV3_MSGCNT, IOVSIZE);
+    g_free(s->header_buf);
+    g_free(s->dgram_dst);
+}
+
+static NetClientInfo net_l2tpv3_info = {
+    .type = NET_CLIENT_OPTIONS_KIND_L2TPV3,
+    .size = sizeof(NetL2TPV3State),
+    .receive = net_l2tpv3_receive_dgram,
+    .receive_iov = net_l2tpv3_receive_dgram_iov,
+    .cleanup = net_l2tpv3_cleanup,
+};
+
+int net_init_l2tpv3(const NetClientOptions *opts,
+                    const char *name,
+                    NetClientState *peer) 
+{
+
+
+    const NetdevL2TPv3Options * l2tpv3;
+    NetL2TPV3State *s;
+    NetClientState *nc;
+    int fd, gairet;
+    struct addrinfo hints;
+    struct addrinfo * result = NULL;
+    char * srcport, * dstport;
+
+    nc = qemu_new_net_client(&net_l2tpv3_info, peer, "l2tpv3", name);
+
+    s = DO_UPCAST(NetL2TPV3State, nc, nc);
+
+    assert(opts->kind == NET_CLIENT_OPTIONS_KIND_L2TPV3);
+    l2tpv3 = opts->l2tpv3;
+
+    if (l2tpv3->has_ipv6 && l2tpv3->ipv6) {
+	s->ipv6 = l2tpv3->ipv6;
+    } else {
+	s->ipv6 = false;
+    }
+
+    if (l2tpv3->has_rxcookie || l2tpv3->has_txcookie) {
+	if (l2tpv3->has_rxcookie && l2tpv3->has_txcookie) {
+	    s->cookie = true;
+	} else {
+	    return -1;
+	}
+    } else {
+	s->cookie = false;
+    }
+
+    if (l2tpv3->has_cookie64 || l2tpv3->cookie64) {
+	s->cookie_is_64  = true;
+    } else {
+	s->cookie_is_64  = false;
+    }
+
+    if (l2tpv3->has_udp && l2tpv3->udp) {
+	s->udp = true;
+	if (!(l2tpv3->has_srcport && l2tpv3->has_dstport)) {
+	    error_report("l2tpv3_open : need both src and dst port for udp\n");
+	    return -1;
+	} else {
+	    srcport = l2tpv3->srcport;
+	    dstport = l2tpv3->dstport;
+	}
+    } else {
+	s->udp = false;
+	srcport = NULL;
+	dstport = NULL;
+    }
+
+
+    s->offset = 4;
+    s->session_offset = 0;
+    s->cookie_offset = 4;
+    s->counter_offset = 4;
+
+    s->tx_session = l2tpv3->txsession;
+    if (l2tpv3->has_rxsession) {
+	s->rx_session = l2tpv3->rxsession;
+    } else {
+	s->rx_session = s->tx_session;
+    }
+
+    if (s->cookie) {
+	s->rx_cookie = l2tpv3->rxcookie;
+	s->tx_cookie = l2tpv3->txcookie;
+	if (s->cookie_is_64 == true) {
+	    /* 64 bit cookie */
+	    s->offset += 8;
+	    s->counter_offset += 8;
+	} else {
+	    /* 32 bit cookie */
+	    s->offset += 4;
+	    s->counter_offset +=4;
+	}
+    }
+
+    memset(&hints, 0, sizeof(hints));
+    
+    if (s->ipv6) {
+	hints.ai_family = AF_INET6;
+    } else {
+	hints.ai_family = AF_INET;
+    }
+    if (s->udp) {
+	hints.ai_socktype = SOCK_DGRAM;
+	hints.ai_protocol = 0;
+	s->offset += 4;
+	s->counter_offset += 4;
+	s->session_offset += 4;
+	s->cookie_offset += 4;
+    } else {
+	hints.ai_socktype = SOCK_RAW;
+	hints.ai_protocol = IPPROTO_L2TP;
+    }
+    
+    gairet= getaddrinfo(l2tpv3->src, srcport, &hints, &result);
+
+    if ((gairet !=0) || (result == NULL)) {
+	error_report("l2tpv3_open : could not resolve src, errno = %s\n", gai_strerror(gairet));
+	return -1;
+    }
+
+    if ((fd = socket(result->ai_family, result->ai_socktype, result->ai_protocol)) == -1) {
+	fd = -errno;
+	error_report("l2tpv3_open : socket creation failed, errno = %d\n", -fd);
+	freeaddrinfo(result);
+	return fd;
+    } 
+    if (bind(fd, (struct sockaddr *) result->ai_addr, result->ai_addrlen)) {
+	error_report("l2tpv3_open :  could not bind socket err=%i\n", errno);
+	close(fd);
+	return -1;
+    }
+
+    freeaddrinfo(result);
+
+    memset(&hints, 0, sizeof(hints));
+    
+    if (s->ipv6) {
+	hints.ai_family = AF_INET6;
+    } else {
+	hints.ai_family = AF_INET;
+    }
+    if (s->udp) {
+	hints.ai_socktype = SOCK_DGRAM;
+	hints.ai_protocol = 0;
+    } else {
+	hints.ai_socktype = SOCK_RAW;
+	hints.ai_protocol = IPPROTO_L2TP;
+    }
+
+    gairet= getaddrinfo(l2tpv3->dst, dstport, &hints, &result);
+    if ((gairet !=0) || (result == NULL)) {
+	error_report("l2tpv3_open : could not resolve dst, error = %s\n", gai_strerror(gairet));
+	return -1;
+    }
+
+    s->dgram_dst = g_malloc(sizeof(struct sockaddr_storage));
+    memset(s->dgram_dst, '\0' , sizeof(struct sockaddr_storage));
+    memcpy(s->dgram_dst, result->ai_addr, result->ai_addrlen);
+    s->dst_size = result->ai_addrlen;
+
+    freeaddrinfo(result);
+
+    if (l2tpv3->has_counter && l2tpv3->counter) {
+	s->has_counter = true;
+	s->offset += 4;
+    } else {
+	s->has_counter = false;
+    }
+
+    if (l2tpv3->has_offset) {
+	/* extra offset */
+	s->offset += l2tpv3->offset;
+    }
+
+    s->msgvec = build_l2tpv3_vector(s, MAX_L2TPV3_MSGCNT);
+    s->vec = g_malloc(sizeof(struct iovec) * MAX_L2TPV3_IOVCNT);    
+    if ((!s->udp) && (!s->ipv6)){
+	s->header_buf = g_malloc(s->offset + sizeof (struct iphdr));    
+    } else {
+	s->header_buf = g_malloc(s->offset);    
+    }
+
+    qemu_set_nonblock(fd);
+
+    if (fd < 0)
+	return -1;
+
+    s->fd = fd;
+    s->counter = 0;
+
+    qemu_set_fd_handler(s->fd, net_l2tpv3_send, NULL, s);
+
+    if (!s) {
+	error_report("l2tpv3_open : failed to set fd handler\n");
+	return -1;
+    }
+    snprintf(s->nc.info_str, sizeof(s->nc.info_str),
+             "l2tpv3: connected");
+    return 0;
+}
+
diff --git a/net/net.c b/net/net.c
index 0a88e68..749d34c 100644
--- a/net/net.c
+++ b/net/net.c
@@ -731,6 +731,9 @@  static int (* const net_client_init_fun[NET_CLIENT_OPTIONS_KIND_MAX])(
         [NET_CLIENT_OPTIONS_KIND_BRIDGE]    = net_init_bridge,
 #endif
         [NET_CLIENT_OPTIONS_KIND_HUBPORT]   = net_init_hubport,
+#ifdef CONFIG_LINUX
+        [NET_CLIENT_OPTIONS_KIND_L2TPV3]    = net_init_l2tpv3,
+#endif
 };
 
 
diff --git a/qapi-schema.json b/qapi-schema.json
index 83fa485..7d9f69f 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -2940,6 +2940,57 @@ 
     '*localaddr': 'str',
     '*udp':       'str' } }
 
+# @NetdevL2TPv3Options
+#
+# Connect the VLAN to Ethernet over L2TPv3 Static tunnel
+#
+# @src :source address
+#
+# @dst :destination address
+#
+# @srcport :#optional source port - mandatory for udp, optional for ip
+#
+# @dstport :#optional destination port - mandatory for udp, optional for ip
+#
+# @ipv6 :#optional - force the use of ipv6 
+#
+# @udp :#optional - use the udp version of l2tpv3 encapsulation
+#
+# @cookie64:#optional - use 64 bit coookies
+#
+# @counter :#optional have sequence counter
+# 
+# @txcookie :#optional 32 or 64 bit transmit cookie 
+# 
+# @rxcookie :#optional 32 or 64 bit receive cookie 
+# 
+# @txsession : 32 bit transmit session
+# 
+# @rxsession : 32 bit receive session - if not specified set to the same value as transmit
+# 
+# @optional : additional offset - allows the insertion of additional application-specific data before the packet payload
+# 
+#
+# Since 1.2
+##
+##
+{ 'type': 'NetdevL2TPv3Options',
+  'data': {
+    'src':	  'str', 
+    'dst':	  'str', 
+    '*srcport':	  'str', 
+    '*dstport':	  'str', 
+    '*ipv6':	  'bool', 
+    '*udp':	  'bool', 
+    '*cookie64':  'bool', 
+    '*counter':   'bool', 
+    '*txcookie':  'uint64',
+    '*rxcookie':  'uint64',
+    'txsession': 'uint32',
+    '*rxsession': 'uint32',
+    '*offset':  'uint32' } }
+
+##
 ##
 # @NetdevVdeOptions
 #
@@ -3014,13 +3065,16 @@ 
 # A discriminated record of network device traits.
 #
 # Since 1.2
-##
+#
+# Added in 2.0 - l2tpv3 
+#
 { 'union': 'NetClientOptions',
   'data': {
     'none':     'NetdevNoneOptions',
     'nic':      'NetLegacyNicOptions',
     'user':     'NetdevUserOptions',
     'tap':      'NetdevTapOptions',
+    'l2tpv3':   'NetdevL2TPv3Options',
     'socket':   'NetdevSocketOptions',
     'vde':      'NetdevVdeOptions',
     'dump':     'NetdevDumpOptions',