diff mbox

macvlan: add tap device backend

Message ID 1249595428-21594-1-git-send-email-arnd@arndb.de
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Arnd Bergmann Aug. 6, 2009, 9:50 p.m. UTC
This is a first prototype of a new interface into the network
stack, to eventually replace tun/tap and the bridge driver
in certain virtual machine setups.

Background
----------
The 'Edge Virtual Bridging' working group is discussing ways to overcome
the limitation of virtual bridges in hypervisors.  One important part
of this is the Virtual Ethernet Port Aggregator (VEPA), as described in
http://www.ieee802.org/1/files/public/docs2009/new-evb-congdon-vepa-modular-0709-v01.pdf

In short, the idea of VEPA is that virtual machines do not communicate
with each other through direct bridging in the hypervisor but only via
an external managed switch that is already well integrated into the data
center, including network filtering, accounting and monitoring. While
we can do most of that efficiently in the Linux bridge code, doing it
externally simplifies the overall setup.

Related work
------------
Patches to implement VEPA in the Linux bridge driver have been posted by
Anna Fischer in June, see http://patchwork.ozlabs.org/patch/28702/. Those
patches are good and hopefully get merged in 2.6.32, but I think we can
take some shortcuts with an alternative approach:

The macvlan driver already has the property of forwarding all traffic
between guests and an external interface but not between the guests, just
as VEPA needs it. Also, VEPA does explicitly not want or need advanced
filtering in the way that netfilter-bridge provides, so we can use macvlan
to replace the bridge code in this setup, reducing the code path through
the kernel.  This works fine with containers and network namespaces,
but not easily with kvm/qemu because we only have a network device.

Or Gerlitz posted a "raw" packet socket backend for qemu to deal with this,
at http://marc.info/?l=qemu-devel&m=124653801212767 and at least three
other people have done a similar functionality independently.

This driver
-----------
While the other approaches should work as well, doing it using a tap
interface should give additional benefits:

* We can keep using the optimizations for jumbo frames that we have put
into the tun/tap driver.

* No need for root permissions that packet sockets need, just use 'ip
link add link type macvtap' to create a new device and give it the right
permissions using udev (using one tap per macvlan netdev).

* support for multiqueue network adapters by opening the tap device
multiple times, using one file descriptor per guest CPU/network
queue/interrupt (if the adapter supports multiple queues on a single
MAC address).

* support for zero-copy receive/transmit using async I/O on the tap device
(if the adapter supports per MAC rx queues).

* The same framework in macvlan can be used to add a third backend
into a future kernel based virtio-net implementation.

This version of the driver does not support any of those features,
but they all appear possible to add ;).
The driver is currently called 'macvtap', but I'd be more than happy
to change that if anyone could suggest a better name. The code is
still in an early stage and I wish I had found more time to polish
it, but at this time, I'd first like to know if people agree with the
basic concept at all.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: "Fischer, Anna" <anna.fischer@hp.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Cc: Edge Virtual Bridging <evb@yahoogroups.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>

---

The evb mailing list eats Cc headers, please make sure to keep everybody
in your Cc list when replying there.
---
 drivers/net/Kconfig   |   12 ++
 drivers/net/Makefile  |    1 +
 drivers/net/macvlan.c |   39 +++-----
 drivers/net/macvlan.h |   37 +++++++
 drivers/net/macvtap.c |  276 +++++++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 341 insertions(+), 24 deletions(-)
 create mode 100644 drivers/net/macvlan.h
 create mode 100644 drivers/net/macvtap.c

Comments

David Miller Aug. 7, 2009, 3:20 a.m. UTC | #1
From: Arnd Bergmann <arnd@arndb.de>
Date: Thu,  6 Aug 2009 21:50:28 +0000

> This is a first prototype of a new interface into the network
> stack, to eventually replace tun/tap and the bridge driver
> in certain virtual machine setups.

I don't know enough to say how good a solution this is for
the problem, but I certainly like this driver for it's
utter simplicity and minimalness.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Daniel Robbins Aug. 7, 2009, 5:35 p.m. UTC | #2
On Thu, Aug 6, 2009 at 3:50 PM, Arnd Bergmann<arnd@arndb.de> wrote:
> This is a first prototype of a new interface into the network
> stack, to eventually replace tun/tap and the bridge driver
> in certain virtual machine setups.

I have some general questions about the intended use and benefits of
VEPA, from an IT perspective:

In which virtual machine setups and technologies do you forsee this
interface being used?
Is this new interface to be used within a virtual machine or
container, on the master node, or both?
What interface(s) would need to be configured for a single virtual
machine to use VEPA to access the network?
What are the current flexibility, security or performance limitations
of tun/tap and bridge that make this new interface necessary or
beneficial?
Is this new interface useful at all for VPN solutions or is it
*specifically* targeted for connecting virtual machines to the
network?
Is this essentially a bridge with layer-2 isolation for the virtual
machine interfaces built-in? If isolation is provided, what mechanism
is used to accomplish this, and how secure is it?
Does VEPA look like a regular ethernet interface (eth0) on the virtual
machine side?
Are there any associated user-space tools required for configuring a VEPA?

Do you have any HOWTO-style documentation that would demonstrate how
this interface would be used in production? Or a FAQ?

This seems like a very interesting effort but I don't quite have a
good grasp of VEPA's benefits and limitations -- I imagine that others
are in the same boat too.

Best Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 9, 2009, 8:02 a.m. UTC | #3
On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann wrote:
> This driver
> -----------
> While the other approaches should work as well, doing it using a tap
> interface should give additional benefits:
> 
> * We can keep using the optimizations for jumbo frames that we have put
> into the tun/tap driver.
> 
> * No need for root permissions that packet sockets need, just use 'ip
> link add link type macvtap' to create a new device and give it the right
> permissions using udev (using one tap per macvlan netdev).
> 
> * support for multiqueue network adapters by opening the tap device
> multiple times, using one file descriptor per guest CPU/network
> queue/interrupt (if the adapter supports multiple queues on a single
> MAC address).
> 
> * support for zero-copy receive/transmit using async I/O on the tap device
> (if the adapter supports per MAC rx queues).
> 
> * The same framework in macvlan can be used to add a third backend
> into a future kernel based virtio-net implementation.

Could you split the patches up, to make this last easier?
patch 1 - export framework
patch 2 - code using it


> This version of the driver does not support any of those features,
> but they all appear possible to add ;).
> The driver is currently called 'macvtap', but I'd be more than happy
> to change that if anyone could suggest a better name. The code is
> still in an early stage and I wish I had found more time to polish
> it, but at this time, I'd first like to know if people agree with the
> basic concept at all.
> 
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> Cc: David S. Miller" <davem@davemloft.net>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Or Gerlitz <ogerlitz@voltaire.com>
> Cc: "Fischer, Anna" <anna.fischer@hp.com>
> Cc: netdev@vger.kernel.org
> Cc: bridge@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Edge Virtual Bridging <evb@yahoogroups.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> 
> ---
> 
> The evb mailing list eats Cc headers, please make sure to keep everybody
> in your Cc list when replying there.
> ---
>  drivers/net/Kconfig   |   12 ++
>  drivers/net/Makefile  |    1 +
>  drivers/net/macvlan.c |   39 +++-----
>  drivers/net/macvlan.h |   37 +++++++
>  drivers/net/macvtap.c |  276 +++++++++++++++++++++++++++++++++++++++++++++++++
>  5 files changed, 341 insertions(+), 24 deletions(-)
>  create mode 100644 drivers/net/macvlan.h
>  create mode 100644 drivers/net/macvtap.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 5f6509a..0b9ac6a 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -90,6 +90,18 @@ config MACVLAN
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called macvlan.
>  
> +config MACVTAP
> +	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
> +	depends on MACVLAN
> +	help
> +	  This adds a specialized tap character device driver that is based
> +	  on the MAC-VLAN network interface, called macvtap. A macvtap device
> +	  can be added in the same way as a macvlan device, using 'type
> +	  macvlan', and then be accessed through the tap user space interface.
> +	
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called macvtap.
> +
>  config EQUALIZER
>  	tristate "EQL (serial line load balancing) support"
>  	---help---
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index ead8cab..8a2d2d7 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -162,6 +162,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
>  obj-$(CONFIG_DUMMY) += dummy.o
>  obj-$(CONFIG_IFB) += ifb.o
>  obj-$(CONFIG_MACVLAN) += macvlan.o
> +obj-$(CONFIG_MACVTAP) += macvtap.o
>  obj-$(CONFIG_DE600) += de600.o
>  obj-$(CONFIG_DE620) += de620.o
>  obj-$(CONFIG_LANCE) += lance.o
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index 99eed9f..9f7dc6a 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -30,22 +30,7 @@
>  #include <linux/if_macvlan.h>
>  #include <net/rtnetlink.h>
>  
> -#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
> -
> -struct macvlan_port {
> -	struct net_device	*dev;
> -	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
> -	struct list_head	vlans;
> -};
> -
> -struct macvlan_dev {
> -	struct net_device	*dev;
> -	struct list_head	list;
> -	struct hlist_node	hlist;
> -	struct macvlan_port	*port;
> -	struct net_device	*lowerdev;
> -};
> -
> +#include "macvlan.h"
>  
>  static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
>  					       const unsigned char *addr)
> @@ -135,7 +120,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
>  			else
>  				nskb->pkt_type = PACKET_MULTICAST;
>  
> -			netif_rx(nskb);
> +			vlan->receive(nskb);
>  		}
>  	}
>  }
> @@ -180,11 +165,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
>  	skb->dev = dev;
>  	skb->pkt_type = PACKET_HOST;
>  
> -	netif_rx(skb);
> +	vlan->receive(skb);
>  	return NULL;
>  }
>  
> -static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
> +int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  {
>  	const struct macvlan_dev *vlan = netdev_priv(dev);
>  	unsigned int len = skb->len;
> @@ -202,6 +187,7 @@ static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  	return NETDEV_TX_OK;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_start_xmit);
>  
>  static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
>  			       unsigned short type, const void *daddr,
> @@ -412,7 +398,7 @@ static const struct net_device_ops macvlan_netdev_ops = {
>  	.ndo_validate_addr	= eth_validate_addr,
>  };
>  
> -static void macvlan_setup(struct net_device *dev)
> +void macvlan_setup(struct net_device *dev)
>  {
>  	ether_setup(dev);
>  
> @@ -423,6 +409,7 @@ static void macvlan_setup(struct net_device *dev)
>  	dev->ethtool_ops	= &macvlan_ethtool_ops;
>  	dev->tx_queue_len	= 0;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_setup);
>  
>  static int macvlan_port_create(struct net_device *dev)
>  {
> @@ -472,7 +459,7 @@ static void macvlan_transfer_operstate(struct net_device *dev)
>  	}
>  }
>  
> -static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
> +int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
>  {
>  	if (tb[IFLA_ADDRESS]) {
>  		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
> @@ -482,9 +469,10 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
>  	}
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_validate);
>  
> -static int macvlan_newlink(struct net_device *dev,
> -			   struct nlattr *tb[], struct nlattr *data[])
> +int macvlan_newlink(struct net_device *dev,
> +		    struct nlattr *tb[], struct nlattr *data[])
>  {
>  	struct macvlan_dev *vlan = netdev_priv(dev);
>  	struct macvlan_port *port;
> @@ -524,6 +512,7 @@ static int macvlan_newlink(struct net_device *dev,
>  	vlan->lowerdev = lowerdev;
>  	vlan->dev      = dev;
>  	vlan->port     = port;
> +	vlan->receive  = netif_rx;
>  
>  	err = register_netdevice(dev);
>  	if (err < 0)
> @@ -533,8 +522,9 @@ static int macvlan_newlink(struct net_device *dev,
>  	macvlan_transfer_operstate(dev);
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(macvlan_newlink);
>  
> -static void macvlan_dellink(struct net_device *dev)
> +void macvlan_dellink(struct net_device *dev)
>  {
>  	struct macvlan_dev *vlan = netdev_priv(dev);
>  	struct macvlan_port *port = vlan->port;
> @@ -545,6 +535,7 @@ static void macvlan_dellink(struct net_device *dev)
>  	if (list_empty(&port->vlans))
>  		macvlan_port_destroy(port->dev);
>  }
> +EXPORT_SYMBOL_GPL(macvlan_dellink);
>  
>  static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
>  	.kind		= "macvlan",
> diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h
> new file mode 100644
> index 0000000..3f3c6c3
> --- /dev/null
> +++ b/drivers/net/macvlan.h
> @@ -0,0 +1,37 @@
> +#ifndef _MACVLAN_H
> +#define _MACVLAN_H
> +
> +#include <linux/netdevice.h>
> +#include <linux/netlink.h>
> +#include <linux/list.h>
> +
> +#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
> +
> +struct macvlan_port {
> +	struct net_device	*dev;
> +	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
> +	struct list_head	vlans;
> +};
> +
> +struct macvlan_dev {
> +	struct net_device	*dev;
> +	struct list_head	list;
> +	struct hlist_node	hlist;
> +	struct macvlan_port	*port;
> +	struct net_device	*lowerdev;
> +
> +	int (*receive)(struct sk_buff *skb);
> +};
> +
> +extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev);
> +
> +extern void macvlan_setup(struct net_device *dev);
> +
> +extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]);
> +
> +extern int macvlan_newlink(struct net_device *dev,
> +		struct nlattr *tb[], struct nlattr *data[]);
> +
> +extern void macvlan_dellink(struct net_device *dev);
> +
> +#endif /* _MACVLAN_H */
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> new file mode 100644
> index 0000000..d99bfc0
> --- /dev/null
> +++ b/drivers/net/macvtap.c
> @@ -0,0 +1,276 @@
> +#include <linux/etherdevice.h>
> +#include <linux/nsproxy.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/cache.h>
> +#include <linux/sched.h>
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/wait.h>
> +#include <linux/cdev.h>
> +#include <linux/fs.h>
> +
> +#include <net/net_namespace.h>
> +#include <net/rtnetlink.h>
> +
> +#include "macvlan.h"
> +
> +struct macvtap_dev {
> +	struct macvlan_dev m;
> +	struct cdev cdev;
> +	struct sk_buff_head readq;
> +	wait_queue_head_t wait;
> +};
> +
> +/*
> + * Minor number matches netdev->ifindex, so need a large value
> + */
> +static int macvtap_major;
> +#define MACVTAP_NUM_DEVS 65536
> +
> +static int macvtap_receive(struct sk_buff *skb)
> +{
> +	struct macvtap_dev *vtap = netdev_priv(skb->dev);
> +
> +	skb_queue_tail(&vtap->readq, skb);
> +	wake_up(&vtap->wait);
> +	return 0;
> +}
> +
> +static int macvtap_open(struct inode *inode, struct file *file)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	int ifindex = iminor(inode);
> +	struct net_device *dev = dev_get_by_index(net, ifindex);
> +	int err;
> +
> +	err = -ENODEV;
> +	if (!dev)
> +		goto out1;
> +
> +	file->private_data = netdev_priv(dev);
> +	err = 0;
> +out1:
> +	return err;
> +}
> +
> +static int macvtap_release(struct inode *inode, struct file *file)
> +{
> +	struct macvtap_dev *vtap = file->private_data;
> +
> +	if (!vtap)
> +		return 0;
> +
> +	dev_put(vtap->m.dev);
> +	return 0;
> +}
> +
> +/* Get packet from user space buffer */
> +static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
> +			       const struct iovec *iv, size_t count,
> +			       int noblock)
> +{
> +	struct sk_buff *skb;
> +	size_t len = count;
> +
> +	if (unlikely(len < ETH_HLEN))
> +		return -EINVAL;
> +
> +	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
> +
> +	if (!skb) {
> +		vtap->m.dev->stats.rx_dropped++;
> +		return -ENOMEM;
> +	}
> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +	skb_put(skb, count);
> +
> +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> +		vtap->m.dev->stats.rx_dropped++;
> +		kfree_skb(skb);
> +		return -EFAULT;
> +	}
> +
> +	skb_set_network_header(skb, ETH_HLEN);
> +	skb->dev = vtap->m.lowerdev;
> +
> +	macvlan_start_xmit(skb, vtap->m.dev);
> +
> +	return count;
> +}

With tap, we discovered that not limiting the number of outstanding
skbs hurts UDP performance. And the solution was to limit
the number of outstanding packets - with hacks to work around
the fact that userspace .



> +
> +static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> +			      unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	ssize_t result;
> +	struct macvtap_dev *vtap = file->private_data;
> +
> +	result = macvtap_get_user(vtap, iv, iov_length(iv, count),
> +			      file->f_flags & O_NONBLOCK);
> +
> +	return result;
> +}
> +
> +/* Put packet to the user space buffer */
> +static ssize_t macvtap_put_user(struct macvtap_dev *vtap,
> +				       struct sk_buff *skb,
> +				       struct iovec *iv, int len)
> +{
> +	int ret;
> +
> +	skb_push(skb, ETH_HLEN);
> +	len = min_t(int, skb->len, len);
> +
> +	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
> +
> +	vtap->m.dev->stats.rx_packets++;
> +	vtap->m.dev->stats.rx_bytes += len;

where does atomicity guarantee for these counters come from?

> +
> +	return ret ? ret : len;
> +}
> +
> +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> +			    unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct macvtap_dev *vtap = file->private_data;
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct sk_buff *skb;
> +	ssize_t len, ret = 0;
> +
> +	if (!vtap)
> +		return -EBADFD;
> +
> +	len = iov_length(iv, count);
> +	if (len < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	add_wait_queue(&vtap->wait, &wait);
> +	while (len) {
> +		current->state = TASK_INTERRUPTIBLE;
> +
> +		/* Read frames from the queue */
> +		if (!(skb=skb_dequeue(&vtap->readq))) {
> +			if (file->f_flags & O_NONBLOCK) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				ret = -ERESTARTSYS;
> +				break;
> +			}
> +			/* Nothing to read, let's sleep */
> +			schedule();
> +			continue;
> +		}
> +		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);

Don't cast away the constness. Instead, fix macvtap_put_user
to used skb_copy_datagram_const_iovec which does not modify the iovec.

> +		kfree_skb(skb);
> +		break;
> +	}
> +
> +	current->state = TASK_RUNNING;
> +	remove_wait_queue(&vtap->wait, &wait);
> +
> +out:
> +	return ret;
> +}
> +
> +struct file_operations macvtap_fops = {
> +	.owner = THIS_MODULE,
> +	.open = macvtap_open,
> +	.release = macvtap_release,
> +	.aio_read = macvtap_aio_read,
> +	.aio_write = macvtap_aio_write,
> +	.llseek = no_llseek,
> +};
> +
> +static int macvtap_newlink(struct net_device *dev,
> +	struct nlattr *tb[], struct nlattr *data[])
> +{
> +	struct macvtap_dev *vtap = netdev_priv(dev);
> +	int err;
> +
> +	err = macvlan_newlink(dev, tb, data);
> +	if (err)
> +		goto out1;
> +
> +	cdev_init(&vtap->cdev, &macvtap_fops);
> +	vtap->cdev.owner = THIS_MODULE;
> +	err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major), dev->ifindex), 1);
> +
> +	if (err)
> +		goto out2;
> +
> +	/*
> +	 * TODO: add class dev so device node gets created automatically
> +	 * by udev.
> +	 */
> +	pr_debug("%s:%d: added cdev %d:%d for dev %s\n",
> +		__func__, __LINE__, MAJOR(macvtap_major),
> +		dev->ifindex, dev->name);
> +
> +	skb_queue_head_init(&vtap->readq);
> +	init_waitqueue_head(&vtap->wait);
> +	vtap->m.receive = macvtap_receive;
> +
> +	return 0;
> +
> +out2:
> +	macvlan_dellink(dev);
> +out1:
> +	return err;
> +}
> +
> +static void macvtap_dellink(struct net_device *dev)
> +{
> +	struct macvtap_dev *vtap = netdev_priv(dev);
> +	cdev_del(&vtap->cdev);
> +	/* TODO: kill open file descriptors */
> +	macvlan_dellink(dev);
> +}
> +
> +static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
> +	.kind = "macvtap",
> +	.priv_size = sizeof(struct macvtap_dev),
> +	.setup = macvlan_setup,
> +	.validate = macvlan_validate,
> +	.newlink = macvtap_newlink,
> +	.dellink = macvtap_dellink,
> +};
> +
> +static int macvtap_init(void)
> +{
> +	int err;
> +
> +	err = alloc_chrdev_region(&macvtap_major, 0,
> +				MACVTAP_NUM_DEVS, "macvtap");
> +	if (err)
> +		goto out1;
> +
> +	err = rtnl_link_register(&macvtap_link_ops);
> +	if (err)
> +		goto out2;
> +
> +	return 0;
> +
> +out2:
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +out1:
> +	return err;
> +}
> +module_init(macvtap_init);
> +
> +static void macvtap_exit(void)
> +{
> +	rtnl_link_unregister(&macvtap_link_ops);
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +}
> +module_exit(macvtap_exit);
> +
> +MODULE_ALIAS_RTNL_LINK("macvtap");
> +MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
> +MODULE_LICENSE("GPL");
> -- 
> 1.6.0.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Aug. 9, 2009, 8:42 p.m. UTC | #4
On Sunday 09 August 2009 08:02:16 Michael S. Tsirkin wrote:
> On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann wrote:
> > * The same framework in macvlan can be used to add a third backend
> > into a future kernel based virtio-net implementation.
> 
> Could you split the patches up, to make this last easier?
> patch 1 - export framework
> patch 2 - code using it

Sure, will do.

> > +/* Get packet from user space buffer */
> > +static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
> > +			       const struct iovec *iv, size_t count,
> > +			       int noblock)
> > +{
> > +	struct sk_buff *skb;
> > +	size_t len = count;
> > +
> > +	if (unlikely(len < ETH_HLEN))
> > +		return -EINVAL;
> > +
> > +	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
> > +
> > +	if (!skb) {
> > +		vtap->m.dev->stats.rx_dropped++;
> > +		return -ENOMEM;
> > +	}
> > +
> > +	skb_reserve(skb, NET_IP_ALIGN);
> > +	skb_put(skb, count);
> > +
> > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > +		vtap->m.dev->stats.rx_dropped++;
> > +		kfree_skb(skb);
> > +		return -EFAULT;
> > +	}
> > +
> > +	skb_set_network_header(skb, ETH_HLEN);
> > +	skb->dev = vtap->m.lowerdev;
> > +
> > +	macvlan_start_xmit(skb, vtap->m.dev);
> > +
> > +	return count;
> > +}
> 
> With tap, we discovered that not limiting the number of outstanding
> skbs hurts UDP performance. And the solution was to limit
> the number of outstanding packets - with hacks to work around
> the fact that userspace .

Something seems to be missing in your last sentence here.

My driver OTOH is also missing any sort of flow control in both
RX and TX direction ;) For RX, there should probably just be
a limit of frames that get buffered in the ring.

For TX, I guess there should be a way to let the packet
scheduler handle this and give us a chance to block and
unblock at the right time. I haven't found out yet how to
do that.

Would it be enough to check the dev_queue_xmit() return
code for NETDEV_TX_BUSY?

How would I get notified when it gets free again?

> > +	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
> > +
> > +	vtap->m.dev->stats.rx_packets++;
> > +	vtap->m.dev->stats.rx_bytes += len;
> 
> where does atomicity guarantee for these counters come from?

AFAIK, we never do for any driver. They are statistics only and
need not be 100% correct, so the networking stack goes for
lower overhead and 99.9% correct.

> > +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> > +			    unsigned long count, loff_t pos)
> > +{
> > +	struct file *file = iocb->ki_filp;
> > +	struct macvtap_dev *vtap = file->private_data;
> > +	DECLARE_WAITQUEUE(wait, current);
> > +	struct sk_buff *skb;
> > +	ssize_t len, ret = 0;
> > +
> > +	if (!vtap)
> > +		return -EBADFD;
> > +
> > +	len = iov_length(iv, count);
> > +	if (len < 0) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	add_wait_queue(&vtap->wait, &wait);
> > +	while (len) {
> > +		current->state = TASK_INTERRUPTIBLE;
> > +
> > +		/* Read frames from the queue */
> > +		if (!(skb=skb_dequeue(&vtap->readq))) {
> > +			if (file->f_flags & O_NONBLOCK) {
> > +				ret = -EAGAIN;
> > +				break;
> > +			}
> > +			if (signal_pending(current)) {
> > +				ret = -ERESTARTSYS;
> > +				break;
> > +			}
> > +			/* Nothing to read, let's sleep */
> > +			schedule();
> > +			continue;
> > +		}
> > +		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
> 
> Don't cast away the constness. Instead, fix macvtap_put_user
> to used skb_copy_datagram_const_iovec which does not modify the iovec.

Ah, good catch. I had copied that from the tun driver before you
fixed it there and failed to fix it the right way when I adapted
it for the new interface.

Thanks for the review,

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Aug. 10, 2009, 6:47 a.m. UTC | #5
Arnd Bergmann wrote:
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> new file mode 100644
> index 0000000..d99bfc0
> --- /dev/null
> +++ b/drivers/net/macvtap.c
> +static int macvtap_open(struct inode *inode, struct file *file)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	int ifindex = iminor(inode);
> +	struct net_device *dev = dev_get_by_index(net, ifindex);
> +	int err;
> +
> +	err = -ENODEV;
> +	if (!dev)
> +		goto out1;
> +
> +	file->private_data = netdev_priv(dev);
> +	err = 0;
> +out1:
> +	return err;
> +}

macvlan will remove all macvlan/vtap devices when the underlying
device in unregistered, at which time you need to release the
device references you're holding. I'd suggest to change the
macvlan_device_event() handler to use

vlan->dev->rtnl_link_ops->dellink(vlan->dev)

instead of macvlan_dellink() so the macvtap_dellink callback
is invoked.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 10, 2009, 8:50 a.m. UTC | #6
On Sun, Aug 09, 2009 at 08:42:24PM +0000, Arnd Bergmann wrote:
> On Sunday 09 August 2009 08:02:16 Michael S. Tsirkin wrote:
> > On Thu, Aug 06, 2009 at 09:50:28PM +0000, Arnd Bergmann wrote:
> > > * The same framework in macvlan can be used to add a third backend
> > > into a future kernel based virtio-net implementation.
> > 
> > Could you split the patches up, to make this last easier?
> > patch 1 - export framework
> > patch 2 - code using it
> 
> Sure, will do.
> 
> > > +/* Get packet from user space buffer */
> > > +static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
> > > +			       const struct iovec *iv, size_t count,
> > > +			       int noblock)
> > > +{
> > > +	struct sk_buff *skb;
> > > +	size_t len = count;
> > > +
> > > +	if (unlikely(len < ETH_HLEN))
> > > +		return -EINVAL;
> > > +
> > > +	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
> > > +
> > > +	if (!skb) {
> > > +		vtap->m.dev->stats.rx_dropped++;
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	skb_reserve(skb, NET_IP_ALIGN);
> > > +	skb_put(skb, count);
> > > +
> > > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > > +		vtap->m.dev->stats.rx_dropped++;
> > > +		kfree_skb(skb);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	skb_set_network_header(skb, ETH_HLEN);
> > > +	skb->dev = vtap->m.lowerdev;
> > > +
> > > +	macvlan_start_xmit(skb, vtap->m.dev);
> > > +
> > > +	return count;
> > > +}
> > 
> > With tap, we discovered that not limiting the number of outstanding
> > skbs hurts UDP performance. And the solution was to limit
> > the number of outstanding packets - with hacks to work around
> > the fact that userspace .
> 
> Something seems to be missing in your last sentence here.

Most userspace does not seem to implement software flow control for UDP,
even though it probably should.

> My driver OTOH is also missing any sort of flow control in both
> RX and TX direction ;) For RX, there should probably just be
> a limit of frames that get buffered in the ring.
> 
> For TX, I guess there should be a way to let the packet
> scheduler handle this and give us a chance to block and
> unblock at the right time. I haven't found out yet how to
> do that.
> 
> Would it be enough to check the dev_queue_xmit() return
> code for NETDEV_TX_BUSY?
> 
> How would I get notified when it gets free again?

You can do this by creating a socket. Look at how tun does
this now.

> > > +	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
> > > +
> > > +	vtap->m.dev->stats.rx_packets++;
> > > +	vtap->m.dev->stats.rx_bytes += len;
> > 
> > where does atomicity guarantee for these counters come from?
> 
> AFAIK, we never do for any driver. They are statistics only and
> need not be 100% correct, so the networking stack goes for
> lower overhead and 99.9% correct.
> 
> > > +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> > > +			    unsigned long count, loff_t pos)
> > > +{
> > > +	struct file *file = iocb->ki_filp;
> > > +	struct macvtap_dev *vtap = file->private_data;
> > > +	DECLARE_WAITQUEUE(wait, current);
> > > +	struct sk_buff *skb;
> > > +	ssize_t len, ret = 0;
> > > +
> > > +	if (!vtap)
> > > +		return -EBADFD;
> > > +
> > > +	len = iov_length(iv, count);
> > > +	if (len < 0) {
> > > +		ret = -EINVAL;
> > > +		goto out;
> > > +	}
> > > +
> > > +	add_wait_queue(&vtap->wait, &wait);
> > > +	while (len) {
> > > +		current->state = TASK_INTERRUPTIBLE;
> > > +
> > > +		/* Read frames from the queue */
> > > +		if (!(skb=skb_dequeue(&vtap->readq))) {
> > > +			if (file->f_flags & O_NONBLOCK) {
> > > +				ret = -EAGAIN;
> > > +				break;
> > > +			}
> > > +			if (signal_pending(current)) {
> > > +				ret = -ERESTARTSYS;
> > > +				break;
> > > +			}
> > > +			/* Nothing to read, let's sleep */
> > > +			schedule();
> > > +			continue;
> > > +		}
> > > +		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
> > 
> > Don't cast away the constness. Instead, fix macvtap_put_user
> > to used skb_copy_datagram_const_iovec which does not modify the iovec.
> 
> Ah, good catch. I had copied that from the tun driver before you
> fixed it there and failed to fix it the right way when I adapted
> it for the new interface.
> 
> Thanks for the review,
> 
> 	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Aug. 10, 2009, 1:29 p.m. UTC | #7
On Monday 10 August 2009, Michael S. Tsirkin wrote:
> 
> > Would it be enough to check the dev_queue_xmit() return
> > code for NETDEV_TX_BUSY?
> > 
> > How would I get notified when it gets free again?
> 
> You can do this by creating a socket. Look at how tun does
> this now.

Hmm, I was hoping to be able to avoid this, because I can
interact more directly with the outbound physical interface
using dev_queue_xmit() instead of netif_rx_ni().

I'll have a look. Thanks,

	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Aug. 10, 2009, 1:42 p.m. UTC | #8
On Mon, Aug 10, 2009 at 03:29:46PM +0200, Arnd Bergmann wrote:
> On Monday 10 August 2009, Michael S. Tsirkin wrote:
> > 
> > > Would it be enough to check the dev_queue_xmit() return
> > > code for NETDEV_TX_BUSY?
> > > 
> > > How would I get notified when it gets free again?
> > 
> > You can do this by creating a socket. Look at how tun does
> > this now.
> 
> Hmm, I was hoping to be able to avoid this, because I can
> interact more directly with the outbound physical interface
> using dev_queue_xmit() instead of netif_rx_ni().

Yea, that's what tun does. socket just notifies you when
packets are freed.

> I'll have a look. Thanks,
> 
> 	Arnd <><
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Aug. 10, 2009, 6:43 p.m. UTC | #9
On Monday 10 August 2009, Patrick McHardy wrote:

> macvlan will remove all macvlan/vtap devices when the underlying
> device in unregistered, at which time you need to release the
> device references you're holding. I'd suggest to change the
> macvlan_device_event() handler to use
> 
> vlan->dev->rtnl_link_ops->dellink(vlan->dev)
> 
> instead of macvlan_dellink() so the macvtap_dellink callback
> is invoked.

Ok, will do that. Thanks,

	Arnd <><

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 5f6509a..0b9ac6a 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@  config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+	
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index ead8cab..8a2d2d7 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -162,6 +162,7 @@  obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 99eed9f..9f7dc6a 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -30,22 +30,7 @@ 
 #include <linux/if_macvlan.h>
 #include <net/rtnetlink.h>
 
-#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
-
-struct macvlan_port {
-	struct net_device	*dev;
-	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
-	struct list_head	vlans;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-};
-
+#include "macvlan.h"
 
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
@@ -135,7 +120,7 @@  static void macvlan_broadcast(struct sk_buff *skb,
 			else
 				nskb->pkt_type = PACKET_MULTICAST;
 
-			netif_rx(nskb);
+			vlan->receive(nskb);
 		}
 	}
 }
@@ -180,11 +165,11 @@  static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_rx(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
-static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
+int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	const struct macvlan_dev *vlan = netdev_priv(dev);
 	unsigned int len = skb->len;
@@ -202,6 +187,7 @@  static int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 	return NETDEV_TX_OK;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -412,7 +398,7 @@  static const struct net_device_ops macvlan_netdev_ops = {
 	.ndo_validate_addr	= eth_validate_addr,
 };
 
-static void macvlan_setup(struct net_device *dev)
+void macvlan_setup(struct net_device *dev)
 {
 	ether_setup(dev);
 
@@ -423,6 +409,7 @@  static void macvlan_setup(struct net_device *dev)
 	dev->ethtool_ops	= &macvlan_ethtool_ops;
 	dev->tx_queue_len	= 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_setup);
 
 static int macvlan_port_create(struct net_device *dev)
 {
@@ -472,7 +459,7 @@  static void macvlan_transfer_operstate(struct net_device *dev)
 	}
 }
 
-static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
+int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
 {
 	if (tb[IFLA_ADDRESS]) {
 		if (nla_len(tb[IFLA_ADDRESS]) != ETH_ALEN)
@@ -482,9 +469,10 @@  static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[])
 	}
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_validate);
 
-static int macvlan_newlink(struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_newlink(struct net_device *dev,
+		    struct nlattr *tb[], struct nlattr *data[])
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -524,6 +512,7 @@  static int macvlan_newlink(struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = netif_rx;
 
 	err = register_netdevice(dev);
 	if (err < 0)
@@ -533,8 +522,9 @@  static int macvlan_newlink(struct net_device *dev,
 	macvlan_transfer_operstate(dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_newlink);
 
-static void macvlan_dellink(struct net_device *dev)
+void macvlan_dellink(struct net_device *dev)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -545,6 +535,7 @@  static void macvlan_dellink(struct net_device *dev)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
 	.kind		= "macvlan",
diff --git a/drivers/net/macvlan.h b/drivers/net/macvlan.h
new file mode 100644
index 0000000..3f3c6c3
--- /dev/null
+++ b/drivers/net/macvlan.h
@@ -0,0 +1,37 @@ 
+#ifndef _MACVLAN_H
+#define _MACVLAN_H
+
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <linux/list.h>
+
+#define MACVLAN_HASH_SIZE	(1 << BITS_PER_BYTE)
+
+struct macvlan_port {
+	struct net_device	*dev;
+	struct hlist_head	vlan_hash[MACVLAN_HASH_SIZE];
+	struct list_head	vlans;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+
+	int (*receive)(struct sk_buff *skb);
+};
+
+extern int macvlan_start_xmit(struct sk_buff *skb, struct net_device *dev);
+
+extern void macvlan_setup(struct net_device *dev);
+
+extern int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]);
+
+extern int macvlan_newlink(struct net_device *dev,
+		struct nlattr *tb[], struct nlattr *data[]);
+
+extern void macvlan_dellink(struct net_device *dev);
+
+#endif /* _MACVLAN_H */
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..d99bfc0
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,276 @@ 
+#include <linux/etherdevice.h>
+#include <linux/nsproxy.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+
+#include "macvlan.h"
+
+struct macvtap_dev {
+	struct macvlan_dev m;
+	struct cdev cdev;
+	struct sk_buff_head readq;
+	wait_queue_head_t wait;
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a large value
+ */
+static int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+
+static int macvtap_receive(struct sk_buff *skb)
+{
+	struct macvtap_dev *vtap = netdev_priv(skb->dev);
+
+	skb_queue_tail(&vtap->readq, skb);
+	wake_up(&vtap->wait);
+	return 0;
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int ifindex = iminor(inode);
+	struct net_device *dev = dev_get_by_index(net, ifindex);
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out1;
+
+	file->private_data = netdev_priv(dev);
+	err = 0;
+out1:
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	struct macvtap_dev *vtap = file->private_data;
+
+	if (!vtap)
+		return 0;
+
+	dev_put(vtap->m.dev);
+	return 0;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_dev *vtap,
+			       const struct iovec *iv, size_t count,
+			       int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = alloc_skb(NET_IP_ALIGN + len, GFP_KERNEL);
+
+	if (!skb) {
+		vtap->m.dev->stats.rx_dropped++;
+		return -ENOMEM;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		vtap->m.dev->stats.rx_dropped++;
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+	skb->dev = vtap->m.lowerdev;
+
+	macvlan_start_xmit(skb, vtap->m.dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+			      unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result;
+	struct macvtap_dev *vtap = file->private_data;
+
+	result = macvtap_get_user(vtap, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_dev *vtap,
+				       struct sk_buff *skb,
+				       struct iovec *iv, int len)
+{
+	int ret;
+
+	skb_push(skb, ETH_HLEN);
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_iovec(skb, 0, iv, len);
+
+	vtap->m.dev->stats.rx_packets++;
+	vtap->m.dev->stats.rx_bytes += len;
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+			    unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_dev *vtap = file->private_data;
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!vtap)
+		return -EBADFD;
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(&vtap->wait, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		if (!(skb=skb_dequeue(&vtap->readq))) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(vtap, skb, (struct iovec *) iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(&vtap->wait, &wait);
+
+out:
+	return ret;
+}
+
+struct file_operations macvtap_fops = {
+	.owner = THIS_MODULE,
+	.open = macvtap_open,
+	.release = macvtap_release,
+	.aio_read = macvtap_aio_read,
+	.aio_write = macvtap_aio_write,
+	.llseek = no_llseek,
+};
+
+static int macvtap_newlink(struct net_device *dev,
+	struct nlattr *tb[], struct nlattr *data[])
+{
+	struct macvtap_dev *vtap = netdev_priv(dev);
+	int err;
+
+	err = macvlan_newlink(dev, tb, data);
+	if (err)
+		goto out1;
+
+	cdev_init(&vtap->cdev, &macvtap_fops);
+	vtap->cdev.owner = THIS_MODULE;
+	err = cdev_add(&vtap->cdev, MKDEV(MAJOR(macvtap_major), dev->ifindex), 1);
+
+	if (err)
+		goto out2;
+
+	/*
+	 * TODO: add class dev so device node gets created automatically
+	 * by udev.
+	 */
+	pr_debug("%s:%d: added cdev %d:%d for dev %s\n",
+		__func__, __LINE__, MAJOR(macvtap_major),
+		dev->ifindex, dev->name);
+
+	skb_queue_head_init(&vtap->readq);
+	init_waitqueue_head(&vtap->wait);
+	vtap->m.receive = macvtap_receive;
+
+	return 0;
+
+out2:
+	macvlan_dellink(dev);
+out1:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev)
+{
+	struct macvtap_dev *vtap = netdev_priv(dev);
+	cdev_del(&vtap->cdev);
+	/* TODO: kill open file descriptors */
+	macvlan_dellink(dev);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind = "macvtap",
+	.priv_size = sizeof(struct macvtap_dev),
+	.setup = macvlan_setup,
+	.validate = macvlan_validate,
+	.newlink = macvtap_newlink,
+	.dellink = macvtap_dellink,
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	err = rtnl_link_register(&macvtap_link_ops);
+	if (err)
+		goto out2;
+
+	return 0;
+
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
+MODULE_LICENSE("GPL");