diff mbox

[RFC] tun: experimental zero copy tx support

Message ID 20120513155206.GA26847@redhat.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Michael S. Tsirkin May 13, 2012, 3:52 p.m. UTC
Let vhost-net utilize zero copy tx when the experimental
zero copy mode is enabled and when used with tun.  This works on
top of the patchset 'copy aside frags with destructors' that I posted
previously. This is not using tcp so doesn't have the
issue with early skb cloning noticed by Ian.

For those that wish to test this with kvm, I intend to post a patchset +
git tree with just the necessary bits from the destructor patch
a bit later.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/tun.c |  125 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 120 insertions(+), 5 deletions(-)

Comments

stephen hemminger May 14, 2012, 4:54 p.m. UTC | #1
On Sun, 13 May 2012 18:52:06 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> +		/* Userspace may produce vectors with count greater than
> +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> +		 * to let the rest of data to be fit in the frags.
> +		 */
Rather than complex partial code, just go through slow path for
requests with too many frags (or for really small requests).
Creating mixed skb's seems too easy to get wrong.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 14, 2012, 5:04 p.m. UTC | #2
On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> On Sun, 13 May 2012 18:52:06 +0300
> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> 
> > +		/* Userspace may produce vectors with count greater than
> > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > +		 * to let the rest of data to be fit in the frags.
> > +		 */
> Rather than complex partial code, just go through slow path for
> requests with too many frags (or for really small requests).
> Creating mixed skb's seems too easy to get wrong.

I don't object in principle but macvtap has same code
so seems better to stay consistent.
Shirley Ma May 14, 2012, 5:34 p.m. UTC | #3
On Sun, 2012-05-13 at 18:52 +0300, Michael S. Tsirkin wrote:
> Let vhost-net utilize zero copy tx when the experimental
> zero copy mode is enabled and when used with tun.  This works on
> top of the patchset 'copy aside frags with destructors' that I posted
> previously. This is not using tcp so doesn't have the
> issue with early skb cloning noticed by Ian.
> 
> For those that wish to test this with kvm, I intend to post a patchset
> +
> git tree with just the necessary bits from the destructor patch
> a bit later.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> 

Hello Mike,

Have you tested this patch? I think the difference between macvtap and
tap is tap forwarding the packet to bridge. The zerocopy is disabled in
this case.

Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 14, 2012, 6:39 p.m. UTC | #4
On Mon, May 14, 2012 at 10:34:50AM -0700, Shirley Ma wrote:
> On Sun, 2012-05-13 at 18:52 +0300, Michael S. Tsirkin wrote:
> > Let vhost-net utilize zero copy tx when the experimental
> > zero copy mode is enabled and when used with tun.  This works on
> > top of the patchset 'copy aside frags with destructors' that I posted
> > previously. This is not using tcp so doesn't have the
> > issue with early skb cloning noticed by Ian.
> > 
> > For those that wish to test this with kvm, I intend to post a patchset
> > +
> > git tree with just the necessary bits from the destructor patch
> > a bit later.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> 
> 
> Hello Mike,
> 
> Have you tested this patch? I think the difference between macvtap and
> tap is tap forwarding the packet to bridge. The zerocopy is disabled in
> this case.
> 
> Shirley

Testing in progress, but the patchset I pointed to enables
zerocopy with bridge.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 14, 2012, 7:02 p.m. UTC | #5
On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
> On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> > On Sun, 13 May 2012 18:52:06 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > +		/* Userspace may produce vectors with count greater than
> > > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > > +		 * to let the rest of data to be fit in the frags.
> > > +		 */
> > Rather than complex partial code, just go through slow path for
> > requests with too many frags (or for really small requests).
> > Creating mixed skb's seems too easy to get wrong.
> 
> I don't object in principle but macvtap has same code
> so seems better to stay consistent.
> 

If I remember well, code in vtap was buggy and still is.

Jason Wang fixes are not yet in, so maybe wait a bit, so that we dont
add a pile of new bugs ?




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 14, 2012, 7:14 p.m. UTC | #6
On Mon, May 14, 2012 at 09:02:26PM +0200, Eric Dumazet wrote:
> On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
> > On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> > > On Sun, 13 May 2012 18:52:06 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > +		/* Userspace may produce vectors with count greater than
> > > > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > > > +		 * to let the rest of data to be fit in the frags.
> > > > +		 */
> > > Rather than complex partial code, just go through slow path for
> > > requests with too many frags (or for really small requests).
> > > Creating mixed skb's seems too easy to get wrong.
> > 
> > I don't object in principle but macvtap has same code
> > so seems better to stay consistent.
> > 
> 
> If I remember well, code in vtap was buggy and still is.
> 
> Jason Wang fixes are not yet in,

They seem to be in net-next, or did I miss something?

> so maybe wait a bit, so that we dont
> add a pile of new bugs ?
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 14, 2012, 7:17 p.m. UTC | #7
On Mon, May 14, 2012 at 10:14:56PM +0300, Michael S. Tsirkin wrote:
> On Mon, May 14, 2012 at 09:02:26PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
> > > On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> > > > On Sun, 13 May 2012 18:52:06 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > +		/* Userspace may produce vectors with count greater than
> > > > > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > > > > +		 * to let the rest of data to be fit in the frags.
> > > > > +		 */
> > > > Rather than complex partial code, just go through slow path for
> > > > requests with too many frags (or for really small requests).
> > > > Creating mixed skb's seems too easy to get wrong.
> > > 
> > > I don't object in principle but macvtap has same code
> > > so seems better to stay consistent.
> > > 
> > 
> > If I remember well, code in vtap was buggy and still is.
> > 
> > Jason Wang fixes are not yet in,
> 
> They seem to be in net-next, or did I miss something?
> 
> > so maybe wait a bit, so that we dont
> > add a pile of new bugs ?
> > 
> > 

Things progress smoother upstream than out of tree
is my experience.

Also everything is guarded by a mod param in vhost
which is off by default and the name
experimental_zcopytx makes it hopefully clear there's
risk involved.

So the chance of hurting someone is imo minimal.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 14, 2012, 7:21 p.m. UTC | #8
On Mon, 2012-05-14 at 22:14 +0300, Michael S. Tsirkin wrote:

> They seem to be in net-next, or did I miss something?

Yes, you re-introduce in this patch some bugs already fixed in macvtap



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 14, 2012, 7:31 p.m. UTC | #9
On Mon, May 14, 2012 at 09:21:47PM +0200, Eric Dumazet wrote:
> On Mon, 2012-05-14 at 22:14 +0300, Michael S. Tsirkin wrote:
> 
> > They seem to be in net-next, or did I miss something?
> 
> Yes, you re-introduce in this patch some bugs already fixed in macvtap

Could explain why I see some problems in testing :)
Maybe common code should go into net/core?
I couldn't decide whether the increase in kernel
size is worth it.
Shirley Ma May 15, 2012, 6 p.m. UTC | #10
On Mon, 2012-05-14 at 22:31 +0300, Michael S. Tsirkin wrote:
> On Mon, May 14, 2012 at 09:21:47PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-05-14 at 22:14 +0300, Michael S. Tsirkin wrote:
> > 
> > > They seem to be in net-next, or did I miss something?
> > 
> > Yes, you re-introduce in this patch some bugs already fixed in
> macvtap
> 
> Could explain why I see some problems in testing :)
> Maybe common code should go into net/core?
> I couldn't decide whether the increase in kernel
> size is worth it.

Which problem you hit during testing?

The logic of zerocopy_sg_from_iovec should work well. Jason's fix made
the code more clear.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Shirley Ma May 16, 2012, 3:16 p.m. UTC | #11
On Mon, 2012-05-14 at 21:39 +0300, Michael S. Tsirkin wrote:
> > Hello Mike,
> > 
> > Have you tested this patch? I think the difference between macvtap
> and
> > tap is tap forwarding the packet to bridge. The zerocopy is disabled
> in
> > this case.
> > 
> > Shirley
> 
> Testing in progress, but the patchset I pointed to enables
> zerocopy with bridge. 

Hello Mike,

You meant this patch or another patchset for enabling bridge zerocopy?

I remembered we disabled forward skb zerocopy since the user space
program might hold the buffers too long or forever.

In tap/bridge case, when the tx buffers will be released?

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin May 16, 2012, 4:51 p.m. UTC | #12
On Wed, May 16, 2012 at 08:16:55AM -0700, Shirley Ma wrote:
> On Mon, 2012-05-14 at 21:39 +0300, Michael S. Tsirkin wrote:
> > > Hello Mike,
> > > 
> > > Have you tested this patch? I think the difference between macvtap
> > and
> > > tap is tap forwarding the packet to bridge. The zerocopy is disabled
> > in
> > > this case.
> > > 
> > > Shirley
> > 
> > Testing in progress, but the patchset I pointed to enables
> > zerocopy with bridge. 
> 
> Hello Mike,
> 
> You meant this patch or another patchset for enabling bridge zerocopy?
> 
> I remembered we disabled forward skb zerocopy since the user space
> program might hold the buffers too long or forever.
> 
> In tap/bridge case, when the tx buffers will be released?
> 
> Thanks
> Shirley

It still fails some tests for me but maybe I'll post the whole
patchset so you can see how it works.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index fe5cd2f3..eb10ee7 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -100,6 +100,8 @@  do {								\
 } while (0)
 #endif
 
+#define GOODCOPY_LEN 128
+
 #define FLT_EXACT_COUNT 8
 struct tap_filter {
 	unsigned int    count;    /* Number of addrs. Zero means disabled */
@@ -602,8 +604,80 @@  static struct sk_buff *tun_alloc_skb(struct tun_struct *tun,
 	return skb;
 }
 
+/* set skb frags from iovec, this can move to core network code for reuse */
+static int zerocopy_sg_from_iovec(struct sk_buff *skb, const struct iovec *from,
+				  int offset, size_t count)
+{
+	int len = iov_length(from, count) - offset;
+	int copy = skb_headlen(skb);
+	int size, offset1 = 0;
+	int i = 0;
+
+	/* Skip over from offset */
+	while (count && (offset >= from->iov_len)) {
+		offset -= from->iov_len;
+		++from;
+		--count;
+	}
+
+	/* copy up to skb headlen */
+	while (count && (copy > 0)) {
+		size = min_t(unsigned int, copy, from->iov_len - offset);
+		if (copy_from_user(skb->data + offset1, from->iov_base + offset,
+				   size))
+			return -EFAULT;
+		if (copy > size) {
+			++from;
+			--count;
+		}
+		copy -= size;
+		offset1 += size;
+		offset = 0;
+	}
+
+	if (len == offset1)
+		return 0;
+
+	while (count--) {
+		struct page *page[MAX_SKB_FRAGS];
+		int num_pages;
+		unsigned long base;
+
+		len = from->iov_len - offset1;
+		if (!len) {
+			offset1 = 0;
+			++from;
+			continue;
+		}
+		base = (unsigned long)from->iov_base + offset1;
+		size = ((base & ~PAGE_MASK) + len + ~PAGE_MASK) >> PAGE_SHIFT;
+		num_pages = get_user_pages_fast(base, size, 0, &page[i]);
+		if ((num_pages != size) ||
+		    (num_pages > MAX_SKB_FRAGS - skb_shinfo(skb)->nr_frags))
+			/* put_page is in skb free */
+			return -EFAULT;
+		skb->data_len += len;
+		skb->len += len;
+		skb->truesize += len;
+		atomic_add(len, &skb->sk->sk_wmem_alloc);
+		while (len) {
+			int off = base & ~PAGE_MASK;
+			int size = min_t(int, len, PAGE_SIZE - off);
+			__skb_fill_page_desc(skb, i, page[i], off, size);
+			skb_shinfo(skb)->nr_frags++;
+			/* increase sk_wmem_alloc */
+			base += size;
+			len -= size;
+			i++;
+		}
+		offset1 = 0;
+		++from;
+	}
+	return 0;
+}
+
 /* Get packet from user space buffer */
-static ssize_t tun_get_user(struct tun_struct *tun,
+static ssize_t tun_get_user(struct tun_struct *tun, void *msg_control,
 			    const struct iovec *iv, size_t count,
 			    int noblock)
 {
@@ -612,6 +686,9 @@  static ssize_t tun_get_user(struct tun_struct *tun,
 	size_t len = count, align = NET_SKB_PAD;
 	struct virtio_net_hdr gso = { 0 };
 	int offset = 0;
+	int copylen;
+	bool zerocopy = false;
+	int err;
 
 	if (!(tun->flags & TUN_NO_PI)) {
 		if ((len -= sizeof(pi)) > count)
@@ -645,14 +722,47 @@  static ssize_t tun_get_user(struct tun_struct *tun,
 			return -EINVAL;
 	}
 
-	skb = tun_alloc_skb(tun, align, len, gso.hdr_len, noblock);
+	if (msg_control)
+		zerocopy = true;
+
+	if (zerocopy) {
+		/* Userspace may produce vectors with count greater than
+		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
+		 * to let the rest of data to be fit in the frags.
+		 */
+		if (count > MAX_SKB_FRAGS) {
+			copylen = iov_length(iv, count - MAX_SKB_FRAGS);
+			if (copylen < offset)
+				copylen = 0;
+			else
+				copylen -= offset;
+		} else
+				copylen = 0;
+		/* There are 256 bytes to be copied in skb, so there is enough
+		 * room for skb expand head in case it is used.
+		 * The rest of the buffer is mapped from userspace.
+		 */
+		if (copylen < gso.hdr_len)
+			copylen = gso.hdr_len;
+		if (!copylen)
+			copylen = GOODCOPY_LEN;
+	} else
+		copylen = len;
+
+	skb = tun_alloc_skb(tun, align, copylen, gso.hdr_len, noblock);
 	if (IS_ERR(skb)) {
 		if (PTR_ERR(skb) != -EAGAIN)
 			tun->dev->stats.rx_dropped++;
 		return PTR_ERR(skb);
 	}
 
-	if (skb_copy_datagram_from_iovec(skb, 0, iv, offset, len)) {
+	if (zerocopy) {
+		err = zerocopy_sg_from_iovec(skb, iv, offset, count);
+		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
+	} else
+		err = skb_copy_datagram_from_iovec(skb, 0, iv, offset, len);
+
+	if (err) {
 		tun->dev->stats.rx_dropped++;
 		kfree_skb(skb);
 		return -EFAULT;
@@ -726,6 +836,10 @@  static ssize_t tun_get_user(struct tun_struct *tun,
 		skb_shinfo(skb)->gso_segs = 0;
 	}
 
+	/* copy skb_ubuf_info for callback when skb has no error */
+	if (zerocopy)
+		skb_shinfo(skb)->destructor_arg = msg_control;
+
 	netif_rx_ni(skb);
 
 	tun->dev->stats.rx_packets++;
@@ -746,7 +860,7 @@  static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
 
 	tun_debug(KERN_INFO, tun, "tun_chr_write %ld\n", count);
 
-	result = tun_get_user(tun, iv, iov_length(iv, count),
+	result = tun_get_user(tun, NULL, iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
 
 	tun_put(tun);
@@ -960,7 +1074,7 @@  static int tun_sendmsg(struct kiocb *iocb, struct socket *sock,
 		       struct msghdr *m, size_t total_len)
 {
 	struct tun_struct *tun = container_of(sock, struct tun_struct, socket);
-	return tun_get_user(tun, m->msg_iov, total_len,
+	return tun_get_user(tun, m->msg_control, m->msg_iov, total_len,
 			    m->msg_flags & MSG_DONTWAIT);
 }
 
@@ -1130,6 +1244,7 @@  static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 		sock_init_data(&tun->socket, sk);
 		sk->sk_write_space = tun_sock_write_space;
 		sk->sk_sndbuf = INT_MAX;
+		sock_set_flag(sk, SOCK_ZEROCOPY);
 
 		tun_sk(sk)->tun = tun;