diff mbox

[RFC,net-next] tun: support retrieving multiple packets in a single read with IFF_MULTI_READ

Message ID 1417752000-27171-1-git-send-email-agartrell@fb.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Alex Gartrell Dec. 5, 2014, 4 a.m. UTC
This patch adds the IFF_MULTI_READ flag.  This has the following behavior.

1) If a read is too short for a packet, a single stripped packet will be read

2) If a read is long enough for multiple packets, as many *full* packets
will be read as possible.  We will not return a stripped packet, so even if
there are many, many packets, we may get a short read.

In casual performance testing with a simple test program that simply reads
and counts packets, IFF_MULTI_READ conservatively yielded a 30% CPU win, as
measured by top.  Load was being driven by a bunch of hpings running on a
server on the same L2 network (single hop through a top-of-rack switch).

Signed-off-by: Alex Gartrell <agartrell@fb.com>
---
 drivers/net/tun.c           | 66 ++++++++++++++++++++++++++++++++++++++-------
 include/uapi/linux/if_tun.h |  3 +++
 2 files changed, 60 insertions(+), 9 deletions(-)

Comments

Stephen Hemminger Dec. 9, 2014, 10:19 p.m. UTC | #1
On Thu, 4 Dec 2014 20:00:00 -0800
Alex Gartrell <agartrell@fb.com> wrote:

>  
> +static inline size_t tun_calc_max_put_len(const struct tun_struct *tun)
> +{
> +	size_t len = 0;

No need for inline with local static function. Let compiler decide.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Dec. 22, 2014, 12:09 p.m. UTC | #2
Alex Gartrell <agartrell@fb.com> wrote:
> This patch adds the IFF_MULTI_READ flag.  This has the following behavior.
> 
> 1) If a read is too short for a packet, a single stripped packet will be read
> 
> 2) If a read is long enough for multiple packets, as many *full* packets
> will be read as possible.  We will not return a stripped packet, so even if
> there are many, many packets, we may get a short read.
> 
> In casual performance testing with a simple test program that simply reads
> and counts packets, IFF_MULTI_READ conservatively yielded a 30% CPU win, as
> measured by top.  Load was being driven by a bunch of hpings running on a
> server on the same L2 network (single hop through a top-of-rack switch).
> 
> Signed-off-by: Alex Gartrell <agartrell@fb.com>

As tun already has a socket interface can we do this through
recvmmsg?

Thanks,
Alex Gartrell Dec. 22, 2014, 8:18 p.m. UTC | #3
Hey Herbert,

Thanks for getting back to me

On 12/22/14 4:09 AM, Herbert Xu wrote:
> As tun already has a socket interface can we do this through
> recvmmsg?

This just presents an easier interface (IMHO) for accomplishing that. 
And I say easier because I was unable how to figure out the recvmmsg way 
to do it.

While fully aware that this makes me look like an idiot, I have to admit 
that I've tried and failed to figure out how to get a socket fd out of 
the tun device.

The regular fd doesn't work (which is obvious when you look at the 
implementation sock_from_file), there's a tun_get_socket function but 
it's only referenced by a single file, and none of the ioctl's jump out 
at me as doing anything to enable this behavior.  Additionally, 
tuntap.txt makes no mention of sockets specifically.

FWIW, I don't feel strongly that IFF_MULTI_READ is the right way to do 
this either.

Thanks,
Dave Taht Dec. 22, 2014, 8:51 p.m. UTC | #4
On Mon, Dec 22, 2014 at 12:18 PM, Alex Gartrell <agartrell@fb.com> wrote:
> Hey Herbert,
>
> Thanks for getting back to me
>
> On 12/22/14 4:09 AM, Herbert Xu wrote:
>>
>> As tun already has a socket interface can we do this through
>> recvmmsg?
>
>
> This just presents an easier interface (IMHO) for accomplishing that. And I
> say easier because I was unable how to figure out the recvmmsg way to do it.

the recvmsg and recvmmsg calls and layers above them could use an abstraction
that allows for better passing of per packet header information to applications
in the QUIC and webrtc age.

> While fully aware that this makes me look like an idiot, I have to admit

I have lost several days of hair to *msg calls. So have the authors of
multipath mosh
(which is WAY cool, btw: https://github.com/boutier/mosh

So, no, trying and failing does not make you an idiot. Trying at all does
make you a mite crazy, however. :)

> that I've tried and failed to figure out how to get a socket fd out of the
> tun device.
>
> The regular fd doesn't work (which is obvious when you look at the
> implementation sock_from_file), there's a tun_get_socket function but it's
> only referenced by a single file, and none of the ioctl's jump out at me as
> doing anything to enable this behavior.  Additionally, tuntap.txt makes no
> mention of sockets specifically.
>
> FWIW, I don't feel strongly that IFF_MULTI_READ is the right way to do this
> either.

I have been thinking about how to implement multiple ways of eliminating
serialization dependencies in userspace vpns using fair queueing, and
multithreading...
(with splitting out the seqno + address across an entire /64)

... and excess latency with multipacket reads, and then codeling
internal queues (as many vpns
bottleneck on the encap and encode step allowing for packets to
accumulate in the OS recv buffer)

See:

http://www.tinc-vpn.org/pipermail/tinc-devel/2014-December/000680.html

And especially:

https://plus.google.com/u/0/107942175615993706558/posts/QWPWLoGMtrm

and after having just suffered through making that work with recvmsg,
was dreading trying to make it work with recvmmsg.

It appears that one of the core crazy ideas (listening on an entire
/64) doesn´t work with the existing APIs, and this new interface would
help? Or recvmmsg could be generalized? Or?
Herbert Xu Dec. 22, 2014, 10:34 p.m. UTC | #5
On Mon, Dec 22, 2014 at 12:18:39PM -0800, Alex Gartrell wrote:
> 
> While fully aware that this makes me look like an idiot, I have to
> admit that I've tried and failed to figure out how to get a socket
> fd out of the tun device.

Well right now the socket is only used within the kernel by
vhost so it's not exported to user-space.  If we were to use
recvmmsg obviously we'd create a new interface based on sockets
for tun and expose the existing socket through that.

The current file-based tun interface was never designed to be
a high-performance interface.  So let's take this opportunity
and create a new interface (but still using the same underlying
code since whatever you create should be easily applicable to
the existing kernel user vhost).

Cheers,
Alex Gartrell Dec. 22, 2014, 11:39 p.m. UTC | #6
Hey Herbert,

On 12/22/2014 02:34 PM, Herbert Xu wrote:
> On Mon, Dec 22, 2014 at 12:18:39PM -0800, Alex Gartrell wrote:
>>
>> While fully aware that this makes me look like an idiot, I have to
>> admit that I've tried and failed to figure out how to get a socket
>> fd out of the tun device.
>
> Well right now the socket is only used within the kernel by
> vhost so it's not exported to user-space.  If we were to use
> recvmmsg obviously we'd create a new interface based on sockets
> for tun and expose the existing socket through that.

Ah, that explains it then.  I was afraid I was just going insane :)

> The current file-based tun interface was never designed to be
> a high-performance interface.  So let's take this opportunity
> and create a new interface (but still using the same underlying
> code since whatever you create should be easily applicable to
> the existing kernel user vhost).

Sounds good to me. I'll get a patch turned around soon.

Thanks,
diff mbox

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6d44da1..f57d618 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1228,6 +1228,26 @@  static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	return result;
 }
 
+static inline size_t tun_calc_max_put_len(const struct tun_struct *tun)
+{
+	size_t len = 0;
+
+	/* It's a pain to peek the skb, so let's assume the worst:
+	 * 1) That skb->len = mtu
+	 * 2) That there is a vlan_tx_tag present
+	 */
+
+	len += tun->dev->mtu + VLAN_HLEN;
+
+	if (tun->flags & TUN_VNET_HDR)
+		len += tun->vnet_hdr_sz;
+
+	if (!(tun->flags & TUN_NO_PI))
+		len += sizeof(struct tun_pi);
+
+	return len;
+}
+
 /* Put packet to the user space buffer */
 static ssize_t tun_put_user(struct tun_struct *tun,
 			    struct tun_file *tfile,
@@ -1343,8 +1363,10 @@  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			   struct iov_iter *to,
 			   int noblock)
 {
+	const size_t max_put_len = tun_calc_max_put_len(tun);
 	struct sk_buff *skb;
-	ssize_t ret;
+	ssize_t ret = 0;
+	ssize_t put_ret = 0;
 	int peeked, err, off = 0;
 
 	tun_debug(KERN_INFO, tun, "tun_do_read\n");
@@ -1355,14 +1377,31 @@  static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 	if (tun->dev->reg_state != NETREG_REGISTERED)
 		return -EIO;
 
-	/* Read frames from queue */
-	skb = __skb_recv_datagram(tfile->socket.sk, noblock ? MSG_DONTWAIT : 0,
-				  &peeked, &off, &err);
-	if (!skb)
-		return 0;
+	while (!ret || ((tun->flags & TUN_MULTI_READ) &&
+			iov_iter_count(to) >= max_put_len)) {
+		/* Read frames from queue */
+		skb = __skb_recv_datagram(tfile->socket.sk,
+					  noblock ? MSG_DONTWAIT : 0,
+					  &peeked, &off, &err);
+		if (skb) {
+			put_ret = tun_put_user(tun, tfile, skb, to);
+			kfree_skb(skb);
+			if (put_ret < 0) {
+				ret = put_ret;
+				break;
+			}
+			ret += put_ret;
+		} else {
+			if (!ret)
+				ret = err;
+			break;
+		}
 
-	ret = tun_put_user(tun, tfile, skb, to);
-	kfree_skb(skb);
+		/* Now that we've received a datagram, noblock for the
+		 * rest
+		 */
+		noblock = 1;
+	}
 
 	return ret;
 }
@@ -1537,6 +1576,9 @@  static int tun_flags(struct tun_struct *tun)
 	if (tun->flags & TUN_PERSIST)
 		flags |= IFF_PERSIST;
 
+	if (tun->flags & TUN_MULTI_READ)
+		flags |= IFF_MULTI_READ;
+
 	return flags;
 }
 
@@ -1720,6 +1762,11 @@  static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
 	else
 		tun->flags &= ~TUN_TAP_MQ;
 
+	if (ifr->ifr_flags & IFF_MULTI_READ)
+		tun->flags |= TUN_MULTI_READ;
+	else
+		tun->flags &= ~TUN_MULTI_READ;
+
 	/* Make sure persistent devices do not get stuck in
 	 * xoff state.
 	 */
@@ -1883,7 +1930,8 @@  static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
 		 * This is needed because we never checked for invalid flags on
 		 * TUNSETIFF. */
 		return put_user(IFF_TUN | IFF_TAP | IFF_NO_PI | IFF_ONE_QUEUE |
-				IFF_VNET_HDR | IFF_MULTI_QUEUE,
+				IFF_VNET_HDR | IFF_MULTI_QUEUE |
+				IFF_MULTI_READ,
 				(unsigned int __user*)argp);
 	} else if (cmd == TUNSETQUEUE)
 		return tun_set_queue(file, &ifr);
diff --git a/include/uapi/linux/if_tun.h b/include/uapi/linux/if_tun.h
index e9502dd..aaf9ddc 100644
--- a/include/uapi/linux/if_tun.h
+++ b/include/uapi/linux/if_tun.h
@@ -36,6 +36,7 @@ 
 #define TUN_PERSIST 	0x0100	
 #define TUN_VNET_HDR 	0x0200
 #define TUN_TAP_MQ      0x0400
+#define TUN_MULTI_READ	0x0800
 
 /* Ioctl defines */
 #define TUNSETNOCSUM  _IOW('T', 200, int) 
@@ -74,6 +75,8 @@ 
 #define IFF_PERSIST	0x0800
 #define IFF_NOFILTER	0x1000
 
+#define IFF_MULTI_READ	0x2000
+
 /* Socket options */
 #define TUN_TX_TIMESTAMP 1