diff mbox

[1/4] Add basic version of bridge helper

Message ID 1317915508-15491-2-git-send-email-rmarwah@linux.vnet.ibm.com
State New
Headers show

Commit Message

Richa Marwaha Oct. 6, 2011, 3:38 p.m. UTC
This patch adds a helper that can be used to create a tap device attached to
a bridge device.  Since this helper is minimal in what it does, it can be
given CAP_NET_ADMIN which allows qemu to avoid running as root while still
satisfying the majority of what users tend to want to do with tap devices.

The way this all works is that qemu launches this helper passing a bridge
name and the name of an inherited file descriptor.  The descriptor is one
end of a socketpair() of domain sockets.  This domain socket is used to
transmit a file descriptor of the opened tap device from the helper to qemu.

The helper can then exit and let qemu use the tap device.

Signed-off-by: Richa Marwaha <rmarwah@linux.vnet.ibm.com>
---
 Makefile             |   12 +++-
 configure            |    1 +
 qemu-bridge-helper.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 216 insertions(+), 2 deletions(-)
 create mode 100644 qemu-bridge-helper.c

Comments

Daniel P. Berrangé Oct. 6, 2011, 4:41 p.m. UTC | #1
On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
> This patch adds a helper that can be used to create a tap device attached to
> a bridge device.  Since this helper is minimal in what it does, it can be
> given CAP_NET_ADMIN which allows qemu to avoid running as root while still
> satisfying the majority of what users tend to want to do with tap devices.
> 
> The way this all works is that qemu launches this helper passing a bridge
> name and the name of an inherited file descriptor.  The descriptor is one
> end of a socketpair() of domain sockets.  This domain socket is used to
> transmit a file descriptor of the opened tap device from the helper to qemu.
> 
> The helper can then exit and let qemu use the tap device.

When QEMU is run by libvirt, we generally like to use capng to
remove the ability for QEMU to run setuid programs at all. So
obviously it will struggle to run the qemu-bridge-helper binary
in such a scenario.

With the way you transmit the TAP device FD back to the caller,
it looks like libvirt itself could execute the qemu-bridge-helper
receiving the FD, and then pass the FD onto QEMU using the
traditional tap,fd=XX syntax.

The TAP device FD is only one FD we normally pass to QEMU. How about
support for vhost net ? Is it reasonable to ask the qemu-bridge-helper
to send back a vhost net FD also. Or indeed multiple vhost net FDs
when we get multiqueue NICs.  Should we expect the bridge helper to
be strictly limited to just connecting a TAP dev to a bridge, or is
the expectation that it will grow more & more functionality over
time ?

Daniel
Anthony Liguori Oct. 6, 2011, 5:44 p.m. UTC | #2
On 10/06/2011 10:38 AM, Richa Marwaha wrote:
> This patch adds a helper that can be used to create a tap device attached to
> a bridge device.  Since this helper is minimal in what it does, it can be
> given CAP_NET_ADMIN which allows qemu to avoid running as root while still
> satisfying the majority of what users tend to want to do with tap devices.
>
> The way this all works is that qemu launches this helper passing a bridge
> name and the name of an inherited file descriptor.  The descriptor is one
> end of a socketpair() of domain sockets.  This domain socket is used to
> transmit a file descriptor of the opened tap device from the helper to qemu.
>
> The helper can then exit and let qemu use the tap device.
>
> Signed-off-by: Richa Marwaha<rmarwah@linux.vnet.ibm.com>
> ---
>   Makefile             |   12 +++-
>   configure            |    1 +
>   qemu-bridge-helper.c |  205 ++++++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 216 insertions(+), 2 deletions(-)
>   create mode 100644 qemu-bridge-helper.c
>
> diff --git a/Makefile b/Makefile
> index 6ed3194..f2caedc 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -34,6 +34,8 @@ $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)
>
>   LIBS+=-lz $(LIBS_TOOLS)
>
> +HELPERS-$(CONFIG_LINUX) = qemu-bridge-helper$(EXESUF)
> +
>   ifdef BUILD_DOCS
>   DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8 QMP/qmp-commands.txt
>   else
> @@ -74,7 +76,7 @@ defconfig:
>
>   -include config-all-devices.mak
>
> -build-all: $(DOCS) $(TOOLS) recurse-all
> +build-all: $(DOCS) $(TOOLS) $(HELPERS-y) recurse-all
>
>   config-host.h: config-host.h-timestamp
>   config-host.h-timestamp: config-host.mak
> @@ -151,6 +153,8 @@ qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o qemu-error.o $(oslib-obj-y) $(trace-ob
>
>   qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o $(oslib-obj-y) $(trace-obj-y) $(block-obj-y) $(qobject-obj-y) $(version-obj-y) qemu-timer-common.o
>
> +qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
> +
>   qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
>   	$(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h<  $<  >  $@,"  GEN   $@")
>
> @@ -208,7 +212,7 @@ clean:
>   # avoid old build problems by removing potentially incorrect old files
>   	rm -f config.mak op-i386.h opc-i386.h gen-op-i386.h op-arm.h opc-arm.h gen-op-arm.h
>   	rm -f qemu-options.def
> -	rm -f *.o *.d *.a *.lo $(TOOLS) qemu-ga TAGS cscope.* *.pod *~ */*~
> +	rm -f *.o *.d *.a *.lo $(TOOLS) $(HELPERS-y) qemu-ga TAGS cscope.* *.pod *~ */*~
>   	rm -Rf .libs
>   	rm -f slirp/*.o slirp/*.d audio/*.o audio/*.d block/*.o block/*.d net/*.o net/*.d fsdev/*.o fsdev/*.d ui/*.o ui/*.d qapi/*.o qapi/*.d qga/*.o qga/*.d
>   	rm -f qemu-img-cmds.h
> @@ -275,6 +279,10 @@ install: all $(if $(BUILD_DOCS),install-doc) install-sysconfig
>   ifneq ($(TOOLS),)
>   	$(INSTALL_PROG) $(STRIP_OPT) $(TOOLS) "$(DESTDIR)$(bindir)"
>   endif
> +ifneq ($(HELPERS-y),)
> +	$(INSTALL_DIR) "$(DESTDIR)$(libexecdir)"
> +	$(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) "$(DESTDIR)$(libexecdir)"
> +endif
>   ifneq ($(BLOBS),)
>   	$(INSTALL_DIR) "$(DESTDIR)$(datadir)"
>   	set -e; for x in $(BLOBS); do \
> diff --git a/configure b/configure
> index 59b1494..3e32834 100755
> --- a/configure
> +++ b/configure
> @@ -2742,6 +2742,7 @@ echo "mandir=$mandir">>  $config_host_mak
>   echo "datadir=$datadir">>  $config_host_mak
>   echo "sysconfdir=$sysconfdir">>  $config_host_mak
>   echo "docdir=$docdir">>  $config_host_mak
> +echo "libexecdir=\${prefix}/libexec">>  $config_host_mak
>   echo "confdir=$confdir">>  $config_host_mak
>
>   case "$cpu" in
> diff --git a/qemu-bridge-helper.c b/qemu-bridge-helper.c
> new file mode 100644
> index 0000000..4ac7b36
> --- /dev/null
> +++ b/qemu-bridge-helper.c
> @@ -0,0 +1,205 @@
> +/*
> + * QEMU Bridge Helper
> + *
> + * Copyright IBM, Corp. 2011
> + *
> + * Authors:
> + * Anthony Liguori<address@hidden>

Heh, fairly sure that's not my email address ;-)

> + *
> + * This work is licensed under the terms of the GNU GPL, version 2.  See
> + * the COPYING file in the top-level directory.
> + *
> + */
> +
> +#include "config-host.h"
> +
> +#include<stdio.h>
> +#include<errno.h>
> +#include<fcntl.h>
> +#include<unistd.h>
> +#include<string.h>
> +#include<stdlib.h>
> +#include<ctype.h>
> +
> +#include<sys/types.h>
> +#include<sys/ioctl.h>
> +#include<sys/socket.h>
> +#include<sys/un.h>
> +#include<sys/prctl.h>
> +
> +#include<net/if.h>
> +
> +#include<linux/sockios.h>
> +
> +#include "net/tap-linux.h"
> +
> +static int has_vnet_hdr(int fd)
> +{
> +    unsigned int features = 0;
> +    struct ifreq ifreq;
> +
> +    if (ioctl(fd, TUNGETFEATURES,&features) == -1) {
> +        return -errno;
> +    }
> +
> +    if (!(features&  IFF_VNET_HDR)) {
> +        return -ENOTSUP;
> +    }
> +
> +    if (ioctl(fd, TUNGETIFF,&ifreq) != -1 || errno != EBADFD) {
> +        return -ENOTSUP;
> +    }
> +
> +    return 1;
> +}
> +
> +static void prep_ifreq(struct ifreq *ifr, const char *ifname)
> +{
> +    memset(ifr, 0, sizeof(*ifr));
> +    snprintf(ifr->ifr_name, IFNAMSIZ, "%s", ifname);
> +}
> +
> +static int send_fd(int c, int fd)
> +{
> +    char msgbuf[CMSG_SPACE(sizeof(fd))];
> +    struct msghdr msg = {
> +        .msg_control = msgbuf,
> +        .msg_controllen = sizeof(msgbuf),
> +    };
> +    struct cmsghdr *cmsg;
> +    struct iovec iov;
> +    char req[1] = { 0x00 };
> +
> +    cmsg = CMSG_FIRSTHDR(&msg);
> +    cmsg->cmsg_level = SOL_SOCKET;
> +    cmsg->cmsg_type = SCM_RIGHTS;
> +    cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
> +    msg.msg_controllen = cmsg->cmsg_len;
> +
> +    iov.iov_base = req;
> +    iov.iov_len = sizeof(req);
> +
> +    msg.msg_iov =&iov;
> +    msg.msg_iovlen = 1;
> +    memcpy(CMSG_DATA(cmsg),&fd, sizeof(fd));
> +
> +    return sendmsg(c,&msg, 0);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +    struct ifreq ifr;
> +    int fd, ctlfd, unixfd;
> +    int use_vnet = 0;
> +    int mtu;
> +    const char *bridge;
> +    char iface[IFNAMSIZ];
> +    int index;
> +
> +    /* parse arguments */
> +    if (argc<  3 || argc>  4) {
> +        fprintf(stderr, "Usage: %s [--use-vnet] BRIDGE FD\n", argv[0]);
> +        return 1;
> +    }
> +
> +    index = 1;
> +    if (strcmp(argv[index], "--use-vnet") == 0) {
> +        use_vnet = 1;
> +        index++;
> +        if (argc == 3) {
> +            fprintf(stderr, "invalid number of arguments\n");
> +            return -1;
> +        }
> +    }
> +
> +    bridge = argv[index++];
> +    unixfd = atoi(argv[index++]);
> +
> +    /* open a socket to use to control the network interfaces */
> +    ctlfd = socket(AF_INET, SOCK_STREAM, 0);
> +    if (ctlfd == -1) {
> +        fprintf(stderr, "failed to open control socket\n");
> +        return -errno;
> +    }
> +
> +    /* open the tap device */
> +    fd = open("/dev/net/tun", O_RDWR);
> +    if (fd == -1) {
> +        fprintf(stderr, "failed to open /dev/net/tun\n");
> +        return -errno;
> +    }
> +
> +    /* request a tap device, disable PI, and add vnet header support if
> +     * requested and it's available. */
> +    prep_ifreq(&ifr, "tap%d");
> +    ifr.ifr_flags = IFF_TAP|IFF_NO_PI;
> +    if (use_vnet&&  has_vnet_hdr(fd)) {
> +        ifr.ifr_flags |= IFF_VNET_HDR;
> +    }
> +
> +    if (ioctl(fd, TUNSETIFF,&ifr) == -1) {
> +        fprintf(stderr, "failed to create tun device\n");
> +        return -errno;
> +    }
> +
> +    /* save tap device name */
> +    snprintf(iface, sizeof(iface), "%s", ifr.ifr_name);
> +
> +    /* get the mtu of the bridge */
> +    prep_ifreq(&ifr, bridge);
> +    if (ioctl(ctlfd, SIOCGIFMTU,&ifr) == -1) {
> +        fprintf(stderr, "failed to get mtu of bridge `%s'\n", bridge);
> +        return -errno;
> +    }
> +
> +    /* save mtu */
> +    mtu = ifr.ifr_mtu;
> +
> +    /* set the mtu of the interface based on the bridge */
> +    prep_ifreq(&ifr, iface);
> +    ifr.ifr_mtu = mtu;
> +    if (ioctl(ctlfd, SIOCSIFMTU,&ifr) == -1) {
> +        fprintf(stderr, "failed to set mtu of device `%s' to %d\n",
> +                iface, mtu);
> +        return -errno;
> +    }
> +
> +    /* add the interface to the bridge */
> +    prep_ifreq(&ifr, bridge);
> +    ifr.ifr_ifindex = if_nametoindex(iface);
> +
> +    if (ioctl(ctlfd, SIOCBRADDIF,&ifr) == -1) {
> +        fprintf(stderr, "failed to add interface `%s' to bridge `%s'\n",
> +                iface, bridge);
> +        return -errno;
> +    }
> +
> +    /* bring the interface up */
> +    prep_ifreq(&ifr, iface);
> +    if (ioctl(ctlfd, SIOCGIFFLAGS,&ifr) == -1) {
> +        fprintf(stderr, "failed to get interface flags for `%s'\n", iface);
> +        return -errno;
> +    }
> +
> +    ifr.ifr_flags |= IFF_UP;
> +    if (ioctl(ctlfd, SIOCSIFFLAGS,&ifr) == -1) {
> +        fprintf(stderr, "failed to set bring up interface `%s'\n", iface);
> +        return -errno;
> +    }
> +
> +    /* write fd to the domain socket */
> +    if (send_fd(unixfd, fd) == -1) {
> +        fprintf(stderr, "failed to write fd to unix socket\n");
> +        return -errno;
> +    }
> +
> +    /* ... */
> +
> +    /* profit! */

Sold!

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

Please put my SoB before yours in the next submission.

Regards,

Anthony Liguori

> +
> +    close(fd);
> +
> +    close(ctlfd);
> +
> +    return 0;
> +}
Anthony Liguori Oct. 6, 2011, 6:04 p.m. UTC | #3
On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
> On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
>> This patch adds a helper that can be used to create a tap device attached to
>> a bridge device.  Since this helper is minimal in what it does, it can be
>> given CAP_NET_ADMIN which allows qemu to avoid running as root while still
>> satisfying the majority of what users tend to want to do with tap devices.
>>
>> The way this all works is that qemu launches this helper passing a bridge
>> name and the name of an inherited file descriptor.  The descriptor is one
>> end of a socketpair() of domain sockets.  This domain socket is used to
>> transmit a file descriptor of the opened tap device from the helper to qemu.
>>
>> The helper can then exit and let qemu use the tap device.
>
> When QEMU is run by libvirt, we generally like to use capng to
> remove the ability for QEMU to run setuid programs at all. So
> obviously it will struggle to run the qemu-bridge-helper binary
> in such a scenario.
>
> With the way you transmit the TAP device FD back to the caller,
> it looks like libvirt itself could execute the qemu-bridge-helper
> receiving the FD, and then pass the FD onto QEMU using the
> traditional tap,fd=XX syntax.

Exactly.  This would allow tap-based networking using libvirt session:// URIs.

>
> The TAP device FD is only one FD we normally pass to QEMU. How about
> support for vhost net ? Is it reasonable to ask the qemu-bridge-helper
> to send back a vhost net FD also.

Absolutely.

> Or indeed multiple vhost net FDs
> when we get multiqueue NICs.  Should we expect the bridge helper to
> be strictly limited to just connecting a TAP dev to a bridge, or is
> the expectation that it will grow more&  more functionality over
> time ?

I would not expect it to do more than create virtual network interfaces, and add 
them to bridges.  Multiqueue virtual nics, vhost, etc. would all be in scope as 
they are part of creating a virtual network interface.

Creating the bridges and managing the bridges should be done statically by an 
administrator and would be out of scope.

Regards,

Anthony Liguori

>
> Daniel
Corey Bryant Oct. 6, 2011, 6:10 p.m. UTC | #4
On 10/06/2011 01:44 PM, Anthony Liguori wrote:
> On 10/06/2011 10:38 AM, Richa Marwaha wrote:
>> This patch adds a helper that can be used to create a tap device
>> attached to
>> a bridge device. Since this helper is minimal in what it does, it can be
>> given CAP_NET_ADMIN which allows qemu to avoid running as root while
>> still
>> satisfying the majority of what users tend to want to do with tap
>> devices.
>>
>> The way this all works is that qemu launches this helper passing a bridge
>> name and the name of an inherited file descriptor. The descriptor is one
>> end of a socketpair() of domain sockets. This domain socket is used to
>> transmit a file descriptor of the opened tap device from the helper to
>> qemu.
>>
>> The helper can then exit and let qemu use the tap device.
>>
>> Signed-off-by: Richa Marwaha<rmarwah@linux.vnet.ibm.com>
>> ---
>> Makefile | 12 +++-
>> configure | 1 +
>> qemu-bridge-helper.c | 205
>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 216 insertions(+), 2 deletions(-)
>> create mode 100644 qemu-bridge-helper.c
>>
>> diff --git a/Makefile b/Makefile
>> index 6ed3194..f2caedc 100644
>> --- a/Makefile
>> +++ b/Makefile
>> @@ -34,6 +34,8 @@ $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)
>>
>> LIBS+=-lz $(LIBS_TOOLS)
>>
>> +HELPERS-$(CONFIG_LINUX) = qemu-bridge-helper$(EXESUF)
>> +
>> ifdef BUILD_DOCS
>> DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8
>> QMP/qmp-commands.txt
>> else
>> @@ -74,7 +76,7 @@ defconfig:
>>
>> -include config-all-devices.mak
>>
>> -build-all: $(DOCS) $(TOOLS) recurse-all
>> +build-all: $(DOCS) $(TOOLS) $(HELPERS-y) recurse-all
>>
>> config-host.h: config-host.h-timestamp
>> config-host.h-timestamp: config-host.mak
>> @@ -151,6 +153,8 @@ qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o
>> qemu-error.o $(oslib-obj-y) $(trace-ob
>>
>> qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o
>> $(oslib-obj-y) $(trace-obj-y) $(block-obj-y) $(qobject-obj-y)
>> $(version-obj-y) qemu-timer-common.o
>>
>> +qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
>> +
>> qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
>> $(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h< $< > $@," GEN $@")
>>
>> @@ -208,7 +212,7 @@ clean:
>> # avoid old build problems by removing potentially incorrect old files
>> rm -f config.mak op-i386.h opc-i386.h gen-op-i386.h op-arm.h opc-arm.h
>> gen-op-arm.h
>> rm -f qemu-options.def
>> - rm -f *.o *.d *.a *.lo $(TOOLS) qemu-ga TAGS cscope.* *.pod *~ */*~
>> + rm -f *.o *.d *.a *.lo $(TOOLS) $(HELPERS-y) qemu-ga TAGS cscope.*
>> *.pod *~ */*~
>> rm -Rf .libs
>> rm -f slirp/*.o slirp/*.d audio/*.o audio/*.d block/*.o block/*.d
>> net/*.o net/*.d fsdev/*.o fsdev/*.d ui/*.o ui/*.d qapi/*.o qapi/*.d
>> qga/*.o qga/*.d
>> rm -f qemu-img-cmds.h
>> @@ -275,6 +279,10 @@ install: all $(if $(BUILD_DOCS),install-doc)
>> install-sysconfig
>> ifneq ($(TOOLS),)
>> $(INSTALL_PROG) $(STRIP_OPT) $(TOOLS) "$(DESTDIR)$(bindir)"
>> endif
>> +ifneq ($(HELPERS-y),)
>> + $(INSTALL_DIR) "$(DESTDIR)$(libexecdir)"
>> + $(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) "$(DESTDIR)$(libexecdir)"
>> +endif
>> ifneq ($(BLOBS),)
>> $(INSTALL_DIR) "$(DESTDIR)$(datadir)"
>> set -e; for x in $(BLOBS); do \
>> diff --git a/configure b/configure
>> index 59b1494..3e32834 100755
>> --- a/configure
>> +++ b/configure
>> @@ -2742,6 +2742,7 @@ echo "mandir=$mandir">> $config_host_mak
>> echo "datadir=$datadir">> $config_host_mak
>> echo "sysconfdir=$sysconfdir">> $config_host_mak
>> echo "docdir=$docdir">> $config_host_mak
>> +echo "libexecdir=\${prefix}/libexec">> $config_host_mak
>> echo "confdir=$confdir">> $config_host_mak
>>
>> case "$cpu" in
>> diff --git a/qemu-bridge-helper.c b/qemu-bridge-helper.c
>> new file mode 100644
>> index 0000000..4ac7b36
>> --- /dev/null
>> +++ b/qemu-bridge-helper.c
>> @@ -0,0 +1,205 @@
>> +/*
>> + * QEMU Bridge Helper
>> + *
>> + * Copyright IBM, Corp. 2011
>> + *
>> + * Authors:
>> + * Anthony Liguori<address@hidden>
>
> Heh, fairly sure that's not my email address ;-)
>

I thought that was a secret identity. :) We'll update that.

>> + *
>> + * This work is licensed under the terms of the GNU GPL, version 2. See
>> + * the COPYING file in the top-level directory.
>> + *
>> + */
>> +
>> +#include "config-host.h"
>> +
>> +#include<stdio.h>
>> +#include<errno.h>
>> +#include<fcntl.h>
>> +#include<unistd.h>
>> +#include<string.h>
>> +#include<stdlib.h>
>> +#include<ctype.h>
>> +
>> +#include<sys/types.h>
>> +#include<sys/ioctl.h>
>> +#include<sys/socket.h>
>> +#include<sys/un.h>
>> +#include<sys/prctl.h>
>> +
>> +#include<net/if.h>
>> +
>> +#include<linux/sockios.h>
>> +
>> +#include "net/tap-linux.h"
>> +
>> +static int has_vnet_hdr(int fd)
>> +{
>> + unsigned int features = 0;
>> + struct ifreq ifreq;
>> +
>> + if (ioctl(fd, TUNGETFEATURES,&features) == -1) {
>> + return -errno;
>> + }
>> +
>> + if (!(features& IFF_VNET_HDR)) {
>> + return -ENOTSUP;
>> + }
>> +
>> + if (ioctl(fd, TUNGETIFF,&ifreq) != -1 || errno != EBADFD) {
>> + return -ENOTSUP;
>> + }
>> +
>> + return 1;
>> +}
>> +
>> +static void prep_ifreq(struct ifreq *ifr, const char *ifname)
>> +{
>> + memset(ifr, 0, sizeof(*ifr));
>> + snprintf(ifr->ifr_name, IFNAMSIZ, "%s", ifname);
>> +}
>> +
>> +static int send_fd(int c, int fd)
>> +{
>> + char msgbuf[CMSG_SPACE(sizeof(fd))];
>> + struct msghdr msg = {
>> + .msg_control = msgbuf,
>> + .msg_controllen = sizeof(msgbuf),
>> + };
>> + struct cmsghdr *cmsg;
>> + struct iovec iov;
>> + char req[1] = { 0x00 };
>> +
>> + cmsg = CMSG_FIRSTHDR(&msg);
>> + cmsg->cmsg_level = SOL_SOCKET;
>> + cmsg->cmsg_type = SCM_RIGHTS;
>> + cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
>> + msg.msg_controllen = cmsg->cmsg_len;
>> +
>> + iov.iov_base = req;
>> + iov.iov_len = sizeof(req);
>> +
>> + msg.msg_iov =&iov;
>> + msg.msg_iovlen = 1;
>> + memcpy(CMSG_DATA(cmsg),&fd, sizeof(fd));
>> +
>> + return sendmsg(c,&msg, 0);
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> + struct ifreq ifr;
>> + int fd, ctlfd, unixfd;
>> + int use_vnet = 0;
>> + int mtu;
>> + const char *bridge;
>> + char iface[IFNAMSIZ];
>> + int index;
>> +
>> + /* parse arguments */
>> + if (argc< 3 || argc> 4) {
>> + fprintf(stderr, "Usage: %s [--use-vnet] BRIDGE FD\n", argv[0]);
>> + return 1;
>> + }
>> +
>> + index = 1;
>> + if (strcmp(argv[index], "--use-vnet") == 0) {
>> + use_vnet = 1;
>> + index++;
>> + if (argc == 3) {
>> + fprintf(stderr, "invalid number of arguments\n");
>> + return -1;
>> + }
>> + }
>> +
>> + bridge = argv[index++];
>> + unixfd = atoi(argv[index++]);
>> +
>> + /* open a socket to use to control the network interfaces */
>> + ctlfd = socket(AF_INET, SOCK_STREAM, 0);
>> + if (ctlfd == -1) {
>> + fprintf(stderr, "failed to open control socket\n");
>> + return -errno;
>> + }
>> +
>> + /* open the tap device */
>> + fd = open("/dev/net/tun", O_RDWR);
>> + if (fd == -1) {
>> + fprintf(stderr, "failed to open /dev/net/tun\n");
>> + return -errno;
>> + }
>> +
>> + /* request a tap device, disable PI, and add vnet header support if
>> + * requested and it's available. */
>> + prep_ifreq(&ifr, "tap%d");
>> + ifr.ifr_flags = IFF_TAP|IFF_NO_PI;
>> + if (use_vnet&& has_vnet_hdr(fd)) {
>> + ifr.ifr_flags |= IFF_VNET_HDR;
>> + }
>> +
>> + if (ioctl(fd, TUNSETIFF,&ifr) == -1) {
>> + fprintf(stderr, "failed to create tun device\n");
>> + return -errno;
>> + }
>> +
>> + /* save tap device name */
>> + snprintf(iface, sizeof(iface), "%s", ifr.ifr_name);
>> +
>> + /* get the mtu of the bridge */
>> + prep_ifreq(&ifr, bridge);
>> + if (ioctl(ctlfd, SIOCGIFMTU,&ifr) == -1) {
>> + fprintf(stderr, "failed to get mtu of bridge `%s'\n", bridge);
>> + return -errno;
>> + }
>> +
>> + /* save mtu */
>> + mtu = ifr.ifr_mtu;
>> +
>> + /* set the mtu of the interface based on the bridge */
>> + prep_ifreq(&ifr, iface);
>> + ifr.ifr_mtu = mtu;
>> + if (ioctl(ctlfd, SIOCSIFMTU,&ifr) == -1) {
>> + fprintf(stderr, "failed to set mtu of device `%s' to %d\n",
>> + iface, mtu);
>> + return -errno;
>> + }
>> +
>> + /* add the interface to the bridge */
>> + prep_ifreq(&ifr, bridge);
>> + ifr.ifr_ifindex = if_nametoindex(iface);
>> +
>> + if (ioctl(ctlfd, SIOCBRADDIF,&ifr) == -1) {
>> + fprintf(stderr, "failed to add interface `%s' to bridge `%s'\n",
>> + iface, bridge);
>> + return -errno;
>> + }
>> +
>> + /* bring the interface up */
>> + prep_ifreq(&ifr, iface);
>> + if (ioctl(ctlfd, SIOCGIFFLAGS,&ifr) == -1) {
>> + fprintf(stderr, "failed to get interface flags for `%s'\n", iface);
>> + return -errno;
>> + }
>> +
>> + ifr.ifr_flags |= IFF_UP;
>> + if (ioctl(ctlfd, SIOCSIFFLAGS,&ifr) == -1) {
>> + fprintf(stderr, "failed to set bring up interface `%s'\n", iface);
>> + return -errno;
>> + }
>> +
>> + /* write fd to the domain socket */
>> + if (send_fd(unixfd, fd) == -1) {
>> + fprintf(stderr, "failed to write fd to unix socket\n");
>> + return -errno;
>> + }
>> +
>> + /* ... */
>> +
>> + /* profit! */
>
> Sold!
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>
> Please put my SoB before yours in the next submission.
>
> Regards,
>
> Anthony Liguori
>

Will do.

>> +
>> + close(fd);
>> +
>> + close(ctlfd);
>> +
>> + return 0;
>> +}
>
>
Corey Bryant Oct. 6, 2011, 6:38 p.m. UTC | #5
On 10/06/2011 02:04 PM, Anthony Liguori wrote:
> On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
>> On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
>>> This patch adds a helper that can be used to create a tap device
>>> attached to
>>> a bridge device. Since this helper is minimal in what it does, it can be
>>> given CAP_NET_ADMIN which allows qemu to avoid running as root while
>>> still
>>> satisfying the majority of what users tend to want to do with tap
>>> devices.
>>>
>>> The way this all works is that qemu launches this helper passing a
>>> bridge
>>> name and the name of an inherited file descriptor. The descriptor is one
>>> end of a socketpair() of domain sockets. This domain socket is used to
>>> transmit a file descriptor of the opened tap device from the helper
>>> to qemu.
>>>
>>> The helper can then exit and let qemu use the tap device.
>>
>> When QEMU is run by libvirt, we generally like to use capng to
>> remove the ability for QEMU to run setuid programs at all. So
>> obviously it will struggle to run the qemu-bridge-helper binary
>> in such a scenario.
>>
>> With the way you transmit the TAP device FD back to the caller,
>> it looks like libvirt itself could execute the qemu-bridge-helper
>> receiving the FD, and then pass the FD onto QEMU using the
>> traditional tap,fd=XX syntax.
>
> Exactly. This would allow tap-based networking using libvirt session://
> URIs.
>

I'll take note of this.  It seems like it would be a nice future 
addition to libvirt.

A slight tangent, but a point on DAC isolation.  The helper enables DAC 
isolation for qemu:///session but we still need some work in libvirt to 
provide DAC isolation for qemu:///system.  This could be done by 
allowing management applications to specify custom user/group IDs when 
creating guests rather than hard coding the IDs in the configuration file.

>>
>> The TAP device FD is only one FD we normally pass to QEMU. How about
>> support for vhost net ? Is it reasonable to ask the qemu-bridge-helper
>> to send back a vhost net FD also.
>
> Absolutely.
>
>> Or indeed multiple vhost net FDs
>> when we get multiqueue NICs. Should we expect the bridge helper to
>> be strictly limited to just connecting a TAP dev to a bridge, or is
>> the expectation that it will grow more& more functionality over
>> time ?
>
> I would not expect it to do more than create virtual network interfaces,
> and add them to bridges. Multiqueue virtual nics, vhost, etc. would all
> be in scope as they are part of creating a virtual network interface.
>
> Creating the bridges and managing the bridges should be done statically
> by an administrator and would be out of scope.
>
> Regards,
>
> Anthony Liguori
>
>>
>> Daniel
>
Daniel P. Berrangé Oct. 7, 2011, 9:04 a.m. UTC | #6
On Thu, Oct 06, 2011 at 02:38:56PM -0400, Corey Bryant wrote:
> 
> 
> On 10/06/2011 02:04 PM, Anthony Liguori wrote:
> >On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
> >>On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
> >>>This patch adds a helper that can be used to create a tap device
> >>>attached to
> >>>a bridge device. Since this helper is minimal in what it does, it can be
> >>>given CAP_NET_ADMIN which allows qemu to avoid running as root while
> >>>still
> >>>satisfying the majority of what users tend to want to do with tap
> >>>devices.
> >>>
> >>>The way this all works is that qemu launches this helper passing a
> >>>bridge
> >>>name and the name of an inherited file descriptor. The descriptor is one
> >>>end of a socketpair() of domain sockets. This domain socket is used to
> >>>transmit a file descriptor of the opened tap device from the helper
> >>>to qemu.
> >>>
> >>>The helper can then exit and let qemu use the tap device.
> >>
> >>When QEMU is run by libvirt, we generally like to use capng to
> >>remove the ability for QEMU to run setuid programs at all. So
> >>obviously it will struggle to run the qemu-bridge-helper binary
> >>in such a scenario.
> >>
> >>With the way you transmit the TAP device FD back to the caller,
> >>it looks like libvirt itself could execute the qemu-bridge-helper
> >>receiving the FD, and then pass the FD onto QEMU using the
> >>traditional tap,fd=XX syntax.
> >
> >Exactly. This would allow tap-based networking using libvirt session://
> >URIs.
> >
> 
> I'll take note of this.  It seems like it would be a nice future
> addition to libvirt.
> 
> A slight tangent, but a point on DAC isolation.  The helper enables
> DAC isolation for qemu:///session but we still need some work in
> libvirt to provide DAC isolation for qemu:///system.  This could be
> done by allowing management applications to specify custom
> user/group IDs when creating guests rather than hard coding the IDs
> in the configuration file.

Yes, this is a item on our todo list for libvirt. There are a couple of
work items involved

 - Extend the XML to allow multiple <seclabel> elements, one per
   security driver in use.
 - Add a new API to allow fetching of live seclabel data per
   security driver
 - Extend the current DAC security driver to automatically allocate
   UIDs from an admin defined range, and/or pull them from the XML
   provided by app.

Tecnically we could do item 3, without doing items 1/2, but that would
neccessitate *not* using the sVirt security driver. I don't think that's
too useful, so items 1/2 let us use both the sVirt & enhanced DAC driver
at the same time.

Regards,
Daniel
Corey Bryant Oct. 7, 2011, 2:40 p.m. UTC | #7
On 10/07/2011 05:04 AM, Daniel P. Berrange wrote:
> On Thu, Oct 06, 2011 at 02:38:56PM -0400, Corey Bryant wrote:
>>
>>
>> On 10/06/2011 02:04 PM, Anthony Liguori wrote:
>>> On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
>>>> On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
>>>>> This patch adds a helper that can be used to create a tap device
>>>>> attached to
>>>>> a bridge device. Since this helper is minimal in what it does, it can be
>>>>> given CAP_NET_ADMIN which allows qemu to avoid running as root while
>>>>> still
>>>>> satisfying the majority of what users tend to want to do with tap
>>>>> devices.
>>>>>
>>>>> The way this all works is that qemu launches this helper passing a
>>>>> bridge
>>>>> name and the name of an inherited file descriptor. The descriptor is one
>>>>> end of a socketpair() of domain sockets. This domain socket is used to
>>>>> transmit a file descriptor of the opened tap device from the helper
>>>>> to qemu.
>>>>>
>>>>> The helper can then exit and let qemu use the tap device.
>>>>
>>>> When QEMU is run by libvirt, we generally like to use capng to
>>>> remove the ability for QEMU to run setuid programs at all. So
>>>> obviously it will struggle to run the qemu-bridge-helper binary
>>>> in such a scenario.
>>>>
>>>> With the way you transmit the TAP device FD back to the caller,
>>>> it looks like libvirt itself could execute the qemu-bridge-helper
>>>> receiving the FD, and then pass the FD onto QEMU using the
>>>> traditional tap,fd=XX syntax.
>>>
>>> Exactly. This would allow tap-based networking using libvirt session://
>>> URIs.
>>>
>>
>> I'll take note of this.  It seems like it would be a nice future
>> addition to libvirt.
>>
>> A slight tangent, but a point on DAC isolation.  The helper enables
>> DAC isolation for qemu:///session but we still need some work in
>> libvirt to provide DAC isolation for qemu:///system.  This could be
>> done by allowing management applications to specify custom
>> user/group IDs when creating guests rather than hard coding the IDs
>> in the configuration file.
>
> Yes, this is a item on our todo list for libvirt. There are a couple of
> work items involved
>
>   - Extend the XML to allow multiple<seclabel>  elements, one per
>     security driver in use.
>   - Add a new API to allow fetching of live seclabel data per
>     security driver
>   - Extend the current DAC security driver to automatically allocate
>     UIDs from an admin defined range, and/or pull them from the XML
>     provided by app.
>
> Tecnically we could do item 3, without doing items 1/2, but that would
> neccessitate *not* using the sVirt security driver. I don't think that's
> too useful, so items 1/2 let us use both the sVirt&  enhanced DAC driver
> at the same time.
>

I think I'm missing something here and could use some more details to 
understand 1 & 2.  Here's what I'm currently picturing.

With DAC isolation:
     QEMU A runs under userA:groupA and QEMU B runs under userB:groupB

versus currently:
     QEMU A runs under qemu:qemu and QEMU B runs under qemu:qemu

In either case, guests A and B have separate domain XML and a single 
unique seclabel, such as this dynamic SELinux label:

<seclabel type='dynamic' model='selinux'>
   <label>system_u:system_r:svirt_t:s0:c633,c712</label>
   <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
</seclabel>


> Regards,
> Daniel
Daniel P. Berrangé Oct. 7, 2011, 2:45 p.m. UTC | #8
On Fri, Oct 07, 2011 at 10:40:56AM -0400, Corey Bryant wrote:
> 
> 
> On 10/07/2011 05:04 AM, Daniel P. Berrange wrote:
> >On Thu, Oct 06, 2011 at 02:38:56PM -0400, Corey Bryant wrote:
> >>
> >>
> >>On 10/06/2011 02:04 PM, Anthony Liguori wrote:
> >>>On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
> >>>>On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
> >>>>>This patch adds a helper that can be used to create a tap device
> >>>>>attached to
> >>>>>a bridge device. Since this helper is minimal in what it does, it can be
> >>>>>given CAP_NET_ADMIN which allows qemu to avoid running as root while
> >>>>>still
> >>>>>satisfying the majority of what users tend to want to do with tap
> >>>>>devices.
> >>>>>
> >>>>>The way this all works is that qemu launches this helper passing a
> >>>>>bridge
> >>>>>name and the name of an inherited file descriptor. The descriptor is one
> >>>>>end of a socketpair() of domain sockets. This domain socket is used to
> >>>>>transmit a file descriptor of the opened tap device from the helper
> >>>>>to qemu.
> >>>>>
> >>>>>The helper can then exit and let qemu use the tap device.
> >>>>
> >>>>When QEMU is run by libvirt, we generally like to use capng to
> >>>>remove the ability for QEMU to run setuid programs at all. So
> >>>>obviously it will struggle to run the qemu-bridge-helper binary
> >>>>in such a scenario.
> >>>>
> >>>>With the way you transmit the TAP device FD back to the caller,
> >>>>it looks like libvirt itself could execute the qemu-bridge-helper
> >>>>receiving the FD, and then pass the FD onto QEMU using the
> >>>>traditional tap,fd=XX syntax.
> >>>
> >>>Exactly. This would allow tap-based networking using libvirt session://
> >>>URIs.
> >>>
> >>
> >>I'll take note of this.  It seems like it would be a nice future
> >>addition to libvirt.
> >>
> >>A slight tangent, but a point on DAC isolation.  The helper enables
> >>DAC isolation for qemu:///session but we still need some work in
> >>libvirt to provide DAC isolation for qemu:///system.  This could be
> >>done by allowing management applications to specify custom
> >>user/group IDs when creating guests rather than hard coding the IDs
> >>in the configuration file.
> >
> >Yes, this is a item on our todo list for libvirt. There are a couple of
> >work items involved
> >
> >  - Extend the XML to allow multiple<seclabel>  elements, one per
> >    security driver in use.
> >  - Add a new API to allow fetching of live seclabel data per
> >    security driver
> >  - Extend the current DAC security driver to automatically allocate
> >    UIDs from an admin defined range, and/or pull them from the XML
> >    provided by app.
> >
> >Tecnically we could do item 3, without doing items 1/2, but that would
> >neccessitate *not* using the sVirt security driver. I don't think that's
> >too useful, so items 1/2 let us use both the sVirt&  enhanced DAC driver
> >at the same time.
> >
> 
> I think I'm missing something here and could use some more details
> to understand 1 & 2.  Here's what I'm currently picturing.
> 
> With DAC isolation:
>     QEMU A runs under userA:groupA and QEMU B runs under userB:groupB
> 
> versus currently:
>     QEMU A runs under qemu:qemu and QEMU B runs under qemu:qemu
> 
> In either case, guests A and B have separate domain XML and a single
> unique seclabel, such as this dynamic SELinux label:
> 
> <seclabel type='dynamic' model='selinux'>
>   <label>system_u:system_r:svirt_t:s0:c633,c712</label>
>   <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
> </seclabel>

If we're going to make the DAC user ID/group ID configurable, then we
need to expose this to application in the XML so that

 a. apps can allocate unique user/group *cluster wide* when shared
    filesystems are in use. libvirt can only ensure per-host uniqueness.

 b. apps can know what user/group ID has been allocate to each guest
    and this can be reported in virsh dominfo, as with svirt info.

ie, we'll need something like this:

  <seclabel type='dynamic' model='selinux'>
    <label>system_u:system_r:svirt_t:s0:c633,c712</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac'>
    <label>102:102</label>
    <imagelabel>102:102</imagelabel>
  </seclabel>


And:

# virsh dominfo f16x86_64
Id:             29
Name:           f16x86_64
UUID:           1e9f3097-0a45-ea06-d0d8-40507999a1cd
OS Type:        hvm
State:          running
CPU(s):         1
CPU time:       19.5s
Max memory:     819200 kB
Used memory:    819200 kB
Persistent:     yes
Autostart:      disable
Security model: selinux
Security DOI:   0
Security label: system_u:system_r:svirt_t:s0:c244,c424 (permissive)
Security model: dac
Security DOI:   0
Security label: 102:102 (enforcing)

Regards,
Daniel
Corey Bryant Oct. 7, 2011, 2:51 p.m. UTC | #9
On 10/07/2011 10:45 AM, Daniel P. Berrange wrote:
> On Fri, Oct 07, 2011 at 10:40:56AM -0400, Corey Bryant wrote:
>>
>>
>> On 10/07/2011 05:04 AM, Daniel P. Berrange wrote:
>>> On Thu, Oct 06, 2011 at 02:38:56PM -0400, Corey Bryant wrote:
>>>>
>>>>
>>>> On 10/06/2011 02:04 PM, Anthony Liguori wrote:
>>>>> On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
>>>>>> On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
>>>>>>> This patch adds a helper that can be used to create a tap device
>>>>>>> attached to
>>>>>>> a bridge device. Since this helper is minimal in what it does, it can be
>>>>>>> given CAP_NET_ADMIN which allows qemu to avoid running as root while
>>>>>>> still
>>>>>>> satisfying the majority of what users tend to want to do with tap
>>>>>>> devices.
>>>>>>>
>>>>>>> The way this all works is that qemu launches this helper passing a
>>>>>>> bridge
>>>>>>> name and the name of an inherited file descriptor. The descriptor is one
>>>>>>> end of a socketpair() of domain sockets. This domain socket is used to
>>>>>>> transmit a file descriptor of the opened tap device from the helper
>>>>>>> to qemu.
>>>>>>>
>>>>>>> The helper can then exit and let qemu use the tap device.
>>>>>>
>>>>>> When QEMU is run by libvirt, we generally like to use capng to
>>>>>> remove the ability for QEMU to run setuid programs at all. So
>>>>>> obviously it will struggle to run the qemu-bridge-helper binary
>>>>>> in such a scenario.
>>>>>>
>>>>>> With the way you transmit the TAP device FD back to the caller,
>>>>>> it looks like libvirt itself could execute the qemu-bridge-helper
>>>>>> receiving the FD, and then pass the FD onto QEMU using the
>>>>>> traditional tap,fd=XX syntax.
>>>>>
>>>>> Exactly. This would allow tap-based networking using libvirt session://
>>>>> URIs.
>>>>>
>>>>
>>>> I'll take note of this.  It seems like it would be a nice future
>>>> addition to libvirt.
>>>>
>>>> A slight tangent, but a point on DAC isolation.  The helper enables
>>>> DAC isolation for qemu:///session but we still need some work in
>>>> libvirt to provide DAC isolation for qemu:///system.  This could be
>>>> done by allowing management applications to specify custom
>>>> user/group IDs when creating guests rather than hard coding the IDs
>>>> in the configuration file.
>>>
>>> Yes, this is a item on our todo list for libvirt. There are a couple of
>>> work items involved
>>>
>>>   - Extend the XML to allow multiple<seclabel>   elements, one per
>>>     security driver in use.
>>>   - Add a new API to allow fetching of live seclabel data per
>>>     security driver
>>>   - Extend the current DAC security driver to automatically allocate
>>>     UIDs from an admin defined range, and/or pull them from the XML
>>>     provided by app.
>>>
>>> Tecnically we could do item 3, without doing items 1/2, but that would
>>> neccessitate *not* using the sVirt security driver. I don't think that's
>>> too useful, so items 1/2 let us use both the sVirt&   enhanced DAC driver
>>> at the same time.
>>>
>>
>> I think I'm missing something here and could use some more details
>> to understand 1&  2.  Here's what I'm currently picturing.
>>
>> With DAC isolation:
>>      QEMU A runs under userA:groupA and QEMU B runs under userB:groupB
>>
>> versus currently:
>>      QEMU A runs under qemu:qemu and QEMU B runs under qemu:qemu
>>
>> In either case, guests A and B have separate domain XML and a single
>> unique seclabel, such as this dynamic SELinux label:
>>
>> <seclabel type='dynamic' model='selinux'>
>>    <label>system_u:system_r:svirt_t:s0:c633,c712</label>
>>    <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
>> </seclabel>
>
> If we're going to make the DAC user ID/group ID configurable, then we
> need to expose this to application in the XML so that
>
>   a. apps can allocate unique user/group *cluster wide* when shared
>      filesystems are in use. libvirt can only ensure per-host uniqueness.
>
>   b. apps can know what user/group ID has been allocate to each guest
>      and this can be reported in virsh dominfo, as with svirt info.
>
> ie, we'll need something like this:
>
>    <seclabel type='dynamic' model='selinux'>
>      <label>system_u:system_r:svirt_t:s0:c633,c712</label>
>      <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
>    </seclabel>
>    <seclabel type='dynamic' model='dac'>
>      <label>102:102</label>
>      <imagelabel>102:102</imagelabel>
>    </seclabel>
>
>
> And:
>
> # virsh dominfo f16x86_64
> Id:             29
> Name:           f16x86_64
> UUID:           1e9f3097-0a45-ea06-d0d8-40507999a1cd
> OS Type:        hvm
> State:          running
> CPU(s):         1
> CPU time:       19.5s
> Max memory:     819200 kB
> Used memory:    819200 kB
> Persistent:     yes
> Autostart:      disable
> Security model: selinux
> Security DOI:   0
> Security label: system_u:system_r:svirt_t:s0:c244,c424 (permissive)
> Security model: dac
> Security DOI:   0
> Security label: 102:102 (enforcing)
>
> Regards,
> Daniel

Ah, yes.  That makes complete sense.  Thanks for the clarification.
Corey Bryant Oct. 7, 2011, 2:52 p.m. UTC | #10
On 10/07/2011 10:45 AM, Daniel P. Berrange wrote:
> On Fri, Oct 07, 2011 at 10:40:56AM -0400, Corey Bryant wrote:
>>
>>
>> On 10/07/2011 05:04 AM, Daniel P. Berrange wrote:
>>> On Thu, Oct 06, 2011 at 02:38:56PM -0400, Corey Bryant wrote:
>>>>
>>>>
>>>> On 10/06/2011 02:04 PM, Anthony Liguori wrote:
>>>>> On 10/06/2011 11:41 AM, Daniel P. Berrange wrote:
>>>>>> On Thu, Oct 06, 2011 at 11:38:25AM -0400, Richa Marwaha wrote:
>>>>>>> This patch adds a helper that can be used to create a tap device
>>>>>>> attached to
>>>>>>> a bridge device. Since this helper is minimal in what it does, it can be
>>>>>>> given CAP_NET_ADMIN which allows qemu to avoid running as root while
>>>>>>> still
>>>>>>> satisfying the majority of what users tend to want to do with tap
>>>>>>> devices.
>>>>>>>
>>>>>>> The way this all works is that qemu launches this helper passing a
>>>>>>> bridge
>>>>>>> name and the name of an inherited file descriptor. The descriptor is one
>>>>>>> end of a socketpair() of domain sockets. This domain socket is used to
>>>>>>> transmit a file descriptor of the opened tap device from the helper
>>>>>>> to qemu.
>>>>>>>
>>>>>>> The helper can then exit and let qemu use the tap device.
>>>>>>
>>>>>> When QEMU is run by libvirt, we generally like to use capng to
>>>>>> remove the ability for QEMU to run setuid programs at all. So
>>>>>> obviously it will struggle to run the qemu-bridge-helper binary
>>>>>> in such a scenario.
>>>>>>
>>>>>> With the way you transmit the TAP device FD back to the caller,
>>>>>> it looks like libvirt itself could execute the qemu-bridge-helper
>>>>>> receiving the FD, and then pass the FD onto QEMU using the
>>>>>> traditional tap,fd=XX syntax.
>>>>>
>>>>> Exactly. This would allow tap-based networking using libvirt session://
>>>>> URIs.
>>>>>
>>>>
>>>> I'll take note of this.  It seems like it would be a nice future
>>>> addition to libvirt.
>>>>
>>>> A slight tangent, but a point on DAC isolation.  The helper enables
>>>> DAC isolation for qemu:///session but we still need some work in
>>>> libvirt to provide DAC isolation for qemu:///system.  This could be
>>>> done by allowing management applications to specify custom
>>>> user/group IDs when creating guests rather than hard coding the IDs
>>>> in the configuration file.
>>>
>>> Yes, this is a item on our todo list for libvirt. There are a couple of
>>> work items involved
>>>
>>>   - Extend the XML to allow multiple<seclabel>   elements, one per
>>>     security driver in use.
>>>   - Add a new API to allow fetching of live seclabel data per
>>>     security driver
>>>   - Extend the current DAC security driver to automatically allocate
>>>     UIDs from an admin defined range, and/or pull them from the XML
>>>     provided by app.
>>>
>>> Tecnically we could do item 3, without doing items 1/2, but that would
>>> neccessitate *not* using the sVirt security driver. I don't think that's
>>> too useful, so items 1/2 let us use both the sVirt&   enhanced DAC driver
>>> at the same time.
>>>
>>
>> I think I'm missing something here and could use some more details
>> to understand 1&  2.  Here's what I'm currently picturing.
>>
>> With DAC isolation:
>>      QEMU A runs under userA:groupA and QEMU B runs under userB:groupB
>>
>> versus currently:
>>      QEMU A runs under qemu:qemu and QEMU B runs under qemu:qemu
>>
>> In either case, guests A and B have separate domain XML and a single
>> unique seclabel, such as this dynamic SELinux label:
>>
>> <seclabel type='dynamic' model='selinux'>
>>    <label>system_u:system_r:svirt_t:s0:c633,c712</label>
>>    <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
>> </seclabel>
>
> If we're going to make the DAC user ID/group ID configurable, then we
> need to expose this to application in the XML so that
>
>   a. apps can allocate unique user/group *cluster wide* when shared
>      filesystems are in use. libvirt can only ensure per-host uniqueness.
>
>   b. apps can know what user/group ID has been allocate to each guest
>      and this can be reported in virsh dominfo, as with svirt info.
>
> ie, we'll need something like this:
>
>    <seclabel type='dynamic' model='selinux'>
>      <label>system_u:system_r:svirt_t:s0:c633,c712</label>
>      <imagelabel>system_u:object_r:svirt_image_t:s0:c633,c712</imagelabel>
>    </seclabel>
>    <seclabel type='dynamic' model='dac'>
>      <label>102:102</label>
>      <imagelabel>102:102</imagelabel>
>    </seclabel>
>
>
> And:
>
> # virsh dominfo f16x86_64
> Id:             29
> Name:           f16x86_64
> UUID:           1e9f3097-0a45-ea06-d0d8-40507999a1cd
> OS Type:        hvm
> State:          running
> CPU(s):         1
> CPU time:       19.5s
> Max memory:     819200 kB
> Used memory:    819200 kB
> Persistent:     yes
> Autostart:      disable
> Security model: selinux
> Security DOI:   0
> Security label: system_u:system_r:svirt_t:s0:c244,c424 (permissive)
> Security model: dac
> Security DOI:   0
> Security label: 102:102 (enforcing)
>
> Regards,
> Daniel

Ah, yes.  That makes complete sense.  Thanks for the clarification.
diff mbox

Patch

diff --git a/Makefile b/Makefile
index 6ed3194..f2caedc 100644
--- a/Makefile
+++ b/Makefile
@@ -34,6 +34,8 @@  $(call set-vpath, $(SRC_PATH):$(SRC_PATH)/hw)
 
 LIBS+=-lz $(LIBS_TOOLS)
 
+HELPERS-$(CONFIG_LINUX) = qemu-bridge-helper$(EXESUF)
+
 ifdef BUILD_DOCS
 DOCS=qemu-doc.html qemu-tech.html qemu.1 qemu-img.1 qemu-nbd.8 QMP/qmp-commands.txt
 else
@@ -74,7 +76,7 @@  defconfig:
 
 -include config-all-devices.mak
 
-build-all: $(DOCS) $(TOOLS) recurse-all
+build-all: $(DOCS) $(TOOLS) $(HELPERS-y) recurse-all
 
 config-host.h: config-host.h-timestamp
 config-host.h-timestamp: config-host.mak
@@ -151,6 +153,8 @@  qemu-nbd$(EXESUF): qemu-nbd.o qemu-tool.o qemu-error.o $(oslib-obj-y) $(trace-ob
 
 qemu-io$(EXESUF): qemu-io.o cmd.o qemu-tool.o qemu-error.o $(oslib-obj-y) $(trace-obj-y) $(block-obj-y) $(qobject-obj-y) $(version-obj-y) qemu-timer-common.o
 
+qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
+
 qemu-img-cmds.h: $(SRC_PATH)/qemu-img-cmds.hx
 	$(call quiet-command,sh $(SRC_PATH)/scripts/hxtool -h < $< > $@,"  GEN   $@")
 
@@ -208,7 +212,7 @@  clean:
 # avoid old build problems by removing potentially incorrect old files
 	rm -f config.mak op-i386.h opc-i386.h gen-op-i386.h op-arm.h opc-arm.h gen-op-arm.h
 	rm -f qemu-options.def
-	rm -f *.o *.d *.a *.lo $(TOOLS) qemu-ga TAGS cscope.* *.pod *~ */*~
+	rm -f *.o *.d *.a *.lo $(TOOLS) $(HELPERS-y) qemu-ga TAGS cscope.* *.pod *~ */*~
 	rm -Rf .libs
 	rm -f slirp/*.o slirp/*.d audio/*.o audio/*.d block/*.o block/*.d net/*.o net/*.d fsdev/*.o fsdev/*.d ui/*.o ui/*.d qapi/*.o qapi/*.d qga/*.o qga/*.d
 	rm -f qemu-img-cmds.h
@@ -275,6 +279,10 @@  install: all $(if $(BUILD_DOCS),install-doc) install-sysconfig
 ifneq ($(TOOLS),)
 	$(INSTALL_PROG) $(STRIP_OPT) $(TOOLS) "$(DESTDIR)$(bindir)"
 endif
+ifneq ($(HELPERS-y),)
+	$(INSTALL_DIR) "$(DESTDIR)$(libexecdir)"
+	$(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) "$(DESTDIR)$(libexecdir)"
+endif
 ifneq ($(BLOBS),)
 	$(INSTALL_DIR) "$(DESTDIR)$(datadir)"
 	set -e; for x in $(BLOBS); do \
diff --git a/configure b/configure
index 59b1494..3e32834 100755
--- a/configure
+++ b/configure
@@ -2742,6 +2742,7 @@  echo "mandir=$mandir" >> $config_host_mak
 echo "datadir=$datadir" >> $config_host_mak
 echo "sysconfdir=$sysconfdir" >> $config_host_mak
 echo "docdir=$docdir" >> $config_host_mak
+echo "libexecdir=\${prefix}/libexec" >> $config_host_mak
 echo "confdir=$confdir" >> $config_host_mak
 
 case "$cpu" in
diff --git a/qemu-bridge-helper.c b/qemu-bridge-helper.c
new file mode 100644
index 0000000..4ac7b36
--- /dev/null
+++ b/qemu-bridge-helper.c
@@ -0,0 +1,205 @@ 
+/*
+ * QEMU Bridge Helper
+ *
+ * Copyright IBM, Corp. 2011
+ *
+ * Authors:
+ * Anthony Liguori   <address@hidden>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include "config-host.h"
+
+#include <stdio.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <string.h>
+#include <stdlib.h>
+#include <ctype.h>
+
+#include <sys/types.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/un.h>
+#include <sys/prctl.h>
+
+#include <net/if.h>
+
+#include <linux/sockios.h>
+
+#include "net/tap-linux.h"
+
+static int has_vnet_hdr(int fd)
+{
+    unsigned int features = 0;
+    struct ifreq ifreq;
+
+    if (ioctl(fd, TUNGETFEATURES, &features) == -1) {
+        return -errno;
+    }
+
+    if (!(features & IFF_VNET_HDR)) {
+        return -ENOTSUP;
+    }
+
+    if (ioctl(fd, TUNGETIFF, &ifreq) != -1 || errno != EBADFD) {
+        return -ENOTSUP;
+    }
+
+    return 1;
+}
+
+static void prep_ifreq(struct ifreq *ifr, const char *ifname)
+{
+    memset(ifr, 0, sizeof(*ifr));
+    snprintf(ifr->ifr_name, IFNAMSIZ, "%s", ifname);
+}
+
+static int send_fd(int c, int fd)
+{
+    char msgbuf[CMSG_SPACE(sizeof(fd))];
+    struct msghdr msg = {
+        .msg_control = msgbuf,
+        .msg_controllen = sizeof(msgbuf),
+    };
+    struct cmsghdr *cmsg;
+    struct iovec iov;
+    char req[1] = { 0x00 };
+
+    cmsg = CMSG_FIRSTHDR(&msg);
+    cmsg->cmsg_level = SOL_SOCKET;
+    cmsg->cmsg_type = SCM_RIGHTS;
+    cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
+    msg.msg_controllen = cmsg->cmsg_len;
+
+    iov.iov_base = req;
+    iov.iov_len = sizeof(req);
+
+    msg.msg_iov = &iov;
+    msg.msg_iovlen = 1;
+    memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
+
+    return sendmsg(c, &msg, 0);
+}
+
+int main(int argc, char **argv)
+{
+    struct ifreq ifr;
+    int fd, ctlfd, unixfd;
+    int use_vnet = 0;
+    int mtu;
+    const char *bridge;
+    char iface[IFNAMSIZ];
+    int index;
+
+    /* parse arguments */
+    if (argc < 3 || argc > 4) {
+        fprintf(stderr, "Usage: %s [--use-vnet] BRIDGE FD\n", argv[0]);
+        return 1;
+    }
+
+    index = 1;
+    if (strcmp(argv[index], "--use-vnet") == 0) {
+        use_vnet = 1;
+        index++;
+        if (argc == 3) {
+            fprintf(stderr, "invalid number of arguments\n");
+            return -1;
+        }
+    }
+
+    bridge = argv[index++];
+    unixfd = atoi(argv[index++]);
+
+    /* open a socket to use to control the network interfaces */
+    ctlfd = socket(AF_INET, SOCK_STREAM, 0);
+    if (ctlfd == -1) {
+        fprintf(stderr, "failed to open control socket\n");
+        return -errno;
+    }
+
+    /* open the tap device */
+    fd = open("/dev/net/tun", O_RDWR);
+    if (fd == -1) {
+        fprintf(stderr, "failed to open /dev/net/tun\n");
+        return -errno;
+    }
+
+    /* request a tap device, disable PI, and add vnet header support if
+     * requested and it's available. */
+    prep_ifreq(&ifr, "tap%d");
+    ifr.ifr_flags = IFF_TAP|IFF_NO_PI;
+    if (use_vnet && has_vnet_hdr(fd)) {
+        ifr.ifr_flags |= IFF_VNET_HDR;
+    }
+
+    if (ioctl(fd, TUNSETIFF, &ifr) == -1) {
+        fprintf(stderr, "failed to create tun device\n");
+        return -errno;
+    }
+
+    /* save tap device name */
+    snprintf(iface, sizeof(iface), "%s", ifr.ifr_name);
+
+    /* get the mtu of the bridge */
+    prep_ifreq(&ifr, bridge);
+    if (ioctl(ctlfd, SIOCGIFMTU, &ifr) == -1) {
+        fprintf(stderr, "failed to get mtu of bridge `%s'\n", bridge);
+        return -errno;
+    }
+
+    /* save mtu */
+    mtu = ifr.ifr_mtu;
+
+    /* set the mtu of the interface based on the bridge */
+    prep_ifreq(&ifr, iface);
+    ifr.ifr_mtu = mtu;
+    if (ioctl(ctlfd, SIOCSIFMTU, &ifr) == -1) {
+        fprintf(stderr, "failed to set mtu of device `%s' to %d\n",
+                iface, mtu);
+        return -errno;
+    }
+
+    /* add the interface to the bridge */
+    prep_ifreq(&ifr, bridge);
+    ifr.ifr_ifindex = if_nametoindex(iface);
+
+    if (ioctl(ctlfd, SIOCBRADDIF, &ifr) == -1) {
+        fprintf(stderr, "failed to add interface `%s' to bridge `%s'\n",
+                iface, bridge);
+        return -errno;
+    }
+
+    /* bring the interface up */
+    prep_ifreq(&ifr, iface);
+    if (ioctl(ctlfd, SIOCGIFFLAGS, &ifr) == -1) {
+        fprintf(stderr, "failed to get interface flags for `%s'\n", iface);
+        return -errno;
+    }
+
+    ifr.ifr_flags |= IFF_UP;
+    if (ioctl(ctlfd, SIOCSIFFLAGS, &ifr) == -1) {
+        fprintf(stderr, "failed to set bring up interface `%s'\n", iface);
+        return -errno;
+    }
+
+    /* write fd to the domain socket */
+    if (send_fd(unixfd, fd) == -1) {
+        fprintf(stderr, "failed to write fd to unix socket\n");
+        return -errno;
+    }
+
+    /* ... */
+
+    /* profit! */
+
+    close(fd);
+
+    close(ctlfd);
+
+    return 0;
+}