Patchwork AF_VMCHANNEL address family for guest<->host communication.

login
register
mail settings
Submitter Gleb Natapov
Date Dec. 14, 2008, 11:50 a.m.
Message ID <20081214115054.4066.14557.stgit@dhcp-1-237.tlv.redhat.com>
Download mbox | patch
Permalink /patch/13937/
State Rejected
Delegated to: David Miller
Headers show

Comments

Gleb Natapov - Dec. 14, 2008, 11:50 a.m.
There is a need for communication channel between host and various
agents that are running inside a VM guest. The channel will be used
for statistic gathering, logging, cut & paste, host screen resolution
changes notifications, guest configuration etc.

It is undesirable to use TCP/IP for this purpose since network
connectivity may not exist between host and guest and if it exists the
traffic can be not routable between host and guest for security reasons
or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.

This patch implement new address family AF_VMCHANNEL that is used
for communication between guest and host. Channels are created at VM
start time. Each channel has a name. Agent, that runs on a guest, can
send/receive data to/from a channel by creating AF_VMCHANNEL socket and
connecting to a channel using channels name as an address.

Only stream sockets are supported by this implementation. Also only
connect, sendmsg and recvmsg socket ops are implemented which is enough
to allow application running in a guest to connect to a channel created
by a host and read/write from/to the channel. This can be extended to
allow channel creation from inside a guest by creating listen socket and
accepting on it if the need will arise and thus even allow guest<->guest
communication in the future (but TCP/IP may be preferable for this).

Signed-off-by: Gleb Natapov <gleb@redhat.com>
---

 include/linux/socket.h       |    4 
 include/linux/vmchannel.h    |   54 +++
 net/Kconfig                  |    1 
 net/Makefile                 |    1 
 net/vmchannel/Kconfig        |   11 +
 net/vmchannel/Makefile       |    5 
 net/vmchannel/af_vmchannel.c |  769 ++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 844 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/vmchannel.h
 create mode 100644 net/vmchannel/Kconfig
 create mode 100644 net/vmchannel/Makefile
 create mode 100644 net/vmchannel/af_vmchannel.c


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov - Dec. 14, 2008, 12:23 p.m.
Hi Gleb.

On Sun, Dec 14, 2008 at 01:50:55PM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> There is a need for communication channel between host and various
> agents that are running inside a VM guest. The channel will be used
> for statistic gathering, logging, cut & paste, host screen resolution
> changes notifications, guest configuration etc.
> 
> It is undesirable to use TCP/IP for this purpose since network
> connectivity may not exist between host and guest and if it exists the
> traffic can be not routable between host and guest for security reasons
> or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.
> 
> This patch implement new address family AF_VMCHANNEL that is used
> for communication between guest and host. Channels are created at VM
> start time. Each channel has a name. Agent, that runs on a guest, can
> send/receive data to/from a channel by creating AF_VMCHANNEL socket and
> connecting to a channel using channels name as an address.
> 
> Only stream sockets are supported by this implementation. Also only
> connect, sendmsg and recvmsg socket ops are implemented which is enough
> to allow application running in a guest to connect to a channel created
> by a host and read/write from/to the channel. This can be extended to
> allow channel creation from inside a guest by creating listen socket and
> accepting on it if the need will arise and thus even allow guest<->guest
> communication in the future (but TCP/IP may be preferable for this).

Couple of comments on this.
First, there is only single virtio device initialized at probe time,
how this will work on the host system with multiple guests? Is it
possible to have multiple virtual devices?
Second, each virtual device has an array of names, and each socket can
be bound to one of them, but it is not allowed to have multiple sockets
bound to the same name, so it looks like there is no possibility to have
several sockets communicating via signel channel, was this intentional?
And third, tasklet callbacks do not use bh socket locking, and while it
is not something bad, but rt folks want (dream) to replace it with
process context, so this at least requires some note in comments.

Except that about questions, this patch looks good.
Gleb Natapov - Dec. 14, 2008, 12:46 p.m.
Hi Evgeniy,

On Sun, Dec 14, 2008 at 03:23:20PM +0300, Evgeniy Polyakov wrote:
> On Sun, Dec 14, 2008 at 01:50:55PM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> > There is a need for communication channel between host and various
> > agents that are running inside a VM guest. The channel will be used
> > for statistic gathering, logging, cut & paste, host screen resolution
> > changes notifications, guest configuration etc.
> > 
> > It is undesirable to use TCP/IP for this purpose since network
> > connectivity may not exist between host and guest and if it exists the
> > traffic can be not routable between host and guest for security reasons
> > or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.
> > 
> > This patch implement new address family AF_VMCHANNEL that is used
> > for communication between guest and host. Channels are created at VM
> > start time. Each channel has a name. Agent, that runs on a guest, can
> > send/receive data to/from a channel by creating AF_VMCHANNEL socket and
> > connecting to a channel using channels name as an address.
> > 
> > Only stream sockets are supported by this implementation. Also only
> > connect, sendmsg and recvmsg socket ops are implemented which is enough
> > to allow application running in a guest to connect to a channel created
> > by a host and read/write from/to the channel. This can be extended to
> > allow channel creation from inside a guest by creating listen socket and
> > accepting on it if the need will arise and thus even allow guest<->guest
> > communication in the future (but TCP/IP may be preferable for this).
> 
> Couple of comments on this.
> First, there is only single virtio device initialized at probe time,
> how this will work on the host system with multiple guests? Is it
> possible to have multiple virtual devices?
The module is loaded only inside a guest not host and it manages all
existing channels. What would be the value to have multiple vmchannel
PCI devices in a single guest?

> Second, each virtual device has an array of names, and each socket can
> be bound to one of them, but it is not allowed to have multiple sockets
> bound to the same name, so it looks like there is no possibility to have
> several sockets communicating via signel channel, was this intentional?
Yes, this is intentional as it matches our usage model. It is possible
to change this in the future if needed. All sockets bound to the same
channel will receive the same data.

> And third, tasklet callbacks do not use bh socket locking, and while it
> is not something bad, but rt folks want (dream) to replace it with
> process context, so this at least requires some note in comments.
> 
This is something I need to understand better. I though that socket
lock guards socket state change. The patch only access socket state
from bh context in the vmchannel_socket_recv() and even if state of the
socket will change after function validates it nothing bad can happen.
Is this the case? I it is I will add comment explaining this.

> Except that about questions, this patch looks good.
Thanks for the review.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Dec. 15, 2008, 6:44 a.m.
From: Gleb Natapov <gleb@redhat.com>
Date: Sun, 14 Dec 2008 13:50:55 +0200

> It is undesirable to use TCP/IP for this purpose since network
> connectivity may not exist between host and guest and if it exists the
> traffic can be not routable between host and guest for security reasons
> or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.

I don't really accept this argument, sorry.

If you can't use TCP because it might be security protected or
misconfigured, adding this new stream protocol thing is not one
bit better.  It doesn't make any sense at all.

Also, if TCP could be "misconfigured" this new thing could just as
easily be screwed up too.  And I wouldn't be surprised to see a whole
bunch of SELINUX and netfilter features proposed later for this and
then we're back to square one.

You guys really need to rethink this.  Either a stream protocol is a
workable solution to your problem, or it isn't.

And don't bring up any "virtualization is special because..."
arguments into your reply because virtualization has nothing to do
with my objections stated above.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov - Dec. 15, 2008, 7:48 a.m.
On Sun, Dec 14, 2008 at 10:44:36PM -0800, David Miller wrote:
> From: Gleb Natapov <gleb@redhat.com>
> Date: Sun, 14 Dec 2008 13:50:55 +0200
> 
> > It is undesirable to use TCP/IP for this purpose since network
> > connectivity may not exist between host and guest and if it exists the
> > traffic can be not routable between host and guest for security reasons
> > or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.
> 
> I don't really accept this argument, sorry.
> 
> If you can't use TCP because it might be security protected or
> misconfigured, adding this new stream protocol thing is not one
> bit better.  It doesn't make any sense at all.
> 
It can be _accidentally_ misconfigured. Just think about sysadmin that has
a remote access to a VM guest and he doesn't even know that it is a VM.
(He can easily find this out but why should he care?). The sysadmin knows
that the first rule of firewalling is deny everything and than allow
what you want to be allowed, so that what he does and cut communication
between host and guest. The problem with networking is that it is visible
to VM user and perceived to be under full user control.

> Also, if TCP could be "misconfigured" this new thing could just as
> easily be screwed up too.  And I wouldn't be surprised to see a whole
> bunch of SELINUX and netfilter features proposed later for this and
> then we're back to square one.
> 
It not only can be missconfigured it may not exist between guest and
host at all. IP connectivity between guest and host is not mandatory
and we don't want to make it such. It is like saying "who needs
serial console, just use ssh". And what subnet should be used for this
purpose? Who will solve conflicts? I can see why SELINUX features may
be proposed for vmchannel, but netfilter doesn't make sense for it.
And vmchannel also has other advantages over TCP/IP: less overhead and
better "naming". By better naming I mean that guest should not guess
(or require configuration) what ip:port should be used for cut&paste,
it just connects to address "cut&paste".

> You guys really need to rethink this.  Either a stream protocol is a
> workable solution to your problem, or it isn't.
> 
Stream protocol is workable solution for us, but we need it out of band
in regard to networking and as much zero config as possible. If we will
use networking how can it be done without additional configuration (and
reconfiguration can be required after migration BTW)

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Dec. 15, 2008, 8:27 a.m.
From: Gleb Natapov <gleb@redhat.com>
Date: Mon, 15 Dec 2008 09:48:19 +0200

> On Sun, Dec 14, 2008 at 10:44:36PM -0800, David Miller wrote:
> > You guys really need to rethink this.  Either a stream protocol is a
> > workable solution to your problem, or it isn't.
>
> Stream protocol is workable solution for us, but we need it out of band
> in regard to networking and as much zero config as possible. If we will
> use networking how can it be done without additional configuration (and
> reconfiguration can be required after migration BTW)

You miss the whole point and you also missed the part where I said
(and the one part of my comments you conveniently did NOT quote):

--------------------
And don't bring up any "virtualization is special because..."
arguments into your reply because virtualization has nothing to do
with my objections stated above.
--------------------

What part of that do you not understand?  Don't give me this
junk about zero config, it's not a plausible argument against
anything I said.

You want to impose a new burdon onto the kernel in the form of a whole
new socket layer.  When existing ones can solve any communications
problem.

Performance is not a good argument because we have (repeatedly) made
TCP/IP go fast in just about any environment.

If you have a configuration problem, you can solve it in userspace in
a number of different ways.  Building on top of things we have and the
user need not know anything about that.

I would even be OK with features such as "permanent" links or special
attributes for devices or IP addresses that by default prevent
tampering and filtering by things like netfilter.

But not this new thing that duplicates existing functionality, no way.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori - Dec. 15, 2008, 3:02 p.m.
David Miller wrote:
> From: Gleb Natapov <gleb@redhat.com>
> Date: Sun, 14 Dec 2008 13:50:55 +0200
>
>   
>> It is undesirable to use TCP/IP for this purpose since network
>> connectivity may not exist between host and guest and if it exists the
>> traffic can be not routable between host and guest for security reasons
>> or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.
>>     
>
> I don't really accept this argument, sorry.
>   

I couldn't agree more.  That doesn't mean I don't think this isn't 
valuable though.

Each of these sockets are going to be connected to a backend (to 
implement guest<=>copy/paste for instance).  We want to implement those 
backends in userspace and preferably in QEMU.

Using some raw protocol over ethernet means you don't have reliability.  
If you use a protocol to get reliability (like TCP), you now have to 
implement a full TCP/IP stack in userspace or get the host kernel 
involved.  I'd rather not get the host kernel involved from a security 
perspective.

An inherently reliable socket transport solves the above problem while 
keeping things simple.  Note, this is not a new concept.  There is 
already an AF_IUCV for s390.  VMware is also developing an AF_VMCI 
socket family.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeremy Fitzhardinge - Dec. 15, 2008, 5:45 p.m.
Anthony Liguori wrote:
> David Miller wrote:
>   
>> From: Gleb Natapov <gleb@redhat.com>
>> Date: Sun, 14 Dec 2008 13:50:55 +0200
>>
>>   
>>     
>>> It is undesirable to use TCP/IP for this purpose since network
>>> connectivity may not exist between host and guest and if it exists the
>>> traffic can be not routable between host and guest for security reasons
>>> or TCP/IP traffic can be firewalled (by mistake) by unsuspecting VM user.
>>>     
>>>       
>> I don't really accept this argument, sorry.
>>     

Yes.  There's no reason why the management stack couldn't implement its 
own private idiot-proofing network for this kind of thing.

> Each of these sockets are going to be connected to a backend (to 
> implement guest<=>copy/paste for instance).  We want to implement those 
> backends in userspace and preferably in QEMU.
>
> Using some raw protocol over ethernet means you don't have reliability.  
> If you use a protocol to get reliability (like TCP), you now have to 
> implement a full TCP/IP stack in userspace or get the host kernel 
> involved.  I'd rather not get the host kernel involved from a security 
> perspective.
>   

There's nothing wrong with user-mode TCP, or you could run your TCP 
stack in a special-purpose guest if you're really paranoid.

> An inherently reliable socket transport solves the above problem while 
> keeping things simple.  Note, this is not a new concept.  There is 
> already an AF_IUCV for s390.  VMware is also developing an AF_VMCI 
> socket family.
>   

The trouble is that it presumes that the host and guest (or whoever the 
endpoints are) are on the same physical machine and will remain that 
way.  Given that live migration is a feature that people seem to like, 
then you'd end up needing to transport this protocol over a real network 
anyway - and at that point you may as well use proper TCP/IP.   The 
alternative is to say either "if you use this feature you can't migrate, 
and you can only resume on the same host", or "you can use this feature, 
and we'll work out a global namespace and proxy it over TCP for you".  
Neither seems very satisfactory.

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Itamar Heim - Dec. 15, 2008, 6:26 p.m.
> -----Original Message-----
> From: kvm-owner@vger.kernel.org [mailto:kvm-owner@vger.kernel.org] On
> Behalf Of Jeremy Fitzhardinge
 
> The trouble is that it presumes that the host and guest (or whoever the
> endpoints are) are on the same physical machine and will remain that
> way.  Given that live migration is a feature that people seem to like,
> then you'd end up needing to transport this protocol over a real network
> anyway - and at that point you may as well use proper TCP/IP.   The
> alternative is to say either "if you use this feature you can't migrate,
> and you can only resume on the same host", or "you can use this feature,
> and we'll work out a global namespace and proxy it over TCP for you".
> Neither seems very satisfactory.
[IH] when migrating a guest to another host, migration takes care of
closing/opening of the VMChannel on the target host. The VMChannel is
local to the hypervisor, not accessible via network. Migration is not an
issue requiring the VMChannel to use TCP/IP.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori - Dec. 15, 2008, 6:45 p.m.
Jeremy Fitzhardinge wrote:
>> Each of these sockets are going to be connected to a backend (to 
>> implement guest<=>copy/paste for instance).  We want to implement 
>> those backends in userspace and preferably in QEMU.
>>
>> Using some raw protocol over ethernet means you don't have 
>> reliability.  If you use a protocol to get reliability (like TCP), 
>> you now have to implement a full TCP/IP stack in userspace or get the 
>> host kernel involved.  I'd rather not get the host kernel involved 
>> from a security perspective.
>>   
>
> There's nothing wrong with user-mode TCP, or you could run your TCP 
> stack in a special-purpose guest if you're really paranoid.

That seems unnecessarily complex.

>> An inherently reliable socket transport solves the above problem 
>> while keeping things simple.  Note, this is not a new concept.  There 
>> is already an AF_IUCV for s390.  VMware is also developing an AF_VMCI 
>> socket family.
>>   
>
> The trouble is that it presumes that the host and guest (or whoever 
> the endpoints are) are on the same physical machine and will remain 
> that way.  Given that live migration is a feature that people seem to 
> like, then you'd end up needing to transport this protocol over a real 
> network anyway - and at that point you may as well use proper 
> TCP/IP.   The alternative is to say either "if you use this feature 
> you can't migrate, and you can only resume on the same host", or "you 
> can use this feature, and we'll work out a global namespace and proxy 
> it over TCP for you".  Neither seems very satisfactory.

This is why I've been pushing for the backends to be implemented in 
QEMU.  Then QEMU can marshal the backend-specific state and transfer it 
during live migration.  For something like copy/paste, this is obvious 
(the clipboard state).  A general command interface is probably 
stateless so it's a nop.

I'm not a fan of having external backends to QEMU for the very reasons 
you outline above.  You cannot marshal the state of a channel we know 
nothing about.  We're really just talking about extending virtio in a 
guest down to userspace so that we can implement paravirtual device 
drivers in guest userspace.  This may be an X graphics driver, a mouse 
driver, copy/paste, remote shutdown, etc.

A socket seems like a natural choice.  If that's wrong, then we can 
explore other options (like a char device, virtual fs, etc.).  This 
shouldn't be confused with networking though and all the talk of doing 
silly things like streaming fence traffic through it just encourages the 
confusion.

Regards,

Anthony Liguori

>    J

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Dec. 15, 2008, 7:43 p.m.
From: Anthony Liguori <anthony@codemonkey.ws>
Date: Mon, 15 Dec 2008 09:02:23 -0600

> There is already an AF_IUCV for s390.

This is a scarecrow and irrelevant to this discussion.

And this is exactly why I asked that any arguments in this thread
avoid talking about virtualization technology and why it's "special."

This proposed patch here is asking to add new infrastructure for
hypervisor facilities that will be _ADDED_ and for which we have
complete control over.

Whereas the S390 folks have to deal with existing infrastructure which
is largely outside of their control.  So if they implement access
mechanisms for that, it's fine.

I would be doing the same thing if I added a protocol socket layer for
accessing the Niagara hypervisor virtualization channels.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori - Dec. 15, 2008, 8:44 p.m.
David Miller wrote:
> From: Anthony Liguori <anthony@codemonkey.ws>
> Date: Mon, 15 Dec 2008 09:02:23 -0600
>
>   
>> There is already an AF_IUCV for s390.
>>     
>
> This is a scarecrow and irrelevant to this discussion.
>
> And this is exactly why I asked that any arguments in this thread
> avoid talking about virtualization technology and why it's "special."
>   

You cannot completely avoid talking about virtualization here.  I agree 
that an argument based on, "we need it for virtualization", why?, 
"virtualization!" is not sufficient.

You still didn't address my earlier question though.

What we need is a mechanism for implementing paravirtual device drivers 
in userspace.  On a modern Linux system, a lot of important things are 
done in userspace (mostly around X) so having some code in the guest's 
userspace is important.

We want this communication mechanism to be simple and reliable as we 
want to implement the backends drivers in the host userspace with 
minimum mess.

Within the guest, we need the interface to be always available and we 
need an addressing scheme that is hypervisor specific.  Yes, we can 
build this all on top of TCP/IP.  We could even build it on top of a 
serial port.  Both have their down-sides wrt reliability and complexity.

The most natural userspace interface that meets all of these 
requirements would appear to be a new socket family.  We could also use 
another userspace interface (netlink was originally proposed, a chardev 
is possible, or a virtual file system).

Do you have another recommendation?

Regards,

Anthony Liguori

> This proposed patch here is asking to add new infrastructure for
> hypervisor facilities that will be _ADDED_ and for which we have
> complete control over.
>
> Whereas the S390 folks have to deal with existing infrastructure which
> is largely outside of their control.  So if they implement access
> mechanisms for that, it's fine.
>
> I would be doing the same thing if I added a protocol socket layer for
> accessing the Niagara hypervisor virtualization channels.
>   

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Dec. 15, 2008, 10:29 p.m.
From: Anthony Liguori <anthony@codemonkey.ws>
Date: Mon, 15 Dec 2008 14:44:26 -0600

> We want this communication mechanism to be simple and reliable as we
> want to implement the backends drivers in the host userspace with
> minimum mess.

One implication of your statement here is that TCP is unreliable.
That's absolutely not true.

> Within the guest, we need the interface to be always available and
> we need an addressing scheme that is hypervisor specific.  Yes, we
> can build this all on top of TCP/IP.  We could even build it on top
> of a serial port.  Both have their down-sides wrt reliability and
> complexity.

I don't know of any zero-copy through the hypervisor mechanisms for
serial ports, but I know we do that with the various virtualization
network devices.

> Do you have another recommendation?

I don't have to make alternative recommendations until you can
show that what we have can't solve the problem acceptably, and
TCP emphatically can.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeremy Fitzhardinge - Dec. 15, 2008, 10:52 p.m.
Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>   
>>> Each of these sockets are going to be connected to a backend (to 
>>> implement guest<=>copy/paste for instance).  We want to implement 
>>> those backends in userspace and preferably in QEMU.
>>>
>>> Using some raw protocol over ethernet means you don't have 
>>> reliability.  If you use a protocol to get reliability (like TCP), 
>>> you now have to implement a full TCP/IP stack in userspace or get the 
>>> host kernel involved.  I'd rather not get the host kernel involved 
>>> from a security perspective.
>>>   
>>>       
>> There's nothing wrong with user-mode TCP, or you could run your TCP 
>> stack in a special-purpose guest if you're really paranoid.
>>     
>
> That seems unnecessarily complex.
>   

Well, the simplest thing is to let the host TCP stack do TCP.  Could you 
go into more detail about why you'd want to avoid that?

> This is why I've been pushing for the backends to be implemented in 
> QEMU.  Then QEMU can marshal the backend-specific state and transfer it 
> during live migration.  For something like copy/paste, this is obvious 
> (the clipboard state).  A general command interface is probably 
> stateless so it's a nop.
>   

Copy/paste seems like a particularly bogus example.  Surely this isn't a 
sensible way to implement it?

> I'm not a fan of having external backends to QEMU for the very reasons 
> you outline above.  You cannot marshal the state of a channel we know 
> nothing about.  We're really just talking about extending virtio in a 
> guest down to userspace so that we can implement paravirtual device 
> drivers in guest userspace.  This may be an X graphics driver, a mouse 
> driver, copy/paste, remote shutdown, etc.
>   
> A socket seems like a natural choice.  If that's wrong, then we can 
> explore other options (like a char device, virtual fs, etc.).

I think a socket is a pretty poor choice.  It's too low level, and it 
only really makes sense for streaming data, not for data storage 
(name/value pairs).  It means that everyone ends up making up their own 
serializations.  A filesystem view with notifications seems to be a 
better match for the use-cases you mention (aside from cut/paste), with 
a single well-defined way to serialize onto any given channel.  Each 
"file" may well have an application-specific content, but in general 
that's going to be something pretty simple.

>   This 
> shouldn't be confused with networking though and all the talk of doing 
> silly things like streaming fence traffic through it just encourages the 
> confusion.

I'm not sure what you're referring to here.

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori - Dec. 15, 2008, 11:01 p.m.
David Miller wrote:
> From: Anthony Liguori <anthony@codemonkey.ws>
> Date: Mon, 15 Dec 2008 14:44:26 -0600
>
>   
>> We want this communication mechanism to be simple and reliable as we
>> want to implement the backends drivers in the host userspace with
>> minimum mess.
>>     
>
> One implication of your statement here is that TCP is unreliable.
> That's absolutely not true.
>   

No, TCP falls under the not simple category because it requires the 
backend to have access to a TCP/IP stack.

>> Within the guest, we need the interface to be always available and
>> we need an addressing scheme that is hypervisor specific.  Yes, we
>> can build this all on top of TCP/IP.  We could even build it on top
>> of a serial port.  Both have their down-sides wrt reliability and
>> complexity.
>>     
>
> I don't know of any zero-copy through the hypervisor mechanisms for
> serial ports, but I know we do that with the various virtualization
> network devices.
>   

Yes, and I went down the road of using a dedicated network device and 
using raw ethernet as the protocol.  The thing that killed that was the 
fact that it's not reliable.  You need something like TCP to add 
reliability.

But that's a lot of work and a bit backwards.  Use a unreliable 
transport but use TCP on top of it to get reliability.  Our link 
(virtio) is inherently reliable so why not just expose a reliable 
interface to userspace?

>> Do you have another recommendation?
>>     
>
> I don't have to make alternative recommendations until you can
> show that what we have can't solve the problem acceptably, and
> TCP emphatically can.
>   

It can solve the problem but I don't think it's the best way to solve 
the problem mainly because the complexity it demands on the backend.

Regards,

Anthony Liguori
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori - Dec. 15, 2008, 11:08 p.m.
Jeremy Fitzhardinge wrote:
> Anthony Liguori wrote:
>>
>> That seems unnecessarily complex.
>>   
>
> Well, the simplest thing is to let the host TCP stack do TCP.  Could 
> you go into more detail about why you'd want to avoid that?

The KVM model is that a guest is a process.  Any IO operations original 
from the process (QEMU).  The advantage to this is that you get very 
good security because you can use things like SELinux and simply treat 
the QEMU process as you would the guest.  In fact, in general, I think 
we want to assume that QEMU is guest code from a security perspective.

By passing up the network traffic to the host kernel, we now face a 
problem when we try to get the data back.  We could setup a tun device 
to send traffic to the kernel but then the rest of the system can see 
that traffic too.  If that traffic is sensitive, it's potentially unsafe.

You can use iptables to restrict who can receive traffic and possibly 
use SELinux packet tagging or whatever.  This gets extremely complex though.

It's far easier to avoid the host kernel entirely and implement the 
backends in QEMU.  Then any actions the backend takes will be on behalf 
of the guest.  You never have to worry about transport data leakage.

>> This is why I've been pushing for the backends to be implemented in 
>> QEMU.  Then QEMU can marshal the backend-specific state and transfer 
>> it during live migration.  For something like copy/paste, this is 
>> obvious (the clipboard state).  A general command interface is 
>> probably stateless so it's a nop.
>>   
>
> Copy/paste seems like a particularly bogus example.  Surely this isn't 
> a sensible way to implement it?

I think it's the most sensible way to implement it.  Would you suggest 
something different?

>> I'm not a fan of having external backends to QEMU for the very 
>> reasons you outline above.  You cannot marshal the state of a channel 
>> we know nothing about.  We're really just talking about extending 
>> virtio in a guest down to userspace so that we can implement 
>> paravirtual device drivers in guest userspace.  This may be an X 
>> graphics driver, a mouse driver, copy/paste, remote shutdown, etc.
>>   A socket seems like a natural choice.  If that's wrong, then we can 
>> explore other options (like a char device, virtual fs, etc.).
>
> I think a socket is a pretty poor choice.  It's too low level, and it 
> only really makes sense for streaming data, not for data storage 
> (name/value pairs).  It means that everyone ends up making up their 
> own serializations.  A filesystem view with notifications seems to be 
> a better match for the use-cases you mention (aside from cut/paste), 
> with a single well-defined way to serialize onto any given channel.  
> Each "file" may well have an application-specific content, but in 
> general that's going to be something pretty simple.

I had suggested a virtual file system at first and was thoroughly 
ridiculed for it :-)  There is a 9p virtio transport already so we could 
even just use that.

The main issue with a virtual file system is that it does map well to 
other guests.  It's actually easier to implement a socket interface for 
Windows than it is to implement a new file system.

But we could find ways around this with libraries.  If we used 9p as a 
transport, we could just provide a char device in Windows that received 
it in userspace.

>>   This shouldn't be confused with networking though and all the talk 
>> of doing silly things like streaming fence traffic through it just 
>> encourages the confusion.
>
> I'm not sure what you're referring to here.

I'm just ranting, it's not important.

Regards,

Anthony Liguori

>    J

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller - Dec. 15, 2008, 11:10 p.m.
From: Anthony Liguori <anthony@codemonkey.ws>
Date: Mon, 15 Dec 2008 17:01:14 -0600

> No, TCP falls under the not simple category because it requires the
> backend to have access to a TCP/IP stack.

I'm at a loss for words if you need TCP in the hypervisor, if that's
what you're implying here.

You only need it in the guest and the host, which you already have,
in the Linux kernel.  Just transport that over virtio or whatever
and be done with it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger - Dec. 15, 2008, 11:13 p.m.
On Mon, 15 Dec 2008 17:01:14 -0600
Anthony Liguori <anthony@codemonkey.ws> wrote:

> David Miller wrote:
> > From: Anthony Liguori <anthony@codemonkey.ws>
> > Date: Mon, 15 Dec 2008 14:44:26 -0600
> >
> >   
> >> We want this communication mechanism to be simple and reliable as we
> >> want to implement the backends drivers in the host userspace with
> >> minimum mess.
> >>     
> >
> > One implication of your statement here is that TCP is unreliable.
> > That's absolutely not true.
> >   
> 
> No, TCP falls under the not simple category because it requires the 
> backend to have access to a TCP/IP stack.
> 
> >> Within the guest, we need the interface to be always available and
> >> we need an addressing scheme that is hypervisor specific.  Yes, we
> >> can build this all on top of TCP/IP.  We could even build it on top
> >> of a serial port.  Both have their down-sides wrt reliability and
> >> complexity.
> >>     
> >
> > I don't know of any zero-copy through the hypervisor mechanisms for
> > serial ports, but I know we do that with the various virtualization
> > network devices.
> >   
> 
> Yes, and I went down the road of using a dedicated network device and 
> using raw ethernet as the protocol.  The thing that killed that was the 
> fact that it's not reliable.  You need something like TCP to add 
> reliability.
> 
> But that's a lot of work and a bit backwards.  Use a unreliable 
> transport but use TCP on top of it to get reliability.  Our link 
> (virtio) is inherently reliable so why not just expose a reliable 
> interface to userspace?
> 
> >> Do you have another recommendation?
> >>     
> >
> > I don't have to make alternative recommendations until you can
> > show that what we have can't solve the problem acceptably, and
> > TCP emphatically can.
> >   
> 
> It can solve the problem but I don't think it's the best way to solve 
> the problem mainly because the complexity it demands on the backend.

"Those who don't understand TCP are doomed to reimplement it, badly."


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Anthony Liguori - Dec. 15, 2008, 11:17 p.m.
David Miller wrote:
> From: Anthony Liguori <anthony@codemonkey.ws>
> Date: Mon, 15 Dec 2008 17:01:14 -0600
>
>   
>> No, TCP falls under the not simple category because it requires the
>> backend to have access to a TCP/IP stack.
>>     
>
> I'm at a loss for words if you need TCP in the hypervisor, if that's
> what you're implying here.
>   

No.  KVM is not a traditional "hypervisor".  It's more of a userspace 
accelerator for emulators.

QEMU, a system emulator, calls in to the Linux kernel whenever it needs 
to run guest code.  Linux returns to QEMU whenever the guest has done an 
MMIO operation or something of that nature.  In this way, all of our 
device emulation (including paravirtual backends) are implemented in the 
host userspace in the QEMU process.

If we used TCP, we don't have a useful TCP/IP stack in QEMU, so we'd 
have to inject that traffic into the host Linux instance, and then 
receive the traffic in QEMU.  Besides being indirect, it has some nasty 
security implications that I outlined in my response to Jeremy's last note.

Regards,

Anthony Liguori

> You only need it in the guest and the host, which you already have,
> in the Linux kernel.  Just transport that over virtio or whatever
> and be done with it.
>   

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeremy Fitzhardinge - Dec. 15, 2008, 11:44 p.m.
Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
>> Anthony Liguori wrote:
>>>
>>> That seems unnecessarily complex.
>>>   
>>
>> Well, the simplest thing is to let the host TCP stack do TCP.  Could 
>> you go into more detail about why you'd want to avoid that?
>
> The KVM model is that a guest is a process.  Any IO operations 
> original from the process (QEMU).  The advantage to this is that you 
> get very good security because you can use things like SELinux and 
> simply treat the QEMU process as you would the guest.  In fact, in 
> general, I think we want to assume that QEMU is guest code from a 
> security perspective.
>
> By passing up the network traffic to the host kernel, we now face a 
> problem when we try to get the data back.  We could setup a tun device 
> to send traffic to the kernel but then the rest of the system can see 
> that traffic too.  If that traffic is sensitive, it's potentially unsafe.

Well, one could come up with a mechanism to bind an interface to be only 
visible to a particular context/container/something.

> You can use iptables to restrict who can receive traffic and possibly 
> use SELinux packet tagging or whatever.  This gets extremely complex 
> though.

Well, if you can just tag everything based on interface its relatively 
simple.

> It's far easier to avoid the host kernel entirely and implement the 
> backends in QEMU.  Then any actions the backend takes will be on 
> behalf of the guest.  You never have to worry about transport data 
> leakage.

Well, a stream-like protocol layered over a reliable packet transport 
would get you there without the complexity of tcp.  Or just do a 
usermode tcp; its not that complex if you really think it simplifies the 
other aspects.

>
>>> This is why I've been pushing for the backends to be implemented in 
>>> QEMU.  Then QEMU can marshal the backend-specific state and transfer 
>>> it during live migration.  For something like copy/paste, this is 
>>> obvious (the clipboard state).  A general command interface is 
>>> probably stateless so it's a nop.
>>>   
>>
>> Copy/paste seems like a particularly bogus example.  Surely this 
>> isn't a sensible way to implement it?
>
> I think it's the most sensible way to implement it.  Would you suggest 
> something different?

Well, off the top of my head I'm assuming the requirements are:

    * the goal is to unify the user's actual desktop session with a
      virtual session within a vm
    * a given user may have multiple VMs running on their desktop
    * a VM may be serving multiple user sessions
    * the VMs are not necessarily hosted by the user's desktop machine
    * the VMs can migrate at any moment

To me that looks like a daemon running within the context of each of the 
user's virtual sessions monitoring clipboard events, talking over a TCP 
connection to a corresponding daemon in their desktop session, which is 
responsible for reconciling cuts and pastes in all the various sessions.

I guess you'd say that each VM would multiplex all its cut/paste events 
via its AF_VMCHANNEL/cut+paste channel to its qemu, which would then 
demultiplex them off to the user's real desktops.  And that since the VM 
itself may have no networking, it needs to be a special magic connection.

And my counter argument to this nicely placed straw man is that the 
VM<->qemu connection can still be TCP, even if its a private network 
with no outside access.

>
>>> I'm not a fan of having external backends to QEMU for the very 
>>> reasons you outline above.  You cannot marshal the state of a 
>>> channel we know nothing about.  We're really just talking about 
>>> extending virtio in a guest down to userspace so that we can 
>>> implement paravirtual device drivers in guest userspace.  This may 
>>> be an X graphics driver, a mouse driver, copy/paste, remote 
>>> shutdown, etc.
>>>   A socket seems like a natural choice.  If that's wrong, then we 
>>> can explore other options (like a char device, virtual fs, etc.).
>>
>> I think a socket is a pretty poor choice.  It's too low level, and it 
>> only really makes sense for streaming data, not for data storage 
>> (name/value pairs).  It means that everyone ends up making up their 
>> own serializations.  A filesystem view with notifications seems to be 
>> a better match for the use-cases you mention (aside from cut/paste), 
>> with a single well-defined way to serialize onto any given channel.  
>> Each "file" may well have an application-specific content, but in 
>> general that's going to be something pretty simple.
>
> I had suggested a virtual file system at first and was thoroughly 
> ridiculed for it :-)  There is a 9p virtio transport already so we 
> could even just use that.

You mean 9p directly over a virtio ringbuffer rather than via the 
network stack?  You could do that, but I'd still argue that using the 
network stack is a better approach.

> The main issue with a virtual file system is that it does map well to 
> other guests.  It's actually easier to implement a socket interface 
> for Windows than it is to implement a new file system.

There's no need to put the "filesystem" into the kernel unless something 
else in the kernel needs to access it.  A usermode implementation 
talking over some stream interface would be fine.

> But we could find ways around this with libraries.  If we used 9p as a 
> transport, we could just provide a char device in Windows that 
> received it in userspace.

Or just use a tcp connection, and do it all with no kernel mods.

(Is 9p a good choice?  You need to be able to subscribe to events 
happening to files, and you'd need some kind of atomicity guarantee.  I 
dunno, maybe 9p already has this or can be cleanly adapted.)

    J
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov - Dec. 15, 2008, 11:45 p.m.
Hi Anthony.

On Mon, Dec 15, 2008 at 05:01:14PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
> Yes, and I went down the road of using a dedicated network device and 
> using raw ethernet as the protocol.  The thing that killed that was the 
> fact that it's not reliable.  You need something like TCP to add 
> reliability.
> 
> But that's a lot of work and a bit backwards.  Use a unreliable 
> transport but use TCP on top of it to get reliability.  Our link 
> (virtio) is inherently reliable so why not just expose a reliable 
> interface to userspace?

I removed original mail and did not check archive, but doesn't rx/tx
queues of the virtio device have limited size? I do hope they have,
which means that either your network drops packets or blocks.


Having dedicated preconfigured network device is essentially the same as
having this special socket option: guests which do not have this (either
network or vchannel socket) will not be able to communicate with the
host, so there is no difference. Except that usual network will just
work out of the box (and especially you will like it when there will be
no need to hack on X to support new network media).

Another approach is to implement that virtio backend with netlink based
userspace interface (like using connector or genetlink). This does not
differ too much from what you have with special socket family, but at
least it does not duplicate existing functionality of
userspace-kernelspace communications.

But IMO having special network device or running your protocol over
existing virtio network device is a cleaner solution both from technical
and convenience points of view.
Evgeniy Polyakov - Dec. 15, 2008, 11:52 p.m.
On Mon, Dec 15, 2008 at 05:08:29PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
> The KVM model is that a guest is a process.  Any IO operations original 
> from the process (QEMU).  The advantage to this is that you get very 
> good security because you can use things like SELinux and simply treat 
> the QEMU process as you would the guest.  In fact, in general, I think 
> we want to assume that QEMU is guest code from a security perspective.
> 
> By passing up the network traffic to the host kernel, we now face a 
> problem when we try to get the data back.  We could setup a tun device 
> to send traffic to the kernel but then the rest of the system can see 
> that traffic too.  If that traffic is sensitive, it's potentially unsafe.

You can even use unix sockets in this case, and each socket will be
named as virtio channels names. IIRC tun/tap devices can be virtualizen
with recent kernels, which also solves all problems of shared access.

There are plenty of ways to implement this kind of functionality instead
of developing some new protocol, which is effectively a duplication of
what already exists in the kernel.
Dor Laor - Dec. 16, 2008, 12:01 a.m.
Evgeniy Polyakov wrote:
> On Mon, Dec 15, 2008 at 05:08:29PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
>   
>> The KVM model is that a guest is a process.  Any IO operations original 
>> from the process (QEMU).  The advantage to this is that you get very 
>> good security because you can use things like SELinux and simply treat 
>> the QEMU process as you would the guest.  In fact, in general, I think 
>> we want to assume that QEMU is guest code from a security perspective.
>>
>> By passing up the network traffic to the host kernel, we now face a 
>> problem when we try to get the data back.  We could setup a tun device 
>> to send traffic to the kernel but then the rest of the system can see 
>> that traffic too.  If that traffic is sensitive, it's potentially unsafe.
>>     
>
> You can even use unix sockets in this case, and each socket will be
> named as virtio channels names. IIRC tun/tap devices can be virtualizen
> with recent kernels, which also solves all problems of shared access.
>
> There are plenty of ways to implement this kind of functionality instead
> of developing some new protocol, which is effectively a duplication of
> what already exists in the kernel.
>
>   

Well, it is kinda pv-unix-domain-socket.
I did not understand how a standard unix domain in the guest can reach 
the host according
to your solution.

The initial implementation was some sort of pv-serial. Serial itself is 
low performing and
there is no naming services what so every. Gleb did offer the netlink 
option as a beginning
but we though a new address family would be more robust (you say too 
robust).
So by suggestion new address family what can think of it as a 
pv-unix-domain-socket.
Networking IS used since we think it is a good 'wheel'.
Indeed, David is right that instead of adding a new chunk of code we can 
re-use the
existing one. But we do have some 'new' (afraid to tell virtualization) 
problems that
might prevent us of using a standard virtual nic:
    - Even if we can teach iptables to ignore this interface, other
      3rd firewall might not obey: What if the VM is a Checkpoint firewall?
      What if the VM is windows? + using a non MS firewall?
    - Who will assign IPs for the vnic? How can I assure there is no ip 
clash?
       The standard dhcp for the other standard vnics might not be in 
our control.

So I do understand the idea of using a standard network interface. It's 
just not that simple.
So ideas to handle the above are welcomed.
Otherwise we might need to go back to serial/pv-serial approach.

btw: here are the usages/next usages of vmchannel:
VMchannel is a host-guest interface and in the future guest-guest interface.
Currently/soon it is used for
    - guest statistics
    - guest info
    - guest single sign own
    - guest log-in log-out
    - mouse channel for multiple monitors
    - cut&paste (guest-host, sometimes client-host-guest, company 
firewall blocks client-guest).
    - fencing (potentially)

tw2: without virtualization we wouldn't have new passionate issues to 
discuss about!
Cheers,
Dor
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Dec. 16, 2008, 2:55 a.m.
Anthony Liguori <anthony@codemonkey.ws> wrote:
> 
> If we used TCP, we don't have a useful TCP/IP stack in QEMU, so we'd 
> have to inject that traffic into the host Linux instance, and then 
> receive the traffic in QEMU.  Besides being indirect, it has some nasty 
> security implications that I outlined in my response to Jeremy's last note.

When combined with namespaces I don't see why using the kernel TCP
stack would create any security problems that wouldn't otherwise
exist.

Cheers,
Gleb Natapov - Dec. 16, 2008, 6:57 a.m.
On Tue, Dec 16, 2008 at 02:45:11AM +0300, Evgeniy Polyakov wrote:
> Hi Anthony.
> 
> On Mon, Dec 15, 2008 at 05:01:14PM -0600, Anthony Liguori (anthony@codemonkey.ws) wrote:
> > Yes, and I went down the road of using a dedicated network device and 
> > using raw ethernet as the protocol.  The thing that killed that was the 
> > fact that it's not reliable.  You need something like TCP to add 
> > reliability.
> > 
> > But that's a lot of work and a bit backwards.  Use a unreliable 
> > transport but use TCP on top of it to get reliability.  Our link 
> > (virtio) is inherently reliable so why not just expose a reliable 
> > interface to userspace?
> 
> I removed original mail and did not check archive, but doesn't rx/tx
> queues of the virtio device have limited size? I do hope they have,
> which means that either your network drops packets or blocks.
> 
It blocks.

> Another approach is to implement that virtio backend with netlink based
> userspace interface (like using connector or genetlink). This does not
> differ too much from what you have with special socket family, but at
> least it does not duplicate existing functionality of
> userspace-kernelspace communications.
> 
I implemented vmchannel using connector initially (the downside is that
message can be dropped). Is this more expectable for upstream? The
implementation was 300 lines of code.

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov - Dec. 16, 2008, 9:25 p.m.
On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> > Another approach is to implement that virtio backend with netlink based
> > userspace interface (like using connector or genetlink). This does not
> > differ too much from what you have with special socket family, but at
> > least it does not duplicate existing functionality of
> > userspace-kernelspace communications.
> > 
> I implemented vmchannel using connector initially (the downside is that
> message can be dropped). Is this more expectable for upstream? The
> implementation was 300 lines of code.

Hard to tell, it depends on implementation. But if things are good, I
have no objections as connector maintainer :)

Messages in connector in particular and netlink in general are only
dropped, when receiving buffer is full (or when there is no memory), you
can tune buffer size to match virtual queue size or vice versa.
Dor Laor - Dec. 16, 2008, 11:20 p.m.
Evgeniy Polyakov wrote:
> On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (gleb@redhat.com) wrote:
>   
>>> Another approach is to implement that virtio backend with netlink based
>>> userspace interface (like using connector or genetlink). This does not
>>> differ too much from what you have with special socket family, but at
>>> least it does not duplicate existing functionality of
>>> userspace-kernelspace communications.
>>>
>>>       
>> I implemented vmchannel using connector initially (the downside is that
>> message can be dropped). Is this more expectable for upstream? The
>> implementation was 300 lines of code.
>>     
>
> Hard to tell, it depends on implementation. But if things are good, I
> have no objections as connector maintainer :)
>
> Messages in connector in particular and netlink in general are only
> dropped, when receiving buffer is full (or when there is no memory), you
> can tune buffer size to match virtual queue size or vice versa.
>
>   
Gleb was aware of that and it's not a problem since all of the 
anticipated usages may
drop msgs (guest statistics, cut&paste, mouse movements, single sign on 
commands, etc).
Service that would need reliability could use basic acks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gleb Natapov - Dec. 17, 2008, 2:31 p.m.
On Wed, Dec 17, 2008 at 12:25:32AM +0300, Evgeniy Polyakov wrote:
> On Tue, Dec 16, 2008 at 08:57:27AM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> > > Another approach is to implement that virtio backend with netlink based
> > > userspace interface (like using connector or genetlink). This does not
> > > differ too much from what you have with special socket family, but at
> > > least it does not duplicate existing functionality of
> > > userspace-kernelspace communications.
> > > 
> > I implemented vmchannel using connector initially (the downside is that
> > message can be dropped). Is this more expectable for upstream? The
> > implementation was 300 lines of code.
> 
> Hard to tell, it depends on implementation. But if things are good, I
> have no objections as connector maintainer :)
> 
Here it is. Sorry it is not in a patch format yet, but it gives
general idea how it looks. The problem with connector is that 
we need different IDX for different channels and there is no way
to dynamically allocate them.

--
			Gleb.
Evgeniy Polyakov - Dec. 18, 2008, 12:30 p.m.
Hi Gleb.

On Wed, Dec 17, 2008 at 04:31:46PM +0200, Gleb Natapov (gleb@redhat.com) wrote:
> Here it is. Sorry it is not in a patch format yet, but it gives
> general idea how it looks. The problem with connector is that 
> we need different IDX for different channels and there is no way
> to dynamically allocate them.

Looks very good. Especially liked how you used idx.val pairs to register
multiple users. Please add some comment in connector header on how you
use it and feel free to add my ack if needed.

Patch

diff --git a/include/linux/socket.h b/include/linux/socket.h
index 20fc4bb..e65834c 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -191,7 +191,8 @@  struct ucred {
 #define AF_RXRPC	33	/* RxRPC sockets 		*/
 #define AF_ISDN		34	/* mISDN sockets 		*/
 #define AF_PHONET	35	/* Phonet sockets		*/
-#define AF_MAX		36	/* For now.. */
+#define AF_VMCHANNEL	36	/* Vmchannel sockets		*/
+#define AF_MAX		37	/* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC	AF_UNSPEC
@@ -229,6 +230,7 @@  struct ucred {
 #define PF_RXRPC	AF_RXRPC
 #define PF_ISDN		AF_ISDN
 #define PF_PHONET	AF_PHONET
+#define PF_VMCHANNEL	AF_VMCHANNEL
 #define PF_MAX		AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/linux/vmchannel.h b/include/linux/vmchannel.h
new file mode 100644
index 0000000..27c1f94
--- /dev/null
+++ b/include/linux/vmchannel.h
@@ -0,0 +1,54 @@ 
+/*
+ *  Copyright 2008 Red Hat, Inc --- All Rights Reserved
+ *
+ *  Author(s): Gleb Natapov <gleb@redhat.com>
+ */
+
+#ifndef VMCHANNEL_H
+#define VMCHANNEL_H
+
+#define VMCHANNEL_NAME_MAX 80
+struct sockaddr_vmchannel {
+	sa_family_t svmchannel_family;
+	char svmchannel_name[VMCHANNEL_NAME_MAX];
+};
+
+#ifdef __KERNEL__
+
+#define VIRTIO_ID_VMCHANNEL 6
+#define VMCHANNEL_BAD_ID (~(__u32)0)
+
+#define vmchannel_sk(__sk) ((struct vmchannel_sock *) __sk)
+
+struct vmchannel_sock {
+	struct sock sk;
+	char name[VMCHANNEL_NAME_MAX];
+	__u32 id;
+	struct sk_buff_head backlog_skb_q;
+};
+
+struct vmchannel_info {
+	__u32 id;
+	char *name;
+};
+
+struct vmchannel_dev {
+	struct virtio_device *vdev;
+	struct virtqueue *rq;
+	struct virtqueue *sq;
+	struct tasklet_struct rx_tasklet;
+	struct tasklet_struct tx_tasklet;
+	__u32 channel_count;
+	struct vmchannel_info *channels;
+	struct sk_buff_head rx_skbuff_q;
+	struct sk_buff_head tx_skbuff_q;
+	atomic_t recv_posted;
+};
+
+struct vmchannel_desc {
+	__u32 id;
+	__le32 len;
+};
+
+#endif /* __KERNEL__ */
+#endif
diff --git a/net/Kconfig b/net/Kconfig
index d789d79..d01f135 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -36,6 +36,7 @@  source "net/packet/Kconfig"
 source "net/unix/Kconfig"
 source "net/xfrm/Kconfig"
 source "net/iucv/Kconfig"
+source "net/vmchannel/Kconfig"
 
 config INET
 	bool "TCP/IP networking"
diff --git a/net/Makefile b/net/Makefile
index 27d1f10..ddc89dc 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -55,6 +55,7 @@  obj-$(CONFIG_IEEE80211)		+= ieee80211/
 obj-$(CONFIG_TIPC)		+= tipc/
 obj-$(CONFIG_NETLABEL)		+= netlabel/
 obj-$(CONFIG_IUCV)		+= iucv/
+obj-$(CONFIG_VMCHANNEL)		+= vmchannel/
 obj-$(CONFIG_RFKILL)		+= rfkill/
 obj-$(CONFIG_NET_9P)		+= 9p/
 
diff --git a/net/vmchannel/Kconfig b/net/vmchannel/Kconfig
new file mode 100644
index 0000000..53f256a
--- /dev/null
+++ b/net/vmchannel/Kconfig
@@ -0,0 +1,11 @@ 
+#
+# VMCHANNEL address family
+#
+
+config VMCHANNEL
+	tristate "AF_VMCHANNEL address family (EXPERIMENTAL)"
+	depends on EXPERIMENTAL && VIRTIO
+
+	---help---
+	  AF_VMCHANNEL family is used for communication between host and guest.
+	  Say Y or M if you are going to run this kernel in a VM.
diff --git a/net/vmchannel/Makefile b/net/vmchannel/Makefile
new file mode 100644
index 0000000..f972fc4
--- /dev/null
+++ b/net/vmchannel/Makefile
@@ -0,0 +1,5 @@ 
+#
+# Makefile for the vmchannel AF.
+#
+
+obj-$(CONFIG_VMCHANNEL) += af_vmchannel.o
diff --git a/net/vmchannel/af_vmchannel.c b/net/vmchannel/af_vmchannel.c
new file mode 100644
index 0000000..ac87b31
--- /dev/null
+++ b/net/vmchannel/af_vmchannel.c
@@ -0,0 +1,769 @@ 
+/*
+ *  Copyright 2008 Red Hat, Inc --- All Rights Reserved
+ *
+ *  Author(s): Gleb Natapov <gleb@redhat.com>
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/errno.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/init.h>
+#include <linux/poll.h>
+#include <net/sock.h>
+#include <net/tcp_states.h>
+#include <linux/kmod.h>
+#include <linux/virtio.h>
+#include <linux/virtio_config.h>
+#include <linux/vmchannel.h>
+
+static int max_ring_len = 1000;
+static int max_packet_len = 1024;
+
+module_param(max_ring_len, int, 0444);
+module_param(max_packet_len, int, 0444);
+
+static struct vmchannel_dev vmc_dev;
+
+static int vmchannel_send_skb(struct sk_buff *skb, const __u32 id);
+static __u32 vmchannel_find_channel_id(const char *name);
+
+static struct proto vmchannel_proto = {
+	.name           = "AF_VMCHANNEL",
+	.owner          = THIS_MODULE,
+	.obj_size       = sizeof(struct vmchannel_sock),
+};
+
+static struct vmchannel_sock_list {
+	struct hlist_head head;
+	spinlock_t lock;
+} vmchannel_sk_list = {
+	.lock = __SPIN_LOCK_UNLOCKED(vmchannel_sk_list.lock)
+};
+
+static void vmchannel_sock_link(struct vmchannel_sock_list *l, struct sock *sk)
+{
+	spin_lock_bh(&l->lock);
+	sk_add_node(sk, &l->head);
+	spin_unlock_bh(&l->lock);
+}
+
+static void vmchannel_sock_unlink(struct vmchannel_sock_list *l,
+		struct sock *sk)
+{
+	spin_lock_bh(&l->lock);
+	sk_del_node_init(sk);
+	spin_unlock_bh(&l->lock);
+}
+
+static struct sock *__vmchannel_get_sock_by_name(const char *nm)
+{
+	struct sock *sk;
+	struct hlist_node *node;
+
+	sk_for_each(sk, node, &vmchannel_sk_list.head) {
+		struct vmchannel_sock *vmc = vmchannel_sk(sk);
+		if (!strncmp(vmc->name, nm, VMCHANNEL_NAME_MAX))
+			return sk;
+	}
+
+	return NULL;
+}
+
+static struct sock *vmchannel_get_sock_by_id(const __u32 id)
+{
+	struct sock *sk = NULL;
+	struct hlist_node *node;
+
+	spin_lock(&vmchannel_sk_list.lock);
+
+	sk_for_each(sk, node, &vmchannel_sk_list.head) {
+		struct vmchannel_sock *vmc = vmchannel_sk(sk);
+		if (vmc->id == id)
+			break;
+	}
+
+	if (sk)
+		sock_hold(sk);
+
+	spin_unlock(&vmchannel_sk_list.lock);
+
+	return sk;
+}
+
+static int vmchannel_address_valid(struct sockaddr *addr, int alen)
+{
+	return addr && (alen >= sizeof(struct sockaddr_vmchannel)) &&
+		addr->sa_family == AF_VMCHANNEL;
+}
+
+/* vmchannel socket OPS */
+static int vmchannel_sock_release(struct socket *sock)
+{
+	struct sock *sk = sock->sk;
+	struct vmchannel_sock *vmc = vmchannel_sk(sk);
+
+	if (!sk)
+		return 0;
+
+	vmchannel_sock_unlink(&vmchannel_sk_list, sk);
+
+	sock_orphan(sk);
+	lock_sock(sk);
+	if (sk->sk_state == TCP_ESTABLISHED) {
+		sk->sk_state = TCP_CLOSE;
+		sk->sk_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
+		sk->sk_err = ECONNRESET;
+		sk->sk_state_change(sk);
+		skb_queue_purge(&vmc->backlog_skb_q);
+	}
+	release_sock(sk);
+	sock_put(sk);
+	return 0;
+}
+
+/* Bind an unbound socket */
+static int vmchannel_sock_bind(struct socket *sock, struct sockaddr *addr,
+		int alen)
+{
+	struct sockaddr_vmchannel *sa = (struct sockaddr_vmchannel *)addr;
+	struct sock *sk = sock->sk;
+	struct vmchannel_sock *vmc;
+	uint32_t id;
+	int err;
+
+	/* Verify the input sockaddr */
+	if (!vmchannel_address_valid(addr, alen))
+		return -EINVAL;
+
+	id = vmchannel_find_channel_id(sa->svmchannel_name);
+
+	if (id == VMCHANNEL_BAD_ID)
+		return -EADDRNOTAVAIL;
+
+	lock_sock(sk);
+	if (!sock_flag(sk, SOCK_ZAPPED)) {
+		err = -EBADFD;
+		goto done;
+	}
+
+	spin_lock_bh(&vmchannel_sk_list.lock);
+
+	if (__vmchannel_get_sock_by_name(sa->svmchannel_name)) {
+		err = -EADDRINUSE;
+		goto done_unlock;
+	}
+
+	vmc = vmchannel_sk(sk);
+
+	/* Bind the socket */
+	memcpy(vmc->name, sa->svmchannel_name, VMCHANNEL_NAME_MAX);
+	vmc->id = id;
+	sock_reset_flag(sk, SOCK_ZAPPED);
+	err = 0;
+
+done_unlock:
+	/* Release the socket list lock */
+	spin_unlock_bh(&vmchannel_sk_list.lock);
+done:
+	release_sock(sk);
+	return err;
+}
+
+static int vmchannel_sock_connect(struct socket *sock, struct sockaddr *addr,
+		int alen, int flags)
+{
+	struct sock *sk = sock->sk;
+	int err;
+
+	if (!vmchannel_address_valid(addr, alen))
+		return -EINVAL;
+
+	if (sk->sk_type != SOCK_STREAM)
+		return -EINVAL;
+
+	if (sock_flag(sk, SOCK_ZAPPED)) {
+		err = vmchannel_sock_bind(sock, addr, alen);
+		if (unlikely(err))
+			return err;
+	}
+
+	lock_sock(sk);
+	sk->sk_state = TCP_ESTABLISHED;
+	sock->state = SS_CONNECTED;
+	sk->sk_state_change(sk);
+	release_sock(sk);
+
+	return 0;
+}
+
+static int vmchannel_sock_getname(struct socket *sock, struct sockaddr *addr,
+		int *len, int peer)
+{
+	struct sockaddr_vmchannel *svmc = (struct sockaddr_vmchannel *)addr;
+	struct sock *sk = sock->sk;
+
+	addr->sa_family = AF_VMCHANNEL;
+	*len = sizeof(struct sockaddr_vmchannel);
+
+	memcpy(svmc->svmchannel_name, vmchannel_sk(sk)->name,
+			VMCHANNEL_NAME_MAX);
+
+	return 0;
+}
+
+static int vmchannel_sock_sendmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *msg, size_t len)
+{
+	struct sock *sk = sock->sk;
+	struct vmchannel_sock *vmc = vmchannel_sk(sk);
+	struct sk_buff *skb;
+	int err;
+
+	err = sock_error(sk);
+	if (err)
+		return err;
+
+	if (msg->msg_flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	if (sk->sk_shutdown & SEND_SHUTDOWN) {
+		send_sig(SIGPIPE, current, 0);
+		return -EPIPE;
+	}
+
+	if (sk->sk_state != TCP_ESTABLISHED)
+		return -ENOTCONN;
+
+	skb = sock_alloc_send_skb(sk, len, msg->msg_flags & MSG_DONTWAIT, &err);
+	if (!skb)
+		return err;
+
+	if (memcpy_fromiovec(skb_put(skb, len), msg->msg_iov, len)) {
+		err = -EFAULT;
+		goto free_skb;
+	}
+
+	err = vmchannel_send_skb(skb, vmc->id);
+	if (err) {
+		err = -EPIPE;
+		goto free_skb;
+	}
+
+	return len;
+
+free_skb:
+	kfree_skb(skb);
+	return err;
+}
+
+static int vmchannel_sock_recvmsg(struct kiocb *iocb, struct socket *sock,
+		struct msghdr *msg, size_t len, int flags)
+{
+	int noblock = flags & MSG_DONTWAIT;
+	struct sock *sk = sock->sk;
+	struct vmchannel_sock *vmc = vmchannel_sk(sk);
+	int target, copied = 0, chunk;
+	struct sk_buff *skb;
+	int err;
+
+	if (flags & MSG_OOB)
+		return -EOPNOTSUPP;
+
+	if (sk->sk_state != TCP_ESTABLISHED)
+		return -EINVAL;
+
+	target = sock_rcvlowat(sk, flags & MSG_WAITALL, len);
+
+	do {
+		spin_lock_bh(&vmc->backlog_skb_q.lock);
+		while ((skb = __skb_dequeue(&vmc->backlog_skb_q))) {
+			if (sock_queue_rcv_skb(sk, skb)) {
+				__skb_queue_head(&vmc->backlog_skb_q, skb);
+				break;
+			}
+			atomic_dec(&vmc_dev.recv_posted);
+		}
+		spin_unlock_bh(&vmc->backlog_skb_q.lock);
+
+		BUG_ON(atomic_read(&vmc_dev.recv_posted) < 0);
+
+		/* this will repost buffers */
+		if (atomic_read(&vmc_dev.recv_posted) < max_ring_len / 2)
+			tasklet_schedule(&vmc_dev.rx_tasklet);
+
+		skb = skb_recv_datagram(sk, flags, noblock, &err);
+		if (!skb) {
+			if (sk->sk_shutdown & RCV_SHUTDOWN)
+				err = 0;
+			return err;
+		}
+
+		chunk = min_t(unsigned int, skb->len, len);
+
+		err = memcpy_toiovec(msg->msg_iov, skb->data, chunk);
+		if (err) {
+			if (!(flags & MSG_PEEK))
+				skb_queue_head(&sk->sk_receive_queue, skb);
+			else
+				kfree_skb(skb);
+
+			if (copied != 0)
+				return copied;
+			return err;
+		}
+
+		copied += chunk;
+		len -= chunk;
+
+		if (flags & MSG_PEEK) {
+			kfree_skb(skb);
+			break;
+		}
+
+		/* Mark read part of skb as used */
+		skb_pull(skb, chunk);
+
+		if (skb->len) {
+			skb_queue_head(&sk->sk_receive_queue, skb);
+			break;
+		}
+
+		kfree_skb(skb);
+	} while (copied < target);
+
+	return copied;
+}
+
+static int vmchannel_sock_shutdown(struct socket *sock, int mode)
+{
+	struct sock *sk = sock->sk;
+	int err = 0;
+
+	mode = (mode + 1) & (RCV_SHUTDOWN | SEND_SHUTDOWN);
+
+	lock_sock(sk);
+	if (sk->sk_state == TCP_CLOSE) {
+		err = -ENOTCONN;
+		goto unlock;
+	}
+
+	sk->sk_shutdown |= mode;
+
+	if (mode & RCV_SHUTDOWN) {
+		skb_queue_purge(&sk->sk_receive_queue);
+		skb_queue_purge(&vmchannel_sk(sk)->backlog_skb_q);
+	}
+
+	/* Wake up anyone sleeping in poll */
+	sk->sk_state_change(sk);
+
+unlock:
+	release_sock(sk);
+	return err;
+}
+
+static struct proto_ops vmchannel_sock_ops = {
+	.family         = PF_VMCHANNEL,
+	.owner          = THIS_MODULE,
+	.release        = vmchannel_sock_release,
+	.bind           = vmchannel_sock_bind,
+	.connect        = vmchannel_sock_connect,
+	.listen         = sock_no_listen,
+	.accept         = sock_no_accept,
+	.getname        = vmchannel_sock_getname,
+	.sendmsg        = vmchannel_sock_sendmsg,
+	.recvmsg        = vmchannel_sock_recvmsg,
+	.poll           = datagram_poll,
+	.ioctl          = sock_no_ioctl,
+	.mmap           = sock_no_mmap,
+	.socketpair     = sock_no_socketpair,
+	.shutdown       = vmchannel_sock_shutdown,
+	.setsockopt     = sock_no_setsockopt,
+	.getsockopt     = sock_no_getsockopt
+};
+
+static int vmchannel_socket_recv(struct sk_buff *skb, const __u32 id)
+{
+	struct sock *sk;
+	struct vmchannel_sock *vmc;
+	int ret = 0;
+
+	sk = vmchannel_get_sock_by_id(id);
+	if (!sk) {
+		kfree_skb(skb);
+		return 0;
+	}
+
+	if (sk->sk_state != TCP_ESTABLISHED ||
+			(sk->sk_shutdown & RCV_SHUTDOWN)) {
+		kfree_skb(skb);
+		goto unlock;
+	}
+
+	vmc = vmchannel_sk(sk);
+
+	spin_lock(&vmc->backlog_skb_q.lock);
+	if (!skb_queue_empty(&vmc->backlog_skb_q) ||
+			sock_queue_rcv_skb(sk, skb)) {
+		__skb_queue_tail(&vmc->backlog_skb_q, skb);
+		ret = 1;
+	}
+	spin_unlock(&vmc->backlog_skb_q.lock);
+unlock:
+	sock_put(sk);
+	return ret;
+}
+
+static void vmchannel_sock_destruct(struct sock *sk)
+{
+	skb_queue_purge(&sk->sk_receive_queue);
+	skb_queue_purge(&sk->sk_write_queue);
+}
+
+static struct sock *vmchannel_sock_alloc(struct socket *sock, int proto,
+		gfp_t prio)
+{
+	struct sock *sk;
+
+	sk = sk_alloc(&init_net, PF_VMCHANNEL, prio, &vmchannel_proto);
+
+	if (!sk)
+		return NULL;
+
+	sock_init_data(sock, sk);
+	skb_queue_head_init(&vmchannel_sk(sk)->backlog_skb_q);
+	sk->sk_destruct = vmchannel_sock_destruct;
+	sk->sk_protocol = proto;
+
+	vmchannel_sock_link(&vmchannel_sk_list, sk);
+
+	return sk;
+}
+
+static int vmchannel_sock_create(struct net *net, struct socket *sock,
+		int protocol)
+{
+	struct sock *sk;
+
+	if (!capable(CAP_NET_ADMIN))
+		return -EPERM;
+
+	if (sock->type != SOCK_STREAM)
+		return -ESOCKTNOSUPPORT;
+
+	sock->state = SS_UNCONNECTED;
+	sock->ops = &vmchannel_sock_ops;
+
+	sk = vmchannel_sock_alloc(sock, protocol, GFP_KERNEL);
+	if (!sk)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static struct net_proto_family vmchannel_sock_family_ops = {
+	.family = AF_VMCHANNEL,
+	.owner  = THIS_MODULE,
+	.create = vmchannel_sock_create,
+};
+
+/* vmchannel device functions */
+static __u32 vmchannel_find_channel_id(const char *name)
+{
+	__u32 id = VMCHANNEL_BAD_ID;
+	int i;
+
+	for (i = 0; i < vmc_dev.channel_count; i++) {
+		if (!strncmp(name, vmc_dev.channels[i].name,
+					VMCHANNEL_NAME_MAX)) {
+			id = vmc_dev.channels[i].id;
+			break;
+		}
+	}
+
+	return id;
+}
+
+static inline struct vmchannel_desc *skb_vmchannel_desc(struct sk_buff *skb)
+{
+	return (struct vmchannel_desc *)skb->cb;
+}
+
+static inline void vmchannel_desc_to_sg(struct scatterlist *sg,
+		struct sk_buff *skb)
+{
+	sg_init_one(sg, skb_vmchannel_desc(skb), sizeof(struct vmchannel_desc));
+}
+
+static int try_fill_recvq(void)
+{
+	struct sk_buff *skb;
+	struct scatterlist sg[2];
+	int err, num = 0;
+
+	sg_init_table(sg, 2);
+	for (; atomic_read(&vmc_dev.recv_posted) < max_ring_len;
+			atomic_inc(&vmc_dev.recv_posted)) {
+		skb = alloc_skb(max_packet_len, GFP_KERNEL);
+		if (unlikely(!skb))
+			break;
+
+		skb_put(skb, max_packet_len);
+		vmchannel_desc_to_sg(sg, skb);
+		skb_to_sgvec(skb, sg + 1, 0, skb->len);
+		skb_queue_head(&vmc_dev.rx_skbuff_q, skb);
+
+		err = vmc_dev.rq->vq_ops->add_buf(vmc_dev.rq, sg, 0, 2, skb);
+		if (err) {
+			skb_unlink(skb, &vmc_dev.rx_skbuff_q);
+			kfree_skb(skb);
+			break;
+		}
+		num++;
+	}
+
+	if (num)
+		vmc_dev.rq->vq_ops->kick(vmc_dev.rq);
+
+	return num;
+}
+
+static void vmchannel_rx(unsigned long data)
+{
+	struct sk_buff *skb;
+	unsigned int l;
+
+	while ((skb = vmc_dev.rq->vq_ops->get_buf(vmc_dev.rq, &l))) {
+		struct vmchannel_desc *desc = skb_vmchannel_desc(skb);
+		__u32 len = le32_to_cpu(desc->len);
+
+		skb_unlink(skb, &vmc_dev.rx_skbuff_q);
+		skb_trim(skb, len);
+		if (!vmchannel_socket_recv(skb, le32_to_cpu(desc->id)))
+			atomic_dec(&vmc_dev.recv_posted);
+	}
+	try_fill_recvq();
+}
+
+static void recvq_notify(struct virtqueue *recvq)
+{
+	tasklet_schedule(&vmc_dev.rx_tasklet);
+}
+
+static int vmchannel_try_send_one(struct sk_buff *skb)
+{
+	struct scatterlist sg[2];
+
+	sg_init_table(sg, 2);
+	vmchannel_desc_to_sg(sg, skb);
+	skb_to_sgvec(skb, sg + 1, 0, skb->len);
+
+	return vmc_dev.sq->vq_ops->add_buf(vmc_dev.sq, sg, 2, 0, skb);
+}
+
+static void vmchannel_tx(unsigned long data)
+{
+	struct sk_buff *skb;
+	unsigned int len;
+	int sent = 0;
+
+	while ((skb = vmc_dev.sq->vq_ops->get_buf(vmc_dev.sq, &len)))
+		kfree_skb(skb);
+
+	spin_lock(&vmc_dev.tx_skbuff_q.lock);
+	while ((skb = skb_peek(&vmc_dev.tx_skbuff_q))) {
+		if (vmchannel_try_send_one(skb))
+			break;
+		__skb_unlink(skb, &vmc_dev.tx_skbuff_q);
+		sent++;
+	}
+	spin_unlock(&vmc_dev.tx_skbuff_q.lock);
+	if (sent)
+		vmc_dev.sq->vq_ops->kick(vmc_dev.sq);
+}
+
+static void sendq_notify(struct virtqueue *sendq)
+{
+	tasklet_schedule(&vmc_dev.tx_tasklet);
+}
+
+static int vmchannel_send_skb(struct sk_buff *skb, const __u32 id)
+{
+	struct vmchannel_desc *desc;
+
+	desc = skb_vmchannel_desc(skb);
+	desc->id = cpu_to_le32(id);
+	desc->len = cpu_to_le32(skb->len);
+
+	skb_queue_tail(&vmc_dev.tx_skbuff_q, skb);
+	tasklet_schedule(&vmc_dev.tx_tasklet);
+
+	return 0;
+}
+
+static int vmchannel_probe(struct virtio_device *vdev)
+{
+	int r, i;
+	__le32 count;
+	unsigned offset;
+
+	vdev->priv = &vmc_dev;
+	vmc_dev.vdev = vdev;
+
+	vdev->config->get(vdev, 0, &count, sizeof(count));
+
+	vmc_dev.channel_count = le32_to_cpu(count);
+	if (vmc_dev.channel_count == 0) {
+		dev_printk(KERN_ERR, &vdev->dev, "No channels present\n");
+		return -ENODEV;
+	}
+
+	pr_debug("vmchannel: %d channel detected\n", vmc_dev.channel_count);
+
+	vmc_dev.channels =
+		kzalloc(vmc_dev.channel_count * sizeof(struct vmchannel_info),
+				GFP_KERNEL);
+	if (!vmc_dev.channels)
+		return -ENOMEM;
+
+	offset = sizeof(count);
+	for (i = 0; i < count; i++) {
+		__u32 len;
+		__le32 tmp;
+		vdev->config->get(vdev, offset, &tmp, 4);
+		vmc_dev.channels[i].id = le32_to_cpu(tmp);
+		offset += 4;
+		vdev->config->get(vdev, offset, &tmp, 4);
+		len = le32_to_cpu(tmp);
+		if (len > VMCHANNEL_NAME_MAX) {
+			dev_printk(KERN_ERR, &vdev->dev,
+					"Wrong device configuration. "
+					"Channel name is too long");
+			r = -ENODEV;
+			goto out;
+		}
+		vmc_dev.channels[i].name = kmalloc(len, GFP_KERNEL);
+		if (!vmc_dev.channels[i].name) {
+			r = -ENOMEM;
+			goto out;
+		}
+		offset += 4;
+		vdev->config->get(vdev, offset, vmc_dev.channels[i].name, len);
+		offset += len;
+		pr_debug("vmhannel: found channel '%s' id %d\n",
+				vmc_dev.channels[i].name,
+				vmc_dev.channels[i].id);
+	}
+
+	vmc_dev.rq = vdev->config->find_vq(vdev, 0, recvq_notify);
+	if (IS_ERR(vmc_dev.rq)) {
+		r = PTR_ERR(vmc_dev.rq);
+		vmc_dev.rq = NULL;
+		goto out;
+	}
+
+	vmc_dev.sq = vdev->config->find_vq(vdev, 1, sendq_notify);
+	if (IS_ERR(vmc_dev.sq)) {
+		r = PTR_ERR(vmc_dev.sq);
+		vmc_dev.sq = NULL;
+		goto out;
+	}
+
+	r = proto_register(&vmchannel_proto, 0);
+	if (r)
+		goto out;
+
+	r = sock_register(&vmchannel_sock_family_ops);
+	if (r)
+		goto out_proto;
+
+	skb_queue_head_init(&vmc_dev.rx_skbuff_q);
+	skb_queue_head_init(&vmc_dev.tx_skbuff_q);
+	tasklet_init(&vmc_dev.rx_tasklet, vmchannel_rx, 0);
+	tasklet_init(&vmc_dev.tx_tasklet, vmchannel_tx, 0);
+	atomic_set(&vmc_dev.recv_posted, 0);
+	if (try_fill_recvq())
+		return 0;
+
+	r = -ENOMEM;
+
+	tasklet_kill(&vmc_dev.rx_tasklet);
+	tasklet_kill(&vmc_dev.tx_tasklet);
+	sock_unregister(PF_VMCHANNEL);
+out_proto:
+	proto_unregister(&vmchannel_proto);
+out:
+	if (vmc_dev.sq)
+		vdev->config->del_vq(vmc_dev.sq);
+	if (vmc_dev.rq)
+		vdev->config->del_vq(vmc_dev.rq);
+
+	for (i = 0; i < count; i++) {
+		if (!vmc_dev.channels[i].name)
+			break;
+		kfree(vmc_dev.channels[i].name);
+	}
+
+	kfree(vmc_dev.channels);
+
+	return r;
+}
+static void vmchannel_remove(struct virtio_device *vdev)
+{
+	int i;
+
+	/* Stop all the virtqueues. */
+	vdev->config->reset(vdev);
+
+	tasklet_kill(&vmc_dev.rx_tasklet);
+	tasklet_kill(&vmc_dev.tx_tasklet);
+
+	sock_unregister(PF_VMCHANNEL);
+	proto_unregister(&vmchannel_proto);
+
+	vdev->config->del_vq(vmc_dev.rq);
+	vdev->config->del_vq(vmc_dev.sq);
+
+	skb_queue_purge(&vmc_dev.rx_skbuff_q);
+	skb_queue_purge(&vmc_dev.tx_skbuff_q);
+
+	for (i = 0; i < vmc_dev.channel_count; i++)
+		kfree(vmc_dev.channels[i].name);
+
+	kfree(vmc_dev.channels);
+}
+
+static struct virtio_device_id id_table[] = {
+	{ VIRTIO_ID_VMCHANNEL, VIRTIO_DEV_ANY_ID }, { 0 },
+};
+
+static struct virtio_driver virtio_vmchannel = {
+	.driver.name =  "virtio-vmchannel",
+	.driver.owner = THIS_MODULE,
+	.id_table =     id_table,
+	.probe =	vmchannel_probe,
+	.remove =       __devexit_p(vmchannel_remove),
+};
+
+static int __init init(void)
+{
+	return register_virtio_driver(&virtio_vmchannel);
+}
+
+static void __exit fini(void)
+{
+	unregister_virtio_driver(&virtio_vmchannel);
+}
+
+module_init(init);
+module_exit(fini);
+
+MODULE_AUTHOR("Gleb Natapov");
+MODULE_DEVICE_TABLE(virtio, id_table);
+MODULE_DESCRIPTION("Virtio vmchannel driver");
+MODULE_LICENSE("GPL");