[RFC,v5,0/5] Add virtio transport for AF_VSOCK

Message ID	1460129705.1749.25.camel@docker.com
State	RFC, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> Message-ID: <1460129705.1749.25.camel@docker.com> Subject: Re: [RFC v5 0/5] Add virtio transport for AF_VSOCK From: Ian Campbell <ian.campbell@docker.com> To: Stefan Hajnoczi <stefanha@redhat.com>, kvm@vger.kernel.org Cc: netdev@vger.kernel.org, "Michael S. Tsirkin" <mst@redhat.com>, Matt Benjamin <mbenjamin@redhat.com>, Christoffer Dall <christoffer.dall@linaro.org>, Alex =?ISO-8859-1?Q?Benn=E9e?= <alex.bennee@linaro.org>, marius vlad <marius.vlad0@gmail.com>, areis@redhat.com, Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>, Greg Kurz <gkurz@linux.vnet.ibm.com>, virtualization@lists.linux-foundation.org Date: Fri, 08 Apr 2016 16:35:05 +0100 In-Reply-To: <1459520587-12337-1-git-send-email-stefanha@redhat.com> References: <1459520587-12337-1-git-send-email-stefanha@redhat.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Ian Campbell April 8, 2016, 3:35 p.m. UTC

On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> 
> I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> first I want to share the latest version of the code.  Several people are
> playing with vsock now so sharing the latest code should avoid duplicate work.

Thanks for this, I've been using it in my project and it mostly seems
fine.

One wrinkle I came across, which I'm not sure if it is by design or a
problem is that I can see this sequence coming from the guest (with
other activity in between):

    1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
    2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
    3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX

I orignally had my backend close things down at #2, however this meant
that when #3 arrived it was for a non-existent socket (or, worse, an
active one if the ports got reused). I checked v5 of the spec
proposal[0] which says:
    If these bits are set and there are no more virtqueue buffers
    pending the socket is disconnected.

but I'm not entirely sure if this behaviour contradicts this or not
(the bits have both been set at #2, but not at the same time).

BTW, how does one tell if there are no more virtqueue buffers pending
or not while processing the op?

Another thing I noticed, which is really more to do with the generic
AF_VSOCK bits than anything to do with your patches is that there is no
limitations on which vsock ports a non-privileged user can bind to and
relatedly that there is no netns support so e.g. users in unproivileged
containers can bind to any vsock port and talk to the host, which might
be undesirable. For my use for now I just went with the big hammer
approach of denying access from anything other than init_net
namespace[1] while I consider what the right answer is.

Ian.

[0] http://thread.gmane.org/gmane.comp.emulators.virtio.devel/1092
[1]
From 366c9c42afb9bd54f92f72518470c09e46f12e88 Mon Sep 17 00:00:00 2001
From: Ian Campbell <ian.campbell@docker.com>
Date: Mon, 4 Apr 2016 14:50:10 +0100
Subject: [PATCH] VSOCK: Only allow host network namespace to use AF_VSOCK.

The VSOCK addressing schema does not really lend itself to simply creating an
alternative end point address within a namespace.

Signed-off-by: Ian Campbell <ian.campbell@docker.com>
---
 net/vmw_vsock/af_vsock.c | 3 +++
 1 file changed, 3 insertions(+)

-- 
2.8.0.rc3

Stefan Hajnoczi April 11, 2016, 10:45 a.m. UTC | #1

On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > 
> > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > first I want to share the latest version of the code.  Several people are
> > playing with vsock now so sharing the latest code should avoid duplicate work.
> 
> Thanks for this, I've been using it in my project and it mostly seems
> fine.
> 
> One wrinkle I came across, which I'm not sure if it is by design or a
> problem is that I can see this sequence coming from the guest (with
> other activity in between):
> 
>     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
>     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
>     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> 
> I orignally had my backend close things down at #2, however this meant
> that when #3 arrived it was for a non-existent socket (or, worse, an
> active one if the ports got reused). I checked v5 of the spec
> proposal[0] which says:
>     If these bits are set and there are no more virtqueue buffers
>     pending the socket is disconnected.
> 
> but I'm not entirely sure if this behaviour contradicts this or not
> (the bits have both been set at #2, but not at the same time).
> 
> BTW, how does one tell if there are no more virtqueue buffers pending
> or not while processing the op?

#2 is odd.  The shutdown bits are sticky so they cannot be cleared once
set.  I would have expected just #1 and #3.  The behavior you observe
look like a bug.

The spec text does not convey the meaning of OP_SHUTDOWN well.
OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
for this connection.  "there are no more virtqueue buffers pending the
socket" really means that this isn't an immediate close from the
perspective of the application.  If the application still has unread rx
buffers then the socket stays readable until the rx data has been fully
read.

> Another thing I noticed, which is really more to do with the generic
> AF_VSOCK bits than anything to do with your patches is that there is no
> limitations on which vsock ports a non-privileged user can bind to and
> relatedly that there is no netns support so e.g. users in unproivileged
> containers can bind to any vsock port and talk to the host, which might
> be undesirable. For my use for now I just went with the big hammer
> approach of denying access from anything other than init_net
> namespace[1] while I consider what the right answer is.

From the vhost point of view each netns should have its own AF_VSOCK
namespace.  This way two containers could act as "the host" (CID 2) for
their respective guests.

Michael S. Tsirkin April 11, 2016, 12:54 p.m. UTC | #2

On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > 
> > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > first I want to share the latest version of the code.  Several people are
> > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > 
> > Thanks for this, I've been using it in my project and it mostly seems
> > fine.
> > 
> > One wrinkle I came across, which I'm not sure if it is by design or a
> > problem is that I can see this sequence coming from the guest (with
> > other activity in between):
> > 
> >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> > 
> > I orignally had my backend close things down at #2, however this meant
> > that when #3 arrived it was for a non-existent socket (or, worse, an
> > active one if the ports got reused). I checked v5 of the spec
> > proposal[0] which says:
> >     If these bits are set and there are no more virtqueue buffers
> >     pending the socket is disconnected.
> > 
> > but I'm not entirely sure if this behaviour contradicts this or not
> > (the bits have both been set at #2, but not at the same time).
> > 
> > BTW, how does one tell if there are no more virtqueue buffers pending
> > or not while processing the op?
> 
> #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> set.  I would have expected just #1 and #3.  The behavior you observe
> look like a bug.
> 
> The spec text does not convey the meaning of OP_SHUTDOWN well.
> OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> for this connection.  "there are no more virtqueue buffers pending the
> socket" really means that this isn't an immediate close from the
> perspective of the application.  If the application still has unread rx
> buffers then the socket stays readable until the rx data has been fully
> read.

Yes but you also wrote:
	If these bits are set and there are no more virtqueue buffers
	pending the socket is disconnected.

how does remote know that there are no buffers pending and so it's safe
to reuse the same source/destination address now?  Maybe destination
should send RST at that point?



> > Another thing I noticed, which is really more to do with the generic
> > AF_VSOCK bits than anything to do with your patches is that there is no
> > limitations on which vsock ports a non-privileged user can bind to and
> > relatedly that there is no netns support so e.g. users in unproivileged
> > containers can bind to any vsock port and talk to the host, which might
> > be undesirable. For my use for now I just went with the big hammer
> > approach of denying access from anything other than init_net
> > namespace[1] while I consider what the right answer is.
> 
> From the vhost point of view each netns should have its own AF_VSOCK
> namespace.  This way two containers could act as "the host" (CID 2) for
> their respective guests.


I wonder how this interacts with the disconnect on migration
idea that you discussed. Specifically, socket has to stay connected

Stefan Hajnoczi April 12, 2016, 1:59 p.m. UTC | #3

On Mon, Apr 11, 2016 at 03:54:08PM +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> > On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > > 
> > > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > > first I want to share the latest version of the code.  Several people are
> > > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > > 
> > > Thanks for this, I've been using it in my project and it mostly seems
> > > fine.
> > > 
> > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > problem is that I can see this sequence coming from the guest (with
> > > other activity in between):
> > > 
> > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX

How did you trigger this sequence?  I'd like to reproduce it.

> > > I orignally had my backend close things down at #2, however this meant
> > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > active one if the ports got reused). I checked v5 of the spec
> > > proposal[0] which says:
> > >     If these bits are set and there are no more virtqueue buffers
> > >     pending the socket is disconnected.
> > > 
> > > but I'm not entirely sure if this behaviour contradicts this or not
> > > (the bits have both been set at #2, but not at the same time).
> > > 
> > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > or not while processing the op?
> > 
> > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> > set.  I would have expected just #1 and #3.  The behavior you observe
> > look like a bug.
> > 
> > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > for this connection.  "there are no more virtqueue buffers pending the
> > socket" really means that this isn't an immediate close from the
> > perspective of the application.  If the application still has unread rx
> > buffers then the socket stays readable until the rx data has been fully
> > read.
> 
> Yes but you also wrote:
> 	If these bits are set and there are no more virtqueue buffers
> 	pending the socket is disconnected.
> 
> how does remote know that there are no buffers pending and so it's safe
> to reuse the same source/destination address now?  Maybe destination
> should send RST at that point?

You are right, the source/destination address could be reused while the
remote still has the connection in their table.  Connection
establishment would fail with a RST reply.

I can think of two solutions:

1. Implementations must remove connections from their table as soon as
   SHUTDOWN_TX|SHUTDOWN_RX is received.  This way the source/destination
   address tuple can be reused immediately, i.e. new connections with
   the same source/destination would be possible while an application is
   still draining the receive buffers of an old connection.

2. Extend the connection lifecycle so that an A->B
   SHUTDOWN_TX|SHUTDOWN_RX must be followed by a by a B->A RST to close
   a connection.  This way the source/destination address is only in use
   once at a time.

Option #2 seems safer because there is no overlap in source/destination
address usage.

Ian Campbell April 12, 2016, 4:07 p.m. UTC | #4

Some how Stefan's reply disapeared from my INBOX (although I did see
it) so replying here.

On Mon, 2016-04-11 at 15:54 +0300, Michael S. Tsirkin wrote:
> On Mon, Apr 11, 2016 at 11:45:48AM +0100, Stefan Hajnoczi wrote:
> > 
> > On Fri, Apr 08, 2016 at 04:35:05PM +0100, Ian Campbell wrote:
> > > 
> > > On Fri, 2016-04-01 at 15:23 +0100, Stefan Hajnoczi wrote:
> > > > 
> > > > This series is based on Michael Tsirkin's vhost branch (v4.5-rc6).
> > > > 
> > > > I'm about to process Claudio Imbrenda's locking fixes for virtio-vsock but
> > > > first I want to share the latest version of the code.  Several people are
> > > > playing with vsock now so sharing the latest code should avoid duplicate work.
> > > Thanks for this, I've been using it in my project and it mostly seems
> > > fine.
> > > 
> > > One wrinkle I came across, which I'm not sure if it is by design or a
> > > problem is that I can see this sequence coming from the guest (with
> > > other activity in between):
> > > 
> > >     1) OP_SHUTDOWN w/ flags == SHUTDOWN_RX
> > >     2) OP_SHUTDOWN w/ flags == SHUTDOWN_TX
> > >     3) OP_SHUTDOWN w/ flags == SHUTDOWN_TX|SHUTDOWN_RX
> > > 
> > > I orignally had my backend close things down at #2, however this meant
> > > that when #3 arrived it was for a non-existent socket (or, worse, an
> > > active one if the ports got reused). I checked v5 of the spec
> > > proposal[0] which says:
> > >     If these bits are set and there are no more virtqueue buffers
> > >     pending the socket is disconnected.
> > > 
> > > but I'm not entirely sure if this behaviour contradicts this or not
> > > (the bits have both been set at #2, but not at the same time).
> > > 
> > > BTW, how does one tell if there are no more virtqueue buffers pending
> > > or not while processing the op?
> > #2 is odd.  The shutdown bits are sticky so they cannot be cleared once
> > set.  I would have expected just #1 and #3.  The behavior you observe
> > look like a bug.
> > 
> > The spec text does not convey the meaning of OP_SHUTDOWN well.
> > OP_SHUTDOWN SHUTDOWN_TX|SHUTDOWN_RX means no further rx/tx is possible
> > for this connection.  "there are no more virtqueue buffers pending the
> > socket" really means that this isn't an immediate close from the
> > perspective of the application.  If the application still has unread rx
> > buffers then the socket stays readable until the rx data has been fully
> > read.

Thanks, distinguishing the local buffer to the application from the
vring would make that clearer. Perhaps by not talking about "virtqueue
buffers" since they sound like a vring thing.

However, as Michael observes I'm not sure that's the whole story.

> Yes but you also wrote:
> 	If these bits are set and there are no more virtqueue buffers
> 	pending the socket is disconnected.
> 
> how does remote know that there are no buffers pending and so it's safe
> to reuse the same source/destination address now?

Indeed this is one of the things I struggled with. e.g. If I send a
SHUTDOWN_RX to my peer am I supposed to wait for that buffer to come
back (so I know the peer has seen it) and then wait for an entire
"cycle" of the TX ring to know there is nothing still in flight? That's
some tricky book-keeping.

>   Maybe destination
> should send RST at that point?

i.e. upon receipt of SHUTDOWN_RX|SHUTDOWN_TX from the peer you are
expected to send a RST. When the peer observes that then they know
there is no further data in that connection on the ring?

That sounds like it would be helpful.

> > > Another thing I noticed, which is really more to do with the generic
> > > AF_VSOCK bits than anything to do with your patches is that there is no
> > > limitations on which vsock ports a non-privileged user can bind to and
> > > relatedly that there is no netns support so e.g. users in unproivileged
> > > containers can bind to any vsock port and talk to the host, which might
> > > be undesirable. For my use for now I just went with the big hammer
> > > approach of denying access from anything other than init_net
> > > namespace[1] while I consider what the right answer is.
> > From the vhost point of view each netns should have its own AF_VSOCK
> > namespace.  This way two containers could act as "the host" (CID 2) for
> > their respective guests.

When you say "should" you mean that's the intended design as opposed to
what the current code is actually doing, right?

Ian.

Stefan Hajnoczi April 13, 2016, 1:38 p.m. UTC | #5

On Tue, Apr 12, 2016 at 05:37:54PM +0100, Ian Campbell wrote:
> Perhaps the guest end is turning shutdown(foo) directly into a vsock
> message without or-ing in the current state?

Yes, you are right:

  lock_sock(sk);
  sk->sk_shutdown |= mode;
  sk->sk_state_change(sk);
  release_sock(sk);

  if (sk->sk_type == SOCK_STREAM) {
      sock_reset_flag(sk, SOCK_DONE);
      vsock_send_shutdown(sk, mode);

Although sk_shutdown is ORed correctly, vsock_send_shutdown() is called with
just the shutdown() argument.

[RFC,v5,0/5] Add virtio transport for AF_VSOCK

Commit Message

Comments

Patch