[11/18] vhost-user: add shutdown support

Message ID	1459509388-6185-12-git-send-email-marcandre.lureau@redhat.com
State	New
Headers	show Return-Path: <qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org> From: marcandre.lureau@redhat.com To: qemu-devel@nongnu.org Date: Fri, 1 Apr 2016 13:16:21 +0200 Message-Id: <1459509388-6185-12-git-send-email-marcandre.lureau@redhat.com> In-Reply-To: <1459509388-6185-1-git-send-email-marcandre.lureau@redhat.com> References: <1459509388-6185-1-git-send-email-marcandre.lureau@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Cc: Yuanhan Liu <yuanhan.liu@linux.intel.com>, "Michael S. Tsirkin" <mst@redhat.com>, Ilya Maximets <i.maximets@samsung.com>, jonshin@cisco.com, Tetsuya Mukawa <mukawa@igel.co.jp>, =?UTF-8?q?Marc-Andr=C3=A9=20Lureau?= <marcandre.lureau@redhat.com> Subject: [Qemu-devel] [PATCH 11/18] vhost-user: add shutdown support Precedence: list Errors-To: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org Sender: qemu-devel-bounces+incoming=patchwork.ozlabs.org@nongnu.org

Marc-André Lureau April 1, 2016, 11:16 a.m. UTC

From: Marc-André Lureau <marcandre.lureau@redhat.com>

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
---
 docs/specs/vhost-user.txt | 15 +++++++++++++++
 hw/virtio/vhost-user.c    | 16 ++++++++++++++++
 2 files changed, 31 insertions(+)

Yuanhan Liu April 13, 2016, 2:49 a.m. UTC | #1

Hi Marc,

First of all, sorry again for late response!

Last time I tried with your first version, I found few issues related
with reconnect, mainly on the acked_feautres lost. While checking your
new code, I found that you've already solved that, which is great.

So, I tried harder this time, your patches work great, except that I
found few nits.

On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lureau@redhat.com wrote:
> From: Marc-André Lureau <marcandre.lureau@redhat.com>
...
> +Slave message types
> +-------------------
> +
> + * VHOST_USER_SLAVE_SHUTDOWN:
> +      Id: 1
> +      Master payload: N/A
> +      Slave payload: u64
> +
> +      Request the master to shutdown the slave. A 0 reply is for
> +      success, in which case the slave may close all connections
> +      immediately and quit.

Assume we are using ovs + dpdk here, that we could have two
vhost-user connections. While ovs tries to initiate a restart,
it might unregister the two connections one by one. In such
case, two VHOST_USER_SLAVE_SHUTDOWN request will be sent,
and two replies will get. Therefore, I don't think it's a
proper ask here to let the backend implementation to do quit
here.

>  
>      switch (msg.request) {
> +    case VHOST_USER_SLAVE_SHUTDOWN: {
> +        uint64_t success = 1; /* 0 is for success */
> +        if (dev->stop) {
> +            dev->stop(dev);
> +            success = 0;
> +        }
> +        msg.payload.u64 = success;
> +        msg.size = sizeof(msg.payload.u64);
> +        size = send(u->slave_fd, &msg, VHOST_USER_HDR_SIZE + msg.size, 0);
> +        if (size != VHOST_USER_HDR_SIZE + msg.size) {
> +            error_report("Failed to write reply.");
> +        }
> +        break;

You might want to remove the slave_fd from watch list? We
might also need to close slave_fd here, assuming that we
will no longer use it when VHOST_USER_SLAVE_SHUTDOWN is
received?

I'm asking because I found a seg fault issue sometimes,
due to opaque is NULL.

	--yliu

Marc-Andre Lureau April 13, 2016, 9:51 a.m. UTC | #2

Hi

----- Original Message -----
> Hi Marc,
> 
> First of all, sorry again for late response!
> 
> Last time I tried with your first version, I found few issues related
> with reconnect, mainly on the acked_feautres lost. While checking your
> new code, I found that you've already solved that, which is great.
> 
> So, I tried harder this time, your patches work great, except that I
> found few nits.
> 
> On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lureau@redhat.com wrote:
> > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> ...
> > +Slave message types
> > +-------------------
> > +
> > + * VHOST_USER_SLAVE_SHUTDOWN:
> > +      Id: 1
> > +      Master payload: N/A
> > +      Slave payload: u64
> > +
> > +      Request the master to shutdown the slave. A 0 reply is for
> > +      success, in which case the slave may close all connections
> > +      immediately and quit.
> 
> Assume we are using ovs + dpdk here, that we could have two
> vhost-user connections. While ovs tries to initiate a restart,
> it might unregister the two connections one by one. In such
> case, two VHOST_USER_SLAVE_SHUTDOWN request will be sent,
> and two replies will get. Therefore, I don't think it's a
> proper ask here to let the backend implementation to do quit
> here.
> 

On success reply, the master sent all the commands to finish the connection. So the slave must flush/finish all pending requests first. I think this should be enough, otherwise we may need a new explicit message?

> 
> >  
> >      switch (msg.request) {
> > +    case VHOST_USER_SLAVE_SHUTDOWN: {
> > +        uint64_t success = 1; /* 0 is for success */
> > +        if (dev->stop) {
> > +            dev->stop(dev);
> > +            success = 0;
> > +        }
> > +        msg.payload.u64 = success;
> > +        msg.size = sizeof(msg.payload.u64);
> > +        size = send(u->slave_fd, &msg, VHOST_USER_HDR_SIZE + msg.size, 0);
> > +        if (size != VHOST_USER_HDR_SIZE + msg.size) {
> > +            error_report("Failed to write reply.");
> > +        }
> > +        break;
> 
> You might want to remove the slave_fd from watch list? We
> might also need to close slave_fd here, assuming that we
> will no longer use it when VHOST_USER_SLAVE_SHUTDOWN is
> received?

Makes sense, I will change that in next update.

> I'm asking because I found a seg fault issue sometimes,
> due to opaque is NULL.
>

I would be interested to see the backtrace or have a reproducer.

thanks

Yuanhan Liu April 13, 2016, 5:32 p.m. UTC | #3

On Wed, Apr 13, 2016 at 05:51:15AM -0400, Marc-André Lureau wrote:
> Hi
> 
> ----- Original Message -----
> > Hi Marc,
> > 
> > First of all, sorry again for late response!
> > 
> > Last time I tried with your first version, I found few issues related
> > with reconnect, mainly on the acked_feautres lost. While checking your
> > new code, I found that you've already solved that, which is great.
> > 
> > So, I tried harder this time, your patches work great, except that I
> > found few nits.
> > 
> > On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lureau@redhat.com wrote:
> > > From: Marc-André Lureau <marcandre.lureau@redhat.com>
> > ...
> > > +Slave message types
> > > +-------------------
> > > +
> > > + * VHOST_USER_SLAVE_SHUTDOWN:
> > > +      Id: 1
> > > +      Master payload: N/A
> > > +      Slave payload: u64
> > > +
> > > +      Request the master to shutdown the slave. A 0 reply is for
> > > +      success, in which case the slave may close all connections
> > > +      immediately and quit.
> > 
> > Assume we are using ovs + dpdk here, that we could have two
> > vhost-user connections. While ovs tries to initiate a restart,
> > it might unregister the two connections one by one. In such
> > case, two VHOST_USER_SLAVE_SHUTDOWN request will be sent,
> > and two replies will get. Therefore, I don't think it's a
> > proper ask here to let the backend implementation to do quit
> > here.
> > 
> 
> On success reply, the master sent all the commands to finish the connection. So the slave must flush/finish all pending requests first.

Yes, that's okay. I here just mean the "close __all__ connections"
and "quit" part.

Firstly, we should do cleanup/flush/finish to it's own connection.
But not all, right?

Second, as stated, doing quit might not make sense, as we may
have more connections.

> I think this should be enough, otherwise we may need a new explicit message?
> 
> > 
> > >  
> > >      switch (msg.request) {
> > > +    case VHOST_USER_SLAVE_SHUTDOWN: {
> > > +        uint64_t success = 1; /* 0 is for success */
> > > +        if (dev->stop) {
> > > +            dev->stop(dev);
> > > +            success = 0;
> > > +        }
> > > +        msg.payload.u64 = success;
> > > +        msg.size = sizeof(msg.payload.u64);
> > > +        size = send(u->slave_fd, &msg, VHOST_USER_HDR_SIZE + msg.size, 0);
> > > +        if (size != VHOST_USER_HDR_SIZE + msg.size) {
> > > +            error_report("Failed to write reply.");
> > > +        }
> > > +        break;
> > 
> > You might want to remove the slave_fd from watch list? We
> > might also need to close slave_fd here, assuming that we
> > will no longer use it when VHOST_USER_SLAVE_SHUTDOWN is
> > received?
> 
> Makes sense, I will change that in next update.
> 
> > I'm asking because I found a seg fault issue sometimes,
> > due to opaque is NULL.

Oh, I was wrong, it's u being NULL, but not opaque.
> >
> 
> I would be interested to see the backtrace or have a reproducer.

It's a normal test steps: start a vhost-user switch (I'm using DPDK
vhost-switch example), kill it, and wait for a while (something like
more than 10s or even longer), then I saw a seg fault:

    (gdb) p dev
    $4 = (struct vhost_dev *) 0x555556571bf0
    (gdb) p u
    $5 = (struct vhost_user *) 0x0
    (gdb) where
    #0  0x0000555555798612 in slave_read (opaque=0x555556571bf0)
        at /home/yliu/qemu/hw/virtio/vhost-user.c:539
    #1  0x0000555555a343a4 in aio_dispatch (ctx=0x55555655f560) at /home/yliu/qemu/aio-posix.c:327
    #2  0x0000555555a2738b in aio_ctx_dispatch (source=0x55555655f560, callback=0x0, user_data=0x0)
        at /home/yliu/qemu/async.c:233
    #3  0x00007ffff51032a6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
    #4  0x0000555555a3239e in glib_pollfds_poll () at /home/yliu/qemu/main-loop.c:213
    #5  0x0000555555a3247b in os_host_main_loop_wait (timeout=29875848) at /home/yliu/qemu/main-loop.c:258
    #6  0x0000555555a3252b in main_loop_wait (nonblocking=0) at /home/yliu/qemu/main-loop.c:506
    #7  0x0000555555846e35 in main_loop () at /home/yliu/qemu/vl.c:1934
    #8  0x000055555584e6bf in main (argc=31, argv=0x7fffffffe078, envp=0x7fffffffe178)
        at /home/yliu/qemu/vl.c:4658


	--yliu

Marc-André Lureau April 13, 2016, 9:43 p.m. UTC | #4

On Wed, Apr 13, 2016 at 7:32 PM, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
>>
>> > I'm asking because I found a seg fault issue sometimes,
>> > due to opaque is NULL.
>
> Oh, I was wrong, it's u being NULL, but not opaque.
>> >
>>
>> I would be interested to see the backtrace or have a reproducer.
>
> It's a normal test steps: start a vhost-user switch (I'm using DPDK
> vhost-switch example), kill it, and wait for a while (something like
> more than 10s or even longer), then I saw a seg fault:
>
>     (gdb) p dev
>     $4 = (struct vhost_dev *) 0x555556571bf0
>     (gdb) p u
>     $5 = (struct vhost_user *) 0x0
>     (gdb) where
>     #0  0x0000555555798612 in slave_read (opaque=0x555556571bf0)
>         at /home/yliu/qemu/hw/virtio/vhost-user.c:539
>     #1  0x0000555555a343a4 in aio_dispatch (ctx=0x55555655f560) at /home/yliu/qemu/aio-posix.c:327
>     #2  0x0000555555a2738b in aio_ctx_dispatch (source=0x55555655f560, callback=0x0, user_data=0x0)
>         at /home/yliu/qemu/async.c:233
>     #3  0x00007ffff51032a6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
>     #4  0x0000555555a3239e in glib_pollfds_poll () at /home/yliu/qemu/main-loop.c:213
>     #5  0x0000555555a3247b in os_host_main_loop_wait (timeout=29875848) at /home/yliu/qemu/main-loop.c:258
>     #6  0x0000555555a3252b in main_loop_wait (nonblocking=0) at /home/yliu/qemu/main-loop.c:506
>     #7  0x0000555555846e35 in main_loop () at /home/yliu/qemu/vl.c:1934
>     #8  0x000055555584e6bf in main (argc=31, argv=0x7fffffffe078, envp=0x7fffffffe178)
>         at /home/yliu/qemu/vl.c:4658
>

This patch set doesn't try to handle crashes from backend. This would
require a much more detailed study of the existing code path. A lot of
places assume the backend is fully working as expected. I think
handling backend crashes should be a different, later, patch set.

Yuanhan Liu April 13, 2016, 9:57 p.m. UTC | #5

On Wed, Apr 13, 2016 at 11:43:56PM +0200, Marc-André Lureau wrote:
> On Wed, Apr 13, 2016 at 7:32 PM, Yuanhan Liu
> <yuanhan.liu@linux.intel.com> wrote:
> >>
> >> > I'm asking because I found a seg fault issue sometimes,
> >> > due to opaque is NULL.
> >
> > Oh, I was wrong, it's u being NULL, but not opaque.
> >> >
> >>
> >> I would be interested to see the backtrace or have a reproducer.
> >
> > It's a normal test steps: start a vhost-user switch (I'm using DPDK
> > vhost-switch example), kill it, and wait for a while (something like
> > more than 10s or even longer), then I saw a seg fault:
> >
> >     (gdb) p dev
> >     $4 = (struct vhost_dev *) 0x555556571bf0
> >     (gdb) p u
> >     $5 = (struct vhost_user *) 0x0
> >     (gdb) where
> >     #0  0x0000555555798612 in slave_read (opaque=0x555556571bf0)
> >         at /home/yliu/qemu/hw/virtio/vhost-user.c:539
> >     #1  0x0000555555a343a4 in aio_dispatch (ctx=0x55555655f560) at /home/yliu/qemu/aio-posix.c:327
> >     #2  0x0000555555a2738b in aio_ctx_dispatch (source=0x55555655f560, callback=0x0, user_data=0x0)
> >         at /home/yliu/qemu/async.c:233
> >     #3  0x00007ffff51032a6 in g_main_context_dispatch () from /lib64/libglib-2.0.so.0
> >     #4  0x0000555555a3239e in glib_pollfds_poll () at /home/yliu/qemu/main-loop.c:213
> >     #5  0x0000555555a3247b in os_host_main_loop_wait (timeout=29875848) at /home/yliu/qemu/main-loop.c:258
> >     #6  0x0000555555a3252b in main_loop_wait (nonblocking=0) at /home/yliu/qemu/main-loop.c:506
> >     #7  0x0000555555846e35 in main_loop () at /home/yliu/qemu/vl.c:1934
> >     #8  0x000055555584e6bf in main (argc=31, argv=0x7fffffffe078, envp=0x7fffffffe178)
> >         at /home/yliu/qemu/vl.c:4658
> >
> 
> This patch set doesn't try to handle crashes from backend. This would
> require a much more detailed study of the existing code path. A lot of
> places assume the backend is fully working as expected. I think
> handling backend crashes should be a different, later, patch set.

Oh, sorry for not making it clear. I actually did the kill by "ctrl-c".
It then is captured to send a SLAVE_SHUTDOWN request.  So, I would say
it's a normal quit.

	--yliu

Yuanhan Liu April 28, 2016, 5:23 a.m. UTC | #6

On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lureau@redhat.com wrote:
> From: Marc-André Lureau <marcandre.lureau@redhat.com>
> 
> Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
> ---
>  docs/specs/vhost-user.txt | 15 +++++++++++++++
>  hw/virtio/vhost-user.c    | 16 ++++++++++++++++
>  2 files changed, 31 insertions(+)
> 
> diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt
> index 8a635fa..60d6d13 100644
> --- a/docs/specs/vhost-user.txt
> +++ b/docs/specs/vhost-user.txt
> @@ -487,3 +487,18 @@ Message types
>        request to the master. It is passed in the ancillary data.
>        This message is only sent if VHOST_USER_PROTOCOL_F_SLAVE_CHANNEL
>        feature is available.
> +
> +Slave message types
> +-------------------
> +
> + * VHOST_USER_SLAVE_SHUTDOWN:
> +      Id: 1
> +      Master payload: N/A
> +      Slave payload: u64
> +
> +      Request the master to shutdown the slave. A 0 reply is for
> +      success, in which case the slave may close all connections
> +      immediately and quit. A non-zero reply cancels the request.
> +
> +      Before a reply comes, the master may make other requests in
> +      order to flush or sync state.

Hi all,

I threw this proposal as well as DPDK's implementation to our customer
(OVS, Openstack and some other teams) who made such request before. I'm
sorry to say that none of them really liked that we can't handle crash.
Making reconnect work from a vhost-user backend crash is exactly something
they are after.

And to handle the crash, I was thinking of the proposal from Michael.
That is to do reset from the guest OS. This would fix this issue
ultimately. However, old kernel will not benefit from this, as well
as other guest other than Linux, making it not that useful for current
usage. 

Thinking of that the VHOST_USER_SLAVE_SHUTDOWN just gives QEMU a chance
to get the vring base (last used idx) from the backend, Huawei suggests
that we could still make it in a consistent state after the crash, if
we get the vring base from vring->used->idx.  That worked as expected
from my test. The only tricky thing might be how to detect a crash,
and we could do a simple compare of the vring base from QEMU with
the vring->used->idx at the initiation stage. If mismatch found, get
it from vring->used->idx instead.

Comments/thoughts? Is it a solid enough solution to you?  If so, we
could make things much simpler, and what's most important, we could
be able to handle crash.

	--yliu

Marc-André Lureau April 29, 2016, 10:40 a.m. UTC | #7

Hi

On Thu, Apr 28, 2016 at 7:23 AM, Yuanhan Liu
<yuanhan.liu@linux.intel.com> wrote:
> On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lureau@redhat.com wrote:
>> From: Marc-André Lureau <marcandre.lureau@redhat.com>
>>
>> Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
>> ---
>>  docs/specs/vhost-user.txt | 15 +++++++++++++++
>>  hw/virtio/vhost-user.c    | 16 ++++++++++++++++
>>  2 files changed, 31 insertions(+)
>>
>> diff --git a/docs/specs/vhost-user.txt b/docs/specs/vhost-user.txt
>> index 8a635fa..60d6d13 100644
>> --- a/docs/specs/vhost-user.txt
>> +++ b/docs/specs/vhost-user.txt
>> @@ -487,3 +487,18 @@ Message types
>>        request to the master. It is passed in the ancillary data.
>>        This message is only sent if VHOST_USER_PROTOCOL_F_SLAVE_CHANNEL
>>        feature is available.
>> +
>> +Slave message types
>> +-------------------
>> +
>> + * VHOST_USER_SLAVE_SHUTDOWN:
>> +      Id: 1
>> +      Master payload: N/A
>> +      Slave payload: u64
>> +
>> +      Request the master to shutdown the slave. A 0 reply is for
>> +      success, in which case the slave may close all connections
>> +      immediately and quit. A non-zero reply cancels the request.
>> +
>> +      Before a reply comes, the master may make other requests in
>> +      order to flush or sync state.
>
> Hi all,
>
> I threw this proposal as well as DPDK's implementation to our customer
> (OVS, Openstack and some other teams) who made such request before. I'm
> sorry to say that none of them really liked that we can't handle crash.
> Making reconnect work from a vhost-user backend crash is exactly something
> they are after.

Handling crashes is not quite the same as what I propose here. I see
it as a different goal. But I doubt about usefulness and reliability
of a backend that crashes. In many case, vhost-user was designed after
kernel vhost, and qemu code has the same expectation about the kernel
or the vhost-user backend: many calls are sync and will simply
assert() on unexpected results.

> And to handle the crash, I was thinking of the proposal from Michael.
> That is to do reset from the guest OS. This would fix this issue
> ultimately. However, old kernel will not benefit from this, as well
> as other guest other than Linux, making it not that useful for current
> usage.

Yes, I hope Michael can help with that, I am not very familiar with
the kernel code.

> Thinking of that the VHOST_USER_SLAVE_SHUTDOWN just gives QEMU a chance
> to get the vring base (last used idx) from the backend, Huawei suggests

Right, but after this message, the backend should have flushed all
pending ring packets and stop processing them. So it's also a clean
sync point.

> that we could still make it in a consistent state after the crash, if
> we get the vring base from vring->used->idx.  That worked as expected

You can have a backend that would have already processed packets and
not updated used idx. You could also have out-of-order packets in
flights (ex: desc 1-2-3 avail, 1-3 used, 2 pending..). I can't see a
clean way to restore this, but to reset the queues and start over,
with either packet loss or packet duplication. If the backend
guarantees to process packets in order, it may simplify things, but it
would be a special case.

> from my test. The only tricky thing might be how to detect a crash,
> and we could do a simple compare of the vring base from QEMU with
> the vring->used->idx at the initiation stage. If mismatch found, get
> it from vring->used->idx instead.

I don't follow, would the backend restore its last vring->used->idx
after a crash?

> Comments/thoughts? Is it a solid enough solution to you?  If so, we
> could make things much simpler, and what's most important, we could
> be able to handle crash.

That's not so easy, many of the vhost_ops->vhost*() are followed by
assert(r>0), which will be hard to change to handle failure. But, I
would worry first about a backend that crashes that it may corrupt the
VM memory too...

Yuanhan Liu April 29, 2016, 5:48 p.m. UTC | #8

On Fri, Apr 29, 2016 at 12:40:09PM +0200, Marc-André Lureau wrote:
> Hi
> 
> On Thu, Apr 28, 2016 at 7:23 AM, Yuanhan Liu
> <yuanhan.liu@linux.intel.com> wrote:
> > On Fri, Apr 01, 2016 at 01:16:21PM +0200, marcandre.lureau@redhat.com wrote:
> >> From: Marc-André Lureau <marcandre.lureau@redhat.com>
> >> +Slave message types
> >> +-------------------
> >> +
> >> + * VHOST_USER_SLAVE_SHUTDOWN:
> >> +      Id: 1
> >> +      Master payload: N/A
> >> +      Slave payload: u64
> >> +
> >> +      Request the master to shutdown the slave. A 0 reply is for
> >> +      success, in which case the slave may close all connections
> >> +      immediately and quit. A non-zero reply cancels the request.
> >> +
> >> +      Before a reply comes, the master may make other requests in
> >> +      order to flush or sync state.
> >
> > Hi all,
> >
> > I threw this proposal as well as DPDK's implementation to our customer
> > (OVS, Openstack and some other teams) who made such request before. I'm
> > sorry to say that none of them really liked that we can't handle crash.
> > Making reconnect work from a vhost-user backend crash is exactly something
> > they are after.
> 
> Handling crashes is not quite the same as what I propose here.

Yes, I know. However, handling crashes is exactly what our customers
want. And I just want to let you know that, say, I don't ask you to
do that :)

> I see
> it as a different goal. But I doubt about usefulness and reliability
> of a backend that crashes.

Agreed with you on that. However, I guess you have to admit that crashes
just happen. Kernel sometimes crashes, too. So, it would be great if
we could make whole stuff work again after an unexpected crash (say,
from OVS), without restarting all guests.

> In many case, vhost-user was designed after
> kernel vhost, and qemu code has the same expectation about the kernel
> or the vhost-user backend: many calls are sync and will simply
> assert() on unexpected results.

I guess we could at aleast try to dimish it, if we can't avoid it completely.

> > And to handle the crash, I was thinking of the proposal from Michael.
> > That is to do reset from the guest OS. This would fix this issue
> > ultimately. However, old kernel will not benefit from this, as well
> > as other guest other than Linux, making it not that useful for current
> > usage.
> 
> Yes, I hope Michael can help with that, I am not very familiar with
> the kernel code.
> 
> > Thinking of that the VHOST_USER_SLAVE_SHUTDOWN just gives QEMU a chance
> > to get the vring base (last used idx) from the backend, Huawei suggests
> 
> Right, but after this message, the backend should have flushed all
> pending ring packets and stop processing them. So it's also a clean
> sync point.
> 
> > that we could still make it in a consistent state after the crash, if
> > we get the vring base from vring->used->idx.  That worked as expected
> 
> You can have a backend that would have already processed packets and
> not updated used idx. You could also have out-of-order packets in
> flights (ex: desc 1-2-3 avail, 1-3 used, 2 pending..). I can't see a
> clean way to restore this, but to reset the queues and start over,
> with either packet loss or packet duplication.

Judging that it (crash or restart) happens so rare, I don't think
it matters. Moreoever, doesn't that happen in real world :)

> If the backend
> guarantees to process packets in order, it may simplify things, but it
> would be a special case.

Well, it's more like a backend thing: it's the backend to try to
set a saner vring base as stated in above proposal. Therefore, I
will not say it's a special case.

> 
> > from my test. The only tricky thing might be how to detect a crash,
> > and we could do a simple compare of the vring base from QEMU with
> > the vring->used->idx at the initiation stage. If mismatch found, get
> > it from vring->used->idx instead.
> 
> I don't follow, would the backend restore its last vring->used->idx
> after a crash?

Yes, restore from the SET_VRING_BASE from QEMU. But it's a stale value,
normally 0 if no start/stop happens before. Therefore, we can't use
it after the crash, instead, we could try to detect the mismatch and
try to fix it at SET_VRING_ADDR request.

> 
> > Comments/thoughts? Is it a solid enough solution to you?  If so, we
> > could make things much simpler, and what's most important, we could
> > be able to handle crash.
> 
> That's not so easy, many of the vhost_ops->vhost*() are followed by
> assert(r>0), which will be hard to change to handle failure. But, I
> would worry first about a backend that crashes that it may corrupt the
> VM memory too...

Not quite sure I follow this. But, backend just touches the virtio
related memory, so it will do no harm to the VM?

	--yliu

Michael S. Tsirkin May 1, 2016, 11:37 a.m. UTC | #9

On Fri, Apr 29, 2016 at 10:48:35AM -0700, Yuanhan Liu wrote:
> > But, I
> > would worry first about a backend that crashes that it may corrupt the
> > VM memory too...
> 
> Not quite sure I follow this. But, backend just touches the virtio
> related memory, so it will do no harm to the VM?

It crashed so apparently it's not behaving as designed -
how can we be sure it does not harm the VM?

Yuanhan Liu May 1, 2016, 9:04 p.m. UTC | #10

On Sun, May 01, 2016 at 02:37:19PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 29, 2016 at 10:48:35AM -0700, Yuanhan Liu wrote:
> > > But, I
> > > would worry first about a backend that crashes that it may corrupt the
> > > VM memory too...
> > 
> > Not quite sure I follow this. But, backend just touches the virtio
> > related memory, so it will do no harm to the VM?
> 
> It crashed so apparently it's not behaving as designed -
> how can we be sure it does not harm the VM?

Hi Michael,

What I know so far, a crash might could corrupt the virtio memory in two
ways:

- vring_used_elem might be in stale state when we are in the middle of
  updating used vring. Say, we might have updated the "id" field, but
  leaving "len" untouched.

- vring desc buff might also be in stale state, when we are copying
  data into it.

However, the two issues will not be real issue, as used->idx is not
updated in both case. Thefore, those corrupted memory will not be
processed by guest. So, no harm I'd say. Or, am I missing anything?

BTW, Michael, what's your thoughts on the whole reconnect stuff?

	--yliu

Michael S. Tsirkin May 2, 2016, 10:45 a.m. UTC | #11

On Sun, May 01, 2016 at 02:04:42PM -0700, Yuanhan Liu wrote:
> On Sun, May 01, 2016 at 02:37:19PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Apr 29, 2016 at 10:48:35AM -0700, Yuanhan Liu wrote:
> > > > But, I
> > > > would worry first about a backend that crashes that it may corrupt the
> > > > VM memory too...
> > > 
> > > Not quite sure I follow this. But, backend just touches the virtio
> > > related memory, so it will do no harm to the VM?
> > 
> > It crashed so apparently it's not behaving as designed -
> > how can we be sure it does not harm the VM?
> 
> Hi Michael,
> 
> What I know so far, a crash might could corrupt the virtio memory in two
> ways:
> 
> - vring_used_elem might be in stale state when we are in the middle of
>   updating used vring. Say, we might have updated the "id" field, but
>   leaving "len" untouched.
> 
> - vring desc buff might also be in stale state, when we are copying
>   data into it.


- a random write into VM memory due to backend bug corrupts it.

> However, the two issues will not be real issue, as used->idx is not
> updated in both case. Thefore, those corrupted memory will not be
> processed by guest. So, no harm I'd say. Or, am I missing anything?

Why is backend crashing? It shouldn't so maybe this means it's
memory is corrupt. That is the claim.

> BTW, Michael, what's your thoughts on the whole reconnect stuff?
> 
> 	--yliu

My thoughts are that we need to split these patchsets up.

There are several issues here:


1. Graceful disconnect
- One should be able to do vmstop, disconnect, connect then vm start
  This looks like a nice intermediate step.
- Why would we always assume it's always remote initiating the disconnect?
  Monitor commands for disconnecting seem called for.
  

2. Unexpected disconnect
- If buffers are processed in order or flushed before socket close,
  then on disconnect status can be recovered
  from ring alone. If that is of interest, we might want to add
  a protocol flag for that. F_DISCONNECT_FLUSH ? Without this,
  only graceful disconnect or reset with guest's help can be supported.
- As Marc-André said, without graceful shutdown it is not enough to
  handle ring state though.  We must handle socket getting disconnected
  in the middle of send/receive.  While it's more work,
  it does seem like a nice interface to support.
- I understand the concern that existing guests do not handle NEED_RESET
  status. One way to fix this would be a new feature bit. If NEED_RESET not
  handled, request hot-unplug instead.

3. Running while disconnected
- At the moment, we wait on vm start for remote to connect,
  if we support disconnecting backend without stopping
  we probably should also support starting it without waiting
  for connection
- Guests expect tx buffers to be used in a timely manner, thus:
- If vm is running we must use them in qemu, maybe discarding packets
  in the process.
  There already are commands for link down, I'm not sure there's value
  in doing this automatically in qemu.
- Alternatively, we should also have an option to stop vm automatically (like we do on
  disk errors) to reduce number of dropped packets.

4. Reconnecting
- When acting as a server, we might want an option to go back to
  listening state, but it should not be the only option,
  there should be a monitor command for moving it back to
  listening state.
- When acting as a client, auto-reconnect might be handy sometimes, but should not be the only
  option, there should be a way to manually request connect, possibly to
  a different target, so a monitor command for re-connecting seems called for.
- We'll also need monitor events for disconnects so management knows it
  must re-connect/restart listening.
- If we stopped VM, there could be an option to restart on reconnect.

Marc-André Lureau May 2, 2016, 11:29 a.m. UTC | #12

Hi

On Mon, May 2, 2016 at 12:45 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 1. Graceful disconnect
> - One should be able to do vmstop, disconnect, connect then vm start
>   This looks like a nice intermediate step.
> - Why would we always assume it's always remote initiating the disconnect?
>   Monitor commands for disconnecting seem called for.

Those two solutions involve VM management. This looks more complex to
communicate/synchronize vhost-user backend & vm management & qemu. The
use case I cover is request from the backend to shutdown, because
that's what the users wanted (and it is already possible to stop the
backend and disconnect it from qemu, we would only need to know when,
with a new command..)

>
> 2. Unexpected disconnect
> - If buffers are processed in order or flushed before socket close,
>   then on disconnect status can be recovered
>   from ring alone. If that is of interest, we might want to add
>   a protocol flag for that. F_DISCONNECT_FLUSH ? Without this,
>   only graceful disconnect or reset with guest's help can be supported.

(doing all this, at this point, I don't see much difference with
having an explicit shutdown)

> - As Marc-André said, without graceful shutdown it is not enough to
>   handle ring state though.  We must handle socket getting disconnected
>   in the middle of send/receive.  While it's more work,
>   it does seem like a nice interface to support.
> - I understand the concern that existing guests do not handle NEED_RESET
>   status. One way to fix this would be a new feature bit. If NEED_RESET not
>   handled, request hot-unplug instead.
>
> 3. Running while disconnected
> - At the moment, we wait on vm start for remote to connect,
>   if we support disconnecting backend without stopping
>   we probably should also support starting it without waiting
>   for connection

That's what Tetsuya proposed in its initial series, but handling the
flags was quite tedious. I think this can be considered easily a
seperate enhancement. What I proposed is to keep waiting for the
initial connect, and check the flags remains compatible on reconnect.

> - Guests expect tx buffers to be used in a timely manner, thus:
> - If vm is running we must use them in qemu, maybe discarding packets
>   in the process.
>   There already are commands for link down, I'm not sure there's value
>   in doing this automatically in qemu.

Testing doesn't show such buffer issues when the link is down (this
can be tested with vubr example above)

> - Alternatively, we should also have an option to stop vm automatically (like we do on
>   disk errors) to reduce number of dropped packets.

Why not, we would need to know if this is actually useful.

>
> 4. Reconnecting
> - When acting as a server, we might want an option to go back to
>   listening state, but it should not be the only option,
>   there should be a monitor command for moving it back to
>   listening state.
> - When acting as a client, auto-reconnect might be handy sometimes, but should not be the only
>   option, there should be a way to manually request connect, possibly to
>   a different target, so a monitor command for re-connecting seems called for.
> - We'll also need monitor events for disconnects so management knows it
>   must re-connect/restart listening.
> - If we stopped VM, there could be an option to restart on reconnect.

That's all involving a third party, adding complexity but the benefit
isn't so clear.

Michael S. Tsirkin May 2, 2016, 12:04 p.m. UTC | #13

On Mon, May 02, 2016 at 01:29:18PM +0200, Marc-André Lureau wrote:
> Hi
> 
> On Mon, May 2, 2016 at 12:45 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 1. Graceful disconnect
> > - One should be able to do vmstop, disconnect, connect then vm start
> >   This looks like a nice intermediate step.
> > - Why would we always assume it's always remote initiating the disconnect?
> >   Monitor commands for disconnecting seem called for.
> 
> Those two solutions involve VM management. This looks more complex to
> communicate/synchronize vhost-user backend & vm management & qemu. The
> use case I cover is request from the backend to shutdown,

Right, but flushing buffers + closing the socket looks like
a cleaner interface than a ton of messages going back and forth.

> because
> that's what the users wanted (and it is already possible to stop the
> backend and disconnect it from qemu, we would only need to know when,
> with a new command..)

You assume the backend has a monitor interface to disconnect though.
That's not a given.

> >
> > 2. Unexpected disconnect
> > - If buffers are processed in order or flushed before socket close,
> >   then on disconnect status can be recovered
> >   from ring alone. If that is of interest, we might want to add
> >   a protocol flag for that. F_DISCONNECT_FLUSH ? Without this,
> >   only graceful disconnect or reset with guest's help can be supported.
> 
> (doing all this, at this point, I don't see much difference with
> having an explicit shutdown)

With graceful shutdown we implicitly request flush when we ask
backend to stop.

> > - As Marc-André said, without graceful shutdown it is not enough to
> >   handle ring state though.  We must handle socket getting disconnected
> >   in the middle of send/receive.  While it's more work,
> >   it does seem like a nice interface to support.
> > - I understand the concern that existing guests do not handle NEED_RESET
> >   status. One way to fix this would be a new feature bit. If NEED_RESET not
> >   handled, request hot-unplug instead.
> >
> > 3. Running while disconnected
> > - At the moment, we wait on vm start for remote to connect,
> >   if we support disconnecting backend without stopping
> >   we probably should also support starting it without waiting
> >   for connection
> 
> That's what Tetsuya proposed in its initial series, but handling the
> flags was quite tedious.

That would be up to management. E.g. let backend export the list
of flags it supports in some file, and apply to QEMU.

> I think this can be considered easily a
> seperate enhancement. What I proposed is to keep waiting for the
> initial connect, and check the flags remains compatible on reconnect.

Seems asymmetrical unless we stop the vm.

> > - Guests expect tx buffers to be used in a timely manner, thus:
> > - If vm is running we must use them in qemu, maybe discarding packets
> >   in the process.
> >   There already are commands for link down, I'm not sure there's value
> >   in doing this automatically in qemu.
> 
> Testing doesn't show such buffer issues when the link is down (this
> can be tested with vubr example above)

Not enough testing then - it's a race condition: buffers can be sent
before link down.

> > - Alternatively, we should also have an option to stop vm automatically (like we do on
> >   disk errors) to reduce number of dropped packets.
> 
> Why not, we would need to know if this is actually useful.
> 
> >
> > 4. Reconnecting
> > - When acting as a server, we might want an option to go back to
> >   listening state, but it should not be the only option,
> >   there should be a monitor command for moving it back to
> >   listening state.
> > - When acting as a client, auto-reconnect might be handy sometimes, but should not be the only
> >   option, there should be a way to manually request connect, possibly to
> >   a different target, so a monitor command for re-connecting seems called for.
> > - We'll also need monitor events for disconnects so management knows it
> >   must re-connect/restart listening.
> > - If we stopped VM, there could be an option to restart on reconnect.
> 
> That's all involving a third party, adding complexity but the benefit
> isn't so clear.

It's rather clear to me. Let's assume you want to restart bridge,
so you trigger disconnect.
If qemu auto-reconnects there's a race as it might re-connect
to the old bridge.
Management is required to make this robust, auto-reconnect
is handy for people bypassing management.

> -- 
> Marc-André Lureau

Yuanhan Liu May 2, 2016, 5:37 p.m. UTC | #14

On Mon, May 02, 2016 at 01:45:31PM +0300, Michael S. Tsirkin wrote:
> On Sun, May 01, 2016 at 02:04:42PM -0700, Yuanhan Liu wrote:
> > On Sun, May 01, 2016 at 02:37:19PM +0300, Michael S. Tsirkin wrote:
> > > On Fri, Apr 29, 2016 at 10:48:35AM -0700, Yuanhan Liu wrote:
> > > > > But, I
> > > > > would worry first about a backend that crashes that it may corrupt the
> > > > > VM memory too...
> > > > 
> > > > Not quite sure I follow this. But, backend just touches the virtio
> > > > related memory, so it will do no harm to the VM?
> > > 
> > > It crashed so apparently it's not behaving as designed -
> > > how can we be sure it does not harm the VM?
> > 
> > Hi Michael,
> > 
> > What I know so far, a crash might could corrupt the virtio memory in two
> > ways:
> > 
> > - vring_used_elem might be in stale state when we are in the middle of
> >   updating used vring. Say, we might have updated the "id" field, but
> >   leaving "len" untouched.
> > 
> > - vring desc buff might also be in stale state, when we are copying
> >   data into it.
> 
> 
> - a random write into VM memory due to backend bug corrupts it.
> 
> > However, the two issues will not be real issue, as used->idx is not
> > updated in both case. Thefore, those corrupted memory will not be
> > processed by guest. So, no harm I'd say. Or, am I missing anything?
> 
> Why is backend crashing? It shouldn't so maybe this means it's
> memory is corrupt. That is the claim.

As far as virtio memory is not corrupted (or even corrupt in above
ways), there would be no issue. But, yes, you made a good point: we
make no guarantees that it's not the virtio memory corruption caused
the crash.

> > BTW, Michael, what's your thoughts on the whole reconnect stuff?
> > 
> > 	--yliu
> 
> My thoughts are that we need to split these patchsets up.
> 
> There are several issues here:
> 
> 
> 1. Graceful disconnect
> - One should be able to do vmstop, disconnect, connect then vm start
>   This looks like a nice intermediate step.
> - Why would we always assume it's always remote initiating the disconnect?
>   Monitor commands for disconnecting seem called for.

A monitor command is a good suggestion. But I'm thinking why vmstop is
necessary. Basically, a disconnect is as to a cable unplug to physical
NIC; we don't need pause it before unplugging.

> 
> 2. Unexpected disconnect
> - If buffers are processed in order or flushed before socket close,
>   then on disconnect status can be recovered
>   from ring alone. If that is of interest, we might want to add
>   a protocol flag for that. F_DISCONNECT_FLUSH ?

Sorry, what does this flag supposed to work here?

> Without this,
>   only graceful disconnect or reset with guest's help can be supported.
> - As Marc-André said, without graceful shutdown it is not enough to
>   handle ring state though.  We must handle socket getting disconnected
>   in the middle of send/receive.  While it's more work,
>   it does seem like a nice interface to support.

Same as above, what the backend (or QEMU) can do for this case without
the guest's (reset) help?


> - I understand the concern that existing guests do not handle NEED_RESET
>   status. One way to fix this would be a new feature bit. If NEED_RESET not
>   handled,

I could check the code by myself, but I'm thinking it might be trivial
for you to answer my question: how do we know that NEED_RESET is not
handled?

> request hot-unplug instead.

Same as above, may I know how to request a hot-unplug?

> 
> 3. Running while disconnected
> - At the moment, we wait on vm start for remote to connect,
>   if we support disconnecting backend without stopping
>   we probably should also support starting it without waiting
>   for connection

Agreed. I guess that would anaswer my first question: we don't actually
need to do vmstop before disconnect.

	--yliu

> - Guests expect tx buffers to be used in a timely manner, thus:
> - If vm is running we must use them in qemu, maybe discarding packets
>   in the process.
>   There already are commands for link down, I'm not sure there's value
>   in doing this automatically in qemu.
> - Alternatively, we should also have an option to stop vm automatically (like we do on
>   disk errors) to reduce number of dropped packets.
> 
> 4. Reconnecting
> - When acting as a server, we might want an option to go back to
>   listening state, but it should not be the only option,
>   there should be a monitor command for moving it back to
>   listening state.
> - When acting as a client, auto-reconnect might be handy sometimes, but should not be the only
>   option, there should be a way to manually request connect, possibly to
>   a different target, so a monitor command for re-connecting seems called for.
> - We'll also need monitor events for disconnects so management knows it
>   must re-connect/restart listening.
> - If we stopped VM, there could be an option to restart on reconnect.
> 
> 
> -- 
> MST

Yuanhan Liu May 2, 2016, 5:50 p.m. UTC | #15

On Mon, May 02, 2016 at 03:04:30PM +0300, Michael S. Tsirkin wrote:
> On Mon, May 02, 2016 at 01:29:18PM +0200, Marc-André Lureau wrote:
> > Hi
> > 
> > On Mon, May 2, 2016 at 12:45 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > 1. Graceful disconnect
> > > - One should be able to do vmstop, disconnect, connect then vm start
> > >   This looks like a nice intermediate step.
> > > - Why would we always assume it's always remote initiating the disconnect?
> > >   Monitor commands for disconnecting seem called for.
> > 
> > Those two solutions involve VM management. This looks more complex to
> > communicate/synchronize vhost-user backend & vm management & qemu. The
> > use case I cover is request from the backend to shutdown,
> 
> Right, but flushing buffers + closing the socket looks like
> a cleaner interface than a ton of messages going back and forth.

I'd agree with Michael on that. It needs more cares when dealing with
two stream buffers: you can't quit backend soon after the shutdown
request, instead, you have to wait for the ACK from QEMU. Making request
from QEMU avoids that.

	--yliu

Marc-André Lureau May 4, 2016, 1:16 p.m. UTC | #16

Hi

On Mon, May 2, 2016 at 2:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, May 02, 2016 at 01:29:18PM +0200, Marc-André Lureau wrote:
>> Hi
>>
>> On Mon, May 2, 2016 at 12:45 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > 1. Graceful disconnect
>> > - One should be able to do vmstop, disconnect, connect then vm start
>> >   This looks like a nice intermediate step.
>> > - Why would we always assume it's always remote initiating the disconnect?
>> >   Monitor commands for disconnecting seem called for.
>>
>> Those two solutions involve VM management. This looks more complex to
>> communicate/synchronize vhost-user backend & vm management & qemu. The
>> use case I cover is request from the backend to shutdown,
>
> Right, but flushing buffers + closing the socket looks like
> a cleaner interface than a ton of messages going back and forth.

What do you mean by "a ton of messages"? It adds
VHOST_USER_SET_SLAVE_FD (generic), and VHOST_USER_SLAVE_SHUTDOWN. The
amount of work to flush and close is the same regardless. I figured
later that if we refactor vhost-user socket handling, we may be able
to accept request from the main channel socket, without extra "slave
channel".

>
>> because
>> that's what the users wanted (and it is already possible to stop the
>> backend and disconnect it from qemu, we would only need to know when,
>> with a new command..)
>
> You assume the backend has a monitor interface to disconnect though.
> That's not a given.

What do you mean? The backend must have a way to request to close/quit
indeed. That's outside of scope how the backend get this information
(via signals or other means). It's external, having this information
from VM management layer is the same (someone has to trigger this
somehow).

>> > 3. Running while disconnected
>> > - At the moment, we wait on vm start for remote to connect,
>> >   if we support disconnecting backend without stopping
>> >   we probably should also support starting it without waiting
>> >   for connection
>>
>> That's what Tetsuya proposed in its initial series, but handling the
>> flags was quite tedious.
>
> That would be up to management. E.g. let backend export the list
> of flags it supports in some file, and apply to QEMU.

That makes me worry that such unfriendly connections details have to
spread outside of vhost-user to VM management layer, making usage &
maintenance harder for no clear benefit. It's a similar concern you
have with "the backend has a monitor interface", here "the backend
must have an introspection interface" or at least export vhost-user
details somehow.

>
>> I think this can be considered easily a
>> seperate enhancement. What I proposed is to keep waiting for the
>> initial connect, and check the flags remains compatible on reconnect.
>
> Seems asymmetrical unless we stop the vm.

That's the point, there will be time with and without the backend if
we keep the VM running.

>> > - Guests expect tx buffers to be used in a timely manner, thus:
>> > - If vm is running we must use them in qemu, maybe discarding packets
>> >   in the process.
>> >   There already are commands for link down, I'm not sure there's value
>> >   in doing this automatically in qemu.
>>
>> Testing doesn't show such buffer issues when the link is down (this
>> can be tested with vubr example above)
>
> Not enough testing then - it's a race condition: buffers can be sent
> before link down.

Ok, I'll do more testing. In all cases, looks reasonable to discard.

>
>> > - Alternatively, we should also have an option to stop vm automatically (like we do on
>> >   disk errors) to reduce number of dropped packets.
>>
>> Why not, we would need to know if this is actually useful.
>>
>> >
>> > 4. Reconnecting
>> > - When acting as a server, we might want an option to go back to
>> >   listening state, but it should not be the only option,
>> >   there should be a monitor command for moving it back to
>> >   listening state.
>> > - When acting as a client, auto-reconnect might be handy sometimes, but should not be the only
>> >   option, there should be a way to manually request connect, possibly to
>> >   a different target, so a monitor command for re-connecting seems called for.
>> > - We'll also need monitor events for disconnects so management knows it
>> >   must re-connect/restart listening.
>> > - If we stopped VM, there could be an option to restart on reconnect.
>>
>> That's all involving a third party, adding complexity but the benefit
>> isn't so clear.
>
> It's rather clear to me. Let's assume you want to restart bridge,
> so you trigger disconnect.
> If qemu auto-reconnects there's a race as it might re-connect
> to the old bridge.

I would say that race can trivially be avoided, so it's a backend bug.

> Management is required to make this robust, auto-reconnect
> is handy for people bypassing management.

tbh, I don't like autoreconnect. My previous series didn't include
this and assumed the feature would be supported only when qemu is
configured to be the server. I added reconnect upon request by users.

Michael S. Tsirkin May 4, 2016, 7:13 p.m. UTC | #17

On Wed, May 04, 2016 at 03:16:44PM +0200, Marc-André Lureau wrote:
> Hi
> 
> On Mon, May 2, 2016 at 2:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Mon, May 02, 2016 at 01:29:18PM +0200, Marc-André Lureau wrote:
> >> Hi
> >>
> >> On Mon, May 2, 2016 at 12:45 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > 1. Graceful disconnect
> >> > - One should be able to do vmstop, disconnect, connect then vm start
> >> >   This looks like a nice intermediate step.
> >> > - Why would we always assume it's always remote initiating the disconnect?
> >> >   Monitor commands for disconnecting seem called for.
> >>
> >> Those two solutions involve VM management. This looks more complex to
> >> communicate/synchronize vhost-user backend & vm management & qemu. The
> >> use case I cover is request from the backend to shutdown,
> >
> > Right, but flushing buffers + closing the socket looks like
> > a cleaner interface than a ton of messages going back and forth.
> 
> What do you mean by "a ton of messages"? It adds
> VHOST_USER_SET_SLAVE_FD (generic), and VHOST_USER_SLAVE_SHUTDOWN. The
> amount of work to flush and close is the same regardless. I figured
> later that if we refactor vhost-user socket handling, we may be able
> to accept request from the main channel socket, without extra "slave
> channel".
> 
> >
> >> because
> >> that's what the users wanted (and it is already possible to stop the
> >> backend and disconnect it from qemu, we would only need to know when,
> >> with a new command..)
> >
> > You assume the backend has a monitor interface to disconnect though.
> > That's not a given.
> 
> What do you mean? The backend must have a way to request to close/quit
> indeed. That's outside of scope how the backend get this information
> (via signals or other means). It's external, having this information
> from VM management layer is the same (someone has to trigger this
> somehow).

Correct. So for symmetry if nothing else, besides handling
slave close request, we should be able to initiate close
from qemu with a new command, and get event when not connected.

Afterwards client can be killed with -9 as it's no longer
connected to qemu.


> >> > 3. Running while disconnected
> >> > - At the moment, we wait on vm start for remote to connect,
> >> >   if we support disconnecting backend without stopping
> >> >   we probably should also support starting it without waiting
> >> >   for connection
> >>
> >> That's what Tetsuya proposed in its initial series, but handling the
> >> flags was quite tedious.
> >
> > That would be up to management. E.g. let backend export the list
> > of flags it supports in some file, and apply to QEMU.
> 
> That makes me worry that such unfriendly connections details have to
> spread outside of vhost-user to VM management layer, making usage &
> maintenance harder for no clear benefit. It's a similar concern you
> have with "the backend has a monitor interface", here "the backend
> must have an introspection interface" or at least export vhost-user
> details somehow.

How can we start VM before backend connects otherwise?
Have better ideas?


> >
> >> I think this can be considered easily a
> >> seperate enhancement. What I proposed is to keep waiting for the
> >> initial connect, and check the flags remains compatible on reconnect.
> >
> > Seems asymmetrical unless we stop the vm.
> 
> That's the point, there will be time with and without the backend if
> we keep the VM running.
> 
> >> > - Guests expect tx buffers to be used in a timely manner, thus:
> >> > - If vm is running we must use them in qemu, maybe discarding packets
> >> >   in the process.
> >> >   There already are commands for link down, I'm not sure there's value
> >> >   in doing this automatically in qemu.
> >>
> >> Testing doesn't show such buffer issues when the link is down (this
> >> can be tested with vubr example above)
> >
> > Not enough testing then - it's a race condition: buffers can be sent
> > before link down.
> 
> Ok, I'll do more testing. In all cases, looks reasonable to discard.

Discard has some issues - for example processing ring in qemu
sometimes exposes us to more security risks.

> >
> >> > - Alternatively, we should also have an option to stop vm automatically (like we do on
> >> >   disk errors) to reduce number of dropped packets.
> >>
> >> Why not, we would need to know if this is actually useful.
> >>
> >> >
> >> > 4. Reconnecting
> >> > - When acting as a server, we might want an option to go back to
> >> >   listening state, but it should not be the only option,
> >> >   there should be a monitor command for moving it back to
> >> >   listening state.
> >> > - When acting as a client, auto-reconnect might be handy sometimes, but should not be the only
> >> >   option, there should be a way to manually request connect, possibly to
> >> >   a different target, so a monitor command for re-connecting seems called for.
> >> > - We'll also need monitor events for disconnects so management knows it
> >> >   must re-connect/restart listening.
> >> > - If we stopped VM, there could be an option to restart on reconnect.
> >>
> >> That's all involving a third party, adding complexity but the benefit
> >> isn't so clear.
> >
> > It's rather clear to me. Let's assume you want to restart bridge,
> > so you trigger disconnect.
> > If qemu auto-reconnects there's a race as it might re-connect
> > to the old bridge.
> 
> I would say that race can trivially be avoided, so it's a backend bug.

How do you avoid it?

> > Management is required to make this robust, auto-reconnect
> > is handy for people bypassing management.
> 
> tbh, I don't like autoreconnect. My previous series didn't include
> this and assumed the feature would be supported only when qemu is
> configured to be the server. I added reconnect upon request by users.

I don't have better solutions so OK I guess.

> -- 
> Marc-André Lureau

Yuanhan Liu May 5, 2016, 3:44 a.m. UTC | #18

Hello,

On Wed, May 04, 2016 at 10:13:49PM +0300, Michael S. Tsirkin wrote:
> How do you avoid it?
> 
> > > Management is required to make this robust, auto-reconnect
> > > is handy for people bypassing management.
> > 
> > tbh, I don't like autoreconnect. My previous series didn't include
> > this and assumed the feature would be supported only when qemu is
> > configured to be the server. I added reconnect upon request by users.
> 
> I don't have better solutions so OK I guess.

Yes, it's a request from me :)
Well, there may be few others also requested that.

The reason I had this asking is that, so far, we just have only one
vhost-user frontend, and that is QEMU. But we may have many backends,
I'm aware of 4 at the time writing, including the vubr from QEMU.
While we could do vhost-user client and reconnect implementation
on all backends, it's clear that implementing it in the only backend
(QEMU) introduces more benefits.

OTOH, I could implement DPDK vhost-user as client and try reconnect
there, if that could makes all stuff a bit easier.

	--yliu

Michael S. Tsirkin May 5, 2016, 1:42 p.m. UTC | #19

On Wed, May 04, 2016 at 08:44:37PM -0700, Yuanhan Liu wrote:
> Hello,
> 
> On Wed, May 04, 2016 at 10:13:49PM +0300, Michael S. Tsirkin wrote:
> > How do you avoid it?
> > 
> > > > Management is required to make this robust, auto-reconnect
> > > > is handy for people bypassing management.
> > > 
> > > tbh, I don't like autoreconnect. My previous series didn't include
> > > this and assumed the feature would be supported only when qemu is
> > > configured to be the server. I added reconnect upon request by users.
> > 
> > I don't have better solutions so OK I guess.
> 
> Yes, it's a request from me :)
> Well, there may be few others also requested that.
> 
> The reason I had this asking is that, so far, we just have only one
> vhost-user frontend, and that is QEMU. But we may have many backends,
> I'm aware of 4 at the time writing, including the vubr from QEMU.
> While we could do vhost-user client and reconnect implementation
> on all backends, it's clear that implementing it in the only backend
> (QEMU) introduces more benefits.
> 
> OTOH, I could implement DPDK vhost-user as client and try reconnect
> there, if that could makes all stuff a bit easier.
> 
> 	--yliu

In my opinion, if slave initiates disconnecting then slave should
also initiate connecting.

In a sense, autoretry of connections is not a vhost-user
feature. It's a general chardev feature. It also does not
have to be related to disconnect: retrying on first connect
failure makes exactly as much sense.

[11/18] vhost-user: add shutdown support

Commit Message

Comments

Patch