diff mbox

[for-2.4] Revert "vhost-user: Send VHOST_RESET_OWNER on vhost stop"

Message ID 1438245296-1446-1-git-send-email-mst@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin July 30, 2015, 8:36 a.m. UTC
This reverts commit 294ce717e0f212ed0763307f3eab72b4a1bdf4d0.

vhost stop happens e.g. when guest unloads the driver,
so closing the backend connection is not the right
thing to do here.

VHOST_RESET_OWNER should happen on vhost_dev_cleanup - it's
the counterpart of VHOST_SET_OWNER.

Cc: Luke Gorrie <luke@snabb.co>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

I think we need this in 2.4 to avoid introducing regressions
in the protocol. We'll fix properly in 2.5.
Luke, can you comment please?

 hw/net/vhost_net.c | 7 -------
 1 file changed, 7 deletions(-)

Comments

Luke Gorrie Aug. 4, 2015, 7 p.m. UTC | #1
Hi Michael,

Sorry I didn't see this mail sooner -

On 30 July 2015 at 10:36, Michael S. Tsirkin <mst@redhat.com> wrote:

> This reverts commit 294ce717e0f212ed0763307f3eab72b4a1bdf4d0.
>
> vhost stop happens e.g. when guest unloads the driver,
> so closing the backend connection is not the right
> thing to do here.
>
> VHOST_RESET_OWNER should happen on vhost_dev_cleanup - it's
> the counterpart of VHOST_SET_OWNER.
>
> Cc: Luke Gorrie <luke@snabb.co>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>
> I think we need this in 2.4 to avoid introducing regressions
> in the protocol. We'll fix properly in 2.5.
> Luke, can you comment please?
>

Interesting. Currently we don't have a test case in Snabb Switch CI for
guests reloading drivers. We need to add that and let you know how it goes.

The reason I made this patch was to fix a problem with guests failing to
reboot. This patch seemed to be necessary for the vswitch to know that the
vring in guest memory is not valid anymore and should not be used for DMA
(since that memory is being reallocated to who knows what).

I wonder if driver reload sort of behavior can be related to this
interesting problem that we are also tracking down - ideas would be very
welcome:
https://github.com/SnabbCo/snabbswitch/issues/574
Michael S. Tsirkin Aug. 4, 2015, 8:36 p.m. UTC | #2
On Tue, Aug 04, 2015 at 09:00:47PM +0200, Luke Gorrie wrote:
> Hi Michael,
> 
> Sorry I didn't see this mail sooner -
> 
> On 30 July 2015 at 10:36, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     This reverts commit 294ce717e0f212ed0763307f3eab72b4a1bdf4d0.
> 
>     vhost stop happens e.g. when guest unloads the driver,
>     so closing the backend connection is not the right
>     thing to do here.
> 
>     VHOST_RESET_OWNER should happen on vhost_dev_cleanup - it's
>     the counterpart of VHOST_SET_OWNER.
> 
>     Cc: Luke Gorrie <luke@snabb.co>
>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>     ---
> 
>     I think we need this in 2.4 to avoid introducing regressions
>     in the protocol. We'll fix properly in 2.5.
>     Luke, can you comment please?
> 
> 
> Interesting. Currently we don't have a test case in Snabb Switch CI for guests
> reloading drivers. We need to add that and let you know how it goes.
> 
> The reason I made this patch was to fix a problem with guests failing to
> reboot. This patch seemed to be necessary for the vswitch to know that the
> vring in guest memory is not valid anymore and should not be used for DMA
> (since that memory is being reallocated to who knows what).

This makes sense, but the description says master will
close the socket. This makes no sense to me.
Maybe we should leave the code as is, but rename the message
to RESET_DEVICE then?

> I wonder if driver reload sort of behavior can be related to this interesting
> problem that we are also tracking down - ideas would be very welcome:
> https://github.com/SnabbCo/snabbswitch/issues/574
> 
>
Luke Gorrie Aug. 6, 2015, 9:33 a.m. UTC | #3
On 4 August 2015 at 22:36, Michael S. Tsirkin <mst@redhat.com> wrote:

> This makes sense, but the description says master will
> close the socket. This makes no sense to me.
> Maybe we should leave the code as is, but rename the message
> to RESET_DEVICE then?
>

That sounds reasonable to me. In my patch I adopted the existing (unused)
protocol message because it seemed like a good fit but perhaps it was not
the perfect one.

(More from me when we have added a driver reload test in our CI.)

Cheers,
-Luke
Luke Gorrie Aug. 2, 2016, 9:42 a.m. UTC | #4
Hi Michael & all,

On 4 August 2015 at 21:00, Luke Gorrie <luke@snabb.co> wrote:

> Hi Michael,
>
> Sorry I didn't see this mail sooner -
>
> On 30 July 2015 at 10:36, Michael S. Tsirkin <mst@redhat.com> wrote:
>
>> This reverts commit 294ce717e0f212ed0763307f3eab72b4a1bdf4d0.
>>
>> vhost stop happens e.g. when guest unloads the driver,
>> so closing the backend connection is not the right
>> thing to do here.
>>
>> VHOST_RESET_OWNER should happen on vhost_dev_cleanup - it's
>> the counterpart of VHOST_SET_OWNER.
>>
>> Cc: Luke Gorrie <luke@snabb.co>
>> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>> ---
>>
>> I think we need this in 2.4 to avoid introducing regressions
>> in the protocol. We'll fix properly in 2.5.
>> Luke, can you comment please?
>>
>
> Interesting. Currently we don't have a test case in Snabb Switch CI for
> guests reloading drivers. We need to add that and let you know how it goes.
>

We have this test case now. Took a year to add, almost to the day, but here
we are :-).

Specifically we have beefed up our CI for Snabb to run some tens of
thousands of end-to-end tests with VMs nightly. This includes testing a
bunch of different QEMU versions with a bunch of different Snabb (host) and
DPDK (guest) versions. It also exercises the transition where the guest
first initializes a Virtio-net device with the kernel driver and then later
hands the device over to the DPDK driver.

The data we have now shows that this test case is only working reliably
with QEMU 2.4.1. My hypothesis is that this QEMU version is sending the
VHOST_RESET_OWNER when the kernel driver shuts down - so that vswitch knows
not to process descriptors - while the other versions are skipping this
(because the patch was not introduced yet, or because the patch was
reverted).

Here is the full thread on Github, becoming more QEMU-oriented as you
scroll down:
https://github.com/snabbco/snabb/issues/976#issuecomment-236838883

So, question for QEMU upstream, is there currently a supported way for the
host vswitch to detect when the vrings are invalid? How should we operate
to be safe from serving garbage DMA requests during guest driver resets and
reboots?

Cheers and sorry for the e-mail latency :-)
-Luke
Michael S. Tsirkin Aug. 2, 2016, 5:32 p.m. UTC | #5
On Tue, Aug 02, 2016 at 11:42:01AM +0200, Luke Gorrie wrote:
> Hi Michael & all,
> 
> On 4 August 2015 at 21:00, Luke Gorrie <luke@snabb.co> wrote:
> 
>     Hi Michael,
> 
>     Sorry I didn't see this mail sooner -
> 
>     On 30 July 2015 at 10:36, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>         This reverts commit 294ce717e0f212ed0763307f3eab72b4a1bdf4d0.
> 
>         vhost stop happens e.g. when guest unloads the driver,
>         so closing the backend connection is not the right
>         thing to do here.
> 
>         VHOST_RESET_OWNER should happen on vhost_dev_cleanup - it's
>         the counterpart of VHOST_SET_OWNER.
> 
>         Cc: Luke Gorrie <luke@snabb.co>
>         Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>         ---
> 
>         I think we need this in 2.4 to avoid introducing regressions
>         in the protocol. We'll fix properly in 2.5.
>         Luke, can you comment please?
> 
> 
>     Interesting. Currently we don't have a test case in Snabb Switch CI for
>     guests reloading drivers. We need to add that and let you know how it goes.
> 
> 
> We have this test case now. Took a year to add, almost to the day, but here we
> are :-).
> 
> Specifically we have beefed up our CI for Snabb to run some tens of thousands
> of end-to-end tests with VMs nightly. This includes testing a bunch of
> different QEMU versions with a bunch of different Snabb (host) and DPDK (guest)
> versions. It also exercises the transition where the guest first initializes a
> Virtio-net device with the kernel driver and then later hands the device over
> to the DPDK driver.
> 
> The data we have now shows that this test case is only working reliably with
> QEMU 2.4.1. My hypothesis is that this QEMU version is sending the
> VHOST_RESET_OWNER when the kernel driver shuts down - so that vswitch knows not
> to process descriptors - while the other versions are skipping this (because
> the patch was not introduced yet, or because the patch was reverted).
> 
> Here is the full thread on Github, becoming more QEMU-oriented as you scroll
> down:
> https://github.com/snabbco/snabb/issues/976#issuecomment-236838883
> 
> So, question for QEMU upstream, is there currently a supported way for the host
> vswitch to detect when the vrings are invalid? How should we operate to be safe
> from serving garbage DMA requests during guest driver resets and reboots?
> 
> Cheers and sorry for the e-mail latency :-)
> -Luke
> 
> 


I think this is adequately answered here:
http://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/vhost-user.txt;hb=HEAD#l149

HTH
Luke Gorrie Aug. 10, 2016, 1:50 p.m. UTC | #6
On 2 August 2016 at 19:32, Michael S. Tsirkin <mst@redhat.com> wrote:

> I think this is adequately answered here:
> http://git.qemu.org/?p=qemu.git;a=blob;f=docs/specs/vhost-
> user.txt;hb=HEAD#l149
>

Thanks for the gentle nudge :). Indeed, it is working great with all QEMU
versions now. I see now why my patch in QEMU 2.4 was not needed.

JFYI: Here is the line of inquiry we are following for interop
testing/benchmarking of different software components in the vhost-user
ecosystem. Goal is predictable performance with all software combinations &
quick detection of regressions anywhere in the test matrix. If anybody is
interested in this topic, is doing related work, or has a better idea :)
please feel free to drop me a line!
https://github.com/snabbco/snabb/issues/989

Cheers,
-Luke
diff mbox

Patch

diff --git a/hw/net/vhost_net.c b/hw/net/vhost_net.c
index c864237..b8d88ec 100644
--- a/hw/net/vhost_net.c
+++ b/hw/net/vhost_net.c
@@ -284,13 +284,6 @@  static void vhost_net_stop_one(struct vhost_net *net,
                                           &file);
             assert(r >= 0);
         }
-    } else if (net->nc->info->type == NET_CLIENT_OPTIONS_KIND_VHOST_USER) {
-        for (file.index = 0; file.index < net->dev.nvqs; ++file.index) {
-            const VhostOps *vhost_ops = net->dev.vhost_ops;
-            int r = vhost_ops->vhost_call(&net->dev, VHOST_RESET_OWNER,
-                                          NULL);
-            assert(r >= 0);
-        }
     }
     if (net->nc->info->poll) {
         net->nc->info->poll(net->nc, true);