diff mbox

net: delay peer host device delete

Message ID 20100920163042.GA29466@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin Sept. 20, 2010, 4:30 p.m. UTC
With -netdev, virtio devices present offload
features to guest, depending on the backend used.
Thus, removing host ntedev peer while guest is
active leads to guest-visible inconsistency and/or crashes.
See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735

As a solution, while guest (NIC) peer device exists,
we must prevent the host peer from being deleted.

This patch does this by adding peer_deleted flag in nic state:
if host device is going away while guest device
is around, set this flag and keep host device around
for as long as guest device exists.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net.c |   21 ++++++++++++++++++++-
 net.h |    1 +
 2 files changed, 21 insertions(+), 1 deletions(-)

Comments

Anthony Liguori Sept. 20, 2010, 4:41 p.m. UTC | #1
On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
> With -netdev, virtio devices present offload
> features to guest, depending on the backend used.
> Thus, removing host ntedev peer while guest is
> active leads to guest-visible inconsistency and/or crashes.
> See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
>
> As a solution, while guest (NIC) peer device exists,
> we must prevent the host peer from being deleted.
>
> This patch does this by adding peer_deleted flag in nic state:
> if host device is going away while guest device
> is around, set this flag and keep host device around
> for as long as guest device exists.
>    

Having an unclear life cycle really worries me.

Wouldn't the more correct solution be to avoid removing the netdev 
device until after the peer has successfully been removed?

Regards,

Anthony Liguori

> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> ---
>   net.c |   21 ++++++++++++++++++++-
>   net.h |    1 +
>   2 files changed, 21 insertions(+), 1 deletions(-)
>
> diff --git a/net.c b/net.c
> index 3d0fde7..10855d1 100644
> --- a/net.c
> +++ b/net.c
> @@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
>       if (vc->vlan) {
>           QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
>       } else {
> +        /* Even if client will not be deleted yet, remove it from list so it
> +         * does not appear in monitor.  */
> +        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> +        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
> +         * */
> +        if (vc->peer&&  vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
> +            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
> +            assert(!nic->peer_deleted);
> +            nic->peer_deleted = true;
> +            return;
> +        }
>           if (vc->send_queue) {
>               qemu_del_net_queue(vc->send_queue);
>           }
> -        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>           if (vc->peer) {
>               vc->peer->peer = NULL;
> +            /* If this is a guest-visible (NIC) device,
> +             * and peer has already been removed from monitor,
> +             * delete it here. */
> +            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
> +                NICState *nic = DO_UPCAST(NICState, nc, vc);
> +                if (nic->peer_deleted) {
> +                    qemu_del_vlan_client(vc->peer);
> +                }
> +            }
>           }
>       }
>
> diff --git a/net.h b/net.h
> index 518cf9c..44c31a9 100644
> --- a/net.h
> +++ b/net.h
> @@ -72,6 +72,7 @@ typedef struct NICState {
>       VLANClientState nc;
>       NICConf *conf;
>       void *opaque;
> +    bool peer_deleted;
>   } NICState;
>
>   struct VLANState {
>
Michael S. Tsirkin Sept. 20, 2010, 4:47 p.m. UTC | #2
On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
> On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
> >With -netdev, virtio devices present offload
> >features to guest, depending on the backend used.
> >Thus, removing host ntedev peer while guest is
> >active leads to guest-visible inconsistency and/or crashes.
> >See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
> >
> >As a solution, while guest (NIC) peer device exists,
> >we must prevent the host peer from being deleted.
> >
> >This patch does this by adding peer_deleted flag in nic state:
> >if host device is going away while guest device
> >is around, set this flag and keep host device around
> >for as long as guest device exists.
> 
> Having an unclear life cycle really worries me.
> 
> Wouldn't the more correct solution be to avoid removing the netdev
> device until after the peer has successfully been removed?
> 
> Regards,
> 
> Anthony Liguori

This is exactly what the patch does.


> >Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> >---
> >  net.c |   21 ++++++++++++++++++++-
> >  net.h |    1 +
> >  2 files changed, 21 insertions(+), 1 deletions(-)
> >
> >diff --git a/net.c b/net.c
> >index 3d0fde7..10855d1 100644
> >--- a/net.c
> >+++ b/net.c
> >@@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
> >      if (vc->vlan) {
> >          QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
> >      } else {
> >+        /* Even if client will not be deleted yet, remove it from list so it
> >+         * does not appear in monitor.  */
> >+        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >+        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
> >+         * */
> >+        if (vc->peer&&  vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
> >+            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
> >+            assert(!nic->peer_deleted);
> >+            nic->peer_deleted = true;
> >+            return;
> >+        }
> >          if (vc->send_queue) {
> >              qemu_del_net_queue(vc->send_queue);
> >          }
> >-        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >          if (vc->peer) {
> >              vc->peer->peer = NULL;
> >+            /* If this is a guest-visible (NIC) device,
> >+             * and peer has already been removed from monitor,
> >+             * delete it here. */
> >+            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
> >+                NICState *nic = DO_UPCAST(NICState, nc, vc);
> >+                if (nic->peer_deleted) {
> >+                    qemu_del_vlan_client(vc->peer);
> >+                }
> >+            }
> >          }
> >      }
> >
> >diff --git a/net.h b/net.h
> >index 518cf9c..44c31a9 100644
> >--- a/net.h
> >+++ b/net.h
> >@@ -72,6 +72,7 @@ typedef struct NICState {
> >      VLANClientState nc;
> >      NICConf *conf;
> >      void *opaque;
> >+    bool peer_deleted;
> >  } NICState;
> >
> >  struct VLANState {
Anthony Liguori Sept. 20, 2010, 4:56 p.m. UTC | #3
On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
>>      
>>> With -netdev, virtio devices present offload
>>> features to guest, depending on the backend used.
>>> Thus, removing host ntedev peer while guest is
>>> active leads to guest-visible inconsistency and/or crashes.
>>> See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
>>>
>>> As a solution, while guest (NIC) peer device exists,
>>> we must prevent the host peer from being deleted.
>>>
>>> This patch does this by adding peer_deleted flag in nic state:
>>> if host device is going away while guest device
>>> is around, set this flag and keep host device around
>>> for as long as guest device exists.
>>>        
>> Having an unclear life cycle really worries me.
>>
>> Wouldn't the more correct solution be to avoid removing the netdev
>> device until after the peer has successfully been removed?
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> This is exactly what the patch does.
>    

At the management layer instead of doing it magically in the backend.

IOW, if device_del returns and the device isn't actually deleted, that's 
a bug and addressing it like this just means we'll trip over it 
somewhere else.

We'll have the same problem with drive_del.

Regards,

Anthony Liguori

>    
>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>> ---
>>>   net.c |   21 ++++++++++++++++++++-
>>>   net.h |    1 +
>>>   2 files changed, 21 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/net.c b/net.c
>>> index 3d0fde7..10855d1 100644
>>> --- a/net.c
>>> +++ b/net.c
>>> @@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
>>>       if (vc->vlan) {
>>>           QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
>>>       } else {
>>> +        /* Even if client will not be deleted yet, remove it from list so it
>>> +         * does not appear in monitor.  */
>>> +        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>>> +        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
>>> +         * */
>>> +        if (vc->peer&&   vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
>>> +            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
>>> +            assert(!nic->peer_deleted);
>>> +            nic->peer_deleted = true;
>>> +            return;
>>> +        }
>>>           if (vc->send_queue) {
>>>               qemu_del_net_queue(vc->send_queue);
>>>           }
>>> -        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>>>           if (vc->peer) {
>>>               vc->peer->peer = NULL;
>>> +            /* If this is a guest-visible (NIC) device,
>>> +             * and peer has already been removed from monitor,
>>> +             * delete it here. */
>>> +            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
>>> +                NICState *nic = DO_UPCAST(NICState, nc, vc);
>>> +                if (nic->peer_deleted) {
>>> +                    qemu_del_vlan_client(vc->peer);
>>> +                }
>>> +            }
>>>           }
>>>       }
>>>
>>> diff --git a/net.h b/net.h
>>> index 518cf9c..44c31a9 100644
>>> --- a/net.h
>>> +++ b/net.h
>>> @@ -72,6 +72,7 @@ typedef struct NICState {
>>>       VLANClientState nc;
>>>       NICConf *conf;
>>>       void *opaque;
>>> +    bool peer_deleted;
>>>   } NICState;
>>>
>>>   struct VLANState {
>>>
Michael S. Tsirkin Sept. 20, 2010, 5:14 p.m. UTC | #4
On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
> On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
> >>>With -netdev, virtio devices present offload
> >>>features to guest, depending on the backend used.
> >>>Thus, removing host ntedev peer while guest is
> >>>active leads to guest-visible inconsistency and/or crashes.
> >>>See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
> >>>
> >>>As a solution, while guest (NIC) peer device exists,
> >>>we must prevent the host peer from being deleted.
> >>>
> >>>This patch does this by adding peer_deleted flag in nic state:
> >>>if host device is going away while guest device
> >>>is around, set this flag and keep host device around
> >>>for as long as guest device exists.
> >>Having an unclear life cycle really worries me.
> >>
> >>Wouldn't the more correct solution be to avoid removing the netdev
> >>device until after the peer has successfully been removed?
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >This is exactly what the patch does.
> 
> At the management layer instead of doing it magically in the backend.

The amount of pain this inflicts on management would be considerable.
Hotplug commands were designed to be asynchronous
(starts the process, does not wait for it to complete), maybe that
was a mistake but we can not change semantics at will now.

Add new commands, okay, but existing ones should work and get fixed
if there's a bug.

> IOW, if device_del returns and the device isn't actually deleted,
> that's a bug and addressing it like this just means we'll trip over
> it somewhere else.
> 
> We'll have the same problem with drive_del.

Let's fix it there as well then.

> Regards,
> 
> Anthony Liguori
> 
> >>>Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> >>>---
> >>>  net.c |   21 ++++++++++++++++++++-
> >>>  net.h |    1 +
> >>>  2 files changed, 21 insertions(+), 1 deletions(-)
> >>>
> >>>diff --git a/net.c b/net.c
> >>>index 3d0fde7..10855d1 100644
> >>>--- a/net.c
> >>>+++ b/net.c
> >>>@@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
> >>>      if (vc->vlan) {
> >>>          QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
> >>>      } else {
> >>>+        /* Even if client will not be deleted yet, remove it from list so it
> >>>+         * does not appear in monitor.  */
> >>>+        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >>>+        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
> >>>+         * */
> >>>+        if (vc->peer&&   vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
> >>>+            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
> >>>+            assert(!nic->peer_deleted);
> >>>+            nic->peer_deleted = true;
> >>>+            return;
> >>>+        }
> >>>          if (vc->send_queue) {
> >>>              qemu_del_net_queue(vc->send_queue);
> >>>          }
> >>>-        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >>>          if (vc->peer) {
> >>>              vc->peer->peer = NULL;
> >>>+            /* If this is a guest-visible (NIC) device,
> >>>+             * and peer has already been removed from monitor,
> >>>+             * delete it here. */
> >>>+            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
> >>>+                NICState *nic = DO_UPCAST(NICState, nc, vc);
> >>>+                if (nic->peer_deleted) {
> >>>+                    qemu_del_vlan_client(vc->peer);
> >>>+                }
> >>>+            }
> >>>          }
> >>>      }
> >>>
> >>>diff --git a/net.h b/net.h
> >>>index 518cf9c..44c31a9 100644
> >>>--- a/net.h
> >>>+++ b/net.h
> >>>@@ -72,6 +72,7 @@ typedef struct NICState {
> >>>      VLANClientState nc;
> >>>      NICConf *conf;
> >>>      void *opaque;
> >>>+    bool peer_deleted;
> >>>  } NICState;
> >>>
> >>>  struct VLANState {
Anthony Liguori Sept. 20, 2010, 6:14 p.m. UTC | #5
On 09/20/2010 12:14 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
>>      
>>> On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
>>>        
>>>> On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
>>>>          
>>>>> With -netdev, virtio devices present offload
>>>>> features to guest, depending on the backend used.
>>>>> Thus, removing host ntedev peer while guest is
>>>>> active leads to guest-visible inconsistency and/or crashes.
>>>>> See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
>>>>>
>>>>> As a solution, while guest (NIC) peer device exists,
>>>>> we must prevent the host peer from being deleted.
>>>>>
>>>>> This patch does this by adding peer_deleted flag in nic state:
>>>>> if host device is going away while guest device
>>>>> is around, set this flag and keep host device around
>>>>> for as long as guest device exists.
>>>>>            
>>>> Having an unclear life cycle really worries me.
>>>>
>>>> Wouldn't the more correct solution be to avoid removing the netdev
>>>> device until after the peer has successfully been removed?
>>>>
>>>> Regards,
>>>>
>>>> Anthony Liguori
>>>>          
>>> This is exactly what the patch does.
>>>        
>> At the management layer instead of doing it magically in the backend.
>>      
> The amount of pain this inflicts on management would be considerable.
> Hotplug commands were designed to be asynchronous
> (starts the process, does not wait for it to complete), maybe that
> was a mistake but we can not change semantics at will now.
>
> Add new commands, okay, but existing ones should work and get fixed
> if there's a bug.
>    

But having commands that are impossible to use correctly is not very good.

Here's what makes sense to me:

1) async device remove + poll device status/removal notification + 
remove backend

The management tool needs to determine when the device is gone and 
remove the backend.

2) sync device remove + remove backend

Command does not return until device is removed

3) async device and backend removal + poll device/backend removal + 
removal notification

One command that removes the device and any associated backend.  We need 
to indicate to the management layer when this operation is complete.

I think (2) is the most elegant but also the most difficult to implement 
today.  I think (1) is the least invasive to implement but has the most 
management tool complexity.  (3) is probably the best compromise in 
terms of complexity and ease of implementation.

Just for comparison, your patch does:

4) async device removal + remove backend

Whereas remove backend may or may not cause removal depending on whether 
device removal has happened.  So it's really async removal but it 
doesn't happen deterministically on it's own.  What happens if you call 
remove backend before starting async device removal?  What if the guest 
never removes the device?  What if a reset happens?

One advantage of (1) is that there is no tricky life cycle 
considerations.  If we did (3), we would have to think through what 
happens if a guest doesn't respond to an unplug request.

Regards,

Anthony Liguori

>    
>> IOW, if device_del returns and the device isn't actually deleted,
>> that's a bug and addressing it like this just means we'll trip over
>> it somewhere else.
>>
>> We'll have the same problem with drive_del.
>>      
> Let's fix it there as well then.
>
>    
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>>> ---
>>>>>   net.c |   21 ++++++++++++++++++++-
>>>>>   net.h |    1 +
>>>>>   2 files changed, 21 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/net.c b/net.c
>>>>> index 3d0fde7..10855d1 100644
>>>>> --- a/net.c
>>>>> +++ b/net.c
>>>>> @@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
>>>>>       if (vc->vlan) {
>>>>>           QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
>>>>>       } else {
>>>>> +        /* Even if client will not be deleted yet, remove it from list so it
>>>>> +         * does not appear in monitor.  */
>>>>> +        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>>>>> +        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
>>>>> +         * */
>>>>> +        if (vc->peer&&    vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
>>>>> +            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
>>>>> +            assert(!nic->peer_deleted);
>>>>> +            nic->peer_deleted = true;
>>>>> +            return;
>>>>> +        }
>>>>>           if (vc->send_queue) {
>>>>>               qemu_del_net_queue(vc->send_queue);
>>>>>           }
>>>>> -        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>>>>>           if (vc->peer) {
>>>>>               vc->peer->peer = NULL;
>>>>> +            /* If this is a guest-visible (NIC) device,
>>>>> +             * and peer has already been removed from monitor,
>>>>> +             * delete it here. */
>>>>> +            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
>>>>> +                NICState *nic = DO_UPCAST(NICState, nc, vc);
>>>>> +                if (nic->peer_deleted) {
>>>>> +                    qemu_del_vlan_client(vc->peer);
>>>>> +                }
>>>>> +            }
>>>>>           }
>>>>>       }
>>>>>
>>>>> diff --git a/net.h b/net.h
>>>>> index 518cf9c..44c31a9 100644
>>>>> --- a/net.h
>>>>> +++ b/net.h
>>>>> @@ -72,6 +72,7 @@ typedef struct NICState {
>>>>>       VLANClientState nc;
>>>>>       NICConf *conf;
>>>>>       void *opaque;
>>>>> +    bool peer_deleted;
>>>>>   } NICState;
>>>>>
>>>>>   struct VLANState {
>>>>>
Anthony Liguori Sept. 20, 2010, 6:19 p.m. UTC | #6
On 09/20/2010 01:14 PM, Anthony Liguori wrote:
> Here's what makes sense to me:
>
> 1) async device remove + poll device status/removal notification + 
> remove backend
>
> The management tool needs to determine when the device is gone and 
> remove the backend.
>
> 2) sync device remove + remove backend
>
> Command does not return until device is removed
>
> 3) async device and backend removal + poll device/backend removal + 
> removal notification
>
> One command that removes the device and any associated backend.  We 
> need to indicate to the management layer when this operation is complete.
>
> I think (2) is the most elegant but also the most difficult to 
> implement today.  I think (1) is the least invasive to implement but 
> has the most management tool complexity.  (3) is probably the best 
> compromise in terms of complexity and ease of implementation.
>
> Just for comparison, your patch does:
>
> 4) async device removal + remove backend
>
> Whereas remove backend may or may not cause removal depending on 
> whether device removal has happened.  So it's really async removal but 
> it doesn't happen deterministically on it's own.  What happens if you 
> call remove backend before starting async device removal?  What if the 
> guest never removes the device?  What if a reset happens?
>
> One advantage of (1) is that there is no tricky life cycle 
> considerations.  If we did (3), we would have to think through what 
> happens if a guest doesn't respond to an unplug request.

BTW, maybe the real problem is that device_del and pci hot unplug 
shouldn't be intimately related.

We could have a pci_unplug that requested the device be unplugged.  
However, unplugging didn't result in the device being deleted.  Instead, 
device_del had to be explicitly called after pci_unplug succeeded.

IRL, if I recall correctly, there's a button that you press that sends 
the ACPI event to the OS.  There usually is an LED associated with the 
slot too and once the device has been unplugged by the OS, the LED goes 
from red to green (or something like that).  At that point, the human 
can physically remove the card from the slot.

You can also initiate the unplug from the OS without the ACPI event ever 
happening.  I suspect that in our current implementation, that means 
that we'll automatically delete the device which may have strange 
effects on management tools.

So it probably makes sense for our interface to present the same 
procedure.  What do you think?

Regards,

Anthony Liguori

> Regards,
>
> Anthony Liguori
>
>>> IOW, if device_del returns and the device isn't actually deleted,
>>> that's a bug and addressing it like this just means we'll trip over
>>> it somewhere else.
>>>
>>> We'll have the same problem with drive_del.
>> Let's fix it there as well then.
>>
>>> Regards,
>>>
>>> Anthony Liguori
>>>
>>>>>> Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
>>>>>> ---
>>>>>>   net.c |   21 ++++++++++++++++++++-
>>>>>>   net.h |    1 +
>>>>>>   2 files changed, 21 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/net.c b/net.c
>>>>>> index 3d0fde7..10855d1 100644
>>>>>> --- a/net.c
>>>>>> +++ b/net.c
>>>>>> @@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
>>>>>>       if (vc->vlan) {
>>>>>>           QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
>>>>>>       } else {
>>>>>> +        /* Even if client will not be deleted yet, remove it 
>>>>>> from list so it
>>>>>> +         * does not appear in monitor.  */
>>>>>> +        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>>>>>> +        /* Detect that guest-visible (NIC) peer is active, and 
>>>>>> delay deletion.
>>>>>> +         * */
>>>>>> +        if (vc->peer&&    vc->peer->info->type == 
>>>>>> NET_CLIENT_TYPE_NIC) {
>>>>>> +            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
>>>>>> +            assert(!nic->peer_deleted);
>>>>>> +            nic->peer_deleted = true;
>>>>>> +            return;
>>>>>> +        }
>>>>>>           if (vc->send_queue) {
>>>>>>               qemu_del_net_queue(vc->send_queue);
>>>>>>           }
>>>>>> -        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
>>>>>>           if (vc->peer) {
>>>>>>               vc->peer->peer = NULL;
>>>>>> +            /* If this is a guest-visible (NIC) device,
>>>>>> +             * and peer has already been removed from monitor,
>>>>>> +             * delete it here. */
>>>>>> +            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
>>>>>> +                NICState *nic = DO_UPCAST(NICState, nc, vc);
>>>>>> +                if (nic->peer_deleted) {
>>>>>> +                    qemu_del_vlan_client(vc->peer);
>>>>>> +                }
>>>>>> +            }
>>>>>>           }
>>>>>>       }
>>>>>>
>>>>>> diff --git a/net.h b/net.h
>>>>>> index 518cf9c..44c31a9 100644
>>>>>> --- a/net.h
>>>>>> +++ b/net.h
>>>>>> @@ -72,6 +72,7 @@ typedef struct NICState {
>>>>>>       VLANClientState nc;
>>>>>>       NICConf *conf;
>>>>>>       void *opaque;
>>>>>> +    bool peer_deleted;
>>>>>>   } NICState;
>>>>>>
>>>>>>   struct VLANState {
>
Michael S. Tsirkin Sept. 20, 2010, 6:24 p.m. UTC | #7
On Mon, Sep 20, 2010 at 01:14:12PM -0500, Anthony Liguori wrote:
> On 09/20/2010 12:14 PM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
> >>>>On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
> >>>>>With -netdev, virtio devices present offload
> >>>>>features to guest, depending on the backend used.
> >>>>>Thus, removing host ntedev peer while guest is
> >>>>>active leads to guest-visible inconsistency and/or crashes.
> >>>>>See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
> >>>>>
> >>>>>As a solution, while guest (NIC) peer device exists,
> >>>>>we must prevent the host peer from being deleted.
> >>>>>
> >>>>>This patch does this by adding peer_deleted flag in nic state:
> >>>>>if host device is going away while guest device
> >>>>>is around, set this flag and keep host device around
> >>>>>for as long as guest device exists.
> >>>>Having an unclear life cycle really worries me.
> >>>>
> >>>>Wouldn't the more correct solution be to avoid removing the netdev
> >>>>device until after the peer has successfully been removed?
> >>>>
> >>>>Regards,
> >>>>
> >>>>Anthony Liguori
> >>>This is exactly what the patch does.
> >>At the management layer instead of doing it magically in the backend.
> >The amount of pain this inflicts on management would be considerable.
> >Hotplug commands were designed to be asynchronous
> >(starts the process, does not wait for it to complete), maybe that
> >was a mistake but we can not change semantics at will now.
> >
> >Add new commands, okay, but existing ones should work and get fixed
> >if there's a bug.
> 
> But having commands that are impossible to use correctly is not very good.

So we will have to fix the existing commands so they can be used
correctly. Since the device is removed from the list
shown to the monitor, I do not really see why the user
cares that the backend is actually still around
until the device is removed.

> Here's what makes sense to me:
> 
> 1) async device remove + poll device status/removal notification +
> remove backend
> 
> The management tool needs to determine when the device is gone and
> remove the backend.
> 
> 2) sync device remove + remove backend
> 
> Command does not return until device is removed
> 
> 3) async device and backend removal + poll device/backend removal +
> removal notification
> 
> One command that removes the device and any associated backend.  We
> need to indicate to the management layer when this operation is
> complete.
> 
> I think (2) is the most elegant but also the most difficult to
> implement today.  I think (1) is the least invasive to implement but
> has the most management tool complexity.  (3) is probably the best
> compromise in terms of complexity and ease of implementation.
> 
> Just for comparison, your patch does:
> 
> 4) async device removal + remove backend
> 
> Whereas remove backend may or may not cause removal depending on
> whether device removal has happened.  So it's really async removal
> but it doesn't happen deterministically on it's own.  What happens
> if you call remove backend before starting async device removal?

It won't be removed until device is removed.

> What if the guest never removes the device?

Not really different from guest never reacting to nic hotplug.
If you want to fix this, we'll need a "force" flag to delete.

>  What if a reset
> happens?

I think reset will complete the hotplug.  If it does not we need to fix
it anyway.

> One advantage of (1) is that there is no tricky life cycle
> considerations.  If we did (3), we would have to think through what
> happens if a guest doesn't respond to an unplug request.
> 
> Regards,
> 
> Anthony Liguori

All very well, but this ignores the issue:

We have told management that a way to remove a frontend backend pair is
by giving two commands.  Management has implemented this. Now we need to
have qemu do the right thing.




> >>IOW, if device_del returns and the device isn't actually deleted,
> >>that's a bug and addressing it like this just means we'll trip over
> >>it somewhere else.
> >>
> >>We'll have the same problem with drive_del.
> >Let's fix it there as well then.
> >
> >>Regards,
> >>
> >>Anthony Liguori
> >>
> >>>>>Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> >>>>>---
> >>>>>  net.c |   21 ++++++++++++++++++++-
> >>>>>  net.h |    1 +
> >>>>>  2 files changed, 21 insertions(+), 1 deletions(-)
> >>>>>
> >>>>>diff --git a/net.c b/net.c
> >>>>>index 3d0fde7..10855d1 100644
> >>>>>--- a/net.c
> >>>>>+++ b/net.c
> >>>>>@@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
> >>>>>      if (vc->vlan) {
> >>>>>          QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
> >>>>>      } else {
> >>>>>+        /* Even if client will not be deleted yet, remove it from list so it
> >>>>>+         * does not appear in monitor.  */
> >>>>>+        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >>>>>+        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
> >>>>>+         * */
> >>>>>+        if (vc->peer&&    vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
> >>>>>+            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
> >>>>>+            assert(!nic->peer_deleted);
> >>>>>+            nic->peer_deleted = true;
> >>>>>+            return;
> >>>>>+        }
> >>>>>          if (vc->send_queue) {
> >>>>>              qemu_del_net_queue(vc->send_queue);
> >>>>>          }
> >>>>>-        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >>>>>          if (vc->peer) {
> >>>>>              vc->peer->peer = NULL;
> >>>>>+            /* If this is a guest-visible (NIC) device,
> >>>>>+             * and peer has already been removed from monitor,
> >>>>>+             * delete it here. */
> >>>>>+            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
> >>>>>+                NICState *nic = DO_UPCAST(NICState, nc, vc);
> >>>>>+                if (nic->peer_deleted) {
> >>>>>+                    qemu_del_vlan_client(vc->peer);
> >>>>>+                }
> >>>>>+            }
> >>>>>          }
> >>>>>      }
> >>>>>
> >>>>>diff --git a/net.h b/net.h
> >>>>>index 518cf9c..44c31a9 100644
> >>>>>--- a/net.h
> >>>>>+++ b/net.h
> >>>>>@@ -72,6 +72,7 @@ typedef struct NICState {
> >>>>>      VLANClientState nc;
> >>>>>      NICConf *conf;
> >>>>>      void *opaque;
> >>>>>+    bool peer_deleted;
> >>>>>  } NICState;
> >>>>>
> >>>>>  struct VLANState {
>
Anthony Liguori Sept. 20, 2010, 6:39 p.m. UTC | #8
On 09/20/2010 01:24 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 01:14:12PM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 12:14 PM, Michael S. Tsirkin wrote:
>>      
>>> On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
>>>        
>>>> On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
>>>>          
>>>>> On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
>>>>>            
>>>>>> On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
>>>>>>              
>>>>>>> With -netdev, virtio devices present offload
>>>>>>> features to guest, depending on the backend used.
>>>>>>> Thus, removing host ntedev peer while guest is
>>>>>>> active leads to guest-visible inconsistency and/or crashes.
>>>>>>> See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
>>>>>>>
>>>>>>> As a solution, while guest (NIC) peer device exists,
>>>>>>> we must prevent the host peer from being deleted.
>>>>>>>
>>>>>>> This patch does this by adding peer_deleted flag in nic state:
>>>>>>> if host device is going away while guest device
>>>>>>> is around, set this flag and keep host device around
>>>>>>> for as long as guest device exists.
>>>>>>>                
>>>>>> Having an unclear life cycle really worries me.
>>>>>>
>>>>>> Wouldn't the more correct solution be to avoid removing the netdev
>>>>>> device until after the peer has successfully been removed?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Anthony Liguori
>>>>>>              
>>>>> This is exactly what the patch does.
>>>>>            
>>>> At the management layer instead of doing it magically in the backend.
>>>>          
>>> The amount of pain this inflicts on management would be considerable.
>>> Hotplug commands were designed to be asynchronous
>>> (starts the process, does not wait for it to complete), maybe that
>>> was a mistake but we can not change semantics at will now.
>>>
>>> Add new commands, okay, but existing ones should work and get fixed
>>> if there's a bug.
>>>        
>> But having commands that are impossible to use correctly is not very good.
>>      
> So we will have to fix the existing commands so they can be used
> correctly. Since the device is removed from the list
> shown to the monitor, I do not really see why the user
> cares that the backend is actually still around
> until the device is removed.
>    

That's even more wrong and maybe I don't understand what you're saying.

But the test case is easy.  acpiphp is not loaded.  You do a device_del 
of a device.  What happens?

You do a netdev_del immediately afterwards, what are you guaranteed as a 
management tool?  If I do a info network, you're telling me I don't see 
the netdev device even though the device is still there and the guest is 
actively using it?  That can't possibly be a good thing.

>> 4) async device removal + remove backend
>>
>> Whereas remove backend may or may not cause removal depending on
>> whether device removal has happened.  So it's really async removal
>> but it doesn't happen deterministically on it's own.  What happens
>> if you call remove backend before starting async device removal?
>>      
> It won't be removed until device is removed.
>    

Which is non-deterministic and guest controlled.

>> What if the guest never removes the device?
>>      
> Not really different from guest never reacting to nic hotplug.
> If you want to fix this, we'll need a "force" flag to delete.
>    

We need to make sure management tools are aware that pci hot unplug can 
fail.  We should design our interfaces to encourage this awareness.  
Force is not necessarily needed.

>>   What if a reset
>> happens?
>>      
> I think reset will complete the hotplug.  If it does not we need to fix
> it anyway.
>    

I'm fairly sure it doesn't FWIW.

>> One advantage of (1) is that there is no tricky life cycle
>> considerations.  If we did (3), we would have to think through what
>> happens if a guest doesn't respond to an unplug request.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> All very well, but this ignores the issue:
>
> We have told management that a way to remove a frontend backend pair is
> by giving two commands.

That's the problem.  This is fundamentally broken.

>    Management has implemented this. Now we need to
> have qemu do the right thing.
>    

The only way to do this correctly is to make device_del block until the 
operation completes.  Unfortunately, this becomes a libvirt DoS which 
would cause all sorts of problems.

I don't see a lot of options that allow the management tools to continue 
doing what they're doing.  This cannot work properly unless there is a 
management interface that is explicitly aware that 1) pci hot unplug can 
and will not be successful 2) the device is still there until it's 
successful (which may be forever).

Regards,

Anthony Liguori
Michael S. Tsirkin Sept. 20, 2010, 6:59 p.m. UTC | #9
On Mon, Sep 20, 2010 at 01:19:48PM -0500, Anthony Liguori wrote:
> On 09/20/2010 01:14 PM, Anthony Liguori wrote:
> >Here's what makes sense to me:
> >
> >1) async device remove + poll device status/removal notification +
> >remove backend
> >
> >The management tool needs to determine when the device is gone and
> >remove the backend.
> >
> >2) sync device remove + remove backend
> >
> >Command does not return until device is removed
> >
> >3) async device and backend removal + poll device/backend removal
> >+ removal notification
> >
> >One command that removes the device and any associated backend.
> >We need to indicate to the management layer when this operation is
> >complete.
> >
> >I think (2) is the most elegant but also the most difficult to
> >implement today.  I think (1) is the least invasive to implement
> >but has the most management tool complexity.  (3) is probably the
> >best compromise in terms of complexity and ease of implementation.
> >
> >Just for comparison, your patch does:
> >
> >4) async device removal + remove backend
> >
> >Whereas remove backend may or may not cause removal depending on
> >whether device removal has happened.  So it's really async removal
> >but it doesn't happen deterministically on it's own.  What happens
> >if you call remove backend before starting async device removal?
> >What if the guest never removes the device?  What if a reset
> >happens?
> >
> >One advantage of (1) is that there is no tricky life cycle
> >considerations.  If we did (3), we would have to think through
> >what happens if a guest doesn't respond to an unplug request.
> 
> BTW, maybe the real problem is that device_del and pci hot unplug
> shouldn't be intimately related.
> 
> We could have a pci_unplug that requested the device be unplugged.
> However, unplugging didn't result in the device being deleted.
> Instead, device_del had to be explicitly called after pci_unplug
> succeeded.
> 
> IRL, if I recall correctly, there's a button that you press that
> sends the ACPI event to the OS.  There usually is an LED associated
> with the slot too and once the device has been unplugged by the OS,
> the LED goes from red to green (or something like that).  At that
> point, the human can physically remove the card from the slot.

PCI express spec has all that.

> You can also initiate the unplug from the OS without the ACPI event
> ever happening.  I suspect that in our current implementation, that
> means that we'll automatically delete the device which may have
> strange effects on management tools.
> 
> So it probably makes sense for our interface to present the same
> procedure.  What do you think?
> 
> Regards,
> 
> Anthony Liguori

We seem to have two discussions here. you speak about how an ideal hot plug
interface will look. This can involve new commands etc.
I speak about fixing existing ones so qemu and/or guest won't crash.
This requires fixing existing commands, unless we can't
fix them at all - which is demonstrably not the case.


> >Regards,
> >
> >Anthony Liguori
> >
> >>>IOW, if device_del returns and the device isn't actually deleted,
> >>>that's a bug and addressing it like this just means we'll trip over
> >>>it somewhere else.
> >>>
> >>>We'll have the same problem with drive_del.
> >>Let's fix it there as well then.
> >>
> >>>Regards,
> >>>
> >>>Anthony Liguori
> >>>
> >>>>>>Signed-off-by: Michael S. Tsirkin<mst@redhat.com>
> >>>>>>---
> >>>>>>  net.c |   21 ++++++++++++++++++++-
> >>>>>>  net.h |    1 +
> >>>>>>  2 files changed, 21 insertions(+), 1 deletions(-)
> >>>>>>
> >>>>>>diff --git a/net.c b/net.c
> >>>>>>index 3d0fde7..10855d1 100644
> >>>>>>--- a/net.c
> >>>>>>+++ b/net.c
> >>>>>>@@ -286,12 +286,31 @@ void qemu_del_vlan_client(VLANClientState *vc)
> >>>>>>      if (vc->vlan) {
> >>>>>>          QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
> >>>>>>      } else {
> >>>>>>+        /* Even if client will not be deleted yet,
> >>>>>>remove it from list so it
> >>>>>>+         * does not appear in monitor.  */
> >>>>>>+        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >>>>>>+        /* Detect that guest-visible (NIC) peer is
> >>>>>>active, and delay deletion.
> >>>>>>+         * */
> >>>>>>+        if (vc->peer&&    vc->peer->info->type ==
> >>>>>>NET_CLIENT_TYPE_NIC) {
> >>>>>>+            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
> >>>>>>+            assert(!nic->peer_deleted);
> >>>>>>+            nic->peer_deleted = true;
> >>>>>>+            return;
> >>>>>>+        }
> >>>>>>          if (vc->send_queue) {
> >>>>>>              qemu_del_net_queue(vc->send_queue);
> >>>>>>          }
> >>>>>>-        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
> >>>>>>          if (vc->peer) {
> >>>>>>              vc->peer->peer = NULL;
> >>>>>>+            /* If this is a guest-visible (NIC) device,
> >>>>>>+             * and peer has already been removed from monitor,
> >>>>>>+             * delete it here. */
> >>>>>>+            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
> >>>>>>+                NICState *nic = DO_UPCAST(NICState, nc, vc);
> >>>>>>+                if (nic->peer_deleted) {
> >>>>>>+                    qemu_del_vlan_client(vc->peer);
> >>>>>>+                }
> >>>>>>+            }
> >>>>>>          }
> >>>>>>      }
> >>>>>>
> >>>>>>diff --git a/net.h b/net.h
> >>>>>>index 518cf9c..44c31a9 100644
> >>>>>>--- a/net.h
> >>>>>>+++ b/net.h
> >>>>>>@@ -72,6 +72,7 @@ typedef struct NICState {
> >>>>>>      VLANClientState nc;
> >>>>>>      NICConf *conf;
> >>>>>>      void *opaque;
> >>>>>>+    bool peer_deleted;
> >>>>>>  } NICState;
> >>>>>>
> >>>>>>  struct VLANState {
> >
>
Michael S. Tsirkin Sept. 20, 2010, 7:15 p.m. UTC | #10
On Mon, Sep 20, 2010 at 01:39:00PM -0500, Anthony Liguori wrote:
> On 09/20/2010 01:24 PM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 01:14:12PM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 12:14 PM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
> >>>>On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
> >>>>>On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
> >>>>>>On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
> >>>>>>>With -netdev, virtio devices present offload
> >>>>>>>features to guest, depending on the backend used.
> >>>>>>>Thus, removing host ntedev peer while guest is
> >>>>>>>active leads to guest-visible inconsistency and/or crashes.
> >>>>>>>See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
> >>>>>>>
> >>>>>>>As a solution, while guest (NIC) peer device exists,
> >>>>>>>we must prevent the host peer from being deleted.
> >>>>>>>
> >>>>>>>This patch does this by adding peer_deleted flag in nic state:
> >>>>>>>if host device is going away while guest device
> >>>>>>>is around, set this flag and keep host device around
> >>>>>>>for as long as guest device exists.
> >>>>>>Having an unclear life cycle really worries me.
> >>>>>>
> >>>>>>Wouldn't the more correct solution be to avoid removing the netdev
> >>>>>>device until after the peer has successfully been removed?
> >>>>>>
> >>>>>>Regards,
> >>>>>>
> >>>>>>Anthony Liguori
> >>>>>This is exactly what the patch does.
> >>>>At the management layer instead of doing it magically in the backend.
> >>>The amount of pain this inflicts on management would be considerable.
> >>>Hotplug commands were designed to be asynchronous
> >>>(starts the process, does not wait for it to complete), maybe that
> >>>was a mistake but we can not change semantics at will now.
> >>>
> >>>Add new commands, okay, but existing ones should work and get fixed
> >>>if there's a bug.
> >>But having commands that are impossible to use correctly is not very good.
> >So we will have to fix the existing commands so they can be used
> >correctly. Since the device is removed from the list
> >shown to the monitor, I do not really see why the user
> >cares that the backend is actually still around
> >until the device is removed.
> 
> That's even more wrong and maybe I don't understand what you're saying.
> 
> But the test case is easy.  acpiphp is not loaded.  You do a
> device_del of a device.  What happens?

Backend will stay active until guest reset (once we fix guest reset).
An alternative is to only lock the backend when guest driver loads,
and unlock on nic reset in addition to hotplug.
It would work as long as the backend only affects the driver, not the
guest OS itself.

I discarded this approach as it seemed even less deterministic ...
do you like this better?

> You do a netdev_del immediately afterwards, what are you guaranteed
> as a management tool?

That it will go away when frontend nic goes away.

> If I do a info network, you're telling me I
> don't see the netdev device even though the device is still there
> and the guest is actively using it?  That can't possibly be a good
> thing.

This last point could be changed: just need to be careful not
to allow the backend device to be deleted twice.
I thought it's nicer to remove it from info net as a sign
that it is going away ... what do you think?

> >>4) async device removal + remove backend
> >>
> >>Whereas remove backend may or may not cause removal depending on
> >>whether device removal has happened.  So it's really async removal
> >>but it doesn't happen deterministically on it's own.  What happens
> >>if you call remove backend before starting async device removal?
> >It won't be removed until device is removed.
> 
> Which is non-deterministic and guest controlled.
> 
> >>What if the guest never removes the device?
> >Not really different from guest never reacting to nic hotplug.
> >If you want to fix this, we'll need a "force" flag to delete.
> 
> We need to make sure management tools are aware that pci hot unplug
> can fail.  We should design our interfaces to encourage this
> awareness.  Force is not necessarily needed.

"force" is really just surprise removal.
It's needed if we want to emulate that.

> >>  What if a reset
> >>happens?
> >I think reset will complete the hotplug.  If it does not we need to fix
> >it anyway.
> 
> I'm fairly sure it doesn't FWIW.

It's a bug I think. We can't claim to wait for guest if we know it
rebooted.

> >>One advantage of (1) is that there is no tricky life cycle
> >>considerations.  If we did (3), we would have to think through what
> >>happens if a guest doesn't respond to an unplug request.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >All very well, but this ignores the issue:
> >
> >We have told management that a way to remove a frontend backend pair is
> >by giving two commands.
> 
> That's the problem.  This is fundamentally broken.
> 
> >   Management has implemented this. Now we need to
> >have qemu do the right thing.
> 
> The only way to do this correctly is to make device_del block until
> the operation completes.  Unfortunately, this becomes a libvirt DoS
> which would cause all sorts of problems.
> 
> I don't see a lot of options that allow the management tools to
> continue doing what they're doing.  This cannot work properly unless
> there is a management interface that is explicitly aware that 1) pci
> hot unplug can and will not be successful 2) the device is still
> there until it's successful (which may be forever).
> 
> Regards,
> 
> Anthony Liguori

Let's start with disabling hotplug in 0.13 then?
It's clearly not feasible to add new commands there.
Anthony Liguori Sept. 20, 2010, 7:22 p.m. UTC | #11
On 09/20/2010 01:59 PM, Michael S. Tsirkin wrote:
>> You can also initiate the unplug from the OS without the ACPI event
>> ever happening.  I suspect that in our current implementation, that
>> means that we'll automatically delete the device which may have
>> strange effects on management tools.
>>
>> So it probably makes sense for our interface to present the same
>> procedure.  What do you think?
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> We seem to have two discussions here. you speak about how an ideal hot plug
> interface will look. This can involve new commands etc.
> I speak about fixing existing ones so qemu and/or guest won't crash.
>    

To be fair, existing qemu won't crash if you do:

(qemu) device_del <device>
Use info_qtree to notice when device goes away
(qemu) netdev_del <backend>

You're trying to come up with a workaround for the fact that libvirt is 
making bad assumptions.  That's wrong.  We either need to fix libvirt to 
not make bad assumptions or we need to provide better interfaces for 
libvirt to use if the current interfaces aren't desirable.

Regards,

Anthony Liguori

> This requires fixing existing commands, unless we can't
> fix them at all - which is demonstrably not the case.
>
Anthony Liguori Sept. 20, 2010, 7:28 p.m. UTC | #12
On 09/20/2010 02:15 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 01:39:00PM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 01:24 PM, Michael S. Tsirkin wrote:
>>      
>>> On Mon, Sep 20, 2010 at 01:14:12PM -0500, Anthony Liguori wrote:
>>>        
>>>> On 09/20/2010 12:14 PM, Michael S. Tsirkin wrote:
>>>>          
>>>>> On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
>>>>>            
>>>>>> On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
>>>>>>              
>>>>>>> On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
>>>>>>>                
>>>>>>>> On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
>>>>>>>>                  
>>>>>>>>> With -netdev, virtio devices present offload
>>>>>>>>> features to guest, depending on the backend used.
>>>>>>>>> Thus, removing host ntedev peer while guest is
>>>>>>>>> active leads to guest-visible inconsistency and/or crashes.
>>>>>>>>> See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
>>>>>>>>>
>>>>>>>>> As a solution, while guest (NIC) peer device exists,
>>>>>>>>> we must prevent the host peer from being deleted.
>>>>>>>>>
>>>>>>>>> This patch does this by adding peer_deleted flag in nic state:
>>>>>>>>> if host device is going away while guest device
>>>>>>>>> is around, set this flag and keep host device around
>>>>>>>>> for as long as guest device exists.
>>>>>>>>>                    
>>>>>>>> Having an unclear life cycle really worries me.
>>>>>>>>
>>>>>>>> Wouldn't the more correct solution be to avoid removing the netdev
>>>>>>>> device until after the peer has successfully been removed?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Anthony Liguori
>>>>>>>>                  
>>>>>>> This is exactly what the patch does.
>>>>>>>                
>>>>>> At the management layer instead of doing it magically in the backend.
>>>>>>              
>>>>> The amount of pain this inflicts on management would be considerable.
>>>>> Hotplug commands were designed to be asynchronous
>>>>> (starts the process, does not wait for it to complete), maybe that
>>>>> was a mistake but we can not change semantics at will now.
>>>>>
>>>>> Add new commands, okay, but existing ones should work and get fixed
>>>>> if there's a bug.
>>>>>            
>>>> But having commands that are impossible to use correctly is not very good.
>>>>          
>>> So we will have to fix the existing commands so they can be used
>>> correctly. Since the device is removed from the list
>>> shown to the monitor, I do not really see why the user
>>> cares that the backend is actually still around
>>> until the device is removed.
>>>        
>> That's even more wrong and maybe I don't understand what you're saying.
>>
>> But the test case is easy.  acpiphp is not loaded.  You do a
>> device_del of a device.  What happens?
>>      
> Backend will stay active until guest reset (once we fix guest reset).
> An alternative is to only lock the backend when guest driver loads,
> and unlock on nic reset in addition to hotplug.
> It would work as long as the backend only affects the driver, not the
> guest OS itself.
>
> I discarded this approach as it seemed even less deterministic ...
> do you like this better?
>    

I think the only workable approach that doesn't involve new commands is 
to change the semantics of the existing ones.

Make netdev_del work regardless of whether the device is still present.

You would need to reference count the actual netdev structure and have 
each device using it unref on delete.  You make netdev_del mark the 
device as deleted and when a device is deleted, any calls into the 
device effectively become nops.

You have to go through most of the cleanup process to ensure that tap 
device gets closed even before your reference count goes to zero.

>>> have qemu do the right thing.
>>>        
>> The only way to do this correctly is to make device_del block until
>> the operation completes.  Unfortunately, this becomes a libvirt DoS
>> which would cause all sorts of problems.
>>
>> I don't see a lot of options that allow the management tools to
>> continue doing what they're doing.  This cannot work properly unless
>> there is a management interface that is explicitly aware that 1) pci
>> hot unplug can and will not be successful 2) the device is still
>> there until it's successful (which may be forever).
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> Let's start with disabling hotplug in 0.13 then?
> It's clearly not feasible to add new commands there.
>    

To be fair, this is really a bug in libvirt IMHO.  libvirt shouldn't 
assume that after device_del is called, the device is gone.

Regards,

Anthony Liguori
Michael S. Tsirkin Sept. 20, 2010, 7:37 p.m. UTC | #13
On Mon, Sep 20, 2010 at 02:22:18PM -0500, Anthony Liguori wrote:
> On 09/20/2010 01:59 PM, Michael S. Tsirkin wrote:
> >>You can also initiate the unplug from the OS without the ACPI event
> >>ever happening.  I suspect that in our current implementation, that
> >>means that we'll automatically delete the device which may have
> >>strange effects on management tools.
> >>
> >>So it probably makes sense for our interface to present the same
> >>procedure.  What do you think?
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >We seem to have two discussions here. you speak about how an ideal hot plug
> >interface will look. This can involve new commands etc.
> >I speak about fixing existing ones so qemu and/or guest won't crash.
> 
> To be fair, existing qemu won't crash if you do:
> 
> (qemu) device_del <device>
> Use info_qtree to notice when device goes away
> (qemu) netdev_del <backend>

Asking libvirt to busy loop polling the monitor sounds like a really bad
situation: note that guest is blocked from doing any io while monitor is
used, so it may in fact prevent it from making progress. Right?

So why can't we let management do netdev_del and have it take effect
when this becomes possible?

> You're trying to come up with a workaround for the fact that libvirt
> is making bad assumptions.

BTW, even if it is, I don't think we should be crashing qemu or guest.

> That's wrong.  We either need to fix
> libvirt to not make bad assumptions or we need to provide better
> interfaces for libvirt to use if the current interfaces aren't
> desirable.
> 
> Regards,
> 
> Anthony Liguori
> 
> >This requires fixing existing commands, unless we can't
> >fix them at all - which is demonstrably not the case.
> 

But frankly, most command semantics are completely ad hock and not well
undefined, in my mind it's better to define them to accomodate existing
users.
Michael S. Tsirkin Sept. 20, 2010, 7:44 p.m. UTC | #14
On Mon, Sep 20, 2010 at 02:28:42PM -0500, Anthony Liguori wrote:
> On 09/20/2010 02:15 PM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 01:39:00PM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 01:24 PM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 20, 2010 at 01:14:12PM -0500, Anthony Liguori wrote:
> >>>>On 09/20/2010 12:14 PM, Michael S. Tsirkin wrote:
> >>>>>On Mon, Sep 20, 2010 at 11:56:56AM -0500, Anthony Liguori wrote:
> >>>>>>On 09/20/2010 11:47 AM, Michael S. Tsirkin wrote:
> >>>>>>>On Mon, Sep 20, 2010 at 11:41:45AM -0500, Anthony Liguori wrote:
> >>>>>>>>On 09/20/2010 11:30 AM, Michael S. Tsirkin wrote:
> >>>>>>>>>With -netdev, virtio devices present offload
> >>>>>>>>>features to guest, depending on the backend used.
> >>>>>>>>>Thus, removing host ntedev peer while guest is
> >>>>>>>>>active leads to guest-visible inconsistency and/or crashes.
> >>>>>>>>>See e.g. https://bugzilla.redhat.com/show_bug.cgi?id=623735
> >>>>>>>>>
> >>>>>>>>>As a solution, while guest (NIC) peer device exists,
> >>>>>>>>>we must prevent the host peer from being deleted.
> >>>>>>>>>
> >>>>>>>>>This patch does this by adding peer_deleted flag in nic state:
> >>>>>>>>>if host device is going away while guest device
> >>>>>>>>>is around, set this flag and keep host device around
> >>>>>>>>>for as long as guest device exists.
> >>>>>>>>Having an unclear life cycle really worries me.
> >>>>>>>>
> >>>>>>>>Wouldn't the more correct solution be to avoid removing the netdev
> >>>>>>>>device until after the peer has successfully been removed?
> >>>>>>>>
> >>>>>>>>Regards,
> >>>>>>>>
> >>>>>>>>Anthony Liguori
> >>>>>>>This is exactly what the patch does.
> >>>>>>At the management layer instead of doing it magically in the backend.
> >>>>>The amount of pain this inflicts on management would be considerable.
> >>>>>Hotplug commands were designed to be asynchronous
> >>>>>(starts the process, does not wait for it to complete), maybe that
> >>>>>was a mistake but we can not change semantics at will now.
> >>>>>
> >>>>>Add new commands, okay, but existing ones should work and get fixed
> >>>>>if there's a bug.
> >>>>But having commands that are impossible to use correctly is not very good.
> >>>So we will have to fix the existing commands so they can be used
> >>>correctly. Since the device is removed from the list
> >>>shown to the monitor, I do not really see why the user
> >>>cares that the backend is actually still around
> >>>until the device is removed.
> >>That's even more wrong and maybe I don't understand what you're saying.
> >>
> >>But the test case is easy.  acpiphp is not loaded.  You do a
> >>device_del of a device.  What happens?
> >Backend will stay active until guest reset (once we fix guest reset).
> >An alternative is to only lock the backend when guest driver loads,
> >and unlock on nic reset in addition to hotplug.
> >It would work as long as the backend only affects the driver, not the
> >guest OS itself.
> >
> >I discarded this approach as it seemed even less deterministic ...
> >do you like this better?
> 
> I think the only workable approach that doesn't involve new commands
> is to change the semantics of the existing ones.
> 
> Make netdev_del work regardless of whether the device is still present.
> 
> You would need to reference count the actual netdev structure and
> have each device using it unref on delete.  You make netdev_del mark
> the device as deleted and when a device is deleted, any calls into
> the device effectively become nops.
> 
> You have to go through most of the cleanup process to ensure that
> tap device gets closed even before your reference count goes to
> zero.

I think you mean 'does not get closed': we need the fd to get the flags etc.

> >>>have qemu do the right thing.
> >>The only way to do this correctly is to make device_del block until
> >>the operation completes.  Unfortunately, this becomes a libvirt DoS
> >>which would cause all sorts of problems.
> >>
> >>I don't see a lot of options that allow the management tools to
> >>continue doing what they're doing.  This cannot work properly unless
> >>there is a management interface that is explicitly aware that 1) pci
> >>hot unplug can and will not be successful 2) the device is still
> >>there until it's successful (which may be forever).
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >Let's start with disabling hotplug in 0.13 then?
> >It's clearly not feasible to add new commands there.
> 
> To be fair, this is really a bug in libvirt IMHO.  libvirt shouldn't
> assume that after device_del is called, the device is gone.
> 
> Regards,
> 
> Anthony Liguori
> 

Note that it will mostly work unless when it'll crash.
Issue is we don't have any documentation so
people get the command set by trial and error.

So how can we prove it's a user bug and not qemu bug?
I guess we should blame ourselves until proven innocent.
Michael S. Tsirkin Sept. 20, 2010, 8:15 p.m. UTC | #15
On Mon, Sep 20, 2010 at 03:15:45PM -0500, Anthony Liguori wrote:
> On 09/20/2010 02:37 PM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 02:22:18PM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 01:59 PM, Michael S. Tsirkin wrote:
> >>>>You can also initiate the unplug from the OS without the ACPI event
> >>>>ever happening.  I suspect that in our current implementation, that
> >>>>means that we'll automatically delete the device which may have
> >>>>strange effects on management tools.
> >>>>
> >>>>So it probably makes sense for our interface to present the same
> >>>>procedure.  What do you think?
> >>>>
> >>>>Regards,
> >>>>
> >>>>Anthony Liguori
> >>>We seem to have two discussions here. you speak about how an ideal hot plug
> >>>interface will look. This can involve new commands etc.
> >>>I speak about fixing existing ones so qemu and/or guest won't crash.
> >>To be fair, existing qemu won't crash if you do:
> >>
> >>(qemu) device_del<device>
> >>Use info_qtree to notice when device goes away
> >>(qemu) netdev_del<backend>
> >Asking libvirt to busy loop polling the monitor sounds like a really bad
> >situation: note that guest is blocked from doing any io while monitor is
> >used, so it may in fact prevent it from making progress. Right?
> 
> With a busy loop?  No, the guest will always make some progress
> because we drop back to the main loop.  But that's besides the point
> really.  libvirt can just do a usleep() when polling.
> 
> Yes, this interface sucks but that's the interface we have today.
> 
> >So why can't we let management do netdev_del and have it take effect
> >when this becomes possible?
> 
> You're making netdev_del be a semantic request that a network
> backend is eventually deleted after some guest controlled period of
> time.
> 
> If libvirt is trying to do useful things like manage a limit set of
> resources (maybe VFs using SR-IOV passthrough), then libvirt needs
> to know when it can reassign a VF to another guest.  But now it
> can't do that after it does netdev_del.  Is it supposed to poll to
> figure out when it can do it?
> 
> >>You're trying to come up with a workaround for the fact that libvirt
> >>is making bad assumptions.
> >BTW, even if it is, I don't think we should be crashing qemu or guest.
> 
> That's certainly true.  But that's a different patch.
> 
> >>That's wrong.  We either need to fix
> >>libvirt to not make bad assumptions or we need to provide better
> >>interfaces for libvirt to use if the current interfaces aren't
> >>desirable.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >>
> >>>This requires fixing existing commands, unless we can't
> >>>fix them at all - which is demonstrably not the case.
> >But frankly, most command semantics are completely ad hock and not well
> >undefined, in my mind it's better to define them to accomodate existing
> >users.
> 
> Okay, let's give them an interface they actually want.  Forcing them
> to poll for when a netdev is actually removed is probably not what
> they actually want.
> 
> Regards,
> 
> Anthony Liguori


Agreed. Any libvirt guys listening in on this discussion
and have an opinion on what does libvirt actually want?
Anthony Liguori Sept. 20, 2010, 8:15 p.m. UTC | #16
On 09/20/2010 02:37 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 02:22:18PM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 01:59 PM, Michael S. Tsirkin wrote:
>>      
>>>> You can also initiate the unplug from the OS without the ACPI event
>>>> ever happening.  I suspect that in our current implementation, that
>>>> means that we'll automatically delete the device which may have
>>>> strange effects on management tools.
>>>>
>>>> So it probably makes sense for our interface to present the same
>>>> procedure.  What do you think?
>>>>
>>>> Regards,
>>>>
>>>> Anthony Liguori
>>>>          
>>> We seem to have two discussions here. you speak about how an ideal hot plug
>>> interface will look. This can involve new commands etc.
>>> I speak about fixing existing ones so qemu and/or guest won't crash.
>>>        
>> To be fair, existing qemu won't crash if you do:
>>
>> (qemu) device_del<device>
>> Use info_qtree to notice when device goes away
>> (qemu) netdev_del<backend>
>>      
> Asking libvirt to busy loop polling the monitor sounds like a really bad
> situation: note that guest is blocked from doing any io while monitor is
> used, so it may in fact prevent it from making progress. Right?
>    

With a busy loop?  No, the guest will always make some progress because 
we drop back to the main loop.  But that's besides the point really.  
libvirt can just do a usleep() when polling.

Yes, this interface sucks but that's the interface we have today.

> So why can't we let management do netdev_del and have it take effect
> when this becomes possible?
>    

You're making netdev_del be a semantic request that a network backend is 
eventually deleted after some guest controlled period of time.

If libvirt is trying to do useful things like manage a limit set of 
resources (maybe VFs using SR-IOV passthrough), then libvirt needs to 
know when it can reassign a VF to another guest.  But now it can't do 
that after it does netdev_del.  Is it supposed to poll to figure out 
when it can do it?

>> You're trying to come up with a workaround for the fact that libvirt
>> is making bad assumptions.
>>      
> BTW, even if it is, I don't think we should be crashing qemu or guest.
>    

That's certainly true.  But that's a different patch.

>> That's wrong.  We either need to fix
>> libvirt to not make bad assumptions or we need to provide better
>> interfaces for libvirt to use if the current interfaces aren't
>> desirable.
>>
>> Regards,
>>
>> Anthony Liguori
>>
>>      
>>> This requires fixing existing commands, unless we can't
>>> fix them at all - which is demonstrably not the case.
>>>        
>>      
> But frankly, most command semantics are completely ad hock and not well
> undefined, in my mind it's better to define them to accomodate existing
> users.
>    

Okay, let's give them an interface they actually want.  Forcing them to 
poll for when a netdev is actually removed is probably not what they 
actually want.

Regards,

Anthony Liguori
Anthony Liguori Sept. 20, 2010, 8:20 p.m. UTC | #17
On 09/20/2010 02:44 PM, Michael S. Tsirkin wrote:
>
>> I think the only workable approach that doesn't involve new commands
>> is to change the semantics of the existing ones.
>>
>> Make netdev_del work regardless of whether the device is still present.
>>
>> You would need to reference count the actual netdev structure and
>> have each device using it unref on delete.  You make netdev_del mark
>> the device as deleted and when a device is deleted, any calls into
>> the device effectively become nops.
>>
>> You have to go through most of the cleanup process to ensure that
>> tap device gets closed even before your reference count goes to
>> zero.
>>      
> I think you mean 'does not get closed': we need the fd to get the flags etc.
>    

No, I actually meant does get closed.

When you do netdev_del, it should result in the fd getting closed.

The actual netdev structure then becomes a zombie that's completely 
useless until the device goes away.

> Note that it will mostly work unless when it'll crash.
> Issue is we don't have any documentation so
> people get the command set by trial and error.
>
> So how can we prove it's a user bug and not qemu bug?
> I guess we should blame ourselves until proven innocent.
>    

Here's what I'm now suggesting:

device_del -> may or may not unplug a device from a guest when it 
returns.  To figure out if it does, you have to run info qdm.

netdev_del -> always destroys a netdev device when it returns.  May be 
called at any point in time.  If you destroy a netdev while the device 
is still using it, all packets go into the bit bucket and the link 
status is modified to be unplugged.

You're suggesting:

netdev_del -> may or may not destroy a netdev depending on when the 
device delete completes.  Eventually, when there's a reset, we will kill 
the device.  Even though the netdev is still active, we'll hide it from 
the management tools.

I think the suggested semantics are totally unusable.  If we can make 
something deterministic for a management tool, that should be the path 
we take.

Regards,

Anthony Liguori
Michael S. Tsirkin Sept. 20, 2010, 8:27 p.m. UTC | #18
On Mon, Sep 20, 2010 at 03:20:59PM -0500, Anthony Liguori wrote:
> On 09/20/2010 02:44 PM, Michael S. Tsirkin wrote:
> >
> >>I think the only workable approach that doesn't involve new commands
> >>is to change the semantics of the existing ones.
> >>
> >>Make netdev_del work regardless of whether the device is still present.
> >>
> >>You would need to reference count the actual netdev structure and
> >>have each device using it unref on delete.  You make netdev_del mark
> >>the device as deleted and when a device is deleted, any calls into
> >>the device effectively become nops.
> >>
> >>You have to go through most of the cleanup process to ensure that
> >>tap device gets closed even before your reference count goes to
> >>zero.
> >I think you mean 'does not get closed': we need the fd to get the flags etc.
> 
> No, I actually meant does get closed.
> 
> When you do netdev_del, it should result in the fd getting closed.
> 
> The actual netdev structure then becomes a zombie that's completely
> useless until the device goes away.
> 
> >Note that it will mostly work unless when it'll crash.
> >Issue is we don't have any documentation so
> >people get the command set by trial and error.
> >
> >So how can we prove it's a user bug and not qemu bug?
> >I guess we should blame ourselves until proven innocent.
> 
> Here's what I'm now suggesting:
> 
> device_del -> may or may not unplug a device from a guest when it
> returns.  To figure out if it does, you have to run info qdm.

I think it should also always unplug on guest reset.

> netdev_del -> always destroys a netdev device when it returns.  May
> be called at any point in time.  If you destroy a netdev while the
> device is still using it, all packets go into the bit bucket and the
> link status is modified to be unplugged.

One issue here is that we can't allow a new device with same name
to be created until the nic is destroyed.

> You're suggesting:
> 
> netdev_del -> may or may not destroy a netdev depending on when the
> device delete completes.  Eventually, when there's a reset, we will
> kill the device.  Even though the netdev is still active, we'll hide
> it from the management tools.

This last point is a bug in my patch. I now think we should not hide it,
run info net to figure out when it is removed.

> I think the suggested semantics are totally unusable.  If we can
> make something deterministic for a management tool, that should be
> the path we take.
> 
> Regards,
> 
> Anthony Liguori

So I basically propose that netdev_del and device_del behave
identically. Why is this unusable?
Michael S. Tsirkin Sept. 20, 2010, 8:37 p.m. UTC | #19
On Mon, Sep 20, 2010 at 03:38:36PM -0500, Anthony Liguori wrote:
> On 09/20/2010 03:27 PM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 03:20:59PM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 02:44 PM, Michael S. Tsirkin wrote:
> >>>>I think the only workable approach that doesn't involve new commands
> >>>>is to change the semantics of the existing ones.
> >>>>
> >>>>Make netdev_del work regardless of whether the device is still present.
> >>>>
> >>>>You would need to reference count the actual netdev structure and
> >>>>have each device using it unref on delete.  You make netdev_del mark
> >>>>the device as deleted and when a device is deleted, any calls into
> >>>>the device effectively become nops.
> >>>>
> >>>>You have to go through most of the cleanup process to ensure that
> >>>>tap device gets closed even before your reference count goes to
> >>>>zero.
> >>>I think you mean 'does not get closed': we need the fd to get the flags etc.
> >>No, I actually meant does get closed.
> >>
> >>When you do netdev_del, it should result in the fd getting closed.
> >>
> >>The actual netdev structure then becomes a zombie that's completely
> >>useless until the device goes away.
> >>
> >>>Note that it will mostly work unless when it'll crash.
> >>>Issue is we don't have any documentation so
> >>>people get the command set by trial and error.
> >>>
> >>>So how can we prove it's a user bug and not qemu bug?
> >>>I guess we should blame ourselves until proven innocent.
> >>Here's what I'm now suggesting:
> >>
> >>device_del ->  may or may not unplug a device from a guest when it
> >>returns.  To figure out if it does, you have to run info qdm.
> >I think it should also always unplug on guest reset.
> >
> >>netdev_del ->  always destroys a netdev device when it returns.  May
> >>be called at any point in time.  If you destroy a netdev while the
> >>device is still using it, all packets go into the bit bucket and the
> >>link status is modified to be unplugged.
> >One issue here is that we can't allow a new device with same name
> >to be created until the nic is destroyed.
> 
> A new netdev device?  Why not?

Because it won't work: it will try to pair with existing nic device
(it is looked up by name) and that will fail.

> We just need the struct
> VLANClientState to remain valid, it doesn't need to be in the linked
> list of clients..
> 
> >>I think the suggested semantics are totally unusable.  If we can
> >>make something deterministic for a management tool, that should be
> >>the path we take.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >So I basically propose that netdev_del and device_del behave
> >identically. Why is this unusable?
> 
> Because the fundamental problem is that device_del is too difficult
> to use.  You're just making netdev_del equally difficult to use.
> 
> Try your patch with libvirt, don't load acpiphp in the guest, and
> then play around with virsh device_detach and device_attach.  All
> sorts of badness will ensue as libvirt tries to manage assignment of
> PCI slot numbers and such.
> 
> Regards,
> 
> Anthony Liguori

At some level, that's right. Is yours better?
I guess the right thing is to wait for libvirt guys to tell
us what they prefer.
Anthony Liguori Sept. 20, 2010, 8:38 p.m. UTC | #20
On 09/20/2010 03:27 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 03:20:59PM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 02:44 PM, Michael S. Tsirkin wrote:
>>      
>>>        
>>>> I think the only workable approach that doesn't involve new commands
>>>> is to change the semantics of the existing ones.
>>>>
>>>> Make netdev_del work regardless of whether the device is still present.
>>>>
>>>> You would need to reference count the actual netdev structure and
>>>> have each device using it unref on delete.  You make netdev_del mark
>>>> the device as deleted and when a device is deleted, any calls into
>>>> the device effectively become nops.
>>>>
>>>> You have to go through most of the cleanup process to ensure that
>>>> tap device gets closed even before your reference count goes to
>>>> zero.
>>>>          
>>> I think you mean 'does not get closed': we need the fd to get the flags etc.
>>>        
>> No, I actually meant does get closed.
>>
>> When you do netdev_del, it should result in the fd getting closed.
>>
>> The actual netdev structure then becomes a zombie that's completely
>> useless until the device goes away.
>>
>>      
>>> Note that it will mostly work unless when it'll crash.
>>> Issue is we don't have any documentation so
>>> people get the command set by trial and error.
>>>
>>> So how can we prove it's a user bug and not qemu bug?
>>> I guess we should blame ourselves until proven innocent.
>>>        
>> Here's what I'm now suggesting:
>>
>> device_del ->  may or may not unplug a device from a guest when it
>> returns.  To figure out if it does, you have to run info qdm.
>>      
> I think it should also always unplug on guest reset.
>
>    
>> netdev_del ->  always destroys a netdev device when it returns.  May
>> be called at any point in time.  If you destroy a netdev while the
>> device is still using it, all packets go into the bit bucket and the
>> link status is modified to be unplugged.
>>      
> One issue here is that we can't allow a new device with same name
> to be created until the nic is destroyed.
>    

A new netdev device?  Why not?  We just need the struct VLANClientState 
to remain valid, it doesn't need to be in the linked list of clients..

>> I think the suggested semantics are totally unusable.  If we can
>> make something deterministic for a management tool, that should be
>> the path we take.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> So I basically propose that netdev_del and device_del behave
> identically. Why is this unusable?
>    

Because the fundamental problem is that device_del is too difficult to 
use.  You're just making netdev_del equally difficult to use.

Try your patch with libvirt, don't load acpiphp in the guest, and then 
play around with virsh device_detach and device_attach.  All sorts of 
badness will ensue as libvirt tries to manage assignment of PCI slot 
numbers and such.

Regards,

Anthony Liguori
Anthony Liguori Sept. 20, 2010, 8:50 p.m. UTC | #21
On 09/20/2010 03:37 PM, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 03:38:36PM -0500, Anthony Liguori wrote:
>    
>> On 09/20/2010 03:27 PM, Michael S. Tsirkin wrote:
>>      
>>> On Mon, Sep 20, 2010 at 03:20:59PM -0500, Anthony Liguori wrote:
>>>        
>>>> On 09/20/2010 02:44 PM, Michael S. Tsirkin wrote:
>>>>          
>>>>>> I think the only workable approach that doesn't involve new commands
>>>>>> is to change the semantics of the existing ones.
>>>>>>
>>>>>> Make netdev_del work regardless of whether the device is still present.
>>>>>>
>>>>>> You would need to reference count the actual netdev structure and
>>>>>> have each device using it unref on delete.  You make netdev_del mark
>>>>>> the device as deleted and when a device is deleted, any calls into
>>>>>> the device effectively become nops.
>>>>>>
>>>>>> You have to go through most of the cleanup process to ensure that
>>>>>> tap device gets closed even before your reference count goes to
>>>>>> zero.
>>>>>>              
>>>>> I think you mean 'does not get closed': we need the fd to get the flags etc.
>>>>>            
>>>> No, I actually meant does get closed.
>>>>
>>>> When you do netdev_del, it should result in the fd getting closed.
>>>>
>>>> The actual netdev structure then becomes a zombie that's completely
>>>> useless until the device goes away.
>>>>
>>>>          
>>>>> Note that it will mostly work unless when it'll crash.
>>>>> Issue is we don't have any documentation so
>>>>> people get the command set by trial and error.
>>>>>
>>>>> So how can we prove it's a user bug and not qemu bug?
>>>>> I guess we should blame ourselves until proven innocent.
>>>>>            
>>>> Here's what I'm now suggesting:
>>>>
>>>> device_del ->   may or may not unplug a device from a guest when it
>>>> returns.  To figure out if it does, you have to run info qdm.
>>>>          
>>> I think it should also always unplug on guest reset.
>>>
>>>        
>>>> netdev_del ->   always destroys a netdev device when it returns.  May
>>>> be called at any point in time.  If you destroy a netdev while the
>>>> device is still using it, all packets go into the bit bucket and the
>>>> link status is modified to be unplugged.
>>>>          
>>> One issue here is that we can't allow a new device with same name
>>> to be created until the nic is destroyed.
>>>        
>> A new netdev device?  Why not?
>>      
> Because it won't work: it will try to pair with existing nic device
> (it is looked up by name) and that will fail.
>    

No, netdev_del should remove the VLANClientState from the 
non_vlan_clients list.

It's no longer enumerable and it's no longer lookup-able.

The only reason it stays around it so that the device doesn't have a 
reference to a free pointer.  The only field that's ever looked at is 
is_deleted which is used by every function to turn around and implement 
a nop.

The VLANClientState is a hollow shell of it's former glorious self.  The 
remainder of it's (hopefully short) life is merely so that we can avoid 
touching every device to teach them about disconnecting backends.

>> Because the fundamental problem is that device_del is too difficult
>> to use.  You're just making netdev_del equally difficult to use.
>>
>> Try your patch with libvirt, don't load acpiphp in the guest, and
>> then play around with virsh device_detach and device_attach.  All
>> sorts of badness will ensue as libvirt tries to manage assignment of
>> PCI slot numbers and such.
>>
>> Regards,
>>
>> Anthony Liguori
>>      
> At some level, that's right. Is yours better?
>    

device_del is still busted but at least netdev_del behaves the way 
libvirt expects it to.

> I guess the right thing is to wait for libvirt guys to tell
> us what they prefer.
>    

Yeah, I think some clarity is needed.

Regards,

Anthony Liguori
Daniel P. Berrangé Sept. 21, 2010, 8:58 a.m. UTC | #22
On Mon, Sep 20, 2010 at 09:37:16PM +0200, Michael S. Tsirkin wrote:
> On Mon, Sep 20, 2010 at 02:22:18PM -0500, Anthony Liguori wrote:
> > On 09/20/2010 01:59 PM, Michael S. Tsirkin wrote:
> > >>You can also initiate the unplug from the OS without the ACPI event
> > >>ever happening.  I suspect that in our current implementation, that
> > >>means that we'll automatically delete the device which may have
> > >>strange effects on management tools.
> > >>
> > >>So it probably makes sense for our interface to present the same
> > >>procedure.  What do you think?
> > >>
> > >>Regards,
> > >>
> > >>Anthony Liguori
> > >We seem to have two discussions here. you speak about how an ideal hot plug
> > >interface will look. This can involve new commands etc.
> > >I speak about fixing existing ones so qemu and/or guest won't crash.
> > 
> > To be fair, existing qemu won't crash if you do:
> > 
> > (qemu) device_del <device>
> > Use info_qtree to notice when device goes away
> > (qemu) netdev_del <backend>
> 
> Asking libvirt to busy loop polling the monitor sounds like a really bad
> situation: note that guest is blocked from doing any io while monitor is
> used, so it may in fact prevent it from making progress. Right?

Clearly we need either an async command completion, or an async
event notification of device_del. No one wants todo polling,
nor does anyone sane want to try to parse the outout of info
qtree :-)

 
> So why can't we let management do netdev_del and have it take effect
> when this becomes possible?

That would be really unpleasant to deal with. netdev_del should
always kill the backend immediately, even if the frontend device 
still exists. If this could cause issues for the frontend, then just
connect it to a no-op backend internally so it gets no further data.
In the context of drive_del, once it returns, libvirt changes the security
labelling, so QEMU is guarenteed not to be able to use the backend
anymore, even if it tries to. We would do the same for netdev_del if
we could.

> > You're trying to come up with a workaround for the fact that libvirt
> > is making bad assumptions.
> 
> BTW, even if it is, I don't think we should be crashing qemu or guest.

Even if the libvirt is making bad assumptions about when a monitor
command can be issued, this is no excuse for QEMU crashing. If the
netdev_del can't safely be performed, it should return an error
via the monitor, not crash QEMU.

Daniel
Michael S. Tsirkin Sept. 21, 2010, 9:18 a.m. UTC | #23
On Mon, Sep 20, 2010 at 03:50:51PM -0500, Anthony Liguori wrote:
> On 09/20/2010 03:37 PM, Michael S. Tsirkin wrote:
> >On Mon, Sep 20, 2010 at 03:38:36PM -0500, Anthony Liguori wrote:
> >>On 09/20/2010 03:27 PM, Michael S. Tsirkin wrote:
> >>>On Mon, Sep 20, 2010 at 03:20:59PM -0500, Anthony Liguori wrote:
> >>>>On 09/20/2010 02:44 PM, Michael S. Tsirkin wrote:
> >>>>>>I think the only workable approach that doesn't involve new commands
> >>>>>>is to change the semantics of the existing ones.
> >>>>>>
> >>>>>>Make netdev_del work regardless of whether the device is still present.
> >>>>>>
> >>>>>>You would need to reference count the actual netdev structure and
> >>>>>>have each device using it unref on delete.  You make netdev_del mark
> >>>>>>the device as deleted and when a device is deleted, any calls into
> >>>>>>the device effectively become nops.
> >>>>>>
> >>>>>>You have to go through most of the cleanup process to ensure that
> >>>>>>tap device gets closed even before your reference count goes to
> >>>>>>zero.
> >>>>>I think you mean 'does not get closed': we need the fd to get the flags etc.
> >>>>No, I actually meant does get closed.
> >>>>
> >>>>When you do netdev_del, it should result in the fd getting closed.
> >>>>
> >>>>The actual netdev structure then becomes a zombie that's completely
> >>>>useless until the device goes away.
> >>>>
> >>>>>Note that it will mostly work unless when it'll crash.
> >>>>>Issue is we don't have any documentation so
> >>>>>people get the command set by trial and error.
> >>>>>
> >>>>>So how can we prove it's a user bug and not qemu bug?
> >>>>>I guess we should blame ourselves until proven innocent.
> >>>>Here's what I'm now suggesting:
> >>>>
> >>>>device_del ->   may or may not unplug a device from a guest when it
> >>>>returns.  To figure out if it does, you have to run info qdm.
> >>>I think it should also always unplug on guest reset.
> >>>
> >>>>netdev_del ->   always destroys a netdev device when it returns.  May
> >>>>be called at any point in time.  If you destroy a netdev while the
> >>>>device is still using it, all packets go into the bit bucket and the
> >>>>link status is modified to be unplugged.
> >>>One issue here is that we can't allow a new device with same name
> >>>to be created until the nic is destroyed.
> >>A new netdev device?  Why not?
> >Because it won't work: it will try to pair with existing nic device
> >(it is looked up by name) and that will fail.
> 
> No, netdev_del should remove the VLANClientState from the
> non_vlan_clients list.
> 
> It's no longer enumerable and it's no longer lookup-able.
> 
> The only reason it stays around it so that the device doesn't have a
> reference to a free pointer.  The only field that's ever looked at
> is is_deleted which is used by every function to turn around and
> implement a nop.
> 
> The VLANClientState is a hollow shell of it's former glorious self.
> The remainder of it's (hopefully short) life is merely so that we
> can avoid touching every device to teach them about disconnecting
> backends.

We'll have to tell them link is down, won't we?

> >>Because the fundamental problem is that device_del is too difficult
> >>to use.  You're just making netdev_del equally difficult to use.
> >>
> >>Try your patch with libvirt, don't load acpiphp in the guest, and
> >>then play around with virsh device_detach and device_attach.  All
> >>sorts of badness will ensue as libvirt tries to manage assignment of
> >>PCI slot numbers and such.
> >>
> >>Regards,
> >>
> >>Anthony Liguori
> >At some level, that's right. Is yours better?
> 
> device_del is still busted but at least netdev_del behaves the way
> libvirt expects it to.
> 
> >I guess the right thing is to wait for libvirt guys to tell
> >us what they prefer.
> 
> Yeah, I think some clarity is needed.
> 
> Regards,
> 
> Anthony Liguori
>
Michael S. Tsirkin Sept. 21, 2010, 9:20 a.m. UTC | #24
On Tue, Sep 21, 2010 at 09:58:14AM +0100, Daniel P. Berrange wrote:
> On Mon, Sep 20, 2010 at 09:37:16PM +0200, Michael S. Tsirkin wrote:
> > On Mon, Sep 20, 2010 at 02:22:18PM -0500, Anthony Liguori wrote:
> > > On 09/20/2010 01:59 PM, Michael S. Tsirkin wrote:
> > > >>You can also initiate the unplug from the OS without the ACPI event
> > > >>ever happening.  I suspect that in our current implementation, that
> > > >>means that we'll automatically delete the device which may have
> > > >>strange effects on management tools.
> > > >>
> > > >>So it probably makes sense for our interface to present the same
> > > >>procedure.  What do you think?
> > > >>
> > > >>Regards,
> > > >>
> > > >>Anthony Liguori
> > > >We seem to have two discussions here. you speak about how an ideal hot plug
> > > >interface will look. This can involve new commands etc.
> > > >I speak about fixing existing ones so qemu and/or guest won't crash.
> > > 
> > > To be fair, existing qemu won't crash if you do:
> > > 
> > > (qemu) device_del <device>
> > > Use info_qtree to notice when device goes away
> > > (qemu) netdev_del <backend>
> > 
> > Asking libvirt to busy loop polling the monitor sounds like a really bad
> > situation: note that guest is blocked from doing any io while monitor is
> > used, so it may in fact prevent it from making progress. Right?
> 
> Clearly we need either an async command completion, or an async
> event notification of device_del. No one wants todo polling,
> nor does anyone sane want to try to parse the outout of info
> qtree :-)
> 
>  
> > So why can't we let management do netdev_del and have it take effect
> > when this becomes possible?
> 
> That would be really unpleasant to deal with. netdev_del should
> always kill the backend immediately, even if the frontend device 
> still exists. If this could cause issues for the frontend, then just
> connect it to a no-op backend internally so it gets no further data.
> In the context of drive_del, once it returns, libvirt changes the security
> labelling, so QEMU is guarenteed not to be able to use the backend
> anymore, even if it tries to. We would do the same for netdev_del if
> we could.


OK, that's clear enough.
One note though: you won't be able to create another backend
with the same name until the frontend is gone.
Anthony Liguori Sept. 21, 2010, 12:42 p.m. UTC | #25
On 09/21/2010 04:18 AM, Michael S. Tsirkin wrote:
>> No, netdev_del should remove the VLANClientState from the
>> non_vlan_clients list.
>>
>> It's no longer enumerable and it's no longer lookup-able.
>>
>> The only reason it stays around it so that the device doesn't have a
>> reference to a free pointer.  The only field that's ever looked at
>> is is_deleted which is used by every function to turn around and
>> implement a nop.
>>
>> The VLANClientState is a hollow shell of it's former glorious self.
>> The remainder of it's (hopefully short) life is merely so that we
>> can avoid touching every device to teach them about disconnecting
>> backends.
>>      
> We'll have to tell them link is down, won't we?
>    

Yes, that would be a nice touch.

Regards,

Anthony Liguori
Anthony Liguori Sept. 21, 2010, 12:47 p.m. UTC | #26
On 09/21/2010 04:20 AM, Michael S. Tsirkin wrote:
>
> OK, that's clear enough.
> One note though: you won't be able to create another backend
> with the same name until the frontend is gone.
>    

If you remove it from the linked list, you'll be able to create another 
backend just fine.

Regards,

Anthony Liguori
diff mbox

Patch

diff --git a/net.c b/net.c
index 3d0fde7..10855d1 100644
--- a/net.c
+++ b/net.c
@@ -286,12 +286,31 @@  void qemu_del_vlan_client(VLANClientState *vc)
     if (vc->vlan) {
         QTAILQ_REMOVE(&vc->vlan->clients, vc, next);
     } else {
+        /* Even if client will not be deleted yet, remove it from list so it
+         * does not appear in monitor.  */
+        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
+        /* Detect that guest-visible (NIC) peer is active, and delay deletion.
+         * */
+        if (vc->peer && vc->peer->info->type == NET_CLIENT_TYPE_NIC) {
+            NICState *nic = DO_UPCAST(NICState, nc, vc->peer);
+            assert(!nic->peer_deleted);
+            nic->peer_deleted = true;
+            return;
+        }
         if (vc->send_queue) {
             qemu_del_net_queue(vc->send_queue);
         }
-        QTAILQ_REMOVE(&non_vlan_clients, vc, next);
         if (vc->peer) {
             vc->peer->peer = NULL;
+            /* If this is a guest-visible (NIC) device,
+             * and peer has already been removed from monitor,
+             * delete it here. */
+            if (vc->info->type == NET_CLIENT_TYPE_NIC) {
+                NICState *nic = DO_UPCAST(NICState, nc, vc);
+                if (nic->peer_deleted) {
+                    qemu_del_vlan_client(vc->peer);
+                }
+            }
         }
     }
 
diff --git a/net.h b/net.h
index 518cf9c..44c31a9 100644
--- a/net.h
+++ b/net.h
@@ -72,6 +72,7 @@  typedef struct NICState {
     VLANClientState nc;
     NICConf *conf;
     void *opaque;
+    bool peer_deleted;
 } NICState;
 
 struct VLANState {