Patchwork [1/2] Fix Block Hotplug race with drive_del()

login
register
mail settings
Submitter Ryan Harper
Date Nov. 9, 2010, 2:25 a.m.
Message ID <1289269554-6766-2-git-send-email-ryanh@us.ibm.com>
Download mbox | patch
Permalink /patch/70488/
State New
Headers show

Comments

Ryan Harper - Nov. 9, 2010, 2:25 a.m.
Block hot unplug is racy since the guest is required to acknowlege the ACPI
unplug event; this may not happen synchronously with the device removal command

This series aims to close a gap where by mgmt applications that assume the
block resource has been removed without confirming that the guest has
acknowledged the removal may re-assign the underlying device to a second guest
leading to data leakage.

This series introduces a new montor command to decouple asynchornous device
removal from restricting guest access to a block device.  We do this by creating
a new monitor command drive_del which maps to a bdrv_unplug() command which
does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
IO is rejected from the device and the guest will get IO errors but continue to
function.  In addition to preventing further IO, we clean up state pointers
between host (BlockDriverState) and guest (DeviceInfo).

A subsequent device removal command can be issued to remove the device, to which
the guest may or maynot respond, but as long as the unplugged bit is set, no IO
will be sumbitted.

Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
---
 block.c         |    7 +++++++
 block.h         |    1 +
 blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
 blockdev.h      |    1 +
 hmp-commands.hx |   18 ++++++++++++++++++
 5 files changed, 63 insertions(+), 0 deletions(-)
Markus Armbruster - Nov. 10, 2010, 12:48 p.m.
One real question, and a couple of nits.

Ryan Harper <ryanh@us.ibm.com> writes:

> Block hot unplug is racy since the guest is required to acknowlege the ACPI
> unplug event; this may not happen synchronously with the device removal command

Well, I wouldn't call unplug "racy".  It just takes an unpredictable
length of time, possibly forever.  To make a race, you need to throw in
a client assuming (incorrectly) that unplug is instantaneous, as
described in your next paragraph.

Moreover, all PCI unplug is that way, not just block.

> This series aims to close a gap where by mgmt applications that assume the
> block resource has been removed without confirming that the guest has
> acknowledged the removal may re-assign the underlying device to a second guest
> leading to data leakage.

Yes, the incorrect assumption is a problem.  But with that fixed (in the
management application), we run right into the next problem: there is no
way for the management application to reliably disconnect the guest from
a block device.  And that's the problem you're fixing.

> This series introduces a new montor command to decouple asynchornous device

Typos "montor" and "asynchornous".  You might want to use a spell
checker :)

Lines are a bit long.  Recommend wrap at column 70.

> removal from restricting guest access to a block device.  We do this by creating
> a new monitor command drive_del which maps to a bdrv_unplug() command which
> does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
> IO is rejected from the device and the guest will get IO errors but continue to
> function.  In addition to preventing further IO, we clean up state pointers
> between host (BlockDriverState) and guest (DeviceInfo).
>
> A subsequent device removal command can be issued to remove the device, to which
> the guest may or maynot respond, but as long as the unplugged bit is set, no IO

"maynot" is not a word.

> will be sumbitted.

This suggests to drive_del before device_del, which makes the device
goes through a "broken device" state on its way to unplug.  If the guest
accesses the device in that state, it gets I/O errors.  Not nice.

Instead, I'd recommend device_del, wait for the device to go away,
drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
it's never exposed to the "broken device" state.  Note: if the drive_del
fails because the device doesn't exist, we lost the race with the
automatic destruction, which is harmless.  Ignore that error.

> Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> ---
>  block.c         |    7 +++++++
>  block.h         |    1 +
>  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
>  blockdev.h      |    1 +
>  hmp-commands.hx |   18 ++++++++++++++++++
>  5 files changed, 63 insertions(+), 0 deletions(-)
>
> diff --git a/block.c b/block.c
> index 6b505fb..c76a796 100644
> --- a/block.c
> +++ b/block.c
> @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
>      }
>  }
>  
> +void bdrv_unplug(BlockDriverState *bs)
> +{
> +    qemu_aio_flush();
> +    bdrv_flush(bs);
> +    bdrv_close(bs);
> +}
> +

Unless we expect more users, I'd inline this into its only caller.
Matter of taste.

>  int bdrv_is_removable(BlockDriverState *bs)
>  {
>      return bs->removable;
> diff --git a/block.h b/block.h
> index 78ecfac..581414c 100644
> --- a/block.h
> +++ b/block.h
> @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
>                         BlockErrorAction on_write_error);
>  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
>  void bdrv_set_removable(BlockDriverState *bs, int removable);
> +void bdrv_unplug(BlockDriverState *bs);
>  int bdrv_is_removable(BlockDriverState *bs);
>  int bdrv_is_read_only(BlockDriverState *bs);
>  int bdrv_is_sg(BlockDriverState *bs);
> diff --git a/blockdev.c b/blockdev.c
> index 6cb179a..ee8c2ec 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -14,6 +14,8 @@
>  #include "qemu-option.h"
>  #include "qemu-config.h"
>  #include "sysemu.h"
> +#include "hw/qdev.h"
> +#include "block_int.h"
>  
>  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
>  
> @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
>      }
>      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
>  }
> +
> +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
> +{
> +    const char *id = qdict_get_str(qdict, "id");
> +    BlockDriverState *bs;
> +    Property *prop;
> +
> +    bs = bdrv_find(id);
> +    if (!bs) {
> +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
> +        return -1;
> +    }
> +
> +    /* quiesce block driver; prevent further io */
> +    bdrv_unplug(bs);
> +
> +    /* clean up guest state from pointing to host resource by
> +     * finding and removing DeviceState "drive" property */
> +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
> +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
> +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
> +            if (prop->info->free) {
> +                prop->info->free(bs->peer, prop);
> +            }

Does this null the drive property?  I doubt it.  Quick check in the
debugger?

The free callbacks generally don't zap the properties, because they run
from qdev_free().

> +        }
> +    }
> +
> +    /* clean up host state pointing to guest resource by removing
> +     * pointers to guest device in the BlockDriverState */
> +    bdrv_delete(bs);
> +
> +    return 0;
> +}
> + 
> diff --git a/blockdev.h b/blockdev.h
> index 653affc..2a0559e 100644
> --- a/blockdev.h
> +++ b/blockdev.h
> @@ -51,5 +51,6 @@ int do_eject(Monitor *mon, const QDict *qdict, QObject **ret_data);
>  int do_block_set_passwd(Monitor *mon, const QDict *qdict, QObject **ret_data);
>  int do_change_block(Monitor *mon, const char *device,
>                      const char *filename, const char *fmt);
> +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
>  
>  #endif
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index e5585ba..d6dc18c 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -68,6 +68,24 @@ Eject a removable medium (use -f to force it).
>  ETEXI
>  
>      {
> +        .name       = "drive_del",
> +        .args_type  = "id:s",
> +        .params     = "device",
> +        .help       = "remove host block device",
> +        .user_print = monitor_user_noop,
> +        .mhandler.cmd_new = do_drive_del,
> +    },
> +
> +STEXI
> +@item delete @var{device}
> +@findex delete
> +Remove host block device.  The result is that guest generated IO is no longer
> +submitted against the host device underlying the disk.  Once a drive has
> +been deleted, the QEMU Block layer returns -EIO which results in IO 
> +errors in the guest for applications that are reading/writing to the device.
> +ETEXI
> +
> +    {
>          .name       = "change",
>          .args_type  = "device:B,target:F,arg:s?",
>          .params     = "device filename [format]",
Ryan Harper - Nov. 10, 2010, 4:01 p.m.
* Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
> One real question, and a couple of nits.
> 
> Ryan Harper <ryanh@us.ibm.com> writes:
> 
> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
> > unplug event; this may not happen synchronously with the device removal command
> 
> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
> length of time, possibly forever.  To make a race, you need to throw in
> a client assuming (incorrectly) that unplug is instantaneous, as
> described in your next paragraph.
> 
> Moreover, all PCI unplug is that way, not just block.
> 
> > This series aims to close a gap where by mgmt applications that assume the
> > block resource has been removed without confirming that the guest has
> > acknowledged the removal may re-assign the underlying device to a second guest
> > leading to data leakage.
> 
> Yes, the incorrect assumption is a problem.  But with that fixed (in the
> management application), we run right into the next problem: there is no
> way for the management application to reliably disconnect the guest from
> a block device.  And that's the problem you're fixing.

Yeah, that's the right way to word it; providing a method to forcibly
disconnect the guest from the host device.
> 
> > This series introduces a new montor command to decouple asynchornous device
> 
> Typos "montor" and "asynchornous".  You might want to use a spell
> checker :)
> 
> Lines are a bit long.  Recommend wrap at column 70.
> 
> > removal from restricting guest access to a block device.  We do this by creating
> > a new monitor command drive_del which maps to a bdrv_unplug() command which
> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
> > IO is rejected from the device and the guest will get IO errors but continue to
> > function.  In addition to preventing further IO, we clean up state pointers
> > between host (BlockDriverState) and guest (DeviceInfo).
> >
> > A subsequent device removal command can be issued to remove the device, to which
> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
> 
> "maynot" is not a word.
> 
> > will be sumbitted.
> 
> This suggests to drive_del before device_del, which makes the device
> goes through a "broken device" state on its way to unplug.  If the guest
> accesses the device in that state, it gets I/O errors.  Not nice.
> 
> Instead, I'd recommend device_del, wait for the device to go away,
> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
> it's never exposed to the "broken device" state.  Note: if the drive_del
> fails because the device doesn't exist, we lost the race with the
> automatic destruction, which is harmless.  Ignore that error.

Honestly, other than describing what happens if you sever the connection
when the guest isn't aware of it; I don't want to try to capture how the
mgmt layer implements the removal.  

One may want to force the disconnect before attempting to remove the
device; or the other way around; that's really the mgmt layer's call.

> 
> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> > ---
> >  block.c         |    7 +++++++
> >  block.h         |    1 +
> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
> >  blockdev.h      |    1 +
> >  hmp-commands.hx |   18 ++++++++++++++++++
> >  5 files changed, 63 insertions(+), 0 deletions(-)
> >
> > diff --git a/block.c b/block.c
> > index 6b505fb..c76a796 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
> >      }
> >  }
> >  
> > +void bdrv_unplug(BlockDriverState *bs)
> > +{
> > +    qemu_aio_flush();
> > +    bdrv_flush(bs);
> > +    bdrv_close(bs);
> > +}
> > +
> 
> Unless we expect more users, I'd inline this into its only caller.
> Matter of taste.

Works for me.

> 
> >  int bdrv_is_removable(BlockDriverState *bs)
> >  {
> >      return bs->removable;
> > diff --git a/block.h b/block.h
> > index 78ecfac..581414c 100644
> > --- a/block.h
> > +++ b/block.h
> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
> >                         BlockErrorAction on_write_error);
> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
> > +void bdrv_unplug(BlockDriverState *bs);
> >  int bdrv_is_removable(BlockDriverState *bs);
> >  int bdrv_is_read_only(BlockDriverState *bs);
> >  int bdrv_is_sg(BlockDriverState *bs);
> > diff --git a/blockdev.c b/blockdev.c
> > index 6cb179a..ee8c2ec 100644
> > --- a/blockdev.c
> > +++ b/blockdev.c
> > @@ -14,6 +14,8 @@
> >  #include "qemu-option.h"
> >  #include "qemu-config.h"
> >  #include "sysemu.h"
> > +#include "hw/qdev.h"
> > +#include "block_int.h"
> >  
> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
> >  
> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
> >      }
> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
> >  }
> > +
> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
> > +{
> > +    const char *id = qdict_get_str(qdict, "id");
> > +    BlockDriverState *bs;
> > +    Property *prop;
> > +
> > +    bs = bdrv_find(id);
> > +    if (!bs) {
> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
> > +        return -1;
> > +    }
> > +
> > +    /* quiesce block driver; prevent further io */
> > +    bdrv_unplug(bs);
> > +
> > +    /* clean up guest state from pointing to host resource by
> > +     * finding and removing DeviceState "drive" property */
> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
> > +            if (prop->info->free) {
> > +                prop->info->free(bs->peer, prop);
> > +            }
> 
> Does this null the drive property?  I doubt it.  Quick check in the
> debugger?
> 
> The free callbacks generally don't zap the properties, because they run
> from qdev_free().

To be honest; I didn't see anything that looked like "remove this
property" in the qdev api.  Any pointers?

should I be calling qdev_free() on the dev?  I don't quite understand
the distinction between the info list of properties and the device
itself, nor specifically what we need to remove in the drive_del()
operation versus the device_del() portion.


> 
> > +        }
> > +    }
> > +
> > +    /* clean up host state pointing to guest resource by removing
> > +     * pointers to guest device in the BlockDriverState */
> > +    bdrv_delete(bs);
> > +
> > +    return 0;
> > +}
> > + 
> > diff --git a/blockdev.h b/blockdev.h
> > index 653affc..2a0559e 100644
> > --- a/blockdev.h
> > +++ b/blockdev.h
> > @@ -51,5 +51,6 @@ int do_eject(Monitor *mon, const QDict *qdict, QObject **ret_data);
> >  int do_block_set_passwd(Monitor *mon, const QDict *qdict, QObject **ret_data);
> >  int do_change_block(Monitor *mon, const char *device,
> >                      const char *filename, const char *fmt);
> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
> >  
> >  #endif
> > diff --git a/hmp-commands.hx b/hmp-commands.hx
> > index e5585ba..d6dc18c 100644
> > --- a/hmp-commands.hx
> > +++ b/hmp-commands.hx
> > @@ -68,6 +68,24 @@ Eject a removable medium (use -f to force it).
> >  ETEXI
> >  
> >      {
> > +        .name       = "drive_del",
> > +        .args_type  = "id:s",
> > +        .params     = "device",
> > +        .help       = "remove host block device",
> > +        .user_print = monitor_user_noop,
> > +        .mhandler.cmd_new = do_drive_del,
> > +    },
> > +
> > +STEXI
> > +@item delete @var{device}
> > +@findex delete
> > +Remove host block device.  The result is that guest generated IO is no longer
> > +submitted against the host device underlying the disk.  Once a drive has
> > +been deleted, the QEMU Block layer returns -EIO which results in IO 
> > +errors in the guest for applications that are reading/writing to the device.
> > +ETEXI
> > +
> > +    {
> >          .name       = "change",
> >          .args_type  = "device:B,target:F,arg:s?",
> >          .params     = "device filename [format]",
Markus Armbruster - Nov. 10, 2010, 5:39 p.m.
Ryan Harper <ryanh@us.ibm.com> writes:

> * Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
>> One real question, and a couple of nits.
>> 
>> Ryan Harper <ryanh@us.ibm.com> writes:
>> 
>> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
>> > unplug event; this may not happen synchronously with the device removal command
>> 
>> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
>> length of time, possibly forever.  To make a race, you need to throw in
>> a client assuming (incorrectly) that unplug is instantaneous, as
>> described in your next paragraph.
>> 
>> Moreover, all PCI unplug is that way, not just block.
>> 
>> > This series aims to close a gap where by mgmt applications that assume the
>> > block resource has been removed without confirming that the guest has
>> > acknowledged the removal may re-assign the underlying device to a second guest
>> > leading to data leakage.
>> 
>> Yes, the incorrect assumption is a problem.  But with that fixed (in the
>> management application), we run right into the next problem: there is no
>> way for the management application to reliably disconnect the guest from
>> a block device.  And that's the problem you're fixing.
>
> Yeah, that's the right way to word it; providing a method to forcibly
> disconnect the guest from the host device.
>> 
>> > This series introduces a new montor command to decouple asynchornous device
>> 
>> Typos "montor" and "asynchornous".  You might want to use a spell
>> checker :)
>> 
>> Lines are a bit long.  Recommend wrap at column 70.
>> 
>> > removal from restricting guest access to a block device.  We do this by creating
>> > a new monitor command drive_del which maps to a bdrv_unplug() command which
>> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
>> > IO is rejected from the device and the guest will get IO errors but continue to
>> > function.  In addition to preventing further IO, we clean up state pointers
>> > between host (BlockDriverState) and guest (DeviceInfo).
>> >
>> > A subsequent device removal command can be issued to remove the device, to which
>> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
>> 
>> "maynot" is not a word.
>> 
>> > will be sumbitted.
>> 
>> This suggests to drive_del before device_del, which makes the device
>> goes through a "broken device" state on its way to unplug.  If the guest
>> accesses the device in that state, it gets I/O errors.  Not nice.
>> 
>> Instead, I'd recommend device_del, wait for the device to go away,
>> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
>> it's never exposed to the "broken device" state.  Note: if the drive_del
>> fails because the device doesn't exist, we lost the race with the
>> automatic destruction, which is harmless.  Ignore that error.
>
> Honestly, other than describing what happens if you sever the connection
> when the guest isn't aware of it; I don't want to try to capture how the
> mgmt layer implements the removal.  
>
> One may want to force the disconnect before attempting to remove the
> device; or the other way around; that's really the mgmt layer's call.

Fair enough.

>> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>> > ---
>> >  block.c         |    7 +++++++
>> >  block.h         |    1 +
>> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
>> >  blockdev.h      |    1 +
>> >  hmp-commands.hx |   18 ++++++++++++++++++
>> >  5 files changed, 63 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/block.c b/block.c
>> > index 6b505fb..c76a796 100644
>> > --- a/block.c
>> > +++ b/block.c
>> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
>> >      }
>> >  }
>> >  
>> > +void bdrv_unplug(BlockDriverState *bs)
>> > +{
>> > +    qemu_aio_flush();
>> > +    bdrv_flush(bs);
>> > +    bdrv_close(bs);
>> > +}
>> > +
>> 
>> Unless we expect more users, I'd inline this into its only caller.
>> Matter of taste.
>
> Works for me.
>
>> 
>> >  int bdrv_is_removable(BlockDriverState *bs)
>> >  {
>> >      return bs->removable;
>> > diff --git a/block.h b/block.h
>> > index 78ecfac..581414c 100644
>> > --- a/block.h
>> > +++ b/block.h
>> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
>> >                         BlockErrorAction on_write_error);
>> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
>> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
>> > +void bdrv_unplug(BlockDriverState *bs);
>> >  int bdrv_is_removable(BlockDriverState *bs);
>> >  int bdrv_is_read_only(BlockDriverState *bs);
>> >  int bdrv_is_sg(BlockDriverState *bs);
>> > diff --git a/blockdev.c b/blockdev.c
>> > index 6cb179a..ee8c2ec 100644
>> > --- a/blockdev.c
>> > +++ b/blockdev.c
>> > @@ -14,6 +14,8 @@
>> >  #include "qemu-option.h"
>> >  #include "qemu-config.h"
>> >  #include "sysemu.h"
>> > +#include "hw/qdev.h"
>> > +#include "block_int.h"
>> >  
>> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
>> >  
>> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
>> >      }
>> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
>> >  }
>> > +
>> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
>> > +{
>> > +    const char *id = qdict_get_str(qdict, "id");
>> > +    BlockDriverState *bs;
>> > +    Property *prop;
>> > +
>> > +    bs = bdrv_find(id);
>> > +    if (!bs) {
>> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
>> > +        return -1;
>> > +    }
>> > +
>> > +    /* quiesce block driver; prevent further io */
>> > +    bdrv_unplug(bs);
>> > +
>> > +    /* clean up guest state from pointing to host resource by
>> > +     * finding and removing DeviceState "drive" property */
>> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
>> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
>> > +            if (prop->info->free) {
>> > +                prop->info->free(bs->peer, prop);
>> > +            }

Your use of prop->info->free() in this context is wrong.  More below.

>> 
>> Does this null the drive property?  I doubt it.  Quick check in the
>> debugger?
>> 
>> The free callbacks generally don't zap the properties, because they run
>> from qdev_free().
>
> To be honest; I didn't see anything that looked like "remove this
> property" in the qdev api.  Any pointers?

The closest we have is indeed the Property method free(), but that's not
quite right.  It's really only for use by qdev_free().

> should I be calling qdev_free() on the dev?

No, because then the whole device is gone, not just the property :)

>                                              I don't quite understand
> the distinction between the info list of properties and the device
> itself, nor specifically what we need to remove in the drive_del()
> operation versus the device_del() portion.

device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
device (C type VirtIOPCIProxy).

drive_del destroys something else, namely the block device host part
(BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
pointers to the host part along with it.  Specifically, it needs to zap
the device's pointer to it.

Example: if a "virtio-blk-pci" device is using drive "foo", then
"drive_del foo" needs to zap its member block.bs.

Complication: we don't (want to) know what kind of device exactly is
using the drive.  But we do know that a drive property must be
describing it.

So we search the properties (for (prop...)) for a drive property
(prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
bs).

Result:

    BlockDriverState *bs;
    Property *prop;
    BlockDriverState **ptr;
[...]
    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
        if ((prop->info->type == PROP_TYPE_DRIVE)) {
            ptr = qdev_get_prop_ptr(dev, prop);
            if (*ptr == bs) {
                bdrv_detach(bs, bs->peer);
                *ptr = NULL;
                break;
            }
        }
    }

Aside: arguably, bdrv_detach() should zap *both* pointers, i.e. also do
the *ptr = NULL.  Not your problem to fix.

Only then are we ready to destroy the host part:

    drive_uninit(drive_get_by_blockdev(bs));

Does this help?
Ryan Harper - Nov. 10, 2010, 5:50 p.m.
* Markus Armbruster <armbru@redhat.com> [2010-11-10 11:40]:
> Ryan Harper <ryanh@us.ibm.com> writes:
> 
> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
> >> One real question, and a couple of nits.
> >> 
> >> Ryan Harper <ryanh@us.ibm.com> writes:
> >> 
> >> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
> >> > unplug event; this may not happen synchronously with the device removal command
> >> 
> >> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
> >> length of time, possibly forever.  To make a race, you need to throw in
> >> a client assuming (incorrectly) that unplug is instantaneous, as
> >> described in your next paragraph.
> >> 
> >> Moreover, all PCI unplug is that way, not just block.
> >> 
> >> > This series aims to close a gap where by mgmt applications that assume the
> >> > block resource has been removed without confirming that the guest has
> >> > acknowledged the removal may re-assign the underlying device to a second guest
> >> > leading to data leakage.
> >> 
> >> Yes, the incorrect assumption is a problem.  But with that fixed (in the
> >> management application), we run right into the next problem: there is no
> >> way for the management application to reliably disconnect the guest from
> >> a block device.  And that's the problem you're fixing.
> >
> > Yeah, that's the right way to word it; providing a method to forcibly
> > disconnect the guest from the host device.
> >> 
> >> > This series introduces a new montor command to decouple asynchornous device
> >> 
> >> Typos "montor" and "asynchornous".  You might want to use a spell
> >> checker :)
> >> 
> >> Lines are a bit long.  Recommend wrap at column 70.
> >> 
> >> > removal from restricting guest access to a block device.  We do this by creating
> >> > a new monitor command drive_del which maps to a bdrv_unplug() command which
> >> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
> >> > IO is rejected from the device and the guest will get IO errors but continue to
> >> > function.  In addition to preventing further IO, we clean up state pointers
> >> > between host (BlockDriverState) and guest (DeviceInfo).
> >> >
> >> > A subsequent device removal command can be issued to remove the device, to which
> >> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
> >> 
> >> "maynot" is not a word.
> >> 
> >> > will be sumbitted.
> >> 
> >> This suggests to drive_del before device_del, which makes the device
> >> goes through a "broken device" state on its way to unplug.  If the guest
> >> accesses the device in that state, it gets I/O errors.  Not nice.
> >> 
> >> Instead, I'd recommend device_del, wait for the device to go away,
> >> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
> >> it's never exposed to the "broken device" state.  Note: if the drive_del
> >> fails because the device doesn't exist, we lost the race with the
> >> automatic destruction, which is harmless.  Ignore that error.
> >
> > Honestly, other than describing what happens if you sever the connection
> > when the guest isn't aware of it; I don't want to try to capture how the
> > mgmt layer implements the removal.  
> >
> > One may want to force the disconnect before attempting to remove the
> > device; or the other way around; that's really the mgmt layer's call.
> 
> Fair enough.
> 
> >> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> >> > ---
> >> >  block.c         |    7 +++++++
> >> >  block.h         |    1 +
> >> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
> >> >  blockdev.h      |    1 +
> >> >  hmp-commands.hx |   18 ++++++++++++++++++
> >> >  5 files changed, 63 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/block.c b/block.c
> >> > index 6b505fb..c76a796 100644
> >> > --- a/block.c
> >> > +++ b/block.c
> >> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
> >> >      }
> >> >  }
> >> >  
> >> > +void bdrv_unplug(BlockDriverState *bs)
> >> > +{
> >> > +    qemu_aio_flush();
> >> > +    bdrv_flush(bs);
> >> > +    bdrv_close(bs);
> >> > +}
> >> > +
> >> 
> >> Unless we expect more users, I'd inline this into its only caller.
> >> Matter of taste.
> >
> > Works for me.
> >
> >> 
> >> >  int bdrv_is_removable(BlockDriverState *bs)
> >> >  {
> >> >      return bs->removable;
> >> > diff --git a/block.h b/block.h
> >> > index 78ecfac..581414c 100644
> >> > --- a/block.h
> >> > +++ b/block.h
> >> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
> >> >                         BlockErrorAction on_write_error);
> >> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
> >> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
> >> > +void bdrv_unplug(BlockDriverState *bs);
> >> >  int bdrv_is_removable(BlockDriverState *bs);
> >> >  int bdrv_is_read_only(BlockDriverState *bs);
> >> >  int bdrv_is_sg(BlockDriverState *bs);
> >> > diff --git a/blockdev.c b/blockdev.c
> >> > index 6cb179a..ee8c2ec 100644
> >> > --- a/blockdev.c
> >> > +++ b/blockdev.c
> >> > @@ -14,6 +14,8 @@
> >> >  #include "qemu-option.h"
> >> >  #include "qemu-config.h"
> >> >  #include "sysemu.h"
> >> > +#include "hw/qdev.h"
> >> > +#include "block_int.h"
> >> >  
> >> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
> >> >  
> >> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
> >> >      }
> >> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
> >> >  }
> >> > +
> >> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
> >> > +{
> >> > +    const char *id = qdict_get_str(qdict, "id");
> >> > +    BlockDriverState *bs;
> >> > +    Property *prop;
> >> > +
> >> > +    bs = bdrv_find(id);
> >> > +    if (!bs) {
> >> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
> >> > +        return -1;
> >> > +    }
> >> > +
> >> > +    /* quiesce block driver; prevent further io */
> >> > +    bdrv_unplug(bs);
> >> > +
> >> > +    /* clean up guest state from pointing to host resource by
> >> > +     * finding and removing DeviceState "drive" property */
> >> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
> >> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
> >> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
> >> > +            if (prop->info->free) {
> >> > +                prop->info->free(bs->peer, prop);
> >> > +            }
> 
> Your use of prop->info->free() in this context is wrong.  More below.
> 
> >> 
> >> Does this null the drive property?  I doubt it.  Quick check in the
> >> debugger?
> >> 
> >> The free callbacks generally don't zap the properties, because they run
> >> from qdev_free().
> >
> > To be honest; I didn't see anything that looked like "remove this
> > property" in the qdev api.  Any pointers?
> 
> The closest we have is indeed the Property method free(), but that's not
> quite right.  It's really only for use by qdev_free().
> 
> > should I be calling qdev_free() on the dev?
> 
> No, because then the whole device is gone, not just the property :)
> 
> >                                              I don't quite understand
> > the distinction between the info list of properties and the device
> > itself, nor specifically what we need to remove in the drive_del()
> > operation versus the device_del() portion.
> 
> device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
> device (C type VirtIOPCIProxy).
> 
> drive_del destroys something else, namely the block device host part
> (BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
> pointers to the host part along with it.  Specifically, it needs to zap
> the device's pointer to it.
> 
> Example: if a "virtio-blk-pci" device is using drive "foo", then
> "drive_del foo" needs to zap its member block.bs.
> 
> Complication: we don't (want to) know what kind of device exactly is
> using the drive.  But we do know that a drive property must be
> describing it.
> 
> So we search the properties (for (prop...)) for a drive property
> (prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
> bs).
> 
> Result:
> 
>     BlockDriverState *bs;
>     Property *prop;
>     BlockDriverState **ptr;
> [...]
>     for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>         if ((prop->info->type == PROP_TYPE_DRIVE)) {
>             ptr = qdev_get_prop_ptr(dev, prop);
>             if (*ptr == bs) {
>                 bdrv_detach(bs, bs->peer);
>                 *ptr = NULL;
>                 break;
>             }
>         }
>     }
> 
> Aside: arguably, bdrv_detach() should zap *both* pointers, i.e. also do
> the *ptr = NULL.  Not your problem to fix.
> 
> Only then are we ready to destroy the host part:
> 
>     drive_uninit(drive_get_by_blockdev(bs));
> 
> Does this help?

Yep; lemme get a v7 out.
Ryan Harper - Nov. 10, 2010, 7:31 p.m.
* Markus Armbruster <armbru@redhat.com> [2010-11-10 11:40]:
> Ryan Harper <ryanh@us.ibm.com> writes:
> 
> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
> >> One real question, and a couple of nits.
> >> 
> >> Ryan Harper <ryanh@us.ibm.com> writes:
> >> 
> >> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
> >> > unplug event; this may not happen synchronously with the device removal command
> >> 
> >> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
> >> length of time, possibly forever.  To make a race, you need to throw in
> >> a client assuming (incorrectly) that unplug is instantaneous, as
> >> described in your next paragraph.
> >> 
> >> Moreover, all PCI unplug is that way, not just block.
> >> 
> >> > This series aims to close a gap where by mgmt applications that assume the
> >> > block resource has been removed without confirming that the guest has
> >> > acknowledged the removal may re-assign the underlying device to a second guest
> >> > leading to data leakage.
> >> 
> >> Yes, the incorrect assumption is a problem.  But with that fixed (in the
> >> management application), we run right into the next problem: there is no
> >> way for the management application to reliably disconnect the guest from
> >> a block device.  And that's the problem you're fixing.
> >
> > Yeah, that's the right way to word it; providing a method to forcibly
> > disconnect the guest from the host device.
> >> 
> >> > This series introduces a new montor command to decouple asynchornous device
> >> 
> >> Typos "montor" and "asynchornous".  You might want to use a spell
> >> checker :)
> >> 
> >> Lines are a bit long.  Recommend wrap at column 70.
> >> 
> >> > removal from restricting guest access to a block device.  We do this by creating
> >> > a new monitor command drive_del which maps to a bdrv_unplug() command which
> >> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
> >> > IO is rejected from the device and the guest will get IO errors but continue to
> >> > function.  In addition to preventing further IO, we clean up state pointers
> >> > between host (BlockDriverState) and guest (DeviceInfo).
> >> >
> >> > A subsequent device removal command can be issued to remove the device, to which
> >> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
> >> 
> >> "maynot" is not a word.
> >> 
> >> > will be sumbitted.
> >> 
> >> This suggests to drive_del before device_del, which makes the device
> >> goes through a "broken device" state on its way to unplug.  If the guest
> >> accesses the device in that state, it gets I/O errors.  Not nice.
> >> 
> >> Instead, I'd recommend device_del, wait for the device to go away,
> >> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
> >> it's never exposed to the "broken device" state.  Note: if the drive_del
> >> fails because the device doesn't exist, we lost the race with the
> >> automatic destruction, which is harmless.  Ignore that error.
> >
> > Honestly, other than describing what happens if you sever the connection
> > when the guest isn't aware of it; I don't want to try to capture how the
> > mgmt layer implements the removal.  
> >
> > One may want to force the disconnect before attempting to remove the
> > device; or the other way around; that's really the mgmt layer's call.
> 
> Fair enough.
> 
> >> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> >> > ---
> >> >  block.c         |    7 +++++++
> >> >  block.h         |    1 +
> >> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
> >> >  blockdev.h      |    1 +
> >> >  hmp-commands.hx |   18 ++++++++++++++++++
> >> >  5 files changed, 63 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/block.c b/block.c
> >> > index 6b505fb..c76a796 100644
> >> > --- a/block.c
> >> > +++ b/block.c
> >> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
> >> >      }
> >> >  }
> >> >  
> >> > +void bdrv_unplug(BlockDriverState *bs)
> >> > +{
> >> > +    qemu_aio_flush();
> >> > +    bdrv_flush(bs);
> >> > +    bdrv_close(bs);
> >> > +}
> >> > +
> >> 
> >> Unless we expect more users, I'd inline this into its only caller.
> >> Matter of taste.
> >
> > Works for me.
> >
> >> 
> >> >  int bdrv_is_removable(BlockDriverState *bs)
> >> >  {
> >> >      return bs->removable;
> >> > diff --git a/block.h b/block.h
> >> > index 78ecfac..581414c 100644
> >> > --- a/block.h
> >> > +++ b/block.h
> >> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
> >> >                         BlockErrorAction on_write_error);
> >> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
> >> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
> >> > +void bdrv_unplug(BlockDriverState *bs);
> >> >  int bdrv_is_removable(BlockDriverState *bs);
> >> >  int bdrv_is_read_only(BlockDriverState *bs);
> >> >  int bdrv_is_sg(BlockDriverState *bs);
> >> > diff --git a/blockdev.c b/blockdev.c
> >> > index 6cb179a..ee8c2ec 100644
> >> > --- a/blockdev.c
> >> > +++ b/blockdev.c
> >> > @@ -14,6 +14,8 @@
> >> >  #include "qemu-option.h"
> >> >  #include "qemu-config.h"
> >> >  #include "sysemu.h"
> >> > +#include "hw/qdev.h"
> >> > +#include "block_int.h"
> >> >  
> >> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
> >> >  
> >> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
> >> >      }
> >> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
> >> >  }
> >> > +
> >> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
> >> > +{
> >> > +    const char *id = qdict_get_str(qdict, "id");
> >> > +    BlockDriverState *bs;
> >> > +    Property *prop;
> >> > +
> >> > +    bs = bdrv_find(id);
> >> > +    if (!bs) {
> >> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
> >> > +        return -1;
> >> > +    }
> >> > +
> >> > +    /* quiesce block driver; prevent further io */
> >> > +    bdrv_unplug(bs);
> >> > +
> >> > +    /* clean up guest state from pointing to host resource by
> >> > +     * finding and removing DeviceState "drive" property */
> >> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
> >> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
> >> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
> >> > +            if (prop->info->free) {
> >> > +                prop->info->free(bs->peer, prop);
> >> > +            }
> 
> Your use of prop->info->free() in this context is wrong.  More below.
> 
> >> 
> >> Does this null the drive property?  I doubt it.  Quick check in the
> >> debugger?
> >> 
> >> The free callbacks generally don't zap the properties, because they run
> >> from qdev_free().
> >
> > To be honest; I didn't see anything that looked like "remove this
> > property" in the qdev api.  Any pointers?
> 
> The closest we have is indeed the Property method free(), but that's not
> quite right.  It's really only for use by qdev_free().
> 
> > should I be calling qdev_free() on the dev?
> 
> No, because then the whole device is gone, not just the property :)
> 
> >                                              I don't quite understand
> > the distinction between the info list of properties and the device
> > itself, nor specifically what we need to remove in the drive_del()
> > operation versus the device_del() portion.
> 
> device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
> device (C type VirtIOPCIProxy).
> 
> drive_del destroys something else, namely the block device host part
> (BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
> pointers to the host part along with it.  Specifically, it needs to zap
> the device's pointer to it.
> 
> Example: if a "virtio-blk-pci" device is using drive "foo", then
> "drive_del foo" needs to zap its member block.bs.
> 
> Complication: we don't (want to) know what kind of device exactly is
> using the drive.  But we do know that a drive property must be
> describing it.
> 
> So we search the properties (for (prop...)) for a drive property
> (prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
> bs).
> 
> Result:
> 
>     BlockDriverState *bs;
>     Property *prop;
>     BlockDriverState **ptr;
> [...]
>     for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>         if ((prop->info->type == PROP_TYPE_DRIVE)) {
>             ptr = qdev_get_prop_ptr(dev, prop);
>             if (*ptr == bs) {
>                 bdrv_detach(bs, bs->peer);

Invoking the free method on the drive property does do detach:

free_drive
{
    BlockDriverState **ptr = qdev_get_prop_ptr(dev, prop);

    if (*ptr) {
        bdrv_detach(*ptr, dev);
        blockdev_auto_del(*ptr);
    }
}

and the bdrv_delete()

takes out the bs pointer.

> Only then are we ready to destroy the host part:
> 
>     drive_uninit(drive_get_by_blockdev(bs));

And if auto-deletion it set, then it handles the drive_uninit().  Do you think
we should explicitly invoke drive_uninit() ?
Markus Armbruster - Nov. 11, 2010, 10:48 a.m.
Ryan Harper <ryanh@us.ibm.com> writes:

> * Markus Armbruster <armbru@redhat.com> [2010-11-10 11:40]:
>> Ryan Harper <ryanh@us.ibm.com> writes:
>> 
>> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
>> >> One real question, and a couple of nits.
>> >> 
>> >> Ryan Harper <ryanh@us.ibm.com> writes:
>> >> 
>> >> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
>> >> > unplug event; this may not happen synchronously with the device removal command
>> >> 
>> >> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
>> >> length of time, possibly forever.  To make a race, you need to throw in
>> >> a client assuming (incorrectly) that unplug is instantaneous, as
>> >> described in your next paragraph.
>> >> 
>> >> Moreover, all PCI unplug is that way, not just block.
>> >> 
>> >> > This series aims to close a gap where by mgmt applications that assume the
>> >> > block resource has been removed without confirming that the guest has
>> >> > acknowledged the removal may re-assign the underlying device to a second guest
>> >> > leading to data leakage.
>> >> 
>> >> Yes, the incorrect assumption is a problem.  But with that fixed (in the
>> >> management application), we run right into the next problem: there is no
>> >> way for the management application to reliably disconnect the guest from
>> >> a block device.  And that's the problem you're fixing.
>> >
>> > Yeah, that's the right way to word it; providing a method to forcibly
>> > disconnect the guest from the host device.
>> >> 
>> >> > This series introduces a new montor command to decouple asynchornous device
>> >> 
>> >> Typos "montor" and "asynchornous".  You might want to use a spell
>> >> checker :)
>> >> 
>> >> Lines are a bit long.  Recommend wrap at column 70.
>> >> 
>> >> > removal from restricting guest access to a block device.  We do this by creating
>> >> > a new monitor command drive_del which maps to a bdrv_unplug() command which
>> >> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
>> >> > IO is rejected from the device and the guest will get IO errors but continue to
>> >> > function.  In addition to preventing further IO, we clean up state pointers
>> >> > between host (BlockDriverState) and guest (DeviceInfo).
>> >> >
>> >> > A subsequent device removal command can be issued to remove the device, to which
>> >> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
>> >> 
>> >> "maynot" is not a word.
>> >> 
>> >> > will be sumbitted.
>> >> 
>> >> This suggests to drive_del before device_del, which makes the device
>> >> goes through a "broken device" state on its way to unplug.  If the guest
>> >> accesses the device in that state, it gets I/O errors.  Not nice.
>> >> 
>> >> Instead, I'd recommend device_del, wait for the device to go away,
>> >> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
>> >> it's never exposed to the "broken device" state.  Note: if the drive_del
>> >> fails because the device doesn't exist, we lost the race with the
>> >> automatic destruction, which is harmless.  Ignore that error.
>> >
>> > Honestly, other than describing what happens if you sever the connection
>> > when the guest isn't aware of it; I don't want to try to capture how the
>> > mgmt layer implements the removal.  
>> >
>> > One may want to force the disconnect before attempting to remove the
>> > device; or the other way around; that's really the mgmt layer's call.
>> 
>> Fair enough.
>> 
>> >> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>> >> > ---
>> >> >  block.c         |    7 +++++++
>> >> >  block.h         |    1 +
>> >> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
>> >> >  blockdev.h      |    1 +
>> >> >  hmp-commands.hx |   18 ++++++++++++++++++
>> >> >  5 files changed, 63 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/block.c b/block.c
>> >> > index 6b505fb..c76a796 100644
>> >> > --- a/block.c
>> >> > +++ b/block.c
>> >> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
>> >> >      }
>> >> >  }
>> >> >  
>> >> > +void bdrv_unplug(BlockDriverState *bs)
>> >> > +{
>> >> > +    qemu_aio_flush();
>> >> > +    bdrv_flush(bs);
>> >> > +    bdrv_close(bs);
>> >> > +}
>> >> > +
>> >> 
>> >> Unless we expect more users, I'd inline this into its only caller.
>> >> Matter of taste.
>> >
>> > Works for me.
>> >
>> >> 
>> >> >  int bdrv_is_removable(BlockDriverState *bs)
>> >> >  {
>> >> >      return bs->removable;
>> >> > diff --git a/block.h b/block.h
>> >> > index 78ecfac..581414c 100644
>> >> > --- a/block.h
>> >> > +++ b/block.h
>> >> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
>> >> >                         BlockErrorAction on_write_error);
>> >> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
>> >> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
>> >> > +void bdrv_unplug(BlockDriverState *bs);
>> >> >  int bdrv_is_removable(BlockDriverState *bs);
>> >> >  int bdrv_is_read_only(BlockDriverState *bs);
>> >> >  int bdrv_is_sg(BlockDriverState *bs);
>> >> > diff --git a/blockdev.c b/blockdev.c
>> >> > index 6cb179a..ee8c2ec 100644
>> >> > --- a/blockdev.c
>> >> > +++ b/blockdev.c
>> >> > @@ -14,6 +14,8 @@
>> >> >  #include "qemu-option.h"
>> >> >  #include "qemu-config.h"
>> >> >  #include "sysemu.h"
>> >> > +#include "hw/qdev.h"
>> >> > +#include "block_int.h"
>> >> >  
>> >> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
>> >> >  
>> >> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
>> >> >      }
>> >> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
>> >> >  }
>> >> > +
>> >> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
>> >> > +{
>> >> > +    const char *id = qdict_get_str(qdict, "id");
>> >> > +    BlockDriverState *bs;
>> >> > +    Property *prop;
>> >> > +
>> >> > +    bs = bdrv_find(id);
>> >> > +    if (!bs) {
>> >> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
>> >> > +        return -1;
>> >> > +    }
>> >> > +
>> >> > +    /* quiesce block driver; prevent further io */
>> >> > +    bdrv_unplug(bs);
>> >> > +
>> >> > +    /* clean up guest state from pointing to host resource by
>> >> > +     * finding and removing DeviceState "drive" property */
>> >> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>> >> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
>> >> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
>> >> > +            if (prop->info->free) {
>> >> > +                prop->info->free(bs->peer, prop);
>> >> > +            }
>> 
>> Your use of prop->info->free() in this context is wrong.  More below.
>> 
>> >> 
>> >> Does this null the drive property?  I doubt it.  Quick check in the
>> >> debugger?
>> >> 
>> >> The free callbacks generally don't zap the properties, because they run
>> >> from qdev_free().
>> >
>> > To be honest; I didn't see anything that looked like "remove this
>> > property" in the qdev api.  Any pointers?
>> 
>> The closest we have is indeed the Property method free(), but that's not
>> quite right.  It's really only for use by qdev_free().
>> 
>> > should I be calling qdev_free() on the dev?
>> 
>> No, because then the whole device is gone, not just the property :)
>> 
>> >                                              I don't quite understand
>> > the distinction between the info list of properties and the device
>> > itself, nor specifically what we need to remove in the drive_del()
>> > operation versus the device_del() portion.
>> 
>> device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
>> device (C type VirtIOPCIProxy).
>> 
>> drive_del destroys something else, namely the block device host part
>> (BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
>> pointers to the host part along with it.  Specifically, it needs to zap
>> the device's pointer to it.
>> 
>> Example: if a "virtio-blk-pci" device is using drive "foo", then
>> "drive_del foo" needs to zap its member block.bs.
>> 
>> Complication: we don't (want to) know what kind of device exactly is
>> using the drive.  But we do know that a drive property must be
>> describing it.
>> 
>> So we search the properties (for (prop...)) for a drive property
>> (prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
>> bs).
>> 
>> Result:
>> 
>>     BlockDriverState *bs;
>>     Property *prop;
>>     BlockDriverState **ptr;
>> [...]
>>     for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>>         if ((prop->info->type == PROP_TYPE_DRIVE)) {
>>             ptr = qdev_get_prop_ptr(dev, prop);
>>             if (*ptr == bs) {
>>                 bdrv_detach(bs, bs->peer);
>
> Invoking the free method on the drive property does do detach:
>
> free_drive
> {
>     BlockDriverState **ptr = qdev_get_prop_ptr(dev, prop);
>
>     if (*ptr) {
>         bdrv_detach(*ptr, dev);
>         blockdev_auto_del(*ptr);
>     }
> }
>
> and the bdrv_delete()
>
> takes out the bs pointer.

Which pointer?  Which bdrv_delete()?

>> Only then are we ready to destroy the host part:
>> 
>>     drive_uninit(drive_get_by_blockdev(bs));
>
> And if auto-deletion it set, then it handles the drive_uninit().  Do you think
> we should explicitly invoke drive_uninit() ?

Actually, blockdev_auto_del() deletes the block device only if DriveInfo
has auto_del set.  Why is that?  Quote blockdev.c:

/*
 * We automatically delete the drive when a device using it gets
 * unplugged.  Questionable feature, but we can't just drop it.
 * Device models call blockdev_mark_auto_del() to schedule the
 * automatic deletion, and generic qdev code calls blockdev_auto_del()
 * when deletion is actually safe.
 */

Thus, you need to blockdev_mark_auto_del() before blockdev_auto_del().

However, my blockdev_add work-in-progress changes these two functions to
*only* delete block devices created the old way (-drive, drive_add).
You want them deleted regardless of how they were created.  That's why I
asked you to use drive_uninit() directly.

You could argue that Property method free() *should* work here.  Fair
point.  If you want to clean that up, you're quite welcome.  But I don't
want to burden your fix with that, so feel free to add a suitable
comment instead.
Ryan Harper - Nov. 11, 2010, 1:25 p.m.
* Markus Armbruster <armbru@redhat.com> [2010-11-11 04:48]:
> Ryan Harper <ryanh@us.ibm.com> writes:
> 
> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 11:40]:
> >> Ryan Harper <ryanh@us.ibm.com> writes:
> >> 
> >> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
> >> >> One real question, and a couple of nits.
> >> >> 
> >> >> Ryan Harper <ryanh@us.ibm.com> writes:
> >> >> 
> >> >> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
> >> >> > unplug event; this may not happen synchronously with the device removal command
> >> >> 
> >> >> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
> >> >> length of time, possibly forever.  To make a race, you need to throw in
> >> >> a client assuming (incorrectly) that unplug is instantaneous, as
> >> >> described in your next paragraph.
> >> >> 
> >> >> Moreover, all PCI unplug is that way, not just block.
> >> >> 
> >> >> > This series aims to close a gap where by mgmt applications that assume the
> >> >> > block resource has been removed without confirming that the guest has
> >> >> > acknowledged the removal may re-assign the underlying device to a second guest
> >> >> > leading to data leakage.
> >> >> 
> >> >> Yes, the incorrect assumption is a problem.  But with that fixed (in the
> >> >> management application), we run right into the next problem: there is no
> >> >> way for the management application to reliably disconnect the guest from
> >> >> a block device.  And that's the problem you're fixing.
> >> >
> >> > Yeah, that's the right way to word it; providing a method to forcibly
> >> > disconnect the guest from the host device.
> >> >> 
> >> >> > This series introduces a new montor command to decouple asynchornous device
> >> >> 
> >> >> Typos "montor" and "asynchornous".  You might want to use a spell
> >> >> checker :)
> >> >> 
> >> >> Lines are a bit long.  Recommend wrap at column 70.
> >> >> 
> >> >> > removal from restricting guest access to a block device.  We do this by creating
> >> >> > a new monitor command drive_del which maps to a bdrv_unplug() command which
> >> >> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
> >> >> > IO is rejected from the device and the guest will get IO errors but continue to
> >> >> > function.  In addition to preventing further IO, we clean up state pointers
> >> >> > between host (BlockDriverState) and guest (DeviceInfo).
> >> >> >
> >> >> > A subsequent device removal command can be issued to remove the device, to which
> >> >> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
> >> >> 
> >> >> "maynot" is not a word.
> >> >> 
> >> >> > will be sumbitted.
> >> >> 
> >> >> This suggests to drive_del before device_del, which makes the device
> >> >> goes through a "broken device" state on its way to unplug.  If the guest
> >> >> accesses the device in that state, it gets I/O errors.  Not nice.
> >> >> 
> >> >> Instead, I'd recommend device_del, wait for the device to go away,
> >> >> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
> >> >> it's never exposed to the "broken device" state.  Note: if the drive_del
> >> >> fails because the device doesn't exist, we lost the race with the
> >> >> automatic destruction, which is harmless.  Ignore that error.
> >> >
> >> > Honestly, other than describing what happens if you sever the connection
> >> > when the guest isn't aware of it; I don't want to try to capture how the
> >> > mgmt layer implements the removal.  
> >> >
> >> > One may want to force the disconnect before attempting to remove the
> >> > device; or the other way around; that's really the mgmt layer's call.
> >> 
> >> Fair enough.
> >> 
> >> >> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
> >> >> > ---
> >> >> >  block.c         |    7 +++++++
> >> >> >  block.h         |    1 +
> >> >> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
> >> >> >  blockdev.h      |    1 +
> >> >> >  hmp-commands.hx |   18 ++++++++++++++++++
> >> >> >  5 files changed, 63 insertions(+), 0 deletions(-)
> >> >> >
> >> >> > diff --git a/block.c b/block.c
> >> >> > index 6b505fb..c76a796 100644
> >> >> > --- a/block.c
> >> >> > +++ b/block.c
> >> >> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
> >> >> >      }
> >> >> >  }
> >> >> >  
> >> >> > +void bdrv_unplug(BlockDriverState *bs)
> >> >> > +{
> >> >> > +    qemu_aio_flush();
> >> >> > +    bdrv_flush(bs);
> >> >> > +    bdrv_close(bs);
> >> >> > +}
> >> >> > +
> >> >> 
> >> >> Unless we expect more users, I'd inline this into its only caller.
> >> >> Matter of taste.
> >> >
> >> > Works for me.
> >> >
> >> >> 
> >> >> >  int bdrv_is_removable(BlockDriverState *bs)
> >> >> >  {
> >> >> >      return bs->removable;
> >> >> > diff --git a/block.h b/block.h
> >> >> > index 78ecfac..581414c 100644
> >> >> > --- a/block.h
> >> >> > +++ b/block.h
> >> >> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
> >> >> >                         BlockErrorAction on_write_error);
> >> >> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
> >> >> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
> >> >> > +void bdrv_unplug(BlockDriverState *bs);
> >> >> >  int bdrv_is_removable(BlockDriverState *bs);
> >> >> >  int bdrv_is_read_only(BlockDriverState *bs);
> >> >> >  int bdrv_is_sg(BlockDriverState *bs);
> >> >> > diff --git a/blockdev.c b/blockdev.c
> >> >> > index 6cb179a..ee8c2ec 100644
> >> >> > --- a/blockdev.c
> >> >> > +++ b/blockdev.c
> >> >> > @@ -14,6 +14,8 @@
> >> >> >  #include "qemu-option.h"
> >> >> >  #include "qemu-config.h"
> >> >> >  #include "sysemu.h"
> >> >> > +#include "hw/qdev.h"
> >> >> > +#include "block_int.h"
> >> >> >  
> >> >> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
> >> >> >  
> >> >> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
> >> >> >      }
> >> >> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
> >> >> >  }
> >> >> > +
> >> >> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
> >> >> > +{
> >> >> > +    const char *id = qdict_get_str(qdict, "id");
> >> >> > +    BlockDriverState *bs;
> >> >> > +    Property *prop;
> >> >> > +
> >> >> > +    bs = bdrv_find(id);
> >> >> > +    if (!bs) {
> >> >> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
> >> >> > +        return -1;
> >> >> > +    }
> >> >> > +
> >> >> > +    /* quiesce block driver; prevent further io */
> >> >> > +    bdrv_unplug(bs);
> >> >> > +
> >> >> > +    /* clean up guest state from pointing to host resource by
> >> >> > +     * finding and removing DeviceState "drive" property */
> >> >> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
> >> >> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
> >> >> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
> >> >> > +            if (prop->info->free) {
> >> >> > +                prop->info->free(bs->peer, prop);
> >> >> > +            }
> >> 
> >> Your use of prop->info->free() in this context is wrong.  More below.
> >> 
> >> >> 
> >> >> Does this null the drive property?  I doubt it.  Quick check in the
> >> >> debugger?
> >> >> 
> >> >> The free callbacks generally don't zap the properties, because they run
> >> >> from qdev_free().
> >> >
> >> > To be honest; I didn't see anything that looked like "remove this
> >> > property" in the qdev api.  Any pointers?
> >> 
> >> The closest we have is indeed the Property method free(), but that's not
> >> quite right.  It's really only for use by qdev_free().
> >> 
> >> > should I be calling qdev_free() on the dev?
> >> 
> >> No, because then the whole device is gone, not just the property :)
> >> 
> >> >                                              I don't quite understand
> >> > the distinction between the info list of properties and the device
> >> > itself, nor specifically what we need to remove in the drive_del()
> >> > operation versus the device_del() portion.
> >> 
> >> device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
> >> device (C type VirtIOPCIProxy).
> >> 
> >> drive_del destroys something else, namely the block device host part
> >> (BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
> >> pointers to the host part along with it.  Specifically, it needs to zap
> >> the device's pointer to it.
> >> 
> >> Example: if a "virtio-blk-pci" device is using drive "foo", then
> >> "drive_del foo" needs to zap its member block.bs.
> >> 
> >> Complication: we don't (want to) know what kind of device exactly is
> >> using the drive.  But we do know that a drive property must be
> >> describing it.
> >> 
> >> So we search the properties (for (prop...)) for a drive property
> >> (prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
> >> bs).
> >> 
> >> Result:
> >> 
> >>     BlockDriverState *bs;
> >>     Property *prop;
> >>     BlockDriverState **ptr;
> >> [...]
> >>     for (prop = bs->peer->info->props; prop && prop->name; prop++) {
> >>         if ((prop->info->type == PROP_TYPE_DRIVE)) {
> >>             ptr = qdev_get_prop_ptr(dev, prop);
> >>             if (*ptr == bs) {
> >>                 bdrv_detach(bs, bs->peer);
> >
> > Invoking the free method on the drive property does do detach:
> >
> > free_drive
> > {
> >     BlockDriverState **ptr = qdev_get_prop_ptr(dev, prop);
> >
> >     if (*ptr) {
> >         bdrv_detach(*ptr, dev);
> >         blockdev_auto_del(*ptr);
> >     }
> > }
> >
> > and the bdrv_delete()
> >
> > takes out the bs pointer.
> 
> Which pointer?  Which bdrv_delete()?

I suppose it's the BlockDriverState returned from bdrv_find() since I'm
invoking bdrv_delete(bs);  

And I suppose qdev_get_prop_ptr() is returning a different ptr to the
same bs; in which case we'll still need the null you had suggested?

> 
> >> Only then are we ready to destroy the host part:
> >> 
> >>     drive_uninit(drive_get_by_blockdev(bs));
> >
> > And if auto-deletion it set, then it handles the drive_uninit().  Do you think
> > we should explicitly invoke drive_uninit() ?
> 
> Actually, blockdev_auto_del() deletes the block device only if DriveInfo
> has auto_del set.  Why is that?  Quote blockdev.c:
> 
> /*
>  * We automatically delete the drive when a device using it gets
>  * unplugged.  Questionable feature, but we can't just drop it.
>  * Device models call blockdev_mark_auto_del() to schedule the
>  * automatic deletion, and generic qdev code calls blockdev_auto_del()
>  * when deletion is actually safe.
>  */
> 
> Thus, you need to blockdev_mark_auto_del() before blockdev_auto_del().
> 
> However, my blockdev_add work-in-progress changes these two functions to
> *only* delete block devices created the old way (-drive, drive_add).
> You want them deleted regardless of how they were created.  That's why I
> asked you to use drive_uninit() directly.

Gotcha.

> 
> You could argue that Property method free() *should* work here.  Fair
> point.  If you want to clean that up, you're quite welcome.  But I don't
> want to burden your fix with that, so feel free to add a suitable
> comment instead.
Markus Armbruster - Nov. 11, 2010, 2:50 p.m.
Ryan Harper <ryanh@us.ibm.com> writes:

> * Markus Armbruster <armbru@redhat.com> [2010-11-11 04:48]:
>> Ryan Harper <ryanh@us.ibm.com> writes:
>> 
>> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 11:40]:
>> >> Ryan Harper <ryanh@us.ibm.com> writes:
>> >> 
>> >> > * Markus Armbruster <armbru@redhat.com> [2010-11-10 06:48]:
>> >> >> One real question, and a couple of nits.
>> >> >> 
>> >> >> Ryan Harper <ryanh@us.ibm.com> writes:
>> >> >> 
>> >> >> > Block hot unplug is racy since the guest is required to acknowlege the ACPI
>> >> >> > unplug event; this may not happen synchronously with the device removal command
>> >> >> 
>> >> >> Well, I wouldn't call unplug "racy".  It just takes an unpredictable
>> >> >> length of time, possibly forever.  To make a race, you need to throw in
>> >> >> a client assuming (incorrectly) that unplug is instantaneous, as
>> >> >> described in your next paragraph.
>> >> >> 
>> >> >> Moreover, all PCI unplug is that way, not just block.
>> >> >> 
>> >> >> > This series aims to close a gap where by mgmt applications that assume the
>> >> >> > block resource has been removed without confirming that the guest has
>> >> >> > acknowledged the removal may re-assign the underlying device to a second guest
>> >> >> > leading to data leakage.
>> >> >> 
>> >> >> Yes, the incorrect assumption is a problem.  But with that fixed (in the
>> >> >> management application), we run right into the next problem: there is no
>> >> >> way for the management application to reliably disconnect the guest from
>> >> >> a block device.  And that's the problem you're fixing.
>> >> >
>> >> > Yeah, that's the right way to word it; providing a method to forcibly
>> >> > disconnect the guest from the host device.
>> >> >> 
>> >> >> > This series introduces a new montor command to decouple asynchornous device
>> >> >> 
>> >> >> Typos "montor" and "asynchornous".  You might want to use a spell
>> >> >> checker :)
>> >> >> 
>> >> >> Lines are a bit long.  Recommend wrap at column 70.
>> >> >> 
>> >> >> > removal from restricting guest access to a block device.  We do this by creating
>> >> >> > a new monitor command drive_del which maps to a bdrv_unplug() command which
>> >> >> > does a qemu_aio_flush; bdrv_flush() and bdrv_close().  Once complete, subsequent
>> >> >> > IO is rejected from the device and the guest will get IO errors but continue to
>> >> >> > function.  In addition to preventing further IO, we clean up state pointers
>> >> >> > between host (BlockDriverState) and guest (DeviceInfo).
>> >> >> >
>> >> >> > A subsequent device removal command can be issued to remove the device, to which
>> >> >> > the guest may or maynot respond, but as long as the unplugged bit is set, no IO
>> >> >> 
>> >> >> "maynot" is not a word.
>> >> >> 
>> >> >> > will be sumbitted.
>> >> >> 
>> >> >> This suggests to drive_del before device_del, which makes the device
>> >> >> goes through a "broken device" state on its way to unplug.  If the guest
>> >> >> accesses the device in that state, it gets I/O errors.  Not nice.
>> >> >> 
>> >> >> Instead, I'd recommend device_del, wait for the device to go away,
>> >> >> drive_del on time out.  If the guest reacts to the ACPI unplug promptly,
>> >> >> it's never exposed to the "broken device" state.  Note: if the drive_del
>> >> >> fails because the device doesn't exist, we lost the race with the
>> >> >> automatic destruction, which is harmless.  Ignore that error.
>> >> >
>> >> > Honestly, other than describing what happens if you sever the connection
>> >> > when the guest isn't aware of it; I don't want to try to capture how the
>> >> > mgmt layer implements the removal.  
>> >> >
>> >> > One may want to force the disconnect before attempting to remove the
>> >> > device; or the other way around; that's really the mgmt layer's call.
>> >> 
>> >> Fair enough.
>> >> 
>> >> >> > Signed-off-by: Ryan Harper <ryanh@us.ibm.com>
>> >> >> > ---
>> >> >> >  block.c         |    7 +++++++
>> >> >> >  block.h         |    1 +
>> >> >> >  blockdev.c      |   36 ++++++++++++++++++++++++++++++++++++
>> >> >> >  blockdev.h      |    1 +
>> >> >> >  hmp-commands.hx |   18 ++++++++++++++++++
>> >> >> >  5 files changed, 63 insertions(+), 0 deletions(-)
>> >> >> >
>> >> >> > diff --git a/block.c b/block.c
>> >> >> > index 6b505fb..c76a796 100644
>> >> >> > --- a/block.c
>> >> >> > +++ b/block.c
>> >> >> > @@ -1328,6 +1328,13 @@ void bdrv_set_removable(BlockDriverState *bs, int removable)
>> >> >> >      }
>> >> >> >  }
>> >> >> >  
>> >> >> > +void bdrv_unplug(BlockDriverState *bs)
>> >> >> > +{
>> >> >> > +    qemu_aio_flush();
>> >> >> > +    bdrv_flush(bs);
>> >> >> > +    bdrv_close(bs);
>> >> >> > +}
>> >> >> > +
>> >> >> 
>> >> >> Unless we expect more users, I'd inline this into its only caller.
>> >> >> Matter of taste.
>> >> >
>> >> > Works for me.
>> >> >
>> >> >> 
>> >> >> >  int bdrv_is_removable(BlockDriverState *bs)
>> >> >> >  {
>> >> >> >      return bs->removable;
>> >> >> > diff --git a/block.h b/block.h
>> >> >> > index 78ecfac..581414c 100644
>> >> >> > --- a/block.h
>> >> >> > +++ b/block.h
>> >> >> > @@ -171,6 +171,7 @@ void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
>> >> >> >                         BlockErrorAction on_write_error);
>> >> >> >  BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
>> >> >> >  void bdrv_set_removable(BlockDriverState *bs, int removable);
>> >> >> > +void bdrv_unplug(BlockDriverState *bs);
>> >> >> >  int bdrv_is_removable(BlockDriverState *bs);
>> >> >> >  int bdrv_is_read_only(BlockDriverState *bs);
>> >> >> >  int bdrv_is_sg(BlockDriverState *bs);
>> >> >> > diff --git a/blockdev.c b/blockdev.c
>> >> >> > index 6cb179a..ee8c2ec 100644
>> >> >> > --- a/blockdev.c
>> >> >> > +++ b/blockdev.c
>> >> >> > @@ -14,6 +14,8 @@
>> >> >> >  #include "qemu-option.h"
>> >> >> >  #include "qemu-config.h"
>> >> >> >  #include "sysemu.h"
>> >> >> > +#include "hw/qdev.h"
>> >> >> > +#include "block_int.h"
>> >> >> >  
>> >> >> >  static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
>> >> >> >  
>> >> >> > @@ -597,3 +599,37 @@ int do_change_block(Monitor *mon, const char *device,
>> >> >> >      }
>> >> >> >      return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
>> >> >> >  }
>> >> >> > +
>> >> >> > +int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
>> >> >> > +{
>> >> >> > +    const char *id = qdict_get_str(qdict, "id");
>> >> >> > +    BlockDriverState *bs;
>> >> >> > +    Property *prop;
>> >> >> > +
>> >> >> > +    bs = bdrv_find(id);
>> >> >> > +    if (!bs) {
>> >> >> > +        qerror_report(QERR_DEVICE_NOT_FOUND, id);
>> >> >> > +        return -1;
>> >> >> > +    }
>> >> >> > +
>> >> >> > +    /* quiesce block driver; prevent further io */
>> >> >> > +    bdrv_unplug(bs);
>> >> >> > +
>> >> >> > +    /* clean up guest state from pointing to host resource by
>> >> >> > +     * finding and removing DeviceState "drive" property */
>> >> >> > +    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>> >> >> > +        if ((prop->info->type == PROP_TYPE_DRIVE) && 
>> >> >> > +            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
>> >> >> > +            if (prop->info->free) {
>> >> >> > +                prop->info->free(bs->peer, prop);
>> >> >> > +            }
>> >> 
>> >> Your use of prop->info->free() in this context is wrong.  More below.
>> >> 
>> >> >> 
>> >> >> Does this null the drive property?  I doubt it.  Quick check in the
>> >> >> debugger?
>> >> >> 
>> >> >> The free callbacks generally don't zap the properties, because they run
>> >> >> from qdev_free().
>> >> >
>> >> > To be honest; I didn't see anything that looked like "remove this
>> >> > property" in the qdev api.  Any pointers?
>> >> 
>> >> The closest we have is indeed the Property method free(), but that's not
>> >> quite right.  It's really only for use by qdev_free().
>> >> 
>> >> > should I be calling qdev_free() on the dev?
>> >> 
>> >> No, because then the whole device is gone, not just the property :)
>> >> 
>> >> >                                              I don't quite understand
>> >> > the distinction between the info list of properties and the device
>> >> > itself, nor specifically what we need to remove in the drive_del()
>> >> > operation versus the device_del() portion.
>> >> 
>> >> device_del / qdev_free() destroy a qdev, such as a "virtio-blk-pci"
>> >> device (C type VirtIOPCIProxy).
>> >> 
>> >> drive_del destroys something else, namely the block device host part
>> >> (BlockDriverState + DeviceInfo).  Obviously, it needs to zap all
>> >> pointers to the host part along with it.  Specifically, it needs to zap
>> >> the device's pointer to it.
>> >> 
>> >> Example: if a "virtio-blk-pci" device is using drive "foo", then
>> >> "drive_del foo" needs to zap its member block.bs.
>> >> 
>> >> Complication: we don't (want to) know what kind of device exactly is
>> >> using the drive.  But we do know that a drive property must be
>> >> describing it.
>> >> 
>> >> So we search the properties (for (prop...)) for a drive property
>> >> (prop->info->type == PROP_TYPE_DRIVE) that points to this drive (... ==
>> >> bs).
>> >> 
>> >> Result:
>> >> 
>> >>     BlockDriverState *bs;
>> >>     Property *prop;
>> >>     BlockDriverState **ptr;
>> >> [...]
>> >>     for (prop = bs->peer->info->props; prop && prop->name; prop++) {
>> >>         if ((prop->info->type == PROP_TYPE_DRIVE)) {
>> >>             ptr = qdev_get_prop_ptr(dev, prop);
>> >>             if (*ptr == bs) {
>> >>                 bdrv_detach(bs, bs->peer);
>> >
>> > Invoking the free method on the drive property does do detach:
>> >
>> > free_drive
>> > {
>> >     BlockDriverState **ptr = qdev_get_prop_ptr(dev, prop);
>> >
>> >     if (*ptr) {
>> >         bdrv_detach(*ptr, dev);
>> >         blockdev_auto_del(*ptr);
>> >     }
>> > }
>> >
>> > and the bdrv_delete()
>> >
>> > takes out the bs pointer.
>> 
>> Which pointer?  Which bdrv_delete()?
>
> I suppose it's the BlockDriverState returned from bdrv_find() since I'm
> invoking bdrv_delete(bs);  
>
> And I suppose qdev_get_prop_ptr() is returning a different ptr to the
> same bs; in which case we'll still need the null you had suggested?

Yes.

The qdev_get_prop_ptr() returns a pointer to the pointer to the bs you
started with.

In other words:

* bs->peer points from bs to the qdev using this drive

* The qdev state contains a pointer back to bs.

  Example: for virtio-blk-pci, that's VirtIOPCIProxy member block.bs.

* a drive property describes that pointer, and qdev_get_prop_ptr()
  returns a pointer to that pointer in the qdev state.

  Example: for a virtio-blk-pci, it returns
  &DO_UPCAST(VirtIOPCIProxy, pci_dev.qdev, bs->peer)->block.bs.

To disconnect drive from qdev, we need to zap both bs->peer and the
pointer to bs in the qdev state.

Clear?

[...]

Patch

diff --git a/block.c b/block.c
index 6b505fb..c76a796 100644
--- a/block.c
+++ b/block.c
@@ -1328,6 +1328,13 @@  void bdrv_set_removable(BlockDriverState *bs, int removable)
     }
 }
 
+void bdrv_unplug(BlockDriverState *bs)
+{
+    qemu_aio_flush();
+    bdrv_flush(bs);
+    bdrv_close(bs);
+}
+
 int bdrv_is_removable(BlockDriverState *bs)
 {
     return bs->removable;
diff --git a/block.h b/block.h
index 78ecfac..581414c 100644
--- a/block.h
+++ b/block.h
@@ -171,6 +171,7 @@  void bdrv_set_on_error(BlockDriverState *bs, BlockErrorAction on_read_error,
                        BlockErrorAction on_write_error);
 BlockErrorAction bdrv_get_on_error(BlockDriverState *bs, int is_read);
 void bdrv_set_removable(BlockDriverState *bs, int removable);
+void bdrv_unplug(BlockDriverState *bs);
 int bdrv_is_removable(BlockDriverState *bs);
 int bdrv_is_read_only(BlockDriverState *bs);
 int bdrv_is_sg(BlockDriverState *bs);
diff --git a/blockdev.c b/blockdev.c
index 6cb179a..ee8c2ec 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -14,6 +14,8 @@ 
 #include "qemu-option.h"
 #include "qemu-config.h"
 #include "sysemu.h"
+#include "hw/qdev.h"
+#include "block_int.h"
 
 static QTAILQ_HEAD(drivelist, DriveInfo) drives = QTAILQ_HEAD_INITIALIZER(drives);
 
@@ -597,3 +599,37 @@  int do_change_block(Monitor *mon, const char *device,
     }
     return monitor_read_bdrv_key_start(mon, bs, NULL, NULL);
 }
+
+int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data)
+{
+    const char *id = qdict_get_str(qdict, "id");
+    BlockDriverState *bs;
+    Property *prop;
+
+    bs = bdrv_find(id);
+    if (!bs) {
+        qerror_report(QERR_DEVICE_NOT_FOUND, id);
+        return -1;
+    }
+
+    /* quiesce block driver; prevent further io */
+    bdrv_unplug(bs);
+
+    /* clean up guest state from pointing to host resource by
+     * finding and removing DeviceState "drive" property */
+    for (prop = bs->peer->info->props; prop && prop->name; prop++) {
+        if ((prop->info->type == PROP_TYPE_DRIVE) && 
+            (*(BlockDriverState **)qdev_get_prop_ptr(bs->peer, prop) == bs)) {
+            if (prop->info->free) {
+                prop->info->free(bs->peer, prop);
+            }
+        }
+    }
+
+    /* clean up host state pointing to guest resource by removing
+     * pointers to guest device in the BlockDriverState */
+    bdrv_delete(bs);
+
+    return 0;
+}
+ 
diff --git a/blockdev.h b/blockdev.h
index 653affc..2a0559e 100644
--- a/blockdev.h
+++ b/blockdev.h
@@ -51,5 +51,6 @@  int do_eject(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_block_set_passwd(Monitor *mon, const QDict *qdict, QObject **ret_data);
 int do_change_block(Monitor *mon, const char *device,
                     const char *filename, const char *fmt);
+int do_drive_del(Monitor *mon, const QDict *qdict, QObject **ret_data);
 
 #endif
diff --git a/hmp-commands.hx b/hmp-commands.hx
index e5585ba..d6dc18c 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -68,6 +68,24 @@  Eject a removable medium (use -f to force it).
 ETEXI
 
     {
+        .name       = "drive_del",
+        .args_type  = "id:s",
+        .params     = "device",
+        .help       = "remove host block device",
+        .user_print = monitor_user_noop,
+        .mhandler.cmd_new = do_drive_del,
+    },
+
+STEXI
+@item delete @var{device}
+@findex delete
+Remove host block device.  The result is that guest generated IO is no longer
+submitted against the host device underlying the disk.  Once a drive has
+been deleted, the QEMU Block layer returns -EIO which results in IO 
+errors in the guest for applications that are reading/writing to the device.
+ETEXI
+
+    {
         .name       = "change",
         .args_type  = "device:B,target:F,arg:s?",
         .params     = "device filename [format]",