diff mbox

[PATCHv3] piix: fix up/down races

Message ID 20120402100345.GA19689@redhat.com
State New
Headers show

Commit Message

Michael S. Tsirkin April 2, 2012, 10:03 a.m. UTC
piix acpi interface suffers from the following 2 issues:

1.
- delete device a
- quickly add device b in another slot

if we do this before guest reads the down register,
the down event is discarded and device will never
be deleted.

2.
- delete device a
- quickly reset before guest can respond

interrupt is reset and guest will never eject the device.

To fix this, we implement three changes:

1. Do not clear up/down registers on each hotplug event.
this ensures we don't lose events.

2. Make existing up/down registers write 1 to clear;
bios will now write back to these the value it read.

3. on reset, remove all devices which have DOWN bit set

For compatibility with old guests, we also clear
the DOWN bit on write to EJ0 for a device.

This patch also extends the documentation for the
ACPI interface, to make the semantics explicit.

Compatibility notes: migrating guests across qemu versions
can cause bios from one qemu realease running on another qemu
release. The following documents the behaviour in this case:

A. new bios running on old qemu
A bios implementing the change mentioned above will if running on an old
qemu introduce a race: the registers were writeable there, so if a
register changes between read and write, we'll clobber the new bits
losing events.  As that version tends to lose events anyway, and as the
better than complicating the interface. If someone cares enough, the fix
can go on the stable branch.
B. old bios running on new qemu
Down bits will get cleared on eject because of 3 above.
Up bits only get cleared on reset.
This will cause unnecessary notifications and bus rescans,
but they are harmless.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---

Changes from v2:
- Use existing register instead of adding a new one,
  and update documentation
Changes from v1:
- tweaked a comment about clearing down register on reset
- documentation update

 docs/specs/acpi_pci_hotplug.txt |   55 +++++++++++++++++++++++++++++++-----
 hw/acpi_piix4.c                 |   58 ++++++++++++++++++++++++++++-----------
 2 files changed, 89 insertions(+), 24 deletions(-)

Comments

Alex Williamson April 2, 2012, 6:59 p.m. UTC | #1
On Mon, 2012-04-02 at 13:03 +0300, Michael S. Tsirkin wrote:
> piix acpi interface suffers from the following 2 issues:
> 
> 1.
> - delete device a
> - quickly add device b in another slot
> 
> if we do this before guest reads the down register,
> the down event is discarded and device will never
> be deleted.
> 
> 2.
> - delete device a
> - quickly reset before guest can respond
> 
> interrupt is reset and guest will never eject the device.
> 
> To fix this, we implement three changes:
> 
> 1. Do not clear up/down registers on each hotplug event.
> this ensures we don't lose events.
> 
> 2. Make existing up/down registers write 1 to clear;
> bios will now write back to these the value it read.
> 
> 3. on reset, remove all devices which have DOWN bit set
> 
> For compatibility with old guests, we also clear
> the DOWN bit on write to EJ0 for a device.
> 
> This patch also extends the documentation for the
> ACPI interface, to make the semantics explicit.
> 
> Compatibility notes: migrating guests across qemu versions
> can cause bios from one qemu realease running on another qemu
> release. The following documents the behaviour in this case:
> 
> A. new bios running on old qemu
> A bios implementing the change mentioned above will if running on an old
> qemu introduce a race: the registers were writeable there, so if a
> register changes between read and write, we'll clobber the new bits
> losing events.  As that version tends to lose events anyway, and as the
> better than complicating the interface. If someone cares enough, the fix
> can go on the stable branch.

Seems like it'd be easy to have seabios probe this on boot, ex. write
<value> to 0xae00, read 0xae00 to see if value is reflected.  If it is,
old qemu, modify dsdt to avoid writes.

> B. old bios running on new qemu
> Down bits will get cleared on eject because of 3 above.
> Up bits only get cleared on reset.
> This will cause unnecessary notifications and bus rescans,
> but they are harmless.

Doesn't windows poll these bits.  How can we say that introducing a bus
rescan every poll interval is harmless?

> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
> 
> Changes from v2:
> - Use existing register instead of adding a new one,
>   and update documentation
> Changes from v1:
> - tweaked a comment about clearing down register on reset
> - documentation update
> 
>  docs/specs/acpi_pci_hotplug.txt |   55 +++++++++++++++++++++++++++++++-----
>  hw/acpi_piix4.c                 |   58 ++++++++++++++++++++++++++++-----------
>  2 files changed, 89 insertions(+), 24 deletions(-)
> 
> diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
> index f0f74a7..e4a9556 100644
> --- a/docs/specs/acpi_pci_hotplug.txt
> +++ b/docs/specs/acpi_pci_hotplug.txt
> @@ -7,31 +7,70 @@ describes the interface between QEMU and the ACPI BIOS.
>  ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
>  -----------------------------------------
>  
> -Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
> +Generic ACPI GPE block. Bit 1 (GPE.1) is used to notify PCI hotplug/hotunplug
>  event to ACPI BIOS, via SCI interrupt.
>  
> -PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
> +Semantics: this event occurs each time a bit in either Pending PCI slot
> +injection notification or Pending PCI slot removal notification registers
> +changes status from 0 to 1.

Or if your bios doesn't clear it, 1 to 1.

> +
> +Pending PCI slot injection notification (IO port 0xae00-0xae03, 4-byte access):
>  ---------------------------------------------------------------
> -Slot injection notification pending. One bit per slot.
> +Slot injection notification is pending. One bit per slot.
> +When written by Guest, this register implements write one to clear behaviour.

So we're writing the mask of the bits (slots) to be cleared?  Can
multiple slots be cleared in one write?

>  Read by ACPI BIOS GPE.1 handler to notify OS of injection
>  events.
> +Written to by ACPI BIOS GPE.1 handler after reading
> +and before notifying OS of injection events.
> +
> +Semantics: after a slot is populated by hotplug, it is immediately enabled
> +and a bit in this register is set.
> +Writes into this register are used to acknowledge the injection notification
> +and do not not affect the slot status.
> +Reset value: 0
>  
> -PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
> +Pending PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
>  -----------------------------------------------------
> -Slot removal notification pending. One bit per slot.
> +Slot removal notification is pending. One bit per slot.
> +When written by Guest, this register implements write one to clear behaviour.

Same clarification as above.
 
>  Read by ACPI BIOS GPE.1 handler to notify OS of removal
>  events.
>  
> +Written to by ACPI BIOS GPE.1 handler after reading
> +and before notifying OS of removal events.
> +
> +Semantics: upon hotunplug request, a bit in this register is set
> +and the OSPM is notified about hotunplug request, but a slot remains enabled
> +until eject.
> +Writes into this register are used to acknowledge the hotunplug request,
> +and do not not affect the slot status.
> +Reset value: 0
> +
>  PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
>  ----------------------------------------
>  
>  Used by ACPI BIOS _EJ0 method to request device removal. One bit per slot.
> -Reads return 0.
> +Guests should not read this register. In practice reads return 0.

Isn't mentioning the read value creating a defacto standard?  If we're
not defining we should say reads are undefined.

> +guests should write values with exactly one bit set, selecting
> +the slot to eject.
> +In practice all bits except the lowest bit that is set are discarded.  Also in

practice = standard, define it.

> +practice writing the value 0 into this register has no effect.

ditto.

> +Semantics: selected hotunpluggable slot is disabled and powered off.
> +Has no effect on non-hotunpluggable slots.
> +
> +For compatibility with old BIOS, writes to this register
> +clear bits in pending slot injection notification register.
>  
>  PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
>  -----------------------------------------------
>  
> -Used by ACPI BIOS _RMV method to indicate removability status to OS. One
> -bit per slot.
> +Used by ACPI BIOS _RMV method to detect removability status
> +and report to OS. One bit per slot.
> +Guests should not write this register, in practice the value
> +written is discarded.

We should define writes as not supported.  "Should not, but ok if you
do" is not a specification.

> +Semantics: selected slots support hotplug. Contents of this register
> +do not change while guest is running.
> diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
> index 797ed24..389db16 100644
> --- a/hw/acpi_piix4.c
> +++ b/hw/acpi_piix4.c
> @@ -287,6 +287,28 @@ static void piix4_update_hotplug(PIIX4PMState *s)
>      }
>  }
>  
> +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> +{
> +    DeviceState *qdev, *next;
> +    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> +    int slot = ffs(slots) - 1;
> +
> +    /*
> +     * Clear DOWN register - this is good for a case where guest can't
> +     * write to CLR_DOWN because of VM reset, and for compatibility with
> +     * old guests which do not have CLR_DOWN.
> +     */
> +    s->pci0_status.down &= ~(1U << slot);

Should we clear "up" here too?

> +
> +    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> +        PCIDevice *dev = PCI_DEVICE(qdev);
> +        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> +        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> +            qdev_free(qdev);
> +        }
> +    }
> +}
> +
>  static void piix4_reset(void *opaque)
>  {
>      PIIX4PMState *s = opaque;
> @@ -302,6 +324,19 @@ static void piix4_reset(void *opaque)
>          pci_conf[0x5B] = 0x02;
>      }
>      piix4_update_hotplug(s);
> +    /*
> +     * Guest lost remove events if any.
> +     * As it's being reset it's safe to remove the device now.
> +     */
> +    while (s->pci0_status.down) {
> +        acpi_piix_eject_slot(s, s->pci0_status.down);
> +    }
> +    /*
> +     * Guest lost add events if any.
> +     * As it's being reset and will rescan the bus we can discard
> +     * past events now.
> +     */
> +    s->pci0_status.up = 0;
>  }
>  
>  static void piix4_powerdown(void *opaque, int irq, int power_failing)
> @@ -472,10 +507,10 @@ static void pcihotplug_write(void *opaque, uint32_t addr, uint32_t val)
>      struct pci_status *g = opaque;
>      switch (addr) {
>          case PCI_BASE:
> -            g->up = val;
> +            g->up &= ~val;
>              break;
>          case PCI_BASE + 4:
> -            g->down = val;
> +            g->down &= ~val;
>              break;
>     }

Ok, so it is a bit(slot) mask and can clear multiple bits.

> @@ -490,19 +525,12 @@ static uint32_t pciej_read(void *opaque, uint32_t addr)
>  
>  static void pciej_write(void *opaque, uint32_t addr, uint32_t val)
>  {
> -    BusState *bus = opaque;
> -    DeviceState *qdev, *next;
> -    int slot = ffs(val) - 1;
> +    PIIX4PMState *s = opaque;
>  
> -    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> -        PCIDevice *dev = PCI_DEVICE(qdev);
> -        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> -        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> -            qdev_free(qdev);
> -        }
> +    if (val) {
> +        acpi_piix_eject_slot(s, val);
>      }
>  
> -
>      PIIX4_DPRINTF("pciej write %x <== %d\n", addr, val);
>  }
>  
> @@ -532,8 +560,8 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
>      register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write, pci0_status);
>      register_ioport_read(PCI_BASE, 8, 4,  pcihotplug_read, pci0_status);
>  
> -    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, bus);
> -    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, bus);
> +    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, s);
> +    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, s);
>  
>      register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s);
>      register_ioport_read(PCI_RMV_BASE, 4, 4,  pcirmv_read, s);
> @@ -567,8 +595,6 @@ static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
>          return 0;
>      }
>  
> -    s->pci0_status.up = 0;
> -    s->pci0_status.down = 0;
>      if (state == PCI_HOTPLUG_ENABLED) {
>          enable_device(s, slot);
>      } else {

So if we have an old bios and do an add, followed by a remove, guest
ACPI finds both "up" and "down" set for the slot and we rely on the
ordering of the AML checking up before down to keep the duct tape and
bailing wire from exploding?  Hmm, can't say I'm excited about this hack
either.  Thanks,

Alex
Michael S. Tsirkin April 2, 2012, 7:20 p.m. UTC | #2
On Mon, Apr 02, 2012 at 12:59:54PM -0600, Alex Williamson wrote:
> On Mon, 2012-04-02 at 13:03 +0300, Michael S. Tsirkin wrote:
> > piix acpi interface suffers from the following 2 issues:
> > 
> > 1.
> > - delete device a
> > - quickly add device b in another slot
> > 
> > if we do this before guest reads the down register,
> > the down event is discarded and device will never
> > be deleted.
> > 
> > 2.
> > - delete device a
> > - quickly reset before guest can respond
> > 
> > interrupt is reset and guest will never eject the device.
> > 
> > To fix this, we implement three changes:
> > 
> > 1. Do not clear up/down registers on each hotplug event.
> > this ensures we don't lose events.
> > 
> > 2. Make existing up/down registers write 1 to clear;
> > bios will now write back to these the value it read.
> > 
> > 3. on reset, remove all devices which have DOWN bit set
> > 
> > For compatibility with old guests, we also clear
> > the DOWN bit on write to EJ0 for a device.
> > 
> > This patch also extends the documentation for the
> > ACPI interface, to make the semantics explicit.
> > 
> > Compatibility notes: migrating guests across qemu versions
> > can cause bios from one qemu realease running on another qemu
> > release. The following documents the behaviour in this case:
> > 
> > A. new bios running on old qemu
> > A bios implementing the change mentioned above will if running on an old
> > qemu introduce a race: the registers were writeable there, so if a
> > register changes between read and write, we'll clobber the new bits
> > losing events.  As that version tends to lose events anyway, and as the
> > better than complicating the interface. If someone cares enough, the fix
> > can go on the stable branch.
> 
> Seems like it'd be easy to have seabios probe this on boot, ex. write
> <value> to 0xae00, read 0xae00 to see if value is reflected.  If it is,
> old qemu, modify dsdt to avoid writes.

This won't help at all: the issue is when we migrate to
old qemu after boot.

> > B. old bios running on new qemu
> > Down bits will get cleared on eject because of 3 above.
> > Up bits only get cleared on reset.
> > This will cause unnecessary notifications and bus rescans,
> > but they are harmless.
> 
> Doesn't windows poll these bits.

No.

>  How can we say that introducing a bus
> rescan every poll interval is harmless?

It's not done every poll interval - all this does
is cause extra rescans when system interrupt is asserted.

> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > Changes from v2:
> > - Use existing register instead of adding a new one,
> >   and update documentation
> > Changes from v1:
> > - tweaked a comment about clearing down register on reset
> > - documentation update
> > 
> >  docs/specs/acpi_pci_hotplug.txt |   55 +++++++++++++++++++++++++++++++-----
> >  hw/acpi_piix4.c                 |   58 ++++++++++++++++++++++++++++-----------
> >  2 files changed, 89 insertions(+), 24 deletions(-)
> > 
> > diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
> > index f0f74a7..e4a9556 100644
> > --- a/docs/specs/acpi_pci_hotplug.txt
> > +++ b/docs/specs/acpi_pci_hotplug.txt
> > @@ -7,31 +7,70 @@ describes the interface between QEMU and the ACPI BIOS.
> >  ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
> >  -----------------------------------------
> >  
> > -Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
> > +Generic ACPI GPE block. Bit 1 (GPE.1) is used to notify PCI hotplug/hotunplug
> >  event to ACPI BIOS, via SCI interrupt.
> >  
> > -PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
> > +Semantics: this event occurs each time a bit in either Pending PCI slot
> > +injection notification or Pending PCI slot removal notification registers
> > +changes status from 0 to 1.
> 
> Or if your bios doesn't clear it, 1 to 1.
> 
> > +
> > +Pending PCI slot injection notification (IO port 0xae00-0xae03, 4-byte access):
> >  ---------------------------------------------------------------
> > -Slot injection notification pending. One bit per slot.
> > +Slot injection notification is pending. One bit per slot.
> > +When written by Guest, this register implements write one to clear behaviour.
> 
> So we're writing the mask of the bits (slots) to be cleared?  Can
> multiple slots be cleared in one write?
> 
> >  Read by ACPI BIOS GPE.1 handler to notify OS of injection
> >  events.
> > +Written to by ACPI BIOS GPE.1 handler after reading
> > +and before notifying OS of injection events.
> > +
> > +Semantics: after a slot is populated by hotplug, it is immediately enabled
> > +and a bit in this register is set.
> > +Writes into this register are used to acknowledge the injection notification
> > +and do not not affect the slot status.
> > +Reset value: 0
> >  
> > -PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
> > +Pending PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
> >  -----------------------------------------------------
> > -Slot removal notification pending. One bit per slot.
> > +Slot removal notification is pending. One bit per slot.
> > +When written by Guest, this register implements write one to clear behaviour.
> 
> Same clarification as above.
>  
> >  Read by ACPI BIOS GPE.1 handler to notify OS of removal
> >  events.
> >  
> > +Written to by ACPI BIOS GPE.1 handler after reading
> > +and before notifying OS of removal events.
> > +
> > +Semantics: upon hotunplug request, a bit in this register is set
> > +and the OSPM is notified about hotunplug request, but a slot remains enabled
> > +until eject.
> > +Writes into this register are used to acknowledge the hotunplug request,
> > +and do not not affect the slot status.
> > +Reset value: 0
> > +
> >  PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
> >  ----------------------------------------
> >  
> >  Used by ACPI BIOS _EJ0 method to request device removal. One bit per slot.
> > -Reads return 0.
> > +Guests should not read this register. In practice reads return 0.
> 
> Isn't mentioning the read value creating a defacto standard?  If we're
> not defining we should say reads are undefined.
> 
> > +guests should write values with exactly one bit set, selecting
> > +the slot to eject.
> > +In practice all bits except the lowest bit that is set are discarded.  Also in
> 
> practice = standard, define it.
> 
> > +practice writing the value 0 into this register has no effect.
> 
> ditto.
> 
> > +Semantics: selected hotunpluggable slot is disabled and powered off.
> > +Has no effect on non-hotunpluggable slots.
> > +
> > +For compatibility with old BIOS, writes to this register
> > +clear bits in pending slot injection notification register.
> >  
> >  PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
> >  -----------------------------------------------
> >  
> > -Used by ACPI BIOS _RMV method to indicate removability status to OS. One
> > -bit per slot.
> > +Used by ACPI BIOS _RMV method to detect removability status
> > +and report to OS. One bit per slot.
> > +Guests should not write this register, in practice the value
> > +written is discarded.
> 
> We should define writes as not supported.  "Should not, but ok if you
> do" is not a specification.
> 
> > +Semantics: selected slots support hotplug. Contents of this register
> > +do not change while guest is running.
> > diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
> > index 797ed24..389db16 100644
> > --- a/hw/acpi_piix4.c
> > +++ b/hw/acpi_piix4.c
> > @@ -287,6 +287,28 @@ static void piix4_update_hotplug(PIIX4PMState *s)
> >      }
> >  }
> >  
> > +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> > +{
> > +    DeviceState *qdev, *next;
> > +    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> > +    int slot = ffs(slots) - 1;
> > +
> > +    /*
> > +     * Clear DOWN register - this is good for a case where guest can't
> > +     * write to CLR_DOWN because of VM reset, and for compatibility with
> > +     * old guests which do not have CLR_DOWN.
> > +     */
> > +    s->pci0_status.down &= ~(1U << slot);
> 
> Should we clear "up" here too?
> 
> > +
> > +    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > +        PCIDevice *dev = PCI_DEVICE(qdev);
> > +        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > +        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > +            qdev_free(qdev);
> > +        }
> > +    }
> > +}
> > +
> >  static void piix4_reset(void *opaque)
> >  {
> >      PIIX4PMState *s = opaque;
> > @@ -302,6 +324,19 @@ static void piix4_reset(void *opaque)
> >          pci_conf[0x5B] = 0x02;
> >      }
> >      piix4_update_hotplug(s);
> > +    /*
> > +     * Guest lost remove events if any.
> > +     * As it's being reset it's safe to remove the device now.
> > +     */
> > +    while (s->pci0_status.down) {
> > +        acpi_piix_eject_slot(s, s->pci0_status.down);
> > +    }
> > +    /*
> > +     * Guest lost add events if any.
> > +     * As it's being reset and will rescan the bus we can discard
> > +     * past events now.
> > +     */
> > +    s->pci0_status.up = 0;
> >  }
> >  
> >  static void piix4_powerdown(void *opaque, int irq, int power_failing)
> > @@ -472,10 +507,10 @@ static void pcihotplug_write(void *opaque, uint32_t addr, uint32_t val)
> >      struct pci_status *g = opaque;
> >      switch (addr) {
> >          case PCI_BASE:
> > -            g->up = val;
> > +            g->up &= ~val;
> >              break;
> >          case PCI_BASE + 4:
> > -            g->down = val;
> > +            g->down &= ~val;
> >              break;
> >     }
> 
> Ok, so it is a bit(slot) mask and can clear multiple bits.
> 
> > @@ -490,19 +525,12 @@ static uint32_t pciej_read(void *opaque, uint32_t addr)
> >  
> >  static void pciej_write(void *opaque, uint32_t addr, uint32_t val)
> >  {
> > -    BusState *bus = opaque;
> > -    DeviceState *qdev, *next;
> > -    int slot = ffs(val) - 1;
> > +    PIIX4PMState *s = opaque;
> >  
> > -    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > -        PCIDevice *dev = PCI_DEVICE(qdev);
> > -        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > -        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > -            qdev_free(qdev);
> > -        }
> > +    if (val) {
> > +        acpi_piix_eject_slot(s, val);
> >      }
> >  
> > -
> >      PIIX4_DPRINTF("pciej write %x <== %d\n", addr, val);
> >  }
> >  
> > @@ -532,8 +560,8 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
> >      register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write, pci0_status);
> >      register_ioport_read(PCI_BASE, 8, 4,  pcihotplug_read, pci0_status);
> >  
> > -    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, bus);
> > -    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, bus);
> > +    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, s);
> > +    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, s);
> >  
> >      register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s);
> >      register_ioport_read(PCI_RMV_BASE, 4, 4,  pcirmv_read, s);
> > @@ -567,8 +595,6 @@ static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> >          return 0;
> >      }
> >  
> > -    s->pci0_status.up = 0;
> > -    s->pci0_status.down = 0;
> >      if (state == PCI_HOTPLUG_ENABLED) {
> >          enable_device(s, slot);
> >      } else {
> 
> So if we have an old bios and do an add, followed by a remove, guest
> ACPI finds both "up" and "down" set for the slot and we rely on the
> ordering of the AML checking up before down to keep the duct tape and
> bailing wire from exploding?  Hmm, can't say I'm excited about this hack
> either.  Thanks,
> 
> Alex
Igor Mammedov April 3, 2012, 6:14 a.m. UTC | #3
----- Original Message -----
> From: "Michael S. Tsirkin" <mst@redhat.com>
> To: "Alex Williamson" <alex.williamson@redhat.com>
> Cc: qemu-devel@nongnu.org, "Anthony Liguori" <aliguori@us.ibm.com>, "Gerd Hoffmann" <kraxel@redhat.com>, "Isaku
> Yamahata" <yamahata@valinux.co.jp>, "Aurelien Jarno" <aurelien@aurel32.net>, "Paolo Bonzini" <pbonzini@redhat.com>,
> gleb@redhat.com, imammedo@redhat.com
> Sent: Monday, April 2, 2012 9:20:32 PM
> Subject: Re: [PATCHv3] piix: fix up/down races
> 
> On Mon, Apr 02, 2012 at 12:59:54PM -0600, Alex Williamson wrote:
> > On Mon, 2012-04-02 at 13:03 +0300, Michael S. Tsirkin wrote:
> > > piix acpi interface suffers from the following 2 issues:
> > > 
> > > 1.
> > > - delete device a
> > > - quickly add device b in another slot
> > > 
> > > if we do this before guest reads the down register,
> > > the down event is discarded and device will never
> > > be deleted.
> > > 
> > > 2.
> > > - delete device a
> > > - quickly reset before guest can respond
> > > 
> > > interrupt is reset and guest will never eject the device.
> > > 
> > > To fix this, we implement three changes:
> > > 
> > > 1. Do not clear up/down registers on each hotplug event.
> > > this ensures we don't lose events.
> > > 
> > > 2. Make existing up/down registers write 1 to clear;
> > > bios will now write back to these the value it read.
> > > 
> > > 3. on reset, remove all devices which have DOWN bit set
> > > 
> > > For compatibility with old guests, we also clear
> > > the DOWN bit on write to EJ0 for a device.
> > > 
> > > This patch also extends the documentation for the
> > > ACPI interface, to make the semantics explicit.
> > > 
> > > Compatibility notes: migrating guests across qemu versions
> > > can cause bios from one qemu realease running on another qemu
> > > release. The following documents the behaviour in this case:
> > > 
> > > A. new bios running on old qemu
> > > A bios implementing the change mentioned above will if running on
> > > an old
> > > qemu introduce a race: the registers were writeable there, so if
> > > a
> > > register changes between read and write, we'll clobber the new
> > > bits
> > > losing events.  As that version tends to lose events anyway, and
> > > as the
> > > better than complicating the interface. If someone cares enough,
> > > the fix
> > > can go on the stable branch.
> > 
> > Seems like it'd be easy to have seabios probe this on boot, ex.
> > write
> > <value> to 0xae00, read 0xae00 to see if value is reflected.  If it
> > is,
> > old qemu, modify dsdt to avoid writes.
> 
> This won't help at all: the issue is when we migrate to
> old qemu after boot.
> 
> > > B. old bios running on new qemu
> > > Down bits will get cleared on eject because of 3 above.
> > > Up bits only get cleared on reset.
> > > This will cause unnecessary notifications and bus rescans,
> > > but they are harmless.
> > 
> > Doesn't windows poll these bits.
> 
> No.
> 
> >  How can we say that introducing a bus
> > rescan every poll interval is harmless?
> 
> It's not done every poll interval - all this does
> is cause extra rescans when system interrupt is asserted.
it will do rescan only when PIIX4_PCI_HOTPLUG_STATUS in 
gpe.sts is set.
  
> 
> > > 
> > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > > ---
> > > 
> > > Changes from v2:
> > > - Use existing register instead of adding a new one,
> > >   and update documentation
> > > Changes from v1:
> > > - tweaked a comment about clearing down register on reset
> > > - documentation update
> > > 
> > >  docs/specs/acpi_pci_hotplug.txt |   55
> > >  +++++++++++++++++++++++++++++++-----
> > >  hw/acpi_piix4.c                 |   58
> > >  ++++++++++++++++++++++++++++-----------
> > >  2 files changed, 89 insertions(+), 24 deletions(-)
> > > 
> > > diff --git a/docs/specs/acpi_pci_hotplug.txt
> > > b/docs/specs/acpi_pci_hotplug.txt
> > > index f0f74a7..e4a9556 100644
> > > --- a/docs/specs/acpi_pci_hotplug.txt
> > > +++ b/docs/specs/acpi_pci_hotplug.txt
> > > @@ -7,31 +7,70 @@ describes the interface between QEMU and the
> > > ACPI BIOS.
> > >  ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
> > >  -----------------------------------------
> > >  
> > > -Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI
> > > hotplug/eject
> > > +Generic ACPI GPE block. Bit 1 (GPE.1) is used to notify PCI
> > > hotplug/hotunplug
> > >  event to ACPI BIOS, via SCI interrupt.
> > >  
> > > -PCI slot injection notification pending (IO port 0xae00-0xae03,
> > > 4-byte access):
> > > +Semantics: this event occurs each time a bit in either Pending
> > > PCI slot
> > > +injection notification or Pending PCI slot removal notification
> > > registers
> > > +changes status from 0 to 1.
> > 
> > Or if your bios doesn't clear it, 1 to 1.
> > 
> > > +
> > > +Pending PCI slot injection notification (IO port 0xae00-0xae03,
> > > 4-byte access):
> > >  ---------------------------------------------------------------
> > > -Slot injection notification pending. One bit per slot.
> > > +Slot injection notification is pending. One bit per slot.
> > > +When written by Guest, this register implements write one to
> > > clear behaviour.
> > 
> > So we're writing the mask of the bits (slots) to be cleared?  Can
> > multiple slots be cleared in one write?
> > 
> > >  Read by ACPI BIOS GPE.1 handler to notify OS of injection
> > >  events.
> > > +Written to by ACPI BIOS GPE.1 handler after reading
> > > +and before notifying OS of injection events.
> > > +
> > > +Semantics: after a slot is populated by hotplug, it is
> > > immediately enabled
> > > +and a bit in this register is set.
> > > +Writes into this register are used to acknowledge the injection
> > > notification
> > > +and do not not affect the slot status.
> > > +Reset value: 0
> > >  
> > > -PCI slot removal notification (IO port 0xae04-0xae07, 4-byte
> > > access):
> > > +Pending PCI slot removal notification (IO port 0xae04-0xae07,
> > > 4-byte access):
> > >  -----------------------------------------------------
> > > -Slot removal notification pending. One bit per slot.
> > > +Slot removal notification is pending. One bit per slot.
> > > +When written by Guest, this register implements write one to
> > > clear behaviour.
> > 
> > Same clarification as above.
> >  
> > >  Read by ACPI BIOS GPE.1 handler to notify OS of removal
> > >  events.
> > >  
> > > +Written to by ACPI BIOS GPE.1 handler after reading
> > > +and before notifying OS of removal events.
> > > +
> > > +Semantics: upon hotunplug request, a bit in this register is set
> > > +and the OSPM is notified about hotunplug request, but a slot
> > > remains enabled
> > > +until eject.
> > > +Writes into this register are used to acknowledge the hotunplug
> > > request,
> > > +and do not not affect the slot status.
> > > +Reset value: 0
> > > +
> > >  PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
> > >  ----------------------------------------
> > >  
> > >  Used by ACPI BIOS _EJ0 method to request device removal. One bit
> > >  per slot.
> > > -Reads return 0.
> > > +Guests should not read this register. In practice reads return
> > > 0.
> > 
> > Isn't mentioning the read value creating a defacto standard?  If
> > we're
> > not defining we should say reads are undefined.
> > 
> > > +guests should write values with exactly one bit set, selecting
> > > +the slot to eject.
> > > +In practice all bits except the lowest bit that is set are
> > > discarded.  Also in
> > 
> > practice = standard, define it.
> > 
> > > +practice writing the value 0 into this register has no effect.
> > 
> > ditto.
> > 
> > > +Semantics: selected hotunpluggable slot is disabled and powered
> > > off.
> > > +Has no effect on non-hotunpluggable slots.
> > > +
> > > +For compatibility with old BIOS, writes to this register
> > > +clear bits in pending slot injection notification register.
> > >  
> > >  PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
> > >  -----------------------------------------------
> > >  
> > > -Used by ACPI BIOS _RMV method to indicate removability status to
> > > OS. One
> > > -bit per slot.
> > > +Used by ACPI BIOS _RMV method to detect removability status
> > > +and report to OS. One bit per slot.
> > > +Guests should not write this register, in practice the value
> > > +written is discarded.
> > 
> > We should define writes as not supported.  "Should not, but ok if
> > you
> > do" is not a specification.
> > 
> > > +Semantics: selected slots support hotplug. Contents of this
> > > register
> > > +do not change while guest is running.
> > > diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
> > > index 797ed24..389db16 100644
> > > --- a/hw/acpi_piix4.c
> > > +++ b/hw/acpi_piix4.c
> > > @@ -287,6 +287,28 @@ static void
> > > piix4_update_hotplug(PIIX4PMState *s)
> > >      }
> > >  }
> > >  
> > > +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned
> > > slots)
> > > +{
> > > +    DeviceState *qdev, *next;
> > > +    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> > > +    int slot = ffs(slots) - 1;
> > > +
> > > +    /*
> > > +     * Clear DOWN register - this is good for a case where guest
> > > can't
> > > +     * write to CLR_DOWN because of VM reset, and for
> > > compatibility with
> > > +     * old guests which do not have CLR_DOWN.
> > > +     */
> > > +    s->pci0_status.down &= ~(1U << slot);
> > 
> > Should we clear "up" here too?
> > 
> > > +
> > > +    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > > +        PCIDevice *dev = PCI_DEVICE(qdev);
> > > +        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > > +        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > > +            qdev_free(qdev);
> > > +        }
> > > +    }
> > > +}
> > > +
> > >  static void piix4_reset(void *opaque)
> > >  {
> > >      PIIX4PMState *s = opaque;
> > > @@ -302,6 +324,19 @@ static void piix4_reset(void *opaque)
> > >          pci_conf[0x5B] = 0x02;
> > >      }
> > >      piix4_update_hotplug(s);
> > > +    /*
> > > +     * Guest lost remove events if any.
> > > +     * As it's being reset it's safe to remove the device now.
> > > +     */
> > > +    while (s->pci0_status.down) {
> > > +        acpi_piix_eject_slot(s, s->pci0_status.down);
> > > +    }
> > > +    /*
> > > +     * Guest lost add events if any.
> > > +     * As it's being reset and will rescan the bus we can
> > > discard
> > > +     * past events now.
> > > +     */
> > > +    s->pci0_status.up = 0;
> > >  }
> > >  
> > >  static void piix4_powerdown(void *opaque, int irq, int
> > >  power_failing)
> > > @@ -472,10 +507,10 @@ static void pcihotplug_write(void *opaque,
> > > uint32_t addr, uint32_t val)
> > >      struct pci_status *g = opaque;
> > >      switch (addr) {
> > >          case PCI_BASE:
> > > -            g->up = val;
> > > +            g->up &= ~val;
> > >              break;
> > >          case PCI_BASE + 4:
> > > -            g->down = val;
> > > +            g->down &= ~val;
> > >              break;
> > >     }
> > 
> > Ok, so it is a bit(slot) mask and can clear multiple bits.
> > 
> > > @@ -490,19 +525,12 @@ static uint32_t pciej_read(void *opaque,
> > > uint32_t addr)
> > >  
> > >  static void pciej_write(void *opaque, uint32_t addr, uint32_t
> > >  val)
> > >  {
> > > -    BusState *bus = opaque;
> > > -    DeviceState *qdev, *next;
> > > -    int slot = ffs(val) - 1;
> > > +    PIIX4PMState *s = opaque;
> > >  
> > > -    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > > -        PCIDevice *dev = PCI_DEVICE(qdev);
> > > -        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > > -        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > > -            qdev_free(qdev);
> > > -        }
> > > +    if (val) {
> > > +        acpi_piix_eject_slot(s, val);
> > >      }
> > >  
> > > -
> > >      PIIX4_DPRINTF("pciej write %x <== %d\n", addr, val);
> > >  }
> > >  
> > > @@ -532,8 +560,8 @@ static void
> > > piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
> > >      register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write,
> > >      pci0_status);
> > >      register_ioport_read(PCI_BASE, 8, 4,  pcihotplug_read,
> > >      pci0_status);
> > >  
> > > -    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, bus);
> > > -    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, bus);
> > > +    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, s);
> > > +    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, s);
> > >  
> > >      register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s);
> > >      register_ioport_read(PCI_RMV_BASE, 4, 4,  pcirmv_read, s);
> > > @@ -567,8 +595,6 @@ static int piix4_device_hotplug(DeviceState
> > > *qdev, PCIDevice *dev,
> > >          return 0;
> > >      }
> > >  
> > > -    s->pci0_status.up = 0;
> > > -    s->pci0_status.down = 0;
> > >      if (state == PCI_HOTPLUG_ENABLED) {
> > >          enable_device(s, slot);
> > >      } else {
> > 
> > So if we have an old bios and do an add, followed by a remove,
> > guest
> > ACPI finds both "up" and "down" set for the slot and we rely on the
> > ordering of the AML checking up before down to keep the duct tape
> > and
> > bailing wire from exploding?  Hmm, can't say I'm excited about this
> > hack
> > either.  Thanks,
> > 
> > Alex
>
Igor Mammedov April 3, 2012, 6:40 a.m. UTC | #4
----- Original Message -----
> From: "Alex Williamson" <alex.williamson@redhat.com>
> To: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: qemu-devel@nongnu.org, "Anthony Liguori" <aliguori@us.ibm.com>, "Gerd Hoffmann" <kraxel@redhat.com>, "Isaku
> Yamahata" <yamahata@valinux.co.jp>, "Aurelien Jarno" <aurelien@aurel32.net>, "Paolo Bonzini" <pbonzini@redhat.com>,
> gleb@redhat.com, imammedo@redhat.com
> Sent: Monday, April 2, 2012 8:59:54 PM
> Subject: Re: [PATCHv3] piix: fix up/down races
> 
> On Mon, 2012-04-02 at 13:03 +0300, Michael S. Tsirkin wrote:
> > piix acpi interface suffers from the following 2 issues:
> > 
> > 1.
> > - delete device a
> > - quickly add device b in another slot
> > 
> > if we do this before guest reads the down register,
> > the down event is discarded and device will never
> > be deleted.
> > 
> > 2.
> > - delete device a
> > - quickly reset before guest can respond
> > 
> > interrupt is reset and guest will never eject the device.
> > 
> > To fix this, we implement three changes:
> > 
> > 1. Do not clear up/down registers on each hotplug event.
> > this ensures we don't lose events.
> > 
> > 2. Make existing up/down registers write 1 to clear;
> > bios will now write back to these the value it read.
> > 
> > 3. on reset, remove all devices which have DOWN bit set
> > 
> > For compatibility with old guests, we also clear
> > the DOWN bit on write to EJ0 for a device.
> > 
> > This patch also extends the documentation for the
> > ACPI interface, to make the semantics explicit.
> > 
> > Compatibility notes: migrating guests across qemu versions
> > can cause bios from one qemu realease running on another qemu
> > release. The following documents the behaviour in this case:
> > 
> > A. new bios running on old qemu
> > A bios implementing the change mentioned above will if running on
> > an old
> > qemu introduce a race: the registers were writeable there, so if a
> > register changes between read and write, we'll clobber the new bits
> > losing events.  As that version tends to lose events anyway, and as
> > the
> > better than complicating the interface. If someone cares enough,
> > the fix
> > can go on the stable branch.
> 
> Seems like it'd be easy to have seabios probe this on boot, ex. write
> <value> to 0xae00, read 0xae00 to see if value is reflected.  If it
> is,
> old qemu, modify dsdt to avoid writes.
> 
> > B. old bios running on new qemu
> > Down bits will get cleared on eject because of 3 above.
> > Up bits only get cleared on reset.
> > This will cause unnecessary notifications and bus rescans,
> > but they are harmless.
> 
> Doesn't windows poll these bits.  How can we say that introducing a
> bus
> rescan every poll interval is harmless?
> 
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > Changes from v2:
> > - Use existing register instead of adding a new one,
> >   and update documentation
> > Changes from v1:
> > - tweaked a comment about clearing down register on reset
> > - documentation update
> > 
> >  docs/specs/acpi_pci_hotplug.txt |   55
> >  +++++++++++++++++++++++++++++++-----
> >  hw/acpi_piix4.c                 |   58
> >  ++++++++++++++++++++++++++++-----------
> >  2 files changed, 89 insertions(+), 24 deletions(-)
> > 
> > diff --git a/docs/specs/acpi_pci_hotplug.txt
> > b/docs/specs/acpi_pci_hotplug.txt
> > index f0f74a7..e4a9556 100644
> > --- a/docs/specs/acpi_pci_hotplug.txt
> > +++ b/docs/specs/acpi_pci_hotplug.txt
> > @@ -7,31 +7,70 @@ describes the interface between QEMU and the ACPI
> > BIOS.
> >  ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
> >  -----------------------------------------
> >  
> > -Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI
> > hotplug/eject
> > +Generic ACPI GPE block. Bit 1 (GPE.1) is used to notify PCI
> > hotplug/hotunplug
> >  event to ACPI BIOS, via SCI interrupt.
> >  
> > -PCI slot injection notification pending (IO port 0xae00-0xae03,
> > 4-byte access):
> > +Semantics: this event occurs each time a bit in either Pending PCI
> > slot
> > +injection notification or Pending PCI slot removal notification
> > registers
> > +changes status from 0 to 1.
> 
> Or if your bios doesn't clear it, 1 to 1.
> 
> > +
> > +Pending PCI slot injection notification (IO port 0xae00-0xae03,
> > 4-byte access):
> >  ---------------------------------------------------------------
> > -Slot injection notification pending. One bit per slot.
> > +Slot injection notification is pending. One bit per slot.
> > +When written by Guest, this register implements write one to clear
> > behaviour.
> 
> So we're writing the mask of the bits (slots) to be cleared?  Can
> multiple slots be cleared in one write?
Probably here we should refer to gpe.sts reg behaviour described in acpi spec.
and the same for other writeable regs.

> 
> >  Read by ACPI BIOS GPE.1 handler to notify OS of injection
> >  events.
> > +Written to by ACPI BIOS GPE.1 handler after reading
> > +and before notifying OS of injection events.
> > +
> > +Semantics: after a slot is populated by hotplug, it is immediately
> > enabled
> > +and a bit in this register is set.
> > +Writes into this register are used to acknowledge the injection
> > notification
maybe "clear the injection notification events in specified sots" is better?

> > +and do not not affect the slot status. 
> > +Reset value: 0
> >  
> > -PCI slot removal notification (IO port 0xae04-0xae07, 4-byte
> > access):
> > +Pending PCI slot removal notification (IO port 0xae04-0xae07,
> > 4-byte access):
> >  -----------------------------------------------------
> > -Slot removal notification pending. One bit per slot.
> > +Slot removal notification is pending. One bit per slot.
> > +When written by Guest, this register implements write one to clear
> > behaviour.
> 
> Same clarification as above.
>  
> >  Read by ACPI BIOS GPE.1 handler to notify OS of removal
> >  events.
> >  
> > +Written to by ACPI BIOS GPE.1 handler after reading
> > +and before notifying OS of removal events.
> > +
> > +Semantics: upon hotunplug request, a bit in this register is set
> > +and the OSPM is notified about hotunplug request, but a slot
> > remains enabled
> > +until eject.
> > +Writes into this register are used to acknowledge the hotunplug
> > request,
> > +and do not not affect the slot status.
> > +Reset value: 0
> > +
> >  PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
> >  ----------------------------------------
> >  
> >  Used by ACPI BIOS _EJ0 method to request device removal. One bit
> >  per slot.
> > -Reads return 0.
> > +Guests should not read this register. In practice reads return 0.
> 
> Isn't mentioning the read value creating a defacto standard?  If
> we're
> not defining we should say reads are undefined.
> 
> > +guests should write values with exactly one bit set, selecting
> > +the slot to eject.
> > +In practice all bits except the lowest bit that is set are
> > discarded.  Also in
> 
> practice = standard, define it.
> 
> > +practice writing the value 0 into this register has no effect.
> 
> ditto.
> 
> > +Semantics: selected hotunpluggable slot is disabled and powered
> > off.
> > +Has no effect on non-hotunpluggable slots.
> > +
> > +For compatibility with old BIOS, writes to this register
> > +clear bits in pending slot injection notification register.
> >  
> >  PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
> >  -----------------------------------------------
> >  
> > -Used by ACPI BIOS _RMV method to indicate removability status to
> > OS. One
> > -bit per slot.
> > +Used by ACPI BIOS _RMV method to detect removability status
> > +and report to OS. One bit per slot.
> > +Guests should not write this register, in practice the value
> > +written is discarded.
> 
> We should define writes as not supported.  "Should not, but ok if you
> do" is not a specification.
> 
> > +Semantics: selected slots support hotplug. Contents of this
> > register
> > +do not change while guest is running.
> > diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
> > index 797ed24..389db16 100644
> > --- a/hw/acpi_piix4.c
> > +++ b/hw/acpi_piix4.c
> > @@ -287,6 +287,28 @@ static void piix4_update_hotplug(PIIX4PMState
> > *s)
> >      }
> >  }
> >  
> > +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> > +{
> > +    DeviceState *qdev, *next;
> > +    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> > +    int slot = ffs(slots) - 1;
> > +
> > +    /*
> > +     * Clear DOWN register - this is good for a case where guest
> > can't
> > +     * write to CLR_DOWN because of VM reset, and for
> > compatibility with
> > +     * old guests which do not have CLR_DOWN.
fix comment, there is no CLR_DOWN any more

> > +     */
> > +    s->pci0_status.down &= ~(1U << slot);
> 
> Should we clear "up" here too?
Is it possible to create pci dev for not yet freed pci slot?
If not then we should not care, otherwise we should fix that.

> 
> > +
> > +    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > +        PCIDevice *dev = PCI_DEVICE(qdev);
> > +        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > +        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > +            qdev_free(qdev);
> > +        }
> > +    }
> > +}
> > +
> >  static void piix4_reset(void *opaque)
> >  {
> >      PIIX4PMState *s = opaque;
> > @@ -302,6 +324,19 @@ static void piix4_reset(void *opaque)
> >          pci_conf[0x5B] = 0x02;
> >      }
> >      piix4_update_hotplug(s);
> > +    /*
> > +     * Guest lost remove events if any.
> > +     * As it's being reset it's safe to remove the device now.
> > +     */
> > +    while (s->pci0_status.down) {
> > +        acpi_piix_eject_slot(s, s->pci0_status.down);
> > +    }
> > +    /*
> > +     * Guest lost add events if any.
> > +     * As it's being reset and will rescan the bus we can discard
> > +     * past events now.
> > +     */
> > +    s->pci0_status.up = 0;
> >  }
> >  
> >  static void piix4_powerdown(void *opaque, int irq, int
> >  power_failing)
> > @@ -472,10 +507,10 @@ static void pcihotplug_write(void *opaque,
> > uint32_t addr, uint32_t val)
> >      struct pci_status *g = opaque;
> >      switch (addr) {
> >          case PCI_BASE:
> > -            g->up = val;
> > +            g->up &= ~val;
> >              break;
> >          case PCI_BASE + 4:
> > -            g->down = val;
> > +            g->down &= ~val;
> >              break;
> >     }
> 
> Ok, so it is a bit(slot) mask and can clear multiple bits.
> 
> > @@ -490,19 +525,12 @@ static uint32_t pciej_read(void *opaque,
> > uint32_t addr)
> >  
> >  static void pciej_write(void *opaque, uint32_t addr, uint32_t val)
> >  {
> > -    BusState *bus = opaque;
> > -    DeviceState *qdev, *next;
> > -    int slot = ffs(val) - 1;
> > +    PIIX4PMState *s = opaque;
> >  
> > -    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > -        PCIDevice *dev = PCI_DEVICE(qdev);
> > -        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > -        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > -            qdev_free(qdev);
> > -        }
> > +    if (val) {
> > +        acpi_piix_eject_slot(s, val);
> >      }
> >  
> > -
> >      PIIX4_DPRINTF("pciej write %x <== %d\n", addr, val);
> >  }
> >  
> > @@ -532,8 +560,8 @@ static void
> > piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
> >      register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write,
> >      pci0_status);
> >      register_ioport_read(PCI_BASE, 8, 4,  pcihotplug_read,
> >      pci0_status);
> >  
> > -    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, bus);
> > -    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, bus);
> > +    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, s);
> > +    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, s);
> >  
> >      register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s);
> >      register_ioport_read(PCI_RMV_BASE, 4, 4,  pcirmv_read, s);
> > @@ -567,8 +595,6 @@ static int piix4_device_hotplug(DeviceState
> > *qdev, PCIDevice *dev,
> >          return 0;
> >      }
> >  
> > -    s->pci0_status.up = 0;
> > -    s->pci0_status.down = 0;
> >      if (state == PCI_HOTPLUG_ENABLED) {
> >          enable_device(s, slot);
> >      } else {
> 
> So if we have an old bios and do an add, followed by a remove, guest
> ACPI finds both "up" and "down" set for the slot and we rely on the
> ordering of the AML checking up before down to keep the duct tape and
> bailing wire from exploding?  Hmm, can't say I'm excited about this
> hack
> either.  Thanks,
> 
> Alex
> 
>
Michael S. Tsirkin April 3, 2012, 10:05 a.m. UTC | #5
On Tue, Apr 03, 2012 at 02:40:41AM -0400, Igor Mammedov wrote:
> > > +     */
> > > +    s->pci0_status.down &= ~(1U << slot);
> > 
> > Should we clear "up" here too?
> Is it possible to create pci dev for not yet freed pci slot?

It's not possible ATM.

> If not then we should not care, otherwise we should fix that.
> > > @@ -567,8 +595,6 @@ static int piix4_device_hotplug(DeviceState
> > > *qdev, PCIDevice *dev,
> > >          return 0;
> > >      }
> > >  
> > > -    s->pci0_status.up = 0;
> > > -    s->pci0_status.down = 0;
> > >      if (state == PCI_HOTPLUG_ENABLED) {
> > >          enable_device(s, slot);
> > >      } else {
> > 
> > So if we have an old bios and do an add, followed by a remove, guest
> > ACPI finds both "up" and "down" set for the slot and we rely on the
> > ordering of the AML checking up before down to keep the duct tape and
> > bailing wire from exploding?

In fact UP triggers a rescan - there is no injection event in ACPI.
So it seems you can check the events in any order.

> >  Hmm, can't say I'm excited about this
> > hack
> > either.  Thanks,
> > 
> > Alex
> > 
> >
Michael S. Tsirkin April 3, 2012, 12:56 p.m. UTC | #6
On Mon, Apr 02, 2012 at 12:59:54PM -0600, Alex Williamson wrote:
> On Mon, 2012-04-02 at 13:03 +0300, Michael S. Tsirkin wrote:
> > piix acpi interface suffers from the following 2 issues:
> > 
> > 1.
> > - delete device a
> > - quickly add device b in another slot
> > 
> > if we do this before guest reads the down register,
> > the down event is discarded and device will never
> > be deleted.
> > 
> > 2.
> > - delete device a
> > - quickly reset before guest can respond
> > 
> > interrupt is reset and guest will never eject the device.
> > 
> > To fix this, we implement three changes:
> > 
> > 1. Do not clear up/down registers on each hotplug event.
> > this ensures we don't lose events.
> > 
> > 2. Make existing up/down registers write 1 to clear;
> > bios will now write back to these the value it read.
> > 
> > 3. on reset, remove all devices which have DOWN bit set
> > 
> > For compatibility with old guests, we also clear
> > the DOWN bit on write to EJ0 for a device.
> > 
> > This patch also extends the documentation for the
> > ACPI interface, to make the semantics explicit.
> > 
> > Compatibility notes: migrating guests across qemu versions
> > can cause bios from one qemu realease running on another qemu
> > release. The following documents the behaviour in this case:
> > 
> > A. new bios running on old qemu
> > A bios implementing the change mentioned above will if running on an old
> > qemu introduce a race: the registers were writeable there, so if a
> > register changes between read and write, we'll clobber the new bits
> > losing events.  As that version tends to lose events anyway, and as the
> > better than complicating the interface. If someone cares enough, the fix
> > can go on the stable branch.
> 
> Seems like it'd be easy to have seabios probe this on boot, ex. write
> <value> to 0xae00, read 0xae00 to see if value is reflected.  If it is,
> old qemu, modify dsdt to avoid writes.
> 
> > B. old bios running on new qemu
> > Down bits will get cleared on eject because of 3 above.
> > Up bits only get cleared on reset.
> > This will cause unnecessary notifications and bus rescans,
> > but they are harmless.
> 
> Doesn't windows poll these bits.  How can we say that introducing a bus
> rescan every poll interval is harmless?
> 
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> > ---
> > 
> > Changes from v2:
> > - Use existing register instead of adding a new one,
> >   and update documentation
> > Changes from v1:
> > - tweaked a comment about clearing down register on reset
> > - documentation update
> > 
> >  docs/specs/acpi_pci_hotplug.txt |   55 +++++++++++++++++++++++++++++++-----
> >  hw/acpi_piix4.c                 |   58 ++++++++++++++++++++++++++++-----------
> >  2 files changed, 89 insertions(+), 24 deletions(-)
> > 
> > diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
> > index f0f74a7..e4a9556 100644
> > --- a/docs/specs/acpi_pci_hotplug.txt
> > +++ b/docs/specs/acpi_pci_hotplug.txt
> > @@ -7,31 +7,70 @@ describes the interface between QEMU and the ACPI BIOS.
> >  ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
> >  -----------------------------------------
> >  
> > -Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
> > +Generic ACPI GPE block. Bit 1 (GPE.1) is used to notify PCI hotplug/hotunplug
> >  event to ACPI BIOS, via SCI interrupt.
> >  
> > -PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
> > +Semantics: this event occurs each time a bit in either Pending PCI slot
> > +injection notification or Pending PCI slot removal notification registers
> > +changes status from 0 to 1.
> 
> Or if your bios doesn't clear it, 1 to 1.
> 
> > +
> > +Pending PCI slot injection notification (IO port 0xae00-0xae03, 4-byte access):
> >  ---------------------------------------------------------------
> > -Slot injection notification pending. One bit per slot.
> > +Slot injection notification is pending. One bit per slot.
> > +When written by Guest, this register implements write one to clear behaviour.
> 
> So we're writing the mask of the bits (slots) to be cleared?  Can
> multiple slots be cleared in one write?
> 
> >  Read by ACPI BIOS GPE.1 handler to notify OS of injection
> >  events.
> > +Written to by ACPI BIOS GPE.1 handler after reading
> > +and before notifying OS of injection events.
> > +
> > +Semantics: after a slot is populated by hotplug, it is immediately enabled
> > +and a bit in this register is set.
> > +Writes into this register are used to acknowledge the injection notification
> > +and do not not affect the slot status.
> > +Reset value: 0
> >  
> > -PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
> > +Pending PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
> >  -----------------------------------------------------
> > -Slot removal notification pending. One bit per slot.
> > +Slot removal notification is pending. One bit per slot.
> > +When written by Guest, this register implements write one to clear behaviour.
> 
> Same clarification as above.
>  
> >  Read by ACPI BIOS GPE.1 handler to notify OS of removal
> >  events.
> >  
> > +Written to by ACPI BIOS GPE.1 handler after reading
> > +and before notifying OS of removal events.
> > +
> > +Semantics: upon hotunplug request, a bit in this register is set
> > +and the OSPM is notified about hotunplug request, but a slot remains enabled
> > +until eject.
> > +Writes into this register are used to acknowledge the hotunplug request,
> > +and do not not affect the slot status.
> > +Reset value: 0
> > +
> >  PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
> >  ----------------------------------------
> >  
> >  Used by ACPI BIOS _EJ0 method to request device removal. One bit per slot.
> > -Reads return 0.
> > +Guests should not read this register. In practice reads return 0.
> 
> Isn't mentioning the read value creating a defacto standard?  If we're
> not defining we should say reads are undefined.
> 
> > +guests should write values with exactly one bit set, selecting
> > +the slot to eject.
> > +In practice all bits except the lowest bit that is set are discarded.  Also in
> 
> practice = standard, define it.
> 
> > +practice writing the value 0 into this register has no effect.
> 
> ditto.
> 
> > +Semantics: selected hotunpluggable slot is disabled and powered off.
> > +Has no effect on non-hotunpluggable slots.
> > +
> > +For compatibility with old BIOS, writes to this register
> > +clear bits in pending slot injection notification register.
> >  
> >  PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
> >  -----------------------------------------------
> >  
> > -Used by ACPI BIOS _RMV method to indicate removability status to OS. One
> > -bit per slot.
> > +Used by ACPI BIOS _RMV method to detect removability status
> > +and report to OS. One bit per slot.
> > +Guests should not write this register, in practice the value
> > +written is discarded.
> 
> We should define writes as not supported.  "Should not, but ok if you
> do" is not a specification.
> 
> > +Semantics: selected slots support hotplug. Contents of this register
> > +do not change while guest is running.
> > diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
> > index 797ed24..389db16 100644
> > --- a/hw/acpi_piix4.c
> > +++ b/hw/acpi_piix4.c
> > @@ -287,6 +287,28 @@ static void piix4_update_hotplug(PIIX4PMState *s)
> >      }
> >  }
> >  
> > +static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
> > +{
> > +    DeviceState *qdev, *next;
> > +    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
> > +    int slot = ffs(slots) - 1;
> > +
> > +    /*
> > +     * Clear DOWN register - this is good for a case where guest can't
> > +     * write to CLR_DOWN because of VM reset, and for compatibility with
> > +     * old guests which do not have CLR_DOWN.
> > +     */
> > +    s->pci0_status.down &= ~(1U << slot);
> 
> Should we clear "up" here too?

On eject clearing "up" is probably OK. However I'm not sure
what purpose would it serve, and it seems the less hacks we do the
better.

> > +
> > +    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > +        PCIDevice *dev = PCI_DEVICE(qdev);
> > +        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > +        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > +            qdev_free(qdev);
> > +        }
> > +    }
> > +}
> > +
> >  static void piix4_reset(void *opaque)
> >  {
> >      PIIX4PMState *s = opaque;
> > @@ -302,6 +324,19 @@ static void piix4_reset(void *opaque)
> >          pci_conf[0x5B] = 0x02;
> >      }
> >      piix4_update_hotplug(s);
> > +    /*
> > +     * Guest lost remove events if any.
> > +     * As it's being reset it's safe to remove the device now.
> > +     */
> > +    while (s->pci0_status.down) {
> > +        acpi_piix_eject_slot(s, s->pci0_status.down);
> > +    }
> > +    /*
> > +     * Guest lost add events if any.
> > +     * As it's being reset and will rescan the bus we can discard
> > +     * past events now.
> > +     */
> > +    s->pci0_status.up = 0;
> >  }
> >  
> >  static void piix4_powerdown(void *opaque, int irq, int power_failing)
> > @@ -472,10 +507,10 @@ static void pcihotplug_write(void *opaque, uint32_t addr, uint32_t val)
> >      struct pci_status *g = opaque;
> >      switch (addr) {
> >          case PCI_BASE:
> > -            g->up = val;
> > +            g->up &= ~val;
> >              break;
> >          case PCI_BASE + 4:
> > -            g->down = val;
> > +            g->down &= ~val;
> >              break;
> >     }
> 
> Ok, so it is a bit(slot) mask and can clear multiple bits.
> 
> > @@ -490,19 +525,12 @@ static uint32_t pciej_read(void *opaque, uint32_t addr)
> >  
> >  static void pciej_write(void *opaque, uint32_t addr, uint32_t val)
> >  {
> > -    BusState *bus = opaque;
> > -    DeviceState *qdev, *next;
> > -    int slot = ffs(val) - 1;
> > +    PIIX4PMState *s = opaque;
> >  
> > -    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
> > -        PCIDevice *dev = PCI_DEVICE(qdev);
> > -        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
> > -        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
> > -            qdev_free(qdev);
> > -        }
> > +    if (val) {
> > +        acpi_piix_eject_slot(s, val);
> >      }
> >  
> > -
> >      PIIX4_DPRINTF("pciej write %x <== %d\n", addr, val);
> >  }
> >  
> > @@ -532,8 +560,8 @@ static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
> >      register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write, pci0_status);
> >      register_ioport_read(PCI_BASE, 8, 4,  pcihotplug_read, pci0_status);
> >  
> > -    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, bus);
> > -    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, bus);
> > +    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, s);
> > +    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, s);
> >  
> >      register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s);
> >      register_ioport_read(PCI_RMV_BASE, 4, 4,  pcirmv_read, s);
> > @@ -567,8 +595,6 @@ static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
> >          return 0;
> >      }
> >  
> > -    s->pci0_status.up = 0;
> > -    s->pci0_status.down = 0;
> >      if (state == PCI_HOTPLUG_ENABLED) {
> >          enable_device(s, slot);
> >      } else {
> 
> So if we have an old bios and do an add, followed by a remove, guest
> ACPI finds both "up" and "down" set for the slot and we rely on the
> ordering of the AML checking up before down to keep the duct tape and
> bailing wire from exploding?

Sounds like an overstatement.  If bios did it the other way around, it
would lose the down event.  This is exactly what happened originally
anyway.

> Hmm, can't say I'm excited about this hack
> either.  Thanks,
> 
> Alex

No need to get excited, but I'll add this to the documentation.
Michael S. Tsirkin April 3, 2012, 1:51 p.m. UTC | #7
On Tue, Apr 03, 2012 at 03:56:18PM +0300, Michael S. Tsirkin wrote:
> > > +    /*
> > > +     * Clear DOWN register - this is good for a case where guest can't
> > > +     * write to CLR_DOWN because of VM reset, and for compatibility with
> > > +     * old guests which do not have CLR_DOWN.
> > > +     */
> > > +    s->pci0_status.down &= ~(1U << slot);
> > 
> > Should we clear "up" here too?
> 
> On eject clearing "up" is probably OK. However I'm not sure
> what purpose would it serve, and it seems the less hacks we do the
> better.

I added a comment to clarify this.
diff mbox

Patch

diff --git a/docs/specs/acpi_pci_hotplug.txt b/docs/specs/acpi_pci_hotplug.txt
index f0f74a7..e4a9556 100644
--- a/docs/specs/acpi_pci_hotplug.txt
+++ b/docs/specs/acpi_pci_hotplug.txt
@@ -7,31 +7,70 @@  describes the interface between QEMU and the ACPI BIOS.
 ACPI GPE block (IO ports 0xafe0-0xafe3, byte access):
 -----------------------------------------
 
-Generic ACPI GPE block. Bit 1 (GPE.1) used to notify PCI hotplug/eject
+Generic ACPI GPE block. Bit 1 (GPE.1) is used to notify PCI hotplug/hotunplug
 event to ACPI BIOS, via SCI interrupt.
 
-PCI slot injection notification pending (IO port 0xae00-0xae03, 4-byte access):
+Semantics: this event occurs each time a bit in either Pending PCI slot
+injection notification or Pending PCI slot removal notification registers
+changes status from 0 to 1.
+
+Pending PCI slot injection notification (IO port 0xae00-0xae03, 4-byte access):
 ---------------------------------------------------------------
-Slot injection notification pending. One bit per slot.
+Slot injection notification is pending. One bit per slot.
+When written by Guest, this register implements write one to clear behaviour.
 
 Read by ACPI BIOS GPE.1 handler to notify OS of injection
 events.
+Written to by ACPI BIOS GPE.1 handler after reading
+and before notifying OS of injection events.
+
+Semantics: after a slot is populated by hotplug, it is immediately enabled
+and a bit in this register is set.
+Writes into this register are used to acknowledge the injection notification
+and do not not affect the slot status.
+Reset value: 0
 
-PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
+Pending PCI slot removal notification (IO port 0xae04-0xae07, 4-byte access):
 -----------------------------------------------------
-Slot removal notification pending. One bit per slot.
+Slot removal notification is pending. One bit per slot.
+When written by Guest, this register implements write one to clear behaviour.
 
 Read by ACPI BIOS GPE.1 handler to notify OS of removal
 events.
 
+Written to by ACPI BIOS GPE.1 handler after reading
+and before notifying OS of removal events.
+
+Semantics: upon hotunplug request, a bit in this register is set
+and the OSPM is notified about hotunplug request, but a slot remains enabled
+until eject.
+Writes into this register are used to acknowledge the hotunplug request,
+and do not not affect the slot status.
+Reset value: 0
+
 PCI device eject (IO port 0xae08-0xae0b, 4-byte access):
 ----------------------------------------
 
 Used by ACPI BIOS _EJ0 method to request device removal. One bit per slot.
-Reads return 0.
+Guests should not read this register. In practice reads return 0.
+guests should write values with exactly one bit set, selecting
+the slot to eject.
+In practice all bits except the lowest bit that is set are discarded.  Also in
+practice writing the value 0 into this register has no effect.
+
+Semantics: selected hotunpluggable slot is disabled and powered off.
+Has no effect on non-hotunpluggable slots.
+
+For compatibility with old BIOS, writes to this register
+clear bits in pending slot injection notification register.
 
 PCI removability status (IO port 0xae0c-0xae0f, 4-byte access):
 -----------------------------------------------
 
-Used by ACPI BIOS _RMV method to indicate removability status to OS. One
-bit per slot.
+Used by ACPI BIOS _RMV method to detect removability status
+and report to OS. One bit per slot.
+Guests should not write this register, in practice the value
+written is discarded.
+
+Semantics: selected slots support hotplug. Contents of this register
+do not change while guest is running.
diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index 797ed24..389db16 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -287,6 +287,28 @@  static void piix4_update_hotplug(PIIX4PMState *s)
     }
 }
 
+static void acpi_piix_eject_slot(PIIX4PMState *s, unsigned slots)
+{
+    DeviceState *qdev, *next;
+    BusState *bus = qdev_get_parent_bus(&s->dev.qdev);
+    int slot = ffs(slots) - 1;
+
+    /*
+     * Clear DOWN register - this is good for a case where guest can't
+     * write to CLR_DOWN because of VM reset, and for compatibility with
+     * old guests which do not have CLR_DOWN.
+     */
+    s->pci0_status.down &= ~(1U << slot);
+
+    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
+        PCIDevice *dev = PCI_DEVICE(qdev);
+        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
+        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
+            qdev_free(qdev);
+        }
+    }
+}
+
 static void piix4_reset(void *opaque)
 {
     PIIX4PMState *s = opaque;
@@ -302,6 +324,19 @@  static void piix4_reset(void *opaque)
         pci_conf[0x5B] = 0x02;
     }
     piix4_update_hotplug(s);
+    /*
+     * Guest lost remove events if any.
+     * As it's being reset it's safe to remove the device now.
+     */
+    while (s->pci0_status.down) {
+        acpi_piix_eject_slot(s, s->pci0_status.down);
+    }
+    /*
+     * Guest lost add events if any.
+     * As it's being reset and will rescan the bus we can discard
+     * past events now.
+     */
+    s->pci0_status.up = 0;
 }
 
 static void piix4_powerdown(void *opaque, int irq, int power_failing)
@@ -472,10 +507,10 @@  static void pcihotplug_write(void *opaque, uint32_t addr, uint32_t val)
     struct pci_status *g = opaque;
     switch (addr) {
         case PCI_BASE:
-            g->up = val;
+            g->up &= ~val;
             break;
         case PCI_BASE + 4:
-            g->down = val;
+            g->down &= ~val;
             break;
    }
 
@@ -490,19 +525,12 @@  static uint32_t pciej_read(void *opaque, uint32_t addr)
 
 static void pciej_write(void *opaque, uint32_t addr, uint32_t val)
 {
-    BusState *bus = opaque;
-    DeviceState *qdev, *next;
-    int slot = ffs(val) - 1;
+    PIIX4PMState *s = opaque;
 
-    QTAILQ_FOREACH_SAFE(qdev, &bus->children, sibling, next) {
-        PCIDevice *dev = PCI_DEVICE(qdev);
-        PCIDeviceClass *pc = PCI_DEVICE_GET_CLASS(dev);
-        if (PCI_SLOT(dev->devfn) == slot && !pc->no_hotplug) {
-            qdev_free(qdev);
-        }
+    if (val) {
+        acpi_piix_eject_slot(s, val);
     }
 
-
     PIIX4_DPRINTF("pciej write %x <== %d\n", addr, val);
 }
 
@@ -532,8 +560,8 @@  static void piix4_acpi_system_hot_add_init(PCIBus *bus, PIIX4PMState *s)
     register_ioport_write(PCI_BASE, 8, 4, pcihotplug_write, pci0_status);
     register_ioport_read(PCI_BASE, 8, 4,  pcihotplug_read, pci0_status);
 
-    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, bus);
-    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, bus);
+    register_ioport_write(PCI_EJ_BASE, 4, 4, pciej_write, s);
+    register_ioport_read(PCI_EJ_BASE, 4, 4,  pciej_read, s);
 
     register_ioport_write(PCI_RMV_BASE, 4, 4, pcirmv_write, s);
     register_ioport_read(PCI_RMV_BASE, 4, 4,  pcirmv_read, s);
@@ -567,8 +595,6 @@  static int piix4_device_hotplug(DeviceState *qdev, PCIDevice *dev,
         return 0;
     }
 
-    s->pci0_status.up = 0;
-    s->pci0_status.down = 0;
     if (state == PCI_HOTPLUG_ENABLED) {
         enable_device(s, slot);
     } else {